0% found this document useful (0 votes)

23 views39 pages

GATKwr17-01-Intro To Variant Discovery

The document outlines best practices for variant discovery, focusing on genetic changes relative to a reference genome, including germline and somatic variants. It details the steps for data generation, including library preparation and sequencing, followed by analysis workflows for variant discovery and refinement. The document emphasizes the importance of data pre-processing and specific workflows tailored to different types of variants.

Uploaded by

danyalhamzah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views39 pages

GATKwr17-01-Intro To Variant Discovery

Uploaded by

danyalhamzah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

GATK

Best Prac)ces for Variant Discovery

Introduc)on to Variant Discovery

All you need to know to get started

http://software.broadinstitute.org/gatk/
Discover “variants” rela)ve to a reference genome

• Gene)c changes in individuals rela%ve to a reference genome

• Germline (inherited)
• Soma)c (cancer)

• Reference genome = a standardized genomic sequence

• Human genome reference sequence

• Current standard: hg19 / b37
• New standard (on the rise): hg38

• Other organisms
• Many have a fully assembled reference available
• Many s)ll do not -‐> SOL
Diﬀerent types of variants

Germline

Soma%c

SNP/SNV Indel CNV/CNA SV

Diﬀerent types of experimental design

Intergenic Exon I Intron I Exon II Intergenic

Whole
Genome
Variant site

Exome
(+ gene panels)
Part 1

BEST PRACTICES
Data genera)on 1: Library prep
Varies depending on experimental design
DNA
[…]
Fragments
RT-‐PCR

RNA

1) Extract nucleic acids from blood, )ssue, saliva

2) RNA: Make cDNA
3) Shear dsDNA into fragments
4) A`ach fragments to adapters (-‐> library)

Library prep
Data genera)on 2: Sequence the library

Library preps

Bitesizebio.com
Flowcell
Enormous pile of
short reads

Lanes

Illumina

HTS machine processes a ﬂowcell containing lanes;

each lane cons)tutes a read group (RG)
(unless mul)plexed)
Raw sequence: typically in FASTQ format

• Sequence Name (read name, group, etc.)

Accuracy
• Sequence
• + (op)onal: Sequence name again)
20 and higher is
• Associated quality score generally a good score

Example record

• @EAS54_6_R1_2_1_413_324
• CCCTTCTTGTCTTCAGCGTTTCTCC Error
• +
• ;;3;;;;;;;;;;;;7;;;;;;;88
Phred value = −10 * log10(ε)
-‐> ASCII code translates to Phred-‐scale Q scores

90% conﬁdence (10% error rate) = Q10

99% confidence (1% error rate) = Q20
99.9% confidence (.1% error rate) = Q30
Format specifica)on h`p://maq.sourceforge.net/fastq.shtml
Part 2

ANALYSIS WORKFLOWS
Data pre-‐processing

Data Variant Callset

Pre-‐Processing Discovery Reﬁnement

FASTQ -‐> BAM BAM -‐> VCF

Step 1: Map the reads produced by the sequencer to the reference

Mapping and
alignment
algorithms

• BWA for DNA

• STAR for RNAseq

Enormous
pile of short
reads from
HTS Reference genome
Reads
mapped to
reference
Output format: Sequence/Binary Alignment Map (SAM/BAM)

HEADER containing metadata (sequence dic)onary, read group deﬁni)ons etc)

RECORDS containing structured read informa)on (1 line per read record)

read name position CIGAR read sequence metadata

SLX1:1:127:63:4 99 1 10052169 60 23M6N10M = 14 10 GAAGATACTGGTT 768832'48:::: SM:Z:JPTGBMN01 …

flags MAPQ mate quality scores

information

• Added mapping info summarizes posi%on, quality, and structure for each read

h`p://samtools.github.io/hts-‐specs/SAMv1.pdf
CIGAR summarizes alignment structure

CIGAR = Concise Idiosyncra%c Gapped Alignment Report

read1 99 ref 2 30 3M1D2M1I1M = 14 20 CATCTAG *

At Broad: Unmapped BAM instead of FASTQ

Special workﬂow using Picard tools for improved data management

Reference genome

BWA MEM
Raw mapped BAM Mapped, cleaned,
sorted BAM

Unmapped BAM MergeBamAlignment

Step 2: Mark duplicates to mi)gate duplica)on ar)facts

Duplicates = non-‐independent measurements

of a sequence fragment

-‐> Must be removed to assess support for alleles correctly

Reference

Mapped
reads

Picard MarkDuplicates

= sequencing error propagated in duplicates

Step 3: Local realignment around indels corrects mapping errors

Several consecu%ve SNPs

BEFORE only found on reads ending
on the right of the
homopolymer

Several consecu%ve SNPs

only found on reads ending
on the le^ of the
homopolymer 7bp homopolymer run

AFTER

Adding a 1-‐bp inser%on

brings sanity to the
en%re alignment
Op%onal with assembly-‐
based variant callers
Step 4: Base Recalibra)on (BQSR) corrects for machine errors

• Sequencers make systema)c errors in base quality scores

• BQSR corrects the quality scores (not the bases)

Example of bias: quali)es reported depending on nucleo)de context

RMSE = 4.188 RMSE = 0.281

10
5

5
Empirical − Reported Quality

Empirical − Reported Quality

0
−5

−5
original recalibrated
−10

−10
AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG

Dinuc Dinuc
Special handling for RNAseq splice junctions
Variant Discovery

Data Variant Callset

Pre-‐Processing Discovery Reﬁnement

FASTQ -‐> BAM BAM -‐> VCF

Best Prac)ces workﬂows by variant type

Germline SNPs & Indels Soma%c SNVs & Indels Soma%c CNVs
Discovery of germline short variants is done on cohorts

• Single genome in isola)on: almost never useful

• Family or popula)on data
add valuable informa)on
– rarity of variants
– de novo muta)ons
– ethnic background
Visualiza)on of reads at a probable SNP site in IGV

Depth of coverage Probable C/T SNP

First and second read from

the same fragment

Non-‐reference bases are

colored; reference bases
are grey
Individual reads
aligned to the genome
Reference genome
Short variants are reported in VCF: Variant Call Format

##fileformat=VCFv4.1
##reference=1000GenomesPilot-NCBI36
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

Header
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS DP=14;AF=0.5 GT:GQ:DP 0/0:48:1 1/0:48:8 1/1:43:5

Records
20 1230237 . T . 47 PASS DP=13 GT:GQ:DP 0/0:54:7 0/0:48:4 0/0:61:2
20 1234567 . GT G 50 PASS DP=9 GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Format speciﬁca)on in

h`ps://samtools.github.io/hts-‐specs/VCFv4.2.pdf
Joint analysis from early stages empowers discovery

Individual callsets Joint callset

Underpowered analysis Empowered analysis

Discovery is empowered at diﬃcult sites

• Sample #1 or Sample #N alone:

• weak evidence for variant
• may miss calling the variant

• Both samples seen together:

• unlikely to be ar%fact
• call the variant more
conﬁdently

And we get full informa)on at all sites of interest

• Analyzed individually:
– No call for either sample
– Very diﬀerent reasons!

SAMPLE A
• In joint analysis with other
samples:
– Hom-‐ref call and no-‐call
genotypes emi`ed

SAMPLE B
Tradi)onal mul%-‐sample calling approach : very ineﬃcient

It gives us the right answers, but…

Compute requirements
scale very badly with
number of samples!!!
Want to add new samples?

Got to re-‐run pipeline from
scratch! The N+1 problem!
GVCF workﬂow solves both problems, yields same results

Compute requirements
scale linearly with
number of samples Want to add a new sample?

Just call it by itself then re-‐genotype the
cohort at will!
Best Prac)ces workﬂows by variant type

Germline SNPs & Indels Soma%c SNVs & Indels Soma%c CNVs
Key challenges : tumor heterogeneity and contamina)on

normal

dysplas%c *
cancer
**"
**

*

*

Tumor Sample
Tissue-‐adjacent Normal Blood Normal

Adapted from h`ps://science.educa)on.nih.gov/supplements/nih1/cancer/guide/understanding1.html

Amount of signal may be comparable to noise

Expecta%on for Expecta%on for

germline variants soma%c variants

signal

signal
heterogeneity

contamina%on

noise noise

+ AF expected to follow ploidy + no reliance on ploidy for AF
Best Prac)ces workﬂows by variant type

Germline SNPs & Indels Soma%c SNVs & Indels Soma%c CNVs
Copy number: it’s all about coverage

Collect propor%onal coverage Normalize to remove noise

Iden%fy segment boundaries

Callset Reﬁnement

Data Variant Callset

Pre-‐Processing Discovery Reﬁnement

FASTQ -‐> BAM BAM -‐> VCF

PROJECT
DEPENDENT
h`ps://sowware.broadins)tute.org/gatk/best-‐prac)ces/
Part 3

PIPELINES AT BROAD

Pipelines implemented on local systems at Broad

“The Picard Pipeline”

Reads

Implementa)on

BAMs

GATK Best Prac)ces

Pre-‐processing

Germline
SNPs & Indels
“FireHose”
Pipelines implemented on Google Cloud

“Genomes On The Cloud” “FireCloud / Workbench”

Reads

BAMs

WGS Germline
SNPs & Indels hkps://so^ware.broadins%tute.org/ﬁrecloud/
GOTC WDL script shared at
h`ps://github.com/broadins)tute/wdl/tree/develop/scripts/broad_pipelines
Write your own pipelines in WDL!

Tired of these op=ons for wri=ng pipelines? Meet the Workﬂow Descrip=on Language

Finally a workﬂow language meant to be read

(low-‐tech) and wriken by humans
Plain Old Scripts

WDL makes it easy to:

-‐ Describe analysis tasks
-‐ Chain tasks into workﬂows
-‐ Specify advanced behaviors like
paralleliza)on
(high-‐tech)
Domain-‐Speciﬁc Languages
hkps://so^ware.broadins%tute.org/wdl/

Whole Exome Seq Data Analysis 1742774815
No ratings yet
Whole Exome Seq Data Analysis 1742774815
58 pages
Intro To RNA-seq Concepts
No ratings yet
Intro To RNA-seq Concepts
85 pages
Data Analysis in Next Generation Sequencing
100% (1)
Data Analysis in Next Generation Sequencing
78 pages
VCF Files
No ratings yet
VCF Files
34 pages
4 - 7 Genome Assembly To Annotation - Final
No ratings yet
4 - 7 Genome Assembly To Annotation - Final
92 pages
Matteson Thesis
No ratings yet
Matteson Thesis
37 pages
VCFv4 2
No ratings yet
VCFv4 2
29 pages
Lecture Slides Human Variant Calling
No ratings yet
Lecture Slides Human Variant Calling
55 pages
NIHMS753481 Supplement Supplemental Data
No ratings yet
NIHMS753481 Supplement Supplemental Data
124 pages
Slides Woods
No ratings yet
Slides Woods
156 pages
Analysis of RNA-Seq Data
No ratings yet
Analysis of RNA-Seq Data
71 pages
WES Shivangi
No ratings yet
WES Shivangi
43 pages
Genetics Textbook
90% (21)
Genetics Textbook
2,553 pages
EBTY348L - Comp Genomics Lectures - Even Sem - 2024-25 - Set 2
No ratings yet
EBTY348L - Comp Genomics Lectures - Even Sem - 2024-25 - Set 2
29 pages
COMP90016 2023 06 Data Sources
No ratings yet
COMP90016 2023 06 Data Sources
64 pages
GATK TUTORIAL:: Variant Callset Evaluation & Filtering: Appendix
No ratings yet
GATK TUTORIAL:: Variant Callset Evaluation & Filtering: Appendix
16 pages
The Variant Call Format (VCF) Version 4.2 Specification
No ratings yet
The Variant Call Format (VCF) Version 4.2 Specification
29 pages
NGS - From Seq2var
No ratings yet
NGS - From Seq2var
60 pages
Gamgee - A C++14 Library For Genomics Data Processing and Analysis - Mauricio Carneiro - CppCon 2014
No ratings yet
Gamgee - A C++14 Library For Genomics Data Processing and Analysis - Mauricio Carneiro - CppCon 2014
50 pages
Chronic Illness Care Principles and Practice Direct Download
100% (9)
Chronic Illness Care Principles and Practice Direct Download
15 pages
Textbook of Personalized Medicine Instant Download
100% (13)
Textbook of Personalized Medicine Instant Download
17 pages
Bioinformatics Workshops
No ratings yet
Bioinformatics Workshops
49 pages
Lecture 8
No ratings yet
Lecture 8
30 pages
BioInformatics For Newbies Dantelan
No ratings yet
BioInformatics For Newbies Dantelan
46 pages
Genomics Lectures 9 To 14-2023 PDF
No ratings yet
Genomics Lectures 9 To 14-2023 PDF
65 pages
Seq Quantification App Note
No ratings yet
Seq Quantification App Note
13 pages
Gatk Pipeline Presentation: From Fastq Data To High Confident Variants
No ratings yet
Gatk Pipeline Presentation: From Fastq Data To High Confident Variants
8 pages
2023 Genetic Testing
No ratings yet
2023 Genetic Testing
36 pages
Biogenome Euformatics Webinar 2024-09-24
No ratings yet
Biogenome Euformatics Webinar 2024-09-24
25 pages
Genomic Analysis Package
No ratings yet
Genomic Analysis Package
8 pages
COMP90016 2023 08 Variant Calling II
No ratings yet
COMP90016 2023 08 Variant Calling II
41 pages
Documents - Pub Introduction To Next Generation Sequencing and Variant Calling Karin Kassahn
No ratings yet
Documents - Pub Introduction To Next Generation Sequencing and Variant Calling Karin Kassahn
74 pages
A Practical Guide To Filtering and Prioritizing Genetic Variants
No ratings yet
A Practical Guide To Filtering and Prioritizing Genetic Variants
12 pages
3 RNAseq-Mapping LO
No ratings yet
3 RNAseq-Mapping LO
98 pages
All Basic Principles and Concept of Databases
No ratings yet
All Basic Principles and Concept of Databases
83 pages
Journal - Pcbi.1009123 SPECTRUM VCF
No ratings yet
Journal - Pcbi.1009123 SPECTRUM VCF
14 pages
Summary of Sequencing Updated
No ratings yet
Summary of Sequencing Updated
11 pages
Supplementary For Validation
No ratings yet
Supplementary For Validation
28 pages
Genomic Analysis Humans
No ratings yet
Genomic Analysis Humans
15 pages
Plant Genotyping 1
No ratings yet
Plant Genotyping 1
317 pages
BGISEQ-500 WGS Demo Report en
No ratings yet
BGISEQ-500 WGS Demo Report en
17 pages
The Variant Call Format and VCFtools
No ratings yet
The Variant Call Format and VCFtools
3 pages
Forensic Dna Phenotyping (FDP)
100% (1)
Forensic Dna Phenotyping (FDP)
12 pages
Entity Relationship Diagram and Basic Database Modeling
No ratings yet
Entity Relationship Diagram and Basic Database Modeling
294 pages
Lab03 - Lab Manual
No ratings yet
Lab03 - Lab Manual
16 pages
Dna PDF
No ratings yet
Dna PDF
4 pages
BioInformatics Quiz1 Week6
No ratings yet
BioInformatics Quiz1 Week6
6 pages
Cloning and Characterization of A Critical Regulator For Preharvest Sprouting in Wheat
No ratings yet
Cloning and Characterization of A Critical Regulator For Preharvest Sprouting in Wheat
26 pages
Structural Variation in The Human Genome: Michael Snyder March 2, 2010
No ratings yet
Structural Variation in The Human Genome: Michael Snyder March 2, 2010
80 pages
2015 PAG Variant PDF
No ratings yet
2015 PAG Variant PDF
65 pages
FreeBayes Variant Calling Workflow For DNA-Seq - Bioinformatics Workbook
No ratings yet
FreeBayes Variant Calling Workflow For DNA-Seq - Bioinformatics Workbook
9 pages
VariantMaster User Manual v1.01
No ratings yet
VariantMaster User Manual v1.01
12 pages
Mutation
No ratings yet
Mutation
12 pages
Mito NGS
No ratings yet
Mito NGS
49 pages
Bioinformatics Analysis of Whole Exome Sequencing Data: Peter J. Ulintz, Weisheng Wu, and Chris M. Gates
No ratings yet
Bioinformatics Analysis of Whole Exome Sequencing Data: Peter J. Ulintz, Weisheng Wu, and Chris M. Gates
42 pages
1000 Genomes Reference
No ratings yet
1000 Genomes Reference
54 pages
GATKwr17-09-Somatic SNVs and Indels
No ratings yet
GATKwr17-09-Somatic SNVs and Indels
23 pages
Genome
No ratings yet
Genome
26 pages
HowTo Finding SNP by BLAST
No ratings yet
HowTo Finding SNP by BLAST
4 pages
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
2.exome Variant Interpretation SNVs CNVs
No ratings yet
2.exome Variant Interpretation SNVs CNVs
9 pages
Module Exercise C
No ratings yet
Module Exercise C
6 pages
Genomic Insights Into The Origin of Farming in Ancient Near East
No ratings yet
Genomic Insights Into The Origin of Farming in Ancient Near East
33 pages
MAQ - Heng Li
No ratings yet
MAQ - Heng Li
9 pages
Chapter E7 - Clinical Pharmacogenomics
No ratings yet
Chapter E7 - Clinical Pharmacogenomics
35 pages
Blank en Berg Pittsburgh 2011 Ngs
No ratings yet
Blank en Berg Pittsburgh 2011 Ngs
59 pages
RNA-Seq Module 1
No ratings yet
RNA-Seq Module 1
54 pages
Notes Applications of Molecular Techniques (Supplementation)
No ratings yet
Notes Applications of Molecular Techniques (Supplementation)
5 pages
Assignment I
No ratings yet
Assignment I
4 pages
RNA-Seq Analysis Course
No ratings yet
RNA-Seq Analysis Course
40 pages
Chapter - Biology Xii Pyq
No ratings yet
Chapter - Biology Xii Pyq
16 pages
Al Subkis Jam Al Jawami A Conceptual Cri
No ratings yet
Al Subkis Jam Al Jawami A Conceptual Cri
32 pages
Basic SQL Slides
No ratings yet
Basic SQL Slides
63 pages
NGS ToolsFormats r1 BDG
No ratings yet
NGS ToolsFormats r1 BDG
32 pages
The Variant Call Format and Vcftools: Example
No ratings yet
The Variant Call Format and Vcftools: Example
1 page
s13073 017 0425 1
No ratings yet
s13073 017 0425 1
11 pages
Understanding Adhd Printable
No ratings yet
Understanding Adhd Printable
52 pages
MTHFR Report A4
No ratings yet
MTHFR Report A4
12 pages
Danyal Hamza
No ratings yet
Danyal Hamza
19 pages
Practice Prelim 3 Answer Key
No ratings yet
Practice Prelim 3 Answer Key
10 pages
2013 Both HSV 1 & 2
No ratings yet
2013 Both HSV 1 & 2
7 pages
Ion S5 S5XL Brochure
No ratings yet
Ion S5 S5XL Brochure
12 pages
P5 PPARg
No ratings yet
P5 PPARg
12 pages
PHYS401 Academic Proposal Oral Presentation List 2025
No ratings yet
PHYS401 Academic Proposal Oral Presentation List 2025
2 pages
BBSP Course - Week 1
No ratings yet
BBSP Course - Week 1
26 pages
PHYS 347 Lecture Notes On Digital Electronics 2 2023 - 2024
No ratings yet
PHYS 347 Lecture Notes On Digital Electronics 2 2023 - 2024
30 pages
Terminologies of Classical Uulitexts A
No ratings yet
Terminologies of Classical Uulitexts A
28 pages
Solved SQL Querries For Revision
No ratings yet
Solved SQL Querries For Revision
7 pages
PHYSICS 343 Tutorial Questions
No ratings yet
PHYSICS 343 Tutorial Questions
5 pages
University of Ghana
No ratings yet
University of Ghana
4 pages
Methods Ecol Evol - 2023 - Bishop - Generating Continuous Maps of Genetic Diversity Using Moving Windows
No ratings yet
Methods Ecol Evol - 2023 - Bishop - Generating Continuous Maps of Genetic Diversity Using Moving Windows
7 pages
Al Subkis Jam Al Jawami A Critical App
No ratings yet
Al Subkis Jam Al Jawami A Critical App
16 pages
2016 Genes and Athletic Performance An Update
No ratings yet
2016 Genes and Athletic Performance An Update
15 pages
Marzillier 09052014
No ratings yet
Marzillier 09052014
54 pages
Genetics of Fe, Zn, Β-carotene, GPC and Yield Traits in Bread Wheat (Triticum Aestivum L.) Using Multi-locus and Multi-traits GWAS
No ratings yet
Genetics of Fe, Zn, Β-carotene, GPC and Yield Traits in Bread Wheat (Triticum Aestivum L.) Using Multi-locus and Multi-traits GWAS
17 pages
PHYS 351 Tutorial Set 4 DIFFRACTION
No ratings yet
PHYS 351 Tutorial Set 4 DIFFRACTION
2 pages
SDFGDFG
No ratings yet
SDFGDFG
5 pages
1 s2.0 S2352513421001162 Main
No ratings yet
1 s2.0 S2352513421001162 Main
14 pages
Genomic DNA Libraries, Construction and Applications
No ratings yet
Genomic DNA Libraries, Construction and Applications
28 pages
Module 6 Cytogenetics
No ratings yet
Module 6 Cytogenetics
17 pages
201 Assignment 0
No ratings yet
201 Assignment 0
7 pages
Cancer Epidemiol Biomarkers Prev-2009-Mavaddat-Common Genetic Variation in Candidate Genes and Susceptibility To Subtypes of Breast Cancer
No ratings yet
Cancer Epidemiol Biomarkers Prev-2009-Mavaddat-Common Genetic Variation in Candidate Genes and Susceptibility To Subtypes of Breast Cancer
6 pages
Nutrigenomics: From Molecular Nutrition To Prevention of Disease
No ratings yet
Nutrigenomics: From Molecular Nutrition To Prevention of Disease
8 pages
Gizmo Cladograms - Name Date Student Exploration Cladograms Vocabulary Adaptation, Amino Acid, - Studocu
No ratings yet
Gizmo Cladograms - Name Date Student Exploration Cladograms Vocabulary Adaptation, Amino Acid, - Studocu
1 page
Early T-Cell Precursor Leukaemia: A Subtype of Very High-Risk Acute Lymphoblastic Leukaemia
No ratings yet
Early T-Cell Precursor Leukaemia: A Subtype of Very High-Risk Acute Lymphoblastic Leukaemia
10 pages

GATKwr17-01-Intro To Variant Discovery

Uploaded by

GATKwr17-01-Intro To Variant Discovery

Uploaded by

GATK

Best Prac)ces for Variant Discovery

Introduc)on to Variant Discovery

• Gene)c changes in individuals rela%ve to a reference genome

• Reference genome = a standardized genomic sequence

• Human genome reference sequence

SNP/SNV Indel CNV/CNA SV

Intergenic Exon I Intron I Exon II Intergenic

1) Extract nucleic acids from blood, )ssue, saliva

HTS machine processes a ﬂowcell containing lanes;

• Sequence Name (read name, group, etc.)

90% conﬁdence (10% error rate) = Q10

Data Variant Callset

FASTQ -­‐> BAM BAM -­‐> VCF

• BWA for DNA

HEADER containing metadata (sequence dic)onary, read group deﬁni)ons etc)

read name position CIGAR read sequence metadata

SLX1:1:127:63:4 99 1 10052169 60 23M6N10M = 14 10 GAAGATACTGGTT 768832'48:::: SM:Z:JPTGBMN01 …

flags MAPQ mate quality scores

CIGAR = Concise Idiosyncra%c Gapped Alignment Report

read1 99 ref 2 30 3M1D2M1I1M = 14 20 CATCTAG *

Special workﬂow using Picard tools for improved data management

Unmapped BAM MergeBamAlignment

Duplicates = non-­‐independent measurements

= sequencing error propagated in duplicates

Several consecu%ve SNPs

Several consecu%ve SNPs

Adding a 1-­‐bp inser%on

• Sequencers make systema)c errors in base quality scores

Example of bias: quali)es reported depending on nucleo)de context

Empirical − Reported Quality

Data Variant Callset

FASTQ -­‐> BAM BAM -­‐> VCF

• Single genome in isola)on: almost never useful

Depth of coverage Probable C/T SNP

First and second read from

Non-­‐reference bases are

Format speciﬁca)on in

Individual callsets Joint callset

Underpowered analysis Empowered analysis

• Sample #1 or Sample #N alone:

• Both samples seen together:

It gives us the right answers, but…

Adapted from h`ps://science.educa)on.nih.gov/supplements/nih1/cancer/guide/understanding1.html

Expecta%on for Expecta%on for

Collect propor%onal coverage Normalize to remove noise

Iden%fy segment boundaries

Data Variant Callset

FASTQ -­‐> BAM BAM -­‐> VCF

PIPELINES AT BROAD

“The Picard Pipeline”

GATK Best Prac)ces

“Genomes On The Cloud” “FireCloud / Workbench”

Finally a workﬂow language meant to be read

WDL makes it easy to:

You might also like

FASTQ -‐> BAM BAM -‐> VCF

Duplicates = non-‐independent measurements

Adding a 1-‐bp inser%on

FASTQ -‐> BAM BAM -‐> VCF

Non-‐reference bases are

FASTQ -‐> BAM BAM -‐> VCF