0% found this document useful (0 votes)

149 views41 pages

COMP90016 2023 08 Variant Calling II

Uploaded by

Lynn CHEN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

149 views41 pages

COMP90016 2023 08 Variant Calling II

Uploaded by

Lynn CHEN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Computational Genomics

Lecture 8
Variant Calling II
Dr Khalid Mahmood

Before watching this lecture, make sure you are familiar with… Today

1 Intro & 2Genomics II 3Sequencing 5 Sequence 7 Variant 8 Variant

Genomics I technologies alignment calling I calling II
Overview

● Improving variant calls

● Inheritance, haplotypes, phasing
● Structural variants
● (Maybe) variant calling in cancer
Improving high-throughput variants
In real-world variant calling from NGS, we usually take extra steps
to decrease false positives.

● Before alignment, trim/filter poor quality reads

● Before variant calling, improve the alignment

● After variant calling, filter variants

GATK best practices protocol

Germline short variant discovery workflow

Broad Institute
Improving alignment
● A sequencing experiment results in a large
volume of sequencing reads
● Reads are not mapped to a reference
● Reads can contain errors and technical
artifacts
● e.g. a molecule sequenced multiple times
will result in duplicate
● We need to pre-process the aligned reads
ready for variant calling
Improving alignment
Before variant calling, improve the alignment:

● Duplicate read removal

○ remove experimental artefacts

● Local realignment
○ fix reads misaligned around indels

● Base quality score recalibration

○ correct FASTQ base qualities based on
alignment
Duplicate removal
Sequencing experiments can lead to one read being measured
many times. Typical sources of duplicate reads include:

● Optical duplicates: the camera/scanner/software on the

sequencing instrument can read one sequence cluster as
several.

● PCR duplicates / amplification bias: in replicating

(‘amplifying’) our DNA sample to get enough DNA to image,
we amplify one sequence much more than the rest. Short
fragments of DNA are particularly favored.
Duplicate removal
Duplicated errors can look like variants:
Duplicate removal
Usual method is to remove reads that map to
the same location. This can accidently remove
non-artifactual reads.

● less of a problem with paired-end reads

○ as both reads must match

● problem for targeted sequencing, high

depth
○ e.g. high-depth exome, or smaller
capture regions
Local realignment
Common problem: read aligned as a set of SNVs instead of an
indel.
● mapping may not have been optimal
● reads are aligned one at a time for
efficiency
● indels near the end of reads get ignored
to improve alignment score

This is a very important issue in calling both

SNVs and indels.
We can address with local realignment, or
alternatively, local assembly.
Local realignment
Once reads are aligned, we can look locally at sets of reads in
the same region, and realign.
Check regions around known indels (e.g. dbSNP) and called
indels.
Choose the consensus variant(s) that minimise the total cost of
the read alignments.
Filtering variants
In a perfect world our excellent experiments and realistic variant-
calling algorithm would give us the best possible prioritisation of
variant calls. No filtering necessary!

In reality we usually filter our variant call list after variant calling.
Filtering - improving specificity
In many research projects the motivation is to narrow the list of
calls to the most likely candidates for further investigation.

● Hard filtering
○ Filter on quality metrics, strand biases (overall reassess
supporting evidence)

● Confirm call with other algorithms

○ i.e. take the intersection of multiple variant callers
○ Usually validates well using Sanger sequencing

Generally more specificity => less sensitivity

i.e. with filtering, we will lose at least some real variants
Variant calling in practice
Summary:
● Sequencing is not perfect
○ errors, artifacts, incorrect qualities
● High-throughput alignment is not perfect
○ misalignment, particularly around indels
● Need further steps to reduce false positives
● Improve alignment before calling variants
○ remove duplicates, local realignment, BQSR
● Filter called variants to reduce false positives

● Other approaches include looking at biological

consequences of variants
● Population frequency of variants etc.
Ploidy, haplotypes, phasing
Ploidy
Humans are diploid: two
copies of each chromosome.

This isn't true of all creatures;

bacteria are haploid (just one
copy) and plants can be
polyploid (lots of copies).

So humans have two copies of

each gene. These two copies
will be slightly different.

Image from Wikimedia Commons

Inheritance and ploidy
Humans inherit one copy of each chromosome from each
parent.
We can do this because we are
diploid.

You will have a maternal copy and a

paternal copy of each chromosome.

Sex chromosomes (autosomes) like

X and Y are exceptions,
e.g. men have one X and one Y.

Images adapted from Stanford at The Tech,

http://genetics.thetech.org/
Variants and ploidy
Since we have two copies of each gene, we could have zero,
one or two copies of a variant in that gene.

Having both copies is a homozygous variant.

Having just one is a heterozygous variant.
Having no copies, i.e. being the same as the reference genome,
is sometimes called "wild-type".
Haplotypes and phasing
AA AB BB
We've already discussed genotypes:

Genotype of an (diploid) organism:

the set of homozygous and heterozygous variants.

However this doesn't give us all the information.

Phasing
Example: say I have two heterozygous SNVs:
(in my diploid genome)

chr1 102 A G
chr1 104 A T

The heterozygous genotypes mean each variant

only appears on one of my two copies of
chromosome 1.
Phasing
So there are two possible situations.
Say the reference genome is: CCACAATTA
And SNVs are: G T

An individuals two chromosomes could look like

CCGCAATTA Different copy

CCACTATTA
or
CCGCTATTA Same copy
CCACAATTA
Phasing
If a variant is phased it means we know which copy of the
chromosome it appears on (relative to other variants).
This can have significant consequences.

e.g. indels (altered codon boundaries effecting reading frame):

CCACAATT -> CCA.CAA.TTA… Reference

CCACAATT -> CCA.CAA.TTA…

Phased to same
CCGACAGGATT -> CCG.ACA.GGA.TTA… chromosome

vs.
CCGACAATT -> CCG.ACA.ATT.A…
Phased on different
CCACAGGATT -> CCA.CAG.GAT.TA… chromosome
Phasing
If variants are close enough together we can use overlapping
reads or read pairs for phasing.
Haplotypes
A haplotype (short for haploid genotype) is a set of
genotypes that are phased so that we know which variants
are on the same DNA molecules.

Some SNPs tend to be inherited together, so are

statistically associated with one another.

That is, we have some knowledge from the population

about how these SNPs are likely to be phased.
Haplotypes
You may also see this called linkage disequilibrium, which
means that the linkage between the SNPs deviates from random
association.

We can think of the sets of statistically linked genotypes as the

set of known human haplotypes.
Haplotype reference data for humans is provided by the
International HapMap Project.
One use of known haplotypes is genotype imputation:
● measure the genotypes of some known SNPs
● use human haplotype information to deduce likely genotypes
of other, unmeasured SNPs in the individual
Structural Variants
Detecting structural variants
Structural variants
● duplications
● insertions
● deletions
● translocations
● inversions
● complex events

Deletions and duplications give rise to CNV.

SVs generally mean large changes (not indels). They can be
larger than our read length.
Detecting structural variants
In NGS we have two main types of evidence:
● For CNV:
○ read depth (or SNP array intensity)
○ B-allele frequencies
● For all SVs:
○ breakpoints

Some methods are based on one or the other.

Software in this field are less mature than for SNV/indel

detection.
Detecting SVs: CNV
Can use read depth to look for breakpoints and/or to
characterise SV type.
Only useful for deletions/duplications

% reads for alternate allele

Deletion Duplication

Alkan et al., Genome structural

variation and genotyping, Nature Yau et al. Genome Biology B-allele frequency
Reviews 12 363-376 (2011) 2010, 11:R92
Detecting SVs: breakpoints
A breakpoint is a discontinuity in the genome: point at which,
relative to the reference genome, the DNA has been broken and
reassembled.

Two parts of the genome that are not adjacent in the reference
may be adjacent at the breakpoint.
Detecting SVs: breakpoints
Using anomalous alignments:
● paired-end mapping: read pairs do not map at expected
distance or orientation
○ can span larger distances using short reads
○ gives only approximate breakpoint location
○ dependent on uncertainty in insert size
● split-read mapping: read maps over breakpoint, so ends of
read align to different parts of reference genome
○ need long enough reads (>200bp)
○ gives single-base resolution of breakpoint
Detecting SVs: breakpoints
If there is a variant, we expect reads be split and/or paired reads
to map discordantly:

Split read mapping

Paired end mapping

No variant Deletion

A. Quinlan and I. Hall, Trends in Genetics 28, 43-53 (2012)

Detecting SVs: breakpoints
Remember that in the data, we typically only get to see the
alignment of the reads to the reference.

Evidence looks like this

No variant Deletion

A. Quinlan and I. Hall, Trends in Genetics 28, 43-53 (2012)

Detecting SVs: breakpoints
Other SV signatures:

Alt

Ref
Detecting SVs: breakpoints
Other SV signatures:

Alt

Ref

Duplication
(tandem)
Detecting SVs: breakpoints
Other SV signatures:

Alt

Ref

Evidence: 2 read pairs – both have wider insert size and both reads are facing the same way
Detecting SVs: breakpoints
Other SV signatures:

Alt

Ref

Inversion
Evidence: 2 read pairs – both have wider insert size and both reads are facing the same way
Detecting SVs
Basic approach:
● Look for breakpoints and/or CNV
○ breakpoints: paired-end reads / split read mapping
○ CNV: read depth / B-allele frequencies
○ Tools include: GRIDSS, Manta, DELLY etc. use both
paired-end and split-reads evidence
● Deduce SV events
○ which breakpoints are linked?
○ what are the likely SV types based on distances and
orientations?
● Possibly, try de novo assembly using reads from the relevant
regions
○ can be effective
○ repeat regions cause problems
Detecting SVs: summary

Brief Funct Genomics, Volume 14, Issue 5, September 2015, Pages 305–314, https://doi.org/10.1093/bfgp/elv014
The content of this slide may be subject to copyright: please see the slide notes for details.
Summary overview of SV types and their mechanisms https://doi.org/10.1038/nrg3373

Genome Assembly in The Telomere-To-Telomere Era
No ratings yet
Genome Assembly in The Telomere-To-Telomere Era
13 pages
Bioinformatics Alignment
No ratings yet
Bioinformatics Alignment
128 pages
MCQs
No ratings yet
MCQs
8 pages
VCF Files
No ratings yet
VCF Files
34 pages
4 - 7 Genome Assembly To Annotation - Final
No ratings yet
4 - 7 Genome Assembly To Annotation - Final
92 pages
Nutrigenomics - Topic 6 - Genetic Variations-Single Nucleotide Polymorphisms (SNPS) - 20241006
No ratings yet
Nutrigenomics - Topic 6 - Genetic Variations-Single Nucleotide Polymorphisms (SNPS) - 20241006
55 pages
Matteson Thesis
No ratings yet
Matteson Thesis
37 pages
Whole Exome Seq Data Analysis 1742774815
No ratings yet
Whole Exome Seq Data Analysis 1742774815
58 pages
Ferretti 2013
No ratings yet
Ferretti 2013
42 pages
MBG2004 Variant Detection and Methods (SV and CNV) Week VI
No ratings yet
MBG2004 Variant Detection and Methods (SV and CNV) Week VI
73 pages
MBG2004 Human Genomics Variations Week - III
No ratings yet
MBG2004 Human Genomics Variations Week - III
47 pages
Carrier Screening Ebook 2 Flyer
No ratings yet
Carrier Screening Ebook 2 Flyer
24 pages
Lecture Slides SV Calling
No ratings yet
Lecture Slides SV Calling
56 pages
9.3.24 - Dr. Boukas - Identification of de Novo Variants From Duos Via Long-Read Sequencing
No ratings yet
9.3.24 - Dr. Boukas - Identification of de Novo Variants From Duos Via Long-Read Sequencing
26 pages
Lecture Slides Human Variant Calling
No ratings yet
Lecture Slides Human Variant Calling
55 pages
GATKwr17-01-Intro To Variant Discovery
No ratings yet
GATKwr17-01-Intro To Variant Discovery
39 pages
BioInformatics For Newbies Dantelan
No ratings yet
BioInformatics For Newbies Dantelan
46 pages
NIHMS753481 Supplement Supplemental Data
No ratings yet
NIHMS753481 Supplement Supplemental Data
124 pages
图形泛基因组
No ratings yet
图形泛基因组
24 pages
Sequencing Depth and Coverage Key
No ratings yet
Sequencing Depth and Coverage Key
12 pages
GATK TUTORIAL:: Variant Callset Evaluation & Filtering: Appendix
No ratings yet
GATK TUTORIAL:: Variant Callset Evaluation & Filtering: Appendix
16 pages
Mohammed Nabi Khaland Dr. Nzan A.amin
No ratings yet
Mohammed Nabi Khaland Dr. Nzan A.amin
12 pages
Bioinformatics Workshops
No ratings yet
Bioinformatics Workshops
49 pages
Slides Woods
No ratings yet
Slides Woods
156 pages
The Variant Call Format and VCFtools
No ratings yet
The Variant Call Format and VCFtools
3 pages
Alkan Et Al., (2011) Genome Structural Variation Discovery and Genotyping
No ratings yet
Alkan Et Al., (2011) Genome Structural Variation Discovery and Genotyping
14 pages
Gamgee - A C++14 Library For Genomics Data Processing and Analysis - Mauricio Carneiro - CppCon 2014
No ratings yet
Gamgee - A C++14 Library For Genomics Data Processing and Analysis - Mauricio Carneiro - CppCon 2014
50 pages
Lecture 8
No ratings yet
Lecture 8
30 pages
NGS - From Seq2var
No ratings yet
NGS - From Seq2var
60 pages
Sequencing Analysis Tools2
No ratings yet
Sequencing Analysis Tools2
6 pages
2023 Genetic Testing
No ratings yet
2023 Genetic Testing
36 pages
Biostatistical Methods in Genetics and Genetic
No ratings yet
Biostatistical Methods in Genetics and Genetic
366 pages
HL7 Version 2.5.1 Implementation Guide Lab Results Interface (LRI)
No ratings yet
HL7 Version 2.5.1 Implementation Guide Lab Results Interface (LRI)
82 pages
Finals Molbio
No ratings yet
Finals Molbio
20 pages
Genomic Analysis Humans
No ratings yet
Genomic Analysis Humans
15 pages
Anderman Et Al-2018 SystBiol
No ratings yet
Anderman Et Al-2018 SystBiol
15 pages
2019 Evomics Reference Free
No ratings yet
2019 Evomics Reference Free
118 pages
MBG2004 Genome-Transcriptome Assembly, Annotation and Comparison Week IX
No ratings yet
MBG2004 Genome-Transcriptome Assembly, Annotation and Comparison Week IX
52 pages
COMP90016 2023 07 Variant Calling I
No ratings yet
COMP90016 2023 07 Variant Calling I
62 pages
Notes On Mutect2: Broad Institute, 415 Main Street, Cambridge, MA 02142
No ratings yet
Notes On Mutect2: Broad Institute, 415 Main Street, Cambridge, MA 02142
14 pages
Lecture 11 (Genetic Variation I) Watch Read Learn
No ratings yet
Lecture 11 (Genetic Variation I) Watch Read Learn
2 pages
CORVELVA MRC 5 Contained in Priorix Tetra Complete Genome Sequencing PDF
No ratings yet
CORVELVA MRC 5 Contained in Priorix Tetra Complete Genome Sequencing PDF
10 pages
Reviews: Structural Variation in The Sequencing Era
No ratings yet
Reviews: Structural Variation in The Sequencing Era
19 pages
3 RNAseq-Mapping LO
No ratings yet
3 RNAseq-Mapping LO
98 pages
Genetics and Genomics Chapter 4 Questions & Answers Multiple Choice Questions
No ratings yet
Genetics and Genomics Chapter 4 Questions & Answers Multiple Choice Questions
23 pages
Lecture 28 Unit6 1
No ratings yet
Lecture 28 Unit6 1
16 pages
Lab03 - Lab Manual
No ratings yet
Lab03 - Lab Manual
16 pages
Genotype Calling and Haplotyping in Parent-Offspring Trios: Method
No ratings yet
Genotype Calling and Haplotyping in Parent-Offspring Trios: Method
10 pages
Genomic Disorders The Genomic Basis of Disease 1st Edition Full Ebook Access
100% (8)
Genomic Disorders The Genomic Basis of Disease 1st Edition Full Ebook Access
17 pages
(Li, W., Freundenberg, J.) Mappability and Read Length
No ratings yet
(Li, W., Freundenberg, J.) Mappability and Read Length
7 pages
SV: Accurate Structural Variation Genotyping and de Novo Mutation Detection From Whole Genomes
No ratings yet
SV: Accurate Structural Variation Genotyping and de Novo Mutation Detection From Whole Genomes
4 pages
Soon Et Al 2013 High Throughput Sequencing For Biology and Medicine
No ratings yet
Soon Et Al 2013 High Throughput Sequencing For Biology and Medicine
14 pages
Finding The Genomic Basis of Local Adaptation: Pitfalls, Practical Solutions, and Future Directions
No ratings yet
Finding The Genomic Basis of Local Adaptation: Pitfalls, Practical Solutions, and Future Directions
19 pages
Thesis Final
No ratings yet
Thesis Final
98 pages
Structural Variation in The Human Genome: Michael Snyder March 2, 2010
No ratings yet
Structural Variation in The Human Genome: Michael Snyder March 2, 2010
80 pages
GATKwr12 3 IndelRealignment PDF
No ratings yet
GATKwr12 3 IndelRealignment PDF
15 pages
COMP90016 2023 08 Variant Calling II
No ratings yet
COMP90016 2023 08 Variant Calling II
41 pages
COMP90016 2023 06 Data Sources
No ratings yet
COMP90016 2023 06 Data Sources
64 pages
Assembly 2 BME130 Lec5 v1
No ratings yet
Assembly 2 BME130 Lec5 v1
28 pages
Human Evolutionary Genetics - 2nd Edition Dropbox Download
100% (13)
Human Evolutionary Genetics - 2nd Edition Dropbox Download
17 pages
MAQ - Heng Li
No ratings yet
MAQ - Heng Li
9 pages
1000 Genomes Reference
No ratings yet
1000 Genomes Reference
54 pages
Titus Brown - How To Interpret Your Own Genome Using (Mostly) Python
No ratings yet
Titus Brown - How To Interpret Your Own Genome Using (Mostly) Python
42 pages
Ans 10
No ratings yet
Ans 10
2 pages
Rubenstein Et Al. TREE 2019
No ratings yet
Rubenstein Et Al. TREE 2019
12 pages
Erik Garrison - Iowa Talk 2
No ratings yet
Erik Garrison - Iowa Talk 2
32 pages
Information Sciences: Doina Bucur
No ratings yet
Information Sciences: Doina Bucur
16 pages
2022 Association of Professors of Human and Medical Genetics A 2022 Genetic
No ratings yet
2022 Association of Professors of Human and Medical Genetics A 2022 Genetic
13 pages
Agronomy 11 01356
No ratings yet
Agronomy 11 01356
17 pages
COMP90016 2023 09 Variant Consequences
No ratings yet
COMP90016 2023 09 Variant Consequences
40 pages
Notes Applications of Molecular Techniques (Supplementation)
No ratings yet
Notes Applications of Molecular Techniques (Supplementation)
5 pages
Full Cytogenomics 1st Edition Thomas Liehr Editor PDF All Chapters
100% (1)
Full Cytogenomics 1st Edition Thomas Liehr Editor PDF All Chapters
65 pages
Geneticvariation Published
No ratings yet
Geneticvariation Published
7 pages
Genome Assembly of Three Amazonian Morpho Butterfly Species Reveals Z-Chromosome Rearrangements Between Closely-Related Species Living in Sympatry
No ratings yet
Genome Assembly of Three Amazonian Morpho Butterfly Species Reveals Z-Chromosome Rearrangements Between Closely-Related Species Living in Sympatry
18 pages
RNA-Seq Analysis Course
No ratings yet
RNA-Seq Analysis Course
40 pages
The Variant Call Format and Vcftools: Example
No ratings yet
The Variant Call Format and Vcftools: Example
1 page
NGS ToolsFormats r1 BDG
No ratings yet
NGS ToolsFormats r1 BDG
32 pages
Material Apoyo 1 DX Molecular
No ratings yet
Material Apoyo 1 DX Molecular
15 pages
Sequencing 101 Ebook 03082023
No ratings yet
Sequencing 101 Ebook 03082023
22 pages
PTE Self Study - RL - 150
No ratings yet
PTE Self Study - RL - 150
32 pages
(FREE PDF Sample) Advances in Genetics Vol 62 1st Edition Jeffrey C. Hall (Eds.) Ebooks
100% (1)
(FREE PDF Sample) Advances in Genetics Vol 62 1st Edition Jeffrey C. Hall (Eds.) Ebooks
41 pages
Phylogenetic Tree
No ratings yet
Phylogenetic Tree
3 pages
Hu Et Al. - 2024 - Plant Pangenomics, Current Practice and Future Direction
No ratings yet
Hu Et Al. - 2024 - Plant Pangenomics, Current Practice and Future Direction
12 pages
GENES Sahajpal Et Al OGM in HRD Testing 2023
No ratings yet
GENES Sahajpal Et Al OGM in HRD Testing 2023
14 pages
Human Evolutionary Genetics 2nd Edition Mark Jobling PDF Download
No ratings yet
Human Evolutionary Genetics 2nd Edition Mark Jobling PDF Download
56 pages
2021mol Bio Higher Rates of Processed Pseudogene Acquisition in Humans and Three Great Apes Revealed by Long-Read Assemblies
No ratings yet
2021mol Bio Higher Rates of Processed Pseudogene Acquisition in Humans and Three Great Apes Revealed by Long-Read Assemblies
9 pages
Assign2 PDF
No ratings yet
Assign2 PDF
2 pages
NC Panpop Merge SV 刘建全
No ratings yet
NC Panpop Merge SV 刘建全
9 pages
Sequencing Depth and Coverage: Key Considerations in Genomic Analyses
No ratings yet
Sequencing Depth and Coverage: Key Considerations in Genomic Analyses
12 pages
Gene Expression Programming: Fundamentals and Applications
From Everand
Gene Expression Programming: Fundamentals and Applications
Fouad Sabry
No ratings yet

COMP90016 2023 08 Variant Calling II

Uploaded by

COMP90016 2023 08 Variant Calling II

Uploaded by

Computational Genomics

1 Intro & 2Genomics II 3Sequencing 5 Sequence 7 Variant 8 Variant

● Improving variant calls

● Before alignment, trim/filter poor quality reads

● Before variant calling, improve the alignment

● After variant calling, filter variants

Germline short variant discovery workflow

● Duplicate read removal

● Base quality score recalibration

● Optical duplicates: the camera/scanner/software on the

● PCR duplicates / amplification bias: in replicating

● less of a problem with paired-end reads

● problem for targeted sequencing, high

This is a very important issue in calling both

● Confirm call with other algorithms

Generally more specificity => less sensitivity

● Other approaches include looking at biological

This isn't true of all creatures;

So humans have two copies of

Image from Wikimedia Commons

You will have a maternal copy and a

Sex chromosomes (autosomes) like

Images adapted from Stanford at The Tech,

Having both copies is a homozygous variant.

Genotype of an (diploid) organism:

However this doesn't give us all the information.

The heterozygous genotypes mean each variant

An individuals two chromosomes could look like

CCGCAATTA Different copy

e.g. indels (altered codon boundaries effecting reading frame):

CCACAATT -> CCA.CAA.TTA…

Some SNPs tend to be inherited together, so are

That is, we have some knowledge from the population

We can think of the sets of statistically linked genotypes as the

Deletions and duplications give rise to CNV.

Some methods are based on one or the other.

Software in this field are less mature than for SNV/indel

% reads for alternate allele

Alkan et al., Genome structural

Split read mapping

Paired end mapping

A. Quinlan and I. Hall, Trends in Genetics 28, 43-53 (2012)

Evidence looks like this

A. Quinlan and I. Hall, Trends in Genetics 28, 43-53 (2012)

You might also like