0% found this document useful (0 votes)
149 views41 pages

COMP90016 2023 08 Variant Calling II

Uploaded by

Lynn CHEN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views41 pages

COMP90016 2023 08 Variant Calling II

Uploaded by

Lynn CHEN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Computational Genomics

Lecture 8
Variant Calling II
Dr Khalid Mahmood

Before watching this lecture, make sure you are familiar with… Today

1 Intro & 2Genomics II 3Sequencing 5 Sequence 7 Variant 8 Variant


Genomics I technologies alignment calling I calling II
Overview

● Improving variant calls


● Inheritance, haplotypes, phasing
● Structural variants
● (Maybe) variant calling in cancer
Improving high-throughput variants
In real-world variant calling from NGS, we usually take extra steps
to decrease false positives.

● Before alignment, trim/filter poor quality reads

● Before variant calling, improve the alignment

● After variant calling, filter variants


GATK best practices protocol

Germline short variant discovery workflow

Broad Institute
Improving alignment
● A sequencing experiment results in a large
volume of sequencing reads
● Reads are not mapped to a reference
● Reads can contain errors and technical
artifacts
● e.g. a molecule sequenced multiple times
will result in duplicate
● We need to pre-process the aligned reads
ready for variant calling
Improving alignment
Before variant calling, improve the alignment:

● Duplicate read removal


○ remove experimental artefacts

● Local realignment
○ fix reads misaligned around indels

● Base quality score recalibration


○ correct FASTQ base qualities based on
alignment
Duplicate removal
Sequencing experiments can lead to one read being measured
many times. Typical sources of duplicate reads include:

● Optical duplicates: the camera/scanner/software on the


sequencing instrument can read one sequence cluster as
several.

● PCR duplicates / amplification bias: in replicating


(‘amplifying’) our DNA sample to get enough DNA to image,
we amplify one sequence much more than the rest. Short
fragments of DNA are particularly favored.
Duplicate removal
Duplicated errors can look like variants:
Duplicate removal
Usual method is to remove reads that map to
the same location. This can accidently remove
non-artifactual reads.

● less of a problem with paired-end reads


○ as both reads must match

● problem for targeted sequencing, high


depth
○ e.g. high-depth exome, or smaller
capture regions
Local realignment
Common problem: read aligned as a set of SNVs instead of an
indel.
● mapping may not have been optimal
● reads are aligned one at a time for
efficiency
● indels near the end of reads get ignored
to improve alignment score

This is a very important issue in calling both


SNVs and indels.
We can address with local realignment, or
alternatively, local assembly.
Local realignment
Once reads are aligned, we can look locally at sets of reads in
the same region, and realign.
Check regions around known indels (e.g. dbSNP) and called
indels.
Choose the consensus variant(s) that minimise the total cost of
the read alignments.
Filtering variants
In a perfect world our excellent experiments and realistic variant-
calling algorithm would give us the best possible prioritisation of
variant calls. No filtering necessary!

In reality we usually filter our variant call list after variant calling.
Filtering - improving specificity
In many research projects the motivation is to narrow the list of
calls to the most likely candidates for further investigation.

● Hard filtering
○ Filter on quality metrics, strand biases (overall reassess
supporting evidence)

● Confirm call with other algorithms


○ i.e. take the intersection of multiple variant callers
○ Usually validates well using Sanger sequencing

Generally more specificity => less sensitivity


i.e. with filtering, we will lose at least some real variants
Variant calling in practice
Summary:
● Sequencing is not perfect
○ errors, artifacts, incorrect qualities
● High-throughput alignment is not perfect
○ misalignment, particularly around indels
● Need further steps to reduce false positives
● Improve alignment before calling variants
○ remove duplicates, local realignment, BQSR
● Filter called variants to reduce false positives

● Other approaches include looking at biological


consequences of variants
● Population frequency of variants etc.
Ploidy, haplotypes, phasing
Ploidy
Humans are diploid: two
copies of each chromosome.

This isn't true of all creatures;


bacteria are haploid (just one
copy) and plants can be
polyploid (lots of copies).

So humans have two copies of


each gene. These two copies
will be slightly different.

Image from Wikimedia Commons


Inheritance and ploidy
Humans inherit one copy of each chromosome from each
parent.
We can do this because we are
diploid.

You will have a maternal copy and a


paternal copy of each chromosome.

Sex chromosomes (autosomes) like


X and Y are exceptions,
e.g. men have one X and one Y.

Images adapted from Stanford at The Tech,


http://genetics.thetech.org/
Variants and ploidy
Since we have two copies of each gene, we could have zero,
one or two copies of a variant in that gene.

Having both copies is a homozygous variant.


Having just one is a heterozygous variant.
Having no copies, i.e. being the same as the reference genome,
is sometimes called "wild-type".
Haplotypes and phasing
AA AB BB
We've already discussed genotypes:

Genotype of an (diploid) organism:


the set of homozygous and heterozygous variants.

However this doesn't give us all the information.


Phasing
Example: say I have two heterozygous SNVs:
(in my diploid genome)

chr1 102 A G
chr1 104 A T

The heterozygous genotypes mean each variant


only appears on one of my two copies of
chromosome 1.
Phasing
So there are two possible situations.
Say the reference genome is: CCACAATTA
And SNVs are: G T

An individuals two chromosomes could look like

CCGCAATTA Different copy


CCACTATTA
or
CCGCTATTA Same copy
CCACAATTA
Phasing
If a variant is phased it means we know which copy of the
chromosome it appears on (relative to other variants).
This can have significant consequences.

e.g. indels (altered codon boundaries effecting reading frame):


CCACAATT -> CCA.CAA.TTA… Reference

CCACAATT -> CCA.CAA.TTA…


Phased to same
CCGACAGGATT -> CCG.ACA.GGA.TTA… chromosome

vs.
CCGACAATT -> CCG.ACA.ATT.A…
Phased on different
CCACAGGATT -> CCA.CAG.GAT.TA… chromosome
Phasing
If variants are close enough together we can use overlapping
reads or read pairs for phasing.
Haplotypes
A haplotype (short for haploid genotype) is a set of
genotypes that are phased so that we know which variants
are on the same DNA molecules.

Some SNPs tend to be inherited together, so are


statistically associated with one another.

That is, we have some knowledge from the population


about how these SNPs are likely to be phased.
Haplotypes
You may also see this called linkage disequilibrium, which
means that the linkage between the SNPs deviates from random
association.

We can think of the sets of statistically linked genotypes as the


set of known human haplotypes.
Haplotype reference data for humans is provided by the
International HapMap Project.
One use of known haplotypes is genotype imputation:
● measure the genotypes of some known SNPs
● use human haplotype information to deduce likely genotypes
of other, unmeasured SNPs in the individual
Structural Variants
Detecting structural variants
Structural variants
● duplications
● insertions
● deletions
● translocations
● inversions
● complex events

Deletions and duplications give rise to CNV.


SVs generally mean large changes (not indels). They can be
larger than our read length.
Detecting structural variants
In NGS we have two main types of evidence:
● For CNV:
○ read depth (or SNP array intensity)
○ B-allele frequencies
● For all SVs:
○ breakpoints

Some methods are based on one or the other.

Software in this field are less mature than for SNV/indel


detection.
Detecting SVs: CNV
Can use read depth to look for breakpoints and/or to
characterise SV type.
Only useful for deletions/duplications

% reads for alternate allele


Deletion Duplication

Alkan et al., Genome structural


variation and genotyping, Nature Yau et al. Genome Biology B-allele frequency
Reviews 12 363-376 (2011) 2010, 11:R92
Detecting SVs: breakpoints
A breakpoint is a discontinuity in the genome: point at which,
relative to the reference genome, the DNA has been broken and
reassembled.

Two parts of the genome that are not adjacent in the reference
may be adjacent at the breakpoint.
Detecting SVs: breakpoints
Using anomalous alignments:
● paired-end mapping: read pairs do not map at expected
distance or orientation
○ can span larger distances using short reads
○ gives only approximate breakpoint location
○ dependent on uncertainty in insert size
● split-read mapping: read maps over breakpoint, so ends of
read align to different parts of reference genome
○ need long enough reads (>200bp)
○ gives single-base resolution of breakpoint
Detecting SVs: breakpoints
If there is a variant, we expect reads be split and/or paired reads
to map discordantly:

Split read mapping

Paired end mapping

No variant Deletion

A. Quinlan and I. Hall, Trends in Genetics 28, 43-53 (2012)


Detecting SVs: breakpoints
Remember that in the data, we typically only get to see the
alignment of the reads to the reference.

Evidence looks like this

No variant Deletion

A. Quinlan and I. Hall, Trends in Genetics 28, 43-53 (2012)


Detecting SVs: breakpoints
Other SV signatures:

Alt

Ref
Detecting SVs: breakpoints
Other SV signatures:

Alt

Ref

Duplication
(tandem)
Detecting SVs: breakpoints
Other SV signatures:

Alt

Ref

Evidence: 2 read pairs – both have wider insert size and both reads are facing the same way
Detecting SVs: breakpoints
Other SV signatures:

Alt

Ref

Inversion
Evidence: 2 read pairs – both have wider insert size and both reads are facing the same way
Detecting SVs
Basic approach:
● Look for breakpoints and/or CNV
○ breakpoints: paired-end reads / split read mapping
○ CNV: read depth / B-allele frequencies
○ Tools include: GRIDSS, Manta, DELLY etc. use both
paired-end and split-reads evidence
● Deduce SV events
○ which breakpoints are linked?
○ what are the likely SV types based on distances and
orientations?
● Possibly, try de novo assembly using reads from the relevant
regions
○ can be effective
○ repeat regions cause problems
Detecting SVs: summary

Brief Funct Genomics, Volume 14, Issue 5, September 2015, Pages 305–314, https://doi.org/10.1093/bfgp/elv014
The content of this slide may be subject to copyright: please see the slide notes for details.
Summary overview of SV types and their mechanisms https://doi.org/10.1038/nrg3373

You might also like