COMP90016 2023 07 Variant Calling I
COMP90016 2023 07 Variant Calling I
COMP90016 2023 07 Variant Calling I
Lecture 7
Variant Calling I
Dr Khalid Mahmood
Before watching this lecture, make sure you are familiar with… Today
● Genomic resources
Somatic variants
● Variants you acquire during your life
● Can differ between cells
ACGA → ATGA
SNPs are differences from the reference genome but are likely to
be more common in a given population.
(A) Biological effects of a SNV
Transcript
non-coding coding
VEP, ensembl.org
Biological effects of a SNV
Missense mutation: a non-synonymous variant changes an
amino acid to another amino acid. This can have an impact on
the protein function depending on the nature of amino acid
change, its location and how conserved the mutation region is
etc.
AGG -> AAG changes Arginine (R) to Lysine (K)
ACTATGAGATTACA Reference
AC---GAGATTACA 3bp deletion
ACTATGAGATTATGTGCA 4bp insertion
An indel with length divisible by 3bp (6bp, 9bp…) can affect one
or two codons
AGG.GAG.CCT.GCA.GCC.CTT.GGC.C
AGG.GCT.GCA.GCC.CTT.GGC.C
Biological effect of an indel
A 1bp (or 2bp, etc) indel called a frameshift, can disrupt a large
stretch of the protein-coding sequence, changing amino acids:
ATG.GAG.CCT.GCA.GCC.CTT.GAC - M E P A A L D
ATG.GCC.TGC.AGC.CCT.TGA - M A C S P Stop
Just like with SNVs, indels in non-coding regions can also have
biological effects:
● Indels in UTRs may affect gene regulation
● Indels in splice sites can affect splicing
Example variants in humans
● 1 bp deletion (CYP3A5*7): sodium transport gene
○ linked to pre-disposition to hypertension
Image: http://en.wikipedia.org/wiki/Copy-number_variation
Types of variants: CNV (continued)
When CNV is present, the amount
of DNA from the duplicated/deleted
regions changes, so our algorithms
can look for this.
Gene
Annotation tracks
UCSC genome browser
Methods and resources
Gene
datacarpentry.org
Variant detection from NGS data
Input:
● a reference genome
● set of reads aligned to reference genome
Output:
● a set of variant calls
● measure of quality of the variant call (as confidence
information for each variant)
Variant detection from NGS data
A variant is defined by:
Likely variant
(SNV)
Reference
Insertion
Variant
score
Variant detection from NGS data
Variant detection from NGS data
Variant detection is non-trivial because of errors in the data
● read errors (experimental errors)
● alignment errors
Specificity
How many false variants do we correctly reject?
○ High specificity = less false positives
Better
algorithm/
data
VCF is:
● human-readable
○ text-based, tab-separated
● machine-readable
● self-describing
○ header lines describe fields
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
AA AB BB
BB is a homozygous variant
AB is a heterozygous variant
AA is homozygous reference (ie no variant)
Genotyping
Recall that in sequencing:
● we make lots of copies of the DNA
● don't manage to read all of them (sample randomly)
The random processes introduce noise into read depth.
Allele frequencies
We expect to see read counts matching the possible genotypes -
approximately!
BB BBB
Percent alternate alleles
ABB
AB
AAB
AA AAA
Genotyping
Given all this, the task of calling variant and genotyping required
a more sophisticated approach.
P(Genotype | Data)
P(D|G) P(G)
P(G|D) =
P(D)
Bayes theorem
Bayes theorem considers prior probabilities.
e.g. if I see a crop circle in my field, what is the chance I have
been visited by aliens?
Lets say aliens always leave crop circles (100%) – P(crop circle|aliens)
Aliens are unlikely – P(aliens)
But crop circles happen occasionally – P(crop circle)
For a given data set D (reads), this is just a constant - the same
for AA, AB and BB.
So instead of calculating P(D) we could use this as a
normalisation factor or:
P(AA|D) + P(AB|D) + P(BB|D) = 1
Genotyping: P(D|G)
P(G|D) = P(D|G) P(G)
P(D)
What are
P(Data | CC)
P(Data | CT)
P(Data | TT) ?
Genotyping: P(D|G)
For heterozygous genotype AB, we are choosing each read
randomly from A and B. This gives a binomial distribution with
p=0.5 (like coin-tossing).
Probability of observing that data
Different possible values of the data in term s of the allele fractions actually observed?
Observed allele fraction
Genotyping: P(D|G)
For homozygous genotypes (AA or BB) we expect all data to be
A (for AA) or B (for BB) except where there is a sequencing
error.
Genotyping: P(D|G)
P(D|G) will give a more definite prediction of which allele fraction
to expect if we have
● lower sequencing error
● higher read depth
P(AA) = 0.25
P(AB) = 0.5
P(BB) = 0.25
Binomial distribution +
error model for
sequencing technology
Prior probability of each
genotype based on
knowledge of sample’s
P(G|D) = P(D|G) P(G) biology
P(D)