COMP90016 2023 07 Variant Calling I

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

Computational Genomics

Lecture 7
Variant Calling I
Dr Khalid Mahmood

Before watching this lecture, make sure you are familiar with… Today

1 Intro & 2Genomics II 3Sequencing 5 Sequence 7 Variant


Genomics I technologies alignment calling I
Overview
● What is a genetic variant?

● Types of genetic variants


○ SNVs, indels, structural variants

● Genomic sequencing data

● Genomic resources

● Sensitivity and specificity

● Variant calling and genotyping


Variants
Genetic variant: difference between genome sequences.

In most cases – this means the difference in the DNA sequence


or structure - compared to a reference genome.

Humans share ~99.8% identical DNA.

Primates e.g. Human and Chimp genomes are 99% identical.


Variants
There is high degree of similarity but the human genome is large
- 3 billion nucleotides.

This results in approximately 4-5 million variants between any


individual and the reference genome.

These, seemingly small number of variations likely explains a


significant proportion of phenotypic diversity among humans.
Variants
Given the size of the difference in the genomes and the possible
combinations of alleles the total number of variations in humans
is growing as more sequencing data becomes available.

According to dbSNP, there are ~700 million distinct known


genetic variants in the human genome.

A significant proportion of the variants are common i.e. seen is


more 1% of the population.

Sherry,S.T., Ward,M. and Sirotkin,K. (1999) dbSNP—Database for Single


Nucleotide Polymorphisms and Other Classes of Minor Genetic
Variation. Genome Res., 9, 677–679.
Variants
Germline variants
● Variants you are born with; part of your genetic code
● Should be in the DNA of all cells of your body
● Can be very common (e.g. half the population)

Somatic variants
● Variants you acquire during your life
● Can differ between cells

Each human has ~3M germline variants when compared to “the”


human reference genome.
What is variant calling?
CGTGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATA
GGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTAAGTAGTGCTTGTGCTCATC
TCCTTGGCTGTGATACGTGGCCGGCCCTCGCTCCAGCAGCTGGACCCCTACCTGCCGTCTGCTGCCA/TCGGAGCCCAAA
GCCGGGCTGTGACTGCTCAGACCAGCCGGCTGGAGGGAGGGCC/GCTCAGCAGGTCTGGCTTTGGCCCTGGGAGAGCAGG
TGGAAGATCAGGCAGGCCATCGCTGCCACAGAACCCAGTGGATTGGCCTAGGTGGGATCTCTGAGCTCAACAAGCCCTCT
CTGGGTGGTAGGTGCAGAGACGGGAGGGGCAGAGCCGCAGGCACAGCCAAGAGGGCTGAAGAAATGGTAGAACGGAGCAG
CTGGTGATGTGTGGGCCCACCGGCCCCAGGCTCCTGTCTCCCCCCAGGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAG
CATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTG/AAGTGTCCCCAGTGTTGCAGAGGT
GAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGCGTCTCCTGTCTCCTGGAGAGG
CTTCGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCATCTGGAGCCCTGCTGCTTGCGGTGGCCTATAAAGCC
TCCTAGTCTGGCTCCAAGGCCTGGCAGAGTCTTTCCCAGGGAAAGCTACA/TAGCAGCAAACAGTCTGCATGGGTCATCC
CCTTCACTCCCAGCTCAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTCTGGTGGAGAACCTGTGCATGAAGGCTGTCAA
CCAGTCCATAGGCAAGCCTGGCTGCCTCCAGCTGGGTCGACAGACAGGGGCTGGAGAAGGGGAGAAGAGGAAAGTGAGGT
TGCCTGCCCTGTCTCCTACCTGAGGCTGAGGAAGGAGAAGGGGATGCACTGTTGGGGAGGCAGCTGTAACTCAAAGCCTT
AGCCTCTGTTCCCACGAAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACAC
CCGGCACCCTGTCCTGGACACGCTGTTGGCCTGGATCTGAGCCCTGGTGGAGGTCAAAGCCACCTTTGGTTCTGCCATTG
CTGCTGTGTGGAAGTTCACTCCTGCCTTTTCCTTTCCCTAGAGCCTCCACCACCCCGAGATCACATTTCTCACTGCCTTT
TGTCTGCCCAGTTTCACCAGAAGTAGGCCTCTTCCTGACAGGC/TAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTT
GCTCTGCCCGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCAGGAGAGTGT
GGAGTCCAGAGTGTTGCCAGGACCCAGGCACAGGCATTAGTGCCCGTTGGAGAAAACAGGGGAATCCCGAAGAAATGGTG
GGTCCTGGCCATCCGTGAGATCTTCCCAGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGG
TGGAGCCGTCCCCCCATGGAGCACAGGCA/GGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTC
CTTGAAGCTGGTCTCCACACAGTGCTGGTTCCGTCACCCCCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTG
TCCATGTCAGAGCAACGGCCCAAGTCTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCAGTCGTCCTCG
TCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAAGGCTCCTCCGGGCCC
CTCACCAGCCCCAGGTCCTTTCCCAGAGATGCCTGGAGGGAAAAGGCTGAGTGAGGGTGGTTGGTGGGAAACCCTGGTTC
CCCCAGCCCCCGGA/CGACTTAAATACAGGAAGAAAAAGGCAGGACAGAATTACAAGGTGCT
Common types of variants
ctccgag
● SNVs: single-nucleotide variants. ctctgag
● Indels: (small) insertions or deletions. Single-nucleotide
variants
(SNVs)
● SNV and indels lengths range from 1bp-1kb.
ctc--ag
ctctgag
● Other types of variants include: Insertion deletions
(Indels)
● CNV: copy-number variation.
● SV: structural variation. ctccgag
● Impacts larger regions of DNA:
ctctaggtaaagag
● extra/missing DNA
Structural Variants
● Rearranged regions etc. (SVs)
Why study human genetic variants?

DNA controls all cells in our body.

Genetic variants are responsible for genetic diversity


● i.e. biological differences, or phenotype

Some variants are associated with disease


● germline variants may be associated with disease or
increased risk
● somatic variants can explain tumour aetiology in cancer
Types of variants: SNVs
SNV: single nucleotide variant

This is a one-letter substitution:

ACGA → ATGA

If the variant is found in a population, it is called a SNP: single


nucleotide polymorphism.

SNPs are differences from the reference genome but are likely to
be more common in a given population.
(A) Biological effects of a SNV

Transcript

non-coding coding

SNV overlapping position

VEP, ensembl.org
Biological effects of a SNV
Missense mutation: a non-synonymous variant changes an
amino acid to another amino acid. This can have an impact on
the protein function depending on the nature of amino acid
change, its location and how conserved the mutation region is
etc.
AGG -> AAG changes Arginine (R) to Lysine (K)

Silent mutation: a synonymous mutation that does not alter the


amino acid, but in some cases can still have a phenotypic effect,
e.g., by disruption transcription, or splicing.

GCG -> GCA both code for amino acid Alanine(A)


Biological effects of a SNV
Codon table
Biological effects of a SNV
Nonsense mutation: a non-synonymous variant that changes an
amino acid codon to the STOP codon resulting in premature
termination of translation.
TCA -> TAA changes Serine (S) to stop_gained
Truncates protein prematurely; likely to affect protein function
can also prevent translation via nonsense-mediated decay

Non-coding variants: depending on the overlap of the variant with


respect to the transcript, non-coding variants can have biological
impact, e.g.
● SNVs in UTRs may affect gene regulation
● SNVs in splice sites (edge of introns) can affect splicing
● SNVs in upstream of gene can affect promoter binding
(B) Types of variants: Indels
Indel: Insertions or deletion:

ACTATGAGATTACA Reference
AC---GAGATTACA 3bp deletion
ACTATGAGATTATGTGCA 4bp insertion

Indels usually refer to short modifications.

Large insertions/deletions are treated separately.


Biological effect of an indel
Indels in coding regions can have multiple consequences
Recall that there are three reading frames in RNA:
AGG.CAG.CCT.GCA.GCC.CTT.GGC.C
A.GGC.AGC.CTG.CAG.CCC.TTG.GCC
AG.GCA.GCC.TGC.AGC.CCT.TGG.CC

An indel with length divisible by 3bp (6bp, 9bp…) can affect one
or two codons
AGG.GAG.CCT.GCA.GCC.CTT.GGC.C
AGG.GCT.GCA.GCC.CTT.GGC.C
Biological effect of an indel
A 1bp (or 2bp, etc) indel called a frameshift, can disrupt a large
stretch of the protein-coding sequence, changing amino acids:

ATG.GAG.CCT.GCA.GCC.CTT.GAC - M E P A A L D
ATG.GCC.TGC.AGC.CCT.TGA - M A C S P Stop

This often also leads to a premature STOP codon.

Just like with SNVs, indels in non-coding regions can also have
biological effects:
● Indels in UTRs may affect gene regulation
● Indels in splice sites can affect splicing
Example variants in humans
● 1 bp deletion (CYP3A5*7): sodium transport gene
○ linked to pre-disposition to hypertension

● 1 bp deletion (HERC2): E3 ubiquitin-protein ligase


○ affects expression of neighboring OCA2 gene
○ linked with eye colour (blue)

● Missense variants (Gly551Asp) CFTR:


○ Associated with CFTR linked disorders such as cystic
fibrosis

● BRCA1/BRCA2 genes: help repair damaged DNA


○ ~50% of women who inherit a damaging variant in these
genes will develop cancer by age 80
(C) Types of variants: structural
variation
SV: structural variation

A broad category, including


● CNVs - large insertions and deletions
● translocations
● inversions
Image: http://compbio.cs.brown.edu/projects/structvar/
Types of variants: CNV
CNV: copy number variation

Large-scale duplications or deletions


of portions of the genome.

Somatic CNV may be called CNA:


copy number alteration.

Image: http://en.wikipedia.org/wiki/Copy-number_variation
Types of variants: CNV (continued)
When CNV is present, the amount
of DNA from the duplicated/deleted
regions changes, so our algorithms
can look for this.

These methods won’t work for


translocations or inversions. For
these structural rearrangements
we usually look for the breakpoints.

Aligned sequence reads


Biological effect of CNV
Gene duplications or deletions
● impact on expression of particular RNA and proteins which
○ can affect biological function of proteins
○ can affect regulation or interaction with other genes
● deletion of regions containing genes
○ often just one copy of a gene is lost
○ if the two copies had different variants (heterozygous),
this is loss of heterozygosity (LOH)

CNV is implicated in some diseases, e.g. autism, schizophrenia,


bipolar disorder

Chiang et.al. (2017) Nat Gen.


Methods and resources
Databases and resources:

○ dbSNP – a public archive of human SNVs and indels

○ gnomAD – an aggregation of >150,000 exomes/genomes


to estimate population level frequency of genetic data
including genetic variants
Methods and resources
Reference

Gene

Annotation tracks
UCSC genome browser
Methods and resources

Gene

UCSC genome browser


Methods and resources
Variant detection using NGS

datacarpentry.org
Variant detection from NGS data
Input:
● a reference genome
● set of reads aligned to reference genome

Output:
● a set of variant calls
● measure of quality of the variant call (as confidence
information for each variant)
Variant detection from NGS data
A variant is defined by:

(1) CHROM: chromosome

(2) POS: base position of the reference sequence

(3) REF: reference allele

(4) ALT: variant allele


Variant detection from NGS data

Possible sequencing error

Likely variant
(SNV)

Reference

Aligned sequence reads


Variant detection from NGS data
Deletion
SNV

Insertion

Variant
score
Variant detection from NGS data
Variant detection from NGS data
Variant detection is non-trivial because of errors in the data
● read errors (experimental errors)
● alignment errors

Errors are rare, but so are variants


e.g. humans have about 1 germline variant per 1000bp

Somatic variants (important in cancer research) are usually even


rarer: 1 somatic variant per 1,000,000bp

An algorithm must trade sensitivity and specificity.


Sensitivity and specificity
Sensitivity
How many real variants do we succeed in finding?
○ High sensitivity = less false negatives

Specificity
How many false variants do we correctly reject?
○ High specificity = less false positives

False discovery rate (FDR)


How many of the variants we find are real?
Sensitivity and specificity
Variant No variant
Called TP FP Ncalled
Not called FN TN Nnot called
Nvariant Nno variant Ntotal sites
FP = false positives TP = true positives
TN = true negatives FN = false negatives
Sensitivity = TP / (TP+FN) Specificity = TN / (TN+FP)
True positive rate (TPR) = TP / (TP+FN) = sensitivity
False positive rate (FPR) = FP / (FP+TN) = 1 - specificity
False discovery rate (FDR) = FP / (FP+TP)
Sensitivity vs specificity

Better
algorithm/
data

SNVMix: predicting single


nucleotide variants from next-
generation sequencing of
tumors. Bioinformatics. 2010
26(6): 730–736.
VCF - Variant Call Format
Standard file format for variants.
Developed for 1000 genomes project.

VCF is:
● human-readable
○ text-based, tab-separated
● machine-readable
● self-describing
○ header lines describe fields

Binary equivalent is BCF.


VCF files
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 14370 . G A 29 PASS NS=3;DP=14;DB GT:GQ:DP 0|0:48:1
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP 0|0:49:3
20 1110696 . A G,T 67 PASS NS=2;DP=10;AA=T GT:GQ:DP 1|2:21:6
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP 0|0:54:7
Variant quality scores: Phred-like
Quality Chance it's wrong Accuracy

10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%

A quality score QUAL is a phred-scaled (integer


representation) of the assertion made in the ALT.
Or
Q = -10 log10 P (call in ALT is wrong)

Q = Phred quality score P = probability of call being incorrect


Variant calling
We recall here the series of evidence/data required to identify
variant sites from sequencing data:

● Depth of coverage – number of reads at position


● Bases at the region with mismatches
● Sequencing quality of the bases
● Mapping quality of the reads

But in many cases we use a probabilistic variant calling


approach known as genotyping. This means we try to calculate
the most likely genotype at a given site.
Genotyping
Reminder: humans are diploid
(two copies of each chromosome)
So you could have 0, 1 or 2 copies of a variant.
These are the three possible genotypes:

AA AB BB

Haploid species: genotypes are A or B (wild-type or variant)


Triploid: possible genotypes are AAA, AAB, ABB, BBB.
And so on.
Genotyping
Diploid genotypes AA, AB, or BB

A stands for the reference allele (C,G,A,T)


B stands for the alternate allele

BB is a homozygous variant
AB is a heterozygous variant
AA is homozygous reference (ie no variant)
Genotyping
Recall that in sequencing:
● we make lots of copies of the DNA
● don't manage to read all of them (sample randomly)
The random processes introduce noise into read depth.
Allele frequencies
We expect to see read counts matching the possible genotypes -
approximately!

BB BBB
Percent alternate alleles

ABB
AB

AAB

AA AAA
Genotyping
Given all this, the task of calling variant and genotyping required
a more sophisticated approach.

The aim is to improve accuracy and report most likely variant


genotype combination.

One of the initial approaches towards this used Bayes theorem.

A general approach to single-nucleotide polymorphism discovery. Nat Genet (1999)


Genotyping
Genotyping or genotype calling = choosing a genotype given
the data

Usual strategy: find probability of each genotype given the


observed data:

P(Genotype | Data)

Then take the most likely genotype.


Genotyping – Data (D)
Example data D:
At this site or column:
Read depth N=18
Reference allele A = C (11 reads)
Alternate allele B = T (7 reads)

We might guess that


P(CT | Data)
is going to be higher than
Ref
P(CC | Data) or P(TT | Data).
Bayesian Genotyping
We want:
P(Genotype | Data)
But easier to calculate:
P(Data | Genotype)

Then use Bayes theorem:

P(D|G) P(G)
P(G|D) =
P(D)
Bayes theorem
Bayes theorem considers prior probabilities.
e.g. if I see a crop circle in my field, what is the chance I have
been visited by aliens?
Lets say aliens always leave crop circles (100%) – P(crop circle|aliens)
Aliens are unlikely – P(aliens)
But crop circles happen occasionally – P(crop circle)

Let’s say 100% Extremely small!

P(aliens|crop circle) = P(crop circle|aliens) P(aliens)


P(crop circle)
Small-ish
=> very small probability (other possible explanations exist)
Genotyping
P(D|G) P(G)
P(G|D) =
P(D)
The terms are:
P(D|G) - probability of seeing the set of reads (observed bases),
for the given genotype

P(G) - prior probability of the genotype, our prior expectation of


seeing this genotype

P(D) - probability of seeing this particular set of reads


Genotyping
P(G|D) = P(D|G) P(G)
P(D)
P(D) = probability of seeing this data

For a given data set D (reads), this is just a constant - the same
for AA, AB and BB.
So instead of calculating P(D) we could use this as a
normalisation factor or:
P(AA|D) + P(AB|D) + P(BB|D) = 1
Genotyping: P(D|G)
P(G|D) = P(D|G) P(G)
P(D)

Start with calculating the

P(D|G) = probability of seeing reads given genotype

i.e. for our data D, what are P(D|AA), P(D|AB), P(D|BB) ?


Genotyping: P(D|G)
Example data D:
At this site, read depth N=18
Reference allele A = C
Alternate allele B = T

There are 11 C’s & 7 T’s

What are
P(Data | CC)
P(Data | CT)
P(Data | TT) ?
Genotyping: P(D|G)
For heterozygous genotype AB, we are choosing each read
randomly from A and B. This gives a binomial distribution with
p=0.5 (like coin-tossing).
Probability of observing that data

Different possible values of the data in term s of the allele fractions actually observed?
Observed allele fraction
Genotyping: P(D|G)
For homozygous genotypes (AA or BB) we expect all data to be
A (for AA) or B (for BB) except where there is a sequencing
error.
Genotyping: P(D|G)
P(D|G) will give a more definite prediction of which allele fraction
to expect if we have
● lower sequencing error
● higher read depth

In practice genotyping software uses a model for the expected


sequencing errors, based on the sequencing technology. e.g.
● Illumina: per-base sequencing errors like C->G, getting more
likely at the ends of reads
Genotyping: prior P(G)
P(G|D) = P(D|G) P(G)
P(D)
P(G) = prior probability of genotype G

If the genome were random, this would be

P(AA) = 0.25
P(AB) = 0.5
P(BB) = 0.25

This is assuming that variants are as likely as non-variants.


Genotyping: prior P(G)
The genome is not random! It's constrained by evolution.
Most of our genomes are very similar i.e. we expect to see the
reference genome sequence more often.

Each human has about 1 SNP per kilobase.


i.e. 1/1000 chance of a variant at each site.
If each chromosome’s allele is chosen randomly from the
population then instead of 0.25:

P(AB) ~ 1/1000 P(BB) ~ 1/1,000,000


Genotyping: prior P(G)
In other words, variants are rare enough that we need a lot of
read evidence to be confident what we are seeing is real.

If the sequencing error rate is high compared to the rate of true


variants, or if we don’t have high enough read depth, it is difficult
to tell the difference between false positives and true variants.
Genotyping

Read depth Increasing read depth ->


higher confidence

SNVMix: predicting single nucleotide


variants from next-generation sequencing
of tumors. Bioinformatics. 2010 26(6): 730–
736.
Genotyping: summary

Binomial distribution +
error model for
sequencing technology
Prior probability of each
genotype based on
knowledge of sample’s
P(G|D) = P(D|G) P(G) biology
P(D)

Just a constant, for any given D.


Probability of getting the data we got, out of
all possible data, randomly.

You might also like