Lecture 9 Scoring Matrices
Lecture 9 Scoring Matrices
Lecture 9
Scoring Matrices
Which PAM matrix to use?
Human beta globin (NP_000509.1)
and Chimp beta globin
(XP_508242.1)- 100% amino acid
identity.
> 1: Alignment of two residues occurs more often than expected by chance (e.g., a
conservative substitution of serine for threonine)
IS symmetric
Salient Differences between PAM and Relatedness
PAM Matrix (Asymmetric)
● The PAM matrix PPP is a Markov transition matrix that describes the
probability of one amino acid changing into another over a given evolutionary
timescale.
● The elements of PPP, say P(A→B)P(A \to B)P(A→B), represent the
probability of amino acid AAA mutating into BBB.
● Due to unequal amino acid frequencies and directional mutation rates,
P(A→B)≠P(B→A) making PPP asymmetric.
Salient Differences between PAM and Relatedness
Construction of the Relatedness Matrix (Symmetric)
● The relatedness matrix is derived from the PAM matrix using the
Using scores to align sequences
Intuition:
How will we use today’s learning?
pam100 <- read.table(system.file("matrices/pam/pam100", package = "seqinr"),
as.is = TRUE)
print(pam100)
library(Biostrings)
print(alignment)
BLOSUM: Henikoff and Henikoff (1992, 1996)
BLOCKS database- over 500 groups of local multiple alignments (blocks) of
distantly related protein sequences.
Henikoffs’ score
General form of
substitution matrices
BLOSUM62
Default scoring matrix for the BLAST protein search programs at NCBI
Merges all proteins in an alignment that have 62% amino acid identity or greater
into one sequence.
E.g. a block of aligned globin orthologs have 62, 80, and 95% AA identity- all
weighted (grouped) as one sequence.
Useful for scoring proteins that share less than 62% identity because it is weighted
more heavily by proteins that share less than 62% identity
BLOSUM 62
Summary of Henikoffs’ Paper
● BLOSUM performed dramatically better than PAM matrices
● Especially useful for identifying weakly scoring alignments
● BLOSUM62 performed slightly better than BLOSUM60 or
BLOSUM70
● BLOSUM50 and BLOSUM90 are other commonly used scoring
matrices in BLAST searches.
● The FASTA family of sequence comparison programs use
BLOSUM50 as a default
Salient differences between PAM and BLOSUM
1. PAM - explicit evolutionary model (i.e. replacements are counted on the
branches of a phylogenetic tree), BLOSUM- No phylogenetic tree
2. PAM - global alignment matrix, includes both highly conserved and highly
mutable regions. BLOSUM - only highly conserved regions in series of
alignments forbidden to contain gaps.
3. BLOSUM - relatedness is contextual to the specific group of sequences.
4. PAM - Higher numbers in name denote larger evolutionary distance,
BLOSUM - higher number implies higher sequence similarity and
therefore smaller evolutionary distance.
Notice the sequence
Assignment 2
● Background: Pyruvate Decarboxylase Gene is key for increasing CO2 production
by yeast and may have a role in improving bread leavening ad wine making
● Access the NCBI database (https://www.ncbi.nlm.nih.gov/) and download the DNA
sequence of the Saccharomyces cerevisiae PDC gene.
● Use BLAST (Basic Local Alignment Search Tool) to compare the PDC gene
sequence across different Saccharomyces cerevisiae strains and identify SNPs
● Use an translation tool (e.g., ExPASy Translate tool) to translate the normal and
SNP-containing PDC gene sequences into amino acid sequences
● Select a SNP within the PDC gene. Utilize tools like SIFT (Sorting Intolerant From
Tolerant) or PolyPhen (Polymorphism Phenotyping) to predict the impact of the
SNP on the PDC protein function
Evaluation Criteria
● Understanding of the biological concepts and bioinformatics tools.
● Accuracy in data retrieval and analysis.
● Depth of analysis in the impact of SNPs on protein function.
● Quality and clarity of writing.
● Proper citation of sources and presentation of data.