0% found this document useful (0 votes)
29 views

Introduction To Bioinformatics: Sequence Alignment

This document discusses protein sequence alignment tools like BLAST and PSI-BLAST, how they use substitution matrices like PAM and BLOSUM to score alignments based on amino acid similarities, and statistical measures like E-values to assess the significance of matches found in database searches. It also covers concepts like affine gap penalties, filtering of low-complexity sequences, and how PSI-BLAST can be used to find more distant homologs through iterative profile searches.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Introduction To Bioinformatics: Sequence Alignment

This document discusses protein sequence alignment tools like BLAST and PSI-BLAST, how they use substitution matrices like PAM and BLOSUM to score alignments based on amino acid similarities, and statistical measures like E-values to assess the significance of matches found in database searches. It also covers concepts like affine gap penalties, filtering of low-complexity sequences, and how PSI-BLAST can be used to find more distant homologs through iterative profile searches.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 29

Introduction to bioinformatics

Sequence Alignment
Part 3
WHATS TODAY?
• MORE BLAST ….
- Similarity scores for protein sequences
- Gaps
- Statistical significance (e-value)
Protein Sequence Alignment
Rule of thumb:
Proteins are homologous if 25% identical (length >100)
DNA sequences are homologous if 70% identical
Protein Pairwise Sequence Alignment
• The alignment tools are similar to the DNA alignment tools
• BLASTN for nucleotides
• BLASTP for proteins

• Main difference: instead of scoring match (+2) and


mismatch (-1) we have similarity scores:
• Score s(i,j) > 0 if amino acids i and j have similar
properties
• Score s(i,j) is  0 otherwise

• How should we score s(i,j)?


The 20 Amino Acids
Chemical Similarities Between
Amino Acids
Acids & Amides DENQ (Asp, Glu, Asn, Gln)

Basic HKR (His, Lys, Arg)

Aromatic FYW (Phe, Tyr, Trp)

Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr)

Hydrophobic ILMV (Ile, Leu, Met, Val)


Sequence Alignment based on AA
similarity
TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS
|| + |||| +|| ||| | +| | | | |
TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS

RFSGSGSGTDFTLTINSLQPEDFATYYCQ---------------QSYSTPHFSQGTKLEI
| | | +| | | +|+ || || |+ + | | || | +
RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL

---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN---------NFYPREAKVQWKVD
++||| | + ++ | | | + ||++|+|
TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID

| = identity + = similarity
Amino Acid Substitutions Matrices
• When scoring protein sequence alignments
it is common to use a matrix of 20  20,
representing all pairwise comparisons :
Substitution Matrix
Given an alignment of closely related sequences
we can score the relation between amino acids
based on how frequently they substitute each other

M G Y D E
M G Y D E
M G Y E E
In this column
M G Y D E E & D are found
M G Y Q E
7/8
M G Y D E
M G Y E E
M G Y E E
Amino Acid Matrices
Symmetric matrix
of 20x20 entries:
Entry (i,i) is
greater than any entry (i,j)=entry(j,i)
entry (i,j), ji. Entry (i,j): the score
of aligning amino
acid i against amino
acid j.
PAM - Point Accepted Mutations
• Developed by Margaret Dayhoff, 1978.
• Analyzed very similar protein sequences
• Proteins are evolutionary close.
• Alignment is easy.
• Point mutations - mainly substitutions
• Accepted mutations - by natural selection.
• Used global alignment.
• Counted the number of substitutions (i,j) per amino acid pair: Many
i<->j substitutions => high score s(i,j)
• Found that common substitutions occurred involving
chemically similar amino acids.
PAM 250

• Similar amino acids are close to each other.


• Regions define conserved substitutions.
Example: Asp & Glu
COO- COO-

+
H3N C H +
H3N C H

HCH HCH

Score = 3 C HCH
O O-
C
Aspartate O O-
(Asp, D)
Glutamate
(Glu, E)
Selecting a PAM Matrix
• Low PAM numbers: short sequences, strong local
similarities.
• High PAM numbers: long sequences, weak
similarities.
– PAM120 recommended for general use (40% identity)
– PAM60 for close relations (60% identity)
– PAM250 for distant relations (20% identity)
• If uncertain, try several different matrices
– PAM40, PAM120, PAM250 recommended
BLOSUM
• Blocks Substitution Matrix
– Steven and Jorga G. Henikoff (1992)
• Based on BLOCKS database (www.blocks.fhcrc.org)
– Families of proteins with identical function
– Highly conserved protein domains
• Ungapped local alignment to identify motifs
– Each motif is a block of local alignment
– Counts amino acids observed in same column
– Symmetrical model of substitution AABCDA… BBCDA
DABCDA. A.BBCBB
BBBCDABA.BCCAA
AAACDAC.DCBCDB
CCBADAB.DBBDCC
AAACAA… BBCCC
BLOSUM Matrices

• Different BLOSUMn matrices are


calculated independently from
BLOCKS
• BLOSUMn is based on sequences that
are at most n percent identical.
Selecting a BLOSUM Matrix
• For BLOSUMn, higher n suitable for
sequences which are more similar
– BLOSUM62 recommended for general use
– BLOSUM80 for close relations
– BLOSUM45 for distant relations
Summary:
• BLOSUM matrices are based on the
replacement patterns found in more highly
conserved regions of the sequences without
gaps
• PAM matrices based on mutations observed
throughout a global alignment, includes
both highly conserved and highly mutable
regions
Gap Scores
• Example showed -1 score per indel
– So gap cost is proportional to its length
• Biologically, indels occur in groups
– We want our gap score to reflect this
• Standard solution: affine gap model
– Once-off cost for opening a gap
– Lower cost for extending the gap
– Changes required to algorithm
Scoring system =

Substitution Matrix +
Gap Penalty
Gap penalty
• We expect to penalize gaps
• Scoring for gap opening & for extension
– Insertions and deletions are rare in evolution
– But once they are created, they are easy to extend
– Gap-extension penalty < gap-open penalty
• Default gap parameters are given for each matrix:
– PAM30: open=9, extension=1
– PAM250: open=14, extension=2
Low Complexity Sequences
• AAAAAAAAAAA

• ATATATATATATA

• CAGCAGCAGCAG

Sequences of low complexity can cause getting significant hits


which are not true homologues !!!

How does BLAST deal with low complexity sequences?

By default low complexity sequences are filtered out


and replaced by XXXXX
Statistical significance
E-value
• The number of hits (with the same similarity score) one can
"expect" to see just by chance when searching the given
string in a database of a particular size.
• higher e-value lower similarity
– “sequences with E-value of less than 0.01 are almost always
found to be homologous”
• The lower bound is normally 0 (we want to find the best)
Expectation Values

Increases Decreases
linearly with Increases exponentially
length of query linearly with with score of
sequence length of alignment
database
• Bit score (S)
– Similar to alignment score
– Normalized
– Higher means more significant
• E value:
Number of hits of score ≥ S expected by chance
– Based on random database of similar size
– Lower means more significant
– Used to assess the statistical significance of the
alignment
Remote homologues
• Sometimes BLAST isn’t enough.
• Large protein family, and BLAST only
gives close members. We want more distant
members

PSI-BLAST
PSI-BLAST
• Position Specific Iterated BLAST
Regular blast

Construct profile from


blast results

Blast profile search

Final results
PSI-BLAST
• Advantage: PSI-BLAST looks for seqs that
are close to ours, and learns from them to
extend the circle of friends
• Disadvantage: if we found a WRONG
sequence, we will get to unrelated
sequences. This gets worse and worse each
iteration

You might also like