0% found this document useful (0 votes)
101 views53 pages

Sequence Alignment Algorithms: DEKM Book Notes From Dr. Bino John and Dr. Takis Benos

The document discusses various sequence alignment algorithms and concepts. It begins with an overview of global and local alignment approaches, including the Needleman-Wunsch and Smith-Waterman algorithms. It then covers scoring schemes like affine gap penalties. Finally, it discusses database searching tools like BLAST that use heuristics to rapidly identify similar sequences.

Uploaded by

Aashutosh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views53 pages

Sequence Alignment Algorithms: DEKM Book Notes From Dr. Bino John and Dr. Takis Benos

The document discusses various sequence alignment algorithms and concepts. It begins with an overview of global and local alignment approaches, including the Needleman-Wunsch and Smith-Waterman algorithms. It then covers scoring schemes like affine gap penalties. Finally, it discusses database searching tools like BLAST that use heuristics to rapidly identify similar sequences.

Uploaded by

Aashutosh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Sequence Alignment Algorithms

DEKM book
Notes from Dr. Bino John
and Dr. Takis Benos

1
To Do
• Global alignment
• Local alignment
• Gaps
– Affine Gaps
– Algorithm (blackboard)
• Statistical Significance
– Notes (blackboard)
• Read up on database searches
– BLAST
– FASTA
– CS tricks: suffix tree, …
• PSSMs and Multiple Sequence Alignments

2
Why compare sequences?
• Given a new sequence, infer its function based
on similarity to another sequence

• Find important molecular regions – conserved


across species

3
Why compare sequences? Do more..
• Determine the evolutionary constraints at
work
• Find mutations in a population or family of
genes
• Find similar looking sequence in a database
• Find secondary/tertiary structure of a
sequence of interest – molecular modeling
using a template (homology modeling)

4
Sequence alignment
• Are two sequences related?
– Align sequences or parts of them
– Decide if alignment is by chance or evolutionarily
linked?
• Issues:
– What sorts of alignments to consider?
– How to score an alignment and hence rank?
– Algorithm to find good alignments
– Evaluate the significance of the alignment

5
6
Dynamic Programming
We apply dynamic programming when:
• There is only a polynomial number of
subproblems
– Align x1…xi to y1…yj

• Original problem is one of the subproblems


– Align x1…xM to y1…yN

• Each subproblem is easily solved from smaller


subproblems
7
Global alignment
PAM
T j-1 j BLOSUM
S DNA matrix
i-1

i Mij

Mi-1, j-1 + Score(Si,Tj )


Mi,j = MAX Mi,j-1 + γ Gap penalty

Mi-1,j + γ
Needleman & Wunsch, 1970

8
9
10
11
12
Alignment: adding scores (cntd)

Score(match) = 1
Score(mismatch) = 0
Score(gap) = 0

13
Alignment: adding scores

14
Alignment: adding scores (cntd)

(Seq #1) A
|
Alignment:
(Seq #2) A

15
Alignment: adding scores (cntd)

(Seq #1) T A
|
Alignment:
(Seq #2) - A

16
Alignment: adding scores (cntd)

(Seq #1) G A A T T C A G T T A
| | | | | |
Alignment:
(Seq #2) G G A - T C - G - - A

6 matches, 1 mism., 4 gaps

17
18
19
Local alignment
Given two sequences, S and T, find two
subsequences, s and t, whose alignment has the
highest “score” amongst all subsequence pairs.

Question: Why do we need local alignment, if we have


the global one?

20
Local alignment: an example

EGR4_HUMAN KA [FACPVESCVRSFARSDELNRHLRIH] TGHKP [FQCRICLRNFSRSDHLTSHVRTH] TGEKP [FACDV--CGRRFARSDEKKRHSKVH]


EGR4_RAT KA [FACPVESCVRTFARSDELNRHLRIH] TGHKP [FQCRICLRNFSRSDHLTTHVRTH] TGEKP [FACDV--CGRRFARSDEKKRHSKVH]
EGR3_HUMAN RP [HACPAEGCDRRFSRSDELTRHLRIH] TGHKP [FQCRICMRSFSRSDHLTTHIRTH] TGEKP [FACEF--CGRKFARSDERKRHAKIH]
EGR3_RAT RP [HACPAEGCDRRFSRSDELTRHLRIH] TGHKP [FQCRICMRSFSRSDHLTTHIRTH] TGEKP [FACEF--CGRKFARSDERKRHAKIH]
EGR1_HUMAN RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]
EGR1_MOUSE RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]
EGR1_RAT RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]
EGR1_BRARE RP [YACPVETCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACEI--CGRKFARSDERKRHTKIH]
EGR2_RAT RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_XENLA RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_MOUSE RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_HUMAN RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_BRARE RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDF--CGRKFARSDERKRHTKIH]
MIG1_KLULA -- [-------------------------] ---RP [YVCPICQRGFHRLEHQTRHIRTH] TGERP [HACDFPGCSKRFSRSDELTRHRRIH]
MIG1_KLUMA -- [-------------------------] ---RP [YMCPICHRGFHRLEHQTRHIRTH] TGERP [HACDFPGCAKRFSRSDELTRHRRIH]
MIG1_YEAST -- [-------------------------] ---RP [HACPICHRAFHRLEHQTRHMRIH] TGEKP [HACDFPGCVKRFSRSDELTRHRRIH]
MIG2_YEAST -- [-------------------------] ---RP [FRCDTCHRGFHRLEHKKRHLRTH] TGEKP [HHCAFPGCGKSFSRSDELKRHMRTH]
[ ] :* [. * * * * * :* . *:* *] ***:* [. * * : *:**** .** : *]

21
Local alignment (cntd)
T PAM
j-1 j
S DNA matrix BLOSUM
i-1

i Mij
0
Mi-1, j-1 + Score(Si,Tj )
Mi,j = MAX
Mi,j-1 + γ Gap penalty

Mi-1,j + γ
Smith & Waterman, 1981 Similarity Scoring Expected value:
negative for random alignments
positive for highly similar sequences 22
The Smith-Waterman Algorithm
1. Initialization
F(0,0) = F(0,j) = F(i,0) = 0

2. Iteration
for i=1,…,M
for j=1,…,N
- calculate optimal F(i,j)
- store Ptr(i,j)

3. Termination
• Find the end of the best alignment with FOPT = max{i,j} F(i,j) and trace back OR
• Find all alignments with F(i,j) > threshold and trace back

23
Local vs. global alignment

24
Local vs. global alignment (cntd)

25
Local alignment (cntd)
Characteristics of local alignments:
• The alignment can start/end at any point in the
matrix.

• No negative scores in the alignment.

• The mean value of the scoring matrix (e.g. PAM,


BLOSUM) should be negative, but there should be
positive scores in the scoring matrix.

26
Scoring the gaps more accurately
• A naive model γ(n)
Gap penalty is linear to the gap length
Nature “prefers” to place gaps where other gaps exist

• Convex gap penalty function


γ(n)
γ(n+1) - γ(n) ≤ γ(n) - γ(n-1)

Time O(N2M) Space O(NM)


(assume N>M)

27
Scoring gaps: affine gaps
• Affine gaps: a compromise between linear and convex gap
penalties

γ(n) = -d - e * (n-1) γ(n)

d: gap initiation penalty


e
e: gap extension penalty d

28
29
30
31
32
33
34
35
36
37
Database searches

38
DNA and protein databases
• EMBL/GenBank/DDBJ database of nucleic acids

39
DNA and protein databases
• EMBL/GenBank/DDBJ database of nucleic acids (cntd)

40
DNA and protein databases
• SWISS-PROT & TrEMBL database of proteins

41
DNA and protein databases
• SWISS-PROT & TrEMBL database of proteins

42
Database searches
• Database searching consists of many pairwise alignments combined in
one search.
• It helps determining the function and the evolutionary relationships
• Heuristic algorithms are used instead of DP. Why?
• Size of SWISS-PROT + TrEMBL (Rel. 9.5):
3.9M entries or 1,276M residues.

• Exact algorithms are O(NM) fast.


• Heuristic methods can look at a small fraction of the searching space
that will include all (or most) of the high scoring pairs.

43
BLAST algorithm
• Basic Local Alignment Search Tool - The method:

• For each “word” (of fixed-length) in the query sequence,


make a list of all neighbouring “words” that score above
some threshold.

• Scan the database for these words.

• Perform (ungapped) “hit extension” until score <


threshold.

• Stop at maximum scoring extension.

44
BLAST algorithm (cntd)
• An example:

Query: CPICHRAFHRLEHQTRHMRIHTGEKPHAC

HMR 18 HMR
HHR HMR -2+13 HIR
HIR +1+13 .
HAR BLOSUM62 -1+13 selection .
… … .

45
BLAST algorithm (cntd)
• An example:

Query: CPICHRAFHRLEHQTRHMRIHTGEKPHAC
H+R
Sbjct: CPLCDKAFHRLEHQTRHIRTHTGEKPHAC

46
BLAST algorithm (cntd)
• An example:

Query: CPICHRAFHRLEHQTRHMRIHTGEKPHAC
CP+C +AFHRLEHQTR H+R HTGEKPHAC
Sbjct: CPLCDKAFHRLEHQTRHIRTHTGEKPHAC

47
BLAST algorithm (cntd)
• The idea: a high scoring match alignment is very likely to contain a short
stretch of very high scoring matches.

• Word length: 3 (proteins) and 11 (DNA).

• HSSP: multiple HSSPs can be reported for each database entry.

• Gapped alignments: more recent BLAST versions perform gapped


alignments.

48
BLAST flavours
Query: DNA Protein
BLASTX

BLASTN

BLASTP
TBLASTN
Database: DNA Protein

TBLASTX: DNA Query to DNA Database via translation

49
FASTA algorithm
• The method:
• For each pair of sequences (query, subject), identify all
identical “word” matches of (fixed) length.
• Look for diagonals with many mutually supporting
“word” matches.
• The best diagonals are used to extend the word matches
to find the maximal scoring (ungapped) regions.
• Join ungapped regions, using gap costs.
• Align the two (sub)regions using full dynamic
programming techniques.

50
FASTA algorithm (cntd)

51
FASTA algorithm (cntd)
• The idea: a high scoring match alignment is very likely to contain a short
stretch of identities.

• Word length: 2 (proteins) and 4-6 (DNA).

• HSSP: usually one (extended) gapped alignment is presented.

52
FASTA flavours
Query: DNA Protein
FASTX3

FASTA3
FASTA3
TFASTA3
Database: DNA Protein

53

You might also like