0% found this document useful (0 votes)

101 views53 pages

Sequence Alignment Algorithms: DEKM Book Notes From Dr. Bino John and Dr. Takis Benos

The document discusses various sequence alignment algorithms and concepts. It begins with an overview of global and local alignment approaches, including the Needleman-Wunsch and Smith-Waterman algorithms. It then covers scoring schemes like affine gap penalties. Finally, it discusses database searching tools like BLAST that use heuristics to rapidly identify similar sequences.

Uploaded by

Aashutosh Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views53 pages

Sequence Alignment Algorithms: DEKM Book Notes From Dr. Bino John and Dr. Takis Benos

Uploaded by

Aashutosh Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Sequence Alignment Algorithms

DEKM book
Notes from Dr. Bino John
and Dr. Takis Benos

1
To Do
• Global alignment
• Local alignment
• Gaps
– Affine Gaps
– Algorithm (blackboard)
• Statistical Significance
– Notes (blackboard)
• Read up on database searches
– BLAST
– FASTA
– CS tricks: suffix tree, …
• PSSMs and Multiple Sequence Alignments

2
Why compare sequences?
• Given a new sequence, infer its function based
on similarity to another sequence

• Find important molecular regions – conserved

across species

3
Why compare sequences? Do more..
• Determine the evolutionary constraints at
work
• Find mutations in a population or family of
genes
• Find similar looking sequence in a database
• Find secondary/tertiary structure of a
sequence of interest – molecular modeling
using a template (homology modeling)

4
Sequence alignment
• Are two sequences related?
– Align sequences or parts of them
– Decide if alignment is by chance or evolutionarily
linked?
• Issues:
– What sorts of alignments to consider?
– How to score an alignment and hence rank?
– Algorithm to find good alignments
– Evaluate the significance of the alignment

5
6
Dynamic Programming
We apply dynamic programming when:
• There is only a polynomial number of
subproblems
– Align x1…xi to y1…yj

• Original problem is one of the subproblems

– Align x1…xM to y1…yN

• Each subproblem is easily solved from smaller

subproblems
7
Global alignment
PAM
T j-1 j BLOSUM
S DNA matrix
i-1

i Mij

Mi-1, j-1 + Score(Si,Tj )

Mi,j = MAX Mi,j-1 + γ Gap penalty

Mi-1,j + γ
Needleman & Wunsch, 1970

8
9
10
11
12
Alignment: adding scores (cntd)

Score(match) = 1
Score(mismatch) = 0
Score(gap) = 0

13
Alignment: adding scores

14
Alignment: adding scores (cntd)

(Seq #1) A
|
Alignment:
(Seq #2) A

15
Alignment: adding scores (cntd)

(Seq #1) T A
|
Alignment:
(Seq #2) - A

16
Alignment: adding scores (cntd)

6 matches, 1 mism., 4 gaps

17
18
19
Local alignment
Given two sequences, S and T, find two
subsequences, s and t, whose alignment has the
highest “score” amongst all subsequence pairs.

Question: Why do we need local alignment, if we have

the global one?

20
Local alignment: an example

EGR4_HUMAN KA [FACPVESCVRSFARSDELNRHLRIH] TGHKP [FQCRICLRNFSRSDHLTSHVRTH] TGEKP [FACDV--CGRRFARSDEKKRHSKVH]

EGR4_RAT KA [FACPVESCVRTFARSDELNRHLRIH] TGHKP [FQCRICLRNFSRSDHLTTHVRTH] TGEKP [FACDV--CGRRFARSDEKKRHSKVH]
EGR3_HUMAN RP [HACPAEGCDRRFSRSDELTRHLRIH] TGHKP [FQCRICMRSFSRSDHLTTHIRTH] TGEKP [FACEF--CGRKFARSDERKRHAKIH]
EGR3_RAT RP [HACPAEGCDRRFSRSDELTRHLRIH] TGHKP [FQCRICMRSFSRSDHLTTHIRTH] TGEKP [FACEF--CGRKFARSDERKRHAKIH]
EGR1_HUMAN RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]
EGR1_MOUSE RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]
EGR1_RAT RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]
EGR1_BRARE RP [YACPVETCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACEI--CGRKFARSDERKRHTKIH]
EGR2_RAT RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_XENLA RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_MOUSE RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_HUMAN RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_BRARE RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDF--CGRKFARSDERKRHTKIH]
MIG1_KLULA -- [-------------------------] ---RP [YVCPICQRGFHRLEHQTRHIRTH] TGERP [HACDFPGCSKRFSRSDELTRHRRIH]
MIG1_KLUMA -- [-------------------------] ---RP [YMCPICHRGFHRLEHQTRHIRTH] TGERP [HACDFPGCAKRFSRSDELTRHRRIH]
MIG1_YEAST -- [-------------------------] ---RP [HACPICHRAFHRLEHQTRHMRIH] TGEKP [HACDFPGCVKRFSRSDELTRHRRIH]
MIG2_YEAST -- [-------------------------] ---RP [FRCDTCHRGFHRLEHKKRHLRTH] TGEKP [HHCAFPGCGKSFSRSDELKRHMRTH]
[ ] :* [. * * * * * :* . *:* *] ***:* [. * * : *:**** .** : *]

21
Local alignment (cntd)
T PAM
j-1 j
S DNA matrix BLOSUM
i-1

i Mij
0
Mi-1, j-1 + Score(Si,Tj )
Mi,j = MAX
Mi,j-1 + γ Gap penalty

Mi-1,j + γ
Smith & Waterman, 1981 Similarity Scoring Expected value:
negative for random alignments
positive for highly similar sequences 22
The Smith-Waterman Algorithm
1. Initialization
F(0,0) = F(0,j) = F(i,0) = 0

2. Iteration
for i=1,…,M
for j=1,…,N
- calculate optimal F(i,j)
- store Ptr(i,j)

3. Termination
• Find the end of the best alignment with FOPT = max{i,j} F(i,j) and trace back OR
• Find all alignments with F(i,j) > threshold and trace back

23
Local vs. global alignment

24
Local vs. global alignment (cntd)

25
Local alignment (cntd)
Characteristics of local alignments:
• The alignment can start/end at any point in the
matrix.

• No negative scores in the alignment.

• The mean value of the scoring matrix (e.g. PAM,

BLOSUM) should be negative, but there should be
positive scores in the scoring matrix.

26
Scoring the gaps more accurately
• A naive model γ(n)
Gap penalty is linear to the gap length
Nature “prefers” to place gaps where other gaps exist

• Convex gap penalty function

γ(n)
γ(n+1) - γ(n) ≤ γ(n) - γ(n-1)

Time O(N2M) Space O(NM)

(assume N>M)

27
Scoring gaps: affine gaps
• Affine gaps: a compromise between linear and convex gap
penalties

γ(n) = -d - e * (n-1) γ(n)

d: gap initiation penalty

e
e: gap extension penalty d

28
29
30
31
32
33
34
35
36
37
Database searches

38
DNA and protein databases
• EMBL/GenBank/DDBJ database of nucleic acids

39
DNA and protein databases
• EMBL/GenBank/DDBJ database of nucleic acids (cntd)

40
DNA and protein databases
• SWISS-PROT & TrEMBL database of proteins

41
DNA and protein databases
• SWISS-PROT & TrEMBL database of proteins

42
Database searches
• Database searching consists of many pairwise alignments combined in
one search.
• It helps determining the function and the evolutionary relationships
• Heuristic algorithms are used instead of DP. Why?
• Size of SWISS-PROT + TrEMBL (Rel. 9.5):
3.9M entries or 1,276M residues.

• Exact algorithms are O(NM) fast.

• Heuristic methods can look at a small fraction of the searching space
that will include all (or most) of the high scoring pairs.

43
BLAST algorithm
• Basic Local Alignment Search Tool - The method:

• For each “word” (of fixed-length) in the query sequence,

make a list of all neighbouring “words” that score above
some threshold.

• Scan the database for these words.

• Perform (ungapped) “hit extension” until score <

threshold.

• Stop at maximum scoring extension.

44
BLAST algorithm (cntd)
• An example:

Query: CPICHRAFHRLEHQTRHMRIHTGEKPHAC

HMR 18 HMR
HHR HMR -2+13 HIR
HIR +1+13 .
HAR BLOSUM62 -1+13 selection .
… … .

45
BLAST algorithm (cntd)
• An example:

Query: CPICHRAFHRLEHQTRHMRIHTGEKPHAC
H+R
Sbjct: CPLCDKAFHRLEHQTRHIRTHTGEKPHAC

46
BLAST algorithm (cntd)
• An example:

Query: CPICHRAFHRLEHQTRHMRIHTGEKPHAC
CP+C +AFHRLEHQTR H+R HTGEKPHAC
Sbjct: CPLCDKAFHRLEHQTRHIRTHTGEKPHAC

47
BLAST algorithm (cntd)
• The idea: a high scoring match alignment is very likely to contain a short
stretch of very high scoring matches.

• Word length: 3 (proteins) and 11 (DNA).

• HSSP: multiple HSSPs can be reported for each database entry.

• Gapped alignments: more recent BLAST versions perform gapped

alignments.

48
BLAST flavours
Query: DNA Protein
BLASTX

BLASTN

BLASTP
TBLASTN
Database: DNA Protein

TBLASTX: DNA Query to DNA Database via translation

49
FASTA algorithm
• The method:
• For each pair of sequences (query, subject), identify all
identical “word” matches of (fixed) length.
• Look for diagonals with many mutually supporting
“word” matches.
• The best diagonals are used to extend the word matches
to find the maximal scoring (ungapped) regions.
• Join ungapped regions, using gap costs.
• Align the two (sub)regions using full dynamic
programming techniques.

50
FASTA algorithm (cntd)

51
FASTA algorithm (cntd)
• The idea: a high scoring match alignment is very likely to contain a short
stretch of identities.

• Word length: 2 (proteins) and 4-6 (DNA).

• HSSP: usually one (extended) gapped alignment is presented.

52
FASTA flavours
Query: DNA Protein
FASTX3

FASTA3
FASTA3
TFASTA3
Database: DNA Protein

Introduction-To-Computational Biology
No ratings yet
Introduction-To-Computational Biology
61 pages
W03 Pairwise
No ratings yet
W03 Pairwise
55 pages
Genomic Sequence Alignment
No ratings yet
Genomic Sequence Alignment
25 pages
Lecture 4
No ratings yet
Lecture 4
106 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
37 pages
Sequence Alignment
No ratings yet
Sequence Alignment
25 pages
Sequence Analysis - Alignment
No ratings yet
Sequence Analysis - Alignment
57 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Bio 2
No ratings yet
Bio 2
39 pages
Module 3 CSE3069 (Bioinformatics)
No ratings yet
Module 3 CSE3069 (Bioinformatics)
57 pages
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
No ratings yet
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
59 pages
BMB 822 - Bioinformatics and Computing - Lecture Notes
No ratings yet
BMB 822 - Bioinformatics and Computing - Lecture Notes
94 pages
MIT6 047F15 Lecture03
No ratings yet
MIT6 047F15 Lecture03
56 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Local and Global Sequence Alignment 12 by DR Sheikh Arslan Sehgal
No ratings yet
Local and Global Sequence Alignment 12 by DR Sheikh Arslan Sehgal
59 pages
Sequence Alignment Methods
No ratings yet
Sequence Alignment Methods
32 pages
Lecture 6
No ratings yet
Lecture 6
31 pages
Blast ND Fasta
No ratings yet
Blast ND Fasta
28 pages
Bio Medical Tics - Sequence Analysis - Alignment - 2011
No ratings yet
Bio Medical Tics - Sequence Analysis - Alignment - 2011
96 pages
Sequence Alignment
No ratings yet
Sequence Alignment
24 pages
Alignment of Sequences
No ratings yet
Alignment of Sequences
33 pages
Accelerating DNA Pairwise Sequence Alignment Using FPGA and A Customized Convolutional Neural Network - ScienceDirect
No ratings yet
Accelerating DNA Pairwise Sequence Alignment Using FPGA and A Customized Convolutional Neural Network - ScienceDirect
9 pages
Unit 2.1
No ratings yet
Unit 2.1
77 pages
Second - Done - w14b - Searching Squence Databases
No ratings yet
Second - Done - w14b - Searching Squence Databases
32 pages
Alignment Methods
No ratings yet
Alignment Methods
33 pages
Bioinformatics Seminar3rdOct18
No ratings yet
Bioinformatics Seminar3rdOct18
25 pages
BLAST (Basic Local Alignment Search Tool)
100% (1)
BLAST (Basic Local Alignment Search Tool)
23 pages
Msa MTech
No ratings yet
Msa MTech
17 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
54 pages
Sequence Alignment
No ratings yet
Sequence Alignment
29 pages
Unit Iv - Blast
No ratings yet
Unit Iv - Blast
21 pages
Sequence Alignments: Felix Sappelt Irina Wagner
100% (1)
Sequence Alignments: Felix Sappelt Irina Wagner
34 pages
Chap 03 BioInfo
No ratings yet
Chap 03 BioInfo
15 pages
Sequence Comparison
No ratings yet
Sequence Comparison
39 pages
Sequence Alignment Presentation
No ratings yet
Sequence Alignment Presentation
27 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
An Introductory Course Bioinformatics-I: A Student Handout
No ratings yet
An Introductory Course Bioinformatics-I: A Student Handout
320 pages
Lecture 6 - Sequence Analysis
No ratings yet
Lecture 6 - Sequence Analysis
28 pages
LO5 Pairwise Sequence Alignment
No ratings yet
LO5 Pairwise Sequence Alignment
11 pages
Frid Seminar
No ratings yet
Frid Seminar
30 pages
Bioinformatics Tutorial
No ratings yet
Bioinformatics Tutorial
12 pages
Bioinformatics:: Guide To Bio-Computing and The Internet
No ratings yet
Bioinformatics:: Guide To Bio-Computing and The Internet
34 pages
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
51 pages
BLAST and Sequence Alignment
No ratings yet
BLAST and Sequence Alignment
36 pages
Basic Local Alignment Search Tool-BLAST
No ratings yet
Basic Local Alignment Search Tool-BLAST
9 pages
Sequence Analysis - Pairwise Alignment
No ratings yet
Sequence Analysis - Pairwise Alignment
26 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
Sequence Alignment Methods and Algorithms
No ratings yet
Sequence Alignment Methods and Algorithms
37 pages
Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department
No ratings yet
Computational Biology (3) Alignment Algorithms: by Dr. Safynaz Abdel-Fattah Computer Science Department
107 pages
05 CAP5510 Fall21
No ratings yet
05 CAP5510 Fall21
40 pages
Lab Report 3 Bioinformatics
No ratings yet
Lab Report 3 Bioinformatics
18 pages
Bioinformatics: Sequence Alignment Methods
No ratings yet
Bioinformatics: Sequence Alignment Methods
32 pages
Introduction To Bioinformatics: Sequence Alignment
No ratings yet
Introduction To Bioinformatics: Sequence Alignment
29 pages
Sequence Alignment
No ratings yet
Sequence Alignment
36 pages
L8 Msa
No ratings yet
L8 Msa
52 pages
Heuristic Local Alignerers: The Basic Indexing & Extension Technique
No ratings yet
Heuristic Local Alignerers: The Basic Indexing & Extension Technique
39 pages
Functions: Coding Blocks
100% (1)
Functions: Coding Blocks
50 pages
Cisco Wiki
No ratings yet
Cisco Wiki
27 pages
COM 113 Lecture 1
100% (1)
COM 113 Lecture 1
39 pages
Act. No. 02 Visual Acuity and Perception
No ratings yet
Act. No. 02 Visual Acuity and Perception
5 pages
Ez Series Technical Bulletin: SUBJECT: EZ2x0/3x0 Program Update V3.16 EZ5x0 Program Update V4.25
No ratings yet
Ez Series Technical Bulletin: SUBJECT: EZ2x0/3x0 Program Update V3.16 EZ5x0 Program Update V4.25
2 pages
Manual T4 Portable Multi Gas Detector 2022
No ratings yet
Manual T4 Portable Multi Gas Detector 2022
43 pages
Document 3
No ratings yet
Document 3
22 pages
Research Proposal
No ratings yet
Research Proposal
3 pages
Mobile Virus and Security
No ratings yet
Mobile Virus and Security
25 pages
Manual Detroit Diesel Serie 92
No ratings yet
Manual Detroit Diesel Serie 92
180 pages
Sericulture Ece
No ratings yet
Sericulture Ece
27 pages
Current Advancements in Stereo Vision
No ratings yet
Current Advancements in Stereo Vision
234 pages
User Manual M5 - Protocol 2 - V6.03
No ratings yet
User Manual M5 - Protocol 2 - V6.03
14 pages
Commvault Professional Advanced Course Guide
No ratings yet
Commvault Professional Advanced Course Guide
393 pages
475-Sales Invoice-Rahul Datta - PDF 8000
No ratings yet
475-Sales Invoice-Rahul Datta - PDF 8000
1 page
Q5X IO-Link Data Reference Guide
No ratings yet
Q5X IO-Link Data Reference Guide
7 pages
Unit 5 - Design Concept (Sofrware Engineering) - NSG Academy
No ratings yet
Unit 5 - Design Concept (Sofrware Engineering) - NSG Academy
11 pages
SAP Fiori Catalogs Hands On 1732979112
No ratings yet
SAP Fiori Catalogs Hands On 1732979112
21 pages
Everyday Narrative Report Format
No ratings yet
Everyday Narrative Report Format
10 pages
Definitive Guide To Azure Kubernetes Service (AKS) Security
No ratings yet
Definitive Guide To Azure Kubernetes Service (AKS) Security
19 pages
Stacks Notes
No ratings yet
Stacks Notes
21 pages
COMP5110 Lecture 1 - Introduction To Software Engineering - Ethics
No ratings yet
COMP5110 Lecture 1 - Introduction To Software Engineering - Ethics
33 pages
Level 2' DFD Showing Passport Management System
No ratings yet
Level 2' DFD Showing Passport Management System
1 page
4th Grade Math Skill of The Day Week 1
100% (1)
4th Grade Math Skill of The Day Week 1
9 pages
NetSure 701 A61 20091105
No ratings yet
NetSure 701 A61 20091105
2 pages
Invoice Bali Smart Travels
No ratings yet
Invoice Bali Smart Travels
1 page
胡希恕经方理论与实践
No ratings yet
胡希恕经方理论与实践
324 pages
HCIA-Intelligent Computing V1.0 Lab Guide
No ratings yet
HCIA-Intelligent Computing V1.0 Lab Guide
213 pages
Datasheet MB980 Ibase
No ratings yet
Datasheet MB980 Ibase
1 page
Fatima Khan.
No ratings yet
Fatima Khan.
8 pages

Sequence Alignment Algorithms: DEKM Book Notes From Dr. Bino John and Dr. Takis Benos

Uploaded by

Sequence Alignment Algorithms: DEKM Book Notes From Dr. Bino John and Dr. Takis Benos

Uploaded by

Sequence Alignment Algorithms

• Find important molecular regions – conserved

• Original problem is one of the subproblems

• Each subproblem is easily solved from smaller

Mi-1, j-1 + Score(Si,Tj )

6 matches, 1 mism., 4 gaps

Question: Why do we need local alignment, if we have

EGR4_HUMAN KA [FACPVESCVRSFARSDELNRHLRIH] TGHKP [FQCRICLRNFSRSDHLTSHVRTH] TGEKP [FACDV--CGRRFARSDEKKRHSKVH]

• No negative scores in the alignment.

• The mean value of the scoring matrix (e.g. PAM,

• Convex gap penalty function

Time O(N2M) Space O(NM)

γ(n) = -d - e * (n-1) γ(n)

d: gap initiation penalty

• Exact algorithms are O(NM) fast.

• For each “word” (of fixed-length) in the query sequence,

• Scan the database for these words.

• Perform (ungapped) “hit extension” until score <

• Stop at maximum scoring extension.

• Word length: 3 (proteins) and 11 (DNA).

• HSSP: multiple HSSPs can be reported for each database entry.

• Gapped alignments: more recent BLAST versions perform gapped

TBLASTX: DNA Query to DNA Database via translation

• Word length: 2 (proteins) and 4-6 (DNA).

• HSSP: usually one (extended) gapped alignment is presented.

You might also like