0% found this document useful (0 votes)
80 views44 pages

Sequence Alignment (Chapter 6) : The Biological Problem

The document discusses sequence alignment and related concepts. It introduces global sequence alignment, which finds the optimal alignment between two sequences by assigning scores to matches and mismatches. The optimal alignment is computed using dynamic programming to evaluate all possible alignments and choose the one with the highest score. Local sequence alignment is also mentioned as another type of sequence alignment.

Uploaded by

Jahir Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views44 pages

Sequence Alignment (Chapter 6) : The Biological Problem

The document discusses sequence alignment and related concepts. It introduces global sequence alignment, which finds the optimal alignment between two sequences by assigning scores to matches and mismatches. The optimal alignment is computed using dynamic programming to evaluate all possible alignments and choose the one with the highest score. Local sequence alignment is also mentioned as another type of sequence alignment.

Uploaded by

Jahir Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Sequence Alignment (chapter 6)

l The biological problem


l Global alignment
l Local alignment
l Multiple alignment

Introduction to bioinformatics, Autumn 2006 22


Background: comparative genomics
l Basic question in biology: what properties are shared
among organisms?
l Genome sequencing allows comparison of organisms
at DNA and protein levels
l Comparisons can be used to
− Find evolutionary relationships between organisms
− Identify functionally conserved sequences
− Identify corresponding genes in human and model
organisms: develop models for human diseases

Introduction to bioinformatics, Autumn 2006 23


Homologs
• Two genes or characters
gB and gC evolved from
the same ancestor gA are gA = agtgtccgttaagtgcgttc
called homologs
gB = agtgccgttaaagttgtacgtc
• Homologs usually exhibit
conserved functions gC = ctgactgtttgtggttc

• Close evolutionary
relationship => expect a
high number of homologs

Introduction to bioinformatics, Autumn 2006 24


Sequence similarity
l Intuitively, similarity of two sequences refers to the
degree of match between corresponding positions in
sequence
agtgccgttaaagttgtacgtc

ctgactgtttgtggttc
l What about sequences that differ in length?

Introduction to bioinformatics, Autumn 2006 25


Similarity vs homology
l Sequence similarity is not sequence homology
− If the two sequences gB and gC have accumulated enough mutations, the
similarity between them is likely to be low
#mutations #mutations
0 agtgtccgttaagtgcgttc 64 acagtccgttcgggctattg
1 agtgtccgttatagtgcgttc 128 cagagcactaccgc
2 agtgtccgcttatagtgcgttc 256 cacgagtaagatatagct
4 agtgtccgcttaagggcgttc 512 taatcgtgata
8 agtgtccgcttcaaggggcgt 1024 acccttatctacttcctggagtt
16 gggccgttcatgggggt 2048 agcgacctgcccaa
32 gcagggcgtcactgagggct 4096 caaac

Homology is more difficult to detect over greater evolutionary


distances.
Introduction to bioinformatics, Autumn 2006 26
Similarity vs homology (2)
l Sequence similarity can occur by chance
− Similarity does not imply homology

l Similarity is an expected consequence of homology

Introduction to bioinformatics, Autumn 2006 27


Orthologs and paralogs
l We distinguish between two types of homology
− Orthologs: homologs from two different species
− Paralogs: homologs within a species Organism A
gA

gA
Gene A is copied
gA gA’ within organism A

gB gC
gB gC
Organism B Organism C

Introduction to bioinformatics, Autumn 2006 28


Orthologs and paralogs (2)
l Orthologs typically retain the original function
l In paralogs, one copy is free to mutate and acquire
new function (no selective pressure) Organism A
gA

gA
Gene A is copied
gA gA’ within organism A

gB gC
gB gC
Organism B Organism C

Introduction to bioinformatics, Autumn 2006 29


Sequence alignment
l Alignment specifies which positions in two sequences
match

acgtctag acgtctag acgtctag


|| ||||| || |||||
actctag- -actctag ac-tctag

2 matches 5 matches 7 matches


5 mismatches 2 mismatches 0 mismatches
1 not aligned 1 not aligned 1 not aligned

Introduction to bioinformatics, Autumn 2006 30


Mutations: Insertions, deletions and
substitutions
Indel: insertion or acgtctag Mismatch: substitution
deletion of a base ||||| (point mutation) of
with respect to the a single base
-actctag
ancestor sequence

l Insertions and/or deletions are called indels


− We can’t tell whether the ancestor sequence had a base or
not at indel position

Introduction to bioinformatics, Autumn 2006 31


Problems
l What sorts of alignments should be considered?
l How to score alignments?
l How to find optimal or good scoring alignments?
l How to evaluate the statistical significance of scores?

In this course, we discuss the first three problems.

Course Biological sequence analysis tackles all four in-


depth.
Introduction to bioinformatics, Autumn 2006 32
Sequence Alignment (chapter 6)
l The biological problem
l Global alignment
l Local alignment
l Multiple alignment

Introduction to bioinformatics, Autumn 2006 33


Global alignment
l Problem: find optimal scoring alignment between two
sequences (Needleman & Wunsch 1970)
l We give score for each position in alignment
− Identity (match) +1 WHAT
− Substitution (mismatch) -µ ||
− Indel WH-Y

S(WHAT/WH-Y) = 1 + 1 – –µ

Introduction to bioinformatics, Autumn 2006 34


Representing alignments and scores

WHAT - W H A T

|| -

WH-Y W X

H X X

Y X

Introduction to bioinformatics, Autumn 2006 35


Representing alignments and scores

WHAT - W H A T

|| - 0
WH-Y W 1

H 2 2-

Global alignment Y 2- -µ
score S3,4 = 2- -µ

Introduction to bioinformatics, Autumn 2006 36


Dynamic programming
l How to find the optimal alignment?
l We use previous solutions for optimal alignments of
smaller subsequences
l This general approach is known as dynamic
programming

Introduction to bioinformatics, Autumn 2006 37


Filling the alignment matrix
- W H A T Consider the alignment process
at shaded square.
- Case 1. Align H against H
(match or substitution).
W Case 1
Case 2 Case 2. Align H in WHY against
– (indel) in WHAT.
H
Case 3 Case 3. Align H in WHAT
Y against – (indel) in WHY.

Introduction to bioinformatics, Autumn 2006 38


Filling the alignment matrix (2)
- W H A T Scoring the alternatives.
Case 1. S2,2 = S1,1 + s(2, 2)
-
Case 2. S2,2 = S1,2
W Case 1 Case 3. S2,2 = S2,1
Case 2
s(i, j) = 1 for matching positions,
H
Case 3 s(i, j) = - µ for substitutions.
Y Choose the case (path) that
yields the maximum score.
Keep track of path choices.

Introduction to bioinformatics, Autumn 2006 39


Global alignment: formal
development
A = a1a2a3…an, 0 1 2 3 4
B = b1b2b3…bm
- b1 b2 b3 b4
b1 b2 b3 b4 -
- a1 - a2 a3 0 -

lAny alignment can be written


1 a1
as a unique path through the
matrix
2 a2
l Score for aligning A and B up
to positions i and j:
3 a3
Si,j = S(a1a2a3…ai, b1b2b3…bj)
Introduction to bioinformatics, Autumn 2006 40
Scoring partial alignments
l Alignment of A = a1a2a3…an with B = b1b2b3…bm can end in
three ways
− Case 1: (a1a2…ai-1) ai
(b1b2…bj-1) bj
− Case 2: (a1a2…ai-1) ai
(b1b2…bj) -
− Case 3: (a1a2…ai) –
(b1b2…bj-1) bj

Introduction to bioinformatics, Autumn 2006 41


Scoring alignments
l Scores for each case:
+1 if ai = bj
− Case 1: (a1a2…ai-1) ai
(b1b2…bj-1) bj
s(ai, bj) = { -µ otherwise
− Case 2: (a1a2…ai-1) ai
(b1b2…bj) –
s(ai, -) = s(-, bj) = -
− Case 3: (a1a2…ai) –
(b1b2…bj-1) bj

Introduction to bioinformatics, Autumn 2006 42


Scoring alignments (2)
• First row and first column 0 1 2 3 4
correspond to initial alignment
against indels:
- b1 b2 b3 b4
S(i, 0) = -i
S(0, j) = -j 0 - 0 -2 -3 -4

• Optimal global alignment 1 a1


score S(A, B) = Sn,m
2 a2 -2

3 a3 -3

Introduction to bioinformatics, Autumn 2006 43


Algorithm for global alignment
Input sequences A, B, n = |A|, m = |B|
Set Si,0 := - i for all i
Set S0,j := - j for all j
for i := 1 to n
for j := 1 to m
Si,j := max{Si-1,j – , Si-1,j-1 + s(ai,bj), Si,j-1 – }
end
end

Algorithm takes O(nm) time and space.

Introduction to bioinformatics, Autumn 2006 44


Global alignment: example
- T G G T G
µ=1 - 0 -2 -4 -6 -8 -10
=2 A -2
T -4
C -6
G -8
T -10 ?

Introduction to bioinformatics, Autumn 2006 45


Global alignment: example (2)
- T G G T G
µ=1 - 0 -2 -4 -6 -8 -10
=2 A -2 -1 -3 -5 -7 -9
T -4 -1 -2 -4 -4 -6
C -6 -3 -2 -3 -5 -5
ATCGT-
G -8 -5 -2 -1 -3 -4
| ||
T -10 -7 -4 -3 0 -2
-TGGTG

Introduction to bioinformatics, Autumn 2006 46


Sequence Alignment (chapter 6)
l The biological problem
l Global alignment
l Local alignment
l Multiple alignment

Introduction to bioinformatics, Autumn 2006 47


Local alignment: rationale
• Otherwise dissimilar proteins may have local regions of
similarity
-> Proteins may share a function

Human bone
morphogenic protein
receptor type II
precursor (left) has a
300 aa region that
resembles 291 aa
region in TGF-
receptor (right).
The shared function
here is protein kinase.

Introduction to bioinformatics, Autumn 2006 48


Local alignment: rationale
A

B
Regions of
similarity

• Global alignment would be inadequate


• Problem: find the highest scoring local alignment
between two sequences
• Previous algorithm with minor modifications solves this
problem (Smith & Waterman 1981)
Introduction to bioinformatics, Autumn 2006 49
From global to local alignment
l Modifications to the global alignment algorithm
− Look for the highest-scoring path in the alignment matrix
(not necessarily through the matrix)
− Allow preceding and trailing indels without penalty

Introduction to bioinformatics, Autumn 2006 50


Scoring local alignments
A = a1a2a3…an, B = b1b2b3…bm

Let I and J be intervals (substrings) of A and B,


respectively: ,

Best local alignment score:

where S(I, J) is the score for substrings I and J.

Introduction to bioinformatics, Autumn 2006 51


Allowing preceding and trailing
indels
• First row and column 0 1 2 3 4
initialised to zero:
Mi,0 = M0,j = 0 - b1 b2 b3 b4

0 - 0 0 0 0 0

1 a1 0

b1 b2 b3 2 a2 0
- - a1
3 a3 0

Introduction to bioinformatics, Autumn 2006 52


Recursion for local alignment
• Mi,j = max { - T G G T G
Mi-1,j-1 + s(ai, bi), - 0 0 0 0 0 0
Mi-1,j ,
A 0 0 0 0 0 0
Mi,j-1 ,
0 T 0 1 0 0 1 0
}
C 0 0 0 0 0 0

G 0 0 1 1 0 1

T 0 1 0 0 2 0

Introduction to bioinformatics, Autumn 2006 53


Finding best local alignment
• Optimal score is the highest - T G G T G
value in the matrix
- 0 0 0 0 0 0

A 0 0 0 0 0 0
= maxi,j Mi,j
T 0 1 0 0 1 0
• Best local alignment can be
found by backtracking from C 0 0 0 0 0 0
the highest value in M
G 0 0 1 1 0 1

T 0 1 0 0 2 0

Introduction to bioinformatics, Autumn 2006 54


Local alignment: example
0 1 2 3 4 5 6 7 8 9 10

- G G C T C A A T C A
0 - 0 0 0 0 0 0 0 0 0 0 0
1 A 0
2 C 0
3 C 0
4 T 0
5 A 0
6 A 0
7 G 0
8 G 0

Introduction to bioinformatics, Autumn 2006 55


Local alignment: example
10
0 1 2 3 4 5 6 7 8 9
Scoring - G G C T C A A T C A
Match: +2 0 - 0 0 0 0 0 0 0 0 0 0 0
1 A 0 0 0 0 0 0 2 2 0 0 2
Mismatch: -1
2 C 0 0 0 2 0 2 0 1 1 2 0
Indel: -2 3 C 0 0 0 2 1 2 1 0 0 3 1
4 T 0 0 0 0 4 2 1 0 2 1 2
5 A 0 0 0 0 2 3 4 3 1 1 3
6 A 0 0 0 0 0 1 5 6 4 2 3
C T – A A 7 G 0 2 2 0 0 0 3 4 5 3 1
C T C A A 8 G 0 2 4 2 0 0 1 2 3 4 2

Introduction to bioinformatics, Autumn 2006 56


Non-uniform mismatch penalties
l We used uniform penalty for mismatches:
s(’A’, ’C’) = s(’A’, ’G’) = … = s(’G’, ’T’) = µ
l Transition mutations (A->G, G->A, C->T, T->C) are
approximately twice as frequent than transversions (A-
>T, T->A, A->C, G->T)
− use non-uniform mismatch A C G T
penalties A 1 -1 -0.5 -1
C -1 1 -1 -0.5
G -0.5 -1 1 -1
T -1 -0.5 -1 1
Introduction to bioinformatics, Autumn 2006 57
Gaps in alignment
l Gap is a succession of indels in alignment
C T – - - A A
C T C G C A A

l Previous model scored a length k gap as w(k) = -k


l Replication processes may produce longer stretches
of insertions or deletions
− In coding regions, insertions or deletions of codons may
preserve functionality

Introduction to bioinformatics, Autumn 2006 58


Gap open and extension penalties (2)
l We can design a score that allows the penalty opening
gap to be larger than extending the gap:
w(k) = - (k – 1)
l Gap open cost , Gap extension cost
l Our previous algorithm can be extended to use w(k)
(not discussed on this course)

Introduction to bioinformatics, Autumn 2006 59


Sequence Alignment (chapter 6)
l The biological problem
l Global alignment
l Local alignment
l Multiple alignment

Introduction to bioinformatics, Autumn 2006 60


Multiple alignment
• Consider a set of n
sequences on the right aggcgagctgcgagtgcta
– Orthologous sequences from cgttagattgacgctgac
different organisms ttccggctgcgac
– Paralogs from multiple gacacggcgaacgga
duplications agtgtgcccgacgagcgaggac
• How can we study gcgggctgtgagcgcta
relationships between these aagcggcctgtgtgcccta
sequences? atgctgctgccagtgta
agtcgagccccgagtgc
agtccgagtcc
actcggtgc

Introduction to bioinformatics, Autumn 2006 61


Optimal alignment of three
sequences
l Alignment of A = a1a2…ai and B = b1b2…bj can end
either in (-, bj), (ai, bj) or (ai, -)
l 22 – 1 = 3 alternatives
l Alignment of A, B and C = c1c2…ck can end in 23 – 1
ways: (ai, -, -), (-, bj, -), (-, -, ck), (-, bj, ck), (ai, -, ck), (ai,
bj, -) or (ai, bj, ck)
l Solve the recursion using three-dimensional dynamic
programming matrix: O(n3) time and space
l Generalizes to n sequences but impractical with
moderate number of sequences

Introduction to bioinformatics, Autumn 2006 62


Multiple alignment in practice
l In practice, real-world multiple alignment problems are
usually solved with heuristics
l Progressive multiple alignment
− Choose two sequences and align them
− Choose third sequence w.r.t. two previous sequences and
align the third against them
− Repeat until all sequences have been aligned
− Different options how to choose sequences and score
alignments

Introduction to bioinformatics, Autumn 2006 63


Multiple alignment in practice
l Profile-based progressive multiple alignment:
CLUSTALW
− Construct a distance matrix of all pairs of sequences using
dynamic programming
− Progressively align pairs in order of decreasing similarity
− CLUSTALW uses various heuristics to contribute to
accuracy

Introduction to bioinformatics, Autumn 2006 64


Additional material
l R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological
sequence analysis
l Course Biological sequence analysis in Spring 2007

Introduction to bioinformatics, Autumn 2006 65

You might also like