0% found this document useful (0 votes)
119 views61 pages

Introduction-To-Computational Biology

This document provides an overview of sequence alignment techniques in bioinformatics. It discusses how the eyeless gene in Drosophila melanogaster controls eye development and other genes through homeobox domains. It then covers the importance of sequence alignment for predicting gene function and evolution. Both global alignment techniques like Needleman-Wunsch and local alignment approaches such as Smith-Waterman are introduced. Key concepts around homology, orthology, and paralogy are also summarized.

Uploaded by

RONAK LASHKARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views61 pages

Introduction-To-Computational Biology

This document provides an overview of sequence alignment techniques in bioinformatics. It discusses how the eyeless gene in Drosophila melanogaster controls eye development and other genes through homeobox domains. It then covers the importance of sequence alignment for predicting gene function and evolution. Both global alignment techniques like Needleman-Wunsch and local alignment approaches such as Smith-Waterman are introduced. Key concepts around homology, orthology, and paralogy are also summarized.

Uploaded by

RONAK LASHKARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 61

Introduction to

Bioinformatics

1
Introduction to Bioinformatics.

LECTURE 3: SEQUENCE ALIGNMENT

* Chapter 3: All in the family

2
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

3.1 Eye of the tiger

* In 1994 Walter Gehring et alum (Un. Basel) turn the


gene “eyeless” on in various places on Drosophila
melanogaster

* Result: on multiple places eyes are formed

* ‘eyeless’ is a master regulatory gene that controls +/-


2000 other genes

* ‘eyeless’ on induces formation of an eye


3
Eyeless Drosophila 4
Mutant Drosophila melanogaster: gene ‘EYELESS’ turned on

5
LECTURE 3: SEQUENCE ALIGNMENT
Homeoboxes and Master regulatory genes

6
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

HOMEO BOX

A homeobox is a DNA sequence found within


genes that are involved in the regulation of
development (morphogenesis) of animals, fungi
and plants.

7
LECTURE 3: SEQUENCE ALIGNMENT
Drosophila melanogaster: HOX homeoboxes

8
LECTURE 3: SEQUENCE ALIGNMENT
Drosophila melanogaster: PAX homeoboxes

9
LECTURE 3: SEQUENCE ALIGNMENT
Homeoboxes and Master regulatory genes

10
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

3.2 On sequence alignment

Sequence alignment is the most important task in


bioinformatics!

11
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

3.2 On sequence alignment

Sequence alignment is important for:

* prediction of function
* database searching
* gene finding
* sequence divergence
* sequence assembly

12
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

3.3 On sequence similarity

Homology: genes that derive from a common ancestor-gene


are called homologs

Orthologous genes are homologous genes in different


organisms

Paralogous genes are homologous genes in one organism


that derive from gene duplication

Gene duplication: one gene is duplicated in multiple copies


that therefore free to evolve and assume new functions
13
LECTURE 3: SEQUENCE ALIGNMENT
HOMOLOGOUS and PARALOGOUS

14
LECTURE 3: SEQUENCE ALIGNMENT
HOMOLOGOUS and PARALOGOUS

15
LECTURE 3: SEQUENCE ALIGNMENT
HOMOLOGOUS and PARALOGOUS versus ANALOGOUS

16
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT: sequence similarity

Causes for sequence (dis)similarity

mutation: a nucleotide at a certain location is replaced by


another nucleotide (e.g.: ATA → AGA)

insertion: at a certain location one new nucleotide is


inserted inbetween two existing nucleotides
(e.g.: AA → AGA)

deletion: at a certain location one existing nucleotide


is deleted (e.g.: ACTG → AC-G)

indel: an insertion or a deletion


17
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

3.4 Sequence alignment: global and local

Find the similarity between two (or more)


DNA-sequences by finding a good
alignment between them.

18
The biological problem of
sequence alignment

DNA-sequence-1

tcctctgcctctgccatcat---caaccccaaagt
|||| ||| ||||| ||||| ||||||||||||
tcctgtgcatctgcaatcatgggcaaccccaaagt

DNA-sequence-2 Alignment

19
Sequence alignment - definition

Sequence alignment is an arrangement of two or more sequences,


highlighting their similarity.

The sequences are padded with gaps (dashes) so that wherever possible,
columns contain identical characters from the sequences involved

tcctctgcctctgccatcat---caaccccaaagt
|||| ||| ||||| ||||| ||||||||||||
tcctgtgcatctgcaatcatgggcaaccccaaagt

20
Algorithms

Needleman-Wunsch
Pairwise global alignment only.

Smith-Waterman
Pairwise, local (or global) alignment.

BLAST
Pairwise heuristic local alignment

21
Pairwise alignment

Pairwise sequence alignment methods are concerned with finding the best-
matching piecewise local or global alignments of protein (amino acid) or DNA
(nucleic acid) sequences.

Typically, the purpose of this is to find homologues (relatives) of a gene or gene-


product in a database of known examples.

This information is useful for answering a variety of biological questions:

1. The identification of sequences of unknown structure or function.

2. The study of molecular evolution.

22
Global alignment

A global alignment between two sequences is an alignment in which all the


characters in both sequences participate in the alignment.

Global alignments are useful mostly for finding closely-related sequences.

As these sequences are also easily identified by local alignment methods global
alignment is now somewhat deprecated as a technique.

Further, there are several complications to molecular evolution (such as domain


shuffling) which prevent these methods from being useful.

23
Global Alignment

Find the global best fit between two sequences

Example: the sequences s = VIVALASVEGAS and


t = VIVADAVIS align like:

V I V A L A S V E G A S
A(s,t) = | | | | | | |
V I V A D A - V - - I S

indels
24
The Needleman-Wunsch algorithm

The Needleman-Wunsch algorithm (1970, J Mol Biol.


48(3):443-53) performs a global alignment on two sequences
(s and t) and is applied to align protein or nucleotide
sequences.

The Needleman-Wunsch algorithm is an example of


dynamic programming, and is guaranteed to find the
alignment with the maximum score.

25
The Needleman-Wunsch algorithm

Of course this works for both DNA-sequences


as for protein-sequences.

26
Alignment scoring function

The cost of aligning two symbols xi and yj is the


scoring function σ(xi,yj )

27
Alignment cost

The cost of the entire alignment:


c
M    ( xi , yi )
i 1

28
A simple scoring function

σ(-,a) = σ(a,-) = -1

σ(a,b) = -1 if a ≠ b

σ(a,b) = 1 if a = b

29
The substitution matrix

A more realistic scoring function is given by the


biologically inspired substitution matrix :

- A G C T
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8
Examples:

* PAM (Point Accepted Mutation) (Margaret Dayhoff)


* BLOSUM (BLOck SUbstitution Matrix) (Henikoff and Henikoff)
30
Scoring function

The cost for aligning the two sequences s =


VIVALASVEGAS and t = VIVADAVIS :

V I V A L A S V E G A S
A(s,t) = | | | | | | |
V I V A D A - V - - I S

is:

M(A) = 7 matches + 2 mismatches + 3 gaps


=7 –2 –3 =2
31
Optimal global alignment

The optimal global alignment A* between two sequences s


and t is the alignment A(s,t) that maximizes the total
alignment score M(A) over all possible alignments.

A* = argmax M(A)

Finding the optimal alignment A* looks a combinatorial


optimization problem:
i. generate all possible allignments
ii. compute the score M
iii. select the alignment A* with the maximum score M*
32
Local alignment

Local alignment methods find related regions within sequences - they can
consist of a subset of the characters within each sequence.

For example, positions 20-40 of sequence A might be aligned with positions


50-70 of sequence B.

This is a more flexible technique than global alignment and has the advantage
that related regions which appear in a different order in the two proteins (which is
known as domain shuffling) can be identified as being related.

This is not possible with global alignment methods.

33
The Smith Waterman algorithm
The Smith-Waterman algorithm (1981) is for determining similar regions
between two nucleotide or protein sequences.

Smith-Waterman is also a dynamic programming algorithm and improves


on Needleman-Wunsch. As such, it has the desirable property that it is
guaranteed to find the optimal local alignment with respect to the scoring
system being used (which includes the substitution matrix and the gap-
scoring scheme).

However, the Smith-Waterman algorithm is demanding of time and


memory resources: in order to align two sequences of lengths m and n,
O(mn) time and space are required.

As a result, it has largely been replaced in practical use by the BLAST


algorithm; although not guaranteed to find optimal alignments, BLAST is
34
much more efficient.
Optimal local alignment

The optimal local alignment A* between two


sequences s and t is the optimal global alignment
A(s(i1:i2), t(j1:j2) ) of the sub-sequences s(i1:i2) and
t(j1:j2) for some optimal choice of i1, i2, j1 and j2.

35
Sequence alignment - meaning

Sequence alignment is used to study the evolution of the sequences


from a common ancestor such as protein sequences or DNA
sequences.

Mismatches in the alignment correspond to mutations,


and gaps correspond to insertions or deletions.

Sequence alignment also refers to the process of constructing


significant alignments in a database of potentially unrelated sequences.

36
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

3.5 Statistical analysis of alignments

This works identical to gene finding:

* Generate randomized sequences based on the


second string

* Determine the optimal alignments of the first


sequence with these randomized sequences

* Compute a histogram and rank the observed


score in this histogram

* The relative position defines the p-value. 37


Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT: statistical analysis

Histogram of scores of randomly


generated strings using permutation
of original sequence t

original sequence s

38
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

3.6 BLAST:
fast approximate alignment

• Fast but heuristic

• Most used algorithm in bioinformatics

• Verb: to blast

39
40
41
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

3.7 Multiple sequence alignment:

Determine the best alignment between multiple


(more than two) DNA-sequences.

42
Multiple alignment

Multiple alignment is an extension of pairwise alignment to


incorporate more than two sequences into an alignment.

Multiple alignment methods try to align all of the sequences


in a specified set.

The most popular multiple alignment tool is CLUSTAL.

Multiple sequence alignment is computationally difficult and


is classified as an NP-Hard problem.

43
Multiple alignment

44
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

3.8 Computing the alignments

* NW and SW are both based on Dynamic


Programming (DP)

* A recursive relation breaks down the computation

45
Dynamic Programming Approach to
Sequence Alignment
The dynamic programming approach to sequence alignment always tries to
follow the best prior-result so far.

Try to align two sequences by inserting some gaps at different locations, so as to


maximize the score of this alignment.

Score measurement is determined by "match award", "mismatch penalty" and


"gap penalty". The higher the score, the better the alignment.

If both penalties are set to 0, it aims to always find an alignment with maximum
matches so far.
Maximum match = largest number matches can have for one sequence by
allowing all possible deletion of another sequence.

It is used to compare the similarity between two sequences of DNA or Protein, to


predict similarity of their functionalities.
Examples: Needleman-Wunsch(1970), Sellers(1974), Smith-Waterman(1981)46
The Needleman-Wunsch algorithm

The Needleman-Wunsch algorithm (1970, J Mol Biol. 48(3):443-53) performs a


global alignment on two sequences (A and B) and is applied to align protein or
nucleotide sequences.

The Needleman-Wunsch algorithm is an example of dynamic programming,


and is guaranteed to find the alignment with the maximum score.

Scores for aligned characters are specified by the transition matrix σ (i,j) :
the similarity of characters i and j.

47
The Needleman-Wunsch algorithm
For example, if the substitution matrix was

- A G C T
A 10 -1 -3 -4
then the alignment: AGACTAGTTAC
G -1 7 -5 -3
C -3 -5 9 0 CGA---GACGT
T -4 -3 0 8

with a gap penalty of -5, would have the following score...

48
The Needleman-Wunsch algorithm
1. Create a table of size (m+1)x(n+1) for sequences s and t of lengths m and n,

2. Fill table entries (m:1) and (1:n) with the values:


i j
M i ,1    (s k ,), M 1, j    (, t k )
k 1 k 1

3. Starting from the top left, compute each entry using the recursive relation:

M i 1, j 1   (s i , t j )
 
M i, j  max  M i 1, j   (s i ,) 
M   ( , t ) 
 i , j 1 j 

4. Perform the trace-back procedure from he bottom-right corner

49
The Needleman-Wunsch algorithm
Once the F matrix is computed, note that the bottom right hand corner of the
matrix is the maximum score for any alignments. To compute which alignment
actually gives this score, you can start from the bottom left cell, and compare the
value with the three possible sources(Choice1, Choice2, and Choice3 above) to
see which it came from. If it was Choice1, then A(i) and B(i) are aligned, if it was
Choice2 then A(i) is aligned with a gap, and if it was Choice3, then B(i) is aligned
with a gap.

50
The Needleman-Wunsch algorithm

51
52
The Smith-Waterman algorithm
1. Create a table of size (m+1)x(n+1) for sequences s and t of lengths m and n,

2. Fill table entries (1,1:m+1) and (1:n+1,1) with zeros.

3. Starting from the top left, compute each entry using the recursive relation:

M i 1, j 1   (s i , t j )
 M 
 i 1, j   (s i ,) 
M i, j  max  
 M i , j 1   (, t j ) 
 0 

4. Perform the trace-back procedure from the maximum element in the table to
the first zero element on the trace-back path.

53
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

EXAMPLE: Eyeless Gene Homeobox

Compare the gene eyeless of Drosophila Melanoganster with the human gene
aniridia. They are master regulatory genes producing proteins that control large
cascade of other genes. Certain segments of genes eyeless of Drosophila
melanogaster and human aniridia are almost identical. The most important of
such segments encodes the PAX (paired-box) domain, a sequence of 128 amino
acids whose function is to bind specific sequences of DNA. Another common
segment is the HOX (homeobox) domain that is thougth to be part of more than
0.2% of the total nummber of vertebrate genes.

54
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

55
Introduction to Bioinformatics
LECTURE 3: GLOBAL ALIGNMENT

56
Introduction to Bioinformatics
LECTURE 3: GLOBAL ALIGNMENT

57
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

58
END of LECTURE 3

59
Introduction to Bioinformatics
LECTURE 3: SEQUENCE ALIGNMENT

60
61

You might also like