ISC 211
Introduction to
Bioinformatics
Lecture 6 – Sequence Analysis
Dr. Athira B
Asst. Professor, CSE
IIIT Kottayam
Suppose you have given a set of new DNA
sequences and ask to identify the
Functional/Structural/Biological features?
How you can do this analysis?
One solution is compare with already existing
known sequences- how they are similar?
How to do this similarity checking?
Sequence Analysis
Process of subjecting a DNA, RNA or peptide sequence to any of a
wide range of analytical methods to understand its features, function,
structure, or evolution.
Objectives:
To find similarity, often to infer if they are related (homologous)
To identify intrinsic features of the sequence such as active sites,
post translational modification sites, gene-structures, reading
frames, distributions of introns and exons and regulatory elements
To identify sequence differences and variations such as point
mutations and single nucleotide polymorphism (SNP) in order to
get the genetic marker.
Revealing the evolution and genetic diversity of sequences and
organisms
Identification of molecular structure from sequence alone
Methods
Sequence Alignment - Pairwise and Multiple
sequence
Comparison against large databases
Sequence Alignment
Procedure of comparing two or more sequences by
searching for a series of individual characters or
character patterns
Identify same characters in the same row
Alignment can be local/global
Sequence Alignment
Biological Problem
Sequence alignment is a way of arranging protein (or DNA)
sequences to identify regions of similarity that may be a
consequence of evolutionary relationships between the sequences.
Genome sequencing allows comparison of organisms at DNA and
protein levels
Comparisons can be used to
Find evolutionary relationships between organisms
Identify functionally conserved sequences
Identify corresponding genes in human and model organisms:
develop models for human diseases
Sequence Homology
Homology: genes that derive from a common ancestor-gene are
called homologs
Orthologous genes are homologous genes in different organisms
Paralogous genes are homologous genes in one organism that
derive from gene duplication
Gene duplication: one gene is duplicated in multiple copies that
therefore free to evolve and assume new functions
Sequence similarity
Intuitively, similarity of two sequences refers to the
degree of match between corresponding positions in
sequence
Sequence similarity is not sequence homology
Homology is more difficult to detect over greater
evolutionary distances
Causes of Gene (dis) similarity
Mutation: a nucleotide at a certain location is replaced by
another
nucleotide ATA → AGA
Insertion: at a certain location one new nucleotide is inserted
in
between two existing nucleotides (e.g.: AA → AGA)
Deletion: at a certain location one existing nucleotide is
deleted (e.g.: ACTG → AC-G)
Indel: an insertion or a deletion
Sequence Alignment
Find the similarity between two (or more) DNA-sequences by
finding
a good alignment between them
Alignment specifies which positions in two sequences match
Sequence Alignment
Sequence alignment is an arrangement of two or more
sequences,
highlighting their similarity.
The sequences are padded with gaps (dashes) so that
wherever
possible, columns contain identical characters from the
sequences
involved
Sequence Alignment
Pairwise Sequence Alignment: methods are concerned with
finding
the best-matching piece-wise local or global alignments of protein
(amino acid) or DNA (nucleic acid) sequences.
Global Alignment: an alignment in which all the characters in
both
sequences participate in the alignment.
Local Alignment: a matching two sequence from regions which
have
more similar with each other
Algorithms
Needleman-Wunsch
Pairwise global alignment only.
Smith-Waterman
Pairwise, local (or global) alignment.
BLAST
Pairwise heuristic local alignment
The Needleman-Wunsch algorithm
The Needleman-Wunsch algorithm (1970, J Mol Biol. 48(3):443-
53)
performs a global alignment on two sequences (s and t) and is
applied to align protein or nucleotide sequences.
The Needleman-Wunsch algorithm is an example of dynamic
programming, and is guaranteed to find the alignment with the
maximum score.
Eg: sequences
where s(xi , yj ) is the substitution cost and d is the gap penalty
Dynamic Programming-steps
1. Initialization of the score matrix
2. Calculation of scores and filling the traceback matrix
3. Deducing the alignment from the traceback matrix
Let’s work on this simple example
Input: AAG (sequence #1) , AGC (sequence #2)
Gap penalty = -5
Step 1
Step2
Final Table
Exercise
Seq 1: GAATTC Seq 2 : GATAC
Match = 2 , mismatch = -1, gap = -2
Smith-Waterman local (or global) alignment.
Example 1 :
Seq 1: GAATTC Seq 2 : GATAC
Match = 2 , mismatch = -1, gap = -2
Example 2
Seq 1: GAATTCAT Seq 2 : CCTCATG
Starting score: 0, match = 2, mismatch = -1, gap = -2