0% found this document useful (0 votes)
22 views36 pages

Sequence Alignment

Uploaded by

aimalktk02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views36 pages

Sequence Alignment

Uploaded by

aimalktk02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Assignment 2

NEXT GENERATION SEQUENCING


WHAT ARE THE MAIN STEPS
WHAT IS THE BENIFIT
Sequence Alignment

• Sequence alignment is the process by which two or more biological


sequences are matched to show optimal similarity.
• DNA sequence alignments, RNA sequence alignments and protein
sequence alignments are routinely performed.
• Sequence alignment is useful for inferring function, structure and
evolutionary information.
• Expectedly, sequence comparison through sequence alignment is central
to most bioinformatics analysis.
• It is the first step towards understanding the evolutionary relationship
and the pattern of divergence between two sequences.
• The relationship between two sequences also helps predict the potential
function of an unknown sequence, thereby indicating protein family
relationship.
Sequence Alignment

• As new biological sequences are being generated at exponential


rates, sequence comparison is becoming increasingly important to
draw functional and evolutionary inference of a new protein with
proteins already existing in the database.
• The most fundamental process in this type of comparison is
sequence alignment.
• Sequence Alignment is an important first step toward structural and
functional analysis of newly determined sequences.
• Sequence identity means the same residues being present at
corresponding positions in two sequences being compared. For
proteins, it means the same amino acids; for nucleic acids, it means
the same bases.

• Sequence similarity means similar residues being present at


corresponding positions in the two sequences being compared. For
nucleic acids, sequence similarity and sequence identity are the
same. However, for proteins, sequence similarity involves amino
acids with similar physicochemical and functional properties.
DNA Sequence similarity Vs sequence Identity

• For the nucleotide sequence of the DNA and RNA both words have
the same meaning. Because both the sequences have similar base
pairs like
Calculating seq.ID and seq.Sim in DNA and
RNA strands

% Seq. ID = No, of IDs


______________________ X 100
Total No, of R in shorter seq.

Calculate for AB, BC and AC.


ASSIGNMENT

Calculate the % similarity and identity between ALL the taxon's sequences

T1-T2 T2-T3 T3-T4 T4-T5


T1-T3 T2-T4 T3-T5
T1-T4 T2-T5
T1-T5
Protein Sequence similarity Vs sequence
Identity
Protein Scoring Systems
• Amino acids have different biochemical and physical properties
that influence their relative replace ability in evolution.
Protein homology
• Similar substitutions are also
referred to as conservative
substitutions. A conservative
amino acid substitution is not
expected to disrupt the
structural/functional attributes of
the protein.
• Sequence homology is an
evolutionary term. Sequences are
called homologous if they have a
common evolutionary origin—that
is, if they are derived from a
common ancestral sequence.
• So, sequences are either
homologous or not homologous
and there is no quantitation of
homology. However, even now,
expressions like “high homology,”
“significant homology,” and even
specifying a “% homology” are
very widely used.
Calculating seq.ID and seq.Sim in Proteins

• In proteins seq. identity and seq. similarity is a different thing


Alanine—Glycine
Aspartic acid–-Glutamic Acid
Lysine---Arginine
Threonine---Serine
Calculating seq.ID and seq.Sim in Proteins

Seq identity= (number of identical a.a/number all a.a) x 100


Seq similarity= (number of similar a.a/number all a.a) x 100
Calculating seq.ID and seq.Sim in Proteins

Seq identity= (number of identical a.a/number all a.a) x 100


Seq similarity= (number of similar a.a/number all a.a) x 100
Find out the %similarity and % identity of the
two peptide sequences
Seq.1 M C G T Q K H D L G V Y F H R P Q D Y
Seq.2 M C A T Q H H D I G T Y F H K P Q E W
Pairwise Alignment
• The alignment of two sequences (DNA or protein) is a
relatively straightforward computational problem.

1. Two sequences can always be aligned.

2. Sequence alignments have to be scored.

3. Often there is more than one solution with the same


score.
Types of Alignments
• The overall goal of pairwise sequence alignment is to find the best pairing
of two sequences, such that there is maximum correspondence among
residues.

• To achieve this goal, one sequence needs to be shifted relative to the


other to find the position where maximum matches are found.

• There are two different alignment strategies that are often used:
1. Global Alignment
2. Local Alignment
Global Alignment
• A global sequence-alignment method aligns and compares two sequences
along their entire length, and comes up with the best alignment that
displays the maximum number of nucleotides or amino acids aligned.
• The algorithm that drives global alignment is the Needleman-Wunsch
algorithm. The Needleman–Wunsch algorithm is an algorithm used in
bioinformatics to align protein or nucleotide sequences. It was one of the
first applications of dynamic programming to compare biological
sequences.
• Global alignment algorithm starts at the beginning of two sequences and
adds gaps to each until the end of one is reached.
• Global alignment works the best when the sequences are similar in
character and length. Because global alignment displays the best alignment
between two sequences using the entire sequence, it may miss a small
region of biological importance.
Local Alignment

• In contrast to global alignment, local sequence alignment is intended


to find the most similar regions in two sequences being aligned. The
algorithm that drives local alignment is the Smith-Waterman
algorithm.
• A local alignment algorithm finds the region of highest similarity
between two sequences and builds the alignment outward from this
region.
• Local alignment is useful for sequences that are not similar in
character and length, yet are suspected to contain small regions of
similarity, such as biologically important motifs.
Local and global Alignment in DNA or RNA
Local and global Alignment in proteins
Methods of Alignment

• By hand - slide sequences on two lines of a word


processor
• Dot plot
• with windows
• Dynamic programming
• Needleman Wunsch, Smith-Waterman (slow, optimal)
• Heuristic methods (fast, approximate)
• BLAST and FASTA
DNA sequence Alignment “By Hand”
Working by hand

SEQ1. TATGTCG mismatch=


SEQ2. TCAGTGC match=
gap =
DNA sequence Alignment “by Dot matrix”

• The basic sequence alignment method is the dot matrix or dot plot
method. In this method, two sequences being compared are written
in the vertical and horizontal axes of the matrix.
• Then each residue is scanned and each match is given a dot;
mismatches are left blank. When enough dots are lined up, they are
connected
ATTGTAC
ATGTTAC

ATTGTAC

ATGTTAC

ATTGTAC
ATGTTAC
Shift down in diagonal means
a gap must be added in sequence 1.
Shift right in diagonal means a gap
Must be added in sequence 2.

Break in the diagonal means


there is a mismatch
SCORING MATRIX, ALIGNMENT SCORE, AND
STATISTICAL SIGNIFICANCE OF SEQUENCE
ALIGNMENT
• For both nucleic acids and proteins, the alignment score is calculated
using a scoring matrix.
• A scoring matrix is a set of values representing the likelihood of one
residue being substituted by another during sequence divergence
through evolution.
• This is why the scoring matrix is also known as the substitution
matrix.
A C A G
For Match: 1
Mismatch: -1 A
C
T
G
Alignment Scoring function
• In the case of sequence alignment, dynamic programming involves
setting up a two-dimensional matrix in which one sequence is listed
vertically and the other sequence is listed horizontally; then
calculating the scores, one row at a time.
• For example,
• a match can be given = 1,
• a mismatch = -1,
• a gap = -2
• A 100% perfect alignment will produce a diagonal straight line (with a
negative slope) spanning from the top left to bottom right.
• If the alignment is not perfect, gaps are introduced in the matrix. For
the sequence represented horizontally, gaps are introduced vertically,
and for the sequence represented vertically, gaps are introduced
horizontally, and the alignment is determined by a trace back step.
Draw the scoring matrix of the given
sequences

A1

A2

A3
ALIGNMENT ALGORITHMS, GAPS, AND GAP
PENALTIES
• An algorithm is a step-by-step procedure that utilizes a finite number
of instructions for automated reasoning and the calculation of a
function.
• The algorithm that drives global alignment is the Needleman-Wunsch
algorithm, and the algorithm that drives local alignment is the Smith-
Waterman algorithm.
• Both these algorithms are examples of dynamic programming.
Dynamic programming is a method for solving complex problems by
breaking them down into simpler sub-problems.
Alignment Scoring function
• In both global and local alignment, the final output is given an
alignment score.
• Gaps have to be introduced to improve the alignment. The reason
gaps are introduced is because one of the sequences may have
gained or lost sequence characteristics (insertion-/-deletion) during
evolution that did not happen with the other sequence.
• The gap penalty value is subtracted from the gross alignment score
to obtain the final alignment score. The insertion of no more than 1
gap per 20 amino acid residues is ideal but that is not possible in
most cases.
• For each gap opened, a gap-opening penalty value is assigned, and
for each gap extended, a gap-extension penalty value is assigned.
• A gap-opening penalty is always much higher than a gap-extension
penalty. Often, a default value of -10 for a gap-opening penalty and -1
for a gap-extension penalty are used.
Alignment Score

• Adjustable differential penalty for gap opening and gap extension is


called affine gap penalty.

You might also like