0% found this document useful (0 votes)
13 views44 pages

Unit 3

This document provides an overview of sequence alignment in bioinformatics, detailing the concepts of sequence similarity, identity, and homology, as well as the types of alignments such as pairwise and multiple sequence alignment. It discusses various alignment algorithms and scoring matrices like PAM and BLOSUM, along with tools like BLAST and CLUSTAL W. The unit aims to enhance understanding of sequence comparison and its applications in biological research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views44 pages

Unit 3

This document provides an overview of sequence alignment in bioinformatics, detailing the concepts of sequence similarity, identity, and homology, as well as the types of alignments such as pairwise and multiple sequence alignment. It discusses various alignment algorithms and scoring matrices like PAM and BLOSUM, along with tools like BLAST and CLUSTAL W. The unit aims to enhance understanding of sequence comparison and its applications in biological research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

UNIT 3

SEQUENCE ALIGNMENT

Structure
3.1 Introduction 3.4 Alignment Scoring Matrices
Expected Learning Outcomes PAM

3.2 Sequence similarity, identity, BLOSUM


and homology
3.5 Sequence alignment tools
Sequence Similarity and Software

Sequence Identity BLAST and Types

Sequence Homology CLUSTAL W

3.3 Alignment Types 3.6 Summary

Pairwise and Multiple Sequence 3.7 Terminal Questions


Alignment
3.8 Answers
Local and Global Alignment
3.9 Further readings

3.1 INTRODUCTION
In the previous unit, you have learned about sequences and structures of
proteins and nucleic acids along with biological databases. As you know,
amino acids are the building blocks of proteins. In general, any popular
language has alphabets, various combinations, and proper arrangement of
these alphabets will form words and sentences with appropriate meaning.
Language helps us to communicate with each other as well as update
knowledge. Similarly, the arrangement of amino acids will provide numerous
functional proteins/enzymes/receptors, etc in biological systems. These
combinations of amino acids and nucleic acids play a major role in the proper
functioning of proteins and genes. It is interesting to know that specific protein
sequences will remain the same in many organisms, but few
additions/deletions/insertions may bring that mutated protein. If the sequence
is exactly the same in two different organisms, it is obvious that protein
function is also the same. There are various tools and software available to
compare these sequences.
83
BBCS-185 Bioinformatics Skill Enhancement Course

In both animals and plants, there are several proteins and enzymes involved in
biochemical pathways, signaling pathways, and other functions. If we compare
the sequences of proteins and genes with another animal/species it is called
sequence comparison. You are going to learn more about sequence
alignment, types, and algorithms involved in it by understanding various
theories. In addition to this, you may come across new terms, software, and
tools. You will learn the applications of sequence alignment in proteins and
nucleic acids, which is essential in the field of biology and allied subjects.

Expected Learning Outcomes


After studying this unit, you shall be able to:

differentiate between similarity, identity, and homology of sequences;

describe alignment types like pairwise and multiple sequence alignment;


and

explain algorithms, amino acid substitution matrices (PAM and


BLOSUM), BLAST and CLUSTALW.

3.2 SEQUENCE SIMILARITY, IDENTITY,


AND HOMOLOGY
In general, similarity, identity, and homology may look alike, but there is a lot
of difference in understanding. Let's assume that AMINOACID is matching
with a word as AMINOACID; it means that each letter is matching 100% with
the above word.

A M I N O A C I D -Seq1

| | | | | | | | |

AMINOACID Seq 2

The above example shows that both the words are matching as the first word
is named as Seq-1 and the second word as Seq-2. It is a simple example to
show the sequence of letters to form words. Now, let us see the similarity of
sequences in the next subsection.

3.2.1 Sequence Similarity


You have seen the example of alignment of letters or matching in the word
AMINOACID. Now, you will learn sequence similarity in proteins and genes.
As we have discussed in the previous unit about the human genome project
as well as public biological databases like NCBI, GenBank, SwissProt, EMBL-
EBI, where most of the plants, animals, fungi, and bacteria genome
sequences, protein sequences are available. There are various formats of
sequences available in specific databases. Among them, FASTA format is
more popular, as shown in Fig. 3.1 for the enzyme hexokinase. You will learn
more about how to obtain the FASTA file while performing exercise number 7
in this course.
84
Unit 3 Sequence Alignment

Enzyme name and Organism


GenBank ID

Fig. 3.1: FASATA format of hexokinase of Homo sapiens is retrieved from


GenBank.

In the above sequence format, you can see the name of the enzyme,
organism, and Genbank ID at the top of the sequence. The sequence starts

organism. When we want to compare or match the sequence of hexokinase


between two different animals, and the sequence matches 100% if there is a
similarity in the number of amino acid residues as well as the type of amino
acids present in them. The matching may not be the same in another set of
organisms or it may be less than 100% due to differences in the number and
type of amino acid residues. In a few animals, we may find mutations in
sequences; still perform normal or similar functions.

Let us see the following examples to understand the concept of similarity.

A T C G G C Seq -1

| | | | | |

ATCGCG Seq-2

Both the sequences Seq-1 and Seq -2 are not the same, but we can call them
similar. There is a mismatch, but the chemical properties of C and G are
similar in the sequences.

For instance, in proteins, there may be a mismatch of amino acids with regard
to their chemical and physical properties; then those mismatches do not alter
the functionality of proteins. In Fig. 3.2 you can find an example for sequence
alignment for protein histone H1 among different mammals. The amino acids,
which are constant throughout alignment, are called conserved residues and
the amino acids varying in alignment are referred to as non-conservative.

Fig. 3.2: Amino acid sequence alignment. (Source:


https://en.wikipedia.org/wiki/Sequence_alignment)
85
BBCS-185 Bioinformatics Skill Enhancement Course

SAQ 1
i) Which type of sequences are available in NCBI database?
ii) Which mammalian sequence is more similar to human histone
sequence?
iii) What do you mean by conserved residues in a multiple sequence
alignment?

3.2.2 Sequence Identity


Most of the time, learners may confuse similarity and identity. There is a slight
difference between both of them. If you find a similar number of nucleotides or
amino acids between two sequences in the same position, then it is called
identity. In other words, the characters or features of sequences match exactly
between two different sequences. Whereas, similarity describes a
resemblance between sequences.

There are various tools and software available to calculate sequence identity
throughout the length of sequences. Among them, BLAST is a powerful tool as
compared to other existing tools. You will learn how to perform using online
tools in exercise 9 of this course.

In the given example (Fig. 3.3), the DNA polymerase sequence of Hepatitis B
virus is considered as query sequence and aligned with the subject sequence
(sequences of database). When this sequence was subjected to alignment,
both (query and subject) sequences had similarity and identity percentages of
98% and 97%. You can observe that a few amino acids are not matching
exactly with the lower sequences. It is observed that big boxes have more
sequence identity rather than small boxes. Those small boxes containing
amino acids are neither identical nor similar. You can also observe some gaps
between sequences. The alignment of sequences is carried out using various
matrices and algorithms. Sequence identity plays a major role in evolutionary
tree generation. It helps in understanding the progeny of specific species and
their relationship with other organisms. Sequence identity is also essential to
acquire information about the working mechanisms of various proteins,
enzymes, receptors, and cellular responses.

86 Fig. 3.3: Sequence identity of DNA polymerase in Hepatitis B Virus.


Unit 3 Sequence Alignment

SAQ 2
i) What is sequence identity?

ii) What is sequence in sequence alignment?

3.2.3 Sequence Homology


In simple words homology describes similarity due to shared ancestry. It is one
of the common terms used in bioinformatics when comparing two or more
sequences of proteins or nucleotides. There are various relationships between
sequence homology with respect to protein stability and functionality. So, it is
very important to know the sequence homology. Most of the time, we consider
the similarity of a sequence throughout its sequence length as homology
(Table 3.1). If the homology is matching 100% then the structure and function
of such protein would be 100% in all aspects. If two or more sequence
alignments share a common ancestral relationship, then we can call them
homologous sequences. In the given Fig 3.4 shows a structural homology, that
play important role in understanding the evolutionary biology.

Fig. 3.4: Structural homology (http://www.bio.miami.edu/dana/160/160S11_3.html)

Table 3.1: The differences between similarity and homology.

S.No. Similarity Homology

1. Similarity refers to the Homology refers to shared ancestry


likeness or % identity
between two sequences

2. Similarity means sharing Two sequences are homologous if


statistically significant number they are derived from a common
or bases or amino acids ancestral sequence

3. Similarity does not imply Homology usually implies similarity


homology

You are advised to watch the video in the given YouTube link to know more
details about sequence similarity:
https://www.youtube.com/watch?v=K6ldxHPzI5A
87
BBCS-185 Bioinformatics Skill Enhancement Course

SAQ 3
i) Write the differences between similarity and homology?

ii) What is homology?

3.3 ALIGNMENT TYPES


Till now, you have learned about sequence alignment with respect to amino
acids and nucleotides with suitable examples. Now, will discuss the process of
alignment throughout the sequence or fragment-based alignment. There are
two types of alignments viz... 1. Pairwise sequence alignment 2. Multiple
sequence alignment. These alignment types help us to understand the
phylogeny of species or genetic relationships between various gene
sequences. The percentage of similarity/homology will provide the distant
relationship or distant homology between the sequences.

3.3.1 Pairwise and Multiple Sequence Alignment


In this section, we will learn about alignment regions. The main purpose of
Pairwise Sequence Alignment is to identify the regions of similarity between
sequences to demonstrate the function, structure of proteins, or genes, which
may lead to finding evolutionary relationships between two sequences.

Seq1 AVLTSHYILRS - 11

|| | | || || || |

Seq2 AVLTSHYILRS - 11

Different methods are used for pairwise alignment of nucleotide and protein
sequences let us learn one by one:

1) Dot Plot It is a graphical method for two sequences, to identify


regions of maximum similarity and dissimilarity, depicted by the
presence and absence of DOTS.
In this plot, if one amino acid of one sequence is matches exactly with
amino acid of another sequence a dot is kept in the respective box as
shown below. The same procedure is followed for nucleotide
sequences (Fig. 3.5).

G G A A T

88 Fig. 3.5: Dotplot of amino acids


Unit 3 Sequence Alignment

Seq-1 G G A A T

| | | | |

Seq-2 G G A A T

If any nucleotide /amino acid is not matching then the gap is noticed within the
alignment.

As you know the most of the gene sequences may be very long. In such
cases, the plot appears as Fig. 3.6. In this figure, X- axis is Seq1 and Y-
axis Seq2, with the total number of amino acids in both the sequences being
200.

Fig. 3.6: Dotplot of amino acid Seq1 and Seq2 with 200 amino acid residues.

2) Dynamic Programming This method breaks a problem into small sub-


problems and uses the solution of the sub-problems to compute the solution of
the larger one. Some algorithms like Needleman-Wansch and Smith-
Waterman are used here. (Watch the YouTube link to more above alignment
types: https://www.youtube.com/watch?v=ipp-pNRIp4g)

3) Heuristic Method When a single sequence is to be compared against the


whole database, heuristic methods like BLAST and FASTA are used.

We are going to study various alignments like local and global alignment in the
next sections of this unit.

We have discussed the importance of multiple sequence alignment. We can


align more than two sequences by using software or online tools. These
alignments would be considered global or local alignments. Multiple
Sequence alignment (MSA, Fig. 3.7) of proteins/genes can be performed by
89
BBCS-185 Bioinformatics Skill Enhancement Course

collecting sequences in FASTA format in most of the software. The output of


the alignment can be seen in the form of trees or alignment of sequences.

Fig. 3.7: Multiple sequence alignment (MSA).

The phylogenetic trees are generated by various tools and software to


determine evolutionary relationships based on multiple sequence alignments
of residues or nucleotides. If the alignment of sequences is more than two
sequences are called multiple sequence alignment (MSA). The use of MSA
is common in phylogenetic analysis, protein structure prediction, and
comparison, identification of conserved domains, regions, and active/inhibitory
sites of enzymes. The MSA always considers sequences from a common
ancestor parent as homologous. The algorithms may try to align homologous
positions or conserved regions by considering function and structure.

To know more about the topic you are advised to visit the following video links:
https://www.youtube.com/watch?v=S07kIY2ihq8

https://www.youtube.com/watch?v=TZaA_-4j19w

SAQ 4
i) What is MSA?

3.3.2 Local and Global Alignment


There are two types of alignments while considering the length of the
sequence to be aligned. We will discuss in detail local and global alignment of
sequences with algorithms implemented in it. Usually, the multiple sequence
algorithms assume that the sequences are similar in all the lengths and that
they behave like global alignment algorithms. They also assume that there are
not many long insertions and deletions of residues/nucleotides. Thus the
algorithms will work for some sequences, but not for others.

90
Unit 3 Sequence Alignment
These algorithms can deal with sequences that are quite different, but, as in
the pairwise case, when the sequences are very different they might have
problems creating a good algorithm. A good algorithm should align the
homologous positions or the positions with the same structure or function.
Global Alignment: In a sequence analysis of proteins or genes, the same
length of sequences is very much suitable for global alignment. Such
alignment is performed from the beginning to the end of the sequence for
appropriate alignment. In such cases, gaps may be created during the
alignment process.
The Needleman-Wunsch algorithm (A formula or set of steps to solve a
problem) was developed by Saul B. Needleman and Christian D. Wunsch in
1970, which is a dynamic programming algorithm for global sequence
alignment. This algorithm explains global sequence alignment for aligning
nucleotide or protein sequences. This was the first of its kind for the alignment
of two protein sequences and was the first application of dynamic
programming to biological sequence analysis. The Needleman-Wunsch
algorithm finds the best-scoring global alignment between the two sequences.
Global alignments are most useful when the two sequences being compared
are in similar lengths, and not too divergent.
Local Alignment: If sequences have similarities or dissimilarities, they can be
compared with local alignment. You will understand high-level similarity
sequences with local alignment.
The above methods of alignment are explained by different algorithms; both
use scoring matrices to align the two different series of characters or patterns
(sequences). Global and local alignment methods are defined by Dynamic
programming for proper approaching methods for aligning two different
sequences. Many proteins exhibit modular architectures. In searching
databases for similar sequences, it is useful to find sequences that have
similar domains or functional motifs. Smith & Waterman (1981) published an
application of dynamic programming to find optimal local alignments. The
algorithm is similar to Needleman-Wunsch, whereas negative cell values are
reset to zero, and the trace back procedures starts from the highest scoring
cell, anywhere in the matrix, and ends when the path encounters a cell with a
value of zero.
If we consider the small fragment of a sequence as the target sequence and
align the other fragment strand at a small region, hence, it is a local alignment.
Similarly, performing complete alignment throughout the sequence length is
known as Global alignment (Fig. 3.8 and 3.9).

91
BBCS-185 Bioinformatics Skill Enhancement Course

Fig. 3.8: Lined diagram representing Global and Local alignment.

(source: https://www.researchgate.net/figure/Global-alignment-vs-Local-
alignment_fig1_322704711)

Fig. 3.9: Global alignment and local alignment.

(Source: https://www.majordifferences.com/2016/05/difference-between-global-and-
local.html)

Gap penalty

Sequence alignments usually require the insertion of gaps, reflecting insertion


or deletion mutations. If a nucleotide or amino acid in one sequence is aligned
to a gap in the target sequence, then this should be penalized as a mismatch.
However, gaps at the ends of sequences should perhaps not incur any
penalty. Moreover, a single insertion or deletion mutation could result in a
contiguous gap of multiple residues. Therefore, a single gap that is 3 residues
long should incur fewer penalties than 3 different gaps, of one residue each.
An affine gap penalty scheme heavily penalizes opening a gap, but extending
a preexisting gap incurs a much lower penalty per additional residue. You will
learn more about global, local and gap penalty concepts while performing
exercise number 9 on BLAST.
92
Unit 3 Sequence Alignment

3.4 ALIGNMENT SCORING MATRICES


In the previous section we have seen the alignment of sequences, as pair or
multiple. As we have discussed, Needleman-Wunsch and Smith-Waterman

dot plots where


one sequence of residue matches with another sequence of residues shown For aligning non-protein
as a dot. But in the case of matrix, a positive score for a match, and a penalty coding DNA
sequences, a
for a mismatch will be assigned as per sequence similarity (Refer a video link).
transition/transversion
For nucleotide sequence alignments, the simplest scoring matrix awards +1 scoring matrix may be
for a match, and -1 for a mismatch. The blastn (will be discussed in the next more appropriate. For
section) algorithm at NCBI scores +5 for a match and -4 for a mismatch. aligning DNA
These scoring matrices treat all mutations (mismatches) equally. In reality, sequences that encode
transitions (pyrimidine to pyrimidine and purine to purine) occur much more proteins, alignment of
the protein amino acid
frequently than transversions (pyrimidine to purine and vice versa) (Refer Fig.
sequences will almost
3.10 below). always be more
reliable.

Fig. 3.10: Transitions and transversions in genetic mutations.

(Source: https://www.differencebetween.com/difference-between-transition-and-vs-
transversion/)

For protein sequence alignments, the scoring matrices are more complicated.
The goal is to reflect evolutionary processes. Some amino acid sequence
changes can arise from a single nucleotide change, whereas other amino acid
changes require two nucleotide changes. Some amino acid changes are less
likely to affect protein structure or function than other amino acid changes.

SAQ 5
What is the use of alignment scoring matrix?

93
BBCS-185 Bioinformatics Skill Enhancement Course

3.4.1 PAM (Point Accepted Mutations)


Dayhoff used alignments of highly conserved proteins to assess which amino
acid changes were likely to be accepted as Point Accepted Mutations. From
this data, she devised a 20 x 20 amino acid substitution matrix for PAM-1, a
unit of evolutionary change resulting in 1 accepted mutation per 100 amino
acids. From there she calculated other matrices such as PAM-2 or PAM-30 or
PAM-250. The substitution matrices are converted to scoring matrices by
converting substitution probabilities to log-odds ratios for each cell.

For example in an alignment of multiple sequences may be as follows:

IAGCW

IAGCT

I IGCT

Dayhoff constructed the phylogenetic tree and used the tree and counted
substitutions in the output of the tree (Fig. 3.11). A tree minimizes the number
of changes in a sequence matrix. To know more about the PAM concept,
watch the video: https://www.youtube.com/watch?v=8avcQRxaLBw

Fig.3.11: Phylogenetic tree for substitution matrix for three sequences.

3.4.2 BLOSUM (BLOcks Substitution Matrix)


In the previous section you have learnt about the brief introduction of the PAM
matrices. Now, we will discuss on BLOSUM matrix used in sequence
alignment.BLOSSUM is another rmatrix used for sequence alignments. These
matrices are used for identification of evolutionarily divergent between protein
sequences. The matrices of this type are local alignment. BLOSUM matrices
were introduced by Steven Henikoff and Jorja Henikoff (Fig 3.12). They had
scanned BLOCKS database for identification of mostly conserved regions of
protein families and later calculated the log-odds score for each of the 190
possible substitution pairs of the 20 standard amino acids under various
combinations.

There is a little bit of difference between PAM and BLOSUM matrices, not as
extrapolated from comparisons of closely related proteins. Scoring sequences
play a major part in it. All matches between the sequences and mismatches
are respectively given the same score (typically +1 or +5 for matches, and -1
or -4 for mismatches. But it is different for proteins. Substitution matrices for
amino acids are more complicated as compared to nucleotides and that might
affect the frequency with which any amino acid is substituted for another. The
objective is to provide a relatively heavy penalty for aligning two residues
94
Unit 3 Sequence Alignment

together if they have a low probability of being homologous (correctly aligned


by evolutionary descent). As you know that forces drive the amino-acid
substitution rates away from uniformity, as discussed in the previous sections,
substitutions occur at different frequencies and are less functionally tolerated
than others.

Fig 3.12: BLOSUM matrices between two protein sequence.

For the calculation of BLOSUM matrix, the following equation is used in


sequences.

Here, pij is the probability of two amino acids i and j replacing each other in
a homologous sequence, and qi and qj are the background probabilities of
finding the amino acids i and j in any protein sequence. The factor is a
scaling factor, set such that the matrix contains easily computable integer
values.

Currently, 3 types of BLOSUM matrices are available, depending on the


requirement of alignment of proteins the different matrices are used.

BLOSUM80: more related proteins

BLOSUM62: midrange

BLOSUM45: distantly related proteins

Among all these 3 types BLOSUM 62 is widely used.

There are various online tools and software available for sequence alignment
with BLOSUM matrices as a weight matrix.

Clustal W is a well-known sequence alignment online tool. You can browse the
following link https://www.genome.jp/tools-bin/clustalw to the BLOSUM matrix
95
BBCS-185 Bioinformatics Skill Enhancement Course

as shown Fig. 3.13. While performing exercise 10, you will learn more about
multiple sequence alignment using Clustal W.

Fig. 3.13: Clustal W online tool is consisting of parameter section with BLOSUM
matrix for pairwise and multiple sequence alignments.

(source: https://www.genome.jp/tools-bin/clustalw). To know further details on


this topic follow the given YouTube link:
https://www.youtube.com/watch?v=ZUjAKgVrir4

3.5 SEQUENCE ALIGNMENT TOOLS AND


SOFTWARES
So far we have studied various concepts and theories behind sequence
alignment; now let us explore the tools available to perform sequence
alignment.

As mentioned in the previous section, Clustal W is a good alignment software


for pairwise and multiple sequence alignment. Apart from this software, a large
number of academic and commercial sequence alignment software are
available. But most of the software's work on Linux, Ubuntu, and Solaris
operating systems. Since, there is a limit of alignment software to work on
Windows XP, NT, and Server. It has been observed that sequence analysis
output is in the form of a graphical representation of data in matrices and
values.

The sequence alignment tools and software will reduce time and enhance the
effectiveness of analysis. The alignment analysis provides the information to
96 make a proper decision to move further in understanding the protein/gene
Unit 3 Sequence Alignment

function or relation with one another. In this section, we will get to know more
about online software based on alignment types. Watch the video at the link
provided to know more about this topic:
https://www.youtube.com/watch?v=uGhZygAMQik

3.5.1 BLAST and Types


The Basic Local Alignment Search Tool (BLAST) main function is to search
regions of similarity between given sequences. This program compares
nucleotide or protein sequences and calculates the statistical significance of
matches as discussed in the alignment section. BLAST can be used to infer
functional and evolutionary relationships between sequences as well as
help to identify members of gene families. The BLAST program can be
accessed from the https://blast.ncbi.nlm.nih.gov/Blast.cgi site.

There are 4 types of BLAST programmes available; they are:

1. Nucleotide BLAST

2. Protein BLAST

3. BLASTx

4. T BLASTn

The main webpage of BLAST is shown in Fig. 3.14.

Let us discuss all the BLAST types in details.

1. BLASTn (Nucleotide BLAST): Compares one or more nucleotide query


sequences to a subject nucleotide sequence or a database of nucleotide
sequences. This is useful while exploring to determine evolutionary
relationships among different organisms.

2. BLASTp (Protein BLAST): Compares one or more protein query (target)


sequences with existing protein sequences or a database of protein
sequences. This is useful while exploring trying to identify a new (or)
unknown protein.

Fig. 3.14: BLAST home webpage with Nucleotide, Protein, blastx and tblastn
links on it. 97
BBCS-185 Bioinformatics Skill Enhancement Course
3. BLASTx (translated nucleotide sequence searched against protein
sequences): Compares a nucleotide query sequence that is translated in six
reading frames (resulting in six protein sequences) against a database of
protein sequences.
Because blastx translates the query sequence in all six reading frames and
provides combined significance statistics for hits to different frames, it is
particularly useful when the reading frame of the query sequence is unknown
or it contains errors that may lead to frame shifts or other coding errors. Thus,
BLASTx is often the first analysis performed with a newly determined
nucleotide sequence.
4. tBLASTn (protein sequence searched against translated nucleotide
sequences): Compares a protein query sequence against the six-frame
translations of a database of nucleotide sequences. Tblastn is useful for
finding homologous protein-coding regions in unannotated nucleotide
sequences such as expressed sequence tags (ESTs) and draft genome
ESTs are short, single- records (HTG), located in the BLAST databases.
read cDNA
(Complementory DNA) Apart from above blast types, there few blast programmes available for
sequences. They standalone system as well as cloud-based platform. Some more BLAST
comprise the largest programs are as follows:
pool of sequence data 1. SmartBLAST: To find proteins highly similar to query sequence.
for many organisms
and contain portions of 2. Primer- BLAST: To design primers specific to given PCR (polymerase chain
transcripts from many reaction) template.
uncharacterized genes.
Since ESTs have no 3. Global Align: To compare two sequences across their entire span or length
annotated coding of sequence with Needleman-Wunsch algorithms.
sequences, there are 4. CD Search: To find the conserved domains in the given sequence.
no corresponding
protein translations in 5. IgBLAST: This blast is related to immunoglobilins and T-Cell receptor
the BLAST protein sequences.
databases. Hence, a
6. VecScreen: To search sequences for vector contamination. This tool is
tblastn search is the
used for molecular biology experiments.
only way to search for
these potential coding 7. CDART: To find sequences with similar conserved domain architecture. You
regions at the protein have to remember the difference between CD-search and CDART in this case.
level. The HTG
sequences, draft 8. Multiple Alignment: To align sequences using domain and protein
sequences from various constrains.
genome projects or
9. MOLE-BLAST: To establish taxonomy for uncultured or environmental
large genomic clones,
sequences (Fig. 3.15).
are another large
source of unannotated
All above tools are available at https://blast.ncbi.nlm.nih.gov/Blast.cgi
coding regions.

98
Unit 3 Sequence Alignment

Fig. 3.15: Specialized searches with BLAST progrmmes at NCBI-BLAST


webpage.

SAQ 6
i) What is the BLAST?

ii) What are BLASTn and BLASTp?

iii) How many types of BLAST tools are available?

Watch the YouTube video available at provided link to know more details
https://www.youtube.com/watch?v=jDHrHfx0cpw

3.5.2 Clustal W
Till now you have studied about blast program to find the sequence with query
sequences within specific databases. The search for simultaneous alignment
of multiple nucleotides or amino acid sequences is now an essential tool in
molecular biology. Multiple sequence alignments are used to find the following:
i) Diagnostic patterns to characterise protein families.
ii) To detect or demonstrate homology between new sequences and existing
families of sequences.
ii) Also to predict the secondary and tertiary structure of new sequences; to
suggest oligonucleotide primers for PCR (Polymerase Chain Reaction).
iv) All these are essential prelude to molecular evolutionary analysis.
There are many variations of the Clustal software, few listed below:

Clustal: The original software for multiple sequence alignments, created by


Des Higgins in 1988, was based on deriving phylogenetic trees from pairwise
sequences of amino acids or nucleotides.

ClustalV: The second generation of the Clustal software was released in 1992
and was a rewrite of the original Clustal package. It introduced phylogenetic
tree reconstruction on the final alignment, the ability to create alignments from
existing alignments, and the option to create trees from alignments using a
method called Neighbor-joining. 99
BBCS-185 Bioinformatics Skill Enhancement Course
ClustalW: The third generation, released in 1994, greatly improved upon
previous versions. It improved upon the progressive alignment algorithm in
various ways, allowing individual sequences to be weighed down or up
according to similarity or divergence, respectively, in a partial alignment. It also
included the ability to run the program in batch mode from the command line.

ClustalX: This version, released in 1997, was the first to have a graphical user
interface.

(Omega): The current standard version.

Clustal_2: The updated versions of both ClustalW and ClustalX with higher
accuracy and efficiency.

3.6 SUMMARY
In this unit, we have studied the basics of sequence alignment along with the
programs or tools used to perform sequence alignment.

Sequence alignment plays a major role in identifying ancestor or


phylogeny.

It helps to predict the experimental for sequenced gene function by


aligning with already existing sequence databases like NCBI,EMBL,
DDBJ and etc.

Multiple sequence alignment (MSA) of genes or proteins determines the


evolutionary relationships or reconstruction of phylogeny. Scientists can
predict new members of gene families.

This will help to identify the structurally or functionally similar regions


within proteins with the exiting databases.

There are two types of alignments 1. Global alignment 2. Local


alignment.

Global Alignment: Needleman-Wunsch algorithms, Local Alignment -


Smith-Waterman algorithms

Sequence identity is to compare the alignment of residues/nucleotides in


percentage or in the scores.

There are differences in the identity and homology of sequences with


respect to type of sequence lineages.

BLAST- Basic local alignment Search Tool is a basic alignment tool and
there are more types based on alignment of database search.

Dotpots, Dynamic programming and Heuristic methods are used to


identify the sequence similar pattern.

PAM is a (Point Accepted Mutations) one of the type in scoring matrix.


There is another known scoring matrix as BLOSUM (BLOcks
Substitution Matrix).
100
Unit 3 Sequence Alignment

BLOSUM62 is widely used in multiple sequence alignment and as well


as development of phylogenetic trees.

Clustal W is a multiple sequence alignment tool.

3.7 TERMINAL QUESTIONS


1. Write the importance of Sequence alignment.

2. Differentiate between identity and homology.

3. Write a note on Pairwise and Multiple sequence alignment with suitable


examples.

4. Explain the role of DOT plot in sequence alignment.

5. What is a MSA? Explain its role in phylogenetic analysis ?

6. Explain the BLAST and its types?

7. Describe the tools and used in multiple sequence alignment.

3.8 ANSWERS
Self Assessment Questions
1. i) The NCBI database consists of plants, animals, fungi and bacterial
genome sequences, protein, gene sequences and etc.

ii) Chimp

iii) The amino acids, which are constant throughout alignment

2. i) Presence of similar number of nucleotides or amino acids between


two sequences in the same position is known as identity

ii) Sequences present in data base

3. i) refer table 3.1

ii) Homology refers to shared ancestry

4. i) Multiple sequence alignment

5. i) To understand elocutionary process

6. i) Basic Local Alignment Search Tool

ii) Nucleotide and protein BLAST types

iii) Four

Terminal Questions
1. i) Sequence alignment plays a major role in identifying ancestors.

101
BBCS-185 Bioinformatics Skill Enhancement Course

ii) It helps to predict the newly sequenced gene function by alignment


with already existing sequence databases like NCBI,EMBL, DDBJ
and etc.

iii) Scientists can predict new members of gene families.

iv) Multiple sequence alignment (MSA) of genes or proteins


determines the evolutionary relationships or reconstruction of
phylogeny.

v) To identify the structurally or functionally similar regions within


proteins with the exiting databases. (refer section 3.1)

2. The differences between similarity and honology as follows

S.No. Similarity Homology

1. Similarity refers to the Homology refers to shared


likeness or % identity ancestry
between two sequences

2. Similarity means sharing Two sequences are


statistically significant homologous if they are
number or bases or amino derived from a common
acids ancestral sequence

3. Similarity does not imply Homology usually implies


homology similarity

3. i) Definition of pairwise and multiple sequence alignment.

ii) Importance of both alignments

iii) Explanation with two or more sequences as per requirement (refer


section 3.3)

4. i) Importance of DOT plots in sequence alignment (refer section

3.3.1).

ii) Alignment patterns as per sequence alignment matching.

iii) Draw the dotplot by considering two sequence of your interest.

5. i) Definition of MSA (refer section3.3.2)

ii) Explanation on development of phylogenetic trees.

iii) Distances of trees, orthologs and paralogs.

6. i) Definition of BLAST and its uses (refer section3.5.1)

ii) BLAST types classification- BLASTp, BLASTn, BLASTx, TBLASTn

7. ClustalW, Clustal X, Clustal Omega, MEGA, (refer section3.5.2)

102
Unit 3 Sequence Alignment

3.9 FURTHER READINGS


1. Sequence Alignment: Methods, Models, Concepts, and Strategies by
Michael S. Rosenberg,University of California Press, 2009.

2. Bioinformatics A Practical Guide to the Analysis of Genes and Proteins


by Andreas Baxevanis, Francis Ouellette, Wiley-Interscience, 2005.

3. Introduction to Bioinformatics by T. K. Attawood & D.J. Parry-smith, 8th

reprint, Pearsoneducation, 2004

4. Bioinformatics: Sequence and genome analysis by D. W. Mount, 2nd


edition, CBS Publication,2005.

5. Fundamental Concepts of Bioinformatics by D. E. Krane and M. L.


Raymer, PearsonPublication, 2006.

6. Bioinformatics: Tools & Applications by D. Edward, J. Stajich and D.


Hansen, Springer, 2009.

7. Bioinformatics: Databases, Tools & Algorithms by O. Bosu and S. K.


Thurkral, Oxford University Press, 2007.

8. Bioinformatics: Methods and Applications - Genomics, Proteomics and


Drug Discovery by S.C. Rastogi, N. Mendiratta, P. Rastogi, PHI Learning
Pvt. Ltd., 2015.

9. Multiple Sequence AlignmentMethods and Protocols by Kazutaka Katoh


2020.Publisher :Springer US.

10. Essential bioinformatics by Jin Xiong 2006 Publisher: Cambridge


University Press.

Video: https://www.youtube.com/watch?v=K6ldxHPzI5A

https://www.youtube.com/watch?v=A4JrzGon8mQ

103
BBCS-185 Bioinformatics Skill Enhancement Course

Exercise 7
MOLECULAR FILE FORMATS -
FASTA, GENBANK,
GENPEPT, GCG, CLUSTAL,
SWISS-PROT, PIR

Structure
7.1 Introduction 7.2 Procedure

Expected Learning 7.3 Summary


Outcomes
7.4 Lab Exercises

7.1 INTRODUCTION
In this exercise, you will practice and download different file formats such as
FASTA, GenBank, GenPept, GCG, CLUSTAL, SWISS-PROT, PIR which are
maintained by different biological databases and used for sequence analysis.
You have learned about biological databases in unit 2 of this course. The
major objective of this exercise is to familiarize you with various file formats
that are regularly used in bioinformatics.

FASTA format

FASTA format is extensively used in most bioinformatics experiments. FASTA


format can be downloaded from NCBI, EBI, Uniprot, and other databases. A
sequence in FASTA format begins with a single-line description, followed by
lines of sequence data. The description line is distinguished from the
sequence data by a greater-than (">") symbol at the beginning example
sequence in the FASTA format shown below.

Example of FASTA format

>gi|532319|pir|TVFV2E|TVFV2E envelope protein

ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLN
GSYSEN
104
Exercise 7 Molecular File Formats - Fasta, Genbank, Genpept, Gcg, Clustal, Swiss-Prot, Pir

GENBANK and GENPEPT format

NCBI specifically maintain GenBank and GenPept formats, The GenBank and
GenPept format store information of DNA and protein sequences respectively.
It is easy to know all the basic information of sequences such as the source of
organism, the author who sequenced, coding information, and other
information from GenBank and GenPept database. GenBank or GenPept
Sequence Format (GenBank Flat File Format) consists of three parts, the
Header, the feature, and the nucleotide sequence. The start of the annotation
section (Header and feature) is marked by a line beginning with the word
"LOCUS". The start of the sequence section is marked by a line beginning with
the word "ORIGIN" and the end of the section is marked by a line with only
"//".The header section consists of initial and basic information, the feature
section consists of Source, CDS, GENE, RNA features, the actual sequence
starts with Origin.

Example of Genbank flat file format

Example of GenPeptfile format


105
BBCS-185 Bioinformatics Skill Enhancement Course

GCG (Genetics Computer Group) format

GCG assists molecular biologists by developing practical tools that implement


the most important bioinformatics techniques. GCG located in the Department
of Genetics at the University of Wisconsin-Madison since 1982. GCG format
can be obtained commonly from GCG commercial package. A sequence file in
GCG format contains one sequence, begins with annotation lines, and then
starts with the sequence is marked by a line ending with two dots ("..")
characters. This line also contains the sequence identifier, the sequence
length, and a checksum. This format should only be used if the file was
created with the GCG package.

Example of GCG format

PIR format

The Protein Information Resource (PIR) is an integrated public resource of


protein informatics that supports genomic and proteomic research and
scientific discovery. PIR maintains the Protein Sequence Database (PSD), an
annotated protein database containing (as of Dec, 2021) over 283000
106 sequences covering the entire taxonomic range.
Exercise 7 Molecular File Formats - Fasta, Genbank, Genpept, Gcg, Clustal, Swiss-Prot, Pir

A sequence in PIR format consists of one line with following features:

a. a ">" (greater-than) sign, followed by

b. a two-letter code describing the sequence type (P1, F1, DL, DC, RL,
RC, or XX), followed by

c. a semicolon, followed by

d. the sequence identification code (the database ID-code).

One line containing a textual description of the sequence.

One or more lines contain the sequence itself. The end of the sequence is
marked by a "*" (asterisk) character.

Optionally, this can be followed by one or more lines describing the sequence.

A file in PIR format may comprise more than one sequence. The PIR format is
also often referred to as the NBRF (National Biomedical Research
Foundations) format.

>P1;CRAB_CHICK

ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).

MDITIHNPLV RRPLFSWLTP SRIFDQIFGE HLQESELLPT SPSLSPFLMR

SPFFRMPSWL ETGLSEMRLE KDKFSVNLDV KHFSPEELKV KVLGDMIEIH

GKHEERQDEH GFIAREFSRK YRIPADVDPL TITSSLSLDG VLTVSAPRKQ

SDVPERSIPI TREEKPAIAG SQRK*

>P1;CRAB_HUMAN

ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN) (ROSENTHAL


FIBER).

MDIAIHHPWI RRPFFPFHSP SRLFDQFFGE HLLESDLFPT STSLSPFYLR

PPSFLRAPSW FDTGLSEMRL EKDRFSVNLD VKHFSPEELK VKVLGDVIEV

HGKHEERQDE HGFISREFHR KYRIPADVDP LTITSSLSSD GVLTVNGPRK

QVSGPERTIP ITREEKPAVT AAPKK*

SWISS-PROT

SWISS-PROT is an annotated protein sequence database. It is a curated


protein sequence database, which strives to provide a high level of annotation.

the relevant entries deposited in SWISS-PROT and TrEMBL. Furthermore, the


final data can be downloaded in different file formats, such as FASTA, GFF,
and Flat text. This set of sequences can be used for the analysis of DNA-
binding proteins. In a similar way, sequences for any kind of protein can be
easily obtained with SWISS-PROT. 107
BBCS-185 Bioinformatics Skill Enhancement Course

CLUSTAL

CLUSTAL is old version of multiple sequence alignment, improved version of


CLUSTAL is CLUSTALW, multiple sequence alignment of CLUSTALW output
result will provide CLUSTALW format. A sample CLUSTALW format is shown
below.

Expected Learning Outcomes


After performing this exercise you shall be able to:

describe various biological databases file formats;

download file formats from databases to perform bioinformatics


experiments; and

differentiate between various file formats.

7.2 PROCEDURE
Step 1: Open the GenBank website from the following URL

https://www.ncbi.nlm.nih.gov/genbank/ (Fig. 7.1).

Fig. 7.1: Screenshot showing GenBank page on NCBI.


108
Exercise 7 Molecular File Formats - Fasta, Genbank, Genpept, Gcg, Clustal, Swiss-Prot, Pir

Step 2: Type the sequence name or sequence ID or relevant text in the text
box or enter any keyword (Fig. 7.2).

Fig. 7.2: Screeshot showing search option on GenBank.

For GenPept format, select protein in dropdown box of NCBI


homepage and type keyword and click on search button (Fig. 7.3).

Fig. 7.3: Screenshot showing dropdown box on NCBI page.

3. On pressing search button the result page (summary page) is


displayed (Fig. 7.4).

Fig. 7.4: Screenshot showing search results on NCBI page.


109
BBCS-185 Bioinformatics Skill Enhancement Course

Step 4: Select the required sequence by double-clicking the accession number


or checkmark appropriate sequence, go to the display button, select GenBank
format if it is nucleotide sequence, and select GenPept if it is protein sequence
or FASTA format to retrieve sequence (Fig. 7.5 A and B).

A)

B)

Fig. 7.5: A) showing GenBank Format B) GenPept format.

5. Copy and save the required protein or nucleotide sequence for further
analysis (Fig. 7.6).

110 Fig. 7.6: Screenshot showing GenBank sequence.


Exercise 7 Molecular File Formats - Fasta, Genbank, Genpept, Gcg, Clustal, Swiss-Prot, Pir

7.3 SUMMARY
You have learned about biological databases in theory unit 2, and the
data will be retrieved and viewed in different formats.

In the current exercise you have learned to download different file


formats such as FASTA, GENBANK, GENPEPT sequence formats.

Formats like GCG, CLUSTAL, SWISS-PROT, PIR which are


maintained by different biological databases and used for sequence
analysis.

7.4 LAB EXERCISES


1. Download sequence NM_001297740 in GenBank and FASTA format.

2. Search for Covid related protein sequence from NCBI download any one
sequence in GenPept format.

111
BBCS-185 Bioinformatics Skill Enhancement Course

Exercise 8
MOLECULAR VIEWER BY
VISUALIZATION
SOFTWARE: PYMOL

Structure
8.1 Introduction 8.3 Summary

Expected Learning Outcomes 8.4 Lab Exercises

8.2 Procedure

8.1 INTRODUCTION
In the previous exercises you have learned how to access databases for
protein and nucleic acid structures. However, to analyse these structures we
need to view them using certain tools that are known as visualisation tools or
software.

In this exercise, you will be learning the PyMOL program to visualize 3-D
structures of molecules. PyMOL is a powerful tool for viewing and analyzing
proteins, DNA, and other macro molecules structures. PyMOL is a stand-alone
molecular visualization program based on Python software. PyMOL is used to
generate high-quality molecular graphics images and animations used for
journal publications describing new macromolecular structures and
interactions. PyMOL was developed by Warren DeLano. It is open source, but
not free in all forms, students and educators can utilise a current free version
in the classroom, anybody can obtain outdated binary releases, and certain
Linux distributions give PyMOL packages created from the open-source code.

Expected Learning Outcomes


After performing this exercise you shall be able to:

download and view 3-D structure of biological macro molecules using


PyMOL;
prepare and save the 3-D molecule for publishing in journal and
dissertation; and

explain the application of PyMOL.


112
Exercise 8 Molecular Viewer by Visualization Software: Pymol

8.2 PROCEDURE
Step 1: Download PyMOL from the website
(http://www.pymol.org/educational), register as a student from the link at the

will eventually send you a link with a username and password. This allows you
to download the software for your Personal Computer or Mac system and
follow the instructions to install the software.

Step 2: Download a 3-D structure of protein6YI3 in PDB format from PDB


database as you have learned in Exercise-6.

Step 3: By double clicking on the PyMOL icon on your desktop PyMOL brings
up two Windows.

i) (Graphical User Interface)


and contains the menu options as well as buttons for advanced
visualization (Fig. 8.1).

Fig. 8.1: Screenshot showing external GUI of PyMOL.

ii)
area where molecules will be displayed. The bottom window also contains
molecular objects
once you have loaded a protein structure. The bottom of this GUI has a
matrix displaying the current mouse configuration, namely what mouse
button combinations control which functions (Fig. 8.2).

Fig. 8.2: Screenshot showing PyMOL internal GUI. 113


BBCS-185 Bioinformatics Skill Enhancement Course

Step 4: File Open


and select the PDB file6YI3.pdb that you have already downloaded. The PDB
file will load, representing the protein

Step 5: To change the representation of the molecule, the right side of the
Viewer shows the object control panel.

un-display the
corresponding molecule(s) (temporarily invisible).

The ASHLC menu ( ) is abbreviated for Action, Show,


Hide, Label and Color.

Cartoon. The molecule is now shown as both a cartoon and a wireframe.


Remove the wireframe by clicking H and lines.

Step 6: To change the background color to white follow this menu cascade
(Fig. 8.3):

Display > Background > White

molecule.

Step 7: To save the image in the present view follow the options File > Save

PyMOL
saved as a PNG image (Fig. 8.4).
114
Exercise 8 Molecular Viewer by Visualization Software: Pymol

Fig. 8.4: Screenshot Showing how to save and rename the image.

Step 8: You can use command line to Save, Viewport, Zoom, Ray, and Select,
to execute the command, follow the additional study material link provided at
the end of this exercise, and also practice other options in detail.

8.3 SUMMARY
PyMOL is a powerful tool to visualize and analyze proteins, DNA, and
other biological molecules structures.
You have learned how to view the 3-D structure of proteins in different
poses.
Images and structures can be used to generate high-quality molecular
graphic images and animations used for journal publications.

8.4 LAB EXERCISES


1. Download 7C8K use commands to analyse structure and create images.

2. Open 7C8K structure in PyMOL and create animation for PowerPoint.

Additional study material link

https://bioquest.org/nimbios2010/wp-
content/blogs.dir/files/2010/07/pymol_tutorial3.pdf

https://sites.pitt.edu/~epolinko/IntroPyMOL.pdf

115
BBCS-185 Bioinformatics Skill Enhancement Course

Exercise 9
BLAST SUITE OF TOOLS FOR
PAIRWISE ALIGNMENT

Structure
9.1 Introduction 9.3 Summary

Expected Learning Outcomes 9.4 Lab Exercises

9.2 Procedure

9.1 INTRODUCTION
In unit-3 of this course you have learnt sequence similarity search using Basic
Local Alignment Search tool (BLAST) and in previous exercises 5 and 7we
have practiced sequence retrieval. These sequences will be used in this
exercise to perform database similarity searches using BLAST tool. BLAST is
an algorithm for comparing primary biological sequence information, such as
the amino-acid sequences of different proteins or the nucleotides of DNA
sequences. A BLAST search enables a researcher to compare a query
sequence with a library or database of sequences, and identify library
sequences that are similar to the query (a question, unknown) sequence.
There are many different types of BLAST available from the BLAST web page.
Selecting the required one depends on the type of sequence you are
searching for and in the desired database. Different types of BLAST are given
below:

a) Nucleotide blast- Search a nucleotide database using a nucleotide


query
Algorithms: blastn, mega blast, discontinuous mega blast

b) Protein blast- Search protein database using a protein query


Algorithms: blastp, psi-blast, phi-blast, delta-blast

c) blastx- Search protein database using a translated nucleotide query

d) tblastn- Search translated nucleotide database using a protein query

e) tblastx -Search translated nucleotide database using a translated


116 nucleotide query
Exercise 9 Blast Suite of Tools for Pairwise Alignment

Expected Learning Outcomes


After performing this exercise you shall be able to:

perform and interpret the database similarity sequence searches from


the BLAST tool;

understand analysis of protein and DNA sequence similarity search of


unknown sequence obtained after sequencing; and

describe the importance of BLAST tool in bioinformatics.

9.2 PROCEDURE
Step 1: Open the basic BLAST search page from following URL
http://blast.ncbi.nlm.nih.gov/Blast.cgi

From the "Program" Menu select the appropriate program(Nucleotide BLAST,


Protein BLAST, blastx, tblastn) (Fig. 9.1).

Fig. 9.1: Screenshot showing BLAST page.

2. Open your FASTA format sequence in a text editor as plain text retrieved
from exercise7 (Fig. 9.2).

Fig. 9.2: Screenshot showing FASTA sequence.


117
BBCS-185 Bioinformatics Skill Enhancement Course

3. Enter gi number, accession number or Copy the entire sequence and paste
it in the search box provided in FASTA format (Fig. 9.3).

Fig. 9.3: Screenshot showing BLAST suite.

5. Make sure you have selected the correct BLAST program and select nr
(non redundant) database (Fig. 9.4).

Fig. 9.4: Screenshot showing programme selection on BLAST.

7. Write down default parameter set and click the "BLAST button" (Fig. 9.5).

118
Exercise 9 Blast Suite of Tools for Pairwise Alignment

8. BLAST will tell you it is working on your search.

Fig. 9.5: Screenshot showing algorithm parameters on BLAST.

9. Once your results are computed they will be presented in the window (Fig.
9.6).

119
BBCS-185 Bioinformatics Skill Enhancement Course

10. Copy and save the results and discuss or interpret your results.

9.3 SUMMARY
In the current exercise you have learnt how to use BLAST tool for
different programs such as blastn, blastp, blastx and tblastn for the
analysis of nucleotide and protein sequences of unknown sequence
obtained after sequencing the sample.

BLAST tool is widely used in the bioinformatics for different applications


such as to know similar sequence, its similarity score to understand
whether they are closely related or distantly related.

By e-value you will be able to analyse weather query sequence is


biologically significance or not, will get to know source, coding sequence
and other information.

9.4 LAB EXERCISES


1. Conduct the blastn for the given query NM_001297740 and discuss the
result.

2. Perform protein blast (blastp) for the query Chain E, Spike protein S1
copy the result and interpret.

3. Search blastx for the given queryFJ436056 tabulate and discuss the
results.

4. Execute the tblastn search for the given query PWZ18702 and interpret
the results.

120
Exercise 10 Multiple Sequence Alignment using Clustalw

Exercise 10
MULTIPLE SEQUENCE
ALIGNMENT USING
CLUSTALW

Structure
10.1 Introduction 10.3 Understanding Output

Expected Learning Outcomes 10.4 Summary

10.2 Procedure 10.5 Lab Exercises

10.1 INTRODUCTION
In unit-3 of this course you have learnt sequence alignment using
CLUSTALW, now in this exercise, you shall practice performing CLUSTALW.
Multiple Sequence Alignment (MSA) is the alignment of three or more
biological sequences of similar length. From the output of MSA applications,
homology can be inferred and the evolutionary relationship between the
sequences can be studied. ClustalW is a free online tool through the European
Bioinformatics Institute (EBI) that is used to align multiple sequences and
generate phylogenetic trees. The improved version of CLUSTAL is Clustal
Omega. If you input the desired sequences to align, Clustal Omega generates
a sequence alignment, and a rooted phylogram or cladogram.

Expected Learning Outcomes


After performing this exercise you shall be able to:

perform alignment of more than two sequences and find out the
similarity between those sequences;

use ClustalW to Generate a Multiple Sequence Alignment and construct


phylogenetic Tree; and

explain the functional relationship between these aligned sequences. 121


BBCS-185 Bioinformatics Skill Enhancement Course

10.2 PROCEDURE
STEP 1- Retrieve required sequences (Nucleic acid or Protein) three or more
from desired sequence databases. Some example sequences are shown
below:

>AAZ67055.1 M protein [Bat SARS CoV Rp3/2004]

MAENGTISVEELKRLLEQWNLVIGFIFLAWIMLLQFAYSNRNRFLYIIKLVFLWL
LWPVTLACFVLAAVYRINWVTGGIAIAMACIVGLMWLSYFVASFRLFARTRSM
WSFNPETNILLNVPLRGTILTRPLMESELVIGAVIIRGHLRMAGHSLGRCDIKD
LPKEITVATSRTLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHSGSND
NIALLVQ

>QHU79197.1 M protein [Human Severe acute respiratory syndrome corona


virus 2]

MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFLYIIKLIFLW
LLWPVTLACFVLAAVYRINWITGGIAIAMACLVGLMWLSYFIASFRLFARTRS
MWSFNPETNILLNVPLHGTILTRPLLESELVIGAVILRGHLRIAGHHLGRCDIKD
LPKEITVATSRTLSYYKLGASQRVAGDSGFAAYSRYRIGNYKLNTDHSSSSD
NIALLVQ

>AVN89369.1 M protein [C Middle East respiratory syndrome-related corona


virus]

MSNMTQLTEAQIIAIIKDWNFAWSLIFLLITIVLQYGYPSRSMTVYVFKMFVLW
LLWPSSMALSIFSAVYPIDLASQIISGIVAAVSAMMWISYFVQSIRLFMRTGSW
WSFNPETNCLLNVPFGGTTVVRPLVEDSTSVTAVVTNGHLKMAGMHFGAC
DYDRLPNEVTVAKPNVLIALKMVKRQSYGTNSGVAIYHRYKAGNYRSPPITA
DIELALLRA

STEP 2-The software tools required for multiple sequence alignment are
available at the following URLhttps://www.ebi.ac.uk/Tools/msa/clustalo/ (Fig.
10.1).

122 Fig. 10.1: Showing the CLUSTAL omega homepage.


Exercise 10 Multiple Sequence Alignment using Clustalw

STEP 3 - Enter your input sequences or paste a set of nucleic acid or protein
sequences into a supported format or upload a file (Fig. 10.2).

Fig. 10.2: Showing multiple sequence alignment step on CLUSTAL omega.

Step 4- Set your output format and set multiple sequence alignment default
options (Fig. 10.3).

Fig. 10.3: Showing output parameters window.

Step 5- Submitting your job, running a tool is usually an interactive process;


the results are delivered directly to the browser when they become available,

the box "Be notified by email" (Fig. 10.4).

Fig. 10.4: Showing the subunit option on CLUSTAL omega.

123
BBCS-185 Bioinformatics Skill Enhancement Course

10.3 UNDERSTANDING OUTPUT


The ClustalW output will give you two main forms results the multiple
sequence alignment and a phylogram/cladogram.

The score table is the first section of the page below, the results summary box.
The score table shows the scoring of the pairwise alignment of all sequences
(Fig. 10.5).

Fig. 10.5: Showing scores table.

Take a screen shot of this table, or download by right-clicking the Output File
(.output) found in the result summary box at the top of the page (Fig. 10.6).

Fig. 10.6: Showing how to save the output file.

CLUSTAL omega aligns all of the input sequences, an HTML text version is
listed just below the Scores Table. A more extensive view of the alignment can
be seen using JalView. Under a
a coloured version of an amino acid alignment (Fig. 10.7).

124
Exercise 10 Multiple Sequence Alignment using Clustalw

Normal View of Alignment Coloured View of Alignment

In the row below the last sequence of the alignment, there may be symbols
like:

" * " the residues or nucleotides in that column are identical in


all sequences

" : " conserved substitutions have been observed, according to the colour
data

" . " semi-conserved substitutions are observed

The generated phylogenetic tree is at the very bottom of the results page.
there
Guide Tree. The tree can be viewed as a phylogram or a cladogram.

A phylogram explicitly represents the number of sequence character changes


through the horizontal branch length. The sum of the horizontal distances
between two leaves is the predicted evolutionary difference in sequences. A
cladogram only depicts branching patterns, not evolutionary time by branch
length (Fig. 10.8 A and B).

A)

125
BBCS-185 Bioinformatics Skill Enhancement Course

B)

Fig. 10.8: A) Phylogram B) Cladogram.

10.4 SUMMARY
Multiple Sequence alignment is aligning of three or more biological
sequences of similar length.

From the output of MSA applications, homology can be inferred and the
evolutionary relationship between the sequences studied.

In the current exercise you have learnt to use the multiple sequence
alignment tool Clustal Omega for analysing evolutionary relationships
among sequences and interpret relationships among the sequences or
organisms through a phylogenetic tree.

10.5 LAB EXERCISES


1. Retrieve any three or more protein sequences from protein database,
copy the sequences in FASTA file format, align the sequences each
other and report the pair wise score using Clustal Omega.

126

You might also like