Unit 3
Unit 3
SEQUENCE ALIGNMENT
Structure
3.1 Introduction 3.4 Alignment Scoring Matrices
Expected Learning Outcomes PAM
3.1 INTRODUCTION
In the previous unit, you have learned about sequences and structures of
proteins and nucleic acids along with biological databases. As you know,
amino acids are the building blocks of proteins. In general, any popular
language has alphabets, various combinations, and proper arrangement of
these alphabets will form words and sentences with appropriate meaning.
Language helps us to communicate with each other as well as update
knowledge. Similarly, the arrangement of amino acids will provide numerous
functional proteins/enzymes/receptors, etc in biological systems. These
combinations of amino acids and nucleic acids play a major role in the proper
functioning of proteins and genes. It is interesting to know that specific protein
sequences will remain the same in many organisms, but few
additions/deletions/insertions may bring that mutated protein. If the sequence
is exactly the same in two different organisms, it is obvious that protein
function is also the same. There are various tools and software available to
compare these sequences.
83
BBCS-185 Bioinformatics Skill Enhancement Course
In both animals and plants, there are several proteins and enzymes involved in
biochemical pathways, signaling pathways, and other functions. If we compare
the sequences of proteins and genes with another animal/species it is called
sequence comparison. You are going to learn more about sequence
alignment, types, and algorithms involved in it by understanding various
theories. In addition to this, you may come across new terms, software, and
tools. You will learn the applications of sequence alignment in proteins and
nucleic acids, which is essential in the field of biology and allied subjects.
A M I N O A C I D -Seq1
| | | | | | | | |
AMINOACID Seq 2
The above example shows that both the words are matching as the first word
is named as Seq-1 and the second word as Seq-2. It is a simple example to
show the sequence of letters to form words. Now, let us see the similarity of
sequences in the next subsection.
In the above sequence format, you can see the name of the enzyme,
organism, and Genbank ID at the top of the sequence. The sequence starts
A T C G G C Seq -1
| | | | | |
ATCGCG Seq-2
Both the sequences Seq-1 and Seq -2 are not the same, but we can call them
similar. There is a mismatch, but the chemical properties of C and G are
similar in the sequences.
For instance, in proteins, there may be a mismatch of amino acids with regard
to their chemical and physical properties; then those mismatches do not alter
the functionality of proteins. In Fig. 3.2 you can find an example for sequence
alignment for protein histone H1 among different mammals. The amino acids,
which are constant throughout alignment, are called conserved residues and
the amino acids varying in alignment are referred to as non-conservative.
SAQ 1
i) Which type of sequences are available in NCBI database?
ii) Which mammalian sequence is more similar to human histone
sequence?
iii) What do you mean by conserved residues in a multiple sequence
alignment?
There are various tools and software available to calculate sequence identity
throughout the length of sequences. Among them, BLAST is a powerful tool as
compared to other existing tools. You will learn how to perform using online
tools in exercise 9 of this course.
In the given example (Fig. 3.3), the DNA polymerase sequence of Hepatitis B
virus is considered as query sequence and aligned with the subject sequence
(sequences of database). When this sequence was subjected to alignment,
both (query and subject) sequences had similarity and identity percentages of
98% and 97%. You can observe that a few amino acids are not matching
exactly with the lower sequences. It is observed that big boxes have more
sequence identity rather than small boxes. Those small boxes containing
amino acids are neither identical nor similar. You can also observe some gaps
between sequences. The alignment of sequences is carried out using various
matrices and algorithms. Sequence identity plays a major role in evolutionary
tree generation. It helps in understanding the progeny of specific species and
their relationship with other organisms. Sequence identity is also essential to
acquire information about the working mechanisms of various proteins,
enzymes, receptors, and cellular responses.
SAQ 2
i) What is sequence identity?
You are advised to watch the video in the given YouTube link to know more
details about sequence similarity:
https://www.youtube.com/watch?v=K6ldxHPzI5A
87
BBCS-185 Bioinformatics Skill Enhancement Course
SAQ 3
i) Write the differences between similarity and homology?
Seq1 AVLTSHYILRS - 11
|| | | || || || |
Seq2 AVLTSHYILRS - 11
Different methods are used for pairwise alignment of nucleotide and protein
sequences let us learn one by one:
G G A A T
Seq-1 G G A A T
| | | | |
Seq-2 G G A A T
If any nucleotide /amino acid is not matching then the gap is noticed within the
alignment.
As you know the most of the gene sequences may be very long. In such
cases, the plot appears as Fig. 3.6. In this figure, X- axis is Seq1 and Y-
axis Seq2, with the total number of amino acids in both the sequences being
200.
Fig. 3.6: Dotplot of amino acid Seq1 and Seq2 with 200 amino acid residues.
We are going to study various alignments like local and global alignment in the
next sections of this unit.
To know more about the topic you are advised to visit the following video links:
https://www.youtube.com/watch?v=S07kIY2ihq8
https://www.youtube.com/watch?v=TZaA_-4j19w
SAQ 4
i) What is MSA?
90
Unit 3 Sequence Alignment
These algorithms can deal with sequences that are quite different, but, as in
the pairwise case, when the sequences are very different they might have
problems creating a good algorithm. A good algorithm should align the
homologous positions or the positions with the same structure or function.
Global Alignment: In a sequence analysis of proteins or genes, the same
length of sequences is very much suitable for global alignment. Such
alignment is performed from the beginning to the end of the sequence for
appropriate alignment. In such cases, gaps may be created during the
alignment process.
The Needleman-Wunsch algorithm (A formula or set of steps to solve a
problem) was developed by Saul B. Needleman and Christian D. Wunsch in
1970, which is a dynamic programming algorithm for global sequence
alignment. This algorithm explains global sequence alignment for aligning
nucleotide or protein sequences. This was the first of its kind for the alignment
of two protein sequences and was the first application of dynamic
programming to biological sequence analysis. The Needleman-Wunsch
algorithm finds the best-scoring global alignment between the two sequences.
Global alignments are most useful when the two sequences being compared
are in similar lengths, and not too divergent.
Local Alignment: If sequences have similarities or dissimilarities, they can be
compared with local alignment. You will understand high-level similarity
sequences with local alignment.
The above methods of alignment are explained by different algorithms; both
use scoring matrices to align the two different series of characters or patterns
(sequences). Global and local alignment methods are defined by Dynamic
programming for proper approaching methods for aligning two different
sequences. Many proteins exhibit modular architectures. In searching
databases for similar sequences, it is useful to find sequences that have
similar domains or functional motifs. Smith & Waterman (1981) published an
application of dynamic programming to find optimal local alignments. The
algorithm is similar to Needleman-Wunsch, whereas negative cell values are
reset to zero, and the trace back procedures starts from the highest scoring
cell, anywhere in the matrix, and ends when the path encounters a cell with a
value of zero.
If we consider the small fragment of a sequence as the target sequence and
align the other fragment strand at a small region, hence, it is a local alignment.
Similarly, performing complete alignment throughout the sequence length is
known as Global alignment (Fig. 3.8 and 3.9).
91
BBCS-185 Bioinformatics Skill Enhancement Course
(source: https://www.researchgate.net/figure/Global-alignment-vs-Local-
alignment_fig1_322704711)
(Source: https://www.majordifferences.com/2016/05/difference-between-global-and-
local.html)
Gap penalty
(Source: https://www.differencebetween.com/difference-between-transition-and-vs-
transversion/)
For protein sequence alignments, the scoring matrices are more complicated.
The goal is to reflect evolutionary processes. Some amino acid sequence
changes can arise from a single nucleotide change, whereas other amino acid
changes require two nucleotide changes. Some amino acid changes are less
likely to affect protein structure or function than other amino acid changes.
SAQ 5
What is the use of alignment scoring matrix?
93
BBCS-185 Bioinformatics Skill Enhancement Course
IAGCW
IAGCT
I IGCT
Dayhoff constructed the phylogenetic tree and used the tree and counted
substitutions in the output of the tree (Fig. 3.11). A tree minimizes the number
of changes in a sequence matrix. To know more about the PAM concept,
watch the video: https://www.youtube.com/watch?v=8avcQRxaLBw
There is a little bit of difference between PAM and BLOSUM matrices, not as
extrapolated from comparisons of closely related proteins. Scoring sequences
play a major part in it. All matches between the sequences and mismatches
are respectively given the same score (typically +1 or +5 for matches, and -1
or -4 for mismatches. But it is different for proteins. Substitution matrices for
amino acids are more complicated as compared to nucleotides and that might
affect the frequency with which any amino acid is substituted for another. The
objective is to provide a relatively heavy penalty for aligning two residues
94
Unit 3 Sequence Alignment
Here, pij is the probability of two amino acids i and j replacing each other in
a homologous sequence, and qi and qj are the background probabilities of
finding the amino acids i and j in any protein sequence. The factor is a
scaling factor, set such that the matrix contains easily computable integer
values.
BLOSUM62: midrange
There are various online tools and software available for sequence alignment
with BLOSUM matrices as a weight matrix.
Clustal W is a well-known sequence alignment online tool. You can browse the
following link https://www.genome.jp/tools-bin/clustalw to the BLOSUM matrix
95
BBCS-185 Bioinformatics Skill Enhancement Course
as shown Fig. 3.13. While performing exercise 10, you will learn more about
multiple sequence alignment using Clustal W.
Fig. 3.13: Clustal W online tool is consisting of parameter section with BLOSUM
matrix for pairwise and multiple sequence alignments.
The sequence alignment tools and software will reduce time and enhance the
effectiveness of analysis. The alignment analysis provides the information to
96 make a proper decision to move further in understanding the protein/gene
Unit 3 Sequence Alignment
function or relation with one another. In this section, we will get to know more
about online software based on alignment types. Watch the video at the link
provided to know more about this topic:
https://www.youtube.com/watch?v=uGhZygAMQik
1. Nucleotide BLAST
2. Protein BLAST
3. BLASTx
4. T BLASTn
Fig. 3.14: BLAST home webpage with Nucleotide, Protein, blastx and tblastn
links on it. 97
BBCS-185 Bioinformatics Skill Enhancement Course
3. BLASTx (translated nucleotide sequence searched against protein
sequences): Compares a nucleotide query sequence that is translated in six
reading frames (resulting in six protein sequences) against a database of
protein sequences.
Because blastx translates the query sequence in all six reading frames and
provides combined significance statistics for hits to different frames, it is
particularly useful when the reading frame of the query sequence is unknown
or it contains errors that may lead to frame shifts or other coding errors. Thus,
BLASTx is often the first analysis performed with a newly determined
nucleotide sequence.
4. tBLASTn (protein sequence searched against translated nucleotide
sequences): Compares a protein query sequence against the six-frame
translations of a database of nucleotide sequences. Tblastn is useful for
finding homologous protein-coding regions in unannotated nucleotide
sequences such as expressed sequence tags (ESTs) and draft genome
ESTs are short, single- records (HTG), located in the BLAST databases.
read cDNA
(Complementory DNA) Apart from above blast types, there few blast programmes available for
sequences. They standalone system as well as cloud-based platform. Some more BLAST
comprise the largest programs are as follows:
pool of sequence data 1. SmartBLAST: To find proteins highly similar to query sequence.
for many organisms
and contain portions of 2. Primer- BLAST: To design primers specific to given PCR (polymerase chain
transcripts from many reaction) template.
uncharacterized genes.
Since ESTs have no 3. Global Align: To compare two sequences across their entire span or length
annotated coding of sequence with Needleman-Wunsch algorithms.
sequences, there are 4. CD Search: To find the conserved domains in the given sequence.
no corresponding
protein translations in 5. IgBLAST: This blast is related to immunoglobilins and T-Cell receptor
the BLAST protein sequences.
databases. Hence, a
6. VecScreen: To search sequences for vector contamination. This tool is
tblastn search is the
used for molecular biology experiments.
only way to search for
these potential coding 7. CDART: To find sequences with similar conserved domain architecture. You
regions at the protein have to remember the difference between CD-search and CDART in this case.
level. The HTG
sequences, draft 8. Multiple Alignment: To align sequences using domain and protein
sequences from various constrains.
genome projects or
9. MOLE-BLAST: To establish taxonomy for uncultured or environmental
large genomic clones,
sequences (Fig. 3.15).
are another large
source of unannotated
All above tools are available at https://blast.ncbi.nlm.nih.gov/Blast.cgi
coding regions.
98
Unit 3 Sequence Alignment
SAQ 6
i) What is the BLAST?
Watch the YouTube video available at provided link to know more details
https://www.youtube.com/watch?v=jDHrHfx0cpw
3.5.2 Clustal W
Till now you have studied about blast program to find the sequence with query
sequences within specific databases. The search for simultaneous alignment
of multiple nucleotides or amino acid sequences is now an essential tool in
molecular biology. Multiple sequence alignments are used to find the following:
i) Diagnostic patterns to characterise protein families.
ii) To detect or demonstrate homology between new sequences and existing
families of sequences.
ii) Also to predict the secondary and tertiary structure of new sequences; to
suggest oligonucleotide primers for PCR (Polymerase Chain Reaction).
iv) All these are essential prelude to molecular evolutionary analysis.
There are many variations of the Clustal software, few listed below:
ClustalV: The second generation of the Clustal software was released in 1992
and was a rewrite of the original Clustal package. It introduced phylogenetic
tree reconstruction on the final alignment, the ability to create alignments from
existing alignments, and the option to create trees from alignments using a
method called Neighbor-joining. 99
BBCS-185 Bioinformatics Skill Enhancement Course
ClustalW: The third generation, released in 1994, greatly improved upon
previous versions. It improved upon the progressive alignment algorithm in
various ways, allowing individual sequences to be weighed down or up
according to similarity or divergence, respectively, in a partial alignment. It also
included the ability to run the program in batch mode from the command line.
ClustalX: This version, released in 1997, was the first to have a graphical user
interface.
Clustal_2: The updated versions of both ClustalW and ClustalX with higher
accuracy and efficiency.
3.6 SUMMARY
In this unit, we have studied the basics of sequence alignment along with the
programs or tools used to perform sequence alignment.
BLAST- Basic local alignment Search Tool is a basic alignment tool and
there are more types based on alignment of database search.
3.8 ANSWERS
Self Assessment Questions
1. i) The NCBI database consists of plants, animals, fungi and bacterial
genome sequences, protein, gene sequences and etc.
ii) Chimp
iii) Four
Terminal Questions
1. i) Sequence alignment plays a major role in identifying ancestors.
101
BBCS-185 Bioinformatics Skill Enhancement Course
3.3.1).
102
Unit 3 Sequence Alignment
Video: https://www.youtube.com/watch?v=K6ldxHPzI5A
https://www.youtube.com/watch?v=A4JrzGon8mQ
103
BBCS-185 Bioinformatics Skill Enhancement Course
Exercise 7
MOLECULAR FILE FORMATS -
FASTA, GENBANK,
GENPEPT, GCG, CLUSTAL,
SWISS-PROT, PIR
Structure
7.1 Introduction 7.2 Procedure
7.1 INTRODUCTION
In this exercise, you will practice and download different file formats such as
FASTA, GenBank, GenPept, GCG, CLUSTAL, SWISS-PROT, PIR which are
maintained by different biological databases and used for sequence analysis.
You have learned about biological databases in unit 2 of this course. The
major objective of this exercise is to familiarize you with various file formats
that are regularly used in bioinformatics.
FASTA format
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLN
GSYSEN
104
Exercise 7 Molecular File Formats - Fasta, Genbank, Genpept, Gcg, Clustal, Swiss-Prot, Pir
NCBI specifically maintain GenBank and GenPept formats, The GenBank and
GenPept format store information of DNA and protein sequences respectively.
It is easy to know all the basic information of sequences such as the source of
organism, the author who sequenced, coding information, and other
information from GenBank and GenPept database. GenBank or GenPept
Sequence Format (GenBank Flat File Format) consists of three parts, the
Header, the feature, and the nucleotide sequence. The start of the annotation
section (Header and feature) is marked by a line beginning with the word
"LOCUS". The start of the sequence section is marked by a line beginning with
the word "ORIGIN" and the end of the section is marked by a line with only
"//".The header section consists of initial and basic information, the feature
section consists of Source, CDS, GENE, RNA features, the actual sequence
starts with Origin.
PIR format
b. a two-letter code describing the sequence type (P1, F1, DL, DC, RL,
RC, or XX), followed by
c. a semicolon, followed by
One or more lines contain the sequence itself. The end of the sequence is
marked by a "*" (asterisk) character.
Optionally, this can be followed by one or more lines describing the sequence.
A file in PIR format may comprise more than one sequence. The PIR format is
also often referred to as the NBRF (National Biomedical Research
Foundations) format.
>P1;CRAB_CHICK
>P1;CRAB_HUMAN
SWISS-PROT
CLUSTAL
7.2 PROCEDURE
Step 1: Open the GenBank website from the following URL
Step 2: Type the sequence name or sequence ID or relevant text in the text
box or enter any keyword (Fig. 7.2).
A)
B)
5. Copy and save the required protein or nucleotide sequence for further
analysis (Fig. 7.6).
7.3 SUMMARY
You have learned about biological databases in theory unit 2, and the
data will be retrieved and viewed in different formats.
2. Search for Covid related protein sequence from NCBI download any one
sequence in GenPept format.
111
BBCS-185 Bioinformatics Skill Enhancement Course
Exercise 8
MOLECULAR VIEWER BY
VISUALIZATION
SOFTWARE: PYMOL
Structure
8.1 Introduction 8.3 Summary
8.2 Procedure
8.1 INTRODUCTION
In the previous exercises you have learned how to access databases for
protein and nucleic acid structures. However, to analyse these structures we
need to view them using certain tools that are known as visualisation tools or
software.
In this exercise, you will be learning the PyMOL program to visualize 3-D
structures of molecules. PyMOL is a powerful tool for viewing and analyzing
proteins, DNA, and other macro molecules structures. PyMOL is a stand-alone
molecular visualization program based on Python software. PyMOL is used to
generate high-quality molecular graphics images and animations used for
journal publications describing new macromolecular structures and
interactions. PyMOL was developed by Warren DeLano. It is open source, but
not free in all forms, students and educators can utilise a current free version
in the classroom, anybody can obtain outdated binary releases, and certain
Linux distributions give PyMOL packages created from the open-source code.
8.2 PROCEDURE
Step 1: Download PyMOL from the website
(http://www.pymol.org/educational), register as a student from the link at the
will eventually send you a link with a username and password. This allows you
to download the software for your Personal Computer or Mac system and
follow the instructions to install the software.
Step 3: By double clicking on the PyMOL icon on your desktop PyMOL brings
up two Windows.
ii)
area where molecules will be displayed. The bottom window also contains
molecular objects
once you have loaded a protein structure. The bottom of this GUI has a
matrix displaying the current mouse configuration, namely what mouse
button combinations control which functions (Fig. 8.2).
Step 5: To change the representation of the molecule, the right side of the
Viewer shows the object control panel.
un-display the
corresponding molecule(s) (temporarily invisible).
Step 6: To change the background color to white follow this menu cascade
(Fig. 8.3):
molecule.
Step 7: To save the image in the present view follow the options File > Save
PyMOL
saved as a PNG image (Fig. 8.4).
114
Exercise 8 Molecular Viewer by Visualization Software: Pymol
Fig. 8.4: Screenshot Showing how to save and rename the image.
Step 8: You can use command line to Save, Viewport, Zoom, Ray, and Select,
to execute the command, follow the additional study material link provided at
the end of this exercise, and also practice other options in detail.
8.3 SUMMARY
PyMOL is a powerful tool to visualize and analyze proteins, DNA, and
other biological molecules structures.
You have learned how to view the 3-D structure of proteins in different
poses.
Images and structures can be used to generate high-quality molecular
graphic images and animations used for journal publications.
https://bioquest.org/nimbios2010/wp-
content/blogs.dir/files/2010/07/pymol_tutorial3.pdf
https://sites.pitt.edu/~epolinko/IntroPyMOL.pdf
115
BBCS-185 Bioinformatics Skill Enhancement Course
Exercise 9
BLAST SUITE OF TOOLS FOR
PAIRWISE ALIGNMENT
Structure
9.1 Introduction 9.3 Summary
9.2 Procedure
9.1 INTRODUCTION
In unit-3 of this course you have learnt sequence similarity search using Basic
Local Alignment Search tool (BLAST) and in previous exercises 5 and 7we
have practiced sequence retrieval. These sequences will be used in this
exercise to perform database similarity searches using BLAST tool. BLAST is
an algorithm for comparing primary biological sequence information, such as
the amino-acid sequences of different proteins or the nucleotides of DNA
sequences. A BLAST search enables a researcher to compare a query
sequence with a library or database of sequences, and identify library
sequences that are similar to the query (a question, unknown) sequence.
There are many different types of BLAST available from the BLAST web page.
Selecting the required one depends on the type of sequence you are
searching for and in the desired database. Different types of BLAST are given
below:
9.2 PROCEDURE
Step 1: Open the basic BLAST search page from following URL
http://blast.ncbi.nlm.nih.gov/Blast.cgi
2. Open your FASTA format sequence in a text editor as plain text retrieved
from exercise7 (Fig. 9.2).
3. Enter gi number, accession number or Copy the entire sequence and paste
it in the search box provided in FASTA format (Fig. 9.3).
5. Make sure you have selected the correct BLAST program and select nr
(non redundant) database (Fig. 9.4).
7. Write down default parameter set and click the "BLAST button" (Fig. 9.5).
118
Exercise 9 Blast Suite of Tools for Pairwise Alignment
9. Once your results are computed they will be presented in the window (Fig.
9.6).
119
BBCS-185 Bioinformatics Skill Enhancement Course
10. Copy and save the results and discuss or interpret your results.
9.3 SUMMARY
In the current exercise you have learnt how to use BLAST tool for
different programs such as blastn, blastp, blastx and tblastn for the
analysis of nucleotide and protein sequences of unknown sequence
obtained after sequencing the sample.
2. Perform protein blast (blastp) for the query Chain E, Spike protein S1
copy the result and interpret.
3. Search blastx for the given queryFJ436056 tabulate and discuss the
results.
4. Execute the tblastn search for the given query PWZ18702 and interpret
the results.
120
Exercise 10 Multiple Sequence Alignment using Clustalw
Exercise 10
MULTIPLE SEQUENCE
ALIGNMENT USING
CLUSTALW
Structure
10.1 Introduction 10.3 Understanding Output
10.1 INTRODUCTION
In unit-3 of this course you have learnt sequence alignment using
CLUSTALW, now in this exercise, you shall practice performing CLUSTALW.
Multiple Sequence Alignment (MSA) is the alignment of three or more
biological sequences of similar length. From the output of MSA applications,
homology can be inferred and the evolutionary relationship between the
sequences can be studied. ClustalW is a free online tool through the European
Bioinformatics Institute (EBI) that is used to align multiple sequences and
generate phylogenetic trees. The improved version of CLUSTAL is Clustal
Omega. If you input the desired sequences to align, Clustal Omega generates
a sequence alignment, and a rooted phylogram or cladogram.
perform alignment of more than two sequences and find out the
similarity between those sequences;
10.2 PROCEDURE
STEP 1- Retrieve required sequences (Nucleic acid or Protein) three or more
from desired sequence databases. Some example sequences are shown
below:
MAENGTISVEELKRLLEQWNLVIGFIFLAWIMLLQFAYSNRNRFLYIIKLVFLWL
LWPVTLACFVLAAVYRINWVTGGIAIAMACIVGLMWLSYFVASFRLFARTRSM
WSFNPETNILLNVPLRGTILTRPLMESELVIGAVIIRGHLRMAGHSLGRCDIKD
LPKEITVATSRTLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHSGSND
NIALLVQ
MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFLYIIKLIFLW
LLWPVTLACFVLAAVYRINWITGGIAIAMACLVGLMWLSYFIASFRLFARTRS
MWSFNPETNILLNVPLHGTILTRPLLESELVIGAVILRGHLRIAGHHLGRCDIKD
LPKEITVATSRTLSYYKLGASQRVAGDSGFAAYSRYRIGNYKLNTDHSSSSD
NIALLVQ
MSNMTQLTEAQIIAIIKDWNFAWSLIFLLITIVLQYGYPSRSMTVYVFKMFVLW
LLWPSSMALSIFSAVYPIDLASQIISGIVAAVSAMMWISYFVQSIRLFMRTGSW
WSFNPETNCLLNVPFGGTTVVRPLVEDSTSVTAVVTNGHLKMAGMHFGAC
DYDRLPNEVTVAKPNVLIALKMVKRQSYGTNSGVAIYHRYKAGNYRSPPITA
DIELALLRA
STEP 2-The software tools required for multiple sequence alignment are
available at the following URLhttps://www.ebi.ac.uk/Tools/msa/clustalo/ (Fig.
10.1).
STEP 3 - Enter your input sequences or paste a set of nucleic acid or protein
sequences into a supported format or upload a file (Fig. 10.2).
Step 4- Set your output format and set multiple sequence alignment default
options (Fig. 10.3).
123
BBCS-185 Bioinformatics Skill Enhancement Course
The score table is the first section of the page below, the results summary box.
The score table shows the scoring of the pairwise alignment of all sequences
(Fig. 10.5).
Take a screen shot of this table, or download by right-clicking the Output File
(.output) found in the result summary box at the top of the page (Fig. 10.6).
CLUSTAL omega aligns all of the input sequences, an HTML text version is
listed just below the Scores Table. A more extensive view of the alignment can
be seen using JalView. Under a
a coloured version of an amino acid alignment (Fig. 10.7).
124
Exercise 10 Multiple Sequence Alignment using Clustalw
In the row below the last sequence of the alignment, there may be symbols
like:
" : " conserved substitutions have been observed, according to the colour
data
The generated phylogenetic tree is at the very bottom of the results page.
there
Guide Tree. The tree can be viewed as a phylogram or a cladogram.
A)
125
BBCS-185 Bioinformatics Skill Enhancement Course
B)
10.4 SUMMARY
Multiple Sequence alignment is aligning of three or more biological
sequences of similar length.
From the output of MSA applications, homology can be inferred and the
evolutionary relationship between the sequences studied.
In the current exercise you have learnt to use the multiple sequence
alignment tool Clustal Omega for analysing evolutionary relationships
among sequences and interpret relationships among the sequences or
organisms through a phylogenetic tree.
126