BBCS 185
BBCS 185
BIOINFORMATICS
Indira Gandhi
National Open University
School of Sciences
BLOCK 1
BIOINFORMATICS 7-126
Programme and Course Design Committee
Prof. Bechan Sharma Prof. Seemi Farhat Basir Faculty Members
Dept of Biochemistry Dept. of Bio Sciences (IGNOU)
University of Allahabad Jamia Milia Islamia
Dr. Parvesh Bubber
Prof. Ranjit K. Mishra Biochemistry, SOS
Dept. of Biochemistry Dr. Sunita Joshi
University of Lucknow Dept. of Biochemistry Dr. M. Abdul Kareem
Daulat Ram College Biochemistry, SOS
Prof. Reena Gupta University of Delhi
Dept. of Biotechnology Dr. Arvind Kumar Shakya
H.P. University, Shimla Biochemistry, SOS
Prof. Vijayshri
Prof. D.V. Devaraju Former Director Dr. Maneesha Pandey
Dept. of Biochemistry School of Sciences Biochemistry, SOS
Bangalore University IGNOU, New Delhi
Dr. Seema Kalra
Prof. Sanjeev Puri Biochemistry, SOS
UIET, Panjab University
This course is broadly divided into 3 theory units and 10 practical or hands-on
exercises. We have arranged the content in a way that, will provide you theoretical
background on the topics followed by hands-on experience through exercises.
This is one of the stand alone course in your program that majorly involves the
utilization of computer skills. Owing to this, we have designed a few exercises where
you will learn the fundamentals of Microsoft Office, frequently used internet-based
terminology, and their applications. All the Software’s, tools and applications
described in this course are freely available on the internet. Hence, it is advised that
you should go through the course content and watch the corresponding video links
provided and also follow the instructions given to perform the exercises. Thorough
understanding of this course will help in building your career in the field of
computational biology. Since, bioinformatics is one of the emerging fields of allied
sciences, researchers from various disciplines starting from Mathematics, Physics, both
basic and applied biologists, Computer scientists, and Statisticians contributed to the
development of this subject. Bioinformatics subject has vast applications in the field of
Medicine, Pharmacy with special emphasis on drug design and development.
We believe that after completing all these exercises you will be in a position to exhibit
the fundamental bioinformatics skills expected from an undergraduate learner.
perform activities like exploring biological databases and retrieving the information
present in them;
distinguish between different biological databases;
download and use the various file formats for the purpose of performing bioinformatics
exercises;
Block
1
BIOINFORMATICS
UNIT 1
Introduction to Bioinformatics 7-46
Exercise 1: Microsoft Office- Word, Excel, PowerPoint
Exercise 2: Introduction to Internet: LAN, WAN, Web Browsers, Search Engines
Exercise 3: Basics of Electronic Mail, Creating an Email Account, Sending
and Receiving Email
UNIT 2
Biological Databases and Data Retrieval 47-82
Exercise 4: Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt
Exercise 5: Retrieval of Protein and Gene Sequences from NCBI
Exercise 6: Accessing Protein Structure from PDB
UNIT 3
Sequence Alignment 83-126
Exercise 7: Molecular File Formats - FASTA, GenBank, GenPept, GCG,
CLUSTAL, Swiss-Prot, PIR
Exercise 8: Molecular Viewer by Visualization Software: PyMol
Exercise 9: Blast Suite of Tools for Pairwise Alignment
Exercise 10:Multiple Sequence Alignment Using CLUSTALW
Unit 1 Introduction to Bioinformatics
UNIT 1
INTRODUCTION TO
BIOINFORMATICS
Structure
1.1 Introduction 1.8 Applications of
Bioinformatics
Expected Learning
Outcomes 1.9 Programming Languages
in Bioinformatics
1.2 Basics of Computer
Operations 1.10 Important Terms used in
Bioinformatics
1.3 Internet Usage
1.11 Summary
1.4 Microsoft Office Basics
1.12 Terminal Questions
1.5 Historical Background
1.13 Answers
1.6 Role of Supercomputers in
Biology 1.14 Further Readings
1.7 Scope of Bioinformatics
1.1 INTRODUCTION
Biological data is being produced at an enormous rate. Managing and
interpreting these data is a great challenge for biologists. Computers are being
used to collect, store, retrieve, analyze and integrate biological and genetic
information. These stored biological data can then be used for the prediction of
disease, drug discovery, biomarker identification, disease diagnosis, and
patient survival analysis, and so on. The vast application of computers for
handling biological data lead to the development of a new field of study, known
as 'Biological Informatics', or 'Bioinformatics'.
SAQ 1
i) Why is bioinformatics called an interdisciplinary field of science?
8
Unit 1 Introduction to Bioinformatics
define bioinformatics;
A computer has four basic operations: Input, Processing, Storage, and Output
(Fig. 1.2).
Protein Data Bank It stores and provides3D structure data for https://www.rcsb.org/
(PDB) large biological molecules such as
proteins, DNA, and RNA.
11
BBCS-185 Bioinformatics Skill Enhancement Course
Features of MS Office
Microsoft Word
Text formatting such as defining font size, font type, font styles, color,
etc.
Microsoft Excel
Drag and Drop feature helps us reposition data and text by simply
dragging the data using a mouse.
Microsoft PowerPoint
SAQ 2
i) What are the basic computational operations?
13
BBCS-185 Bioinformatics Skill Enhancement Course
You will learn in the details while performing hands on sessions provided.
c. Drug discovery & designing: This field of study deals with screening small
compounds/ligands that hold the capacity to be utilized as drug compounds for
disease treatment and prevention by deploying an in-silico modus operandi.
There are mainly two types of approaches that are adopted in CADD: i)
structure-based and ii) ligand-based. In a structure-based approach, the target
receptor is considered as a fixed structure whose binding cavities are
identified and ligands (small molecules/drug candidates) are docked. While, in
ligand-based approaches, the structure of the target receptor is not known,
thus, it is flexible. By a flexible receptor, we mean, the binding cavities of the
receptor can be known or unknown.
SAQ 3
i) Fill in the blanks:
a) ………………………………………….. is known as the “Father &
Mother of Bioinformatics”.
b) A computer that holds immense processing power is called as
………………………………………………..
c) ………………………………….. deals with the study of the holistic
genes and genetic elements.
d) Some of the existing bioinformatics suites and software use
multithreading approaches are ……………., …………………,
………………….. and …………………..
ii) What are the scopes of bioinformatics?
iii) Write a short note on supercomputers.
15
BBCS-185 Bioinformatics Skill Enhancement Course
Molecular medicine
Every disease causes some alteration in the genome. Therefore, the Human
genome has profound effects and impacts on biomedical research and clinical
medicine. With the availability of a complete human genome, it is possible to
search for genes that are directly associated with various diseases and
discover the molecular basis of these diseases. The discovery of the
molecular basis of disease would enable better treatments and diagnosis of
the disease, and the development of molecular medicines. The advancement
in molecular medicine would help us for better drug discovery, personalized
medicines, preventive medicines, and gene therapy.
Microorganisms are found everywhere and can survive extreme heat, cold,
radiation, acidity, salt, and pressure. Researchers have begun to understand
these microbes at a fundamental level by studying their genetic material. The
bioinformatics tools and techniques are applied in the field of microbial
genomic applications, including:
Waste cleanup
Antibiotic resistance
Bio-weapon creation
Biotechnology
Agriculture
The sequencing of plant and animal genomes has benefitted a lot in the field
of agriculture studies. Bioinformatics tools and techniques are applied for gene
searching within these genomes and finding their functions. Some of the
applications of bioinformatics in agriculture research are as follows:
16
Unit 1 Introduction to Bioinformatics
Crop improvement
Veterinary Science
Due to the sequencing of several farm animals including cows, sheets, and
pigs, it is expected that a better understanding of these organisms would have
a large impact on production, the health of livestock, and finally, all these
would benefit human beings.
SAQ 4
i) Mention a few real-life applications of bioinformatics research.
R is mainly used for statistical data analysis and their visualizations, while
Python, Java and BASH are used to develop new tools and software’s.
Moreover, when talking about software development, Java along with C and 17
BBCS-185 Bioinformatics Skill Enhancement Course
Some of the important and prominent terms that are used in bioinformatics are
described in the following section. These terms will be frequently used while
performing lab exercises and research activities in this field.
cDNA library Group of DNA sequences that encode for genes. This
is prepared from mRNA sequences.
19
BBCS-185 Bioinformatics Skill Enhancement Course
Functional The study of genes and how they code for their
genomics respective proteins, which in turn play crucial
biological, cellular and chemical processes in the body.
Gene family A set of similar and related genes that form similar
proteins.
Genomic library A library of stored copies of DNA that depict the entire
genome of an individual.
sequence
Optimal Alignment An alignment between sequences that has the top best
score.
22
Unit 1 Introduction to Bioinformatics
23
BBCS-185 Bioinformatics Skill Enhancement Course
SAQ 5
i) Define bioinformatics.
a) PSSM
b) BLAST
c) UPGMA
d) ORF
1.11 SUMMARY
Bioinformatics is an interdisciplinary science combining biology,
computer science, statistics, and information technology (IT).
Bioinformatics lets us understand biology in terms of molecules and
apply IT techniques to understand and organized the information
associated with these molecules on a large scale.
Internet is the most potential technology serving as the key platform for
Bioinformatics, including (i) online bioinformatics resources (databases
and tools) such as NCBI, PDB, PubChem, BLAST, etc. (ii) scientific
literature databases such as PubMed, PubMed Central, (iii) Web-based
platform for high-end bioinformatics computing, and (iv) Bioinformatics
courses and tutorials.
1.13 ANSWERS
Self Assessment Questions
1. i) Bioinformatics is the field of science in which biology, computer
science, and information technology merge to form a single
discipline. Hence, it is an interdisciplinary field of study where
biologists, computer scientists, statisticians, and data scientists
work together. Bioinformatics lets us understand biology in terms
of molecules and apply Information Technology (IT) techniques to
understand and organized the information associated with these
molecules on a large scale. The important fundamental questions
addressed are: how do we describe, analyze, simulate, and predict
the dynamics of the biological processes by using IT tools.
iv) a) MS Powerpoint
b) Internet
b) Supercomputer
c) Genomics
iii) Yes. The Human genome has profound effects and impacts on
biomedical research and clinical medicine. With the availability of a
complete human genome, it is possible to search for genes that are
directly associated with various diseases and discover the
molecular basis of these diseases. The discovery of the molecular
27
BBCS-185 Bioinformatics Skill Enhancement Course
ii) Pharmacokinetics:
Terminal Questions
1. Refer Section 1.2.
Exercise 1
MICROSOFT OFFICE- WORD,
EXCEL, POWERPOINT
Structure
1.1 Microsoft Office Microsoft Word
After practicing MS Office learners will be able to learn, simplify basic office
tasks, and improve work productivity. Each application is designed to address
specific tasks, such as word processing, data management, making and
presentations.
understand and know how to use the most common Microsoft Office
programs;
Microsoft Word (MS Word) is the word processing program that users can
type with. It is being developed by Microsoft company. It is used to type, edit,
format, retrieve, save and print documents.MS Word has an application to
create and edit letters, articles, newsletters, flyers, and creating text
documents, editing and formatting the existing documents, making a text
document interactive with different features and tools, graphical documents,
comprising images, used by Authors and Researchers, detect grammatical
errors in a text document.
Step 1: Double-click on the MS Word icon from the computer desktop or from
your 'Start' menu to open Microsoft Word.
Go to the Start menu if the MS Word icon is not on the desktop Click ►Start
►Programs ►Microsoft Word
Step 2: Click on the blank document, a new blank document will open up
ready for you to start typing (Fig. 1.1).
Step 3: When you open a blank document, the flashing cursor will be at the
start of your document, ready for you to start typing. As you type, the cursor
will also move with each letter (Fig. 1.3).
Fig. 1.2: Screenshot showing various options and tools available on MS word.
Step 4: The mouse can be used to move around a document. Select the text
that you wish to edit or change the formatting. To change the selected font to
bold, click B, to italics; click I, to underline click U (Fig. 1.3).
Step 5: To copy and paste the text, select your text so that it’s highlighted,
copy the text by clicking on the copy icon at the left-hand side of the formatting
ribbon. Click Paste to insert the copied text in its new place in your text (Fig.
1.4).
31
BBCS-185 Bioinformatics Skill Enhancement Course
Fig. 1.4: Screen shot showing how to copy and paste the text.
Step 6: To center, left align, right align and justify text, select the text that you
wish to change by using the mouse, click on the ‘right align’ icon, click on the
‘center text’ icon; click on the ‘justify’ icon in the formatting ribbon at the top of
the document (Fig. 1.5).
Step 7: To save a document, click File in the top left-hand corner of the
screen, "choose Save" from the menu. Once you have typed in the name of
your document, click Save (Fig. 1.6).
Upto now you have observed the steps like how to create, edit and save word
32 file. Practice these steps and learn more.
Exercise 1 Microsoft Office- Word, Excel, Power Point
Step 1: Double-click on the MS Excel icon from the desktop. Go to the Start
menu if the MS Excel icon is not on the desktop Click ►Start ►Programs ►
MS Excel
Step 2: Click File, and then click New. Under New, click the Blank
workbook, a new excel file will open up ready for you to start using (Fig. 1.7).
Fig. 1.7: Screen shot sowing how excel blank work book.
Step 3: Click an empty cell. For example, cell A1 on a new sheet. So cell A1 is
in the first row of column-A. Type, text or a number in the cell. Press Enter or
press Tab to move to the next cell (Fig. 1.8).
Step 5: Excel can do other math as well using formulas to add, subtract,
multiply, or divide your numbers. Pick a cell, and then type an equal sign
(=).Type a combination of numbers and calculation operators, like the plus
sign (+) for addition, the minus sign (-) for subtraction, the asterisk (*) for
multiplication, or the forward-slash (/) for division (Fig. 1.10).
Step 6: Click the Save button on the Quick Access Toolbar, or press Ctrl+S.
For the first time, you have to save this file: Under Save As, pick where to
save your workbook, and then browse to a folder. In the File name box, enter
a name for your workbook. Click Save.
Step 7: Click File, and then click Print, or press Ctrl+P. Preview the pages by
clicking the Next Page and Previous Page arrows (Fig. 1.11).
Fig. 1.11: Such screen will appear once you save the excel file.
Step 1: To start MS PowerPoint double click on the PowerPoint icon from your
desktop
If the MS PowerPoint icon is not on the desktop, go to the Start menu Click
Start ►Programs ►Microsoft PowerPoint. 35
BBCS-185 Bioinformatics Skill Enhancement Course
Step 2: Click on New and blank presentation. New PowerPoint file will open
up ready for you to start using (Fig. 1.12).
Step 3: To add text to a Slide Click inside the provided text box (Click to
Title, Subtitle, Text, etc)
Once the cursor is blinking you can begin to type your text (Fig. 1.13).
Step 4: To insert a new Slide Click INSERT► Select NEW SLIDE ►Click
title slide or title and content slide etc. You can also use the keyboard shortcut
CTRL+M
Step 5: To delete slides Click on the slide you wish to delete in the Slide
menu, press the DELETE key on your keyboard
Step 7: To save your work Click ►File ►Save from the Menu Bar
37
BBCS-185 Bioinformatics Skill Enhancement Course
LAB EXERCISES
38
Exercise 2 Introduction to Internet: Lan, Wan, Web browsers, Search engines
Exercise 2
INTRODUCTION TO INTERNET:
LAN, WAN, WEB
BROWSERS, SEARCH
ENGINES
Structure
2.1 Introduction 2.3 Web browsers
2.1 INTRODUCTION
The word INTERNET is taken from the Interconnected Network of all the Web
Servers Worldwide, it is also known as World Wide Web (WWW). Using the
Internet, you can send electronic mail, chat with colleagues around the world,
and obtain information on a wide variety of subjects. Internet is a global and
public network that supports communications using different common
languages worldwide. In this exercise, you will also learn about LAN, WAN,
and search engines. This basic information about the internet and browsers
will help you to perform better in the upcoming exercises of this course.
Data transfer rates are generally high, and they range from 100 Mbps to
1000 Mbps.
40
Exercise 2 Introduction to Internet: Lan, Wan, Web browsers, Search engines
Procedure:
Step 1: There are many different search engines you can use, but some of the
most popular include Google, Yahoo!, and Bing etc. To perform a search,
you'll need to use a search engine in your web browser, type one or more
keywords, and then press Enter on your keyboard (Fig. 2.1).
Step 2: After you run a search, you'll see a list of relevant websites that match
your search terms called search results. If you find a site that is relevant to
your interest, you can open that link (Fig. 2.2).
Step 3: You may be looking for something more specific, like a news
article, picture, or video. Most search engines have links at the top of the
page that allow you to perform these unique searches (Fig. 2.3).
41
BBCS-185 Bioinformatics Skill Enhancement Course
2.5 SUMMARY
Network of all the Web Servers Worldwide it is also known as World
Wide Web.
Using the Internet, you can send electronic mail, chat with colleagues
around the world, and obtain information on a wide variety of subjects
using the World Wide Web.
In this exercise you have studied terms like LAN, WAN, and grasped the
overview of web browsers and search engine.
Reference: https://edu.gcfglobal.org/en/internetbasics/using-search-
engines/1/
42
Exercise 3 Basics of Electronic Mail, Creating An Email Account, Sending And Receiving Email
Exercise 3
BASICS OF ELECTRONIC MAIL,
CREATING AN EMAIL
ACCOUNT, SENDING
AND RECEIVING EMAIL
Structure
3.1 Introduction 3.2 Procedure
3.1 INTRODUCTION
In the previous exercises, you have learned about Microsoft Office and
Internet. In this exercise, you’ll learn about electronic mail and its application.
ELECTRONIC MAIL
create email;
3.2 Procedure:
CREATING AN EMAIL ACCOUNT
To sign up for Gmail, create a Google Account. You can use the username
and password to sign in to Gmail.
Step 2: Fill in the details to create an account, follow the instructions stepwise,
and use the account you created to sign in to Gmail (Fig. 3.1).
Step 1: Open Gmail account signing into your email service so that you are on
the main page of your mail account (Fig. 3.2).
Step 3: A new blank email window will open. In the ‘To’ box, type or add the
email address of the recipient.
Step 4: If you want to include someone else in your email to ‘keep them in the
loop’ you can click on Cc (carbon copy) or Bcc (blind carbon copy), which will
45
BBCS-185 Bioinformatics Skill Enhancement Course
open another field. Adding an email address to the ‘Cc’ field will allow all the
other recipients to see the email address. If an email address is added into the
'Bcc’ field no other recipient can see the address. If you are sending the same
email to multiple people, it’s a good idea to put all the email addresses in the
‘Bcc’ field to keep your ‘mailing list’ confidential.
Step 5: In subject box type the relevant subject which allows the recipient to
get the glimpse of the topic of your email.
Step 6: Email text can be typed in message box. You can change the font
style, colour and size using the formatting icons (Fig. 3.4).
Fig. 3.4: Screenshot showing “To”, “Cc” and “Bcc” option of an email.
Step 7: once after typing text in message box, click the blue Send button at
the bottom of the compose window.
Step 8: The email you’ve sent will now be saved in the ‘Sent Mail’ folder on
your Gmail dashboard
Step 9: by using “Attach files” option you can attach images, files, etc. from
your computer.
3.3 SUMMARY
The e-mail enables internet users to communicate messages to another
internet user anywhere in the world.
In the current exercise you have learned how to create a new e-mail,
send and receive a new e-mail.
Also learned how to attach files using the “attach files” option.
46
Unit 2 Biological Databases and Data Retrieval
UNIT 2
BIOLOGICAL DATABASES AND
DATA RETRIEVAL
Structure
2.1 Introduction 2.4 Small Molecular Databases
2.1 INTRODUCTION
In the previous unit, you have learned about the basics of computers and their
applications in the field of biology, that is bioinformatics. In this unit, you’ll be
studying biological databases. In these biological databases information
related to DNA, RNA, Protein, and other biomolecules are stored in a
systematic way inside servers named Data servers. Scientists, academicians,
and researchers working across the globe can retrieve this data (Biological
Data) whenever they need it for the purpose of analysis.
because primary databases are also curated to ensure that the data in
them is consistent and accurate.
Curated database;
Synonyms Archival database
knowledgebase
Results of analysis,
literature research and
Source of Direct submission of experimentally-
interpretation, often of
data derived data from researchers
data in primary
databases
Upto now you have studied about types of biological databases based on
sources, now let us know about nucleotide databases.
Gen Bank – It is an integral part of the main biological database, i.e., NCBI
(National Center for Biotechnology. It has a tool called Entrez, which helps to
retrieve data from Genbank.
Swiss-PROT – This
database is owned by
EMBL and
maintained by SIB
TrEMBL - It contains
maximum translated
sequences
SAQ 1
Fill in the blanks:
PDB is a part of the Worldwide Protein Data Bank which collects, organizes,
and disseminates data on biological macromolecular structures like proteins,
enzymes, and DNA/RNA.
2. CATH
3. PDBSUM
a. All alpha
b. All beta
c. Alpha or beta
e. Multi-domain folds
51
BBCS-185 Bioinformatics Skill Enhancement Course
2.2.5 CATH
The CATH database (http://www.cathdb.info/) is a free, publicly available
online resource that provides information on the evolutionary relationships of
protein domains. It was created in the mid-1990s by Professor Christine
Orengo and colleagues, and continues to be developed by the Orengo group
at University College London.
all alpha, all beta, a mixture of alpha and beta, or little secondary structure; in
the Architecture (A) level, information of the secondary structure arrangement
in three-dimensional space is used at the Topology/fold (T) level, information
on how the secondary structure elements are connected and arranged is
used; assign segregation are made to the Homologous superfamily (H) level if
there sufficient evidence that the domains are related by evolution, i.e. they
are homologous. To know, and browse the classification hierarchy, visit CATH
hierarchy web page (Fig. 2.4).
SAQ 2
Define the following terms:
i) PDB
ii) RCSBC
iii) SCOP
53
BBCS-185 Bioinformatics Skill Enhancement Course
2.4.1 PubChem
The PubChem database is a primary source for various chemicals, drugs, and
derivatives. It is one of the freely accessible chemical information resource
databases as well as the largest in the world. We can search for various
chemicals by molecular formula, name, structure, and other identifiers.
Further, one can find chemical and physical properties, safety and toxicity
information, biological activities, literature citations, patents, and more. New
chemicals/substances will be added regularly as and when new information is
available from the literature or from experimental results. It is very crucial for
finding vendor-based chemicals or new chemicals. Most of the scientists are
screening molecules from the Pubchem database for various disease
treatments like Cancer, Tuberculosis (TB), Alzheimer’s, Osteoporosis,
Atherosclerosis, Cardio-vascular diseases, etc. It is the database of chemicals
managed by at National Institutes of Health (NIH). The meaning of “Open” is
that anyone can submit scientific data to this database and become a provider.
This database is more important and useful for scientists, students, and the
general public. Each month database and programmatic services provide data
to several million users worldwide about compounds.
(https://pubchemdocs.ncbi.nlm.nih.gov/)
SAQ 3 Do as directed
i) Write a short on chemical databases used in drug design and drug
discovery.
ii) Define the term curated database? Enlist few chemical databases
developed using curation method.
Currently, we are going to discuss basic software used for the visualization of
biomolecules. There are various file formats available to view those molecules
in 3Dimenstional space. It means that, each atomic position should be defined
from its origin with respect to X,Y, and Z axis. The majorly used format to view
in basic software is the pdb format (Protein databank). It has a stranded format
as follows. While performing exercises number 7 and 8 you will learn more
about these tools and file formats.
Notice that each line or record begins with the record type ATOM. The atom
serial number is the next item in each record.
(Source:
https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html) 57
BBCS-185 Bioinformatics Skill Enhancement Course
1. RasMol
Most of the protein structure databases tools available today are well-
equipped with graphical visualization tools. The commonly used tool for
academic and research purposes is RasMol software. This is a molecular
graphics program intended to visualize proteins, nucleic acids and small
molecules, available in a 3-D structures format. In order to display a molecule,
RasMol requires an atomic co-ordinate file that specifies the position of every
atom in the molecule through its 3-D Cartesian coordinates (Fig. 2.5). RasMol
accepts this coordinate file in a variety of formats, including the Protein Data
Bank (PDB) format. The visualization tool provides the user a choice of color
schemes and molecular representation (wireframe, cylinder (Dreiding) stick
bonds, alpha-carbon trace, space-filling (CPK) spheres, macromolecular
ribbons (either smooth-shaded solid ribbons or parallel strands), hydrogen
bonding and dot surface. Additional features such as test labeling for selected
atoms, different color schemes for different parts of the molecule, zoom,
rotation, etc. have made this the most popular among all existing visualization
tools.
Website:http://www.openrasmol.org/
Fig. 2.5: RasMol software with Crystals of Crambin with PDB ID: 1CRN.
2. Chime
Chime and proteins explorer are derivatives of RasMol that allow visualization
of structures inside web browsers, while RasMol runs independently outside a
web browser. Hence, chime should be used only online, when connected to
the Internet. Another feature of Chime is that only certain molecules that are
allowed by the company can be seen, unlike RasMol where any protein
molecule with atomic coordinates can be seen.
58
Unit 2 Biological Databases and Data Retrieval
Now-a day, tools like Jmol and Jsmol are the software’s run on Java platform
also used widely. This can be downloaded on personal computers to view
molecules like proteins, DNA and RNA.
3. MolMol
MolMol stands for Molecule analysis and Molecule display. This is also free
software with a lot of features that are not found in RasMol and Chime. MolMol
is a molecular graphics program for display, analysis and manipulation of
three-dimensional structures of biological macromolecules, with special
emphasis on nuclear magnetic resonance (NMR) solution structures of
proteins and nucleic acids. MolMol can be reached at:
www.mol.biol.ethz.ch/wuthrich/software/molmol
4. Pymol
5. SPDBV
2.6 SUMMARY
Biological databases used to store experimental data in various formats
that can be accessed through the internet.
SCOP has five sub-classes. 1. All alpha, 2. All Beta, 3. Alpha or Beta 4.
Alpha and Beta 5. Multi-domain fold.
2.8 ANSWERS
Self Assessment Questions
1. i) Nucleotide Database at NCBI
ii) DNA
iii) European Bioinformatics institute
iv) National Centre for Biotechnology
v) PIR
61
BBCS-185 Bioinformatics Skill Enhancement Course
2. i) Protein Data Bank
ii) Research Collaboratory for Structural Bioinformatics
iii) Structural Classification of Proteins
3. i) Refer Section 2.4.1 to 2.4.3.
ii) Refer Metabolic Pathway Databases under section 2.3.
Terminal Questions
1. There are various principles for the biological databases or
characteristics.
Curateddatabase;
Synonyms Archival database
knowledgebase
Results of analysis,
literature research and
Source of Direct submission of experimentally-
interpretation, often of
data derived data from researchers
data in primary
databases
InterPro (protein
families, motifs and
domains) UniProt
ENA, GenBank and DDBJ (nucleotide Knowledgebase (seque
sequence) ArrayExpress and GEO (fu nce and functional
nctional genomics data) Protein Data information on
Examples proteins) Ensembl (varia
Bank (PDB; coordinates of three-
dimensional macromolecular tion, function, regulation
structures) and more layered onto
whole genome
sequences)
62
Unit 2 Biological Databases and Data Retrieval
63
BBCS-185 Bioinformatics Skill Enhancement Course
Exercise 4
DATABASES NCBI, PDB, SCOP, :
PUBMED, GENE BANK,
UNIPROT
Structure
4.1 Introduction 4.2 Databases and Retrieval
4.1 INTRODUCTION
In this exercise, you will learn about biological databases that are widely used
in the field of bioinformatics.
Databases are systematic collections of theoretically related data. Software
packages are used for defining and managing databases. In publicly
accessible databases, there is a lot of information available regarding
biomolecules due to exponential growth in biological data. Data is no longer
published in a conventional way but rather submitted directly to databases.
Generally, the biological database can be classified into sequence database,
structural database, genome database, proteome database, specialized
databases, etc.
You can access the NCBI to know about different popular resources, further
you will be learning the usage of NCBI and Genbank to access nucleotide and
protein sequence in exercises 5 and 7.
PUBMED
than 32 million citations (Abstract) for biomedical literature from MEDLINE, life
science journals, and online books. Citations do not include full text journal
articles but may include links to full-text content from PubMed Central (PMC)
and publisher web sites available from other sources. PubMed was developed
and maintained by the National Center for Biotechnology Information (NCBI),
at the U.S. National Library of Medicine (NLM), located at the National
Institutes of Health (NIH).
Procedure
Step 2: Type your text query in the search panel (for example corona virus
etc….)
Step 3: Select the appropriate abstract from the PubMed summary web page
66 (Fig. 4.3).
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt
Step 3: Copy and save the relevant bibliography search for further use.
GenBank
So far, you have learned about databases NCBI, PubMed, GenBank and how
to access the citations/abstract from PubMed. To become more familiar with
the procedure, repeat the exercises with different keywords such as author
name, keywords like antioxidants, curcumin, cholesterol etc. and text
searches. In the next subsection you will learn about Protein Data Bank, which
is widely used for 3-D protein structure-related information. Further, you will be
learning the usage of GenBank to access nucleotide sequences in exercises 5
and 7.
PDB
The Protein Data Bank (PDB) is a repository for the 3-D structural data of
large biological molecules, such as proteins and nucleic acids. The data,
typically obtained by X-ray crystallography or NMR spectroscopy and
submitted by biologists and biochemists from around the world, are freely
accessible on the Internet via the websites of its member organisations,
Research Collaboratory for Structural Bioinformatics (RCSB). The PDB
database is intended to provide access to 3-D structural information. To
access the PDB database, follow the web link https://www.rcsb.org/ and
retrieve structural information from PDB (Fig. 4.5).
You can access the PDB to understand about structural database, further you
will be learning the usage of PDB to access and download 3-D structures of
protein and DNA in exercises 6.
68
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt
SCOP
Procedure:
Step 2: Type the protein name or relevant text in the text box titled or enter
keyword (Fig. 4.7).
69
BBCS-185 Bioinformatics Skill Enhancement Course
Step 4: Choose the appropriate link to display the functional information (Fig.
70 4.9).
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt
Fig. 4.9: Appropriate links showing family and super family have been encircled.
UNIPROT
Procedure:
Step 2: Type the protein name or relevant text in the text box titled or enter
keyword (Fig. 4.11).
Step 4: Choose the first sequence by double clicking the accession number,
go to display button select FASTA format to retrieve sequence (Fig. 4.13).
72
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt
Step 5. Copy and save the protein sequence for further analysis (Fig. 4.14).
4.3 SUMMARY
Databases are systematic collections of theoretically related data.
Generally, the biological database can be classified into sequence
database, structural databases, genome database, proteome database,
and specialized databases etc.
4. Open SCOP database and give any keyword or text search write the
functional aspect, name of the protein, family, class and domain
74
Exercise 5 Retrieval of Protein and Gene Sequences from NCBI
Exercise 5
RETRIEVAL OF GENE
SEQUENCES FROM NCBI
Structure
5.1 Introduction 5.2 Procedure
5.1 INTRODUCTION
In the previous exercise, you have learned about different databases. In this
exercise, you will be studying the retrieval of protein and gene sequences
from NCBI.
In this exercise, we will learn about protein and gene sequence retrieval
from NCBI database. We have studied theoretically NCBI database in Unit-2
and learned about different resources of NCBI such as GenBank and GenPept
in Exercise 4. In this section, we shall access sequences from GenBank and
GenPept of NCBI which will be used in various sequence analysis techniques.
explore and retrieve gene information from NCBI Gene database; and
5.2 PROCEDURE
Step 1: Access the home page of NCBI from the following web link
https://www.ncbi.nlm.nih.gov/ (Fig. 5.1).
2. Click on the scrolling button (drop down menu) and select “Gene” (Fig.
5.2).
3. Type the relevant text in the search box or enter keyword (Example-
76 Gene name, Species name etc) (Fig. 5.3).
Exercise 5 Retrieval of Protein and Gene Sequences from NCBI
Scroll down and click on required file format (FASTA or GenBank format)
77
BBCS-185 Bioinformatics Skill Enhancement Course
6. Copy and save the required gene sequence for further analysis (Fig.
5.6).
5.3 SUMMARY
NCBI is a systematic collection of theoretically related biological data
such as sequence databases, genome databases, and specialized
databases, etc.
Both gene and protein sequences can be retrieved from NCBI database
for further analysis.
Exercise 6
ACCESSING PROTEIN
STRUCTURE FROM PDB
Structure
6.1 Introduction 6.3 Summary
6.2 Procedure
6.1 INTRODUCTION
In the previous exercise, you learned how to retrieve protein and gene
sequences from the NCBI database. Now, in this exercise, you shall be
exploring the steps involved in downloading protein structures from the PDB
database. 3-D structures of proteins from Protein Data Bank (PDB), are used
to understand structural information such as the binding site of a protein or
DNA, the active site of enzymes, DNA-Protein interactions, and Protein-
Protein interactions, and this has applications in drug design. You learned
about the PDB database in Unit-2 and accessed the PDB website in Exercise-
4.
Protein structure is useful to understand how the protein works, and that
information can be used to inhibit, regulate, or modify protein function, and
predict what molecules bind to that protein. Also, to understand various
biological interactions, assist drug discovery, or even design novel proteins
therapeutic as molecules. In order to understand the biological function of
DNA, we need to study its molecular structure. The PDB is a repository for the
3-D structural data of large biological molecules, such as proteins and nucleic
acids. You have learned about PDB in exercise-4, now in this exercise, we
shall learn about how to download protein structure from PDB.
6.2 PROCEDURE
Step 1:
1. Open the PDB from the following URL- https://www.rcsb.org/ (Fig. 6.1).
2. Enter the query in the textbox provided by entering PDB ID, molecule
name or author name. Click on the search button (Fig. 6.2).
80
Exercise 6 Download Protein Structure from PDB
3. From the summary page click on PDB ID 7LYJ and download the
macromolecular 3D structure in PDB format (Fig. 6.3 and 6.4).
6.3 SUMMARY
PDB is the NCBI database from where we can access the protein 3-D
structures.
You have acquired the skills to access PDB pages and learned how to
search for the desired protein.
82
UNIT 3
SEQUENCE ALIGNMENT
Structure
3.1 Introduction 3.4 Alignment Scoring Matrices
3.1 INTRODUCTION
In the previous unit, you have learned about sequences and structures of
proteins and nucleic acids along with biological databases. As you know,
amino acids are the building blocks of proteins. In general, any popular
language has alphabets, various combinations, and proper arrangement of
these alphabets will form words and sentences with appropriate meaning.
Language helps us to communicate with each other as well as update
knowledge. Similarly, the arrangement of amino acids will provide numerous
functional proteins/enzymes/receptors, etc in biological systems. These
combinations of amino acids and nucleic acids play a major role in the proper
functioning of proteins and genes. It is interesting to know that specific protein
sequences will remain the same in many organisms, but few
additions/deletions/insertions may bring that mutated protein. If the sequence
is exactly the same in two different organisms, it is obvious that protein
function is also the same. There are various tools and software available to
compare these sequences.
83
BBCS-185 Bioinformatics Skill Enhancement Course
In both animals and plants, there are several proteins and enzymes involved in
biochemical pathways, signaling pathways, and other functions. If we compare
the sequences of proteins and genes with another animal/species it is called
sequence comparison. You are going to learn more about sequence
alignment, types, and algorithms involved in it by understanding various
theories. In addition to this, you may come across new terms, software, and
tools. You will learn the applications of sequence alignment in proteins and
nucleic acids, which is essential in the field of biology and allied subjects.
A M I N O A C I D -Seq1
| | | | | | | | |
A M I N O A C I D – Seq 2
The above example shows that both the words are matching as the first word
is named as Seq-1 and the second word as Seq-2. It is a simple example to
show the sequence of letters to form words. Now, let us see the similarity of
sequences in the next subsection.
In the above sequence format, you can see the name of the enzyme,
organism, and Genbank ID at the top of the sequence. The sequence starts
with the ‘>’ sign and GenBank id, enzyme name, and scientific name of an
organism. When we want to compare or match the sequence of hexokinase
between two different animals, and the sequence matches 100% if there is a
similarity in the number of amino acid residues as well as the type of amino
acids present in them. The matching may not be the same in another set of
organisms or it may be less than 100% due to differences in the number and
type of amino acid residues. In a few animals, we may find mutations in
sequences; still perform normal or similar functions.
A T C G G C –Seq -1
| | | | | |
A T C G C G – Seq-2
Both the sequences Seq-1 and Seq -2 are not the same, but we can call them
similar. There is a mismatch, but the chemical properties of C and G are
similar in the sequences.
For instance, in proteins, there may be a mismatch of amino acids with regard
to their chemical and physical properties; then those mismatches do not alter
the functionality of proteins. In Fig. 3.2 you can find an example for sequence
alignment for protein histone H1 among different mammals. The amino acids,
which are constant throughout alignment, are called conserved residues and
the amino acids varying in alignment are referred to as non-conservative.
SAQ 1
i) Which type of sequences are available in NCBI database?
ii) Which mammalian sequence is more similar to human histone
sequence?
iii) What do you mean by conserved residues in a multiple sequence
alignment?
There are various tools and software available to calculate sequence identity
throughout the length of sequences. Among them, BLAST is a powerful tool as
compared to other existing tools. You will learn how to perform using online
tools in exercise 9 of this course.
In the given example (Fig. 3.3), the DNA polymerase sequence of Hepatitis B
virus is considered as query sequence and aligned with the subject sequence
(sequences of database). When this sequence was subjected to alignment,
both (query and subject) sequences had similarity and identity percentages of
98% and 97%. You can observe that a few amino acids are not matching
exactly with the lower sequences. It is observed that big boxes have more
sequence identity rather than small boxes. Those small boxes containing
amino acids are neither identical nor similar. You can also observe some gaps
between sequences. The alignment of sequences is carried out using various
matrices and algorithms. Sequence identity plays a major role in evolutionary
tree generation. It helps in understanding the progeny of specific species and
their relationship with other organisms. Sequence identity is also essential to
acquire information about the working mechanisms of various proteins,
enzymes, receptors, and cellular responses.
SAQ 2
i) What is sequence identity?
You are advised to watch the video in the given YouTube link to know more
details about sequence similarity:
https://www.youtube.com/watch?v=K6ldxHPzI5A
87
BBCS-185 Bioinformatics Skill Enhancement Course
SAQ 3
i) Write the differences between similarity and homology?
Seq1 AVLTSHYILRS - 11
|| | | || || || |
Seq2 AVLTSHYILRS - 11
Different methods are used for pairwise alignment of nucleotide and protein
sequences let us learn one by one:
G G A A T
Seq-1 G G A A T
| | | | |
Seq-2 G G A A T
If any nucleotide /amino acid is not matching then the gap is noticed within the
alignment.
As you know the most of the gene sequences may be very long. In such
cases, the plot appears as Fig. 3.6. In this figure, X- axis is Seq1 and Y-
axis Seq2, with the total number of amino acids in both the sequences being
200.
Fig. 3.6: Dotplot of amino acid Seq1 and Seq2 with 200 amino acid residues.
We are going to study various alignments like local and global alignment in the
next sections of this unit.
To know more about the topic you are advised to visit the following video links:
https://www.youtube.com/watch?v=S07kIY2ihq8
https://www.youtube.com/watch?v=TZaA_-4j19w
SAQ 4
i) What is MSA?
90
Unit 3 Sequence Alignment
These algorithms can deal with sequences that are quite different, but, as in
the pairwise case, when the sequences are very different they might have
problems creating a good algorithm. A good algorithm should align the
homologous positions or the positions with the same structure or function.
Global Alignment: In a sequence analysis of proteins or genes, the same
length of sequences is very much suitable for global alignment. Such
alignment is performed from the beginning to the end of the sequence for
appropriate alignment. In such cases, gaps may be created during the
alignment process.
The Needleman-Wunsch algorithm (A formula or set of steps to solve a
problem) was developed by Saul B. Needleman and Christian D. Wunsch in
1970, which is a dynamic programming algorithm for global sequence
alignment. This algorithm explains global sequence alignment for aligning
nucleotide or protein sequences. This was the first of its kind for the alignment
of two protein sequences and was the first application of dynamic
programming to biological sequence analysis. The Needleman-Wunsch
algorithm finds the best-scoring global alignment between the two sequences.
Global alignments are most useful when the two sequences being compared
are in similar lengths, and not too divergent.
Local Alignment: If sequences have similarities or dissimilarities, they can be
compared with local alignment. You will understand high-level similarity
sequences with local alignment.
The above methods of alignment are explained by different algorithms; both
use scoring matrices to align the two different series of characters or patterns
(sequences). Global and local alignment methods are defined by Dynamic
programming for proper approaching methods for aligning two different
sequences. Many proteins exhibit modular architectures. In searching
databases for similar sequences, it is useful to find sequences that have
similar domains or functional motifs. Smith & Waterman (1981) published an
application of dynamic programming to find optimal local alignments. The
algorithm is similar to Needleman-Wunsch, whereas negative cell values are
reset to zero, and the trace back procedures starts from the highest scoring
cell, anywhere in the matrix, and ends when the path encounters a cell with a
value of zero.
If we consider the small fragment of a sequence as the target sequence and
align the other fragment strand at a small region, hence, it is a local alignment.
Similarly, performing complete alignment throughout the sequence length is
known as Global alignment (Fig. 3.8 and 3.9).
91
BBCS-185 Bioinformatics Skill Enhancement Course
(source: https://www.researchgate.net/figure/Global-alignment-vs-Local-
alignment_fig1_322704711)
(Source: https://www.majordifferences.com/2016/05/difference-between-global-and-
local.html)
Gap penalty
(Source: https://www.differencebetween.com/difference-between-transition-and-vs-
transversion/)
For protein sequence alignments, the scoring matrices are more complicated.
The goal is to reflect evolutionary processes. Some amino acid sequence
changes can arise from a single nucleotide change, whereas other amino acid
changes require two nucleotide changes. Some amino acid changes are less
likely to affect protein structure or function than other amino acid changes.
SAQ 5
What is the use of alignment scoring matrix?
93
BBCS-185 Bioinformatics Skill Enhancement Course
IAGCW
IAGCT
I IGCT
Dayhoff constructed the phylogenetic tree and used the tree and counted
substitutions in the output of the tree (Fig. 3.11). A tree minimizes the number
of changes in a sequence matrix. To know more about the PAM concept,
watch the video: https://www.youtube.com/watch?v=8avcQRxaLBw
There is a little bit of difference between PAM and BLOSUM matrices, not as
extrapolated from comparisons of closely related proteins. Scoring sequences
play a major part in it. All matches between the sequences and mismatches
are respectively given the same score (typically +1 or +5 for matches, and -1
or -4 for mismatches. But it is different for proteins. Substitution matrices for
amino acids are more complicated as compared to nucleotides and that might
affect the frequency with which any amino acid is substituted for another. The
objective is to provide a relatively heavy penalty for aligning two residues
94
Unit 3 Sequence Alignment
Here, “pij” is the probability of two amino acids i and j replacing each other in
a homologous sequence, and qi and qj are the background probabilities of
finding the amino acids i and j in any protein sequence. The factor λ is a
scaling factor, set such that the matrix contains easily computable integer
values.
BLOSUM62: midrange
There are various online tools and software available for sequence alignment
with BLOSUM matrices as a weight matrix.
Clustal W is a well-known sequence alignment online tool. You can browse the
following link https://www.genome.jp/tools-bin/clustalw to the BLOSUM matrix
95
BBCS-185 Bioinformatics Skill Enhancement Course
as shown Fig. 3.13. While performing exercise 10, you will learn more about
multiple sequence alignment using Clustal W.
Fig. 3.13: Clustal W online tool is consisting of parameter section with BLOSUM
matrix for pairwise and multiple sequence alignments.
The sequence alignment tools and software will reduce time and enhance the
effectiveness of analysis. The alignment analysis provides the information to
96 make a proper decision to move further in understanding the protein/gene
Unit 3 Sequence Alignment
function or relation with one another. In this section, we will get to know more
about online software based on alignment types. Watch the video at the link
provided to know more about this topic:
https://www.youtube.com/watch?v=uGhZygAMQik
1. Nucleotide BLAST
2. Protein BLAST
3. BLASTx
4. T BLASTn
Fig. 3.14: BLAST home webpage with Nucleotide, Protein, blastx and tblastn
links on it. 97
BBCS-185 Bioinformatics Skill Enhancement Course
3. BLASTx (translated nucleotide sequence searched against protein
sequences): Compares a nucleotide query sequence that is translated in six
reading frames (resulting in six protein sequences) against a database of
protein sequences.
Because blastx translates the query sequence in all six reading frames and
provides combined significance statistics for hits to different frames, it is
particularly useful when the reading frame of the query sequence is unknown
or it contains errors that may lead to frame shifts or other coding errors. Thus,
BLASTx is often the first analysis performed with a newly determined
nucleotide sequence.
4. tBLASTn (protein sequence searched against translated nucleotide
sequences): Compares a protein query sequence against the six-frame
translations of a database of nucleotide sequences. Tblastn is useful for
finding homologous protein-coding regions in unannotated nucleotide
sequences such as expressed sequence tags (ESTs) and draft genome
ESTs are short, single- records (HTG), located in the BLAST databases.
read cDNA
(Complementory DNA) Apart from above blast types, there few blast programmes available for
sequences. They standalone system as well as cloud-based platform. Some more BLAST
comprise the largest programs are as follows:
pool of sequence data 1. SmartBLAST: To find proteins highly similar to query sequence.
for many organisms
and contain portions of 2. Primer- BLAST: To design primers specific to given PCR (polymerase chain
transcripts from many reaction) template.
uncharacterized genes.
Since ESTs have no 3. Global Align: To compare two sequences across their entire span or length
annotated coding
of sequence with Needleman-Wunsch algorithms.
sequences, there are 4. CD –Search: To find the conserved domains in the given sequence.
no corresponding
protein translations in 5. IgBLAST: This blast is related to immunoglobilins and T-Cell receptor
the BLAST protein sequences.
databases. Hence, a
tblastn search is the
6. VecScreen: To search sequences for vector contamination. This tool is
used for molecular biology experiments.
only way to search for
these potential coding 7. CDART: To find sequences with similar conserved domain architecture. You
regions at the protein have to remember the difference between CD-search and CDART in this case.
level. The HTG
sequences, draft 8. Multiple Alignment: To align sequences using domain and protein
sequences from various constrains.
genome projects or
9. MOLE-BLAST: To establish taxonomy for uncultured or environmental
large genomic clones,
sequences (Fig. 3.15).
are another large
source of unannotated
All above tools are available at https://blast.ncbi.nlm.nih.gov/Blast.cgi
coding regions.
98
Unit 3 Sequence Alignment
SAQ 6
i) What is the BLAST?
Watch the YouTube video available at provided link to know more details
https://www.youtube.com/watch?v=jDHrHfx0cpw
3.5.2 Clustal W
Till now you have studied about blast program to find the sequence with query
sequences within specific databases. The search for simultaneous alignment
of multiple nucleotides or amino acid sequences is now an essential tool in
molecular biology. Multiple sequence alignments are used to find the following:
i) Diagnostic patterns to characterise protein families.
ii) To detect or demonstrate homology between new sequences and existing
families of sequences.
ii) Also to predict the secondary and tertiary structure of new sequences; to
suggest oligonucleotide primers for PCR (Polymerase Chain Reaction).
iv) All these are essential prelude to molecular evolutionary analysis.
There are many variations of the Clustal software, few listed below:
ClustalV: The second generation of the Clustal software was released in 1992
and was a rewrite of the original Clustal package. It introduced phylogenetic
tree reconstruction on the final alignment, the ability to create alignments from
existing alignments, and the option to create trees from alignments using a
method called Neighbor-joining. 99
BBCS-185 Bioinformatics Skill Enhancement Course
ClustalW: The third generation, released in 1994, greatly improved upon
previous versions. It improved upon the progressive alignment algorithm in
various ways, allowing individual sequences to be weighed down or up
according to similarity or divergence, respectively, in a partial alignment. It also
included the ability to run the program in batch mode from the command line.
ClustalX: This version, released in 1997, was the first to have a graphical user
interface.
Clustal_2: The updated versions of both ClustalW and ClustalX with higher
accuracy and efficiency.
3.6 SUMMARY
In this unit, we have studied the basics of sequence alignment along with the
programs or tools used to perform sequence alignment.
BLAST- Basic local alignment Search Tool is a basic alignment tool and
there are more types based on alignment of database search.
3.8 ANSWERS
Self Assessment Questions
1. i) The NCBI database consists of plants, animals, fungi and bacterial
genome sequences, protein, gene sequences and etc.
ii) Chimp
iii) Four
Terminal Questions
1. i) Sequence alignment plays a major role in identifying ancestors.
101
BBCS-185 Bioinformatics Skill Enhancement Course
3.3.1).
102
Unit 3 Sequence Alignment
Video: https://www.youtube.com/watch?v=K6ldxHPzI5A
https://www.youtube.com/watch?v=A4JrzGon8mQ
103
BBCS-185 Bioinformatics Skill Enhancement Course
Exercise 7
MOLECULAR FILE FORMATS -
FASTA, GENBANK,
GENPEPT, GCG, CLUSTAL,
SWISS-PROT, PIR
Structure
7.1 Introduction 7.2 Procedure
7.1 INTRODUCTION
In this exercise, you will practice and download different file formats such as
FASTA, GenBank, GenPept, GCG, CLUSTAL, SWISS-PROT, PIR which are
maintained by different biological databases and used for sequence analysis.
You have learned about biological databases in unit 2 of this course. The
major objective of this exercise is to familiarize you with various file formats
that are regularly used in bioinformatics.
FASTA format
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLN
GSYSEN
104
Exercise 7 Molecular File Formats - Fasta, Genbank, Genpept, Gcg, Clustal, Swiss-Prot, Pir
NCBI specifically maintain GenBank and GenPept formats, The GenBank and
GenPept format store information of DNA and protein sequences respectively.
It is easy to know all the basic information of sequences such as the source of
organism, the author who sequenced, coding information, and other
information from GenBank and GenPept database. GenBank or GenPept
Sequence Format (GenBank Flat File Format) consists of three parts, the
Header, the feature, and the nucleotide sequence. The start of the annotation
section (Header and feature) is marked by a line beginning with the word
"LOCUS". The start of the sequence section is marked by a line beginning with
the word "ORIGIN" and the end of the section is marked by a line with only
"//".The header section consists of initial and basic information, the feature
section consists of Source, CDS, GENE, RNA features, the actual sequence
starts with Origin.
PIR format
b. a two-letter code describing the sequence type (P1, F1, DL, DC, RL,
RC, or XX), followed by
c. a semicolon, followed by
One or more lines contain the sequence itself. The end of the sequence is
marked by a "*" (asterisk) character.
Optionally, this can be followed by one or more lines describing the sequence.
A file in PIR format may comprise more than one sequence. The PIR format is
also often referred to as the NBRF (National Biomedical Research
Foundations) format.
>P1;CRAB_CHICK
>P1;CRAB_HUMAN
SWISS-PROT
CLUSTAL
7.2 PROCEDURE
Step 1: Open the GenBank website from the following URL
Step 2: Type the sequence name or sequence ID or relevant text in the text
box or enter any keyword (Fig. 7.2).
A)
B)
5. Copy and save the required protein or nucleotide sequence for further
analysis (Fig. 7.6).
7.3 SUMMARY
You have learned about biological databases in theory unit 2, and the
data will be retrieved and viewed in different formats.
2. Search for Covid related protein sequence from NCBI download any one
sequence in GenPept format.
111
BBCS-185 Bioinformatics Skill Enhancement Course
Exercise 8
MOLECULAR VIEWER BY
VISUALIZATION
SOFTWARE: PYMOL
Structure
8.1 Introduction 8.3 Summary
8.2 Procedure
8.1 INTRODUCTION
In the previous exercises you have learned how to access databases for
protein and nucleic acid structures. However, to analyse these structures we
need to view them using certain tools that are known as visualisation tools or
software.
In this exercise, you will be learning the PyMOL program to visualize 3-D
structures of molecules. PyMOL is a powerful tool for viewing and analyzing
proteins, DNA, and other macro molecules structures. PyMOL is a stand-alone
molecular visualization program based on Python software. PyMOL is used to
generate high-quality molecular graphics images and animations used for
journal publications describing new macromolecular structures and
interactions. PyMOL was developed by Warren DeLano. It is open source, but
not free in all forms, students and educators can utilise a current free version
in the classroom, anybody can obtain outdated binary releases, and certain
Linux distributions give PyMOL packages created from the open-source code.
8.2 PROCEDURE
Step 1: Download PyMOL from the website
(http://www.pymol.org/educational), register as a student from the link at the
bottom of the page. You’ll need to fill out the form, and the automated system
will eventually send you a link with a username and password. This allows you
to download the software for your Personal Computer or Mac system and
follow the instructions to install the software.
Step 3: By double clicking on the PyMOL icon on your desktop PyMOL brings
up two Windows.
i) The top window constitutes the “External GUI (Graphical User Interface),”
and contains the menu options as well as buttons for advanced
visualization (Fig. 8.1).
ii) The bottom window contains the “Visualization Area,” which is the main
area where molecules will be displayed. The bottom window also contains
another “Internal GUI.” This GUI will contain a list of molecular objects
once you have loaded a protein structure. The bottom of this GUI has a
matrix displaying the current mouse configuration, namely what mouse
button combinations control which functions (Fig. 8.2).
Step 4: To open the PDB file, select “File→Open” in the external GUI window,
and select the PDB file6YI3.pdb that you have already downloaded. The PDB
file will load, representing the protein
Step 5: To change the representation of the molecule, the right side of the
Viewer shows the object control panel.
The first name is always “all.” Clicking on the name itself will un-display the
corresponding molecule(s) (temporarily invisible).
Let’s learn how to make a cartoon representation of this, click S and select
Cartoon. The molecule is now shown as both a cartoon and a wireframe.
Remove the wireframe by clicking H and lines.
Step 6: To change the background color to white follow this menu cascade
(Fig. 8.3):
Fig. 8.3: Screenshot showing “how to change the background “color” of the
molecule.
Step 7: To save the image in the present view follow the options File > Save
Image…
Replace the default word “PyMOL” the file you want to save, the image will be
saved as a PNG image (Fig. 8.4).
114
Exercise 8 Molecular Viewer by Visualization Software: Pymol
Fig. 8.4: Screenshot Showing how to save and rename the image.
Step 8: You can use command line to Save, Viewport, Zoom, Ray, and Select,
to execute the command, follow the additional study material link provided at
the end of this exercise, and also practice other options in detail.
8.3 SUMMARY
PyMOL is a powerful tool to visualize and analyze proteins, DNA, and
other biological molecules structures.
You have learned how to view the 3-D structure of proteins in different
poses.
Images and structures can be used to generate high-quality molecular
graphic images and animations used for journal publications.
https://bioquest.org/nimbios2010/wp-
content/blogs.dir/files/2010/07/pymol_tutorial3.pdf
https://sites.pitt.edu/~epolinko/IntroPyMOL.pdf
115
BBCS-185 Bioinformatics Skill Enhancement Course
Exercise 9
BLAST SUITE OF TOOLS FOR
PAIRWISE ALIGNMENT
Structure
9.1 Introduction 9.3 Summary
9.2 Procedure
9.1 INTRODUCTION
In unit-3 of this course you have learnt sequence similarity search using Basic
Local Alignment Search tool (BLAST) and in previous exercises 5 and 7we
have practiced sequence retrieval. These sequences will be used in this
exercise to perform database similarity searches using BLAST tool. BLAST is
an algorithm for comparing primary biological sequence information, such as
the amino-acid sequences of different proteins or the nucleotides of DNA
sequences. A BLAST search enables a researcher to compare a query
sequence with a library or database of sequences, and identify library
sequences that are similar to the query (a question, unknown) sequence.
There are many different types of BLAST available from the BLAST web page.
Selecting the required one depends on the type of sequence you are
searching for and in the desired database. Different types of BLAST are given
below:
9.2 PROCEDURE
Step 1: Open the basic BLAST search page from following URL
http://blast.ncbi.nlm.nih.gov/Blast.cgi
2. Open your FASTA format sequence in a text editor as plain text retrieved
from exercise7 (Fig. 9.2).
3. Enter gi number, accession number or Copy the entire sequence and paste
it in the search box provided in FASTA format (Fig. 9.3).
5. Make sure you have selected the correct BLAST program and select nr
(non redundant) database (Fig. 9.4).
7. Write down default parameter set and click the "BLAST button" (Fig. 9.5).
118
Exercise 9 Blast Suite of Tools for Pairwise Alignment
9. Once your results are computed they will be presented in the window (Fig.
9.6).
10. Copy and save the results and discuss or interpret your results.
9.3 SUMMARY
In the current exercise you have learnt how to use BLAST tool for
different programs such as blastn, blastp, blastx and tblastn for the
analysis of nucleotide and protein sequences of unknown sequence
obtained after sequencing the sample.
2. Perform protein blast (blastp) for the query Chain E, Spike protein S1
copy the result and interpret.
3. Search blastx for the given queryFJ436056 tabulate and discuss the
results.
4. Execute the tblastn search for the given query PWZ18702 and interpret
the results.
120
Exercise 10 Multiple Sequence Alignment using Clustalw
Exercise 10
MULTIPLE SEQUENCE
ALIGNMENT USING
CLUSTALW
Structure
10.1 Introduction 10.3 Understanding Output
10.1 INTRODUCTION
In unit-3 of this course you have learnt sequence alignment using
CLUSTALW, now in this exercise, you shall practice performing CLUSTALW.
Multiple Sequence Alignment (MSA) is the alignment of three or more
biological sequences of similar length. From the output of MSA applications,
homology can be inferred and the evolutionary relationship between the
sequences can be studied. ClustalW is a free online tool through the European
Bioinformatics Institute (EBI) that is used to align multiple sequences and
generate phylogenetic trees. The improved version of CLUSTAL is Clustal
Omega. If you input the desired sequences to align, Clustal Omega generates
a sequence alignment, and a rooted phylogram or cladogram.
perform alignment of more than two sequences and find out the
similarity between those sequences;
10.2 PROCEDURE
STEP 1- Retrieve required sequences (Nucleic acid or Protein) three or more
from desired sequence databases. Some example sequences are shown
below:
MAENGTISVEELKRLLEQWNLVIGFIFLAWIMLLQFAYSNRNRFLYIIKLVFLWL
LWPVTLACFVLAAVYRINWVTGGIAIAMACIVGLMWLSYFVASFRLFARTRSM
WSFNPETNILLNVPLRGTILTRPLMESELVIGAVIIRGHLRMAGHSLGRCDIKD
LPKEITVATSRTLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHSGSND
NIALLVQ
MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFLYIIKLIFLW
LLWPVTLACFVLAAVYRINWITGGIAIAMACLVGLMWLSYFIASFRLFARTRS
MWSFNPETNILLNVPLHGTILTRPLLESELVIGAVILRGHLRIAGHHLGRCDIKD
LPKEITVATSRTLSYYKLGASQRVAGDSGFAAYSRYRIGNYKLNTDHSSSSD
NIALLVQ
MSNMTQLTEAQIIAIIKDWNFAWSLIFLLITIVLQYGYPSRSMTVYVFKMFVLW
LLWPSSMALSIFSAVYPIDLASQIISGIVAAVSAMMWISYFVQSIRLFMRTGSW
WSFNPETNCLLNVPFGGTTVVRPLVEDSTSVTAVVTNGHLKMAGMHFGAC
DYDRLPNEVTVAKPNVLIALKMVKRQSYGTNSGVAIYHRYKAGNYRSPPITA
DIELALLRA
STEP 2-The software tools required for multiple sequence alignment are
available at the following URLhttps://www.ebi.ac.uk/Tools/msa/clustalo/ (Fig.
10.1).
STEP 3 - Enter your input sequences or paste a set of nucleic acid or protein
sequences into a supported format or upload a file (Fig. 10.2).
Step 4- Set your output format and set multiple sequence alignment default
options (Fig. 10.3).
123
BBCS-185 Bioinformatics Skill Enhancement Course
The score table is the first section of the page below, the results summary box.
The score table shows the scoring of the pairwise alignment of all sequences
(Fig. 10.5).
Take a screen shot of this table, or download by right-clicking the Output File
(.output) found in the result summary box at the top of the page (Fig. 10.6).
CLUSTAL omega aligns all of the input sequences, an HTML text version is
listed just below the Scores Table. A more extensive view of the alignment can
be seen using JalView. Under alignment, you can click “Show Colors” to view
a coloured version of an amino acid alignment (Fig. 10.7).
124
Exercise 10 Multiple Sequence Alignment using Clustalw
In the row below the last sequence of the alignment, there may be symbols
like:
" : " – conserved substitutions have been observed, according to the colour
data
The generated phylogenetic tree is at the very bottom of the results page.
You’ll notice above this there is a “Guide Tree” section. You can save the
Guide Tree. The tree can be viewed as a phylogram or a cladogram.
A)
125
BBCS-185 Bioinformatics Skill Enhancement Course
B)
10.4 SUMMARY
Multiple Sequence alignment is aligning of three or more biological
sequences of similar length.
From the output of MSA applications, homology can be inferred and the
evolutionary relationship between the sequences studied.
In the current exercise you have learnt to use the multiple sequence
alignment tool Clustal Omega for analysing evolutionary relationships
among sequences and interpret relationships among the sequences or
organisms through a phylogenetic tree.
126