0% found this document useful (0 votes)
153 views126 pages

BBCS 185

The document outlines the Bioinformatics course (BBCS-185) offered by Indira Gandhi National Open University, designed to enhance skills in bioinformatics through theoretical units and practical exercises. It covers essential topics such as biological databases, data retrieval, and sequence alignment, while emphasizing the interdisciplinary nature of bioinformatics, integrating biology, computer science, and statistics. The course aims to equip learners with fundamental bioinformatics skills applicable in various fields, including medicine and drug development.

Uploaded by

Rithu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views126 pages

BBCS 185

The document outlines the Bioinformatics course (BBCS-185) offered by Indira Gandhi National Open University, designed to enhance skills in bioinformatics through theoretical units and practical exercises. It covers essential topics such as biological databases, data retrieval, and sequence alignment, while emphasizing the interdisciplinary nature of bioinformatics, integrating biology, computer science, and statistics. The course aims to equip learners with fundamental bioinformatics skills applicable in various fields, including medicine and drug development.

Uploaded by

Rithu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

BBCS-185

BIOINFORMATICS
Indira Gandhi
National Open University
School of Sciences

BLOCK 1
BIOINFORMATICS 7-126
Programme and Course Design Committee
Prof. Bechan Sharma Prof. Seemi Farhat Basir Faculty Members
Dept of Biochemistry Dept. of Bio Sciences (IGNOU)
University of Allahabad Jamia Milia Islamia
Dr. Parvesh Bubber
Prof. Ranjit K. Mishra Biochemistry, SOS
Dept. of Biochemistry Dr. Sunita Joshi
University of Lucknow Dept. of Biochemistry Dr. M. Abdul Kareem
Daulat Ram College Biochemistry, SOS
Prof. Reena Gupta University of Delhi
Dept. of Biotechnology Dr. Arvind Kumar Shakya
H.P. University, Shimla Biochemistry, SOS
Prof. Vijayshri
Prof. D.V. Devaraju Former Director Dr. Maneesha Pandey
Dept. of Biochemistry School of Sciences Biochemistry, SOS
Bangalore University IGNOU, New Delhi
Dr. Seema Kalra
Prof. Sanjeev Puri Biochemistry, SOS
UIET, Panjab University

Course Preparation Team


Content Editor Content Writers
Dr. Venkata Rao, Dr. Khalid Raza (Units 1)
Principal Scientist Assistant Prof., Computer Science, Jamia Millia Islamia,
CSIR-CIMAP, Allalasandra, New Delhi
Yelahanka
Bangaluru Dr. K.V Swami (Unit 2 and 3)
Associate Prof., MIT-ADT University, Pune

Dr. K.R. Dasegowda (Exercise 1-10)


Assistant Prof., Biotechnology, REVA University, Bangaluru

Course Coordinator: Dr. M. Abdul Kareem ([email protected])


Cover page and Graphic Design input: Dr. M. Abdul Kareem

Print Production Team


Sh. Rajiv Giridhar, AR (Pub.) Sh. Hemant, S.O (Pub.)
MPDD, IGNOU MPDD, IGNOU
Acknowledgement: Mr. Sumit Verma for CRC and word processing.
January, 2022
© Indira Gandhi National Open University, 2021
ISBN:
Disclaimer: Any materials adapted from web-based resources in this module are being used for educational
purposes only and not for commercial purposes.
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without
permission in writing from the Copyright holder.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s
office at Maidan Garhi, New Delhi-110 068 or the official website of IGNOU at www.ignou.ac.in.
Printed and published on behalf of Indira Gandhi National Open University, New Delhi by Prof. Sujatha Varma,
Director, SOS, IGNOU.
Printed at
BBCS-185: BIOINFORMATICS
Dear Learners welcome to the Skill Enhancement Course (SEC) - Bioinformatics
(BBCS-185). This is a 4-credit theory course offered in the fourth semester of the B.Sc.
(Hons.) Biochemistry programme. The major aim of this course is to inculcate basic
skills pertaining to bioinformatics. You have already studied the fundamentals of
bioinformatics in unit 11 of course proteins (BBCCT-105) in your second semester. In
this course, you shall learn more tools and applications of bioinformatics, also known as
computational biology.

This course is broadly divided into 3 theory units and 10 practical or hands-on
exercises. We have arranged the content in a way that, will provide you theoretical
background on the topics followed by hands-on experience through exercises.

This is one of the stand alone course in your program that majorly involves the
utilization of computer skills. Owing to this, we have designed a few exercises where
you will learn the fundamentals of Microsoft Office, frequently used internet-based
terminology, and their applications. All the Software’s, tools and applications
described in this course are freely available on the internet. Hence, it is advised that
you should go through the course content and watch the corresponding video links
provided and also follow the instructions given to perform the exercises. Thorough
understanding of this course will help in building your career in the field of
computational biology. Since, bioinformatics is one of the emerging fields of allied
sciences, researchers from various disciplines starting from Mathematics, Physics, both
basic and applied biologists, Computer scientists, and Statisticians contributed to the
development of this subject. Bioinformatics subject has vast applications in the field of
Medicine, Pharmacy with special emphasis on drug design and development.

We believe that after completing all these exercises you will be in a position to exhibit
the fundamental bioinformatics skills expected from an undergraduate learner.

Expected Learning Outcomes


After completing this course you should be able to:

 define the basic terminology frequently used in the field of bioinformatics;

 explain the applications of Microsoft Office;

 create and use email;

 browse the desired data from the existing search engines;

 perform activities like exploring biological databases and retrieving the information
present in them;
 distinguish between different biological databases;

 access the protein and DNA sequences from online resources;

 download and use the various file formats for the purpose of performing bioinformatics
exercises;

 describe and perform sequence alignment; and

 enlist to the applications of bioinformatics.

We hope you will find this course quite interesting to study.


Best wishes:
BBCS-185
BIOINFORMATICS
Indira Gandhi
National Open University
School of Sciences

Block

1
BIOINFORMATICS
UNIT 1
Introduction to Bioinformatics 7-46
Exercise 1: Microsoft Office- Word, Excel, PowerPoint
Exercise 2: Introduction to Internet: LAN, WAN, Web Browsers, Search Engines
Exercise 3: Basics of Electronic Mail, Creating an Email Account, Sending
and Receiving Email

UNIT 2
Biological Databases and Data Retrieval 47-82
Exercise 4: Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt
Exercise 5: Retrieval of Protein and Gene Sequences from NCBI
Exercise 6: Accessing Protein Structure from PDB

UNIT 3
Sequence Alignment 83-126
Exercise 7: Molecular File Formats - FASTA, GenBank, GenPept, GCG,
CLUSTAL, Swiss-Prot, PIR
Exercise 8: Molecular Viewer by Visualization Software: PyMol
Exercise 9: Blast Suite of Tools for Pairwise Alignment
Exercise 10:Multiple Sequence Alignment Using CLUSTALW
Unit 1 Introduction to Bioinformatics

UNIT 1
INTRODUCTION TO
BIOINFORMATICS

Structure
1.1 Introduction 1.8 Applications of
Bioinformatics
Expected Learning
Outcomes 1.9 Programming Languages
in Bioinformatics
1.2 Basics of Computer
Operations 1.10 Important Terms used in
Bioinformatics
1.3 Internet Usage
1.11 Summary
1.4 Microsoft Office Basics
1.12 Terminal Questions
1.5 Historical Background
1.13 Answers
1.6 Role of Supercomputers in
Biology 1.14 Further Readings
1.7 Scope of Bioinformatics

1.1 INTRODUCTION
Biological data is being produced at an enormous rate. Managing and
interpreting these data is a great challenge for biologists. Computers are being
used to collect, store, retrieve, analyze and integrate biological and genetic
information. These stored biological data can then be used for the prediction of
disease, drug discovery, biomarker identification, disease diagnosis, and
patient survival analysis, and so on. The vast application of computers for
handling biological data lead to the development of a new field of study, known
as 'Biological Informatics', or 'Bioinformatics'.

The term 'Bioinformatics' was coined by PaulienHogeweg and Ben Hesper in


1970, making Bioinformatics a field parallel to biophysics and biochemistry.
However, this term became visible in the late 1980s. Bioinformatics is the field
of science in which biology, computer science, and information technology
merge to form a single discipline. Hence, it is an interdisciplinary field of study
where biologists, computer scientists, statisticians, and data scientists work
together (Fig. 1.1). Bioinformatics lets us understand biology in terms of
molecules and apply Information Technology (IT) techniques to understand
and organize the information associated with these molecules on a large 7
BBCS-185 Bioinformatics Skill Enhancement Course

scale. The important fundamental questions addressed are: how do we


describe, analyze, simulate, and predict the dynamics of the biological
processes by using IT tools. Defining Bioinformatics is a non-trivial task, as its
definition varies from person to person. These variations are biased towards
how someone perceives them. A rough definition can be "the application of
computers to handle biological information". However, there are some
standard definitions given by standard organizations and researchers.

Fig. 1.1: Bioinformatics as an interdisciplinary field of study.

Hence, one of the possible definitions of bioinformatics can be: "an


interdisciplinary field of science, combining computer science, mathematics,
statistics, biology, and engineering, whose aim is to develop methods,
software tools, and databases to understand, analyze and interpret biological
data with the objective to understand the biological phenomenon and discover
new biological insights". Some of the authors have also quoted that "the
association of biology and computer science created a new discipline, called
Bioinformatics". Hence, we can say that among many different disciplines,
computer science and biology are the major contributors in bioinformatics;
where biology is the problem domain and computer science is the solution
provider.

SAQ 1
i) Why is bioinformatics called an interdisciplinary field of science?

ii) Who coined the term “Bioinformatics”?

iii) What are the essential components of bioinformatics?

8
Unit 1 Introduction to Bioinformatics

Expected Learning Outcomes


After studying this unit, you should be able to:

 define bioinformatics;

 describe important terminologies used in bioinformatics;

 list programming languages and role of supercomputers in


bioinformatics; and

 explain scope and applications of bioinformatics.


1.2 BASICS OF COMPUTER OPERATIONS
Today’s modern computer is a versatile machine. It can do numerous tasks
such as playing games, playing music, watching movies, word processing,
spreadsheets processing, calculations, huge storage, quick information
retrieval, online transaction processing, various information processing, and
what’s not! So, at the first glance, identifying the basic operations of a
computer seems confusing due to this exhaustive list of tasks a computer
performs. However, at the basic level, a computer carries out a “data
processing” task irrespective of their different applications.

A computer has four basic operations: Input, Processing, Storage, and Output
(Fig. 1.2).

Input: Inputting is a basic operation of a computer. Basically, it is an act of


feeding the data, and instruction into the computer. The data and/or
instructions are inputted to the computer in a predefined format. A computer
system comprises different functional units, and the Input Unit does the
inputting operations. Some of the common input devices used to input the data
into computers are keyboard, mouse, microphones, cameras, scanners,
sensors, disks, or via network connections.

Processing: Data processing is an important and central operation of a


computer. This is an operation where the computer does the data crunching.
For instance, doing arithmetic calculations, processing electric signals
received from the microphone, images from the camera, and combining them
to produce a video clip. The central processing unit (CPU) performs the data
processing tasks and directs the operation of input and output devices. The
CPU comprises arithmetic and logic unit (ALU), control unit (CU), cache
memory, and different registers. The ALU does the arithmetic and
comparisons, while CU controls the operations. During the processing, data
and instructions are stored in the primary memory such as random access
memory (RAM). After the completion of processing, the final output is sent to
storage units.

Storage: Data and instructions (computer programs) entered into the


computer must be stored somewhere before actual processing starts.
Similarly, the output produced by the computer needs to be stored before it is
sent to the output device. Also, intermediate results produced by the computer
must be stored for further processing. Thus, storage is an important operation
9
BBCS-185 Bioinformatics Skill Enhancement Course

of a computer. Storage devices are classified as “internal storage” or


“external storage” based on whether it is inside the main machine or not.
Further, storage devices are also classified as “primary storage” or
“secondary storage” based on their closeness with the CPU or works as
backup media. Primary storage is also called primary memory, and secondary
storage is also called backup storage, or secondary memory. Some of the
common storage devices used in a computer system are registers, cache,
RAM/ROM, magnetic disk, optical disk, flash disk, or other network-based
(such as cloud) storage.

Output: It is where the result of processing goes. It presents the processed


results to the user in a suitable form. The devices that can output information
are called Output Devices. Output devices can display information on the
screen, or a printer, and may send information to other computers via network
connections. The output device also displays messages related to errors, and
a dialog box asking for more information to be input. Some of the common
output devices are monitors, speakers, projectors, printers, plotters, etc.

Fig. 1.2: Basic operations of a computer.

1.3 INTERNET USAGE


The Internet is the pan structure of interconnected computer networks that
gives myriad information and communication facilities. It uses a standard
TCP/IP (Transmission Control Protocol/Internet Protocol) suite to
communicate between networks and various attached devices. The Internet is
a fast-growing and most transformative technology that has become a
commodity today. It carries numerous types of information resources and
services that include hypertext documents (web pages), web applications of
the World Wide Web (WWW), email, file sharing, and a variety of web-based
services. The number of Internet users is drastically increasing. In 2021, the
total number of Internet users worldwide is estimated to be 4,832.89 million,
which is projected to reach 5,631.54 million by 2025.

Usage of Internet in Bioinformatics

As far as the resources of biological information and bioinformatics are


concerned, the information superhighway, i.e. Internet, plays a vital role.
Internet is the most potential technology serving as the key platform for
Bioinformatics tools. Some of the facilities provided by the Internet for
Bioinformatics are as follows:

 Online bioinformatics resources (databases and tools) such as NCBI


(National Center for Biotechnology Information), Protein Data Bank
10 (PDB), PubChem, Basic Local Alignment Search Tool (BLAST), etc.
Unit 1 Introduction to Bioinformatics

 Scientific literature databases such as PubMed, PubMed Central, etc.

 Web-based platform for high-end bioinformatics computing.

 Bioinformatics courses and tutorials.

Some of the important Internet resources of Bioinformatics are listed in the


following Table 1.1.

Table 1.1: Summary of some of the important online bioinformatics


resources

Resource Name Description Website

National Center for Houses a series of biological databases https://www.ncbi.nlm.nih.go


Biotechnology (such as GenBank for DNA sequence, v/
Information (NCBI) PubChem for molecules, PubMed for
biomedical literature database, etc.),
bioinformatics tools (such as BLAST for
sequence alignment, Entrez for search
engine, etc.), and services.

Protein Data Bank It stores and provides3D structure data for https://www.rcsb.org/
(PDB) large biological molecules such as
proteins, DNA, and RNA.

UniProt Provide a comprehensive, high-quality, https://www.uniprot.org/


and freely accessible resource of protein
sequence and functional information.

EMBL-EBI European Bioinformatics Institute (EBI) is a https://www.ebi.ac.uk/


part of the European Molecular Biology
Laboratory. It provides public biological
data available to the scientific community
through a range of services and tools and
provides professional training in
bioinformatics.

ChemSpider A free chemical structure database http://www.chemspider.com


provides access to over 100 million /
structures from hundreds of data sources.

Kyoto Encyclopedia of A database resource for understanding https://www.genome.jp/keg


Genes and Genomes high-level functions and utilities of the g/
(KEGG) biological system, from molecular-level
information.

Expasy An extensive and integrative bioinformatics https://www.expasy.org/


resource portal by the Swiss Institute of
Bioinformatics (SIB) provides access to
over 160 databases and software tools.

11
BBCS-185 Bioinformatics Skill Enhancement Course

1.4 MICROSOFT OFFICE BASICS


Microsoft Office (or MS Office) is a software suite and services developed by
Microsoft Corporation for office or business applications. It was primarily
developed to automate manual office work. MS Office was designed for
Personal Computers, and is available for Windows, Linux, and Mac operating
systems. Its initial version contained only MS Word, MS Excel, and MS
PowerPoint, but over the years MS Access, OneNote, Outlook, and Publisher
were added. Each of these software applications serves a specific purpose.
For instance, MS Word facilitates users to create text documents; MS Excel
provides creation of simple to complex financial spreadsheets; MS PowerPoint
allow the creation of multimedia presentations; MS Access lets us manage
database application; MS Publishers allow developing publishing marketing
materials, and OneNote facilitate users to organize their notes.

MS Office is available as both desktop version and online version. The


desktop version is the most widely used. Office Online is available both from
the cloud under a lighter (Office Web Apps) and full version (Office 365).
Office Online is a version that runs within a web browser and is available as
Office mobile apps for different platforms such as Android and iOS.

Features of MS Office

Microsoft Word

 Creating text documents, edit them when required.

 Defining page layout (landscape or portrait), page size, page margins,


etc.

 Text formatting such as defining font size, font type, font styles, color,
etc.

 Text may be formatted in column style.

 Facility to add Tables.

 Insertion of graphical pictures, and images from a file or clipart gallery.

 Header and footer, page numbering.

 Spelling and grammar check.

 Word count and other statistics.

 Facility of macros to automate some functions.

 Online help of any option.

Microsoft Excel

 A general-purpose electronic spreadsheet for financial computation and


data analytics.

 Helps prepare a simple family budget to complex accounting ledger for a


business.
12
Unit 1 Introduction to Bioinformatics

 AutoSum (summing up the selected values), List AutoFill (automatically


fills the data as per our choice), AutoShapes (drawing geometrical
shapes), etc.

 A rich set of predefined statistical, financial, and accounting functions.

 Drag and Drop feature helps us reposition data and text by simply
dragging the data using a mouse.

 Charts feature help us present a graphical representation of data in the


form of Bar, Line, Pie charts, Boxplot, etc.

 PivotTable allows performing data analysis and various report


generation such as statistical reports, periodic financial statements, etc.

Microsoft PowerPoint

 Helps us create a presentation (collection of slides containing


information on a topic) for business meetings, marketing, training, and
educational purposes.

 Wizard helps us go through the presentation creation process.

 Design templates, a predefined background, and font style, facilitate


faster creating of attractive presentations.

 Preloaded themes that allow us to change from simple color to complete


format and layouts.

 Slide animation and multimedia content.

 Various slide presentational options, progression, navigation, etc.

SAQ 2
i) What are the basic computational operations?

ii) Name some online software/tools/databases that are used in


bioinformatics research.

iii) Differentiate between MS word and MS excel.

iv) Fill in the blanks:

a) …………………. helps us create a presentation (collection of slides


containing information on a topic) for business meetings, marketing,
training, and educational purposes.

b) The ……………………..is the pan structure of interconnected


computer networks that gives myriad information and communication
facilities.

c) Storage devices are also classified as ………………..or


…………………………….based on their closeness with the CPU or
works as backup media.

13
BBCS-185 Bioinformatics Skill Enhancement Course

You will learn in the details while performing hands on sessions provided.

1.5 HISTORICAL BACKGROUND


It was back then in the early 1960s when the pillars of bioinformatics were put
forth by engaging in silico approaches to various biological problems. From
sequence analysis to structure prediction to drug discovery and designing to
biological database creation and data collection. The Mother of Bioinformatics
– Margaret Oakley Dayhoff as she worked hard in the field and its growing
years by giving out the concept of evolutionary substitution matrices for protein
sequences. After the successful completion of the Human Genome Project
(1986-2003), bioinformatics was considered as a full-fledged domain of
sciences that further expanded into various sub-fields of study. Technically,
the human genome project gave a push to the field of bioinformatics and
allowed it to be equally important when compared to counterpart subjects of
physics, chemistry, biology, statistics, mathematics, and computer science.

Owing to the breakthroughs of 20th century, it further moved on to the


concepts of Big data, artificial intelligence (AI), translational bioinformatics,
computational intelligence (CI), virtual intelligence, deep learning (DL), etc.
There was an huge expansion of data from all the domains of science that
required to be stored and managed. Therefore, further bifurcations were
added to the field of bioinformatics namely – systems biology, genomics,
proteomics, transcriptomics, metabolomics, epigenomics, nutritionomics,
pathway analysis, synthetic biology, computer-aided drug discovery and
designing (CADD), system modeling, and nanoinformatics, etc. All these fields
have since then grown and are still expanding and contributing significantly in
terms of scientific knowledge.

1.6 ROLE OF SUPERCOMPUTERS IN


BIOLOGY
With the term – “Supercomputer” we refer to a computer that holds immense
processing power. A supercomputer is mainly deployed for complex, time-
consuming, and hectic problems that are observed in scientific and
engineering domains. It is a 5th generation type of computer and helps
researchers, government officials, academia, and industries to handle apex
data sets and allows rapid computations. It allows high speed with efficient
performance with humongous data. In bioinformatics, supercomputers are
used for hectic tasks such as – molecular simulation of large systems,
sequencing, molecular modeling, etc. All these tasks require multithreading.
Some of the existing bioinformatics suites and software use a multithreading
approach to execute tasks easily. Some of them are – BBMap, Bowtie2, BWA,
Velvet, IDBA, SPAdes, Clustal Omega, MAFFT, SINA, and GROMACS.

1.7 SCOPE OF BIOINFORMATICS


The scope of bioinformatics today is multi-faceted. Bioinformatics has
extended itself in various domains such as – genomics (that deals with the
study of the holistic genes and genetic elements), proteomics (encapsulates
14 the study of all the proteins and proteomes), computational drug discovery and
Unit 1 Introduction to Bioinformatics

designing (CADD), and systems biology. Moreover, it allows different data


analyses and also gives tools for modeling, visualizing, and understanding
data. Aim of bioinformatics is to arrange chiseled data and convert it into
meaningful and significant pieces of knowledge.

The scope of Bioinformatics in various domains have been bifurcated as:

a. Genomics & Proteomics: Field of bioinformatics that deals with the


analysis and understanding of nucleotides and protein sequences, structures,
functions.

b. Biological-based software development: This field encapsulates the


generation of new biological-based software and tools that aid non-
computational researchers/students to analyze and interpret data easily.

c. Drug discovery & designing: This field of study deals with screening small
compounds/ligands that hold the capacity to be utilized as drug compounds for
disease treatment and prevention by deploying an in-silico modus operandi.
There are mainly two types of approaches that are adopted in CADD: i)
structure-based and ii) ligand-based. In a structure-based approach, the target
receptor is considered as a fixed structure whose binding cavities are
identified and ligands (small molecules/drug candidates) are docked. While, in
ligand-based approaches, the structure of the target receptor is not known,
thus, it is flexible. By a flexible receptor, we mean, the binding cavities of the
receptor can be known or unknown.

d. Systems biology: Field of bioinformatics that allows modeling, simulation,


and understanding the complex biological systems – molecules, cells, tissues,
organs, humans. Bioinformatics has helped to combine the computational
tools and software in understanding myriad layers of biological systems, thus,
helps in unraveling and discovering essential characteristics of molecules,
cells, tissues, organisms, when combines, function holistically as a single
system.

SAQ 3
i) Fill in the blanks:
a) ………………………………………….. is known as the “Father &
Mother of Bioinformatics”.
b) A computer that holds immense processing power is called as
………………………………………………..
c) ………………………………….. deals with the study of the holistic
genes and genetic elements.
d) Some of the existing bioinformatics suites and software use
multithreading approaches are ……………., …………………,
………………….. and …………………..
ii) What are the scopes of bioinformatics?
iii) Write a short note on supercomputers.

15
BBCS-185 Bioinformatics Skill Enhancement Course

1.8 APPLICATIONS OF BIOINFORMATICS


The field of bioinformatics has many beneficial applications in diverse fields,
including the follows:

Comparative and evolutionary studies

Comparing genomes of different species is a useful way to unravel the


evolutionary relationship between the species. With the help of comparative
genomics, it is possible to find conserved sequences between species that
represent the functional parts of the genome. For instance, when genomes of
humans and mice were compared, it is found that 5% of both the sequence
was conserved, revealing the minimum amount of functional DNA in both the
genomes.

Molecular medicine

Every disease causes some alteration in the genome. Therefore, the Human
genome has profound effects and impacts on biomedical research and clinical
medicine. With the availability of a complete human genome, it is possible to
search for genes that are directly associated with various diseases and
discover the molecular basis of these diseases. The discovery of the
molecular basis of disease would enable better treatments and diagnosis of
the disease, and the development of molecular medicines. The advancement
in molecular medicine would help us for better drug discovery, personalized
medicines, preventive medicines, and gene therapy.

Microbial genome applications

Microorganisms are found everywhere and can survive extreme heat, cold,
radiation, acidity, salt, and pressure. Researchers have begun to understand
these microbes at a fundamental level by studying their genetic material. The
bioinformatics tools and techniques are applied in the field of microbial
genomic applications, including:

 Waste cleanup

 Climate change study

 Alternative energy source

 Antibiotic resistance

 Bio-weapon creation

 Forensic analysis of microbes, and

 Biotechnology

Agriculture

The sequencing of plant and animal genomes has benefitted a lot in the field
of agriculture studies. Bioinformatics tools and techniques are applied for gene
searching within these genomes and finding their functions. Some of the
applications of bioinformatics in agriculture research are as follows:
16
Unit 1 Introduction to Bioinformatics

 Crop improvement

 Insect resistance and control

 Improved nutritional quality

Veterinary Science

Due to the sequencing of several farm animals including cows, sheets, and
pigs, it is expected that a better understanding of these organisms would have
a large impact on production, the health of livestock, and finally, all these
would benefit human beings.

Computer-Aided Drug Design

Drug design is the process of finding new medications based on knowledge of


a biological target. Basically, drug design is a process of designing molecules
that are complementary in shape and charge to the target biomolecules with
which they interact and bind. Drug design mostly relies on computer modeling
techniques, called computer-aided drug design (CADD). The CADD is applied
to speed up and ease hit identification, hit-to-lead selection, optimize
absorption, distribution, metabolism, excretion, and toxicity profiles, and avoid
safety issues. Mostly, structure-based drugs are designed. Structure-based
drug design relies on the knowledge of the 3D structure of target biomolecule.
However, other commonly used computational approaches are ligand-based
drug design and quantitative-structure activity, and quantitative-structure
property relationships.

SAQ 4
i) Mention a few real-life applications of bioinformatics research.

ii) Write a short note on computer-aided drug designing (CADD).

iii) Can bioinformatics-based research help in improving the medical


infrastructure and in providing person-centric/personalized treatment to
patients?

1.9 PROGRAMMING LANGUAGES IN


BIOINFORMATICS
Programming languages are essential for a bioinformatician and to any
software engineer as they require high-level algorithms in different aspects of
research. Either in developing a modus operandi, or deciding to opt for a
specific software or a tool that can easily execute any analysis task –
programming languages play a crucial role in bioinformatics. Some of the
commonly utilized programming languages in bioinformatics are – Python,
Perl, Java, R, and BASH. Out of these, Python is a very popular high-level
programming language.

R is mainly used for statistical data analysis and their visualizations, while
Python, Java and BASH are used to develop new tools and software’s.
Moreover, when talking about software development, Java along with C and 17
BBCS-185 Bioinformatics Skill Enhancement Course

C++ are also deployed by bioinformaticians because of their ease of usage


and user-friendly interface. Machine learning algorithms that are relevant in
various biological problems are used in MATLAB. For a bioinformatician, it is a
pre-requisite to learn a minimum of 3 languages. However, the widely used
combination of high-level programming languages by bioinformatics learners
and researchers are Python, R, and BASH.

1.10 IMPORTANT TERMS USED IN


BIOINFORMATICS
Bioinformatics is a multidisciplinary field of science that encapsulates the
concepts of computer science, biology, physics, chemistry, statistics, and
mathematics, helping in the acquisition, storage, analysis, and dissemination
of biological data. This field of science deploys computational programs for a
myriad of applications such as –gene and protein functions, identifying
phylogenetics and ancestral history, sequence (both nucleic and proteins)
analysis, structural predictions of biomacromolecules, drug discovery, and
designing, protein structural stability using molecular dynamics simulation, etc.

Some of the important and prominent terms that are used in bioinformatics are
described in the following section. These terms will be frequently used while
performing lab exercises and research activities in this field.

Algorithm A fixed step-by-step modus operandi to execute in a


computer program.

Alignment A procedure to interline sequences (both nucleic


acids/proteins) to gain the highest levels of similarity to
assess the degree of homology and conservation of
residues.

Amino acid Arrangement of amino acid residues forming a protein.


sequence

Annotation Providing important information about either gene(s) or


amino acid sequences (proteins) to a newly identified
raw sequence in a database/repository.

Assembly Arranging sequences stretch of DNA in the correct


chromosomal locus.

Base Important nitrogenous units that form the DNA or RNA.


Examples: A, T, C, U, G.

Base pair A pair of any two nitrogenous bases.

Base sequence Distribution of bases in a DNA molecule

BLAST The Basic Local Alignment Search Tool (BLAST) is a


sequence similarity search tool to identify the
percentage of homology between sequences (both
nucleic acids/proteins) to the query sequence. Some of
the variants of BLAST programs are BLASTn,
18
Unit 1 Introduction to Bioinformatics

BLASTp, tBLASTn, tBLASTx, BLASTx, PSI-BLAST,


Omega BLAST, Smart BLAST.

BLOSUM Blocks Substitution Matrix (BLOSUM) is a substitution


matrix that provides scores for every position observed
in the substitution frequency of amino acid residues in
a local alignment. Commonly used BLOSUM is the
BLOSUM62 matrix as it provides scores for those
sequences that don’t match up to 62%.

cDNA library Group of DNA sequences that encode for genes. This
is prepared from mRNA sequences.

Conservation Residues (amino acids/ bases) that don’t change,


referring to no new insertions or deletions or
substitutions (INDELS) during the evolutionary course
of time. They are the stretches that showcase the
original strand of either DNA or protein sequences.

Contig Set of overlapping and copied stretches of DNA of a


particular chromosome.

Data warehouse A group encapsulating different repositories,


databases, data tables that aid in understanding a
particular topic about a gene, chromosome, or protein.

Diploid Complete set of paired chromosomes. For example,


the diploid human genome consists of 46
chromosomes taken from each parental set.

Directed Matching DNA strands from adjacent stretches of a


sequencing particular chromosome.

Disease-associated An alternative set of genes that carry a specific


genes disease-related DNA sequence.

Domain It is a stretch of amino acid residues that holds the


potential to fold independently and has a specific
function.

Draft sequence A rough, unorganized sequence encapsulating 10,000


base pair-sized strands whose approximate
chromosomal locations are known.

E-value Expectation value (E-value)is a parameter that defines


the number of optimal results that one can “expect” to
get by chance when searching for homology in a set of
sequences. The lower the E-value, the more significant
the score.

E. coli Escherichia coli is a common bacterium used as a


model organism for its small size and easy growth in a
few hours.

19
BBCS-185 Bioinformatics Skill Enhancement Course

EST Expressed Sequence Tags (EST) is a short stretch of


DNA sequence used in locating any gene.

FASTA A heuristic similarity searching algorithm that used


‘seeds’ in local alignments and then expands its search
in the entire query sequence.

Functional The study of genes and how they code for their
genomics respective proteins, which in turn play crucial
biological, cellular and chemical processes in the body.

Gap A space is added in a sequence alignment to


compensate for insertions and deletions. Gaps in
alignments are usually presented in the form of loops
or bulges within protein structures.

GenBank An open-access database of annotated groups of


nucleotide sequences and their protein translations
produced and maintained by the National Center for
Biotechnology Information (NCBI) as part of the
International Nucleotide Sequence Database
Collaboration (INSDC).

Gene chip Using various genes to develop complementary DNA


technology (cDNA).

Gene expression It refers to the process by which a gene gets triggered


in a cell that forms RNA and proteins.

Gene family A set of similar and related genes that form similar
proteins.

Gene mapping Identifying specific positions of genes and their


distances, linkage on a chromosome/DNA.

Gene prediction A computer-based prediction of possible genes for a


DNA sequence.

Genome A complete set of the genetic composition


encapsulating the DNA, chromosomes, genes, and
other genetic elements of an individual.

Genomic library A library of stored copies of DNA that depict the entire
genome of an individual.

Genomics The study of genes, DNA, genetic elements, and their


function.

Genotype The genetic features of an individual.

Global alignment An alignment where two sequences (DNA/ protein)


have a high similarity over long stretches over their
length.

Highly conserved Similar sequences are found in different species.


20
Unit 1 Introduction to Bioinformatics

sequence

High-throughput Also called: Next-generation sequencing (NGS). It is a


sequencing rapid way of arranging the bases in a DNA sequence.

Homolog A gene/protein that is related to another gene/protein


because of common ancestry.

Homologous Related chromosomes that have to share a common


chromosomes descent.

Homology Sequences sharing the highest similarity because of a


common ancestor.

High-scoring Alignments with no gaps and have the highest


segment pair (HSP) alignment score in a search query. These are usually
achieved by local alignments.

Human Genome It was a global scientific research project that started


Initiative (HGI) back in 1986 with the aim of sequencing the human
DNA, identifying the base pairs, and mapping the
entire human genome from both a physical and
functional point of view.

In vitro Studies executed in a laboratory.

In vivo Studies executed inside an organism.

In silico Studies carried out using computers.

Kilobase (kb) A unit that is made up of 1000 nucleotides in DNA.

K-value A statistical parameter used in BLAST converts raw


score (S) to a bit score (S’).

Linkage The markers in a close proximity on a chromosome.

Local Alignment An alignment where two sequences (DNA/ protein)


have a high similarity over short stretches over their
length.

Messenger RNA It carries the genetic information that is required to


(mRNA) form proteins.

Microarray A large group of microscopic DNA spots bound to a


solid surface. It is used to determine the gene/protein
expression.

Model organisms An organism used for experiments and research in a


laboratory.

Modeling The use of in-silico analysis to develop or predict


various biological or biochemical structures.

Motif A short highly conserved residual stretche in a protein


sequence.
21
BBCS-185 Bioinformatics Skill Enhancement Course

Multiple Sequence Interlining more than two sequences (DNA/Protein) to


Alignment (MSA) identify the best matching residues and homology
percentage. Commonly deployed tools for MSA are
Clustal W, Clustal Omega, T-Coffee, etc.

Molecular docking A bioinformatics approach to check and bind a small


ligand/drug compound/ small compound/ion, etc. to the
target receptor (Protein/DNA). AutoDock Vina is a
commonly used software.

Molecular dynamic A bioinformatics approach to check the stability of the


simulation target receptor (protein structure) or a docked complex
(protein-drug). GROMACS, NAMD-VMD, Desmond,
Schrodinger, etc.., are few softwares that help in
executing molecular dynamic simulations.

Open reading The sequence of either DNA or RNA that lies in


frame (ORF) between the start codon and the termination codon.

Operon A group of functional genes that have a single


promoter.

Optimal Alignment An alignment between sequences that has the top best
score.

Orthologous Homologous sequences present in different species


that were derived from a common ancestor but don’t
have a similar function.

P-value It is a measure of the probability of an alignment


occurring with the score by a random chance. The p-
values closer to zero (0) are highly significant. Both, P
and E values are different ways of showing the
reliability of any sequence alignment.

Point Accepted Also called – Percent Accepted Mutation. It was


Mutation (PAM) described and developed by the Mother of
Bioinformatics – Margaret Oakley Dayhoff. It is simply
the substitution of a single amino acid residue in the
primary structure of a protein with another residue that
gets accepted by natural selection during the
evolutionary course of time. It is more of a look-up
table where there are scores assigned to each
replaced residual pair that has been computed based
on the number of times that change has occurred in
proteins.

Pharmacogenomics The study of the behavior of how an individual’s body


and drug react after it gets administered in the body.

Pharmacokinetics The study of how the human body reacts to the


administered drug, which is measured in terms of drug
absorption, distribution, metabolism, excretion, and
possible toxicity in the human body.

Pharmacodynamics The study of how drug responds (both biological and


pharmacological) in the human body after it gets
administered.

22
Unit 1 Introduction to Bioinformatics

Phenotype The physical features of an organism.

Proteomics A large-scale systematic study of proteins.

Pseudogene A non-functional gene that doesn’t code for any


protein.

PSI-BLAST Position-Specific Iterative BLAST is a similarity search


tool that is based on iterative searching and uses a
profile (PSSM) to find the best possible hit for a protein
query sequence.

Position-specific The PSSM gives the log-odds score for identifying a


scoring matrix specific best matching amino acid residue in a protein
(PSSM) query sequence. It is also called a profile.

Phylogenetics Study of evolutionary ancestry of an individual or a


species.

Query A problem; An input that is not known by the user and


is worked upon.

Quantitative A quintessential computational method in drug


structure-activity discovery and designing where chemical compound
relationship properties are associated with their respective
(QSAR) biological activities.

Scaffold A chain of contigs that lie in the correct sequence but


are not connected in one stretch of a sequence.

Seed Alignment An alignment containing only one pair homologue.

Sequence A method that determines the arrangement of multiple


assembly sequenced DNA sequences.

Sequencing A way to arrange the bases in DNA or RNA or amino


acid residues in a protein molecule.

Sequencing The technology used to determine and arrange the


technology sequence of nucleotides in DNA, such as Sanger
sequencing, next-generation sequencing (NGS).

Structural genomics A sub-field of genomics that deals with the


identification of tertiary structures of biomolecules
using in silico and in vitro techniques.

Substitution Matrix It states the speed at which any base (DNA/RNA) or an


amino acid (protein) gets replaced by another
character over evolutionary time.

Tandem repeat A repeated set pattern of nucleotides in a DNA


sequences sequence.

Transcription factor A protein that controls the gene expression by


attaching itself to a DNA sequence.

23
BBCS-185 Bioinformatics Skill Enhancement Course

Transcriptome Study of complete transcriptional elements consisting


of the activated genes, mRNAs, or transcripts.

Unweighted pair A bottom-up hierarchical clustering method used in


group method with phylogenetic analysis.
arithmetic mean
(UPGMA)

Wild type An untouched and non-maneuvered organism.

SAQ 5
i) Define bioinformatics.

ii) Differentiate between: Pharmacokinetics & pharmacodynamics

iii) Expand the following:

a) PSSM

b) BLAST

c) UPGMA

d) ORF

iv) What is BLOSUM?

1.11 SUMMARY
 Bioinformatics is an interdisciplinary science combining biology,
computer science, statistics, and information technology (IT).
Bioinformatics lets us understand biology in terms of molecules and
apply IT techniques to understand and organized the information
associated with these molecules on a large scale.

 As far as the operation of a computer system is concerned, it has four


fundamental operations which include input, processing, storage, and
output. Today, the Internet is being used as a commodity for various
applications, including in the field of bioinformatics.

 Internet is the most potential technology serving as the key platform for
Bioinformatics, including (i) online bioinformatics resources (databases
and tools) such as NCBI, PDB, PubChem, BLAST, etc. (ii) scientific
literature databases such as PubMed, PubMed Central, (iii) Web-based
platform for high-end bioinformatics computing, and (iv) Bioinformatics
courses and tutorials.

 Computer programming languages are very essential components of


bioinformatics, where methods and algorithms are implemented as
computer programs. Some of the commonly used programming
languages in bioinformatics are – Python, Perl, Java, and R. Out of
these, Python is a very popular high-level programming language while
24 Perl is losing out on its popularity amongst the bioinformaticians. As
Unit 1 Introduction to Bioinformatics

most of the bioinformatics algorithms are very complex and computation


is carried out on large-scale biological data, it would require high-
performance computing (HPC) such as supercomputer or computer
cluster.

 Bioinformatics has a multitude of real applications, some of them are:


Comparative and evolutionary studies, Molecular medicine (drug
discovery, personalized medicines, preventive medicines, and gene
therapy), Microbial genome applications (Waste cleanup, Climate
change study, Alternative energy source, Antibiotic resistance, Bio-
weapon creation, Forensic analysis of microbes, and Biotechnology),
Agriculture (Crop improvement, Insect resistance and control, and
Improved nutritional quality), Veterinary Science, and Computer-Aided
Drug Design.

1.12 TERMINAL QUESTIONS


1. Define the following terms: Input, Output, Processing and Storage.

2. Write the important features of MS Office.

3. Explain the scope of bioinformatics.

4. Enlist the applications of bioinformatics.

1.13 ANSWERS
Self Assessment Questions
1. i) Bioinformatics is the field of science in which biology, computer
science, and information technology merge to form a single
discipline. Hence, it is an interdisciplinary field of study where
biologists, computer scientists, statisticians, and data scientists
work together. Bioinformatics lets us understand biology in terms
of molecules and apply Information Technology (IT) techniques to
understand and organized the information associated with these
molecules on a large scale. The important fundamental questions
addressed are: how do we describe, analyze, simulate, and predict
the dynamics of the biological processes by using IT tools.

ii) The term 'Bioinformatics' was coined by Paulien Hogeweg and


Ben Hesper in 1970, making Bioinformatics a field parallel to
biophysics and biochemistry. However, this term became visible in
the late 1980s.

iii) Bioinformatics is the field of science in which biology, computer


science, and information technology merge to form a single
discipline.

2. i) A computer has four basic operations: Input, Processing, Storage,


and Output.

ii) Databases: National Center for Biotechnology Information (NCBI),


Protein Data Bank (PDB), PubChem etc. 25
BBCS-185 Bioinformatics Skill Enhancement Course

Tools: Basic Local Alignment Search Tool (BLAST), MUSCLE, T-


Coffee, etc.

iii) Microsoft Word


 Creating text documents, edit them later.
 Defining page layout (landscape or portrait), page size, page
margins, etc.
 Text formatting such as defining font size, font type, font
styles, color, etc.
 Text may be formatted in column style.
 Facility to add Tables.
 Insertion of graphical pictures, and images from a file or
clipart gallery.
 Header and footer, page numbering.
 Spelling and grammar check.
 Word count and other statistics.
 Facility of macros to automate some functions.
 Online help of any option.
Microsoft Excel
 A general-purpose electronic spreadsheet for financial
computation and data analytics.
 Helps prepare a simple family budget to complex accounting
ledger for a business.
 AutoSum (summing up the selected values), List AutoFill
(automatically fills the data as per our choice), AutoShapes
(drawing geometrical shapes), etc.
 A rich set of predefined statistical, financial, and accounting
functions.
 Drag and Drop feature helps us reposition data and text by
simply dragging the data using a mouse.
 Charts feature help us present a graphical representation of
data in the form of Bar, Line, Pie charts, Boxplot, etc.
PivotTable allows performing data analysis and various
report generation such as statistical reports, periodic
financial statements, etc.

iv) a) MS Powerpoint

b) Internet

c) Primary storage and Secondary storage

3. i) a) Margaret Oakley Dayhoff

b) Supercomputer

c) Genomics

d) BBMap, Bowtie2, BWA and Velvet.

ii) Bioinformatics has extended itself in various domains such as –


26 genomics (that deals with the study of the holistic genes and
Unit 1 Introduction to Bioinformatics

genetic elements), proteomics (encapsulates the study of all the


proteins and proteomes), computational drug discovery and
designing (CADD), and systems biology. Moreover, it allows
different data analyses and also gives tools for modeling,
visualizing, and understanding data. Aim of bioinformatics is to
arrange chiseled data and convert it into meaningful and significant
pieces of knowledge.

iii) With the term – “Supercomputer” we refer to a computer that holds


immense processing power. A supercomputer is mainly deployed
for complex, time-consuming, and hectic problems that are
observed in scientific and engineering domains. It is a 5th
generation type of computer and helps researchers, government
officials, academia, and industries to handle apex data sets and
allows rapid computations. It allows high speed with efficient
performance with humongous data. In bioinformatics,
supercomputers are used for hectic tasks such as – molecular
simulation of large systems, sequencing, molecular modeling, etc.

4. i) Bioinformatics has a multitude of real applications, some of them


are: Comparative and evolutionary studies, Molecular medicine
(drug discovery, personalized medicines, preventive medicines,
and gene therapy), Microbial genome applications (Waste cleanup,
Climate change study, Alternative energy source, Antibiotic
resistance, Bio-weapon creation, Forensic analysis of microbes,
and Biotechnology), Agriculture (Crop improvement, Insect
resistance and control, and Improved nutritional quality), Veterinary
Science, and Computer-Aided Drug Design.

ii) Computer-Aided Drug Design

Drug design is the process of finding new medications based on


knowledge of a biological target. Basically, drug design is a process
of designing molecules that are complementary in shape and
charge to the target biomolecules with which they interact and bind.
Drug design mostly relies on computer modeling techniques, called
computer-aided drug design (CADD). The CADD is applied to
speed up and ease hit identification, hit-to-lead selection, optimize
absorption, distribution, metabolism, excretion, and toxicity profile,
and avoid safety issues. Mostly, structure-based drugs are
designed. Structure-based drug design relies on the knowledge of
the 3D structure of the target biomolecule. However, other
commonly used computational approaches are ligand-based drug
design and quantitative-structure activity, and quantitative-structure
property relationships.

iii) Yes. The Human genome has profound effects and impacts on
biomedical research and clinical medicine. With the availability of a
complete human genome, it is possible to search for genes that are
directly associated with various diseases and discover the
molecular basis of these diseases. The discovery of the molecular
27
BBCS-185 Bioinformatics Skill Enhancement Course

basis of disease would enable better treatments and diagnosis of


the disease, and the development of molecular medicines. The
advancement in molecular medicine would help us for better drug
discovery, personalized medicines, preventive medicines, and gene
therapy.

5. i) Bioinformatics is a multidisciplinary field of science that


encapsulates the concepts of computer science, biology, physics,
chemistry, statistics, and mathematics helping in the acquisition,
storage, analysis, and dissemination of biological data.

ii) Pharmacokinetics:

The study of how the human body reacts to the


administered drug, which is measured in terms of drug
absorption, distribution, metabolism, and excretion, and
possible toxicity in the human body.

Pharmacodynamics: The study of how drug responds


(both biological and pharmacological) in the human body
after it gets administered.

iii) a) PSSM: Position-specific scoring matrix

b) BLAST: Basic local alignment search tool

c) UPGMA: Unweighted pair group method with arithmetic


mean

d) ORF: Open reading frame

iv) Blocks Substitution Matrix (BLOSUM) is a substitution matrix that


provides scores for every position observed in the substitution
frequency of amino acid residues in a local alignment. Commonly
used BLOSUM is the BLOSUM62 matrix as it provides scores for
those sequences that don’t match up to 62%.

Terminal Questions
1. Refer Section 1.2.

2. Refer Section 1.4.

3. Refer Section 1.7.

4. Refer Section 1.8.

1.14 FURTHER READINGS


1. Jin Xiong. Essential Bioinformatics. Cambridge University Press.

2. Jean-Michel Claverie, Cedric Notredame. Bioinformatics For


Dummies, 2nd Ed.

3. Rastogi et al. Bioinformatics: Methods and Applications: Genomics,


Proteomics and Drug Discovery, 4th Ed. PHI.

4. Zhumur Ghosh and Bibekanand Mallick. Bioinformatics: Principles


28 and Applications.Oxford University Press, India.
Exercise 1 Microsoft Office- Word, Excel, Power Point

Exercise 1
MICROSOFT OFFICE- WORD,
EXCEL, POWERPOINT

Structure
1.1 Microsoft Office Microsoft Word

Expected Learning Microsoft Excel


Outcomes
Microsoft PowerPoint
1.2 Components of Microsoft
Office

1.1 MICROSOFT OFFICE


This is the first exercise in Bioinformatics practical component to make you
familiar with computers, you have to learn and understand basic operations of
computers to analyze, annotate and store data. Microsoft Office (MS Office) is
a collection of computer programs designed primarily for offices, students,
teachers, researchers, businesses, and all other users. The Microsoft
Corporation created Office software, which was first released in 1990. MS
Office makes it easy to perform fundamental office activity functions and
increases productivity. Each program is specialized to a certain activity, such
as word processing, data management, and presentation. Word, Excel, and
PowerPoint are the most commonly used Microsoft Office software.

After practicing MS Office learners will be able to learn, simplify basic office
tasks, and improve work productivity. Each application is designed to address
specific tasks, such as word processing, data management, making and
presentations.

Expected Learning Outcomes


After studying this unit, you should be able to:

 understand and know how to use the most common Microsoft Office
programs;

 explain primary objective of MS Word is to enable the user, to create


and edit documents;
29
BBCS-185 Bioinformatics Skill Enhancement Course

 perform to create, edit, save, and print visual presentations; and

 demonstrate to manage and store data in a spreadsheet.

1.2 COMPONENTS OF MICROSOFT


OFFICE
The major components are Microsoft word, excel and power point. Let us
explain one by one.

1.2.1 Microsoft Word

Microsoft Word (MS Word) is the word processing program that users can
type with. It is being developed by Microsoft company. It is used to type, edit,
format, retrieve, save and print documents.MS Word has an application to
create and edit letters, articles, newsletters, flyers, and creating text
documents, editing and formatting the existing documents, making a text
document interactive with different features and tools, graphical documents,
comprising images, used by Authors and Researchers, detect grammatical
errors in a text document.

Step 1: Double-click on the MS Word icon from the computer desktop or from
your 'Start' menu to open Microsoft Word.

Go to the Start menu if the MS Word icon is not on the desktop Click ►Start
►Programs ►Microsoft Word

Step 2: Click on the blank document, a new blank document will open up
ready for you to start typing (Fig. 1.1).

30 Fig. 1.1: Screenshot showing MS Word blank document.


Exercise 1 Microsoft Office- Word, Excel, Power Point

Step 3: When you open a blank document, the flashing cursor will be at the
start of your document, ready for you to start typing. As you type, the cursor
will also move with each letter (Fig. 1.3).

Fig. 1.2: Screenshot showing various options and tools available on MS word.
Step 4: The mouse can be used to move around a document. Select the text
that you wish to edit or change the formatting. To change the selected font to
bold, click B, to italics; click I, to underline click U (Fig. 1.3).

Fig. 1.3: Word document with sample text.

Step 5: To copy and paste the text, select your text so that it’s highlighted,
copy the text by clicking on the copy icon at the left-hand side of the formatting
ribbon. Click Paste to insert the copied text in its new place in your text (Fig.
1.4).

31
BBCS-185 Bioinformatics Skill Enhancement Course

Fig. 1.4: Screen shot showing how to copy and paste the text.

Step 6: To center, left align, right align and justify text, select the text that you
wish to change by using the mouse, click on the ‘right align’ icon, click on the
‘center text’ icon; click on the ‘justify’ icon in the formatting ribbon at the top of
the document (Fig. 1.5).

Fig. 1.5: Screen shot showing the text alignment options.

Step 7: To save a document, click File in the top left-hand corner of the
screen, "choose Save" from the menu. Once you have typed in the name of
your document, click Save (Fig. 1.6).

Fig. 1.6: Screenshot showing how to save MS Office file.

Upto now you have observed the steps like how to create, edit and save word
32 file. Practice these steps and learn more.
Exercise 1 Microsoft Office- Word, Excel, Power Point

1.2.2 Microsoft Excel

Microsoft Excel is an electronic spreadsheet program developed by Microsoft,


which contains a number of columns and rows, where each intersection of a
column and a row is a “cell.” It enables users to store, organize, calculate and
manipulate the data with formulas using a spreadsheet.

Step 1: Double-click on the MS Excel icon from the desktop. Go to the Start
menu if the MS Excel icon is not on the desktop Click ►Start ►Programs ►
MS Excel

Step 2: Click File, and then click New. Under New, click the Blank
workbook, a new excel file will open up ready for you to start using (Fig. 1.7).

Fig. 1.7: Screen shot sowing how excel blank work book.

Step 3: Click an empty cell. For example, cell A1 on a new sheet. So cell A1 is
in the first row of column-A. Type, text or a number in the cell. Press Enter or
press Tab to move to the next cell (Fig. 1.8).

Fig. 1.8: Screeshot Showing columns and rows on a excel sheet.


33
BBCS-185 Bioinformatics Skill Enhancement Course

Fig. 1.9: Screenshot of autosum option.

Step 4: To add the numbers, a fast way to do that is by using AutoSum.


Select the cell to the right or below the numbers you want to add. Click
the Home tab, and then click AutoSum in the Editing group. AutoSum adds
up the numbers and shows the result in the cell you selected.

Step 5: Excel can do other math as well using formulas to add, subtract,
multiply, or divide your numbers. Pick a cell, and then type an equal sign
(=).Type a combination of numbers and calculation operators, like the plus
sign (+) for addition, the minus sign (-) for subtraction, the asterisk (*) for
multiplication, or the forward-slash (/) for division (Fig. 1.10).

Fig. 1.10: Showing of numbers in excel.


34
Exercise 1 Microsoft Office- Word, Excel, Power Point

Step 6: Click the Save button on the Quick Access Toolbar, or press Ctrl+S.

For the first time, you have to save this file: Under Save As, pick where to
save your workbook, and then browse to a folder. In the File name box, enter
a name for your workbook. Click Save.

Step 7: Click File, and then click Print, or press Ctrl+P. Preview the pages by
clicking the Next Page and Previous Page arrows (Fig. 1.11).

Fig. 1.11: Such screen will appear once you save the excel file.

1.2.3 Microsoft PowerPoint

Microsoft PowerPoint popularly known as PPT is a powerful slide show


presentation program that is included in the Microsoft Office suite. It allows
the user to create slides with animations, recordings, narrations, transitions
and other features in order to present information in an attractive way. It is
used to make presentations and slides convey information rich in multimedia.
Some features of PowerPoint are like Customize Color Schemes; Add
Animation effects, Notes and Handout Masters, Create, Edit and Import
Charts, Create and Edit Tables. Manage Hyperlinks, Create Custom Shows,
etc.

Step 1: To start MS PowerPoint double click on the PowerPoint icon from your
desktop

If the MS PowerPoint icon is not on the desktop, go to the Start menu Click
Start ►Programs ►Microsoft PowerPoint. 35
BBCS-185 Bioinformatics Skill Enhancement Course

Step 2: Click on New and blank presentation. New PowerPoint file will open
up ready for you to start using (Fig. 1.12).

Fig. 1.12: Screenshot showing new power point fie.

Step 3: To add text to a Slide Click inside the provided text box (Click to
Title, Subtitle, Text, etc)
Once the cursor is blinking you can begin to type your text (Fig. 1.13).

Fig. 1.13: Screenshot showing text slide of PPT.


36
Exercise 1 Microsoft Office- Word, Excel, Power Point

Step 4: To insert a new Slide Click INSERT► Select NEW SLIDE ►Click
title slide or title and content slide etc. You can also use the keyboard shortcut
CTRL+M

Step 5: To delete slides Click on the slide you wish to delete in the Slide
menu, press the DELETE key on your keyboard

Step 6: To insert images saved on your computer Click INSERT ► Scroll


to PICTURES ► Select FROM FILE ► Navigate to the folder or storage
area/medium where your picture is located ►Click on the picture file name ►
Click INSERT (Fig. 1.14).

Fig. 1.14: Screenshot showing how to insert image in a PPT.

Step 7: To save your work Click ►File ►Save from the Menu Bar

37
BBCS-185 Bioinformatics Skill Enhancement Course

LAB EXERCISES

Up to now, you have studied various components of MS Office. Practice the


given sample exercises to get more experience and understanding.
1. Prepare document of a paragraph with normal styles having “Times New
Roman”, 12 sizes, and 1.50 line spacing with 0.6cm left indent.
2. Prepare document by inserting pictures, shapes, and SmartArt.
3. Prepare a PowerPoint presentation about the sales details of a company
4. Create a new workbook, enter the values in the exact cell locations, use
autofill to put the Employee numbers into cells.

38
Exercise 2 Introduction to Internet: Lan, Wan, Web browsers, Search engines

Exercise 2
INTRODUCTION TO INTERNET:
LAN, WAN, WEB
BROWSERS, SEARCH
ENGINES

Structure
2.1 Introduction 2.3 Web browsers

Expected Learning 2.4 Search Engines


Outcomes
2.5 Summary
2.2 Frequently used Terminology

2.1 INTRODUCTION
The word INTERNET is taken from the Interconnected Network of all the Web
Servers Worldwide, it is also known as World Wide Web (WWW). Using the
Internet, you can send electronic mail, chat with colleagues around the world,
and obtain information on a wide variety of subjects. Internet is a global and
public network that supports communications using different common
languages worldwide. In this exercise, you will also learn about LAN, WAN,
and search engines. This basic information about the internet and browsers
will help you to perform better in the upcoming exercises of this course.

Expected Learning Outcomes


After performing this exercise you shall be able to:

 define terms like LAN, WAN and WWW;

 describe search engine; and

 retrieve data by using search engine.

2.2 FREQUENTLY USED TERMINOLOGY


39
BBCS-185 Bioinformatics Skill Enhancement Course

LAN is abbreviated as local area network, it is a private network, where a


group of devices, i.e. two or more computers connected. It is a network
contained within a small geographic area ranging from a home network with
one user to an enterprise network, usually within the same building. For
example Home WiFi networks, small business networks, or thousands of
users and devices in an office or school. Some of the characteristic features of
LAN are as follows:

 Network size is limited to a small geographical area, to a few kilometres.

 Data transfer rates are generally high, and they range from 100 Mbps to
1000 Mbps.

 The number of computers connected to a LAN is usually restricted.

WAN is abbreviated as a wide area network and is a large network


covering a large geographic area. WANs can facilitate communication, the
sharing of information, and much more between devices from around the
world through a WAN provider. The Internet is an example of WAN, it covers
the entire globe. It is a network that uses various links like—private lines,
Multiprotocol Label Switching (MPLS), virtual private networks (VPNs),
wireless (cellular), and the Internet—to connect smaller metropolitan and
campus networks in diverse locations into a single, distributed network. LANs
are typically faster and more secure than WANs, but WANs enable more
widespread connectivity.

2.3 WEB BROWSERS


A web browser is application software that allows users to find, access,
display, and view websites. Common web browsers include Internet
Explorer, Google Chrome and Mozilla Firefox, etc. Internet Explorer was once
the most widely used web browser, attaining a peak of about 95% usage
share by the year 2003. Its usage declined with the launch of Firefox (2004)
and Google Chrome (2008), and with the growing popularity of mobile
operating systems such as Android and iOS that do not support Internet
Explorer.

2.4 SEARCH ENGINES


Search engines are now part of our daily lives. A search engine is a Website
that is designed to carry out web searches; they search the World Wide Web
about particular information provided in a textual search query. The search
results are generally presented in a line of results; the information will be web
pages, images, videos, infographics, articles, research papers, and other
types of files. Search engines also maintain real-time information by running
an algorithm. Popular examples of search engines are Google, YouTube,
Facebook, Amazon (product-based search engine), etc. Google Search
Engine is the most widely used search engine in the world and it is also one of
the most popular products from Google. Almost 70 percent of the Search
Engine market has been acquired by Google.

40
Exercise 2 Introduction to Internet: Lan, Wan, Web browsers, Search engines

Procedure:

Step 1: There are many different search engines you can use, but some of the
most popular include Google, Yahoo!, and Bing etc. To perform a search,
you'll need to use a search engine in your web browser, type one or more
keywords, and then press Enter on your keyboard (Fig. 2.1).

Fig. 2.1: Screen shot showing Google search engine.

Step 2: After you run a search, you'll see a list of relevant websites that match
your search terms called search results. If you find a site that is relevant to
your interest, you can open that link (Fig. 2.2).

Fig. 2.2: Screen shot showing search results.

Step 3: You may be looking for something more specific, like a news
article, picture, or video. Most search engines have links at the top of the
page that allow you to perform these unique searches (Fig. 2.3).

41
BBCS-185 Bioinformatics Skill Enhancement Course

Fig. 2.3: Screenshot showing images, videos and other options.

2.5 SUMMARY
 Network of all the Web Servers Worldwide it is also known as World
Wide Web.

 Using the Internet, you can send electronic mail, chat with colleagues
around the world, and obtain information on a wide variety of subjects
using the World Wide Web.

 In this exercise you have studied terms like LAN, WAN, and grasped the
overview of web browsers and search engine.

 Also learned how to access information from Google search engine.

Reference: https://edu.gcfglobal.org/en/internetbasics/using-search-
engines/1/

42
Exercise 3 Basics of Electronic Mail, Creating An Email Account, Sending And Receiving Email

Exercise 3
BASICS OF ELECTRONIC MAIL,
CREATING AN EMAIL
ACCOUNT, SENDING
AND RECEIVING EMAIL

Structure
3.1 Introduction 3.2 Procedure

Expected Learning 3.3 Summary


Outcomes

3.1 INTRODUCTION
In the previous exercises, you have learned about Microsoft Office and
Internet. In this exercise, you’ll learn about electronic mail and its application.

ELECTRONIC MAIL

Initially, programmers started connecting to computer networks, one of the first


services to emerge was electronic mail (e-mail). By using e-mail people
started to communicate in groups. E-mail is one of the most popular Internet
services. E-mail enables an internet user to communicate messages to
another internet user anywhere in the world. Messages in email contain not
only text, but we can also include images, audio, and video data as an
attachment. E-mail is a useful communication tool that avoids many of the
problems that arise while communicating with people over the phone. E-mail
can also be used to send a message to multiple people on a mailing list at the
same time. The best free e-mail providers are Gmail, Outlook, ProtonMail,
AOL, Zoho Mail, iCloud, Mail, Yahoo! Mail, and GMX. One of the most well-
known and widely-used email services is Gmail. Emails are delivered
extremely fast when compared to traditional post.
43
BBCS-185 Bioinformatics Skill Enhancement Course

Expected Learning Outcomes


After performing this exercise you shall be able to:

 create and send an E-mail for communication with other people


worldwide very quickly;

 create email;

 send and receive email; and

 explain the importance of email.

3.2 Procedure:
CREATING AN EMAIL ACCOUNT

To sign up for Gmail, create a Google Account. You can use the username
and password to sign in to Gmail.

Step 1: Browse the Google Account creation web page.

Step 2: Fill in the details to create an account, follow the instructions stepwise,
and use the account you created to sign in to Gmail (Fig. 3.1).

Fig. 3.1: Screenshot showing Google account creating page.

SENDING AND RECEIVING EMAIL


44
Exercise 3 Basics of Electronic Mail, Creating An Email Account, Sending And Receiving Email

Step 1: Open Gmail account signing into your email service so that you are on
the main page of your mail account (Fig. 3.2).

Fig. 3.2: Screen shot showing Gmail login.

Step 2: Click Compose menu to create a new E-mail (Fig. 3.3).

Fig. 3.3: Screenshot showing various options available in gmail.

Step 3: A new blank email window will open. In the ‘To’ box, type or add the
email address of the recipient.

Step 4: If you want to include someone else in your email to ‘keep them in the
loop’ you can click on Cc (carbon copy) or Bcc (blind carbon copy), which will
45
BBCS-185 Bioinformatics Skill Enhancement Course

open another field. Adding an email address to the ‘Cc’ field will allow all the
other recipients to see the email address. If an email address is added into the
'Bcc’ field no other recipient can see the address. If you are sending the same
email to multiple people, it’s a good idea to put all the email addresses in the
‘Bcc’ field to keep your ‘mailing list’ confidential.

Step 5: In subject box type the relevant subject which allows the recipient to
get the glimpse of the topic of your email.

Step 6: Email text can be typed in message box. You can change the font
style, colour and size using the formatting icons (Fig. 3.4).

Fig. 3.4: Screenshot showing “To”, “Cc” and “Bcc” option of an email.

Step 7: once after typing text in message box, click the blue Send button at
the bottom of the compose window.

Step 8: The email you’ve sent will now be saved in the ‘Sent Mail’ folder on
your Gmail dashboard

Step 9: by using “Attach files” option you can attach images, files, etc. from
your computer.

3.3 SUMMARY
 The e-mail enables internet users to communicate messages to another
internet user anywhere in the world.

 In the current exercise you have learned how to create a new e-mail,
send and receive a new e-mail.

 Also learned how to attach files using the “attach files” option.

46
Unit 2 Biological Databases and Data Retrieval

UNIT 2
BIOLOGICAL DATABASES AND
DATA RETRIEVAL

Structure
2.1 Introduction 2.4 Small Molecular Databases

Expected Learning PubChem


Outcomes
Drugbank
2.2 Classification of Biological
Databases ZINC Database

Classification Based on Cambridge Structure


Source Database (CSD)

Nucleotide Databases 2.5 Structure Viewing tools


and File Formats
Protein Database
2.6 Summary
Structural Databases
2.7 Terminal Questions
CATH
2.8 Answers
2.3 Classification of Biological
Databases Based on
Nature of Data

2.1 INTRODUCTION
In the previous unit, you have learned about the basics of computers and their
applications in the field of biology, that is bioinformatics. In this unit, you’ll be
studying biological databases. In these biological databases information
related to DNA, RNA, Protein, and other biomolecules are stored in a
systematic way inside servers named Data servers. Scientists, academicians,
and researchers working across the globe can retrieve this data (Biological
Data) whenever they need it for the purpose of analysis.

In BBCCT-101 course molecules of life, you have studied various


biomolecules and their significance. You are also aware that these molecules
interact with each other through various metabolic reactions. 47
BBCS-185 Bioinformatics Skill Enhancement Course

Some important characteristic features of biological databases include:


1. They can store data in electronic form in various formats.
2. Each entry in the database would be assigned a unique number or ID. It
cannot be repeated in other terms – non-redundancy.
3. Data sharing – The data can be downloaded from various websites related
to biological databases or FTP (File transfer protocol).
4. They are well structured, searchable, and also information is updated
periodically as per publications/innovations in the scientific world.
5. The data also refers to unique IDs in research publications/books. This can
be called a cross-reference.

Expected Learning Outcomes


After studying this unit, you should be able to:

 define biological databases;

 classify of biological databases;

 enlist application of biological databases in research and data retrieval


through web links;

 describe chemical, biological and structural databases; and

 explain the availability of online visualization tools and offline software’s.

2.2 CLASSIFICATION OF BIOLOGICAL


DATABASES
In unit-11 of BBCCT-105 (Proteins) course you came across some basic
concepts of biological databases, hence you are advised to recall those
concepts before proceeding further in this unit. The classification of biological
databases is very simple and is based on the source and nature of data
collection.

2.2.1 Classification Based on Source


1. Primary databases: These databases are constructed based on data
collected from laboratory experiments. After experiments the data will be
validated and analyzed before uploading in the biological databases and
it is very crucial step in the data collection. They are classified based on
the type of biological molecules like nucleic acid databases (GenBank,
EMBL, DDBJ, NDB), protein databases (PIR, Swiss-Prot, TrEMBL,
PDB), metabolic pathway database (KEGG, EcoCyc, and MetaCyc) and
small molecule databases (PubChem, Drug Bank, ZINC, CSD).

2. Secondary Databases: These databases are constructed based on


primary biological databases with additional information.
Secondary databases comprise data derived from the results of
analyzing primary data available on the primary databases. They are
48 often referred to as curated databases, but this is a bit of a misnomer
Unit 2 Biological Databases and Data Retrieval

because primary databases are also curated to ensure that the data in
them is consistent and accurate.

Secondary databases often draw upon information from numerous


sources, including other databases (primary and secondary), controlled
vocabulary, and scientific literature. They are highly curated and often
use a complex combination of computational algorithms and manual
analysis and interpretation to derive new knowledge from published
data.

Secondary databases have become the molecular biologist’s reference


library over the past decade or so, providing a wealth of (often daunting)
information on just about any gene or gene product that has been
investigated by the research community. The potential for mining this
information to make new discoveries is vast.

Table 2.1: Differences between primary and secondary databases


(Source:https://www.ebi.ac.uk/)

Primary database Secondary Database

Curated database;
Synonyms Archival database
knowledgebase

Results of analysis,
literature research and
Source of Direct submission of experimentally-
interpretation, often of
data derived data from researchers
data in primary
databases

Inter Pro (protein


families, motifs and
ENA, GenBank and DDBJ (nucleotide domains) UniProt
sequence) Array Knowledgebase (sequen
Express and GEO (functional ce and functional
Examples genomics data) Protein Data information on
Bank (PDB; coordinates of three- proteins) Ensembl (variat
dimensional macromolecular ion, function, regulation
structures) and more layered onto
whole genome
sequences)

Upto now you have studied about types of biological databases based on
sources, now let us know about nucleotide databases.

2.2.2 Nucleotide Databases


It is well known that DNA and RNA are major nucleic acids. You have studied
the structure of these nucleic acids in Unit 13&14 of BBCCT-101). Each
protein/enzyme coded by a specific gene/genes intern itself is a DNA
sequence. If all the gene coding sequences are stored in a database i.e, called
a nucleotide database. These databases are repositories of the store and
49
BBCS-185 Bioinformatics Skill Enhancement Course

retrieve data in terms of nucleotides of various genomes (set of chromosomes


of an organism).

Let us see some of the examples of these nucleotide databases:

Gen Bank – It is an integral part of the main biological database, i.e., NCBI
(National Center for Biotechnology. It has a tool called Entrez, which helps to
retrieve data from Genbank.

EMBL – European Molecular Biology Laboratory is available at European


Bioinformatics Institute (EBI). SRS (Sequence Retrieval System) is a tool for
retrieval of desired protein/DNA/Gene Sequences from this above database.

DDBJ – DNA data Bank for Japan is another database.

All above three-nucleotide databases are interconnected with each other by


data sharing and allow access to the data through the web links and data
rvers (Fig. 2.1).

Swiss-PROT – This
database is owned by
EMBL and
maintained by SIB

TrEMBL - It contains
maximum translated
sequences

Fig. 2.1: Graphical representation of Nucleotide databases.

2.2.3 Protein Database


PIR – Protein Information Resource is located at NBRF (National Biomedical
Research Foundation). It consists of complete protein information like source
protein crystal structures available in protein Databank (with ID), etc., PIR
have been classified into four types.

PIR1- This database is fully classified and annotated

PIR2 - It is basic database with preliminary protein information

PIR3- This database has unverified entries

PIR4- Database with genetically engineered sequences. This helps to


understand the possibilities of engineering proteins for research activities.
50
Unit 2 Biological Databases and Data Retrieval

SAQ 1
Fill in the blanks:

i) _________________tool is used to retrieve data from GenBank

ii) DDBJ is a____________database.

iii) EMBL database is maintained by_______.

iv) The full form of NCBI is________________.

v) ____________________database is related to classification of protein


families.

2.2.4 Structural Databases


Protein Databank (PDB) comprises various databases

PDB is a part of the Worldwide Protein Data Bank which collects, organizes,
and disseminates data on biological macromolecular structures like proteins,
enzymes, and DNA/RNA.

PDBj (Protein Data Bank Japan) maintains a centralized PDB archive of


macromolecular structures and provides integrated tools, in collaboration with
the Research Collaboratory for Structural Bioinformatics (RCSB),
the Biological magnetic resonance Data Bank (BMRB) in the USA, and the
PDBe in the EU.

RCSB: Research Collaboratory for Structural Bioinformatics. (RCSB-PDB).

Secondary structural Databases

Let us know about secondary structural databases, they are like:

1. SCOP : (http://scop.mrc-lmb.cam.ac.uk/scop/, Fig. 2.2)

2. CATH

3. PDBSUM

1. SCOP (Structural Classification of Proteins) database started by Lab of


Molecular Biology, MRC, Cambridge, UK. The aim of this database is to
classify protein 3D structures based on hierarchical schemes. SCOP has
various classifications as species, proteins, families, super families,
folds, and classes. Each class was again classified into various
structural organizations. Folds are further classified into five classes (Fig.
2.3):

a. All alpha

b. All beta

c. Alpha or beta

d. Alpha and beta

e. Multi-domain folds
51
BBCS-185 Bioinformatics Skill Enhancement Course

Fig. 2.2: Screenshot showing SCOP home page.


Classification chart

Fig.2.3: SCOP hierarchy chart structured based and evolution based


classification.

2.2.5 CATH
The CATH database (http://www.cathdb.info/) is a free, publicly available
online resource that provides information on the evolutionary relationships of
protein domains. It was created in the mid-1990s by Professor Christine
Orengo and colleagues, and continues to be developed by the Orengo group
at University College London.

Protein structures in the databank were experimentally determined and split


into their consecutive polypeptide chains, where applicable. All protein
domains are identified within these chains using a mixture of automatic
methods and manual curation that are available to the scientific community.
The domains are then classified within the CATH structural hierarchy: Class as
(C) level, domains assigned according to their secondary structure content, i.e.
52
Unit 2 Biological Databases and Data Retrieval

all alpha, all beta, a mixture of alpha and beta, or little secondary structure; in
the Architecture (A) level, information of the secondary structure arrangement
in three-dimensional space is used at the Topology/fold (T) level, information
on how the secondary structure elements are connected and arranged is
used; assign segregation are made to the Homologous superfamily (H) level if
there sufficient evidence that the domains are related by evolution, i.e. they
are homologous. To know, and browse the classification hierarchy, visit CATH
hierarchy web page (Fig. 2.4).

Additional sequence data for domains with no experimentally determined


structures are provided by sister resources like Gene3D, which are used to
populate the homologous superfamilies. Protein sequences from
UniProtKB and Ensembl were scanned against CATH HMMs to predict
domain sequence boundaries and make homologous superfamily
assignments/groups.

Fig. 2.4: CATH home page.

Learners can explore more about the proteins/enzymes/receptors by using


various sub-search methods like 3D structure, protein evolution, protein
function, and conserved sites. You can also access updated information about
the development or updation of the database from Learn more tab of
the CATH web homepage. You may download the complete database from
the download link. This will help learners to understand protein classification
through the structural organization of proteins.

SAQ 2
Define the following terms:

i) PDB

ii) RCSBC

iii) SCOP

53
BBCS-185 Bioinformatics Skill Enhancement Course

2.3 CLASSIFICATION OF BIOLOGICAL


DATABASES BASED ON NATURE OF
DATA
Up to now, you have learned about biological databases and their
classification-based. Now you will learn how to classify biological databases
based on their data nature.There are currently five types of databases

1. Sequence databases: These databases consist of DNA, RNA, and protein


sequences. You can access the gene or protein sequence by searching by
providing the name or unique ID in the respective databases. For example
EMBL (European Molecular Biology Laboratory), NCBI (National Center for
Biotechnology Information).

2. Structural databases: These are specialized databases related to


protein/DNA structures derived from X-Ray or NMR experiments or theoretical
models. Some of the structural databases are related to crystal structures of
chemicals. For example Protein Databank (PDB), Cambridge Crystallographic
Data Center (CCDC).

3. Literature databases: These databases are very important for the


development and advancement of science and technology as well as other
disciplines. These databases help researchers, academicians, and scientists
to search for the information and are able to download data in electronic form
like HTML (HyperText Machine Language), PDF (Portable document Format),
JPEG (Joint Photographic Experts Group), and other formats. Examples:
Pubmed, Medline, National Digital Library.

4. Gene expression databases: It is a well-known database to understand


the gene functions like up-regulation or down-regulation of cellular activities.
The Gene Expression Database (GXD) is a community resource for gene
expression information obtained from the laboratory. At various GXD stores
and integrates different types of expression data and makes these data freely
available in formats appropriate for comprehensive analysis. This database
helps in the interconnection of the gene expression and control with
other genes through various systems biology software. Now-a-days
scientists are working on gene molecular networks.

5. Metabolic pathway databases: It is a curated database of experimentally


elucidated metabolic pathways from all domains of life and is well maintained
and updated regularly. As you know about metabolic pathways of
carbohydrate metabolism (BBCCT-109) have many enzymatic reactions to
achieve the final product. The main characteristics of metabolic database
pathways are as follows:

Online encyclopedia of metabolism

1) Predict metabolic pathways in sequenced genomes

2) Support metabolic engineering via enzyme database

3) Metabolite database aids metabolomics research

Examples; KEGG (Kyoto Encyclopedia of Genes and Genomes)


MetaCyc, BioCyc
54
Unit 2 Biological Databases and Data Retrieval

2.4. SMALL MOLECULAR DATABASES


You have studied the various databases related to biological databases like
NCBI, Genbank, Swiss-Prot, EMBL, EBI, DDBJ, CATH, and SCOP in the
previous sections. All the above databases related to proteins and nucleic
acids have molecules with more molecular weight. So the above molecular
databases are called macromolecular databases. Low molecular weight
molecules are stored and retrieved through databases known as small
molecular databases. Most of the molecules are organic molecules like drugs,
antibiotics, vaccines, peptides, elements, compounds, etc. Now, we will
discuss in detail small molecular databases in this section.

2.4.1 PubChem
The PubChem database is a primary source for various chemicals, drugs, and
derivatives. It is one of the freely accessible chemical information resource
databases as well as the largest in the world. We can search for various
chemicals by molecular formula, name, structure, and other identifiers.
Further, one can find chemical and physical properties, safety and toxicity
information, biological activities, literature citations, patents, and more. New
chemicals/substances will be added regularly as and when new information is
available from the literature or from experimental results. It is very crucial for
finding vendor-based chemicals or new chemicals. Most of the scientists are
screening molecules from the Pubchem database for various disease
treatments like Cancer, Tuberculosis (TB), Alzheimer’s, Osteoporosis,
Atherosclerosis, Cardio-vascular diseases, etc. It is the database of chemicals
managed by at National Institutes of Health (NIH). The meaning of “Open” is
that anyone can submit scientific data to this database and become a provider.
This database is more important and useful for scientists, students, and the
general public. Each month database and programmatic services provide data
to several million users worldwide about compounds.
(https://pubchemdocs.ncbi.nlm.nih.gov/)

2.4.2 Drug Bank


The drug bank is a crucial database for all existing drugs that are approved by
Food and Drug Administration authority to treat various diseases. It is also one
of the largest drug banks in the world. DrugBank is a curated pharmaceutical
knowledge base, with products commercially available for precision medicine,
telehealth, and drug discovery. It also provides important drug information in a
structured, unified resource. DrugBank Online is a free-to-access website that
provides highly detailed information across multiple topics including
pharmacology, chemical structures, targets, metabolism, and toxicology. The
integrated data means you can search by text, gene sequence, chemical
structure, and more. Anyone can download a comprehensive dataset, free for
academic and non-commercial researches. (https://www.drugbank.com/visit
this weblink to know more bout Drug Bank )
55
BBCS-185 Bioinformatics Skill Enhancement Course

2.4.3 ZINC Database


It is a free database of commercially-available compounds for virtual
screening. ZINC contains over 230 million purchasable compounds in ready-
to-dock, 3D formats. ZINC also contains over 750 million purchasable
compounds anyone can search for analogs in a short span of time. ZINC is
maintained by the Irwin and Shoichet Laboratories in the Department of
Pharmaceutical Chemistry at the University of California, San Francisco
(UCSF). This database is used by various researchers, training people,
scientists, biotech companies, research organizations, and university scholars
for drug discovery. (https://zinc.docking.org/visit this weblink to more about the
ZINC database )

2.4.4 Cambridge Structure Database (CSD)


Established in 1965, the CSD is the world’s repository for small- organic
molecules and metal-organic crystal structures. Containing over one million
structures from x-ray and neutron diffraction analyses, this unique database of
accurate 3D structures has become an essential resource to scientists around
the world.

There will be automatic checking of newly added entries to the above


database later on the chemical information further verified by in-house
scientific editors/scientists before launching online to the public. Each
chemical structure is enhanced with good quality for visualization,
downloading, and understanding of physical properties. This new knowledge
has been applied across academia and industry in pursuit of new drugs, novel
materials and a greater understanding of chemical and crystallographic
phenomena. (https://www.ccdc.cam.ac.uk/solutions/csd-
core/components/csd/visit this weblink to more about CSD ) . You will learn
more about how to retrieve data from these data bases while performing
exercise number 3 of this course.

SAQ 3 Do as directed
i) Write a short on chemical databases used in drug design and drug
discovery.

ii) Define the term curated database? Enlist few chemical databases
developed using curation method.

2.5. STRUCTURE VIEWING TOOLS AND


FILE FORMATS
The molecular structures like organic, biological molecules like proteins, DNA,
RNA, lipids and carbohydrates can be visualized through specific software.
Most of the molecular visualization software is not only for visualization but
also for various modifications and calculations of bond lengths, angles,
energy, rotatable bonds, molecular weight, and other various parameters.
These parameters are very important as per molecular visualization. In
56
Unit 2 Biological Databases and Data Retrieval

addition, a few more advanced software help us to calculate binding energy


between drug – Receptor and molecular stability at various solvents, pH,
temperatures and etc.,

Currently, we are going to discuss basic software used for the visualization of
biomolecules. There are various file formats available to view those molecules
in 3Dimenstional space. It means that, each atomic position should be defined
from its origin with respect to X,Y, and Z axis. The majorly used format to view
in basic software is the pdb format (Protein databank). It has a stranded format
as follows. While performing exercises number 7 and 8 you will learn more
about these tools and file formats.

Examples of PDB Format

Glucagon is a small protein of 29 amino acids in a single chain. The first


residue is the amino-terminal amino acid, histidine, which is followed by a
serine residue and then glutamine. The coordinate information (entry 1gcn)
starts with:

ATOM 1 N HIS A 1 49.668 24.248 10.436 1.00 25.00 N

ATOM 2 CA HIS A 1 50.197 25.578 10.784 1.00 16.00 C

ATOM 3 C HIS A 1 49.169 26.701 10.917 1.00 16.00 C

ATOM 4 O HIS A 1 48.241 26.524 11.749 1.00 16.00 O

ATOM 5 CB HIS A 1 51.312 26.048 9.843 1.00 16.00 C

ATOM 6 CG HIS A 1 50.958 26.068 8.340 1.00 16.00 C

ATOM 7 ND1 HIS A 1 49.636 26.144 7.860 1.00 16.00 N

ATOM 8 CD2 HIS A 1 51.797 26.043 7.286 1.00 16.00 C

ATOM 9 CE1 HIS A 1 49.691 26.152 6.454 1.00 17.00 C

ATOM 10 NE2 HIS A 1 51.046 26.090 6.098 1.00 17.00 N

ATOM 11 N SER A 2 49.788 27.850 10.784 1.00 16.00 N

ATOM 12 CA SER A 2 49.138 29.147 10.620 1.00 15.00 C

ATOM 13 C SER A 2 47.713 29.006 10.110 1.00 15.00 C

ATOM 14 O SER A 2 46.740 29.251 10.864 1.00 15.00 O

ATOM 15 CB SER A 2 49.875 29.930 9.569 1.00 16.00 C

ATOM 16 OG SER A 2 49.145 31.057 9.176 1.00 19.00 O

ATOM 17 N GLN A 3 47.620 28.367 8.973 1.00 15.00 N

ATOM 18 CA GLN A 3 46.287 28.193 8.308 1.00 14.00 C

ATOM 19 C GLN A 3 45.406 27.172 8.963 1.00 14.00 C

Notice that each line or record begins with the record type ATOM. The atom
serial number is the next item in each record.
(Source:
https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html) 57
BBCS-185 Bioinformatics Skill Enhancement Course

There are plenty of software’s to show small molecules to higher level


molecular structures in a space. Among them few are enlisted as follows

1. RasMol

Most of the protein structure databases tools available today are well-
equipped with graphical visualization tools. The commonly used tool for
academic and research purposes is RasMol software. This is a molecular
graphics program intended to visualize proteins, nucleic acids and small
molecules, available in a 3-D structures format. In order to display a molecule,
RasMol requires an atomic co-ordinate file that specifies the position of every
atom in the molecule through its 3-D Cartesian coordinates (Fig. 2.5). RasMol
accepts this coordinate file in a variety of formats, including the Protein Data
Bank (PDB) format. The visualization tool provides the user a choice of color
schemes and molecular representation (wireframe, cylinder (Dreiding) stick
bonds, alpha-carbon trace, space-filling (CPK) spheres, macromolecular
ribbons (either smooth-shaded solid ribbons or parallel strands), hydrogen
bonding and dot surface. Additional features such as test labeling for selected
atoms, different color schemes for different parts of the molecule, zoom,
rotation, etc. have made this the most popular among all existing visualization
tools.

This standalone software can be downloaded from the RasMol website.

Website:http://www.openrasmol.org/

Fig. 2.5: RasMol software with Crystals of Crambin with PDB ID: 1CRN.

2. Chime

Chime and proteins explorer are derivatives of RasMol that allow visualization
of structures inside web browsers, while RasMol runs independently outside a
web browser. Hence, chime should be used only online, when connected to
the Internet. Another feature of Chime is that only certain molecules that are
allowed by the company can be seen, unlike RasMol where any protein
molecule with atomic coordinates can be seen.
58
Unit 2 Biological Databases and Data Retrieval

You can access chime at: www.umass.edu/microbio/chime

Now-a day, tools like Jmol and Jsmol are the software’s run on Java platform
also used widely. This can be downloaded on personal computers to view
molecules like proteins, DNA and RNA.

3. MolMol

MolMol stands for Molecule analysis and Molecule display. This is also free
software with a lot of features that are not found in RasMol and Chime. MolMol
is a molecular graphics program for display, analysis and manipulation of
three-dimensional structures of biological macromolecules, with special
emphasis on nuclear magnetic resonance (NMR) solution structures of
proteins and nucleic acids. MolMol can be reached at:
www.mol.biol.ethz.ch/wuthrich/software/molmol

4. Pymol

PyMOL is a user-friendly and one of the popular molecular visualization tools


on an open-source foundation, maintained and distributed by Schrödinger. It is
widely used software in structural bioinformatics, biophysics, computer-aided
drug design, and other fields of biology. It is an advanced graphical user
interface in the field of molecular visualization. Used to see the protein binding
with ligand/drug in a 3D space. This software allows the viewer to label atoms,
bonds, distances, angles, residues, residues with numbers, chains, and types
of bond interactions (Fig. 2.6). It has many features like one can model
protein, DNA with secondary structural information. User can see the proteins
in different forms like balls-sticks, wires, molecular surfaces with atomic
energy distribution. This tool is freely available for students/academic
institutions with legal agreement or registration. The tutorial and software
download is also available at https://pymol.org/2/

Fig.2.6: Visualization of ligand with sticks within a cavity of protein by pymol


software.

5. SPDBV

Swiss-PdbViewer is an application that provides a user-friendly interface to


analyze several proteins at the same time. The proteins can be superimposed
in order to deduce structural alignments and compare their active sites or any
other relevant parts. Amino acid mutations, H-bonds, angles and distances
between atoms can be viewed easily. This tool functions on the intuitive
graphic and menu interface (Fig. 2.7).
59
BBCS-185 Bioinformatics Skill Enhancement Course

Swiss-Pdb viewer was developed in 1994 by Nicolas Guex. Swiss-PdbViewer


is closely SWISS-MODEL, an automated homology modeling server
developed within the Swiss Institute of Bioinformatics (SIB) at the Structural
Bioinformatics Group of associated with Biozentrum in Basel.

Working with SWISS-MODEL and SWISS-Pbd Viewer programs greatly


reduces the amount of time required to generate models. It is possible to
thread a protein primary sequence onto a 3D template and get immediate
feedback on how well the threaded protein will be accepted by the reference
structure before submitting a request to build missing loops and refine side-
chain packing.

Fig. 2.7: Protein structure visualization with Spdbv software.

2.6 SUMMARY
 Biological databases used to store experimental data in various formats
that can be accessed through the internet.

 In biological databases, to avoid the Non-Redundancy of the data, a


unique number or ID is assigned as primary key.

 All biological databases are well structured so that data can be


retrieved across the globe with ease in a short span of time.

 There are two types of Biological Databases: 1. Primary Databases –


data collected from Laboratory- GenBank, EMBL,DDBJ, NDB, TrEMBL,
PIR, SwissProt and PDB. 2. Secondary databases- derived from the
results of analyzing primary data of primary, databases- InterPro(Protein
families motifs, and domains) UniProt, Ensembl, Brenda databases
Macro molecules like proteins, DNA and RNA come under separate
databases. Similarly small molecules hence their own databases.
Examples of small molecular databases are Pubchem, Drug Bank,ZINC
Databases,CSD (Cambridge Structure Database).
60
Unit 2 Biological Databases and Data Retrieval

 SCOP, CATH and PDBSUM serve as Secondary structural databases.

 SCOP has five sub-classes. 1. All alpha, 2. All Beta, 3. Alpha or Beta 4.
Alpha and Beta 5. Multi-domain fold.

 Based on nature of data in biological databases, there are five types of


databases.

1. Sequence databases-EMBL, NCBI, GenBank

2. Structural Databases- RCSB-PDB and CCDC

3. Literature Databases- Pubmed, Medline, NDL

4. Gene expression Databases- GXD, Gene Expression Omnibus


(GEO) is a database repository of high throughput gene
expression data and hybridization arrays, chips, microarrays.

5. Metabolic pathway Databases-KEGG (Kyoto Encyclopedia of


Genes and Genomes) MetaCyc, BioCyc

 To view any protein or any molecule in a virtual space, it requires 3D


space coordinates like X, Y, and Z along the axis with respect to the
origin.

 The basic and free molecular visualization software is Rasmol.

 Pymol is a well-developed Graphical user interface with good rendering


options along with well-advanced features.

2.7 TERMINAL QUESTIONS


1. What are the basic principles and characteristics of an ideal biological
Databases

2. Write a note on Primary databases and secondary databases with


special emphasis to biological databases.

3. Describe nucleotide databases.

4. Explain secondary Structural databases.

5. Write in detail about classification of biological databases based on


nature of data.

2.8 ANSWERS
Self Assessment Questions
1. i) Nucleotide Database at NCBI
ii) DNA
iii) European Bioinformatics institute
iv) National Centre for Biotechnology
v) PIR

61
BBCS-185 Bioinformatics Skill Enhancement Course
2. i) Protein Data Bank
ii) Research Collaboratory for Structural Bioinformatics
iii) Structural Classification of Proteins
3. i) Refer Section 2.4.1 to 2.4.3.
ii) Refer Metabolic Pathway Databases under section 2.3.

Terminal Questions
1. There are various principles for the biological databases or
characteristics.

i) Biological databases can be stored in an electronic form at various


formats.

ii) As it is a database, each entry in the database would be assigned


an unique number or ID. It can’t be repeated in other terms – Non-
redundancy.

iii) Biological data sharing – The data can be downloaded from


various websites related to biological databases or ftp (File transfer
protocol).

iv) Biological databases are well structured, searchable and also


updated information periodically as per publications/innovations in
the scientific world.

v) The data also referred with unique ID in research


publications/books. This can be called a cross reference.

2. Primary database Secondary Database

Curateddatabase;
Synonyms Archival database
knowledgebase

Results of analysis,
literature research and
Source of Direct submission of experimentally-
interpretation, often of
data derived data from researchers
data in primary
databases

InterPro (protein
families, motifs and
domains) UniProt
ENA, GenBank and DDBJ (nucleotide Knowledgebase (seque
sequence) ArrayExpress and GEO (fu nce and functional
nctional genomics data) Protein Data information on
Examples proteins) Ensembl (varia
Bank (PDB; coordinates of three-
dimensional macromolecular tion, function, regulation
structures) and more layered onto
whole genome
sequences)

62
Unit 2 Biological Databases and Data Retrieval

3. Refer section (2.2.2)

i) Importance of nucleotide databases

ii) Enlist nucleotide databases and describe in detail.

4. Refer section (2.2.4)

i) Write about SCOP,CATH, PDBSUM in detail.

ii) Classification of SCOP

iii) Few points on CATH databases.

5. Refer section (2.3)

63
BBCS-185 Bioinformatics Skill Enhancement Course

Exercise 4
DATABASES NCBI, PDB, SCOP, :
PUBMED, GENE BANK,
UNIPROT

Structure
4.1 Introduction 4.2 Databases and Retrieval

Expected Learning 4.3 Summary


Outcomes
4.4 Lab Exercises

4.1 INTRODUCTION
In this exercise, you will learn about biological databases that are widely used
in the field of bioinformatics.
Databases are systematic collections of theoretically related data. Software
packages are used for defining and managing databases. In publicly
accessible databases, there is a lot of information available regarding
biomolecules due to exponential growth in biological data. Data is no longer
published in a conventional way but rather submitted directly to databases.
Generally, the biological database can be classified into sequence database,
structural database, genome database, proteome database, specialized
databases, etc.

Expected Learning Outcomes


After performing this exercise you shall be able to:

 browse the required information from the databases;

 explain the importance of databases;


64  differentiate different biological databases; and
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt

 enlist the applications of various databases for academic and scientific


research.

4.2 DATABASES AND RETRIEVAL


In this section, we are going to learn about various databases that you have
studied in unit-2 of this course. However, the main focus will be on how to
access the information available in these databases using existing online
tools. You are aware that there are different databases available for individual
biomolecules like proteins, nucleic acids, and small molecules. Let us explore
them one by one.

National Centre for Biotechnology Information (NCBI)

National Centre for Biotechnology Information (NCBI) is a source of public


biomedical database; it contains software’s for analysing molecular and
genomic data, and conducting research in computational biology. NCBI
maintains over 40 integrated databases for the medical and scientific sectors,
as well as the general public. The GenBank nucleotide database is maintained
by the NCBI, which is part of the National Institute of Health (NIH), a federal
agency of the US government, Access NCBI database by following weblink
(https://www.ncbi.nlm.nih.gov/) and use popular resources to retrieve
information and use different tools (Fig. 4.1).

Fig. 4.1: Screeshot showing NCBI database.

You can access the NCBI to know about different popular resources, further
you will be learning the usage of NCBI and Genbank to access nucleotide and
protein sequence in exercises 5 and 7.

PUBMED

Public Medical (PubMed) is a bibliographic database of


popular NCBI resources. PubMed is a free resource supporting the search and
retrieval of biomedical and life sciences literature with the objective of
educating health globally and personally. As of 2021, PubMed comprises more
65
BBCS-185 Bioinformatics Skill Enhancement Course

than 32 million citations (Abstract) for biomedical literature from MEDLINE, life
science journals, and online books. Citations do not include full text journal
articles but may include links to full-text content from PubMed Central (PMC)
and publisher web sites available from other sources. PubMed was developed
and maintained by the National Center for Biotechnology Information (NCBI),
at the U.S. National Library of Medicine (NLM), located at the National
Institutes of Health (NIH).

Procedure

Step 1: Open the PUBMED, browser using the following URL


http://www.ncbi.nlm.nih.gov/pubmed/ (Fig. 4.2).

Fig. 4.2: Screeshot showing PubMed home page.

Step 2: Type your text query in the search panel (for example corona virus
etc….)

Step 3: Select the appropriate abstract from the PubMed summary web page
66 (Fig. 4.3).
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt

Fig. 4.3: Screeshot showing PubMed search results.

Step 3: Copy and save the relevant bibliography search for further use.

GenBank

The GenBank nucleotide database is maintained by the NCBI. GenBank is


part of the International Nucleotide Sequence Database Collaboration, which
comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide
Archive (ENA), and GenBank at NCBI. These three organizations exchange
data on a daily basis. The GenBank database is intended to provide and
encourage access to the most up-to-date and complete DNA sequence
67
BBCS-185 Bioinformatics Skill Enhancement Course

information within the scientific community


(https://www.ncbi.nlm.nih.gov/genbank/) (Fig. 4.4).

Fig. 4.4: Screenshot showing GenBank.

So far, you have learned about databases NCBI, PubMed, GenBank and how
to access the citations/abstract from PubMed. To become more familiar with
the procedure, repeat the exercises with different keywords such as author
name, keywords like antioxidants, curcumin, cholesterol etc. and text
searches. In the next subsection you will learn about Protein Data Bank, which
is widely used for 3-D protein structure-related information. Further, you will be
learning the usage of GenBank to access nucleotide sequences in exercises 5
and 7.

PDB

The Protein Data Bank (PDB) is a repository for the 3-D structural data of
large biological molecules, such as proteins and nucleic acids. The data,
typically obtained by X-ray crystallography or NMR spectroscopy and
submitted by biologists and biochemists from around the world, are freely
accessible on the Internet via the websites of its member organisations,
Research Collaboratory for Structural Bioinformatics (RCSB). The PDB
database is intended to provide access to 3-D structural information. To
access the PDB database, follow the web link https://www.rcsb.org/ and
retrieve structural information from PDB (Fig. 4.5).

You can access the PDB to understand about structural database, further you
will be learning the usage of PDB to access and download 3-D structures of
protein and DNA in exercises 6.

68
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt

Fig. 4.5: Screenshot showing PDB homepage.

SCOP

Structural classification of proteins (SCOP). SCOP maintained at MRC


(medical research council) laboratory of Molecular biology and centre for
protein engineering. Describes structural and evolutionary relationships
between proteins. Classification in hierarchical fashion, like Family: clustered
to families with clear evolutionary relationships, Super Family: structural and
functional characteristic have common evolutionary origin, Fold: common fold
if they have same secondary structure.

Procedure:

Step 1: Open the SCOP from the following URL https://scop.mrc-


lmb.cam.ac.uk/ (Fig. 4.6).

Fig. 4.6: Screeshot showing SCOP homepage.

Step 2: Type the protein name or relevant text in the text box titled or enter
keyword (Fig. 4.7).
69
BBCS-185 Bioinformatics Skill Enhancement Course

Fig. 4.7: Screenshot showing Keyword in search engine of SCOP.

Step 3: On pressing search button the result page (summary) is displayed. To


know further about specific protein, click on it (Fig. 4.8).

Fig. 4.8: Screenshot showing SCOP search results.

Step 4: Choose the appropriate link to display the functional information (Fig.
70 4.9).
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt

Fig. 4.9: Appropriate links showing family and super family have been encircled.

UNIPROT

The Universal Protein (UniProt) is a comprehensive resource for protein


sequence and annotation data. The UniProt databases are of the following
subtypes like UniProt Knowledgebase (UniProtKB), the UniProt Reference
Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium
and host institutions like the European Bioinformatics Institute (EBI), the Swiss
Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR)
are committed to the long-term preservation of the UniProt databases.

Procedure:

Step 1. Open the UniProt website from the following URL-


https://www.uniprot.org/ (Fig. 4.10).

Fig. 4.10: Screeshot showing Uniprot homepage.


71
BBCS-185 Bioinformatics Skill Enhancement Course

Step 2: Type the protein name or relevant text in the text box titled or enter
keyword (Fig. 4.11).

Fig. 4.11: Screenshot showing Uniprot search column.

Step 3: On pressing search button the result page(summary) is displayed (Fig.


4.12).

Fig. 4.12: Screenshot showing search results on Uniprot.

Step 4: Choose the first sequence by double clicking the accession number,
go to display button select FASTA format to retrieve sequence (Fig. 4.13).
72
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt

Fig. 4.13: Screeshot showing display options for searched protein.

Step 5. Copy and save the protein sequence for further analysis (Fig. 4.14).

Fig. 4.14: Screenshot showing IASTA sequence of the protein.

4.3 SUMMARY
 Databases are systematic collections of theoretically related data.
Generally, the biological database can be classified into sequence
database, structural databases, genome database, proteome database,
and specialized databases etc.

 To gain knowledge of different biological databases and their usage


appropriately in relevant ways, databases are created and maintained.

 NCBI is a source of the public biomedical database, NCBI maintains


over 40 integrated databases for the medical and scientific sectors, as
well as the general public.

 The GenBank nucleotide database is maintained by the NCBI and


maintains complete DNA sequence information.
73
BBCS-185 Bioinformatics Skill Enhancement Course

 PubMed is a bibliography database of popular NCBI resources,


comprising more than 21 million abstracts for biomedical literature.

 PDB is a repository for the 3-D structural data of large biological


molecules, such as proteins and nucleic acids.

 SCOP provides structural and evolutionary relationships between


proteins and provides classification on hierarchical protein family and
fold information. UniProt is a comprehensive resource for protein
sequence and annotation data.

4.4 LAB EXERCISES


1. Retrieve abstract PMID: 32643536 from PUBMED bibliographic
database write title of abstract and authors name

2. Access nucleotide sequence FJ436056 from GENE BANK database in


Genbank format and FASTA format and write the title, molecule type,
how many base pairs

3. Download protein 3-D structure of PDBID-7JMO from PDB (RCSB)


database and view structure in viewing tool.

4. Open SCOP database and give any keyword or text search write the
functional aspect, name of the protein, family, class and domain

5. Access globulin protein sequence from UNIPROT database in FASTA


format and write the title, Uniprot KB id, organism name.

74
Exercise 5 Retrieval of Protein and Gene Sequences from NCBI

Exercise 5
RETRIEVAL OF GENE
SEQUENCES FROM NCBI

Structure
5.1 Introduction 5.2 Procedure

Expected Learning 5.3 Summary


Outcomes
5.4 Lab Exercises

5.1 INTRODUCTION
In the previous exercise, you have learned about different databases. In this
exercise, you will be studying the retrieval of protein and gene sequences
from NCBI.
In this exercise, we will learn about protein and gene sequence retrieval
from NCBI database. We have studied theoretically NCBI database in Unit-2
and learned about different resources of NCBI such as GenBank and GenPept
in Exercise 4. In this section, we shall access sequences from GenBank and
GenPept of NCBI which will be used in various sequence analysis techniques.

Protein sequences are the fundamental determinants of biological structure


and function. The NCBI protein database is a collection of protein sequences
from different sources like GenPept, including translation from annotated
coding regions in GenBank, Reference sequences and Third Party Annotation
(TPA) as well as records from Swiss-Prot, Protein Information Resource (PIR),
Protein Research Foundation (PRF) and Protein Data Bank (PDB). The
nucleotide database is a collection of gene sequences from different sources,
which include GenBank, RefSeq, TPA, and PDB. Genome, gene, and
transcript sequence data provide the foundation for biomedical research and
discovery. The Gene database can be accessed by simply querying the word,
preferably the gene name, or the disease name in the query box, which will
display the list of genes associated with the search. Users can also search
records with Gene ID, which is a unique identifier issued by NCBI. 75
BBCS-185 Bioinformatics Skill Enhancement Course

Expected Learning Outcomes


After performing this exercise you shall be able to:

 capable of using NCBI-GenPept database and retrieve protein


sequence;

 explore and retrieve gene information from NCBI Gene database; and

 explain the importance of NCBI in sequence retrieval.

5.2 PROCEDURE
Step 1: Access the home page of NCBI from the following web link
https://www.ncbi.nlm.nih.gov/ (Fig. 5.1).

Fig. 5.1: Screenshot showing NCBI home page.

2. Click on the scrolling button (drop down menu) and select “Gene” (Fig.
5.2).

Fig. 5.2: Screenshot showing dropdown menu on NCBI.

3. Type the relevant text in the search box or enter keyword (Example-
76 Gene name, Species name etc) (Fig. 5.3).
Exercise 5 Retrieval of Protein and Gene Sequences from NCBI

Fig. 5.3: Screenshot showing search box on NCBI.

4. On pressing search button the result page (summary page) is displayed


(Fig. 5.4).

Fig. 5.4: Screenshot showing summary page (results) on NCBI.

5. Choose the desired Gene sequence by double-clicking the name or ID


or check to mark the appropriate sequence; go to the display button,
select GenBank or Fasta format to retrieve the sequence (Fig. 5.5).

Scroll down and click on required file format (FASTA or GenBank format)

77
BBCS-185 Bioinformatics Skill Enhancement Course

Fig. 5.5: Screenshot showing how to obtain FASTA or GenBank


sequence.

6. Copy and save the required gene sequence for further analysis (Fig.
5.6).

Fig. 5.6: Screenshot showing FASTA sequence.

5.3 SUMMARY
 NCBI is a systematic collection of theoretically related biological data
such as sequence databases, genome databases, and specialized
databases, etc.

 NCBI is a source of public biomedical databases. The GenBank


nucleotide database is maintained by the NCBI that provides complete
DNA sequence information. The GenPept protein sequence database
maintained by NCBI.

 Both gene and protein sequences can be retrieved from NCBI database
for further analysis.

5.4 LAB EXERCISES


1. Access Nucleo capsid phosphor protein Gene sequence from NCBI
database in Genbank format write the title, ID, Organism name

2. Access envelope protein sequence from NCBI database in Genbank


format write the title, ID, Organism name

3. Access Covid-19 protein sequence from NCBI database in FASTA


78 format and write the title, ID, organism name.
Exercise 6 Download Protein Structure from PDB

Exercise 6
ACCESSING PROTEIN
STRUCTURE FROM PDB

Structure
6.1 Introduction 6.3 Summary

Expected Learning Outcomes 6.4 Lab Exercises

6.2 Procedure

6.1 INTRODUCTION
In the previous exercise, you learned how to retrieve protein and gene
sequences from the NCBI database. Now, in this exercise, you shall be
exploring the steps involved in downloading protein structures from the PDB
database. 3-D structures of proteins from Protein Data Bank (PDB), are used
to understand structural information such as the binding site of a protein or
DNA, the active site of enzymes, DNA-Protein interactions, and Protein-
Protein interactions, and this has applications in drug design. You learned
about the PDB database in Unit-2 and accessed the PDB website in Exercise-
4.

Protein structure is useful to understand how the protein works, and that
information can be used to inhibit, regulate, or modify protein function, and
predict what molecules bind to that protein. Also, to understand various
biological interactions, assist drug discovery, or even design novel proteins
therapeutic as molecules. In order to understand the biological function of
DNA, we need to study its molecular structure. The PDB is a repository for the
3-D structural data of large biological molecules, such as proteins and nucleic
acids. You have learned about PDB in exercise-4, now in this exercise, we
shall learn about how to download protein structure from PDB.

Expected Learning Outcomes


After performing this exercise you shall be able to:

 describe how to access structures of proteins from PDB;


79
BBCS-185 Bioinformatics Skill Enhancement Course

 describe how to access structural data of a protein using PDB database;


 explain the PDB file format; and

 perform how to download and save 3-D structure of Protein in PDB


format.

6.2 PROCEDURE
Step 1:

1. Open the PDB from the following URL- https://www.rcsb.org/ (Fig. 6.1).

Fig. 6.1: Screenshot showing PDB home page.

2. Enter the query in the textbox provided by entering PDB ID, molecule
name or author name. Click on the search button (Fig. 6.2).

Fig. 6.2: Screenshot showing search column on PDB homepage.

80
Exercise 6 Download Protein Structure from PDB

3. From the summary page click on PDB ID 7LYJ and download the
macromolecular 3D structure in PDB format (Fig. 6.3 and 6.4).

Fig. 6.3: Screenshot showing target protein (7LYJ).

Dowanload PDB format.

Fig. 6.4: Screenshot showing how to download PDB format.

4. Using any one of the visualizing tools PyMoL or RasMol or Swiss-PDB


viewer open the structure file to visualize. You will learn about these
tools in exercise number 8 of this course.
81
BBCS-185 Bioinformatics Skill Enhancement Course

6.3 SUMMARY
 PDB is the NCBI database from where we can access the protein 3-D
structures.

 In this exercise you have exhibited the skills to download protein in


PDB format.

 You have acquired the skills to access PDB pages and learned how to
search for the desired protein.

 These PDB formats can be visualised using visualising tools.

6.4 LAB EXERCISES


1. Access 7LMF protein structure from PDB database and download in
PDB format and also save in PDB flat file(text) format comment few
points.

2. Download S. cerevisiae CMG-Pol epsilon-DNA in PDB format give it


PDB ID, source and comment few points.

3. Download crystal structure of yeast phenylalanine t-rnain PDB format


give it PDB ID, source and comment few points.

82
UNIT 3
SEQUENCE ALIGNMENT

Structure
3.1 Introduction 3.4 Alignment Scoring Matrices

Expected Learning Outcomes PAM

3.2 Sequence similarity, identity, BLOSUM


and homology
3.5 Sequence alignment tools
Sequence Similarity and Software

Sequence Identity BLAST and Types

Sequence Homology CLUSTAL W

3.3 Alignment Types 3.6 Summary

Pairwise and Multiple Sequence 3.7 Terminal Questions


Alignment
3.8 Answers
Local and Global Alignment
3.9 Further readings

3.1 INTRODUCTION
In the previous unit, you have learned about sequences and structures of
proteins and nucleic acids along with biological databases. As you know,
amino acids are the building blocks of proteins. In general, any popular
language has alphabets, various combinations, and proper arrangement of
these alphabets will form words and sentences with appropriate meaning.
Language helps us to communicate with each other as well as update
knowledge. Similarly, the arrangement of amino acids will provide numerous
functional proteins/enzymes/receptors, etc in biological systems. These
combinations of amino acids and nucleic acids play a major role in the proper
functioning of proteins and genes. It is interesting to know that specific protein
sequences will remain the same in many organisms, but few
additions/deletions/insertions may bring that mutated protein. If the sequence
is exactly the same in two different organisms, it is obvious that protein
function is also the same. There are various tools and software available to
compare these sequences.
83
BBCS-185 Bioinformatics Skill Enhancement Course

In both animals and plants, there are several proteins and enzymes involved in
biochemical pathways, signaling pathways, and other functions. If we compare
the sequences of proteins and genes with another animal/species it is called
sequence comparison. You are going to learn more about sequence
alignment, types, and algorithms involved in it by understanding various
theories. In addition to this, you may come across new terms, software, and
tools. You will learn the applications of sequence alignment in proteins and
nucleic acids, which is essential in the field of biology and allied subjects.

Expected Learning Outcomes


After studying this unit, you shall be able to:

 differentiate between similarity, identity, and homology of sequences;

 describe alignment types like pairwise and multiple sequence alignment;


and

 explain algorithms, amino acid substitution matrices (PAM and


BLOSUM), BLAST and CLUSTALW.

3.2 SEQUENCE SIMILARITY, IDENTITY,


AND HOMOLOGY
In general, similarity, identity, and homology may look alike, but there is a lot
of difference in understanding. Let's assume that AMINOACID is matching
with a word as AMINOACID; it means that each letter is matching 100% with
the above word.

A M I N O A C I D -Seq1

| | | | | | | | |

A M I N O A C I D – Seq 2

The above example shows that both the words are matching as the first word
is named as Seq-1 and the second word as Seq-2. It is a simple example to
show the sequence of letters to form words. Now, let us see the similarity of
sequences in the next subsection.

3.2.1 Sequence Similarity


You have seen the example of alignment of letters or matching in the word
AMINOACID. Now, you will learn sequence similarity in proteins and genes.
As we have discussed in the previous unit about the human genome project
as well as public biological databases like NCBI, GenBank, SwissProt, EMBL-
EBI, where most of the plants, animals, fungi, and bacteria genome
sequences, protein sequences are available. There are various formats of
sequences available in specific databases. Among them, FASTA format is
more popular, as shown in Fig. 3.1 for the enzyme hexokinase. You will learn
more about how to obtain the FASTA file while performing exercise number 7
in this course.
84
Unit 3 Sequence Alignment

Enzyme name and Organism


GenBank ID

Fig. 3.1: FASATA format of hexokinase of Homo sapiens is retrieved from


GenBank.

In the above sequence format, you can see the name of the enzyme,
organism, and Genbank ID at the top of the sequence. The sequence starts
with the ‘>’ sign and GenBank id, enzyme name, and scientific name of an
organism. When we want to compare or match the sequence of hexokinase
between two different animals, and the sequence matches 100% if there is a
similarity in the number of amino acid residues as well as the type of amino
acids present in them. The matching may not be the same in another set of
organisms or it may be less than 100% due to differences in the number and
type of amino acid residues. In a few animals, we may find mutations in
sequences; still perform normal or similar functions.

Let us see the following examples to understand the concept of similarity.

A T C G G C –Seq -1

| | | | | |

A T C G C G – Seq-2

Both the sequences Seq-1 and Seq -2 are not the same, but we can call them
similar. There is a mismatch, but the chemical properties of C and G are
similar in the sequences.

For instance, in proteins, there may be a mismatch of amino acids with regard
to their chemical and physical properties; then those mismatches do not alter
the functionality of proteins. In Fig. 3.2 you can find an example for sequence
alignment for protein histone H1 among different mammals. The amino acids,
which are constant throughout alignment, are called conserved residues and
the amino acids varying in alignment are referred to as non-conservative.

Fig. 3.2: Amino acid sequence alignment. (Source:


https://en.wikipedia.org/wiki/Sequence_alignment)
85
BBCS-185 Bioinformatics Skill Enhancement Course

SAQ 1
i) Which type of sequences are available in NCBI database?
ii) Which mammalian sequence is more similar to human histone
sequence?
iii) What do you mean by conserved residues in a multiple sequence
alignment?

3.2.2 Sequence Identity


Most of the time, learners may confuse similarity and identity. There is a slight
difference between both of them. If you find a similar number of nucleotides or
amino acids between two sequences in the same position, then it is called
identity. In other words, the characters or features of sequences match exactly
between two different sequences. Whereas, similarity describes a
resemblance between sequences.

There are various tools and software available to calculate sequence identity
throughout the length of sequences. Among them, BLAST is a powerful tool as
compared to other existing tools. You will learn how to perform using online
tools in exercise 9 of this course.

In the given example (Fig. 3.3), the DNA polymerase sequence of Hepatitis B
virus is considered as query sequence and aligned with the subject sequence
(sequences of database). When this sequence was subjected to alignment,
both (query and subject) sequences had similarity and identity percentages of
98% and 97%. You can observe that a few amino acids are not matching
exactly with the lower sequences. It is observed that big boxes have more
sequence identity rather than small boxes. Those small boxes containing
amino acids are neither identical nor similar. You can also observe some gaps
between sequences. The alignment of sequences is carried out using various
matrices and algorithms. Sequence identity plays a major role in evolutionary
tree generation. It helps in understanding the progeny of specific species and
their relationship with other organisms. Sequence identity is also essential to
acquire information about the working mechanisms of various proteins,
enzymes, receptors, and cellular responses.

86 Fig. 3.3: Sequence identity of DNA polymerase in Hepatitis B Virus.


Unit 3 Sequence Alignment

SAQ 2
i) What is sequence identity?

ii) What is “subject sequence” in sequence alignment?

3.2.3 Sequence Homology


In simple words homology describes similarity due to shared ancestry. It is one
of the common terms used in bioinformatics when comparing two or more
sequences of proteins or nucleotides. There are various relationships between
sequence homology with respect to protein stability and functionality. So, it is
very important to know the sequence homology. Most of the time, we consider
the similarity of a sequence throughout its sequence length as homology
(Table 3.1). If the homology is matching 100% then the structure and function
of such protein would be 100% in all aspects. If two or more sequence
alignments share a common ancestral relationship, then we can call them
homologous sequences. In the given Fig 3.4 shows a structural homology, that
play important role in understanding the evolutionary biology.

Fig. 3.4: Structural homology (http://www.bio.miami.edu/dana/160/160S11_3.html)

Table 3.1: The differences between similarity and homology.

S.No. Similarity Homology

1. Similarity refers to the Homology refers to shared ancestry


likeness or % identity
between two sequences

2. Similarity means sharing Two sequences are homologous if


statistically significant number they are derived from a common
or bases or amino acids ancestral sequence

3. Similarity does not imply Homology usually implies similarity


homology

You are advised to watch the video in the given YouTube link to know more
details about sequence similarity:
https://www.youtube.com/watch?v=K6ldxHPzI5A
87
BBCS-185 Bioinformatics Skill Enhancement Course

SAQ 3
i) Write the differences between similarity and homology?

ii) What is homology?

3.3 ALIGNMENT TYPES


Till now, you have learned about sequence alignment with respect to amino
acids and nucleotides with suitable examples. Now, will discuss the process of
alignment throughout the sequence or fragment-based alignment. There are
two types of alignments viz... 1. Pairwise sequence alignment 2. Multiple
sequence alignment. These alignment types help us to understand the
phylogeny of species or genetic relationships between various gene
sequences. The percentage of similarity/homology will provide the distant
relationship or distant homology between the sequences.

3.3.1 Pairwise and Multiple Sequence Alignment


In this section, we will learn about alignment regions. The main purpose of
Pairwise Sequence Alignment is to identify the regions of similarity between
sequences to demonstrate the function, structure of proteins, or genes, which
may lead to finding evolutionary relationships between two sequences.

Seq1 AVLTSHYILRS - 11

|| | | || || || |

Seq2 AVLTSHYILRS - 11

Different methods are used for pairwise alignment of nucleotide and protein
sequences let us learn one by one:

1) Dot Plot – It is a graphical method for two sequences, to identify


regions of maximum similarity and dissimilarity, depicted by the
presence and absence of DOTS.
In this plot, if one amino acid of one sequence is matches exactly with
amino acid of another sequence a dot is kept in the respective box as
shown below. The same procedure is followed for nucleotide
sequences (Fig. 3.5).

G G A A T

88 Fig. 3.5: Dotplot of amino acids


Unit 3 Sequence Alignment

Seq-1 G G A A T

| | | | |

Seq-2 G G A A T

If any nucleotide /amino acid is not matching then the gap is noticed within the
alignment.

As you know the most of the gene sequences may be very long. In such
cases, the plot appears as Fig. 3.6. In this figure, X- axis is Seq1 and Y-
axis Seq2, with the total number of amino acids in both the sequences being
200.

Fig. 3.6: Dotplot of amino acid Seq1 and Seq2 with 200 amino acid residues.

2) Dynamic Programming – This method breaks a problem into small sub-


problems and uses the solution of the sub-problems to compute the solution of
the larger one. Some algorithms like Needleman-Wansch and Smith-
Waterman are used here. (Watch the YouTube link to more above alignment
types: https://www.youtube.com/watch?v=ipp-pNRIp4g)

3) Heuristic Method – When a single sequence is to be compared against the


whole database, heuristic methods like BLAST and FASTA are used.

We are going to study various alignments like local and global alignment in the
next sections of this unit.

We have discussed the importance of multiple sequence alignment. We can


align more than two sequences by using software or online tools. These
alignments would be considered global or local alignments. Multiple
Sequence alignment (MSA, Fig. 3.7) of proteins/genes can be performed by
89
BBCS-185 Bioinformatics Skill Enhancement Course

collecting sequences in FASTA format in most of the software. The output of


the alignment can be seen in the form of trees or alignment of sequences.

Fig. 3.7: Multiple sequence alignment (MSA).

The phylogenetic trees are generated by various tools and software to


determine evolutionary relationships based on multiple sequence alignments
of residues or nucleotides. If the alignment of sequences is more than two
sequences are called multiple sequence alignment (MSA). The use of MSA
is common in phylogenetic analysis, protein structure prediction, and
comparison, identification of conserved domains, regions, and active/inhibitory
sites of enzymes. The MSA always considers sequences from a common
ancestor–parent as homologous. The algorithms may try to align homologous
positions or conserved regions by considering function and structure.

To know more about the topic you are advised to visit the following video links:
https://www.youtube.com/watch?v=S07kIY2ihq8

https://www.youtube.com/watch?v=TZaA_-4j19w

SAQ 4
i) What is MSA?

3.3.2 Local and Global Alignment


There are two types of alignments while considering the length of the
sequence to be aligned. We will discuss in detail local and global alignment of
sequences with algorithms implemented in it. Usually, the multiple sequence
algorithms assume that the sequences are similar in all the lengths and that
they behave like global alignment algorithms. They also assume that there are
not many long insertions and deletions of residues/nucleotides. Thus the
algorithms will work for some sequences, but not for others.

90
Unit 3 Sequence Alignment
These algorithms can deal with sequences that are quite different, but, as in
the pairwise case, when the sequences are very different they might have
problems creating a good algorithm. A good algorithm should align the
homologous positions or the positions with the same structure or function.
Global Alignment: In a sequence analysis of proteins or genes, the same
length of sequences is very much suitable for global alignment. Such
alignment is performed from the beginning to the end of the sequence for
appropriate alignment. In such cases, gaps may be created during the
alignment process.
The Needleman-Wunsch algorithm (A formula or set of steps to solve a
problem) was developed by Saul B. Needleman and Christian D. Wunsch in
1970, which is a dynamic programming algorithm for global sequence
alignment. This algorithm explains global sequence alignment for aligning
nucleotide or protein sequences. This was the first of its kind for the alignment
of two protein sequences and was the first application of dynamic
programming to biological sequence analysis. The Needleman-Wunsch
algorithm finds the best-scoring global alignment between the two sequences.
Global alignments are most useful when the two sequences being compared
are in similar lengths, and not too divergent.
Local Alignment: If sequences have similarities or dissimilarities, they can be
compared with local alignment. You will understand high-level similarity
sequences with local alignment.
The above methods of alignment are explained by different algorithms; both
use scoring matrices to align the two different series of characters or patterns
(sequences). Global and local alignment methods are defined by Dynamic
programming for proper approaching methods for aligning two different
sequences. Many proteins exhibit modular architectures. In searching
databases for similar sequences, it is useful to find sequences that have
similar domains or functional motifs. Smith & Waterman (1981) published an
application of dynamic programming to find optimal local alignments. The
algorithm is similar to Needleman-Wunsch, whereas negative cell values are
reset to zero, and the trace back procedures starts from the highest scoring
cell, anywhere in the matrix, and ends when the path encounters a cell with a
value of zero.
If we consider the small fragment of a sequence as the target sequence and
align the other fragment strand at a small region, hence, it is a local alignment.
Similarly, performing complete alignment throughout the sequence length is
known as Global alignment (Fig. 3.8 and 3.9).

91
BBCS-185 Bioinformatics Skill Enhancement Course

Fig. 3.8: Lined diagram representing Global and Local alignment.

(source: https://www.researchgate.net/figure/Global-alignment-vs-Local-
alignment_fig1_322704711)

Fig. 3.9: Global alignment and local alignment.

(Source: https://www.majordifferences.com/2016/05/difference-between-global-and-
local.html)

Gap penalty

Sequence alignments usually require the insertion of gaps, reflecting insertion


or deletion mutations. If a nucleotide or amino acid in one sequence is aligned
to a gap in the target sequence, then this should be penalized as a mismatch.
However, gaps at the ends of sequences should perhaps not incur any
penalty. Moreover, a single insertion or deletion mutation could result in a
contiguous gap of multiple residues. Therefore, a single gap that is 3 residues
long should incur fewer penalties than 3 different gaps, of one residue each.
An affine gap penalty scheme heavily penalizes opening a gap, but extending
a preexisting gap incurs a much lower penalty per additional residue. You will
learn more about global, local and gap penalty concepts while performing
exercise number 9 on BLAST.
92
Unit 3 Sequence Alignment

3.4 ALIGNMENT SCORING MATRICES


In the previous section we have seen the alignment of sequences, as pair or
multiple. As we have discussed, Needleman-Wunsch and Smith-Waterman
algorithms require a scoring matrix. “The scoring matrix is a mathematical
arrangement of numbers in a systematic way”. We have seen dot plots where
one sequence of residue matches with another sequence of residues shown For aligning non-protein
as a dot. But in the case of matrix, a positive score for a match, and a penalty coding DNA
sequences, a
for a mismatch will be assigned as per sequence similarity (Refer a video link).
transition/transversion
For nucleotide sequence alignments, the simplest scoring matrix awards +1 scoring matrix may be
for a match, and -1 for a mismatch. The blastn (will be discussed in the next more appropriate. For
section) algorithm at NCBI scores +5 for a match and -4 for a mismatch. aligning DNA
These scoring matrices treat all mutations (mismatches) equally. In reality, sequences that encode
transitions (pyrimidine to pyrimidine and purine to purine) occur much more proteins, alignment of
the protein amino acid
frequently than transversions (pyrimidine to purine and vice versa) (Refer Fig.
sequences will almost
3.10 below). always be more
reliable.

Fig. 3.10: Transitions and transversions in genetic mutations.

(Source: https://www.differencebetween.com/difference-between-transition-and-vs-
transversion/)

For protein sequence alignments, the scoring matrices are more complicated.
The goal is to reflect evolutionary processes. Some amino acid sequence
changes can arise from a single nucleotide change, whereas other amino acid
changes require two nucleotide changes. Some amino acid changes are less
likely to affect protein structure or function than other amino acid changes.

SAQ 5
What is the use of alignment scoring matrix?

93
BBCS-185 Bioinformatics Skill Enhancement Course

3.4.1 PAM (Point Accepted Mutations)


Dayhoff used alignments of highly conserved proteins to assess which amino
acid changes were likely to be accepted as Point Accepted Mutations. From
this data, she devised a 20 x 20 amino acid substitution matrix for PAM-1, a
unit of evolutionary change resulting in 1 accepted mutation per 100 amino
acids. From there she calculated other matrices such as PAM-2 or PAM-30 or
PAM-250. The substitution matrices are converted to scoring matrices by
converting substitution probabilities to log-odds ratios for each cell.

For example in an alignment of multiple sequences may be as follows:

IAGCW

IAGCT

I IGCT

Dayhoff constructed the phylogenetic tree and used the tree and counted
substitutions in the output of the tree (Fig. 3.11). A tree minimizes the number
of changes in a sequence matrix. To know more about the PAM concept,
watch the video: https://www.youtube.com/watch?v=8avcQRxaLBw

Fig.3.11: Phylogenetic tree for substitution matrix for three sequences.

3.4.2 BLOSUM (BLOcks Substitution Matrix)


In the previous section you have learnt about the brief introduction of the PAM
matrices. Now, we will discuss on BLOSUM matrix used in sequence
alignment.BLOSSUM is another rmatrix used for sequence alignments. These
matrices are used for identification of evolutionarily divergent between protein
sequences. The matrices of this type are local alignment. BLOSUM matrices
were introduced by Steven Henikoff and Jorja Henikoff (Fig 3.12). They had
scanned BLOCKS database for identification of mostly conserved regions of
protein families and later calculated the log-odds score for each of the 190
possible substitution pairs of the 20 standard amino acids under various
combinations.

There is a little bit of difference between PAM and BLOSUM matrices, not as
extrapolated from comparisons of closely related proteins. Scoring sequences
play a major part in it. All matches between the sequences and mismatches
are respectively given the same score (typically +1 or +5 for matches, and -1
or -4 for mismatches. But it is different for proteins. Substitution matrices for
amino acids are more complicated as compared to nucleotides and that might
affect the frequency with which any amino acid is substituted for another. The
objective is to provide a relatively heavy penalty for aligning two residues
94
Unit 3 Sequence Alignment

together if they have a low probability of being homologous (correctly aligned


by evolutionary descent). As you know that forces drive the amino-acid
substitution rates away from uniformity, as discussed in the previous sections,
substitutions occur at different frequencies and are less functionally tolerated
than others.

Fig 3.12: BLOSUM matrices between two protein sequence.

For the calculation of BLOSUM matrix, the following equation is used in


sequences.

Here, “pij” is the probability of two amino acids i and j replacing each other in
a homologous sequence, and qi and qj are the background probabilities of
finding the amino acids i and j in any protein sequence. The factor λ is a
scaling factor, set such that the matrix contains easily computable integer
values.

Currently, 3 types of BLOSUM matrices are available, depending on the


requirement of alignment of proteins the different matrices are used.

BLOSUM80: more related proteins

BLOSUM62: midrange

BLOSUM45: distantly related proteins

Among all these 3 types BLOSUM 62 is widely used.

There are various online tools and software available for sequence alignment
with BLOSUM matrices as a weight matrix.

Clustal W is a well-known sequence alignment online tool. You can browse the
following link https://www.genome.jp/tools-bin/clustalw to the BLOSUM matrix
95
BBCS-185 Bioinformatics Skill Enhancement Course

as shown Fig. 3.13. While performing exercise 10, you will learn more about
multiple sequence alignment using Clustal W.

Fig. 3.13: Clustal W online tool is consisting of parameter section with BLOSUM
matrix for pairwise and multiple sequence alignments.

(source: https://www.genome.jp/tools-bin/clustalw). To know further details on


this topic follow the given YouTube link:
https://www.youtube.com/watch?v=ZUjAKgVrir4

3.5 SEQUENCE ALIGNMENT TOOLS AND


SOFTWARES
So far we have studied various concepts and theories behind sequence
alignment; now let us explore the tools available to perform sequence
alignment.

As mentioned in the previous section, Clustal W is a good alignment software


for pairwise and multiple sequence alignment. Apart from this software, a large
number of academic and commercial sequence alignment software are
available. But most of the software's work on Linux, Ubuntu, and Solaris
operating systems. Since, there is a limit of alignment software to work on
Windows XP, NT, and Server. It has been observed that sequence analysis
output is in the form of a graphical representation of data in matrices and
values.

The sequence alignment tools and software will reduce time and enhance the
effectiveness of analysis. The alignment analysis provides the information to
96 make a proper decision to move further in understanding the protein/gene
Unit 3 Sequence Alignment

function or relation with one another. In this section, we will get to know more
about online software based on alignment types. Watch the video at the link
provided to know more about this topic:
https://www.youtube.com/watch?v=uGhZygAMQik

3.5.1 BLAST and Types


The Basic Local Alignment Search Tool (BLAST) main function is to search
regions of similarity between given sequences. This program compares
nucleotide or protein sequences and calculates the statistical significance of
matches as discussed in the alignment section. BLAST can be used to infer
functional and evolutionary relationships between sequences as well as
help to identify members of gene families. The BLAST program can be
accessed from the https://blast.ncbi.nlm.nih.gov/Blast.cgi site.

There are 4 types of BLAST programmes available; they are:

1. Nucleotide BLAST

2. Protein BLAST

3. BLASTx

4. T BLASTn

The main webpage of BLAST is shown in Fig. 3.14.

Let us discuss all the BLAST types in details.

1. BLASTn (Nucleotide BLAST): Compares one or more nucleotide query


sequences to a subject nucleotide sequence or a database of nucleotide
sequences. This is useful while exploring to determine evolutionary
relationships among different organisms.

2. BLASTp (Protein BLAST): Compares one or more protein query (target)


sequences with existing protein sequences or a database of protein
sequences. This is useful while exploring trying to identify a new (or)
unknown protein.

Fig. 3.14: BLAST home webpage with Nucleotide, Protein, blastx and tblastn
links on it. 97
BBCS-185 Bioinformatics Skill Enhancement Course
3. BLASTx (translated nucleotide sequence searched against protein
sequences): Compares a nucleotide query sequence that is translated in six
reading frames (resulting in six protein sequences) against a database of
protein sequences.
Because blastx translates the query sequence in all six reading frames and
provides combined significance statistics for hits to different frames, it is
particularly useful when the reading frame of the query sequence is unknown
or it contains errors that may lead to frame shifts or other coding errors. Thus,
BLASTx is often the first analysis performed with a newly determined
nucleotide sequence.
4. tBLASTn (protein sequence searched against translated nucleotide
sequences): Compares a protein query sequence against the six-frame
translations of a database of nucleotide sequences. Tblastn is useful for
finding homologous protein-coding regions in unannotated nucleotide
sequences such as expressed sequence tags (ESTs) and draft genome
ESTs are short, single- records (HTG), located in the BLAST databases.
read cDNA
(Complementory DNA) Apart from above blast types, there few blast programmes available for
sequences. They standalone system as well as cloud-based platform. Some more BLAST
comprise the largest programs are as follows:
pool of sequence data 1. SmartBLAST: To find proteins highly similar to query sequence.
for many organisms
and contain portions of 2. Primer- BLAST: To design primers specific to given PCR (polymerase chain
transcripts from many reaction) template.
uncharacterized genes.
Since ESTs have no 3. Global Align: To compare two sequences across their entire span or length
annotated coding
of sequence with Needleman-Wunsch algorithms.
sequences, there are 4. CD –Search: To find the conserved domains in the given sequence.
no corresponding
protein translations in 5. IgBLAST: This blast is related to immunoglobilins and T-Cell receptor
the BLAST protein sequences.
databases. Hence, a
tblastn search is the
6. VecScreen: To search sequences for vector contamination. This tool is
used for molecular biology experiments.
only way to search for
these potential coding 7. CDART: To find sequences with similar conserved domain architecture. You
regions at the protein have to remember the difference between CD-search and CDART in this case.
level. The HTG
sequences, draft 8. Multiple Alignment: To align sequences using domain and protein
sequences from various constrains.
genome projects or
9. MOLE-BLAST: To establish taxonomy for uncultured or environmental
large genomic clones,
sequences (Fig. 3.15).
are another large
source of unannotated
All above tools are available at https://blast.ncbi.nlm.nih.gov/Blast.cgi
coding regions.

98
Unit 3 Sequence Alignment

Fig. 3.15: Specialized searches with BLAST progrmmes at NCBI-BLAST


webpage.

SAQ 6
i) What is the BLAST?

ii) What are BLASTn and BLASTp?

iii) How many types of BLAST tools are available?

Watch the YouTube video available at provided link to know more details
https://www.youtube.com/watch?v=jDHrHfx0cpw

3.5.2 Clustal W
Till now you have studied about blast program to find the sequence with query
sequences within specific databases. The search for simultaneous alignment
of multiple nucleotides or amino acid sequences is now an essential tool in
molecular biology. Multiple sequence alignments are used to find the following:
i) Diagnostic patterns to characterise protein families.
ii) To detect or demonstrate homology between new sequences and existing
families of sequences.
ii) Also to predict the secondary and tertiary structure of new sequences; to
suggest oligonucleotide primers for PCR (Polymerase Chain Reaction).
iv) All these are essential prelude to molecular evolutionary analysis.
There are many variations of the Clustal software, few listed below:

Clustal: The original software for multiple sequence alignments, created by


Des Higgins in 1988, was based on deriving phylogenetic trees from pairwise
sequences of amino acids or nucleotides.

ClustalV: The second generation of the Clustal software was released in 1992
and was a rewrite of the original Clustal package. It introduced phylogenetic
tree reconstruction on the final alignment, the ability to create alignments from
existing alignments, and the option to create trees from alignments using a
method called Neighbor-joining. 99
BBCS-185 Bioinformatics Skill Enhancement Course
ClustalW: The third generation, released in 1994, greatly improved upon
previous versions. It improved upon the progressive alignment algorithm in
various ways, allowing individual sequences to be weighed down or up
according to similarity or divergence, respectively, in a partial alignment. It also
included the ability to run the program in batch mode from the command line.

ClustalX: This version, released in 1997, was the first to have a graphical user
interface.

ClustalΩ (Omega): The current standard version.

Clustal_2: The updated versions of both ClustalW and ClustalX with higher
accuracy and efficiency.

3.6 SUMMARY
In this unit, we have studied the basics of sequence alignment along with the
programs or tools used to perform sequence alignment.

 Sequence alignment plays a major role in identifying ancestor or


phylogeny.

 It helps to predict the experimental for sequenced gene function by


aligning with already existing sequence databases like NCBI,EMBL,
DDBJ and etc.

 Multiple sequence alignment (MSA) of genes or proteins determines the


evolutionary relationships or reconstruction of phylogeny. Scientists can
predict new members of gene families.

 This will help to identify the structurally or functionally similar regions


within proteins with the exiting databases.

 There are two types of alignments 1. Global alignment 2. Local


alignment.

 Global Alignment: Needleman-Wunsch algorithms, Local Alignment -


Smith-Waterman algorithms

 Sequence identity is to compare the alignment of residues/nucleotides in


percentage or in the scores.

 There are differences in the identity and homology of sequences with


respect to type of sequence lineages.

 BLAST- Basic local alignment Search Tool is a basic alignment tool and
there are more types based on alignment of database search.

 Dotpots, Dynamic programming and Heuristic methods are used to


identify the sequence similar pattern.

 PAM is a (Point Accepted Mutations) one of the type in scoring matrix.


There is another known scoring matrix as BLOSUM (BLOcks
Substitution Matrix).
100
Unit 3 Sequence Alignment

 BLOSUM62 is widely used in multiple sequence alignment and as well


as development of phylogenetic trees.

 Clustal W is a multiple sequence alignment tool.

3.7 TERMINAL QUESTIONS


1. Write the importance of Sequence alignment.

2. Differentiate between identity and homology.

3. Write a note on Pairwise and Multiple sequence alignment with suitable


examples.

4. Explain the role of DOT plot in sequence alignment.

5. What is a MSA? Explain its role in phylogenetic analysis ?

6. Explain the BLAST and its types?

7. Describe the tools and software’s used in multiple sequence alignment.

3.8 ANSWERS
Self Assessment Questions
1. i) The NCBI database consists of plants, animals, fungi and bacterial
genome sequences, protein, gene sequences and etc.

ii) Chimp

iii) The amino acids, which are constant throughout alignment

2. i) Presence of similar number of nucleotides or amino acids between


two sequences in the same position is known as identity

ii) Sequences present in data base

3. i) refer table 3.1

ii) Homology refers to shared ancestry

4. i) Multiple sequence alignment

5. i) To understand elocutionary process

6. i) Basic Local Alignment Search Tool

ii) Nucleotide and protein BLAST types

iii) Four

Terminal Questions
1. i) Sequence alignment plays a major role in identifying ancestors.

101
BBCS-185 Bioinformatics Skill Enhancement Course

ii) It helps to predict the newly sequenced gene function by alignment


with already existing sequence databases like NCBI,EMBL, DDBJ
and etc.

iii) Scientists can predict new members of gene families.

iv) Multiple sequence alignment (MSA) of genes or proteins


determines the evolutionary relationships or reconstruction of
phylogeny.

v) To identify the structurally or functionally similar regions within


proteins with the exiting databases. (refer section 3.1)

2. The differences between similarity and honology as follows

S.No. Similarity Homology

1. Similarity refers to the Homology refers to shared


likeness or % identity ancestry
between two sequences

2. Similarity means sharing Two sequences are


statistically significant homologous if they are
number or bases or amino derived from a common
acids ancestral sequence

3. Similarity does not imply Homology usually implies


homology similarity

3. i) Definition of pairwise and multiple sequence alignment.

ii) Importance of both alignments

iii) Explanation with two or more sequences as per requirement (refer


section 3.3)

4. i) Importance of DOT plots in sequence alignment (refer section

3.3.1).

ii) Alignment patterns as per sequence alignment matching.

iii) Draw the dotplot by considering two sequence of your interest.

5. i) Definition of MSA (refer section3.3.2)

ii) Explanation on development of phylogenetic trees.

iii) Distances of trees, orthologs and paralogs.

6. i) Definition of BLAST and its uses (refer section3.5.1)

ii) BLAST types classification- BLASTp, BLASTn, BLASTx, TBLASTn

7. ClustalW, Clustal X, Clustal Omega, MEGA, (refer section3.5.2)

102
Unit 3 Sequence Alignment

3.9 FURTHER READINGS


1. Sequence Alignment: Methods, Models, Concepts, and Strategies by
Michael S. Rosenberg,University of California Press, 2009.

2. Bioinformatics – A Practical Guide to the Analysis of Genes and Proteins


by Andreas Baxevanis, Francis Ouellette, Wiley-Interscience, 2005.

3. Introduction to Bioinformatics by T. K. Attawood & D.J. Parry-smith, 8th

reprint, Pearsoneducation, 2004

4. Bioinformatics: Sequence and genome analysis by D. W. Mount, 2nd


edition, CBS Publication,2005.

5. Fundamental Concepts of Bioinformatics by D. E. Krane and M. L.


Raymer, PearsonPublication, 2006.

6. Bioinformatics: Tools & Applications by D. Edward, J. Stajich and D.


Hansen, Springer, 2009.

7. Bioinformatics: Databases, Tools & Algorithms by O. Bosu and S. K.


Thurkral, Oxford University Press, 2007.

8. Bioinformatics: Methods and Applications - Genomics, Proteomics and


Drug Discovery by S.C. Rastogi, N. Mendiratta, P. Rastogi, PHI Learning
Pvt. Ltd., 2015.

9. Multiple Sequence AlignmentMethods and Protocols by Kazutaka Katoh


2020.Publisher :Springer US.

10. Essential bioinformatics by Jin Xiong 2006 Publisher: Cambridge


University Press.

Video: https://www.youtube.com/watch?v=K6ldxHPzI5A

https://www.youtube.com/watch?v=A4JrzGon8mQ

103
BBCS-185 Bioinformatics Skill Enhancement Course

Exercise 7
MOLECULAR FILE FORMATS -
FASTA, GENBANK,
GENPEPT, GCG, CLUSTAL,
SWISS-PROT, PIR

Structure
7.1 Introduction 7.2 Procedure

Expected Learning 7.3 Summary


Outcomes
7.4 Lab Exercises

7.1 INTRODUCTION
In this exercise, you will practice and download different file formats such as
FASTA, GenBank, GenPept, GCG, CLUSTAL, SWISS-PROT, PIR which are
maintained by different biological databases and used for sequence analysis.
You have learned about biological databases in unit 2 of this course. The
major objective of this exercise is to familiarize you with various file formats
that are regularly used in bioinformatics.

FASTA format

FASTA format is extensively used in most bioinformatics experiments. FASTA


format can be downloaded from NCBI, EBI, Uniprot, and other databases. A
sequence in FASTA format begins with a single-line description, followed by
lines of sequence data. The description line is distinguished from the
sequence data by a greater-than (">") symbol at the beginning example
sequence in the FASTA format shown below.

Example of FASTA format

>gi|532319|pir|TVFV2E|TVFV2E envelope protein

ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLN
GSYSEN
104
Exercise 7 Molecular File Formats - Fasta, Genbank, Genpept, Gcg, Clustal, Swiss-Prot, Pir

GENBANK and GENPEPT format

NCBI specifically maintain GenBank and GenPept formats, The GenBank and
GenPept format store information of DNA and protein sequences respectively.
It is easy to know all the basic information of sequences such as the source of
organism, the author who sequenced, coding information, and other
information from GenBank and GenPept database. GenBank or GenPept
Sequence Format (GenBank Flat File Format) consists of three parts, the
Header, the feature, and the nucleotide sequence. The start of the annotation
section (Header and feature) is marked by a line beginning with the word
"LOCUS". The start of the sequence section is marked by a line beginning with
the word "ORIGIN" and the end of the section is marked by a line with only
"//".The header section consists of initial and basic information, the feature
section consists of Source, CDS, GENE, RNA features, the actual sequence
starts with Origin.

Example of Genbank flat file format

Example of GenPeptfile format


105
BBCS-185 Bioinformatics Skill Enhancement Course

GCG (Genetics Computer Group) format

GCG assists molecular biologists by developing practical tools that implement


the most important bioinformatics techniques. GCG located in the Department
of Genetics at the University of Wisconsin-Madison since 1982. GCG format
can be obtained commonly from GCG commercial package. A sequence file in
GCG format contains one sequence, begins with annotation lines, and then
starts with the sequence is marked by a line ending with two dots ("..")
characters. This line also contains the sequence identifier, the sequence
length, and a checksum. This format should only be used if the file was
created with the GCG package.

Example of GCG format

PIR format

The Protein Information Resource (PIR) is an integrated public resource of


protein informatics that supports genomic and proteomic research and
scientific discovery. PIR maintains the Protein Sequence Database (PSD), an
annotated protein database containing (as of Dec, 2021) over 283000
106 sequences covering the entire taxonomic range.
Exercise 7 Molecular File Formats - Fasta, Genbank, Genpept, Gcg, Clustal, Swiss-Prot, Pir

A sequence in PIR format consists of one line with following features:

a. a ">" (greater-than) sign, followed by

b. a two-letter code describing the sequence type (P1, F1, DL, DC, RL,
RC, or XX), followed by

c. a semicolon, followed by

d. the sequence identification code (the database ID-code).

One line containing a textual description of the sequence.

One or more lines contain the sequence itself. The end of the sequence is
marked by a "*" (asterisk) character.

Optionally, this can be followed by one or more lines describing the sequence.

A file in PIR format may comprise more than one sequence. The PIR format is
also often referred to as the NBRF (National Biomedical Research
Foundations) format.

>P1;CRAB_CHICK

ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).

MDITIHNPLV RRPLFSWLTP SRIFDQIFGE HLQESELLPT SPSLSPFLMR

SPFFRMPSWL ETGLSEMRLE KDKFSVNLDV KHFSPEELKV KVLGDMIEIH

GKHEERQDEH GFIAREFSRK YRIPADVDPL TITSSLSLDG VLTVSAPRKQ

SDVPERSIPI TREEKPAIAG SQRK*

>P1;CRAB_HUMAN

ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN) (ROSENTHAL


FIBER).

MDIAIHHPWI RRPFFPFHSP SRLFDQFFGE HLLESDLFPT STSLSPFYLR

PPSFLRAPSW FDTGLSEMRL EKDRFSVNLD VKHFSPEELK VKVLGDVIEV

HGKHEERQDE HGFISREFHR KYRIPADVDP LTITSSLSSD GVLTVNGPRK

QVSGPERTIP ITREEKPAVT AAPKK*

SWISS-PROT

SWISS-PROT is an annotated protein sequence database. It is a curated


protein sequence database, which strives to provide a high level of annotation.
Searching in UniProtKB with the keyword “transcription factors” will display all
the relevant entries deposited in SWISS-PROT and TrEMBL. Furthermore, the
final data can be downloaded in different file formats, such as FASTA, GFF,
and Flat text. This set of sequences can be used for the analysis of DNA-
binding proteins. In a similar way, sequences for any kind of protein can be
easily obtained with SWISS-PROT. 107
BBCS-185 Bioinformatics Skill Enhancement Course

CLUSTAL

CLUSTAL is old version of multiple sequence alignment, improved version of


CLUSTAL is CLUSTALW, multiple sequence alignment of CLUSTALW output
result will provide CLUSTALW format. A sample CLUSTALW format is shown
below.

Expected Learning Outcomes


After performing this exercise you shall be able to:

 describe various biological databases file formats;

 download file formats from databases to perform bioinformatics


experiments; and

 differentiate between various file formats.

7.2 PROCEDURE
Step 1: Open the GenBank website from the following URL

https://www.ncbi.nlm.nih.gov/genbank/ (Fig. 7.1).

Fig. 7.1: Screenshot showing GenBank page on NCBI.


108
Exercise 7 Molecular File Formats - Fasta, Genbank, Genpept, Gcg, Clustal, Swiss-Prot, Pir

Step 2: Type the sequence name or sequence ID or relevant text in the text
box or enter any keyword (Fig. 7.2).

Fig. 7.2: Screeshot showing search option on GenBank.

For GenPept format, select protein in dropdown box of NCBI


homepage and type keyword and click on search button (Fig. 7.3).

Fig. 7.3: Screenshot showing dropdown box on NCBI page.

3. On pressing search button the result page (summary page) is


displayed (Fig. 7.4).

Fig. 7.4: Screenshot showing search results on NCBI page.


109
BBCS-185 Bioinformatics Skill Enhancement Course

Step 4: Select the required sequence by double-clicking the accession number


or checkmark appropriate sequence, go to the display button, select GenBank
format if it is nucleotide sequence, and select GenPept if it is protein sequence
or FASTA format to retrieve sequence (Fig. 7.5 A and B).

A)

B)

Fig. 7.5: A) showing GenBank Format B) GenPept format.

5. Copy and save the required protein or nucleotide sequence for further
analysis (Fig. 7.6).

110 Fig. 7.6: Screenshot showing GenBank sequence.


Exercise 7 Molecular File Formats - Fasta, Genbank, Genpept, Gcg, Clustal, Swiss-Prot, Pir

7.3 SUMMARY
 You have learned about biological databases in theory unit 2, and the
data will be retrieved and viewed in different formats.

 In the current exercise you have learned to download different file


formats such as FASTA, GENBANK, GENPEPT sequence formats.

 Formats like GCG, CLUSTAL, SWISS-PROT, PIR which are


maintained by different biological databases and used for sequence
analysis.

7.4 LAB EXERCISES


1. Download sequence NM_001297740 in GenBank and FASTA format.

2. Search for Covid related protein sequence from NCBI download any one
sequence in GenPept format.

111
BBCS-185 Bioinformatics Skill Enhancement Course

Exercise 8
MOLECULAR VIEWER BY
VISUALIZATION
SOFTWARE: PYMOL

Structure
8.1 Introduction 8.3 Summary

Expected Learning Outcomes 8.4 Lab Exercises

8.2 Procedure

8.1 INTRODUCTION
In the previous exercises you have learned how to access databases for
protein and nucleic acid structures. However, to analyse these structures we
need to view them using certain tools that are known as visualisation tools or
software.

In this exercise, you will be learning the PyMOL program to visualize 3-D
structures of molecules. PyMOL is a powerful tool for viewing and analyzing
proteins, DNA, and other macro molecules structures. PyMOL is a stand-alone
molecular visualization program based on Python software. PyMOL is used to
generate high-quality molecular graphics images and animations used for
journal publications describing new macromolecular structures and
interactions. PyMOL was developed by Warren DeLano. It is open source, but
not free in all forms, students and educators can utilise a current free version
in the classroom, anybody can obtain outdated binary releases, and certain
Linux distributions give PyMOL packages created from the open-source code.

Expected Learning Outcomes


After performing this exercise you shall be able to:

 download and view 3-D structure of biological macro molecules using


PyMOL;
 prepare and save the 3-D molecule for publishing in journal and
dissertation; and

 explain the application of PyMOL.


112
Exercise 8 Molecular Viewer by Visualization Software: Pymol

8.2 PROCEDURE
Step 1: Download PyMOL from the website
(http://www.pymol.org/educational), register as a student from the link at the
bottom of the page. You’ll need to fill out the form, and the automated system
will eventually send you a link with a username and password. This allows you
to download the software for your Personal Computer or Mac system and
follow the instructions to install the software.

Step 2: Download a 3-D structure of protein6YI3 in PDB format from PDB


database as you have learned in Exercise-6.

Step 3: By double clicking on the PyMOL icon on your desktop PyMOL brings
up two Windows.

i) The top window constitutes the “External GUI (Graphical User Interface),”
and contains the menu options as well as buttons for advanced
visualization (Fig. 8.1).

Fig. 8.1: Screenshot showing external GUI of PyMOL.

ii) The bottom window contains the “Visualization Area,” which is the main
area where molecules will be displayed. The bottom window also contains
another “Internal GUI.” This GUI will contain a list of molecular objects
once you have loaded a protein structure. The bottom of this GUI has a
matrix displaying the current mouse configuration, namely what mouse
button combinations control which functions (Fig. 8.2).

Fig. 8.2: Screenshot showing PyMOL internal GUI. 113


BBCS-185 Bioinformatics Skill Enhancement Course

Step 4: To open the PDB file, select “File→Open” in the external GUI window,
and select the PDB file6YI3.pdb that you have already downloaded. The PDB
file will load, representing the protein

Step 5: To change the representation of the molecule, the right side of the
Viewer shows the object control panel.

The first name is always “all.” Clicking on the name itself will un-display the
corresponding molecule(s) (temporarily invisible).

The ASHLC menu ( ) is abbreviated for Action, Show,


Hide, Label and Color.

Let’s learn how to make a cartoon representation of this, click S and select
Cartoon. The molecule is now shown as both a cartoon and a wireframe.
Remove the wireframe by clicking H and lines.

Step 6: To change the background color to white follow this menu cascade
(Fig. 8.3):

Display > Background > White

Fig. 8.3: Screenshot showing “how to change the background “color” of the
molecule.

Step 7: To save the image in the present view follow the options File > Save
Image…

Replace the default word “PyMOL” the file you want to save, the image will be
saved as a PNG image (Fig. 8.4).
114
Exercise 8 Molecular Viewer by Visualization Software: Pymol

Fig. 8.4: Screenshot Showing how to save and rename the image.

Step 8: You can use command line to Save, Viewport, Zoom, Ray, and Select,
to execute the command, follow the additional study material link provided at
the end of this exercise, and also practice other options in detail.

8.3 SUMMARY
 PyMOL is a powerful tool to visualize and analyze proteins, DNA, and
other biological molecules structures.
 You have learned how to view the 3-D structure of proteins in different
poses.
 Images and structures can be used to generate high-quality molecular
graphic images and animations used for journal publications.

8.4 LAB EXERCISES


1. Download 7C8K use commands to analyse structure and create images.

2. Open 7C8K structure in PyMOL and create animation for PowerPoint.

Additional study material link

https://bioquest.org/nimbios2010/wp-
content/blogs.dir/files/2010/07/pymol_tutorial3.pdf

https://sites.pitt.edu/~epolinko/IntroPyMOL.pdf

115
BBCS-185 Bioinformatics Skill Enhancement Course

Exercise 9
BLAST SUITE OF TOOLS FOR
PAIRWISE ALIGNMENT

Structure
9.1 Introduction 9.3 Summary

Expected Learning Outcomes 9.4 Lab Exercises

9.2 Procedure

9.1 INTRODUCTION
In unit-3 of this course you have learnt sequence similarity search using Basic
Local Alignment Search tool (BLAST) and in previous exercises 5 and 7we
have practiced sequence retrieval. These sequences will be used in this
exercise to perform database similarity searches using BLAST tool. BLAST is
an algorithm for comparing primary biological sequence information, such as
the amino-acid sequences of different proteins or the nucleotides of DNA
sequences. A BLAST search enables a researcher to compare a query
sequence with a library or database of sequences, and identify library
sequences that are similar to the query (a question, unknown) sequence.
There are many different types of BLAST available from the BLAST web page.
Selecting the required one depends on the type of sequence you are
searching for and in the desired database. Different types of BLAST are given
below:

a) Nucleotide blast- Search a nucleotide database using a nucleotide


query
Algorithms: blastn, mega blast, discontinuous mega blast

b) Protein blast- Search protein database using a protein query


Algorithms: blastp, psi-blast, phi-blast, delta-blast

c) blastx- Search protein database using a translated nucleotide query

d) tblastn- Search translated nucleotide database using a protein query

e) tblastx -Search translated nucleotide database using a translated


116 nucleotide query
Exercise 9 Blast Suite of Tools for Pairwise Alignment

Expected Learning Outcomes


After performing this exercise you shall be able to:

 perform and interpret the database similarity sequence searches from


the BLAST tool;

 understand analysis of protein and DNA sequence similarity search of


unknown sequence obtained after sequencing; and

 describe the importance of BLAST tool in bioinformatics.

9.2 PROCEDURE
Step 1: Open the basic BLAST search page from following URL
http://blast.ncbi.nlm.nih.gov/Blast.cgi

From the "Program" Menu select the appropriate program(Nucleotide BLAST,


Protein BLAST, blastx, tblastn) (Fig. 9.1).

Fig. 9.1: Screenshot showing BLAST page.

2. Open your FASTA format sequence in a text editor as plain text retrieved
from exercise7 (Fig. 9.2).

Fig. 9.2: Screenshot showing FASTA sequence.


117
BBCS-185 Bioinformatics Skill Enhancement Course

3. Enter gi number, accession number or Copy the entire sequence and paste
it in the search box provided in FASTA format (Fig. 9.3).

Fig. 9.3: Screenshot showing BLAST suite.

5. Make sure you have selected the correct BLAST program and select nr
(non redundant) database (Fig. 9.4).

Fig. 9.4: Screenshot showing programme selection on BLAST.

7. Write down default parameter set and click the "BLAST button" (Fig. 9.5).

118
Exercise 9 Blast Suite of Tools for Pairwise Alignment

8. BLAST will tell you it is working on your search.

Fig. 9.5: Screenshot showing algorithm parameters on BLAST.

9. Once your results are computed they will be presented in the window (Fig.
9.6).

Fig. 9.6: Screenshot showing “results” after BLAST. 119


BBCS-185 Bioinformatics Skill Enhancement Course

10. Copy and save the results and discuss or interpret your results.

9.3 SUMMARY
 In the current exercise you have learnt how to use BLAST tool for
different programs such as blastn, blastp, blastx and tblastn for the
analysis of nucleotide and protein sequences of unknown sequence
obtained after sequencing the sample.

 BLAST tool is widely used in the bioinformatics for different applications


such as to know similar sequence, its similarity score to understand
whether they are closely related or distantly related.

 By e-value you will be able to analyse weather query sequence is


biologically significance or not, will get to know source, coding sequence
and other information.

9.4 LAB EXERCISES


1. Conduct the blastn for the given query NM_001297740 and discuss the
result.

2. Perform protein blast (blastp) for the query Chain E, Spike protein S1
copy the result and interpret.

3. Search blastx for the given queryFJ436056 tabulate and discuss the
results.

4. Execute the tblastn search for the given query PWZ18702 and interpret
the results.

120
Exercise 10 Multiple Sequence Alignment using Clustalw

Exercise 10
MULTIPLE SEQUENCE
ALIGNMENT USING
CLUSTALW

Structure
10.1 Introduction 10.3 Understanding Output

Expected Learning Outcomes 10.4 Summary

10.2 Procedure 10.5 Lab Exercises

10.1 INTRODUCTION
In unit-3 of this course you have learnt sequence alignment using
CLUSTALW, now in this exercise, you shall practice performing CLUSTALW.
Multiple Sequence Alignment (MSA) is the alignment of three or more
biological sequences of similar length. From the output of MSA applications,
homology can be inferred and the evolutionary relationship between the
sequences can be studied. ClustalW is a free online tool through the European
Bioinformatics Institute (EBI) that is used to align multiple sequences and
generate phylogenetic trees. The improved version of CLUSTAL is Clustal
Omega. If you input the desired sequences to align, Clustal Omega generates
a sequence alignment, and a rooted phylogram or cladogram.

Expected Learning Outcomes


After performing this exercise you shall be able to:

 perform alignment of more than two sequences and find out the
similarity between those sequences;

 use ClustalW to Generate a Multiple Sequence Alignment and construct


phylogenetic Tree; and

 explain the functional relationship between these aligned sequences. 121


BBCS-185 Bioinformatics Skill Enhancement Course

10.2 PROCEDURE
STEP 1- Retrieve required sequences (Nucleic acid or Protein) three or more
from desired sequence databases. Some example sequences are shown
below:

>AAZ67055.1 M protein [Bat SARS CoV Rp3/2004]

MAENGTISVEELKRLLEQWNLVIGFIFLAWIMLLQFAYSNRNRFLYIIKLVFLWL
LWPVTLACFVLAAVYRINWVTGGIAIAMACIVGLMWLSYFVASFRLFARTRSM
WSFNPETNILLNVPLRGTILTRPLMESELVIGAVIIRGHLRMAGHSLGRCDIKD
LPKEITVATSRTLSYYKLGASQRVGTDSGFAAYNRYRIGNYKLNTDHSGSND
NIALLVQ

>QHU79197.1 M protein [Human Severe acute respiratory syndrome corona


virus 2]

MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFLYIIKLIFLW
LLWPVTLACFVLAAVYRINWITGGIAIAMACLVGLMWLSYFIASFRLFARTRS
MWSFNPETNILLNVPLHGTILTRPLLESELVIGAVILRGHLRIAGHHLGRCDIKD
LPKEITVATSRTLSYYKLGASQRVAGDSGFAAYSRYRIGNYKLNTDHSSSSD
NIALLVQ

>AVN89369.1 M protein [C Middle East respiratory syndrome-related corona


virus]

MSNMTQLTEAQIIAIIKDWNFAWSLIFLLITIVLQYGYPSRSMTVYVFKMFVLW
LLWPSSMALSIFSAVYPIDLASQIISGIVAAVSAMMWISYFVQSIRLFMRTGSW
WSFNPETNCLLNVPFGGTTVVRPLVEDSTSVTAVVTNGHLKMAGMHFGAC
DYDRLPNEVTVAKPNVLIALKMVKRQSYGTNSGVAIYHRYKAGNYRSPPITA
DIELALLRA

STEP 2-The software tools required for multiple sequence alignment are
available at the following URLhttps://www.ebi.ac.uk/Tools/msa/clustalo/ (Fig.
10.1).

122 Fig. 10.1: Showing the CLUSTAL omega homepage.


Exercise 10 Multiple Sequence Alignment using Clustalw

STEP 3 - Enter your input sequences or paste a set of nucleic acid or protein
sequences into a supported format or upload a file (Fig. 10.2).

Fig. 10.2: Showing multiple sequence alignment step on CLUSTAL omega.

Step 4- Set your output format and set multiple sequence alignment default
options (Fig. 10.3).

Fig. 10.3: Showing output parameters window.

Step 5- Submitting your job, running a tool is usually an interactive process;


the results are delivered directly to the browser when they become available,
or it’s possible to be notified by email when the job is finished by simply ticking
the box "Be notified by email" (Fig. 10.4).

Fig. 10.4: Showing the subunit option on CLUSTAL omega.

123
BBCS-185 Bioinformatics Skill Enhancement Course

10.3 UNDERSTANDING OUTPUT


The ClustalW output will give you two main forms results – the multiple
sequence alignment and a phylogram/cladogram.

The score table is the first section of the page below, the results summary box.
The score table shows the scoring of the pairwise alignment of all sequences
(Fig. 10.5).

Fig. 10.5: Showing scores table.

Take a screen shot of this table, or download by right-clicking the Output File
(.output) found in the result summary box at the top of the page (Fig. 10.6).

Fig. 10.6: Showing how to save the output file.

CLUSTAL omega aligns all of the input sequences, an HTML text version is
listed just below the Scores Table. A more extensive view of the alignment can
be seen using JalView. Under alignment, you can click “Show Colors” to view
a coloured version of an amino acid alignment (Fig. 10.7).

124
Exercise 10 Multiple Sequence Alignment using Clustalw

Normal View of Alignment Coloured View of Alignment

Fig. 10.7: Showing “normal” and “coloured” alignment results.

In the row below the last sequence of the alignment, there may be symbols
like:

" * " – the residues or nucleotides in that column are identical in


all sequences

" : " – conserved substitutions have been observed, according to the colour
data

" . " – semi-conserved substitutions are observed

The generated phylogenetic tree is at the very bottom of the results page.
You’ll notice above this there is a “Guide Tree” section. You can save the
Guide Tree. The tree can be viewed as a phylogram or a cladogram.

A phylogram explicitly represents the number of sequence character changes


through the horizontal branch length. The sum of the horizontal distances
between two leaves is the predicted evolutionary difference in sequences. A
cladogram only depicts branching patterns, not evolutionary time by branch
length (Fig. 10.8 A and B).

A)

125
BBCS-185 Bioinformatics Skill Enhancement Course

B)

Fig. 10.8: A) Phylogram B) Cladogram.

10.4 SUMMARY
 Multiple Sequence alignment is aligning of three or more biological
sequences of similar length.

 From the output of MSA applications, homology can be inferred and the
evolutionary relationship between the sequences studied.

 In the current exercise you have learnt to use the multiple sequence
alignment tool Clustal Omega for analysing evolutionary relationships
among sequences and interpret relationships among the sequences or
organisms through a phylogenetic tree.

10.5 LAB EXERCISES


1. Retrieve any three or more protein sequences from protein database,
copy the sequences in FASTA file format, align the sequences each
other and report the pair wise score using Clustal Omega.

126

You might also like