0% found this document useful (0 votes)
11 views34 pages

Report

The Bioinformatics Lab Manual serves as an introductory guide to bioinformatics, detailing its significance in managing biological data and various databases such as NCBI, EMBL, and DDBJ. It covers essential topics including nucleotide and protein databases, data retrieval methods, and the importance of bioinformatics in genomic research and disease understanding. Authored by Tulsi Chaturvedi under Dr. Anuja Mishra's mentorship, the manual aims to provide foundational knowledge for graduate and postgraduate students in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views34 pages

Report

The Bioinformatics Lab Manual serves as an introductory guide to bioinformatics, detailing its significance in managing biological data and various databases such as NCBI, EMBL, and DDBJ. It covers essential topics including nucleotide and protein databases, data retrieval methods, and the importance of bioinformatics in genomic research and disease understanding. Authored by Tulsi Chaturvedi under Dr. Anuja Mishra's mentorship, the manual aims to provide foundational knowledge for graduate and postgraduate students in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Bioinformatics Lab Manual

PREFACE

Bioinformatics, a multidisciplinary field at the intersection of biology, computer science, and


mathematics, has become an essential pillar in modern scientific research. Its rise parallels the exponential
growth in biological data, driven by advances in high-throughput sequencing, genomics, proteomics, and
other molecular technologies.

This Bioinformatics laboratory manual is intended to serve a precise introduction about Bioinformatics as
a subject, emerging field, furthermore, it encapsulates a detailed introduction about the organizations,
databases, primitive, and emerging tools in bioinformatics.
This manual is authored by Tulsi Chaturvedi under the mentorship of Dr. Anuja Mishra, whose
guidance is instrumental in the writing of this manual. This manual is prepared while keeping in mind
about the level of graduates and postgraduates. This manual is able to provide a solid foundation for
beginners as well as intermediates in the field of Bioinformatics, and will be able to equip readers with
essential information about the field of bioinformatics.

Tulsi Chaturvedi
M.Sc Bioinformatics
Department of Biotechnology
GLA University, Mathura

2
Bioinformatics Lab Manual

3
Bioinformatics Lab Manual

INTRODUCTION TO BIOINFORMATICS
Bioinformatics is an interdisciplinary field that combines biology, computer science, and mathematics to
analyze and interpret biological data. At its core, bioinformatics aims to manage and analyze the vast
amounts of data generated by modern biological research, especially in genomics, proteomics, and other
"omics" sciences.

About NCBI (National Center for Biotechnology Information)


● It is the key resource for bioinformatics, providing various tools, databases and services to
support biological research.
● Founded in 1988 as part of the U.S. National Library of Medicine (NLM) under the National
Institutes of Health (NIH).
● NLM offers extensive resources for medical research and education. It also provides access to a
vast collection of biomedical literature and information.
● NIH conducts and funds medical research to improve public health as a Medical Research
Agency.

Fig.1 Homepage of NCBI https://www.ncbi.nlm.nih.gov/.

Importance of Bioinformatics:

4
Bioinformatics Lab Manual

Bioinformatics plays a critical role in modern biology and medicine due to its ability to manage and
analyze the vast and complex data generated by biological research. Its importance spans several key
areas:

1. Managing Big Data in Biology

● Modern techniques like next-generation sequencing (NGS) and high-throughput proteomics


generate massive amounts of data. Bioinformatics provides the tools and methods to organize,
store, and analyze this data efficiently.

● Without bioinformatics, it would be impossible to handle the millions of DNA or protein


sequences being produced by research labs globally, leading to inefficient and incomplete
biological research.

2. Accelerating Genomic Research

● Bioinformatics has revolutionized genomics by allowing rapid assembly, annotation, and


comparison of genomes. The Human Genome Project, for example, was made possible because
of bioinformatics.

● Understanding genomes helps scientists discover gene functions, map disease-related mutations,
and trace evolutionary history. It has applications in everything from cancer research to
biodiversity studies.

3. Understanding Complex Diseases

● Many diseases, like cancer, diabetes, and neurodegenerative disorders, are complex and involve
multiple genetic and environmental factors. Bioinformatics helps in analyzing large-scale datasets
to uncover these complex relationships.

● By integrating data from genomics, transcriptomics, and proteomics, bioinformatics provides


insights into disease mechanisms, helping researchers develop more targeted therapies.

4. Facilitating Evolutionary and Ecological Studies

● Bioinformatics is used to compare DNA and protein sequences across species, helping scientists
understand evolutionary relationships, gene function, and biodiversity.

● This information helps in studying species evolution, identifying conserved genes essential for
life, and developing conservation strategies for endangered species.

MAJOR TYPES OF DATABASES


1. Nucleotide Sequence Database
2. Protein Database (Sequence, Structure)

Nucleotide Databases – Overview:

5
Bioinformatics Lab Manual

Nucleotide databases are essential resources for the storage and retrieval of nucleotide sequences, which
form the basis for much of bioinformatics research. These databases, including GenBank, EMBL-EBI,
and DDBJ, enable the collection, curation, and sharing of nucleotide sequences across global research
communities.

● Central to the annotation and analysis of DNA sequences.


● Enables global collaboration in genomics and comparative studies.
● Facilitates studies in gene function, regulation, and molecular evolution.
● Supports the development of tools for sequence alignment and phylogenetics.
● Critical for the identification of disease-associated genetic variants.
● Ensures the accessibility and interoperability of genomic data.
● Plays a fundamental role in genomic discovery and translational medicine.

European Molecular Biology Laboratory (EMBL):

6
Bioinformatics Lab Manual

The European Molecular Biology Laboratory (EMBL) is a leading international research institution that
fosters cutting-edge research in molecular biology, bioinformatics, and computational biology.

● Home to the European Bioinformatics Institute (EBI), which is central to global bioinformatics
infrastructure.
● Supports interdisciplinary research and data-sharing efforts across Europe.
● Plays a pivotal role in developing computational tools and resources for bioinformatics.
● Drives projects related to genome analysis, protein structure prediction, and molecular biology.
● Promotes data sharing and collaboration across biological and computational sciences.
● Crucial in the development of Europe’s contribution to bioinformatics research.
● Facilitates multi-omics data integration and systems biology research.

DNA Data Bank of Japan (DDBJ):

The DNA Data Bank of Japan (DDBJ) is one of the world’s leading nucleotide sequence databases,
sharing the responsibility with GenBank (NCBI) and EMBL for collecting and distributing DNA
sequence data. Its focus is on ensuring accessibility and the exchange of genomic data globally.

● Key player in the International Nucleotide Sequence Database Collaboration (INSDC).


● Provides essential infrastructure for managing global nucleotide sequence data.
● Facilitates international collaborations in genomic research.
● Supports projects in functional genomics, molecular evolution, and phylogenetics.
● Ensures data consistency and availability in bioinformatics research.
● Contributes to the integration of genomic data in global databases.
● Promotes open access to genomic data for research and healthcare advancements.

7
Bioinformatics Lab Manual

EXPERIMENT: 1
GenBank:

INTRODUCTION:
● GenBank is the NIH genetic sequence database, an annotated collection of all publicly available
DNA sequences.
● It is part of the International Nucleotide Sequence Database Collaboration, which comprises the
DNA Data Bank of Japan
● (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI.
● These three organizations exchange data on a daily basis.
● A GenBank release occurs every two months and is available from the FTP site.

GLOSSARY TERMS:

a. Header Section-
● LOCUS: Includes the locus name, sequence length, molecular type, GenBank division,
and modification data.
● DEFINITION: A brief description of the sequence.
● ACCESSION: Unique identifier for the sequence.
● VERSION: A version of the sequence with a accession.
● KEYWORDS: Optional keywords that describes the entry.
● SOURCE: The common name of the organism from which the sequence originates.
● ORGANISM: The formal scientific name of the organism, followed by its taxonomic
lineage.

a. Reference Section-
● Citations and reference number of the publications or the PUBMED reference related to
the sequence.
● Includes TITLE AND AUTHOR as well.

b. Feature Section-
● Contains annotations about the coding sequence, genes, regulatory elements, etc.
● Each feature includes qualifiers (e.g., gene names, product descriptions and location
information (e.g., where the gene starts and ends on the sequence).

8
Bioinformatics Lab Manual

c. Sequence Section-
● The raw nucleotide or protein sequence is displayed at the bottom.

d. End Section:
● //: This marks the end of the GenBank entry.

GenBank file format

EXPERIMENT: 2

9
Bioinformatics Lab Manual

AIM: To retrieve the gene from Genbank and to save the sequence in FASTA format.

INTRODUCTION:
● FASTA is a text-based format for representing nucleotide or protein sequences.
● Sequences are stored in a simple plain text file.
● A FASTA file begins with a single-line description, preceded by a ">" symbol (called a header).
● The sequence data follows the header and consists of the nucleotide or amino acid sequence.
● It is widely used in bioinformatics for sequence alignment, storage, and analysis.
● Each sequence line can be up to 80 characters, but it is typically displayed as one continuous
string.
● Programs like BLAST and CLUSTAL use the FASTA format for sequence comparison.
● FASTA format can handle multiple sequences in a single file.

Steps:
● Type NCBI in the web browser and click search, click National Center for Biotechnology
Information, it directs to the URL : https://www.ncbi.nlm.nih.gov/) OR
● NCBI homepage will appear.
● Click the All Databases drop –up menu and drag the bar and select nucleotide.
● Search list will be displayed, click the suitable accession number or any gene of interest
● Click the gene of interest/accession number.
● Type Collagen in the search area and click.
● A new window will appear and shows the entry of the collagen gene in detail.
● Click the FASTA, FASTA sequence appears in the new window.
● Copy the FASTA sequence and paste it in note pad

Interpret the Output:

● The FASTA results are typically broken down into several parts:
● Top Hits: A ranked list of sequences from the database that most closely match your query, along
with statistical significance values.
● Alignment View: Detailed alignments of the query with the top matching sequences, including
scoring information.
● Statistical Scores:
● E-value: The number of expected hits of similar quality (or better) by chance. Lower values
indicate more significant matches.
● Percent identity: Percentage of identical matches in the alignment.
● Z-score: Indicates the degree of similarity to the database.

10
Bioinformatics Lab Manual

11
Bioinformatics Lab Manual

12
Bioinformatics Lab Manual

EXPERIMENT: 3
GRAPHICS

Graphic Summary feature in the NCBI Nucleotide Database is a highly useful tool that provides a visual
representation of the annotations and features of a nucleotide sequence. This graphical representation
allows researchers to quickly and efficiently analyze the structural organization of the sequence and gain
insights into its functional elements.

● The Graphic Summary provides an interactive view of nucleotide sequences and their associated
features, such as genes, coding sequences (CDS), exons, introns, and regulatory regions. By
displaying these features along the length of the nucleotide sequence, users can easily navigate
and understand the genomic context.
● Features in the Graphic Summary are color-coded to help users distinguish between different
types of annotations. For example, coding regions might be highlighted in one color, while
regulatory elements or non-coding regions are shown in another. This visual distinction aids in
quickly identifying specific genomic elements.
● The Graphic Summary tool is interactive, allowing users to zoom in on specific regions of interest
or zoom out to view the broader genomic landscape. This feature is particularly useful for
navigating large sequences, such as whole chromosomes or long contigs, and focusing on areas
with specific annotations like exons or protein-coding genes.
● By clicking on specific features within the Graphic Summary, users can access detailed
information about those regions. For instance, clicking on a gene within the graphic will provide
links to its corresponding gene records, mRNA sequences, and protein products, integrating
different levels of biological data.
● The Graphic Summary is integrated with other NCBI tools and databases, such as BLAST,
GenBank, and RefSeq, allowing users to cross-reference sequence alignments, genetic
variations, and evolutionary relationships. This integration facilitates comprehensive
bioinformatics analyses.
● The tool displays multiple layers of annotations simultaneously, including gene predictions,
expressed sequence tags (ESTs), single nucleotide polymorphisms (SNPs), and functional
elements like promoters or enhancers. This layered view provides a detailed understanding of the
functional elements embedded within the nucleotide sequence.
● In addition to nucleotide sequences, the Graphic Summary can display comparative genomic
information, including syntenic regions and evolutionary conserved elements across different
species. This feature is particularly useful for studying evolutionary conservation and functional
genomics.
● The Graphic Summary is designed for ease of use, with a clean and intuitive interface.
Researchers can adjust settings to display different levels of detail, hide or show specific
annotations, and customize the display based on their specific needs or research focus.

13
Bioinformatics Lab Manual

● When interacting with a nucleotide sequence, the Graphic Summary also provides links to
associated protein structures (via the PDB), functional motifs (via CDD), and gene expression
data. This helps in understanding the biological significance of the annotated regions in the
context of molecular function and regulation.
● The Graphic Summary tool is widely used in comparative genomics to visualize syntenic blocks,
track evolutionary conserved genes, and explore genomic variations among species. By providing
a visual context for sequence comparison, the tool facilitates the identification of functionally
important genomic regions conserved across species or strains.

14
Bioinformatics Lab Manual

EXPERIMENT: 4

Introduction to Protein Database:

A protein database is a digital repository that stores protein-related data, including sequences, structures,
annotations, and functional information. These databases are essential for researchers studying protein
biology and conducting computational analyses.

Some examples-

1. NCBI Protein: A comprehensive collection of protein sequences from various sources, including
translations of nucleotide sequences from GenBank.
2. UniProt (Universal Protein Resource): Provides high-quality protein sequences with detailed
functional annotations, including domain structures, active sites, and pathway information.
3. Protein Data Bank (PDB): Focuses on 3D structural data for proteins and nucleic acids.
4. Pfam: Specializes in protein families and domains.

15
Bioinformatics Lab Manual

Glossary terms table-

Term Definition Key Uses

Protein Database Repositories of protein Functional annotation,


sequences and related data. evolutionary studies, and disease
research.

GenPept Annotated protein format from Detailed sequence data


NCBI. including domains, active sites,
and biological functions.

Identical Proteins Proteins with identical amino Conserved function analysis,


acid sequences across species or evolutionary studies.
strains.

FASTA A format for storing sequence Input for sequence alignment,


data (text-based). homology searches, and
phylogenetic analysis.

Graphics Visual representations of protein Interpretation of protein


features, structures, or sequences, interactions, and 3D
relationships. structures.

16
Bioinformatics Lab Manual

EXPERIMENT:5

GenPept

GenPept is a data format provided by NCBI for protein sequences and their annotations. It
contains translated protein data derived from nucleotide sequences in GenBank, with
added functional and structural insights. Key components of a GenPept entry:

● Protein Sequence: Displayed as amino acid chains in single-letter code.


● Annotations: Details on active sites, conserved domains, signal peptides, and
modifications.
● Gene and Organism Information: Links the protein to its corresponding gene and
the organism of origin.

17
Bioinformatics Lab Manual

18
Bioinformatics Lab Manual

EXPERIMENT:6

Identical Proteins

The term identical proteins refers to proteins that share the exact same amino acid
sequence. These identical sequences might occur in:

● Different organisms: Representing conserved proteins with similar functions


across species.
● Different strains: Found in distinct strains or isolates of the same species.

19
Bioinformatics Lab Manual

EXPERIMENT:7

FASTA

FASTA is a universal file format used for storing nucleotide or protein sequences. It is widely used for
bioinformatics tools and data sharing. A FASTA file consists of:

1. Header line: Starts with a > symbol, followed by a description or sequence identifier (e.g.,
>sp|P68871|HBB_HUMAN Hemoglobin subunit beta).
2. Sequence: Written in single-letter amino acid (protein) or nucleotide code (DNA/RNA).

FASTA is the standard input format for tools like:

● BLAST: To find similar sequences.

● Clustal Omega: For sequence alignment.

● MAFFT: For multiple sequence alignment.

20
Bioinformatics Lab Manual

EXPERIMENT:8

Graphics:

In bioinformatics, graphics refers to visual representations of sequence and structural data. Graphics are
critical for interpreting complex biological information. Common applications include:

1. Sequence Features: Tools like NCBI’s Graphics Viewer provide visual layouts of protein
features (e.g., exons, conserved domains, and mutations).
2. 3D Protein Structures: Visualization software like PyMOL, Chimera, and iCn3D allows
exploration of protein tertiary and quaternary structures.
3. Phylogenetic Trees: Graphical outputs that display evolutionary relationships.
4. Molecular Interactions: Tools like STRING visualize protein-protein interaction networks.

21
Bioinformatics Lab Manual

EXPERIMENT:9

NCBI BLAST:

BLAST (Basic Local Alignment Search Tool) is one of the most widely used bioinformatics tools for
sequence analysis. It identifies regions of local similarity between sequences, which helps in
understanding functional, structural, or evolutionary relationships.

Here’s an overview of the four main types of BLAST:

22
Bioinformatics Lab Manual

Advanced Variants

1. MegaBLAST:
○ Optimized for highly similar sequences.
○ Used for large-scale searches like genome assembly comparison.
2. PSI-BLAST (Position-Specific Iterative BLAST):
○ Builds a position-specific scoring matrix (PSSM) for detecting distant
relationships.
3. PHI-BLAST (Pattern-Hit Initiated BLAST):
○ Combines pattern matching with BLAST for sequences with specific motifs.

23
Bioinformatics Lab Manual

EXPERIMENT:10

PubChem Overview:

PubChem is a free, comprehensive database maintained by the National Center for


Biotechnology Information (NCBI), a part of the U.S. National Library of Medicine (NLM). It
provides information on the chemical structures, properties, biological activities, safety,
toxicity, and applications of chemical substances. It serves as a vital resource for
researchers, educators, and students in chemistry, biology, pharmacology, and related
fields.

Integration with NCBI databases enables cross-disciplinary research by linking chemical


information to genomic and proteomic data.

Facilitates the identification of chemical-protein interactions and the study of pathways


involving specific compounds.

24
Bioinformatics Lab Manual

25
Bioinformatics Lab Manual

EXPERIMENT:11

RCSB-PDB Overview:

RCSB PDB (Research Collaboratory for Structural Bioinformatics Protein Data Bank).

The RCSB PDB is a globally recognized, open-access repository for 3D structural data of biological
macromolecules, such as proteins, nucleic acids, and their complexes. It is an essential resource for
researchers in structural biology, biochemistry, bioinformatics, and related fields.

The RCSB PDB is part of the Worldwide Protein Data Bank (wwPDB) consortium, which also includes
the European PDB (PDBe), PDB Japan (PDBj), and the Biological Magnetic Resonance Bank (BMRB).

26
Bioinformatics Lab Manual

27
Bioinformatics Lab Manual

Experiment:12

Aim: To perform a multiple sequence alignment (MSA) using ClustalW, a widely-used

algorithm for aligning multiple biological sequences (such as protein or nucleotide sequences).

Multiple Sequence Alignment (MSA) is a bioinformatics technique used to align three or more
biological sequences (DNA, RNA, or protein sequences) in such a way that similar regions are aligned
across all sequences. The goal is to identify conserved sequences, structural motifs, functional regions,
and evolutionary relationships between the sequences. MSA is a critical step in many areas of
bioinformatics, molecular biology, and evolutionary biology.

Common Tools for Multiple Sequence Alignment:

1. ClustalW/Clustal Omega:
○ ClustalW is one of the oldest and most widely used MSA tools. It uses a
progressive alignment approach and can handle both nucleotide and protein
sequences.
○ Clustal Omega is an updated, faster version of ClustalW and can align larger
datasets more efficiently.
2. MAFFT:
○ MAFFT (Multiple Sequence Alignment by Fast Fourier Transform) is another
popular tool that uses both progressive and iterative refinement methods. It is
known for being fast and capable of handling very large datasets.
3. MUSCLE:
○ MUSCLE (Multiple Sequence Comparison by Log-Expectation) is an iterative
method that is often preferred for its high accuracy, especially with protein
sequences.
4. T-Coffee:
○ T-Coffee (Tree-based Consistency Objective Function for Alignment Evaluation)
is known for its accuracy and ability to combine results from different alignment
methods.
5. KAlign:
○ A tool based on an alignment algorithm that employs a scoring system derived
from global statistics.

28
Bioinformatics Lab Manual

29
Bioinformatics Lab Manual

30
Bioinformatics Lab Manual

Phylogenetic Tree: An Overview

A phylogenetic tree is a branching diagram that represents the evolutionary relationships among various
species, genes, or proteins. It is a critical tool in bioinformatics, evolutionary biology, and comparative
genomics to study how organisms or sequences have evolved from a common ancestor.

Key Components of a Phylogenetic Tree

1. Nodes:
○ Internal Nodes: Represent hypothetical ancestors of the organisms or sequences.
○ External Nodes (Leaves): Represent the species, genes, or sequences being compared.
2. Branches:
○ Connect nodes, representing evolutionary pathways.
○ Branch lengths may indicate the extent of evolutionary change or time.
3. Root:
○ Represents the most recent common ancestor of all entities in the tree.
○ A tree without a root is referred to as an unrooted tree.
4. Topology:
○ The branching pattern of the tree, which shows relationships but not necessarily
evolutionary distance.

31
Bioinformatics Lab Manual

Experiment: 13

Aim:

Introduction

ExPASy is a web-based resource developed by the Swiss Institute of Bioinformatics (SIB). It offers a
range of tools for bioinformatics analysis, with a focus on protein sequences. ExPASy allows users to
perform sequence alignments, identify domains and motifs, predict secondary and tertiary structures, and
examine protein families.

In this experiment, we will:

● Perform a sequence search and retrieve protein data.


● Use tools to analyze protein properties such as molecular weight, isoelectric point (pI),
and secondary structure.
● Investigate possible post-translational modifications (PTMs) using ExPASy tools.

32
Bioinformatics Lab Manual

33
Bioinformatics Lab Manual

34

You might also like