Report
Report
PREFACE
This Bioinformatics laboratory manual is intended to serve a precise introduction about Bioinformatics as
a subject, emerging field, furthermore, it encapsulates a detailed introduction about the organizations,
databases, primitive, and emerging tools in bioinformatics.
This manual is authored by Tulsi Chaturvedi under the mentorship of Dr. Anuja Mishra, whose
guidance is instrumental in the writing of this manual. This manual is prepared while keeping in mind
about the level of graduates and postgraduates. This manual is able to provide a solid foundation for
beginners as well as intermediates in the field of Bioinformatics, and will be able to equip readers with
essential information about the field of bioinformatics.
Tulsi Chaturvedi
M.Sc Bioinformatics
Department of Biotechnology
GLA University, Mathura
2
Bioinformatics Lab Manual
3
Bioinformatics Lab Manual
INTRODUCTION TO BIOINFORMATICS
Bioinformatics is an interdisciplinary field that combines biology, computer science, and mathematics to
analyze and interpret biological data. At its core, bioinformatics aims to manage and analyze the vast
amounts of data generated by modern biological research, especially in genomics, proteomics, and other
"omics" sciences.
Importance of Bioinformatics:
4
Bioinformatics Lab Manual
Bioinformatics plays a critical role in modern biology and medicine due to its ability to manage and
analyze the vast and complex data generated by biological research. Its importance spans several key
areas:
● Understanding genomes helps scientists discover gene functions, map disease-related mutations,
and trace evolutionary history. It has applications in everything from cancer research to
biodiversity studies.
● Many diseases, like cancer, diabetes, and neurodegenerative disorders, are complex and involve
multiple genetic and environmental factors. Bioinformatics helps in analyzing large-scale datasets
to uncover these complex relationships.
● Bioinformatics is used to compare DNA and protein sequences across species, helping scientists
understand evolutionary relationships, gene function, and biodiversity.
● This information helps in studying species evolution, identifying conserved genes essential for
life, and developing conservation strategies for endangered species.
5
Bioinformatics Lab Manual
Nucleotide databases are essential resources for the storage and retrieval of nucleotide sequences, which
form the basis for much of bioinformatics research. These databases, including GenBank, EMBL-EBI,
and DDBJ, enable the collection, curation, and sharing of nucleotide sequences across global research
communities.
6
Bioinformatics Lab Manual
The European Molecular Biology Laboratory (EMBL) is a leading international research institution that
fosters cutting-edge research in molecular biology, bioinformatics, and computational biology.
● Home to the European Bioinformatics Institute (EBI), which is central to global bioinformatics
infrastructure.
● Supports interdisciplinary research and data-sharing efforts across Europe.
● Plays a pivotal role in developing computational tools and resources for bioinformatics.
● Drives projects related to genome analysis, protein structure prediction, and molecular biology.
● Promotes data sharing and collaboration across biological and computational sciences.
● Crucial in the development of Europe’s contribution to bioinformatics research.
● Facilitates multi-omics data integration and systems biology research.
The DNA Data Bank of Japan (DDBJ) is one of the world’s leading nucleotide sequence databases,
sharing the responsibility with GenBank (NCBI) and EMBL for collecting and distributing DNA
sequence data. Its focus is on ensuring accessibility and the exchange of genomic data globally.
7
Bioinformatics Lab Manual
EXPERIMENT: 1
GenBank:
INTRODUCTION:
● GenBank is the NIH genetic sequence database, an annotated collection of all publicly available
DNA sequences.
● It is part of the International Nucleotide Sequence Database Collaboration, which comprises the
DNA Data Bank of Japan
● (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI.
● These three organizations exchange data on a daily basis.
● A GenBank release occurs every two months and is available from the FTP site.
GLOSSARY TERMS:
a. Header Section-
● LOCUS: Includes the locus name, sequence length, molecular type, GenBank division,
and modification data.
● DEFINITION: A brief description of the sequence.
● ACCESSION: Unique identifier for the sequence.
● VERSION: A version of the sequence with a accession.
● KEYWORDS: Optional keywords that describes the entry.
● SOURCE: The common name of the organism from which the sequence originates.
● ORGANISM: The formal scientific name of the organism, followed by its taxonomic
lineage.
a. Reference Section-
● Citations and reference number of the publications or the PUBMED reference related to
the sequence.
● Includes TITLE AND AUTHOR as well.
b. Feature Section-
● Contains annotations about the coding sequence, genes, regulatory elements, etc.
● Each feature includes qualifiers (e.g., gene names, product descriptions and location
information (e.g., where the gene starts and ends on the sequence).
8
Bioinformatics Lab Manual
c. Sequence Section-
● The raw nucleotide or protein sequence is displayed at the bottom.
d. End Section:
● //: This marks the end of the GenBank entry.
EXPERIMENT: 2
9
Bioinformatics Lab Manual
AIM: To retrieve the gene from Genbank and to save the sequence in FASTA format.
INTRODUCTION:
● FASTA is a text-based format for representing nucleotide or protein sequences.
● Sequences are stored in a simple plain text file.
● A FASTA file begins with a single-line description, preceded by a ">" symbol (called a header).
● The sequence data follows the header and consists of the nucleotide or amino acid sequence.
● It is widely used in bioinformatics for sequence alignment, storage, and analysis.
● Each sequence line can be up to 80 characters, but it is typically displayed as one continuous
string.
● Programs like BLAST and CLUSTAL use the FASTA format for sequence comparison.
● FASTA format can handle multiple sequences in a single file.
Steps:
● Type NCBI in the web browser and click search, click National Center for Biotechnology
Information, it directs to the URL : https://www.ncbi.nlm.nih.gov/) OR
● NCBI homepage will appear.
● Click the All Databases drop –up menu and drag the bar and select nucleotide.
● Search list will be displayed, click the suitable accession number or any gene of interest
● Click the gene of interest/accession number.
● Type Collagen in the search area and click.
● A new window will appear and shows the entry of the collagen gene in detail.
● Click the FASTA, FASTA sequence appears in the new window.
● Copy the FASTA sequence and paste it in note pad
● The FASTA results are typically broken down into several parts:
● Top Hits: A ranked list of sequences from the database that most closely match your query, along
with statistical significance values.
● Alignment View: Detailed alignments of the query with the top matching sequences, including
scoring information.
● Statistical Scores:
● E-value: The number of expected hits of similar quality (or better) by chance. Lower values
indicate more significant matches.
● Percent identity: Percentage of identical matches in the alignment.
● Z-score: Indicates the degree of similarity to the database.
10
Bioinformatics Lab Manual
11
Bioinformatics Lab Manual
12
Bioinformatics Lab Manual
EXPERIMENT: 3
GRAPHICS
Graphic Summary feature in the NCBI Nucleotide Database is a highly useful tool that provides a visual
representation of the annotations and features of a nucleotide sequence. This graphical representation
allows researchers to quickly and efficiently analyze the structural organization of the sequence and gain
insights into its functional elements.
● The Graphic Summary provides an interactive view of nucleotide sequences and their associated
features, such as genes, coding sequences (CDS), exons, introns, and regulatory regions. By
displaying these features along the length of the nucleotide sequence, users can easily navigate
and understand the genomic context.
● Features in the Graphic Summary are color-coded to help users distinguish between different
types of annotations. For example, coding regions might be highlighted in one color, while
regulatory elements or non-coding regions are shown in another. This visual distinction aids in
quickly identifying specific genomic elements.
● The Graphic Summary tool is interactive, allowing users to zoom in on specific regions of interest
or zoom out to view the broader genomic landscape. This feature is particularly useful for
navigating large sequences, such as whole chromosomes or long contigs, and focusing on areas
with specific annotations like exons or protein-coding genes.
● By clicking on specific features within the Graphic Summary, users can access detailed
information about those regions. For instance, clicking on a gene within the graphic will provide
links to its corresponding gene records, mRNA sequences, and protein products, integrating
different levels of biological data.
● The Graphic Summary is integrated with other NCBI tools and databases, such as BLAST,
GenBank, and RefSeq, allowing users to cross-reference sequence alignments, genetic
variations, and evolutionary relationships. This integration facilitates comprehensive
bioinformatics analyses.
● The tool displays multiple layers of annotations simultaneously, including gene predictions,
expressed sequence tags (ESTs), single nucleotide polymorphisms (SNPs), and functional
elements like promoters or enhancers. This layered view provides a detailed understanding of the
functional elements embedded within the nucleotide sequence.
● In addition to nucleotide sequences, the Graphic Summary can display comparative genomic
information, including syntenic regions and evolutionary conserved elements across different
species. This feature is particularly useful for studying evolutionary conservation and functional
genomics.
● The Graphic Summary is designed for ease of use, with a clean and intuitive interface.
Researchers can adjust settings to display different levels of detail, hide or show specific
annotations, and customize the display based on their specific needs or research focus.
13
Bioinformatics Lab Manual
● When interacting with a nucleotide sequence, the Graphic Summary also provides links to
associated protein structures (via the PDB), functional motifs (via CDD), and gene expression
data. This helps in understanding the biological significance of the annotated regions in the
context of molecular function and regulation.
● The Graphic Summary tool is widely used in comparative genomics to visualize syntenic blocks,
track evolutionary conserved genes, and explore genomic variations among species. By providing
a visual context for sequence comparison, the tool facilitates the identification of functionally
important genomic regions conserved across species or strains.
14
Bioinformatics Lab Manual
EXPERIMENT: 4
A protein database is a digital repository that stores protein-related data, including sequences, structures,
annotations, and functional information. These databases are essential for researchers studying protein
biology and conducting computational analyses.
Some examples-
1. NCBI Protein: A comprehensive collection of protein sequences from various sources, including
translations of nucleotide sequences from GenBank.
2. UniProt (Universal Protein Resource): Provides high-quality protein sequences with detailed
functional annotations, including domain structures, active sites, and pathway information.
3. Protein Data Bank (PDB): Focuses on 3D structural data for proteins and nucleic acids.
4. Pfam: Specializes in protein families and domains.
15
Bioinformatics Lab Manual
16
Bioinformatics Lab Manual
EXPERIMENT:5
GenPept
GenPept is a data format provided by NCBI for protein sequences and their annotations. It
contains translated protein data derived from nucleotide sequences in GenBank, with
added functional and structural insights. Key components of a GenPept entry:
17
Bioinformatics Lab Manual
18
Bioinformatics Lab Manual
EXPERIMENT:6
Identical Proteins
The term identical proteins refers to proteins that share the exact same amino acid
sequence. These identical sequences might occur in:
19
Bioinformatics Lab Manual
EXPERIMENT:7
FASTA
FASTA is a universal file format used for storing nucleotide or protein sequences. It is widely used for
bioinformatics tools and data sharing. A FASTA file consists of:
1. Header line: Starts with a > symbol, followed by a description or sequence identifier (e.g.,
>sp|P68871|HBB_HUMAN Hemoglobin subunit beta).
2. Sequence: Written in single-letter amino acid (protein) or nucleotide code (DNA/RNA).
20
Bioinformatics Lab Manual
EXPERIMENT:8
Graphics:
In bioinformatics, graphics refers to visual representations of sequence and structural data. Graphics are
critical for interpreting complex biological information. Common applications include:
1. Sequence Features: Tools like NCBI’s Graphics Viewer provide visual layouts of protein
features (e.g., exons, conserved domains, and mutations).
2. 3D Protein Structures: Visualization software like PyMOL, Chimera, and iCn3D allows
exploration of protein tertiary and quaternary structures.
3. Phylogenetic Trees: Graphical outputs that display evolutionary relationships.
4. Molecular Interactions: Tools like STRING visualize protein-protein interaction networks.
21
Bioinformatics Lab Manual
EXPERIMENT:9
NCBI BLAST:
BLAST (Basic Local Alignment Search Tool) is one of the most widely used bioinformatics tools for
sequence analysis. It identifies regions of local similarity between sequences, which helps in
understanding functional, structural, or evolutionary relationships.
22
Bioinformatics Lab Manual
Advanced Variants
1. MegaBLAST:
○ Optimized for highly similar sequences.
○ Used for large-scale searches like genome assembly comparison.
2. PSI-BLAST (Position-Specific Iterative BLAST):
○ Builds a position-specific scoring matrix (PSSM) for detecting distant
relationships.
3. PHI-BLAST (Pattern-Hit Initiated BLAST):
○ Combines pattern matching with BLAST for sequences with specific motifs.
23
Bioinformatics Lab Manual
EXPERIMENT:10
PubChem Overview:
24
Bioinformatics Lab Manual
25
Bioinformatics Lab Manual
EXPERIMENT:11
RCSB-PDB Overview:
RCSB PDB (Research Collaboratory for Structural Bioinformatics Protein Data Bank).
The RCSB PDB is a globally recognized, open-access repository for 3D structural data of biological
macromolecules, such as proteins, nucleic acids, and their complexes. It is an essential resource for
researchers in structural biology, biochemistry, bioinformatics, and related fields.
The RCSB PDB is part of the Worldwide Protein Data Bank (wwPDB) consortium, which also includes
the European PDB (PDBe), PDB Japan (PDBj), and the Biological Magnetic Resonance Bank (BMRB).
26
Bioinformatics Lab Manual
27
Bioinformatics Lab Manual
Experiment:12
algorithm for aligning multiple biological sequences (such as protein or nucleotide sequences).
Multiple Sequence Alignment (MSA) is a bioinformatics technique used to align three or more
biological sequences (DNA, RNA, or protein sequences) in such a way that similar regions are aligned
across all sequences. The goal is to identify conserved sequences, structural motifs, functional regions,
and evolutionary relationships between the sequences. MSA is a critical step in many areas of
bioinformatics, molecular biology, and evolutionary biology.
1. ClustalW/Clustal Omega:
○ ClustalW is one of the oldest and most widely used MSA tools. It uses a
progressive alignment approach and can handle both nucleotide and protein
sequences.
○ Clustal Omega is an updated, faster version of ClustalW and can align larger
datasets more efficiently.
2. MAFFT:
○ MAFFT (Multiple Sequence Alignment by Fast Fourier Transform) is another
popular tool that uses both progressive and iterative refinement methods. It is
known for being fast and capable of handling very large datasets.
3. MUSCLE:
○ MUSCLE (Multiple Sequence Comparison by Log-Expectation) is an iterative
method that is often preferred for its high accuracy, especially with protein
sequences.
4. T-Coffee:
○ T-Coffee (Tree-based Consistency Objective Function for Alignment Evaluation)
is known for its accuracy and ability to combine results from different alignment
methods.
5. KAlign:
○ A tool based on an alignment algorithm that employs a scoring system derived
from global statistics.
28
Bioinformatics Lab Manual
29
Bioinformatics Lab Manual
30
Bioinformatics Lab Manual
A phylogenetic tree is a branching diagram that represents the evolutionary relationships among various
species, genes, or proteins. It is a critical tool in bioinformatics, evolutionary biology, and comparative
genomics to study how organisms or sequences have evolved from a common ancestor.
1. Nodes:
○ Internal Nodes: Represent hypothetical ancestors of the organisms or sequences.
○ External Nodes (Leaves): Represent the species, genes, or sequences being compared.
2. Branches:
○ Connect nodes, representing evolutionary pathways.
○ Branch lengths may indicate the extent of evolutionary change or time.
3. Root:
○ Represents the most recent common ancestor of all entities in the tree.
○ A tree without a root is referred to as an unrooted tree.
4. Topology:
○ The branching pattern of the tree, which shows relationships but not necessarily
evolutionary distance.
31
Bioinformatics Lab Manual
Experiment: 13
Aim:
Introduction
ExPASy is a web-based resource developed by the Swiss Institute of Bioinformatics (SIB). It offers a
range of tools for bioinformatics analysis, with a focus on protein sequences. ExPASy allows users to
perform sequence alignments, identify domains and motifs, predict secondary and tertiary structures, and
examine protein families.
32
Bioinformatics Lab Manual
33
Bioinformatics Lab Manual
34