0% found this document useful (0 votes)
61 views12 pages

Bioinformatics Practical File

The document introduces bioinformatics databases including NCBI, DDBJ, and UniProt, explaining their roles in storing and analyzing biological data. It details various databases within NCBI, such as GenBank and PubMed, and describes tools like Entrez and BLAST for data retrieval and sequence alignment. Additionally, it outlines the functionalities of DDBJ and UniProt, emphasizing their contributions to the scientific community in managing protein and genetic information.

Uploaded by

kuhukapoor0304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views12 pages

Bioinformatics Practical File

The document introduces bioinformatics databases including NCBI, DDBJ, and UniProt, explaining their roles in storing and analyzing biological data. It details various databases within NCBI, such as GenBank and PubMed, and describes tools like Entrez and BLAST for data retrieval and sequence alignment. Additionally, it outlines the functionalities of DDBJ and UniProt, emphasizing their contributions to the scientific community in managing protein and genetic information.

Uploaded by

kuhukapoor0304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

EXPERIMENT-01

AIM: Introduction to bioinformatics databases: NCBI, DDBJ, UniProt, PDB.


THEORY:
BIOINFORMATICS:-
 In biology, bioinformatics is defined as the use of computer to store, retrieve, analyse or
predict the composition or structure of biomolecules.
 Bioinformatics is the application of computational techniques and information technology to
the organisation and management of biological data. Classical bioinformatics deals primarily
with sequence analysis.

NCBI:-
 The National Center for Biotechnology Information (NCBI) (Fig1.1: NCBI) is part of the United
States National Library of Medicine (NLM), a branch of the National Institutes of
Health (NIH). It is approved and funded by the government of the United States.
 The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation
sponsored by US Congressman Claude Pepper.
 The NCBI houses a series of databases relevant to biotechnology and biomedicine and is an
important resource for bioinformatics tools and services.
 Major databases include GenBank for DNA sequences and PubMed, a bibliographic
database for biomedical literature. Other databases include the NCBI Epigenomics database.
 All these databases are available online through the Entrez search engine.
 NCBI was directed by David Lipman, one of the original authors of the BLAST sequence
alignment program and a widely respected figure in bioinformatics.

Fig1.1: Homepage of NCBI

DATABASE OF NCBI:
 Assembly - A database providing information on the structure of assembled genomes,
assembly names and other meta-data, statistical reports, and links to genomic sequence
data.
 Bookshelf - A collection of biomedical books that can be searched directly or from linked
data in other NCBI databases. The collection includes biomedical textbooks, other scientific
titles, genetic resources such as GeneReviews, and NCBI help manuals.
 ClinVar - A resource to provide a public, tracked record of reported relationships between
human variation and observed health status with supporting evidence. Related information
in the NIH Genetic Testing Registry (GTR), MedGen, Gene, OMIM, PubMed and other
sources is accessible through hyperlinks on the records.
 Computational Resources from NCBI's Structure Group - A centralized page providing
access and links to resources developed by the Structure Group of the NCBI Computational
Biology Branch (CBB). These resources cover databases and tools to help in the study of
macromolecular structures, conserved domains and protein classification, small molecules
and their biological activity, and biological pathways and systems.
 GenBank - The NIH genetic sequence database, an annotated collection of all publicly
available DNA sequences. GenBank is part of the International Nucleotide Sequence
Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European
Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations
exchange data on a daily basis. GenBank consists of several divisions, most of which can be
accessed through the Nucleotide database. The exceptions are the EST and GSS divisions,
which are accessed through the Nucleotide EST and Nucleotide GSS databases, respectively.
 Gene Expression Omnibus (GEO) Database - A public functional genomics data repository
supporting MIAME-compliant data submissions. Array- and sequence-based data are
accepted and tools are provided to help users query and download experiments and curated
gene expression profiles.
 Genetic Testing Registry (GTR) - A voluntary registry of genetic tests and laboratories, with
detailed information about the tests such as what is measured and analytic and clinical
validity. GTR also is a nexus for information about genetic conditions and provides context-
specific links to a variety of resources, including practice guidelines, published literature, and
genetic data/information. The initial scope of GTR includes single gene tests for Mendelian
disorders, as well as arrays, panels and pharmacogenetic tests.
 Genome - Contains sequence and map data from the whole genomes of over 1000
organisms. The genomes represent both completely sequenced organisms and those for
which sequencing is in progress. All three main domains of life (bacteria, archaea, and
eukaryota) are represented, as well as many viruses, phages, viroids, plasmids, and
organelles.
 Nucleotide Database - A collection of nucleotide sequences from several sources, including
GenBank, RefSeq, the Third Party Annotation (TPA) database, and PDB. Searching the
Nucleotide Database will yield available results from each of its component databases.
 Online Mendelian Inheritance in Man (OMIM) - A database of human genes and genetic
disorders. NCBI maintains current content and continues to support its searching and
integration with other NCBI databases. However, OMIM now has a new home at omim.org,
and users are directed to this site for full record displays.
 Protein Database - A database that includes protein sequence records from a variety of
sources, including GenPept, RefSeq, Swiss-Prot, PIR, PRF, and PDB.
 PubMed - A database of citations and abstracts for biomedical literature from MEDLINE and
additional life science journals. Links are provided when full text versions of the articles are
available via PubMed Central (described below) or other websites.
 RefSeqGene - A collection of human gene-specific reference genomic sequences. RefSeq
gene is a subset of NCBI’s RefSeq database, and are defined based on review from curators
of locus-specific databases and the genetic testing community. They form a stable
foundation for reporting mutations, for establishing consistent intron and exon numbering
conventions, and for defining the coordinates of other biologically significant variation.
RefSeqGene is a part of the Locus Reference Genomic (LRG) Collaboration.

Fig1.2: Collection of all resources in NCBI

DATABASE RETRIEVAL TOOL –

Entrez –

 Entrez is a molecular biology database system that provides integrated access to


nucleotide and protein sequence data, gene-centered and genomic mapping information,
3D structure data, PubMed MEDLINE, and more.

 The system is produced by the National Center for Biotechnology Information (NCBI) and is
available via the Internet.

 Entrez covers over 20 databases including the complete protein sequence data from PIR-
International, PRF, Swiss-Prot, and PDB and nucleotide sequence data from GenBank that
includes information from EMBL and DDBJ.

 The Entrez retrieval system uses an intuitive user interface for rapidly searching sequence
and bibliographic data. A unique feature of the system is its use of precomputed similarity
searches for each record to create links to "neighbors" or related records in other Entrez
databases. These links facilitate integrated access across the various databases. An Entrez
global query provides search capability for a subset of Entrez databases at one time.

 Results may be viewed in various formats inlcuding FlatFile, FASTA, XML, and others. A
graphical interface provides easy visualization of complete genomes or chromosomes, as
well as biological annotation on individual sequences.

 Entrez also allows Batch downloads of large search results.


 Entrez is available via the World Wide Web at http://www.ncbi.nlm.nih.gov/Entrez/.

BLAST (BASIC LOCAL ALIGNMENT SEARCH TOOL):-

 The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between
sequences.
 The program compares nucleotide or protein sequences to sequence databases and
calculates the statistical significance of matches.
 BLAST can be used to infer functional and evolutionary relationships between sequences
as well as help identify members of gene families.
 BLAST executables for local use are provided for Solaris, LINUX, Windows, and MacOSX
systems. Pre-formatted databases for BLAST nucleotide, protein, and translated searches
also are available for downloading under the db subdirectory.
 FTP: BLAST Databases - Sequence databases for use with the stand-alone BLAST programs.
The files in this directory are pre-formatted databases that are ready to use with BLAST.

FILE FORMATS:-
1. FASTA-
 Fasta format is a simple way of representing nucleotide or amino acid sequences of nucleic
acids and proteins.
 This is a very basic format with two minimum lines. First line referred as comment line starts
with ‘>’ and gives basic information about sequence. There is no set format for comment
line. Any other line that starts with ‘;’ will be ignored.
 Lines with ‘;’ are not a common feature of fasta files. After comment line, sequence of
nucleic acid or protein is included in standard one letter code. Any tabulators, spaces,
asterisks etc in sequence will be ignored.
 File extensions: file.fa, file.fasta, file.fsa

2. GenBank-
 The Genbank format allows for the storage of information in addition to a DNA/protein
sequence. It holds much more information than the FASTA format.
 The first part of GenBank shows various details, the first section includes the entry’s LOCUS,
DEFINITION, ACCESSION and VERSION and denoted by ORIGIN. The final detail is the actual
sequence. These five elements are the essential parts of the GenBank format.
 The non-essential parts of the entry contain what is commonly known as metadata, and can
include more detailed information about the organism, cross-references to other databases,
and even a list of publications in which this entry is featured in.
 The FEATURES part of the entry describes important characteristics of the entry’s sequence
such as presence of coding sequences, proteins, etc. This section is less human-friendly, and
it may contain fields that do not make any sense to the untrained eye.
 Finally, at the end of the file, we find the actual sequence that could be DNA or protein. The
last line of the entry has a “//”. These two characters indicate the end of the entry/file.

3. FlatFile-
 A flat file format is a type of database design that stores data in a plain text file.
 Each line of the text file holds one record, with fields separated by delimiters, such as
commas or tabs.
 While it lacks the structure and robustness of relational databases, its simplicity and
portability make it a popular choice for data exchange between programs and systems
that don't need to maintain relationships between records. In the context of the
National Center for Biotechnology Information (NCBI), they use this format in their
GenBank database to store biological sequences.
 This format allows users to create, edit, validate, and submit a genome sequence
submission to GenBank, providing a preview of how the data will appear as a GenBank
record in the NCBI Nucleotide database.

4. Multi-FASTA-
 The Multi-FASTA format, often referred to as the MULTIFASTA format, is an
extension of the FASTA format used in bioinformatics. It is a text-based format for
representing multiple nucleotide sequences or peptide (amino acid) sequences in a
single file.
 Each sequence in a Multi-FASTA file starts with a description line (also known as the
FASTA definition line), which begins with a greater-than character (">") followed by
a unique sequence identifier and a description.
 The lines immediately following the description line contain the sequence data.
 This format is particularly useful when dealing with genome or protein data sets that
contain more than one related sequence, and it is widely used in genome sequencing
projects.

SPECIALIZED TOOLS:-
 ORF Finder: ORF (Open reading frame) finder is an essential graphical analysis tool, which
finds all open reading frames of a selectable minimum size in a user’s sequence already in
the database.
 e-PCR: Electronic polymerase chain reaction is a computational procedure that is used to
identify sequence – tagged sites (STSs), within DNA sequences.
 Spidey: this is an mRNA to genomic alignments program, which uses the local alignments
tools – BLAST and Dot view to find its alignments

DNA Databank of Japan (DDBJ):-


• The DNA Data Bank of Japan (DDBJ) (Fig1.3: DDBJ) (https://www.ddbj.nig.ac.jp/index-
e.html) is a biological database that collects DNA sequences.

• It is located at the National Institute of Genetics (NIG) in Japan.


• It is also a member of the International Nucleotide Sequence Database Collaboration or
INSDC.

• It exchanges its data with European Molecular Biology Laboratory at the European
Bioinformatics Institute and with GenBank at the National Centre for Biotechnology
Information on a daily basis.

• Thus, these three databanks contain the same data at any given time.

• DDBJ began data bank activities in 1986 at NIG and remains the only nucleotide sequence
data bank in Asia. Although DDBJ mainly receives its data from Japanese researchers, it can
accept data from contributors from any other country.

Fig1.3: Homepage of DDBJ

RESOURCES AT DDBJ –
A. Retrieval of data-
• Getentry: It provides a gateway through which data can be retrieved from databases at
DDBJ with the help of unique identifiers such as accession numbers. Getentry is the DDBJ flat
file search system, by accession numbers.

• SRS (sequence retrieval system): A data integration, analysis and display tool where data is
retrieved through key words.

Sequence database: DDBJ, DDBJNEW, DAD, SWISSPROT, PIR

Sequence-related database: PROSITE, PROSITEDOC, ENZYME

Protein tertiary structure database: PDB

• TX (Taxonomy) Search: This is a retrieval system developed by DDBJ for a Taxonomy


database, a unified database of DDBJ, GenBank and EMBL.

B. Submission of Data –
• SAKURA: It is a nucleotide sequence data submission system through the World Wide Web
server at DDBJ. Researchers can enter and submit nucleotide and translated amino acid
sequences through this system. Other than sequence submission, the system also supports
addition of functions and features of the sequence, name of the user, affiliations, addresses
and references.

• Mass Submission system (MSS): It is a recommendable submission system when a long


nucleotide sequence or a complex genome data has to be submitted.

C. Databases at DDBJ –
1. Genomes TO Protein structures and functions (GTOP) :
 It contains data analysis of proteins identified by the various genome projects.
 This database mainly uses sequence homology analysis and uses information of the
three-dimensional structures.
 Prediction of 3D structure.
 Sequence homology search of PDB using REVERSE PSI-BLAST.
 Classification of proteins into families.
 Sequence homology search of Swiss-Prot with the use of BLAST.

2. Light Balance for Remote Analogous Proteins (LIBRA) :


 A tool for analysis of protein structures and sequences.
 It evaluates the compatibility of protein structure and a sequence.
 Stability analysis of mutant proteins using 3D profiles.
 Structure prediction using threading.
 Sequence homology search by threading.

UNIPROT:-
• The mission of UniProt (Fig1.4: UniProt) (http: //www.uniprot.org) is to provide the scientific
community with a comprehensive, high-quality and freely accessible resource of protein
sequence and functional information.

• The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference
Clusters (UniRef), and the UniProt Archive (UniParc).

• UniProt is a collaboration between the European Bioinformatics Institute (EML-EBI), the SIB
Swiss Institute of Bioinformatics and the Protein Information Resource (PIR)
Fig1.4: Homepage of UniProt

Uniprot has four components optimized for different uses:

1. UniProt Knowledgebase (UniProtKB) is an expertly curated database, a central access point


for integrated protein information with cross-references to multiple sources.
 UniProtKB consists of two sections - UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.
 The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of
functional information on proteins, with accurate, consistent and rich annotation
mainly, the amino acid sequence, protein name or description, taxonomic data and
citation information.
 In UniProtKB, annotation consists of the description of the following: function(s),
enzyme-specific information, biologically relevant domains and sites, post-
translational modifications, subcellular location(s), tissue specificity,
developmentally specific expression, structure, interactions, splice isoform(s),
diseases associated with deficiencies or abnormalities, etc.
 After a careful inspection of the sequences, the annotator selects the reference
sequence, does the corresponding merging, and lists the splice and genetic variants
along with disease information when available. Any discrepancies between the
different sequence sources are also annotated.
 UniProtKB/TrEMBL contains computationally analyzed records enriched with
automatic annotation and classification. The computer- assisted annotation is
created using automatically generated rules as in Spearmint, or manually curated
rules based on protein families, including HAMAP family rules, Rule Base rules and
PIRSF classification-based name rules and site rules.
 UniProtKB/TrEMB contains the translations of all coding sequences (CDS) present in
the EMBL/GenBank/DDBJ Nucleotide Sequence Databases, the sequences of PDB
structures and data derived from amino acid sequences that are directly submitted
to the UniProt Knowledgebase or scanned from the literature.
2. UniProt Archive (UniParc) is a comprehensive sequence repository, reflecting the history of
all protein sequences. UniParc is the main sequence storehouse and is a comprehensive
repository that reflects the history of all protein sequences.
 UniParc houses all new and revised protein sequences from various sources to
ensure that complete coverage is available at a single site. It includes not only
UniProtKB but also translations from the EMBL-Bank/DDBJ/GenBank Nucleotide
Sequence Databases, the Ensembl database of eukaryotic genomes, the H-
Invitational Database (H-Inv), the International Protein Index (IPI), the Protein Data
Bank (PDB), Protein Research Foundation (PRF), NCBI's Reference Sequence
Collection (RefSeq), model organism databases FlyBase, SGD, TAIR Arabidopsis
thaliana and WormBase, TROME and protein sequences from the European,
American and Japanese Patent Offices.
 To avoid redundancy, sequences are handled as strings—all sequences 100%
identical over the entire length are merged, regardless of source organism.
 The basic information stored within each UniParc entry is the identifier, the
sequence, cyclic redundancy check number, source database(s) with accession and
version numbers, and a time stamp. If a UniParc entry does not have a cross-
reference to a UniProtKB entry, the reason for the exclusion of that sequence from
UniProtKB is provided (e.g. pseudogene).
3. UniProt Reference Clusters (UniRef) merge closely related sequences based on sequence
identity to speed up searches. UniRef provides clustered sets of all sequences from the
UniProt Knowledgebase (including splice forms as separate entries) and selected UniProt
Archive records to obtain complete coverage of sequence space at resolutions of 100%, 90%
and 50% identity while hiding redundant sequences. The UniRef clusters provide a
hierarchical set of sequence clusters where each individual member sequence can exist in
only one UniRef cluster at each resolution and have only one parent or child cluster at
another resolution.
4. The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository
specifically developed for metagenomic and environmental data. UniMES currently contains
data from the Global Ocean Sampling Expedition (GOS). UniMES uniquely provides free
access to the array of genomic information gathered from the sampling expeditions,
enhanced by links to further analytical resources.

PROTEIN DATA BANK (PDB):-


• PDB is an archive of experimentally determined 3D structures of biological macromolecules
(Fig1.5: PDB).

• It was established by Research collaborator for structural Bioinformatics (RCBS), USA.

• The data base contains atomic coordinates, bibliographic citation, primary and secondary
structure information, NMR experimental data and crystallographic structure factors. It is a
source of primary data from which secondary data can be easily available to the global
scientific community.
Fig1.5: Homepage of PDB

DATA RETRIEVAL FROM PDB:-


• The structure of macromolecules can be downloaded from the download portal on the
home page.

• Multiple files can be downloaded with information of various aspects of a macromolecule


such as coordinates, experimental data, sequence etc.

• The files are usually downloaded in the three formats: PDB format, mmCIF format and
PDBML/XML format. The files as either uncompressed files or as zip files to reduce the size
of the files.

DATA DEPOSITION:-
• Auto Dep Input Tool (ADIT): Developed by RCSB, the tool helps in deposition of structures. It
provides checking of the format of coordinates, structure factors files etc.

• PDB_extract: this is an online tool which assemble the statistical data from various output
files.

• Ligand depot: It serves as the data resource which integrates information related to small
molecules which bound to macromolecules.

• Validation server: it helps in checking of the format of coordinates and structure factor files
with the help of RCSB validation server.

PDB BETA:-
It is an enhanced and revised version of PDB which provides a powerful portal for studying the
structures of biological macromolecules, their relationship to sequence, function and disease.

Features include:
 Enhanced web design and improved navigation.
 The search tab offers ways to accessing the RCSB PDB database i.e., new browsing options
through categories.
 Standardization of database.
 Improved ligand searching

Gene Retrieval Through Various Tools

Fig1.6: INS gene retrieval through FASTA

Fig1.7: INS gene retrieval through GenBank


Fig1.8: INS gene retrieval through DDBJ via GEO of NCBI

Fig1.9: INS gene retrieval through UniProt

Fig1.10: Gene retrieval through PDB

You might also like