0% found this document useful (0 votes)
696 views31 pages

Bioinformatics Biological Database

Biological databases store vast amounts of biological data in an organized, computer-readable format. There are two main types - primary databases contain experimentally derived data directly submitted by researchers, like sequences and structures. Secondary databases derive new knowledge by analyzing primary data using computational algorithms and manual curation. Examples of primary databases include GenBank, EMBL, and Swiss-Prot for sequences and structures, while secondary databases include PROSITE for motifs, SMART and CATH for domains, and KEGG for pathways. Biological databases are important for efficiently searching, managing, and deriving new insights from vast and complex biological data.

Uploaded by

Sir Rutherford
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
696 views31 pages

Bioinformatics Biological Database

Biological databases store vast amounts of biological data in an organized, computer-readable format. There are two main types - primary databases contain experimentally derived data directly submitted by researchers, like sequences and structures. Secondary databases derive new knowledge by analyzing primary data using computational algorithms and manual curation. Examples of primary databases include GenBank, EMBL, and Swiss-Prot for sequences and structures, while secondary databases include PROSITE for motifs, SMART and CATH for domains, and KEGG for pathways. Biological databases are important for efficiently searching, managing, and deriving new insights from vast and complex biological data.

Uploaded by

Sir Rutherford
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

BIOLOGICAL

DATABASES
A database is an organized collection of data.
For instance, a list with some of the movies that we like would be a movie database:

ID title year director


movie1 The player 1992 Robert Altman
movie2 Cookie’s fortune 1999 Robert Altman
The man who shot
movie 3 1962 John Ford
Liberty Valance

•Entities: The kind of things that we want to store in a database. E.g.: Genes, DNA
sequences, bibliographical references.

•Records: The particular things stored in the database. E.g.: The gene BRCA1
•Identifiers or key: The unique name that identifies a record

•Fields: The properties that an entity has. E.g.: The name, sequence and mutations of
the gene
If the different entities could be stored in different
tables and the records on those tables would be related
by their unique identifiers, that structure would
comprise a relational database.

The databases usually provide mechanisms to store,


search, retrieve and modify the data.
Biological databases :

Biological data are complex, exception-ridden, vast and incomplete.

Therefore several databases has been created and interpreted to ensure


unambiguous results.

A collection of biological data arranged in computer readable form that


enhances the speed of search and retrieval and convenient to use is called
biological database.

A good database must have updated information.


Importance of biological database :

A range of information like biological sequences, structures,


binding sites, metabolic interactions, molecular action,
functional relationships, protein families, motifs and
homologous can be retrieved by using biological databases.

The main purpose of a biological database is to store and


manage biological data and information in computer
readable forms.
Primary databases

Primary databases are also called as archieval database. 

They are populated with experimentally derived data such as


nucleotide sequence, protein sequence or macromolecular
structure.

Experimental results are submitted directly into the database


by researchers, and the data are essentially archival in
nature.

Once given a database accession number, the data in primary


databases are never changed: they form part of the scientific
record.
Secondary databases:

Secondary databases comprise data derived from the results of


analysing primary data.

Secondary databases often draw upon information from numerous


sources, including other databases (primary and secondary),
controlled vocabularies and the scientific literature.

They are highly curated, often using a complex combination of


computational algorithms and manual analysis and interpretation
to derive new knowledge from the public record of science.
However, many data resources have both primary and secondary characteristics.

For example, UniProt accepts primary sequences derived from peptide sequencing
experiments.

However, UniProt also infers peptide sequences from genomic information, and it
provides a wealth of additional information, some derived from
automated annotation (TrEMBL), and even more from careful manual analysis
(SwissProt).

4. There are also specialized databases are those that cater to a particular research
interest.

For example, Flybase, HIV sequence database, and Ribosomal Database Project are
databases that specialize in a particular organism or a particular type of data.
Primary database vs. secondary database

•A primary database contains only sequence or structural


information.

•The database derived from the analysis or treatment of


primary data are secondary database. It is very important for
infering protein function.
Examples of some primary biological database:
GeneBank

One of the fastest growing repositories of known nucleotide sequences,


GeneBank (Genetic Sequence Databank), has a flat file structure.

It is an ASCII text file, readable by both humans and computers. Besides


sequence data, GeneBank files contain information such as accession
numbers and gene names, phylogenetic classification and references to
published literature.

This database has been developed and maintained at the NCBI, Bethesda,
MD, USA, as a part of International Sequence Database Collaboration (INSDC).

It is an open access sequence database.

It coordinates with individual laboratories and other sequence databases


like EMBL and DDBJ.
It is an annotated collection of all nucleotide sequences that are
available to the public.

The nucleotide database was divided into three databases at NCBI:


Core Nucleotide database, Expressed Sequence Tag (EST) and Genome
Survey Sequence (GSS).

Core Nucleotide database has most of the nucleotide sequences used. It


also encloses all nucleotide records that are not in the EST and GSS
databases.

Submission of sequences to GeneBank can be done using BankIt,


Sequin and tbl2asn tools.
EMBL(European Molecular Biology Laboratory)

•A comprehensive database of DNA and RNA sequences, EMBL


nucleotide sequence database is collected from scientific
literature, patient offices and is directly submitted by
researchers. EMBL has been prepared in collaboration with
GeneBank (USA) and the DNA Database of Japan (DDBJ).

•It is established in 1980.

•It is maintained by EBI (European Bioinformatics Institute )


Swiss-Port

This is a curated protein sequence database that offers a high


level of integration with other databases and also has a very
low level of redundancy.

Swiss-Port strives to provide protein sequences with a high


level of annotation (for instance, the description of protein
function, domain structure and post translational
modifications, etc.).

It is established in 1986 and maintained collaboratively since


1987, by the department of Medical Biochemistry of the
University of Geneva and the EMBL data Library.
TrEMBL is a computer–annotated supplement of Swiss-Port that
contains all translations of EMBL nucleotide sequence entries,
which is not yet integrated in Swiss-Port.

Currently Swiss-Port have 0.5 and TrEMBL have 7.6 milliom


sequences.
Protein Information Resource(PIR)

•PIRis an integrated public bioinformatics resource to


support genomic and proteomic research and scientific
studies.

Nowadays, PIR offers a wide variety of resources mainly


oriented to assisting the propagation and consistency of
protein annotations like PIRSF, ProClass and ProLINK.
Examples of Some Secondary Biological
Database
Motif Databases

•Protein sequence motif is a set of conserved amino acid residues


that are important for protein function and are located within a
certain distance from one another. These motifs usually provide
clues to the functions of otherwise uncharacterised proteins.

•The PROSITE database consists of documentation entries


describing protein domains, families and functional sites as well as
associated patterns and profiles to identify them.

•PRINT is a database for protein fingerprints. A fingerprint is a


group of conserved motifs used to characterise a protein family.
Domain Database

•A protein domain is an independently folded, structurally compact


unit that forms a steady three- dimensional structure and shows a
certain level of evolutionary conservation. Typically , a conserved
domain contains one or more motifs.

•ProDom is a protein domain database automatically generated from


the Swiss-Port and TrEMBL sequence database.

•SMART is a highly reliable and sensitive tool for domain


identification.

•COG is a database and a convenient tool for motif and domain


identification.
3D Structure databases

•PDB (Protein Data bank) is the main primary database for 3D


structures of biological macromolecules determined by X-ray,
crystallography and NMR. It also accepts experimental data used to
determine the structures and homology models.

•SCOP (Structural Classification of Protein database) classifies


protein 3D structures in a hierarchical scheme of structure classes.
All the protein structures in PDB are classified here, and the updated
new structures are deposited in PDB.

•The CATH database (Class, Architecture, Topology, Homologous)


contains a hierarchical classification of protein domain structure.
Protein data bank

•PDB (Protein data bank) is a repository for 3D structural


data obtained by x-ray crystallography or NMR spectroscopy
of proteins and nucleic acids.

•Research Collaboratory for Structural Bioinformatics


(RCSB) PDB provides a variety of tools and resources for
studying the structures of biological macromolecules and
their relationship with other sequences, its function and
diseases caused if any .
Gene expression databases

•GEO or Gene Expression Omnibus is a curated online resource and a


gene expression molecular abundance repository for gene expression
data browsing, query and retrieval.

•GXD or Gene Expression Database is a community resource for gene


expression information
.
•MGED or Microarray Gene Expression data contains microarray data,
generated by functional genomics and proteomics experiments.

•ArrayExpress from European Bioinformatics Institute is a repository


for transcriptomics data.
Metabolic pathway databases

•KEGG PATHWAY Database contains graphical pathway maps for all known
metabolic pathways from various organisms.

•EcoCyc is an E. coli database , stores information regarding the genome


and biochemical machinery of E. coli.

•LIGAND is a chemical database for enzyme reactions at the Institute for


Chemical Research, Kyoto. It is composite database currently consisting of
the COMPOUND, DRUG, GLYCAN, REACTION, RPAIR and ENZYME databases.

•MetaCyc is a non-redundant, experimentally elucidated metabolic


pathway database.
•BRENDA is an enzyme database tat contains information on all aspects of
enzymes and enzymatic reactions.
Genome databases

•Genome databases give absolute information on the heritable properties of an


organism. These databases help to identify genes and predict their functions. A few
genome databases have links with specific organism databases.

•GOLD (Genomes Online Database at the University of Illinois, USA) contains a list
of all the complete and ongoing genome projects worldwide.

•Genomes at NCBI (National Centre for Biotechnology Information, USA).

•TIGR database (TDB), at the institute for Genomic Research at Rockeville MD, USA.
Virological databases

A virological database contains all the sequences and related


information of viruses of animals, plants, bacteria, fungi and
archea; for example, the HIV protease database.

A committee called The International Committee on


Taxonomy of Viruses(ICTV) authorises and organises the
taxonomic classification of viruses. ICTVdB contains taxonomic
information for over thousands of virus species.
World biodiversity databases

•Taxonomic databases are built to document all known


species and make them available and accessible worldwide.

These databases contain taxonomic hierarchies, species


names, synonyms, descriptions, illustrations and
references. For example: CCINFO, STRAIN and ALGAE.
Database for various model organisms

•Escherichia coli- E. coli Genome Centre(Wisconsin university, USA), The E.coli index (University of
Birmingham, UK)

•Arabidopsis thaliana- TAIR (The Arabidopsis Information Resource)

•Homo sapiens- Human Genome Resources at NCBI, USA

•Oryza sativa (rice) -RGP (Rice Genome Research Programme, Japan)

•Drosaphila melanogaster -FlyBase (Drosophila Genome Database)

•Mus musculus (mouce)- Mouce Genome Informatics

•Danio rerio(zebrafish)- ZFIN (Zebrafish Information Network at the University of Oregon, USA)

•S. cerevisiae (Bakers yeast)- SGD ()Yeast Genome Database at Stanford, USA
Annotation of Gene:

In molecular biology, genomes make the basic genetic material and typically consist of
DNA.

Genome include the genes (coding ) and non coding regions, of interest to us, are

The coding regions are of interest as they actively influence the basic life processes.

The genes contain useful biological information that is required in building up and
maintaining an organism. Gene annotation can be defined merely as the process of
making nucleotide sequence meaningful.

Gene annotation involves the process of taking the raw DNA sequence produced by
genome sequencing projects and adding layers of analysis and interpretation
necessary to extracting biologically significant information and placing such derived
details into context.

Annotation is the process by which pertinent information about these raw DNA
sequences is added to the databases.
Accession number

Accession numbers are unique identifiers which


permanently identify sequences in the database.

Accession numbers are assigned and communicated to


authors within two working days of the receipt of
submission.

You might also like