0% found this document useful (0 votes)
3 views

Database

The document provides an overview of biological databases, including their definitions, classifications, and purposes. It details various types of databases such as bibliographic, genome, sequence, DNA, protein, metabolic, disease, expression, and chemical databases, along with examples and their functions. Additionally, it discusses the submission processes for researchers to contribute their data to these databases.

Uploaded by

sadiquraga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Database

The document provides an overview of biological databases, including their definitions, classifications, and purposes. It details various types of databases such as bibliographic, genome, sequence, DNA, protein, metabolic, disease, expression, and chemical databases, along with examples and their functions. Additionally, it discusses the submission processes for researchers to contribute their data to these databases.

Uploaded by

sadiquraga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

BCH 418

PRINCIPLES OF BIOINFORMATICS

DR. A. Dandare
Dept. of Biochemistry & Molecular Biology
Usmanu Danfodiyo University Sokoto
Database
▪ A database is a computerized archive used to store and organize data in such a way
that information can be retrieved easily via a variety of search criteria.

▪ The chief objective of the development of a database is to organize data in a set of


structured records to enable easy retrieval of information.

▪ Biological Database: is a collection of data that is structured, searchable, updated


periodically and crossed referenced.
Biological data are developed to perform several functions as follows:

i. Biological Database aid in organisation of biological experiments and analysis.

ii. Biological Database make biological data available for scientist at one place and
help them to obtain data of their research and cross-validation.

iii. Biological Databases are available in computer readable format, thus forms the first
fundamental step of biological data analysis.
Classification of Biological Database
Biological Database
are broadly classified
into nine categories
based on composition
of the data types.
Classification of Biological Database

1. Bibliographic Database
Is a scientific literature database consisting of numerous research papers and
articles from various journal..

PubMed, available at NCBI, is the widely used bibliographic database, it is


maintained by National Library of Medicine (NLM). and contained morthan12.8
million abstract from 4,400 biomedical and biochemical journals

MEDLINE is also an NML premier bibliography database covering the field of


human medicine, nursing, dentistry, veterinary medicine, health care system,
and pre-clinical science, it has 4,800 biomedical journals
Classification of Biological Database
2. Genome databases
▪ Genome databases give absolute information on the heritable properties of an
organism. These databases help to identify genes and predict their functions.
A few genome databases have links with specific organism databases

▪ GOLD (Genomes Online Database at the University of Illinois, USA)


contains a list of all the complete and ongoing genome projects worldwide.

▪ Genomes at NCBI (National Centre for Biotechnology Information, USA).


Classification of Biological Database
3. Sequence Databases

▪ RefSeq database for example is an open access, annotated and curated collection of publicly
available nucleotide sequences (DNA, RNA) and their protein products.

▪ The National Center for Biotechnology Information Reference Sequence (NCBI RefSeq) database
provides curated non-redundant sequences of genomic regions, transcripts and proteins for
taxonomically diverse organisms including Archaea, Bacteria, Eukaryotes, and Viruses.

▪ RefSeq database is derived from the sequence data available in the redundant archival
database GenBank. RefSeq sequences include coding regions, conserved domains,
variations etc.
▪ Nucleic acids sequence database include: Genebank, EMBL (European Molecular Biology
Laboratory)Bank, DDBJ (DNA Data Bank of Japan) etc

▪ Protein sequence database include: Entrez protein, Swiss Prot, Protein Data Bank (PDB),
Molecular Modelling Database (MMDB), Gene3D, EMBL-Macromolecular Structure Database
Classification of Biological Database
DNA Databases
▪ A DNA database centers on managing DNA data from many or some
specific species.

▪ The primary function of human DNA databases includes establishment of


the reference genome.

▪ A representative example of DNA database is GenBank, a collection of


all publicly-available DNA sequences

▪ GenBank contains over 184 billion nucleotide bases in more than 179
million sequences
Classification of Biological Database
▪ Protein Databases

▪ The purpose of constructing protein databases includes collection of universal


proteins, identification of protein families and domains, reconstruction of phylogenetic trees,
and profiling of protein structures.

▪ A representative example of protein database is PDB, the main primary database for 3D
structures of biological macromolecules determined by X-ray crystallography and NMR.

▪ PDB contains more than 105,465 biological macromolecular structures as of in which 27,393
entries belong to human (http://www.rcsb.org/pdb).

▪ Another example is the Universal Protein Resource (UniProt). As a collaborative project between
EMBL-EBI, Swiss Institute of Bioinformatics (SIB), and Protein Information Resource (PIR).

▪ UniProt provides a comprehensive, high-quality, and freely-accessible resource of protein


sequence and functional information.
Classification of Biological Database
4. Metabolic Database
It contains data on biological pathway and enzymes in different organisms
Pathway databases
▪ Pathway databases contain biological pathways for metabolic, signalling, and
regulatory pathway analysis.

▪ A representative example is KEGG pathway, a curated biological pathway resource on


the molecular interaction and reaction networks.

▪ • KEGG pathway Database contains graphical pathway maps for all known metabolic
pathways from various organisms.

▪ KEGG pathway integrates many entities that are stored in KEGG sibling databases,
including genes, proteins, RNAs, chemical compounds, and chemical reactions.
Classification of Biological Database
5. Disease databases
▪ These are exclusive sources for disease-related information example OMIM (online Mendelian
inheritance in Man) provides data about human genes and genetic disorder.

▪ Genetic Association Database is another popular disease database containing data on human
genetic association studies of complex diseases and disorders.

▪ This database helps in rapidly identifying medically relevant polymorphism from large volume of
polymorphisms and mutational data. This database have a significant therapeutic value.

▪ Example, the Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium
(ICGC) are example of disease database

▪ TCGA is aimed to collect a wide diversity of omics data (including exome, SNP, mRNA, miRNA,
and methylation) for more than 20 different types of human cancer

▪ ICGC aims to obtain a comprehensive description of genomic, transcriptomic, and epigenomic


changes in 50 different tumor types and/or subtypes.
Classification of Biological Database
6. Expression databases
Expression databases can be used for various purposes, including archiving expression
data (e.g., GEO), detecting differential and baseline expression (e.g., Expression Atlas),
exploring tissue-specific gene expression and regulation (e.g., TiGER ), and profiling
expression information based on both RNA and protein data (e.g., Human Protein Atlas.

➢ A representative case of expression database is Human Protein Atlas.


➢ it encompasses expression profiles for a large majority of human protein-coding
genes based on both RNA (transcriptome analysis based on 213 tissue and cell line
samples) and protein data (proteome analysis based on 24,028 antibodies)
(http://www.proteinatlas.org).
Classification Databases Based on Data Source
1. Primary databases

Primary databases are also called as archieval database, accept original data from researcher
with relatively little checking or validation. They contain original submission from researcher.

They are populated with experimentally derived data such as nucleotide sequence, protein
sequence or macromolecular structure.

▪ Once given a database accession number, the data in primary databases are never changed:
they form part of the scientific record.

Examples

▪ ENA, GenBank and DDBJ (nucleotide sequence).

▪ Array Express Archive and GEO (functional genomics data).

▪ Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures).


Classification Databases Based on Data Source
2. Secondary databases
▪ Secondary databases comprise data derived from the results of analysing primary data.
▪ Secondary databases often draw upon information from numerous sources, including other
databases (primary and secondary), controlled vocabularies and the scientific literature.
▪ They are highly curated, often using a complex combination of computational algorithms and
manual analysis and interpretation to derive new knowledge from the public record of science.
Examples
▪ InterPro (protein families, motifs and domains)
▪ UniProt Knowledgebase (sequence and functional information on proteins)
▪ Ensembl (variation, function, regulation and more layered onto whole genome sequences)

3. However, many data resources have both primary and secondary characteristics. For example,
UniProt accepts primary sequences derived from peptide sequencing experiments. However, UniProt
also infers peptide sequences from genomic information, and it provides a wealth of additional
information, some derived from automated annotation (TrEMBL), and even more from careful manual
analysis (SwissProt).
Classification of Biological Database
7. Chemical Databases:

▪ This database store chemical information of various molecules. Examples:

▪ PubChem of NCBI contain substances description of small molecules with fewer than 1000
atoms and 1000 bonds

▪ ChEMBL is a large-scale bioactivity database containing binding, functional, in vivo


absorption, distribution, metabolism, excretion, and toxicity (ADMET) information about drug-
like bioactive compounds

▪ ChEMBL data are manually curated from the published literature together with data drawn
from other databases. ChEMBL are standardized for using in many types of chemical
biology and drug-discovery research problems.

▪ ChEMBL database can be accessed from a web-based interface where a variety of search
and browsing functionality are provided.

▪ ChEMBL data is freely available from their FTP site in the formats of Oracle, MySQL,
PostgreSQL, structure-data file (SDF), FASTA and RDF
Submission to Database
Investigators are encouraged to submit their newly obtained sequences directly to a member of the
International Nucleotide Sequence Database Collaboration such as

1. National Center for Biotechnology Information (NCBI) which manages Genbank

(http://www.ncbi.nlm.nih.gov)

2. The DNA Databank of Japan (DDBJ; http://www.ddbj.nig.ac.jp)

3. The European Molecular Biology Laboratory(EMBL)/EBI Nucleotide Sequence Database


(http://www.embl-heldelberg.de)
• The simplest way of submitting sequences is through the website http://ncbi.nlm.nih.gov/
on a Web form page called bankIt.
The sequence can be annotated with information about the sequence, such as mRNA start
and coding regions.
Ways of submission to Databases
▪ The submitted form is transformed into Genbank format and returned to the submitter
for review before it is added to the Genbank.
▪ The other method of submission is to use Sequin (formerly called Authorin) , which
runs on personal computers and UNIX machines.
▪ The programme provides an easy -to –use graphic interface and can manage large
submissions such as genomic sequence .
▪ It is described and demonstrated on http://www.ncbi.nlm.nih.gov/Sequin/index.html

▪ Completed files, using the appropriate or standard format:


▪ Files containing only sequence characters.
▪ Based on American Standard Code for Information Interchange (ASCII).
▪ Can be sent by Email to [email protected]
▪ Or mailed on diskette to Genbank Submissions , NCBI National Library of Medicine,
Bldg, 38A ,Room 8N-803, Bethesda, Maryland 20894.
▪ The sequence then became publicly available.

You might also like