Database
Database
PRINCIPLES OF BIOINFORMATICS
DR. A. Dandare
Dept. of Biochemistry & Molecular Biology
Usmanu Danfodiyo University Sokoto
Database
▪ A database is a computerized archive used to store and organize data in such a way
that information can be retrieved easily via a variety of search criteria.
ii. Biological Database make biological data available for scientist at one place and
help them to obtain data of their research and cross-validation.
iii. Biological Databases are available in computer readable format, thus forms the first
fundamental step of biological data analysis.
Classification of Biological Database
Biological Database
are broadly classified
into nine categories
based on composition
of the data types.
Classification of Biological Database
1. Bibliographic Database
Is a scientific literature database consisting of numerous research papers and
articles from various journal..
▪ RefSeq database for example is an open access, annotated and curated collection of publicly
available nucleotide sequences (DNA, RNA) and their protein products.
▪ The National Center for Biotechnology Information Reference Sequence (NCBI RefSeq) database
provides curated non-redundant sequences of genomic regions, transcripts and proteins for
taxonomically diverse organisms including Archaea, Bacteria, Eukaryotes, and Viruses.
▪ RefSeq database is derived from the sequence data available in the redundant archival
database GenBank. RefSeq sequences include coding regions, conserved domains,
variations etc.
▪ Nucleic acids sequence database include: Genebank, EMBL (European Molecular Biology
Laboratory)Bank, DDBJ (DNA Data Bank of Japan) etc
▪ Protein sequence database include: Entrez protein, Swiss Prot, Protein Data Bank (PDB),
Molecular Modelling Database (MMDB), Gene3D, EMBL-Macromolecular Structure Database
Classification of Biological Database
DNA Databases
▪ A DNA database centers on managing DNA data from many or some
specific species.
▪ GenBank contains over 184 billion nucleotide bases in more than 179
million sequences
Classification of Biological Database
▪ Protein Databases
▪ A representative example of protein database is PDB, the main primary database for 3D
structures of biological macromolecules determined by X-ray crystallography and NMR.
▪ PDB contains more than 105,465 biological macromolecular structures as of in which 27,393
entries belong to human (http://www.rcsb.org/pdb).
▪ Another example is the Universal Protein Resource (UniProt). As a collaborative project between
EMBL-EBI, Swiss Institute of Bioinformatics (SIB), and Protein Information Resource (PIR).
▪ • KEGG pathway Database contains graphical pathway maps for all known metabolic
pathways from various organisms.
▪ KEGG pathway integrates many entities that are stored in KEGG sibling databases,
including genes, proteins, RNAs, chemical compounds, and chemical reactions.
Classification of Biological Database
5. Disease databases
▪ These are exclusive sources for disease-related information example OMIM (online Mendelian
inheritance in Man) provides data about human genes and genetic disorder.
▪ Genetic Association Database is another popular disease database containing data on human
genetic association studies of complex diseases and disorders.
▪ This database helps in rapidly identifying medically relevant polymorphism from large volume of
polymorphisms and mutational data. This database have a significant therapeutic value.
▪ Example, the Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium
(ICGC) are example of disease database
▪ TCGA is aimed to collect a wide diversity of omics data (including exome, SNP, mRNA, miRNA,
and methylation) for more than 20 different types of human cancer
Primary databases are also called as archieval database, accept original data from researcher
with relatively little checking or validation. They contain original submission from researcher.
They are populated with experimentally derived data such as nucleotide sequence, protein
sequence or macromolecular structure.
▪ Once given a database accession number, the data in primary databases are never changed:
they form part of the scientific record.
Examples
3. However, many data resources have both primary and secondary characteristics. For example,
UniProt accepts primary sequences derived from peptide sequencing experiments. However, UniProt
also infers peptide sequences from genomic information, and it provides a wealth of additional
information, some derived from automated annotation (TrEMBL), and even more from careful manual
analysis (SwissProt).
Classification of Biological Database
7. Chemical Databases:
▪ PubChem of NCBI contain substances description of small molecules with fewer than 1000
atoms and 1000 bonds
▪ ChEMBL data are manually curated from the published literature together with data drawn
from other databases. ChEMBL are standardized for using in many types of chemical
biology and drug-discovery research problems.
▪ ChEMBL database can be accessed from a web-based interface where a variety of search
and browsing functionality are provided.
▪ ChEMBL data is freely available from their FTP site in the formats of Oracle, MySQL,
PostgreSQL, structure-data file (SDF), FASTA and RDF
Submission to Database
Investigators are encouraged to submit their newly obtained sequences directly to a member of the
International Nucleotide Sequence Database Collaboration such as
(http://www.ncbi.nlm.nih.gov)