Introduction To Databases
Introduction To Databases
Now biology becomes increasingly turned into a data-rich science, so the need for strong and
communicating large datasets has grown tremendously (e.g. Nucleotide and protein
sequences, three-dimensional structures from X-ray crystallography and NMR). A biological
database is a collection of data that is organized so that it contents can easily be accessed,
managed and updated. Biological databases play a fundamental role in bioscience particularly
in bioinformatics. They offer scientists the opportunity to access sequence and structure data
for tens of thousands of sequences from a broad range of organisms. Biological databases
represent an invaluable resource in support of biological research.
The biological data obtained from the nucleotide to the networks level results the diverse
classes of biological databases, which includes:
Definition of Bioinformatics
A biological database is a collection of data that is organized so that it contents can easily be
accessed, managed, and updated.
Why we need Biological database?
One of the hallmarks of modern genomic research is the generation of enormous amounts of
raw sequence data. As the volume of genomic data grows, sophisticated computational
methodologies are required to manage the data. Thus, the very first challenge in the genomics
era is to store and handle the overwhelming volume of information through the establishment
of computer databases. The development of databases to handle the vast amount of molecular
biological data is thus a fundamental task of bioinformatics. This chapter introduces some
basic concepts related to databases, in particular, the types, designs, and architectures of
biological databases.
• Availability of biological data to scientific community: To store, organize and share data in
a structured and searchable manner with the aim to facilitate data retrieval and visualization.
• Databases are composed of tables of data: It the same thing as a spreadsheet: a set of rows
and columns.
• Each table has several records (rows): A record stores all the information for a given
individual.
• Each record has several fields (columns): A field is an individual piece of data, a single
attribute of the record.
• Each record has a unique identifier, the primary key: A primary key serves to identify the
data stored in this record across all the tables in the database.
Sequence databases:
Contains nucleic acid and protein sequences information
Structure databases:
Three-dimensional structures of proteins, nucleic acids, and macromolecular complexes.
These databases are important tools in assisting scientists to analyze and explain a host of
biological phenomena from the structure of biomolecules and their interaction, to the whole
metabolism of organisms and to understanding the evolution of species. This knowledge helps
to facilitate the fight against diseases, assists in the development of medications, predicting
certain genetic diseases and in discovering basic relationships among species in the history of
life.
Sequences and structures are only among the several different types of data required in the
practice of the modern biology. Other important data types includes metabolic pathway
networks and molecular interactions, mutations and polymorphisms in molecular sequences
and structures as well as organelle structure and tissue type, genetic maps, physicochemical
data, gene and mRNA expression profiles, two dimensional gel electrophoresis images of
protein expression.
Sequence and structural databases are further can be classified into:
Primary
Secondary
Composite
Primary database:
Consisting of data derived experimentally such as nucleotide, protein sequences and three
dimensional structures alone.
Examples of these include UniProtKB for protein sequences, GenBank & DDBJ for Genome
sequences and the Protein Data Bank for protein structures.
Secondary databases:
Contains data that are derived from the analysis or treatment of primary data such as
secondary structures, hydrophobicity plots, conserved sequence, signature sequence and
domain are stored in secondary databases.
Secondary structure database contains detailed information of the PDB entry in an organized
way. Example: Structural classification of protein class, fold, superfamily, etc.
Most of the secondary database created and hosted by various researchers at their individual
laboratories. Example: SCOP-developed at Cambridge University, CATH-developed at
University College of London, BMCD-developed at NIST, USA.
Composite databases:
This merges a variety of different primary database sources, which avoids the need to search
multiple resources. Different composite database use different combinations of primary
database and different criteria in their search algorithm.
The nucleotide and protein databases hosted at the National Center for Biotechnology
Information (NCBI), provides OMIM (Online Mendelian Inheritance in Man) an online
comprehensive, authoritative compendium of human genes and genetic phenotypes.
Current Status
The Database Issue of the journal “Nucleic Acids Research” is freely available, and categorizes
many of the publicly available online databases related to biology and bioinformatics.
According to a report of 21st Nucleic Acids Research Database Issue, published in 2014, there
are 1552 databases that are publicly accessible online [ref] and the recent 22nd Nucleic Acids
Research Database Issue reports the addition of 58 new molecular biology databases, and the
updates on 115 existing databases. (Nucleic Acids Research, 2015, Vol. 43, Database issue D1–
D5)
About Me
www.linkedin.com/in/shradheya-r-r-gupta-54492984
Enjoy learning!