Lecture 4 Biological Databases
Lecture 4 Biological Databases
Biological
Databases & Role
in Bioinformatics
by
Dr. Aditya Kumar Padhi, Ph.D.
• Types of databases
• Pitfalls
• Information retrieval
• Conclusion
2
Need of Database
• One of the hallmarks of modern genomic research is the generation of enormous
amounts of raw sequence data (DNA & Protein).
• Thus, the very first challenge in the genomics era is to store and handle the
staggering volume of information through the establishment and use of computer
databases.
• We will go through some basic concepts related to databases, the types, designs,
and architectures of biological databases.
3
Database
• A database is a computerized archive used to store and organize data in such a way that
information can be retrieved easily via a variety of search criteria.
• Databases are composed of computer hardware and software for data management.
• The chief objective of the development of a database is to organize the data in a set of structured
records for easy retrieval of information.
• Although data retrieval is the main purpose of all databases, “biological databases often have a
higher level of requirement, known as knowledge discovery (the identification of connections
between pieces of information that were not known when the information was first entered)”.
• For example, databases containing raw sequence information can perform extra computational
tasks to identify sequence homology or conserved motifs. These features facilitate the discovery
of new biological insights from raw data.
4
Types of Databases
• To facilitate the access and retrieval of data, sophisticated computer software programs for
organizing, searching, and accessing data have been developed.
• These systems contain not only raw data records but also operational instructions to help
identify hidden connections among data records.
OODBMS
RDBMS
(object-oriented
(Relational Database
Management Systems) database management
systems)
5
RDBMS
• Originally, databases all used a flat-file format, which is a long text file that contains many
entries separated by a delimiter, a special character such as a vertical bar (|).
• The text file can be considered a single table. Thus, to search a flat file for a particular piece of
information, a computer has to read through the entire file (obviously an inefficient process).
• Instead of using a single table as in a flat-file database, relational databases use a set of tables to
organize data. Each table also called a relation, is made up of columns and rows.
6
RDBMS
• Example of constructing a relational database for five students’ course information originally expressed
in a flat file.
• By creating 3 different tables linked by common fields, data can be easily accessed and reassembled.
• Programming languages
like C++ are used to
create object-oriented
Question: which courses are databases.
students from Texas taking?
• The objects are linked by
a set of pointers defining
predetermined
Example of the construction and query of an OODBMS using the same student information. relationships between the
Objects are constructed and are linked by pointers shown as arrows. Finding specific objects.
information relies upon simple navigation through the objects by way of pointers.
8
Data types in Biology
Primary data Sequence Primary database
AATGCGTATAGGCAG DNA
• Gene sequences
• Amino acid sequences in proteins
• Motifs and domains in proteins
• Structural data from XRD & NMR
• Metabolic pathways
• Protein-protein interactions
• Gene expression data DNA microarrays
• Thus, the storage and handling of this staggering information are the major challenges of the
current genomics era.
• Biological databases address this, allow data indexing, as well as help, remove the data
redundancy.
Components of biological database
• Similar to other databases, a biological database also has certain basic components.
a) Entity - An entity refers to the item we want to store in a database. e.g., DNA sequences, Genes,
Bibliographic references, etc.
b) Fields - The properties of an entity are called fields. e.g., Gene name, gene sequence, mutation (if
any), etc.
c) Records - A record typically refers to a combination of all the fields for a given entity. For e.g., Record
for gene BRCA1 in GenBank.
d) Identifier - The unique name which identifies a record.
• Once given a database accession number, the data in primary databases are never changed: they
form part of the scientific record.
• Example: 1) Swissprot, PIR (protein sequences), 2) GenBank, DDBJ (genome sequences), 3) Protein Data
Bank (protein 3D structures).
• Secondary:
• The database is derived from the analysis or treatment of primary data.
• They are highly curated, often using a complex combination of computational algorithms and manual
analysis.
• Example: 1) InterPro (protein families, motifs and domains), 2) UniProt Knowledgebase (sequence and
functional information of proteins), 3) Ensembl (variation, function, regulation and more layered onto whole
genome sequences).
Classification of databases
Nucleotide Protein
3)
4)
Uniprot
Uniprot
RCSB PDB
RCSB PDB
CATH
1)
• A free, publicly available online resource that provides
information on the evolutionary relationships of protein
domains.
2)
SCOPe
1)
2)
STRING
1)
2)
OMIM
1)
2)
India is not lagging behind!
Suggested reading:
1. A repository of web-based bioinformatics resources developed in
India, Abhishek Agarwal, Piyush Agrawal, Aditi Sharma, Vinod Kumar,
Chirag Mugdal, Anjali Dhall, Gajendra P.S. Raghava, bioRxiv
2020.01.21.855627; doi: https://doi.org/10.1101/2020.01.21.855627
2. https://www.natureasia.com/en/nindia/article/10.1038/nindia.2015.118
3. https://bioinformaticsreview.com/20190210/india-ranks-4th-among-
the-top-20-bioinformatics-database-contributors-in-the-world/
Thank you