"MBG1002 Biological Databases Week II
"MBG1002 Biological Databases Week II
MBG1002
Genomics data mining
(Biological Databases)
Sandra Porter
• What is a Database ?
• Definition 1: the collection, classification, storage, and analysis of biochemical and biological
information using computers especially as applied to molecular genetics and genomics
(Merriam-Webster dictionary)
• A collection of
• structured
• searchable (index) → table of contents
• updated periodically (release) → new edition
• cross-referenced (hyperlinks) → links with other db data
Bioinformatics
Bioinformatics Database: a field that works on the problems involving intersection of
Biology/Computer Science/Statistics and stores this information as a database structure
Archana Bhardwaj
• Includes also associated tools (software) necessary for
db access, db updating, db information insertion, db
information deletion ..
5
E. Hemond
L., Ghambir
• Biological Databases
• More than 1000 different databases
• Generally accessible thorugh the web
• Variable size: <100kb to > 10 Gb
• DNA > 10 Gb
• Protein: 1Gb
• 3D structure: 5Gb
• Update frequency: daily to annually
• Different kinds of bioinformatic databases
• General Purpose
• Data type specific (structure, expression)
• Organism specific (human, yeast)
• Pathway Information
• Specialized data
• Categories of Databases
• Bibliography
• Sequences (DNA, protein)
• Genomics
• Protein domain/family
• Mutation/polymorphism
• Proteomics (2D gel, MS)
• 3D structure
• Metabolic networks
• Regulatory networks
• others
Other Databases
Databases
Archana Bhardwaj
• Bibliographic databases
https://www.ebi.ac.uk/services
https://www.ddbj.nig.ac.jp/index-e.html
• General Sequence Databases – disadvantages
• Huge amount of Data
• Sequence redundancy (Archive: nothing goes out highly
redundant)
• Sequence accuracy – (full of errors in sequences, in
annotations, in CDS attributions)
• Sequence annotations – (no consistency of annotations;
most annotations are done by the submitters)
• Heterogeneity of the quality
• Sequence contamination
• The solution;
• Highly curate
databases of non-
redundant sequences
https://www.researchgate.net/figure/How-complete-and-reusable-are-publicly-
archived-data-in-ecology-and-evolution-The_fig6_283717086
https://www.ncbi.nlm.nih.gov/refseq/
• Sequence format: GenBank format
• Sequence format: FASTA format
>Sequence Name,
[sequence …….]
http://www.rcsb.org
http://www.rcsb.org
http://www.rcsb.org
• Genomic databases
• Contain information on genes, gene location (mapping),
gene nomenclature, and links to sequence databases