0% found this document useful (0 votes)
18 views37 pages

"MBG1002 Biological Databases Week II

The document provides an overview of biological databases, defining them as collections of biochemical and biological information that are structured, searchable, and updated periodically. It highlights the explosive growth of biological data due to advancements in Next-Generation Sequencing technologies and the importance of various types of databases, including sequence and genomic databases. Additionally, it discusses the challenges associated with general sequence databases, such as data redundancy and accuracy, and emphasizes the need for highly curated databases.

Uploaded by

efebrek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views37 pages

"MBG1002 Biological Databases Week II

The document provides an overview of biological databases, defining them as collections of biochemical and biological information that are structured, searchable, and updated periodically. It highlights the explosive growth of biological data due to advancements in Next-Generation Sequencing technologies and the importance of various types of databases, including sequence and genomic databases. Additionally, it discusses the challenges associated with general sequence databases, such as data redundancy and accuracy, and emphasizes the need for highly curated databases.

Uploaded by

efebrek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Introduction to Bioinformatics

MBG1002
Genomics data mining
(Biological Databases)

Assistant Prof. Cemalettin Bekpen

Sandra Porter
• What is a Database ?
• Definition 1: the collection, classification, storage, and analysis of biochemical and biological
information using computers especially as applied to molecular genetics and genomics
(Merriam-Webster dictionary)
• A collection of
• structured
• searchable (index) → table of contents
• updated periodically (release) → new edition
• cross-referenced (hyperlinks) → links with other db data
Bioinformatics
Bioinformatics Database: a field that works on the problems involving intersection of
Biology/Computer Science/Statistics and stores this information as a database structure

efinition 1: the collection, classification, storage, and analysis of biochemic


ological information using computers especially as applied to molecular ge
d genomics (Merriam-Webster dictionary)

Archana Bhardwaj
• Includes also associated tools (software) necessary for
db access, db updating, db information insertion, db
information deletion ..

• Why biological databases ?


• Explosive growth in biological data
• Data (sequences, 3D structures, 2D gel analysis, Expression
analysis) are no longer published in a conventional manner,
but directly submitted to databases
• Essential tools for biological research, as classical publications
used to be !
https://www.enago.com/academy/biological-databases-an-overview-and-future-perspectives/
Bioinformatic Databases
The amount of data deposited in
databanks is increasing rapidly due
to the availability of NGS (Next-
Generation Sequencing)
technologies.

5
E. Hemond
L., Ghambir
• Biological Databases
• More than 1000 different databases
• Generally accessible thorugh the web
• Variable size: <100kb to > 10 Gb
• DNA > 10 Gb
• Protein: 1Gb
• 3D structure: 5Gb
• Update frequency: daily to annually
• Different kinds of bioinformatic databases
• General Purpose
• Data type specific (structure, expression)
• Organism specific (human, yeast)
• Pathway Information
• Specialized data
• Categories of Databases
• Bibliography
• Sequences (DNA, protein)
• Genomics
• Protein domain/family
• Mutation/polymorphism
• Proteomics (2D gel, MS)
• 3D structure
• Metabolic networks
• Regulatory networks
• others
Other Databases
Databases

Archana Bhardwaj
• Bibliographic databases

• Bibliographic reference databases contain citations and


abstract informations of published life science article

• Example: PubMed – developed by the National Center


for Biotechnology Information.

• PubMed provides access to bibliographic information


such as MEDLINE, PreMEDLINE, HealthSTAR, and to
integrated molecular biology databases (composite db)
• PubMed (Medline)
• MEDLINE covers the fields of medicine, nursing, dentistry,
veterinary medicine, the health care system, and the preclinical
sciences
• Contains citations from approximately 5,200 worldwide journals in
37 lenguages; 60 languages for older journals
• Contains over 20 millions citations since 1948 until now
• Contains links to biological db and to some journals
• New recors are
• added PreMEDLINE
• daily !
• A search by subject: “Aging”
• A search by authors: “Bekpen”
• Sequence Databases
• Sequences (Gene, RNA, Protein, Genome)
• Accession number (AC)
• References
• Taxonomic data
• Annotation
• Keywords
• Cross references
Sources of data: research groups (direct submission),
literature supplementary information, genome sequencing
institutes, patents
Main databases: GenBank, EMBL-EBI, DDBJ
(These three databases exchange information routinely)
Benson DA, et al., 2012
Benson DA, et al., 2012
https://www.ebi.ac.uk/

https://www.ebi.ac.uk/services
https://www.ddbj.nig.ac.jp/index-e.html
• General Sequence Databases – disadvantages
• Huge amount of Data
• Sequence redundancy (Archive: nothing goes out highly
redundant)
• Sequence accuracy – (full of errors in sequences, in
annotations, in CDS attributions)
• Sequence annotations – (no consistency of annotations;
most annotations are done by the submitters)
• Heterogeneity of the quality
• Sequence contamination
• The solution;
• Highly curate
databases of non-
redundant sequences

https://www.researchgate.net/figure/How-complete-and-reusable-are-publicly-
archived-data-in-ecology-and-evolution-The_fig6_283717086
https://www.ncbi.nlm.nih.gov/refseq/
• Sequence format: GenBank format
• Sequence format: FASTA format
>Sequence Name,
[sequence …….]
http://www.rcsb.org
http://www.rcsb.org
http://www.rcsb.org
• Genomic databases
• Contain information on genes, gene location (mapping),
gene nomenclature, and links to sequence databases

• Exist for most organisms important for life science


research

• Examples: UCSC, GDB (human), MGI (mouse), FlyBase


(Drosophila), SGD (yeast), MaizeDB (maize), SubtiList (B.,
subtilis), etc.)
https://genome.ucsc.edu
https://genome-euro.ucsc.edu/cgi-bin/hgGateway?redirect=manual&source=genome.ucsc.edu
http://www.informatics.jax.org
https://www.ncbi.nlm.nih.gov/sra
https://www.ncbi.nlm.nih.gov/genbank/wgs/
https://www.ncbi.nlm.nih.gov/genbank/tsa/
• Summary

• What is the best db for sequence analysis ?

• Which does contain the highest quality data ?

• Which is the more comprehensive ?

• Which is the more up to date ?

• Which is the less redundant ?

• Which is the more indexed (allows complex queries) ?

• Which Web server does respond most quickly ?


https://academic.oup.com/nar/article/47/D1/D1/5280358
https://academic.oup.com/nar/article/47/W1/W1/5524725

You might also like