0% found this document useful (0 votes)
15 views

Lec4 Databases

The document discusses several types of biological databases including: 1) Primary databases that contain original experimental data submissions controlled by submitters, as well as derivative databases derived from primary data controlled by third parties like NCBI. 2) Relational databases that organize information into tables with rows and columns to reduce data redundancy. 3) Major database providers like NCBI, EBI, and GenomeNet that host biological data. 4) Specific database types like nucleotide databases that store nucleic acid sequences, protein databases that include Uniprot, PDB, and Pfam, and pathway databases like KEGG.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lec4 Databases

The document discusses several types of biological databases including: 1) Primary databases that contain original experimental data submissions controlled by submitters, as well as derivative databases derived from primary data controlled by third parties like NCBI. 2) Relational databases that organize information into tables with rows and columns to reduce data redundancy. 3) Major database providers like NCBI, EBI, and GenomeNet that host biological data. 4) Specific database types like nucleotide databases that store nucleic acid sequences, protein databases that include Uniprot, PDB, and Pfam, and pathway databases like KEGG.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Biological Databases

Zoya Khalid
[email protected]
Data Vs. Information

• Information produced by processing data


• Information used to reveal meaning in data
• Accurate, timely and relevant information is the key to good
decision making
• Good decision making is the key to organizational survival
What is a database

• Structured collection of information.


• Consists of basic units called records or entries.
• Each record consists of fields, which hold pre-defined data
related to the record.
• For example, a protein database would have protein entries as
records and protein properties as fields (e.g., name of protein,
length, amino-acid sequence)
Types of databases

• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, Trace, SRA, SNP, GEO

• Derivative Databases
– Derived from primary data
– Content controlled by third party (NCBI) Algorithms
• Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene,
Structure, Conserved Domain
A flat-file database
Why Flat Files ?

• Flat files are the universal mechanism for moving data from one
database or system to another.
• There are two common types of flat files: CSV (comma separated
values) and delimited files.
Relational databases
Relational database

• A relational database consists of a relations (tables) containing attributes


(fields or columns). Each row in a table is known as a record or tuple.
• Information should be ‘normalized’ so that it is non-redundant this means
that every row should be unique, although this ideal is not always observed.
First Name Last Name Institution Department Address
Omar|Farooq|Computer Science|NUCES|Islamabad
Omar Farooq NUCES Computer Science Islamabad Hadiya|Ali|Electrical Engineering|FAST|Islamabad
Ahmed|Khan|Dept of Computer Science|NUCES|Isb
Hadiya Ali FAST Electrical Engineering Islamabad
Ahmed|Khan|Dept of Management|NUST|Islamabad
Ahmed Khan NUCES Dept of Computer Science Isb
Ahmed Khan NUST Dept of Management Islamabad
Foreign Primary
Key Key
Table Professor Table Contacts
Primary Prof_id First_name Last_name Contact_id Contact_id Institution Department Address
Key 1 Omar Farooq 1 1 NUCES Computer Science Islamabad
2 Hadiya Ali 2 2 FAST Electrical Engineering Islamabad
3 Ahmed Khan 1 3 NUST Management Islamabad
4 Ahmed Khan 3
Types of databases
Database providers

• The National Center for


Biotechnology Information (NCBI)
offers data banks, databases and
tools (USA)
• The European Bioinformatics
Institute (EBI) does a similar
function in Europe
• GenomeNet gathers several
databases from Japan
Data quality

• How are things entered


– Step by step protocol
• What are the evidence?
– Automatic validation
– Manual curation
• How new is the data?
• Can the data be secret?
• Redundant or non-redundant?
summary
NCBI
European Bioinformatics Institute
GenomeNet
NAR database issue
Nucleotide databases
• International nucleotide sequence database collaborations
– Genbank
– EMBL
– DDBJ
• The nucleotide sequence databases are data repositories,
accepting nucleic acid sequence data from the scientific
community and making it freely available.
– The databases strive for completeness, with the aim of recording
every publicly known nucleic acid sequence.
– These data are heterogeneous, they vary with respect to
• the source of the material (e.g. genomic versus cDNA), the intended quality
(e.g. finished versus single pass sequences), the extent of sequence
annotation
• the intended completeness of the sequence relative to its biological target
(e.g. complete versus partial coverage of a gene or a genome).
GenBank entry
Genome specific databases
Protein Databases

• Sequences are in Uniprot

• Structures are in PDB

• Enzyme classifications EC

• Protein families: Pfam,


Interpro etc
Uniprot

• UniProtKB: Protein knowledgebase, consists of two sections:


– Swiss-Prot, which is manually annotated and reviewed.
– TrEMBL, which is automatically annotated and is not reviewed.
• Includes complete and reference proteome sets.
• UniRef: Sequence clusters, used to speed up sequence
similarity searches.
• UniParc: Sequence archive, used to keep track of sequences
and their identifiers.
• Supporting data
– Literature citations, keywords, subcellular locations, cross-referenced
databases and more.
Uniprot
PDB
PDB
Pfam

Multiple Sequence Alignment and HMMs


KEGG
https://www.genome.jp/kegg/pathway.html

You might also like