0% found this document useful (0 votes)
7 views18 pages

Disclaimer

The document provides a disclaimer regarding the non-commercial and educational use of content, acknowledging potential copyright issues. It details various primary protein databases, including their purposes, structures, and examples such as PIR, SWISS-PROT, and UniProt. Additionally, it discusses the Protein Data Bank (PDB) and its significance in storing 3-D structural data of biological molecules.

Uploaded by

Sneha Ardeshna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views18 pages

Disclaimer

The document provides a disclaimer regarding the non-commercial and educational use of content, acknowledging potential copyright issues. It details various primary protein databases, including their purposes, structures, and examples such as PIR, SWISS-PROT, and UniProt. Additionally, it discusses the Protein Data Bank (PDB) and its significance in storing 3-D structural data of biological molecules.

Uploaded by

Sneha Ardeshna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Disclaimer

It is hereby declared that the production of the said content is meant for non-commercial, scholastic and research
purposes only.

We admit that some of the content or the images provided in this channel's videos may be obtained through the
routine Google image searches and few of them may be under copyright protection. Such usage is completely
inadvertent.

It is quite possible that we overlooked to give full scholarly credit to the Copyright Owners. We believe that the non-
commercial, only-for-educational use of the material may allow the video in question fall under fair use of such
content. However we honour the copyright holder's rights and the video shall be deleted from our channel in case of
any such claim received by us or reported to us.
Department
of

Primary Protein Microbiology


Unit no 3
Introduction to
Databases databases

Subject name
and code
Bioinformatics
& Biostatistics;
02MB0301
Dr. Purvi M.
Rakhashiya
Primary Protein databases

• The PRIMARY databases hold the experimentally determined


protein sequences inferred from the conceptual translation of
the nucleotide sequences. This, may and may not contain
experimentally derived information, but has arisen as a result of
interpretation of the nucleotide sequence information.
• There is a number of primary protein sequence and structure
databases and each requires some specific consideration. The
following are the types with examples:

• Protein sequence databases:


PIR
MIPS
SWISS-PROT
TrEMBL
NRL-3D
• Protein Structure Databases:
PDB
Protein Information Resource (PIR)

• The Protein Sequence Database was developed at the National


Biomedical Research Foundation (NBRF) in the early 1960s by
Margaret Dayhoff as a collection of sequences for investigating
evolutionary relationships among proteins.
• Since 1988, the Protein Sequence Database has been maintained
collaboratively by PIR-International, the consortium includes the
Protein Information Resource (PIR) at the NBRF, the International
Protein Information Database of Japan (JIPID), and the Martinsried
Institute for Protein Sequences (MIPS)
• In its current form, the database is split into four distinct sections,
designated PIR1-PIR4, which differ in terms of the quality of data and
level of annotation provided: PIR1 contains fully classified and
annotated entries; PIR2 includes preliminary entries, which have not
been thoroughly reviewed and may contain redundancy; PIR3
contains unverified entries, which have not been reviewed; and PIR4
entries fall into one of four categories:
• (i) conceptual translations of artefactual sequences;
• (ii) conceptual translations of sequences that are not transcribed or
translated;
• (iii) protein sequences or conceptual translations that are extensively
genetically engineered; or
• (iv) sequences that are not genetically encoded and not produced on
ribosomes.
Martinsried Institute for Protein Sequences
(MIPS)
• The Martinsried Institute for Protein Sequences (MIPS) collects and
processes sequence data for PIR-International Protein Sequence
Database project
• The database is distributed with PATCHX, a supplement of unverified
protein sequences from external sources.
Uni-Prot/ SWISS PROT

• SWISS-PROT is a protein sequence database which, from its inception in


1986, was produced collaboratively by the Department of Medical
Biochemistry at the University of Geneva and the EMBL; after 1994, the
collaboration moved to EMBL's UK outstation, the EBI
• In April 1998, further change saw a move to the Swiss Institute of
Bioinformatics (SIB); hence the database is now maintained
collaboratively by SIB and EBI/EMBL
• The database endeavours to provide high-level annotations, including
descriptions of the function of the protein, and of the structure of its
domains, its post-translational modifications, variants, and so on
• SWISS-PROT aims to be minimally redundant, and is interlinked to many
other resources. In 1996, a computer-annotated supplement to SWISS-
PROT was created, termed TrEMBL, which is described in more detail
below
• First, however, we will take a close look at the structure of SWISS-PROT
entries.
The Structure of Swissprot entries

The structure of the database, and the quality of its


annotations, sets SWISS-PROT apart from other
protein sequence resources and has made it the
database of choice for most research purposes.

By mid-1998, the database contained -70000


entries from more than 5000 different species, the
bulk of these coming from just a small number of
model organisms (e.g., Homo sapiens,
Saccharomyces cerevisiae, Escherichia coli, Mus
musculus and Rattus norvegicus).
UniProt

• UniProt is produced by the UniProt Consortium, a collaboration


between the European Bioinformatics Institute (EBI), the Swiss
Institute of Bioinformatics (SIB) and the Protein Information Resource
(PIR). UniProt comprises four components:
1. The UniProt Knowledgebase (UniProtKB)
2. UniProt Reference Clusters (UniRef)
3. UniProt Archive (UniParc)
4. UniProt Metagenomic and Environmental Sequences (UniMES)
Sources for curation in UniProt KB

The default raw sequence data for


UniProtKB are:
• DDBJ/ENA/GenBank coding
sequence (CDS) translations,
• the sequences of PDB structures,
• sequences from Ensembl and
RefSeq,
• data derived from amino acid
sequences that are directly
submitted to UniProtKB or
scanned from the literature.
UniProt Knowledgebase

• The UniProt Knowledgebase, and in particular UniProtKB/Swiss-Prot, is


used to access functional information on proteins. Every UniProtKB
entry contains the amino acid sequence, protein name or description,
taxonomic data and citation information but in addition to this, we add
as much annotation as possible. This includes widely accepted biological
ontologies, classifications and cross-references, as well as clear
indications on the quality of annotation in the form of evidence
attribution to experimental and computational data.
• UniProtKB/Swiss-Prot contains high-quality manually annotated and
non-redundant protein sequence records. Manual annotation consists
of analysis, comparison and merging of all available sequences for a
given protein
• UniProtKB/TrEMBL contains high-quality computationally analysed
records enriched with automatic annotation and classification.
UniProt Reference Clusters (UniRef)

• Three UniRef databases – UniRef100, UniRef90 and UniRef50 – merge


sequences automatically across species.
• UniRef100 is based on all UniProtKB records. It also contains selected
UniParc records, including Ensembl protein translations from chicken,
cow, dog, fly, Fugu, human, mouse, rat, Tetraodon, Xenopus and
zebrafish. UniRef100 is produced by clustering all these records by
sequence identity. Identical sequences and sub-fragments are presented
as a single UniRef100 entry with accession numbers of all the merged
entries, the protein sequence, links to the corresponding UniProtKB and
archive records.
• UniRef90 and UniRef50 are built from UniRef100 to provide records with
mutual sequence identity of 90% or more, or 50% or more, respectively,
with links to the corresponding UniProtKB records
UniProt Archive (UniParc)

• UniParc is designed to capture all publicly available protein sequence


data and contains all the protein sequences from the main publicly
available protein sequence databases. This makes UniParc the most
comprehensive publicly accessible non-redundant protein sequence
database.
• A protein sequence may exist in several databases and more than once
in a given database, thus creating redundant information. UniParc
overcomes this problem by storing each unique sequence only once,
and assigning it a unique UniParc identifier
UniProt Metagenomic and Environmental
Sequences (UniMES)

• The availability of metagenomic data has necessitated the creation of


a separate database, UniMES, to store sequences which are
recovered directly from environmental samples.
• The predicted proteins from this dataset are combined with
automatic classification by InterPro, an integrated resource for
protein families, domains and functional sites, to enhance the
original information with further analysis.
TrEMBL

• TrEMBL (Translated EMBL) was created in 1996 as a computer-annotated


supplement to SWISS-PROT
• The database benefits from the SWISS-PROT format, and contains
translations of all coding sequences (CDS) in EMBL
• TrEMBL has two main sections, designated SP-TrEMBL and REM-TrEMBL:
• SP-TrEMBL (SWISS-PROT, TrEMBL) contains entries that will eventually be
incorporated into SWISS-PROT, but that have not yet been manually
annotated
• REM-TrEMBL contains sequences that are not destined to be included in
SWISS-PROT - these include immunoglobulins and T-cell receptors,
fragments of fewer than eight amino acids, synthetic sequences, patented
sequences, and codon translations that do not encode real proteins
• TrEMBL was designed to address the need for a well-structured SWISS-PROT-
like resource that would allow very rapid access to sequence data from the
genome projects, without having to compromise the quality of SWISS-PROT
itself by incorporating sequences with insufficient analysis and annotation
NRL-3D
• The NRL-3D database is produced by PIR from sequences extracted
from the Brookhaven Protein Databank (PDB)
• The titles and biological sources of the entries conform to the
nomenclature standards used in the PIR
• Bibliographic references and MEDLINE cross- references are included,
together with secondary structure, active site, binding site and modified
site annotations, and details of experimental method, resolution, R-
factor, etc. Keywords are also provided
• NRL-3D is a valuable resource, as it makes the sequence information in
the PDB available both for keyword interrogation and for similarity
searches
• The database may be searched using the ATLAS retrieval system, a
multi-database information retrieval program specifically designed to
access macromolecular sequence databases.
Protein structure database
• The Protein Data Bank (PDB) is a repository for the 3-D
structural data of large biological molecules, such as
proteins and nucleic acids. The data, typically obtained by
X-ray crystallography or NMR spectroscopy and submitted
by biologists and biochemists from around the world, can
be accessed at no charge on the internet. The PDB is
overseen by an organization called the Worldwide Protein
Data Bank, wwPDB.
• The PDB file format is the standard file format for protein
structure files. It describes how molecules are held
together in 3-D structure of a protein. The file contains
hundreds or thousands of lines called record, which
describes about protein.
• File formats - PDB, mmCIF, XML.

You might also like