Introduction To Databases - NCBI, PDB and Uniprot
Introduction To Databases - NCBI, PDB and Uniprot
AIM
THEORY
NCBI
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of
Medicine (NLM), a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland
and was founded in 1988 through legislation sponsored by Senator Claude Pepper. The NCBI houses a
series of databases relevant to biotechnology and biomedicine and an important resource for
bioinformatics tools and services. Major databases include GenBank for DNA sequences and PubMed, a
bibliographic database for the biomedical literature. Other databases include the NCBI Epigenomics
database. All these databases are available online through the Entrez search engine.
The NCBI has software tools that are available by WWW browsing or by FTP. For example, BLAST is a
sequence similarity searching program. BLAST can do sequence comparisons against the GenBank DNA
database in less than 15 seconds.NCBI has developed many databases under it which are very useful
tools for biological searches in today’s date.
It is a database with all kinds of search tools that facilitate and provide us with a number of options
including databases(nucleotide and protein),research papers , genomes , BLAST, documents ,Resources
(pubmed,pubchem,SNP) etc.
PDB (Protein databank)
The Protein Data Bank (PDB) is a crystallographic database for the three-dimensional structural
data of large biological molecules, such as proteins and nucleic acids. The data, typically
obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron
microscopy, and submitted by biologists and biochemists from around the world, are freely
accessible on the Internet via the websites of its member organisations (PDBe, PDBj, and
RCSB). The PDB is overseen by an organization called the Worldwide Protein Data Bank,
wwPDB.
The PDB is a key resource in areas of structural biology, such as structural genomics. Most
major scientific journals, and some funding agencies, now require scientists to submit their
structure data to the PDB. Many other databases use protein structures deposited in the PDB. For
example, SCOP and CATH classify protein structures, while PDB sum provides a graphic
overview of PDB entries using information from other sources, such as Gene ontology .
The PDB database is updated weekly . Likewise, the PDB holdings list is also updated weekly.
As of 27 December 2015, the breakdown of current holdings is as follows:
These data show that most structures are determined by X-ray diffraction, but about 10% of
structures are now determined by protein NMR. When using X-ray diffraction, approximations
of the coordinates of the atoms of the protein are obtained, whereas estimations of the distances
between pairs of atoms of the protein are found through NMR experiments. Therefore, the final
conformation of the protein is obtained, in the latter case, by solving a distance geometry
problem. A few proteins are determined by cryo-electron microscopy. (Clicking on the numbers
in the original table will bring up examples of structures determined by that method.)
Examples of protein structures from the PDB created with UCSF Chimera.
UNIProt
The mission of UniProt is to provide the scientific community with a comprehensive, high-
quality and freely accessible resource of protein sequence and functional information.It is a
comprehensive, high-quality and freely accessible database of protein sequence and functional
information, many entries being derived from genome sequencing projects. It contains a large amount
of information about the biological function of proteins derived from the research literature.
UniProt provides four core databases: UniProtKB (with sub-parts Swiss-Prot and TrEMBL),
UniParc, UniRef, and UniMes.
UniProtKB
UniProtKB/Swiss-Prot
Annotated entries undergo quality assurance before inclusion into UniProtKB/Swiss-Prot. When
new data becomes available, entries are updated.
UniProtKB/TrEMBL
UniParc
UniProt Archive (UniParc) is a comprehensive and non-redundant database, which contains all
the protein sequences from the main, publicly available protein sequence databases.[17] Proteins
may exist in several different source databases, and in multiple copies in the same database. In
order to avoid redundancy, UniParc stores each unique sequence only once. Identical sequences
are merged, regardless of whether they are from the same or different species. Each sequence is
given a stable and unique identifier (UPI), making it possible to identify the same protein from
different source databases. UniParc contains only protein sequences, with no annotation.
Database cross-references in UniParc entries allow further information about the protein to be
retrieved from the source databases. When sequences in the source databases change, these
changes are tracked by UniParc and history of all changes is archived.
UniRef
The UniProt Reference Clusters (UniRef) consist of three databases of clustered sets of protein
sequences from UniProtKB and selected UniParc records. The UniRef100 database combines
identical sequences and sequence fragments (from any organism) into a single UniRef entry. The
sequence of a representative protein, the accession numbers of all the merged entries and links to
the corresponding UniProtKB and UniParc records are displayed. UniRef100 sequences are
clustered using the CD-HIT algorithm to build UniRef90 and UniRef50. Each cluster is
composed of sequences that have at least 90% or 50% sequence identity, respectively, to the
longest sequence. Clustering sequences significantly reduces database size, enabling faster
sequence searches.
UniMes
UniProtKB contains protein sequences from known species, data arising from metagenomics
studies is from environmental (i.e., uncultured) samples and as such the species may not be
known or as yet identified. UniMES was developed for this data. Data from UniMES is not
included in UniProtKB or UniRef, but is included in UniParc. As of July 2012, UniMES contains
only data from the Global Ocean Sampling Expedition (GOS). The environmental sample data
contained within this database is not present in either the UniProt Knowledgebase or the UniProt
Reference Clusters.