0% found this document useful (0 votes)
279 views5 pages

Introduction To Databases - NCBI, PDB and Uniprot

The document introduces several important biological databases - NCBI, PDB, and UniProt. It describes what each database contains, how it is structured and curated. NCBI contains genes and literature. PDB contains 3D protein structures. UniProt contains protein sequences and functional annotations from literature.

Uploaded by

Mehak Mattoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
279 views5 pages

Introduction To Databases - NCBI, PDB and Uniprot

The document introduces several important biological databases - NCBI, PDB, and UniProt. It describes what each database contains, how it is structured and curated. NCBI contains genes and literature. PDB contains 3D protein structures. UniProt contains protein sequences and functional annotations from literature.

Uploaded by

Mehak Mattoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

EXPERIMENT 2

AIM

Introduction to different databases- NCBI,PDB AND UNIProt.

THEORY

A database is a collection of information that is organized so that it can easily be accessed,


managed, and updated. In one view, databases can be classified according to types of content:
bibliographic, full-text, numeric, and images.

NCBI
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of
Medicine (NLM), a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland
and was founded in 1988 through legislation sponsored by Senator Claude Pepper. The NCBI houses a
series of databases relevant to biotechnology and biomedicine and an important resource for
bioinformatics tools and services. Major databases include GenBank for DNA sequences and PubMed, a
bibliographic database for the biomedical literature. Other databases include the NCBI Epigenomics
database. All these databases are available online through the Entrez search engine.

The NCBI has software tools that are available by WWW browsing or by FTP. For example, BLAST is a
sequence similarity searching program. BLAST can do sequence comparisons against the GenBank DNA
database in less than 15 seconds.NCBI has developed many databases under it which are very useful
tools for biological searches in today’s date.

It is a database with all kinds of search tools that facilitate and provide us with a number of options
including databases(nucleotide and protein),research papers , genomes , BLAST, documents ,Resources
(pubmed,pubchem,SNP) etc.
PDB (Protein databank)
The Protein Data Bank (PDB) is a crystallographic database for the three-dimensional structural
data of large biological molecules, such as proteins and nucleic acids. The data, typically
obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron
microscopy, and submitted by biologists and biochemists from around the world, are freely
accessible on the Internet via the websites of its member organisations (PDBe, PDBj, and
RCSB). The PDB is overseen by an organization called the Worldwide Protein Data Bank,
wwPDB.

The PDB is a key resource in areas of structural biology, such as structural genomics. Most
major scientific journals, and some funding agencies, now require scientists to submit their
structure data to the PDB. Many other databases use protein structures deposited in the PDB. For
example, SCOP and CATH classify protein structures, while PDB sum provides a graphic
overview of PDB entries using information from other sources, such as Gene ontology .

The PDB database is updated weekly . Likewise, the PDB holdings list is also updated weekly.
As of 27 December 2015, the breakdown of current holdings is as follows:

Experimental Protein/Nucleic Acid


Proteins Nucleic Acids Other Total
Method complexes
X-ray diffraction 95636 1694 4817 4 102151
NMR 9840 1135 231 8 11214
Electron microscopy 666 29 227 0 922
Hybrid 83 3 2 1 89
Other 170 4 6 13 193
Total: 106293 2865 5283 26 114569
91,748 structures in the PDB have a structure factor file.
8,531 structures have an NMR restraint file.
2,289 structures in the PDB have a chemical shifts file.
901 structures in the PDB have a 3DEM map file deposited in EM Data Bank

These data show that most structures are determined by X-ray diffraction, but about 10% of
structures are now determined by protein NMR. When using X-ray diffraction, approximations
of the coordinates of the atoms of the protein are obtained, whereas estimations of the distances
between pairs of atoms of the protein are found through NMR experiments. Therefore, the final
conformation of the protein is obtained, in the latter case, by solving a distance geometry
problem. A few proteins are determined by cryo-electron microscopy. (Clicking on the numbers
in the original table will bring up examples of structures determined by that method.)
Examples of protein structures from the PDB created with UCSF Chimera.

UNIProt
The mission of UniProt is to provide the scientific community with a comprehensive, high-
quality and freely accessible resource of protein sequence and functional information.It is a
comprehensive, high-quality and freely accessible database of protein sequence and functional
information, many entries being derived from genome sequencing projects. It contains a large amount
of information about the biological function of proteins derived from the research literature.

UniProt provides four core databases: UniProtKB (with sub-parts Swiss-Prot and TrEMBL),
UniParc, UniRef, and UniMes.

UniProtKB

UniProt Knowledgebase (UniProtKB) is a protein database partially curated by experts,


consisting of two sections: UniProtKB/Swiss-Prot (containing reviewed, manually annotated
entries) and UniProtKB/TrEMBL (containing unreviewed, automatically annotated entries).

UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot is a manually annotated, non-redundant protein sequence database. It


combines information extracted from scientific literature and biocurator-evaluated computational
analysis. The aim of UniProtKB/Swiss-Prot is to provide all known relevant information about a
particular protein. Annotation is regularly reviewed to keep up with current scientific findings.
The manual annotation of an entry involves detailed analysis of the protein sequence and of the
scientific literature.Sequences from the same gene and the same species are merged into the
same database entry. Differences between sequences are identified, and their cause documented.
Relevant publications are identified by searching databases such as PubMed. The full text of
each paper is read, and information is extracted and added to the entry. Annotation arising from
the scientific literature includes, but is not limited to:

 Protein and gene names


 Function
 Enzyme-specific information such as catalytic activity, cofactors and catalytic residues
 Subcellular location
 Protein-protein interactions
 Pattern of expression
 Locations and roles of significant domains and sites
 Ion-, substrate- and cofactor-binding sites
 Protein variant forms produced by natural genetic variation, RNA editing, alternative
splicing, proteolytic processing, and post-translational modification

Annotated entries undergo quality assurance before inclusion into UniProtKB/Swiss-Prot. When
new data becomes available, entries are updated.

UniProtKB/TrEMBL

UniProtKB/TrEMBL contains high-quality computationally analyzed records, which are


enriched with automatic annotation. It was introduced in response to increased dataflow resulting
from genome projects, as the time- and labour-consuming manual annotation process of
UniProtKB/Swiss-Prot could not be broadened to include all available protein sequences.[10] The
translations of annotated coding sequences in the EMBL-Bank/GenBank/DDBJ nucleotide
sequence database are automatically processed and entered in UniProtKB/TrEMBL.
UniProtKB/TrEMBL also contains sequences from PDB, and from gene prediction, including
Ensembl, RefSeq and CCDS.[16]

UniParc

UniProt Archive (UniParc) is a comprehensive and non-redundant database, which contains all
the protein sequences from the main, publicly available protein sequence databases.[17] Proteins
may exist in several different source databases, and in multiple copies in the same database. In
order to avoid redundancy, UniParc stores each unique sequence only once. Identical sequences
are merged, regardless of whether they are from the same or different species. Each sequence is
given a stable and unique identifier (UPI), making it possible to identify the same protein from
different source databases. UniParc contains only protein sequences, with no annotation.
Database cross-references in UniParc entries allow further information about the protein to be
retrieved from the source databases. When sequences in the source databases change, these
changes are tracked by UniParc and history of all changes is archived.

UniRef

The UniProt Reference Clusters (UniRef) consist of three databases of clustered sets of protein
sequences from UniProtKB and selected UniParc records. The UniRef100 database combines
identical sequences and sequence fragments (from any organism) into a single UniRef entry. The
sequence of a representative protein, the accession numbers of all the merged entries and links to
the corresponding UniProtKB and UniParc records are displayed. UniRef100 sequences are
clustered using the CD-HIT algorithm to build UniRef90 and UniRef50. Each cluster is
composed of sequences that have at least 90% or 50% sequence identity, respectively, to the
longest sequence. Clustering sequences significantly reduces database size, enabling faster
sequence searches.

UniRef is available from the UniProt FTP site.

UniMes

The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository


specifically developed for metagenomic and environmental data.[20] The predicted proteins from
this dataset are combined with automatic classification by InterPro to enhance the original
information with further analysis.

UniProtKB contains protein sequences from known species, data arising from metagenomics
studies is from environmental (i.e., uncultured) samples and as such the species may not be
known or as yet identified. UniMES was developed for this data. Data from UniMES is not
included in UniProtKB or UniRef, but is included in UniParc. As of July 2012, UniMES contains
only data from the Global Ocean Sampling Expedition (GOS). The environmental sample data
contained within this database is not present in either the UniProt Knowledgebase or the UniProt
Reference Clusters.

You might also like