UNIT II
UNIT II
UNIT-II
DEFINE BIOLOGICAL DATABASES (2 marks)
Biological databases are the library of life informations, collected from scientific experiments,
published literatures, high throughput experimental technology and computational analyses.
They contain information from research access including genomics, proteomics, metabolisms,
microarray, gene expression and phylogenetics.
Information in biological databases includes, gene function, structure, localization (both cellular and
chromosomal), clinical effects of mutations and similarities of sequence and structures.
Bioinformatics combines the tools of and techniques of biology, computer science, information
technology, statistics, and mathematics. Genome sequencing assemblies laboratories originally developed
bioinformatics tools for
Accumulation of sequencing data
Sequence assembly
Annotation
Curation and analysis
Database Record:
A typical database record contains three sections:
The header include description of the sequence, its organism of origin, allied literature references and cross
links to related sequences in other databases.
3
Locus field contains a unique identifier summarizing the function of the sequence in abbreviation and is
followed by an accession number in the accession field.
The organism field contains the binominal or the organism and its full taxonomic classification.
The feature table contains a description of the features in the record like coding sequences, axons, repeats,
promoters, etc., for the nucleotide sequences and domains, structure elements binding sites, etc., for protein
sequences. If the feature table includes a coding DNA sequence (CDS), links to the translated protein
sequence is also mentioned in the feature description.
The sequence (per se) which is often more easily analyzed by computer.
The query sequined should either be removed or replaced by appropriate letter codes (e.g., N for unknown
nucleic acid residue or X for unknown amino acid residue).
There are also some unusual letters in the databases or analysis program, or in protein sequences.
They are
ONE- LETTER NUCLEOTIDE NAME CATEGORY
CODE
B Gln or Glu Glutamine or
Glutamic acid
Z Asn or Asp Asparagines of
Aspartic acid
X Xaa Any residue
- --- No corresponding
residue (gap)
Example: The proteins like insulin or myoglobin contains the same number of amino acids also called
residues in the same proportion the formula for an insulin protein is
Insulin= (30 glycine+ 44 alanine + 5 tyrosine + 14 glutamine + ……………)
The first aminoacid sequence of protein insulin was determined in 1951
Example, insulin = MALWMRLLPLLALLALWGPDP……….
A good reference for various genome sequencing projects is available at the website is
www.genomeonline.org. Currently, there are 2210 projects listed of which 470 are completed and published
PUBLISHED GENOMES AND ONGOING GENOME SEQUENCING
Prokaryotic genomes
Completed genomic sequence and whole genome shotgun
6
Bacterial: 711
Archeal: 35
Eukaryotic genomes: Completed genomic sequence and whole genome shotgun
133
WHAT ARE THE PUBLIC BIOINFORMATICS DATABASES? (5 marks)
Database type Example Note
One of the largest public
GenBank
sequence databases
DDBJ DNA DataBank of Japan
>P1;CATPAA
Chloramphenicl acetyltransferase (EC2.3.1.28) - E.coli plasmid
MEKKITGYTTVDISQWHRKEHFEAFQSVAQCTYNQTVQLD
ITAFLKTVKKNKHKFYPAFIHILARLMNAHPEFRMAMKDGE
LVIWDSVHPCYTVFHEQTETFSSLWSEYHDDFRQFLHIYSQ
DVACYGENLAYFPKGFIENMFFVSANPWVSFTSFDLNVANM
DNFFAPVFTMGJKYYTQGDKVLMPLAIQVHHAVCDGFHVGR
D NA F F A P V F T M G K Y Y T Q G D K V L M P L A I Q V H H A V C D G F H V G R
MLNELQQYCDEWQGGA
C; Species: Escherichia coli
R; Shaw, W.V., Packman, L.C., Burleigh, B.D.,Dell, A., Morris, H./r., and Hartley, B.S
Nature 282,870-872, 1979 (Plamid JR66b, complete sequence with experimental details)
A;The chloramphenicol binding site may include regions near residues31 and 192-196. Lys-136 may be involved in
the formation of salt bridges between the chains.
R;\Alton, N.K., and Vapnek,D.
Nature 282,864-869, 1979 (Sequence translated from the nucleotide sequence for the transposable genetic element
Tn9)
A; Residues77-219 correspond to a probable fusidic acid resistance protein.
R; Marcoli, R., Iida, S., and Bickle, T.A.
FEBSLett. 110, 11-14, 1980 (Sequence translated from the nucleotide sequence for the transposon, Tncam204,
derived from the R plasmid NR1 [=R100])
C;This enzxyme, a type I variant mediated by an R plasmid in E.coli, exists as a tetramer of identical chains.
The protein Research Foundation at Osaka, Japan has got 51984 access counts since October 30, 1996. This
comprehensive protein database offers several levels of research facilities in the field of Proteomics, including its own
protein database. The PRF can be reached at http://www.prf.or.jp/en/
REM-TrEMBL contains sequences that are not destined to be included in SWISS-PROT. These
include immunoglobulins, T-cell receptors, Fragments of fever than eight amino acids, Synthetic
sequences, Patented sequences, Codon Translations (which do not encode real proteins).
EXPLAN IN DETAIL ABOUT EMBL (The European Molecular Biology Laboratory) (10 marks)
INTRODUCTION
The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl), maintained at the European
Bioinformatics Institute (EBI) near Cambridge, UK, is a comprehensive collection of nucleotide sequences
and annotation from available public sources. The database is part of an international collaboration with
DDBJ (Japan) and GenBank (USA).
European Bioinformatics Institute (EBI):
10
The European Bioinformatics Institute (EBI) is an outstation of the European Molecular Biology
Laboratory (EMBL) in Heidelberg, Germany.
It is located on the Welcome Trust Genome Campus near Cambridge, UK.
The EBI genomes provide access and statistics for the completed genomes, and informations about
ongoing projects.
Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software system
that produces and maintains automatic annotation on eukaryotic genomes
What is the mission of EBI? (2 marks)
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) is Europe’s primary
nucleotide sequence resource.
This database is the European part of an international collaboration with DDBJ (Japan) (2)
and GenBank (USA) (3) (INSDC, International Nucleotide Sequence Database
Collaboration). Data are exchanged on a daily basis between the collaborating institutes.
The data in the EMBL Nucleotide Sequence Database originates from a combination of
large-scale genome sequencing projects, direct submissions from individual scientists and the
European Patent Office.
There is a quarterly release of the whole database and new and updated records are
distributed daily.
Size of the EMBL:
(i) Over the last year, the size of EMBL Nucleotide Sequence Database has increased from 27.2
million entries in Release 76, September 2003 to 42.3 million entries in release 80.
(ii) Presently, databases of over a million entries and 15500 species exist.
(iii) Homo sapiens, Caenorhabditis elegans, Saccharomyces cerevisiae, Mus musculus and
Arabidopsis thaliana constitute more than 50% of databases.
(iv) SRS (Etzold 1996) links the principal DNA and protein sequence with motif, structure and
mapping.
(v) Sequence entries have links to MEDLINE
How to submit new sequences to the EMBL nucleotide sequence database? (2 marks)
The primary tool for submission of nucleotide sequence data is Webin. For alignment data, it
is Webin-Align. Projects with large-scale submissions can open a project account allowing
direct updates.
Information for submitters can be found here:
http://www.ebi.ac.uk/embl/Documentation/information_for submitters. html. For submission
guidelines please see http://www.ebi.ac.uk/embl/Submission/.
Webin:
Webin is the preferred submission tool for nucleotide sequences and biological information. It should
also be used for TPA submissions. Webin allows fast submissions of single, multiple and very large
numbers of sequences (bulk submissions) and is available at http://www.ebi.ac.uk/
embl/Submission/webin.html.
List out the data in the EMBL Nucleotide Sequence Database (5 marks)
11
Data in the EMBL Nucleotide Sequence Database are grouped into divisions, according to either the
methodology used in their generation (e.g. EST and HTG divisions) or taxonomic origin of the sequence
source (e.g. HUM and PRO divisions). There are also some specialized entry types.
New developments:
Sequence length limit
In the past, the sequence length of a database record was limited to 350 000 base pairs. In June 2004, this
restriction was lifted and entries of any length are now permitted in the database. Complete genomic units
such as entire chromosomes can now be represented in a single entry
12
This is maintained at the MRC Laboratory of Molecular Biology and Centre for Protein Engineering
at Cambridge, United Kingdom. This can be reached at http://scop.mrc-lmb.cam.ac.uk/scop/.
The SCOP database was created by manual inspection and abetted by a battery of automated
methods, aims to provide a detailed and comprehensive description of the structural and evolutionary
relationships between all proteins whose structure is known including all entries in the Protein
DataBank.
It provides a broad survey all known protein folds, detailed information about the close relatives of
any particular protein, and a framework for future research and classification.
Following classification by class, SCOP additionally classifies protein structures by a number of
hierarchical levels to reflect both evolutionary and structural relationships; namely family, super
family, and fold.
The links to structure files to be opened with RasMol or Chime Plugins and links back to the PDB to
download structures
At top level, known proteins are generally grouped by their secondary structure characteristics in to
all-alpha, all-beta, coiled coil, small proteins with structured metal ions, and various types of mixed
alpha-beta structures. These major types are called classes within SCOP.
The next layer of classification, the Fold level, is a mixture of topology and similarity in domains of
known function. One fold can be called ‘Globin-like’ and the other called ‘four helical up and down
bundle’.
In SCOP, Proteins are classified in a hierarchical fashion to reflect their structural and evolutionary
relatedness. There are many levels in the hierarchy, and they are (i) fold, (ii) super families, (iii)
families.
SCOP is accessible for keyword interrogation through the MRC laboratory web server.
Specialized databases
Antibody Central Antibody information database and search resource.
BIOMOVIE (ETH Zurich) movies related to biology and biotechnology
CGAP Cancer Genes (National Cancer Institute)
Clone Registry Clone Collections (National Center for Biotechnology Information)
Connectivity map Transcriptional expression data and correlation tools for drugs
CTD The Comparative Toxicogenomics Database describes chemical-gene-disease interactions
DBGET H.sapiens (Univ. of Kyoto)
DiProDB A database to collect and analyse thermodynamic, structural and other dinucleotide
properties.
Dryad a repository of data underlying scientific publications in evolution, ecology, and related fields
Edinburgh Mouse Atlas
GreenPhylDB (A phylogenomic database for plant comparative genomics)
16
ACEDB DATABASE
The first genome database called ACEDB (a Caenorhabditis elegans) database and the methods to
access this database were developed by Mike Cherry and Colleagues (Cherry and Cartinhour, 1993).
This database was accessible through the internet and allowed retrieval of sequences; information
about genes and mutants, investigators addresses and references.
Similar databases subsequently developed using the same methods for Arabidopsis thaliana and
Saccharomyces cerevisiae. Presently, there are a large number of such publicly available databases.
E.COLI DATABASE
There are several databases for Escherichia coli.
18
The Coli Genetic Stock Center (CGSE) maintains a database of E.coli genetic information
including genotypes and reference information for the strains in the CGSC collection, gene names,
properties and linkage maps, gene product information and information on specific mutations.
The E.coli database collection (ECDC) is another example. The Encyclopedia of E.coli Genes and
Metabolism is a database of E.coli genes and metabolic pathways.
The Kyoto Encyclopedia of Genes and Genomes, is developed for basic research and practical
applications.
KEGG is a database of biological systems that integrates genomic, chemical and systemic functional
information.
KEGG provides a reference knowledge base for linking genomes to life through the process of
PATHWAY mapping KEGG organizes five types of data into a comprehensive system:
They may be used to identify gene transcripts, and are instrumental in gene discovery and gene
sequence determination.
The identification of ESTs has proceeded rapidly, with approximately 65.9 million ESTs now
available in public databases (e.g. GenBank 18/6/2010, all species).
Currently, Genbank divides ESTs into three major categories: human, mouse and other.
An EST is produced by one-shot sequencing of a cloned mRNA (i.e. sequencing several hundred
base pairs from an end of a cDNA clone taken from a cDNA library).
The resulting sequence is a relatively low quality fragment whose length is limited by current
technology to approximately 500 to 800 nucleotides.
Because these clones consist of DNA that is complementary to mRNA, the ESTs represent portions
of expressed genes.
They may be present in the database as either cDNA/mRNA sequence or as the reverse complement
of the mRNA, the template strand.
ESTs can be mapped to specific chromosome locations using physical mapping techniques, such as
radiation hybrid mapping, Happy mapping, or FISH.
Alternatively, if the genome of the organism that originated the EST has been sequenced one can
align the EST sequence to that genome using a computer.
EST contigs
Because of the way ESTs are sequenced, many distinct expressed sequence tags are often partial
sequences that correspond to the same mRNA of an organism.
In an effort to reduce the number of expressed sequence tags for downstream gene discovery
analyses, several groups assembled expressed sequence tags into EST contigs.
Examples of resources that provide EST contigs include:
22
Top Ten Organisms for which ESTs have been sequenced (dbEST relese 050903, may 2003)
UNIGENE
The goal of the Unigene (unique gene) project is to create one unique entry for each gene and to collect all the ESTs
associated with that gene. For example, in the case of RBP4, there is only one unigene entry.