0% found this document useful (0 votes)
15 views

UNIT II

hfh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

UNIT II

hfh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 23

1

UNIT-II
DEFINE BIOLOGICAL DATABASES (2 marks)
Biological databases are the library of life informations, collected from scientific experiments,
published literatures, high throughput experimental technology and computational analyses.
They contain information from research access including genomics, proteomics, metabolisms,
microarray, gene expression and phylogenetics.
Information in biological databases includes, gene function, structure, localization (both cellular and
chromosomal), clinical effects of mutations and similarities of sequence and structures.
Bioinformatics combines the tools of and techniques of biology, computer science, information
technology, statistics, and mathematics. Genome sequencing assemblies laboratories originally developed
bioinformatics tools for
 Accumulation of sequencing data
 Sequence assembly
 Annotation
 Curation and analysis

BIOLOGICALLY IMPORTANT DATABASES


Biological databases store data related to biology. It may be from literature, sequencing projects,
structure determination. There are a large numbers of databases to store different types of sequences. They
are as follows.
 Nucleotide sequence databases
Genbank, EMBL, DDBJ
 Protein sequence databases
SWISS-PROT, PIR, TrEMBL, MIPS, NRL-3D, GenPept and so on.
 Complete Genomic databases
GDB, MGB, SGB and so on.
 Specialized databases
ESTs, STSs, REPBASE and so on.
 3D structure databases
PDB
 Retrieval system databases
SRS, Entrez, DBGET and so on.
 Classifying proteins based on structural similarity databases
CATH, SCOP, and FSSP
By choosing proper matrices like BLOSUM, PAM and algorithms like FASTA, BLAST one
can proceed further in searching the query sequences to obtain the best homologue sequence.

What is the Importance of Biological Databases? (2 marks)


 A database is a logically coherent collection of related data with inherent meaning built for certain application.
IT is composed of entries – discrete coherent parcels of information. It is a general repository of information
and contains records to be processed by a program. Its contents can easily be accessed, managed, and
updated.
 Databases can be searched or cross-referenced either over the Internet or using downloaded versions on local
computers or computer networks by multiple users. The databases are electronic filing cabinets, a convenient
and efficient method or storing vast amount of information.
 Databases are needed to collect and preserve data, to make data easy to find and search, to standardize data
representation and to organize data into knowledge.
 The primary goals of databases are, i) minimizing data redundancy and ii) achieving data independence.
 Information available in these databases can be searched, compared, retrieved and analyzed.
 Databases are essential for managing similar kind of data and developing a network to access them across the
globe.

Explain the Classification of biological Databases: (5 marks)


There are many different database types depending both on the nature of the information being stored and on
the manner of data storage. Databases are broadly classified into two types, namely,
2

Generalized databases and


Specialized databases.
DNA, Protein, carbohydrate etc., database is called as generalized database.
Database of expressed sequence tags (EST), Genome Survey Sequence (GSS), single Nucleotide polymorphisms
(SNP), sequence Tagged Sites (STS), RNA database etc., are some of the specialized databases.
Generalized databases are again broadly classified into sequence databases and structure databases.
Sequence databases- contain the individual sequence records of either nucleotide
or amino acids or proteins.
Structure database - contain the individual sequence records of biochemically solved structures
of macromolecules (example, Protein 3D structure).
Sequence database are two types, nucleic acid database and protein sequence database.
(i) The nucleic acid database may be divided in to primary database or secondary database (value added databases)
and composite database.
Primary database contain the data in their original form, taken as such from the source (e.g., Genbank DNA,
PIR/NBRF (USA) DNA, DDBJ (Japan) DNA, GSDB (NCGR, USA) DNA, PIR/NBRF (USA) Protein, SWISS-PROT
(Switzerland), Protein, PDB (BNL/ISA) 3D structure.
Secondary database or value added database contain annotated data and information (OMIM)- Online Mendelian
Inheritance in Man Gene and clinical data, GDB- Genome Data Base- human, PROSITE, BLOCKS protein motifs,
Metabolism: KEGG, EcoCyc, SCOP, CATH,
Composite database is amalgamates a variety of different primary database structures into one. Some important
composite databases are NRDB (Non Redundant Data Base), OWL (Web Ontology Language), MIPSX [(Martinsried
Institute for Proteins Sequences), the database comprises information from PIR 1-4, MIPS preliminary translation,
MIPSH, NRL-3D, SWISS-PROT, EMTrans (an automated translation of EMBL), GBTrans, translated GenBank
entries, Kabat and PseqIP], SWISS-PROT+TrEMBL.
Other specialized databases are: Kabat – Immunology proteins, LIGAND, Enzyme reaction ligands, klotho-
Biochemical compounds, PKR (Protein Kinase Resource) - SDSC- protein Kinases.

Define Database Entries: (2 marks)


 Database entries comprise new experimental results, and supplementary information of annotations.
Annotations include information about the source of data and the methods used to determine them.
 They identify the investigators responsible for the discovery and cite relevant publications. Thy provide links
to connected information in other databank. Curators in databanks base their annotations on the analysis of
the sequence by computer programs.
 To make sure that all the fundamental data related to DNA and RNA are freely available, scientific journals
require deposition of new nucleotide sequences in the database as a conditions for publication of an article.
Similar conditions apply to amino acid sequences, and to nucleic acid and protein structures.

What is Sequence Formats? (2 marks)


 Many databases and software applications are designee to work with sequence data, and this requires a
standard format for inputting nucleic acid and protein sequence information.
 Three of the most common sequence formats are NBR/PIR (National Biomedical Research
Foundation/Protein Information Resource), FASTA and GDE.
 Each of these formats has facilities not only for representing the sequence itself, but also for inserting a unique
code to identify the sequence and for making comments which may include for example the name of the
sequence, the species form which it was derived, and an accession number for GenBank or another
appropriate database.
 NBRF/PIR format begins with either >P1; for protein or >N1; for nucleic acid.
 FASTA format begins with only ‘>’, and the GDE format begins with ‘%’.
 A feature table (lines beginning FT) is a component of the annotation of an entry that reports properties of
specific regions, for instance coding sequences (CDS).
 The feature table may indicate regions that perform or affect function, that interact with other molecules, that
affect replication, that are involved in recombination, that are a repeated unit, that have secondary or tertiary
structure and that are revised or corrected.

Database Record:
A typical database record contains three sections:
 The header include description of the sequence, its organism of origin, allied literature references and cross
links to related sequences in other databases.
3

 Locus field contains a unique identifier summarizing the function of the sequence in abbreviation and is
followed by an accession number in the accession field.
 The organism field contains the binominal or the organism and its full taxonomic classification.
 The feature table contains a description of the features in the record like coding sequences, axons, repeats,
promoters, etc., for the nucleotide sequences and domains, structure elements binding sites, etc., for protein
sequences. If the feature table includes a coding DNA sequence (CDS), links to the translated protein
sequence is also mentioned in the feature description.
 The sequence (per se) which is often more easily analyzed by computer.

PRIMARY DATABASES (10 marks)


As early 1980s, sequence information was becoming abundantly available. This was the beginning of
primary databases. The major primary databases in nucleic acids and proteins are as follows

NUCLEIC ACIDS PROTIEN


EMBL PIR
GENBANK MIPS
DDBJ SWISS-PROT
TrEMBL
NRL-3D

NUCLEIC ACID SEQUENCE DATABASES: (10 marks)


DNA and proteins are complicated 3D molecules, composed of millions of atoms bonded together. DNA and proteins
are made up of polymers, chains of repeating chemical units (monomers), with a common backbone holding together.
In DNA there are four nucleic acid monomers (A, T, C and G) which are commonly used to build there polymer
chain. There are 3 principal nucleic acid sequence databases.
The 3 premier institutes in this world that are considered as the authority in the nucleotide sequence databases.
(i) The EMBL (European Molecular biology Laboratory),
(ii) The NCBI – GenBank (National Centre for biotechnology Information) and (refer printout notes page
no.6 to 9)
(i) The DNA Data Bank of Japan.
These institutes have an International Nucleotide Sequence Database Collaboration, under which they share the
nucleotides on daily basis.

Define Sequence Database Formats: (2 marks)


 In the sequence databases, each sequence is an “entry” and each entry is divided into “fields”. Fields are used
to create indices for relational databases.
 Each field is essentially a “table” and the field values are indices. The names are usually NOTSTABLE! The
Accession Numbers (UNIQUE) SHOULD be stable. Dates are created, updated.
 The most commonly and internationally accepted sequences format in Bioinformatics is the FASTA format.

Explain about FASTA Format: (5 marks)


 A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The
description line is distinguished form the sequence data by greater-than (“>”) symbol in the first column. It is
recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA
format is:
 Single line description >gi | 532319 | pir |TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNESNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
APTEVRRYIGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXXXVQSQHILLAGILQQQKNL
LLAVEAQQMLKLTIWGVK
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these
exceptions:
 Lower-case letters are accepted and are mapped into upper-case.
 A single hyphen or dash can be used to represent a gap of indeterminate length.
 In amino acid sequences, U and* are acceptable letters. Before submitting a request, any numerical digits in
4

 The query sequined should either be removed or replaced by appropriate letter codes (e.g., N for unknown
nucleic acid residue or X for unknown amino acid residue).

EXPLAIN THE ANALYSIS METHODS OF SEQUENCES (5 marks)


(I) ANALYSING DNA AND RNA SEQUENCES:
Deoxy ribonucleic acid (DNA), which makes our genes, is a large molecule consisting of a chain of
four constituents that they called nucleotide. A nucleotide is made up of one phosphate group linked to a
Pentose sugar and four nitrogenous organic bases symbolized letters A, G, C and T. DNA is very stable and
resistant. The following table lists the one-letter codes (IUPAC) international union of pure and applied
chemistry, used to work with DNA sequences. Most common letters used for DNA sequences
The nucleic acid codes supported are:
A → adenosine M → AC (amino)
C → cytidine S → GC (strong)
G → guanosine W → AT (Week)
T → thymidine B → GTC
U → Uridine D → GAT
R → GA (purine) H → ACT
Y → TC (Pyrimidine) V → GCA
K → GT (Keto) N → AGCT (any)
- gap of indeterminate length.
Later the 1970s the molecular biologists determine the sequence of DNA molecule and get access to
the gene nucleotide sequences. Earning A. Sanger- Nobel prize winner compared the four nucleotides to 20
amino acids, which allowed a much simpler, faster reading, and quickly lent to complete sequence
determination.
Ribonucleic acid (RNA) is much more active nucleic acid, which is synthesized and degraded
constantly as it makes copies of gene available to the cell factory. In bioinformatics, the difference between
RNA and DNA is
1. RNA differs from DNA by one nucleotide and
2. RNA comes as a single stranded.
(ii)ANALYSING PROTEIN SEQUENCES
Proteins are made up of building blocks called amino acids. Amino acids are complex organic molecules
made up of carbon, hydrogen, oxygen, nitrogen, sulphur atoms. Proteins which were macromolecules made
up of large number of aminoacids (typically from 100 to 500), picked from a selection of 20”flacours” with
names such as alanine, glycine, tyrosine etc. The 20 aminoacids and their official codes
For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted aminoacid
codes are:
5

S.NO ONE- LETTER THREE- LETTER


CODE CODE AND NAME
1 A Ala Alanine
2 R Arg Arginine
3 N Asn Asparagine
4 D Asp Aspartic acid
5 C Cys Cysteine
6 Q Gln Glutamine
7 E Glu Glutamic acid
8 G Gly Glycine
9 H His Histidine
10 I Ile Isoleucine
11 L Leu Leucine
12 K Lys Lysine
13 M Met Methionine
14 F Phe Phenylalanine
15 P Pro Proline
16 S Ser Serine
17 T Thr Threonine
18 W Trp Tryptophan
19 Y Tyr Tyrosine
20 V Val Valine

There are also some unusual letters in the databases or analysis program, or in protein sequences.
They are
ONE- LETTER NUCLEOTIDE NAME CATEGORY
CODE
B Gln or Glu Glutamine or
Glutamic acid
Z Asn or Asp Asparagines of
Aspartic acid
X Xaa Any residue
- --- No corresponding
residue (gap)
Example: The proteins like insulin or myoglobin contains the same number of amino acids also called
residues in the same proportion the formula for an insulin protein is
Insulin= (30 glycine+ 44 alanine + 5 tyrosine + 14 glutamine + ……………)
The first aminoacid sequence of protein insulin was determined in 1951
Example, insulin = MALWMRLLPLLALLALWGPDP……….
A good reference for various genome sequencing projects is available at the website is
www.genomeonline.org. Currently, there are 2210 projects listed of which 470 are completed and published
PUBLISHED GENOMES AND ONGOING GENOME SEQUENCING
Prokaryotic genomes
Completed genomic sequence and whole genome shotgun
6

 Bacterial: 711
 Archeal: 35
Eukaryotic genomes: Completed genomic sequence and whole genome shotgun
 133
WHAT ARE THE PUBLIC BIOINFORMATICS DATABASES? (5 marks)
Database type Example Note
One of the largest public
GenBank
sequence databases
DDBJ DNA DataBank of Japan

Nucleotide sequence European Molecular Biology


EMBL
Laboratory
MGDB Mouse Genome Database

NDB Nucleic Acid Database

Database type Example Note


Swiss institute for
SWISS-PROT Bioinformatics and European
Bioinformatics Institute
Annotated suppliment to
Protein Sequence TrEMBL
SWISS-PROT
Weekly, pre-processed update to
TrEMBL new
TrEMBL
PIR Protein Information Resource

Database Type Example Note


PDB Protein Data bank
3D Structures
MMDB Molecular Modelling Database
Cambridge Structural
For small molecules
Database

Database Type Example Note

Enzymes and Compounds


Chemical compounds and
LIGAND
reactions

Database Type Example Note


PROSITE Sequence motifs

BLOCKS Derived from PROSITE

PRINTS A superset of BLOCKS


Protein families database of
Pfam alignments and hidden Markov
Sequence Motifs (Alignment) models
ProDOM Protein Domains
7

Database Type Example Note


Metabolic and regulatory
pathway
Pathways and Complexes Pathway, KEGG
maps

Database Type Example Note


Online Mendelian Inheritance in
Molecular Disease OMIM
Man
PubMed Contains Medline
Biomedical Literature Medline Medical Literature

ENUMERATE PROTEIN DATABASE: (10 marks)


(1) SWISS-PROT is a protein sequence database. The department of Medical Biochemistry, Geneva,
initiated SWISS-PROT in 1986. The Swiss Institute of Bioinformatics (SIB) and European Bioinformatics
Institute provide with the SWISS PROT protein sequence data bank such a resource. The SIB and EBI
initiated now a major effort to annotate, describe and distribute to the scientific community a high level of
annotations (such as the description of the function of a protein, it domains structure, post-translational
modifications, variants, etc.), a minimal level of redundancy and high level of integration with other
databases.
 SWISS PORT is seated at the Expert Protein Analysis System (ExPASy) proteomics server. ExPASy
also maintains PROSITE, a protein families and domains database, SWISS-2DPAGE, a two –
dimensional polyacrylamide gel electrophoresis database, SWISS-3DIMAGE, A 3D images of protein
and other biological macromolecules database, ENZYME, an enzyme nomenclature database and other
such protein related databases and analyzing tools.
SWISS-PROT can be reached at: http://www.expasy.ch/sprot/sprot-top.html
Swiss Prot Protein Entry Format
ID T160_HUMAN STANDARD; PRT; 513 AA.
AC Q92993; Q13430;
DT 15-JUL-1998 (rel. 36, Created)
DT 15-JUL-1998 (Rel. 36, Last sequence update)
DT 30-MAY-2000 (Rel.39, last annotation update)
DE 60 KDA TAT INTERACTIVE PROTEIN.
GN TIP60
OS Homo sapiens (Human).
OC Eukaryota; Metazoa; chordata; Craniata; Vertebrata; euteleostmi;
OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
RN [1]
RP SEQUENCE FROM N.A.
RC TISSUE=Lymphoblast;
RX MEDLINE; 96182937.
RA Kamine J., Elangovan B., subramanian T., Coleman D.,
RA Chinnadurai G.,
RT “Identification of a cellular protein that specifically interacts
RT with the essential cysteine region of the HIV-1 Tat transactivator”;
RL Virology 216:357 – 366 (1996).
CC -!- FUNCTION: BINDS TO THE TAT PROTEIN OF THE NUMAN
IMMUNODEFICIENCY
CC VIRUS (HIV). SPECIFIC BINDING OF TIP60 TO TAT MIGHT BE AN
CC IMPORTANT FEATURE FOR EFFICIENT TAT TRANSACTIVATION OF HIV GENE
CC EXPRESSION. INTERACTS WITH HIVI TAT.
CC -!- SIMILARITY: BELONGS TO THE MYST (SAS/MOZ) FAMILY.
8

CC This SWISS-PROT entry is copyright. It is produced through collaboration


CC or send an email to [email protected]).

(2) EXPLAIN ABOUT PROTEIN INFORMATION RESOURCES (PIR): (5 marks)


 The Protein Information Resource is an integrated public resource of functional annotation of protein
data. It was aimed at supporting Genomic/proteomic research and scientific discovery.
 Started by the National Biomedical Research foundation (NBRF), Georgetown, USA at 1984, the PIR
provides databases and analysis tools to support research on molecular evolution, functional genomics
and computational biology.
 Structured into two major groups, PIR-PSD (Protein sequence database) and iProclass (protein
classification database), as on September 14, 2001 the PIR contained 246303 entries.
 PIR can be reached at: http:/ /www-nbrf.georgetown.edu/per/
 The PIR is an effective combination of a carefully curated database, information retrieval access
software, and a workbench for investigations of sequences. The PIR also produces the Integrated
Environment for Sequence Analysis (IESA). Its functionality includes browsing, searching and
similarity analysis, and links to other databases including PDB, KEGG, WIT, & BRENDA.
 PIR international (an association of macro-molecular sequences data collection) includes the Protein
Information Database Japan (JIPID), and Martinsried Institute of Protein Sequence (MIPS).
 The PIR database is split in to four sections PIR1 TO PIR4
 PIR 1 TO PIR4 differs in quality of data and level of annotation.
 PIR1- fully classified annotated entries
 PIR2- Preliminary entries
 PIR3- Unverified entries
 PIR4- Conceptual translation of artefactual sequences, conceptual translation of sequences
not transcribed or translated and conceptual translation of genetically engineered sequenced
that are not encoded or produced on ribosomes.
PIR FORMAT:
The following is an example of a PIR-formatted Protein sequence obtained from the PIR protein library PROTEIN
using the COPY command of the program PSQ. The documentation comments following the sequence are in the PIR-
NBRF format.

>P1;CATPAA
Chloramphenicl acetyltransferase (EC2.3.1.28) - E.coli plasmid
MEKKITGYTTVDISQWHRKEHFEAFQSVAQCTYNQTVQLD
ITAFLKTVKKNKHKFYPAFIHILARLMNAHPEFRMAMKDGE
LVIWDSVHPCYTVFHEQTETFSSLWSEYHDDFRQFLHIYSQ
DVACYGENLAYFPKGFIENMFFVSANPWVSFTSFDLNVANM
DNFFAPVFTMGJKYYTQGDKVLMPLAIQVHHAVCDGFHVGR
D NA F F A P V F T M G K Y Y T Q G D K V L M P L A I Q V H H A V C D G F H V G R
MLNELQQYCDEWQGGA
C; Species: Escherichia coli
R; Shaw, W.V., Packman, L.C., Burleigh, B.D.,Dell, A., Morris, H./r., and Hartley, B.S
Nature 282,870-872, 1979 (Plamid JR66b, complete sequence with experimental details)
A;The chloramphenicol binding site may include regions near residues31 and 192-196. Lys-136 may be involved in
the formation of salt bridges between the chains.
R;\Alton, N.K., and Vapnek,D.
Nature 282,864-869, 1979 (Sequence translated from the nucleotide sequence for the transposable genetic element
Tn9)
A; Residues77-219 correspond to a probable fusidic acid resistance protein.
R; Marcoli, R., Iida, S., and Bickle, T.A.
FEBSLett. 110, 11-14, 1980 (Sequence translated from the nucleotide sequence for the transposon, Tncam204,
derived from the R plasmid NR1 [=R100])
C;This enzxyme, a type I variant mediated by an R plasmid in E.coli, exists as a tetramer of identical chains.

(3)WHAT IS PROTEIN RESEARCH FOUNDATION (PRF)? (2marks)


9

The protein Research Foundation at Osaka, Japan has got 51984 access counts since October 30, 1996. This
comprehensive protein database offers several levels of research facilities in the field of Proteomics, including its own
protein database. The PRF can be reached at http://www.prf.or.jp/en/

(4) WHAT IS MARTINSRIED INSTITUTE FOR PROTEIN SEQUENCES (MIPS)? (2 marks)


 MIPS collect and processes sequence data for the tripartite PIR-International Protein Sequence Database
Project. The database is distributed with PATCHX, a supplement of unverified protein sequences from
external sources.
 Access to the database is provided through its Web server; results of FastA similarity searches of all proteins
with PIR-International and PATCHX are stored in a dynamically maintained database, allowing instant access
to FastA results.

(5) WHAT IS TrEMBL (Translated EMBL))? (2 marks)


 TrEMBL was created in 1996 as a computer-annotated supplement to SWISS-PROT. The Database helps the
SWISS-PROT format and contains translations of all coding sequences (CDS) in EMBL.
 It has two main sections. They are
 SP-TrEMBL (SWISS-PROT TrEMBL) contains entries that will eventually be incorporated into
SWISS-PROT; which is not a manual annotation.

 REM-TrEMBL contains sequences that are not destined to be included in SWISS-PROT. These
include immunoglobulins, T-cell receptors, Fragments of fever than eight amino acids, Synthetic
sequences, Patented sequences, Codon Translations (which do not encode real proteins).

(6) WHAT IS NRL-3D DATABASE? (2 marks)


 An NRL-3D database was produced by PIR from sequences extracted from the Brookhaven Protein Data
Bank (PDB).
 The titles and biological sources of the entries confirm to the nomenclature standards used in the PIR
 This provides
Bibliographic references - Binding site
MEDLINE cross references - Modified site annotations
Secondary structure - Details of experimental method
Active site - Resolution
Keywords - R factors
 It is a valuable resource as it makes the sequence information in the PDB available both for the keyword
interrogation and similarity searches
 ATLAS retrieval system is used for searching the database.
 A multidatabase information retrieval program specifically designed to access macromolecular sequence
databases.

WHAT ARE THE SECONDARY DATABASES (5 marks)


A Database that contains information derived from primary sequence data, typically in the form of regular
expressions (PATTERNS), fingerprints, BLOCKS, PROFILES or HIDDEN MARKOV MODELS. These abstractions
represent distillations of the most conserved features of multiple alignments, such that they are able to provide potent
discriminators of family membership for newly determined sequences.

SEQUENCE DATABASES (10 marks)


The Sequence databases contain the individual sequence records of either nucleotide or amino acids or
proteins. (Refer printout notes page no. 6 to 9, and 14-22)
The sequence databases are common databases and are of three types. They are Annotated, Low Annotation
and Specialised.

EXPLAN IN DETAIL ABOUT EMBL (The European Molecular Biology Laboratory) (10 marks)
INTRODUCTION
The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl), maintained at the European
Bioinformatics Institute (EBI) near Cambridge, UK, is a comprehensive collection of nucleotide sequences
and annotation from available public sources. The database is part of an international collaboration with
DDBJ (Japan) and GenBank (USA).
European Bioinformatics Institute (EBI):
10

 The European Bioinformatics Institute (EBI) is an outstation of the European Molecular Biology
Laboratory (EMBL) in Heidelberg, Germany.
 It is located on the Welcome Trust Genome Campus near Cambridge, UK.
 The EBI genomes provide access and statistics for the completed genomes, and informations about
ongoing projects.
 Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software system
that produces and maintains automatic annotation on eukaryotic genomes
What is the mission of EBI? (2 marks)
 The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) is Europe’s primary
nucleotide sequence resource.
 This database is the European part of an international collaboration with DDBJ (Japan) (2)
and GenBank (USA) (3) (INSDC, International Nucleotide Sequence Database
Collaboration). Data are exchanged on a daily basis between the collaborating institutes.
 The data in the EMBL Nucleotide Sequence Database originates from a combination of
large-scale genome sequencing projects, direct submissions from individual scientists and the
European Patent Office.
 There is a quarterly release of the whole database and new and updated records are
distributed daily.
Size of the EMBL:
(i) Over the last year, the size of EMBL Nucleotide Sequence Database has increased from 27.2
million entries in Release 76, September 2003 to 42.3 million entries in release 80.
(ii) Presently, databases of over a million entries and 15500 species exist.
(iii) Homo sapiens, Caenorhabditis elegans, Saccharomyces cerevisiae, Mus musculus and
Arabidopsis thaliana constitute more than 50% of databases.
(iv) SRS (Etzold 1996) links the principal DNA and protein sequence with motif, structure and
mapping.
(v) Sequence entries have links to MEDLINE

Submissions to the EMBL nucleotide sequence database:


Why is it essential to submit new sequence? (2 marks)
 Printing sequence data as part of a publication is neither sensible nor manageable; hence journals
prefer to cite only the accession number assigned by the INSD Collaboration.
 Most journals have a mandatory submission procedure such that papers will only be accepted if they
have an accession number. The nucleotide sequence is considered part of the publication and
therefore almost all nucleotide sequences are publicly available.
 Having your sequence in the database means it is readily available to the scientific user community
 A repository of primary nucleotide sequence data that is freely accessible is essential for
computational analysis and genome research

How to submit new sequences to the EMBL nucleotide sequence database? (2 marks)
The primary tool for submission of nucleotide sequence data is Webin. For alignment data, it
is Webin-Align. Projects with large-scale submissions can open a project account allowing
direct updates.
Information for submitters can be found here:
http://www.ebi.ac.uk/embl/Documentation/information_for submitters. html. For submission
guidelines please see http://www.ebi.ac.uk/embl/Submission/.
Webin:
Webin is the preferred submission tool for nucleotide sequences and biological information. It should
also be used for TPA submissions. Webin allows fast submissions of single, multiple and very large
numbers of sequences (bulk submissions) and is available at http://www.ebi.ac.uk/
embl/Submission/webin.html.

List out the data in the EMBL Nucleotide Sequence Database (5 marks)
11

Data in the EMBL Nucleotide Sequence Database are grouped into divisions, according to either the
methodology used in their generation (e.g. EST and HTG divisions) or taxonomic origin of the sequence
source (e.g. HUM and PRO divisions). There are also some specialized entry types.

Whole Genome Shotgun (Wgs) Data


Methods using WGS data are used to gain a large amount of genome coverage for an organism. The
sequences of all contigs originating from one experiment are grouped in a set. WGS entries have the
standard EMBL format, with accession numbers clearly distinct from those of non-WGS entries. The
accession numbers of all entries in each WGS set share the same prefix.

Third Party Annotation (TPA) Data:


The Third Party Annotation data set was launched in response to requests from the research
community to submit entries that include either re-annotation of existing data, or combinations of novel
sequence, existing primary sequence, trace archive and WGS data.

Accessing the EMBL Nucleotide Sequence Database


The EMBL Nucleotide Sequence Database is available from the EBI via various WWW interfaces, ftp and
email (for more information see http://www.ebi.ac.uk/embl/Access).

SEQUENCE RETRIEVAL SYSTEM (SRS)


The EMBL Nucleotide Sequence Database can be accessed via the EBI SRS server at http://srs.ebi.ac.uk/.

New developments:
Sequence length limit
In the past, the sequence length of a database record was limited to 350 000 base pairs. In June 2004, this
restriction was lifted and entries of any length are now permitted in the database. Complete genomic units
such as entire chromosomes can now be represented in a single entry
12

EXPLAIN IN DETAIL ABOUT DDBJ (DNA DataBank of Japan) (10 marks)


DDBJ is DNA DataBank. It is located at the National Institute of Genetics (NIG), in Shizuoka
prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration
(INSDC).
It exchanges its data with EMBL at European Bioinformatics Institute (EBI) and with GenBank at
the NCBI, on a daily basis. Thus these three databases contain the same data at any given time. Hence DDBJ
is one of the three major collaborators in the international nucleotide sequence database collaboration
project.
Initially it is started as Centre for information biology at NIG; the centre was then reorganized as the
centre for Information Biology and DNA Databank of Japan (DIB/DDBJ) in 2001. It is the sole DNA bank
in Japan. DDBJ accepts nucleotide sequence data and issues accession numbers to the authors.
It also provides tools for retrieval and analysis of the genetic information in the database. DDBJ can
be reached at: www.ddbj.nig.ac.jpDDBJ began databank activities since 1986 at the NIG and it boasts to be
the only nucleotide sequence databank in Asia.
Although DDBJ mainly receives its data from Japanese Researches, however it can accept data from
a contributor belonging to any other country. DDBJ is primarily funded by the Japanese Ministry of
Education, culture, sports, science and technology (MEXT). One may access databases through web-based
data submission tools like FastA and BLAST
DDBJ has an International advisory committee which consists of 9 members, three members each
from Europe, Japan and US. This committee advises DDBJ about its maintenance, management and future
plans once a year.
Apart from this DDBJ also has an International collaborative committee which advises on various
technical issues related to international collaboration and consists of working level participants. The
international bodies conduct International Advisory Meeting (IAM) and International Collaborative Meeting (ICM) at
regular intervals to exchange views and updates techniques.

Example Entry #4: DDBJ - HUMCKRASA

LOCUS HUMCKRASA 450 bp ss-mRNA PRI 15-SEP-1990


DEFINITION Human PR310 c-K-ras protein mRNA, 5' end.
ACCESSION M35504
KEYWORDS c-K-ras protein; c-myc oncogene.
SOURCE Human (patient PR310) lung carcinoma, cDNA to mRNA.
ORGANISM Homo sapiens
Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Mammalia;
Theria; Eutheria; Primates; Haplorhini; Catarrhini; Hominidae.
13

REFERENCE 1 (bases 1 to 450)


AUTHORS Yamamoto,F., Nakano,H., Neville,C. and Perucho,M.
TITLE Structure and mechanisms of activation of c-K-ras oncogenes in
human lung cancer
JOURNAL Prog. Med. Virol. 32, 101-114 (1985)
STANDARD simple staff entry
FEATURES Location/Qualifiers
CDS 1...>450
/note="PR310 c-K-ras oncogene"
/codon_start=1
BASE COUNT 155 a 71 c 106 g 118 t
ORIGIN
1 atgactgaat ataaacttgt ggtagttgga gctggtggcg taggcaagag tgccttgacg
61 atacagctaa ttgacaatca ttttgtggac gaatatgatc caacaataga ggattcctac
121 aggaagcaag tagtaattga tggagaaacc tgtctcttgg atattctcga cacagcaggt
181 catgaggagt acagtgcaat gagggaccag tacatgagga ctggggaggg ctttctttgt
241 gtatttgcca taaataatac taaatcattt gaagatattc accattatag agaacaaatt
301 aaaagagtta aggactctga agatgtacct atggtcctag taggaaataa atgtgatttg
361 ccttctagaa cagtagacac aaaacaggct caggacttag caagaagtta tggaattcct
421 tttattcaaa catcagcaaa gacaagacag
//
DEFINE STRUCTURAL DATABASES WITH EXAMPLE (2 marks)
Structural databases are databases of macromolecular structures. Examples are PDB, CATH and
SCOP are the databases which classify the structural proteins.
 PDB is the main primary database for 3D structures of biological macromolecules (determined by X-
ray crystallography and NMR). The PDB entries contain the atomic co-ordinates, and some
structural parameters connected with the atoms (B-factors, occupancies), or computed from the
structures (secondary) structures. PDB provides a primary archive of all 3D structures for
macromolecules such as proteins, RNA, DNA and various complexes. (Refer printout notes page
no. 9 & 10)
 CATH and SCOP are the two major databases classify proteins by structure in order to identify
structural and evolutionary relationships
 CATH (Class, Architecture, Topology, Homologous super-family) database is a hierarchical
classification of protein domain structures, which clusters proteins at four major structural levels.
 SCOP (Structural Classification of Proteins) database was started with the objective to classify
protein 3D structures in a hierarchical scheme of structural classes. It is curated and all protein
structures in the PDB are classified, and it is updated as new structures are deposited in the PDB.
This is typical secondary database; it is based on data in a primary database (PDB), but adds
information through analysis and/or organisation, in this case the classification of protein 3D
structures into a hierarchical scheme of folds, super-families and families.

INTRODUCTION TO STRUCTURE CLASSIFICATION


 Proteins that look the, in terms of shape and topology are classified as more closely related than the
proteins look different. This is based on the visual observation and comparison with familiar objects.
 The classification databases can be done as trees with many branches at each branch point as
phylogenetic trees.
 Many proteins share structural similarities, reflections and also the evolutionary origins. The
evolutionary process involves substitutions, insertions ad deletions in amino acid.

WHY SHOULD WE CALSSIFY PROTEINS STRUCTURES? (2 marks)


Classification groups together proteins with similar structures and common evolutionary origins.

NATURE OF INFORMATION PRESENTED


The nature of the information presented by a structure classification scheme is entirely dependent on
(i) the underlying philosophy of the approach
(ii) The methods used to identify and evaluate structural similarity.

CLASSIFICATION FOR INFORMATION


14

The two most important classification schemes are


(i) CATH
(ii) SCOP
These are explained below together with PDB sum of accessing structural information, a web-based
compendium maintained at UCL (University College, London).

GIVE A BRIEF ACCOUNT ON CATH: (10 marks)


CATH: CATH is the protein structure classification database, where proteins structures are classified based
on hierarchical domain structures, maintained at UCL. It is similar to SCOP in concept, but it divides up the
Protein Data Bank (PDB) a little differently. In CATH, the proteins are hierarchically clustered at four major
levels,
 Class (C),
 Architecture (A),
 Topology (T) and
 Homologous super family (H).
 The CATH is an excellent resource for examining the variety of known protein structures.
 CATH can be searched by PDB code and proteins can be displayed within the browser page.
 Different categories with in the classification are identified by means of both unique numbers
[by analogy (this have the same fold, but other evidence for common ancestry is weak) with
the enzyme classification or E.C system for enzymes and descriptive names).
 CATH is accessible for keyword interrogation through UCLs Biomolecular Structures and
Modelling Unit Web Server.
CLASS: The classes are assigned automatically based on gross secondary structure content. Four classes of
domains are recognised. They are (i) mainly α (ii) mainly β (iii) α-β both alternating α/β, and α+β structures
(iv) those with low secondary structure content
ARCHITECTURE describes the gross arrangement of secondary structures ignoring their connectivity.
Currently it is assigned normally using simple descriptions of the secondary structure arrangements (barrel,
roll, sandwich etc)
TOPOLOGY OR FOLD: All classification gathers together proteins with the same overall fold or
topology. Proteins in the same fold or topology class contain more or less the same SSEs (Secondary
Structure Elements), connected in the same way and in similar relative spatial positions.
It is achieved by means of structure comparison algorithms that use empirically derived parameters
to cluster the domains.
The examples are given in fig 1 and 2 which are part of the TIM barrel-fold level. Within this level,
all proteins have the well-known TIM barrel structure: a parallel eight-stranded beta barrel surrounded by α
helices.
When proteins have the same fold or topology, they can usually be structurally aligned to give large
section of superimposed backbone structure with low Root Mean Square Deviation (RMSD) values.
HOMOLOGY: Homology (homologs are related by divergent evolution from a common ancestor, and
have the same fold).
SUPER FOLD: super-folds are protein folds that seem likely to have arisen more than once in evolution.
They are thought to have advantageous physicochemical properties.
They appear in SCOP and CATH as fold or topology levels containing several homologous super-families.
Proteins are classified first into hierarchical levels by class, similar to the SCOP classification except
that _/_ and _+_proteins are considered to be in one class. Instead of a fourth class for _+_ proteins, the
fourth class of CATH comprises proteins with few secondary structures.
Following class, proteins are classified by architecture, fold, super family, and family. Similar
structures are found by the program SSAP. CATH can be reached at
http://www.biocom.ac.uk/bsm/cath_new/

GIVE AN ACCOUNT ON SCOP DATABASE (structural classification of proteins) (10 marks)


 The SCOP (structural classification of proteins) database, describes the structural and evolutionary
relationships between proteins of known structure.
15

 This is maintained at the MRC Laboratory of Molecular Biology and Centre for Protein Engineering
at Cambridge, United Kingdom. This can be reached at http://scop.mrc-lmb.cam.ac.uk/scop/.
 The SCOP database was created by manual inspection and abetted by a battery of automated
methods, aims to provide a detailed and comprehensive description of the structural and evolutionary
relationships between all proteins whose structure is known including all entries in the Protein
DataBank.
 It provides a broad survey all known protein folds, detailed information about the close relatives of
any particular protein, and a framework for future research and classification.
 Following classification by class, SCOP additionally classifies protein structures by a number of
hierarchical levels to reflect both evolutionary and structural relationships; namely family, super
family, and fold.
 The links to structure files to be opened with RasMol or Chime Plugins and links back to the PDB to
download structures
 At top level, known proteins are generally grouped by their secondary structure characteristics in to
all-alpha, all-beta, coiled coil, small proteins with structured metal ions, and various types of mixed
alpha-beta structures. These major types are called classes within SCOP.
 The next layer of classification, the Fold level, is a mixture of topology and similarity in domains of
known function. One fold can be called ‘Globin-like’ and the other called ‘four helical up and down
bundle’.
 In SCOP, Proteins are classified in a hierarchical fashion to reflect their structural and evolutionary
relatedness. There are many levels in the hierarchy, and they are (i) fold, (ii) super families, (iii)
families.
 SCOP is accessible for keyword interrogation through the MRC laboratory web server.

SPECIALIZED DATA BASES: (10 marks)


Databases containing public information or material in nature commonly appear on the World Wide
Web. These databases, many of which are maintained by government agencies and nonprofit organizations,
will quickly provide us the information.
Specialized Databases are indexes that can be searched, much like the search engines. The main
difference is that specialized databases are collections on particular subjects, such as medical journal article
abstracts and citations, company financial data, united states supreme court decisions, census data patents
and so forth.
There are a number of specialized databases available in the net, like the ones for Expressed
Sequence tags (EST), Genome Survey Sequences (GSSs), Single Nucleotide Polymorphisms (SNPs),
Sequenced Tagged Sites (STS), RNA databases etc.
Exclusive genome databases are also available for some of the organisms, like Escherichia coli,
S.cerevisiae (Bakers Yeast), Arabidopsis thaliana, Oryza sativa (Rice), Drosophila melanogaster (Fruit fly),
and Mus musculus (Mouse), Homosapiens.
Specialized protein family databases, protein classification databases, structure databases, pathway
databases and microarray databases are also available.

Specialized databases
Antibody Central Antibody information database and search resource.
BIOMOVIE (ETH Zurich) movies related to biology and biotechnology
CGAP Cancer Genes (National Cancer Institute)
Clone Registry Clone Collections (National Center for Biotechnology Information)
Connectivity map Transcriptional expression data and correlation tools for drugs
CTD The Comparative Toxicogenomics Database describes chemical-gene-disease interactions
DBGET H.sapiens (Univ. of Kyoto)
DiProDB A database to collect and analyse thermodynamic, structural and other dinucleotide
properties.
 Dryad a repository of data underlying scientific publications in evolution, ecology, and related fields
Edinburgh Mouse Atlas
 GreenPhylDB (A phylogenomic database for plant comparative genomics)
16

 GyDB The Gypsy Database of Mobile Genetic Elements (Universitat de València)


 Genome Database for Rosaceae (International Genomics and Genetics Database for Rosaceous
crops)
 GDB Hum. Genome Db (Human Genome Organization)
 HGMD disease-causing mutations (HGMD Human Gene Mutation Database)
 HUGO (Official Human Genome Database: HUGO Gene Nomenclature Committee)
 HvrBase++ Human and primate mitochondrial DNA
 INTERFEROME The Database of Interferon Regulated Genes
List with SNP-Databases
 NCBI-UniGene (National Center for Biotechnology Information)
 OMIM Inherited Diseases (Online Mendelian Inheritance in Man)
 p53 The p53 Knowledgebase
 PhenCode linking human mutations with phenotype
 Plasma Proteome Database Human plasma proteins along with their isoforms
 PolygenicPathways Genes and risk factors implicated in Alzheimer's disease, Bipolar disorder or
Schizophrenia
 SHMPD The Singapore Human Mutation and Polymorphism Database

GENOME DATABASES (10 marks)


 The genome Database (GDB, http://www.gdb.org) is a public repository of data on human genes,
clones, STSs, polymorphisms and maps.
 GDB entries are highly cross-linked to each other, to literature citations and to entries in other
databases, including the sequence databases, OMIM, and the Mouse Genome Database.
 The database can be searched by a variety of methods, ranging from keyword searches to complex
queries.
 Major functionality extensions in the last year include the ongoing computation of integrated human
genome maps called Comprehensive Maps, and the use of those maps to support positional queries
and graphic displays.
 Genome databases contain the genetic and sequence information that can be queried to retrieve to get
the information.
 The genome databases provide views for a variety of genomes, complete chromosomes, sequence
maps with contigs, and integrated genetic and physical maps.
 The Genome Database (GDB) is the major human database including both molecular and mapping
data
 The database is organized in six major organism groups: Archae, Bacteria, Eukaryotes, viruses,
Viroids and Plasmids and includes complete chromosomes, organelles and plasmids as well as draft
genome assemblies.
17

GENOME DATABASES OF MODEL ORGANISMS AND OTHER GENOME DATABASES


S.NO DATABASE WEBSITE
1 Caenorhabditis elegans (worm) database http://www.wormbase.org/
2 C.elegans chromosome ftp://ftp.sanger.ac.uk/pub
databases/c.elegans_sequences/CHROMOSOMES
3 Drosophila melanogaster chromosomes http://flybase.bio.indiana.edu/maps/fbgrmap.html
4 Escherichia coli genome project http://www.genetics.wisc.edu/
5 Genome databases at NCBI http://web.bham.ac.uk/bcm4ght6/res.html
6 Genome list at NIH http:molbio.info.nih.gov/molbio/db.html
7 Mitochondrial DNA database-Mit BASE http://www.ebi.ac.uk/Research/Mitbase/mitbase.pl
8 Organelle genome sequences http:www_nbrf.georgetown.edu/pir/genome.html

LIST OF MOLECULAR AND GENETIC DATABASE


(I) EMBL Nucleotide Sequence Data Library
(II) DDBJ
(III) GENBANK
(IV) SWISS PROT
(V) GENOME DATABASE
(VI) ONLINE MENDELIAN INGERITANCE IN MAN (OMIM)
(VII) Organism Database include BACTERIUM E.coli, MOUSE Mus musculus, MUSTARD PLANT
Arabidopsis thaliana

ACEDB DATABASE
 The first genome database called ACEDB (a Caenorhabditis elegans) database and the methods to
access this database were developed by Mike Cherry and Colleagues (Cherry and Cartinhour, 1993).
 This database was accessible through the internet and allowed retrieval of sequences; information
about genes and mutants, investigators addresses and references.
 Similar databases subsequently developed using the same methods for Arabidopsis thaliana and
Saccharomyces cerevisiae. Presently, there are a large number of such publicly available databases.

E.COLI DATABASE
There are several databases for Escherichia coli.
18

The Coli Genetic Stock Center (CGSE) maintains a database of E.coli genetic information
including genotypes and reference information for the strains in the CGSC collection, gene names,
properties and linkage maps, gene product information and information on specific mutations.
The E.coli database collection (ECDC) is another example. The Encyclopedia of E.coli Genes and
Metabolism is a database of E.coli genes and metabolic pathways.

MIPS YEAST DATABASE


The MIPS YEAST Database is important for information of the yeast genome and its products. The
Saccharomyces Genome Database (SGD) is another major yeast database.
 Two of the best curated genetic databases are Flybase, the database for Drosophila melanogaster
and the Mouse Genome Database (MGD). ZFIN, a database for another important model organism,
the Zebra fish, Brachydaniorerio, has been implemented recently.
 The two major databases for human genes and genomics are in existence McKusicks Mendelian
Inheritance in Man (MIM), which is a catalogue of human genes and genetic disorders is available
in an online form that is Online Mendelian Inheritance in Man (OMIM) from the NCBI.
 Both OMIM and GDB include information on genetic variation in humans but there is also the
human mutation server at the EBI; and to the SRS interface to many human mutation databases.

MITOMAP: A Human Mitochondrial genome database


 Mitomap (http://www.MITOMAP.org), a database for the human mitochondrial genome, has grown
rapidly in the role of mitochondrial DNA (mtDNA) variation in human origins, forensics,
degenerative diseases, cancer and aging has increased.
 MITOMAP is a comprehensive database of human mitochondrial DNA (mtDNA) variation and its
relationship with human evolution and disease.
 The mtDNA is a closed circular molecule of 16569 nucleotides.
 The mtDNA codes for 37 genes, including a 12S and 16S rRNA, 22 tRNAs and 13 essential genes
for oxidative Phosphorylation (OXPHOS) polypeptides.
 In addition, the mtDNA contains a 1000 nucleotides control region that encompasses transcription
and replication regulatory elements.

MOUSE GENOME DATABSE (MGD)


 The mouse genome database (MGD) forms the core of the Mouse Genome Informatics (MGI) system
(http://www.informatics.jax.org), a model organism database resource for the laboratory mouse.
 MGD provides essential integration of experimental knowledge for the mouse system with information
annotated from both literature and online sources.
 MGD curates and presents consensus and experimental data representations of genotype (sequence) through
phenotype information, including highly detailed reports among genes, sequences and phenotypes.

THE INSTITUTE FOR GENOMIC RESEARCH (TIGR)


The TIGR Databases (http://wwww.tigr.org/tdb/) are a collection of curated database containing
DNA and protein sequence, gene expression, cellular role, protein family and taxonomic data for microbes,
plants and humans.

OTHER SPECIALIZED DATABASES:


 KEGG: It is a public pathway database. The Pathway Database is offered by the Kyoto
Encyclopedia of Genes and Genomes (KEGG).
 The KEGG database includes a gene database with annotations of orthologs between various species
and a pathway database with hundreds of metabolic (and other) pathway map.
 Kyoto Encyclopedia of Genes and Genomes (KEGG) is part of the research projects of the Kanehisa
Laboratories in the Bioinformatics Center of Kyoto University and the Human Genome Center of the
University of Tokyo.
 They are effective to find and characterize building blocks of genes, proteins, and chemical
substances.
 Since 1995 we have been developing knowledge-based methods for uncovering higher-order
systemic behaviors of the cell and the organism from genomic information.
19

 The Kyoto Encyclopedia of Genes and Genomes, is developed for basic research and practical
applications.
 KEGG is a database of biological systems that integrates genomic, chemical and systemic functional
information.
 KEGG provides a reference knowledge base for linking genomes to life through the process of
PATHWAY mapping KEGG organizes five types of data into a comprehensive system:

Catalogues of chemical compounds in living cells


The catalogues of chemical compounds and genes contain information about particular molecules or
sequences
Gene catalogues: Genome maps integrates the genes themselves according to their appearance on
chromosomes
Pathway maps: describe potential networks of molecular activities, both metabolic and regulatory.
Orthologue tables: One enzyme in one organism would be referred to in KEGG in its orthologue tables,
which link the enzyme to related ones in other organisms.

The three graph objects in KEGG

KEGG at http://www.genome.ad.jp/kegg/ is the reference knowledge base that integrates current


knowledge on molecular interaction networks such as pathways and complexes (PATHWAY database),
information about genes and proteins generated by genome projects (GENES/SSDB/KO databases) and
information about biochemical compounds and reactions (COMPOUND/GLYCAN/REACTION databases).
20

ONLINE MENDELIAN INHERITANCE IN MAN (OMIM):


This is a database of genetic diseases with references to molecular medicine, cell biology, biochemistry and
clinical details of the diseases.
 In considering Mendelian disorders, we turn now to OMIM, a comprehensive data- database for
human genes and genetic disorders
 The OMIM database contains bibliographic entries for over 12,000 human diseases and relevant
genes.
 The focus of OMIM is inherited genetic diseases. As indicated by its name, the OMIM database is
concerned with Mendelian genetics. These are inherited traits that are transmitted between
generations.
 There is relatively little known about genetic mutations in complex disorders and the database does
not include chromosomal disorders Thus its focus is a comprehensive survey of single-gene
disorders.
 The OMIM Entrez database contains about 18 000 entries, including data on over 12 000 established
gene loci and phenotypic descriptions.
 These records link many important resources, such as locus-specific databases and GeneTests
(www.genetests.org).
 Online Mendelian Inheritance in Man (OMIM) is a nonsequence-based information resource that can
be of tremendous use to genomics researchers, physicians, and patients.
 OMIM is the electronic version of the catalogue of human genes and genetic disorders founded and
developed by Victor McKusick and colleagues at Johns Hopkins University (McKusick, 1998;
Hamosh et al., 2002).
 It provides concise textual information from the literature on most human conditions having a
genetic basis, as well as pictures illustrating the condition or disorder (where appropriate) and full
citation information.
 Since the online version of OMIM is housed at NCBI, links to Entrez are provided from all cited
references within each OMIM entry.
21

EXPRESSED SEQUENCE TAGS (ESTs) (10 marks)


 A more efficient method for gene identification in eukaryotic genomes is sequencing of Expressed Sequence
Tags (ESTs). ESTs are partial sequences of cDNA, reversibly transcribed form mRNA and represent a direct
supply of coding intron-free sequences of genes.
 An expressed sequence tag or EST is a short sub-sequence of a transcribed cDNA sequence.

 They may be used to identify gene transcripts, and are instrumental in gene discovery and gene
sequence determination.
 The identification of ESTs has proceeded rapidly, with approximately 65.9 million ESTs now
available in public databases (e.g. GenBank 18/6/2010, all species).
 Currently, Genbank divides ESTs into three major categories: human, mouse and other.
 An EST is produced by one-shot sequencing of a cloned mRNA (i.e. sequencing several hundred
base pairs from an end of a cDNA clone taken from a cDNA library).
 The resulting sequence is a relatively low quality fragment whose length is limited by current
technology to approximately 500 to 800 nucleotides.
 Because these clones consist of DNA that is complementary to mRNA, the ESTs represent portions
of expressed genes.
 They may be present in the database as either cDNA/mRNA sequence or as the reverse complement
of the mRNA, the template strand.
 ESTs can be mapped to specific chromosome locations using physical mapping techniques, such as
radiation hybrid mapping, Happy mapping, or FISH.
 Alternatively, if the genome of the organism that originated the EST has been sequenced one can
align the EST sequence to that genome using a computer.

EST contigs
 Because of the way ESTs are sequenced, many distinct expressed sequence tags are often partial
sequences that correspond to the same mRNA of an organism.
 In an effort to reduce the number of expressed sequence tags for downstream gene discovery
analyses, several groups assembled expressed sequence tags into EST contigs.
 Examples of resources that provide EST contigs include:
22

TIGR gene indices


Unigene (refer printout page no. 35)
 Constructing EST contigs is not trivial and may yield artifacts (contigs that contain two distinct gene
products).
 When the complete genome sequence of an organism is available and transcripts are annotated, it is
possible to bypass contig assembly and directly match transcripts with ESTs.
 This approach is used in the Tissue Info system (see below) and makes it easy to link annotations in
the genomic database to tissue information provided by EST data.

What Are ESTs and How Are They Made? (2 marks)


 ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by
sequencing either one or both ends of an expressed gene.
 The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from
different organisms and use these "tags" to fish a gene out of a portion of chromosomal DNA by matching
base pairs.
 The challenge associated with identifying genes from genomic sequences varies among organisms and is
dependent upon genome size as well as the presence or absence of introns, the intervening DNA sequences
interrupting the protein coding sequence of a gene.
 ESTs are usually short (300 – 500 base pairs) single reads from (m-RNA) (c-DNA) which are usually
produced in large number.
 They represent a snap-shot of what is expressed in a given tissue and/or at a given developmental stage. They
represent tags of expression for a given c-DNA library.
 These records usually are very poor in annotation and have only library and Biosource information.
 They are represented in a variety of databases notably DDBJ/EMBL/GENBANK/dbEST AND UNIGENE.

Expressed sequence tags and encoded human proteins


 The sequences of 5′ ESTs derived from mRNAs encoding secreted proteins are disclosed.
The 5′ ESTs may be to obtain cDNAs and genomic DNAs corresponding to the 5′ ESTs.
 The 5′ ESTs may also be used in diagnostic, forensic, gene therapy, and chromosome
mapping procedures.
 Upstream regulatory sequences may also be obtained using the 5′ ESTs. The 5′ ESTs may
also be used to design expression vectors and secretion vectors.
23

Top Ten Organisms for which ESTs have been sequenced (dbEST relese 050903, may 2003)

S.NO ORGANISM COMMON NAME NUMBER OF ESTs


1 Homosapiens Human 5,142,390
2 Mus musculus + domesticus Mouse 3,721,428
3 Rattus sp. Rat 525,556
4 Ciona intestinalis Sea squirt 492,488
5 Gallus gallus Chicken 418,093
6 Triticum aestivum Wheat 340,945
7 Hordeum vulgare+ subsp. Barley 340,945
8 Bos taurus Cattle 319,775
9 Danio rerio Zebra fish 311,335
10 Glycine max Soybean 308,582

UNIGENE
The goal of the Unigene (unique gene) project is to create one unique entry for each gene and to collect all the ESTs
associated with that gene. For example, in the case of RBP4, there is only one unigene entry.

You might also like