We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 23
Bioinformatics - The Genomic Revolution
The science of Bioinformatics or computational biology is increasingly being used to improve the
quality of life as we know it, Bioinformatics has developed out of the need to understand the code of
life, DNA. Massive DNA sequencing projects have evolved and added in the growth of the science of
informatics. DNA the basic molecule of life directly controls the fundamental biology of life. It
codes for genes, which code for proteins, which determine the biological makeup of humans or any
living organism. Its variations and errors in the genomic DNA, which ultimately define the likelihood
of developing diseases or resistance to these same disorders.
‘The ultimate goal of Bioinformatics is to uncover the wealth of biological information hidden in the
mass of sequence data and obtains a clearer insight into the fundamental biology of organisms and to
use this information to enhance the standard of life for mankind.
Itis being used now and in the foreseeable future in the areas of molecular medicine to help produce
better and more customized medicines to prevent or cure diseases, it has environmental benefits in,
identifying waste cleanup bacteria and in agriculture it can be used for producing high yield low
maintenance crops. These are just a few of the many benefits Bioinformatics will help develop.
‘The genomic era
The genomic era has seen a massive explosion in the amount of biological information available due to
huge advances in the fields of molecular biology and Genomics.
Bioinformatics is the application of computer technology to the management and analysis of biological
data, The result is that computers are being used to gather, store, analyze and merge biological data.
Bioinformatics
‘computational sciences. This new knowledge could have profound impacts on fields as varied as human
energy and biotechnology.
an interdisciplinary research area that is the interface between the biological and
health, agriculture, the environment,
The greatest challenge facing the molecular biology community today is to make sense of the wealth
of data that has been produced by the genome sequencing projects. Traditionally, molecular biology
research was carried out entirely at the experimental laboratory bench but the huge increase in the
scale of data being produced in this genomic era has seen a need to incorporate computers into this
esearch process. Sequence generation, and its subsequent storage, interpretation and analysis are
entirely computer dependent tasks. However, the molecular biology of an organism is a very complex
issue with research being carried out at different levels including the genome, proteome, transcriptome
and metabalome levels. Following on from the explosion in volume of genomic data, similar increase
in data have been observed in the fields of proteomics, transcriptomics and metabalomic
Recent development in sequencing technology has produced means of reading genes (DNA). The
code to describe even the smallest of organism would fill many books. But scientists are very ambitious
people. They have started to decode "themselves" in the Human Genome Project. Vast computer data
bases accessible to researchers store this vast quantity of information, There are a lot of different
databases where DNA and protein sequence information are stored,| 4000 hereditary disease including Cystic Fibrosis and Huntingtons disease) or a result of
‘response to an environmental stress which causes alterations in the genome (eg. cancers, heart
diabetes..).
‘The completion of the human genome means that we can search for the genes directly associated with
different diseases and begin to understand the molecular basis of these diseases more clearly. This new
knowledge of the molecular mechanisms of disease will enable better treatments, cures and even
preventative tests to be developed.
1.1 More drug targets
At present all drugs on the market target only about 500 proteins. With an improved understanding of
disease mechanisms and using computational tools to identify and validate new drug targets, more
specific medicines that act on the cause, not merely the symptoms, of the disease can be developed.
These highly specific drugs promise to have fewer side effects than many of today's medicines,
1.2 Personalized medicine
Clinical medicine will become more personalized with the development of the field of
pharmacogenomics. This is the study of how an individual's genetic inheritance affects the body's
Tesponse to drugs. At present, some drugs fail to make it to the market because a small percentage of
the clinical patient population show adverse affects to a drug due to sequence variants in their DNA.
As aresult, potentially life saving drugs never make it to the marketplace. Today, doctors have to use
trial and error to find the best drug to treat a particular patient as those with the same clinical symptoms
can show a wide range of responses to the same treatment. In the future, doctors will be able to analyze
patient's genetic profile and prescribe the best available drug therapy and dosage from the beginning.
1.3 Preventative medicine
With the specific details of the genetic mechanisms of diseases being unraveled, the development of
diagnostic tests to measure a persons susceptibility to different diseases may become a distinct reality.
Preventative actions such as change of lifestyle or having treatment at the earliest possible stages when
they are more likely to be successful, could result in huge advances in our struggle to conquer disease.
1.4 Gene therapy
In the not too distant future, the potential for using genes themselves to treat disease may become a
reality. Gene therapy is the approach used to treat, cure or even prevent disease by changing the
expression of a persons genes. Currently, this field is in its infantile stage with clinical trials for many
different types of cancer and other diseases ongoing.
icrobial genome applications
ganisms are ubiquitous, that is they are found everywhere. They have been found survivi
made of a yariety of microbial properties in the baking,
complete genome sequences and their potentialof Energy (DOE) initiated the MGP (Microbial Genome Project) to sequence genon
useful in energy production, environmental cleanup, industrial processing and toxic ws
By studying the genetic material of these organisms, scientists can begin to understand
ata very fundamental level and isolate the genes that give them their unique abilities to surviy
extreme conditions
+ 2.1 Waste cleanup 7
Deinococcus radiodurans is known as the world’s toughest bacteria and itis the most radiation resistant
organism known, Scientists are interested in this organism because of its potential usefulness in cleaning
up waste sites that contain radiation and toxic chemicals.
Microbial Genome Program (MGP) scientists are determining the DNA sequence of the genome of C-
crescentus, one of the organisms responsible for sewage treatment.
2.2 Climate change
Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil fuels for
energy, are thought to contribute to global climate change. Recently, the DOE (Department of Energy,
USA) launched a program to decrease atmospheric carbon dioxide levels. One method of doing so is.
to study the genomes of microbes that use carbon dioxide as their sole carbon source.
Sources:
-NCBI Education Page and
NCBI Science Primer: Bioinformatics
EBI Education PageBibliographic Databases
Taxonomic Databases
Nucleotide Databases
Genomic Databases
Protein Databases
Microarray Databases
Scientific literature databases have been available since the 1960's.
Services that produced abstracts of scientific literature began to make their data available in machine
readable form in the early 1960's.
‘You should be aware that none of the abstracting services has a complete coverage. The best known is
MEDLINE, and now PUBMED, abstracting mainly medical literature.
MEDLINE is accessible through EBI's SRS. PUBMED is accessible through NCBI's ENTREZ.
EMBASE is a commercial product for the medical literature.
BIOSIS, the inheritor of the old Biological Abstracts, covers a broad biological field; the Zoological
Record indexes the zoological literature.
CAB Intemational maintains abstract databases in the fields of agriculture and parasitic diseases.
AGRICOLA is for the agricultural field what MEDLINE is for the medical field
Beilstein Abstracts which contain abstracts from the top journals in organic and related chemistry,
published from 1980 to the present and available free.
The bibliographical databases are with the exception of MEDLINE, PUBMED and Beilstein Abstracts,
only available through commercial database vendors.
‘Taxonomic Databases - classification of all organisms
‘The Taxonomy Browser is a prominent taxonomic database which maintained by the
taxonomy is hierarchical and sequence-based, aiming to centralise the classification of all organisms
represented in the databases with at least one nucleotide or protein sequence. The Ta
‘can be used to view the taxonomic position or retrieve sequence data for a particular“This collaboration is ajoint operation by EMBL-Bank at the European Bioinformatics
| the DNA Data Bank of Japan (DDBI) at the Center for Information Biology (CIB) and (
National Center for Biotechnology Information (NCBI). .
hitp://wwweddbj.nig.ac.jp/
‘http://www.ncbi.nlm.nih.gov/Genbank/
In Europe, the vast majority of the nucleotide sequence data produced is collected, organi
distributed by the EMBL Nucleotide Sequence Database located at the EBI in Cambridge
‘Outstation of the European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany.
The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data fro
the scientific community and making it freely available. The databases strive for completeness, with
the aim of recording every publicly known nucleic acid sequence. These data are heterogenous, they
vary with respect to the source of the material (e.g. genomic yersus cDNA), the intended quality (e.g,
finished versus single pass sequences), the extent of sequence annotation and the intended completeness
of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene ora
genome). The nucleotide databases are distributed free of charge over the internet.
DDBJ, GenBank and EMBL-Bank exchange new and updated data on a daily basis to achieve optimal
synchronisation. The result is that they contain exactly the same information, except for sequences
that have been added in the last 24 hours.
‘The EMBL Nucleotide sequence database (also known as EMBL-Bank) is divided into sections that |
Teflect major taxonomic divisions. :
These taxonomic divisions include:
~ Invertebrates
Other Mammals
Mus musculus
Organelles
Bacteriophage
Plantsere: accession number, followed by. apcuadenl aversion numb
‘number part will be stable, but the version part will be incremented when the sequence
i
Although the nucleotide sequence data are checked for integrity and obvious errors by the data library —
staff, the quality of the data is the responsibility of the submitter. As a consequence, there are many
errors in the database: many sequence entries are either mislabelled, contaminated, incompletely or
erroneously annotated, or contain sequencing errors. In addition, the database is very redundant, in the
sense that the same sequence from the same organism may be included many times, simply reflecting
the redundancy of the original scientific reports,
Other types of nucleotide sequence databases
Genomes Server - this gives access to a large number of complete genomes.
UniGene - a sequence-cluster database which address the redundancy problem by coalescing
sequences that are sufficiently similar that one may reasonably infer that they are derived from the
same gene.
STACK - the 'Sequence Tag Alignment and Consensus Knowledgebase’ another sequence-cluster
database which address the same problem as UniGene.
EMBL-SVA - the EMBL Sequence Version Archive' server is a repository of all entries that have
been made public since release 1 of the EMBL database. It comprises more than 100 million entries
and includes entries pre-dating the first electronic release of the database in 1982.
Several specialised sequence databases are also available, Some of these deal with particular classes of
sequence:
RDP - the ‘Ribosomal Database Project’ provides ribosome related data services to the scientific
community, including online data analysis, YRNA derived phylogenetic trees, and aligned and annotated
rRNA sequences. .
HIV-SD - the ‘HIV Sequence Database’ collects, curates and annotates HIV and SIV sequence |
data and provides various tools for analysing this data. eng
"+ IMGT - the ‘ImMunoGeneTies database’ is a database specialising in Immunoglob
"receptors and the Major Histocompatibility Complex (MHC) of all vertebrate species.
Others nucleotide sequence databases are focussing on particular features such as:
__TRANSFAC - contains sequence information on transcription factors and transcrip
sites.
for restriction enzymes and restriction enzyme sites.
pret dlese of cael Sesouag aitns of major interest to geneticists, there is a long history of conventionally p
resis poes ‘of genes or mutations. In the past few years, most of these have been made availabl
"electronic form and a variety of new databases have been developed. These databases vary
the classes of data captured and how these data are stored.
‘Genomes Server - this server gives access to a hundreds of complete genome sequences, including
"those from archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids and viruses.
Proteome Analysis - the Proteome Analysis database has been set up to provide comprehensive
statistical and comparative analyses of the predicted proteomes of fully sequenced organisms,
Ensembl - this is a joint project between the EBI and the Wellcome Trust Sanger Institute that aims at
developing a system that maintains automatic annotation of large eukaryotic genomes. Ensembl presents
up-to-date sequence data and the best possible automatic annotation for metazoan genomes. Available
now are human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae.
FlyBase - the database for Drosophila melanogaster is one of the best-curated genetic databases.
MGD - the ‘Mouse Genome Database’ is one of the most comprehensively curated genetic databases,
RGD - the'Rat Genome Database’ curates and integrates rat genetic and genomic data and provides
access to this data to support research using the rat as a genetic model for the study of human disease,
+ SGD - the ‘Saccharomyces Genome Database’ is another major yeast database. The MIPS yeast
database is an important resource for information on the yeast genome and its products.
SPGP - the 'S. Pombe Genome Project’ based at the Sanger Institute
data on the fungus Schizosaccharomyces pombe.
the database for genetic
AceDB - this is the database for genetic and molecular data concerning Caenorhabditis elegans.
‘The database management system written for AceDB by R. Durbin and J. Thierry-Mieg has proved
very popular and has been used in many other species-specific databases. AceDB is now the name of
this database management system, resulting in some confusion relative to the C. elegans database. The
entire database can be downloaded from the Sanger Institute,
HIV-SD - the 'HIV Sequence Database’ collects, curates and annotates HIV and STV sequence data
and provides various tools for analysing this data.
Other examples of genomic databases
strains in the CGSC collection, gene names, properties, and linkage map, gene product
information on specific mutations. The'E. coli Database collection’ (ECDC), in Giessen,
ed gene-based sequence records for E, coli. EcoCye , the ‘Encyclopedia o
ism’ is a database of E, coli genes and metabolic pathways.rot gets thoroughly analysed and annotated by biologists ensuring
ation and the quality of the database.
mains data that originates from a wide variety of organisms from more than 6,000
ecies. Half of the entries come from about 20 organisms, which are the target of many.
anked by number of entries):
+ Homo sapiens
“+ Saccharomyces cerevisiae
+ Escherichia coli
- Mus musculus
Rattus norvegicus
Bacillus subtilis
Caenorhabditis elegans
Haemophilus influenzae
Schizosaccharomyces pombe
Methanococcus jannaschii
Bos taurus
Drosophila melanogaster
‘Mycobacterium tuberculosis
Gallus gallus
‘Arabidopsis thaliana
Salmonella typhimurium:
Xenopus laevis
‘Synechocystis sp. (strain PCC 6803)
‘Sus scrofa
Oryctolagus cuniculus
More detailed statistics relating to Swiss-Prot composition can be obtained from the Swiss-Prot statistics
| Page. The Swiss-Prot user manual is a comprehensive description of the database and it's entiries, _|
Primary protein sequence databases - TEMBL.
‘There is a tremendous increase in the amount of sequence data available due to technological adv
such as sequencing machines, the use of new biochemical methods such as PCR technology as we
the implementation of projects to sequence complete genomes. 2
a
a, have produced a flood of sequence information. Maintaining the high quality
entry, pee coinnine pis that involves the extensive use
wi SRW ee a
is ‘in 1996, si
aiic, truncated, pseudogenes, patented, small
which Swiss-Prot are not interested in annotating.
‘Primary protein sequence databases - The Protein Information Resource (PIR) _
PIR (Barker et al, 2001) was established in 1984 by the National Biomedical R un
(NBRF) as a successor of the original NBRF Protein Sequence Database, sceieres ieieee
period by the late Margaret O. Dayhoff and published as the “Atlas of Protein Sequence vu
(Dayhoff etal., 1965; Dayhoff, 1979). Since 1988 the database has been maintained by PIR-Intemati
a collaboration between the NBRE, the Munich Information Center for Protein Sequences
and the Japan International Protein Information Database (JIPID).
‘The database is partitioned into four sections, PIRI, PIR2, PIR3 and PIR4. Entries in PIRI are fully
classified by superfamily assignment, fully annotated and fully merged with respect to other entries in
PIRI. The annotation content as well as the level of redundancy reduction varies in PIR2 entries. Many
entries in PIR2 are merged, classified, and annotated. Entries in PIR3 are not classified, merged or
annotated. PIR3 serves as a temporary buffer for new entries. PIR4 was created to include sequences
identified as not naturally occurring or expressed, such as known pseudogenes, unexpressed ORFS,
synthetic sequences, and non-naturally occurring fusion, crossover or frameshift mutations.
PIR also provides some degree of cross-referencing to other biomolecular databases by linking to the
EMBL/ DDBJ/ GenBank nucleotide sequence databases, PDB, GDB , FlyBase, OMIM, SGD, and
MGD.
Specialised protein sequence databases
There are many specialised protein sequence databases, some of them are quite small and only contain
a handfal of entries, and others are wider in scope and larger in size. As this category of databases is
quite changeable, any list provided here would soon be outdated. However, a document is available
which lists information sources for molecular biologists, and is kept constantly up-to-date,
A brief description of some of specialised protein sequence databases follows:
Proteome Analysis - The Proteome Analysis database has been set up to provide comprehensive
statistical and comparative analyses of the predicted proteomes of fully sequenced organisms.
GOA - GO or Gene Ontology is an international consortium of scientists with the editorial office
based at the EBI. The goal of the GO consortium is to produce a dynamic controlled vocabulary that
can be applied to all organisms, even while knowledge of gene and protein roles in cells is still
accumulating and changing. we nn |
The Gene Ontology Annotation (GOA) project is run in conjunction with GO and’
controlled vocabulary to a non-redundant set of proteins described in the Swiss-Prot,
aun
MEROPS - This database (Rawlings and Barrett, 1999) provides a catalogue and st
lassification of peptidases (i.e. all proteolytic enzymes). An index of the peptidases b|
‘ery often the sequence of an unknown protein is too distantly related to any protein of known structure
‘to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its
‘sequence of a particular cluster of residue types which is commonly known as a pattern, motif, signature,
‘or fingerprint.
These motifs arise because of particular requirements on the structure of specific region(s) of a protein,
which may be important, for example, for their binding properties or for their enzymatic activity. These
Tequirements impose very tight constraints on the evolution of those limited (in size) but important
Portion(s) of a protein sequence. A signature modelling such a site must be as short as possible, should
detect all or most of the sequences it is designed to describe and should not give too many false positive
Tesults. In other words it must exhibit both high sensitivity and high specificity. There area few databases
available, which use different methodology and a varying degree of biological information on the
characterised protein families, domains and sites.
Examples of secondary protein databases include: ;
PROSITE - The special value of this database is the extensive documentation on many protein
families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and
Patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably
identify to which family of proteins the new sequence belongs.
‘The profile structure used in PROSITE is similar to but slightly more general than the one introduced by
Gribskov and co-workers (Gribskov ct al.,1987). Generalised profiles are remarkably similar to the spe-
cific type of Hidden Markov Models (HMMs) used in Pfam,
PRINTS -A different approach to pattern recognition, termed "fingerprinting" is used by this data-
base. Within a sequence alignment, it is usual to find not one, but several motifs that characterise the
aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a
family signature. In a database search, there is then a greater chance of identifying a distant relative,
‘whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level
of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders
fingerprinting a powerful diagnostic technique.
+ Pfam - Another important secondary protein database is Pfam, The methodology used by Pfam to
create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely re-
lated to profiles, but are based on probability theory methods. These allow a direct statistical approach to
identifying and scoring matches, and also to combining information from a multiple alignment with
prior knowledge.
One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the
formers allow the full extent of a domain to be identified in a sequence. They are thus particularly usefi
yhen analysing multidomain proteins. The biggest drawback of Pfam is its lack of biological inform
(annotation) of the protein families.
BLOCKS - Blocks are multiply aligned ungapped segments corresponding to the
red regions of proteins. The blocks for the Blocks Database are made automaticall
highly conserved regions in groups of proteins documented in InterPro,_
fe SBASE - This is a protein domain library sequences database that contains annotated structural,
functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence
databases and sequence pattern collections.
Secondary protein databases
‘These secondary protein sequence databases have become vital tools for identifying distant relationships
in novel sequences and hence for inferring protein function. These databases have evolved by using
signature-recognition methods to address different sequence analysis problems, resulting in rather dif
ferent and independent databases. To perform a comprehensive analysis, a user therefore has to know
several important things. For example, what are the resources and where can they be found? What is the
difference between them in terms of diagnostic performance and family coverage? What do the different
search outputs mean? Is it sufficient to use just one of the databases, and if so, which one?
Diagnostically, the most commonly used secondary protein databases, PROSITE, PRINTS and Pfam,
have different areas of optimum application owing to the different strengths and weaknesses of their
underlying analysis methods such as regular expressions, profiles, HMMs and fingerprints. For example,
tegular expressions are likely to be unreliable in the identification of members of highly divergent super-
families whereas profiles and HMMs excel; fingerprints perform relatively poorly in the diagnosis of
very short motifs whereas regular expressions do well; and profiles and HMMs are less likely to give
specific sub-family diagnoses whereas fingerprints excel
‘While all of the resources share a common interest in protein sequence classification, some such as Pfam
focus on divergent domains, some such as PROSITE focus on functional sites, and others such as PRINTS
focus on families, specialising in hierarchical definitions from super-family down to sub-family levels in
order to pin-point specific functions. A number of sequence cluster databases such as ProDom are also
commonly used in sequence analysis, for example to facilitate domain identification.
Secondary protein databases - InterPro
Unfortunately, these secondary databases do not share the same formats and nomenclature as each other
which makes the use of all of them in an automated way difficult. In response to this the Swiss-Prot and
TrEMBL group at the EBI have developed the Integrated resource of Protein domains and functional
sites more commonly known as InterPro (Apweiler et al., 1996). This database is an integration of the
PROSITE, PRINTS, Pfam and ProDom groups databases. InterPro will allow users uccess to a wider,
complementary range of site and domain recognition methods in a single package
In the task of sequence characterisation, we need more reliable, concerted methods for identifying pro
‘cin family traits and for inheriting functional annotation, This is especially important piven our depen:
ence on automatic methods for assigning functions to the raw sequence data issuing from genome
projects, Rationalising this process by creating a single coherent resource for diagnosis and decumenti=
tion of protein families is difficult, given entirely different database formats, different search tools and
different search outputs. InterPro is an attempt to address some of these issues. This new resource pro-
vvides an integrated view of a number of commonly used pattem databases, and provides an intuitive
| interface for text- and sequence-based searches.
_Fla-files submitted by each of the groups were systematically merged and dismantled. Where re
‘family ions were amalgamated, and all method-specific annotation separated out. ‘This p
16 -‘was complicated by the relationships that can exist, both between entries in the same
| between entries in different databases, Different types of parent-child relationship were din
to the differentiation into 'sub-types' and ‘sub-strings. A sub-string means that a motif or motif
contained within a region of sequence encoded by a wider pattern. Examples would be; a PR(
pattem is typically contained within a PRINTS fingerprint; or a fingerprint might be contained
Pfam domain. A sub-type means that one or more motifs are specific for a sub-set of sequences captured
by another more general pattern . Examples would be; a super-family fingerprint may contain several
family- and sub-family-specific fingerprints; or a generic Pfam domain may include several family fin-
gerprints.
Structure databases
The number of known protein structures is increasing very rapidly and these are available through the
Protein Data Bank (PDB). The Nucleic Acid Database (NDB) is the database for structural information
about nucleic acid molecules. The EBI Macromolecular Structure Database (MSD) group is the Euro-
ean Project for the management and distribution of data on macromolecular structures, they have elose
ties with the Research Collaboratory for Structural Bioinformatics (RCSB) who in collaboration
MSD maintain and administer the PDB. The aim of the MSD project is to build a relational database of
the PDB, and use it to clean-up, maintain and distribute the PDB data. Clean-up in this context means
that the data relating to a single structure is internally consistent and free of errors. The Cambridge
Crystallographic Data Centre (CCDC) provides a database of structures of small molecules’, of interest
to biologists concerned with protein-ligand interactions.
Microarrays and gene expression databases
Microarray technology makes use of the sequence resources created by the genome projects and other
sequencing efforts to answer the question, what genes are expressed in a particular cell type of an organ-
ism, at a particular time and under particular conditions. For instance, they allow comparison of gene
expression between normal and diseased (e.g., cancerous) cells. There are several names for this technol-
ogy - DNA microarrays, DNA arrays, DNA chips, gene chips, others. Sometimes a distinction is made
between these names but in fact they are all synonyms as there are no standard definitions for which type
‘of microarray technology should be called by which name.
Microarray technology and applications
Microarrays exploit the preferential binding of complementary single-stranded nucleic acid sequences.
‘A microartay is typically a glass slide, on to which DNA molecules are attached at fixed locations
(spots). There may be tens of thousands of spots on an array, each containing a huge number of identical
DNA molecules (or fragments of identical molecules), of lengths from twenty to hundreds of nucle-
otides. (According to quick napkin calculations by Wilhelm Ansorge and John Quackenbush in
Schnookeloch in Heidelberg on 4
October, 2001, the number of DNA molecules in a microarry spot is 107-108). For gene expression
studies, each of these molecules ideally should identify one gene or one exon in the genome, however, in
practice this is not always so simple and may not even be generally possible due to families of sit
‘genes in a genome. Microarrays that contain all of the approximate 6000 genes of the yeast genome’
been available since 1997. The spots are either printed on the microarrays by a robot, or synt
photo-lithography (similarly as in computer chip productions) or by ink-jet printing. The
right shows an illuminated microarray (enlarged). A typical dimension of such an array isfacie spot diameter is of the order of 0.1 mm, for some microarray types can be even smaller. There
are different ways how microarrays can be used to measure the gene expression levels. One of the most
popular micorarray applications allows the comparison of gene expression levels in two different samples,
€-g., the same cell type ina healthy and diseased state (see picture below).
‘The total mRNA from the cells in two different conditions is extracted and labelled with two different
fluorescent labels: for example a green dye for cells at condition 1 and a red dye for cells at condition 2
(to be more accurate, the labelling is typically done by synthesising single stranded DNAs that are comple-
mentary to the extracted mRNA by a enzyme called reverse transcriptase). Both extracts are washed over
the microarray. Labelled gene products from the extracts hybridise to their complementary sequences in
the spots due to the preferential binding - complementary single stranded nucleic acid sequences tend to
attract to each other and the longer the comlementary parts, the stronger the attraction.
The dyes enable the amount of sample bound to a spot to be measured by the level of fluorescence
emitted when itis excited by a laser. If the RNA from the sample in condition 1 is in abundance, the spot
will be green, if the RNA from the sample in condition 2.is in abundance, it will be red. If both are equal,
the spot will be yellow, while if neither are present it will not fluoresce and appear black. Thus, from the
fluorescence intensities and colours for each spot, the relative expression levels of the genes in both
samples can be estimated.
‘The raw data that are produced from microarray experiments are the hybridised microarray images, To
obtain information about gene expression levels, these images should be analysed, each spot on the array
identified, its intensity measured and compared to the background. This is called image quantitation,
Image quantiation is done by image analysis software. To obtain the final gene expression matrix from
Spot quantitations, all the quantities related to some gene (either on the same array or on arrays measur
ing the same conditions in repeated experiments) have to be combined and the entire matrix has to be
scaled to make different arrays comparable.
Gene expression monitoring is not the only microarray application, another one is SNP detection,
ArrayExpress
Microarrays are already producing massive amounts of data. These data, like genome sequence data, can
help us to gain insights into underlying biological processes only if they are carefully recorded and
stored in databases, where they can be queried, compared and analysed by different computer software
programs, The EBI is currently establishing a public repository for microarray gene expression data
‘ArrayExpress, analogous to EMBL-bank for DNA sequence data. In many respects gene expression
databases are inherently more complex than sequence databases (this does not mean that developing,
‘maintaining and curating the sequence databases are any less challenging). Conceptually, a gene ex.
Pression database can be regarded as consisting of three parts - the gene expression data matrix, gene
annotation and sample annotation, see picture below.
Gene expression data have meaning only in the context of the particular biological sample and the exact
conditions under which the samples were taken. For instance, if we are interested in finding out how
different cell types react o treatments with various chemical compounds, we must record unambiguous
information about the cell types and compounds used in the experiments. EBl is participating in an effort |
to develop ontologies for sample annotation, this is analogous to gene ontology for gene description.‘ F
‘annotation can be taken care of to some extent by links to sequence databases,
Plicated many-to-many relationships between genes in the gene expression matrix and
‘on the array make it necessary to provide a full and detailed description of each feature on
‘one gene can relate to several features on the array. The lack of standards in gene naming is another
difficulty - a table relating each array feature present in the database to the list of all synonymous name:
Of the respective gene is an essential part of a gene expression database.
‘The microarray technology is still rapidly developing, therefore it is natural that currently there are no |
established standards for microarray experiments and how the raw data should be processed. There are
also no standard measurement units for gene expression levels. In the lack of such standards the informa- | —
tion about how exactly the gene expression data matrix was obtained should be kept in the database, if
the data are to be properly interpreted later.
AcrayExpress is storing all this information, the details of which is called Minimum Information About
4 Microarray Experiment (MIAME) defined by the Microarray Gene Expression Database (MGED)
consortium. MGED is a grass roots movement that was founded at a meeting at the EBI in 1999, is
supported by most of the important players in the microarray community, and has evolved far beyond the
EBL.
Another repository for gene expression data GEO is being developed at NCBI in the US. DDBJ in Japan.
also have plans. All three groups face similar problems and are involved in MGED to some degree. A
common data exchange format MAGE-ML is being developed in collaboration between MGED (with
active participation of the EBI) and some major microarray companies.
Gene expression data analysis and Expression Profiler
Capturing and storage of microarray data is not an end in itself. The amounts of data from even a single
microarray experiment are so large, that software tools have to be used to make any sense out of it,
Clustering and class prediction are typical methods currently used in gene expression data analysis (see
Microarray Data Analysis). One of the popular gene ‘expression data analysis tools is Expression Profiler,
developed at the EBI. The Microarray Informatics Team at the EBI is actively working in many microarray
data analysis areas using this and other tools.
An.example of such research is an approach to reverse engineering of gene regulatory networks, which is
based on the hypothesis that genes that have similar expression profiles (i.c., similar rows in the gene
expression matrix) should also have similar regulation mechanisms as there must be a reason why their
‘expression is similar under a variety of conditions. Therefore, if we cluster the genes by similarities in
their expression profiles and take sets of promoter sequences from genes in such clusters, some of
‘sets of sequences may contain a ‘signal’ as a specific sequence pattern such as a particular subs
| which is relevant to regulation of these genes (Vilo et al. 2000). aDNA is the main carrier of genetic information in living organisms. DNA molecules are extremely
ong, large, and consist of repeated nucleotides. Nucleotides consist of nitrogenous bases, sugar moi-
«ty and a phosphate molecule. Adenine (A), thymine (T), guanine (G), and cytosine (C) are the four
Nitrogenous bases that make up the DNA. The structure of a DNA molecule is double stranded, con-
sisting of two DNA strands wound around each other to form a double helix. The nucleotides of the
two strands are complementary to each other such that adenine cross-links with thymine (A-T), and
‘guanine cross-links with cytosine (G-C).). Itis the random arrangement of these four bases that makes
all the difference.
‘The goal of DNA sequencing is to determine the order of bases for a specific piece of DNA. The first
Successful attempt at sequencing a small portion of a gene completed in 1971, required three years of
work to determine 12 bps from the termini of lambda phage DNA.
Methods of DNA Sequencing
1) MAXAM AND GILBERT METHOD
2) SANGAR METHOD
3) AUTOMATED DNA SEQUENCING
4) SEQUENCING WITH CAPILLARY GEL ELECTROPHORESIS
1. Maxam-Gilbert Chemical Sequencing
For this method, in vitro DNA polymerase reaction is not required, First the DNA fragment to be
sequenced has to be isolated and radioactive label at one end ONLY. Separate one strand from the
other to yield a population of identical strands labeled on one end. Then divide the mixture into four
samples, each of which is subjected to a different chemical reagent that destroys one’ or two specific
bases. The four reagents destroy only G, only C, A and G, or T and C. The loss of a base makes the
Sugar-phosphate backbone more likely to break at that point. The reagent concentration is adjusted $0.
that only about one in 50 of the target bases is destroyed per DNA fragment, It results ina mixture of
different sized pieces carrying the radioactive label. When these are separated in the different lanes of
‘gel, they can be arranged in order of length and the base destroyed at each site can be determined by
noting in which lane or lanes the band appears. Thus, the sequence of bases in the strand can quite
simply be read from the pattem of bands on the gel.
In the G-specific reaction dimethyl sulphate met iylates guanine at the N-7 position which lead
instability of the glycosidic linkage. Then, piperidine displaces the opened 7-methylguanine
talyses the elimination of both phosphate from the sugar. Adenine also becomes methylated at N-3
‘not N-7 with piperidine to cause strand breakage. In the G+-A reaction the bases are protonated a
cosidic bond are broken by treatment with acid. Piperidine is finally used to eliminate
le ine is used to attack pyrimidines in the C+T reaction a
ingen s and then joins with
y totals
tof lonly the labeled DNA fragments via Autoradiography or Fluorescence analysis. DNA sequ
be read from bottom of gel to top by examination or "reading" of the ladder of DNA bi
sequence 5' to 3! (5' > 3',
Itis a technique of time consuming, costly and tedious, however the advantages are: 1) Used for
Sequencing of DNA that is unstable when cloned and DNA with regions of secondary structure. 2)
Used to analyze genomic footprinting. 3) Used in the study of DNA methylation.
2. Sanger dideoxy DNA Sequencing
This is the most extensively used and well known method of determining the sequence of the
bases that make up DNA. The underlying principle is that the incorporation of a dideoxy nucleotide
into an extending DNA strand terminates the elongation by its inability to incorporate bases any fur-
ther. Dideoxynucleotide sequencing is commonly called Sanger sequencing since Sanger devised the
method. This technique utilizes 2',3-dideoxynucleotide triphospates (ddNTPs), molecules that differ
from deoxynucleotides by the having a hydrogen atom attached to the 3’ carbon rather than an OH
gfoup. These molecules terminate DNA chain elongation because they cannot form a phosphodiester
bond with the next deoxynucleotide.
In order to perform the sequencing, one must first convert double stranded DNA into single stranded
DNA. This can be done by denaturing the double stranded DNA with NaOH. A Sanger reaction con-
sists of the following: a strand to be sequenced (one of the single strands which was denatured using
NaOH), DNA primers (short pieces of DNA that are both complementary to the strand which is to be
sequenced and radioactively labeled at the 5' end), a mixture of a particular ddNTP (such as ddATP)
with its normal dNTP (dATP in this case), and the other three dNTPs (CTP, dGTP, and dTTP)
(Figure 1),
The concentration of ddATP should be 1% of the concentration of dATP. The logic behind this ratio is
that after DNA polymerase is added, the polymerization will take place and will terminate whenever a
ddATP is incorporated into the growing strand. If the d4ATP is only 1% of the total concentration of
dATP, a whole series of labeled strands will result. Note that the lengths of these strands are dependent
on the location of the base relative to the 5' end.
This reaction is performed four times using a different ddNTP for each reaction, When these reactions
are completed, a polyacrylamide gel electrophoresis (PAGE) is performed. One reaction is loaded into
one lane for a total of four lanes. The ge i transferred toa nitrocellulose filter and autoradiography is
performed so that only the bands with the radioactive label on the 5' end will appear. In PAGE, the
shortest fragments will migrate the farthest. Therefore, the bottom-most band indicates that its particu
Tar dideoxynucleotide was added first to the labeled primer. For example, the band that migrate the
farthest was in the ddATP reaction mixture. Therefore, dd ATP musthave been added fist tothe primer,
and its complementary base, thymine, must have been the base present on the 3’ end of the seq
strand. One can continue reading in this fashion, Note that if one reads the bases from the bottom up,
‘one is reading the 5' 1o 3' sequence of the strand complementary to the sequenced strand.
quenced strand can be read S' to 3! by reading top to bottom the bases complementary to thoseTraditional methods of manual DNA sequencing utilize radioactive isotopes such as phosphorot
sulfur-35, and phosphorous-33, incorporated into specific nucleotides (A,C,T,G). Radioactive lat
nucleotides allow for reading the sequence by a technique known as autoradiography. The gel
contains the separated DNA segments is exposed to X-ray film for a period of time. The radiation
‘causes dark spots on the film to indicate its location. Next, the film is develope to reveal the pattern of
the labeled nucleotides. Since a process does not exist to discriminate the different nucleotides by the
spots on the film, each labeled nucleotide must have its own lane on the gel. Therefore, four individual
lanes are required for manual sequencing in order to determine the full DNA sequence. An individual
‘must interpret the results of this process and typically the results are entered into a computer for
storage and linking to other results,
AUTOMATED DNA SEQUENCING
Automated deoxyribonucleic acid (DNA) sequencing reduces the volume of low-level radioactive
‘waste generated on campus, while providing a suitable alternative to manual DNA sequencing. Tradi-
tional methods of manual DNA sequencing utilize radioactive isotopes to label the DNA. Automated
DNA sequencing utilizes fluorescent tracers instead of radioisotopes to label the DNA, thereby elimi-
nating or significantly reducing the use of radioactive materials in some research laboratories,
Automated DNA sequencing utilizes a fluorescent dye to label the nucleotides instead of a radioactive
isotope. The fluorescent dye is not an environmentally hazardous chemical and has no special han-
dling or disposal requirements. Instead of using X-ray film to read the sequence, a laser is used to
stimulate the fluorescent dye. The fluorescent emissions are collected on a. charge coupled device that
is able to determine the wavelength, The Perkin-Elmer Applied Biosystems (ABI) DNA sequencers
are designed to discriminate all four fluorescent dye wavelengths simultaneously, which allows for
complete DNA sequencing in one lane on the gel.
‘Varying degrees of automation are also available. For full automation, all that is required is to load a
sample tray with template DNA; the equipment performs the labeling and analysis. The other option is
to perform the labeling reactions with fluorescent dyes, load the samples onto a gel, and place the gel
into the DNA sequencer. The equipment performs the separation and analysis. The system automati-
cally identifies the nucleotide sequence and saves the information on the computer. Thus, only a re
view of the data is necessary to ensure no anomalies were misidentified by the computer,
Benefits
Automated DNA sequencing equipment can eliminate the need for radioactive isotopes to label
DNA, thereby reducing the volume of low-level radioactive waste generated on campus. As a general.
xximation, one template of manual DNA sequencing will produce 83 mL. of liquid waste and.
0.167 gallon of solid waste. As a result, every 45 templates processed by automated DNA seq
reduces the amount of manual DNA sequencing. The time saved is due to not having to per
‘autoradiography or associated tasks required for working with radioactive materials such as
surveys, inventory/disposal documentation, ete
ea lie
1‘Tather than slab gel electrophoresis, thus, have separate thin capillary gel for each DN
_ Blass slabs- with a series of gel- filled glass tubes each about the width of a huma
‘Sequencers use multiple tiny (capillary) tubes to run standard electrophoretic separat
Tations are much faster because the tubes dissipate heat well and allow the use of much high
fields to complete sequencing in shorter times. The machines automatically load
Separation, detect the fluorescence and clean out the capillaries between runs.
Advantages:
1. Better resolution, no running over from one lane to another. *
2. Separation of bands occurs much faster. => 10- to 15-fold increase in speed.
Both advantages very important for Celera sequencing of the human genome
Applications:
DNA sequencing, first devised in 1975, has become a powerful technique in molecular biology, allow-
ing analysis of genes at the nucleotide level. For this reason, this tool has been applied to many areas
of research. For example, the polymerase chain reaction (PCR), a method which rapidly produces
numerous copies of a desired piece of DNA, requires first knowing the flanking sequences of this
piece. Another important use of DNA sequencing is identifying restriction sites in plasmids. Knowing
these restriction sites is useful in cloning a foreign gene into the plasmid.
Before the advent of DNA sequencing, molecular biologists had to sequence proteins directly; now
amino acid sequences can be determined more easily by sequencing a piece of CDNA and finding an
‘open reading frame. In eukaryotic gene expression, sequencing has allowed researchers to identify
conserved sequence motifs and determine their importance in the promoter region. Furthermore, a
molecular biologist can utilize sequencing to identify the site of a point mutation. These are only afew
examples illustrating the way in which DNA sequencing has revolutionized molecular biology.