Unit I
Unit I
INTRODUCTION
Any biological data can form the bioinformatics, in which the data deals on the two biomolecules namely
(i) nucleic acids (DNA and RNA) which is the hereditary determinant forming a link between
one generation and the next
(ii) Proteins which is the vital molecule deciding and executing all the functions of living
organisms
Thus the bioinformatics concentrates on genomics and proteomics
Genomics deals with the analysis of sequence of nucleotide in a particular fragment of data
Proteomics deals with sequence analysis and structure prediction of protein molecules
In 1997
The genome for E.coli (4.7 Mbp) is published.
In 2001
The human genome (3000 Mbp) is published.
1) DNA level
. Major applications involve cardiovascular disease, thrombosis, heart disease, obesity, HIV resistance. This is to get
most variant sequences of 100 random individuals and more from family based linkage analyses to association
analyses. This is done by characterizing the SNPs (SINGLE NUCLEOTIDE POLYMORPHISM). This Aids the
identification of the susceptible genes.
Pair wise sequence Alignment
4
2) RNA level
Simultaneous monitoring of expression of all genes i.e. monitor all m-RNAs at qualitative and quantitative sensitivity
levels, to distinguish alternative splicing, with the use of DNA micro array
3) Monitoring level and modification state of all proteins
It is used to monitor post-translational proteins and genetic network modification. Example, Phosphorylation state.
4) Identification of all basic protein shapes
Analysis of amino acid sequence against database of protein shapes. Some of the bioinformatics databases are
Pfam- Protein Multiple sequence alignments and common protein domains
SCOP – structural classification of proteins
CATH – protein classification by Class, Architecture, Topology and Homology.
5) Multiple sequence Alignment of Proteins
A Protein family is a collection of proteins with similar structure, (i.e. 3-dimensional shape), similar function, or
similar evolutionary history. Multiple sequence Alignment is used to know the newly sequenced protein belongs to
which family as this provides hypotheses about its structure, function and evolutionary history
6) Determining Gene Function
Bioinformatics is used to find genes in a genome, predict the gene product, and predict the gene function.
SEQUENCE FORMATS:
One major difficulty encountered in running sequence analysis software is the use of differing
sequence formats by different programs. These formats all are standard ASCII files, but they may differ in
5
the presence of certain characters and words that indicate where different types of information and the
sequence itself are to be found.
THE MORE COMMONLYUSED SEQUENCE FORMATS ARE DISCUSSED BELOW.
1. GenBank DNA Sequence Entry
2. European Molecular Biology Laboratory Data Library Format
3. SwissProt Sequence Format
4. FASTA Sequence Format
5. National Biomedical Research Foundation/Protein Information Resource Sequence Format
6. Stanford University/Intelligenetics Sequence Format
7. Genetics Computer Group Sequence Format
8. Format of Sequence File Retrieved from the National Biomedical Research
9. Foundation/Protein Information Resource
10. Plain/ASCII.Staden Sequence Format
11. Abstract Syntax Notation Sequence Format
12. Genetic Data Environment Sequence Format
SEQUENCE DATABASES
Three publicly available databases store large amount nucleotide and protein sequence data.
GenBank at the National centre for biotechnology information (NCBI) of the National Institute of
health (NIH) in BETHESDA. http://www.ncbi.nlm.nih.gov/
The DNA Data bank of Japan (DDBJ) http://www.ddbj.nig.ac jp/
The European Bioinformatics Institute (EBI) in Hinxton, England. http://www.ebi ac.uk/ European
Bioinformatics’ Maintains the EMBL
DNA Data Bank of Japan Associated with the Center (DDBJ) for Information Biology
6
GENBANK:
Gene bank is a database which is an annotated collection of all publicly available nucleotide and
protein sequences maintained by the National center for Biotechnology Information (NCBI), a division of
National Library of medicine located at National Institutes of Health (NIH) in Bethesda (Benson et al,
2002). According to the release 127.0 on December 15, 2001 there are approximately 15,850,000,000 bases
in 14,976,310 (belonging to more than 105,000 different organisms) sequence records. In addition to storing
these sequences, GenBank contains bibliographic and biological annotation Data. GenBank are available
free of charge from the National Center for Biotechnology Information (NCBI) in the National Library of
Medicine at the NIH.
Amount of Sequence Data:
GenBank contains over 31 billion nucleotides from 24 million sequences (release 1350, April 2003).
The growth of GenBank in terms of both nucleotides of DNA and number of sequences from 1982 to 2000
is summarized in Figure. Over the period 1982 to the present the number of bases in GenBank has doubled
approximately every 14 months.
A new release of GenBank comes out every 2 months. GenBank is a part of the International
Nucleotide Sequence Database Collaboration, which is comprised of the European Molecular Biology
Laboratory (EMBL), the DNA Databank of Japan (DDBJ) and GenBank at NCBI. These three organizations
exchange data on a daily basis.
Presently before publication of many journals require submission of sequence information to a
database, so that an accession number may appear in the paper.
For this purpose NCBI has a WWW form called BankIt for convenient and quick submission of
sequence data and Sequin for submission of softwares.
7
ORGANISMS IN GENBANK:
Over 100,000 different species are represented in GenBank, with over 1000 new species added per month.
GenBank Records & Divisions:
Each GenBank entry includes a concise description of the sequence, the scientific Name, Taxonomy of the
source organisms, Bibliographic reference and a Table of features.
The files in the Genbank distribution have been partitioned into divisions that roughly correspond to the
taxonomic groups.
GenBank staff can usually assign an accession number to a sequence submission within two working days of
receipt, and do so at a rate of almost 1600 per day.
The ACCESSION NUMBER serves as confirmation that the sequence has been submitted and
allows readers of articles, in which the sequence is cited, to retrieve the data. Direct submissions receive a
quality assurance review that includes checks for vector contamination, proper translation of coding regions,
correct taxonomy and correct bibliographic citations.
A draft of the GenBank record is passed back to the author for review before it enters the database.
Since the three-dimensional atomic coordinates are not convenient examination of the structure with the
naked eye, a large number of 3D structure viewers have been designed to graphically view these
coordinates. The most common is RasMol programs. It allows drawing the structure using a range of
representations, including spacefill, ball and stick and wireframe views which emphasize secondary
10
structure. There are other structural databases like the SCOP (Structural Classification of Proteins) database
which classifies proteins according to the structural similarity and evolutionary relationships. In SCOP one
can also see the hierarchical classification of proteins in families and superfamilies, with links to relevant
PDB structures.
1,830 kb and the genome of the bacterium Mycoplasma genitaltum is 580 kb. In early 1997, the complete
4,600-kb genome of E coli was reported. Late in 1996, the complete 1,660-kb genome sequence of the
microbe Methanococcus jan naschii was reported.
This organism belongs to a group of organisms called the Archaea (Achaeans), many of which live
in extreme conditions such as hot springs and deep sea vents. Methanococcus jannaschii, for example, was
isolated from a deep sea vent in the Pacific Ocean at a site where the temperature is close Ito the boiling
point of water and where the pressure is 1245 times greater than at sea level. A surprising 56 percent of the
1,738 genes the organism contains are entirely new to science, adding support to the growing evidence that
the Archaea represent a third kingdom of life.
A major goal of the Human Genome Project is to generate a highly detailed map of the human
genome. The progressive generation of this genetic map has come from studies of polymorphic DNA
markers; that is, loci in the genome where the different “alleles” are detectable differences in DNA length.
We have already discussed DNA markers resulting from restriction fragment length polymorphism (pp. 479-
480. The first human genetic map for the Human s Genome Project—published in 1987—was based on such
RFLP markers.
The Human Genome Project is a multinational effort, begun in 1988, whose aim is to produce a
complete physical map of all human chromosomes, as well as the entire human DNA sequence. As part of
the project, genomes of other organisms such as bacteria, yeast, flies, and mice are also being studied. This
is no easy task, since the human genome is so large. So far, many virus genomes have been entirely
sequenced, but their sizes are generally in the 1 kilobasepair to 10 kilobasepair range.
The first free-living organism to be totally sequenced was the bacterium Haemophilus influenzae,
containing an 1800 kilobasepair genome. In 1996 the whole sequence of the yeast genome a 10 million bp
sequence was also determined. The US Human Genome Project: the first five years”, was submitted to
Congress in February 1990. It brought t the NRC and OTA evaluation-to-date and set out objectives to be
fulfilled by 1995. Sequencing the human genome was postponed until sequencing costs fell to a minimum of
50 cents per base pair.
GENOME ARRANGEMENT
The following classification shows that how the human DNA is classified accordingly for their
identification.
The scientific objectives for the five year plan were the following
To set up a full human genetic map with markers at an average distance of 2 to 5 centimorgans to
each other. Each marker if possible to be identified with an STS
To assemble an STS map of all the human chromosomes with an interval between markers of about
100,000base pairs, to generate sets of overlapping clones or closely spaced markers ordered without
ambiguity and this for a continuous distance of 2 x lo6 bp; To improve current methods and or
develop new methods for sequencing DNA to bring sequencing costs to a maximum of 50 cents per
base pair (a decision on large-scale sequencing would have to be taken within 4 to 5 years
To prepare a genetic map of the mouse genome based on DNA markers.
Begin mapping one or two chromosomes. To sequence an aggregate of 20 x l0 6 of a variety of
organisms, focusing on genome or fragments of l06 basepair in length (the model organisms being
E. coli, the yeast S. cerevisiae, D. melanogaster, C. elegans and the laboratory mouse);
To constitute a “Joint Informatics Task Force” between the DOE and the NIH for the development of
software and databases to support large scale sequencing and mapping projects, to create databases
with easy and up to date access to chromosome physical maps and sequences, to develop the
algorithmic and analytical tools likely to be of use in the interpretation of genomic information;
To develop programs for these comprehension and handling of ethical, legal and social problems of
the human genome project (through contracts, research grants, colloquia, teaching and educational
materials), to identify and define the major problems and develop political scenarios for handling
them
To support the training of young scientists (pre-doctoral and post-doctoral); to examine the needs for
other types of training;
To support innovative and risky technological development to meet the needs of the genome project;
.