0% found this document useful (0 votes)
27 views11 pages

Unit I

dhr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views11 pages

Unit I

dhr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 11

1

CLASS: III B.Sc., BIOCHEMISTRY


SUBJECT: BIOINFORMATICS, IPR AND BIOSAFETY
UNIT I
DEFINE BIOINFORMATICS (2 marks)
Bioinformatics is an interdisciplinary field using computational techniques to analyze the biological
data. It is the combination of biology and information technology .It encompasses any computational tools
and methods used to manage, analyze and manipulate large sets of biological data.

INTRODUCTION
Any biological data can form the bioinformatics, in which the data deals on the two biomolecules namely
(i) nucleic acids (DNA and RNA) which is the hereditary determinant forming a link between
one generation and the next
(ii) Proteins which is the vital molecule deciding and executing all the functions of living
organisms
Thus the bioinformatics concentrates on genomics and proteomics
Genomics deals with the analysis of sequence of nucleotide in a particular fragment of data
Proteomics deals with sequence analysis and structure prediction of protein molecules

DEFINE DATABASE (2 marks)


Databases are effectively electronic filing cabinets, a convenient and efficient method of storing vast
amounts of information.
GENERAL NOTE ON BIOINFORMATICS & ITS PRINCIPLE
The entire field of bioinformatics is under three perspectives.
Three perspectives of Bioinformatics:
Perspective I:
Analyzing DNA, RNA and protein sequences:
 Obtaining sequences
 Comparing two sequences
 Comparing a sequence to all others sequences in databases
Perspective II:
Genome-wide analysis of RNA and protein:
 Gene expression
 Micro arrays
 Protein analysis and protein families
 Protein structure
 Aligning the sequences
 View of phylogenetic trees
Perspective III:
Genome analysis:
 Prokaryotes and eukaryotes
 The human genome
 Human diseases
The first perspective is the cell. The central dogma of molecular biology is that DNA is transcribed
into RNA and translated into protein. The focus of molecular biology has been on individual genes, mRNA
transcripts and proteins. A focus of the field of bioinformatics is the complete collection of DNA (the
genome); RNA (the transcriptome) and protein sequences (the proteome).
The millions of molecular sequences present both greater opportunities and great challenges. A
bioinformatics approach to molecular sequence data involves the application of computer algorithms and
computer databases to molecular and cellular biology. Such an approach is sometimes referred to as
functional genomics. This typifies the essential nature of bioinformatics: biological questions can be
approached from levels ranging from single genes and proteins to cellular pathways and networks or even
whole genomic responses. Our goals are to understand how to study both individual genes and proteins and
collections of thousands of genes/proteins.
Central dogma of molecular biology
DNA → RNA → Protein → Cellular phenotype
2

Central dogma of genomics


Genome→Transcriptome→Proteome→Cellular phenotype
From the cell we can focus on individual organisms, which represent the second perspective of the
field of bioinformatics. Each organism changes across different stages of development and across different
regions of the body. For example, while we may sometimes think of genes as static entities that specify
features such as eye colour or height, they are in fact dynamically regulated across time and region and in
response to physiological state.
Gene expression varies in disease states or in response to a variety of signals, both intrinsic and
environmental. Many bioinformatics tools are available to study the broad biological questions relevant to
the individual. There are many databases of expressed genes and proteins derived from different tissues and
conditions. One of the most powerful applications of functional genomics is the use of DNA micro arrays to
measure the expression of thousands of genes in biological samples.
At the largest scale is the tree of life. There are many millions of species alive today, and they can be
grouped into the three major branches of bacteria, archaea (single celled microbes that tend to live in
extreme environments) and eukaryotes. Molecular sequence databases currently hold DNA sequences from
over 100,000 different organisms. Also the genomes of various species are compared by comparative
genomics.

EXPLAIN THE HISTORY OF BIOINFORMATICS (5 marks)


The methods that are used to collect store, retrieve and correlate the complex information is called bioinformatics, it
started with,
Gregor Mendel- the father of genetics explained the inheritance of traits is passed from one generation to other
generation by certain factors. He imagined them as tiny particles packed inside the pollen grains.
Erwin Schrödinger (1928) - Physics Nobel Prize winner, using Arithmetic logic deduced the factors to be very small
to fit into human chromosome. Based on the number of human traits and the size of the chromosome bands he
concluded the size of the factors to be 1000 A ˚. The bioinformatics has come to existence due to growth in molecular
biology, technology and computers in the last three decades.
Paul berg (1972) made the first recombinant DNA molecule using ligase enzyme
Stanley Cohen, Annie chang and Herbert boyer (1972) produced the first r-DNA organism. In 1973, two important
inventions were made in the field of genomics
(i) Joseph sambrook et al., made refinement of DNA electrophoresis using Agarose gel
(ii)
Stanley Cohen made cloning of DNA
The major cause for the birth of Bioinformatics was the introduction of the method for
(i) sequencing DNA
(ii)
the origin of a genetic engineering company, Genetech
In 1956
The first Bioinformatics database is the protein sequence, i.e. the amino acid sequence of bovine insulin,
consisting of 51 residues
In 1966
The first nucleic acid sequence of yeast alanine t-RNA with 77 bases were reported
In 1967
Dayhoff gathered all the available sequence data to create the first bioinformatics database
In 1977
The protein databank followed with a collection of 10 x-ray crystallographic protein structures

MAJOR EVENTS IN BIOLOGY


In 1980
The first complete gene sequence for an organism (Fx174) is published. The gene consists of 5386 base pairs,
which codes nine proteins
In 1986
The term Genomics was appeared to describe the mapping, sequencing and analyzing the genes. The term
genomics was first coined by Thomas Roderick
In 1987
The physical map of E.coli by Y.Kohara et al.
In 1995
The Haemophilus influenza and Mycoplasma genitalium (1.8 Mbp) is sequenced.
In 1996
The genome of Saccharomyces cerevisiae (12.1 Mbp bakers yeast) is sequenced.
3

In 1997
The genome for E.coli (4.7 Mbp) is published.

In 2001
The human genome (3000 Mbp) is published.

MAJOR DEVELOPMENTS IN COMPUTATIONAL TECHNOLOGY AND BIOINFORMATICS


In 1970
The details of the Needleman Wunsch algorithm for sequence comparison are published
In 1974
Vint Corf and Robert Kahn develop the concept of connecting networks of computers into an “Internet”
In 1976
The prosite database is reported by Bairoch
In 1980
The Smith and Watermann algorithm for sequence Alignment was published
In 1986
The SWISS PROT database was created by the Department of Medical Biochemistry of the University of
Geneva and the European Molecular Biology Laboratory (EMBL).
In 1988
The National Centre for Biotechnology Information (NCBI) was established at the National Cancer Institute.
The human Genome Initiative was started. Pearson and Lipman publish the FASTA Algorithm for sequence
comparison
In 1990
The BLAST program was implemented by Altschul et al.
In 1991
The creation and use of Expressed Sequence Tags (EST) was described
In 1994
Altwood and Beck publish the PRINTS database of protein Motifs.

EXPLAIN THE SCOPE OF BIOINFORMATICS (5 marks)


SCOPE OF BIOINFORMATICS
 Bioinformatics has obtained the complete genomic picture of a particular organism like Human,
Drosophila, Yeast, Rice, E.coli, etc.
 Genomic sequencing is done to obtain the complete genome picture of a particular species.
 To find the exact number of genes causing hereditary diseases by studying the genomic data from
individuals belonging to the same family members
 Used to create drugs for the particular individual to fight against the diseases and provide him the best
nutritional food
 Bioinformatics has been applied largely are
a. Pharmacogenomics
b. Medical informatics
c. Functional genomics
d. Comparative genomics
e. Proteomics
f. Agriinformatics etc.
 Their applications are at different levels
a. DNA level
b. RNA level
c. Monitoring level and modification state of all proteins

d. Identification of all basic protein shapes


e. Multiple sequence alignment of proteins
f. Determine gene function

1) DNA level
. Major applications involve cardiovascular disease, thrombosis, heart disease, obesity, HIV resistance. This is to get
most variant sequences of 100 random individuals and more from family based linkage analyses to association
analyses. This is done by characterizing the SNPs (SINGLE NUCLEOTIDE POLYMORPHISM). This Aids the
identification of the susceptible genes.
 Pair wise sequence Alignment
4

 Multiple sequence alignment arises in the study of repetitive sequences

2) RNA level
Simultaneous monitoring of expression of all genes i.e. monitor all m-RNAs at qualitative and quantitative sensitivity
levels, to distinguish alternative splicing, with the use of DNA micro array
3) Monitoring level and modification state of all proteins
It is used to monitor post-translational proteins and genetic network modification. Example, Phosphorylation state.
4) Identification of all basic protein shapes
Analysis of amino acid sequence against database of protein shapes. Some of the bioinformatics databases are
 Pfam- Protein Multiple sequence alignments and common protein domains
 SCOP – structural classification of proteins
 CATH – protein classification by Class, Architecture, Topology and Homology.
5) Multiple sequence Alignment of Proteins
A Protein family is a collection of proteins with similar structure, (i.e. 3-dimensional shape), similar function, or
similar evolutionary history. Multiple sequence Alignment is used to know the newly sequenced protein belongs to
which family as this provides hypotheses about its structure, function and evolutionary history
6) Determining Gene Function
Bioinformatics is used to find genes in a genome, predict the gene product, and predict the gene function.

DEFINE DATABANK (2 marks)


1. It is a set of data related to a given subject and organized in such a way that it can be consulted by the
users
2. A data repository accessible by local and remote users
3. A databank may contain information n single or multiple subjects may be organized in rational manner,
may contain more than one database, and may be geographically distributed. More than one databank may
be required to build a comprehensive database.

ENUMERATE GENBANK (10 marks)


SUBMISSION OF SEQUENCES TO THE DATABASES:
Investigators are encouraged to submit their newly obtained sequences directly to a member of the
International Nucleotide Sequence Database Collaboration, such as the National Center for Biotechnology
Information (NCBI), which manages GenBank (http://www.ncbi.nlm.nih.gov); the DNA Databank of Japan
(DDBJ; http://www.ddbj.nig.ac.jp); or the European Molecular Biology Laboratory (EMBL)/EBI
Nucleotide Sequence Database (http://www.embl-heidelberg.de). NCBI reviews new entries and updates
existing ones, as requested. A database accession number, which is required to publish the sequence, is
provided. New sequences are exchanged daily by the GenBank, EMBL, and DDBJ databases.
VARIOUS WAYS TO SUBMIT YOUR SEQUENCES:
The simplest and newest way of submitting sequences is through the Web site
http://www.ncbi.nlm.nih.gov/ on a Web form page called BankIt.
The sequence can also be annotated with information about the sequence, such as mRNA start and
coding regions. The submitted form is transformed into GenBank format and returned to the submitter for
review before being added to GenBank.
The other method of submission is to use Sequin (formerly called Authorin), which runs on personal
computers and UNIX machines. The program provides an easy-to-use graphic interface and can manage
large submissions such as genomic sequence information. It is described and demonstrated on
http://www.ncbi.nlm.nih.gov/Sequin/index.html and may be obtained by anonymous FTP from
ncbi.nlm.nih.gov/sequin/.
Completed files can also be E-mailed to gbsub _ncbi.nlm.nih.gov or can be mailed on diskette to
GenBank Submissions, National Center for Biotechnology Information, National Library of Medicine, Bldg.
38A, Room 8N-803, Bethesda, Maryland 20894.

SEQUENCE FORMATS:
One major difficulty encountered in running sequence analysis software is the use of differing
sequence formats by different programs. These formats all are standard ASCII files, but they may differ in
5

the presence of certain characters and words that indicate where different types of information and the
sequence itself are to be found.
THE MORE COMMONLYUSED SEQUENCE FORMATS ARE DISCUSSED BELOW.
1. GenBank DNA Sequence Entry
2. European Molecular Biology Laboratory Data Library Format
3. SwissProt Sequence Format
4. FASTA Sequence Format
5. National Biomedical Research Foundation/Protein Information Resource Sequence Format
6. Stanford University/Intelligenetics Sequence Format
7. Genetics Computer Group Sequence Format
8. Format of Sequence File Retrieved from the National Biomedical Research
9. Foundation/Protein Information Resource
10. Plain/ASCII.Staden Sequence Format
11. Abstract Syntax Notation Sequence Format
12. Genetic Data Environment Sequence Format

SEQUENCE DATABASES
 Three publicly available databases store large amount nucleotide and protein sequence data.
 GenBank at the National centre for biotechnology information (NCBI) of the National Institute of
health (NIH) in BETHESDA. http://www.ncbi.nlm.nih.gov/
 The DNA Data bank of Japan (DDBJ) http://www.ddbj.nig.ac jp/
 The European Bioinformatics Institute (EBI) in Hinxton, England. http://www.ebi ac.uk/ European
Bioinformatics’ Maintains the EMBL
 DNA Data Bank of Japan Associated with the Center (DDBJ) for Information Biology
6

GENBANK:
Gene bank is a database which is an annotated collection of all publicly available nucleotide and
protein sequences maintained by the National center for Biotechnology Information (NCBI), a division of
National Library of medicine located at National Institutes of Health (NIH) in Bethesda (Benson et al,
2002). According to the release 127.0 on December 15, 2001 there are approximately 15,850,000,000 bases
in 14,976,310 (belonging to more than 105,000 different organisms) sequence records. In addition to storing
these sequences, GenBank contains bibliographic and biological annotation Data. GenBank are available
free of charge from the National Center for Biotechnology Information (NCBI) in the National Library of
Medicine at the NIH.
Amount of Sequence Data:
GenBank contains over 31 billion nucleotides from 24 million sequences (release 1350, April 2003).
The growth of GenBank in terms of both nucleotides of DNA and number of sequences from 1982 to 2000
is summarized in Figure. Over the period 1982 to the present the number of bases in GenBank has doubled
approximately every 14 months.
A new release of GenBank comes out every 2 months. GenBank is a part of the International
Nucleotide Sequence Database Collaboration, which is comprised of the European Molecular Biology
Laboratory (EMBL), the DNA Databank of Japan (DDBJ) and GenBank at NCBI. These three organizations
exchange data on a daily basis.
Presently before publication of many journals require submission of sequence information to a
database, so that an accession number may appear in the paper.
For this purpose NCBI has a WWW form called BankIt for convenient and quick submission of
sequence data and Sequin for submission of softwares.
7

ORGANISMS IN GENBANK:
Over 100,000 different species are represented in GenBank, with over 1000 new species added per month.
GenBank Records & Divisions:
Each GenBank entry includes a concise description of the sequence, the scientific Name, Taxonomy of the
source organisms, Bibliographic reference and a Table of features.
The files in the Genbank distribution have been partitioned into divisions that roughly correspond to the
taxonomic groups.

BUILDING THE DATABASES:


The data in Genbank, the collaborating databases EMBL, DDBJ is submitted primarily by individual authors
to one of the 3Databases or by sequencing centers or as batches of EST, GSS, HTC, WGS OR HTG
sequences.

DIRECT ELECTRONIC SUBMISSION:


All records enter GenBank as direct electronic submissions (www.ncbi.nlm.nih.gov/Genbank/
index.html)
8

GenBank staff can usually assign an accession number to a sequence submission within two working days of
receipt, and do so at a rate of almost 1600 per day.
The ACCESSION NUMBER serves as confirmation that the sequence has been submitted and
allows readers of articles, in which the sequence is cited, to retrieve the data. Direct submissions receive a
quality assurance review that includes checks for vector contamination, proper translation of coding regions,
correct taxonomy and correct bibliographic citations.
A draft of the GenBank record is passed back to the author for review before it enters the database.

A TYPICAL DATABASE RECORD CONTAINS THREE SECTIONS


(I) The header of a database record has a set of information which includes description of the
sequence, its organism of origin, allied literature references and cross-links to related sequences in other
databases.
A unique identifier of the record is given at the beginning of the entry, in the LOCUS field of
GenBank.
The accession number is assigned to the sequence when it is submitted and is the primary identifier
of record. This is the number which should be cited while referring to as sequence in the literature, since the
number will always remain associated with the sequence. Another field in the databases is the ORGANISM
which contains snot only the scientific name of the organism of origin of the sequence, but also its full
taxonomic classification according to a standard taxonomy maintained at NCBI.
(II) the Feature table which contains a description of features in the record, like coding sequences,
exons, repeats, promoters etc for nucleotide sequences and domains , structure elements, binding sites etc in
case of protein sequences.
A CDS (CODING DNA SEQUENCE) is a region in DNA that codes entirely for a protein and can
be directly translated into a protein sequence.
(III) The Sequence which is often more easily analyzed by computer than examined by eye.

THE GENBANK DATABASE RECORD STRUCTURE

SUBMISSION USING BANK IT:


About a third of author submissions are received through NCBI’s Web-based data submission tool,
BankIt (www.ncbi.nlm.nih.gov/ BankIt). Using BankIt, authors enter sequence information directly into a
form and add biological annotation such as coding regions or mRNA features.
BankIt validates submissions, flagging many common errors and checks for vector contamination
using a variant of BLAST called Vecscreen

SUBMISSION USING SEQUIN:


9

NCBI also offers a standalone multi-platform submission program called Sequin


www.ncbi.nlm.nih.gov/Sequin/index.html.
Sequin handles simple sequences such as a cDNA, as well as segmented entries, phylogenetic
studies, population studies, mutation studies, environmental samples and alignments for which BankIt and
other Web-based submission tools are not well-suited. Sequin has convenient editing and complex
annotation capabilities and contains a number of built-in validation functions for quality assurance.

RETRIEVING GENBANK DATA


The sequence records in GenBank are accessible via Entrez (www.ncbi.nlm.nih.gov/sites/gquery), a
flexible database retrieval system that covers 35 biological databases. Entrez databases contain DNA and
protein sequences derived from GenBank and other sources, genome maps, population, phylogenetic and
environmental sequence sets, gene expression data, the NCBI taxonomy, protein domain information and
protein structures from the Molecular Modelling Database, MMDB. Each database is linked to the scientific
literature via PubMed and PubMed Central.

EXPLAIN THE STRUCTURAL DATABASES OF PDB (10 marks)


Structure databases archive, annotate and distribute sets of atomic coordinates. The best-established
data base for biological macromolecular structures is the Protein Data Bank (PDB).
The PDB (PROTEIN DATABANK) is the major repository of protein structures (and to some
extent of nucleic acid structures). This database stores 3-dimensional atomic coordinates of proteins and
nucleic acids. The data is obtained by experimental methods like X-ray crystallography, NMR, or computer
modelling.
It contains structures of proteins, nucleic acids, and a few carbohydrates. It is started by the late
Walter Hamilton at Brookhaven National Laboratories, Long Island, New York, USA in 1971, the PDB is
now managed by the Research laboratory for Structural Bioinformatics (RCSB), a distributed organization
based at Rutgers University, in New Jersey; the San Diego Supercomputer Centre, in California; and the
National Institute of Standards and Technology, in Maryland, all in the USA.
The parent web site of the Protein Data Bank is at http://www.rcsb.org. Official mirror sites exist in
Europe, Singapore, Japan and Brazil; others are distributed around the world The home page of the PDB
contains links to the data files themselves, to expository and tutorial material including short news items and
the PDB Newsletter, to facilities for deposition of new entries, and to specialized search software for
retrieving structures.
A PDB record includes information similar to that found in the header of Genbank entries (organism
of origin, authors literature references etc). Sequence similarity tools such as BLAST can also be used as the
record in database has sequence information.
The entry also contains secondary structure information like location of helices and strands and
disulphide bonds. The three- dimensional structure information is stored as a series of spatial coordinates for
each atom in the molecule (the position of the atom on the x, y and z axes). The box shows part of a PDB
entry for a structure of E. coli thioredoxin. The information contained includes:
 What protein is the subject of references to publications describing the structure determination
 Experimental details about the structure determination, including information related to the general
quality of the result such as resolution of an X-ray structure determination and
 Stereo chemical statistics
 The amino acid sequence
 What additional molecules appear in the structure, including cofactors, inhibitors, and water
 Molecules the entry, and what species it came from
 Who solved the structure, and references.
 Assignments of secondary structure: helix, sheet
 Disulphide bridges
 The atomic coordinates

Since the three-dimensional atomic coordinates are not convenient examination of the structure with the
naked eye, a large number of 3D structure viewers have been designed to graphically view these
coordinates. The most common is RasMol programs. It allows drawing the structure using a range of
representations, including spacefill, ball and stick and wireframe views which emphasize secondary
10

structure. There are other structural databases like the SCOP (Structural Classification of Proteins) database
which classifies proteins according to the structural similarity and evolutionary relationships. In SCOP one
can also see the hierarchical classification of proteins in families and superfamilies, with links to relevant
PDB structures.

WHAT IS LITERATURE DATABASE (2 marks)


Literature databases are used to refer the entire bibliographic databases. The literature databases contains
information on over 12000 books, journals articles, research reports, conferences papers, dissertations and
other types of literature relating to all aspects of the theory and practical. They also contain abstracts and
publications.

EXPLAIN PUBMED (5 marks)


PubMed, a bibliographic database offering abstracts of scientific articles, integrated with other information
retrieval tools of the National Centre for Biotechnology Information within the National Library of Medicine
(http://www.ncbi.nlm.nih.gov/PubMed/). One very effective feature of PUBMED is the option to retrieve
related articles. This is a very quick way to 'get into' the literature of a topic. Combined with the use of a
general search engine for web sites that do not correspond to articles published in journals, fairly
comprehensive information is readily available about most subjects. Almost all scientific journals now place
their tables of contents, and in many cases their entire issues, on web sites. US National Institutes of Health
have established a centralized Web-based library of scientific articles, called PubMed Central
(http://www.pubmedcentral.nih.gov/). In collaboration with scientific journals, the NCBI is organizing the
electronic distribution of the full texts of published articles.
PubMed is a free search engine for accessing the MEDLINE database of citations, abstracts and some full
text articles on life sciences and biomedical topics. The United States National Library of Medicine at the
National Institutes of Health maintains PubMed as part of the Entrez information retrieval system. Listing an
article or journal in PubMed is not endorsement. In addition to MEDLINE, PubMed also offers access to
OLDMEDLINE for pre-1966 citations. This has recently been enhanced, and records for 1951+, even those
parts in the printed indexes, are now included within the main portion.
Citations to all articles, even those that are out-of-scope (example., covering plate tectonics or astrophysics)
from certain MEDLINE journals, primarily the most important general science and chemistry journals, from
which the life sciences articles are indexed for MEDLINE.
MEDLINE currently contains over 18 million references to journals articles in the life sciences with
citations from over 4300 biomedical journals in 70 countries. A free access to MEDLINE is provided on the
WWW (World Wide Web) through PUBMED (http://www.ncbi.nlm.nih.gov/PubMed/) which is developed
by NCBI. While PUBMED and MEDLINE both provide bibliographic citations, PUBMED also contains
links to online full-text journals articles. PUBMED also provides access and links to the Integrated
Molecular Biology databases maintained by NCBI

EXPLAIN HUMAN GENOME PROJECT (10 marks)


The Human Genome Project (HGP) is a 13-year effort, which is formally an extensive, collaborative
effort to map all of the estimated 50,000 to 100,000 human genes and to obtain the sequence of the complete
3 billion (3 x 109) nucleotide subunits (bases) of the genome. In the United States, the Human Genome
Project is being overseen primarily by the National Center of Human Genome Research (a part of the
National Institutes of Health) and the Department of Energy. Officially, the Human Genome Project began
on October 1, 1990. The director of the project was Francis Collins, after leadership during the first two
years of the project by James D. Watson.
Human Genome Project is also called Genome Initiative Scientific Research Effort to analyze the
DNA of humans and of several lower organisms. Associated with the Human Genome Project are parallel
efforts to obtain gene maps and complete sequences of the genomes of a number of other model organisms,
including E. coli, yeast, Drosophila melanogaster, the plant Arabidopsis thaliana, the nematode
Caenorhabditis elegans, and mouse. Sequencing of the complete yeast genome was completed in early 1996
by an international consortium of scientists, the genome is 12,057 kb (excluding repetitive DNA) distributed
among the organism’s 16 chromosomes, and it contains an estimated 6,000 genes.
The yeast genome was the first eukaryotic genome to be sequenced completely. The complete sequences of
two prokaryotic genomes were reported in 1995: the genome of the bacterium Haemophilus influenzae is
11

1,830 kb and the genome of the bacterium Mycoplasma genitaltum is 580 kb. In early 1997, the complete
4,600-kb genome of E coli was reported. Late in 1996, the complete 1,660-kb genome sequence of the
microbe Methanococcus jan naschii was reported.
This organism belongs to a group of organisms called the Archaea (Achaeans), many of which live
in extreme conditions such as hot springs and deep sea vents. Methanococcus jannaschii, for example, was
isolated from a deep sea vent in the Pacific Ocean at a site where the temperature is close Ito the boiling
point of water and where the pressure is 1245 times greater than at sea level. A surprising 56 percent of the
1,738 genes the organism contains are entirely new to science, adding support to the growing evidence that
the Archaea represent a third kingdom of life.
A major goal of the Human Genome Project is to generate a highly detailed map of the human
genome. The progressive generation of this genetic map has come from studies of polymorphic DNA
markers; that is, loci in the genome where the different “alleles” are detectable differences in DNA length.
We have already discussed DNA markers resulting from restriction fragment length polymorphism (pp. 479-
480. The first human genetic map for the Human s Genome Project—published in 1987—was based on such
RFLP markers.
The Human Genome Project is a multinational effort, begun in 1988, whose aim is to produce a
complete physical map of all human chromosomes, as well as the entire human DNA sequence. As part of
the project, genomes of other organisms such as bacteria, yeast, flies, and mice are also being studied. This
is no easy task, since the human genome is so large. So far, many virus genomes have been entirely
sequenced, but their sizes are generally in the 1 kilobasepair to 10 kilobasepair range.
The first free-living organism to be totally sequenced was the bacterium Haemophilus influenzae,
containing an 1800 kilobasepair genome. In 1996 the whole sequence of the yeast genome a 10 million bp
sequence was also determined. The US Human Genome Project: the first five years”, was submitted to
Congress in February 1990. It brought t the NRC and OTA evaluation-to-date and set out objectives to be
fulfilled by 1995. Sequencing the human genome was postponed until sequencing costs fell to a minimum of
50 cents per base pair.
GENOME ARRANGEMENT
The following classification shows that how the human DNA is classified accordingly for their
identification.

The scientific objectives for the five year plan were the following
 To set up a full human genetic map with markers at an average distance of 2 to 5 centimorgans to
each other. Each marker if possible to be identified with an STS
 To assemble an STS map of all the human chromosomes with an interval between markers of about
100,000base pairs, to generate sets of overlapping clones or closely spaced markers ordered without
ambiguity and this for a continuous distance of 2 x lo6 bp; To improve current methods and or
develop new methods for sequencing DNA to bring sequencing costs to a maximum of 50 cents per
base pair (a decision on large-scale sequencing would have to be taken within 4 to 5 years
 To prepare a genetic map of the mouse genome based on DNA markers.
 Begin mapping one or two chromosomes. To sequence an aggregate of 20 x l0 6 of a variety of
organisms, focusing on genome or fragments of l06 basepair in length (the model organisms being
E. coli, the yeast S. cerevisiae, D. melanogaster, C. elegans and the laboratory mouse);
 To constitute a “Joint Informatics Task Force” between the DOE and the NIH for the development of
software and databases to support large scale sequencing and mapping projects, to create databases
with easy and up to date access to chromosome physical maps and sequences, to develop the
algorithmic and analytical tools likely to be of use in the interpretation of genomic information;
 To develop programs for these comprehension and handling of ethical, legal and social problems of
the human genome project (through contracts, research grants, colloquia, teaching and educational
materials), to identify and define the major problems and develop political scenarios for handling
them
 To support the training of young scientists (pre-doctoral and post-doctoral); to examine the needs for
other types of training;
 To support innovative and risky technological development to meet the needs of the genome project;
.

You might also like