Bioinformatics Primer (An Introductory Handbook For Bioinformatics Practitioners)
Bioinformatics Primer (An Introductory Handbook For Bioinformatics Practitioners)
Bio-Bio-1 Team
i
ii
Preface
iii
iv
Contents
I Introduction... 1
1 Introduction to Bioinformatics 5
4 Introduction to Proteomics 35
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Amino Acids . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 General properties of Amino acids . . . . . . . . . . . . . 36
4.2.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2.2 Zwitter Ion . . . . . . . . . . . . . . . . . . . . . 37
4.2.2.3 Isomerism . . . . . . . . . . . . . . . . . . . . . . 37
4.2.2.4 Classification of Amino acids . . . . . . . . . . . 38
4.3 The Structure of Proteins . . . . . . . . . . . . . . . . . . . . . . 40
4.3.1 Primary Structure . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 Secondary Structure . . . . . . . . . . . . . . . . . . . . . 42
v
vi
10 Genome Mapping 79
10.1 Genetic Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
10.1.1 Landmarks of Genetic Maps . . . . . . . . . . . . . . . . . 81
10.1.2 Linkage Analysis . . . . . . . . . . . . . . . . . . . . . . . 81
10.2 Physical Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
10.3 Restriction Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 82
10.3.1 Historical Background . . . . . . . . . . . . . . . . . . . . 82
10.3.2 Restriction Map . . . . . . . . . . . . . . . . . . . . . . . 83
10.3.3 Restriction Mapping Process . . . . . . . . . . . . . . . . 84
10.3.4 Uses of Restriction Mapping . . . . . . . . . . . . . . . . 86
11 Sequences Alignment 91
11.1 DNA & Protein Sequences Comparison and Alignment . . . . . . 91
11.1.1 Sequence Alignment: . . . . . . . . . . . . . . . . . . . . . 92
11.1.2 Motivation for Sequence Alignment . . . . . . . . . . . . . 92
11.1.3 Similarity and Homology of Sequences . . . . . . . . . . . 93
11.1.4 Type of Sequence Alignment . . . . . . . . . . . . . . . . 94
11.1.5 Computational Methods & Models for Sequence Alignment 96
11.1.5.1 Dot Matrix . . . . . . . . . . . . . . . . . . . . . 97
11.1.5.2 Dynamic Programming . . . . . . . . . . . . . . 98
11.1.6 Importance of Sequence Alignment . . . . . . . . . . . . . 99
11.1.7 Sequence Alignment Tools . . . . . . . . . . . . . . . . . . 100
11.2 Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . 100
11.2.1 Methods for Multiple Sequence Alignment . . . . . . . . . 102
11.2.1.1 Dynamic Programming based Models . . . . . . 102
11.2.1.2 Statistical Methods and Probabilistic Models . . 102
11.2.2 Usage of Multiple Sequence Alignment . . . . . . . . . . . 103
11.2.3 Tools for Multiple Sequence Alignment . . . . . . . . . . . 103
11.3 Regulatory Motif Finding . . . . . . . . . . . . . . . . . . . . . . 103
11.3.1 Gene-Regulation & Regulatory Motif . . . . . . . . . . . . 104
11.3.2 Motif Discovery Methods . . . . . . . . . . . . . . . . . . 104
11.3.3 Tools for Motif Finding . . . . . . . . . . . . . . . . . . . 108
viii
5.1 Bacteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
xiii
xiv
xvii
Part I
Introduction...
1
3
Introduction to
Bioinformatics
—Fokhruzzaman
Four technologies that will create major disruptions of our current realities are:
Information Technology (IT), Biotechnology (Bio), Nanotechnology (Nano), and
Neurotechnology (Neuro). All these four have already shown tremendous po-
tentials to influence all our future. Although IT cuts across all the scientific
disciplines now-a-days, but it made such a huge impact in Biotechnology that a
new discipline emerged as Bioinformatics. As the NCBI defines, Bioinformatics
is the field of science in which Biology, Computer Science, and Information Tech-
nology merge into a single discipline. The ultimate goal of the field is to enable
the discovery of new biological insights as well as to create a global perspective
from which unifying principles in Biology can be discerned. And thus the Ar-
tificial Intelligence field has essentially become a part of Bioinformatics! The 3
major objectives of Bioinformatics are: (1) Analyze the humongous amount of
Biological Data, (2) Develop smarter tools to handle the increasing complexi-
ties, (3) Interpret the results from both the wet-lab and in-silico experiments.
Some of the major Bioinformatics applications are: (a) Mapping of different
Biomolecules information, (b) Comparing DNA / RNA / Protein Sequences,
(c) Predicting 3-D structures of Gene-Products / Proteins, (d) Predicting func-
tions of Gene-Products / Proteins, (e) Designing Primers.
Our ability in the future to make new biological discoveries will depend strongly
on our ability to combine and correlate diverse data sets along multiple dimen-
sions and scales, rather than a continued effort focused in traditional areas.
Sequence data will have to be integrated with structure and function data, with
gene expression data, with pathways data, with phenotypic and clinical data,
and so forth. Basic research within bioinformatics will have to deal with these
issues of system and integrative biology, in the situation where the amount of
data is growing exponentially. The large amounts of data create a critical need
5
6 1. Introduction to Bioinformatics
1. Molecules of Life
Biochemical molecules such as deoxy ribo nucleic acid (DNA), ribo nucleic acid
(RNA), proteins, carbohydrates, and lipids are fundamental for cellular organi-
zation and their complex interplay with each other dictates various aspects of
living things. They enable a systematic execution of numerous biological pro-
cesses in a defined manner to maintain life at the cellular level (Kitano, 2002;
Noble, 2002). The genetic materials (DNA and RNA) are tightly regulated in
7
organisms. At any given moment, organisms have to deal with different pres-
sures (internal or external) by controlling various biochemical molecules thus
maintaining a balance or in other terms, homeostasis.
Any mutation in DNA, if not repaired by the various polymerases, may result
in the transcription of faulty RNA resulting in a wrong protein being translated
lacking its original activity. This may cause major problems such as protein
aggregation and misfolded proteins, which are not degradable and result in fa-
tal diseases. Naturally occurring single nucleotide polymorphism (SNP) among
human population may influence gene function and expression in individuals.
Functional variants or genetic changes like SNPs that alter amino acids in pro-
teins, gene expression, and gene splicing are of great interest.
The first step of regulation is trying to fix problems at the DNA levels. The
next step is to mend at the RNA level through gene splicing and then at the
protein level via proteosome / ubiquitin pathways. In eukaryotes, higher-level
organism, DNA is transcribed to RNA (Pre-mRNA) that consists of introns and
exons. The exons possess the codes that will be translated into proteins whereas
the introns are eventually cut out through gene splicing. The resulting RNA is
referred to as messenger RNA (mRNA). This messenger RNA may or may not
get translated into peptide that folds into a functional protein.
8 1. Introduction to Bioinformatics
DNA has a semi-conservative replication (Meselson and Stahl, 1958). The dou-
ble helix opens up (in a fork-like fashion) and each strand serves as a parental
template for replication of the DNA. The replication occurs from 5 to 3 by DNA
polymerase. Each daughter strand ends up being the complement of a parental
strand. Subsequently, each replicated DNA fragment has one parental strand
and one daughter strand, hence the term semi-conservative. The genetic make-
up of an individual is termed genotype. Most of the DNA sequences among in-
dividuals are conserved but genetic variation in 0.1% of DNA influences disease
risk, metabolic activity, and drug response. It is important to map occurrence of
variation in the human genome, which can help to identify allelic polymorphisms
that result in disease. Computational techniques that can rapidly compare en-
tire genome and genes will help to identify polymorphism among population.
Comparative genomics is a field in which DNA sequences across several genomes
are compared to understand evolutionary aspects of biological processes.
RNA consists of the nitrogenous bases adenine (A), uracil (U), cytosine (C),
and guanine (G) and can fold into a complex tertiary structure with hair-pin
bends that have unpaired bases. Recurring RNA structural motifs have been
observed and attributed to biological function. Some of the conformationally
recurring motifs include GNRA-like tetraloop, S1, S2, kink turns. Comparative
Algorithm to Discover Recurring Elements of Structure (COMPADRES) is an
automated approach to identify such recurrent motifs (Wadley and Pyle, 2004).
Some of these motifs may contact residues in proteins that are essential for bi-
ological function, for example, a pi-turn motif is found on RNA that interacts
with ribosomal protein L2.
bets. Individual properties and standard residue codes for each of these amino
acids can be obtained from the following website: http://www.imb-jena.de/
IMAGE_AA.html#Properties. Studies have shown that amino acids can be
exchanged with each other without compromising changes in the structure
(Azarya-Sprinzak et al., 1997; Benner et al., 1994; Gonnet et al., 1992; Johnson
and Overington, 1993; Jones et al., 1992; Naor et al., 1996). Such exchanges
are possible because amino acids share similar physico- chemical properties, and
changes within similar groups are tolerated (Taylor, 1986). The degree of sub-
stitution at a particular residue position depends on the functional role and the
environmental location of the residue in the folded form of the protein (Azarya-
Sprinzak et al., 1997). Due to this, a number of slightly different sequences may
adopt similar structure (divergent evolution) and function. If sequence, struc-
ture, and function of a set of related proteins are already known then inference
rules can be derived. These rules can be applied to classify a new sequence for
which no structure or function is known. Such inference rules can be a set of
conserved residues like sequence motifs or structural motifs that is present in
all the members in a related set of proteins (Falquet et al., 2002; Guruprasad
and Shivaprasad, 2000; Hofmann et al., 1999; Hutchinson and Thornton, 1996).
The effective means of understanding sequence information coming out of ge-
nomic projects will require assigning structure and function. Protein sequences
that have evolved from a common parent share similar structure and function.
If the parent protein structure is known then one can apply comparative mod-
eling techniques to obtain the geometric information for the unknown protein.
Hence, relating protein sequences to their structural parent or to a known fold
using computational techniques will be critical to handle biological information
effectively.
One key aspect of protein binding to the cell surface is specificity. Since the
particular molecules binding to the cell surface are intended to elicit specific
responses, they must be very selective in the pathways that they initiate. At
the same time, it is equally important to consider interactions between different
cellular pathways, as the cell must respond collectively to a variety of stimuli at
any one time. Protein signaling pathways are extremely broad and encompass
many different signal transduction pathways. For example the Notch signaling
pathway is critical in developmental processes co-ordinated by signal transducer
proteins and transcriptional activators, leading to changes at the gene transcrip-
tion level. The Notch pathway is activated upon contact with neighboring cells
expressing Notch ligands. In humans, the ligands that are capable of activating
notch are Delta and Serrate. These ligands are membrane bound; therefore,
close cell proximity is required for activation of Notch pathways. In some ways,
Notch signaling can be considered a ”classical” pathway; the binding of Notch
ligand to Notch receptor ultimately results in the translocation of the Notch
intracellular domain to the nucleus and effects upon gene transcription. Notch
ligands are single- pass transmembrane proteins, which contain multiple epider-
mal growth factors like repeats in the extracellular domain. There are several
such signaling pathways that are responsible for widely observed biological pro-
cesses.
The first organisms on earth were unicellular such as bacteria and protozoa. So
the question that arises is what led to the evolution of multicellular organisms.
With our current knowledge of biology, we can explain the origin and impor-
tance of cell-cell interactions. Cell-cell interactions are crucial and are part of
every aspect of the cell in eukaryotes. These interactions were responsible for
the evolution of multicellular organisms. When we define cell-cell interaction
it means communication of cells for division, differentiation, reproduction, mi-
gration, apoptosis, contact inhibition, etc. There are over 200 types of cells in
the human body broadly classified on the basis of the tissue they are present
in, namely epithelia, connective tissue, nervous tissue, and muscle. Cooperation
among cellular processes is required for the induction of an antibody response
11
The mechanism in both of these processes involves cell growth and differen-
tiation as well as cell-matrix interactions. Several proteins control the timing
of the events in the cell cycle, which is tightly regulated to ensure that cells di-
vide only when necessary. The loss of this regulation is the hallmark of cancer,
which is also due to loss of control in contact inhibition. Major control switches
of the cell cycle are cyclin-dependent kinases. Each cyclin-dependent kinase
forms a complex with a particular cyclin, a protein that binds and activates
the cyclin-dependent kinase. The kinase part of the complex is an enzyme that
adds a phosphate to various proteins required for progression of a cell through
the cycle. These added phosphates alter the structure of the protein and can
activate or inactivate the protein, depending on its function. There are specific
cyclin-dependent kinase/cyclin complexes at the entry points into the G1, S,
and M phases of the cell cycle, as well as additional factors that help prepare
the cell to enter S phase and M phase. Normal mammalian cells show contact
inhibition; that is, they respond to contact with other cells by ceasing cell divi-
sion. Therefore, cells can divide to fill in a gap, but they stop dividing as soon
as there are enough cells to fill the gap. This characteristic is lost in cancer
cells, which continue to grow after they touch other cells, causing a large mass
of cells to form.
and Future
V Bioinformatics Tools
25 Python - Primer Programming Language for Bioinformatics
26 Python And Bioinformatics
13
Thought of Mind
Chapter Layout-
• Definitions / Background / History
• Preliminaries needed for getting more use of the following text in the
Bioinformatics Primer
—Farjana Khatun
Introduction to Cell Biology..1 Introduction to Cell Biology. Introduction to
Cell Biology. Introduction to Cell Biology. Introduction to Cell Biology. In-
troduction to Cell Biology. Introduction to Cell Biology. Introduction to Cell
Biology. Introduction to Cell Biology. Introduction to Cell Biology. Introduc-
tion to Cell Biology. Introduction to Cell Biology. Introduction to Cell Biol-
ogy. Introduction to Cell Biology......... ............. ....................... ....................
....................... ........ ......... ........... ......... Introduction to Cell Biology In-
troduction to Cell Biology Introduction to Cell Biology Introduction to Cell
Biology Introduction to Cell Biology Introduction to Cell Biology Introduction
to Cell Biology Introduction to Cell Biology Introduction to Cell Biology In-
troduction to Cell Biology Introduction to Cell Biology Introduction to Cell
BiologyI introduction to Cell Biology Introduction to Cell Biology.
2.1 Cell
The cell theory was developed by Matthias Jakob Schleiden and Theodor Schwann
Cell: Cell is the Building Block of
in 1839, states that all organisms are composed of one or more cells. All cells
an organism
come from preexisting cells. Vital functions of an organism occur within cells,
and all cells contain the hereditary information necessary for regulating cell
functions and for transmitting information to the next generation of cells.
Cell is the building block of all living organism. It is the functional unit of life.
There are millions of different types of cells. There are some organisms having
single cell such as amoeba and bacterial cells. Human body consists of different
types of cells - brain cells, skin cells, liver cells, stomach cells etc. All these cells
have a unique feature but perform different functions. According to the struc-
ture, there are two types of cells - eukaryotic cell (example: fungi, mammals,
1 The Chapter Preamble will be written latter
15
16 2. Introduction to Cell Biology
birds, fish, invertebrates, mushrooms, plants etc) and prokaryotic cells (exam-
ple: bacteria, amoeba, cyanobacteria etc). Among prokaryotes, most widely
studied organism is E. coli. Eukaryotic cells have well organized nucleus and
prokaryotic cells contain undefined nucleus. On the basis of function, there are
two types of cells - somatic cell (forming the body of the organism) and germ
cell (Regulate the production of sperm, eggs i.e. involve in reproduction).
There are different types of organelles in the cell. Among them most impor-
tant organelles are -
Vesicle- secretes hormones, neurotransmitter etc. that are packed into the
golgi apparatus.
Vacuole- most commonly found in plant cells. Its function is to store nutri-
ents, waste products etc.
Cytoplasm- It is jelly like material that hold all the organelles of the cell.
18 2. Introduction to Cell Biology
Lysosome- cellular digestive system, found in animal cells but rare in plant
cells, contains digestive enzymes that digest extra-cellular organelles, engulf
virus and bacteria.
All these organelles are surrounded by the cell membrane, composed of phospho-
lipids and protein and semi-permeable in nature. Mitochondria and chloroplast
(found in plants. It helps plants to produce their foods through the process of
photosynthesis) have their own genome in circular plasmids, which is separate
and distinct from the nuclear genome of a cell.
Tissue Tissues are nothing but the collective of cells. Similar types of cells
work together to perform a specific function.
Organism Organisms get their structural form with the combination of dif-
ferent organ systems.
The above describes the hierarchical flow of an organisms basic structure. But
how an organism is originated from a single cell? The mechanism can be clearly
understood with the evaluation of a human being from a single zygote cell.
................................
• Mitosis (M phase): During this phase, the cell splits itself into two
distinct cells (daughter cells). Mitosis process can be described by five
consecutive steps known as
– Prophase
– Prometaphase
– Metaphase
– Anaphase
– Telophase
A resting phase (G0 Phase) where cell leaves the cell cycle and stop divid-
ing. The cell-division cycle is a vital process by which a single-celled fertilized
egg develops into a mature organism as well as the process by which hair, skin,
blood cells, and some internal organs are renewed.
(There are two types of cell division in eukaryotic cells: mitosis and meiosis.
Mitosis is the process by which a cell is divided into two new cells. The genetic
material in the new cells is 100% identical to that of mother cell. Thats why,
mitosis is also called as equational division. On other hand, in meiosis, four new
cells are formed from a cell and each cell carry 50% of genetic material compared
to that of mother cell. Thats why, meiosis is known reductional division.)
20 2. Introduction to Cell Biology
2.2 Chromosome
Chromosome is the carrier of genetic information from one generation to an-
other. It is present in the nucleus of a cell having thread-like (string) structures
made of DNA and Protein. The shape and number of chromosome vary widely
among the organisms. For example, (i) eukaryotic cells have large linear chro-
mosomes where as prokaryotic cells contains small circular chromosomes, (ii)
46 chromosomes are present in human being where as ape, fruit fly have 48, 8
chromosomes respectively. Human being has 23 pairs of chromosomes. Each
pair is inherited from parents - one from mother and another from father. One
from 23 pairs is responsible for determining sex - XY for male and XX for female.
Chromosomes are visible under electron microscope when stained with certain
dyes that reveal a pattern of light and dark bands (karyotype analysis). Chro-
mosomes can be distinguished from each other on the basis of size and banding
pattern difference. Major chromosomal abnormalities such as missing, extra
copies, gross break and rejoining etc. can be detected by karyotype analysis.
Karyotypic analysis reveals that diseases such as
Down Syndrome is due to the presence of an extra chromosome (third copy
of chromosome) in chromosome 21.
Turner Syndrome is due to loss of one sex chromosome between two.
Recombination and Manipulation of Chromosome, therefore, has a pivotal
role in genetic diversity.
2.5 Nucleotide
Nucleotides are the building blocks (monomer) of DNA and RNA. Three com-
ponents are essential to form a nucleotide. These are shown below through flow
2.5. Nucleotide 21
diagram
Nitrogen Base: There are five types of nitrogen bases named adenine (A),
Guanine (G), Cytosine (C), Thymine (T) and Uracil (U). A, T, C, G are
found in DNA and A, U, C,G are found in RNA.
A=T C≡G
of ladder. Now we have to twist the ladder to imagine the 3-D structure
of DNA.
3’...GATGCTAGGCA...5’
5’...CTACGATCCGT...3’
More to Come!!!
• How a Cell Becomes a Human?
• Central Dogma of Life or Biology
• Protein Synthesis Process
• Cell Division
Chapter 3
Introduction to Genetics
and Genomics
—Farjana Khatun
The idea that chromosomes are the carriers of inheritance was expressed in
1883 by Wilhelm Roux.
The modern concept of the gene first originated by a nineteenth century Augus-
tinian monk Gregor Mendel who systematically studied heredity in pea plants
(Pisum sativum) and hypothesized a factor that conveys traits from parent to
offspring. He spent over 10 years of his life on one experiment. Although he did
25
26 3. Introduction to Genetics and Genomics
not use the term gene, he explained his results in terms of inherited character-
istics. Mendel was also the first to hypothesize
• Independent assortment
Although Mendel’s work was largely unrecognized after its first publication
in 1866, it was rediscovered in 1900 by three European scientists, Hugo de Vries,
Carl Correns, and Erich von Tschermak, who had reached similar conclusions
from their own research.
Danish botanist Wilhelm Johannsen coined the word ”gene” in 1909 to de-
scribe these fundamental physical and functional units of heredity, while the
related word genetics was first used by William Bateson in 1905. The word
was derived from Hugo de Vries 1889 term pangen for the same concept, it-
self a derivative of the word pangenesis coined by Darwin (1868). The word
pangenesis is made from the Greek words pan (a prefix meaning ”whole”,
”encompassing”) and genesis (”birth”) or genos (”origin”).
• In 1941, George Wells Beadle and Edward Lawrie Tatum showed that
mutations in genes caused errors in specific steps in metabolic pathways.
This showed that specific genes code for specific proteins, leading to the
”one gene, one enzyme” hypothesis.
Richard J. Roberts and Phillip Sharp discovered in 1977 that genes can be
split into segments. This led to the idea that one gene can make several pro-
teins. Recently (as of 2003-2006), biological results let the notion of gene appear
more slippery. In particular, genes do not seem to sit side by side on DNA like
discrete beads. Instead, regions of the DNA producing distinct proteins may
overlap, so that the idea emerges that ”genes are one long continuum”.
Replication: DNA can create its own copy through the process of replication
or copying mechanism.
DRAW IT
DRAW IT
DRAW IT
The first step in accessing the human genome was the preparation of a map
of the individual chromosome by karyotyping. Each chromosome has character-
istics banding pattern when stained with special dyes. The patterns are useful
as reference point for the preparation of more detailed genetic maps. Abnormal
patterns are the characteristics of some genetic disorders and several cancers.
Defining a gene is not straightforward. For example, small genes can easily
be overlooked in a nucleotide sequence, a gene may code for more than one
30 3. Introduction to Genetics and Genomics
The completion of the human genome sequences has stimulated new ap-
proaches for diagnosing diseases and predicting disease susceptibility. In 2006,
the gene and Environment Initiative (GEI) has been launched by a joint collab-
oration of the National Institute of Environmental Health Services (NIEHS) and
the National Human Genome Research Institute (NHGRI) to understand the
link between genes, environment and why certain individuals develop diseases.
They conduct genetic studies of individuals with specific diseases and their per-
sonal exposure to environmental factors such as sun and chemicals, diet and
physical activity.
The Cancer Genome Atlas (TCGA) has been launched in 2006 by the NHGRI
and the National cancer institute with a immediate goal of compilation of an at-
las of genome changes (mutations) in three tumors: brain cancer (glioblastoma),
lung cancer and ovarian cancer.
3.6 Genome
The total complement of genes in an organism or cell is known as its genome.
Genes that appear together on one chromosome of one species may appear on
separate chromosomes in another species. The study of genome is known as
Genomics.
Cells or organisms with only one copy of each chromosome are called hap-
loid; those with two copies are called diploid; and those with more than two
copies are called polyploid.
• Paralogs
• Orthologs
• Xenologs
• Genetic Code
Our genotype is derived from the genotypes of our parents. Yet we are not
exact copy of either parents or easily identifiable mixture of their characteristics.
The instructions that are present in your genotype finally determine the anatom-
ical (related to shape and size) and physiological (related to functions of differ-
ent parts of the body) characteristics to make you a unique individual. Those
anatomical and physiological characteristics constitute your phenotype i.e. ap-
pearance (eg. hair and eye color, skin tone, foot size etc) and behavior.
DNA fragment analysis can also be used to determine such disease causing
genetics aberrations as microsatellite instability (MSI ), trisomy or aneuploidy,
32 3. Introduction to Genetics and Genomics
and loss of heterozygosity (LOH ). MSI and LOH in particular have been as-
sociated with cancer cell genotypes for colon, breast and cervical cancer. The
most common chromosomal aneuploidy is a trisomy of chromosome 21 which
manifests itself as Down syndrome. Current technological limitations typically
allow only a fraction of an individuals genotype to be determined efficiently.
Twenty two of those pairs are called autosomal chromosomes. Most of the
genes of autosomal chromosomes affect the somatic characteristics such as the
hair color, skin pigmentation etc. The chromosomes of the 23rd pair are called
sex chromosomes; one of their functions is to determine whether the individual
is male or female.
Locus: The two chromosomes in a homologous autosomal pair have the same
structure and carry genes that affect the same traits. Suppose that one member
of the pair contains three genes in a row, first gene determining hair color, the
second eye color and the third skin pigmentation. The other chromosome carries
genes that affect the same traits, and the gene are in the same sequence and also
located at equivalent positions on their respective chromosome. The position of
the genes on a chromosome is called locus.
Allele: The two chromosomes in a pair may not carry the same form of each
gene. The various forms of a given gene are called alleles. These alternate forms
of gene determine the precise effect of the gene on phenotype.
Co-dominance:
34 3. Introduction to Genetics and Genomics
Chapter 4
Introduction to Proteomics
—Farjana Khatun
4.1 Introduction
The field of Proteomics is much bigger/wider than genomics. Primarily the
term ”Proteomics” means analysis of protein profile of tissues. Proteome refers
to all proteins present in a species. Genome is a constant feature of an organism
whereas proteome varies with the nature of the tissue, state of development, dis-
ease or effect of drugs. So it varies with time. The primary structure of proteins
(sequence of amino acids) is determined with the help of genomic data (sequence
of nucleotides) through the process of transcription (synthesis of mRNA from
DNA) and translation (protein synthesis from mRNA) respectively. Then pri-
mary structure of protein is converted to its secondary and finally in 3 − D
structure through the process of post-translation. Post-translational process
has a pivotal role in determining the destination and function of all synthesized
proteins. Primary structure determination of a protein is relatively easy. But
how it gets its 3 − D structure and what is its absolute 3 − D structure are the
headache of the researchers. Because it reveals ............
35
36 4. Introduction to Proteomics
4.2 Protein
The genetic code is the sequence of three bases (nucleotides) in the DNA se-
quences containing information of linear sequence of amino acids (known as
primary structure of protein). The primary structure of a protein determines
how it can fold and how it interacts with other molecules in the cell to perform
its function. The primary structure of all the diverse proteins are synthesized
from 20 amino acids arranged in a linear sequence determined by the genetic
code.
Each of the amino acids used for protein synthesis has the same general struc-
ture. It contains a carboxylic acid group, an amino group, a hydrogen atom and
a chemical group called a side chain that is different from each amino acid are
attached to α-carbon.
In protein, these amino acids are joined into linear polymer called polypep-
tide chain through peptide bonds between the carboxyl group of one amino acid
and the amino group of next amino acid.
4.2. Protein 37
4.2.2.3 Isomerism
Of the standard a-amino acids, all but glycine can exist in either of two optical
isomers, called L or D amino acids, which are mirror images of each other.
While L-amino acids represent all of the amino acids found in proteins during
translation in the ribosome, D-amino acids are found in some proteins produced
by enzyme posttranslational modifications after translation and translocation
to the endoplasmic reticulum, as in exotic sea-dwelling organisms such as cone
38 4. Introduction to Proteomics
N H3+ C COO−
snails. They are also abundant components of the peptidoglycan cell walls of
bacteria, and D-serine may act as a neurotransmitter in the brain.
Amino acids can be classified into two categories on the basis of polarity.
The graph above nicely demonstrates the location of the 20 amino acids in
different regions of a protein tertiary structure. The vertical axis shows the
fraction of highly buried within the protein core (inaccessible for water) amino
acid residues, while the horizontal axis shows the amino acid names in one-letter
code. Apparently there is very small fraction of buried charged residues, while
in the case of the non-polar amino acids the fraction is very high.
The propensity of amino acid residues to be (or not to be) in contact with polar
solvent largely controls the distribution of each of the 20 amino acids within the
volume of a protein structure. Thus, most protein molecules have a hydrophobic
40 4. Introduction to Proteomics
The primary structure is the linear order of amino acid residues along the
polypeptide chain. It arises from covalent linkage of individual amino acids
via peptide bonds.
Ala-Glu-Glu-Ser-Ser-Lys-Ala-Val-Lys-Tyr-Tyr-Thr-...
A—-E—E—S—S—K—A—V—K—Y—Y—T-...
Figure 4.6: Single- and three-letter codes for amino acids of a primary sequence
Figure 4.7: The primary sequences of human and sperm whale myoglobin
• Positive
• Aromatic
• Aliphatic
• Hydrophibic
• etc...
—Farjana Khatun
Some Bioinformatics Model Organism . . . Some Bioinformatics Model
Organism . . . Some Bioinformatics Model Organism . . . Some Bioinformatics
Model Organism . . . Some Bioinformatics Model Organism . . . Some
Bioinformatics Model Organism . . . Some Bioinformatics Model Organism .
. . Some Bioinformatics Model Organism . . . Some Bioinformatics Model
Organism . . . Some Bioinformatics Model Organism . . . Some Bioinformatics
Model Organism . . . Some Bioinformatics Model Organism . . . Some
Bioinformatics Model Organism . . . Some Bioinformatics Model Organism .
. . Some Bioinformatics Model Organism . . . Some Bioinformatics Model
Organism . . . Some Bioinformatics Model Organism . . .
43
44 5. Some Bioinformatics Model Organisms
Bacteria were also involved in the second great evolutionary divergence, that
of the archaea and eukaryotes. Here, eukaryotes resulted from ancient bacte-
ria entering into endosymbiotic associations with the ancestors of eukaryotic
cells, which were themselves possibly related to the Archaea. This involved
the engulfment by proto-eukaryotic cells of alpha-proteobacterial symbionts to
form either mitochondria or hydrogenosomes, which are still found in all known
Eukarya (sometimes in highly reduced form, e.g. in ancient ”amitochondrial”
protozoa). Later on, some eukaryotes that already contained mitochondria also
engulfed cyanobacterial-like organisms. This led to the formation of chloro-
plasts in algae and plants. There are also some algae that originated from even
later endosymbiotic events. Here, eukaryotes engulfed a eukaryotic algae that
developed into a ”second-generation” plastid. This is known as secondary en-
dosymbiosis.
← END
THIS ABOVE TEXT HAS NOT YET BEEN CHANGED
FROM THE SOURCE TAKEN
5.2 Virus
Discovery of the tobacco mosaic virus by Martinus Beijerinck initiated the jour-
ney of virology in 1898. Virus is a acellular infectious agent that infect all types
of organisms such as archaea, bacteria, plants, animals.
Viruses display a wide diversity of shapes and sizes. Generally viruses are much
smaller than bacteria. Most viruses that have been studied have a diameter
between 10 and 300 nanometres.
Viral populations do not have own metabolism and do not grow through cell
division rather they use the machinery and metabolism of a host cell to produce
multiple copies of themselves, and they assemble in the cell. It is thought that
viruses played a central role in the early evolution, before the diversification of
bacteria, archaea and eukaryotes and at the time of the last universal common
ancestor of life on Earth. Viruses are still one of the largest reservoirs of unex-
plored genetic diversity on the Earth.
Different types of viruses can only infect a limited range of hosts and many
are species-specific. Some, such as smallpox virus can only infect one species-
human, and are said to have a narrow host range. Other viruses, such as rabies
virus, can infect different species of mammals and are said to have a broad range.
Viruses have enormous genomic diversity than plants, animals, archaea and
bacteria. Genomic Diversity of virus can be figured below
• Segmented
• Double
strand with
regions of
single strand
(eg. Hepad-
naviridae)
5.2. Virus 47
Bacteriophages are viruses that infect bacteria. Its genes has contribution
in the expression of hosts phenotypes. Bacteria protect themselves from bacte-
riophages by producing enzymes, restriction endonucleases, to destroy the DNA
of bacteriophages by splicing.
5.3 Bacteria
Microbiologist Antonie van Leeuwenhoek in 1676 first observed bacteria by his
own designed single-lens microscope. He then called it ”animalcules.” In 1838,
Christian Gottfried Ehrenberg introduced it as bacterium. Bacteria are mi-
croscopic, single celled prokaryotes surrounded by cell membrane made of lipid.
They have neither membrane bounded nucleus nor organelles like mitochondria,
chloroplasts, Golgi apparatus, endoplasmic reticulum. Bacterial cells contain
micro-compartments named carboxysome enclosed by protein shells that helps
it in metabolism. Bacterial cell walls are made of peptidoglycan (protein and
carbohydrate). There are two types of bacteria on the basis of structure of cell
wall named Gram-positive and Gram-negative bacteria. These names originate
from the reaction of the cells to the Gram stain (crystal violet, safranin).
• Gram positive bacteria possess a thick cell wall, made of many layers of
peptidoglycan and trichoic acids and can retain the Gram stain even after
washing with alcohol or acetone.
• Gram negative bacteria have relatively thin cell wall consisting few layers
of peptidoglycan and surrounded by lipid membrane (lipopolysaccharides
and lipoproteins) and unable to retain stain after washing with alcohol or
acetone.
5.3. Bacteria 49
exist in bacteria which is rare in eukaryotes. Bacteria may also contain plasmids
which are small extra-chromosomal DNAs that may contain genes for antibiotic
resistance.
Some bacteria also transfer genetic material between cells. This can occur in
three main ways.
These types of gene acquisition are known as horizontal gene transfer and
most common in nature. Due to gene transfer, it is difficult to determine origi-
nal sequences of bacteria. For an example : to determine the genome sequence
of Mycoplasma genitalium, scientists of J. Craig Venter Institute systemically
destroy its gene (mutating by inseration) one by one to observe which are es-
sential to life and which are dispensable. Finally they have concluded that only
381 protein-encoding genes are essential to life out of 485.
5.5 Archaea
Another group of prokaryotes, archaea, meet these criteria but differ from bac-
teria on the basis evolutionary history. A major step forward in the study of
bacteria was the recognition in 1977 by Carl Woese that archaea have a separate
line of evolutionary descent from bacteria. This new phylogenetic taxonomy was
52 5. Some Bioinformatics Model Organisms
based on the sequencing of 16S ribosomal RNA, and divided prokaryotes into
two evolutionary domains, bacteria and Archaea, as part of the three-domain
system (bacteria, archaea and eukaryotic cells)
5.6 Fungi
Fungi are eukaryotic organisms that include microorganisms such as yeasts,
mushrooms and molds. It is classified as a kingdom which is separated from
plants, animals and bacteria. The study of fungi is called mycology which is
often regarded as a branch of botany but genetic studies have shown that fungi
are closely related to animals than to plants.
Advances in molecular genetics have opened the way for DNA analysis to be in-
corporated into taxonomy, which has sometimes challenged the historical group-
ings based on morphology and other traits. Phylogenetic studies published in
the last decade have helped reshape the classification of Kingdom Fungi, which
is divided into one subkingdom, seven phyla, and ten subphyla.
• Using mold Neurospora crassa, one gene-one enzyme hypothesis was for-
mulated to test their biochemical theories.
epithelial tissue are the outer layer of the skin, the inside of the mouth
and stomach, and the tissue surrounding the body’s organs.
4. Nerve Tissue - Nerve tissue contains two types of cells: neurons and
glial cells. Nerve tissue has the ability to generate and conduct electrical
signals in the body. These electrical messages are managed by nerve tissue
in the brain and transmitted down the spinal cord to the body.
Organ systems are composed of two or more different organs. There are 10
major organ systems in the human body, they are
• Nervous System: The main role of the nervous system is to relay elec-
trical signals through the body. The nervous system directs behaviour
and movement and, along with the endocrine system, controls physiolog-
ical processes such as digestion, circulation, etc. Major organ are Brain,
spinal cord and peripheral nerves.
Computing Fundamentals
for Bioinformatics
55
56 6. Computing Fundamentals for Bioinformatics
For the above problem, the pseudocode is the following cryptic lines: (If you
are from biology background, dont worry. We will go through each line of it.)
Before explaining the next instructions I would like to describe about how a
computer treats a DNA sequence. Consider a DNA sequence like ACTCACG-
TAG as a sample input.
To repeat the task of line number 3, 4, 5, 6 and 7 we use for in line 2. Now I
am explaining line number 2:
You will find index, a variable just after for and its initial value is zero. The
whole line means that the inside instructions will be repeated until index value
reaches DN A Sequence.Length − 1. For our considering DNA sequence ACT-
CACGTAG, it will stop when index reaches after 9(= 10 − 1) that means when
58 6. Computing Fundamentals for Bioinformatics
index will be 10 the inside instruction will not be executed. But how does the
index value increase? See line number 6, index ← index + 1. Each repetition it
increases by one.
Line number 3 and 4 are straight forward. These mean if the present nucleotide
equals to A (Adenine), N oOf Adenine will be increased by one. Be careful, line
6 is not under the condition of line 3.
Simulation
Step 9: Again program will go to line 2 and find that index value is 2 so
for will repeat.
In this way, program executes until index value reaches 9. We again want
to see what is happening in some of last steps.
Suppose, now we are just before line 6 and index value is 8. So, after line
6.2. Data Structure 59
6, index value will be 9. Program will go to line 2 and find that still index
value is 9 which less than 10 so repeat. In line 3, condition will be false as
DN A Sequence[9] is G and line 4 will be skipped. In line 6 index value will
be 10 and again program will go to line 2 and find that index reaches 10 so for
will not be repeated and program enters in line 8 where it returns noOf Adenine.
If you understand up to this you have completed a great journey towards algo-
rithm.
61
62 7. Math Primer for Bioinformatics
Chapter 8
Biological Processes,
Experimental Methods &
Machinery
—Farjana Khatun
63
64 8. Biological Processes, Experimental Methods & Machinery
Part II
Introduction to
Bioinformatics Problems
65
67
—Saddam Hossain
DNA & Protein Sequencing . . . DNA & Protein Sequencing . . . DNA &
Protein Sequencing . . . DNA & Protein Sequencing . . . DNA & Protein
Sequencing . . . DNA & Protein Sequencing . . . DNA & Protein Sequencing
. . . DNA & Protein Sequencing . . . DNA & Protein Sequencing . . . DNA
& Protein Sequencing . . . DNA & Protein Sequencing . . . DNA & Protein
Sequencing . . . DNA & Protein Sequencing . . . DNA & Protein Sequencing
. . . DNA & Protein Sequencing . . . DNA & Protein Sequencing . . . DNA
& Protein Sequencing . . . DNA & Protein Sequencing . . . DNA & Protein
Sequencing . . . DNA & Protein Sequencing . . .
69
70 9. DNA & Protein Sequencing
cesses can sequence 500 − 5, 000 bases at a time. This is really very small in size
with respect to a complete genome of any eukaryotic.
The very first methods of DNA sequencing are Maxam - Gilbert Method and
Sanger Method. Maxam - Gilbert Method was developed by Allan Maxam
and Walter Gilbert in 1977, which was a chemical cleavage method. And Fred
Sanger devised at around the same time which is a dideoxynucleotide chain
termination method, known as emphSanger Method [Chapter 08]. As Maxam
- Gilbert Method is an old and predominant method but not currently used
frequently, it is not described in detail in this book, however one of its advan-
tage is that it permits direct sequencing of small fragments. Sanger method is
the commonly used one and even the Human Genome Project was done based
on Sanger method and Shotgun Sequencing. The Sanger method is carried out
using Gel electrophoresis. These were the two competing methods of determin-
ing DNA sequence since the old days of bioinformatics. The Sanger method is
commonly of two types - i) Manual Sanger Method and ii) Automated Sanger
Method. Shotgun sequencing takes maximum advantage of the speed and low
cost of automated sequencing, but relies totally on software to assembly a jum-
ble of sequence reads into a coherent and accurate contig. But there are lots of
success stories for shotgun sequencing, that is why this one of the most favorite
sequencing methods till date. The Institute for Genomic Research (TIGR)
has demonstrated the power and utility of the shotgun approach by determin-
ing the complete genomic sequences of Haemophilus influenzae, Methanococcus
jannaschii, and Mycoplasma genitalium.
The technology of DNA labeling has changed in the last fifteen years, so that
there are many more options. In this modern era there are some more automated
method such as Automated Fluorescence Sequencing or Radioactivity-based Dye
Termination Sequencing methods, which are more versatile of sequencing with
much more throughput than the previous methods. There are some other meth-
ods like Cycle Sequencing, Capillary Electrophoresis, etc. More new and promis-
ing technologies are Computational Fragment Assembly, Pyrosequencing, Single
Molecule Methods. Eventually all these techniques and methods provide the
order of the nucleotides in a given DNA.
Extraction of Genomic DNA: The first step is to extract the high quality
DNA from the organism to sequence. Different kits and protocols are available
to extract clean and efficient genomic DNA from the respective organism.
72 9. DNA & Protein Sequencing
Genome Mapping: The genome map is the must pre-requisite for DNA se-
quencing task. From the genome map the span region of genome to be sequenced
is identify first. Identity set of clones from this region is selected, these are the
mapped clones. Then the amplification for this gene region is done.
Library Creation: From the selected mapped clones, through cloning, sets
of smaller clones are made. This pool of clones act as a clone library for further
sequencing work.
Gel Electrophoresis: The sequences from the smaller clones are determined
here using gel electrophoresis.
Finishing: This is the stage where the final product of sequenced DNA is
achieved. This sequenced DNA is now ready to process as DNA sequece data
for futher use.
Genome Template
Library Gel Elec-
Mapping Preper-
Creation trophoresis
ation
Data
Editing Pre-
Finishing
/ DNA finishing
Annotatin
into smaller pieces, sequence the smaller DNA fragments and reconstract the
complete sequenced DNA from these sequenced fragments. Based on genome
map, the genome is fragmented. Then the fragments are cloned to build a
clone-library. These DNA fragments are sequenced first (based on the previ-
ously discussed methodologies). After that, computational models, software
packages are used to identify overlapping clones with common restriction frag-
ments and assembles them into a contig. These contigs are edited and aligned
to the genome map. Gaps between clones are filled with other clones (such
as fosmids) in this step, or by generating PCR products from BAC clones or
genomic DNA. Contigs are assembled into the complete genome in this way.
Based on the principles discussed above, there are three major strategies for
complete genome sequencing. i) Hierarchical or Clone-by-Clone, where the
genome is broken into many long pieces. Each long piece is then mapped onto
the genome. And each piece is sequenced with shotgun. This strategy is applied
to Yeast, Worm, Human, Rat etc. ii) Walking, which is the online version of
(i). Here the genome is broken into many long pieces and each piece is started
to be sequenced with shotgun, then construction of map is done. Rice genome
has been sequenced in this manner. iii) Whole genome shotgun, in which one
large shotgun pass on the whole genome. Genome from many organism like
Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu etc have been se-
quenced using this strategy.
the average capability for daily sequencing is about 200 samples. This is very
low for sequencing throughput to support the current and up-coming demand
for sequenced DNA. A well maintained machine is also vital to a successful
sequence.
Opportunities for discovery are virtually endless, from complex diseases to pa-
leogenomics and museomics (analysis of ancient DNA), from searching for new
organisms in the deep ocean and volcanoes to manipulating valuable traits in
livestock and molecular plant breeding. This is where the challenges as well as
major opportunities lie in the future.
the DNA or mRNA sequence encoding the protein. The Edman degradation is
a very important reaction for protein sequencing, because it allows the ordered
amino acid composition of a protein to be discovered. Automated Edman se-
quencers are now in widespread use, and are able to sequence peptides up to
approximately 50 amino acids long.
The other major direct method by which the sequence of a protein can be
determined is Mass Spectrometry. This method has been gaining popularity in
recent years as new techniques and increasing computing power have facilitated
it. Mass spectrometry can, in principle, sequence any size of protein, but the
problem becomes computationally more difficult as the size increases. Peptides
are also easier to prepare for mass spectrometry than whole proteins, because
they are more soluble. One method of delivering the peptides to the spectrome-
ter is electrospray ionization, for which John Bennett Fenn won the Nobel Prize
in Chemistry in 2002.
Genome Mapping
—Saddam Hossain
Genome Mapping
Genome Map is the guide to the Genetic Highway. Imagine that one of your
best friends has moved to Dhaka, and you are on your way to meet her at her
home. You are driving in a car down the highway to visit her. Your favorite
tunes are playing on the radio, and you haven’t care in the world. You stop to
check your maps and realize that all you have are interdivisional highway maps
- not a single street map of the area. How will you find your friend’s house?
It’s going to be difficult, but eventually, you may stumble across the right house.
This scenario is similar to the situation facing scientists searching for a spe-
cific gene somewhere within the vast genome. They have available to them two
broad categories of maps: genetic maps and physical maps. Both genetic and
physical maps provide the likely order of items along a chromosome. However,
a genetic map, like an interdivision highway map, provides an indirect estimate
of the distance between two items and is limited to ordering certain items. One
could say that genetic maps serve to guide a scientist toward a gene, just like an
79
80 10. Genome Mapping
interdivision map guides a driver from city to city. On the other hand, physical
maps mark an estimate of true distance, in measurements called base pairs,
between items of interest. To continue our analogy, physical map would then
be similar to street maps, where the distance between two sites of interest may
be defined more precisely in terms of city blocks or street addresses. Physical
maps, therefore, allow a scientist to more easily home in on the location of a
gene. An detail of how each of these maps is constructed may be helpful in
understanding how scientists use these maps to traverse that genetic highway
commonly referred to as the ”genome”.
A genome map helps scientists navigate around the genome. Like road maps
and other familiar maps, a genome map is a set of landmarks that tells people
where they are, and helps them get where they want to go. The landmarks
on genome map might include short DNA sequences, regulatory sites that turn
genes on and off, and genes themselves. Often, genome maps are used to help
scienctists find new genes. Road maps chart well-known territory surveyed with
astonishing precision, but a genome map is a map of a new frontier. In that
sense, a genome map is more like the maps of Bangladesh made when the Por-
tugese were just beginning to explore the continent. Some parts of the genome
have been mapped in great detail, while others remain relatively uncharted ter-
ritory. It may turn out that a few landmarks on current genome maps appear
in the wrong place or at the wrong distance from other landmarks. But over
time, as scientists continue to explor the genome frontier, maps will become
more accurate and more detailed. Genome mapping is a work in progress.
The ultimate goal of genome mapping is to clone genes, especially disease genes.
Once a gene is cloned, we can determine its DNA sequence and study its protein
product.
will occur, because recombination occurs only when the chrisma is located be-
tween the two loci. To apply this basic principle to map a disease gene, we need
to analyze the pedigree and estimate recombination frequency.
The contemporary concept of genetic map is called Physical Map. This is rela-
tively new and have advanced rapidly in the last decade because of the advances
in Clone Manipulation, High Throughput Automation and Efficient Computa-
tional Models. In the case of physical map, the distance between DNA land-
marks are expressed as a quantitative measure in term of number of Bases (kilo
bases, kb or mega bases, mb) or Nucleotides. Physical mapping is now a central
technology in deriving a finished genome DNA sequence for many genomes, spe-
cially human genome and many other model organism genomes. The physical
map is of higher resolution than the old genetic map, which was up to gene level
resolution where as physical map depicts a resolution up to base nucleotides of
the DNA. That is why physical map is the core tool to inter-relate a phenotype
to its responsible genotype (corresponding DNA sequence). In fact, a physical
map with the distances in terms of number of bases and sequenced fragments, in
10.3. Restriction Mapping 83
another words - the ultimate physical map of a DNA is its Complete Sequence.
There are three general categories of techniques to build physical map: (1) Cy-
togenetic Characterization, (2) Radiation Hybrid Mapping and (3) Restriction
Mapping. Restriction mapping is the mostly used option among these. And this
mapping model is widely used because of its rich and accurate biological model
and efficient computational model. That is why we would like start our journey
to the Bioinformatics Problems from here.
enzyme is isolated. The second and third letters are the initial letters of the
organisms Species name. A fourth letter, if any, indicates a particular strain of
organism. Roman numerals indicate the sequence in which different endonucle-
ases were isolated from a particular organism and strain. For example EcoRI is
found in Escherichia coli and HindIII is from Hemophilus influenzae. So, the
nomenclature of EcoRI is E = genus Escherichia, co = species coli, R = strain
RY13, I = first endonuclease isolated and HindIII is H = genus Hemophilus,
in = species influenzae, d = strain d, III = third endonuclease isolated.
As a result, each Strand of the DNA can Self-Anneal and the DNA forms a
small Cruciform Structure. This structure may help the enzyme to recognize
the sequence that it is designed to cut.
The digestion can be done in many ways like - Single Digestion, in which
the DNA is digested using a single restriction enzyme. Double Digestion,
in which the DNA is digested using two different restriction enzymes at the
10.3. Restriction Mapping 85
same time and Multiple Digestion, in this case the digestion is done in the
presence of multiple restriction enzymes. Single digests are used to determine
which fragments are in the unknown DNA, and double digests to order and
orient the fragments correctly.Also according to the nature of digestion the di-
gestion can be of two categories, one is Complete Digestion, here the lengths
of DNA fragments of two consecutive restriction site are measured using gel
electrophoresis and another is Partial Digestion, in which all pair- lengths of
the distance of restriction sites are measured out. However, because the length
of each DNA fragment depends upon the position of the restriction sites, based
on this fact different computational models have been developed to reconstruct
the DNA restriction map.
For example, we may isolate two clones for a gene that are 8kb and 10kb long.
We know that they overlap, because the procedure used to isolate them told
that they have sequence in common. A restriction map tell how much they
10.3. Restriction Mapping 87
overlap by. From the restriction map information, we can tell which parts of
the two clones are identical and which parts are different.
Cloning: A Three-Step Go
1. (1) DNA is cleaved into smaller fragments using one of the restriction
enzymes
2. (2) The vector DNA is cut with the same restriction enzyme
3. (3) Cut vector DNA and cut target DNA are mixed together and DNA
Ligase is added to join the vector and target DNAs. The ligated DNAs
are propagated in E. coli for Replication.
clones, which are clones that have been damaged by the cloning process. Com-
mon sources for clone damage are Deletion, in which part of DNA supposedly
spanned by the clone is missing; Chimerism, in which the clone is actually
composed of DNA segments from non-contiguous areas of the target; and Coli-
gation, in which DNA from organism used to grow the clone has been added to
the clone itself. Finally the map can be used to verify the assembled sequence
for the target. Since the features of the physical map are based on features of
the underlying DNA sequence, it is possible and useful to compare the sequence
and the map to verify their consistency.
Sequences Alignment
—Saddam Hossain
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . .
. Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment .
. . Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment
. . . Sequences Alignment . . .
91
92 11. Sequences Alignment
Simply, whenever we have a protein or DNA sequence and want to find other
sequences that look like it there comes the question of sequence comparison.
Now a days ”Alignment” is the more accepted word than ”Comparison” as be-
cause this study of comparison involves aligning two or more sequences through
an explicit mapping of bases or amino acids. Exploration of different DNA and
protein sequences started explosively since 1970s. As a result there came the
necessity of efficient computational models and sequence database for Sequence
Alignment. Sequence Comparison and Alignment is one of the central study
Area in the arena of Bioinformatics. This is really a large topic to describe on
a section of a chapter.
If DNA and Protein sequences are presented by series of bases (A,T,C,G) and
(L,G,P etc) respectively, The alignment of sequences Sq1 and Sq2 is a Two
Row Matrix such that first row contains the characters of Sq1 and second row
contains characters of Sq2 keeping the order of the characters and interspersed
with some spaces to align identical characters on a vertical read. It is obvious
that there may have different alignments even for a single pair of sequences.
Hypothetical Sequences:
Sq1: ATCATTCGTGAT
Sq2: TGCAATTCGTA
Sample Alignment:
Sq1: AT - C - ATTCGTGAT
| |
Sq2: - TGCAATTCGT - A -
ity, which helped the scientist to propose a hypothesis that there may have a
Common Ancestor of evolutionary origin for them. It is believed that there is a
Natural Evolutionary Process among the organisms. According to modern sci-
ence these evolutions have been due to Mutation in the DNA. During mutation
there can arise different DNA Replication errors causing Substitutions, Inser-
tions, and Deletions of nucleotides changing the original DNA structure. And
this leads to different ways of differentiation and divergence in the descendants
of the species. Then an interesting question come into mind - how a species
or organism is evolutionary linked to another? Do they have any common an-
cestor? Hypothetically this can be derived from different kind of alignment of
DNA or Protein Sequences.
Again another research, back in 1883, on the similarities between cancer causing
v-sys oncogene and the Growth-Stimulating Hormone gave a surprising clue to
common function of them. And this research was the first success story of prov-
ing a conjecture based on sequence comparison. v-sys oncogene in the simian
sarcoma virus causes uncontrolled cell growth leading to cancer in monkeys. The
seemingly unrelated growth factor Platelet-Derived Growth Factor (PDGF) is
a protein that stimulate and regulate cell growth. When these genes were com-
pared, significant similarity was found and scientists conjectured that cancer
may be caused by a normal growth gene being switched on at the wrong time.
And the sequence comparison method came into place to establish functional
links between proteins and DNA sequences.
Homologous genes that share a common ancestry and function in the absence of
any evidence of gene duplication are called Orthologs. When there is evidence
for Gene Duplication, the genes in an evolutionary lineage derived from one of
the copies and with the same function are also referred to as orthologs. The two
copies of the duplicated gene and their progeny in the evolutionary lineage are
referred to as Paralogs. In other cases, similar regions in sequences may not
have a common ancestor but may have arisen independently by two evolution-
ary pathways converging on the same function, called Convergent Evolution.
Such sequences are referred to as Analogous (Fitch 1970).
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
Local alignment is used when the length of sequences are not same and
there is possibility of similarity in a specific area (subsequence) of the sequences
but dissimilar in others, sequences that differ in lengths but share a Conserved
Region or Domain. For example, homeobox genes, which regulate embryonic
development, are present in large variety of species. Although homeobox genes
are very different in different species, one region of them called Homeodomain
is highly conserved. Local alignments can create high quality alignments mak-
ing Residue-per-Residue analysis. It consists of paired subsequences that may
be surrounded by residues that are completely unrelated.. Another obvious case
where local alignments are desired is the alignment of the nucleotide sequence of
a Spliced mRNA to its genomic sequence, where each exon would be a distinct
local alignment. Proteins that have a significant biological relationship to one
another often share only isolated regions of sequence similarity. For identifying
relationships of this nature, the ability to find local regions of optimal similarity
is advantageous over global alignment.
Considering the number of sequences concerned to the alignment process, the se-
quence alignment methods are of two types one is Pairwise Sequence Alignment
and another is Multiple Sequence Alignment.
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
Sequence comparison is the most ever difficult task for biologists. Most of the
computational models for sequence alignments use insertion, deletion, substitu-
tion and some slightly different set of operations on base nucleotides or amino
acids to incorporate the concept of mutation in the alignment models. In gen-
eral finding differences is often equivalent to finding similarities because these
models try to find alignments of sequences in terms of alignment distance or
edit distance.
Though the computational models are evolving every day with more efficiency
than the before, sequence database is growing more faster than that. As a result
the methods using for long time, may perform extraordinary 20 years ago, are
now too slow to search a sequence database of 109 entries. As a result now
a days performance is gained through parallel hardware implementation or us-
ing fast heuristics that usually work well but not guaranteed to find the closet
match in every cases. There are lots of scope to work on this for the future in
the improvement of performance and accuracy.
There have been evolved many methods for sequence alignment. The most
basic and widely used methods are Dot Matrix and Dynamic Programming,
which are discussed in the following few sections.
11.1. DNA & Protein Sequences Comparison and Alignment 97
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
Usage of Dot Matrix: Unless the sequences are known to be very much
alike, the dot matrix method should be used first, because this method displays
any possible sequence alignments as diagonals on the matrix which is needed
for general exploration of the sequences. Dot matrix can readily reveal the
presence of insertion/deletions and direct and inverted repeats that are more
difficult to find by the other, more automated methods. The major limitation of
the method is that most dot matrix computer programs do not show an actual
alignment.
Dot Matrix is also used for predicting regions in RNA that are self-complementary
and that, therefore, have the potential of forming secondary structure. The ma-
jor advantage of the dot matrix method for finding sequence alignment is that
all possible matches of residues between two sequences are found, leaving the
investigator the choice of identifying the most significant ones. Then sequences
98 11. Sequences Alignment
of the actual regions that align can be detected by using dynamic programming
afterwards.
Dot Matrix can reveal complex relationships involving multiple regions of lo-
cal similarities. In a dot- matrix representation, certain patterns of dots may
appear to sketch out a ”path”,but it is up to the biologist to deduce the align-
ment from this information. This graphical representation known as a path
graph provides an explicit representation of an alignment.
A dot matrix can also reveal the presence of repeats of the same sequence
characters many times. These repeats become apparent on the dot matrix of a
protein sequence against itself as horizontal or vertical rows of dots that some-
times merge into rectangular or square patterns.
a sequence should be in the same column in an alignment, and which are in-
sertions in one of the sequences (or deletions on the other). This information
is important for making functional, structural, and evolutionary predictions on
the basis of sequence alignments.
It is important to bear in mind that optimal methods always report the best
alignment that can be achieved, even if it has no biological meaning. On the
other hand, when searching for local alignments there may be several significant
alignments, so it is a mistake to look only at the optimal one.
BLAST: BLAST is the mostly used tool to align a sequence against a se-
quence database. There is a more powerful version of BLAST named PSIBLAST
which can answer some more biological queries than BLAST. The BLAST pro-
grams introduced a number of refinements to database searching that improved
overall search speed and put database searching on a firm statistical foundation
(Altschul et al., 1990)
sequences. This is the need and motivation behind the development of Multiple
Sequence Alignment. ”One amino acid sequence plays coy; a pair of homologous
sequences whisper; many aligned sequences shout out loud.”
Multiple Sequence Alignment is the process of aligning more that two sequences
simultaneously. For an illustration, lets have four hypothetical protein sequence
-SqeA, SeqB, SeqC and SeqD. Their multiple sequence alignment is shown bel-
low with the substitution of (F/Y) and deletion of (L) and insertion of ( K).
And tree is an evolutionary tree for this group of sequences.
Sequences
SeqA: NFLS
SeqB: NFS
SeqC: NKYLS
SeqD: NYLS
SeqA: N * F L S
SeqB: N * F - S
SeqC: N K Y L S
SeqD: N * Y L S
The major problem with the progressive alignment method described above
is that errors in the initial alignments of the most closely related sequences are
propagated to the multiple sequence alignment. This problem is more acute
when the starting alignments are between more distantly related sequences.
Iterative models attempt to correct for this problem by repeatedly realigning
subgroups of the sequences and then by aligning these subgroups into a global
alignment.
as, if not better than, other methods. A model of a sequence family is first pro-
duced and initialized with prior information about the sequences. The model
is trained with a good number of sequences first. The trained model is then
used to produce the most probable multiple sequence alignment as posterior
information. As a result it is modeled based on completely probability theory,
no sequence ordering is needed, insertion/deletion penalties are not needed, and
experimentally derived information can be used.
• HMMER
Quoting a very obvious case study of Fruit-Flies, due to the lacking of so-
phisticated Immune System fruit flies get infections by bacterial attack. But
luckily they have a small set of Immunity Genes that usually remain dormant,
and get switched on when they are infected. When these genes are turned on,
some protein (antibody) is produced to destroy the pathogen and cure the in-
fection. Through the use of DNA Array, a biologist can do a lab-experiment
taking the infected and not-infected flies into account to determine what trig-
gers the activation of the immunity gene. The DNA sequence that has switched
on the immunity gene through encouraging the RNA Polymerase to transcribe
the genes into proteins, is called Regulatory Motif.
Usually motifs are short sequences (5-25 bp). Graphically motifs are pre-
sented by a special type of symbol called Motif Logo, which shows the con-
served and variable region of a motif with a variable size distribution of the
containing symbols.
53++!305))6*;4826)4+.)4+);806*;48!860))85;]8*:+*8!83(88)5*!;46(;88*96*?;8)
*+(;485);5*!2:*+(;4956*2(5*-4)88*; 4069285);)6!8)4++;1(+9;48081;8:8+1;48!8
5;4)485!528806*81(+9;48;(88;4(+?34;48)4+;161;:188;+?;
11.3. Regulatory Motif Finding 105
He assumed that the message was written in english and each letter is replaced
by a symbol, and tried to solve the puzzle, firstly applying the frequency dis-
106 11. Sequences Alignment
Unfortunately, DNA texts are not that easy to decipher and DNA Linguis-
tics is not known even not the Genetic Grammar. There is also no dictionary of
motifs in hand. Except only the information that frequent or rare DNA-words
(substring) may carry some signals regarding Genetic Language of the organism.
The complications arise because we do not know the motif sequence before-
hand, even we don’t know where it is located relative to the genes start, also
they can differ slightly from one gene to the next. Finding a pattern without
11.3. Regulatory Motif Finding 107
any solid prior knowledge of it - really tough for any computational model.
From the very top line, the Regulatory Motif Finding problems include Multiple
Sequence Local Alignment first to create the alignment of l-mers. Then profiling
is done to score and build a consensus motif string that is thought to be the
ancestor motif of the corresponding function.
There have been developed two distinct ways to solve this problem. One is
Combinatorial Computational Approach and other is Probabilistic Based Meth-
ods. The combinatorial approach includes Brute Force Motif Finding, the Me-
dian String Problem, Search Trees, Search Trees with Branch-and-Bound Tech-
niques, Consensus based Greedy Motif Search and Exhaustive Motif Search
models. And probabilistic approaches use Expectation Maximization, Profile
Hidden Markov Model (HMM) etc.
108 11. Sequences Alignment
• IDENTIFY
• SCAN
• MOTIFS
Chapter 12
Gene Prediction
—Saddam Hossain
Gene Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene
Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene Prediction
. . . Gene Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene
Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene Prediction
. . . Gene Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene
Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene Prediction
. . . Gene Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene
Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene Prediction . .
. Gene Prediction . . . Gene Prediction . . . Gene Prediction . . .
109
110 12. Gene Prediction
From the central dogma of life, it can be derived that a part of raw DNA
sequence transcribed into mRNA to synthesis a particular protein eventually
through the steps of pre-transcription, transcription, splicing and translation.
The raw DNA sequence that goes under transcription process is called Tran-
scribed Region or Gene Coding Segment(CDS). And between two CDSs
there remain some regulatory segments organized in the upstream area of a
gene. The upstream area is comprised of Enhancer, Upstream Promoter, Motif,
Core Promoter, GC-box, CAAT-box, TATA-box, INR-box, Transcription Start
Site(TSS) etc. this gene-upstream region is also called Flanking Regions. In
a simple thought, the whole genome is a N-Times repetitions of Flanking Region
and Coding Segment pairs.
During the Gene prediction or identification, finding out the gene coding
segment is central focus or area of concentration. Coding Segment consists two
type of segments - exons(coding sequence of the gene) and introns(sequence
that does not transcribed into protein), and four type of signals - start codon
(ATG), donor splice sites(usually GT), acceptor splice sites (usually AG) and
stop codons (TAG, TGA, TAA). Again there can be four types of exons - (i)
initial exons that extend from a start codon to the first donor site. (ii) internal
exons, which extend from one acceptor site to next donor site. (iii) final exons
extend from the last acceptor site to the stop codon. (iv) And sometime there
found intronless exon which is called single exon, not interrupted by non-coding
segments.
Except extrinsic or evidence based gene prediction, all methods are based on
the calculation and finding of the above mentioned gene-markers (start & end
codon, splice sites, promoter etc.). Which are really complex and the complex-
ity increases as the non-coding (introns) area increases and coding ( exons) area
decreases in length. Also finding splicing sites are difficult because GT and AC
appear very often. As a result all the gene prediction start with ab-initio meth-
ods and repeatedly predicted and evaluated using approximation methods, its
really hard to have exact or perfect prediction always.
The simplest method for finding DNA sequences that encode proteins or repre-
sent genes is by searching for Open Reading Frames (ORF). An open reading
frame is a DNA sequence that contains a contiguous set of codons that reflect
an amino acid. There can have six possible reading frames.
For example, the following sequence of DNA can be read in six reading frames.
Three in the forward and three in the reverse direction. The three reading
frames in the forward direction are shown with the translated amino acids be-
low each DNA sequence. Frame 1 starts with the ”a”, Frame 2 with the ”t”
and Frame 3 with the ”g”. Stop codons are indicated by an ”*” in the protein
sequence. The longest ORF is in Frame 1.
There are some issues regarding the gene prediction the first one is the size of
112 12. Gene Prediction
the genome, larger the genome, the more genes and complex to find. And more
complexity results less coding density or fewer genes per kbp. It is assumed
that long ORFs tend to be coding. As the coding to non-coding region- length
ratio decreases exon or gene prediction becomes more complex.
12.1. Introduction to Genome Annotation & Gene Prediction 113
In the Ab-initio prediction method there needs a good set of training data
for the evaluation of statistical likelihood of a prediction being real. Ab-initio
prediction is never perfect. It has high false positive rates. Incorporation of
similarity test model may reduce the false positive rate but it will increase the
false negative rate. This model is rarely used as a final product, but for a start.
known cDNA (or protein) database. Among different homology based methods
Local Alignment Methods and Pattern-based Alignment Methods are used.
• Jigsaw
• GLEAN
• Grail
• BLAST
• FASTAX
• BLAT
• WABA
• MZEF,
• MZEF-SPC
• FGENESH
116 12. Gene Prediction
Chapter 13
Genome Analysis
—Saddam Hossain
For Draft
The entire DNA content of the cell is what is known as genome. The segment of
genome that is transcribed into RNA is called gene. Simply Genome Analysis
is a process which analyses the genome.
Segments of genome called genes determine the sequence of amino acids in pro-
teins. The mechanism is simple for the prokaryotic cell where all the genes are
converted into the corresponding mRNA (messenger ribonucleic acid) and then
into proteins. The process is more complex for eukaryotic cells where rather
than full DNA sequence, some parts of genes called exons are expressed in the
form of mRNA interrupted at places by random DNA sequences called introns.
Of the several questions posed here, one is that how some parts of the genome
are expressed as proteins and yet other parts (introns as well as intergenic re-
gions) are not expressed.
117
118 13. Genome Analysis
version of the human genome sequence. Model organisms have been sequenced
in both the plant and animal kingdoms. As we begin the new millennium, the
major goal of molecular biology is to obtain the complete sequences of as many
genomes as possible. A comparison of the genome sizes of different organisms
(Table 1) raises questions like what types of genetic modifications are respon-
sible for the four times large genome size of wheat plant and seven times small
size of the rice plant as compared to that of humans. Mice and humans contain
roughly the same number of genes . about 28K protein coding regions. The
chimp and human genomes vary by an average of just 2% i.e. just about 160
enzymes.
Genome Sequencing:
Genome Annotation:
Genome Rearrangement:
Gene Prediction:
Genome Similarity:
DNA Microarrays:
Chapter 14
Phylogenetic Analysis
—Saddam Hossain
Phylogenetic Data Types There are two types of data that are available
and usually used in the phylogenetic analysis.
119
120 14. Phylogenetic Analysis
ATG ACG
A(T/C)G
ATG ACG
Sequence ATG and ACG are homologous as they share a common ancestor
A(T/C)G (hypothetical) and their similarity is 66% (percentage of nucleotide
similarity)
division or divergency of a single species into two or more species. With the
progress of speciation, duplication and deletion may happen repeatedly and in-
dependently in each species.
S0: CAGT
deletion(C) deletion(T)
AGT CAG
duplication(T) duplication(G)
Species-0 (S0) diverged into Species-1 and Species-2 through speciation and in
the progrss of speciation there happened deletion and duplication of gene, that
finally result into two different species Species-1 and Species-2
Gene families are composed of homologous genes that share a common an-
cestor. Each is the result of an evolution process involving gene duplication,
speciation and gene deletion.
Ancestor(Root)
branch
Rooted & Unrooted Trees: A rooted phylogenetic tree present the in-
ference about a common ancestor. It has a ”base” node being the ancestor of
all the organisms under cover. The direction of evolution and pathways can
be achieved from ancestor to organisms. On the other hand the unrooted tree
only stablishes relationships among OTUs but does not specify evolutionary
pathways. Roots can be asigned to unrooted trees by finding an outgroup. An
outgroup is a species that has unambiguously seperated much earlier from other
species under cosideration.
The possible number of phylogenetic tree, both for Rooted and Unrooted
trees, grows exponentially with the following equations. Though there are stag-
124 14. Phylogenetic Analysis
Primate
Gorilla
Chimpanzee Human
Gorilla
Human Chimpanzee
Human, Chimpanzee and Gorilla have relations among them, but in the figure
no evolutionary pathways are defined
gering number of possible phylogenetic trees even for a small set of data (species,
organisms, sequences), only one of these tree is the true phylogenetic tree! This
is the real challenge. Only the molecular data may never infer the correct phy-
logenetic tree, more artificial, morphological, and historical data are needed to
14.5. Approaches in Phylogenetic Analysis 125
It starts with the most similar pair of OTU and build a composite OTU with
these two. Now from the new group of OTUs again the pair with heighest simi-
larity is picked and composited into a single OTU. This process continues until
two OTUs are left. The tree building process starts with the initial OTUs as
leaf node and evey composite node as the next intermediate ancestral node for
the chosen pair of OTU. This new intermediate ancestral node is considered as
a new OTU. The tree building goes on until there remain only two OTUs, as
becasue the final two OTUs are the first descendants of the root ancestor.
Seq1 G T A G G A T
Distance = 2 l l
Seq2 G A A A G A T
A B C
A − 2 4
B − 4
C −
Algorithm 2 UPGMA
1: Initialization:
2: Assign each xi into its own cluster Ci
3: Define one leaf per sequence, height 0
4: Iteration:
5: Find two clusters Ci , Cj such that. dij is minimum
6: Let Ck = Ci ∪ Cj
7: Define node connecting Ci , Cj , height dij /2
8: Delete Ci , Cj
9: Termination:
10: When all sequences belong to one cluster
• Calculate the average distance from A to all other sequences (of cluster
X), and from B to all other sequences (of cluster X).
• Adjust the position of the common ancestor node for A and B, so that
the difference between the averages is equal to the difference between the
A and B branch lengths, while the sum of the branch lengths is distance
between A and B( d(A, B)).
A(T/C)CG A(T/C)CG
[T → C]
The concept of maximum parsimony has evolved from two assumptions, one
is that mutations are exceedingly rare events in the evolutionary pathways. And
another is that the more unlikely events a model invokes, the less likely the model
is to be correct. As a result, the relationship that requires the fewest number
of mutations to explain the current state of the sequences being considered the
relationship that is more likely to be correct. To establish this conservative
principle of minimum evolution the maximum parsimony model for phyloge-
netic tree has been postulated.
The pre-step for maximum parsimony method is to align the sequences using the
multiple sequence alignment methods. Every sequence position of the alignment
(aligned column) is called a site. For each aligned site (position), phylogenetic
tree that require the smallest number of evolutionary changes to produce the
observed sequence changes are identified. This analysis is continued for every
14.6. Methods for Phylogenetic Tree-Construction 131
site of the alignment. Finally, those trees that produce the smallest number of
changes overall for al sequences are indentified as the maximum parsimony tree.
—** Figure-
The Fitch’s Algorithm (W. Fitch, 1971) is widely used method for constructing
phylogenetic tree for maximum parsimony method. As the possible number of
trees is very large even for a small set of sequences, enumerating all the trees
then scoring them and finding the most parsimonius tree is really impractical.
That is why there are used several heuristics like Branch-and-Bound method,
Nearest-Neighbour Interchange method etc are used to narrow down the search
space to find the optimal or suboptimal tree rather that having the exact solu-
tion.
There are some adavantages of maximum likelihood methods over other meth-
ods. Maximum likelihood methods show lower variance than other methods,
they are robust and statistically well founded. This method works well for
distantly related sequences even for different molecular clock theory and can
incorporate any desirable evolutionary model. Overall this method is the most
flexible and shows good results under good Evolutionary Models. But the main
132 14. Phylogenetic Analysis
disadvantages are that it gives bad Approximation under bad Evolutionary Mod-
els and this is a computationally intensive method.
• DRAWGRAM/DRAWTREE
• CONSENSE
• etc...
Chapter 15
Protein Folding
—Saddam Hossain
15.1 Proteins
Proteins are the basis of cellular and molecular life. Proteins play a crucial
role in virtually all biological processes with a broad range of functions. Amino
acids (aa) are the building blocks of protein.There are 20 natural amino acids
(ACDEF GHIKLM N P QRST V W Y ). Protein is a linear combination of these
amino acids joined by peptide bonds.
133
134 15. Protein Folding
There are many protein folding algorithms, all the brute force algorithms are
NP-complete. The practical algorithms are approximate algorithms with poly-
nomial time and close to true result with high probability. And these are not
stochastic.
15.4.2.1 α-helix
This structure repeats itself evry 5.4 Angstroms along the helix axis. Every
main chain CO and NH group is hydrogen bonded to a peptide bond 4 residues
away. In accurate, there gets a α-helix turn per every 3.6 residues. This struc-
ture mainly found on the Protein Surfaces.
15.5. Experimental Techniques for Structure Determination 135
15.4.2.2 β-sheets
• Requires crystals
136 15. Protein Folding
Arhitecture (fold)
Topology (superfamily)
Homology (family)
A schematic view of how to proceed from the sequence to a model of the protein
is presented bellow. Prediction of structure relies heavily on different alignment
methods. Obtaining a reliable alignment with a known structure determines
which methods to be used.
The end-to-end process for prtein structure prediction from its primary struc-
ture (amino acids sequence) can be thought into several stages. Though the
initial attempts were to predict a 2D structure for protein, now the 3D struc-
ture is the final requirement.
Most of the prediction methods are based on primary sequence only with accu-
racy 64% -75%. The prediction accuracy is higher for α-helices than β-strands.
Accuracy is dependent on protein family and predictions of engineered proteins
are less accurate.
Bottom-to-Top:
Calculate the minimal energy function
Top-to-Bottom:
Extract the optimal assignment
138 15. Protein Folding
Time complexity:
Exponential in tree width, linear in graph size
Some Concepts:
Search sequence data banks for homologs, Search methods e.g. BLAST,
PSIBLAST, FASTA, Homologue in PDB..
primary step for tertiary structure prediction is the Secondary Structure Predic-
tion. Protein Secondary structure prediction methods predict these structures
from its primary structure (more roughly from its amino acids sequence). Now
a days, the available prediction methods have reached an averaged accuracy of
more than 70%.
There are two main computational alternatives beyond the experimental meth-
ods to determine or predict the secondary structures of proteins. The first
alternative is ab-initio methods and second is approximate or heuristic meth-
ods. And again the widely used approximate or heuristic methods for secondary
protein structure prediction can be categorized as Statistical Methods, Nearest
Neighbor Approach, Neural Networks Approach, Hidden Markov Model, and Sup-
port Vector Machine based methods.
The ab-initio methods determine protein structure based on sequence data (pri-
mary structure) abd the physics of molecular dynamics. Physics of molecular
dyanamics consists of Newtonian physics, atomic level forces, bond lengths,
bond angles, torsion (dihedral) angles, and equations for calculating energy for
the most stable (minimum free energy) conformation or structure. These func-
tions can also depend on amino acid sequence, the temperature, presssure, pH
and other local conditions. And the functions of angles also depend on the types
of atoms involved and the number of free electrons available for bonding.
The ab-inito methods start with sequence data or primary protein structure.
Then it constructs a reasonable secondary structure by using bond lengths,
angles and torsion angles. On the next step, it populates a library of tertiary
structures by generating all possible candidate tertiary structures using molecu-
lar dynamics and Monte-Carlo methods. Monte-Carlo methods indetify confor-
mational combinations with lowest free energy. After that, from this library of
3D structure candidates, the best possible structures are filtered using Metropo-
140 15. Protein Folding
lis algorithm, which identifies the most stable molecular conformations. This
method is based on the assumption that the native conformation of a protein is
the conformation with the lowest free energy. When the top-most candidates are
selected, they are visualized and validated against the corresponding structures
calculated from the experimental methods like NMR or X-ray crystallography.
The protein structures are compared with the Root Mean Squared Deviation
(RMSD) measure.
GOR Method Garnier, Osguthorpe & Robson proposed the GOR method.
It assumes that amino acids up to 8 residues on each side influence the secondary
structure (SS) of the central residue. This can correctly predicts upto 64%.
Normally there are three output nodes in the NN model for secondary struc-
ture prediction, each representing a class of the secondary structure. Recently,
a hybrid neural network model is used for predicting three type of secondary
structures all along. Most of the NN model works on the fragment libraries
which consist protein sequence fragments of known strucutres and the model
predicts having knowledge form that fragment databases/libraries.
15.9. Performance of Structure Prediction Approaches 141
Hybrid Fuzzy Neural Network can be used for protein secondary structure pre-
diction.
The ab initio-methods use the energy functions to guide its search and explore
structure spcae to predict the secondary structures. These techniques are useful
to predict novel structures for which comparisons against known structures do
not yield useful information. As a result these start with less structural (tem-
plate coverage) and sequence similarity (homology) information These meth-
ods can predict structure with higher level of difficulty (computational model)
and higher runtime with a resolution accuracy of 5 − 20Å. Fold Recognition
(FR) based models dont use homology based comparison rather these use Fold-
Recognition (FR) alone with better running time, and resolution accuracy and
lower level of difficulty than those of ab initio-methods. The fold commonali-
ties recognition based models need higher similarity and template coverage, but
produces higher resolution accuracy. And the mostly used methods, Homology
142 15. Protein Folding
based methids, need higher level of similarity (>30%) and template coverage to
predict structure based on the known structure library, their running time very
lower and accuracy is higher.
—Saddam Hossain
Structural Bioinformatics & Drug Discovery . . . Structural Bioinformatics
& Drug Discovery . . . Structural Bioinformatics & Drug Discovery . . .
Structural Bioinformatics & Drug Discovery . . . Structural Bioinformatics
& Drug Discovery . . . Structural Bioinformatics & Drug Discovery . . .
Structural Bioinformatics & Drug Discovery . . . Structural Bioinformatics
& Drug Discovery . . . Structural Bioinformatics & Drug Discovery . . .
Structural Bioinformatics & Drug Discovery . . .
143
144 16. Structural Bioinformatics & Drug Discovery
to treat diseases, especially for humans. Drugs are chemical compound, specially
small-compounds, some are large, with some specific characteristics such as it is
safe, effective, deliverable, available, stable and novel. Drugs must be regulated
by the Food and Drug Administration (FDA).
The modern or rational drug discovery process can be divided into three major
parts, they are Exploratory Phase, Drug Discovery Phase and Drug Development
Phase. During the exploratory phase, Target Identification & Identification and
Target Validation are done. Later in the discovery phase, Assay Development,
Lead Identification, Lead Development, Screening and Hits to Leads, Lead Op-
timization are carried out. Development phase consists of Drug Development,
Drug Testing, Preclinical Development, Clinical Trials, Drug Toxicology and fi-
16.3. Structural Bioinformatics 145
nally NDA and New Drug to Market. The clinical trials phase may have different
phases like clinical trials I, clinical trials II, clinical trials III, etc. This is really
a very long process of rational drug discovery starting from target identification
ending to commercialization to market. Sometimes it takes 15-20 years to com-
plete this, sometimes even more. And it may cost $700-$800.
When the target is confirmed, modulators of the target can be identified. There
are two types of modulators for each kind of target, they are positive modulators
and negative modulators.
Introduction to
Bioinformatics
Computations
149
151
—Saddam Hossain
17.1 Introduction
153
154 17. Statistical and Probabilistic Methods in Bioinformatics
Computational Methods in
Bioinformatics
—Saddam Hossain
155
156 18. Computational Methods in Bioinformatics
Chapter 19
As data sets have grown in size and complexity, direct hands-on data analy-
sis has increasingly been augmented with indirect, automatic data processing.
This has been aided by other discoveries in computer science, such as
• Neural Networks(NN)
• Clustering
• Genetic Algorithms (1950s)
• Decision Trees (1960s)
157
158 19. Bioinformatics Data Mining
2. Segmentaion/Clustering
3. Association
4. Summarization
19.2. Data Mining Task 159
Things to Ponder
It is sometimes confusing between clustering and classification. Both of them
put data examples into different groups. The difference is that in classification
,the groups are predefined and the task is to decide which group a new data
sample should belong to.
In clustering the types of groups and even the number of groups are not known
and the task is to find the best way to segment all the data.
For some methods, the number of clusters needs to be specified first. For ex-
ample, to use the K-Means algorithm.The major problem of clustering is the
decision of the number of clusters. For some the user should input the number
of clusters first. The centers of the clusters are chosen arbitrarily. Then, the
data iteratively move between clusters until they converge. If the user is not
satisfied with the results, another clusters number is then tried. So this kind of
method is a trial-and-error process.
Association: Another task of data mining is to search for a set of data within
which a subset is dependent on the rest of the set. x− > y means : if sequence
x is supposed to be present in a specific part of a gnome then so will sequence
y.
2. Statistics
3. Machine Learning
would be very difficult because the ”trend” of the data is usually nonlinear
and very complicated. Many parameter estimates are involved if the model is
nonlinear. Therefore, to simplify the problem, a linear model is usually used.
A linear model is to use a straight line to estimate the trend of the data. This
is called linear regression. For a set of data with a nonlinear nature, it can be
assumed that the data trend is piecewise linear. That is, in a small period, the
data trend is about a straight line.
• gradient decent,
• evolutionary algorithms.
The learning procedure is repeated until the error measure reaches zero or
is minimized. After the learning procedure is completed with the training data,
the parameters are set and kept unchanged and the model can be used to pre-
dict or classify new data samples.
Different learning schemes have been developed and discussed in the machine
learning literature. Important issues include the learning speed, the guarantee
of convergence, and how the data can be learned incrementally. There are two
categories of learning schemes:
2. unsupervised learning.
Supervised learning learns the data with an answer. Meaning, the parame-
ters are modified according to the difference of the real output and the desired
output (the expected answer). The classification problem falls into this category.
Various models like neural networks (NN), decision trees (DT), genetic algo-
rithms (GA), fuzzy systems, and support vector machines (SVM) have proved
very useful in classification and clustering problems. But machine learning tech-
niques usually handles relatively small data sets because the learning procedure
is normally very time-consuming. To apply the techniques to data mining tasks,
the problem with handling large data sets must be overcome.
Some Algorithms in
Bioinformatics
20.1 BLAST
20.2 FASTA
20.3 CLUSTALW
20.4 PHD
20.5 Predator
20.6 TRILOGY
20.7 Gibbs Sampler
20.8 DALI
163
164 20. Some Algorithms in Bioinformatics
Part IV
165
167
Dynamic Programming
And Bioinformatics
—Saddam Hossain
169
170 21. Dynamic Programming And Bioinformatics
Next the change for 3 Tk can be obtained by finding the optimal solution from
the options
Edit Distance: Though the term Edit Distance was coined (Levenshtein,
1966) in the study of string first, this concept is now being used in DNA sequence
alignment with/without some customizations. Edit Distance of two strings is
the minimum number of edit operations (insertion-deletion or indel, substitu-
tion of symbol) needed to transform one string into another. Some examples
are shown bellow-
A-TGCA
AGTC-A
indel = 2
substitution = 1
Edit Distance = 3(=2+1)
-ATGT-C
CAG–GC
indel = 4
substitution = 1
Edit Distance = 5(=4+1)
U=AAG
V=AGC
U=AAG-
V=-AGC
There may have many possible alignments, but target is to find out the best
(optimal) alignment. To grade an alignment with performance, there is needed
a scoring mechanism. The scoring mechanism will award a match-alignment
and penalize substitution-alignment and indel-alignment. The highest scoring
alignment will be chosen as best or optimal alignment.
Lets start with the assumption that 0-length subsequences are alined and set
ASM[0,0] = 0 as an initial alignment score.
There can have three different choices/paths to update/align every cell of the
ASM[i,j] matrix. The indels and substitution penalty and match-award can be
derived from predefined indel/substitution/match-scoring matrix. (for our illus-
tration lets assume each indel costs 1 and substitution costs 2 and each match
awards 2 point to align).
Then fill up the first row and column with insertion and deletion operation
respectively. And then all other cells are filled up using the recurrence function
of score.
After completion of the ASM matrix, backtracking the path of alignment from
end (ASM[m,n]) to start (ASM[0,0]) results the optimal alignment.
Figure 21.4: ASM Matrix Cell Building (This is a Temp-Pic, New To be Drawn
Later)
To this end, we define affine gap penalties to be a linearly weighted score for
large gaps. We can set the score for a gap of length x to be -(? + sx), where
? ¿ 0 is the penalty for the introduction of the gap and s ¿ 0 is the penalty for
each symbol in the gap (? is typically large while s is typically small). Though
this may seem to be complicating our alignment approach, it turns out that the
edit graph representation of the problem is robust enough to accommodate it.
176 21. Dynamic Programming And Bioinformatics
• Critical element, which tells learning element how the algorithm performs,
and
179
180 22. Neural Network And Bioinformatics
quently, the neuron will fire the electrical signal down the axon. The occurrence
of action potential can be increased or decreased by changing the constitution
of various neurotransmitters.
where Xn denotes the nth input and Wi,n denotes the weights between the
input and hidden layers.
The hidden neurons are then used as inputs for the output y
X
yi = G( Vi,n hn ) (22.2)
n
Where Vi,n denotes the weights between the hidden and output layers. The
activation function F or G is a sigmoid or logistic function which is usually dif-
ferentiable and contributes to stability in neural network learning (Narayanan
et al., 2003a).
Despite the simplicity of neural network, the summation functions can be more
complex than just the simple sum of the products of inputs and their weights.
The specific algorithm to combine neural inputs is determined by the chosen
network architecture and hypothesis.
where , ti is the target output and Oi is the actual output. The steps used
to find the weights for minimizing error are:
• compare the actual output value with the target output value,
• modify the weights so that the actual output is closer to the target output
next time, with smaller error.
This process is repeated for all samples in the dataset and results, and then
repeated until the output error for all the samples achieves an acceptable low
value, which indicates the end point of the training. Once the training is finished,
testing can be done using the rest of the data set, not used during the training
phase, to test the trained neural network. If the testing is not satisfactory,
further modification of the weights has to be done. Otherwise, the output value
of the tested data is preserved for any decision making.
The theory of SVMs can be applied to the clustering of yeast microarray expres-
sion data. When the misclassification rates of SVMs are compared with those of
other machine learning approaches, SVMs are found to be the best performing
methods (Brown et al., 2000). In addition to their use for evaluating microar-
ray expression data, SVMs have been shown to perform well in multiple areas
of biological analysis, including detecting remote protein homologies (Jaakkola,
1999) and recognizing translation initiation sites. SVMs can also be used to an-
alyze expression data (Furey et al., 2000). Gene expression data is usually high
dimensional data that constitutes a serious problem in several machine learning
methods. Dimensionality reduction can be used, but it leads often to informa-
tion loss and performance degradation. Fortunately, SVMs can overcome this
problem as they can generalize high dimensional data well (Valentini, 2002).
The SOM consists of an input layer and a competitive output layer. The
output layer is normally organized into a two-dimensional grid of fully connected
neurons, as illustrated in Fig. 22.3. The input vectors are fed into input layer
and mapped with competitive neurons in the output layer. The competition
learning algorithm in the output layer ensures that similar input vectors are
mapped with competitive neurons that are closer to each other in the grid than
dissimilar ones. In SOM, input vectors in high dimensional space are, therefore,
projected on to two-dimensional output space based on their spatial similari-
ties. Similar input patterns are clustered into one small region in the grid of the
output layer.
The SOM is widely used as a data mining and visualization method in bioin-
formatics. It is a more robust and accurate method for the clustering of large
amounts of noisy data than hierarchical clustering methods are for analyzing
the gene expression data. In the analysis of the Stanford yeast gene expres-
sion dataset using SOMs, the best performance of gene expression analysis was
186 22. Neural Network And Bioinformatics
The advantages of the SOM can be attributed to its ability to map high di-
mensional data onto more comprehensible lower dimensional space and to its
22.4. Neural Network Learning Algorithms 187
fast execution. It is potentially very useful for dealing with high dimensional-
ity and large-scale databases to extract information from gene expression data.
However, the effectiveness of its combining with database queries warrants fur-
ther investigation. SOM also has limitations, namely,
Related Ref:
(Adeli, 1995; Finlay and Dix 1996; 118 Supawan Prompramote et al. Kuo-
nen, 2003; Narayanan et al., 2002; Negnevitsky, 2002; Nilsson, 1996; Baldi and
Brunak, 2001; and Westhead et al., 2002).
188 22. Neural Network And Bioinformatics
Chapter 23
—Saddam Hossain
&
—Fokhruzzaman
23.1 Introduction
The very intuitive and natural question from any biologist or bioinformatist,
when they have a DNA or Amino acid (Protein) sequence in hand is that what
the sequence represent. For example, is a particular DNA sequence a gene or
not? Another example would be to identify which family of proteins a given
protein belons to? In both the cases, we have a sequence of symbols from some
alphabet and we are required to say something about the structure of that
sequence. There are some techniques that can be used to model this kind of
sequence problem, among these, Markov Chains and Hidden Markov Models
serve as probabilistic models for sequence. We will concentrate on some famous
biological problems to explain the Markov Chains Models and Hidden Markov
Models, and their applications. The first of these is identifying CpG islands in
a DNA sequence. Lets define the CpG islands problem first.
189
190 23. Hidden Markov Model (HMM) And Bioinformatics
30 ...A C TA G ...50
50 ...T G AT C ...30
30 ...A CG TTA...50
30 ...A CG T CG AG CG TACTGTTACTCAGTCTTAG...50
Due to the methylation process, the CpG dinucleotide is rarer than would
be expected by the independent probabilities of C and G. In human genome,
CG dinucleotide occurs with frequency < 1%. This is the least frequent dinu-
cleotide. However, for biologically important reasons, the methylation process
is suppressed in upstream areas around genes and hence these areas contain a
relatively high concentration of the CpG dinucleotide. Such regions are called
CpG islands, whose length varies from few hundreds to few thousands bases in
the promoter regions of genes. CpG-islands in the promoter-regions of genes
play an important role in the deactivation of a copy of the X-chromosome in
females, in genetic imprinting an in the deactivation of intra-genomic parasites.
According to a recent study, human chromosomes 21 and 22 contain about 1100
CpG-islands. About 56% of the human genes are associated with an region of
CpG islands. The presence of a CpG island can be an indication to the start of a
gene. Therefore identifying CpG islands helps to determine the location of genes
across the DNA. There may arise two very frequent questions - i) Given a short
sequence, is it from a CpG island or not? and ii) Given a long sequence, does
it contains a CpG island or not?. We will continue our discussion of Markov
chain in try of answering these questions.
23.3. Markov Chain 191
NH2 NH2 O
H H H
GCCTACACAC CG CCAGTTGTGTTCCTGCTATGTCTCTAGTGATCCCTGAA
AAGTTCCAG CG TATTTTG CG AATACTCAACAGCAACATCAA CG GGCAG
CAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGGTTGACAGTA
CACTCATAGTGTTGAGGAAAGCTGA CG TTGACCTCACCAAGTGGGCAGGA
GAACTCACTGAGGATGAGATGGAA CG TGTGATGACCATTATGCAGAATCC
ATGCCAGTACAAGATCCCAGACTGGTTCTTG
P
holding the property of j aij = 1. And a Markov Chain Model is defined by
(Q, A).
Lets examine the smallest possible Markov Chain with two states (is drawn
as circles), Q = {1, 2}, named as 1 and 2, with the transition probability (is
drawn as transition edge, a directed edge) of moving from state-1 to state-2 is
p, that is a1,2 = p and transition within the state itself is 1 − p, which is a loop
transition probability a1,1 = (1 − p). In the same manner the other transition
probabilities are a2,2 = (1 − p), a2,1 = p. And the complete transition matrix is
as bellow. And this Markov Chain Model can be defined as M = {Q, A}.
a1,1 a1,2 1−p p
A= =
a2,1 a2,2 p 1−p
(1 − p) (1 − p)
p
1 2
p
Markov Property: In the Markov Chain the transition from one state to
another is in discrete time steps, n = 1, 2, 3, .... If we are in state i at time
step n, we go to state j in time step n + 1 with probability ai,j (the transition
probability from state i to state j). We also assume that the state at time n, qn
depends on the states q1 , q2 , q3 , .... But Markov Chain of Order-1 is the model
that the transition probabilities of the Markov Chain can only ”remember”
one state of its history. Beyond this it is memoryless. The ”memorylessness”
condition is very important. it is called the Markov Property. Below it has been
shown mathematically.
A Markov Chain can produce a sequence through transition from its states-
to-states. For example a sequence like 1221 can be obtained from the above
mentioned Markov Chain. The corresponding transition path would be like
starting from state-1 going to state-2 and then having a loop transition to
state-2 itself and returning back to state-1. As the transition from one state
to another is associated with some transition probability, the whole path that
23.3. Markov Chain 193
generates the sequence of states like 1221, it generates with some probability.
For our next course of study , we assume that states are presented by the
set Q = {q1 , q2 , q3 , ..., qn }, and the sequence that can be generated from ther
Markov Chain of corresponding states and transition matrix will be represented
as X = x1 , x2 , x3 , ..., xL for a sequence of lenght L. Where the sequence element
xi (1 ≤ i ≤ L) is generated by some corresponding state qj , 1 ≤ j ≤ n.
But what is the probability of starting in state q1 , p(x1 = q1 )?. This has
to be given, so we must have a probability distribution for the starting state.
Alternatively, we can model this by explicitly adding a start state with transition
probabilities to all other states. We will always start with that special start
state. Let the start state be denoted by q0 and its emitting sequence symbol by
x0 , then p(x1 = q1 ) = p(x1 = q1 | x0 = q0 ) = aq0 ,q1 . So the previous equation
becomes-
194 23. Hidden Markov Model (HMM) And Bioinformatics
Similarly, we can explicitly add an end state. Although not needed, having
an end state will help us model the length of the sequences too, because, it will
induce a probability distribution for the length of a path in the Markov Chain.
Therefore each state (including the start state, i.e. empty sequence) will have
a transition probability to that special end state. The probability of ending a
sequence in state qL is aqL ,qL=1 . Then the probability of the path will be:
β q1 ... qL
A T
β
C G
A+ T+
β
C+ G+
A− T−
β
C− G−
A C G T A C G T
A 0.180 0.274 0.426 0.120
− A 0.300 0.205 0.285 0.210
A = C 0.171 0.368 0.274 0.188 A =
+
C 0.322 0.298 0.078 0.302
G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208
T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292
p+ (CGCG) = p(CGCG | M+ )
= a+ + + + +
q0 ,q1 aq1 ,q2 aq2 ,q3 aq3 ,q4 aq4 ,q5
= a+ + +
q1 ,q2 aq2 ,q3 aq3 ,q4
= a+ + +
CG aGC aCG
= 0.274 ∗ 0.339 ∗ 0.274
= 0.025450764
23.3. Markov Chain 197
p− (CGCG) = p(CGCG | M− )
= 0.078 ∗ 0.246 ∗ 0.078
= 0.001496664
This is very obvious that the probability of the sequence CGCG of being in
the CpG-islands is higher than that of non − CpG-islands. We can also use the
p(x|M+ )
log-odds ratio log p(x|M − ) to determine if x is coming from a CpG-island or not.
p(x|M+ )
If log p(x|M− ) > 0, the x is coming from a CpG-island.
The above strategy can answer our first question, Question: i) Given a short
sequence, is it from a CpG island or not?. But what about second question,
Question: ii) Given a long sequence, does it contains a CpG island or not? The
dual Markov Chain model that we have developed above to find CpG-islands
in a long sequence of nucleotides. How this can be done? This can be done by
taking windows of small sizes, say 100 nucleotides, in the long sequence. For
each window (which is a short sequence), the log-odds ratios are calculated,
as described above. Therefore, we can identify windows with positive log-odds
ratio and then merge intersecting windows to determine which part of the long
sequence are CpG-islands.
But still there is a problem with this model. That is - this model does not
stablish a ono-to-on correspondence between the states and the symbols of the
sequence. For instance, the symbol C can be generated by both stated C + and
C − . Hence, a sequence does not correspond to a path in the model anymore,
but to multiple paths. In other term, a sequence, X = {x1 , x2 , x3 , ..., xL } does
not uniquely determines the path in the model. The states are Hidden in the
sense that the sequence itself does not reveal how it was generated. Therefore,
198 23. Hidden Markov Model (HMM) And Bioinformatics
we need to develop a slightly different theory for this new model, called Hidden
Markov Model.
A+ T+ C+ G+
β
A− T− C− G−
Figure 23.11: Combined Model for CpG & non − CpG islands
HMM Parameters:
• Transition probabilities
• Emission probabilities
HMM Usage:
• Evaluate the probability of an observation sequence given the model (For-
ward)
• Find the most likely path through the model for a given observation se-
quence (Viterbi)
23.5. HMM and Pair wise Sequence Alignment 199
• Statisticians are comfortable with the theory behind hidden Markov mod-
els
• HMMs are still very powerful modeling tools - far more powerful than
many statistical methods
HMM Disadvantages:
• P (y) must be independent of P (x), and vice versa, this usually isnt true
• Can get around it when relationships are local
• Not good for RNA folding problems
• Model may not converge to a truly optimal parameter set for a given
training set
• HMM is only as good as the training set
• More training is not always good, causes over-fitting
• Still slow in comparison to other methods
—Saddam Hossain
Genetic Programming And Bioinformatics . . . Genetic Programming And
Bioinformatics . . . Genetic Programming And Bioinformatics . . . Genetic
Programming And Bioinformatics . . . Genetic Programming And Bioinformat-
ics . . . Genetic Programming And Bioinformatics . . . Genetic Programming
And Bioinformatics . . . Genetic Programming And Bioinformatics . . .
Genetic Programming And Bioinformatics . . . Genetic Programming And
Bioinformatics . . . Genetic Programming And Bioinformatics . . . Genetic
Programming And Bioinformatics . . . Genetic Programming And Bioinfor-
matics . . . Genetic Programming And Bioinformatics . . .
**********Fig***Cycle of Selection************
201
202 24. Genetic Programming And Bioinformatics
was then developed by other researchers. Genetic algorithm is one of the results
of such researches, developed by John Holland and his students and colleagues
in 1975.
**********Fig***Steps of GP************
24.4. Basic Genetic Algorithm 203
Survivors Selection: Usually two individuals are selected from the popula-
tion based on their fitness, they are the parents to reproduce offspring for a new
generation. It is assumed that fitter individuals have more chance to reproduce
more fitter individuals. This new generation has same size as old generation,
and old generation dies, new generation come into place. Each iteration of the
loop is called a Generation.
Algorithm 3 GeneticAlgorithm
1: [Start] Generate random initial population, Generation G0 .
2: [Fitness] Evaluate the fitness of each individual in the population.
3: [New Population] Create a new population by repeating following steps
until the new population is complete.
4: [Selection] Select two parents according to their fitness.
5: [Crossover] With a crossover probabiloty, crossover the parents to form
a new offspring.
6: [Mutation] With a mutation probability, mutate new offspring.
7: [Accepting] Place new offspring in a new population.
8: [New Generation] Use new generated population, as Generation Gn for a
further run of the algorithm.
9: [Solution Test]
10: if A solution (individual) is got as optimal then
11: Go to [End]
12: end if
13: Loop back to [Fitness]
14: [End]
204 24. Genetic Programming And Bioinformatics
This is the most basic algorithmic outline for genetic programming. But
there are many things that can be implemented differently in various problems.
There are few points in the algorithm, which are very important to discuss
seperately. First of all is the mechanism how the chromosome or individual is
created in other words what type of encoding is used for presenting individuals.
And there are two basic operators for genetic programming. These and some
other points are discussed in the next section.
24.5.1 Encoding
The chromosome should in some way contains information about solution which
it represents. And encoding is the process or method for representing a solution
or decision variable (chromosome or individual) containing that information for
the genetic programming. The decision variables of a problem can be encoded in
various fashions, and these are obviously finite length string. The usual encoding
mechanisms are - Binary Encoding, Permutation Encoding, Value Encoding, Tree
Encoding, etc. Though encoding very depends on the problem, Binary Encoding
is mostly used.
ChromosomeA: 1452376
ChromosomeB: 6253417
ATG
ATC ACG
Example of Tree encoding.
24.5.2 Crossover
A percentage of the population is selected for breeding and assigned random
mates. This random mates and generation of new offspring are done through
crossover. Crossover is one of the two basic operators of genetic programming.
206 24. Genetic Programming And Bioinformatics
After the encoding has been decided, crossover method can be chosen. Crossover
selects genes from parent chromosomes and creates new offspring.The simplest
way of doing this is to choose randomly some crossover point(s). Depending on
this there may have several types of crossover, discussed bellow.
Single Point Crossover: Single Point Crossover selects one crossover point.
String from beginning of chromosome to the crossover point is copied from one
parent, the rest is copied from the second parent.
Multiple Point Crossover: In this case, two or more crossover points are
selected. Exchange of genes before and after the corresponding crossover points
are done to produce new offsprings.
Uniform Crossover
24.5.3 Mutation
Mutation is another basic operator of the two most important basic operators of
genetic programming. After a crossover is performed, mutation take place. This
is to prevent falling all solutions in population into a local optimum of solved
problem. Mutation changes randomly the new offspring. Depending encoding
mutation can be different. For binary encoding we can switch a few randomly
chosen bits from 1 to 0. In string of nucleotides, it may change randomly chosen
A into C.
ACTGGTCA → CCTGGTCC
Mutation
• Defining convergence
• Local optimisation
following section we will see how genetic algorithm can do sequence alignment.
Lets denote one of the known sequence from the database by U , of length m an
input query sequence denoted by V , of length n. The pairwise alignment prob-
lem will attempt to answer the question: ”How similar are the two sequences
U and V ?”. A candidate solution (aligned sequence) can be represented by
a matrix, called pairwise alignment matrix. A pairwise alignment matrix for
sequences U and V is a 2-row matrix constructed with (m + n) columns. Each
column contains only one characters of U , V , and at most one gap (−), that is a
column can not be entirely composed of gaps, except for the rightmost positions.
In the pairwise alignment matrix the original sequences may be augmented by
some gaps inside them, let denote these augmented sequences by U a and V a ,
of equal length (m + n). So the alignment matrix is the matrix P of dimension
2x(m + n) with row U a and V a .
l
X l+k l
|Ω|=
k k
k=0
√ l (24.2)
∼
= (3 + 2 2) [for very larger l]
∼ 5l
=
a a
w1 (Ui , Vi ),
if Uia = Via and Uia 6= ” − ”, Via =
6 ”−” ;
w2 (Uia , Via ), if Uia a a
= Vi and Ui 6= ” − ”, Vi =a
6 ”−” ;
f (Pi ) = max w3 (Uia , −), if Via = ” − ” and Uia 6= ” − ” ;
w4 (−, Via ), if Uia = ” − ” and Via 6= ” − ” ;
Uia = Via = ” − ”;
f ,
2 if
[w1 , w2 , w3 , w4 , f2 are alignment score matrix or function]
(24.3)
1, if Uia = Via and Uia 6= ” − ”, Via =
6 ”−” ;
−1, if Uia a a
= Vi and Ui 6= ” − ”, Vi =a
6 ”−” ;
f (Pi ) = max −2, if Via = ” − ” and Uia 6= ” − ” ; (24.4)
−2, if Uia = ” − ” and Via 6= ” − ” ;
Uia = Via = ” − ”;
−3, if
l
X
f (P ) = f (Pi ) (24.5)
i=1
to the fitness function values. Two genetic operations - crossover and muta-
tion, creates new chrosomes called offspring through altering the compositions
of genes. The selection operation will create populations from generation to
generation. And chromosomes with better fitness values have higher probabili-
ties of being selected in the next generation. And after several generations, the
algorithm will optimistically converge to the best solution.
A A T T C C G G − − − −
P00 = [initial individual]
A T C G − − − − − − − −
This is the first individual of generation-0, P00 . Some initial population can
be created with some probability function distributing the gaps over the row.
This is called the initial population of generation-0, Pi0 .
A A T − C C − G − T − G
P01 =
A − − G − − T − − C − −
A − T − C − A G − T C G
P02 =
A − − − − − T G − C − −
Fitness Evaluation: The fitness function has been chosen earlier, which is
equation 24.5. Fitness scores of the initial population are measured. The best
individuals are those with highest scores.
Selection: At any time during the genetic programming process, the solu-
tions that give the best scores are selected and kept in the population while the
poor ones are automatically rejected so as to keep the population some specific
range. This constitutes the current generation of population (solutions).
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
Start
Initial
Align-
memt, P00
Generation
of Initial
Popula-
tion, Pr0
Selection
Crossing
Over
Mutation
Fitness
Evaluation
New
Population
Solution
End
Bioinformatics Tools
213
215
Python - Primer
Programming Language for
Bioinformatics
217
218 25. Python - Primer Programming Language for Bioinformatics
Chapter 26
219
220 26. Python And Bioinformatics
Chapter 27
221
222 27. Tools and Libraries for Bioinformatics
Part VI
223
225
—Fokhruzzaman
227
228 28. Prominent Research Areas in Bioinformatics
Chapter 29
Endless Horizon of
Bioinformatics: Future
Directions
—Fokhruzzaman
229
230 29. Endless Horizon of Bioinformatics: Future Directions
Chapter 30
—Fokhruzzaman
Thought of Mind
Snippets of Thought: NOT sure if it would become another chapter or not
.. But I want to propose one area to put ALL our imaginations .. like ”the
Crazy Corner with ALL Wild Imaginations” ... Say, while working on the Intro
chapter .. I was reading ... ”Protein is also a necessary component in our diet
bcz animals can not synthesize all the amino acids and must obtain essential
amino acids from food. Through the process of digestion, animals break down
ingested protein into free amino acids that can be used for protein synthesis...”
I assume the Plant kingdom (am I correct here...?) can synthesize all the amino
acids themselves.. so they don’t need external food (as such) like animals...
So may be an wild imagination could be ... we discover some Gene Mutation
Technique (or Protein Synthesis Technique using our dynamic DNA parts...) so
that we, humans, can live without external food ... I imagine our Great Saints
”knew” this Gene Mutation (??) or ”Self-Amino-Acids-Synthesis-Techniques”
like Plants long back ... ;-)...We just need to re-discover that for mere mortals
like us ... ;-).... The Global Food Industry will NOT love my Crazy Idea here
... ;-)...
Study random walk in streets and random driving / traffic behaviors using
Human Genome .... ??
Like .. Labony’s idea of discovering Gold & Diamond Mountains in outer space
through Nano Technology .... ?? Labony .. pls elaborate on this .. frankly .. I
did not remember the full idea now ...
231
232 30. The Crazy Corner with ALL WILD Imaginations
Appendix A
Bioinformatics
Terminologies
233
234 A. Bioinformatics Terminologies
Appendix B
235
236 B. Amino Acid Lists
Appendix C
Book Layout
• Section I: Introduction...
237
238 C. Book Layout
239
Index
DNA Landmarks, 70
DNA Map, 70
Restriction Mapping, 70
240