Bioinformatics and Computational Biology
Bioinformatics and Computational Biology
Tiwary
Bioinformatics
and Computational
Biology
A Primer for Biologists
Bioinformatics and Computational Biology
Basant K. Tiwary
Bioinformatics
and Computational
Biology
A Primer for Biologists
Basant K. Tiwary
Department of Bioinformatics, School of Life Sciences
Pondicherry University
Puducherry, India
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
The goal of writing this book is to explain the theoretical foundation and mathemati-
cal concepts of bioinformatics and computational biology to a biologist who is new
to this subject. This book is aimed at undergraduate and graduate students of any
biological discipline like life sciences, genetics, medicine, agriculture, animal hus-
bandry and bioengineering in addition to the students of bioinformatics and compu-
tational biology. Like any other technical subject, bioinformatics has its own
language and terms. All terms are explained in simple language avoiding mathemat-
ical jargon. This textbook starts with basic theoretical concepts in bioinformatics
building in a stepwise fashion leading to an optimum level where the application of
bioinformatics in medicine, agriculture, animal husbandry and bioengineering is
apparent. This work examines the underlying principles and methods in bioinfor-
matics applied to answer real biological problems. The last four chapters are
completely applied in nature although theoretical concepts discussed in earlier
eight chapters are a prerequisite for understanding the applications. The exercises
and multiple-choice questions at the end of each chapter will reinforce the under-
standing of the concepts discussed in the text. The worked-out answers to exercises
in each chapter will provide a thorough understanding of the problems in bioinfor-
matics and their stepwise solutions. Some useful references are added at the end of
each chapter for further reading.
I am indebted to all my teachers who have shaped my understanding of interdisci-
plinary science. I thank my Ph.D. advisor, Prof. Arun K. Ray, who has given me the
freedom to explore the interdisciplinary and experimental biological research at Bose
Institute. I express my heartfelt gratitude to Prof. Wen-Hsiung Li, James Watson
Professor at the University of Chicago to initiate me in the field of computational
genomics during my stay in his lab as a visiting faculty. I am also grateful to all
research workers across the world in the field of bioinformatics and computational
biology for building the knowledge base of the book. I thank all my colleagues and
students at the Department of Bioinformatics, Pondicherry University, for creating a
wonderful teaching-learning environment. Special thanks to Dr. R. Krishna for
reading and commenting on the chapter on structural bioinformatics. I also thank the
team at Springer Nature for making this academic endeavour successful.
v
Contents
vii
viii Contents
Learning Objectives
You will be able to understand the following after reading this chapter:
1.1 Introduction
Biology has undergone a sea change in the last two decades and transformed itself
into a true quantitative science akin to mathematics, physics and statistics. Although
biology was fortunate enough to be enriched by great mathematicians like
G.H. Hardy, R.A. Fisher, J.B.S. Haldane and Sewall Wright during the last century,
it continued to remain largely a descriptive natural subject with less emphasis on
quantitative methods until the end of last century. The advent of fast microcomputers
and advanced molecular biology techniques such as PCR, microarray and next-
generation sequencing has led to the emergence of high-throughput data in biology.
Surprisingly, the generation of genomic data has surpassed astronomical data in
terms of volume and became the most abundant available big data today. These
drastic developments in generation of biological data has necessitated a parallel
development of computational methods to analyse the data in a meaningful way and
ultimately led to the emergence of a new discipline known as bioinformatics.
Although biology and bioinformatics are currently being treated as separate
disciplines, the time is not far away when a thin line of division between these two
disciplines will gradually disappear in favour of a more holistic biology.
Fig. 1.1 Bioinformatics as an interdisciplinary subject at the interface of biological and physical
sciences
statistical modelling with strong foundation in the probability theory, graph theory,
descriptive and inferential statistics, differential equation and statistical program-
ming in the R language and environment. He is also expected to regularly maintain
and upgrade systems and servers in his lab. The knowledge of one scripting language
like R or Python will have an additional advantage. A good understanding in the core
biological subjects such as genetics, genomics, biochemistry, molecular biology and
evolution are prerequisite for a bioinformatics scientist. A continuous learning of
advanced technologies like next-generation sequencing and mass spectrometry is
also required. Above all, a bioinformatics scientist should be highly motivated and
dedicated person and have a teamwork spirit with excellent analytical ability.
databases were integrated in 1986–1987 and are still existing under the umbrella of
the International Nucleotide Sequence Database Collaboration. Many scripting
languages were developed in the 1980s for application in bioinformatics. These
scripting languages are interpreted whenever launched and do not require compila-
tion from C code. Perl (Practical Extraction and Reporting Language) was developed
in 1987 and soon became a popular language to manipulate biological sequences.
Due its flexibility, it became very popular among bioinformaticians and improved
further in form of BioPerl in 1996. Today, R and Python are two major programming
languages for bioinformaticians.
The first genome of H. influenzae was sequenced by Craig Venter and his team in
1995. But human genome project was completed with combined private and public
efforts using Sanger sequencing technology in 2003. At present, genome sequencing
is much cheaper after the advent of second-generation or next-generation sequencing
technologies. The 454 pyrosequencing technology was the first next-generation
sequencing platform. But Illumina platforms like HiSeq soon became a popular
sequencing platform for high-throughput sequencing. This has led to development of
many alignment software for assembly of short reads generated by various sequenc-
ing platforms. Today, high-performance computing facility is being provided by
various public and private agencies.
Now, modern biology is moving from reductionist phase which focuses on a
single gene or a protein to a more holistic approach of systems biology. A mathe-
matical model representing a biological process can be generated from genome or
transcriptome or proteome or metabolome data. This endless journey of bioinfor-
matics from sequence analysis to systems biology will continue with a much faster
pace in near future. Computational biology is likely to be an integral part of
agriculture and medicine worldwide soon for a better food production and
healthcare.
This book is intended for biologists who have a basic understanding of mathematical
concepts. Bioinformatics is a diverse discipline with a potential to solve real
problems in biological, medical, agricultural, veterinary science and bioengineering.
This book is a humble attempt to communicate the concepts in bioinformatics to
experimental biologists with minimum emphasis on the underlying mathematical
principles. The mathematical descriptions are simplified in favour of a better under-
standing of biological processes. The first part of this book deals with fundamental
concepts in biological databases, statistical computing in R, sequence alignment,
structural bioinformatics and molecular evolution. A good concept in biological
database and sequence alignment are essential for understanding any advanced field
in bioinformatics. R language and environment has become the lingua franca of
bioinformatics and computational biology with largest collection of statistical
packages available for biological data analysis and visualization. I will discuss the
molecular structure and dynamics of macromolecules in connection to their
1.8 Multiple Choice Questions 7
Summary
• Bioinformatics and computational biology are integral parts of holistic biology.
• Bioinformatics deals with development of computational tools for storage and
analysis of biomedical data.
• Computational biology deals with mathematical modelling and simulations for
understanding biological systems.
• Computational thinking is more important than computer programming in devel-
oping bioinformatics and computational biology skills.
Suggested Reading 9
Suggested Reading
Moody G (2004) Digital code of life: how bioinformatics is revolutionizing science, medicine, and
business. Wiley, London
Mount D (2004) Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor
Laboratory Press, Cold Spring Harbor
Gauthier J, Vincent AT, Charette SJ, Derome N (November 2019) A brief history of bioinformatics.
Briefings in Bioinformatics 20(6):1981–1996
Biological Databases
2
Learning Objectives
You will be able to understand the following after reading this chapter:
2.1 Introduction
Biological data have accumulated at a faster pace in the recent past due to advent of
high-throughput and cheaper next-generation sequencing technologies. Conse-
quently, new databases are being developed in order to manage this ever increasing
biological data. Databases are digital repositories based on a computerized software
for storage of information in a system and their retrieval from the system using
search tools. Biological databases are an important component of bioinformatics
research and are usually well-annotated and cross-referenced to other databases. The
primary objective of a database is to organize the data in a structured and searchable
form allowing easy retrieval of useful data. The simplest form of a database is a
single file containing multiple entries of same set of information. Currently, there are
more than thousand biological databases providing access to multifarious omics data
to biologists. Biological databases are classified based on different criteria such as
levels of data coverage and data curation. Based on the extent of data coverage,
biological databases consist of two main categories: comprehensive and specialized
databases. Comprehensive database such as GenBank includes a variety of data
collected from numerous species. On the other hand, specialized databases contain
data from one particular species, for example, WormBase contains data on nematode
worm. Similarly, biological databases are classified into two groups, namely primary
databases and secondary databases, based on the levels of data curation. Primary
databases such as GenBank are created from experimentally derived raw data
generated and submitted by experimental biologists. On the other hand, secondary
databases such as Ensembl (maintained at the EMBL-EBI, UK), UCSC Genome
Browser (maintained at the University of California, Santa Cruz, USA) and TIGR
(maintained at the Institute of Genomic Research, Maryland, USA) are highly
curated and are usually created from analysis of various sources of primary data.
Some databases such as UniProt have characteristics of both primary and secondary
databases. For example, Uniprot database stores peptide sequences generated from
sequencing experiments and computationally inferred from genomic data as well.
Since majority of biological databases do not have a complete information, an
integrated database retrieval system like Entrez maintained by NCBI provides
integrated access to 35 distinct databases. These databases may be grouped into
six areas: Literature, Genomes, Genes, Proteins, Health and Chemicals (Table 2.1).
It can be searched using Boolean phrases (AND, OR and NOT) and useful data is
downloaded in various formats. For example, most common Entrez databases
include PubMed, SRA, Nucleotide, Protein and Structure. Similarly, Sequence
Retrieval System (SRS) also provide integrated database retrieval system regarding
a search term. BioStudies (www.ebi.ac.uk/biostudies) is a recent public multimodal
database launched by EMBL-EBI in 2017 and it currently holds metadata
2.2 Types of Biological Databases 13
descriptions from various biological studies with a plan to archive all functional
genomics data in near future.
(a) Nucleic acid databases: All published DNA and RNA sequences are usually
deposited in three parallel public databases, namely GenBank (Fig. 2.1) (avail-
able at the National Centre for Biotechnology Information), EMBL (European
Molecular Biology Laboratory) and DDBJ (DNA Data Bank of Japan available
at the National Institute of Genetics), maintained in the USA, the UK and Japan,
respectively. These three public databases exchange their data under the Inter-
national Nucleotide Sequence Database Collaboration (INSDC) framework.
The important nucleotide databases are listed in Table 2.2. GenBank has
grown exponentially since its inception in 1982 and currently contains more
than 2.1 billion nucleotide sequences (Fig. 2.1). The current exponential
increase with a doubling time of 20.8 months is very close to the doubling
time of 18 months as proposed by Moore’s law. There was an exponential
increase in the number of taxa as well in the publically available databases
(Fig. 2.2). The European Nucleotide Archive (ENA) is maintained by EMBL
and provides open access to a wide range of nucleotide sequences from raw
reads to finished genome sequences. The DDBJ Sequence Read Archive (DRA),
maintained by the DDBJ, provides free access to raw read data and assembled
genomic data from next-generation sequencing platforms. The DNA and RNA
sequences are directly submitted by the researchers in these databases and are
13
Log 10 number of nucleotides
12
11
10
Fig. 2.1 The growth of nucleotide sequences on a logarithmic scale in public databases (GenBank/
ENA/DRA) from June 1982 to March 2020. The central line is the best least square fit line
representing doubling time of 20.8 months, whereas external lines indicate a doubling time of
18 months
14 2 Biological Databases
Fig. 2.2 The exponential growth in the number of taxa in the available public databases
Fig. 2.3 Partial view of GenBank home page showing information on human oxytocin mRNA
There are some useful expression databases hosting gene expression data for func-
tional genomics studies (Table 2.4). High-throughput gene expression data derived
from both microarray and RNA-Seq platforms are usually archived in two public
databases: NCBI Gene Expression Omnibus (GEO) and EBI ArrayExpress (AE).
Gene expression Omnibus database was started in 2000 and is maintained by NCBI.
The overall structure of this database consists of platform record, sample record and
series record. Each GEO DataSet is an upper-level object having unique GEO
accession number starting with prefix GDS. DataSets are represented both in terms
of experiment-centric and gene-centric information. The gene-centric information
provides quantitative expression of a single gene across DataSets and is known as
GEO Profile. In addition to gene expression data, GEO contains other functional
genomic data such as copy number variations and transcriptional factor binding.
Another public database ArrayExpress is maintained by EBI and was launched in
pathways from all domains of life along with chemical compounds, reactions and
enzymes. They are useful in complex bioinformatics analysis such as flux balance
analysis and cellular modelling. KEGG and MetaCyc are the oldest and most
popular metabolic databases. MetaCyc is the largest reference database of experi-
mentally determined metabolic pathways and enzymes from all domains of life. It
contains 2749 pathways currently extracted from thousands of publications. Patho-
Logic, a component of Pathway Tools program, performs computational prediction
of metabolic pathways in an organism using this reference database. The major
limitations of this database are lack of any information regarding protein signalling,
disease and developmental process pathways. KEGG (Kyoto Encyclopaedia of
Genes and Genomes) is a comprehensive functional knowledge base of genes and
genomes at the molecular level and higher biological levels as well. KEGG
Orthology (KO) database hosts the molecular functions of various genes and
proteins. The higher-level biological functions are represented as KEGG pathway
maps describing the biological networks of molecular interactions and metabolic
reactions. In addition, the KEGG database also provides chemical information
regarding the metabolites and other small molecules in form of KEGG LIGAND
and health-related information of drugs and diseases in form of KEGG MEDICUS.
However, KEGG database does not provide information regarding lipid synthesis,
cellular location and metabolite signalling. PathBank is a comprehensive pathway
database of ten model organisms including human, rat, cow, yeast, etc. It provides a
detailed account of map of metabolic pathways and signalling, protein signalling,
disease, drug and physiological pathways in these model organisms. Reactome is a
relational database of signalling, transcriptional regulation, disease and metabolic
pathways. It provides a bioinformatics tool for analysis of pathways for systems
biology research. The human metabolome database (HMDB) is a comprehensive
electronic knowledge base of small-molecule metabolites present in human body
along with hyperlinks to other useful databases such as KEGG, MetaCyc and PDB.
BiGG models knowledge base is a high-quality genome-scale metabolic model
repository containing more than 100 BiGG models of different organisms including
a human-specific model known as Recon3D. In addition, there are 515 strain-
specific draft genome-scale models across three organisms. Plant Reactome is a
comparative and systems-level plant pathway database using rice as a reference
species and the pathway knowledge derived from rice is extended to other 82 plant
species using gene-orthology prediction. A total of 298 reference pathways are
hosted by this database related to metabolic pathways, transcriptional networks,
hormone signalling pathways and developmental processes. Ingenuity Pathway
Analysis (IPA) is a commercial database developed by QIAGEN for visualization
and analysis of omics data using a network or pathway.
Disease database is one of the most important resource for biomedical research
(Table 2.6). The Online Mendelian Inheritance in Man (OMIM) of NCBI contains
22 2 Biological Databases
useful information regarding human genes and genetic diseases along with gene
sequences and associated phenotypes. It provides a complete description regarding a
disease gene, the phenotypes associated with the disease gene and other genes
associated with the disease. The Cancer Genome Atlas (TCGA) is joint programme
of National Cancer Institute and National Human Genome Research Institute
launched in 2006. It hosts more than 2.5 pentabytes of genomic, epigenomic,
transcriptomic and proteomic data of 33 cancer types. canSAR is a largest public
resource of cancer multidisciplinary data for integrative translational research and
drug discovery. Similarly, CNCDatabase provides a detailed information about
predicted non-coding cancer drivers located on promoters, untranslated regions,
enhancers and non-coding RNAs. The Human Gene Mutation Database (HGMD)
is a collection of all published gene mutations involved in human inherited diseases.
The GWAS Central database is most comprehensive repository of genome-wide
association study (GWAS) data related to more than 1400 phenotypes. It provides
integrative access and graphic visualization of GWAS data collected from more than
3800 studies. The HbVar is a database of human genomic sequence changes not only
2.2 Types of Biological Databases 23
There are specialized organism-specific databases available for some model animals,
economically important plants and animals (Table 2.7). In addition, specific genomic
databases of pathogenic viruses are also available. The Alliance of Genome
Resources Portal (www.alliancegenome.org) provides an integrated access to
genome of diverse model species used to study human biology. WormBase which
is one of the founding member of this alliance is a repository of detailed biological
The Basic Local Alignment Search Tool (BLAST) is most widely used search tool
for sequence databases. It finds a region of local similarity (conserved sequence
pattern) between a query DNA or protein sequence against a target database. The
ultimate aim of a BLAST program is to infer an evolutionary and functional
relationship between two DNA or protein sequences. The original version of
BLAST was launched in 1990 followed by development of various variants of
2.4 Exercises 25
BLAST specialized for different types of databases. Table 2.8 describes various
variants of BLAST along with their applications. There are three important aspects
of a search process: the input query sequence, target database and choosing a
customized BLAST program. For example, PSI-BLAST is a useful program to
identify remote homologs in different species. BLAST finds a local alignment in
each match in the database above a certain threshold S. These matches are known as
high scoring segment pairs (HSPs) along with their E-values. An E-value of 10 5 is
usually taken as cut-off during a BLAST search underlying the fact that the score or a
greater value than the score is likely to occur by chance in one out of 100,000
matches. The Position-Specific Iterated (PSI)-BLAST searches remote protein
homologs in a protein database in multiple iterative steps using position-specific
scoring matrix. FASTA is a similarity search program like BLAST and its input
sequence format is widely known as FASTA format. Blast-like alignment tool
(BLAT) is another efficient program to find sequences of greater similarity (more
than 95%) but has less sensitivity to divergent sequences. The BLAT algorithm
keeps the index of complete genome in the computer memory requiring 2GB RAM.
However, sensitivity of this program is significantly inferior to NCBI-BLAST.
2.4 Exercises
Solution
1. Retrieve the protein sequence of the human aromatase from GenBank or
Ensembl.
2. Use this protein sequence as a query sequence to perform PSI-BLAST (Figs. 2.8,
2.9, and 2.10).
2. The TP53 gene encodes for a tumour suppressor protein responding to various
kind of cellular stresses. It plays an important role in the prevention of cancer
and is also known to be mutated in majority of human cancer. Find the location
and number of alternative transcripts of TP53 gene in human and infer the
Fig. 2.9 Part of the Iteration 1 table for the human aromatase search showing closely related
sequences
2.4 Exercises 27
Fig. 2.10 New remote homologs are identified with each iteration of PSI-BLAST search. Part of
the interaction table shows newly identified fish homologous sequences (highlighted in yellow)
Fig. 2.11 Partial view of Ensembl genome browser showing details regarding human TP53 gene
evolutionary history of TP53 gene gain and loss in different genomes available in
the Ensembl database.
Solution
1. Search the term “TP53 gene” under human species in the Ensembl database.
2. It is located on the reverse strand of Chromosome 17: 7,661,779-7,687,538
(Fig. 2.11).
3. The TP53 gene consists of 27 alternative transcripts in human (Fig. 2.12).
28 2 Biological Databases
Fig. 2.12 The Ensembl genome browser showing 27 alternative transcripts of TP53 gene
4. View the gene gain/loss tree to see the evolutionary history of human TP53 gene
(Fig. 2.13). The species tree shows significant gene gain events or expansions
(indicated in red branch colour), gene loss events or contractions (indicated in
green branch colour) and no changes (indicated in blue branch colour). Each node
indicates the number of gene family present in the respective extant or extinct
species. In TP53 gene family tree, two species, namely Siamese fighting fish
(B. splendens) and African elephant (L. africana), have undergone remarkable
expansion in form of 27 members and 14 members, respectively. In addition,
16 species show significant expansions and three species show significant
contractions.
2.4 Exercises 29
Answer: 1. c 2. a 3. a 4. a 5. a 6. d 7. a 8. a 9. a 10. b
Summary
• Biological database provides a structured platform for retrieval of useful data.
• Entrez is an integrated database retrieval system for 35 diverse databases.
• GenBank, ENA, DDBJ and Ensembl are major nucleotide databases.
• UniProt Knowledgebase (UniProtKB) provides a comprehensive resource for
protein sequences.
• Protein Data Bank (PDB) is a global archive for protein structures and other
macromolecules.
• GISAID database is a comprehensive genomic resource for different variants of
human SARS-CoV-2 virus.
Suggested Reading
Letovsky SI (1999) Bioinformatics: databases and systems. Springer, Boston
Lesk AM (2004) Database annotation in molecular biology: principles and practice. Wiley,
Chichester
Revesz P (2010) Introduction to databases: from biological to Spatio-temporal. Springer, London
Annual nucleic acid research database issues, Oxford University Press: Oxford, 2015-2021
Statistical Computing Using R
3
Learning Objectives
You will be able to understand the following after reading this chapter:
The R language has a variety of data structures such as vectors, character strings,
matrices, lists, data frames and classes. Vector is the R workhorse and consists of
elements of similar data type such as integer or character. However, scalars or
individual numbers in the R are essentially one-element vector. On the other hand,
character strings are truly single-element vectors having the character mode. An R
matrix consists of rectangular array of numbers. This mathematical structure has two
attributes: rows and columns. An R list consists of multiple values of different data
types. So, different components are packaged a single list that can be returned by an
R function. The internal structure of a list can be printed by command str(), where str
stands for structure. Data frame is a two-dimensional structure with rows and
columns akin to a matrix. However, each column has a different mode unlike matrix.
For example, one column may be numeric, whereas another column may consist of
character strings in a data frame. A class defines the type of an object in connection
with object-oriented programming in R. It is usually defined by the user during R
programming. Generic functions operate on an object based on its class.
3.3 R Packages
Fig. 3.2 CRAN, a network of ftp and webservers providing access to R codes, documentation and
packages
process installs two basic packages, namely base and stats by default. CRAN
provides thousands of packages for a wide range of statistical analysis and graphics
(Fig. 3.2).
Any session in R starts with choosing its working directory. R imports a data from a
datafile and builds an R object in the workspace. Here, the workspace is an abstract
concept that indicates the storage of data in RAM during a specific R session. The
workspace can be permanently stored in a binary file known as .RData file. In
addition to .RData file, all the commands executed in a particular R session can be
saved in a simple text file known as .RHistory file and reloaded again. First, the
current working directory needs to be checked with the function getwd(). The current
working directory may be changed through another function setwd(). R always
searches any file in the working directory and writes a file in the same directory.
The command line starts with a character > and continuation of a line is indicated by
+. The generic function read.table reads a data from a file and writes the insides into
an R object called data frame. We may add the separator in terms of tabulation
characters (sep ¼ “\t”), white spaces (sep ¼ “ ”) or commas (sep ¼ “,”).
36 3 Statistical Computing Using R
Here, an object gene is created including five gene expression values in different
samples. The simple computations such as sum, mean, median, standard deviation
and variance of these values can be computed using respective built-in functions.
However, data values are often stored in form of matrix consisting of rows and
columns.
>dbinom(k,n,p)
The normal distribution of data is assumed for many statistical analyses such as
gene expression values obtained using microarray data. The data values in normal
distribution are represented by a bell-shaped curve or normal density with mean mu
and variance sigma squared. This curve is symmetric around the mean value. We can
3.5 Biological Data Analysis in R 37
draw a random sample of size 100 with the population mean of 1.5 and the
population standard deviation of 0.6 using the R command
>rnorm(100,1.5,0.6)
In a biological research, we build a theory and would like to support our hypothesis
with experimental data. Thus, the testing procedure consists of proposing a hypoth-
esis, assuming the statistical distribution and inferring the conclusion. First, we have
an idea that a certain drug is effective against a disease. This is also known as
alternative hypothesis. Secondly, we propose a null hypothesis that the drug is not
effective against a disease in order to remove the bias. A suitable statistic such as a
t-value is computed which is not only a function of random variables but a function
3.5 Biological Data Analysis in R 39
by broadening the acceptance range, the power of the test is also reduced making us
prone to Type II errors. The power of the test or the probability of drawing correct
conclusion can be improved by increasing the sample size prior to conducting
experiments.
42 3 Statistical Computing Using R
Parametric tests have some assumptions about the underlying distribution of data.
For example, the data collected from a population with normal distribution. The
normality of the data is assessed by Shapiro–Wilk test which is based on a Q-Q plot.
The Shapiro–Wilk test can be computed in R using built-in function as follows.
>shapiro.test (data)
z.value<-sqrt(n)*(mean(x)-mu0)/sigma
p.value<-2*pnorm(-abs(z.value),0,1)
Thus, the null hypothesis is rejected when z-value is either highly positive or
highly negative. When z-value falls between 95% confidence interval, then null
hypothesis is not rejected and consequently this region is known as the acceptance
region.
>t.value<-sqrt(n)*(mean(x)-mu0)/sd(x)
>t.test(x,mu=0)
This one command will generate t-value, p-value and 95% confidence interval for
population mean at value 0. The large value of t indicates that the mean is different
from 0, whereas very small p-value tending towards zero supports the rejection of
null hypothesis.
If the variances for two groups are equal, the t-value is computed in R using the
following script.
Thus, the null hypothesis of equal population means between two groups is
rejected if the p-value is less than 0.05.
>wilcox.test (x,y)
44 3 Statistical Computing Using R
>chisq.test (data)
SSB ðG 1Þ
F¼
SSW ðN GÞ
3.5.5 Graphics in R
32
30
30
25
28
20
data
26
15
24
10
22
5
20
0
2 4 6 8 10 12 14
Index
Scatterplot Bar plot
32
3.0
30
1.0 1.5 2.0 2.5
Sample Quantiles
28
Frequency
26
24
22
0.5
20
0.0
20 22 24 26 28 30 32 −1 0 1
Theoretical Quantiles
A
30
C
28
26
24
T
22
G
20
ylab are included to label the x-axis and y-axis, respectively. Moreover, a command
par is used to change the default layout of the graphics device. Finally, the plot is
exported by saving in common file formats like jpg, postscript, tiff, png and pdf.
Some of the important plots (Fig. 3.13) that can be created using high-level
functions are described as follows.
Scatterplot This plot can be used to draw any number of points in a figure window
using the command plot. It displays number of dots to show the relationship between
3.5 Biological Data Analysis in R 47
two variables representing x-axis and y-axis. For example, the expression of genes in
disease and healthy conditions can be visualized using scatter plot.
Box Plot A box plot displays the distribution of numerical data based on different
levels of quartiles using the command boxplot. These levels are minimum, first
quartiles (Q1), median, third quartiles (Q3) and maximum. It also provides graphical
information not only regarding symmetry and skewness of distribution but indicates
the presence of an outlier.
Bar Plot Bar plots are the graphical display of categorical variables in form of
rectangles created in R using the barplot() function. The height or the length of the
bar is proportional to the numerical values.
Normal quantile-quantile (Q-Q) plot The normal quantile plots are the graphical
display of the theoretical normal distribution on the x-axis against the dataset on the
y-axis to assess whether your dataset deviates from the normal distribution. If the
dataset follows normal distribution, we can proceed with the standard statistical
analysis using t-test or ANOVA.
Pie Chart The pie chart describes a data graphically in a circular graph. Each slice
of a pie represents relative proportion of data. Pie chart is created in R using pie()
function.
The base graphics of R has a facility to explore and plot various types of data. Most
common package to perform general plotting functions is known as lattice devel-
oped by Deepayan Sarkar and it is an integral part of the R base installation. It
provides a high-level data visualization of multivariate data. The graphical
parameters can be customized in lattice for a publication quality display. Moreover,
48 3 Statistical Computing Using R
3.6 Exercises
1. SRGAP2 and FOXP2 are the key genes involved in extraordinary cognitive
development and speech in human lineage. The expression levels of both genes
in the prefrontal cortex of human brain are given below.
1. Compute the mean and standard deviation of gene expression levels of SRGAP2
and FOXP2 genes.
2. Perform appropriate tests to check the normality and equality of variances in
expression levels of both genes.
3. Perform two-sample t-test to find significant differences in the gene expression
levels of both genes.
Solution:
>SRGAP2<-c
(5.54341,5.38406,5.68797,5.48020,5.71367,5.16755,5.54771,
5.50315,5.29877,5.26532,5.58837,5.93008,5.85553,5.55625)
>FOXP2<-c
(2.53967,3.22162,2.35884,2.04029,2.67516,2.27110,1.62437,
2.15451,1.99223, 2.74845,2.78851,2.98586,2.44189,2.55143)
> mean(SRGAP2)
[1] 5.537289
> mean(FOXP2)
[1] 2.456709
> sd(SRGAP2)
[1] 0.216255
> sd(FOXP2)
[1] 0.4243888
> shapiro.test(SRGAP2)
Shapiro-Wilk normality test
data: SRGAP2
W = 0.97621, p-value = 0.9469
> shapiro.test(FOXP2)
Shapiro-Wilk normality test
data: FOXP2
W = 0.99442, p-value = 1
> var.test(SRGAP2,FOXP2)
F test to compare two variances
data: SRGAP2 and FOXP2
50 3 Statistical Computing Using R
The p-value of F-test is 0.02122 which is lower than the significance level 0.05.
Therefore, there is a significant difference in between two variances.
2. We want to test a hypothesis that the nucleotides of a gene have equal probability.
The null hypothesis to be tested is that the frequency of each nucleotide is 0.25.
Compute the total number of nucleotide bases and frequency of each nucleotide
in the gene. Perform chi-square test whether nucleotides in the gene occur with
equal probability. The coding sequence of the gene is as follows.
ATGAACAGCCAGGGGTCAGCCCAGAAAGCAGGGACACTCCTCCTGT-
TGCTGATATCAAAC
Solution
> library(Biostrings)
>gene<-DNAString("ATGAACAGCCAGGGGTCAGCCCAGAAAGCAGGGACACT
CCTCCTGTTGCTGATATCAAAC")
> length(gene)
[1] 60
> alphabetFrequency(gene, baseOnly=TRUE, as.prob=TRUE)
A C G T other
0.3000000 0.2833333 0.2500000 0.1666667 0.0000000
> gene.freq<-alphabetFrequency(gene, baseOnly=TRUE, as.prob=TRUE)
> chisq.test(gene.freq)
Chi-squared test for given probabilities
3.7 Multiple Choice Questions 51
data: gene.freq
X-squared = 0.30278, df = 4, p-value = 0.9896
Answer: 1. b 2. d 3. a 4. b 5. a 6. b 7. a 8. d 9. c 10. d
Summary
• R language is an object-oriented functional programming language used for
statistical computing.
• Common data structures in the R environment are vectors, character strings,
matrices, lists, data frames and classes.
• R packages are organized libraries for different functionalities available at CRAN
and bioconductor.
• Useful statistical parametric and non-parametric tests can be performed in the R
environment.
• Both high-level and low-level graphics functions are available in the R
environment.
Suggested Reading
Gerrard P, Johnson RM (2015) Mastering scientific computing with R. Packt Publishing,
Birmingham
Sinha PP (2014) Bioinformatics with R cookbook. Packt Publishing, Birmingham
Hahne F, Huber W, Gentleman R, Falcon S (2008) Bioconductor case studies. Springer, New York
Lewis PD (2010) R for medicine and biology. Jones and Bartlett Series, Burlington
Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer, Cham
Sequence Alignment
4
Learning Objectives
You will be able to understand the following after reading this chapter:
4.1 Introduction
nucleotide alignment. The best approach is to first align the sequences at the amino
acid level and then reverse translate the residues to corresponding nucleotide
sequence alignment. Some programs like DAMBE and TRANSALIGN can perform
this kind of analysis. However, it is still very difficult to find optimal alignment of
non-coding nucleotide sequences such as introns and repeats.
During alignment, the alignment score for each pair of nucleotides or residues is
computed in form of a scoring matrix. This scoring matrix is useful in finding a score
of each pair of nucleotides or residues. For example, NCBI-BLAST program assigns
a score of +1 for identical nucleotides and 3 for non-identical nucleotides for
highly conserved sequences. The scoring matrix for amino acid sequences are
developed considering their observed physicochemical properties and actual substi-
tution frequencies in nature. Point accepted mutation (PAM) matrix is derived from
the observed substitution rates. PAM matrices signify the substitution probabilities
of amino acids over a constant unit of evolutionary change. For example, PAM-1 is
one PAM unit which represents one substitution or accepted point mutation per
100 amino acid residues. This matrix is usually used to compare closely related
protein sequences. On the other hand, PAM-1000 matrix may be useful to compare
distantly related sequences. Usually, PAM-250 matrix is most commonly used in
alignment of protein sequences. BLOSUM matrix is an alternative to the PAM
matrix derived from observed substitution rates among related proteins. In
BLOSUM matrix, similar protein sequences are first clustered and then substitution
rates between clusters are computed. BLOSUM matrices are derived for protein
sequences having variable degree of similarity like PAM matrix. However, number-
ing pattern of BLOSUM matrix is inverse to the numbering pattern of PAM matrix.
As an example, a higher number BLOSUM matrix such as BLOSUM-62 is suitable
to compare closely related protein sequences, whereas a lower number BLOSUM
matrix such as BLOSUM-30 is appropriate for divergent protein sequences having
distant relationship. Most commonly used matrices like BLOSUM-62 and
BLOSUM-80 are appropriate for comparing protein sequences having approxi-
mately 62% and 80% similarities, respectively.
A dot plot is the simplest graphical method for visual comparison of two sequences
using a two-dimensional matrix and identify the regions of high similarity. In dot
plot, first sequence is plotted on the X-axis, whereas second sequence is plotted on
the Y-axis. Presence of identical residues in each position of both sequences is
indicated by a dot in the plot space. The adjacent regions with high similarity are
indicated by a diagonal in a dot plot. The abundance of dots along a diagonal line is
an indicator of multiple identical amino acids in homologous sequences. Thus, this
method is useful in finding similarity in large genomic sequences and identifying
internal repetitive sequences in protein sequences. However, this method is currently
not so popular in bioinformatics analysis and is widely used for educational
purpose only.
4.2 Pairwise Sequence Alignment 55
Fig. 4.1 Optimum global pairwise alignment using Needleman–Wunsch Algorithm. The optimal
traceback path is shown in the scoring matrix table
Fig. 4.2 Local optimal three pairwise alignments using Smith–Waterman algorithm. The lowest
score of this matrix is zero and alignment starts from highest score and tracked back to a score of
zero. The traceback paths of three alignments are shown in specific colours
value from the grid located diagonally above in the left side (1,1) and add either
match or mismatch score to get a final score indicating alignment of residues. The
maximum score obtained out of all three possibilities is taken as a final score.
Finally, the optimal path of alignment is traced back starting from lower right-
hand corner to upper left-hand corner following the alignment score of the matrix.
This algorithm finds the best alignment based on the overall alignment score. In
order to avoid excessive gaps in the alignment, gaps are usually penalized by
subtracting gap penalty from overall score. The gap penalty includes both gap
opening penalty and gap extension penalty collectively known as affine gap
penalties. Gaps are usually found in blocks in an alignment and a simple linear
penalty for gaps is not a method of choice. Instead, affine gap penalties are
implemented in many algorithms where gap opening penalty is always higher than
the gap extension penalty. Thus, introduction of sequential gaps with affine gap
penalty in an alignment has a substantial reduction in the overall cost of penalty.
Smith and Waterman (1981) proposed a modification in the Needleman–Wunsch
algorithm in order to allow local alignment (Fig. 4.2). A local alignment finds best
matching regions within two sequences being compared. Here, the alignment path
does not start from the right-hand corner and end at the left-hand corner. Instead, it
can start and end internally anywhere and any direction in the matrix. A zero term is
added additionally in the Smith–Waterman algorithm and therefore no value exists
in the scoring matrix which is below zero. If a score below zero is obtained by any
method, a zero is placed in that position in the matrix. The traceback process begins
from the highest value in the matrix representing alignment score and ends after
reaching a value of zero.
4.3 Multiple Sequence Alignment 57
the alignment process. All popular multiple sequence alignment programs are listed
in Table 4.1.
Table 4.2 Common programs for visualization and editing of multiple sequence alignment
Editor Operating system URL
BIOEDIT Windows https://bioedit.software.informer.com
GENEDOC Windows https://genedoc.software.informer.com
JALVIEW Windows, Linux and Mac https://www.jalview.org/
MACCLADE Mac http://macclade.org/macclade.html
SEAVIEW Windows, Linux and Mac http://doua.prabi.fr/software/seaview
Fig. 4.3 BioEdit sequence alignment editor showing multiple sequence alignment of FOXP2
protein in mammals
4.5 Exercises
1. Find the optimal global alignment and local alignments of two sequences “AT
GAAATATACAAGTTATATCTTGGCTTTTCAGCTCTGCATCGTTTTGGG
TTCTCTTGGC” and “ATGAACGCTACACACTGCATCTTGGCTTTGCAGC
TCTTCCTCA TGGCTGTTTCTGGCTGT” using Biostrings package in the R
environment.
Solution
We will use the pairwiseAlignment function in the Biostrings to perform both
optimal global and local alignments. The scores for match and mismatch are
+1and 1, respectively. The gap opening penalty is 5 and gap extension penalty is 2.
>library(Biostrings)
>s1<-DNAString("ATGAAATATACAAGTTATAT")
>s2<-DNAString("ATGAACGCTACACACTGCAT")
>mat <- nucleotideSubstitutionMatrix(match = 1, mismatch = -1,
baseOnly = TRUE)
60 4 Sequence Alignment
Human:
MNSLVSWQLLLFLCATHFGEPLEKVASVGNSRPTGQQLESLGLLAPGE-
QSLPCTERKPAATARLSRRGTSLSPPPESSGSPQQPGLSAPHSRQIPAPQGA-
VLVQREKDLPNYNW NSFGLRFGKREAAPGNHGRSAGRG.
Chimpanzee:
MNSLVSWQLLLFLCATHFGEPLEEVASVGNSRPTGQQLESLGLLAPGE-
QSLPCTERKPAATARLSRRGTSLSPPPESSGSPQPGLSAPNSRQIPAPQGAV-
LVQREKDLPNYNWNSFGLRFGKREAAPGNHGRSAGRG.
Compare the similarity between human and chimpanzee sequences based on a dot
plot using seqinr package in the R environment.
>library(seqinr)
> kisspeptin<-read.fasta("kisspeptin.fasta")
> attach(kisspeptin)
> dotPlot(Human, Chimpanzee, wsize = 10, wstep = 6, nmatch = 6)
The Fig. 4.4 indicates very high similarities between two neuropeptide
sequences from human and chimpanzee as dots are located on the main diagonal
in the dot plot.
4.6 Multiple Choice Questions 61
Answer: 1. c 2. a 3. b 4. d 5. d 6. c 7. b 8. b 9. b 10. c
Summary
• Amino acid alignment is preferred over nucleotide alignment for protein coding
sequences.
• The scoring matrix for amino acid sequences is based on their observed physico-
chemical properties and actual substitution frequencies in nature.
• A pair-wise global alignment is performed using Needleman–Wunsch algorithm.
• A pairwise local alignment is performed using Smith and Waterman algorithm.
Suggested Reading 63
Suggested Reading
Nguyen K, Guo X, Pan Y (2016) Multiple biological sequence alignment: scoring functions,
algorithms and evaluation. Wiley, New York
Rosenberg MS (2009) Sequence alignment: methods, models, concepts, and strategies. University
of California Press, Berkeley
DeBlasio D, Kececioglu JD (2017) Parameter advising for multiple sequence alignment. Springer,
Cham
Chao K-M, Zhang L (2009) Sequence comparison: theory and methods. Springer, London
Structural Bioinformatics
5
Learning Objectives
You will be able to understand the following after reading this chapter:
5.1 Introduction
wide gap between the abundant protein sequence information and protein structure
information of millions of protein sequences.
A protein structure can be studied at four levels: primary, secondary, tertiary and
quaternary structures. The primary structure of a protein consists of a polypeptide
sequence of amino acids which starts with the N-terminal amino acid and ends with
the C-terminal amino acid. There are 20 standard amino acids present universally
among all organisms. The amino acids are linked by a peptide bond which forms an
amide linkage between NH2 group of one amino acid with COOH group of another
adjacent amino acid. The peptide bond is unique in terms of having a rigid double-
bond character with restricted rotation. The polypeptide backbone forms various
conformations known as secondary structures. The alpha helices, beta-pleated sheets
and turns or bends are the examples of secondary structures (Fig. 5.1). The second-
ary structure of a protein further folds in a three-dimensional space to form a tertiary
structure. When a protein consists of multiple polypeptides (subunits), each subunit
combines together to form a quaternary structure of a functional protein. For
Fig. 5.2 Schematic flowcharts of homology modelling, ab initio prediction and threading
very good model with less than 1A0 RMSD for the main-chain atoms using
homology modelling in case the identity level is above 50%. On the other hand,
low accuracy models with multiple types of structural errors are expected with an
identity level less than 30%.
The structure prediction methods which do not rely on the availability of a template
structure are known as de novo methods. Ab initio prediction methods are a subset of
de novo methods which depends on energy functions or information inferred from
other protein structures. This method assumes a global free energy minimum for the
5.3 Modelling of Protein Structure 69
native state of the protein based on the conformational space for available protein
tertiary structures with low free energy for a sequence of amino acids. Rosetta
method is a popular ab initio method which relies on an ensemble of short structural
fragments obtained from the Protein Data Bank (PDB). A Monte Carlo search
method is used to assemble the structural fragments based on a scoring function.
This method has been applied widely for modelling protein structures and structur-
ally variable regions such as insertions and loops. Another popular method is
TASSER which relies on a threading approach to find the local fragments that are
subsequently connected into a full-length model by means of a random walk
followed by a refinement process. Although de novo methods have predicted
accurate models of some small proteins, they suffer from some inherent limitations
like huge computational demand and poor quality of models for large proteins.
5.3.3 Threading
The protein fold recognition and threading methods are useful in generating three-
dimensional models based on evolutionary distant fold templates. Threading is most
popular approach for fold recognition of proteins by fitting a sequence onto the
backbone coordinates of available protein structures. Threading has become a
generic term for fold recognition although it is a sub-class of fold recognition
methods. We can identify the proteins with known structures within the Protein
Data Bank (PDB) sharing common folds with the target sequence using fold
recognition. These proteins are subsequently used as templates to model the folds
of the target sequence. Main focus of this method is to assign folds to the target
sequence even if there is low sequence identity to known protein structures. The
basic concepts involved in threading is just reverse of the comparative modelling. In
comparative modelling, we search for a protein structure best fitting several target
sequences. In contrast, each available potential protein structure is fitted with a target
sequence in threading process. Here, each target sequence is compared with a library
of potential fold templates based on energy potentials and other similarity scores;
templates having either least energy score or best similarity score are assumed to best
fit of the fold of the target structures.
model. There are four steps commonly involved in an integrative modelling. First,
the input data is collected from various experimental methods and other statistical
measures such as atomic statistical potentials and molecular mechanics force fields.
A scoring function is developed to evaluate the consistency of the model with the
input data. Multiple models are generated and their scores are improved using
various optimization algorithms such as Monte Carlo, genetic algorithm and gradient
decent. The good scoring models after optimization represent the conformations of
the system. Finally, the ensemble of all good scoring models is clustered and
analysed to obtain the overall precision and accuracy of the ensemble. It also gives
a clue to most probable experiment to be conducted in the next iteration.
either defines the upper and lower limit of the score or divides into different classes
requiring a specific treatment.
Some of the most common scoring functions are contact scoring functions,
distance-dependent scoring functions, accessible surface scoring functions and com-
bined scoring functions. Contact scoring function is the simplest form of scoring
function due to minimum size of the matrix filled with few experimental data. For
example, contact potential is a squared two-dimensional matrix of 20 rows and
20 columns for a body definition of alpha carbons of 20 standard amino acids. These
functions are generally derived symmetrically for beta carbons or side chain
centroids of standard amino acids. Distance-dependent scoring functions are most
widely used in the prediction of protein structure. It is represented by a matrix of four
dimensions: the first two dimensions include total number of bodies defined, the
third dimension distinguishes between local and non-local interactions and the
fourth dimension describes distance ranges or bins in order to represent interactions.
Accessible surface scoring functions tend to capture the propensity of interaction of
defined bodies with the solvent. It describes the solvent accessibility at the residue or
atomic level. The combined scoring function is a combination of information from
both accessible surface and contact plus distance-dependent approaches. The contact
and distance-dependent approach deals with intramolecular protein interactions,
whereas accessible surface approach is concerned with the interactions between
the solvent and the protein. A normalization step is required to combine these two
independent scoring functions.
There are various applications of scoring functions in protein structure prediction.
It helps in selecting the best model in terms of accuracy out of different alternative
models. Moreover, the fold assessment of a predicted model is performed using a
scoring function to evaluate the correct fold. The geometrical accuracy of the overall
model as well as individual regions can also be assessed using scoring functions. The
stability of a protein structure can also be predicted using scoring functions in case of
mutant screening.
The primary goal of model assessment is to evaluate the overall accuracy of the
predicted model. In addition, it also selects the most accurate model from a set of
alternative models and evaluates the accuracy of the different regions of the model.
There are four common approaches of model assessment, namely physics-based
energies, knowledge-based potentials, combined scoring functions and clustering
approaches. Chemical force fields are the physics-based energy function of particles
in a system and are derived from both quantum mechanical calculations and experi-
mental findings. The force field energy function is represented by the energy from
chemical bonds between interacting atoms (i.e. distances, angles and dihedral
angles) and interactions between non-bonded atoms (i.e. electrostatic and van der
Waals interactions). These physics-based approaches are useful in selecting near-
native structure models from a set of predicted models. The second approach is the
72 5 Structural Bioinformatics
Fig. 5.3 The workflow of a molecular dynamic (MD) simulation of a PDB structure (PDB:2dfd)
generating a large dataset of trajectories for further graphical analysis
74 5 Structural Bioinformatics
0.25 0.30
0.15 0.20
RMSD [nm]
0.10
0.05
Legend
Wild-type peptide
Mutant peptide
0.00
0 5 10 15 20 25 30 35
time [ns]
Fig. 5.4 The RMSD curves of two independent MD simulation trajectories in 35 nanoseconds
Legend
0.5
Wild-type peptide
Mutant peptide
0.4
RMSF [nm]
0.3
0.2
0.1
0.0
Fig. 5.5 The RMSF curves of two independent MD simulations showing positional variation of
first 300 atoms
Fig. 5.6 Docking of a small ligand into the crystal structure of its receptor to form a stable complex
(PDB:5toa)
popular docking programs such as Autodock, Glide, GOLD and LigandFit have
been developed on the fundamental principles of physical chemistry. Molecular
docking is useful in finding and optimizing lead compounds and thereby proved as
a powerful tool of drug discovery. A docking program has two important
components: a docking algorithm to search the configurational and conformational
degrees of freedom and a scoring function for evaluation. The docking algorithm
tends to find the global energy minimum by searching the potential energy landscape
extensively. Only translational and rotational degrees of freedom are allowed for the
ligand interacting with receptor active site in rigid docking. In contrast, flexible
ligand docking involves torsional degrees of freedom for the ligand by allowing
variation in ligand dihedral angles. The scoring function, in general, evaluates not
only the steric complementarity between the ligand and its receptor but their
chemical complementarity as well. A popular docking program Autodock
implements simulated annealing Monte Carlo algorithms having tens of thousands
of steps during each cycle. First, the ligand is placed randomly onto the binding site
of the receptor and allowed to move towards a global energy minimum. The
structure is minimized after each move and subsequently the energy of the new
structure is measured. The simulation may consist of several cycles with decreasing
temperature in each cycle. The lowest energy in the previous cycle becomes the
starting point of the next cycle.
The protein-ligand binding affinity is measured in terms of free energy gains. The
largest contributor to this free energy gain emerges from the displacement of water
from the hydrophobic regions of the receptor. Another important source of free
energy after ligand binding is the hydrogen bonds between the ligand and receptor
protein. This free energy gain is achieved by displacing the water bound to the
receptor protein by the ligand. The purpose of a docking program is to use a global
scoring function that can be equally applied to all protein-ligand complexes. The
5.10 Structure-Based Drug Design 77
scoring function, in general, consists of three force field terms for van der Waals
interactions, hydrogen bonding and electrostatic interactions and a term representing
for desolvation effects. The strength of interaction between a ligand and a protein in
terms of binding free energy ΔG bind is as follows.
where E MM is the internal energy of the molecule determined from the bond lengths,
the bond angles and the dihedral angles followed by enthalpic contributions to free
energy from the van der Waals interactions and the electrostatic interactions. The
solute entropy consists of terms for translational, rotational, vibrational and confor-
mational entropy. The solvent free energy includes both polar and nonpolar terms
and represents both entropic and enthalpic effects of the solvent. The accuracy of a
docking program is usually evaluated by comparison of the predicted ligand-
receptor complex structure with the experimental structure obtained by X-ray crys-
tallography. During such comparison, each ligand is docked to its own cognate
receptor conformation (self-docking). Sometimes accuracy of the program is also
tested by docking a ligand to non-cognate receptor conformations such as apo
structure of receptor or co-crystallized structure with a different ligand. This method
of testing accuracy of a docking program is known as cross-docking.
depends upon the correct superposition of ligands and available structural informa-
tion regarding the target protein. The 4D-QSAR is an advanced form of multidimen-
sional QSAR (mQSAR) where energetically feasible binding modes are included in
a 4D-dataset such as different ligand conformations, ligand poses and protonation
states. Using this dataset, the actual bioactive conformation or true binding mode of
the ligand is identified based on the QSAR algorithm. Each ligand molecule is
represented as a single entity in the 3D-QSAR, whereas the ligand molecule is
represented in an ensemble of conformations, poses, protonation states, tautomeric
forms and stereoisomers in the 4D-QSAR. The binding pocket of a protein structure
is not only modified by a bound ligand (induced-fit effect) but other biochemical
properties of a receptor protein such as hydrophobicity, hydrophilicity and solvent
accessibility are also altered. In some proteins like GPCR proteins, the structure of
target protein is not known and realistic induced-fit effect cannot be determined.
Thus, a fifth dimension has been added in the 5D-QSAR where we can identify
realist induced-fit protocols in addition to 3D topology of the binding pocket. The
protein target is strongly affected by solvation phenomena (ligand desolvation,
solvent stripping and proton transfer) when bound to a small-molecule ligand.
Thus, 6D-QSAR accounts for different solvation models simultaneously and thereby
allows more realistic simulation of the ligand-protein binding process. 6D-QSAR is
implemented in a receptor modelling program known as Quasar allowing optimiza-
tion based on genetic algorithm. The binding energy in Quasar is determined as:
where Eligand-receptor is composed of Eelectrostatic, Evan der Waals, Ehydrogen bonding and
Epolarization.
Raptor is another popular program for 6D-QSAR based on a different scoring
function. The underlying scoring function is composed of directional terms for
hydrogen bonding and hydrophobicity considering solvation effect implicitly. The
binding energy in Raptor is calculated as:
5.11 Exercises
> library(bio3d)
> help(package=bio3d)
> MDH<-read.pdb("2dfd")
> MDH
Call: read.pdb(file = "2dfd")
Total Models#: 1
Total Atoms#: 10213, XYZs#: 30639 Chains#: 4 (values: A B C D)
Protein Atoms#: 9195 (residues/Calpha atoms#: 1260)
Nucleic acid Atoms#: 0 (residues/phosphate atoms#: 0)
Non-protein/nucleic Atoms#: 1018 (residues: 814)
Non-protein/nucleic resid values: [ CL (7), HOH (799), MLT (4), NAD (4) ]
> bs<-binding.site(MDH)
> bs
$inds
Call: NULL
Atom Indices#: 1310 ($atom)
XYZ Indices#: 3930 ($xyz)
+ attr: atom, xyz
$resnames
[1] "LEU 12 (A)" "GLY 13 (A)" "SER 15 (A)" "GLY 16 (A)" "GLY 17 (A)"
[6] "ILE 18 (A)" "GLY 19 (A)" "TYR 38 (A)" "ASP 39 (A)" "ILE 40 (A)"
[11] "ALA 41 (A)" "PRO 81 (A)" "ALA 82 (A)" "GLY 83 (A)" "VAL 84 (A)"
[16] "PRO 85 (A)" "ARG 86 (A)" "THR 91 (A)" "ARG 92 (A)" "ASP 93 (A)"
[21] "LEU 95 (A)" "ASN 99 (A)" "ILE 102 (A)" "LEU 106 (A)" "ILE 122 (A)"
[26] "ALA 123 (A)" "ASN 124 (A)" "VAL 126 (A)" "PRO 145 (A)" "ASN 146 (A)"
[31] "VAL 151 (A)" "LEU 154 (A)" "ASP 155 (A)" "ARG 158 (A)" "HIS 182 (A)"
[36] "ALA 183 (A)" "GLY 184 (A)" "GLY 216 (A)" "THR 217 (A)" "VAL 220 (A)"
[41] "SER 228 (A)" "ALA 229 (A)" "THR 230 (A)" "MET 233 (A)" "LYS 261 (A)"
[46] "ILE 281 (A)" "LEU 12 (B)" "GLY 13 (B)" "SER 15 (B)" "GLY 16 (B)"
[51] "GLY 17 (B)" "ILE 18 (B)" "GLY 19 (B)" "TYR 38 (B)" "ASP 39 (B)"
[56] "ILE 40 (B)" "ALA 41 (B)" "PRO 81 (B)" "ALA 82 (B)" "GLY 83 (B)"
[61] "VAL 84 (B)" "PRO 85 (B)" "ARG 86 (B)" "ARG 92 (B)" "LEU 95 (B)"
[66] "ASN 99 (B)" "ILE 102 (B)" "LEU 106 (B)" "ILE 122 (B)" "ALA 123 (B)"
[71] "ASN 124 (B)" "VAL 126 (B)" "VAL 151 (B)" "LEU 154 (B)" "ASP 155 (B)"
[76] "ARG 158 (B)" "HIS 182 (B)" "ILE 191 (B)" "VAL 198 (B)" "ASP 199 (B)"
[81] "PHE 200 (B)" "LEU 205 (B)" "GLY 216 (B)" "THR 217 (B)" "VAL 220 (B)"
[86] "SER 228 (B)" "ALA 229 (B)" "THR 230 (B)" "MET 233 (B)" "LEU 12 (C)"
[91] "GLY 13 (C)" "SER 15 (C)" "GLY 16 (C)" "GLY 17 (C)" "ILE 18 (C)"
[96] "GLY 19 (C)" "TYR 38 (C)" "ASP 39 (C)" "ILE 40 (C)" "ALA 41 (C)"
[101] "PRO 81 (C)" "ALA 82 (C)" "GLY 83 (C)" "VAL 84 (C)" "PRO 85 (C)"
[106] "ARG 86 (C)" "THR 91 (C)" "ARG 92 (C)" "ASP 93 (C)" "LEU 95 (C)"
80 5 Structural Bioinformatics
[111] "ASN 99 (C)" "ILE 102 (C)" "THR 105 (C)" "LEU 106 (C)" "ILE 122 (C)"
[116] "ALA 123 (C)" "ASN 124 (C)" "VAL 126 (C)" "VAL 151 (C)" "LEU 154 (C)"
[121] "ASP 155 (C)" "ARG 158 (C)" "HIS 182 (C)" "ALA 183 (C)" "GLY 184 (C)"
[126] "GLY 216 (C)" "THR 217 (C)" "VAL 220 (C)" "SER 228 (C)" "ALA 229 (C)"
[131] "THR 230 (C)" "MET 233 (C)" "LYS 261 (C)" "SER 262 (C)" "GLN 263 (C)"
[136] "LEU 12 (D)" "GLY 13 (D)" "SER 15 (D)" "GLY 16 (D)" "GLY 17 (D)"
[141] "ILE 18 (D)" "GLY 19 (D)" "TYR 38 (D)" "ASP 39 (D)" "ILE 40 (D)"
[146] "ALA 41 (D)" "PRO 81 (D)" "ALA 82 (D)" "GLY 83 (D)" "VAL 84 (D)"
[151] "PRO 85 (D)" "ARG 86 (D)" "THR 91 (D)" "ARG 92 (D)" "ASP 93 (D)"
[156] "LEU 95 (D)" "ASN 99 (D)" "ILE 102 (D)" "LEU 106 (D)" "VAL 121 (D)"
[161] "ILE 122 (D)" "ALA 123 (D)" "ASN 124 (D)" "VAL 126 (D)" "VAL 151 (D)"
[166] "LEU 154 (D)" "ASP 155 (D)" "ARG 158 (D)" "HIS 182 (D)" "ALA 183 (D)"
[171] "GLY 184 (D)" "ILE 191 (D)" "VAL 198 (D)" "ASP 199 (D)" "PHE 200 (D)"
[176] "LEU 205 (D)" "GLY 216 (D)" "THR 217 (D)" "VAL 220 (D)" "SER 228 (D)"
[181] "ALA 229 (D)" "THR 230 (D)" "MET 233 (D)"
$resno
[1] 12 13 15 16 17 18 19 38 39 40 41 81 82 83 84 85 86 91
[19] 92 93 95 99 102 106 122 123 124 126 145 146 151 154 155 158 182 183
[37] 184 216 217 220 228 229 230 233 261 281 12 13 15 16 17 18 19 38
[55] 39 40 41 81 82 83 84 85 86 92 95 99 102 106 122 123 124 126
[73] 151 154 155 158 182 191 198 199 200 205 216 217 220 228 229 230 233 12
[91] 13 15 16 17 18 19 38 39 40 41 81 82 83 84 85 86 91 92
[109] 93 95 99 102 105 106 122 123 124 126 151 154 155 158 182 183 184 216
[127] 217 220 228 229 230 233 261 262 263 12 13 15 16 17 18 19 38 39
[145] 40 41 81 82 83 84 85 86 91 92 93 95 99 102 106 121 122 123
[163] 124 126 151 154 155 158 182 183 184 191 198 199 200 205 216 217 220 228
[181] 229 230 233
$chain
[1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
[19] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
[37] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B" "B" "B"
[55] "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
[73] "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "C"
[91] "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C"
[109] "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C" "C"
[127] "C" "C" "C" "C" "C" "C" "C" "C" "C" "D" "D" "D" "D" "D" "D" "D" "D" "D"
[145] "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D"
[163] "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D" "D"
[181] "D" "D" "D"
> plot.bio3d(MDH$atom$b[MDH$calpha], sse=MDH, typ="l", ylab="B-
factor")
5.11 Exercises 81
Fig. 5.7 Residue B-factors for malate dehydrogenase (PDB ID 2DFD) with secondary structure
annotations in the marginal regions
> library(bio3d)
> ids <- c("2dfd","5zje")
> raw.files <- get.pdb(ids)
trying URL 'https://files.rcsb.org/download/2dfd.pdb'
Content type 'application/octet-stream' length unknown
downloaded 920 KB
trying URL 'https://files.rcsb.org/download/5zje.pdb'
Content type 'application/octet-stream' length unknown
82 5 Structural Bioinformatics
downloaded 2.5 MB
> files <- pdbsplit(raw.files, ids)
> pdbs <- pdbaln(files,exefile="msa")
> pdbs$id <- substr(basename(pdbs$id),1,6)
> seqidentity(pdbs)
Fig. 5.8 Sequence identity between different chains of human malate dehydrogenase and lactate
dehydrogenase
Fig. 5.9 Root mean square deviation (RMSD) between different chains of human malate dehydro-
genase and lactate dehydrogenase
Fig. 5.10 A histogram of RMSD between different chains of human malate dehydrogenase and
lactate dehydrogenase
Fig. 5.11 A RMSD cluster dendrogram based on RMSD between different chains of malate
dehydrogenase and lactate dehydrogenase
84 5 Structural Bioinformatics
Answer: 1. c 2. b 3. c 4. a 5. a 6. d 7. d 8. d 9. a 10. c
Summary
• Homology modelling of a protein structure is performed using evolutionary
related target protein structure from PDB.
• Ab initio modelling depends on energy functions or information obtained from
other protein structures.
• Threading assigns folds in the target protein based common folds in the template
PDB structures.
• Integrative modelling is an iterative process starting with a proposed initial model
followed by several rounds of refinement and validation.
• Large-scale determination of protein structures for a systems level view of protein
structural world is known as structural genomics.
• The CATH and SCOP databases describe protein structural classification cover-
ing the entire protein structures deposited in the PDB.
• Molecular dynamics simulations provide useful insights into mechanistic basis of
protein folding, protein ligand interactions and protein dynamics.
• Molecular docking is useful in identification and optimization of lead compounds
during a drug discovery process.
86 5 Structural Bioinformatics
Suggested Reading
Schwede T, Peitsch M (2008) Computational structural biology: methods and applications. World
Scientific, Hackensack
Buxbaum E (2007) Fundamentals of protein structure and function. Springer, Cham
Becker OM, Karplus M (2006) A guide to biomolecular simulations. Springer, Dordrecht
Rigden DJ (2009) From protein structure to function with bioinformatics. Springer, Dordrecht
Lesk AM (2001) Introduction to protein architecture: the structural biology of proteins. Oxford
University Press, Oxford
Molecular Evolution
6
Learning Objectives
You will be able to understand the following after reading this chapter:
6.1 Introduction
Motoo Kimura proposed his landmark neutral theory of molecular evolution in 1968
which is central to the principles of molecular evolution. The rate of molecular
evolution among proteins varies widely from the slowest evolving histone IV (1011
substitutions per site per year) to the fastest evolving fibrinopeptide (9 109
substitutions per site per year). Based on this observation, Kimura and Ohta pro-
posed that functionally less important molecules or parts of such molecules are likely
to evolve faster than more important proteins. In other words, functionally important
molecules or their parts are selectively constrained due to their detrimental effect on
the fitness of the organism. The major evolutionary force involved in genetic
substitutions is stochastic fixation of mutations. The majority of these substitutions
are product of random fixation of neutral and nearly neutral mutations. The synony-
mous nucleotide changes and changes in nucleotides in introns, pseudogenes, other
non-coding and nonregulatory regions are examples of neutral mutations. Some
nonsynonymous mutations having no effect on the protein functions are also treated
as neutral mutations. Since effective population size is usually very small in
6.3 The Concept of Homology 89
Fig. 6.1 Different forms of homology between genes based on speciation and duplication events
homology due to exon shuffling. Thus, orthologous and paralogous genes should be
carefully selected for phylogenetic analysis in order to understand speciation and
duplication events, respectively. Two non-homologous enzymes may secondarily
acquire similarity at the active site during evolution due to similar functional
constraint which is often mistaken for homology. This evolutionary process is
known as convergent or parallel evolution.
The genetic or evolutionary distances between all pairs of aligned sequences are
computed in form of a distance matrix. A pair of sequences derived from a common
ancestor are likely to diverge in the course of time under the influence of various
evolutionary forces. This divergence between two sequences is measured by genetic
distance. Genetic distance is an indicator of similarity between sequences and
provides useful information for inference of a phylogenetic tree. The substitution
process is generally modelled as a stochastic or random event. One needs to specify
the appropriate statistical model of substitution prior to computing genetic distances
between two nucleotide or protein sequences. The simplest measure of genetic
distance (d) is the total number of nucleotide differences per site between two
sequences. This measure is also known as the p-distance or the observed distance.
However, it is not informative about the actual number of substitutions occurred in
case of high level of divergence between two sequences. Multiple nucleotide
substitutions at a particular site in an evolutionary timescale usually leads to the
saturation of such sites in the DNA sequences.
Another approach to measure expected genetic distances between a pair of
sequences is to apply a likelihood function L (d). This likelihood is defined as the
probability to observe two sequences given the distance d. The value of the distance
d that maximizes L (d) is known as the maximum likelihood estimate (MLE) of the
genetic distance.
There is a variety of nucleotide substitution rate matrices from a very simple model
like Jukes–Cantor (JC69) model to many highly complex models. In JC69 model,
the equilibrium frequencies of four nucleotides are equal (25%) and each nucleotide
has the same probability to replace other one. In the general time reversible (GTR)
model, all eight free parameters of a reversible nucleotide rate matrix are clearly
stated. The free parameters are usually unknown and required to be estimated from
the data. Thus, it is always desirable to reduce the number of free parameters by
including some constraints in the underlying substitution process. The Tamura–Nei
(TN93) model considers only two independent rate parameters, the ratio (κ) between
transition and transversion rates and the ratio (γ) between two types of transition
rates. If we assume that the purine and pyrimidine rates are equal (γ ¼ 1), The
92 6 Molecular Evolution
Tamura–Nei model becomes a simpler model and the reduced model is known as
HKY85 model. This HKY85 model can be further reduced to the Kimura
two-parameter (K80) model assuming uniform nucleotide frequencies of 25% for
each nucleotide base. The HKY85 model is reduced to F81 model and K80 model is
reduced to the simplest model, Jukes and Cantor model in case the κ ¼1. It is always
advisable to choose a simple model like Jukes–Cantor model or Kimura
two-parameter model than a complex model like GTR model for phylogenetic
analysis of closely related nucleotide sequences. The best-fit evolutionary model
of a particular dataset is selected using different statistical approaches such as
hierarchical likelihood ratio test, Akaike Information Criterion (AIC) and Bayesian
Information Criterion (BIC). The likelihood ratio test (LRT) compares the log
likelihoods of two contrasting nested models to find the best fit and is defined by
LRT ¼ 2 (l1 l0), where l1 and l0 are the maximum likelihood values under complex
(parameter-rich) and simple models (less parameter rich), respectively. However,
AIC and BIC are applicable to both nested and non-nested models. The fully
automated model selection process to find best-fit model is implemented for nucleo-
tide sequence and protein sequence in programs jModelTest and protTest, respec-
tively. The protTest selects the best model of amino acid replacement. These
evolutionary models are based on the amino acid substitution matrices computed
from a large dataset of protein families. The common amino acid replacement
models used for protein sequence analysis are Jones–Taylor–Thornton (JTT),
Dayhoff, BLOCKS Substitution Matrices (BLOSUM62) and Whelan and Goldman
(WAG) matrices.
It is well known that nucleotide substitution rates vary greatly at different
positions in DNA sequences. For instance, the third codon position mutates faster
than the first and second codon positions. This rate heterogeneity at different codon
positions may influence the measurement of genetic distance. A gamma (Γ) distri-
bution model with variance 1/α considers varying degree of distribution of rates over
sites by adjusting the shape parameter α. The gamma distribution is bell shaped in
case the value of α is more than one and shows weak heterogeneity over sites. On the
other hand, the gamma distribution looks like L-shape if the value of α is less than
one and represents strong rate heterogeneity over the sites. Four to eight discrete
categories of gamma distribution are usually considered in phylogenetic analysis to
approximate the continuous gamma distribution.
of pairwise distances. The second step involves clustering algorithms to infer the tree
topology based on evolutionary distances. Clustering methods such as UPGMA
(unweighted-pair group method with arithmetic means) usually generate an
ultrametric tree assuming molecular clock. The ultrametric trees are the rooted
trees with all the end nodes having equal distance from the root. The tree is
constructed in sequential order by clustering those OTUs closely related to each
other followed by more distant OTUs. However, the UPGMA method is highly
sensitive to errors due to unequal rates of substitutions in different lineages. Since the
ordinary clustering methods have serious limitations, additive distance trees are the
better alternatives to ultrametric trees. Additive distance trees are unrooted trees in
which the genetic distance between a pair of OTUs is equal to sum of the lengths of
connecting branches. This kind of tree is always a better fit to the genetic distances in
the absence of clock-like behaviour in the sequence data. An additive tree can be
constructed using minimum evolution (ME) method that minimizes the length of the
tree which is the sum of the lengths of its branches. The length of the tree is
computed from a matrix of genetic distances. The main drawback of this method
is to search all possible topologies to find the minimum tree which appears to be an
impractical computational task with more than ten sequences. A better solution to
this computational challenge is to use a heuristic approach to find the ME tree using
clustering without assuming a clock-like behaviour. This improved method is known
as the neighbour joining (NJ) which is most popular distance-based method used in
phylogenetics. The distance-based method can be best adopted for phylogenetics
combining the NJ approach with bootstrapping. The reliability of an inferred tree
especially a specific clade of a tree is usually evaluated using bootstrap analysis. It is
very efficient statistical method to approximate underlying sampling distribution by
resampling from the original data. In phylogenetics, bootstrapping is implemented
on the original alignment by sampling of nucleotides sites with replacement,
reconstructing the phylogenetic tree and checking the presence of the original
nodes in all resampled set of phylogenetic trees (Fig. 6.3). This process is usually
repeated from 200 to 2000 times and the proportion of each node among all
bootstrap replicates is known as the bootstrap value. Bootstrap value indicates the
statistical confidence supporting a monophyletic clade and usually labelled on the
top of each clade. A bootstrap value of 70 or more demonstrates a good confidence
for a specific clade and the clades having lesser bootstrap values are treated with
caution and usually collapsed into a multifurcated phylogenetic tree. Jackknifing is
an alternative approach to bootstrapping in evaluating specific clades in a phyloge-
netic tree. Here, half of the sites are randomly purged from the original sequences to
generate numerous new samples. The proportion of clades in the resampled tree is
known as the jackknife value and 70% or more jackknife value of a clade is treated
with good confidence. The NJ method is very fast even for phylogenetic analysis of
hundreds of sequences and is an appropriate method for reconstructing large trees.
96 6 Molecular Evolution
It is one of the best and useful method for molecular phylogenetics based on discrete
character states and is not based on an explicit model of evolution. This method
searches for tree or a collection of a tree assuming minimum number of genetic
changes from a common ancestor to its descendants. This method is equally effective
in case the rate of evolution is either fast or slow. This method changes randomly the
topology of a tree in stepwise manner until a parsimony score is obtained which
cannot be improved further. The exhaustive search for the best tree is done by
evaluating all possible trees but this algorithm is feasible up to 11 sequences. An
alternative search method known as the branch-and-bound method is effective for
alignments containing 12 to 25 sequences. This method can be improved further
adding a heuristic method or neighbour joining. Heuristics is an approximate method
which can make an effort to find optimal solutions. First, an initial tree is generated
using a greedy algorithm and then subjected to several rounds of perturbations to
yield the most optimal tree. The stepwise addition is the most widely used greedy
algorithm to generate an initial tree but rarely yields a globally optimum tree. Thus,
three common methods of tree-rearrangement perturbations or branch swapping are
6.7 Methods for Reconstruction of Phylogenetic Tree 97
nearest-neighbour interchange (NNI), subtree pruning and regrafting (SPR) and tree
bisection and reconnection (TBR). These methods are effective for an alignment up
to 100 sequences. However, a hill climbing algorithm like branch swapping is likely
to be trapped in local minima.
The major advantage associated with this method is that it is free from unrealistic
model assumptions. The limitation of this method is high computational cost,
oversimplification of model of evolution and is likely to be inaccurate in case of
substantial evolution. Parsimony method takes into account only the informative
sites during phylogenetic analysis and does not consider information from the
non-informative sites. Sometimes, maximum parsimony method is likely to generate
an incorrect topology due to artificial clustering of long or short branches forming a
common clade. This phenomenon is known as long branch attraction and short
branch attraction, respectively.
bisection and reconnection (TBR) and subtree pruning and regrafting (SPR). The
certainty of a maximum likelihood tree can be assessed by comparing maximum
likelihood values using likelihood ratio test (LRT). Similarly, statistical support to a
specific branch is assessed by either nonparametric bootstrapping or jackknifing.
The Bayesian phylogenetic inference is based on our prior beliefs about the topology
of a phylogenetic tree. In case the background information regarding the topology of
a tree is lacking, equal probability is assigned to each possible tree topology and such
prior probability is known as uninformative prior. We need a molecular sequence
alignment data and an evolutionary model in order to update this prior probability to
a posterior probability distribution. Bayes’ rule is implemented to compute the
posterior probability distribution. The posterior probability stipulates the probability
of an individual tree given the prior, model and the data. The most of the posterior
probability focusses on a single tree or few possible trees out of a large tree space in
case the sequence data is informative. It is an impossible task to compute posterior
probability analytically or drawing random sampling because posterior probability is
restricted to a very small part of the huge parameter space. Therefore, the posterior
probability distribution (Fig. 6.4) is usually estimated using the Markov Chain
Monte Carlo (MCMC) sampling. Markov chains have a tendency to converge
towards an equilibrium state irrespective of starting value. Thus, our focus is to set
up a Markov chain converging on our posterior probability distribution. All plausible
trees are sampled in order to find the range of a parameter, e.g. substitution rates in
MCMC analysis. The MCMC algorithm samples the entire distribution avoiding any
suboptimal solution. It proposes a new set of parameter values including tree
topology in every step (generation) and the newly proposed state is either accepted
in case better than previous state or rejected if worse than the previous state. The
present state is retained in case of rejection and the process is repeated. The
simulation process continues for a long duration changing one tree topology to
another topology. The early phase of the simulation run is known as burn in and
samples of this period is discarded to avoid the influence of starting value. The
posterior clade probabilities are used as a statistical support to a clade in Bayesian
phylogenetic inference in contrast to the bootstrap value used in other phylogenetic
methods.
The molecular clock hypothesis is based on the assumption that the rate of molecular
evolution is approximately uniform over evolutionary timescale. This hypothesis is
fully consistent with the arguments of neutral theory of evolution. Emile
Zuckerkandl and Linus Pauling proposed this concept in 1962 by comparing the
amino acid changes in the various protein sequences including haemoglobin in
6.8 Molecular Clock 99
different species with their age estimated from fossil records. Overall, there was a
linear trend in amino acid differences in respect to evolutionary time. Although there
was a constant rate of molecular evolution across the species, each protein had a
characteristic rate of molecular evolution (Fig. 6.5). The fibrinopeptides had a faster
rate of molecular evolution in comparison to extremely slow rates in cytochrome c
and histones. The differences in characteristic rates of evolution of these proteins can
be explained by the proportion of neutral sites in the protein sequences. The proteins
showing faster rate of evolution are likely to have a greater proportion of neutral
sites. The molecular clock ticks at irregular intervals and is probabilistic in nature.
The molecular clock estimates are prone to two types of errors. First type of error is
caused by the sloppy nature of the rate of substitution process leading to a huge
imprecision in the molecular date estimation. The variable mutation rate, population
size and proportion of sites under differential selection pressure are the underlying
causes of the second type of error. The presence of global molecular clock in a gene
is usually tested by relative rate test, Tajima test and the likelihood ratio test. Since
the global molecular clock is usually lacking in majority of genomic sequences, the
presence of local molecular clock may be tested in specific lineages allowing rate
variation across the lineages (relaxed clock). Molecular clock is useful in estimating
the date of origin of a viral disease in absence of any fossil record. The evolutionary
history of an epidemic disease is reconstructed from available extant virus samples.
The viral molecular clocks are usually calibrated using the stored viral samples.
100 6 Molecular Evolution
Fig. 6.5 The molecular clock of three different proteins ticking at variable rates
The evolution of organismal complexity and the origin of novel gene function are
likely to be linked with gene duplication and whole genome duplication. Susumo
Ohno proposed his hypothesis that duplication of genes and whole genomes have
played an important role in the evolution of organismal complexity. He further
suggested that the duplication of a gene is followed by retention of original function
by one copy and adoption of new function by another copy. The paralogs that
emerged in the vertebrate genome through whole genome duplication are known
as Ohnologs. The expansion in the number of genes in genomes is caused by various
processes such as whole genome duplication, tandem gene duplication and segmen-
tal duplication. Consequently, large genomes allow more functional diversification
of genes and creation of gene families leading to formation of complex gene
networks in an organism. For example, amphioxus, ancestral cephalochordate line-
age, usually have small number of gene families even singleton in many cases. On
the other hand, advanced vertebrates such as mammals have three to four copies of
genes per family. The two-rounds of genome duplication (2R) hypothesis was
proposed as an explanatory model for the presence of large number of gene families
in vertebrates. This model suggests that early genome duplication occurred in the
genome of a cephalochordate-like ancestor (1R) followed by a second round of
genome duplication (2R) leading to the formation of four sets of genomes. This 2R
model is based on the empirical evidences obtained during genomic analysis of
6.10 Molecular Signatures of Selection 101
Natural selection frequently acts on the genes and genomic sequences in an organism
and thereby causes some genomic variations that are likely to enhance the overall
fitness of individuals carrying these variations. These molecular footprints left on the
nucleotide and protein sequences by the natural selection in the past can be identified
using various statistical approaches. The relative rate of silent and replacement
fixations in the molecular sequences is an important information to infer the past
action of natural selection. A mutation in the gene, if confers a fitness benefit to the
species is likely to be under positive selection that includes different processes.
Positive selection is usually restricted to a small region of a gene and may point to a
functionally important region or sites of a selected gene. There are two primary
forms of positive selection: directional and diversifying selection. The action of
directional selection on a particular gene is manifested in form of concerted substi-
tution in the direction of a specific amino acid residue which is ultimately fixed in a
population after a long span of time (selective sweep). The development of a specific
mutation conferring drug resistance in HIV-1 against retroviral drugs is an example
of selective sweep. On the other hand, diversifying selection maintains an amino
acid diversity at a specific site in a population. This type of natural selection operates
on the codon positions of HIV-1 that are the targets of the human immune response.
There are two major approaches to identify natural selection on a gene. In order to
perform selection analysis on a certain gene, we need a multiple sequence alignment
of orthologs from different species or strains along with underlying phylogenetic tree
of these species or strains. In case of recombination, each non-recombinant segment
of the gene is represented by a separate phylogenetic tree. One needs to be very
careful during multiple sequence alignment to preserve all codons and avoid any
frameshift. There are two methods to identify positive selection on a gene based on
the predictions of Kimura’s neutral theory of evolution. The first approach deals with
comparison of nonsynonymous and synonymous changes in a gene. The synony-
mous (silent) changes are likely to be neutral and a greater rate of nonsynonymous
(amino acid replacement) changes relative to synonymous changes indicate the
action of positive selection. It is estimated as ω or dN/dS or Ka/Ks ratio for the
coding sequences. The ω is measured as the observed number of nonsynonymous
substitutions per nonsynonymous site (dN) over the synonymous substitutions per
synonymous site (dS) (Fig. 6.6). The ω is expected to be equal to one under the
assumption of neutral evolution due to the fact that the synonymous changes are
102 6 Molecular Evolution
Fig. 6.6 The directional shift in ω values away from or towards the neutral value (ω ¼ 1) indicates
the impact of natural selection on a gene. Positive selection operates in case of ω value more than
one, whereas purifying selection shift the ω value towards zero. Relaxed selection is identified in a
lineage by a shift of ω towards the neutral value in comparison to another related lineage
more or less equal to amino acid changes. The essential protein sequences allow only
minor replacement changes in the amino acid sequences and majority of amino acid
changes are eliminated by natural selection. This kind of natural selection is known
as the negative or purifying selection and estimated as an ω value significantly less
than one. An ω value close to zero indicates an exceptional selective constraint on
the protein sequence. However, amino acid replacements for some proteins are also
beneficial for the organism under selection. The genes encoding these proteins show
an ω value significantly more than one and the selection operating is inferred as
positive or directional selection. A high value of ω indicates the recurrent rounds of
selection on a gene rather than a single selective sweep. The genes under positive
selection are usually involved in male reproduction (e.g. human protamine gene),
host’s immune response to pathogen (e.g. human major histocompatibility complex
(MHC) locus) and neural developments in primates. The estimation of ω varies from
site to site and from branch to branch in a phylogenetic tree during a selective event.
The positive selection is estimated based on the variation of ω across the sites and/or
lineages. In site models, two maximum likelihood models, the first model (positive
selection) allowing a class of codons in the alignment to evolve with ω more than one
and the second model (neutral model) not allowing the same, are compared using
likelihood ratio tests. In case, two models are found to be significantly different
based on the chi-square distribution, the neutral model is rejected in favour of
positive selection model. In branch-site models, episodic positive selection is
detected on specific branches of a phylogenetic tree. Relaxed selection or relaxation
of positive or purifying selection is also identified on a specific lineage as a shifting
of ω value towards one in comparison to a closely related lineage.
When molecular sequences are sampled from different individuals in a popula-
tion, the ω value reflects the ratio of replacement to silent polymorphism in a
population. The ω value is likely to be one in the absence of positive and negative
selection and any significant statistical deviations in this value indicates the action of
6.11 Evolution of Protein Structure 103
natural selection. Natural selection might not have fixed the beneficial mutation or
removed the deleterious mutation in a population during recent evolutionary history.
Thus, the McDonald–Kreitman test, which compares the distribution of
nonsynonymous and synonymous polymorphisms in a particular lineage with the
ratio of nonsynonymous to synonymous differences fixed between lineages or
species, is widely used to detect recent positive selection. However, saturation of
substitution rates especially at the synonymous site for deep tree branches may affect
the estimation of selection in case of fast evolving sequences. However, branch-site
models are insensitive to the biases introduced by synonymous site saturation and
are suitable for analysis of distant species. In addition, there are certain indexes to
detect the saturation indexes and third codon position is often removed from analysis
in case of synonymous site saturation. Moreover, this test is extremely conservative
in the assessment of positive selection as it is an average estimate over the entire
gene. This limitation can be overcome using the entire phylogeny instead of
sequence pairs and additional computational cost. The purifying selection or positive
selection are relaxed on certain genes by shifting the ω values towards the value of
one (neutral value). The relaxed selection is determined based on whether intensity
of natural selection is relaxed or intensified in two subsets of branches of a phyloge-
netic tree. The selection pressure either significantly increases or decreases in a
foreground lineage with respect to a background lineage and is measured as a model
parameter K (relaxation parameter).
The second approach is based on the predictions made by the neutral theory on
allele or haplotype frequency in a particular population or between two populations.
For example, if positive selection has swept a mutation to high frequency in a certain
genomic region, it is expected that the targeted genomic region is likely to have low
sequence diversity, a surplus of rare alleles and a greater amount of linkage disequi-
librium than the expectation of neutral theory. Selection operating on a particular
population in comparison to other populations eventually leads to a greater degree of
population differentiation than expected in neutral evolution. Most common test
statistics are Tajima’s D, Wright’s FST and Fu’s W to measure these patterns in a
population. However, demographic history, population bottleneck and expanding
human population may generate spurious signatures of positive selection using this
approach.
The evolution of a protein structure can be explained through a sequence space. Each
possible amino acid sequence of a protein is treated as node and a single amino acid
mutation between each sequence is represented by an edge in a sequence space.
Since each node has a specific physical and functional feature, a genotype-phenotype
space is created having all possible protein sequences. The protein sequence follows
the edges in evolutionary trajectories through genotype-phenotype space. This
approach helps us in understanding the underlying evolutionary mechanism across
genotype-phenotype space in emergence of diversity and novel functions of
104 6 Molecular Evolution
The recent global pandemic of coronavirus disease 2019 (COVID-19) has reinforced
our interest in understanding the mechanisms involved in initiation and exponential
spread of viral epidemics in human population. The evolutionary properties of
viruses involved in invading a new host species and evading new vaccines are
largely unknown. The emergence of novel viruses is a microevolutionary process
often involving cross-species transmission. Natural selection is a powerful force in
virus evolution and operates on these viruses during replication and production of
progeny in the host cell. The rates of mutation and subsequent fixation of mutation in
the population (i.e. substitution) of viruses are many folds higher than the cellular
genes of the host species. The sexual reproduction in viruses occurs through two
distinct mechanisms: recombination and reassortment. Recombination process in
viruses involves coinfection of a single host cell by two viruses and generation of a
hybrid molecule during replication. On the other hand, two or more segmented
viruses sometimes infect a single cell and the resulting progeny virus has segments
from multiple ancestries. Although recombination in a virus often leads to a new
beneficial genetic combination, mutation plays a crucial role in adaptive evolution in
viruses. For example, the resistance to antiviral drugs in human immunodeficiency
virus (HIV) can be attributed to mutation alone rather than recombination. The intra-
host population size of viruses is very large considering both the huge number of
viruses per host cell and the large number of infected host cells. For instance, HIV-1
can infect 107 to 108 cells in a patient producing 1010 virions in a single day. Viruses
lack functionless regions such as introns and pseudogenes in their noncoding
genome suggesting no significant role of genetic drift in viral evolution.
RNA viruses are fast evolving pathogens and have ability to accumulate rich
genetic diversity in a short period of time. The underlying evolutionary pattern can
be understood using phylogenetic analysis of homologous sequences from diverse
types of extant viral species or strains. Since recombination is a common phenome-
non in viruses, multiple phylogenetic trees describe the complete model of a viral
evolution, one tree representing each non-recombinant fragment of homologous
sequences. The failure to account for recombination may not only lead to a distorted
phylogenetic analysis but produce an incorrect estimation of natural selection. The
recombination breakpoints can be detected using GARD program in a sequence
alignment based on the phylogenetic incongruence among segments. Alternatively,
RDP is a popular program to identify breakpoints in a sequence alignment using
different methodologies. Natural selection acts on the RNA viruses with great
6.12 Evolution of Viruses 107
efficiency due to their extremely large and well-mixed populations. Thus, positive
selection is a very common mode of adaptive evolution in RNA viruses and the host
genes responding to viral infections as well. The current methods of the dN/dS
analysis need repeated fixation of nonsynonymous changes at a particular site in
order to infer a positive selection. This type of positive selection is known as
diversifying selection. In contrast, a single amino acid change in a specific lineage
is known as directional selection which is hard to detect using dN/dS method.
Directional selection is a rampant adaptive process in the emergence of new viruses.
Most mutations especially at the nonsynonymous sites in RNA viruses like HIV are
deleterious and consequently removed by the purifying selection. Although synony-
mous sites are generally treated as selectively neutral, synonymous sites in RNA
viruses may not be selectively neutral. For instance, polymerase basic 2 (PB2)
protein encoding gene is highly conserved in influenza A due to its crucial role in
viral packaging. A major change in the nucleotide composition of a virus is expected
when a virus jumps from one species to another species. For example, influenza A
virus underwent a major change in the nucleotide composition during transmission
from bird to human. Nucleotide composition of viruses can vary widely in the same
families due to different hosts and vectors utilized during the transmission. There is a
prevalence of epistatic interactions although mostly antagonistic in viral evolution
may be due to very common use of overlapping reading frames.
Coronaviruses are zoonotic pathogens with a positive-sense and single-stranded
RNA genome having origin in wild animals. The genome size of coronaviruses is
extraordinarily large with size up to 32 kb. The genomic organization of all
coronaviruses are similar with two large overlapping open reading frames, ORF1a
and ORF1b coding for two polyproteins pp1a and pp1b, respectively. Further
processing of these polyproteins produces 16 non-structural proteins (nsp1-nsp16).
In addition, four types of structural proteins, namely spike, envelope, membrane and
nucleoproteins, are coded by some ORFs in the remaining part of the genome. Spike
proteins are key surface glycoproteins having very crucial role in interacting with
distinct host cell receptors. A wide number of accessory proteins are also encoded by
the genome of different viruses. Coronaviruses have lower mutation rates than other
RNA viruses due to some proofreading activity of 30 -to-50 exoribonuclease. How-
ever, this lower mutation rate in this virus is well compensated by a high rate of virus
replication in hosts. In addition, coronaviruses expand their genome through fre-
quent gene gains and losses and the genome expansion in the coronaviruses is
largely contributed by their exceptionally high replication fidelity. The genome
expansion has facilitated acquisition and maintenance of novel genes encoding a
large number of accessory proteins performing crucial functions such as enhanced
virulence, suppression of immune responses and adaptation in a specific host. Thus,
the gene losses and gains in coronaviruses are key evolutionary phenomena in the
emergence of novel viral phenotypes. Coronavirus genome codes for an enzyme
phosphodiesterase (PDE) which is responsible for blocking the interferon-induced
antiviral immune responses in the host and consequently increases the overall fitness
of viruses. Interestingly, coronaviruses frequently steal some additional genes from
their hosts like other viruses. For example, a viral enzyme, hemagglutinin esterase
108 6 Molecular Evolution
and the N-terminal domain of the spike protein are derived from cellular lectins of
the host. The structural components of the coronavirus genome have received much
evolutionary attention because these proteins are exposed on the viral surface to
induce the immune responses in the human host. The spike proteins of coronaviruses
can adapt to exploit diverse types of cellular receptors in hosts. This key adaptive
feature in coronaviruses might play an important role in frequent host jump. It is still
a mystery how these binding affinities of the spike proteins to cellular receptors have
evolved in coronaviruses. The ongoing pandemic of novel pneumonia widely known
as coronavirus disease 2019 (COVID-19) is caused by a new human
betacoronavirus, SARS-CoV-2, a close relative of SARS (severe acute respiratory
syndrome)-like viruses. There is about 79% similarity between SARS-CoV-2 and
SARS-CoV at the nucleotide level. However, this sequence similarity is little less of
about 72% in the spike protein. This virus has a zoonotic origin and appears to be
more infectious than any other coronavirus known till date. Most of the closely
related viruses to SARS-CoV-2 are found in bats as bats are known to be a reservoir
for a wide variety of coronaviruses. The genome of this novel virus has approxi-
mately 96% similarity with the genome of a related virus (RaTG13) from horseshoe
bat. In spite of very high similarity at the nucleotide level, a number of key genomic
features differ between two viruses. For example, SARS-CoV-2 has a polybasic
S1/S2 cleavage site insertion which might account for its high infectivity and
pathogenicity in human. Interestingly, the receptor binding domain (RBD) of these
two viruses has approximately 85% similarity and only one amino acid is common
out of six critical amino acid residues at this domain. It appears that the receptor
binding domain of SARS-CoV-2 is well optimized to bind the angiotensin
converting enzyme 2 (ACE2) receptor in human like other SARS-CoV. It is possible
that SARS-CoV-2 might have acquired the key mutations needed for efficient
human transmission in an intermediate host like pangolins. For example, the pango-
lin viruses have not only amino acid sequence similarity of 97% with SARS-CoV-
2 but contain all human-specific six mutations at the receptor binding domain
optimized for binding ACE2 receptor. However, another possibility of acquiring
some of its key mutations during a period of cryptic spread in human before its first
detection cannot be excluded. There are different variants of this virus across the
world (Fig. 6.7) and some amino acid sites in the spike glycoprotein accumulated not
only synonymous and missense mutations but also gained few stop mutations
(Fig. 6.8). We can study the genetic diversity in viruses using distinct phylogenetic
clusters of genomic sequences. However, it is difficult to determine the phenotypi-
cally important mutations using genomic sequences only; it can only be validated
using clinical samples in laboratories.
6.13 Exercises
Fig. 6.7 A view of sequence alignment of 11 strains of SARS-CoV-2 on the Ensembl COVID-19
database showing variability at some nucleotide sites. The nucleotides highlighted in green and
yellow colour are synonymous and missense variants
Fig. 6.8 A view of various variants of SARS-CoV-2 spike glycoprotein on the Ensembl COVID-
19 database
> library(Biostrings)
> GnRH<-readDNAStringSet("GnRH.fasta")
DNAStringSet object of length 9:
#Fig. 6.9
6.13 Exercises 111
> library(msa)
> GnRH.alignment<-msa(GnRH)
> GnRH.alignment
CLUSTAL 2.1
Call:
msa(GnRH)
MsaDNAMultipleAlignment with 9 rows and 291 columns
#Fig. 6.10
112 6 Molecular Evolution
Fig. 6.14 Top ten best models for the GnRH coding sequences. The model (TPM1 + I) showing
lowest AICc value is the best model
Fig. 6.15 A maximum likelihood phylogenetic tree showing bootstrap values more than 50 at
respective nodes
> library(ape)
> GnRH<-as.DNAbin(GnRH.alignment)
> distance<-dist.dna(GnRH)
> distance
#Fig. 6.11
> tree<-nj(distance)
> write.tree(tree)
6.13 Exercises 113
#Fig. 6.12
> rooted<-root (tree, outgroup=9)
> plot(rooted)
#Fig. 6.13
> library(phangorn)
Loading required package: ape
> library(msa)
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel
> GnRH<-readDNAStringSet("GnRH.fasta")
> GnRH.alignment<-msa(GnRH)
use default substitution matrix
> GnRH<-msaConvert(GnRH.alignment,type="phangorn::phyDat")
> distance.matrix <-dist.ml(GnRH, "F81")
> tree <- NJ (distance.matrix)
> GnRH.model <- modelTest(GnRH, tree=tree, multicore=F,model="all")
> GnRH.model[order(GnRH.model$AICc),]
#Fig. 6.14
> env <- attr(GnRH.model, "env")
> fitStart <- eval(get("TPM1+I", env), env)
> fit <- optim.pml(fitStart, rearrangement = "stochastic",
114 6 Molecular Evolution
#Fig. 6.15
(c) 70%
(d) 80%
8. Which of the following is not true about maximum parsimony method?
(a) An assumption of evolutionary model is necessary
(b) It considers only informative sites
(c) It is prone to long branch attraction
(d) It is prone to short branch attraction
9. The confidence of a clade in a Bayesian inference tree is indicated by:
(a) Bootstrap
(b) Posterior probability
(c) Prior Probability
(d) None of the above
10. Discarding the early phase of MCMC simulation run in Bayesian inference is
known as:
(a) burn in
(b) burn out
(c) burning
(d) sampling Answer: 1. d 2. c 3. a 4. d 5. a 6. d 7. c 8. a 9. b 10. a
Summary
• The substitution rate is computed in terms of accumulation of genetic differences
between individuals in an evolutionary timescale.
• Nonsynonymous mutations are subjected to selective pressure and cause pheno-
typic changes.
• Synonymous mutations are neutral and are fixed in a population under the impact
of genetic drift.
• Neutral theory suggests that evolution is a stochastic process operating on neutral
mutations where substitutions are fixed by the random genetic drift.
• Nearly neutral theory emphasized on those borderline mutations which are
neither strictly neutral nor are strongly selected and called them as nearly neutral
mutations.
• The higher similarity between two genes is an indicator of their being homolo-
gous (having a common ancestry).
• The best-fit evolutionary model of a particular dataset is selected using different
statistical approaches such as hierarchical likelihood ratio test, Akaike Informa-
tion Criterion (AIC) and Bayesian Information Criterion (BIC).
• A phylogenetic tree consists of two characteristic features, namely topology of a
tree and its branch lengths.
• The reliability of a specific clade of a phylogenetic tree is evaluated using
bootstrap analysis.
• In Bayesian inference of phylogeny, posterior probability distribution is usually
estimated using the Markov Chain Monte Carlo (MCMC) sampling.
• The molecular clock ticks at irregular intervals and is probabilistic in nature.
116 6 Molecular Evolution
• The duplication of genes and whole genomes have played an important role in the
evolution of organismal complexity.
• The omega (dN/dS) is measured as the observed number of nonsynonymous
substitutions per nonsynonymous site (dN) over the synonymous substitutions
per synonymous site (dS).
• The omega is expected to be equal to one under the assumption of neutral
evolution.
• An omega value significantly less than one indicates negative or purifying
selection.
• An omega value significantly more than one indicates operation of positive or
directional selection.
• The marginal stability of a protein structure is a product of neutral evolution
through balance between mutation and selection.
• The spike proteins of coronaviruses can adapt to exploit diverse types of cellular
receptors in hosts.
Suggested Reading
Li W-H (1997) Molecular Evolution. Sinauer Associates Inc, Sunderland
Graur D, Li W-H (2000) Fundamentals of molecular evolution. Sinauer Associates Inc, Sunderland
Nei M, Kumar S (2000) Molecular evolution and Phylogenetics. Oxford University Press, Oxford
Bromham L (2016) An introduction to molecular evolution and Phylogenetics. Oxford University
Press, Oxford
Next-Generation Sequencing
7
Learning Objectives
You will be able to understand the following after reading this chapter:
7.1 Introduction
technologies require prior amplification of the template DNA. The necessity of prior
DNA amplification was circumvented in the third-generation sequencing technology
known as Pacific Biosciences (PacBio) platform. Moreover, it allows the detection
of single molecules in real time and generates thousands of several kilo-base long
reads suitable for de novo assembly. The latest addition to sequencing technology,
Oxford nanopore is treated as the fourth-generation sequencing technology and is
capable of producing ultra-long reads at a cheaper cost. Overall, the second-
generation technologies have high throughput with a lower error rate and cost per
base. On the other hand, the third- and fourth-generation technologies offer long read
length with short running time. Thus, advantages of different generation of
technologies are complementary and a hybrid-sequencing approach having a mix-
ture of different generations of sequencing offers a better solution to whole-genome
sequencing. The rapid development of next-generation technology was
complemented by ever increasing computing power and development of efficient
algorithms for assembly of short reads.
Most popular second-generation sequencing platforms are 454, Illumina, SOLiD and
Ion Torrent. The PacBio and the Oxford nanopore technologies are the third-
generation and the fourth-generation technologies, respectively. A comparison of
different generations of sequencing platforms is briefly described in Table 7.1. The
principles of each sequencing method with its advantages and limitations are
described as follows:
The preparation of sample starts with capturing each DNA single-stranded fragment
of the DNA libraries on an agarose bead using adaptor sequences. The fragment-
bead complex is secluded in water-in-oil micelles containing PCR reactants. These
micelles act as a microreactor for thermal cycling (emulsion PCR) of each DNA
fragment into about one million copies. Finally, microreactor is broken and each
28 μm DNA-coated bead containing one DNA fragment is fitted into a well of 44 μm
average diameter on a fibre-optic slide. The sequencing enzyme and primer are
loaded into the well. The synthesis of complementary DNA strand is performed by
passing one unlabelled nucleotide through the well at a time. The incorporation of a
nucleotide leads to release of pyrophosphate with a light emission by firefly enzyme
luciferase in the well. The amount of light produced by the luciferase is proportional
to the number of nucleotides incorporated by the DNA polymerase. The signal for
nucleotide incorporation is detected by an underlying charged couple device (CCD)
sensor in a real-time basis. Finally, 400,000 short reads (400–500 bp length) are
produced in parallel sequencing reactions using millions of wells. The raw reads are
processed by the 454-analysis software and produce about 100 Mb data consisting of
quality reads.
Denatured adapter-linked DNA libraries are attached at one end to the complemen-
tary oligonucleotides attached onto a flow cell. The flow cell is a closed glass
microfabricated device having eight channels. The open end of the DNA fragment
bends down and further hybridizes to a complementary adaptor attached on the solid
surface. The DNA strands bound to the flow cell undergo solid phase PCR to
produce clusters of clonal populations. This process is known as bridge amplifica-
tion. The solid phase amplification process undergoes several cycles and the resulted
amplified DNA is denatured to produce a cluster of around 1000 single-stranded
DNA molecules. The DNA polymerase along with primers and four reversible
terminator nucleotides labelled in different way are added for synthesis. The
incorporation of new nucleotides at 30 terminator base during synthesis are detected
by a CCD sensor based on its colour followed by enzymatic removal of fluorophore.
This cycle is repeated several times during the synthesis process. First machine using
this process was known as Genome analyser (GA) and it was capable of producing
paired-end reads with length up to 35 bp. Now, two new variants of the GA, HiSeq
and MiSeq are being used by various laboratories with a longer read length and
depth, although MiSeq machine has a lower throughput with a cheaper cost than the
HiSeq machine.
The sample preparation of the SOLiD platform is highly similar to the 454 platform.
The sequencing process in the SOLiD machine is performed by ligation using a
DNA ligase rather than synthesis using DNA polymerase in the Illumina platform.
First, the genomic DNA is sheared into 200 bp fragments and oligonucleotide
adaptors are added at the both ends of the fragment. Single-molecule templates are
attached to the magnetic beads and are amplified on the beads using emulsion PCR.
120 7 Next-Generation Sequencing
The beads containing one amplified DNA fragment are spread onto a glass slide
surface which are subsequently loaded into a flow cell. First, a sequencing oligonu-
cleotide primer is hybridized to an adaptor. The 50 end of the primer ligates with an
oligonucleotide hybridizing to the adjoining sequence. The ligation to the primer is
subjected to competition between a combination of octamers. The ligated octamer is
cleaved after detection of colour and this ligation cycle is repeated several times.
The sample preparation and sequencing methodology are similar to the 454 sequenc-
ing. The sample preparation is based on the emulsion PCR, similar to sample
preparation in 454 machines. The incorporation of each nucleotide is detected by a
minute semiconductor pH sensor implanted in each well of the picotitre plate. The
formation of phosphodiester bond during nucleotide incorporation leads to release of
a proton (H+ ion) during the polymerization process in the Ion Torrent machine
instead of pyrophosphate in 454 platform. The difference in pH caused by proton
release is detected by a complementary metal-oxide-semiconductor (CMOS) sensor
technology.
7.2.5 PacBio
Biological nanopores are widely used for the single-molecule detection and single-
and double-stranded DNA sequencing. α-hemolysin is most common biological
nanopore suitable for single-stranded DNA sequencing. Nanopore sequencing
provides high-throughput label-free and ultra-long reads at a cheaper cost. Trans-
membrane nanometre-sized pores are usually embedded in a biological membrane or
in a solid-state film (Fig. 7.1). These membranes separate two compartments
containing conductive electrolytes. The electrolyte ions can pass through the
nanopores when subjected to a voltage difference and consequently generate signal
in form of ionic current. The ionic current signal is stopped when nanopores are
blocked by a negatively charged DNA molecule. The ionic current signals generated
during the sequencing are segmented into discrete events. A machine learning
algorithm either recurrent neural network (RNN) or hidden Markov model (HMM)
transforms these segmented events to consensus sequence of template and comple-
ment strands of the double-stranded DNA molecules. The Oxford nanopore
technologies have developed two nanopore sequencing platforms, namely GridION
and MinION. MinION is a small USB portable device producing cheaper and faster
ultra-long read (average length 10 kb) data without DNA amplification. The current
MinION flow cells consist of 2048 protein nanopores arranged in 512 channels
allowing parallel processing of 512 DNA molecules. It is useful in generating
bacterial genome sequences in a couple of days. Unfortunately, the sequencing
error rate of nanopore technology is very high in the range of 15–40% at present.
Fig. 7.1 Oxford nanopore sequencing technology is based on the changes in electrical pulses due
to blocking of ionic current in a nanopore clogged by a DNA strand
122 7 Next-Generation Sequencing
FASTA format is a text file with a single header line starting with the character “>”,
which is widely used to store the nucleic acid and protein sequences. On the other
hand, FASTQ format is a standard format for the NGS data with a single header line
starting with the character “@”. In addition, a Phred quality score for each nucleotide
base is also included in the FASTQ format. The Phred quality score assigns error
probability for each nucleotide base (Fig. 7.2). These error scores are the log
transformed value of the error probability and computed using the formula which
is as follows:
Phred Quality score ¼ 10 log 10 P, where P is the estimated error probability:
Phred quantiles
10% 25% 50% 75% 90%
50
40
Phred score
30
20
10
0
0 20 40 60 80 100
Sequence position
Fig. 7.2 The Phred quality score across each nucleotide base in the reads. Phred quantiles are
indicated at quantiles 10%, 25%, 50%, 75% and 90% using different lines
7.5 Alignment of NGS Data 123
required information without uncompressing the whole file. SAMtools are used to
convert the SAM files to the BAM files and vice versa.
The UCSC Genome browser is a common web-based genome browser for visuali-
zation of NGS data. It uses two compressed binary file formats: BigBed and BigWig
for transfer of voluminous NGS data. The GenomeStudio is a commercial software
developed by Illumina and is capable of reading and exporting all types of NGS data
formats. Mauve is a C package available for all kinds of platforms and is widely used
for visualization of multiple genome sequences. The Integrative Genomics Viewer
(IGV) is an open-source JAVA-based visualization software for common NGS data
formats such as FASTA, FASTQ, SAM and BAM formats. Another common
visualization package is the GenomeView for reading the BAM files with some
advanced visualization features, for example, zooming from chromosome level to
nucleotide level. The Generic Genome Browser (GBrowse) is a web-based genome
visualization tool written in Perl which can be customized based on the need of the
user. The preferable format for the GBrowse is GFF3.
A single NGS machine is capable of producing millions of short reads. The align-
ment of short reads to a long reference genome sequence (reference-based assembly)
or alignment of a sequence read to another sequencing read (de novo assembly) is a
computationally demanding step in the overall NGS data analysis. The traditional
124 7 Next-Generation Sequencing
The short reads are aligned to a long reference sequence using modifications of
existing alignment methods such as seed-and-extend method and hash table method.
In addition, new alignment method such as Burrows–Wheeler transform has been
introduced for alignment of short reads. These methods are briefly described as
follows:
The local alignment in the BLAST is based on the significant exact matches of the
k-mers or seeds known as hits and comparing the matches in a hash table. The
BLAST method is optimized for short read alignment by including spaced seeds in
order to improve sensitivity. A spaced seed contains few internal mismatches
especially at the third codon position for protein coding sequences. The alignment
method consists of three steps: finding the seeds, first performing gap-free extension
and rejecting the alignment with a low score and finally extension with gaps. If the
seed size is very large, it is very difficult to get large exact matches and consequently
the method has a low sensitivity. On the other hand, we get many random small
matches with a very small size of seed increasing the running time. Moreover, this
method is unsuitable for human genome with abundant repeat sequences due to lots
7.6 Reference-Based Assembly 125
of hits during alignment. The popular programs for alignment based on spaced seed
are Short Oligonucleotide Alignment Program (SOAP), Eland, MAQ, RMAP
and ZOOM.
Suffix array is a simple data structure having a list of alphabetically sorted starting
positions of the suffixes in the sequence. During this alignment, a substring along a
long reference sequence is searched using all suffixes starting with the substring. All
these suffixes are ordered alphabetically and resulted sorted suffixes are searched for
query substring using an efficient binary search. The reference string is represented
by a data structure known as trie (derived from the word retrieval) allowing faster
string matching. A trie is composed of all possible substrings representing a refer-
ence string. The terminator end of a substring is indicated by $ in suffix tries. The $
character is treated as smaller character than all other characters and is alphabetically
or lexicographically first to all other characters in the substring. The lexicographic
order will be $ < A < C < G < T for a DNA string. For instance, suffix arrays are
created for a DNA sequence using following steps:
DNA sequence: A T G A T A G C A $
Position: 0 1 2 3 4 5 6 7 8 9
Suffix arrays: 9 8 5 3 0 7 2 6 4 1 (ordered alphabetically)
Suffix tree is a more complex data structure related to suffix array. A suffix tree of
a string is a compressed trie for all suffixes of that string. A suffix tree is made from a
string in three successive steps. First, all suffixes of a string are generated and these
suffixes are treated as individual words. Finally, a compressed trie is built based on
the suffixes. However, suffix array is a better space-efficient data structure than
suffix tree requiring 15–20 bytes per character. On the other hand, five bytes are
sufficient to represent a character in suffix array. Overall, a suffix array can store a
3.15 billion base pairs human genome in 12 GB RAM. Segemehl and Vmatch are the
common alignment software based on the enhanced suffix arrays. MUMmer and
OASIS are popular software using suffix tree algorithm for alignment.
The BWT of the reference sequence is a compressed version of suffix array for
compression and indexing leading to enhanced speed and reduction in the memory
footprint. It is closely related to the creation of a suffix arrays. For example, T is a
DNA sequence string having a length of m. We can make rotations of T by taking a
character from one end of string and sticking it on the other end repetitively. As a
result, a m x m matrix is created by assembling them vertically followed by sorting
them alphabetically into Burrows–Wheeler matrix of T. Finally, the last column of
126 7 Next-Generation Sequencing
the BWM (T) is the Burrows–Wheeler transform of T: BWT (T). For example, a
DNA string ACAGCA undergoes Burrows–Wheeler transformation in following
three steps.
ACAGCA$ $ACAGCA
CAGCA$A A$ACAGC
AGCA$AC ACAGCA$
ACAGCA$ GCA$ACA AGCA$AC AC$CGAA
T CA$ACAG CA$ACAG BWT (T)
A$ACAGC CAGCA$A
$ACAGCA GCA$ACA
Rotations BWM (T)
The common alignment software based on the FM-index are BWA, SOAP2 and
Bowtie.
De novo assembly of short reads are performed without any prior information about
the genome. This kind of assembly is necessary for novel genomes and even for a
human genome with a large-scale genomic alteration induced by cancer. There are
two algorithms employed for de novo assembly of short reads: overlap-layout-
consensus (OLC) and de-bruijn-graph (DBG).
The OLC algorithm is performed in three successive steps (Fig. 7.4). First, the
overlapping regions between short reads are identified explicitly by aligning each
read against the other read. An OLC read graph is constructed based on the layout of
all the reads along with overlap information. The read graph treats each read as a
node and two nodes are connected if overlap region is larger than a certain cut-off
length. Thus, the total number of nodes is equal to the number of reads. The layout
step involves determination of Hamiltonian path which is a NP-hard problem.
Finally, the reads are aligned along the assembly path and a consensus sequence is
inferred from multiple sequence alignments. This algorithm is suitable for short
reads with low sequencing depth or base coverage depth (i.e. total amount of
sequencing data) and is less sensitive to read errors and repeats in the genome.
The sequencing depth depends upon the size of the genome and a species with large
genome size needs more sequencing depth. For example, a human genome needs a
minimum sequencing depth of 22. Newbler, Arachne, Phrap and Phusion are most
widely used programs based on OLC algorithm. The OLC method is a better choice
than the DBG for assembly of long and error-prone reads generated using PacBio
and Oxford nanopore platforms due to its sensitive overlapping step.
This algorithm involves chopping of short reads into strings of specified length
known as the k-mers and concomitant detection of neighbouring relationship implic-
itly among the k-mers (Fig. 7.5). A k-mer graph consists of individual nodes
representing all unique k-mers and the edges representing exact overlap between
the k-mers in the genome. Both node numbers and edge numbers are expected to be
equal to the genome size and are independent of sequencing depths. However, the
number of DBG nodes are likely to be much higher than the overall genome size due
to inclusion of many false k-mers introduced by sequencing errors. The contig
sequence is assembled from the k-mer graph through the Eulerian path using linear
time algorithm which is an easier computational approach than finding Hamiltonian
path. There is no need to call the consensus sequence separately in the DBG
algorithm because the consensus information is already available with the k-mers.
This algorithm achieves a higher memory and CPU efficiency than the OLC
algorithm with a genome size with high coverage. Since the second-generation
sequencing platforms usually have a high sequencing depth (>30X) in order to
compensate for a short read length, the DBG is a better algorithm than the OLC for
large genome assembly using short reads. However, the DBG is more sensitive to
read errors and repeats unlike the OLC. Euler, Velvet and SOAPdenovo are common
alignment programs developed based on the DBG algorithm.
7.8 Scaffolding
Short reads are assembled into large number of long and accurate contigs using both
OLC and DBG algorithms. Scaffolding is process of ordering and orienting contigs
similar to their position in the genome. Scaffold is a collection of contigs positioned
in a linear order and correct orientation. However, there may be some gaps between
contigs representing unknown sequences. This process is performed in four succes-
sive steps: contig orientation, contig ordering, contig distancing and gap closing. The
scaffolding problem is approximately solved using graph theory where each contig
is a node and linking read pairs are treated as edges. The read pairs are generated
during paired-end sequencing and also known as mates. Mates may overlap with
each other in the middle of the fragment based on their lengths. Some of the mates at
one end of a contig are likely to pair with the mates in another neighbouring contig
and are known as spanning pairs. The presence of spanning pairs indicates the
proximity of two contigs. The distance between the contigs is estimated from the
expected distance between the linking reads obtained from the fragment length
distribution. This distance is included in the scaffolding graph as lengths to the
edges. Finally, one scaffold per chromosome of the genome including gaps of
correct lengths between the contigs is obtained. Some of the most popular contem-
porary scaffolders are Bambus, Opera, ABySS, SGA and SOAPdenovo2.
7.9.2 Transcriptomics
The transcriptome refers to complete set of all RNA molecules encoded by the active
genes present in a cell. Thus, the transcriptome of a cell is likely to change with
respect to its functional and temporal state. The transcriptome of a cell can be
analysed for gene expression studies and identify all upregulated and downregulated
genes under a particular condition. Although microarray technology which is based
on hybridization is routinely used for global gene expression studies on thousands of
target genes, NGS technology is technically superior to microarray technology
because it can detect novel transcripts and splice-events. Moreover, it is free from
unspecific hybridization. The RNA-seq is an RNA-specific next-generation sequenc-
ing protocol to analyse the RNA sequences and their expression levels without any
knowledge about the target genes. It can detect the accurate gene expression levels
quantitatively and identify tissue-specific transcript isoforms as well. The exact
location of exon boundaries and polymorphism in the sequences are also detected
during RNA-seq experiment. Complementary DNA is synthesized from RNA
extracted from the cell. RNA sequencing allows us to understand not only differen-
tial gene expression but also mutation, gene fusion and splicing of RNA. Moreover,
it provides us additional information regarding non-coding RNAs such as
microRNA (miRNA) and long non-coding RNA (ncRNA) linked to the pathogene-
sis of various diseases. The abundance of a transcript is measured in terms of read
counts or RPKM (reads per thousand nucleotides in transcripts per million reads) or
FPKM (fragment per thousand nucleotides per million map reads) or TPM
(Transcripts per million). The RPKM and FPKM are applicable for single-end and
paired-end RNA-seq, respectively. Thus, FPKM takes into account the fact that two
reads can map a single fragment. TPM measures the frequency of a transcript within
a sample.
7.9.3 Epigenomics
method involves digestion of the precipitated DNA fragment at either end using
exonuclease. The most dominant sequencing platform used for ChIP-seq is Illumina
platform followed by SOLID platform. On the other hand, a wide range of methods
such as methylation-dependent enzymatic restriction, methyl DNA enrichment and
direct bisulphite conversion are used for profiling DNA methylation. Moreover,
PacBio and Oxford nanopore platforms can directly detect DNA methylation at the
level of single molecule.
7.10 Examples
Solution
The suffixes of the DNA string are as follows:
[0] ATCGATAACGA
[1] TCGATAACGA
[2] CGATAACGA
[3] GATAACGA
[4] ATAACGA
[5] TAACGA
[6] AACGA
[7] ACGA
[8] CGA
[9] GA
[10] A
132 7 Next-Generation Sequencing
[10] A
[6] AACGA
[7] ACGA
[4] ATAACGA
[0] ATCGATAACGA
[8] CGA
[2] CGATAACGA
[9] GA
[3] GATAACGA
[5] TAACGA
[1] TCGATAACGA
Solution
DNA string:
AACTGACAGCATAAGCAT$
Rotations of the string:
AACTGACAGCATAAGCAT$
ACTGACAGCATAAGCAT$A
CTGACAGCATAAGCAT$AA
TGACAGCATAAGCAT$AAC
GACAGCATAAGCAT$AACT
ACAGCATAAGCAT$AACTG
CAGCATAAGCAT$AACTGA
AGCATAAGCAT$AACTGAC
GCATAAGCAT$AACTGACA
CATAAGCAT$AACTGACAG
ATAAGCAT$AACTGACAGC
TAAGCAT$AACTGACAGCA
AAGCAT$AACTGACAGCAT
AGCAT$AACTGACAGCATA
GCAT$AACTGACAGCATAA
CAT$AACTGACAGCATAAG
AT$AACTGACAGCATAAGC
T$AACTGACAGCATAAGCA
$AACTGACAGCATAAGCAT
Sorted Rotations:
$AACTGACAGCATAAGCAT
AACTGACAGCATAAGCAT$
7.11 Multiple Choice Questions 133
AAGCAT$AACTGACAGCAT
ACAGCATAAGCAT$AACTG
ACTGACAGCATAAGCAT$A
AGCAT$AACTGACAGCATA
AGCATAAGCAT$AACTGAC
AT$AACTGACAGCATAAGC
ATAAGCAT$AACTGACAGC
CAGCATAAGCAT$AACTGA
CAT$AACTGACAGCATAAG
CATAAGCAT$AACTGACAG
CTGACAGCATAAGCAT$AA
GACAGCATAAGCAT$AACT
GCAT$AACTGACAGCATAA
GCATAAGCAT$AACTGACA
T$AACTGACAGCATAAGCA
TAAGCAT$AACTGACAGCA
TGACAGCATAAGCAT$AAC
Answer: 1. c 2. b 3. c 4. b 5. a 6. a 7. c 8. b 9. a 10. d
Summary
• The PacBio technology has a potential to sequence single molecules with longer
read length without any need for prior DNA amplification.
• FASTQ format is a standard format for the next-generation sequencing data
including a Phred quality score for each nucleotide base.
• BAM format is a binary file with compressed and indexed SAM data.
• Suffix array is a simple data structure having a list of alphabetically sorted starting
positions of the suffixes in the sequence.
• The Burrows–Wheeler transformation (BWT) of the reference sequence is a
compressed version of suffix array for compression and indexing.
Suggested Reading 135
• FM-index is a data structure containing only sampled positions which are small
fractions of all string positions.
• The OLC method is a better choice than the DBG for assembly of long and error-
prone reads generated using PacBio and Oxford-nanopore platforms.
• The DBG is a better algorithm than the OLC for large genome assembly using
short reads.
• Scaffold is a collection of contigs positioned in a linear order and correct
orientation.
• RNA-seq detects the accurate gene expression levels quantitatively and identifies
tissue-specific transcript isoforms.
• The abundance of a transcript is measured in terms of read counts or RPKM or
FPKM or TPM.
• The whole exome sequencing involves the sequencing of all protein encoding
genes in the human genome.
Suggested Reading
Ye SQ (2016) Big data analysis for bioinformatics and biomedical discoveries. CRC Press, Boca
Raton
Brown SM (2013) Next generation DNA sequencing informatics. Cold Spring Harbor Laboratory
Press, Cold Spring Harbor
Janitz M (2008) Next generation genome sequencing: towards personalized medicine. Wiley-
Blackwell, Somerset
Masoudi-Nejad A, Narimani Z, Hosseinkhan N (2013) Next generation sequencing and sequence
assembly: methodologies and algorithms. Springer, New York
Systems Biology
8
Learning Objectives
You will be able to understand the following after reading this chapter:
8.1 Introduction
The molecular biology has provided us the finer details of the structure and function
of genes and proteins in a cell. As a result, we can explain the molecular basis of
many biological processes such as reproduction, development and initiation of a
disease. The whole-genome sequencing of human and other organisms has
contributed significantly in understanding the molecular basis of a biological process
with a better understanding of genes and proteins involved in the process. However,
this molecular level understanding is not sufficient for systems-level understanding
of the biological systems. Systems biology is an emerging approach to understand
the holistic biology at the systems level. Here, we study the components of a system
and their inter-relationship, behaviour of a system on temporal and spatial scales and
the properties controlling the behaviour of a system. There are four fundamental
steps in systems biology. First, we identify the components of a system and model
the system. Then, we perturb the system and monitor the changes in the behaviour of
the system. Finally, we refine and test the model in iterative manner under different
conditions. Thus, systems biology model provides new insights into the mechanistic
link between genotype and phenotype.
networks. The number of nodes and edges varies widely in these biological
networks. For example, there are 297 nodes (neurons) and 2345 edges (synapses)
in the neural network of small worm C. elegans. On other hand, a human brain is
composed of 1011 neurons and each neuron is connected to 7000 synapses on
average.
A biological network is mathematically represented by a graph. The interaction
between two nodes in some cases represented by a directed edge when there is a
particular direction in the interaction. All nodes are connected to each other in a
complete graph. However, the interaction may be two sided in many cases
represented by undirected edges between two nodes. The most important property
of a node in a network is its degree (k) (i.e. the number of links to other nodes). The
average degree (<k>) of a network is a global property reflecting the overall
connections between nodes. The overall connection in the network is mathematically
represented in the form of an adjacency matrix having two elements 0 and
1 indicating either absence or presence of an interaction. The adjacency matrix of
an undirected network is symmetric and is typically sparse in case of real networks.
Sometimes, a network consists of two disjoint sets U and V where node of a
particular set can connect to the node of other set. This kind of network is known
as bipartite network and best exemplified by a disease-drug network (Fig. 8.1). In a
network, the distance between two nodes is measured by the path length where a
path is the total number of links between two nodes. The geodesic path (d) is the
shortest path (i.e. path with minimum number of links) between two nodes. The
140 8 Systems Biology
diameter (dmax) is the longest geodesic path between two connected nodes in a
network. The average path length (<d>) of a network is the average of shortest paths
between all pairs of nodes.
Clustering coefficient is the measurement of well-connected clusters in the
network. The local clustering coefficient (C) is a measure of local density which
indicates the degree to which the neighbours of a particular node are connected to
each other. Moreover, the global density of the whole network is represented by the
average clustering coefficient (<C>) taking the average value of all nodes in the
network. The clustering coefficient of small degree nodes is significantly higher than
the clustering coefficient of the hubs. Thus, the local neighbourhood of a hub node
and a small degree node is sparse and dense, respectively.
The concept of centrality is based on detection of most important node in the
network. The degree of node is the simplest measure of centrality and commonly
known as degree centrality. The eigenvector centrality is high for nodes having
connections with other important nodes. Between centrality measures the frequent
position of a node between the paths of other nodes in the network. Closeness
centrality indicates fast communication of a node with other nodes. The graph
density is an indicator of number of connections in a subset of nodes. Most of the
biological networks are not dense but found to be sparse with graph density less than
0.1. Clique is a sub-graph where all nodes are connected to all other nodes and a
clique has a clustering coefficient of 1.
There are two network models proposed to represent the real properties of
biological networks: random network and scale-free network. A random network
G (N, p) is defined as the network with N nodes connecting with each other
randomly with probability p (Fig. 8.2). When we increase the value of p from 0 to
1, the network increasingly becomes denser with linear increase in average number
of edges. The random network model is also known as the Erdos and Renyi model in
honour of their valuable contributions to this theory. Most nodes in this model are
likely to have average degree and the degree distribution approximates Poisson
distribution in a random network. The distance between two random nodes in a
network is unexpectedly small. This is known as small-world phenomena. The
average distance between two random nodes <d> is directly proportional to log
N/ log <k>. Although real networks are not random, the random model is used as a
reference guide to explain the characteristic features of real networks.
The degree distribution of real biological networks is approximated by a power
law degree distribution which is as follows:
Pk k γ
The exponent γ is known as the exponent of power law equation. The overall
behaviour of a scale-free network depends on the value of this exponent. The
networks with a power law degree distribution are known as scale-free networks
(Fig. 8.3). Power law distribution has a higher peak and fat tail with a characteristic
straight slope on a log-log scale (Fig. 8.4). The probability of finding a high degree
or hub node is many folds higher in a scale-free network than a random network. The
8.4 Network Models 141
Fig. 8.2 A random network consisting of 500 nodes based on Erdos–Renyi model showing
random connections between nodes
hubs have a tendency to grow polynomially and consequently occupy a large size in
large networks. The value of the exponent γ varies from 2–3 in a scale-free biological
network such as the protein interaction network. The scale-free networks are usually
ultra-small due to the presence of hubs connecting numerous small nodes. However,
the networks show the small-world property and look like a random network in case
the value of exponent γ exceeds 3. On the other hand, the value of exponent less than
1 in a scale-free network is not expected unless hubs have many self-loops and there
are multiple connections between same pair of nodes. Hubs in a biological network
prefer to connect to nodes with less degrees rather than connecting other hubs
showing disassortative nature of biological network. Biological network consists
of modules and each module is characterized by a group of physically or functionally
connected nodes performing a particular function.
The interaction between coexpressing genes may be inferred using quantitative
measures such as correlation coefficient and mutual information approach on global
gene expression data. The correlation coefficient measures linear relationship
between two variables, whereas mutual information is a measure of non-linear and
non-continuous dependency among two variables. Weighted correlation network
142 8 Systems Biology
Fig. 8.3 A scale-free network consisting of 500 nodes based on Barabasi–Albert model showing
hub nodes with many connections
Fig. 8.4 The degree distribution of scale-free network on a log-log scale showing characteristic
straight slope. k is the degree of a node in log scale and Pk represents probability of degree in a log
scale. Red straight line indicates the power law fit
products. The total number of EMs in the cellular network indicates the overall
redundancy and robustness to perform a certain function. Extreme pathways are the
subsets of elementary modes. When both internal and external reactions are irrevers-
ible in a metabolic network, a set of elementary modes becomes identical to a set of
extreme pathways. A GSMM of an organism is built in two major steps (Fig. 8.5).
The first step is automatic reconstruction of a gap-filled draft model based on the
genome annotation of a species. The second step is the manual refinement of the
draft model using available experimental biochemical data. The building of a
GSMM starts with downloading the genome sequence of a particular species.
Then, the genes are annotated using bidirectional BLAST using high matching
length of the query sequence (70%), moderate amino acid sequence identity (40%)
and a very low e-value (<1 1030) in the first model. In the second model, KEGG
Automatic Annotation server (KAAS) is used for functional annotation of all amino
acid query sequences. A draft model is developed based on both annotated models.
This draft model is further refined using biochemical information from public
databases such as KEGG (filling of missing genes), MetaCyc (direction of reaction),
CELLO (position subsystems) and TCDB (transport reaction). Subsequently, man-
ual corrections like gap filling, deletion of error reactions, a check for mass-charge
balance and addition of species-specific information are performed in sequential
phases. Further, a software environment like COBRA Toolbox or sybil package in R
simulates the SBML (System Biology Markup Language) model with flux balance
analysis (FBA) using a desired objective function such as growth rate or product
formation (Fig. 8.6). First, all metabolic reactions are mathematically represented in
the form of a stoichiometric matrix having stoichiometric coefficient of each reac-
tion. Each reaction is defined by a minimum allowable flux (lower bound) and
maximum allowable flux (upper bound). The second step is to define a phenotype
in form of an objective function. For example, if the objective function is biomass
production (i.e. the conversion of metabolites into biomass), an artificial biomass
reaction based on experimental measures is added as an extra column in the
stoichiometric matrix. As a result, we can predict the maximum growth rate of an
organism by computing the conditions allowing maximum flux through the biomass
reaction. Thus, objective function is a quantitative indicator of relative contribution
of each reaction to the phenotype. The optimization problem is solved in flux balance
analysis using linear programming where an objective function (Z ) is either
minimized or maximized. The objective function (Z ) is computed by multiplying
each reaction flux (vi) with known constant (ci) and then adding all the values. FBA
finds a solution for steady state (Sv ¼ 0) where each reaction is bound by an upper
8.5 Genome-Scale Metabolic Model (GSMM) 145
Fig. 8.6 The metabolic reconstruction has a list of stoichiometrically balanced biochemical
reactions. This reconstruction is converted into a stoichiometric matrix of size m x n, where each
row and column represent metabolite and reaction, respectively. Here, two additional reactions are
added into the matrix to represent growth (biomass reaction) as objective function and exchange
reaction of glucose inside and outside the cell. The VBiomass is the objective function to be
maximized during optimization. The flux through each reaction under steady state is SV ¼ 0
which is defined by a system of linear equations
bound value and a lower bound value. A solution space is defined by the set of all
possible points of an optimization problem satisfied by the mass balanced constraints
having an upper and a lower bound for each reaction. The optimization for a
biological objective function such as ATP utilization or biomass production finds
an optimal flux within the solution space (Fig. 8.7). The flux value obtained by the
FBA is not a unique optimal solution. A biological system can achieve same
objective value using alternative pathway. The identification of alternate pathways
using FBA is known as flux variability analysis (FVA) which measures maximum
and minimum possible fluxes through a particular reaction constrained by keeping
the objective function at its optimal value. Finally, the model is validated by
comparing in silico result with experimentally observed phenotype. All GSSM
models have some missing reactions and thus are incomplete due to our knowledge
gaps.
Since all enzymes are not active in each type of a cell or under a variable culture
condition, specific algorithms are necessary to simulate a particular condition in
silico. Sometimes biological, physical or chemical constraints are derived from
experimental data obtained under a particular condition to build a biological mean-
ingful model known as context-specific or condition-specific GSMM. Here, the
omics (i.e. transcriptomics, proteomics and metabolomics) data under a particular
condition is mapped onto the template framework of a general purpose model. There
are specific algorithms to reconstruct a cell or a strain specific model (Table 8.1).
146 8 Systems Biology
Fig. 8.7 The flux distribution of a metabolic network may lie at any point in an unconstrained
space. After defining constraints in the model, any flux distribution is feasible only within the
allowable space. Further, the objective function is optimized using linear programming to find a
single optimal flux distribution lying on the edge of the allowable space
These models do not maintain steady state unlike GSMM and allows us to study the
system as a dynamic system. It requires knowledge about initial concentration of
metabolites and kinetic reaction coefficients. Therefore, this framework can capture
the metabolic state of a small system, whereas GSMM reflects the entire metabolic
profile of a cell. The overall size of a kinetic model is much smaller than a genome-
scale model. Here, the focus is to model few reactions in deeper mechanistic details.
For instance, we can develop a kinetic model describing the rate and affinities of
multiple reactions. A set of reactants participating in metabolic reactions in a cell is
defined by deterministic methods such as non-linear differential equations or partial
differential equations. However, some biological processes with few participating
molecules such as cell signalling and gene expression are better explained using
stochastic kinetic models rather than conventional deterministic models. The deter-
ministic equations describe the alterations in intracellular metabolite concentration
over a defined time period. The computational solutions obtained by kinetic
modelling helps us in understanding the dynamics of a biological process. However,
the biological information obtained is very limited in kinetic models due to small
number of reactions (often less than 20) included in the model. These models include
various reaction kinetic parameters like enzyme concentration (E0), turnover number
(Kcat) and Michaelis constant (Km), which are measured experimentally using bio-
chemical methods. In case the estimation of kinetic parameters is not possible,
computational methods such as parametric estimation methods and Monte Carlo
methods are implemented to obtain the value of the kinetic parameters. The kinetic
model can be integrated with the genome-scale models forming large kinetic
genome-scale models. Overall, a kinetic model often has a large number of
parameters. The kinetic models of metabolic reactions are constructed using various
computational approaches such as mechanistic modelling, thermokinetic modelling,
modular and convenience kinetic rate law modelling, ensemble modelling, optimi-
zation and risk analysis of complex living entities (ORACLE), structural kinetic
modelling (SKM) and mass action stoichiometric simulation (MASS). Most of these
approaches excluding mechanistic and ensemble models can be applied to develop a
large-scale genome model. However, both mechanistic and ensemble methods have
a potential to build a kinetic model with a maximum of 50 reactions. For example, a
whole cell kinetic erythrocyte model of many healthy individuals was developed
recently using the MASS approach. This kinetic model is useful in predicting
susceptibility to the hemolytic anaemia often induced by an antiviral drug.
8.7 Motifs
Motifs are recurring patterns found in significantly higher frequency in the real
network in comparison to an ensemble of randomized networks (Fig. 8.8).
Randomized networks have all the characteristic features of a real network such as
total number of nodes and edges except the connection between nodes made on a
148 8 Systems Biology
Fig. 8.8 Common network motifs in the biological networks: (a) Positive and negative
autoregulation (b) Coherent feed-forward loop (c) Incoherent feed-forward loop (d) Single-input
module (e) Multi-output feed-forward loop (f) Bi-fan and (g) Dense overlapping regulons (DOR)
signals and thereby stabilizes cellular gene expression under ever-changing stimuli.
On the other hand, incoherent FFL helps in the generation of pulse and speeds up the
response time. It seems that same FFL is rediscovered by convergent evolution again
and again in the biological networks of different organisms due to its vital cellular
function.
In addition, there are two larger families of motifs known as single-input module
(SIM) and dense overlapping regulons (DOR). In SIM network motif, a single
master transcription factor performs dynamical function by controlling several target
genes concomitantly. It regulates the temporal expression of genes in a defined order
based on their protein product requirement in the cell. In SIM circuit, the gene
activated first is deactivated in the last and this temporal order is known as the last-
in-first-out (LIFO). This order is observed experimentally in the arginine biosynthe-
sis of E. coli where individual genes are expressed at an interval of 5–10 min.
However, same activation and deactivation order of genes is desirable in some cases
and is known as the first-in-first-out (FIFO) order. The multi-output FFL can
generate the temporal FIFO order. For example, the multi-output FFL regulates
the expression of motor proteins of flagella in E. coli. Bi-fan is a four-node
overlapping pattern where two transcription factors X1 and X2 control jointly the
expression of two target genes Z1 and Z2. The dense overlapping regulons (DOR) is a
set of input transcription factors regulating through a dense overlapping wiring to
another set of target genes. It is a combinatorial decision-making device having
multiple input functions in order to regulate a series of target genes. There is a large
number of DORs regulating hundreds of genes in the transcriptional networks of
E. coli and yeast. The DORs are often composed of other common motifs like SIMs
and FFLs.
The neuronal networks of C. elegans have similar FFLs akin to transcriptional
networks although the first one has a much higher spatial scale than the latter one.
Another interesting similarity is the presence of multi-input FFLs in neuronal
network instead of multi-output FFLs in the transcriptional network.
8.8 Robustness
8.9 Exercises
Fig. 8.9 A scale-free network with 25 nodes. Node 2 is the hub node in the network
#Fig. 8.9
> scale.free.adjacency<-get.adjacency(scale.free)
> scale.free.adjacency
25 x 25 sparse Matrix of class "dgCMatrix"
#Fig. 8.10
> degree(scale.free)
[1] 1 12 3 2 1 1 2 1 1 1 3 3 1 1 1 2 3 2 1 1 1 1 1 1 1
> degree_distribution(scale.free)
[1] 0.00 0.64 0.16 0.16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04
152 8 Systems Biology
Fig. 8.10 The adjacency matrix of the network is a binary matrix (0, 1) where each entry in the
diagonal is zero. Each dot indicates zero in the matrix
(c) In silico gene knockout involves disabling the reactions catalysed by the gene
product through setting the upper and lower bound to zero. Find the effects of
single-gene knockouts on the objective function of biomass production.
(d) FVA estimates the range of feasible fluxes through each metabolic reaction.
Compute and plot flux variability in order to find the flux values for an
objective function setting at 80% of maximal biomass production.
Fig. 8.12 A stoichiometric matrix showing stoichiometric coefficient of each reaction (row) and
metabolite (column) pair
#Fig. 8.11
> cg <- gray(0,8/8)
> image(S(model), col.regions = c(cg, rev(cg)))
#Fig. 8.12
8.9 Exercises 155
>findExchReact(model)
#Fig. 8.13
> opt <- optimizeProb(mod, algorithm = "fba", retOptSol = TRUE)
> opt
solver: glpkAPI
method: simplex
algorithm: fba
number of variables: 95
number of constraints: 72
return value of solver: solution process was successful
solution status: solution is optimal
value of objective function (fba): 0.873922
value of objective function (model): 0.873922
> opt<-oneGeneDel(model)
156 8 Systems Biology
#Fig. 8.14
> opt <- fluxVar(mod, percentage = 80, verboseMode = 0)
calculating 190 optimizations ... OK
>summaryOptsol(opt, model)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-35.34 0.00 0.00 18.82 10.13 1000.00
> plot(opt)
#Fig. 8.15
8.10 Multiple Choice Questions 157
Fig. 8.15 Flux variability analysis showing minimum and maximum flux values of each metabolic
reaction
(continued)
8.10 Multiple Choice Questions 159
V1 þ V2 þ V3 ! M ! V4 þ V5 þ V6
dm=dt ¼ V 1 þ V 2 þ V 3 V 4 V 5 V 6
where V1 is the net rate of the first reaction and V2 is the rate of the second
reaction, etc.
For example, we have another reaction 3X + Y ! 4Z.
Alternatively, this reaction may be written as 0 ¼ 3X Y + 4Z.where 3,
1, and + 4 are stoichiometric coefficients. All reactants have negative signs
and all products have positive signs.
We can write a general equation using the symbol c for stoichiometric
coefficient:
0 ¼ cx X þ c y Y þ cz Z þ . . . :
(continued)
160 8 Systems Biology
(continued)
162 8 Systems Biology
Summary
• Systems biology is an emerging discipline to understand the holistic biology at
the systems level.
• Complex biological systems exhibit complex dynamic behaviour at different
scales of biological organization.
• A good computational model only captures the fundamental features of a system
ignoring other irrelevant details.
• The biological networks with a power-law degree distribution are known as scale-
free networks.
• Flux-balance analysis estimates the flow of metabolites in a metabolic network
and predicts the growth rate of an organism or rate of production of a useful
metabolite.
• Motifs are recurring patterns found in higher frequency in the real network in
comparison to an ensemble of randomized networks.
• Coherent FFL is the most abundant type of feed-forward loop in the biological
networks.
• Robustness allows certain changes in the structure and components of the system
in order to maintain the specific function of the system.
Suggested Reading
Newman MEJ (2010) Networks: An introduction. Oxford University Press, Oxford
Kovert MW (2015) Fundamentals of Systems Biology: From synthetic circuits to whole cell model.
CRC Press, Boca Raton
Alon U (2019) An introduction to systems biology: Design principles of biological circuits, 2nd
edn. Chapman and Hall/CRC, Boca Raton
Marian Walhout AJ, Vidal M, Dekker J (2013) Handbook of Systems Biology. Academic Press,
Cambridge
Clinical Bioinformatics
9
Learning Objectives
You will be able to understand the following after reading this chapter:
9.1 Introduction
cancer, diabetes and mental disorders have not only a genetic basis but also induced
by various epigenetic and environmental factors. The symptoms associated with a
certain disease is a result of perturbations in the molecular interaction networks in a
multi-scale system. Computational models have a property to reflect the overall
complexity associated with a disease. A comparative study of a disease model along
with a normal healthy model helps us in understanding the mechanistic details of
pathogenesis at a deeper level. These models are generally developed in form of
biological networks, genome-scale models and kinetic models.
The clinical bioinformatics has emerged as a powerful discipline due to availabil-
ity of big data. Most advanced technologies such as NGS, MRI and MS produce a
flood of big data which needs to processed and analysed computationally on a
regular basis. For example, metabolic profiling gives a holistic picture of all
metabolites present in a sample. There is a need for integrating all the data from
different sources in order to provide a unified understanding of the data. There are
four kinds of data types: structured data, unstructured data, semi-structured data and
time-series data. Structured data is stored in a relational database or spreadsheet and
accounts for almost 20% of an AI project. Majority of the data used in AI is in form
of unstructured data without any predefined formatting like text files and images.
However, some data is a hybrid of structured and unstructured data known as semi-
structured data such as XML and JSON formats and constitutes about 5–10% of data
used in AI project. The fourth type of data is a time-series data which is a combina-
tion of structured, unstructured and semi-structured data. Big data is a critical part of
an AI project. A big data is characterized by three V-features: volume, variety and
velocity. The volume of a big data is often in the scale of the terabytes and usually
unstructured. Moreover, big data is highly diverse consisting of mainly unstructured
data along with structured and semi-structured data. These data are created at an
extremely high velocity and processed in the memory instead of disks. Other Vs
associated with big data are veracity (accuracy of data), value (data from reliable
sources), variability (data changing over time) and visualization (graphical represen-
tation of the data).
genome and therefore labelled as unmapped reads. These unmapped reads are
broken into two to three smaller parts which are finally aligned to the reference
sequence using Smith–Waterman algorithm in order to identify indels in the test
genome. Structural variations are either insertions or deletions having a size more
than 100 kbp. There are different kinds of structural variations in the human genome
known as deletion, duplication, insertion, inversion and translocation. Methods to
detect structural variations are based on paired-end mapping, split read, de novo
assembly of genome and read depth. Copy number variation (CNV) is a kind of
structural variation involving either a duplication or deletion of a genomic segment
more than 1 kbp size. There are many diseases known to be associated with the CNV
in human such as cancer, autism, schizophrenia and Alzheimer disease. Methods for
CNV detection are largely based on either significant increase or significant decrease
in read depth signal. The popular R packages for CNV detection are CNV-seq,
readDepth and SeqCBS.
Although whole-genome sequencing is most comprehensive method to identify
different kinds of genomic variations, targeted sequencing is the most cost-effective
method to detect Mendelian disorders and complex diseases. The first step in
targeted sequencing is known as target enrichment which is based either on
PCR-amplicons or hybridization capture. Targeted sequencing consists of two
forms: whole-exome sequencing and targeted deep sequencing. Targeted deep
sequencing is further classified into sequencing with multi-gene panel and designed
regions. Whole-exome sequencing (WES) covers only 2% of the genome encoding
protein sequences but about 85% of genomic variations associated with diseases are
localized in this region. It has successfully identified causal gene for Miller syn-
drome in a very small population consisting of only four individuals and mutations
for autism in a large population consisting of 2500 families. However, regulatory
regions of the genome located in introns cannot be explored using WES approach.
Targeted deep sequencing requires the prior knowledge of the disease and is a better
technology than WES in dealing with regulatory regions. Multi-gene panel sequenc-
ing is based on array hybridization and covers large number of candidate genes along
with introns. The number of genes varies between 70 and 377 in genes panels for
diseases such as muscular dystrophy, cardiomyopathy and epilepsy. Similarly, a
desired gene or a region can be captured on custom basis using PCR-amplicons or
hybridization. Targeted sequencing is often used as a validation step after WES
because an average coverage of 1200-fold can be achieved using this approach. The
size of NGS sample varies from small number of cases to a large population. A small
size of NGS sample is adequate for a common variant but a large population-based
study is required for rare variants of complex diseases. Family-based whole-genome
or whole-exome sequencing needs about 200 pedigrees for efficient detection of
Mendelian disorders. However, this family-based approach is not suitable for com-
plex diseases. Case-control deep sequencing needs 125 to 500 case-control pairs for
high detection power but is not sufficient for a novel gene discovery. However, a
case-control whole-genome or whole-exome sequencing using 500–1000 case-
control pairs is capable for discovery of novel genes. Most cost-efficient approach
166 9 Clinical Bioinformatics
The network models are very useful in understanding the overall pathobiology of a
disease. The initiation and progression of a disease can be explicitly captured in form
of network module. Each disease is represented by a well-defined disease module.
The genes involved in a disease are likely to be connected in a disease module. This
module consists of an ensemble of directly connected genes performing a certain
biological function. Any disruption in this disease module may lead to manifestation
of disease. The neighbouring directly interacting gene of a disease-causing gene is
likely to play some role in the pathogenesis as well. These network models are
developed using the known interactions from literature, RNA-Seq, microarray and
yeast two-hybrid data. If the candidate genes of a disease are known, their inclusion
in the model makes the modelling process more accurate and effective. There are
four types of statistical methods employed in reconstruction of a gene network from
microarray gene expression data. These are probabilistic network-based methods
such as Bayesian networks, correlation-based methods, partial correlation-based
methods and information theory-based methods.
We can identify a disease module including the neighbouring genes of a candi-
date gene of a network. The average degree of disease genes is higher than the
average degree of control genes. Moreover, disease genes involved in a certain
disease are likely to interact with the disease genes of other diseases. The average
shortest path of disease proteins is lower in protein–protein interaction network. A
disease gene does not encode a hub protein and is localized in the periphery of the
network. However, a static disease network cannot capture the dynamic changes
occurring during the progression of a disease. The dynamic rewiring in the molecular
interactions during different stages of a disease is compared with the healthy state
using differential network analysis (Fig. 9.1). It is expected that different networks
Fig. 9.1 Gene coexpression network of cancer biomarker genes in healthy ovary and ovarian
cancer (stage II and IV) showing differential connectivity among genes
9.4 Biomarker Discovery 167
Machine learning was first developed by Arthur L. Samuel in 1959 and its ultimate
goal is to develop a predictive model based on training the dataset using one or more
algorithms. Here, we train the model rather than explicitly program a computational
task. First, the order of the data is randomized and appropriate algorithm is selected
by trial and error. The training data which constitutes about 70% of the complete
dataset is used to create relationships in the algorithm. Training phase is followed by
the evaluation of the model’s accuracy using remaining 30% of the dataset. The
parameters in the algorithm are adjusted to fine-tune the model for improving the
9.6 Artificial Intelligence in Medicine 169
results. Some hyperparameters which cannot be learned directly from the training
process are also adjusted during fine tuning. The algorithms used are usually
complex but there is no need to compute these algorithms by the practitioner as
plenty of programs already available in the R and the Python. Although there are
hundreds of machine learning algorithms available, they can be categorized into four
major classes which are as follows:
Fig. 9.2 A SVM model classifying healthy (circles) and disease (triangles) samples using radial
basis function (RBF) kernel. The maximum margin hyperplane divides both samples and solid data
points close to the hyperplane are support vectors
suitable for perceptual problems which needs feature engineering like an image
classification.
Deep learning is a subfield of machine learning which analyses the datasets using
neural networks mimicking the human brain. The word “deep” refers to the inherent
hidden layers in the neural networks directly related to the learning power of the
algorithm. Even a single hidden layer neural network is not treated as deep due to
presence of a single hidden layer. Here, the goal of learning is to predict a response
or classify a response based on certain attributes. An Artificial Neural Networks
(ANN) is a function consisting of units called neurons (also known as perceptrons or
nodes). Typical feed-forward neural networks have three layers, namely an input
layer, a hidden layer and an output layer (Fig. 9.3). The number of neurons in an
input layer corresponds to the number of features or variables to be fed in the
networks. Similarly, the number of neurons in the output layer corresponds to the
number of items predicted or classified. The hidden layer neurons perform non-linear
transformation of the input attributes. The relative importance of each neuron in a
network is indicated by its value and weight. Each neuron has an activation function
and a threshold value to activate the neuron. Thus, a neuron performs summation of
weights of all inputs and subsequent activation function in order to pass the output to
the next layer. The net input is calculated by the formula which is as follows:
Fig. 9.3 The architecture of an artificial neural network (ANN) consisting of input layer, hidden
layers and output layer
172 9 Clinical Bioinformatics
The value and weight of each neuron is passed to a hidden layer of neurons which
uses a function to produce a final output. The activation function is a non-linear
function which varies between 1 and + 1 considering the non-linear nature of real
world. Each neuron in the networks usually has same activation function. So, the
output is computed by applying the activation function over the net input which is as
follows:
Y ¼ F ðyin Þ
Most common function is the sigmoid one which produces an output value
between 0 and 1. The highest accuracy achieved is 1 for a perfect model. Bias is
another constant value used in the calculation of the function and is similar to an
intercept in linear regression. It facilitates the activation function of the networks to
move either upwards or downwards. This type of training is known as feed-forward
neural network. However, this model is too simple and can be improved further by
multiple hidden layers resulting in multilayered perceptron (MLP). MLP is endowed
with a unique property of backpropagation. The loss function or objective function
compares the output prediction and the true target using a distance score. This score
gives a feedback signal to adjust the value of weights to reduce this loss score. The
adjustment is done by an optimizer using backpropagation algorithm.
Backpropagation involves adjustment of weights in a neural network after getting
the errors followed by iterating the new values to optimize the model. For example,
one of the input value has an output value of 0.7 indicating an error value of 0.3
(1–0.7). After backpropagation, the output value may improve to 0.75. Thus, the
training will continue till output value reaches close to 1. Initially, errors are large
due to large weights of the input nodes. After few iterations, the error gradually
decreases to find an optimum mid-point value at the bottom of the curve. Most
common types of neural networks are recurrent neural networks (RNN),
convolutional neural network (CNN) and generative adversarial network (GAN).
The recurrent neural networks have a function which processes the input and prior
inputs across time as well. The convolutional neural network analyses a complex
data like an image data section by section. In GAN, two neural networks compete
with each other using a feedback loop and create new object.
The scope of deep learning applications is still narrow as it requires large amount
of input data and high computational power. It is a challenging task to select the
number of hidden layers and hyperparameters to develop a right model. Usually,
comparatively simple machine learning models are found to be more effective using
less amount of input data than the complex deep learning models.
9.7 Genome-Wide Association Mapping 173
Humans have only 0.1% variation of nucleotide bases in their genome. A single
nucleotide varies between two individuals in a population and this variation is
known as single-nucleotide polymorphism (SNP). The presence of one SNP in a
gene can lead to a monogenic disease like sickle cell anaemia. Moreover, some
complex diseases such as diabetes and cardiovascular disease are caused by the
epistatic interaction among multiple genes and the interaction between the genes and
the environment. However, there are two major challenges in understanding the
genetic basis of these complex diseases. The discovery of large number of genetic
variants along with their modified phenotypic manifestations under different envi-
ronmental conditions is the first challenge. The second challenge is to define a
correct phenotype for a study and proper statistical analysis of the data. There are
ten million SNPs estimated in human population. Current commercially available
SNP arrays provide a comprehensive genomic coverage of about 660,000 SNPs in
different populations. The study design of GWAS involves cases having the disease
and controls without the disease and other traits contributing to confounding effect.
The SNPs and samples with low quality are usually detected through low call rates
and subsequently removed from analysis. In addition, the SNPs with low minor
allele frequency and the SNPs deviating from Hardy–Weinberg equilibrium are
removed from study in order to avoid false positive findings. It is expected that
both allele and genotype frequencies remain stable in a particular population as per
requirement of Hardy–Weinberg equilibrium. However, it is advisable to first
analyse all SNPs and then examine the validity of Hardy–Weinberg equilibrium
for those SNPs associated with the phenotype. A single SNP analysis is conducted
using a statistical test to compare the null hypothesis (i.e. there is no association
between the SNPs and the phenotype) with the alternative hypothesis (i.e. there is an
association between the SNPs and the phenotype). A small p-value obtained during
the test leads to the rejection of the null hypothesis and indicates that there is a
significant association between the SNP and the phenotype. The genetic effect of the
SNP is modelled on a continuous phenotype trait using a linear regression model.
The result of the statistical analysis can be visualized using a Manhattan plot
highlighting the genomic regions showing log10 (p-value) of the association
(Fig. 9.4). Principle components analysis (PCA) is usually used in summarizing
the genome-wide variability of SNPs creating principal components of all SNPs in
the genome. The first principal component generally captures the population sub-
structure based on ethnicity. There are many programs publicly available for
genome-wide association mapping. Some R packages are available on bioconductor
for GWAS analysis and analysis of structural variations (Tables 9.1 and 9.2). PLINK
is another popular program for whole genome-wide association mapping including
various basic analysis such as haplotype analysis and LD estimation. The GWAS
Central is a comprehensive database for comparing genotype and phenotype from
various genome-wide association studies and appropriate data sets can be retrieved
from this database for analysis.
174 9 Clinical Bioinformatics
Fig. 9.4 Manhattan plot of a genome-wide association analysis of a disease. X-axis and Y-axis
show chromosomal positions and -log10 p values, respectively. The horizontal red line indicates the
significance threshold of genome-wide association
9.8 Exercises
1. The genotypes of a SNP in a sample are AA, AT, TT, AA, AA, AA, AT, AA, AT,
AA, TT, AT. Using the R package genetics, compute the allele frequency,
genotype frequency and heterozygosity of the genotypes. Plot the frequency of
three types of genotypes. Test whether these genotypes are in Hardy–Weinberg
equilibrium.
Solution First, the data is converted to genotypes and frequency is computed using
the function “summary ()”
>library(genetics)
>data<-c("A/A","A/T","T/T","A/A","A/A","A/A","A/T","A/A","A/
T","A/A", "T/T", "A/T" )
> geno <- genotype(data)
> summary(geno)
Number of samples typed: 12 (100%)
Allele Frequency: (2 alleles)
Count Proportion
A 16 0.67
T 8 0.33
Genotype Frequency:
Count Proportion
A/A 6 0.50
A/T 4 0.33
T/T 2 0.17
Heterozygosity (Hu) = 0.4637681
Poly. Inf. Content = 0.345679
>plot(geno)
HWE.test.genotype(x = geno)
Raw Disequlibrium for each allele pair (D)
A T
A -0.05555556
T -0.05555556
Scaled Disequlibrium for each allele pair (D')
A T
A -0.5
T -0.5
Correlation coefficient for each allele pair (r)
A T
A 0.25
T 0.25
Observed vs Expected Allele Frequencies
Obs Exp Obs-Exp
A/A 0.5000000 0.4444444 0.05555556
T/A 0.1666667 0.2222222 -0.05555556
A/T 0.1666667 0.2222222 -0.05555556
T/T 0.1666667 0.1111111 0.05555556
Overall Values
Value
D -0.05555556
D' -0.50000000
r 0.25000000
Confidence intervals computed via bootstrap using 1000 samples
Observed 95% CI NA's Contains Zero?
Overall D -5.555556e-02 (-1.875000e-01, 8.333333e-02) 0 YES
Overall D' -5.000000e-01 (-2.840000e+00, 3.333333e-01) 1 YES
Overall r 2.500000e-01 (-3.333333e-01, 8.222222e-01) 1 YES
Overall R^2 6.250000e-02 ( 7.061648e-05, 6.760494e-01) 1 *NO*
Significance Test:
Exact Test for Hardy-Weinberg Equilibrium
data: geno
N11 = 6, N12 = 4, N22 = 2, N1 = 16, N2 = 8, p-value = 0.5176
Using the example file in the R package snpStats, compute linkage disequilibrium
(LD) among the SNPs in European and African populations based on D0 and R2
9.8 Exercises 177
> library(snpStats)
Loading required package: survival
Loading required package: Matrix
> data(ld.example)
> ceph.1mb
A SnpMatrix with 90 rows and 603 columns
Row names: NA06985 ... NA12892
Col names: rs5993821 ... rs5747302
> head(support.ld)
dbSNPalleles Assignment Chromosome Position Strand
rs5993821 G/T G/T chr22 15516658 +
rs5993848 C/G C/G chr22 15529033 +
rs361944 C/G C/G chr22 15544372 +
rs361995 C/T C/T chr22 15544478 +
rs361799 C/T C/T chr22 15544773 +
rs361973 A/G A/G chr22 15549522 +
178 9 Clinical Bioinformatics
Answer: 1. d 2. a 3. c 4. c 5. b 6. a 7. c 8. a 9. b 10. d
Summary
• A comparative study of a disease model and a normal healthy model provides the
mechanistic details of pathogenesis at a deeper level.
• A big data is characterized by three V-features: volume, variety and velocity.
• The human diseases associated with the copy number variation (CNV) are cancer,
autism, schizophrenia and Alzheimer disease.
• Whole-exome sequencing (WES) covers only 2% of the human genome encoding
protein sequences and detects 85% of the genomic variations associated with
various diseases.
• The initiation and progression of a disease can be explicitly captured in form of a
computational network module.
• A disease gene does not encode a hub protein and is localized in the periphery of
the network.
• A genome-scale metabolic model representing a specific-type of cancer cell may
help in the identification of a drug target.
• A bipartite network shows direct connection between a drug network and its
target protein network.
• Support vector machine (SVM) is a machine learning classification method for
development of molecular biomarkers to discriminate disease samples from
healthy samples.
• A single SNP analysis compares the null hypothesis (i.e. there is no association
between the SNPs and the phenotype) with the alternative hypothesis (i.e. there is
an association between the SNPs and the phenotype).
Suggested Readings
Wang X, Baumgartner C, Shields DC, Deng H-W, Beckmann JS (2016) Application of clinical
bioinformatics. Springer, Dordrecht
Trent RJA (2014) Clinical bioinformatics. Humana Press, Totowa
Liang K-H (2013) Bioinformatics for biomedical science and clinical applications. Woodhead
Publishing, Oxford
Raza K, Dey N (2021) Translational bioinformatics applications in healthcare. CRC Press, Boca
Raton
Agricultural Bioinformatics
10
Learning Objectives
You will be able to understand the following after reading this chapter:
10.1 Introduction
The estimated population of this world would be about nine billion by the year 2050
and consequently there is a need for a very steep increase in the food production to
feed the projected population. So, there is a necessity for large-scale innovations in
agriculture to boost its productivity and sustainability as well. Crops have a vital role
in global economy being a major source of nutrients for the ever increasing world
population. In fact, crop genomics is expected to play a significant role in the second
green revolution. The available genomic databases useful in crop genomics research
are listed in Table 10.1. The agricultural system is a complex system where the
biology of a crop interacts with the environment and management practices. Bioin-
formatics can play a pivotal role in enhancing the agricultural productivity due to
rapid progress in both computing power and next-generation sequencing technol-
ogy. The interaction and behaviour of the overall agricultural system is better
modelled through integrating the data with computational analysis. We can capture
Fig. 10.1 Structural variants like CNV and PAV cause variations in the functional gene content of
crop plants. Here, CNV in the genome is represented by three copies of Gene 2 and two copies of
Gene 3. The PAV indicates loss of one copy of each gene with gain of another gene, Gene 4 in the
genome
a wide variety of real-time big data of different genomic and environmental variables
using existing sensor technologies. Some of the promising areas of agriculture
having significant attention of bioinformatics research are pan-genome of crops,
assembly of plant genomes, identification of homeologs, genomic selection and
modelling of agricultural systems.
Modern crops are endowed with a wide variety of small to large continuum of
genomic variations in the form of single-nucleotide polymorphisms (SNPs)-
insertions and deletions (indels)-structural variants (SV). SNPs have a major effect
on the functional gene content of a crop genome either through premature stop
codons or changes in the key functional sites. Similarly, small indels can have a
major impact on the genome variation caused by frameshift changes and premature
stop codons in the genes. The structural variants such as copy number variants
(CNVs) and presence/absence variants (PAVs) are crucial determinants of useful
agronomical traits (Fig. 10.1). Both natural and artificial selection operate on these
genomic variations to increase the overall fitness of the genotype and crop
10.2 Pan-Genome of Crops 185
Fig. 10.2 The pan-genome of crop plants exhibits variable portions of core genes and dispensable
genes. The pan-genome size of each species is indicated by the number of pan-genes in a particular
species at the centre of the ring
Table 10.2 Structural variations in the genes involved in the abiotic stress
Type of SV Gene Crop Trait
PAV Sub1A Rice Tolerance to submergence
PAV Pup1 Rice Tolerance to phosphorus starvation
CNV MATE1 Maize Tolerance to aluminium
CNV FR-2 Wheat Tolerance to cold climate
CNV FR-H2 Barley Tolerance to frost
PAV GmCHX1 Soybean Tolerance to salinity
CNV Ppd-B1 Wheat Sensitivity to photoperiod
Table 10.3 Structural variations in the genes involved in the biotic stress
Type of SV Gene Crop Trait
PAV Pikm1-TS Rice Resistance to blast disease
PAV Pi21 Rice Resistance to blast disease
PAV qHSR1 Maize Resistance to head smut
PAV Lr10 Wheat Resistance to leaf rust
CNV Rhg1 Soybean Resistance to cyst nematode
PAV R1, ELR Potato Resistance to late blight
PAV Yr36 Wheat Resistance to stripe rust
associated traits such as grain quality or fruit quality. For example, a 1212-bp
deletion in GW5 gene can alter the grain weight and width in rice plants. Elongated
fruit shape in tomato is due to copy number variation of SUN gene. Since structural
variations create differences in the gene content between different individual lines,
the heterosis effect in hybrids may be due to passage of complementary genes from
the individual parental lines. Structural variations might have played a significant
role in the domestication of crop plants. For example, during domestication of maize
plant, both increase in apical dominance and decrease in tiller number occurred due
to insertion of transposable elements at the tb1 locus.
Pan-genomic study of crop wild relatives will reveal the full landscape of
biodiversity in the species and this untapped resource can be utilized for boosting
the crop productivity. In fact, wild relatives of crop plant are being used for
backcrossing during conventional breeding. This pan-genomic studies in the crop
species can be linked to the QTL/GWAS to identify useful elite genes for further
breeding strategies. Thus, a comprehensive genome resource of crop plants will be
created through pan-genome studies in near future which can be further exploited by
plant breeders for enhanced agricultural productivity.
Computational analysis of the pan-genome is prerequisite for fully understanding
the genomic landscape of a crop species. The pan-genome of a species is computa-
tionally represented by a data structure known as coordinate system where all genetic
variants are explicitly represented in form of a sequence coordinate graph. Although
there are many available software for pan-genome analysis in prokaryotes, recent
development of a software like Pangloss for pan-genome analysis in eukaryotes will
be highly useful for a crop species. Pangloss is written in the Python language to
188 10 Agricultural Bioinformatics
characterize the pan-genome of eukaryotes for gene prediction, gene annotation and
functional analyses. The total diversity in a crop species in form of a pan-genome can
be visualized in form of a phylogenetic tree. A phylogenetic tree from whole-
genome data can be reconstructed using both DNA sequences and the gene content.
In a sequence-based phylogenetic tree reconstruction, genomic sequences from all
variants are first aligned using a multiple sequence alignment and a phylogenetic tree
is then reconstructed using the evolutionary distances. In gene content approach, the
presence and absence of a gene in the genome is scored in binary numbers 1 and
0, respectively, followed by construction of a distance matrix consisting of 1 and 0 to
represent total pan-genomic profile of a crop species. The distances between differ-
ent variants are measured in terms of Jaccard distance and Manhattan distance.
Manhattan distance refers to sum of absolute gene-wise differences between two
genomes, whereas Jaccard distance between two genomes measure the amount of
intersection in terms of presence or absence of a gene. If two genomic sequences are
identical, both Manhattan distance and Jaccard distance are 0.0. Conversely, if there
is no overlap between two genomic sequences, both measurements are 1.0. A
distance-based phylogenetic approach such as neighbour joining or UPGMA will
generate an evolutionary tree exhibiting relationship among different variants of the
pan-genome in a crop species based on presence or absence of individual genes.
Plants have evolved a large and complex genome for their survival in the terrestrial
environment. The genome of the first plant species, A. thaliana followed by genome
of some economically important species like rice, maize and papaya were sequenced
using the Sanger method. The advent of next-generation sequencing technology has
speeded up the pace of genome sequencing with a reduced cost. The second-
generation deep sequencing of crops is challenging due to large genome size,
duplications and repeat contents. The genome assembly process consists of a
combination of sequencing and computation. The reads generated by sequencing
is combined using a computer program called assembler. Therefore, assembly and
analysis of short reads in plant genome needs substantial bioinformatics skill. The
short reads generated through second-generation technology needs high coverage
from 50X to 100X; even this high coverage of 100X may be inefficient in
deciphering complex plant genomes. On the other hand, long-read technologies do
not need such high coverage and a coverage of 20X to 30X may be sufficient to get a
good assembly of the genome. Thus, second-generation deep sequencing technology
has produced a draft genome of many plant species lacking almost 20% of the vital
genomic information about the species. The resulted draft genome cannot be used
reliably to understand the repetitive part of the genome and to discriminate between
functional genes and pseudogenes as well.
In fact, a plant genome appears like a gene island surrounded by more than 80%
of the repeat sequences. Transposons have played a profound evolutionary role in
the evolution of plant species and transposable elements (TE) constitute the major
10.4 Identification of Homeologs 189
portion of repetitive sequence in a plant genome. The small genome crops like rice
have only 17% of transposons, whereas a large genome of maize have almost 85% of
transposable elements. The abundant presence of repetitive transposons in a crop
species poses a critical challenge in assembly of their genome. The short reads have
less power to resolve the repeats in the plant genome. Thus, longer reads generated
by single-molecule sequencing must be combined with short reads to resolve the
repetitive DNA. The longer repeats than the read length create gaps during the
assembly process which in turn can be resolved by paired-ends during sequencing.
Polyploidy or fusion of two or more genomes in a species has played a significant
role in the evolution of wide diversity in plants. It is a result of either whole-genome
duplication known as autopolyploidy or crossing of two species followed by dupli-
cation known as allopolyploidy. Genome duplications produce new genes with new
functions and thereby generate novel phenotypes. The polyploid plants have better
adaptive capability in an ever-changing environment due to their genomic plasticity.
A majority of crop plants like wheat, potato, sugarcane and banana are polyploids.
The redundancy due to presence of two or more copies of genes can adversely affect
the accuracy of the genome assembly. Gene duplication is another major force for
creating new genes in a genome in form of paralogs. It is difficult to distinguish
alleles from paralogs in the genome assembly of natural heterozygotes. We always
look for lineage-specific genes in a crop species for functional studies. Sometimes,
these lineage-specific genes are simply artefacts of misassembles. These artefacts
can be avoided by developing some novel algorithms to identify real lineage-specific
genes. The development of de novo approaches in assembly such as de Bruijn graphs
combined with Eulerian paths has facilitated assembly of plant genomes with long
repeats. With the advent of third-generation single-molecule sequencing such as
PacBio, the assembly of plant genomes with long reads having an average length of
10,000 bp can circumvent the problem of long repeats. However, the high error rate
(5–15%) of this newer technology is a major constraint in the application to crop
genomes.
The homeologous genes in plants are the products of allopolyploidy and have a
common ancestry like other homologous genes. Homoeologs are the gene pairs
derived by speciation and again recombined in a genome of a species by allopoly-
ploidy. In simple words, we can define homeologs as the orthologs between the
subgenomes of a species. The subgenomes are individual genomes in an allopoly-
ploid species each inherited from a different ancestral species. Thus, homeologs
have a combined evolutionary and functional features of orthologs (derived by
speciation) and ohnologs plus paralogs (derived by duplication) (Fig. 10.3). The
homeologous genes are expected to follow same order (collinear) between the
ancestral genome and the descendant genomes. This is known as positional
homeology like positional orthology and the homeologous genes maintain one-to-
one relationship. The best bidirectional hits (BBH) approach works well in inferring
190 10 Agricultural Bioinformatics
Fig. 10.3 Venn diagram showing relationship of homeologs with orthologs, ohnologs and
paralogs. Homeologs are located at the intersection of orthologs and ohnologs plus paralogs
during the BLAST. Orthologous matrix (OMA) database has implemented another
graph-centric approach for identification of homeologs in wheat. First, mutually
closest homologs are chosen based on the evolutionary distance taking into account
both differential genes loss and many-to-many relationship among genes. In parallel
to earlier phylogenetic-based approach, orthologs between different subgenomes are
detected using the standard pipeline followed by relabelling of orthologs as
homeologs. Thus, the OMA approach is a better graph-based approach because it
depends on the evolutionary distances in comparison to alignment score. However,
the quality of assembly and annotation of allopolyploid crop genomes need to be
improved for the high accuracy of homeolog inference.
Crop breeding depends upon the repetitive cycles of phenotypic selection followed
by crossing in each generation to produce a superior genotype. The marker-assisted
selection (MAS) was used for improvement of common crop in the recent past by
detecting underlying major genes but failed to detect minor gene effects in the
breeding population. With the availability of whole genomic sequences of various
crop species, the minor-effect genetic variants associated with agronomic traits can
be identified across the genome. The genomic selection operates on these genome-
wide genetic variants circumventing the need for repeated cycles of phenotype
selection. It selects the best candidates as parents for the next breeding cycle using
predicted breeding value which includes their genotypes, their phenotypes and the
genotypes of their relatives (Fig. 10.4). The breeding value of an individual is
measured by the average performance of its progenies. Single-nucleotide
polymorphisms (SNPs) are the variations at the nucleotide level in a population
and are extensively used for identification of more than 10,000 quantitative trait loci
(QTL) having economic importance. In a typical genomic selection program, there
Fig. 10.4 Schematic diagram showing development of prediction equations from phenotype and
genotype in form of thousands of SNPs in the reference population and subsequent application of
prediction equations in the selection of candidates using thousands of SNPs data for computing
genomic estimated breeding value
192 10 Agricultural Bioinformatics
are two kinds of distinct but related population: the training population and the
breeding population. The genotype with all genome-wide markers and phenotype of
individuals in the training population are known, whereas only genotypes including
all genome-wide markers of individuals in the breeding population is known without
any knowledge of their phenotype. The genetic values of individuals in the breeding
population need to be predicted during genomic selection. A prediction model is
developed from the training population to predict the genomic estimated breeding
values (GEBV) of individuals in the breeding population. This approach has a
greater power in capturing the effects of small-effect loci as well. Overall, this
model captures all the additive genetic variance of a particular trait of economic
importance. Genomic selection has a high accuracy of genomic-enabled prediction
for simple traits with high heritability. Moreover, it also has a potential to improve
the complex traits with low heritability. Thus, genomic selection can play a signifi-
cant role in enhancing genetic gain per unit time and cost in a breeding population
through accelerated breeding cycles.
Since the variance of a complex trait is assumed in various forms in each type of
prediction model, the field response of different types of prediction models varies
widely. The general model for the whole-genome regression analysis can be
formulated as:
y ¼ Xb þ Wam þ e
y ¼ Xb þ Zu þ e
The plant phenotyping is a core concept of plant breeding where some elite plants are
selected for subsequent crossing and genetic gain. This traditional phenotyping
method focusses on one or few traits in a particular environment and often leads to
a marginal annual genetic gain (0.5–1%) in the productivity of major cereal crops.
The crop phenomics deals with collection of multi-dimensional phenotypic data at
various scales such as cell level, organ level, plant level and population
level (Table 10.4). There are automated phenotyping platforms such as for
generating high-throughput data from individual plants. The most common
phenotyping indices of individual plants are leaf length, leaf area and fruit volume.
194 10 Agricultural Bioinformatics
A genomic selection model can be applied for genetic gain under variable season or
field condition. However, combination of genomics data with phenomics data
obtained from repeated experiment in variable environment has a better potential
for genetic gain in crop breeding. In phenome to genome approach, the SNPs and
genomic regions are associated with agronomical phenotypic traits using high-
throughput phenotyping tools. The final statistical model consists of three
components: genotype, environment and management (G X E X M).
Image-based phenotyping is used frequently in both laboratory and field
environments. Plant phenotypes are divided into four groups: colour, texture, quan-
tity and geometry in image-based phenotyping. Each plant is exposed to top and side
cameras at three different wavelength bands: visible, near-infrared and fluorescence
to get thousands of images during whole phenotyping period. The visible range
provides colour images regarding nutrition, health, growth and biomass status of an
individual plant. Moreover, infrared range detects a quantitative measure of the
water content and fluorescence imaging senses chlorophyll and other fluorophores
in a plant. Machine learning and deep learning techniques are widely used in the
image-based phenotyping such as object detection and classification. The phenotype
image analysis consists of four successive steps: pre-processing, segmentation,
feature extraction and post-processing. The first step involves preparation of the
image for analysis like detection of outlier, reproducibility of trait and normalization
of phenotypic profile. It is followed by the second step of segmentation dividing the
image in foreground (the plant) and background (the imaging chamber) parts.
Feature extraction selects the optimal set of explanatory variables in a stepwise
manner using variance inflation factors to get a list phenotypic traits. In general,
about 400 phenotypic traits are extracted from the image of each crop plant. The final
step of post-processing is summarization of all computed results.
software programs are known as decision support systems. First crop systems
models were statistical models to predict the response of a system such as crop
yield based on the past data sets. However, statistical models are not suitable to
predict the impact of unseen climate changes such as increase in CO2 concentration
in the atmosphere. Statistical models are generally useful in case sufficient historical
data sets are available for prediction.
The dynamic system simulation models are widely used to understand crop and
farming system models in response to certain external changes such as weather or
management practices. The typical output of a dynamic model is daily outputs of a
specific crop over a period of time. The models are highly complex having numerous
descriptive variables and parameters with long run times. Thus, summary models are
derived from a complex model suitable in certain situations. A common model of
crop system, Agricultural Production System Simulator (APSM) is a complex
dynamic model, which predicts the yield of a crop as a function of time and space
based on several inputs such as weather and soil properties. This model can be
implemented in the R environment using a package called apsmr. Similarly, World
Food Studies (WOFOST) crop growth simulation model and Light interception and
utilization (LINTUL) can also be implemented in the R using the packages Rwofost
and LINTUL-package, respectively. AquaCrop is a mechanistic crop growth generic
model from emergence to maturity developed by Food and Agriculture organization.
It is most widely used generic model to simulate growth of different crops under
variable climates and geographical locations. AquaCropR is an R package built upon
the AquaCrop software with some additional functions. An R package for agricul-
tural data sets known as agridat is available in the R environment.
(continued)
196 10 Agricultural Bioinformatics
10.8 Exercises
1. The R package ZeBook contains a dynamic model of crop growth for maize
cultivated in potential conditions. Three state variables, namely leaf area index
(LAI), total biomass (B) and cumulative thermal time since plant emergence
(TT) indicate the overall crop growth. Find the parameters and compute the
growth of crop in terms of three state variables from day 100 to 150. Plot the
increase in total biomass during this period.
>plot(model$B,type="o",pch=16)
2. The alpha lattice design of spring oats is an available dataset with the R package
agridat. This dataset contains 72 observations on five variables, namely plot
number (plot), replicate (rep), incomplete block (block), genotype (gen) and dry
matter yield (yield). Estimate the genetic effects using the best linear unbiased
prediction (BLUPs) and heritability for yield in the R environment.
10.8 Exercises 197
Fig. 10.6 Increase in total biomass from day 100 to day 150
> library(agridat)
> library(lme4)
> library(emmeans)
> data(john.alpha)
> dat <- john.alpha
> model <- lm(yield ~ 0 + gen + rep, data=dat) # Randomized Complete Block
(RCB) design
> model1 <- lm(yield ~ 0 + gen + rep + rep:block, dat) # Intra-block analysis
> model2<-lmer(yield~0 + gen+rep+(1|rep:block),dat)# Combined inter-intra
block analysis
> anova(model2)
Analysis of Variance Table
npar Sum Sq Mean Sq F value
gen 24 380.44 15.8515 185.9959
rep 2 1.57 0.7851 9.2124
> means <- data.frame(rcb=coef(model)[1:24],
+ ib=coef(model1)[1:24],
+ intra=fixef(model2)[1:24]) # Variety means
> head(means)
rcb ib intra
genG01 5.201233 5.268742 5.146433
10.9 Multiple Choice Questions 199
Answer: 1. c 2. b 3. b 4. a 5. d 6. b 7. b 8. d 9. a 10. a
Summary
• The structural variants in crops such as copy number variants (CNVs) and
presence/absence variants (PAVs) are crucial determinants of useful agronomical
traits.
• The percentage of dispensable genes in a crop species reflects the diversity among
individuals of a particular species.
• Structural variation is a major causative factor in generating dispensable genes in
crops.
• The dispensable genome of a crop is enriched with genes associated with biotic
and abiotic stress resistance.
• The second-generation deep sequencing of crops is challenging due to large
genome size, duplications and repeat contents.
• The homeologous genes in plants are the products of allopolyploidy.
• Genomic selection has a high accuracy of genomic-enabled prediction for simple
traits with high heritability.
• The most common predictive models for quantitative traits are the genomic best
linear unbiased prediction (G-BLUP) and the ridge-regression BLUP
(RR-BLUP).
• The crop phenomics deals with collection of multi-dimensional phenotypic data
of crops at various scales using automated platforms.
• The dynamic system simulation models are widely used to understand crop and
farming system models in response to certain external changes.
Suggested Readings
Wallach D, Makowski D, Jones J, Brun F (2013) Working with dynamic crop models. Academic
Press, Amsterdam
Isik F, Holland J, Maltecca C (2017) Genetic data analysis for plant and animal breeding. Springer,
Cham
Normanly J (2012) High-throughput phenotyping in plants: methods and protocols. Humana Press,
Totowa
Busch W (2017) Plant genomics: methods and protocols. Humana Press, Totowa
Farm Animal Informatics
11
Learning Objectives
You will be able to understand the following after reading this chapter:
11.1 Introduction
Thus, this modular format consists of several modules and submodules with clear
documentation, readable and readily adaptable in a new setup. Individual modules
can be adapted and improved further in a particular farm condition without
disrupting other modules of a system. For example, a whole-farm dairy systems
model consists of four integrated biophysical modules of Animal, Manure, Soil and
Crop, and Feed storage. In addition, there are three system balance modules of water,
energy and economics. The Animal module consists of three primary submodules:
Animal Life cycle, Nutrition & Production and Management & Facilities (Fig. 11.1).
11.3 Nutritional Models of Farm Animals 205
The Animal Life Cycle submodule is a stochastic Monte Carlo model of events in
life of a dairy cow from birth to culling or death. The Animal module can simulate
stochastically the daily growth, production and reproduction of individual cows
based on the inputs regarding feed, weather and management. The stochastic
modelling simulates the probabilistic distribution of events for each individual
cow in a herd accommodating interactions among cows. The Nutrition and Produc-
tion submodule can predict the optimal diet of a cow for a desired milk production
using linear programming. The Management and Facility submodule is a barn-level
simulation including interaction of human intervention and environment with the
cow herd. The Soil and Crop module simulates soil temperature, hydrology, nutrient
dynamics, pasture growth and animal feeding using various climate inputs.
Interestingly, the feed storage module simulates the nitrogen and carbon loss
during harvesting and storage of animal feeds. Finally, the Water Balance module is
an all-encompassing module gathering information from four biophysical modules
with water use and generation in cattle management.
The efficiency and robustness of farm animals has been a focus of attention in the
field of animal nutrition. Productive animals are known to have better feed efficiency
and as a result have margin over feed cost. Although selection can improve signifi-
cantly the genotype of a farm animal in respect of some selected traits, other
non-selected traits remain prone to be fragile under fluctuating environmental
perturbations. The efficiency of a system is defined as the ratio between the fluxes
of outputs and inputs. For example, feed efficiency of a cow refers to production of
206 11 Farm Animal Informatics
Although sheep and goat are small ruminants, different feeding strategies are needed
in these small ruminants due to their specific physiological requirements. For
example, the primary production of dairy sheep is wool in addition to milk produc-
tion. Sheep and goat eat more food as a percentage of body weight for regular
maintenance. Thus, a high producing dairy sheep and goat have a feed intake of 4%
to 7% of body weight but it never exceeds 4% of body weight in case of dairy cow.
Production efficiency of an animal is maximization of its product such as milk or
meat in comparison to inputs used and usually represented by the feed efficiency.
Feed conversion ratio is generally reciprocal of the feed efficiency. The higher value
of feed efficiency is desirable for an animal. On the other hand, lower feed conver-
sion efficiency is preferred during an animal production process. Various nutritional
models have been developed to enhance the production efficiency of goats and
sheep. The current nutritional models are more comprehensive having many mecha-
nistic components such as animal, dietary and environmental variables. The energy
and nutrient requirement of sheep and goats can be predicted more accurately under
diverse environmental and management practices. These nutrient models can be
further improved by real-time monitoring of small ruminants along with environ-
mental and production variables using modern sensor technology.
A laying hen has a potential to produce at least 300 eggs annually. The weight of an
egg is determined not only by the age and genetic potential but on the nutrients fed
during laying period. The amino acid and energy requirement of a laying hen is
necessary to predict the potential body weight at first laying and subsequent potential
egg output over a time period. Energy constitutes about 70% of the costs incurred on
the poultry feed. Methionine is the primary limiting amino acid in the feed of a
208 11 Farm Animal Informatics
Fig. 11.2 Determination of optimum economic nutrient level for laying hen using optimization
laying hen. The age and body weight of an individual first laying bird determines its
future laying performance in terms of egg number and egg weight. This characteris-
tic feature of growth can be manipulated using different lighting and nutritional
regimes. The change in photoperiod especially the initial and final photoperiod has a
strong influence on the gonadal maturity of a laying hen. The daily intake of amino
acids and energy is largely used by a laying hen for maintenance. The body protein
content of a laying hen is found to be maximal at the age of sexual maturity and
remains comparatively stable throughout the entire laying period. A mathematical
model can predict food intake of a hen based on its body protein weight and potential
daily egg output. The model computes the energy requirements of maintenance and
egg output. However, the potential growth and egg output differ markedly between
growing and reproducing birds. An individual hen responds linearly to the increas-
ing amount of a limiting nutrient to its maximum genetic potential. The response to
nutrients are represented in terms of an economically important output such as egg
output. However, such responses are generally curvilinear when applied to a group
of birds. A simulation model can provide the answer for optimum economic nutrient
level for a group of laying birds considering marginal costs incurred and revenue
generated (Fig. 11.2). First, a feed formulation is passed to a laying hen model to
simulate the performance of hen. The costs and revenues are also calculated. During
the optimization process, feed formulations are altered following certain rules. This
optimization process iterates several times till it improves the value of objective
function using linear programming. Thus, egg producers can maximize their profits
by relying on this optimization process. In addition, it excludes the necessity for an
expensive and tedious long-term laying trial to measure the response of laying hens
to various feeds with differential nutrients.
11.3 Nutritional Models of Farm Animals 209
yt ¼ f ð t Þ
There are many empirical models used for fitting the lactation curve data. Early
models emphasized on the deterministic component of the lactation curve. The
Fig. 11.3 The lactation curve of a dairy cattle showing ascending, peak and declining phases
210 11 Farm Animal Informatics
The phenotype is the measurement of some features in an animal and has been a
basis of all genetic improvement. High-throughput phenotyping is very crucial in
closing the gap between genotype and phenotype. The real-time acquisition of high
dimensional phenotypic data such as physiological or behavioural associated with
production on individual animal scale is known as livestock phenomics. The live-
stock phenomics is more challenging than the plant phenomics because animals have
a longer generation interval than the crops and they change their location frequently
unlike crops in a field. Phenomics data can be measured at different levels from
molecular level to morphological level. Molecular measurements include transcripts,
proteins and metabolites expressed in different cells or tissues at different time
points. Morphometric, behavioural and physiological data are other higher levels
of phenomics data. Recent developments in sensor technology have given us
opportunity to measure difficult or previously unmeasurable phenotyping traits in
farm animals. The knowledge of most appropriate phenotype such as feed efficiency
in a grazing livestock system will give deep insights into the biology involved in
manifestation of this phenotype. For example, interactions between genotypes and
environment and pleiotropic effects of genes can be well understood using
phenomics approaches. A sensor mounted on an animal allows comprehensive
two- and three-dimensional measurements or images of animal behaviour in a
pasture. Physiological traits such as body temperature, heart rate, respiratory rate
and rumen function can be monitored at a real-time basis on each individual animal.
Sometimes, radiant temperature in the farm environment are monitored using ther-
mal cameras. Many phenotypes of economic importance in animals such as feed
conversion efficiency, disease resistance and reproductive potential are extremely
difficult to measure in a large pastoral environment. These complex traits are derived
from individual components and their interactions. The rapid generation of multi-
sensor high-throughput phenotypic data in an animal farm over time poses a grand
challenge in big data management. Further, the computational analysis of these big
phenotypic data for biological interpretation is necessary to enhance the farm
productivity. In dairy industry, the performance of a Holstein breed animal in the
USA is measured by 42 traits of economic importance. The success of a genomic
selection program solely depends upon large number of phenotypic measurements.
212 11 Farm Animal Informatics
Fig. 11.4 Genomic selection in farm animals is based on linkage disequilibrium (LD) between
single-nucleotide polymorphism (SNP) and quantitative trait loci (QTL)
11.7 Exercises 213
11.7 Exercises
> X<-lactation.calf.model2(lactation.define.param()["nominal",],300,0.1)
>X
214 11 Farm Animal Informatics
Fig. 11.6 Body weights of chicken at weight weeks linked with male and female parents
216 11 Farm Animal Informatics
Answer: 1. d 2. b 3. b 4. c 5. b 6. d 7. c 8. c 9. d 10. d
Summary
• Empirical model is derived from experimental data to infer relationships among
different components at a single level.
• Mechanistic model is a process-based model with several individual components
and their specific interactions.
• Homeorhetic regulations control the basic life processes associated with repro-
duction and growth of an animal.
• Homeostatic regulations operate on the adaptive changes in animals under altered
nutritional environment.
• A mathematical model can predict food intake of a hen based on its body protein
weight and potential daily egg output.
• The shape of a lactation curve is characterized by an ascending phase after
parturition reaching its maximum peak followed by a declining slope till the
dry-off of the dairy cattle.
• Bovine Genome Database (BGD) is a web-based resource providing access to
bovine genome assemblies and annotations.
• Livestock phenomics data can be measured at different levels from molecular
level to morphological level.
• The genomic selection is based on the LD between SNP and QTL.
• Genomic BLUP (GBLUP) is a common genomic-relationship method widely
used in animal breeding.
Suggested Reading
Khatib H (2015) Molecular and quantitative animal genetics. Wiley-Blackwell, Hoboken
Mondal S, Singh RL (2020) Advances in animal genomics. Academic Press, London
Isik F, Holland J, Maltecca C (2017) Genetic data analysis for plant and animal breeding. Springer,
Cham
Mrode RA (2014) Linear models for the prediction of animal breeding values, 3rd edn. CABI
Publishing, Wallingford
Computational Bioengineering
12
Learning Objectives
You will be able to understand the following after reading this chapter:
12.1 Introduction
Fig. 12.1 The control system of human thermoregulation using closed-loop feedback
12.4 Metabolic Engineering 221
either using two deterministic approaches: continuous time series and integer time
points. State variables represent the real value of the time variable (t) in a continuous-
time system, whereas integer time points (i.e. t ¼ 1, 2, 3...) define a discrete-time
system. However, deterministic models, although useful, often fail to capture the
inherent complexity of a biological system. Therefore, stochastic models like hidden
Markov models are developed to represent a biological system.
There is a huge demand for bio-based products such as biofuels, solvents, organic
acids, polymers, and food supplements across the world. In order to produce them in
large quantity at low cost, three performance metrics of industrial production,
namely titre (the final concentration of the product after bioprocess), yield and
productivity, should be very high. The minimum target for titre is about
100 grams per litre in most cases but a higher target of more than 200 grams per
litre is feasible in some cases. The yield of a product is defined as mole or gram of the
product obtained from mole or gram of the consumed substrate. On the other hand,
productivity is defined in two forms: specific productivity and volumetric productiv-
ity. The specific productivity is the amount of the product produced in terms of mole
or gram per cell per unit time, whereas the volumetric productivity is the amount of
product produced per volume per unit time. The importance of any parameter in
decision-making depends on the nature of bio-product and the bioprocess. A high
titre is not only a useful parameter for cost-effectiveness, it also facilitates separation
and purification of the product. A higher yield of a product is vital for production of
bulk chemicals as the major cost is incurred in procuring carbon substrates. Produc-
tivity covers the overall production cost incurred in a bioprocess such as fermenter
size and equipment depreciation cost per year.
A simple biological module can be represented as promoter with its expressed gene
and regulatory proteins with its DNA binding sites. These simple modules in turn
give rise to a gene regulatory network which produces an output signal after complex
computation in a biological organism when challenged by various inputs. This
computation is based on various engineering principles such as positive feedback,
negative feedback and feed-forward loop.
The performance of a biological system can be predicted in silico using compu-
tational modelling. First, the possible enzyme pathways involved in the biosynthesis
of a desired metabolite are identified. A set of reactions in individual paths are
stoichiometrically balanced and added to get the net chemical reaction for a path. A
metabolic engineer redesigns the existing network pathways in order to change the
metabolic flux rates. If one wishes to increase the production of a particular metabo-
lite, an attempt will be made to increase the flux of the metabolite from its precursor
without affecting the remainder of the metabolic pathways. The membrane flux of a
precursor molecule in a cell is increased by increasing the concentration of that
molecule in culture media in case of passive diffusion. In contrast, the concentration
or efficiency of a transporter protein is increased for active transport of precursor
across the cell membrane. The precursor molecule in the cell is converted to a
desired metabolite through multi-step enzymatic reactions. For an enzymatic reac-
tion, if substrate concentration is much lower than the Michaelis constant, the
reaction flux will be proportional to the substrate concentration. For example,
consider an example where a precursor molecule X enters a cell through transport
process and is converted to either Y or Z inside a cell via enzymatic reactions as
illustrated in Fig. 12.2. If we want to overproduce metabolite Y, the mass flux from
X to Y needs to be increased through enhanced activity of enzyme catalysing this
reaction. This enhanced enzyme activity can be achieved by alteration of specific
activity of the enzyme (kcat / Km) or alteration of the concentration of the intracellular
12.5 Evolutionary Engineering 223
various nutrients such as glycerol and glucose in chemostat cultures for improved
growth. Similarly, the effects of environmental stresses such as high temperature and
osmotic pressure or tolerance to high ethanol concentration on the adaptive labora-
tory evolution of microbial cells have tremendous industrial applications. Computa-
tional models are useful in understanding the evolution of laboratory strains.
However, the major challenges are the high computational cost of simulation and
unknown functions of a significant proportions of genes (~30%) in a model species
like E. coli or S. cerevisiae. The effects of various environmental stressors on
microbial cell are also very difficult to integrate in the model. In spite of limitations,
computational models can play a significant role in improving laboratory strains
through adaptive laboratory evolution.
The directed evolution has been applied on the artificial evolution of enzymes to
catalyse the industrial reactions. Frances Arnold received the Nobel prize in chemis-
try in 2018 for her pioneering work on the directed evolution. Darwinian evolution
favours the fit organisms by accumulating beneficial mutations. In directed evolu-
tion, macromolecules are purposefully evolved towards new desired properties. The
evolutionary processes are implemented in three successive steps in directed evolu-
tion: mutation, selection and amplification. First, a natural macromolecule whose
properties are similar to the desired one is taken as a precursor molecule for
redesigning. Alternatively, a new macromolecule is created de novo from a collec-
tion of random sequences and the process is known as protein design. During
redesigning, catalytic activity of an enzyme can be transplanted on another enzyme
with same folds and mechanistic functions. Even a few mutations at the active site of
the enzyme facilitates emergence of novel functions. Computational redesign of an
extant protein is generally associated with the experimental evolution of the protein.
For example, the computational design starts with simulated docking of a target
ligand into the active site of a protein retrieved from the protein data bank (PDB).
The residues around this pocket are randomized and the design algorithm repeatedly
searches for a conformational space of the side chains and ligands at the minimum
energy level for each sequence. Finally, only a few proteins are experimentally
produced from a list of sequences with minimum energy predicted by the computa-
tional design algorithm. Novel receptors for proteins and small molecules have been
designed using this approach. A variety of experimental methods such as error-prone
PCR, degenerate codons, DNA shuffling and recombination are used in vitro to
create diversity in the molecule. Directed evolution is an established method for
enzyme engineering. The specificity, stability and reaction conditions of many
enzymes have been custom-made for commercial exploitation. Lipases, generated
through directed evolution, are produced commercially as an additive to laundry
detergents to break down lipids. An enzyme becomes several hundred or thousand
times more effective than the template enzyme after a few iterative cycles of directed
evolution.
12.6 Synthetic Biology 225
Devices Devices are made of different parts performing a user-defined function. For
example, a simple device protein generator consists of promoter, ribosome binding
site, protein coding sequence and terminator sequence. The devices have logic gates
having binary states 1 (condition satisfied) and 0 (condition not satisfied). A logic
gate takes multiple binary inputs and produces a single output in form of either
0 (OFF) or 1 (ON). There are different types of logic gates in a device such as AND
gate (all input conditions must be satisfied for ON state), OR gate (any input
condition must be satisfied for ON state), NOT gate (single input is inverted with
compatible promoter/repressor pair) and negated gate (a combination of NOT gate
with AND/OR gate).
Systems Simple systems are developed from one or more devices. The examples of
simple biological systems are feedback loops, genetic toggle switches, oscillators
and repressilators. The effect of output on the future behaviour of a system is known
as feedback effect. When the future output of a system is increased by the present
output, it is known as positive feedback. Conversely, when a present output
decreases the future output of the system, it is known as negative feedback.
Activators and repressors are used for creating positive and negative feedback
respectively in synthetic biology. The switch is a device which can be turned ON
or OFF in response to an input signal. Toggle switches are having only two stable
steady states ON and OFF without any intermediate state. Toggle switches in
synthetic biology are designed by integrating two repressors and two constitutive
promoters, where each promoter transcribe a repressor which in turn inhibit the
opposite promoter. An oscillator is a device showing cyclical pattern (oscillations) in
the output nearby an unstable steady state. There are two important elements in a
biological oscillator: an inhibitory feedback loop and a source of delay in the
feedback loop. A repressilator is a kind of oscillator made by combining multiple
226 12 Computational Bioengineering
inverters. For example, three genes may be combined in feedback loop in such a way
that each gene represses the subsequent gene in the loop and itself repressed by the
preceding gene.
The synthetic organisms assembled from genetic parts and devices do not always
exhibit predictable behaviour unlike its electronic counterparts due to stochastic
noise and uncertain environment in a new chassis. Sometimes, engineered cells
show low productivity due to metabolic burden that is the part of the cellular
resources used in engineering the metabolic pathways. The user-defined function
such as synthesis of a new product in an engineered cell may jeopardize the overall
metabolic robustness optimized by natural selection in an evolutionary timescale.
(continued)
12.8 Exercises 229
12.8 Exercises
1. An R package BiGGR contains the BIGG model (iND750) of a yeast cell. Yeast
performs alcoholic fermentation using glycolysis to produce pyruvic acid from
glucose followed by production of ethanol from pyruvic acid. Plot the metabolites
and reactions involved in glycolysis in the yeast model using BiGGR package.
Solution
(a) The complete graph of genome-scale metabolic model of yeast is as follows
(Fig. 12.4):
(b) The k-shortest path between extracellular glucose and intracellular glucose is as
follows. The flux of this exchange reaction is 1 unit (Fig. 12.5).
(c) The k-shortest path from extracellular glucose to the objective function of
maximum biomass along with fluxes is as follows (Fig. 12.6):
12.8 Exercises 231
Fig. 12.5 The k-shortest path between extracellular and intracellular glucose
232 12 Computational Bioengineering
Fig. 12.6 The k-shortest path from extracellular glucose to maximum biomass of yeast along with
fluxes
Fig. 12.7 The k-shortest path from extracellular glucose to maximum biomass of yeast showing
zero fluxes after knocking down glucose transport uniport
Answer: 1. b 2. a 3. a 4. c 5. d 6. a 7. a 8. b 9. b 10. a
Summary
• A cell can be perceived as a dynamic system characterized by a set of state
variables.
• Both biological systems and artificial systems are non-linear dynamical systems.
• A metabolic engineer redesigns the existing network pathways in order to change
the metabolic flux rates.
• Best strategy for metabolic engineering is to perform minimum changes in the
overall metabolic network of the cell.
• Computational models are useful in understanding the evolution of laboratory
strains.
• Directed evolution has been applied on the artificial evolution of industrial
enzymes.
• Synthetic biology deals with either construction of new artificial system or
redesigning a natural biological system.
• Common examples of simple biological systems are feedback loops, genetic
toggle switches, oscillators and repressilators.
• Toggle switches have only two stable steady states ON and OFF without any
intermediate state.
• Synthetic Biology Open Language (SBOL) is a standard format to support the
specification and exchange of information regarding biological design.
Suggested Reading
Filipovic N (2020) Computational bioengineering and bioinformatics: computer modelling in
bioengineering. Springer, Cham
Filipovic N (2019) Computational modeling in bioengineering and bioinformatics. Academic Press,
London
Zhang G (2017) Computational bioengineering. CRC Press, Boca Raton
Smolke C (2018) Synthetic biology: parts, devices and applications. Wiley-Blackwell, Weinheim