Advances in Bioinformatics (Springer, 2021)
Advances in Bioinformatics (Springer, 2021)
Singh
Ajay Kumar Editors
Advances in
Bioinformatics
Advances in Bioinformatics
Vijai Singh • Ajay Kumar
Editors
Advances in
Bioinformatics
Editors
Vijai Singh Ajay Kumar
Department of Biosciences Biotechnology
Indrashil University Rama University
Mehsana, Gujarat, India Kanpur, Uttar Pradesh, India
# The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore
Pte Ltd. 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by
similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Foreword
v
Preface
vii
Acknowledgement
I would like to express my sincere gratitude and deep appreciation to Dr. J.S. Yadav,
Director (Research), Indrashil University, India, for extending his outstanding sup-
port and motivation to complete this book. I would like to give many thanks to
co-editor Dr. Ajay Kumar Singh of this book who gave me outstanding personal and
professional support as well as inspiration to finish this book.
I am delighted to thank all the authors for their excellent contributions to this
book. I would like to thank Dr. Bhavik Sawhney (Associate Editor—Biomedicine)
and Ms. Veena Perumal (Production Editor) from Springer for their excellent
management of this project.
I would like to give thank Prof. Chaitanya G. Joshi, Prof. Rakesh Rawal, Prof.
Bharat Maitreya. Prof. Pawan K. Dhar, Dr. Poonam Bhargava, Dr. Madhvi Joshi,
Prof. S. R. Dave, Dr. Bhabatosh Das, Dr. Pablo Carbonell, Dr. Javin P. Oza, Dr.
Rupesh Maurya, Dr. Shams Yazdani, Dr. Indra Mani, Dr. Satya Prakash, Dr. Vimal
C. Pandey, Dr. Dinh Toi Chu, Dr. Suresh Ramakrishna, Dr. Prof. Nisheeth Agrawal,
and those whose names do not feature here but have directly or indirectly contributed
in shaping this project. I greatly appreciate the support of my students Mr. Nisarg
Gohil, Mr. Khushal Khambhati, and Ms. Gargi Bhattacharjee, whose discussion and
comments helped to shape this book.
I wish to express my gratitude to my beloved wife Pritee Singh for her endless
support, patience, and inspiration. Lots of affection for my kids Aaradhya and Ayush
who missed me during this project. I would like to warmly thank the faculty and staff
of Indrashil University for providing a great working environment.
I am aware that even despite our best efforts, the first version always comes with
some error that may have crept in the compilation. I would be delighted to receive
feedback from readers to further improve the future book. Last but not least, my
sincere thanks to GOD for his supreme POWER for endowing me to live with joy
and victory in the shape of this book.
Vijai Singh
ix
x Acknowledgement
xi
xii Contents
Ajay Kumar is the Professor and Head of the Department of Biotechnology at the
Faculty of Engineering and Technology at Rama University, Kanpur, Uttar Pradesh.
He has 15 years of post-PhD experience in academics and research. He has held
several key positions in different universities and engineering institutes. He received
M. Tech. (Biotechnology in 2001) at Institute of Engineering and Technology,
Lucknow (UP) and PhD (2006) from the ICAR—Central Institute for Research on
Goats, Mathura. His research interests include computational vaccine and drug
development, genomics and proteomics, and fermentation technology. He has
published more than 90 research papers in international/national Journals. He
xiii
xiv About the Editors
authored many books and chapters. He serves as an editor and reviewer of several
peer-reviewed journals. He is a member of professional body such as the Interna-
tional Association of Engineers (IAENG) and INSA. He has rendered consultancy
services in the field of vaccine research.
An Introduction and Applications
of Bioinformatics 1
Henny Patel, Dhruti Bhatt, Shreya Shakhreliya, Navya L. Lam,
Rupesh Maurya, and Vijai Singh
Abstract
In the past few decades, bioinformatics has been extensively explored in many
areas of biological sciences. It combines the principles of biology, computer
science, mathematics, physics, and statistics to analyze and interpret biological
data. It uses computation power, algorithm and software for extracting knowledge
from biological data for analysis, prediction, imaging, and visualization purpose.
In this chapter, we highlight recent developments in this field and how the
potential of bioinformatics has harnessed in multiple disciplines.
Keywords
Bioinformatics · Data analysis · Algorithm · Drug discovery · Data · Modelling
1.1 Introduction
A constant need has arisen to develop and establish a technology to gain knowl-
edge and to meliorate human and animal life. Over the past few decades, develop-
ment of number of breakthrough technologies including DNA sequencing (Sanger
et al. 1977; Gohil et al. 2019), genome sequencing, proteomics, genome annotation
Statistics Mathematics
Bioinformatics
Computer
Physics
Science
Chemistry
and assembly (Gibson et al. 2010), protein sequencing, genome synthesis (Gibson
et al. 2010), genome editing (Cong et al. 2013; Singh et al. 2017, 2018a; Gohil et al.,
2021; Bhattacharjee et al. 2020), and many such technologies has been made
immensely to enhance the quality of research. The advent of bioinformatics has
further made it possible to accelerate and improve wet laboratory experiments
greatly. Bioinformatics is a combination of biology, computer science, mathematics,
statistics, physics, and engineering principles that helps in analyzing and interpreting
biological data (especially large and complex) with the assistance of software tools
(Fig. 1.1).
Bioinformatics involves biological investigation that utilizes computer program-
ming and algorithms as part of its techniques. Additionally, in the field of genomics,
there is repetitive use of particular analysis “pipelines.” Scientists commonly use
bioinformatics tools for identifying candidate genes and single-nucleotide
polymorphisms (SNPs) (Cargill et al. 1999; Bhattacharjee et al. 2019) for studying
characteristics of population, different adaptations, genetic disorders, or desirable
properties (mainly agricultural species). Bioinformatics also contributes in under-
standing of organizational principles of nucleic acids and protein sequences, also
called proteomics (Graves and Haystead 2002).
In the several areas of biology, bioinformatics has already proven its significance.
In experimental molecular biology, image and signal processing (bioinformatics
methodologies) aids in extracting significant results from large amounts of raw
data. Bioinformatics has its importance in the field of genetics where it assists in
sequencing and annotating genomes and their visualized mutations. It contributes to
text mining of biological writing and supports in developing biological and gene
1 An Introduction and Applications of Bioinformatics 3
ontologies for organizing and quizzing biological data. Gene and protein expression
and regulation is now possible due to development of bioinformatics tools. Also, it
assists in comparison, analysis, and interpretation of genetic and genomic data
(especially studying evolutionary characteristics of molecular biology). Combinato-
rially, it supports in analyzing and classifying biological pathways and networks that
have significance in systems biology. It helps in simulation and modelling of DNA,
RNA, proteins, and bimolecular interactions in the field of structural biology. This
chapter includes basics of bioinformatics and its use in different fields that overall
contributes in clear understanding of biological sciences.
In the past decade, bioinformatics and its applications have widely used in number of
areas for better understanding and have helped to accelerate research in many areas
including DNA sequencing, gene and genome analysis, evolutionary biology,
immunoinformatics, gene expression, proteomics, and many others (Fig. 1.2).
Fred Sanger discovered the bacteriophage ΦX174 and analyzed it (Sanger et al.
1977). This was a major discovery and subsequently number of genes were
sequenced. DNA sequences were used for identification of genes and organisms. It
Genome
Proteomics
Analysis
Applications
of
Bioinformatics
Gene Evolutionary
Expression Biology
Immunoinfo-
rmatics
4 H. Patel et al.
was later started being used for construction of evolutionary relationship among
species. DNA sequences are also analyzed by a number of tools and software. In
2008, a software program called Basic Local Alignment Search Tool (BLAST) was
used to scan sequences—from more than 260,000 species, including more than
190 billion nucleotides (Benson et al. 2008). Still a major part of biodiversity is
completely left unexplored. New DNA sequencing approaches are being discovered
for accelerating research.
For analyzing sequences, one can obtain data from Genbank. It is a data storage
bank that has publicly available DNA sequences. Similarity, DNA sequences can be
checked and verified using BLAST program (Singh et al. 2018b). Development of
algorithms has assisted with base calling for several experimental approaches to
DNA sequencing.
After the complete genome was sequenced, a major challenge was its annotation and
analysis. Annotation tends to be a method for the recognition of certain genes and
other biological features throughout the DNA sequence in the genomics context. It is
important to automate this process as most genomes are massive to actually annotate
first hand, not to mention the need to continuously annotate and provide as many
genomes as technically practicable, but the sequence rate sometimes stops during
operations and it somehow tends to be a big bottleneck. The fact that specific genes
1 An Introduction and Applications of Bioinformatics 5
computational biology is now highly popular in the chemical, medical industry and
bioinformatics (Mitchell and Taylor 1999).
This informatics enables the life scientists to
In the 1970–1980s, comparing the viral genomes became the starting point in the
comparative genomics (Koonin and Galperin 2002). The basics of comparative
genomics is to compare genomic features such as nucleotide sequences, genome
size, genes, orthology analysis, regulation of gene expression and other genomic
structural changes of two different organisms (Xia 2013). Similarity is the key point
in the comparative genomics (Primrose and Twyman 2009). The curve of DNA
evolution acts as an evolutionary proceeding at a different organizational level. At
first level it acts on species nucleotides through point mutations and at second level,
rapid speciation occurs due to chromosomal duplication, genomic transfer, mutation,
transposition that lead to genome processes of endobiont, polyploidy, hybridization
and many (Brown 2002; Chen and Ni 2006).
Tools used for comparing genome are UCSC browser (large scale sequencing
references and assembling genome draft), Ensembl (eukaryotic and vertebrates
species genome databases), Map view (data sequencing and genome mapping),
VISTA (visualizing results based on DNA alignments), and BlueJay Genome
Browser (visualization for multi-scale genomic changes) (Soh et al. 2012). With
mathematical models, statistical (Bayesian analysis) and algorithms (Markov chain
Monte Carlo algorithms), developers troubleshoot the complexity of changing or
evolving genomic sequences. Major studies are contingent upon removal of DNA
sequences homology to direct protein family’s sequences (Carter et al. 2002).
1.4.1 Pan-genomics
Pangenome is the complete set of genes of all strains of a species. Tettelin et al.
(2005) introduced the concept of pan-genomics that was eventually established in
bioinformatics. However, at the beginning, it involved strains of species that have
close relations. It was divided into two parts: first, the core genome, which is a set of
genes unique for every genome under study (housekeeping genes) and, second is the
dispensable or flexible genome that does not show its presence in all but only in one
or several strains. Characterization of bacterial species’ pan-genome is possible by
using a bioinformatics tool such as Bacterial Pan Genome Analysis (BPGA). BPGA
is an ultra-fast software package that helps comprehensive pan-genome analysis of
microorganisms (Chaudhari et al. 2016).
Studying complex diseases has become easy since the introduction of next-
generation sequencing. Nowadays, one can easily access adequate sequence data
for mapping the genes of infertility (Aston 2014), breast cancer (Véron et al. 2014),
or Alzheimer’s disease (Tosto and Reitz 2013). Studies of genome-wide correlation
are a valuable method to identify the mutations liable behind certain complicated
8 H. Patel et al.
diseases (Londin et al. 2013). With the help of these studies, thousands of associated
DNA variants can be researched that have relations with alike diseases and traits
(Hindorff et al. 2009). In addition, one of the most important applications is the
potential for genes to be used in prognosis, diagnosis, or accurate treatment. Several
researches have addressed positive methods of selecting the genes to be used and the
issues as well as disadvantages to use genes to determine the existence or prognosis
of diseases (Bejar 2014).
The gene expression is regulated by measuring mRNA level with different noise-
prone techniques such as DNA microarrays or biochip—assembly of DNA micro-
scopic spots substrate to a solid surface for hybridization of two DNA strands;
expressed cDNA sequence tag sequencing (EST)—short sub-sequence of cloned
cDNA; serial analysis of gene expression tag sequencing (SAGE)—output of analy-
sis is to list out short tags sequence and the number of time it occurs; massively
1 An Introduction and Applications of Bioinformatics 9
are used to identify regulatory regions that are enclosed through transcription factors
for analysis and understanding dynamics of gene regulation (Fogel and Corne 2002).
For example, in unicellular organisms, one can compare the cell division cycle with
various stress factors (starvation, temperature, heat shock, and many more). For the
determination of co-expressed gene, clustering algorithms are used with expression
data. Various clustering algorithms are involved in genome clustering such as
k-means clustering, hierarchical cluster analysis (HCA), self-organizing feature
map (SOFM), and consensus clustering (cluster ensembles).
Structural bioinformatics is the field of bioinformatics that analyzes and identifies the
3-D structure of protein, RNA, and DNA. The major emphasis of structural bioin-
formatics is to form a new technique that settles with macromolecules data to resolve
the issue of researchers and to create better understanding of biological molecules. It
mainly inscribes structural interactions among space coordinates. Prediction of
structure of protein is the foremost application in bioinformatics (Gu and Bourne
2011). Some structural databases of the protein structure are as follows: protein data
bank (PDB)—Macromolecular Structures Resource Group, nucleic acid database—
Nucleic Acid Database (NDB), critical assessment of protein structure prediction
(CASP)—Prediction Center, protCID—Protein Common Interface Database, elec-
tron density server (EDS)—EDS, Uppsala Electron Density Server, and some others
(Luscombe et al. 2001). Understanding of structure helps to understand protein
function and disease conditions (Sudha et al. 2014).
In the field of bioinformatics, homology is used for determining the role of
specific genes and which sequences of protein is vital for structural formation and
interaction with other molecules, such as, if sequence of one of the gene “A” is
known and is homologous to other gene “B”, then it can be assumed that gene A can
share function of gene B. Homology modelling technique uses this data to assume
the protein structure by known homologous protein. An example of this is of the
homology between hemoglobin in humans and leghemoglobin in legumes which
have the same function, i.e., to carry oxygen to cells. Both have different sequences
of amino acids but their structure is same, which therefore mirrors their same
function and common ancestor. Another application of structural bioinformatics is
target selection, trial mapping, or analysis of X-ray crystallography (Ilari and Savino
2008) and NMR spectroscopy data (Sugimoto et al. 2012), and for virtual screening
of models in drug discovery (Chou 2004).
A field of science that studies both immunogenetics and immunology data with the
help of bioinformatics tools is known as immunoinformatics. This helps to identify
and analyze epitopes (antigenic part of protein) in protein for development of antigen
1 An Introduction and Applications of Bioinformatics 11
Acknowledgements The authors would like to give thanks to Ms. Gargi Bhattacharjee and
Mr. Nisarg Gohil for support and proofreading of this chapter.
References
Alcalá-Fdez J, Sanchez L, Garcia S et al (2009) KEEL: a software tool to assess evolutionary
algorithms for data mining problems. Soft Comput 13:307–318
Andersson DI, Hughes D (2009) Gene amplification and adaptive evolution in bacteria. Annu Rev
Genet 43:167–195
Aston KI (2014) Genetic susceptibility to male infertility: news from genome-wide association
studies. Andrology 2:315–321
Bai JP, Alekseyenko AV, Statnikov A et al (2013) Strategic applications of gene expression: from
drug discovery/development to bedside. AAPS J 15:427–437
Barrett T, Wilhite SE, Ledoux P et al (2012) NCBI GEO: archive for functional genomics data
sets—update. Nucleic Acids Res 41:D991–D995
Bejar R (2014) Clinical and genetic predictors of prognosis in myelodysplastic syndromes.
Haematologica 99:956–964
Benson DA, Karsch-Mizrachi I, Lipman DJ et al (2008) GenBank. Nucleic Acids Res 36:D25–D30
12 H. Patel et al.
Gohil N, Panchasara H, Patel S et al (2019) Molecular biology techniques for the identification and
genotyping of microorganisms. In: Tripathi V, Kumar P, Tripathi P, Kishore A (eds) Microbial
genomics in sustainable agroecosystems. Springer, Singapore, pp 203–226
Graham CH, Ferrier S, Huettman F et al (2004) New developments in museum-based informatics
and applications in biodiversity analysis. Trends Ecol Evol 19:497–503
Grau J, Ben-Gal I, Posch S et al (2006) VOMBAT: prediction of transcription factor binding sites
using variable order Bayesian trees. Nucleic Acids Res 34:W529–W533
Graves PR, Haystead TA (2002) Molecular biologist’s guide to proteomics. Microbiol Mol Biol
Rev 66:39–63
Gu J, Bourne PE (2011) Structural bioinformatics, 2nd edn. Wiley, Hoboken
Hall DA, Ptacek J, Snyder M (2007) Protein microarray technology. Mech Ageing Dev
128:161–167
Hindorff LA, Sethupathy P, Junkins HA et al (2009) Potential etiologic and functional implications
of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci
106:9362–9367
Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Mol
Biol Evol 23:254–267
Hye-Jung EC, Jaswinder K, Martin K et al (2014) Second-generation sequencing for cancer genome
analysis. In: Dellaire G, Berman JN, Robert JA (eds) Cancer genomics. Academic Press,
Boston, USA, pp 13–30
Ilari A, Savino C (2008) Protein structure determination by x-ray crystallography. In: Keith JM
(ed) Bioinformatics. Humana Press, pp 63–87
International Sheep Genomics Consortium, Archibald AL, Cockett NE, Dalrymple BP et al (2010)
The sheep genome reference sequence: a work in progress. Anim Genet 41:449–453
Jespersen MC, Peters B, Nielsen M et al (2017) BepiPred-2.0: improving sequence-based B-cell
epitope prediction using conformational epitopes. Nucleic Acids Res 45:W24–W29
Kanehisa M, Furumichi M, Tanabe M et al (2017) KEGG: new perspectives on genomes, pathways,
diseases and drugs. Nucleic Acids Res 45:D353–D361
Koonin E, Galperin MY (2002) Sequence—evolution—function: computational approaches in
comparative genomics
Kukurba KR, Montgomery SB (2015) RNA sequencing and analysis. Cold Spring Harb Protoc
2015:951–969
Lewin HA, Robinson GE, Kress WJ et al (2018) Earth BioGenome Project: sequencing life for the
future of life. Proc Natl Acad Sci 115:4325–4333
Li FW, Harkess A (2018) A guide to sequence your favorite plant genomes. Appl Plant Sci 6:e1030
Londin E, Yadav P, Surrey S et al (2013) Use of linkage analysis, genome-wide association studies,
and next-generation sequencing in the identification of disease-causing mutations.
Pharmacogenomics. Methods Mol Biol 1015:127–146
Luscombe NM, Greenbaum D, Gerstein M (2001) What is bioinformatics? An introduction and
overview. Yearb Med Inf 1:83–100
Mani N, Mani A (2017) Design of cellular quantum-inspired evolutionary algorithms with random
topologies. In: Bhattacharya S, Maulik U, Dutta P (eds) Quantum inspired computational
intelligence, pp 111–146
Mitchell M, Taylor CE (1999) Evolutionary computation: an overview. Ann Rev Ecol Syst
30:593–616
Morris JS, Baladandayuthapani V (2017) Statistical contributions to bioinformatics: design,
modeling, structure learning, and integration. Stat Model 17:245–289
Orengo CA, Pearl FMG, Bray JE et al (1999) The CATH Database provides insights into protein
structure/function relationships. Nucleic Acids Res 27:275–279
Pagel M (2006) Evolution, bioinformatics and evolutionary bioinformatics online 2
Pennacchio LA, Bickmore W, Dean A et al (2013) Enhancers: five essential questions. Nat Rev
Genet 14:288–295
Primrose SB, Twyman R (2009) Principles of genome analysis and genomics. Wiley
14 H. Patel et al.
Pruitt KD, Tatusova T, Maglott DR (2005) NCBI Reference Sequence (RefSeq): a curated
non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 33:
D501–D504
Romero IG, Ruvinsky I, Gilad Y (2012) Comparative studies of gene expression and the evolution
of gene regulation. Nat Rev Genet 13:505–516
Sanger F, Air GM, Barrell BG et al (1977) Nucleotide sequence of i X 174 DNA. Nature
265:687–695
Sherry ST, Ward MH, Kholodov M et al (2001) dbSNP: the NCBI database of genetic variation.
Nucleic Acids Res 29:308–311
Singh H, Raghava GP (2001) ProPred: prediction of HLA-DR binding sites. Bioinformatics
17:1236–1237
Singh H, Raghava GP (2003) ProPred1: prediction of promiscuous MHC Class-I binding sites.
Bioinformatics 19:1009–1014
Singh H, Ansari HR, Raghava GPS (2013) Improved method for linear B-cell epitope prediction
using antigen’s primary sequence. PLoS One 8:e62216
Singh V, Braddick D, Dhar PK (2017) Exploring the potential of genome editing CRISPR-Cas9
technology. Gene 599:1–18
Singh V, Gohil N, Ramírez García R et al (2018a) Recent advances in CRISPR-Cas9 genome
editing technology for biological and biomedical investigations. J Cell Biochem 119:81–94
Singh V, Gohil N, Ramírez-García R (2018b) New insight into the control of peptic ulcer by
targeting the histamine H2 receptor. J Cell Biochem 119:2003–2011
Soh J, Gordon PM, Sensen CW (2012) The Bluejay genome browser. Curr Protoc Bioinf 37:10–19
Spears WM (1995) Adapting crossover in evolutionary algorithms. In: Evolutionary programming,
pp 367–384
Stalker J, Gibbins B, Meidl P et al (2004) The Ensembl Web site: mechanics of a genome browser.
Genome Res 14:951–955
Sudha G, Nussinov R, Srinivasan N (2014) An overview of recent advances in structural bioinfor-
matics of protein–protein interactions and a guide to their principles. Prog Biophy Mol Biol
116:141–150
Sugimoto M, Kawakami M, Robert M et al (2012) Bioinformatics tools for mass spectroscopy-
based metabolomic data processing and analysis. Curr Bioinf 7:96–108
Swofford DL (1993) PAUP: Phylogenetic Analysis Using Parsimony, version 3.1.1. Laboratory of
Molecular Systematics, Smithsonian Institution, Washington D.C
Tamura K, Stecher G, Peterson D et al (2013) MEGA6: molecular evolutionary genetics analysis
version 6.0. Mol Biol Evol 30:2725–2729
Tettelin H, Masignani V, Cieslewicz MJ et al (2005) Genome analysis of multiple pathogenic
isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl
Acad Sci 102:13950–13955
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting, position-specific gap
penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680
Tosto G, Reitz C (2013) Genome-wide association studies in Alzheimer’s disease: a review. Curr
Neurol Neurosci Rep 13:381
Venter JC, Adams MD, Myers EW et al (2001) The sequence of the human genome. Science
291:1304–1351
Véron A, Blein S, Cox DG (2014) Genome-wide association studies and the clinic: a focus on breast
cancer. Biomarkers Med 8:287–296
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev
Genet 10:57–63
Wiens JJ (1995) Polymorphic characters in phylogenetic systematics. Syst Biol 44:482–500
Xia X (2013) What is comparative genomics? In: Comparative genomics. Springer, Berlin,
Heidelberg, pp 1–20
Bioinformatics Tools and Software
2
Aeshna Gupta, Disha Gangotia, and Indra Mani
Abstract
Keywords
2.1 Introduction
A large amount of “omic” data has been produced by the onset of immense
technological advances in science. Comprehension of the massive amount of this
sequence and structure data being produced at numerous levels of biological systems
is the principle task (Pevsner 2015). This is where “Bioinformatics” comes into play.
It is an interdisciplinary field that can be referred to as the use of computational
algorithms to assemble, evaluate, comprehend, visualize, and archive data associated
with biomolecules (Luscombe et al. 2001; Pevsner 2015). Various fields of modern
biology like genomics, transcriptomics, proteomics, genetics, and evolution are
incorporated within bioinformatics (Kumar and Chordia 2017). Applications of
bioinformatics range from sequencing of genomes, prediction of gene and its
function to protein analysis like prediction of protein structure and function, phylo-
genetic studies, designing drugs and vaccines, identification of organisms, and for
supporting and advancing research in the area of biotechnology. Ultimately, bioin-
formatics facilitates the discovery of new biological insights (Kumar and Chordia
2017). Due to a rapid increase of biological data in the form of sequence, structure,
pathway, and interactions, biological science has become data-rich science.
The post-genomics revolution period has witnessed a large amount of data that
relates to the analysis of DNA, RNA, and proteins, the complex networks and
ecosystems in which living organisms engage, and the crucially important
metadata—which puts “omics” data in context. The 2020 coronavirus pandemic
proves the importance of rapid data analysis and interpretation in controlling the
spread through data being shared quickly and openly. This further throws light on
the significance of bioinformatics in data sharing and analysis (Peter Bickerton
2020). In a given sample, metagenomics (culture-independent molecular approach)
is the process of sequencing DNA from the genomes of all species and is one of the
common methods for the study of the structure and function of the microbiome
population. Researchers are discovering new metagenome-encoded genes, many of
which may be of biotechnological or pharmaceutical concern. Such kind of analysis
requires complex bioinformatics tools to analyze the data (Roumpeka et al. 2017).
Several methodologies are used to deduce different levels of microbiome knowl-
edge. These techniques include analysis of the 16S ribosomal RNA (16S rRNA)
gene, analysis of the whole genome shotgun (WGS; metagenome), and analysis of
the whole transcriptome shotgun known as metatranscriptome (Niu et al. 2018).
Some bioinformatics tools used for analyzing 16S rRNA are QIIME, UPARSE,
MOTHUR, DADA2, and minimum entropy decomposition (MED) (Niu et al.
2018). Metagenomic analysis tools include MetaPhlAn2 (Truong et al. 2016),
Kraken (Wood and Salzberg 2014), CLARK (Ounit et al. 2015), FOCUS (Silva
et al. 2014), SUPER-FOCUS (Silva et al. 2016), and MG-RAST (Meyer et al. 2008).
2 Bioinformatics Tools and Software 17
data. Following are some widely used sequence submission tools used to submit
sequence data to major biological databases like NCBI (National Centre for Bio-
technology Information), EMBL (European Molecular Biology Laboratory), and
DDBJ (DNA Data Bank of Japan).
2.3.1.1 BanqIt
BankIt is a web-based tool to submit sequence data to NCBI-GenBank (https://www.
ncbi.nlm.nih.gov/WebSub/). It very well may be utilized to submit a single
sequence, a few unrelated sequences (with different features and/or source informa-
tion), or even a large set of sequences (Fetchko and Kitts 2011).
2.3.1.2 SPIN
SPIN (https://www.ebi.ac.uk/swissprot/Submissions/spin/) is a web interface
provided by UniProt (Universal Protein Resource) for submission of protein
sequences for which data at the protein level is present. Any protein sequence that
is determined either through Edman degradation or through mass spectrometry
results interpreted manually can be submitted. Sequence data obtained from peptide
mass fingerprinting or some other procedure of mass spectrometry based on database
searches or any nucleotide sequence that is translated is not recognized. Therefore,
the newly sequenced proteins are accessible to the research community and scientists
can obtain UniProt accession numbers via this service that can be used in literature
(Pichler et al. 2018).
2.3.1.3 WEBIN
WEBIN (https:/www.ebi.ac.uk/ena/submit/sra/#home) is EMBL’s recommended
web-based submission platform for nucleotide sequences and biological annotation
data (biology associated with sequence data). This tool enables the submission of
single, multiple, or very large numbers of sequences (bulk sequences) (Stoesser et al.
2002).
2.3.1.4 Sequin
Sequin (https:/www.ncbi.nlm.nih.gov/Sequin/) is a standalone software to submit
and update sequence data to the GenBank, EMBL, or DDBJ sequence database. It is
devised by the NCBI. It is ideal for processing simple submissions containing a
single sequence of short mRNA. Complicated entries containing long sequences
with or without gaps, various annotations, or phylogenetic and population studies
can likewise be submitted with the assistance of this software (Stoesser et al. 2002).
2.3.1.5 SAKURA
SAKURA is a data submission system developed by DDBJ. It is a World Wide Web
(WWW) interface-oriented system. Submitters have an opportunity to “pause and
resume” their work while using this tool, in which the typed-in information is
temporarily kept on the server for one month, unless the session is intentionally
terminated by the submitter. When contrasted with E-mail entries, errors are more
thoroughly checked by SAKURA. Three types of errors are classified by this system,
2 Bioinformatics Tools and Software 19
mandatory, illegal, or semantic, and will issue error and warning messages whenever
appropriate (Yamamoto et al. 1996).
Retrieval of data is as important as submission and one of the main objectives of any
database is to provide the users with the required information. Any database contains
immense amounts of information, retrieving which is also a critical task depending
on the right use of search strings.
2.3.2.1 Entrez
Entrez (http:/www.ncbi.nlm.nih.gov/Entrez/) is a text-based search and retrieval
platform implemented by NCBI that offers interconnected links to nucleotide and
protein sequence information, information on gene and genome mapping, structural
data, biomedical literature, etc. It constitutes over 20 databases including the nucle-
otide sequence data from GenBank that includes information from EMBL and DDBJ
and complete protein sequence data from PIR (Protein Information Resource)-
International, PRF (Protein Research Foundation), Swiss-Prot, PDB (Protein Data
Bank), and database documents containing biological sequence and 3-D structural
data, or abstracts from the scientific literature can be retrieved using simple Boolean
queries (type of search allowing users to combine keywords or phrases with
operators such as “and,” “not,” and “or” to further produce more relevant results)
(Schuler et al. 1996). A single, well-defined object (e.g. a particular protein sequence
or PubMed citation) is recognized by a unique ID (UID) (Ostell 2002). Results may
be viewed in various formats like flat-file, FASTA, XML, etc. Entire genomes or
chromosomes, as well as biological annotation on individual sequences can be
visualized via a graphical interface.
However, documents identified in this manner are not endpoints in themselves.
Instead, they serve as entry points for further exploration with the help of hypertext
links. For example, cross-reference between a sequence and the corresponding
research article in which it was reported, or between a protein sequence and the
sequence of the gene encoding it, is possible when using Entrez (Schuler et al. 1996).
The extent of such hypertext links can also be expanded to include external services,
such as organism-specific genome databases through a service called LinkOut
(Sayers et al. 2009). Henceforth, connections between different data that may
propose future analyses or help in understanding of the available information can
be deduced through Entrez.
2.3.2.2 SRS
Sequence retrieval system (SRS), developed at the European Bioinformatics Insti-
tute (EBI) at Hinxton, UK, is a homogeneous interface to approximately
80 biological databases. It is suitable for flat-file databases, such as the EMBL
nucleotide sequence database or the Swiss-Prot database of protein sequences. It
comprises databases of sequence and protein 3-D structure data, information about
20 A. Gupta et al.
2.3.2.3 Getentry
Getentry (http://getentry.ddbj.nig.ac.jp/top-e.html) is used to retrieve data from
various databases of DDBJ. Unique identifiers required for retrieval through
Getentry can be accession number, gene name, etc.
The worldwide Protein Data Bank (wwPDB) is a partnership of servers for the
collation, maintenance, and dissemination of macromolecular structure data. It
contains structures of biological macromolecules determined by NMR (nuclear
magnetic resonance), X-ray or neutron diffraction, and cryo-electron microscopy
(Abriata 2017). The current wwPDB members include the Research Collaboratory
for Structural Bioinformatics PDB (RCSB PDB) in the USA, the PDB in Europe
(PDBe), PDB Japan (PDBj), and the Biological Magnetic Resonance Bank (BMRB,
University of Wisconsin in the USA). Entries in PDB include structures of isolated
proteins, nucleic acids, their complexes with each other as well as with other small
molecules like cofactors, substrate analogues, regulators, inhibitors, etc. (Abriata
2017). For automated and precise structure deposition, various tools have been
designed by PDB. The deposition process consists of assembling and entering data
(coordinates and structure-factor files, source and sequence of the macromolecules in
the structure, citations) and finally submitting it to the PDB (Yang et al. 2004).
Following are some of the tools for deposition of structure data to PDB:
2.3.3.1 ADIT
ADIT “Auto Dep Input Tool” (http:/deposit.pdb.org/adit) is an integrated structural
data collection, editing, evaluation, and deposition software system for the PDB.
Three tasks can be performed during an ADIT session: (a) a data-format pre-check in
which the format of the coordinate data file is checked to ensure that it conforms with
either PDB or mmCIF (macromolecular Crystallographic Information File) format;
(b) validation, which requires verifying the accuracy of data with known standards
and generating a report; (c) actual deposition of a structure. All categories in ADIT
should be completed correctly during the deposition process and reviewed before
submission. A PDB ID is allocated to the entry upon the structure’s successful
deposition (Yang et al. 2004).
2.3.3.2 pdb_extract
Pdb extract is an application that, at each level of the process of structure determina-
tion, can retrieve information from the output of standard crystallographic programs.
In order to construct two mmCIF data files, one with structure factors and the other
2 Bioinformatics Tools and Software 21
with structure details, along with its coordinates, files containing the necessary
information are merged. These two mmCIF data files have now become fully
prepared for deposition and authentication. Three versions of pdb_extract are avail-
able: a web interface (http://pdb-extract.rutgers.edu), a standalone application, and
part of the CCP4 package (Collaborative Computational Project No. 4 - Software for
Macromolecular X-Ray Crystallography) (Yang et al. 2004).
2.3.3.3 AutoDep
AutoDep (https://www.ebi.ac.uk/pdbe/deposition) is a web-based tool developed by
EBI (European Bioinformatics Institute) for the submission of X-ray crystallography
and NMR spectroscopy structures and data to the PDB and BMRB.
2.3.3.4 EMDep
EMDep (https://www.ebi.ac.uk/pdbe-emdep/emdep/) is a web-based tool developed
by EBI that enables the submission of data to the Electron Microscopy Data Bank
(EMDB), an archive of high-resolution 3D cryo-electron microscopy data. EMDB
contains 3D maps (volumes), masks, images, and bibliographic citations, as well as
processed primary data. The deposition system allows users to deposit 3D maps to
EMDB and associated coordinate data to the PDB.
2.3.3.5 OneDep
A common portal for deposition of atomic coordinates and related experimental data
derived from the three currently accepted structure determination techniques to the
PDB archive exists. It is known as the OneDep system (http://deposit.wwpdb.org). It
was conceived by the wwPDB partners as a multinational collaboration. Depending
on the geographical location of the depositor, the structure is allocated to one of the
wwPDB sites for processing: RCSB PDB, PDBe, or PDBj (Young et al. 2017).
With the use of sequence alignment, the structural and functional aspects of a novel
sequence can be easily predicted. Higher the sequence closeness, more prominent is
the opportunity that they share comparable structure or function. The sequence
alignments can be of two sorts, i.e., looking at two (pairwise) or more sequences
(numerous) for a progression of characters. Alignment of more than three proteins/
nucleotides sequences refers to multiple sequence alignment (MSA). The genes
which are similar are the ones that may be conserved among different species
(Troy et al. 2001).
2.3.4.1 BLAST
BLAST stands for basic local alignment search tool developed by Stephen Altschul
of NCBI in 1990 (Altschul et al. 1990). It is one of the most commonly used
programs for sequence analysis based on pairwise sequence alignment. It carries
out alignment as well as provides statistical information about the alignment.
22 A. Gupta et al.
Pairwise sequence alignment helps to identify regions which are similar between the
two biological sequences. The similarities obtained maybe indicative of the func-
tional, structural, and evolutionary relationships. The algorithm of BLAST is based
on heuristic word method. This technique works by discovering short stretches of
indistinguishable or almost indistinguishable matches of letters in the two sequences.
It is based on the premise that at least one word (short stretch of characters) should be
common in two linked sequences. Once the word matches have been identified,
extending the similarity regions from these words lead to formation of a longer
alignment. In addition, finding a high sequence similarity region followed by joining
neighboring high scoring regions leads to the development of a full alignment
(Xiong 2006). Michael J Conway showed that BLAST analysis of the cDNA pool
of cell line of Carassius auratus, commonly known as crucian carp and head kidney
tissue of Ctenopharyngodon idella, commonly known as grass carp, indicates the
belongingness of the sequence to SARS-like coronaviruses and the evolutionary
divergence of sequences in other species. Therefore, it could be likely that SARS-
like coronaviruses in aquatic habitat regions are a widespread environmental patho-
gen (Conway 2020).
Types of BLAST
Different variants of BLAST program have been developed, namely blastn, blastp,
blastx, tblastn, and tblastx. These programs are based on the form of query sequences
that may be protein or nucleotide sequences, accessible via https:/blast.ncbi.nlm.nih.
gov/Blast.cgi. Table 2.1 represents the query and subject sequences of different
types of BLAST programs. Blastn has a nucleotide sequence query with nucleotide
sequence database. For protein sequence queries, Blastp scans the protein sequence
database. Blastx has a query of the nucleotide sequence, which is translated into all
six reading frames and displays the subject sequence as the translated protein
sequence. tblastn has protein sequences as query against nucleotide database.
It is ideal for searching protein homologs, which are encoded in newly sequenced
genomes. The tblastx uses nucleotide sequences as query and searches for a collec-
tion of nucleotide sequence databases having all translated sequences (Berkley
Library, University of California, 2020, https://guides.lib.berkeley.edu/ncbi). In
addition to this, BLAST offers programs for special purposes as well, ex. bl2seq,
immunoglobulin BLAST, and VecScreen. In view of its speed, high selectivity, and
adaptability, BLAST is frequently the best option among other sequence similarity
search programs, and more importantly, this method forms the basis for genome
annotation.
Table 2.1 List of different Type of BLAST Query sequence Subject sequence
BLAST programs and their
BLASTn Nucleotide Nucleotide
sequence searches
BLASTp Protein Protein
BLASTx Nucleotide Protein
TBLASTn Protein Nucleotide
TBLASTx Nucleotide Nucleotide
2 Bioinformatics Tools and Software 23
Statistical Significance
A list of matching sequences ranked on the basis of statistical significance is given
by the BLAST output window. The scores provided help to distinguish between
sequences that are evolutionarily related and unrelated. One of the most important
statistical indicators in BLAST is the E-value or expectation value. As the name
suggests, it is possible that a random chance would cause the resulting alignments
from a database search. This provides details on the probability that a given sequence
match is merely occurring by chance. Therefore, lesser the E-value, there is lower
likelihood of the database match to occur by random chance and thus, the match is
more significant (Koonin and Galperin 2003). Another statistical parameter for
BLAST includes Bit Score. Bit score represents a prospective level for sequence
comparison that is independent of size of the database and query length. The bit
score, expressed as “S,” is a standardized score communicated in bits that lets you
gauge the size of the search space you would need to glance through before you
would hope to discover a score on a par with or superior to this one by some
coincidence (Fassler and Cooper 2011).
Fig. 2.1 (a) Graphical presentation of blastp for human insulin (AAA59172.1) and (b) blastn for
human insulin (AH002844.2)
24 A. Gupta et al.
Fig. 2.2 Matching list or description for human insulin (AAA59172.1): Displaying E-value,
accession no., score, identity %
Header
Statistics
Alignment
Fig. 2.3 Alignment section of protein BLAST for human insulin (AAA59172.1)
This is trailed by the content depiction, which is additionally separated into three
areas specifically header, statistics, and alignment. The description of the database
sequence in a single line, along with the reference number of the database hit is
displayed by the header section. The statistics result of the output is inclusive of the
bit score, E-value, percentage identity, positives, and gaps. The last section or the
alignment section of protein BLAST has query sequence on the top, a matching
sequence in between, and subject sequence aligned to it at bottom (Fig. 2.3), while
the nucleotide BLAST has just the query and subject sequences (Fig. 2.4). When the
sequences have matching identical residues they are displayed at their respective
positions (blastp) or a vertical line representing matches is present between the two
sequences (blastn), while those sequences which are not identical matches have
similar physiochemical properties or evolutionary conserved substitutions. For
example, in case of amino acids, they are represented by a + sign and represent
positive matches (Fig. 2.5). When two dissimilar residues are present, it represents a
mismatch. A horizontal line represents a gap which is created in order to get the
flanking region to match (Fig. 2.6) and all the low complexity regions are masked
2 Bioinformatics Tools and Software 25
Header
Statistics
Alignment
Fig. 2.4 Alignment section of nucleotide BLAST for human insulin (AH002844.2)
with Xs or Ns or displayed as small letters such that it does not interfere with the
alignment (Fig. 2.7).
2.3.4.2 CLUSTAL W
A multiple sequence alignment (MSA) software to match homologous nucleotide or
protein sequences is CLUSTAL W (Thompson et al. 1994). It can be accessed
through www.ebi.ac.uk/clustalw/ on the EMBL-EBI website. CLUSTAL and
CLUSTAL V series of programs, which have been developed originally by Des
Higgins; Fabian Sievers; David Dineen; Andreas Wilm, gave rise to CLUSTAL
W. The “W” in CLUSTAL W stands for “weights” as we now give different weights
to sequences and parameters at different positions in alignments. MSA encourages
us to recognize the most developmentally conserved regions that are basic in
functionality of a specified gene and recognize changes in the function just like its
causes at the sequence level. Additionally, data in regard to structure and function of
proteins can be acquired which is further useful in examining new domains or motifs
having biological importance. The algorithm is based on the argument that groups of
sequences are phylogenetically related, i.e., if they can be aligned, there is usually an
underlying phylogenetic tree. This approach is commonly referred to as progressive
alignment. Figure 2.8 briefly describes the algorithm behind Clustal programs (Aiyar
2000).
Certain basic features of CLUSTAL W include (1) support for more file formats
for trees, sequence data sets, and alignments; (2) optional, full dynamic program-
ming alignments for estimating the initial pairwise distances between all the
sequences; (3) neighbor-joining TM trees for the initial guide trees, used to guide
the progressive alignments; (4) sequence weighting to correct for unequal sampling
of sequences at different evolutionary distances; (5) dynamic calculation of
sequence- and position-specific gap penalties as the alignment proceeds; (6) the
use of different weight matrices for different alignments; and (7) improved facilities
for adding new sequences to an existing alignment (Higgins et al. 1996). New
options have been included in Clustal W 2.0 and 2.1 (latest), to permit quicker
arrangement of huge data sets and to build accuracy of alignment. Moreover, it is
capable of handling some very difficult protein alignment problems as well. The
Clustal W results then further can be used for creating phylogenetic trees, which
2 Bioinformatics Tools and Software 27
Output Format
The output has a simple text mode interface with the sequences aligned to each other.
“*” represents presence of identical residues or nucleotides in that column while a
blank space is indicative of no match and gaps are represented using hyphens “-”
(Fig. 2.9). A “:” shows conserved substitutions and “.” represents semi-conserved
substitutions (Fig. 2.10). The aligned sequences represent regions of similarity and
could be indicative of closely related genes or common ancestry.
2.3.4.3 CLUSTAL X
CLUSTAL X is a variant of Clustal W that has a graphical user interface developed
using the NCBI VIBRANT toolkit (Thompson et al. 1997). The current version
of Clustal X named Clustal X 2.1 (Larkin et al. 2007) is available on Linux, Mac, and
Windows (http://www.clustal.org/download/clustalx_help.html). The software is
intended to (1) introduce multiple alignments, (2) see the effects of methods used
for the alignment, and (3) strengthen it if possible. Clustal X has options that are
unavailable in Clustal W which help in improving the alignment (such as choosing a
part of the alignment with various gap penalties to be realigned while keeping the
remainder of the alignment fixed). Clustal X utilizes a similar technique as Clustal W
Fig. 2.9 Output of Clustal W 2.1 showing MSA between Homo sapiens insulin (AH002844.2),
Oryctolagus cuniculus insulin mRNA (M61153.1), and Octodon degus insulin mRNA (M57671.1)
2 Bioinformatics Tools and Software 29
(PHYLogeny Inference Package) displays trees, with branch order, branch lengths,
and sequence names, as a progression of nested parentheses.
In a study spanning 3 years (2010–2013, growing seasons from May to July),
determination of level of Cucumber mosaic virus (CMV) in Taro was carried out and
the maximum likelihood phylogenetic trees of nucleotide and amino acid sequences
were generated using Clustal X v1.8. The results showed that these CMV isolates
from taro in China came under subgroup I (Wang et al. 2014). In another study
conducted in Fujian province to detect the genotypes and the rate of infection for
Japanese encephalitis virus (JEV) in mosquitoes, applications such as Clustal X
(1.83), MegAlign, GeneDoc 3.2, and MEGA (Molecular Evolutionary Genetic
Analysis) 4.0 were used to splice sequence and deduce amino acid sequence and
phylogenetic tree differentiation analysis for nucleotides (He et al. 2012).
Fruitful model structure requires in any event one tentatively constructed 3-D
structure (format) that has a critical sequence similarity to the target sequence.
Exploratory structure clarification and comparative modeling supplement each
other in the investigation of the protein structure space. Displaying of protein
structures generally requires broad aptitude in structural biology and the utilization
of exceptionally specialized PC programs for each of the modeling steps
(Tramontano et al. 2001). There are different 3-D structure modeling tools available.
2.3.5.1 SWISS-MODEL
SWISS-MODEL (www.expasy.ch/swissmod/SWISS-MODEL.html) is an
automated 3D protein structure modeling browser, allowing the user to automati-
cally post a sequence and get a structure. This tool is based on homology modeling
or comparative modeling methods, which utilize experimental protein structures for
model building of evolutionarily related proteins called as targets. 3D models built
by SWISS-MODEL are combined into the INTERPRO database (Mulder et al.
2003). Via programmed alignment or the first approach mode or manual alignment
or the optimization mode, the server creates a model. The client presents a succes-
sion contribution for displaying in the previous technique and the server aligns the
sequence requested with PDB sequences using BLAST. A raw model is constructed
after a choice of suitable models. GROMOS carries out structure refinement. Then
again, the user can likewise determine or upload structures as templates. The user
creates an alignment in the Swiss-PDB Viewer in the Optimize mode and submits it
to the model development server (Schwede et al. 2003).
SWISS-MODEL consists of the following components:
A recent study involving screening of the specific epitopic regions in the spike
proteins and selection of their energetic, inhibitory concentration 50 (IC50), MHC II
reactivity was done for SARS COV 2 with some of them proving to be a great target
for vaccination. The structure quality of spike glycoproteins was verified using
SWISS-MODEL, Phyre2, and Pymol. A potential function of glycosylation on
epitopic area indicated significant impacts on epitopic acknowledgment which
could be useful in developing reasonable immunization regimen against SARS
CoV-2 (Banerjee et al. 2020). Rahman et al. screened natural compounds using an
in silico approach in order to discover potential inhibitors of the host enzyme,
transmembrane protease serine 2 (TMPRSS2), in case of SARS COV-2. Viral
entry into host cells is mediated by the enzyme, and its inhibition renders the virus
unable to bind to the enzyme 2 that converts angiotensin (ACE2). As a consequence,
the pathogenesis of SARS-CoV-2 is limited. SWISS-MODEL was used to construct
the 3-D structure of TMPRSS2, and validation was done by RAMPAGE (Rahman
et al. 2020).
2.3.5.2 Modeller
MODELLER is used to model protein 3D structures for homology (Webb and Sali
2016; Marti-Renom et al. 2000). A sequence arrangement to be demonstrated with
known associated structures is given by the user and a model containing all
non-hydrogen particles is computed by MODELLER naturally. By fulfilling spatial
constraints, the tool updates related protein structure modeling (Sali and Blundell
1993; Fiser et al. 2000). The program models the backbone exercising a restraint
technique determined by homology, which relies on multiple sequence alignments to
recognize highly conserved residues between target and format proteins. Conserved
residues in replication from the layout structures are given high constraints. Less
conserved residues are given less or none of the constraints, like loop residues
(Xiong 2006). Finally, a 3D model can be generated by satisfying all the restraints
as well as possible. In addition, it may perform several different tasks, such as de
novo modeling of loops in protein structures, followed by optimizing protein
structure models with regard to a flexibly specified objective feature. In addition,
tasks such as multiple protein sequence and structure alignment, clustering, sequence
database search, protein structure comparison, etc. can easily be performed (https://
salilab.org/modeller/). The current release of Modeller is 9.24, which was released
on Apr 9th, 2020 and can be accessed through http://bioserv.cbs.cnrs.fr/HTML
BIO/frame mod.html.
2.3.5.3 JPred
JPred (in operation since 1998) is a server for prediction of the secondary structure of
proteins. The current version is JPred v4 (http://www.compbio.dundee.ac.uk/jpred/).
The Jnet calculation uses JPred so as to make more precise predictions. JPred also
32 A. Gupta et al.
allows estimates of accessibility to solvents and coiled coil regions known as the
Lupas technique, considering the protein secondary structure (http://www.compbio.
dundee.ac.uk/jpred/about.shtml). Jnet algorithm of Jpred 3 provides prediction of
secondary structure (α-helix, β-strand, and coil) with 81.5% accuracy. Regardless of
whether a solitary or multiple protein sequence, Jpred infers alignment profiles
which further make predictions of secondary structure and ability to dissolve. The
predictions can be represented as colored HTML, plain text, PostScript, PDF, and
through the highly versatile Jalview alignment editor to view and apply the results
(Cole et al. 2008).
2.3.5.4 3D-Jigsaw
3D-JIGSAW (www.bmm.icnet.uk/servers/3djigsaw/) is an automated framework
that uses homology modeling to construct three-dimensional models of proteins.
3D-JIGSAW comes in two modes, automatic or interactive, which allows you to
select the templates and correct the alignments before submitting the model (https://
bip.weizmann.ac.il/toolbox/structure/3d.html). The database method is the back-
bone of its loop modeling. There are features in the collaborative mode to modify
alignments and pick models, loops, side chains. Conversely, if a submitted protein
sequence has a greater identity than 40 percent with known protein structures, the
automatic mode has no human involvement and modeling is completed (Xiong
2006).
2.3.5.5 ModBase
MODBASE (database access: http://modbase.compbio.ucsf.edu/modbase-cgi/
index.cgi) is a database for models of annotated protein structure. ModPipe is an
automatic modeling pipeline dependent on the programs like PSI-BLAST and
MODELLER, from which these models are derived. Fold assignments and
alignments are also a part of the database. Theoretically calculated models form
the core of MODBASE, though it may contain significant errors, it does not have
experimentally determined structures. Thus, the quality of such models needs to be
assessed. In addition to this, knowledge about potential ligand binding sites, SNP
(small nucleotide polymorphisms) annotation, and interactions among proteins is
central to MODBASE (http://modbase.compbio.ucsf.edu/modbasecgi/display.cgi?
server¼modbase&type¼general).
The data annotation is very much important; otherwise, enormous information about
sequences of nucleic acids is meaningless. It is only possible when we have good
algorithm containing tools and software. Data sharing is one of the most critical
aspects of biological sciences in current times. Bioinformatics is a multidisciplinary
field which deals with computational analysis and sharing of a variety of biological
data. High throughput in silico tools and software are the prerequisites for efficient
and quick sequence retrieval, analysis, and structure prediction; thus, being
2 Bioinformatics Tools and Software 33
References
Abriata LA (2017) Structural database resources for biological macromolecules. Brief Bioinform 18
(4):659–669
Aiyar A (2000) The use of CLUSTAL W and CLUSTAL X for multiple sequence alignment.
Methods Mol Biol 132:221–241
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J
Mol Biol 215(3):403–410
Banerjee A, Santra D, Maiti S (2020) Energetics and IC50 based epitope screening in SARS CoV-2
(COVID 19) spike protein by immunoinformatic analysis implicating for a suitable vaccine
development. J Transl Med 18(1):281
Bashiardes S, Zilberman-Schapira G, Elinav E (2016) Use of metatranscriptomics in microbiome
research. Bioinform Biol Insights 10:19–25
Biasini M, Bienert S, Waterhouse A, Arnold K, Studer G et al (2014) SWISS-MODEL: modelling
protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Res 42
(W1):195–201
Blank C, Easterly C, Gruening B, Johnson J, Kolmeder CA, Kumar P, May D, Mehta S, Mesuere B,
Brown Z et al (2018) Disseminating metaproteomic informatics capabilities and knowledge
using the Galaxy-P framework. Proteomes 6:1
Cole C, Barber JD, Barton GJ (2008) The Jpred 3 secondary structure prediction server. Nucleic
Acids Res 36:W197–W201
Conway MJ (2020) Identification of coronavirus sequences in carp cDNA from Wuhan, China. J
Med Virol. https://doi.org/10.1002/jmv.25751
Etzold T, Argos P (1993) SRS-an indexing and retrieval tool for flat file data libraries. Bioinformat-
ics 9(1):49–57
Fassler J, Cooper P (2011) BLAST Glossary: The GenBank Submissions Handbook. National
Center for Biotechnology Information (US), Bethesda (MD). https://www.ncbi.nlm.nih.gov/
books/NBK62051/
Fetchko M, Kitts A (2011) What is BankIt? The GenBank Submissions Handbook. National Center
for Biotechnology Information (US), Bethesda (MD). https://www.ncbi.nlm.nih.gov/books/
NBK63590/
Fiser A, Do RK, Sali A (2000) Modeling of loops in protein structures. Protein Sci 9:1753–1773
He XX, Wang HY, Fu SH et al (2012) Zhonghua shi yan he lin chuang bing du xue za zhi ¼
Zhonghua shiyan he linchuang bingduxue zazhi. Chin J Exp Clin Virol 26(2):81–83
Higgins DG, Thompson JD, Gibson TJ (1996) Using CLUSTAL for multiple sequence alignments.
Methods Enzymol 266:383–402
Koonin EV, Galperin MY (2003) Sequence - evolution - function: computational approaches in
comparative genomics. Kluwer Academic, Boston. Chapter 4, Principles and Methods of
Sequence Analysis. https://www.ncbi.nlm.nih.gov/books/NBK20261/
Kumar A, Chordia N (2017) Role of bioinformatics in biotechnology. Res Rev Biosci 12(1):116
Kusonmano K, Vongsangnak W, Chumnanpuen P (2016) Informatics for metabolomics. Adv Exp
Med Biol 939:91–115
34 A. Gupta et al.
Lang J, Sun P, Lu S et al (2017) Genotypes of the Hepatitis C virus in infected patients in Hohhot,
China. Bing Du Xue Bao 33(1):61–66
Larkin MA, Blackshields G, Brown NP et al (2007) Clustal W and Clustal X version 2.0.
Bioinformatics 23(21):2947–2948
Luscombe NM, Greenbaum D, Gerstein M (2001) What is bioinformatics? A proposed definition
and overview of the field. Methods Inf Med 40:346–358
Marti-Renom MA, Stuart A, Fiser A, Sánchez R, Melo F, Sali A (2000) Comparative protein
structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 29:291–325
Meyer F, Paarmann D, D’Souza M et al (2008) The metagenomics RAST server - a public resource
for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform 9:386
Mulder NJ, Apweiler R, Attwood TK, Bairich A, Barrell D et al (2003) The InterPro Database, 2003
brings increased coverage and new features. Nucleic Acids Res 31:315–318
Niu SY, Yang J, McDermaid A, Zhao J, Kang Y, Ma Q (2018) Bioinformatics tools for quantitative
and functional metagenome and metatranscriptome data analysis in microbes. Brief Bioinform
19(6):1415–1429
Ostell J (2002, Updated 2014) The Entrez Search and Retrieval System. The NCBI handbook, 2nd
edn. National Center for Biotechnology Information (US), Bethesda (MD). https://www.ncbi.
nlm.nih.gov/books/NBK184582/
Ounit R, Wanamaker S, Close TJ, Lonardi S (2015) CLARK: fast and accurate classification of
metagenomic and genomic sequences using discriminative k-mers. BMC Genom 16(1):236
Pevsner J (2015) Bioinformatics and functional genomics, 3rd edn. Wiley, Chichester
Pichler K, Warner K, Magrane M (2018) UniProt Consortium SPIN: submitting sequences deter-
mined at protein level to UniProt. Curr Protoc Bioinformatics 62(1):e52
Rahman N, Basharat Z, Yousuf M, Castaldo G, Rastrelli L, Khan H (2020) Virtual screening of
natural products against Type II Transmembrane Serine Protease (TMPRSS2), the Priming
Agent of Coronavirus 2 (SARS-CoV-2). Molecules 25(10):2271
Riffle M, May DH, Timmins-Schiffman E, Mikan MP, Jaschob D, Noble WS, Nunn BL (2017)
MetaGOmics: a web-based tool for peptide-centric functional and taxonomic analysis of
metaproteomics data. Proteomes 6:2
Roumpeka D, Wallace RJ, Escalettes F, Fotheringham I, Watson M (2017) A review of bioinfor-
matics tools for bio-prospecting from metagenomic sequence data. Front Genet 8:1664–8021
Sadasivan J, Singh M, Sarma JD (2017) Cytoplasmic tail of coronavirus spike protein has intracel-
lular targeting signal. J Biosci 42(2):231–244
Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J
Mol Biol 234:779–815
Sayers EW, Barrett T et al (2009) Database resources of the National Center for Biotechnology
Information. Nucleic Acids Res 37(Database issue):D5–D15
Schuler GD, Epstein JA, Ohkawa H, Kans JA (1996) Entrez: molecular biology database and
retrieval system. Methods Enzymol 266:141–162
Schwede T, Kopp J, Guex N, Peitsch MC (2003) SWISS-MODEL: an automated protein
homology-modeling server. Nucleic Acids Res 31(13):3381–3385
Seifert J, Muth T (2019) Editorial for special issue: metaproteomics. Proteomes 7(1):9
Silva GG, Cuevas DA, Dutilh BE, Edwards RA (2014) FOCUS: an alignment-free model to
identify organisms in metagenomes using non-negative least squares. PeerJ 2:e425
Silva GG, Green KT, Dutilh BE, Edwards RA (2016) SUPER-FOCUS: a tool for agile functional
analysis of shotgun metagenomic data. Bioinformatics 32(3):354–361
Stoesser G, Baker W, van den Broek A et al (2002) The EMBL Nucleotide Sequence Database.
Nucleic Acids Res 30(1):21–26
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting, position-specific gap
penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680
2 Bioinformatics Tools and Software 35
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL X
windows interface: flexible strategies for multiple sequence alignment aided by quality analysis
tools. Nucleic Acids Res 25:4876–4882
Tramontano A, Leplae R, Morea V (2001) Analysis and assessment of comparative modeling
predictions in CASP4. Proteins 45(Suppl 5):22–38
Tröscher-Mußotter J, Tilocca B, Stefanski V, Seifert J (2019) Analysis of the bacterial and host
proteins along and across the porcine gastrointestinal tract. Proteomes 7:4
Troy CS, MacHugh DE, Bailey JF, Magee DA, Loftus RT et al (2001) Sequence-evolution -
function: computational approaches in comparative genomics. Chapter 4: principles and
methods of sequence analysis. Genetic evidence for Near-Eastern origins of European cattle.
Nature 410:1091
Truong DT, Franzosa EA, Tickle TL et al (2016) MetaPhlAn2 for enhanced metagenomic taxo-
nomic profiling. Nat Methods 13(1):101
Wang YF, Wang GP, Wang LP, Hong N (2014) First report of cucumber mosaic virus in Taro
plants in China. Plant Dis 98(4):574
Webb B, Sali A (2016) Comparative protein structure modeling using modeller. Current protocols
in bioinformatics 54. Wiley, pp 5.6.1–5.6.37
Wilmes P, Bond PL (2004) The application of two-dimensional polyacrylamide gel electrophoresis
and downstream analyses to a mixed community of prokaryotic microorganisms. Environ
Microbiol 6:911–920
Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact
alignments. Genome Biol 15(3):R46
Xiong J (2006) Essential bioinformatics. Cambridge University Press, New York, United States of
America
Yamamoto H, Tamura T, Isono K et al (1996) SAKURA: a new data submission system of DDBJ to
meet users’ needs in the age of mass production of DNA sequences. Genome Inform 7:204–205
Yang H, Guranovic V, Dutta S, Feng Z, Berman HM, Westbrook JD (2004) Automated and
accurate deposition of structures solved by X-ray diffraction to the Protein Data Bank. Acta
Crystallogr D60:1833–1839
Young JY, Westbrook JD et al (2017) OneDep: unified wwPDB system for deposition, biocuration,
and validation of macromolecular structures in the PDB archive. Structure 25(3):536–545
Online Resources
https://www.ncbi.nlm.nih.gov/Sequin/
http://www.clustal.org/download/clustalx_help.html
https://www.genome.jp/tools-bin/clustalw
https://www.ncbi.nlm.nih.gov/
https://www.ebi.ac.uk/Tools/psa/
https://guides.lib.berkeley.edu/ncbi
https://salilab.org/modeller/
https://www.ebi.ac.uk/training/online/course/pdbe-quick-tour/submitting-data-pdb-and-emdb
https://www.ddbj.nig.ac.jp/getentry-help-e.html#About_getentry
Peter Bickerton, 2020 Scientific Communications & Outreach Manager, Erlham Institute https://
www.earlham.ac.uk/articles/why-bioinformatics-important
Role of Bioinformatics in Biological Sciences
3
Disha Gangotia, Aeshna Gupta, and Indra Mani
Abstract
Keywords
Bioinformatics · Sequence · Structure · Databases · Genomics · Proteomics
D. Gangotia
University College Dublin, Dublin, Ireland
A. Gupta · I. Mani (*)
Department of Microbiology, Gargi College, University of Delhi, New Delhi, India
3.1 Introduction
3.2.1 Genomics
Genomics is the study of whole genomes (an organism’s complete set of DNA). It
allows sequencing, assembling, and analysis of the structure and function of
genomes by using a combination of recombinant DNA technology, DNA sequenc-
ing techniques, and bioinformatics tools and software. It varies from “genetics” as
genetics explores the functioning and structure of the single gene or single gene
product at a time, whereas genomics approaches all genes and their inter-
relationships in order to understand their cumulative impact on the organism’s
growth and development. Genomics captures the availability of whole DNA
sequences for organisms and was made possible by Frederick Sanger’s initial
work (Sanger et al. 1977) as well as the more recent next-generation sequencing
(NGS) technology. Genomics is broadly divided into two main areas: structural
genomics, demonstrating the physical structure of whole genomes; and functional
genomics, depicting the transcriptome (total transcripts) and the proteome (the
complete display of encoded proteins). Schematic representation of different omics
is given in Fig. 3.1 (Virkud et al. 2019). All of these are explained in detail in the
following subdivisions.
3 Role of Bioinformatics in Biological Sciences 39
Fig. 3.1 Overview of the major “omics” fields (Virkud et al. 2019. Adapted with permission)
(SNPs), gene regulation (e.g. promoter analysis), microarray data analysis (gene
expression studies), SAGE (serial analysis of gene expression; RNA sequencing for
global gene expression studies in a cell), mutations, epigenetics, etc. (Kaushik et al.
2018). A widely used bioinformatics resource, the Database for Annotation, Visual-
ization and Integrated Discovery (DAVID) allows characterization of functional
genes, determines genes which are functionally related, enables gene/protein
identifiers to be converted from one type to the other, and studies gene names in a
set (Dennis et al. 2003; Huang et al. 2007).
An essential component of functional genomics is the human genome project
(HGP) (Collins et al. 2003; Green et al. 2015) revealing that 3164.7 million nucleo-
tide bases with a total of approximately 20,000 genes are found in the human
genome. There are multiple fundamental strategies to functional genomics at differ-
ent stages: genomics and epigenomics (DNA), transcriptomics (RNA), proteomics
(proteins), and metabolomics (metabolites), the details of which are given in the
following sections. Thus, it is expected that a wide-ranging model of the biological
system under study will be provided by the compilation of all these data.
3.2.2 Transcriptomics
The analysis of the transcriptome of an organism, the sum of all its RNA transcripts,
is termed as transcriptomics. In the information network, mRNA acts as a transient
intermediate molecule, while noncoding RNAs perform various different functions.
The field of transcriptomics involves two predominant technologies: microarrays,
which quantify the abundance of a given set of transcripts through their
hybridization with a range of complementary probes, and RNA sequencing
(RNA-Seq), which corresponds to the high-throughput sequencing of cDNA
transcripts, where number of transcript counts is used to estimate the concentration.
Large volumes of data are generated by transcriptomic analysis. In order to ensure
their usefulness to the wider scientific community, raw or processed data may be
stored in publicly accessible repositories such as Gene Expression Omnibus (GEO)
(Edgar et al. 2002), ArrayExpress (Kolesnikov et al. 2015), etc.
Assessing an organism’s gene expression patterns in various tissues,
environments, or time intervals provides data on how genes are controlled, about
functions of previously unannotated genes, and demonstrates features of the biology
of an organism (Lowe et al. 2017). A significant application of this field lies in
experimentation in diagnostics and disease profiling. The scope of using RNA-Seq
to diagnose immune-related diseases is rapidly expanding due to its capability to
distinguish populations of immune cells and sequence B-cell and T-cell receptor
repertoires (Proserpio and Mahata 2016; Byron et al. 2016).
3.2.3 Proteomics
The large-scale analysis of the total protein complement of a cell line, tissue, or
organism, i.e. its proteome, is referred to as proteomics (Wasinger et al. 1995;
Wilkins et al. 1995; Anderson and Anderson 1996). Proteomics aims not only to
characterize all proteins in a cell, but also to establish a precise three-dimensional
cell map (3-D) that indicates where proteins are localized. Proteomics is thus
considered to be the most important data set to describe a biological system as
proteins are effectors of biological function, the levels of which depend on the
corresponding levels of mRNA as well as on translational regulation of the host
(Cox and Mann 2007). Expression proteomics refers to the quantitative analysis of
protein expression between variable samples, while structural proteomics involves
identifying and locating all proteins within a protein complex, determining their
structure and analyzing all protein–protein interactions (Graves and Haystead 2002).
Figure 3.2 illustrates the flowchart of various proteomics techniques.
An enormous amount of proteomics data is obtained with the help of high-
throughput technologies. Various bioinformatics tools have been developed for
predicting 3D structures, analyzing protein domains and motifs, interpretation of
mass spectrometry results, etc. Evolutionary relationships can be inferred with the
help of sequence and structure alignment tools (Vihinen 2001; Perez-Riverol et al.
2015). Proteomics-based techniques are used for various research environments,
42 D. Gangotia et al.
3.2.4 Metabolomics
3.2.5 Chemoinformatics
and programs for phylogenetic tree construction, molecular phylogeny has become
more popular. This is since molecular data are more abundant and easier to collect
than fossil records, there is no sampling bias involved, which helps to mend the holes
in real fossil records, with this data it is possible to create simpler and more accurate
phylogenetic trees (Horiike 2016).
Based on the sequence similarity of molecules, such as DNA, RNA, proteins, it is
also possible to infer evolutionary relationships between species. These similarities
can be detected by multiple sequence alignment (MSA) through programs like
Clustal W, Clustal X, and homology search (homologues are sequences that have
common ancestry) through tools like BLAST (basic local alignment search tool)
(Altschul et al. 1990). This is followed by use of methods to construct and interpret
phylogenetic trees such as unweighted pair group method with arithmetic
mean (UPGMA), neighbor-joining (NJ), maximum parsimony, maximum likeli-
hood, and Bayesian method (Horiike 2016). Figure 3.3 shows a typical bifurcating
phylogenetic tree that can be constructed for phylogenetic analysis.
Tree construction methods are classified into two groups. Primarily is the
distance-based method that uses evolutionary distance matrix. UPGMA and NJ
methods are the representative methods which make use of computational tools
such as MEGA7 (Molecular Evolutionary Genetics Analysis version 7) and PHYLIP
(PHYLogeny Inference Package), CLUSTALX, respectively. The advantage of the
distance-based method is its short calculation time that allows handling of large
amount of data. Another is the character-based approach, which uses the aligned
sequences directly through tree inference. Maximum parsimony, Maximum likeli-
hood, and Bayes method are the representative methods which use PHYLIP,
MEGA7, PhyML, etc. as named in Table 3.1 (Horiike 2016).
3 Role of Bioinformatics in Biological Sciences 45
Table 3.1 Table showing list of methods for infering phylogenetic trees (Horiike 2016)
Method Group Algorithm Software
UPGMA Distance Clustering for the shortest MEGA7
matrix evolutionary distance
Neighbor- Distance Clustering for minimum total PHYLIP, Clustal X, MEGA 7
joining matrix branch distance
Maximum Character- Searching tree with minimum total PHYLIP, MEGA 7
parsimony based number of character-state changes
Maximum Character- Searching tree with maximum PHYLIP, PhyML, RAxML,
likelihood based likelihood FastTree, MEGA 7, TOPALi
v2
Bayesian Character- Searching tree with maximum MrBayes, TOPALi v2
based posterior probability
import and translation tools. CellDesigner, a Java-based software for the creation
and editing of biochemical networks, is another extremely useful method (Funahashi
et al. 2003).
Further, bioinformatics applications for research in systems biology use
applications to visualize network architectures and overlay virtual and experimental
data on the network. These tools include yEd graph editor for network editing and
tools such as Cytoscape (Shannon et al. 2003) and Pathway Tools cellular overview
diagram and Omics Viewer (Paley and Karp 2006) for visualization of “omics” data
in the form of biochemical networks. The ability to develop predictive, multi-scale
models helps our researchers to classify new disease biomarkers, stratify patients on
the basis of specific genome profiles, target drugs, and other therapies. Moreover, the
biology of structures provides the capacity for entirely new ways of assessment and
innovation in biotechnology and computer science.
structure prediction tools for 3D modeling (Refer Chap. 2, Sect. 6). To date, many
methods of prediction are very advanced and widely used by biologists, but there are
still some challenges to address. Most of the prediction methods used today (includ-
ing template-free methods) rely on the structural knowledge that is known, which is
not too desirable. The latest advance in the prediction of protein structures can be
predicted by reducing dependence on recognized structures and improving first-
principle research (Deng et al. 2018).
Fig. 3.4 Schematic representation of the integration of different data sets in bioinformatics
(Ramharack and Soliman 2018. Adapted with permission)
to manufacture more and more drugs at low risk in a short period of time has led to a
remarkable interest in bioinformatics (Ortega et al. 2012).
A computer-aided drug design (CADD) (Song et al. 2009; Cordeiro and Speck-
Planche 2012) is a different field that is dedicated to bioinformatics-assisted drug
designing. One of the key drivers of current bioinformatics approaches is the
prediction and detection of biologically active candidates (Whittaker 2003) as well
as mining and storing relevant information. The mining and storage of the human
genome sequence by bioinformatics has facilitated to identify and categorize the
nucleotide compositions of certain genes responsible for the coding of target
proteins, in addition to identifying new targets that offer more potential for new
drugs (Chen and Chen 2008; Katara et al. 2011). Another field where bioinformatics
plays an important role is the approach of aim validation, as this supports to
moderate the possible for failure in the phases of clinical testing and approval
(Ratti and Trist 2001; Gilbert et al. 2003; Whittaker 2003). This also helps to ensure
that during the approval procedure, more drug candidates are successful, making it
additional cost-effective (Ortega et al. 2012). Moreover, bioinformatics will serve as
an acceptable interface to provide pharmaceutical companies with new methods to
opportunities to discover possible drug targets effectively and produce novel drugs
(Whittaker 2003). Figure 3.4 shows the integration of biological data sets for better
understanding of diseases to design more effective therapeutics (Ramharack and
Soliman 2018).
52 D. Gangotia et al.
Mathematical modeling is used to explain inner and outer cell interactions and how
they impact cell metabolism. In the investigation and simulation of phenotypes, this
is calculated as metabolite concentrations and reaction fluxes over metabolic
pathways, regulated by enzymes under different intrinsic and extrinsic conditions.
It involves making an interpretation of cycles to mathematical problems with formal
portrayals for an increased level of exactness and detail, since the objective is to
arrive at the complexity and completeness of the behavior of a metabolic network.
Depending on the situation in which the mechanism is analyzed, the two types of
approaches to the numerical representation of biological processes vary. The station-
ary modeling takes into account the system operating at an equilibrium point where,
after some time, the metabolite concentration is steady. Dynamic modeling, how-
ever, accepts the evolution of metabolite concentrations over time (Osvaldo et al.
2018).
Constructing models that make observable predictions of cell states over time is
one of the most difficult tasks. This problem is currently addressed by new methods
in silico, such as the reconstruction of dynamic models, the use of approaches to
phenotype prediction, and the design of pathways through effective algorithms for
strain optimization (Osvaldo et al. 2018). In biomedical science, systems biology
and bioinformatics methods also help researching related data and properties
(e.g. genome sequencing) to allow discoveries driven by modeling. This has also
facilitated the development of genome-scale networks, the simulation of complex
biological systems in silico, and the understanding of how metabolic flux
distributions shift within a specific biological network to predict cellular phenotypes
(McCloskey et al. 2013).
Furthermore, considered in many environmental and genetic situations, mathe-
matical cellular metabolism modeling supports the tasks of metabolic engineering
(ME) involving the design of appropriate strains, optimal gene deletion selection, or
regulation of expression for the overproduction of compounds generated
(Stephanopoulos et al. 1998; Burgard et al. 2003), thus, finding applications in the
industrial sector as well.
prerequisites for quick sequence retrieval, efficient genome analysis, structure pre-
diction, protein–protein interactions, rational drug designing, and phylogenetic
analysis, thus having the potential to generate complete picture of an organism.
This has opened wide avenues of research in biological sciences, as highlighted
throughout the chapter. Therefore, in silico technology has revolutionized the way
we look at basic scientific research, thereby opening doors for a variety of
applications.
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J
Mol Biol 215(3):403–410
Anderson NG, Anderson NL (1996) Twenty years of two-dimensional electrophoresis: past, present
and future. Electrophoresis 17:443–453
Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181
(4096):223–230
Aslam B, Basit M, Nisar MA, Khurshid M, Rasool MH (2017) Proteomics: technologies and their
applications. J Chromatogr Sci 55(2):182–119
Bajic VB, Seah SH, Chong A, Zhang G, Koh JL, Brusic V (2002) Dragon Promoter Finder:
recognition of vertebrate RNA polymerase II promoters. Bioinformatics 18:198–199
Basuri TS, Meman AS (2011) Role of bioinformatics, cheminformatics and proteomic in biomarker
identification and drug target validation in drug delivery processes. IJPSR 2(10):2521–2533
Bayat A (2002) Science, medicine, and the future: Bioinformatics. BMJ (Clin Res Ed) 324
(7344):1018–1022
Bell MJ, Gillespie CS, Swan D, Lord P (2012) An approach to describing and analysing bulk
biological annotation quality: a case study using UniProtKB. Bioinformatics 28(18):i562–i568
Boeckmann B et al (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL
in 2003. Nucleic Acids Res 31:365–370
Burgard AP, Pharkya P, Maranas CD (2003) OptKnock: a bilevel programming framework for
identifying gene knockout strategies for microbial strain optimization. Biotechnol Bioeng
84:647–657
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol
Biol 268:78–94
Byron SA, Van Keuren-Jensen KR, Engelthaler DM, Carpten JD, Craig DW (2016) Translating
RNA sequencing into clinical diagnostics: opportunities and challenges. Nat Rev Genet
17:257–271
Carbonell P, Currin A, Jervis AJ, Rattray NJ, Swainston N, Yan C, Takano E, Breitling R (2016)
Bioinformatics for the synthetic biology of natural products: integrating across the Design-
Build-Test cycle. Nat Prod Rep 33(8):925–932
Chen YP, Chen F (2008) Identifying targets for drug discovery using bioinformatics. Expert Opin
Ther Targ 12:383–389
Clish CB (2015) Metabolomics: an emerging but powerful tool for precision medicine. Cold Spring
Harb Mol Case Stud 1(1):a000588
Collins FS, Morgan M, Patrinos A (2003) The Human Genome Project: lessons from large-scale
biology. Science 300:286–290
Cordeiro MN, Speck-Planche A (2012) Computer-aided drug design, synthesis and evaluation of
new anti-cancer drugs. Curr Top Med Chem 12(24):2703–2704
Cox J, Mann M (2007) Is proteomics the new genomics? Cell 130(3):395–398
54 D. Gangotia et al.
Cozzolino SMF, Cominetti C (2013) Biochemical and physiological bases of nutrition in different
stages of life in health and disease, 1st edn. Monole, São Paulo, Brazil
Cruz IBM, Taufer M, Schwanke CHA (2003) Genomics in the era of aging and its potential
application in gerontology and geriatrics. In: Souza ACA (ed) Institute of Geriatrics and
Gerontology PUCRS: the cradle of academic geriatrics in Brazil, 1st edn., pp 83–84
Dalmiel L, Vargas T, Molina AR (2012) Nutritional genomics for the characterization of the effect
of bioactive molecules in lipid metabolism and related pathways. Electrophoresis 15:2266–2289
Dauncey MJ (2012) Recent advances in nutrition, genes and brain health. Proc Nutr Soc 71
(4):581–591
Deng H, Jia Y, Zhang Y (2018) Protein structure prediction. Int J Mod Phys B 32(18):1840009
Dennis G Jr, Sherman BT, Hosack DA et al (2003) DAVID: database for annotation, visualization,
and integrated discovery. Genome Biol 4:P3
Desany B, Zhang Z (2004) Bioinformatics and cancer target discovery. DDT 9:18
Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI gene expression and
hybridization array data repository. Nucleic Acids Res 30:207–210
Feist AM, Henry CS, Reed JL et al (2007) A genome-scale metabolic reconstruction for Escherichia
coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information. Mol Syst
Biol (3): article 121
Fernandez-Suarez XM, Rigden DJ, Galperin MY (2014) The 2014 Nucleic Acids Research
Database Issue and an updated NAR online Molecular Biology Database Collection. Nucleic
Acids Res 42:D1–D6
Fialho E, Moreno FS, Ong TPP (2008) Nutrition in the post-genomics: fundamentals and
applications of omics tools. J Nutr 21(6):757–766
Funahashi A, Tanimura N, Morohashi M, Kitano H (2003) CellDesigner: a process diagram editor
for gene-regulatory and biochemical networks. BIOSILICO 1(1):159–162
Gerlt JA, Allen KN, Almo SC, Armstrong RN, Babbitt PC, Cronan JE, Dunaway-Mariano D, Imker
HJ, Jacobson MP, Minor W, Poulter CD, Raushel FM, Sali A, Shoichet BK, Sweedler JV
(2011) The enzyme function initiative. Biochemistry 50:9950–9962
Gianchandani EP, Brautigan DL, Papin JA (2006) Systems analyses characterize integrated
functions of biochemical networks. Trends Biochem Sci 31(5):284–291
Gilbert J, Henske P, Singh A (2003) Rebuilding big pharma’s business model. In vivo Bus Med Rep
21:10
Graves PR, Haystead TA (2002) Molecular biologist’s guide to proteomics. Microbiol Mol Biol
Rev 66(1):39–63
Green ED, Watson JD, Collins FS (2015) Human Genome Project: twenty-five years of big biology.
Nature 526:29–31
Grüning NM, Lehrach H, Ralser M (2010) Regulatory crosstalk of the metabolic network. Trends
Biochem Sci 35(4):220–227
Horiike T (2016) An introduction to molecular phylogenetic analysis. J Agric Sci 4:36–45
Huang DW, Sherman BT, Tan Q et al (2007) The DAVID Gene Functional Classification Tool: a
novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol
8:R183
Huang YJP, Mao B, Aramini JM, Monteliono GT (2014) Assessment of template-based protein
structure predictions in CASP10. Proteins 82(S2):43–56
Hucka M, Finney A, Bornstein BJ, Keating SM, Shapiro BE, Matthews J, Kovitz BL, Schilstra MJ,
Funahashi A, Doyle JC, Kitano H (2004) Evolving a lingua franca and associated software
infrastructure for computational systems biology: the Systems Biology Markup Language
(SBML) project. Syst Biol (Stevenage) 1(1):41–53
Hughes JP, Rees S, Kalindjian SB, Philpott KL (2011) Principles of early drug discovery. Br J
Pharmacol 162:1239–1249
Ilyin SE, Pinhasov A, Vaidya AH, Amato FA, Kauffman J, Xin H, Gordon PA (2003) Emerging
paradigms in applied bioinformatics. Biosilico 1:3
Jorgensen WL (2004) The many roles of computation in drug discovery. Science 303:1813–1818
3 Role of Bioinformatics in Biological Sciences 55
Junqueira DM, Braun RL, Verli H (2014) Alinhamentos. In: Verli H (ed) Bioinformática da
biologia à flexibilidade molecular. SBBq, São Paulo, pp 38–61
Katara P, Grover A, Kuntal H, Sharma V (2011) In silico prediction of drug targets in Vibrio
cholerae. Protoplasma 248:799–804
Kaushik S, Kaushik S, Sharma D (2018) Functional genomics. Encyclopedia of Bioinformatics and
Computational Biology 2:118–133
Kodama Y, Shumway M, Leinonen R (2012) International Nucleotide Sequence Database C. The
Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res 40:D54–D56
Kolesnikov N, Hastings E, Keays M, Melnichuk O, Tang YA, Williams E et al (2015)
ArrayExpress updated simplifying data submissions. Nucleic Acids Res 43:D1113–D1116
Kuehnbaum NL, Britz-McKibbin P (2013) New advances in separation science for metabolomics:
resolving chemical diversity in a post-genomic era. Chem Rev 113:2437–2468
Kumar P, Periyasamy R, Das S, Neerukonda S, Mani I, Pandey KN (2014) All-trans retinoic acid
and sodium butyrate enhance natriuretic peptide receptor a gene transcription: role of histone
modification. Mol Pharmacol 85(6):946–957
Kusonmano K, Vongsangnak W, Chumnanpuen P (2016) Informatics for metabolomics. Adv Exp
Med Biol 939:91–115
Lawrance SK, Smith CL, Srivastava R, Cantor CR, Weissman SM (1987) Megabase-scale mapping
of the HLA gene complex by pulsed field gel electrophoresis. Science 235(4794):1387–1390
Liu B, Qian SB (2011) Translational regulation in nutrigenomics. Am Soc Nutr 2:511–519
Lowe R, Shirley N, Bleackley M, Dolan S, Shafee T (2017) Transcriptomics technologies. PLoS
Comput Biol 13(5):e1005457
Mahan LK, Stump SS (2005) Food, nutrition & diet therapy, 6th edn. Roca, Sao Paulo, Brazil
Manohar P, Shailendra S (2012) Protein sequence alignment: a review. World Appl Program
2:141–145
Maschio T, Kowalski T (2001) Bioinformatics – a patenting view. Trends Biotechnol 19:9
McCloskey D, Palsson BØ, Feist AM (2013) Basic and applied uses of genome-scale metabolic
network reconstructions of Escherichia coli. Mol Syst Biol 9:661
Molidor R, Sturn A, Maurer M, Trajanoski Z (2003) New trends in bioinformatics: from genome
sequence to personalized medicine. Exp Gerontol 38:1031–1036
Moura M, Finkle J, Stainbook S, Greene J, Broadbelt LJ, Tyo KE (2016) Evaluating enzymatic
synthesis of small molecule drugs. J Metab Eng 33:138–147
Ortega SS, Cara LC, Salvador MK (2012) In silico pharmacology for a multidisciplinary drug
discovery process. Drug Metabol Drug Interact 27:199–127
Osvaldo KD, Miguel R, Paulo M (2018) A review of dynamic modeling approaches and their
application in computational strain optimization for metabolic engineering. Front Microbiol
9:1690
Paley SM, Karp PD (2006) The pathway tools cellular overview diagram and omics viewer. Nucleic
Acids Res 34(13):3771–3778
Perez-Riverol Y, Alpi E, Wang R, Hermjakob H, Vizcaino JA (2015) Making proteomics data
accessible and reusable: current state of proteomics databases and repositories. Proteomics 15
(5–6):930–949
Prestridge DS (1995) Predicting Pol II promoter sequences using transcription factor binding sites. J
Mol Biol 249:923–932
Prjibelski AD, Korobeynikov AI, Lapidus AL (2019) Encyclopedia of bioinformatics and compu-
tational biology 3:292–322
Proserpio V, Mahata B (2016) Single-cell technologies to study the immune system. Immunology
147:133–140
Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O et al (2014) RefSeq:
an update on mammalian reference sequences. Nucleic Acids Res 42:D756–D763
Raman K, Chandra N (2010) Systems biology. Resonance:131–153
Ramharack P, Soliman MES (2018) Bioinformatics-based tools in drug discovery: the cartography
from single gene to integrative biological networks. Drug Discov Today 23(9):1658–1665
56 D. Gangotia et al.
Ratti E, Trist D (2001) Continuing evolution of the drug discovery process. Pure Appl Chem
73:67–75
Romano JD, Tatonetti NP (2019) Informatics and computational methods in natural product drug
discovery: a review and perspectives. Front Genet 10:368
Sales NM, Pelegrini PB, Goersch MC (2014) Nutrigenomics: definitions and advances of this new
science. J Nutr Metab 202759
Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc
Natl Acad Sci U S A 74(12):5463–5467
Sasaki AA, Fernandes GF, Rodrigues AM, Lima FM, Marini MM, Dos S Feitosa L, de Melo
Teixeira M, Felipe MS, da Silveira JF, de Camargo ZP (2014) Chromosomal polymorphism in
the Sporothrix schenckii complex. PLoS One 9(1):e86819
Scherf M, Klingenhoff A, Werner T (2000) Highly specific localization of promoter regions in large
genomic sequences by PromoterInspector: a novel context analysis approach. J Mol Biol
297:599–606
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowskis B, Ideker
T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction
networks. Genome Res 13:2498–2504
Smith AAT, Belda E, Viari A, Medigue C, Vallenet D (2012) The CanOE strategy: integrating
genomic and metabolic contexts across multiple prokaryote genomes to find candidate genes for
orphan enzymes. PLoS Comput Biol 8:1–12
Song CM, Lim SJ, Tong JC (2009) Recent advances in computer-aided drug design. Brief
Bioinform 10:579–591
Stanke M, Steinkamp R, Waack S, Morgenstern B (2004) AUGUSTUS: a web server for gene
finding in eukaryotes. Nucleic Acids Res 32:W309–W312
Stephanopoulos GN, Aristidou AA, Nielsen J (1998) Metabolic engineering: principles and
methodologies. Academic Press, San Diego, CA
Strassberger V, Fugmann T, Neri D, Roesli C (2010) Chemical proteomic and bioinformatic
strategies for the identification and quantification of vascular antigens in cancer. J Proteomics
73:1954–1973
Subramanian U, Kumar P, Mani I, Chen D, Kessler I, Periyasamy R, Raghavaraju G, Pandey KN
(2016) Retinoic acid and sodium butyrate suppress the cardiac expression of hypertrophic
markers and proinflammatory mediators in Npr1 gene-disrupted haplotype mice. Physiol
Genomics 48(7):477–490
Teufel A, Krupp M, Weinmann A, Galle PR (2006) Current bioinformatics tools in genomic
biomedical research (Review). Int J Mol Med 17(6):967–973
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL X
windows interface: flexible strategies for multiple sequence alignment aided by quality analysis
tools. Nucleic Acids Res 25:4876–4882
Troy CS, MacHugh DE, Bailey JF, Magee DA, Loftus RT, Cunningham P, Chamberlain AT, Sykes
BC, Bradley DG (2001) Sequence-evolution - function: computational approaches in compara-
tive genomics. Chapter 4: principles and methods of sequence analysis. Genetic evidence for
near-eastern origins of European cattle. Nature 410:1091
Tsoka S, Ouzounis CA (2000) Recent developments and future directions in computational
genomics. FEBS Lett 480:42–48
Vihinen M (2001) Bioinformatics in proteomics. Biomol Eng 18(5):241–248
Virkud YV, Kelly RS, Wood C, Lasky-Su JA (2019) The nuts and bolts of omics for the clinical
allergist. Ann Allergy Asthma Immunol 123(6):558–563
Wang X, Liotta L (2011) Clinical bioinformatics: a new emerging science. J Clin Bioinform 1
(1):1–10
Wanjek C (2011) Systems biology as defined by NIH an intellectual resource for integrative
biology. NIH Catalyst 19:6
3 Role of Bioinformatics in Biological Sciences 57
Wasinger VC, Cordwell SJ, Cerpa-Poljak A, Yan JX, Gooley AA, Wilkins MR, Duncan MW,
Harris R, Williams KL, Humphery- Smith I (1995) Progress with gene-product mapping of the
Mollicutes: Mycoplasma genitalium. Electrophoresis 16:1090–1094
Weber T, Blin K, Duddela S, Krug D, Kim HU, Bruccoleri R, Lee SY, Fischbach MA, Müller R,
Wohlleben W, Breitling R, Takano E, Medema MH (2015) AntiSMASH 3.0-a comprehensive
resource for the genome mining of biosynthetic gene clusters. Nucleic Acids Res 43:W237–
W243
Whittaker P (2003) What is the relevance of Bioinformatics to pharmacology? Trend Pharmacol Sci
24:434–439
Wilkins MR, Sanchez JC, Gooley AA, Appel RD, Humphery-Smith I, Hochstrasser DF, Williams
KL (1995) Progress with proteome projects: why all proteins expressed by a genome should be
identified and how to do it. Biotechnol Genet Eng Rev 13:19–50
Zhao S, Kumar R, Sakai A, Vetting MW, Wood BM, Brown S, Bonanno JB, Hillerich BS, Seidel
RD, Babbitt PC, Almo SC, Sweedler JV, Gerlt JA, Cronan JE, Jacobson MP (2013) Discovery
of new enzymes and metabolic pathways by using structure and genome context. Nature
502:698–702
Zou D, Ma L, Yu J, Zhang Z (2015) Biological databases for human research. Genom Proteom
Bioinf 13(1):55–63
Online Resources
https://www.ebi.ac.uk/training/online/course/genomics-introduction-ebi-resources/what-genomics
https://www.who.int/genomics/geneticsVSgenomics/en/
http://www.scfbio-iitd.res.in/tutorial/genomeanalysis.html
https://medical-dictionary.thefreedictionary.com/Ab+Initio+Gene+Prediction
https://isbscience.org/about/what-is-systems-biology/
https://isbscience.org/about/what-is-systems-biology/
https://archive.bio.org/articles/synthetic-biology-explained
https://www.jli.edu.in/blog/applications-of-clinical-bioinformatics/
https://www.ddw-online.com/informatics/p323016-application-of-bioinformatics-in-support-of-
precision-medicine.html
https://www.ddw-online.com/informatics/p323016-application-of-bioinformatics-in-support-of-
precision-medicine.html
https://www.jli.edu.in/blog/applications-of-clinical-bioinformatics/
https://isbscience.org/about/what-is-systems-biology/
https://app.biorender.com/community/gallery/s-5eceae0e8b256400b59142f6-what-are-omics-
sciences
https://biocyc.org/
Protein Analysis: From Sequence
to Structure 4
Jaykumar Jani and Anju Pappachan
Abstract
Proteins are primary molecules that control most of the cellular processes. The
sequence of a protein is linked to its structure which in turn is linked to its
function. Understanding and integrating protein sequence, structure, and function
information is necessary to address many challenging areas of Biology including
protein engineering, structural biology, and drug discovery. Bioinformatics deals
with protein sequences, structures, predictions, and analysis. Accessibility of
these data and availability of high-throughput analysis tools will supplement
experimental work in order to understand proteins better. Prediction of three-
dimensional structures of proteins and studying the structural features are very
necessary to understand various diseases and aid in disease diagnosis and drug
discovery. In this chapter we discuss about various databases and in silico tools
and methods related to protein sequence and structure analysis.
Keywords
Sequence · Protein structure prediction · Protein analysis · In silico analysis ·
Protein database · Homology modelling
4.1 Introduction
Proteins are the key players that control almost all activities which sustain living
organisms. Even though the genome of an organism consists of information for
survival, proteins are the versatile macromolecules that regulate virtually all life
Fig. 4.1 Flow chart on protein sequence analysis to structure prediction and analysis
62 J. Jani and A. Pappachan
Protein can be classified into different groups based on its structure, chemical nature,
and biological role. Complex structural detail of protein can be studied at primary,
secondary, tertiary, and quaternary levels of structural organization (Nelson et al.
2008).
A linear linking of amino acids with each other as a chain via a peptide bond is
represented as the primary structure of a protein. The polypeptide chain has an
N-terminus and a C-terminus based on the presence of free amino or carboxyl group,
respectively. The peptide bond is planar and rigid (non-rotatable) in nature as it
partially shares two pairs of an electron. Whereas N-Cα and Cα-C have some
freedom to rotate (180 to +180), which helps proteins to acquire a three-
dimensional structure (Nelson et al. 2008).
4.2.2.4 b-turns
β-turns are irregular in shape and length, they connect two β-strands and help
polypeptide to change the direction. β-turns are also known as reverse turns.
β-turns are usually found on the surface of protein which enables them to interact
with other proteins and molecules (Venkatachalam 1968).
Consideration of various domains, motifs, and folds becomes very important while
predicting the structure and function of the protein as they have evolutionarily
conserved sequence and are mainly found in the active site of protein which is
responsible for catalysis (Nelson et al. 2008).
4.2.5.1 Domain
It is a conserved part of the polypeptide chain and can individually form its three-
dimensional structure irrespective of other domains in the protein. Even it can
execute its function irrespective to rest of the protein. A single protein may have
64 J. Jani and A. Pappachan
more than one domain. Chimeric proteins with desired activity can be generated
through protein engineering utilizing domain swapping (Nelson et al. 2008).
4.2.5.2 Motifs
Motifs are conserved sequence of amino acids found among proteins having similar
catalytic activity. One motif may have more than one secondary structure element,
e.g. Helix turn helix (Nelson et al. 2008).
4.2.5.3 Fold
Folds are similar to motifs and represent general protein architecture. Proteins with
the same folds show the same combinations of secondary structure (Nelson et al.
2008).
This database groups protein based on topology, homology, class, and architecture.
The topology level classification clusters proteins based on the overall shape and
secondary structure. Homology based classification groups protein by their sequence
identity along with protein domain similarity shared with the ancestor. Class of
protein is mainly determined by their secondary structure and fold pattern and
includes; all α, all β, α-β, etc. The architecture of proteins represents an overall
structure and shape of a protein generated by different secondary structure organiza-
tion. Architecture level classification system groups protein based on its secondary
structure arrangement in three-dimensional space (Ghoorah et al. 2015; Orengo et al.
1997).
domains at the lowest level. Set of domains are classified into families of
homologues. Families that share common structure and function are grouped into
superfamilies. Superfamilies that share a common folding topology are grouped as
folds. Each fold group may belong to one of the general classes—α, β, α + β, and
small proteins which often have minimal secondary structures. This database
classifies protein based on Family, Superfamily, fold, IUPR (Intrinsically Unstruc-
tured Protein Region), Classes, and protein type (Ghoorah et al. 2015; Murzin et al.
1995; Andreeva et al. 2020).
There are various sequence, structure, and composite databases which provide
different information regarding proteins. Sequence databases provide protein
sequence information and structure databases like PDB provide three-dimensional
structural information about protein, and the composite database integrates informa-
tion from various primary databases. The different composite database uses different
algorithms and criteria to yield diverse information on proteins (Chen et al. 2017).
Table 4.1 gives a list of commonly used protein databases.
Later in 2002 PIR, along with its international partner created a single worldwide
database UniProt by combining PIR-PSD, Swiss-Prot, and TrEMBL (Wu et al.
2003).
4.4.2.1 PDB
Protein Data Bank is a repository of macromolecular structures experimentally
deciphered by X-ray crystallography, NMR spectroscopy, and Cryo-EM all around
the world. Initially, the database was created as a joint project by Cambridge
Crystallographic Data Center, UK and Brookhaven National Laboratory, the USA
in 1971. In 2003 the database becomes an international organization. Now there are
four members which are PDBj, PDBe, Research Collaboration for Structural Bioin-
formatics (RCSB), and Biological Magnetic Resonance Data Bank (BMRB) who
deal with data deposition, data processing, and distribution. The information sub-
mitted to the database is reviewed manually and computationally for its authenticity.
Each submitted structure is given unique four letter accession ID called PDB ID. The
database can be dug by protein name, PDB ID, author name, deposition date, etc.
PDB also contains information regarding protein secondary structure, experimental
procedure, experimental data, and ligand information. The protein structure coordi-
nate file can be downloaded as a .pdb file and can be visualized using structure
visualization software such as Pymol, VMD, Rasmol, etc. The main purpose of the
database is to provide structural information of biologically important
macromolecules. Further some secondary and curated databases utilize information
from PDB to predict protein structure (Berman 2008).
4.4.3.1 Swiss-Prot
Swiss-Prot is designed by EMBL (European Molecular Biology Laboratory) and
Department of Medical Biochemistry at University of Geneva collectively. In 2002,
Swiss-Prot became UniProt Knowledgebase (UniProtKB) with supplement informa-
tion from TrEMBL and PIR protein database. Today, UniProtKB provides detailed
information about protein function, structure, post-translational modification, etc.,
with minimum redundancy (Bairoch and Apweiler 2000).
4 Protein Analysis: From Sequence to Structure 67
4.4.3.2 PROSITE
PROSITE is a secondary database that contains information about conserved motifs
of proteins which relates to its biological function. Multiple sequence alignment
(MSA) is performed by a database to provide information related to the query
sequence. When a search is made for a new protein sequence in the database it
gives two types of information. First, it gives information about sequence patterns
and enlists other proteins with the same pattern. Second, it gives detail about the
protein family and its denoted biological role (Hulo et al. 2006).
4.4.3.3 PRINT
This database classifies protein into different families based on protein fingerprints.
Fingerprints are multiple small conserved motifs identified by sequence alignment.
Motifs are not necessarily present in the contiguous sequence, but they might come
together in 3D space upon protein folding, which defines active site or interacting
site of the protein. Thus the study of fingerprint represents protein fold and function
better than single motif (Attwood et al. 2000).
4.4.3.5 Pfam
Pfam is a protein family database. Entry in Pfam is classified as family, domain,
repeats, and motifs. Search can be made using protein sequence, domain, keyword,
or taxonomy. As a result, it provides Pfam annotations for domain architecture,
sequence alignment, interaction with other proteins, and protein structure in PDB
(Finn et al. 2014).
The sequence of the protein determines the structure and the function of proteins. A
thorough analysis of the protein sequence will throw light on its biological role,
active site, stability, post-translational modification sites, regulatory elements, etc.
Today there are several databases and tools available which predict protein features
based on its sequence composition.
68 J. Jani and A. Pappachan
4.5.1.1 Clustal
Clustal includes a series of programs commonly used in bioinformatics for sequence
alignment purposes. Originally the program was developed in 1988 and managed by
EMBL-EBI. There are many versions of Clustal based on the development/up-
gradation of an algorithm, Clustal Omega is the current standard version. All
versions of Clustal perform multiple sequence alignment from a series of pairwise
alignments, and assess it on the basis of scores based on a scoring matrix. These
values will be used by the algorithm for distance measurement which reflects the
evolutionary distance between sequences and the tool can build a phylogenetic tree
using the neighbour-joining approach (Chenna et al. 2003).
BLAST
BLAST is a fast, accurate, and most commonly used method worldwide to find
sequence similarity between a query sequence and sequences available in the
databases. The sequence is queried against a specified database, and produces a
report of those proteins in the database that are related to the query sequence.
BLAST provides different options for standard and specialized data mining. Stan-
dard BLAST includes BLASTP (protein query against a protein database), BLASTN
(DNA nucleotide query against DNA database), TBLASTN (protein query against
4 Protein Analysis: From Sequence to Structure 69
4.5.2.1 ProtParam
ExPASy is a bioinformatics tool which provides access to the various database in the
field of life sciences like proteomics, genomics, transcriptomics, population genetics,
etc. This portal is operated by Swiss Institute of Bioinformatics (SIB). ProtParam is
one of many tools available on the ExPASy server which calculates various
parameters of protein which are given below (Gasteiger et al. 2005).
70 J. Jani and A. Pappachan
Molecular Weight
The molecular weight of a protein is calculated by adding the average isotopic mass
of each amino acid in the sequence.
Theoretical PI
The isoelectric point (PI) of protein depends on the pKa value of amino acid. The
pKa value depends on the side-chain composition of amino acid. However, the pH of
the solution where protein is present significantly affects the PI and solubility of the
protein.
Half-life
This is the predicted time required for half of the protein to degrade after its synthesis
in the cellular system.
Instability Index
This parameter represents the stability of a given amino acid sequence in the test
tube. If the value is lower than 40 it is considered stable and if the value is greater
than 40 it is considered as unstable.
Extinction Coefficient
The extinction coefficient represents the absorbance of light by a given medium at a
particular wavelength. Experimentally this value can be calculated by using the
reference of known amino acid sequence. Computationally it is predicted by
analysing number of aromatic amino acids in a given amino acid sequence.
4.5.2.2 Protein–Sol
Protein–sol is an online open-access tool (http://protein-sol.manchester.ac.uk). This
tool predicts solubility of a given amino acid sequence, the algorithm of the tool
calculates 35 features of sequence which include twenty amino acid composition
scores, seven other composites, protein length, folding propensity, disorder propen-
sity, beta-strand propensities, Kyte-Doolittle hydropathy, PI, sequence entropy, and
absolute charge. If the predicted solubility score is >0.45 then the protein is
predicted to be soluble, if the value is <0.45 then solubility is less (Hebditch et al.
2017).
4 Protein Analysis: From Sequence to Structure 71
Local secondary structure can be predicted by utilizing information of its amino acid
sequence. It is the first crucial step to tertiary structure prediction. Available methods
focus to identify conserved local secondary structures such as helices, strands, and
turns. These structures form at the early stage of protein folding. Thus, understand-
ing of protein secondary structure is essential to study the protein folding process
also. There are many prediction methods available which use different algorithm for
secondary structure prediction. The Chou–Fasman method was considered as a
breakthrough method having almost 50–60% accuracy in prediction. However,
recent methods have an improvised algorithm with an increased accuracy of up to
60–65% (Kabsch and Sander 1983). Apart from the use of amino acid sequence for
secondary structure prediction, consideration of microenvironment of protein and
solvent accessibility of protein improvises prediction.
structure with higher accuracy. The accuracy of the predicted structure by this
method is up to 70% (Lin et al. 2005).
The most successful approach for predicting protein tertiary structure is the template-
based homology modelling. It is based on the knowledge that homologous protein
sequences fold into similar three-dimensional structures. The general criteria are that
two sequences must be at least 25% identical to assume structural similarity between
them. To predict the three-dimensional structure of a protein, homology modelling
starts with doing a database search to identify its homologues whose structures are
solved. Now this structure is used as a template to predict the unknown protein
structure. Then their amino acid sequences are aligned and structurally conserved
regions are assigned based on closely related amino acid sequences. The atomic
coordinates of these regions are then used to construct a partial model of the
unknown protein. Side chains that are different between the two proteins within
these regions are replaced with the correct ones taken from suitable structure
libraries. In this partial model, now the gaps are filled by loop searching and
modelling of the loops. At the end of this process, a complete model with certain
errors in bond length, bond angle, etc., may be obtained which has to be corrected by
molecular mechanics and energy minimization (Marks et al. 2012).
The main problem in three-dimensional structure prediction is the calculation of
free energy and obtaining structure with the globally lowest energy. Nevertheless,
due to recent advancements in technology, several automated bioinformatics tools
are now available to do this. Mainly two types of approaches are used for protein
model structure preparation (1) template-based and (2) template independent. Both
methods have their advantages and disadvantages (Marks et al. 2012; Kc 2017).
However, template-based methods are more accurate than other methods (Kc 2017;
Zhang and Skolnick 2004). Few commonly used tools are explained below.
Individuals can access the CAMEO website (https://www.cameo3d.org/), which is
an automated server to provide continuous assessment of protein structure prediction
services in order to decide on a tool for protein structure prediction (Haas et al.
2018).
SWISS-MODEL
SWISS-MODEL is a widely used modelling tool as it is fast, accurate, and user
friendly. This server consists of three integrated compounds (1) SWISS-MODEL
pipeline—contains software for database related to protein modelling
(2) SWISS-MODEL Workspace—provides virtual workspace and handles complex
tasks during model preparation (3) SWISS-MODEL Repository—provides updated
information regarding 3-D protein model of model organisms. The structure
4 Protein Analysis: From Sequence to Structure 73
Modeller
Modeller was developed by Andrej Sali Laboratory at the University of California,
San Francisco. This tool is used for tertiary and quaternary structure prediction. It
derives important information about protein structure from experimental data
generated by NMR spectroscopy, site-directed mutagenesis, fluorescence spectros-
copy, image reconstructions from electron microscopic studies, etc. This information
is utilized to understand various parameters such as bond length, bond angle, and
dihedral angle in the protein model structure building. To build modelled structure
MODELLER uses following sequential steps; (i) searching for the available
evaluated structure related to the query sequence, (ii) alignment of query and
template sequence, (iii) model preparation, and (iv) evaluation of the final model.
The DOPE method is used for model evaluation. Other than model building it also
performs fold assignment, phylogenetic tree preparation, and de novo modelling of
protein loop (Webb and Sali 2016).
I-TASSER
Developed by Yang Zhang Lab, upgraded version of I-TASSER models structure
using threading method. In order to generate a protein model from the query
sequence, it performs multiple steps. First, it searches for a super secondary structure
related to query in PDB, using multiple threading approaches also called LOMETS
[50]. Then, the different fragments of the modelled structure are combined using the
Monte Carlo method. Multiple models of protein having lower energy levels are
generated using Replica Exchange Monte Carlo Simulation (REMC). Coordinates of
all the models are clustered by SPICKER method and average values of coordinates
from all models are taken further for model preparation. Lastly, FG-MD algorithm is
used to reconstruct all the atoms of the model having low free energy states. As a
final output, five full-length models with atomic resolution and estimated accuracy
are shown up. If in case given template does not have any previously available
homologous structure for modelling, then the structure is prepared from scratch
using ab initio-based approach by QUARK tool. QUARK is an integral part of
I-TASSER structure prediction pipeline but these steps are only used when domains
in the template are <300 residues (Roy et al. 2010; Xu and Zhang 2012).
74 J. Jani and A. Pappachan
4.6.3 CASP
unusual are the stereo-chemical parameters of given protein model. Lower the
G-factor: lower the probability of a particular conformation (Wlodawer 2017).
Due to the limitation of a force field and all-atom reconstruction the quality of
predicted structure may not be very good. So, the refinement of a predicted structure
is a necessary step in protein structure prediction. The aim of refinement is to
improve the model structure quality with minor improvement of coordinates in the
backbone and side-chain atoms. Refinement will help to get a structure with high
stereo-chemical quality which is nearer to the native structure. Potential energy
minimization (PEM) techniques and molecular dynamics help to get a structure
with lower energy. FG-MD is one of the methods which performs atomic-level
molecular dynamics simulation to obtain a lower energy structure without much
change in overall structure (Zhang et al. 2011). Mod-Refiner which is also used for
structure refinement uses Monte Carlo simulation for energy minimization. This
method usually refines backbone structure first, from the primary Cα traces. After
refining the backbone at minimum energy, it performs another round of simulation to
reconstruct side-chain atoms and gives a final refined model with lower minimum
free energy. The refined model can be validated using Ramachandran analysis
(Xu and Zhang 2011; Feig 2017) to see if there are stearic clashes between the
atoms in the structure. PROCHECK (Laskowski et al. 1993), RAMPAGE, and
Moleman2 (Kleywegt and Jones 1996) are extensively used online tools for structure
validation.
can be used as an input file in GROMACS. By applying script code for the different
force fields, the movement of the molecule over time can be created in MD run(s).
The output of the simulation can be analysed and visualized in the supplemented tool
provided in the MD package (Abraham et al. 2015; Hollingsworth and Dror 2018).
Proteins rarely act alone. For various metabolic and regulatory processes, they may
be associated with ligands or nucleic acids or other proteins. Understanding the
molecular and structural basis of these interactions is very necessary for the func-
tional elucidation of the proteins. There are several in silico methods to predict and
characterize protein–ligand/nucleic acid/other protein interacting sites. In order to
predict interaction of protein with other molecules large number of available struc-
tural data are being utilized to develop and improvise available prediction
algorithms. The empirical, force field, knowledge based, and machine learning are
four scoring functions currently in use (Böhm 1994). These scoring functions use
different approaches to calculate binding energy of protein with another molecule.
SWISS Dock (Grosdidier et al. 2011) is a commonly used tool to study protein–
ligand interaction. Manual docking and simulation studies are also helpful to
understand these interactions.
Interacting proteins are necessary for proper functioning of various cellular pro-
cesses. There are several examples like proteinase-inhibitor complexes, antigen–
antibody interactions, various signalling complexes, RNA polymerase assembly,
etc. Experimental study of protein–protein interaction is costly and time-consuming,
and such studies can be made easy computationally. Various computational tools are
available for PPI prediction, primarily all tools utilize protein sequence information
for analysis (Jones and Thornton 1997). Previously studied protein and structural
information are useful to identify a surface patch of protein that may be found at the
interface site. PPI interaction can be studied online using fully automated tools and
offline by using manual docking software. These tools give information about
binding geometry and binding energy (Kangueane and Nilofer 2018). Some of the
available tools are PrISE, InterPreTS, iLoops, Struct2Net which are structure-based
prediction tools. PPI spider, Path2PPI, POINeT, RedNemo are PPI network predic-
tion tools. TRI_tool, HIVsemi, ChiPPI, InterPORC are model organism-based PPI
prediction tools. STRING, SPRINT, HSPPIP, BindML+, and iFrag are other PPI
prediction tools (Kangueane and Nilofer 2018; Rao et al. 2014).
4 Protein Analysis: From Sequence to Structure 77
Protein DNA interactions are very important for the fundamental processes like
DNA replication, transcription, and translation. Its importance in epigenetic regula-
tion is also now well recognized.
Transcription factor and histone proteins are examples of protein with multiple
substrate specificity which makes them difficult to learn. However, there are numer-
ous bioinformatics tools which predicts DNA–protein integration. Mainly two
approaches are used for this prediction: sequence-based and structure-based. The
structure-based approach requires protein structure to predict interaction and a
sequence-based approach utilizes previously available sequence information to
predict interaction (Sarai and Kono 2005). Examples of such tools are
DBS-PSSM, DBS-Pred, DISIS, DISPLAR, DP-Bind, BindN, FoldX, and
DNAbinder (Sarai and Kono 2005; Si et al. 2015).
Earlier, novel drug discovery was either by chance or a trial and error process which
is usually performed by a high throughput screening method. However, advance-
ment in protein structure prediction and docking algorithms reduced the cost and
time needed for this process. Bioinformatics helps in different aspects of drug
discovery and development starting from target selection to prediction of a lead
compound to its improvement. Protein sequence and structure analysis is important
to select a potential drug target against a disease. Knowledge about multi-protein
complexes makes it possible to target specific protein–protein interaction.
Even if the tertiary structure of a potential drug target may not be available, with a
predicted protein structure, we can create a hypothesis about its function, interaction
with other macromolecules, and its regulatory aspect in the biological system. For
78 J. Jani and A. Pappachan
Table 4.2 Drugs currently under clinical trial for COVID-19 treatment
Target Antiviral treatment
RNA Remdesivir, favipiravir, ribavirin, umifenovir, galidesivir, oseltamivir,
polymerase sofosbuvir, methylcobalamin
3CL protease Lopinavir/Ritonavir, Ivermectin
PL protease Disulfiram
Protein S Griffithsin
Miscellaneous Resveratrol, Loperamide, Losartan, Chloroquine, Hydroxychloroquine
modelled membrane protein structures to screen for various drugs (Becker et al.
2004; Hauser et al. 2018). Even there are dedicated tools for modelling of GPCR
family proteins such as GPCR-SSFE 2.0, GPCRM, and GOMoDo (Worth et al.
2017; Miszta et al. 2018; Sandal et al. 2013).
4.10 Conclusion
There are several databases of protein sequence and structures which are not only
repositories of validated and annotated data, but also provide several tools to analyse
these data. Once a new protein is discovered, the biological function can be under-
stood by sequence comparisons with homologous proteins because proteins with
related functions have related amino acid sequences. Such comparisons also throw
light on the evolution of these proteins. Families of proteins with related functions
have evolved from a common ancestor. Such proteins will show similar three-
dimensional structure too which means that the three-dimensional structure of an
unknown protein can be predicted by homology modelling if a homologous structure
is already known. Due to the tremendous advances in our knowledge of protein
folding as well as machine learning tools and algorithms, protein structure prediction
methods have improved significantly in the past decade. This has facilitated the
prediction of model protein structure with greater accuracy and closer to the native
structure. These protein structures can be further analysed to understand their
structure–function relationships. One of the major applications of such studies is in
drug discovery and development. However few challenges need to be addressed for
future development such as modelling of multi-domain proteins, prediction of
structure involving loop-mediated interactions, simulation of macromolecular
complexes, better algorithms to understand protein folding, etc. With the advance-
ment in computational facilities and development of powerful algorithms, such in
silico analysis of protein sequences and structures can make tremendous impact on
major challenges in biology.
References
Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindahl E (2015) GROMACS: high
performance molecular simulations through multi-level parallelism from laptops to
supercomputers. SoftwareX 1–2:19–25
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J
Mol Biol 215:403–410
Andreeva A, Kulesha E, Gough J, Murzin AG (2020) The SCOP database in 2020: expanded
classification of representative family and superfamily domains of known protein structures.
Nucleic Acids Res 48:D376–D382
Attwood TK, Croning MD, Flower DR, Lewis AP, Mabey JE, Scordis P, Selley JN, Wright W
(2000) PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res 28:225–227
Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence database and its supplement
TrEMBL in 2000. Nucleic Acids Res 28:45–48
80 J. Jani and A. Pappachan
Huang X, Pearce R, Zhang Y (2020) De novo design of protein peptides to block association of the
SARS-CoV-2 spike protein with human ACE2. Aging 12:11263
Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist
CJA (2006) The PROSITE database. Nucleic Acids Res 34:D227–D230
Jones DT (2001) Predicting novel protein folds by using FRAGFOLD. Proteins 45:127–132
Jones S, Thornton JM (1997) Analysis of protein-protein interaction sites using surface patches. J
Mol Biol 272:121–132
Joshi T, Joshi T, Sharma P, Mathpal S, Pundir H, Bhatt V, Chandra S (2020) In silico screening of
natural compounds against COVID-19 by targeting Mpro and ACE2 using molecular docking.
Eur Rev Med Pharmacol Sci 24:4529–4536
Kabsch W, Sander C (1983) How good are predictions of protein secondary structure? FEBS Lett
155:179–182
Kangueane P, Nilofer C (2018) Protein-protein and domain-domain interactions. Springer
Kc DB (2017) Recent advances in sequence-based protein structure prediction. Brief Bioinform
18:1021–1032
Kendrew JC, Bodo G, Dintzis HM, Parrish RG, Wyckoff H, Phillips DC (1958) A three-
dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature 181:662–666
Kerzmann A, Neumann D, Kohlbacher O (2006) SLICK– scoring and energy functions for protein–
carbohydrate interactions. J Chem Inf Model 46:1635–1642
Kinch LN, Kryshtafovych A, Monastyrskyy B, Grishin NV (2019) CASP13 target classification
into tertiary structure prediction categories. Proteins Struct Funct Bioinform 87:1021–1036
Kleywegt GJ, Jones TA (1996) Phi/psi-chology: Ramachandran revisited. Structure 4:1395–1400
Laskowski RA, Macarthur MW, Moss DS, Thornton JM (1993) PROCHECK: a program to check
the stereochemical quality of protein structures. J Appl Crystallogr 26:283–291
Lin K, Simossis VA, Taylor WR, Heringa J (2005) A simple and fast secondary structure prediction
method using hidden neural networks. Bioinformatics 21:152–159
Madden TL, Busby B, Ye J (2019) Reply to the paper: misunderstood parameters of NCBI BLAST
impacts the correctness of bioinformatics workflows. Bioinformatics 35:2699–2700
Malik A, Firoz A, Jha V, Ahmad S (2010) PROCARB: a database of known and modelled
carbohydrate-binding protein structures with sequence-based prediction tools. Adv Bioinform
2010
Marco W (2009) Structural bioinformatics: from the sequence to structure and function. Curr
Bioinform 4:54–87
Marks DS, Hopf TA, Sander C (2012) Protein structure prediction from sequence variation. Nat
Biotechnol 30:1072–1080
Miszta P, Pasznik P, Jakowiecki J, Sztyler A, Latek D, Filipek S (2018) GPCRM: a homology
modeling web service with triple membrane-fitted quality assessment of GPCR models. Nucleic
Acids Res 46:W387–W395
Mount DW (2009) Using hidden Markov models to align multiple sequences. Cold Spring Harb
Protoc, 2009, pdb.top41
Mount DW, Mount DW (2001) Bioinformatics: sequence and genome analysis. Cold Spring
Harbor Laboratory Press, Cold Spring Harbor, NY
Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins
database for the investigation of sequences and structures. J Mol Biol 247:536–540
Narayanan N, Nair DT (2020) Vitamin B12 may inhibit RNA-dependent-RNA polymerase activity
of nsp12 from the SARS-CoV-2 virus. IUBMB Life
Nelson DL, Lehninger AL, Cox MM (2008) Lehninger principles of biochemistry. Macmillan
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM (1997) CATH – a
hierarchic classification of protein domain structures. Structure 5:1093–1109
Pauling L, Corey RB, Branson HR (1951) The structure of proteins: two hydrogen-bonded helical
configurations of the polypeptide chain. Proc Natl Acad Sci 37:205
82 J. Jani and A. Pappachan
Abstract
Keywords
Phylogenetics · Phylodynamics · Molecular clock · Bayesian analysis · Human
evolution · Virus evolution
S. Banerjee (*)
School of Life Sciences B. S. Abdur Rahman Crescent Institute of Science and Technology,
Chennai, India
e-mail: [email protected]
5.1 Introduction
In the first half of twentieth century, evolutionary biology was largely considered as
a field of analyzing fossils. But, after the discovery of DNA, few branches of biology
like molecular biology, genetics has rapidly evolved with new perspectives. The
microbial adaptation to environmental stress like antibiotic has been explained from
the light of mutation. Evolutionary biology witnessed a paradigm shift from
handling fossils to analyzing mutations in genes. During that time, another field of
biology, namely bioinformatics has silently sprouted with a considerable application
of computational algorithms. Algorithms and statistics have become an integral part
of evolutionary biology for calculating the number of mutations in a given site to
construct phylogenetic tree of a novel pathogenic virus. In the 1980s, world has
witnessed Human immune virus (HIV) claiming millions of lives worldwide.
Computational analysis of the gene sequences of HIV subtypes helped the scientist
to understand the nature of the virus and its mutation rate. From the beginning of
twentieth century, the world has witnessed several cases of epidemic and pandemic.
In 2014, Sierra Leone of West Africa witnessed a serious viral outbreak with high
mortality rate named Ebola. The zoonotic host of Ebola is bat. The genome analysis
of the viral strains isolated from different patient samples and bat samples revealed
that the outbreak started from single person who came in contact to Ebola infected
bat (Futuyma and Kirkpatrick 2017). Since 2020, we are witnessing one of the worst
pandemic in human history named Covid-19. Viral genome sequencing from the
patient sample followed by computational analysis revealed that the possible source
is bat and the nearest relative of SARS-CoV-2 (causative agent of Covid-19) is
SARS. This information is produced by using different computational tools like
multiple sequence alignment, phylogenetic tree construction, estimation of the
time of the most recent common ancestor (tMRCA) using Bayesian analysis, etc.
Computational evolutionary biology is hugely applied in evolutionary genetics
for analysis of the ancient genomes of human and other species, molecular anthro-
pology, tracking spread of an infectious agent, genetic polymorphism detection, etc.
The advancement of genome sequencing technology has yielded huge volume of
sequence data. These sequences have opened a unique opportunity to the scientists to
discover new insights in evolution. Also, the huge volume of data pose challenge to
manage and analyze the data to extract meaningful information. With advancements
of novel and high throughput sequencing techniques have opened many subfields in
evolutionary biology like molecular anthropology, population genetics,
archeogenetics, macro-evolution, etc. The analysis of sequencing data reveals infor-
mation on evolutionary process, epidemiological influence, etc. Analyzing raw
genome data yields various information like mammalian promoter architecture and
evolution (Carninci et al. 2006), identifying evolutionary relationship between
closely related and potentially hybridizing species like flycatcher (Nater et al.
2015), identification of mutations associated with clonal evolution of breast cancer
(Wang et al. 2014), etc. Genome annotation helps us to identify protein coding genes
and their respective structural and functional annotations. The functional annotation
uses the method of identification of homologous proteins through finding best match
5 Computational Evolutionary Biology 85
The substitution model is applied for either DNA or genome and amino acids in
protein. Here, nucleotide and amino acid substitution model shall be separately
discussed.
The distance between two DNA sequences is defined as the number of substitutions
per site and considering evolution rate is constant, the sequences should further
diverge. Substitution models essentially calculate the number of substitutions per
site and calculate the distance between two sequences. In simplistic way, the extent
of sequence divergence is the ratio or proportion ( p) of the nucleotide sites at which
two nucleotides are differing from each other. So, p ¼ nd/n, where nd is the number
of nucleotides that are different between two nucleotides and n is the number of
nucleotides screened, respectively (Nei and Kumar 2000). This is called p distance.
As there are 4 different nucleotides, there could be 16 different possible pair can be
generated. Out of these, four are identical which are AA, TT, CC and GG. Four
nucleotide pairs undergo transition—AG, GA, TC and CT. There are eight
5 Computational Evolutionary Biology 87
possibilities of transversion—AT, TA, AC, CA, GT, TG, GC and CG. Theoretically,
transversion should be twice frequent than transition but in reality, transition is found
to be more frequent than transversion (Nei and Kumar 2000). The quantitation of
nucleotide substitution is very important in order to estimate the number of nucleo-
tide substitutions. There are different models that have been proposed for the
purpose. The standard nucleotide substitution models are—Jukes–Cantor model,
Kimura model, equal input model, Tamura model, HKY model, Tamura–Nei model,
general reversible model and unrestricted model.
V ðd0 Þ ¼ 9qð1‐qÞ=ð3‐4qÞ2 n
p
d ¼ b ln 1
b
h P 2
i
where b ¼ 1=2 1 4i¼1 gi2 þ pc (Nei and Kumar 2000; Tajima and Nei 1984).
4. Tamura’s model: In reality, nucleotide frequency does not become equal, on the
contrary to some substitution models. Tamura proposed a substitution model
where substitution frequency is different for transition and transversion and
denoted as α and β, respectively. Also, AT and GC content is also considered
different (θ1 and θ2, respectively) and the estimation of substitution is dependent
on both the parameters. Here, the nucleotide substitution d is determined as
P
d ¼ hln 1 Q 1=2ð1 hÞ ln ð1 2QÞ
h
where h ¼ 2θ(1-θ)
Tamura’s method is able to compute different evolutionary parameters.
5. HKY model: It was proposed by Hasegawa, Kishino and Yano at 1985. It is based
on Markov chain principle. It was proposed to estimate the divergence of
mitochondrial DNA between primates and ungulates (Hasegawa et al. 1985).
HKY model estimates genetic distances indirectly and it takes into consideration
multiple changes in a site. Phylogenetic relationship was established using
Maximum Likelihood method.
6. Tamura and Nei method: It was proposed by Tamura and Nei in 1993 (Tamura
and Nei 1993). It considers different substitution values for transition and
transversion. Transversion is assumed to have same frequency. It also
differentiates two kinds of transition (purine ! purine and pyrimidine ! pyrimi-
dine). For purine, αAG ¼ αGA and αtransversion ¼ α1; αCT ¼ αTC ¼ α2. In this
model, P and Q are the transitional and transversional mutation frequency,
respectively.
7. General time reversible model: GTR is probably the most popular substitution
model in last one decade (Sumner et al. 2012). A Markov chain is called time
reversible if
πi is the proportion of time the Markov chain spends at the ith state and πiqij is the
information flow r ¼ from ith to jth state. The rate matrix is written as a symmetrical
matrix is multiplied by diagonal matrix.
5 Computational Evolutionary Biology 89
2 3
: aπC bπA cπG
6 7
6 aπT : dπA eπG 7
6 7
Q ¼ qij ¼ 6 7
6 bπT dπC : f πG 7
4 5
cπT eπT f πA :
2 32 3
: a b c πT 0 0 0
6 76 7
6a : e7 6 0 7
6 d 76 0 πA 0 7
¼6 76 7
6b d : f7 6 0 0 πG 0 7
4 54 5
c e f : 0 0 0 πA
From evolutionary point of view, studying amino acid substitution is more informa-
tive than nucleotide substitution because amino acid is more conserved than
nucleotides and gives important insights into long term evolution. There are different
statistical methods to measure distance between two amino acid sequences. The
distance, also termed as evolutionary distance is extremely important for construc-
tion of phylogenetic tree and estimation of divergence time. One method is to
measure the number of different amino acids (nd) between two sequences. This
method can be applicable if the sequence length of all the peptides is same, but it
does not happen in reality (Nei and Kumar 2000). Introducing the gaps in multiple
sequence alignment in order to show insertion-deletion (indel) mutation is a very
common practice. There, direct measure of nd is not possible. However, the ratio of
number of differences over total number of amino acids (nd/n) is a more meaningful
approach. The ratio is known as p distance. In reality, p is not strictly proportional to
time (t). One of the reason might be the multiple amino acid substitution at the same
site. Poisson distribution gives a better estimate between the relation of p and t with
the equation p(k; t) ¼ ert (rt)k/k! where r is the rate of amino acid substitution per
year at a given site. The rate is considered uniform in all the bases. It was found that
this assumption does hold true in real life scenario. The observation suggests that the
amino acid substitution rate is higher in functionally trivial sites compared to the
important one. It was described that if the number of amino acid substitutions per site
follows gamma distribution (Γ), it will become negative binomial distribution also
(Nei and Kumar 2000).
So, f ðr Þ ¼ ΓbðaÞ ebr r a1
a
Z 1
ΓðaÞ ¼ et t a1 dt
0
Where a is the gamma parameter which determines the shape of the distribution and
b is the scaling parameter.
A B C
evolutionary time scale can be subdivided into five steps: (1) dataset assembly,
(2) choice of calibration model, (3) selection of molecular clock method and model
for rate variation, (4) analysis, and (5) interpretation (Sauquet 2013). Calibrating the
molecular clock is very important and tricky part. The variation in rates can be
divided into three components: gene effects, lineage effects, and gene-by-lineage
effects (Ho and Duchene 2014). The rates can be expressed as substitutions per site
per year or per million years. Gene effect or locus effect affects evolutionary rate
across the genes. Lineage effect is the rate that varies across taxa but all genes are
affected equally. Gene-by-lineage effects work in combination and gene specific
rates vary among lineage. The classification of molecular clock can be done by
number of distinct rates (k) across the number of branches (n) in the tree. For, strict
molecular clock, k ¼ 1 and for relaxed molecular clock, n k > 1. Another type of
molecular clock is local clock, where the assumption is evolutionary close relatives
share same rate. So, the phylogenetic tree used to have distinct clusters of closely
related lineages and each cluster used to have different rates. Random local clock
model is implemented by phylogenetic tree construction software BEAST, where
phylogeny, node time, and rate are estimated simultaneously. In this framework,
Bayesian stochastic search is implemented to infer whether a branch inherits substi-
tution from parental node by calculating posterior probability.
There are different tests of the molecular clock such as relative rate tests and
likelihood ratio test. Relative rate test is the simplest of clock hypothesis test. Two
species A and B evolve at the same rate relative to a third species C which is an
out-group. Thus, all the rates measured here are relative to C, hence the name is
relative rate test. In a hypothetical tree (Fig. 5.1), O is the ancestral node and distance
of A and B, i.e., dOA and dOB should be equal. Likelihood ratio test of the clock is
applied to any tree size. If there are s number of species, under the clock model of H0,
there are s-1 internal nodes for rooted tree. The number of substitutions per site is
uniform.
As any model of molecular clock estimation works on a number of assumptions,
Bayesian framework, by far, considered to be the best. It is based on Bayes theorem
which states
PðθjDÞ ¼ PðθÞPðDjθÞ=PðDÞ
where P(θǀD) is the posterior probability, P(θ) is the prior probability, P(Dǀθ) is
likelihood, and P(D) is the probability of the data. The likelihood value is estimated
by likelihood model that includes substitution model and choice of specific tree.
Prior distribution is estimated by evolutionary models. In this model,
92 S. Banerjee
hyperparameters are used to estimate the priors, which is called hyperpriors. The
probabilities of different parameters are then obtained from likelihood score and the
posterior probability distribution is obtained. The probability of the data, P(D) is
difficult to obtain analytically and it was circumvented by implementation of
Markov Chain Monte Carlo (MCMC) method. It generates a huge number of sample
data by stochastic method to determine posterior distribution in Bayesian
phylogenetics. There are different algorithms for MCMC simulation and one of
the most widely used algorithm is Metropolis–Hastings (Bromham et al. 2018;
Hastings 1970; Metropolis et al. 1953). In this algorithm, one model tree is generated
with a set of values assigned to the model parameters and likelihood is estimated.
Then, it modifies one or more model parameters and recalculates the likelihood score
and prior probability. Thus, it generates two related set of sequences separated by at
least one parameter. The ratio of the posterior probability of these two alternate
events is calculated. Likewise, depending on the ratio, the procedure moves on step
by step. There are different software that use Bayesian model for molecular clock
analysis, e.g. Bayesian Evolutionary Analysis Sampling Trees (BEAST), Molecular
Evolutionary Genetics Analysis (MEGA), MrBayes, BayesPhylogenies,
BayesTraits, etc.
local clock model. Different tree models are used to construct phylodynamic model.
There are two types of tree models: (a) model for unstructured population and
(b) model for structured population.
a. Model for unstructured population: For unstructured tree model, two assumptions
are used—first birth-death model where birth and death rate is assumed to build
the tree. The second approach is coalescent model. Here, depending on the
background population and changes in the effective size of the population, the
parameterization is done.
b. Model for structured population: For structured population, the model is analo-
gous to unstructured population models, i.e. two approaches are (1) multistate
birth-death model and (2) structured coalescent approach. In multistate birth-
death model, using bdmm package, BEAST v 2.5 can quantify parameter like
migration rate from ancestral lineage without MCMC sampling. In structured
coalescent model, MultiType Tree package can sample ancestral state of all
lineages using MCMC sampling.
Overall, BEAST is a very important and robust tool ancestral sequence recon-
struction and other phylodynamic analysis which include usage of molecular clock.
statistical test for constructed tree which includes interior branch test and boot-
strap test. Molecular clock test includes Tajima’s test and ML based molecular
clock test. MEGA also offers reconstruction of ancient state by Maximum
Likelihood method. ML estimates ancestral state of each node of phylogenetic
tree. Maximum parsimony method is applied for inferring ancestral state when
sequence diversity is low.
The wealth of genomic data has accelerated the research on evolutionary biology.
The huge information of genome of different species has driven the scientists to
develop novel computational tools to infer useful information from the vast reper-
toire of data which is rapidly increasing also. Some of those tools are IMPUTOR,
DeCoSTAR, MultiTWIN, POPBAM, VISTA genome browser and tools, PipMaker,
Expasy tools including Bgee, OMA, ALF, BayeScan, etc.
IMPUTOR: The next generation sequences have generated huge sequence data
but relatively short read lengths influences the quality of the data. IMPUTOR
compares sequences by constructing a high confidence phylogenetic tree and
imputes for a set of recombining sequences. It functions via the principle of
parsimony which considers ancestral sites will not get reversion. It imputes missing
variants and also corrects nonmissing sites that may arise false sequencing errors
(Jobin et al. 2018).
DeCoSTAR: This software reconstructs ancestral genes and genomes. It
organizes the ancestral sequences in form of adjacencies of ancestral sequences. It
can also improve the assembly of fragmented genome by recognizing scaffolding
fragments formed by evolutionary pressure. Ancestral genes or domains are calcu-
lated by considering phylogenetic tree which is constructed by taking into account
different events that influence gene evolution like gain, loss, duplication, or transfer.
This software not only able to reconstruct gene domains generate like breakage,
fusion, etc., but also able to handle large datasets (Duchemin et al. 2017).
MultiTwin: This software package assumes a multipartite graphical approach to
construct and analyze evolution at different levels of organization. This software
takes into account different levels of biological organization like genes, genomes,
communities, or environment. It comprehensively analyze sequence based classifi-
cation. This type of graph is useful in comparative analysis of genome in microbes,
gene sharing between their cellular genome, transposable elements, etc. This tool can
be used to decipher pathogenicity traits among microbial community (Corel et al.
2018).
POPBAM: Next generation sequencing results yield multiple short fragments.
POPBAM software is a collection of tools for evolutionary analysis of whole
genome alignment among multiple species. It uses BAM formatted file. BAM is a
compressed binary version of sequence alignment map/file (SAM). POPBAM uses
sequence assembly file and calls variant sites and calculates different statistical
parameter related to evolutionary biology such as nucleotide diversity, linkage
5 Computational Evolutionary Biology 95
11. (Krause et al. 2007). The genomic study reveals that Neanderthals interbred with
ancient human and non-African modern human carry Neanderthal genes. The
analysis revealed that modern East Asian population carry more amount of Nean-
derthal genes compared to European population. Male hybrid sterility was found as
one of the significant deleterious mutation acquired from Neanderthal
(Sankararaman et al. 2014). One comparative analysis of genome copy number
variation between human and other 9 primates has revealed that some genes like
AQP7 has human specific duplication which drives species specific evolution
(Dumas et al. 2007). Thus, different scientific work indicate that human evolutionary
biology is steadily enriching with help of computational tool.
Genome wide association study (GWAS) is a very useful tool to identify gene
association to diseases. GRASP2 database is a publicly available repository 8.87
million SNP-disease association (Karim et al. 2016). In another study, a novel
human adenoviral pathogen causing pneumonia was identified and with help of
recombination analysis, phylogenomics and phylodynamics study reveal it has three
hosts (human, chimpanzee, and bonobo) and it is able to cross species infection
(Dehghan et al. 2019). In another study, the group has analyzed genome of 95 strains
of human adenovirus and performed different in silico analysis like recombination
and structural analysis, phylogenetic analysis, etc. They identified horizontal
genome transfer by recombination is an important feature for adenovirus evolution
and may pose serious consequence on human health (Ismail et al. 2018).
Transposable elements are very important evolutionary tool. They are found in
most species and they regulate varieties of cellular mechanism. There are different
tools and databases like RepBase, Dfam, GyDB, SINEbase, TREP, RiTE,
RepeatMasker, etc., have been created to identify and catalog transposable elements
(Goerner-Potvin and Bourque 2018). Nipah virus is a zoonotic virus and assumed to
be a threat to animal and human. In one study, authors have collected all the
available Nipah virus genome sequence and analyzed by phylogenetics and molecu-
lar evolution study. The receptor analysis and other studies indicate the variability
among two strains and towards its adaptive evolution (Li et al. 2020a, 2020b).
Flaviviridae family of viruses causes major health hazards. It is a single strand
RNA virus. The synonymous codon usage pattern of flaviviridae family has been
analyzed and correspondence analysis study revealed it is constituted of two groups
(Yao et al. 2019). Ebola virus is one of the most deadly virus currently in circulation.
Authors have analyzed Ebola virus data and identified that positive genetic selection
has happened on GP and L genes and there could be more strains that is able to make
human to human transmission (Liu et al. 2015). Upon studying Ebola virus data and
performing phylodynamic assessment of intervention strategies, hypothetical impact
assessment, role of barriers on virus transmission, etc., helps to make strategy to stop
viral spread (Dellicour et al. 2018). Simulation is an important step to predict
evolutionary dynamics of virus genome. One research group has created a software
5 Computational Evolutionary Biology 97
5.7 Conclusion
References
Benvenuto D, Giovanetti M, Salemi M, Prosperi M, De Flora C, Junior Alcantara LC, Ciccozzi M
(2020) The global spread of 2019-nCoV: a molecular evolutionary analysis. Pathog Glob Health
114(2):64–67. https://doi.org/10.1080/20477724.2020.1725339
Bouckaert R, Vaughan TG, Barido-Sottani J, Duchene S, Fourment M, Gavryushkina A,
Drummond AJ (2019) BEAST 2.5: an advanced software platform for Bayesian evolutionary
analysis. PLoS Comput Biol 15(4):e1006650. https://doi.org/10.1371/journal.pcbi.1006650
Bromham L, Duchene S, Hua X, Ritchie AM, Duchene DA, Ho SYW (2018) Bayesian molecular
dating: opening up the black box. Biol Rev Camb Philos Soc 93(2):1165–1191. https://doi.org/
10.1111/brv.12390
Brudno M, Poliakov A, Minovitsky S, Ratnere I, Dubchak I (2007) Multiple whole genome
alignments and novel biomedical applications at the VISTA portal. Nucleic Acids Res 35:
W669–W674. https://doi.org/10.1093/nar/gkm279
98 S. Banerjee
Jobin M, Schurz H, Henn BM (2018) IMPUTOR: phylogenetically aware software for imputation
of errors in next-generation sequencing. Genome Biol Evol 10(5):1248–1254. https://doi.org/
10.1093/gbe/evy088
Karim S, NourEldin HF, Abusamra H, Salem N, Alhathli E, Dudley J, Kumar S (2016) e-GRASP:
an integrated evolutionary and GRASP resource for exploring disease associations. BMC
Genomics 17(Suppl 9):770. https://doi.org/10.1186/s12864-016-3088-1
Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucleic Acids Res 30(14):3059–3066.
https://doi.org/10.1093/nar/gkf436
Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through
comparative studies of nucleotide sequences. J Mol Evol 16(2):111–120. https://doi.org/10.
1007/BF01731581
Krause J, Lalueza-Fox C, Orlando L, Enard W, Green RE, Burbano HA, Paabo S (2007) The
derived FOXP2 variant of modern humans was shared with Neandertals. Curr Biol 17
(21):1908–1912. https://doi.org/10.1016/j.cub.2007.10.008
Kumar S, Nei M, Dudley J, Tamura K (2008) MEGA: a biologist-centric software for evolutionary
analysis of DNA and protein sequences. Brief Bioinform 9(4):299–306. https://doi.org/10.1093/
bib/bbn017
Lai A, Bergna A, Acciarri C, Galli M, Zehender G (2020) Early phylogenetic estimate of the
effective reproduction number of SARS-CoV-2. J Med Virol 92(6):675–679. https://doi.org/10.
1002/jmv.25723
Li K, Yan S, Wang N, He W, Guan H, He C, Su S (2020a) Emergence and adaptive evolution of
Nipah virus. Transbound Emerg Dis 67(1):121–132. https://doi.org/10.1111/tbed.13330
Li J, Zhang S, Li B, Hu Y, Kang XP, Wu XY, Jiang T (2020b) Machine learning methods for
predicting human-adaptive influenza A viruses based on viral nucleotide compositions. Mol
Biol Evol 37(4):1224–1236. https://doi.org/10.1093/molbev/msz276
Liu SQ, Deng CL, Yuan ZM, Rayner S, Zhang B (2015) Identifying the pattern of molecular
evolution for Zaire ebolavirus in the 2014 outbreak in West Africa. Infect Genet Evol 32:51–59.
https://doi.org/10.1016/j.meegid.2015.02.024
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state
calculations by fast computing machines. J Chem Phys 21(6):1087–1092
Nater A, Burri R, Kawakami T, Smeds L, Ellegren H (2015) Resolving evolutionary relationships
in closely related species with whole-genome sequencing data. Syst Biol 64(6):1000–1017.
https://doi.org/10.1093/sysbio/syv045
Nei M, Kumar S (2000) Molecular evoution and phylogenetics. Oxford University Press, Oxford
Noonan JP (2010) Neanderthal genomics and the evolution of modern humans. Genome Res 20
(5):547–553. https://doi.org/10.1101/gr.076000.108
Nuin P A, Wang Z, Tillier ER (2006) The accuracy of several multiple sequence alignment
programs for proteins. BMC Bioinformatics 7:471. https://doi.org/10.1186/1471-2105-7-471
Park Y, Casey D, Joshi I, Zhu J, Cheng F (2020) Emergence of new disease: how can artificial
intelligence help? Trends Mol Med 26(7):627–629. https://doi.org/10.1016/j.molmed.2020.04.
007
Remita MA, Halioui A, Malick Diouara AA, Daigle B, Kiani G, Diallo AB (2017) A machine
learning approach for viral genome classification. BMC Bioinf 18(1):208. https://doi.org/10.
1186/s12859-017-1602-3
Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Hohna S, Huelsenbeck JP (2012)
MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model
space. Syst Biol 61(3):539–542. https://doi.org/10.1093/sysbio/sys029
Sankararaman S, Mallick S, Dannemann M, Prufer K, Kelso J, Paabo S, Reich D (2014) The
genomic landscape of Neanderthal ancestry in present-day humans. Nature 507(7492):354–357.
https://doi.org/10.1038/nature12961
Sauquet H (2013) A practical guide to molecular dating. Comptes Rendus Palevol 12(6):355–367
100 S. Banerjee
Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Miller W (2000) PipMaker--a web
server for aligning two genomic DNA sequences. Genome Res 10(4):577–586. https://doi.org/
10.1101/gr.10.4.577
Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large
phylogenies. Bioinformatics 30(9):1312–1313. https://doi.org/10.1093/bioinformatics/btu033
Sumner JG, Jarvis PD, Fernandez-Sanchez J, Kaine BT, Woodhams MD, Holland BR (2012) Is the
general time-reversible model bad for molecular phylogenetics? Syst Biol 61(6):1069–1074.
https://doi.org/10.1093/sysbio/sys042
Tajima F, Nei M (1984) Estimation of evolutionary distance between nucleotide sequences. Mol
Biol Evol 1(3):269–285. https://doi.org/10.1093/oxfordjournals.molbev.a040317
Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region
of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10(3):512–526. https://doi.
org/10.1093/oxfordjournals.molbev.a040023
Wang Y, Waters J, Leung ML, Unruh A, Roh W, Shi X, Navin NE (2014) Clonal evolution in
breast cancer revealed by single nucleus genome sequencing. Nature 512(7513):155–160.
https://doi.org/10.1038/nature13600
Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24
(8):1586–1591. https://doi.org/10.1093/molbev/msm088
Yao H, Chen M, Tang Z (2019) Analysis of synonymous codon usage bias in flaviviridae virus.
Biomed Res Int 2019:5857285. https://doi.org/10.1155/2019/5857285
Zehender G, Lai A, Bergna A, Meroni L, Riva A, Balotta C, Galli M (2020) Genomic characteriza-
tion and phylogenetic analysis of SARS-COV-2 in Italy. J Med Virol. https://doi.org/10.1002/
jmv.25794
Web-Based Bioinformatics Approach
Towards Analysis of Regulatory Sequences 6
B. Sharan Sharma, Sonal R. Bakshi, Preeti Sharma, and
Ramtej J. Verma
Abstract
B. S. Sharma (*)
GeneXplore Diagnostics and Research Centre, Ahmedabad, Gujarat, India
Rivaara Labs Pvt Ltd, KD Hospital, Ahmedabad, Gujarat, India
S. R. Bakshi
Institute of Science, Nirma University, Ahmedabad, Gujarat, India
P. Sharma
Department of Zoology, Biomedical Technology and Human Genetics, Gujarat University,
Ahmedabad, Gujarat, India
PanGenomics International Pvt Ltd, Sterling Accuris Diagnostics, Ahmedabad, Gujarat, India
R. J. Verma
Department of Zoology, Biomedical Technology and Human Genetics, Gujarat University,
Ahmedabad, Gujarat, India
# The Author(s), under exclusive license to Springer Nature Singapore Pte 101
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_6
102 B. S. Sharma et al.
Keywords
Cis-regulatory elements · RSAT · Transcription factor · TFBS · Online
bioinformatics tools
6.1 Introduction
Mammalian genome consists of coding and non-coding DNA sequences. In the past,
much attention has been paid on coding regions of the DNA as they are the
manufacturing units of functioning proteins and slightest of variation in protein-
coding regions could have devastating effects on the proper functioning of protein
that may result in disease conditions. On the other hand, mutations in regulatory
elements, generally, alter expression of associated gene (increased or decreased
6 Web-Based Bioinformatics Approach Towards Analysis of Regulatory Sequences 103
expression) and not the protein structure, therefore, these mutations are less likely to
exert phenotypic impact. However, a rising number of studies have identified
variations outside the coding regions that are robustly associated with various
quantitative traits and complex diseases (Worsley-Hunt et al. 2011; Mokry et al.
2016; Chatterjee and Ahituv 2017). The primary reason being mutations outside
coding regions and in regulatory DNA sequences can disrupt binding sites of
transcription factor or can create new ones eventually altering mechanisms of
transcriptional regulation. Hitherto, in the past two decades, much stress has been
given to protein-coding changes in a DNA sequence. Now, it is apparent that
strategic focus will be on variations within regulatory sequences as well since precise
functioning of various classes of regulatory elements is prerequisite towards error-
free expression of linked genes, and failure of this molecular machinery can lead to
serious magnitudes resulting in disease conditions.
Major disciplines of ‘-omic’ technologies have evolved rapidly over the last decade
or so (Sharma et al. 2019a). Consequently, with rapid advances in sequencing and
microarray technologies, availability of genomic data has expanded tremendously.
Handling of this genomic data holds many potential applications not only to
understand normal physiology but also disease conditions. Hence, role of bioinfor-
matics tools and techniques are fundamental. And, it would be appropriate to say that
the marriage of omics with bioinformatics forges a frontier to provide scientists with
unlimited opportunities towards breakthrough discoveries in medicine research.
Omics and bioinformatics are just not only used for studying genes and biological
signalling pathways responsible for human diseases but are also used to identify
potential new targets for their applications in therapeutic drugs and therapy (Allen
and Cagle 2009). Omics revolution and rise of bioinformatics tools and techniques
has upgraded our knowledge and capability to handle huge data-sets. Omics com-
bined with high-throughput studies have facilitated mapping of a large number of
genetic variants within coding as well as non-coding regulatory sequences
(LaFramboise 2009; Begum et al. 2012; Gloss and Dinger 2018; Perenthaler et al.
2019).
Comparative genomic approaches using computational tools are bestowing a
notable impression on the study of regulatory regions especially in mammals and
at present denote the most systematic and sound approaches of envisaging
non-coding sequences expected to control the gene expression patterns. By
subjecting genomic sequences to in silico comparisons and subsequent
investigations, we are slowly but surely moving towards a better catalogue of
recurrent regulatory motifs responsible for the fundamental biological processes
(Loots 2008). Analysis of data generated through genomic projects have given a
valuable foundation for studying particular genetic variants causative to simple and
complex ailments (Hasin et al. 2017), and as our knowledge on involvement of
regulatory elements in diseases in increasing, it is only rationale to develop
104 B. S. Sharma et al.
popular examples of online TF databases which are widely used to predict, compare,
and analyse unknown and/or novel motifs and transcription factors with transcrip-
tion factor binding models stored in these database.
In one of our original and recent works, a systematic bioinformatics approach was
adapted to detect and analyse motifs in human locus control regions (LCRs) (Sharma
et al. 2019b). LCRs are important, however, not much studied, cis-acting regulatory
sequences that control expression of linked genes in a position-independent and
copy-number dependent manner. Using web-based RSAT suite, motifs of biological
relevance could be discovered in the important human LCRs to help understand their
unique regulatory features. LCRs form an important component of integrating
vectors owning to their unique expression control abilities, therefore, identification
of unique regulatory signatures present within LCR sequences will be contributory
in the design of new generation of regulatory elements. One such example was
described in the design of a non-viral mammalian expression vector in which the
primary transgene was under the transcriptional control of elements of LCR (Sharma
and Verma 2020). Such vector design provides a framework for strong regulation
with non-viral features which confer certain advantages over viral vectors.
In order to understand the regulation of a gene, understanding of the mechanism
of DNA–protein interaction at the molecular level is important (Sharma et al. 2020).
This interaction, in general, involves various classes of regulatory elements for
eventual and faithful expression of genes. This expression machinery, thus, is an
essential aspect of cellular functioning failure of which could lead to serious
consequences resulting in diseases. With advancements in computational biology
and bioinformatics tools and techniques, it is now easier to discover unique regu-
latory signatures (URS) of regulatory elements in order to understand the molecular
machinery. Web-based bioinformatics tools such as RSAT and MEME are powerful
tools to better understand regulatory elements in order to predict their potential role
in health and diseases. We have used RSAT suite in the past to study LCRs, some of
the features of RSAT suite are outlined in the next section.
described 13 new programs added to the 30 tools of the previous version. A series of
protocols were also described through different publications to give step-by-step
instructions about option choices and result interpretation for the popular tools of
RSAT (Janky and van Helden 2007; Sand and Helden 2007; Defrance et al. 2008;
Sand et al. 2008; Turatsinze et al. 2008). RSAT 2015 version offered access to a
large number of genomes from all kingdoms, assisted by a new taxon-specific
organization of the public servers, and was also expanded to diversify its
applications, including comparison and clustering of motifs, regulatory variants
analyses and comparative genomics (Medina-Rivera et al. 2015). The 20th anniver-
sary article of RSAT provided updates on the novelties included in RSAT 2018
suite, and also presented various access and training modalities (Nguyen et al. 2018).
Proper sequence dataset is the beginning point of any genomic analysis, RSAT
suite provides an application called as ‘retrieve-ensembl-seq’ that significantly eases
the retrieval of sequences from the Ensembl database in a user-friendly fashion
(Sand et al. 2009). RSAT suite has been useful in detecting and analysing putative
cis-regulatory elements and regions enriched in such elements (Turatsinze et al.
2008). Peak-motifs of RSAT, as a comprehensive pipeline, efficiently discovers
motifs and identify putative transcription factors in ChIP-seq and similar data (ChIP-
PET, ChIP-on-chip, CLIP-seq). Biological validity of peak-motifs was demonstrated
by recovering the correct motifs from ChIP-seq sets corresponding to known
transcription factors, moreover predicted specific motifs and transcription factors
in an original analysis (Thomas-Chollier et al. 2012). Recently ‘Variation-tools’
program has been included in the well maintained suite RSAT which provide an
accessible resource for expert and non-expert users to analyse regulatory variants in
a web interface for as many as fifteen organisms with flexibility to upload personal
variant and PSSM collections (Santana-Garcia et al. 2019).
Development of RSAT from yeast genome to high-throughput sequencing era has
impacted the analysis of regulatory elements in a user-friendly and positive way.
Comprehensive identification of functional elements, most notably regulatory
motifs, has a fundamental importance towards biomedical research (Hashim et al.
2019), especially when their dysregulation might lead to pathological conditions.
Computational identification of other functional elements such as DNase
I-hypersensitive sites, zinc finger domains, and other regulatory signatures within
the regulatory sequences is an important aspect of biomedical research. Develop-
ment of additional user-friendly online algorithms/tools will be instrumental to gain
regulatory insights of complex sequences.
In the current era, analysing complex sequences, involved in gene regulation, using
user-friendly computational and bioinformatics tools is an advanced sequence anal-
ysis field with tremendous scope in omics technologies. Such kind of sequence
analysis is not only a fine finishing tool for wet lab results but also offers a cradle of
108 B. S. Sharma et al.
References
Allen TC, Cagle PT (2009) Bioinformatics and omics. In: Basic concepts of molecular pathology.
Springer, Boston, pp 55–60
Ambrosini G, Groux R, Bucher P (2018) PWMScan: a fast tool for scanning entire genomes with a
position-specific weight matrix. Bioinformatics 34(14):2483–2484
6 Web-Based Bioinformatics Approach Towards Analysis of Regulatory Sequences 109
Ardui S, Ameur A, Vermeesch JR, Hestand MS (2018) Single molecule real-time (SMRT)
sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids
Res 46(5):2159–2168
Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS (2009)
MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37(suppl_2):W202–
W208
Bailey TL, Johnson J, Grant CE, Noble WS (2015) The MEME suite. Nucleic Acids Res 43(W1):
W39–W49
Begum F, Ghosh D, Tseng GC, Feingold E (2012) Comprehensive literature review and statistical
considerations for GWAS meta-analysis. Nucleic Acids Res 40(9):3777–3784
Budd A (2012) Introduction to genome biology: features, processes, and structures. In: Evolution-
ary genomics. Humana Press, Totowa, pp 3–49
Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, Frisch M, Bayerlein M,
Werner T (2005) MatInspector and beyond: promoter analysis based on transcription factor
binding sites. Bioinformatics 21(13):2933–2942
Chatterjee S, Ahituv N (2017) Gene regulatory elements, major drivers of human disease. Annu
Rev Genomics Hum Genet 18:45–63
Defrance M, Sand O, Van Helden J (2008) Using RSAT oligo-analysis and dyad-analysis tools to
discover regulatory signals in nucleic sequences. Nat Protoc 3(10):1589
Frith MC, Hansen U, Weng Z (2001) Detection of cis-element clusters in higher eukaryotic DNA.
Bioinformatics 17(10):878–889
Frith MC, Li MC, Weng Z (2003) Cluster-buster: finding dense clusters of motifs in DNA
sequences. Nucleic Acids Res 31(13):3666–3668
Gloss BS, Dinger ME (2018) Realizing the significance of noncoding functionality in clinical
genomics. Exp Mol Med 50(8):1–8
Hashim FA, Mabrouk MS, Atabany WA (2019) Comparative analysis of DNA motif discovery
algorithms: a systemic review. Curr Cancer Ther Rev 15(1):4–26
Hasin Y, Seldin M, Lusis A (2017) Multi-omics approaches to disease. Genome Biol 18(1):83
Herrmann C, Van de Sande B, Potier D, Aerts S (2012) i-cisTarget: an integrative genomics method
for the prediction of regulatory features and cis-regulatory modules. Nucleic Acids Res 40(15):
e114
Janky RS, van Helden J (2007) Discovery of conserved motifs in promoters of orthologous genes in
prokaryotes. In: Comparative genomics. Humana Press, Totowa, pp 293–308
Kato M, Tsunoda T (2007) MotifCombinator: a web-based tool to search for combinations of
cis-regulatory motifs. BMC Bioinf 8(1):100
Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, Lee R, Bessy A, Chèneby J,
Kulkarni SR, Tan G, Baranasic D (2018) JASPAR 2018: update of the open-access database of
transcription factor binding profiles and its web framework. Nucleic Acids Res 46(1):D260–
D266
Kiesel A, Roth C, Ge W, Wess M, Meier M, Söding J (2018) The BaMM web server for de-novo
motif discovery and regulatory sequence analysis. Nucleic Acids Res 46(W1):W215–W220
Kulakovskiy IV, Vorontsov IE, Yevshin IS, Sharipov RN, Fedorova AD, Rumynskiy EI,
Medvedeva YA, Magana-Mora A, Bajic VB, Papatsenko DA, Kolpakov FA (2018)
HOCOMOCO: towards a complete collection of transcription factor binding models for
human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res 46(D1):D252–D259
LaFramboise T (2009) Single nucleotide polymorphism arrays: a decade of biological, computa-
tional and technological advances. Nucleic Acids Res 37(13):4181–4193
Leporcq C, Spill Y, Balaramane D, Toussaint C, Weber M, Bardet AF (2020) TFmotifView: a
webserver for the visualization of transcription factor motifs in genomic regions. Nucleic Acids
Res 48:W208–W217
Liu ET, Pott S, Huss MQ (2010) A: ChIP-seq technologies and the study of gene regulation. BMC
Biol 8(1):56
110 B. S. Sharma et al.
Thomas-Chollier M, Sand O, Turatsinze JV, Janky RS, Defrance M, Vervisch E, Brohee S, van
Helden J (2008) RSAT: regulatory sequence analysis tools. Nucleic Acids Res 36(suppl_2):
W119–W127
Thomas-Chollier M, Defrance M, Medina-Rivera A, Sand O, Herrmann C, Thieffry D, van Helden
J (2011) RSAT 2011: regulatory sequence analysis tools. Nucleic Acids Res 39(suppl_2):W86–
W91
Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, van Helden J (2012) RSAT
peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res 40(4):e31
Trindade LM, Van Berloo R, Fiers MW, Visser RG (2005) PRECISE: software for prediction of
cis-acting regulatory elements. J Hered 96(5):618–622
Turatsinze JV, Thomas-Chollier M, Defrance M, Van Helden J (2008) Using RSAT to scan genome
sequences for transcription factor binding sites and cis-regulatory modules. Nat Protoc 3
(10):1578
Van Helden J (2003) Regulatory sequence analysis tools. Nucleic Acids Res 31(13):3593–3596
Van Helden J, André B, Collado-Vides J (1998) Extracting regulatory sites from the upstream
region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281
(5):827–842
Van Helden J, André B, Collado-Vides J (2000a) A web site for the computational analysis of yeast
regulatory sequences. Yeast 16(2):177–187
Van Helden J, Rios AF, Collado-Vides J (2000b) Discovering regulatory elements in non-coding
sequences by analysis of spaced dyads. Nucleic Acids Res 28(8):1808–1818
Worsley-Hunt R, Bernard V, Wasserman WW (2011) Identification of cis-regulatory sequence
variations in individual genome sequences. Genome Med 3(10):65
An Overview of Bioinformatics Resources
for SNP Analysis 7
Sudarkodi Sukumar, Arunika Krishnan, and Subhamoy Banerjee
Abstract
Keywords
Single nucleotide polymorphisms · Variants · SNPs · Bioinformatics · SNP
Analysis
# The Author(s), under exclusive license to Springer Nature Singapore Pte 113
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_7
114 S. Sukumar et al.
Abbreviations
bp Base pair
CNV Copy Number Variations
cSNPs coding region SNPs
GWAS Genome Wide Association Studies
ncSNPs non-coding region SNPs
NGS Next Generation Sequencing Technology
nsSNPs non-synonymous SNPs
QTL Quantitative Trait Loci
SNPs Single Nucleotide Polymorphisms
SNVs Single Nucleotide Variants
sSNPs Synonymous SNPs
STR Short Tandem Repeats
SVM Support Vector Machine
UTR Untranslated region
VCF Variant Call Format
VNTRs Variable Number of Tandem Repeats
WGS Whole Genome Sequences
7.1 Introduction
Genetic variations are of different types such as genomic structural variations, single
nucleotide polymorphisms (SNPs), insertions/deletions or indels, block
substitutions, inversions, variable number of tandem repeats (VNTRs), and copy
number variations (CNVs). The most common and abundant form of genetic varia-
tion in the human genome are SNPs (Collins et al. 1998). SNPs are basically single
nucleotide substitutions that occur in at least 1% of a population. Substitutions can
be either transitions (purine $ purine or pyrimidine $ pyrimidine) or transversions
(purine $ pyrimidine), although transitions are most common (Wang et al. 1998). In
the human genome on an average, a SNP can occur about every 1000 base pair
(bp) (Taillon-Miller et al. 1998) and are responsible for 90% of genomic variations
between individual humans (Brookes 1999). Although SNPs are mostly biallelic,
significant research suggests the presence of triallelic (Hodgkinson and Eyre-Walker
2010) and as well as tetra-allelic SNPs (Phillips et al. 2015). SNPs can occur in the
coding region or in the non-coding region of a genome. Based on their likely effect
on proteins in the form of amino acid variants, the coding region SNPs (cSNPs) can
be further classified into synonymous SNPs (sSNPs) and non-synonymous SNPs
(nsSNPs) (Fig. 7.1).
Synonymous SNPs are substitutions that do not result in amino acid changes in
the translated protein sequence owing to the degeneracy of the genetic code. Hence,
sSNPs were also referred to as silent mutations for long, assuming the absence of any
downstream effects on the protein. However, studies have proven that sSNPs can
indeed affect protein conformation and function through different mechanisms
(Hunt et al. 2009) and are even associated with human diseases (Brest et al. 2011;
Nackley et al. 2006). Interestingly sSNPs have also found use in population genetic
analysis on par with nsSNPs (Gutacker et al. 2006).
response between individuals (Peterson et al. 2013; Ahmed et al. 2016). Such SNPs
are as crucial in the case of infectious diseases as for metabolic diseases (Wang et al.
2018; Nogales and Dediego 2019) including even the Covid-19 pandemic (Hou et al.
2020). Therefore, it is indispensable to account these SNPs to offer personalized
diagnostics as well as treatment options to combat diseases.
It is essential to note that none of these diverse applications of SNPs would be
possible without advancements in technology that would aid in the identification,
prediction and validation of SNPs. Undoubtedly, developments in bioinformatics are
indispensable for the study of SNPs. This chapter provides a comprehensive view of
the latest and updated bioinformatics resources and software tools for SNP analysis
for not just human genomes but across the genomes of other organisms as well.
platforms and some are available on Windows/Mac as well. Also, there are reviews
that guide researchers in the selection of SNP calling methods and software on a
general note (Altmann et al. 2012) and as well as specific to an organism (Olson et al.
2015). Still for experimental biologists with limited computational expertise and
facilities, there are several simplified options to perform variant calling. GALAXY is
one such genome analysis platform that includes variant calling (Blankenberg et al.
2010) and provides an automated workflow. PATRIC (https://www.patricbrc.org)
provides a genome analysis environment exclusively for microbes (Wattam et al.
2017). It provides the user with multiple options for alignment and variant calling,
which enables the user to frame their own simple SNP calling pipeline. Apart from
these non-commercial open source tools, the commercial sequencing vendors offer
all in one package that include variant listing along with the sequencing results, for
example, the NGS pipeline CASAVA from Illumina. But still the experimentalists
should have a basic idea on the advantages and limitations of the various alignment
and SNP calling algorithms in order to obtain reliable and meaningful results.
correlate and examine a variation in the genomic context. The data from dbSNP can
be downloaded in variant call format (VCF format) (Danecek et al. 2011) (https://
vcftools.github.io), a generic format exclusively developed for storing sequence
variation. Each SNP is given a unique accession id called the reference SNP cluster
ID or rsID. SNPs are identified using rsID not just within dbSNP but across different
databases. Thus, dbSNP is undoubtedly the single largest and the most useful
resource for SNP data analysis. However, since 2017 dbSNP no longer accepts
any non-human SNP data submissions. Therefore, researchers interested in
non-human variation data may utilize the European Variation Archive (https://
www.ebi.ac.uk/eva) resource under European Bioinformatics Institute, which is also
supported by its own variant browser.
The 1000 GENOMES project (https://www.internationalgenome.org/) (Auton
et al. 2015; Sudmant et al. 2015) is an important initiative to catalogue human
genetic variation on a massive scale by sequencing a large number of people from
different ethnicities. The data from this project is freely available for scientific
community through various databases. As mentioned earlier, this project data is
integrated with dbSNP as well. The different ethnic population selection for this
project is based on the samples of the HapMap project, which is a depository of
human haplotype data. (haplotype refers to the cluster of SNPs occurring in a
chromosome). Since 1000 Genome project has gained momentum in human popu-
lation genetics and genomics research, the HapMap project has been discontinued.
Apart from the general SNP databases like dbSNP, many researchers created
specialized databases to suit their needs and applications. The Human Gene Muta-
tion Database (HGMD) (Stenson et al. 2020) (http://www.hgmd.cf.ac.uk) specifi-
cally curates germline mutations in nuclear genes that may cause human diseases
from journal publications. It is available as a free public version for registered
academic and non-profit institutions, which is updated only twice annually. On the
other hand, the HGMD Professional version can be availed through a subscription
from QIAGEN, which is updated quarterly.
Human genetic variation database (HGVD) is a reference database for genetic
variations observed in a Japanese sample population (http://www.hgvd.genome.
med.kyoto-u.ac.jp) (Higasa et al. 2016). The data is collated from exome sequencing
of 1208 Japanese individuals performed at five different research institutes in Japan.
Ethnic National Database Operating software (ETHNOS) is a useful tool that
provides a platform for establishing National and Ethnic Mutation Databases
(NEMDBs) (van Baal et al. 2010). Using this tool NEMDBs of Israeli, Tunisian,
Egyptian populations were established. Further these NEMDBs along with core
databases like OMIM provide opportunities to create more specialized databases like
FINDbase. FINDbase (Frequency of INherited Disorders database) is an online
resource for collating information on frequencies of genomic variations that are
pharmacogenomic biomarkers and cause inherited disorders (http://www.findbase.
org/) (Kounelis et al. 2020).
Genome wide association study (GWAS) is the best approach utilized to detect
common SNPs between disease and healthy individuals for different diseases, but
with some challenges in its execution depending on disease complexity. GWAS
7 An Overview of Bioinformatics Resources for SNP Analysis 121
studies are thus instrumental in identifying and correlating the role of SNPs in
various diseases. The database of Genotypes and Phenotypes (dGaP) in NCBI is a
repository of data on the interaction of genotypes and phenotypes, collated from
various genome wide association studies (https://www.ncbi.nlm.nih.gov/gap).
GWAS catalog is another central repository of information curated from the litera-
ture of various GWAS studies, which was originally started in 2008 by NHGRI and
later maintained in collaboration with the EBI since 2010 (https://www.ebi.ac.uk/
gwas/) (Buniello et al. 2019). GWAS central is yet another repository that provides
summary of GWAS study findings that are curated from literature as well as
documented from public domain projects (www.gwascentral.org) (Beck et al. 2020).
Ethical concerns pertaining to sharing of human data was an important area of
research that was initiated during the human genome project execution. But today
with sequencing and variation data being handled and shared by a large number of
databases worldwide, there is larger concern over ethical integrity. In this regard, the
Global Alliance for Genomics and Health (GA4GH) was established in 2013 to set
standards for the secure and responsible sharing of human genomics data, which
paves way for uninterrupted human health research (https://www.ga4gh.org/). This
is a non-profit global organization comprising of members from 500 research
organizations spread across 71 countries. While GA4GH focuses on the needs of
researchers, the Human Variome Project (HVP) (https://www.
humanvariomeproject.org/) (Cotton et al. 2008) established much earlier is a larger
consortium; an official partner of UNESCO; spanning 81 countries is focused on the
clinical needs and lays out guidelines for genomic variation data handling, which has
direct impact on disease management. Adherence to such initiatives will sustain the
trust and fruitful collaboration of researchers across the globe for human genetics
research. While waiting for such initiatives to gain momentum and support from the
concerned community, researchers who wish to share their data in a secure environ-
ment may utilize the services of Café Variome (https://www.cafevariome.org/). Café
variome creates a network of clinicians and researchers who mutually trust each
other to share and receive inputs on their genetic data.
The Leiden Open (source) Variation Database (LOVD) software was developed
to support hassle free creation of variation databases (https://www.lovd.nl/)
(Fokkema et al. 2011). Those researchers interested in creating and maintaining
locus specific sequence variation database can utilize this freely available, platform
independent software. It is essential that the researchers follow the nomenclature
suggested by HGVS while documenting the sequence variations. Mutalyzer tool
offers a quick option to crosscheck the sequence variance descriptions as per the
HGVS nomenclature (https://www.mutalyzer.nl) (Wildeman et al. 2008).
The human sequencing project parallelly amplified the sequencing efforts in
various other organisms that were important to humans, which again led to humun-
gous sequence variation data and in turn multitude of non-human variation
databases. The HGVS website provides a list of such non-human variation databases
such as OMIA (Online Mendelian Inheritance in Animals) (https://omia.org/). The
list at HGVS is just few examples, as the list is as exhaustive as for human variation
databases. Genome based Mycobacterium tuberculosis variation database (GMTV)
122 S. Sukumar et al.
Tools that can discern the impact of nsSNPs on protein stability and its effect on
interactions with other molecules are required to understand the role of such SNPs
for the whole cell be it a cancer cell or an infectious bacterium or even a virus.
Hence, advanced tools that utilize the complete structural features are inevitable to
comprehend the mechanism by which the SNPs cause phenotypic changes.
Structure-based methods either use potential energy functions for quantifying the
mutational effects or machine learning approach (Table 7.2).
Polyphen-2 (Polymorphism phenotyping v2), a successor of Polyphen is a com-
bination tool that uses sequence homology and 3-dimensional structural features for
predicting the impact of amino acid substitutions on protein structure and function.
Specifically, the functional significance is predicted using Naïve Bayes, a supervised
machine learning method. The results are presented as benign or possibly damaging
or probably damaging (Adzhubei et al. 2010).
SDM (Site Directed Mutator) uses statistical potential energy function to calcu-
late the score for the effect of SNPs on protein stability and it also predicts the
disease-causing propensity of the SNPs. SDM specifically uses environment-specific
substitution tables (ESST) that stores probability values for amino acid substitution
data (Worth et al. 2011).
PoPMuSiC 2.1 is a free webserver that predicts protein thermodynamic stability
changes upon protein mutations. PoPMuSiC can predict the stability changes that a
protein may undergo due to all possible amino acid mutations with a high computa-
tional speed that is unique to this tool when compared to its counterparts and with
better accuracy as well (Dehouck et al. 2011).
mCSM (mutation Cutoff Scanning Matrix) utilizes specific signature such as
graph-based interatomic distances, pharmacophore changes and experimental
conditions to predict the impact of single point mutations on protein stability as
well as the affinity changes towards other proteins and nucleic acid complexes.
mCSM has been elaborated into several individual modules such mCSM-lig to study
7 An Overview of Bioinformatics Resources for SNP Analysis 127
protein affinity to small molecules apart from mCSM-NA for studying nucleic acid
interactions (Pires et al. 2014a).
DUET is a consensus tool that combines the results of mCSM and SDM to predict
the effect of nsSNPs on protein stability. DUET is claimed to be powerful than the
individual tools as it optimizes the prediction by collating both the results and then
used for predicting using support vector machine (Pires et al. 2014b).
MAESTRO is a versatile tool that can predict stability changes upon protein
mutations and also several other applications. MAESTRO uses an ensemble
machine learning model comprising of multiple linear regression method, neural
network and SVM to compute the ΔΔG values (difference in folding free energy
between wild type and mutant) along with a confidence metric for the predictions
made. The Protherm, a thermodynamics database for proteins and mutants, was used
to train and validate the model. The results of MAESTRO were comparable to that of
PoPMuSiC and mCSM (Laimer et al. 2015).
DynaMut analyses changes in protein dynamics that are caused due to mutations.
A consensus prediction is obtained by incorporating normal mode analysis (NMA)
approaches and graph-based signatures to assess the effect of mutations on protein
stability (Rodrigues et al. 2018).
128 S. Sukumar et al.
References
Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS,
Sunyaev SR (2010) A method and server for predicting damaging missense mutations. Nat
Methods 7(4):248–249. NIH Public Access. https://doi.org/10.1038/nmeth0410-248
Ahmed S, Zhou Z, Zhou J, Chen SQ (2016) Pharmacogenomics of drug metabolizing enzymes and
transporters: relevance to precision medicine. Genomics Proteomics Bioinf 14(5):298–313).
Beijing Genomics Institute. https://doi.org/10.1016/j.gpb.2016.03.008
Altmann A, Weber P, Bader D, Preuß M, Binder EB, Müller-Myhsok B (2012) A beginners guide
to SNP calling from high-throughput DNA-sequencing data. Hum Genet 131(10):1541–1554.
https://doi.org/10.1007/s00439-012-1213-z
Auton A, Abecasis GR, Altshuler DM, Durbin RM, Bentley DR, Chakravarti A, Clark AG,
Donnelly P, Eichler EE, Flicek P, Gabriel SB, Gibbs RA, Green ED, Hurles ME, Knoppers
BM, Korbel JO, Lander ES, Lee C, Lehrach H, Schloss JA (2015) A global reference for human
genetic variation. Nature 526(7571):68–74). Nature Publishing Group. https://doi.org/10.1038/
nature15393
Baniecki ML, Faust AL, Schaffner SF, Park DJ, Galinsky K, Daniels RF, Hamilton E, Ferreira MU,
Karunaweera ND, Serre D, Zimmerman PA, Sá JM, Wellems TE, Musset L, Legrand E,
Melnikov A, Neafsey DE, Volkman SK, Wirth DF, Sabeti PC (2015) Development of a single
nucleotide polymorphism barcode to genotype plasmodium vivax infections. PLOS Neglected
Trop Dis 9(3):e0003539. https://doi.org/10.1371/journal.pntd.0003539
Beck T, Shorter T, Brookes AJ (2020) GWAS Central: a comprehensive resource for the discovery
and comparison of genotype and phenotype data from genome-wide association studies. Nucl
Acids Res 48(D1):D933–D940. https://doi.org/10.1093/nar/gkz895
Beres SB, Richter EW, Nagiec MJ, Sumby P, Porcella SF, DeLeo FR, Musser JM (2006) Molecular
genetic anatomy of inter- and intraserotype variation in the human bacterial pathogen group A
Streptococcus. Proc Natl Acad Sci USA 103(18):7059–7064. https://doi.org/10.1073/pnas.
0510279103
Berger J, Suzuki T, Senti KA, Stubbs J, Schaffner G, Dickson BJ (2001) Genetic mapping with SNP
markers in Drosophila. Nat Genet 29(4):475–481. https://doi.org/10.1038/ng773
Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor
J (2010) Galaxy: a web-based genome analysis tool for experimentalists. Curr Protocols Mol
Biol 0 19(SUPPL. 89, p. Unit). NIH Public Access. https://doi.org/10.1002/0471142727.
mb1910s89
Brest P, Lapaquette P, Souidi M, Lebrigand K, Cesaro A, Vouret-Craviari V, Mari B, Barbry P,
Mosnier JF, Hébuterne X, Harel-Bellan A, Mograbi B, Darfeuille-Michaud A, Hofman P (2011)
A synonymous variant in IRGM alters a binding site for miR-196 and causes deregulation of
IRGM-dependent xenophagy in Crohn’s disease. Nat Genet 43(3):242–245. https://doi.org/10.
1038/ng.762
Bromberg Y, Rost B (2007) SNAP: predict effect of non-synonymous polymorphisms on function.
Nucl Acids Res 35(11):3823–3835. https://doi.org/10.1093/nar/gkm238
Brookes AJ (1999) The essence of SNPs. Gene 234(2):177–186. https://doi.org/10.1016/S0378-
1119(99)00219-X
Budowle B, Van Daal A (2008) Forensically relevant SNP classes. BioTechniques 44(5):603–610.
Future Science Ltd London, UK. https://doi.org/10.2144/000112806
Buniello A, Macarthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, McMahon A,
Morales J, Mountjoy E, Sollis E, Suveges D, Vrousgou O, Whetzel PL, Amode R, Guillen
JA, Riat HS, Trevanion SJ, Hall P, Junkins H, Parkinson H (2019) The NHGRI-EBI GWAS
catalog of published genome-wide association studies, targeted arrays and summary statistics
2019. Nucl Acids Res 47(D1):D1005–D1012. https://doi.org/10.1093/nar/gky1120
Capriotti E, Altman RB, Bromberg Y (2013) Collective judgment predicts disease-associated single
nucleotide variants. BMC Genomics 14 Suppl 3(Suppl 3):S2. https://doi.org/10.1186/1471-
2164-14-s3-s2
130 S. Sukumar et al.
Carter H, Douville C, Stenson PD, Cooper DN, Karchin R (2013) Identifying Mendelian disease
genes with the variant effect scoring tool. BMC Genomics 14 Suppl 3(Suppl 3):S3. https://doi.
org/10.1186/1471-2164-14-s3-s3
Chernyaeva EN, Shulgina MV, Rotkevich MS, Dobrynin PV, Simonov SA, Shitikov EA, Ischenko
DS, Karpova IY, Kostryukova ES, Ilina EN, Govorun VM, Zhuravlev VY, Manicheva OA,
Yablonsky PK, Isaeva YD, Nosova EY, Mokrousov IV, Vyazovaya AA, Narvskaya OV,
O’Brien SJ (2014) Genome-wide Mycobacterium tuberculosis variation (GMTV) database: a
new tool for integrating sequence variations and epidemiology. BMC Genomics 15(1):308.
https://doi.org/10.1186/1471-2164-15-308
Choi Y, Sims GE, Murphy S, Miller JR, Chan AP (2012) Predicting the functional effect of amino
acid substitutions and indels. PLoS ONE 7(10):e46688. https://doi.org/10.1371/journal.pone.
0046688
Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM (2012) A
program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff:
SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6(2):80–92.
https://doi.org/10.4161/fly.19695
Coll F, McNerney R, Guerra-Assunção JA, Glynn JR, Perdigão J, Viveiros M, Portugal I, Pain A,
Martin N, Clark TG (2014) A robust SNP barcode for typing Mycobacterium tuberculosis
complex strains. Nat Commun 5. https://doi.org/10.1038/ncomms5812
Collins FS, Brooks LD, Chakravarti A (1998) A DNA polymorphism discovery resource for
research on human genetic variation. Genome Res 8(12):1229–1231. https://doi.org/10.1101/
gr.8.12.1229
Collins FS, Green ED, Guttmacher AE, Guyer MS (2003) A vision for the future of genomics
research. Nature 422(6934):835–847. https://doi.org/10.1038/nature01626
Cotton RGH, Auerbach AD, Axton M, Barash CI, Berkovic SF, Brookes AJ, Burn J, Cutting G,
Den Dunnen JT, Flicek P, Freimer N, Greenblatt MS, Howard HJ, Katz M, Macrae FA,
Maglott D, Möslein G, Povey S, Ramesar RS, Watson M (2008) Genetics: The human variome
project. Science 322(5903):861–862. https://doi.org/10.1126/science.1167363
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G,
Marth GT, Sherry ST, McVean G, Durbin R (2011) The variant call format and VCFtools.
Bioinformatics 27(15):2156–2158. https://doi.org/10.1093/bioinformatics/btr330
Dehouck Y, Kwasigroch JM, Gilis D, Rooman M (2011) PoPMuSiC 2.1: A web server for the
estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinf
12:151. https://doi.org/10.1186/1471-2105-12-151
Dekkers CMJ (2012) Application of genomics tools to animal breeding. Curr Genomics 13
(3):207–212. https://doi.org/10.2174/138920212800543057
den Dunnen JT, Dalgleish R, Maglott DR, Hart RK, Greenblatt MS, Mcgowan-Jordan J, Roux AF,
Smith T, Antonarakis SE, Taschner PEM (2016) HGVS recommendations for the description of
sequence variants: 2016 update. Human Mutat 37(6):564–569. https://doi.org/10.1002/humu.
22981
Doddamani D, Khan AW, Katta MAVSK, Agarwal G, Thudi M, Ruperao P, Edwards D, Varshney
RK (2015) CicArVarDB: SNP and InDel database for advancing genetics research and breeding
applications in chickpea. Database 2015:78. https://doi.org/10.1093/database/bav078
Fokkema IFAC, Taschner PEM, Schaafsma GCP, Celli J, Laros JFJ, den Dunnen JT (2011) LOVD
v.2.0: The next generation in gene variant databases. Human Mutat 32(5):557–563. https://doi.
org/10.1002/humu.21438
Garrison E, Marth G (2016) Haplotype-based variant detection from short-read sequencing. Nat
Genet 48(6):593–599
Gutacker MM, Mathema B, Soini H, Shashkina E, Kreiswirth BN, Graviss EA, Musser JM (2006)
Single-nucleotide polymorphism–based population genetic analysis of Mycobacterium tuber-
culosis strains from 4 geographic sites. J Infect Dis 193(1):121–128. https://doi.org/10.1086/
498574
7 An Overview of Bioinformatics Resources for SNP Analysis 131
Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian
inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucl
Acids Res 33(Database Iss):D514–D517. https://doi.org/10.1093/nar/gki033
Higasa K, Miyake N, Yoshimura J, Okamura K, Niihori T, Saitsu H, Doi K, Shimizu M,
Nakabayashi K, Aoki Y, Tsurusaki Y, Morishita S, Kawaguchi T, Migita O, Nakayama K,
Nakashima M, Mitsui J, Narahara M, Hayashi K, Matsuda F (2016) Human genetic variation
database, a reference database of genetic variations in the Japanese population. J Human Genet
61(6):547–553. https://doi.org/10.1038/jhg.2016.12
Hinrichs AS, Raney BJ, Speir ML, Rhead B, Casper J, Karolchik D, Kuhn RM, Rosenbloom KR,
Zweig AS, Haussler D, Kent WJ (2016) UCSC data integrator and variant annotation integrator.
Bioinformatics 32(9):1430–1432. https://doi.org/10.1093/bioinformatics/btv766
Hodgkinson A, Eyre-Walker A (2010) Human triallelic sites: evidence for a new mutational
mechanism? Genetics 184(1):233–241. https://doi.org/10.1534/genetics.109.110510
Hou Y, Zhao J, Martin W, Kallianpur A, Chung MK, Jehi L, Sharifi N, Erzurum S, Eng C, Cheng F
(2020) New insights into genetic susceptibility of COVID-19: an ACE2 and TMPRSS2
polymorphism analysis. BMC Med 18(1):216. https://doi.org/10.1186/s12916-020-01673-z
Hunt R, Sauna ZE, Ambudkar SV, Gottesman MM, Kimchi-Sarfaty C (2009) Silent (synonymous)
SNPs: should we care about them? Methods Mol Biol (Clifton, N.J.) 578:23–39. Humana Press,
Totowa, NJ. https://doi.org/10.1007/978-1-60327-411-1_2
Ioannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S, Musolf A, Li Q,
Holzinger E, Karyadi D, Cannon-Albright LA, Teerlink CC, Stanford JL, Isaacs WB, Xu J,
Cooney KA, Lange EM, Schleutker J, Carpten JD, Sieh W (2016) REVEL: an ensemble method
for predicting the pathogenicity of rare missense variants. Am J Human Genet 99(4):877–885.
https://doi.org/10.1016/j.ajhg.2016.08.016
Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson
RK, Ding L (2009) VarScan: variant detection in massively parallel sequencing of individual
and pooled samples. Bioinformatics 25(17):2283–2285. https://doi.org/10.1093/bioinformatics/
btp373
Kounelis F, Kanterakis A, Kanavos A, Pandi M, Kordou Z, Manusama O, Vonitsanos G, Katsila T,
Tsermpini E, Lauschke VM, Koromina M, Spek PJ, Patrinos GP (2020) Documentation of
clinically relevant genomic biomarker allele frequencies in the next-generation FINDbase
worldwide database. Human Mutat 41(6):1112–1122. https://doi.org/10.1002/humu.24018
Krawczak M, Ball EV, Fenton I, Stenson PD, Abeysinghe S, Thomas N, Cooper DN (2000) Human
gene mutation database - a biomedical information and research resource. Human Mutat 15
(1):45–51. https://doi.org/10.1002/(SICI)1098-1004(200001)15:1<45::AID-HUMU10>3.0.
CO;2-T
Kryukov GV, Pennacchio LA, Sunyaev SR (2007) Most rare missense alleles are deleterious in
humans: implications for complex disease and association studies. Am J Human Genet 80
(4):727–739. https://doi.org/10.1086/513473
Kwok P-Y, Chen X (2003) Detection of single nucleotide polymorphisms 43 detection of single
nucleotide polymorphisms. Issues Mol Biol 5:43–60
Lai K, Duran C, Berkman PJ, Lorenc MT, Stiller J, Manoli S, Hayden MJ, Forrest KL, Fleury D,
Baumann U, Zander M, Mason AS, Batley J, Edwards D (2012) Single nucleotide polymor-
phism discovery from wheat next-generation sequence data. Plant Biotechnol J 10(6):743–749.
https://doi.org/10.1111/j.1467-7652.2012.00718.x
Laimer J, Hofer H, Fritz M, Wegenkittl S, Lackner P (2015) MAESTRO - multi agent stability
prediction upon point mutations. BMC Bioinf 16(1):116. https://doi.org/10.1186/s12859-015-
0548-6
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M,
Fitzhugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, Levine R,
McEwan P, Morgan MJ (2001) Initial sequencing and analysis of the human genome. Nature
409(6822):860–921. https://doi.org/10.1038/35057062
132 S. Sukumar et al.
Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR (2014)
ClinVar: Public archive of relationships among sequence variation and human phenotype.
Nucl Acids Res 42(D1):D980. https://doi.org/10.1093/nar/gkt1113
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9
(4):357–359. https://doi.org/10.1038/nmeth.1923
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of
short DNA sequences to the human genome. Genome Biol 10(3):25. https://doi.org/10.1186/gb-
2009-10-3-r25
Lappalainen I, Lopez J, Skipper L, Hefferon T, Spalding JD, Garner J, Chen C, Maguire M,
Corbett M, Zhou G, Paschall J, Ananiev V, Flicek P, Church DM (2013) DbVar and DGVa:
Public archives for genomic structural variation. Nucl Acids Res 41(D1):D936. https://doi.org/
10.1093/nar/gks1213
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics 25(14):1754–1760. https://doi.org/10.1093/bioinformatics/btp324
Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using
mapping quality scores. Genome Res 18(11):1851–1858. https://doi.org/10.1101/gr.078212.
108
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R
(2009a) The sequence alignment/map format and SAMtools. Bioinformatics 25
(16):2078–2079. https://doi.org/10.1093/bioinformatics/btp352
Li W, Raoult D, Fournier PE (2009b) Bacterial strain typing in the genomic era. FEMS Microbiol
Rev 33(5):892–916. Oxford Academic. https://doi.org/10.1111/j.1574-6976.2009.00182.x
Liu X, Jian X, Boerwinkle E (2011) dbNSFP: a lightweight database of human nonsynonymous
SNPs and their functional predictions. Human Mutat 32(8):894–899. https://doi.org/10.1002/
humu.21517
Lunter G, Goodson M (2011) Stampy: a statistical algorithm for sensitive and fast mapping of
Illumina sequence reads. Genome Res 21(6):936–939. https://doi.org/10.1101/gr.111120.110
Mansur YA, Rojano E, Ranea JAG, Perkins JR (2018) Analyzing the effects of genetic variation in
noncoding genomic regions. In: Precision medicine: tools and quantitative approaches. Elsevier
Inc, pp 119–144. https://doi.org/10.1016/B978-0-12-805364-5.00007-X
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K,
Altshuler D, Gabriel S, Daly M, DePristo MA (2010) The genome analysis toolkit: a
MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20
(9):1297–1303. https://doi.org/10.1101/gr.107524.110
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, Flicek P, Cunningham F (2016)
The ensembl variant effect predictor. Genome Biol 17(1):122. https://doi.org/10.1186/s13059-
016-0974-4
Md V, Misra S, Li H, Aluru S (2019) Efficient architecture-aware acceleration of BWA-MEM for
multicore systems. In: Proceedings - 2019 IEEE 33rd International Parallel and Distributed
Processing Symposium, IPDPS 2019, pp 314–324. https://doi.org/10.1109/IPDPS.2019.00041
Moran JL, Bolton AD, Tran PV, Brown A, Dwyer ND, Manning DK, Bjork BC, Li C,
Montgomery K, Siepka SM, Vitaterna MH, Takahashi JS, Wiltshire T, Kwiatkowski DJ,
Kucherlapati R, Beier DR (2006) Utilization of a whole genome SNP panel for efficient genetic
mapping in the mouse. Genome Res 16(3):436–440. https://doi.org/10.1101/gr.4563306
Nackley AG, Shabalina SA, Tchivileva IE, Satterfield K, Korchynskyi O, Makarov SS, Maixner W,
Diatchenko L (2006) Human catechol-O-methyltransferase haplotypes modulate protein expres-
sion by altering mRNA secondary structure. Science 314(5807):1930–1933. https://doi.org/10.
1126/science.1131262
Ng PC, Henikoff S (2003) SIFT: predicting amino acid changes that affect protein function. Nucl
Acids Res 31(13):3812–3814. https://doi.org/10.1093/nar/gkg509
Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-
generation sequencing data. Nat Rev Genet 12(6):443–451. Nature Publishing Group. https://
doi.org/10.1038/nrg2986
7 An Overview of Bioinformatics Resources for SNP Analysis 133
polymorphism database for map-based cloning of rice genes. Plant Physiol 135(3):1198–1205.
https://doi.org/10.1104/pp.103.038463
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K (2001) DbSNP:
the NCBI database of genetic variation. Nucl Acids Res 29(1):308–311. https://doi.org/10.1093/
nar/29.1.308
Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GLA, Edwards KJ, Day INM, Gaunt TR
(2013) Predicting the functional, molecular, and phenotypic consequences of amino acid
substitutions using hidden markov models. Human Mutat 34(1):57–65. https://doi.org/10.
1002/humu.22225
Sobrino B, Brión M, Carracedo A (2005) SNPs in forensic genetics: a review on SNP typing
methodologies. Forensic Sci Int 154(2–3):181–194. https://doi.org/10.1016/j.forsciint.2004.10.
020
Stenson PD, Mort M, Ball EV, Chapman M, Evans K, Azevedo L, Hayden M, Heywood S, Millar
DS, Phillips AD, Cooper DN (2020) The human gene mutation database (HGMD®): optimizing
its use in a clinical diagnostic or research setting. Human Genet. Springer. https://doi.org/10.
1007/s00439-020-02199-3
Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K,
Jun G, Fritz MHY, Konkel MK, Malhotra A, Stütz AM, Shi X, Casale FP, Chen J,
Hormozdiari F, Dayama G, Chen K, Korbel JO (2015) An integrated map of structural variation
in 2,504 human genomes. Nature 526(7571):75–81. https://doi.org/10.1038/nature15394
Taillon-Miller P, Gu Z, Li Q, Hillier LD, Kwok PY (1998) Overlapping genomic sequences: a
treasure trove of single-nucleotide polymorphisms. Genome Res 8(7):748–754. https://doi.org/
10.1101/gr.8.7.748
Tan K-K, Tan Y-C, Chang L-Y, Lee KW, Nore SS, Yee W-Y, Mat Isa MN, Jafar FL, Hoh C-C,
AbuBakar S (2015) Full genome SNP-based phylogenetic analysis reveals the origin and global
spread of Brucella melitensis. BMC Genomics 16(1):93. https://doi.org/10.1186/s12864-015-
1294-x
Tang H, Thomas PD (2016) PANTHER-PSEP: predicting disease-causing genetic variants using
position-specific evolutionary preservation. Bioinformatics 32(14):2230–2232. https://doi.org/
10.1093/bioinformatics/btw222
van Baal S, Zlotogora J, Lagoumintzis G, Gkantouna V, Tzimas I, Poulas K, Tsakalidis A,
Romeo G, Patrinos GP (2010) ETHNOS: a versatile electronic tool for the development and
curation of national genetic databases. Human Genomics 4(5):361–368. https://doi.org/10.1186/
1479-7364-4-5-361
Vihinen M (2014) Variation Ontology for annotation of variation effects and mechanisms. Genome
Res 24(2):356–364. https://doi.org/10.1101/gr.157495.113
Wang DG, Fan JB, Siao CJ, Berno A, Young P, Sapolsky R, Ghandour G, Perkins N, Winchester E,
Spencer J, Kruglyak L, Stein L, Hsie L, Topaloglou T, Hubbell E, Robinson E, Mittmann M,
Morris MS, Shen N, Lander ES (1998) Large-scale identification, mapping, and genotyping of
single- nucleotide polymorphisms in the human genome. Science 280(5366):1077–1082.
https://doi.org/10.1126/science.280.5366.1077
Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from
high-throughput sequencing data. Nucl Acids Res 38(16):e164. https://doi.org/10.1093/nar/
gkq603
Wang Y, Zhang M-M, Huang W-W, Wu S-Q, Wang M-G, Tang X-Y, Sandford AJ, He J-Q (2018)
Polymorphisms in toll-like receptor 10 and tuberculosis susceptibility: evidence from three
independent series. Front Immunol 9:309. https://doi.org/10.3389/fimmu.2018.00309
Wang Z, Moult J (2001) SNPs, protein structure, and disease. Human Mutat 17(4):263–270. https://
doi.org/10.1002/humu.22
Wattam AR, Davis JJ, Assaf R, Boisvert S, Brettin T, Bun C, Conrad N, Dietrich EM, Disz T,
Gabbard JL, Gerdes S, Henry CS, Kenyon RW, Machi D, Mao C, Nordberg EK, Olsen GJ,
Murphy-Olson DE, Olson R, Stevens RL (2017) Improvements to PATRIC, the all-bacterial
7 An Overview of Bioinformatics Resources for SNP Analysis 135
bioinformatics database and analysis resource center. Nucl Acids Res 45(D1):D535–D542.
https://doi.org/10.1093/nar/gkw1017
Wildeman M, Van Ophuizen E, Den Dunnen JT, Taschner PEM (2008) Improving sequence
variant descriptions in mutation databases and literature using the mutalyzer sequence variation
nomenclature checker. Human Mutat 29(1):6–13. https://doi.org/10.1002/humu.20654
Wilkening S, Tekkedil MM, Lin G, Fritsch ES, Wei W, Gagneur J, Lazinski DW, Camilli A,
Steinmetz LM (2013) Genotyping 1000 yeast strains by next-generation sequencing. BMC
Genomics 14(1):90. https://doi.org/10.1186/1471-2164-14-90
Worth CL, Preissner R, Blundell TL (2011) SDM - A server for predicting effects of mutations on
protein stability and malfunction. Nucl Acids Res 39(SUPPL. 2):W215–W222. https://doi.org/
10.1093/nar/gkr363
Xu C (2018) A review of somatic single nucleotide variant calling algorithms for next-generation
sequencing data. Comput Struct Biotechnol J 16:15–24). Elsevier B.V. https://doi.org/10.1016/
j.csbj.2018.01.003
Yandell M, Huff C, Hu H, Singleton M, Moore B, Xing J, Jorde LB, Reese MG (2011) A
probabilistic disease-gene finder for personal genomes. Genome Res 21(9):1529–1542.
https://doi.org/10.1101/gr.123158.111
Zhang F, Lupski JR (2015) Non-coding genetic variants in human disease. Human Mol Genet 24
(R1):R102–R110. Oxford University Press. https://doi.org/10.1093/hmg/ddv259
Vaccine Design and Immunoinformatics
8
Fariya Khan and Ajay Kumar
Abstract
Keywords
Immunoinformatics · Algorithm · Epitopes · T cell · B cell · Vaccine candidate ·
Molecular docking and simulation
# The Author(s), under exclusive license to Springer Nature Singapore Pte 137
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_8
138 F. Khan and A. Kumar
8.1 Introduction
The rapid emergence of infectious diseases spreading in all geographical range leads
to the disease outbreak thus urges demand of advanced research and development in
public health sector. Development of vaccines has become utmost need globally due
to the factors of resistant of drugs for infectious diseases by immune system and
therefore required advanced method to save public life (Pahil et al. 2017). Thus,
immunoinformatics examines infectious diseases by understanding its mode of
pathogenesis, genome information, outbreak of widely spread pathogenic strain,
and structural modeling (Li et al. 2014). Therefore it was found that
immunoinformatics helps to find the best vaccine candidates using different compu-
tational model ruling out traditional analysis that involves isolating and cultivating
infectious pathogens (Fig. 8.1).
This chapter on immunoinformatics will help the researchers to illustrate the use
of advanced computer based technology in the development of vaccine that could be
effective against hypervariable pathogens. Epitope based vaccine has become a key
contributor in the change towards traditional vaccine design concept and strong long
term immunity (Tomar and De 2010). The application of new and updated tools
140 F. Khan and A. Kumar
Fig. 8.1 Flowchart representing the immunoinformatics approaches for vaccine designing
screens the set of immunogenic antigens from pool of genes thus only promising
epitopes can be forwarded to rigorous steps for binding capability (Khan et al. 2018).
Screening of the epitopes is also a very important and first parameter in designing
vaccine and hence different tools are available for it (Xiang and He 2009; Irini and
Darren 2007). As epitopes should not be allergic in nature, they should be toxic, it
should be highly antigenic and adhesive, antigen should be stable as well as soluble
and localization of all potential proteins (Table 8.1).
The first step in vaccine designing is the selection of immunodominant epitope from
a large pool of antigens using different immunoinformatics tools. Since, T cell
epitopes can only be identified when it is linked to MHC molecules and therefore
tools based on these parameters have been employed to predict putative T cell
epitopes (Oyarzun and Kobe 2015). The accuracy rate of MHC-I binding epitopes
is estimated to be 90–95% that is very effectual for wide coverage of alleles and
therefore different servers were used. NetCTL 1.2 server is used to find cytotoxic
T-lymphocyte (CTL) epitopes from pathogenic proteins (Larsen et al. 2007). It can
characterize epitopes that bind to 12 supertypes A1, A2, A3, A24, A26, B7, B8, B27,
B39, B44, B58, and B62 and work on the parameters of artificial neural networks,
SMM, and scoring matrices. NetMHC pan 4.1 servers are the most updated version
8 Vaccine Design and Immunoinformatics 141
that is also employed for the quantitative prediction of peptides that binds to MHC
alleles using artificial neural networks (ANNs). This method predicts epitopes on the
basis of quantitative Binding Affinity (BA) along with Mass Spectrometry Eluted
Ligands (EL) peptides (Birkir et al. 2020). It predicts the peptides based on the
threshold value that indicates 0.5 for strong binding epitopes and 2 for weak binding
epitopes.
The epitopes can also be analyzed by IEDB analysis tool that uses stabilized
matrix method (SMM) for identification of MHC-I binding alleles (Vita et al. 2018).
Here, peptides having IC50 less than 22 nm were considered to be strong binders
with higher affinity and it works on the parameter of MHC-I binding score,
proteasomal cleavage, and TAP score. Based on these criteria, the best vaccine
candidate was selected for conservancy analysis (Table 8.2).
142 F. Khan and A. Kumar
ProPred I server is widely used for MHC class I binding epitopes based on matrix
based method that facilitate prediction of the binding regions in an antigen for a total
of 47 MHC-I molecules (Singh and Raghava 2003). This method uses quantitative
analysis to investigate the interaction of antigens with alleles belonging to MHC
class I with great binding affinity. Although the greater prediction rate has been
obtained from MHC-I prediction method but there are databases for identifying the
epitopes that binds to MHC class II alleles also. MHC2Pred tool is used for the
prediction of dominant epitopes that bind to Class II MHC alleles as this tool is based
on SVM method therefore the accuracy was estimated as ~80% for 42 alleles.
MHCPred 2.0 is used to predict epitopes that bind to both MHC class I and II alleles
and generates Quantitative Structure Activity Relationship (QSAR) models (Guan
et al. 2003). It is also used to predict Transporter associated with Processing (TAP)
scores of the peptide. The performance of this tool was approved by 5-fold cross
validation and the epitopes having value of IC50 less than 500 nm were considered
to be as binders and vice versa.
Another immunoinformatics server ProPred is used to find out the antigen zone in
protein sequence and is placed on quantitative matrix based method (Singh and
Raghava 2001). SVMHC server also allows the prediction of T cell epitopes for both
class I and class II MHC alleles but mostly it is used for MHC-II binding prediction
along with the effects of single nucleotide polymorphisms (SNPs) (Pierre and Oliver
2006).
B cell epitopes provide long term immunity and can be used against several diseases
but the mostly B cell epitopes are not continuous in sequence therefore prediction
completely varies from T cell epitope prediction method (Zobayer et al. 2019).
Previous literature depicts that continuous epitopes are easier to predict in compari-
son to discontinuous one and based on these parameters different tools have been
designed (Krawczyk et al. 2014). The identification of continuous B cell epitopes is
entirely based on physiochemical characteristics like charge, hydrophilicity, polar-
ity, flexibility, and secondary structure. BcePred tool was used to evaluate B cell
epitopes on the basis of their physiochemical properties and the accuracy of this tool
was retrieved as 58.70% at threshold value of 2.38 (Saha and Raghava 2004). The
ABCPred server is used to identify potential epitopes based on artificial neural
network with accuracy of 65.93% and portray the result in tabular as well as
graphical form (Saha and Raghava 2006). BepiPred is also used to predict B cell
epitopes from a sequence of protein applying random forest algorithm method and is
a very valuable tool in bioinformatics for analysis (Jespersen et al. 2017).
To predict the discontinuous B cell epitopes, DiscoTope is also considered to be a
novel method to evaluate surface accessibility and propensity score of amino acid
using protein three-dimensional structures (Jens Vindahl et al. 2012). B cell epitopes
has a significant role in disease treatment as well as other immune therapy and
thus ElliPro tool is used to identify discontinuous epitopes on the basis of 3D
8 Vaccine Design and Immunoinformatics 143
The major obstacle in designing vaccine construct is to understand the functional and
insight properties of proteins. Thus in order to overcome this issue, different
computational servers have been designed to homology modeling of epitopes
(Kaur et al. 2007). Several bioinformatics tools have been employed to construct
three-dimensional modeling of vaccine targets to retrieve the complete information
of proteins (Aurelien et al. 2011). Modeler is the most popular tool for comparative
based modeling of proteins and generates 5 models by comparing the target
sequence alignment with template structure (Fiser et al. 2002). Phyre 2 is used to
build 3-dimensional model using HMM (Hidden Markov Models) as well as identify
binding site of ligand (Kelley et al. 2015). All the 3d coordinates files of HLA alleles
were taken from IPD-IMGT/HLA Database (Robinson et al. 2015). RaptorX is used
144 F. Khan and A. Kumar
Table 8.4 List of different computational tools for modeling of epitopes and alleles
Tools Link Prediction mode
Modeler 9.19 https://salilab.org/modeller/9.19/release. 3 D modeling of Proteins
html
PEPstrMOD https://webs.iiitd.edu.in/raghava/ 3 D modeling of Epitopes
pepstrmod/
Raptor X http://raptorx.uchicago.edu/ 3 D modeling of Proteins
SWISS- https://swissmodel.expasy.org/ 3 D modeling of Proteins
MODEL
I-TASSER https://zhanglab.ccmb.med.umich.edu/I- Predicts protein structure and
TASSER/ functions
PyMOL https://pymol.org/2/ Molecular visualization of
proteins and epitopes
Chimera http://www.rbvi.ucsf.edu/chimera/ Molecular visualization of
1.12 download.html proteins and epitopes
Phyre2 http://www.sbg.bio.ic.ac.uk/phyre2/html/ 3 D modeling of Proteins
page.cgi?id¼index
ERRAT https://servicesn.mbi.ucla.edu/ERRAT/ Validation of 3D modeled
structure
RAMPAGE http://services.mbi.ucla.edu/SAVES/ Validation of 3D modeled
Ramachandran/ structure
PROCHECK https://www.ebi.ac.uk/thornton-srv/ Validation of 3D modeled
software/PROCHECK/ structure
IPD-IMGT/ https://www.ebi.ac.uk/ipd/imgt/hla/ Search sequence of alleles
HLA
to analyze the tertiary and secondary structure and functional information of the
binding region of proteins (Morten et al. 2012).
The designed model can be validated by using different servers that provide
validation score and approves the quality of designed model. The list of servers that
can be used for validation include RAMPAGE, PROCHECK, and ERRAT. RAM-
PAGE is referred as Ramachandran Plot analysis which is used to read insight
structural information of proteins by calculating torsional angles. PROCHECK is
also used to analyze the quality of model on the basis of their stereochemistry and it
compares the model with known define model of high resolution (Laskowski et al.
1993) (Table 8.4).
The binding capability of epitopes to HLA alleles should be strong binder and highly
promiscuous in nature. Therefore, the combination of multiple epitopes with the help
of linkers and adjuvant will make a tremendous effect on immunity of an individual
(Fig. 8.2).
To make an effective multi-epitope vaccine, it is very crucial to characterize the
key antigens and epitopes that have the great ability to trigger immunity. Thus,
immunoinformatics tools can predict B and T cell epitopes of the targeted strain
based on survey and then these epitopes will be merge together with the help of
linkers and unique adjuvant to increase immunity level (Lifhang Zhang 2018).
Therefore, the use of computational algorithm predicts peptide with higher antigenic
score, MHC binding score, solubility, epitope conservancy, and covering large
number of population (Nezafat et al. 2014). Linkers play a very unique role in
combining all the CTL, HTL, and B cell epitopes with an adjuvant and provide better
flexibility and protein folding thus making protein more stable. The linkers that are
mostly used are EAAAK, AAY, and GPGPG linker based on the review of previous
papers (Pandey et al. 2018).
Table 8.5 List of different tools for docking and molecular simulation study
Server name Link Prediction mode
Autodock 4.2 http://autodock.scripps.edu/downloads Docking of epitope and
alleles
Autodock http://vina.scripps.edu/ Docking of epitope and
Vina alleles
PatchDock https://bioinfo3d.cs.tau.ac.il/PatchDock/php. Docking of epitope and
php alleles
SwissDock http://www.swissdock.ch/docking Docking of epitope and
alleles
NAMD tool https://www.ks.uiuc.edu/Research/namd/ Molecular Simulation
CHARMM https://www.charmm.org/ Molecular Simulation
MDWeb https://mmb.irbbarcelona.org/MDWeb/ Molecular Simulation
well as drug design also. Autodock Vina is recent and updated version for predicting
molecular docking (Trott and Olson 2010). It uses structural file of molecule in
PDBQT format and generates the PDBQT file with the help of MGL tools. Autodock
Vina is advanced tool with better performance and is faster compared to Autodock
4 version. Epidock is also used for molecular docking mainly for MHC class II
predictions and it includes 23 alleles for generating the output. It accepts input
sequence of protein in a single letter code and translates into nonamers for which
scores are generated in output file (Atanasova et al. 2013; Patronov et al. 2011).
Along with the docking, molecular dynamics simulation allows to understand the
complex stability of the designed vaccine candidates. As it is important to know the
molecular details of the molecules and their stability in water over a certain period of
time (Xu and Zhang 2012). NAMD tool is used for simulation of molecules along
with VMD (Visual Molecular Dynamics) especially designed for high performance
rate (James et al. 2005; Humphrey et al. 1996). It generates the results in trajectory
DCD files by retrieving the information from PDB file.
MDWeb built a friendly environment for molecular dynamic simulation by
analyzing trajectory input file of molecule with the help of Gromacs package
(Adam et al. 2012). It generates result in the graphical form of RMSD plot and
Radius of Gyration which allows the user to identify which have the maximum
number of mobility and regions that has shown stability during simulation. Thus,
with the help of these plots the stability of the selected vaccine candidate can be
accessed and can be suggested further for experimental confirmation (Table 8.5).
8.8 Conclusion
Thus, in silico analytical methods have a powerful impact for vaccine discovery for
novel antigens for precise immune response. This method has allowed the
researchers to strengthen broad spectrum research in epitope based vaccine that
has overshadowed traditional methods in the last few decades. Immunoinformatics
technology has become a boon in medical science research to study infectious
8 Vaccine Design and Immunoinformatics 147
diseases, cancer, personalized medicine, and allergy. Although there are certain
limitations also related to immunoinformatics approaches that also cannot be
neglected in terms of handling real data and therefore predicted vaccine epitopes
should undergo experimental analysis in animal model.
References
Adam H, Andrio P, Fenollosa C, Cicin-Sain D, Orozco M, Gelpi JL (2012) MDWeb and MDMoby:
an integrated web-based platform for molecular dynamics simulations. Bioinformatics 28
(9):1278–1279
Alessandro S, Rino R (2010) Reverse vaccinology: developing vaccines in the era of genomics.
Immunity 33(4):530–541
Angus NO, Obialor WO, Ifeanyichukwu MO, Odimegwu DC, Okoyeh JN, Emechebe GO,
Adejumo SA, Ibeanu GC (2020) Immunoinformatics and vaccine development: an overview.
Immunotargets Ther 9:13–30
Arafat RO, Pervin T, Mia M, Hossain M, Shahnaij M, Mahmud S, Kaderi Kibria KM (2017)
Vaccinomics approach for designing potential peptide vaccine by targeting Shigella spp. Serine
protease autotransporter subfamily protein SigA. J Immunol Res. https://doi.org/10.1155/2017/
6412353
Atanasova M, Dimitrov I, Flower DR, Doytchinova I (2013) EpiDOCK: a molecular docking-based
tool for MHC class II binding prediction. Protein Eng Des Sel 26(10):631–634
Aurelien G, Zoete V, Michielin O (2011) SwissDock, a protein-small molecule docking web service
based on EADock DSS. Nucl Acids Res 39:W270–W277
Backert L, Kohlbacher O (2015) Immunoinformatics and epitope prediction in the age of genomic
medicine. Genome Med 7(1):119
Birkir R, Alvarez B, Paul S, Peters B, Nielsen M (2020) NetMHCpan-4.1 and NetMHCIIpan-4.0:
improved predictions of MHC antigen presentation by concurrent motif deconvolution and
integration of MS MHC eluted ligand data. Nucl Acids Res. https://doi.org/10.1093/nar/
gkaa379
Clarisa B P-d-S, Soares I d S, Rosa DS (2018) Editorial: epitope discovery and synthetic vaccine
design. Front Immunol 9:826
Duhovny D, Nussinov R, Wolfson HJ (2002) Efficient unbound docking of rigid molecules.
proceedings of the 2’nd workshop on algorithms in bioinformatics (WABI) Rome, Italy. Lecture
Notes in Computer Science, vol 2452. Springer, pp 185–200
Fiser A, Feig M, Brooks CL, Sali A (2002) Evolution and physics in comparative protein structure
modeling. Acc Chem Res. 35:413–421. https://doi.org/10.1021/ar010061h
Goodsell DS, Morris GM, Halliday RS, Huey R, Belew RK, Olson AJ (1998) Automated docking
using a Lamarckian genetic algorithm and empirical binding free energy function. J Comp
Chem 19:1639–1662
Guan P, Doytchinova IA, Zygouri C, Flower DR (2003) MHCPred: bringing a quantitative
dimension to the online prediction of MHC binding. Appl Bioinf 2:63–66
Hamrouni S, Bras-Gonçalves R, Kidar A, Aoun K, Chamakh-Ayari R, Petitdidier E, Messaoudi Y,
Pagniez J, Lemesre JL, Meddeb-Garnaoui A (2020) Design of multi-epitope peptides containing
HLA class-I and class-II-restricted epitopes derived from immunogenic Leishmania proteins,
and evaluation of CD4+ and CD8+ T cell responses induced in cured cutaneous leishmaniasis
subjects. PLoS Negl Trop Dis 14(3):e0008093
Huber Sietske R, van Beek J, de Jonge J, Luytjes W, van Baarle D (2014) T cell responses to viral
infections – opportunities for peptide vaccination. Front Immunol 5:171
Humphrey W, Dalke A, Schulten K (1996) VMD—visual molecular dynamics. J Mol Graphics
14:33–38
148 F. Khan and A. Kumar
Irini AD, Darren RF (2007) VaxiJen: a server for prediction of protective antigens, tumour antigens
and subunit vaccines. BMC Bioinf 8:4
James CP, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E, Chipot C, Skeel RD, Kalé L,
Schulten K (2005) Scalable molecular dynamics with NAMD. J Comp Chem 26
(16):1781–1802
Jens Vindahl K, Lundegaard C, Lund O, Nielsen M (2012) Reliable B cell epitope predictions:
impacts of method development and improved benchmarking. PLoS Comp Biol 8(12):
e1002829
Jensen KK, Andreatta M, Marcatili P, Buus S, Greenbaum JA, Yan Z, Sette A, Peters B, Nielsen M
(2018) Improved methods for predicting peptide binding affinity to MHC class II molecules.
Immunology 154(3):394–406
Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequence-based
B-cell epitope prediction using conformational epitopes. Nucl Acids Res. https://doi.org/10.
1093/nar/gkx352
Kaur H, Garg A, Raghava GPS (2007) PEPstr: A de novo method for tertiary structure prediction of
small bioactive peptides. Protein Pept Lett 14(7):626–630
Kaur R, Arora N, Jamakhani MA, Malik S, Kumar P, Anjum F, Tripathi S, Mishra A, Prasad A
(2020) Development of multi-epitope chimeric vaccine against Taenia solium by exploring its
proteome: an in silico approach. Exp Rev Vaccines 19(1):105–114
Kelley L, Mezulis S, Yates C et al (2015) The Phyre2 web portal for protein modeling, prediction
and analysis. Nat Protocol 10(6):845–858
Khan F, Srivastava V, Kumar A (2017) Epitope-based peptides prediction from proteome of
Enterotoxigenic E coli. Int J Peptide Res Ther 24(2):323–336
Khan F, Srivastava V, Kumar A (2018) Computational identification and characterization of
potential T-Cell epitope for the utility of vaccine design against Enterotoxigenic Escherichia
coli. Int J Peptide Res Ther (Springer) 25:289–302
Krawczyk K, Liu X, Baker T, Shi J, Deane CM (2014) Improving B-cell epitope prediction and its
application to global antibody-antigen docking. Bioinformatics 30(16):2288–2294
Larsen MV, Lundegaard C, Lamberth K, Buus S, Lund O, Nielsen M (2007) Large-scale validation
of methods for cytotoxic T-lymphocyte epitope prediction. BMC Bioinf 8:424
Laskowski RA, MacArthur MW, Moss DS, Thornton JM (1993) PROCHECK - a program to check
the stereochemical quality of protein structures. J App Cryst 26:283–291
Li W, Joshi MD, Singhania S, Ramsey KH, Murthy AK (2014) Peptide vaccine: progress and
challenges. Vaccine 2(3):515–536
Lippolis JD et al (2002) Analysis of MHC class II antigen processing by quantitation of peptides
that constitute nested sets. J Immunol 169:5089–5097
Monterrubio-López GP, Ribas-Aparicio RM (2015) Identification of novel potential vaccine
candidates against tuberculosis based on reverse vaccinology. Biomed Res Int 12:1–16
Morris GM, Huey R, Lindstrom W, Sanner MF, Belew RK, Goodsell DS, Olson AJ (2009)
Autodock4 and AutoDockTools4: automated docking with selective receptor flexibility. J
Comput Chem 16:2785–2791
Morten K, Wang H, Wang S, Peng J, Wang Z, Lu H, Xu J (2012) Template-based protein structure
modeling using the RaptorX web server. Nat Protocols 7:1511–1522
Negi SS, Braun W (2009) Automated detection of conformational epitopes using phage display
peptide sequences. Bioinform Biol Insights 3:71–81
Nezafat N, Ghasemi Y, Javadi G, Khoshnoud MJ, Omidinia E (2014) A novel multi-epitope peptide
vaccine against cancer: an in silico approach. Theor Biol 349:121–134
Oyarzun P, Kobe B (2015) Computer-aided design of T-cell epitope-based vaccines: addressing
population coverage. Int J Immunogenet 42(5):313–321
Pahil S, Taneja N, Ansari HR, Raghava GPS (2017) In silico analysis to identify vaccine candidates
common to multiple serotypes of Shigella and evaluation of their immunogenicity. PLoS One
12:8
8 Vaccine Design and Immunoinformatics 149
Abstract
The long-used traditional methodology for novel drug discovery and drug devel-
opment is an immensely challenging, multifaceted, and prolonged process. To
overcome this limitation, a new advanced approach was developed and adopted
over time which is referred to as computer-aided drug discovery (CADD). Over
the course, CADD has accelerated the overall traditional time-consuming process
of new drug entity development with the advancement of computational tools and
methods. Recently CADD methodologies have become a fundamental and indis-
pensable tool in different junctures of drug discovery and development. Addi-
tionally, with the increasing availability and knowledge of biological/bio-
macromolecule structures, as well as an exponential increase in computing
power, it is now plausible to use these methods effectively without the significant
loss of accuracy. CADD has also paved paths for the screening of selected
compounds and synthesis of those entities for better therapeutics. Therefore,
CADD is continuously employed with the collective biological and chemical
knowledge to rationalize lead optimization, design, and thus can be effectively
used in different stages of the discovery and development pipeline. Over the past
decades, various CADD techniques such as homology modeling, docking,
pharmacophore modeling based virtual screening, conformation generation, ab
initio design, toxicity profile, quantitative structure–activity relationship (QSAR),
and quantitative free-energy calculation have been greatly improved. The current
methods of CADD are routinely utilized in academic and commercial research, as
it has been now an emerging field of interest in drug design and developments.
# The Author(s), under exclusive license to Springer Nature Singapore Pte 151
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_9
152 T. Rajkishan et al.
This chapter aims to illustrate some crucial CADD techniques also commonly
referred to as in silico methods in a diverse arena of drug discovery and focusing
some of the modern advancements.
Keywords
9.1 Introduction
The English word “Drug” originated from the French word “Drogue” that means
Dry Herb, strongly recommends that most primitive drugs were extracted out from
various plant sources (Rekker 1992; Wadud et al. 2007). A drug or medicine shows a
physiological effect when introduced to the human body. The drug may be a natural
or synthetic substance that affects the structure and functioning of a living body and
is used for the prevention, treatment, and diagnosis of a specific disease and results in
the relief of discomfort. In the field of pharmacology, a drug molecule is a chemical
entity other than an essential dietary ingredient that develops a biological effect on a
living system after administration. A medicine or pharmaceutical drug is used to cure
or prevent disease. So far, traditional drugs are acquired through medicinal plants,
however, also obtained by organic synthesis. Moreover, various pharmaceutical
drugs are categorized into different drug classes. A group of related medicine or
drug that possess similar chemical structures shows the same binding to the common
biological target (reveal the same mechanism of action), correlated mode of action
and that drugs are used to cure the same disease (Izzo and Ernst 2001). The
Anatomical Therapeutic Chemical Classification System (ATC) is the most com-
monly used classification system in which a specific ATC code is designated to the
drugs. One more broad classification system is the Biopharmaceutics Classification
System; it categorizes the drugs based on their absorption or permeability properties
and aqueous solubility (Lennernäs and Abrahamsson 2005). Broadly, any substance
administered orally, or injected subcutaneously, intramuscularly, or intravenously,
or applied topically or to a body cavity to treat or prevent a disease or condition is
termed as “Drug.”
Drug design is a brilliant and magnificent inventive process in the development of
novel therapeutics according to the biological target. Generally, it is also termed as
rational drug design. Drug designing is a powerful invention in medicinal chemistry
or biological history to produce an important and noteworthy beneficial or therapeu-
tic reaction. A drug is an organic substance, once it binds to the particular target site
it may either stimulate or inhibit the function of a biological molecule or macromol-
ecule that outcomes as therapeutic benefits. The success of drug design is influenced
9 Computer-Aided Drug Designing 153
(a) Target identification: This phase includes identification of drug targets which
are further inspected and correlated for their functions, biology associated with a
specific disease/disorder (Anderson 2003).
(b) Target validation: It is the phase where the association and correlation of
identified drug target with the disease are confirmed by biological assays
which involve assessing the capacity to regulate biological functions (Chen
and Chen 2008).
(c) Lead identification: It brings about the identification or discovery of a synthetic
chemical moiety with a higher and significant degree of specificity, sensitivity,
and potency against an above-identified drug target. The lead identified in this
stage can be a drug-like candidate or precursor moiety to a drug-like entity
(Kalyaanamoorthy and Chen 2011).
(d) Lead optimization: This is an important stage as it includes the improvement of
the identified lead molecule through iterative cycles of lead compound
(s) evaluation. Combined in vitro and in vivo experiments/validations are carried
out to sort and filter the potential candidates to develop a safe, specific, and
efficient drug. Further structure–activity relationships (SARs) strategies are
154 T. Rajkishan et al.
The high budget of drug development is often ascribed to the fact that approxi-
mately 90% of the lead molecules entering clinical trials fail to get regulatory
approvals to reach the consumer end-market. An estimated 75% of the total cost
of drug development occurs due to the failures met during the drug discovery and
development pipeline (Leelananda and Lindert 2016). Nowadays with rapid
developments in high-throughput screening (HTS) experiments and combinatorial
chemistry, which can screen thousands of probes with robotics have accelerated the
drug discovery expedition. However, HTS is still expensive in terms of cost and
experimental resources, which are not easily available in academic settings. Eventu-
ally, various computational methodologies have been developed to reduce the series
of the drug development cycle and drastically diminish the expenses and threat of
failure in drug discovery and development.
This drug development procedure can be separated into two distinct phases. The
preliminary drug discovery and development phases comprise the drug target selec-
tion and simultaneously the potential hit-to-lead compounds identification by
employing in silico screening and/or high-throughput (virtual) screening
approaches. This is followed by lead optimization procedure to search for a best
clinical candidate with increased affinity and optimized pharmacokinetics properties.
The development of a new drug also includes the second phase, which is dedicated to
determining the clinical utilization of candidate leads (Bleicher et al. 2003).
In the current post-genomic era, the CADD tools have been utilized as a “virtual
shortcut” to analyze, develop, and discover/identify potent drug-like molecules.
CADD is an interdisciplinary area where numerous aspects of basic and applied
research combine to cover most phases of the drug discovery practice starting from
drug target identification and its validation to the optimization of the identified lead
molecule. During the initial phase of the drug development procedure, investigators/
researchers may face problems without any prior information of structure–activity
relationship (Kore et al. 2012). It also includes evaluation and optimization of the
pharmacokinetic parameters of drug-like molecules such as ADMET (absorption,
distribution, metabolism, excretion, and toxicity) for the safety issues. Noteworthy,
the utilization of these computational tools and techniques in the arena of drug
discovery and development is promptly attaining popularity and is implemented
frequently. Numerous terms are being adapted to this in silico domain, such as
computational drug designing (CAD), computer-aided molecular modeling
(CAMM), computer-aided drug design (CADD), computer-aided molecular design
(CAMD), computer-aided rational drug discovery, rational drug design, and in silico
drug design. The use of computational tools along with the experimental findings has
a significant role in the novel drug development process and also has trimmed down
the total cost of drug discovery and development without significant trade-off to
overall accuracy.
Computer-aided drug design and development necessitate (Kapetanovic 2008):
Generally, approaches utilized in CADD are classified into two categories: structure-
based and ligand-based methods. The structure-based approach relies on the avail-
ability of the 3D structure of the target protein for the screening and identification of
promising ligand molecules by calculating the interaction energies between the
target and compound. In contrast, the latter approach utilizes the information/knowl-
edge of active and inactive molecules with diverse chemical structures as well as the
development of predictive models such as QSAR (quantitative structure–activity
relation) (Kalyaanamoorthy and Chen 2011). These models are further utilized for
screening and identification of additional newer chemical entities through a large
chemical database search, a process called virtual screening. To summarize, the
structure-based method is preferred when high-resolution 3D structural data of drug
target is accessible and the ligand-based approach is ideal when no significant
structural information through experimental approaches are available.
and evaluate ligands based on their interactions with the residues present in the
binding/active site of target protein (Grinter and Zou 2014; Lavecchia and Giovanni
2013). Thus the acquisition of structural information is a core part of the hypothesis
of SBDD approach which exploits the molecule's potential to have energetically
favorable interaction with the target protein's binding site and its consequent
biological outcome. Thus, 3D structural information of the drug target protein is
quintessential in SBDD which is now routinely available through state-of-the-art
technologies like X-ray diffraction (XRD), in solution nuclear magnetic resonance
(NMR), Cryo-electron microscopy (Cryo-EM) along with computational modeling
techniques such as homology modeling and all-atom molecular dynamics
(MD) simulations (Kalyaanamoorthy and Chen 2011; Lin et al. 2020; Patel et al.
2019). Various packages used in binding site prediction, modeling packages, and
homology modeling tools are summarized in Table 9.1. Further, SBDD can be
divided into molecular docking and de novo ligand design approaches. Molecular
docking approach used the physiochemical properties of binding sites on proteins
such as hydrogen bond, hydrophobic and electrostatic fields, key residues and then a
compound is exploited to identify whether it is a suitable molecule whose molecular
shapes match the binding sites of the receptor with high binding affinity. If this
approach is used for screening large databases or compound libraries, it is also
referred to as high-throughput virtual screening (VS) approach (Lavecchia and
Giovanni 2013; Andricopulo et al. 2009). Another approach known as de novo
ligand design utilizes information from protein binding site to identify small
fragments that align well with the molecular shape of the binding site followed by
linking these small fragments based on chemistry connection rules identify a struc-
turally novel ligand moiety (Kutchukian and Shakhnovich 2010; Reker et al. 2014).
The successful screening of potential lead molecule from any of the above
approaches is further synthesized followed by evaluation of their biological
activities.
Structure-based CADD (SB-CADD) approach depends on the capability to
conclude and examine 3D structures of biological entities. The underlying funda-
mental assumption of this method is that compounds affinity and ability to bind with
an exact receptor/target and maintain preferred biological interactions is based on its
competence to favorably bind to a specific interacting pocket on the same target.
Compounds that contribute to those complementary interactions would reveal simi-
lar biological effects. Consequently, novel molecules can be identified through the
accurate and attentive investigation of the binding pocket of a specific protein. The
SB-CADD project relies on prerequisite structural information about the target of
interest. The SB-CADD has emerged as a frequently used approach in the area of
drug discovery, thanks to improvements in -omics era that have enabled researchers
with finding a huge repertoire of candidate/potential drug targets (Bambini and
Rappuoli 2009).
So far, many scientists and research group all over the world have shown
enthusiastic efforts in the direction of virtual high-throughput ligand screening
(VLS) and structure-based drug discovery (SBDD), as the furthermost scientifically
challenging and promising methodologies to identify the best lead for
158 T. Rajkishan et al.
Table 9.1 Tools and software packages used in various computational drug design
Program/ Free/
Function Server Commercial Description Websites
Binding/ CASTp Free Utilizes the alpha sts.bioe.uic.edu/
interaction complex and weighted castp/
sites Delaunay triangulation
prediction for shape
measurements
Cavitator Free Grid-based geometric sites.gatech.edu/
analysis for pocket cssb/cavitator/
prediction
ConCavity Free Uses evolutionary compbio.cs.
sequence and 3D princeton.edu/
structures concavity/
eFindSite Free Uses a set of www.cct.lsu.
evolutionarily related edu/resources
proteins for predicting
common ligand-
binding site
SiteComp Free Uses molecular sitecomp.
interaction fields for sanchezlab.org
binding site
comparison
PocketFinder Free (PyMOL Utilizes shape www.modeling.
plugin) descriptors for pocket leeds.ac.uk/
identification pocketfinder/
fpocket Free as a Uses alpha sphere fpocket.
standalone theory sourceforge.net/
program
ProBis Free Local structural probis.cmm.ki.
alignments si/index.php
3DLigandSite Free Based on homologous www.sbg.bio.ic.
structure for ligand- ac.uk/
binding site prediction B3dligandsite/
ConSurf Free Surface-mapping consurf.tau.ac.il/
2016/
Docking AutoDock/ Free Flexible side chains autodock.
AutoDock (genetic algorithm) scripps.edu/
Vina
Adaptive NA Integrates perturbation (Bolia et al.
BP-Dock response scanning 2014)
(PRS) with the flexible
docking protocol
cDocker Commercial Uses side-chain accelrys.com/
flexibility at the atomic
level with grid-based
docking
Docking Free/commercial Integration of several www.
server computational dockingserver.
chemistry tools com/web
(continued)
9 Computer-Aided Drug Designing 159
pharmaceutical objectives (Bleicher et al. 2003; Foloppe et al. 2006; Klebe 2006;
Miteva 2008). This structure-based approach of ligand identification offers an
insight into the molecular interaction of protein–ligand complexes as well, permit-
ting medicinal chemists to formulate extremely precise and exact chemical
alterations or alterations around the skeleton/scaffold of ligand (Kitchen 2017;
Kitchen et al. 2004). Overview of types of computational tools used in drug
discovery, precision medicine, and chemical biology is shown in Fig. 9.2.
In case, where drug target structure is unavailable or the structure prediction using
approaches such as ab initio structure prediction or homology modeling is challeng-
ing, the alternative to SBDD is ligand-based drug design (LBDD). Tools for
modeling packages and homology modeling are summarized in Table 9.1. It depends
on the information of compounds that interact with the biological protein target of
curiosity. The three-dimensional quantitative structure–activity relationship
164 T. Rajkishan et al.
The most standard and validated methods for LBDD are (2D/3D) QSAR and
pharmacophore modeling. Overall, the QSAR is a computational technique to
quantify the relationship among the specific biological process or activity and
chemical structural properties for a series of molecules. The underlying postulates
behind QSAR approach are that related physicochemical or structural properties
result in similar biological activity (Akamatsu 2002; Verma and Hansch 2009).
Primarily a set of a chemical compound or lead compounds are screened which
indicates the appropriate and relevant biological activity. A quantitative-structural
relationship is well settled between the physicochemical properties of the potential
leads and the specialized biological activity. The generated QSAR model is further
utilized for optimization of the best active molecules to enhance the significant
biological activity. These predicted hits are then further directed to experimental
test for the essential and desired activity. To conclude, the QSAR approaches thus
can be utilized as a supervisory strategy/tool for recognition of novel hits along with
their modified features and improved biological activity.
General components for developing QSAR models are as follows:
all the compounds are of a congeneric chain but should be sufficiently chemical
diverse to have a great dissimilarity in biological activity.
2. Secondly, the identification and determination of molecular descriptors
associated with diverse physicochemical and structural properties of the
compounds and examination.
3. Identifying the correlation between 2D or 3D molecular descriptors and with their
corresponding biological activity that can elucidate the difference in activities in
the biological dataset.
4. Investigation of the mathematical reliability and analytical power of the derived
QSAR model (Fig. 9.3).
Over the past two decades, the logical or intellectual focal point (the core of gravity)
of the arena of medicinal chemistry has reallocated significantly and dramatically
from how to build the best compound, to what compound needs to build or make.
The challenge at present is the collection of knowledge to make choices about the
use of various assets in drug design. The information contributing to the drug design
attempt and practice are progressively quantitative, constructing upon modern
improvements in molecular structure explanation, statistics, combinatorial mathe-
matics, and computer simulations. Generally, these fields have directed to an
166 T. Rajkishan et al.
innovative paradigm in drug design and development that has been denoted as
“quantitative structure–activity relationships.” For above 40 years, the QSAR
hypothesis first initiated its mode into the tradition of pharmaceutical chemistry
(Tropsha and Golbraikh 2007).
In 1865, Crum-Brown and Fraser described the idea and knowledge that there
was a statistical relationship among the chemical/molecular structure and their
corresponding biological activity. First, one has to formulate the “physiological
activity,” i.e. ɸ, a function of the molecular/chemical structure C, which is signified
below in Eq. (9.1); (Rekker 1992)
ϕ ¼ f ðC Þ ð9:1Þ
Furthermore, Richet, Meyer, and Overton after few decades in 1893 and 1900
independently created a correlation between simple organic compounds and their
water solubility, which additionally established a linear relationship denoting as oil–
water partition or solubility, i.e. lipophilicity. In the 1930s L. Hammett stated that
there is a correlation between electronic characteristics of organic bases and acids
with their equilibrium reactivity and constants (Hansch et al. 1991).
In 1969, Corwin Herman Hansch published a model related to free energy to
show a relationship between biological activities and physicochemical properties
(Hansch 1978). Moreover, Taft formulated a mode for extracting polar resonance
and steric effects and leading to the first steric parameters (Hansch et al. 1991). Taft
and Hammett together contributed to formulating the mechanistic base for the
generation of the QSAR prototype given by Fujita and Hansch (Lombardo et al.
2000). They mixed the hydrophobic constants along with Hammett’s electronic
constants to acquire the linear Hansch equation and numerous extended forms
(Leo and Hansch 1999).
Log 1=C ¼ aσ þ bл þ ck Linear form ð9:2Þ
The principal component analysis (PCA) was accomplished to resolve the issues
of MLR study by taking out statistics/knowledge from the multiple, probably
redundant variables into a less significant number of non-associated variables.
Hence, PCA offers a competent and effective mode designed for diminution of the
number of independent variables applied in QSAR model generation. This approach
has shown great help for schemes with several molecular descriptors comparable to
the number of predictions. Though, outcomes achieved from PCA are frequently
168 T. Rajkishan et al.
1. Internal validation
2. External validation
The most familiar and adopted inner validation process is the leave-one-out cross-
validation method (Kohavi and Kohavi 1995). Briefly, in this approach, one set of
predictions is tagged as validation data with the remaining of the data tagged as the
training set to approximate the coefficients of the developed QSAR model. The
relevant biological activity of the test set is further estimated using the derived model
consisting of the training set compounds. Further, the method is reiterated for
additional rest of the remaining compounds unless each one of them is considered
once as a test set of compounds. Afterwards, the analytical ability and strength of the
derived model are evaluated by analyzing and evaluating the cross-validated r2 or q2
computed from the following mathematical equation:
P 2
ypred yobs
Q ¼1P
2
ðyobs ymean Þ2
The major drawback of this procedure is the amount of time required to accom-
plish the calculation rises with double the size of the training set compounds.
Conversely, another deviation of this procedure is k-fold cross-validation. As an
alternative to leaving one molecule out or away, this approach generated the training
9 Computer-Aided Drug Designing 169
Table 9.3 Tools/programs/software for calculating the molecular descriptor or QSAR model
analysis
Program/ Free/
Function Server Commercial Description Websites
QSAR McQSAR Free Generates multi- users.abo.fi/mivainio/
conformational QSAR mcqsar/index.php
using genetic function
approximation
paradigm
SYBYL-X Commercial Collection of QSAR www.certara.com/
packages (CoMFA, pressrelease/certara-
HQSAR, and Topomer enhances-sybyl-x-
CoMFA) drug-design-and-
discovery-software-
suite/
MOLFEAT Free Calculates molecular jing.cz3.nus.edu.sg/
fingerprints and cgi-bin/molfeat/
descriptors derived molfeat.cgi
from curated QSAR
models
Open3DQSAR Free High-throughput open3dqsar.
chemometric analysis sourceforge.net/
of molecular
interaction fields
(MIFs)
E-Dragon Free Calculates molecular www.vcclab.org/lab/
descriptors to evaluate edragon/
structure–activity or
structure–property
relationship studies for
HTS screening
i. Electronic
ii. Steric
iii. Hydrophobic
Since the Hansch’s seminal efforts, various diverse methodologies of the QSAR
have been generated. QSAR techniques may be examined to view the fact:
9 Computer-Aided Drug Designing 171
1. The different kinds of structural factors that are employed to distinguish molecu-
lar identities consisting of numerous demonstrations of compounds, from simple
chemical principles to 3D conformations.
2. The scientific approach which is used to achieve the statistical association
between these biological activities and structural parameters (Gonzalez
et al. 2008).
Hansch Analysis
Corwin Hansch in 1969 summarized the idea of linear free energy (LFER)
relationships to illustrate the efficiency of a biologically active and potential com-
pound. This is one of the best promising methods to the quantification and assess-
ment of the drug compounds and interaction with the biological system. It also
describes as an extra thermodynamic procedure which concludes an additive conse-
quence of numerous substituents in steric, electronic, hydrophobic, and scattering
data of macromolecules interaction. Also, it supported in the scattering data of the
non-covalent interaction of macromolecules like protein/enzyme/receptor and a drug
compound. This approach creates the relation of biological activity within a homol-
ogous chain of molecules to a set of hypothetical and theoretical chemical
parameters which define significant properties of the drug compound. Hansch
demonstrated a linear and non-linear dependence of biological activity on various
parameters.
Hansch suggested that the action of drug molecule relies on two practices:
172 T. Rajkishan et al.
1. Starting from the point of entrance in the body to the site of action of a drug that
comprises the path of a sequence of membranes and consequently, it is associated
with partition coefficient log P (lipophilic) and can be elucidated by random walk
theory.
2. Interaction with the target site that in turn based on Steric bulk substitution groups
and the electron density on the attached group
Free-Wilson Analysis
The Free-Wilson method is a structure–activity centered approach since it integrates
the contribution created by numerous structural fragments to the entire biological
activity (Schaper 1999). Indicator variables are utilized to indicate the absence or
presence of a specific structural feature.
9.3.4.2 3D-QSAR
The 3D-QSAR model includes the study of the quantifiable relationship involving
the three-dimensional properties of a set of molecules and their respective biological
activity by employing statistical correlation approaches. 3D-QSAR practices three-
dimensional characteristics which are mostly steric and electrostatic properties of
compounds that depend on probe-based sampling inside the molecular lattice and
that can further establish a correlation between the 3d descriptors with their respec-
tive biological activity (Deora et al. 2013).
Advancement in medicinal chemistry and drug development depends on the
user’s skill to decipher the molecular interaction of drug/lead molecule compounds
with their relevant biological targets/receptors. The conventional QSAR analysis
defines biological activity about physicochemical properties of molecules at the
definite site of the drug compounds. This 3D-QSAR analysis utilizes the application
of different force field functions that need the three-dimensional structures of a
particular selected set of small compounds, i.e. training set with their known reported
biological activity. The selected training set requires to be aligned or superimposed
by using either experimental data of crystallographic protein–ligand complex or
most active compound alignment approach. It employs the Lennard-Jones potential,
a computed potential that is concerned with the complete compound rather than a
single substituent. Cramer et al. named the first 3D-QSAR as CoMFA,
i.e. comparative molecular field analysis (Cramer et al. 1988). Recently other 3D
QSAR strategies have also been in use, for example, spectral structure–activity
relationship (S-SAR) (Putz and Lacrămă 2007), an adaptation of the fields for
molecular comparison (AFMoC) (Gohlke and Klebe 2002), Topomer CoMFA
(Cramer 2003), and comparative residue interaction analysis (CoRIA) (Datar
et al. 2006).
selected molecules are associated with the modifications within the electrostatic and
steric fields of these molecules. Following that the field values are further linked with
biological activity by PLS analysis (Chilton et al. 2017). Comparative molecular
field analysis (CoMFA) is a typical and conventional 3D-QSAR technique that
covers the complete procedure of drug discovery. However, CoMFA is significant
and remarkable for high predictive power, the basic data-based characteristics still
used by this approach are useless by noise. As far as, various endeavors have been
practiced to meliorate the robustness of CoMFA model and calculate the predictive
accuracy by involving numerous factors, comprising molecular alignment and
confirmation along with grid spacing (Kubinyi et al. 1998).
Application
Since the origin CoMFA, various studies/applications of the methodology in many
disciplines have been published. Several successful activities of CoMFA method in
the area of geochemistry (herbicide, pesticide, or insecticide), physio-chemistry
(capacity factors, partition coefficients, and chemical shift), pharmacokinetics and
toxicity kinetics, and thermodynamics have also been comprehensively appraised in
many reviews (Bordás et al. 2003).
ship to offer more useful knowledge for the development of novel and potent drug-
like compounds/derivatives/analogues (Klebe and Abraham 1999).
pesticides that may communicate or enter the body of humans or other life forms.
This is very commonly and frequently used in the area of in silico assessment of
pharmacokinetic and metabolic-related points and molecular modeling (Balani et al.
2005; Tetko et al. 2006). Several commercial software and open-source are available
at ease for ADMET analysis as mentioned in Table 9.4. ADMET profiling is
governed by molecular structure knowledge, molecular descriptors, resulted from
molecular graphs or another molecular presentation, between the non-candidate and
candidate compounds. It incorporates into the early phase of drug design and
development to speed up the practice of drug discovery and to decrease the number
of molecules and removing them in the late phase of drug designing.
The complete procedure of the drug discovery is extremely costly, lengthy, and time-
consuming and a crucial problem for the biotech and pharmaceutical industry.
Hence, the computational/in silico approaches have applied very smartly and widely
in the drug-like lead development through compiling the knowledge of medicinal,
chemical, and pharmacology biology to make it successful. At present, hundreds to
millions of compounds have to be examined and tested within a very short period to
find out novel hits, thus, exceedingly effective methodologies are essential for
today’s users/researchers. Therefore, in silico methods, involving computational
tools and techniques are utilized in experimental works as it is cost-effective,
favorable, and less complex to perform reliable virtual screening for lead identifica-
tion. These various in silico methods involve several databases, similarity search,
178 T. Rajkishan et al.
9.7 Conclusion
More than the past decades, computer-supported/aided drug design and the property
calculation of lead molecules have emerged as an extensively used and well-
established field to support the research and development procedure in biotechnol-
ogy/life science. Modern and advanced drug discovery and development cycle
includes the identification of best hits, optimization of the drug-like lead compound
to enhance their affinity, specificity to decrease the possible side effects along with
effectiveness or potency, efficacy, metabolic constancy and stability to maximize the
half-life and oral bioavailability. Once an initial lead is identified that satisfies all of
these essential requirements, then further it would commence the drug discovery
procedures before entering the clinical trials. These computational approaches of
lead identification are favorable in the terms of cost, time and less tedious and
eco-friendly than earlier lengthy, expensive, challenging and insufficient that
resulted in low rate discovery of novel therapeutic.. Advanced in silico approaches,
namely QSAR, pharmacophore modeling, and molecular docking and pharmacoki-
netics portray a significant role in the identification of novel drug-like compounds.
References
Abraham MH (1993) Scales of solute hydrogen-bonding: their construction and application to
physicochemical and biochemical processes. Chem Soc Rev. https://doi.org/10.1039/
CS9932200073
Acharya C, Coop A, E Polli J, D MacKerell A (2010) Recent advances in ligand-based drug design:
relevance and utility of the conformationally sampled pharmacophore approach. Curr Comput
Aided-Drug Des 7:10–22. https://doi.org/10.2174/157340911793743547
Akamatsu M (2002) Current state and perspectives of 3D-QSAR. Curr Top Med Chem 2:1381–
1394. https://doi.org/10.2174/1568026023392887
Anderson AC (2003) The process of structure-based drug design. Chem Biol. https://doi.org/10.
1016/j.chembiol.2003.09.002
Andricopulo A, Salum L, Abraham D (2009) Structure-based drug design strategies in medicinal
chemistry. Curr Top Med Chem 9:771–790. https://doi.org/10.2174/156802609789207127
9 Computer-Aided Drug Designing 179
Foloppe N, Fisher LM, Howes R, Potter A, Robertson AGS, Surgenor AE (2006) Identification of
chemically diverse Chk1 inhibitors by receptor-based virtual screening. Bioorg Med Chem
14:4792–4802. https://doi.org/10.1016/j.bmc.2006.03.021
Geladi P, Kowalski BR (1986) Partial least-squares regression: a tutorial. Anal Chim Acta 185:1–
17. https://doi.org/10.1016/0003-2670(86)80028-9
Gohlke H, Klebe G (2002) Drugscore meets CoMFA: adaptation of fields for molecular comparison
(AFMoC) or how to tailor knowledge-based pair-potentials to a particular protein. J Med Chem
45:4153–4170. https://doi.org/10.1021/jm020808p
Gonzalez M, Teran C, Saiz-Urra L, Teijeira M (2008) Variable selection methods in QSAR: an
overview. Curr Top Med Chem 8:1606–1627. https://doi.org/10.2174/156802608786786552
Grinter SZ, Zou X (2014) Challenges, applications, and recent advances of protein-ligand docking
in structure-based drug design. Molecules. https://doi.org/10.3390/molecules190710150
Guner O, Clement O, Kurogi Y (2004) Pharmacophore modeling and three dimensional database
searching for drug design using catalyst: recent advances. Curr Med Chem 11:2991–3005.
https://doi.org/10.2174/0929867043364036
Hall LH, Mohney B, Kier LB (1991) The electrotopological state: an atom index for QSAR. Quant
Struct Relatsh 10:43–51. https://doi.org/10.1002/qsar.19910100108
Hansch C (1978) Recent advances in biochemical QSAR. In: Correlation analysis in chemistry.
Springer US, pp 397–438. https://doi.org/10.1007/978-1-4615-8831-3_9
Hansch C, Leo A, Taft RW (1991) A survey of hammett substituent constants and resonance and
field parameters. Chem Rev 91:165–195. https://doi.org/10.1021/cr00002a004
Izzo AA, Ernst E (2001) Interactions between herbal medicines and prescribed drugs: a systematic
review. Drugs 61:2163–2175
Jhoti H, Rees S, Solari R (2013) High-throughput screening and structure-based approaches to hit
discovery: is there a clear winner? Expert Opin Drug Discovery. https://doi.org/10.1517/
17460441.2013.857654
Kapetanovic IM (2008) Computer-aided drug discovery and development (CADDD): in silico-
chemico-biological approach. Chem Biol Interact 171:165–176. https://doi.org/10.1016/j.cbi.
2006.12.006
Kalyaanamoorthy S, Chen YPP (2011) Structure-based drug design to augment hit discovery. Drug
Discov Today. https://doi.org/10.1016/j.drudis.2011.07.006
Karelson M (2000) Molecular descriptors in QSAR/QSPR, p 35168
Kitchen DB (2017) Computer-aided drug discovery research at a global contract research organi-
zation. J Comput Aided Mol Des 31:309–318. https://doi.org/10.1007/s10822-016-9991-3
Kitchen DB, Decornez H, Furr JR, Bajorath J (2004) Docking and scoring in virtual screening for
drug discovery: methods and applications. Nat Rev Drug Discov. https://doi.org/10.1038/
nrd1549
Klebe G (2006) Virtual ligand screening: strategies, perspectives and limitations. Drug Discov
Today. https://doi.org/10.1016/j.drudis.2006.05.012
Klebe G, Abraham U (1999) Comparative molecular similarity index analysis (CoMSIA) to study
hydrogen-bonding properties and to score combinatorial libraries. J Comput Aided Mol Des
13:1–10. https://doi.org/10.1023/A:1008047919606
Koehn FE, Carter GT (2005) The evolving role of natural products in drug discovery. Nat Rev Drug
Discov. https://doi.org/10.1038/nrd1657
Kohavi R, Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and
model selection, pp 1137–1143
Kore PP, Mutha MM, Antre RV, Oswal RJ, Kshirsagar SS (2012) Computer-aided drug design: an
innovative tool for modeling. Open J Med Chem 02:139–148. https://doi.org/10.4236/ojmc.
2012.24017
Kubinyi H, Hamprecht FA, Mietzner T (1998) Three-dimensional quantitative similarity-activity
relationships (3D QSiAR) from SEAL similarity matrices. J Med Chem 41:2553–2564. https://
doi.org/10.1021/jm970732a
9 Computer-Aided Drug Designing 181
Kuntz ID (1992) Structure-based strategies for drug design and discovery. Science (80-) 257:1078–
1082. https://doi.org/10.1126/science.257.5073.1078
Kutchukian PS, Shakhnovich EI (2010) De novo design: balancing novelty and confined chemical
space. Expert Opin Drug Discovery. https://doi.org/10.1517/17460441.2010.497534
Langer T, Hoffmann R (2005) Virtual screening an effective tool for lead structure discovery. Curr
Pharm Des 7:509–527. https://doi.org/10.2174/1381612013397861
Lavecchia A, Giovanni C (2013) Virtual screening strategies in drug discovery: a critical review.
Curr Med Chem 20:2839–2860. https://doi.org/10.2174/09298673113209990001
Lee KH (1999) Anticancer drug design based on plant-derived natural products. J Biomed Sci.
https://doi.org/10.1007/BF02253565
Leelananda SP, Lindert S (2016) Computational methods in drug discovery. Beilstein J Org Chem.
https://doi.org/10.3762/bjoc.12.267
Lennernäs H, Abrahamsson B (2005) The use of biopharmaceutic classification of drugs in drug
discovery and development: current status and future extension. J Pharm Pharmacol 57:273–
285. https://doi.org/10.1211/0022357055263
Leo AJ, Hansch C (1999) Role of hydrophobic effects in mechanistic QSAR. Perspect Drug Discov
Des. https://doi.org/10.1023/A:1008762321231
Lin X, Li X, Lin X (2020) A review on applications of computational methods in drug screening
and design. Molecules. https://doi.org/10.3390/molecules25061375
Lokendra OK, Rachana S, Mukta Rani B (2013) Modern drug design with advancement in QSAR:
a review. Int J Res Biosci
Lombardo F, Shalaeva MY, Tupper KA, Gao F, Abraham MH (2000) ElogP (oct): a tool for
lipophilicity determination in drug discovery. J Med Chem 43:2922–2928. https://doi.org/10.
1021/jm0000822
Macalino SJY, Gosu V, Hong S, Choi S (2015) Role of computer-aided drug design in modern drug
discovery. Arch Pharm Res. https://doi.org/10.1007/s12272-015-0640-5
Milletti F, Storchi L, Sforna G, Cruciani G (2007) New and original pKa prediction method using
grid molecular interaction fields. J Chem Inf Model 47:2172–2181. https://doi.org/10.1021/
ci700018y
Miteva M (2008) Hierarchical structure-based virtual screening for drug design. Biotechnol
Biotechnol Equip 22:634–638. https://doi.org/10.1080/13102818.2008.10817525
Moses H, Dorsey ER, Matheson DHM, Thier SO (2005) Financial anatomy of biomedical research.
J Am Med Assoc 294:1333–1342. https://doi.org/10.1001/jama.294.11.1333
Myers S, Baker A (2001) Drug discovery – an operating model for a new era. Despite the advent of
new science and technologies, drug developers will need to make radical changes in their
operations if they are to remain competitive and innovative. Nat Biotechnol. https://doi.org/
10.1038/90765
Myint KZ, Xie XQ (2010) Recent advances in fragment-based QSAR and multi-dimensional
QSAR methods. Int J Mol Sci 11:3846–3866. https://doi.org/10.3390/ijms11103846
Patel B, Singh V, Patel D (2019) Structural bioinformatics. In: Essentials of bioinformatics, vol I.
Springer International Publishing, Cham, pp 169–199. https://doi.org/10.1007/978-3-030-
02634-9_9
Phoebe Chen Y-P, Chen F (2008) Identifying targets for drug discovery using bioinformatics.
Expert Opin Ther Targets 12:383–389. https://doi.org/10.1517/14728222.12.4.383
Poptodorov K, Luu T, Hoffmann RD (2006) Pharmacophore model generation software tools. In:
Pharmacophores and pharmacophore searches. Wiley, pp 15–47. https://doi.org/10.1002/
3527609164.ch2
Prathipati P, Dixit A, Saxena A (2007) Computer-aided drug design: integration of structure-based
and ligand-based approaches in drug design. Curr Comput Aided-Drug Des 3:133–148. https://
doi.org/10.2174/157340907780809516
Putz M, Lacrămă A-M (2007) Introducing spectral structure activity relationship (S-SAR) analysis.
Application to ecotoxicology. Int J Mol Sci 8:363–391. https://doi.org/10.3390/i8050363
182 T. Rajkishan et al.
Rekker RF (1992) The history of drug research: from Overton to Hansch. Quant Struct Relatsh
11:195–199. https://doi.org/10.1002/qsar.19920110214
Reker D, Rodrigues T, Schneider P, Schneider G (2014) Identifying the macromolecular targets of
de novo-designed chemical entities through self-organizing map consensus. Proc Natl Acad Sci
U S A 111:4067–4072. https://doi.org/10.1073/pnas.1320001111
Schaper KJ (1999) Free-Wilson-type analysis of non-additive substituent effects on THPB dopa-
mine receptor affinity using artificial neural networks. Quant Struct Relatsh 18:354–360. https://
doi.org/10.1002/(SICI)1521-3838(199910)18:4<354::AID-QSAR354>3.0.CO;2-2
Scior T, Medina-Franco J, Do Q-T, Martinez-Mayorga K, Yunes Rojas J, Bernard P (2009) How to
recognize and workaround pitfalls in QSAR studies: a critical review. Curr Med Chem 16:4297–
4313. https://doi.org/10.2174/092986709789578213
Silverman RB, Holladay MW (2015) The organic chemistry of drug design and drug action, 3rd
edn. Elsevier Inc. https://doi.org/10.1016/C2009-0-64537-2
Tetko IV, Bruneau P, Mewes HW, Rohrer DC, Poda GI (2006) Can we estimate the accuracy of
ADME-Tox predictions? Drug Discov Today. https://doi.org/10.1016/j.drudis.2006.06.013
Tintori C, Manetti F, Botta M (2010) Pharmacophoric models and 3D QSAR studies of the
adenosine receptor ligands. Curr Top Med Chem 10:1019–1035. https://doi.org/10.2174/
156802610791293118
Tropsha A, Golbraikh A (2007) Predictive QSAR modeling workflow, model applicability
domains, and virtual screening. Curr Pharm Des 13:3494–3504. https://doi.org/10.2174/
138161207782794257
Verma RP, Hansch C (2009) Camptothecins: a SAR/QSAR study. Chem Rev 109:213–235. https://
doi.org/10.1021/cr0780210
Wadud A, Prasad PVV, Rao MM, Narayana A (2007) Evolution of drug: a historical perspective.
Bull Indian Inst Hist Med Hyderabad 37:69–80
Wermuth CG (2006) Pharmacophores: historical perspective and viewpoint from a medicinal
chemist. In: Pharmacophores and pharmacophore searches. Wiley, pp 1–13. https://doi.org/
10.1002/3527609164.ch1
Wermuth CG, Ganellin CR, Lindberg P, Mitscher LA (1998) Glossary of terms used in medicinal
chemistry (IUPAC Recommendations 1998). Pure Appl Chem 70:1129–1143. https://doi.org/
10.1351/pac199870051129
Wolber G, Langer T (2005) Ligand scout: 3-D pharmacophores derived from protein-bound ligands
and their use as virtual screening filters. J Chem Inf Model 45:160–169. https://doi.org/10.1021/
ci049885e
Yang SY (2010) Pharmacophore modeling and applications in drug discovery: challenges and
recent advances. Drug Discov Today. https://doi.org/10.1016/j.drudis.2010.03.013
Zheng W, Tropsha A (2000) Novel variable selection quantitative structure-property relationship
approach based on the k-Nearest-Neighbor principle. J Chem Inf Comput Sci 40:185–194.
https://doi.org/10.1021/ci980033m
Chemoinformatics and QSAR
10
Vivek Srivastava, Chandrabose Selvaraj, and Sanjeev Kumar Singh
Abstract
Keywords
V. Srivastava
Department of Biotechnology, Faculty of Engineering and Technology, Rama University Uttar
Pradesh, Kanpur, Uttar Pradesh, India
C. Selvaraj (*) · S. K. Singh (*)
Computer Aided Drug Design and Molecular Modeling Lab, Department of Bioinformatics,
Science Block, Alagappa University, Karaikudi, Tamilnadu, India
e-mail: [email protected]
# The Author(s), under exclusive license to Springer Nature Singapore Pte 183
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_10
184 V. Srivastava et al.
10.1 Introduction
have significantly solved the problems associated with the selection process,
although they have not been sufficient to overcome this problem. It has been
assumed that more sophistication of these screening methods must be trained for
the identification of potent lead molecules (Oprea et al. 2001; Proudfoot 2002).
QSAR is a widely used method in the drug development process over the last few
decades to develop mathematical models to find statistically significant chemical
structures and its relationship between structure and activity using regressive analy-
sis (Cherkasov et al. 2014). These days, QSAR modeling has expanded, and there
are concerns about building a predictive model of biological activity. The QSAR
concept has been extensively used for various applications including drug discovery
processes for relating both the biological and physicochemical properties of lead
molecules; therefore, it was known as quantitative structure–property relationship
(QSPR). QSAR is used in the diagnostic process to find the relations among
compound configuration and biological activity. QSAR has developed to satisfy
the medicinal chemist’s desire to predict the biological response. QSAR models are
often used for the comparative analysis with molecular descriptors including various
representatives (1D, 2D, and 2D) which results of final computational processes to
describe the structure and behavior of the molecules in the system. Then the final
outputs of QSAR have mainly computed the set of the mathematical equation
relating the chemical structure to biological activity (Eriksson et al. 2003; Golbraikh
et al. 2003; Wedebye et al. 2015). The current review discusses the predictive QSAR
modeling and development procedures for validation and application in the drug
discovery process. It also discusses the various molecular descriptors in QSAR
methods, successful QSAR based screening of compounds, and its applications in
computer aided drug design and environmental chemical risk valuation and future
perspectives.
In the year 1868 Crum-Brown and Fraser first proposed an equation and considered
the first generation of formulation of QSAR in their investigations of various
alkaloids (Crum-Brown 1868). Initially QSAR was used for working on the narcotic
activity of various drugs. Later, Hammett introduced the new method for the account
for the effect of the substituent on the reaction mechanism (Hammet 1935). QSAR
studies correlated the affinities of the ligands with their respective receptor, rate of
constant, inhibition constant, biological endpoints with the atomic group, and other
properties including lipophilicity, the polarizability nature of the molecule. Though,
this approach has limited utility for the development of novel lead molecules owing
to the nonexistence of the three-dimensional structure of the molecules. To over-
come this problem, 3D-QSAR has been used as the extent of the classical Hansch
and Free-Wilson methods to exploit biological activity of the molecules via
chemometric techniques like PLS, ANN. The techniques have aided as a valuable
implement in the drug development process, especially in pharmaceutical and
agrochemical industries. Though the trial and error issue complicated in the design
186 V. Srivastava et al.
of novel lead molecules cannot be unnoticed, QSAR undeniably reduced the quan-
tity of the compounds and enhanced the selection of the potent lead molecules and its
synthesis (verma et al. 2010; Dessalew et al. 2007). Topological techniques pro-
posed by Enslein et al. to evaluate the structure and biological activity of molecules
with minimum topological difference (MTD) method of Simon. Recently, electro
topological indices have been used to encode the significance of structural statistics
on the topological state of the atom and fragments and also their electron valence has
applied to biological data (Tong et al. 1998), the other developments in QSAR
include inverse QSAR, HQSAR, and binary QSAR (Gao et al. 1999).
This method enhances the descriptors to the QSAR model at a time, which includes
the regression approach that provides the maximum fitness function, the selected
descriptor designated the force into all the QSAR models and novel descriptors
gradually supplemented to the regression and each descriptor contributes the highest
fitness function during the addition of previously preferred one (Guyon et al. 2002).
Though it has some disadvantages, a set of descriptors communally considered as a
good predictor, but it gives poor prediction when it is alone, hence the forward
selection methods have been in several QSAR studies (Wegner et al. 2004).
This method starts with all descriptors that are selected and are screened one by one
to reduce the quantity of the descriptors based on their impact on decline of an error
benchmark like a sum of the squares. This process is completed when all the
descriptors are removed; to date only some degree of reports has used this method
as the flexible selection approach (Yang et al. 2003).
10 Chemoinformatics and QSAR 187
This method is the well-known and extensively used method in QSAR and has a
stepwise descriptor which has a step-step procedure. The assortment time of this
model development starts deprived of any descriptors in the regression equation.
Each step has introduced the descriptors which offer the highest fitness function
applied (for example, prediction of biological activity with correlation coefficient),
but it also analyzes the importance of the descriptors in the regression QSAR model.
The process is ended when the important descriptors are not present in the pool and
satisfy the collection benchmark. Hence it is an unpretentious and an authoritative
method to acquire a subset of a descriptor, but it does not interpret the artificial neural
network methods.
Liu et al. (2003) proposed this technique based on the estimation of two statistics
such as the interrelation coefficient of two descriptors (Rint) and the correlation
coefficient (q2) is designed with leave-one-out (LOO) cross-validation method. It
was familiarized into the whole subset to progress the performance. There are two
main features such as controlling the exploration of several optimal subsets by q2 in
the LOO cross-validation and examining speed of all top subsets by Rint together
with q2 to differentiate this from other methods.
In this method leaps-and-bound regression was used for the selection of descriptors
(Xu and Zhang 2001), this regression can quickly find the best descriptor subset
devoid of checking all the conceivable subsets based on the following inequality
(Furnival and Wilson 1974).
where A is any set of descriptors, RSS denoted the sum of residual squares, and Ai is
a subset of A. based on the number of the above methods of subsets are evaluated to
search the best subsets. All the above methods are essentially linear and also have
some disadvantages that may well not be in effect where the relationship between the
activity and descriptors is nonlinear and causes coincidental correlation where
several variables are screened for inclusion in the model.
188 V. Srivastava et al.
In this technique, the developed QSAR models were endorsed within the LOO cross-
validation method, where a compound is removed randomly from the dataset each
phase, and the model is assembled with the other molecules. Then the final model
was then used for calculating the action of the abolished molecules. This develop-
ment is recurring several times till all the molecules are eliminated once. The
following equation is used for the calculation of the cross-validated squared correla-
tion coefficient.
P
ðY Y Pred Þ2
Q ¼ 1 P Obs
2
ðY Obs Y Þ2
where YObs represent the observed activity of the training set, Ypred represents the
training set activity, and Y represents the mean values of the training set activity, and
also, the modification (R2(adjR2) of R2 that corrects the quantity of explanatory terms
in a model also calculated. In this method, the addition of descriptors was established
which raises its value of the adjR2 when the new term improves the model than
expected by chance (Roy et al. 2012). The R2 calculated to overcome the drawbacks
by the following expression.
ðn 1ÞR2 P
adjR2 ¼
np1
where p denotes the predictor, variables used for the model development, the overall
implication of the regression coefficients and variance ratio F was calculated by the
following equation.
P
ðY cal Y Þ2
F¼
P
10 Chemoinformatics and QSAR 189
This model was used to define the prophetic ability of the established model and set
of test activity values for the calculation of the predictive R2 value by the following
equation
P 2
Y pred ðtestÞ Y ðtestÞ
R 2
pred ¼ 1 P 2
Y ðtestÞ Y ðtrainingÞ
where the predicted and observed activity of the test compounds were represented by
Ypred (test) and Y(test), respectively. The mean activity value of the training set was
represented by Y(training) and the R2Pred indicates the correlation coefficient of all the
test compounds.
The Y-randomization method is mainly used to check the robustness of the expected
QSAR model and implemented by permuting the reaction values, including activity
(Y ) and the descriptors (X) matrix was unchanged. The sum of deviance of correla-
tion coefficient of randomized model from the squared mean correlation coefficient
of the non-randomized model was calculated by the following equation (Roy et al.
2015).
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
R2P ¼ R2 X R2 R2r
The above equation denotes that the value of R2P and R2 must be equivalent for the
development of the QSAR model. This led Todeschini to define the correction for R2
by the following equation.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
cR2P ¼ R R2 R2r
of molecules into a set of mathematical and binary values that symbolize the several
molecular properties that are significant for understanding the activity of the
molecules. There are two broad categories of molecular descriptors used based on
3D alignment and conformation of the molecules (Kumar Singh et al. 2007).
This type is extremely fast and easy to capture and correlates the molecular
properties that are associated with features founding its structure. Some of the
constitutional descriptors comprise the total number of molecules and atoms of a
different identity. Besides, some properties related with bonds including single,
double, and triple and aromatic types bonds.
These descriptors are involved to provide the data on atomic van der Waals areas
formed molecular surface depending on the three-dimensional organization of atoms
creating the molecules and provide the data on molecular volume. The principal
moments of the gravitation indices capture the three-dimensional organization of the
molecule obtained by the projection (Labute 2000).
The 3D-QSAR approach is more composite than 2-D QSAR and is involved in
numerous phases to find the numerical descriptors of the molecular structure. First,
the conformation of the molecules is predicted from investigational records or
molecule mechanics then it is refined by minimizing the energy (Guner 2002;
Akamatsu 2002). In the second step, these conformers are consistently associated
in space and finally its occupied conformers are investigated computationally for
different descriptors. In this method, various descriptors including geometric,
quantum-chemical, and physical characteristics were used to designate the 3D
features of the lead molecules. These molecular descriptors are combined and
generate a pharmacophore model to elucidate the various features of the molecules
such as the number of hydrogen bond acceptors, donors that are crucial for the
evaluation of desired biological activity (Chang and Swaan 2006). Then the final
model of 3QSAR was obtained by the evaluation of stability and statistical signifi-
cance of the pharmacophore model. Several studies have reported the various
techniques of 3D-QSAR modeling and are widely used for drug design (de Groot
and Ekins 2002; Van Drie 2003). Table 1 represents the classification of QSAR
models based on different criteria which are widely used in drug discovery
approaches.
Before the calculation of descriptors, several approaches are required for the align-
ment of molecules that strongly depend on features of the receptor for predicted
ligands. Besides where the data is accessible the alignment of molecules is studied
using receptor–ligand complexes. Then, the computational tools are used for
192 V. Srivastava et al.
overlaying the configurations in space hence these methods are highly dependent on
atom–atom or substructure charting (Lemmen and Lengauer 2000).
The CoMFA generally used two crucial energy fields such as coulombic and van der
Waals energy fields to study the molecules. Here the molecules are aligned and
placed in a 3-D grid. This analysis allows identifying positive and negative charged
regions of the structures; hence it was extensively used 3D QSAR methods to
describe the shape-dependent steric and electrostatic possessions of the lead
molecules with its biological activity. Based on the 3D configurations on the 3-D
grid molecules are aligned by both steric and electrostatic potential energy values at
each grid point. It generally adopts the calculation of the minimum energy conformer
of the bioactive molecules. The known crystal structure and its matches may be used
to describe the bioactive conformers (Gohda et al. 2000). Partial least square analysis
(PLS) and principal component analysis (PCA) methods are generally used in
CoMFA model development and then it was subjected for geometric importance
and strength. The predictive stability of the CoMFA model is overly delicate to the
orientation of the bioactive conformers (Yasuo et al. 2009). However, the lowest
energy conformers of the bioactive molecules in the absence of receptors are flouted
(MacKerell Jr. 2004; Hasegawa et al. 2000; Koehn and Carter 2005). It uses
Lennard-Jones and Coulombic utility to compute the steric and electrostatic interface
(Flower 2002a, b).
CoMSIA is like CoMFA which also immerses the molecules in the regular grid
lattice and calculates the similarity between the probe atoms. Compared to the
CoMFA it uses and calculates the different functions like steric, hydrophobic
properties, hence the probe atoms have additional hydrophobic properties. The
application of Gaussian types in CoMSIA as an alternative of Coulombic function
and Lennard-Jones allows precise data in the grid situated within the molecules. In
the case of CoMFA, the huge values are acquired in these points owing to the
prospective functions and random cut-offs (Klebe et al. 1994). In this method it uses
the hydrophobic and hydrogen bond donor and acceptor and also steric and coulomb
contribution to study the similarity indices (Flower 2002a, b). Then the bell-shaped
Gaussian function was used for the calculation of electrostatic and steric components
of the energy (Acharya et al. 2011).
10 Chemoinformatics and QSAR 193
This type of descriptor provides the invariant statistics by analyzing the principal
component analysis on the coordinates of molecules and transforms the molecule
into the space that captures the variance. Several statistics were reused to calculate
the proportion, variance, and symmetry of the molecules. The combination of
directional and non-directional descriptors is also defined in this method. Principal
component analysis of the molecules and impact of each atom can be analyzed by
chemical properties and the atoms can be weighted by mass, atomic electronegativ-
ity, polarizability, and molecular electrostatic potential (Douali et al. 2003).
10.6.4 VolSurf
The VolSurf method relies on examining the grid of a molecule with specific probes
like hydrophobic interaction. Then the lattice boxes are utilized to compute the
descriptors on the surface of the 3D contours. The utilization of several
investigations and cut-off values of various molecular properties such as surface,
molecular volume, and hydrophilic regions are quantified. Besides, derivative
quantities like molecular globularity that are related to the surface of hydrophilic
regions of the whole molecules are also computed, and geometric-based descriptors
are also available in this method (Cruciani et al. 2000; Crivori et al. 2000).
As VolSurf, the GRIND also uses the probing of a grid with a specific probe to
overwhelm the issues with interpretability in configuration free descriptors. The-
regions of most favorable energies of the molecules are designated and provide the
distance between the regions is large. Then the probe-based dynamisms of the
molecules are calculated in a way and in the final step, the distance among the
nodes in the grid in the set of bins. The distance of each bin encodes the nodes with
the maximum energy and stored and then the values are used as numerical
descriptors. Then the stored information of each node can also be used to track the
exact region of the molecules (Pastor et al. 2000).
QSAR is an effective technique for constructing the accurate models used to find a
statistically important relationship of chemical structure and continuous PIC50 and
PEC50 value or biological property using regression analysis and also for binary/
categorical properties like active, inactive, and nontoxic (Cherkasov et al. 2014). In
the past years, QSAR modeling has been modified in numerous ways ranging from
1 D to nD and different methods to find the association among the chemical structure
194 V. Srivastava et al.
and the biological activity. The classical QSAR has some limitations to analyze the
insignificant series of congeneric lead molecules and simple deterioration methods.
Recently, several modifications and improvements in QSAR modeling were
differentiated and progressed to modeling and screening of diverse chemical
structures from large data sets with an inclusive diversity of machine learning
techniques (Cherkasov et al. 2014; Ekins et al. 2015; Goh et al. 2017; Mitchell
2004).
lead molecules prediction from bulky chemical libraries (Cherkasov et al. 2014). The
screened molecules were analyzed to reduce the number of compounds from a large
chemical library and then will be subjected to experimental assay. However, it is
more important to study the workflows and additional filtering steps like empirical
rule setting (Cherkasov et al. 2014). Though, the investigational confirmation of
computational hits the QSAR methods should be implemented as the final essential
step. Then the investigational confirmation with QSAR prediction of effectiveness
and pharmacokinetic constraints should be piloted. This statistics is essential for lead
molecule optimization and design of the new molecule sequence to select properties
like potency, pharmacokinetics, and selectivity which are related to the effect of
diverse configurations to design the new lead molecules with a specific target.
Hence, the evaluation of virtual methods is crucial before screen large libraries
with suitable approach and successively generates accurate outcomes on an actual
project. Thus, several software and workflows are systematically calculated with
benchmark datasets. Such datasets have both known active and inactive molecules
called decoys (Irwin 2008; Selvaraj et al. 2014). Preferably, both active and inactive
molecules are selected based on experimental assays. Receiver operating
characteristics curves (ROCs) are used as shared metrics to evaluate the performance
of the virtual screening methods (Triballeau et al. 2005). Besides, the enrichment
factors (EF), enrichment curves (EC), and predictiveness curves are theoretically
different and all share the similar objective to evaluate the capability of a process to
find the active compounds and also distinguish them from decoy compounds
(Empereur-mot et al. 2015). The alignment of both active and decoy molecules
has been shown to slope virtual screening assessment outcomes. The space between
the two chemical spaces of the molecule is determined by active molecules and the
decoy molecules were used to evaluate the synthetic overestimation of improvement
(Bissantz et al. 2000). In contrast, the existence of decoy molecules with active
compounds may lead to an artificial underestimation of the improvement (Good and
Oprea 2008). A number of studies have reported that the application of virtual
screening on the structural properties of a target such as binding sites, physicochem-
ical properties, and structural flexibility (Cummings et al. 2005). Based on the above
information and increasing the protein families in databases, a decoy set of
compounds were prepared publically accessible to find consistent benchmarking
datasets (Ibrahim et al. 2015; Mysinger et al. 2012).
196 V. Srivastava et al.
The QSAR Model validation is one of the most important steps in QSAR based
applications, recently, several numbers of acceptable statistics criteria have been
used for the estimation of the test compounds in QSAR modeling (Zhang et al.
2006). This critical component is established as reliable and widely accepted
practices for model development. Also, establishing the model’s domain applicabil-
ity is the major problem in QSAR analysis. The absence of applicability domain in
the model can easily calculate the activity of the molecules even with structurally
diverse training set compounds. Thus, the absence of these domains as a necessary
element in the QSAR model evaluation would lead to the inexcusable prediction of
the model in the chemistry space, which leads the inaccurate prediction (Kovatcheva
et al. 2003; Shen et al. 2003, 2002). Mandel (Mandel 1982) reported that
the effective forecast domain based on the various range of descriptors including
the regression analysis. In another study, Afantitis et al. (2006) demonstrated the
multiple regression analysis models for a dataset of apoptotic agents and reported
that applicability domain highly influences the equivalent transverse component of
the hat matrix. This method can predict possible leverages outliers. Netzeva et al.
(2006) and Saliner et al. (2006) have reported the applicability domain with various
ranges of descriptors that are occupied by representative points which have signifi-
cant drawbacks due to the representative points. Tong et al. (2004) also defined the
same applicability domain to construct the QSAR models for two datasets with
decision forest method to study the dependence of the model vs. applicability
domain for the accurate calculation. The precision was 50% for the initial applica-
bility domain and increased when the applicability domain increased by 30%.
The evaluation of virtual screening tools with the use of a benchmarking starts in the
year 2000, with the inventive work of Bissantz et al. (2000). The evaluation of ligand
enrichment is the main objective of their study, i.e., the selection of hit molecules
with the best dock score was found from the docking programs. There are three
10 Chemoinformatics and QSAR 197
docking programs such as Dock (Kuntz et al. (1982), Gold (Jones et al. 1997), FlexX
(Rarey et al. 1996) shared seven scoring functions assessed on two different target
proteins. For each target, a minimum of 10 known compound datasets with
990 small molecules that are implicated to be inactive (decoy compounds) were
generated. (1) These molecules were used to eliminate the undesired compounds and
(2) then most of the molecules were randomly selected by filtered dataset and are
used to estimate and match the number of docking and scoring schemes. This
procedure was applied to investigate the structural similarity of the molecules against
three human G-protein coupled receptors (GPCR) (Bissantz et al. 2003). Today, the
high focus is on virtual screening methods, to get the new compounds from available
databases along with utilizing the decoy set for comparing the actives (Kellenberger
et al. 2004; Brozell et al. 2012; Neves et al. 2012; Repasky et al. 2012; Spitzer and
Jain 2012). Hence, the benchmarking databases are widely used to evaluate the
various virtual screening models and support the finding of potent lead molecules
with both ligand and structure-based virtual screening (Allen et al. 2015; Ruggeri
et al. 2011).
Based on the dimension of the decoy sets Diller and Li (2003) incorporated the
physicochemical filters for decoy selection. In addition to the kinases inhibitors
(1000), they also retrieved six kinases (EGFr, VEGFr1, PDGFrβ, FGFr1, SRC, and
p38), and 32, 000 compounds from the literature were randomly selected from
MDDR (MDL drug data report). This filter was specially intended to find best the
decoy compounds based on comparable polarization and mass. In the year 2003,
McGovern and Shoichet developed a new benchmarking database with MDDR
where undesired functional groups were removed, and MDDR database with at
least 20 known ligands was accessible and found a target dataset (MMP3, NEP,
and XO). The enduring molecules were used as decoy sets (McGovern and Shoichet
2003). The filters were used in benchmarking databases in an early stage; the potent
and highly active molecules from literature while the decoy molecules were involved
in the presumed inactive molecules were selected from large databases which are
significantly filters to specific criteria such as molecular weight, drug-likeness
property, etc. Due to the use of MDDR and filtering the decoy compounds, these
benchmarking databases show remarkable drawbacks; the difference occurring
between the physicochemical parameters and decoy molecules was led to perception
and then good enrichment (Verdonk et al. 2004; Huang et al. 2006).
Irwin (2008) reported that the similar decoy molecules are known as active
molecules which significantly reduce the bias while the dissimilar compounds are
known to reduce the probability of the target protein. Based on the above consider-
ation they constructed DUD databases used for the evaluation of virtual screening
methods (Huang et al. 2006). This database contains 2950 ligands and 95,326 decoy
molecules for 40 protein targets from 6 major protein families such as serine
protease, folate enzyme, nuclear hormone receptors, and kinase. The structurally
198 V. Srivastava et al.
similar compounds retrieved from ZINC database and the decoy molecules were
calculated by estimating the Tanimoto distance based on the physicochemical
properties. Around, 36 molecules were sharing the most comparable properties for
a single active molecule. The evaluation of the DOCK confirmed the uncorrected
databases like MDDR led over-optimistic enrichment associated with the improved
database such as DUD.
Based on the literature, several biases have been used to build the DUD database;
where the analogous has limited space for the active molecules and is restricted to the
series of chemical compounds that have been explored in the databases (Good and
Oprea 2008). The refinement of the active molecules from the decoy compounds is
simplified by larger structural variability and can induce the virtual screening
performance due to the lack of structural diversity of active molecules which limits
the evaluation of ligand-based virtual screening methods for predictions of potent
lead molecules and are structurally different from the reference compounds while
retaining the similarity and activity. The artificial enrichment bias often displays the
differences in inactive compounds and decoy compounds with their respective
structural complexity and optimizes the compounds retrieved from a large dataset
in the patent literature (Stumpfe and Bajorath 2011). The false-negative decoys have
makes its presence, even in the active molecules in the random decoy set to show the
screening performance of hit compounds (Vogel et al. 2011; Bauer et al. 2013). To
eradicate the least minimum of these biases, new strategies were emerged to evaluate
the virtual screening methods.
scoring factors to find the best hit molecules. However, these alignments are highly
required for manual interference and a substantial volume of time hence the
descriptor-based approach is introduced for faster screening. Based on the number
of properties that have been used descriptors are classified into 1D, 2D, and 3D. 1D
descriptor represents the molecular properties such as mass and logP, whereas 2D
descriptors categorized as linear molecular properties have two cases such as real
value and binary descriptors (Hessler et al. 2005; Reddy et al. 2007). These methods
are applied differently to break molecules into a fragment and then used for virtual
screening, for example, the Topomer search from Tripos is one of the fragment-
based methods which can perform the R-group search or whole molecules search
(Tripos 2007).
10.10 Inverse-QSPR/QSAR
The inverse QSAR modeling is used to evaluate the values of descriptors and also
generate the QSAR model with high activity and build the small molecule structure
from these values (Skvortsova et al. 2001). Figure 10.2 represents the overall
workflow of QSAR approach. The presence of numerical signatures of the active
molecules and its activity may be considered a major challenge in this method, hence
it should be re-translated into chemical structure with high activity to overcome this
problem (Schneider and Baringhaus 2013; Speck-Planche and Cordeiro 2017). It
uses multiple linear regression models to build a chemical graph that corresponds to
the multiple linear regression equation (Schneider and Baringhaus 2013). For exam-
ple, the particular descriptors have been introduced for inverse QSAR based on the
multiple regression analysis equations and algorithms (Faulon et al. 2003;
Churchwell et al. 2004; Weis et al. 2005). Inverse QSAR was divided into
two-stage processes and is theoretically based on the significant principle adopted
from conventional QSAR for the prediction of higher activity values of more
necessary chemical structures. This two-stage process of inverse QSAR conjugates
the challenges of the descriptor value generation that corresponds to the higher
activity than the currently available training set. Besides these high activity
compounds are significantly optimized by Gaussian mixture models and cluster
wise multiple regression for the development due to the multi-parametric nature of
training data were ordered into several clusters. With available techniques, the new
onboard techniques are the two-stage inverse QSAR modeling and that was used to
construct the descriptor space with autoencoder modeling for encoding a line
notation of molecules of recurrent neural networks (RNNs). The optimized
coordinates in latent space can directly translate into another line notation by
RNNs and this method does not depend on any descriptors and has the possibility
to mechanically address two-stage inverse QSAR (Gómez-Bombarelli et al. 2018).
Recently the field of cheminformatics, the combination of mathematics and compu-
tational resources has emerged as a potent approach to solve the chemistry-related
problems and handling the large datasets. The statistical method QSAR/QSPR
models the chemical molecules with similar structure and properties and mainly
10 Chemoinformatics and QSAR 201
highlights the relationship between the structure and properties of the molecules
(Nieto-Draghi et al. 2015). The identification or synthesis of novel molecules with
desired properties has been examined since the support of QSPR models. The recent
advance in the QSAR, i.e. tuned for i-QSPR models primarily established to
calculate the property values. The stochastic method also used to monitor the
structure prediction using molecular identifiers such as SMILES neural networks
for the group of diversity.
202 V. Srivastava et al.
Computing all possible combinations such as atoms, graphs, fragments, and bonds is
the easiest way for the generation of molecules; Fink et al. (2005) enumerate the
chemical molecules up to 11 heavy atoms and generate the database with possible
graphs and substitution atoms. This filter is useful for removal of unrealistic
structures including molecules with bad valence. They also generated databases
like GDB-13, GDB-17 database (Blum and Reymond 2009; Ruddigkeit et al.
2012). The software like MOLMAKER (Clark et al. 1996) and Makino et al.
(1999) are recently used methods that perform reaction-based computing to produce
the possible products library. These methodologies are also used in a two-step
algorithm to predict the entire molecules by combinatorial association and then
subjected for the screening using QSPR models. To perform i-QSPR the virtual
library molecules have combined the fragments and used to screen the molecules to
find the potent molecules with given application conditions. I-QSPR with GCM also
has extensively used methods to find novel molecules with desired properties from
known structures. The generation of a molecule is described with valence and types
of groups attached at the point and the combination of these groups was controlled
by about 4 roles such as aromatic group is attached as single point supplement
constituent, supplement group, in this case, aromatic ring of the double point
supplement does not form a double bond and also could not combine. But in some
special conditions, the screening was performed by the GCM model and the
improvements were proposed by bonding rules and are characterized by their
binding ability (Brignole et al. 1986). These groups were considered as intermediate
to their connectivity, the low and high binding reactions were set as feasibility
criterion. Derringer and Markham (1985) demonstrated the contribution techniques
to design the polymers with three specific properties like water absorption, density,
and temperature by an unsystematic grouping of seven functional groups. This
screening was implemented after the addition of each fragment. Pretel et al. (1994)
generated the possible creation of aromatic groups with designed solvents and
generation by scheming transitional groups having more than one free valence
atom. Bolis et al. (1991) reported the thermolysin enzyme inhibitors by
computerizing the group selection step and produced a classification process to
identify the desired property. The use of GCM implemented with i-QSAR represents
the more advantages but the combinatorial explosion is challenging to grasp. This
problem can be easily elucidated by setting the proper setting rules by choosing the
specific groups for the generation. In this specification, the groups must be
contributing the potential bonds by their valence and the atoms can be bonded
properly. Though GCM has some limitations to constrain the prediction of features
to the existing fragments and the size of this fragment should cooperate among
increasing the difficulty versus limiting the feature space consideration.
Recently molecular hologram QSAR (HQSAR) plays a key role in the finding of
sub-structural features in molecules that are highly relevant to its biological activity.
The key factor of this method is differentiating the other methods such as Free
Wilson and CASE analysis of the molecular hologram generated fragments
10 Chemoinformatics and QSAR 203
including the cyclic and overlapping fragments. Thus, each atom of these molecules
occurs in multiple fragments, unlike the maximal common algorithm. HQSAR
yields one of the demonstrations of QSAR modeling that was attained in a reflective
analysis of the data set. Randomization of testing and redistributing the activity data
which attempt the statistical model to relate the scrambled data with a molecular
descriptor (Deshpande et al. 2002; Tropsha et al. 2003).
Since 1964, an extensive variety of QSAR approaches has been designed with the
concept of Free, Wilson, Hanch, and Fujita. The classical 2-D-QSAR methods use
2D molecule substituents and their physicochemical properties to analyze the quan-
titative prediction. Since then the evaluation of 3-D-QSAR method has been
recognized as fast and development of first novel method known as comparative
molecular field analysis familiarized by Crammer et al. (1988) which acts as a basis
for the improvement of other advance approaches such as CoMSIA, SOMFA,
CoMMA and also multidimensional nD-QSAR methods like 4D, 5D QSAR, etc.,
to overcome the difficulties. Hence, recent fragment-based methods show significant
attraction and attention because effective prediction of molecular fragments with
essential properties and potent activities is fast and robust (Zhang et al. 2007).
Several lead molecules against a variety of diseases were predicted with implemen-
tation of QSAR approaches, for example, antimicrobial and antitumor compounds
with strong activity and prediction of series of Xanthines derivatives as adenosine
antagonists were also predicted QSAR modeling (Li et al. 2013). It also has been
implemented in various studies to evaluate epothilones–tubulin depolymerization
inhibitors (Lee and Briggs 2001). In addition, QSAR models significantly used
structurally diverse antifolates like cycloguanil, aminopterin, pyrimethamine, and
13 pyrrolo [2, 3-d] pyrimidines (Santos-Filho and Hopfinger 2001). The implemen-
tation of topological polar surface area in 2D-QSAR has been used for the develop-
ment of 14 sets of pharmacologically active compounds (Prasanna and Doerksen
2009). Hydroazones derivatives were also predicted as electron acceptors for xan-
thine oxidase with QSAR model and antiviral QSAR models were implemented to
predict the potent lead molecules against 40 viral species with mt-QSAR model and
Markov chain theory is used to compute the novel multitarget entropy of QSAR
model (Prusis et al. 2004; Prado-Prado et al. 2011). In the drug designing approach,
validation of the QSAR model is very important to conclude the results whether it
satisfies the expectation or not. R2 and Q2 are two statistical measures that have been
used for validation (Catalin 2014; Tang et al. 2016), where R2 represents the
coefficient of multiple regression to measure the data and its fitness. The previous
reports demonstrated that the value of R2 should be 0.6 to consider the best fit
model. The Q2 represents the squared correlation which acts as an important
criterion for the robustness. However, the values of R2 are not enough to calculate
the model and tested for their capability to calculate the lead molecules with an
204 V. Srivastava et al.
external test set (Tang et al. 2016). It was proved by the good predictability of R2-Q2
values that should be in the range of below 0.3 which indicates the best probability of
the models. The other applications of the QSAR study were described in the
investigation of antidiabetic drugs based on sitagliptin as a potent antioxidant
agent. Several descriptors such as rotatable bonds, hydrophobicity, hydrogen bond
donor, acceptor atom were used based on the QSAR equation with improved
pharmacological effect as DPP4 inhibitors sitagliptin as a novel potent molecule
(Catalin 2014). Computational methods have been used as good predictive tools for
the evaluation of inhibitory molecules, where the QSAR studies are often used with
docking methods and neural networks. The implementations of important fields such
as steric, electrostatic, and hydrophobic with QSAR were used for the prediction of
xanthine oxidoreductase inhibitors (Veerasamy et al. 2011). Another study reported
that the series of quinolones derivatives also predicted by the QSAR model with
better caspase-3 inhibitors. Based on the QSAR model the new series of quinolone
compounds and then calculated their caspase inhibitory activity (Sharma et al. 2008).
The implementation of QSAR model with radial distribution function (RDF) also
used for the prediction of potent inhibitors against HIV-1 protease. It indicates the
uses of best descriptors like RDF010u, RDF010m, F04[C-N] which play an essential
role in the enzyme binding (Ravichandran et al. 2010). The CoMFA studies with
3D-QASR were used to predict the series of compounds and have been proven as a
valuable method for constructing the predictive model and both the electrostatic and
steric fields were used as descriptors (Baraldi 1999). The molecular descriptors such
as group count, logP, solvent accessible surface area, dielectric energy were consid-
ered to compare the anticancer activity of the lead molecules (Alam and Khan 2014).
QSAR is a widely used technique in the drug designing process. Though, the
classical QSAR approach has a useful correlation with important congeneric series
of molecules. Besides, the 3D QSAR technique has broadly been used in the general
yield of a statistically robust model with limitations to describe the potent molecules.
Analyzing the QSAR based virtual screening leads to the identification of promising
hit molecules. Several QSAR projects are developed through it and do not show a
successful model building stage which leads to poor understanding of several
interdisciplinary applications and common unawareness of the pest practice in the
field (Tropsha 2010; Ban et al. 2017). Also, several studies have been reported by
researchers to target their determinations to the malicious statistical cycle, with the
leading objective of model validation. In QSAR modeling is highly limited to the
best statistical method. Though, recognizing the precise selection of the statistical
method and exterior validations are essential for a crucial step in computational
based drug discovery approaches to develop the new compounds with desired
properties. Therefore, development of novel machine learning algorithms and
other data curation techniques has emerged as alternatives to classical methods to
avoid the quantity of examined molecules available in the literature. Hence,
10 Chemoinformatics and QSAR 205
researchers must think critically and prioritize the potent lead molecules, this study is
highly dependent on the overall success of QSAR based virtual screening
approaches in drug discovery processes.
References
Acharya C, Coop A, Polli JE, Mackerell AD Jr (2011) Recent advances in ligand-based drug
design: relevance and utility of the conformationally sampled pharmacophore approach. Curr
Comput Aided Drug Des 7:10–22
Afantitis A, Melagraki G, Sarimveis H, Koutentis PA, Markopoulos J, Igglessi-Markopoulou O
(2006) A novel QSAR model for predicting induction of apoptosis by 4-aryl-4H-chromenes.
Bioorg Med Chem 14:6686–6694
Akamatsu M (2002) Current state and perspectives of 3D QSAR. Curr Top Med Chem
2:1381–1394
Alam S, Khan F (2014) QSAR and docking studies on xanthone derivatives for anticancer activity
targeting DNA topoisomerase II α. Drug Des Dev Ther 8:183–195
Allen BK, Mehta S, Ember SW, Schonbrunn E, Ayad N, Schürer SC (2015) Large-scale computa-
tional screening identifies first in class multitarget inhibitor of EGFR kinase and BRD4. Sci Rep
5:16924
Augen J (2002) The evolving role of information technology in the drug discovery process. Drug
Discov Today 7:315–323
Bajorath J (2002) Integration of virtual and high-throughput screening. Nat Rev Drug Discov
1:882–894
Bajorath JR (ed) (2011) Chemoinformatics and computational chemical biology. Humana Press,
Totowa
Balaban AT (1982) Highly discriminating distance-based topological index. Chem Phys Lett
89:399–404
Ban F, Dalal K, Li H, LeBlanc E, Rennie PS, Cherkasov A (2017) Best practices of computer-aided
drug discovery: lessons learned from the development of a preclinical candidate for prostate
cancer with a new mechanism of action. J Chem Inf Model 57:1018–1028
Baraldi PG (1999) Comparative molecular field analysis (CoMFA) of a series of selective adenosine
receptor A2A antagonists. Drug Dev Res 46:126–133
Bauer MR, Ibrahim TM, Vogel SM, Boeckler FM (2013) Evaluation and optimization of virtual
screening workflows with DEKOIS 2.0 – a public library of challenging docking benchmark
sets. J Chem Inf Model 53:1447–1462
Bissantz C, Folkers G, Rognan D (2000) Protein-based virtual screening of chemical databases.
1. Evaluation of different docking/scoring combinations. J Med Chem 43:4759–4767
Bissantz C, Bernard P, Hibert M, Rognan D (2003) Protein-based virtual screening of chemical
databases. II. Are homology models of G-protein coupled receptors suitable targets? Proteins
50:5–25
Blum LC, Reymond JL (2009) 970 million druglike small molecules for virtual screening in the
chemical universe database GDB-13. J Am Chem Soc 131:8732–8733
Bolis G, Di Pace L, Fabrocini FJ (1991) A machine learning approach to computer-aided molecular
design. Comput Aided Mol Des 5:617–628
Brignole EA, Bottini SB, Gani R (1986) A strategy for the solvents for liquid extraction of solvents
for separation processes. Fluid Phase Equilib 29:125
206 V. Srivastava et al.
Brozell SR, Mukherjee S, Balius TE, Roe DR, Case DA, Rizzo RC (2012) Evaluation of DOCK
6 as a pose generation and database enrichment tool. J Comput Aided Mol Des 26:749–773
Catalin B (2014) More effective DPP4 inhibitors as antidiabetics based on sitagliptin applied QSAR
and clinical methods. Curr Comput Aided Drug Des 10:237–249(13)
Chang C, Swaan PW (2006) Computational approaches to modeling drug transporters. Eur J Pharm
Sci 27:411–424
Cherkasov A, Muratov EN, Fourches D, Varnek A, Baskin II, Cronin M (2014) QSAR modeling:
where have you been? Where are you going to? J Med Chem 57:4977–5010
Churchwell CJ, Rintoul MD, Martin S, Visco DP, Kotu A, Larson RS, Sillerud LO, Brown DC,
Faulon J (2004) The signature molecular descriptor. 3. Inverse-quantitative structure-activity
relationship of ICAM-1 inhibitory peptides. J Mol Graph Model 22:263–273
Clark DE, Pickett SD (2000) Computational methods for the prediction of ‘drug-likeness’. Drug
Discov Today 5:49–58
Clark DE, Firth MA, Murray CW (1996) Molmaker: de novo generation of 3D databases for use in
drug design. J Chem Inf Comput Sci 36:137
Cramer RD, Patterson DE, Bunce JD (1988) Comparative molecular field analysis (CoMFA). 1.
Effect of shape on binding of steroids to carrier proteins. J Am Chem Soc 110:5959–5967
Crivori P, Cruciani G, Carrupt PA, Testa B (2000) Predicting blood–brain barrier permeation from
three-dimensional molecular structure. J Med Chem 43:2204–2216
Cruciani G, Crivori P, Carrupt PA, Testa B (2000) Molecular interaction fields in drug discovery:
recent advances and future perspectives. J Mol Struct THEOCHEM 503:17–30
Crum-Brown AFT (1868) On the connection between chemical constitution and physiological
action. Pt 1. On the physiological action of the salts of the ammonium bases, derived from
strychnia, Brucia. Thebia, Codeia, Morphia and Nicotia. R Soc Edin 2:151–203
Cummings MD, DesJarlais RL, Gibbs AC, Mohan V, Jaeger EP (2005) Comparison of automated
docking programs as virtual screening tools. J Med Chem 48:962–976
de Groot MJ, Ekins S (2002) Pharmacophore modeling of cytochromes P450. Adv Drug Deliv Rev
54:367–383
Dean PM (2005) Computer-aided design of small molecules for chemical genomics. Humana Press
Inc., Totowa
Derringer GC, Markham RL (1985) A computer-based methodology for matching polymer
structures with required properties. J Appl Polym Sci 30:4609–4617
Deshpande M, Kuramochi M, Karypis J (2002) Frequent substructure-based approaches for
classifying chemical compounds. In: Proc of the 8th international conference on knowledge
discovery and data mining, Edmonton
Dessalew N, Singh SK (2008) 3D-QSAR CoMFA and CoMSIA study on benzodipyrazoles as
cyclin dependent kinase 2 inhibitors. Med Chem 4:313–321
Dessalew N, Bharatam PV, Singh SK (2007) 3D-QSAR CoMFA study on aminothiazole
derivatives as cyclin-dependent kinase 2 inhibitors. QSAR Comb Sci 26:85–91
Diller DJ, Li R (2003) Kinases, homology models, and high throughput docking. J Med Chem
46:4638–4647
Douali L, Villemin D, Cherqaoui D (2003) Neural networks: accurate nonlinear QSAR model for
HEPT derivatives. J Chem Inf Comput Sci 43:1200–1207
Ekins S, Lage de Siqueira-Neto J, McCall L-I, Sarker M, Yadav M, Ponder EL (2015) Machine
learning models and pathway genome data base for Trypanosoma cruzi drug discovery. PLoS
Negl Trop Dis 9:e0003878
Empereur-mot C, Guillemain H, Latouche A, Zagury JF, Viallon V, Montes M (2015)
Predictiveness curves in virtual screening. J Chem Informatics 7:52
Eriksson L, Jaworska J, Worth AP, Cronin MT, McDowell RM, Gramatica P (2003) Methods for
reliability and uncertainty assessment and for applicability evaluations of classification- and
regression-based QSARs. Environ Health Perspect 111:1361–1375
Estrada E, Uriarte E (2001) Quantitative structure-toxicity relationships using TOPS-MODE.
1. Nitrobenzene toxicity to tetrahymena pyriformis. Environ Res 12:309–324
10 Chemoinformatics and QSAR 207
Faulon JL, Visco DP Jr, Pophale RS (2003) The signature molecular descriptor. 1. Using extended
valence sequences in QSAR and QSPR studies. J Chem Inf Comput Sci 43:707–720
Fink T, Bruggesser H, Reymond JL (2005) Virtual exploration of the small-molecule chemical
universe below 160 Daltons. Angew Chem Int Ed 44:1504–1508
Flower DR (2002a) Predicting chemical toxicity and fate. CRC Press, Roca Baton
Flower DR (2002b) Drug design: cutting edge approaches. Royal Society of Chemistry, Cambridge
Fourches D, Muratov E, Tropsha A (2015) Curation of chemogenomics data. Nat Chem Biol
11:535–535
Furnival GM, Wilson RW (1974) Regressions by leaps and bounds. Technometrics 16:499–511
Gao H, Williams C, Labute P, Bajorath J (1999) Binary Quantitative structureactivity relationship
(QSAR) analysis of estrogen receptor ligands. J Chem Inf Comput Sci 39:164
Gasteiger J (2003) Handbook of chemoinformatics: from data to knowledge. Wiley, New York
Goh GB, Hodas NO, Vishnu A (2017) Deep learning for computational chemistry. J Comput Chem
38:1291–1307
Gohda K, Mori I, Ohta D, Kikuchi T (2000) A CoMFA analysis with conformational propensity: an
attempt to analyze the SAR of a set of molecules with different conformational flexibility using a
3D-QSAR method. J Comput Aided Mol Des 14:265–275
Golbraikh A, Shen M, Xiao Z, Xiao YD, Lee KH, Tropsha A (2003) Rational selection of training
and test sets for the development of validated QSAR models. J Comput Aided Mol Des
17:241–253
Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernandez-Lobato JM, Sanchez-Lengeling B,
Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A (2018) Automatic
chemical design using a data-driven continuous representation of molecules. ACS Cent Sci
4:268–276
Good AC, Oprea TI (2008) Optimization of CAMD techniques 3. Virtual screening enrichment
studies: a help or hindrance in tool selection? J Comput Aided Mol Des 22:169–178
Guner OF (2002) History and evolution of the pharmacophore concept in computer-aided drug
design. Curr Top Med Chem 2:1321–1332
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using
support vector machines. Mach Learn 46:389
Hall LH, Kier LBJ (2000) Chem Inf Comput Sci 30:784–791
Hall DG, Manku S, Wang F (2001) Solution- and solid-phase strategies for the design, synthesis,
and screening of libraries based on natural product templates: a comprehensive survey. J Comb
Chem 3:125–150
Hammet LP (1935) Some relations between reaction rates and equilibrium constants. Chem Rev
17:125–136
Hasegawa K, Arakawab M, Funatsu K (2000) Rational choice of bioactive conformations through
use of conformation analysis and 3-way partial least squares modeling. Chemom Intell Lab Syst
50:253–261
Hecht P (2002) High-throughput screening: beating the odds with informatics-driven chemistry.
Curr Drug Discov 10:21–24
Helguera AM, Combes RD, Gonzalez MP, Cordeiro MN (2008) Applications of 2D descriptors in
drug design: a DRAGON tale. Curr Top Med Chem 8:1628–1655
Hessler G, Zimmermann M, Matter H, Evers A, Naumann T, Lengauer T, Rarey M (2005)
Multiple-ligand-based virtual screening: methods and applications of the MTree approach. J
Med Chem 48:6575–6584
Huang N, Shoichet BK, Irwin JJ (2006) Benchmarking sets for molecular docking. J Med Chem
49:6789–6801
Ibrahim TM, Bauer MR, Boeckler FM (2015) Applying DEKOIS 2.0 in structure-based virtual
screening to probe the impact of preparation procedures and score normalization. Aust J Chem
7:21
Irwin JJ (2008) Community benchmarks for virtual screening. J Comput Aided Mol Des
22:193–199
208 V. Srivastava et al.
Jones G, Willett P, Glen RC, Leach AR, Taylor R (1997) Development and validation of a genetic
algorithm for flexible docking. J Mol Biol 267:727–748
Kapetanovic IM (2008) Computer-aided drug discovery and development (CADDD): in-silico-
chemico-biological approach. Chem Biol Interact 171:165–176
Kellenberger E, Rodrigo J, Muller P, Rognan D (2004) Comparative evaluation of eight docking
tools for docking and virtual screening accuracy. Proteins 57:225–242
Kitchen DB, Decornez H, Furr JR, Bajorath J (2004) Docking and scoring in virtual screening for
drug discovery: methods and applications. Nat Rev Drug Discov 3:935
Klebe G, Abraham U, Mietzner T (1994) Molecular similarity indices in a comparative analysis
(CoMSIA) of drug molecules to correlate and predict their biological activity. J Med Chem
37:4130–4146
Koehn FE, Carter GT (2005) The evolving role of natural products in drug discovery. Nat Rev Drug
Discov 2005:206–220
Kovatcheva A, Buchbauer G, Golbraikh A, Wolschann P (2003) QSAR modeling of alpha-
campholenic derivatives with sandalwood odor. J Chem Inf Comput Sci 43:259–266
Kumar Singh S, Dessalew N, Bharatam PV (2007) 3D-QSAR CoMFA study on oxindole
derivatives as cyclin dependent kinase 1 (CDK1) and cyclin dependent kinase 2 (CDK2)
inhibitors. Med Chem 3:75–84
Kuntz ID, Blaney JM, Oatley SJ, Langridge R, Ferrin TE (1982) A geometric approach to
macromolecule-ligand interactions. J Mol Biol 161:269–288
Labute PA (2000) Widely applicable set of descriptors. J Mol Graph Model 18:464–477
Lee KW, Briggs JM (2001) Comparative molecular field analysis (CoMFA) study of epothilones-
tubulin depolymerization inhibitors: phramacophore developemt using 3D QSAR methods. J
Comput Aided Mol Des 15:41–55
Lemmen C, Lengauer TJ (2000) Computational methods for the structural alignment of molecules.
Comput Aided Mol Des 14:215–232
Lewis RA (2005) A general method for exploiting QSAR models in lead optimization. J Med Chem
48:1638–1648
Li P, Tian Y, Zhai H, Deng F, Xie M, Zhang X (2013) Study on the activity of non-purine xanthine
oxidase inhibitor by 3D-QSAR modeling and molecular docking. J Mol Struct 5:56–65
Liu SS, Liu HL, Yin CS, Wang LSJ (2003) VSMP: a novel variable selection and modeling method
based on the prediction. Chem Inf Comput Sci 43:964–969
MacKerell AD Jr (2004) Empirical force fields for biological macromolecules: overview and issues.
J Comput Chem 25:1584–1604
Makino S, Ewing TJA, Kuntz ID (1999) DREAM++: flexible docking program for virtual combi-
natorial libraries. J Comput Aided Mol Des 13:513–532
Mandel J (1982) Use of the singular value decomposition in regression-analysis. Am Stat 36:15–24
Matter H, Baringhaus KH, Naumann T, Klabunde T, Pirard B (2001) Computational approaches
towards the rational design of drug-like compound libraries. Comb Chem High Scr 4:453–475
McGovern SL, Shoichet BK (2003) Information decay in molecular docking screens against holo,
apo, and modeled conformations of enzymes. J Med Chem 46:2895–2907
Mitchell JBO (2004) Machine learning methods in chemoinformatics. Wiley Interdiscip Rev
Comput Mol Sci 4:468–481
Moustakas DT, Lang PT, Pegg S, Pettersen E, Kuntz ID, Brooijmans N, Rizzo RC (2006)
Development and validation of a modular, extensible docking program: DOCK 5. J Comput
Aided Mol Des 20:601–619
Mysinger MM, Carchia M, Irwin JJ, Shoichet BK (2012) Directory of useful decoys, enhanced
(DUD-E): better ligands and decoys for better benchmarking. J Med Chem 55
(2012):6582–6594
Nantasenamat C, Monnor T, Worachartcheewan A, Mandi P, Isarankura-Na-Ayudhya C,
Prachayasittikul V (2014) Predictive QSAR modeling of aldose reductase inhibitors using
Monte Carlo feature selection. Eur J Med Chem 76:352–359
10 Chemoinformatics and QSAR 209
Netzeva TI, Gallegos SA, Worth AP (2006) Comparison of the applicability domain of a quantita-
tive structure-activity relationship for estrogenicity with a large chemical inventory. Environ
Toxicol Chem 25:1223–1230
Neves MA, Totrov M, Abagyan R (2012) Docking and scoring with ICM: the benchmarking results
and strategies for improvement. J Comput Aided Mol Des 26:675–686
Nieto-Draghi C, Fayet G, Creton B, Rozanska X, Rotureau P, de Hemptinne JC, Ungerer P,
Rousseau B, Adamo C (2015) A general guidebook for the theoretical prediction of physico-
chemical properties of chemicals for regulatory purposes. Chem Rev 115:13093–13164
Oprea TI, Davis AM, Teague SJ, Leeson PD (2001) Is there a difference between leads and drugs?
A historical perspective. J Chem Inf Comput Sci 41:1308–1315
Panwar U, Singh SK (2020) Atom-based 3D-QSAR, molecular docking, DFT, and simulation
studies of acylhydrazone, hydrazine, and diazene derivatives as IN-LEDGF/p75 inhibitors.
Struct Chem 2020:1–16
Pastor M, Cruciani G, McLay I, Pickett S, Clementi S (2000) GRid-INdependent Descriptors
(GRIND): a novel class of alignment-independent three-dimensional molecular descriptors. J
Med Chem 43:3233–3243
Prado-Prado FJ, García I, García-Mera X, González-Díaz H (2011) Entropy multi-target QSAR
model for prediction of antiviral drug complex networks. Chemom Intell Lab Syst 107:227–233
Prasanna S, Doerksen RJ (2009) Topological polar surface area: a useful descriptor in 2D-QSAR.
Curr Med Chem 16:21–41
Pretel EJ, López PA, Bottini SB, Brignole EA (1994) Computer-aided molecular design of solvents
for separation processes. AICHE J 40:1349–1360
Proudfoot JR (2002) Drugs, leads, and drug-likeness: an analysis of some recently launched drugs.
Bioorg Med Chem Lett 12:1647–1650
Prusis P, Dambrova M, Andrianov V, Rozhkov E, Semenikhina V, Piskunova I, Ongwae E,
Lundstedt T, Kalvinsh I, Wikberg JES (2004) Synthesis and quantitative structureactivity
relationship of hydrazones of N-amino-N‘-hydroxyguanidine as electron acceptors for xanthine
oxidase. J Med Chem 47:3105–3110
Rarey M, Kramer B, Lengauer T, Klebe G (1996) A fast flexible docking method using an
incremental construction algorithm. J Mol Biol 261:470–489
Ravichandran V, Shalini S, Sundram KM, Dhanaraj SA (2010) QSAR study of substituted 1, 3,
4-oxadiazole naphthyridines as HIV-1 integrase inhibitors. Eur J Med Chem 45:2791–2797
Reddy AS, Pati SP, Kumar PP, Pradeep HN, Sastry GN (2007) Virtual screening in drug discovery -
a computational perspective. Curr Protein Pept Sci 8:329–351
Reddy KK, Singh SK, Dessalew N, Tripathi SK, Selvaraj C (2012) Pharmacophore modelling and
atom-based 3D-QSAR studies on N-methyl pyrimidones as HIV-1 integrase inhibitors. J
Enzyme Inhib Med Chem 27:339–347
Reddy KK, Singh SK, Tripathi SK, Selvaraj C (2013a) Identification of potential HIV-1 integrase
strand transfer inhibitors: in silico virtual screening and QM/MM docking studies. SAR QSAR
Environ Res 24:581–595
Reddy KK, Singh SK, Tripathi SK, Selvaraj C, Suryanarayanan V (2013b) Shape and
pharmacophore-based virtual screening to identify potential cytochrome P450 sterol 14-
α-demethylase inhibitors. J Recept Signal Transduction 33:234–243
Repasky MP, Murphy RB, Banks JL, Greenwood JR, Tubert-Brohman I, Bhat S, Friesner RA
(2012) Docking performance of the glide program as evaluated on the Astex and DUD datasets:
a complete set of glide SP results and selected results for a new scoring function integrating
WaterMap and glide. J Comput Aided Mol Des 26:787–799
Rester U (2006) Dock around the clock - current status of small molecule docking and scoring.
QSAR Comb Sci 25:605–615
Roy K, Mitra I, Kar S, Ojha PK, Das RN, Kabir H (2012) Comparative studies on some metrics for
external validation of QSAR model. J Chem Inf Model 52:396–408
210 V. Srivastava et al.
Roy K, Kar S, Das RN (2015) Background of QSAR and historical developments. In: Das KRKN
(ed) Understanding the basics of QSAR for applications in pharmaceutical sciences and risk
assessment. Academic, Boston, pp 1–46
Ruddigkeit L, van Deursen R, Blum LC, Reymond JL (2012) Visualization and virtual screening of
the chemical universe database GDB-17. J Chem Inf Model 52:2864
Ruggeri C, Drinkwater N, Sivaraman KK, Bamert RS, McGowan S, Paiardini A (2011) Identifica-
tion and validation of a potent dual inhibitor of the P. falciparum M1 and M17 aminopeptidases
using virtual screening. PLoS ONE 10:e0138957
Rusinko A III, Young SS, Drewry DH, Gerritz SW (2002) Optimization of focused chemical
libraries using recursive partitioning. Comb Chem High Scr 5:125–133
Saliner AG, Netzeva TI, Worth AP (2006) Prediction of estrogenicity: validation of a classification
model. Environ Res 17:195–223
Santos-Filho OA, Hopfinger AJ (2001) A search for sources of drug resistance by the 4D-QSAR
analysis of a set of antimalarial dihydrofolate reductase inhibitors. J Comput Aided Mol Des
15:1–12
Schneider G, Baringhaus KH (2013) De novo design: from models to molecules. In: De novo
molecular design. Wiley, Weinheim, pp 1–55
Schneider G, Fechner U (2005) Computer-based de novo design of drug-like molecules. Nat Rev
Drug Discov 4:649
Selvaraj C, Singh P, Singh SK (2014) Molecular insights on analogs of HIV PR inhibitors toward
HTLV‐1 PR through QM/MM interactions and molecular dynamics studies: comparative
structure analysis of wild and mutant HTLV‐1 PR. J Mol Recognit 27:696–706
Sharma S, Ravichandran V, Jain PK, Mourya VK, Agrawal RK (2008) Prediction of caspase-3
inhibitory activity of 1,3-dioxo-4-methyl-2,3- dihydro-1h-pyrrolo[3,4-c] quinolines: QSAR
study. J Enzyme Inhib Med Chem 23:424–431
Shen M, LeTiran A, Xiao Y, Golbraikh A, Kohn H, Tropsha A (2002) Quantitative structure-
activity relationship analysis of functionalized amino acid anticonvulsant agents using k nearest
neighbor and simulated annealing PLS methods. J Med Chem 45:2811–2823
Shen M, Xiao Y, Golbraikh A, Gombar VK, Tropsha A (2003) Development and validation of k-
nearest-neighbor QSPR models of metabolic stability of drug candidates. J Med Chem
46:3013–3020
Singh SK, Dessalew N, Bharatam PV (2006) 3D-QSAR CoMFA study on indenopyrazole
derivatives as cyclin dependent kinase 4 (CDK4) and cyclin dependent kinase 2 (CDK2)
inhibitors. Eur J Med Chem 41:1310–1319
Skvortsova MI, Fedyaev KS, Palyulin VA, Zefirov N (2001) Inverse structure-property relationship
problem for the case of a correlation equation containing the Hosoya index. Dokl Chem
379:191–195
Southan C, Várkonyi P, Muresan S (2009) Quantitative assessment of the expanding complemen-
tarity between public and commercial databases of bioactive compounds. J Chem 1:10
Speck-Planche A, Cordeiro MNDS (2017) Fragment-based in silico modeling of multi-target
inhibitors against breast cancer-related proteins. Mol Divers 21:511–523
Spitzer R, Jain AN (2012) Surflex-dock: docking benchmarks and real-world application. J Comput
Aided Mol Des 26:687–699
Stanton DT, Egolf LM, Jurs PC, Hicks MG (1992) Computer assisted prediction of normal boiling
points of pyrans and pyrroles. J Chem Inf Comput Sci 32:306–316
Stumpfe D, Bajorath J (2011) Applied virtual screening: strategies, recommendations, and caveats.
In: Sotriffer C (ed) Virtual screening: principles, challenges, and practical guidelines. Wiley,
Weinheim, pp 291–318
Suryanarayanan V, Kumar Singh S, Kumar Tripathi S, Selvaraj C, Konda Reddy K, Karthiga A
(2013) A three-dimensional chemical phase pharmacophore mapping, QSAR modelling and
electronic feature analysis of benzofuran salicylic acid derivatives as LYP inhibitors. Environ
Res 24:1025–1040
10 Chemoinformatics and QSAR 211
Abstract
# The Author(s), under exclusive license to Springer Nature Singapore Pte 213
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_11
214 F. A. Araújo et al.
Keywords
available to the public was the 454 GS/20 Roche machine, which offered an increase
in the number of reads (up to 100 bp), as well as improved data quality (Voelkerding
et al. 2009). The greater number of reactions, generated in parallel sequencing on a
micrometer scale, often made it possible as a result of improvements in
microfabrication and high-resolution images, was the point that defined the second
generation of DNA sequencing (Shendure and Ji 2008).
After the 454 machine, other sequencing platforms were developed, such as
Solexa, which was later acquired by Illumina in 2005 (Voelkerding et al. 2009)
and SOLiD of the Applied Biosystems system (Thermo Fisher Scientific) in 2006,
all based on fluorescence detection and characterized by low cost and high sequenc-
ing coverage (McKernan et al. 2009). In addition, another platform developed in
2010, but not based on fluorescence or luminescence, was Ion Torrent. This technol-
ogy incorporates nucleotides due to the pH difference that is generated by the release
of H+ ions during polymerization (Rothberg et al. 2011). Most of these new
technologies allow the construction of paired genomic libraries, being useful in the
resolution of repetitive regions in the genome during the new assembly process. All
of these factors contributed to the increase in projects for sequencing complete
genomes (Mardis 2011; Scholz et al. 2012).
In 2009, a new way of sequencing single DNA molecules in real time was
established by Pacific Biosciences, culminating in the launch of a sequencer
known as SMRT (single-molecule real time). This process takes place through a
single DNA polymerase molecule fixed at the bottom of a ZMW (zero-mode
waveguide detector) detector with the size of a few nanometers, made of a metal
film and deposited on a glass substrate (Van Dijk et al. 2014), it can generate long
readings of 10 kb (Schadt et al. 2010). However, there is a great perspective
regarding the technology of single molecules related to the sequencing in nanopores.
This approach was established for the first time even before the second-generation
sequencing appeared (Holley et al. 1965).
In early 2012, the first nanopores sequencing platform was announced by Oxford
Nanopore, introducing two main versions of sequencers: GridION and MinION,
capable of generating large amounts of data, with a simple sample preparation
resulting in long reads to a low cost. MinION is a small, portable device, capable
of sequencing 30 Gb of DNA, while GridION can generate up to 150 Gb of data
transmitted in real time for immediate analysis. Another technology launched by the
same company was PromethION, which can generate up to 8 Tb of data (van Dijk
et al. 2018).
possible the identification of genes that generally do not function independently, and
their functions are not controlled directly by the promoter, but by many other
regulatory elements, such as intensifiers, response elements, and silencers (Heather
and Chain 2016; Holley et al. 1965). While the transcriptome sequencing allows
quantifying the heterogeneity of gene expression from cells to tissues and organs,
this method is important as it offers the initial steps for the functional annotation and
characterization of genes and genomes that were previously revealed by DNA
sequencing (Altman and Raychaudhuri 2001); assembly projects for the rebuilding
of genetic interaction networks to comprehend cellular functions, growth and devel-
opment, and biological systems (Hsiao et al. 2000); produces molecular fingerprints
of disease development and prognoses to identify potential targets for diagnosis and
drug development (Celis et al. 2000), and it also makes it possible to study the
interaction between the host and the pathogen by the development of new strategies
that can be used for therapeutic and prophylactic intervention (Manger and Relman
2000). Thus, the analysis of the complete transcriptome provides a basis for explor-
ing the regulatory pathways and genetic networks that both qualitatively and quanti-
tatively control the phenotypes important for agriculture and human medicine (Jiang
et al. 2015).
The sequence of bases in DNA has a huge importance, as it contains the code for the
formation of several proteins and, therefore, contains the complexity and diversity of
life itself (Martorell-Marugán et al. 2019). The unique order of these bases in DNA
creates the basic hereditary units, which are the genes. The human genome project
initially estimated that there would be 20,000 genes in the human genome (Lander
et al. 2001; Venter et al. 2001), and these estimates were later revised to
25,000–30,000 genes (Pennisi 2003). Based on the DNA sequence, enzymes such
as RNA polymerase create single-stranded messenger RNA (mRNA) that later
translates into proteins. This entire process of decoding the DNA sequence in a
protein is referred to as the “central dogma of life” (Crick 1970). Depending on the
organism, genes may not encode proteins that, being composed of amino acids, are
much more complex than nucleic acids. There are 20 main amino acids that form
proteins, and each protein can group them in different numbers and order. This
amino acid sequence of proteins is also crucial, as it not only determines the
physical-chemical properties of proteins, but also determines the different
conformations that they can create in a three-dimensional space (Anfinsen 1973).
These changes result in complex protein structures that, in turn, perform unique
biological functions, such as transport, functional regulation, and homeostasis.
Therefore, it is of great importance to identify the correct sequence of nucleotides
in DNA/RNA and of amino acids in proteins.
The comparison of biological sequences allows us to confront the differences
between organisms and species at the gene level. Comparative genomics, a branch of
science that exhaustively uses bioinformatics techniques to track genes in various
11 Computational Genomics 217
species and study their similarities and differences, uses these studies to infer the
functional and structural characteristics of newly discovered or existing proteins.
The analysis of biological sequences does not differ much from the techniques used
to compare strings and texts and, therefore, the concept of alignment becomes very
important. Sequences that evolve in species and clades through mutations include
insertions, deletions (indels), and incompatibilities. When comparing two biological
sequences, an alignment is generated to visualize the differences between the
sequences at each position (Martorell-Marugán et al. 2019).
Sequence alignment is one of the main tasks of bioinformatics. It consists of
aligning a query sequence with a reference sequence, which is usually in a public
database of sequences, with the aim of determining whether they have
correspondences with each other that are statistically significant (Gusfield 1997). It
differs from the classic computational problem of exact string matching (Cormen
et al. 2001), where there is an interest in finding exact matches. String alignment is
an approximate string match or string match problem that allows for errors (Navarro
2001). The problem, in its most general form, is to find into a text (or sequence of
characters) the position where a certain pattern occurs, allowing a limited number of
errors in the correspondences. The distance between the two sequences is defined as
the minimum sequence of operations necessary to transform one into the other. With
respect to probability, a cost is assigned to operations, so that the most likely
operations cost less. The objective is to minimize the total cost (Li and Homer
2010). Ultimately, the final goal of sequence alignment is to determine the similarity
between parts of the genomic code. Among the known applications of this type of
task, we can mention the discovery of genes, prediction of function, and assembly of
the genome sequence.
An alignment between two strings is simply the matching of pairs between the letters
in each string. The alignment of nucleotide or amino acid sequences is able to reflect
the evolutionary relationship between two or more homologous sequences that share
a common ancestor. If the same letter is present in both sequences, the position was
preserved in the evolution. If the letters are different, then it is possible to infer that
the two strings are derived from an ancestral letter (which may be one of the two or
none) (Koonin and Galperin 2013). However, sequences that are homologous can
have different lengths, which can be partly explained by insertions or deletions in the
sequences. In this way, a letter or a section of letters can be paired with dashes in the
other sequence to signify this insertion or exclusion (Fig. 11.1).
Global Alignment
Also called end-to-end alignment. The idea behind the method is to try to align all
the residues in each sequence. This approach is useful when the sequences being
compared are similar and of approximate size. Needleman and Wunsch were the first
to present an algorithm capable of finding the global alignment between two amino
acid sequences. The algorithm is based on dynamic programming and achieves the
global alignment of two sequences (Needleman and Wunsch 1970). The algorithm
covers three main steps: initialization, calculation, and trace back. A matrix of
dimensions i, j is initialized, where i and j are the length of the two strings in
comparison. Next, the highest score F(i,j) for each comparison in each position is
calculated,
where s(Xi, Yi) is the match/mismatch score and d is the penalty for deletion.
After calculating the maximum score for each position in the matrix (Fig. 11.2),
the trace back starts from the last cell (bottom right) in the matrix. At each step, it
moves from the current cell to the one from which the current cell value was derived.
A match or mismatch is assigned if the maximum score was derived from a diagonal
cell. An insertion/deletion is assigned if the score was derived from the top or left
cell. After the trace back is complete, there are two sequences aligned end to end
with an optimal alignment score (Durbin et al. 1998).
Local Alignment
This type of arrangement is most useful for different sequences that probably contain
regions of similarity in the larger context of the sequence. Smith and Waterman
(Smith et al. 1981) introduced a different algorithm for scoring similarities in order
to find optimal local alignment subsequences, even at the cost of the global score.
The algorithm achieves local alignment of strings and is quite similar to the
Needleman–Wunsch’s method. Local alignment can be used in situations where
you want to align smaller substrings from two sequences. In the biological context,
11 Computational Genomics 219
such a situation can arise during the search for a domain or motif within larger
sequences. The algorithm comprises the same steps as Needleman–Wunsch, how-
ever, with two main differences. The calculation of the maximum score also includes
an option of 0:
Fig. 11.3 Smith–Waterman matrix with linear gap penalty. The calculation uses scores: +2 for, 1
for mismatch, and 1 for gap. The left matrix represents input sequences and the right matrix
represents sequences are alignments. The left matrix is the corresponding (n + 1) by (m + 1)
score matrix. The right matrix is the trace back matrix, with red arrows indicating the optimal
alignment path. The null pointer is represented as 0
The use of seeds to first search for exact matches greatly increases the entire
search process and alignment without gaps loses only a small set of significant
11 Computational Genomics 221
matches. BLAST’s accuracy and sensitivity have made it one of the most widely
used search algorithms in the biological world (Martorell-Marugán et al. 2019). A
variant of BLAST called Position-Specific-Iterative BLAST (PSI-BLAST) extends
the basic BLAST algorithm. PSI-BLAST (Altschul et al. 1997) performs several
BLAST iterations and uses the hits found in one iteration as a query for the next
iteration. Although PSI-BLAST responds slower to the large amount of calculations
required, it is considered a reliable tool to finding distant homology relationships
(Martorell-Marugán et al. 2019).
Although BLAST and PSI-BLAST are still widely used, some lately developed
methods offer results with greater precision and sensitivity. Hidden Markov Models
(HMM) have been used efficiently in numerous applications to understand and
explore biological data. An example is HMM-HMM (HHblits) fast sequence search
(Remmert et al. 2012). The tool can be used as an alternative to BLAST and
PSI-BLAST and is 50–100 times more sensitive. This high sensitivity of the tool
can be attributed to the algorithm that is based on the comparison of the HMM
representations of the sequences. Although profile–profile or HMM–HMM
alignments are very slow due to calculations, the HHblits prefilter reduces the
required alignment scaling from millions to thousands, increasing its speed consid-
erably. HHblits represents each sequence in the database as an HMM profile. This
pre-processing reduces the number of HMM comparisons to search for similarity,
selecting only those target sequences where the highest alignment without gap
exists. At the end, a Smith–Waterman alignment shows a significant E-value.
There is another set of methods used to perform Multiple Sequence Alignment
(MSA), while reducing errors inherent to progressive methods, they are called
iteratives. These categories work in a similar way to progressive methods, but they
realign the initial sequences repeatedly as well as they add new sequences to the
growing MSA. A very used iteration-based algorithm is called MUSCLE (Multiple
Sequence Alignment by Log-Expectation) and improves the performance of pro-
gressive methods through a more accurate distance measurement to assess the
relationship between two sequences (Edgar 2004).
Both pairwise and MSA algorithms use substitution matrices to assign points to
the sequence alignments. These matrices evaluate potential substitutions for protein
and nucleic acid sequences. Each possible residue substitution receives a score that
reflects the probability of change. Two protein substitution matrix models are the
best known: Percent Accepted Mutation (PAM) (Dayhoff et al. 1978) and Blocks
Substitution Matrix (BLOSUM) (Henikoff and Henikoff 1992).
These above-mentioned methods use traditional approaches to solve the MSA
problem (Edgar and Batzoglou 2006). In practice, as the MSA is an NP-hard
problem, these methods often fail to do multiple alignments. To overcome this
problem, Metaheuristic approaches to the MSA problem were developed. The
Metaheuristics’ basic concept allows the description at an abstract level and can
take advantage of domain-specific knowledge in the heuristics’ form, which is, in its
turn, controlled by the strategy in a higher-level. Metaheuristics are generally used as
a guide to overcome heuristic problems. These methods exploit the search space in
222 F. A. Araújo et al.
From the data generated by NGS technologies, several new applications have
emerged, such as the study of microbial communities, the discovery of structural
variants in genomes, and the analysis of gene structure and expression (Chen et al.
2017). Due to the small length of sequences generated by the most common NGS
platforms, many of these analyses begin with the computational process of sequence
assembly, which consists of grouping the generated fragments based on their base
identity (Nagarajan and Pop 2013).
There are two general approaches to assembling NGS fragments: reference-based
and de novo approaches. In the reference-based assembly, a reference genome of the
11 Computational Genomics 223
same organism or related species is used as a guide to align the reads; this is, in many
cases, the analysis of resequenced data (Pop et al. 2004).
Fig. 11.5 Main alignment concepts: matches, mismatches, insertions, deletions, and percentage of
identity
used in mapping applications (Zhang et al. 2013). In this context, two software stand
out: BWA (Li et al. 2008) and Bowtie (Langmead et al. 2009).
Bowtie begins by constructing the FM-index for the reference sequence, then it
uses a modified version of the Ferragina and Manzini mapping algorithm to locate
the position of the alignment. Currently, two versions of bowtie can be found,
Bowtie and Bowtie 2 (Langmead and Salzberg 2012). Bowtie 2 was developed
mainly to handle reads longer than 50 base-pairs, while the first version of bowtie
handles only sequences up to 35 base-pairs.
BWA is very similar to Bowtie; it also uses a modified version of the Ferragina
and Manzini (2000) mapping algorithm to find exact matches. To handle inexact
matches, BWA searches for matches among subsequences of the reference sequence
minding a certain distance defined. In general, Bowtie is best suited for most
analyses, while BWA performs better for longer reads (Hatem et al. 2013).
The de novo assembly approach is based on the overlapping of the reads or part of
them with another (Martin and Wang 2011). This strategy is useful to unknown
genomes: new strains or species, and is able to represent regions which cannot be
identified by reference assembly due to its absence in the reference genome.
To improve the accuracy of the de novo assembly is highly recommended to
remove the low quality bases (Phred metric) of the ends of the reads, and that reads
with low quality scores to avoid missassemblies. After the launch of next generation
sequencing platforms, the challenge became to group short readings (<30 bp) based
on its identity to produce long sequences (contigs), and in turn contigs can be
ordered and oriented to generate scaffolds (Baker 2012; El-Metwally et al. 2013;
Martin and Wang 2011).
226 F. A. Araújo et al.
One of main difficulties is assembly of repeated regions, greater than reads length,
of the genome (El-Metwally et al. 2013) and some bases or regions cannot be
represented in the assembled genome, these regions are called gaps and are usually
represented by N (Baker 2012).
Actually, the reads length increases and the use of paired libraries is useful to
address some repeated regions for the prokaryote, but it remains a real problem
during the assembly of eukaryotes due to the larger repeated regions (Nowak et al.
2019).
The main strategies used to assembly genomes are Eulerian De Bruijn Graph,
Hamiltonian De Bruijn Graph, String graph, and Overlap-Layout-Consensus (OLC)
(Sohn and Nam 2018) which are implemented with some differences by many
assemblers, such as ALLPATHS (Butler et al. 2008), Velvet (Zerbino and Birney
2008), ABySS (Simpson et al. 2009), SOAPdenovo (Luo et al. 2012), and SPAdes
(Bankevich et al. 2012).
Before executing any approach the user can adopt some strategies to correct
errors on the reads, most of them based on the frequency of reads or k-mers to define
the confident and erroneous sequences (Sohn and Nam 2018).
The quality of the results obtained from the genome assembly process can be
evaluated based on the contig length, amount of bases and contig generated, and how
large the sequences produced are, to explore the results the common metrics to be
evaluated are: N50—The N50 value means that 50% of the bases generated by the
assembly process are part of contigs with length greater than or equal to N50 value;
NG50—the same for N50 but the percent of bases in the reference genome selected;
L50—the number of contigs used to reach the N50 value (Earl et al. 2011). Other
metrics should be evaluated too, such as number of base produced, number of
contigs and missassemblies when a reference genome is available to check with
software such as Quast (Gurevich et al. 2013).
New hybrid strategies have been developed to take advantage of each type of
assembly. Among them, it is possible to highlight techniques that combine reads
and assemblies from different sequencing technologies and different assembly
algorithms that can be applied in several tasks, such as de novo assemblies, sequenc-
ing error correction, and sequence quality improvement (Hatakeyama et al. 2018).
This type of hybrid assembly makes use of reads from different sequencers to
reconstruct the genome, mostly using overlap-layout-consensus based methods.
Another hybrid assembly approach occurs when different assemblers are used.
Rather than performing assembly from reads, this kind of hybrid strategy, also
known as meta-assembly, uses assemblies generated by different assemblers, com-
bining the results (contigs and/or scaffolds) produced by those tools to produce a
new sequence. However, the concepts of hybrid assembler and hybrid assembly
should not be confused. When it comes to assemblers, “hybrid” refers to the ability
of an assembler to work with short and long reads, while in relation to the assembly
11 Computational Genomics 227
process, “hybrid” refers to the use of more than one type of assembly strategy
(DBG/OLC), sequencer (regardless of read type), or input dataset (read/mount)
(Miller et al. 2017).
After finishing the genome assembly process, having a whole genome or a draft, the
identification of the Open Read Frames (ORFs), sequence between the start and stop
codon, is the next step, followed by gene annotation: the process to get metadata
regards the genic product for each ORF identified. The gene annotation often is
based on biological database that shows the function, products, and processes that
gene can be involved beyond other information. Some methods adopted to do the
gene prediction are based on a training dataset, so when this set of genes chosen for
training is not good, it can lead to bad results. For the training task, most programs
today use Markov models (HMM—Hidden Markov Models or IMM—Interpolated
Markov Models) (for example, SNAP (Korf 2004); GlimmerHMM (Majoros et al.
2004); GeneMark (Lukashin and Borodovsky 1998); GlimmerIMM (Salzberg et al.
1998)) for this training, where the genes are modeled with the Markov models that
use a series of states to represent a generic structure of the genes. Data training for
gene prediction and annotation programs is often chosen at random from a subset of
high-quality genes that ideally represent the variation found in a genome. When
programs for gene prediction and annotation are trained on a grass genes subset with
random GC content, they are effectively being trained on two classes of genes at the
same time, and this may result in poor output when genes are predicted in new
sequences of genome.
Actually, the sequence of the human genome can be done for less than a thousand
dollars. Due to this reduction in the sequencing price, there was an advance in the
assembly and alignment algorithms. As a result, obtaining a high-quality assembly
draft became an achievable goal for most genome projects. This caused the bottle-
neck in genomic studies to change focus, genome annotation has become a chal-
lenging task due to the difficulty of collecting or predicting proteins, mainly for large
genomes, requiring other data sources, such as RNA-Seq and databases to train,
optimize, and configure gene annotation tools. (Yandell and Ence 2012).
The manual curation can be used to improve the quality of gene annotation to
describe the Gene Ontology (GO) Terms (Consortium 2015) or the gene products
based on biological annotation database, such as Blast2GO (Conesa et al. 2005) and
GoFeat (Araujo et al. 2018), which use annotated genes and its structural similarity
to take new information and insights, ever based on computational approaches
representing most of the annotations found on the biological databases.
The accuracy of gene annotation is essential to next analysis to evaluate the genes
found and their relationship in the organism, which will drive to discoveries about
functions and phenotypes which can be associated to the organism to many
applications, such as pathogen–host interactions and antibiotic resistance.
228 F. A. Araújo et al.
One of the main limitations in the genomic annotation task comes from need of
database with annotations already made. There are some areas of biology that are
more studied and therefore have more data (complete data, better described, and
sometimes curated) for known processes, beyond the amount of databases for
specific analysis that are not integrated on the big databases such as Genbank,
DDBJ, and EBI. Nowadays with the evolution of annotation programs, most of
them are now automated, for example, RAST (Aziz et al. 2008; Seemann 2014),
PATRIC (Wattam et al. 2014). These pipelines basically have two tasks: searching
for patterns that identify the species gene (e.g., ESTs—Expressed Sequence Tag,
proteins, RNA-Seq) and characterizing these patterns into a database (e.g., Interpro
(Mitchell et al. 2019), Uniprot (Apweiler et al. 2004), Pfam (El-Gebali et al. 2019))
using Blast (Altschul et al. 1990) or Diamond (Buchfink et al. 2015) (Table 11.1).
Biological networks are used in different biological sciences, such as the study of the
interactome, cancer study, drug prediction, metagenome analysis, proteomic analy-
sis, molecular interactions, and cell interactions, among other areas.
A biological network can be defined as a collection of units (biomolecules),
potentially interacting as a system. In other words, a biological interaction network
can be represented an abstraction of the interactions obtained through mathematical
or computational models, where a uniform set of nodes connected by a uniform set
of edges that can be directed or undirected are represented. In this type of network,
the nodes can represent biomolecules (genes, protein, neuron, organisms, cells,
among others), and the borders usually represent relationships and interactions
(biochemical, transcriptional, energy flow, regulation, co-expression, metabolic,
among other) (Beretta et al. 2019a; Proulx et al. 2005).
The study and analysis of networks is part of network biology. This paradigm
allows us to understand the complex interactions of biomolecules within cells by
representing and analyzing biological systems through tools and methods derived
from graph theory, mathematics, physics, statistics, machine learning, and other,
applies to and omics and biological data (Pellegrini 2019; Zhang et al. 2014).
The inferences of biological networks using NGS data allow obtaining relevant
information about expression and regulation processes inside the organisms.
Biological interaction networks can be built using different methods of reverse
engineering that use high and low throughput data, as well as statistical, mathemati-
cal, and computational techniques that allow reconstructing how the elements of
biological networks integrate as a system (Chasman et al. 2016; Tieri et al. 2019).
The power of biological networks lies in the possibility of being able to abstract
from complex biological systems in the form of a graph, which allows analyzes and
descriptions of these systems, as well as detecting interactions and processes that
could not be discovered by studying the elements individually (Marbach et al. 2012;
Pellegrini 2019).
11 Computational Genomics 229
Table 11.1 List of the main software tools developed for gene prediction and annotation
since 2007
Software Organism Year Type Method
GLIMMER Bacteria, 2007 Ab initio IMM (Interpolated Markov Model)
(Delcher et al. archea and
2007) viruses
RAST (Aziz Bacteria and 2008 Pipeline Glimmer + PubSEED
et al. 2008) archea
Mgene Eukaryote 2009 Ab initio Structural HMM (Hidden Markov
(Schweikert Model) combined with
et al. 2009) discrimination training techniques
similiar to SVMs (support vector
machine)
Prodigal Prokaryote 2010 Ab initio Dynamic programming + HMM
(Hyatt et al.
2010)
MAKER2 Smaller 2011 Pipeline Evidence or ab initio or ab initio
(Holt and eukaryotic combiner evidence driven
Yandell 2011) and
prokaryotic
MOCAT Prokaryote 2012 Pipeline Use prodigal or MetaGeneMark
(Kultima et al. and
2012) eukaryote
MetaGUN Smaller 2013 Ab initio SVM (support vector machine) and
(Liu et al. eukaryotic prokaryotic
2013)
GeneMark-ET Prokaryote 2014 Ab initio HMM (Hidden Markov Model)
(Lomsadze
et al. 2014)
Prokka Prokaryote 2014 Pipeline Abinitio + evidence-based for
(Seemann functional annotation
2014)
GASS (Wang Eukaryote 2015 Comparative Shortest path model and Dynamic
et al. 2015) Programming
AugustusCGP Eukaryote 2016 Comparative Logistic regression
(König et al.
2016)
PGAP Prokaryote 2016 Pipeline GenemarkS + Glimmer + extrinsec
(Tatusova data
et al. 2016)
Funannotate Specifically 2017 Pipeline Evidence
(Palmer and for fungi, Modeler + Augustus + GeneMark-
Stajich 2017) higher ES/ET + evidence + PASA
eukaryotes
FunGap (Min Fungi 2018 Pipeline Augustus + Maker + Braker1
et al. 2017)
Vgas (Zhang Vírus 2019 Ab initio + ZCURVEV + BLASTp
et al. 2019) similarity-based
230 F. A. Araújo et al.
Fig. 11.6 Types of biological interactions that can be represented by networks. Adapted from
(Koh et al. 2012)
The biological interaction networks have specific architectures and properties that
enable the analysis and interpretation of the complexity of the interactions present
within the different domains and elements present in them (Aoki et al. 2007; Beretta
11 Computational Genomics 231
et al. 2019b). Some of the elements and property that are part of the networks are as
follows:
beginning of the appropriate metabolic and genetic responses. For the reconstruction
of this type of network, techniques such as genetic knockout have been used, which
allow studying the different responses of organisms to this stimulus (Tieri et al.
2019).
Fig. 11.7 Workflow for generating and analyzing a co-expression network. The figure shows the
different kinds of analyzes that can be performed with this type of network. Adapted from (van Dam
et al. 2018)
have low degrees, in contrast to the existence of a few nodes with high degrees,
which indicates the probability that one node can connect to another is directly
proportional to its degree. Due to this property, it is possible to identify a small set of
central genes (hubs) and a broad set of genes with few interactions. This character-
istic allows networks to be more robust (van Dam et al. 2018; Tieri et al. 2019).
Co-expression networks have been used to understand the relationships between
genes’ expression and the study of different phenomena and interactions between
genes. CENs were used to identify genes related to the synthesis and metabolization
of fenbendazole and flunixin meglumine in pig livers; this research found eight gene
modules that showed a high relation to the level of transcripts relating the metabo-
lism of these medications (Howard et al. 2017).
Exciting research that used co-expression networks were carried out by Shaik and
Ramakrishna (Shaik and Ramakrishna 2013). This study presents the common genes
for responses to water and bacterial stress present in rice and Arabidopsis. The team
was able to identify several common gene modules that showed high co-expression
and specific hubs related to these stresses.
CENs were used to study the genes involved in developing the skeletons and
muscle mass of mice for myostatin. In the study developed by Yang (Yang et al.
2015), the researchers built co-expression networks using microarray data, which
allowed them to study biological processes and metabolic pathways related to the
development of muscles and skeletons in wild mice. This study allowed confirmed
and identified new transcriptional regulators.
CEN can be used for the detection of biomarkers. The research developed by the
team of Zhao and Li (2019), studied gestational diabetes mellitus and managed to
identify ten potential biomarkers that help in diagnosing and therapy of this disease
through co-expression networks.
In the research of Yuan et al. (2018), biomarkers for the diagnosis of adrenocor-
tical carcinoma were analyzed. Within this study, they analyzed 12 central genes
(hubs) within the networks that showed a correlation associated with the prognosis
and progress of the disease; another team (Kommadath et al. 2014) used
co-expression networks to detect candidate regulatory genes that present differential
expression and that contribute to the spread of Salmonella enterica in pigs.
References
Altman RB, Raychaudhuri S (2001) Whole-genome expression analysis: challenges beyond clus-
tering. Curr Opin Struct Biol 11(3):340–347
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J
Mol Biol 215(3):403–410
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped
blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25
(17):3389–3402
Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181
(4096):223–230
11 Computational Genomics 235
Ansorge W, Sproat BS, Stegemann J, Schwager C (1986) A non-radioactive automated method for
DNA sequence determination. J Biochem Biophys Methods 13(6):315–323
Ansorge W, Sproat B, Stegemann J, Schwager C, Zenke M (1987) Automated DNA sequencing:
ultrasensitive detection of fluorescent bands during electrophoresis. Nucleic Acids Res 15
(11):4593–4602
Aoki K, Ogata Y, Shibata D (2007) Approaches for extracting practical information from gene
co-expression networks in plant biology. Plant Cell Physiol 48(3):381–390
Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H,
Lopez R, Magrane M et al (2004) Uniprot: the universal protein knowledgebase. Nucleic Acids
Res 32(suppl_1):D115–D119
Araujo FA, Barh D, Silva A, Guimarães L, Ramos RTJ (2018) Go feat: a rapid web-based
functional annotation tool for genomic and transcriptomic data. Sci Rep 8(1):1–4
Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM,
Kubal M et al (2008) The RAST server: rapid annotations using subsystems technology. BMC
Genomics 9(1):1–15
Baker M (2012) De novo genome assembly: what every biologist should know. Nat Methods 9
(4):333–337
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko
SI, Pham S, Prjibelski AD et al (2012) Spades: a new genome assembly algorithm and its
applications to single-cell sequencing. J Comput Biol 19(5):455–477
Beretta S, Denti L, Previtali M (2019a) Graph theory and definitions. Academic Press, Cambridge,
MA
Beretta S, Denti L, Previtali M (2019b) Network properties. Academic Press, Cambridge, MA
Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using diamond. Nat
Methods 12(1):59–60
Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB
(2008) Allpaths: de novo assembly of whole-genome shotgun microreads. Genome Res 18
(5):810–820
Card DC, Schield DR, Reyes-Velasco J, Fujita MK, Andrew AL, Oyler-McCance SJ, Fike JA,
Tomback DF, Ruggiero RP, Castoe TA (2014) Two low coverage bird genomes and a
comparison of reference-guided versus de novo genome assemblies. PLoS One 9(9):e106649
Celis JE, Kruhøffer M, Gromova I, Frederiksen C, Østergaard M, Thykjaer T, Gromov P, Yu J,
Pálsdóttir H, Magnusson N et al (2000) Gene expression profiling: monitoring transcription and
translation products using dna microarrays and proteomics. FEBS Lett 480(1):2–16
Chasman D, Siahpirani AF, Roy S (2016) Network-based approaches for analysis of complex
biological systems. Curr Opin Biotechnol 39:157–166
Chen Q, Lan C, Zhao L, Wang J, Chen B, Chen YPP (2017) Recent advances in sequence
assembly: principles and applications. Brief Funct Genomics 16(6):361–378
Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M (2005) Blast2go: a universal tool
for annotation, visualization and analysis in functional genomics research. Bioinformatics 21
(18):3674–3676
Consortium GO (2015) Gene ontology consortium: going forward. Nucleic Acids Res 43(D1):
D1049–D1056
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) The Knuth-Morris-Pratt algorithm. In:
Introduction to algorithms, 2nd edn. MIT Press, Cambridge, MA
Crick F (1970) Central dogma of molecular biology. Nature 227(5258):561–563
Dasgupta D, Yu S, Nino F (2011) Recent advances in artificial immune systems: models and
applications. Appl Soft Comput 11(2):1574–1587
Dayhoff M, Schwartz R, Orcutt B (1978) A model of evolutionary change in proteins. In: Atlas of
protein sequence and structure, vol 5. The National Biomedical Research Foundation, Silver
Spring, MD, pp 345–352
De Smet R, Marchal K (2010) Advantages and limitations of current network inference methods.
Nat Rev Microbiol 8(10):717–729
236 F. A. Araújo et al.
Delcher AL, Bratke KA, Powers EC, Salzberg SL (2007) Identifying bacterial genes and endosym-
biont dna with glimmer. Bioinformatics 23(6):673–679
Dey A, Saha I, Maulik U (2017) A survey on multiple sequence alignment using metaheuristics. In:
2017 7th international conference on communication systems and network technologies
(CSNT). IEEE, pp 279–284
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic
models of proteins and nucleic acids. Cambridge University Press, Cambridge
Earl D, Bradnam K, John JS, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, Zerbino DR, Diekhans
M et al (2011) Assemblathon 1: a competitive assessment of de novo short read assembly
methods. Genome Res 21(12):2224–2241
Edgar RC (2004) Muscle: multiple sequence alignment with high accuracy and high throughput.
Nucleic Acids Res 32(5):1792–1797
Edgar RC (2010) Search and clustering orders of magnitude faster than blast. Bioinformatics 26
(19):2460–2461
Edgar RC, Batzoglou S (2006) Multiple sequence alignment. Curr Opin Struct Biol 16(3):368–373
Ekblom R, Wolf JB (2014) A field guide to whole-genome sequencing, assembly and annotation.
Evol Appl 7(9):1026–1042
El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ,
Salazar GA, Smart A et al (2019) The pfam protein families database in 2019. Nucleic Acids
Res 47(D1):D427–D432
El-Metwally S, Hamza T, Zakaria M, Helmy M (2013) Next-generation sequence assembly: four
stages of data processing and computational challenges. PLoS Comput Biol 9(12):e1003345
Feng DF, Doolittle RF (1987) Progressive sequence alignment as a prerequisitetto correct phyloge-
netic trees. J Mol Evol 25(4):351–360
Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proceedings 41st
annual symposium on foundations of computer science. IEEE, pp 390–398
Fionda V (2019) Networks in biology. In: Ranganathan S, Gribskov M, Nakai K, Schön-bach C
(eds) Encyclopedia of bioinformatics and computational biology. Academic Press, Oxford, pp
915–921. https://doi.org/10.1016/B978-0-12-809633-8.20420-2
Fonseca NA, Rung J, Brazma A, Marioni JC (2012) Tools for mapping high-throughput sequencing
data. Bioinformatics 28(24):3169–3177
Gurevich A, Saveliev V, Vyahhi N, Tesler G (2013) Quast: quality assessment tool for genome
assemblies. Bioinformatics 29(8):1072–1075
Gusfield D (1997) Algorithms on stings, trees, and sequences: computer science and computational
biology. ACM Sigact News 28(4):41–60
Hacia JG (1999) Resequencing and mutational analysis using oligonucleotide microarrays. Nat
Genet 21(1):42–47
Hatakeyama M, Aluri S, Balachadran MT, Sivarajan SR, Patrignani A, Grüter S, Poveda L,
Shimizu-Inatsugi R, Baeten J, Francoijs KJ et al (2018) Multiple hybrid de novo genome
assembly of finger millet, an orphan allotetraploid crop. DNA Res 25(1):39–47
Hatem A, Bozdağ D, Toland AE, Çatalyürek ÜV (2013) Benchmarking short sequence mapping
tools. BMC Bioinformatics 14(1):184
Heather JM, Chain B (2016) The sequence of sequencers: the history of sequencing DNA.
Genomics 107(1):1–8
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl
Acad Sci 89(22):10915–10919
Higgins DG, Sharp PM (1988) Clustal: a package for performing multiple sequence alignment on a
microcomputer. Gene 73(1):237–244
Higgins DG, Bleasby AJ, Fuchs R (1992) Clustal v: improved software for multiple sequence
alignment. Bioinformatics 8(2):189–191
Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermüller J
(2009) Fast mapping of short sequences with mismatches, insertions and deletions using index
structures. PLoS Comput Biol 5(9):e1000502
11 Computational Genomics 237
Holley RW, Apgar J, Everett GA, Madison JT, Marquisee M, Merrill SH, Penswick JR, Zamir A
(1965) Structure of a ribonucleic acid. Science 147:1462–1465
Holt C, Yandell M (2011) Maker2: an annotation pipeline and genome-database management tool
for second-generation genome projects. BMC Bioinformatics 12(1):491
Howard JT, Ashwell MS, Baynes RE, Brooks JD, Yeatts JL, Maltecca C (2017) Gene
co-expression network analysis identifies porcine genes associated with variation in
metabolizing fenbendazole and flunixin meglumine in the liver. Sci Rep 7(1):1–12
Hsiao LL, Stears RL, Hong RL, Gullans SR (2000) Prospective use of dna microarrays for
evaluating renal function and disease. Curr Opin Nephrol Hypertens 9(3):253–258
Husi H, Skipworth RJ, Fearon KC, Ross JA (2013) Lscluster, a large-scale sequence clustering and
aligning software for use in partial identity mapping and splice-variant analysis. J Proteome
84:185–189
Hyatt D, Chen GL, LoCascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic
gene recognition and translation initiation site identification. BMC Bioinformatics 11(1):119
Jancura P, Mavridou E, Carrillo-de Santa Pau E, Marchiori E (2012) A methodology for detecting
the orthology signal in a PPI network at a functional complex level. BMC Bioinformatics 13:
S18
Jiang Z, Zhou X, Li R, Michal JJ, Zhang S, Dodson MV, Zhang Z, Harland RM (2015) Whole
transcriptome analysis with sequencing: methods, challenges and potential solutions. Cell Mol
Life Sci 72(18):3425–3439
Kambara H, Nishikawa T, Katayama Y, Yamaguchi T (1988) Optimization of parameters in a dna
sequenator using fluorescence detection. Bio/Technology 6(7):816–821
Karlebach G, Shamir R (2008) Modelling and analysis of gene regulatory networks. Nat Rev Mol
Cell Biol 9(10):770–780
Keel BN, Snelling WM (2018) Comparison of burrows-wheeler transform-based mapping
algorithms used in high-throughput whole-genome sequencing: application to illumina data
for livestock genomes1. Front Genet 9:35
Kent WJ (2002) Blat—the blast-like alignment tool. Genome Res 12(4):656–664
Koh GC, Porras P, Aranda B, Hermjakob H, Orchard SE (2012) Analyzing protein–protein
interaction networks. J Proteome Res 11(4):2014–2031
Kommadath A, Bao H, Arantes AS, Plastow GS, Tuggle CK, Bearson SM, Stothard P et al (2014)
Gene co-expression network analysis identifies porcine genes associated with variation in
salmonella shedding. BMC Genomics 15(1):1–15
König S, Romoth LW, Gerischer L, Stanke M (2016) Simultaneous gene finding in multiple
genomes. Bioinformatics 32(22):3388–3395
Koonin EV, Galperin M (2013) Sequence—evolution—function: computational approaches in
comparative genomics. Springer, Dordrecht
Korf I (2004) Gene finding in novel genomes. BMC Bioinformatics 5(1):59
Kultima JR, Sunagawa S, Li J, Chen W, Chen H, Mende DR, Arumugam M, Pan Q, Liu B, Qin J
et al (2012) Mocat: a metagenomics assembly and gene prediction toolkit. PLoS One 7(10):
e47656
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M,
FitzHugh W et al (2001) Initial sequencing and analysis of the human genome. Nature 409
(6822):860–921
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with bowtie 2. Nat Methods 9(4):357
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of
short dna sequences to the human genome. Genome Biol 10(3):R25
Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing.
Brief Bioinform 11(5):473–483
Li H, Ruan J, Durbin R (2008) Mapping short dna sequencing reads and calling variants using
mapping quality scores. Genome Res 18(11):1851–1858
Lindgreen S, Gardner PP, Krogh A (2007) Mastr: multiple alignment and structure prediction of
non-coding rnas using simulated annealing. Bioinformatics 23(24):3304–3311
238 F. A. Araújo et al.
Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227
(4693):1435–1441
Lischer HE, Shimizu KK (2017) Reference-guided de novo assembly approach improves genome
reconstruction for related species. BMC Bioinformatics 18(1):1–12
Liu Y, Guo J, Hu G, Zhu H (2013) Gene prediction in metagenomic fragments based on the SVM
algorithm. BMC Bioinformatics 14:S12
Lomsadze A, Burns PD, Borodovsky M (2014) Integration of mapped rna-seq reads into automatic
training of eukaryotic gene finding algorithm. Nucleic Acids Res 42(15):e119–e119
Lourenço HR, Martin OC, Stützle T (2003) Iterated local search. In: Handbook of metaheuristics.
Springer, Boston, pp 320–353
Luckey JA, Drossman H, Kostichka AJ, Mead DA, D’Cunha J, Norris TB, Smith LM (1990) High
speed dna sequencing by capillary electrophoresis. Nucleic Acids Res 18(15):4417–4421
Lukashin AV, Borodovsky M (1998) Genemark.hmm: new solutions for gene finding. Nucleic
Acids Res 26(4):1107–1115
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y et al (2012)
Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler.
Gigascience 1(1):2047-217X
Majoros WH, Pertea M, Salzberg SL (2004) Tigrscan and glimmerhmm: two open source ab initio
eukaryotic gene-finders. Bioinformatics 20(16):2878–2879
Manger ID, Relman DA (2000) How the host ‘sees’ pathogens: global gene expression responses to
infection. Curr Opin Immunol 12(2):215–218
Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, Kellis M,
Collins JJ, Stolovitzky G (2012) Wisdom of crowds for robust gene network inference. Nat
Methods 9(8):796–804
Mardis ER (2011) A decade’s perspective on dna sequencing technology. Nature 470
(7333):198–203
Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet 12
(10):671–682
Martorell-Marugán J, Tabik S, Benhammou Y, del Val C, Zwir I, Herrera F, Carmona-Sáez P
(2019) Deep learning in omics data analysis and precision medicine. Codon Publications,
Brisbane, pp 37–53
Maulik U, Saha I (2009) Modified differential evolution based fuzzy clustering for pixel classifica-
tion in remote sensing imagery. Pattern Recogn 42(9):2135–2149
Maxam AM, Gilbert W (1977) A new method for sequencing dna. Proc Natl Acad Sci 74
(2):560–564
McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, Clouser CR, Duncan C,
Ichikawa JK, Lee CC et al (2009) Sequence and structural variation in a human genome
uncovered by short-read, massively parallel ligation sequencing using two-base encoding.
Genome Res 19(9):1527–1541
Miller JR, Zhou P, Mudge J, Gurtowski J, Lee H, Ramaraj T, Walenz BP, Liu J, Stupar RM, Denny
R et al (2017) Hybrid assembly with long and short reads improves discovery of gene family
expansions. BMC Genomics 18(1):541
Min B, Grigoriev IV, Choi IG (2017) Fungap: Fungal genome annotation pipeline using evidence-
based gene model evaluation. Bioinformatics 33(18):2936–2937
Misra S, Narayanan R, Lin S, Choudhary A (2010) Fangs: high speed sequence mapping for next
generation sequencers. In: Proceedings of the 2010 ACM symposium on applied computing.
ACM, New York, pp 1539–1546
Mitchell AL, Attwood TK, Babbitt PC, Blum M, Bork P, Bridge A, Brown SD, Chang HY,
El-Gebali S, Fraser MI et al (2019) Interpro in 2019: improving coverage, classification and
access to protein sequence annotations. Nucleic Acids Res 47(D1):D351–D360
Mladenović N, Hansen P (1997) Variable neighborhood search. Comput Oper Res 24
(11):1097–1100
11 Computational Genomics 239
Morris AP, Zeggini E (2010) An evaluation of statistical approaches to rare variant analysis in
genetic association studies. Genet Epidemiol 34(2):188–193
Naama B, Bouzeboudja H, Allali A (2013) Application of Tabu search and genetic algorithm in
minimize losses in power system. Using the b-coefficient method. Energy Procedia 36:687–693
Nagarajan N, Pop M (2013) Sequence assembly demystified. Nat Rev Genet 14(3):157–167
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv (CSUR) 33
(1):31–88
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the
amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Nowak RM, Jastrzębski JP, Kuśmirek W, Sałamatin R, Rydzanicz M, Sobczyk-Kopcioł A, Sulima-
Celińska A, Paukszto Ł, Makowczenko KG, Płoski R et al (2019) Hybrid de novo whole-
genome assembly and annotation of the model tapeworm hymenolepis diminuta. Sci Data 6
(1):1–14
Oates CJ, Mukherjee S (2012) Network inference and biological dynamics. Ann Appl Stat 6
(3):1209
Ortuño FM, Valenzuela O, Rojas F, Pomares H, Florido JP, Urquiza JM, Rojas I (2013) Optimizing
multiple sequence alignments using a genetic algorithm based on three objectives: structural
information, non-gaps percentage and totally conserved columns. Bioinformatics 29
(17):2112–2121
Palmer J, Stajich J (2017) Funannotate: eukaryotic genome annotation pipeline
Pavlopoulos GA, Secrier M, Moschopoulos CN, Soldatos TG, Kossida S, Aerts J, Schneider R,
Bagos PG (2011) Using graph theory to analyze biological networks. BioData Min 4(1):10
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl
Acad Sci 85(8):2444–2448
Pellegrini M (2019) Community detection in biological networks. In: Encyclopedia of bioinformat-
ics and computational biology. Elsevier, Amsterdam
Pennisi E (2003) A low number wins the GeneSweep pool. Science 300:1484
Pitkänen E, Rousu J, Ukkonen E (2010) Computational methods for metabolic reconstruction. Curr
Opin Biotechnol 21(1):70–77
Pizzuti C, Rombo SE (2014) Algorithms and tools for protein–protein interaction networks
clustering, with a special focus on population-based stochastic methods. Bioinformatics 30
(10):1343–1352
Pop M, Phillippy A, Delcher AL, Salzberg SL (2004) Comparative genome assembly. Briefings
in. Bioinformatics 5(3):237–248
Prober JM, Trainor GL, Dam RJ, Hobbs FW, Robertson CW, Zagursky RJ, Cocuzza AJ, Jensen
MA, Baumeister K (1987) A system for rapid dna sequencing with fluorescent chain-
terminating dideoxynucleotides. Science 238(4825):336–341
Proulx SR, Promislow DE, Phillips PC (2005) Network thinking in ecology and evolution. Trends
Ecol Evol 20(6):345–353
Raghava GP, Barton GJ (2006) Quantification of the variation in percentage identity for protein
sequence alignments. BMC Bioinformatics 7(1):1–4
Remmert M, Biegert A, Hauser A, Söding J (2012) Hhblits: lightning-fast iterative protein sequence
searching by hmm-hmm alignment. Nat Methods 9(2):173–175
Rognes T, Flouri T, Nichols B, Quince C, Mahé F (2016) Vsearch: a versatile open source tool for
metagenomics. PeerJ 4:e2584
Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, Leamon JH, Johnson K,
Milgrew MJ, Edwards M et al (2011) An integrated semiconductor device enabling non-optical
genome sequencing. Nature 475(7356):348–352
Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated
markov models. Nucleic Acids Res 26(2):544–548
Sanger F, Coulson A (1996) A rapid method for determining sequences in DNA by primed
synthesis with DNA polymerase. Sel Pap Frederick Sanger Comment 94:382
240 F. A. Araújo et al.
Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc
Natl Acad Sci 74(12):5463–5467
Schadt EE, Turner S, Kasarskis A (2010) A window into third-generation sequencing. Hum Mol
Genet 19(R2):R227–R240
Schneeberger K, Ossowski S, Ott F, Klein JD, Wang X, Lanz C, Smith LM, Cao J, Fitz J,
Warthmann N et al (2011) Reference-guided assembly of four diverse arabidopsis thaliana
genomes. Proc Natl Acad Sci 108(25):10249–10254
Schnepp PM, Chen M, Keller ET, Zhou X (2019) Snv identification from single-cell rna sequencing
data. Hum Mol Genet 28(21):3569–3583
Scholz MB, Lo CC, Chain PS (2012) Next generation sequencing and bioinformatic bottle-necks:
the current state of metagenomic data analysis. Curr Opin Biotechnol 23(1):9–15
Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L,
Bohlen A et al (2009) mgene: accurate svm-based gene finding with an application to nematode
genomes. Genome Res 19(11):2133–2143
Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30
(14):2068–2069
Shaik R, Ramakrishna W (2013) Genes and co-expression modules common to drought and
bacterial stress responses in arabidopsis and rice. PLoS One 8(10):e77261
Shendure J, Ji H (2008) Next-generation dna sequencing. Nat Biotechnol 26(10):1135–1145
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I (2009) Abyss: a parallel assembler
for short read sequence data. Genome Res 19(6):1117–1123
Smith TF, Waterman MS et al (1981) Identification of common molecular subsequences. J Mol Biol
147(1):195–197
Smith LM, Fung S, Hunkapiller MW, Hunkapiller TJ, Hood LE (1985) The synthesis of
oligonucleotides containing an aliphatic amino group at the 50 terminus: synthesis of fluorescent
dna primers for use in dna sequence analysis. Nucleic Acids Res 13(7):2399–2412
Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SB, Hood LE
(1986) Fluorescence detection in automated dna sequence analysis. Nature 321(6071):674–679
Sohn Ji, Nam JW (2018) The present and future of de novo whole-genome assembly. Brief
Bioinform 19(1):23–40
Suzuki S, Kakuta M, Ishida T, Akiyama Y (2014) Ghostx: an improved sequence homology search
algorithm using a query suffix array and a database suffix array. PLoS One 9(8):e103833
Swerdlow H, Gesteland R (1990) Capillary gel electrophoresis for rapid, high resolution dna
sequencing. Nucleic Acids Res 18(6):1415–1419
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A,
Pruitt KD, Borodovsky M, Ostell J (2016) Ncbi prokaryotic genome annotation pipeline.
Nucleic Acids Res 44(14):6614–6624
Thompson JD, Higgins DG, Gibson TJ (1994) Clustal w: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, position-specific gap penalties and
weight matrix choice. Nucleic Acids Res 22(22):4673–4680
Tieri P, Farina L, Petti M, Astolfi L, Paci P, Castiglione F (2019) Network inference and
reconstruction in bioinformatics. Encycl Bioinform Comput Biol 2:805–813
Toledo-Arana A, Solano C (2010) Deciphering the physiological blueprint of a bacterial cell:
revelations of unanticipated complexity in transcriptome and proteome. BioEssays 32
(6):461–467
Tordini F, Aldinucci M, Milanesi L, Liò P, Merelli I (2016) The genome conformation as an
integrator of multi-omic data: the example of damage spreading in cancer. Front Genet 7:194
van Dam S, Vosa U, van der Graaf A, Franke L, de Magalhaes JP (2018) Gene co-expression
analysis for functional classification and gene–disease predictions. Brief Bioinform 19
(4):575–592
Van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C (2014) Ten years of next-generation sequencing
technology. Trends Genet 30(9):418–426
11 Computational Genomics 241
van Dijk EL, Jaszczyszyn Y, Naquin D, Thermes C (2018) The third revolution in sequencing
technology. Trends Genet 34(9):666–681
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans
CA, Holt RA et al (2001) The sequence of the human genome. Science 291(5507):1304–1351
Voelkerding KV, Dames SA, Durtschi JD (2009) Next-generation sequencing: from basic research
to diagnostics. Clin Chem 55(4):641–658
Wang Y, Chen L, Song N, Lei X (2015) Gass: genome structural annotation for eukaryotes based on
species similarity. BMC Genomics 16(1):150
Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, Gillespie JJ, Gough R, Hix D,
Kenyon R et al (2014) Patric, the bacterial bioinformatics database and analysis resource.
Nucleic Acids Res 42(D1):D581–D591
Wu TD, Nacu S (2010) Fast and snp-tolerant detection of complex variants and splicing in short
reads. Bioinformatics 26(7):873–881
Yandell M, Ence D (2012) A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet 13
(5):329–342
Yang X, Koltes JE, Park CA, Chen D, Reecy JM (2015) Gene co-expression network analysis
provides novel insights into myostatin regulation at three different mouse developmental
timepoints. PLoS One 10(2):e0117607
Yuan L, Qian G, Chen L, Wu CL, Dan HC, Xiao Y, Wang X (2018) Co-expression network
analysis of biomarkers for adrenocortical carcinoma. Front Genet 9:328
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de bruijn
graphs. Genome Res 18(5):821–829
Zhang J, Lin H, Balaji P, Feng WC (2013) Optimizing burrows-wheeler transform-based sequence
alignment on multicore architectures. In: 2013 13th IEEE/ACM international symposium on
cluster, cloud, and grid computing. IEEE, pp 377–384
Zhang B, Tian Y, Zhang Z (2014) Network biology in medicine and beyond. Circulation: cardio-
vascular. Genetics 7(4):536–547
Zhang KY, Gao YZ, Du MZ, Liu S, Dong C, Guo FB (2019) Vgas: a viral genome annotation
system. Front Microbiol 10:184
Zhao X, Li W (2019) Gene coexpression network analysis identified potential biomarkers in
gestational diabetes mellitus progression. Mol Gen Genom Med 7(1):e00515
A Guide to RNAseq Data Analysis Using
Bioinformatics Approaches 12
Preeti Sharma, B. Sharan Sharma, and Ramtej J. Verma
Abstract
The emergence of Next Generation Sequencing (NGS), such as DNA, RNA and
other small RNA sequencing technologies, gave rise to a huge amount of raw data
on a massive scale. To analyse that data and to obtain the biological interpretation
as a challenging act, advancements in computational biology and bioinformatics
applications emerged as the need of the hour. RNAseq accounts for exploration of
comprehensive expression profile of genes and quantifies the presence of RNA
content in the biological sample. In addition to this, RNAseq also provides
information for alternative splice variants, novel gene identification, differentially
expressing genes, etc. The workflow for RNAseq data analysis requires quality
check of the data, mapping onto a reference genome/transcriptome, read quanti-
fication, differential expression analysis and functional annotation. Various tools
and softwares with different algorithms have been developed to provide
biological understanding of the data and to meet the demands of the analyst.
An overview of the tools and softwares has been provided in the chapter that can
be exploited to analyse the data for different investigations. Also, a glimpse of
P. Sharma (*)
Department of Zoology, Biomedical Technology and Human Genetics, University School of
Sciences, Gujarat University, Ahmedabad, Gujarat, India
PanGenomics International Pvt Ltd, Sterling Accuris Diagnostics, Ahmedabad, Gujarat, India
B. S. Sharma
Genexplore Diagnostics and Research Centre, Ahmedabad, Gujarat, India
Rivaara Labs Pvt Ltd, KD Hospital, Ahmedabad, Gujarat, India
R. J. Verma
Department of Zoology, Biomedical Technology and Human Genetics, University School of
Sciences, Gujarat University, Ahmedabad, Gujarat, India
# The Author(s), under exclusive license to Springer Nature Singapore Pte 243
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_12
244 P. Sharma et al.
other RNAseq techniques such as single cell RNAseq and small RNA sequencing
has been discussed as an introduction to newer forms of RNA sequencing.
Keywords
12.1 Introduction
12.2.1 SOLiD
SOLiD stands for Sequencing by Oligo Ligation and Detection and the technique
was developed in 2005 (Hedges et al. 2011). It is based on oligonucleotide ligation to
ligate dsDNA strands with the help of enzyme DNA ligase. A primer-binding
adapter is bound to the target sequence on a bead, which is then amplified using
emulsion PCR. A universal primer is hybridized to the adapter, followed by expo-
sure of beads to a library of 8-nucleotide probes tagged with four different fluores-
cent dyes at 5’end and a hydroxyl group at 3’end. Based on the complementarity of
246 P. Sharma et al.
Table 12.1 List of tools available for different analytical processes of RNAseq data analysis
S. no. Process Tool Link
1. Quality check FastQC http://www.bioinformatics.babraham.ac.
uk/projects/fastqc/
Kraken https://github.com/DerrickWood/kraken2
HTSeq https://htseq.readthedocs.io/en/master/
NGS QC http://www.nipgr.res.in/ngsqctoolkit.html
Toolkit
RNASeQC https://github.com/getzlab/rnaseqc
2. Pre-processing BBDuk https://github.com/BioInfoTools/BBMap/
blob/master/sh/bbduk.sh
Cutadapt https://bioinformaticshome.com/tools/rna-
seq/descriptions/cutadapt.html
FASTX Toolkit http://hannonlab.cshl.edu/fastx_toolkit/
SortMeRNA https://bioinfo.lifl.fr/RNA/sortmerna/
Trimmomatic https://github.com/timflutre/trimmomatic
3. Alignment of reads
Reference guided Bowtie http://bowtie-bio.sourceforge.net/index.
shtml
Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/
index.shtml
Burrows- http://bio-bwa.sourceforge.net/
Wheeler Aligner
(BWA)
Bayesembler https://github.com/bioinformatics-centre/
bayesembler
Cufflinks http://cole-trapnell-lab.github.io/cufflinks/
IsoLasso http://alumni.cs.ucr.edu/~liw/isolasso.
html
De novo assemblers CLC Genomics https://digitalinsights.qiagen.com/
Workbench products-overview/discovery-insights-
portfolio/analysis-and-visualization/
qiagen-clc-genomics-workbench/
Oases https://github.com/dzerbino/oases
rnaSPAdes https://cab.spbu.ru/software/rnaspades/
Rnnotator https://www.osti.gov/biblio/1231732-
rnnotator
SOAPdenovo- http://sourceforge.net/projects/
trans soapdenovotrans/
Trans-ABySS https://github.com/bcgsc/transabyss
Trinity https://github.com/trinityrnaseq/
trinityrnaseq/wiki
Velvet https://www.ebi.ac.uk/~zerbino/velvet/
4. Assembly evaluation Busco https://busco.ezlab.org/
tools Detonate http://deweylab.biostat.wisc.edu/detonate/
rnaQUAST https://github.com/ablab/rnaquast
TransRate https://hibberdlab.com/transrate/
(continued)
12 A Guide to RNAseq Data Analysis Using Bioinformatics Approaches 247
first two bases, the probes get attached to the target sequence with the help of the
enzyme DNA ligase. The fluorescent tag is then cleaved from the fragment at 5th and
6th base of the probe which is joined by phosphorothioate linkage. The fluorescence
of the dyes generated due to cleavage is measured at different spectra. After the
completion of first round of sequencing, the second-round sequencing starts with
primer of length N-1, and so on. The sequencing of the target is ensured by
measuring the fluorescence signals at each round of sequencing. However, the
technique was low-cost and provided results with high accuracy due to the
two-base sequencing, the main disadvantages were the time-consumption and
shorter read lengths (Wyrzykiewicz and Cole 1994).
degradation. The fragmented libraries are then ligated with complementary probes
embedded on beads and mixed with PCR reagents and oil to perform emulsion PCR.
Here, each microsphere of emulsion, specifically known as Ion Sphere Particles
(ISPs), is covered with multiple copies of same DNA fragment for clonal amplifica-
tion. After amplification the ISPs with template fragment are enriched from the
mixture using biotin labelled magnetic beads and the rest are melted off. The positive
templates are then prepared for sequencing and loaded onto Ion chips which contain
millions of microwells with many copies of single-stranded DNA template and other
sequencing reagents such as DNA polymerase, dNTPs in each well. The
incorporation of the complementary nucleotide results into the change in pH level
and is converted to digital signals to obtain the sequence of the target sequence. The
technology is not based on fluorescence signals and does not require optical reading
for detection so the sequencing is rapid and number of bases gets incorporated in less
time. The technology limits in reading of homopolymer sequences in the template,
such as ‘TTTTTT’, and becomes challenging to distinguish between the multiple
oligomers, resulting into an increase in the error rate (Merriman et al. 2012).
• FASTQ is the most commonly used file format. It allows storing of data with
corresponding quality values known as Phred scores. The files in fastq format are
with extension ‘.fq’ or ‘.fastq’. A FASTQ file contains four lines of textual
information. The first line starts with a sign ‘@’, generally known as a sequence
identifier. The second line consists of a sequence of nucleotides, i.e. A, T, G,
C. The third line consists of a ‘+’ sign which is usually a separator and indicates
the end of the sequence. The fourth line provides a quality score corresponding to
the sequence in the second line (Deorowicz and Grabowski 2011).
• FASTA format is also one of the data storing formats and is available with
extension ‘.fa’ and ‘.fasta’. The sequences are recognized by a ‘>’ sign in the
beginning followed by a descriptive information about the sequence. This format
is generally used while alignment or reference genome mapping by different tools
and softwares. The sequence consists of nucleotides A, T, G, C and N (for
undetermined base) (Gilbert 2003). The sequence can be viewed using text editor
tools or LINUX/UNIX environment.
• BAM/SAM—The raw sequence data generated from the sequencer have no
genomic information and are need to be aligned to a reference genome. After
mapping or aligning to a reference genome, the output is generated in SAM/BAM
format. SAM is Sequence Alignment/Map format which stores the sequences in
an aligned format against the reference genome. A SAM file is a tab-delimited
file, recognized by a ‘.sam’ extension and can be viewed using text editor tools
(Li et al. 2009). A BAM file is binary version of SAM file and is often found with
‘.bam’ extension (Niemenmaa et al. 2012).
The data generated after sequencing often contains contaminants such as poor-
quality reads, PCR artefacts, adapter sequences, over-represented sequences, etc.
which interferes in downstream analytical operations of the data. Hence, the data
needs to be quality checked to obtain clean and filtered high quality reads. For this,
many softwares are available to assess the quality of the reads. These softwares
perform a quality check (QC) on the data and provide a QC report depicting
low-quality sequencing reads impeding the quality of the data. FASTQC is a
commonly used tool for assessing the quality of the data. It measures scores
associated with data such as read length, quality score, GC percentage, k-mers,
etc. and produces results in different modules (Andrews 2010).
The per base sequence quality module assesses the overall quality of the bases at
each position of the read which is represented by a box whisker plot. A higher score
determines better quality of the base call. Likewise, per sequence quality score
report presents a subset of overall sequences having low-quality scores. This
constitutes a small fraction of the total sequences; however, a large subset possessing
bad quality scores indicates some systematic errors.
The per base GC content shows the GC content of each base in the sequence. A
shift in the graph of GC content with the underlying genome indicates presence of
250 P. Sharma et al.
Before using the data for functional annotation and differential expression, etc. it is
required to be pre-processed for removal of contaminated reads. For this, various
tools are available such as Fastx-toolkit (Gordon and Hannon 2010), NGStoolkit
(Mulcare 2004), Trimmomatic (Bolger et al. 2014), etc. Fastx-toolkit is most com-
monly used tool to filter out the good data from the bad quality data. During the
course of filtration, the data is processed for removal of low-quality bases, adapter
sequences, and other such reads interfering with the quality of the data.
The sequencing data is often contaminated with adapter sequences which are
synthetically designed fragments of DNA added to the target sequences. These
sequences are generally removed by the sequencers after the completion of sequenc-
ing process. But less often they remain attached to the sequenced read and are
responsible for background noise in the data. Various tools such as Cutadapt (Martin
2011), Trimmomatic (Bolger et al. 2014), etc. are most frequently used tools for
removal of adapter sequences.
Other contaminants are bases with low-quality, i.e. those with high error rate of
being incorrect. The quality of base is assigned by a phred score (Q score) value,
which is commonly used to measure the accuracy of the base call while sequencing
the read by the sequencer. A quality score of <20 is generally considered of poor
quality with high chances of inaccuracy. Fastx-toolkit is the most commonly used
tool to trim off the reads with phred score <20.
Few other sequences such as rRNA sequences also act as contaminants in case of
whole transcriptome sequencing. To remove the rRNA reads, rRNAFilter (Wang
et al. 2017), SortMeRNA (Kopylova et al. 2012) and RiboPicker (Schmieder and
Edwards 2011) are commonly used tools for the process.
The raw reads generated after sequencing are then mapped onto a reference genome
or transcriptome of the same species or the nearest relative, whichever available.
(Roberts et al. 2011; Trapnell et al. 2010). The mapping of reads is affected by
complexities of the genome, polymorphisms, gene isoforms, alternative splicing, etc.
12 A Guide to RNAseq Data Analysis Using Bioinformatics Approaches 251
In reference guided assembly, the reads are mapped onto a reference genome or
transcriptome, whichever available, to assemble them into transcripts. The reads to
be mapped are split into parts where one part maps to the exonic part and the other
one to the intronic region. Reads mapping on the reference genome minimizes the
complexities in the assemblies as they are mapped specifically to their genomic
locations (Voshall and Moriyama 2018). Several assemblers are available for refer-
ence guided assemblies, such as Bayesembler (Maretty et al. 2014), Cufflinks
(Ghosh and Chan 2016), Stringtie (Pertea et al. 2015), etc. Different assemblers
use different strategies to assemble reads with highest percentage of read coverage,
such as Cufflinks uses few numbers of transcripts to assemble large number reads to
the genome or transcriptome, whereas Bayesembler uses Bayesian likelihood to
estimate the most likely combination of transcripts constructed for each splice
junction. Other assemblers such as IsoLasso (Li et al. 2011) and iReckon (Mezlini
et al. 2013) use L-1 norm and specific sparse constraints, respectively, to obtain
possible transcripts combinations.
The reference guided assemblers use reference genomes to align the reads and
assemble them into transcripts, where graphs are prepared and isoforms are consid-
ered as paths of graphs (Li and Xuejun 2016). The accuracy of the assembly depends
on the availability of complete and good quality reference genome which are usually
available for the model organisms such as human, mouse, rat, Arabidopsis, Oryza,
etc., but not for non-model species.
Therefore, for species with no reference genome de novo or reference-
independent method is used to construct the transcripts. The de novo assembly is
based on generation of short fragments of reads known as k-mers which overlaps to
form a de Bruijn graph structure (Martin and Wang 2011). The assemblage of
contigs using different algorithms depends on the varying lengths of the k-mers.
Shorter k-mers generally cover the reference sequences completely but also provides
ambiguity because of the presence of multiple reads from different transcripts. In
case of longer k-mers, ambiguity is resolved but also does not cover the entire region
of the reference genome/transcriptome.
Various assemblers are available based on optimization of k-mer lengths for
assemblage of contigs using different algorithms. SOAPdenovo-Trans (Xie et al.
2014) and Trinity (Freedman 2016) use the preferred k-mer lengths for producing
the de Bruijn graph. Trinity is a package of three independent softwares: Inchworm,
Chrysalis and Butterfly, where Inchworm assembles the transcripts, Chrysalis forms
252 P. Sharma et al.
the de Bruijn graph by clustering those transcripts and finally Butterfly evaluates the
graphs and produces the full-length assembly (Grabherr et al. 2011). rnaSPAdes
(Bushmanova et al. 2019) identifies the k-mer lengths based on the read data.
rnaSPAdes is the optimized version of SPAdes (Bankevich et al. 2012), where
three assemblies are produced and one can choose any of them depending upon
the downstream analyses. The three assemblies contain, one assembled with all
transcripts, assembly with long and highly expressing transcripts, and assembly
with short and lowly expressing transcripts (Geniza and Jaiswal 2017). Another
assembler Velvet/Oases assembles the contigs based on de Bruijn graph using short
reads. Velvet assembles the contigs using the short reads which are then clustered
into loci using Oases program (Schulz et al. 2012).
Before processing the data for further downstream analysis the assembled reads are
checked for their quality. The quality metrics of the assembled reads can be
evaluated using two different criteria, either by calculating number and length of
contigs or by mapping the assembled reads to coded proteins for similarity search.
Softwares such as rnaQUAST (Bushmanova et al. 2016), CD-HIT (Li and Godzik
2006), TransRate (Smith-Unna et al. 2016) and Bowtie (Langmead 2010), etc. can
be used to measure the quality of the assembly by measuring the lengths of the
contigs and N50 value of the assemblies (T O’Neil and Emrich 2013). N50 value is
defined as the minimum contig length required to cover fifty percent of the genome.
While N50 value is more suitable quality of a genome assembly, transcriptome
assembly is checked by measuring their ExN50 value which is dynamic and real
time estimation of the assembled reads (Geniza and Jaiswal 2017).
ExN50 calculates the highly expressing transcripts which accounts for half of the
overall transcriptome data. Another criterion based on mapping of the assembled
reads to the coded proteins provides more probable notion of completeness of the
assembled transcripts. The similarity searches are generally done by aligning the
assembled reads against well-annotated databases containing non-protein sequences,
conserved domains of proteins with functional annotation or lineage dependent
protein databases (Nakasugi et al. 2014). These include BLAST (Altschul et al.
1990), Pfam (Finn et al. 2014), UniProt/Swiss-Prot (Apweiler et al. 2004), BUSCO
(Waterhouse et al. 2018), etc. However, the protein-coded similarity search is a more
plausible metric of QC of an assembly, the performance is limited by the relatedness
of the biological entity in question to the sequences present in the databases. The
more the divergence of the organism, more will be the possibility of lower percent-
age of assembled reads and gaps in the assembly.
12 A Guide to RNAseq Data Analysis Using Bioinformatics Approaches 253
12.6 Annotation
The output of differential gene analysis provides information for the altered expres-
sion level of particular set of genes, now the next step is to explore the biological
function of the genes. This is done by analysing the functional aspects, interaction
network, pathway analysis and gene ontology, etc. of the genes involved in different
processes of the biological system.
For functional annotation of the genes, various databases such as PANTHER
classification system (Mi et al. 2016), DAVID Gene Functional Classification Tool
(Sherman et al. 2007), etc. are available which assign particular function to genes
and categorize them into different protein classes and biological pathways based on
254 P. Sharma et al.
their over-representation analysis (ORA) in the data (Khatri et al. 2012). Based on
similar biological functions, cellular localization and pathway annotation these
genes are classified into different functional categories. The genes are analysed for
their over-representation in the particular category by calculating their occurrence in
the specific category compared to the proportion of genes accommodated in the same
category. The results can further be evaluated for significant results by applying
statistical tools such as Fisher’s exact test, Hypergeometric correction, etc.
The GO terms are said to be loosely hierarchical based on the available informa-
tion regarding their biological functions and localizations. Based on this information
they can be arranged in terms of ‘parent terms’ or more specific ‘child terms’.
GO analysis also provides information for genes that are over- or under-regulated
under specific conditions. This is done by calculating the enrichment analysis for the
over-representation of certain set of genes in a particular condition (Gene Ontology
Consortium@2015). The results are statistically evaluated based on their p-values.
Various tools such as WebGeStalt (Wang et al. 2013), Clusterprofiler (Yu et al.
2012), Gorilla (Eden et al. 2009), WEGO (Ye et al. 2006), etc. are widely used.
Small RNAs, such as siRNA (small interfering RNA), miRNA (microRNA), etc.
belong to class of non-coding RNAs that plays crucial roles in regulation of gene
expression at transcriptional level. The developing technologies in high throughput
sequencing opened new prospects to explore the world of the miRNAs
(Sharma@2020). Despite their pivotal roles, miRNAs share very less percentage
in the genome. In order to obtain a comprehensive profile of miRNAs, deep
sequencing is performed which is a modified version of next generation sequencing,
sequencing a genomic region hundred or thousand times and allowing to detect
molecules present in rare volumes (Motameny et al. 2010).
Currently, only a small number of tools and pipelines are available for analysis of
miRNA data which is also a major challenge faced by many researchers. The
analysis of miRNA data involves:
(a) Pre-processing of the raw data to filter out low-quality reads and other
non-coding RNAs such as rRNA, tRNA, snRNA, snoRNA, etc.
(b) Mapping of reads to miRbase (largest repository of published miRNA
sequences and annotations of various organisms) (Griffiths-Jones et al. 2007)
to obtain known or conserved miRNAs in an organism.
(c) Prediction of novel miRNAs in an organism based on generation of hairpin loop
structure using an RNA folding algorithm.
(d) Quantification of miRNAs for detection of differentially expressing miRNAs.
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J
Mol Biol 215:403–410
Andrews S (2010) FastQC: a quality control tool for high throughput sequence data
Ansorge WJ (2009) Next-generation DNA sequencing techniques. New Biotechnol 25:195–203
Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H,
Lopez R, Magrane M (2004) UniProt: the universal protein knowledgebase. Nucleic Acids Res
32:D115–D119
Avital G, Hashimshony T, Yanai I (2014) Seeing is believing: new methods for in situsingle-cell
transcriptomics. Genome Biol 15:110
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko
SI, Pham S, Prjibelski AD (2012) SPAdes: a new genome assembly algorithm and its
applications to single-cell sequencing. J Comput Biol 19:455–477
Betel D, Wilson M, Gabow A, Marks DS, Sander C (2008) The microRNA.org resource: targets
and expression. Nucleic Acids Res 36:D149–D153
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence
data. Bioinformatics 30:2114–2120
Bushmanova E, Antipov D, Lapidus A, Suvorov V, Prjibelski AD (2016) rnaQUAST: a quality
assessment tool for de novo transcriptome assemblies. Bioinformatics 32:2210–2212
Bushmanova E, Antipov D, Lapidus A, Prjibelski AD (2019) rnaSPAdes: a de novo transcriptome
assembler and its application to RNA-Seq data. GigaScience 8:giz100
258 P. Sharma et al.
Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or
without a reference genome. BMC Bioinformatics 12:323
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or
nucleotide sequences. Bioinformatics 22:1658–1659
Li W, Jiang T (2012) Transcriptome assembly and isoform expression level estimation from biased
RNA-Seq reads. Bioinformatics 28:2914–2921
Li Z, Xuejun L (2016) A comprehensive review on RNA-seq data analysis. Trans Nanjing Univ
Aeronaut Astronaut 33(3):339–361
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R
(2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079
Li W, Feng J, Jiang T (2011) IsoLasso: a LASSO regression approach to RNA-Seq based
transcriptome assembly. J Comput Biol 18:1693–1707
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR,
Kamitaki N, Martersteck EM (2015) Highly parallel genome-wide expression profiling of
individual cells using nanoliter droplets. Cell 161:1202–1214
Maragkakis M, Reczko M, Simossis VA, Alexiou P, Papadopoulos GL, Dalamagas T,
Giannopoulos G, Goumas G, Koukis E, Kourtis K (2009) DIANA-microT web server:
elucidating microRNA functions through target prediction. Nucleic Acids Res 37:W273–W276
Maretty L, Sibbesen JA, Krogh A (2014) Bayesian transcriptome assembly. Genome Biol 15:501
Martin M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads.
EMBnet J 17:10–12
Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet 12:671–682
Merriman B, Ion Torrent R&D Team, Rothberg JM (2012) Progress in ion torrent semiconductor
chip based sequencing. Electrophoresis 33:3397–3417
Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly multiplexed target
capture and sequencing. Cold Spring Harb Protoc 2010:pdb-prot5448
Mezlini AM, Smith EJ, Fiume M, Buske O, Savich GL, Shah S, Aparicio S, Chiang DY,
Goldenberg A, Brudno M (2013) iReckon: simultaneous isoform discovery and abundance
estimation from RNA-seq data. Genome Res 23:519–529
Mi H, Huang X, Muruganujan A, Tang H, Mills C, Kang D, Thomas PD (2016) PANTHER version
11: expanded annotation data from gene ontology and Reactome pathways, and data analysis
tool enhancements. Nucleic Acids Res 45:D183–D189
Motameny S, Wolters S, Nürnberg P, Schumacher B (2010) Next generation sequencing of
miRNAs–strategies, resources and methods. Genes 1:70–84
Mulcare D (2004) NGS toolkit. Part 8: the National Geodetic Survey. NADCON tool. Prof Surv
Mag 24(2):120–125
Nakasugi K, Crowhurst R, Bally J, Waterhouse P (2014) Combining transcriptome assemblies from
multiple de novo assemblers in the allo-tetraploid plant Nicotiana benthamiana. PLoS One 9:
e91776
Niemenmaa M, Kallio A, Schumacher A, Klemelä P, Korpelainen E, Heljanko K (2012) Hadoop-
BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics
28:876–877
Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL (2015) StringTie
enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol
33:290–295
Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y
(2012) A tale of three next generation sequencing platforms: comparison of ion torrent, pacific
biosciences and illumina MiSeq sequencers. BMC Genomics 13:1–13
Rehmsmeier M, Steffen P, Höchsmann M, Giegerich R (2004) Fast and effective prediction of
microRNA/target duplexes. RNA 10:1507–1517
Roberts A, Pimentel H, Trapnell C, Pachter L (2011) Identification of novel transcripts in annotated
genomes using RNA-Seq. Bioinformatics 27:2325–2329
Schmieder R, Edwards R (2011) Quality control and preprocessing of metagenomic datasets.
Bioinformatics 27:863–864
260 P. Sharma et al.
Schulz MH, Zerbino DR, Vingron M, Birney E (2012) Oases: robust de novo RNA-seq assembly
across the dynamic range of expression levels. Bioinformatics 28:1086–1092
Sharma P, Bhunia S, Poojary SS, Tekcham DS, Barbhuiya MA, Gupta S, Shrivastav BR, Tiwari PK
(2016) Global methylation profiling to identify epigenetic signature of gallbladder cancer and
gallstone disease. Tumor Biol 37:14687–14699
Sharma P, Kumar S, Beriwal S, Sharma P, Bhairappanavar SB, Verma RJ, Das J (2020) Compara-
tive transcriptome profiling and co-expression network analysis reveals functionally coordi-
nated genes associated with metabolic processes of Andrographis paniculata. Plant Gene
23:100234
Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J, Stephens R, Baseler MW, Lane HC,
Lempicki RA (2007) The DAVID gene functional classification tool: a novel biological module-
centric algorithm to functionally analyze large gene lists. Genome Biol 8:R183
Smith-Unna R, Boursnell C, Patro R, Hibberd JM, Kelly S (2016) TransRate: reference-free quality
assessment of de novo transcriptome assemblies. Genome Res 26:1134–1144
T O’Neil S, Emrich SJ (2013) Assessing De Novo transcriptome assembly metrics for consistency
and utility. BMC Genomics 14:465
Tomescu AI, Kuosmanen A, Rizzi R, Mäkinen V (2013) A novel min-cost flow method for
estimating transcript expression with RNA-Seq. BMC Bioinformatics 14(Suppl 5):S15
Trapnell C (2013) Cufflinks. cuffdiff (v6). Open module on GenePattern public server.
GenePattern. https://software.broadinstitute.org/cancer/software/genepattern/modules/docs/
Cufflinks.cuffdiff/6
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, Van Baren MJ, Salzberg SL, Wold BJ,
Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated
transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515
Voshall A, Moriyama EN (2018) Next-generation transcriptome assembly: strategies and perfor-
mance analysis. In: Bioinformatics in the era of post genomics and big data. IntechOpen,
London, pp 15–36
Wang J, Duncan D, Shi Z, Zhang B (2013) WEB-based gene set analysis toolkit (WebGestalt):
update 2013. Nucleic Acids Res 41:W77–W83
Wang Y, Hu H, Li X (2017) rRNAFilter: a fast approach for ribosomal RNA read removal without a
reference database. J Comput Biol 24:368–375
Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, Kriventseva EV,
Zdobnov EM (2018) BUSCO applications from quality assessments to gene prediction and
phylogenomics. Mol Biol Evol 35:543–548
Wong N, Wang X (2015) miRDB: an online resource for microRNA target prediction and
functional annotations. Nucleic Acids Res 43:D146–D152
Wyrzykiewicz T, Cole D (1994) Sequencing of oligonucleotide phosphorothioates based on solid-
supported desulfurization. Nucleic Acids Res 22:2667–2669
Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, Huang W, He G, Gu S, Li S (2014) SOAPdenovo-
Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics
30:1660–1666
Ye J, Fang L, Zheng H, Zhang Y, Chen J, Zhang Z, Wang J, Li S, Li R, Bolund L (2006) WEGO: a
web tool for plotting GO annotations. Nucleic Acids Res 34:W293–W297
Yu G, Wang L-G, Han Y, He Q-Y (2012) clusterProfiler: an R package for comparing biological
themes among gene clusters. Omics 16:284–287
Zappia L, Phipson B, Oshlack A (2018) Exploring the single-cell RNA-seq analysis landscape with
the scRNA-tools database. PLoS Comput Biol 14:e1006245
Zhao S, Zhang B, Zhang Y, Gordon W, Du S, Paradis T, Vincent M, von Schack D (2016)
Bioinformatics for RNA-seq data analysis. Bioinformatics—updated features and applications.
InTechOpen, London, pp 125–149
Zyprych-Walczak J, Szabelska A, Handschuh L, Górczak K, Klamecka K, Figlerowicz M,
Siatkowski I (2015) The impact of normalization methods on RNA-Seq data analysis. Biomed
Res Int 2015:621690
Computational Metabolomics
13
Priya Ranjan Kumar, Santosh Kumar Mishra, and Sarika Srivastava
Abstract
Keywords
Metabolomics · Metabolites · Metabolome · High-throughput · Data processing
# The Author(s), under exclusive license to Springer Nature Singapore Pte 261
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_13
262 P. R. Kumar et al.
13.1 Introduction
Small molecules which are produced during the metabolic reactions due to enzy-
matic activities are called metabolite. These metabolites also take part in various
other catabolic and anabolic reactions and many a time required for normal growth
and development of cells. Entire metabolites present in an organism are called
metabolome (Oliver et al. 1998). Likewise, metabolomics refers to the use of
scientific methods to find and quantitatively estimate all metabolites in any organism
or bio-system, as well as the monitoring of changes in the metabolome of total plant
or organism. In a metabolomics experiment, following steps are performed
(Goodacre et al. 2007):
The rapid improvement in methods based on MS, NMR, and computational tools
which are capable in big data processing leads to significant improvement in high-
throughput metabolomics methods (Wen and Zhu 2015).
In 1990s, GC/LC-MS (gas-chromatography mass spectrometry) was the choice of
technology for the analysis of metabolites in the wide range of plant species. This
leads to the development of many metabolomics spectral libraries which are cur-
rently used by many tools and software for the identification of metabolites from the
spectra of new bio-samples. One such metabolomics spectral library is “GOLM
Metabolome Database” (Kopka et al. 2004). Now a day, GC/LC-MS, NMR and
Electron microscopy techniques are used for the identification of metabolites from
bio-samples. Among all these techniques, NMR is a widely used technique in
metabolomics and is becoming increasingly popular in this field. All metabolomics
experiments generate a complex and very big dataset. Handling and processing of
these datasets for the identification of metabolites is very complex and a big
challenge in this area (Boccard et al. 2010). These datasets are generated from
various experimental methods like NMR or GC/LC-MS, etc. These instrumental
datasets need to be initially preprocessed to get clean dataset as they contain many
biases and noises. This clean data need to be further processed using different tools
and metabolite libraries for the identification of metabolites (Barnes et al. 2016).
Quantitative and chemometric approaches are the two major means of metabolomics
data analysis. The first approach facilitates the quantitative estimation of entire
metabolites present in bio-sample using spectral libraries prior to statistical analysis
of data. Whereas, in chemometric approaches the intensity of spectra and its patterns
are initially recorded and then it is statistically compared to find spectral features
(Xia et al. 2009). The processing pipeline of NMR data for metabolite identification
from a given sample is explained in this chapter.
13 Computational Metabolomics 263
Homonuclear NMR experiment is the one where recorded dimensions span chemical
shifts of the same type of nucleus. Examples are COSY (COrrelated SpectroscopY),
TOCSY (TOtal Correlated SpectroscopY), NOESY (Nuclear Overhauser Effect
SpectroscopY), and ROESY (Rotational nuclear Overhauser Effect SpectroscopY)
(Keeler 2010; Gheysen et al. 2008).
The Heteronuclear NMR experiment is the one where recorded dimensions span
chemical shifts of different types of nucleus. It is used to assign the spectrum of
another nucleus once the spectrum of one nucleus is known. Examples are HSQC
264 P. R. Kumar et al.
Bruker Corporation is the largest vendor for the NMR instrument. Hence the “Bruker
fid” is a standard data file format for the NMR data. Other than this, JCAMP-DX,
ASCII, CSV, Varian VNMR, Joel, Simplot, etc. are few other widely used file
format. The Bruker fid dataset contains many different files with it. It stores one
scanning session in the directory with name as per the subject and its session. It
belongs to its own directory. Within a single session directory of an experiment,
there are:
1. A text file with name “subject,” contains information about the experiment.
2. 1, 2, 3, etc. numbered subdirectory containing acquisitioned data for each saved
run of that session.
1. A text file with name “imnd.” It contains various parameters which are used in
data acquisition from the scanner.
13 Computational Metabolomics 265
2. Another text file with name “acqs,” which is also an Acquisition Parameter file
containing details about that run.
3. Raw Free Induction Decay (FID) data used to be present in a very big binary. A
single file having name “fid” for 1D NMR and multiple “fid” files along with a
serial “ser” file in 2D NMR experiments use to be present in the directory.
4. Various other files, including the “pulseprogram,” “log,” and “grdprog.r” gradi-
ent programs.
5. A subdirectory “pdata” (Processed Data) that contains any reconstructions of
the data.
Open an FID data file after downloading and installation of GSim software. To do
this one need to select Data option available in Edit Table A spreadsheet is built by
GSim tool to select and use real and imaginary FID data parts. This file can be saved
by pressing ctrl+s as .ascii format.
Open the .ascii file in MS-Excel by choosing all file type (*.*). Click next button and
select space check box. Chose dot as decimal separator by clicking on advance
button and then click OK. Finally, click on “Finish” button to save opened file in
CSV format.
266 P. R. Kumar et al.
Preprocessing of NMR data usually aims to reduce variances and influences as phase
corrections of each spectrum, baseline corrections, etc. For NMR data, preprocessing
procedures include Fourier transformation of the raw FID, phasing, noise filtering,
baseline correction, normalization, and conversion to magnitude spectra (Goodacre
et al. 2007). For chemical shift variability, NMR peak alignment, whether global or
local is also regarded as preprocessing.
Absorption line does not appear sometime in the real part of spectrum which is very
undesirable as it is required for best resolution spectrum. The real part (Sx) of the FID
used to be a damped cosine wave and the imaginary part (Sy) a damped sine wave.
The spectrum obtained by Fourier transformation contains the real part having the
absorption mode line shape and the imaginary part the dispersion mode. But due to
the effect of phase shift of around 45 , both real and imaginary sections of spectrum
get mixture of absorption and dispersion lines. Sometime due to more phase shift of
Fig. 13.1 (a) example of a free induction decay with only one frequency component where
frequency decay exponential with time; (b) Fourier-transformed frequency spectrum of FID
13 Computational Metabolomics 267
Sx Sy Sx Sy
A B
Sx Sy Sx Sy
C D
Fig. 13.2 Spectrum showing the phase shift effect. (a) Normal phase; (b) Phase shift of 45 ; (c)
Phase shift of 90 ; (d) Phase shift of 180 (Keeler 2004)
90 real part takes the form of a damped sine wave, whereas imaginary part takes the
form of damped cosine wave. Negative absorption line may appear in the real part
due to the phase shift of 180 . (Keeler 2004). The effect of different phase shifts on
time domain signal is shown in Fig. 13.2.
In practice, the real part of the spectrum is displayed after Fourier transformation
of FID spectrum. Further, the phase is adjusted to make it correct until the spectrum
appears to be in the absorption mode. The whole process is called phasing the
spectrum.
Many a time, noises are also get recorded while recording FID. Major contributors of
these noises are the amplifiers, some of the electrical components of spectrometer
and thermal noise of the signal detector coil. Due to this, while FID decays with
respect to time, noises continue to get recorded. Hence, if we record spectrometer
data for a long time, there may be only noise in the later part of data instead of actual
signal. A weak SNR (signal-to-noise ratio) may be expected in the resulting spec-
trum. Since, actual metabolite signals used to be present in early part of FID, hence
by reducing the spectrometer data recording time, SNR may be improved (Keeler
2004). But, one need to be careful while shortening the data acquisition time so that
they do not miss the actual FID data as shown in Fig. 13.3.
268 P. R. Kumar et al.
Fig. 13.3 Time dependency of SNR in spectrum while recording FID. (a) A long FID data
recording time contributes more noise. (b) Low level of noise when data recording time is reduced
to half. (c) taking the first quarter of the data (Keeler 2004)
Looking at the FID spectrum as shown in Fig. 13.3, it can be concluded that, only
the starting parts of FID contain actual signal. Hence, a mathematical function can be
used which can cut off the remaining part of data without affecting the starting part in
order to improve SNR in the spectrum.
NMR analysis of biofluid samples is often associated with the variations in the
position of peak and its shape not correlated with the sample. The main reason for
this is the instability of the instrument and the variations in the sample background
matrix. The NMR data analysis and interpretation gets complicated due to these
variations. Hence, peak alignment step is the must to follow preprocessing method to
remove these complications before any further analysis step. The peak alignment is
performed by shifting the spectrum sidewise and comparing it with the reference
spectrum until the best correlation is not found (Forshed et al. 2003).
13 Computational Metabolomics 269
13.5.5 Binning/Bucketing
The chemical shift variability across spectra is corrected by this method. The entire
spectra are segmented into small bins and then spectrum under each bins are taken
for further processing. The appropriate bin size is used so that the spectral peak of
one compound remains in a single bin despite small spectral shifts across the spectra.
For this, the size of bins needs to be specified either in ppm or must fix total number
of bins (Cobas 2011). Most preferably binning is performed from 0.04 to 10 ppm
with a bin size of 0.04 ppm. Signals that are unrelated to the experiment, called the
dark regions can be excluded from binning. An example of binned spectra is shown
in Fig. 13.4.
13.5.6 Normalization
There are a few freely available tools/R-packages which can help in 1D NMR data
processing and analysis. They are:
• NMRS,
• ChemoSpec,
• speaq,
• batman.
For 2D NMR data processing and analysis following tools can be used:
• NMRPipe (An UNIX based software for data processing and analysis),
• rNMR (An R-based tool for data analysis).
13.6.1 NMRS
Bruker FID format spectral data can be loaded directly into the NMRS tool for its
analysis. The reference of spectrum can also be displayed using this tool. It can also
be used to perform many basic operations like phase correction, chemical shift
adjustment of certain compound to zero ppm, baseline correction and selection of
spectral area. The NMRS package has been designed as an interactive process. By
typing NMRS the user can have access to complete preprocessing of the data.
NMRS package is dependent upon few other R-package and requires the combina-
tion of Tool Command Language (Tcl) and Tk GUI toolkit which is referred to as
Tcl/Tk. Hence before installing the NMRS package, one need to configure their
machine with Tcl development package (tcl-dev) and Tk development package
(tk-dev). In order to perform Time–Frequency analysis, one also needs to install
another R-package “R-wave.” The “tkrplot” package is required for placing R
graphics in a Tk widget; and “FTICRMS,” to handle large matrices and data
visualization (Izquierdo 2013).
13.6.2 ChemoSpec
Spectroscopic data analysis can be performed using ChemoSpec tool kit. The spectra
can be plotted using specific functions of this package. Many exploratory data
analysis can be performed using this tool such as PCA (Principal Component
Analysis), Model based clustering, and HCA (Hierarchical Cluster Analysis). The
comparison between control and treatment group samples can also be performed
using this tool (Hanson et al. 2020). The package has only command line interface so
it runs on R console. This tool depends upon several other R-packages such as R.
utils, plyr, amap, baseline, pls, etc. Since this package is also depending upon “rgl”
package for 3D visualization so it requires OpenGL support. Function
13 Computational Metabolomics 271
“getManyCsv” helps in moving raw data sets into ChemoSpec, and it supports only .
csv file format containing data in two columns containing frequency and intensity
values. ChemoSpec package has a large set of functions which can help in data
preprocessing and analysis.
13.6.3 Speq
Metabolite quantification from NMR spectra can be done using “Speq” tool which
uses CluPA (Cluster-based Peak Alignment) method for peak alignment. This
R-package aligns reference spectrum with the target spectrum using top–bottom
approach and builds a cluster tree. Further, the spectra get divided into small segment
on the basis of the farthest cluster. It also carries out different statistical analyses like
F-statistic or a one-way ANOVA to quantify the NMR data. This package does not
have any data preprocessing option (Beirnaert et al. 2019).
13.6.5 NMRPipe
NMRPipe is a UNIX based vast software program used for the NMR spectroscopic
data processing and analysis. It helps in the processing and analysis of multidimen-
sional NMR spectra. It need support of C-shell and X11 Graphics along with
terminal window. It supports both GUI and command line interface with shell and
TCL scripts. This makes it very flexible in nature where user can write their own
scripts and run to perform any specific action on the spectral data (Delaglio et al.
1995). This package also includes many other programs like NMRDraw,
NMRWish, DYNAMO, ACME, DC, etc. These programs help to perform several
different types of actions like interactive processing of data, script editing, chemical
shift analysis, peak detection, etc. A conventional processing pipeline using
NMRPipe is as follows:
• Initial preprocessing of low resolution data like baseline correction, phasing, etc.
• Further inspection and if required additional processing.
• Automatic peak detection.
• Data analysis.
After data processing, the peak list generated from the tool can be uploaded
directly on various metabolite libraries servers for example MMC database or BMR
databank for metabolite profiling to identify various metabolites present in the
sample.
13.6.6 rNMR
NMR Spectral library is used for bio-profiling to identify various metabolites present
in sample, using peak lists generated from processed spectral data. There are various
NMR base metabolite libraries available freely on World Wide Web in which we can
submit our data to search for metabolite present in the sample. Few metabolite
spectral libraries are as follows:
NMR data of peptides, DNA, RNA, and other biomolecules are stored in this
database and serves as reference library for metabolite identification. It is a member
database of PDB (Ulrich et al. 2008).
13.7.3 NMRShiftDB
It is also a web based NMR database serves as reference library for many NMR data
analysis tools. It has a large collection of structures of organic compounds along with
their spectra. It is an open source database and available under the GNU Free
Documentation License (Steinbeck and Kuhn 2004).
13.8 Conclusion
NMRS tool has very good options for NMR data preprocessing but it is limited to
just data preprocessing task where as “ChemoSpec” has a wide range of data
preprocessing and analysis functions, but the problem with this package is that it
only accepts raw data in csv file format having two columns of frequency and
intensity. So, the integration of some other function/script to generate the peak list
with two column having ppm and intensity from the FID data is required. It also has
all the data preprocessing options except Fast Fourier Transformation and phase
correction because these are not needed to apply on intensity table. The intensity
table is generated only after applying FFT on fid data using any other tool. These
tools are specific for 1D NMR data. But in metabolomics 2D data are also being
generated to get more robust and high resolution result. Hence tools/R-packages
which can process the multidimensional NMR data are required. NMRPipe and
rNMR are the tools which can handle 2D NMR data very well. NMRPipe is an
UNIX based collection of various programs which allows user to interact either with
command line mode or with GUI. NMRPipe is capable in handling data from 1D to
4D. It allows user to do all data processing and generate peak list from the spectra
with which one can proceed for bio-profiling using various NMR spectrum libraries
like MMCD, NMRShiftDB, etc. and further analysis. rNMR is R-based 1D and 2D
NMR data analysis tools. It also provides command line interface and Graphical user
interface to run various data analysis steps. As it was mainly developed for multiple
NMR data analysis hence it does not have any data processing function. With this
tool user can generate peak list for bio-profiling and select region of interest from
multiple spectra to do comparative analysis.
274 P. R. Kumar et al.
References
Barnes S, Benton HP, Casazza K, Cooper SJ, Cui X, Du X, Engler J, Kabarowski JH, Li S,
Pathmasiri W, Prasain JK, Renfrow MB, Tiwari HK (2016) Training in metabolomics research.
II. Processing and statistical analysis of metabolomics data, metabolite identification, pathway
analysis, applications of metabolomics and its future. J Mass Spectrom 51(8):535–548
Beirnaert C, Vu TN, Meysman P, Laukens K, Valkenborg D (2019) speaq: Tools for Nuclear
Magnetic Resonance (NMR) Spectra Alignment, Peak Based Processing, Quantitative Analysis
and Visualizations. R-Package, https://cran.r-project.org/web/packages/speaq/index.html
Boccard J, Veuthey JL, Rudaz S (2010) Knowledge discovery in metabolomics: an overview of MS
data handling. J Sep Sci 33:290–304
Bodenhausen G, Ruben DJ (1980) Natural abundance nitrogen-15 NMR by enhanced heteronuclear
spectroscopy. Chem Phys Lett 69(1):185–189
Cobas C (2011) Alignment of NMR spectra – part II: binning/bucketing. Blog on NMR analysis.
http://nmr-analysis.blogspot.com/2011/01/alignment-of-nmr-spectra-part-ii.html
Cui Q, Lewis IA, Hegeman AD, Anderson ME, Li J, Schulte CF, Westler WM, Eghbalnia HR,
Sussman MR, Markley JL (2008) Metabolite identification via the Madison metabolomics
consortium database. Nat Biotechnol 26(2):162–164
De Meyer T, Sinnaeve D, Van Gasse B, Rietzschel ER, Buyzere MLD, Langlois MR, Bekaert S,
Martins JC, Criekinge W (2010) Evaluation of standard and advanced preprocessing methods
for the univariate analysis of blood serum 1H-NMR spectra. Anal Bioanal Chem 398
(4):1781–1790
Delaglio F, Grzesiek S, Vuister GW, Zhu G, Pfeifer J, Bax A (1995) NMRPipe: a multidimensional
spectral processing system based on UNIX pipes. J Biomol NMR 6:277–293
Duer M (2004) Introduction to solid-state NMR spectroscopy. Blackwell Publishing, Hoboken, NJ,
pp 43–58
Ellinger JJ, Chylla RA, Ulrich EL, Markley JL (2013) Databases and software for NMR-based
metabolomics. Curr Metabolomics 1(1). https://doi.org/10.2174/2213235X11301010028
Forshed J, Schuppe-Koistinen I, Jacobsson SP (2003) Peak alignment of NMR signals by means of
a genetic algorithm. Anal Chim Acta 487(2):189–199
Gheysen K, Mihai C, Conrath K, Martins J (2008) Rapid identification of common hexapyranose
monosaccharide units by a simple TOCSY matching approach. Chem Eur J 14:8869–8878
Goodacre R, Broadhurst D, Smilde AK, Kristal BS, Baker DJ, Beger R, Bessant C, Connor S,
Capuani G, Craig A, Ebbels T, Kell DB, Manetti C, Newton J, Paternostro G, Somorjai R,
Sjöström M, Trygg J, Wulfert F (2007) Proposed minimum reporting standards for data analysis
in metabolomics. Metabolomics 3:231–241
Hanson BA, Bostock M, Keinsley M (2020) ChemoSpec: exploratory chemometrics for spectros-
copy. R Package. https://cran.r-project.org/web/packages/ChemoSpec/index.html
Hao J, Astle W, Iorio MD, Ebbels T (2012) BATMAN--an R package for the automated quantifi-
cation of metabolites from NMR spectra using a Bayesian Model. Bioinformatics 28
(15):2088–2090
Izquierdo JL (2013) Package ‘NMRS’. R-Package. http://www2.uaem.mx/r-mirror/web/packages/
NMRS/NMRS.pdf
Izquierdo-García JL, Rodríguez I, Kyriazis A, Villa P, Barreiro P, Desco M, Ruiz-Cabello J (2009)
Metabonomic: a novel R-package graphic user interface for the analysis of metabonomic
profiles. BMC Bioinformatics 10:363
Keeler J (2004) Chapter 4: Fourier transformation and data processing. In: Understanding NMR
spectroscopy lecture series. The James Keeler Group, Cambridge
Keeler J (2010) Understanding NMR spectroscopy, 2nd edn. Wiley, Hoboken, NJ, pp 280–299
Kopka J, Schauer N, Krueger S, Birkemeyer C, Usadel B, Bergmüller E, Dörmann P,
Weckwerth W, Gibon Y, Stitt M, Willmitzer L, Fernie AR, Steinhauser D (2004)
[email protected]: the Golm metabolome database. Bioinformatics 21(8):1635–1638
13 Computational Metabolomics 275
Lewis IA, Schommerand SC, Markley JL (2009) rNMR: open source software for identifying and
quantifying metabolites in NMR spectra. Magn Reson Chem 47(1):123–126
Oliver SG, Winson MK, Kell DB, Baganz F (1998) Systematic functional analysis of the yeast
genome. Trends Biotechnol 16(9):373–378
Steinbeck C, Kuhn S (2004) NMRShiftDB – compound identification and structure elucidation
support through a free community-built web database. Phytochemistry 65(19):2711–2717
Torgrip RJO, Aberg KM, Alm E et al (2008) A note on normalization of biofluid 1D 1H-NMR data.
Metabolomics 4:114–121
Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J, Livny M, Mading S,
Maziuk D, Miller Z, Nakatani E, Schulte CF, Tolmie DE, Kent WR, Yao H, Markley JL
(2008) BioMagResBank. Nucleic Acids Res 36(Database issue):D402–D408
van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ (2006) Centering,
scaling, and transformations: improving the biological information content of metabolomics
data. BMC Genomics 7:142
Wang T, Shao K, Ch O, Ren Y, Mu Y, Qu L, He J, Jin C, Xia B (2009) Automics: an integrated
platform for NMR-based metabonomics spectral processing and data analysis. BMC Bioinfor-
matics 10:83
Wen B, Zhu M (2015) Applications of mass spectrometry in drug metabolism: 50 years of progress.
Drug Metab Rev 47(1):71–87
Xi Y, Rocke DM (2008) Baseline correction for NMR spectroscopic metabolomics data analysis.
BMC Bioinformatics 9:324
Xia J, Psychogios N, Young N, Wishart DS (2009) MetaboAnalyst: a web server for metabolomic
data analysis and interpretation. Nucleic Acids Res 37(Web Server issue):W652–W660
Next Generation Sequencing
14
Anchita Prasad, Harshita Bhargava, Ayam Gupta, Nidhi Shukla,
Shalini Rajagopal, Sonal Gupta, Amita Sharma, Jayaraman Valadi,
Vinod Nigam, and Prashanth Suravajhala
Abstract
A. Prasad · V. Nigam
Department of Bioengineering, Birla Institute of technology, Mesra, Ranchi, Jharkhand, India
H. Bhargava · A. Sharma
Department of Computer Science, IIS University, Jaipur, Rajasthan, India
A. Gupta · N. Shukla · S. Rajagopal · S. Gupta
Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research (BISR),
Statue Circle, Jaipur, Rajasthan, India
J. Valadi
Department of Informatics, Shiv Nadar University, Noida, India
Department of Computer Science, Flame University, Pune, Maharashtra, India
P. Suravajhala (*)
Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research (BISR),
Statue Circle, Jaipur, Rajasthan, India
Bioclues.org, Hyderabad, India
e-mail: [email protected]
# The Author(s), under exclusive license to Springer Nature Singapore Pte 277
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_14
278 A. Prasad et al.
Keywords
14.1 Introduction
Increased understanding about interpreting the human genome has provided critical
evidence for genetic disorders as well as development of extensive treatment and
diagnostic therapy strategies. This has had a face lift after sequencing efforts
burgeoned in the last few decades. After the success of the human genome sequence
project in the year 2004 (Abdellah et al. 2004), the growing need to sequence a
massive number of genomes was lifted from traditional Sanger sequencing method
to novel DNA sequencing techniques. In 2005, the first parallel DNA sequencing
method appeared, ushering the new era of next generation sequencing (NGS)
technologies (Shendure 2005). The NGS involves high-throughput and massively
parallel sequencing technologies which has revolutionized the biological research.
Further, evolving at a faster pace over the last few decades in terms of declining
sequencing cost per base and high-throughput effects. Due to its high-throughput,
scalability, and speed, NGS has enabled researchers in a wide array of biological and
clinical applications (Abbasi and Masoumi 2020). Through NGS, millions of DNA
reads are sequenced in a single assay at much lower cost. Owing to these advantages,
NGS methods have been used for a wide range of applications including variant
identification using whole genome/exome resequencing, transcriptome profile anal-
ysis of tissues, microbial profiling, and detecting genetic biomarkers for disease
prognosis (Schuster 2008; Suravajhala et al. 2016). Over the last few years,
increased awareness about decoding the human genome has provided significant
evidence for detecting rare genetic disorders as well as their diagnosis and treatment
in an efficient manner. In addition, for studying germline DNA and for analysis of
cancer genome “massive” or “deep” sequencing techniques are applied (Pettersson
et al. 2009; Stratton et al. 2009). On the other hand, information about transmission
and outbreaks encompassing microbial genomes has also been determined, for
example, inferring virulence, transmission, antibiotic resistance, and molecular sub
typing. In addition, with NGS tracking the outbreak of Methicillin-resistant Staphy-
lococcus aureus (MRSA) on neonates (Chiu et al. 2008), the main utility of NGS in
microbiology has steadfastly replaced the conventional characterization of
pathogens on various criteria with genomic features (Deurenberg et al. 2017). The
bottom-line is that NGS has been a recommended strategy for characterizing various
facets of organisms, viz. bacteria, viruses, fungi, yeast, and parasites. As for NGS,
there is no requirement for target-specific primers as desired for Sanger sequencing,
such techniques are available to researchers, practitioners, and academicians at a
very reasonable cost and with higher accuracy (Di Resta et al. 2018). The NGS
process for DNA sequencing has been explained in the following general steps
(Fig. 14.1).
14 Next Generation Sequencing 279
14.2 Fragmentation
The adapters are introduced at the beginning and end to create known “fragments” in
the form of random sequences (Heather and Chain 2016). A ligase enzyme cova-
lently links the adapter and inserts DNA fragments, making complete library
molecules. As these adapters serve as multiple functions, they are attached to the
sequences for easy sample identification and multiplexing. In some cases, barcodes
are also attached while for bridge amplification, the DNA fragment binds to the
oligos which creates the bridge with a primer binding to this DNA sequence and
amplifying vertically (Ambardar et al. 2016).
280 A. Prasad et al.
14.4 Sequencing
The polymerase adds the nucleotide into the bridge amplification, where the signals
are recorded which will further generate multiple sequencing databases for the DNA
sequences (Buermans and den Dunnen 2014). The sequencing steps vary from one
instrument to the other depending on the capacity and flow cells. There are other
parameters, viz. average depth, coverage and size of the reads and read chemistry as
discussed in Table 14.1 and Fig. 14.1.
The data generated by the sequencing machines can be aligned to the reference
genome sequence. Basically, from the library preparation to data analysis, there are
four different types of NGS methods, viz. whole genome sequencing (WGS), whole
exome sequencing (WES), whole transcriptome shotgun sequencing (WTSS), and
targeted/amplicon genome sequencing. If the organism is not under question, there
could be metagenome sequencing where several organisms in questions could be
characterized,
As dealt earlier, various modern sequencing technologies including – Illumina
(Solexa) sequencing, whole genome sequencing, Targeted sequencing, Amplicon
sequencing, exome sequencing, de novo sequencing and transcriptomics, etc. have
been preferentially used for short-read sequencing while 10x or Pacbio (Rhoads and
Au 2015) or Oxford Nanopore uses long read sequencing chemistry (Metzker 2010).
The sequences are read from one end to the other based on the type of read
chemistry, single in case of single read, paired end, and mate pair chemistry in the
case of multiple paired end reads (Head et al. 2014). There are, however, recent
technologies that have allowed us to sequence DNA and RNA more easily and cost-
effectively than previously used Sanger sequencing (Pareek et al. 2011). This has not
only helped in studying any alterations harbored in genetics and molecular pathways
associated with mutated genes but also allowed us to identify non-coding spectrum
from WGS and WTSS datasets (Ansorge 2009). In addition, NGS in recent years has
made it possible to better understand the genetics behind rare diseases and imple-
ment it as a technological advancement in clinical and diagnostic practices across a
wide array of genomes (Mathur et al. 2018). The NGS has allowed us to analyze
diverse regions of a genome in a single reaction assay in a much better cost-effective
manner and proved as an efficient tool for examining patients with genetic disorders
(Depristo et al. 2011), (Deurenberg et al. 2017) (Fig. 14.2). Furthermore, the
molecular and genomic data in the form of precise detection of disease biomarkers
has helped understanding regulation, identifying inheritable disorders, and depicting
factors governing response to therapeutic responses (Rabbani et al. 2014; Jamuar
and Tan 2015). Moreover, a diverse variety of molecular tests are implemented that
make use of NGS technology, such as single- and multiple-gene panel sequencing,
single-cell sequencing, WGS, WES, cell-free DNA for prenatal sequencing (Van
den Veyver and Eng 2015). Considering the fact that NGS is a recent and efficient
Table 14.1 Different platforms of next generation sequencing 14
time
(continued)
Table 14.1 (continued)
282
diagnostic tool in clinical genetics, there are challenges and limitations regarding
how to analyze and interpret the sequencing data and communicate it to patients and
family members. Thus, it is essential to properly understand the applications,
strength and weakness of different approaches which we document below for each
case.
the reference data otherwise, the de novo assembly step will be proceeded for data
analysis (Lam et al. 2012). Thus, the predicted gene sequences will be annotated on
freely available online resources such as Uniprot, NR-NCBI databases that will
identify the known unknown regions. From the WGS results, we can analyze the
read depth, gene density, insertion density, and SNP density and elucidate the
unexplored genomic regions. The identified unique variants would reveal novel
biological pathways that lead to complex disorders that provide high-resolution
insights in the affected pathways (Sanders et al. 2017).
The whole exome sequencing (WES) is one of the broadly used NGS techniques
where only protein-coding regions of the genome are sequenced. As the human
exome consists of less than 2% of protein-coding genes but harbors more than 85%
of the disease-causing variants, resulting in a cost-efficient sequencing approach in
contrast to whole genome sequencing (WGS). The DNA libraries for WES approach
could be developed in just 1 day, thereby yielding 4–5 Gb of sequenced data per
exome. The WES utilizes exome enrichment methodology for deciphering coding
regions which can further be applied to a wide range of clinical applications,
including cancer studies, population-based studies, and genetic disorders (Gupta
et al. 2017; Mueller et al. 2018; Weigelt et al. 2018). In addition, WES has been
proven advantageous for identifying pathogenic variants in several Mendelian
phenotypes, complex disorders as well as rare disorders (Jeste and Geschwind
2014; Mathur et al. 2018). Since the past, the WES approach has been routinely
applied in clinical diagnostics as a generic test in managing various disorders (Arts
et al. 2019) and has been included as an efficient genomic strategy in the 1000
Genome Project (Altshuler et al. 2012), and Exome Aggregation Consortium
(ExAC) (Lek et al. 2016) to decipher population-risk variants and to predict
disorders linked to rare mutations. Various pipelines to perform the analyses do
exist and vary from commercial platforms to open source tools, viz. SeqMule,
Interpretomics, Qiagen/CLCBio, GATK in addition to bash based pipelines that
uses open source tools developed by us (Meena et al. 2018). In contrast to targeted
sequencing, WES has several advantages, for example, firstly, it allows prediction of
novel causal genes associated with any genetic disorder that are not included in
exome-wide genetic panel and secondly, along with small polymorphic variants it
also provides genome-wide data accessibility for reliable detection of larger poly-
morphic sites including copy number variants (CNVs) and homozygous locations
(Stray-Pedersen et al. 2017; Gambin et al. 2017). Moreover, to reduce the intricacies
of data analysis and accelerate the process, WES methods could be combined with
computational data-driven processes of already reported cohorts of causal genes
(Neveling et al. 2013). As WES delivers an extensive depth of coverage for the
coding regions of the genome and yields, compact and manageable data information
for faster and precise analysis are used in comparison to WGS methods (Gupta et al.
2020). The WES methods allow variant detection located in the coding exonic sites,
14 Next Generation Sequencing 285
Fig. 14.3 A schematic representation of Whole Genome and Whole Exome Sequencing workflow
and analysis
with an ability to extend the target regions to involve untranslated regions (UTRs),
and in some cases microRNAs (de Carvalho et al. 2019) and even long non-coding
RNAs to get a more detailed outlook of gene regulation in rare disorders (Gupta et al.
2018) (Fig. 14.3).
286 A. Prasad et al.
In a given cell at a particular stage, the entire set of RNA transcripts is known as
transcriptome. To understand any particular development and disease, understand-
ing the transcriptome is an essential element (Martin and Wang 2011). One of the
most widely used methods to study differentially expressed genes is microarray
technology but it has its own limitations. Advancement in sequencing technologies
has revolutionized transcriptome analysis by c-DNA sequencing (RNA-Seq).
Because of its higher reproducibility and better resolution, RNA-Seq is widely
accepted and used for different research purposes (Wang et al. 2009). The main
steps of RNA-Seq approach include (1) analysis of raw data, (2) alignment read,
(3) transcriptome reconstruction, and (4) quantification and differential expression
analysis (Nagalakshmi et al. 2010). Initial steps of RNA-Seq include quality check
of the raw data followed by mapping to the reference genome. If there is no
availability of reference genome, it can be done by using a de novo assembly
approach. The last step is analysis of differentially expressed genes which can be
done using different available approaches (Garber et al. 2011) (Fig. 14.4).
In 1947, Alan Tuning gave a statement “... what we want is a machine that can learn
from experience.” Today is a time where we are using a technique which learns from
experience and is named as Machine Learning (ML). Machine Learning aims to
forecast the future possibilities in a specific problem by continuous learning. In the
past decade, machine learning has vastly improved understanding of the human
genome. The ML is so pervasive today that we probably use it dozens of times a day
without knowing it. ML has revolutionized traditional statistical techniques by
supplying a new insight in data analysis. It has been deployed on vast problems of
different areas like biomedical, education, commerce, engineering, aeronautics,
space science, and many more. While the ML has enormous capabilities of drilling
out deeply the wide variety of hidden patterns, this incredible property has made it an
integral part of genomics research. Generally, genomics has problems: genome
sequencing, gene-editing, drug–target interaction, molecular docking, microorgan-
ism genesis, antiviral drug study, etc. The nature of data in such problems is massive,
multidimensional, viscosity, and variety. Also, data could be structure like database,
or semi structures as log files or completely unstructured like videos, images, and
audios. The ML algorithms have been devised to penetrate in the structured or
unstructured perplexing data and get insight from it. While ML does not only
provide predictive analysis, it also assists in decision making. These algorithms
14 Next Generation Sequencing 289
• Association rule algorithms find the relationships between the given data items.
The support and confidence are commonly defined measures for association rule
algorithms.
• Clustering algorithms split the datasets in different clusters using a mathematical
formula (Brun et al. 2007). These are validated using three types of measures,
i.e. internal, external, and relative. Internal validation assesses the clustering
quality using intrinsic data (Hämäläinen et al. 2017).
• External validation works with extrinsic data to evaluate level of cohesion in
grouping. Relative validation is a hybrid of above two and judges the overall
quality. Internal indexes are Dunn index, Calinski–Harabasz, Gamma index,
C-Index, Davies–Bouldin index, Silhouette index, etc. External are Jaccard
Coefficient, Goodman–Kruskals Coefficient, f measure, Rand statistics, purity,
Fowlkes and Mallows Index, Entropy, etc. (Vendramin et al. 2010).
• Dimensionality reduction (DR) algorithms are essential for reducing the dimen-
sional complexity of data (Sarwar et al. 2000). The quality assessment for such
algorithms is classified into two categories: local preservation criteria and global
preservation criteria. Local approach focuses on local-neighborhood preserva-
tion, whereas global approach focuses on overall structure-holding preservation
(Gracia et al. 2014).
• Classification algorithms classify a dataset based on the class of output variable/s.
These are assessed using a confusion matrix or error matrix. A confusion matrix is
a report card of classification model providing details of its performance with
sensitivity and specificity. It depicts a number of classified data in the form of true
positive, true negative, false positive, and false negative which reveals the
correctly and incorrectly classified data. Apart from this accuracy and precision
are also calculated using confusion matrix. These two metrics define the robust-
ness of the model. A receiver operating characteristic (ROC) curve is also
primarily used to accept or reject the classification model (Cai and Dodd 2008;
Greiner et al. 2000). It is drawn using specificity and sensitivity provided by the
290 A. Prasad et al.
confusion matrix. Other metrics for decision tree algorithms are Gini Index, and
classification error and entropy (Silahtaroǧlu 2009).
• Prediction algorithms are used to forecast the possibility of object or event. The
measures for evaluating prediction are accuracy, precision; mean absolute error,
root mean square error, etc. These are also used for measuring the performance of
prescriptive analysis.
These metrics are behaviors that measure the progress of algorithms. They guide
researchers to finalize and verify the ML based model for their study as different
NGS applications require different models (Marceddu et al. 2019). Common appli-
cation of machine learning algorithms in NGS are screening of compounds,
fragment-based de novo design, computational screening of molecular fragments,
and fragment linking to design novel inhibitors, molecular docking analysis with
virtual screening, Construction of homology models, designing of linear discrimi-
nant analysis model, and designing of analogs (van den Akker et al. 2018). The new
areas of genomic research are soil fertility study with microorganisms, newborn
genetic screening, precision medicine, gene-based prescription, energy healing and
genetic transformation, genetic disease and vibrational therapies, and many more.
292 A. Prasad et al.
NGS data generated using different platforms has inherent applications in identifi-
cation of genes, variants including copy number variants (CNVs) and single nucleo-
tide polymorphisms (SNVs), exomes, RNA, and small RNAs (Tripathi et al. 2016).
The enormous data generated using NGS is one of the most promising instances of
“big data” which is evident from the fact that the space needed to store 1000
genomes is approximately three terabytes. The storage space and data processing
requirements are often addressed using cloud computing solutions with Apache
Hadoop framework. The sequence compression algorithms also contribute towards
better transmission, analysis, and storage of such data (Wandelt et al. 2012). The cost
effectiveness of NGS in terms of time and throughput as compared to Sanger
sequencing has posed challenges as well as opportunities in terms of effective data
analysis and storage. With the advent of NGS techniques the focus gradually shifted
from data generation to methods which could assist in gaining insights out of this
data. The basic workflow of machine learning can be summarized as shown in
Fig. 14.6.
The first step in analyzing the NGS data is the curation of the relevant dataset with
respect to the underlying problem (Tripathi et al. 2016). The next step is to prepro-
cess the data which itself is a combination of several substeps including data
cleaning, data integration, data transformation, handling data imbalance issues,
and dimensional reduction. Data cleaning includes removal of out of date data,
handling or removal of missing values/features, respectively, and identifying the
outliers. The data integration step may involve the handling of different heteroge-
neous sources of data while converting them into a uniform format. Some features in
the dataset have extreme values falling in different ranges which can be handled
using different normalization schemes for transforming the data. Data imbalance is
one of the most prominent issues while handling NGS datasets. This issue is very
inherent while identifying the trait/disease related non-coding variants (Schubach
et al. 2017) wherein the machine learning model tends to learn the majority class
thereby generating wrong predictions/classification for the instances of the minority
class. The purpose of dimensionality reduction is to reduce the number of features/
variables which thereby affects the storage and time complexity involved in
processing such data. However, feature selection differs from dimensionality reduc-
tion in the way that the former deals with selecting the relevant and important
features while the latter aims at projecting the existing features to a lower dimen-
sional space. This further simplifies the data exploratory analysis step wherein the
visualization can be done in the form of plots to study or analyze the data prior to
modeling or hypothesis testing (He et al. 2017). As the next step the ML model is
chosen and trained on the training dataset and further evaluated using the test set.
The evaluation results can be further improved by tuning the hyper parameters
associated with the ML model. The hyperparameters are the structural/architectural
parameters say the no of estimators or depth to be taken for a random forest model.
The final learned/tuned model should have the generalization ability to predict for
any kind of input sample.
The low cost and high-throughput NGS techniques have encouraged the application
of genetic tests for studying and identifying the variants, mutations associated with
the rare diseases, Mendelian disorders (Wadapurkar and Vyas 2018). This has also
paved the way to diagnose and cure the human genetic disorders and diseases on the
basis of individual genome profile. There are various computational tools that are
used for NGS data analysis from preprocessing, sequence alignment, post alignment
processing, variant (structural variants and copy number variants) calling, variant
functional annotation stages. A list of tools used for NGS data analysis is shown in
Table 14.3.
Over the last few decades, the NGS technique and its applications has increased by
leaps and bounds (Levy and Myers 2016). The outcomes have shown rise while
lowering down the sequencing costs per sample run, both by orders of magnitude
and precision. The majority of sequencing platform companies have spent a couple
of years since the past, mainly focusing on improving it in terms of accessibility and
user-friendliness. Illumina’s newly launched sequencing systems such as HiSeq
(Illumina 2015), MiSeq (Schirmer et al. 2015), NextSeq systems, all making use
of reagent cartridges for operations and reducing the hands-on-time for sample
library preparation. The Ion Torrent sequencing platforms have been observed to
294 A. Prasad et al.
Table 14.3 List of tools used for NGS data analyses, all URLs accessed on September 27, 2020
Tools Function Category References/URL
FastQC QC report for high- Quality check https://www.
throughput sequencing bioinformatics.babraham.
data ac.uk/projects/fastqc/
FASTX toolkit Performs some of the Preprocessing http://hannonlab.cshl.edu/
preprocessing tasks such fastx_toolkit/index.html
as conversion from
FASTQ to FASTA
format, renaming the
sequence identifiers in
FASTA/FASTQ files,
collapsing identical
sequences in FASTA/
FASTQ files into a single
sequence, etc.
Cutadapt/ Removal of adapters and Preprocessing Martin (2011), Joshi and
sickle/ low-quality reads Fass (2011), Bolger et al.
PRINSEQ/ (2014); http://prinseq.
Trimmomatic sourceforge.net/index.
html
BWA BWA is a software Sequence Li and Durbin (2009)
package that is based on alignment
burrows wheeler
transform algorithm and
is fast and efficient for
short and long reads.
mrFAST Micro read fast Sequence http://mrfast.sourceforge.
alignment search tool alignment net/
which maps short reads
from the ILLUMINA
platform.
Bowtie Faster than BWA, Sequence Langmead et al. (2009)
memory efficient short- alignment
read aligner
HISAT2 A graph-based alignment Sequence http://daehwankimlab.
algorithm. It enables fast alignment github.io/hisat2/
and sensitive alignment
ABySS De novo assembler Sequence Simpson et al. (2009)
alignment
SAM tools Allows for manipulation Post alignment http://samtools.
of alignments in the processing sourceforge.net/
SAM (sequence
alignment/map) format
VarScan Variant detection Post alignment http://varscan.
processing sourceforge.net/
Genome Suite of tools including Post alignment https://gatk.broadinstitute.
analysis toolkit depth of coverage processing org/hc/en-us
(GATK) analyzers, unified
genotype inference,
haplotype mapping a
(continued)
14 Next Generation Sequencing 295
be more difficult to operate than Illumina platforms. Nevertheless, the Ion-S5 system
from Thermo Fisher Scientific has been explicitly engineered to shorten the entire
sequencing methodology, from sample library preparation to data production and
interpretation (Quail et al. 2012). After looking at such sequencing improvements of
higher outcomes, reduced sequencing costs and greater accessibility to users, one
can imagine that all the barriers to progression have been secluded. But the hard-
work has just begun and there remain various challenges for NGS that need to be
resolved. One of the major challenges is about data storage, as there is a huge amount
of data generated by NGS and storing them can be a herculean task. For every single
sample, the generated raw files are in gigabytes depending on its further application,
which makes this process cumbersome. For example, the raw reads for whole
genome sequencing can go up to 250GB whereas for deep RNA-seq, the raw
reads in fastq formats range between 20 and 25 GB. After obtaining and filtering
initial data, it can be streamlined for further downstream analysis which requires
alignment to the reference genome or transcriptome of the organism of origin
(Langmead et al. 2009)
This is the most time-consuming step of the analysis which requires different
algorithms but this poses another problem. Choosing an appropriate algorithm tool
from an existing set is not that easy. Some of the important criteria to consider while
choosing an algorithm should be its performance (in publications) or in-house
benchmarking studies. Another important point to consider for NGS analysis is
smooth and fast functioning of the workflows and the instruments involved in
analysis. Another challenge is about data analysis as there is multiple software
available so choosing the right one according to one’s need is not that easy.
RNA-seq has revolutionized our understanding of the entire transcriptome to better
analyze differential gene expression in different experimental groups (Trapnell et al.
2010). Each RNA-seq experiment basically consists of several steps such as experi-
mental design, mapping short reads, quality control, estimating transcript abun-
dance, and analyzing differential expression. All these steps have their own
challenge (Waern et al. 2011). For example, proper normalization of read counts is
required for estimating transcript abundance but due to RNA fragmentation, longer
transcripts generate more reads compared to shorter transcripts and both are present
at the same abundance in the sample which causes variation in the data analysis.
Another important point to consider during RNA-seq is non-uniformity of coverage
14 Next Generation Sequencing 297
Application of NGS technique has allowed the researchers to predict large amounts
of genomic, transcriptomic, and metagenomic data in an efficient manner. However,
only a small fraction of data has been applied for clinical and diagnostic purposes, so
far. Understanding the rest of unsupported data could be managed with the help of
introduction of novel and powerful NGS approaches and algorithms. The potential
that NGS technique holds for diagnosing any disorder and their clinical implemen-
tation is enormous which could be further enhanced by careful application of NGS
data analysis. Moreover, expeditious usage of divergent methods for big data
analysis might improve patient diagnosis and treatment. Implementation of compu-
tational diagnostic methods such as machine and deep learning algorithms are more
consistent, unbiased and less prone to human errors. Utilizing machine learning
algorithms as a diagnostic modeling tool are more advantageous in clinical practices.
However, such models are required to be properly trained for understanding
biological systems to have proper insights into the disease mechanisms. In such a
manner, these modeled algorithms could be used in agreement with the analyzed
disorder, and concurrently could be trained and altered according to the disease
evolution.
References
Abbasi S, Masoumi S (2020) Next-generation sequencing (NGS). Int J Adv Sci Technol. https://
doi.org/10.1007/978-3-662-49054-9_3542-1
Abdellah Z, Ahmadi A, Ahmed S et al (2004) Finishing the euchromatic sequence of the human
genome. Nature 431:931–945. https://doi.org/10.1038/nature03001
Altshuler DM, Durbin RM, Abecasis GR et al (2012) An integrated map of genetic variation from
1,092 human genomes. Nature 491:56–65. https://doi.org/10.1038/nature11632
Ambardar S, Gupta R, Trakroo D et al (2016) High throughput sequencing: an overview of
sequencing chemistry. Indian J Microbiol 56:394–404
Ansorge WJ (2009) Next-generation DNA sequencing techniques. N Biotechnol 25:195–203
Arts P, Simons A, AlZahrani MS et al (2019) Exome sequencing in routine diagnostics: a generic
test for 254 patients with primary immunodeficiencies. Genome Med 11:38. https://doi.org/10.
1186/s13073-019-0649-3
298 A. Prasad et al.
Goff LA, Trapnell C, Kelley D (2012) CummeRbund: visualization and exploration of cufflinks
high-throughput sequencing data. R Packag version
Goldberg B, Sichtig H, Geyer C et al (2015) Making the leap from research laboratory to clinic:
challenges and opportunities for next-generation sequencing in infectious disease diagnostics.
MBio 6:e01888. https://doi.org/10.1128/mBio.01888-15
Gracia A, González S, Robles V, Menasalvas E (2014) A methodology to compare dimensionality
reduction algorithms in terms of loss of quality. Inf Sci (Ny) 270:1–27. https://doi.org/10.1016/j.
ins.2014.02.068
Greiner M, Pfeiffer D, Smith RD (2000) Principles and practical application of the receiver-
operating characteristic analysis for diagnostic tests. Prev Vet Med 45:23–41. https://doi.org/
10.1016/S0167-5877(00)00115-X
Gupta S, Chatterjee S, Mukherjee A, Mutsuddi M (2017) Whole exome sequencing: uncovering
causal genetic variants for ocular diseases. Exp Eye Res 164:139–150
Gupta S, Gupta N, Tiwari P et al (2018) Lnc-EPB41-protein interactions associated with congenital
pouch colon. Biomol Ther 8:95. https://doi.org/10.3390/biom8030095
Gupta A, Shukla N, Nehra M et al (2020) A pilot study on the whole exome sequencing of prostate
cancer in the Indian phenotype reveals distinct polymorphisms. Front Genet 11:874. https://doi.
org/10.3389/fgene.2020.00874
Hämäläinen J, Jauhiainen S, Kärkkäinen T (2017) Comparison of internal clustering validation
indices for prototype-based clustering. Algorithms 10:105. https://doi.org/10.3390/a10030105
He KY, Ge D, He MM (2017) Big data analytics for genomic medicine. Int J Mol Sci 18:412
Head SR, Kiyomi Komori H, LaMere SA et al (2014) Library construction for next-generation
sequencing: overviews and challenges. Biotechniques 56:61–77. https://doi.org/10.2144/
000114133
Heather JM, Chain B (2016) The sequence of sequencers: the history of sequencing DNA.
Genomics 107:1–8
Heyer R, Schallert K, Zoun R et al (2017) Challenges and perspectives of metaproteomic data
analysis. J Biotechnol 261:24–36
Hui AWH, Lau HW, Chan THT, Tsui SKW (2013) The human microbiota: a new direction in the
investigation of thoracic diseases. J Thorac Dis 5:127–131
Hyatt D, Chen GL, LoCascio PF et al (2010) Prodigal: prokaryotic gene recognition and translation
initiation site identification. BMC Bioinformatics 11:119. https://doi.org/10.1186/1471-2105-
11-119
Illumina (2015) HiSeq 3000/HiSeq 4000 sequencing systems. In: Illumina
Jamuar SS h, Tan E-C (2015) Clinical application of next-generation sequencing for Mendelian
diseases. Hum Genomics 9:10. https://doi.org/10.1186/s40246-015-0031-5
Jeste SS, Geschwind DH (2014) Disentangling the heterogeneity of autism spectrum disorder
through genetic findings. Nat Rev Neurol 10:74–81. https://doi.org/10.1038/nrneurol.2013.278
Joshi N, Fass J (2011) Sickle: a sliding-window, adaptive, quality-based trimming tool for FastQ
files (Version 1.33) [Software]. https://github.com/najoshi/sickle
Korbel JO, Abyzov A, Mu XJ et al (2009) PEMer: a computational framework with simulation-
based error models for inferring genomic structural variants from massive paired-end sequenc-
ing data. Genome Biol 10:23. https://doi.org/10.1186/gb-2009-10-2-r23
Lam HYK, Clark MJ, Chen R et al (2012) Performance comparison of whole-genome sequencing
platforms. Nat Biotechnol 30:78–82. https://doi.org/10.1038/nbt.2065
Lam KN, Cheng J, Engel K et al (2015) Current and future resources for functional metagenomics.
Front Microbiol 6:1196. https://doi.org/10.3389/fmicb.2015.01196
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of
short DNA sequences to the human genome. Genome Biol 10:25. https://doi.org/10.1186/gb-
2009-10-3-r25
Lek M, Karczewski KJ, Minikel EV et al (2016) Analysis of protein-coding genetic variation in
60,706 humans. Nature 536:285–291. https://doi.org/10.1038/nature19057
300 A. Prasad et al.
Levy SE, Myers RM (2016) Advancements in next-generation sequencing. Annu Rev Genomics
Hum Genet 17:95–115
Li H, Durbin R (2009) Fast and accurate short read alignment with burrows-wheeler transform.
Bioinformatics 25(14):1754–1760. https://doi.org/10.1093/bioinformatics/btp324
Love M, Anders S, Huber W (2017) Analyzing RNA-seq data with DESeq2. Bioconductor
Luthra R, Chen H, Roy-Chowdhuri S, Singh RR (2015) Next-generation sequencing in clinical
molecular diagnostics of cancer: advantages and challenges. Cancers (Basel) 7:14
Mai NTH, Phu NH, Nhu LNT et al (2017) Central nervous system infection diagnosis by next-
generation sequencing: a glimpse into the future? Open Forum Infect Dis 4:046. https://doi.org/
10.1093/ofid/ofx046
Marceddu G, Dallavilla T, Guerri G et al (2019) Analysis of machine learning algorithms as
integrative tools for validation of next generation sequencing data. Eur Rev Med Pharmacol
Sci 23:8139. https://doi.org/10.26355/eurrev_201909_19034
Maron PA, Ranjard L, Mougel C, Lemanceau P (2007) Metaproteomics: a new approach for
studying functional microbial ecology. Microb Ecol 53:486–493
Martin M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads.
EMBnet J 17:10. https://doi.org/10.14806/ej.17.1.200
Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet 12:671
Mathur P, Medicherla KM, Chaudhary S et al (2018) Whole exome sequencing reveals rare variants
linked to congenital pouch colon. Sci Rep 8:6646. https://doi.org/10.1038/s41598-018-24967-y
Meena N, Mathur P, Medicherla K, Suravajhala P (2018) A bioinformatics pipeline for whole
exome sequencing: overview of the processing and steps from raw data to downstream analysis.
Bio-Protocol 8:e2805. https://doi.org/10.21769/BioProtoc.2805
Metzker ML (2010) Sequencing technologies the next generation. Nat Rev Genet 11:31–46
Morgan JL, Darling AE, Eisen JA (2010) Metagenomic sequencing of an in vitro-simulated
microbial community. PLoS One 5:e10209. https://doi.org/10.1371/journal.pone.0010209
Moriya Y, Itoh M, Okuda S et al (2007) KAAS: an automatic genome annotation and pathway
reconstruction server. Nucleic Acids Res 35:2. https://doi.org/10.1093/nar/gkm321
Mueller JJ, Schlappe BA, Kumar R et al (2018) Massively parallel sequencing analysis of mucinous
ovarian carcinomas: genomic profiling and differential diagnoses. Gynecol Oncol 150:127–135.
https://doi.org/10.1016/j.ygyno.2018.05.008
Nagalakshmi U, Waern K, Snyder M (2010) RNA-seq: a method for comprehensive transcriptome
analysis. Curr Protoc Mol Biol 89:4.11.1–4.11.13
Neveling K, Feenstra I, Gilissen C et al (2013) A post-hoc comparison of the utility of sanger
sequencing and exome sequencing for the diagnosis of heterogeneous diseases. Hum Mutat
34:1721–1726. https://doi.org/10.1002/humu.22450
Ng PC, Kirkness EF (2010) Whole genome sequencing. Methods Mol Biol 628:215–226
Pareek CS, Smoczynski R, Tretyn A (2011) Sequencing technologies and genome sequencing. J
Appl Genet 52:413–435
Parize P, Muth E, Richaud C et al (2017) Untargeted next-generation sequencing-based first-line
diagnosis of infection in immunocompromised adults: a multicentre, blinded, prospective study.
Clin Microbiol Infect 23:574. https://doi.org/10.1016/j.cmi.2017.02.006
Pettersson E, Lundeberg J, Ahmadian A (2009) Generations of sequencing technologies. Genomics
93:105–111. https://doi.org/10.1016/j.ygeno.2008.10.003
Quail MA, Smith M, Coupland P et al (2012) A tale of three next generation sequencing platforms:
comparison of ion torrent, pacific biosciences and illumina MiSeq sequencers. BMC Genomics
13:341. https://doi.org/10.1186/1471-2164-13-341
Rabbani B, Tekin M, Mahdieh N (2014) The promise of whole-exome sequencing in medical
genetics. J Hum Genet 59:5–15. https://doi.org/10.1038/jhg.2013.114
Rhoads A, Au KF (2015) PacBio sequencing and its applications. Genomics Proteomics Bioinfor-
matics 13:278–289
14 Next Generation Sequencing 301
Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a bioconductor package for differential
expression analysis of digital gene expression data. Bioinformatics 26:139. https://doi.org/10.
1093/bioinformatics/btp616
Salzberg SL, Breitwieser FP, Kumar A et al (2016) Next-generation sequencing in neuropathologic
diagnosis of infections of the nervous system. Neurol - Neuroimmunol Neuroinflammation 3:
e251. https://doi.org/10.1212/NXI.0000000000000251
Sanders SJ, Neale BM, Huang H et al (2017) Whole genome sequencing in psychiatric disorders:
the WGSPD consortium. Nat Neurosci 20:1661–1668. https://doi.org/10.1038/s41593-017-
0017-9
Sarwar B, Karypis G, Konstan J, Riedl J (2000) Application of dimensionality reduction in
recommender system—a case study. ACM WebKDD 2000 Web Min ECommerce Work.
https://doi.org/10.3141/1625-22
Schirmer M, Ijaz UZ, D’Amore R et al (2015) Insight into biases and sequencing errors for
amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Res 43:37. https://doi.
org/10.1093/nar/gku1341
Schloss PD, Westcott SL, Ryabin T et al (2009) Introducing mothur: open-source, platform-
independent, community-supported software for describing and comparing microbial
communities. Appl Environ Microbiol 75:7537. https://doi.org/10.1128/AEM.01541-09
Schubach M, Re M, Robinson PN, Valentini G (2017) Imbalance-aware machine learning for
predicting rare and common disease-associated non-coding variants. Sci Rep 7:2959. https://doi.
org/10.1038/s41598-017-03011-5
Schuster SC (2008) Next-generation sequencing transforms today’s biology. Nat Methods 5:16–18
Shendure J (2005) Accurate multiplex polony sequencing of an evolved bacterial genome. Science
309:1728–1732. https://doi.org/10.1126/science.1117389
Silahtaroǧlu G (2009) An attribute-centre based decision tree classification algorithm. World Acad
Sci Eng Technol 36:11282
Simpson JT, Wong K, Jackman SD et al (2009) ABySS: a parallel assembler for short read
sequence data. Genome Res 19:1117–1123. https://doi.org/10.1101/gr.089532.108
Stratton MR, Campbell PJ, Futreal PA (2009) The cancer genome. Nature 458:719–724. https://doi.
org/10.1038/nature07943
Stray-Pedersen A, Sorte HS, Samarakoon P et al (2017) Primary immunodeficiency diseases:
genomic approaches delineate heterogeneous Mendelian disorders. J Allergy Clin Immunol
139:232–245. https://doi.org/10.1016/j.jaci.2016.05.042
Suravajhala P, Kogelman LJA, Kadarmideen HN (2016) Multi-omic data integration and analysis
using systems genomics approaches: methods and applications in animal production, health and
welfare. Genet Sel Evol 48:38. https://doi.org/10.1186/s12711-016-0217-x
Thermes C (2014) Ten years of next-generation sequencing technology. Trends Genet 30:418–426.
https://doi.org/10.1016/j.tig.2014.07.001
Thomas T, Gilbert J, Meyer F (2012) Metagenomics—a guide from sampling to data analysis.
Microb Inform Exp 2:3. https://doi.org/10.1186/2042-5783-2-3
Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq.
Bioinformatics 25:1105. https://doi.org/10.1093/bioinformatics/btp120
Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-Seq
reveals unannotated transcripts and isoform switching during cell differentiation. Nat
Biotechnol 28:511–515. https://doi.org/10.1038/nbt.1621
Tringe SG (2005) Comparative metagenomics of microbial communities. Science 308:554–557.
https://doi.org/10.1126/science.1107851
Tripathi R, Sharma P, Chakraborty P, Varadwaj PK (2016) Next-generation sequencing revolution
through big data analytics. Front Life Sci 9:119–149. https://doi.org/10.1080/21553769.2016.
1178180
van den Akker J, Mishne G, Zimmer AD, Zhou AY (2018) A machine learning model to determine
the accuracy of variant calls in capture-based next generation sequencing. BMC Genomics
19:263. https://doi.org/10.1186/s12864-018-4659-0
302 A. Prasad et al.
Van den Veyver IB, Eng CM (2015) Genome-wide sequencing for prenatal detection of fetal single-
gene disorders. Cold Spring Harb Perspect Med 5:23077. https://doi.org/10.1101/cshperspect.
a023077
Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a com-
parative overview. Stat Anal Data Min 3:209. https://doi.org/10.1002/sam.10080
Wadapurkar RM, Vyas R (2018) Computational analysis of next generation sequencing data and its
applications in clinical oncology. Informatics Med Unlocked 11:75–82. https://doi.org/10.1016/
j.imu.2018.05.003
Waern K, Nagalakshmi U, Snyder M (2011) RNA sequencing. Methods Mol Biol 3:209–235.
https://doi.org/10.1007/978-1-61779-173-4_8
Wandelt S, Rheinländer A, Bux M et al (2012) Data management challenges in next generation
sequencing. Datenbank-Spektrum 12:161–171. https://doi.org/10.1007/s13222-012-0098-2
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev
Genet 10:57–63
Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from
high-throughput sequencing data. Nucleic Acids Res 38:164. https://doi.org/10.1093/nar/
gkq603
Weigelt B, Bi R, Kumar R et al (2018) The landscape of somatic genetic alterations in breast
cancers from ATM germline mutation carriers. JNCI J Natl Cancer Inst 110:1030–1034. https://
doi.org/10.1093/jnci/djy028
Wylie KM, Weinstock GM, Storch GA (2013) Virome genomics: a tool for defining the human
virome. Curr Opin Microbiol 16:479–484. https://doi.org/10.1016/j.mib.2013.04.006
Bioinformatics in Personalized Medicine
15
G. Sunil Krishnan, Amit Joshi, and Vikas Kaushik
Abstract
# The Author(s), under exclusive license to Springer Nature Singapore Pte 303
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_15
304 G. Sunil Krishnan et al.
sex, racial, ethnic, genetic polymorphisms, and other ecological variables influ-
ence the in resistant reaction to a specific therapy.
Keywords
15.1 Introduction
create tools and procedures for the prediction of emerging, recurrence, progression,
response of disease to treatment. Individualizing drug or vaccine therapy with the
use of bioinformatics and pharmacogenomics tools have the prospective to transform
health care system (Mancinelli et al. 2000). The whole-exome sequencing have
proven to be valuable methods for the discovery of the genetic causes of rare and
complex diseases (Gonzaga-Jauregui et al. 2012). Bioinformatics and genomics
protocol in personalized medicine summarized in Fig. 15.1.
Many drugs and vaccines are failing to deliver optimum response to broad popula-
tion. PM is a combinational approach to individual health care. This is required for
the improvement of early disease diagnosis and treatment at individual level. Each
individual’s pre- or post-disease clinical, genomic, and environmental information
are not unique. Genome-wide association studies helped to identify genes important
in serious adverse drug reactions (Daly and Day 2012). In the most recent decade,
biochemical science has made numerous advances to personalized medication,
including the Human Genome venture, International HapMap task, and genome-
wide affiliation contemplates (GWASs). Single nucleotide polymorphisms (SNPs)
are currently perceived as the fundamental driver of human hereditary fluctuation
and are as of now a significant asset for planning complex hereditary characteristics.
A great many DNA variations have been distinguished that are related with ailments
and attributes. By joining this hereditary relationship with phenotypes and medica-
tion reaction, customized medication will tailor medicines to the patients’ particular
306 G. Sunil Krishnan et al.
Targeted PM
treatmen
Diagnosis for
individual disease
condition
Early detection of
disease
Prevention by
behaviour,lifesty
le or medically
Risk assesment-
predisposition to
disease
genotype. Albeit entire genome groupings are not utilized in ordinary practice today,
there are as of now numerous instances of customized medication in current practice.
Figures 15.2 and 15.3 explain various stages and steps of personal medicine design,
respectively. Chemotherapy prescriptions, for example, trastuzumab and imatinib
target explicit diseases, a focused on pharmacogenetic dosing calculation is utilized
for warfarin and the frequency of unfavorable occasions is decreased by checking for
powerless genotypes for drugs like abacavir, carbamazepine and clozapine.
Customized medication is required to profit by consolidating genomic data with
customary checking of physiological states by different high-throughput methods.
Over the previous decade, upgrades in instrument affectability, speed, exactness, and
throughput, combined with the improvement of innovations, for example, various
responses observing. Under the direction of the Human Proteome Organization over
80% of the proteins anticipated by the human genome have now been recognized
utilizing either mass spectrometric or immunizer based procedures, and the staying
“missing proteins” are as a rule consistently represented. Assets, for example, the
Human MRM Atlas, a far reaching asset intended to empower researchers to perform
quantitative examination of every human protein, are being created to encourage
reproducible exchange of quantitative tests between labs. Such turns of events and
activities currently empower both top to bottom disclosure and focused
on/quantitative work processes, making the way for the clinical analytic field.
Combined with this, the foundation of exhaustive information bases and the
improvement of amazing in silico methods is empowering viable information
mining. Specifically this has empowered interactome examines permitting the rec-
ognizable proof of key flagging pathways prompting potential new medication
15 Bioinformatics in Personalized Medicine 307
targets, despite the fact that to date it has been assessed that under 20% of the protein
communications in people, not including dynamic, tissue-or infection explicit
associations, have been distinguished (Chen et al. 2012).
Bioinformatics tools may provide better diagnoses in genomic level with earlier
detection of disease and better targeted therapy through efficient personalized
medicine development. Omics analysis provides a great assistance in the develop-
ment of personalized medicine (Fig. 15.4). Bioinformatics tools helps in diagnosis,
intervention, drug development, therapy, and personalized vaccination. Personalized
vaccine means for an optimized prevention of disease with minimized reactogenicity
and side effect. Personalized vaccines are developed to take care of haplotypes and
polymorphism can become risk of an adverse vaccine reaction. Variation in gender,
racial, ethnic, gene polymorphisms and other environmental factors affect the in
308 G. Sunil Krishnan et al.
The upsides of PM would be relevant in the uncommon and complex ailment the
board by refining patients and care suppliers, quicken exploration, and supporting
vital changes in strategy and guideline. The new bioinformatics explores have been
planning apparatuses and test pipelines to investigate singular affliction circum-
stance. The progressing customized medication has been in understanding consider-
ation relevant to cardiovascular sicknesses, Mendelian problems, malignant
growths, Kabuki condition and hereditarily heterogeneous issues (schizophrenia,
irregular mental imbalance and range issues) (Table 15.1).
Phases involved in genomics analysis (Fig. 15.5) are sequencing by deploying
next-generation approach like illumina solexa, 454 pyrosequencing, Ion torrent, etc.
After sequencing genetic sets are analyzed to detect epigenetic relationships, to
determine phylogenetic expressions involved to accumulate information in
databases that can be used for personalized medication formulations.
Fig. 15.6 General Sequencing and bioinformatics tools used in the PM design
completely robotized where the client has no obstruction except for can assume the
part of “caretaker” to check the approval of the yields (results).
With the advancement of the computational force, bioinformatics picked up the
possibility to handle huge information and incorporate a lot of information a lot
quicker than it is delivered, turning into an answer applying high-throughput
methods in clinical diagnostics and customized medication. For instance, a few
examinations have shown that bioinformatics pipelines produced for the investiga-
tion of MALDI-ToF mass spectra can extricate symptomatic data from pee, blood,
and undeveloped organism culture media quicker than its ability of being created. In
genomics, a few bioinformatics pipelines of examination for NGS, RNAseq, and
microarrays have been additionally evolved to remove analytic data out of sequenc-
ing of infection, obsessive microorganisms, and malignant growth biopsies. In
addition, bioinformatics instruments for preparing “omics” have likewise been
fruitful in the revelation of novel medication focuses for malignant growth treatment.
Bioinformatics can additionally improve clinical research facilities proficiency and
expenses by sparing time and HR on the investigation and answering to centers and
patients. This should be possible by creating pipelines of examination with
computerized revealing and APIs completely committed to giving constant online
access, encouraging the correspondence between labs, clinicians, and patients. Also,
persistent chronicled information and metadata ought to be secure and sorted out in
an organized manner (information “stockrooms”) with the end goal that it very well
may be additionally pulled efficiently to bioinformatics pipelines. This would permit
going past in integrative examination of patients by having their information as a
component of time permitting a more customized checking of patients indicative and
permitting better prognostics.
Apparatuses with direct significance to customized medication
15.7 Conclusions
References
Adhikari UK, Tayebi M, Rahman MM (2018) Immunoinformatics approach for epitope-based
peptide vaccine design and active site prediction against polyprotein of emerging oropouche
virus. J Immunol Res 2018:1–22
Akhtar N, Joshi A, Singh B, Kaushik V (2020) Immuno-informatics quest against COVID-19/
SARS-COV-2: determining putative T-cell epitopes for vaccine prediction. Infect Disord Drug
Targets 20:32957905. https://doi.org/10.2174/1871526520666200921154149
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J
Mol Biol 215(3):403–410
Atsaves V, Leventaki V, Rassidakis GZ, Claret FX (2019) AP-1 transcription factors as regulators
of immune responses in cancer. Cancer 11(7):1037
Auladell M, Jia X, Hensen L, Chua B, Fox A, Nguyen TH, Kedzierska K (2019) Recalling the
future: immunological memory toward unpredictable influenza viruses. Front Immunol 10
Brodin P, Davis MM (2017) Human immune system variation. Nat Rev Immunol 17(1):21–29.
https://doi.org/10.1038/nri.2016.125
Chandra H, Yadav JS (2016) Human leukocyte antigen (HLA)-binding epitopes dataset for the
newly identified T-cell antigens of mycobacterium immunogenum. Data Brief 8:1069
Chen R, Mias GI, Li-Pook-Than J, Jiang L, Lam HY, Chen R et al (2012) Personal omics profiling
reveals dynamic molecular and medical phenotypes. Cell 148(6):1293–1307
Clemens EB, Van de Sandt C, Wong SS, Wakim LM, Valkenburg SA (2018) Harnessing the power
of T cells: the promising hope for a universal influenza vaccine. Vaccine 6(2):18
Cotugno N, Ruggiero A, Santilli V, Manno EC, Rocca S, Zicari S, Amodio D, Colucci M, Rossi P,
Levy O, Martinon-Torres F, Pollard AJ, Palma P (2019) OMIC technologies and vaccine
development: from the identification of vulnerable individuals to the formulation of invulnera-
ble vaccines. J Immunol Res 2019:8732191. https://doi.org/10.1155/2019/8732191
Daly AK, Day CP (2012) Genetic association studies in drug-induced liver injury. Drug Metab Rev
44(1):116–126
Dodin G (2017) Personal genomics: new concepts for future community data banks. bioRxiv.
https://doi.org/10.1101/230516
Fink AL, Engle K, Ursin RL, Tang WY, Klein SL (2018) Biological sex affects vaccine efficacy
and protection against influenza in mice. Proc Natl Acad Sci 115(49):12477–12482
Fischinger S, Boudreau CM, Butler AL, Streeck H, Alter G (2019) Sex differences in vaccine-
induced humoral immunity. In: Seminars in Immunopathology. Springer, Berlin Heidelberg, pp
239–249
Genomes Project Consortium (2010) A map of human genome variation from population-scale
sequencing. Nature 467(7319):1061
Gfeller D, Bassani-Sternberg M (2018) Predicting antigen presentation—what could we learn from
a million peptides? Front Immunol 9:1716
Gonzaga-Jauregui C, Lupski JR, Gibbs RA (2012) Human genome sequencing in health and
disease. Annu Rev Med 63:35–61
Gunawardena S, Karunaweera ND (2015) Advances in genetics and genomics: use and limitations
in achieving malaria elimination goals. Pathogens Global Health 109(3):123–141. https://doi.
org/10.1179/2047773215Y.0000000015
Hossain MU, Omar TM, Oany AR, Kibria KK, Shibly AZ, Moniruzzaman M, Islam MM (2018)
Design of peptide-based epitope vaccine and further binding site scrutiny led to groundswell in
drug discovery against Lassa virus. 3 Biotech 8(2):81
Joshi A, Kaushik V (2020) In-silico proteomic exploratory quest: crafting T-cell epitope vaccine
against Whipple’s disease. Int J Pep Res Therapeut 18:1–11. https://doi.org/10.1007/s10989-
020-10077-9
Joshi A, Joshi BC, Mannan MA, Kaushik V (2020) Epitope based vaccine prediction for SARS-
COV-2 by deploying immuno-informatics approach. Inform Med 19:100338. https://doi.org/10.
1016/j.imu.2020.100338
15 Bioinformatics in Personalized Medicine 315
Abstract
E. Emmanuel Nweke
Department of Surgery, University of Witwatersrand, Johannesburg, South Africa
E. Suleman
Veterinary Molecular Diagnostics and Vaccines, Medical Devices and Diagnostics Impact Area,
CSIR, Pretoria, South Africa
M. Du Plessis
Bioinformatics and Comparative Genomics, South African National Biodiversity Institute, Pretoria,
Gauteng, South Africa
D. B. Thimiri Govinda Raj (*)
Synthetic Nanobiotechnology and Biomachines, Centre for Synthetic Biology and Precision
Medicine, CSIR, Pretoria, South Africa
Biotechnology Innovation Centre, Rhodes University, Grahamstown, South Africa
Department of Clinical Medicine, WITS University, Johannesburg, South Africa
University of Pretoria, Pretoria, South Africa
e-mail: [email protected]
# The Author(s), under exclusive license to Springer Nature Singapore Pte 317
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_16
318 E. Emmanuel Nweke et al.
Keyword
16.1 Introduction
Computational biology has played the key role in recent advancement of biological
sciences and biotechnology research (Mulder et al. 2016). Bio-IT has supported
rapid progress across all the fields due to its ability to store and analyse larger
datasets. This approach enabled use of computational approach across all technology
innovation and development. Some of recent applications for computational biology
in biological sciences research and development include human genome project,
human epigenome project, Cancer genome and human microbiome atlas (Hood and
Rowen 2013). With advancement for data analytics, high performance statistical
analysis, block chain analysis and artificial intelligence, computational biology has
empowered and enabled research innovation advancement at higher level (Özdemir
et al. 2017). Particularly, pharmaceutical companies are applying the computational
biology tools such as biostatistics, artificial intelligence and pharmacogenomics for
its drug discovery development pipeline. During the recent COVID19 pandemic,
rapid progress in decoding structural and functional relevance of the SARS-CoV2
protein was achieved through high performance cloud computing based Cryo-EM
studies (Wintjens et al. 2020).
One example of computational biology applications in the microbial synthetic
biology research space is where Govindaraj et al. designed and constructed a semi-
synthetic baculovirus genome called SynBac using synthetic biology techniques
such as homologous recombination and Cre-loxP. SynBac is currently sold in the
market by Geneva Biotech, an EMBL spin-out company (SynBac™ 2020). SynBac
was build based on computational analysis, gene annotation and genome map of
Autographa californica multiple nucleopolyhedrovirus (ACMNPV) (Berger and Raj
2014). Briefly, gene annotation and comparative genome analysis were performed
on baculovirus genome across the various genome families. Using the genome
annotation and data mining on functional role of the baculovirus genes, Govindaraj
et al. were able to delineate the essential and non-essential gene annotation in
AcMNPV genome (Vijayachandran et al. 2013). Further, they used cutting-edge
synthetic biology technologies such as homologous recombination and Cre-loxP to
rewire AcMNPV genome and invented SynBac (SynBac 1.0) which is currently
applied for recombinant protein production (FEBS-EMBO Conference
proceedings 2014). Similarly, several teams are using cutting-edge bioinformatics
technologies to assist with the development of synthetic microbes for biotechnology
and biological applications such as cancer biology. For example, the Synthetic
Nanobiotechnology and Biomachines team at CSIR, South Africa is currently
establishing bioinformatics driven synthetic biology approach to produce synthetic
microbes for recombinant protein production in industrial scale. Another application
of computational biology is in precision medicine research where the statistics,
16 Bioinformatics Tools for Gene and Genome Annotation Analysis of Microbes. . . 319
artificial intelligence and data analysis play key role in clinical analysis of various
patient cohort. Some of the computational biology applications in precision medi-
cine include use of computational science in genomics, RNA-seq analysis, proteo-
mics, pharmacogenomics and drug sensitivity screening. In this regard, a cancer
drug sensitivity screening platform that combines drug repurposing platform with
computational analysis such as drug sensitivity score (DSS) and automation is
currently established at CSIR (Raj et al. 2018a, b).
In this book chapter, we have a focused review on the gene and genome annota-
tion analysis of microorganisms along with its applications in synthetic biology and
cancer biology applications. Our focus review is delineated with several subsections
that includes the current status of gene annotation (automated and manual) in
microbes, gene features on the prokaryotic genome, comparative analysis on gene
annotation between virus, bacteria and other microbes, gene ontology along with
community annotation in cancer biology, application of microbial genomics in
cancer biology and application of synthetic microbes for cancer treatment. Finally,
we report in-depth future perspective analysis with the applications for genome
annotation of microbes for synthetic biology and cancer biology both in applied
research and clinical setting.
taxa, whilst others rather focus on various biological categories, which are broadly
termed as subject-specific databases.
Given the development of advances and decreasing costs of next generation
sequencing much of the gene annotation algorithms have now been incorporated
as part of genome annotation workflows. Whilst they are developed with genome
scale data in mind, the principles around annotating individual genes or shorter gene
regions remain valid. As such, an aspect of the annotation pipeline involves the
annotation of all features of a genome, including the genes. The most widely used
tools in this context are the PGAP (Tatusova et al. 2016) and Prokka (Seemann
2014) as observed from genome publications in current journals. As well as serving
the role of genome annotation tools, they also serve as gene annotation tools,
pending the length of the input sequence.
Often the genome annotation workflows have a version of the most recognisable
gene annotation tools integrated in their workflows. In terms of frequency of usage,
the most utilised gene annotation tools are Glimmer, GeneMark and Prodigal.
Glimmer and GeneMark use various Markov models, whilst Prodigal, in turn, uses
a log-likelihood function, which particularly performs better with high GC content
genomes. Collectively, these tools are capable of identifying about of 97% of genes
(Hyatt et al. 2010; Lomsadze et al. 2018) in a general annotation setting. When
looking at the manual curation environment, again, not much has changed in terms
of the general approach. The most significant advance in manual annotation has been
through utilising more annotators in the same space of time, or in the same sample
annotation environment. This as such relates to decreasing the annotation time, for
manual curation by including more physical annotators. In addition, it functions in
increasing the accuracy of the annotation as it represents several layers of manual
checking. The approach has been validated in a study by (Rödelsperger et al. 2019).
The work clearly reflected that the novel approach of community/crowd participa-
tion is a viable approach for improved manual curation. It furthermore serves to
actually grow the community along with the skills-set as a significant training aspect
in order to promote uniformity is required.
genome of 4.64 Mb that codes for approximately 4400 genes (Blattner et al. 1997).
In contrast, Mycoplasma genitalium has a genome size of only 0.58 Mb and only
500 genes (Fraser et al. 1997). Some prokaryotes may have genomes with multiple
circular molecules (e.g. Vibrio cholerae and Deinococcus radiodurans (Heidelberg
et al. 2000; White et al. 1999)) or a combination of circular and linear and circular
molecules (e.g. Borrelia burgdorferi (Fraser et al. 1997)). Prokaryotes may also have
additional genetic material on small, circular (or linear) DNA molecules called
plasmids, that are independent of the larger genome and have genes that confer
additional properties (e.g. antibiotic resistance, ability to use other nutrients as
carbon sources, etc.). Plasmids are not essential for prokaryotic survival but they
are very beneficial and can be transferred from one prokaryote to another via
horizontal gene transfer.
Replication of prokaryotic genomes occurs at a highly conserved sequence
known as the origin of replication (oriC). Prokaryotes with circular chromosome
have a single origin site while linear chromosomes generally have an origin site in
the middle of the chromosome. Bacterial oriC sites can be very diverse in size
(250–2000 bp), sequence composition and organisation depending on species
(Mackiewicz et al. 2004). Replication in both circular and linear bacterial genomes
generally occurs bidirectionally from the oriC site, but termination of replication in
linear chromosomes requires telomeres (multiple tandem repeats of noncoding
nucleotide sequences at the ends of the linear chromosome) to protect the ends of
the chromosome from gradual degradation since the replication enzymes are unable
to synthesise new DNA at the ends of the linear chromosome. The oriC site and
associated sequences are very useful for annotation and assembly of prokaryotic
genomes generated by Next Generation Sequencing (NGS). For example, DoriC
(Luo and Gao 2019) was initially developed as a database of bacterial replication
origins (oriC), which were determined either experimentally or as predicted by
Ori-Finder (Raj et al. 2018a). Since the initial development and launch of DoriC in
2007, the database has undergone several upgrades and improvements. The current
version, DoriC 10, has 7580, 226 and 1209 oriCs of bacteria, archaea and plasmids,
respectively, and is the most complete and scalable database of prokaryotic replica-
tion origins (Luo and Gao 2019). This database facilitates enhanced understanding
of the structure and functions of prokaryotic replication origins with many
predictions verified experimentally in the laboratory (Luo and Gao 2019; Gao and
Zhang 2008).
Generally, prokaryotic genes are clustered together as operons, i.e. genes that are
required for a specific cellular or metabolic function are grouped together. Further-
more, aside from the coding sequences in an operon, there are additional sequences
that are important for regulation expression of genes within an operon. These include
(1) a promoter region that is located at the start (50 end) of the operon and contains
sequences for initiation of transcription by RNA polymerase, (2) the open reading
frame (ORFs) which contain the polycistronic gene sequences, i.e. the coding
sequences for the individual genes comprising the operon, (3) the terminator region
located at the end (30 ) of the operon which regulates termination of transcription. The
first bacterial operon discovered was the lac operon, in E. coli, which contains all the
genes required for utilising lactose as a carbon (energy) source (Jacob and Monod
322 E. Emmanuel Nweke et al.
1961). This ground breaking discovery earned Francois Jacob and Jacques Monod
the Nobel Prize in Physiology or Medicine in 1965. Several bioinformatics tools
have been developed to identify prokaryotic operons, promoters and other regulatory
sequences. More recently, Operon-mapper (http://biocomputo.ibt.unam.mx/operon_
mapper/) is the first publically available web server that predicts the operons of any
bacterial or archaeal genome that only requires the genomic sequence as input data
(Taboada et al. 2018). Another novel method for detecting and predicting operons
called Operon Hunter uses visual representations of genomic fragments and a neural
network architecture (i.e. machine learning) to achieve highly accurate predictions
outperforming other state of the art tools (Assaf et al. 2020). Most of the current
bioinformatics tools for operon prediction require genomic sequence data. However,
the Rockhopper platform (https://cs.wellesley.edu/~btjaden/Rockhopper/) combines
genome sequencing data with RNA-seq data (which is becoming more prevalent) to
improve accuracy and specificity of computationally identified prokaryotic operons
(Tjaden 2020). Another useful tool developed for visualisation and analysis of
prokaryotic genomes, particularly identification of features such as analysis of
transcription factors (TFs), regulatory motifs (promoters, ribosome binding sites,
terminators), transcriptional regulation, etc. is BAC-Browser (Garanina et al. 2018).
BAC-Browser (http://smdb.rcpcm.org/tools/index.html) also incorporates a variety
of other free tools for primer design, visualisation and analysis.
Prokaryotic genomes consist of double-stranded DNA molecules encoding
numerous genes which could be transcribed from either the sense (50 to 30 ) or
antisense (30 to 50 ) strand. The sequence between the promoter and terminator that
contains no stop codons, and which can be transcribed into mRNA and subsequently
translated into protein is referred to as the open reading frame (ORF). Generally,
ORFs begin with the start codon (ATG encoding methionine) and one of three
possible stop codons (TAA, TAG, TGA). Since genes may encoded on either the
sense or antisense strands and depending on the starting point chosen for a particular
gene sequence, there are six possible reading frames for translating any DNA
sequence into an amino acid sequence. Generally, lengthy ORFs are used, in
conjunction with other empirical or predicted evidence, to identify possible protein
coding sequences. There are many bioinformatics tools that have been developed to
address these challenges and to accurately identify the correct ORFs. One of the
most commonly used tools for prediction of prokaryotic ORFs is ORF Finder
(https://www.ncbi.nlm.nih.gov/orffinder/). However, metagenome sequencing
generates significant amounts of sequencing information, including from many
prokaryotic species that cannot be cultured. Most computational tools for finding
ORFs become computationally bottlenecked when used with metagenome data,
particularly when this data consists of unassembled reads. To address these
challenges, OrfM, a bioinformatics tool to identify ORFs in metagenomics sequence
data has recently been developed (Woodcroft et al. 2016). This tool is four to five
times faster than similar tools such as GetOrf (Tringe et al. 2005) and Translate
(unpublished, http://eddylab.org/software.html) without any decrease in accuracy.
This increase in performance significantly reduces the bottlenecks associated with
ORF identification from large datasets which are typical of metagenome studies.
16 Bioinformatics Tools for Gene and Genome Annotation Analysis of Microbes. . . 323
The advent of cheaper second generation/next generation sequencing, along with the
availability of single cell long read technologies, has significantly changed the field
of cancer community cell heterogeneity. Along with that there has also been some
development around digital cell identification technologies.
There are two processes that are currently typically used, from a physiological
perspective, to evaluate variability in cancer cell communities. In the first instance
researchers are looking at sectioning cancer tumours into different regions and then
to conduct sequencing on the regional scale. This approach is termed multi-region
sequencing.
The alternate approach, which allows for a finer scale analysis, involves utilising
a combination of either single cell with long read sequencing technology or with
transcriptome, sequencing. However, the error rates associated with long read
technology at current (while it is improving) points to transcriptome sequencing
leading the charge in terms of current applications. Ultimately, regardless of the
approach, in the final analysis what is sought is to clearly define the differences in
cell types and abundances of the cell types (e.g. Normal vs cancerous, and within
cancerous to capture the differences between those cells). This knowledge-base in
turn feeds into understanding the drivers of tumour evolution.
The value of understanding the diversity in cancer cells, as defined above, is that
studies have now started revealing that there are correlations between the variations
and overall patient survival in relation to liver cancer. The study by Ma et al. (2019)
clearly demonstrates a correlation between transcriptomic diversity, which links to
genomic diversity, and predicts patient prognosis. This is achieved through a unique
bioinformatics workflow which involved the utility of Seurat package (version 2.3.0)
(Butler et al. 2018) in R (version 3.4.3). The computational tool takes in single cell
RNA seq data, and allows for the identification of shared populations across varying
datasets. This provides input for modelling the overlaps and similarities versus the
differences of cancer cell signatures within the overall population. Outside of the
scope of sequence based approaches there have also been significant developments
in the implementation of digital cell identification technologies, which have found a
home in a discipline termed computational pathology. This is concerned with the
development of algorithms that facilitate the evaluation and analysis of digital
pathology images. The implementation of such an approach uses deep neural
network training models, which forms part of a semi-supervised approach for
successfully distinguishing different community cells (Javed et al. 2020). Ultimately
it translates into the construction of a library of digital imagery of cancerous cells
which feeds into a model in order to identify the cancerous cells in a novel setting, or
when scientists are exposed to classifying a novel dataset. In terms of advancing the
area of cancer research, both in the context of community variation and generally
understanding the disease, there is an obvious need for a well-represented reference
databases with a shared vocabulary which all researchers in the cancer/general
biology space recognise. One of the older yet, still most relevant resources relating
324 E. Emmanuel Nweke et al.
to ontologies remain the Gene Ontology database, which have been continuously
updated to keep trend with the latest development in the field of ontologies
(Ashburner 2000; Gene Ontology Consortium 2019).
An obvious development within the cancer research community has subsequently
been the development of ontologies with greater specificity in the field of cancer.
These represent a diverse array of areas, which is too broad to entirely define within
the scope of this work. As an example, there has been the development of ontologies
that speak to aspects such as hematologic malignancies (Serra et al. 2019), where the
basis of their defining criteria involves using the classification of immunophenotypes
as the basis for differentiation.
Using technologies such as 16S rRNA gene sequencing and high throughput array
panels, the elucidation of microbial genomics has been applied primarily in cancer
diagnosis, prognosis and in deciphering mechanisms of disease progression. Ele-
vated salivary microbiota (Prevotella melaninogenica, Streptococcus mitis and
Capnocytophaga gingivalis) were found to be a potential diagnostic indicators of
oral cancer (Mager et al. 2005). Another study using 16S rRNA gene sequencing
showed that oral tumours had an increased abundance of Veillonella, Dialister, and
Streptococcus species (Guerrero-Preston et al. 2016). Börnigen et al. compared oral
bacteria of 121 oral cancer patients to 242 controls and found significant changes in
the microbial abundance and diversity between the two groups (Börnigen et al.
2017).
In lung cancer research, there has been several studies conducted to identify
microbial communities present in lung cancer patients utilising varying sample types
(Yan et al. 2015; Yang et al. 2018; Xu et al. 2020). Yan et al. showed through deep
sequencing of 20 lung cancer patient samples compared to ten healthy controls, the
significant abundance of Neisseria, Veillonella and Capnocytophaga in the tumour
samples (Yan et al. 2015). Consequently they proposed that these bacteria could be
potential diagnostic markers for lung cancer. In another study, 16S rRNA gene
sequencing was used to determine microbial abundance and diversity in 75 lung
cancer patients. They found that compared to 127 healthy individuals, this group of
non-smoking female cancer patients had increased levels of Blastomonas and
Sphingomonas. Furthermore, they showed that the expression of Napsin A, a well-
known immunohistochemical marker of lung adenocarcinoma, was positively
correlated to Blastomonas occurrence in the samples (Yang et al. 2018).
Metagenomic sequencing was also used to identify plausible bacterial biomarkers
for lung cancer in a study conducted by Cameron et al. The authors evaluated sputum
samples from ten suspected lung cancer patients and determined that Streptococcus
viridans was significantly present in the lung cancer positive patients compared to
those that were negative (Cameron et al. 2017). Bronchoalveolar fluid samples from
20 lung cancer patients and 8 benign diseased patients were analysed using 16S
16 Bioinformatics Tools for Gene and Genome Annotation Analysis of Microbes. . . 325
rRNA sequencing. It was found that Veillonella and Megasphaera were significantly
abundant in lung cancer patients (Lee et al. 2016).
The role of the microbiome in the study of colorectal cancer pathogenesis is
evident (Saus et al. 2019; Song et al. 2020; Cho et al. 2014; Lin et al. 2019). 16S
rRNA gene sequencing and real-time polymerase chain reaction were applied to
faecal and mucosal samples obtained from colorectal cancer (CRC) patients. It was
observed that the abundance and diversity of microbiota, and the expression of
inflammation-associated genes, were distinct between CRC patients and healthy
individuals (Flemer et al. 2017). Another study utilised 16S rRNA gene sequencing
and gas-chromatography mass spectrometry to evaluate the microbiome and
metabolome, respectively, of faecal samples obtained from 50 CRC patients versus
50 healthy individuals. It was observed that 76 operational taxonomic units
differentiated the two groups (Yang et al. 2019a). Yu et al. conducted a
metagenome-wide association utilising faecal samples obtained from CRC Chinese
patients and 50 control samples and determined that Peptostreptococcus stromatis
and Fusobacterium nucleatum were associated with CRC (Yu et al. 2017). The
alteration of intestinal microbiota was shown in 15 CRC patients, specifically
Fusobacterium, Selemonas and Peptostreptococcus increased in abundance and
diversity (Hibberd et al. 2017). The analysis of oral microbiome was conducted
using 16S rRNA gene sequencing in a prospective cohort which included
participants from African-American and low-income groups. They observed that
Prevotella intermedia and Treponema denticola were associated with increased
CRC risk. Additionally, Bifidobacteriaceae was more abundant in CRC patients
compared to controls (Yang et al. 2019b).
In breast cancer patients, microbial dysbiosis has also been observed. One study
found that Methylobacterium radiotolerans was increased in tumours compared to
normal tissues (Xuan et al. 2014). Utilising a pan-pathogen array, distinct microbial
signature has been observed in triple negative breast cancer patients (Banerjee et al.
2015). The presence of distinct microbial signatures in breast tumours was also
confirmed by Hieken et al. (2016). Similar trends can also be observed in other
cancers such as pancreatic, oesophageal, gastric, head and neck cancers. A study
involving 361 pancreatic cancer (PC) and 371 controls was analysed and was shown
that while Aggregatibacter actinomycetemcomitans and Porphyromonas gingivalis
were associated with increased risk of PC, Fusobacteria was linked to decreased PC
risk (Fan et al. 2018). The widely studied Helicobacter pylori has been associated
with gastrointestinal carcinogenesis (Inamura 2020; Meng et al. 2015;
Trikudanathan et al. 2011). In gastric cancer, it promotes carcinogenesis by deliver-
ing CagA protein in epithelial cells (Hatakeyama 2004). Coker et al. applied 16S
rRNA gene sequencing on 81 gastric mucosal samples and found that as the diseased
progressed there was an abundance of Streptococcus anginosus, Peptostreptococcus
stomatis, Parvimonas micra, Dialister pneumosintes and Slackia exigua hinting on
their potential roles in gastric cancer progression (Coker et al. 2018).
The oral microbiome was assessed in a prospective cohort of oesophageal
adenocarcinoma patients and Tannerella forsythia was found to be associated with
increased risk of the cancer (Peters et al. 2017). In the same study, it was also
326 E. Emmanuel Nweke et al.
Recently synthetic biology has enabled cancer precision medicine, particularly with
respect to recent development with respect to application of synthetic microbes as an
enabling tool and technologies for cancer therapy (Courbet et al. 2016). Despite the
recent development with respect to applications of synthetic biology for cancer
therapy, bacterial systems have applied in cancer treatment for more than century
(Felgner et al. 2016). In this section of the book chapter, we will focus on historical
relevance and current developments on applying synthetic microbes for cancer
precision medicine. In this section, we focus on microbial applications for cancer
treatment and how synthetic microbes enabled for cancer precision medicine and
therapy.
For a long period in the past, controversial approach was to use live bacteria for
prophylactic vaccination and cancer therapy (Payette and Davis 2001). In the
16 Bioinformatics Tools for Gene and Genome Annotation Analysis of Microbes. . . 327
studying the skin, nasal and vaginal microbiome in healthy and cancer patient cohort
in order to better understand the relationship between microbiome and cancer
(Hieken et al. 2016).
Furthermore, with better understanding of gut, skin, nasal and vaginal
microbiome, the scientists are proposing tailor-made cancer treatment for individual
patients. To elaborate in details, by using microbiome information, there is potential
approach to apply for cancer precision medicine in clinical setting. With the recent
advancement of next generation sequencing and synthetic biology technologies,
microbiome dataset has been integrated with cancer precision medicine in order to
provide clinically relevant drug treatments for various patient cohorts. With the
better understanding on the microbiome, there might be potential approach to
understand and address problem of adverse drug reaction in African patient cohort.
16.7 Conclusion
The above sections provide a brief summary of prokaryotic genome architecture and
more recently developed bioinformatics tools which are very useful for studying,
identifying and annotating various features of prokaryotic genomes. With the recent
developments of synthetic biology tools and technologies, several designer microbes
have been invented for various applications. As a future perspective, we should
focus on synthetic biology driven designer microbes enabled improved cancer
treatment and its application in cancer precision medicine.
References
Al-Hebshi NN, Nasher AT, Maryoud MY, Homeida HE, Chen T, Idris AM et al (2017) Inflamma-
tory bacteriome featuring Fusobacterium nucleatum and Pseudomonas aeruginosa identified in
association with oral squamous cell carcinoma. Sci Rep 7:1834. https://doi.org/10.1038/s41598-
017-02079-3
Ashburner M (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25:25–29.
https://doi.org/10.1038/75556
Assaf R, Xia F, Stevens R (2020) Detecting operons in bacterial genomes via visual representation
learning. bioRxiv 2020:860221. https://doi.org/10.1101/860221
Banerjee S, Wei Z, Tan F, Peck KN, Shih N, Feldman M et al (2015) Distinct microbiological
signatures associated with triple negative breast cancer. Sci Rep 5:15162. https://doi.org/10.
1038/srep15162
Berger I, Raj DBTG (2014) Improved baculoviral expression system and methods of producing the
same. CA2898696A1. https://patents.google.com/patent/CA2898696A1/tr
Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M et al (1997) The complete
genome sequence of Escherichia coli K-12. Science 277:1453–1462. https://doi.org/10.1126/
science.277.5331.1453
Börnigen D, Ren B, Pickard R, Li J, Ozer E, Hartmann EM et al (2017) Alterations in oral bacterial
communities are associated with risk factors for oral and oropharyngeal cancer. Sci Rep
7:17686. https://doi.org/10.1038/s41598-017-17795-z
16 Bioinformatics Tools for Gene and Genome Annotation Analysis of Microbes. . . 329
Hayes RB, Ahn J, Fan X, Peters BA, Ma Y, Yang L et al (2018) Association of oral microbiome
with risk for incident head and neck squamous cell cancer. JAMA Oncol 4:358–365. https://doi.
org/10.1001/jamaoncol.2017.4777
Heidelberg JF, Eisen JA, Nelson WC, Clayton RA, Gwinn ML, Dodson RJ et al (2000) DNA
sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature 406:477–483.
https://doi.org/10.1038/35020000
Hibberd AA, Lyra A, Ouwehand AC, Rolny P, Lindegren H, Cedgård L et al (2017) Intestinal
microbiota is altered in patients with colon cancer and modified by probiotic intervention. BMJ
Open Gastroenterol 4:e000145. https://doi.org/10.1136/bmjgast-2017-000145
Hieken TJ, Chen J, Hoskin TL, Walther-Antonio M, Johnson S, Ramaker S et al (2016) The
microbiome of aseptically collected human breast tissue in benign and malignant disease. Sci
Rep 6:30751. https://doi.org/10.1038/srep30751
Hoffman RM (2012) The preclinical discovery of bacterial therapy for the treatment of metastatic
cancer with unique advantages. Expert Opin Drug Discovery 7:73–83. https://doi.org/10.1517/
17460441.2012.644534
Hood L, Rowen L (2013) The human genome project: big science transforms biology and medicine.
Genome Med 5:79. https://doi.org/10.1186/gm483
Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic
gene recognition and translation initiation site identification. BMC Bioinfo 11:119. https://doi.
org/10.1186/1471-2105-11-119
Inamura K (2020) Gut microbiota contributes towards immunomodulation against cancer: new
frontiers in precision cancer therapeutics. Sem Cancer Biol 5:29. https://doi.org/10.1016/j.
semcancer.2020.06.006
Jacob F, Monod J (1961) Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol
3:318–356. https://doi.org/10.1016/S0022-2836(61)80072-7
Javed S, Mahmood A, Fraz MM, Koohbanani NA, Benes K, Tsang Y-W et al (2020) Cellular
community detection for tissue phenotyping in colorectal cancer histology images. Med Image
Anal 63:101696. https://doi.org/10.1016/j.media.2020.101696
Jin C, Lagoudas GK, Zhao C, Bullman S, Bhutkar A, Hu B et al (2019) Commensal microbiota
promote lung cancer development via γδ T cells. Cell 176:998–1013. https://doi.org/10.1016/j.
cell.2018.12.040
Lee SH, Sung JY, Yong D, Chun J, Kim SY, Song JH et al (2016) Characterization of microbiome
in bronchoalveolar lavage fluid of patients with lung cancer comparing with benign mass like
lesions. Lung Cancer 102:89–95. https://doi.org/10.1016/j.lungcan.2016.10.016
Lin C, Cai X, Zhang J, Wang W, Sheng Q, Hua H et al (2019) Role of gut microbiota in the
development and treatment of colorectal Cancer. Digestion 100:72–78. https://doi.org/10.1159/
000494052
Lomsadze A, Gemayel K, Tang S, Borodovsky M (2018) Modeling leaderless transcription and
atypical genes results in more accurate gene prediction in prokaryotes. Genome Res
28:1079–1089. https://doi.org/10.1101/gr.230615.117
Loshitskiy PP, Nikolov NA (2015) Magnetothermia utilization in the curing of malignancies. Part
1. Radioelectron Commun Syst 58:49–60. https://doi.org/10.3103/S0735272715020016
Luo H, Gao F (2019) DoriC 10.0: an updated database of replication origins in prokaryotic genomes
including chromosomes and plasmids. Nucleic Acids Res 47:D74–D77. https://doi.org/10.
1093/nar/gky1014
Ma L, Hernandez MO, Zhao Y, Mehta M, Tran B, Kelly M et al (2019) Tumor cell biodiversity
drives microenvironmental reprogramming in liver cancer. Cancer Cell 36:418–430. https://doi.
org/10.1016/j.ccell.2019.08.007
Mackiewicz P, Zakrzewska-Czerwinska J, Zawilak A, Dudek MR, Cebrat S (2004) Where does
bacterial replication start? Rules for predicting the oriC region. Nucleic Acids Res
32:3781–3791. https://doi.org/10.1093/nar/gkh699
Mager DL, Haffajee AD, Devlin PM, Norris CM, Posner MR, Goodson JM (2005) The salivary
microbiota as a diagnostic indicator of oral cancer: a descriptive, non-randomized study of
16 Bioinformatics Tools for Gene and Genome Annotation Analysis of Microbes. . . 331
cancer-free and oral squamous cell carcinoma subjects. J Transl Med 3:27. https://doi.org/10.
1186/1479-5876-3-27
Meng W, Bai B, Sheng L, Li Y, Yue P, Li X et al (2015) Role of helicobacter pylori in gastric
cancer: advances and controversies. Discov Med 20:285–293
Min J-J, Kim H-J, Park JH, Moon S, Jeong JH, Hong Y-J et al (2008) Noninvasive real-time
imaging of tumors and metastases using tumor-targeting light-emitting Escherichia coli. Mol
Imaging Biol 10:54–61. https://doi.org/10.1007/s11307-007-0120-5
Mulder NJ, Christoffels A, de Oliveira T, Gamieldien J, Hazelhurst S, Joubert F et al (2016) The
development of computational biology in South Africa: successes achieved and lessons learnt.
PLoS Comput Biol 12:e1004395. https://doi.org/10.1371/journal.pcbi.1004395
Nordor AV, Bellet D, Siwo GH (2018) Cancer–malaria: hidden connections. Open Biol 8:180127.
https://doi.org/10.1098/rsob.180127
Özdemir V, Dove ES, Gürsoy UK, Şardaş S, Yıldırım A, Yılmaz ŞG et al (2017) Personalized
medicine beyond genomics: alternative futures in big data-proteomics, environtome and the
social proteome. J Neural Transm (Vienna) 124:25–32. https://doi.org/10.1007/s00702-015-
1489-y
Payette PJ, Davis HL (2001) History of vaccines and positioning of current trends. Curr Drug
Targets Infect Disord 1:241–247. https://doi.org/10.2174/1568005014606017
Peters BA, Wu J, Pei Z, Yang L, Purdue MP, Freedman ND et al (2017) Oral microbiome
composition reflects prospective risk for esophageal cancers. Cancer Res 77:6777–6787.
https://doi.org/10.1158/0008-5472.CAN-17-1296
Raj TG, Balaji DB, Giliberto M, Cremaschi A, Skånland SS, Gade A, Tjønnfjord GE et al (2018a)
Drug sensitivity screening on multiple myeloma for precision cancer therapy. Blood
132:4677–4677. https://doi.org/10.1182/blood-2018-99-110669
Raj TG, Balaji DB, Cremaschi A, Skånland SS, Gade A, Schjesvold FH, Tjønnfjord GE et al
(2018b) In-vitro drug sensitivity screening in chronic lymphocytic leukemia (CLL) primary
patient samples identifies drug candidates for precision cancer therapy. Blood 132:4676–4676.
https://doi.org/10.1182/blood-2018-99-110357
Roberts NJ, Zhang L, Janku F, Collins A, Bai R-Y, Staedtke V et al (2014) Intratumoral injection of
Clostridium novyi-NT spores induces antitumor responses. Sci Transl Med 6:249. https://doi.
org/10.1126/scitranslmed.3008982
Rödelsperger C, Athanasouli M, Lenuzzi M, Theska T, Sun S, Dardiry M et al (2019)
Crowdsourcing and the feasibility of manual gene annotation: a pilot study in the nematode
Pristionchus pacificus. Sci Rep 9:18789. https://doi.org/10.1038/s41598-019-55359-5
Saus E, Iraola-Guzmán S, Willis JR, Brunet-Vega A, Gabaldón T (2019) Microbiome and colorec-
tal cancer: roles in carcinogenesis and clinical potential. Mol Asp Med 69:93–106. https://doi.
org/10.1016/j.mam.2019.05.001
Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069.
https://doi.org/10.1093/bioinformatics/btu153
Serra LM, Duncan WD, Diehl AD (2019) An ontology for representing hematologic malignancies:
the cancer cell ontology. BMC Bioinfo 20:181. https://doi.org/10.1186/s12859-019-2722-8
Song M, Chan AT, Sun J (2020) Influence of the gut microbiome, diet, and environment on risk of
colorectal cancer. Gastroenterology 158:322–340. https://doi.org/10.1053/j.gastro.2019.06.048
SynBac™ (2020) In: Geneva Biotech [cited 2 Nov 2020]. https://geneva-biotech.com/product_
category/insect-cell-expression/synbac/
Taboada B, Estrada K, Ciria R, Merino E (2018) Operon-mapper: a web server for precise operon
identification in bacterial and archaeal genomes. Bioinformatics 34:4118–4120. https://doi.org/
10.1093/bioinformatics/bty496
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L et al (2016) NCBI
prokaryotic genome annotation pipeline. Nucleic Acids Res 44:6614–6624. https://doi.org/10.
1093/nar/gkw569
Teif VB, Bohinc K (2011) Condensed DNA: condensing the concepts. Prog Biophys Mol Biol
105:208–222. https://doi.org/10.1016/j.pbiomolbio.2010.07.002
332 E. Emmanuel Nweke et al.
Tjaden B (2020) A computational system for identifying operons based on RNA-seq data. Methods
176:62–70. https://doi.org/10.1016/j.ymeth.2019.03.026
Trikudanathan G, Philip A, Dasanu CA, Baker WL (2011) Association between Helicobacter pylori
infection and pancreatic cancer. A cumulative meta-analysis. JOP 12:26–31
Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW et al (2005) Compara-
tive metagenomics of microbial communities. Science 308:554–557. https://doi.org/10.1126/
science.1107851
Vijayachandran LS, Thimiri Govinda Raj DB, Edelweiss E, Gupta K, Maier J, Gordeliy V et al
(2013) Gene gymnastics: synthetic biology for baculovirus expression vector system engineer-
ing. Bioengineered 4:279–287
Vivarelli S, Salemi R, Candido S, Falzone L, Santagati M, Stefani S et al (2019) Gut microbiota and
cancer: from pathogenesis to therapy. Cancer 11:38. https://doi.org/10.3390/cancers11010038
Wang H, Funchain P, Bebek G, Altemus J, Zhang H, Niazi F et al (2017) Microbiomic differences
in tumor and paired-normal tissue in head and neck squamous cell carcinomas. Genome Med
9:14. https://doi.org/10.1186/s13073-017-0405-5
White O, Eisen JA, Heidelberg JF, Hickey EK, Peterson JD, Dodson RJ et al (1999) Genome
sequence of the radioresistant bacterium Deinococcus radiodurans R1. Science
286:1571–1577. https://doi.org/10.1126/science.286.5444.1571
Wintjens R, Bifani AM, Bifani P (2020) Impact of glycan cloud on the B-cell epitope prediction of
SARS-CoV-2 spike protein. Vaccine 5:1–8. https://doi.org/10.1038/s41541-020-00237-9
Wood LM, Guirnalda PD, Seavey MM, Paterson Y (2008) Cancer immunotherapy using listeria
monocytogenes and listerial virulence factors. Immunol Res 42:233–245. https://doi.org/10.
1007/s12026-008-8087-0
Woodcroft BJ, Boyd JA, Tyson GW (2016) OrfM: a fast open reading frame predictor for
metagenomic data. Bioinformatics 32:2702–2703. https://doi.org/10.1093/bioinformatics/
btw241
Xu N, Wang L, Li C, Ding C, Li C, Fan W et al (2020) Microbiota dysbiosis in lung cancer:
evidence of association and potential mechanisms. Transl Lung Cancer Res 9:1554–1568.
https://doi.org/10.21037/tlcr-20-156
Xuan C, Shamonki JM, Chung A, Dinome ML, Chung M, Sieling PA et al (2014) Microbial
dysbiosis is associated with human breast cancer. PLoS One 9:e83744. https://doi.org/10.1371/
journal.pone.0083744
Yan X, Yang M, Liu J, Gao R, Hu J, Li J et al (2015) Discovery and validation of potential bacterial
biomarkers for lung cancer. Am J Cancer Res 5:3111–3122
Yang J, Mu X, Wang Y, Zhu D, Zhang J, Liang C et al (2018) Dysbiosis of the salivary microbiome
is associated with non-smoking female lung cancer and correlated with immunocytochemistry
markers. Front Oncol 8:520. https://doi.org/10.3389/fonc.2018.00520
Yang Y, Misra BB, Liang L, Bi D, Weng W, Wu W et al (2019a) Integrated microbiome and
metabolome analysis reveals a novel interplay between commensal bacteria and metabolites in
colorectal cancer. Theranostics 9:4101–4114. https://doi.org/10.7150/thno.35186
Yang Y, Cai Q, Shu X-O, Steinwandel MD, Blot WJ, Zheng W et al (2019b) Prospective study of
oral microbiome and colorectal cancer risk in low-income and African American populations.
Int J Cancer 144:2381–2389. https://doi.org/10.1002/ijc.31941
Yao L, Jermanus C, Barbetta B, Choi C, Verbeke P, Ojcius DM et al (2010) Porphyromonas
gingivalis infection sequesters pro-apoptotic bad through Akt in primary gingival epithelial
cells. Mol Oral Microbiol 25:89–101. https://doi.org/10.1111/j.2041-1014.2010.00569.x
Youssef N, Budd A, Bielawski JP (2019) Introduction to genome biology and diversity. Methods
Mol Biol 1910:3–31. https://doi.org/10.1007/978-1-4939-9074-0_1
Yu J, Feng Q, Wong SH, Zhang D, Liang QY, Qin Y et al (2017) Metagenomic analysis of faecal
microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut
66:70–78. https://doi.org/10.1136/gutjnl-2015-309800
Bioinformatics for Human Microbiome
17
Abhirupa Ghosh, Shazia Firdous, and Sudipto Saha
Abstract
Keywords
Metabolism · Microbial community · Metagenomics · Downstream analysis ·
Microbiome · Computational analysis
17.1 Introduction
# The Author(s), under exclusive license to Springer Nature Singapore Pte 333
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_17
334 A. Ghosh et al.
Fig. 17.1 High-throughput sequencing and computational analysis allow to explore the human
microbiome and to understand its role in human normal physiology and diseases
choosing the right tool from the array of computational tools and methods while
maintaining the standards of the study. The design and environmental factors of
experimental methods, the analysis workflow can affect the final outcome.
In this chapter, the bioinformatics aspects of microbiome data analyses have been
briefly discussed, focusing on the state-of-the-art tools and workflows of both
primary and downstream analyses. The chapter also discusses the known and
putative applications of human microbiome analyses.
Fig. 17.2 Commonly used workflow for (a) metataxonomic analysis, (b) metagenomic analysis
and (c) metatranscriptomic analysis. The first step in each analysis is pre-processing of the raw reads
17 Bioinformatics for Human Microbiome 337
The initial HTS data come as raw reads in fastq format. The foremost step of analysis
is the quality control of the sequence reads using a variety of computational tools to
quality check, identify and remove low-quality bases and reads, low complexity
reads, artefacts like primers, adapters or barcodes and remove host contamination.
FastQC is the most popular pre-processing tool that provides quality control report
and MultiQC is used for merging quality control report from multiple samples into a
single report for easy comparison (Ewels et al. 2016). Read trimming and filtering
tools like Trimmomatic, Trim Galore! and Cutadapt are used widely for DNA or
RNA HTS data (Bolger et al. 2014). There are clusters of pre-processing tools such
as FAST-X toolkit and BBTools for format conversion, quality report, quality
trimming, filtering or masking nucleotides and removal of artefacts. Human
metagenome has a big share of host nucleotide contamination that affects the
microbial profiling analysis. To overcome this, KneadData, a well-designed
contaminants removal tool is used often. Pre-processing is a crucial step and requires
a trade-off between the sequence quality and amount of information it can provide.
17.2.2 Metataxonomics
⁄
Fig. 17.2 (continued) including removal of poor quality reads, primers and barcodes. Detailed
methods are described in Sect. 17.2
338 A. Ghosh et al.
chimera sequences is done using VSEARCH algorithm (Rognes et al. 2016). Mothur
and QIIME use sequence alignment approach to improve clustering of the reads to
OTUs (Schloss et al. 2009). OTU based clusters are the representatives of a
taxonomic unit with 97% identity. Moreover, this approach fails to detect species
or strain level taxa, includes sequencing errors and ignores the SNPs. More recent
algorithms such as Deblur and DADA2 use alternative approach to include error
profiles, sequence variations like SNPs and ability to differentiate closely related
taxa (Amir et al. 2017; Callahan et al. 2016). This de-noising algorithm of develop-
ing ASVs has also been incorporated in QIIME 2 (Bolyen et al. 2019). Either of the
approaches can be implemented to obtain the feature table of OTU or ASV with the
quantitative frequency of features in each sample. As a final step of analysis,
taxonomic assignment is done based on NCBI taxonomy databases, SILVA, RDP
or Greengenes (Schoch et al. 2020; Quast et al. 2013; Cole et al. 2014; DeSantis et al.
2006).
17.2.3 Metagenomics
MetaBAT2 that combine both composition and homology-based approach (Wu et al.
2014, 2016; Wu and Scott 2012; Kang et al. 2015, 2019). The clustering of the
sequences is possible to visualise for evaluation of binning using tools like VizBin
(Laczny et al. 2015). Elviz is another tool used for visualisation of both metagenome
assembly and binning (Cantor et al. 2015). As an optional step, reassembly of the
reads in each bin leads to production of longer contigs and helps in metagenomic
genome reconstruction. The assembly and binning are evaluated using tools like
MetaQUAST, CheckM and BUSCO (Mikheenko et al. 2016; Parks et al. 2015;
Seppey et al. 2019). The taxonomic assignment is another crucial step in the
metagenomic analysis process. The taxonomic classification is done either using
the raw reads or assembled contigs. The most primitive approach is using BLAST to
match each read with the sequences of GenBank, however, the method is not
computationally feasible with increasing data amount. With the development of
computational techniques, various programs have been developed with strategies
like aligning reads to marker genes or protein sequences, k-mer mapping or genome
assembly. Marker gene based approaches such as MetaPhlAn2 use customised
clade-specific genes database and GOTTCHA uses a unique strategy of creating a
database of genome signatures for taxonomic profiling. Alignment requires high
computational resources therefore new approaches like the k-mer mapping
algorithms that built a simple lookup table which requires lesser computational
work are developed. Kraken and CLARK are two popular k-mer based classifiers
used for faster identification of metagenomic reads. k-mers are also represented
using de Bruijn graphs as implemented in Kallisto to find strain level abundances.
For more sensitive metagenomic classification, translated reads are compared with
protein sequence databases as done by DIAMOND, Kaiju and MEGAN.
17.2.4 Metatranscriptomics
The heart of the microbiome data analysis is assignment of correct taxonomy to the
reads. Both marker gene sequencing and shotgun sequencing provide information on
the microbial composition of the microbiome sample using references from public
databases. The analysis of metataxonomic data is assisted by few comprehensive
resources such as SILVA, RDP, Greengenes and UNITE (Quast et al. 2013; Cole
et al. 2014; DeSantis et al. 2006; Balvociute and Huson 2017; Nilsson et al. 2019).
SILVA (http://www.arb-silva.de) is an updated and non-redundant database of
aligned small and large subunit rRNA gene sequences from Bacteria, Archaea and
Eukaryota. Ribosomal Database Project (RDP; http://rdp.cme.msu.edu/) contains
aligned and annotated bacterial and archaeal small subunit rRNA genes and fungal
large subunit rRNA genes. Greengenes (http://greengenes.lbl.gov) is a 16S rRNA
database that includes chimera screening as an exclusive feature. UNITE (https://
unite.ut.ee/) is a database of fungal ribosomal internal transcribed spacer (ITS)
region. Metagenomics is not restricted to single gene thus metagenomic classifiers
use large number of genes for taxonomic assignments. A widely used taxonomic
assignment resource is NCBI taxonomy database that comprises of all the organism
names associated with sequence submission in NCBI. For identification of genes
from metagenomic reads or merged contigs, NCBI non-redundant (nr) database is
used as reference database. Metagenomic Phylogenetic Analysis2 (MetaPhlAn2) is a
widely used tool that has its own database of clade-specific marker genes identified
from bacteria, archaea, viruses and eukaryotes for taxonomic profiling of the reads
(Segata et al. 2012).
The final outputs from HTS analysis are microbial taxonomic and active gene feature
tables. These output files are further analysed to answer scientific questions such as
diversity of microbial composition in samples or across sample groups, identification
of pathogen species, the significant microbial genes and functional pathways specific
to sample groups, identification of variants, genome structure and phylogeny. These
sample groups are usually representation of diseases; drug or antibiotic exposed or
control individuals. The downstream analyses from the taxonomy and gene tables
are done using various statistical analysis and visualisation tools as tabulated in
Table 17.1.
Downstream analysis from taxonomic profiles involves finding alpha and beta
diversity and comparing them among sample groups, finding differential abundance
of taxa and correlation between taxa and metadata. Alpha diversity is measured
using Shannon diversity index and Shannon evenness index that defines the species
17 Bioinformatics for Human Microbiome 341
Table 17.1 List of some computational tools for downstream analysis and visualisation
Tool/software Description URL
Anvi’o (Eren et al. A platform to analyse and visualise microbial http://merenlab.org/
2015) assembly and binning software/anvio/
BURRITO An interactive visualisation tool of multiomic https://github.com/
(McNally et al. microbiome data to pair taxonomic and borenstein-lab/burrito
2018) functional information
Elviz (Cantor et al. An interactive web tool for visualisation of https://genome.jgi.
2015) assembled metagenomes along with metadata doe.gov/viz/
and sequence parameters.
FragGeneScan A method to predict genes using hidden Markov https://sourceforge.
(Rho et al. 2010) model from metagenomic data net/projects/
fraggenescan/
GraPhlAn (Asnicar Tool for visualisation of microbial genomes and https://huttenhower.
et al. 2015) metagenomes along with phylogenies, metadata sph.harvard.edu/
and abundances. graphlan
HUMAnN2 A pipeline for profiling microbial pathways and https://huttenhower.
(Franzosa et al. their abundance from metagenomic or sph.harvard.edu/
2018) metatranscriptomic sequencing data humann
Krona (Ondov An interactive tool to visualise hierarchies of https://github.com/
et al. 2011) metagenomic classifications along with the marbl/Krona/wiki
relative abundances and confidences.
LEfSe (Segata et al. A program for biomarker discovery and https://huttenhower.
2011) identification of genomic features such as taxa, sph.harvard.edu/lefse/
gene or pathway to differentiate between classes
MaAsLin2 (Himel A program for determining multivariable https://huttenhower.
Mallick et al. 2021) association between phenotypes, environments, sph.harvard.edu/
exposures, covariates and microbial metaomic maaslin/
features
MetaCHIP (Song A pipeline to predict horizontal gene transfer https://github.com/
et al. 2019) from metagenomic data. songweizhi/
MetaCHIP
MetaGeneMark Ab initio prediction of gene from shotgun http://exon.gatech.
(Zhu et al. 2010) sequences edu/meta_gmhmmp.
cgi
MetaProdigal A program to identify genes from short https://github.com/
(Hyatt et al. 2012) sequences with high accuracy and ability to hyattpd/prodigal
identify sequences with alternate genetic codes
mmvec (Morton A program to predict microbe–metabolite https://github.com/
et al. 2019) interactions from multiomic microbiome data biocore/mmvec
Phinch (Holly M An interactive web-based framework to explore http://phinch.org/
Bik 2014) multiomic microbiome data
Phyloseq A R package to explore microbiome https://joey711.
(McMurdie and phylogenetic profiles github.io/phyloseq/
Holmes 2013)
PICRUSt (Douglas A package to predict microbial functions from http://picrust.github.
et al. 2018) 16S rRNA analysis io/picrust/
shortBRED A program for profiling protein families of https://huttenhower.
(Kaminski et al. interest from shotgun metaomic sequencing data sph.harvard.edu/
2015) with high specificity shortbred
(continued)
342 A. Ghosh et al.
richness, diversity and evenness within a sample. The comparison of alpha diversity
among or between groups is statistically determined using ANOVA, Mann–Whitney
U test and Kruskal–Wallis test. The alpha diversity is visually represented using
box-plots, Venn diagrams and rarefaction curves. Beta diversity finds variation in
microbial composition between samples using Bray–Curtis dissimilarity, Jaccard
distance and weighted, unweighted UniFrac. It is paired with principal coordinate
analysis (PCoA), non-metric multi-dimensional scaling (NMDS) and constrained
PCoA (CPCoA) to obtain visual outputs. Beta diversity is visually compared among
samples or groups using scatter-plots and dendograms. Several programs are avail-
able for calculating the alpha and beta diversity such as QIIME, phyloseq, vegan and
USEARCH (Edgar 2010; McMurdie and Holmes 2013). Volcano plots, Manhattan
plots and tools like LefSe are used to find differential abundances of taxa and
significant determinant taxon between groups of samples (Segata et al. 2011).
Correlation coefficient curve, linear fitting curve and heatmaps are used to find
correlation between taxonomic profile and metadata. The phylogenetic tree and
cladogram are used to understand the phylogenetic and taxonomic hierarchy.
GraPhlAn is a software that provides attractive publication-ready phylogenetic
trees (Asnicar et al. 2015). Other popular visualisation tools are Krona an interactive
visualisation tools to explore the relative abundances along with hierarchical
classifications and TIME (Temporal Insights into Microbial Ecology) that allows
prediction of taxonomical markers for different sample groups (Ondov et al. 2011;
Baksi et al. 2018). Horizontal gene transfer (HGT) is a crucial phenomenon in
bacteria especially involved in the spread of antibiotic resistance and human
microbiome, the reservoirs of microbes is explored using tool such as MetaCHIP
that helps in the identification of HGT from metagenomic datasets (Song et al. 2019).
The downstream analysis and metabolic pathway information can be drawn from
functional analysis of microbiome study. The marker gene analysis only gives
insight in to the microbial composition, but there are tools like PICRUSt and
Tax4Fun that assign metabolic functions to the samples by mapping the 16S reads
to annotated genomes (Douglas et al. 2018; Asshauer et al. 2015). However, for
detailed and accurate functional profiling, shotgun and transcriptome sequencing are
17 Bioinformatics for Human Microbiome 343
highly recommended. Genes are identified from assembled contigs of shotgun reads
using tools like MetaGeneMark, FragGeneScan and MetaProdigal (Zhu et al. 2010;
Rho et al. 2010; Hyatt et al. 2012). The identified genes or ORF are further analysed
to predict the function. BLASTn and BLASTp use NCBI GenBank databases or
UniProt databases for homology-based search to annotate the contigs (UniProt
2019). Hidden Markov model based HMMER and support vector machine model
based PhyloPythiaS(+) are also used as similarity search tools (Eddy 2008). The
widely used databases for prediction of gene function, pathways or functional
domains are PFAM, COG, SEED, eggNOG, KEGG and TIGRFAM (El-Gebali
et al. 2019; Galperin et al. 2019; Huerta-Cepas et al. 2019; Overbeek et al. 2014;
Kanehisa et al. 2017; Haft et al. 2013). There are few tools to visualise the annotated
functional gene information and compare among sample groups such as
HUMAnN2, LEfSe and shortBRED (Franzosa et al. 2018; Kaminski et al. 2015).
Microbiome data analysis is a complicated process with many overlapping steps and
pressure points. Several analysis pipelines have been developed to facilitate full or
partial analysis. QIIME, QIIME 2 and mothur are comprehensive amplicon sequenc-
ing analysis pipelines with integrated scripts to perform steps from quality control to
diversity visualisation (Caporaso et al. 2010; Bolyen et al. 2019; Schloss 2020).
Similarly, MetAMOS and ANASTASIA are pipelines for metagenomic assembly
and gene annotation (Treangen et al. 2013; Koutsandreas et al. 2019). Another
mentionable pipeline, SqueezeMeta does real time analysis of metagenomic data
from nanopore technology (Tamames and Puente-Sanchez 2018).
Metatranscriptomics is relatively new in the field and have very few tools and
pipelines. SAMSA2 is a dedicated metatranscriptomic data analysis tool that handles
from quality control to visualisation (Westreich et al. 2018). MetaQUBIC is a
344 A. Ghosh et al.
pipeline that pairs multiomics for analysis to detect gene module from metagenomic
and metatranscriptomic data (Ma et al. 2019). Apart from these pipelines, web
services such as MG-RAST, EBI Metagenomics, IMG/M and Qiita provide
automated end-to-end processing of the data (Meyer et al. 2019; Mitchell et al.
2020; Chen et al. 2019; Gonzalez et al. 2018). An alternative approach is using
Galaxy, an open-source workflow system comprising numerous tools for each step
of analysis and allows customisation of frameworks such as ASaiM (Afgan et al.
2018; Batut et al. 2018).
The biggest challenge of computational analysis of Microbiome data is not about the
chosen pipeline for analysis instead the experimental factors have more influence on
the outcome. Since the field is evolving rapidly, it is essential that the analyses are
reproducible. To tackle such challenges, the metadata of the experimental
procedures including the host phenotype, sample site, sampling technique, nucleo-
tide extraction method, primers and barcodes must be available with the sequence
data. The recommended data collection procedures should be followed as mentioned
in minimum information about a marker genes (MIMARKS) and metagenomes
(MIMS), minimum information about a single amplified genome (MISAG) and a
metagenome-assembled genome (MIMAG) of bacteria and archaea and uncultivated
virus genome (MIUViG) (Field et al. 2008; Yilmaz et al. 2011; Bowers et al. 2017;
Roux et al. 2019). The experimental handling introduces contaminations that modify
the microbial composition of the sample. Therefore, usage of proper controls and
considering consistency across all similar samples should be kept in mind as
recommendations such as International Human Microbiome Standards and the
Microbiome Quality Control (MBQC) project (Sinha et al. 2017; Costea et al.
2017). Equal efforts are also given for raw data storage and standardisation and
reproducibility of the data analyses using cloud and open-source resources such as
Qiita, EBI Metagenomics, Sequence Read Archive (SRA) and MG-RAST (Mitchell
et al. 2020; Gonzalez et al. 2018; Kodama et al. 2012; Keegan et al. 2016).
The insight study of human microbiome manifest that the microbes of a healthy
individual greatly differs from a non-healthy or diseased individual. Thereby,
characterising microbiome based on their differential trait could serve as a potential
tool in identifying disease risk, prognosis, phenotype and response to treatment. In
diseases like inflammatory bowel disease (IBD), obesity, diabetes and cancer,
microbiome studies help to identify the pathophysiology of diseases and inspired
further studies to understand the link of immune system and therapeutics with human
Microbiome (McCarville et al. 2016; Kostic et al. 2012; Durack et al. 2018; Chen
17 Bioinformatics for Human Microbiome 345
References
Afgan E et al (2018) The galaxy platform for accessible, reproducible and collaborative biomedical
analyses: 2018 update. Nucleic Acids Res 46:W537–W544
Afiahayati KS, Sakakibara Y (2015) MetaVelvet-SL: an extension of the Velvet assembler to a de
novo metagenomic assembler utilizing supervised learning. DNA Res 22(1):69–77
Altschul SF et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Amir A et al (2017) Deblur rapidly resolves single-nucleotide community sequence patterns.
mSystems 2:2
Asnicar F et al (2015) Compact graphical representation of phylogenetic data and metadata with
GraPhlAn. PeerJ 3:e1029
Asshauer KP et al (2015) Tax4Fun: predicting functional profiles from metagenomic 16S rRNA
data. Bioinformatics 31(17):2882–2884
Baksi KD, Kuntal BK, Mande SS (2018) TIME': a web application for obtaining insights into
microbial ecology using longitudinal Microbiome data. Front Microbiol 9:36
Balvociute M, Huson DH (2017) SILVA, RDP, Greengenes, NCBI and OTT—how do these
taxonomies compare? BMC Genomics 18(Suppl 2):114
Batut B et al (2018) ASaiM: a galaxy-based framework to analyze microbiota data. Gigascience 7
(6):057
Bay L et al (2020) Universal dermal microbiome in human skin. MBio 11(1):02945
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence
data. Bioinformatics 30(15):2114–2120
Bolyen E et al (2019) Reproducible, interactive, scalable and extensible microbiome data science
using QIIME 2. Nat Biotechnol 37(8):852–857
Bowers RM et al (2017) Minimum information about a single amplified genome (MISAG) and a
metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol 35
(8):725–731
Callahan BJ et al (2016) DADA2: high-resolution sample inference from Illumina amplicon data.
Nat Methods 13(7):581–583
346 A. Ghosh et al.
Himel Mallick, LJM, Rahnavard A, Ma S, Zhang Y, Nguyen LH, Tickle TL, Weingart G, Ren B,
Schwager E, Subramanian A, Lu Y, Waldron L, Paulson JN, Franzosa EA, Bravo HC,
Huttenhower C (2021) Multivariable association in population-scale meta-omics studies
Holly M Bik PI (2014) Phinch: an interactive, exploratory data visualization framework for–Omic
datasets. In: bioRxiv
Huerta-Cepas J et al (2019) eggNOG 5.0: a hierarchical, functionally and phylogenetically
annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res
47:309–314
Huey SL et al (2020) Nutrition and the gut microbiota in 10- to 18-month-old children living in
urban slums of Mumbai, India. mSphere 5:5
Human Microbiome Project (2012a) Structure, function and diversity of the healthy human
microbiome. Nature 486(7402):207–214
Human Microbiome Project (2012b) A framework for human microbiome research. Nature 486
(7402):215–221
Hyatt D et al (2012) Gene and translation initiation site prediction in metagenomic sequences.
Bioinformatics 28(17):2223–2230
Kaminski J et al (2015) High-specificity targeted functional profiling in microbial communities with
ShortBRED. PLoS Comput Biol 11(12):e1004557
Kanehisa M et al (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs.
Nucleic Acids Res 45(D1):D353–D361
Kang DD et al (2015) MetaBAT, an efficient tool for accurately reconstructing single genomes from
complex microbial communities. PeerJ 3:e1165
Kang DD et al (2019) MetaBAT 2: an adaptive binning algorithm for robust and efficient genome
reconstruction from metagenome assemblies. PeerJ 7:e7359
Karlsson F et al (2013) Assessing the human gut microbiota in metabolic diseases. Diabetes 62
(10):3341–3349
Keegan KP, Glass EM, Meyer F (2016) MG-RAST, a metagenomics Service for Analysis of
microbial community structure and function. Methods Mol Biol 1399:207–233
Knights D et al (2011) Bayesian community-wide culture-independent microbial source tracking.
Nat Methods 8(9):761–763
Kodama Y et al (2012) The sequence read archive: explosive growth of sequencing data. Nucleic
Acids Res 40:D54–D56
Kopylova E, Noe L, Touzet H (2012) SortMeRNA: fast and accurate filtering of ribosomal RNAs in
metatranscriptomic data. Bioinformatics 28(24):3211–3217
Kostic AD et al (2012) Genomic analysis identifies association of fusobacterium with colorectal
carcinoma. Genome Res 22(2):292–298
Koutsandreas T et al (2019) ANASTASIA: an automated metagenomic analysis pipeline for novel
enzyme discovery exploiting next generation sequencing data. Front Genet 10:469
Laczny CC et al (2015) VizBin—an application for reference-independent visualization and
human-augmented binning of metagenomic data. Microbiome 3(1):1
Li D et al (2015) MEGAHIT: an ultra-fast single-node solution for large and complex
metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31(10):1674–1676
Ma A et al (2019) MetaQUBIC: a computational pipeline for gene-level functional profiling of
metagenome and metatranscriptome. Bioinformatics 35(24):5397
McCarville JL, Caminero A, Verdu EF (2016) Novel perspectives on therapeutic modulation of the
gut microbiota. Therap Adv Gastroenterol 9(4):580–593
McInnes RS et al (2020) Horizontal transfer of antibiotic resistance genes in the human gut
microbiome. Curr Opin Microbiol 53:35–43
McMurdie PJ, Holmes S (2013) phyloseq: an R package for reproducible interactive analysis and
graphics of microbiome census data. PLoS One 8(4):e61217
McNally CP et al (2018) BURRITO: an interactive multi-omic tool for visualizing taxa-function
relationships in Microbiome data. Front Microbiol 9:365
348 A. Ghosh et al.
Mehta RS et al (2018) Stability of the human faecal microbiome in a cohort of adult men. Nat
Microbiol 3(3):347–355
Meyer F et al (2019) MG-RAST version 4-lessons learned from a decade of low-budget ultra-high-
throughput metagenome analysis. Brief Bioinform 20(4):1151–1159
Mikheenko A, Saveliev V, Gurevich A (2016) MetaQUAST: evaluation of metagenome
assemblies. Bioinformatics 32(7):1088–1090
Mitchell AL et al (2020) MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res 48
(D1):D570–D578
Morton JT et al (2017) Balance trees reveal microbial niche differentiation. mSystems 2:1
Morton JT et al (2019) Learning representations of microbe-metabolite interactions. Nat Methods
16(12):1306–1314
Namiki T et al (2012) MetaVelvet: an extension of velvet assembler to de novo metagenome
assembly from short sequence reads. Nucleic Acids Res 40(20):e155
Nayfach S et al (2019) New insights from uncultivated genomes of the global human gut
microbiome. Nature 568(7753):505–510
Nilsson RH et al (2019) The UNITE database for molecular identification of fungi: handling dark
taxa and parallel taxonomic classifications. Nucleic Acids Res 47(D1):D259–D264
Nishijima S et al (2016) The gut microbiome of healthy Japanese and its microbial and functional
uniqueness. DNA Res 23(2):125–133
Nurk S et al (2017) metaSPAdes: a new versatile metagenomic assembler. Genome Res 27
(5):824–834
Oh M, Zhang L (2020) DeepMicro: deep representation learning for disease prediction based on
microbiome data. Sci Rep 10(1):6026
Ondov BD, Bergman NH, Phillippy AM (2011) Interactive metagenomic visualization in a web
browser. BMC Bioinformatics 12:385
Ounit R et al (2015) CLARK: fast and accurate classification of metagenomic and genomic
sequences using discriminative k-mers. BMC Genomics 16:236
Overbeek R et al (2014) The SEED and the rapid annotation of microbial genomes using
subsystems technology (RAST). Nucleic Acids Res 42:206–214
Parks DH et al (2015) CheckM: assessing the quality of microbial genomes recovered from isolates,
single cells, and metagenomes. Genome Res 25(7):1043–1055
Pasolli E et al (2019) Extensive unexplored human microbiome diversity revealed by over 150,000
genomes from metagenomes spanning age, geography, and lifestyle. Cell 176(3):649–662
Peng Y et al (2011) Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics 27
(13):94–101
Peng Y et al (2012) IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing
data with highly uneven depth. Bioinformatics 28(11):1420–1428
Qian X et al (2020) Gut microbiota in children with juvenile idiopathic arthritis: characteristics,
biomarker identification, and usefulness in clinical prediction. BMC Genomics 21(1):286
Qiao Y et al (2018) MetaBinG2: a fast and accurate metagenomic sequence classification system for
samples with many unknown organisms. Biol Direct 13(1):15
Qin J et al (2010) A human gut microbial gene catalogue established by metagenomic sequencing.
Nature 464(7285):59–65
Quast C et al (2013) The SILVA ribosomal RNA gene database project: improved data processing
and web-based tools. Nucleic Acids Res 41:590–596
Rajilic-Stojanovic M et al (2012) Long-term monitoring of the human intestinal microbiota
composition. Environ Microbiol 10:15
Relman DA, Lipsitch M (2018) Microbiome as a tool and a target in the effort to address
antimicrobial resistance. Proc Natl Acad Sci USA 115(51):12902–12910
Rho M, Tang H, Ye Y (2010) FragGeneScan: predicting genes in short and error-prone reads.
Nucleic Acids Res 38(20):e191
Ritchie ML, Romanuk TN (2012) A meta-analysis of probiotic efficacy for gastrointestinal
diseases. PLoS One 7(4):e34938
17 Bioinformatics for Human Microbiome 349
Robertson RC et al (2019) The Human Microbiome and child growth—first 1000 days and beyond.
Trends Microbiol 27(2):131–147
Rognes T et al (2016) VSEARCH: a versatile open source tool for metagenomics. Peer J 4:e2584
Routy B et al (2018) Gut microbiome influences efficacy of PD-1-based immunotherapy against
epithelial tumors. Science 359(6371):91–97
Roux S et al (2019) Minimum information about an uncultivated virus genome (MIUViG). Nat
Biotechnol 37(1):29–37
Schloss PD (2020) Reintroducing mothur: 10 years later. Appl Environ Microbiol 86(2):e02343
Schloss PD et al (2009) Introducing mothur: open-source, platform-independent, community-
supported software for describing and comparing microbial communities. Appl Environ
Microbiol 75(23):7537–7541
Schoch CL et al (2020) NCBI taxonomy: a comprehensive update on curation, resources and tools.
Database (Oxford) 2020:062
Segata N et al (2011) Metagenomic biomarker discovery and explanation. Genome Biol 12(6):R60
Segata N et al (2012) Metagenomic microbial community profiling using unique clade-specific
marker genes. Nat Methods 9(8):811–814
Seppey M, Manni M, Zdobnov EM (2019) BUSCO: assessing genome assembly and annotation
completeness. Methods Mol Biol 1962:227–245
Shi W et al (2019) gcMeta: a Global Catalogue of Metagenomics platform to support the archiving,
standardization and analysis of microbiome data. Nucleic Acids Res 47(1):637–648
Sinha R et al (2017) Assessment of variation in microbial community amplicon sequencing by the
Microbiome quality control (MBQC) project consortium. Nat Biotechnol 35(11):1077–1086
Sommer F et al (2017) The resilience of the intestinal microbiota influences health and disease. Nat
Rev Microbiol 15(10):630–638
Song W et al (2019) MetaCHIP: community-level horizontal gene transfer identification through
the combination of best-match and phylogenetic approaches. Microbiome 7(1):36
Stennett CA et al (2020) A cross-sectional pilot study of birth mode and vaginal microbiota in
reproductive-age women. PLoS One 15(4):0228574
Subramanian S et al (2014) Persistent gut microbiota immaturity in malnourished Bangladeshi
children. Nature 510(7505):417–421
Sun J et al (2020a) Role of the oral microbiota in cancer evolution and progression. Cancer Med
9:6306–6321
Sun Y et al (2020b) Population-level configurations of gut mycobiome across six ethnicities in
urban and rural China. Gastroenterology 6:31–38
Susic D et al (2020) Microbiome Understanding in Maternity Study (MUMS), an Australian
prospective longitudinal cohort study of maternal and infant microbiota: study protocol. BMJ
Open 10(9):e040189
Tamames J, Puente-Sanchez F (2018) SqueezeMeta, a highly portable, fully automatic
metagenomic analysis pipeline. Front Microbiol 9:3349
Tamburini S et al (2016) The microbiome in early life: implications for health outcomes. Nat Med
22(7):713–722
Thang MWC et al (2019) MetaDEGalaxy: galaxy workflow for differential abundance analysis of
16s metagenomic data. F1000Res 8:726
The Integrative HMP (iHMP) Research Network Consortium (2019) The Integrative Human
Microbiome Project. Nature 569(7758):641–648
Tigchelaar EF et al (2015) Cohort profile: LifeLines DEEP, a prospective, general population cohort
study in the northern Netherlands: study design and baseline characteristics. BMJ Open 5(8):
e006772
Treangen TJ et al (2013) MetAMOS: a modular and open source metagenomic assembly and
analysis pipeline. Genome Biol 14(1):R2
Truong DT et al (2015) MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods
12(10):902–903
350 A. Ghosh et al.
UniProt C (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47(D1):
D506–D515
Valles-Colomer M et al (2019) The neuroactive potential of the human gut microbiota in quality of
life and depression. Nat Microbiol 4(4):623–632
Vangay P, Hillmann BM, Knights D (2019) Microbiome learning repo (ML repo): a public
repository of microbiome regression and classification tasks. Gigascience 8:5
Vetizou M et al (2015) Anticancer immunotherapy by CTLA-4 blockade relies on the gut
microbiota. Science 350(6264):1079–1084
Westreich ST et al (2018) SAMSA2: a standalone metatranscriptome analysis pipeline. BMC
Bioinformatics 19(1):175
Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact
alignments. Genome Biol 15(3):46
Wu M, Scott AJ (2012) Phylogenomic analysis of bacterial and archaeal sequences with
AMPHORA2. Bioinformatics 28(7):1033–1034
Wu YW et al (2014) MaxBin: an automated binning method to recover individual genomes from
metagenomes using an expectation-maximization algorithm. Microbiome 2:26
Wu YW, Simmons BA, Singer SW (2016) MaxBin 2.0: an automated binning algorithm to recover
genomes from multiple metagenomic datasets. Bioinformatics 32(4):605–607
Yang Z et al (2016) Preliminary analysis showed country-specific gut resistome based on 1,267
feces samples. Gene 581(2):178–182
Yatsunenko T et al (2012) Human gut microbiome viewed across age and geography. Nature 486
(7402):222–227
Yilmaz P et al (2011) Minimum information about a marker gene sequence (MIMARKS) and
minimum information about any (x) sequence (MIxS) specifications. Nat Biotechnol 29
(5):415–420
Zhang L et al (2020) Characterization of antibiotic resistance and host-microbiome interactions in
the human upper respiratory tract during influenza infection. Microbiome 8(1):39
Zhu W, Lomsadze A, Borodovsky M (2010) Ab initio gene identification in metagenomic
sequences. Nucleic Acids Res 38(12):e132
Zimmermann M et al (2019) Separating host and microbiome contributions to drug pharmacoki-
netics and toxicity. Science 363(6427):9931
Neural Network Analysis
18
Amit Joshi, Jitendra Sasumana, Nillohit Mitra Ray, and Vikas Kaushik
Abstract
Neural networks play very significant role when it comes to analysis of proteins
and nucleic acid sequences. Many of the pattern recognition software are based
on neural networks for prediction of biological patterns. Modern sequencing
advancement fuels up the collection of data related to DNA, RNA, and protein
sequences. The complexity and enormous size of this data require best computa-
tional algorithms for analysis and interpretation. This information will assist in
developing useful insight for biomolecular structural predictions and prediction
of interactions between such molecules. A neural system investigation framework
is a succession of computations that attempts to see concealed associations in a lot
of data through a technique that imitates the way where the human mind works. In
this sense, neural frameworks suggest systems of neurons, artificial in nature.
Vectors and matrices based linear algebra and topology designs supported vari-
ous types of neural architectures. Neural frameworks can conform to advancing
info; so the framework makes the best result without hoping to refresh the yield
rules. The possibility of neural frameworks, which has its basic establishments in
man-made consciousness, is rapidly getting ubiquity in the progression of in
silico designing systems. Here, we talk about and sum up the uses of Neural
Networks in computational biology, with a specific spotlight on applications in
protein and Nucleic acid bioinformatics. We concluded with giving basic insights
of neural networks in multiple domains of life sciences like gene prediction,
protein structure prediction, epitope prediction, expression, co-expression,
protein–protein interaction, and many other domains.
# The Author(s), under exclusive license to Springer Nature Singapore Pte 351
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_18
352 A. Joshi et al.
Keywords
18.1 Introduction
genomic as well as proteomic analysis. This will resolve big data analysis problems
too. Aim of this chapter considered useful, as enormous data should not become
problem for investigator rather it will act as boon for life sciences and medical world
to develop better and fast regimens for several diseases and this will also intercon-
nect health sector globally.
Proteins and nucleic acids are integral part of each and every cellular entity of all
living organisms. These biomolecules participate in almost all functional activities
within cells from metabolic reactions to genetic expression of transcriptomes. Many
life sustaining processes like cell cycle, apoptosis, enzymatic catalysis, cell signal-
ing, adhesion, and central dogma are always depending on protein–protein and
DNA–protein interactions. Proteins and nucleic acids are macromolecular
heteropolymers of amino acids and nucleotides, respectively (Giorgini et al. 2020).
Different proteins have different amino acids sequences, peptide bond forms
between two amino acids due to bonding between amino and carboxyl group of
adjacent amino acids (Fig. 18.1b and 18.1c). Similarly in nucleic acid (DNA or
RNA) phosphodiester bonds exist between two adjacent nucleotides (Deoxyribose
or Ribose sugar, Nitrogenous base (A,T,C,G), and phosphate group) (Fig. 18.1a).
Sequencing studies produce enormous data regarding DNA, RNA, and protein
sequences. These sequences are stored in databases like NCBI Genbank, DDBJ,
and EMBL. Similarly structural information of proteins and ligands interacting to
Fig. 18.1 (a) DNA structure: revealing phosphodiester bonds and hydrogen bond (b) protein
chain: primary structure (c) peptide bond between amino acids
354 A. Joshi et al.
them are also submitted in databases like Pubchem, Chembridge, Maybridge, and
Protein Data Bank (RCSB-PDB) by X-ray crystallographic experiments.
Bioinformatics not only provide platform to analyze the structures of protein in
primary, secondary, and tertiary forms but also allow user to bring structural changes
according to the influence of molecules in its vicinity. This means that protein
structures show bonding or interactions with other biomolecules, which can impart
inducible changes as in case with induced fit model analogy for enzymatic actions
(Morgat et al. 2020). Computational studies created applications for deep neural
networks to assist in prediction of protein–protein binding pockets or interaction
sites (Zeng et al. 2020). Even protein contacts can also be predicted by metagenomic
sequence data and residual neural networks (Wu et al. 2020). Such all recent studies
indicate the importance of neural network bioinformatics and biochemistry together
to generate a big informative picture of biomolecules functional aspect with accuracy
and precision. These studies also suggest that the unsolved structure for known
sequences could be easily determined by deploying neural networks architectures in
biochemistry and medicine studies.
Neural networks are part of Artificial intelligence and Machine learning. These
networks works like brain neurons, these networks are dependent on weights as
we increase loads the networks learn more to predict suitable results or outputs. Deep
learning becomes more advanced with the increase in data. Therefore, a neural
network also adjusts its performance to greater extant as they grows bigger and
deals with enormous flow of information. Neural networks are best in comparison to
other machine learning tools that reach a plateau after a point. Activation functions
play crucial role in switching on and off artificial nets that connects artificial neural
elements. This allows systematic flow of information and deep learning in software
based systems (Khan 2020). Each neural element receives multiple inputs and
randomized loads and adds them to static bias of each neural element, then directs
them to activator function that finally brings output of desired neural element of
network (Fig. 18.2a). Activator function can be of linear (simplest), heviside step,
and sigmoid function (complex) type. When final neural layer generates output, loss
functions are calculated on the basis of inputs and outputs and back propagation
conducted to bring alterations in loads that lead to loss minimization (Fig. 18.2b). To
determine overall loads adjustment is main or central criteria of neural network
architectures (Galushkin 2007).
Neural network architectures can be divided in many subtypes: based on frame-
work, Datum transfer, Counterfeit neurons with weighted-density, multiple-layering
and activation functions (Amato et al. 2013).Common types of neural networks are
Feed forward network, Multi-layer perceptron, Convolutional neural network,
Radial basis neural network, Recurrent neural network, modular networks, etc.
(Fig. 18.2c and Fig. 18.3).
18 Neural Network Analysis 355
Fig. 18.2 (a) Load (numeric) values product with input data in back track to minimize loss, and the
relation with activation function “F” to generate output. (b) Input layer “i/p” exhibits dimension of
input vector, Hidden layer shows intermediary nodes separating input spaces with boundary limits
that consider input load sets to synthesize information by activation function. Information as output
“o/p” layer shows final information via ANN system architecture. (c) Various types of neural
network architectures
Feed forward neural network was simplest first ANN to be deployed in bioinfor-
matics. In such system one way flow of information exist (from input to hidden to
output). No loops or closed cycles exist in such architecture.
The Feed forward network does not possess backward propagation. These
systems have static loads (Shao 2020). Mostly step activation function is used here
with 0 to 1 criteria (f(v) ¼ 1 iffva;& f(v) ¼ 0 iff v < a; where v ¼ Ʃwixi, and
a ¼ threshold). The neuron is actuated in the event that it is above edge (typically 0)
and such counterfeit neuron generate 1 (informative yield). Counterfeit neuron is not
enacted on the off chance that it is beneath edge (typically 0) which is considered as -
1. They are genuinely easy to keep up and are furnished with to manage information
which contains a great deal of commotion. Significant for analysis as simple to
design, fast, and also generate good responsiveness to noise. Only disadvantage is
that it cannot be deployed in AI-processing tasks because of no deep stratifications
and reverse tracking.
Multilayer-supervised model is advancement in Feed forward neural network.
Each and Every single node is interconnected. Input and output layers are found in
between of multiple hidden Layers (Heidari et al. 2020). It involves forward and
356 A. Joshi et al.
Fig. 18.3 Types of neural networks: (a) Feed forward network, (b) multi-perceptron network, (c)
radial basis network, (d) convolution network, (e) modular network, (f) recurrent neural network
backward preparative tracking. Sources of info multiplied with loads and afterward
exposed for activation function calculations along with reverse tracking; each
informative node shows such alterations, so it can decrease the mislaying in infor-
mation. Loads are machine taken in values from Neural Networks. Loads (Wi) are
competent to self-modify contingent upon the contrast between anticipated yields
versus preparing inputs. Nonlinear-initiation activator function is conveyed here,
which makes them complex and best for deep learning tasks. Disadvantage includes
comparatively slow functionality in huge data analysis.
Convolution neural network contains a 3D course of action of neurons, rather
than the standard 2D arrangement. The principal layer is known as a convolution
filter with activation mapping of counterfeit neurons. Every neuron in the convolu-
tion filter only processes the data from a little piece of the image related data (Chen
et al. 2020). Information highlights are taken in clump astute like a channel. The
system comprehends the pictures in parts and can figure these activities on various
occasions to finish the full picture preparing. Handling includes change picture
standards between RGB and dark scaling. Promoting adjustments for CRT screen
dots worth assist with identifying corners, such pictures will be easily characterized.
Engendering or tracks are following one-direction flow and convolution neural
network holds at least one convolution filter accompanied by amalgamation of
information and dual-directional when yield of convolution filters transfers informa-
tion towards associated neural system for ordering the pictures as appeared in the
18 Neural Network Analysis 357
Fig. 18.3. Channels are utilized to remove certain pieces from picture. In Multilayer-
supervised model the sources of info are duplicated with loads and subjected to the
activation function. Convolution utilizes nonlinear enactment function followed by
softmax. Convolution neural systems show viable outcomes in picture and video
acknowledgment, semantic parsing and reword discovery. It is deployed for machine
learning analysis. Disadvantage includes complexity in designing and slow
functionality.
Radial Basis Function Network comprises an info vector followed by a layer of
RBF neurons and a yield layer with one hub for each classification (Zaji et al. 2020).
Characterization is performed by estimating the info’s similitude to information
focuses from the preparation set where every neuron stores a model. This will be
one of the models from the preparation set. At the point when another info vector
[the n-dimensional vector that you are attempting to classify] should be arranged,
every neuron computes the Euclidean separation between the information and its
model. Each RBF neuron looks at the info vector to its model and yields a worth
running which is a proportion of similitude from 0 to 1. As the info equivalents to the
model, the yield of that RBF neuron will be 1 and with the separation develops
between the information and model the responses tumbles off exponentially towards
0. The plot created out of neuron’s responses tends towards a typical the bell shaped
plot. The yield layer comprises of a lot of neurons [one per category]. Its applications
are found in power restoration.
Recurrent Neural framework fed back to info or data to provide assistance for
anticipating results for each layer. Primary stratified division ordinarily show feed
forward architecture accompanied intermittent counterfeit framework strata that
holds data (past time-step), so recollected by storage assemblies acting as
memory-units (Smyl 2020). Onward tracks executed for such situations. It holds
the knowledge relevant for its potential use. On the off chance that the expectation is
not right, the learning rate is utilized to roll out little improvements. Consequently,
stepwise increment towards making the correct forecast during the back track. Its
focal points are Model consecutive information where each example can be thought
to be subject to verifiable ones, it is utilized with convolution layers to expand the
pixel viability. Significant detriments of such network architecture is Gradient
disappearing and detonating issues, preparing repetitive neural frameworks act as
troublesome undertaking, hard to info-processing for successive information
utilizing rectified-linear-units as initiating set. LSTM (Long short term memory)
systems are a kind of RNN that utilizes exceptional units notwithstanding standard
units. LSTM units incorporate a “memory cell” that can keep up data in memory for
significant stretches of time. A lot of doors is utilized to control when data enters the
memory when its yield, and when’s it slipped it’s mind. There are three types of
gates, viz., Input gate (Info door), output gate (yield entryway), and forget gate
(overlook entryway). Info door chooses what number of data from the last data set
will be kept in memory; the yield entryway manages the measure of information
went to the following layer, and overlook entryways control the tearing pace of
memory put away. Such architecture lets them learn longer-term dependencies.
358 A. Joshi et al.
A modular neural system has various systems that work autonomously and
perform sub-undertakings. The various systems do not generally collaborate with
or signal each other during the calculation procedure (Li et al. 2020). They work
autonomously towards accomplishing the yield. Therefore, an enormous and com-
plex computational procedure is done essentially quicker by separating it into free
segments. The calculation speed increments in light of the fact that the systems are
not collaborating with each other but at last associated with one another. It is robust
and efficient neural network, but sometimes has moving target problems. Commonly
used by stock exchange market for predictions, and biological studies for compres-
sion of high level input data, and character recognition.
Many in silico tools, servers, and algorithms (Table 18.1) are currently used in both
proteomic and genomic analysis. Structural and functional aspect of reacting
biomolecules within cellular domains can be easily accessed by neural network
algorithms. Neural networks have multiple applications in bioinformatics:
Table 18.1 List of various modern in silico tools/techniques based on deep learning or neural
networks
Tools based
on NN Source Function
Net MHC Lundegaard et al. Epitopes selection and prediction from bacterial and viral
server (2011) proteins used in vaccine designing
NeuRiPP De los Santos Identification of genetic clusters to reveal ribosomally
(2019) synthesized and post-translationally modified proteins
DeepGoPlus Kulmanov and Protein function prediction
Hoehndorf (2020)
DEEPscreen Rifaioglu et al. Prediction of drug targets
(2020)
RONN Yang et al. (2005) Identification of disordered regions of proteins
RESCUE Pons and Delsuc NMR spectral assignment to proteins
(1999)
DeepQA Cao et al. (2016) Estimation of single protein model
DeepInteract Patel et al. (2017) Protein–protein interaction analysis
ProLanGO Cao et al. (2017) Protein functionality assessment
DeepDrug3D Pu et al. (2019) Drug or ligand binding pocket analysis and identification
with in proteins or enzymes
EpiDock Atanasova et al. Molecular docking tool based on MHC class II
(2013) interactions with epitopes
DeepLNC Tripathi et al. A long non coding RNA elements identification
(2016)
DeepRibo Clauwaert et al. Gene annotation for prokaryotes based on ribosome
(2019) profiling signals and binding site patterns
Afann Tang et al. (2019) Alignment free genetic sequence comparisons
SECLAF Szalkai and Biological sequence classification
Grolmusz (2018)
SpliceFinder Wang et al. (2019) Prediction of splice sites using convolutional neural
network architecture
DeepImpute Arisdakessian et al. Impute single cell RNA-seq data
(2019)
DanQ Quang and Xie Quantification of DNA functions
(2016)
RNAsamba Camargo et al. Assessment of translational potential of RNA sequence
(2020)
REVEL Ioannidis et al. Prediction of pathogenicity of rare missense DNA
(2016) variants. Assist in cancer biology
(b) SSpro8 program for secondary structure classification into the eight classes
produced by the DSSP (dictionary of secondary structure of proteins) program,
types include 3/10 helix, alpha helix, pi helix, extended strand in parallel and/or
anti-parallel β-sheet conformation, isolated β-bridge, hydrogen bonded turn, bend,
and coil. 8-state secondary structure is frequently amassed into 3-state auxiliary
structure (Pollastri et al. 2002). Predicting protein structural disorders can be
estimated by using feed forward neural networks (Li et al. 1999). Artificial neural
360 A. Joshi et al.
networks are also used protein functional determination likely emulsification, and
foaming for assisting food industry (Arteaga and Nakai 1993).
Fig. 18.4 Neural networks in epitope based vaccine crafting for viral pathogens
18.5 Conclusion
Institutional computing facilities were improved lot in the past decade. Amalgam-
ation of neural networks with advanced servers will assist rapid drug discovery,
effective error free vaccine crafting, speedy alignments, structural predictions, and
physiochemical analysis of biomolecules, etc. Modern world should not starve for
better food security, medicinal treatments. To fulfill this broad socialistic view neural
networks have intensified power to integrate, to access, and to analyze big data
related to agriculture, animal husbandry, medicine, and physiology. Neural networks
are constituent of deep learning domain of Artificial intelligence and machine
learning; it holds significance in analyzing relationship about the integral features
of IoT and bigdata (Mohammadi et al. 2018). Neural networks, as the name suggests
it is the network or spider-web of interconnected artificial neurons joining input layer
to output layer. Multiple types of neural networks assist users to develop insight
about biomolecular structures and functions. Modern fast sequencing techniques
generated enormous amount of data related to biological sequences, it was neural
networks in bioinformatics who assisted researchers to bring fruitful outcomes in the
field of agriculture as well as in medicine. It is ongoing research journey as neural
networks are still evolving and linking to upgrading modern computing facilities to
show its power of deep learning towards data analysis within the roots of big data
and IoT.
References
Amato F, López A, Peña-Méndez EM, Vaňhara P, Hampl A, Havel J (2013) Artificial neural
networks in medical diagnosis. J Appl Biomed 11(2):47–58
Arisdakessian C, Poirion O, Yunits B, Zhu X, Garmire LX (2019) DeepImpute: an accurate, fast,
and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biol 20
(1):1–14
Arteaga GE, Nakai S (1993) Predicting protein functionality with artificial neural networks:
foaming and emulsifying properties. J Food Sci 58(5):1152–1156
Atanasova M, Patronov A, Dimitrov I, Flower DR, Doytchinova I (2013) EpiDOCK: a molecular
docking-based tool for MHC class II binding prediction. Protein Eng Des Sel 26(10):631–634
Babiker EAA, Almofti YA, Abd-Elrahman KA (2020) Novel T-lymphocytes vaccine candidates
against human mumps virus via reverse vaccinology. Eur J Biomed 7(1):45–63
Bain A (1873) Mind and body: the theories of their relation, vol 4. Henry S. King, London
Camargo AP, Sourkov V, Pereira GAG, Carazzolle MF (2020) RNAsamba: neural network-based
assessment of the protein-coding potential of RNA sequences. NAR Genom Bioinform 2(1):
lqz024
Cao R, Bhattacharya D, Hou J, Cheng J (2016) DeepQA: improving the estimation of single protein
model quality with deep belief networks. BMC Bioinform 17(1):495
Cao R, Freitas C, Chan L, Sun M, Jiang H, Chen Z (2017) ProLanGO: protein function prediction
using neural machine translation based on a recurrent neural network. Molecules 22(10):1732
Chen Y, Tang L, Yang X, Bilal M, Li Q (2020) Object-based multi-modal convolution neural
networks for building extraction using panchromatic and multispectral imagery.
Neurocomputing 386:136–146
18 Neural Network Analysis 363
Chicco D, Sadowski P, Baldi P (2014) Deep autoencoder neural networks for gene ontology
annotation predictions. In Proceedings of the 5th ACM conference on bioinformatics, computa-
tional biology, and health informatics, pp. 533–540
Clauwaert J, Menschaert G, Waegeman W (2019) DeepRibo: a neural network for precise gene
annotation of prokaryotes by combining ribosome profiling signal and binding site patterns.
Nucleic Acids Res 47(6):e36–e36
Collins FS, Morgan M, Patrinos A (2003) The human genome project: lessons from large-scale
biology. Science 300(5617):286–290
de los Santos EL (2019) NeuRiPP: neural network identification of RiPP precursor peptides. Sci
Rep 9(1):1–9
Evans RB (1990) William James, “the principles of psychology,” and experimental psychology.
Am J Psychol 103(4):433–447
Galushkin AI (2007) Neural networks theory. Springer, Berlin
Giorgini E, Biavasco F, Galeazzi R, Gioacchini G, Giovanetti E, Mobbili G et al (2020) Synthesis,
structural insights and activity of different classes of biomolecules. In: The First Outstanding
50 Years of “UniversitàPolitecnicadelle Marche”. Springer, Cham, pp 463–482
Heidari AA, Faris H, Mirjalili S, Aljarah I, Mafarja M (2020) Ant lion optimizer: theory, literature
review, and application in multi-layer perceptron neural networks. In: Nature-inspired
optimizers. Springer, Cham, pp 23–46
Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational
abilities. Proc Natl Acad Sci 79(8):2554–2558
Ioannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S et al (2016) REVEL:
an ensemble method for predicting the pathogenicity of rare missense variants. Am J Hum Genet
99(4):877–885
Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequence-based
B-cell epitope prediction using conformational epitopes. Nucleic Acids Res 45(W1):W24–W29
Jiménez J, Skalic M, Martinez-Rosell G, De Fabritiis G (2018) K deep: protein–ligand absolute
binding affinity prediction via 3d-convolutional neural networks. J Chem Inf Model 58
(2):287–296
Joshi A, Joshi BC, Mannan MAU, Kaushik V (2020) Epitope based vaccine prediction for SARS-
COV-2 by deploying immuno-informatics approach. Inform Med Unlocked 19:100338
Joshi A, Kaushik V (2020) In-Silico proteomic exploratory quest: crafting T-cell epitope vaccine
against Whipple’s disease. Int J Pept Res Ther 27:169–179
Kelley DR, Reshef YA, Bileschi M, Belanger D, McLean CY, Snoek J (2018) Sequential regulatory
activity prediction across chromosomes with convolutional neural networks. Genome Res 28
(5):739–750
Khan E (2020) Neural fuzzy based intelligent systems and applications. In: Fusion of neural
networks, fuzzy systems and genetic algorithms. CRC Press, Boca Raton, FL, pp 105–140
Kulmanov M, Hoehndorf R (2020) DeepGOPlus: improved protein function prediction from
sequence. Bioinformatics 36(2):422–429
Li W, Li M, Qiao J, Guo X (2020) A feature clustering-based adaptive modular neural network for
nonlinear system modeling. ISA Trans 100:185–197
Li X, Romero P, Rani M, Dunker AK, Obradovic Z (1999) Predicting protein disorder for N-, C-and
internal regions. Genome Inform 10:30–40
Lundegaard C, Lund O, Nielsen M (2011) Prediction of epitopes using neural network based
methods. J Immunol Methods 374(1–2):26–34
McGuffin LJ, Bryson K, Jones DT (2000) The PSIPRED protein structure prediction server.
Bioinformatics 16(4):404–405
Mohammadi M, Al-Fuqaha A, Sorour S, Guizani M (2018) Deep learning for IoT big data and
streaming analytics: a survey. IEEE Commun Surv Tutorials 20(4):2923–2960
Morgat A, Lombardot T, Coudert E, Axelsen K, Neto TB, Gehant S et al (2020) Enzyme annotation
in UniProtKB using Rhea. Bioinformatics 36(6):1896–1901
364 A. Joshi et al.
Patel S, Tripathi R, Kumari V, Varadwaj P (2017) DeepInteract: deep neural network based protein-
protein interaction prediction tool. Curr Bioinform 12(6):551–557
Pedersen AG, Nielsen H (1997) Neural network prediction of translation initiation sites in
eukaryotes: perspectives for EST and genome analysis. Proc Inst Conf Intell Syst Mol Biol
5:226–233
Pollastri G, Przybylski D, Rost B, Baldi P (2002) Improving the prediction of protein secondary
structure in three and eight classes using recurrent neural networks and profiles. Proteins Struct
Funct Bioinform 47(2):228–235
Pons JL, Delsuc MA (1999) RESCUE: an artificial neural network tool for the NMR spectral
assignment of proteins. J Biomol NMR 15(1):15–26
Pu L, Govindaraj RG, Lemoine JM, Wu HC, Brylinski M (2019) DeepDrug3D: classification of
ligand-binding pockets in proteins with a convolutional neural network. PLoS Comput Biol 15
(2):e1006718
Quang D, Xie X (2016) DanQ: a hybrid convolutional and recurrent deep neural network for
quantifying the function of DNA sequences. Nucleic Acids Res 44(11):e107–e107
Rifaioglu AS, Nalbat E, Atalay V, Martin MJ, Cetin-Atalay R, Doğan T (2020) DEEPScreen: high
performance drug–target interaction prediction with convolutional neural networks using 2-D
structural compound representations. Chem Sci 11(9):2531–2557
Roche DB, Buenavista MT, McGuffin LJ (2012) FunFOLDQA: a quality assessment tool for
protein-ligand binding site residue predictions. PLoS One 7(5):e38219
Saha S, Raghava GPS (2006) Prediction of continuous B-cell epitopes in an antigen using recurrent
neural network. Proteins Struct Funct Bioinform 65(1):40–48
Shao C (2020) A quantum model of feed-forward neural networks with unitary learning algorithms.
Quantum Inf Process 19(3):102
Smyl S (2020) A hybrid method of exponential smoothing and recurrent neural networks for time
series forecasting. Int J Forecast 36(1):75–85
Stranzl T, Larsen MV, Lundegaard C, Nielsen M (2010) NetCTLpan: pan-specific MHC class I
pathway epitope predictions. Immunogenetics 62(6):357–368
Szalkai B, Grolmusz V (2018) SECLAF: a webserver and deep neural network design tool for
hierarchical biological sequence classification. Bioinformatics 34(14):2487–2489
Tang K, Ren J, Sun F (2019) Afann: bias adjustment for alignment-free sequence comparison based
on sequencing data using neural network regression. Genome Biol 20(1):1–17
Tripathi R, Patel S, Kumari V, Chakraborty P, Varadwaj PK (2016) DeepLNC, a long non-coding
RNA prediction tool using deep neural network. Network Model Anal Health Inform Bioinform
5(1):21
Wang R, Wang Z, Wang J, Li S (2019) SpliceFinder: ab initio prediction of splice sites using
convolutional neural network. BMC Bioinform 20(23):652
Wu CH, McLarty JW (2012) Neural networks and genome informatics. Elsevier, Amsterdam
Wu Q, Peng Z, Anishchenko I, Cong Q, Baker D, Yang J (2020) Protein contact prediction using
metagenome sequence data and residual neural networks. Bioinformatics 36(1):41–48
Yang ZR, Thomson R, Mcneil P, Esnouf RM (2005) RONN: the bio-basis function neural network
technique applied to the detection of natively disordered regions in proteins. Bioinformatics 21
(16):3369–3376
Zaji AH, Bonakdari H, Khameneh HZ, Khodashenas SR (2020) Application of optimized artificial
and radial basis neural networks by using modified genetic algorithm on discharge coefficient
prediction of modified labyrinth side weir with two and four cycles. Measurement 152:107291
Zeng M, Zhang F, Wu FX, Li Y, Wang J, Li M (2020) Protein–protein interaction site prediction
through combining local and global features with deep neural networks. Bioinformatics 36
(4):1114–1120
Role of Bioinformatics in MicroRNA Analysis
19
Indra Mani
Abstract
Keywords
miRNA · Tools · Databases · Bioinformatics · RNA
I. Mani (*)
Department of Microbiology, Gargi College, University of Delhi, New Delhi, India
# The Author(s), under exclusive license to Springer Nature Singapore Pte 365
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_19
366 I. Mani
19.1 Introduction
As per recent updates, 38,589 entries of miRNAs have been mentioned in the
miRBase database (http://www.mirbase.org/release#22.1, October 2018). It is a
repository of annotated and published miRNA sequences. The microRNAs
(miRNAs) are endogenously expressed small (~22 nucleotides) single-strand
RNAs, which binds with the target mRNA and regulates gene expression and able
to interfere post-transcriptionally with the protein production of their targets (Bartel
2004; Lewis et al. 2005; Alvarez-Garcia and Miska 2005). First miRNA was
discovered in 1993 as lin-4, which suppressed the lin-14 gene expression in
Caenorhabditis elegans (Lee et al. 1993), and second was identified in same
organism as let-7 that suppressed expression of lin-41gene (Reinhart et al. 2000).
These miRNAs have worked as same mechanism like both binds to the 3-
0
-untranslated region of lin-4 and lin 41 genes. Interestingly, more than 1000
miRNAs are encoded by the human genome that may cover approximately 60% of
genes of mammalian and are ample in different human cell types (Bartel 2004;
Bentwich et al. 2005; Friedman et al. 2009; Hennessy 2017; Narożna et al. 2017;
Andrei et al. 2019; Hargreaves et al. 2020).
The typical structural characteristic of miRNAs is its initial transcriptional feature
as a long primary transcript (pri-miRNA) that is modified into about 70 nucleotides
precursor stem-loop hairpin RNAs (Lee et al. 2004). The pre-miRNAs are entered
into a cytoplasm from the nucleus through a nuclear transport receptor exportin-5,
where Dicer processes them into mature miRNA (about 22 nucleotides of miRNA
each). After that it is included into a miRNA-containing RNA-induced silencing
complex (miRISC) (Yi et al. 2003; Cullen 2004; Ambros 2004). It can be used to
artificially induced to cleavage of the target, either altering the target or using
miRNA sequences that can hybridize to the target (Zeng et al. 2002; Boden et al.
2004). A translational repression or mRNA cleavage is a method, which used by
mature miRNA to regulate the gene expression. The use of miRNAs in numerous
core cellular pathways as well as in many human diseases further endorses their
biological significance (Bartel 2004; Alvarez-Garcia and Miska 2005; Mani et al.
2016).
Due to availability of enormous studies about miRNA, it seems to be it is being
involved in numerous pathway (Kontaraki et al. 2014; Huang et al. 2012; Zhu et al.
2011). However, its imbalance could be caused the defective cell functions, and
disease occurs. In addition, it also involves in regulation of the various biological
processes (Fu et al. 2013; Tüfekci et al. 2014; Vishnoi and Rani 2017; Correia de
Sousa et al. 2019). It has been suggested that miRNA could be used in prognosis,
diagnosis, and therapeutic (Mani and Vasdev 2018). Remarkably, due to a rapid
increase of biological data in the form of sequence, structure, pathway, and
interactions, biological sciences have developed data-rich science. There are various
databases and tools which are being available to retrieve and analysis of miRNA
from different organisms. Following database and tools are discussed in detail.
19 Role of Bioinformatics in MicroRNA Analysis 367
19.2.1 miRBase
There are various tools available for the prediction of miRNA. It is described in
detail.
19.3.1 psRNATarget
Fig. 19.1 A miRBase database showing the browsing of mir-121. Users can use the combined
search and retrieve the mRNA details. http://www.mirbase.org/index.shtml
368 I. Mani
Fig. 19.2 Database home page of psRNATarget prediction server for miRNA prediction. http://
plantgrn.noble.org/psRNATarget/
stranded precursors of plant sRNAs (Axtell 2013). The psRNATarget includes plant
sRNA targets using evaluating complementary equivalent between target mRNA
sequence and the sRNA sequence by scoring scheme and considering target site
availability. A scoring procedure is adjustable and based on canonical and
non-canonical targets (Dai et al. 2018). The psRNATarget has been used to predict
20,815 unigene targets (Ye et al. 2019). In addition, psRNATarget has been utilized
for analysis of miRNAs from various plants such as Hordeum vulgare (Lv et al.
2012), Coffea canephora (Loss-Morais et al. 2014), Tomato (Luan et al. 2014), Beta
vulgaris (Li et al. 2015), Chlamydomonas reinhardtii (Hajieghrari et al. 2016),
Brassica rapa (Hajieghrari et al. 2017; Zhou et al. 2020), Arachis hypogaea
(Rajendiran et al. 2019), Oryza sativa (Jabbar et al. 2019), and Passiflora edulis
(Paul et al. 2020). This server (Fig. 19.2) is being utilized to retrieve and analysis of
plant miRNAs and siRNAs.
Fig. 19.3 Database home page of RNAhybrid prediction server for miRNA prediction
http://bibiserv.techfak.uni-bielefeld.de/rnahybrid
and superovulation (Khan et al. 2015), in the invasion and metastasis of colorectal
carcinoma (Yang et al. 2015), and in colon cancer (Xiong et al. 2019). This server
(Fig. 19.3) is being utilized to retrieve and analysis of miRNAs and calculate the free
energy rapidly and accurately.
Fig. 19.4 Database home page of miRscan prediction server for miRNA prediction
http://hollywood.mit.edu/mirscan/index.html
References
Alvarez-Garcia I, Miska EA (2005) MicroRNA functions in animal development and human
disease. Development 132(21):4653–4662
Ambros V (2004) The functions of animal microRNAs. Nature 431:350–355
Andrei D, Nagy RA, van Montfoort A, Tietge U, Terpstra M, Kok K, van den Berg A, Hoek A,
Kluiver J, Donker R (2019) Differential miRNA expression profiles in cumulus and mural
Granulosa cells from human preovulatory follicles. Microrna 8(1):61–67
Axtell MJ (2013) Classification and comparison of small RNAs from plants. Annu Rev Plant Biol
64:137–159
Bartel DP (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116
(2):281–297
Bentwich I, Avniel A, Karov Y et al (2005) Identification of hundreds of conserved and
nonconserved human microRNAs. Nat Genet 37(7):766–770
Boden D, Pusch O, Silbermann R, Lee F, Tucker L, Ramratnam B (2004) Enhanced gene silencing
of HIV-1 specific siRNA using microRNA designed hairpins. Nucleic Acids Res 32:1154–1158
Correia de Sousa M, Gjorgjieva M, Dolicka D, Sobolewski C, Foti M (2019) Deciphering miRNAs’
action through miRNA editing. Int J Mol Sci 20(24):6249
Cullen BR (2004) Transcription and processing of human microRNA precursors. Mol Cell
16:861–865
Dai X, Zhao PX (2011) psRNATarget: a plant small RNA target analysis server. Nucleic Acids Res
39:W155–W159
Dai X, Zhuang Z, Zhao PX (2018) psRNATarget: a plant small RNA target analysis server (2017
release). Nucleic Acids Res 46:W49–W54
Enright AJ, John B, Gaul U et al (2003) microRNA targets in Drosophila. Genome Biol 5:R1
Fan J, Li H, Nie X, Yin Z, Zhao Y, Zhang X, Yuan S, Li Y, Chen C, Wang DW (2018) MiR-665
aggravates heart failure via suppressing CD34-mediated coronary microvessel angiogenesis.
Aging 10(9):2459–2479
Feng X, Wu J, Zhou S, Wang J, Hu W (2018) Characterization and potential role of microRNA in
the Chinese dominant malaria mosquito Anopheles sinensis (Diptera: Culicidae) throughout
four different life stages. Cell Biosci 12(8):29
Friedman RC, Farh KK, Burge CB, Bartel DP (2009) Most mammalian mRNAs are conserved
targets of microRNAs. Genome Res 19(1):92–105
Fu G, Brkic J, Hayder H, Peng C (2013) MicroRNAs in human placental development and
pregnancy complications. Int J Mol Sci 14:5519–5544
Grey F, Antoniewicz A, Allen E, Saugstad J, McShea A, Carrington JC, Nelson J (2005) Identifi-
cation and characterization of human cytomegalovirus-encoded microRNAs. J Virol 79
(18):12095–12099
Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ (2006) miRBase: microRNA
sequences, targets and gene nomenclature. Nucleic Acids Res 34: D140-D144
Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ (2008) miRBase: tools for microRNA
genomics. Nucleic Acids Res 36: D154-D158
Hajieghrari B, Farrokhi N, Goliaei B, Kavousi K (2016) Identification and characterization of novel
miRNAs in Chlamydomonas reinhardtii by computational methods. Microrna 5(1):66–77
Hajieghrari B, Farrokhi N, Goliaei B, Kavousi K (2017) Computational identification of
microRNAs and their transcript target(s) in field mustard (Brassica rapa L.). Iran J Biotechnol
15(1):22–32
Hargreaves BKV, Roberts SE, Derfalvi B, Boudreau JE (2020) Highly efficient serum-free manip-
ulation of miRNA in human NK cells without loss of viability or phenotypic alterations is
accomplished with TransIT-TKO. PLoS One 15(4):e0231664
Hennessy E (2017) MiRNA profiling in human induced pluripotent stem cells. Methods Mol Biol
1509:47–56
372 I. Mani
Huang W, Feng Y, Liang J (2012) MicroRNA-128 regulates Isl-1 via Nkx-2.5/gsh-2 competition
during cardiac development. Circulation 126:A11394
Jabbar B, Iqbal MS, Batcho AA, Nasir IA, Rashid B, Husnain T, Henry RJ (2019) Target prediction
of candidate miRNAs from Oryza sativa for silencing the RYMV genome. Comput Biol Chem
83:107127
Khan HA, Zhao Y, Wang L, Li Q, Du YA, Dan Y, Huo LJ (2015) Identification of miRNAs during
mouse postnatal ovarian development and superovulation. J Ovarian Res 8:44
Kiriakidou M, Nelson PT, Kouranov A et al (2004) A combined computational experimental
approach predicts human microRNA targets. Genes Dev 18:1165–1178
Kontaraki JE, Marketou ME, Zacharis EA, Parthenakisand FI, Vardas PE (2014) Differential
expression of vascular smooth muscle-modulating microRNAs in human peripheral blood
mononuclear cells: novel targets in essential hypertension. J Hum Hypertens 28:510–516
Kozomara A, Birgaoanu M, Griffiths-Jones S (2019) miRBase: from microRNA sequences to
function. Nucleic Acids Res 47:D155–D162
Kozomara A, Griffiths-Jones S (2011) miRBase: integrating microRNA annotation and deep-
sequencing data. Nucleic Acids Res 39:D152–D157
Kozomara A, Griffiths-Jones S (2014) miRBase: annotating high confidence microRNAs using
deep sequencing data. Nucleic Acids Res 42:D68–D73
Krek A, Grun D, Poy MN et al (2005) Combinatorial microRNA target predictions. Nat Genet
37:495–500
Kruger J, Rehmsmeier M (2006) RNAhybrid: microRNA target prediction easy, fast and flexible.
Nucleic Acids Res 34:W451–W454
Lee RC, Feinbaum RL, Ambros V (1993) The C. elegans heterochronic gene lin-4 encodes small
RNAs with antisense complementarity to lin-14. Cell 75:843–854
Lee Y, Kim M, Han J, Yeom KH, Lee S, Baek SH, Kim VN (2004) MicroRNA genes are
transcribed by RNA polymerase II. EMBO J 23:4051–4060
Lewis BP, Burge CB, Bartel DP (2005) Conserved seed pairing, often flanked by adenosines,
indicates that thousands of human genes are microRNA targets. Cell 120:15–20
Lewis BP, Shih IH, Jones-Rhoades MW et al (2003) Prediction of mammalian microRNA targets.
Cell 115:787–798
Li JG, Ding Y, Huang YM, Chen WL, Pan LL, Li Y, Chen XL, Chen Y, Wang SY, Wu XN (2017)
FAMLF is a target of miR-181b in Burkitt lymphoma. Braz J Med Biol Res 50(6):e5661
Li JL, Cui J, Cheng DY (2015) Computational identification and characterization of conserved
miRNAs and their target genes in beet (Beta vulgaris). Genet Mol Res 14(3):9103–9108
Li W, Dong X, He C, Tan G, Li Z, Zhai B, Feng J, Jiang X, Liu C, Jiang H, Sun X (2019) LncRNA
SNHG1 contributes to sorafenib resistance by activating the Akt pathway and is positively
regulated by miR-21 in hepatocellular carcinoma cells. J Exp Clin Cancer Res 38(1):183
Lim LP, Glasner ME, Yekta S, Burge CB, Bartel DP (2003a) Vertebrate microRNA genes. Science
299(5612):1540
Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB, Bartel DP
(2003b) The microRNAs of Caenorhabditis elegans. Genes Dev 17(8):991–1008
Loss-Morais G, Ferreira DC, Margis R, Alves-Ferreira M, Corrêa RL (2014) Identification of novel
and conserved microRNAs in Coffea canephora and Coffea arabica. Genet Mol Biol 37
(4):671–682
Luan Y, Wang W, Liu P (2014) Identification and functional analysis of novel and conserved
microRNAs in tomato. Mol Biol Rep 41(8):5385–5394
Lv S, Nie X, Wang L, Du X, Biradar SS, Jia X, Weining S (2012) Identification and characterization
of microRNAs from barley (Hordeum vulgare L.) by high-throughput sequencing. Int J Mol Sci
13(3):2973–2984
Mani I, Garg R, Pandey KN (2016) Role of FQQI motif in the internalization, trafficking, and
signaling of guanylyl-cyclase/natriuretic peptide receptor-a in cultured murine mesangial cells.
Am J Physiol Renal Physiol 310(1):F68–F84
19 Role of Bioinformatics in MicroRNA Analysis 373
Mani I, Vasdev K (2018) MicroRNA in prognosis, diagnosis and therapy of Cancer. Cell Cell Life
Sci J 3:000134
Narożna B, Langwinski W, Jackson C, Lackie P, Holloway JW, Szczepankiewicz A (2017)
MicroRNA-328 is involved in wound repair process in human bronchial epithelial cells. Respir
Physiol Neurobiol 242:59–65
Paul S, de la Fuente-Jiménez JL, Manriquez CG, Sharma A (2020) Identification, characterization
and expression analysis of passion fruit (Passiflora edulis) microRNAs. 3 Biotech 10(1):25
Rajendiran A, Vijayakumar S, Pan A (2019) Exploring microRNAs, target mRNAs and their
functions in leguminous plant Arachis hypogaea. Microrna 8(2):135–146
Rehmsmeier M, Steffen P, Hochsmann M, Giegerich R (2004) Fast and effective prediction of
microRNA/target duplexes. RNA 10(10):1507–1517
Reinhart BJ, Slack FJ, Basson M et al (2000) The 21-nucleotide let-7 RNA regulates developmental
timing in Caenorhabditis elegans. Nature 403:901–906
Song J, Jiang X, Hu Y, Li H, Zhang X, Xu J, Li W, Zheng X, Dong S (2018) High-throughput
sequencing of putative novel microRNAs in rhesus monkey peripheral blood mononuclear cells
following EV71 and CA16 infection. Intervirology 61(3):133–142
Sun G, Wang Y, Zhang J, Lin N, You Y (2018) MiR-15b/HOTAIR/p53 form a regulatory loop that
affects the growth of glioma cells. J Cell Biochem 119(6):4540–4547
Tüfekci KU, Meuwissen RL, Genç S (2014) The role of microRNAs in biological processes.
Methods Mol Biol 1107:15–31
Vijayaraghavan B, Danabal K, Padmanabhan G, Ramanathan K (2018) Study on regulation of low
density lipoprotein cholesterol metabolism using PCSK9 gene silencing: a computational
approach. Bioinformation 14(5):248–251
Vishnoi A, Rani S (2017) MiRNA biogenesis and regulation of diseases: an overview. Methods
Mol Biol 1509:1–10
Wang Y, Peng M, Chen Y, Wang W, He Z, Yang Z, Lin Z, Gong M, Yin Y, Zeng Y (2019)
Analysis of Panax ginseng miRNAs and their target prediction based on high-throughput
sequencing. Planta Med 85(14–15):1168–1176
Xia W, Cao G, Shao N (2009) Progress in miRNA target prediction and identification. Sci China C
Life Sci 52(12):1123–1130
Xiong W, Wang X, Cai X, Xiong W, Liu Y, Li C, Liu Q, Qin J, Li Y (2019) Identification of tRNA-
derived fragments in colon cancer by comprehensive small RNA sequencing. Oncol Rep 42
(2):735–744
Yan N, Lu Y, Sun H, Tao D, Zhang S, Liu W, Ma Y (2007) A microarray for microRNA profiling in
mouse testis tissues. Reproduction 134(1):73–79
Yang B, Tan Z, Song Y (2015) Study on the molecular regulatory mechanism of MicroRNA-195 in
the invasion and metastasis of colorectal carcinoma. Int J Clin Exp Med 8(3):3793–3800
Ye J, Han W, Fan R, Liu M, Li L, Jia X (2019) Integration of transcriptomes, small RNAs, and
degradome sequencing to identify putative miRNAs and their targets related to Eu-rubber
biosynthesis in Eucommia ulmoides. Genes 10(8):623
Yi R, Qin Y, Macara IG, Cullen BR (2003) Exportin-5 mediates the nuclear export of
pre-microRNAs and short hairpin RNAs. Genes Dev 17:3011–3016
Zeng Y, Wagner EJ, Cullen BR (2002) Both natural and designed micro RNAs can inhibit the
expression of cognate mRNAs when expressed in human cells. Mol Cell 9:1327–1333
Zhou B, Leng J, Ma Y, Fan P, Li Y, Yan H, Xu Q (2020) BrmiR828 targets BrPAP1, BrMYB82,
and BrTAS4 involved in the light induced anthocyanin biosynthetic pathway in Brassica rapa.
Int J Mol Sci Jun 21(12):4326
Zhu H, Yang Y, Wang Y, Li J, Schiller PW, Peng T (2011) MicroRNA-195 promotes palmitate-
induced apoptosis in cardiomyocytes by down-regulating Sirt1. Cardiovasc Res 92(1):75–84
Bioinformatics for Image Processing
20
Dilip Kumar J. Saini and Dhirendra Siddharth
Abstract
# The Author(s), under exclusive license to Springer Nature Singapore Pte 375
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_20
376 D. K. J. Saini and D. Siddharth
Keywords
20.1 Introduction
The deluge of complex biomedical and biological images presents huge obstacles for
the image processing community. As a natural extension of the existing biomedical
field of image analysis, an emerging modern engineering area is to develop and
optimise various image data processing and informatics strategies to handle, capture,
scan and compare the biological information of the respective images. This latest
field can be known as bioimage informatics. Although due to the quality of infor-
mation and the high complexity of bioimages, such as the very high cell density
(e.g. microglia, neurons, astrocytes), the mechanism of entangled or very fast
microtubular growth in a 4-dimensional live cell film (Mathews and Jezzard 2004)
makes it extremely difficult to specifically apply current medical imaging techniques
to this bioimage computer problem. Multiple colour channels are a single biological
image stack and are usually wide. The artefacts of interest like this in an image, such
as in the 3D structures in neurons (Wiemer et al. 2003), can have drastic differences
in strength and morphology from image to image. It is also not unusual that millions
of photographs have to be handled automatically in a high-throughput manner in
terms of the number of hours or even days, but not months and even years of manual
labour. Both of these difficulties require the emergence of new systems and
algorithms for bioimage informatics, mainly from three factors: mining and image
processing, visualisation and image database.
There are many researches in bioimage informatics either ongoing or on the past
few decades. To address the latest trends in this area, a series of very fruitful
workshops have been planned. The aim of this section is to briefly examine the
advancement of bioimage informatics from the primary methods, angles of imple-
mentation, resource availability and instruments.
The practise of making graphic pictures of the internal systems of body part for
medical treatment and diagnostic, along with a direct picture of the role of the
internal tissue, is diagnostic imaging. This approach pursues the condition’s diagno-
sis and recovery. This method generates a catalogue of the normal configuration and
operation of the organs to make the abnormalities easier to recognise. This technique
covers both radiological and organic imaging using scopes, magnetic, thermal
imaging, isotope and sonography and electromagnetic energies (X-rays). In order
to record details about the function and location of the body, several other devices
are used. Compared to those modules that generate images, there are several
drawbacks to those techniques. For different diagnostic purposes, billions of images
are made annually worldwide. Digital images regularly serve an integral role. The
analysis in medical imaging relates to image handling using the machine. In this
process involved several processes, including communication, presentation, storage
and image retrieval. The image is a feature that means measuring features such as the
20 Bioinformatics for Image Processing 377
colour of a visible sight or illumination. There are many benefits of digital images,
such as adaptable handling, easy and economical reproduction, immediate quality
appraisal, multiple copying with quality reservation, quick storage and communica-
tion, and faster and cheaper processing costs. The drawbacks of digital images
include the need for quicker processor manipulation, the need for large-capacity
memory, the failure to resize with consistency retention and exploitation of
copyright.
The use of computers to manipulate digital images is a technique for image
processing. This technique has many benefits including connectivity, data manage-
ment, adaptability and elasticity. This methodology has many sets of synchronous
performance instructions for photographs. Multidimensional manipulation of the 2D
and 3D Tranform images for different areas such as TV images, humanities,
therapeutic applications and environmental enhancement, digital imaging methods
have been used. In the time, the editing of photographs became cheap, simple and
quicker.
Medical imaging techniques (MIT) consider the laboratory tests like blood test and
specimen tests, it is one of the most popular medical tests. Over the past decade,
medical imaging has undergone a revolution with rapid, more accurate and less
invasive devices. Medical image techniques can be seen as instruments to learn more
about neurobiology and people’s behaviours (Wang et al. 2016). In medical image
techniques, the energy source that enter in the body through sensor and detector
detects the body part, after that algorithm works on this data which is given by the
detector and then displays result.
There are some different tools that can be used to see into the patient, depending
on the energy sources. This may be an image of the interior of the patient by sensing
the pulse emanating from the body. In this article, important techniques include
magnetic resonance imaging, computed tomography, computed thermography and
tomography single photon release, optical imaging, radiography X-ray, radionuclide
imaging, positron emission tomography (PET), ultrasonography and elastography.
Worldwide, 5 billion medical imaging tests were carried out by 2010 (Roobottom
et al. 2010).
It is a medical tool which uses electromagnet ionising radiation, for example, X-rays,
to examine body part. With wavelength of about 0.01 and 10 manometers, X-ray is a
high-energy radioactive radiation ionising gas capable of penetrating solids. X-rays
travel through the body for medical imaging, forming a profile, absorbing or
attenuating them at various amounts, depending on the atomic number of the
different density and tissues (Spahn 2013). In the X-ray, information is recorded
378 D. K. J. Saini and D. Siddharth
This is a medical method and it combines a computer and cathode ray tube display
with X-ray equipment. In this method, it produces images of the part of human body.
This method, X-ray film is changed by a sensor that evaluates the X-ray data. There
is a spinning frame within the CT scanner with a detector positioned on one side and
the X-ray tube on the other side (Xu and Tsui 2014). An X-ray beam is generated as a
rotating frame spins the X-ray tube and detector around the body. Any time the
detector and X-ray tube perform one full rotation, an image or slice is obtained. The
profile is repeated with a two-dimensional version of the slice viewed by the
computer. 3-dimensional computed tomography can be acquired by spiral computed
tomography (Xu and Tsui 2014) patient anatomy data volume all at one spot. This
data collection of volumes will then be reproduced to include three-dimensional
20 Bioinformatics for Image Processing 379
It is a health diagnostic technique for imagery body tissue and monitoring body
chemistry using radio and magnetic frequency fields (Caiani et al. 2006). The MRI
used to image morphological modifications is based on its ability to detect magnetic
spin relaxation times and proton density variations that are typical of the tissue
provided in the environment. When we talk about MR scanner, there are mainly
three components: a radio frequency system, gradient, central magnet and magnetic
field system. The principal magnet that produces a magnetic field is a permanent
magnet. In the magnetic field gradient device, there are three orthogonal gradient
coils. This coil uses for the signal localisation. The RF system has a transmitter coil
capable; it is used for exciting a spin device, producing a spinning magnetic field,
and a receiver coil capable of translating magnetisation processing into electrical
signals. The MR scanner rebuilds the optical device and measures the signals and
signals convert into images. Recently, a novel technique designed for measuring
brain movement is called functional magnetic resonance imaging (FMRI) (Ng et al.
2009).
380 D. K. J. Saini and D. Siddharth
20.2.4 Ultrasonography
used to monitor the procedures for biopsy and drainage. Latest Doppler scanner
techniques allow blood flow measurement in veins and arteries.
20.2.5 Elastography
obtained from it (Kennedy et al. 2009). Tactile imaging (Hoshi et al. 2010) is a
diagnostic imaging method which converts a visual signal into the sense of touch.
This method has features that sense the pressure on the body.
depends on age. The nature of their distribution makes it possible to draw such
assumptions regarding the body organ size.
The visualisation of the organs, body or tissues used for medical diagnosis, treatment
and disease tracking is medical imaging. The techniques of imaging include the
fields of optical imaging, radiology and nuclear medicine. There are a few types of
medical images, such as X-rays. In this diagnostic imaging techniques include
advance radiation techniques for smart healthcare system.
Magnetic imaging and magnetic resonance imaging (MRI) are also kinds of
medical imaging. The image, molecular imaging and CT work with ultrasound
and MRI without the radiation being ionised, unlike conventional X-rays. MRIs
employ powerful magnets which produce a strong magnetic field that forces protons
in the body to align with that field. Imaging techniques, where ionising radiation is
not necessary, can be used for certain types of clinical cases. Ultrasound scans hire
waves, for example, with low frequency sonority.
The procedure, the method and the practise of medical imaging create visualisation
representations of the body's interior aimed at medical practise and health study.
Imaging medicine aims at exposing internal mechanisms concealed from the skin
and bones, both for the treatment of diseases and diagnosis. Health images sets a
basic anatomy and physiology database for Enable the detection of anomalies.
Although images of organs and tissues removed for medicinal use are reasonable,
such processes are commonly considered to be part of pathology rather than diag-
nostic image. So, the Chapter is going to be focus primarily on medical image
processing equipment.
Medical image has its place in the modern medical field submitted a big advance.
This technology matters as it may be implemented before an actual operation. On the
several kinds of medical imaging have been created years ago, various forms of
medical image suit various kinds of images engineering. Could medical images have
its own merits and demerits? There are fifteen types of market-driven MIP
equipment.
There are many technical resources used for the application of medical image
processing. 15 Types of instruments widely used by researchers were introduced to
the following section.
20.3.4.1 VTK
It stands for Visualization Toolkit (VTK). This toolkit is accessible for everyone so it
is open source framework. VTK is 3D computer graphics software and platform is
supported by Kitware, with the community now working to develop the future. This
386 D. K. J. Saini and D. Siddharth
toolkit provides VTK Resources technical guidance and support. In addition, VTK
has a robust information visualisation framework, 3D package Widget touch,
enables parallel processing, and connects with numerous libraries of GUI toolkits,
such as QT (Hanwell et al. 2015).
20.3.4.2 ITK
ITK stands for Insight Segmentation and Registration. This tool provides the image
analysis to the developers (Roobottom et al. 2010). ITK is more powerful tool that
provides registration algorithm and learning edge segmentation when we study
about two and more dimension (Liu et al. 2014). It is an cross-platform framework
and it is open source system.
20.3.4.3 FSL
Study produces FSL (FMRIB Software Library) Community, UK, OXFORD and
FMRIB. FSL is wide DTI Brain, FMRIand MRI Research Toolkit imageryData
(Smith et al. 2004). FSL is a comprehensive library of analysis tools for FMRI, MRI
and DTI brain imaging. It provides the important library for algorithm for MRI
images and also used for research works.
20.3.4.4 SPM
It stands for Statistical Parametric Mapping. It is used for statistical processes. This
package of software is designed by Karl Friston. SPM is used for the brain imaging
analysis. It is studying the data sequences like MEG, EEG, PET, etc. SPM helps for
analysing brain anomaly or detects the abnormalities in the brain (Sowell et al.
2000).
20.3.4.5 GIMIAS
GIMIAS stands for Graphical Interface for Medical Image Analysis and Simulation.
It is most powerful graphical interface, provides the solving simulation problem and
also solves the complex biomedical image computing. It has the plug-ins of specific
problem and also used for the research work (Larrabide et al. 2009).
20.3.4.6 NiftyReg
It is the most useful image registration software. It is used for the rigid and non-rigid
registration. It is open source software developed by Translational Imaging Group
(TIG 2014). It gives the more efficient result for medical image compared to other
registration software.
20.3.4.7 Elastix
This software helps to solve the image registration software. It has the group of
algorithms to solve the problem of registration of image. It is more powerful than
other tools like ITK. Compare to other it has the more efficiency like fast configura-
tion and other registration method. This is an open source software and also used for
research works (Kerner et al. 2015).
20 Bioinformatics for Image Processing 387
20.3.4.8 ANTs
ANT is very useful for interpretation control and multidisciplinary data visualisation,
and can derive information from large datasets (Avants et al. 2011). ANTs stand for
Advanced Normalization Tools which is used for visualising multidimensional data
and extract data from complex datasets. It is open source data.
20.3.4.9 NiftySeg
This tool has various programs to be used for analysis. It is used for EM based
segmentation. This tool is indeed one of the university-developed programs,
approved under BSD registration. It is a great thing. A tool involves many picture
segmentation or format analysis programmes based on EM (TIG 2014).
20.3.4.10 ITK-Snap
ITK-Snap is a method for the segmentation of structures in 3D medical pictures, Paul
Yushkevich produces it. This tool offers semi-automatic segmentation with active
use methods of contour, and manual delineation and picture browser (Yushkevich
et al. 2006).
20.3.4.11 MITK
It is a development platform which incorporates application structure with the
Insight Toolkit Visualization Toolkit (VTK) and Insight Toolkit (ITK). The software
is approved in compliance with BSD-Style (Lu et al. 2012). The MITK stands for
Medical Imaging Toolkit.
20.3.4.12 NiftyRec
At UCL London, the NiftyRec software project that provides the Tomographic
Reconstruction Code was created (Assaf and Alexander 2014). It has several types
of package for registration like local and global. Registration of lungs also uses this
package. This tool helps to us for free-form deformation algorithm when we use
block-matching approach.
20.3.4.13 NiftySim
It is open source finite; high-performance toolkit uses for high graphics processing
unit (GPU). This tool also has simulation abilities, developed at London University
College, is a nonlinear feature solver, high-performance finite. The GPU-based
execution option that allows a solver to greatly outperform market-like packages is
a distinctive feature (Johnsen et al. 2015).
20.3.4.14 Camino
Camino is an MRI Processing software toolkit; it is capable of creating production
pipelines that contain modules from other systems. Actually, Render Toolkit Main-
tenance is the imaging community of microstructures at UCL’s lead development
(Cook et al. 2006).
388 D. K. J. Saini and D. Siddharth
20.3.4.15 DTI-TK
It stands for Diffusion Tensor Imaging Toolkit. This tool is used for construction of
an atlas Tool and spatial normalisation designed for analysing morphometry of white
matter using data from DTI. In the year 2011, it published a journal in ImageNeuro.
It rated DTI-TK as the cutting-edge method in its category (Keihaninejad et al.
2013).
A review of the fifteen medical photos in this section Table 20.2 tabulates the
computing methods. Research is performed on the basis of the guidelines below. The
following table provides a comparison of among the 15-MIP tools mentioned.
Fifteen Medical Image Comparisons Tabulated resources for the processing are
as above. The requirements relating to comparisons are the most recent versions
supported by the tool, Sponsored Device GUI, Medical Imaging Supported,
provided tool languages, tool function, type tools, prices and platform type to run
in standard to fulfill the requirement of consumers. As for the GUI tools, all 15 types
of tools provided user interface power, so the existence of user-friendly and easy
GUI view for beginners. There are fifteen different medical image-processing
software modalities to help in smart healthcare. For example, VTK supports only
the 3D Medical imaging, in which SPM can support 5 forms of PET, MRI, CT-Scan,
EEG and fMRI medical imaging. MITK, GIMIAS and Elastix meanwhile endorse
all forms of medical imaging.
Furthermore, each instrument has distinct characteristics, such as, Camino and
Elastix supported both functions of their unit, while NiftyReg supporting only
viewing and segmentation. In which there are different programing languages used
to create the App Tool. On the other hand, C # is Camino, ITK-Snap, DTI-TK Tools
SPM and FSL, NiftySeg and MITK Language C Utilities. GIMIAS, VTK and
Elastix are the JAVA tools for C++. The remaining instruments, such as NiftyReg,
NiftyReg, NiftySim and NiftyRec Equipment for pythons, are DTI-TK, ITK and
Camino.
Based on the user experience and survey, for medical photos, Elastix would like
to suggest tool that offers the most optimal image processing. This helps customers
to download the new technical update of the tool and the GUI programme. It also
encourages all forms of imaging, such as MRI, CT-scan, radiography and ultra-
sound. This promotes a broad variety of functions while image processing is
performed and this procedure can be run as a separate device or paired with IDE
software, for example, Visual studio, NetBeans, MATLAB. Furthermore, version on
their official website tutorials for beginners is also offered. Elastix is an open access
programme, with the latest available. It also has a big alternative for importing MIP
services. Elastix is complete impact exporting. It supports Bitmap, PNG, Dicom,
TIFF and JPEG variants of picture to import data.
It is for consumers who are looking for a full featured interface, The Elastix for
MIP tool is recommended for research. During while users that either want to
explore segmentation or segmentation, visualisation application, Camino should be
carried out. Camino is similar to Elastix, which aims to cure both kinds of medica-
tion. Imaging and even offering all defined functionality, but it is possible to add
only windows. Provide the installer with another operating system, while Camino
Table 20.2 The image processing tools comparison
Image processing tool VTK ITK FSL SPM GIMIAS NIFTYREG Elastix ANTS NifySeg ITK-Snap MITK NiftyRec NiftySim Camino DTI-TK
Latest supported 6.1 4.0 5.0 12 1.5 3.1 4.7 2.1 3.1 3.2 03 1.6.9 2.0 2.0 3.0
20
version
Date of last published 2014 2014 2014 2014 2013 2013 2014 2014 2014 2014 2014 2014 2014 2013 2011
EEG y y y Y
PET y y y y Y
Mammogram y y y y y y
fMRI y y y y y y y
Ultrasound y y y y y y Y
CT-Scan y y y y y y
X-ray y y y y y Y
MRI y y y y y y y y y y y y Y
System Interface y y y y y y y y y y y y y y Y
Image processing tool VTK ITK FSL SPM GIMIAS NIFTYREG Elastix
Functions 3D Images y y y
Visualisation y y y y
Bioinformatics for Image Processing
Reconstruction y y y y y
Generic y y y y y y
Registration y y y y y
Diffusion y y y y
Simulation y y y y
Segmentation y y y y
System VB.Net
Language JAVA y y
C y
Python y y
C# y
PHP
C++ y y y
Paid
Standalone tool y y y y y
Open Source y y y y y y y
Framework y
Plugin/Integration y y
Platform Mac OS X y y y y y
389
Windows y y y y y y y
Linux y y y y y y y
Image processing tool ANTS NifySeg ITK-Snap MITK NiftyRec NiftySim Camino DTI-TK
390
Functions 3D Images y y y y y
Visualisation y y y y y
Reconstruction y y y y y
Generic y y y
Registration y y y y y
Diffusion y y y y
Simulation y y y y
Segmentation y y y y y
System Language VB.Net y
JAVA y y
C y y
Python y y y
C# y y
PHP y
C++
Paid
Standalone tool y y y y y y
Open Source y y y y y y y y
Framework
Plugin/Integration y y y
Platform Mac OS X y y y y y y y
Windows y y y y y y y
Linux y y y y y y y y
D. K. J. Saini and D. Siddharth
20 Bioinformatics for Image Processing 391
does not strong practical tool foundation. Some of the software are available on MIP
for non-commercial use, free of charge is labelled as free from the Upper, bench.
Free downloads can be found at online platform for the respective resources. In the
case of MIP tools which are not listed as free, this means the consumer has to buy a
license for the use of a given MIP tool.
20.4 Conclusion
This chapter provides several techniques of medical imaging and discusses that how
digital image processing is useful in bioinformatics technology. We also discuss
advantages, disadvantages, benefits and accuracy of these techniques. Many bioin-
formatics technologies and tools are used in images. It also describes some useful
toolkits for custom solutions to be created. The development of medical imaging
technology has provided a large amount of data. There are several types of medical
image processing technique that has different constraints. When we study about MIP
tools, there are only 15 MIP tools used. These MIP tools play a very important role in
bioinformatics techniques. In future, these techniques and tools are improving
accuracy for better result and detecting other diseases.
References
Asbach P, Klatt D, Schlosser B, Biermer M, Muche M, Rieger A, Loddenkemper C et al (2010)
Viscoelasticity-based staging of hepatic fibrosis with multifrequency MR elastography. Radiol-
ogy 257(1):80–86
Assaf Y, Alexander DC (2014) Advanced methods to study white matter microstructure. In: Cohen-
Adad J, Wheeler-Kingshott CAM (eds) Quantitative MRI of the spinal cord. Academic Press,
London, pp 156–163
Avants BB, Tustison NJ, Song G, Cook PA, Klein A, Gee JC (2011) A reproducible evaluation of
ANTs similarity metric performance in brain image registration. Neuroimage 54(3):2033–2044
Caiani EG, Toledo E, MacEneaney P, Bardo D, Cerutti S, Lang RM, MorAvi V (2006) Automated
interpretation of regional left ventricular wall motion from cardiac magnetic resonance images. J
Cardiovasc Magn Reson 8:427–433
Carstensen MH, Al-Harbi M, Urbain JL et al (2011) SPECT/CT imaging of the lumbar spine in
chronic low back pain: a case report. Chiropr Man Therap 19:2. https://doi.org/10.1186/2045-
709X-19-2
Cook PA, Bai Y, Nedjati-Gilani SKKS, Seunarine KK, Hall MG, Parker GJ, Alexander DC (2006)
Camino: open-source diffusion-MRI reconstruction and processing. In: 14th scientific meeting
of the international society for magnetic resonance in medicine (vol. 2759). Seattle WA, USA
Garofalakis A, Zacharakis G, Meyer H, Economou E, Mamalaki C, Papamatheakis J, Kioussis D,
Ntziachristos V, Ripollcts J (2007) Three-dimensional in vivo imaging of green fluorescent
protein–expressing T cells in mice with noncontact fluorescence molecular tomography. Mol
Imaging 6(2):96–107
Hanwell MD, Martin KM, Chaudhary A, Avila LS (2015) The visualization toolkit (VTK):
rewriting the rendering code for modern graphics cards. SoftwareX 1:9–12
Hoshi T, Takahashi M, Iwamoto T, Shinoda H (2010) Noncontact tactile display based on radiation
pressure of airborne ultrasound. IEEE Trans Haptics 3(3):155–165
392 D. K. J. Saini and D. Siddharth
Johnsen SF, Taylor ZA, Clarkson MJ, Hipwell J, Modat M, Eiben B, Ourselin S (2015) NiftySim: a
GPU-based nonlinear finite element package for simulation of soft tissue biomechanics. Int J
Comp Assisted Radiol Surg 10(7):1077–1095
Keihaninejad S, Zhang H, Ryan NS, Malone IB, Modat M, Cardoso MJ, Ourselin S (2013) An
unbiased longitudinal analysis framework for tracking white matter changes using diffusion
tensor imaging with application to Alzheimer’s disease. NeuroImage 72:153–163
Kennedy BF, Hillman TR, McLaughlin RA, Quirk BC, Sampson DD (2009) In vivo dynamic
optical coherence elastography using a ring actuator. Optical Exp 17(24):21762–21772
Kerner GS, Fischer A, Koole MJ, Pruim J, Groen HJ (2015) Evaluation of elastixbased propagated
align algorithm for VOI-and voxel-based analysis of longitudinal 18FFDG PET/CT data from
patients with non-small cell lung cancer (NSCLC). EJNMMI Res 5(1):15
Kraft O, Havel M (2012) Sentinel lymph node identification in breast cancer - comparison of planar
scintigraphy and SPECT/CT. Open Nucl Med J 4:5–13
Larrabide I, Omedas P, Martelli Y, Planes X, Nieber M, Moya JA, Bijnens BH (2009) GIMIAS: an
open source framework for efficient development of research tools and clinical prototypes. In:
Functional imaging and modeling of the heart. Springer, Berlin, pp 417–426
Larsson A (2005) Corrections for improved quantitative accuracy in SPECT and planar scinti-
graphic imaging. Print & Media, Sweden
Liu Y, Kot A, Drakopoulos F, Yao C, Fedorov A, Enquobahrie A, Clatz O, Chrisochoides NP
(2014) An ITK implementation of a physics-based non-rigid registration method for brain
deformation in image-guided neurosurgery. Front Neuroinform 8:33
Lu T, Liang P, Wu WB, Xue J, Lei CL, Li YY, Liu FY (2012) Integration of the image-guided
surgery toolkit (IGSTK) into the medical imaging interaction toolkit (MITK). J Digital Imaging
25(6):729–737
Mathews P, Jezzard P (2004) Functional magnetic resonance imaging. J Neurol Neurosurg and
Psychiatry 75(1):6–12
Ng B, Abugharbieh R, Huang X, McKeown MJ (2009) Spatial characterization of fMRI activation
maps using invariant 3-D moment descriptors. IEEE Trans Med Imaging 28(2):261–268
Nikpoor N (2009) Scintigraphy of the musculoskeletal system. In: Weissman BN (ed) Imaging of
arthritis and metabolic bone disease. W.B. Saunders, Philadelphia, pp 17–33
Ovland R (2012) Coherent plane-wave compounding in medical ultrasound imaging, Master thesis.
Norwegian University of Science and Technology
Roobottom CA, Mitchell G, Hughes GM (2010) Radiation-reduction strategies in cardiac computed
tomographic angiography. Clin Radiol 65(11):859–867
Sampson D, Kennedy K, McLaughlin R, Kennedy B (2013) Optical elastography probes mechani-
cal properties of tissue at high resolution. Biomedical Optics & Medical Imaging, SPIE
Sarvazyan A, Hall TJ, Urban MW, Fatemi M, Aglyamov SR, Garra BS (2011) An overview of
elastography–an emerging branch of medical imaging. Curr Med Imaging Rev 7:255–282
Smith SM, Jenkinson M, Woolrich MW, Beckmann CF, Behrens TE, Johansen-Berg H, Niazy RK
(2004) Advances in functional and structural MR image analysis and implementation as FSL.
Neuroimage 23(Suppl. 1):S208–S219. External Resources Pubmed/Medline (NLM) CrossRef
(DOI)
Sowell ER, Levitt J, Thompson PM, Holmes CJ, Blanton RE, Kornsand DS, Toga AW (2000)
Brain abnormalities in early-onset schizophrenia spectrum disorder observed with statistical
parametric mapping of structural magnetic resonance images. Am J Psychiatry 157
(9):1475–1484
Spahn M (2013) X-ray detectors in medical imaging. Nucl Instrum Methods Phys Res Sect A
731:57–6311
TIG (2014) The TIG, Image processing tool. Retrieved from http://cmictig.cs.ucl.ac.uk/wiki/index.
php/Main_Page
Tyagi S, Kumar S (2010) Clinical applications of elastography: an overview. Int J Pharma Bio Sci 1
(3)
20 Bioinformatics for Image Processing 393
Wang L, Alpert KI, Calhoun VD, Cobia DJ, Keator DB, King MD, Kogan A, Landis D, Tallis M,
Turner MD, Potkin SG, Turner JA, Ambite JL (2016) SchizConnect: mediating neuro-imaging
databases on schizophrenia and related disorders for large-scale integration. Neuroimage
124:1155–1167
Wiemer J, Schubert F, Granzow M et al (2003) Informatics united: exemplary studies combining
medical informatics, neuroinformatics and bioinformatics. Methods Inf Med 42(2):126–133
Xu J, Tsui BMW (2014) Quantifying the importance of the statistical assumption in statistical X-ray
CT image reconstruction. IEEE Trans Med Imaging 33(1):61–73
Yodh AG, Chance B (1995) Spectroscopy and Imaging with diffusing light. Phys Today 48
(3):34–40
Yushkevich PA, Piven J, Hazlett HC, Smith RG, Ho S, Gee JC, Gerig G (2006) User-guided 3D
active contour segmentation of anatomical structures: significantly improved efficiency and
reliability. NeuroImage 31(3):1116–1128. https://doi.org/10.1016/j.neuroimage.2006.01.015
Artificial Intelligence in Bioinformatics
21
Hari Om Sharan
Abstract
Artificial intelligence and Bioinformatics have a solid link and Artificial Intelli-
gence gradually expanded attention in bioinformatics research. AI has become
common for the researchers to deploy the readymade systems to categorize and
data mining. In current scenerio, there are numerous intelligent systems exists.
Bioinformatics combines the biology and informataion system (intellegent sys-
tem). Artificial Intelligence can be used to examine procedure and classify the
biological data in short time. Various Artificial Intelligence algorithms have been
developed and used in bioinformatics analyses. This chapter summarizes the
applications of Artificial Intelligence that deployed in bioinformatics.
Keywords
21.1 Introduction
The study of life is called Biology—It is one of the most interesting aspects of
science. A microlevel study of biology/science is sequencing of DNA and RNA
strands, protein classification, and the analysis of gene expression on DNA
microarrays through Artificial Intelligence.
The combination of biology and computational intelligence is called Bioinfor-
matics. It combines the data science and biology which uses the application of
H. O. Sharan (*)
Department of Computer Science and Engineering, Faculty of Engineering and Technology, Rama
University, Kanpur, Uttar Pradesh, India
# The Author(s), under exclusive license to Springer Nature Singapore Pte 395
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_21
396 H. O. Sharan
Machine learning and artificial intelligence for real and important objective (Hanif
et al. 2019a).
We use the applications of computational intelligence in bioinformatics, and
artificial intelligence and data science discover the biological systems and methods.
Artificial intelligence is used in bioinformatics for prediction with the growth and
the data at molecular level, machine learning, and deep learning to predict the
sequence of DNA and RNA strands (Ezziane 2006).
Bioinformatics is one of the major contributors of the current innovations in
artificial intelligence.
Through machine learning we can develop the better understanding of biological
data based on large datasets. By the use of machine learning applications we can
predict and also detect the pattern based on big data sets.
We can solve various biological problems by the implementation of the mathe-
matical/statistical models, algorithms and computational intelligence (Narayanan
et al. 2002). A machine can work intelligently by the use of artificial intelligence.
There are numerous problems exists in bioinformatics which required a new
concept or intelligent technology for being addressed to exploit biological data.
Artificial intelligence is a branch of computer science and its approaches excel to
deal the problems, pattern recognition, and prediction. And there is a lot of scope to
predict and recognize the pattern of bioinformatics problems for the applications of
artificial intelligence (Hanif et al. 2019b). Regression analysis (linear regression and
logistic regression) and various AI algorithms in bioinformatics increase the capacity
to solve the biological problems.
Artificial Intelligence can be classified into three categories: narrow artificial intelli-
gence, general artificial intelligence, and artificial super intelligence.
The main goal of artificial intelligence is to make the system intelligent that are able
to learning, prediction, accepting, and executing the tasks like to humans or beyond
the capability of humans (https://www.educba.com/importance-of-artificial-intelli
gence/). Here we are going to discuss the few importance of artificial intelligence:
By the use of artificial intelligence we can change every industry, but in limits.
Artificial intelligence learns from the past experience and this is the basic limitation
of artificial intelligence. There is no further method by which we can incorporate the
information. Therefore we can say if input data is incorrect than it will affect the
output data or may give the wrong results. And any extra feature of prediction can be
incorporated separately (Cannata et al. 2008).
Artificial intelligence systems are trained enough to complete the defined task.
And artificial intelligence systems cannot perform the different task for which they
are nor designed, for example, the system that plays the cricket cannot solve the
Sudoku (Cannata et al. 2008; Hassanien et al. 2008).
Therefore we can say that the artificial intelligence systems are very specifically
designed for singular task, they focused to perform the single task.
In early days technology was used only for automation and to minimize the use of
papers for keeping record, but now a day’s artificial intelligence is not only a theory
it has many practical applications (Hassanien et al. 2008; https://blog.adext.com/
applications-of-artificial-intelligence/). Here we are discussing few vital applications
of artificial intelligence:
Artificial Intelligence works on the large data sets with speed, regression, and
algorithms. It permits to data sets to be trained automatically from feature extraction
on the data sets (Keedwell et al. 2002). Artificial Intelligence is the wide area of
which includes several tools with the help of following sub-fields:
1. Machine Learning
2. Deep Learning
3. Artificial Neural Network
4. Natural Language Processing
5. Cognitive Processing
6. Computer Vision
7. Internet of Things (IoT)
8. Graphical Processing Unit
Challenges in bioinformatics vary based on the scope, some of the challenges seen
by biologist and some of the challenges seen by computer scientist (Liang et al.
1998). Here we discuss some of the challenges which were seen by both scientists,
which are following:
1. Ability to forecast the record where and when will happen in a genome.
2. Ability to forecast the pattern of any primary record.
3. Ability to forecast cellular reaction to external provocation.
4. Determining operative protein-DNA, protein-RNA and protein-protein recogni-
tion codes (Michalewicz 1996)
5. Precise structure forecasting
6. Coherent design of small molecule inhibitors of proteins
7. Understanding to grow of new protein.
400 H. O. Sharan
The general use of Bioinformatics is to abstract the information from biological data
or biological studies through software. Bioinformatics is used in numerous fields
such as molecular medicine, modern medical research, and investigation of geno-
mics, drug discovery and development, prediction of protein structure, prediction of
gene therapy, microbial applications, and many more (D’Haeseleer et al. 1999).
There are some software tools are available to manage the database, and retrieval
of data or knowledge discovery and we can analyze the useful biological/molecular
data. It is also having many research applications (Rumelhart and McClelland 1986).
We can use the concept of artificial intelligence and machine learning in the area of
molecular biology. When the artificial intelligence introduced in this area, many
algorithms are designed and deployed to the analysis of different data sets. Nature of
402 H. O. Sharan
artificial intelligence said that it will be useful in practical rather than theory. It is a
regular practice for most of the researchers that they compare the new approaches
with the older one, to analyze the effectiveness and efficiency on the defined data sets
(Machine Learning for Healthcare: On the Verge of a Major Shift in Healthcare
Epidemiology). If we study about the molecular biology, then we have to discuss
about DNA sequencing, it is an important assignment in the molecular biology. In
this reference DNA chips are the best alternative and very important method for
DNA sequencing.
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. BMC Bioinf 3(35):12–16
Alga E, Tomassini M (2002) Parallelism and evolutionary algorithms. IEEE Trans Evol Comput
6:443œ462
Altman R (2001) Challenges for intelligent systems in biology. IEEE Intell Syst 2(32):14–18
Azuaje F (2001) An unsupervised neural network approach for discovery of gene expression
patterns in B-cell lymphoma. Online J Bioinf 1:26–41
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol
Biol 268:78–94
Cannata N, Schröder M, Marangoni R, Romano PA (2008) Semantic web for bioinformatics: goals,
tools, systems, applications. BMC Bioinf 9(4):1
D’Haeseleer P, Liang S, Somogyi R (1999) Gene expression analysis and modelling. Proceedings
of Pacific Symposium on Biocomputing, Hawaii, (PSB99). Available from www.cgl.ucsf.edu/
psb/psb99/genetutorial.pdf
Douzono H, Hara S, Noguchi Y (1998) An application of genetic algorithm to DNA sequencing by
oligonucleotide hybridization. Proceedings of the IEEE international joint symposia on intelli-
gence and systems Rockville Maryland, USA 5(34):92–98
Ezziane Z (2006) Applications of artificial intelligence in bioinformatics: a review. Exp Syst Appl
30:2–10. www.elsevier.com/locate/eswa
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML,
Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of
21 Artificial Intelligence in Bioinformatics 403
cancer: class discovery and class prediction by gene expression monitoring. Science
286:531–536
Hanif W et al (2019a) Artificial intelligence in bioinformatics ISSN 2410-955X– an international
biannually. J Rev Article 5(2):114–119
Hanif W, Afzal MA, Ansar S, Saleem M, Ikram A, Afzal S, Khan SAF, Larra SA, Noor H (2019b)
Artificial intelligence in bioinformatics. Biomed Lett 5(2):114–119
Hassanien AE, Milanova MG, Smolinski TG, Abraham A (2008) Computational intelligence in
solving bioinformatics problems: reviews, perspectives, and challenges. Springer, Berlin 2
(34):3–47
Haupt R (2007) Antenna design with a mixed integer genetic algorithm. IEEE Trans Antennas
Propag 55(3):577œ582
Keedwell E, Narayanan A, Savic DA (2002) Modelling gene regulatory data using artificial neural
networks. Proceedings of the International Joint Conference on Neural Networks (IJCNN’02),
Honolulu, Hawaii, pp 183–188
Kohonen T (1982) A simple paradigm for the self-organized formation of structured feature maps.
In Amari S, Arbib M (eds) Competition and cooperation in neural nets, Lecture notes in
biomathematics. Springer
Liang S, Fuhrman S, Somogyi R (1998) REVEAL, a general reverse engineering algorithm for
inference of genetic network architectures. Pac Symp Biocomput 3:18–29
Lipman DJ, Altschul SF, Kececioglu JD (1989) A tool for multiple sequence alignment. Proc Natl
Acad Sci 86(12):4412–4415
Machine Learning for Healthcare: On the Verge of a Major Shift in Healthcare Epidemiology
Michalewicz Z (1996) Genetic algorithms + data structures ¼ evolution programs. Springer, Berlin
1(32):3
Narayanan A, Keedwell EC, Olsson B (2002) Artificial intelligence techniques for bioinformatics.
Appl Bioinf 1:191–222
Rost B, Sander C (1994) Prediction of protein secondary structure at better than 70% accuracy. J
Mol Biol 232:584–599
Rumelhart DE, McClelland JL. and the PDP Research Group (1986) Parallel distributed processing:
volume 1 foundations. The Massachusetts Institute of Technology
Ryu J, Sung-Bae C (2002) Gene expression classification using optimal feature/classifier ensemble
with negative correlation. Proceedings of the International Joint Conference on Neural
Networks (IJCNN’02), Honolulu, Hawaii, pp 198–203
Shepherd A (1999) Protein secondary structure prediction with neural networks: A tutorial.
Available from http://www.biochem.ucl.ac.uk/~shepherd/sspred_tutorial/ss-index.html
Su T, Basu M, Toure A (2002) Multi-domain gating network for classification of cancer cells using
gene expression data. In: Proceedings of the International Joint Conference on Neural Networks
(IJCNN’02), Honolulu, Hawaii, pp 286–289
Big Data Analysis in Bioinformatics
22
Anugrah Srivastava and Advait Naik
Abstract
Biology is the science of nature’s life. It regards living things in one cell or
separate cells (e.g. creatures, plants, and micro-organisms). Natural sciences
include numerous areas, including the study and characterization of creatures
by subatomic components in cells. It also shows how the collaboration between
the ecosystems is determined by species formed in cells. To enhance their
exhibition at the characterized activities, sciences can be covered by various
sub-topics such as drugs, organic chemistry, and brain study. Natural structures
consist of various animals to work together in executing such orders. These
systems will possibly boost key developments in human well-being and environ-
mental management. For example, software engineering, bioinformatics, and
materials science may provide different logical instructions on how natural
structures can change under a variety of conditions after some time. These
systems have qualified answers for the most natural and medical treatment
frameworks. Through using broad knowledge analysis and datasets, the exhibi-
tion of the scientific frameworks in the plant, bioinformatics, and medical services
can be enhanced. While there still exists a contrast to the seriousness of the word
in large quantities of information, we understand the data of a huge volume and a
wide variety of items, which are constantly refreshed and placed in many outlets,
as well as extraordinary developments in the skill, movement, handling, and
investigation of this information. This chapter emphasizes on applications of
A. Srivastava (*)
Computer Science Engineering Department, Bennett University, Greater Noida, Uttar Pradesh,
India
A. Naik
Department of Computer Engineering, Vivekanand Education Society’s Institute of Technology,
Mumbai, India
# The Author(s), under exclusive license to Springer Nature Singapore Pte 405
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_22
406 A. Srivastava and A. Naik
big data tools and techniques in bioinformatics. we also addressed big data issues
and challenges in the field of bioinformatics.
Keywords
22.1 Introduction
The systematic study “Hype Cycle for Emerging Technologies, 2015” of Gartner
Inc. (Gartner 2015), where the investigation of consumer fervour, production, as
well as gain over 2000 new mechanical systems are provided in the graphical
framework, the idea of enormous data as a separate invention of the spectrum and
handling of immense informational indicators has disappeared. The organization
company has clarified its decision, acknowledging the idea of “enormous informa-
tion” involves countless developments that were used successfully, which are parts
of other mainstream areas and drifts and have become regular working devices. To
date, relevant information must be isolated from the above primary task of dealing
with such information. The greatest wins have been gained in businesses by working
closely with the consumer and by being able to take advantage of a correct evalua-
tion and prediction of potential buyers in the same way. This refers primarily to
banks, broadcasting, retail, electricity, and utilities. Currently, we address the pro-
fessional use of a lot of knowledge by companies in their ability and preparation
business cycles and their capacity to assist businesses. Huge information resources
allow associations to track assets proficiently, to predict opportunities that influence
their operations, and to decide on conclusion more quickly. Computer science
responded to gradual public activity changes by developing new logical tests,
including online inquiries and business information examinations. The Network
Intelligence is a territory where creative work reviews the work and realistic findings
of human sensitivity usage (information presentation, arrangement, and association
of information disclosure, data mining, the use of canny professionals) and progres-
sive data (remote organization, e-mail) is being carried out in these exploration
regions. Business insight (BI) incorporates mechanical instruments to assist corpo-
rate heads, business managers, and opposing customers in making business rules-
based choices for assortment, handling, and analysis of business data. Business
research involves a vast variety of instruments, applications, and implementations
which permits associations to gather information from internal structures. A high
value is given here to machine learning, methods for finding rules and relations in
vast quantities of data, information mining, advanced methods for the interpretation
and self-examination of information, choice of emotionally supporting networks and
man-made reasoning, the arrangement of the recognition of normal dialects, and so
on. In 2008 we can recall that at first it was mainly concerned with a circle of logic
and in a huge measure with bioinformatics, recalling at the same period the
22 Big Data Analysis in Bioinformatics 407
beginning of the term “big data”. B. Hesper and P. Hogeweg in 1970 were the first to
use the term “bioinformatics” in an article defined as “the study of information
processes in biotic systems” written in Holland (Hogeweg 2011). The developers
considered the administration of data in various systems, for instance, the collection
of data during the time spent growth, the transmission of data via DNA to intercel-
lular and intracellular cycles, the comprehension of data at different life levels as a
characterizing life property. Bioinformatics today is a science that builds on the use
of PC techniques to explore a variety of genomic knowledge. A major part of the
progress of bioinformatics has been the rapid improvement in PC innovation and
computational information handling techniques and the development of new broad-
cast communications advances.
This significant source of knowledge for scientists and medicine policymakers,
the openness of the highest organic material—the human genome—made accessible
to scientists worldwide and empowered bioinformatics as an aggregate science in
which the achievements of particular groups would be made available to researchers
promptly. With the Internet of Things approach today a large number of sensors
rapidly gather information. CCTV and news outlets, for example, are gushing staff
continuously with knowledge from social, portable, and various applications. Such
data also need to be treated gradually and the effect is only worthwhile if the time
needed for the preparation is limited. For instance, different sensors are utilized to
screen basic frames and actual conditions. Sensor knowledge consists of a complex
and unrestricted dataset that can be treated continuously to make control framework
choices. A variety of information still exists, but typically a predetermined informa-
tion system is forced to manage and break down information. As Excel tables and
social knowledge bases, structured information is typically available. As far as vast
information is concerned, different kinds of information are presently used, treated,
and examined. Such information designs include content, texts (SMS), messages,
tweets, blogs, site information, blog information, GPS data, images, sound, video,
sensor data, reports, and social datasets. Therefore, huge information also involves a
mixture of organized, semi-organized, and unstructured information on different
arrangements that have to be created and broken down. Despite the difficulties, a
large-scale inquiry can maybe inspect a lot of information to discover overviewed
examples, connections, and useful expertise in various fields, such as consumer
analysis, advocating promise, proposal systems, online media review and response,
extortion, natural and man-made prevention. For example, the use of a large-scale
information survey in agriculture areas is used to broaden crop respect to resolve
food safety issues. Additionally, large-scale information analysis is used to investi-
gate therapies and disease remedies such as malignancy. Research in bioinformatics
is regarded as a field with large, extended, and complex datasets.
Bioinformatics is an interdisciplinary region that primarily involves nuclear
science, software engineering, arithmetic, and insight. To understand and organize
data on organic particles and to mention derivations from objective facts it mainly
manages to demonstrate natural cycles in the subatomic stage. Bioinformatics, for
example, focuses on genomics, proteomics, transcriptomics, metabolomics, and
glycolic, in the statistical investigations of datasets. Today, the essential extension
408 A. Srivastava and A. Naik
of natural knowledge that poses stockpiling and planning difficulties is the use of
high-performance cutting-edge sequencing. Genome introduces to the entire
arrangement of qualities or cell genetic material (DNA) of a creature in its first
step (from 2014 to 2017) to meet some significant problems in the information
science field and to promote data-driven disclosure (National Institutes of Health
2018). The H. Winkler (botanist) in 1920 suggested “genome” for the assignment of
chromosomes for the age of an ever-growing number of rational terms, which
finished with “-ome” (Winkler 1920). Until then there were definitions of the
biome (the arrangement of living creatures) and the rhizome (root framework), but
now there are numerous “-omes” (Baker 2013) among the researchers. The Greek
addition “-ome” is a large number of these words, usually meaning “having the idea
of” simultaneous progress in PC limitations and further advances in obtaining
knowledge in various science controls found by genomes guided to the creation of
wide orders named “-omics” in bioinformatics, which dissects the entire living being
in their auxiliary relationship (DNA, RNA, protein, metabolites, and so on) The
genomics, metagenomics, transcriptomics, proteomics products, metabolomics,
interactomics, and various bioinformatic zones examine genomes, metagenomes,
transcriptomes, proteomes, metabolomes, and other papers (Ohashi et al. 2015).
Every discipline of bioinformatics has its reading items and own knowledge gains.
However, they all produce enormous measurements of information in different
configurations and levels that should be interpreted, arranged, appreciated, and
pictured to expand current information and reinforce information. Over the long
term, genomics is studying the structure, function, development, preparation, and
modification of the genome of the life-form. The most detailed knowledge in
bioinformatics is found in DNA groupings. DNA consists of nucleotide particles.
Guanine (G), adenine (A), cytosine (C), and thymine (T) are the codes for the
information found in DNA. The need for these foundations is what the hereditary
code decides on. DNA sequencing is the way to establish the exact application of
bases A, G, C, and T within a DNA strand. There can be several million bases for an
ordinary bacterial genome. There are approximately 3.2 billion human bases in the
human genome and about 200 gigabytes in magnitude of a solitary sequenced human
genome (Robison 2014a). The major human genome was sequenced fully in June
2000 and about 228,000 human genomes were sequenced from 2014 onwards
(Rosenberg 2017). Late, more than 500,000 human genomes were sequenced at
Illumina, the greatest producer of DNA sequencers (Herper 2017). At the end of the
day, natural information is still more easily sequenced. The Cancer Genome Atlas
(Li et al. 2013) and the DNA Element Encyclopaedia (The ENCODE Project
Consortium 2012) are instances of two large datasets. Science data storage stores
size 40 petabytes (EMBL-European Bioinformatics Institute 2014a) are available at
the European Bioinformatical Institute (EBI). As the information collected from
different sources is routinely used, it is heterogeneous since it is positioned in
different arrangements. Additionally, organic and clinical knowledge (e.g. clinical
imagery in medical services) is generated increasingly and fast. Another organic
broad knowledge characteristic is that it is scattered topographically (Kashyap et al.
2015).
22 Big Data Analysis in Bioinformatics 409
The knowledge survey that gathers a plethora of information from natural and
biomedical information like genetic preparation for the grouping of DNA lead to the
creation of a prediction of the human well-being and infection that promote disease
relief and the development of human well-being and lives. This is a set of huge
information issues and institutions, like the national health institutes (NIH), under-
stand the importance to tackle the huge information problems that have been found
in managing and researching organic information. In 2012, NIH sent large-scale data
to knowledge to allow creative biomedical work of inventive methodology and
devices to boost the usefulness of biomedical large-scale information in the territory
of huge information science. Massive quantities of details are used. The calculation
of the knowledge generated every day in the advanced world today is enormous. For
example, almost 500 million daily photographs are registered, around 56 million
images on social media, and about 200 billion daily messages are sent. Current
knowledge collections in petabytes are projected and exabyte datasets are soon
usual. The simple calculation of information that needs to be dissected and dismissed
is an important issue with huge information, but there are also problems in improv-
ing the other two features, namely the speed and the range of information. For
example, customary databases contain very static and restricted stock information,
deals, and customer information. Handling of information like this is not delayed as
the approaching information stream rate is slower than the preparation period and,
despite any preparatory delay, the preparatory results are usually still useful. We are
presenting key ideas in research on large-scale knowledge, including both
algorithms for “machine learning” and “unsupervised” and “supervised” cases.
Here we are talking about the developments in current bioinformatics found by the
creation of high-performance sequencing phases that contributed to the expansion of
research and science skills and lead to the wonder of big data in science. The former
is a field of logical developments that explores the impact of the modern World Wide
Web objects, authorities, and frameworks and employs the use of computerized
reasoning and data innovation (IT), and the latter is its space which tends towards
dynamic issues. The need for more progress and strategies for ability organization,
board, review, and interpretation of large data is validated. Current bioinformatics
faces a wide range of techniques for translating and introducing the data, the
concurrent existence of various programming instruments and information designs,
and more than the problem of managing colossal amounts of heterogeneous infor-
mation. New knowledge base management systems, rather than social frameworks,
can help to address the problem of supplying huge information and to set a worthy
timeframe for search results. Latest programming developments, like standard
computer programming and visual writing programs, and aimed at addressing the
problem of the numerous genomic information designs and at addressing them.
410 A. Srivastava and A. Naik
Big data has influenced bioinformatics extremely well in recent years. The area of
exploration is tremendous and complex. Scientists from around the world have
attempted a few things by splitting the application and instruments into the area of
bioinformatics. These methods can be used to manage vast volumes of information
using multiple and dispersed progress in registration. This survey paper discusses a
few uses of huge knowledge and gives us a diagram of its current and lets us consider
the openings of future research.
Bioinformatics science and the limited effort of the information age are moving
us into an era of “big data” represented by voluminous and gradual datasets and
complex information-examination techniques. The uncovering of the DNA architec-
ture took years of the joint work of several exploration groups from different nations
after the sequencing of the human genome. Present-day developments take into
account the sequence of the whole genome in a matter of days. Accessibility of vast
information gives rise to extraordinary possibilities, but it also poses challenges in
information mining and investigation. The AI techniques utilized in bioinformatics
are equivalent and iterative. The methods can be utilized to tackle with large-scale
information using sufficient and equitable developments in registration. Generally,
massive information apparatuses conduct group-mode calculations and streamlined
unavailable for iterative handling and large information dependency between work-
ing. Equal, incremental, and multi-see AI calculations have been proposed over the
years. In addition, diagram-based systems and large in-memory information
resources are established to reduce input/output costs and advance iterative handling.
Normal large information systems are still deficient. Similarly, suitable devices are
not available for some major bioinformatics problems, such as the rapid develop-
ment of co-articulation and administrative organizations and the remarkable module
ID, the identification of buildings through the development of protein–protein
cooperation information, rapid analysis of vast DNA, RNA, and protein-related
information, and rapid questioning of incremental and heterogeneous processes.
During this data age, information is being generated by a wide range of sources,
such as sensors that are embedded in MRI scanners, video recognition cameras, and
other than individuals and staff. Considering the yearly revolution of the digital age,
in the advanced world—information we generate every year—44 zettabytes or
44 trillion gigabytes will be produced continuously in 2020, the computerized
universe being many times the size in 2013 (Turner et al. 2014b) (Turner et al.
2014a). With the digitization of our entire devices and the subsequent invention, the
rapid transition to the data era has been completed. Since time immemorial develop-
ment has been digitized, such as basic communication and video cameras.
Elite advancements are utilized in the logical examination, for instance, fast
information capture devices and exceptionally high goal satellite information record-
ing are used for the logical analysis. Aside from digitizing administrations and
efforts, a different trend came into being late in the day in time for arranging all
the items that have been produced around us, including signals, home appliances,
cars, and force metres. Devices talk to one another to exchange information gathered
22 Big Data Analysis in Bioinformatics 411
in a deeply moving exploration field such as bioinformatics. In any case, the great
challenges of information here vary significantly from other essential issues of
information, for instance, molecular materials information gathered from CERN or
satellite high target information from the open information archive of NRSC/ISRO2.
Initially, the knowledge in bioinformatics is deeply heterogeneous. Many questions
of bioinformatics testing involve various heterogeneous and autonomous bases of
knowledge for assumption and approval. Numerous unregulated connections often
generate bioinformatics information and thus their origins speak to similar types of
information in different systems. Second, monster and filling data in bioinformatics
are all topographically appropriated around the world as far as calculation and
number of occasions. Although a section of this data can be transmitted through
the Internet, because of its scale, expense, safety, and other moral problems, the
remainder is not adaptable (Marx 2013). This occasionally allows the analysis to be
conducted at a distance and the results to be shared. Having regard to volume, pace,
and assortment, but also geologically suitable details, huge issues in bioinformatics
can be identified. Distributed computing innovations have been used, with a lot of
accomplishments, to deal with these difficulties of high knowledge in bioinformat-
ics. The best approach is to utilize the cloud both for storage and measurement
purposes (Marx 2013). Truth is that this approach helps to take care of the huge
knowledge challenges of monstrous, evolving, and distantly circulating information
that is forced by bioinformatics science. A wide-scale genome study on a variety of
cloud-based PCs can be used for Gaea. Bina Technologies, Stanford University, and
UC Berkeley turn off a cloud-based genome examination system for an equipment
piece, the Bina box, to allow the preparation of genome data and a cloud-based study
section about pretreated data. Furthermore, for their successful cloud-section shar-
ing, the Bina box reduces genome information size. This arrangement professes to
substantially increase the efficiency of the genome analysis beyond customary
methods (Rojahn 2012).
Big data examination examines immense, unstructured, and rapid difficulties with
knowledge. A part of the unbelievable scientific methods is applied to gigantic
knowledge. The majority of organizations, associations, and governments today
produce various kinds of extraordinary and varied knowledge about nature.
Associations typically use valuable data and serious benefits accumulated by mas-
sive knowledge initiatives. One of the major problems is to efficiently and quickly
extract important data from such sources. In order to enhance market competence
and render common purposes, comprehensive details can be accepted by the review
instruments. Recently, test instruments have been applied to provide enormous
information on the volume, speed, and range. Note that they are not expensive
because some are open source available. Big data analysis is one of the exam
systems that are most commonly used and involves equipment and open-source
programming. It acquires gigantic measures of information to disperse them in
modest circles just as it offers numerous detailed assets to effectively split informa-
tion. All the above-listed technologies and devices must use the synchronization of
the information sources within and outside. They are fundamental parts of the main
methodology of information (Zakir et al. 2015). This section applies to a range of
22 Big Data Analysis in Bioinformatics 413
In the field of bioinformatics, there are many other big data issues that are yet to be
addressed. In view of the recent biotechnology broad data boom, many of these
issues must be tackled as a matter of urgency, as discussed above. We divide into
seven categories the issue of big data analytics in bioinformatics. The following is
addressed.
The number and size of microarrays are quickly extending, fundamentally because
of diminished expenses and a boundless utilization of trials with microarray. To
catch the movements in articulation esteems over the long run or over different
phases of an ailment, microarray tests were likewise led in quality example time
414 A. Srivastava and A. Naik
spaces. For quick co-articulation and administrative organizations using huge vol-
ume microarrays, large information innovations are significant. As quality articula-
tion information are gotten at different phases of an infection over the long haul,
qualities influenced by the sickness can be recognized and the biomarkers for the
illness can be distinguished. The expansion of time to the third measurement
computationally makes the investigation much more muddled than the customary
quality examination.
Pathway research covers phenotypes of interest for genetic products, gene function
prediction, recognition of biomarkers and characteristics, and patient and sample
classification. Genetic, metabolic, and proteomic data is rapidly developing and big
data technologies are needed to interpret these data in large quantities.
Complexes and variations of protein–protein interactions hinder high data levels for
numerous diseases. PPI networks are researched in various life sciences fields with
the development of large volumes of data. The volume, variance, and speed of data
are a real big problem in PPI complex analysis. A structured and scalable architec-
ture is required to quickly and accurately produce, validate, and classify PPI
complexes.
The disease networks are formulated and continue to expand and new networks are
introduced in their own format from various sources. The multi-target correlations
between diseases in heterogeneous networks are helpful in learning disease-to-
network ties. Traditional network analytics techniques will be inadequate beyond
unstructured and heterogeneous data, without sacrificing the standard of
22 Big Data Analysis in Bioinformatics 415
The most commonly used methods for descriptive and predictive analytics on big
data are supervised, unsupervised, and hybrid machine learning approaches. In
addition, in big data analytics, different methods from mathematics are implemented.
The big data volume problem can be reduced by the decrease of dimensionality. For
dimensionality reduction, linear mapping techniques, like principal component
analysis (PCA) and singular value decomposition, similarly nonlinear mapping
techniques, like Sammon’s mapping, kernel main component analysis, and
Laplacian eigenmaps, are commonly utilized. Mathematical optimization is a pow-
erful method utilized in big data analytics. Subfields of optimization are generally
With multiple architectures, big data analytics systems were proposed. The three
major architectures classified large-data solutions with each having its own
advantages, limits and the suitability of algorithm depends on the design and
specifications of the algorithm. The following is discussed.
different nodes. However, only throughout syncing can the worker nodes read/write
in the global memory. In Fig. 22.2, the thick and thin arrows have been separated.
The problem is distributed between working nodes during the map process and
limited results provided from the working nodes are saved in the global memory.
The limited results are incorporated to produce the final result that is deposited in the
global storage during the reduction process. The processes are replicated again when
the intermediate findings are to be processed further. When the size of the data is big
and the problem is embarrassedly parallel, the MapReturn architecture works well.
By replacing the device (done by the failed node) for the process in a different node,
the architecture provides defect tolerance. The design however has limitations on
problems with high data dependency. Furthermore, iterative calculation cannot be
used by the architecture and with high input/output overhead is inefficient. Research
are performed to minimize and enhance the efficiency of the MapReduce architec-
ture. Instead of writing on the distributed memory after each process, Twister
(Ekanayake et al. 2010) optimizes iterative computations of the MapReduce archi-
tecture. However, because of in-memory processing Twister has problems with fault
tolerance. To acknowledge the processing of memory and failure tolerance by
reconstruction of an incorrect partition in the event of node failure, Apache Spark
extends Hadoop to a resilient distributed database (RDD) (Zaharia et al. 2012).
MapReduce inhibits high data dependencies are not very articulate if computer
dependencies are complex between data and statistical methods. MapReduce is
also not their best architecture. Supporting fault tolerance is needed to efficiently
process the difficult and iterative problem. Fault scalability is also critical, as it
permits insecure networks such as the Internet to be used. In order to do this, Low
et al. (2014) first introduced a fault-tolerant graph-based architecture called the
22 Big Data Analysis in Bioinformatics 419
GraphLab, and then similar architecture was adopted by several other major data
solutions. The algorithm is heterogeneously split into nodes, with each one doing
certain unique tasks. The model data is split into two sections: (1) a shared
(distributed) global memory and (2) a computer node graph. The machine nodes
are denoted by Mi’s and the dotted arrows indicate the node dependencies and the
network communication. Like MapReduce, the calculation is performed synchro-
nously in execution intervals. With the input data the shared database is started. The
node reads the shared database at the beginning of each loop and then allows
computation using its own data and the data of its neighbour. The result is then
combined and then returned for use in the next execution cycle in the globally shared
database. When one cycle fails, one node is republished and one cycle loses the
dependent nodes. Even if the efficiency is decreased by a loop, failure tolerance is
assured. It is replaced when a node fails on a permanent basis. The architecture offers
expression for complex data dependence and iteration problems. The architecture
requires high disc input/output and hence is not executed optimizely. There is no
further proposal to expand the use of RDD to promote memory processing and
failure tolerance. In addition to GraphLab, Pregel and Giraph are other major graphic
Big Data solutions. Graphical packages, for example, GraphX and the Hama graph
package called Angrapa, are also built for the MapReduce architecture (Fig. 22.3).
This issue is best tackled via the well-known Message Passing Interface (MPI)
(Turner et al. 2014b). At application level, MPI has an API similar to MapReduce,
and MPI can be integrated with almost all MapReduce programmes. Figure 22.4
shows the graph architecture for distributed processing in large-scale applications,
for large bandwidth and iterative applications with large data dependence. The
architecture is accompanied by rising calculated velocity and reliability of the
network and increased bandwidth. Between this and the previous architecture,
there are three significant differences. In this architecture, instead of global shared
memory the nodes exchange data directly through peer-to-peer communication.
Secondly, asynchronous operations are carried out. Only during their merger
activities the various data flows are synchronized. Finally, data does not need to
be stored in disk in this architecture. As every day memories get cheaper, large-
volume data memory processing is feasible, thus increasing the total performance
considerably. This architecture’s key drawback is lack of tolerance of faults. If one of
these nodes does not operate, then the process must restart from the start. As a
consequence, in insecure networks such as the Internet, this architecture is unsuit-
able. In essence, this creates problems with scalability. However, if there is a stable
network and there is a high data reliance on the algorithm, this architecture will
provide more efficiency than the others. This architecture can be executed using MPI
for processing big data on standalone clusters.
Big data is utilized in advanced organic sciences and regularly utilized for advanced
knowledge analysis. Through organized measurable investigation (for example, the
building of measurable inspection and trial conditions) it is not imaginable to analyse
all the currently accessible details. The measure of accessible information
overwhelms the capacity of a person to perform, significantly less decipher, the
after effects of every single imaginable test currently, machine learning procedures
begin to fill the void. Machine learning is a part of computerized reasoning (AI) and
relies on the possibility of system models and meetings benefiting from them
22 Big Data Analysis in Bioinformatics 421
Supervised machine learning comprises a solution that uses a bunch of data collec-
tion, containing the sources of information and yields (marked with a correct yield),
to provide a precise system resolution when new information is entered. The
machine learning task then deduces a capacity to draw a contribution to the output
that depends on the knowledge and returns from the information gathering model.
By comparing its true output and its correct output, the calculation learns to discover
errors and thus iteratively refine the model until a good executive level is reached.
The most functional machine learning utilizes the technique of supervised learning.
The point where the yield variable is a class is an arrangement problem, i.e. a certain
number of qualities can be predicted. For instance, given the arrangement of
information includes, the indicator capacity ought to anticipate either a tumour is
favourable or harmful. The binary classification and multi-class classification are
two kinds of arrangement to the problem. Binary classification is the position where
yield maybe one or two possible qualities, normally 1 or 0, but in multi-class one of
the three classes can be grouped, e.g. when the kind of disease can be expected.
422 A. Srivastava and A. Naik
Machine learning grouping issues estimates include option trees, strategic rebound,
guileless habitats, near-K neighbour, irregular backwoods, and straight SVC (sup-
port vector classifier). A recurrence problem is a point where the predicted rates
variable, for instance, temperature and weight, has a genuine or consistent value.
Calculations of recurrence normal include recurrency, recurring trees (e.g. Random
Forest), and support vector regression (SVR). The least difficult basic direct relapse
model attempts to locate a measurable connection among two nonstop factors by
adhering to a meaningful boundary that best fits the information. In (Segal et al.
2003), a variety of microarray regression approaches analysis like the vector support
machine (SVM) have been presented.
Datasets with names are utilized in supervised machine learning, while unmarked
datasets are utilized for unsupervised machine learning. With unsupervised machine
learning, the structure is necessary for analysing the authentic information to dis-
cover parallels, examples, and links in the information to research about links in the
information. Supervised machine learning is ideal for information with little infor-
mation, such as “what examples exist in the quality articulation of malignant
growths?” for example to address addresses. The two renowned tasks are the
clustering of data and dimensionality reduction of data. Clustering is the way to
find correlations in unlabelled knowledge that can be aggregated into a category.
There are various types of grouping strategies available in which each philosophy
follows an alternative concept or collection of rules to define the comparability level
between information centres. The clustering of genes in expression data is the most
typical application of clustering in bioinformatics (Larrañaga et al. 2006). The most
popular use of bioinformatics bundling is the combination of qualities in articulation
details. A few DNA microarray tests typically allow an estimate of the articulation
levels of huge qualities. Bunching can be used to combine attributes in any example
into a category with a comparative articulation level. The K-implies grouping and
progressive grouping of the two most commonly used bunching computations in
machine learning. K-implies bunching, a form of partial grouping calculation,
follows the centroid model all the more clearly. It is an iterative grouping calculation
by which a centre of the rags depends on the proximity of the details. The K-implies
that the information is computed into K bunches, in which every group has a group
site called the centroid. First of all, the focus of the K bunch is precariously set and
the information to the closest community is poured out. The K-group fixates are
checked again depending on whether the information in the bundles is inscribed. The
proximity of the information to the latest server farm is examined and the cycle will
be repeated until information does not change group involvement. Another known
clustering technique is part-type but the model-based calculation is expectation–
maximization (EM), also known as soft clustering.
22 Big Data Analysis in Bioinformatics 423
Deep learning is an algorithm that aims to simulate the human brains’ ability to learn.
In these lines, the structure of the brain of neural organizations is activated by a
profound learning design. Artificial neural networks (ANN) are used to provide
in-depth learning to break up detailed knowledge and resolve complex problems
using the immense personal power that is now available (e.g. example managing
GPUs). Neural organizations have long been around but state-of-the-art ANNs are
“profound”—a traditional neural organization consisting regularly of a couple of
concealed layers. The use of ANN to produce and prepare models efficiently over
time is conceivable. The algorithms learn with a deep learning model and determine
all alone whether or not a projection is accurate. Automated driving, speech transla-
tion, and automated detection of cancer cells are deep learning applications.
In recent years research into bioinformatics has quickly become a big data concern.
Big data are distributed and gradual and have not only volume, velocity, and variety.
This is why conventional data analysis can be done easily and reliably because of
these broad data properties. In managing massive data analytics, machine learning
techniques may be beneficial as they have developed in the computer science market
with goals such as accuracy and effectiveness. This segment summarizes some of the
difficulties and problems of big data analytics research using machine learning.
424 A. Srivastava and A. Naik
Big data are not appropriate for the methods used for analysing and visualizing
classic databases. The number, speed, variety, distribution, and incrementality of
such data present challenges in the conventional data analysis methods. Data
generation volume and data transfer speed are increasingly increasing. Napatech, a
high-speed network accelerator manufacturer reports a growth rate of 23% annually
through 2018 for all network results. Recently, there is an exponential increase in
hand-holder devices and their associated sensors. The rate of information production
and circulation is growing alongside the rise in data volumes. The average connec-
tion speed for the mobile network in 2014 was 1683 kbps, according to the Cisco
report (Cisco 2015), which will hit around 4.0 Mbps in 2019. With high data speed,
real-time analysis of big data becomes harder. While batch mode analytics with
distributed and parallel computing techniques are scalable to high data speed, the
moderate input/output procedures critically influence the analytical efficiency. Cur-
rently, input/output rate is way behind machine rate, which serves as the boundary
parameter. In addition, the data produced continuously are highly heterogeneous. In
a set of specified schemes, conventional databases are organized. After the
extraction – transformation – loading operations, data stores and upgrade data. Big
data architecture constantly captures information from heterogeneous sources in
high-speed and varied ways, the organized database, like data warehouse, which
malfunction to dynamic storage and retrieval at the same moment. Due to various
problems, conventional techniques for data analysis, like machine learning and
statistical analysis, are not successful for large data as originally developed. The
issue of machine learning powered analytics must therefore be explored from a
large-scale data perspective. Another huge problem for big data analysis is data
protection, especially in bioinformatics and in the healthcare sector. Data sources
may use anonymity or publish only partial information to protect sensitive informa-
tion. Imperfect or anonymous data collection can be complicated and
counterproductive.
Big data analytics involve massive handling of poly-structured, structured, and semi-
structured data. Efficient time analysis introduces an extra time-bound computing
requirement. Unstructured data may use machine learning techniques for identifying
patterns and relationships. However, conventional analysis of big data has some
efficiency problems mentioned below.
1. It still does not have a coordinated engineering for analysis in big data which
endures blames and is equipped for handling huge, fluctuated information in
clusters and in a continuous stream progressively.
2. The essential way to deal with and control the immense volume of enormous
information is disseminated processing. Most machine learning, data mining,
22 Big Data Analysis in Bioinformatics 425
22.9 Conclusion
References
Baker M (2013) The ‘Oms puzzle. Nature 494:416–419
Bhaskar H, Hoyle DC, Singh S (2006) Intelligent technologies in medicine and bioinformatics.
Comput Biol Med 36:1104
Bocicor M-I, Czibula G, Czibula I-G (2011) A reinforcement learning approach for solving the
fragment assembly problem. In: 2011 13th international symposium on symbolic and numeric
algorithms for scientific computing, September 26–29, 2011. IEEE, Timisoara, Romania
Cai D, He X, Han J (2008) Srda: an efficient algorithm for large-scale discriminant analysis. IEEE
Trans Knowl Data Eng 20(1):1–12
Choudhury A, Nair PB, Keane AJ et al (2002) A data parallel approach for large-scale Gaussian
process modeling. In: SDM. SIAM, New Delhi, pp 95–111
Cisco (2015) Cisco visual networking index: global mobile data traffic forecast update, 2014–2019.
Cisco Public Information, San Franisco
428 A. Srivastava and A. Naik
Czibula G, Bocicor M-I, Czibula I-G (2017) A reinforcement learning model for solving the folding
problem. Int J Comput Appl Technol 2017:171–182
Dean J, Ghemawat S (2005) MapReduce: simplified data processing on large clusters. In:
OSDIn’04, pp 137–150
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun
ACM 51(1):107–113
Dogan B, Ölmez T (2015) A novel state space representation for the solution of 2D-HP protein
folding problem using reinforcement learning methods. Appl Soft Comput 26:213–223
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for
iterative MapReduce. In: Proceedings of the 19th ACM international symposium on high
performance distributed computing. ACM, New York, pp 810–818
EMBL-European Bioinformatics Institute (2014a) EMBL-EBI Annual Scientific Report 2013
EMBL-European Bioinformatics Institute (2014b) EMBL-EBI annual scientific report 2013
Fogel DB (2006) Evolutionary computation: toward a new philosophy of machine intelligence, vol
1. Wiley, New York
Gartner (2015) What’s new in Gartner’s hype cycle for emerging technologies. Accessed 17 Feb
2017
Gershenfeld N, Krikorian R, Cohen D (2004) The internet of things. Sci Am 291(4):76
Herper M (2017) Illumina promises to sequence human genome for $100 but not quite yet. Forbes,
Jersey City, NJ
Hogeweg P (2011) The roots of bioinformatics in theoretical biology. PLOS Comput Biol 7(3):
e1002021
Kashyap H, Ahmed HA, Hoque N, Roy S, Bhattacharyya DK (2015) Big data analytics in
bioinformatics: a machine learning perspective, CoRR, arXiv:1506.05101
Kaufman L, Rousseeuw PJ (1990) Finding groups in data. An introduction to cluster analysis,
Wiley series in probability and mathematical statistics. Applied probability and statistics, vol
1. Wiley, New York
Kohvaei P, (2015) Reinforcement learning techniques in RNA inverse folding, Master’s Thesis,
Albert-Ludwigs Universität Freiburg
Kraska T, Talwalkar A, Duchi JC, Griffith R, Franklin MJ, Jordan MI (2013) Mlbase: a distributed
machine-learning system. In: CIDR
Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R,
Santafé G, Pérez A, Robles V (2006) Machine learning in bioinformatics. Brief Bioinform 7
(1):86–112
Li J, Lu Y, Akbani R, Zhenlin J, Roebuck PL, Liu W, Yang J-Y, Broom BM, Verhaak RGW, Kane
DW, Wakefield C, Weinstein JN, Mills GB, Liang H (2013) TCPA: a resource for cancer
functional proteomics data. Nat Methods 10:1046–1047
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed graphlab:
a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5
(8):716–727
Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new
framework for parallel machine learning. arXiv:1408.2041 (preprint)
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a
system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD interna-
tional conference on management of data. ACM, New York, pp 135–146
Marx V (2013) Biology: the big challenges of big data. Nature 498(7453):255–260
National Institutes of Health (2018) Big data to knowledge phase I & II, June 2018. Accessed June
27 2018
Nordberg H, Bhatia K, Wang K, Wang Z (2013) BioPig: a Hadoop based analytic toolkit for large-
scale sequence data. Bioinformatics 29(23):3014–3019
Ohashi H, Hesegawa M, Wakimoto K, Miyamoto-Sato E (2015) Next-generation technologies for
multiomics approaches including interactome sequencing. BioMed Res Int 2015:104209
22 Big Data Analysis in Bioinformatics 429
Ovsiannikov M, Rus S, Reeves D, Sutter P, Rao S, Kelly J (2013) The quantcast file system. Proc
VLDB Endow 6(11):1092–1101
Owen S, Anil R, Dunning T, Friedman E (2011) Mahout in action. Manning, Shelter Island, NY
Pareto V (1964) Cours d’e’conomie politique. Droz, Gene’ve
Park BH, Kargupta H (2002) Distributed data mining: algorithms, systems, and applications. In:
Data mining handbook. Springer, Berlin, pp 341–358
Raftery AE, Gneiting T, Balabdaoui F, Polakowski M (2005) Using Bayesian model averaging to
calibrate forecast ensembles. Mon Weather Rev 133(5):1155–1174
Rana O, Walker D, Li M, Lynden S, Ward M (2000) Paddmas: parallel and distributed data mining
application suite. In: Parallel and distributed processing symposium, 2000. IPDPS 2000.
Proceedings. 14th International. IEEE, New York, pp 387–392
Robison RJ (2014a) How big is the human genome? In: Precision medicine. Elsevier, Amsterdam
Robison RJ (2014b) How big is the human genome? In: Precision medicine. Elsevier, Amsterdam
Rojahn SY (2012) Breaking the genome bottleneck. MIT Technology Review.
Rosenberg E (2017) The human genome, Ch. 11. In: It’s in your DNA. From discovery to structure,
function and role in evolution, cancer and aging. Academic Press, Cambridge, MA, pp 97–98
Segal MR, Dahlquist KD, Conklin BR (2003) Regression approaches for microarray data analysis. J
Comput Biol 10(6):961–980
Shi W, Guo YF, Jin C, Xue X (2008) An improved generalized discriminant analysis for large-scale
data set. In: Machine learning and applications, 2008. ICMLA’08. Seventh international
conference. IEEE, New York, pp 769–772
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Mass
storage systems and technologies (MSST), 2010 IEEE 26th symposium. IEEE, New York, pp
1–10
The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the
human genome. Nature 489:57–74
Turner V, Gantz J, Reinsel D, Minton S (2014a) The digital universe of opportunities: rich data and
the increasing value of the internet of things. International data corporation, white paper,
IDC_1672
Turner V, Gantz J, Reinsel D, Minton S (2014b) The digital universe of opportunities: rich data and
the increasing value of the internet of things. International data corporation, white paper,
IDC_1672
Winkler H (1920) Verbreitung und Ursache der Parthenogenesis im Pflanzen - und Tierreiche.
Verlag Fischer, Jena
Wright R, Yang Z (2004) Privacy-preserving Bayesian network structure computation on
distributed heterogeneous data. In: Proceedings of the tenth ACM SIGKDD international
conference on knowledge discovery and data mining. ACM, New York, pp 713–718
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I
(2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster comput-
ing. In: In: proceedings of the 9th USENIX conference on networked systems design and
implementation. USENIX Association, Berkeley, CA, p 2
Zakir J, Seymour T, Berg K (2015) Big data analytics. IIS 16:81–90
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large
databases. In: ACM SIGMOD record, vol 25. ACM, New York, pp 103–114
Soft Computing in Bioinformatics
23
Vivek Srivastava
Abstract
In this chapter, we explored the soft computing based techniques for bioinfor-
matics. Necessity of soft computing techniques and their compatibility for solv-
ing wide spectrum of bioinformatics related problems is reviewed. Basics of soft
computing techniques are discussed and their relevancy in solving many bioin-
formatics based problems is also elaborated. Actual experimental results on two
real world bioinformatics data demonstrated the efficacy of soft computing
techniques over conventional one for biological data problems.
Keywords
Soft computing · Data · Bioinformatics · Sequences · Artificial neural network
23.1 Introduction
Human brains are enigmatic field of research that has been always being a remark-
able area for researchers from long past. Biological neurons in human brains are
responsible for transmission of information from one end to another. The crucial
characteristics of human intelligence are the ability of recognition and classification
of patterns. Artificial neurons imitated from biological neurons perform human like
capability of intelligence for recognition and learning. Artificial neural network
composed of such neurons offers learning ability which shows similarity of artificial
intelligence with the human intelligence. Artificial intelligence demonstrates intelli-
gent behaviours of machines similar to the human beings. Artificial intelligence is a
grand field of research that primarily includes searching methods, knowledge
V. Srivastava (*)
Rajkiya Engineering College Kannauj, Kannauj, Uttar Pradesh, India
# The Author(s), under exclusive license to Springer Nature Singapore Pte 431
Ltd. 2021
V. Singh, A. Kumar (eds.), Advances in Bioinformatics,
https://doi.org/10.1007/978-981-33-6191-1_23
432 V. Srivastava
Soft computing is initially coined by (Zadeh 1994). First time, (Bezdek 1992) called
synergism of fuzzy logic, artificial neural networks and genetic algorithms as soft
computing. According to the wide acceptance, the most prominent parts of soft
computing are fuzzy logic, neural networks, evolutionary algorithms, swarm intelli-
gence, ant colony optimization, probabilistic reasoning, belief networks and artificial
life. In this chapter, main focus is given to the fuzzy systems and fuzzy clustering,
neural networks and support vector machines, evolutionary algorithms, hybrid
intelligent system, ant colony optimization and particle swarm optimization. These
approaches are presented in Sects. 23.3.1, 23.3.2, 23.3.3, 23.3.4, 23.3.5, and 23.3.6,
respectively.
Fuzzy sets are sets whose elements have different degree of membership which may
be defined by membership function. Membership function decides the membership
of an element in a set. Basically, membership refers to the degree of belongingness
of an element to a set. An element can belong to different sets by different degree of
belongingness. Fuzzy set provides flexibility in decision boundaries. An example of
fuzzy membership function is shown in Fig. 23.1. The degree of fuzziness is denoted
by fuzzifier (m) which can achieve maximum value up to 1. In present work, we
focus on fuzzy clustering and its synergism with evolutionary computation. Fuzzy
clustering employs a membership function which represents the fuzzy partition of
input data into clusters. A pictorial representation of fuzzy clustering is shown in
Fig. 23.2. This figure shows the three clusters with overlapping boundaries. It has
been shown that a single point can belong to two or more clusters with different
degree of belongingness. Introduction of evolutionary algorithms in fuzzy clustering
provides optimized results in fuzzy classification (Hruschka et al. 2009). It is well
established that fuzzy clustering provides comparatively better ability in solving
many problems including classification problems (Mingoti and Lima 2006) over the
conventional ones. Various versions of fuzzy clustering have been well applied for
problems with noise (Yang et al. 2011). Hence, combining fuzzy c-means clustering
Artificial neural network is analogous to the biological neural network for machine
intelligence. The artificial neuron model was firstly proposed by Warren
MacCulloch (McCulloch and Pitts 1943). Two basic functions of this model are as
follows: summation part which aggregates the input with weights and second part
produces output by applying activation functions on aggregated information. A
general neural network structure is shown in Fig. 23.3. It is basically a three layer
architecture which contains input, hidden and output layer. Each layer may have
436 V. Srivastava
different off springs generated. The various kinds of evolutionary algorithms are
proposed (Hruschka et al. 2009; Deb 2001) and successfully implemented for
solving clustering problems in the literature. Evolutionary algorithms are suitable
for optimization of various aspects of clustering such as partitioning quality and
number of clusters. It has been analysed by various researches (Hruschka et al. 2009;
Zio and Baraldi 2005; Handl and Knowles 2007) that incorporation of evolutionary
algorithms with clustering provides better clustering solutions that are also suitable
for bioinformatics problems.
Evolutionary computation generally involves genetic algorithms, evolutionary
strategies and evolutionary programming. All three components are rigorously
involved in solving various kinds of bioinformatics problems (Fogel and Corne
2003). In (Chiesa et al. 2020), Genetic Algorithm is used for the identification of a
Robust Subset of features in high-dimensional datasets.
In the last few decades (Ozbey et al. 2006; Zio and Baraldi 2005), hybridization of
techniques of soft computing has been promising and successful. The creation of the
computationally improved system will result from synergy of progressive
calculations, fuzzy logic and a neural network. The pictorial representation of hybrid
soft computing is shown in Fig. 23.4. The three circles with overlapping represent
these three techniques. The intersection region between two circles represents
synergism of two techniques that indicates evolution of evolutionary fuzzy,
neural-fuzzy and evolutionary-neural system. The intersection region among all
designates the hybridization of three techniques which evolves the development of
evolutionary-neuro fuzzy system. In recent researches, in order to effectively resolve
the different real world concerns, hybrid computer intelligence approaches have
been used (Sio-Iong 2009; Su 2011; Zhang et al. 2002). The techniques also refer to
applications of biometrics (Bhattacharjee et al. 2010; Zhang et al. 2002; Haddadnia
et al. 2003). Various literatures (Bhattacharjee et al. 2010; Ozbey et al. 2006; Su
2011; Zhang et al. 2002) identify the strength and efficacy of such techniques. The
ð23:1Þ
ð23:2Þ
ð23:3Þ
ð23:4Þ
ð23:5Þ
ð23:6Þ
ð23:7Þ
ð23:8Þ
ð23:9Þ
ð23:10Þ
ð23:11Þ
ð23:12Þ
ð23:13Þ
ð23:14Þ
Ant Colony Optimization (ACO) is a general search method based on population for
the solution of difficult complex problems that is inspired by the pheromone trail
laying behaviour of real ant colonies. To solve the problem, an ant can be regarded as
a simple computing agent that builds, for example, a solution. States are regarded as
partial problem solutions. The identity of small subsets of extremely predictive and
biologically relevant genes in bioinformatics is very tedious. Ant Colony Algorithm
(ACA) is an algorithm that contains previous information and enables the efficiency
of a sample space search to allow the algorithm to identify small subsets of very
important and biological genes without the need for an extensive preselection of
features when applied to multiple high-dimensional datasets. Thus, it is obvious that
ant colony optimization and ant colony algorithms both are very useful for
bioinformatics.
Various recent researches demonstrate that ant colony optimization is also very
applicable in various bioinformatics problems. In (Kleinkauf et al. 2015), ACO
meta-heuristics is employed for RNA and superior performance is obtained in
biological datasets. For the 2D and 3D hydrophobic polar protein folding problem,
ACO algorithm is applied (Shmygelska and Hoos 2005). In (Do Duc et al. 2018),
authors successfully designed an efficient Ant Colony Optimization algorithm for
protein structure prediction. In other research (Wu 2020), ACO is also useful for
DNA sequence alignment.
The PSO lists social behaviours, the strategies used for locating roosting sites, food
sources or other suitable habitat in bird flocking or fishing schools. PSO is originally
developed by Kennedy and Eberhart (Kennedy and Eberhart 1995). In the search
23 Soft Computing in Bioinformatics 441
space of the given problem, the PSO algorithm simultaneously implements several
candidate solutions. Each solution for each candidate is obtained by optimizing the
objective function and evaluating the fitness of each algorithm. The PSO algorithm
initially randomly chooses candidate solutions in the space of search. Each candidate
solution can be taken as a “flying” particle in the fitness scene when seeking the
maximum or minimum objective function.
In (Das et al. 2008), there is wide demonstration of applicability of swarm
intelligence in bioinformatics. In other research, author combined particle swarm
optimization and simulated annealing for solving protein multiple sequence align-
ment problem (Chaabane 2018). PSO and gene scoring strategy is employed for
hybrid gene selection in (Han et al. 2019).
Promoter gene sequence problem (Harley and Reynolds 1987; Towell et al. 1990)
contains 106 instances with 59 attributes in each. Basically, promoters initiate the
process of gene expression. The problem is to predict the member/non-member of
class of sequences with biological promoter activity. The dataset contains
non-numeric domain of attributes. The attributes are one of the ‘a’, ‘g’, ‘t’ and ‘c’
(a ¼ Adenine, b ¼ Guanine, t ¼ Thymine and c ¼ Cytosine). The snapshot of DNA
gene sequence is shown in Fig. 23.5. The class distribution in this dataset is 50% for
each, i.e. 53 instances for positive class and 53 instances belong to negative class. In
this work, we have selected 50% data for training set and rest 50% data for test set.
Table 23.1 summarizes the results obtained for promoter gene dataset. It has been
seen that NN yields far better results than Fuzzy clustering. The considered MLP
consists of two NNs of size 59-18-1 for each. It yields maximum accuracy 92.67%
with 2 1099 learning parameters on 20,000 learning cycles. HI-model again yields
best accuracy of 96.22% just at C ¼ 3 with 3 498 learning parameters in associated
NN structure of 59-8-2(3), on 4000 average learning cycles.
It has been observed that on selecting more than two members in a cluster, the
performance decreases. The performance also degrades when selecting more or less
than three clusters. The maximum accuracy is attained at C ¼ 3 and MCM ¼ 2. On
comparing with other exiting work, we found that maximum 93.33% accuracy is
achieved by decision tree technique reported in Noordewier et al. (1991). Hence, it
has been strongly investigated that soft computing methods are well applied and
yield promising results for this bioinformatics based data.
The primate dataset contains a total of 3190 occurrences with 62 attributes in each.
Splice junctions are considered as the points on DNA sequence. During the process
of protein creation in higher entity, superfluous DNA is removed in such junctions
(Noordewier et al. 1991). The main problem is to find the boundaries between exons
(the part of DNA sequence retained after splicing) and introns (the part of DNA
sequence that are spliced out) in given DNA gene sequence. Hence, the problem
contains three classes. First is intron–exon (IE) boundary which is sometimes called
donors. Second is exon–intron (EI) boundary which is sometimes called acceptors.
Third and last class belongs to neither donors nor acceptors (Neither). The class
distribution is as follows: 767 instances for (IE), 768 instances for (EI) and 1655
instances for (Neither) class. The snapshot of this gene sequence is shown in
Fig. 23.6. The gene sequences shown in smaller font are the examples of acceptor
class while gene sequence shown in larger font implies donor class. The italicized
font shows the example of neither class.
For splice gene data, we have selected training and testing data as division made
in Bower and Bolouri (2004) for making a fair comparison. Therefore, training set
23 Soft Computing in Bioinformatics 443
Fig. 23.6 Splice-junction gene sequences showing acceptor, donor and neither classes
contains 2000 instances (500 for acceptor, 500 for donor and 1000 for none class).
Test set contains rest 1190 instances (267 for acceptor, 268 for donor and 655 for
none class). The comparative analysis for splice gene sequence is shown in
Table 23.2. The considered MLP consists of three NNs of each size 62-20-1. It
achieves accuracy up to 90.12% with 3 661 learning parameters on 30,000
average learning cycles. HI-model obtains maximum accuracy of 95.6% with only
3 531 learning parameters in associated NN of structure 62-8-3(3) on average
6000 learning cycles.
On comparison with techniques reported in (Noordewier et al. 1991), we found
that decision tree technique yields 93.6% accuracy and SVM obtains 92% accuracy.
However, assoDNA method in (Noordewier et al. 1991) yields 96.1% accuracy. It is
clear that HI-model yields better accuracy than these techniques. Hence, it has been
again proved that HI-model emerged as best computing model over other mentioned
techniques even for highly complex gene sequence prediction.
23.5 Conclusions
References
Aerts S, Van Loo P, Moreau Y et al (2004) A genetic algorithm for the detection of new
cis-regulatory modules in sets of coregulated genes. Bioinformatics 20(12):1974–1976
Barbosa F, Devito K, Filho W (2009) Using a neural network for supporting radiographic diagnosis
of dental cares. Int J Appl Artif Intell 23(9):872–882
Bezdek JC (1992) On the relationship between neural networks, pattern recognition and intelli-
gence. Int J Approx Reason 6:85–107
Bhattacharjee D, Basu DK, Nasipuri M, Kundu M (2010) Human face recognition using fuzzy
multilayer perceptron. Soft Comput 14:559–570
Bower JM, Bolouri H (eds) (2004) Computational modeling of genetic and biochemical networks.
Cambridge, MA, MIT Press
Carter RJ, Dubchak I, Holbrook SR (2001) A computational approach to identify genes for
functional RNAs in genomic sequence. Nucleic Acids Res 29(19):3928–3938
Chaabane L (2018) A hybrid solver for protein multiple sequence alignment problem. J Bioinforma
Comput Biol 14(4):185–200
Chiesa M, Maioli G, Colombo GI et al (2020) GARS: genetic algorithm for the identification of a
robust subset of features in high-dimensional datasets. BMC Bioinform 21:54
Das S, Abraham A, Konar A (2008) Swarm intelligence algorithms in bioinformatics. In:
Kelemen A, Abraham A, Chen Y (eds) Computational intelligence in bioinformatics. Studies
in computational intelligence, vol 94. Springer, Berlin
Deb K (2001) Multi-objective optimization using evolutionary algorithms. Wiley, New York
Dembélé D, Kastner P (2003) Fuzzy C-means method for clustering microarray data. Bioinformat-
ics 19(8):973–980
Do Duc D, Thai Dinh P, Thi Ngoc Anh V, Linh-Trung N (2018) An efficient ant Colony
optimization algorithm for protein structure prediction. 12th international symposium on medi-
cal information and communication technology (ISMICT), Sydney 1–6
Fazendeiro P, de Oliveira JV (2007) A semantic driven evolutionary fuzzy clustering algorithm. In:
Proceedings of IEEE International Fuzzy Systems Conference. IEEE, New York, pp 1–6
Fogel GB, Corne DW (2003) Evolutionary computation in bioinformatics. Morgan Kaufmann,
Burlington, CA
Gaxiola F, Melin P (2010) Modular neural networks for person recognition using segmentation and
the iris biometric measurement with image pre-processing. In: International joint conference on
neural networks (IJCNN). IEEE, New York, pp 2765–2771
Haddadnia J, Faez K, Ahmadi M (2003) A fuzzy hybrid learning algorithm for radial basis function
neural network with application in human face recognition. Pattern Recogn 36(5):1187–1202
Han F, Tang D, Sun Y, Cheng Z, Jiang J, Li Q (2019) A hybrid gene selection method based on
gene scoring strategy and improved particle swarm optimization. BMC Bioinform 20(Supp
8):1–13
Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol
Comput 11(1):56–76
Harley C, Reynolds R (1987) Analysis of E. Coli promoter sequences. Nucleic Acids Res
15:2343–2361
23 Soft Computing in Bioinformatics 445
Hruschka ER, Campello RJGB, Freitas AA, de Carvelho ACPLF (2009) A survey of evolutionary
algorithms for clustering. IEEE Trans Syst Man Cybern Part C 39(2):133–155
Huang S, Cai N, Pacheco P, Narandes S, Wang Y, Xu W (2018) Applications of support vector
machine (SVM) learning in Cancer genomics. Cancer Genom Proteom 15(1):41–51
Kennedy J, Eberhart RC (1995) Particle swarm optimization. In: Proceedings of IEEE international
conference on neural networks, Piscataway, NJ, vol 4. IEEE, New York, pp 1942–1948
Kleinkauf R, Mann M, Backofen R (2015) antaRNA: ant colony-based RNA sequence design.
Bioinformatics 31(19):3114–3121
Krallinger M, Erhardt RA, Valencia A (2005) Text-mining approaches in molecular biology and
biomedicine. Drug Discov Today 10(6):439–445
Lancashire LJ, Lemetre C, Ball GR (2009) An introduction to artificial neural networks in
bioinformatics- application to complex microarray and mass spectrometry datasets in cancer
studies. Brief Bioinform 10(3):315–329
Larranaga P, Calvo B, Santana R et al (2006) Machine learnsing in bioinformatics. Brief Bioinform
7(1):86–112
Leondes CT (2003) An application of artificial neural networks to DNA sequence analysis. In:
Leondes CT (ed) Computational methods in biophysics, biomaterials, biotechnology and
medical systems. Springer, Boston, MA
Lu J, Yuan X, Yahagi T (2007) A method of face recognition based on fuzzy c- means clustering
and associated sub-NNs. IEEE trans. Neural Netw 18(1):152–160
Maji P, Paul S (2017) Fundamentals of rough-fuzzy clustering and its application in bioinformatics.
In: Pattern recognition and big data. World Scientific, Singapore, pp 513–543
Mathe C, Sagot MF, Schlex T et al (2002) Current methods of gene prediction, their strengths and
weaknesses. Nucleic Acids Res 30(19):4103–4117
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull
Math Biophys 5:115–133
Mingoti SA, Lima JO (2006) Comparing SOM neural network with fuzzy c-means, K-means and
traditional hierarchical clustering algorithms. Eur J Oper Res 174:1742–1759
Noordewier MO, Towell GG, Shavlik JW (1991) Training knowledge-based neural networks to
recognize genes in DNA sequences advances in neural information processing systems, vol
3. Morgan Kaufmann, Burlington, MA
Ozbey Y, Ceylan R, Karlik B (2006) A fuzzy clustering neural network architecture for classifica-
tion of ECG arrhythmias. Comput Biol Med 36:376–388
Saha I, Mukhopadhyay A, Maulik U (2009) Combining fuzzy clustering with ANN classifier for
categorical data. In: IEEE International Advance Computing Conference, (IACC 2009). IEEE,
New York, pp 44–49
Shmygelska A, Hoos HH (2005) An ant colony optimisation algorithm for the 2D and 3D
hydrophobic polar protein folding problem. BMC Bioinform 6:30
Sio-Iong AO (2009) Hybrid intelligent regressions with neural network and fuzzy clustering
advances in computational algorithms and data analysis. In: Ao SI, Rieger B, Chen SS (eds)
Advances in computational algorithms and data analysis. Lecture notes in electrical engineering,
vol 14. Springer, Dordrecht
Su ZX (2011) A hybrid fuzzy approach to fuzzy multi attribute group decision making. Int J Inf
Technol Decis Making 10(4):695–711
Tampuu A, Bzhalava Z, Dillner J, Vicente R (2019) ViraMiner: deep learning on raw DNA
sequences for identifying viral genomes in human samples. PLoS One 14(9):e0222271
Tang B, Pan Z, Yin K, Khateeb A (2019) Recent advances of deep learning in bioinformatics and
computational biology. Front Genet 10:1–10
Towell G, Shavlik J, Noordewier M (1990) Refinement of approximate domain theories by
knowledge-based artificial neural networks. In: Proceedings of the Eighth National Conference
on Artificial Intelligence (AAAI-90). AAAI, Menlo Park, CA
Tripathi BK, Kalra PK (2010a) The novel aggregation function based neuron models in complex
domain. Soft Comput 14(10):1069–1081
446 V. Srivastava
Tripathi BK, Kalra PK (2010b) High dimensional neural networks and applications. In: Intelligent
autonomous systems, SCI, vol 275. Springer, Berlin, pp 215–233
Tripathi BK, Kalra PK (2011a) On the learning machine for three dimensional mapping. Neural
Comput Appl 20:105–111
Tripathi BK, Kalra PK (2011b) On efficient learning machine with root-power mean neuron in
complex-domain. IEEE Trans Neural Network 22(5):727–738
Tripathi BK, Kalra PK (2011c) Complex generalized-mean neuron model and its applications. Appl
Soft Comput 11(01):768–777
Wang Z, Jiang M, Hu Y, Li H (2012) An incremental learning method based on probabilistic neural
networks and adjustable fuzzy clustering for human activity recognition by using wearable
sensors. IEEE Trans Inf Technol Biomed 16(4):691–699
Wu T (2020) DNA sequence alignment based on ants’ colony algorithm. AIP Conf Proc
2208:020030
Yang X, Zhang G, Lu J, Ma J (2011) A kernel fuzzy c-means clustering-based fuzzy support vector
machine algorithm for classification problems with outliers or noises. IEEE Trans Neural
Network 19(1):05–115
Yuan B, Klir GJ, Stone JF (1995) Evolutionary fuzzy c-means clustering. In: Proceedings of
International Conference on Fuzzy System. ACM, New York, pp 2221–2226
Zadeh LA (1994) Fuzzy logic, neural networks and soft computing. Commun ACM 37(3):77–84
Zamani M, Kremer SC (2013) Neural networks in bioinformatics. In: Bianchini M, Maggini M, Jain
L (eds) Handbook on neural information processing. Intelligent systems reference library, vol
49. Springer, Berlin
Zhang D, Peng H, Zhou J, Pal SK (2002) A novel face recognition system using hybrid neural and
dual Eigenspaces methods. IEEE Trans Syst Man Cybern A 32(6):787–793
Zio E, Baraldi P (2005) Evolutionary fuzzy clustering for the classification of transient in nuclear
components. Prog Nucl Energy 46(3–4):282–296