Statistical Modelling ML Principles Bioinformatics
Statistical Modelling ML Principles Bioinformatics
K. G. Srinivasa
G. M. Siddesh
S. R. Manisekhar Editors
Statistical Modelling
and Machine Learning
Principles for
Bioinformatics
Techniques, Tools,
and Applications
Algorithms for Intelligent Systems
Series Editors
Jagdish Chand Bansal, Department of Mathematics, South Asian University,
New Delhi, Delhi, India
Kusum Deep, Department of Mathematics, Indian Institute of Technology Roorkee,
Roorkee, Uttarakhand, India
Atulya K. Nagar, Department of Mathematics and Computer Science,
Liverpool Hope University, Liverpool, UK
This book series publishes research on the analysis and development of algorithms
for intelligent systems with their applications to various real world problems. It
covers research related to autonomous agents, multi-agent systems, behavioral
modeling, reinforcement learning, game theory, mechanism design, machine
learning, meta-heuristic search, optimization, planning and scheduling, artificial
neural networks, evolutionary computation, swarm intelligence and other algo-
rithms for intelligent systems.
The book series includes recent advancements, modification and applications
of the artificial neural networks, evolutionary computation, swarm intelligence,
artificial immune systems, fuzzy system, autonomous and multi agent systems,
machine learning and other intelligent systems related areas. The material will be
beneficial for the graduate students, post-graduate students as well as the
researchers who want a broader view of advances in algorithms for intelligent
systems. The contents will also be useful to the researchers from other fields who
have no knowledge of the power of intelligent systems, e.g. the researchers in the
field of bioinformatics, biochemists, mechanical and chemical engineers,
economists, musicians and medical practitioners.
The series publishes monographs, edited volumes, advanced textbooks and
selected proceedings.
S. R. Manisekhar
Editors
Statistical Modelling
and Machine Learning
Principles for Bioinformatics
Techniques, Tools,
and Applications
123
Editors
K. G. Srinivasa G. M. Siddesh
Department of Informatics, Computer Department of Information Science
Science & Engineering and Engineering
National Institute of Technical Teachers Ramaiah Institute of Technology
Training and Research Bengaluru, Karnataka, India
Chandigarh, Chandigarh, India
S. R. Manisekhar
Department of Information Science
and Engineering
Ramaiah Institute of Technology
Bengaluru, Karnataka, India
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
Bioinformatics
v
vi Preface
discovery. Finally, the proper steps to lead the experiments for using various tools and
databases with examples are addressed.
Chapter “Machine Learning for Bioinformatics” highlights various ways to
analyse bioinformatics data using machine learning. The author has highlighted the
different machine learning techniques in bioinformatics. They also presented two
case studies using artificial neural network in bioinformatics area.
Chapter “Impact of Machine Learning in Bioinformatics Research” tells the
importance of advanced tool in bioinformatics area which deals with molecular
phenotypes, drug discovery, and aids in determining unfamiliar diseases. This
chapter also gives a detailed overview of the impact of machine learning in the field of
bioinformatics.
Chapter “Text Mining in Bioinformatics” explores the files of text mining in
bioinformatics. The authors discussed the introduction to text mining and its
implementation. Further, it is extended with a few case studies of text mining using
different approaches.
Chapter “Open-Source Software Tools for Bioinformatics” provides a brief
discussion on open-source tools used in bioinformatics area. This chapter also
explains the architecture, domain, and data management issues used for solving
many life science problems.
Bioinformatics
Introduction to Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
S. R. Manisekhar, G. M. Siddesh and Sunilkumar S. Manvi
Review About Bioinformatics, Databases, Sequence Alignment,
Docking, and Drug Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
P. Lakshmi and D. Ramyachitra
Machine Learning for Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
K. Aditya Shastry and H. A. Sanjay
Impact of Machine Learning in Bioinformatics Research . . . . . . . . . . . . 41
E. Naresh, B. P. Vijaya Kumar, Ayesha and Sahana P. Shankar
Text Mining in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Minal Moharir and Preetham Maiya
Open-Source Software Tools for Bioinformatics . . . . . . . . . . . . . . . . . . . 75
T. Gururaj and A. P. Pavithra
ix
x Contents
xi
xii About the Editors
1 Introduction
The human species, being a self-aware and intelligent life form, has always sought to
find how the human body works, how our physical structure came to be, what are the
chemical processes that came about along with complex molecular interactions and
physiological mechanisms that were discovered and the development and evolution
of the body.
Many comparisons were made between the human body and the intricacy of man-
made machines and tools like timepieces, mechanical toys and pneumatic machines,
even stretching to the extent of mills, factories and assembly lines during the Industrial
Revolution.
Soon, this interest began to wander towards other species, their interactions with
nature, what was their physiology and how they co-inhabited, interacted and adapted
to the same environment as us. By studying countless species, we were able to gather
some information about all of this, but it was becoming more and more difficult.
As more species were discovered, the data size became larger and more complex,
making it difficult to analyze.
With the dawn of the computer age, computers were becoming more accessi-
ble beyond the electrical engineering departments. Many biologists realized that
they could use various computational methods to solve the data-intensive biologi-
cal problems that came up. Soon, ecologists, biotechnologists, agricultural scientists
and medical researchers could identify and successfully classify every gene, every
protein, every control signal—and not only for humans. Bioinformatics became the
guiding path to gain this knowledge.
3 Importance of Bioinformatics
Bioinformatics is used to comprehend and analyze the regulation of cells, the func-
tion of genes, disease and drug design. Without the analysis methods provided by
Bioinformatics, the study of all the massive amount of system-generated data could
not have resulted in any meaningful interpretation. The modern developments in the
field of biology and medicine could not have been possible without Bioinformatics.
The demand for this skill is higher than ever, as more and more advancements are
made in the management and analysis of this data.
Introduction to Bioinformatics 5
4 Emergence of Bioinformatics
The term “Bioinformatics” was coined in 1970 by Hesper and Hogeweg to refer
to biotic system-related information processes, which drew parallelism between
biochemistry and Bioinformatics as a field.
The concept of Bioinformatics seems to have emerged recently to assist in com-
pilation and study of biological sequences; however, Bioinformatics was conceptu-
alized over 50 years ago, when DNA sequencing was not possible. Computational
methods laid the foundation for protein sequence analysis, in the early 1960s (notably,
biological sequence database and substitution models, and also de novo sequence
assembly). DNA analysis emerged due to (1) molecular biological methods and
(2) computer software and technology advancements which saw the increase in
computational power due to miniaturization of parts, as well as custom software
built to handle bioinformatic calculations and tasks. It is widely accepted that the
mathematization of biology on many fronts led to the development of this new field
of Bioinformatics. The machines only helped accelerate the entire process.
6 S. R. Manisekhar et al.
To understand the basic difference between these fields, let us look at an overview
of the basic definition of these two connotations. Bioinformatics involves the appli-
cation research and development of computational tools and aims to organize visu-
alize, archive store and acquire data (not in this order). The data may be biological,
behavioural, medical or health. Computational Biology involves the development
and application of analyzing methods for data, mathematical modelling and simula-
tion techniques to study social, behavioural and biological systems. Table 1 discusses
some of the key difference [6].
Let us take a, suppose we have to execute\test an engineering method or model,
we would design it such that it has certain performance characteristics, and performs
as intended. We validate this method and test it solve a class of similar problems with
the help of software (that is, testing the model or method with a large amount of data
to prove correctness and validate it) and we then proceed to write papers about this
method, then this would be an example of Bioinformatics.
When a method is used to answer a biological question, the success of this method
does not depend on the computational methods\tools applied. It is about whether the
new discovery or the answer is true, and is validated or not, and if the evidence is up
to the standard of evidence expected by the biological community. This would be a
case of computational biology.
However, there is not a fine difference as of today, between these fields. With the
increase in data and the technology advancements, the line drawn between them is
becoming a bit blurred. The major differences are still present, but the similarities
between these fields are increasing. Thus, one may argue that the difference between
these fields may vary for every researcher.
One thing that drastically changes as there is increase in raw data is the way it is ana-
lyzed and studied. Greater computational power is required, and new computational
tools and methods have to be developed to ensure greater accuracy in prediction of
outcomes. Especially in the field of biology, where getting a protein prediction right
can save thousands of lives. Thus, computational methods are an integral part of
the field as a whole, used for the management and study of data, and also how it is
represented. It is interwoven into the fabric of biology.
Let us take a look at some of the subfields of computational biology, and what are
the advancements made in those fields. We will also look at how they approach the
biological problems.
• Computational Bio-modelling: It is a field which tackles biological problems by
building computer models. It uses simulations to asses and evaluates the complex-
ity of bio-systems. The method to approach such a problem requires the use of
highly specialized algorithms and visual software [7]. This tests if the system is
robust, by observing the changes in the system when exposed to different envi-
ronments. Computational bio-modelling allows multiple users to study the huge
amount of data generated.
• Computational Genomics: This field involves the study of genomes of cells and
organisms [8, 9]. One example is the human genome project, which aims to acquire
data about the entire human genome, it will allow the doctors to analyze the
genome of a patient and this further opens up the doorway for creating target-
specific and personalized medicine, based on the genetic pattern of the patient.
Sequence homology is used to compare genomes, by comparing and studying
various nucleotide structures that are acquired from the same ancestor. This field
is still in development and researchers are trying to understand the non-coding
8 S. R. Manisekhar et al.
areas of the genome via large consortia projects like the Roadmap Epidemics
Project and various other computational and statistical methods.
• Computational Neuroscience: This field studies the properties and structure of
the nervous system and can create tools for practical application by analyzing
brain data [10]. Models of the brain are used to analyze the different aspects of the
nervous system. Some of the brain models include Realistic brain models which
represent the brain in detail, down to the cellular level. These models, even though
so intricate and detailed, are very prone to errors; as it does not include the cellular
structures, the scientists do not know about. These are very costly to implement
and require a huge amount of computational power. The other is the Simplifying
Brain Model, which assesses physical properties of neurological systems. It also
reduces the potential error by a higher factor than the realistic brain model.
• Computational Pharmacology: The book “Computational Biologists: The Next
Pharma Scientists?” Written by Price, Michael, defines computational pharma-
cology as “the study of the effects of genomic data to find links between specific
genotypes and diseases and then screening drug data” [11]. The pharma compa-
nies required a shift from the traditional use of Excel worksheets to compare data
related to the effectiveness of drugs, which led to the development of this sub-
field. Computational methods such as machine learning are used to analyze this
huge collection of data, leading to a more meaningful and efficient comparison of
notable points leading to better drug development.
• Computational Evolutionary Biology: This field makes the use of DNA data to
reconstruct tree of life. It uses a computational method known as Computational
phylogenetics (use of algorithms and methods for phylogenetic analysis). It also
fits population models to data about DNA and evaluates demographic and selective
history. The prediction of which evolutionary system is likely to evolve first is made
by building population genetics from first principles.
• Cancer Computational Biology: This field aims to determine the future mutations
in the cancer cells [12]. It does this by applying an algorithmic approach to ana-
lyzing data, leading to the use of high throughput measurement. Data is collected
from the RNA, DNA and other structures. It helps in determining the causation of
tumours and cancer, and how is it connected to the human genome.
• Computational Neuropsychiatry: It is an emerging field that uses computers to
model the brain, to detect the mechanisms involved in mental disorders. It also
helps to understand how mental functions occur and dysfunctions are caused by
understanding the neuronal circuits [13].
• Computational Anatomy: This field involves the imaging of biological and
anatomical structures, via technologies such as MRI, to obtain dense 3D measure-
ments [14]. It also requires the implementation of mathematical and data analytical
modelling methods for simulation of structures.
Introduction to Bioinformatics 9
7 Conclusion
References
Biological data was to be interpreted for the purpose of analysis and processes the
data in various forms that to be stored and retrieved from databases. The importance
of a particular gene or protein recognized and focused on genetics. The particular
or common domain which helps to access the wide variety of resources has to be
authenticated. Bioinformatics uses to launch the applications of computer science
and information technology into the field of molecular biology. The raw data can be
stored in the database to organize and manipulate the data for extraction that can be
easily utilized by the researchers.
Figure 1 shows some layers of the bioinformatic processes. The detail of the
sequence information depends upon the particular organisms that are stored in the
biological databases. It helps to understand the new sequence genome and protein
sequence for the purpose of comparison and identification between them become
easier. For that, various tools are to be addressed.
The knowledge of genomic and bioinformatics tools is very important for the
scientist in veterinary and animal science. By using these tools, the improvement
of productivity of farm animals will be possible in the future. In the sequence cost
reduction, all individual’s population tested that might be genotyped and sequenced.
Fig. 1 Overview of
bioinformatics
and interaction of genetics [1]. Algorithms, databases, statistics, control and system
theory, circuit theory, structural biology, information and computation theory, Web
technologies, soft computing, software engineering, data mining, image process-
ing, artificial intelligence are included in bioinformatics, which help to improve the
development of the living organisms.
The requirement of homology modeling is the part of the sequence alignment and
to predict the structure of a protein to know the homology structure. This alignment
helps to predict the difference between the structure and template of the sequence.
BLAST and FASTA are the basic operations for sequence alignment. The steps of
alignment sequences are the identification of sequence, search in database, homolog
detection, sequence alignment, add structural information, perform required oper-
ation. X-ray crystallography, NMR, and reason version of experiment instruments
are discussed to know the information of isolation. Some of the computational algo-
rithms also addressed to align the sequences in an effective manner. Three types of
alignments available that is single, pairwise, and multiple sequences.
While DNA sequencing and alignment, measurement of gene expression tech-
niques, modeling studies for protein folding patterns and datasets requires large stor-
ages like terabytes, modern computational and machine learning capabilities helps
to reduce the complexity for the process of analysis, storage and interpretation [2].
DNA sequencing is critical for any genomic experiment. It is costly, demands a lot
of time, and could be still technically difficult. New emerging technologies were
Review About Bioinformatics, Databases, Sequence … 13
successful in solving most of these problems although for animal genomic studies’
cost is still a limitation. Any information generated downstream of the sequencing
will be determined by it; for this, the sequencing process should be as much accurate
as possible [3]. Since then, the genomes from different species including livestock
animals have been sequenced. A lot of further investigation should be performed to
understand more about genes, its functionality, and disease involvements [4].
2 Databases
Information on protein and gene sequences is stored in the databases. They can be
easily manipulated, and the databases are located in different places. Day-to-day, the
information is exchanged, updated, and synchronized [8]. According to the gene, the
database search depends on the following information which is given below:
• Genes homologous are the distribution of taxonomic, frequencies of allele, and
synteny.
• Information of genome is location of chromosome, introns, regions regulatory,
and domain sharing.
• Information belongs to the structures which are protein structure connection, types
of fold, domain structures.
• Information of expressions is the particular tissues of a specific expression, stage
developments, phenotypes, and diseases.
• Information of functions is the molecular and enzymatic function, role of pathway
or cellular, localization, and disease roles.
14 P. Lakshmi and D. Ramyachitra
• miRBase—microRNA database
• ncRNA Expression Database (NRED)—Database of expression data on human
and mouse long ncRNAs
• piRNABank—Web resource on classified and clustered piRNAs
• Rfam—RNA families’ database of alignments and CMs
• Sno/scaRNAbase—Curated repository for Small Nucleolar RNAs and small Cajal
body-specific RNA
• RNAcentral—Resource to organize data for non-protein coding RNA genes
• TarBase—Curated database of experimentally supported microRNA targets.
Review About Bioinformatics, Databases, Sequence … 15
• dbSNP—http://www.ncbi.nlm.nih.gov/SNP
• dbVar—Database of Genomic Structural Variation
• ENCODE—ENCyclopedia Of DNA Elements
• HapMap—international HapMap Project
• HGBASE—Human Genic Bi-Allelic SEquences http://hgbase.cgr.ki.se
• Human Genome Segmental Duplication Database and Human Structural Variation
Database
• The SNP Consortium (TSC)—http://snp.cshl.org
16 P. Lakshmi and D. Ramyachitra
Docking uses to confirm ligand binding to receptor, and usually receptor is bigger
than the molecules. This information includes the coordination of the ligand atoms
and finds the lowest energy bind site of the docking configuration. FLEX and auto-
docking example programs will be shown in a later chapter of this paper. The aim is
to predict the conformation bound and the affinity binding. The interaction between
two molecules is determined, and the overall minimum energy of complex formation
was to be found with the best orientation of ligand binding. Various bioinformatic
tools are helpful for disease management, diagnosis, and drug discovery. Sequencing
is enabled to identify the disease and drug discovery by scientists. Mutation and drug
are all identified and experimented by utilizing different computational tools. Drug
targets decide the suitable drug entry into the pipeline of drug development with the
help of bioinformatic tools.
3.1 BLAST
In the database, sequence similarity can be identified between a query sequence and
sequences by using the basic local alignment search tool (BLAST) operation. To
allow the putative gene identification, sequence homology was to be detected, which
helps to determine the protein or gene relations between them. BLAST operation uses
to identify the performance of similarity, alignment, homology, annotation between
the sequences of genes or proteins. The figure shows some of the examples of BLAST
operation [9].
Review About Bioinformatics, Databases, Sequence … 17
Figure 2 shows the next-generation first operation of the sequence, i.e., Web
interface of the BLAST operation which is performed in the NCBI database contains
some BLAST operation with sequence.
Tab Explanation
Home Home page of BLAST link
Recent results Results of the BLAST link, and we can retrieve the current browser session
Saved strategies In my NCBI account, input parameters of BLAST have been saved
Help Documentation of BLAST help will be listed
3.2 FASTA
Fig. 4 BLAST search for unknown sequence with NCBI RefSeq RNA database
Fig. 5 BLAST search for known sequence with NCBI RefSeq RNA database
DRUG CLINICA
MARKE
DRUG DEVELOP L
TING
DISCOV MENT; TRIALS
APPLIC
ERY MANUFA IN
ATION
CTURING HUMAN
COMPLIANCE WITH
REGULATORY
REQUIREMENTS IS
NECESSORY
If the company wants to import or manufacture a new drug, then it has to get per-
mission and license authority from DCGI. With the act of schedule Y of drugs
22 P. Lakshmi and D. Ramyachitra
and cosmetics 1940 and rules 1945, the document has to be submitted. Based on
the guidelines of schedule Y, clinical trials were engaged to prove which depend
on public health, permission granted to import new drugs to other countries, after
authorization of that drug properly [13]. Figures 8 and 9 are the web page of the drug
bank. Here we can search the details of the drug with its relevant ID and its required
information.
4 Conclusion
This paper investigates the databases of protein and gene, sequence alignment, dock-
ing and drug discovery which indicated that for various tasks and the performance of
few operations. It is useful to understand not only the protein performance, but also
the biological activities of the various genes. Through this paper, types of sequence
alignment, steps of BLAST and FASTA operation were to be utilized. Docking
helps to know the information about the binding site of the protein, and drug dis-
covery uses to understand the steps of drug development processes. From here,
the overview of some experiments and sequential steps of some biological tools,
databases, sequences, docking, and drug discovery is to be discussed.
References
1. Smith TF, Waterman MS (1991) Identification of common molecular sequences. J Mol Biol
147:195–197
2. Needleman SB, Wunsch CD (1990) A general method applicable to the search for similarities
in the amino acid sequence of two proteins. J Mol Biol 48:443–453
3. Thompson JD, Higgins DG, Gibson TJ (1994) Improved sensitivity of profile searched through
the use of sequence weights and gap excision. CABIOS 10:19–29
4. Morgenstern B (1999) Dialign, 2 improvement of the segment-to-segment approach to multiple
sequence alignment. Bioinformatics 15:211–218
5. Thompson JD, Linard B, Lecompte D, Poch O (2011) A comprehensive benchmark study of
multiple sequence alignment methods: current challenges and future perspectives. PLoS ONE
6:1–14
6. Lassmann T, Sonnhammer EL (2005) Kalign—an accurate and fast multiple sequence
alignment algorithm. BMC Bioinf 6:298
7. Mount DW (2004) Bioinformatics: sequence and genome analysis. Cold Spring Harbor
Laboratory Press, Cold Spring Harbor, NY
8. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W et al (1997) Gapped BLAST
and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res
25(17):3389–3402 (PMC free article) (PubMed)
9. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search
tool. J Mol Biol 215(3):403–410 (PubMed)
10. Baker EJ, Lin GN, Liu H, Kosuri R (2007) NFU-enabled FASTA: moving bioinformatics
applications onto wide area networks. Biol Med 2. PMC2211279
11. Chen R, Li L, Weng Z (2003) ZDOCK: an initial-stage protein-docking algorithm. Proteins
52:80–87
12. Song CM, Lim SJ, Tong JC (2009) Recent advances in computer-aided drug design. Brief
Bioinform 10:579–591
13. Worrell J (2006) Practical issues of conducting clinical trials in India. Drug Dev
Machine Learning for Bioinformatics
1 Introduction
Other problems related to the efficient design of primer for PCR, analysis of
biological images, and back translation of proteins are being solved by machine
learning techniques. Machine learning involves the automatic learning of computers
without explicit programming. It usually learns through past experience. The term
learning refers to the concept where a computer program is able to learn by itself
from training data. ML utilizes statistical theory for building the models by deriving
inferences from a sample [24].
The ML process constitutes of two main steps. The first step is to produce a
model by analyzing the big data. The second step involves deriving inferences from
the analysis. Apart from the algorithm efficiency, other factors such as time and
space complexity, transparency and interpretability are found to be important for
accurately predicting the results. Extracting useful knowledge from data is an iterative
and interactive process. The iterative stage is composed of many steps. In the first
step, the data from different sources is merged into a single format. Data-warehouse
techniques can be used for identifying the outliers and inconsistencies. The second
step involves the selection, cleaning, and transformation of the collected data [24].
The ANN is used widely in bioinformatics as it was found to be efficient during the
early stages of research in bioinformatics [25]. It is able to solve complex real-world
problems while being flexible. They are resistant to noise and errors present in the
training data. The major disadvantage of ANN is that it is a black-box approach
where the functioning of each node in the neural network is hard to interpret and
validate. It also involves complex mathematical and statistical computations which
Machine Learning for Bioinformatics 29
(Target ACTGCCATAAGACT
Data) TGCACCTAAGCATA A CA A T G
ACTACTACTACAGA Pre- T C A A C T A T C
CCAATTACATACAT processing A C A C A G C
ACATACTTACTACT A G A A T C
CCATCGTCTTCTCAT A C C G A T C
CCATTACTATACTA
Transformation
Selection
Raw Data
A CA ---ATG
KNOWLEDGE T C A A C T A T C
A C A C --AGC
A G A A - - -A T C
Interpretation / Evaluation A C C G --A T C
(Pattern)
[AT][CG][AC][ACGT]*A [TG][CG]
make it hard to interpret. The choice of ANN parameters such as its weights, biases,
and number of hidden layers is an ongoing research issue [26].
The ANNs have found widespread applications in the prediction of protein
structure, functional prediction [27–31], and classification of proteins [32–34].
The decision tree method has been utilized in the works [35, 36] for classifying
membrane protein sequences based on its functional classes. The prediction of protein
structure was performed by He et al. [37, 38] and Sivan et al. [39] using the decision
trees. Furthermore, the decision trees were used by Salzberg [40] to trace the protein-
coding genes.
For practical applications, the decision trees are relatively better as they are resis-
tant to outliers and noise. However, they tend to over fit in some cases. Optimizing
decision trees is difficult in certain instances [41].
30 K. A. Shastry and H. A. Sanjay
The genetic algorithms have found popularity in bioinformatics research due to its
simplicity. It is more suited to bioinformatics since most of the biological data have
high dimensions, and GA is more able to solve high dimensional problems. The
major drawback of GA is that the changes during its evolution process are dynamic
and not clear [42].
Parsons et al. [43], Rathee et al. [44], Alba et al. [45], and Nebro et al. [46] have
applied GA for DNA fragment assembly. Horng et al. [47] used GA for the alignment
of multiple molecular sequences.
The common issues faced while designing ANN for bioinformatics applications
need to be addressed. Figure 2 demonstrates the flowchart of ANN design for
bioinformatics.
The feature representation and input encoding constitute the data preprocessing
stage. The performance of ANN is determined by the proper data preprocessing
steps. The prior knowledge about the structure and function of sequences has to
be represented to obtain the complete benefit of ANN. This enables the effective
mining of relevant attributes present in the sequence. Different combinations of
twenty amino acids with different lengths make up the proteins. Defined alphabet
set is used to write a sequence. However, in biology, the alphabets carry additional
information such as function and residue structure. The amino acid chain reveals
a great number of physical and chemical properties. The structure and function of
protein are determined by the interaction between the side chains. Prior knowledge
encoding of certain amino acid properties such as surface area, volume, propensity of
secondary structure, polarity, and chemical properties is performed when the ANN
is fed with input [49].
Certain amino acids demonstrate multiple structural or functional roles; hence,
their properties are assessed by considering the entire protein. These properties rep-
resent global or local features of protein context. To characterize these attributes,
hydrophobic moments, amino acid frequency, or hydrophobicity profile may be
computed and encoded to use in ANN.
The ANN finds its use in several bioinformatics applications involving proteins. They
can be categorized into the following categories:
• Protein structure prediction containing content of the secondary structure, struc-
tural contacts, maps of contacts, domain of boundaries related to structures,
beta-turns, etc.
• Forecast of ligands and binding sites that comprises of forecasting the binding
residues and prediction of several properties of the binding ligands.
• Forecast of the properties of proteins like proteins related to physicochemical, host
organism localization, etc.
The forecast of the protein secondary structure using ANN is presented in this
section. The PSIPRED is the commonly used model for the forecast of protein sec-
ondary structure. It was proposed by Jones in 1999. The PSIPRED method is a
two-stage ANN. It takes a position specific scoring matrix (PSSM) that is created
using PSI-BLAST algorithm by way of the input. Figure 3 depicts the design of
PSIPRED.
Machine Learning for Bioinformatics 33
…ACYMHGFDKKIPQWESDFHVCAWERIPLK…
Protein sequence
m m , …m m m , …m
mi,1, mi,2, …mi,21 2,1, 2,2 2,21 15,1, 15,2 15,21
Input layer
Hidden layer
Hidden layer
Output layer
(prediction from Z8,2
nd
Z8,1 Z8,3
2 stage)
Figure 3 illustrates a two-stage network that has three layers of feed-forward net-
works. Here, the input to the second stage network is the output of the first stage
network. During the first stage, the PSSM is used to represent the input protein
sequence utilizing a window size of 15 which is based on the predicted amino acid.
Twenty dimensions are included in the PSSM for each amino acid with substitution
scores. The scores are used to indicate the presence/absence of amino acid at a par-
ticular position in the sequence from a set of sequences which resemble the predicted
34 K. A. Shastry and H. A. Sanjay
The data source was not considered by the bioinformatics researchers during the
utilization of the data. If the data at source is dirty, then the accuracy of the classifier
decreases. Dirty biological data can be due to many factors such as:
i. Errors during experimentation.
ii. Erroneous interpretation by biologists.
iii. Typing mistakes due to human error.
iv. Non-standardized methods (3D structure in PDB from multiple sources, X-ray
diffraction, theoretical modeling, nuclear magnetic resonance, etc.) are used in
experiments.
The learning algorithms need to be robust to data when learning on a biological
database that is dirty. The ML approaches must be able to provide optimal decisions
by adjusting in order to avoid over fitting the datasets. Since the biological databases
are updated on a daily basis, the ML algorithms should possess short learning and
training times [24].
While interpreting and analyzing biological, it is essential to work with domain
experts which in this case are biologists. To maintain the data quality, the data in the
databases needs to be constantly revised. The issues faced during the cleaning of gene
databases are addressed by Brunak et al. [52] and Korning et al. [53]. Applying the
ML techniques on biological databases to learn has developed into a major challenge
in bioinformatics research.
Most of the hypotheses in biological research are data driven and based on exper-
imental data. Further experimental-based research suffers from uncertainty. Hence,
the biological research has been transformed into a statistical dependent field. With-
out the incorporation of probability theories, the degree of confidence will be lower
when bioinformatics analysis is performed. To differentiate between classes, a dis-
criminative method is useful, while a general approach is required in other cases. The
choice of ML techniques in bioinformatics is dependent on the objectives of learn-
ing and application tasks. More confidence is provided to the discovered knowledge
if the appropriate method is implemented in bioinformatics research. This, in turn,
improves the degree of uncertainty [24].
to understand and interpret. However, they generate better results than the inductive
learning approaches that are rule based. Traditional rule generators like decision trees
provide clear explanations understandable by humans. This is an open research topic
where we have to decide whether we need a black-box approach or an approach with
clear explanation at each step [24].
5 Conclusion
Recently, the transformation of huge volume of data into knowledge is the biggest
challenge faced in computational biology. The machine learning techniques pro-
vide this transformation. This chapter introduces some useful ML methods like
genetic algorithms, decision trees, and artificial neural networks. From literature,
it is observed that artificial neural networks are the widely used technique in bioin-
formatics research. Keeping this mind, this chapter discusses in detail the case study
of protein structure prediction using ANN. Some of the critical research issues in
bioinformatics are also discussed.
References
27. Hirst JD, Sternberg MJE (1992) Prediction of structural and functional features of protein and
nucleic acid sequences by artificial neural networks. Biochemistry 31:7211–7218
28. Qian N, Sejnowski TJ (1988) Predicting the secondary structure of globular proteins using
neural network models. J Mol Biol 202:865–884
29. Howard Holley L, Karplus M (1989) Protein secondary structure prediction with a neural
network. Proc Natl Acad Sci USA 86:152–156
30. Mathkour H, Ahmad M (2010) An integrated approach for protein structure prediction
using artificial neural network. In: International conference on computer engineering and
applications, vol 2, pp 484–488. https://doi.org/10.1109/ICCEA.2010.243
31. Chen K, Kurgan LA (2012) Neural networks in bioinformatics. In: Rozenberg G, Bäck T, Kok
JN (eds) Handbook of natural computing. Springer, Berlin, Heidelberg
32. Rossi ALD, de Oliveira Camargo-Brunetto MA (2007) Protein classification using artificial
neural networks with different protein encoding methods. In: Seventh international conference
on intelligent systems design and applications (ISDA 2007), Rio de Janeiro, pp 169–176.
https://doi.org/10.1109/isda.2007.81
33. Rossi A, Camargo-Brunetto MA (2007) Protein classification using artificial neural networks
with different protein encoding methods. https://doi.org/10.1109/isda.2007.81
34. Lee NK, Wang D, Wah Tan K (2005) Protein classification using neural networks: a review
35. Nijil RN, Mahalekshmi T (2018) Multilabel classification of membrane protein in human by
decision tree (DT) approach. Biomed Pharmacol J 11(1)
36. Siva Sankari E, Manimegalai D (2017) Predicting membrane protein types using various deci-
sion tree classifiers based on various modes of general PseAAC for imbalanced datasets. J
Theor Biol 435. https://doi.org/10.1016/j.jtbi.2017.09.018
37. He J, Hu HJ, Harrison R, Tai PC, Dong Y, Pan Y (2005) Understanding protein structure
prediction using SVM_DT. In: Chen G, Pan Y, Guo M, Lu J (eds) Parallel and distributed
processing and applications—ISPA 2005 workshops. ISPA 2005. Lecture notes in computer
science, vol 3759. Springer, Berlin, Heidelberg
38. He J, Hu H-J, Harrison R, Tai PC, Pan Y (2006) Rule generation for protein secondary structure
prediction with support vector machines and decision tree. IEEE Trans Nano Biosci 5(1):46–53.
https://doi.org/10.1109/TNB.2005.864021
39. Sivan S, Filo O, Siegelmann H (2007) Application of expert networks for predicting proteins
secondary structure. Biomol Eng 24:237–243. https://doi.org/10.1016/j.bioeng.2006.12.001
40. Salzberg S, Delcher AL, Fasman K, Henderson J (1998) A decision tree system for finding
genes in DNA. J Comput Biol 5:667–680. https://doi.org/10.1089/cmb.1998.5.667
41. Stiglic G, Kocbek S, Pernek I, Kokol P (2012) Comprehensive decision tree models in
bioinformatics
42. Bhaskara Murthy V, Pardha Saradhi Varma G (2013) Genetic algorithm—a case study in gene
identification. Int J Adv Res Comput Sci 4(5)
43. Parsons RJ, Forrest S, Burks C (1995) Mach Learn 21:11. https://doi.org/10.1007/BF00993377
44. Rathee M, Vijay Kumar TV (2014) DNA fragment assembly using multi-objective genetic
algorithms. Int J Appl Evol Comput 5(3):84–108
45. Alba E, Luque G, Khuri S (2005) Assembling DNA fragments with parallel algorithms. In: 2005
IEEE congress on evolutionary computation, Edinburgh, Scotland, vol 1, pp 57–64. https://doi.
org/10.1109/cec.2005.1554667
46. Nebro AJ, Luque G, Luna F, Alba E (2008) DNA fragment assembly using a grid-based genetic
algorithm. Comput Oper Res 35(9):2776–2790. ISSN 0305-0548. https://doi.org/10.1016/j.cor.
2006.12.011
47. Horng JT, Wu LC, Lin CM et al (2005) Soft Comput 9:407. https://doi.org/10.1007/s00500-
004-0356-9
48. Bhaskar H, Hoyle DC, Singh S (2006) Machine learning in bioinformatics: a brief survey and
recommendations for practitioners. Comput Biol Med 36:1104–1125. https://doi.org/10.1016/
j.compbiomed.2005.09.002
49. Hapudeniya M (2010) Artificial neural networks in bioinformatics. Sri Lanka J Bio-Med Inform
1:104–111. https://doi.org/10.4038/sljbmi.v1i2.1719
Machine Learning for Bioinformatics 39
50. Seiffert U, Hammer B, Kaski S, Villmann T (2006) Neural networks and machine learning in
bioinformatics-theory and applications. In: European symposium on artificial neural networks,
pp 521–532
51. Bordoloi H, Sarma K (2019) Protein structure prediction using artificial neural network
52. Brunak S, Engelbrecht J, Knudsen S (1990) Cleaning up gene databases. Nature 343:123
53. Korning PG, Hebsgaard SM, Rouze P, Brunak S (1996) Cleaning the GenBank Arabidopsis
thaliana data set. Nucleic Acids Res 24:316–320
54. Sekhar SM, Siddesh GM, Manvi SS, Srinivasa KG (2019) Optimized focused web crawler with
natural language processing based relevance measure in bioinformatics web sources. Cybern
Inf Technol 19(2):146–158
55. Sekhar M, Sivagnanam R, Matt SG, Manvi SS, Gopalalyengar SK (2019) Identification of
essential proteins in yeast using mean weighted average and recursive feature elimination.
Recent Patents Comput Sci 12(1):5–10
56. Patil SB, Sekhar SM, Siddesh GM, Manvi SS (2017) A method for predicting essential proteins
using gene expression data. In: 2017 international conference on smart technologies for smart
nation (SmartTechCon). IEEE, pp 1278–1281
Impact of Machine Learning
in Bioinformatics Research
1 Introduction
The term bioinformatics was depicted by Paulien Hogeweg and Ben Hesper. Bioin-
formatics integrates classical features of different fields; those fields include as shown
in Fig. 1:
(i) Biology: which plays a salient role in capturing the correct template of the bio-
logical data, (ii) Information Engineering: which helps in generation and distribution
of the information pertaining to the biological data and also performing analysis on
it, (iii) Computer Science: which facilitates the interaction between the processes
and the data, (iv) Mathematics and Statistics: which assist in performing all the pre-
requisite mathematical operation, detailed study, and portraying the biological data.
Bioinformatics represents a science of aggregating the biological data, storing the
biological data, retrieving, and performing the analysis on the biological data. It
also helps in developing software tools in order to entirely have a piece of knowl-
edge about the biological data [1]. The biological data mostly represents genomics,
proteomics, microarrays, systems biology, evolution, and text mining.
In recent times, the volume of biological data is enormous coming from various
sources of neuroscience. This enormous amount of data needs to be addressed by
vigorous data analysis tools and techniques. Although bioinformatics itself serves the
purpose, there is a restriction as to the speed of the process and human capacitance
and hence we rely on machine learning.
Machine learning consists of significant techniques that aid in bioinformatics
research by employing a diverse set of algorithms for data analysis and predic-
tions [1]. Machine learning basically feeds the biological dataset into the system
and trains the system on how to reliably distinguish between the dataset and make
predictions. These machine learning algorithms help the system learn from the pre-
viously recorded biological datasets which can be a genomic dataset or proteomic
dataset and thereby help formulate classifiers and hypothesis that unravels complex
relation between the biological data. Machine learning can be broadly classified into
supervised learning, unsupervised learning, a mixture of both, i.e., semi-supervised,
and the last one reinforced learning. A typical machine learning model is depicted
in Fig. 2.
The biological data such as the genomic data, for example, is a sequence of
chemical letters A, C, G, T. The combination of which results in our DNA. Around
99% of our DNA is entirely the same, and the other 1% is what makes us unique.
This 1% of unique genome sequence accounts for either uniqueness or some dis-
functionalities in a human. The defect in the genomic sequences is encountered by
taking into comparison with the other genomic dataset.
In supervised learning, the system is exposed to a large amount of biological
datasets along with their expected labels. This biological data is characterized by
the set of attributes that represent them. The set of attributes are the feature set of
the object, i.e., a biological dataset which includes genomic sequence from a tissue
sample, presence of amino acid at a distinct position, or maybe the absence of it.
The main objective of supervised learning is to draft out a model that can accurately
predict which class does the new object with its respective feature set belongs to.
Unsupervised learning, on the other hand, is also exposed to an enormous amount
of dataset, but there is an absence of the labeled data. Therefore, the system learns
about the data which find an intrinsic pattern and cluster them in accordance with
their similarities. And it further makes a prediction when it encounters a new dataset
by learning and improvising frequently. In a broader sense, the objective behind
unsupervised learning is to group the biological datasets by its similarities and define
labels to which the dataset is associated.
The initial database was established right after the protein genome sequence was dis-
covered. Subsequently, the nucleic genome sequence was announced after a decade.
44 E. Naresh et al.
Margaret Oakley Dayhoff, a chemist, and a bioinformatics pioneer collected all the
recorded genome sequenced data to construct the first-ever bioinformatics database,
followed by the Protein Data Bank in 1972 and SwissProt in the year 1987. The
structure of these original bioinformatics databases was modest. The data was orga-
nized in flat files, and the data entry was mutable. In order to ease, the searching
process suitable keywords were used as lookup indices [3].
A biological database is a set of huge persistent data that can be queried, updated,
retrieved, and stored [4]. Each record of a nucleotide file per database may normally
be associated with input sequence, molecule description, the scientific name of source
organism, etc. These biological databases may be publically accessible or some may
be private to the organization.
The most tenet requirement biological database is as follows:
• Data quality of the biological dataset needs authentic.
• The database should support relevant experimental data.
• Should have consistent annotation which dictates additional information of the
dataset.
• The data should be available whenever it is needed.
• Should allow integration of new experimental biological datasets.
Here are some of the biological database and their description.
2.1.1 GenBank
2.1.2 EMBL
The EMBL database is of the all-inclusive database which contains DNA and RNA
sequences gathered from different sources of submission that includes patent appli-
cation and scientific research. This EMBL database is in collaboration with the
GenBank (set up by the USA) and DNA database which was set up by Japan [3].
2.1.3 SwissProt
SwissProt protein database contains two types of data: the core information and
the description of that information (annotation). It allows integration with the other
database at a maximum level and maintains a minimum level of redundancies. In
Impact of Machine Learning in Bioinformatics Research 45
recent times, the format and the contents of the database were enhanced including
cross-references to another integrated database.
The biological database such as the genomic data mostly contains lasting strings of
nucleotides that represents guanine, adenine, cytosine, and thymine, and it may or
may not contain amino acids such as glycine and serine. Each string sequence of the
nucleotide corresponds to the starting letter of the chemical compound that is A, G,
C, and T as shown in Fig. 5. The sequence of these letters renders a particular gene
or protein. As the chemical composition is only represented by the starting initials,
46 E. Naresh et al.
this alleviates the processing speed and requires less amount of storage for analysis.
Most of the database holds information pertaining to the nucleotide sequencing or
protein sequencing whereas there are few databases which take core information
such as biochemical structures and characteristics of organism into consideration.
The advancement in the field of biology and chemistry is speeding up the gene
sequencing. Apart from the above advancement, cloning technology also accounted
for the production of new DNA sequences which were the outcome of mutation
between a foreign DNA and bacteria. Synthesis of the oligonucleotide, on the other
hand, provided methods to fabricate small fragments of DNA by choosing other DNA
sequences. All these development made it possible to have more and more sequences
to deal with and hence increased the progress. The entire procedure from collecting
the blood samples to extracting the DNA is depicted in Fig. 4. As the creation
of sequences started increasing, accumulating all these sequences from different
sources by hand did not seem practical enough. The researchers needed access to the
database of sequence information and wanted methods to extract those sequences.
After the accumulation of the sequences, the analysis and the structuring was yet to
be accomplished.
The computer technology catered the required boost to store and organize the
sequence information in the biological database. The computing power and capacity
storage were rapidly increasing on all of the above benefits.
Bioinformatics tools are helpful in acquiring gene sequences and protein sequences
from the repository. The repository holds the genomic sequence coming from various
sources such as experimental data and research papers of scholars. These tools can
also be helpful to analyze different type of sequences which are either labeled or
prepared from experiments performed by the researcher [3].
Impact of Machine Learning in Bioinformatics Research 47
help in identifying the congruency between the sequences of the test sample and the
recorded sample in the database at certain constraints. The abbreviation of BLAST
is a basic local alignment search tool. It is an open-sourced algorithm that works on
different operating systems such as Linux, UNIX, Mac, and Windows.
The main components of BLAST are input, output, process, and algorithm.
Input: Inputs are the sequences obtained from the GenBank in the same format
as the GenBank and the weight matrix.
Output: The output in BLAST supports HTML format, plain text format, and
XML format. The default format is an HTML format for its creators that is NCBI.
The results include a graphical representation which indicates sequence hits, tabular
representation for the identifiers of the sequence, and sequence alignment with respect
to the sequence hits. The sequence is a term that describes the matched sequence.
The most instructive output among these is the tabular representation.
Process: The process under BLAST is to search for similar sequence [7]. It is
done by the process called seeding. Seeding is the process of tracing short string
matches between the sequences. When the first match is met the BLAST process
starts aligning the sequences. In order to find the identical, the algorithm breaks
down the sequence of the nucleotide or amino acid into short strings called the seed
of minimal size. For example, if the sequence string is GLKFG (sequence of amino
acid), short seeds built by the BLAST algorithm are GLK, LKF, and KFG. These
seeds are of the size three which is minimal for five letter sequences of the amino
acid. The process further continues to compare the respective sequences that are
the test sample and the recorded samples, when the short seed is a sequence hit
(sequence matched). The process of aligning takes place. The alignment has only
accepted the score of the seed corresponds to the threshold T of the scoring matrix.
A scoring matrix is a matrix containing optimal sequence similarity. The normally
used scoring matrix in BLAST is BLOSUM62. The threshold T decides whether the
seeds have to be included or not. If the score of the seed is higher than the threshold,
then the seed is aggregated to the alignment. On the other case, if the score of the
seed is lower than the threshold, then the seed is rejected and the alignment ceases.
When few types of element form a sequence, this region is termed as low complexity
region. The low complexity region leads to a high score which may cause ambiguity
to the program that is programmed to find the substantial sequences. Hence, this low
complexity region needs to be eliminated from the biological database. When we
consider protein sequence, these regions are tagged with the letter X. And in the case
of nucleic sequence, the region is tagged by the letter N.
Impact of Machine Learning in Bioinformatics Research 49
The BLAST program complies and disregards the sequence tagged by the respec-
tive letter. SEG and DUST are used to omit the low complexity region of protein and
DNA sequences, respectively.
Step 2: Create a minimal sequence of the size k for the query sequence.
Equate letter k to a small constant that represents the length of the sequence. The
BLAST algorithm performs a possible combination of the query sequence restricted
to the k length. From the beginning of the first letter of the query sequence until it
reaches to the end letter of the sequence. Normally, k = 11 for the sequence of the
DNA.
Step 3: Enumerate viable matches.
BLAST algorithm considers matches if and only if the matches are high scoring words
(sequences). The comparison of the k-length sequence with the query sequence gives
rise to the score which is recorded in the scoring matrix. The possible match score
when k = 3 is denoted by 20 raised to k (20k ). When the match scores are recorded, the
scores are compared with the threshold score T. If the recorded scores are higher than
the threshold T, the recorded scores are accepted and rejected if otherwise (Fig. 7).
Step 4: Structure an efficient search tree that contains high scoring sequence
which is resultant from the above step.
The resultant high scoring sequences are now compared to the sequences in the
biological database. The biological database’s sequence is scanned with high scoring
sequences to find its accurate match. When the match is discovered, the seeding
process takes place that is nothing much the matched sequence is aligned between
the sequence that is queried and sequences from the database. For every k-length
sequence, steps 3 and 4 are repeated.
Step 5: Scale the accurate matches to HSP (high scoring segment pair).
In this step, the BLAST algorithm expands the aligned sequences between the
database and the query sequence in both the direction of the sequence. This occurs
right from that position where the match was found. It does not stop until the high
scoring segment pair begins to reduce.
Step 6: Enumerate all the high segment pair in the database having a high
enough score.
50 E. Naresh et al.
A cutoff score is fixed experimentally, denoted by the letter S. If the high segment
pair is greater than the cutoff value, the HSP makes its way to the list. The cutoff
score is fixed by analyzing the alignment distribution score designed by comparing
random sequence in such a way that the cutoff score is identified to have highest
possible value to assure importance to the rest of the HSP.
Step 7: The importance of HSP is examined.
BLAST algorithm examines the importance of HSP one by one statistically by
using extreme value distribution (EVD). It is evident that for local alignment of
ungapped sequence that follows Smith-Waterman distribution, Gumble extreme
value distribution is applicable perfectly.
In the above accordance, probability of scored observed equal to or greater than
x is given by Eq. (1)
Step 11: Display the Smith-Waterman gapped local aligned sequence of the
query sequence and display the matched biological database sequence.
The first version of the BLAST displays the unmapped aligned sequence with the
high segment pair sequence. The advance version BLAST2, however, displays a
single aligned sequence along with gaps. These gaps may include high segment pair
region.
Step 12: Notify each match whose score is lesser than the threshold score.
The most commonly used approach in machine learning is supervised learning. This
technique supervised the system model to make a prediction in accordance with the
training data. This technique revolves around the training dataset. The training dataset
usually involves sample data which includes the input and the expected outputs that
are known already. The data object in the training dataset is referred to as a sample
point. These sample points are independent in nature. The training dataset is denoted
by D = {(X 1 , C 1 ), …, (X n , C n )} wherein (X i , C i ) is the sample point of the dataset
D. Each sample point is associated with the describing features called as a feature
set connoted by {X 1 , …, X d } along with C i which represent the class of interest.
The dataset given to the model is proportioned in a way that represents 70% of the
training dataset and 30% as testing dataset. The procedure involved in supervised
learning is checked if the model is making a correct prediction or not and improvise
as in when the dataset is tested. The correct prediction accounts for the efficiency of
the model [1, 5].
expression samples, we obtain the values of expression and group them in the fash-
ion described above. The target of this approach is to follow natural hierarchy while
clustering the dataset [8].
The biological data object consists of various features. Each feature represents a
dimension. More the features more will be the dimension [9]. In order to perform
analysis, the high dimensionality of the biological data must be reduced in a pre-
processing step which implies reducing the size of the dataset. There are two main
methods that perform this task: (i) dimension reduction and (ii) feature selection.
The objective of feature selection is to specify huge dimension dataset with less
number of relevant features. This approach uses a univariate or multivariate filter in
the preprocessing step [10]. In these filtering methods, all the features are prioritized
to their predefined benchmarks and further their ranks are calculated. The feature with
the highest rank is held, and the feature of the lowest rank is rejected. Examples for
univariate are Wilcoxon rank sum method, and example for multivariate is Markov
blanket filters.
Impact of Machine Learning in Bioinformatics Research 53
of this approach. Similar to a standard tree structure, the decision tree consists of
same components: root node, branches, internal nodes, and leaf nodes. The node
indicates a test which is performed on it, and leaf node indicates the decision [5].
The genetic algorithm was invented by Holland. The learning agent in genetic algo-
rithm is indulged on biological evolution theory. The main notion is to uphold the
population of the data that stands for the optimal solution to a subjective problem,
wherein the solution starts evolving and maturing through competition that is con-
trolled by fluctuation. This improvises the performance parameter of the learning
agent [5]. The population is under constant experiment on combination of the two
strings or maybe mutation in the strings. The genetic algorithm focuses on making
the system adaptive and sustainable in the new environment [14]. And it furthermore
finds the best fit among the solution.
The algorithm is subjected to population genetics. The problem to be solved is first
considered. The population which is represented as bit strings is formulated randomly
from the environment where the learning agent is present. The population is exposed
to variation which may indicate that few of the population’s individual are a better
fit. Evolution takes place iteratively. At each iterative phase, a candidate with better
survival proficiencies is selected. This process leads to accepting new candidates
for the next iteration and discarding the other which are not selected. The cycle
continues until an optimal solution is found. The advantages of genetic algorithm
are just like the decision trees which is robust, simple, solves high dimensionality
problem, whereas the disadvantage is that it is complex to understand the dynamics
in each iteration.
Clustering is a type of approach that detects similarities in the data and groups them
into classes and organizes them. In order to predict and define complex datasets,
derivative clustering algorithm is used. There are two types of clustering algorithms:
hierarchical clustering algorithm and k-clustering algorithm. In hierarchical clus-
tering algorithm [15], the input dataset is represented in a hierarchical perspective
shown in Fig. 9, whereas in k-clustering it collects the input data and allocates each
data object to its equivalent group that it belongs to according to the features as
depicted in Fig. 10.
There are two approaches followed with a view of clustering the data which is
observed. One approach is on the chemical and physical concepts of data’s feature
set. And the second is on computational and statistical views of the input data. When
the dataset is encountered, the agent will cluster them into a group of smaller size
Impact of Machine Learning in Bioinformatics Research 55
Fig. 9 Hierarchical
clustering [13]
using either of the approaches. These groups are represented by similar features of
the data hence clustering suitable for biological data representation because of its
descriptiveness and expressiveness [5].
Some of the serious diseases such as autoimmune disorder, tumors, and cancer are
complex to treat since the degree of variation is high in the affected individuals.
In that case, precision medicines attempt to resolve this problem by individualizing
each of the population and support the creation of medicines of those individuals. It
takes gene variation of an individual lifestyle, and the environment of the individual
account is considered to predict progression of ailment and stage transition, and it
aims to find an appropriate treatment for the individual’s ailment.
We use patient subtyping which pays a prominent role in predicting medicines.
Here, we distinguish between the population creating subpopulation having similar
56 E. Naresh et al.
patients, and this helps in finding more accurate diagnostic strategies and treatment
strategies. This assists not only in medical science but also in practical examination.
In reality, from a clinical perspective fine-tuning the prediction process can cut down
the uncertainty in the expected upshot of each individual’s treatment.
Data integration based on supervised and unsupervised learning was proposed in
order to subgroup patients. To advance the accuracy of the agent for better classifi-
cation of patients, the process tends to use other features such as microRNA (small
non-coded RNA found in an organism) expression, and methylation (a process where
methyl is added to the DNA) is augmented along with the gene expression. The pro-
cess of methylation or any copy number alteration acts as a biomarker which is an
analogy to bookmark for cancer subtype classification. The data integration effec-
tively helps in subtyping among the population. The main goal is to determine classes
of samples which share same or similar yet relevant molecular expressions [6, 13].
An intermediate integration network fusion methodology SNF108 was capable of
incorporating microRNA expression and methylation of DNA in order to recognize
the subpopulation of the patient. This process takes place in a network structure
where at each view patient similarity network is constructed. At each iterative phase,
the data arriving from various networks is integrated together in the next upcoming
network that yields classes with more similarity at each step. And finally, the iteration
seizes to a final fused network.
There are four main steps involved in this approach:
Step 1: Extraction
The data is collected which includes the integrated dataset along with its features
that will undergo clustering and dimension reduction to find the most prominent data
object of the feature set.
Step 2: Ranking
The prototypes are ranked in accordance with the separability score in order to
construct the network.
Step 3: Iteration process
The data from each network is view and clustered according to the similarities at
each iteration.
It further optimizes and improves at each coming phase.
Step 4: Network fusion
The final outcome from the various network is clustered which efficiently distinguish
between classes accurately by using matrix factorization approach.
The familiar drugs and components may be used to treat new disease other than the
disease for which those drugs were targeted to such a process referred to as drug
Impact of Machine Learning in Bioinformatics Research 57
Introduction to Kipoi
Kipoi termed as a model zoo for the collection of genomes, in order to promote
better usage, sharing, archiving, and building similar models. This model uses deep
learning and more extensively machine learning concept in order to comprehend how
the sequences of the DNA encode molecular phenotypes which relate to the person’s
physical appearance. The process of obtaining phenotypes from the gene sequences
is called phenotyping. Furthermore, the researchers used machine learning concept
to model how the glitch in the code (unmatched sequences) disrupts the prototype
which gives rise to the disease. The Kipoi biological database consists of 2000 models
from different literature sources and publication.
Fig. 11 Ziga Avsec, API for accessing Kipoi models, GitHub, 2 October 2018
The dirty biological data refers to the redundant data which is an issue that concerns
most of the biologist. With the abundant data available over the deposition and
exchange over the network to the public domain, it is extremely hard to detect the
flaws and maintain the quality of the biological data [13]. This type of dirty data
60 E. Naresh et al.
reduces the accuracy of the agent’s programs. The dirty biological data occurs due
to certain factors which include:
• Error from the experiment.
• Misleading interpretation.
• Human-made mistake while doing the annotation process.
• No standards followed while interpretation.
Learning about the dirty data and ways to handle it through a machine learning
algorithm play out to a plus point in this field of bioinformatics. Furthermore, machine
learning approaches help the agent to adapt their learning algorithm to produce an
optimal solution or in other words to find the best fit. To tackle with the exponential
growth of the biological data and frequently update its short learning and training
period is extremely crucial.
In order to overcome these problems, there is a necessity to work along with the
specialist in the field like the biologist while the data is analyzed and interpreted.
The data in the database has to be updated constantly to maintain the quality of the
data. Cleaning the dirty data is one of the solutions that eliminate the non-relevant
data from the biological database after analysis. Application of machine learning is
one of the challenges that we ought to face in the bioinformatics field.
There is always a discussion when the subject of concern is the issue of approxima-
tion and explanation. Some of the approaches of machine learning such as genetic
algorithm and neural network generate outcomes without any explanation using a
Impact of Machine Learning in Bioinformatics Research 61
learning process. This is an allusion to a black box testing, wherein all the testing tech-
nique is hidden. Although these technologies result in accurate outcomes, it is often
difficult to comprehend strategies and interpret relevant information. Approaches
like decision tree may generate an explanation that suits human understandability
and also accomplish the task. But there is way too much information at every step.
That begs to answer do we need that enormous information to conclude to a final
step?
According to the survey conducted, most of the bioinformatics tools utilizes more
than one machine learning technique [17–19], since using one technique is vulnerable
to break down the entire process. This also helps in attaining significant performance
compared to the use of a single machine learning method. Although it provides the
performance which is needed, combining these processes is a difficult task since it
lacks in coherence. This happens due to the different approaches followed by the
output and learning agent.
8 Conclusions
In recent times, the biological research revolves around enormous data which is
increasing in a rapid fashion due to high throughput research. The use of a computa-
tional tool is crucial in this research which will assist the biologist in order to analyze
the sequences of the genome, define the patterns of the sequence, and discover useful
information in the biological database. The need for a computational tool gave rise to
a new field called bioinformatics along with machine learning. Bioinformatics acts
as the interface between the biological research field and computational research.
Bioinformatics in combination with machine learning helps in maintaining the
biological database, creates tools that learn to make a prediction, and detects knowl-
edge from the database. The objective of the bioinformatics is to assist in pharmaceu-
tical discovery and discover new ailment and ways to counter it. Machine learning
approaches and technique have accelerated bioinformatics research which includes
genomic field, proteomic, molecular structure prediction, etc. The use of machine
learning concept is inexpensive and efficient to deal with bioinformatics issues. There
is a belief that machine learning will play a major role in the bioinformatics future.
62 E. Naresh et al.
References
1. Lai K, Twine N, O’Brien A, Guo Y, Bauer D (2018) Artificial intelligence and machine learning
in bioinformatics. https://doi.org/10.1016/b978-0-12-809633-8.20325-7
2. Avsec Z, Kreuzhuber R, Israeli J, Xu N, Cheng J, Shrikumar A, Banerjee A, Kim DS, Urban
L, Kundaje A, Stegle O, Gagneur J (2018) Kipoi: accelerating the community exchange and
reuse of predictive models for genomics. https://doi.org/10.1101/375345
3. Gauthier J, Vincent A, Charette S, Derome N (2018) A brief history of bioinformatics. Brief
Bioinform 2018:1–16. https://doi.org/10.1093/bib/bby063
4. Li H (2008) Improvement and application of BP neural network algorithm. Chongqing Normal
University
5. Libbrecht MW, Noble WS (2015) Machine learning applications in genetics and genomics.
Nat Rev Genet 16(6):321–332
6. Tan AC, Gilbert D (2001) Machine learning and its application to bioinformatics: an overview
7. Wang F-L, Song J, Song Y (2009) Application of BP neural network in prediction of protein
secondary structure. Comput Technol Dev 19:217–219
8. Wong K-C, Li Y, Zhang Z (2016) Unsupervised learning in genome informatics. In:
Unsupervised learning algorithms. Springer, pp 405–448
9. Hajighorbani M, Reza Hashemi SM, Minaei-Bidgoli B, Safari S (2016) A review of some semi-
supervised learning methods. In: IEEE-2016, first international conference on new research
achievements in electrical and computer engineering
10. Yu X, Yu G, Wang J (2017) Clustering cancer gene expression data by projective clustering
ensemble. PLoS ONE 12(2):e0171429
11. Satpute BS, Yadav R (2017) Machine learning techniques for bioinformatics and computational
biology—a survey
12. Wang YX, Liu K, Theusch E, Rotter JI, Medina MW, Waterman MS, Huang H (2017) General-
ized correlation measure using count statistics for gene expression data with ordered samples.
Bioinformatics
13. Li Q et al (2016) A supervised method using convolutional neural networks for retinal vessel
delineation. In: International congress on image and signal processing. IEEE, pp 418–422
14. Abdurakhmonov IY (2016) Bioinformatics: basics, development, and future. https://doi.org/
10.5772/63817
15. Le J (2018) A gentle introduction to neural networks for machine learning
16. Rathore S, Habes M, Iftikhar MA, Shacklett A, Davatzikos C (2017) A review on neuroimaging-
based classification studies and associated feature extraction methods for Alzheimer’s disease
and its prodromal stages. NeuroImage
17. Tan AC, Gilbert D (2001) Machine learning and its application to bioinformatics: an overview.
August 31
18. Wong KC, Li Y, Zhang Z (2016) Unsupervised learning in genome informatics. In: Unsuper-
vised learning algorithms. Springer, pp 405–448
19. Hajighorbani M, Reza Hashemi SM, Minaei-Bidgoli B, Safari S (2016) A review of some semi-
supervised learning methods. In: IEEE-2016, First international conference on new research
achievements in electrical and computer engineering
Text Mining in Bioinformatics
1 Introduction
The literatures on various subjects are increasing day by day. And with the help
of the Internet, the growth is unprecedented. This is especially true in the field of
biomedical sciences. And since the cost of sequencing human genome is under $1000
[1], biomedical scientists are creating large and complex multidimensional datasets.
This has led to an exponential growth in literature. And with this rate of growth of
literature raises an issue of efficiently utilizing the data.
The problem of effective usage of data is solved by employing text mining tech-
niques on the available literature. Text mining is a part of data mining, where unstruc-
tured data is analyzed and useful information is extracted from them. This enables
the literature to be used to promote new discoveries in lesser time and help experts
achieve realistic diagnosis with the help of extracted information. Large amounts
of data are freely accessible through PubMed, an online search engine to access
MEDLINE documents.
There are a lot of approaches to achieve text mining in biomedical fields. Some
of them are named entity recognition (NER), document classification and clustering,
relationship discovery, hedge cue detection, claim detection, information extraction
and information retrieval, and question answering. The main goal of all the techniques
is to understand large data on biomedical field in a very short time. To support these
techniques, it needs a dataset. The dataset is usually in the form of structured or
annotated data like a corpus or a vector of words or word embeddings.
Text mining is the process of extracting useful information from large unstructured
data. The data source maybe papers, abstracts, HTML pages, or mails [2]. It uses the
process of pattern recognition and lexical analysis for extracting information. Since
For named entity recognition (NER) of biomedical text, one should take the ambiguity
and inconsistency of text into considerations before constructing the model. It also
becomes more complex due to long sequences of terms usually given to medical
terms. To overcome this, one of the requirements is to have a proper annotated
dataset. One of them is The National Center for Biotechnology Information (NCBI).
It is a part of United States National Library of Medicine (NLM). To have a good
predictive model, dataset is one of the requirements. The other is the architecture
used.
The task of named entity recognition (NER) in biomedical field is very different
from other fields like news domain or reviews on a subject. Many of the differences
give rise to the difficulty in solving this problem in biomedical field. Some of the
difficulties faced are:
• The words maybe used differently in the same sentence or across documents.
• They maybe abbreviated in some places or expanded in others.
• Biomedical entities usually span over more than one word.
Text Mining in Bioinformatics 65
• Many entities sound almost the same or even have similar composition but may
belong to different classes.
These are some of the difficulties faced specifically in biomedical domain. We
will come across many others with examples as we proceed. Hence, the methods
developed are specific to biomedical domain.
There are three general approaches to solving NER: dictionary based, rule based,
and machine learning/deep learning. Dictionary based uses a set of manually crafted
66 M. Moharir and P. Maiya
dictionary to map the words. But it needs human intervention to update the dictionary
and suffers from giving false results due to ambiguity in medical terms. Rule-based
approach uses a set of rules for words to be used to classify them. It can give better
results compared to dictionary based but the model is very specific to the domain
and cannot be ported. Machine learning/deep learning approaches work with wide
range of dataset across domains. It is a better approach than the other two.
The approach discussed in paper [3] is entity recognition in electronic health
records (EHR). It is a documented finding by the doctors while treating patients.
They are usually riddled with incomplete information or typos since doctors do not
give them much priority since it will reduce their time with the patients. Thus, this
produces a difficulty in performing named entity recognition (NER) by dictionary-
based approach.
To overcome the difficulties in achieving this, various different types of dictionary-
based annotations are discussed. The first one is the exact match from the dictionary.
This would not cover many words due to presence of typos and abbreviations. The
second approach is fuzzy matching the dictionary values to the text to be annotated.
This takes care of most of the typos in the text. The third approach is by stemmed
matching, where stemmed words from text are matched with stemmed words in the
dictionaries. This takes care of different forms of the word as the stemmed values are
same. All these methods are used for annotating the document. The overall picture
is depicted in Fig. 2.
The machine learning approach discussed in paper [4] studies the data set for
extracting features from the text. It takes the approach taken in NER in news domain
and makes a parallel to the biomedical domain. Although, the entities that are recog-
nized in the domains are different, some of the approaches remain same. The paper
lists out the different ways a biomedical entity can be named. They maybe descrip-
tive like ‘normal thymic epithelial cells,’ contain one head noun like ‘91 and 84 kDa
proteins,’ which contain two entities, have non-standard names, be abbreviated, or
have cascaded entities where one entity lies inside another.
By studying the data, and listing the various forms the entities maybe, the paper
chooses various features from each word to be extracted. They are word forma-
tion pattern, morphological pattern, POS, head noun trigger, special noun trigger
and name alias. It then uses a hidden Markov model for prediction of entities with
k-nearest neighbor algorithm for data sparseness. The approach is applied to the
GENIA corpus. The overall precision scores achieved on two different GENIA
corpus, GENIA V1.1, and GENIA V3.0 is presented in Table 1.
The approach of [5] is direct usage of conditional random field (CRF). It uses the
algorithm on two datasets, GENETAG corpora for protein and gene name recognition
and BioText corpus for disease name recognition. It uses a type of dictionary-based
feature extraction for enhancing the features. Instead of using a dictionary directly, it
uses an ULMS metathesaurus to approximate the feature. Direct dictionary extraction
is heavily dependent on the domain knowledge. Using the approximate dictionary
lookup algorithm, the feature extraction is a bit more flexible. Other features are
Text Mining in Bioinformatics 67
AnnotaƟon Merging
Final Annotated
Exit
Phase 2
HER Named EnƟty RecogniƟon
Pre-processing
Training • TokenizaƟon
Corpus • LemmaƟzaƟon
• POS tagging
Annotated
Corpus
Table 2 Results of
Encoding scheme Precision Recall F1
evaluating different entity
encoding scheme on biotext IO 62.82 47.79 54.28
NER task with hepple tagger BIO 63.40 47.13 54.07
BIOEW 63.11 46.61 53.61
Table 3 Results of
Encoding scheme Precision Recall F1
evaluating different entity
encoding scheme on biotext IO 87.42 69.40 77.38
NER task with med tagger BIO 87.93 82.29 85.02
BIOEW 83.05 74.57 78.58
scheme gives the slightly better F-score than BIO and BIOEW schemes. This is in
agreement with the finding in [13] that uses the BioCreative II corpus for gene/protein
NER task. In this paper, the IO setting is retained for our experiments.
Further the three schemes were evaluated using MedTagger in Table 3. The F-score
is improved using MedTagger. The BIO encoding scheme gives the better F-score
than IO and BIOEW schemes using MedTagger.
The approach in [6] is partially based on rule based with machine learning. It
defines a set of rules for defining features. It uses support vector machines (SVM) as
the machine learning method to classify into different labels. SVMs are supervised
Text Mining in Bioinformatics 69
The model is trained with the training data and the result is stored in this format
for further classification since training is computationally expensive, taking a long
time.
The GENIA corpus with annotations was used as the dataset. The results state
that for biomedical terms, the precision is 94.33% with recall rate 71.67%. For
non-medical terms, the precision score was 69.73% and recall rate is 93.82%. So
the overall precision score comes out to be 84.34% and recall score as 80.76%. The
main disadvantage with simpler models like this is the scores vary a lot over different
datasets used.
The other approach is discussed in paper [7]. It uses a skip-conditional random
field (CRF) instead of a standard CRF. It identifies that usually in medical texts,
words next to each other may not be belonging to same class. Sometimes, even
though they may have similar features, they belong to radically different classes,
making the direct use of CRF, give wrong results. Many a times, the related words
maybe far apart in the same sentence.
Hence, it employs a skip-CRF model, with having regular connections to nearby
words also has extra connections to distant words, which specifically helps in medical
texts. The structure of skip-chain CRF is shown in Fig. 4. There are edges going from
distant words. For example, hOggl, mOggl, and Oggl do not appear next to each other
but are related. The performance of above scheme is tabulated in Table 4.
The approach in [8] adopts various methods with long term short memory (LSTM)
network, bi-directional LSTM network, word embeddings, character embeddings,
and CRF. It lists out the advantages of adding each of the parts mentioned above.
LSTM network help with adding context to current word based on the previous
word. Bi-directional LSTM takes words in the future with the words previously
looked at, and hence gives a better context. Character embeddings help with the out
of vocabulary (OOV) problem. And finally, CRF’s help in recognizing the sequence
of words which form an entity.
Pretrained word embedding’s taken. For character embeddings, two vectors are
used: one with reading words forward, vf and one with reading words backwards, vb .
These vectors are concatenated with the word embeddings and fed to a bi-directional
LSTM. Finally, to prevent over fitting, dropout techniques are used. The output of
70 M. Moharir and P. Maiya
Yt Yt+1 Yt+2 Yt+3 Yt+4 Yt+5 Yt+6 Yt+7 Yt+8 Yt+9 Yt+10
Xt Xt+1 Xt+2 Xt+3 Xt+4 Xt+5 Xt+6 Xt+7 Xt+8 Xt+9 Xt+10
Table 4 Results of
Precision Recall F1
evaluating skip-conditional
random field (CRF) 72.8 73.6 73.2
this network is fed to a CRF to extract the entity name. An example of the final
network is shown in Fig. 5.
It presents the scores of various architectures in Table 5. They are as follows.
LSTM Long short-term memory
WE Word embeddings
BLSTM Bi-directional long short-term memory
CE Character embeddings
CRF Conditional Random field
The paper [9] introduces a novel method, crafted specifically for biomedical data
and has an implementation based on Python programming language. It uses Ten-
sorFlow for building neural network and word2vec model for representing text in a
vector. It [9] uses a novel approach by using a convolutional neural network (CNN)
with POS tagging and word and character embeddings to create a new type of net-
work which it has named as ‘GRAM-CNN’ as shown in Fig. 6. It hypothesizes that
long short term memory (LSTM) network, based on recurrent neural network (RNN),
which is usually used for tagging, would not give high performance since long sen-
tences would often contain information unrelated to the current sentence and would
degrade the efficiency of LSTM. This approach concatenates POS tag, word embed-
dings and character embeddings with the word itself and feeds this to a CNN. Word
embeddings alone cannot give useful representation for out of vocabulary words. To
help with this, character embeddings are used. It was implemented using a CNN.
Text Mining in Bioinformatics 71
CRF layer
Output
V0 V0 V0 V0 V0 V0 Layer
Backward
Vb Vb Vb Vb Vb Vb
LSTM
Vf Vf Vf Vf Vf Vf
Forward
LSTM
Word
embeddings
Fig. 5 Bi-directional LSTM with character and word embeddings and CRF
The general architecture of the CNN is shown in Fig. 7. The output of CNNs is then
feeded to a conditional random field (CRF) to give a final output. Since, biomedical
entities contain more than one word, CRF is used to predict a sequence of words
as an entity. To predict the sequence label, sequence having the largest probability
was chosen. There are also different computational methods [10–12] which have
been developed for protein structure prediction or essential protein prediction using
machine learning methods. The prediction scores on NCBI dataset are as follows:
72 M. Moharir and P. Maiya
Wood concatenate
ambedding GRAM CNN CRF
...
... CNN-1 B-Gene
A3
adenosin
CNN-3 I-Gene
CNN-5
Receptor I-Gene
3 Conclusion
This chapter has discussed what text mining is and how it is important for biomedical
applications. Further it provided a comprehensive overview of text mining methods.
The chapter discussed various methods with different case studies. The results from
different case studies are analyzed in detail.
Text Mining in Bioinformatics 73
W10 W10
References
1. Goldfeder RL, Wall DP, Khoury MJ, Ioannidis JPA, Ashley EA (2017) Human genome sequenc-
ing at the population scale: a primer on high-throughput DNA sequencing and analysis. Am J
Epidemiol 186(8):1000–1009
2. Zhang Y, Chen M, Liu L (2015) A review on text mining. In: 2015 6th IEEE international
conference on software engineering and service science (ICSESS), Beijing, pp 681–685
3. Quimbaya AP et al (2016) Named entity recognition over electronic health records through a
combined dictionary-based approach. Procedia Comput Sci 100:55–61
4. Zhou G, Zhang J, Su J, Shen D, Tan C (2004) Recognizing names in biomedical texts: a machine
learning approach. Bioinformatics 20(7):1178–1190
5. Kanimozhi U, Manjula D (2017) A CRF based machine learning approach for biomedical
named entity recognition. In: 2017 Second international conference on recent trends and
challenges in computational models (ICRTCCM), Tindivanam, pp 335–342
6. Ju Z, Wang J, Zhu F (2011) Named entity recognition from biomedical text using SVM. In:
2011 5th International conference on bioinformatics and biomedical engineering, Wuhan, pp
1–4
7. Liao Z, Wu H (2012) Biomedical named entity recognition based on skip-chain CRFS. In:
2012 International conference on industrial control and electronics engineering, Xi’an, pp
1495–1498
8. Gridach M (2017) Character-level neural network for biomedical named entity recognition. J
Biomed Inf 70:85–91
9. Zhu Q, Li X, Conesa A, Pereira C (2017) GRAM-CNN: a deep learning approach with local
context for named entity recognition in biomedical text. Bioinformatics 34(9):1547–1554
74 M. Moharir and P. Maiya
10. Sekhar SM, Siddesh GM, Manvi SS, Srinivasa KG (2019) Optimized focused web crawler with
natural language processing based relevance measure in bioinformatics web sources. Cybern
Inf Technol 19(2):146–158
11. Sekhar M, Sivagnanam R, Matt SG, Manvi SS, Gopalalyengar SK (2019) Identification of
essential proteins in yeast using mean weighted average and recursive feature elimination.
Recent Patents Comput Sci 12(1):5–10
12. Patil SB, Sekhar SM, Siddesh GM, Manvi SS (2017). A method for predicting essential proteins
using gene expression data. In: 2017 International conference on smart technologies for smart
nation (SmartTechCon), pp 1278–1281. IEEE
13. Sukanya M, Biruntha S (2012) Techniques on text mining. In: 2012 IEEE international
conference on advanced communication control and computing technologies (ICACCCT),
Ramanathapuram, pp 269–271
Open-Source Software Tools
for Bioinformatics
1 Introduction
T. Gururaj (B)
Research Scholar, Department of ISE, MSRIT, Visvesvaraya Technological University, Jnana
Sangama, Belagavi 560054, India
A. P. Pavithra
Associate Professor, Department of ECE, SJM Institute of Technology, Chitradurga 577502, India
T. Gururaj · A. P. Pavithra
Visvesvaraya Technological University, Belagavi, Karnataka, India
Many medical data sets were available for the research, which are provided from
the diagnostic laboratories. These data sets can be used by the researchers to study
the architecture and data structure of biological data. The programmers can able
to download these medical data from online and apply the effective algorithm for
analysing the hidden information in the data for the diagnosis.
Recently, bioinformatics and predictive analysis in the medical field have gained
more attention for the related applications. The signals extracted from the human
brain and heart are applied for the information extraction, which is used for processing
and predictive analysis. The data set from the FPMS, UCSD, PhysioNet and others
can be applied for the bioinformatics investigation based on the machine learning
tools.
Figure 1 illustrates that bioinformatics is the interdisciplinary subject including
chemistry, engineering, biology, computer science and statistics for analytics and
predictions. Molecular biology is helpful in studying biological structures.
With computer tools and algorithms, the medical and diagnostic sciences were
analysed by the computer professionals. Even researchers are interested in interdis-
ciplinary field of bioinformatics so that their programming skills can be used for the
medical sciences.
Fig. 1 Elements of
bioinformatics
Open-Source Software Tools for Bioinformatics 77
Medical data were analysed using the software tools to analyse its structure. Some
of the tools based on specific datasets were listed below [2].
OpenEEG (http://openeeg.sourceforge.net/doc/): OpenEEG is the open-source
online tools that can be used to analyse the EEG signals. This tools MIR and Brain-
Wave Viewer. EEGMIR (https://sites.google.com/site/eegnetworks/) is the open-
source tool for analysing the EEG brain signals. The brain network can be viewed
with these features.
BioSig (http://biosig.sourceforge.net/): BioSig is open-source which consists of
many libraries as add-ons such as Brainathlon, NeuroServer, BrainBay, BioEra,
EEGce software libraries with many features to process the signal. The library has
useful features in analysing the signal including EEG, electrooculogram (EOG), elec-
trocorticogram (ECoG), electrocardiogram (ECG), electromyogram (EMG), respi-
ration and others. Additionally, this can support the toolbox interface to the pop-
ular softwares like MATLAB, Python, Octave, PHP, Perl, C, C++, Tcl and Ruby.
The medical domains such as neurophysiology, cardiovascular systems, neuroinfor-
matics, psychology and sleep research require brain–computer interface, which are
effectively processed by BioSig.
GenomeTools (http://genometools.org/): GenomeTools is the open-source tool for
the analysis of the genome and biological parameter analysis. The libraries are avail-
able for the bioinformatics and can support C language with APIs with detailed
instruction. The GenomeTools can be used for the deep analysis of the biological
structures.
Biopython for molecular biology (http://biopython.org/): Biopython provides the set
of tools and libraries for computation of the biological structures. The open-source
distribution is promoted by the Open Bioinformatics Foundation (OBF). Python
code is used to convert the bioinformatics files into data structures for analysis.
International formats that are supported in Biopython: UniGene, PubMed, GenBank,
Medline, GenBank, FASTA, ClustalW and BLAST. BioSQL (http://biosql.org) can
be used with Biopython to store a biological database.
Bioperl: Bioperl is the open-source software with free of cost and is licensed under
the Perl Artistic Licence (http://www.perl.com/pub/a/language/misc/Artistic.html).
The software can be downloaded in the link of http://www.bioperl.org.
The Bioperl object model is tested to support the enterprise-level application such
as Ensembl that provides easy learning curve for Perl programmers. This software is
able to analyse and process the results from software such as ClustalW, BLAST or
the EMBOSS suite. BioCORBA can be used in this software to support the modules
written in the Java and Python language, which supplementary materials are available
at www.genome.org.
78 T. Gururaj and A. P. Pavithra
1.2 Interoperability
In some cases, the bioinformatics problems were solved by the best solution of
hybrid tools. Supporting interoperability [3] between various languages enables a
programmer to use different toolkits to develop the component based on the work
done by other languages and projects. These are written in various programs like Java,
Python and C and can support Perl program by invoking, which is often referred to
as shelling out.
Bioinformatics community accepted data structures and number of Extensible
Markup Language (XML) formats are supported by the Bioperl.
Software interoperability is not only processed using the external programs, but
also uses the remote components possibility developed in different programming lan-
guage from calling component. Bioperl is compliant with the BioCORBA (Common
Object Request Broker Architecture) project (http://www.biocorba.org) and is one
of the standard techniques of CORBA components for biological process. BioJava
and Biopython also support the BioCORBA project.
1.3 Summary
Biomedical and bioinformatics are two domains in research for predictive analysis
and various applications innovated. Data mining with machine learning techniques
on signals generated by different parts of human body can be evaluated. With this
introduction, we will consider several tools in detail in the coming topics.
2 Bioperl
Bioperl is open-source which has gained international standards for biologists [4],
scientists and bioinformaticians that have evolved since from past so many years
and approached with a complete library of Perl modules for the usefulness of life
sciences. Bioperl offers easy, constant and reliable programming for bioinformati-
cians. Complexity of the program can be reduced with simple line of code by using
Bioperl.
Bioperl toolkit provides reusable Perl modules which contain generalized routines
related to the life science information. The main objectives of the writing the toolkit
Open-Source Software Tools for Bioinformatics 79
are to focus on a solution that shares the components instead of duplication effort
[5].
Codes are freely available in the open-source licence, so others could extend the
routines in the Bioperl library and contribute their routines as well. Just as Human
Genome Project shares the public data sharing, the open source of the Bioperl project,
which minimizes the time for the solutions and new tool to reach the community.
The aim of Bioperl is to allow the user to focus on the problem, such as the logic
needed to filter hits in a BLAST [6] represents by certain aspects, instead of actual
mechanics in the parsing BLAST report.
The Perl modules are shown in Table 1 and are organized by logical names. For
instance, the Bio_Search can be used to search the database, and Bio_Graphics has
the modules related to the drawing. The Bioperl modules are the simplified API, and
this provides the common Bioperl functions.
Most advanced use of the Bioperl is depended on the Ensembl project [7]. The
sequence handling, sequence features and file format parsing are developed for the
automatically annotating the various genomes. The Bioperl sequence, feature pair
objects and sequence features can be implemented using this Bioperl.
Many problems are in the computation biology such as file parsing and string. Solving
these problems is a major task among the developers in bioinformatics. Since there
is no standard method for processing the biological data, different developers follow
their own program and own file format to compute that creates difficulties. For
instance, this is difficult to process the same data into different laboratories. Although
most of the laboratory equipments are similar and try to solve the same problem,
this is difficult to process the same data in the computation experiments due to the
different file format and softwares.
Bioperl projects aim to co-ordinate the effects of the bioinformatics developers
to build the standard tool to process the genomic analysis. Bioperl consists of the
modules and documentation in online. Each module consists of one or more object.
The main objects of Bioperl are:
• Sequence
• Sequence Alignment
• BLAST
• Alignment Factories
• 3D Structure.
Bioperl’s large module collections can be easily installed. The additional steps are
also provided for the outside program and Perl depends on Bioperl. Probably, Perl’s
CPAN is needed to fetch and install Bioperl.
“INSTALL” is a document that provides the step-by-step instruction to install
Bioperl on the various operating systems of Window, Linux and Mac. It is the part
of Bioperl distribution and can be found at http://bioperl.org/Core/Latest/INSTALL.
Open-Source Software Tools for Bioinformatics 81
This location may change, but it is easy to find the document in the Bioperl download
page.
Before installing the Bioperl, reading the document first was recommended. Here,
provide an overview of the installation, and comments among common installation
are given below:
Installing Bioperl for the different platforms is given below:
For Unix/Linux: You need to download tar file from the website and untar it,
go through configure. Follow the step-by-step procedures that are provided in the
“INSTALL” document, which present in the distribution.
For Microsoft Windows: In ActiveState’s Perl (http://www.activestate.com), there
is a PPM file present in Bioperl; current present at http://bioperl.org/ftp/DIST/
Bioperl-1.2.1.ppd.
Bioperl consists of CVS repository that fetches the current version of the modules.
Some newer version consists of the new features, and more bugs can be found, the
less bugs in the older and stable version releases. The details to install CVS repository
are provided in the Bioperl website.
All these methods to install Bioperl are ease and most general way for Perl
programmers to install module set based on CPAN.
Bioperl can be installed by providing the following command line:
Module is needed to install often which requires other modules for its proper
functions, and one or more required modules may not be installed. The CPAN enables
way to check whether it is other required modules which are installed, and the steps
can be followed to install the missing prerequisites.
In Bioperl for large collection of modules, prerequisites may pop up. This can be
configured by the following command:
List of options in the CPAN sessions was not found and CPAN documentation is
analysed and checked for the string prerequisites_policy, and read the following:
82 T. Gururaj and A. P. Pavithra
Internet connection may get slow as many prerequisite may have to be set. Incase
if the system gets hung, wait till the installation of program completes as it takes
longer time.
CPAN session command can be summarized by the following commands:
The Bundle::BioPerl has some extra useful modules in the Bioperl uses.
Bundle is first installed for the further process:
Open-Source Software Tools for Bioinformatics 83
This will take the module code from the repository, then unpack, process and
install the module with the prerequisites. The main Bioperl distribution is installed.
Latest release of the CPAN is in 1.2.1 version. To analyse the latest release of the
module, websites were read the news about the latest releases.
Analyse the INSTALL file to check for the process. The Perl version 5.8.0 is one
of standard supported platforms.
Installation is tested on the notebook computer consist of Intel 686 process with
the operating system of Red Hat Linux 7.2. The operating system and computer of
two years old were used to test the installation and Bioperl website was checked and
this recommends the versions of Linux.
Generally, the modern computer system is complex and they are changing rapidly.
The hardware and operating systems have the replacement cycle of about two years.
The system is needed to be in sync to perform the function. The Bioperl and Perl
are needed to co-ordinate with the hardware and software; other modules like web
server, C compiler, web browser and others may cause some problems. The system
requirement is needed to be analysed for the installation process.
However, no warning is provided in this installation environment. The latest ver-
sion of Perl is installed with success due to diligence installation. Then, Bioperl
installation is performed.
Type the command for the installation:
CPAN follows a great amount of activity from the distribution of the Internet and
analyses the various modules. The modules were tested, a few of them failed and
CPAN does not install the modules. After the few failures of the installation process,
check the failures in the peripheral components of Bioperl and this can be fixed them
later.
To complete the installation, CPAN is forced to install despite some failures in
the test process:
Bioperl consists of the number of sequence object, and most common is Seq. This
denotes the single nucleotide sequence and processed in the file of Seq.pm. This
84 T. Gururaj and A. P. Pavithra
is widely used in Bioperl object due to every program that generates, modifies and
processes the DNA sequences using sequence object. This is automatically created
when you read a sequence from the file or a database. The object has the techniques for
the writing and reading the data into the different file formats. The file formats such
as Raw, GenBank and FASTA are supported. The basic operations such as sequence
translation, DNA to RNA conversion and subsequence extraction are supported in
the sequence object.
Example of creating a sequence object is shown below:
$seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact',
'-desc'=>'Sample Bio::Seq object',
'-display_id' => 'something',
'-accession_number' => 'accnum',
'-alphabet' => 'dna' );
The sequence is read from the file. The following function converts the sequence
file from the FASTA format into text raw file format [8]:
3 Biopython
Biopython is one of the most common bioinformatics packages of the Python. Var-
ious sub-modules were present for the common bioinformatics process [9]. This is
developed by Chapman and Chang, mainly developed in Python. This consists of
the C code to optimize the complex software process. This is compactable for the
Linux, Windows, Mac, etc.
Open-Source Software Tools for Bioinformatics 85
3.1 Introduction
Biopython consists of various Python module, mainly developed to process the data
related to the DNA, RNA and protein sequence operations such as identify the mod-
ified protein sequence and reverse complementing of a DNA string. A lot of parse
is provided to process the common databases like GenBank, FASTA, Swissport, etc.
and also provide the interface to the execute the other popular bioinformatics tools
like NCBI, Entrez, etc. in the Python environment. This has sibling projects like
BioJava, Bioperl and BioRuby.
The Biopython is the International Developer Association project freely available
in the Python (http://www.python.org) for computing the molecular biology pro-
cess. The website http://www.biopython.org provides the online resources for the
Python-based software developer in bioinformatics to process the scripts, modules
and weblinks. Basically, this tool is developed to program in Python and helps to
develop the high-quality reusable modules and scripts [10].
The main Biopython release supports lots of functionality, including:
• Parse bioinformatics file in Python using data structures
• Supported format files were iterated over record by record
• Develop the code to process online bioinformatics data
• Provide interface for the common bioinformatics programs
• Establish common standard sequence class
• Tools for performing common operations
• Connect with BioSQL, which is a sequence data set that supports the Bioperl and
BioJava projects.
3.2 Goals
The Biopython is developed to provide simple, standard and extensive access in the
bioinformatics in the Python language. The Biopython-specific goals were listed
below:
• Providing standardized access to bioinformatics resources.
• Establish high-quality, reusable modules and scripts.
• Fast array manipulation based on cluster or classification models.
• Genomic data analysis.
3.3 Advantages
Biopython requires less code and also has the following advantages:
• Provide microarray data type used in clustering
86 T. Gururaj and A. P. Pavithra
The installation process is easy and typically, and this requires five minutes.
Step 1—Verifying Python Installation
Biopython is developed to support the Python 2.5 or higher version. Basically,
Python is to be installed on your system. Execute the following command to check
for Python version:
It is defined below
C:\Users>python –version
Python 3.6.5
This shows which version of the Python is installed. If the older version is present,
download the latest version, install it and run the command again.
Step 2—Installing Biopython using pip
Biopython can be easily installed based on pip by the command line for all
platforms. Execute the command below:
Collecting biopython
Using cached
https://files.pythonhosted.ort/packages/6a/22/c5b6edsf3j3jlgjfik4asl
hfklsuue6klahf/biopython-1.72-cp.................
Requirement already satisfied: numpy in
/Library/Frameworks/Python.framework/versions/3.6/site-packages(from
biopython)(1.14.2)
Installing collected packages: biopython
Successfully installed biopython-1.72
Once the command is processed, the older versions of Biopython and NumPy are
removed, and recent versions were installed.
Step 3—Verifying Biopython Installation
Biopython is successfully installed on the system. To analyse Biopython properly
installed, the process of the command on the Python consoles:
C:\Users\User\python
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2019, 17:00:18)[MSC
V.1900 64 bit(AMD64)]on
win32 Type “help”, “copyright”,”credits” or “license” for more
information.
>>>import Bio
>>>print(Bio.__version__)
1.72
>>>
88 T. Gururaj and A. P. Pavithra
This will build the Biopython from the source code, as below:
Some of the use cases are to test and analyse the importance of the Biopython in the
field for the process of RNA structure, population genetics, etc.
Population Genetics
Population genetics is the study of the genetics variation within population and
investigates in the changes in the genes frequencies and allies in the population over
spatial and temporal manner.
90 T. Gururaj and A. P. Pavithra
4 Conclusion
References
1. Stajich JE (2006) Open source tools and toolkits for bioinformatics: significance, and where
are we? Brief Bioinform 7(3):287–296
2. Retrieved from http://elibrary.nusamandiri.ac.id/ebook/OpenSourceForYou-February_2018.
pdf
3. Stajich JE (2002) The BioPerl toolkit: perl modules for the life sciences. Genome Res
12(10):1611–1618
4. Stajich JE (2007) An introduction to BioPerl. Plant Bioinf, 535–548
5. Gremme G, Steinbiss S, Kurtz S (2013) GenomeTools: a comprehensive software library for
efficient processing of structured genome annotations. IEEE/ACM Trans Comput Biol Bioinf
10(3):645–656
6. Altschul S (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs. Nucleic Acids Res 25(17):3389–3402
7. Hubbard T (2002) The Ensembl genome database project. Nucleic Acids Res 30(1):38–41.
https://doi.org/10.1093/nar/30.1.38nio
8. The GenomeTools Developer’s Guide. (n.d.). Retrieved from http://genometools.org/
documents/devguide.pdf
9. Taewan Ryu (2009). Benchmarking of BioPerl, Perl, BioJava, BioPython, and Python for
primitive bioinformatics tasks and choosing suitable language. Int J Contents 5(2)
10. Talevich E, Invergo B. Cock PJ, Chapman BA (2012) Bio.Phylo: a unified toolkit for processing,
analyzing and visualizing phylogenetic trees in Biopython. BMC Bioinf 13(1)
Protein Structure Prediction and Gene
Expression Analysis
A Study on Protein Structure Prediction
1 Introduction
In the last couple of decades, the progress in the experimental determination of pro-
tein three-dimensional structure has been tremendous. But massive parallel sequenc-
ing technology has led to the rapid rise of sequence information and the experi-
mental determination of protein three-dimensional structure has not kept pace with
it. Therefore, the structure of more proteins is known than the three-dimensional
structure of the protein. And the gap between the number of protein structures and
the three-dimensional structure rather than diminishing is ever growing. Yet the
information present in the amino acid of most proteins is enough so that the three-
dimensional structure of those proteins can be determined, this opens up the possi-
bility of predicting the three-dimensional structure of the protein from its amino acid
sequence.
To bridge the widening gap, computational prediction of protein structures may
be a viable option to the long-standing challenge of more than 40 years in molecular
biology. Already, many accurate and useful three-dimensional models of proteins
have been predicted by making use of the similarity of the amino acid of those
proteins which are similar to another protein whose three-dimensional structure is
already known. This is often called as template and homology model.
However, it is hard to predict the protein structure of protein when not even a
single structure of a protein from its protein family is known. So how do we predict
the structure of a protein?
There are three ways of predicting the structure of a protein.
• Comparative modelling
• Threading
• Ab initio prediction.
Comparative modelling is based on the fact that proteins which are evolutionarily
related and have similar sequences, which are found by checking the percentage of
identical residues at each position established by an optimal structural superposition,
have similar protein structure.
Comparative modelling is also known as homology modelling.
A Study on Protein Structure Prediction 97
2.2 Threading
PDB: The Protein Data Bank (PDB) is a database which consists of three-
dimensional structure data for a large number of a biological module. These bio-
logical modules can be proteins or nucleic acids. The structural data for the protein
or nucleic acids are obtained by mainly cryo-electron microscopy; it is also obtained
by X-ray crystallography and NMR spectroscopy. These structures are normally sub-
mitted by biochemists and biologists from all around the world into the PDB. The
organization which manages the PDB is the Worldwide Protein Data Bank (wwPDB).
In the last 3 years, all the new structures which have been submitted to the PDB have
folds similar to the ones present in the PDB. Around 90% of the structures come
under this category.
In ab initio prediction is the method to find out all the energetics of the various
processes which are required in the folding of the protein. And then finding the
structures which has the lowest free energy out of all the structure we have modelled
for that protein.
If there exits a protein whose tertiary structure is known to us and shares at least
30% of its structure with that of the protein whose structure is undetermined, then
comparative methods can be used which will overlay the putative unknown structure
which the structure of the known one and predict the structure of the unknown. But
however, if there exists not protein which matches this threshold, then three other
types of prediction must be applied to predict the structure of such a protein. They
are ab initio protein prediction, fold recognition and threading.
• Ab initio methods: Physicochemical and neural net algorithms are used to derive
the secondary structure of a protein from its primary structure. After this step, the
algorithms predict the tertiary folding of the protein. This method is not capable
of incorporating the locations and orientation of amino acid side chain.
A Study on Protein Structure Prediction 99
3 Homology Modelling
Homology modelling was done as early as the 1960s. Originally, models which were
made of wire and plastic were used to display the bonds and atoms. The coordinates of
proteins whose structure were known were taken and the models were constructed and
the for those whose amino acids did not match the structure, the model was modified
by hand. The first paper was published on 1969 by David Phillips, Brown and co-
workers on homology modelling. This paper consisted of a modelled a-lactalbumin
based on the structure of henegg white lysozyme. The homology between the proteins
mentioned in the paper was 39%.
Example: Suppose we want to predict the structure of a protein A which is 150
amino acid long. So the first step will be to compare the sequence of protein A with
all the sequences of the protein whose structures are known. These known structures
are stored in the PDB. If there exists a protein B which is 300 amino acid long and
consists of a region on amino acid which is 150 amino acid long which matches the
150 amino acid of A with at least 50% similarity which falls in the safe zone. Then,
100 B. Banerjee et al.
we can take a fragment from the structure of sequence B corresponding to the aligned
region of that of sequence A. The amino acids that differ between the sequences A
and B can be mutated to finally arrive at our model for structure A is called the target
and is of course not known at the time of modelling.
Homology Modelling Steps (see Fig. 4).
In this step, we search the PDB for homologous proteins which consists of determined
structures. We can conduct this search using programs such as BLAST and FASTA
because the percentage of the homology identity between the target sequence and a
possible template should be high enough so it falls in the safe zone, which is one of
the requirements of using these search programs.
A Study on Protein Structure Prediction 101
BLAST Search
It is a search algorithm which can compare our query sequence with the sequences
present in a library or database. And it gives us the sequences which match our query
sequence. The sequence which are returned has a resemble above a certain threshold.
The search can be varied depending on the sequence we want to query.
It can search amino acid sequence of protein, nucleotides of DNA and RNA
sequences. It is developed by Stephen Altschul, Warren Gish, Webb Miller, Eugene
Myers and David J. Lipman at the National Institutes of Health. So far, it has been
cited 50,000 times. And it was published in the Journal of Molecular Biology in
1990.
FASTA Search
It is a software which is primarily used for DNA and protein sequence alignment.
It takes our query sequence and searches a database for similar sequences using the
local sequence alignment. It has a high speed of execution. First, it marks the potential
matches using the pattern of word hits and word-to-word matches of a given length
before performing a search using a Smith–Waterman type of algorithm which is a
more time-consuming search.
It is also famous for its FASTA format which has become widely used in the
bioinformatics domain. It is developed by William R. Pearson and David J. Lipman
in 1985.
To attain the list proteins which are homologous with the target protein, the
program compares the target query sequence with all the sequences of the known
structures in the PDB using the below two matrices:
• A residue exchange matrix
• An alignment matrix.
Alignment correction
The alignment of two sequences of the two proteins where the percentage sequence
identity is low can be difficult. One can then use the other sequences present from
the homologous protein to find a solution.
The example is shown in Fig. 5a and b. We can see that it is nearly impossible
for the sequence LTLTLTLT to be aligned with sequence YAYAYAYAY. So we find
a third sequence TYTYTYTYT which can be easily aligned with both of them.
In Fig. 5, Model 2 is correct, because it leads to a smaller gap, compared to a huge
hole associated with alignment 1.
4 Loop Modelling
After the sequence alignment step, due to insertion and deletion there are often
regions created that lead to gaps in the alignment. Loop modelling is used to fill in
102 B. Banerjee et al.
these gaps, but it is comparatively less accurate. The two main techniques are used
to approach the problem:
• The database searching method: This method involves going through the database
of known protein structures and finding loops and then superimposing these loops
onto the two stem region of the target regions of the target protein. There exists
some specialized software which can be used for this task, like FREAD and CODA.
• The ab initio method: In this method, various loops are generated randomly. And
then, it searches of a loop in those loops which has reasonably low energy and ϕ
and ψ angles which fall in the allowable regions of the Ramachandran plot.
• Side-chain modelling: It is important because is used to evaluate the protein–
ligand interactions at the active sites and also the protein–protein interactions at
the contact interface.
By searching every possible combination for every torsion angle of the side chain,
we can select the one which has the lowest interaction energy with neighbouring
atoms. A rotamer library can also be used, which has all the favourable side-chain
torsion angles extracted from known protein crystal structures.
Model Optimization
Model optimization is done so that the overall conformation of the molecule has the
lowest possible energy potential. This is done by adjusting the relative position of
the atoms. The goal of this step is to relieve the steric collisions without altering the
overall structure.
One other way of optimizing the model is by using molecular dynamics simulation.
In molecular dynamics simulation, the atoms are moved towards a global minimum
by applying various stimulation conditions like heating, cooling, considering water
molecules. This leads to a better chance of finding the true structure.
Model Validation
Advantages
• It can find the location of alpha carbons of key residues inside the folded protein.
• It can hypothesize structure functions’ relationship which can be guided using this.
• Mutagenesis experiments can be guided using this.
• The putative active sites, binding pockets and ligands can be identified using the
positions of the conserved regions of the protein surface.
Disadvantages
5 Use Case
Fig. 7 Starting a modelling project ad giving protein P61851 as the input [5]
A Study on Protein Structure Prediction 105
When they give the UniProt code, it automatically pulls the protein sequence
from UniProt. Then, the next step is homology modelling to find good templates
for our sequence. We can do that by clicking the “Search For Templates” button.
The SWISS-MODEL server uses blast and HH Blitz to find suitable template. The
template search takes less than 15 min.
Once the template search is done, we land on this page. It shows the template
which is found for our protein. The results are sorted according to template score.
We can also use the coverage row to select our templates. We can use the coverage
row to learn a lot about the protein. Identity shows how many amino acids are exactly
aligned, higher the identity the better. Method gives us the method with which the
template was found and the resolution of the template. We must make sure to chose
a template with high resolution (Fig. 8).
Subsequently, select the templates which be used for modelling which clicking on
the check box, then click on the “Build Model “button to build the models (Fig. 9).
After the model is finished build, we can see the model (Fig. 10).
Here, we can see all the details of our model. The GNQE values are updated
compared to the values shown in the template selection page because it also takes
the QMEAN value into account. The SWISS-MODEL uses the QMEAN score for
quality estimation.
Homology Modelling Software (see Table 1).
Threading
Threading which also knows as fold recognition refers to a sequence-structure align-
ment approach used for proteins which do not have structural homology. It matches
a protein sequence to a fold library using a so-called threading algorithm that assigns
each amino acid to a position on the three-dimensional structure for a particular fold
class which is available in the PDB. Energy minimization or modular dynamics is
then used for structural refinement and selection of the best model.
106 B. Banerjee et al.
It is similar to homology modelling in the sense that both the approaches try to
build a structural model by using the experimentally solved structures as a template.
But homology is only possible when the 3D structure of the protein is known. When
the 3D structure is not known, then the path we take is threading (Fig. 11).
From Fig. 12, we can see growth of the protein data bank entries, right from the
time the x-ray crystallographic analysis was done of myoglobin there has been a
continuous submission of protein structure coordinates and over the last couple of
decades there has been a steep increase in the structure and has come to reach a
staggering number of more than one lakh forty thousand, but at the same time if you
see the unique folds a protein can have we see in Fig. 8, it is not proportional to the
number of structural coordinates that are being submitted.
A Study on Protein Structure Prediction 107
For example, from Fig. 12, we can see that in the year 2010, we had 70,000
structures in the PDB, whereas from Fig. 8 we can see that in 2010, 1393 folds. In
the year 2015 for 1 lakh structures, ten thousand we had the same number of folds,
1393.
Threading Steps (see Fig. 13).
108
Fig. 13 Threading
modelling steps
Threading alignment:
After the scoring function is designed in the previous step, it has to be optimized so
that the target sequence can be aligned with each of the structure templates. In every
threading-based structure prediction program which take into account the pairwise
contact potential, this is a significant task. Instead of using such a program, a dynamic
programming algorithm can also be used for this task.
Threading prediction:
We should choose a threading alignment that is statistically most likely as the
threading protein. A structural model is constructed by using the selected structural
template and placing the backbone atoms of the target sequence on the structural
template’s aligned backbone positions.
Ab Initio Modelling
Ab initio means “from the beginning”, and the method predicts the structure of the
protein from the scratch using sequence information by applying Newtonian force to
achieve a thermodynamic stable native-like state. It is very useful in situations where
the query sequence does not have any suitable templates in the protein structure
library for either homology modelling; we also apply what we call as Anfinsen’s
theory.
Anfinsen’s theory states that protein native structure corresponds to the state with
the lowest free energy of the protein solvent system and that this particular infor-
mation is contained in the amino acid sequence. The basic idea is to build empirical
function that stimulates real physical forces and potentials of possible chemical
contacts within the protein.
The methodology of ab initio is very dependent on Anfinsen’s theory. Anfinsen’s
theory states that all the information of protein folding is contained in the sequence
A Study on Protein Structure Prediction 111
itself, and therefore, cell machinery is a material in protein folding. This holds true
for most of the small globular protein. Protein native structure corresponds to a state
with the lowest free energy for the protein solvent system, the protein energy land-
scape is very uneven, and it consists of many local minima in which partially folded
proteins with relatively higher energy are seen. The global minima representing the
native structure is found deep down the energy funnel passing through several molten
globule states which is a number of intermediate conformational states between the
unfolded states and the complete native state of the globular that are located there
by you see the protein sequence at the very top intermediates in between which are
called the molten globule state well folded protein or the global minima which is very
close to the native structure. Having now understood that the proteins are just not
amino acids that are strung together peptide bonds but hold vital information for the
final protein structure. Software programs are so designed to use certain basic prin-
ciples and concepts of physics to predict the three-dimensional structure of proteins
(Fig. 14).
112 B. Banerjee et al.
Interactions between atoms are based on quantum mechanics and coulomb potential
with only a few fundamental parameters such as the electron charge and the Planck
constant that pertain to quantum mechanics. Compromised force field having a large
number of selected atom types.
Well-known examples of such all atom physics-based force include AMBER,
CHARMM, OPLS and GROMOS96.
A Study on Protein Structure Prediction 113
We also have something called as solvation energy which reflects the implicit
solvation model, the generalized bond (GB) model surface area-based model and the
explicit solvation model TIP3P.
Energy potentials contain terms associated with bond lengths angles torsion
angles, van der Waals and electrostatic interactions.
Knowledge-based static functions refer to the static energy potentials containing
the empirical energy factors that are recognized from already reported protein struc-
tures in the protein data bank. They involve less computational cost; one of the most
famous or well-known softwares that is used in this technique is I-TASSER.
This methodology could be either sequence independent which deals with atomic
interaction potential, hydrogen bond potential, torsion angle potential, solvation
potential, or it could be sequence dependent where pairwise residue contact poten-
tial, distance-dependent atomic contact potential, secondary structure propensities
are made use of.
• Conformational search: Identify global minimum energy state using energy
minimization; Monte Carlo; simulated annealing; MD or genetic algorithm.
The protein energy landscape is generally rocky; it contains many local energy min-
ima and maxima along with intermediate saddle points, but it finally has only one
global minima. The saddle points correspond to the transition states the barrier that
all the molecules must cross if they are to fold to the native state. This is the point
where the slope of the orthogonal function space becomes zero.
In order to find a valid ab initio structure model of a protein, it is important to
have powerful conformational search method which for a given energy function can
accurately and efficiently find the global minimum energy structure. The different
methods one could possibly use are as follows:
1. Energy minimization
It is a molecular/quantum mechanical process to find the lowest energy
conformation of a protein.
Energy minimization methods:
a. First-order minimizations: steepest descent, conjugate gradient
b. Second derivative methods: Newton-Raphson method
c. Quasi-Newton methods.
2. Monte Carlo is an important optimization and sampling technique that does not
force but rather compare energies. It makes use of Boltzmann probabilities.
3. Molecular dynamics simulation aimed to understand the properties of assem-
blies of molecules and the atomic-level interactions between them. Software:
CHARMM and NAMD.
Model selection: Choose the best native-like structure from pool of decoy
structures.
114 B. Banerjee et al.
4. Monte Carlo is an important optimization and sampling technique that does not
force but rather compare energies. It makes use of Boltzmann probabilities.
5. Molecular dynamics simulation aimed to understand the properties of assem-
blies of molecules and the atomic-level interactions between them. Software:
CHARMM and NAMD.
Model selection: Choose the best native-like structure from pool of decoy
structures.
Irrespective of conformational method used, we arrive at a number of possible
decoy structures and then need to choose the final one; the methods that could possibly
be used are based on energy or clustering.
In the energy-based methods, consider specific potentials and identify lowest
energy state, for example as the one used in software ASTRO-FOLD. In the other
method of clustering, the cluster centre conformations of the largest cluster are con-
sidered as the one closest to the native structure. The software that best uses the
clustering method is Rosetta.
Below, we can see the steps followed in Rosetta and I-TASSER modelling. These
are the most common and widely used ab initio modelling algorithms.
In Fig. 15, we can see the various steps of the Rosetta protocol. In the first step,
the target protein is split into fragments. The target proteins are unrelated to the
protein structure in the PDB. The proteins in the PDB are the ones which are used
to create the full-length models by simulations which guided by a knowledge-based
force field. The second step is used only for refining the models selected in the first
step. The models are refined at an atomic level using physics-based potential.
In Fig. 16, we see the data flow of the I-TASSER algorithm. First, various thread-
ing programs are used to identify the templates and the super-secondary structure
fragments. Full-length models are reassembled using the segments which are excised
from the continuously aligned regions. Also, in the full-length model the threading
aligned regions are built using the simulations based on lattice-based ab initio simula-
tion. The next step in the algorithm is to search for templates which are similar to the
templates found in the first step from the PDB by structure alignment. To assist in the
second-round refinement, the spatial restraints are extracted from the templates. In
recent times, several developments were made to improve the results of distant homol-
ogy modelling such as sequence-based contact predictions and segmental threading
(Table 3).
Ab initio Modelling Software (see Table 4).
A Study on Protein Structure Prediction 115
6 Conclusion
In this chapter, we have discussed various computation methods for protein structure
prediction like comparative modelling, threading and ab initio methods. It has been
observed that discussed methods have their own advantages and disadvantages. Also,
the selection of methods is purely based on the data properties like mutual information
or simple correlation coefficients which are sufficiently indicative of the relevance
of the features.
116 B. Banerjee et al.
Table 3 A list of ab initio modelling algorithms along with their energy functions, conformational
search methods, model selection schemes and typical CPU time per target
Algorithms Force field Search method Model Time
selection cost per
CPU
AMBER/CHARMM/OPLS Physics-based Molecular Lowest energy Years
dynamics (MD)
UNRES Physics-based Conformational Clustering/free Hours
space annealing energy
(CSA)
ASTRO-FOLD Physics-based CSA/MD Lowest energy Months
ROSETTA Physics and Monte Carlo Clustering/free Days
knowledge energy
based
TASSER/chunk-TASSER Knowledge Monte Carlo Clustering/free Hours
based energy
I-TASSER Knowledge Monte Carlo Clustering/free Hours
based energy
QUARK Physics and Monte Carlo Clustering/free Hours
knowledge energy
based
A Study on Protein Structure Prediction 117
References
1. Stephen Stoker H (2015 Jan 1) Organic and biological chemistry. Cengage Learning, p 371.
ISBN 978-1-305-68645-8
2. Brocchieri L, Karlin S (2005-06-10) Protein length in eukaryotic and prokaryotic proteomes.
Nucleic Acids Res 33(10):3390–3400. https://doi.org/10.1093/nar/gki615. pmc 1150220. pmid
15951512
3. Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in
proteins. EMBO J 5(4):823–826. PMC 1166865. PMID 3709526
4. Kaczanowski S, Zielenkiewicz P (2010) Why similar protein sequences encode similar
three-dimensional structures? Theoret Chem Acc 125(3–6):643–650. https://doi.org/10.1007/
s00214-009-0656-3
5. https://swissmodel.expasy.org/
6. https://www.rcsb.org/stats/growth/overall
7. http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=fold-cath
118 B. Banerjee et al.
8. https://www.researchgate.net/figure/Flowchart-of-the-ROSETTA-protocol_fig1_225193759,
https://www.researchgate.net/figure/Flowchart-of-I-TASSER-protein-structure-modelling_
fig3_225193759
9. Simons KT, Kooperberg C, Huang E, Baker D (1997) Assembly of protein tertiary structures
from fragments with similar local sequences using simulated annealing and Bayesian scoring
functions. J Mol Biol 268(1):209–225
10. Lee J, Wu S, Zhang Y (2009) Ab initio protein structure prediction. In: Rigden DJ (ed) From
protein structure to function with bioinformatics. Springer, Dordrecht
Computational Methods Used
in Prediction of Protein Structure
Poulami Majumder
1 Introduction
Protein is the basic building block of life. This is the key component of the body
which is responsible for various physiological biochemical reactions. Protein is an
important chunk in bioinformatics field to understand the possible biological process
of life. It is very important to predict the protein structure to pursue the following
challenges like drug design, medicinal application as well as in bio-industrial appli-
cations [1]. Over 25 years, the way out towards prediction of protein structure has
been continued. Multiple kinds of approaches have been taken for protein structure
prediction through computational approaches which have been developed as the most
popular and useful in recent times [2]. To know about the different protein structure
prediction approaches, we should first focus on the overview of the structure of
protein.
Four types of protein structure have been discovered so far (Fig. 1). Those are
primary structure, secondary structure, tertiary structure and quaternary structure
[3]. The primary structure of protein is based on the simple linear arrangement of
amino acid residue sequences [4]. The secondary protein structure is generally based
on the binding pattern of the amino hydrogen and carboxyl oxygen atoms between
amino acid sequences throughout the peptide backbone [5]. There are two kinds of
protein secondary structure, and those are alpha helices and beta strands. In these
structures, the amino acid sequences are linked with each other by hydrogen bonds.
The alpha helix structure is generally composed of 3.6 amino acids per turn along
with hydrogen bonds which are formed between every fourth residue while in beta
strands there are two portions of the chain—one is upward with 5–10 consecutive
amino acids and another is downward 5–10 consecutive amino acid sequences. H-
bond interactions are formed mostly in between adjacent amino acids and short loops
P. Majumder (B)
Department of Biotechnology, Maulana Abul Kalam Azad University of Technology, Kolkata,
West Bengal 700064, India
© Springer Nature Singapore Pte Ltd. 2020 119
K. G. Srinivasa et al. (eds.), Statistical Modelling and Machine Learning
Principles for Bioinformatics Techniques, Tools, and Applications, Algorithms
for Intelligent Systems, https://doi.org/10.1007/978-981-15-2445-5_8
120 P. Majumder
Primary Structure
(amino acids chain sequence)
Secondary Structure
(alpha helix and beta pleated
sheet)
Tertiary Structure
(3D folded structure)
Quaternary Structure
(3D complex structure include
more than one subunit)
between them [6, 7]. This secondary structure prediction is more likely related to the
pattern of alpha helices and beta strands amino acids residue structure. The prediction
of secondary structure is mainly focused on to know about the linear amino acid
sequences, i.e. primary protein structure [8]. The pattern of the amino acid residues
arrangements, their size and shape directs the ligands to fit with the protein in a better
Computational Methods Used in Prediction of Protein Structure 121
Three major strategies of computational method have been taken to predict the protein
structure and those are as follows:
• Homology modelling techniques or comparative techniques,
• Protein threading or protein fold recognition and
• Ab initio or de novo techniques.
In Fig. 2 the basic concept of protein structure prediction has been illustrated
schematically based on different protein modelling stated earlier.
122 P. Majumder
Protein Sequence
Database
Searching
Multiple Sequence
Alignment
Fold Recognition
No
Ab-initio Structure
Prediction
Model Building
Evaluate Model
No
Model is OK
Yes
End Process
Protein threading is nothing but protein fold recognition. In this technique, the known
proteins with same fold are being used as template for modelling the target protein
124 P. Majumder
Table 1 Some useful computational methods based on homology modelling techniques (Courtesy
Wikipedia) [17]
Name Description/function
RaptorX One of the most popular methods. It does protein 3D modelling, detection
of remote homology and the prediction of binding site
Biskit It is an open-source software package programmed in Python. It wraps
external programs into automated workflow
ESyPred3D It is an automated homology program that helps to predict template
sequences, alignment and 3D modelling. It is majorly focused on
alignment strategy
FoldX It uses empirical force field to design algorithm for protein structure.
Energy calculations and protein design are built
HHpred It is an open-source software package. Template detection, alignment, 3D
modelling of sensitive protein structure
MODELLER This model is used to build tertiary and quaternary protein structure
Phyre and Phyre2 Free web-based service. It is one of the popular methods which helps in
residues alignment, remote template detection and 3D modelling by using
multiple templates
Prime It works on sequence alignment, secondary structure prediction, homology
modelling, protein refinement, loop-prediction and side chain prediction
Bhageerath-H This platform was established by IIT Delhi and mainly focuses on tertiary
protein structure prediction
SWISS-MODEL This homology modelling is currently most accurate method for protein
structure prediction. It works by finding the local similarity and fragment
assembly
YASARA Detection of templates, hybridization of model fragments, alignment,
ligands and oligomers designing
[18]. There is a minute difference in protein threading from protein homology mod-
elling. Protein threading specifically targets the protein with same fold level that
means it aligns the sequence to the template structure while homology modelling is
for comparatively easier target which aligns the sequence to the template sequence
only [19]. There are specific interactions between the amino acid sequences that affect
the protein folding like hydrogen bond, hydrophobic bond, Van der Waals interac-
tion, electrostatic force, etc. [20]. There are almost 1300 different known protein
folds which are existing till now, though each year new folds are being discovered.
Protein threading is a process that comprises of four major steps (Fig. 4).
Those are as follows:
1. Library of core fold templates which represents the template structures (protein
data bank database).
2. The compatibility between the aligned amino acid sequences and template fold
including the compatibility evaluation.
3. Search for the best option to optimize the target sequence and the template
structure alignment.
4. Evaluate the best match based on statistical significance.
Computational Methods Used in Prediction of Protein Structure 125
Fig. 4 Schematic
representation of protein Search for Related Template Fold
threading in simpler way
Unsuccessful
Fold Assignment
Successful
Alignment of Target Sequence and
Template Fold
Ab initio
Modelling
Comparative Model
Building
Evaluate Model
Model is OK
End Process
Predicting the nature of protein structure from its amino acid sequences is really a
tough job. We have discussed earlier about protein homology modelling and thread-
ing. These two popular methods are used for protein structure prediction based on the
sequence and/or structural fold compatibility [24]. But if homologs do not exist in
the resource or existing homologs cannot be identified, then another way out must be
found. In this case, ab initio modelling has been found. It is used to predict compara-
tively complex protein structure such as tertiary structure [25]. This method requires
126 P. Majumder
Table 2 Some useful computational methods based on protein threading (Courtesy Wikipedia)
[17]
Name Method
HHpred Popular protein threading software, it helps in
template fold detection, alignment, 3D
modelling based on pairwise comparison of
hidden Markov models
RaptorX Single and multiple template threading, remote
template detection
Phyre and Phyre2 Remote template fold detection, sequence
alignment, multi-template threading, 3D
modelling
MUSTER It is a protein threading algorithm. It is based on
sequence profile-to-profile alignment and
dynamic programming along with multiple
alignments
BioShell It is a protein threading algorithm. It is using
adjusted profile-to-profile dynamic
programming algorithm shared with predicted
secondary protein structure
SPARKS-X It works statistically, purely based on
probability. It makes the sequence to structure
match of target protein and template protein. 3D
structure modelling done according to the
sequence and structural profiles
Building Blocks Structure Predictor (BBSP) Hybrid template-based
DeepFR Remote template selection, 3D modelling
Table 3 Some useful computational methods based on ab initio modelling (Courtesy Wikipedia)
[17]
Name Method
FALCON It predicts protein structure by position-specific hidden Markov model
through refining the dihedral angles distribution
QUARK It obeys the Monte Carlo method for complementary conformation search
and fragment assembly
I-TASSER One of the popular ab initio modelling. It involves protein threading followed
by fragment structure reassembly
Rosetta@home It designs new proteins by predicting the nature of protein–protein docking
with the help of about sixty thousand active volunteered computers. Rosetta
algorithm has been implemented in distributed-computing system
ROBETTA Combination of ab initio fragment assembly and Rosetta homology
modelling along with Ginzu domain prediction
Bhageerath A computational protocol for modelling and predicting protein structures at
the atomic level
Abalone It aims to predict protein folding and DNA-ligand complexes. It simulates
biomolecules by molecular dynamics folding and molecular graphics
program
128 P. Majumder
2.4 CASP
Protein Threading
Protein Clustering
Lowest e-Structure:
Adding Rotamers by Pulchra and Scwrl
Final Model
the target sequences were not in the PDB, so it is difficult to detect the sequences
through homology modelling or threading. Ab initio modelling could be the best
alternative for those cases. CASP provides the overall understanding regarding the
accurate model and amino acid sequences. Even the homology modelling of protein
has been monitored and can be improved by CASP experiment.
There are lots of CASP experiments which are used in different stages of protein
structure prediction at different levels. Those are as follows [44–46]:
All types of CASP can predict tertiary structure
• CASP5 can predict secondary protein structure and also detect the disordered
regions.
• CASP6 helps in protein function prediction.
• CASP7 assesses model quality and model improvement
• CASP8 successfully matches the full template model with the target with accuracy.
• CASP10 evaluates the binding sites nature.
• CASP ROLL aims to predict larger number of targets.
There are some other protein structure prediction computational methods which
are used in secondary protein structure prediction, transmembrane protein and signal
peptide predictions. In Table 4, those computational methods are enlisted.
Table 4 Computation methods for different kinds of protein structure prediction (Courtesy
Wikipedia) [17]
Name Method description Type
Porter 5 Fast, state-of-the-art ab initio prediction of protein Secondary protein
secondary structure in 3 and 8 classes
RaptorX-SS8 Predicts both 3-state and 8-state secondary structures Secondary protein
by using conditional neural fields from PSI-BLAST
profiles
GOR Information theory/Bayesian inference Secondary protein
Jpred Multiple neural network assignment from Secondary protein
PSI-BLAST and HMMER profiles. Predicts
secondary structure and solvent accessibility
PredictProtein Profile-based neural network Secondary protein
PSIPRED Two feed-forward neural networks that perform Secondary protein
analysis of the PSI-BLAST based output
HyperChem Frequency analysis of amino acid residues observed Secondary protein
in proteins
HCAM Hidropathy Clustering Assisted Method by detection Secondary protein
of physicochemical patterns
HMMTOP Hidden Markov model Transmembrane helix
PHDhtm Multiple alignment-based neural network system Transmembrane helix
SignalP Artificial neural networks and hidden Markov models Signal peptide
Computational Methods Used in Prediction of Protein Structure 131
3 Conclusion
This chapter envisages the overview of the available significant computational meth-
ods and software to predict the structure of protein. Amino acid sequences are the
backbone of all protein structure. However, it is not enough to know about the name
of amino acids only. The pattern of amino acid sequences, the type of folding, confor-
mational space, concentration, free energy pH, active binding sites, etc. are also very
important to know to predict the function as well as the state of queried protein based
on its structural basis. Protein structure prediction fulfils the gap between the amino
acid sequences and protein structure. The prediction method is limited to the protein
homology, threading and ab initio modelling; however, their successful prediction
helps many researchers to bloom their interest in proteomics and the budding number
of servers and groups taking part in community-wide prediction quality assessment
experiments are the resilient of that. Though it is necessary to make more progress
in development of remote sequence detection as well as the accurate identification
of novel protein sequences in all aspect.
References
11. Bu Z, Callaway DJ (2011) Proteins MOVE! Protein dynamics and long-range allostery in cell
signaling. Protein Structure and Diseases. Adv Protein Chem Struct Biol 83:163–221. https://
doi.org/10.1016/B978-0-12-381262-9.00005-7
12. Mittag Tanja, Marsh Joseph, Grishaev Alexander, Orlicky Stephen, Lin Hong, Sicheri Frank,
Tyers Mike, Forman-Kay Julie D (2010) Structure/function implications in a dynamic com-
plex of the intrinsically disordered Sic1 with the Cdc4 subunit of an SCF ubiquitin ligase.
Structure 18(4):494–506. https://doi.org/10.1016/j.str.2010.01.020
13. Rokde CN, Kshirsagar M (2013) Bioinformatics: protein structure prediction. In: 2013
fourth international conference on computing, communications and networking technologies
(ICCCNT), Tiruchengode, pp. 1–5. https://doi.org/10.1109/icccnt.2013.6726753
14. Zhai Y, Yang B, Wang L, An B (2009) New trend of protein secondary structure prediction. In:
2009 international symposium on intelligent ubiquitous computing and education, Chengdu,
pp 121–124. https://doi.org/10.1109/iuce.2009.9
15. Rost B, Sander C (1999) Third generation prediction of secondary structure. In: Protein structure
prediction: methods and protocols. Humana Press, New Jersey, USA
16. Kaczanowski S, Zielenkiewicz P (2010) Why similar protein sequences encode similar
three-dimensional structures? Theoret Chem Acc 125(3–6):643–650. https://doi.org/10.1007/
s00214-009-0656-3
17. https://en.wikipedia.org/wiki/List_of_protein_structure_prediction_software#Ab_initio_
structure_prediction. Accessed on 23rd June 2019
18. Peng Jian, Xu Jinbo (2011) RaptorX: exploiting structure information for protein alignment
by statistical inference. Proteins 79(Suppl 10):161–171. https://doi.org/10.1002/prot.23175
19. Peng Jian, Xu Jinbo (2010) Low-homology protein threading. Bioinformatics 26(12):i294–
i300. https://doi.org/10.1093/bioinformatics/btq192
20. Ma Jianzhu, Wang Sheng, Xu Jinbo (2012) A conditional neural fields model for protein
threading. Bioinformatics 28(12):i59–i66. https://doi.org/10.1093/bioinformatics/bts213
21. Wu S, Zhang Y (2007) LOMETS: a local meta-threading-server for protein structure prediction.
Nucleic Acids Res 35(10):3375–3382
22. Skolnick J, Kihara D, Zhang Y (2004) Development and large scale benchmark testing of the
PROSPECTOR 3.0 threading algorithm. Protein 56:502–518
23. Bryant SH, Lawrence CE (1993) An empirical energy function for threading protein sequence
through the folding motif. Proteins 16(1):92–112
24. Lee, Wu S, Zhang Y (2009) Ab initio protein structure prediction. In: Rigden DJ (ed) From
protein structure to function with bioinformatics, pp 3–25. Springer, Netherlands
25. Xu D, Zhang Y (2012) Ab initio protein structure assembly using continuous structure
fragments and optimized knowledge-based force field. Proteins 80(7):1715–1735
26. Xu D, Zhang Y (2013) Toward optimal fragment generations for ab initio protein structure
assembly. Proteins 81(2):229–239
27. Thomas PD, Dill KA (1996) Statistical potentials extracted from protein structures: how
accurate are they? J Mol Biol 257(2):457–469
28. Taylor WR, Bartlett GJ, Chelliah V et al (2008) Prediction of protein structure from ideal forms.
Proteins 70(4):1610–1619
29. Pedersen JT, Moult J (1997) Ab initio protein folding simulations with genetic algorithms:
simulations on the complete sequence of small proteins. Proteins 29:179–184
30. Melo F, Sanchez R, Sali A (2002) Statistical potentials for fold assessment. Protein Sci
11(2):430–448
31. Oldziej S, Czaplewski C, Liwo A et al (2005) Physics-based protein-structure prediction using
a hierarchical protocol based on the UNRES force field: assessment in two blind tests. Proc
Natl Acad Sci USA 102(21):7547–7552
32. Lindorff-Larsen K, Maragakis P, Piana S et al (2012) Systematic validation of protein force
fields against experimental data. PLoS ONE 7(2):e32131
33. Freddolino PL, Harrison CB, Liu Y et al (2010) Challenges in protein folding simulations:
timescale, representation, and analysis. Nat Phys 6(10):751–758
Computational Methods Used in Prediction of Protein Structure 133
34. Zhang Y, Kihara D, Skolnick J (2002) Local energy landscape flattening: parallel hyperbolic
monte carlo sampling of protein folding. Proteins-Struct Funct Genet 48(2):192–201
35. Klepeis JL, Wei Y, Hecht MH et al (2005) Ab initio prediction of the three-dimensional structure
of a de novo designed protein: a double-blind case study. Proteins 58(3):560–570
36. Kryshtafovych A, Barbato A, Monastyrskyy B et al (2015) Methods of model accuracy esti-
mation can help selecting the best models from decoy sets: assessment of model accuracy
estimations in CASP11. Proteins 84:349–369
37. Jayachandran G et al (2006) Using massively parallel simulation and Markovian models to
study protein folding: Examining the dynamics of the villin headpiece. Published online
38. Kmiecik S, Gront D, Kolinski M, Wieteska L, Dawid AE, Kolinski A (2016-06-22) Coarse-
grained protein models and their applications. Chem Rev 116(14):7898–936. https://doi.org/
10.1021/acs.chemrev.6b00163
39. Roy A, Kucukural A, Zhang Y (2010) I-TASSER: a unified platform for automated protein
structure and function prediction. Nat Protoc 5(4):725–738
40. Fujitsuka Y, Chikenji G, Takada S (2006) SimFold energy function for de novo protein structure
prediction: consensus with Rosetta. Proteins 62(2):381–398
41. Moult J et al (2007) Critical assessment of methods of protein structure prediction—Round
VII. Proteins 69(Suppl 8):3–9. https://doi.org/10.1002/prot.21767
42. Zhang Y, Skolnick J (2005) The protein structure prediction problem could be solved using
the current PDB library. Proc Natl Acad Sci USA 102(4):1029–1034. https://doi.org/10.1073/
pnas.0407152101
43. Qian B et al (2007) High-resolution structure prediction and the crystallographic phase problem.
Nature 450(7167):259–264. https://doi.org/10.1038/nature06249
44. Tress M et al (2009) Target domain definition and classification in CASP8. Proteins 77(Suppl
9):10–17. https://doi.org/10.1002/prot.22497
45. Kryshtafovych A, Monastyrskyy B, Fidelis K (2014) CASP prediction center infrastructure and
evaluation measures in CASP10 and CASP ROLL. Proteins Struct Funct Bioinform 82(Suppl
2):7–13. https://doi.org/10.1002/prot.24399
46. Kryshtafovych A et al (2007) Progress from CASP6 to CASP7. Proteins: Struct Funct Bioinf
69(Suppl 8):194–207. https://doi.org/10.1002/prot.21769
Computational Methods for Inference
of Gene Regulatory Networks from Gene
Expression Data
1 Introduction
When the genes in a cell express themselves, it leads to synthesis of amino acids for
protein synthesis. A protein is the building block of a body and controls the proper
functioning of various tissues, organs and organ systems of the body. It is therefore
clear that expression of genes controls the health and homeostasis of an organism.
For the last few years, the advancement of technologies like DNA microarray, next-
generation sequencing, etc. has made available to researchers a huge volume of
various kinds of biological data about working of a human cell. Some of this data is
available in public repositories over the Internet. Many researchers are using com-
putational methods to derive meaningful information about health of the cell and
the organism from this data. These techniques help scientists to integrate data from
various sources, look at it from various angles and derive useful insights from it about
the diagnosis of the patient.
Gene expression involves works in two steps—the transcription step, in which
messenger RNA (mRNA), transfer RNA (tRNA) and ribosomal RNA (rRNA) [1]
are produced by the action of enzyme RNA. This mRNA decides the sequence of
amino acids in a protein. The second step is translation, and in this step, mature
mRNA acts as a template to put together various amino acids in correct sequence
to generate a polypeptide inside the ribosomes of a cell. These ribosomes consist of
proteins and rRNA. Figure 1 shows the stages of transcription and translation.
Gene expression is regulated by one or many of these ways—regulating the rate of
translation or transcription and regulating processing and stability of RNA. Figure 2
shows the process of gene regulation in a cell in detail.
Fig. 3 Example gene regulatory network in hybrid rice. This figure is from https://commons.
wikimedia.org/wiki/File:DG_Network_in_Hybrid_Rice.png
138 N. Koul and S. S. Manvi
techniques on biological data sets like gene expression data. In simplified model of
a GRN, genes are considered as nodes, transcription factors are the inputs to nodes
and output is the level of expression of this gene. The transcription factors which
are inputs are the outputs of a previous step in gene expression. The output can be
calculated by employing various modelling techniques. For example, using Boolean
networks, the operators like AND, OR and NOT are applied on inputs which simulate
the effect that presence of one or more transcription factors has on a gene. Compu-
tational or mathematical modelling of gene regulatory networks by analysis of gene
expression data or histone modification ChIP-seq data using techniques like ordi-
nary differential equations, Bayesian networks, Gaussian networks artificial neural
networks [4] not only helps us in understanding the influential transcripts and their
pathways but also we can use the GRNs to predict cell health.
d2 y dy dy
+2 + y = 0, (0) = 2, y(0) = 4, (1)
dx 2 dx dx
d3 y d2 y dy d2 y dy
3
+ 3 2
+ 5 + y = sin x, 2
(0) = 12, (0) = 2, (2)
dx dx dx dx dx
y(0) = 4
Order of an ODE is equal to the highest order derivative in it, and degree of an
ODE is equal to the power of the highest order derivative.
For example, the ODE 3
d3 y d2 y dy
x3 3
+ x2 2 + x + x y = ex (3)
dx dx dx
has order 3 and degree 1.
And Eq. 4
2
dy dy
+ 1 + x2 = sin x (4)
dx dx
140 N. Koul and S. S. Manvi
has order 1, the order of highest derivative and a degree of 2, the power of highest
derivative term. Derivative of a variable x with respect to another variable t is the
rate of change of x with respect to t. It is represented as dx dt
.
In case of GRN modelling, x is the concentration of transcript x in a cell, and t is
the time. Therefore, this derivative is the rate of change in concentration of x over
time. Here, x is dependent and t is independent variable. This rate of change can be
positive, i.e. rate of increase, or negative, i.e. rate of decrease. Assuming that the rate
of increase of x to be proportional to value of x, we can write that dx dt
= kx, k being
a non-zero, positive constant.
The equation
dx
= kx (5)
dt
is known as an ordinary differential equation consisting of first-order derivative of
x with respect to t. When t = 0, the value of x is x 0 , the initial condition. If x = 0,
then the derivative is 0, and this represents a condition of no change in the expression
of gene or transcript represented by variable x. This is the trivial solution to above
equation. In case x is not equal to 0, we need to solve the differential equation to
obtain value of x at next t or a future time value. In general, solving a differential
equation means that knowing value of k and x 0 , we can calculate the value of x at
each succeeding t. We need to come up with a function, x(t), whose derivative is kx.
Ordinary differential equation can be numerically solved using Euler’s method as
follows:
For an initial state x 0 and a small discrete t, we calculate (tk + 1, xk + 1) from
the preceding point (tk , xk ) as follows:
Calculate slope of f (tk , xk )
Calculate the next point (tk + 1, xk + 1) as follows:
tk + 1 = tk + t (6)
In order to model regulatory relations between two genes using ODE, let us assume
that m1 and m2 are the levels of concentration of messenger RNA1 and messenger
RNA2, respectively, let k1 and k2 be two constants indicating the rate of production
of messenger RNAs and let y1 and y2 be two constants indicating rate of degradation
of mRNAs. The values of k1, k2, y1 and y2 are all constant. The mutual inhibitory
regulation between gene 1 and gene 2 can be modelled as with following equations:
where
dm
= (regulator transcription value) − degradation value
dt
The rate of change of other products of genes is
dp
= (translation value + diffusion value) − degradation value
dt
The classical method to solve a general ordinary differential equation of linear
order and constant coefficients is as follows:
dn y dn−1 y d2 y dy
n
+ k n n−1
+ · · · + k 3 2
+ k2 + k1 y = F(x) (11)
dx dx dx dx
Its general solution y has two components, a homogeneous y H and a particular
part y P .
y = yH + yP (12)
y H when substituted in the left-hand side of the equation gives zero so is solution
of the equation
dn y dn−1 y d2 y dy
n
+ kn n−1 + · · · + k3 2 + k2 + k1 y = 0 (13)
dx dx dx dx
The above equation can be written as
Dn y + kn Dn−1 y + · · · + k2 Dy + k1 y = 0 (14)
where
dn
Dn =
dx n
dn−1
Dn−1 =
dx n−1
142 N. Koul and S. S. Manvi
(D − r1 )y = 0 (16)
dy dy
= r1 y or = r1 dx
dx y
ln y = r1 x + c
y = cer1 x (17)
where rn , rn−1 , . . . , r2 , r1 are the roots and Cn, Cn−1 , . . . , C2 , C1 are constants and
(D − r1 )(D − r1 )y = 0
If
(D − r1 )y = z
then
(D − r1 )z = 0
z = C2 er1 x (19)
Computational Methods for Inference of Gene Regulatory … 143
The solution is
Therefore,
A recurrent neural network (RNN) is a type of neural network in which the nodes are
connected in the form of a directed cyclic or acyclic graph. The graph directions fol-
low a temporal sequence. While an acyclic graph can be unfolded to see the involved
states, a cyclic graph cannot be unfolded. Certain RNNS also have a gated memory
unit, and such RNNS are known as long short-term memory (LSTM) networks [8].
The nodes at each layer of an RNN are directionally connected to each node in next
layer. Activation function is non-linear and weights are real valued. The input values,
in the form of input vectors, are fed into input nodes one by one, and a hidden node
computes its activation using a non-linear activation function on weighted sum of
activations of all its input units and compares it with target activations available at
some of the output nodes.
Since gene regulation is a temporal sequence of interactions, involving more than
one gene influencing each other in time, we can model the expression regulation using
RNNS [9]. A node in RNN models a gene; the edge weights represent regulatory
144 N. Koul and S. S. Manvi
influence between the gene at source of edge and the destination of the edge. Value
of weights of edges in a particular layer in this RNN will represent the value of
expression of the gene at the source of the edge at one point in time ti.
At time i + 1 = ti + dt, the expression level of a gene is determined by expression
level of all genes which have an incoming edge to this gene at ti.
i.e. Expression of Genei = nj=1 wi j ∗ x j + Bi
Bi is the delay parameter, higher the Bi less is the influence of weight wij on
expression of gene i. Aim of training the RNN is to find optimal values of weights
and the delay parameter so as to minimize the mean square error between actual time
series data and the regenerated data. This error is given by
N
P(X 1, X 2, X 3, . . . , X N ) = P(Xi| Xi)
i=1
Directed acyclic graphs can easily express the probability distributions of this
nature.
Therefore, we can formally describe a Bayesian network as a directed graph
G(V, E) with two components
i. A set of nodes Xi where Xi belongs to V.
ii. A set of edges E containing a table of conditional probability distributions for
each node linking parent node to child node.
Consider the example where we wish to diagnose if a person is suffering from
pneumonia or a common cold, represented by symbols P and C, respectively. So we
have a set of symptoms to consider, e.g. flowing nose, headache and hot flashes. Let
us denote these symptoms by the symbols N, H and F, respectively. Also let us take
into consideration the information whether the patient has had pneumonia within last
one year, represented by variable R.
Then, the Bayesian network for this diagnosis can be drawn like shown in Fig. 5.
Applying Bayesian Networks for Reconstruction of Gene Regulatory Networks
In order to search a network structure in the given set of genes whose expression
values are given as input, we first need to use some information theoretic measure
like mutual information, conditional mutual information, Kullback–Leibler (KL)
divergence, to identify correlation and non-linear dependencies among pairs of genes
or a group of genes [13, 14].
A generalized algorithm for using Bayesian network to infer GRNS is a follows:
i. Identify the most relevant genes for the problem being investigated by following
a feature selection method.
Fig. 5 A simplified Bayesian network to show probability of a cold being pneumonia or common
cold
148 N. Koul and S. S. Manvi
ii. Calculate conditional dependencies among the genes in this selected group using
Bayes’ rule and draw the initial network.
iii. From the calculated dependency values, treat the ones above a threshold value
to be valid and discard the ones below threshold and reconstruct the network
with only valid dependencies.
Many researchers have used supervised and unsupervised machine learning
methods to calculate the threshold and reconstruct the network.
4 Conclusion
References
1. Godini R, Karami K, Fallahi H (2019) Genome imprinting in stem cells: a mini-review. Gene
Expr Patterns 34:119063. ISSN 1567-133X. https://doi.org/10.1016/j.gep.2019.119063
2. Dewey GT, Galas DJ (2010) Gene regulatory networks. In: Madame curie bioscience database
(Landes Bioscience, Austin, TX). https://www.ncbi.nlm.nih.gov/books/NBK5974/
3. Chan TE, Stumpf MPH, Babtie AC (2017) Gene regulatory network inference from single-cell
data using multivariate information measures. Cell Syst 5:251–267.e3. https://doi.org/10.1016/
j.cels.2017.08.014
4. Huang J, Shimizu H, Shioya S (2003) Clustering gene expression pattern and extracting
relationship in gene network based on artificial neural networks. J Biosci Bioeng 96(5):421–428
5. Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ (2018) Next-generation
machine learning for biological networks. Cell 173:1581–1592. https://doi.org/10.1016/j.cell.
2018.05.015
Computational Methods for Inference of Gene Regulatory … 149
1 Introduction
What does it mean for a machine to learn? In other words, what is machine learning?
Machine learning is the process of making a computer figure out, as closely as
possible, the mathematical relationship between a set of inputs and a set of outputs
for a problem under consideration, i.e. the machine tries to arrive at the best function
f (), also known as a model, between input dataset X and output set Y . Input set X
may consist of many data points x1 , x2 , . . . , xn , where each xi is a value of any data
type.
All members xi of X are known as features of input set, however, it has been seen
that not all of xi equally influence the output, i.e. there is a subset of features in X
that are sufficient to arrive at the output Y . Remaining features are irrelevant and can
be ignored for the purpose of learning the relationship between X and Y . We identify
the subset of relevant features and use only them to establish the relation between
X and Y . This reduces the computational complexity of learning the model. Because
we drop the irrelevant features, the chances of overfitting in the model are reduced.
The process of identifying the smallest and most relevant subset of features that are
sufficient to establish a valid relation between inputs and outputs for a problem is
known as feature selection.
In the field of bioinformatics and computational biology, medical data, e.g. gene
expression data is an indicator of function and state of all genes in an organism’s
genome and can be used to diagnose a disease like cancer. However, to the problem
of diagnosis of cancer, only a few genes in the total genome of thousands of genes
are relevant; therefore, it is imperative to try and identify these few genes from the
expression of entire genome, and then use the expression values of these few genes
to classify a patient as having the disease or not.
Gene expression data [1] is the numerical data corresponding to the expression level
of genes in the cell of an organism. It is obtained using DNA microarray technology.
DNA microarray technology involves the hybridization of an mRNA (a subcellular
component) molecule to the DNA template. Amount of mRNA bound to each site
on array indicates expression level of genes in the sample. This data is known as the
gene expression profile of the cell. This data can reveal presence of diseases and even
the possibility of occurrence of certain diseases at the transcriptome level. A study
of these profiles can also help in development of drugs to target the disease-causing
genes and their action pathways. In addition to DNA microarray technology, we now
have an advanced technology called next-generation sequencing or high throughput
sequencing which is a process of determining the sequence of nucleotides in a particu-
lar portion of DNA or RNA. It can reveal any abnormal or out of sequence nucleotides
and hence indicate an abnormal bodily condition or a disease. In this chapter, we
focus on use of machine learning for feature selection from gene expression datasets.
A prominent characteristic of gene expression datasets is very high dimensionality,
of the order of thousands of genes present in the genome of an organism, whereas
the sample size is very small. The noise and redundancy further make preprocessing
of the data mandatory. To control the curse because of high dimensionality, feature
selection is performed on the gene expression data to identify the smallest subset of
genes which are accurate and sufficient markers of a disease or a condition that is
the subject of study. Identification of such genes not only reduced the computational
complexity of entire machine-learning effort to classify or cluster the profiles but
also helps in discovery of accurate drugs and their targets in order to treat or cure
certain diseases like cancer.
The machine-learning approaches are used to identify differentially expressed
genes which are markers or a disease and also to eliminate the redundant genes or to
identify the most relevant genes,
Prominent Applications of Gene Expression Data
i. Identification of new genes in the genome, the genes whose functionality may
have been yet unknown can be un-covered during the analysis of gene expression
data.
ii. As discussed, analysis of gene expression data can reveal presence of a disease
like cancer.
iii. Analysis of genes in a diseased and a control cell can help understand the
molecular pathways and mechanisms. This knowledge can be used for synthesis
of targeted drugs which can control a harmful over activity of a gene, or can
promote under active genes to work better for better health of the cell.
iv. Gene expression data also can be used to research and study the effect of various
environmental factors, toxins, stress, etc., on gene activity and thereby on health
of a cell.
Machine-Learning Algorithms for Feature Selection … 153
3 Feature Selection
In the field of machine learning, feature selection [2] is the step that involves elimina-
tion of attributes or features that are noisy, irrelevant and redundant for classification
of data samples within a given dataset. This elimination is often done on basis of
a score, rank or weight that represents relevance of the feature. Feature extraction
[3] is another technique of dimensionality reduction which involves combination or
transformation of basic features into more complex features in higher dimensions
using methods like principal component analysis. Both feature selection and feature
extractions prepare a reduced set of most relevant features that capture the essential
properties of the data and will correctly classify a given dataset. This classification
can be done in a supervised as well as unsupervised way. Supervised learning uses the
datasets with output class labels present, while unsupervised learning uses the data
with no output class labels available. In supervised learning, a model is computed
which represents the equation between pattern in input feature values and the output
class labels. Input feature values are called training data. These training samples have
their actual class labels present in the data. After training, the test samples are input to
the model, test samples do not have a class label, the model tried to predict the label
for each sample in test data based on the pattern it has learnt from relation between
sample values and output labels seen in training data. The ratio of correctly classified
samples to total number of see samples is called as the classification accuracy of a
classifier. Therefore, the correct selection of effective features is as important for
higher classification accuracy as is the method used for classification.
Feature selection, therefore, is a set of operations which when performed on a
dataset with n features, returns a subset of k features, such that k < n, and these k fea-
tures are the sufficient and relevant subset of features for the machine-learning prob-
lem under consideration to be solved. Literature classifies feature selection methods
into three categories—filter methods, wrapper methods and embedded methods.
Feature selection is a well-researched area in machine learning and the improved
algorithms based on evolutionary and bioinspired computing continue to be
introduced to improve the performance and accuracy of feature selection.
i. From the set of input features, X, with cardinality n, pick a subset of minimum
possible size.
ii. Evaluate the performance of this subset by determining the relation between the
inputs in this subset and the outputs.
iii. Use cross validation on test set and validation set of inputs to establish the
relationship.
iv. Now pick a subset with cardinality 1 greater than the previous subset and repeat
steps ii to iv till we find the subset with best performance.
154 N. Koul and S. S. Manvi
subset. These methods give the best subsets of features at the cost of additional
computational complexity due to the need for retraining of classifier for each subset.
These methods can search for the probable subsets of features using one of the
three search techniques
i. Exhaustive search techniques like breadth-first search and depth-first search.
ii. Heuristic search techniques like forward selection that starts from empty subset
and recursively adds to it a feature that optimizes the fitness function or back-
ward selection that starts with full feature set in original data and drops one
feature from it at a time while optimizing the fitness function. A bidirectional
search uses forward and backward search simultaneously and when both FS
and BS arrive at the same subset of features, the search stops. Random fea-
ture elimination (RFE), Plus-L Minus-R (LRS) are examples of heuristic search
techniques. Other examples include genetic algorithms which treat all possible
subsets (chromosomes) of original feature set as starting population of possible
solutions to the feature selection problem, a set of operators is applied to this
initial population to evaluate their fitness. The subsets with higher fitness act as
population for next iteration for the algorithm, the next generation. If no subset
qualifies for next generation, new subsets are produced by recombination of
subsets in previous generation. Crossover and mutation functions are primarily
used to produce next generation of chromosomes till an optimal set is obtained.
GA can be combined with other feature selection methods to generate initial
population of subsets.
iii. Random search techniques—In this technique, a subset of features is randomly
generated and any of the above techniques can be applied on subsets to arrive
at optimal subsets.
In wrapper methods, a validation set is often used to compute validation accuracy
using techniques like k-fold cross validation. The wrapper methods are slower than
filter methods, have a tendency to overfitting and the results have high variance.
Embedded methods have feature selection integrated into classification algorithm.
Classification algorithm is used to evaluate the performance of subsets of selected
genes but the classifier in this case is a part of the feature selection algorithm itself
and not a separate layer. These methods have a built in classifier and, therefore, they
are specific to the learning problem,e.g. support vector machine algorithm with linear
kernels, these approaches also use a penalization term for the features to differentiate
the features. Lasso with penalty is also an example of embedded approach, and
another embedded algorithm is decision tree. In this algorithm, a feature subset is
selected in each step of growth of tree by dividing a bigger subset into smaller
subsets. For more informative features, i.e. more is the relevance of a feature, more
is number of child nodes in its subtree belonging to the same output class. Examples
of tree-based algorithms for feature selection re CART, ID3, C4.5, etc.
Thus, we can say that in embedded methods the feature selection forms the part of
training of the predictive model like artificial neural network, decision trees, support
vector machines, etc. These methods often use backward elimination to remove
irrelevant features and regularization terms. Embedded methods are slower, and tend
156 N. Koul and S. S. Manvi
to over-fit in absence of large amount data. But perform better than filter methods in
presence of large amounts of input data.
There also is a technique of joining together wrapper and embedded methods
known as Ensemble learning. These combine the benefits of both constituent methods
and provide robust feature selection with less variance. Examples are bagging and
random forest, gradient tree boost, etc.
A subset of features is selected from among the full feature set of input data by eval-
uating the fitness of various possible subsets of features. There are various measures
to compute this fitness. Fitness of a feature decides its relevance to the objective
function. Interclass distance, entropy, information gain, and probability of error are
some of the inherent properties [5] used to rank features for their relevance.
Probability of Error
Given the objective of correctly classifying the samples in a gene expression dataset,
the goal of feature selection algorithm is to find the genes that minimize the
probability of classification error of the samples, i.e. maximize the classification
accuracy.
Divergence
Divergence measures the separation or distance between conditional probabilities of
various classes in the output. Good features are those for whom the divergence in
conditional probabilities is more than a threshold value. A few examples of measures
of divergence are
i. Kullback–Liebler divergence
ii. Patrick-Fisher divergence
iii. Matusita divergence
iv. Chernoff divergence
v. Bhattacharyya divergence
vi. Kolmogorov divergence.
Association or Dependence
This measures the extent of association or correlation among features, e.g. correlation
coefficient. Interclass distance: We assume different output classes to be widely
distant in the sample space. Any features that maximize this space are good features,
e.g. Euclidean distance.
Machine-Learning Algorithms for Feature Selection … 157
Entropy
Let X be variable from the set S = {x1 , x2 , . . . , xn }, if P(X ) is the probability
distribution of X , then entropy of random variable X is written as H (X ):
H (X ) = −X x ∈ X P(x)log P(x)
Mutual Information
In order to measure the relationship between two random variables sampled at a time,
we use the concept of mutual information.
For two random variables x and y, if P(x, y) is their joint probability distribution,
then their mutual information is given by
P(x, y)
M I (X, Y ) = P(x, y) log
x E X y EY
P(x)P(y)
Chi-Square Statistic
In this feature selection method, the gene values are converted into discrete intervals
based on entropy and then the chi-square, χ 2 , statistic is calculated between each
gene and the output class. Chi-square, χ 2 is a statistic is used to evaluate influence
on a variables value on output classes.
The formula for calculating χ 2 of a gene is given as
k
m 2
A i j − Ri . C j
χ =
2
Ri C j
i=1 j=1 N
Information Gain
Information gain (IG) is used to measure the relevance of attribute X i in class V. It
calculates the amount of information inherent in the value of the feature. Information
gain is the amount of information gained by evaluating a feature that can help in
classification of the sample in correct class.
Features that help in correct classification of samples, have highest information,
unrelated features provide no information. Information gain is equivalent to decrease
in entropy in the system. Where entropy is a measure of impurity in a dataset.
158 N. Koul and S. S. Manvi
Correlation-Based Methods
In these methods, various correlation coefficients [6] are calculated between each
gene and the output class. A strong correlation coefficient means higher relevance
of the gene, a weak correlation means less relevance, i.e. the feature with weak
correlation may or may not be necessary for correct classification of the genes. Non-
linear correlation is measured by the entropy which measures amount of un-certainty
of a data value. This method ignores the features with low correlation with output or
class value as irrelevant features with high correlation among themselves are treated
as redundant. A feature is relevant if it can predict class of samples correctly which
has not been yet achieved using another smaller subset of features.
Algorithm
i. Let X be a two-class dataset with dimensionality p and number of samples n.
ii. Scale the input data to lie in the interval [0, 1]. I.e. Convert the input data to
binary format if not already in Binary format.
iii. Initialize the score of p features as zeros
For i = 1 to m iterations:
a. Chose a random sample S from the dataset, take all its features in vector X,
b. Find the nearest neighbor of this randomly chosen sample using Euclidean
distance as a measure of nearness. Closest neighbor belonging to same class
as S is known as ‘near hit’ and a closest neighbor belonging to a different
class than S is known as near miss.
i. Decrease the score of a feature if it differs from that feature in the neighboring
sample of belonging to same class as S.
ii. Increase the score of a feature if it differs from that feature in neighboring
sample belonging to different classes.
N ew Scor e o f f eatur e i
= Old scor e o f f eatur e i−(V alue o f f eatur e i in sample S
− value o f f eatur e i in near hit)2
+ (V alue o f f eatur e i in sample S
− value o f f eatur e i in near hit)2
d. Divide the score vector by m, this is the relevance vector of all ‘p’
Features of which genes with relevance greater than a threshold
Value can be selected for further work.
i. For an input gene expression, dataset X with n features and p samples, initialize
the optimal subset S with all n features of the original dataset X
ii. Train the SVM classifier using all the features in S
iii. Compute the rank of genes
iv. Discard the genes with lowest rank and retain ‘r’ top-ranked genes
v. Repeat the steps i to iv with newly obtained gene subset till performance of the
classifier continues to improve.
vi. Return the subset with optimal classifier performance.
Support Vector Machines (SVM) Since gene expression data is highly non-linear,
it is almost impossible to separate it into separate classes using linear classification
algorithms; therefore, we need the algorithms that can map the input data into alter-
native dimensional spaces and then try to find the planes that can separate the input
data into different classes based on the properties of data in higher dimensions. Sup-
port vector machine [9] is one such algorithm. SVM is machine-learning algorithm
that computes a hyperplane of maximum margin in a higher-dimensional space such
that the values in input dataset can be linearly separable. SVM uses a kernel and a
penalty term. There are four basic kernels used with SVM linear kernel, polynomial
kernel, radial basis function and sigmoid kernel.
Principal Component Analysis (PCA) PCA [10] is a method for feature engineer-
ing which leads to dimensionality reduction of the high-dimensional data. This is
an unsupervised method for feature reduction [11] which projects data from a high-
dimensional space to a low-dimensional space while retaining the features that count
for maximum variance in data. PCA involves computing of mutually orthogonal
principal components based on Eigen values of the covariance matrix of input data.
5 Conclusion
In this chapter, we have seen various techniques that are used in machine learning
for the feature selection step. It has been observed that filter methods generally
work well for feature selection from gene expression data. Properties of data like
Machine-Learning Algorithms for Feature Selection … 161
References
1. Lu Y, Han J (2003) Cancer classification using gene expression data. Inf Syst (Elsevier)
28(4):243–268
2. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using
support vector machines. Mach Learn 46(1–3):389–422
3. Srinivasa KG, Venugopal KR, Patnaik LM (2006) Feature extraction using fuzzy c-means
clustering for data mining systems. Int J Comput Sci Netw Secur 6(3A):230–236
4. Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell J (Special issue on
relevance) 97(1–2):273–324
5. Sánchez-Maroño N, Alonso-Betanzos A, Tombilla-Sanromán M (2007) Filter methods for
feature selection—a comparative study. In: Yin H, Tino P, Corchado E, Byrne W, Yao X (eds)
Intelligent data engineering and automated learning—IDEAL 2007. IDEAL 2007. Lecture
notes in computer science, vol 4881. Springer, Berlin, Heidelberg
6. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation based filter
solution. In: Proceedings of the twentieth international conference on machine learning, ICML,
pp 856–863
7. Robnik-Sikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and
RRelief. Mach Learn 53:23–69
8. Lin X, Yang F, Zhou L et al (2012) A support vector machine recursive feature elimina-
tion feature selection method based on artificial contrast variables and mutual information. J
Chromatogr B Anal Technol Biomed Life Sci 10:149–155
9. Arenas-Garcıa J, Perez-Cruz F (2003) Multi-class support vector machines: a new approach.
In: Proceeding of the IEEE international conference on acoustics, speech, and signal processing
(ICASSP ‘03), vol 2, pp 781–784
10. Song F, Mei D, Li H (2010) Feature selection based on linear discriminant analysis. In: 2010
International conference on intelligent system design and engineering application, Changsha,
pp 746–749. https://doi.org/10.1109/isdea.2010.311
11. Taveira De Souza J, Carlos De Francisco A, Carla De Macedo D (2019) Dimensionality
reduction in gene expression datasets. IEEE Access 7:61136–61144
Genomics and Proteomics
Unsupervised Techniques in Genomics
Mrinmoyee Bhattacharya
1 Introduction
M. Bhattacharya (B)
St. Joseph’s College, Bangalore, India
Microarrays [2]:
Firstly, the data which is considered has to be reformed so that it can be used by
machine learning algorithms. This process is known as pre-processing. Secondly,
data analysis is required to know the problem.
The most typical applications are identification of expression pattern, classifica-
tion, and general network induction.
Systems biology:
It is a field where machine learning and biology work and organized with each other.
It is complex to model the life processes that takes residence inside a cell. Thus,
computational methods are very helpful to model biological networks, metabolic
pathways, and single transduction network.
Evolution:
Text mining:
Computational techniques have a side effect due to increasing amount of data, so text
mining is a technique which is used for extraction of knowledge. Text mining is an
interesting topic in computational biology which is applied in prediction of cellular
location, analysis of protein interaction, and many other areas.
It is a type of clustering algorithms which divide the objects into a tree structure
which consists of nodes, and each individual node denotes a cluster. Each node has
any number or no child nodes.
Partition Clustering
In this type of clustering, the algorithm partitions the data objects into subsets that
are not overlapping with each other in such a way that each subset has exactly one
object. Each subset is called one cluster.
Model-Based Clustering
Models are used by these clustering algorithms where attempts are made to fit the
data and the models. In case of model-based clustering algorithm, we consider that
the data are coming from different clusters according to probability distribution.
Various unsupervised techniques in bioinformatics are as given below [2]:
• Protein secondary structure prediction: In this, we predict the three-dimensional
protein structure that is the folding and its secondary as well as tertiary structure
from the primary structure. Neural networks and support vector machines are used
to perform this task.
• Gene recognition: It is a method of recognizing the areas of genomic DNA that
encode genes. Protein-coded genes and RNA genes are included here. It may also
contain predicting the functional elements. Hidden Markov model is used to solve
this.
• Multiple sequence alignment: It is the process of aligning a sequence set
which generally consists of the biological sequence protein, DNA or RNA. The
techniques used to solve this are clustering and hidden Markov model.
• Splice site recognition: While technology is used to sequence the genome, a vast
amount of sequence data has been created. For genome sequence, one of the main
tasks is to identify all the genes. In case of eukaryotic genes, the coding region
is depended on the identification of the exon–intron structures. Introns are non-
coding regions of protein whose biological sequence is not yet known. The borders
between exons and introns are known as splice sites. There are two splice sites,
one in the upstream part of the intron and the other in the downstream part. The
part of the intron which is upstream is called donor splice site, and the part of the
intron which is downstream is called acceptor splice site. These two splice sites
along with their relevant beginning and end of intron are called canonical splice
sites.
There are numerous techniques to detect splice size. Markov model is used in the
initial stage, and SVM is used in the next stage.
• Microarray data—Normalization: Many biologists are using the technology of
microarray to monitor the genome-wide expression level of genes in any organism.
Microarray is basically a glass slide on which the DNA molecules are fixed in a par-
ticular order at a specific location which is known as spots or features. Each DNA
Unsupervised Techniques in Genomics 169
molecule has thousands of spots which contain few millions of duplicate DNA
molecules that individually correspond to a gene. The spots are printed on a glass
slide with the help of a robot or synthesized by a process called photolithography.
In order to detect differentially expressed genes, we require a reasonable measuring
level that should not change in two different conditions. Generally, we notice that
an average ratio of such genes differs by 1 which may be due to various conditions.
In case of microarray experiments, as for large-scale experiments, many sources
are there that affect the measurement of gene expression level.
Normalization is a technique of allowing such variations to compare data obtained
from two different samples. The first step of normalization procedure is to select a
gene set which consists of genes whose expression ratio is expected to be 1. Later,
calculate normalization factor. Then, we apply this normalization factor to other
genes in the microarray experiment. Since the normalization process changes the
data, it is carried out in the background corrected value of each spot.
• Microarray data—Gene selection: In order to study the gene expression, sample
classification needs to be done for which selection of gene is an important task.
Researchers try to recognize the smallest set of genes that are used for prediction
performance, e.g., to diagnose clinical abnormalities. Univariate ranking is used
for significance of gene and arbitrary thresholds for selecting the number of genes.
The gene selection ranking criteria are used to solve various classification algo-
rithms. Some common techniques for gene selection are using classification algo-
rithm like random forest or feature selection algorithm using a correlation-based
feature selector combined with ML.
• Microarray data—Prediction of therapy outcome: Predictions of therapy out-
comes can be done in two ways. The first method is to take the predictions directly
from the patients, therapists, and clinical observers, whereas the second method
is to take predictive measures from the same sources. There are various diseases
worldwide, and one most common is depression disorder. Many options for treat-
ment are available, e.g., neurostimulation and pharmacological, but there is no
universal effective treatment. In this case, it is very important to identify the factors
predicting a priori or early in treatment response or potential resistance.
Most of the evidence which comes from research is proved statistically, but clini-
cally they may not be always useful. “Clinical Predictor” should be cheap, depend-
able, reproducible, non-invasive, and easily accessible so that they can be used daily
for clinical purposes. There are also many methodological issues combined with
predictor or biomarker research.
• Microarray data—Dependencies between genes: The dependency of a gene
with another gene plays an important role as it helps us to understand the dif-
ferent biological mechanisms. The interaction measures that already exist are
based on association measures such as Pearson correlations. It can capture mono-
tonic or linear dependency relationships. For nonlinear combinatorial dependency
relationships, hidden Markov model is used.
It is generally solved with the help of independent component analysis and
clustering [6].
170 M. Bhattacharya
study the different biochemical properties of protein. This may include the study
of binding activities of protein as well as enzyme–substrate relationships.
Reverse-phase protein microarrays (RPMA). It is a sort of microarray that iden-
tifies the different proteins by checking out various probes of lysate samples. At the
beginning, some of these microarrays had been utilized in monitoring prostate cancer
patients to recognize historical alterations.
• Systems biology and modeling: It is the theory of demonstrating the dynam-
ics of biochemical links where molecules are the nodes and the connections
between them are the edges. An explicit mathematical description of the connec-
tions and its interactive dynamics is used which helps in checking and predicting
the performance of computer simulations.
In this type of clustering, the objects are partitioned by an iterative method into k
clusters based on their dissimilarity as shown in Fig. 1. J is a predefined criterion
function which assigns data into the kth number set. Due to this, result of the func-
tion clustering is possible. Partitioning methods are very useful for bioinformatics
applications.
The different types of partitional clustering are as follows:
It is one of the basic unsupervised learning procedures that are used to resolve the
clustering problem. There are numerous ways to cluster the data, but k-Means algo-
rithm is used maximum number of times which tries to increase the inter-group
resemblance but at the same time keeps the groups away from each other. This
algorithm functions on distance calculations where “Euclidean distance” is calcu-
lated. Euclidean distance calculates the distance between two given points using the
following formula:
Euclidean Distance = (X 1 − X 2 )2 + (Y1 − Y2 )2
This method is used to calculate the distance in two-dimensional space, but the
same concept is also used for multi-dimensional space by expanding the number of
terms. In case of k-Means clustering, “k” represents the number of clusters in which
we want our data to divide into. The limitation for k-Means algorithm is that the data
should be continuous. This algorithm will not work for categorical data.
If segmentation of population of India is to be done and the height is in cm and
weight in kg, one can understand that the distance metric discussed above is highly
susceptible to the units of variables. Therefore, it is recommended to make all the
data in standard form before the clustering process.
k-Means is an iterative process of clustering, which keeps iterating until it reaches
the best solution or clusters in our problem space.
The aim of this algorithm is to minimize an objective function known as squared
error function given by
c
ci
J (V ) = xi − v j 2
i=1 j=1
where
“xi − v j ” is the Euclidean distance between x i and vj .
“ci ” is the number of data points in ith cluster.
“c” is the number of cluster centers.
Let the data points set X = {x 1 , x 2 , x 3 , …, x n } and the set of centers be V = {v1 , v2 ,
…, vc } as shown in Fig. 2
(1) In a random manner, select some cluster centers “c”.
(2) Evaluate the distance between the cluster centers and data points.
(3) Whichever data point has the minimum distance from all the cluster centers is
assigned the cluster center.
(4) The equation below helps to evaluate the new cluster center:
ci
vi = (1/ci ) xi
j=1
Unsupervised Techniques in Genomics 173
the exiting center of the clusters and make it the tth cluster center. Thus, we see that
the objects of dataset belong to cluster from Eq. 1 below:
In 1990, Kaufman and Rousseeuw designed a clustering algorithm which was used
for clustering large application (CLARA) [13, 14]. It is basically an extension of
k-Medoids method to decrease the computing time and storage problem because it
consists of a huge number of objects, i.e., more than several thousands. This technique
uses the approach of sampling and so instead of calculating the medoids for the full
dataset, small sample datasets are considered, and the PAM algorithm is used to find
the optimal set of medoids for this sample data [15, 16].
Resultant medoids value is calculated by the average difference among every
object in the whole dataset and the medoid of its cluster, which is defined as the
cost function. This process of sampling and clustering is repeated many times to
reduce the effect of sampling. Finally, we get a collection of medoids whose cost is
minimum.
The CLARA algorithm is as follows:
Step 1: From the actual dataset, randomly create a number of subsets with fixed size
of the sample.
Step 2: On every subset, compute the PAM algorithm and find the corresponding
k representative objects which are known as medoids. All observation of the full
dataset should be assigned to the set nearest to the medoid.
Step 3: The difference between their observations and the medoid closest to it is
calculated, and then, we find the mean of all these differences. This technique helps
us to find the quality of clustering.
Step 4: The dataset which has the minimum mean value is preserved, and again an
analysis is carried out on the absolute partition.
v. Fuzzy k-Means:
• k-Means clusters pursue hard clusters where a particular point belong to one and
only one cluster, whereas fuzzy k-Means clustering pursues soft clusters due to
overlapping.
• In case of fuzzy clustering, a single point can belong to two or more clusters with
some affinity to each of the points.
• The rate of affinity is directly proportional to the distance of that point from the
centroid of that cluster.
• Like k-Means clustering, it also works on objects whose distance is distinct and
can be defined in n-dimensional vector space.
vi. k-Modes:
vii. COOLCAT:
Barbara et al. [17] proposed it which deals with k-Mode sensitivity. The sensitivity
depends on the way the objects are selected. It is an incremental algorithm which
decreases the entropy of the clusters. For example, if a set of clusters is given, then
COOLCAT will put the next point in such a way that the entire entropy decreases.
It is accomplished to cluster each and every new point without clustering the entire
set. It is suitable for processing data streams.
Step 11: The third primary cluster should contain the remaining points of distance-
array:
Step 12: Center of first primary cluster = Center of B(12, 11) = (12, 11); center of
second primary cluster = Center of A(11, 12) = (11, 12); center of third primary
cluster = Center of E(8, 3) F(7, 2) I(2, 8) and H(1, 8) = (4.5, 5.25).
178 M. Bhattacharya
F, G, I, C, D, J, and K are considered as outliers, since its weights > j/j have been
incremented to 1. Since the number of candidate points is greater than K/K = 3, we
repeat the process from Step 4.
Step 5: Distance-array should be:
Differences Index
7.53 1
2.46 2
Step 11: The third primary cluster should contain the remaining points of distance-
array:
Step 12: Center of first primary cluster = Center of A(11, 12) and B(12, 11) = (11.5,
11.5); center of second primary cluster = Center of E(8, 3) = (8, 3); center of third
primary cluster = Center of H(1, 8) = (1, 8).
Step 13: By integrating k-Means algorithm using these detected centers to cluster all
points of S, the partition will be:
based on these detected centers. By using this strategy, we ensure that the initial
choice of centers is reasonable, since these centers are remote from each other and
they are in the densest regions. Consequently, the drawback of k-Means has been
fixed up by adopting this new strategy.
Case Study 2
Since all these algorithms are stochastic, we performed multiple runs over all 20
benchmarks, and each value is the average of 50 runs. The effects of F-measure,
obtained by each algorithm, are shown in Table 2.
The Davies–Bouldin index which we get from each algorithm is shown in Table 3.
As the clusters are close to each other, the DB measure is lower, whereas if the clusters
Unsupervised Techniques in Genomics 181
are spread out, then the measure is higher. Consequently, better quality cluster is
obtained whose Davies–Bouldin index is minimum.
These experimental results have clearly revealed the difference before and after
the improvement of k-Means algorithm. This new algorithm could be used in big
data (millions of objects), and it is more likely to provide very good results, on one
hand, since the problem of the local optimum had been fixed up, and on the other
hand, the improved k-Means is simple to implement, fast, and easily parallelized.
Table 2 Results of
Benchmark Improved K-Means K-Means++
F-measure
K-Means
Alizadeh-v2 1 0.8201 0.8518
Alizadeh-v3 0.7599 0.6771 0.6432
Armstrong-v1 0.6667 0.7215 0.7329
Armstrong-v2 0.7105 0.7567 0.8264
Bredel 0.7996 0.7024 0.6590
Chen 0.6446 0.8096 0.6895
Chowdary 0.6697 0.6697 0.6764
Dyrskjot 0.7990 0.7367 0.5376
Garber 0.6126 0.5784 0.5753
Golub-v1 0.7240 0.8460 0.8357
Golub-v2 0.8859 0.7789 0.8069
Khan 0.6926 0.6230 0.6031
Laiho 0.6771 0.7511 0.7289
Nutt-v3 1 0.7854 0.7045
Pomeroy-v1 0.6644 0.6750 0.6572
Pomeroy-v2 0.7358 0.6273 0.5859
Shipp-v1 0.7341 0.6891 0.7134
Singh 0.6286 0.6286 0.6286
West 0.6571 0.7437 0.6607
Yeoh-v1 0.9836 0.8991 0.8143
often suggest biochemical pathways. Mostly like mathematical tool, cluster study
derives expressive results when they are mutual with biochemical insight.
The most commonly used mathematical technique is hierarchical clustering which
tries to group small clusters and then the clusters into higher groups. We use den-
drogram to view the resultant hierarchical tree structure. Most of the studies involve
a series of experiments to identify the genes that are consistently coregulated due to
certain circumstances such as disease state, increasing time, and increasing dose of
drug. Each set of gene-expression levels is compared to all the other set of expression
levels in a pairwise fashion, and similarity scores are produced in the form of sta-
tistical correlation coefficients. To make the correlations ordered, a node is created
between highest-scoring pair of rows that is the geometrical closet. Then the matrix
is modified where the joined elements are represented as a single node, and all the
distances between the new node and other gene sequence of the matrix are computed.
We do not recalculate the entire correlation but only change the rows which contain
the new node changed values.
Typically, the dendrogram contains a link of the node, where the height of the
relation is directly proportional to the strength of the correlation. This method of
creating proportional links continues till all the genes in the experiment form a single
Unsupervised Techniques in Genomics 183
Table 3 Results of
Bench mark Improved K-Means K-Means++
Davies–Bouldin index
K-Means
Alizadeh-v2 1.6366 2.2306 1.7122
Alizadeh-v3 1.6032 2.3249 1.7575
Armstrong-vl 1.9288 1.9636 1.9683
Armstrong-v2 2.0268 1.8721 2.1469
Bredel 2.1958 2.1241 1.8838
Chen 1.2488 2.4253 1.4717
Chowdary 0.8818 0.8819 0.8889
Dyrskjot 1.8901 1.7330 1.4242
Garber 1.7822 2.6147 1.7252
Golub-vl 1.8092 1.8891 1.8786
Golub-v2 1.7489 1.9410 1.9531
Khan 1.7493 1.9387 1.4035
Laiho 1.7147 1.8561 1.6953
Nutt-v3 1.7910 1.5726 1.6410
Pomeroy-v1 1.4546 1.6875 1.8476
Pomeroy-v2 1.4930 1.7259 1.6233
Shipp-v1 0.3683 1.4970 1.2275
Singh 0.8425 0.8425 0.8425
West 0.8815 2.0582 1.7026
Yeoh-v1 3.0918 2.5013 2.3834
hierarchical cluster where all the links are of appropriate length. In case two nodes
are related to the same association, we can solve the problem by some predetermined
set of rules.
In case of hierarchical clustering, the genes whose expression patterns are similar
are put into a group which is connected with branches. This is known as clustering
tree or dendrogram. The main task of hierarchical clustering algorithms is to partition
all the objects into a tree structure as shown in Fig. 4.
The methods included in hierarchical clustering are:
Agglomerative: Originally, many minor clusters are formed which are combined
based on their resemblance. At the end, a single cluster will contain all the objects.
a. Divisive: Originally, all objects from one cluster are divided into smaller clusters.
b. Steps in agglomerative clustering:
• Build the similarity matrix.
• From the similarity matrix, find the largest values.
• Combine the clusters having the largest value.
• Recalculate the matrix and iterate the process till all the clusters are combined
together.
184 M. Bhattacharya
With the help of linkage algorithms, we calculate the distance between clusters:
Single linkage: In case of single linkage clustering, the distance between two clusters
is the smallest distance between any two members from two different clusters.
Complete linkage: In case complete linkage clustering, the maximum distance from
any two members of two different clusters is the distance between two clusters.
Average linkage: In average linkage clustering, we measure the average distance
from any members of two different clusters and the distance between two clusters
D(r, s) is calculated as:
D(r, s) = Tr s /(Nr ∗ Ns )
where T rs is the amount of all pairwise distances among cluster r and cluster s.
N r is the size of cluster r, and N s is the size of cluster s.
Types of experiments are generally used to recognize the overall similarities between
gene expression patterns in the context of different treatment procedures—the aim
is to stratify patients based on their molecular-level responses to the treatments.
The hierarchical methods are suitable for such clustering, which is calculated on
the pairwise statistical estimation of complete scatterplots instead of individual gene
sequences.
4 Conclusion
Thus, we see that with the help of machine learning, we can handle large amount of
data. Since genomic data is relativity larger in size, machine learning approaches can
simplify the things and make it easy to analyze. There are a lot of scope for genomics
in machine learning like gene sequencing, gene editing, pharmacogenomics, etc. In
case of gene editing, the genes are examined and then we discover the meticulous
matches in genes and later change the gene sequence according to the way it needs to
be targeted. Pharmacogenomics is extra field which offers more benefit for starting
the modified medicine that is the drug which is assumed to the patient for a spe-
cific illness that should familiarize to the genetic makeup of the individual patient.
Newly born inherited showing tools can make use of ML methods in classifying the
metabolism defects. Thus, we can see that machine learning algorithms play a vital
role in genomics.
References
1. http://www.iasri.res.in/sscnars/Genetics/14supervised%20classification%20_2_pdf
2. http://www.ijcstjournal.org/volume-5/issue-6/IJCST-V5I6P21.pdf
3. Chiang D, Brown P, Eisen M (2001) Visualizing associations between genome sequences and
gene expression data using genome-mean expression profile. Bioinformatic 17
4. Johnson SC, Hierarchical clustering schemes. Psychometrika 32
5. Dalton L, Ballarin V, Brun M (2009) Clustering algorithms: on learning, validation, perfor-
mance, and applications to genomics. Curr Genomics 10(6):430–445
6. Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis. J Comput Appl Math 20(1):53–65
7. Sekhar SM, Siddesh GM, Manvi SS, Srinivasa KG (2019) Optimized focused web crawler with
natural language processing based relevance measure in bioinformatics web sources. Cybern
Inf Technol 19(2):146–158
8. Sekhar M, Sivagnanam R, Matt SG, Manvi SS, Gopalalyengar SK (2019) Identification of
essential proteins in yeast using mean weighted average and recursive feature elimination.
Recent Pat Comput Sci 12(1):5–10
9. Patil SB, Sekhar SM, Siddesh GM, Manvi SS (2017) A method for predicting essential proteins
using gene expression data. In: 2017 international conference on smart technologies for smart
nation (SmartTechCon). IEEE, pp 1278–1281
10. Kenidra B, Benmohammed M (2016) An improved K-means algorithm for gene expression
data clustering. In: Conference paper, Aug 2016, Research Gate
11. Chernoff H, A measure of asymptotic efficiency for tests of a hypothesis based on the sum of
observations. Ann Math Stat 195:493–509
12. Rendon E, Abundez I, Arizmendi A (2011) Internal versus external cluster validation indexes.
Int J Comput Commun 5(1):27–34
13. Fred A, Jain A (2005) Combining multiple clusterings using evidence accumulation. IEEE
Trans Pattern Anal Mach Intell 27(6):835–850
14. Topchy A, Jain AK, Punch W (2003) Combining multiple weak clusterings. In: Third IEEE
international conference on data mining, pp 331–338
15. Lu Z, Peng Y, Xiao J (2008) From comparing clusterings to combining clusterings. In: Pro-
ceedings of the twenty-third AAAI conference on artificial intelligence. Institute of Computer
Science and Technology, Peking University, Beijing
188 M. Bhattacharya
16. Strehl JGA (2002) Cluster ensembles—a knowledge reuse framework for combining multiple
partitions. J Mach Learn Res 3:583–617
17. Barbara D, Li Y, Couto J (2002) COOLCAT: an entropy-based algorithm for categorical
clustering. In: Proceedings of the International Conference on Information and Knowledge
Management, pp 582–589. https://doi.org/10.1145/584792.584
18. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0184370
19. Azuaje F (2002) A cluster validity framework for genome expression data. Bioinformatics
18(2):319–320
20. http://www.informit.com/articles/article.aspx?p=357695&seqNum=4
Supervised Techniques in Proteomics
1 Introduction
i.e., transformation from DNA to RNA and then finally proteins. Proteins can be
classified based on the structural level as primary proteins which is an amino acid
sequence consisting of twenty unit alphabet. And then, the secondary protein caused
by the folding of amino acid sequences into alpha helices and beta sheets, tertiary
proteins, which is a three-dimensional conformation of amino acids, and finally
quaternary protein due to interactions between protein subunits to form a large unit
of protein. The primary protein determines the secondary structure formation and the
shape of the tertiary 3D conformation, which in turn determines the final structure
of the protein conglomerate [4].
The study of protein, proteomics, can be further classified based on which aspect
of protein is being studied. Structural proteomics deals with the study of the struc-
ture of the protein and atomic resolution of three-dimensional protein structures to
understand protein sequencing. Expression proteomics [5] is the study of analysis and
differential expression of proteins. It facilitates the measure of relative and absolute
levels of proteins in a cell under simulation of time, drugs, and disease. Techniques
used to label samples at the protein level and heavy oxygen usage at peptide level
fall under expression proteomics. Interaction proteomics [6] deals with the interac-
tion between proteins and activation of protein complexes. Biological insights and
processes can be studied when interactions between the proteins are studied rather
than individual proteins.
Proteomics is applied in a wide range of areas which include pharmaceutical
drug discovery [7], proteomics in chemistry, protein complex analysis, excretion
and metabolism, disease diagnosis and disease monitoring, etc. Drug screening is
used for target identification and validation. Proteomics is being used in the drug
industry since the past 20 years and it revolves around identifying proteins expressed
between the control and disease samples. It is also used in protein expression profil-
ing which plays a vital role in Pharmaceutical Research in which protein profiles are
compared between the disease and control samples in the laboratory. These are called
disease biomarkers. Cell surfaces and proteins are concentrated in the protein pro-
filing technique. Proteomics in chemistry is targeted for compound screening which
is used in the discovery of new drugs using target-based screening and phenotypic
screening, which is based on phenotypes like cytokine contents from the cells. This
technique can be adopted to produce high-quality drugs. Monitoring of drug proteins
in tissues or blood can be used for Pharmaceutical Research and developments.
Machine learning techniques are being used to analyze data from biology fields
like data generation by analytical methods of metabolism and transcriptomics for
identification and classification of genes. These techniques are now even used in
proteomics to analyze the data collected about proteins and mass spectrometry.
Quantitative protein information can be obtained from biological sample by vari-
ous machine learning techniques. Some of the most commonly used techniques are
gel electrophoresis, affinity separation, mass spectrometry-based technologies, pro-
tein arrays, etc. Some of these techniques can only analyze a limited number of
proteins and specific type of proteins due to time constraints. Mass spectrometry
(MS) techniques give hyper-efficient list of proteins than the other techniques. This
makes it more suitable for finding proteins which have not been considered before
Supervised Techniques in Proteomics 191
and are not already listed in the protein database. Combination of other proteomics
along with mass spectrometry can be used to find to form a pipeline of analysis of
various proteins. Machine learning techniques like mass spectrometry can be used for
exploring proteins using database searching, preprocessing, and protein quantitation.
Machine learning techniques are used for identifying biomarkers, which is a
measurable indicator of biological condition. Size of the protein dataset is used
in the selection of machine learning techniques that could be used to classify various
proteins. Various algorithm-based techniques with bottom-up approaches integrated
with machine learning methods are used in the analysis of proteins and protein iden-
tification. Mass spectrography is a widely used technique for protein identification
using ionization and dissociation of them. Greater experimental requirements and
more complex instruments along with modern machine learning techniques can be
used to analyze a variety of proteins.
Massive growth in the amount of biological data requires efficient information man-
agement and storage along with extraction of useful information from the available
data. Development of various methodologies and tools for converting the data into
biological knowledge involves adoption of machine learning techniques to give rise
to new models to uncover this information and provide predictions of the system.
In proteomics field, computational methodologies, combinatorial techniques, and
machine learning techniques are used in protein structure and function prediction
and in management of complex experimental data. Firstly, protein-related data needs
to be preprocessed so that it is made suitable for applying machine learning algo-
rithms. Next, data needs to be analyzed based on what has to be predicted using
pattern identification [8].
Genome of any organism consists of patterns of thousands of proteins which is
the root cause of function of life. Flawed proteins result in diseases such as diabetes,
cancer, and dementia. Hence, the study of proteins plays a major role in drug recom-
mendations. To understand diseases and processes of life and to provide appropriate
treatment, it is important to analyze proteins. Mass spectrometry is used to determine
the type and quantity of proteins in biological systems. Proteomic data can be used
to train neural network which is used to recognize proteins in lesser time and greater
accuracy [9].
Machine learning in proteomics is specifically applied in the classification of sam-
ple of various proteins and identification and investigation of potential biomarkers.
Biomarkers are quantifiable characteristics of biological processes, which do not
relate to the sense of well-being of the patient but rather help in improving the drug
development process and contribute to biological research by application of various
techniques [10]. This could be performed using laboratory-based and computational
techniques. Machine learning analysis in proteomics requires large datasets, and it
192 V. P. Kiranmai et al.
is essential to consider the types of protein samples available and their sustainability
to apply machine learning techniques [11].
Increase in the amount of data that is available provides sources of information
where different mining techniques can be applied for extraction of knowledge from
the available data. This can be done using the text mining techniques. Data and text
mining has become very useful in computational biology and it has a wide range of
applications and can be put into use in cellular location prediction, functional annota-
tions, and protein interaction analysis [8]. Computational techniques of data mining
can be used to solve back translation of proteins which is a complex combinatorial
problem.
Machine learning is used in proteomics to optimize the performance by using
data from datasets and examples. Optimization can help in improving the degree
of accuracy obtained in prediction models in case of modeling problem. Statistical
methods are used to build computational models to make computers learn from
the datasets and process this data to represent the model. Efficiency of the Learning
algorithm used should have high accuracy and considerate space and time complexity.
Transformation of data to knowledge is iterative and interactive process.
In the iterative phase, data integration and merging of information from various
sources are done followed by data cleaning, where elimination of incorrect data takes
place. Selection of relevant variables present in the data, data mining, inconsistencies,
and outliers will be resolved and the appropriate technique for the data analysis, either
supervised or unsupervised classification will be selected. Among these techniques of
machine learning, various models are studied to understand which one suits the data
that we possess. After the model is chosen, it is evaluated from both the computational
and biological perspectives. This step is repeated till an efficient model is acquired
to compute the available protein data.
Various approaches and tools based on a wide range of algorithms are developed.
There are several classification and biomarker analysis techniques which are very
specific to the proteomics domain [12]. In addition to these approaches, the conven-
tional machine learning techniques like KNN algorithm, logistic regression, support
vector machine (SVM), decision tree algorithms, and neural network can be applied.
These techniques fall under supervised learning techniques. Unsupervised learning
techniques like clustering and probabilistic graphical methods like Bayesian net-
works are also used in proteomics. Data preprocessing techniques like wavelet and
genetic algorithms are used for feature extraction and feature selection which is a
vital step for protein classification.
phenotypes or treatments. The attributes present under each class could be identified
proteins [2]. A general classification problem has a set of attributes divided into
classes. The attributes have some features associated to them and a set of classification
rules which help in distinction of these various attributes under various classes.
Supervised classification paradigms are algorithms which induce classification rules
from data in the dataset which could be used in classifying data which is out of the
dataset [8].
Various supervised learning algorithms like Bayesian classifier, rule-based learn-
ers, decision trees, random forests, and support vector machines can be applied in
protein classification techniques, and biomarkers identification initial step would be
to collect the data which can be analyzed and studied further. Datasets can be pre-
processed and required features can be extracted using feature selection techniques
and later they can be provided as input to the known machine learning algorithms
according to our specifications.
In this section, we first discuss the structure of the proteomic datasets, some of
the proteomic datasets which are available, and later on how feature extraction can
be performed on the available dataset, and later we analyze the various supervised
techniques present.
Proteomics dataset is a data matrix which can be represented Xij having a response
variable in the form of protein consisting of i rows which are the complete obser-
vations and j columns which is proteins. “Mass unit” is the term used to represent
proteins or mass over charge (m/z) units. When the dataset is subject to classifi-
cation techniques, an additional dummy variable vector is considered, Y, which is
used to identify group membership of the observations, which is called the indicator
matrix [13].
A wide range of techniques can be applied to the available proteomic datasets. Each
of the dataset represents a number of diseases and mass spectrometric platforms.
Some of the examples of the datasets which can be used are.
This dataset consists of 50 cancer cases with 50 and 45, respectively. These datasets
have undergone variable preselection, and there are 39 variables in lung cancer dataset
and 109 variables in colorectal cancer dataset. These datasets have been acquired
194 V. P. Kiranmai et al.
Datasets are preprocessed by removing the relevant data from the dataset. Data
preprocessing plays a vital role in knowledge extraction from the datasets [12]. Pro-
teomic dataset contains large amount of data which needs to be preprocessed and fea-
ture selection to pick the important and significant attributes are to be applied. Some
of the important data preprocessing algorithms that are applied to the proteomics
dataset are.
Supervised Techniques in Proteomics 195
This algorithm can be used in removing the noise associated with the dataset and
also feature selection for preprocessing of the protein classification. Wavelet trans-
formation is used in the data analysis of the biological data which highlights the
application of useful wavelets in biology. Wavelet analysis is used in preprocessing
the data to reduce the data dimensions. Discrete wavelet transformation is applied
after data binning and is used in further compression of the dimensions of the data.
This technique has been applied to 121 cancer samples, having dimensions of over
370,000. Using this method, it has been reduced to 6757 and further reduced to 3382
features. Various wavelet techniques have been presented for processing MS data
transformations. Stationary discrete wavelet transform is another technique used for
de-noising of the dataset and reduction of dimension from 20,000 to 1800 on a mice
tumor dataset. Further, support vector machine technique has been applied to the
reduced features of this dataset, and this has produced an accuracy of over 99%.
Wavelet-based method can be used to preprocess mass spectrometry dataset which
consists of heterogeneous noise. The performance of detections has been shown to
improve by applying the local wavelet thresholding of data. Another wavelet tech-
nique, dual-tree complex wavelet transformation, uses symmetric Hilbert-pair of
wavelets which is also used to de-noise the data and has been proven to perform
better than discrete wavelet transform and stationary wavelet transform.
developed by the integration of genetic algorithm and Bayesian network for iden-
tification and classification of Bacillus species. The integrated technique is used to
reduce the variables from 150 to 22 on the subset of the data. This technique was
effective in discovering the biomarkers for spores. Mathematical model of genetic
algorithm and support vector machine is to select the peptide peak. Classification
techniques are used to determine optimal separating hyperplane for classification.
This technique is also used in serum peptide classification with high accuracy using
MALDI-TOF system. Another multivariate analysis based on genetic algorithm is
with principal component analysis which is linear discriminant analysis. This is used
in the lipidomic approach for blood plasma classification.
Using these techniques, data can be preprocessed and the dimensionality of the
data can be reduced significantly which helps in faster and more accurate processing
of datasets.
It is a widely used dimension reduction technique in huge and complex datasets. The
goal of this technique is to summarize the data into lesser dimension without losing
any important information. This is accomplished using matrix techniques of mathe-
matics. A data matrix is represented by Xij, which is a product of two matrices, score
matrix Tik and loadings matrix Pij. This technique is often represented as k-PCA
which means the number of components or variables which are extracted from the
data, such that a data point is represented in a k-dimensional space. Mathematically,
it is represented by,
X = TPk + E
X = TPk + E x
and y = TCk + E y
where T represents the scores of the latent variables that the data is projected to, P
and C are loads, and Ex and Ey are residual matrices obtained from original data X
and y.
The above two methods provide class separation at a qualitative level, they are
not classification techniques but purely dimension reduction technique. They are
generally used with some classification methods.
This technique is a linear combination of all the new components obtained from
the principal component analysis and partial least square on dimensional reduction
process. The disadvantage of using LDA is that it is not able to deal with n p
type of datasets. In such cases, use PCA or PLS and then later apply classification
technique rather than using a formal approach like LDA. The technique that LDA
uses projects the known observations into a new coordinate system and then passes
these values to a classifier. Later, a model is developed to predict the classes of
the unknown observations, using the prior probabilities found from the learning set.
Linear combination is used to calculate and maximize the ratios between within class
198 V. P. Kiranmai et al.
variance and between class variance. The assumption that LDA considers is that all
the matrices are equal variance or covariance matrices and are normally distributed.
Generally, dimension reduction is done in the feature subset selection process. This is
an advantageous step to proceed with supervised classification stage as it will reduce
the cost of data acquisition, improve the overall efficiency of the classification model
and faster classification model, and increase classifier accuracy. This problem of
feature subset selection can be described as a search problem, where each state in
the search space specifies a subset of possible features. There is a large amount of
computational data, and finding the possible feature subset is unfeasible. There are
four basic steps to this, identify the search space starting point, search organization,
subset evaluation function, and halting criteria. Initially, the direction of search is
chosen, the features get added one by one starting from an empty set. The search
organization strategizes the search in the space of size of the feature subset. The
evaluation function is used to measure the effectiveness of the subset of features
that have been selected. The halting criterion is used to stop the search for subset
space identification. An effective subset of the observations would help in proper
optimization of time and space, helping in the overall efficiency of the classification
problem.
Protein sets which are related having similar structures or functions belong to the
same protein group. A family of proteins which consist of higher classifier called
G-protein receptor is the basis for which they are divided. Family of proteins is
classified as family and domain which are used in classification of proteins. These
terms have been come up with biologists for computational purposes. There are
various aspects like description of protein family, protein domain architectures, and
species distribution which play a vital role in classification of proteins. Classification
is the step after the feature extraction step in which important and necessary features
Supervised Techniques in Proteomics 199
are extracted. Various approaches can be used for classification like support vector
machines, which is a regression classification based on the primary structure of the
proteins and the family they belong to. Using this technique, a sequence of protein
is created which is a distributed representation of proteins. The dataset is divided
into training data and test data, in which the system is trained using the training
dataset and then the test data is fed into the system and verified. The algorithms are
chosen based on the datasets and the accuracy that is required. Generally, a large
amount of training data is required for training the machine to learn and based on the
information present in the dataset. The protein sequence machine learns and outputs
which sequence the input data belong to within a very limited period of time with high
accuracy based on the classification method chosen. Some of the algorithms used
are n-gram which uses the information from the protein for lapping of the window
from three to six residues. Using this technique, more accurate results are obtained
for different window sizes.
Protein space analysis is used in the training space for analyzing the physical and
chemical properties that are being used in n-grams classification. Some techniques
use other properties like volume and mass for classification of the data. The protein
space characteristics are studies from Lipschitz constant which is a mathematical con-
cept which deals with uniform continuity of function and how fast function changes
with time. In our context, it deals which how the physical and chemical compositions
of proteins are changing quickly with time and other biological changes.
Decision trees structure the extracted data or information and discriminate the data
in a tree-like structure. Decision trees improve the understandability of the classifi-
cation and identification of attributes and classes [11]. This algorithm depends on the
statements which are conditional and unconditional and uses some available tools to
make a decision. These tools can analyze graph-like models and understand them,
where the problem has a set of possible outcomes and the chance that the outcome
occurs. Based on this information, these tools evaluate the event that is most likely to
happen [16]. There are a variety of decision trees, and simple decision trees are the
easiest of all, where the classification and the various relevant branches can be under-
stood with ease. Methods like C4.5 are also used in proteomics. In this technique,
we start with an empty tree and the data is split in iterative fashion, then branches
are created and till all the data is classified into either of the trees branches based on
200 V. P. Kiranmai et al.
criteria, this method is iterated. Each of the data points is a tree leaf in decision tree
technique.
Decision tree technique fits well in proteomics problems. Computational com-
plexity of decision trees is linear in the number of input variables in the worst case
scenario. It performs considerably well when the input sample is large compared to
the number of samples, where most of the variables are irrelevant. Its complexity
is faster. This technique can be used in analyzing biomarkers. This algorithm can
be combined with any ensemble approach which gives rise to more advanced algo-
rithms, which have hybrid computational and functional properties yielding better
analysis results [17].
In proteomics, decision trees have been widely used in the past for classification
purposes with a high percentage of accuracy and precision [12]. In 2002, Baoling
Adam has come up with a proteomic classification using the decision tree algorithm.
This is based on the nine-protein mass pattern. This experiment is done with the blood
samples of two types—prostate cancer antigen (PCA) and healthy man cohort—to
validate a system which showed 96% of classification accurately. A year later, Markey
used classification and regression Tree model is to classify 41 clinical specimens with
an accuracy of 90%. Size of the data plays a crucial role in interpretation of results.
Later in 2012, advanced decision tree was used to develop sequence identification
algorithm, and this has significantly improved the accuracy and precision. In 2013,
cleavage prediction with decision trees has been invented and was widely used in
mass spectrometry-based peptides data. This algorithm helps in reduction of search
space and time complexity. Later in 2018, predictive models based on decision tree
algorithm were used for the analysis of Staphylococcus aureus strain. The average
accuracy of all the above example is more than 83%.
Support vector machines are the type of machine learning which are based on predic-
tion using the linear separability between classes [11]. It is a computationally effective
supervised model and is based on statistical learning. It establishes a plane to clas-
sify various patterns in the classes. This algorithm depends on the linear separability
between classes. A hyperplane used in classification is based on a transformation
method and kernel function where an input feature vector space is transformed into
high-dimensional space [18]. This feature makes SVMs solve complex classification
between sets.
In a practical scenario, the data points in the dataset are not linearly separable in
input space. Then, the kernel function comes into work. The data points should not
be overfitting, and so kernel function projects the input space to a higher-dimensional
space. Kernels function is based on the Mercers’ condition. This function maps input
space to a higher dimension reducing the possibility of overfitting. There are a variety
of kernel functions, polynomial function, Gaussian radial basis kernel function, etc.
SVM technique is based on support vectors which are elements of the training set
Supervised Techniques in Proteomics 201
based on which the decision boundaries of the classifier are set. Proteomics dataset
contains the training examples which are very difficult to classify. SVM uses these
support vectors to classify and predict the samples present in the dataset.
Support vector machine algorithm is used in proteomics due to its ability to handle
high-dimensional data using a transformation function to convert to low-dimensional
data. One of its applications is in analysis of ovarian cancer biomarker discovery
and classification, which had details about the 37 patients who were suffering from
papillary ovarian cancer. The quantization of the data is done using mzMine technique
and SVMs along with some feature selection methods is used in classification. This
way, an accuracy of 83% is achieved when nonlinear SVM is used along with LOO-
CV technique and an accuracy of 97% is obtained when nonlinear SVM is used along
with SVM-based feature selection techniques [8].
Random forest is based on the decision tree algorithm, wherein multiple trees are
built over the training data. Every tree is associated with a sampled subset of attributes
related to the problem [11]. Each individual tree predicts a class based on the training
set, and a random forest is built using all of these decision trees, and the prediction
classification of the random forest is based on the prediction of each individual tree.
Random forests are a learning technique used for both classification and regression.
In a practical scenario, let us consider that there are N training sets and we use a
bootstrap method, which is the mean sample with some replacements in the training
dataset. The taring set would be two-third to the number of the original dataset.
Then, we would construct a classification and regression tree (CART) [1] for every
bootstrap dataset and this would result in the formation of decision tree for each of
the bootstrap set. If there are M features in each of the input vector, then m more
relevant features are chosen for the classification purpose. All the results of each
bootstrap dataset classification are aggregated resulting in the formation of the final
classification of the random forest.
In proteomics which involves a large number of features in the dataset, multi-
ple decisions trees can be built and be classified using the random forest algorithm
considering only the relevant and import features in each dataset. In 2003, the per-
formance of classification of ovarian cancer dataset is calculated using classifiers,
which include bagging and boosting, SV, and random forest algorithm. Using random
forest along with feature classification of the features in the dataset, the accuracy of
the net-ore classification has increased to 92%. In 2004, random forest algorithm
is applied with 1000 trees on the data with a sample of 100 spectra, each spectrum
containing 138 peaks, and the error rate have come to 32%, sensitivity to 76%, and
specificity to 64%. Later, standardization and de-noising of the datasets are used to
classify the data of whole-organism bacterial specimens. Using this technique along
random forest has improved the accuracy of the entire classification.
202 V. P. Kiranmai et al.
In 2008, random forest [12] was implemented to classify the proteomic profiles
obtained by mass spectra. The dataset has been collected over 76 breast cancer
patients. Random forest classified these samples into target classes with a speci-
ficity of about 86% and sensitivity of 82%. Later in 2010, random forest was used in
phosphorylation dataset. A nonlinear random forest classification along with discrete
mapping approach technique is used in classification. In 2014, random forest was
used in classification of N-glycopeptides using mass spectral features. In 2016, SVM
and random forest were used to classify the geographical originals of 31 white rice
samples, and accuracy of 93.66% and 93.83%, respectively. In 2017, mass spectrom-
etry along with random forest and SVM with radial basis kernel, C5.0, average neural
network, and kNN has yielded superior results with accuracy of 95%. Random forest
model is more robust to outliers and negligees overfitting making it more suitable
for using it against large datasets.
The nearest neighbor algorithm is used to classify a given data point to a label based
on the nearest data points to the test data point. This is the basis of the k-nearest
Supervised Techniques in Proteomics 203
This algorithm classifies a pattern by asking and answering a set of questions related
to the sequencing of the data points, till a directed classification tree is obtained,
this tree is classed the classification tree, where there is a root node is located at the
204 V. P. Kiranmai et al.
top and is connected or directionally linked to the other nodes [8]. This linkage is
done from the root till the leaf nodes are reached in the tree. The classification of a
pattern is started from the top of the tree and it goes on till the property of the pattern
is achieved. The different paths traversed in the tree lead to different patterns that
can be derived from the classification tree. In a classification tree, the links between
the different nodes are exhaustive, which means that only one link would be there
to traverse from one node to the other node in the tree. The further step would be to
make a decision at a subsequent node, which would be the root of the subsequent
subtree which is chosen. This technique is adopted till a leaf node is reached. The
leaf node consists of a categorical label associated with it, and the pattern that is
traversed in the tree is assigned to the label obtained on traversing.
The classification tree would progressively split the training labeled dataset into
smaller and smaller datasets. When an ideal situation is considered, the samples of
the subset would bear the same categorical label. If this is the scenario, then the
subset is considered to be pure and the process of dividing the subset further would
be terminated. But in a practical scenario, there are mixtures of labels in each subset
of the classification tree. So, before splitting the subset further, we might have to
accept some misclassification and some impurity in the decision instead of splitting
it into much smaller trees. This is a recursive tree-growing process, from the root to
the leaves of the trees to keep finding the perfect node of the tree, where there could
be a possible and best split off the tree. Ratios like, Gini Ratio and Gain ratio is used
to compute the best and appropriate variable at each level of the tree where there
could be a split done.
In proteomics, this technique is widely used in classification of the datasets at
appropriate variables by the formation of the classification tree from the entire
datasets. In 2003, a new classification tree technique was adopted to discriminate
proteins using mass spectrometry [12]. This is obtained by projecting the dataset
using the wavelet transformation and then using the feature space construction by
using feature selection technique. The recursive classification tree algorithm can be
used to partition the feature space by dividing the sample into much smaller sam-
ples. In 2011, a classification tree model was used to discriminate the patients with
pulmonary tuberculosis from patients who were non-tuberculosis with a sensitivity
of 98% and specificity of 85%. Later in 2014, a classification decision tree was pro-
posed which is used to classify acute leukemia spectra which was further divided
into five subsample groups. A maximum tree was obtained with a root node, and then
the proteomic-based classification was found to be consistent with the MIC-based
classification technique. A comparatively good accuracy rate was obtained when the
classification technique was adopted.
Supervised Techniques in Proteomics 205
Machine learning algorithms have played a vital role in proteomics with respect to
classification of various proteins. Various machine learning algorithms like nearest
neighbors were used in the past for the prediction of the secondary structure of
the proteins. Then, another method, classification tree is used for the prediction of
the protein secondary structure present in it. Later, two-stage methods involving
support vector machine and a Bayesian classifier are used to predict the surface
residues present in proteins which are widely visible in the case of protein–protein
interactions [8]. Another problem related to prediction of the protein sub-cellular
locations automatically using only the sequence has been analyzed using the fuzzy
k-nearest neighbor algorithm.
Another application where machine learning in proteomics plays a vital role is in
the extraction and analysis of the genetic information of the Human Genome Project
[16], and this topic is of importance as it is blooming with respect to the number of
researches held with respect to it. The analysis of the proteins and the structures of the
various proteins is specifically used in the protein purification and mass spectrometry.
Here, we briefly discuss how various machine learning supervised techniques are
used in the field of proteins in analysis of them and understanding the proteins better
and in solving some problems in this field using the classification and statistical
models.
Mass spectrometry is used for generating protein profiles of various body fluids
like saliva, urine, or serum. These measurements help in the diagnosis of various dis-
eases as the structure of the proteins would vary between the diseased and the normal
patients. It also helps in monitoring the response of patients to various medicines
or drugs. The data acquired from the patients are typically high in dimensions con-
sisting of several thousand variables, with less number of samples. In these cases,
classical statistical techniques like linear discriminants and neural networks can be
used to reduce the dimensions of the data. An alternative to these techniques is to
use the advanced machine learning techniques like kernel-based methods and tree-
based ensembles to explore the datasets without any prior feature extraction and
elimination.
206 V. P. Kiranmai et al.
5.1.1 Problem
The main objective of this section is to identify biomarkers of a particular disease from
proteomics MS datasets, by discriminating between a certain class of the disease or by
recording the responses to a particular treatment. Predictive models are constructed
to explore the biomarker present in the datasets, helping in diagnosis of the diseases.
The datasets are collected from the past biological samples of various patients and
are classified based on various factors and are later processed by mass spectrometer.
Mass spectrometer is used to provide more accuracy to the data using the signal
intensities. The number of variables or the dimension of the data acquired is very
large based on the number of patients. Usage of the machine learning algorithm is
decided based on the number of dimensions of the dataset.
This experiment is conducted using two datasets of SELDI-TOF-MS obtained
from the serum samples of the patients. The motto of this experiment is to detect
patients suffering from inflammatory disease. These samples were collected from
the University Hospital of Liege from 2002.
Various techniques can be used to extract information from the datasets, and super-
vised learning technique is also applied based on the samples described by the input
variables and the output information. The goal of the entire process is to extract a
synthetic model which can predict the output information from the input variables.
The development of learning algorithm is used in the construction of the classifica-
tion model. Various algorithms can be used for this process, like neural networks,
discriminant analysis, or decision tree. Based on the specificities of the application,
an algorithm picked is based on the number of samples, how informative the variables
are, in order to determine the biomarkers.
Decision tree techniques used in these problems fit well in these characteristics.
The complexity of these methods is linear with respect to the number of the input
variables, and they cope when input space dimensionality is greater than the num-
ber of the samples, as many input variables are not relevant. It has a better time
complexity compared to all other supervised algorithms. It is exploratory approach,
and hence, subset of the important variable is identified easily, i.e., biomarkers. The
basic decision tree can be integrated with another ensemble approaches like bag-
ging and boosting, producing better algorithms which has hybrid computational and
functional properties.
Supervised Techniques in Proteomics 207
5.1.4 Method
The data which is noisy even after the mass spectrometer cleaning should be elim-
inated or filter out the noise before applying any machine learning technique on it.
This is removed using simple m/z discretization algorithm. Using a small ratio m/z
ratio discretization provides cleaner data points.
Model Construction
In case of single decision tree, CART algorithm is used with cost-complexity pruning.
Several decision trees along with ensemble methods are used to reduce variance and
bias, improving accuracy and reliability. Notable studies have been conducted to
compare various tree-based ensemble methods, and no method has been found to
outperform another. In this problem, four methods are applied in parallel and the
results of the experiments described are:
Bagging: Tree is built based on CART algorithm from bootstrap samples and all the
tree predictions are aggregated using majority technique.
Random forest: This technique is derived from bagging, and at every node, k
attributes were chosen, where the split of the tress is determined.
Extra trees: A complete learning set is selected at each node by selecting the best
among k randomly generated splits.
Boosting: A tree is built sequentially using CART, by increasing the weights of the
learning set samples.
In training the learning algorithm, each sample is removed from the learning set till
reasonable accuracy is acquired. Then, from the remaining n − 1 samples, another
sample is eliminated. In practical scenarios, selection of the best among several
models is done based on the cost of misclassification and the error rate. The factors
on which the decision should be taken are:
Sensitivity: Percentage of sample from the target class that are well classified.
Specificity: Percentage of sample from the other class that are well classified.
Error rate: Percentage of sample that are misclassified by the model.
Based on the combination of the above three factors, the model is selected.
Biomarker Selection
The identification of biomarker is a procedure consisting of two steps one after the
other: Firstly, the attributes are ranked in decreasing order of their significance, and
later, a subset of biomarker is chosen using cross-validation.
208 V. P. Kiranmai et al.
Biomarker Selection
The most optimistic subset of biomarkers is selected based on the error estimates. A
model is built with all the relevant attributes, and using machine learning algorithm,
the best one is selected using cross-validation. Only on the most important attributes,
algorithm is used. The accuracy of the model is computed so as to determine the
curve of the model. The attribute which has the maximum accuracy is chosen as the
biomarkers.
Validation
Sensitivities, specification, and error rate are widely used in eliminating the various
attributes in the model classifications. The learning set consists of two or four replicas
of each patient and that repeated data point has to be removed. Data preprocessing by
peak selection gives good results. Boosting is applied to increase the superiority of
the model generated. C4.5 is the base learning on top of which ensemble algorithms
are used. Preprocessing of data is done using discretization. Various methods are
compared like kNN and SVM, and the results of SVM are better, but the best trees-
based method is better than SVM, making it the best suitable classification technique.
This application has demonstrated a flexible and systematic method from extrac-
tion of knowledge from the proteomic dataset. It has highlighted the usage of super-
vised machine learning algorithms like decision tree induction and various decision
tree ensemble methods along with pre- and post-processing stage. Various tools and
methods are used to extract data from the dataset, and biomarkers are identified to
be used in clinical decisions.
electric noise and chemical noise which result in reducing the efficiency of the
classification. The motto of the preprocessing of the data is to purify the data.
Feature selection is used to identify the relevant feature present in the dataset, and
later support vector machine recursive feature elimination is used to select a small
subset of the input attributes to classify and prepare the learning model. Distance
metric learning technique is also used in the classification of the proteomic mass
spectrometry data points. Later, the results of both the algorithms were compared
and it is found that both of these algorithms produce equivalent results. SVM-RFE
is chosen over SVM as the former involves feature subset selection using recursive
approaches, filtering out only the most important attributes.
5.2.1 Dataset
For the ovarian cancer dataset, when the classifier LMNN using the energy way of
classification is used, an accuracy of about 99.3% is achieved. While when LMNN
algorithm using the Euclidean distance technique is used, accuracy of about 84.6%
is used. Further, when LMNN technique using Mahalanobis distance is used, an
accuracy of 99% is acquired. When the same set of algorithms are used against breast
cancer dataset, the accuracy is calculated. LMNN with energy way of classification
provides an accuracy of 81.8% and when LMNN technique using Euclidean distance
is used, then an accuracy of 84.6% is achieved, later when LMNN using Mahalanobis
distance is used, an accuracy of 81.7% is achieved. This has proven that LMNN
using energy way of classification provides the best accuracy in model classification
compared to the other LMNN techniques.
Comparing the results of these two classification approaches, we can conclude
that applying SVM to SVM-REF feature datasets fetch us better results and applying
210 V. P. Kiranmai et al.
LMNN classifier based on energy classification and Mahalanobis distance yield better
results than applying SVM to SVM-RFE feature sets. LMNN classifiers with energy
classification and Mahalanobis metric have superior results in the classification of
proteomics data. Later, classification results have been compared when LMNN is
applied to feature sets chosen by SVMRFE, and it has been concluded that the lesser
number of features yield better results in each experiment compared to the use of
SVM.
This experiment compared two algorithms picked up from supervised distance
metric learning, nearest neighbor classifier and support vector machines for the clas-
sification of mass spectrometry proteomic dataset. Comparing the results of the two
algorithms that have been applied, it is concluded that applying distance metric
learning algorithm to the proteomic dataset have yield superior results in compari-
son to support vector machine. Further, this experiment can be extended to the use
of distance metric learning in feature selection for the datasets.
6 Conclusion
References
1. Can T (2013) Introduction to bioinformatics. Part of the Methods in Molecular Biology book
series (MIMB, vol 1107), pp 51–71. https://doi.org/10.1007/978-1-62703-748-8_4
2. Lesk AM (2019) Bioinformatics. https://www.britannica.com/science/bioinformatics
3. Yee A, Pardee K, Christendat D, Savchenko A, Edwards AM, Arrowsmith CH (2003) Structural
proteomics: toward high-throughput structural biology as a tool in functional genomics. https://
doi.org/10.1021/ar010126g
4. Introduction to Proteomics, Wikibooks. https://en.wikibooks.org/wiki/Proteomics/
Introduction_to_Proteomics, 2017
5. Center for Proteomics and Bioinformatics, Expression Proteomics, Western Reserve University,
Cleveland, Ohio. http://proteomics.case.edu/proteomics/expression-proteomics.html, 2010
6. Center for Proteomics and Bioinformatics, Interaction Proteomics, Western Reserve University,
Cleveland, Ohio. http://proteomics.case.edu/proteomics/interaction-proteomics.html, 2010
7. Yokota H (2019) Applications of proteomics in pharmaceutical research and development.
Appl Proteomics Pharm Res Dev
8. Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armananzas R,
Santafe G, Perez A, Robles V (2005) Machine learning in bioinformatics. Brief Bioinform
9. Artificial intelligence boosts proteome research, Technical University of Munich (TUM).
https://www.sciencedaily.com/releases/2019/05/190529113044.htm, 2019
10. Strimbu K, Tavel JA (2010) What is biomarker? Curr Opin HIV AIDS 5(6):463–466
11. Swan AL, Mobasheri A, Allaway D, Liddell S, Bacardit J (2013) Application of machine learn-
ing to proteomics data: classification and biomarker identification in postgenomics biology.
OMICS: J Integr Biol
12. Fan Z, Kong F, Zhou Y, Chen Y, Dai Y (2018) Intelligence algorithms for protein classification
by mass spectrometry. BioMed Res Int
13. Sampson DL, Parker TJ, Upton Z, Hurst CP (2011) A comparison of methods for classifying
clinical samples based on proteomics data: a case study for statistical and machine learning
approaches. PLoS ONE 6(9):e24973. https://doi.org/10.1371/journal.pone.0024973
14. Dimensionality reduction, Wikipedia. https://en.wikipedia.org/wiki/Dimensionality_
reduction, 2016
15. Sharma P (2018) The ultimate guide to 12 dimensionality reduction techniques (with Python
codes). https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-
python/
16. Naveenkumar KS, Mohammed Harun Babu R, Vinayakumar R, Soman KP (2018) Protein fam-
ily classification using deep learning. Center for Computational Engineering and Networking
(CEN)
17. Geurts P, Fillet M, de Seny D, Meuwis M-A, Malaise M, Merville M-P, Wehenkel L (2005)
Proteomic mass spectra classification using decision tree based ensemble methods. Oxford
Academic
18. He B, Zhang B (2013) Discovery of proteomics based on machine learning. Beihang University
19. Liu Q, Qiao M, Sung AH (2008) Distance metric learning and support vector machines for
classification of mass spectrometry proteomics data. In: Seventh international conference on
machine learning and applications
Visualizing Codon Usage Within
and Across Genomes: Concepts and Tools
1 Introduction
The sheer complexity and diversity of life forms on Earth hinge on a genetic code. The
latter is set of rules that define how nucleotide sequence (information) is converted
into proteins (workhorses of cellular metabolism) in course of ribosomal protein syn-
thesis. In mRNA, four nucleotide bases can form 64 possible trinucleotide sequences
or codons. Each codon corresponds to a specific amino acid (sense codons) or stop
signal in the process of protein synthesis. There are 61 sense codons, while natural
proteins are built of only 20 amino acids. Hence, the extant genetic code, theoret-
ically, may encode three times as many different amino acids. Elucidation of the
origin of the genetic code and driving forces behind the evolution of protein-coding
sequences are of great fundamental and applied interest. Much has been achieved in
these areas since the discovery of mRNA triplet structure over 50 years ago, lead-
ing to a refined set of algorithms and tools for the analysis of codon evolution [1].
Wide adoption of next-generation sequencing technologies since 2005 [2–4] has
led to tremendous growth of nucleotide sequence databases, allowing application of
these tools to virtually any biological problem. Larger datasets might indeed provide
insights into function and evolution of coding sequences. Yet, they pose nontrivial
challenges that require an educated choice of analytical tools and care in interpre-
tation of results. As analysis of protein-coding sequences at DNA and amino acid
B. Ostash (B)
Department of Genetics and Biotechnology, Ivan Franko National University of Lviv, Lviv 79005,
Ukraine
e-mail: [email protected]
M. Anisimova
School of Life Sciences and Facility Management, Institute of Applied Simulations, Zurich
University of Applied Sciences ZHAW, 8820 Wädenswil, Switzerland
Swiss Institute of Bioinformatics, Lausanne, Switzerland
levels historically dominated the field, for a non-expert it is not always obvious what
advantages codon-oriented methods would bring. Therefore, modeling of sequence
data of high volume and complexity requires, in our opinion, more user-friendly
applications which would represent the data in a visually informative way. This
should foster the growth of interest in codon-based studies and, most importantly,
improve or even guide our understanding of a biological problem behind the num-
bers. In the latter, we believe our thoughts resonate with current state of the entire
field of biomedical data analysis [5]. This chapter aims to introduce the reader to a
specter of approaches available for the study of protein-coding sequences at a codon
level. Below, we describe the basic concept of each approach, list relevant visual-
izations tools and outline possible directions for future development. Throughout
the text, we emphasize the visualization part of a problem; the reader is referred
to excellent and up-to-date literature where extensive treatment of conceptual and
mathematical issues of certain approach is offered. We start with a description of the
concept of the genetic code. Although being already a textbook paradigm, it remains
an area of intense research and ingenious visualization efforts. Description of pro-
grams for k-mer analysis will lay the ground for several sections devoted to codon
analysis tools—such as codon context discovery, codon indices, substitution models
and methods of estimation of selective pressure. Future prospects and challenges for
the codon-based studies and tools will conclude the chapter.
2 Genetic Code
Fig. 1 Table of canonical genetic code provides information on the amino acid assigned to each
codon. Initiator methionine codon is shown in green
first two letters and differing in the third one (synonymous position) are labeled as
codon box. Split codon boxes are those encoding two amino acids (or amino acid(s)
plus stop codon; see Fig. 1). Un-split codon boxes encode a single amino acid (e.g.,
Pro, Ala, Gly). The organization of genetic code table is far from being random.
Particularly, amino acids similar in their physical and chemical properties tend to
group together in the table. Second codon position is a key determinant of codon
assignment. This largely confers error minimization property to the genetic code,
as substitutions or mistranslation in synonymous position does not change codon
meaning, while substitutions in the first position switch one amino acid to a related
one. A number of other regularities in natural genetic code prompt additional theo-
ries about how these regularities might help maintain biological function, of which
minimization of frameshift errors during translation is perhaps the most salient [8].
The most widely used version of genetic code is referred to as universal, standard
or canonical. Currently, over 20 variations of canonical genetic code are known [9,
10]. Reassignment of usual stop codons (UAA, UAG and UGA) to certain amino
acids is typically encountered within noncanonical genetic codes. For example, in
216 B. Ostash and M. Anisimova
the mammalian mitochondrial genome UGA is decoded as Trp, while the canonical
Arg triplet (AGG) serves as a stop codon. In a few ciliate genomes, all stop codons
are simultaneously used as sense codons, and termination of translation is context-
dependent [11]. In some yeast species, canonical Leu codon CUG is reassigned to
serine. However, in certain species, such as Candida albicans or Ascoidea asiatica,
this codon can be read as both leucine and serine. This leads to a rather unique situa-
tion where Ser and Leu residues are incorporated randomly into protein. An mRNA
carrying n CUG codons would lead to 2n different proteins. This causes a stochastic
heterogeneity of proteins within yeast cells (so-called statistical proteome [12]) and,
consequently, phenotypic variation of the latter. The aforementioned examples attest
to the evolvability of the genetic code [13].
The decoding of the mRNA is done by ribosomes and requires transfer
RNAs (tRNAs). The latter are small (usually 76 nt), highly structured and post-
transcriptionally modified entities serving as a bridge between nucleic acid and pro-
tein worlds. The aminoacyl stem (AAS) and anticodon stem loop (ASL) are two
prominent parts of tRNA having distinct biological roles. AAS of each tRNA is
specifically charged with the amino acid by aminoacyl-tRNA synthetases (ARS),
leading to aa-tRNA. Ribosome provides an environment where formation of mini-
helix between codon of mRNA and anticodon of cognate aa-tRNA will be pro-
moted—a crucial checkpoint prior to the peptide bond formation [14]. In the vast
majority of organisms, the number of tRNA species is fewer than the number of
sense codons, e.g., 46 in Escherichia coli, 42 in Saccharomyces cerevisiae and 29
in Mycoplasma capricolum. Some of the tRNAs, referred to as isoacceptors, are
able to recognize more than one synonymous codon. Sets of tRNAs having different
anticodons and charged with the same amino acid are known as isoacceptor fami-
lies. First two positions of a codon form Watson–Crick pairs with bases 36 and 35,
respectively, of ASL (e.g., G·C, C·G, U·A, A·U). The third codon base, referred to
as a wobble position, may form atypical hydrogen bonding with the 34th position of
the anticodon (e.g., U·G, G·U, I·A/U/C), thanks to extensive modifications of hete-
rocycle and carbohydrate portions of bases in 34th and 37th positions of ASL [15]
(Fig. 2).
It has to be noted, however, that the importance of post-transcriptional tRNA
modifications goes far beyond ensuring wobble interactions [16–18]. This notion is
supported by an observation that even genomes carrying very large sets of tRNA
genes (e.g., human cells express around 300 different cytoplasmic tRNAs) and pos-
sessing tRNA specific for each codon still maintain a sophisticated tRNA modifica-
tion machinery [19, 20]. To summarize, three layers of code can be distinguished
in the process of gene expression. The first one is the genetic code that determines
the meaning of mRNA codons. The second one is embodied by 20 aminoacyl-tRNA
synthetases that control correct charging of tRNAs. Post-transcriptional tRNA modi-
fications constitute the third code for correct charging of tRNA, decoding and protein
synthesis in general. For example, going back to the C. albicans case, the ambiguous
reading of CUG codons in this species is enabled at two levels. First, the tRNASer CAG
is charged with serine (97%) and leucine (3%) because both Ser-aRS and Leu-aRS
Visualizing Codon Usage Within and Across Genomes … 217
Fig. 2 Triplet structure of protein-coding genes: basic terms. a Three consecutive bases of codon
are numbered as 1, 2 and 3. In tRNA, anticodon positions are 34, 35 and 36. Mini-helix forma-
tion between codon and cognate anticodon proceeds in antiparallel fashion; e.g., two-nucleotide
strands run in opposite directions, 5 to 3 . b The correct codon–anticodon interaction is promoted
and stabilized by post-transcriptional modifications (red ovals) of nucleoside residues within anti-
codon stem loop of tRNA. For example, modification of uridine within position 34 allows it to
form non-Watson–Crick pair (dashed line) with guanine in wobble (third) position of codon UUG.
Modifications of nucleosides outside the anticodon (such as position 37) are also important for the
decoding process
The table of genetic code (Fig. 1) is one of the earliest visual representations of codon
organization, as is the wheel of code (Fig. 3). Both depictions provide the reader with
the information on an amino acid encoded by each codon. It is customary to present
bases in the code as follows: U, C, A, G. Departures from this order lead to novel
codon groupings, some of which suggest evolution of triplet code from a doublet one,
and to the roles for sixfold degenerate codon boxes in balancing the GC content of
prokaryotic genomes [22, 23]. Codons of genetic code can be arranged to illustrate
other aspects of the process of gene expression. It is known, for example, that aRSs
involved in tRNA charging fall into two structurally disparate classes I and II, and
correspondence between aRS class and codons can be visualized in the form of
hypercubes [24].
218 B. Ostash and M. Anisimova
Fig. 3 Visual representation of genetic codes: different ways for different purposes. Textbook
version in the form of wheel (a) provides information on the amino acid assigned to each codon. A
triplet of mRNA (5 → 3 ; read from inside) is assigned to one of the 20 canonical amino acids or
a stop codon. Chemical structures of amino acid side chains are shown on the outside. The natural
expansion of the genetic code of selenocysteine (SeC) at opal and pyrrolysine (Pyl) at the amber
and opal stop codons is depicted (shown in red). Codons can be arranged in four-dimensional cubes
(b) according to their relatedness and type of amino acyl tRNA synthetase (class I, red; class II,
black) used for their charging. Analogously, codons can be sorted out in n-dimensional graphs
according to other properties of triplets or respective amino acids. Part a is from [25]; part b was
adapted from [24]; https://creativecommons.org/licenses/by/4.0/
Fig. 4 Side-by-side representation of codon table and corresponding anticodons within a set of
tRNA expressed in a given taxon (archaea in this case). The asterisks indicate anticodons’ first posi-
tion post-transcriptional tRNA modifications necessary to unambiguously read the respective codon
box. Blue shadow labels an example of a triplet whose decoding is enabled by post-transcriptional
modification of the anticodon. Interrogation sign instead of AGA codon implies that it is still debat-
able what anticodon allows unambiguous reading of AGA. This figure was reproduced from [26],
https://creativecommons.org/licenses/by/4.0/
modifications is currently available only for a few model organisms, limiting the use-
fulness of the proposed visualization to yeast, E. coli, human mitochondria and a few
archaebacteria. Many tRNA modifications are reversible and depend on the nutri-
tional status of the cell [28–30]. By analogy to transcriptional regulation, a codon
can be viewed as a cis-acting element of the regulatory device, whereas tRNA—as
a trans-acting one. The interplay between these elements creates a continuum of
regulatory opportunities for the cell to respond to internal and external signals via
modulation of efficiency of codon reading. Current studies support the notion that
adaptive translation is an essential feature of any biological system [31], yet it will be
a challenge to predict and visualize this aspect of tRNA biology in current models.
To summarize, visualization of the genetic code is a vibrant research area. In recent
years, the challenge has shifted from depiction of the encoding principle (which is
nearly universal) to the incorporation of decoding strategies into the picture. This
leads from the universal genetic code to species-level models. To fully embrace this
approach, we still need to learn a lot about tRNA and mRNA biogenesis. Inter-
estingly, after over 50 years of studies, genetic code visualization remains one of
the least digitized aspects of molecular biology. This is because the paradigm of
the universal genetic code leaves no room for automated analysis. Nowadays, the
aforementioned efforts toward the inclusion of decoding details open the door for an
automated analysis, for example through database mining for the information about
codon frequency, types of tRNA modification enzymes and so on, and their conver-
sion into human-readable diagrams. We think that the development of specialized
220 B. Ostash and M. Anisimova
Fig. 5 Saccharomyces cerevisiae wheel of code with four layers of information. First, it repre-
sents codon identity. Codons are read from center to the edge of the wheel. Amino acids corre-
sponding to un-split codon boxes are shown in red. Conventional one-letter code for amino acid
is used; asterisk stands for stop codon. Second, it represents known post-transcriptional modifi-
cations (PTMs) of uridine (U34) in the first anticodon position of S. cerevisiae tRNAs. Types of
different U34 modifications are shown in boxed rectangles (mcm5s2U, 5-methoxycarbonylmethyl-
2-thiouridine; mcm5U, 5-methoxy-carbonylmethyluridine; ncm5U, 5-carbamoylmethyluridine;
ncm5Um, 5-carbamoylmethyl-2 -O-methyluridine; ψ, pseudouridine). Third, it denotes the global
codon usage for each codon of S. cerevisiae (counts per thousand). It is inserted between the cir-
cle for the third base and that for the amino acid type. Finally, four thick red lines divide genetic
code into three sections according to Turner energy of anticodon–codon mini-helix formation. The
bottom section consists of AT-rich codons having low Turner energy (weak codons); top section
consists of GC-rich codons with strong mini-helix formation (strong codons). Intermediate values
of free energy possess codons in the middle section. This is modified version of the figure S4Bc
downloaded from http://www-ibmc.u-strasbg.fr/spip-arn/spip.php?rubrique296&lang=fr [27]
software for building such models would be of great help, for example, in case of
industrially important organisms where optimization of gene expression is sought.
Visualizing Codon Usage Within and Across Genomes … 221
The four-letter alphabet of the genetic code forms 64 codons; the latter encode 20
amino acids plus translation stop signal. Consequently, there are 2164 (~4×1084 )
possible genetic codes. Additional constraints on theoretical genetic codes (e.g.,
a requirement to encode no less than 21 elements or specific assumptions about
primordial code) reduce their number, albeit not significantly. Several mutually non-
exclusive theories were put forward to explain the origin and evolution of extant
genetic code. Current consensus view is that the code arose as a combination of a ran-
dom event (“frozen” accident) that fixed certain proto-code, which further expanded
and evolved to support vital cellular process of protein synthesis [10, 32]. Thus,
natural genetic code is likely a result of an optimization process over evolutionary
timescales, and so it should be fitter than random codes with regard to certain param-
eters. It remains an open question as to what goals the code is optimized for. Various
aspects of protein synthesis, such as rate or accuracy of translation, are usually
considered primary “suspects.” It is unlikely that the genetic code (and translation
machinery in general) was shaped by a single requirement. Hence, the prevailing
approach is to treat the evolution of genetic code as a multi-parameter optimization
problem.
A number of studies addressed the optimality of natural genetic code by comparing
it to alternative genetic codes. The immense number of the latter has prompted the
researchers to impose certain constraints on what is considered a valid code. In
one of the simplest scenarios, the natural and alternative codes would have the same
number of codons per amino acid and the same impact of codon misreading. This was
achieved by independent permutation of nucleotides in the first two codon positions
and by allowing only an A↔G permutation in the third position (to satisfy the wobble
rule stating that codons NNU/C cannot be distinguished by translation machinery).
Therefore, a set of 1152 (4! × 4! × 2) alternative codes was generated and compared
to the natural code with regard to two properties: an ability to encode arbitrary
sequences of fixed length (n-mers; see Fig. 6) and the number of sense codons read
by the ribosome frameshift translation error [33].
The probability to encode arbitrary n-mers is higher for natural genetic codes
than in the vast majority of alternative codes. The longer the n-mer, the higher is the
probability for natural code-based mRNA to carry such a sequence. This is thought
to come from the fact that natural stop codons do not overlap with each other, yet they
can arise from adjacent codons (within mRNA) for abundant amino acids through
frameshifts. It is therefore possible that aforementioned property of real code arose
as a side consequence of selection for minimization of effects of frameshift errors.
Indeed, using natural genetic code, a ribosome would translate, on average, 15 codons
out of frame prior to reaching the stop codon. In case of alternative codes, such a
number is 23 codons. As compared to the real code, alternative ones would lead
to measurable fitness defects because of longer and potentially more toxic protein
products. The superiority of real code comes from the fact that its stop codons can
222 B. Ostash and M. Anisimova
be easily “hidden” within a sequence. Therefore, the natural genetic code appears
to display a robust behavior in the face of frameshift errors. This collaterally allows
encrypting into mRNA extra information unrelated to the protein-coding capacity of
the latter (signals for DNA binding proteins, microRNA, splicing sites, etc.) [34].
The aforementioned study indicates that the genetic code and amino acid abundances
are coadapted. However, whether the genetic code is adapted to amino acid pools
or vice versa still remains a point of debate [35]. Redundancy of the natural genetic
code is the key reason for its capacity to carry additional layers of information.
It is possible to elaborate the other strategies for the judicious generation of sets of
alternative genetic codes, which can be compared against the real one. However, even
109 codes would be a minuscule fraction of all theoretically possible variants. Such
an approach cannot therefore guarantee that we understand properties of real genetic
code on the basis of the analysis of the entire space of codes (or even a significant
part of it). If more extensive search of this space would be undertaken, then the
conclusions might prove incorrect. For example, natural genetic code appears to be
adapted to minimize the adverse effects of the frameshifts arising through mutations
and/or translational errors. One can think of natural code as “optimized” (or fit)
with regard to these criteria as compared to over 1000 alternative codes. However,
would a natural code still represent a global peak of fitness if we compare it against
109 codes? It is not a trivial question to address given the astronomical number
of possible codes. There are no reliable analytic methods to search through vast
spaces, and researchers often employ evolutionary algorithms to discover properties
of genetic code. Briefly, the algorithm starts with initial population of codes, where
random changes are introduced. Novel codes are evaluated on the basis of defined
objective functions (e.g., fitness value for each code is determined). Fitter variants
are taken to the next round of evolution and selection, and the procedure is iterated at
researcher’s will or until no further improvement of fitness values is observed. This
Visualizing Codon Usage Within and Across Genomes … 223
approach has been recently implemented to assess how the optimality of genetic
code is influenced in response to all possible single-point mutations that lead from
one amino acid to another [36]. To deduce the fitness effects (costs) of such changes,
authors have chosen eight different physicochemical properties (objectives) of amino
acids (from AAindex database; https://www.genome.jp/aaindex/), such as isoelectric
point, polarity, hydrophobicity and molecular weight. The algorithm started with
2800 random codes (each encoding 20 amino acids; stop codons were the same
as in real code). The codes were permuted via genetic operator of mutation and
evaluated, and a fraction of top-optimized codes were taken to the next round. The
set of codes was replenished from the archive set to keep the size of population
constant (2800). For each objective, the fitness of the code was calculated as the sum
of squared differences between values of amino acid indices encoded by a pair of
codons differing in one nucleotide:
Fi (code) = [ pi (c1 ) − pi (c2 )]2 (1)
c1 ,c2 ∈C
where i is an objective index, C is the set of all pairs of codons differing in single
nucleotide, c1 and c2 are codons, and pi (c1 ) and pi (c2 ) are the values of index i for
amino acids encoded by c1 and c2 .
Each code is therefore represented by a vector of eight values, and the best codes
are those minimizing the costs of amino acid replacements. Using Pareto evolutionary
algorithm, authors went on to show that real genetic code is very close to theoretical
best codes; e.g., it minimizes the costs of amino acid replacements as well as the
latter. Nevertheless, there are codes more optimal than the real one with regard to
as many as eight objectives combined. One of such theoretical codes is significantly
different from the natural one: Only three codons (ACC, ACA and AGC) retained the
same assignment in both codes, and codon boxes for Ser and Thr each consisted of 16
members. A general conclusion from evolutionary searches is that the natural genetic
code does not represent even local optimum in the code space when a limited number
of criteria are used to evaluate code optimality [37, 38]. An inevitable shortcoming
of the conclusion is that the criteria used to compare the code are not necessarily
those which the natural code was optimized for. Perhaps, the real code would achieve
top optimality if more (or biologically more relevant) criteria are taken into account.
It is possible to raise bacterial mutants that display level of accuracy of ribosomal
protein synthesis higher than in the wild type, yet this accuracy comes at the expense
of diminished growth rate [39, 40]. Likewise, current focus on the robustness or
precision of the genetic code in face of mutations or translational errors might be
misleading. This shortcoming notwithstanding, evolutionary algorithms show great
promise in elucidation of the origin and properties of the genetic code. They allow
sampling and iterative selection of codes across large swathes of the search space
and might help uncover artificial variants with novel properties. Besides fundamental
interest, the results of such studies can find their use in ongoing efforts to create
orthogonal genetic systems [41, 42].
224 B. Ostash and M. Anisimova
Fig. 7 FramePlot visualizes potential open reading frames within GC-rich nucleotide sequences.
Here, a 4020-bp fragment of landomycin A biosynthetic gene cluster from Streptomyces cyanogenus
S136 (accession #: AF080235) was used. The application calculates frequency of GC bases in third
position of triplets in all six possible reading frames. Dashed line indicates overall GC content of
the sequence being used (69%); colored lines indicate the % of GC in the third position of a triplet.
Above the graph, symbols “|” and “>”, “<” indicate potential stop and start codons, respectively.
The application was accessed from http://www0.nih.go.jp/~jun/cgi-bin/frameplot.pl
Visualizing Codon Usage Within and Across Genomes … 225
A number of k-mer calculators are available as desktop tools. Their extensive col-
lection is present on OmicsTools Web site (https://omictools.com/search?q=k-mer%
20analysis), and here we will review the most notable examples that deal with dif-
ferent aspects of the nucleotide sequence analysis. Corseq is the software to identify
favored codons in RNA-seq data without the need for annotated genome sequence
[53]. The software estimates transcript abundance by k-mer counting. Sequence
element enrichment (Seer) and its Python-reimplemented version pyseer identify
sequence elements (9–100 nt long) significantly enriched in certain phenotypes [54].
Seer allows alignment-free inferences of association of a certain SNP with a pheno-
typic trait, such as antibiotic resistance. Jellyfish and Gerbil permit fast, memory-
efficient counting of k-mers, which can be visualized in the form of histograms.
KAnalyze and DSK are fast k-mer counters that use low memory (especially the
latter) and can be integrated into various sequence analysis pipelines. Tallymer uses
enhanced suffix arrays to count k-mers of varying lengths. This method can be used
to determine frequencies of various repeats in genomes. The microTaboo algorithm
offers an efficient solution to the problem of finding unique (disjoint) k-mers (e.g.,
subsequences of length W that differ by more than k mismatches). This has a number
of practical applications in the areas of SNP detection and selection of probes for
diagnostic purposes, etc. In contrast to exact k-mer counters described above, ntCard
estimates k-mer frequencies in genomic datasets from sample distribution. This is a
preferred approach when researcher deals with high volumes (on a terabase scale)
of sequence data [55]. Squeakr system is designed for either exact or approximate
counting of k-mers. There is a number of benchmark studies which compare avail-
able k-mer counters in terms of speed, memory efficiency, scalability, etc., so that
interested reader could choose the tool most suitable for the problem at hand [56].
All the progress and available options notwithstanding, there are ample opportunities
for further improvement in the area of k-mer analysis. Poor visualization of obtained
data is a common shortcoming of all available approaches, as the latter most often
return results in tabular form. This is mostly explained by the fact that the development
of new k-mer counters is focused on correction of sequencing errors. Positional
information about k-mer is lost; namely, it is not known whether the k-mer is more
frequent at the start or toward the end of the genes. To the best of our knowledge, at the
226 B. Ostash and M. Anisimova
4 Codon Indices
Due to the degeneracy of the genetic code, single amino acid sequence can, theoret-
ically, be “spelled out” by an astronomical number of different codon sequences, of
which a single one is actually used (Fig. 9). Each species prefers certain synonymous
codons over the others to encode proteins. The non-uniform usage of synonymous
codons is known as codon usage bias (CUB). In bacteria, CUB tracks with GC con-
tent of the genome [59] and tRNA abundance [60] or copy a number of tRNA genes
[61]. The latter parameter also correlates positively with CUB observed in highly
expressed genes in some animals [62]. CUB also influences mRNA and protein fold-
ing [63, 64]. These observations point to the possibility that in addition to neutral
processes (such as mutation and genetic drift) CUB is a result of translational selec-
tion; namely, codon usage and tRNA pools are coadapted to increase the speed and/or
accuracy of protein synthesis [65–67]. It is therefore necessary to have quantitative
measures of CUB as they could help us understand mechanisms behind CUB and
optimize coding sequences for different applications. A multitude of such measures
Visualizing Codon Usage Within and Across Genomes … 227
Fig. 9 An example of a nonapeptide sequence (blue boxes at the top): This sequence can be encoded
by 6 × 3 × 4 × 6 × 2 × 4 × 6 × 4 × 2 = 165,888 different synonymous codon sequences. One
possible path through the space of codon sequences is highlighted with green boxes
have been proposed over the last three decades. Below, we will review those that are
most frequently used or which were described recently and thus are not covered in
the exhaustive 2012 review of CUB [1].
The measures of CUB are known under a broad term of codon indices. The codon
index employs a dedicated function to represent a certain (often quite narrow) aspect
of codon usage with a single number. The codon index of a gene is a combined
result of a contribution of constituent codons. Any amino acid encoded by more
than one codon can exhibit bias. One-codon amino acids (Met and Trp) and stop
codons are excluded from the analysis of CUB. The contribution of each codon to
the index has to be carefully weighted to avoid the distortions caused, for example, by
overrepresented amino acids. The quantification of CUB may pursue different goals.
Several of the earliest indices were designed to reveal open reading frames (ORFs)
of a nucleotide sequence. The rationale behind these approaches is that an ORF
dominated by rare codons is unlikely to be protein-coding. Weakly expressed genes
carry fewer rare codons than can be expected from background nucleotide frequencies
[68]. Likewise, codon metrics can be employed to correct sequencing errors and
filter out random ORFs. As different species exhibit a different CUB, the indices
can be used to detect genomic regions that arose by horizontal gene transfer [69].
Conversely, if some genes or gene sets across different species show similar levels
of codon sequence conservation (as described by a certain index), then this can be
construed as evidence for coevolution of abundance of functionally related proteins.
Perhaps, the most frequent use of codon indices is for prediction or visualization
of protein expression level. Highly expressed prokaryotic genes can be deduced on
the basis of calculation of codon indices, and there are extensive databases where
such information is collected for some well-studied bacteria [70]. Different indices
can be classified with regard to the force that the former illuminate. Some indices,
228 B. Ostash and M. Anisimova
such as the effective number of codons, report on combined effects of mutation and
selection. Many codons specifically describe translational selection on codon usage.
Methodologically, most of codon indices fall into one of the two categories. The first
category calculates the deviation of codon usage of gene of interest from the expected
codon distribution (e.g., calculated from nucleotide frequencies). More often, indices
compare the codon usage of a gene to the codon usage of a reference set. Still, some
indices do not fit this dichotomy, while others can be used to study codon usage against
either the expected distribution or specific sets of reference genes. Essentially, the
“cottage industry” of codon indices is based on the exploration of virtually endless
ways to select experimental datasets and strategies of their comparative analysis.
The ever-blurring boundaries between different approaches toward quantifying CUB
lead us to abandon simplistic classification. The peculiar features of each index and
alternative ways of its implementation (if available) will be discussed shortly. We
adhere to the scheme described in Chap. 13 of [1] for the notation of indices in this
section.
This codon metric requires at least two sets of sequenced genes significantly differing
in expression; further, we refer to them as highly and lowly expressed genes. Then,
RSCU is determined from codon frequencies for each synonymous codon of each
amino acid:
oac
rac = 1 (2)
ka c∈ca oac
where oac is count of codon c for amino acid a in a gene or set of genes; k a is the
number of synonymous codons. The RSCU values are the ratio of observed number
of a codon to the expected one, when it occurs by chance. RSCU = 1 when there is
no synonymous codon usage bias; RSCU > 1 when codon usage is more frequent
than average for the amino acid.
As a reference set, in their original 1986 work Sharp and coworkers have chosen
genes known to be highly expressed. Recently, Paulet et al. [71] proposed to use
ribosome profiling data instead of codon counts in calculating RSCU. The ribosome
profiling, or Ribo-seq, determines how many times ribosome occupies every codon
of transcript. Thus, here one deals with counts of ribosome occupancy on different
codons, which is a measure of mRNA translation level. Authors dubbed their version
of RSCU as RSCURS and went on to show that both aforementioned indices strongly
correlate (r > 0.97) for a number of model species, such as worm (Caenorhabdi-
tis elegans) and yeast (C. albicans). Authors calculated index values for a central
segment of transcripts, from 20th to 200th codon. Differences between RSCU and
Visualizing Codon Usage Within and Across Genomes … 229
RSCURS become more apparent when shorter ranges were used (20–50; 20–100).
This is a common property of all codon indices: Some minimal number of codons is
necessary (80–100) to deduce the reliable information. There are a number of advan-
tages in the use of Ribo-seq data. One can directly assess the significance of CUB for
translation; highly translated genes can be automatically selected even if they are not
functionally annotated; most importantly, there is a possibility to assess the influence
of different conditions (nutrition, tissue-specific aspects and so on) on translation and,
therefore, on RSCURS . Indeed, it has been already reported that in E. coli amino acid
starvation leads to reduced translation of abundant codons from sixfold degenerate
families (e.g., leucyl codons CUU, CUA, CUC), while the reading of rare codons
(UUR) and CUG still remains robust [28]. Hence, RSCURS should, in principle, be
able to capture the dynamic nature of the efficiency of mRNA decoding—an aspect
of biology not assessable with the other indices. An overall consensus is that highly
translated genes are enriched with certain (“optimal”) codons; this seems to agree
with the original work of Sharp and Li [72].
The empirical metric CRI was deduced as a result of aforementioned study [28]
on codon degeneracy lifting in the face of environmental perturbations. The basic
idea of this measure was articulated in some previous works, showing that favorite
codons are not those that are most abundant in the genome, but rather those which
cognate tRNAs are most efficiently charged under amino acid starvation [73, 74].
The CRI describes the robustness of a protein synthesis rate to a limited amount of a
certain amino acid. The competition between tRNA isoacceptors for aminoacylation
is proposed to be the key determinant of the hierarchy of mRNA translation rates
during amino acid starvation. The index is computed by summing over log weights
of codons ωc that belong to a subset of codons corresponding to the limited amino
acid (c ∈ Ca_lim ):
CRI = log2 ωc (3)
c∈Ca_ lim
The ωc values for Leu and Arg were calculated based on the robustness of synthesis
of a panel of yellow fluorescent proteins (YFPs) whose coding sequences carried
different combinations of synonymous Leu and Arg codons. If, for example, YFP
coding sequence carries 22 Leu codons (7 CTA and 15 CTG) and taking that log2 ωc =
Wc , one deduces that synthesis rate of given YFP variant will be 7WCTA + 15WCTG .
The above line of reasoning was applied to find ωc . The ωc values for Leu codons were
as follows: CTG, 1; TTG, 0.91; TTA, 0.88; CTC, 0.67; CTT, 0.61; and CTA, 0.45. The
ωc values for codons other than those cognate to the limiting amino acid were set to 1.
Authors calculated Z-scores for CRI to estimate the deviation of observed CRI values
from the expected ones. The latter were estimated on a dataset of 4.3 million random
230 B. Ostash and M. Anisimova
sequences (each of 4300 ORFs in the E. coli genome was permuted 1000 times).
Using 92 ORF-YFP translational fusions, authors demonstrated that among several
indices of translational efficiency, such as CAI and tAI (see below), CRI shows the
highest correlation with experimental data. Although CRI appears to model quite a
narrow biological scenario, natural ecosystems are heavily dominated by nutrient-
limited conditions; human gut [75, 76] and soil [77, 78] are two well-known examples
of such ecosystems.
Here, the most frequent codon will have ωac = 1, while the ωac of all the other
codons will be <1.
Fop is the ratio of the number of optimal codons within a gene (dataset) to a total
number of synonymous codons. This is perhaps the earliest codon measure and the
first one which relies on reference information. The codon optimality can be derived
from chemical properties of nucleotides, or genome parameters, such as GC content,
CUB or copy number of tRNA genes. The latter seems to be the most important
one, for reasons mentioned at the beginning of this section. It permits to define
translationally optimal codons as those for which there is the highest tRNA gene
copy number. Having defined a subset of optimal codons oopt , one calculates Fop:
oopt
Fop = (5)
otot
This is another reference-based measure that was used to locate protein-coding genes
and detect frameshift sequencing errors. One first calculates the likelihood ratio ωac
P
:
Visualizing Codon Usage Within and Across Genomes … 231
f ac
ωac
P
= (6)
eac
where f ac is the observed frequency of codon c for amino acid a in the gene, and
eac is the frequency of that codon expected from background nucleotide frequencies
in the genome (computed as a product of nucleotide frequencies at the three codon
positions: eac = b1 b2 b3 ).
To determine P for a gene, the above calculated values (in the form of
log-likelihoods) are summed over the entire length of the gene L:
1
L
P = exp log ωac
P
(i) (7)
L i=1
This is the most popular CUB metric that compares the codon usage in a gene of
interest against the codon usage in highly expressed genes [72]. The latter can be
derived from experimental data; in their absence, ribosomal protein genes can be used
[79], as they constitute one of the most highly expressed groups of genes across all
domains of life. Alternatively, highly expressed genes can be inferred from prevalent
CUB for a given genome in course of an iterative procedure outlined in [80]. For an
mRNA consisting of L codons, the CAI can be computed as follows:
1 ref
CAImRNA = oac ln ωac (8)
L a∈A c∈C
a
where C a stands for codons encoding amino acid a; A is set of all amino acids a;
oac is the observed count of codon c for amino acid a in a given mRNA; ωac is the
relative adaptiveness of codon c encoding a in a given mRNA (computed according to
Eq. 3); the reference relative adaptiveness is calculated on the basis of either mRNAs
of ribosomal genes or other chosen datasets.
Reference genes exhibit the highest CAI = 1, which will be ≤1 for other genes.
Several improvements of CAI were proposed that deal with different mutational
biases, gene expression level, translation efficiency of synonymous R- or Y-ending
codons or identification of highly expressed gene set [81]. On the basis of these
considerations, a novel index of translation elongation (I TE ) has been proposed that
is computed similarly to CAI [82]. The I TE was applied in the re-analysis of expres-
sion data from a library of 154 synonymous genes encoding green fluorescent protein
[83], pointing to the possible explanation as to why translation elongation was under-
estimated in the original study as the contributor to protein expression. Recently, two
corrections to CUB indices, including CAI, were proposed that take into account the
fact that in all domains of life many mRNAs carry sequences that induce -1 ribosomal
232 B. Ostash and M. Anisimova
frameshifting [84]. Such events incur costs to translational efficiency that may not
be reflected by traditional CUB measures. On the basis of several datasets, correc-
tion for −1 frameshifts was shown to lead to a higher correlation between CUB and
expression level.
The tAI is a measure of adaptation of codon sequence to cellular tRNA pool. Like
in CAI, tAI is an average value of each codon adaptiveness. The latter is computed
in two steps. First, absolute adaptation W c for codon c is found:
Wc = (1 − sct )Tct (9)
t
Wac
ωac
tAI
= (10)
max Wac
c∈Ca
The tAI hinges on the fact that different tRNAs recognize the codon with differ-
ent affinities, due to wobble interactions and different pools of anticodons available
for each codon. One needs the following pieces of information to compute this
index: the list of anticodons that recognize codons; complete and correct catalogue
of tRNA genes for a given genome; reference set of highly expressed genes; and
efficiency of codon–anticodon recognition (wobble interaction weights sct ). As men-
tioned above, tRNA gene copy number serves as a proxy to tRNA abundance in
the original implementation of tAI. Recent technological advances allow a direct
quantification of tRNA from RNA-seq data, offering more precise estimation of tAI
[85]. A version of tAI, called normalized translational efficiency nTE, was proposed
in 2013 that takes into account both tRNA abundance and demand for it; the latter
is inferred from the codon usage [86]. For a long time, sct was derived from gene
expression data in S. cerevisiae. In 2014, a generic approach has been proposed for
species-specific estimation of the weights without resorting to gene expression data
Visualizing Codon Usage Within and Across Genomes … 233
[87]. The approach capitalizes on the assumption that highly expressed genes should
have both higher tAI (more adapted to tRNA pool) and higher CUB (less uniform
distribution of synonymous codons). There must be, consequently, a significant cor-
relation between CUB and tAI. One may therefore find species-specific values of sct
that optimize the correlation between the aforementioned metrics. Authors used a
derivative of relative codon usage bias index, RCB [88], as a measure of CUB; its
chief advantage is that it is dependent only on coding sequence.
This is information theory-based measure of CUB. The entropy for each amino acid
is computed through the normalized difference between maximum and observed
entropies:
max(Ha ) − Ha log2 ka − Ha
Ea = = (12)
max(Ha ) log2 ka
Na = Z a−1 (16)
Na takes values from 1 to the number of synonymous codons ka for amino acid
a. The effective number of codons Nc for a gene will be the sum of average Z a for
different redundancy classes k out of the entire set K of all such classes:
Nc = n k N a=k (17)
k∈K
where
1
N a=k = Na (18)
n k a∈k
Special rules regulate the calculation of Nc in cases, when codon usage of a gene
is more uniform than expected or when certain amino acids (and respective codons)
are missing in the protein product [1]. For example, if one or a few amino acids
within class k exhibit(s) strong CUB, while the others in the same class do not, then
Nc for a gene could be calculated as the sum of individual amino acids (and not the
averages over k) [90]:
Nc = Na (19)
a∈A
This is another metric of CUB based on Shannon’s information entropy [94]. The
entropy Ha for an amino acid a encoded by a set of synonymous codons C a from
degeneracy class k will be
ka
Ha = − f aclog2 f ac (20)
c∈Ca
where f ac is frequency of encoding amino acid a with codon c. Ha for all amino acids
can be converted into information entropy of the entire gene as follows:
20
Hg = p(a)Ha (21)
a∈A
Visualizing Codon Usage Within and Across Genomes … 235
20
Hm = p(a) log2 ka (22)
a∈A
The Hm values computed in this way may not represent a null hypothesis about
properties of the gene under study: Its codon structure is not random as it is under
selection constraints to carry out a biological function. Therefore, it is perhaps more
logical to infer Hm from random sequences that match the GC and amino acid contents
of the gene, according to procedure outlined in [94]. Then, the raw CUB score is:
Hg
Sg = (23)
Hm
This measure compares observed CUB against expected random distribution taking
into account both GC and purine contents. The cosine distance metric is adopted to
measure the difference, and bootstrapping is used to assess the statistical significance
of the results [95]. The CDC value will be within 0–1 range. The codon position-
specific nucleotide contents are computed as follows:
b1 b2 b3
ec = (26)
c∈C sc b1 b2 b3
1, if c is sense codon
where sc = .
0, otherwise
236 B. Ostash and M. Anisimova
When both expected (ec ) and observed ( f c ) codon usages are determined, CDC
computes the distance coefficient for a given gene:
ec f c
CDC = 1 − c∈C
(27)
2 f c2
c∈C ec c∈C
While the other described above indices are heuristic or statistical at their core, NAI
is built upon a mechanistic model of protein translation. The coding sequence is
linked to its protein production costs. NAI measures the extent to which a coding
sequence is adapted to minimize the cost of nonsense errors during translation, as
compared to the distribution of all possible synonymous variants of the gene (alleles).
Starting from a set of parameters (such as protein synthesis rate and mutation rate),
one follows a four-step procedure to compute NAI, as detailed in [96]. First, per-
codon elongation and nonsense error probabilities are calculated. Second, protein
production cost in the face of nonsense errors (ηobs ) is calculated. In the third step,
there are calculated central moments of ηexp (and variance thereof) across the entire
space of synonymous variants (approximated by some continuous distributions) of
the given coding sequence. In the fourth step, NAI is computed:
ηobs − ηexp
NAI = − √ (28)
Var(η )
This metric was derived from large-scale expression of genes of diverse phylogenetic
origin in E. coli under the control of T7 RNA polymerase [101]. Protein expression
level for each protein was scored from experimental data (SDS-PAGE) on an integer
scale from none (0) to the highest (5). Then, multi-parameter binary logistic regres-
sion modeling was employed to reveal the contributions of different mRNA sequence
parameters to the expression of proteins that fall into different expression “bins” (0–
5). Particularly, the analysis showed that frequency of certain synonymous codons
is higher in different segments of coding sequence (see below) and differs for pro-
teins from different “bins,” leading to codon slopes—a relationship between codon
frequency and its influence on expression level. After filtering out non-contributing
and redundant factors, a logarithm of the odds of observing 5 versus 0 expression
score for a gene was proposed:
This is a codon metric which links the codon occurrence to the mRNA half-life.
Several transcriptomic and proteomic studies (see above) indicated that enrichment
Visualizing Codon Usage Within and Across Genomes … 239
of coding sequence with optimal codons (e.g., those that are efficiently translated)
correlates with increased mRNA stability. Codon stabilization coefficient for S. cere-
visiae has been determined for each codon from experimental data [111], and the
latter can be used to compute CSC for each yeast gene (CSCg):
oc
CSCg = CSCc (30)
C
L
where CSCc is CSC value for a given codon c; oc is count of codon c in gene g; L is
the length of gene g in codons. To ascertain the biological meaning of CSCg values
observed in real genes, these were compared against values computed from shuffled
genomes [112].
Some of the earliest indices are already part of bioinformatics suites (such as
EMBOSS) and biological sequence analysis software packages. For example, MEGA
X [113] supports the calculation of RSCU, and DAMBE7 [114] allows computing
RSCU and several versions of CAI. INCA [115] is a software package fully devoted
to codon analysis. It computes common indices such as CAI, Nc and codon bias
and represents them as either text or plots, as shown in Fig. 10. Program CodonW
[116] calculates the following indices: RSCU, CAI, Nc, CBI and Fop. The first three
indices from the aforementioned list can also be computed with ACUA toolkit [117].
CAIJava is Java-based program that calculates CAI of a gene of interest using as a
reference a set of genes with the highest CAI scores (http://www.ihes.fr/~carbone/
materials/description.html). The reference set is identified through an iterative proce-
dure of selection of coding sequences harboring very frequent codons [80]. The latter
feature distinguishes CAIJava from many other CAI calculators which take known or
presumably highly expressed genes (usually for ribosomal proteins) or codon usage
table for the organism as a reference. Program CodonO computes SCUO index and
visualizes data in the form of tables and plots [118]. xtRamp is a software package
that identifies so-called ramp sequences from protein-coding genes [119]. The ramp
sequence is a 5 -terminal segment of the gene adjacent to the start codon which is
preferentially populated by slowly translated codons. This is thought to ensure a more
uniform placement of ribosomes at the translation initiation stage, thus avoiding their
collisions [120]. The software makes use of tAI and relative codon adaptiveness to
identify the ramp sequences.
A number of online tools support CUB analysis. Being perhaps the most pop-
ular index, CAI enjoys the widest variety of tools for its calculation—see [121–
123] and https://www.biologicscorp.com/tools/CAICalculator/#. XKRbIkhS_IU;
http://www.bioinformatics.nl/cgi-bin/emboss/cai. GCUA [124] compares codon and
240 B. Ostash and M. Anisimova
(a) (b)
1
60 INCA version 1.20a (c) Fran Supek INCA version 1.20a (c) Fran Supek
58 0,98
56 0,96
54 0,94
52
0,92
50
48 0,9
46 0,88
44 0,86
42
0,84
40
0,82
38
36 0,8
34 0,78
32 0,76
30
0,74
28
26 0,72
24 0,7
22 0,68
20
0,66
18
16 0,64
14 0,62
12 0,6
10 0,58
8
0,56
6
4 0,54
2 0,52
0 0,5
0,45 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 0,45 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95
Fig. 10 Visualizing CUB on a genome-wide scale with INCA. The entire set of Streptomyces
coelicolor A3(2) coding sequences was loaded into the program, which calculates a number of
genomic and codon usage parameters. Here, the plots depict correlation between CAI (x-axis; 0–1)
and Nc (y-axis; 0–61) in part a, and between CAI and GC content in third codon position in part b,
for all S. coelicolor genes. Genes on minus and plus strands of the chromosome are shown as red
and green dots, respectively. By pointing to each dot, one can get information about protein product
of the gene and correlation coefficient between CAI and Nc. The figure was generated with INCA
version 1.20
amino acid usage of query sequence against codon usage table for the source organism
and presents the result as bar graphs. The stAIcalc [125] is an online tAI calculator
which features pre-computed wobble interaction weights for 100 organisms. The
output file contains tAI values in a tabular form. The calculator can be accessed
at http://tau-tai.azurewebsites.net/. Stand-alone version of the program permits to
optimize the weights for new species.
Several databases contain pre-computed indices for selected organisms. Kazusa
database (https://www.kazusa.or.jp/codon/) is an open collection of codon usage
tables for about 36,000 organisms. The codon usage tables provide information (in
a tabular form) about the absolute numbers of each codon in a genome and codon
frequency per 1000 codons. However, this database is rather outdated as it is based
on GenBank file released in 2007. HIVE-CUT is another online storage for codon
usage tables of all (approximately 850,000 as of April 2019) genomes deposited into
GenBank [126]. HIVE-CUT requires registration and is updated every two months.
This database also contains pre-computed Nc indices for all coding sequences. STA-
DIUM contains species-specific tAI values pre-calculated for 148 species [127].
Besides raw data in tabular form, one can find the distribution of tAI values across
different biological pathways.
Visualizing Codon Usage Within and Across Genomes … 241
The codon usage indices have been used in a vast number of recent studies and
become an indispensable tool for the analysis of genomic sequences [128]. Exten-
sive body of data shows that, largely, there is a high correlation between the gene’s
CUB, abundance of cognate tRNAs, mRNA and protein levels [129–133]. Codon
optimization or de-optimization has become a routine biotechnological tool to boost
or suppress, respectively, protein expression levels in prokaryotic platforms [101,
134, 135, 136]. This suggests that translational selection is an important driver of
CUB. However, it is still a formidable challenge to disentangle the contributions
of neutral and adaptive processes to the final codon structure of the genome [137,
138]. Rules, causality, benefits and risks of codon optimization remain debatable for
mammalian and plant systems [139–142]. The protein expression level is inevitably
a dynamic parameter, a net result of multitude of biological factors [143]. There is an
inherent tension between static estimation of expression level portrayed by a given
codon index and expression level ranges revealed with omics approaches. This divide
is not insurmountable as researchers continue to explore novel ways to incorporate
biological variables into CUB metrics; some examples have been mentioned in pre-
ceding section. Besides the aforementioned conceptual issue, the understanding and
applicability of CUB indices suffer from unresolved methodological shortcomings.
First of all, there is no single platform where all or the majority of (most popu-
lar) CUB indices could be computed for a given gene or genome. Such a platform
would facilitate the comparative analysis of performance of these indices and possi-
ble confounding effects of the biases other than translational selection (gene length,
GC content, etc.). Indeed, a recent benchmarking study of several indices based on
measures from random expectation (Nc, SCUO, RCBS, CDC) revealed that differ-
ent indices produce large heterogeneity of results when challenged with the same
synthetic datasets [94]. There is a broad range of correlation values between indices
thought to gauge the same aspect of codon function. Hence, either these indices are
sensitive to unanticipated biases, or they measure different aspects of codon usage.
More extensive benchmark studies are needed to more fully understand the CUB
index behavior. A corollary to this is the need for more rigorous statistical validation
of CUB indices. Currently, only CDC index comes with a means for estimation of
its statistical significance [95].
The computed codon indices are most often represented in a tabular form. Program
package INCA offers perhaps the richest menu of options for CUB visualization. Yet,
for many indices (e.g., CDC and iCUB) there are still no online applications, greatly
limiting their accessibility, especially to researchers non-conversant with command-
line program interfaces. There is ample room for an improved representation of
popular indices as well. For example, a graphical representation of per-codon CUB
index values across the entire coding sequence is often useful [144], yet this feature
is not implemented in any visualization tool.
242 B. Ostash and M. Anisimova
5 Codon Context
The first observations that codons tend to be specific to a certain preferred neighbor-
hood, or context, which influences the mRNA function, date back to the 1980s [44,
145, 146]. Today, the phenomenon of codon context is a solid line of investigations,
as judged from over 1500 entries in PubMed mentioning this term. The codon con-
text comes in a great variety of forms and functional consequences. The genetic code
degeneracy largely enables codon context, as different nucleotide sequences can be
written into coding regions without altering protein’s primary structure. Moving from
the start to stop codon, below we will succinctly describe the documented cases of
codon context. Then, we will proceed with the description of codon context-centered
tools and databases.
Fig. 11 Examples of mRNA sequence motifs that influence translation. The mRNAs of different
origin carry different sequences that impact translation initiation, elongation and termination; see
graphical legend at the bottom of the figure. Start (AUG) and stop (UGA) codons are shown in
red. Shine-Dalgarno and Kozak sequences are shown in conjunction with AUG. Sequences (codon
pairs) that inhibit translation or induce frameshifts are shown in blue below the mRNA. Sequences
that help maintain optimal elongation rate and/or native conformation of the head of mRNA (5
region near start codon) are shown in green above the mRNA, respectively. The list of sequence
motifs shown in the figure is not exhaustive and chosen to illustrate major points of the main text
(see below)
ribosomal occupancy, implying that the primary role of such pairs is to slow down
the translation at the defined loci of mRNA. The latter might be needed to ensure
the proper protein folding, much like as it is documented in a number of other
studies focused on individual codons [63, 170, 171] or codon clusters [172, 173].
While the identification and interpretation of non-optimal codons or codon pairs
were rather straightforward, it remains unclear as to whether there is a nonrandom
use of optimal codon pairs. The individual optimal codons were suggested to be
located preferentially within a context of structurally sensitive sites, which would
lead to protein misfolding/aggregation when mistranslated [174]. In yeast, the same
subsequent amino acids in different positions along the mRNA tend to be encoded by
codons recognized by the same tRNA. This so-called tRNA re-usage phenomenon
means that a switch to a synonymous codon read by another tRNA is less frequent
than would be expected by chance [175]. Although the above case is not about the
use of consecutive codons, it does make a point that highly expressed genes are under
pressure to place the codons in the way that favors rapid translation.
Stop codon readthrough or frameshifting was recognized in early days of molecu-
lar biology [176, 177] and remains the best understood form of codon context [178].
The so-called post-stop sequence CARYYA was first implicated in UAG readthrough
in some plant viruses [179]. It is now known that this sequence enables efficient
readthrough in heterologous organisms [180]. In human cells, sequence UGA-CUAG
is read through within transcript for vitamin D receptor with 6.7% efficiency [181].
Experimental approaches demonstrate the abundance of stop codon readthrough in
animals, as well as diversity of codon contexts and secondary structure elements
implicated in the readthrough [182, 183]. Curated collection of stop codon-recoded
vertebrate genes is available at NCBI [184]. In two ciliate species, genetic codes
have no dedicated stop codons, as the canonical ones are reassigned to glutamine
(UAA and UAG) and tryptophan (UGA). In this species, translation termination is
determined by a weakly biased usage of uridines at the flanks of the canonical stop
codons and by the proximity to a polyA tail [185]. This is a striking example of a
biological system where the default mode for stop codons is translation rather than
termination.
Of different forms of frameshifts, −1 programmed ribosome frameshifting clearly
depends on codon context. The sequence where ribosome undergoes −1 frameshift
(slippery site) displays the following composition: heptamer N.NNW.WWH (N =
A, G, C, U; W = A, U; H = A, C, U; dot demarcates codon borders), 1–12 nt
spacer sequence and a pseudoknot structure. On the slippery site, a ribosome will be
pushed 1 nucleotide back and will resume translation in −1 frame: NNN.WWW.H
and so on [186]. Well-defined and rather conserved organization of the slippery
site facilitated the development of a database of predicted −1 ribosomal frameshift
signals in eukaryotes [187].
246 B. Ostash and M. Anisimova
Fig. 12 A codon context heat map deduced from the dicodon analysis of Streptomyces albus J1074
(a) and S. coelicolor M145 (b) ORFomes. Logs of dicodon frequencies were the input data to
build the map. Rows correspond to 5 codon (P site of the ribosome), and columns are for 3 codon.
Green cells correspond to preferred contexts (occur more frequently than expected from background
nucleotide frequencies) and red cells to the rare (also referred to as “rejected” or “avoided”) ones.
Values that are not statistically significant are colored black (see also the main text). Gray cells
indicate no data (some rare codons, such as UUA, do not occur in all possible contexts). The color
scale represents the full range of values of residuals for codon context
248 B. Ostash and M. Anisimova
Fig. 13 Median 64 × 64 codon heat map of 50 Streptomyces genomes. In order to identify specific
codon pair contexts, the map was filtered to display codon residuals that are above 50-fold difference.
Green cells correspond to preferred and red cells to rare contexts. All other cases were colored black.
Contexts a–d correspond to single pixels, and e and f—boxed lines of the map. See Fig. 12 for
more details
CUG are overrepresented among the codon pairs conserved across 50 genomes. Out
of 11 such pairs, nine carry either of the two aforementioned codons. The pairs can
be summarized as follows: NNU-CNG and GNA-CNN. There are also two patterns
of dependency between non-adjacent bases: CUS-NSN and CUS-NWN. Some of
the observed patterns have already been observed in other taxa belonging to all
three domains of life. These cases include avoidance of CUC-CNN or preference for
NNC-ANN [191]. Specific biological reasons for the observed patterns await their
verification.
While the formal definition of CUB was given in Sect. 4, here we focus on the
intrinsic CUB, caused by the general benefit of a certain codon over its synonyms in
terms of efficiency and/or accuracy of translation, or other properties (such as, but
not limited to, production or stability of mRNA). In contrast, the context-dependent
CUB arises when the effects of a codon depend on the neighboring sequence, in
either a short or a long range. This kind of codon usage bias (unrelated to different
decoding or other functional properties of synonyms) was discussed in Sect. 5, and
it will not be considered here. Then, we will proceed with the description of the
relevant visualization tools.
concentration of cognate tRNAs in the cell. The codons selected for by the described
above adaptive process are referred to as optimal. In the scientific literature, such
codons can also be termed as frequent, fast or popular, putting special emphasis on
one facet of codon property. In line with the above, infrequent (rare) codons are
often referred to as slow or non-optimal ones. It has to be noted that meaning of
codon optimality is an evolving concept as researchers develop novel approaches
to measure gene expression and tRNA abundance [64, 86, 106]; different sets of
codons can be viewed as optimal under different conditions [26, 86]. In a classical
view, optimal codon improves all or a combination of the three following parameters
of protein synthesis: speed of translation, its accuracy and robustness to mistransla-
tion. The first two aspects have already been mentioned in Sects. 4 and 5; the latter
postulates that mistranslation is very costly to the cell as it may lead to toxic mis-
folded proteins. Consequently, a coding sequence is under constraints to use those
codons which would help tolerate or minimize missense mutations [201, 202]. Cur-
rent advances in high-throughput system biology approaches largely substantiate the
idea that optimal codons improve heterologous protein production [101]. As optimal
codons are translated faster [203–205], this would increase the translational effi-
ciency, e.g., number of protein molecules produced per transcript. Nevertheless, the
effects of codon optimality on translational efficiency of endogenous mRNAs remain
less clear-cut [89, 206]. A number of other observations point to the fact that our
understanding of causes and effects of CUB is oversimplified. No significant differ-
ences were revealed in ribosome occupancy of either optimal or non-optimal codons
in several model systems [64, 86, 150], pointing to the possibility that in native genes
both classes of the codons are decoded at a similar rate. Focus on the abundance of
cognate tRNA in determining translation accuracy of a codon might also be mislead-
ing, as the large pool of near-cognate tRNAs could outcompete cognate tRNA [207].
As mentioned above, much of our knowledge of translational regulation comes from
the bacterial systems. In animals, strength of translational selection is weaker, yet
there is evidence for non-neutrality of synonymous mutations [208] as well as for
their involvement in translation modulation [209, 210].
Adaptive processes other than translational selection also contribute to the CUB.
Genes might be depleted of certain codons to minimize the formation of mRNA sec-
ondary structures near the 5 end, facilitating the translation initiation [83, 104]. A
different speed of ribosome movement along mRNA as a function of codon optimal-
ity evokes additional forms of selection. It is known that optimal codons increase the
stability of mRNA in bacteria and eukaryotes [132], whereas non-optimal ones serve
as a signal of mRNA decay. Here, codon optimality therefore refers not only to trans-
lational efficiency of certain codon, but also to its capacity to encode mRNA stability.
Rare codons are placed in protein domain linker regions, whereas optimal codons are
grouped within functional domains. Most likely, slow ribosomes on linker regions
enable proper co-translational protein folding [63, 86, 171]. In fungi, codon usage
was shown to impact transcription through the mechanisms independent of mRNA
translation or stability. Particularly, in Neurospora the CUB correlates with mRNA
and protein abundance, and codon optimization in this species increases mRNA and
protein levels. How codon identity influences transcription is unclear. The authors
Visualizing Codon Usage Within and Across Genomes … 251
suggest that DNA sequences overlapping mRNA coding regions contain elements
recognized by the transcriptional machinery, and synonymous replacements alter
these sequences, thus impacting transcription [211]. Finally, other as-yet-unknown
translation-independent factors could dictate codon choice, as recent work on toxicity
of certain synonymous versions of mRNA has shown [212].
Distinct CUB patterns are revealed within coding sequences. A head of the gene
is enriched in rare codons, presumably forming a ramp for slow loading of ribosomes
onto mRNA. The latter helps avoid ribosome collisions and improve the translation
initiation [120, 213]. As the initiation is rate-limiting step of translation [128], there
was a debate as to why the codon content is so evidently nonrandom along the
entire transcript. For example, last 50 codons of eukaryotic mRNAs are often highly
optimized [120]. One possibility is that ribosome pausing on sites distal to 5 end
of the gene can somehow be relayed back to the head of the gene, thus preventing
new round of initiation at the start codon [214]. Under this scenario, the initiation
and elongation steps are linked, and transcript-wide enrichment in optimal codons
has an adaptive value. Not taking into account this factor might lead to alternative
conclusions that only initiation is the major determinant of mRNA stability [215].
In summary, experimental data reveal that synonymous substitutions in coding
regions may affect all steps of gene expression, such as replication, transcription,
translation, mRNA stability, toxicity and protein folding. Moreover, codon optimal-
ity is proposed to be predictive of protein half-life [216]. It is impossible to develop
an experimental model where the synonymous codon replacement would be related
to a single aspect of gene expression or mRNA structure; we must deal with inher-
ent complexity of biological systems. These challenges notwithstanding, there is
undoubted progress in our understanding of relationship between codon optimality
and translational efficiency, at least in bacteria and model eukaryotic systems such as
yeast and human cell lines. An overall conclusion is that mRNAs of highly expressed
genes are composed of optimal codons that enable accurate and fast translation; the
latter is also important for mRNA stability. Codon optimality appears to be relative
to some extent, as different environmental conditions modulate the availability of
charged tRNA for a given codon [28]. Non-optimal codons are important, too. They
can be used to decrease the mRNA stability where needed and to either improve
[203] or derail [217] the co-translational folding. The influences that codons exert on
translation will depend on their context (see preceding section) as well as qualitative
and quantitative parameters of tRNA pool [28, 166, 218, 219].
The data on genome-wide codon usage can be inferred from annotated genome
sequence or found in databases such as Kazusa or HIVE-CUT [126], and converted
in various types of bar charts or diagrams. For example, CUB for different genomes
can be conveniently visualized in the form of circular diagrams. The latter grasp the
252 B. Ostash and M. Anisimova
GGG AA
G GC AA C G
G AAT
GC
GGT AAA A A
G CC
A A TA A TT TC TG
GG 0.08
T
GC
GTA GTT TC TG
0.06
A A
G G
AC
Frequency
GC
CA
AC
0.05
AC
C
0.03
G
AG
0.02
GA C G
A GT
GA
A
0.00
A GC
A GG
GA T
GA A
TA A
TA T
CGG
CGC
TA C G
CGT
TA
A
TTA TTT TC TG
CG
G
CC
C
T
CC
C T
TC
A
C
CC A
T
TC
T
TG
A CTA CTT TC TG TC
C
TGT CA A C C
CA T TC
TGC G
CA C G
TGG CA
Fig. 14 A circular diagram of codon usage in Streptomyces coelicolor A3(2) genome (GC content
72%). Codon counts within each gene were summed over the entire genome and converted into the
diagram. The codon usage is skewed toward GC-rich triplets
most salient features of codon usage for a given genome, such as preferred usage of
GC-rich codons in GC-rich genomes (Fig. 14).
Codon indices can be employed to visualize the differences in CUB between two
or more genes. In this regard, CAI remains the most popular index among reference-
based measures, while Nc or codon bias index (CBI) [1] is often used if one prefers
reference-free approach (see Sect. 4 for more details). Care should be taken when
interpreting the differences in codon indices for individual genes, as they may be
caused by factors unrelated to translation efficiency (see above). Ideally, measures of
CUB should be applied to genes whose evolutionary history, expression levels and
function are well understood.
Positional CUB has been reported in many studies, and it remains a fertile ground
for the development of dedicated visualization tools. For example, extreme depletion
of leucyl TTA codon in GC-rich Streptomyces genomes is accompanied by positional
skew of this codon toward 5 end of the gene [220]. Based on this finding, a Web
service TTALynx was developed that visualizes positional codon usage bias on a
Visualizing Codon Usage Within and Across Genomes … 253
genome-wide scale [221] as depicted in Fig. 15. The statistical significance to the
observed biases remains unaddressed. Perhaps, one way to bypass this problem
is to aggregate the data from multiple genomes to see the global trends. Expected
distribution can be inferred from simulated coding sequences that preserve nucleotide
and codon biases of the genomes under study (Fig. 16).
Codon context and codon usage biases can be viewed as a horizontal “axis” of codon
changes, those that happen in a single coding sequence. In contrast, the vertical “axis”
is represented by changes that take place over time in different homologous genes.
Namely, once two copies of the gene occur due to either duplication or speciation
event, they start diverging due to the accumulation of various random genetic changes,
such as insertions, deletions, translocations and recombination. However, the most
frequent type of rearrangement is point mutation, which is a change of a single DNA
base to another one. Depending on the experimental approach, the point mutation
rates in E. coli were reported to be within the range of (0.25–5.0) × 10−10 per
generation [222]. If a mutation spreads in a population, because of either a genetic
drift or an adaptive advantage it confers to organisms, becomes fixed and is called a
substitution. For a set of aligned DNA sequences coding for protein, the substitutions
can be modeled at three levels: nucleotide, amino acid and codon ones (Fig. 17). It
makes sense to use codon substitution models (CSMs) in order to have the alignments
to account for the structure of the genetic code, unequal biases at three codon positions
and selection on protein [223]. All of the aforementioned aspects can be modeled
explicitly by CSMs. The latter are recent addition to the toolkit of computational
biology as compared to nucleotide and amino acid models, which is partly explained
by the absence of relevant and sizable datasets in early days of molecular biology. The
availability of homologous genome sequences from closely related species stimulates
the interest in CSMs as they offer detailed view of the forces that shape protein-
coding sequences. In this section, we begin with succinct description of the founding
principles that are at the core of all CSMs. Then, we describe several most popular
CSMs as well as recent developments not covered in other reviews. An overview of
tools for simulation of coding sequences and visualization of CSMs will conclude
this section. For a deeper acquaintance with the topic of CSMs, interested reader is
referred to [1, 224].
Fig. 16 Cumulative distribution of leucine codons and leucine codon-based quadruplets (TTAN)
within genome of Actinosynnema mirum (73% GC). See Fig. 15 for more details
and not on any past states (i.e., Markovian property). Assuming that seq1 in Fig. 17c
represents a parental sequence, the state of first codon CCT in seq2 depends on the
present state of the first codon (CCG) in seq1. In Markov models, each character in
a sequence (nucleotide or amino acid residue, or codon) is a random variable that
undergoes substitutions independently and with the same probabilities as every other
character. In a coding sequence, each codon is described by a single Markov chain
over 61 codon states (sense codons). A Markov model is defined by the generator
matrix Q = {qij } of instantaneous rates of change between the sense codons i and j,
i.e., describing the pattern of change when time has no past. Substitutions involving
stop codons are prohibited. To compute the probability of various transitions between
codon states over time t, a probability transition matrix P(t) can be computed by
solving the differential equation:
dP(t)
= Q P(t) (31)
dt
At timepoint t = 0, the probability of a change is 0 (off-diagonal values) while
the probability of no change is 1 (values on a diagonal). Therefore, P(0) is an iden-
tity matrix I. This provides Eq. 31 with an initial condition, yielding the following
solution:
256 B. Ostash and M. Anisimova
P(t) = e Qt (32)
Therefore, given a rate matrix Q, one can compute P(t) for any time t ≥ 0, which
contains probabilities pi j (t) of change from state i to state j over time t.
In order for Q to define the probability matrix P(t), it should satisfy the following
mathematical properties. All its off-diagonal entries have to be nonnegative, and
sum of each row should be zero. Therefore, the diagonal entries qii = − i = j qi j
are negative. Q is also assumed to be independent of t (e.g., is time homogeneous).
This means that the same rate matrix will be globally used over all branches of
tree representing given multiple sequence alignments. Most of CSMs are formulated
as time reversible—that is, direction of change between i and j cannot be told:
πi pi j (t) = π j p ji (t). It is not biological reality but rather mathematical simplicity
that drives the widespread assumption of time homogeneity and reversibility. For
example, a full empirical irreversible model for N states (e.g., codons) has N ×
(N − 1) − 1 free parameters, whereas for a reversible one model this number is
reduced to N × (N + 1)/2 − 2. In addition, an reversible and homogeneous model
does not require the knowledge of the root placement, which is often non-trivial.
Visualizing Codon Usage Within and Across Genomes … 257
CSMs can be parameterized in different ways (e.g., with substitution rates between
synonymous and non-synonymous codons, with transition/transversion rates, etc.),
and these parameters can be inferred from observed codon sequences. The CSMs
can be used to compute the probability that, over the evolutionary time spanning
the input multiple sequence alignments, a coding sequence will accumulate substi-
tutions resulting in the observed sequences, given the model. This is essentially the
likelihood L of the observed data D, which is proportional to the probability of D
given the substitution model M with its parameters: L = p(D|M). Model parameters
and evolutionary divergencies (i.e., phylogenetic branch lengths) can be estimated
by maximum likelihood or using the Bayesian approach [224, 226, 227, 228].
New models and methods have to be compared against the existing ones to under-
stand potential caveats and advantages of the former. Likelihood-based tests are a
relatively simple and popular way to compare models [229]. Models that provide a
better description of the unknown process which produces the data typically have a
higher likelihood when adjusted for a number of parameters they contain. It is impor-
tant to make sure that the higher likelihood arises due to better explanatory power of
the model and not due to overfitting. It is customary to compare a model with more
parameters M A to a simpler null model M 0 . If a complex model can be transformed
into a simpler one by constraining some of its parameters, then the models are nested
and a likelihood ratio test (LRT) can be applied:
L(MA )
LRT = −2 log = −2(log(L(MA )) − log(L(M0 ))) (33)
L(M0 )
The p-value can be computed for a LRT statistic to assess the significance of the
difference between the optimized log-likelihoods log L(MA ) and log L(M0 ).
If the models are not nested, then the Akaike or Bayesian information criteria
(AIC or BIC, respectively) can be used to rank the models—but not to reject one
model in favor of the other. For example, AIC for model M A would be:
Fig. 18 ROC of four M8-based and busted CSMs on a dataset with codon gamma rate varia-
tion. Specificity is defined as the proportion of correctly identified alignments simulated under a
model with positive selection, and sensitivity is defined as the proportion of correctly identified
alignments simulated without positive selection. The dashed diagonal line shows the theoretical
performance of the random predictor, and the dashed vertical and horizontal lines indicate the the-
oretical performance of the perfect predictor. This figure is part C of Fig. 1 from Davydov et al.
[231]
Two principally different types of CSMs can be distinguished. The first type is
referred to as mechanistic or parametric one. These models describe all possible
Visualizing Codon Usage Within and Across Genomes … 259
types of codon substitutions with a defined and finite set of parameters. In contrast,
empirical CSMs are derived from large sets of aligned homologous sequences. It has
to be noted that parameters of mechanistic models themselves are based on empirical
data. For example, codon frequencies can be inferred from nucleotide frequencies.
Hybrid, semiempirical CSMs are quite widespread: These are built as empirical and
then amended with certain parameters. Parametric models are especially useful in
dissecting the factors and mechanisms of coding sequence evolution. Empirical mod-
els are suitable in the area of phylogenetic reconstruction. Exhaustive description of
all available CSMs and their computer implementation is beyond the scope of this
section as this topic is deeply covered in the literature [1, 224, 232]. Here, we want
to get the reader acquainted with the principles of development of CSMs and the
ongoing efforts to more fully incorporate the biological reality into the models. His-
torically, parametric models were developed first and thus will be considered first;
then, we proceed to empirical and semiempirical models.
One-ratio model M 0 is the simplest parametric CSM, where the instantaneous
rates Q = {qij } are provided as follows:
⎧
⎪
⎪ 0, if i and j differ by more than 1 nucleotide substitution
⎪
⎪
⎪
⎨ μκπ j , if i and j differ by 1 substitution; synonymous transition
qi j = μκωπ j , if i and j differ by 1 substitution; nonsynonymous transition
⎪
⎪
⎪
⎪ μπ j , if i and j differ by 1 substitution; synonymous transversion
⎪
⎩ μωπ , if i and
j j differ by 1 substitution; nonsynonymous transversion
(35)
where μis the scaling constant to make sure that the average substitution
61
rate: − i=1 πi qii = 1; π j is the equilibrium frequency of codon j calcu-
lated from observed nucleotide frequencies at the three codon positions; κ is
the transition/transversion rate ratio (κ = 1 means no transition bias); ω is non-
synonymous/synonymous substitution rate ratio (d N /d S ) describing selection on non-
synonymous mutations. In M 0 , double and triple substitutions within a codon are
prohibited, and π, ω and κ are estimated from data to characterize processes at the
DNA level. The example of M 0 above represents the so-called GY-type CSM assum-
ing that qi j rates are proportional to the frequency of a target codon π j . In contrast,
p
in MG-type CSMs qi j is proportional to the frequency of the target nucleotide f x [1,
224]. A detailed treatment of the ways to compute codon frequencies under different
models is given in [233].
Detecting the selection on a coding sequence was a primary goal of the CSM
development; this point will be further elaborated in the next section. The M 0 model
assumes constant ω across sites in the input alignment, which is not biologically
justified. Selection, positive or negative, often impacts different lineages and codon
sites in a sequence in a different way, since the selective pressure is different due
to protein folding constraints and the changing evolutionary pressures over time.
Consequently, new models were proposed to account for variation in selection. Using
model tests mentioned above, one can deduce whether the models allowing variability
260 B. Ostash and M. Anisimova
in ω explain the data better than the restrictive one ω-ratio model, and if so, what
types of selection may operate on the sequences. The branch models allow variation
over the time, assigning a different ω for different branches of phylogenetic tree. Site
models provide variation of ω between codons through various distributions of ω or
its components, d N and d S .
Through adding new parameters and using different parameterizations, increas-
ingly sophisticated CSMs can be constructed. For example, rates of double and triple
substitutions can be set to some nonzero value [234]; different rates for different
codon pairs can be proposed; different positions of a codon may have different rates
as well. We list some example below.
Besides selection at the protein level, a wide specter of CSMs is now available
that are tailored to assess other properties of coding sequences. Models have been
developed to detect selection for amino acid properties [235]. Here, amino acid
substitutions are divided into conservative (change leads to an amino acid similar
in its physicochemical properties) and radical (switch of the properties). Genetic
algorithm, known as CodonTest, has been developed to estimate such models [236].
A number of models aim to reveal site interdependency (both within [237] and
between [238] codons) and to study codon bias [239].
One-ratio CSM can be combined with a model of selection on synonymous
substitutions that takes into account codon-specific nonsense error rate [240]. An
interesting aspect of this nested model is that it gives clues about protein expres-
sion levels. The latter parameter of gene expression can also be estimated using
SelAC [241]. SelAC is an CSM built using the cost-benefit function that links gene
expression level to the strength of stabilizing selection. This is CSM where gene
and amino acid-specific substitution matrices are nested within a substitution model.
The genome-wide parameters used by the model are: nucleotide-specific mutation
rates scaled by the effective population size; amino acid side chain physicochemical
weights; and a gamma distribution shape parameter describing the distribution of
selective strength. There is also a gene-specific parameter ψ describing the average
rate at which the protein is produced. SelAC can be seen as the latest attempt to
marry codon and population genetic models; both deal with stochastic process of
character substitution, although the former focus on interspecies variations, whereas
the latter explores intraspecies variation [242]. SENCA is another interesting exam-
ple, defined as a CSM that separately considers mutational biases, CUB and amino
acid preferences [243]. It represents a growing roster of mutation–selection mod-
els that attempt to account for selective patterns at the amino acid level, with some
approaches allowing for heterogeneity in these patterns across codon sites. Basically,
change in a nucleotide sequence is viewed as the product of the probability of a muta-
tion from codon to another, the probability that the mutation becomes fixed in the
population (which, in turn, depends on population parameters, most notably effective
population size) and a scaling constant [244, 245]. Mutation–selection models pave
the way to study selection on coding sequences, codon usage bias and codon/amino
acid interdependence. SENCA confirmed several previous observations, such as the
universal mutational bias toward AT bases in bacterial genomes. Finally, a general-
purpose parametric (GPP) model has been proposed that includes double and triple
Visualizing Codon Usage Within and Across Genomes … 261
There is not much room for visualization of matrices describing parametric codon
models in contrast to the empirical ones, where tabular data are a typical raw output
[225]. The patterns of character substitution are usually visualized using bubble plots
where both bubble size and color inform about the frequency and nature of substi-
tution, respectively. This kind of visualization has been applied to visualize amino
acid [233] and codon [250] substitutions. Particularly, codon substitution rate param-
eters can be estimated from multiple codon sequence alignments with expectation-
maximization procedure [228] implemented in program XRate [261, 262]. Codon
alignments can be prepared in several ways. For example, if the reading frame is
Visualizing Codon Usage Within and Across Genomes … 263
known, one can start with coding DNA sequences, translate them to amino acid
sequences, align these and then back-translate into codon alignments [263]. Today,
however, more sophisticated tools are available; MACSE builds coding alignments
that account frameshifts and stop codons [264].
Figure 19 presents a typical visualization of an empirical CSM inferred from a
set of 40 orthologous genes for sporulation-specific protein SsgA in Streptomyces
[265]. Here, one can observe a unique pattern of substitution, which will be distinct for
each dataset, as Fig. 20 demonstrates. Noteworthy, the latter shows that substitution
pattern is different for orthologous groups of transcriptional factor AdpA [266] from
different actinobacterial genera. Although the above figures focus on orthologs, it is
possible to visualize the substitution model for reasonably aligned sets of sequences
of any origin (such as paralogs).
Fig. 19 Bubble plot of codon relative substitution rates computed with the help of XRate for
orthologous group highly conserved in Streptomyces sporulation-specific genes ssgA (145 codons;
accession number for S. venezuelae ssgA is ACC77837). Codons are ordered according to biochem-
ical similarity of respective amino acids. The size of a bubble reflects the rate of substitution of
one codon (rows) by another (columns). Synonymous one-nucleotide and two-nucleotide substitu-
tions are shown in red and blue, respectively. Non-synonymous one-nucleotide and multinucleotide
substitutions are shown in green and yellow, respectively. Blue diagram (below the matrix) shows
overall codon usage. All plots in this and the following figures are equally scaled (gray circle, top
left corner of the plot)
264 B. Ostash and M. Anisimova
Fig. 20 Bubble plot of codon substitution rates computed with the help of XRate for orthologous
groups of genes for transcriptional factor AdpA from Streptomyces (left) and Streptosporangium
(right). For each model, 30 codon sequences were used. The size of a bubble reflects the rate of
substitution of one codon (rows) by another (columns). See Fig. 19 for color code
Cursory analysis of bubble plots reveals salient features of evolution of the under-
lying sequences. Clustering of substitutions along the diagonal line reflects the
absence of substantial amino acid changes, because codons in the plot are ordered
according to their physicochemical properties. The plots depicted in Figs. 19 and
20 show that the two datasets experience substitutions to a different degree. This
may imply that respective coding sequences are under different selective pressures
to maintain their primary structure. Symmetry is another important feature of the
bubble plots. The latter is not apparent in the first two figures; however, pronounced
symmetry is clearly seen in bubble plot generated on the basis of RuBisCo pro-
teins from flowering plants (Fig. 21). This suggests unusual tolerance of RuBisCo
to certain kinds of substitutions; e.g., both forward and reverse changes happen at
the same frequency. Such a feature of essential genes may render them more robust
in the face of mutations and mistranslation [201]. Low incidence of near-diagonal
substitutions for RuBisCo genes is another distinctive feature in Fig. 21, probably
reflecting the action of stabilizing selection on RuBisCo coding sequences. Physico-
chemical properties of amino acids to a large extent dictate the substitution patterns.
Nevertheless, some of the patterns remain unexplained. For example, it is puzzling
that RuBisCo genes (rbcL) are depleted of synonymous substitutions. Given the
enormous importance of RuBisCo for the biosphere [267], it will be interesting to
study larger and taxonomically distinct rbcL datasets by comparing their patterns of
codon substitutions based on the bubble plot analysis.
Visualizing Codon Usage Within and Across Genomes … 265
Fig. 21 Bubble plot of codon substitution rates computed with the help of XRate for orthologous
group of rbcL genes for RuBisCo protein. The rbcL sequences from flowering land plants were
taken from [268]. Fifty codon sequences were used for model generation. See Fig. 19 for color code
and more details
Random mutations arising in an individual genome are a basic material for evolution.
If a mutation increases reproductive success of the organism under given conditions
(i.e., has “fitness” benefits [269]), it will spread in the population and become fixed
faster. The above process of fixation of certain mutations with probability above ran-
dom expectation is referred to as positive selection. Conversely, negative (purifying)
selection eliminates detrimental mutations from the population, thereby conserving
an existing amino acid sequence. A weak purifying selection that removes alleles
with extreme deviations from the mean (optimal) fitness value creates the basis for
stabilizing selection. The latter is thought to be the main force in phenotypic evo-
lution, and recent works reinforce its importance during the molecular evolution as
well [270].
Positive selection comes in several flavors depending on the timescale and the
origin of the sequences analyzed. Particularly, the data may come from either a sin-
gle population (e.g., mycobacteria isolated from the sputum of a patient) or different
species. Positive selection that favors non-synonymous changes across interspecies
sequences is known as diversifying selection. This type of selection can be assumed
constant over a certain time frame, e.g., if it affects similarly a protein family or a set
of related species. Within populations, positive selection can be either balancing or
directional. Balancing selection increases the polymorphism level (when it confers
fitness advantage to the carrier cell), while directional selection decreases it by driv-
ing the beneficial mutations to fixation. In the latter case, the pressure fades away
as the mutation is no longer rare. Selective sweep is a result of positive selection
within population where a fixed beneficial mutation reduces variation in the nearby
(linked) loci. Generally, coding sequences from interspecies and population data
pose different kinds of a problem for the detection of selective evolutionary forces.
Genetic alterations that define interspecific markers occurred a long time ago and thus
represent fixed mutations (substitutions). It is therefore relatively straightforward to
analyze the selection across species. In contrast, intraspecific genetic variation is
low and can be enriched for polymorphic sites (mutations not yet driven to fixation)
segregating in the populations. This requires a proper adjustment of the analysis
methods.
Visualizing Codon Usage Within and Across Genomes … 267
While beneficial and harmful mutations are under selective pressure, it appears
that many mutations are neutral or nearly neutral with regard to their impact on fitness
[271, 272]. Their maintenance in genes is determined by genetic drift and mutational
pressure rather than selection. Synonymous codon substitutions were one of the
earliest candidates for neutral mutations [273]. Current view of codon evolution
portrays a complex picture where population structure and nonadaptive factors (such
as mutation bias and genetic drift) play an important role alongside selection; an
ongoing debate is about relative contributions of these forces in different cases and
when the amount of mutations is enough to shape the gene’s structure. Depending on
an experimental model, an involvement of positive, purifying or stabilizing forces in
shaping the codon usage has been suggested [67, 274], and different genome loci can
experience a different kind and/or strength of selection [275]. No universal or taxon-
specific rules seem to exist for codon sequence evolution. Therefore, researchers
must carefully consider all possible evolutionary scenarios for the data at hand.
A single coding sequence cannot be used to pinpoint selection. Although high
values of certain codon indices, such as CAI, ENC or Fop (see above), may imply
that the gene is under selective pressure, it could be very difficult to tell its kind.
Furthermore, most codon indices rely on reference information, and so the prediction
is in fact based on more than one sequence. Codon volatility is the only notable
attempt to infer selection from a single sequence [276]. The former can be defined as
a probability that a point mutation in a codon leads to a non-synonymous codon. Each
codon would therefore have its fixed volatility value, and gene volatility (ranging
from 0.5 to 1.0) is an average value over all its codons. If a coding sequence consists
of high volatility codons, then there is an increased probability that previous codon
substitutions were non-synonymous. The volatility of codons can therefore be used
as a statistic to reveal positive selection. The significance of gene volatility can
be deduced from its comparison to the bootstrap distribution of volatility values
of simulated sequences [277]. Numerous studies revealed no correlation between
codon volatility and positive selection in a number of organisms, mounting strong
critique of this approach as a way to detect selection [278–280]. Hence, at least two
homologous coding sequences are needed to reveal the latter.
A number of methods for detecting intraspecific selection are based on the predic-
tion of frequency of mutations (alleles) expected under the neutral regime of evolution
within and among populations. If a genomic region has undergone selective sweep,
then there would be low sequence diversity and an excess of rare mutations. Dif-
ferential pressure of selection on populations would also leave a footprint in DNA
sequences. These differences create a foundation for a suite of neutrality test statistics,
such as Tajima’s D, Fu’s W, Wright’s F ST and others [281]. The problem with these
neutrality tests is that their precision and interpretation depend on the demographic
history of the population, which is often difficult to know.
Many approaches to detect selection on coding sequences are based on the analysis
of non-synonymous (d N ) and synonymous (d S ) substitutions within pairwise or mul-
tiple alignments of homologous coding sequences. Assuming the neutral evolution
as a null hypothesis, several neutrality tests were developed to detect selection that
drives species divergence. For example, McDonalds–Kreitman test (Fig. 22) com-
268 B. Ostash and M. Anisimova
Fig. 22 McDonalds–Kreitman neutrality test. A toy phylogenetic tree features two clades, one
represented by species A and another by three members of species B (blue lines). The test compares
the non-synonymous/synonymous substitution ratio (d N /d S ) on species branches (circles) to the
non-synonymous/synonymous polymorphism ratio (pN /pS ) on intraspecies lineages (diamonds). A
graphical legend is shown to the right of the figure
pares the number of non-synonymous changes between and within species [282]. If
this number is significantly higher between species, this is construed as a signal of
diversifying selection.
The computation of d N /d S , or the ω-ratio, is also an established way to detect
selection on sequences, which is most efficient when the sequences are derived from
different species. Positive or negative selection is indicated when ω > 1 and ω < 1,
respectively, whereas ω = 1 suggests that sequences evolve neutrally. First methods
to compute this ratio relied on counting of various changes in pairwise comparisons
of codon sequences [233]. Counting approaches, however, suffer from two problems.
First, they lack power when sequences are too similar or highly divergent and can
be applied to pairs of sequences rather than to multiple sequence alignments. There
is no proper framework to handle the uncertainty if the methods applied over a
phylogeny in a pairwise fashion (although some solutions have been proposed on
the basis of visual exploration of the model fit [283, 284]). Second, in most cases
selection impacts only a few sites in a coding sequence, but counting methods have no
means to provide estimate site-specific ω values [285]. Recent notable development
of counting methods includes the renaissance counting, implemented in the package
BEAST (see below). It maps substitutions throughout the phylogeny in the 4 × 4
nucleotide space, then counts d N and d S , their “neutral” expectations, and applies an
empirical Bayes procedure to those counts to arrive at d N /d S estimates [286]. This
approach is computationally efficient when it comes to the analysis of very large
datasets and offers solutions to the aforementioned shortcomings of early counting
methods.
Model-based methods of estimating the ω-ratio at least as good and more often
outperform the counting methods. Even with only two sequences, maximum likeli-
hood methods allow for accurate inferences [287]. Model-based approaches are nat-
urally applicable to multiple sequence alignments, from which site-specific selection
parameters can also be inferred. Unlike the counting approaches, maximum likeli-
hood estimates have attractive statistical properties (e.g., convergence to the true
Visualizing Codon Usage Within and Across Genomes … 269
value as data size grows). It is desirable to use multiple sequence alignments instead
of pairwise as the inferences will be more powerful, and selection can be studied with
more sophisticated codon models as described above. A typical scheme is as follows.
For a given dataset, infer a multiple codon sequence alignment and a phylogenetic tree
(preferably using a codon model which can be done using the program codonPhyML
[288]). Based on the inferred alignment and phylogeny, compare two codon models,
one that allows selection (ω can be >1 for all or some sites and/or branches of the
phylogeny) and another model that does not allow selection. Using statistical model
comparisons, such as LRT (see above), one can deduce whether selection-permitting
model describes the data better. The absence of differential selection pressure on
synonymous and non-synonymous sites is an implicit condition of many ω-based
approaches (which is not strictly obeyed; see preceding sections), although there are
also models that allow variable d N and d S . If there is a reason to believe that not all
synonymous sites are the same, this should be properly incorporated into modeling
process. The ω-based approaches can also be applied to intraspecies (population)
data; here, one should be careful about the initial assumptions on effective popu-
lation sizes and mutation rates, as these could affect the conclusions. A detailed
treatment of the use of parametric codon substitution models in natural selection
studies is given in [1]. An entry-level tutorial on detecting selection in microbial
genomes using ω-based methods can be found in [289].
Fig. 23 Fixed effects likelihood (FEL) analysis of 56 sequences, 84 codons long, encoding AraC-
type DNA binding domain of transcriptional factor AdpA from Actinobacteria. The branch leading
to the micromonosporae was selected for FEL analysis. The X-axis represents codon positions, and
Y-axis—omega value for each position. The ω values are much less than 1 (or zero if d N = 0) for
the chosen branch, indicating pervasive purifying selection (p-value threshold was set to 0.1)
9 Outlook
Fig. 24 Color-coded results of the SELECTON analysis of selective forces on the same dataset as
the one used to prepare Fig. 23
modulation of translation speed, it may also determine mRNA stability and protein
conformation. However, enormous complexity and scale of the molecular data make
it nontrivial to understand the associated biological processes and to extract some
general rules. Below, we attempt to summarize main issues in the area of codon usage
and how visualization-oriented tools may improve the situation.
Computational and experimental approaches toward better understanding of
intrinsic and context-dependent codon usage biases remain a fertile ground for
database and software development. Traditional CUB indices enjoy sustained use,
and new ones are constantly being reported. There is clear need for more extensive
benchmarking studies of different CUB indices. This will help understand what they
measure, how they are useful and what can be improved. There seem to be a script
or program for virtually every aspect of CUB in biological software libraries, but not
all of them are known or can be easily accessed by a wider community of biologists;
some required tools are mentioned in respective section of this text. A more con-
sistent action in this direction will undoubtedly deepen the insights into underlying
biological phenomena. On a more fundamental level, CUB studies generate a flood
of data that are quite complex and, often, a subject of disparate interpretation [298].
Much remains to be discovered in this area, and improved visualization may help to
organize and embrace the complexities of codon usage biases.
Current CSMs largely neglect functional, structural or phenotypic data associated
with the sequences under study. The evolution of coding sequences is ultimately
dependent on the fitness effects of mutations, and these effects depend on the genetic
background, starting from the intragenic level [299]. It is imperative to keep searching
statistical and computational foundations so that CSMs can incorporate biochemical
and structural information, or the genotype–phenotype effects. Mutation–selection
framework is one example of how this can be accomplished [245], although there
is a long way ahead before such models become a routine and reliable tool. As the
sequencing data enter BigData arena, there is also a need to increase the computa-
tion speed of codon models. Likewise, there is a clear need for improved software
engineering quality in case of programs for detecting selection [300], which will pay
off in terms of their usage and gained knowledge.
272 B. Ostash and M. Anisimova
Acknowledgements B.O. thanks numerous students and coworkers who investigated various
aspects of codon usage. Work in the laboratory of B.O. was supported by the grants from Min-
istry of Education and Science of Ukraine and State Fund for Fundamental Research. M.A. thanks
the Swiss National Science Foundation for research funding (grant 31003A_182330/1).
References
1. Cannarozzi GM, Schneider A (eds) (2012) Codon evolution. Mechanisms and models. Oxford
University Press, New York, 297 p. ISBN 978–0–19–960116–5
2. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman
MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen
S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza
JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade
KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth
Visualizing Codon Usage Within and Across Genomes … 273
GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA,
Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM (2005) Genome
sequencing in microfabricated high-density picolitre reactors. Nature 437(7057):376–380
3. Rothberg JM, Leamon JH (2008) The development and impact of 454 sequencing. Nat
Biotechnol 26(10):1117–1124. https://doi.org/10.1038/nbt1485
4. Slatko BE, Gardner AF, Ausubel FM (2018) Overview of next-generation sequencing
technologies. Curr Protoc Mol Biol 122(1):e59. https://doi.org/10.1002/cpmb.59
5. O’Donoghue SI, Baldi BF, Clark SJ, Darling AE, Hogan JM, Kaur S, Maier-Hein L, McCarthy
DJ, Moore WJ, Stenau E, Swedlow JR, Vuong J, Procter JB (2018) Visualization of biomedi-
cal data. Annu Rev Biomed Data Sci 1:275–304. https://doi.org/10.1146/annurev-biodatasci-
080917-013424
6. Liu X, Zhang J, Ni F, Dong X, Han B, Han D, Ji Z, Zhao Y (2010) Genome wide exploration
of the origin and evolution of amino acids. BMC Evol Biol 15(10):77. https://doi.org/10.1186/
1471-2148-10-77
7. Jordan IK, Kondrashov FA, Adzhubei IA, Wolf YI, Koonin EV, Kondrashov AS, Sun-
yaev S (2005) A universal trend of amino acid gain and loss in protein evolution. Nature
433(7026):633–638
8. Fimmel E, Strüngmann L (2018) Mathematical fundamentals for the noise immunity of the
genetic code. Biosystems 164:186–198. https://doi.org/10.1016/j.biosystems.2017.09.007
9. Keeling PJ (2016) Genomics: evolution of the genetic code. Curr Biol 26(18):R851–R853.
https://doi.org/10.1016/j.cub.2016.08.005
10. Koonin EV, Novozhilov AS (2017) Origin and evolution of the universal genetic code. Annu
Rev Genet 27(51):45–62. https://doi.org/10.1146/annurev-genet-120116-024713
11. Heaphy SM, Mariotti M, Gladyshev VN, Atkins JF, Baranov PV (2016) Novel ciliate
genetic code variants including the reassignment of all three stop codons to sense codons
in Condylostoma magnum. Mol Biol Evol 33(11):2885–2889
12. Mühlhausen S, Schmitt HD, Pan KT, Plessmann U, Urlaub H, Hurst LD, Kollmar M (2018)
Endogenous stochastic decoding of the CUG codon by competing Ser- and Leu-tRNAs
in Ascoidea asiatica. Curr Biol 28(13):2046–2057.e5. https://doi.org/10.1016/j.cub.2018.
04.085
13. Miranda I, Rocha R, Santos MC, Mateus DD, Moura GR, Carreto L, Santos MA (2007) A
genetic code alteration is a phenotype diversity generator in the human pathogen Candida
albicans. PLoS ONE 2(10):e996
14. Väre VY, Eruysal ER, Narendran A, Sarachan KL, Agris PF (201) Chemical and conforma-
tional diversity of modified nucleosides affects tRNA structure and function. Biomolecules
7(1):pii: E29. https://doi.org/10.3390/biom7010029
15. Agris PF, Narendran A, Sarachan K, Väre VYP, Eruysal E (2017) The importance of being
modified: the role of RNA modifications in translational fidelity. Enzymes 41:1–50. https://
doi.org/10.1016/bs.enz.2017.03.005
16. Schweizer U, Bohleber S, Fradejas-Villar N (2017) The modified base isopentenyladenosine
and its derivatives in tRNA. RNA Biol 14(9):1197–1208. https://doi.org/10.1080/15476286.
2017.1294309
17. Hori H (2017) Transfer RNA methyltransferases with a SpoU-TrmD (SPOUT) fold and
their modified nucleosides in tRNA. Biomolecules 7(1):pii: E23. https://doi.org/10.3390/
biom7010023
18. Hou YM, Masuda I, Gamper H (2019) Codon-Specific Translation by m(1)G37 Methylation
of tRNA. Front Genet 10(9):713. https://doi.org/10.3389/fgene.2018.00713
19. Pan T (2018) Modifications and functional genomics of human transfer RNA. Cell Res
28(4):395–404. https://doi.org/10.1038/s41422-018-0013-y
20. Schimmel P (2018) The emerging complexity of the tRNA world: mammalian tRNAs beyond
protein synthesis. Nat Rev Mol Cell Biol 19(1):45–58. https://doi.org/10.1038/nrm.2017.77
21. Silva RM, Paredes JA, Moura GR, Manadas B, Lima-Costa T, Rocha R, Miranda I, Gomes
AC, Koerkamp MJ, Perrot M, Holstege FC, Boucherie H, Santos MA (2007) Critical roles for
a genetic code alteration in the evolution of the genus Candida. EMBO J 26(21):4555–4565
274 B. Ostash and M. Anisimova
22. Zhang Z, Yu J (2011) On the organizational dynamics of the genetic code. Genomics
Proteomics Bioinformatics. 9(1–2):21–29. https://doi.org/10.1016/S1672-0229(11)60004-1
23. Rosandić M, Paar V (2014) Codon sextets with leading role of serine create “ideal” symmetry
classification scheme of the genetic code. Gene 543(1):45–52. https://doi.org/10.1016/j.gene.
2014.04.009
24. José MV, Zamudio GS, Morgado ER (2017) A unified model of the standard genetic code. R
Soc Open Sci 4(3):160908. https://doi.org/10.1098/rsos.160908
25. Acevedo-Rocha CG, Budisa N (2016) Xenomicrobiology: a roadmap for genetic code
engineering. Microb Biotechnol 9(5):666–676. https://doi.org/10.1111/1751-7915.12398
26. van der Gulik PT, Hoff WD (2016) Anticodon modifications in the tRNA set of LUCA and
the fundamental regularity in the standard genetic code. PLoS ONE 11(7):e0158342. https://
doi.org/10.1371/journal.pone.0158342
27. Grosjean H, Westhof E (2016) An integrated, structure- and energy-based view of the genetic
code. Nucleic Acids Res 44(17):8020–8040. https://doi.org/10.1093/nar/gkw608
28. Subramaniam AR, Pan T, Cluzel P (2013) Environmental perturbations lift the degeneracy of
the genetic code to regulate protein levels in bacteria. Proc Natl Acad Sci U S A 110(6):2419–
2424. https://doi.org/10.1073/pnas.1211077110
29. Moukadiri I, Garzón MJ, Björk GR, Armengod ME (2014) The output of the tRNA modi-
fication pathways controlled by the Escherichia coli MnmEG and MnmC enzymes depends
on the growth conditions and the tRNA species. Nucleic Acids Res 42(4):2602–2623. https://
doi.org/10.1093/nar/gkt1228
30. Asano K, Suzuki T, Saito A, Wei FY, Ikeuchi Y, Numata T, Tanaka R, Yamane Y, Yamamoto
T, Goto T, Kishita Y, Murayama K, Ohtake A, Okazaki Y, Tomizawa K, Sakaguchi Y, Suzuki
T (2018) Metabolic and chemical regulation of tRNA modification associated with taurine
deficiency and human disease. Nucleic Acids Res 46(4):1565–1583. https://doi.org/10.1093/
nar/gky068
31. Kirchner S, Ignatova Z (2015) Emerging roles of tRNA in adaptive translation, signalling
dynamics and disease. Nat Rev Genet 16(2):98–112. https://doi.org/10.1038/nrg3861
32. Rogers SO (2019) Evolution of the genetic code based on conservative changes of codons,
amino acids, and aminoacyl tRNA synthetases. J Theor Biol 7(466):1–10. https://doi.org/10.
1016/j.jtbi.2019.01.022
33. Itzkovitz S, Alon U (2007) The genetic code is nearly optimal for allowing additional
information within protein-coding sequences. Genome Res 17(4):405–412
34. Itzkovitz S, Hodis E, Segal E (2010) Overlapping codes within protein-coding sequences.
Genome Res 20(11):1582–1589. https://doi.org/10.1101/gr.105072.110
35. Bollenbach T, Vetsigian K, Kishony R (2007) Evolution and multilevel optimization of the
genetic code. Genome Res 17(4):401–404
36. Wn˛etrzak M, Błażej P, Mackiewicz D, Mackiewicz P (2018) The optimality of the standard
genetic code assessed by an eight-objective evolutionary algorithm. BMC Evol Biol 18(1):192.
https://doi.org/10.1186/s12862-018-1304-0
37. Błażej P, Wn˛etrzak M, Mackiewicz D, Gagat P, Mackiewicz P (2019) Many alternative and
theoretical genetic codes are more robust to amino acid replacements than the standard genetic
code. J Theor Biol 7(464):21–32. https://doi.org/10.1016/j.jtbi.2018.12.030
38. Kuruoglu EE, Arndt PF (2017) The information capacity of the genetic code: is the natural
code optimal? J Theor Biol 21(419):227–237. https://doi.org/10.1016/j.jtbi.2017.01.046
39. Agarwal D, Gregory ST, O’Connor M (2011) Error-prone and error-restrictive mutations
affecting ribosomal protein S12. J Mol Biol 410(1):1–9. https://doi.org/10.1016/j.jmb.2011.
04.068
40. Robinson LJ, Cameron AD, Stavrinides J (2015) Spontaneous and on point: do spontaneous
mutations used for laboratory experiments cause pleiotropic effects that might confound
bacterial infection and evolution assays? FEMS Microbiol Lett 362(21):pii: fnv177. https://
doi.org/10.1093/femsle/fnv177
41. An W, Chin JW (2011) Orthogonal gene expression in Escherichia coli. Methods Enzymol
497:115–134. https://doi.org/10.1016/B978-0-12-385075-1.00005-6
Visualizing Codon Usage Within and Across Genomes … 275
42. Liu CC, Jewett MC, Chin JW, Voigt CA (2018) Toward an orthogonal central dogma. Nat
Chem Biol 14(2):103–106. https://doi.org/10.1038/nchembio.2554
43. Ishikawa J, Hotta K (1999) FramePlot: a new implementation of the frame analysis for pre-
dicting protein-coding regions in bacterial DNA with a high G+C content. FEMS Microbiol
Lett 174(2):251–253
44. Fickett JW, Tung CS (1992) Assessment of protein coding measures. Nucleic Acids Res
20(24):6441–6450
45. Azad RK, Borodovsky M (2004) Probabilistic methods of identifying genes in prokaryotic
genomes: connections to the HMM theory. Brief Bioinform 5(2):118–130
46. Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ (2003) Evolutionary implications of
microbial genome tetranucleotide frequency biases. Genome Res 13(2):145–158
47. Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO (2004) TETRA: a web-service
and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns
in DNA sequences. BMC Bioinform 26(5):163
48. Richter M, Rosselló-Móra R, Oliver Glöckner F, Peplies J (2016) JSpeciesWS: a web server for
prokaryotic species circumscription based on pairwise genome comparison. Bioinformatics
32(6):929–931. https://doi.org/10.1093/bioinformatics/btv681
49. Wang Y, Zeng Z, Liu TL, Sun L, Yao Q, Chen KP (2019) TA, GT and AC are significantly
under-represented in open reading frames of prokaryotic and eukaryotic protein-coding genes.
Mol Genet Genomics. https://doi.org/10.1007/s00438-019-01535-1
50. Akogwu I, Wang N, Zhang C, Gong P (2016) A comparative study of k-spectrum-based error
correction methods for next-generation sequencing data analysis. Hum Genomics 10(Suppl
2):20. https://doi.org/10.1186/s40246-016-0068-0
51. Mapleson D, Garcia Accinelli G, Kettleborough G, Wright J, Clavijo BJ (2017) KAT: a K-
mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics
33(4):574–576. https://doi.org/10.1093/bioinformatics/btw663
52. Sheppard SK, Guttman DS, Fitzgerald JR (2018) Population genomics of bacterial host
adaptation. Nat Rev Genet 19(9):549–565. https://doi.org/10.1038/s41576-018-0032-z
53. Camiolo S, Porceddu A (2018) corseq: fast and efficient identification of favoured codons
from next generation sequencing reads. PeerJ. 4(6):e5099. https://doi.org/10.7717/peerj.5099
54. Lees JA, Vehkala M, Välimäki N, Harris SR, Chewapreecha C, Croucher NJ, Marttinen P,
Davies MR, Steer AC, Tong SY, Honkela A, Parkhill J, Bentley SD, Corander J (2016)
Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes.
Nat Commun 16(7):12797. https://doi.org/10.1038/ncomms12797
55. Mohamadi H, Khan H, Birol I (2017) ntCard: a streaming algorithm for cardinality estimation
in genomics data. Bioinformatics 33(9):1324–1330. https://doi.org/10.1093/bioinformatics/
btw832
56. Manekar SC, Sathe SR (2018) A benchmark study of k-mer counting methods for high-
throughput sequencing. Gigascience 7(12). https://doi.org/10.1093/gigascience/giy125
57. Fuglsang A (2004) Nucleotides downstream of start codons show marked non-randomness
in Escherichia coli but not in Bacillus subtilis. Antonie Van Leeuwenhoek 86(2):149–158
58. Rokytskyy I, Kulaha S, Mutenko H, Rabyk M, Ostash B (2017) Peculiarities of codon context
and substitution within streptomycete genomes. Visn Lviv Univ Ser Biol 75:66–74. https://
doi.org/10.30970/vlubs.2017.75.07
59. Knight RD, Freeland SJ, Landweber LF (2001) A simple model based on mutation and
selection explains trends in codon and amino-acid usage and GC composition within and
across genomes. Genome Biol 2(4):RESEARCH0010
60. Ikemura T (1985) Codon usage and tRNA content in unicellular and multicellular organisms.
Mol Biol Evol 2(1):13–34
61. Higgs PG, Ran W (2008) Coevolution of codon usage and tRNA genes leads to alternative
stable states of biased codon usage. Mol Biol Evol 25(11):2279–2291. https://doi.org/10.
1093/molbev/msn173
62. Kanaya S, Yamada Y, Kinouchi M, Kudo Y, Ikemura T (2001) Codon usage and tRNA
genes in eukaryotes: correlation of codon usage diversity with translation efficiency and with
CG-dinucleotide usage as assessed by multivariate analysis. J Mol Evol 53(4–5):290–298
276 B. Ostash and M. Anisimova
63. Yu C-H, Dang Y, Zhou Z et al (2015) Codon usage influences the local rate of translation
elongation to regulate co-translational protein folding. Mol Cell 59:744–754
64. Quax TE, Claassens NJ, Söll D, van der Oost J (2015) Codon bias as a means to fine-tune
gene expression. Mol Cell 59(2):149–161. https://doi.org/10.1016/j.molcel.2015.05.035
65. Ikemura T (1981) Correlation between the abundance of Escherichia coli transfer RNAs and
the occurrence of the respective codons in its protein genes: a proposal for a synonymous
codon choice that is optimal for the E. coli translational system. J Mol Biol 151(3):389–409
66. dos Reis M, Savva R, Wernisch L (2004) Solving the riddle of codon usage preferences: a
test for translational selection. Nucleic Acids Res 32(17):5036–5044
67. Hershberg R, Petrov DA (2008) Selection on codon bias. Annu Rev Genet 42:287–299. https://
doi.org/10.1146/annurev.genet.42.110807.091442
68. Gribskov M, Devereux J, Burgess RR (1984) The codon preference plot: graphic analysis
of protein coding sequences and prediction of gene expression. Nucleic Acids Res 12(1 Pt
2):539–549
69. Garcia-Vallve S, Guzman E, Montero MA, Romeu A (2003) HGT-DB: a database of puta-
tive horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Res
31(1):187–189
70. Puigbò P, Romeu A, Garcia-Vallvé S (2008) HEG-DB: a database of predicted highly
expressed genes in prokaryotic complete genomes under translational selection. Nucleic Acids
Res 36(Database issue):D524–D527
71. Paulet D, David A, Rivals E (2017) Ribo-seq enlightens codon usage bias. DNA Res
24(3):303–2100. https://doi.org/10.1093/dnares/dsw062
72. Sharp PM, Li WH (1987) The codon Adaptation Index—a measure of directional synonymous
codon usage bias, and its potential applications. Nucleic Acids Res 15(3):1281–1295
73. Dittmar KA, Sørensen MA, Elf J, Ehrenberg M, Pan T (2005) Selective charging of tRNA
isoacceptors induced by amino-acid starvation. EMBO Rep 6(2):151–157
74. Welch M, Govindarajan S, Ness JE, Villalobos A, Gurney A, Minshull J, Gustafsson C
(2009) Design parameters to control synthetic gene expression in Escherichia coli. PLoS
ONE 4(9):e7002. https://doi.org/10.1371/journal.pone.0007002
75. Wu GD, Chen J, Hoffmann C, Bittinger K, Chen YY, Keilbaugh SA, Bewtra M, Knights D,
Walters WA, Knight R, Sinha R, Gilroy E, Gupta K, Baldassano R, Nessel L, Li H, Bushman
FD, Lewis JD (2011) Linking long-term dietary patterns with gut microbial enterotypes.
Science 334(6052):105–108. https://doi.org/10.1126/science.1208344
76. Koropatkin NM, Cameron EA, Martens EC (2012) How glycan metabolism shapes the human
gut microbiota. Nat Rev Microbiol 10(5):323–335. https://doi.org/10.1038/nrmicro2746
77. Hodgson DA (2000) Primary metabolism and its control in streptomycetes: a most unusual
group of bacteria. Adv Microb Physiol 42:47–238
78. Ho A, Di Lonardo DP, Bodelier PL (2017) Revisiting life strategy concepts in environmental
microbial ecology. FEMS Microbiol Ecol 93(3). https://doi.org/10.1093/femsec/fix006
79. Nakao A, Yoshihama M, Kenmochi N (2004) RPG: the Ribosomal Protein Gene database.
Nucleic Acids Res 32(Database issue):D168–D170
80. Carbone A, Zinovyev A, Képès F (2003) Codon adaptation index as a measure of dominating
codon bias. Bioinformatics 19(16):2005–2015
81. Raiford DW, Doom TE, Krane DE, Raymer ME (2011) A genetic optimization approach for
isolating translational efficiency bias. IEEE/ACM Trans Comput Biol Bioinf 8(2):342–352
82. Xia X (2015) A major controversy in codon-anticodon adaptation resolved by a new codon
usage index. Genetics 199(2):573–579. https://doi.org/10.1534/genetics.114.172106
83. Kudla G, Murray AW, Tollervey D, Plotkin JB (2009) Coding-sequence determinants of gene
expression in Escherichia coli. Science 324(5924):255–258. https://doi.org/10.1126/science.
1170160
84. Garcia V, Zoller S, Anisimova M (2018) Accounting for programmed ribosomal frameshifting
in the computation of codon usage bias indices. G3 (Bethesda) 8(10):3173–3183. https://doi.
org/10.1534/g3.118.200185
Visualizing Codon Usage Within and Across Genomes … 277
85. Wei Y, Silke JR, Xia X (2019) An improved estimation of tRNA expression to better elucidate
the coevolution between tRNA abundance and codon usage in bacteria. Sci Rep 9(1):3184.
https://doi.org/10.1038/s41598-019-39369-x
86. Pechmann S, Frydman J (2013) Evolutionary conservation of codon optimality reveals hidden
signatures of cotranslational folding. Nat Struct Mol Biol 20(2):237–243. https://doi.org/10.
1038/nsmb.2466
87. Sabi R, Tuller T (2014) Modelling the efficiency of codon-tRNA interactions based on codon
usage bias. DNA Res 21(5):511–526. https://doi.org/10.1093/dnares/dsu017
88. Roymondal U, Das S, Sahoo S (2009) Predicting gene expression level from relative codon
usage bias: an application to Escherichia coli genome. DNA Res 16(1):13–30. https://doi.
org/10.1093/dnares/dsn029
89. Wright F (1990) The ‘effective number of codons’ used in a gene. Gene 87(1):23–29
90. Fuglsang A (2004) The ‘effective number of codons’ revisited. Biochem Biophys Res
Commun 317(3):957–964
91. Novembre JA (2002) Accounting for background nucleotide composition when measuring
codon usage bias. Mol Biol Evol 19(8):1390–1394
92. Liu X (2013) A more accurate relationship between ‘effective number of codons’ and GC3s
under assumptions of no selection. Comput Biol Chem 42:35–39. https://doi.org/10.1016/j.
compbiolchem.2012.11.003
93. Sun X, Yang Q, Xia X (2013) An improved implementation of effective number of codons
(N c ). Mol Biol Evol 30(1):191–196. https://doi.org/10.1093/molbev/mss201
94. Liu SS, Hockenberry AJ, Jewett MC, Amaral LAN (2018) A novel framework for evaluating
the performance of codon usage bias metrics. J R Soc Interface 15(138):pii: 20170667. https://
doi.org/10.1098/rsif.2017.0667
95. Zhang Z, Li J, Cui P, Ding F, Li A, Townsend JP, Yu J (2012) Codon Deviation Coefficient: a
novel measure for estimating codon usage bias and its statistical significance. BMC Bioinform
22(13):43. https://doi.org/10.1186/1471-2105-13-43
96. Gilchrist MA, Shah P, Zaretzki R (2009) Measuring and detecting molecular adaptation in
codon usage against nonsense errors during protein translation. Genetics 183(4):1493–1505.
https://doi.org/10.1534/genetics.109.108209
97. Chou T (2003) Ribosome recycling, diffusion, and mRNA loop formation in translational
regulation. Biophys J 85(2):755–773
98. Mitarai N, Sneppen K, Pedersen S (2008) Ribosome collisions and translation efficiency:
optimization by codon usage and mRNA destabilization. J Mol Biol 382(1):236–245. https://
doi.org/10.1016/j.jmb.2008.06.068
99. Gilchrist MA, Chen WC, Shah P, Landerer CL, Zaretzki R (2015) Estimating gene expression
and codon-specific translational efficiencies, mutation biases, and selection coefficients from
genomic data alone. Genome Biol Evol 7(6):1559–1579. https://doi.org/10.1093/gbe/evv087
100. Proshkin S, Rahmouni AR, Mironov A, Nudler E (2010) Cooperation between translating
ribosomes and RNA polymerase in transcription elongation. Science 328(5977):504–508.
https://doi.org/10.1126/science.1184939
101. Boël G, Letso R, Neely H, Price WN, Wong KH, Su M, Luff J, Valecha M, Everett JK,
Acton TB, Xiao R, Montelione GT, Aalberts DP, Hunt JF (2016) Codon influence on protein
expression in E. coli correlates with mRNA levels. Nature 529(7586):358–363. https://doi.
org/10.1038/nature16509
102. Bellaousov S, Reuter JS, Seetin MG, Mathews DH (2013) RNAstructure: web servers for RNA
secondary structure prediction and analysis. Nucleic Acids Res 41(Web Server issue):W471–
W474. https://doi.org/10.1093/nar/gkt290
103. Bentele K, Saffert P, Rauscher R, Ignatova Z, Blüthgen N (2013) Efficient translation initiation
dictates codon usage at gene start. Mol Syst Biol 18(9):675. https://doi.org/10.1038/msb.
2013.32
104. Kelsic ED, Chung H, Cohen N, Park J, Wang HH, Kishony R (2016) RNA structural deter-
minants of optimal codons revealed by MAGE-Seq. Cell Syst. 3(6):563–571.e6. https://doi.
org/10.1016/j.cels.2016.11.004
278 B. Ostash and M. Anisimova
105. Frumkin I, Schirman D, Rotman A, Li F, Zahavi L, Mordret E, Asraf O, Wu S, Levy SF, Pilpel
Y (2017) Gene architectures that minimize cost of gene expression. Mol Cell 65(1):142–153.
https://doi.org/10.1016/j.molcel.2016.11.007
106. Hanson G, Alhusaini N, Morris N, Sweet T, Coller J (2018) Translation elongation and mRNA
stability are coupled through the ribosomal A-site. RNA 24(10):1377–1389. https://doi.org/
10.1261/rna.066787.118
107. Arango D, Sturgill D, Alhusaini N, Dillman AA, Sweet TJ, Hanson G, Hosogane M, Sinclair
WR, Nanan KK, Mandler MD, Fox SD, Zengeya TT, Andresson T, Meier JL, Coller J,
Oberdoerffer S (2018) Acetylation of cytidine in mRNA promotes translation efficiency. Cell
175(7):1872–1886.e24. https://doi.org/10.1016/j.cell.2018.10.030
108. Schikora-Tamarit MÀ, Carey LB (2018) Poor codon optimality as a signal to degrade tran-
scripts with frameshifts. Transcription 9(5):327–333. https://doi.org/10.1080/21541264.2018.
1511676
109. Lykke-Andersen S, Jensen TH (2015) Nonsense-mediated mRNA decay: an intricate machin-
ery that shapes transcriptomes. Nat Rev Mol Cell Biol 16(11):665–677. https://doi.org/10.
1038/nrm4063
110. Radhakrishnan A, Chen YH, Martin S, Alhusaini N, Green R, Coller J (2016) The DEAD-box
protein Dhh1p couples mRNA decay and translation by monitoring codon optimality. Cell
167(1):122–132.e9. https://doi.org/10.1016/j.cell.2016.08.053
111. Presnyak V, Alhusaini N, Chen YH, Martin S, Morris N, Kline N, Olson S, Weinberg D,
Baker KE, Graveley BR, Coller J (2015) Codon optimality is a major determinant of mRNA
stability. Cell 160(6):1111–1124. https://doi.org/10.1016/j.cell.2015.02.029
112. Carneiro RL, Requião RD, Rossetto S, Domitrovic T, Palhano FL (2019) Codon stabilization
coefficient as a metric to gain insights into mRNA stability and codon bias and their rela-
tionships with translation. Nucleic Acids Res 47(5):2216–2228. https://doi.org/10.1093/nar/
gkz033
113. Kumar S, Stecher G, Li M, Knyaz C, Tamura K (2018) MEGA X: molecular evolutionary
genetics analysis across computing platforms. Mol Biol Evol 35(6):1547–1549. https://doi.
org/10.1093/molbev/msy096
114. Xia X (2018) DAMBE7: new and improved tools for data analysis in molecular biology and
evolution. Mol Biol Evol 35(6):1550–1552. https://doi.org/10.1093/molbev/msy073
115. Supek F, Vlahovicek K (2004) INCA: synonymous codon usage analysis and clustering by
means of self-organizing map. Bioinformatics 20(14):2329–2330
116. Peden JF (2005) CodonW, p. 1. https://sourceforge.net/projects/codonw/. Last accessed Apr
2019
117. Vetrivel U, Arunkumar V, Dorairaj S (2007) ACUA: a software tool for automated codon
usage analysis. Bioinformation 2(2):62–63
118. Angellotti MC, Bhuiyan SB, Chen G, Wan XF (2007) CodonO: codon usage bias analysis
within and across genomes. Nucleic Acids Res 35(Web Server issue):W132–W136
119. Miller JB, Brase LR, Ridge PG (2019) ExtRamp: a novel algorithm for extracting the ramp
sequence based on the tRNA adaptation index or relative codon adaptiveness. Nucleic Acids
Res. https://doi.org/10.1093/nar/gky1193
120. Tuller T, Carmi A, Vestsigian K, Navon S, Dorfan Y, Zaborske J, Pan T, Dahan O, Furman
I, Pilpel Y (2010) An evolutionarily conserved mechanism for controlling the efficiency of
protein translation. Cell 141(2):344–354. https://doi.org/10.1016/j.cell.2010.03.031
121. Wu G, Culley DE, Zhang W (2005) Predicted highly expressed genes in the genomes of Strep-
tomyces coelicolor and Streptomyces avermitilis and the implications for their metabolism.
Microbiology 151(Pt 7):2175–2187
122. Grote A, Hiller K, Scheer M, Münch R, Nörtemann B, Hempel DC, Jahn D (2005) JCat: a
novel tool to adapt codon usage of a target gene to its potential expression host. Nucleic Acids
Res 33(Web Server issue):W526–W531
123. Puigbò P, Bravo IG, Garcia-Vallve S (2008) CAIcal: a combined set of tools to assess codon
usage adaptation. Biol Direct 16(3):38. https://doi.org/10.1186/1745-6150-3-38
124. McInerney JO (1998) GCUA: general codon usage analysis. Bioinformatics 14(4):372–373
Visualizing Codon Usage Within and Across Genomes … 279
125. Sabi R, Volvovitch Daniel R, Tuller T (2017) stAIcalc: tRNA adaptation index calculator
based on species-specific weights. Bioinformatics 33(4):589–591. https://doi.org/10.1093/
bioinformatics/btw647
126. Athey J, Alexaki A, Osipova E, Rostovtsev A, Santana-Quintero LV, Katneni U, Simonyan V,
Kimchi-Sarfaty C (2017) A new and updated resource for codon usage tables. BMC Bioinform
18(1):391. https://doi.org/10.1186/s12859-017-1793-7
127. Yoon J, Chung YJ, Lee M (2018) STADIUM: species-specific tRNA adaptive index
compendium. Genomics Inform 16(4):e28. https://doi.org/10.5808/GI.2018.16.4.e28
128. Plotkin JB, Kudla G (2011) Synonymous but not the same: the causes and consequences of
codon bias. Nat Rev Genet 12(1):32–42. https://doi.org/10.1038/nrg2899
129. Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A, Dephoure N, O’Shea EK,
Weissman JS (2003) Global analysis of protein expression in yeast. Nature 425(6959):737–
741
130. Ishihama Y, Schmidt T, Rappsilber J, Mann M, Hartl FU, Kerner MJ, Frishman D (2008)
Protein abundance profiling of the Escherichia coli cytosol. BMC Genom 27(9):102. https://
doi.org/10.1186/1471-2164-9-102
131. Liu Y, Beyer A, Aebersold R (2016) On the dependency of cellular protein levels on mRNA
abundance. Cell 165(3):535–550. https://doi.org/10.1016/j.cell.2016.03.014
132. Hanson G, Coller J (2018) Codon optimality, bias and usage in translation and mRNA decay.
Nat Rev Mol Cell Biol 19(1):20–30. https://doi.org/10.1038/nrm.2017.91
133. Frumkin I, Lajoie MJ, Gregg CJ, Hornung G, Church GM, Pilpel Y (2018) Codon usage of
highly expressed genes affects proteome-wide translation efficiency. Proc Natl Acad Sci U S
A. 115(21):E4940–E4949. https://doi.org/10.1073/pnas.1719375115
134. Puigbò P, Guzmán E, Romeu A, Garcia-Vallvé S (2007) OPTIMIZER: a web server
for optimizing the codon usage of DNA sequences. Nucleic Acids Res 35(Web Server
issue):W126–W131
135. Hatfield GW, Roth DA (2007) Optimizing scaleup yield for protein production: computation-
ally optimized DNA assembly (CODA) and translation engineering. Biotechnol Annu Rev
13:27–42
136. Cheng BYH, Nogales A, de la Torre JC, Martínez-Sobrido L (2017) Development of live-
attenuated arenavirus vaccines based on codon deoptimization of the viral glycoprotein.
Virology 15(501):35–46. https://doi.org/10.1016/j.virol.2016.11.001
137. Jia W, Higgs PG (2008) Codon usage in mitochondrial genomes: distinguishing context-
dependent mutation from translational selection. Mol Biol Evol 25(2):339–351
138. Aalberts DP, Boël G, Hunt JF (2017) Codon clarity or conundrum? Cell Syst. 4(1):16–19.
https://doi.org/10.1016/j.cels.2017.01.004
139. Webster GR, Teh AY, Ma JK (2017) Synthetic gene design-The rationale for codon optimiza-
tion and implications for molecular pharming in plants. Biotechnol Bioeng 114(3):492–502.
https://doi.org/10.1002/bit.26183
140. Mauro VP, Chappell SA (2018) Considerations in the use of codon optimization for recom-
binant protein expression. Methods Mol Biol 1850:275–288. https://doi.org/10.1007/978-1-
4939-8730-6_18
141. Mauro VP (2018) Codon optimization in the production of recombinant biotherapeutics:
potential risks and considerations. BioDrugs 32(1):69–81. https://doi.org/10.1007/s40259-
018-0261-x
142. Mandad S, Rahman RU, Centeno TP, Vidal RO, Wildhagen H, Rammner B, Keihani S, Opazo
F, Urban I, Ischebeck T, Kirli K, Benito E, Fischer A, Yousefi RY, Dennerlein S, Rehling P,
Feussner I, Urlaub H, Bonn S, Rizzoli SO, Fornasiero EF (2018) The codon sequences predict
protein lifetimes and other parameters of the protein life cycle in the mouse brain. Sci Rep
8(1):16913. https://doi.org/10.1038/s41598-018-35277-8
143. Liu Y, Mi Y, Mueller T, Kreibich S, Williams EG, Van Drogen A, Borel C, Frank M, Germain
PL, Bludau I, Mehnert M, Seifert M, Emmenlauer M, Sorg I, Bezrukov F, Bena FS, Zhou
H, Dehio C, Testa G, Saez-Rodriguez J, Antonarakis SE, Hardt WD, Aebersold R (2019)
Multi-omic measurements of heterogeneity in HeLa cells across laboratories. Nat Biotechnol
37(3):314–322. https://doi.org/10.1038/s41587-019-0037-y
280 B. Ostash and M. Anisimova
164. Peil L, Starosta AL, Lassak J, Atkinson GC, Virumäe K, Spitzer M, Tenson T, Jung K, Remme
J, Wilson DN (2013) Distinct XPPX sequence motifs induce ribosome stalling, which is
rescued by the translation elongation factor EF-P. Proc Natl Acad Sci U S A. 110(38):15265–
15270. https://doi.org/10.1073/pnas.1310642110
165. Starosta AL, Lassak J, Peil L, Atkinson GC, Virumäe K, Tenson T, Remme J, Jung K, Wilson
DN (2014) Translational stalling at polyproline stretches is modulated by the sequence context
upstream of the stall site. Nucleic Acids Res 42(16):10711–10719. https://doi.org/10.1093/
nar/gku768
166. Gamble CE, Brule CE, Dean KM, Fields S, Grayhack EJ (2016) Adjacent codons act in
concert to modulate translation efficiency in yeast. Cell 166(3):679–690. https://doi.org/10.
1016/j.cell.2016.05.070
167. McCarthy C, Carrea A, Diambra L (2017) Bicodon bias can determine the role of synony-
mous SNPs in human diseases. BMC Genom 18(1):227. https://doi.org/10.1186/s12864-017-
3609-6
168. Chevance FFV, Hughes KT (2017) Case for the genetic code as a triplet of triplets. Proc Natl
Acad Sci U S A. 114(18):4745–4750. https://doi.org/10.1073/pnas.1614896114
169. Ghoneim DH, Zhang X, Brule CE, Mathews DH, Grayhack EJ (2018) Conservation of location
of several specific inhibitory codon pairs in the Saccharomyces sensu stricto yeasts reveals
translational selection. Nucleic Acids Res. https://doi.org/10.1093/nar/gky1262
170. Komar AA, Lesnik T, Reiss C (1999) Synonymous codon substitutions affect ribosome traffic
and protein folding during in vitro translation. FEBS Lett 462(3):387–391
171. Zhang G, Hubalewska M, Ignatova Z (2009) Transient ribosomal attenuation coordinates
protein synthesis and co-translational folding. Nat Struct Mol Biol 16(3):274–280. https://
doi.org/10.1038/nsmb.1554
172. Buhr F, Jha S, Thommen M, Mittelstaet J, Kutz F, Schwalbe H, Rodnina MV, Komar AA (2016)
Synonymous codons direct cotranslational folding toward different protein conformations.
Mol Cell 61(3):341–351. https://doi.org/10.1016/j.molcel.2016.01.008
173. Pechmann S, Chartron JW, Frydman J (2014) Local slowdown of translation by nonoptimal
codons promotes nascent-chain recognition by SRP in vivo. Nat Struct Mol Biol 21(12):1100–
1105. https://doi.org/10.1038/nsmb.2919
174. Lee Y, Zhou T, Tartaglia GG, Vendruscolo M, Wilke CO (2010) Translationally optimal codons
associate with aggregation-prone sites in proteins. Proteomics 10(23):4163–4171. https://doi.
org/10.1002/pmic.201000229
175. Cannarozzi G, Schraudolph NN, Faty M, von Rohr P, Friberg MT, Roth AC, Gonnet P, Gonnet
G, Barral Y (2010) A role for codon order in translation dynamics. Cell 141(2):355–367.
https://doi.org/10.1016/j.cell.2010.02.036
176. Carrier MJ, Buckingham RH (1984) An effect of codon context on the mistranslation of UGU
codons in vitro. J Mol Biol 175(1):29–38
177. Buckingham RH (1994) Codon context and protein synthesis: enhancements of the genetic
code. Biochimie 76(5):351–354
178. Baranov PV, Atkins JF, Yordanova MM (2015) Augmented genetic decoding: global, local and
temporal alterations of decoding processes and codon meaning. Nat Rev Genet 16(9):517–529.
https://doi.org/10.1038/nrg3963
179. Skuzeski JM, Nichols LM, Gesteland RF, Atkins JF (1991) The signal for a leaky UAG stop
codon in several plant viruses includes the two downstream codons. J Mol Biol 218(2):365–
373
180. Chan CS, Jungreis I, Kellis M (2013) Heterologous stop codon readthrough of metazoan
readthrough candidates in yeast. PLoS ONE 8(3):e59450. https://doi.org/10.1371/journal.
pone.0059450
181. Loughran G, Jungreis I, Tzani I, Power M, Dmitriev RI, Ivanov IP, Kellis M, Atkins JF (2018)
Stop codon readthrough generates a C-terminally extended variant of the human vitamin D
receptor with reduced calcitriol response. J Biol Chem 293(12):4434–4444. https://doi.org/
10.1074/jbc.M117.818526
282 B. Ostash and M. Anisimova
182. Jungreis I, Lin MF, Spokony R, Chan CS, Negre N, Victorsen A, White KP, Kellis M (2011)
Evidence of abundant stop codon readthrough in Drosophila and other metazoa. Genome Res
21(12):2096–2113. https://doi.org/10.1101/gr.119974.110
183. Jungreis I, Chan CS, Waterhouse RM, Fields G, Lin MF, Kellis M (2016) Evolutionary
dynamics of abundant stop codon readthrough. Mol Biol Evol 33(12):3108–3132
184. Rajput B, Pruitt KD, Murphy TD (2019) RefSeq curation and annotation of stop codon
recoding in vertebrates. Nucleic Acids Res 47(2):594–606. https://doi.org/10.1093/nar/
gky1234
185. Swart EC, Serra V, Petroni G, Nowacki M (2016) Genetic codes with no dedicated stop
codon: context-dependent translation termination. Cell 166(3):691–702. https://doi.org/10.
1016/j.cell.2016.06.020
186. Belew AT, Dinman JD (2015) Cell cycle control (and more) by programmed −1 ribosomal
frameshifting: implications for disease and therapeutics. Cell Cycle 14(2):172–178. https://
doi.org/10.4161/15384101.2014.989123
187. Belew AT, Hepler NL, Jacobs JL, Dinman JD (2008) PRFdb: a database of computationally
predicted eukaryotic programmed −1 ribosomal frameshift signals. BMC Genom 17(9):339.
https://doi.org/10.1186/1471-2164-9-339
188. Pinheiro M, Afreixo V, Moura G, Freitas A, Santos MA, Oliveira JL (2006) Statistical, com-
putational and visualization methodologies to unveil gene primary structure features. Methods
Inf Med 45(2):163–168
189. Moura G, Pinheiro M, Silva R, Miranda I, Afreixo V, Dias G, Freitas A, Oliveira JL, Santos
MA (2005) Comparative context analysis of codon pairs on an ORFeome scale. Genome Biol
6(3):R28
190. Moura G, Pinheiro M, Arrais J, Gomes AC, Carreto L, Freitas A, Oliveira JL, Santos MA
(2007) Large scale comparative codon-pair context analysis unveils general rules that fine-tune
evolution of mRNA primary structure. PLoS ONE 2(9):e847
191. Tats A, Tenson T, Remm M (2008) Preferred and avoided codon pairs in three domains of
life. BMC Genom 8(9):463. https://doi.org/10.1186/1471-2164-9-463
192. Doyle F, Leonardi A, Endres L, Tenenbaum SA, Dedon PC, Begley TJ (2016) Gene- and
genome-based analysis of significant codon patterns in yeast, rat and mice genomes with the
CUT Codon UTilization tool. Methods 1(107):98–109. https://doi.org/10.1016/j.ymeth.2016.
05.010
193. Alexaki A, Kames J, Holcomb DD, Athey J, Santana-Quintero LV, Lam PVN, Hamasaki-
Katagiri N, Osipova E, Simonyan V, Bar H, Komar AA, Kimchi-Sarfaty C (2019) Codon and
codon-pair usage tables (CoCoPUTs): facilitating genetic variation analyses and recombinant
gene design. J Mol Biol pii: S0022-2836(19)30228-1. https://doi.org/10.1016/j.jmb.2019.
04.021
194. Kucukyildirim S, Long H, Sung W, Miller SF, Doak TG, Lynch M (2016) The rate and spec-
trum of spontaneous mutations in Mycobacterium smegmatis, a bacterium naturally devoid
of the postreplicative mismatch repair pathway. G3 (Bethesda) 6(7):2157–2163. https://doi.
org/10.1534/g3.116.030130
195. Aslam S, Lan XR, Zhang BW, Chen ZL, Wang L, Niu DK (2019) Aerobic prokaryotes do not
have higher GC contents than anaerobic prokaryotes, but obligate aerobic prokaryotes have.
BMC Evol Biol 19(1):35. https://doi.org/10.1186/s12862-019-1365-8
196. Hershberg R, Petrov DA (2010) Evidence that mutation is universally biased towards AT in
bacteria. PLoS Genet 6(9):e1001115. https://doi.org/10.1371/journal.pgen.1001115
197. Lassalle F, Périan S, Bataillon T, Nesme X, Duret L, Daubin V (2015) GC-Content evo-
lution in bacterial genomes: the biased gene conversion hypothesis expands. PLoS Genet
11(2):e1004941. https://doi.org/10.1371/journal.pgen.1004941
198. Hildebrand F, Meyer A, Eyre-Walker A (2010) Evidence of selection upon genomic
GC-content in bacteria. PLoS Genet 6(9):e1001107. https://doi.org/10.1371/journal.pgen.
1001107
199. Bobay LM, Ochman H (2017) Impact of recombination on the base composition of bacteria
and archaea. Mol Biol Evol 34(10):2627–2636. https://doi.org/10.1093/molbev/msx189
Visualizing Codon Usage Within and Across Genomes … 283
200. Trotta E (2016) Selective forces and mutational biases drive stop codon usage in the human
genome: a comparison with sense codon usage. BMC Genom 17(17):366. https://doi.org/10.
1186/s12864-016-2692-4
201. Wilke CO, Drummond DA (2006) Population genetics of translational robustness. Genetics
173(1):473–481
202. Zhou T, Weems M, Wilke CO (2009) Translationally optimal codons associate with struc-
turally sensitive sites in proteins. Mol Biol Evol 26(7):1571–1580. https://doi.org/10.1093/
molbev/msp070
203. Yu CH, Dang Y, Zhou Z, Wu C, Zhao F, Sachs MS, Liu Y (2015) Codon usage influences
the local rate of translation elongation to regulate co-translational protein folding. Mol Cell
59(5):744–754. https://doi.org/10.1016/j.molcel.2015.07.018
204. Yan X, Hoek TA, Vale RD, Tanenbaum ME (2016) Dynamics of translation of single mRNA
molecules in vivo. Cell 165(4):976–989. https://doi.org/10.1016/j.cell.2016.04.034
205. Zhao F, Yu CH, Liu Y (2017) Codon usage regulates protein structure and function by affect-
ing translation elongation speed in Drosophila cells. Nucleic Acids Res 45(14):8484–8492.
https://doi.org/10.1093/nar/gkx501
206. Li GW, Burkhardt D, Gross C, Weissman JS (2014) Quantifying absolute protein synthe-
sis rates reveals principles underlying allocation of cellular resources. Cell 157(3):624–635.
https://doi.org/10.1016/j.cell.2014.02.033
207. Shah P, Gilchrist MA (2010) Effect of correlated tRNA abundances on translation errors and
evolution of codon usage bias. PLoS Genet 6(9):e1001128. https://doi.org/10.1371/journal.
pgen.1001128
208. Chamary JV, Parmley JL, Hurst LD (2006) Hearing silence: non-neutral evolution at
synonymous sites in mammals. Nat Rev Genet 7(2):98–108
209. Sauna ZE, Kimchi-Sarfaty C (2011) Understanding the contribution of synonymous mutations
to human disease. Nat Rev Genet 12(10):683–691. https://doi.org/10.1038/nrg3051
210. Kirchner S, Cai Z, Rauscher R, Kastelic N, Anding M, Czech A, Kleizen B, Ostedgaard
LS, Braakman I, Sheppard DN, Ignatova Z (2017) Alteration of protein function by a silent
polymorphism linked to tRNA abundance. PLoS Biol 15(5):e2000779. https://doi.org/10.
1371/journal.pbio.2000779
211. Zhou Z, Dang Y, Zhou M, Li L, Yu CH, Fu J, Chen S, Liu Y (2016) Codon usage is an
important determinant of gene expression levels largely through its effects on transcription.
Proc Natl Acad Sci U S A 113(41):E6117–E6125
212. Mittal P, Brindle J, Stephen J, Plotkin JB, Kudla G (2018) Codon usage influences fitness
through RNA toxicity. Proc Natl Acad Sci U S A 115(34):8639–8644. https://doi.org/10.1073/
pnas.1810022115
213. Weinberg DE, Shah P, Eichhorn SW, Hussmann JA, Plotkin JB, Bartel DP (2016) Improved
ribosome-footprint and mRNA measurements provide insights into dynamics and regulation
of yeast translation. Cell Rep 14(7):1787–1799. https://doi.org/10.1016/j.celrep.2016.01.043
214. Chu D, Kazana E, Bellanger N, Singh T, Tuite MF, von der Haar T (2014) Translation elon-
gation can control translation initiation on eukaryotic mRNAs. EMBO J 33(1):21–34. https://
doi.org/10.1002/embj.201385651
215. Chan LY, Mugler CF, Heinrich S, Vallotton P, Weis K (2018) Non-invasive measurement of
mRNA decay reveals translation initiation as the major determinant of mRNA stability. Elife
7:pii: e32536. https://doi.org/10.7554/elife.32536
216. Eraslan B, Wang D, Gusic M, Prokisch H, Hallström BM, Uhlén M, Asplund A, Pontén F,
Wieland T, Hopf T, Hahne H, Kuster B, Gagneur J (2019) Quantification and discovery of
sequence determinants of protein-per-mRNA amount in 29 human tissues. Mol Syst Biol
15(2):e8513. https://doi.org/10.15252/msb.20188513
217. Zhou M, Guo J, Cha J, Chae M, Chen S, Barral JM, Sachs MS, Liu Y (2013) Non-
optimal codon usage affects expression, structure and function of clock protein FRQ. Nature
495(7439):111–115. https://doi.org/10.1038/nature11833
218. Chan C, Pham P, Dedon PC, Begley TJ (2018) Lifestyle modifications: coordinating the tRNA
epitranscriptome with codon bias to adapt translation during stress responses. Genome Biol
19(1):228. https://doi.org/10.1186/s13059-018-1611-1
284 B. Ostash and M. Anisimova
219. Novoa EM, Pavon-Eternod M, Pan T, de Pouplana LR (2012) A role for tRNA modifications
in genome structure and codon usage. Cell 149(1):202–213. https://doi.org/10.1016/j.cell.
2012.01.050
220. Fuglsang A (2005) Intragenic position of UUA codons in streptomycetes. Microbiology
151(Pt 10):3150–3152
221. Zaburannyy N, Ostash B, Fedorenko V (2009) TTA Lynx: a web-based service for analysis of
actinomycete genes containing rare TTA codon. Bioinformatics 25(18):2432–2433. https://
doi.org/10.1093/bioinformatics/btp402
222. Jee J, Rasouly A, Shamovsky I, Akivis Y, Steinman SR, Mishra B, Nudler E (2016)
Rates and mechanisms of bacterial mutagenesis from maximum-depth sequencing. Nature
534(7609):693–696
223. Kosiol C, Goldman N (2011) Markovian and non-Markovian protein sequence evolution:
aggregated Markov process models. J Mol Biol 411(4):910–923. https://doi.org/10.1016/j.
jmb.2011.06.005
224. Anisimova M, Kosiol C (2009) Investigating protein-coding sequence evolution with prob-
abilistic codon substitution models. Mol Biol Evol 26(2):255–271. https://doi.org/10.1093/
molbev/msn232
225. Schneider A, Cannarozzi GM, Gonnet GH (2005) Empirical codon substitution matrix. BMC
Bioinform 1(6):134
226. Beaumont MA, Rannala B (2004) The Bayesian revolution in genetics. Nat Rev Genet
5(4):251–261
227. Eddy SR (2004) What is Bayesian statistics? Nat Biotechnol 22(9):1177–1178
228. Do CB, Batzoglou S (2008) What is the expectation maximization algorithm? Nat Biotechnol
26(8):897–899. https://doi.org/10.1038/nbt1406
229. Anisimova M, Bielawski JP, Yang Z (2001) Accuracy and power of the likelihood ratio test
in detecting adaptive molecular evolution. Mol Biol Evol 18(8):1585–1592
230. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control
19(6):716–723. https://doi.org/10.1109/TAC.1974.1100705
231. Davydov II, Salamin N, Robinson-Rechavi M (2019) Large-scale comparative analysis of
codon models accounting for protein and nucleotide selection. Mol Biol Evol pii: msz048.
https://doi.org/10.1093/molbev/msz048
232. Arenas M (2015) Trends in substitution models of molecular evolution. Front Genet 26(6):319.
https://doi.org/10.3389/fgene.2015.00319
233. Yang Z (2006) Computational molecular evolution. Oxford University Press, Oxford, 324 p.
ISBN 978–0–19–856699–1
234. Venkat A, Hahn MW, Thornton JW (2018) Multinucleotide mutations cause false inferences
of lineage-specific positive selection. Nat Ecol Evol 2(8):1280–1288. https://doi.org/10.1038/
s41559-018-0584-5
235. Liu X, Liu H, Guo W, Yu K (2012) Codon substitution models based on residue similarity
and their applications. Gene 509(1):136–141. https://doi.org/10.1016/j.gene.2012.07.075
236. Delport W, Scheffler K, Botha G, Gravenor MB, Muse SV, Kosakovsky Pond SL (2010)
CodonTest: modeling amino acid substitution preferences in coding sequences. PLoS Comput
Biol 6(8):pii: e1000885. https://doi.org/10.1371/journal.pcbi.1000885
237. Huttley GA (2004) Modeling the impact of DNA methylation on the evolution of BRCA1 in
mammals. Mol Biol Evol 21(9):1760–1768
238. Mayrose I, Doron-Faigenboim A, Bacharach E, Pupko T (2007) Towards realistic codon
models: among site variability and dependency of synonymous and non-synonymous rates.
Bioinformatics 23(13):i319–i327
239. Higgs PG, Hao W, Golding GB (2007) Identification of conflicting selective effects on highly
expressed genes. Evol Bioinform Online 14(3):1–13
240. Kubatko L, Shah P, Herbei R, Gilchrist MA (2016) A codon model of nucleotide substitution
with selection on synonymous codon usage. Mol Phylogenet Evol 94(Pt A):290–297. https://
doi.org/10.1016/j.ympev.2015.08.026
Visualizing Codon Usage Within and Across Genomes … 285
241. Beaulieu JM, O’Meara BC, Zaretzki R, Landerer C, Chai J, Gilchrist MA (2019) Population
genetics based phylogenetics under stabilizing selection for an optimal amino acid sequence:
a nested modeling approach. Mol Biol Evol 36(4):834–851. https://doi.org/10.1093/molbev/
msy222
242. Higgs PG (2008) Linking population genetics to phylogenetics. Banach Center Publ
80(1):145–166
243. Pouyet F, Bailly-Bechet M, Mouchiroud D, Guéguen L (2016) SENCA: a multilayered codon
model to study the origins and dynamics of codon usage. Genome Biol Evol 8(8):2427–2441.
https://doi.org/10.1093/gbe/evw165
244. Rodrigue N, Lartillot N (2017) Detecting adaptation in protein-coding genes using a bayesian
site-heterogeneous mutation-selection codon substitution model. Mol Biol Evol 34(1):204–
214. https://doi.org/10.1093/molbev/msw220
245. Teufel AI, Ritchie AM, Wilke CO, Liberles DA (2018) Using the mutation-selection frame-
work to characterize selection on protein sequences. Genes (Basel) 9(8):pii: E409. https://doi.
org/10.3390/genes9080409
246. Dunn KA, Kenney T, Gu H, Bielawski JP (2019) Improved inference of site-specific positive
selection under a generalized parametric codon model when there are multinucleotide muta-
tions and multiple nonsynonymous rates. BMC Evol Biol 19(1):22. https://doi.org/10.1186/
s12862-018-1326-7
247. Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins.
In: Atlas of protein sequence and structure, vol 5, pp 345–352
248. Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence
database. Science 256(5062):1443–1445
249. De Maio N, Holmes I, Schlötterer C, Kosiol C (2013) Estimating empirical codon hidden
Markov models. Mol Biol Evol 30(3):725–736. https://doi.org/10.1093/molbev/mss266
250. Zoller S, Schneider A (2010) Empirical analysis of the most relevant parameters of codon
substitution models. J Mol Evol 70(6):605–612. https://doi.org/10.1007/s00239-010-9356-9
251. Kosiol C, Holmes I, Goldman N (2007) An empirical codon model for protein sequence
evolution. Mol Biol Evol 24(7):1464–1479
252. Doron-Faigenboim A, Pupko T (2007) A combined empirical and mechanistic codon model.
Mol Biol Evol 24(2):388–397
253. Zoller S, Schneider A (2012) A new semiempirical codon substitution model based on prin-
cipal component analysis of mammalian sequences. Mol Biol Evol 29(1):271–277. https://
doi.org/10.1093/molbev/msr198
254. Hoban S, Bertorelle G, Gaggiotti OE (2012) Computer simulations: tools for population and
evolutionary genetics. Nat Rev Genet 13(2):110–122. https://doi.org/10.1038/nrg3130
255. Arenas M (2013) Computer programs and methodologies for the simulation of DNA sequence
data with recombination. Front Genet 1(4):9. https://doi.org/10.3389/fgene.2013.00009
256. Anisimova M, Nielsen R, Yang Z (2003) Effect of recombination on the accuracy of the
likelihood method for detecting positive selection at amino acid sites. Genetics 164(3):1229–
1236
257. Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C (2012) ALF—a simulation frame-
work for genome evolution. Mol Biol Evol 29(4):1115–1123. https://doi.org/10.1093/molbev/
msr268
258. Arenas M, Posada D (2014) Simulation of genome-wide evolution under heterogeneous sub-
stitution models and complex multispecies coalescent histories. Mol Biol Evol 31(5):1295–
1301. https://doi.org/10.1093/molbev/msu078
259. Mallo D, De Oliveira Martins L, Posada D (2016) SimPhy: phylogenomic simulation of gene,
locus, and species trees. Syst Biol 65(2):334–344. https://doi.org/10.1093/sysbio/syv082
260. Haller BC, Messer PW (2019) SLiM 3: forward genetic simulations beyond the Wright-Fisher
model. Mol Biol Evol 36(3):632–637. https://doi.org/10.1093/molbev/msy228
261. Klosterman PS, Uzilov AV, Bendaña YR, Bradley RK, Chao S, Kosiol C, Goldman N, Holmes
I (2006) XRate: a fast prototyping, training and annotation tool for phylo-grammars. BMC
Bioinform 3(7):428
286 B. Ostash and M. Anisimova
262. Barquist L, Holmes I (2008) xREI: a phylo-grammar visualization webserver. Nucleic Acids
Res 36(Web Server issue):W65–W69. https://doi.org/10.1093/nar/gkn283
263. Wernersson R, Pedersen AG (2003) RevTrans: multiple alignment of coding DNA from
aligned amino acid sequences. Nucleic Acids Res 31(13):3537–3539
264. Ranwez V, Douzery EJP, Cambon C, Chantret N, Delsuc F (2018) MACSE v2: toolkit for the
alignment of coding sequences accounting for frameshifts and stop codons. Mol Biol Evol
35(10):2582–2584. https://doi.org/10.1093/molbev/msy159
265. Noens EE, Mersinias V, Traag BA, Smith CP, Koerten HK, van Wezel GP (2005) SsgA-like
proteins determine the fate of peptidoglycan during sporulation of Streptomyces coelicolor.
Mol Microbiol 58(4):929–944
266. Rabyk M, Yushchuk O, Rokytskyy I, Anisimova M, Ostash B (2018) Genomic insights into
evolution of AdpA family master regulators of morphological differentiation and secondary
metabolism in Streptomyces. J Mol Evol 86(3–4):204–215. https://doi.org/10.1007/s00239-
018-9834-z
267. Wang M, Kapralov MV, Anisimova M (2011) Coevolution of amino acid residues in the key
photosynthetic enzyme Rubisco. BMC Evol Biol 23(11):266. https://doi.org/10.1186/1471-
2148-11-266
268. Kapralov MV, Filatov DA (2007) Widespread positive selection in the photosynthetic Rubisco
enzyme. BMC Evol Biol 11(7):73
269. Elena SF, Lenski RE (2003) Evolution experiments with microorganisms: the dynamics and
genetic bases of adaptation. Nat Rev Genet 4(6):457–469
270. Charlesworth B (2013) Stabilizing selection, purifying selection, and mutational bias in finite
populations. Genetics 194(4):955–971. https://doi.org/10.1534/genetics.113.151555
271. Kimura M (1991) Recent development of the neutral theory viewed from the Wrightian
tradition of theoretical population genetics. Proc Natl Acad Sci U S A 88(14):5969–5973
272. Jensen JD, Payseur BA, Stephan W, Aquadro CF, Lynch M, Charlesworth D, Charlesworth
B (2019) The importance of the Neutral Theory in 1968 and 50 years on: a response to Kern
and Hahn 2018. Evolution 73(1):111–114. https://doi.org/10.1111/evo.13650
273. Kimura M (1981) Possibility of extensive neutral evolution under stabilizing selection with
special reference to nonrandom usage of synonymous codons. Proc Natl Acad Sci U S A
78(9):5773–5777
274. Fuller ZL, Haynes GD, Zhu D, Batterton M, Chao H, Dugan S, Javaid M, Jayaseelan JC,
Lee S, Li M, Ongeri F, Qi S, Han Y, Doddapaneni H, Richards S, Schaeffer SW (2014) Evi-
dence for stabilizing selection on codon usage in chromosomal rearrangements of Drosophila
pseudoobscura. G3 (Bethesda) 4(12):2433–2449. https://doi.org/10.1534/g3.114.014860
275. Jackson BC, Campos JL, Haddrill PR, Charlesworth B, Zeng K (2017) Variation in the
intensity of selection on codon bias over time causes contrasting patterns of base compo-
sition evolution in Drosophila. Genome Biol Evol 9(1):102–123. https://doi.org/10.1093/gbe/
evw291
276. Plotkin JB, Dushoff J, Fraser HB (2004) Detecting selection using a single genome sequence
of M. tuberculosis and P. falciparum. Nature 428(6986):942–945
277. Plotkin JB, Dushoff J, Desai MM, Fraser HB (2006) Codon usage and selection on proteins.
J Mol Evol 63(5):635–653
278. Zhang J (2005) On the evolution of codon volatility. Genetics 169(1):495–501
279. Dagan T, Graur D (2005) The comparative method rules! Codon volatility cannot detect
positive Darwinian selection using a single genome sequence. Mol Biol Evol 22(3):496–500
280. O’Connell MJ, Doyle AM, Juenger TE, Donoghue MT, Keshavaiah C, Tuteja R, Spillane C
(2012) In Arabidopsis thaliana codon volatility scores reflect GC3 composition rather than
selective pressure. BMC Res Notes 17(5):359. https://doi.org/10.1186/1756-0500-5-359
281. Tajima F (1989) Statistical method for testing the neutral mutation hypothesis by DNA
polymorphism. Genetics 123(3):585–595
282. McDonald JH, Kreitman M (1991) Adaptive protein evolution at the Adh locus in Drosophila.
Nature 351(6328):652–654
Visualizing Codon Usage Within and Across Genomes … 287
283. Zhai W, Slatkin M, Nielsen R (2007) Exploring variation in the d(N)/d(S) ratio among sites
and lineages using mutational mappings: applications to the influenza virus. J Mol Evol
65(3):340–348
284. Gelman A, Meng X-L, Stern H (1996) Posterior predictive assessment of model fitness via
realized discrepancies. Stat Sin 6:733–807
285. Kosakovsky Pond SL, Frost SD (2005) Not so different after all: a comparison of methods
for detecting amino acid sites under selection. Mol Biol Evol 22(5):1208–1222
286. Lemey P, Minin VN, Bielejec F, Kosakovsky Pond SL, Suchard MA (2012) A counting
renaissance: combining stochastic mapping and empirical Bayes to quickly detect amino
acid sites under positive selection. Bioinformatics 28(24):3248–3256. https://doi.org/10.1093/
bioinformatics/bts580
287. Yang Z, Nielsen R (2000) Estimating synonymous and nonsynonymous substitution rates
under realistic evolutionary models. Mol Biol Evol 17(1):32–43
288. Gil M, Zanetti MS, Zoller S, Anisimova M (2013) CodonPhyML: fast maximum likelihood
phylogeny estimation under codon substitution models. Mol Biol Evol 30(6):1270–1280.
https://doi.org/10.1093/molbev/mst034
289. Hedge J, Wilson DJ (2016) Practical approaches for detecting selection in microbial genomes.
PLoS Comput Biol 12(2):e1004739. https://doi.org/10.1371/journal.pcbi.1004739
290. Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol
24(8):1586–1591
291. Gao F, Chen C, Arab DA, Du Z, He Y, Ho SYW (2019) EasyCodeML: a visual tool for analysis
of selection using CodeML. Ecol Evol 9(7):3891–3898. https://doi.org/10.1002/ece3.5015
292. Zhao K, Henderson E, Bullard K, Oberste MS, Burns CC, Jorba J (2018) PoSE: visualization
of patterns of sequence evolution using PAML and MATLAB. BMC Bioinform 19(Suppl
11):364. https://doi.org/10.1186/s12859-018-2335-7
293. Pond SL, Frost SD, Muse SV (2005) HyPhy: hypothesis testing using phylogenies.
Bioinformatics 21(5):676–679
294. Weaver S, Shank SD, Spielman SJ, Li M, Muse SV, Kosakovsky Pond SL (2018) Datamonkey
2.0: a modern web application for characterizing selective and other evolutionary processes.
Mol Biol Evol 35:773–777. https://doi.org/10.1093/molbev/msx335
295. Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, Gavryushkina A, Heled
J, Jones G, Kühnert D, De Maio N, Matschiner M, Mendes FK, Müller NF, Ogilvie HA, du
Plessis L, Popinga A, Rambaut A, Rasmussen D, Siveroni I, Suchard MA, Wu CH, Xie D,
Zhang C, Stadler T, Drummond AJ (2019) BEAST 2.5: an advanced software platform for
Bayesian evolutionary analysis. PLoS Comput Biol 15(4):e1006650. https://doi.org/10.1371/
journal.pcbi.1006650
296. Sealfon RS, Lin MF, Jungreis I, Wolf MY, Kellis M, Sabeti PC (2015) FRESCo: finding
regions of excess synonymous constraint in diverse viruses. Genome Biol 17(16):38. https://
doi.org/10.1186/s13059-015-0603-7
297. Stern A, Doron-Faigenboim A, Erez E, Martz E, Bacharach E, Pupko T (2007) Selecton 2007:
advanced models for detecting positive and purifying selection using a Bayesian inference
approach. Nucleic Acids Res 35(Web Server issue):W506-W511
298. Supek F, Šmuc T (2010) On relevance of codon usage to expression of synthetic and natural
genes in Escherichia coli. Genetics 185(3):1129–1134. https://doi.org/10.1534/genetics.110.
115477
299. Pokusaeva VO, Usmanova DR, Putintseva EV, Espinar L, Sarkisyan KS, Mishin AS,
Bogatyreva NS, Ivankov DN, Akopyan AV, Avvakumov SY, Povolotskaya IS, Filion GJ, Carey
LB, Kondrashov FA (2019) An experimental assay of the interactions of amino acids from
orthologous sequences shaping a complex fitness landscape. PLoS Genet 15(4):e1008079.
https://doi.org/10.1371/journal.pgen.1008079
300. Darriba D, Flouri T, Stamatakis A (2018) The state of software for evolutionary biology. Mol
Biol Evol 35(5):1037–1046. https://doi.org/10.1093/molbev/msy014
301. Abadi S, Azouri D, Pupko T, Mayrose I (2019) Model selection may not be a mandatory step
for phylogeny reconstruction. Nat Commun 10(1):934. https://doi.org/10.1038/s41467-019-
08822-w
288 B. Ostash and M. Anisimova
302. Spielman SJ, Kosakovsky Pond SL (2018) Relative evolutionary rates in proteins are largely
insensitive to the substitution model. Mol Biol Evol. https://doi.org/10.1093/molbev/msy127
303. Chionh YH, McBee M, Babu IR, Hia F, Lin W, Zhao W, Cao J, Dziergowska A, Malkiewicz
A, Begley TJ, Alonso S, Dedon PC (2016) tRNA-mediated codon-biased translation
in mycobacterial hypoxic persistence. Nat Commun 11(7):13302. https://doi.org/10.1038/
ncomms13302
304. Gingold H, Tehler D, Christoffersen NR, Nielsen MM, Asmar F, Kooistra SM, Christophersen
NS, Christensen LL, Borre M, Sørensen KD, Andersen LD, Andersen CL, Hulleman E,
Wurdinger T, Ralfkiær E, Helin K, Grønbæk K, Ørntoft T, Waszak SM, Dahan O, Pedersen
JS, Lund AH, Pilpel Y (2014) A dual program for translation regulation in cellular proliferation
and differentiation. Cell 158(6):1281–1292. https://doi.org/10.1016/j.cell.2014.08.011
Single-Cell Multiomics: Dissecting
Cancer
Cancer is a heterogeneous and complex disease with its own “tumor ecosystem”
where tumor cells interact with neighboring cells which allows them to adapt and
evolve continuously [1]. Tumor microenvironment consists of tumor cells surrounded
by plethora of other cell types such as fibroblasts, epithelial cells, immune cells,
inflammatory cells, blood and vascular networks, as well as extracellular matrix,
which broadly defines the tumor microenvironment (TME). Schematic representa-
tion of TME is depicted in Fig. 1. In homeostasis, TME acts as a physical barrier
against the malignant cells. However, tumor evolution reprograms the adjacent TME
to facilitate tumor growth and progression [2]. Cross talk between stromal cells
and cancer cells is reported to result in host metabolism hijacking, immune evasion,
and eventually metastasis [1]. Active participation of tumor microenvironment in car-
cinogenesis has laid foundation for new therapeutic strategies by interfering the cross
talk between the tumor and its milieu. Here, we have described major component of
tumor microenvironment and their potential role in the carcinogenesis.
(TAMs) elicit pro-tumorigenic activity and enhance tumor invasion and metastasis,
angiogenesis, and extracellular matrix remodeling by inhibiting the immune surveil-
lance. Macrophages are recruited into the microenvironment by vascular endothelial
growth factor (VEGF), macrophage-colony stimulating factor (M-CSF), and mono-
cyte chemotactic protein 1 (MCP-1) produced by tumor cells [3]. Natural killer cells
which have a key feature of recognizing and eradicating malignant cells are limited
by the secreted soluble factors such as prostaglandin E2 (PGE2) and TGFβ. Myeloid-
derived suppressor cells (MDSC) and regulatory T cells (Treg) have an innate func-
tion of immune suppression which could be activated by cytokines through signal-
ing pathways [15, 16]. Accumulation of MDSCs is driven by tumor necrosis factor
(TNF), and these cells will help the tumor cells escape from the immune surveillance,
inhibit adaptive immunity, and promote angiogenesis by secreting VEGF, TGFβ, and
fibroblast growth factors (FGF) [2, 3].
Tumor cells require metabolites and nutrients for their growth which is facilitated
by formation of complex vascular network. Tumor blood vessels are formed from
pre-existing normal blood vessels, tumor blood vessels, and tumor endothelial cells
(TECs). TECs have an altered phenotype from normal endothelial cells and pro-
mote angiogenesis by secreting angiogenic factors [17]. TECs show a pattern of
chromosomal instability and express distinct expression pattern when compared to
normal endothelial cells [18]. The genetic and epigenetic alterations in the TECs aid
the process carcinogenesis by interacting with the tumor microenvironment [18].
The vascular network formed around the tumor cells is inefficient and leaky in
nature. To overcome the hypoxia condition, the angiogenic switch is activated dur-
ing the process of tumorigenesis for the formation of new blood vessels [10]. Key
blood/lymph-angiogenesis-related factors including VEGFA, VEGFC, and platelet-
derived growth factor A (PDGFA) are highly expressed in the tumor cells. The growth
factors and their corresponding receptors induce many signaling cascades and con-
tribute to tumor-associated blood/lymph-angiogenesis [19]. Blood and lymphatic
networks help tumor cells evade from immune destruction. In particular, MDSC and
the immature dendritic cells present in the sentinel lymph node (SLN) will restrict
the operation of T cells. SLN will have physical contact with the tumor cells and
provide a freeway for tumor cells to migrate into other regions [10].
The abnormal ECM found in all stromal cell types secretes numerous growth factors,
cytokines, and hormones to escalate tumor invasion and metastasis [21]. As tumor
cells proliferate, the surrounding ECM undergoes significant architectural changes in
a dynamic interplay between the microenvironment and resident cells. Continuous
cross talk between the ECM and the tumor cells promotes tumor progression by
disturbing cell–cell adhesion, cell polarity, and inducing growth factor signaling.
Collagen and fibronectins provide mechanical strength to the cells preventing the
tumor cells from migrating [22]. Cancer cells have adopted strategies to cross these
barriers. As the tumor cells proliferate, there will be an increased mechanical stress
in the membrane resulting in rupture of the membrane that provokes the tumor cells
to escape from the primary site [22]. Understanding the key elements of ECM will
guide for new therapeutic interventions for cancer.
Advance in genomic techniques has revolutionized healthcare system with the advent
of individual’s genomic blueprint and emphasized the concept of personalized
medicine. Next generation sequencing technologies have catalogued information
on alterations in multiple diseases including cancer, inherited disorders, infectious
diseases, and so on. Identification of genomic variations in normal and tumor tissues
has made notable impact on disease diagnosis, prognosis, and treatment response
[23]. However, one of the challenging issues with this approach is the genetic diver-
sity that forms the basis of the intratumor heterogeneity (ITH) [1, 24]. Conventional
genomic, transcriptomic, and proteomic studies on bulk tumor have not been able to
deduce cellular heterogeneity and complexity. Multiregion whole-exome sequenc-
ing associated with bulk sequencing approach has been harnessed to identify ITH
of many cancer types; however, it cannot directly dissect the cellular composition of
the tumor region [25, 26]. Computational deconvolution method has been employed
to analyze gene expression data to infer the cellular composition of tumors with
the help of known gene signatures or gene expression profiles for sorted cells [27].
However, this method is limited to known cell types and could not infer cancer
and non-malignant cell-type proportions directly from tumor gene expression pro-
files. Advent of single-cell sequencing technology has a great potential to capture
the nature of cancer heterogeneity. The recent advances in single-cell sequencing
techniques have indisputably revolutionized the field of cancer research.
294 J. Sambath et al.
Single-cell sequencing technique has wide range of options to characterize the tumor
heterogeneity. It has given extensive knowledge about heterogenic nature of the
cancer, rare subpopulation of cells, clonal evolution, and the development of drug
resistance. Single-cell DNA sequencing studies have been applied to explore the
clonal and subclonal architecture of the primary tumors and the dynamics of mutation
acquisition in breast, bladder, glioblastoma, colon, and hematological malignancies
[1]. Single-cell epigenomic technologies have also been utilized to investigate the
role of epigenetic alteration in disease progression and metastasis [28].
Bartoschek et al. performed single-cell RNA sequencing (scRNA-seq) on nega-
tively selected mesenchymal cells for CAFs in breast cancer. The study identified
three transcriptionally diverse subtypes of CAFs, namely vascular CAFs (vCAFs),
matrix CAFs (mCAFs), and cycling CAFs (cCAFs) with unique functional and spa-
tial characteristics [29]. In another study by Li et al., 2 CAF subpopulations (CAF-A
and CAF-2) with distinct transcriptome profiles were identified using scRNA-seq in
colorectal cancer [30]. The heterogeneity of TAMs in glioma has been studied exten-
sively by Muller et al. using ScRNA-seq. The expression signatures distinguished
blood-derived TAMs from brain-resident microglia TAMs and also showed distinct
phenotypes. Blood-derived TAMs upregulate immunosuppressive cytokines and cor-
relate with reduced overall survival in lower-grade glioma [31]. ScRNA-seq has been
used to profile the expression patterns of the cancer cells to dissect the population of
tumor microenvironment and broadly depicted the tumor immune microenvironment
in many cancers [32]. These rare subpopulations of cells were identified due to virtue
of the in-depth investigation approaches of single-cell sequencing technique.
2 Single-Cell omics
Mutational event in tumor suppressor genes and oncogenes is known to aid in the
development of cancer. These accumulated mutations over the time confer tumori-
genic properties to cells and eventually diverge to form distinct subclones with differ-
ent mutational landscape. This forms the basis for ITH, which is included as a major
hallmark of cancer. Clonal diversity in human tumors plays key role in progression,
metastasis, and evolution of resistance to therapy. However, in-depth investigation of
ITH was hampered due to technical advancement thus far. Studying clonal diversity
in bulk tumor sequencing is difficult because the tumor tissue obtained for sequenc-
ing includes mixture of tumor cells resulting in masking of mutations present at
low frequency that may have important role in tumor progression. To overcome
this challenge, recent studies have investigated tumor profiling at the single-cell
level (Table 1). Single-cell sequencing includes single-cell genome, transcriptome,
and epigenome and multiomics sequencing led to detailed understanding of tumor
Single-Cell Multiomics: Dissecting Cancer 295
microenvironment, to delineate ITH and clonal diversity. This method also provides
information on rare cell population, which may have an oncogenic effect.
Study by Navin et al. employed scSeq analysis approach to investigate clonal evolu-
tion in breast cancer for the first time in 2011. This study has profiled copy number
alterations and revealed punctuated model of copy number evolution in which the
aneuploid rearrangements occur in early stage and followed by stable clonal expan-
sion in two triple-negative breast cancer patients (TNBC) [33]. Later, Wang et al.
applied single-cell whole-exome sequencing (Nuc-Seq) method in two breast cancer
patients and showed that the point mutations evolved gradually over time generating
extensive clonal diversity and supported the concept of punctuation model in copy
number evolution. These studies hint at the potential of using scSeq to delineate
clonal evolution and elucidate driver genes mutation with low allele frequency [34].
Single-cell sequence analysis of glioblastoma to explore the clonal diversity reported
the convergent evolution of EGFR mutations in different subclones within a tumor
[35]. In case of renal carcinoma, single-cell genomic analysis revealed that the cells
share common truncal mutations indicating their common origin and presence of
fewer rare mutations [36]. Study by Yu et al. used single-cell exome sequencing
approach and depicted biclonal origin model with the verification of rare muta-
tion event of SLC12A5 and its potential oncogenic effect in colon cancer [37]. In
contrast, Wu et al. using combined bulk whole-exome sequencing and single-cell
whole-exome sequencing demonstrated that the colorectal cancer has the monoclonal
origin; however, several subclones were found based on the accumulation of novel
driver mutations [38]. Genetic diversity of bladder cancer at single-cell level has
been investigated by Li et al. They employed single-cell whole-exome sequencing
on muscle-invasive bladder transitional cell carcinoma and identified the existence
of tumor subpopulation within a tumor. Authors observed set of 22 common mutant
genes including some driver genes in all three identified subclones indicating the
origin from common ancestral clone [39]. DNA single-cell sequencing (DNA SCS)
helped to delineate the clonal evolution in hematopoietic malignancies. Gawad et al.
performed targeted sequencing in 1479 cells from six childhood acute lymphoblas-
tic leukemia (ALL) patients to measure the clonal structures [40]. Cancer stem cells
play a crucial role in the initiation of tumorigenesis, progression, metastasis, and
drug resistance. Single-cell sequencing can be used to characterize the cancer stem
cells and to identify the driver genes. Li et al. discovered that novel mutations in KCP,
LOC440040, and LOC440563 act as a driver in renal cancer stem cells. The authors
concluded that the above three novel mutations can promote the reprogramming of
renal cancer cells into cancer stem-like cells (CSCs) [41]. Single-cell sequencing on
bladder cancer stem cells (BCSC) deciphered the genetic basis and stemness-related
gene mutations in BCSC [42].
Single-Cell Multiomics: Dissecting Cancer 297
Epigenetics is defined as the heritable change in gene expression without altering the
genetic code of DNA. The epigenetic alterations broadly include changes in DNA
methylation, and histone modification [28]. These epigenetic events are dynamic and
reversible. Along with genetic alterations, epigenetic aberrations also play a crucial
role in carcinogenesis. In cancer cells, the pattern of epigenetic changes differs from
cell to cell leading to heterogeneity within tumor [28]. Epigenetic alterations in
cancer generally lead to silencing of tumor suppressor genes, activation of oncogenes,
and dysregulation in gene expression [46]. DNA methylation in the CpG island
of gene promoter region results in silencing of gene expression by preventing the
binding of transcription machinery proteins. Several key genes were observed to be
disrupted in various cancer types either by hypermethylation or by hypomethylation
[47]. Chromatic structural changes influenced by histone modifications are another
important regulatory mechanism for gene regulation and carcinogenesis. Mutation
in the chromatin remodeling factors and histone-modifying enzymes leads to change
in the transcriptional state of the gene, thus altering the behavior of the cell [48].
Recent studies have shown that the dysregulation in the epigenetic machinery of a
cell can drive the oncogenic transformation, followed by progression and treatment
response [48–50].
Advances in the next generation sequencing technologies such as bisulfite
sequencing for analyzing DNA methylation, DNase-seq, and MNase-seq for DNA
accessibility and chromatin conformation, and chromatin immunoprecipitation fol-
lowed by sequencing (ChIP-seq) for binding sites of individual factors or modified
298 J. Sambath et al.
Single-cell genomics has led to the identification of detailed mutation portrait in dif-
ferent subclones of the tumor. Expression profile of these cells with distinct genetic
makeup can be further investigated using single-cell transcriptomics approach.
Single-cell RNA sequencing has provided unprecedented insights into tumor het-
erogeneity and unraveled the mysteries about cell-to-cell variation, new cell types in
the tumor milieu, plasticity, and cancer stem cells. The pioneer study on single-cell
RNA sequencing was done in 430 single cells isolated from five primary glioblas-
tomas to uncover the genetic heterogeneity. Authors examined the subtype signa-
tures established by the cancer genome atlas (TCGA) in individual cells. Different
glioblastoma subtypes were observed within individual cells demonstrating ITH. The
analysis revealed that tumors contain hybrid cell states including progenitor states
and differentiated states with distinct transcriptional programs and provide inferen-
tial evidence for dynamic transitions [56]. Another landmark study analyzed 4645
single cells from 19 melanoma patients for profiling malignant and non-malignant
cells (immune, stromal, and endothelial cells). The study identified the presence of
rare subpopulation with drug-resistant property. Single-cell gene expression analysis
of infiltrating T cells revealed potential biomarkers for distinguishing exhausted and
cytotoxic T cells that may aid in selecting patients for immune checkpoint blockade.
This study has unraveled the tumor microenvironment in melanoma that provided
insights into targeted and immuno therapies [57]. Giustacchini et al. analyzed about
2000 single cells from chronic myeloid leukemia (CML) patients to characterize
distinct molecular signatures of single-cell subpopulations in human CML samples
from diagnosis through remission and disease progression. The study revealed het-
erogeneity of CML single cells and identified subgroup which was persistently found
even after prolonged treatment [58].
Tumor microenvironment plays a major role in shaping the tumor at both primary
and metastatic sites. It provides an advantageous environment for the cancer cells to
evolve and promote its growth. Single-cell RNA sequencing can shed light on tumor
microenvironment by dissecting the malignant cells, stromal cells, and immune cell
within the tumor. A study by Li et al. demonstrates a single-cell profiling of the cells
isolated from primary colorectal cancer tumor and matched normal samples. The
study has identified seven cell types, namely epithelial cells, fibroblasts, endothe-
lial cells, B cells, T cells, mast cells, and myeloid cells within tumor. The specific
gene expression signature of these cell types has shown the increased level of EMT-
related genes in CAFs that could be the cause for transition to activated fibroblast
state. Authors proposed that single-cell transcriptomics followed by cell-type clus-
tering provides a more unbiased approach for identifying dysregulated genes within
tumor and stromal cells [30]. A study on metastatic breast cancer cells by Lawson
et al. reported that the metastatic cells from the low burden tissue possess distinct
gene expression signatures for functions like stem cells, epithelial-to-mesenchymal
transition, pro-survival, and dormancy when compared with metastatic high burden
tissue which possessed properties similar to primary tumors. This study supported
300 J. Sambath et al.
hierarchical model for metastasis, in which metastases are initiated by stem-like cells
that proliferate and differentiate to produce advanced metastatic disease [59].
To escape from immune surveillance, tumor cells recruit immunosuppressive cells
into the microenvironment and alter the phenotype and function of normal immune
cells [60]. Transcriptome sequencing of bulk tissue has classified tumor subtypes
based on the expression signature of bulk tumor population. However, it could not
classify the cellular diversity. Studies focusing on immune landscape of the tumors
have been flourished using scRNA-Seq. Characterization of immune cells found in
stromal population will lead to the identification of new immunotherapeutic strate-
gies. Zheng et al. have catalogued immune microenvironment of lung adenocarci-
noma and provided the immune atlas of lung adenocarcinoma. In this study, Tregs
and exhausted CD8+T cells are found to be more abundant in tumor cell than in
the normal peripheral blood, and it can result in immune suppression in the TME
[61]. Transcriptome analysis of 515 single cells from different subtypes of breast
cancer patients revealed carcinoma and non-carcinoma microenvironment cells that
included macrophages, T lymphocytes, and B lymphocytes. Profiling of infiltrating
T cells revealed that the infiltrating immune cells were high in TNBC subtype than
the other subtype. It showed the signature of T cells with a regulatory or exhausted
phenotype and macrophages with an M2 phenotype that can promote evasion of
cancer cells from immune surveillance [62].
The studies mentioned here have demonstrated the potential of scRNA-seq in dis-
secting tumor heterogeneity. scRNA-seq study will facilitate better understanding
of individual cell types within tumor and their role in carcinogenesis. This informa-
tion can have enormous impact on clinical research and on the precision medicine.
Similarly, multiple single-cell omics analysis studies have been reported in various
cancers to delineate ITH.
Fig. 2 Schematic representation of clonal evolution models exhibited by various cancers. a Clonal
selection model. Subclones that have acquired additional mutations are positively selected and
provide favorable TME. b Multiclonal invasion model. Multiple clones evolve simultaneously and
contribute equally to the growth of tumor. c Big Bang model. Tumor growth will occur in the
absence of stringent selection, consistent with neutral evolution
cells from 10 patients with both DCIS and IDC regions revealed that genomic alter-
ations in the subclones are present before it leaves the duct and hypothesized that
all the subclones in the duct arise from the single initiating cell. The evident shift in
subclonal population frequency between DCIS and IDC supports multiclonal inva-
sion model (Fig. 2b), in which multiple subclones escape from the duct and evoke
the invasive tumor [75].
In case of colorectal cancer, one clonal expansion model, i.e, Big Bang model
(Fig. 2c) is proposed which states that single tumor initiation cells lead to the propa-
gation of intermixed cells with various subclones without stringent selection, consis-
tent with effective neutral evolution. Analyzing the genomic profiles of multiregion,
single gland and single cells from 15 colorectal tumors exhibited uniform high intra-
tumor heterogeneity and also evidenced the implications of Big Bang model. In
this model, public driver alterations and private alterations that are pervasive in the
final neoplasm might have occurred early during tumor growth. Big Bang model
also insists that the timing of mutation determines its prevalence in the advanced
304 J. Sambath et al.
tumor. Hence, the private alterations which occur early will be found in subclonal
populations, whereas late mutations will become undetectable [76].
Study of clonal evolution which drives the tumor progression will provide detailed
information about the initial stage of cancer development and help us to treat cancer
efficiently. During cancer progression, each tumor cell is reported to undergo various
selective pressures that form the basis of intratumor heterogeneity. Each tumor will
exhibit different patterns of evolution, and delineating the model of evolution will
give novel insights into the progression and identification of subpopulations that
drive metastasis or resistance to anti-cancer therapy will provide new therapeutic
targets and improve treatment efficiency.
Cancer drug resistance mechanisms are complex and multidimensional. Cancer ther-
apies are known to become resistant over a period leading to recurrence and disease
progression. As the realization of rising rates of cancer therapy resistance grew,
there was a pressing need to understand the drug resistance mechanisms better.
Several different mechanisms of cancer drug resistance have been discovered and
described (Fig. 3) [77–80]. Drug efflux, drug inactivation, drug target alteration,
tumor cell death inhibition, tumor heterogeneity, DNA damage repair, epithelial-
to-mesenchymal transition, epigenetic modifications, and multidrug resistance were
widely studied mechanisms associated with resistance. These mechanisms could be
present in isolation or in combination leading to the modality of combination therapy
for most cancer diagnosis [77–82]. The understanding of these mechanisms for the
mainstay of development of strategies to design personalized therapeutic approaches
for these patients would be utmost important.
Nearly two-thirds of non-small-cell lung cancer (NSCLC) patients are diagnosed
at an advanced stage where surgical resection may not be possible. These patients
need systemic therapy and in some cases radiation-based therapy. Disease relapse
and progression is a common phenomenon in NSCLC and warrants a deeper under-
standing of the pathways of resistance for better implementation of care, decreased
morbidity and mortality in this deadly disease [81]. Head and neck squamous cell
cancers (HNSCC), especially the non-HPV type are the other type of cancers that are
known to have high risk of recurrence and disease progression [82]. Oral squamous
cell carcinoma (OSCC) is one of the commonest forms of head and neck cancer
and is known to be difficult to treat with high relapse rates. Nearly sixty percent of
HNCSCC are locally advanced at diagnosis and are not resectable. These patients as
well as resected high-risk cases are treated with chemotherapy and radiation in the
definitive or adjuvant setting. Resistance to chemotherapy and radiation is one of the
prominent reasons for relapse and progression in these cases [83].
Single-Cell Multiomics: Dissecting Cancer 305
Fig. 3 Drug resistance mechanisms in cancer. Various mechanisms promote drug resistance either
directly or indirectly, independently or in combination
Several mechanisms of drug resistance have been studied and described [77–81].
Drug efflux is one of the most prominent mechanisms of resistance leading to
decreased efficacy, high relapse, and decreased survival. ATP-binding cassette (ABC)
transporter proteins are regulatory proteins on the plasma membranes of cells that
are understood to enable this efflux [84]. The three transporters that are thought to be
notorious for resistance in cancer are multidrug resistance protein 1 (MDR1), mul-
tidrug resistance-associated protein 1 (MRP1), and breast cancer resistance protein
(BCRP) [85, 86] and are known to be associated with poor clinical outcomes [86].
The ABC genes encoding MDR1, MRP1, MRP2, and BCRP are involved in many
solid tumor chemotherapy resistances including OSCC and NSCLC [87].
Decreased drug activation and drug inactivation are other prominent mechanisms
of cancer drug resistance. The mechanism usually involves complex processes in
which drugs interact with different proteins leading to their modification, partial or
full degradation, or inactivation. Matrix metalloproteinases (MMPs) overexpression
has been also associated with drug resistance in HNSCC [88]. Endothelial growth
306 J. Sambath et al.
factor receptor (EGFR) overexpression seen in HNSCC is linked with response from
EGFR-directed therapy in this disease. Competitive inhibition of EGFR blocking
antibodies has been shown to lead to inactivation of EGFR inhibitors and resistance
to therapy. Several other prominent protein superfamilies are involved in altered drug
activation and inactivation like uridine diphospho-glucuronosyltransferase (UGT)
which is involved in irinotecan inactivation modulated by epigenetic silencing of
UGT1A1 in colon cancer, glutathione-S-transferase (GST), and cytochrome P450
(CYP) superfamily [89].
Drug target alterations have major implications in the development of resistance.
Alterations in target receptors and expression levels could be one of the other mech-
anisms for cancer drug resistance. A classic example of target receptor alteration
would be the development of T790M mutation in epidermal growth factor receptor
(EGFR)-positive non-small-cell lung cancers seen in more than 50% of patients over
a period of time leading to drug resistance to first- and second-generation tyrosine
kinase inhibitors [90]. Another common example would be the development of resis-
tance to HER2 inhibitors in HER2-positive breast cancer due to the occurrence of
PI3KCA mutation or PTEN loss or both in these patients [91].
has been associated with immune exclusion and EMT in melanoma, lung adenocar-
cinoma, and squamous cell carcinoma [95]. Other mechanisms that included DNA
methylation and histone acetylation are the epigenetic modifications that contribute
to drug resistance. DNA demethylating agents and histone deacetylators have been
investigated in conjunction with chemotherapy and other targeted therapy drugs in
an effort to reverse resistant pathway development. Early findings are promising
although additional work is needed to bring the strategies for the clinic utility [96].
Cancer cell heterogeneity has been studied widely and remains one of the biggest
challenges in understanding disease biology and formulating therapeutic strategies.
Proliferation and growth of drug-resistant clones alongside response to therapy in
drug-sensitive clone of cancer cells are well-understood phenomena [97]. Weak
selection pressure or state of neutral evolution is thought to be largely responsible for
intratumoral heterogeneity. Under weak selection pressure or neutral evolution, mul-
tiple different cancer cell clones survive that leads to drug resistance and increased
metastatic potential. This condition is specifically described in certain solid tumors
[98].
Single-cell multiomics with the help of next generation sequencing and modern com-
putational bioinformatics analyses has provided for better understanding of cancer
biology and pathology. Such studies would be critical in cancer-related signaling
recognition and the development of novel therapeutics. Initially published by Tang
et al. [102], the techniques have improved over time with significant advancement in
cell capture methods and library preparation techniques [103, 104]. Genetic hetero-
geneity has been studied widely with the help of sequenced datasets with single-cell
whole-genome and whole-exome sequencing [105]. With improved techniques of
single-cell sequencing, it has become easier to study neoplastic clonal populations
with better understanding [36]. Single-cell multiomics has enabled to study the tran-
scriptome, genome and methylome of a single cell [106]. Multiomics study by Hou
et al. utilized the technique to identify two different clones of hepatocellular carci-
noma cells with distinct population and metastatic potentials [106]. These techniques
have been also utilized to understand TME and pathways of immunotherapy resis-
tance in lung cancer [107]. Inter-tumor and intratumor heterogeneities have been
thought to be one of the causes for partial response to immunotherapy. In a recent
study by Ma et al, single-cell RNA sequencing was utilized to confirm heterogeneity
of immune response-related genes in lung adenocarcinoma, demonstrating that how
modern technology could confer associated drug resistance mechanisms [108].
Circulating tumor cells (CTCs) are cancer cells that shed from primary tumor and dis-
seminate into bloodstream. CTCs are the major prerequisite in the metastasis cascade.
During cancer progression, tumor cells intravasate into blood vessel by undergoing
epithelial-to-mesenchymal transition (EMT). These circulating cells leave the cir-
culation and seed in secondary site via mesenchymal to epithelial transition (MET)
and initiate metastasis [109–111]. Studies have mainly focused on CTC isolation and
enrichment methods and observed the correlation between CTC number and clinical
prognosis [110, 112–114]. Role of CTCs as biomarkers for diagnosis, prognosis,
and monitoring treatment response has been reported in several solid tumors [115–
117]. Mutation and expression studies of primary and metastatic tumor fostered the
advent of personalized medicine. However, the sequencing efforts made are predom-
inantly derived from the single-site biopsy samples that give genomic snapshot of
tumor at one time point. The continuous evolution of tumor cells reflects inter- and
intratumor heterogeneity in the individual patient. Single-site biopsy samples could
not uncover the process of tumor evolution, and subjecting the patient to multiple
site biopsies is infeasible. CTCs being a representative of primary and metastatic
Single-Cell Multiomics: Dissecting Cancer 309
tumor cell populations could be used to study tumor heterogeneity and therapeu-
tic resistance mechanism in pan-cancer. The development of single-cell sequencing
techniques has facilitated characterization of these rare cancer cell populations.
Recent studies on single CTCs have given insights into the tumor heterogeneity in
various cancer types. Studies on single CTCs by Shaw et al. and Luca et al. highlighted
the existence of genetic heterogeneity within and between breast cancer patients by
observing somatic mutational status of cancer-related genes (TP53, ERBB2, EGFR,
PIK3CA, PTEN, ESR1, KRAS) [118, 119]. Gao et al. analyzed copy number alter-
ations (CNA) in CTCs isolated from four different types of cancer and proposed
convergent evolutionary model for the evolution of tumor CNAs. The proposed CNV
evolution model is in contrast to the reported model such as gradual evolution and
punctuated model [120]. Ni et al. employed genomic analysis of single CTCs in lung
cancer patients and identified that copy number variations in the cancer-related gene
loci are the key events that alter the gene expression and give selective advantage
for metastasis [121]. Small-cell lung cancer (SCLC) is an aggressive disease with
early metastatic dissemination. Major challenge in SCLC is the early development of
resistance to chemotherapy. Since SCLC is characterized by early vascular dissem-
ination liquid biopsy sample CTC can be used for diagnosis and predicting clinical
outcomes. Carter et al. generated whole-genome sequencing data for single CTCs,
pooled CTCs, and matched WBC isolated from 13 SCLC patients and carried out
copy number variation analysis. Genes that are altered in CTCs are consistent with
tumor cells and are not found in germline control. This result suggested that the
CTC-based genomic profile could be an alternative for tumor profiles. Authors used
CTC-based CNA data to distinguish patients as chemosensitive and chemorefractory.
Using bioinformatics analysis, 16 CNA profiles were generated and developed pre-
dictive model that stratifies the samples as chemosensitive and chemorefractory. The
study suggested the use of non-invasive CTC-based profiling to anticipate clinical
outcomes in SCLC and showed the potential of CTCs to understand the biology of
cancer [122].
To understand the mechanism of EMT in cancer metastasis, Yu et al. characterized
breast cancer CTCs and suggested that mesenchymal CTCs are highly associated
with disease progression [123]. Ting et al. performed single-cell RNA-seq study
on CTCs and matched primary tumors from mouse model of pancreatic cancer.
Clustering based on transcriptomic profiles of CTCs, primary tumors, and tumor-
derived cell lines showed CTCs as separate cluster featured with low proliferative
signature, expression of both epithelial and mesenchymal markers, and enriched stem
cell-associated genes. Authors observed aberrant expression of extracellular matrix
genes, underlying the role of stromal microenvironment in spreading of cancer to
distant sites [124].
Gene expression study on individual CTCs of castration-resistant prostate cancer
patients showed high intercellular transcriptional heterogeneity compared to expres-
sion profile of single cells of prostate cancer cell lines and primary tumor. The authors
also observed different androgen receptor splice variants and androgen receptor muta-
tions in individual patients expressing remarkable heterogeneity. The differences
in the transcriptomic profile of CTCs from treatment-resistant and treatment-naïve
310 J. Sambath et al.
patients resulted in the identification of several signaling pathways which may confer
resistance to anti-androgen therapy [125]. Grillet et al. established CTC cell lines
from metastatic colorectal cancer patients characterized by cancer stem cell (CSC)
phenotype and genetic and phenotypic heterogeneity. RNA-seq study of these cell
lines showed high-level expression of CSC markers and demonstrated the ability of
CTCs to develop tumor in distant organs. Expression analysis also showed enrich-
ment in drug-metabolizing pathways which may confer resistance to conventional
cytotoxic compounds. This study demonstrated the potential use of CTCs to predict
drug responses in colorectal cancer patients [126].
Technological advancements in CTC isolation and sequencing have opened a
new dimension of research to use CTCs as biomarker for diagnosis, to predict treat-
ment response, and to study tumor heterogeneity. The pioneer studies on single-cell
sequencing of CTCs have successfully demonstrated the use of CTCs for clinical
purposes. Further development in the field of CTCs may help to understand tumor
heterogeneity and treatment resistance more deeply and facilitate precision medicine
with the identification of new therapeutic targets and therapeutic strategies.
Single-cell sequencing is a giant leap in the field of cancer research. Study at single-
cell level has given deeper insights into the biology of cancer, intratumor hetero-
geneity, clonal diversity, evolution of therapy resistance, metastatic dissemination,
and characterized rare cells in the tumor microenvironment. Thus far, huge collabo-
rative efforts such as The Cancer Genome Atlas (TCGA) and International Cancer
Genome Consortium (ICGC) have made to collate genomic, epigenomic, transcrip-
tomics and proteome alteration profile in various cancers. Profiling in these cases
was predominantly employed using bulk tumor sequencing; hence, these studies
could not provide resolution and deeper understanding of ITH. The development
of the sophisticated algorithms such as PyClone [63], Sci-Clone [64], and Geno-
Clone [65] which exploit variant allele frequency and genotype to delineate ITH is
now extensively used to analyze existing sequencing datasets. Recent pan-cancer
ITH study using 3383 whole-exome datasets belonging to nine different cancers
from TCGA revealed that breast, urothelial, head and neck, and renal carcinoma had
relatively lower ITH. These were dominated by single clonal expansion. However,
lung squamous cell carcinoma, lung adenocarcinoma, glioma, prostate cancer, and
melanoma were predicted to be polyclonal and have higher ITH. This study has
efficiently demonstrated the use of open-source tools to investigate ITH in conven-
tionally sequenced bulk tumors. Multiregion sequencing of a tumor biopsy is also
a promising approach to study ITH. This technique is predominantly employed in
clinical trials such as “Tracking the Evolution of Non-Small-Cell Lung Cancer” also
referred to as TRACERx. This study has utilized multiregion sequencing along with
orthogonal bulk tumor sequencing to investigate effects of ITH on tumor progression
and overall survival outcome based on mutation evolution trajectories.
Single-Cell Multiomics: Dissecting Cancer 311
References
8. Fouad YA, Aanei C (2017) Revisiting the hallmarks of cancer. Am J Cancer Res 7(5):1016–
1036
9. Hanahan D, Weinberg RA (2011) Hallmarks of cancer: the next generation. Cell 144(5):646–
674
10. Wang M et al (2017) Role of tumor microenvironment in tumorigenesis. J Cancer 8(5):761–
773
11. Thunnissen E, van der Oord K, den Bakker M (2014) Prognostic and predictive biomarkers
in lung cancer. A review. Virchows Arch 464(3):347–358
12. McNamara MG, Sahebjam S (2013) Mason WP (2013) Emerging biomarkers in glioblastoma.
Cancers (Basel) 5(3):1103–11019
13. Henry NL, Hayes DF (2012) Cancer biomarkers. Mol Oncol 6(2):140–146
14. Roma-Rodrigues C et al (2019) Targeting tumor microenvironment for cancer therapy. Int J
Mol Sci 20(4):840
15. Kim J, Bae JS (2016) Tumor-associated macrophages and neutrophils in tumor microenvi-
ronment. Mediat Inflamm 2016:6058147
16. Quail DF, Joyce JA (2013) Microenvironmental regulation of tumor progression and
metastasis. Nat Med 19(11):1423–1437
17. Hida K et al (2018) Contribution of tumor endothelial cells in cancer progression. Int J Mol
Sci 19(5):1272
18. Maishi N, Hida K (2017) Tumor endothelial cells accelerate tumor metastasis. Cancer Sci
108(10):1921–1926
19. Onimaru M, Yonemitsu Y (2011) Angiogenic and lymphangiogenic cascades in the tumor
microenvironment. Front Biosci (Schol Ed) 3:216–225
20. Hui L, Chen Y (2015) Tumor microenvironment: sanctuary of the devil. Cancer Lett 368(1):7–
13
21. Wu T, Dai Y (2017) Tumor microenvironment and therapeutic response. Cancer Lett 387:61–
68
22. Walker C, Mojares E, Del Rio Hernandez A (2018) Role of extracellular matrix in development
and cancer progression. Int J Mol Sci 19(10):3028
23. Sokolenko AP, Imyanitov EN (2018) Molecular diagnostics in clinical oncology. Front Mol
Biosci 5:76
24. Ryu D et al (2016) Deciphering intratumor heterogeneity using cancer genome analysis. Hum
Genet 135(6):635–642
25. Wei Q et al (2017) Multiregion whole-exome sequencing of matched primary and metastatic
tumors revealed genomic heterogeneity and suggested polyclonal seeding in colorectal cancer
metastasis. Ann Oncol 28(9):2135–2141
26. Varon-Gonzalez C, Navarro N (2019) Epistasis regulates the developmental stability of the
mouse craniofacial shape. Heredity (Edinb) 122(5):501–512
27. Racle J et al (2017) Simultaneous enumeration of cancer and immune cell types from bulk
tumor gene expression data. Elife 6:e26476
28. Lo PK, Zhou Q (2018) Emerging techniques in single-cell epigenomics and their applications
to cancer research. J Clin Genom 1(1)
29. Bartoschek M et al (2018) Spatially and functionally distinct subclasses of breast cancer-
associated fibroblasts revealed by single cell RNA sequencing. Nat Commun 9(1):5150
30. Li H et al (2018) Author correction: reference component analysis of single-cell transcrip-
tomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet 50(12):1754
31. Muller S et al (2017) Single-cell profiling of human gliomas reveals macrophage ontogeny
as a basis for regional differences in macrophage activation in the tumor microenvironment.
Genome Biol 18(1):234
32. Valdes-Mora F et al (2018) Single-cell transcriptomics in cancer immunobiology: the future
of precision oncology. Front Immunol 9:2582
33. Navin N et al (2011) Tumour evolution inferred by single-cell sequencing. Nature
472(7341):90–94
Single-Cell Multiomics: Dissecting Cancer 313
34. Wang Y et al (2014) Clonal evolution in breast cancer revealed by single nucleus genome
sequencing. Nature 512(7513):155–160
35. Francis JM et al (2014) EGFR variant heterogeneity in glioblastoma resolved through single-
nucleus sequencing. Cancer Discov 4(8):956–971
36. Xu X et al (2012) Single-cell exome sequencing reveals single-nucleotide mutation charac-
teristics of a kidney tumor. Cell 148(5):886–895
37. Yu C et al (2014) Discovery of biclonal origin and a novel oncogene SLC12A5 in colon cancer
by single-cell sequencing. Cell Res 24(6):701–712
38. Wu H et al (2017) Evolution and heterogeneity of non-hereditary colorectal cancer revealed
by single-cell exome sequencing. Oncogene 36(20):2857–2867
39. Li Y et al (2012) Single-cell sequencing analysis characterizes common and cell-lineage-
specific mutations in a muscle-invasive bladder cancer. Gigascience 1(1):12
40. Gawad C, Koh W, Quake SR (2014) Dissecting the clonal origins of childhood acute lym-
phoblastic leukemia by single-cell genomics. Proc Natl Acad Sci USA 111(50):17947–17952
41. Li C et al (2017) Single-cell exome sequencing identifies mutations in KCP, LOC440040,
and LOC440563 as drivers in renal cell carcinoma stem cells. Cell Res 27(4):590–593
42. Yang Z et al (2017) Single-cell sequencing reveals variants in ARID1A, GPRC5A and MLL2
driving self-renewal of human bladder cancer stem cells. Eur Urol 71(1):8–12
43. Caswell DR, Swanton C (2017) The role of tumour heterogeneity and clonal cooperativity in
metastasis, immune evasion and clinical outcome. BMC Med 15(1):133
44. Leung ML et al (2017) Single-cell DNA sequencing reveals a late-dissemination model in
metastatic colorectal cancer. Genome Res 27(8):1287–1299
45. Eirew P et al (2015) Dynamics of genomic clones in breast cancer patient xenografts at
single-cell resolution. Nature 518(7539):422–426
46. Baxter E et al (2014) Epigenetic regulation in cancer progression. Cell Biosci 4:45
47. Chatterjee A, Rodger EJ, Eccles MR (2018) Epigenetic drivers of tumourigenesis and cancer
metastasis. Semin Cancer Biol 51:149–159
48. Xi Y et al (2018) Histone modification profiling in breast cancer cell lines highlights
commonalities and differences among subtypes. BMC Genom 19(1):150
49. Li LC, Carroll PR, Dahiya R (2005) Epigenetic changes in prostate cancer: implication for
diagnosis and treatment. J Natl Cancer Inst 97(2):103–115
50. Kanwal R, Gupta S (2010) Epigenetics and cancer. J Appl Physiol 109(2):598–605
51. Litzenburger UM et al (2017) Single-cell epigenomic variability reveals functional cancer
heterogeneity. Genome Biol 18(1):15
52. Guo H et al (2013) Single-cell methylome landscapes of mouse embryonic stem cells and
early embryos analyzed using reduced representation bisulfite sequencing. Genome Res
23(12):2126–2135
53. Gaiti F et al (2019) Epigenetic evolution and lineage histories of chronic lymphocytic
leukaemia. Nature 569(7757):576–580
54. Farlik M et al (2015) Single-cell DNA methylome sequencing and bioinformatic inference of
epigenomic cell-state dynamics. Cell Rep 10(8):1386–1397
55. Buenrostro JD et al (2015) Single-cell chromatin accessibility reveals principles of regulatory
variation. Nature 523(7561):486–490
56. Patel AP et al (2014) Single-cell RNA-seq highlights intratumoral heterogeneity in primary
glioblastoma. Science 344(6190):1396–1401
57. Tirosh I et al (2016) Dissecting the multicellular ecosystem of metastatic melanoma by single-
cell RNA-seq. Science 352(6282):189–196
58. Giustacchini A et al (2017) Single-cell transcriptomics uncovers distinct molecular signatures
of stem cells in chronic myeloid leukemia. Nat Med 23(6):692–702
59. Lawson DA et al (2015) Single-cell analysis reveals a stem-cell program in human metastatic
breast cancer cells. Nature 526(7571):131–135
60. Liu Y, Cao X (2016) Immunosuppressive cells in tumor immune escape and metastasis. J Mol
Med (Berl) 94(5):509–522
314 J. Sambath et al.
61. Zheng C et al (2017) Landscape of infiltrating T cells in liver cancer revealed by single-cell
sequencing. Cell 169(7):1342–1356
62. Chung W et al (2017) Single-cell RNA-seq enables comprehensive tumour and immune cell
profiling in primary breast cancer. Nat Commun 8:15081
63. Roth A et al (2014) PyClone: statistical inference of clonal population structure in cancer.
Nat Methods 11(4):396–398
64. Miller CA et al (2014) SciClone: inferring clonal architecture and tracking the spatial and
temporal patterns of tumor evolution. PLoS Comput Biol 10(8):e1003665
65. Zou M, Jin R, Au KF (2018) Revealing tumor heterogeneity of breast cancer by utilizing the
linkage between somatic and germline mutations. Brief Bioinform
66. Zhang J et al (2014) Intratumor heterogeneity in localized lung adenocarcinomas delineated
by multiregion sequencing. Science 346(6206):256–259
67. Zhang LL et al (2017) Multiregion sequencing reveals the intratumor heterogeneity of driver
mutations in TP53-driven non-small cell lung cancer. Int J Cancer 140(1):103–108
68. Ledgerwood LG et al (2016) The degree of intratumor mutational heterogeneity varies by
primary tumor sub-site. Oncotarget 7(19):27185–27198
69. Yan T et al (2019) Multi-region sequencing unveils novel actionable targets and spatial
heterogeneity in esophageal squamous cell carcinoma. Nat Commun 10(1):1670
70. Hao JJ et al (2016) Spatial intratumoral heterogeneity and temporal clonal evolution in
esophageal squamous cell carcinoma. Nat Genet 48(12):1500–1507
71. Kim TM et al (2015) Subclonal genomic architectures of primary and metastatic colorectal
cancer based on intratumoral genetic heterogeneity. Clin Cancer Res 21(19):4461–4472
72. Ahmed M, Li LC (2013) Adaptation and clonal selection models of castration-resistant
prostate cancer: current perspective. Int J Urol 20(4):362–371
73. Horning AM et al (2018) Single-cell RNA-seq reveals a subpopulation of prostate cancer cells
with enhanced cell-cycle-related transcription and attenuated androgen response. Cancer Res
78(4):853–864
74. Casasent AK, Edgerton M, Navin NE (2017) Genome evolution in ductal carcinoma in situ:
invasion of the clones. J Pathol 241(2):208–218
75. Casasent AK et al (2018) Multiclonal invasion in breast tumors identified by topographic
single cell sequencing. Cell 172(1–2):205–217
76. Sottoriva A et al (2015) A big bang model of human colorectal tumor growth. Nat Genet
47(3):209–216
77. Sarkar S et al (2013) Cancer development, progression, and therapy: an epigenetic overview.
Int J Mol Sci 14(10):21087–21113
78. Byler S et al (2014) Genetic and epigenetic aspects of breast cancer progression and therapy.
Anticancer Res 34(3):1071–1077
79. Byler S, Sarkar S (2014) Do epigenetic drug treatments hold the key to killing cancer
progenitor cells? Epigenomics 6(2):161–165
80. Campbell PJ et al (2010) The patterns and dynamics of genomic instability in metastatic
pancreatic cancer. Nature 467(7319):1109–1113
81. Sosa Iglesias V et al (2018) Drug resistance in non-small cell lung cancer: a potential for
NOTCH targeting? Front Oncol 8:267
82. Lopez-Verdin S et al (2018) Molecular markers of anticancer drug resistance in head and neck
squamous cell carcinoma: a literature review. Cancers (Basel) 10(10):376
83. Ansell A et al (2016) Epidermal growth factor is a potential biomarker for poor cetuximab
response in tongue cancer cells. J Oral Pathol Med 45(1):9–16
84. Sauna ZE, Ambudkar SV (2001) Characterization of the catalytic cycle of ATP hydrolysis by
human P-glycoprotein. The two ATP hydrolysis events in a single catalytic cycle are kinetically
similar but affect different functional outcomes. J Biol Chem 276(15):11653–11661
85. Hilgendorf C et al (2007) Expression of thirty-six drug transporter genes in human intestine,
liver, kidney, and organotypic cell lines. Drug Metab Dispos 35(8):1333–1340
86. Haber M et al (2006) Association of high-level MRP1 expression with poor clinical outcome
in a large prospective study of primary neuroblastoma. J Clin Oncol 24(10):1546–1553
Single-Cell Multiomics: Dissecting Cancer 315
87. Friedrich RE, Punke C, Reymann A (2004) Expression of multi-drug resistance genes (mdr1,
mrp1, bcrp) in primary oral squamous cell carcinoma. Vivo 18(2):133–147
88. Rivlin N et al (2011) Mutations in the p53 Tumor Suppressor Gene: Important Milestones at
the Various Steps of Tumorigenesis. Genes Cancer 2(4):466–474
89. Holohan C et al (2013) Cancer drug resistance: an evolving paradigm. Nat Rev Cancer
13(10):714–726
90. Kobayashi S et al (2005) EGFR mutation and resistance of non-small-cell lung cancer to
gefitinib. N Engl J Med 352(8):786–792
91. Razis E et al (2011) Evaluation of the association of PIK3CA mutations and PTEN loss
with efficacy of trastuzumab therapy in metastatic breast cancer. Breast Cancer Res Treat
128(2):447–456
92. Cook KL et al (2014) Hydroxychloroquine inhibits autophagy to potentiate antiestrogen
responsiveness in ER+breast cancer. Clin Cancer Res 20(12):3222–3232
93. Curtin NJ (2012) DNA repair dysregulation from cancer driver to therapeutic target. Nat Rev
Cancer 12(12):801–817
94. Nogueira GAS et al (2018) Polymorphisms in DNA mismatch repair pathway genes predict
toxicity and response to cisplatin chemoradiation in head and neck squamous cell carcinoma
patients. Oncotarget 9(51):29538–29547
95. Chae YK et al (2018) Epithelial-mesenchymal transition (EMT) signature is inversely
associated with T-cell infiltration in non-small cell lung cancer (NSCLC). Sci Rep 8(1):2918
96. Bearzatto A et al (2000) Epigenetic regulation of the MGMT and hMSH6 DNA repair genes
in cells resistant to methylating agents. Cancer Res 60(12):3262–3270
97. Navin N et al (2010) Inferring tumor progression from genomic heterogeneity. Genome Res
20(1):68–80
98. McGranahan N, Swanton C (2017) Clonal Heterogeneity and Tumor Evolution: Past, Present,
and the Future. Cell 168(4):613–628
99. Van Allen EM et al (2015) Genomic correlates of response to CTLA-4 blockade in metastatic
melanoma. Science 350(6257):207–211
100. Rizvi NA et al (2015) Cancer immunology. Mutational landscape determines sensitivity to
PD-1 blockade in non-small cell lung cancer. Science 348(6230):124–128
101. Fares CM et al (2019) Mechanisms of resistance to immune checkpoint blockade: why does
checkpoint inhibitor immunotherapy not work for all patients? Am Soc Clin Oncol Educ Book
39:147–164
102. Tang F et al (2009) mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods
6(5):377–382
103. Zheng GX et al (2016) Haplotyping germline and cancer genomes with high-throughput
linked-read sequencing. Nat Biotechnol 34(3):303–311
104. Gierahn TM et al (2017) Seq-Well: portable, low-cost RNA sequencing of single cells at high
throughput. Nat Methods 14(4):395–398
105. Wang Y, Navin NE (2015) Advances and applications of single-cell sequencing technologies.
Mol Cell 58(4):598–609
106. Hou Y et al (2016) Single-cell triple omics sequencing reveals genetic, epigenetic, and
transcriptomic heterogeneity in hepatocellular carcinomas. Cell Res 26(3):304–319
107. Lambrechts D et al (2018) Phenotype molding of stromal cells in the lung tumor microenvi-
ronment. Nat Med 24(8):1277–1289
108. Ma KY et al (2019) Single-cell RNA sequencing of lung adenocarcinoma reveals heterogene-
ity of immune response-related genes. JCI Insight 4(4):e121387
109. Pantel K, Speicher MR (2016) The biology of circulating tumor cells. Oncogene 35(10):1216–
1224
110. Zhu Z et al (2018) Progress and challenges of sequencing and analyzing circulating tumor
cells. Cell Biol Toxicol 34(5):405–415
111. Yadavalli S et al (2017) Data-driven discovery of extravasation pathway in circulating tumor
cells. Sci Rep 7:43710
316 J. Sambath et al.
112. Krebs MG et al (2014) Molecular analysis of circulating tumour cells-biology and biomarkers.
Nat Rev Clin Oncol 11(3):129–144
113. Arya SK, Lim B, Rahman AR (2013) Enrichment, detection and clinical significance of
circulating tumor cells. Lab Chip 13(11):1995–2027
114. Khoo BL et al (2015) Short-term expansion of breast circulating cancer cells predicts response
to anti-cancer therapy. Oncotarget 6(17):15578–15593
115. Cristofanilli M et al (2004) Circulating tumor cells, disease progression, and survival in
metastatic breast cancer. N Engl J Med 351(8):781–791
116. Balakrishnan A et al (2019) Circulating Tumor Cell cluster phenotype allows monitoring
response to treatment and predicts survival. Sci Rep 9(1):7933
117. Lianidou ES, Markou A, Strati A (2015) The role of CTCs as tumor biomarkers. Adv Exp
Med Biol 867:341–367
118. Shaw JA et al (2017) Mutation analysis of cell-free DNA and single circulating tumor cells
in metastatic breast cancer patients with high circulating tumor cell counts. Clin Cancer Res
23(1):88–96
119. De Luca F et al (2016) Mutational analysis of single circulating tumor cells by next generation
sequencing in metastatic breast cancer. Oncotarget 7(18):26107–26119
120. Gao Y et al (2017) Single-cell sequencing deciphers a convergent evolution of copy number
alterations from primary to circulating tumor cells. Genome Res 27(8):1312–1322
121. Ni X et al (2013) Reproducible copy number variation patterns among single circulating tumor
cells of lung cancer patients. Proc Natl Acad Sci USA 110(52):21083–21088
122. Carter L et al (2017) Molecular analysis of circulating tumor cells identifies distinct copy-
number profiles in patients with chemosensitive and chemorefractory small-cell lung cancer.
Nat Med 23(1):114–119
123. Yu M et al (2013) Circulating breast tumor cells exhibit dynamic changes in epithelial and
mesenchymal composition. Science 339(6119):580–584
124. Ting DT et al (2014) Single-cell RNA sequencing identifies extracellular matrix gene
expression by pancreatic circulating tumor cells. Cell Rep 8(6):1905–1918
125. Miyamoto DT et al (2015) RNA-Seq of single prostate CTCs implicates noncanonical Wnt
signaling in antiandrogen resistance. Science 349(6254):1351–1356
126. Grillet F et al (2017) Circulating tumour cells from patients with colorectal cancer have cancer
stem cell hallmarks in ex vivo culture. Gut 66(10):1802–1810
127. Baslan T et al (2012) Genome-wide copy number analysis of single cells. Nat Protoc
7(6):1024–1041
128. Leung ML et al (2015) SNES: single nucleus exome sequencing. Genome Biol 16:55
129. Gao R et al (2017) Nanogrid single-nucleus RNA sequencing reveals phenotypic diversity in
breast cancer. Nat Commun 8(1):228
130. Macosko EZ et al (2015) Highly parallel genome-wide expression profiling of individual cells
using nanoliter droplets. Cell 161(5):1202–1214
131. Grindberg RV et al (2013) RNA-sequencing from single nuclei. Proc Natl Acad Sci USA
110(49):19802–19807
132. Shapiro E, Biezuner T, Linnarsson S (2013) Single-cell sequencing-based technologies will
revolutionize whole-organism science. Nat Rev Genet 14(9):618–630
133. Bernard V et al (2019) Single-cell transcriptomics of pancreatic cancer precursors demon-
strates epithelial and microenvironmental heterogeneity as an early event in neoplastic
progression. Clin Cancer Res 25(7):2194–2205
134. Kim C et al (2018) Chemoresistance evolution in triple-negative breast cancer delineated by
single-cell sequencing. Cell 173(4):879–893
135. Azizi E et al (2018) Single-cell map of diverse immune phenotypes in the breast tumor
microenvironment. Cell 174(5):1293–1308
136. Liu M et al (2017) Multi-region and single-cell sequencing reveal variable genomic
heterogeneity in rectal cancer. BMC Cancer 17(1):787
137. Lavin Y et al (2017) Innate immune landscape in early lung adenocarcinoma by paired single-
cell analyses. Cell 169(4):750–765
Single-Cell Multiomics: Dissecting Cancer 317
138. Dey SS et al (2015) Integrated genome and transcriptome sequencing of the same cell. Nat
Biotechnol 33(3):285–289
139. Wu L et al (2015) Full-length single-cell RNA-seq applied to a viral human cancer:
applications to HPV expression and splicing analysis in HeLa S3 cells. Gigascience 4:51
140. Dago AE et al (2014) Rapid phenotypic and genomic change in response to therapeutic
pressure in prostate cancer inferred by high content analysis of single circulating tumor cells.
PLoS ONE 9(8):e101777
141. Zhao L et al (2013) High-purity prostate circulating tumor cell isolation by a polymer
nanofiber-embedded microchip for whole exome sequencing. Adv Mater 25(21):2897–2902