2020 Book AdvancesInBioinformaticsAndCom
2020 Book AdvancesInBioinformaticsAndCom
Setubal
Waldeyr Mendes Silva (Eds.)
Advances
LNBI 12558
in Bioinformatics and
Computational Biology
13th Brazilian Symposium on Bioinformatics, BSB 2020
São Paulo, Brazil, November 23–27, 2020
Proceedings
123
Lecture Notes in Bioinformatics 12558
Series Editors
Sorin Istrail
Brown University, Providence, RI, USA
Pavel Pevzner
University of California, San Diego, CA, USA
Michael Waterman
University of Southern California, Los Angeles, CA, USA
Advances
in Bioinformatics and
Computational Biology
13th Brazilian Symposium on Bioinformatics, BSB 2020
São Paulo, Brazil, November 23–27, 2020
Proceedings
123
Editors
João C. Setubal Waldeyr Mendes Silva
University of São Paulo Instituto Federal de Goiás
São Paulo, Brazil Formosa, Goiás, Brazil
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Conference Chair
Daniel Cardoso Moraes de Fluminense Federal University, Brazil
Oliveira
Steering Committee
Daniel Cardoso Moraes de Fluminense Federal University, Brazil
Oliveira
João Carlos Setubal University of São Paulo, Brazil
Luis Antonio Kowada Fluminense Federal University, Brazil
Natália Florencio Martins Empresa Brasileira de Pesquisa Agropecuária, Brazil
Ronnie Alves Instituto Tecnológico Vale, Brazil
Sérgio Lifschitz Pontifícia Universidade Católica do Rio de Janeiro,
Brazil
Sérgio Vale Aguiar Campos Federal University of Minas Gerais, Brazil
Tainá Raiol Fundação Oswaldo Cruz, Brazil
Waldeyr Mendes Instituto Federal de Goiás, Brazil
Program Committee
Said Sadique Adi Federal University of Mato Grosso do Sul, Brazil
Nalvo Almeida Federal University of Mato Grosso do Sul, Brazil
Ronnie Alves ITV Sustainable Development, Brazil
Deyvid Amgarten University of São Paulo, Brazil
Marilia Braga Bielefeld University, Germany
Laurent Brehelin Laboratoire d’Informatique, de Robotique et de
Microélectronique de Montpellier, France
Marcelo Brigido University of Brasília, Brazil
Sérgio Campos Federal University of Minas Gerais, Brazil
Andre Carvalho University of São Paulo, Brazil
Ricardo Cerri Federal University of Sao Carlos, Brazil
Luís Cunha Fluminense Federal University, Brazil
Alberto Davila FIOCRUZ-RJ, Brazil
Ulisses Dias University of Campinas, Brazil
viii Organization
Additional Reviewers
Eloi Araújo
Daniel Saad Nogueira Nunes
Lucas Oliveira
Organization ix
Funding
Sponsor
1 Introduction
The field of biological research has changed rapidly since the advent of mas-
sively parallel sequencing technologies, known as next-generation sequenc-
ing (NGS) [10,18]. Some commercial DNA sequencing platforms include the
Genome Sequencer from Roche 454 Life Sciences (www.my454.com), the Genome
Analyzer platform from Illumina (www.illumina.com), the SOLiD System
from Applied Biosystems (www.appliedbiosystems.com), the Pacific Bioscience
(PacBio) sequencers (https://www.pacb.com), and the Oxford MinION, which
uses Nano Pore Sequencing technology (https://nanoporetech.com).
These platforms’ vital characteristic is that they do not rely on Sanger chem-
istry [41] as first-generation machines did. With their arrival in the market in
2005 and the fast development since then, they have drastically lowered the cost
per sequenced nucleotide and increased throughput by orders of magnitude [36].
Their performance dramatically increased the numbers of generated reads (many
hundreds of thousands or even millions of reads) in a relatively short time [25]
with good genome coverage.
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 1–12, 2020.
https://doi.org/10.1007/978-3-030-65775-8_1
2 E. M. de Armas et al.
2 Genome Assembly
Genome assembly may be defined as the computational process of reconstructing
a whole-genome using numerous short sequences called reads up to the chromo-
somal level. An assembly is a hierarchical data structure that maps the sequence
data to a putative reconstruction of the target genome. However, the vast major-
ity of sequenced genomes are made available only in draft format, having as a
result only contigs, which are continuous stretches of DNA sequence, or scaffolds,
which are chains of contigs with additional information about their relative posi-
tions and orientations.
There are some challenges around the use of NGS data that brings diffi-
culties to obtain the assembly. DNA sequencing technologies share the funda-
mental limitation that read lengths are much shorter than even the smallest
genomes. The process of determining the complete DNA sequence of an organ-
ism’s genome overcomes this limitation by over-sampling the target genome with
short reads from random positions. Also, assembly software is challenged by
repeated sequences in the target genome. Genomics regions that share per-
fect repetitions may be indistinguishable, mainly if they are longer than the
reads. The repetition resolution is more difficult in front of sequencing errors.
Therefore it is necessary to further increase sequence accuracy by sequencing
individual genomes in a large number of times, increasing the sequenced reads
coverage.
In terms of computational complexity, the assembly may require High Per-
formance Computing (HPC) platforms for large genomes and the processing
of larger volumes of data. Algorithms developed for these HPC platforms are
typically complex and depend on pragmatic engineering and heuristics. Heuris-
tics help overcome complicated repetition patterns in real genomes, random and
de Bruijn Graph Approaches for Fragment Assembly 3
systematic errors in factual data, and real computers’ physical limitations. Also,
the implementations and results are tied to a suitable parameter instantiation.
In case of de novo assembly, using a k -mer based algorithms (see Sect. 3), the
selection of k value is vital.
The k -mer is a string whose length is k, 1 < k < m. k defines the minimum
length of a substring that two reads must share to define an overlap, linking two
reads in the graph traverse. Using a larger k value involves more accuracy to
discover repeated regions in the genome, but also increases the chances of loose
overlaps in reads, causing the loss of links in the graph. Consequently, it is not
easy to estimate the right k value for the best assembly. The total number of
k -mers present in one read is equal to m−k +1, while the total number of k -mers
present in n reads is (m − k + 1)n. The unique k -mers space for k value is 4k .
The life cycle of DBG for genome assembly can be summarized in two steps.
First, construction involves the generation of all k -mers to generate a node per
distinct k -mer and an edge between two nodes if these k -mers have a k − 1
overlapped in at least one read. In the second step, the processing is carried on
by simplifying the graph and traverse it to generate contiguous genome regions
called contigs.
4 E. M. de Armas et al.
to reduce the input size removing redundant information and errors before the
assembly process itself start.
Optimized Data Structures for Graph Representation: To minimize the
memory requirements during the k -mer unique identification process, indexes
for identifying duplicate k -mers may be used. Hash tables in memory have also
been successfully used for many assemblers, such as in [29,45,48], to identify
duplicate k -mers. However, for a large amount of NGS data, hash tables do not
work well because they may not fit in memory. Suffix-array is a data structure
also used to compute overlaps. The FM-index [44] has also been used to allow
the compressed representation of input reads and fast computation of overlaps
in string graph (equivalent to overlap graph), but it is not tested yet in the
construction of de Bruijn graph.
Succinct data structures have also been explored to represent de Bruijn graph
[3,4,11]. In [11] a succinct bitmap is also used to compress the representation
of de Bruijn graph, but overall its need for space will continue to increase as
the graph becomes “bigger”. Other approaches are based on the idea of sparse-
ness in genome assembly [47], where only a subset of k -mers present in the
dataset is stored. Bloom Filters (BF) have been arduously explored as solution
to deal with DBG computationally demanding for a not exact representations
of DBG [9,33,50]. They are used to store vertices (k -mers), while the edges are
implicitly deduced by querying the Bloom filter for the membership of all possi-
ble extensions of a k -mer. However, this approach does not correspond exactly to
the edges contained in the reads. Some works have been focused on mechanisms
that avoid false positives using Bloom Filters [9,40].
An extra-compacted de Bruijn Graph structure is introduced in [12]. It rep-
resents intermediate states of the DBG during its generation through a series of
iterations in which the number of k -mers to be processed is iteratively reduced.
The extra-compacted de Bruijn Graph nodes represents the unique dk -mers with
length equal or less than d, while edges corresponds to unique edges formed by
adjacent dk -mers, whose sharing k − 1-length overlap in at least one read.
Extending the Computational Resources: Some solutions proposed the use
of cloud-based resources to overcome the memory requirements limitations. In
[25] were designed a set of assembler experiments using the GAGE datasets
and a group of program assemblers in virtual machines in the Amazon AWS
environment. The financial analysis reveals that the cost for assembly increases
as the complex of genomes to be assembled, because such an operation requires
more expensive virtual machines, and the assembly may be executed for several
hours. Other solutions are based or combined with parallelization techniques as
BCALM2 [8], and the use of GPU and other memory systems like Gerbil [19]
and k -mer counter FPGA-based solution [32].
External Memory Approaches: Partition assembly algorithms were also pro-
posed for external processing. For example, the Minimum Substring Partitioning
(MSP) [28] technique allows us to split the input reads into subsequences and
distributes then into this disk partitions, then processing one disk partition at
6 E. M. de Armas et al.
hash optimization as in Jellyfish with the use of a bijective function and Merac-
ulous, which use a lightweight hash (a combination of the hash family). In the
second group, we have seen best represented by Bloom Filters (BF) with false
positives (FP) or specialized probabilistic data structures based on BF.
Table 1. de Bruijn graph solutions comparison. Abbreviation used for the approach
classification (Cl.): KC (K-mer Counter), A (Assembler), S (Space efficient solution)
and E (external approach by disk distribution)
Some variants for k -mer codification are used to reduce the amount of mem-
ory needed to represent and store each k -mer. DSK (Minia assembler) [39] and
ABYSS [45], for example, use a classical 2 bits representation for each nucleotide
base in k -mers. With the use of probabilistic data structures like Bloom Filters,
Minia [9], for example, allows approximated 1.44 log2 (16k/2.08) + 2.08 bits/k -
mer.
Other solutions use strategies based on hash tables, for example Jellyfish [31]
codifying a part of k -mer as the index of the table using a bijective hash function.
Meracolous [6] uses a recursive collision strategy with multiple hash functions
to avoid explicitly storing the k -mer themselves. khmer [50] uses a Count-Min
Sketch storing only counts, while k -mers must be retrieved from the original data
set, and HaVec [38] uses 5 bytes for each index in the hash table plus 2k bits for
k -mer. As another strategy, KMC1 [16], for example, stores k-mer without the
p1 and p2 prefixes.
The DBG construction implies a subroutine to identify distinct k -mers and get
their multiplicity. Identify distinct k -mers problem also has been touched by
counting k -mer tools [16,17,27,31,33,39]. Although k -mer counter tools aim
at generating histograms over k -mers distributions, their processes has some
similarities to the ones that get the vertices set of the DBG.
Identifying distinct k -mers have been approached by sorting [16,17], hash-
ing [27,31,39] or using Bloom Filters [33], combined sometimes with parallel
approaches to speed up the process [16,17]. Some of them [16,17,27,39] have
been focused on distribution the k -mers in disk partitions to counter them before,
loading in main memory each partition at time.
It is valid to note that k -mer counters have not the notion of vertices and
edges. Besides, to reduce the amount of data, they make some assumptions such
as do not count the k -mers with frequencies smaller or more significant than a
given value, which is not appropriate for the DBG construction.
5 Conclusions
Several techniques have been proposed to construct a DBG. This paper clas-
sifies those approaches into two main categories, one for an exact and complete
graph, another for non-exact DBG representation. Considering the solutions
based on probabilistic data structures used for an approximate DBG representa-
tion, the most explored data structure are the bloom filters. This data structure
allows vertices to be stored independently of their number, but at the cost of
false positives.
By revisiting the literature, we could find some solutions using external mem-
ory to construct a graph with an exact representation. They process all the k -
mers in external memory. A new algorithm building an accurate representation
without the necessity of process all k-mers have also been found.
This paper presented in detail the main theoretical and practical aspects
related to de novo assembly, particularly the de Bruijn graph structure. We
have enumerated the existing approaches to reduce the memory requirements
for DBG construction. Finally, we have proposed a comparative view of the
existing solutions in the literature.
References
1. Bankevich, A., et al.: Spades: a new genome assembly algorithm and its appli-
cations to single-cell sequencing. J. Comput. Biol.: J. Comput. Mol. Cell Biol.
19(5), 455–477 (2012). https://doi.org/10.1089/cmb.2012.0021. https://pubmed.
ncbi.nlm.nih.gov/22506599
2. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun.
ACM 13(7), 422–426 (1970). https://doi.org/10.1145/362686.362692
3. Boucher, C., Bowe, A., Gagie, T., Puglisi, S., Sadakane, K.: Variable-order de
Bruijn graphs. In: Data Compression Conference Proceedings 2015 (2014). https://
doi.org/10.1109/DCC.2015.70
4. Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In:
Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer,
Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0 18
5. Butler, J.: ALLPATHS: De novo assembly of whole-genome shotgun microreads.
Genome Res. 18(5), 810–820 (2008). https://doi.org/10.1101/gr.7337908
6. Chapman, J.A., Ho, I., Sunkara, S., Luo, S., Schroth, G.P., Rokhsar, D.S.: Mer-
aculous: de novo genome assembly with short paired-end reads. PLoS ONE 6(8),
e23501 (2011). https://doi.org/10.1371/journal.pone.0023501
7. Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the rep-
resentation of de Bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS,
vol. 8394, pp. 35–55. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-
05269-4 4
8. Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from
sequencing data quickly and in low memory. Bioinformatics 32(12), i201 (2016).
https://doi.org/10.1093/bioinformatics/btw279
10 E. M. de Armas et al.
9. Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based
on a Bloom filter. Algorithms Mol. Biol. 8(1), 22 (2013). https://doi.org/10.1186/
1748-7188-8-22
10. Claros, M.G., Bautista, R., Guerrero-Fernández, D., Benzerki, H., Seoane, P.,
Fernández-Pozo, N.: Why assembling plant genome sequences is so challenging.
Biology 1(2), 439 (2012). https://doi.org/10.3390/biology1020439
11. Conway, T.C., Bromage, A.J.: Succinct data structures for assembling
large genomes. Bioinformatics 27(4), 479–486 (2011). https://doi.org/10.1093/
bioinformatics/btq697
12. de Armas, E.M., Castro, L.C., Holanda, M., Lifschitz, S.: A new approach for de
Bruijn graph construction in de novo genome assembling. In: 2019 IEEE Interna-
tional Conference on Bioinformatics and Biomedicine, pp. 1842–1849 (2019)
13. de Armas, E.M., Ferreira, P.C.G., Haeusler, E.H., de Holanda, M.T., Lifschitz, S.:
K-mer mapping and RDBMS indexes. In: Kowada, L., de Oliveira, D. (eds.) BSB
2019. LNCS, vol. 11347, pp. 70–82. Springer, Cham (2020). https://doi.org/10.
1007/978-3-030-46417-2 7
14. de Armas, E.M., Haeusler, E.H., Lifschitz, S., de Holanda, M.T., da Silva, W.M.C.,
Ferreira, P.C.G.: K-mer Mapping and de Bruijn graphs: the case for velvet frag-
ment assembly. In: 2016 IEEE International Conference on Bioinformatics and
Biomedicine (BIBM), pp. 882–889 (2016). https://doi.org/10.1109/BIBM.2016.
7822642
15. de Armas, E.M., Silva, M.V.M., Lifschitz, S.: A study of index structures for K-
mer mapping. In: Proceedings Satellite Events of the 32nd Brazilian Symposium
on Databases. Databases Meet Bioinformatics Workshop, pp. 326–333 (2017)
16. Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on
a PC. BMC Bioinform. 14(1) (2013). https://doi.org/10.1186/1471-2105-14-160
17. Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and
resource-frugal k-mer counting. Bioinformatics 31(10), 1569 (2015). https://doi.
org/10.1093/bioinformatics/btv022
18. El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence
assembly: four stages of data processing and computational challenges. PLoS Com-
put. Biol. 9(12), 1–19 (2013). https://doi.org/10.1371/journal.pcbi.1003345
19. Erbert, M., Rechner, S., Müller-Hannemann, M.: Gerbil: a fast and memory-
efficient k-mer counter with GPU-support. Algorithms Mol. Biol. 12(1), 9:1–9:12
(2017). https://doi.org/10.1186/s13015-017-0097-9
20. Ghosh, P., Kalyanaraman, A.: A fast sketch-based assembler for genomes. In: Pro-
ceedings of the 7th ACM International Conference on Bioinformatics, Computa-
tional Biology, and Health Informatics. BCB ’16, pp. 241–250. ACM, New York,
NY, USA (2016). https://doi.org/10.1145/2975167.2975192
21. Ghosh, P., Kalyanaraman, A.: FastEtch: a fast sketch-based assembler for genomes.
IEEE/ACM Trans. Comput. Biol. Bioinform. 16(4), 1091–1106 (2019). https://
doi.org/10.1109/TCBB.2017.2737999
22. Gnerre, S., et al.: High-quality draft assemblies of mammalian genomes from mas-
sively parallel sequence data. Proc. Natl. Acad. Sci. U.S.A. 108(4), 1513–1518
(2011). https://doi.org/10.1073/pnas.1017351108. 21187386[pmid]
23. Jackman, S.D., Birol, I.: Assembling genomes using short-read sequencing technol-
ogy. Genome Biol. 11(1), 202 (2010). https://doi.org/10.1186/gb-2010-11-1-202.
https://www.ncbi.nlm.nih.gov/pubmed/20128932, 20128932[pmid]
24. Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and
correction of sequencing errors. Genome Biol. 11(11), R116 (2010). https://doi.
org/10.1186/gb-2010-11-11-r116
de Bruijn Graph Approaches for Fragment Assembly 11
25. Kleftogiannis, D., Kalnis, P., Bajic, V.B.: Comparing memory-efficient genome
assemblers on stand-alone and cloud infrastructures. PLoS ONE 8(9) (2013).
https://doi.org/10.1371/journal.pone.0075505
26. Kokot, M., Dlugosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-
mer statistics. Bioinformatics 33(17), 2759–2761 (2017). https://doi.org/10.1093/
bioinformatics/btx304
27. Li, Y., Yan, X.: MSPKmerCounter: a fast and memory efficient approach for K-mer
Counting. arXiv e-prints (2015)
28. Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum
substring partitioning. PVLDB 6(3), 169–180 (2013). https://doi.org/10.14778/
2535569.2448951
29. Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-
read de novo assembler. GigaScience 1(1), 1–6 (2012). https://doi.org/10.1186/
2047-217X-1-18
30. Mamun, A.A., Pal, S., Rajasekaran, S.: KCMBT: a k-mer counter based on mul-
tiple burst trees. Bioinformatics 32(18), 2783 (2016). https://doi.org/10.1093/
bioinformatics/btw345
31. Marcais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting
of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011). https://doi.org/
10.1093/bioinformatics/btr011
32. McVicar, N., Lin, C., Hauck, S.: K-Mer counting using bloom filters with an
FPGA-attached HMC. In: 25th IEEE Annual International Symposium on Field-
Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA, 30
April–2 May 2017, pp. 203–210 (2017). https://doi.org/10.1109/FCCM.2017.23
33. Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using
a bloom filter. BMC Bioinform. 12(1), 333 (2011). https://doi.org/10.1186/1471-
2105-12-333
34. Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation
sequencing data. Genomics 95(6), 315–327 (2010). https://doi.org/10.1016/j.
ygeno.2010.03.001. 20211242[pmid]
35. Myers, E.W.: Toward simplifying and accurately formulating fragment assembly.
J. Comput. Biol. 2(2), 275–290 (1995). https://doi.org/10.1089/cmb.1995.2.275.
pMID: 7497129
36. Niedringhaus, T.P., Milanova, D., Kerby, M.B., Snyder, M.P., Barron, A.E.: Land-
scape of next-generation sequencing technologies. Anal. Chem. 83(12), 4327–4341
(2011). https://doi.org/10.1021/ac2010857
37. Pandey, P., Bender, M.A., Johnson, R., Patro, R.: deBGR: an efficient and near-
exact representation of the weighted de Bruijn graph. Bioinformatics 33(14), i133–
i141 (2017). https://doi.org/10.1093/bioinformatics/btx261
38. Rahman, M.M., Sharker, R., Biswas, S., Rahman, M.: HaVec: an efficient de Bruijn
graph construction algorithm for genome assembly. Int. J. Genom. 2017, 1–12
(2017). https://doi.org/10.1155/2017/6120980
39. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low
memory usage. Bioinformatics 29(5), 652–653 (2013). https://doi.org/10.1093/
bioinformatics/btt020
40. Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve
the memory usage for de Brujin graphs. Algorithms Mol. Biol.: AMB 9, 2 (2014).
https://doi.org/10.1186/1748-7188-9-2
41. Sanger, F., Coulson, A., Barrell, B., Smith, A., Roe, B.: Cloning in single-stranded
bacteriophage as an aid to rapid DNA sequencing. J. Mol. Biol. 143(2), 161–178
(1980). https://doi.org/10.1016/0022-2836(80)90196-5
12 E. M. de Armas et al.
42. Schatz, M.C., Delcher, A.L., Salzberg, S.L.: Assembly of large genomes using
second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010). https://doi.
org/10.1101/gr.101360.109
43. Silva, M.V.M., de Holanda, M.T., Haeusler, E.H., de Armas, E.M., Lifschitz, S.:
VelvetH-DB: Persistência de Dados no Processo de Montagem de Fragmentos
de Sequências Biológicas. In: Proceedings Satellite Events of the 32nd Brazilian
Symposium on Databases. Databases Meet Bioinformatics Workshop, pp. 334–341
(2017)
44. Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using
the FM-index. Bioinformatics (Oxford, England) 26(12), i367–i373 (2010). https://
doi.org/10.1093/bioinformatics/btq217
45. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS:
a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123
(2009). https://doi.org/10.1101/gr.089532.108
46. Titus Brown, C., Howe, A., Zhang, Q., Pyrkosz, A.B., Brom, T.H.: A reference-
free algorithm for computational normalization of shotgun sequencing data. arXiv
e-prints arXiv:1203.4802 (2012)
47. Ye, C., Sam Ma, Z., Cannon, C., Pop, M., Yu, D.: Exploiting sparseness in de novo
genome assembly. BMC Bioinform. 13(Suppl. 6), S1 (2012). https://doi.org/10.
1186/1471-2105-13-S6-S1
48. Zerbino, D.: Velvet software. EMBL-EBI. https://www.ebi.ac.uk/zerbino/velvet/
(2016). Accessed 15 June 2019
49. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using
de Bruijn graphs. Genome Res. 18(5), 821–829 (2008). https://doi.org/10.1101/
gr.074492.107
50. Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the
K-mers you are looking for: efficient online K-mer Counting Using a Probabilistic
Data Structure. PLoS ONE 9(7), 1–13 (2014). https://doi.org/10.1371/journal.
pone.0101271
Redundancy Treatment of NGS Contigs
in Microbial Genome Finishing
with Hashing-Based Approach
1 Introduction
DNA sequencing is routinely used in various fields of biology. When Whole
Genome Sequencing is performed, the DNA is fragmented and the nucleotides are
sequenced. High-Throughput Sequencing Technology, also known as Next Gen-
eration Sequencing (NGS), allowed the parallelization of the sequencing process,
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 13–24, 2020.
https://doi.org/10.1007/978-3-030-65775-8_2
14 M. Braga et al.
generating much more data than previous methods [1]. From the data generated
by NGS technologies, several new applications have emerged. Many of these
analyzes begin with the computational process of sequence assembly [2]. NGS
sequence assembly consists of grouping a set of sequences generated in sequenc-
ing, producing longer contiguous sequences, called contigs. These contigs are
joined together to form even larger known sequences, the scaffolds [3].
There are two general approaches to assembling NGS fragments: reference-
based and de novo approaches. In the first approach, a reference genome of a
related species is used as a guide to align the reads. De novo assembly is based
only on the overlapping reads to generate contigs [4]. These, in turn, may contain
gaps (regions not represented in the assembly).
New hybrid strategies have been developed to take advantage of each type
of assembly [5,6]. For example, hybrid strategies can combine reads and assem-
blies from different sequencing technologies and different assembly algorithms, or
use assemblies generated by different assemblers, combining the results (contigs
and/or scaffolds) produced by those tools to produce a new sequence [7].
bler is more conservative, it will break the assembly at these junction points,
leading to precise but fragmented assembly with very small contigs.
The central problem with the repetitions is that the assembler is unable
to distinguish from each other, which means that the regions flanking these
repetitions can be mistakenly assembled. It is common for an assembler to create
a chimera by joining two regions that are not close in the genome and, in the
end, the reads eventually align with the wrongly assembled genome [8].
There is a combination of strategies for solving repetitive DNA problems,
including the use of varying fragment libraries [12], post-processing software to
detect wrong assemblies [11], analysis of coverage statistics to detect and resolve
entanglements in DBGs.
NGS Technologies remain unable to generate one single sequence per chro-
mosome, instead, they produce a large and redundant set of reads, each read
being a fragment of the complete genome [13].
Assembly algorithms explore graphs through heuristics, selecting and travers-
ing paths and generating sequences as output. The set of contigs is hardly sat-
isfactory and usually needs to be postprocessed, most of the time to discard
very short contigs or contigs contained within other contigs. This procedure is
performed to reduce redundancy [14].
Specific errors inherent in sequencing platforms also affect the quality of the
generated assembly. In the case of the 454 and Ion Torrent platforms, for exam-
ple, errors in identifying homopolymer sequences may affect the construction of
contigs in the de Bruijn graph since k-mers derived from these regions may not
show agreement, resulting in a greater fragmentation of the assembly. Thus, it
is important to consider using different assembly tools and to give preference to
those that have greater tolerance to errors observed in the platform of interest.
The point is that when using a hybrid assembly approach, the problem of
redundant contigs persists and even increases. When assembling with different
assemblers using different methods, the contigs resulting from these assemblies
are merged, generating a much larger amount. These contigs, in general, repre-
sent different regions of the genome and lead to different gaps. The large number of
contigs generated by these hybrid assembly approaches require considerable com-
putational and human resources for analysis, especially for the identification of
assembly errors and the elimination of bases with higher probability of contigu-
ous edge error, which prevents the extent of overlap due to mismatch errors [15].
There are a large number of computational methods that can identify redundan-
cies in text sequences, such as biological sequences. String matching, or pattern
matching, or sequence matching, is a classic computational problem. Algorithms
of this nature are used to find matches between a standard input string and a
specified string. These methods can locate all occurrences of a character pattern
within another sequence [16].
16 M. Braga et al.
Similarity Search. One form of searching for similarities is the Nearest Neigh-
bor Search and consists of locating data items whose distances to a query item
are the smallest in a large data set [19]. The search for similarities has become a
primary computational task in many areas, including pattern recognition, data
mining, biological databases, multimedia information retrieval, machine learning,
data compression, computer vision and statistical data analysis. The concept of
exact string matching rarely has meaning in these environments, while concepts
such as proximity, distance (similarity/dissimilarity) are much more beneficial
for this type of search [20].
Several methods have been developed to solve this kind of problem and many
efforts have been devoted to the approximate search. Hashing techniques have
been widely studied, especially Locality Sensitive Hashing, as a useful concept
for these categories of applications.
In general terms, hashing is an approach where it turns the data item into
a small numeric representation or, equivalently, a short code consisting of a
sequence of bits [21]. Hashing the nearest neighbor can be done in two ways:
by indexing data items through hash tables that store items with the same
code in the same hash bucket or by approximating the distance using what was
calculated with short codes [22].
The hashing approach to the approximate search aims to map the query
items to the destination items so that the approximate search by the nearest
neighbor can be performed efficiently and accurately using the destination items
and possibly a small subset of the raw query items. Target items are called hash
codes (also known as hash values) [23].
Locality Sensitive Hashing (LSH). The term Locality Sensitive Hashing was
introduced in 1988 [24] to designate a randomized structure able to efficiently
search for the nearest neighbor in large spaces. It is based on the definition of
LSH functions, a family of hash functions that maps similar input items with the
same hash code to a much higher probability than for different items. The first
specific LSH function, minHash, was invented by Broder [25] for the detection
and grouping of nearly duplicate web pages and is still one of the most studied
and applied LSH methods [20].
In conventional hashing, close items, which are similar, can be mapped/
scattered to different positions after hashing, but in LSH similar items remain
close even after hashing.
Redundancy Treatment of NGS Contigs with Hashing-Based Approach 17
In LSH, we call candidate pairs to item pairs that have been mapped to
the same compartment. When the banding technique is applied, comparisons
are performed only on candidate pairs rather than on all pairs, as in a linear
search. If the goal is to find an exact match, techniques used to process data
such as MapReduce [26], Twitter Storm [27] can be used. These techniques are
based on parallelism, resulting in reduced time, however, these methods require
additional hardware. In most cases, only the most similar pairs are desired. The
level of similarity is defined by some threshold and the desired result is what is
known by searching for the nearest neighbor, and in these cases LSH model is
the best option. To apply LSH in different applications, it needs to be developed
according to the application domain [28].
Another important consideration is that false negatives and false positives
should be avoided as far as possible. It is said that there is a false negative when
the most similar pairs are not mapped to the same compartment. A false positive
happens when different pairs are mapped to the same compartment [29].
Bloom Filter. Bloom Filter is a probabilistic data structure that uses multiple
hash functions to store data in a large array of bits. It was introduced in 1970 by
Burton H. Bloom and is used in applications that perform membership queries
on a large set of elements [30].
A Bloom Filter (BF) is, therefore, a simple data structure that uses bit arrays
to represent a set and determine whether or not an element is present in it. False
positives are allowed, that is to say, with high probability, if an element is in the
set. On the other hand, false negatives are not possible, that is, it is possible
to know exactly if an element does not belong to the set. The achieved space
savings can often overcome this disadvantage of false positives if the probability
of error is controlled [31].
BF uses hash functions to map elements in an array, the filter. The member-
ship is tested by comparing the mapping results with the potential members of
the vector. An element is considered part of the set if and only if a hash function
maps that location to a key [32].
Repetitions in the genome associated with small reads length are known to be
one of the most common reasons for fragmentation of a consensus sequence, as
reads that come from different copies of a repetitive region in the genome end
up not being properly identified and assembled in the same contig, known as
repetitive contigs. This problem is already addressed with a set of strategies
such as using miscellaneous fragment libraries, postprocessing software to detect
assembly errors, and coverage statistics analysis to detect and resolve entan-
glements in DBGs, however, none of these mechanisms completely solves the
problem. On the other hand, sequencing errors also affect the assembly, either
by generating further fragmentation of the assembly, or by resulting in abrupt
ends in graph pathways and ultimately contigs. As a result, hybrid assembly
18 M. Braga et al.
approaches are often used to try to achieve greater error tolerance. However, in
these approaches, a new problem arises, the production of an even larger amount
of redundant contigs.
In this paper, we present a computational method for detecting and elimi-
nating redundant contigs from microbial assemblies based on Bloom Filter and
LSH combination, allowing to minimize the computational effort in the genome
finishing step.
The GAGE-B dataset [33] is used to evaluate large-scale genomic assembly algo-
rithms. GAGE-B data is originated from the genome sequencing of eight bacteria,
with size between 2.9 and 5.4 Mb and GC content between 33 and 69%. This data
is publicly available for Illumina sequencing. Some genomes were included which
HiSeq and MiSeq data were available also, resulting in 12 datasets (Table 1). All
GAGE-B datasets had underwent preprocessing steps such as adapter removal
and q10 quality trimming using the Ea-utils package [34].
The de novo assembly of the GAGE-B datasets were performed with SPAdes
[35] and Fermi [36] assemblers on an AMD Opteron (TM) Processor 6376 com-
puter with 64 CPUs, 1TB RAM, operating system CentOS release 5.10 Linux
OS version 2.6.18371.6.1.el5.
Prior to assembling with SPAdes, KmerGenie software [37] was used to esti-
mates the best k-mer length for each assembly. To predict the best K value, the
raw reads of each sample were used. First, we merged the forward and reverse
reads into a single file. Then the file containing the forward and reverse reads was
submitted to KmerGenie. For two samples, the assembly was performed with K
values arbitrated by SPAdes itself.
After the assemblies, the contigs generated by both assemblers were sub-
mitted to the QUAST software [38] to measure the assembly performance. In
addition to the number of contigs and the N50 value, the largest contig, com-
pleteness (genome fraction -%), misassemblies and number of genes (complete
and partial) metrics were also computed.
Bloom Filter with MD5, CRC32, and SHA-1 hash functions to do an exact string
match. Duplicate contigs are deleted and a new file is generated with the unique
sequences only. Then, an LSH approach performs approximate string match-
ing to identify similar contigs. The LSH function chosen was MinHash, which
behaves well with the Jaccard distance. The LSH method was implemented as
described in [39]. The contigs generated in the GAGE-B assemblies were sep-
arated into different files in FASTA format and submitted to the pipeline for
detection and removal of redundant copies.
To compare and measure the performance of the proposed model, three other
methods for reducing redundancy were implemented or used. The first, not yet
published, is based on HS-BLASTN [40], which uses the Burrows-Wheeler trans-
form as a basis for sequence alignment. The second was the Simplifier [13] and
finally the CD-HIT [41] was applied.
All programs were performed on an AMD Opteron (TM) Processor 6376
computer, 64 CPUs, 1TB RAM, CentOS release 5.10 operating system OS Linux
version 2.6.18371.6.1.el5. The similarity percentage used as a reference to con-
sider contigs as similar was 75%.
For the data generated by SPAdes, the most efficient method for decreasing
contiguity redundancy was CD-HIT, having obtained the largest reduction for 10
out of 12 datasets (Table 3, however BFLSH achieved the second lowest number
of contigs for nine organisms Table 3. For R.sphaeroides (MiSeq), BFLSH could
not detect any contiguous redundancy in SPAdes data, while the other meth-
ods found and considerably reduced the number of contiguous sequences. When
BFLSH was applied to the contigs generated with Fermi, there was a significant
elimination of redundancy (53.75%), compatible with the result obtained by the
other compared methods (HS-BLATN, Simplifier and CD-HIT).
It can be inferred that, for this organism and in the specific case of assembling
with SPAdes, the proposed Hashing-based technique failed to properly map the
actual similarities between the contigs into the Bloom Filter, that is, it was not
able to efficiently translate the proximity between the elements of the set of
contigs using the Jaccard similarity used in the LSH method.
Another hypothesis would be the usage of inappropriate parameters during
the application of the method for this isolated case. Bloom Filter’s parameter
variance, such as the number of hash functions, the size of the BF, or even the
amount of difference allowed to consider two items as similar can affect the engine
output. LSH methods may be more or less stringent depending on the parameter
used for similarity. In the case of this work, a value of 75% was used as default
for all experiments. One last possibility can still be raised: the difference in the
assembly approach used by SPAdes and Fermi. Since SPAdes is based on Bruijn
22 M. Braga et al.
graphs, its internal way of generating contigs uses kmers, while Fermi, being
from the OLC family of assemblers, does not. This factor may have in some way
affected the nature of the contigs generated by SPAdes for this organism, which,
combined with the other factors, may have led to unexpected yield for this case.
The results of the comparison between BFLSH and other methods for reduc-
ing Fermi-generated contigs are shown in Table 4.
In the comparison performed for the data generated by Fermi, the most
efficient method to reduce contigs redundancy was CD-HIT, having obtained
the largest reduction for the 12 datasets. However, BFLSH achieved the second
lowest number of contigs for twelve data sets in absolute terms.
5 Conclusion
A computational application has been developed capable of efficiently detecting
and eliminating redundant NGS contigs generated by de novo assemblers in
microbial genomes. The problem of redundant contigs generated in NGS data
assembly has been minimized with BFLSH.
For efficiency evaluation, the BFLSH was compared with three other distinct
methods and proved to be efficient. Bloom Filter and LSH techniques can effec-
tively be used to find similar items in biological sequences. The probabilistic
nature of hash functions makes the possibility of false negatives possible, but
these can be minimized with proper techniques.
One way to consider in the future is to adopt a parallelization approach,
which can significantly increase the time and memory efficiency, especially of
the LSH step.
Redundancy Treatment of NGS Contigs with Hashing-Based Approach 23
References
1. Ambardar, S., Gupta, R., Trakroo, D., Lal, R., Vakhlu, J.: High throughput
sequencing: an overview of sequencing chemistry. Indian J. Microbiol. 56(4), 394–
404 (2016)
2. Nagarajan, N., Pop, M.: Sequence assembly demystified. Nat. Rev. Genet. 14(3),
157–167 (2013)
3. El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence
assembly: four stages of data processing and computational challenges. PLoS Com-
put. Biol. 9(12), e1003345 (2013)
4. Martin, J.A., Wang, Z.: Next-generation transcriptome assembly. Nat. Rev. Genet.
12(10), 671–682 (2011)
5. Goswami, M., et al.: Distance sensitive bloom filters without false negatives. In:
Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete
Algorithms, SODA ’17, USA, 2017, pp. 257–269. Society for Industrial and Applied
Mathematics (2017)
6. Tang, L., Li, M., Fang-Xiang, W., Pan, Y., Wang, J.: MAC: Merging assemblies by
using adjacency algebraic model and classification. Front. Genet. 10, 1396 (2020)
7. de Sousa Paz, H.E.: reSHAPE : montagem hibrida de genomas com foco em organ-
ismos bacterianos combinando ferramentas de novo. Dissertacao (2018)
8. Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing:
computational challenges and solutions. Nat. Rev. Genet. 13(1), 36–46 (2011)
9. Batzer, M.A., Deininger, P.L.: Alu repeats and human genomic diversity. Nat. Rev.
Genet. 3(5), 370–379 (2002)
10. Zavodna, M., Bagshaw, A., Brauning, R., Gemmell, N.J.: The accuracy, feasibility
and challenges of sequencing short tandem repeats using next-generation sequenc-
ing platforms. PLoS ONE 9(12), e113862 (2014)
11. Phillippy, A.M., Schatz, M.C., Pop, M.: Genome assembly forensics: finding the
elusive mis-assembly. Genome Biol. 9(3), R55 (2008)
12. Wetzel, J., Kingsford, C., Pop, M.: Assessing the benefits of using mate-pairs
to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinform.
12(1), 95 (2011)
13. Bradnam, K.R., et al.: Assemblathon 2: evaluating de novo methods of genome
assembly in three vertebrate species. GigaScience 2(1), 2047–217X (2013)
14. Nagarajan, N., Pop, M.: Parametric complexity of sequence assembly: theory and
applications to next generation sequencing. J. Comput. Biol. 16(7), 897–908 (2009)
15. Ramos, R.T.J., Carneiro, A.R., Azevedo, V., Schneider, M.P., Barh, D., Silva, A.:
Simplifier: a web tool to eliminate redundant NGS contigs. Bioinformation 8(20),
996–999 (2012)
16. Galil, Z., Giancarlo, R.: Data structures and algorithms for approximate string
matching. J. Complex. 4(1), 33–72 (1988)
17. Pandiselvam, P., Marimuthu, T., Lawrance, R.: A comparative study on string
matching algorithm of biological sequences (2014)
18. Al-Khamaiseh, K., ALShagarin, S.: A survey of string matching algorithms. Int.
J. Eng. Res. Appl. 4, 144–156 (2014)
24 M. Braga et al.
19. Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for similarity search: a survey (2014)
20. Chauhan, S.S., Batra, S.: Finding similar items using lsh and bloom filter. In:
2014 IEEE International Conference on Advanced Communications, Control and
Computing Technologies, pp. 1662–1666 (2014)
21. Bender, M.A., et al.: Don’t thrash: how to cache your hash on flash. Proc. VLDB
Endow. 5, 1627–1637 (2012)
22. Slaney, M., Casey, M.: Locality-sensitive hashing for finding nearest neighbors [lec-
ture notes]. Signal Process. Mag. IEEE 25, 128–131 (2008)
23. Baluja, S., Covell, M.: Learning to hash: forgiving hash functions and applications.
Data Min. Knowl. Disc. 17(3), 402–430 (2008)
24. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the
curse of dimensionality. In: Conference Proceedings of the Annual ACM Sympo-
sium on Theory of Computing, pp. 604–613 (2000)
25. Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings
of the Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171),
pp. 21–29 (1997)
26. Jain, R., Rawat, M., Jain, S.: Data optimization techniques using bloom filter in
big data. Int. J. Comput. Appl. 142, 23–27 (2016)
27. Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. 13(7),
e1002195 (2015)
28. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest
neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
29. Ding, K., Huo, C., Fan, B., Xiang, S., Pan, C.: In defense of locality-sensitive
hashing. IEEE Trans. Neural Netw. Learn. Syst. 29(1), 87–103 (2018)
30. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun.
ACM 13(7), 422–426 (1970)
31. Broder, A., Mitzenmacher, M.: Survey: Network applications of bloom filters: a
survey. Internet Math. 1, 11 (2003)
32. Naor, M., Yogev, E.: Tight bounds for sliding bloom filters. Algorithmica 73(4),
652–672 (2015)
33. Magoc, T., et al.: GAGE-B: an evaluation of genome assemblers for bacterial organ-
isms. Bioinformatics 29(14), 1718–1725 (2013)
34. Aronesty, E.: Comparison of sequencing utility programs. Open Bioinform. J. 7(1),
1–8 (2013)
35. Bankevich, A., et al.: SPAdes: a new genome assembly algorithm and its applica-
tions to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)
36. Li, H.: Exploring single-sample SNP and INDEL calling with whole-genome de
novo assembly. Bioinformatics (Oxford, England) 28(14), 1838–1844 (2012)
37. Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome
assembly. Bioinformatics 30(1), 31–37 (2013)
38. Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool
for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)
39. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge
University Press, Cambridge (2014)
40. Chen, Y., Ye, W., Zhang, Y., Yuesheng, X.: High speed BLASTN: an accelerated
MegaBLAST search tool. Nucleic Acids Res. 43(16), 7762–7768 (2015)
41. Li, W., Godzik, A.: CD-HIT: a fast program for clustering and comparing large
sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)
Efficient Out-of-Core Contig Generation
1 Introduction
Genome sequencing is the process that determines the order of nucleotides within
a DNA molecule. Modern instruments splits a genome into a set of many short
sequences (reads) that are assembled into longer contiguous sequences, contigs,
followed by the process of correctly ordering contigs into scaffolds [18].
We may associate genome sequencing with the problem of finding a Hamil-
tonian Cycle through an Overlap Layout Consensus (OLC) assembly method.
Alternatively, it can be modeled as the problem of finding a Eulerian Cycle con-
sidering the de Bruijn Graph (dBG) based methods [14]. The latter can be seen
as a breakthrough for the research on genome assembly. This is due to the fact
that to find a Hamiltonian Cycle is an NP-complete problem [14].
When we handle actual dBGs, we must consider the existence of errors that
appear due to high-frequencies distortions on Next-generation Sequencing (NGS)
platforms. These errors induce the dBG size to be more prominent than the
overlap graph used in the OLC genome sequencing method using the same reads.
To remove the errors, we need to have some data structure representation of
the dBG. Current real-world datasets induce challenging problems as they have
already reached high volumes and will continue to grow as sequencing technolo-
gies improve [19]. As a consequence, the dBG increases the complexity leading
to tangles and topological structures that are difficult to resolve [16]. Also, the
graph has a high memory footprint for large organisms (e.g., sugarcane plants)
and it becomes worse due to the increase of the genome datasets. Therefore,
there are research works that focus on dealing with the ever-growing graph sizes
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 25–37, 2020.
https://doi.org/10.1007/978-3-030-65775-8_3
26 J. O. P. Entenza et al.
for dBG-based genome assembly [3,15]. Any proposed solution must be aware
of the fact that the dBG is not entirely built before error pruning. After the
splitting of the genome in the first phase and, for a natural number k chosen
based on an empirical criterion, for each read of size m we will have m − k + 1
possible k-mers, which correspond to the nodes of the dBG. The number of k-
mers depends on the adjacency determined by the r ead itself. The splitting error
introduced in a given read may large increase the size of the subgraph it induces.
Roughly speaking, the dBG is a set of k-mers (subsequences of length k) [23]
linked to each other, according to information provided by their reads. Some
k-mers can come from different reads and the information about adjacency sup-
plied by any other read is processed only at dBG constructing phase. Thus, error
pruning should happen at this particular stage. The memory needed is so signif-
icant that we must use external memory to accomplish the dBG construction.
The different approaches that deal with the dBG size aim to design
lightweight data structures to reduce the memory requirements and to fit the
assembly graph into the main memory. Although it might be efficient, the amount
of memory increases according to the size of the dataset and the DNA of the
organism. While bacteria genomes currently take only a few gigabytes of RAM,
species like mammalian and plants require over tens to hundreds of gigabytes.
For instance, approximately 0.5 TB of memory is required to build a hash table
of 75-mers for the human genome [19].
We propose algorithms to simplify and remove errors in the de Bruijn graph
using external memory. As a result, it will be able to generate contigs using a
fixed amount of RAM, independently of the read dataset size. There are other
works addressing de Bruijn graph processing using external memory [8,11], but
they focus only on the constructing of large de Bruijn graphs efficiently with no
error prune considerations. To the best of our knowledge, this is the first proposal
of using an external memory approach focusing on the dBG simplification and
errors removal for contig generation during dBG construction.
We show an algorithm that provides out-of-core contraction of unambiguous
paths with an I/O cost of O(|E|/B), where E is the set of edges of the dBG and
B is the size of the partition loaded to the RAM each time. With the overhead
for creating the new partitions, the overall I/O complexity is O((sort(|E|) +
|E|/B) log P ath), where P ath is the length of the longest unambiguous path in
the dBG. For a machine with memory M , and a dBG satisfying |E| < M 2 /4B
sort(E) is performed with I/O complexity O(|E|/B) [9]. Summing up the I/O
complexity of this out-of-core contraction of paths is O((|E|/B) log P ath). The
out-of-core graph cleaning phase, by removing tips and bubbles, is performed with
a similar I/O complexity O((|E|/B +sort(|E|))logP ath). The creation of contigs
is performed by a full scan of the graph with I/O complexity O(|V |/B).
k−1 characters. The assembly aims to construct a set of contigs from the dBG G.
Given a dBG as input, to generate contigs is equivalent to output all contiguous
sequences which represent unambiguous paths in the graph.
The use of the dBG to generate contigs consists of a pipeline: nodes enu-
meration, compaction, and graph cleaning. In the first step, a set of distinct
k -length substrings (k-mers) is extracted from the reads. Each k-mer becomes a
graph node. Next, all paths with all but the first vertex having in-degree 1 and
all but the last vertex having out-degree 1 (unitigs) are compacted into a single
vertex. Finally, the last step removes topological issues from the graph due to
sequencing errors and polymorphism [5].
The number of nodes in the graph can be huge. For instance, the size of
the genome of white spruce is 20 Gbp and generates 10.7 × 109 k-mers (with
k = 31) and needs 4.3 TB of memory [5]. Also, the whole genome assembly of
22 Gbp (bp - base pairs) loblolly pine generates 13 × 109 k-mers and requires
800GB of memory [5]. Theoretically speaking, a 1,000 Genomes dataset with 200
Terabytes of data can generate about 247 or nodes, 64–128 times larger than the
problem size of the top result in the Graph 500 list [15].
Next-generation sequencing platforms do not provide comprehensive read
data from the genome sequences. Hence, the produced data is distorted by high
frequencies of sequencing errors and genomic repeats [18]. Sequencing errors
compound this problem because each such error corrupts the correct genomic
sequence into up to k erroneous k-mers. These erroneous k-mers introduce new
vertices and edges to the graph, significantly expanding its size and creating
topological artifacts as tips and bubbles [23].
Different solutions have been proposed to address the memory issues in
genome assembly problem. One approach samples the entire k-mer set and per-
forms the assembly process over the selected k-mers [21]. Another approach
address to encode the dBG into efficient data structures such as light-weight
hash tables [3], succinct data structures [2] or bloom filters[6,17]. There are
research works based on distributed memory systems for processing power and
memory demanding resources [3,7,15].
Although their apparent differences, all of these approaches are based exclu-
sively on in-memory systems. Consequently, if the size of the graph exceeds the
amount of memory available, it will be necessary to increase the RAM. As in the
next future, the size of datasets will increase dramatically, and this situation will
stress the different systems [20]. There is a need for new approaches to process
all of this massive amount of information in a scalable way. We propose in this
work to use an external memory approach to process the dBG. To increase the
amount of RAM does not guarantee that the graph will always fit.
Our basic pipeline of de novo genome assembly could be divided into five basic
operations [23]: 1) dBG construction, which constructs a dBG from the DNA
reads; 2) Contraction of unambiguous paths, which merges unambiguous vertices
28 J. O. P. Entenza et al.
into unitigs; 3) Graph cleaning, which filters and removes errors such as tips and
bubbles; 4) Contigs creation, which create a first draft of the contigs and 5)
Scaffolds, which joins the previous contigs together to form longer contigs. In
this work, we face steps 2, 3, and 4 using an external memory approach. We
assume a de Bruijn graph exists, and it is persisted as an edge-list format.
Fig. 1. Graph Contraction. Dashed arcs represent the messages and the label
between brackets indicates if a vertex is a tail (t) or a head (h). a) A flipped coin
choose which node will be t or h. If a tail vertex has out-degree = 1 then it sends a
message to its neighbour. b) If a h vertex receives a message and its in-degree = 1 then
both vertices are merged. c) Shows the result after some repeated steps.
as head/tail of each unitig paths; and 2) apply the graph contraction technique
over these paths until all nodes are maximal unitigs.
In Graph cleaning we aim to remove short dead-end divergences, called tips,
from the main path. One strategy consists of testing each branching node for all
possible path extensions up to a specified minimum length. If the length of the
path is less than a certain threshold (set by the user) then the nodes belonging
to this path are removed from the graph [7,23].
The tips removing process is analogous to traversal the paths from a branch-
ing node, ua , to a dead-end node ue . The graph does not fit into RAM, even
after the unitig process. We need to traverse the dBG in an I/O efficient way
to find and remove all tips. Our algorithm is based on an external memory list
ranging from the ue to ua nodes. However, we have to make two significant mod-
ifications: (i) the ranking is represented by each edge/node’s coverage to decide
which path will be removed, and (ii) as two of more dead-ends could reach the
same ua node, we need to keep in RAM a data structure to make the traversal
backward. This way, we eliminate the selected path from the branching node.
Bubbles are paths that diverge from a node then converge into another.
The process of fixing bubbles begins by detecting the divergence points in the
graph [23]. For each point, all paths from it are detected by tracing the graph
forward until a convergence point is reached. Some assemblers restricts the size
of the bubble to n nodes where k ≤ n ≤ 2k [7], others use a modified version of
Dijkstra’s algorithm [23]. To simulate the different external memory approaches,
we need to identify all branching nodes ua . We execute an I/O-efficient breath-
first search (BFS) from ua until we find a visited node. It means that there is
a bubble at some point in the search (vb ). Then, we select the branch that will
be kept and start another BFS in a backward direction (the start node is vb ).
Finally, we remove the other paths until we find back ua . After the execution
of steps 2 and 3, the Contigs creation step involves the output of all the contigs
represented by nodes.
Processing Out-of-Core Graphs. Many graph engines implement a vertex-
centric or “think like a vertex” (TLAV) programming interface. This paradigm
iteratively executes a user-defined program over vertices of a graph. The vertex
program is designed from the vertex’s perspective, receiving as input the data
from adjacent vertices and incident edges. Execution halts after a specified num-
ber of iterations, called supersteps, are completed. It is important to note that
each vertex knows the global supersteps. There is no other knowledge about the
overall graph structure but its immediate neighbors and the messages that are
exchanged along their incident edges [13].
The computation graph engines proceed in supersteps. For every superstep,
it loads one or more partitions p based on available RAM. Then, it processes
the vertices and edges that belong to p and saves the partitions back to disk.
A different subset of partitions is then loaded and processed until all of them
have been treated for the given superstep. The process is then repeated for the
next superstep until there are no remaining vertices to visit (see Algorithm 1
30 J. O. P. Entenza et al.
from GraphChi [9]). If the machine has sufficient RAM to store the graph and
metadata, all partitions can be kept in RAM, and there is no disk access [9].
Because all the operations related to partitions and parallel vertices processing
are fixed, from now we will only highlight the UpdateFunction and the number
of supersteps. For simplicity, UpdateFunction(G,u) means that we apply the
function over the vertex u in the graph G.
4 Contig Generation
partitions, first, it is necessary to divide the nodes by their ID and later sort
all the edges based on their destination vertex ID [9]. Thus, the I/O cost of the
process for the created graph is O(sort(|E|)).
Finally, at each superstep, Algorithm 2 contracts a constant fraction of the
vertices per iteration by graph contraction [22]. It expected O(log P ath) iter-
ations with high probability, where P ath is the longest unambiguous path
in the graph [1]. Hence, the overall expected I/O cost for simplifying the
graph is O((sort(|E|) + |E|/B) log P ath). If we assume |E| < M 2 /(4B) then
sort(|E|) = O(|E|/B). This condition can be satisfied with a typical value of M ,
say 8 GB, B in the order of kilobytes and a graph size smaller than a petabyte
[10]. On these conditions the I/O cost is O((|E|/B) log P ath) (Fig. 2).
To remove tips, we design a straightforward procedure in few supersteps
(Algorithm 3). At this point, all nodes are contracted. Thus, all terminal nodes
are potential tips, and they may be removed. In the first superstep, the vertices
having an in-degree or out-degree of zero and sequence’s length less than 2k are
identified as potential tips (line 10), and a message is sent to their neighbors
(line 11). In the next superstep (lines 14–17), the vertices that received the
messages, search for the maximal multiplicity among all neighbors and remove
those potential tips with multiplicity less than the maximal value. Removing
the tips generates new linear paths in the graph that will be contracted. Note
32 J. O. P. Entenza et al.
Fig. 2. Removing tips. Dotted lines show the sent messages. a) F and I are
marked as potential tips. Later, they send a message to their neighbors. b) C and D
receive the messages and check their multiplicity to eliminate the real tips. So, H and
I are removed. c) The graph is compressed to obtain the final result.
that once the initial set of tips are removed, it could produce other tips. Most
assemblers execute the removal tip algorithm a fixed number of times.
I/O Analysis. The function RemoveT ips needs only two supersteps to remove
any tip given a graph: one to identify all tips and other to removes all of them
(lines 8–17). This can be carried out with at most two full scans over all the
graph partitions. Thus, the I/O cost is O(|E|/B). Although RemoveT ips only
executes two supersteps, the graph contraction dominates the I/O cost (line
7). The I/O cost is O((|E|/B + sort(|E|)) log P ath), where Path represents the
longest path created after the tips are removed.
The primary approach to identify and remove bubbles is based on BFS
(breadth-first search) algorithm. As bubbles consist of paths with very different
multiplicities, those paths with low multiplicity are deleted and use the path
with the highest multiplicity to represent the bubble. In our algorithm, each
vertex manages its history, which makes it easy to control the different paths
and to pick up the right one.
The proposed algorithm has two stages: forward and backward. In the for-
ward stage (Algorithm 4) identifies all paths that form a bubble and select one of
them. On the other hand, the backward stage, (Algorithm 5), removes the redun-
dant paths and compacts the graph. Due to space limitations, we will illustrate
and explain both algorithms by examples. See Fig. 3 for the forward stage and
Fig. 4 for the backward stage.
Fig. 3. Forward Bubble Detection. The figures only show the id par in the sent
messages. a) A is a potential bubble beginning, so it sends messages to the outgoing
neighbors. b) B and E keep it and forward new messages updating the vertices IDs. c)
When H, in-degree ≥ 2, receives a message it selects that E belongs to a bubble. Later,
it marks the node for the backward stage.
I/O Analysis. In the forward stage, SelectP ath (line 4) iterates through the
bubbles in a constant number of times depend on a limit value. Also, only a
small number of messages are present in the graph, each one originating from
any ambiguous vertex whose out-degree is greater than 2. Moreover, each mes-
sage will be passed along an edge exactly once, as notifications are only sent
34 J. O. P. Entenza et al.
along outgoing edges. This means that only in-edges are read, and the out-
edges are written. As this algorithm implies an external BFS traversal, the I/O
cost is O(BP ath(|V | + |E|)/B) ≈ O(BP ath|E|/B) where BP ath is the longest
length among all bubbles and |E| = O(|V |) because a dBG is a sparse graph.
Then the total I/O cost is O(limit ∗ BP ath ∗ |E|/B) = O(BP ath ∗ |E|/B).
On the other hand, the backward stage uses another BFS but in the oppo-
site direction to the forward phase, so the I/O cost is the same. Addition-
ally, it iterates through the set of vertices (lines 6–8) and executes a contrac-
tion on the resulting graph. Therefore, the I/O cost of the backward stage is
O((|E|/B + sort(|E|)) log BP ath).
At this point, the graph should be formed by contracted vertices. As each
vertex represents a contig, we can output them using a full scan over the graph.
Thus, the I/O complexity is O(|V |/B), where |V | is the number of contigs.
Efficient Out-of-Core Contig Generation 35
5 Conclusions
In this paper, we have proposed out-of-core algorithms for dealing with contig
generation, one of the most critical steps of fragment assembly methods based
on de Bruijn graphs. Besides presenting these algorithms, we have made an I/O
analytical evaluation that shows that it becomes feasible to generate de Bruijn
graphs to obtain the needed contigs, independently of the available memory.
The I/O cost studies show that graph simplification is one of the most expen-
sive steps. Actually, we could expect it because this phase involves a more sig-
nificant number of vertices and edges. To deal with that, one could choose to do
the assembly without simplifying the graph. In this condition, a tip is a branch
with low coverage and not just a vertex. Hence, it will be necessary to apply a
list ranking from the dead-end to branching nodes.
Among other issues, we may cite that it is hard to estimate the number of
executions related to each phase. Primarily it depends on the number of nodes
in the graph, which itself depends on the properties of the read’s datasets. As
these values vary from one dataset and sequencing technology to another, the
assembly algorithms execute each step, a fixed and empirical number of times. As
future work, we may cite the evaluation of other graph simplification approaches
targeting erroneous and non-recognizable structures, such as X-cuts.
References
1. Anderson, R.J., Miller, G.L.: A simple randomized parallel algorithm for list-
ranking. Inf. Process. Lett. 33(5), 269–273 (1990)
2. Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn Graphs. In:
Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer,
Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0 18
3. Chapman, J.A., et al.: Meraculous: de novo genome assembly with short paired-end
reads. PLoS One 6(8), e23501 (2011)
4. Chiang, Y.J., et al.: External-memory graph algorithms. Procs. ACM/SIAM Symp.
Discr. Algorithm. (SODA) 95, 139–149 (1995)
5. Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from
sequencing data quickly and in low memory. Bioinformatics 32(12), i201–i208
(2016)
6. Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based
on a bloom filter. Algorithm. Mol. Biol. 8(1), 22 (2013)
7. Jackman, S.D., et al.: Abyss 2.0: resource-efficient assembly of large genomes using
a bloom filter. Genome Res. 27(5), 768–777 (2017)
8. Kundeti, V.K., et al.: Efficient parallel and out of core algorithms for constructing
large bi-directed de Bruijn graphs. BMC Bioinf. 11(1), 560 (2010)
9. Kyrola, A., Blelloch, G., Guestrin, C.: Graphchi: large-scale graph computation on
just a PC. In: USENIX Symposium on Operating Systems Design and Implemen-
tation (OSDI), pp. 31–46 (2012)
Efficient Out-of-Core Contig Generation 37
10. Kyrola, A., Shun, J., Blelloch, G.: Beyond synchronous: new techniques for
external-memory graph connectivity and minimum spanning forest. In: Gudmunds-
son, J., Katajainen, J. (eds.) Experimental Algorithms — SEA 2014. Lecture Notes
in Computer Science, vol. 8504, pp. 123–137. Springer, Cham (2014). https://doi.
org/10.1007/978-3-319-07959-2 11
11. Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum
substring partitioning. Proc. VLDB Endow. 6(3), 169–180 (2013)
12. Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: Process
ACM SIGMOD Intl. Conf. on Manage. Data, pp. 135–146 (2010)
13. McCune, R.R., Weninger, T., Madey, G.: Thinking like a vertex: a survey of vertex-
centric frameworks for large-scale distributed graph processing. ACM Comput.
Surv. (CSUR) 48(2), 25:1–25:39 (2015)
14. Medvedev, P., Georgiou, K., Myers, G., Brudno, M.: Computability of models for
sequence assembly. In: Giancarlo, R., Hannenhalli, S. (eds.) Algorithms in Bioinfor-
matics — WABI 2007. Lecture Notes in Computer Science, pp. 289–301. Springer,
Cham (2007). https://doi.org/10.1007/978-3-540-74126-8 27
15. Meng, J., Seo, S., Balaji, P., Wei, Y., Wang, B., Feng, S.: Swap-assembler 2: opti-
mization of de novo genome assembler at extreme scale. In: Proceedings of the
45th ICPP, pp. 195–204. IEEE (2016)
16. Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation
sequencing data. Genomics 95(6), 315–327 (2010)
17. Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve
the memory usage for de Brujin graphs. Algorithm. Mol. Biol. 9(1), 2 (2014)
18. Simpson, J.T., Pop, M.: The theory and practice of genome sequence assembly.
Ann. Rev. Genomics Hum. Genet. 16, 153–172 (2015)
19. Sohn, J., Nam, J.W.: The present and future of de novo whole-genome assembly.
Briefings Bioinf. 19(1), 23–40 (2016)
20. Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Bio. 13(7),
e1002195 (2015)
21. Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Douglas, W.Y.: Exploiting sparseness in
de novo genome assembly. BMC (BioMed Central) Bioinf. 13, S1 (2012)
22. Zeh, N.: I/o-efficient graph algorithms. In: EEF Summer School on Massive Data
Sets (2002)
23. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using
de Bruijn graphs. Gen. Res. 821–829 (2008)
In silico Pathogenomic Analysis
of Corynebacterium Pseudotuberculosis
Biovar Ovis
1 Introduction
lymphadenitis (CLA), ulcerative lymphangitis and diphtheria. These bacteria are capa-
ble of producing virulence factors (VF) and pathogenicity-related proteins, responsible
for causing diseases [2].
C. pseudotuberculosis can be divided into two biovars: equi and ovis, causers of seri-
ous infections. This discrimination depends on the pathogen’s ability to reduce nitrate:
positive nitrate reductase bacteria are members of biovar equi, responsible for causing
ulcerative lymphangitis in buffaloes and horses, and negative nitrate reductase bacteria
are classified in biovar ovis, etiological causers of CLA, forming wounds and abscesses
in internal organs of small ruminants, as goats and sheep [3–5].
2 Methods
2.1 Pan-Genomic Analysis of C. Pseudotuberculosis Biovar Ovis
Complete genomes of 33 strains of biovar ovis were downloaded and standardized for
nucleotides, amino acids and function files (.nuc,.pep and.function). The strains were:
1002, 1002B, 12C, 226, 267, 29156, 3995, 4202, C231, E55, E56, FRC41, I19, MEX1,
MEX25, MEX29, MIC6, N1, P54B96, PA01, PA02, PA04, PA05, PA06, PA07, PA08,
PAT10, PO22241, AN902, T1, VD57 and ft_2193-67. The software PGAP [11] gener-
ated orthologous gene clusters among strains. The Gene Family method was used with
minimum score of blastall 40, 80% identity, 90% coverage and e-value of 0.00001.
The presence of phage DNA in the genome of C. pseudotuberculosis 12C was per-
formed with the web tool PHAST (Phage Search Tool), which contains an extensive
set of genes identified in prophages. The input genome must be inserted as a fasta file,
generating homologous annotated hits through BLAST with different databases and a
graphic visualization of the prophage genetic content [22].
The pathogenomic analysis of the genus revealed 4 major classes of VF, comprising 12
types of them. The complete analysis showed 36 genes associated with VF in Corynebac-
terium. Regarding their conservation, we noticed different results for the different strains
analyzed.
The first class of VF groups the adherence proteins, which are divided into Collagen-
Binding Proteins; SpaA-type pili; SpaD-type pili; SpaH-type pili and Surface-anchored
pilus proteins. For our considered dataset, however, collagen-binding proteins were not
conserved in any of the strains [23].
Pili are filamentary cellular structures that provide adhesive and invasive functions to
the pathogen after infection [24]. In C. diphtheriae, pili are formed by the sortase machin-
ery, being also present in other pathogenic genera such as Streptococcus, Actinomyces
and Clostridium. The three Sortase-mediated Pilus Assembly structures SpaA, SpaD
and SpaH have similar composition, and are essential for the attachment to host’s tissues
42 I. R. Blanco et al.
after infection. SpaA pilus is responsible for the adherence of bacteria to human pha-
ryngeal epithelial cells, while SpaB and SpaC are responsible for binding to pharyngeal
cells [25].
Of the surface-anchored pilus proteins SapA, SapD and SapE, initially characterized
as VF in C. jeikeium K411 genome [26], only SapA was conserved in the genomes of
biovar ovis and C. ulcerans FRC11. Some adherence-related VF in Corynebacterium
are present only in C. diphtheriae, as spaA, spaB, spaE, spaF, spaG, spaH, srtD, srtE,
sapD and sapE. It is suggestible that some of these adherence genes may have been
horizontally transferred to C. pseudotuberculosis [15]. Genes srtA, spaD, srtB, srtC,
sapA are conserved in most of strains, while spaC is shared among nearly half of them
(Table 1).
Fig. 1. Synteny analysis of main C. pseudotuberculosis VF in the genomes of strains 12C, FRC41
and 1002. Strains are compared regarding to conservation and position of important VF clusters
in their genomes.
In PAI 1, some genes stand out, as those inserted in the fagABCD operon, followed
by the toxin pld gene. In PAI 3, some genes show relation with manganese uptake
and ABC transport, as mentioned. In PAI 4, many genes present association with iron
ABC transport, as the ciuABCDE cluster, experimentally proven to have virulent func-
tion [9]. In PAI 5, a genomic region is conserved, highlighting the virulent function
of the PTS sugar transporter pfoS, responsible for regulation of toxins in Clostridium
perfringens. Aditionally, the proline iminopeptidase protein was found, essential for vir-
ulence of Xanthomonas campestris and conserved in C. pseudotuberculosis [9]. In PAI
7, an adhesin-related protein is found, which may relate to the role of adherence in the
pathogenic process, once it is essential for the invasion of the bacteria in host cells.
Further, in PAI 8 we noted two adherence-related molecules: the collagen-binding
surface protein and the sortase srtA1. In PAI9, the iron dicitrate transport proteins phuC
and fecD are placed, as well as the sortase srtA2, conserving virulent function [23].
The sequences inserted in PAI were submitted to the MP3 tool to predict their proba-
ble pathogenic roles. Pathogenicity potentials for all coding sequences of the complete
genome C. pseudotuberculosis 12C, composed by 2,220 coding sequences, were calcu-
lated. 683 were classified as pathogenic by both prediction methods (HMM and SVM),
adding an average value of 30.3% of proteins predicted as pathogenic in silico.
The number of pathogenic proteins in PAI, however, was different. The composition
of these loci is formed by low G + C content sequences, what has implied in a higher
pathogenic content prediction for most of these islands. We found a mean value of
48.6% of pathogenic proteins, 18.3% more than the complete genome. The pathogenic
content varied significantly, showing standard deviation of 16.3% among PAI regarding
to pathogenic protein content (Table 2).
In order to complement the analysis, we looked for prophage sequences in the genome
of the pathogen. Three incomplete prophage regions have been identified. One of
In silico Pathogenomic Analysis 45
4 Conclusion
Once the presence of mobile genetic elements is often found to be related with
pathogenicity in bacterial genome, we looked for presence of phage DNA in the C.
pseudotuberculosis genome, identifying three prophage regions, highlighting one phage
tail gene, which is known to be related with pathogen attachment to host cell after
infection. Finally, the main VF identified in this research were analyzed looking for PPI
networks, exhibiting high-confidence associations among ciuABCDE, hmuTUV and
clusters, suggesting combined roles in pathogenic function.
References
1. Dorella, F., Pacheco, L., Oliveira, S., et al.: Corynebacterium pseudotuberculosis: microbiol-
ogy, biochemical properties, pathogenesis and molecular studies of virulence. Vet. Res. 37(2),
201–218 (2016)
2. Araújo, C., Alves, J., Lima, A., et al.: The Genus Corynebacterium in the Genomic Era. Basic
Biology and Applications of Actinobacteria, Shymaa Enany, IntechOpen (2018)
3. Araújo, C., Blanco, I., Souza, L., et al.: In silico functional prediction of hypothetical proteins
from the core genome of Corynebacterium pseudotuberculosis. PeerJ 8, e9643 (2020)
4. Biberstein, E., Knight, H., Jang, S.: Two biotypes of Corynebacterium pseudotuberculosis.
The Vet. Rec. 89, 691–692 (1971)
5. Williamson, L.: Caseous lymphadenitis in small ruminants. The Vet. Clin. North Am. Food
Anim. Pract. 17(2), 359–371 (2001)
6. Van Dijk, E., Jaszczyszyn, Y., Naquin, D., et al.: The third revolution in sequencing technology.
Trends Genet. 34(9), 666–681 (2018)
7. Guedes, M., Souza, B., Sousa, T., et al.: Infecção por Corynebacterium pseudotuberculosis
em equinos: aspectos microbiológicos, clínicos e preventivos. Pesquisa Veterinária Brasileira
35(8), 701–708 (2015)
8. Weerasekera, D.: Characterization of virulence factors of Corynebacterium diphtheriae and
Corynebacterium ulcerans. thesis (2019)
9. Ruiz, J., D’afonseca, V., Silva, A., et al.: Evidence for reductive genome evolution and lateral
acquisition of virulence functions in two Corynebacterium pseudotuberculosis strains. PLoS
ONE 6(4), e18551 (2011)
10. Pallen, M., Wren, B.: Bacterial pathogenomics. Nature 449(7164), 835–842 (2007)
11. Zhao, Y., Wu, J., Yang, J., et al.: PGAP: Pan-genomes analysis pipeline. Bioinformatics 28(3),
416–418 (2012)
12. Quiroz-Castañeda, R.: Pathogenomics and molecular advances in pathogen identification. In:
Farm Animals Diseases, Recent Omic Trends and New Strategies of Treatment. IntechOpen
(2018)
13. Cassiday, P., Pawloski, L., Tiwari, T., et al.: Analysis of toxigenic Corynebacterium ulcerans
strains revealing potential for false-negative real-time PCR results. J. Clin. Microbiol. 46(1),
331–333 (2007)
14. Lo, B.: Diptheria. Medscape. https://emedicine.medscape.com/article/782051-print.
Accessed on 24 08 2020
15. Guimarães, L., Soares, S., Trost, E., et al.: Genome informatics and vaccine targets in
Corynebacterium urealyticum using two whole genomes, comparative genomics, and reverse
vaccinology. BMC Genom. 16, 5 (2015)
16. Collin, M., Fischetti, V.: A novel secreted endoglycosidase from Enterococcus faecalis with
activity on human immunoglobulin G and ribonuclease B. J. Biol. Chem. 279(21), 22558–
22570 (2004)
48 I. R. Blanco et al.
17. Liu, B., Zheng, D., Jin, Q., et al.: VFDB 2019: a comparative pathogenomic platform with
an interactive web interface. Nucleic Acids Res. 47(1), 687–692 (2019)
18. Soares, S., Geyik, H., Ramos, R., et al.: GIPSy: genomic island prediction software. J.
Biotechnol. 232, 2–11 (2016)
19. Veltri, D., Wight, M., Crouch, J.: SimpleSynteny: a web-based tool for visualization of
microsynteny across multiple species. Nucleic Acids Res. 44, 41–45 (2016)
20. Gupta, A., Kapil, R., Dhakan, D.B., et al.: MP3: a software tool for the prediction of pathogenic
proteins in genomic and metagenomic data. PLoS ONE 9, 4 (2014)
21. Szklarczyk, D., Gable, A., Lyon, D., et al.: STRING v11: protein–protein association net-
works with increased coverage, supporting functional discovery in genome-wide experimental
datasets. Nucleic Acids Res. 47(D1), 607–613 (2019)
22. Zhou, Y., Liang, Y., Lynch, K., et al.: PHAST: a fast phage search tool. Nucleic Acids Res.
39, 347–352 (2011)
23. Ton-That, H., Marraffini, L., Schneewind, O.: Sortases and pilin elements involved in pilus
assembly of Corynebacterium diphtheriae. Mol. Microbiol. 53, 251–261 (2004)
24. Ton-That, H., Schneewind, O.: Assembly of pili on the surface of Corynebacterium
diphtheriae. Mol. Microbiol. 50(4), 1429–1438 (2003)
25. Mandlik, A., Swierczynski, A., Das, A., et al.: Corynebacterium diphtheriae employs specific
minor pilins to target human pharyngeal epithelial cells. Mol. Microbiol. 64, 111–124 (2007)
26. Hansmeier, N., Chao, T., Daschkey, S., et al.: A comprehensive proteome map of the lipid-
requiring nosocomial pathogen Corynebacterium jeikeium K411. Proteomics 7(7), 1076–
1096 (2007)
27. Tauch, A., Kaiser, O., Hain, T., et al.: Complete genome sequence and analysis of the multire-
sistant nosocomial pathogen Corynebacterium jeikeium K411, a lipid-requiring bacterium of
the human skin flora. J. Bacteriol. 187(13), 4671–4682 (2005)
28. Schmitt, M., Drazek, E.: Construction and consequences of directed mutations affecting the
hemin receptor in pathogenic Corynebacterium species. J. Bacteriol. 183, 1476–1481 (2001)
29. Stojiljkovic, I., Perkins-Balding, D.: Processing of heme and heme-containing proteins by
bacteria. DNA Cell Biol. 21(4), 281–295 (2002)
30. Kunkle, C., Schmitt, M.: Analysis of a DtxR-regulated iron transport and siderophore
biosynthesis gene cluster in Corynebacterium diphtheriae. J. Bacteriol. 187(2), 422–433
(2005)
31. Qian, Y., Lee, J., Holmes, R.: Identification of a DtxR-regulated operon that is essential for
siderophore-dependent iron uptake in Corynebacterium diphtheriae. J. Bacteriol. 184(17),
4846–4856 (2002)
32. D’Aquino, J., Tetenbaum-Novatt, J., White, A., et al.: Mechanism of metal ion activation of
the diphtheria toxin repressor DtxR. Proc. Nat. Acad. Sci. U.S. Am. 102(51), 18408–18413
(2005)
33. Oram, D., Avdalovic, A., Holmes, R.: Analysis of genes that encode DtxR-like transcriptional
regulators in pathogenic and saprophytic corynebacterial species. Infect. Immun. 72(4), 1885–
1895 (2004)
34. McKean, S., Davies, J., Moore, R.: Expression of phospholipase D, the major virulence factor
of Corynebacterium pseudotuberculosis, is regulated by multiple environmental factors and
plays a role in macrophage death. Microbiology 153, 2203–2211 (2007)
35. Brüssow, H.: Impact of phages on evolution of bacterial pathogenicity. In: Bacterial
Pathogenomics, ASM Press (2007)
36. Guerrero, J., de Oca Jiménez, R., Dibarrat, J., et al.: Isolation and molecular characterization
of Corynebacterium pseudotuberculosis from sheep and goats in Mexico. Microbial Pathog.
117, 304–309 (2018)
In silico Pathogenomic Analysis 49
37. de Sá, M.A., Gouveia, G., Krewer, C.: Distribution of PLD and FagA, B, C and D
genes in Corynebacterium pseudotuberculosis isolates from sheep and goats with caseous
lymphadenitis. Genet. Mol. Biol. 36(2), 265–268 (2013)
38. Galvão, C., Fragoso, S., de Oliveira, C., et al.: Identification of new Corynebacterium pseu-
dotuberculosis antigens by immunoscreening of gene expression library. BMC Microbiol.
17(1), 202 (2017)
39. Silva, W., Folador, E., Soares, S., et al.: Label-free quantitative proteomics of Corynebac-
terium pseudotuberculosis isolates reveals differences between Biovars ovis and equi strains.
BMC Genom. 18(1), 451 (2017)
40. Raynal, J., Bastos, B., Vilas-Boas, P., et al.: Identification of membrane-associated proteins
with pathogenic potential expressed by Corynebacterium pseudotuberculosis grown in animal
serum. BMC Res. Notes 1, 11 (2018)
Assessing the Sex-Related Genomic
Composition Difference Using
a k-mer-Based Approach: A Case
of Study in Arapaima gigas (Pirarucu)
1 Introduction
Arapaima gigas, commonly known as “Pirarucu” or “Paiche”, is the largest bony
freshwater fish in the world. It belongs to the bonytongues (Order Osteoglos-
siformes) Arapaimidae family, and has a natural habitat in the Amazon Basin
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 50–56, 2020.
https://doi.org/10.1007/978-3-030-65775-8_5
Assessing the Sex-Related Genomic Composition Difference 51
[1]. Adult specimens may weigh around 200 kg and measure about 3 m [2,3].
Due to its large size, its flesh containing low-fat and low fishbone, along with
its physiological characteristic of emerging to the surface at intervals of 15 min
to assimilate oxygen, Arapaima gigas became a vulnerable species to overfish-
ing in the Amazon region [4] leading to greater surveillance of the marketing of
Pirarucu by the Brazilian government in the early 2000s [5,6].
Some studies show that the use of Pirarucu in intensive fish farming is facil-
itated, in part, by the physiological characteristics of the animal that guarantee
the rusticity of the species [7]. For example, the obligate air-breathing causes this
species to tolerate environments with low concentrations of dissolved oxygen in
the water [8]. In addition to the facility for captive management, attributes such
as the low content of fat, combined with the rapid growth of the species, which
weighs an average of 10 kg in its first year of life, add value for an intensification
in the commercial exploitation of the Arapaima gigas [5,9,10].
One of the problems related to its fishing exploitation and fish-farming is
that the genetic mechanisms linked to sex-differentiation in Arapaima gigas [11]
is not know yet. Since its sexual maturation occurs late, around the third to the
fifth year of life, and sexual dimorphism is not a strong feature of the species
[10], its sexing is yet performed using laborious procedures (e.g. laparoscopy and
transrectal ultrasound). In recent decades, the creation of Arapaima gigas in
captivity has been increasingly stimulated, either to develop research to better
know the particularities of the species or to exploit its economic potential [12,13].
For more sustainable management, both for captive breeding and for the study
of the species, it is of paramount importance to seek an effective and non-invasive
method to differentiate sexually juvenile individuals.
For this, the establishment of a molecular genetic marker related to sexual
differentiation would be an advantageous tool. Previous analysis of the Arapaima
gigas genome found no genes associated with the identification of the sex deter-
mination system of these individuals [14,15]. And chromosomal characterization
studies could not distinguishes cytologically a sex chromosome in Arapaima gigas
[16,17]. In this study, we proposed to asses the genomic composition of Arapaima
gigas using a k-mer-based approach to identify regions in excess or missing in
one of the sexes.
paired-end (v. 1.33) [19]. Data sets were processed in the sagarana HPC cluster,
CEPAD-ICB-UFMG.
This analysis was performed in four steps: (1) k-mer count, (2) k-mer count
normalization, (3) k-mer filtering for repetitive regions and (4) k-mer count
comparison. The k-mer count was performed with the help of the tool K-mer
Counter (KMC, v.3.1.1) [20], a free software written in C++, whose premise
is to count k-mers (sequences of consecutive k nucleotide) in a given genome
sequencing file. The trimmed fastq files of the 6 representatives Arapaima gigas
were submitted as input data to KMC algorithm using the parameter -k, on the
k-mer size, as 23.
After KMC step, in order to normalize the data Quantile Normalization
(QN), a global arrangement method, which consists of a non-parametric methods
that makes two or more distributions identical on statistical properties [21], was
performed using an in-house Perl script (v.5.16.3). Lastly, to compare the average
of the normalized counts between male and female samples, we performed a T-
Test in R to identify which k-mers presented small p-value (p ≤ 0.05). For further
procedures, we used a in-house Perl script to select k-mers which were extracted
from a repetitive region with a repeat unit up to 8 nucleotides. K-mer with the
same repeat unit were merged and had their normalized count summed up.
Table 1. Stats of average of reads and total number of k-mer in six samples of Arapaima
gigas.
Table 2. Repeat units with significant difference (p-value < 0.05) on the average of
the normalized k-mer count between female and male samples of Arapaima gigas.
Arapaima gigas plays an important role in the economy of the north region
of Brazil [12,13]. Understand the mechanisms of sexual determination in fish are
essential for a sustainable management of ichthyofauna, either for commercial
or conservation purposes [22]. But, elucidating these mechanisms in fish is chal-
lenging, especially in Arapaima gigas case, because this species do not have a
typical sex chromosome in their genome [16,17] and the difference of the genome
sequences between samples of opposite sex seems to be minimal. Both Arapaima
54 R. L. D. Cavalcante et al.
gigas genome assemblies [14,15] did not find significant differences between the
genomic content of male and female samples.
In this context, we explored other genomic features in Arapaima gigas to find
some clues about the genetic factors involved in the sex determining system. For
this, we analysed the genomic composition of Pirarucu using k-mer based app-
roach. In this study, we have noticed the existence of k-mers from repeat regions
over or underrepresented in one of the sexes, indicating potential differences in
the genetic composition between males and females of Arapaima gigas.
The difference is not so expressive, which corroborates the reports that esti-
mate 0.01% [14] to 0.1% [15] of the genome of this species as linked to the
sexual determination. The sequences reported in this work are all part of repet-
itive sequences. Despite of their low complexity, repetitive regions have been
reported to have important role on sex determination [23]. In medaka, which
has the XY system, there is a large stretch of repetitive regions on the Y-specific
regions [24]. The chromosome Y of Pacific salmon also bears a specific repetitive
regions (OtY1) that is used as genetic marker to differentiate sex [25].
In this context, the repetitive sequences found in this study could be a com-
ponent that could be used to determine sex in individuals of Arapaima gigas. We
recognize, however, the necessity to perform analyses with a greater number of
samples to obtain a better statistical support for our results, as well as to sug-
gest bench trials for the validation of the in-silico analyses. Despite of that, the
k-mer-based method applied on this work has demonstrated to be an interesting
strategy to help us discover the sex-determination system in Pirarucu specimens
and can be extended to other species.
4 Conclusions
This short paper reports a few repetitive genome sequences that can be dif-
ferentiated in quantity in male and female of Arapaima gigas. Indicating that
k-mer-based methods is an interesting approach to assist us in unraveling the
sex-determination system in Arapaima gigas.
References
1. Reis, R.E., Kullander, S.O., Ferraris, C.J.: Check List of the Freshwater Fishes of
South and Central America. Edipucrs, Porto Alegre (2003)
Assessing the Sex-Related Genomic Composition Difference 55
22. Martı́nez, P., Viñas, A.M., Sánchez, L., Dı́az, N., Ribas, L., Piferrer, F.: Genetic
architecture of sex determination in fish: applications to sex ratio control in aqua-
culture. Front. Genet. 5, 340 (2014)
23. Ezaz, T., Deakin, J.E.: Repetitive sequence and sex chromosome evolution in ver-
tebrates. Adv. Evol. Biol. 2014, 9 (2014)
24. Kondo, M., et al.: Genomic organization of the sex-determining and adjacent
regions of the sex chromosomes of medaka. Genome Res. 16(7), 815–826 (2006)
25. Devlin, R.H., Biagi, C.A., Smailus, D.E.: Genetic mapping of Y-chromosomal DNA
markers in pacific salmon. Genetica 111(1–3), 43–58 (2001)
COVID-19 X-ray Image Diagnostic
with Deep Neural Networks
Gabriel Oliveira, Rafael Padilha(B) , André Dorte, Luis Cereda, Luiz Miyazaki,
Maurı́cio Lopes, and Zanoni Dias
1 Introduction
With its impact on society, public health, and economy, the outbreak of COVID-
19 (SARS-CoV-2) has shaped how the year 2020 will be remembered. The scien-
tific community has devoted its efforts to trace the origin of the disease, study-
ing its effects on the human body, as well as evaluating treatment methods
[7,10,28]. Amidst the crisis, society directed its resources onto identifying how
artificial intelligence could aid the fight against the pandemic [13], such as pre-
dicting mortality and growth rates [23,25], analyzing the virus genome [18], and
discovering possible drugs [2].
As the increase in COVID-19 cases overwhelms healthcare systems world-
wide, finding accurate and efficient diagnosis methods is critical to prevent fur-
ther disease spread and treat affected patients. In this sense, encouraged by
recent successes of machine learning applied to medical imaging analysis [14]—
e.g., skin lesion classification [12], brain tumor segmentation [27], cardiac image
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 57–68, 2020.
https://doi.org/10.1007/978-3-030-65775-8_6
58 G. Oliveira et al.
2 Dataset
The dataset was obtained from COVIDx [24], which aggregates images from
several different datasets. Available data varies in resolution from 400 × 500 to
3520 × 4280 pixels and are grouped in three different classes: COVID-19, Pneu-
monia, and Normal (cases of no disease). Table 1 shows the patient and chest
radiography images distribution in the training and test sets. In the Pneumonia
and COVID-19 classes, there are different X-rays of the same patient. Figure 1
illustrates one example of each class.
Patients Images
Train Test Train Test
Normal 7,966 100 7,966 100
Pneumonia 5,444 98 5,459 100
COVID-19 320 74 473 100
Total 13,730 272 13,898 300
Fig. 1. Examples of X-rays from the COVIDx dataset [24]. (a) Normal, (b) Pneumonia
and (c) COVID-19.
Some X-ray images in the dataset present artifacts and noise patterns (Fig. 2),
such as medical devices connected to the patient, contour and volume of the
breasts, or differences in size and shape of the lungs and rib cage due to the
patient being an adult or a child. Even though these patterns might be common
to the task, they do not directly relate to the diagnostic outcome and, hence,
with enough data a model should learn how to ignore them.
60 G. Oliveira et al.
Fig. 2. Examples of images with different patterns, such as medical devices connected
to the patient, contour and volume of the breasts and noise.
3 Methodology
Ultimately, our goal is to accurately classify if a chest X-ray image belongs
to a patient with COVID-19 or not. In our evaluation, we follow the pipeline
depicted in Fig. 3, in which an input image is preprocessed and then analyzed
by a classification model.
Fig. 3. Overview of our pipeline. An input image is preprocessed and then classified
considering two scenarios. In the Multi-class Classification, the model assesses if the
X-ray image belongs to a COVID-19, a Pneumonia, or a Normal patient. On the Binary
Classification, the model decides between a COVID-19 or a non-COVID outcome.
We consider two scenarios concerning the output of our models. In the first
one, we approach the problem as a multi-class classification with three possi-
ble outcomes—i.e., COVID-19, Pneumonia, or Normal. The rationale is that by
COVID-19 X-ray Image Diagnostic with Deep Neural Networks 61
explicitly differentiating non-COVID cases, the model might better capture sub-
tle differences in each diagnostic. Whereas, in the second scenario, we consider a
binary classification between COVID-19 and non-COVID images, in other words,
Normal and Pneumonia are grouped into a single class.
In the next subsections, we detail how images are preprocessed before being
classified, as well as present the classification models used in our evaluation.
Each model evaluated in our analysis expects an input image with particular
dimensions and a range of values. In this regard, an input image must be stan-
dardized before training and classification.
During training, we resize each image to the network expected input size
and normalize its pixel values accordingly. We also apply data augmentation
techniques to increase our training set. Considering we have a small dataset
that might not realistically represent the variety of images encountered in real
scenarios, as seen in Fig. 2, data augmentation is essential to artificially add
variability in training and improve model generalization.
We employed the following augmentation techniques: random rotation in
the range [−5◦ , 5◦ ], zoom of at most 10% of the image’s dimension, vertical
and horizontal shifts up to 10% of the image’s height and width, respectively.
Considering the set of all possible transformations, each image can generate up to
10,000 slightly altered versions, virtually increasing our training set. We present
in Fig. 4 examples of these transformations.
At testing time, before being fed to the classification model, we resize and
normalize each input image, without applying any augmentation technique.
Fig. 4. Data augmentation strategies applied during training. (a) Original image, (b)
rotation, (c) zoom, (d) vertical and (e) horizontal shifts.
In recent years, the research community has proposed several CNN architec-
tures, optimization methods, and regularization techniques. Besides focusing on
improving generalization and efficacy in particular tasks, some works also aimed
62 G. Oliveira et al.
at increasing the efficiency of such networks, allowing them to train faster and
run on low-powered devices.
In this work, we evaluate several architectures proposed in the past five years.
As training them from scratch would require large datasets, we use pre-trained
CNNs optimized on ImageNet [6], an object classification dataset with 14 million
images. The rationale behind this is that deep networks tend to learn similar
concepts in their initial and intermediate layers common to most visual tasks
[26]—from object recognition to medical imaging analysis.
With the knowledge previously obtained, we adapt the networks for the
COVID-19 diagnosis task. We remove their last fully-connected layer, respon-
sible for classifying in one of the ImageNet classes, and exchange it for a new
fully-connected layer with three or two output neurons—for the Multi-class or
Binary classification scenarios, respectively—activated by softmax operation.
As each architecture imposes different constraints on the learning process,
the final characteristics learned by each of them capture distinct and often com-
plementary aspects of the training data. Because of this, we investigate if an
ensemble of networks improves the performance over individual models.
We explore three fusion approaches. In the first one, we average the answers
of all CNNs, aiming for the mean consensus between them. Secondly, instead of
merely averaging their responses, we train a meta-classifier on top of the concate-
nated answers. This meta-model learns subtle relative patterns between fused
classifiers. Finally, we take a step back and extract features from the penultimate
layer of each network, optimizing a meta-classifier on their concatenation. This
meta-model aggregates the knowledge from intermediate features and learns how
to jointly classify them.
In comparison, we also train other machine learning classifiers as baselines for
this task. We evaluate SVM, Random Forest, XGBoost [4] and Logistic Regres-
sion. In addition to the pre-processing described in Subsect. 3.1, we serialize each
image before feeding them to each classifier.
4 Experimental Evaluation
In this section, we present the results of the different CNN architectures and
baseline classifiers evaluated in the COVIDx dataset.
To train the predictive models, we organized the original training set into
train and validation splits. Initially, we randomly sampled 473 X-ray images of
each class (the size of the smaller class), divided into 383 for training and 90 for
validation. This was done to mitigate the class unbalance of the dataset, making
a balanced sub-sample that allows us to train each architecture efficiently. Using
this reduced balanced set, we evaluated all models in terms of accuracy and
number of parameters for the Multi-class classification scenario.
Considering the top three models, we evaluated their performance apply-
ing data augmentation and using the whole unbalanced dataset. To do so, we
employed a stratified division of the original training set, using 80% for training
and 20% for validation. In this setup, the model has more images to learn from
COVID-19 X-ray Image Diagnostic with Deep Neural Networks 63
and generalize, even though the training step requires more time. Each model
was evaluated in the Multi-class and Binary classification scenarios, as well as
their ensemble, combined under different strategies.
Table 2. Accuracy in the balanced validation set and number of parameters for each
evaluated models.
The best baseline methods were Random Forest, XGBoost, and SVM with
RBF kernel, with accuracies ranging from 79.25% to 78.15%. Even though their
64 G. Oliveira et al.
performances were superior to a few architectures, they are not suitable for
dealing with the spatial information of images in the same way as CNNs. By
reshaping the image as a vector, the models lose most of the spatial structure
between neighboring pixels, which often hinders performance. On the other hand,
logistic regression and linear SVM suffer from trying to correctly fit a linear
decision plane in the high dimensional space produced by the flattened image.
CNNs that were outperformed by the most baselines have an increased num-
ber of parameters, which indicates that network size does not directly relate to
the accuracy in the task. This is probably due to the lack of data required to
carefully train a network with such complexity for this problem. The exception
was EfficientNetB7, which accounts in second both in the number of parameters
and accuracy. Despite its size, it outperforms in the ImageNet recognition task
several architectures [22]—both with lower and higher number of parameters.
This highlights its capability of learning discriminative and rich features, which
is essential when adapting it to a task with limited data.
Our experiments show that ResNet50 architecture achieved the highest accu-
racy in the validation set, with an intermediate number of parameters among
the networks evaluated. We select ResNet50, EfficientNetB7 and MobileNetV2
as the base models to perform additional explorations.
Once selected, we evaluated the top three models in the complete unbalanced
dataset. To do so, we split it into 80% for training and 20% for validation in a
stratified division. Due to the class unbalancing, the model would naturally give
greater importance to the over-represented classes, i.e., Normal and Pneumonia.
To overcome this issue, we assigned weights to each category according to their
sizes. The weight of a particular class is equal to the number of images of the
largest class divided by the number of samples in it.
Besides that, we also employed the data augmentation techniques from Sub-
sect. 3.1 to increase the diversity in our training. Considering the increased
amount of data, we unfroze the initial and intermediate layers of each model,
allowing the optimization process to freely update their weights. Each CNN
was trained with the same optimizer and hyperparameters in the Multi-class
and Binary classification scenarios. We report the balanced accuracy for each
method in the top part of Table 3.
The CNNs obtained high accuracy, with ResNet50 and MobileNetV2 outper-
forming EfficientNetB7 in both classification scenarios. Similar to the previous
experiment, all three CNNs exceeded the baseline Random Forest, which was
considerably affected by the class unbalancing, especially in the Binary setup.
Table 3. Balanced accuracy in the stratified validation and test sets of Multi-class
and Binary classification for individual CNNs and ensembles. We highlight the best
method and ensemble strategies in each scenario and evaluation set.
Fig. 5. Confusion matrix for ResNet50 and the best ensemble strategy in both classi-
fication scenarios for the test set.
Pneumonia, and Normal classes, while obtaining a 93.50% accuracy when dis-
tinguishing between COVID-19 and non-COVID.
As future work, additional preprocessing steps can be investigated, such as
segmenting the lung parts of the image, applying filters to reduce noise and
highlight particular artifacts, as well as using images with higher resolution.
In the post-processing step, explainability techniques, such as Grad-CAM [19],
could be used to further audit the network decisions and interpret them. Finally,
being a recent problem, datasets for this task are still limited in size and present
a considerable class unbalancing. Because of this, a continuous data collection is
essential for future research in this problem.
References
1. Apostolopoulos, I.D., Mpesiana, T.A.: Covid-19: automatic detection from X-ray
images utilizing transfer learning with convolutional neural networks. Phys. Eng.
Sci. Med. 43(2), 635–640 (2020). https://doi.org/10.1007/s13246-020-00865-4
2. Beck, B.R., Shin, B., Choi, Y., Park, S., Kang, K.: Predicting commercially avail-
able antiviral drugs that may act on the novel coronavirus (SARS-CoV-2) through
a drug-target interaction deep learning model. Comput. Struct. Biotechnol. J. 18,
784–790 (2020)
3. Bizopoulos, P., Koutsouris, D.: Deep learning in cardiology. IEEE Rev. Biomed.
Eng. 12, 168–193 (2018)
4. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: ACM Inter-
national Conference on Knowledge Discovery and Data Mining (ACM KDD), pp.
785–794 (2016)
5. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In:
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1251–
1258 (2017)
6. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-
scale hierarchical image database. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 248–255 (2009)
7. Gao, Q., Bao, L., Mao, H., Wang, L., Xu, K., Yang, M., Li, Y., Zhu, L., Wang, N.,
Lv, Z., et al.: Development of an inactivated vaccine candidate for SARS-CoV-2.
Science 369(6499), 77–81 (2020)
8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.
770–778 (2016)
9. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile
vision applications. arXiv:1704.04861 (2017)
10. Huang, C., Wang, Y., Li, X., Ren, L., Zhao, J., Hu, Y., Zhang, L., Fan, G., Xu, J.,
Gu, X., et al.: Clinical features of patients infected with 2019 novel coronavirus in
wuhan, china. The Lancet 395(10223), 497–506 (2020)
11. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected
convolutional networks. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 4700–4708 (2017)
12. Kawahara, J., Hamarneh, G.: Multi-resolution-tract CNN with hybrid pretrained
and skin-lesion trained layers. In: International Workshop on Machine Learning in
Medical Imaging, pp. 164–171 (2016)
68 G. Oliveira et al.
13. Lalmuanawma, S., Hussain, J., Chhakchhuak, L.: Applications of machine learning
and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review. Chaos,
Solitons & Fractals 139, 110059 (2020)
14. Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image
Anal. 42, 60–88 (2017)
15. Narin, A., Kaya, C., Pamuk, Z.: Automatic detection of coronavirus dis-
ease (COVID-19) using X-ray images and deep convolutional neural networks.
arXiv:2003.10849 (2020)
16. Raghu, M., Zhang, C., Kleinberg, J., Bengio, S.: Transfusion: understanding trans-
fer learning for medical imaging. In: Advances in Neural Information Processing
Systems (NIPS), pp. 3347–3357 (2019)
17. Rajpurkar, P., et al.: CheXNet: radiologist-level pneumonia detection on chest X-
rays with deep learning. arXiv:1711.05225 (2017)
18. Randhawa, G.S., Soltysiak, M.P., El Roz, H., de Souza, C.P., Hill, K.A., Kari, L.:
Machine learning using intrinsic genomic signatures for rapid classification of novel
pathogens: Covid-19 case study. PLoS ONE 15(4), e0232391 (2020)
19. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-
cam: Visual explanations from deep networks via gradient-based localization. In:
IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017)
20. Shen, W., Zhou, M., Yang, F., Yang, C., Tian, J.: Multi-scale convolutional neural
networks for lung nodule classification. In: International Conference on Information
Processing in Medical Imaging (IPMI), pp. 588–599 (2015)
21. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep-
tion architecture for computer vision. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 2818–2826 (2016)
22. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural
networks. In: IEEE International Conference on Machine Learning (ICML), pp.
6105–6114 (2019)
23. Tuli, S., Tuli, S., Tuli, R., Gill, S.S.: Predicting the growth and trend of COVID-
19 pandemic using machine learning and cloud computing. Internet of Things 11,
100222 (2020)
24. Wang, L., Wong, A.: COVID-Net: a tailored deep convolutional neural network
design for detection of COVID-19 cases from chest x-ray images. arXiv:2003.09871
(2020)
25. Yan, L., et al.: An interpretable mortality prediction model for COVID-19 patients.
Nat. Mach. Intell. 2, 1–6 (2020)
26. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep
neural networks? In: Advances in Neural Information Processing Systems (NIPS),
pp. 3320–3328 (2014)
27. Zhao, J., Zhang, M., Zhou, Z., Chu, J., Cao, F.: Automatic detection and classifi-
cation of leukocytes using convolutional neural networks. Med. Biol. Eng. Comput.
55(8), 1287–1301 (2016). https://doi.org/10.1007/s11517-016-1590-x
28. Zhou, P., Yang, X.L., Wang, X.G., Hu, B., Zhang, L., Zhang, W., Si, H.R., Zhu,
Y., Li, B., Huang, C.L., et al.: A pneumonia outbreak associated with a new
coronavirus of probable bat origin. Nature 579(7798), 270–273 (2020)
29. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures
for scalable image recognition. In: IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), pp. 8697–8710 (2018)
Classification of Musculoskeletal
Abnormalities with Convolutional Neural
Networks
1 Introduction
Musculoskeletal conditions are extensively present in the population, affecting
over 1.3 billion people worldwide [9]. These conditions often cause long-term
pain, directly and indirectly reducing the quality of life of those suffering from
it and their household [33,34]. In this setting, medical imaging as X-rays plays
an essential role as one of the main tools for abnormality detection.
Insufficient medical staff, along with the complexity of diagnosis, creates a
system prone to errors. False-negative diagnosis leads to untreated injuries and
symptoms such as chronic pain and further complications in the long term, and
false-positives diagnosis leads to unnecessary treatment. Computer-Aided Diag-
nosis (CAD) systems are used to counteract these problems, improving diagnos-
tic accuracy, assist decision-making, and reducing radiologists’ workload.
Computer-aided diagnosis has been a topic of research since the 1960s and
has significantly evolved due to advances in medicine itself and in computer sci-
ence. From a task standpoint, CAD has found application in a wide variety of
medical disorders. A few of the innumerous works include breast cancer [8,31],
lung cancer [19], and Alzheimer’s [6]. X-ray classification has been particularly
prevalent for the chest area [2,32]. Under the same scope of this work (muscu-
loskeletal abnormalities in the upper limb), 70 works were submitted under a
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 69–80, 2020.
https://doi.org/10.1007/978-3-030-65775-8_7
70 G. T. S. Sato et al.
competition1 that took place using the same dataset as this work. The reported
achieved Cohen’s Kappa range from 0.518 to 0.843. However, no further infor-
mation, such as methodology, is generally available for these works.
Methodology-wise, some of the most successful classifiers include k-nearest
neighbors (KNN) [14,20], support vector machines (SVM) [7], random forests
[1,15], and neural networks [16,30]. In this work, we will evaluate the use of a
neural network classifier due to its promising performance, producing state-of-
the-art results in many other applications [4,26].
A wide range of techniques, such as deep learning, image processing, and com-
puter vision, are applied to interpret radiographic images automatically. How-
ever, deep learning models’ success is highly dependent on the amount of data
available, creating a particular challenge for medical images due to privacy con-
cerns and time-consuming labeling requiring experts. In this work, we present a
method to classify normal and abnormal X-rays from the upper limb by applying
convolutional neural networks and several machine learning techniques aiming
to improve the classification and offset the lack of data. The rest of this paper
is organized as follows. Section 2 describes the settings of this work and all the
experiments executed to reach a final classifier, and Sect. 3 evaluates this clas-
sifier using an independent test set and discusses the results obtained, as well
as alternative scenarios. Section 4 concludes the paper, including an overview of
the results obtained and future work.
2 Methodology
This section goes over the dataset used in this work and details the experi-
ments executed. Over the experiments, we explore different scenarios by apply-
ing machine learning and deep learning techniques and measure the impact of
the proposed changes when comparing to previously tested scenarios, in a path
to maximize the classifier robustness.
2.1 Dataset
In this work, we used the MURA: Large Dataset for Abnormality Detection
in Musculoskeletal Radiographs [23] dataset, which contains 40,005 radiographic
images labeled by radiologists. The MURA dataset is divided into 14,656 studies
from the body upper extremities – shoulder, humerus, elbow, forearm, wrist,
hand, and finger. Each study is labeled as either normal or abnormal.
The available data is divided into validation and training sets. For testing,
a third set was created to be used in a competition and, therefore, not publicly
available. To offset this fact and provide a realist measurement of our model’s
performance in a real-world scenario, we split the training data to create a test
set. Our goal was to create a test set the same size as the validation set. As the
provided validation set has 8.6% of the size of the training set, the new test set
was created by moving, for each body part, 8.6% of the studies from the training
to the test set. The number of items in each class is described in Table 1.
1
https://stanfordmlgroup.github.io/competitions/mura/.
Classification of Musculoskeletal Abnormalities 71
Table 1. Distribution of the number of studies (images) contained in each class for
the three sets.
2.2 Experiments
The models used for classification were developed in Python, mainly using the
PyTorch framework [21]. Every sample from the dataset has its target class
defined among 14 classes (7 body parts, normal or abnormal). Before inputting
the images to the model, we normalized each image’s pixels values to the mean
and standard deviation of the ImageNet dataset [5] and resized to 224 × 224
pixels.
We trained each model over 40 epochs, with a batch size of 25. The samples
are initially shuffled and reshuffled before each epoch. Upon each epoch, the
network performance is measured with the scikit-learn library package [22]. To
measure the binary classification performance, the 14 class output is condensed
into two by ignoring the body part information. To determine the output of a
study consisting of several images, the probability distribution output for each
image in the study are averaged.
(a) Original image [23] (b) Fit (c) Pad (d) Stretch
Fig. 1. Three transformations to change a rectangular image into square. Fit does not
deform but loses information. Pad also does not deform but adds irrelevant information.
Stretch includes all pixels but deforms the image.
From these results, we could conclude that, despite having more data, Aug-
mented Dataset B performs worse than Augmented Dataset A, but still better
than the original. Moreover, since training time is approximately linear in the
number of samples, using Augmented Dataset A results in longer training time.
Therefore, we trained other networks using Augmented Dataset B (Table 5).
These networks were chosen based on the results of Experiment II while avoid-
ing more than one network from the same “family.”
Table 5. Performance of networks on the validation data for the networks trained
using a dataset augmented using only horizontal flip (Augmented dataset A).
are also ultimately a weighted average, each element of the output array for
each model is now independently weighted. Also, while the mere five weights in
the simple weighted were obtained using an exhaustive search, the weights in
these two experiments were obtained using gradient descent. Finally, we used
a support vector machine (SVM) with a radial basis function (RBF) kernel to
ensemble the models. The SVM is implemented by scikit-learn [22].
Fig. 2. Schematic representation of the two neural network architectures used in this
experiment. The number of classes shown here is 4 for simplification, while the real
number of classes is 14. The number of models n is 5 in the experiment.
Figure 3 illustrates the final methodology used. The results obtained are pre-
sented in Table 6, comparing it with the average results obtained by the models
individually. All ensemble models were able to perform better than the average
performance of the single models. The single models present a hard to avoid
overfitting problem, leading to nearly 100% accuracy in the training set for all
the models, therefore making it impossible to find the best way to combine these
predictions using the training set since practically any combination will lead to
nearly 100% accuracy. We must then tune the parameters in the ensemble layer
using the validation set, which should inevitably lead to a higher drop in perfor-
mance when transitioning to the test set.
Classification of Musculoskeletal Abnormalities 75
Fig. 3. Overview of the final classification methodology adopted. The X-ray [23] is
squared, processed by five neural networks, and the five independent predictions are
combined to reach a final prediction using a SVM.
Fig. 4. Grad-CAM heatmaps for each individual model and the ensemble model for an
abnormal humerus sample. Models VGG-16, EfficientNet-B7 and Inception-ResNet-v2
predict correctly, while DenseNet-161 and ResNet-152 predict incorrectly. The Ensem-
ble SVM model takes all models into consideration and outputs a correct final predic-
tion.
averaging the Grad-CAM outputted matrix (used to create the heatmap) for
the five models. Hence, the heatmap generated for the ensemble model is an
approximation and not specific to any of the models.
Figure 5 summarizes the Kappa coefficients for all the experiments. Experi-
ment II had an average Kappa of 0.6542; the SVM (RBF) model from Experi-
ment IV improved it to 0.7345, which is our best result.
0.75
0.70
0.65
0.60
Inc Effi seN 161
VG t-v2
IV Wei V C et-1 6
I P it
D I S ad
II Dens Net- h
1
II ientN t-161
Inc II I cept t-B7
n-R ption v3
II esNe v4
n-R tNet 61
Re sNet 2
sN -18
-A II GG 2
III De VG -16
III III- B D seN -19
Sp ght ons 52
lly nn ge
IV sely Av sus
SV nne d
(R d
)
BF
I esN -1
Effi eN 12
Re t-v
5
II ense tretc
IV Co ecte
M cte
IF
Fu Co era
ep cien et-1
II et-1
-
-
G
G
ed en
ep nce ion
en et-
e
e
e
III -B
n
R
In
c
II
-B
tio
tio
II
ar
-B B
-
III
IV
II
Fig. 5. Summary of the Kappa coefficient for the experiments. The maximum value
was 0.7345.
To evaluate our model in a scenario closer to the real world, we will now
use our test set, defined in Subsect. 2.1, which has no overlap with the other
two sets. Table 7 shows the performance metrics achieved using the Ensemble
Classification of Musculoskeletal Abnormalities 77
SVM model, which was the best performing model on the validation data. The
model’s performance expectedly decreased due to overfitting.
Our model performed worst on hands, with a Kappa of 0.4717, and best
on elbows, with a Kappa of 0.7921. The overall Kappa was 0.6724. In contrast,
human radiologists performed worst on fingers and best on wrists [23]. Our model
was able to outperform two of the three radiologists evaluated on the elbow
classification task, and all of them in the finger classification, but falls behind
on other body parts. The MURA competition has ended, so we cannot evaluate
our model on the official test set. As a consequence, the human radiologists
were evaluated on the official test set, and our model, using our test set, so the
performance comparison might not be completely accurate.
Table 7. Performance of the model Ensemble SVM on the test data, broken down by
inputted body part.
The performance difference between the test and validation sets is likely
explained due to the fitting of ensemble parameters using the validation set,
as described in Experiment IV. We can verify this by running the test set on
the Ensemble Consensus model, which does not have any extra parameters.
Table 8 shows the results of this test. These results are similar to the results
obtained in the validation set (by some metrics, even better), and despite being
the worst performer on Experiment IV, it performed better than the Ensemble
SVM model, the best ensemble model. Therefore, we could assume that if we
were able to train the single models in such a way to reduce overfitting, we
could train the ensemble parameters using the train set and reduce the gap
between validation and test results. Alternatively, splitting the data into four
Table 8. Performance of the model Ensemble Consensus on the test data. The model
has similar performance on the validation set.
sets instead of three to fit the ensemble parameters using a separated set could
also be beneficial, but would further reduce an already limited dataset.
4 Conclusion
Our work proposed to explore this machine learning classification problem using
convolutional neural networks and related methods. The best setting found was
to stretch the input to a square, apply horizontal flip data augmentation, and
ensemble a variety of architectures using a support vector machine, reaching an
AUC ROC of 0.8791 and Kappa of 0.6724.
Although the overall result was lower than that obtained by human radiolo-
gists, we were still able to achieve promising results in some scenarios. However,
a transition to the clinical setting is challenging. Besides accuracy improvement
under a controlled scenario, the algorithm would need, for example, to han-
dle unexpected inputs, provide translation and rotation invariance, and include
explainability to mitigate automation bias.
The greatest hindrance in this work was the models’ overfitting, to which
further experimentation with other methods is needed to adequately address it,
such as early stopping and regularization. The development of larger datasets
would also provide a big leap on this matter.
References
1. Alickovic, E., Subasi, A.: Medical decision support system for diagnosis of heart
Arrhythmia using DWT and random forests classifier. J. Med. Syst. 40(4), 1–12
(2016). https://doi.org/10.1007/s10916-016-0467-8
2. Arzhaeva, Y., et al.: Development of automated diagnostic tools for pneumoconiosis
detection from chest X-ray radiographs (2019)
3. Cadene, R.: Pretrained models for Pytorch (2017). https://github.com/Cadene/
pretrained-models.pytorch
4. Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence
models. In: IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), pp. 4774–4778. IEEE (2018). https://doi.org/10.1109/ICASSP.
2018.8462105
5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale
hierarchical image database. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 248–255. IEEE (2009). https://doi.org/10.1109/CVPR.
2009.5206848
6. Ding, Y., et al.: A deep learning model to predict a diagnosis of Alzheimer disease
by using 18F-FDG PET of the brain. Radiology 290(2), 456–464 (2019). https://
doi.org/10.1148/radiol.2018180958
7. Dolatabadi, A.D., Khadem, S.E.Z., Asl, B.M.: Automated diagnosis of coronary
artery disease (CAD) patients using optimized SVM. Comput. Methods Programs
Biomed. 138, 117–126 (2017). https://doi.org/10.1016/j.cmpb.2016.10.011
Classification of Musculoskeletal Abnormalities 79
24. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-
CAM: visual explanations from deep networks via gradient-based localization. In:
IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017).
https://doi.org/10.1109/ICCV.2017.74
25. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv:1409.1556 (2014)
26. Sun, Y., Liang, D., Wang, X., Tang, X.: DeepID3: face recognition with very deep
neural networks. arXiv:1502.00873 (2015)
27. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, Inception-ResNet
and the impact of residual connections on learning. In: Proceedings of the 31st
AAAI Conference on Artificial Intelligence, pp. 4278–4284 (2017)
28. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the Incep-
tion architecture for computer vision. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 2818–2826 (2016). https://doi.org/10.1109/
CVPR.2016.308
29. Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural
networks. arXiv:1905.11946 (2019)
30. Tschandl, P., et al.: Expert-level diagnosis of nonpigmented skin cancer by com-
bined convolutional neural networks. JAMA Dermatol. 155(1), 58–65 (2019).
https://doi.org/10.1001/jamadermatol.2018.4378
31. Wang, H., Zheng, B., Yoon, S.W., Ko, H.S.: A support vector machine-based
ensemble algorithm for breast cancer diagnosis. Eur. J. Oper. Res. 267(2), 687–699
(2018). https://doi.org/10.1016/j.ejor.2017.12.001
32. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8:
hospital-scale chest X-ray database and benchmarks on weakly-supervised classifi-
cation and localization of common thorax diseases. In: IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pp. 2097–2106 (2017). https://doi.
org/10.1109/CVPR.2017.369
33. Woolf, A.D., Pfleger, B.: Burden of major musculoskeletal conditions. Bull. World
Health Organ. 81, 646–656 (2003)
34. World Health Organisation: Musculoskeletal conditions fact sheet (2019). https://
www.who.int/news-room/fact-sheets/detail/musculoskeletal-conditions
35. Xie, Y., Richmond, D.: Pre-training on grayscale ImageNet improves medical image
classification. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11134, pp.
476–484. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11024-6 37
Combining Mutation and Gene Network
Data in a Machine Learning Approach
for False-Positive Cancer Driver Gene
Discovery
1 Introduction
Cancer is one of the main cause of death globally, being responsible for around
9.6 million deaths in 2018, according to the World Health Organization1 . It
is considered a complex disease and is caused by the accumulation of genetic
alterations in the human body cells, which are called genetic mutations.
The investigation of mutations is crucial for the understanding of cancer initia-
tion and progression. A single cell undergoes a vast number of mutations; nonethe-
less, not all of them lead to cancer. In this context, mutations can be classified in
two types: Passenger mutations, which comprehend the majority of mutations in
cancer cells and are not significant for the cancer progression, i.e., do not confer
a selective advantage to cells; and Driver mutations, which are a small group of
mutation significant for cancer, i.e., they give cancer cells a growth advantage.
Distinguishing between driver and passenger mutations is a long-standing
investigation line in Cancer Genomics. Many computational methods have been
developed on this topic [5,9], which are based on various types of data that are
currently available. Among them, mutation data analysis has taken a prominent
position after the advent of next-sequencing generation technologies (NGS) and
thanks to projects such as the TCGA (The Cancer Genome Atlas) [29] that
makes large collections of mutation data available. Gene interaction informa-
tion is also often explored and has an important role in many computational
methods [22]. This type of data provides essential information about complex
gene interactions among genes and their related proteins. Such interactions are
represented by complex networks, in which genes are nodes and edges connect
genes that are physically interacting or functionally related [16].
There are also computational methods that benefit from the simultaneous anal-
ysis of mutation and gene network data. HotNet [28], HotNet2 [17], Hierarchi-
cal HotNet [25], MUFFINN [3], nCOP [13], NetSig [12], and GeNWeMME [6] are
methods that employ network algorithms and mutation data analysis to find sig-
nificantly related mutations and to identify driver genes. Recently, machine learn-
ing algorithms, such as DriverML [11], LOTUS [4], and MoProEmbeddings [10]
have taken advantage of the massive volume of digital biological data to induce
predictive models able to suggest driver genes and find novel biological insights.
Although these methods have been extensively used for driver gene identifi-
cation, they can misclassify some genes as drivers, thus being necessary expert
curation to filter their findings [1]. It occurs because some genes present charac-
teristics of drivers, but are not actually involved in cancer initiation and progres-
sion. These genes are referred to as false-positive drivers and may even mislead
a biomedical decision or compromise the performance of models that consider
them as variables. The avoiding of the misclassification of false-positive-drivers
as drivers is still an ongoing challenge. Thus, the development of dedicated tools
for further screening these models’ findings so that possible misclassified genes
can be detected is required. In this work, the terms false-positive drivers and
false-drivers are used as synonymous.
1
www.who.int/news-room/fact-sheets/detail/cancer.
Machine Learning Approach for False-Positive Cancer Driver Gene Discovery 83
2 Method
This section describes the steps in the development of this research. Figure 1
shows a summary of the approach established for the research. In Step 1, cancer
mutation data, gene interaction networks, and gene labels are selected from
reliable and widely used sources. In Step 2, data is preprocessed, and features
are extracted. Finally, in Step 3, a hyper-parameters tuning is performed so that
optimized models can be induced and evaluated through stratified k-fold cross-
validation. Further assessment of the models’ applicability is performed over new
genes reported as drivers by other sources.
– ten features were extracted for each gene from the MAF file to create a muta-
tion data set DSM U T . One feature is related to the gene’s coverage, i.e., the
number of patients in which the gene is mutated. Nine features were extracted
for each somatic variant, representing each specific somatic variant’s number
of mutations. Thus, the data set DSM U T is composed of 19184 samples and
ten features.
– ten features were extracted for each gene (node) in the network U GN to create
a data set DSGN . The features are the following centrality measures: degree,
betweenness, closeness, eigenvector, coreness, clustering coefficient, average
of neighbors’ degree, leverage, information, and bridging. Such measures con-
sider distinct aspects of the network structure and topology to characterize
the importance of a node, thus highlighting its central role according to each
of measures [21]. The data set DSGN is composed of 18959 samples and ten
features.
A combined data set was obtained by merging DSM U T and DSGN . Some of
the genes were not contained in both data sets, therefore only their intersection
was taken. The merging leads to the data set DSCOM B , composed of 16281
samples and 20 features.
86 J. F. Cutigi et al.
Supervised machine learning algorithms were trained with the data set
DSCOM BL to induce predictive models that classify genes as drivers or false-
drivers. Scikit-learn [23], a Python module for machine learning, was used in all
processes described in this section.
total number of estimators in the range from 20 to 500. The maximum depth
was tested with limits from 3 to 20 layers, including the no restriction scenario.
Both gini and entropy criteria were evaluated. The possibilities for maximum
numbers of features considered to perform a split varied according to the fol-
lowing functions: None, auto, sqrt, log2. The mean prediction accuracy was
used as a metric for the comparison and was calculated over all folds and repeti-
tion for each model. The best model for each algorithm was selected for further
evaluation.
3 Results
The optimal hyper-parameters obtained from the grid-searches performed for the
three labeled data sets, with different features, are provided in Table 2. Because
some of the optimal parameters obtained for the SVM were in the limits of the
tested ranges, these were then expanded, but no significant improvement in per-
formance was observed. The data sets are referred to as follow. DSCOM BL :
labeled data set with features from mutation data and gene network data;
DSM U TL : labeled data set with features only from mutation data; and DSGNL :
labeled data set with features only from gene network data.
The models induced using these hyper-parameters were further investigated
using the metrics described in Sect. 2.2. Once again, the mean of each metric was
calculated over 30 repetitions of the whole training process. Table 3 shows the
calculated averages and standard deviations for each selected evaluation metric.
Both models trained with combined features presented satisfactory results,
without significant differences in performance based on the selected metrics.
Moreover, the construction of models using combined mutation and network
88 J. F. Cutigi et al.
SVM RF
DSCOM BL DSM U TL DSGNL DSCOM BL DSM U TL DSGNL
Kernel linear linear rbf Estimators 300 100 150
C 1 1 1 max. depth 10 12 5
gamma 1 1 10 max. features None auto auto
Criterion gini gini gini
features outperforms the other models induced with a single source of features.
This trend is also observed when analyzing ROC curves, presented in Fig. 2.
Additionally, DeLong test [8] was performed over the 30 repetitions to compare
ROC curves of DSCOM BL and DSM U TL . A p-value < 0.05 was observed in 24
and 27 tests for SVM and RF, respectively.
SVM RF
DSCOM BL DSM U TL DSGNL DSCOM BL DSM U TL DSGNL
Accuracy 0.850 ± 0.007 0.828 ± 0.006 0.704 ± 0.011 0.824 ± 0.007 0.785 ± 0.007 0.659 ± 0.009
Precision 0.916 ± 0.005 0.898 ± 0.004 0.784 ± 0.006 0.931 ± 0.004 0.923 ± 0.006 0.853 ± 0.009
Recall 0.877 ± 0.007 0.865 ± 0.009 0.827 ± 0.014 0.823 ± 0.009 0.774 ± 0.009 0.651 ± 0.018
F1 0.896 ± 0.005 0.881 ± 0.004 0.805 ± 0.008 0.874 ± 0.005 0.842 ± 0.006 0.738 ± 0.010
References
1. Bailey, M.H., et al.: Comprehensive characterization of cancer driver genes and
mutations. Cell 173(2), 371–385.e18 (2018). https://doi.org/10.1016/j.cell.2018.
02.060
2. Cerami, E., et al.: The cBio cancer genomics portal: an open platform for explor-
ing multidimensional cancer genomics data. Cancer Discov. 2(5), 401–404 (2012).
https://doi.org/10.1158/2159-8290.CD-12-0095
3. Cho, A., Shim, J.E., Kim, E., Supek, F., Lehner, B., Lee, I.: MUFFINN: cancer
gene discovery via network analysis of somatic mutation data. Genome Biol. 17(1),
129 (2016). https://doi.org/10.1186/s13059-016-0989-x
4. Collier, O., Stoven, V., Vert, J.P.: LOTUS: a single- and multitask machine learning
algorithm for the prediction of cancer driver genes. PLoS Comput. Biol. 15(9), 1–27
(2019). https://doi.org/10.1371/journal.pcbi.1007381
5. Cutigi, J.F., Evangelista, A.F., Simao, A.: Approaches for the identification
of driver mutations in cancer: a tutorial from a computational perspective.
J. Bioinform. Comput. Biol. 18(03), 2050016 (2020). https://doi.org/10.1142/
S021972002050016X. pMID: 32698724
6. Cutigi, J.F., Evangelista, A.F., Simao, A.: GeNWeMME: a network-based com-
putational method for prioritizing groups of significant related genes in cancer.
In: Kowada, L., de Oliveira, D. (eds.) BSB 2019. LNCS, vol. 11347, pp. 29–40.
Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46417-2 3
Machine Learning Approach for False-Positive Cancer Driver Gene Discovery 91
7. Das, J., Yu, H.: HINT: high-quality protein interactomes and their applications in
understanding human disease. BMC Syst. Biol. 6, 92 (2012). https://doi.org/10.
1186/1752-0509-6-92
8. DeLong, E.R., DeLong, D.M., Clarke-Pearson, D.L.: Comparing the areas under
two or more correlated receiver operating characteristic curves: a nonparametric
approach. Biometrics 44, 837–845 (1988)
9. Dimitrakopoulos, C.M., Beerenwinkel, N.: Computational approaches for the iden-
tification of cancer genes and pathways. Wiley Interdisc. Rev.: Syst. Biol. Med.
9(1), e1364 (2017). https://doi.org/10.1002/wsbm.1364
10. Gumpinger, A.C., Lage, K., Horn, H., Borgwardt, K.: Prediction of cancer
driver genes through network-based moment propagation of mutation scores.
Bioinformatics 36(Supplement 1), i508–i515 (2020). https://doi.org/10.1093/
bioinformatics/btaa452
11. Han, Y., et al.: DriverML: a machine learning algorithm for identifying driver genes
in cancer sequencing studies. Nucleic Acids Res. 47(8), e45–e45 (2019)
12. Horn, H., et al.: NetSig: network-based discovery from cancer genomes. Nat. Meth-
ods 15, 61–66 (2018). https://doi.org/10.1038/nmeth.4514
13. Hristov, B.H., Singh, M.: Network-based coverage of mutational profiles reveals
cancer genes. Cell Syst. 5(3), 221–229 (2017)
14. Jassal, B., et al.: The reactome pathway knowledgebase. Nucleic Acids Res. 48(D1),
D498–D503 (2020)
15. Keshava Prasad, T.S., et al.: Human protein reference database-2009 update.
Nucleic Acids Res. 37(Database issue), D767–D772 (2009). https://doi.org/10.
1093/nar/gkn892
16. Kim, Y., Cho, D., Przytycka, T.M.: Understanding genotype-phenotype effects
in cancer via network approaches. PLoS Comput. Biol. 12(3), e1004747 (2016).
https://doi.org/10.1371/journal.pcbi.1004747
17. Leiserson, M.D.M., et al.: Pan-cancer network analysis identifies combinations of
rare somatic mutations across pathways and protein complexes. Nat. Genet. 47(2),
106–114 (2015). https://doi.org/10.1038/ng.3168
18. Lever, J., Zhao, E.Y., Grewal, J., Jones, M.R., Jones, S.J.: CancerMine: a
literature-mined resource for drivers, oncogenes and tumor suppressors in cancer.
Nat. Methods 16(6), 505–507 (2019)
19. Luck, K., et al.: A reference map of the human binary protein interactome. Nature
580, 1–7 (2020)
20. Martı́nez-Jiménez, F., et al.: A compendium of mutational cancer driver genes.
Nat. Rev. Cancer 20, 1–18 (2020)
21. Oldham, S., Fulcher, B., Parkes, L., Arnatkeviciute, A., Suo, C., Fornito, A.:
Consistency and differences between centrality measures across distinct classes of
networks. PLoS ONE 14(7), 1–23 (2019). https://doi.org/10.1371/journal.pone.
0220061
22. Ozturk, K., Dow, M., Carlin, D.E., Bejar, R., Carter, H.: The emerging potential
for network analysis to inform precision cancer medicine. J. Mol. Biol. 430(18),
2875–2899 (2018)
23. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn.
Res. 12, 2825–2830 (2011)
24. Repana, D., et al.: The Network of Cancer Genes (NCG): a comprehensive cat-
alogue of known and candidate cancer genes from cancer sequencing screens.
Genome Biol. 20(1), 1 (2019). https://doi.org/10.1186/s13059-018-1612-0
92 J. F. Cutigi et al.
25. Reyna, M.A., Leiserson, M.D.M., Raphael, B.J.: Hierarchical HotNet: identify-
ing hierarchies of altered subnetworks. Bioinformatics 34(17), i972–i980 (2018).
https://doi.org/10.1093/bioinformatics/bty613
26. Sondka, Z., Bamford, S., Cole, C.G., Ward, S.A., Dunham, I., Forbes, S.A.: The
COSMIC cancer gene census: describing genetic dysfunction across all human can-
cers. Nat. Rev. Cancer 18(11), 696–705 (2018)
27. Tamborero, D., et al.: Comprehensive identification of mutational cancer driver
genes across 12 tumor types. Sci. Rep. 3, 2650 (2013)
28. Vandin, F., Upfal, E., Raphael, B.J.: Algorithms for detecting significantly mutated
pathways in cancer. J. Comput. Biol. 18(3), 507–522 (2011). https://doi.org/10.
1089/cmb.2010.0265. pMID: 21385051
29. Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat.
Genet. 45(10), 1113 (2013)
Unraveling the Role of Nanobodies Tetrad
on Their Folding and Stability Assisted
by Machine and Deep Learning Algorithms
Abstract. Nanobodies (Nbs) achieve high solubility and stability due to four con-
served residues referred to as the Nb tetrad. While several studies have highlighted
the importance of the Nbs tetrad to their stability, a detailed molecular picture of
their role has not been provided. In this work, we have used the Rosetta package to
engineer synthetic Nbs lacking the Nb tetrad and used the Rosetta Energy Func-
tion to assess the structural features of the native and designed Nbs concerning
the presence of the Nb tetrad. To develop a classification model, we have bench-
marked three different machine learning (ML) and deep learning (DL) algorithms
and concluded that more complex models led to better binary classification for
our dataset. Our results show that these two classes of Nbs differ significantly in
features related to solvation energy and native-like structural properties. Notably,
the loss of stability due to the tetrad’s absence is chiefly driven by the entropic
contribution.
1 Introduction
Ever since their discovery, single-domain binding fragment of heavy-chain camelid anti-
bodies [1], referred to as nanobodies (Nbs), have gained considerable attention in transla-
tional research as therapeutic and diagnostic tools against human diseases and pathogens
[2]. Along with its small size (15 kDa) and favorable physical-chemical properties (e.g.,
thermal and environmental stabilities), Nbs display binding affinities equivalent to con-
ventional antibodies (cAbs) [1, 3]. Moreover, its heterologous expression in bacteria
allows overcoming cAbs production pitfalls, such as high production cost and need of
animal facility [4, 5]. Hence, Nbs are considered as a promising tool against numerous
diseases. A variety of Nbs is currently being investigated under pre-clinical and clinical
stages against a wide range of viral infections [6, 7].
Fig. 1. Cartoon representation of the overall topology of an Nb (PDB ID: 3DWT) [11]. The Nb
domain consists of 9 β-sheets linked by loop regions, 3 of these constitute the CDR region and
are colored in green, blue, and red. The framework region separated by the hypervariable loops
are colored in silver. The Nb tetrad residues are highlighted in yellow. (Color figure online)
These four residues’ presence is a hallmark characteristic of Nbs as it has been shown
by several sequence alignments studies [12, 13]. The high conservation of these residues
indicates an evolutionary-driven constraint, and it highlights their pivotal role in Nbs
structure. To ascertain that changes in the Nb tetrad would negatively impact the Nb
folding, we have previously designed a Nb by altering the tetrad residues. The obtained
chimera presented low expression yields and the absence of a well-defined globular
three-dimensional structure due to aggregation (unpublished data). On the contrary,
attempts to “camelize” human/murine Abs by grafting the Nb tetrad to the Ab heavy
chain’s corresponding position has resulted in structural deformations of the framework
β-sheet, leading to scarce stability and aggregation [14]. Although it has been described
a phage-display library derived from llamas that has produced a set of stable and soluble
Nbs devoid of the Nb tetrad [15], these Nbs are unusual, and their stability should be
explained in the light of an alternative mechanism.
Unraveling the Role of Nanobodies Tetrad 95
Given the Nb’s tetrad importance in maintaining its folded structure and stability,
these residues can be considered key to engineer novel Nbs. It has been shown that molec-
ular dynamics simulations are not sufficient to capture the lower stability of aggregating
Nbs, and it does not elucidate the structural and thermodynamic features role of the Nb
tetrad to the structures of the Nbs [16]. In this study, we seek to identify the impact
of the Nb tetrad from a molecular perspective. To gain insight into the thermodynamic
contributions to the folded Nbs, we have used the Rosetta Energy Function compo-
nents combined with machine learning to identify whether there are differences in the
structural pattern of natural Nbs and the corresponding Nbs without the presence of
the tetrad sidechains, by replacing them for methyl groups (Alanine). We benchmarked
two machine learning (ML) models (Support Vector Machine [17] and Random Forest
[18]) and one deep learning (DP) model (Artificial neural network by Multilayer Per-
ceptron [19]) to evaluate the performance of these algorithms in effectively capture the
differences among the classes from the multivariate nature of the data.
2 Computational Details
2.1 Dataset Preparation
A total of 30 non-redundant X-ray derived Nb structures, with resolution lower than 3 Å,
were retrieved from the Protein Data Bank (PDB). To alleviate bad atomic contacts, the
nearest local minimum in the energy function was achieved by geometry-minimizing
their initial coordinates using the Rosetta package v. 3.10 [20] and the linear-Broyden-
Fletcher-Goldfarb-Shanno minimization flavor conditioned to the Armijo-Goldstein
rule. The minimization protocol was carried out in a stepwise fashion, where the
sidechain angles were initially geometry-minimized, followed by full rotamer pack-
ing and minimization of the orientation of sidechain, backbone, and rigid body. To
enhance sampling, χ1 and χ2 rotamers angles were used for all residues that pass an
extra-chi cutoff of 1. Hydrogen placement was optimized during the protocol. For each
of the minimized structures, the Nb tetrad residues were identified and replaced by ala-
nine using the RosettaScripts in the four positions, and the obtained structures were
geometry-minimized accordingly to the previously described protocol. Thus, the final
dataset consisted of 60 instances.
To evaluate folding propensity, the Nb structures were scored using the all-atom
Rosetta Energy Function 2015 (REF2015) [21] to calculate the energy of all atomic
interactions within the proteins. The REF2015 possesses 20 terms and these were used
as the features. The terms can be found in the GitHub (https://github.com/mvfferraz/
NanobodiesTetrad), and a detailed description of each term can be found in reference
[21]. The score function is a model parametrized to approximate the energy for a given
protein conformation. Thus, it consists of a weighted sum of energy terms expressed as
mathematical functions based on fundamental physical theories, statistical-mechanical
models, and protein structures observations. The Rosetta package is a state-of-art prime-
tool to the modeling and design of proteins, and its empirical energy function successfully
allows for a valid assessment of the relative thermodynamic stability of folded proteins.
The weights for each energy term were kept as default. The parsed command lines, PDB
codes, and dataset are available in the GitHub.
96 M. V. F. Ferraz et al.
All the algorithms were written using Python v. 3, and the Scikit Learn Library [22] was
employed in conjunction with the Pandas [23] and Numpy [24] packages. In addition,
Tensorflow [25] and Keras [26] were used for the NN algorithm. The dataset was split
as training (70%) and test (30%) sets. The data features vector was standardized using
preprocessing tools to normally distribute the data by scaling the data to a zero mean
and unit variance. Details of the code can be found in the GitHub.
Random Forest (RF). RF is a meta estimator that builds a number of decision trees on
bootstrapped training samples and uses averaging from random samples for each split in
a tree. All parameters were implemented as the default, save by the criterion to measure
the split’s quality, set as entropy envisioning information gain.
Neural Network (NN). TensorFlow library was used in conjunction with the Keras
high-level application programming interface. The classification was performed using
the Multi-layer Perceptron (MLP) Classifier with 100 hidden layers. An MLP is a feed-
forward artificial NN class, which learns a function f (·) : Rm → Rn by training on a
dataset with m input dimensions and n output dimensions, and it contains hidden layers
in between the input and output layer. Each hidden layer contains a weight propagated
for each posterior layer as a weighted linear summation and followed by a non-linear
Unraveling the Role of Nanobodies Tetrad 97
tp
Precision = (1)
tp + fp
tp
Recall = (2)
tp + fn
2 tp
f1 = = (3)
recall−1 + precision −1
tp + 2 (fp + fn)
1
The Rosetta energy terms are convenient mathematical approximations to the physics
that governs protein structure and stability. The Rosetta Energy Function (REF) ranks
the relative fitness of several amino acid sequences for a given protein structure, and it
is capable of predicting the threshold for protein stability by discriminating native-like
from non-native structures in a decoy [31]. The functional form relies upon pairwise
decomposability of energy terms. The decomposition limits the number of energetic
contributions to 1/2N (N − 1), where N is the atom’s number in the system.
When using Rosetta energy function to calculate the score of a protein, i.e., the relative
energy for a given conformation reasoned by specific parameters of the Hamiltonian, it
yields a total of 20 energetic terms [21]. A feature selection was performed to reduce the
effects of noise or irrelevant variables to construct the models. A feature was considered
relevant and non-redundant if it presented a feature importance score higher than 0.05
(Fig. 2). From the obtained split, a total of 7 features were filtered:
Fig. 2. Feature selection based on REF terms using Extra trees classifier. Feature importance
greater than 0.05 was regarded as a relevant feature. The red dashed line represents the threshold
for a given feature to be filtered. (Color figure online)
The short descriptions of the terms were retrieved from [21]. As can be seen, almost
half of the selected terms are related to the system’s solvation properties. Since the
replacement of hydrophilic residues for alanine increases the hydrophobic content of
the Nbs lacking Nb tetrad, these structural differences have potentially been captured
by the REF. These observations corroborate the well-described importance of the Nbs
tetrad for solubility.
Fig. 3. Discriminant features assessed by LDA. (A) One dimensional LDA. The bars in red are
the LD values for the Nbs lacking the Nb tetrad whereas the bars in blue consist of the natural
Nbs; (B) Loading of each feature used to calculate de LDs; (C–E) ref energy, lk_ball_wdt, fa_dun
energy terms, respectively, for each class. The p-value < 0.05 indicates significant differences in
the means. (Color figure online)
Moreover, the other two features are related to native-like conformations properties.
Thus, these results show that the Nb tetrad potentially impacts Nbs’ backbone φ and
ψ angles distribution as it is found in the Dunbrack’s library of rotamers. Given the
importance of the torsion angles for protein folding, a putative explanation for the Nbs
tetrad’s role in maintaining the structure arises from a geometrical issue. It must be noted
that regarding the geometric features, this effect is unlikely to be an artifact from the
modeling, since the replacement by alanine residues are not expected to cause significant
structural changes, due to the small size of its sidechain, it can be positioned and matched
for in any part of the protein (except when replacing tightly buried glycine residues).
To ensure the average of these features were statistically different between the classes,
a two-tailed t-test was used. Figure 3(C–E) shows the distribution of the data for each
class, along with averages and standard deviations. All three features presented a p-value
< 0.05, indicating that these averages are statistically different.
learning (SVM and RF) and deep learning (ANN-MLP) were assessed regarding their
binary classification performance. SVM is an instance-based learning model, and RF is
an ensemble method. MLP is a class of NN, and here it has been employed more than
three hidden-layers, and therefore, consists of a DL approach.
All the models have been prepared with the same data and training set. All the
20 features were taken into account to carry out the classifications, since using the
selected features from extra trees classifier resulted in poor performances (Data not
shown for conciseness). Since it is a small dataset, it is prone to suffer from overfitting
the data (high variance). Thus, we performed several performance evaluations. Initially,
the models were compared regarding their threshold metrics. Threshold metrics are
useful for diagnosing classification prediction errors. Initially, the scores (Fig. 4A),
which are directly associated with a combination of the precision and the recall values,
were calculated using two approaches: 1) Evaluation was performed considering the
initial training/test set; 2) A 10-fold cross-validation was employed. In the latter flavor
of evaluating the estimator performance, the training set is split into k sets, and the metrics
are calculated in a loop for the different generated sets. The performance is then measured
by the average of each k-fold cross-validation. The SVM model presented a remarkable
performance in properly assigning the classes, with an accuracy of 0.94 for the initial test
set, and an accuracy average of 0.80 when considering ten different subsets. Followed
by SVM, MLP also presented good metrics, even though with a slightly lower value.
From the three models, the one with the poorest metrics was the RF algorithm. The two
formers are more complex and robust models, so that the classification task is likely not
trivial, in such a way, a simpler algorithm will not capture the main differences between
the classes. The algorithms were compared using the confusion matrix (Fig. 4B–D).
The diagonal elements of the matrixes consist of the number of true label classification,
whereas, off-diagonal elements represent the mislabeled classifications. The SVM and
MLP algorithms outperformed the RF model. The performance metrics are summarized
in Table 1 and demonstrate the SVM and MLP algorithms’ efficacy for our dataset.
To identify how much the models can benefit from adding more data, learning curves
were plotted. Two learning curves were constructed: 1) Train learning curve: calculated
based on the training set and diagnosis how well the model is learning, and 2) Validation
learning curve: calculated based on a hold-out validation set and diagnosis how well the
model is generalizing. Figure 5D–F shows the learning curve for the models. For SVM,
the training curve modestly decreases as more samples are added, and the learning curve
Fig. 4. Threshold metrics for the models’ performance (A) Validation scores for the SMV, RF and
MLP models considering the accuracy for the train/test split and for the 10-fold cross-validation;
(B–D) Confusion matrix for SVM, RF, and MLP
increases until reaching a plateau at a score of nearly 0.80. As can be seen, the model
fits the data well, but its generalization has a slightly lower value for the score. Thus, the
SVM model might be slightly overfitted. However, its learning capability is progressively
increased as more samples are added, indicating that the set number is small. For MLP,
a similar trend is observed. However, for the same number of samples, SVM acquires
a higher score for the learning curve, suggesting a better model’s performance. These
results indicate that one source of difficulty for classifying using this dataset resides in
the small number of samples.
Furthermore, it shows that the algorithm’s training and learning process is not
straightforward, given that MLP presents a higher score for the training, proposing that
the more complex fitting to the data is required. The RF model did not reflect sensitivity
to increasing the number of samples, and a decrease in the learning curve is observed.
Thus, the RF model does not benefit from increasing the dataset, and its overfitting can-
not be attributed solely to the small size of the dataset, but rather to the simplicity of the
algorithm over a complex classificatory task.
These information show that SVM and MLP have the potential to classify between
the classes. Such a model is of fundamental relevance for a myriad of protein design
algorithms that rely on Monte Carlo sampling. Since a large number of decoys are usually
generated, identifying the Nbs that possess native-like characteristics is of enormous
advantage to time and resources saving for experimental characterization. From our
benchmarking, the RF model is not a proper model to learn from the data. Besides SVM
102 M. V. F. Ferraz et al.
Fig. 5. Assessment of the models’ performance through their characteristic curves. (A–C) ROC
curves for SVM, RF, and MLP models, respectively; (D–F) Learning curves and the validation
score as function of the number of training examples for SVM, RF, and MLP, respectively
having a slight advantage over MLP, the latter is a promising alternative since its training
curve perfectly fits the training data, and its increasing learning curve is a promising
indicator of its potential. The SVM presented a satisfactory performance, and from
searching for different parameters combination, a considerable gain in the predictivity
capacity might be observed. The ML and DL algorithms’ performance confirms that
there are traits that allow for the discrimination of Nbs containing the tetrad or not.
Our results show that abolishing the tetrad associates with loss of folding stability in
agreement with literature data. It is captured by the ref term, which in turn is shown to
have significant contributions from the solvation energy and torsional dihedral motion
terms. Therefore, the loss of Nbs stability due to eliminating the tetrad is a mostly
entropic-driven phenomenon.
4 Conclusions
We have compared the structural features, calculated by the REF’s energy term, of
natural Nbs containing the Nbs tetrad and a synthetic set of Nbs lacking the tetrad. Data
mining analyses revealed that the two classes of nanobodies differ mainly by folding and
solvation features, corroborating with previous studies suggesting the tetrad’s importance
for stability and solubility. This work’s findings expand the knowledge on the impact
of the Nbs tetrad from a molecular-level perspective by highlighting the importance of
entropic contributions to their stability.
Acknowledgements. This work has been funded by FACEPE, CAPES, CNPq, and FIOCRUZ.
We acknowledge the LNCC for the availability of resources and support.
Unraveling the Role of Nanobodies Tetrad 103
References
1. Muyldermans, S.: Nanobodies: natural single-domain antibodies. Annu. Rev. Biochem. 82,
775–797 (2013)
2. Mir, M.A., Mehraj, U., Sheikh, B.A., Hamdani, S.S.: Nanobodies: The “Magic Bullets” in
therapeutics, drug delivery and diagnostics. Hum. Antib. 28, 29–51 (2020)
3. Vincke, C., Muyldermans, S.: Introduction to heavy chain antibodies and derived Nanobodies.
Methods Mol. Biol. 911, 15–26 (2012)
4. Morrison, C.: Nanobody approval gives domain antibodies a boost. Nat. Rev. Drug. Discov.
18, 485–487 (2019)
5. Jovčevska, I., Muyldermans, S.: The Therapeutic potential of Nanobodies. BioDrugs 34(1),
11–26 (2019). https://doi.org/10.1007/s40259-019-00392-z
6. Beghein, E., Gettemans, J.: Nanobody technology: A versatile toolkit for microscopic imag-
ing, Protein–Protein interaction analysis, and protein function exploration. Front. Immunol.
8, 771 (2017)
7. Konwarh, R.: Nanobodies: Prospects of expanding the Gamut of neutralizing antibodies
against the novel coronavirus, SARS-CoV-2. Front. Immunol. 11, 1531 (2020)
8. Revets, H., De Baetselier, P., Muyldermans, S.: Nanobodies as novel agents for cancer therapy.
Expert. Opin. Biol. Ther. 5, 111–124 (2005)
9. Muyldermans, S.: Single domain camel antibodies: Current status. J. Biotechnol. 74, 277–302
(2001)
10. Barthelemy, P.A., et al.: Comprehensive analysis of the factors contributing to the stability
and solubility of autonomous human VH domains. J. Biol. Chem. 283, 3639–3654 (2008)
11. Vincke, C., Loris, R., Saerens, D., Martinez-Rodriguez, S., Muyldermans, S., Conrath, K.:
General strategy to humanize a camelid single-domain antibody and identification of a
universal humanized nanobody scaffold. J. Biol. Chem. 284, 3273–3284 (2009)
12. Mitchell, L.S., Colwell, L.J.: Comparative analysis of nanobody sequence and structure data.
Proteins 86, 697–706 (2018)
13. Kunz, P., et al.: Exploiting sequence and stability information for directing nanobody stability
engineering. Biochim. Biophys. Acta Gen. Subj. 1861, 2196–2205 (2017)
14. Rouet, R., Dudgeon, K., Christie, M., Langley, D., Christ, D.: Fully human VH single domains
that rival the stability and cleft recognition of camelid antibodies. J. Biol. Chem. 290, 11905–
11917 (2015)
15. Tanha, J., Dubuc, G., Hirama, T., Narang, S.A., MacKenzie, C.R.: Selection by phage dis-
play of llama conventional V(H) fragments with heavy chain antibody V(H)H properties. J.
Immunol. Methods 263, 97–109 (2002)
16. Soler, M.A., de Marco, A., Fortuna, S.: Molecular dynamics simulations and docking enable
to explore the biophysical factors controlling the yields of engineered nanobodies. Sci. Rep.
6, 34869 (2016)
17. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines.
IEEE Intell. Syst. Appl. 13, 18–28 (1998)
18. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
19. Pal, S.K., Mitra, S.: Multilayer perceptron, fuzzy sets, classifiaction. IEEE. Trans. Newural.
Netw. 3(5), 683–697 (1992)
20. Leaver-Fay, A., et al.: ROSETTA3: An object-oriented software suite for the simulation and
design of macromolecules. Methods Enzymol. 487, 545–574 (2011)
21. Alford, R.F., et al.: The Rosetta all-atom energy function for macromolecular modeling and
design. J. Chem. Theory Comput. 13, 3031–3048 (2017)
22. Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12,
2825–2830 (2011)
104 M. V. F. Ferraz et al.
23. McKinney, W.: Data structures for statistical computing in python. In: Proceedings of the 9th
Python in Science Conference, pp. 56–61. Austin (2010)
24. Harris, C.R., et al.: Array programming with NumPy. Nature 585, 357–362 (2020)
25. Abadi, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed
systems. arXiv preprint arXiv:1603.04467 (2016)
26. Gulli, A., Pal, S.: Deep learning with Keras. Packt Publishing Ltd, Birmingham (2017)
27. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annals Eugen. 7,
179–188 (1936)
28. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63, 3–42
(2006)
29. Prism, G.: Graphpad software. San Diego, CA, USA (1994)
30. Powers, D.M.: Evaluation: From precision, recall and F-measure to ROC, informedness,
markedness and correlation. J. Mach. Learn. Technol. 2, 37–63 (2011)
31. Cunha, K.C., Rusu, V.H., Viana, I.F., Marques, E.T., Dhalia, R., Lins, R.D.: Assessing protein
conformational sampling and structural stability via de novo design and molecular dynamics
simulations. Biopolymers 103, 351–361 (2015)
Experiencing DfAnalyzer for Runtime
Analysis of Phylogenomic Dataflows
1 Introduction
Over the past years, several categories of experiments in the bioinformatics
domain became more and more dependent on complex computational simula-
tions. One example are the Phylogenomic analyses that provide the basis for
evolutionary biology inferences, and they have been fostered by several tech-
nologies (e.g., DNA sequencing methods [10] and novel mathematical and com-
putational algorithms). This leads to a high submission rate of protein sequences
This work was partially supported by CNPq, CAPES, and FAPERJ.
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 105–116, 2020.
https://doi.org/10.1007/978-3-030-65775-8_10
106 L. G. Dias et al.
to databases such as UniProt, which now contain millions of sequences [18] that
can be used in such analyses. Due to the need of processing such volume of data,
the execution of this type of experiment became data- and CPU-intensive, thus
requiring High-Performance Computing (HPC) environments to process data
and analyze results in a timely manner.
A phylogenomic analysis experiment may be modeled as a workflow [4]. A
workflow is an abstraction that allows for the user (e.g., a bioinformatician)
to compose a series of activities connected by data dependencies, thus creat-
ing a dataflow. These activities are typically legacy programs (e.g., MAFFT,
BLAST, etc). Bioinformaticians often fall back on Workflow Management Sys-
tems (WfMSs), such as Galaxy [2], Pegasus [5] and SciCumulus [12], to model
and manage the execution of these workflows, but they have to rewrite the work-
flow into their languages and restrictions to their execution environments. Thus,
despite collecting provenance data [6] (the derivation history of a data product,
starting from its original sources - e.g., a dataset containing many DNA and
RNA sequences), which is a key issue to analyze and reproduce results (and
allied with domain-specific data and telemetry data, it provides an important
framework for data analytics), WfMSs are often not used by bioinformaticians.
This way, several bioinformaticians prefer to implement their experiments
using scripts (e.g., Shell, Perl, Python) [9]. More recently, bioinformaticians
also started to explore efficient Data-Intensive Scalable Computing (DISC) sys-
tems and migrate their data- and CPU-intensive experiments to frameworks
like Apache Spark (https://spark.apache.org/), e.g., SparkBWA [1] and ADAM
(https://adam.readthedocs.io/en/latest/).
Collecting provenance data (and domain-specific data) from scripts and DISC
frameworks is challenging. There are several alternatives to WfMSs focused on
capturing and analyzing provenance from scripts and DISC frameworks that can
be applied in phylogenomic analyses [7,15]. However, they also present some lim-
itations. The first one is that some approaches require specific languages (e.g.,
noWorkflow [15] works only with Python) and specific versions of the framework
(e.g., SAMbA-RaP [7] requires a specific version of Apache Spark). Flexibility to
define the level of granularity is also an issue. In general, automatic provenance
data capturing generates fine-grained provenance, which commonly overwhelms
bioinformaticians with a large volume of undesired data to analyze (e.g., access
to files and databases). On the other hand, automatic capturing coarse-grained
provenance may not provide enough data for analysis (even existing WfMS pro-
vide non-flexible level of granularity). In addition, capturing domain-specific
and telemetry data is also an issue in these approaches. We consider that If this
integrated database (provenance, domain-specific data and telemetry data) is
available, bioinformaticians can focus on analyzing just relevant data, reproduce
the results and also observe a specific pattern to infer that something is not
going well in the script at runtime, deciding to stop it or change parameters.
To address these issues, DfAnalyzer [17] was recently proposed to provide
an agnostic way for scientists to define the granularity of the provenance data
capture. The DfAnalyzer provenance database can be queried at runtime (i.e.,
Experiencing DfAnalyzer for Runtime Analysis of Phylogenomic Dataflows 107
during the execution of the experiment), using a W3C PROV compliant data
model. This portable and easy-to-query provenance database is also a step towards
the reproducibility of bioinformatics experiments, regardless how they are imple-
mented, since DfAnalyzer can be coupled to scripts, Spark applications and exist-
ing WfMSs. In this paper, we specialize DfAnalyzer to the context of phylogenomic
experiments and present the benefits for bioinformatics data analyses and debug-
ging. Thus, the main contribution of this paper is an extension of DfAnalyzer to
collect telemetry (performance) data and its customization for the bioinformatics
domain.
The remainder of the paper is organized as follows. Section 2 provides back-
ground concepts, describing the DfAnalyzer tool and discusses related work.
Section 3 introduces the extensions and customizations in DfAnalyzer tool and
presents the evaluations in a case study with a phylogenomic experiment. Finally,
Sect. 4 concludes the paper and points future directions.
2 Background
2.1 DfAnalyzer: Runtime Dataflow Analysis of Scientific
Applications Using Provenance
DfAnalyzer [17] is a W3C provenance compliant system that captures and stores
provenance during the execution of experiments, regardless of how they are
implemented. DfAnalyzer is based on a formal dataflow representation to regis-
ter the flow of datasets and data elements. It allows for analyzing and debugging
dataflows at runtime. One important characteristic of DfAnalyzer is that it cap-
tures only relevant data (as defined by the user), thus avoiding overloading users
with a large volume of low level data. DfAnalyzer captures “traditional” prove-
nance data (e.g., data derivation path) but also domain-specific data through
raw data extraction, e.g., a DNA sequence or the e-value. These characteristics
are essential since experiments may generate massive datasets, while only a small
sub-set of provenance and domain data is relevant for analysis [14]. The original
architecture of DfAnalyzer has five components: (i) Provenance Data Extrac-
tor (PDE); (ii) Raw Data Extractor (RDE); (iii) Dataflow Viewer (DfViewer);
(iv) Query Interface (QI); and (v) Provenance Database. The first two compo-
nents are invoked by plugging calls on the script, while the other three have
independent interfaces for the user to submit data analyses at runtime.
After deploying the DfAnalyzer, users are required to identify relevant data
in their own script, model these data and instrument the code (add DfAna-
lyzer calls) in order that DfAnalyzer automatically captures data and populates
the database. It is worth noticing that the database tables are automatically
created based on the instrumentation performed in the code. This data identi-
fication is based on the following: Dataflow: a tag to identify the dataflow that
is being captured; Transformations: The data transformations that are part of
the dataflow; Tasks: The smaller unit of processing, a transformation may be
executed by several tasks; Datasets: Group of data elements consumed by tasks
and transformations. A transformation consumes an output produced by other
108 L. G. Dias et al.
transformation. Data elements: The attributes that compose datasets. They rep-
resent either domain-specific data or input parameters of the experiment. Such
information is inserted at the beginning of the script or program. Before insert-
ing the tags on the script, it is needed to map these concepts to the script’s
dataflow. Identifiying the dataflow on the script is essential to represent the
data transformations, dependencies, and data elements that need to be stored
in the provenance database.
Listing 1 shows an example of code instrumentation in DfAnalyzer. To instru-
ment the code, we have a 3-step process: (i) import DfAnalyzer packages in the
script; (ii) define the prospective provenance (i.e., the definitions of the dataflow
and transformations—lines 2 to 17); and (iii) define the retrospective provenance
(i.e., activities and data elements to capture—lines 20 to 27). When the scripts
start running each call sends to DfAnalyzer the prospective and retrospective
data to be stored in the database. Figure 1 presents a fragment of the prove-
nance database schema. Each dataset identified in the script has an associated
table in the database. It is worth noticing that this instrumentation is per-
formed only once and the script may be executed several times after that. There
are datasets, data elements, and data transformations that are typically used in
phylogenomic workflows. To avoid this repetitive step for bioinformaticians and
allow for a consistent data representation, this work provides specialized services
for DfAnalyzer users in this domain.
Input Phylogenomic
1 2 3 4 5
Sequence Trees
Mafft
RaXML
Model
RemovePipe ClustalW ReadSeq
Generator
MrBayes
Muscle
number of cycles that the Markov Chain Monte Carlo (MCMC) algorithm is
executed, the main algorithm goal is to make small random changes in some
parameters to accept or reject according to the probability. Also, Nchains is
the parameter related to the number of different parallel MCMCMC chains
within a single execution run, printfreq is related to the frequency that the
information is shown on the screen, nruns how many independent analyses
are started simultaneously. Concerning the MSA programs, it was not nec-
essary to set parameters and the programs were executed in default mode.
More information are find on Muscle (www.ebi.ac.uk/Tools/msa/muscle/),
Mafft (mafft.cbrc.jp/alignment/software/), Clustalw (www.genome.jp/tools-
bin/clustalw), RAxML (http://cme.h-its.org/exelixis/web/software/raxml/)
and MrBayes (http://mrbayes.sourceforge.net/commref mb3.2.pdf) documenta-
tion.
Experiment A Experiment B
Wf. step Program Parameter Wf. step Program Parameter
Clean RemovePipe Total Clean RemovePipe Total
Alignment Mafft Default Alignment Mafft Default
ClustalW Default ClustalW Default
Muscle Default Muscle Default
Converter ReadSeq Default Converter ReadSeq Default
Evolutive Model generator Default Evolutive Model generator Default
Model model
generator generator
Tree RaXML Default Tree MrBayes ngen 100000
generator generator
nchains 4
printfreq 1000
burnin 0
nruns 2
rates mrbayes 4
the user needs to analyze what are the models chosen for the input data. If the
bioinformatician executed SciPhylomics without DfAnalyzer, one has to open
each file to check this information, which is time-consuming, tedious, and error-
prone. By using DfAnalyzer, one has only to submit the query shown on List-
ing 2. In this query, the user wants to discover the number of times a specific
evolutionary model was used, but just when the length of the input sequence is
larger than 20.
The result of the query presented in Listing 2 is shown in Fig. 3. One can
state that WAG and RtREV models are the most common ones. WAG presented
higher likelihoods than any of the other models. This type of query can be
adapted to other attributes of the database, such as quality of the generated
tree, e-value, etc. Another performed analysis was related to the execution time
of each transformation and resource usage. Capturing this type of data can
impact the performance of the experiment since it is usually captured in a short
time interval and may introduce overhead. Thus, in this analysis we defined an
interval for capturing performance data (30 s window). The box-plots of the
execution time behaviour for the six variations of SciPhylomics are presented in
Fig. 4.
Fig. 4. Execution time of SciPhylomics varying the MSA programs (ClustalW, Mafft
and Muscle) and tree generator programs (MrBayes and RaXML).
model choice across a wide range of phylogenetic and evolutionary models and is
more costly than RAxML. The programs execute different operations and gen-
erate different datasets that are composed of different attributes as well. While
RaXML does not execute the operation of “search for evolutionary model” (since
the ModelGenerator program is executed), MrBayes execute this operation, and
in this experiment variability the operation “search for evolutionary model” is
executed twice. These characteristics explain the difference in time execution in
both cases, and the resource usage as well, as shown on Fig. 5.
4 Conclusion
This paper presents an approach to capture provenance data from phylogenomic
dataflows without the need to rewrite the workflow to a specific programing lan-
guage or Workflow Management System engine. We specialized the novel DfAn-
alyzer tool in the context of Phylogenomic experiments so that in an existing
workflow it is possible to achieve a flexible granularity-level data capture and cre-
ate an integrated database composed of domain-specific, provenance and teleme-
try data. DfAnalyzer is prospective provenance based. By modelling prospective
provenance data, retrospective provenance is automatically captured while the
workflow executes, and the data created by its transformations are stored in a
relational database for further querying. Differently from other approaches, the
provenance is captured with flexible granularity, and the bioinformaticians can
specify what is important for their analysis and reduce the experiment cost in
different spheres.
Another advantage of applying DfAnalyzer in the context of Phylogenomic
experiments is the capability of capturing data from experiments that exe-
cute either locally or HPC environments, due to the fact that DfAnalyzer is
asynchronous and request-based, it can execute in different environments. This
asynchronous characteristic contributes that the instrumentation does not cause
delays in the workflow execution. In addition, we extended DfAnalyzer to cap-
ture telemetry (performance data). This way, users are allowed to perform anal-
yses based on both the provenance data and performance metrics. We evaluated
the proposed specialization of DfAnalyzer using the previously defined SciPhy-
lomics workflow and the proposed approach showed relevant telemetry scenarios
and rich data analyses. In future work, we intend to evaluate reproducibility in
experiments based on the analysis of the provenance database.
References
1. Abuı́n, J.M., Pichel, J.C., Pena, T.F., Amigo, J.: SparkBWA: speeding up the
alignment of high-throughput DNA sequencing data. PLoS ONE 11(5), e0155461
(2016)
2. Afgan, E., et al.: The galaxy platform for accessible, reproducible and collaborative
biomedical analyses: 2016 update. Nucleic Acids Res. 44(W1), W3–W10 (2016)
3. Carvalho, L.A.M.C., Wang, R., Gil, Y., Garijo, D.: NIW: converting notebooks into
workflows to capture dataflow and provenance. In: Tiddi, I., Rizzo, G., Corcho, Ó.
(eds.) Proceedings of Workshops and Tutorials of the 9th International Conference
on Knowledge Capture (K-CAP 2017), Austin, Texas, USA, 4 December 2017.
CEUR Workshop Proceedings, vol. 2065, pp. 12–16. CEUR-WS.org (2017)
4. de Oliveira, D.C.M., Liu, J., Pacitti, E.: Data-intensive workflow management: for
clouds and data-intensive and scalable computing environments (2019)
5. Deelman, E., et al.: Pegasus, a workflow management system for science automa-
tion. FGCS 46, 17–35 (2015)
6. Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks:
a survey. CS&E 10(3), 11–21 (2008)
116 L. G. Dias et al.
7. Guedes, T., et al.: Capturing and analyzing provenance from spark-based scientific
workflows with SAMbA-RaP. Future Gener. Comput. Syst. 112, 658–669 (2020)
8. Hondo, F., et al.: Data provenance management for bioinformatics workflows using
NoSQL database systems in a cloud computing environment. In: 2017 IEEE Inter-
national Conference on Bioinformatics and Biomedicine (BIBM), pp. 1929–1934.
IEEE (2017)
9. Marozzo, F., Talia, D., Trunfio, P.: Scalable script-based data analysis workflows
on clouds. In: WORKS, pp. 124–133 (2013)
10. Masulli, F.: Comput. Methods Programs Biomed. 91(2), 182 (2008)
11. Moreau, L., et al.: PROV-DM: the PROV data model. W3C Recommendation 30,
1–38 (2013)
12. Oliveira, D., Ocaña, K.A.C.S., Baião, F.A., Mattoso, M.: A provenance-based adap-
tive scheduling heuristic for parallel scientific workflows in clouds. JGC 10(3),
521–552 (2012)
13. de Oliveira, D., et al.: Performance evaluation of parallel strategies in public clouds:
a study with phylogenomic workflows. Future Gener. Comput. Syst. 29(7), 1816–
1825 (2013)
14. Olma, M., Karpathiotakis, M., Alagiannis, I., Athanassoulis, M., Ailamaki, A.:
Slalom: coasting through raw data via adaptive partitioning and indexing. Proc.
VLDB Endow. 10(10), 1106–1117 (2017)
15. Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: noWorkflow: a tool for col-
lecting, analyzing, and managing provenance from python scripts. Proc. VLDB
Endow. 10(12), 1841–1844 (2017)
16. Pina, D.B., Neves, L., Paes, A., de Oliveira, D., Mattoso, M.: Análise de
hiperparâmetros em aplicações de aprendizado profundo por meio de dados de
proveniência. In: Anais Principais do XXXIV Simpósio Brasileiro de Banco de
Dados, pp. 223–228. SBC (2019)
17. Silva, V., de Oliveira, D., Valduriez, P., Mattoso, M.: Dfanalyzer: runtime dataflow
analysis of scientific applications using provenance. Proc. VLDB Endow. 11(12),
2082–2085 (2018)
18. The UniProt Consortium: UniProt: the universal protein knowledgebase. Nucleic
Acids Res. 45(D1), D158–D169 (2016)
Sorting by Reversals and Transpositions
with Proportion Restriction
1 Introduction
When comparing two genomes, one of the main goals is to determine the sequence
of mutations that occurred during the evolutionary process capable of transform-
ing a genome into another. In comparative genomics, we estimate this sequence
through genome rearrangements, evolutionary events (mutations) affecting a
large sequence of the genome.
Two genomes G1 and G2 can be computationally represented as the sequence
of labels assigned to their shared genes (or shared blocks of genes). Labels are
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 117–128, 2020.
https://doi.org/10.1007/978-3-030-65775-8_11
118 K. L. Brito et al.
2 Basic Definitions
This section formally presents the definitions used in the genome rearrangement
problems. Given two genomes G1 and G2 , each synteny block (common block
of genes between the two genomes) is represented by an integer that also has
a positive or negative sign to indicate its orientation, if known. Therefore, each
genome is a permutation of integers. We assume that one of them is represented
by the identity permutation ιn = (+1 +2 . . . +n) and the other is represented
by a signed (or unsigned) permutation π = (π1 π2 . . . πn ).
We define a rearrangement model M as the set of rearrangement events
allowed to compute the distance. Given a rearrangement model M and a per-
mutation π, the rearrangement distance d(π) is the minimum number of rear-
rangements of M that sorts π (i.e., that transforms π into ι). The goal of the
Sorting by Genome Rearrangements problems consists in finding such distance
and the sequence that reflects it.
In this work, we will assume that M contains both reversals and transposi-
tions. Let us formally define these events.
Note that when k = 1 the SbRTwPR problem becomes the Sorting by Rever-
sals problem on signed [5] and unsigned [4] permutations. Moreover, when k = 0
we have the Sorting by Reversals and Transpositions problem on signed [10] and
unsigned [9] permutations.
Example 1 shows an optimal solution S for π = (−1 +4 −8 +3 +5 +2 −7 −6)
considering the SbRT and the SbWRT problems (SbWRT using costs 2 for
reversals and 3 for transpositions). Note that half of the operations in S are
reversals and half are transpositions, even using a higher cost for transpositions.
Example 1.
π= (−1 +4 −8 +3 +5 +2 −7 −6)
1
π = π · ρ(1, 5) = (−5 −3 +8 −4 +1 +2 −7 −6)
2 1
π = π · τ (2, 4, 9) = (−5 −4 +1 +2 −7 −6 −3 +8)
3 2
π = π · τ (1, 3, 7) = (+1 +2 −7 −6 −5 −4 −3 +8)
4 3
π = π · ρ(3, 7) = (+1 +2 +3 +4 +5 +6 +7 +8)
S= {ρ(1, 5), τ (2, 4, 9), τ (1, 3, 7), ρ(3, 7)}
In the following, we present breakpoints and the cycle graph, both widely
used to obtain bounds for the distance and to develop algorithms.
2.1 Breakpoints
Given a permutation π = (π1 . . . πn ), we extend π by adding the elements
π0 = 0 and πn+1 = n + 1, with these elements having positive signs when
considering signed permutations. We observe that these elements are not affected
by rearrangement events. From now on, we work on extended permutations.
Definition 5. For an unsigned permutation π, a pair of elements πi and πi+1 ,
with 0 ≤ i ≤ n, is a breakpoint if |πi+1 − πi | = 1.
Sorting by Reversals and Transpositions with Proportion Restriction 121
For a signed permutation π, we define the cycle graph G(π) = (V, E), such that
V = {+π0 , −π1 , +π1 , −π2 , +π2 , . . . , −πn , +πn , −πn+1 } and E = Eb ∪ Eg , where
Eb = {(−πi , +πi−1 ) | 1 ≤ i ≤ n + 1} and Eg = {(+(i − 1), −i) | 1 ≤ i ≤ n + 1}.
We say that Eb is the set of black edges and Eg is the set of gray edges.
Note that each vertex is incident to two edges (a gray edge and a black edge)
and, so, there exists a unique decomposition of edges in cycles. The size of a
cycle C ∈ G(π) is the number of black edges in C. A cycle C is trivial if it has
size 1. If C has size less than or equal to 3, then C is called short and, otherwise,
C is called long. The identity permutation ιn is the only one with a cycle graph
containing n + 1 cycles, which are all trivial.
The number of cycles in G(π) is denoted by c(π). Given an operation γ, let
Δc(π, γ) = c(π · γ) − c(π), that is, Δc(π, γ) denotes the change in the number of
cycles after applying γ to π.
The cycle graph G(π) is drawn in a way to highlight characteristics of the
permutation, as shown in Fig. 1. In this representation, we draw the vertices in
a horizontal line, from left to right, following the order +π0 , −π1 , +π1 , . . . , −πn ,
+πn , −πn+1 . The black edges are horizontal lines and the gray edges are arcs.
For 1 ≤ i ≤ n + 1, the black edge (−πi , +πi−1 ) is labeled as i. We represent
a cycle C by the sequence of labels of its black edges following the order they
are traversed, assuming that the first black edge is the one with highest label
(rightmost black edge of C) and it is traversed from right to left. Assuming this
representation, if a black edge is traversed from left to right we add a minus sign
to its label (the first black is always positive since it is traversed from right to
left by convention).
Two black edges of a cycle C are divergent if their labels have different signs,
and convergent otherwise. A cycle C is divergent if at least one pair of black
edges of C are divergent, and it is convergent otherwise.
1 2 3 4 5 6 −7 8
+0 −5 +5 −2 +2 −4 +4 −3 +3 −1 +1 −6 +6 −7 +7 −8
3 Approximation Algorithms
In this section, we present approximation algorithms considering both unsigned
and signed permutations.
Proof. Since |S| is an optimal sequence for the instance (π, k), we have that at
least |S|k operations present in S are reversals. By Lemmas 1 and 2, we have that
a reversal can remove up to two breakpoints while a transposition can remove
up to three. Let φb(S) denote the average number of breakpoints decreased by
an operation in S, we have that
Proof. Since b(π) breakpoints must be removed in order to turn the permutation
π into ι and, by Lemma 3, up to 3 − k breakpoints are removed per operation
on average, the theorem follows.
that the permutation π will be sorted. In addition, no more than b(π) oper-
ations will be used to sort π, maintaining the approximation factor of 3 − k.
Since each operation (reversal or transposition) can be found in linear time and
|S| ≤ b(π) ≤ n + 1, the running time of Algorithm 1 is O(n2 ).
Proof. Since |S| is an optimal sequence for the instance (π, k), we have that at
least |S|k operations in S sequence are reversals. By Lemmas 4 and 5, we have
that a reversal creates at most one new cycle, while a transposition creates at
most two new cycles. Let φc(S) denote the average number of cycles increased
by an operation in S, we have that:
Proof. Since (n + 1) − c(π) new cycles must be created in order to turn the
permutation π into ι and, by Lemma 6, up to 2 − k new cycles are created per
operation on average, the theorem follows.
Proof. If at any stage G(π) has a divergent cycle C, then there exists a reversal
applied to C that increases the number of cycles by one unit [10]. Otherwise,
G(π) has only convergent cycles, and one of the following is true [8]:
1 2 3
1 2 3 −4 5
−1 2 3
−1 2 3 4 5
1 −2 3
1 2 3 1 −2 3 4 5
Case 1 Case 2
1 2 3 4 5 6 7
1 −2 3 4 5 6 7
−1 2 3 4 5 6 7
1 2 3 4 5 6 7
Case 3
If G(π) has an oriented long cycle C, then we can apply a reversal on its
black edges in such a way that it turns C into a divergent cycle C . Since C
is long, we can apply at least two reversals on C that increase the number of
cycles by one unit each (Fig. 2, Case 1).
In the other two cases we can turn the cycle C into an oriented cycle C by
applying one reversal to a cycle D that closes an open gate from C. If C is short,
we can break it into two trivial cycles with a reversal, and this second reversal
turns D into a divergent cycle D , which guarantees that we can apply a third
reversal to D that increases the number of cycles by one (Fig. 2, Case 2). If
C is long, then we can apply at least two reversals that increase the number of
cycles by one unit each (Fig. 2, Case 3).
In the three cases above we applied three reversals that increased the number
of cycles by two, and the theorem follows.
Proof. Let S = (S1 , . . . , S|S| ) be the sorting sequence generated by the algorithm
without considering the substitution of transpositions by reversals applied in line
14. Let S be the subsequence of operations applied in the while loop of lines
2 to 10. Each operation in S increases the number of cycles by at least one
unit, and each operation in S \ S (that is, the operations applied outside the
while loop) increases on average in 2/3 the number of cycles. By the condition
of line 2, we have that |S | ≥ (1 − k)|S| and, therefore, the average increase in
the number of cycles in S is at least (1 − k)|S||S|+ k|S|2/3 = 1 − k/3. Since these
operations increase at most n + 1 − c(π) cycles, we have that |S| ≤ n +1 −1 − c(π)
k/3 .
In the final sequence, we may increase four operations by replacing the last two
transpositions with six reversals (only if necessary). Therefore, the size of this
sequence is at most n +1 −
1 − c(π)
k/3 + 4.
2−k
Theorem 7. Algorithm 2 is a 1 − k/3 -asymptotic approximation algorithm for
SbRTwPR.
Proof. Since the algorithm only adds transpositions while the condition of line
2 is satisfied and at most two transpositions are added in the sorting sequence
in one iteration, we guarantee that |Sρ | ≥ k by replacing the last two transpo-
sitions by reversals. By Lemma 7 and Theorem 4, the sequence S returned by
Algorithm 2 satisfies |S| ≤ n+1−c(π)
1−k/3 + 4 ≤ 1−k/3
2−k
dk (π) + 4. Therefore, it is a
2−k
1−k/3 -asymptotic approximation algorithm for SbRTwPR.
5 Conclusion
References
1. Bafna, V., Pevzner, P.A.: Sorting permutations by transpositions. In: Proceedings
of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 1995),
pp. 614–623. Society for Industrial and Applied Mathematics, Philadelphia (1995)
2. Bafna, V., Pevzner, P.A.: Sorting by transpositions. SIAM J. Discret. Math. 11(2),
224–240 (1998)
3. Bulteau, L., Fertin, G., Rusu, I.: Sorting by transpositions is difficult. SIAM J.
Discret. Math. 26(3), 1148–1180 (2012)
4. Caprara, A.: Sorting permutations by reversals and Eulerian cycle decompositions.
SIAM J. Discret. Math. 12(1), 91–110 (1999)
5. Hannenhalli, S., Pevzner, P.A.: Transforming cabbage into turnip: polynomial algo-
rithm for sorting signed permutations by reversals. J. ACM 46(1), 1–27 (1999)
6. Kececioglu, J.D., Sankoff, D.: Exact and approximation algorithms for sorting by
reversals, with application to genome rearrangement. Algorithmica 13, 180–210
(1995)
7. Oliveira, A.R., Brito, K.L., Dias, U., Dias, Z.: On the complexity of sorting by
reversals and transpositions problems. J. Comput. Biol. 26, 1223–1229 (2019)
8. Oliveira, A.R., Brito, K.L., Dias, Z., Dias, U.: Sorting by weighted reversals and
transpositions. J. Comput. Biol. 26, 420–431 (2019)
9. Rahman, A., Shatabda, S., Hasan, M.: An approximation algorithm for sorting by
reversals and transpositions. J. Discret. Algorithms 6(3), 449–457 (2008)
10. Walter, M.E.M.T., Dias, Z., Meidanis, J.: Reversal and transposition distance of
linear chromosomes. In: Proceedings of the 5th International Symposium on String
Processing and Information Retrieval (SPIRE 1998), pp. 96–102. IEEE Computer
Society, Los Alamitos (1998)
Heuristics for Breakpoint Graph
Decomposition with Applications
in Genome Rearrangement Problems
1 Introduction
The rearrangement distance between two genomes is a problem in comparative
genomics that aims to find a minimum sequence of rearrangements required to
transform one genome into the other.
Genomes in this problem are generally represented as permutations, where
each element of the permutation corresponds to a gene. When the orientation of
the genes is known, we use a plus or minus sign in each element to indicate the
orientation, and we say that the permutation is signed. Otherwise, elements do
not have signs, and we say that the permutation is unsigned.
Since we can model one of the genomes as the identity permutation (i.e.,
permutation (1 2 . . . n) or (+1 + 2 . . . + n)), the problem of transforming one
genome into another by rearrangements is equivalent to that of sorting permu-
tations by rearrangements. We often assume that these permutations have two
extra elements 0 and n + 1 at the beginning and at the end, respectively.
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 129–140, 2020.
https://doi.org/10.1007/978-3-030-65775-8_12
130 P. O. Pinheiro et al.
The most studied rearrangements are the reversal, transposition, and Double
cut-and-join (DCJ) operations. The problems of Sorting by Reversals and Sorting
by DCJs are solvable in polynomial time on signed permutations [8,13], while
they are NP-hard on unsigned permutations [3]. The problems of Sorting by
Transpositions [2] and Sorting by Reversals and Transpositions [11] are also
NP-hard.
The first bounds for these problems used the concept of breakpoints, which
are pairs of elements that are adjacent in the given permutation but not in the
identity permutation. Later, improved bounds and new algorithms were devel-
oped using the breakpoint graph of a permutation [1]. We formally define the
breakpoint graph of a permutation in Sect. 2.
These bounds for the rearrangement distance are based on the number of
cycles in a maximum cardinality decomposition of the breakpoint graph into
edge-colored disjoint cycles. This decomposition is unique on signed permu-
tations or when the model allows only transpositions on unsigned permuta-
tions. When considering reversals on unsigned permutations, Caprara [3] showed
that finding a maximum cardinality decomposition of a breakpoint graph of an
unsigned permutation is NP-hard, and the same is valid for DCJs.
Using the Breakpoint Graph Decomposition as a subproblem, Bafna and
Pevzner [1] presented a 7/4-approximation algorithm for the Sorting by Rever-
sals. This factor was improved by Christie [5] to 1.5 and by Lin and Jiang [10]
to 1.4193 + . Based on a similar strategy, Chen [4] presented a (1.4167 + )-
approximation algorithm for the Sorting by DCJs problem. More recently,
Jiang et al. [9] presented a randomized FPT algorithm for the Sorting by DCJs
with an approximation factor of (4/3 + ).
In this paper, we propose a greedy algorithm and an algorithm based on
the Tabu Search metaheuristic for the Breakpoint Graph Decomposition. We
analyze the performance of these algorithms in practice and compare them with
the algorithm of Lin and Jiang [10]. Furthermore, we present experimental results
of these algorithms applied to the genome rearrangement distance considering a
model with DCJs and a model with reversals and transpositions.
This paper is organized as follows. Section 2 introduces the concepts used
in the algorithms and formalizes the problem. Section 3 presents the heuristics
created for the Breakpoint Graph Decomposition problem. In Sect. 4, we show
the experimental results for the heuristics of Sect. 3. At last, in Sect. 5, we give
our final remarks and discuss directions of future work.
2 Preliminaries
Let π be a permutation (π1 π2 . . . πn ), where πi ∈ {1, . . . , n} and πi = πj
iff i = j, for 1 ≤ i, j ≤ n. The identity permutation ιn is the permutation
(1 2 . . . n), which is the target of the sorting by rearrangements problems.
We extend π by adding the elements π0 = 0 and πn+1 = n + 1. In the next
definitions, we assume that permutations are in its extended form.
We say that (πi , πi+1 ) is a breakpoint if |πi+1 − πi | = 1, for 0 ≤ i ≤ n. The
number of breakpoints in a permutation π is denoted by b(π).
Algorithms for Breakpoint Graph Decomposition 131
(a)
0 6 4 1 7 3 5 2 8 9
(b)
0 6 4 1 7 3 5 2 8 9
(c)
0 6 4 1 7 3 5 2 8 9
As shown in Fig. 1(a), each vertex in the breakpoint graph has a maximum of two
black edges and two gray edges. Our greedy algorithm has a subroutine called
bfs cycle that, starting from a given vertex with incident edges, performs a
breadth first search for an alternating cycle, as we explain in detail later in this
section. Depending on which vertex the algorithm chooses to start the search,
different variations could be applied:
– using the leftmost vertex with incident edges, which we call the FIRST app-
roach;
– choosing a vertex at random among those still having incident edges, which
we call the RANDOM approach;
– executing for all vertices still having incident edges, creating a list with the
returned cycles, and, at the end, picking the smallest cycle from this list,
which we call the ALL approach.
The selected cycle is added to the list of cycles and all edges in that cycle
are removed from the breakpoint graph. These steps are repeated until there are
no edges left in the graph, in which case the greedy algorithm stops. Variations
FIRST and RANDOM take O(n2 ) while the variation ALL takes O(n3 ).
As the RANDOM variation is non-deterministic, we also developed a variation
called MAX, that runs the RANDOM k times and returns the decomposition with the
largest number of cycles. The best√trade-off between execution time and solution
√ achieved using k = n n, so the MAX variation has a time complexity
quality was
of O(n3 n).
The bfs cycle subroutine: given a vertex v, we perform a breadth first
search starting at v with the constraint that we explore a new vertex u only if
the explored path from v to u is an alternating path (i.e., consecutive edges have
distinct colors).
When exploring the edges of a vertex u, if there exists an edge (u, v) such
that the path from v to u plus the edge (u, v) forms an alternating cycle, then
Algorithms for Breakpoint Graph Decomposition 133
we stop the search returning this cycle. Since we do this using a breadth first
search, this subroutine finds the smallest alternating cycle starting with v.
As an example consider the breakpoint graph of Fig. 1(a). The FIRST app-
roach returns the decomposition of Fig. 1(c). It starts at vertex 0, finds the small-
est alternating cycle starting at 0, which is the cycle with vertices (0, 6, 7, 1), and
removes its edges from the graph. Then, it proceeds to find the smallest cycle
starting at vertex 6, which still has black edges. The algorithm finds the cycle
with vertices (6, 4, 3, 5) and removes its edges from the graph. The algorithm
continues until there are no edges in the graph.
Consider the same example but executing the RANDOM approach. Suppose
that the vertices 6, 1, and 0 were chosen in each iteration, in this order. Then,
the algorithm returns the decomposition of Fig. 1(b). In the first iteration, the
algorithm finds the smallest cycle starting at vertex 6, which is the cycle with
vertices (6, 4, 3, 5). In the next iteration, the algorithm finds the smallest cycle
starting at 1, which is the cycle with vertices (1, 7, 8, 2). At this point, the graph
has only one cycle that is chosen in the last iteration.
As a heuristic, our greedy algorithm described above cannot guarantee an
approximation factor. However, it is possible to adapt it in such a way that it
guarantees the same approximation as the Lin and Jiang algorithm (L&J) [10].
Let C be the cycle decomposition returned by L&J. Append to our cycle list
every cycle from C with less than 4 black edges (i.e., the list of cycles that is
enough to guarantee the approximation factor of 1.4193 + ). Remove their edges
from the breakpoint graph and start the breadth first search over this modified
breakpoint graph until no more edges exist. The L&J algorithm requires O(n3 )
time to generate all possible short cycles (i.e., cycles of size 4 and 6) plus O(3m )
to calculate the greatest subset of disjoint short cycles, where m is the number
of short cycles, with = 0.
(a)
... ...
u v w
(b)
... ...
u v w
Fig. 2. (a) Example of a solution for BGD with two cycles C1 (straight edges) and
C2 (dashed edges). These cycles have three vertices (u, v, and w) in common and
they satisfy the conditions to apply the first movement. (b) Three cycles generated by
applying the first movement in the cycles C1 and C2 .
The second movement is applied in two cycles C1 and C2 , such that they
have at least two common vertices u and v, transforming them into two distinct
cycles.
1
Let Puv and Q1vu be the two distinct paths from u to v and from v to u in
1 2
cycle C1 , respectively, such that Puv begins with a black edge. Let Puv and Q2vu
be the two distinct paths from u to v and from v to u in cycle C2 , respectively,
2
such that Puv begins with a black edge. Note that both Q1vu and Q2vu end with
a gray edge.
1
The second movement can be applied if the length of Puv has the same parity
of the length of Puv . This movement creates two cycles C1 and C2 , such that
2
(a)
...
u v
(b)
...
u v
Fig. 3. (a) Example of a solution for BGD with two cycles C1 (dashed edges) and
C2 (straight edges). These cycles have two vertices (u and v) in common and they
satisfy the conditions to apply the second movement. (b) Two new cycles generated
by applying the second movement in the cycles C1 and C2 .
4 Experimental Results
In order to check the efficiency of our heuristics, we implemented Lin and Jiang’s
algorithm (L&J), the greedy heuristic, and the Tabu Search. To compare them,
we generated sets of permutations such that the number of breakpoints is as
large as possible (i.e., n+1 for a permutation of size n). Thus, every breakpoint
graph has n+1 black edges and every vertex has two black edges and two gray
edges, with the exception of the first and the last vertices which have only one
edge of each color. Our sets are separated by permutation size, from 10 to 100
in intervals of 10, with 100 permutations each.
Four different experiments were performed: the first two consist of (i) com-
paring the different versions of the greedy algorithm against the L&J; and (ii)
indicating to the greedy algorithm that it must contain the same short cycles as
those of L&J (so that our greedy algorithm also guarantees the same approxima-
tion factor). The other two experiments consist of running the Tabu Search on
top of the first two tests to check its improvement. Results for variant RANDOM
of the greedy algorithm are the average of 100 executions for each permutation.
Table 1 shows the average number of cycles returned by L&J and the four
variants of the greedy algorithm, namely FIRST, ALL, RANDOM and MAX. We can
see that, except for permutations of size up to 20, all variations of the greedy
algorithm returned decompositions with a greater number of cycles on average
compared to those returned by L&J. Besides, the decompositions obtained by
MAX have on average more cycles than L&J for all permutations of size greater
than or equal to 20. For permutations of size greater than or equal to 60, the
MAX heuristic returned cycle decompositions whose number of cycles are at least
50% greater on average than the cycle decompositions returned by L&J. Results
of Experiment 2 show that by using the same set of short cycles from L&J in
the cycle decomposition, the variations FIRST, ALL, and RANDOM returned cycle
decompositions with more cycles compared to their results on Experiment 1.
However, the MAX variation, which produces the best results on average, returned
cycle decompositions with a smaller number of cycles on average compared to
the results obtained on Experiment 1, for permutations sizes between 20 and 80.
Table 2 shows the average number of cycles returned by Tabu Search using
the output of L&J and the four variants of the greedy algorithm. We can see that
Tabu Search was able to improve the results of all algorithms, especially L&J
that had a great improvement (probably because it returned the lowest average
number of cycles, with a large room for improvement). The MAX variation remains
as the one that returns the greatest number of cycles on average, and the same
behavior of Experiment 2 happened here: on average, the results of Experiment
4 are slightly worse than the results of Experiment 3, for permutations sizes
between 20 and 90.
Algorithms for Breakpoint Graph Decomposition 137
n Experiment 1 Experiment 2
L&J FIRST ALL RANDOM MAX FIRST ALL RANDOM MAX
10 4.14 3.99 4.09 3.98 4.14 4.14 4.14 4.14 4.14
20 6.58 6.23 6.63 6.35 6.89 6.75 6.75 6.75 6.75
30 8.01 8.04 8.75 8.27 9.22 8.89 8.90 8.88 8.91
40 9.12 9.98 10.70 10.06 11.39 10.81 10.89 10.84 10.97
50 9.70 11.39 12.75 11.67 13.45 12.61 12.74 12.62 12.97
60 10.20 13.06 14.38 13.20 15.38 14.25 14.47 14.27 14.88
70 10.97 14.45 16.36 14.73 17.13 16.09 16.33 16.02 16.84
80 11.52 15.69 18.16 16.30 19.01 17.77 18.24 17.77 18.95
90 11.73 17.27 19.51 17.60 20.44 19.14 19.63 19.13 20.72
100 12.31 18.35 21.35 19.03 22.17 20.67 21.27 20.71 22.54
Theorem 1 (Bafna and Pevzner [1]). For any permutation π, the reversal
distance dr (π) ≥ b(π) − c(π).
Theorem 2 (Chen [4] and Yancopoulos et al. [13]). For any permutation
π, the DCJ distance dDCJ (π) ≥ b(π) − c(π). Also, given a decomposition H for
G(π), the DCJ distance dDCJ (π) ≤ b(π) − |H|.
Theorem 3 (Rahman et al. [12]). For any permutation π, the reversal and
transposition distance drt (π) ≥ (b(π) − c(π))/2. Also, given a decomposition H
for G(π), the reversal and transposition distance drt (π) ≤ b(π) − |H|.
138 P. O. Pinheiro et al.
Table 2. Average number of cycles returned by the Tabu Search heuristic using the
cycle decomposition returned by each algorithm as input on Experiments 3 and 4.
Experiment 3 consists of using the Tabu Search heuristic in the solutions of Experiment
1 and Experiment 4 consists of using the Tabu Search heuristic in the solutions of
Experiment 2. Each value represents the average for 100 permutations of size n, which
is indicated in the first column. For the RANDOM approach, we also did the average of
100 executions for each permutation.
n Experiment 3 Experiment 4
L&J FIRST ALL RANDOM MAX FIRST ALL RANDOM MAX
10 4.14 4.09 4.12 4.14 4.14 4.14 4.14 4.14 4.14
20 6.76 6.50 6.80 6.89 6.89 6.77 6.77 6.81 6.81
30 8.87 8.54 9.00 9.22 9.22 9.03 9.02 9.13 9.13
40 10.85 10.50 11.12 11.39 11.40 11.08 11.14 11.33 11.33
50 12.55 12.28 13.06 13.45 13.58 12.99 13.08 13.40 13.42
60 14.39 13.97 14.90 15.40 15.61 14.75 14.85 15.34 15.41
70 16.16 15.72 16.78 17.15 17.48 16.60 16.79 17.35 17.41
80 17.90 17.18 18.67 19.11 19.63 18.39 18.77 19.42 19.54
90 19.24 18.65 20.17 20.58 21.33 19.87 20.24 21.14 21.26
100 20.72 20.29 21.96 22.33 23.27 21.56 21.99 22.94 23.14
Table 3. Results for the DCJ distance and the reversals and transpositions (RT) dis-
tance using the original L&J algorithm (Table 1) and the algorithms of the fourth
experiment (Table 2) as a subroutine for the genome rearrangements algorithms. Each
value represents the average for 100 permutations of size n, which is indicated in the
first column. For the RANDOM approach, we also did the average of 100 executions for
each permutation.
5 Conclusion
In this paper, we studied the problem of Breakpoint Graph Decomposition, which
is associated with the genome rearrangement distance on unsigned permutations.
We developed a greedy algorithm and an algorithm based on the Tabu Search
metaheuristic for this problem. In our experiments, the proposed algorithms
yielded better results than the algorithm proposed by Lin and Jiang [10]. The
Tabu Search algorithm, which is given a feasible solution and looks for better
ones through local search, was able to improve the solutions returned by all the
algorithms.
140 P. O. Pinheiro et al.
References
1. Bafna, V., Pevzner, P.A.: Genome rearrangements and sorting by reversals. SIAM
J. Comput. 25(2), 272–289 (1996)
2. Bulteau, L., Fertin, G., Rusu, I.: Sorting by transpositions is difficult. SIAM J.
Discret. Math. 26(3), 1148–1180 (2012)
3. Caprara, A.: Sorting permutations by reversals and Eulerian cycle decompositions.
SIAM J. Discret. Math. 12(1), 91–110 (1999)
4. Chen, X.: On sorting unsigned permutations by double-cut-and-joins. J. Comb.
Optim. 25(3), 339–351 (2013). https://doi.org/10.1007/s10878-010-9369-8
5. Christie, D.A.: A 3/2-approximation algorithm for sorting by reversals. In: Pro-
ceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA
1998), pp. 244–252. Society for Industrial and Applied Mathematics, Philadelphia
(1998)
6. Glover, F.W.: Tabu search - Part I. INFORMS J. Comput. 1(3), 190–206 (1989)
7. Glover, F.W.: Tabu search - Part II. INFORMS J. Comput. 2(1), 4–32 (1990)
8. Hannenhalli, S., Pevzner, P.A.: Transforming cabbage into turnip: polynomial algo-
rithm for sorting signed permutations by reversals. J. ACM 46(1), 1–27 (1999)
9. Jiang, H., Pu, L., Qingge, L., Sankoff, D., Zhu, B.: A randomized FPT approxima-
tion algorithm for maximum alternating-cycle decomposition with applications. In:
Wang, L., Zhu, D. (eds.) COCOON 2018. LNCS, vol. 10976, pp. 26–38. Springer,
Cham (2018). https://doi.org/10.1007/978-3-319-94776-1 3
10. Lin, G., Jiang, T.: A further improved approximation algorithm for breakpoint
graph decomposition. J. Comb. Optim. 8(2), 183–194 (2004). https://doi.org/10.
1023/B:JOCO.0000031419.12290.2b
11. Oliveira, A.R., Brito, K.L., Dias, U., Dias, Z.: On the complexity of sorting by
reversals and transpositions problems. J. Comput. Biol. 26, 1223–1229 (2019)
12. Rahman, A., Shatabda, S., Hasan, M.: An approximation algorithm for sorting by
reversals and transpositions. J. Discret. Algorithms 6(3), 449–457 (2008)
13. Yancopoulos, S., Attie, O., Friedberg, R.: Efficient sorting of genomic permutations
by translocation inversion and block interchange. Bioinformatics 21(16), 3340–3346
(2005)
Center Genome with Respect
to the Rank Distance
1 Introduction
The rank distance between matrices has been very successfully used in coding
theory since at least 1985, when Gabidulin published his discoveries in matrix
codes [5]. Recently, applications of the rank distance to genome evolution, specifi-
cally in the area of genome rearrangements, started to emerge [9]. In this context,
the genome median problem, which takes a number of genomes A1 , A2 , . . . , Ak
and aims to find a genome M such that d(A1 , M ) + d(A2 , M ) + . . . + d(Ak , M ) is
minimized, is relevant. This problem can be stated for any genomic distance, not
just the rank distance. In many cases, the genome median problem is NP-hard,
but a number of approximate methods have been developed.
With regard to genome medians, much work has been published, especially in
the case of exactly three inputs. This is one of the seminal steps in building phylo-
genetic trees. Finding a genome median is NP-hard for several genome distances,
with the exception of SCJ and breakpoint for multichromosomal genomes. [4,8].
Center genomes, also called closest genomes or minimax genomes, are also
aimed at somehow representing all the inputs, as a sort of average genome.
The center genome problem takes genome inputs A1 , A2 , . . . , Ak and looks
for a genome M minimizing max(d(A1 , M ), d(A2 , M ), . . . , d(Ak , M )). There is
an important difference between using central genomes and median genomes as
subroutines for ancestral reconstruction methods: when just two inputs are used
for the median, the solution will probably be not very relevant, because many
solutions exist, including both input genomes and anything in an optimal path
from one to the other; on the other hand, the center genome, even with just
two inputs, is already restricted enough to be relevant with respect to ancestral
genomes.
For any distance defined as the minimum number of operations, when all
operations have the same weight, clearly the theoretical lower bound for two
genomes is readily achievable: it suffices to start at one of the genomes and
walk towards the other, stopping when the right number of steps have been
performed. However, if an arbitrary number of inputs is allowed, the problem
becomes NP-hard, even for very simple distances such as the SCJ [2].
In contrast, distance measures where operations have distinct weights may
not be able to always attain the lower bound. Here we concentrate on two inputs
and examine the rank distance, which can be defined as the rank of A − B for
genomes (matrices) A and B, but also as the minimum number of cuts, joins,
and double swaps, with weights 1, 1, and 2, respectively, that bring one genome
to the other. Since we have different weights, it is not obvious the lower bound
can be achieved. In fact, we show that it cannot in the case where d(A, B) = 2n
with n odd. In all other cases, the lower bound is achieved.
The rest of this paper is organized as follows. Section 2 contains the defini-
tions used throughout the text. Section 3 presents the results. Finally, Sect. 4
summarizes our work and points to possible continuation of this research.
2 Definitions
For genomes with just one gene, we have just two extremities. There are only
two genomes: one with an adjacency linking these two extremities, and the other
with just telomeres. Here are some examples of genomes over two genes:
⎡ ⎤ ⎡ ⎤
0100 1000
⎢1 0 0 0⎥ ⎢0 1 0 0⎥
C=⎢ ⎥ ⎢
⎣0 0 1 0⎦,D = ⎣0 0 0 1⎦
⎥
0001 0010
Genome matrices are therefore square matrices of size (2n) × (2n) and have
the following properties:
– They are binary matrices, i.e., have 0’s and 1’s only.
– They are symmetric matrices, that is, they satisfy A = A.
– They are orthogonal matrices, that is, they satisfy A = A−1 .
– They are involutions, that is, they satisfy A2 = I.
It is easy to verify that any two of the last three properties implies the third
one. For binary matrices, being an orthogonal matrix is equivalent to having
just one 1 in each row and in each column. Such binary matrices are called per-
mutation matrices. We can then say that genome matrices are permutation
matrices that are involutions.
Extremities x such that Ax = x are called telomeres of A. A genome with no
telomeres is called circular. Two genomes with exactly the same set of telomeres
are called co-tailed.
3 Results
We recall a lower bound for the score relative to two genomes, and show exactly
the cases where it is possible to achieve such a score. We also show that, in any
case, it is always possible to find a genome within 1 unit of the lower bound.
We start by recalling the notion of intermediate genomes, defined as
genomes that appear in an optimal scenario between two genomes A and B. The
definition depends on A and B, so sometimes we will call them AB-intermediates
for improved clarity. Although initially defined for DCJ [3], the definition works
for any distance.
In addition to being optimal scenario members, intermediate genomes can
be characterized as those for which the triangle inequality becomes an equality.
They are also the medians of two genomes.
Given two genomes A, and B, a center genome for them is a genome M
that minimizes the score sc(M ; A, B), defined as:
The triangle inequality gives almost immediately a lower bound on the score:
144 P. Biller et al.
0010 0100
To compute their distance, let’s subtract B from A:
⎡ ⎤
0 1 −1 0
⎢ 1 0 0 −1 ⎥
A−B =⎢ ⎣ −1 0 0 1 ⎦ .
⎥
0 −1 1 0
This matrix has rank 2. Both A and B are circular genomes, since they do
not have telomeres. Now for a circular genome such as A, the only genomes at
distance 1 from it are the ones obtained by cutting an adjacency, since no extra
adjacencies can be added to A. Genome A has only two adjacencies, so there are
just two genomes at distance 1 from it, namely:
⎡ ⎤ ⎡ ⎤
1000 0100
⎢0 1 0 0⎥ ⎢1 0 0 0⎥
A1 = ⎢ ⎥ ⎢ ⎥
⎣ 0 0 0 1 ⎦ , A2 = ⎣ 0 0 1 0 ⎦ .
0010 0001
However, it can be readily verified that none of these two genomes is at
distance 1 from B. In fact, they are both at distance 3 from B. We conclude
that the center conjecture is not true.
Center Genome with Respect to the Rank Distance 145
Lemma 5. If A and C are co-tailed genomes and d(A, C)/2 is odd, then there
is no genome B satisfying the center lower bound.
or
d(A, Bi ) ≤ d(A, C)/2 + 1,
since both sides are integers.
An analogous result is valid for joins, saying that joins can be brought back
through double swaps, but we won’t need it now.
Lemma 8. Let A and C be two genomes not co-tailed. Then, for every integer
i such that 0 ≤ i ≤ d(A, C) there is an intermediate genome B between A and
C with d(A, B) = i.
Proof. By induction on d(A, C). The base case is d(A, C) = 1, because A and
C are not co-tailed and hence cannot be equal. The statement is clearly true for
d(A, C) because in this case we only have two possibilities for i, namely, i = 0
or i = 1, and we can take B = A for i = 0 and B = C for i = 1.
Now assume d(A, C) ≥ 2 and consider an integer i such that 0 ≤ i ≤ d(A, C).
Since A and C are not co-tailed, there is either a telomere in A not shared by C
or a telomere in C not shared by A. Without loss of generality, we may assume
that there is a telomere in C not shared by A, otherwise we can just exchange
A and C and i with d(A, C) − i.
Given that there is a telomere in C that is not an A-telomere, destroying
the adjacency of x in A gives us a cut P applicable to A such that A + P is an
intermediate genome between A and C. If A + P is not co-tailed with C, we can
apply the induction hypothesis to A + P and C and get intermediate genomes
at an arbitrary distance j from A + P , provided that 0 ≤ j ≤ d(A + P, C) =
d(A, C) − 1, which will be at distance j + 1 from A. This covers all the distances
we need except 0, for which we can take B = A.
Now if A+P is co-tailed with C, then they are distinct, since d(A, A+P ) = 1
and d(A, C) ≥ 2. Co-tailed genomes can be sorted by double swaps, so there is
a double swap Q applicable to A yielding an intermediate genome A + P + Q
between A+P and C. However, according to Lemma 7, a cut can go forward past
a double swap, which means that Q is applicable to A. The resulting genome,
A + Q, is intermediate between A and C because A + Q + P is just another way
of getting to A + P + Q, which we know is intermediate between A and C. We
can then apply the induction hypothesis to A + Q and C, which are not co-tailed
since A + Q is co-tailed with A, obtaining intermediate genomes at distances i
from A for 2 ≤ i ≤ d(A, C). For i = 0 we have A, and for i = 1 we have A + P .
This completes the induction step and the proof of our lemma.
d(A, C)
d(A, B) =
2
and
d(A, C)
d(B, C) = .
2
However, there is a genome matrix B such that:
d(A, C)
d(A, B) = +1
2
and
d(A, C)
d(B, C) = − 1.
2
Proof. Part 1 is a consequence of Lemma 8, since 0 ≤ d(A, C)/2 ≤ d(A, C).
Part 2 is a consequence of Lemma 4. Part 3 is a consequence of Lemmas 5 and 6.
4 Conclusions
In this paper we showed that center genomes do not always attain the theoreti-
cal lower bound in the case of two genomes, with respect to the rank distance.
In spite of that, their are easy to calculate, and provide an attractive alterna-
tive to the median in ancestral genome reconstruction, even in the two-input
version, which is already more restrictive than its median counterpart. Given
that computing a median is NP-hard for the majority of relevant distances, its
replacement by a center solution would bring a significant gain.
Nevertheless, it would be interesting to extend this analysis to three inputs,
and determine what happens there. Probably the arbitrary input version is NP-
hard, as similar problems with simpler distances have already been proved NP-
hard [2,7]. In addition, considering genomes with unequal gene content would
also be worthwhile.
References
1. Chindelevitch, L., Zanetti, J.P.P., Meidanis, J.: On the rank-distance median of 3
permutations. BMC Bioinform. 19(Suppl 6), 142 (2018). https://doi.org/10.1186/
s12859-018-2131-4
2. Cunha, L.F.I., Feijão, P., dos Santos, V.F., Kowada, L.A., de Figueiredo, C.M.H.:
On the computational complexity of closest genome problems. Discret. Appl. Math.
274, 26–34 (2020)
3. Feijão, P.: Reconstruction of ancestral gene orders using intermediate genomes.
BMC Bioinform. 16(Suppl 14), S3 (2015). https://doi.org/10.1186/1471-2105-16-
S14-S3
4. Feijão, P., Meidanis, J.: SCJ: a breakpoint-like distance that simplifies several rear-
rangement problems. Trans. Comput. Biol. Bioinform. 8, 1318–1329 (2011)
5. Gabidulin, E.M.: Theory of codes with maximum rank distance. Probl. Peredachi
Inf. 21(1), 3–16 (1985)
6. Meidanis, J., Biller, P., Zanetti, J.P.P.: A matrix-based theory for genome rearrange-
ments. Technical Report, IC-18-10. Institute of Computing, University of Campinas,
August 2018
7. Popov, V.: Multiple genome rearrangement by swaps and by element duplications.
Theor. Comput. Sci. 385(1–3), 115–126 (2007)
8. Tannier, E., Zheng, C., Sankoff, D.: Multichromosomal median and halving problems
under different genomic distances. BMC Bioinform. 10(1), 120 (2009). https://doi.
org/10.1186/1471-2105-10-120
9. Zanetti, J.P.P., Biller, P., Meidanis, J.: Median approximations for genomes modeled
as matrices. Bull. Math. Biol. 78(4), 786–814 (2016). A preliminary version appeared
on the Proceedings of the Workshop on Algorithms for Bioinformatics (WABI) 2013
ImTeNet: Image-Text Classification
Network for Abnormality Detection
and Automatic Reporting on
Musculoskeletal Radiographs
1 Introduction
Musculoskeletal disorders represent a major health problem that affects a large
part of the population. The X-ray images are one of the most commonly acces-
sible radiological examinations used to detect and locate abnormalities in radio-
graphic studies. The process of interpreting image examinations is a typically
complex task. Specialist doctors are usually responsible for conducting the inter-
pretation of these types of information [8]. The first step is to read and analyze
the images in order to have a knowledge base to write a report on what is in the
image and provide a diagnosis of normal or abnormal, for example.
Determining whether a radiographic study is normal or abnormal is a chal-
lenging problem. If a study is interpreted as normal, the possibility of disease is
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 150–161, 2020.
https://doi.org/10.1007/978-3-030-65775-8_14
ImTeNet: Image-Text Classification Network 151
excluded, which may eliminate the need for the patient to undergo other tests
or procedures [13]. In Fig. 1, we present two examples of images from MURA
dataset and fracture location using activation maps.
To assist in the diagnostic process, many automated computational meth-
ods, such as Computer-Aided Diagnosis (CAD) have been explored [10]. With
advances in techniques such as natural language processing (NLP) and computer
vision (CV), various deep learning methods have been developed to automati-
cally interpret medical images and reports in order to assist clinicians who are
subjected to these tasks on a daily basis.
More recently, transformers have shown success in labeling radiological
images [5,16]. However, using these methods, a large amount of resources are
required to annotate the data manually to obtain a higher classification score.
Several methods of deep learning have been developed for the task of classifi-
cation, localization and interpretation of radiology images. Recent advances in
deep convolutional neural network (DCNN) architectures have improved the per-
formance of CAD systems, which support health experts [14]. The advancement
in hardware and software development has made it possible that the amount of
medical data collected per patient has increased considerably [12].
In this work, we propose an image-text classification network, called
ImTeNet, for the automatic detection and notification of fractures on muscu-
loskeletal radiographs. Our method uses information extracted from images and
artificially generated captions to obtain a classification label. Initially, we use a
caption generator model to create textual data for a dataset without text infor-
mation. Then, we apply our multi-modal information model, which consists of
two distinct networks, DenseNet-169 and BERT, to perform image and text clas-
sification tasks respectively, and a fusion module that receives a concatenation
of feature vectors extracted from both.
Our main contributions are summarized as follows: (i) the proposed method
is applied with a clinical objective for diagnostic abnormality recognition, (ii) dis-
tinct classifiers based on deep learning are trained with radiographs and texts for
abnormality classification and (iii) experimental results on the MURA dataset
show that the combination of image and text features can increase the classifi-
cation results.
The text is organized as follows. Section 2 describes some relevant approaches
related to the topic under investigation. Section 3 presents the proposed image-
152 L. Braz et al.
2 Related Work
Deep learning methods are the state of the art for classification tasks in the med-
ical field. Moreover, Deep Convolutional Neural Networks (DCNNs) are widely
used mainly for image domains [3,13,21]. Many Natural Language Processing
(NLP) methods have been developed to extract structured labels from free-text
medical reports [1,12]. Our work is closely related to approaches that explore
the use of DCNNs on medical images, with the aim of extracting relevant infor-
mation, as well as approaches that explore textual medical information and the
combination of both.
Wang et al. [21] explored the use of many pre-trained DCNN models to
perform a multi-label classification of thoracic disease on chest X-ray images.
Chen et al. [3] proposed a dual asymmetric feature learning network called
DualCheXNet to explore the complementarity and cooperation between the
two networks to learn discriminative features. Rajpurkar et al. [13] proposed
the Musculoskeletal Radiographs (MURA) dataset and explored the use of a
DenseNet-169 [7] to perform a binary classification task.
Smit et al. [16] proposed a method, called CheXbert, for radiology report
labeling, combining existing report labelers with hand-annotations to obtain
accurate results. They used a BERT [4] model and obtained state-of-the-art
results for report labeling on the chest X-ray datasets. Pelka et al. [12] proposed
an approach to combine automatically generated keywords with radiographs,
presenting a method to allow multi-modal image representations by fusing tex-
tual information with radiographic images to perform body part and abnormality
recognition on MURA dataset. In contrast to these works, our approach explores
the cooperation between image and text networks to learn the classification of
abnormalities in a complementary and accurate way.
3 Proposed Method
In this section, we present details of the proposed method. We provide some
notations used throughout this paper, as well as describe the different tasks
performed and how we combine each one to produce our proposed architecture.
The ROCO dataset contains 81,825 radiology images including several medi-
cal imaging modalities, such as X-ray, Computer Tomography (CT), Ultrasound,
Mammography (UM), Magnetic Resonance Imaging (MRI) and various other
modalities. All images on ROCO have a corresponding caption information.
For this caption generation task, inspired by Pelka et al. [12], we propose an
encoder-decoder framework [8,20]. As an encoder, we use a Convolutional Neural
Network (CNN) model, called ResNet-152 [6], pretrained on ImageNet [15]. The
encoder extracts the feature vector from an input image and this feature vector
is linearly transformed to be the input to the decoder network. The decoder is a
Long Short-Term Memory (LSTM) network, which receives as input the feature
vector from the encoder and produces the image caption as output.
Linear
LSTM
LSTM
LSTM
LSTM
DCNN ...
TCL ICL
FCL
Linear(6656,256)
Linear(256,128)
Linear(128,1)
Figure 2 shows the caption generation model (CG) scheme. We trained the
model using a corpus of paired image and captions from the ROCO dataset1 .
After the model training step, we construct a dataset, called MURA caption,
executing the caption generator for each image on MURA. At the end of this
task, we have a combination of an original image and label from MURA and an
1
No additional datasets were used for training.
154 L. Braz et al.
Our focus in this work is to develop and evaluate an approach based on deep
learning that, using multi-modal information, can make the detection of abnor-
malities in musculoskeletal radiograph samples more accurate. The intuition
behind our proposed method is that the combination of visual and text informa-
tion will benefit the classification task, improving its results. Figure 2 illustrates
the main steps of our architecture. After the caption generation task, three steps
are applied: (i) Image Classification Level (ICL), (ii) Text Classification Level
(TCL), and (iii) Fusion Classification Level (FCL).
1. ICL Step: At this step, we trained a DCNN model, called DenseNet-
169 [7], pre-trained on ImageNet [15], to perform an image classification. We
also used this model as a feature extractor. In this model, we added an atten-
tion module called Class Activation Mapping (CAM) [22], which is employed
to indicate the discriminative region detected by the DCNN model to iden-
tify the correct class. After this attention module, we applied an Average-Max
(AVG-MAX) pooling layer to reduce computational complexity and extract low
(average) and high (max) level features from the neighborhood. The AVG-MAX
pooling is the concatenation of the average pooling and max pooling results.
2. TCL Step: In this step, we proposed a Natural Language Processing
(NLP) approach to extract structured labels from free-text image captions. We
fine-tuned the BERT [4] model to perform a text classification. The BERT model
architecture is based on a multilayer bidirectional transformer [18], and pre-
trained in an unsupervised way in two tasks: masked language model and next
ImTeNet: Image-Text Classification Network 155
sentence prediction [2]. Our proposed method follows the same architecture as
BERT. Each image caption text is tokenized, and the maximum number of tokens
in each input sequence is limited to 64. The final-layer hidden state is then fed as
input to each of the BERT linear heads. We changed the BERT output dimen-
sion to 1 to cover our binary classification problem. We also used this model
as a feature extractor and then applied an Average-Max (AVG-MAX) pooling
operation in hidden state layers.
3. FCL Step: At this step, the output features from ICL and TCL models
are concatenated into a single feature vector used as input to the FCL. In the
fusion classifier, the M input features are fed into three dense layers. As a loss
function, we used the Binary Cross Entropy Loss (BCEL).
4 Experimental Setup
In this section, we present details about the dataset used in our experiments, as
well as some implementation details.
The original dataset was split into training set(11,184 patients, 13,457 stud-
ies, 36,808 images), validation set (783 patients, 1,199 studies, 3,197 images),
and test set (206 patients, 207 studies, 556 images).
156 L. Braz et al.
For the purpose of comparing our results and evaluating our proposed
method, as the test set is not public accessible, the validation set was adopted
as a test set. In addition, we split stratified the original training set into 80%
and 20% to generate new training and validation sets, respectively.
the results of DenseNet-169 and BERT individual. For ‘Finger’ study, the per-
formance of ImTeNet is comparable to the fusion performance, which presents
the best results. The performance of the ImTeNet presents the best results for
‘Shoulder’, ‘Humerus’ and ‘Forearm’ studies. In the latter, in particular, our
ImTeNet provides the greatest gain compared to the DenseNet-169 model. For
‘Hand’ study, the BERT model presents a negative Kappa, indicating that there
is no agreement on its results and the performance of ImTeNet is comparable to
the performance of DenseNet-169 individually, with no relevant gain, the fusion
classifier individually presents the best results. The results of the BERT model,
as expected due the caption format, show the worst results in all studies.
Compared to the baseline results of DenseNet-169, the proposed method
presented an overall performance gain, even with artificially generated text data
where most contain texts with poor quality or lacking relevant information. In
general, our model was able to extract and combine discriminative features from
each proposed branch (ICL and TCL). Our method obtained an average balanced
accuracy and Cohen’s kappa statistic score of 0.8407 and 0.6920, respectively,
with a significant difference of 0.0138 and 0.0202 for the DenseNet-169 individu-
ally. We also evaluate our proposed ImTeNet and compare with another existing
method for musculoskeletal abnormality classification [12] on MURA dataset,
under the same ruled test set, defined in Sect. 4.1.
To allow a fair comparison, we report the evaluation results of each method
with the accuracy scores, as shown in Table 2. The ‘Visual’ column is related
to the image classification task, the ‘Text’ column is associated with the text
classification task, whereas the last column is related to the proposed method.
Compared to this baseline, our proposed ImTeNet yields a high performance for
musculoskeletal abnormality classification on the MURA dataset. Our method
has an accuracy score of 0.8511 with a difference of 0.0356 for the best result
method by Pelka et al. [12].
158 L. Braz et al.
5.1 Analysis
We analyzed specific examples from the test set and their prediction on each
level, as illustrated in Fig. 5. In addition, we applied the BERTViz [19] tool to
visualize the BERT model attention produced in each input caption.
In the first two examples, all levels were able to correctly label the input
image and text. The BERT model correctly detected relevant information and
was able to discard spurious information present in both captions. The DenseNet-
169 model could also identify relevant regions in the image. On the third example,
which contains poor textual information with absolutely no semantic informa-
tion, the BERT model incorrectly labeled the abnormality as negative, while the
DenseNet-169 labeled the abnormality as positive with a high confidence. It is
also possible to observe a well-defined highlighted area of the radiograph that
was most important in making the prediction. In the last example, while the
DenseNet-169 model was unable to detect the presence of abnormalities in the
image, the BERT model was able to distinguish relevant information from the
caption and correctly label the abnormality as positive. All features and pre-
dictions performed by DenseNet-169 and BERT model were considered in the
fusion level classifier, which could increase the classification results.
Notwithstanding, our study has some limitations. First, our approach relies
on the existence of a good caption generator model. Second, and very related to
the first, our proposed generator model is designed to deal with musculoskeletal
radiograph texts, however, MURA does not have this type of data. We tried
to provide this information by artificially creating textual data, however, when
observing examples of generated captions, many samples do not have relevant
information, such as the caption of the third image in Fig. 5, in which we expected
a text containing information such as “x-ray”, “implant” and “humerus”, how-
ever, <start> the of the the of the the of the <end> was obtained. Third, the
average length of our caption texts is lower than other datasets usually per-
formed with the BERT model [17]. We conjecture that longer text information
can increase the BERT results.
ImTeNet: Image-Text Classification Network 159
Fig. 5. Original image, DenseNet-169 heatmap and artificially generated caption from
test set. A check mark indicates the correct label prediction for each level. The
BERTViz tool was used to visualize the BERT attention maps for each caption.
References
1. Annarumma, M., Withey, S.J., Bakewell, R.J., Pesce, E., Goh, V., Montana, G.:
Automated triaging of adult chest radiographs with deep artificial neural networks.
Radiology 291(1), 196–202 (2019)
2. Beltagy, I., Lo, K., Cohan, A.: SciBERT: A Pretrained Language Model for Scien-
tific Text. arXiv preprint arXiv:1903.10676 (2019)
3. Chen, B., Li, J., Guo, X., Lu, G.: DualCheXNet: dual asymmetric feature learning
for thoracic disease classification in chest X-rays. Biomed. Signal Process. Control
53, 101554 (2019)
4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805 (2018)
5. Drozdov, I., Forbes, D., Szubert, B., Hall, M., Carlin, C., Lowe, D.J.: Supervised
and unsupervised language modelling in chest X-ray radiological reports. PLoS
ONE 15(3), e0229963 (2020)
6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
7. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected
convolutional networks. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2017)
8. Jing, B., Xie, P., Xing, E.P.: On the automatic generation of medical imaging
reports. In: 56th Annual Meeting of the Association for Computational Linguistics
- Proceedings of the Conference (Long Papers), vol. 1, pp. 2577–2586 (2018)
9. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization, pp. 1–15. arXiv
preprint arXiv:1412.6980 (2014)
10. Kooi, T., et al.: Large scale deep learning for computer aided detection of mam-
mographic lesions. Med. Image Anal. 35, 303–312 (2017)
11. Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.M.: Radiology objects
in context (ROCO): a multimodal image dataset. In: Stoyanov, D., et al. (eds.)
LABELS/CVII/STENT -2018. LNCS, vol. 11043, pp. 180–189. Springer, Cham
(2018). https://doi.org/10.1007/978-3-030-01364-6 20
ImTeNet: Image-Text Classification Network 161
12. Pelka, O., Nensa, F., Friedrich, C.M.: Branding - fusion of meta data and mus-
culoskeletal radiographs for multi-modal diagnostic recognition. In: International
Conference on Computer Vision Workshop (ICCV), pp. 467–475 (2019)
13. Rajpurkar, P., et al.: MURA: large dataset for abnormality detection in muscu-
loskeletal radiographs. arXiv preprint arXiv:1712.06957 (2017)
14. Ranjan, E., Paul, S., Kapoor, S., Kar, A., Sethuraman, R., Sheet, D.: Jointly
learning convolutional representations to compress radiological images and clas-
sify thoracic diseases in the compressed domain. In: 11th Indian Conference on
Computer Vision, Graphics and Image Processing, pp. 1–8 (2018)
15. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J.
Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
16. Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., Lungren, M.P.: CheXbert:
combining automatic labelers and expert annotations for accurate radiology report
labeling using BERT. arXiv preprint arXiv:2004.09167 (2020)
17. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification?
In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI),
vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-
030-32381-3 16
18. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances
in Neural Information Processing Systems 30, pp. 5998–6008 (2017)
19. Vig, J.: A Multiscale visualization of attention in the transformer model, pp. 1–6.
arXiv preprint arXiv:1906.05714 (2019)
20. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image cap-
tion generator. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 7(12),
3156–3164 (2015)
21. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8:
hospital-scale chest X-ray database and benchmarks on weakly-supervised clas-
sification and localization of common thorax diseases. In: IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (2017)
22. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep fea-
tures for discriminative localization. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2016)
A Scientometric Overview of Bioinformatics
Tools in the Pseudomonas Putida Genome Study
1 Introduction
The planet is fighting against all types of environmental pollution, with the soil being
one of the places most affected by environmental degradation. There is, therefore, a need
to control soil pollution in order to maintain fertility as well as productivity [1, 2]. In
this context, the development of technologies to remedy these degraded environments
becomes indispensable [3].
Pseudomonas genera, belonging to the Gammaproteobacteria class, have been
extensively studied over time due to their great potential for degradation and biotrans-
formation of xenobiotic and recalcitrant compounds, and are likely to be used in vari-
ous biotechnological processes of environmental recovery [4]. Pseudomonas putida is
an ubiquitous bacterium, mainly in soil, classified as chemoorganotrophic, capable of
metabolizing large carbon chains, as well as several recalcitrant pollutants [5].
From the sequencing of the genome, as well as from bioinformatics analyses, it is
possible to know the proteins and metabolic pathways of Pseudomonas putida applied
to environmental remediation and degradation of xenobiotic compounds [6]. Thus,
using molecular level analysis it is possible to study new techniques and environmental
applications for this organism [7].
This work aimed to perform a systematic analysis, by means of the scientomet-
ric methodology, of the global publications on the genome of Pseudomonas putida,
analyzing mainly the bioinformatics tools used for the assembly and annotation of the
sequenced genomes.
2 Methods
The searches for performing the scientometric analysis were performed in the Web
of Science database using the terms “Genome” AND “Pseudomonas putida” AND
“soil”. Initially 159 documents were found on the researched terms, however, a manually
filtering was performed in order to find only those documents that reported the sequencing
process of the microorganism. After the filtering, 29 papers were left, being 100% journal
articles, containing the sequencing of the genome.
After filtering, the data was extracted to Microsoft Excel and Citespace software in
order to produce graphs of data and connections on: main countries with publications
on the subject, knowledge areas applied, keywords most used, number of publications
and citation per year. Also, all the documents were read, in order to extract the quantity
of sequenced genomes, the sequencing platforms, besides the genome annotation and
assembly softwares, these data were tabulated and had graphs generated in Excel.
The first publication studying the genomic sequencing of Pseudomonas putida bac-
terium associated with soil, within the database, was in 2002, followed by a gap of 11
years, only starting the publications again in 2014, however, this time with a continuous
growth of publications until 2020. The rate of citations on the subject grew continu-
ously from 2002, totaling 1129 citations until August 2020, as shown in the Fig. 1. The
growth in genomic studies of this soil-related organism is due to the extremely versatile
metabolism of P. putida, it’s great capacity of adaptation in several environments, the
resistance to physical-chemical stresses, and the genes associated with degradation of
recalcitrant compounds. This way, studies using this organism in biological remediation
processes have maintained constant growth [8, 9]. Also, the top 5 of the main countries
that published on the subject are: Spain, Japan, USA, Germany and France. The number
of publications by country each year is also shown in Fig. 1.
From the analysis performed at CiteSpace Software it is possible to see that Micro-
biology, Environmental Sciences and Biotechnology are the main areas of knowledge
associated with the theme, fact based on the search direction of the research, aiming
at the application of P. putida in soils, due to the high diversity of metabolisms of this
microorganism associated with bioremediation [10]. The Fig. 2 presents the network of
164 A. B. Trentin et al.
Fig. 1. Number of publications by country and year, and the number of citations per year
Fig. 2. Network of connections over knowledge areas. Yellow circles represent the centrality.
(Color figure online)
connections related to the areas of study, and the higher the font of the letter, the more
frequently this area is used in these researches.
Also, the yellow circle represents the centrality, a factor that indicates the amount
of connections that the theme performs [11], thus, it is visualized that Microbiology is
the only area that presents a significant centrality, due to the fact that this set of data
approaches the use of a microorganism, therefore, even the most specific researches are
still interconnected with microbiology, making microbiology coherent as a central area
of knowledge [12].
In this data set 12 journals published the articles, most of them related to microbiol-
ogy, environmental science or genomics. The main journal was Microbiology Resource
A Scientometric Overview of Bioinformatics Tools 165
Fig. 3. (A) Percentage of 120 whole/draft genome. (B) Percentage of the main sequencing
platforms used.
166 A. B. Trentin et al.
3.4 Assembly
In total, 9 genome assembly softwares were used, the most used being SPAdes, one of the
main bacterial genome assembly softwares [13]. Other software such as CLC Genomics
Workbench, Newbler and HGAP were used, as shown in Fig. 4. However, many papers
(6 out of 29) did not inform the assembly software.
3.5 Annotation
In the same way as in the assembly, most of the articles did not inform the genome
annotation software (7 of the 26 documents). There were 5 softwares used in the 26
papers, the main one being the NCBI Prokaryotic Genome Annotation Pipeline, a tool
in constant modification and evolution for the annotation process [14]. Figure 5 shows
the software used along with the percentage within the 26 papers.
4 Conclusions
Pseudomonas putida is a widely studied organism with an extreme potential in the area
of bioremediation of affected environments. In this context, the genomic study of this
microorganism is of extreme relevance, since from the analyses at molecular level it is
possible to know the main metabolic pathways, as well as the genes and proteins used
by P. putida to perform the degradation processes of recalcitrant compounds.
In this way, the bioinformatics tools are able to analyze these molecular data and
allow the development of new technologies using this microorganism, besides allowing
the dissemination of genomic data. Thus, this systematic study allowed the visualization
of the main bioinformatics tools used for the analysis of the P. putida genome, besides
presenting the advance of the research on the subject, helping possible future studies
with P. putida with application in soil environmental science and other environments.
References
1. Ashraf, A.M., Maah, J.M., Yusoff, I.: Soil contamination, risk assessment and remediation,
vol. 1, no. 1, pp. 3–5. Intech (2014)
2. Ranga, P., Sharma, D., Saharan, S.B.: Bioremediation of azo dye and textile effluents using
Pseudomonas putida MTCC 244. Asian J. Microbiol. Biotechnol. Environ. Sci. 22(2), 88–94
(2019)
3. Horemans, B., Breugelmans, P., Saeys, W., Springael, D.: A soil-bacterium compatibility
model as a decision-making tool for soil bioremediation. Environ. Sci. Technol. 51(3), 1605–
1615 (2017)
4. Loh, K.C., Cao, B.: Paradigm in biodegradation using Pseudomonas putida—A review of
proteomics studies. Enzyme Microbial Technol. 43(1), 1–12 (2008)
5. Tsirinirindravo, H.L., et al.: Bioremediation of soils polluted by petroleum hydrocarbons by
Pseudomonas putida. Int. J. Innov. Eng. Sci. Res. 2(5), 9–18 (2018)
6. Iyer, R., Iken, B., Damania, A., Krieger, J.: Whole genome analysis of six organophosphate-
degrading rhizobacteria reveals putative agrochemical degradation enzymes with broad
substrate specificity. Environ. Sci. Pollut. Res. 25(14), 13660–13675 (2018)
7. Nelson, K.E., et al.: Complete genome sequence and comparative analysis of the metabolically
versatile Pseudomonas putida KT2440. Environ. Microbiol. 4(12), 799–808 (2002)
8. Crovadore, J., et al.: Whole-genome sequence of Pseudomonas putida strain 1312, a potential
biostimulant developed for agriculture. Microbiol. Resour. Announc. 10(7), 1–2 (2018)
9. Weimer, A., et al.: Industrial biotechnology of Pseudomonas putida: advances and prospects.
Appl. Microbiol. Biotechnol. 104(1), 7745–7766 (2020)
10. Abyar, H., et al.: The role of Pseudomonas putida in bioremediation of naphthalene and
copper. World J. Fish Marine Sci. 3(5), 444–449 (2011)
11. Ping, Q., He, J., Chen, C.: How many ways to use CiteSpace? A study of user interactive
events over 14 months. J. Assoc. Inf. Sci. Technol. 68(5), 1234–1256 (2017)
12. Chen, C.: Manual do CiteSpace, 1st edn. Drexel University, Philadelphia (2014)
13. Prjibelski, A., et al.: Using SPAdes de novo assembler. Curr. Prot. Bioinform. 70(1), 1–29
(2020)
14. Tatusova, T., et al.: NCBI prokaryotic genome annotation pipeline. Nucl. Acids Res. Adv.
1(1), 1–11 (2016)
Polysome-seq as a Measure
of Translational Profile
from Deoxyhypusine Synthase Mutant
in Saccharomyces cerevisiae
1 Introduction
Protein synthesis consists of decoding the messenger RNA. This process is cat-
alyzed by ribosomes and mediated by translation factors. The regulation of the
repertoire of proteins expressed in a cell is determined by the selective con-
trol of gene expression by several cellular mechanisms, such as epigenetic, tran-
scriptional, translational, post-transcriptional or post-translational modification
[4,18,21].
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 168–179, 2020.
https://doi.org/10.1007/978-3-030-65775-8_16
Polysome-seq as a Measure of Translational Profile from dys1-1 Mutant 169
Saccharomyces cerevisiae strains SVL613 (MATa leu2 trp1 ura3 his3 dys1::HIS3
[DYS1/TRP1/CEN - pSV520]) and SVL614 (MATa leu2 trp1 ura3 his3
dys1::HIS3 [dys1 W75R T118A A147T /TRP1/CEN - pSV730]), DYS1 and
dys1-1, respectively, were used to RNA highthroughput experiments. Cells were
grown under previously described conditions [9].
For the regulatory gene groups, we performed gene ontology (GO) analysis with
terms of biological process to determine whether specific biological functions
were enriched using Yeastmine database [8]. Fisher’s exact test was used to test
for statistically significant differences, and the Holm-Bonferroni correction test
procedure to adjust for the effects of multiple tests [2]. GO terms were considered
significant when FDR <0.05. Gene lists obtained via the statistical differential
from transcriptome profile were submitted to the PSCAN (v1.5, http://159.149.
160.88/pscan/) online tool.
After filtering out non-expressed genes (see Methods), the table of read counts
per gene contained data for 5,334 S. cerevisiae annotated ORFs. Both transcrip-
tional and translational profiles results were highly reproducible among biological
replicates for each strain (Fig. 1B and 1C) (Table 2 and 3).
172 F. M. Demarqui et al.
Fig. 1. (A) Experimental approaches for studying the transcribed and recruited
mRNAs for translation. Transcriptional profile: the total RNA is extracted, the mRNAs
are separated and subjected to large-scale sequencing. Translational profile: extracts
are separated by ultracentrifugation through sucrose gradient which is then fraction-
ated while its absorbance is continuously monitored at 254 nm (A254), allowing the
separation of free RNA, the 40S and 60S ribosomal subunits, the 80S monosomes and
the polysomes. The RNA is isolated from individualized gradient fractions and pooled
for further large-scale analysis. (B) Principal Component Analysis indicating the dis-
tribution of replicates in the plan. Three biological replicates independent of the DYS1
and dys1-1 strains are represented in the distribution graphs along two main compo-
nents, from the normalized RPM values of the genes sequenced by RNA-seq of each
profile. (C) Linear correlation between replicates of log2 RPM values of genes sequenced
by RNA-seq. The linear correlation of the log2 RPM values of experimental replicates
for the transcriptional profile varied between 0.94 and 0.98 whereas for the translational
profile this value varied between 0.98 and 0.99.
Polysome-seq as a Measure of Translational Profile from dys1-1 Mutant 173
Table 2. Pearsons correlation values for log2 RPM values from transcriptional profile
for each replicate
Table 3. Pearsons correlation values for log2 RPM values from translational profile for
each replicate
One technique aimed for studying the composition of mRNAs recruited for trans-
lation by large-scale analysis is the polysome profiling, which segregates mRNAs
associated with polysomes from ribosome-free mRNAs, associated with RNA-seq
(Fig. 1A). In addition to Polysome-seq, Ribo-seq methodology, or ribosome pro-
filing, is based on the sequencing of ribosome-protected fragment (RPF) mRNAs
[12]. We observed high Pearson correlations with the log2 RPM wild-type data
from this study to ribosome profiling wild-type data available in the literature
[10,23] and (Fig. 2A and 2B).
Next, we compared the wild-type strain quantification of gene expression by
RNA-seq and Polysome-seq to published proteomic data [6]. The correlation and
coefficient of determination from translatome (Polysome-seq) to the proteome
normalized abundances (Fig. 2C) was higher than the transcriptome measure-
ments (Fig. 2D), indicating that this former quantification of gene expression
provides a more accurate picture of protein abundance, since translation is reg-
ulated by (1) translation rate, (2) translation rate modulation, (3) modulation
of a protein’s half-life, (4) protein synthesis delay, (5) protein transport [17,18].
So Polysome-seq allows a better understanding of regulatory mechanisms that
involves post-transcriptional gene expression programs [11,13], as regulation via
tuning transcript levels alone [16], resulting in a profile of selected mRNAs
recruited for translation.
174 F. M. Demarqui et al.
We first calculated the gene expression level fold change (FC) between the two
strains using RNA-seq and Polysome-seq data separately and we observed similar
numbers of differentially expressed genes (DEGs) for both profiles - 2432 and
2826 DEGs for transcriptional and translational level, respectively - (Fig. 3A and
3B), however, Polysome-seq data had a higher variance than RNA-Seq data for
the significant log2 FC distribution (Fig. 3C), a consistent result for a mutant
involved with a translational factor.
To establish the relationship between mRNA and polysome-associated
mRNA changes when comparing DYS1 and dys1-1, we categorized DEGs into
gene expression modes by computing analysis of partial variance with transcrip-
tome and translatome (Fig. 3D): (1) Homodirectional DEGs, significantly change
in both profiles in a concordant way, indicating a transcriptional regulation; (2)
Polysome-only DEGs, up or down polysome-associated mRNA with no signifi-
cant changes in mRNA levels, a result of translation regulatory mode; (3) Tran-
scriptome only DEGs, differences in mRNA levels not followed by a significant
change in polysome-associated mRNA, a result of buffering regulatory mode; (4)
Antidirectional DEGs, significantly change in both profiles but antidirectional
ways. Most DEGs (67%) showed a coupled significant change, i. e., genes with sig-
nificant homodirectional change in both the transcriptome and the translatome
(Fig. 3E). This result is in accordance with the fact that under stress conditions,
differential expressed proteins correlated strongly with the corresponding mRNA
level, indicating that transcriptional control seems to be the major driver behind
changes in protein levels [14].
Transcriptionally regulated genes were significantly enriched for Gene Ontol-
ogy (GO) biological process terms as “maturation of SSU-rRNA” (GO:0030490),
“transposition” (GO:0032196), “RNA modification” (GO:0009451) (Table 4)
and Transcription Factors (TF) as Tod6, Dot6 and Stb3 (Table 5). Additionally,
BUD27, the gene that encodes a protein which impacts the homeostasis of the
ribosome biogenesis process by regulating the activity of the three RNA poly-
merases [17], is classified as an homodirectional gene and upregulated in both
profiles. Taking together, these results revealed a cell response to ribosome bio-
genesis, a high-energy consumption process that requires stringent regulation to
ensure proper ribosome production to deal with cell growth and protein synthesis
in different environmental and metabolic situations [17].
The results of this study illustrate the use of Polysome-seq as a measurement
of mRNAs recruited for translation. We identified for a deoxyhypusine synthase
mutant dys1-1, a protein involved in translation, a pattern of gene expression
control that is transcription dependent and upregulation of ribosome synthesis
is one of the cell responses to translation impairment.
Polysome-seq as a Measure of Translational Profile from dys1-1 Mutant 175
Fig. 3. Volcano plot of the distribution of the transcripts differentially expressed in the
transcriptional profile (A) and translational profile (B). The values of −log1 0 p-value
were plotted according to the differencial expression between DYS1 and dys1-1 (log2 fold
change). Downregulated genes are highlighted in blue (left), upregulated genes, in orange
(right); dashed horizontal line indicates an adjusted p-value of 0.05. (C) Distribution of
gene expression fold change (FC) values. FC was calculated as the ratio between the num-
ber of reads in dys1-1 and DYS1 strains. We took the average number of reads per gene
among the replicates. (D) Scheme of differential expression analysis between the tran-
scriptional and translational profile of the dys1-1 mutant. Genes classified as differentially
expressed were called transcriptome only (blue), polysome only (orange), antidirectional
(purple) - significantly opposite variations between transcriptional and translational pro-
files - and homodirectional (green) - variations significantly converging between both pro-
files. (E) Distribution of the log2 fold change of the transcriptional and translational pro-
file. Genes showing statistical differences between dys1-1 and DYS1 were simultaneously
compared in the two profiles. Categories are defined in 3D. (Color figure online)
Polysome-seq as a Measure of Translational Profile from dys1-1 Mutant 177
References
1. de Almeida, O.P., et al.: Hypusine modification of the ribosome-binding protein
eIF5A, a target for new anti-inflammatory drugs: understanding the action of the
inhibitor GC7 on a murine macrophage cell line. Curr. pharm. Des. 20(2), 284–92
(2014)
2. Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple test-
ing under dependency. Ann. Stat. (2001). https://doi.org/10.1214/aos/1013699998
3. Buskirk, A.R., Green, R.: Ribosome pausing, arrest and rescue in bacteria and
eukaryotes. Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci. 372, 20160183 (2017).
https://doi.org/10.1098/rstb.2016.0183
4. Chassé, H., Boulben, S., Costache, V., Cormier, P., Morales, J.: Analysis of transla-
tion using polysome profiling. Nucleic Acids Res. (2017). https://doi.org/10.1093/
nar/gkw907
5. Chen, K.Y., Liu, A.Y.: Biochemistry and function of hypusine formation on eukary-
otic initiation factor 5A. NeuroSignals (1997). https://doi.org/10.1159/000109115
6. Csárdi, G., Franks, A., Choi, D.S., Airoldi, E.M., Drummond, D.A.: Accounting
for experimental noise reveals that mRNA levels, amplified by post-transcriptional
processes, largely determine steady-state protein levels in yeast. PLoS Genet.
11(5), e1005206 (2015). https://doi.org/10.1371/journal.pgen.1005206
7. Dever, T.E., Ivanov, I.P.: Roles of polyamines in translation. J. Biol.
Chem. 293(48), 18719–18729 (2018). https://doi.org/10.1074/jbc.TM118.003338.
http://www.jbc.org/
8. Engel, S.R., et al.: The reference genome sequence of saccharomyces cerevisiae:
then and now. G3: Genes Genomes Genetics 4(3), 389–398 (2014). https://doi.
org/10.1534/g3.113.008995
9. Galvão, F.C., Rossi, D., Silveira, W.D.S., Valentini, S.R., Zanelli, C.F.: The deoxy-
hypusine synthase mutant dys1-1 reveals the association of eIF5A and Asc1 with
cell wall integrity. Plos One 8(4), e60140 (2013). https://doi.org/10.1371/journal.
pone.0060140
10. Heyer, E.E., Moore, M.J.: Redefining the translational status of 80S monosomes.
Cell 164(4), 757–769 (2016). https://doi.org/10.1016/j.cell.2016.01.003
11. Ingolia, N.T.: Ribosome profiling: new views of translation, from single codons
to genome scale. Nat. Rev. Genet. 15, 205–213 (2014). https://doi.org/10.1038/
nrg3645
12. Ingolia, N.T., Ghaemmaghami, S., Newman, J.R., Weissman, J.S.: Genome-wide
analysis in vivo of translation with nucleotide resolution using ribosome profiling.
Science 324(5924), 218–223 (2009). https://doi.org/10.1126/science.1168978
13. Jin, H.Y., Xiao, C.: An integrated polysome profiling and ribosome profiling
method to investigate in vivo translatome. In: Methods in Molecular Biology, vol.
1712, pp. 1–18. Humana Press Inc. (2018). https://doi.org/10.1007/978-1-4939-
7514-31
14. Lahtvee, P.J., et al.: Absolute quantification of protein and mRNA abun-
dances demonstrate variability in gene-specific translation efficiency in yeast.
Cell Syst. 4(5), 495–504 (2017). https://doi.org/10.1016/J.CELS.2017.03.003.
https://www.sciencedirect.com/science/article/pii/S2405471217300881#mmc4
15. Landau, G., Bercovich, Z., Park, M.H., Kahana, C.: The role of polyamines in
supporting growth of mammalian cells is mediated through their requirement for
translation initiation and elongation. J. Biol. Chem. 285(17), 12474–12481 (2010).
https://doi.org/10.1074/jbc.M110.106419
Polysome-seq as a Measure of Translational Profile from dys1-1 Mutant 179
16. Liu, Y., Beyer, A., Aebersold, R.: On the dependency of cellular protein levels on
mRNA abundance. Cell 165, 535–550 (2016). https://doi.org/10.1016/j.cell.2016.
03.014
17. Martı́nez-Fernández, V., et al.: Prefoldin-like Bud27 influences the transcription of
ribosomal components and ribosome biogenesis in Saccharomyces cerevisiae. RNA
(2020). https://doi.org/10.1261/rna.075507.120
18. Piccirillo, C.A., Bjur, E., Topisirovic, I., Sonenberg, N., Larsson, O.:
Translational control of immune responses: from transcripts to trans-
latomes. Nat. Immunol. 15(6), 503–511 (2014). https://doi.org/10.1038/ni.2891.
http://www.nature.com/doifinder/10.1038/ni.2891
19. Rossi, D., Kuroshu, R., Zanelli, C.F., Valentini, S.R.: eIF5A and EF-P: two unique
translation factors are now traveling the same road. Wiley Interdiscip. Rev.: RNA
5(2), 209–222 (2014). https://doi.org/10.1002/wrna.1211
20. Schnier, J., Schwelberger, H.G., Smit-McBride, Z., Kang, H.A., Hershey, J.W.:
Translation initiation factor 5A and its hypusine modification are essential for cell
viability in the yeast Saccharomyces cerevisiae. Mol. Cell. Biol. (1991). https://
doi.org/10.1128/MCB.11.6.3105
21. Schuller, A.P., Green, R.: Roadblocks and resolutions in eukaryotic translation.
Nat. Rev. Mol. Cell Biol. 19(8), 526–541 (2018). https://doi.org/10.1038/s41580-
018-0011-4. http://www.nature.com/articles/s41580-018-0011-4
22. Schuller, A.P., Wu, C.C.C., Dever, T.E., Buskirk, A.R., Green, R.: eif5a functions
globally in translation elongation and termination. Mol. Cell 66(2), 194–205 (2017).
https://doi.org/10.1016/j.molcel.2017.03.003
23. Sen, N.D., Zhou, F., Ingolia, N.T., Hinnebusch, A.G.: Genome-wide analysis of
translational efficiency reveals distinct but overlapping functions of yeast DEAD-
box RNA helicases Ded1 and eIF4A. Genome Res. 25(8), 1196–1205 (2015).
https://doi.org/10.1101/gr.191601.115
Anti-CD3 Stimulated T Cell
Transcriptome Reveals Novel ncRNAs
and Correlates with a Suppressive Profile
1 Introduction
Targeting T lymphocytes was among the first monoclonal antibodies (mAb) asso-
ciated immunotherapies. Anti-CD3 therapy was used to control T cell activity,
suppress the immune response, and substitute the polyclonal anti-thymocyte
antibody preparation, previously used for graft rejection [17]. Muronomab, a
monoclonal antibody specific for the human CD3 antigen, was the first mAb
used in clinical studies, but its use was abolished due to its overall toxicity
[21]. Nowadays, novel and less toxic CD3 specific antibodies have reemerged as
promising therapeutics for controlling autoimmune diseases.
To better understand the human T cell reprogramming after anti-CD3 treat-
ment, we previously investigated the protein-coding (PTC) genes [25] regulated
ex vivo in a PBMC milieu. Based on a new transcript prediction algorithm,
we reannotated the non-coding transcriptome of human T cells treated with
anti-CD3 antibodies to unveil differentially expressed lncRNA (DEL) that may
be involved in CD3 targeted antibody therapy. We observed several novel non-
coding transcripts along with previously annotated ones, and we discuss their
possible participation in T cell fate and the suppressive phenotype.
All sequenced reads produced by Illumina were analyzed for quality control
using FASTQC [1]. The reads were filtered using BBDuk [3] at k = 31 to a ref-
erence of ribosomal kmers provided by the developers. Adapters were trimmed
by cutadapt [15], and reads were then aligned to the HG19/GRCh37 Human
Genome using HISAT2 [13] at standard settings. Ryūtō [7] was run on the align-
ments to predict individual transcripts for each set. Transcript predictions were
joined using the TACO meta assembler [16]. Results were compared to the Gen-
code V19 annotation to identify novel transcripts. Salmon [18] was used to realign
182 M. M. do Almo et al.
the filtered reads for each sample to the full set of GENCODE. A count matrix
was created from transcripts together with the additional predicted novel tran-
scripts. DESeq2 [14] was used to identify differentially expressed genes (DEGs)
on all replicates, applying a significance threshold for the adjusted p-values of
0.05.
transcripts (fold change) and for analysis of the obtained data, the RT2 Profiler
PCR Array Data Analysis (SABiosciences Frederick, MD, USA) software was
used. Real-Time qPCR p-values were calculated based on Student’s t-test using
RT2 Profiler PCR Array Data Analysis software.
3 Results
The transcriptome reconstructed from the union of all mapped reads comprises
174,649 transcripts of which about a third are protein-coding, nearly half (44.1%)
are correspond to known non-coding Ensembl/VEGA transcripts, and 20% pre-
viously unannotated transcripts (Fig. 1). The distribution of Ensembl/VEGA
transcript types among the known ncRNA is summarized in Fig. 1-A. The most
abundant types are transcripts with retained introns (IR) and Processed Tran-
scripts (PT) that are not classified as belonging to one of the lncRNAs classes.
3.1 LincRNAs
cant differential expression was observed, except for the GAPLINC-2042 , barely
significantly repressed by OKT3 with a padj = 0.0135. Along with them, two
novel isoforms, TU21901 and TU21904, were detected. TU20901 was the most
abundant transcript predicted in this locus and is a DEL, repressed as a result of
OKT3, FvFcR, and FvFcM treatment (padj value of 0.0001, 0.0487, and 0.0502,
respectively). The qPCR quantification suggested that all antibodies induced
certain repression, especially for a particular primer pair, which detects all vari-
ants except for GAPLINC-201 (Fig. 4).
The analysis of cDNA clones obtained from PCR for exons 1 to 4 and 3 to
4 yielded sequences that confirmed the presence of predicted transcripts. Three
independent clones could unequivocally validate TU21901 with the same exon-
2
ENST00000581442.1.
Anti-CD3 Stimulated T Cell Transcriptome Reveals Novel ncRNAs 187
exon junction. Two clones showed the unique junction of GAPLINC-204, where
exon 2 uses an alternative donor splice site compared to TU21901. Three other
predicted clones showed an exon-exon junction that is shared either by TU21904
or by the previously reported transcript GAPLINC-202. Therefore, the data
showed that along with GAPLINC-202 and GAPLINC-204, at least the novel
regulated transcript TU21901 could be found in non-stimulated T cells.
Fig. 3. The WFDC21P transcript is depicted to reveals the Lnc-DC along with
TU20859, TU20860, and TU20861 intron-exon structure. (A) The transcripts in the
opposite strand of chromosome 17 were rotated to facilitate visualization. Primers used
to check for transcripts are marked in red and green. Quantitative expression by qPCR
assay of lncRNAs was performed with total RNA extracted from T cells stimulated
with anti-CD3 antibodies. The results are expressed as the fold change relative to
unstimulated T cells (n = 5; p < 0.05). (B) Expression of transcripts detected with the
primer pair for the first and third exon of Lnc-DC (red). (C) Expression of transcripts
detected with the primer pair for the second and third exon of Lnc-DC (green). (Color
figure online)
4 Discussion
The stimulation of T lymphocytes with anti-CD3 antibodies induces a change
in transcriptional landscape. In this new study, we reanalyzed the data on the
188 M. M. do Almo et al.
Fig. 4. Transcriptional activity of the GAPLINC locus. (A) Novel transcripts are
depicted along with annotated transcripts. In the top, the genomic view of transcripts
with exon-intron structured. Primers used to check for transcripts are marked in green
and magenta. Quantitative expression by qPCR assay of lncRNAs was performed with
total RNA extracted from T cells stimulated with anti-CD3 antibodies. The results
are expressed as the fold change relative to unstimulated T cells (n = 5; p < 0.05). (B)
qPCR with a primer to the junction of the first and second exon and another for the
third exon of GAPLINC-204 (red), detecting all transcripts except GAPLINC 201. (C)
Expression of transcripts detected with a primer pair for the third and fourth exon of
GAPLINC-202/TU21904 (green). (Color figure online)
Anti-CD3 Stimulated T Cell Transcriptome Reveals Novel ncRNAs 189
5 Conclusion
References
1. Andrews, S., et al.: FastQC: a quality control tool for high throughput sequence
data (2010). https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
2. Arrial, R.T., Togawa, R.C., Brigido, M.M.: Screening non-coding RNAs in tran-
scriptomes from neglected species using PORTRAIT: case study of the pathogenic
fungus Paracoccidioides brasiliensis. BMC Bioinform. 10(1), 239 (2009). https://
doi.org/10.1186/1471-2105-10-239
3. BBTools: BBDuk. http://jgi.doe.gov/data-and-tools/bb-tools/
4. Cabili, M.N., et al.: Integrative annotation of human large intergenic noncoding
RNAs reveals global properties and specific subclasses. Genes Dev. 25(18), 1915–
1927 (2011)
5. GTEx Consortium et al.: Genetic effects on gene expression across human tissues.
Nature 550(7675), 204–213 (2017)
6. Dijkstra, J.M., Ballingall, K.T.: Non-human lnc-DC orthologs encode Wdnm1-like
protein. F1000Research 3, 160 (2014)
7. Gatter, T., Stadler, P.F.: Ryūtō: network-flow based transcriptome reconstruction.
BMC Bioinform. 20(1), 190 (2019). https://doi.org/10.1186/s12859-019-2786-5
8. Hojo, M.A., et al.: Identification of a genomic enhancer that enforces proper apop-
tosis induction in thymic negative selection. Nat. Commun. 10(1), 1–15 (2019)
9. Hu, Y., et al.: Long noncoding RNA GAPLINC regulates CD44-dependent cell
invasiveness and associates with poor prognosis of gastric cancer. Cancer Res.
74(23), 6890–6902 (2014)
10. Huang, S., et al.: NEAT1 regulates Th2 cell development by targeting STAT6 for
degradation. Cell Cycle 18(3), 312–319 (2019)
11. Hudson, W.H., et al.: Expression of novel long noncoding RNAs defines virus-
specific effector and memory CD8+ T cells. Nat. Commun. 10(1), 1–11 (2019)
12. Hunt, S.E., et al.: Ensembl variation resources. Database 2018 (2018)
13. Kim, D., Langmead, B., Salzberg, S.L.: HISAT: a fast spliced aligner with low
memory requirements. Nat. Methods 12(4), 357–360 (2015). https://doi.org/10.
1038/nmeth.3317
14. Love, M.I., Huber, W., Anders, S.: Moderated estimation of fold change and dis-
persion for RNA-seq data with DESeq2. Genome Biol. 15(12), 550 (2014). https://
doi.org/10.1186/s13059-014-0550-8
15. Martin, M.: Cutadapt removes adapter sequences from high-throughput sequencing
reads. EMBnet. J. 17(1), 10–12 (2011)
16. Niknafs, Y.S., et al.: Taco produces robust multisample transcriptome assemblies
from RNA-seq. Nat. Methods 14(1), 68–70 (2017)
17. Norman, D.J., et al.: The use of OKT3 in cadaveric renal transplantation for
rejection that is unresponsive to conventional anti-rejection therapy. Am. J. Kidney
Dis. 11(2), 90–93 (1988). https://doi.org/10.1016/S0272-6386(88)80186-0
18. Patro, R., et al.: Salmon provides fast and bias-aware quantification of transcript
expression. Nat. Methods 14(4), 417–419 (2017)
19. Quinlan, A.R.: BEDTools: the Swiss-army tool for genome feature analysis. Curr.
Protocols Bioinform. 47(1), 11–12 (2014)
20. Ranzani, V., et al.: The long intergenic noncoding RNA landscape of human lym-
phocytes highlights the regulation of T cell differentiation by linc-MAF-4. Nat.
Immunol. 16(3), 318 (2015)
21. Reichert, J.M.: Marketed therapeutic antibodies compendium. MAbs 4(3), 413–
415 (2012). https://doi.org/10.4161/mabs.19931
Anti-CD3 Stimulated T Cell Transcriptome Reveals Novel ncRNAs 191
22. Robinson, J.T., et al.: Integrative genomics viewer. Nat. Biotechnol. 29(1), 24–26
(2011)
23. Shui, X., et al.: Knockdown of lncRNA NEAT1 inhibits Th17/CD4+ T cell dif-
ferentiation through reducing the STAT3 protein level. J. Cell. Physiol. 234(12),
22477–22484 (2019)
24. Silva, H.M., et al.: Novel humanized anti-CD3 antibodies induce a predominantly
immunoregulatory profile in human peripheral blood mononuclear cells. Immunol.
Lett. 125(2), 129–136 (2009)
25. Sousa, I.G., et al.: Gene expression profile of human T cells following a single
stimulation of peripheral blood mononuclear cells with anti-CD3 antibodies. BMC
Genomics 20(1), 593 (2019). https://doi.org/10.1186/s12864-019-5967-8
26. Tooley, J.E., et al.: Changes in T-cell subsets identify responders to FcR-
nonbinding anti-CD3 mAb (teplizumab) in patients with type 1 diabetes. Eur.
J. Immunol. 46(1), 230–241 (2016)
27. Wang, P., et al.: The STAT3-binding long noncoding RNA lnc-DC controls human
dendritic cell differentiation. Science 344(6181), 310–313 (2014)
A Simplified Complex Network-Based
Approach to mRNA and ncRNA
Transcript Classification
1 Introduction
After more than 150 years of the discovery of nucleic acids, the interest in the
study of these molecules has been growing over the years [1]. Since the sequenc-
ing of bacteriophage φX174 in 1977 [2], a high amount of organisms have been
sequenced and stored in databases. The advances in the development of sequenc-
ing technologies has led to the generation of a huge volume of biological data.
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 192–203, 2020.
https://doi.org/10.1007/978-3-030-65775-8_18
A Simplified Complex Network-Based Approach to mRNA and ncRNA 193
ated and compared with the methodologies CPC, CPC2 and PLEK, considering
CPC2 dataset[16] with six different species in which methodology obtained ade-
quate results and with greater accuracy than the competing methods considering
all the species evaluated.
A C G
Fig. 1. Graph generated from the sequence ‘ACG’, considering the nucleotides as
nodes and its neighborhood as edges.
1
AC CG
1 2
GC
3.1 Materials
This work adopts a dataset in order to validate the proposed method as well as
to compare its results with the competitor methods. The adopted dataset was
obtained from CPC2 [16], which contains six species of organisms: Arabidopsis
thaliana, human, zebrafish, fruitfly, mouse and worm. The CPC2 dataset presents
transcripts (mRNA), small non-coding transcripts (sncRNAs) and long non-
coding transcripts (lncRNAs). In this work the sncRNA and lncRNA sequences
were grouped in a ncRNA subset. Redundant sequences were removed from the
dataset. The adopted dataset, the number of samples per class and species are
presented in Table 1.
Table 1. Description of the number of samples per class of the dataset adopted in
this work.
3.2 Methods
The first step is to map the sequences in complex networks. Therefore, the
sequences were organized in FASTA files, and the messenger RNA and non-
coding RNA sequences were separated, that is, they were previously classified in
a supervised learning model.
For the mapping of sequences in complex networks, two parameters were
adopted: k and step. The step parameter defines the distance traveled in the
sequence to define the neighborhood of each iteration. The k parameter defines
the amount of nucleotides for forming the k-mer. Figure 3 shows an example
in which the network was generated from the sequence ‘ACCGATG’ with the
parameters step = 1 and k = 3, as defined by the BASiNET [18] method.
ATG
1
1
ACC CCG GAT
1
1
CGA
Fig. 3. Graph generated from the sequence ‘ACCGATG’, considering k-mers, with
k = 3 as nodes and its neighborhood with step = 1 as edges.
mRNA ncRNA
(Sequence) (Sequence)
Network Network
(mRNA) (ncRNA)
Adjacency Adjacency
matrix maxtrix
(mRNA) (ncRNA)
Binary Binary
Adjacency Adjacency
Matrix Matrix
(% of ex- (% of ex-
clusivity) clusivity)
(mRNA) (ncRNA)
Subtraction
(Matrices)
Filter
(Exclusive
Edges)
1000
750
Quantity of Edges
500 Exclusive
Not exclusive
250
0
0
20
40
60
(%)
Fig. 5. Quantity of exclusive and non-exclusive edges identified in the dataset in both
classes of RNA.
198 M. M. Breve and F. M. Lopes
the higher point on the curve of the exclusive edges is between 40% and 60%,
and that the saturation point of the exclusive curve is between 40% and 50%.
Thus, the exclusivity parameter adopted in this work was 45% for the feature
selection step.
Figure 6 presents an overview of the feature selection step. It is possible to
notice that (a) presents the exclusive edges, (b) presents the original network
and (c) the network filtered by considering only the exclusive edges, i.e. the
edges that are not present in the filter (exclusive edge) are removed. This feature
selection reduces the complexity of the BASiNET method [18]. More specifically,
BASiNET considers all the network edges and applies a threshold iteratively to
remove less frequent edges (with lower weights). Thus, the topological measure-
ments are extracted at each iteration at different levels of network resolution.
As a consequence, the proposed feature selection approach eliminates the need
to apply the threshold and leads to an improved and simplified approach.
After the feature selection step, the next step is to extract topological mea-
surements from the networks. This work adopts some complex network measure-
ments commonly used in the proposed approach, such as: assortativity, medium
degree, maximum degree, minimum degree, average centrality of intermediation,
cluster coefficient, average short path length, average standard deviation, fre-
quency of motifs with size 3 and frequency of motifs with size 4 [20]. However,
each measurement has a different value range, thus a Min-Max is applied and
the measurements values are rescaled in range [0,1].
A Simplified Complex Network-Based Approach to mRNA and ncRNA 199
The last step consists of classifying the sequences based on their topological
features extracted from their respective networks. For this step, the supervised
learning was adopted taking into account the Random Forest [33] classifier. The
R project [34] was adopted and the rfUtilities [35] package was also considered
for 10-fold cross-validation.
To evaluate the proposed approach, the adopted dataset (Sect. 3.1) was consid-
ered in the same way for all the methods. The proposed approach was performed
considering the following parameters values: edge exclusivity = 45%, step = 1
and k = 3, which were presented and their values justified in Sect. 3.2.
Table 2 presents the accuracy rates of classifications regarding the mRNAs
and ncRNAs using the 10-fold cross validation. It is possible to verify the ade-
quacy of the proposed approach for the correct identification between ncRNA
and mRNA sequences with a high accuracy rate achieving accuracy rates higher
than 98,7% for all species.
Table 2. Accuracy rates in the classification of mRNA and ncRNA sequences using
the proposed method for different species.
Species Accuracy
mRNA ncRNA
Human 6142 (100%) 12005 (99.9%)
Fruitfly 3665 (99,5%) 3551 (99,7%)
Mouse 10638 (100%) 12251 (100%)
Zebrafish 2313 (98,7%) 1528 (100%)
Arabidopsis 15930 (100%) 3834 (100%)
Worm 3523 (99,7%) 9312 (100%)
Figure 7 presents the average accuracy of the adopted methods for the clas-
sification of the mRNA and ncRNA sequences, considering each species of the
dataset. It can be noted that the CPC and PLEK present greater variations
in their results. While the proposed approach and the CPC2 show more stable
behaviors, the proposed approach presents superior results when compared to all
competitor methods, indicating the suitability in the classification of the mRNA
and ncRNA sequences.
200 M. M. Breve and F. M. Lopes
PLEK
CPC
100 CPC2
Proposed approach
95
(%)
90
85
Fr
A
H
W
Ze
ra
um
ou
or
br
itfl
bi
m
se
afi
an
do
y
sh
ps
is
Fig. 7. Comparison of the accuracy of the methods considering the species separately.
Table 3 presents the average results considering all the species available in the
dataset. It can be noted that the classification of mRNA and ncRNA sequences
by the proposed approach presents results superior to competing methods, again
indicating their adequacy in the classification of transcripts.
The results indicated a high accuracy in the identification of RNA sequences
by the proposed approach. The average results obtained both considering the
species individually and when grouped indicate the robustness of the proposed
approach, with small variations and superior accuracies when compared with
competitor methods. Therefore, the feature selection approach by filtering the
exclusive edges proved to be adequate for the correct identification of the fea-
tures, reducing the complexity of the classification and with a high accuracy rate
for transcripts identification.
5 Conclusion
Complex network theory has been successfully applied in modeling various real-
world problems, in particular in bioinformatics. This work applies the theory of
A Simplified Complex Network-Based Approach to mRNA and ncRNA 201
References
1. Dahm, R.: Friedrich Miescher and the discovery of DNA. Dev. Biol. 278(2), 274–
288 (2005)
2. Sanger, F., et al.: Nucleotide sequence of bacteriophage φ X174 DNA. Nature
265(5596), 687–695 (1977)
3. Hogeweg, P.: The roots of bioinformatics in theoretical biology. PLOS Comput.
Biol. 7(3), 1–5 (2011)
4. Wang, Z., Gerstein, M., Snyder, M.: RNA-Seq: a revolutionary tool for transcrip-
tomics. Nat. Rev. Genet. 10(1), 57–63 (2009)
5. Costa-Silva, J., Domingues, D., Lopes, F.M.: RNA-Seq differential expression anal-
ysis: an extended review and a software tool. PLOS One 12(12), 1–18 (2017)
6. Garber, M., Grabherr, M.G., Guttman, M., Trapnell, C.: Computational methods
for transcriptome annotation and quantification using RNA-Seq. Nat. Methods
8(6), 469–477 (2011)
7. Panwar, B., Arora, A., Raghava, G.P.S.: Prediction and classification of ncRNAs
using structural information. BMC Genomics 15(1), 1–13 (2014)
202 M. M. Breve and F. M. Lopes
8. Wang, K.C., Chang, H.W.: Molecular mechanisms of long noncoding RNAs. Mol.
Cell 43(6), 904–914 (2011)
9. Peng, Y., Li, J., Zhu, L.: Chapter 8 - cancer and non-coding RNAs. In: Ferguson,
B.S. (ed.)Nutritional Epigenomics, volume 14 of Translational Epigenetics, pp.
119–132. Academic Press (2019)
10. Long, Y., Wang, X., Youmans, D.T., Cech, T.R.: How do lncRNAs regulate tran-
scription? Sci. Adv. 3(9), p. eaao2110 (2017)
11. Chen, X., Yan, G.-Y.: Novel human lncRNA-disease association inference based on
lncRNA expression profiles. Bioinformatics 29(20), 2617–2624 (2013)
12. Huarte, M.: The emerging role of lncRNAs in cancer. Nat. Med. 21(11), 1253–1261
(2015)
13. Peng, W.-X., Koirala, P., Mo, Y.-Y.: LncRNA-mediated regulation of cell signaling
in cancer. Oncogene 36(41), 5661–5667 (2017)
14. Kung, J.T., Colognori, D., Lee, J.T.: Long noncoding RNAs: past, present, and
future. Genetics 193(3), 651–669 (2013)
15. Kong, L., et al.: CPC: assess the protein-coding potential of transcripts using
sequence features and support vector machine. Nucleic Acids Res. 35(suppl 2),
W345–W349 (2007)
16. Kang, Y.J., et al.: Cpc2 a fast and accurate coding potential calculator based on
sequence intrinsic features. Nucleic Acids Res. 45(W1), W12–W16 (2017)
17. Li, A., Zhang, J., Zhou, Z.: PLEK: a tool for predicting long non-coding RNAs
and messenger RNAs based on an improved k-mer scheme. BMC Bioinformatics
15(1), 311 (2014)
18. Ito, E.A., Katahira, I., Vicente, F.F.D.R., Pereira, L.F.P., Lopes, F.M.: BASiNET
- biological sequences network: a case study on coding and non-coding RNAs iden-
tification. Nucleic Acids Res. 46(16), e96–e96 (2018)
19. Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.-U.: Complex net-
works: structure and dynamics. Phys. Rep. 424(4), 175–308 (2006)
20. Costa, L.D.F., Rodrigues, F.A., Travieso, G., Villas Boas, P.R.: Characterization
of complex networks: a survey of measurements. Adv. Phys. 56(1), 167–242 (2007)
21. Diestel, R.: Graph Theory, 3rd edn. Springer-Verlag, Heidelberg (2005)
22. Barabási, A.L.: Linked: How Everything Is Connected to Everything Else and
What It Means. Plume, New York (2003)
23. Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature
393, 440–442 (1998)
24. Barabási, A.-L., Albert, R.: Emergence of scaling in random networks. Science
286(5439), 509–512 (1999)
25. Newman, M.E.J.: The structure and function of complex networks. SIAM Rev.
45(2), 167–256 (2003)
26. Lopes, F.M., Cesar, R.M., da F. Costa, L.: AGN simulation and validation model.
In: Bazzan, A.L.C., Craven, M., Martins, N.F. (eds.) BSB 2008. LNCS, vol.
5167, pp. 169–173. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-
540-85557-6 17
27. Yi Ming Zou: Modeling and analyzing complex biological networks incooperating
experimental information on both network topology and stable states. Bioinfor-
matics 26(16), 2037–2041 (2010)
28. Lopes, F.M., Cesar Jr, R.M., Costa, L.D.F.: Gene expression complex networks:
synthesis, identification, and analysis. J. Comput. Biol. 18(10), 1353–1367 (2011)
29. Lopes, F.M., Martins Jr., D.C., Barrera, J., Cesar Jr., R.M.: A feature selection
technique for inference of graphs from their known topological properties: revealing
scale-free gene regulatory networks. Inf. Sci. 272, 1–15 (2014)
A Simplified Complex Network-Based Approach to mRNA and ncRNA 203
30. da Rocha Vicente, F.F., Lopes, F.M.: SFFS-SW: a feature selection algorithm
exploring the small-world properties of GNs. In: Comin, M., Käll, L., Marchiori, E.,
Ngom, A., Rajapakse, J. (eds.) PRIB 2014. LNCS, vol. 8626, pp. 60–71. Springer,
Cham (2014). https://doi.org/10.1007/978-3-319-09192-1 6
31. de Lima, G.V.L., Castilho, T.R., Bugatti, P.H., Saito, P.T.M., Lopes, F.M.: A com-
plex network-based approach to the analysis and classification of images. CIARP
2015. LNCS, vol. 9423, pp. 322–330. Springer, Cham (2015). https://doi.org/10.
1007/978-3-319-25751-8 39
32. de Lima, G.V., Saito, P.T., Lopes, F.M., Bugatti, P.H.: Classification of texture
based on bag-of-visual-words through complex networks. Expert Syst. Appl. 133,
215–224 (2019)
33. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3),
18–22 (2002)
34. R Core Team. R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna, Austria (2014)
35. Evans, J.S., Murphy, M.A.: rfUtilities. R package version 2.1-3 (2018)
A Systems Biology Driven Approach to Map
the EP300 Interactors Using Comprehensive
Protein Interaction Network
1 Introduction
EP300(p300) is a ubiquitously expressed transcriptional coactivator and a member of the
EP300/CBP family of type 3 major lysine (K) acetyltransferases (KAT3), present in all
mammals and found in many multicellular organisms, such as flies, worms, and plants.
In humans, 31 exons in chromosome 22 (locus 22q13) codes for the p300 gene, and gene
size spans approximately 90 kb. Overexpression and inappropriate activation of EP300
are associated with malignancy, tumor size, poor differentiation, tumor progression, and
poor prognosis [1–3]. Increased expression of EP300 has been observed in advanced
human malignancies, such as liver, prostate cancers, primary human breast cancers,
etc., [4].Recent reports highlight EP300 as a central regulator of angiogenesis, hypoxia,
and EMT pathway in esophageal squamous carcinoma [5]. The increased expression of
cancer stem cell markers, tumorsphere formation was observed in EP300-depleted cells
and diminished in EP300-overexpressing cells [6]. Apart from cancer, EP300 is a key
player in Rubinstein − Taybi syndrome (RTS or RSTS) disease [7].
EP300 shares high sequence homology with CBP (CREBBR or KAT3A), and less
with other acetyltransferases [8]. Both proteins have almost 86% amino acid residue iden-
tity in the catalytic domain and significant sequence homology was found in several types
of protein-protein interacting motifs, and other non-catalytic domains [9]. In EP300, the
acetyltransferase domain spans from residues 1284 to 1673, and IBiD (Interferon Bind-
ing Domain) located at the C-terminal side. The IBiD contains an NCBD (Nuclear Coac-
tivator Binding Domain) and glutamine-rich domain, followed by a proline-containing
PxP motif. There are three cysteine/histidine-rich domains (C/H) like C/H1, C/H2 (which
is part of the catalytic domain), and C/H3. The C/H1 and C/H3 domains contain transcrip-
tional adaptor zinc fingers (TAZ1 and TAZ2), and additionally, C/H3 domain contains a
ZZ zinc finger. The C/H2 domain contains a plant homeodomain (PHD) and the domains
such as interferon binding homology domain (IHD), KIX domain, and bromodomain is
located between the C/H1 and C/H2 domains [10, 11].
EP300 functions as acetyltransferase by facilitating transcription through acetylation
of histones, transcription factors, sequence-specific DNA binding factors, and basal tran-
scriptional machinery. During intracellular or extracellular signaling the cell must turn
different subsets of genes to regulate different cellular functions accomplished by acety-
lation of histone proteins during transcription. Most of the cellular signaling pathway
such as cAMP signaling pathway, HIF-1 signaling pathway, FoxO signaling pathway,
cell cycle, Wnt signaling pathway, Notch signaling pathway, TGF-beta signaling path-
way, adherens junction signaling, Jak-STAT signaling pathway, DNA damage pathways,
and other pathways use EP300 as downstream effector protein [12, 13]. EP300 responds
to those signaling pathways differently, which mainly depends on the cell environment
and its phosphorylation state. Various proteins such as PKC, cyclin E/CDK-2, CaMKIV,
IKK, and AKT phosphorylate EP300 at different sites which ultimately impact on its
acetyltransferase activity. Along with, self-modification (auto-acetylation) of EP300
also influences on the acetyltransferase activity. EP300 contain methylation sites near
the KIX domain and lysine SUMOylation site near the bromodomain. EP300 also has
acetylation site (17 lysine residues) in the regulatory loop of acetyltransferase domain
and their acetylation is essential for its acetyltransferase activity, and for binding with
other proteins. In addition, EP300 through protein interacting domains binds to the dif-
ferent proteins and thereby it regulates wide variety of signaling pathways [14, 15]. All
these reports clearly show that EP300 regulates signaling pathways by interacting with
multiple proteins and targeting these interactions during disease conditions could be a
good solution. In this concern, all the experimentally valid datasets of EP300 interac-
tors were collected from primary protein interaction databases, followed by tracing their
subcellular locations and functional annotations. Finally, the interactome of EP300 with
its interactors was developed and first-degree interactors were identified.
206 S. Kandagalla et al.
The experimentally detected proteins having the interaction with EP300 were extracted
from the public databases such as, IntAct [16], BioGRID [17], APID [18], PINA [19],
Mentha [20], HitPredict [21], WiKi-Pi [22], PIPs [23], PPI-finder [24] and PrePPI [25].
Non-human interactors of EP300 were excluded from the study. Using the UniProt
Knowledge base (UniProtKB) Id mapping, the gene symbols and protein symbols were
identified [26].
The subcellular location of EP300 interactors was explored using the UniProtKB
database based on the record “Subcellular location”. UniProtKB database act as a cen-
tral hub in identifying functional information of proteins with accurate annotations,
and also it includes widely accepted biological ontologies, classifications and cross-
references, and clear indications of the quality of annotation in the form of evidence
attribution of experimental and computational data (https://www.uniprot.org/help/uni
protkb). The PANTHER classification system was used to identify the protein classes
of EP300 interactors. The PANTHER (Protein Analysis THrough Evolutionary Rela-
tionships) database contains comprehensive information on the evolution and function
of protein-coding genes from 104 completely sequenced genomes. PANTHER classi-
fication tools allow users to classify new protein sequences and to analyze gene lists
obtained from large scale genomics experiments [27, 28].
The primary protein interaction data of EP300 interactors were extracted from STRING
database v10.5 [31] with a high confidence score of 0.9. The interactions in the STRING
database are derived from different sources: text mining, experiments, co-expression,
neighborhood, gene fusion, and co-occurrence. The high confidence interaction of score
above 0.9 indicates, all the interactions are validated in all the above-mentioned sources.
The low confidence (score 0.7) interactions were considered for N4BP2, MSTO1, MYB,
HOXD10, and KLF16 interactors. The interactome of EP300 with its interactors was
constructed using Cytoscape 3.4.0 [30] based on the subcellular location and the first-
degree interactors of EP300 were identified from the core interactome.
A Systems Biology Driven Approach to Map the EP300 Interactors 207
Finally, the analysis of EP300 interactors present in both cytoplasm and nucleus
revealed that most of the interactors are participate in the regulation of transcription, pos-
itive regulation of macromolecule metabolic process, and regulation of RNA metabolic
process (Fig. 3B). The transcriptional regulator activity, transcription factor binding,
transcriptional activator activity, and transcription factor activity are the top enriched
molecular function (Fig. 3B). Together, this analysis provides the functional significance
of EP300 interactors present in the different subcellular locations. Interactors present
in the cytoplasm were mainly involved in the biological process such as extracellular
stimulus, positive regulation of apoptosis, positive regulation of programmed cell death,
positive regulation of cell death, etc. Whereas interactors present in both cytoplasm and
nucleus were engaged in almost similar biological processes.
Fig. 1. Functional enrichment of EP300 interactors present in the Cytoplasm: Top annotated
EP300 interactors involved in A) Biological Process, B) Molecular function, C) Pathways (KEGG
and PANTHER), D) PANTHER protein class.
Fig. 2. Functional enrichment of EP300 interactors present in the Nucleus: Top annotated EP300
interactors involved in A) Biological Process, B) Molecular function, C) Pathways (KEGG and
PANTHER), D) PANTHER protein class.
Finally, EP300 interactors present in both cytoplasm and nucleus are enriched during
pathways in cancer, chronic myeloid leukemia, prostate cancer, acute myeloid leukemia,
pancreatic cancer, cell cycle, and ErbB signaling pathway are among the top enriched
KEGG pathways. The PDGF signaling pathway, JAK/STAT signaling pathway, B cell
activation, T cell activation, p53 pathway, EGF receptor signaling pathway, p53 pathway
feedback loops 2 and TGF-beta signaling pathway are among the top enriched PAN-
THER pathways (Fig. 3C). Collectively these analysis provides the details on EP300
interactors associated pathways. From the results, it can be seen that apart from normal
pathways, EP300 interactors also enriched in associated disease related pathways such
as cancer, infection, etc. Further analysis of EP300 interactors associated with disease
related pathways gives broad insights on the role of EP300 and also, it provides a new
avenue in developing new drugs.
Fig. 3. Functional enrichment of EP300 interactors present in the both Cytoplasm and Nucleus:
Top annotated EP300 interactors involved in A) Biological Process, B) Molecular function, C)
Pathways (KEGG and PANTHER), D) PANTHER protein class.
EP300 has a direct connection. The UBA52, NR3C1, GRIP1, SREBF1, and JUN are
the top nodes based on the degree. Further, the network of EP300 interactors present in
the cytoplasm and nucleus were constructed and the network has 1214 nodes with 5443
edges (Fig. 4A). Total 89 nodes have a direct connection with EP300 and among TP53,
JUN, AKT1, CREBBP, HDAC2, HDAC1, and MYC are top nodes based on degree
(complete list is provided in the supplementary file [37]).
Altogether, the final interactome consists of 2388 nodes with 12577 edges. Among
2388 nodes Ep300 form the direct interaction with only 186 nodes (Fig. 4C) and among
TP53, CREBBP, JUN, HDAC1, CTNNB1, MYC, PCNA, HDAC2, FOS and KAT2B are
the top interactors. Previously several reports show the significance of EP300 interactions
with TP53, CREBBP, JUN, CTNNB1, and MYC in several pathophysiological condi-
tions [33–36], and still, their role is not clearly understood. Further, in vitro validation
of these interactors is required to understand the role of EP300 in different cancer condi-
tions and which ultimately helps in developing the novel inhibitor also, these interactors
act as potential biomarkers.
A Systems Biology Driven Approach to Map the EP300 Interactors 211
Fig. 4. Interactome of EP300 with it interactors, A) The PPI network of EP300 interactors present
in nucleus are colored in red, green nodes corresponds to cytoplasm, and grey node corresponds
to both nucleus and cytoplasm. B) Venn diagram showing number of EP300 interacting partners
present in different subcellular location. C) First neighbors of EP300 in the interactome.
4 Conclusion
The evaluation of EP300 interactors present in different subcellular locations provides
a broad sense to the role of EP300 in complex disease or cellular events. The functional
and pathways enrichment analysis of EP300 interactors clearly shows their involvement
in several pathological conditions and mainly in cancer. Among the EP300 interactors,
TP53, CREBBP, JUN, HDAC1, CTNNB1, MYC, PCNA, HDAC2, FOS, and KAT2B
are the top first-degree nodes, and these interactors are the key players with which EP300
interact and perform its functions. Further, in vitro validation of these interactors with
EP300 is required in different cancer conditions. Altogether, the present analysis gives
the complete overview on EP300 interactors presents different subcellular locations.
Acknowledgement. The work was supported by Act 211 Government of the Russian Federation,
contract 02.A03.21.0011 and by the Ministry of Science and Higher Education of Russia (Grant
FENU-2020–0019).
212 S. Kandagalla et al.
Conflict of Interest. All authors declared that they have no competing interest.
References
1. Karamouzis, M.V., Konstantinopoulos, P.A., Papavassiliou, A.G.: Roles of CREB-binding
protein (CBP)/p300 in respiratory epithelium tumorigenesis. Cell Res. 17, 324–332 (2007).
https://doi.org/10.1038/cr.2007.10
2. Dutto, I., Scalera, C., Prosperi, E.: CREBBP and p300 lysine acetyl transferases in the DNA
damage response. Cell. Mol. Life Sci. 75(8), 1325–1338 (2017). https://doi.org/10.1007/s00
018-017-2717-4
3. Mees, S.T., Mardin, W.A., Wendel, C., et al.: EP300-A miRNA-regulated metas-tasis sup-
pressor gene in ductal adenocarcinomas of the pancreas. Int. J. Cancer 126, 114–124 (2010).
https://doi.org/10.1002/ijc.24695
4. Yang, H., Pinello, C.E., Luo, J., et al.: Small-molecule inhibitors of acetyltrans-ferase p300
identified by high-throughput screening are potent anticancer agents. Mol. Cancer Ther. 12,
610–620 (2013). https://doi.org/10.1158/1535-7163.MCT-12-0930
5. Bi, Y., Zhang, L., et al.: EP300 promotes tumor development and correlates with poor
prognosis in esophageal squamous carcinoma. Oncotarget 9(1), s376–s392 (2018)
6. Asaduzzaman, M., et al.: Tumour suppressor EP300, a modulator of paclitaxel resistance and
stemness, is downregulated in metaplastic breast cancer. Breast Cancer Res. Treat. 163(3),
461–474 (2017). https://doi.org/10.1007/s10549-017-4202-z
7. Babu, A., et al.: Chemical and genetic rescue of an ep300 knockdown model for Rubinstein
Taybi Syndrome in zebrafish. Biochim. Biophys. Acta - Mol. Basis Dis. 1864, 1203–1215
(2018). https://doi.org/10.1016/j.bbadis.2018.01.029.
8. Gayther, S.A., Batley, S.J., Linger, L., et al.: Mutations truncating the EP300 acetylase in
human cancers. Nat. Genet. 24, 300–303 (2000). https://doi.org/10.1038/73536
9. Chan, H.M., La Thangue, N.B.: p300/CBP proteins: HATs for transcriptional bridges and
scaffolds. J. Cell. Sci. 114, 2363–2373 (2001)
10. Ogryzko, V.V., Schiltz, R.L., Russanova, V., et al.: The transcriptional coactiva-tors p300 and
CBP are histone acetyltransferases. Cell 87, 953–959 (1996)
11. Yang, X.J., Seto, E.: Lysine acetylation: codified crosstalk with other post-translational
modifications. Mol. Cell. 31, 449–461 (2008). https://doi.org/10.1016/j.molcel.2008.07.002
12. Vo, N., Goodman, R.H.: CREB-binding protein and p300 in transcriptional regulation. J. Biol.
Chem. 276, 13505–13508 (2001). https://doi.org/10.1074/jbc.R000025200https://doi.org/10.
1074/jbc.R000025200
13. Bedford, D.C., Brindle, P.K.: Is histone acetylation the most important physio-logical function
for CBP and p300? Aging (Albany NY) 4, 247–55 (2012). https://doi.org/10.18632/aging.
100453
14. Attar, N., Kurdistani, S.K.: Exploitation of EP300 and CREBBP lysine acetyltransferases by
cancer. Cold Spring Harb. Perspect. Med. 7, a026534 (2017). https://doi.org/10.1101/cshper
spect.a026534
15. Dancy, B.M., Cole, P.A.: Protein lysine acetylation by p300/CBP. Chem. Rev. 115, 2419–2452
(2015). https://doi.org/10.1021/cr500452k
16. Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., et al.: IntAct: an open source molecular
interaction database. Nucleic Acids Res. 32, D452–D455 (2004). https://doi.org/10.1093/nar/
gkh052
17. Stark, C., Breitkreutz, B.J., Reguly, T., et al.: BioGRID: a general repository for interaction
datasets. Nucleic Acids Res. 34, D535–D539 (2006). https://doi.org/10.1093/nar/gkj109
A Systems Biology Driven Approach to Map the EP300 Interactors 213
18. Alonso-López, D., Gutiérrez, M.A., Lopes, K.P., et al.: APID interactomes: providing
proteome-based interactomes with controlled quality for multiple spe-cies and derived
networks. Nucleic Acids Res. 44, W529–W535 (2016). https://doi.org/10.1093/nar/gkw363
19. Cowley, M.J., Pinese, M., Kassahn, K.S., et al.: PINA v2.0: mining interactome modules.
Nucleic Acids Res. 40, D862–D865 (2012). https://doi.org/10.1093/nar/gkr967
20. Calderone, A., Castagnoli, L., Cesareni, G.: mentha: aresource for browsing integrated
protein-interaction networks. Nat. Methods 10, 690–691 (2013). https://doi.org/10.1038/
nmeth.2561
21. Patil, A., Nakai, K., Nakamura, H.: HitPredict: a database of quality assessed protein-protein
interactions in nine species. Nucleic Acids Res. 39, D744–D749 (2011). https://doi.org/10.
1093/nar/gkq897
22. Orii, N., Ganapathiraju, M.K.: Wiki-Pi: a web-server of annotated human protein-protein
interactions to aid in discovery of protein function. PLoS ONE 7, e49029 (2012). https://doi.
org/10.1371/journal.pone.0049029
23. McDowall, M.D., Scott, M.S., Barton, G.J.: PIPs: human protein-protein interac-tion pre-
diction database. Nucleic Acids Res. 37, D651–D656 (2009). https://doi.org/10.1093/nar/
gkn870
24. He, M., Wang, Y., Li, W.: PPI Finder: a mining tool for human protein-protein interactions.
PLoS ONE 4, e4554 (2009). https://doi.org/10.1371/journal.pone.0004554
25. Zhang, Q.C., Petrey, D., Garzón, J.I., et al.: PrePPI: a structure-informed data-base of protein-
protein interactions. Nucleic Acids Res. 41, D828–D833 (2013). https://doi.org/10.1093/nar/
gks1231
26. Bateman, A., Martin, M.J., O’Donovan, C., et al.: UniProt: the universal protein knowl-
edgebase. Nucleic Acids Res. 45, D158–D169 (2017). https://doi.org/10.1093/nar/gkw
1099
27. Thomas, P.D., Campbell, M.J., Kejariwal, A., et al.: PANTHER: a library of pro-tein families
and subfamilies indexed by function. Genome Res 13, 2129–2141 (2003). https://doi.org/10.
1101/gr.772403
28. Mi, H., Dong, Q., Muruganujan, A., et al.: PANTHER version 7: improved phy-logenetic
trees, orthologs and collaboration with the gene ontology consortium. Nucleic Acids Res. 38,
D204–D210 (2010). https://doi.org/10.1093/nar/gkp1019
29. Huang, D.W., Sherman, B.T., Tan, Q., et al.: The DAVID gene functional classification tool:
a novel biological module-centric algorithm to functionally analyze large gene lists. Genome
Biol. 8, R183 (2007). https://doi.org/10.1186/gb-2007-8-9-r183
30. Shannon, P., Markiel, A., Ozier, O., et al.: Cytoscape: a software environment for integrated
models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003). https://
doi.org/10.1101/gr.1239303
31. Szklarczyk, D., Morris, J.H., Cook, H., et al.: The STRING database in 2017: quality-
controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res.
45, D362–D368 (2017). https://doi.org/10.1093/nar/gkw937
32. Kong, F.Y., et al.: Bioinformatics analysis of the proteins interacting with LASP-1 and their
association with HBV-related hepatocellular carcinoma. Sci. Rep. 7, 1–5 (2017). https://doi.
org/10.1038/srep44017
33. Kaypee, S., et al.: Mutant and wild-type tumor suppressor p53 induces p300 autoacetylation.
iScience 4, 260–272 (2018). https://doi.org/10.1016/j.isci.2018.06.002
34. Attar, N., Kurdistani, S.K.: Exploitation of EP300 and CREBBP lysine acetyltransferases by
cancer. Cold Spring Harb. Perspect. Med. 7 (2017). https://doi.org/10.1101/cshperspect.a02
6534.
214 S. Kandagalla et al.
35. Yu, W., et al.: Cellular Physiology and Biochemistry Cellular Physiology and Biochemistry
Yu et al.: B-catenin cooperates with CREB binding protein β-catenin cooperates with CREB
binding protein to promote the growth of tumor cells cellular physiology and biochemistry
cellular physiology and biochemistry. Cell Physiol. Biochem. 44, 467–478 (2017). https://
doi.org/10.1159/000485013.
36. Wang, Y.N., Chen, Y.J., Chang, W.C.: Activation of extracellular signal-regulated kinase
signaling by epidermal growth factor mediates c-Jun activation and p300 recruitment in ker-
atin 16 gene expression. Mol. Pharmacol. 69, 85–98 (2006). https://doi.org/10.1124/mol.105.
016220
37. Kandagalla, S., Grishina, M., Potemkin, V., Shekarappa, S.B., Gollapalli, P.: A systems biol-
ogy driven approach to map the EP300 interactors using comprehensive protein interaction
network (2020). https://doi.org/10.5281/ZENODO.4112838
Analyzing Switch Regions of Human Rab10
by Molecular Dynamics Simulations
1 Introduction
Rab10 is a small monomeric enzyme that belongs to the Rab GTPases family [1]. It is
responsible for regulating intracellular traffic in various pathways of different cellular
sublocations, having roles in the endoplasmic reticulum, trans-Golgi network, endo-
somes, lysosomes and primary cilium [2]. Functional deficiencies in the Rab10 pathways
are implicated in ciliopathies [3], glioblastomas [4] and neurodegenerative diseases [5].
Studies have shown that Rab10 has a relevant role in Alzheimer disease (AD), helping in
the process of the amyloid precursor protein (APP) and in the production of Aβ through
intracellular vesicle transport [6]. Such evidence paves the way for the application of
new strategies for targeting drugs in the treatment of AD. Thus, modulation of Rab10
activity may represent an alternative to reduce the proportion of neurotoxic Aβ, making
it a potential therapeutic target for the prevention and treatment of AD [5].
Rab GTPases regulate cellular processes by alternating the nucleotides GTP and
GDP. When linked to GTP, its switch 1, interswitch and switch 2 regions interacts with a
series of effector proteins promote downstream signaling events. On the other hand, the
hydrolysis of GTP results in conformational changes in the G domain of these enzymes,
inactivating them [7]. The differences between the conformations of the G domain linked
to GDP and GTP suggest that after the hydrolysis of GTP the switch 1 and switch 2
regions show a high degree of flexibility and disorder. In contrast, such regions are
stabilized in the active state, which favors Rab10 to be recognized by effector proteins
[8].
Therefore, the present study aimed to detail the structural flexibility of Rab10 and
its switch regions, considering simulations of 200 ns of molecular dynamics (MD). In
the past, 10 ns MD simulations have been done to investigate the internal movements of
Rab5a of wild and mutant type [9]. However, even today, no in silico study on Rab10
linked to GTP and GDP has been carried out. Due to the unavailability of Rab10 crys-
tallographic models associated with GTP and GDP, we used the molecular docking
technique to form complexes with such nucleotides. Thus, it was possible to analyze
atomic movements using classical mechanics and verify whether the heuristic method
used was able to describe the active “ON” and inactive “OFF” state of this enzyme.
The results discussed here may be useful for MD studies that aim to identify potential
competitive inhibitors based on nucleotides against Rab10.
2 Methodology
2.1 Molecular Docking
Rab10 (PDB ID: 5SZJ) [10], GDP and GTP structures were downloaded from the Protein
Data Bank (PDB) [11]. Modeller software v9.23 [12] was used to fill the missing atoms
of Rab10. The addition of hydrogen in each structure, considering the protonation state
of the atoms at physiological pH, was performed using the Open Babel 3.0.0 software
[13]. The Autodock Vina 1.1.2 software [14] was used to docking the nucleotides at the
active site of Rab10. The grid box with a size of 22 Å3 , was defined by the average of
the Cartesian coordinates of the co-crystallized GNP nucleotide, being 34.692, 26.252
and –46.027 for the x, y and z axes, respectively. The GNP compound was redocked to
validate the docking study. The poses of each nucleotide were chosen by means of the
lowest binding energy and the highest number of intermolecular bonds. The interactions
between the ligands and receptor were calculated using the Maestro 12.3 interface [15].
energy minimization was performed using the steepest descent method with a maximum
force of 1000 kJ.mol−1 .nm−1 . After minimization, the systems were equilibrated in two
stages: a canonical NVT set (number of particles, volume and temperature) followed
by an isothermal-isobaric NPT set (number of particles, pressure and temperature). The
NVT equilibrium was performed with a constant temperature of 300 K for 500 ps. The
NPT equilibrium was performed with a constant pressure of 1 bar and a constant tem-
perature of 300 K for 500 ps. The production step was carried out at 300 K for 200 ns
and the trajectories were saved every 10 ps. The tools of the root mean square deviation
(RMSD), root mean square fluctuation (RMSF), radius of gyration (Rg ) and solvent
accessible surface area (SASA) were used for the trajectory analysis.
The lowest energy values for each test were grouped and their molecular interactions
were analyzed. The most promising poses of each ligand are described in Table 1. The
comparison between the co-crystallized GNP ligand pose and all docking poses indicated
RMSD ≤ 0.60 Å. These values are lower than the tolerance level of 2.0 Å, indicating
that the docking protocol has been validated. The GDP and GTP nucleotides were suc-
cessfully docking at the active site of Rab10. These complexes presented intermolecular
bonds in common: 2 saline bridges are formed with residue K22 and another with D125;
6 hydrogen bonds involve residues G21, K22, T23, C24, D125 and K154; and 2 π stack-
ing interactions are found in residues F34. Moreover, the presence of γ-phosphate in
GTP guarantees four more bonds of hydrogen with residues G19, T41, G67 and D125.
These binding modes are consistent with the interactions found in the crystal of the
Rab10 structure associated with GNP [10].
Table 1. Score by the Autodock Vina and the number of interactions calculated by Maestro.
showed significant differences when Rab10 was associated with the tested nucleotides.
Here, the Rab10_GDP system showed greater fluctuations compared to Rab10_GTP.
This is explained by the absence of γ-phosphate in GDP, which makes the switch 1
region more flexible due to the lack of stabilizing bonds between the enzyme and this
nucleotide. In contrast, the results found for switch 2 did not reflect the nature of the
structural flexibility of Rab10. High RMSD values were expected in switch 2 when Rab10
was associated with GDP, indicating the high fluctuations resulting from disordered
movements. However, in this region the Rab10_GDP system showed a lowest RMSD
value. In the case of interswitch, although the RMSD has been higher in the Rab10_GTP
system, the variations in flexibility were not significant.
Table 2. Analysis of RMSD (nm) for the entire enzyme, switch 1, interswitch and switch 2.
Figure 1 shows the results of the RMSF, SASA and Rg analyzes of the enzyme. The
RMSF allows analyzing the amino acid residues that contributed most to the fluctuations
during the simulation. As it can be seen in Fig. 1A, the residues composes the switch 1
region showed greater fluctuations when Rab10 is linked to GDP and lesser fluctuations
when linked to GTP. Although the Rab10_GDP system had predominantly higher peaks
than Rab10_GTP in the interswitch region, the difference in RMSF values between the
two systems was subtle. In relation to the switch 2 region, the RMSF values reflected
greater fluctuations when Rab10 is associated with GTP, which is erroneous and does
not represent Rab10’s biological behavior.
The values of the radius of gyration of switch 1 (see Fig. 1B) confirm that this
region has more disordered movements when Rab10 is associated with GDP. Here, the
Rab10_GDP system showed an average of 0.94 ± 0.03 nm, while Rab10_GTP 0.91 ±
0.01 nm. Constant values of Rg indicate structures folded in a stable way, this indicates
that the switch 1 region of Rab10 has greater flexibility when linked to GDP. The analysis
of the interswitch radius of gyration (see Fig. 1C) showed that for the two systems studied,
the flexibility is stable, where the calculated mean of the interswitch region was 1.07
± 0.01 nm for all systems. In switch 2 (see Fig. 1D), the Rab10_GTP system showed
greater disorder in fold movements (0.79 ± 0.03 nm), while Rab10_GDP had better
stability, with Rg of 0.81 ± 0.01 nm.
The SASA analysis allowed to quantify the molecular surface and describe the con-
tact between Rab10 and the solvent. The systematic increase in SASA indicates the
destabilization of the biomolecule, which can expose its hydrophobic regions to the sol-
vent [19]. Figure 1E shows that the SASA of the Rab10_GDP system has predominantly
higher peaks than Rab10_GTP. When Rab10 is linked to GDP, the average of SASA
was 105.92 ± 2.40 nm2 ; when connected to GTP, it was 101.91 ± 2.86 nm2 . Thus, we
Analyzing Switch Regions of Human Rab10 219
1.50
RMSF (nm)
1.00 S1 Int S2
0.50
0.00
0 25 50 75 100 125 150 175
A Residue number
Rg (nm)
Rg (nm)
1.00 1.00 1.00
0.90 0.90 0.90
0.80 0.80 0.80
0.70 0.70 0.70
0 100 200 0 100 200 0 100 200
B Time (ns) C Time (ns) D Time (ns)
120
SASA (nm²)
110
100
90
0 25 50 75 100 125 150 175 200
E Time (ns)
Fig. 1. Analysis of the trajectories obtained in the MD simulation: the grey line represents the
Rab10_GTP system and the black line Rab10_GDP system. (A) RMSF of the entire enzyme:
Switch 1 (S1) region is defined by positions 31–44, while interwitch (In) and Switch 2 (S2), 45–
65, 66–82, respectively. (B) Rg of switch 1. (C) Rg of Interswitch. (D) Rg of switch 2. (E) SASA
of the entire enzyme.
can infer that the disordered movements of switch 1 and the absence of γ-phosphate
contribute to the increase in SASA of Rab10.
4 Conclusions
In short, the MD simulations used in this study were able to obtain notable differences in
the switch 1 region of Rab10, enabling the identification of its active “ON” and inactive
“OFF” states. However, the classical mechanics method was unable to accurately predict
the disordered movements of the switch 2 region. We hypothesized that the flexibility
of the switch 1 sensitive region can be used as an indicator of in silico studies that
search potential competitive inhibitors based on nucleotides against Rab10. Our findings
suggest that the in silico study of the flexibility of sensitive regions involved in the on-
to-off mechanism of other protein targets may be useful in the discovery of potential
drug candidates.
220 L. B. Alves et al.
References
1. Yan, T., Wang, L., Gao. J., et al.: Rab10 Phosphorylation is a Prominent Pathological Feature
in Alzheimer’s Disease. J. Alzheimer Dis. 63(1), 157–165 (2018)
2. Chua, C.E.L., Tang, B.L.: Rab10 – a traffic controller in multiple cellular pathways and
locations. J. Cellular Phys. 233(9), 6483–6494 (2018)
3. Ordónez, A.J.L., Fernández, B., Fdez, E., et al.: RAB8, RAB10 and RILPL1 contribute to
both LRRK2 kinase–mediated centrosomal cohesion and ciliogenesis deficits. Human Mol.
Genetycs 28(21), 3552–3568 (2019)
4. Shen, G., Mao, Y., Su, Z., et al.: PSMB8-AS1 activated by ELK1 promotes cell proliferation
in glioma via regulating miR-574-5p/RAB10. Biomed. Pharmacother. 122(1), 109658 (2020)
5. Ridge, P.G., Karch, C.M., Hsu, S., et al.: Linkage, whole genome sequence, and biological
data implicate variants in RAB10 in Alzheimer’s disease resilience. Genome Med. 9(1), 100
(2017)
6. Tavana, J.P., Rosene, M., Jensen, N.O., et al.: RAB10: an Alzheimer’s disease resilience locus
and potential drug target. Clin. Interv. Aging 14(1), 73–79 (2019)
7. Good, R.G., Müller, M.P., Wu, Y.: Mechanisms of action of Rab proteins, key regulators of
intracellular vesicular transport. Bio. Chem. 398(5–6), 565–575 (2017)
8. Pylypenko, O., Hammich, H., Yu, I., et al.: Rab GTPases and their interacting protein partners:
Structural insights into Rab functional diversity. Small GTPases 9(1–2), 22–48 (2018)
9. Wang, J., Chou, K.: Insight into the molecular switch mechanism of human Rab5a from
molecular dynamics simulations. Bio. Biophys. Res. Commun. 390(3), 608–612 (2009)
10. Rai, A., Oprisko, A., Campos, G., et al.: bMERB domains are bivalent Rab8 family effectors
evolved by gene duplication. eLife 5(1), e186475 (2016)
11. Berman, H.M., Westbrook, J., Feng, et al.: The Protein Data Bank. Nucleic Acids Res. 28(1),
235–242 (2000)
12. Sali, A., Blundell, T.L.: Comparative protein modelling by satisfaction of spatial restraints.
J. Mol. Biol. 234(1), 779–815 (1993)
13. O’Boyle, N.M., Banck, M., James, C.A., et al.: Open Babel: An open chemical toolbox. J.
Cheminformatics 3(1), 33 (2011)
14. Trott, M., Olson, A.J.: AutoDock Vina: Improving the speed and accuracy of docking with
a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31(1),
455–461 (2010)
15. Schrödinger Release 2020–3.: Maestro. New York NY (2020)
16. Spoel, D.V.D., Lindahl, E., Hess, B., et al.: GROMACS: Fast, flexible, and free. J. Comput.
Chem. 26(1), 1701–1718 (2005)
17. Huang, J., MacKerell Jr., A.D.: CHARMM36 all-atom additive protein force field: Validation
based on comparison to NMR data. J. Comput. Chem. 34(1), 2135–2145 (2013)
18. Vanommeslaeghe, K., MacKerell Jr., A.D.: Automation of the CHARMM General Force Field
(CGenFF) I: Bond Perception and Atom Typing. J. Chem. Inf. Model. 52(12), 3144–3154
(2012)
19. Paul, M., Panda, M.K., Thatoi, E.H.: Developing Hispolon-based novel anticancer thera-
peutics against human (NF-κβ) using in silico approach of modelling, docking and protein
dynamics. J. Biomol. Struct. Dyn. 37(15), 3947–3967 (2019)
Importance of Meta-analysis in Studies
Involving Plant Responses to Climate
Change in Brazil
1 Introduction
The number of scientific publications is increasing exponentially, and there is a
need for revisions that compile data produced to guide several study areas [34].
For many years these reviews were carried out by narrative reviews [12]. How-
ever, the narrative review is subjective and often not reproducible, as it can be
skewed depending on the point of view and preferences of each author [36]. To
answer this demand, the reviews have been applying the meta-analysis method-
ology that incorporates a previous systematic review to extract information,
following the studies’ inclusion criteria, where all the steps are documented [28].
Meta-analysis is a set of statistical methods that quantitatively compares the
results of different studies that address a common issue [10,22]. Meta-analysis is
the grandmother of ‘big data’ and ‘open science’. The implementation of meta-
analytic techniques was the first effort to collect and synthesize pre-existing data
to determine patterns, make predictions, and make evidence-based decisions[11].
Furthermore, qualitative data presentation that indicates gaps in current knowl-
edge and new research needs to be performed [33]. The meta-analysis is the
analysis of the analyzes [5]. After Glass [10] first used the term “meta-analysis”
in 1976, the method has been widely applied and developed in the areas of
medicine and sociology [41]. In the 1990 s, meta-analysis has been used in ecol-
ogy and evolutionary biology [33] and is not yet widely known in the biological
sciences [33].
Several studies show terrestrial ecosystems’ responses to climate change,
mainly due to increased atmospheric CO2 . The concentration of carbon diox-
ide (CO2 ) in the atmosphere has increased from ∼ 280 ppm (parts per million)
to ∼ 410 ppm from the industrial revolution to the present [19]. This increase
in atmospheric CO2 is due to fossil fuels, forest burning, and land-use changes
[18,19]. The projection for the 2100 s of the Intergovernmental Panel on Climate
Change (IPCC) is that the CO2 levels will increase to 1300 ppm [18]. How-
ever, CO2 is not only one of the leading gases responsible for the greenhouse
effect (GHG) [19,38], but it is also an essential component for photosynthesis,
leading to growth and higher productivity of the ecosystem [2,23]. Plant’s dry
mass consists of 40% carbon fixed by photosynthesis [25]. The CO2 concentra-
tion increases lead to a rise in temperature, and together they induce drastic
changes in the terrestrial ecosystem, such as changing the pattern of rainfall in
certain regions [23], or they can also cause tree mortality due to water losses
or the forest burning [29]. Thus, most studies performed in this century try to
understand how plants can respond to the increase in CO2 . Indeed, several indi-
vidual publications have focused on the three variables that mostly affected the
climate: CO2 , temperature, and water stress.
Meta-analyses on climate change have been primarily applied in studies with
temperate climate species and have proven to be a valuable tool in this field
[2,6,47]. Curtis et al. [6] used the meta-analysis to summarize more than 500
studies on the high CO2 effects. They conclude that there is a 28% increase in
tree biomass allocation. Wand et al. [47] showed that biomass increased by 33%
and 44% when submitted to high CO2 in plants holding C3 and C4 photosynthe-
Meta-analysis on Plants and Climate Change in Brazil 223
2 Methodology
2.1 How to Get the Data to Perform the Meta-analysis
The steps for performing meta-analysis follow the scientific research procedures:
problem, question, hypothesis, method, data collection, and data analysis [27].
Data collection in the meta-analysis studies is carried out employing a systematic
review, which selects primary studies and, in this case, experimental studies on
plant responses to the increase in CO2 [28]. A systematic review is necessary
to identify and describe all the steps performed in selecting studies and the
extraction of data so that the general result will be reproducible [28](Fig. 1A). In
this work, we describe the history of meta-analysis using studies on the effect of
elevated CO2 on plants. These studies were obtained from the search in the Web
of Science. The keywords used in the search were: “meta-analysis” OR “meta-
analytic” AND “CO2” AND “plant,” on August 18, 2020. To verify the primary
studies on the effect of increasing CO2 in plants in Brazil, a search was carried
out in three databases (Web of Science, Scielo, and Brazilian Digital Library of
Theses and Dissertations). The words used in this search were: “elevated CO2”
224 J. da S. Fortirer et al.
The first stage to perform a meta-analysis is determining the type of effect size
[27] (Fig. 1B). Effect size is the basic unit of meta-analysis. It makes possible the
standardization of individual studies’ results by providing an average estimate
of the effect of elevated CO2 compared to ambient CO2 [12,14], and the effect
size’s in logarithmic scale is generally used [27]. The choice of the effect size used
for the dataset extracted from the primary studies raises the necessity to check
whether there is dependence on the data. Research in biology may include more
than one analysis in the same experiment, and there are also different studies
for the same species or studies that investigate several species. When these data
types are included in the meta-analysis, the study requires a hierarchical model
approach because it allows dependence on the data set [33]. After adjusting to
the models, it is necessary to perform the heterogeneity test to determine the
variation among studies, which is not attributable to the sample variance [33].
The publication bias test should be applied in the meta-analysis. The publica-
tion bias occurs when the published studies are not representative of the totality
of the studies performed. For example, significant results that confirm the expec-
tations of the research are more likely to be published than non-significant results
[13]. The methods commonly used to assess publication bias are funnel charts
[39] and the Egger test [9]. Finally, it is recommended to perform the sensitivity
test to check the data’s consistency and possible outliers. For the present study,
we used “metafor” package for analysis [46] and ggplot2 package for the graphics
[48], both in the R version 3.6.0 program [42].
Meta-analysis on Plants and Climate Change in Brazil 225
Fig. 1. (A) Flowchart showing the systematic review steps modified from Liberati et
al. [28]. Numbers in parentheses represent the data obtained in the systematic review
carried out in this work for publications on the CO2 increases and its physiological
responses in plants using individual Brazilian plant studies. After performing the sys-
tematic review and data compilation, (B) flowchart describing the steps to perform a
meta-analysis. Based on Lei et al. [27].
Fig. 2. (A) World history events ranging from meetings to IPCC reports that addressed
global climate change. (B) Distribution over years of published manuscripts (from 1996
to 2019) of meta-analyses performed worldwide that address plant responses to high
CO2 (236 studies).
with 120 works showing physiological aspects and some species’ productivity.
This work was a milestone for studies researching climate change, with 3.157
citations (until October 20, 2020), demonstrating the power of quantitative syn-
thesis of meta-analysis.
The meta-analytic studies addressed some ecological processes and relation-
ships, including the effect of elevated CO2 on photosynthesis and plant respira-
tion, growth, competition, and interaction among plants, productivity, exchange
and conductance of gases in the leaves, soil respiration, carbon, and nitrogen
accumulation in the soil, and seed production [7,30,47]. This shows that the
meta-analysis has been widely used in research containing experiments with ele-
vated CO2 in plants and comes with each new publication trying to couple the
high CO2 with other factors such as temperature and water conditions.
Alternatively, most meta-analyses, such as Ainsworth & Long [2], only
include experiments performed with plants from temperate climate regions. This
points to the need for studies involving tropical plant responses to climate change
[2,21]. The importance of carrying out studies on species in a tropical climate
is highlighted by the fact that from the 236 meta-analyses found in this work,
only three studies on experiments with tropical species carried out in Brazil were
included among the primary studies reviewed [4,31,40].
Fig. 3. (A) Number of times each variable was used within the 36 studies included
in this meta-analysis. (B) Size of the high CO2 effects on plant growth. (C) Size of
the effect of high CO2 on plant photosynthesis. Data represent means ±95% CI for
each functional group analyzed. Values that do not overlap to zero have a significant
difference (α < 0.05). The numbers of observations are presented in parentheses.
Life’s habit (trees or herbs) influences the magnitude of high CO2 responses
on plant biomass. From 1,625 studies included for the meta-analysis, 36 observa-
tions were obtained to estimate the effect of biomass (growth) (Figs. 1 and 3A).
This higher growth variable number is due to the measurement facility during
experiments since a single scale can be used to obtain the variable data. This
representativeness of studies enables a more detailed analysis of the effect of high
230 J. da S. Fortirer et al.
4 Conclusion
We found a close correlation (likely involving the release of the IPCC reports)
between the rise in meta-analysis concerning the responses of plants to elevated
CO2 and the socio-political events that determined sustainable development
goals on the planet during the last 50 years.
Given the importance of Brazilian biomes and agriculture for the planet and
the importance of land use for the impacts of global climate change, we conclude
that it is urgently needed that more studies would be performed with species
from the neotropics. A meta-analysis will undoubtedly be crucial to understand
such impacts and for decision making. However, most studies would have to
bring a plethora of variables that can afford the use of this valuable statistical
method.
References
1. Aguiar, S., Santos, I.D.S., Arêdes, N., Silva, S.: Redes-bioma: Informação e comu-
nicação para ação sociopolı́tica em ecorregiões. Ambiente Soc. 19(3), 233–252
(2016)
2. Ainsworth, E.A., Long, S.P.: What have we learned from 15 years of free-air
co2 enrichment (face)? a meta-analytic review of the responses of photosynthe-
sis, canopy properties and plant production to rising co2. New Phytol. 165(2),
351–372 (2005). https://doi.org/10.1111/j.1469-8137.2004.01224.x
3. Arenque, B.C., Grandis, A., Pocius, O., de Souza, A.P., Buckeridge, M.S.:
Responses of Senna reticulata, a legume tree from the amazonian floodplains, to
elevated atmospheric CO2 concentration and waterlogging. Trees 28(4), 1021–1034
(2014). https://doi.org/10.1007/s00468-014-1015-0
4. Cleveland, C.C., et al.: Relationships among net primary productivity, nutrients
and climate in tropical rain forest: a pan-tropical analysis. Ecol. Lett. 14(9), 939–
947 (2011). https://doi.org/10.1111/j.1461-0248.2011.01658.x
5. Cook, T.D.: The potential and limitations of secondary evaluations. In: Analysis
and Responsibility, Educational Evaluation (1974)
6. Curtis, P.S., Wang, X.: A meta-analysis of elevated CO2 effects on woody plant
mass, form, and physiology. Oecologia 113(3), 299–313 (1998). https://doi.org/10.
1007/s004420050381
7. Curtis, P.: A meta-analysis of leaf gas exchange and nitrogen in trees grown under
elevated carbon dioxide. Plant, Cell Environ. 19(2), 127–137 (1996). https://doi.
org/10.1111/j.1365-3040.1996.tb00234.x
8. De Souza, A.P., et al.: Elevated CO2 increases photosynthesis, biomass and pro-
ductivity, and modifies gene expression in sugarcane. Plant, Cell Environ. 31(8),
1116–1127 (2008). https://doi.org/10.1111/j.1365-3040.2008.01822.x
9. Egger, M., Smith, G.D., Schneider, M., Minder, C.: Bias in meta-analysis detected
by a simple, graphical test. Bmj 315(7109), 629–634 (1997). https://doi.org/10.
1136/bmj.315.7109.629
10. Glass, G.V.: Primary, secondary, and meta-analysis of research. Educ. Res. 5(10),
3–8 (1976). https://doi.org/10.3102/0013189X005010003
11. Gurevitch, J., Koricheva, J., Nakagawa, S., Stewart, G.: Meta-analysis and the
science of research synthesis. Nature 555(7695), 175–182 (2018). https://doi.org/
10.1038/nature25753
12. Harrison, F.: Getting started with meta-analysis. Methods Ecol. Evol. 2(1), 1–10
(2011). https://doi.org/10.1111/j.2041-210X.2010.00056.x
Meta-analysis on Plants and Climate Change in Brazil 233
13. Haworth, M., Hoshika, Y., Killi, D.: Has the impact of rising CO2 on plants been
exaggerated by meta-analysis of free air CO2 enrichment studies? Front. Plant Sci.
7, 1153 (2016). https://doi.org/10.3389/fpls.2016.01153
14. Hedges, L.V., Gurevitch, J., Curtis, P.S.: The meta-analysis of response ratios in
experimental ecology. Ecology 80(4), 1150–1156 (1999). https://doi.org/10.1890/
0012-9658(1999)080[1150:TMAORR]2.0.CO;2
15. Hedges, L.V., Pigott, T.D.: The power of statistical tests for moderators in meta-
analysis. Psychol. Methods 9(4), 426 (2004). https://doi.org/10.1037/1082-989X.
9.4.426
16. IPCC: Climate change: The ipcc 1990 and 1992 assessments (1990). https://www.
ipcc.ch/report/climate-change-the-ipcc-1990-and-1992-assessments/. Accessed 25
Aug 2020
17. IPCC: Sar climate change 1995: Synthesis report (1995). https://www.ipcc.ch/
site/assets/uploads/2018/05/2nd-assessment-en-1.pdf. Accessed 25 Aug 2020
18. IPCC: Tar climate change 2001: Synthesis report (2001). https://www.ipcc.ch/
report/ar3/syr/. Accessed 25 Aug 2020
19. IPCC: Ar4 climate change 2007: Synthesis report (2007). https://www.ipcc.ch/
report/ar4/syr/. Accessed 25 Aug 2020
20. IPCC: Global warming of 1.5 ◦ c: Special report (2018). https://www.ipcc.ch/sr15/.
Accessed 25 Aug 2020
21. Jones, A.G., Scullion, J., Ostle, N., Levy, P.E., Gwynn-Jones, D.: Completing the
face of elevated CO2 research. Environ. Int. 73, 252–258 (2014). https://doi.org/
10.1016/j.envint.2014.07.021
22. Koricheva, J., Gurevitch, J., Mengersen, K.: Handbook of meta-analysis in ecology
and evolution. Princeton University Press, New Jersey (2013)
23. Körner, C.: Plant CO2 responses: an issue of definition, time and resource supply.
New Phytol. 172(3), 393–411 (2006). https://doi.org/10.1111/j.1469-8137.2006.
01886.x
24. Körner, C.: Responses of humid tropical trees to rising CO2 . Annu. Rev. Ecol. Evol.
Syst. 40, 61–79 (2009). https://doi.org/10.1146/annurev.ecolsys.110308.120217
25. Lambers, H., Chapin III, F.S., Pons, T.L.: Plant Physiological Ecology. Springer,
New York (2008)
26. Leakey, A.D., et al.: Photosynthesis, productivity, and yield of maize are not
affected by open-air elevation of CO2 concentration in the absence of drought.
Plant Physiol. 140(2), 779–790 (2006). https://doi.org/10.1104/pp.105.073957
27. Lei, X., Peng, C., Tian, D., Sun, J.: Meta-analysis and its application in global
change research. Chin. Sci. Bull. 52(3), 289–302 (2007). https://doi.org/10.1007/
s11434-007-0046-y
28. Liberati, A., et al.: The Prisma statement for reporting systematic reviews and
meta-analyses of studies that evaluate health care interventions: explanation and
elaboration. J. Clin. Epidemiol. 62(10), e1–e34 (2009). https://doi.org/10.1016/j.
jclinepi.2009.06.006
29. Lovejoy, T.E., Nobre, C.: Amazon tipping point: last chance for action (2019).
https://doi.org/10.1126/sciadv.aba2949
30. Luo, Y., Hui, D., Zhang, D.: Elevated CO2 stimulates net accumulations of carbon
and nitrogen in land ecosystems: a meta-analysis. Ecology 87(1), 53–63 (2006).
https://doi.org/10.1890/04-1724
31. Moles, A.T., et al.: Which is a better predictor of plant traits: temperature or
precipitation? J. Veg. Sci. 25(5), 1167–1180 (2014). https://doi.org/10.1111/jvs.
12190
234 J. da S. Fortirer et al.
1 Introduction
Bioinformatics is an interdisciplinary research field whose principle is using models
and algorithms to analyze biological data and solve biologically related problems [1].
Bioinformatics’ roots are in the early 1960s when computers, used for military purposes,
became available for universities and research institutes. At that time, researchers began
to use computers to try answering fundamental questions in life sciences [2].
Margaret Dayhoff was a pioneer in bioinformatics studies at that time. She proposed
the use of mathematical approaches for analyzing amino acid frequencies and mutation
probabilities in biological sequences. Since the late 1950s, experimental approaches have
allowed the sequencing of small protein structures, such as insulin [3]. This culminated
in the creation of the first database of amino acid sequences and structures, the so-called
“Atlas of Protein Sequence and Structure” [4]. Dayhoff and collaborators also proposed
computational methods for sequence comparisons to detect homologous proteins using
a substitution matrix called PAM (Percent Accepted Mutation), which contributed to the
rising of the molecular evolution field [2].
Also, Needleman and Wunsch proposed a dynamic programming method for
sequence alignment that is the base for the state-of-the-art methods used nowadays
[5]. By the end of the 1960s, we observed the rising of structural bioinformatics, for
example, with the construction of a three-dimensional model of a Cytochrome c protein
[6]. At the beginning of the 1980s, the first methods for phylogenetic tree inference
based on DNA sequences and the maximum likelihood were proposed [7]. Moreover,
BLAST, a tool for local sequence alignments, was proposed in the 1990s mainly due to
the increase of sequences availability [8].
However, three main milestones led to the modern bioinformatics field were: (i) the
DNA sequencing methods, (ii) the genome projects, and (iii) the rise of supercomputers
and the Internet [2]. DNA sequencing methods have existed since the 1970s, such as the
Sanger chain-termination sequencing method [9]. However, these methods were slow
and expensive. For instance, the human genome project (HGP) was an international
research effort to map all of the genes and sequences of the human genome. This project
started at the beginning of the 1990s but was only officially completed in 2003. The HGP
gave us a genetic blueprint of the human being, but the sequencing costs were high.
The game changed when ingenious strategies were used in combination with com-
putational approaches. Shotgun sequencing was introduced in the 1970s [10]. In 1995, a
similar strategy was used to obtain a complete nucleotide sequence of the Haemophilus
influenzae bacterium [11] and, in 2010, to get the complete genome of the fly Drosophila
melanogaster, a eukaryote with a sequence length of ~ 120 Mb [12]. All these develop-
ments led to the emergence of Next-Generation Sequencing (NGS) platforms [13]. These
technologies are characterized by high-throughput, with reduced processing time, and
costs each time lower. This culminated in the diffusion of genome projects, even in small
or middle research laboratories. Later, other high-throughput technologies appeared,
such as microarray and RNA-Seq, used to analyze gene expression, and cryo-electron
microscopy, which sped up the resolution of 3D macromolecular structures.
Thus, bioinformatics went from a tool for biological analysis to an interdisciplinary
research field, mainly focused on the development of new models, algorithms, tools,
and new types of analysis based on computational approaches to deal with biological
data and get knowledge from them. In the last years, these evolutions of technological
resources originated a tidal wave of data [14]. Consequently, an unprecedented amount
of studies using bioinformatics approaches have been released, increasing peer-reviewed
published papers in the field.
Here we tell the recent history of Bioinformatics based on metadata collected from
scientific papers. We know that the history of bioinformatics has been told before in some
publications [1, 15, 16]. However, we aimed to understand the recent state-of-the-art of
the bioinformatics research, visualize changes motivated by technological evolution, and
A Brief History of Bioinformatics Told by Data Visualization 237
Fig. 1. (A) Publications by journal (1998–2019). (B) The proportion of international collabora-
tions. “Domestic” corresponds to collaborations between researchers of the same institute or the
same country, “Multinational” from different countries, and “Single author” to individual papers.
(C) Nationality estimated based on author affiliation declarations (hence, this may not represent
the real nationality). (C) The amount and (D) the proportion. United Kingdom’s (UK) data was
not included in the European Union (EU) group to ease the comprehension (even, before Brexit).
We analyzed the keywords reported in the papers to establish what are the major research
topics addressed in bioinformatics publications. We obtained 7,167 unique keywords
reported in the papers. From them, eleven were reported in the top five positions from
1998 to 2019 (Fig. 2).
Since 2011, only a restricted set of keywords has scored the top five list: humans,
computational biology, algorithms, models, and software. Interestingly, the keyword
“humans” that leads the top five groups since 2011, was less cited in the previous years,
except in 2000–2001 when it appeared in the fifth position. Even though most of the
bioinformatics research is more focused on developing new algorithms, models, and
software, a fact well illustrated by the other four keywords in the rank, it was expected
to find papers using keywords describing application areas for these approaches. Indeed,
we can observe that “humans” appear in the top five list in the years closer to the first
A Brief History of Bioinformatics Told by Data Visualization 239
Fig. 2. Top five keywords listed in bioinformatics papers from 1998 to 2019.
announcements of the human genome draft (2001). We hypothesized that, in the fol-
lowing years, the bioinformatics research was characterized by considerable efforts to
establish the architecture for data processing and storage, collection and organization of
data, and development of desktop and web tools. Since many publications have accom-
plished these objectives, this provided scaffolds for more applicable studies focused on
human data.
The top five ranking list only told us the main and general topics. To tell more
about recent bioinformatics content history, it is necessary to analyze the importance
of some keywords and correlate them with historical facts. To obtain these insights, we
constructed word clouds with keywords with at least 50 occurrences by year (available
on the website). Analyzing the word cloud, we can detect when some topics turned
into trends or reduced, comparing the word size changes according to different periods.
Although this analysis is quite limited to show specific changes in the keyword use (since
it is hard to compare and find many words together through different word clouds), this
could be used to identify targets for further analysis. Hence, based on this analysis, we
raised some questions: what topic is the target for a higher number of studies: genome,
transcriptome, or proteome? Are there keywords that are less used nowadays than in
the past? What is related to the recent increasing in molecular dynamics publications?
Why “artificial intelligence” was very reported in the 2000s, but fewer today? To try to
answer these questions, we constructed several visualizations and discussed what they
depict.
3.2 Omics
A Bioinformatics’ fundamental consists of the use of computational approaches to ana-
lyze data related to the central dogma of molecular biology, i.e., the process in which
information flow from DNA to RNA to protein. Thus, genome, transcriptome, and pro-
teome have been considered the main topics in this area, compounding the “Omics”
240 D. Mariano et al.
study fields, such as genomics, transcriptomics, and proteomics (and new study areas
such as metagenomics, metabolomics, and so on).
A priori, we inquired which of these topics is the target of a higher number of studies.
In Fig. 3A, we plotted the percentage of reports of each keyword. We can observe the
sovereignty of genome studies, only defied by proteome studies by a short period between
2007 and 2008. The first occurrence of transcriptome keyword was in 2010. Since then,
the proportion of transcriptome studies is slightly increasing, while the proportion of
proteome studies has considerably decreased. Metagenome and metabolome are recent
study areas, and hence, still have few occurrences. We also included metagenome and
metabolome keywords in this plot to illustrate the interest in recent “Omics” approaches.
Fig. 3. Keywords analysis. (A) Percentage of citations of “Omics” keywords from 1998 to 2019.
(B) Use of the keywords “Database Management Systems”, “User-Computer Interface”, “Pro-
gramming Languages”, and “Sequence Alignment”, “Protein Structure”, and “Systems Biology”
from 1998 to 2019. (C) Use of “Artificial Intelligence” and other correlated keywords from 1998
to 2019.
A Brief History of Bioinformatics Told by Data Visualization 241
3.4 Artificial Intelligence and the Influence of the Pop Culture in Science
We analyzed the use of the keyword “Artificial Intelligence” and other related keywords.
We observed that this keyword’s use increased from 2001 until 2008, when reduced
(Fig. 3C). Then, publications using “Artificial Intelligence” were made, but the authors
started to use more specific descriptions, such as “neural networks”, “support vector
machine”, “machine learning”, and “data mining”.
It is interesting to report that artificial intelligence is studied since the middle of the
20th century, as well as the other listed topics. Hence, what could explain the expressive
increase in the use of this topic as a keyword and its subsequent reduction?
A peculiar explanation for this phenomenon is that in 2001 was released the famous
movie “A.I. Artificial Intelligence” (directed by Steven Spielberg). The film’s popularity
may have influenced the keyword “artificial intelligence” in academic works. Later, this
keyword’s use decreased, and more specific descriptions of the A.I. techniques used were
adopted. This suggests that pop culture could influence how studies are disseminated.
They have been used, for example, to perform computational predictions of cancer drug
resistance [27], to understand allosteric immune escape pathways in the HIV-1 envelope
glycoprotein [28], and to simulate the action of enzymes used in biofuel production [29,
30]. Although these methods are known to have high computational cost requirements,
since 2011, the number of citations of “molecular dynamics simulation” has increased
substantially (Fig. 4).
Fig. 4. (A) Use of “molecular dynamics” as a keyword (above) compared to (B) the number of
transistors in NVIDIA GPUs (below). Also, it is important to highlight that molecular dynam-
ics researchers prefer to publish their experiments in specialized journals, such as the Journal
of Biomolecular Structure and Dynamics. Source: Adapted from https://vintage3d.org/dbn.php.
Accessed on September 16th 2020.
We searched for connections in the number of reports with the number of transistors in
NVIDIA graphic cards from 1998 until 2020. We observed that the number of citations of
molecular dynamics simulation seems to correlate to the evolution of Graphic Processing
Unities (GPUs). The evolution of GPUs is impelled by the game industry because of the
necessity of more realistic games. However, researchers have adapted the GPU usage to
process force fields calculus used in molecular dynamics simulations. Thus, molecular
dynamics scientists have successfully used gamer graphic cards in research applications,
which led to better results than those from CPU only supercomputers from a few years
ago.
A Brief History of Bioinformatics Told by Data Visualization 243
4 Conclusion
In this paper, we presented an overview of bioinformatics publications in the last 22
years based on four high-impact bioinformatics journals. We consider that this paper’s
main scientific contribution is to present the reports of the state-of-the-art of publications
in bioinformatics (past and present), presenting our predictions based on what the data
showed us. We show an increasingly collaborative world: data show an increase in
the bioinformatics’ scientific production with the increasing participation of several
countries. However, there is still much to improve (see Latin America’s low participation
in global scientific production). The keyword analysis showed changes in the major
topics addressed in bioinformatics papers. We also hypothesized a growth in molecular
dynamic simulation publications due to the correlation between published works and
recent evolutions in GPUs. The most cited programming languages analysis showed
Python, Perl, Java, and R as the most popular programming languages for bioinformatics.
Additionally, we proposed a web tool for exploratory data analysis as supplementary
material. Here, we present only an overview of what the data showed us, focusing on
some details that caught our attention. Our readers can interact with the data in the web
tool, obtain insights, and maybe reach conclusions that we may not even have imagined
when writing this article. We also want to encourage (and provoke) our readers to think
about bioinformatics’ perspectives as a science (and their specific areas of activity).
The scientist’s role is to observe, question, and propose solutions that lead to society’s
improvement. For scientists to fulfill their roles, they should be able to adapt to changes
and know the history and discuss the future is a fundamental step to this. We hope
that these data visualizations can provide insights and raise more thought-provoking
discussions about bioinformatics evolution, trends, and perspectives. We provided the
data sets and other interactive visualizations at http://bioinfo.dcc.ufmg.br/history/.
Acknowledgments. The authors thank the funding agencies: CAPES, FAPEMIG, and CNPq.
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Supe-
rior - Brasil (CAPES) - Finance Code 001. Project grant number 51/2013 - 23038.004007/2014-82.
A Brief History of Bioinformatics Told by Data Visualization 245
References
1. Akalin, P.K.: Introduction to bioinformatics. Mol. Nutr. Food Res. 50, 610–619 (2006)
2. Hagen, J.B.: The origins of bioinformatics. Nat. Rev. Genet. 1, 231–236 (2000)
3. Moore, S., Spackman, D.H., Stein, W.H.: Automatic recording apparatus for use in the
chromatography of amino acids, pp. 1107–1115 (1958)
4. Dayhoff, M.O.: Atlas of protein sequence and structure. National Biomedical Research
Foundation (1972)
5. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities
in the amino acid sequence of two proteins. J. Molecular Biol. 48, 443–453 (1970)
6. Levinthal, C.: Molecular model-building by computer. Sci. Am. 214, 42–52 (1966)
7. Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J.
Mol. Evol. 17, 368–376 (1981)
8. Stephen, F., Altschu, P., Warren, G., Webb, M., Eugene, W., Myers, David, J.L.: Basic Local
Alignment Search Tool (1990)
9. Sanger, F., Nicklen, S., Coulson, A.R.: DNA sequencing with chain-terminating inhibitors.
Proc. National Acad. Sci. 74, 5463–5467 (1977)
10. Staden, R.: A strategy of DNA sequencing employing computer programs. Nucleic Acids
Res. 6, 2601–2610 (1979)
11. Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R.,
et al.: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.
Science 269, 496–512 (1995)
12. Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., et al.:
The genome sequence of drosophila melanogaster. Science 287, 2185–2195 (2000)
13. Mariano, D.C.B., Pereira, F.L., Aguiar, E.L., Oliveira, L.C., Benevides, L., Guimarães, L.C.,
et al.: SIMBA: a web tool for managing bacterial genome assembly generated by Ion PGM
sequencing technology. BMC Bioinform. 17(Suppl 18), 456 (2016)
14. It’s sink or swim as a tidal wave of data approaches. Nature, 399, 517 (1999)
15. Gauthier, J., Vincent, A.T., Charette, S.J., Derome, N.: A brief history of bioinformatics.
Brief. Bioinform. 20, 1981–1996 (2019)
16. Hogeweg, P.: The Roots of Bioinformatics in Theoretical Biology. PLoS Comput. Biol. 7,
e1002021 (2011)
17. Canese, K., Weis, S.: PubMed: The Bibliographic Database. National Center for Biotechnol-
ogy Information (US) (2013). https://www.ncbi.nlm.nih.gov/books/NBK153385/. Accessed
14 Sep 2020
18. NCBI Resource Coordinators: Database resources of the national center for biotechnology
information. Nucleic Acids Res. 46, D8–D13 (2018)
19. Monastersky, R., Noorden, R.V.: 150 years of nature: a data graphic charts our evolution.
Nature 575, 22–23 (2019)
20. Carey, V.J., Gentry, J., Whalen, E., Gentleman, R.: Network structures and algorithms in
Bioconductor. Bioinformatics 21, 135–136 (2005)
21. Chen, H., Lau, M.C., Wong, M.T., Newell, E.W., Poidinger, M., Chen, J.: Cytofkit: a biocon-
ductor package for an integrated mass cytometry data analysis pipeline. PLoS Comput. Biol.
12, e1005112 (2016)
22. Fournier, F., Joly Beauparlant, C., Paradis, R., Droit, A.: rTANDEM, an R/Bioconductor
package for MS/MS protein identification. Bioinformatics 30, 2233–2234 (2014)
23. Gådin, J.R., van’t Hooft, F.M., Eriksson, P., Folkersen, L.: AllelicImbalance: an R/ biocon-
ductor package for detecting, managing, and visualizing allele expression imbalance data
from RNA sequencing. BMC Bioinform. 16, 194 (2015)
246 D. Mariano et al.
24. Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., et al.: Biocon-
ductor: open software development for computational biology and bioinformatics. Genome
Biol. 5, R80 (2004)
25. Talevich, E., Invergo, B.M., Cock, P.J., Chapman, B.A.: Bio.Phylo: a unified toolkit for
processing, analyzing and visualizing phylogenetic trees in Biopython. BMC Bioinform. 13,
209 (2012)
26. Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., et al.: Biopy-
thon: freely available Python tools for computational molecular biology and bioinformatics.
Bioinformatics 25, 1422–1423 (2009)
27. Sun, X., Hu, B.: Mathematical modeling and computational prediction of cancer drug
resistance. Brief. Bioinform. 19, 1382–1399 (2018)
28. Sethi, A., Tian, J., Derdeyn, C.A., Korber, B., Gnanakaran, S.: A mechanistic understanding
of allosteric immune escape pathways in the HIV-1 envelope glycoprotein. PLoS Comput.
Biol. 9, e1003046 (2013)
29. Costa, L.S.C., Mariano, D.C.B., Rocha, R.E.O., Kraml, J., da Silveira, C.H., Liedl, K.R., et al.:
Molecular dynamics gives new insights into the glucose tolerance and inhibition mechanisms
on β-glucosidases. Molecules 24, 3215 (2019)
30. Lima, L.H.F., de Fernandez-Quintéro, M., Rocha, R.E.O., Mariano, D.C.B., Melo-Minardi,
R.C., de Liedl, K.R.: Conformational flexibility correlates with glucose tolerance for point
mutations in β-glucosidases – a computational study. J. Biomolecular Structure Dyn. 1–20
(2020)
31. Russell, P.H., Johnson, R.L., Ananthan, S., Harnke, B., Carlson, N.E.: A large-scale analysis
of bioinformatics code on GitHub. PLoS ONE 13, e0205898 (2018)
32. Ekmekci, B., McAnany, C.E., Mura, C.: An Introduction to Programming for Bioscientists:
A Python-Based Primer. PLoS Comput. Biol. 12, e1004867 (2016)
33. Mariano, D., Martins, P., Helene Santos, L., de Melo- Minardi, R.C.: Introducing Program-
ming Skills for Life Science Students. Biochemistry and Molecular Biology Education (2019).
https://doi.org/10.1002/bmb.21230
Computational Simulations
for Cyclizations Catalyzed by Plant
Monoterpene Synthases
1 Introduction
Terpenes are the oldest and various plant natural products as essential oils,
and some of them, as volatile compounds, play a crucial role in response to
herbivores, interaction with other plants, and attracting pollinators [33,36]. The
benefits of terpenes properties for humans spread through their use as flavors
in agricultural and industrial products, or as fragrances in foods and cosmetics,
plus pharmaceuticals and biofuels [2,39].
The wide range of terpenes and its “chemodiversity” is expected as a char-
acteristic of life, taking into account the considerable biodiversity of plants and
their interactions with other organisms [16,36]. This chemodiversity is related
to the chemical mechanisms catalyzed by Terpene synthases/cyclases (TPS)
influencing the terpenes’ variety. This variety may be related to their biolog-
ical function by adjusting the mixture and amount of terpenes to the specificity
of the target, both in communication relations and concerning protection against
numerous predators, parasites and competitors [16,32,36].
Terpenes are named according to the number of C5 isoprenoid units incor-
porated into their carbon skeletons as mono- (C10 ), sesqui- (C15 ), di- (C20 ),
sester- (C25 ), tri- (C30 ) and sesquarterpenes (C35 ) [37]. The isoprenoid units can
be Isopenthenyl Diphosphate (IP P ) or its allylic isomer Dimethylallyl Diphos-
phate (DM ADP ) that are condensed by prenyltransferases to produce larger
prenyl diphosphates, such as the monoterpene precursor Geranyl Diphosphate
(GDP ), which is the minimum-length cyclization substrate terpene biosynthe-
sis [10]. Commonly, after the GP P diphosphate loss and a C1 − C6 bond forma-
tion, the monoterpene formation proceeds through the α-terpinyl cation (Fig. 1)
by means of a cascade of reactions that include C − C bonds, Wagner-Meerwein
rearrangements, allyl- and methyl-shifts caused by conformational changes of
intermediate cations, and carbocation capture by water and hydride [33].
There are two classes of TPSs: Class I and Class II, defined by catalytically
essential amino acid motifs [7,27]. TPSs I convert linear, all-trans, isoprenoids,
geranyl (C10)-, farnesyl (C15)-, or geranylgeranyl (C20)-diphosphate into numer-
ous varieties of monoterpenes, sesquiterpenes, and diterpenes. The TPSs I bind
their substrate by coordinating a trinuclear divalent metal ion catalytic site
(generally a M g 2+ ), consisting of a central cavity formed by mostly antiparallel
α-helices. This catalytic site has an aspartate-rich DDxxD/E motif, and often
another N SE/DT E motif in the C-terminal portion [24]. TPSs Class II act
by triggering GGPP protonation, which results in successive carbocations and
cyclizations to form, for example, copalyl-diphosphate (CP P ) [27]. In the Class
II TPSs, the DxDD motif (distinct from the TPS I DDxxD/E motif) catalyzes
the reaction, also using a M g 2+ cofactor to assist substrate binding and posi-
tioning [15]. The terpenes diversity also can be influenced by nucleotide changes
in the alleles of TPS genes [26]. In plants, the production of terpenes can be
compartmentalized (Fig. 2), and monoterpenes, for example, can be produced in
specialized structures: plastids [31].
Enzyme function prediction is particularly challenging when dealing with
TPS because of their capability to produce numerous carbon skeletons by
Computational Simulations for Cyclizations Catalyzed by PMS 249
2 Method
Degenhardt et al. [11] described reaction mechanisms for plant mono- and sesqui-
terpene synthases identifying several monoterpenes as products of these chem-
ical rearrangements. Based on Degenhardt et al. [11] review and other sources
of plant enzymatic GP P cyclizations [6,13,17,18,36], we extended the approach
presented by Silva et al. [33] for plant sesquiterpenes biosynthesis by including
monoterpenes.
The method formally models the chemical reactions on a mechanistic level,
building pathways assembled as a chemical network. The chemical network is
abstracted as a directed multi-hypergraph, where the vertices correspond to
molecules and hyperedges to reactions. Each vertex represents a molecule, which
is abstracted by an undirected graph, where atoms are vertices, and the bonds
are edges. In other words, each vertex of the chemical network is composed of
an undirected graph, which represents a molecule, and the chemical reactions
on these molecules are modeled as graph transformations. The accumulation of
graph transformations following the provided rules compose a network obtained
in a given number of iterations that exposes the initial, intermediate, and final
compounds.
Rule-based graph transformations can be described formally by graph gram-
mars that generalize the much more commonly used term-rewriting systems.
Each rule describes a specific class of chemical reactions such as a ring closure
from C1 to C6 atom, or an allyl-shift. Each rule has a pattern L of atoms and
bonds that need to be present in the educts for the corresponding reaction. The
matched part of educts is then transformed as specified by the rule.
We used the double pushout (DPO) formalism [29] for graph rewriting
because it is particularly suitable to model chemistry: it ensures reversibility
of transformation and supports well-defined atom maps [3]. Here, each rule has
l r
the form p = (L ←−K − → R) where L is the left graph, R is the right graph,
and K is a context graph. Graph morphisms l and r describe the embedding
Computational Simulations for Cyclizations Catalyzed by PMS 251
Fig. 3. Example of rule for quenching by water, and its application to the molecules
H2 O and α-terpynol.
combined with a Docker Image with all the environment ready, is available in
the GitHub1 .
representing a chemical network that exposes the inputs, intermediates, and final
compounds at an atom-bond level. The results presented here, allied to biolog-
ical scenarios and the primary sequences of TPS, can enhance their functional
annotation. Moreover, plant metabolic engineering could take advantage of this
approach to help designing genes that produce monoterpenes’ desired assortment
for ecological and industrial purposes. These monoterpene results are added to
the previous results of sesquiterpenes [33] to compose a terpene biosynthesis
simulation system called 2Path. Also, it brings a Web interface that makes writ-
ing the simulations more intuitive and less laborious as some drag and drop,
and clicks are enough. All the work is available at https://github.com/waldeyr/
2PathTerpenes.
References
1. Abbas, F., Ke, Y., Yu, R., Yue, Y., Amanullah, S., Jahangir, M.M., Fan, Y.:
Volatile terpenoids: multiple functions, biosynthesis, modulation and manipula-
tion by genetic engineering. Planta 246(5), 803–816 (2017). https://doi.org/10.
1007/s00425-017-2749-x
2. Aharoni, A., Jongsma, M.A., Bouwmeester, H.J.: Volatile science? metabolic engi-
neering of terpenoids in plants. Trends Plant Sci. 10(12), 594–602 (2005)
3. Andersen, J.L., Flamm, C., Merkle, D., Stadler, P.F.: Inferring chemical reaction
patterns using graph grammar rule composition. J. Syst. Chem. 4, 4 (2013)
4. Andersen, J.L., Flamm, C., Merkle, D., Stadler, P.F.: A Software Package for
Chemically Inspired Graph Transformation. In: Echahed, R., Minas, M. (eds.)
ICGT 2016. LNCS, vol. 9761, pp. 73–88. Springer, Cham (2016). https://doi.org/
10.1007/978-3-319-40530-8 5
5. Block, A.K., Vaughan, M.M., Schmelz, E.A., Christensen, S.A.: Biosynthesis and
function of terpenoid defense compounds in maize (zea mays). Planta 249(1), 21–
30 (2019)
6. Bohlmann, J., Steele, C.L., Croteau, R.: Monoterpene synthases from grand fir
(Abies grandis): cDNA isolation, characterization, and functional expression of
myrcene synthase, (-)-(4S)- limonene synthase, and (-)-(1S,5S)-pinene synthase. J.
Biol. Chemistry 272(35), 21784–21792 (1997)
7. Chen, F., Tholl, D., Bohlmann, J., Pichersky, E.: The family of terpene synthases
in plants: a mid-size family of genes for specialized metabolism that is highly
diversified throughout the kingdom. Plant J. 66(1), 212–229 (2011)
8. Chen, H., Li, G., Köllner, T.G., Jia, Q., Gershenzon, J., Chen, F.: Positive dar-
winian selection is a driving force for the diversification of terpenoid biosynthesis
in the genus oryza. BMC Plant Biol. 14(1), 1–12 (2014)
9. Chow, J.Y., et al.: Computational-guided discovery and characterization of a ses-
quiterpene synthase from Streptomyces clavuligerus. Proceedings of the National
Academy of Sciences of the United States of America 112(18), 5661–6 (2015)
Computational Simulations for Cyclizations Catalyzed by PMS 257
10. Christianson, D.W.: Structural and chemical biology of terpenoid cyclases. Chem-
ical Rev. 117(17), 11570–11648 (2017)
11. Degenhardt, J., Köllner, T.G., Gershenzon, J.: Monoterpene and sesquiterpene
synthases and the origin of terpene skeletal diversity in plants. Phytochemistry
70(15–16), 1621–1637 (2009)
12. Djoumbou-Feunang, Y., Fiamoncini, J., Gil-de-la-Fuente, A., Greiner, R., Manach,
C., Wishart, D.S.: BioTransformer: a comprehensive computational tool for small
molecule metabolism prediction and metabolite identification. Journal of Chemin-
formatics 11(1), 1–25 (2019). https://doi.org/10.1186/s13321-018-0324-5
13. Dong, L., Jongedijk, E., Bouwmeester, H., Van Der Krol, A.: Monoterpene biosyn-
thesis potential of plant subcellular compartments. New Phytologist 209(2), 679–
690 (2016)
14. Duigou, T., du Lac, M., Carbonell, P., Faulon, J.L.: Retrorules: a database of
reaction rules for engineering biology. Nucleic Acids Res. 47(D1), D1229–D1235
(2018)
15. Gao, Y., Honzatko, R.B., Peters, R.J.: Terpenoid synthase structures: a so far
incomplete view of complex catalysis. Nat. Prod. Rep. 29(10), 1153 (2012)
16. Gershenzon, J., Dudareva, N.: The function of terpene natural products in the
natural world. Nat. Chem. Biol. 3(7), 408–414 (2007)
17. Godard, K.A., White, R., Bohlmann, J.: Monoterpene-induced molecular responses
in Arabidopsis thaliana. Phytochemistry 69(9), 1838–1849 (2008)
18. Gutensohn, M., et al.: Cytosolic monoterpene biosynthesis is supported by plastid-
generated geranyl diphosphate substrate in transgenic tomato fruits. Plant J.
75(3), 351–363 (2013)
19. Hastings, J., et al.: Chebi in 2016: Improved services and an expanding collection
of metabolites. Nucleic Acids Res. 44(D1), D1214–D1219 (2016)
20. Heinig, U., Gutensohn, M., Dudareva, N., Aharoni, A.: The challenges of cellular
compartmentalization in plant metabolic engineering. Current Opinion Biotechnol.
24(2), 239–246 (2013)
21. Himsolt, M.: Gml: a portable graph file format. Html page under http://www.fmi.
uni-passau.de/graphlet/gml/gml-tr.html, Universität Passau (1997)
22. Isegawa, M., Maeda, S., Tantillo, D.J., Morokuma, K.: Predicting pathways for
terpene formation from first principles-routes to known and new sesquiterpenes.
Chem. Sci. 5(4), 1555–1560 (2014)
23. Kanehisa, M., et al.: The kegg database. In: Novartis Foundation Symposium, pp.
91–100. Wiley Online Library (2002)
24. Kempinski, C., Jiang, Z., Bell, S., Chappell, J.: Metabolic engineering of higher
plants and algae for isoprenoid production. In: Schrader, J., Bohlmann, J. (eds.)
Biotechnology of Isoprenoids. ABE, vol. 148, pp. 161–199. Springer, Cham (2015).
https://doi.org/10.1007/10 2014 290
25. Kim, S., et al.: Pubchem 2019 update: improved access to chemical data. Nucleic
Acids Res. 47(D1), D1102–D1109 (2019)
26. Köllner Schnee, C., Gershenzon, J., Degenhardt, J.: The Variability of Sesquiter-
penes Emitted from Two Zea mays\nCultivars Is Controlled by Allelic Variation
of Two Terpene\nSynthase Genes Encoding Stereoselective Multiple\nProduct
Enzymes. The Plant Cell 16(May), 1115–1131 (2004)
27. Liu, W., et al.: Structure, function and inhibition of ent-kaurene synthase from
bradyrhizobium japonicum. Sci. Rep. 4, 612 (2014)
28. Liu, W., et al.: Structure, function and inhibition of ent-kaurene synthase from
bradyrhizobium japonicum. Sci. Rep. 4, 6214 (2014)
258 W. M. C. da Silva et al.
Abstract. Cancer cells depend on several signaling pathways and organelles, such
as the lysosomes. Defects in the activity of lysosomal hydrolases involved in gly-
cosaminoglycan degradation lead to a group of lysosomal storage diseases called
Mucopolysaccharidoses (MPS). In MPS, secondary cell disturbance affects path-
ways common to cancer. This work aims to identify oncogenic pathways related
to cancer in the different MPS datasets available in public databases and compare
the ontologies across the different types of MPS. For this, we used 12 expres-
sion datasets of 6 types of MPS. Statistical analysis was based on hypergeometric
distribution followed by FDR correction. We found several enriched pathways
across the 12 MPS studies, among being 57.65% were KEGG pathways, 32.5%
of GO Biological Process, 2.5% GO Celular Component, and 7.35% GO Molec-
ular Function. Hippo signaling pathway and MAPK signaling pathway appear
in all datasets. Proteoglycans in cancer, Rap1 signaling pathway, and Cytokine-
mediated signaling pathway appears in 11 of 12 datasets. The lysosome partici-
pates in several biological processes, like autophagy, cell adhesion and migration,
and antigen presentation. These processes also may affect in several types of can-
cer and Lysosomal Storage Diseases. Studying the tumor ontology signature in
lysosomal disorders may help understand lysosomal storage diseases and cancer’s
underlying mechanisms. This may help amplify therapeutic approaches for both
types of diseases.
1 Introduction
Several metabolic pathways are deranged in cancer cells. The proliferation ability of
tumors depends on a cascade of signaling pathways in several cancer cells’ organelles,
such as the lysosomes [1]. Lysosomes are cellular compartments responsible, among
other functions, for the degradation of macromolecules through acid hydrolases con-
tained within them. Defects in these enzymes culminate in the lysosomal accumulation
of intermediate metabolites or macromolecules, known as lysosomal storage diseases
[2]. Lysosomal Storage Diseases (LSD) are a group of more than 50 rare metabolic
diseases, among which we can highlight the Mucopolysaccharidoses (MPS). In MPS,
secondary cell disturbance affects pathways common to cancer.
This work aims to identify oncogenic pathways related to cancer in the different
datasets of MPS available in public databases and to compare the ontologies across the
different types of MPS.
2 Methods
Gene expression analysis considered 12 datasets available at GEO (https://www.ncbi.
nlm.nih.gov/geo), from six different MPS types. For RNA-seq data, we used edgeR, and
for microarray data, we used R packages according to the experiment’s platform. Fur-
thermore, the data present in this work are available in the MPSBase (https://www.ufrgs.
br/mpsbase/). Statistical analysis was based on hypergeometric distribution followed by
FDR correction. We perform the enrichment analysis in Cytoscape, with Bingo and
ClueGo plugins. We search the child terms with QuickGo. We selected 12 datasets,
being 2 of MPS I; 1 from MPS II; 1 from MPS IIIA; 3 MPS IIIB; 1 MPS VI; and
4 from MPS VII. These datasets comprise an RNA-seq data of Illumina HiSeq 2500
platform of human iPSC-derived Neuronal Stem Cell (MPS I, GSE111906); Agilent-
021193 Canine (V2) microarray of Ascending Aorta, Descending Aorta and Carotid
Aorta (MPS I, GSE78889); AB SOLiD 3 Plus System (Mus musculus) of Brain sam-
ples (MPS II, GSE95224); Agilent-028005 SurePrint G3 Mouse GE 8×60K Microarray
of Brain and Blood samples (MPS IIIA, GSE97759); Agilent-012694 Whole Mouse
Genome G4122A of Lateral entorhinal cortex and Medial entorhinal cortex (MPS IIIB,
GSE15758); Affymetrix Human Exon 1.0 ST Array of iPSC-derived Neuronal Stem
Cell (MPS IIIB, GSE23075); Affymetrix Human Exon 1.0 ST Array of HeLa depleting
NAGLU (MPS IIIB, GSE32154); Affymetrix Mouse Gene 1.0 ST Array of ARSB null
mouse hepatic cells (MPS VI, GSE77689); Illumina Mouse-8 Expression BeadChip of
Descending aorta (MPS VII, GSE30657); Affymetrix Mouse Genome 430A 2.0 Array
of six brain regions (MPS VII, GSE34071); Affymetrix Mouse Exon 1.0 ST Array of
iPS embryo-derived ES cells with controls derived from B6 Blu ES cells and Mouse
embryonic fibroblast (MPS VII, GSE36017); and Affymetrix Mouse Genome 430A 2.0
Array of hippocampus (MPS VII, GSE76283).
3 Results
We found 680 oncogenic enriched ontologies across the 12 MPS studies, among being
392 were KEGG pathways (57.65%), 221 GO Biological Process (32.5%), 17 GO Celu-
lar Component (2.5%), and 50 of GO Molecular Function (7.35%). Hippo signaling
pathway and MAPK signaling pathway appears in all datasets. Proteoglycans in cancer,
Rap1 signaling pathway, and Cytokine-mediated signaling pathway appears in 11 of 12
datasets (see Fig. 1).
Oncogenic Signaling Pathways in Mucopolysaccharidoses 261
(a)
(b)
(c)
(d)
Fig. 1. Top Gene ontology of oncogenic terms of MPS. (a) KEGG pathways; (b) GO Biological
Process; (c) Molecular Function; (d) Cellular Component.
262 G. C. V. Silva et al.
Considering only the MPS types, the ontologies Axon guidance, Focal adhesion,
Hippo signaling pathway, MAPK signaling pathway, Metabolic pathways, Pathways in
cancer, PI3K-Akt signaling pathway, Proteoglycans in cancer, Rap1 signaling pathway,
and Ras signaling pathway are present in all the MPS types found in the GEO. The
following Table 1 gives a summary of the most frequent oncogenic ontologies according
to the MPS type.
Table 1. Prevalent oncogenic enriched pathways of datasets analyzed. In bold, the ontology
appears in all MPS types.
The dataset with the most enriched pathways is GSE32154 (MPS IIIB, Homo sapi-
ens) with 90 ontologies (see Fig. 2). The dataset with the most enriched KEGG terms is
GSE30657 (MPS VII, Mus musculus) with 60 KEGG terms. The GSE32154 (MPS IIIB,
Homo sapiens) have the most GO Biological Process enriched terms, with 45 terms.
For GO Cellular Component, the dataset with more enriched terms in this category
is GSE32154 (MPS IIIB, Homo sapiens) with 5 terms. Lastly, in the GO Molecular
Function, the GSE32154 is the dataset with more enriched terms (10 terms found).
4 Discussion
in the composition of the extracellular matrix [7], helping to regulate processes such
as metabolic signaling, apoptosis, cell migration, adhesion, and antigen presentation, in
both cancer and MPS.
5 Concluding Remarks
The available public data is essential for amplified the multi-omic knowledge of complex
and rare diseases. Bioinformatic approaches, such as gene enrichment analysis, may
help us understand the complexity of processes deranged in several diseases. Studying
the tumor ontology signature in lysosomal disorders may help understand lysosomal
storage diseases and cancer’s underlying mechanisms. This may help amplify therapeutic
approaches for both types of diseases.
References
1. Cairns, R.A., Harris, I.S., Mak, T.W.: Regulation of cancer cell metabolism. Nat. Rev. Cancer
11(2), 85–95 (2011). https://doi.org/10.1038/nrc2981
2. Matte, U., Pasqualim, G.: Lysosome: the story beyond the storage. J. Inborn Errors Metab.
Screen. 4, e160044 (2016). https://doi.org/10.1177/2326409816679431
3. Kallunki, T., Olsen, O.D., Jäättelä, M.: Cancer-associated lysosomal changes: friends or foes?
Oncogene 32(16), 1995–2004 (2012). https://doi.org/10.1038/onc.2012.292
4. Fiorenza, M.T., Moro, E., Erickson, R.P.: The pathogenesis of lysosomal storage disorders:
beyond the engorgement of lysosomes to abnormal development and neuroinflammation. Hum.
Mol. Genet. 27(R2), R119–R129 (2018). https://doi.org/10.1093/hmg/ddy155
5. Martinez-Carreres, L., Nasrallah, A., Fajas, L.: Cancer: linking powerhouses to suicidal bags.
Front. Oncol. 7, 204 (2017). https://doi.org/10.3389/fonc.2017.00204
6. Sanchez-Vega, F., et al.: Oncogenic signaling pathways in the cancer genome atlas. Cell 173(2),
321–337 (2018). https://doi.org/10.1016/j.cell.2018.03.035
7. Davidson, S.M., Vander Heiden, M.G.: Critical functions of the lysosome in cancer biology.
Ann. Rev. Pharmacol. Toxicol. 57(1), 481–507 (2017). https://doi.org/10.1146/annurev-pha
rmtox-010715-103101
Natural Products as Potential Inhibitors
for SARS-CoV-2 Papain-Like Protease:
An in Silico Study
1 Introduction
SARS-CoV-2 is an RNA virus whose genome has 10 open reading frames that
encode various structural and non-structural proteins [23]. Among these pro-
teins, one of the most studied together with the main protease and the protein
spike, is the papain-like protease (PLpro ) which is necessary to facilitate the
spread of the virus and help to evade the host immune response [7]. One of
the most outstanding aspects of SARS-CoV-2 PLpro is its role as an ubiquitin-
specific protease, especially the ubiquitin-like protein ISG15, which is a known
c Springer Nature Switzerland AG 2020
J. C. Setubal and W. M. Silva (Eds.): BSB 2020, LNBI 12558, pp. 265–270, 2020.
https://doi.org/10.1007/978-3-030-65775-8_25
266 J. Alvarado-Huayhuaz et al.
2 Methods
We carried out two stages. First, a VS with AutoDock Vina (referred to as
Vina from here on) [19] and 213,038 chemical structures of NPs collected in the
freely available Universal Natural Product Database - In Silico MS/MS Database
(UNPD-ISDB) [1].
To ensure the reliability of the VS results, the second stage consisted of
molecular docking divided into three sub-stages with Glide [5] from the suite
Schrödinger. The best hits of the VS were docked with standard precision (Glide-
SP), extra precision (Glide-XP), and then with the Induced Fit Docking (IFD)
protocol, that allows flexibility to the residues from the pocket. Finally, the
binding free energy (ΔG b ) of the best 10 complexes from IFD protocol was cal-
culated using the physics-based MM-GBSA (Molecular Mechanics-Generalized
Born Surface Area) method with the Prime [10] module of the suite Schrödinger.
Preparation of the Receptor. From Protein Data Bank we retrieve the struc-
ture of the SARS-CoV-2 PLpro with PDB ID 6W9C (resolution 2.7 Å). For VS,
all water molecules and ions were removed and polar hydrogens were added to
the receptor using AutoDock Tools v1.5.6 [14]. For molecular docking, the pro-
tease was prepared by removing unnecessary water molecules and ions with the
Protein Preparation Wizard [13] module and considering the protonation states
of the residues at pH 7.4 with Epik [17].
and the coordinates of the centroid of the binding site were calculated in the
same way as for VS and molecular docking stages.
Virtual Screening. Ligands and protein in PDBQT format were used as input
files. The centroid of the binding site (x = −50.1, y = 15.1 and z = 34.4) and
the optimal box size (40 × 40 × 40 Å), which cover the active and allosteric
sites, were obtained with the POCASA server [25] and the eBoxSize script [4]
respectively. Vina was used with an exhaustiveness level of 24 in the Quinde 1
supercomputer.
Molecular Docking and MM-GBSA. The best 1000 ligands were selected
from VS according to their binding affinity values. These were docked with Glide-
SP in the SARS-CoV-2 PLpro active and allosteric sites. From the results with
Glide-SP, the best 100 hits were selected and docked with Glide-XP. From the
last step with Glide-XP, the best 10 hits were selected and again they were
docked with the IFD protocol, allowing flexibility to the residues close to the
active and allosteric sites. For these steps, the OPLS 2005 force field was used,
and the same conditions of binding site coordinates and box size from the VS
were used. Finally, the ΔG b of the best ligand-receptor complexes from IFD was
estimated, according to Eq. 1, with the Prime module using an implicit solvation
model VSGB and the OPLS 2005 force field. In Eq. 1, G complex , G protein , and
G ligand are the free energies of the complex, protein, and ligand, respectively.
Validation of Docking Protocol. Using the Open Babel obrms tool, the
RMSD was calculated between the native co-crystallized ligand (TTT; 5-amino-
2-methyl-N-[(1R)-1-naphthalen-1-ylethyl] benzamide) of the SARS-CoV PLpro
and its best re-docked conformations obtained with Vina (0.47 Å), Glide-SP
(0.57 Å) and Glide-XP (0.81 Å). Since RMSD values are less than 2.0 Å with
respect to the native ligand, it is considered that the protocols used in the present
study are reliable [21].
〈-1,0] 2 Frequency
4 Conclusion
In this study, we identified 10 molecules that can interact favorably with residues
of the allosteric site of PLpro , being considered stable complexes according to
their ΔG b values. This search was done using a library of 213,038 structures
of NPs collected in the UNPD-ISDB, and techniques like virtual screening and
molecular docking. These molecules are mainly glycosides and tannins, and their
Natural Products Against SARS-CoV-2 Papain-Like Protease 269
References
1. Allard, P.M., et al.: Integration of molecular networking and In-Silico MS/MS frag-
mentation for natural products dereplication. Anal. Chem. 88, 3317–3323 (2016)
2. Arnold, C.: Race for a vaccine. New Sci. 245, 44–47 (2020)
3. Estrada, E.: COVID-19 and SARS-CoV-2. Modeling the present, looking at the
future. Phys. Rep. 869, 1–51 (2020)
4. Feinstein, W.P., Brylinski, M.: Calculating an optimal box size for ligand dock-
ing and virtual screening against experimental and predicted binding pockets. J.
Cheminform. 7(1), 1–10 (2015). https://doi.org/10.1186/s13321-015-0067-5
5. Friesner, R.A., et al.: Extra precision glide: docking and scoring incorporating a
model of hydrophobic enclosure for protein-ligand complexes. J. Med. Chem. 49,
6177–6196 (2006)
6. Guo, S., et al.: The anti-diabetic effect of eight Lagerstroemia speciosa leaf extracts
based on the contents of ellagitannins and ellagic acid derivatives. Food Funct. 11,
1560–1571 (2020)
7. Harcourt, B.H., et al.: Identification of severe acute respiratory syndrome coron-
avirus replicase products and characterization of papain-like protease activity. J.
Virol. 78, 13600–13612 (2004)
8. Ho, K.V., et al.: Identifying antibacterial compounds in black walnuts (Juglans
nigra) using a metabolomics approach. Metabolites 8, 58 (2018)
9. Ishii, T., Okino, T., Mino, Y., Tamiya, H., Matsuda, F.: Plant-growth regulators
from common starfish (Asterias amurensis Lütken) waste. Plant Growth Regul.
52, 131–139 (2007)
10. Jacobson, M.P., Friesner, R.A., Xiang, Z., Honig, B.: On the role of the crystal
environment in determining protein side-chain conformations. J. Mol. Biol. 320,
597–608 (2002)
11. Ji, D., Huang, Z.y., Fei, C.h., Xue, W.w., Lu, T.l.: Comprehensive profiling and
characterization of chemical constituents of rhizome of Anemarrhena asphodeloides
Bge. J. Chromatogr. B 1060, 355–366 (2017)
12. Kang, Z.Y., Zhang, M.J., Wang, J.X., Liu, J.X., Duan, C.L., Yu, D.Q.: Two new
furostanol saponins from the fibrous root of Ophiopogon japonicus. J. Asian Nat.
Prod. Res. 15, 1230–1236 (2013)
13. Madhavi Sastry, G., Adzhigirey, M., Day, T., Annabhimoju, R., Sherman, W.:
Protein and ligand preparation: parameters, protocols, and influence on virtual
screening enrichments. J. Comput. Aided. Mol. Des. 27, 221–234 (2013)
270 J. Alvarado-Huayhuaz et al.
14. Morris, G.M., et al.: AutoDock4 and AutoDockTools4: automated docking with
selective receptor flexibility. J. Comput. Chem. 30, 2785–2791 (2009)
15. Nonaka, G.i., Nakayama, S., Ishioka, I.N.: Tannins and related compounds.
LXXXIII. Isolation And Structures of Hydrolyzable Tannins, Phillyraeoidins A-
E From Quercus Phillyraeoides. Chem. Pharm. Bull. 37, 2030–2036 (1989)
16. O’Boyle, N.M., Banck, M., James, C.A., Morley, C., Vandermeersch, T., Hutchison,
G.R.: Open babel: an open chemical toolbox. J. Cheminform. 3, 33 (2011)
17. Shelley, J.C., Cholleti, A., Frye, L.L., Greenwood, J.R., Timlin, M.R., Uchimaya,
M.: Epik: a software program for pK a prediction and protonation state generation
for drug-like molecules. J. Comput. Aided. Mol. Des. 21, 681–691 (2007)
18. Shin, D., et al.: Papain-like protease regulates SARS-CoV-2 viral spread and innate
immunity. Nature (2020)
19. Trott, O., Olson, A.J.: AutoDock Vina: Improving the speed and accuracy of dock-
ing with a new scoring function, efficient optimization, and multithreading. J.
Comput. Chem. 31, 455–461 (2009)
20. Tung, N.H., Shoyama, Y.: Eastern blotting analysis and isolation of two new
dammarane-type saponins from american ginseng. Chem. Pharm. Bull. 60, 1329–
1333 (2012)
21. Wang, R., Lu, Y., Wang, S.: Comparative evaluation of 11 scoring functions for
molecular docking. J. Med. Chem. 46, 2287–2303 (2003)
22. Zhao, W., Xu, R., Qin, G., Vaisar, T., Lee, M.S.: Saponins from mussaenda
pubescens. Phytochemistry 42, 1131–1134 (1996)
23. Wu, F., et al.: A new coronavirus associated with human respiratory disease in
China. Nature 579, 265–269 (2020)
24. Yang, W.L., Tian, J., Peng, S.L., Guan, J.F., Ding, L.S.: Chemical constituents of
Diuranthera inarticulata. Yao Xue Xue Bao 36, 590–4 (2001)
25. Yu, J., Zhou, Y., Tanaka, I., Yao, M.: Roll: a new algorithm for the detection of
protein pockets and cavities with a rolling probe sphere. Bioinformatics 26, 46–52
(2009)
26. Zhou, L., Cheng, Z., Chen, D.: Simultaneous determination of six steroidal saponins
and one ecdysone in Asparagus filicinus using high performance liquid chromatog-
raphy coupled with evaporative light scattering detection. Acta Pharm. Sin. B 2,
267–273 (2012)
Author Index