Background:: A Modeling and Simulation Web Tool For Plant Biologists
Background:: A Modeling and Simulation Web Tool For Plant Biologists
Results: An ensemble feature selection strategy for miRNA signatures is proposed. miRNAs
are chosen based on consensus on feature relevance from high-accuracy classifiers of
different typologies. This methodology aims to identify signatures that are considerably
more robust and reliable when used in clinically relevant prediction tasks. Using the
proposed method, a 100-miRNA signature is identified in a dataset of 8023 samples,
extracted from TCGA. When running eight-state-of-the-art classifiers along with the 100-
miRNA signature against the original 1046 features, it could be detected that global
accuracy differs only by 1.4%. Importantly, this 100-miRNA signature is sufficient to
distinguish between tumor and normal tissues. The approach is then compared against
other feature selection methods, such as UFS, RFE, EN, LASSO, Genetic Algorithms, and EFS-
CLA. The proposed approach provides better accuracy when tested on a 10-fold cross-
validation with different classifiers and it is applied to several GEO datasets across different
platforms with some classifiers showing more than 90% classification accuracy, which
proves its cross-platform applicability.
Outcomes: As a proof of concept, Boa for genomics, Boag, has been implemented to
analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boag
provides a massive improvement from existing solutions like Python and MongoDB, by
utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage
footprint that scales well and requires fewer lines of code. We execute scripts through Boag
to answer questions about the genomes in RefSeq. We identify the largest and smallest
genomes deposited, explore exon frequencies for assemblies after 2016, identify the most
commonly used bacterial genome assembly program, and address how animal genome
assemblies have improved since 2016. Boag databases provide a significant reduction in
required storage of the raw data and a significant speed up in its ability to query large
datasets due to automated parallelization and distribution of Hadoop infrastructure during
computations.