Cancerdiscover: An Integrative Pipeline For Cancer Biomarker and Cancer Class Prediction From High-Throughput Sequencing Data
Cancerdiscover: An Integrative Pipeline For Cancer Biomarker and Cancer Class Prediction From High-Throughput Sequencing Data
Research Paper
CancerDiscover: an integrative pipeline for cancer biomarker
and cancer class prediction from high-throughput sequencing
data
Akram Mohammed1,*, Greyson Biegert1,*, Jiri Adamec1 and Tomáš Helikar1
1
Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
*
These authors have contributed equally to this work
Correspondence to: Tomáš Helikar, email: [email protected]
Keywords: open-source; cancer classification; gene expression; machine learning; cancer biomarker
Received: September 28, 2017 Accepted: December 09, 2017 Published: December 20, 2017
Copyright: Mohammed et al. This is an open-access article distributed under the terms of the Creative Commons Attribution
License 3.0 (CC BY 3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author
and source are credited.
ABSTRACT
Figure 1: Schematic representation of the CancerDiscover pipeline. First, raw data are normalized, background correction is
performed, and the output is partitioned into training and testing sets. The test set is held in reserve for model testing while the training set
undergoes a feature selection method. Feature selection provides a list of ranked attributes that are subsequently used to rebuild the training
and testing sets. The training dataset is subsequently used to build machine learning models. Finally, the testing data set is used for model
testing.
Figure 2: Model accuracies for the classification of tumor vs. normal and adenocarcinoma vs. squamous cell carcinoma:
RF represents Random Forest classifier and SVM indicates Support Vector Machine classifier. (A) Training accuracy for
Tumor vs. Normal model, (B) Training accuracy for Adenocarcinoma vs. Squamous Cell Carcinoma model, (C) Testing accuracy for
Tumor vs. Normal model, (D) Testing accuracy for Adenocarcinoma vs. Squamous Cell Carcinoma model.
All the datasets contain 54,675 features, and 2 CPUs were used for the analysis.
Elapsed time refers to the amount of real-time spent processing that function.
partitions, feature selection algorithms, classification 20 of the 23 possible feature selection algorithms were
algorithms, and the threshold or percentage of ranked completed, and subsequently utilized to generate 665
features for additional analysis. classification models. The 200-sample dataset provided 20
of the 23 possible feature selection outputs and produced
CancerDiscover benchmarking 650 classification models. Lastly, the 500-sample dataset
contained 20 out of the possible 23 feature selection
Benchmarking was performed using 500 samples outputs and generated 665 classification models. As the
from Acute Myeloid Leukemia (AML) data (See datasets grew, the time required for cancer classification
Data Collection section for more details) to assess the increased linearly (Table 2).
performance of the software by running all the feature
selection and classification algorithms. The following Comparison of CancerDiscover with other
sample quantities were used; 500, 200, 100, 50 and 10. methods
Each dataset was run through the pipeline performing 23
feature selection algorithms (See Supplementary File 2 We compared the performance of CancerDiscover
for the list of FS methods) and classification algorithms with that of the three existing methods, GenePattern [13],
to determine the required computational resources Chipster [12] and the method described in Aliferis et al.
such as the total amount of elapsed time for each step [21]. We used the same train and test datasets to compare
of the pipeline and the total amount of elapsed time. the performance of CancerDiscover with these methods.
These factors, mainly depend on the size of the dataset Results of this analysis are summarized in Table 3,
being processed. Benchmarking was performed using Supplementary Table 1 (Supplementary File 3), and
computational resources at the Holland Computing discussed in detail below.
Centre of the University of Nebraska-Lincoln which has GenePattern [13] is a web-based platform that
106 nodes, 4 CPU per nodes. Table 2 below shows the allows users to upload data and perform statistical analysis
benchmarking results. and class prediction. Due to the nature of the data used in
For the smallest set containing only ten samples, 19 this study, only SVM classification suite was used to draw
of the 23 possible feature selection algorithms completed comparisons between CancerDiscover and GenePattern.
processing (4 feature selection algorithms could not be Because GenePattern could not perform normalization
completed due to the 10-fold cross-validation used). For and background correction for the given datasets, we
those 19 feature selection algorithms, 585 classification used the data normalized by the CancerDiscover pipeline
models were generated (few of the ARFF files were empty (using RMA method) and provided the normalized data
for the lower feature thresholds due to the small number of to the SVM classification module of GenePattern. The
samples). The 50-sample dataset completed 20 of the 23 input data contained all probes as GenePattern did not
possible feature selection algorithms, thereby generating provide feature selection options. ML classification
665 classification models. When using 100 samples, models were generated using the training data with
This table highlights the capabilities of tools for performing different functions necessary to generate ML models.
accuracies of 98.43% for the Tumor vs. Normal model, for cancer vs. normal classification, and 12 and 500
and 99.06% for the Adenocarcinoma vs. Squamous Cell features, respectively, for adenocarcinoma vs. squamous
Carcinoma model. These higher accuracies could also cell carcinoma classification. Aliferis et al. reported
be due to the normalization and background correction average accuracies across classification algorithms:
performed by the CancerDiscover pipeline. Of all the 94.97% for cancer vs. normal model, and 96.83% for
three compared software tools, GenePattern’s accuracies the squamous carcinoma vs. adenocarcinoma model.
are most similar to the ones produced by CancerDiscover In comparison, CancerDiscover resulted in 99.21%
– 99.21% and 99.06%, respectively. All probes were accuracy for cancer vs. normal model, and 99.06% for
utilized in the model building since feature selection could the adenocarcinoma vs. squamous cell carcinoma model,
not be performed using GenePattern. On the other hand, while using only three features. In the context of these
CancerDiscover was able to achieve similar accuracy by data, CancerDiscover was more accurate, while using less
using as few as three probes (See Supplementary Table information than that of Aliferis et al.
1 in Supplementary File 3). Finally, CancerDiscover These results demonstrate that the CancerDiscover
differs from the proprietary GenePattern by the fact that method is complementary to some of the existing methods,
CancerDiscover is open-source; as such, its methodologies such as GenePattern, Chipster, and Aliferis et al. methods,
are transparent and reproducible, and the community can and that it is also suitable for accurate classification of
further expand the software. other types of cancer types and subtypes. Although the
Chipster is developed based on a client-server classification accuracy of CancerDiscover was marginally
architecture. Data is imported at the client side, while higher than that of the compared methods, the strengths of
all processing is performed on the server side using CancerDiscover lie in its streamlined nature that enables
R programming. It requires that all data need to be users to begin with raw data and proceed to build machine
transferred between client and server for each analysis learning models within a complete pipeline. Another
step which can be very time-consuming if the datasets strength of CancerDiscover is that it is flexible, allowing
are large [22]. Chipster was not able to successfully users to utilize various methodologies within the platform,
perform a classification when we provided the dataset and further extend the software as a whole due to its open-
containing all probes. As a result, feature selection was source nature.
performed artificially; that is, the datasets provided to In conclusion, we have developed a comprehensive,
Chipster contained only those probes selected by our integrative, open-source, and freely available pipeline,
CancerDiscover feature selection method; thus, datasets CancerDiscover, which enables researchers to automate
provided included the top 3, 6, 12, 100, or 500 probes. the processing and analysis of high-throughput data
Raw data in the form of CEL files were normalized with the objective of both identifying cancer biomarkers
(RMA normalization) by Chipster. The accuracy using top and classifying cancer and normal tissue samples
3 probes for the Tumor vs. Normal model was 97.63%, (including cancer sub-types). Herein, we showcased the
whereas, for the Adenocarcinoma vs. Squamous Cell pipeline’s flexibility, utility, and ease-of-use in generating
Carcinoma model was 98.82%, ranking 3rd for the accuracy multiple prediction models simultaneously from raw
assessment (See Supplementary Table 1 in Supplementary high-throughput data. CancerDiscover allows users to
File 3). These accuracy assessments for the CancerDiscover customize each step of the pipeline, preprocessing raw
are better than the results provided by Chipster. data, selecting normalization methods, data partitions,
Data used in this paper were also analyzed feature selection algorithms, and classification algorithms
independently in Aliferis et al. [21], using two feature for additional analysis. The CancerDiscover pipeline was
selection algorithms: Recursive Feature Elimination able to identify an optimal number of top-ranked features
and Univariate Association Filtering. These algorithms and accurately classify the sample data into its classes.
identified 6 and 100 features, respectively, as significant Benchmarking demonstrated the high performance of the
Specificity =
TN FUNDING
TN + FP
This work has been supported by the National
Precision * Recall Institutes of Health grant # [1R35GM119770-01 to TH].
F1 − Score = 2 *
Precision + Recall
REFERENCES
Model selection and accuracy estimation 1. Saeys Y, Inza I, Larranaga P. A review of feature selection
techniques in bioinformatics. Bioinformatics. 2007; 23:
The pipeline offers the flexibility to choose any
2507–17. https://doi.org/10.1093/bioinformatics/btm344.
k-fold cross-validation for model selection and accuracy
estimation. In the case study, we used stratified 10-fold 2. Sanborn JZ, Benz SC, Craft B, Szeto C, Kober KM, Meyer
cross-validation [27, 29]. This technique separates data L, Vaske CJ, Goldman M, Smith KE, Kuhn RM, Karolchik
into ten parts and uses nine parts for the model generation D, Kent WJ, Stuart JM, et al. The UCSC cancer genomics
while predictions are generated and evaluated by using browser: update 2011. Nucleic Acids Res. 2011; 39: D951–
the one part. This step is subsequently repeated ten times, 9. https://doi.org/10.1093/nar/gkq1113.
such that each part (internal test set) is tested against the 3. Mateo L, Guitart-Pla O, Pons C, Duran-Frigola M, Mosca
other nine parts (internal train set). After the 10-fold cross- R, Aloy P. A PanorOmic view of personal cancer genomes.
validation, the average performance of all of the folds is Nucleic Acids Res. 2017; 45: W195–200. https://doi.
used as an unbiased estimate of the performance of model org/10.1093/nar/gkx311.
training. 4. Tang Z, Li C, Kang B, Gao G, Li C, Zhang Z. GEPIA:
a web server for cancer and normal gene expression
Abbreviations profiling and interactive analyses. Nucleic Acids Res.
Oxford University Press; 2017; 45: W98–102. https://doi.
HTS: high throughput sequencing; HPC: high org/10.1093/nar/gkx247.
performance computing; SLURM: simple linux utility 5. Cases I, Pisano DG, Andres E, Carro A, Fernandez JM,
for resource management; WEKA: waikato environment Gomez-Lopez G, Rodriguez JM, Vera JF, Valencia A,
for knowledge analysis; ML: machine learning; SVM: Rojas AM. CARGO: a web portal to integrate customized
support vector machine; RF: random forests; RMA: biological information. Nucleic Acids Res. 2007; 35: W16–
robust multi average; TP: true positives; TN: true 20. https://doi.org/10.1093/nar/gkm280.
negatives; FP: false positives; FN: false negatives; AML:
6. Dietmann S, Lee W, Wong P, Rodchenkov I, Antonov AV.
acute myeloid leukemia; ARFF: attribute relation file
CCancer: a bird’s eye view on gene lists reported in cancer-
format.
related studies. Nucleic Acids Res. 2010; 38: W118–23.
https://doi.org/10.1093/nar/gkq515.
Author contributions
7. Shepherd R, Forbes SA, Beare D, Bamford S, Cole CG,
AM designed the software pipeline. AM and GB Ward S, Bindal N, Gunasekaran P, Jia M, Kok CY, Leung
wrote the software, developed the case study, and wrote K, Menzies A, Butler AP, et al. Data mining using the
the manuscript. TH, JA, and AM conceptualized the Catalogue of Somatic Mutations in Cancer BioMart.
project, reviewed, and revised the manuscript. Database. 2011; 2011: bar018-bar018. https://doi.
org/10.1093/database/bar018.
ACKNOWLEDGMENTS 8. Culhane AC, Schroder MS, Sultana R, Picard SC, Martinelli
EN, Kelly C, Haibe-Kains B, Kapushesky M, St Pierre AA,
We would like to thank Lara Appleby and Flahive W, Picard KC, Gusenleitner D, Papenhausen G, et
Achilles Gasper Rasquinha for the critical review of this al. GeneSigDB: a manually curated database and resource
manuscript. We would also like to acknowledge Holland for analysis of gene expression signatures. Nucleic Acids