0% found this document useful (0 votes)
47 views

Cancerdiscover: An Integrative Pipeline For Cancer Biomarker and Cancer Class Prediction From High-Throughput Sequencing Data

jblogger is a young blogger

Uploaded by

Rhap Mendoza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Cancerdiscover: An Integrative Pipeline For Cancer Biomarker and Cancer Class Prediction From High-Throughput Sequencing Data

jblogger is a young blogger

Uploaded by

Rhap Mendoza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

www.impactjournals.com/oncotarget/ Oncotarget, 2018, Vol. 9, (No.

2), pp: 2565-2573

Research Paper
CancerDiscover: an integrative pipeline for cancer biomarker
and cancer class prediction from high-throughput sequencing
data
Akram Mohammed1,*, Greyson Biegert1,*, Jiri Adamec1 and Tomáš Helikar1
1
Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
*
These authors have contributed equally to this work
Correspondence to: Tomáš Helikar, email: [email protected]
Keywords: open-source; cancer classification; gene expression; machine learning; cancer biomarker
Received: September 28, 2017     Accepted: December 09, 2017     Published: December 20, 2017
Copyright: Mohammed et al. This is an open-access article distributed under the terms of the Creative Commons Attribution
­License 3.0 (CC BY 3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author
and source are credited.

ABSTRACT

Accurate identification of cancer biomarkers and classification of cancer type


and subtype from High Throughput Sequencing (HTS) data is a challenging problem
because it requires manual processing of raw HTS data from various sequencing
platforms, quality control, and normalization, which are both tedious and time-
consuming. Machine learning techniques for cancer class prediction and biomarker
discovery can hasten cancer detection and significantly improve prognosis. To date,
great research efforts have been taken for cancer biomarker identification and cancer
class prediction. However, currently available tools and pipelines lack flexibility in data
preprocessing, running multiple feature selection methods and learning algorithms,
therefore, developing a freely available and easy-to-use program is strongly
demanded by researchers. Here, we propose CancerDiscover, an integrative open-
source software pipeline that allows users to automatically and efficiently process
large high-throughput raw datasets, normalize, and selects best performing features
from multiple feature selection algorithms. Additionally, the integrative pipeline lets
users apply different feature thresholds to identify cancer biomarkers and build
various training models to distinguish different types and subtypes of cancer. The
open-source software is available at https://github.com/HelikarLab/CancerDiscover
and is free for use under the GPL3 license.

INTRODUCTION evaluation reports that distinguish cancer from normal


samples, as well as different types and subtypes of cancer.
Classification of a tissue sample as cancer or normal Remarkable efforts have been put to develop gene
and among different tissue subtypes facilitates cancer expression analysis tools and databases for cancer high-
treatment, and high-throughput techniques generate throughput data [2–11]. Several machine learning tools
massive amounts of cancer data. Machine learning (ML) have been developed to study cancer classification [12–
has the potential to improve such classification, and 16]. However, Classifusion [14], ESVM [15], Prophet
the traditional motivation behind ML feature selection [16] are either not available, abandoned or not maintained.
algorithms is to find the optimal subset of features. The available platforms require processed raw data that
However, no single feature selection algorithm or have been normalized to address various technical and
classification algorithm can provide the best set of features statistical challenges such as, gene expression value
and classifiers [1]. Therefore, there is a need to develop differences within the datasets and sequencing platform
a pipeline that lets users apply different feature selection bias. Moreover, different analysis steps have to be
algorithms, feature thresholds, and various classification performed manually by various tools, often using different
algorithms to generate accurate prediction models and software platforms. These long and manual processing

www.impactjournals.com/oncotarget 2565 Oncotarget


steps are not only time-consuming but also error-prone, normalization and background corrections are
making high-quality, large-scale ML analyses difficult. required to remove or subdue bias in raw data for
To this end, we have developed CancerDiscover, accurate models. Once raw high-throughput data is
an integrative software pipeline, which, given raw, obtained, normalization and background corrections
bulk high-throughput data from various platforms, can are performed to remove the technical variation
perform quality checks, normalize the data, select the from noisy data and background noise from signal
most important features from multiple feature selection intensities and generated the expression set matrix
algorithms, and build and evaluate different machine (for example, gene expression intensity values, etc.).
learning models in a streamlined fashion. Unlike software (2). Preliminary Feature Vector Generation: Next,
tools that require manual processing and are limited in the expression set matrix is used to create the master
feature selection and classification algorithm options (e.g., feature vector in the WEKA-native file format
GenePattern [13] and Chipster [12]), CancerDiscover is (ARFF).
a fully automated pipeline, while providing users with (3). Preliminary Data Partitioning (Stratified):
full control over each analysis step. CancerDiscover is Stratified data partitioning was used by splitting the
complementary to the data repositories, data visualization, master feature vector into training and testing sets to
and the software tools such as ONCOMINE [9], INDEED maintain an even distribution of class distribution.
[10] and cBioPortal [11] that support data visualization These training sets are used to construct the models
and analysis of differential gene expression. Herein, after feature selection has been performed in the next
we describe the open-source software and demonstrate step. Later, the model’s accuracy will be assessed
its utility and flexibility through a case study. We also with the testing set, which had not been exposed to
demonstrated the utility of CancerDiscover pipeline the model, giving an honest assessment of the model.
using 2,175 gene expression samples from nine tissue Users of CancerDiscover can specify the size of the
types to identify tissue-specific biomarkers (gene sets) data partition of their choice in the pipeline.
whose expression is characteristic of each cancer class (4). Feature Selection (on training data set only): Our
and built single-tissue and multi-tissue models [17]. In pipeline offers the ability to select multiple feature
the end, we provided the benchmarking statistics using selection algorithms. Each of these algorithms
CancerDiscover for datasets of varying sizes. provides the list of ranked features that distinguish
different types and subtypes of cancer. Users can
RESULTS AND DISCUSSION choose different feature thresholds and can explore
the relationship between the number of features
Implementation considered by the classification algorithms and model
accuracy. For example, the feature sets generated
The purpose of CancerDiscover pipeline tool is can be separated into different feature thresholds
to allow users to efficiently and automatically process (including the top 1%, 10%, 33%, 66%, 100% of the
large high-throughput datasets by transforming various total number of ranked features as well as the top 25,
raw datasets and selecting best performing features 50 and 100 ranked features). Users can also arbitrarily
from multiple feature selection algorithms. The pipeline choose these thresholds to identify the minimum
lets users apply different feature thresholds and various number of features needed to achieve accurate
learning algorithms to generate multiple prediction models classification models. For a list of available feature
that distinguish different types and subtypes of cancer. selection methods, see Supplementary File 2.
CancerDiscover takes raw datasets, normalizes it, (5). Feature Vector Generation: Since the classification
generates WEKA [18]-native (Attribute-Relation File models must be built based only on the ranked
Format: ARFF) input files. The pipeline is illustrated in features, new feature vectors are generated based on
Figure 1. CancerDiscover consists of eight components: the ranked feature sets.
normalization, preliminary feature vector generation, (6). Data Partitioning (Stratified): Once the new feature
preliminary data partitioning, feature selection, feature vectors are generated, each feature vectors file will
vector generation, data partitioning, model training undergo a second data partitioning. This partitioning
and model testing. These components are organized seed value (or integer that defines the exact sequence
into three scripts (masterScript_1, masterScript_2, and of a pseudo-random number) is the same as the one
masterScript_3). In addition to Bash, the CancerDiscover used in the preliminary data partitioning. As such,
pipeline is also implemented in SLURM (Simple Linux each new feature vector will be split into the same
Utility for Resource Management) to make it available for training and testing sets as in step 3. The master
high-performance computing clusters. training and testing feature vectors and the new
training and testing feature vectors differ only in
(1). Normalization: Due to the inherent differences the number of features; the master feature vectors
among samples obtained from various studies, contain all of the features, whereas the newly created

www.impactjournals.com/oncotarget 2566 Oncotarget


feature vectors contain only the features that ranked machine and an advanced version that consists of SLURM
according to different thresholds (Dimensionality (Simple Linux Utility for Resource Management) scripts
Reduction). that can be run on a high-performance computer (HPC).
(7). Model Training: CancerDiscover provides a diverse SLURM is a computational architecture used to organize
set of machine learning classification algorithms and user requests into a queue to utilize high-performance
allows the user to build models as they see fit. Each computing resources. Due to the sheer size of the high-
new training dataset from the above step undergoes throughput data and complexity of data processing steps,
machine learning model construction using stratified it is recommended to use CancerDiscover on a high-
k-fold cross-validation to identify the optimal model. performance compute cluster. The command-line pipeline
(8). Model Testing: The model performance was assessed is compatible with Linux OS and Mac OSX.
by testing its accuracy using the testing dataset that
was kept hidden from the model. The model can also Case study
be used to predict the class labels for the samples
in the unknown dataset. In the case study below, Two kinds of ML models were generated and tested
we illustrate the utility of the software to classify to illustrate the possible application of the pipeline. The
normal vs. cancerous tissues and adenocarcinoma vs. first model was developed to classify tissue samples
squamous carcinoma based on gene expression data. as either cancerous or normal, according to their gene
expression patterns. Sample distributions were as follows:
Installation/operation 237 tumor tissue samples and 17 histologically normal
tissue samples split evenly into testing and training data
CancerDiscover software is available at https:// sets. The Quantile Normalization Method [19] was used
github.com/HelikarLab/CancerDiscover. All the to normalize the data, and the background correction was
components of the pipeline are organized into three performed using the Robust Multichip Average (RMA)
scripts (namely masterScript_1, masterScript_2, and [20] parameter method by modifying the configuration
masterScript_3), each of which is composed of several file for the case study presented in the paper. Filtered
scripts (PERL, AWK, SHELL, BASH, R, and SLURM). Attribute Evaluator combined with Ranker method was
The detailed installation/operation of the pipeline is the algorithm selected (using pipeline configuration
described in the Supplementary File 1. There are two file) to perform feature selection on the training dataset.
versions of the CancerDiscover pipeline: the beginner This algorithm outputs a list of all data features ranked
version consists of bash scripts that can be run on the local according to their utility in distinguishing the different

Figure 1: Schematic representation of the CancerDiscover pipeline. First, raw data are normalized, background correction is
performed, and the output is partitioned into training and testing sets. The test set is held in reserve for model testing while the training set
undergoes a feature selection method. Feature selection provides a list of ranked attributes that are subsequently used to rebuild the training
and testing sets. The training dataset is subsequently used to build machine learning models. Finally, the testing data set is used for model
testing.

www.impactjournals.com/oncotarget 2567 Oncotarget


classes of samples; features ranked at the top of the list into training and testing datasets. After feature selection,
are most useful in distinguishing cancer from normal the list of ranked features was used to generate models
samples. The plates used for this case study contain based on different feature thresholds. Results from testing
approximately 10,000 full-length genes corresponding accuracies can be seen in Figure 2B. With the entire list
to 12,625 probes (features). The top (0.25%, 0.5%, 1%, of ranked features, the RF testing accuracy was 91.51%,
10%, 33%, 100%) ranked features, as well as additional increasing in accuracy as the percentage or number of
feature sets containing the top (3, 6, 12, 100, 500) features ranked features decreased. The top 1% of ranked features
were used for generating several models simultaneously. (126 attributes) resulted in a model testing accuracy of
Training and testing accuracies are reported in Figure 2A- 93.40% while the top 0.25% (31 attributes) of ranked
2D. We selected RF and SVM as the machine learning features resulted in testing accuracies of 95.28%. A
classification algorithms for the case study. similar trend was seen going from the top 500 features to
We achieved a model training accuracy of 98.43% the top 3 features. On average, SVM testing accuracies
for the RF classifier using the top 0.25% (31 attributes) were more consistent and higher than those based on RF.
of features. Models constructed using the top 3% of The model generated with top 3 features resulted in an
ranked features reported an accuracy of 96.06%, while accuracy of 96.23% while using the top 6 features led
models using the entire list of features (100%) resulted to the accuracy of 95.28%. Using 100 features resulted
in the lowest accuracy of 93.70%. Training accuracies in a testing accuracy of 97.17%. Using the top 0.25%
for the SVM classifier were 99.21% for the models and 0.5% resulted in accuracies of 96.23% and 97.17%,
that used the top 3 features. Accuracy declined with the respectively, while using the top 1% and 10% features led
increasing number of features, with models that used the to an accuracy of 98.11%. Using the top 33% of ranked
top 12 ranked features reporting an accuracy of 98.43%. features resulted in the highest testing accuracy of 99.06%.
SVM resulted in the lowest (though still relatively high) Precision, Recall, and F1-Score for the models generated
accuracy of 97.64% using 100 features. As few as the top using the top 3 features are reported in Table 1.
31 features are sufficient to achieve a higher accuracy, As shown in Table 1, we were able to achieve a
using random forest classifiers, whereas top 3 features high degree of accuracy using a small fraction of top-
are sufficient to achieve a higher accuracy using support ranked features (3 features). This case study illustrates
vector machines. the pipeline’s flexibility, utility, and ease-of-use in the
The second set of models was also bi-class; however, generation of several models simultaneously from raw
the models were developed to distinguish lung sub-types high-throughput data. It also highlights the customization
(adenocarcinoma vs. squamous cell carcinoma), rather allowed by CancerDiscover on the individual steps of a
than tumor vs. normal tissue. 211 lung adenocarcinoma typical high-throughput data analysis pipeline, including
and squamous cell carcinoma samples were evenly split the data preprocessing, normalization methods, data

Figure 2: Model accuracies for the classification of tumor vs. normal and adenocarcinoma vs. squamous cell carcinoma:
RF represents Random Forest classifier and SVM indicates Support Vector Machine classifier. (A) Training accuracy for
Tumor vs. Normal model, (B) Training accuracy for Adenocarcinoma vs. Squamous Cell Carcinoma model, (C) Testing accuracy for
Tumor vs. Normal model, (D) Testing accuracy for Adenocarcinoma vs. Squamous Cell Carcinoma model.

www.impactjournals.com/oncotarget 2568 Oncotarget


Table 1: Accuracies of random forest models using top 3 features
Training model Precision Recall F1-Score
Tumor vs. normal 98.3 98.3 98.3
Adeno vs. squamous 97.9 98.9 98.4

Table 2: Benchmarking results


Feature selection Models Normalization Feature selection Model train & test
Samples Total
methods generated (Elapsed Time) (Elapsed Time) (Elapsed Time)
500 20 665 2:05:32 21:45:59 8:05:32 31:57:03
200 20 650 0:52:31 14:16:55 4:49:33 19:58:59
100 20 665 0:26:56 13:31:22 3:12:30 17:00:48
50 20 665 0:16:48 12:06:42 2:58:56 15:12:26
10 19 585 0:07:03 10:05:17 2:14:05 12:26:25

All the datasets contain 54,675 features, and 2 CPUs were used for the analysis.
Elapsed time refers to the amount of real-time spent processing that function.

partitions, feature selection algorithms, classification 20 of the 23 possible feature selection algorithms were
algorithms, and the threshold or percentage of ranked completed, and subsequently utilized to generate 665
features for additional analysis. classification models. The 200-sample dataset provided 20
of the 23 possible feature selection outputs and produced
CancerDiscover benchmarking 650 classification models. Lastly, the 500-sample dataset
contained 20 out of the possible 23 feature selection
Benchmarking was performed using 500 samples outputs and generated 665 classification models. As the
from Acute Myeloid Leukemia (AML) data (See datasets grew, the time required for cancer classification
Data Collection section for more details) to assess the increased linearly (Table 2).
performance of the software by running all the feature
selection and classification algorithms. The following Comparison of CancerDiscover with other
sample quantities were used; 500, 200, 100, 50 and 10. methods
Each dataset was run through the pipeline performing 23
feature selection algorithms (See Supplementary File 2 We compared the performance of CancerDiscover
for the list of FS methods) and classification algorithms with that of the three existing methods, GenePattern [13],
to determine the required computational resources Chipster [12] and the method described in Aliferis et al.
such as the total amount of elapsed time for each step [21]. We used the same train and test datasets to compare
of the pipeline and the total amount of elapsed time. the performance of CancerDiscover with these methods.
These factors, mainly depend on the size of the dataset Results of this analysis are summarized in Table 3,
being processed. Benchmarking was performed using Supplementary Table 1 (Supplementary File 3), and
computational resources at the Holland Computing discussed in detail below.
Centre of the University of Nebraska-Lincoln which has GenePattern [13] is a web-based platform that
106 nodes, 4 CPU per nodes. Table 2 below shows the allows users to upload data and perform statistical analysis
benchmarking results. and class prediction. Due to the nature of the data used in
For the smallest set containing only ten samples, 19 this study, only SVM classification suite was used to draw
of the 23 possible feature selection algorithms completed comparisons between CancerDiscover and GenePattern.
processing (4 feature selection algorithms could not be Because GenePattern could not perform normalization
completed due to the 10-fold cross-validation used). For and background correction for the given datasets, we
those 19 feature selection algorithms, 585 classification used the data normalized by the CancerDiscover pipeline
models were generated (few of the ARFF files were empty (using RMA method) and provided the normalized data
for the lower feature thresholds due to the small number of to the SVM classification module of GenePattern. The
samples). The 50-sample dataset completed 20 of the 23 input data contained all probes as GenePattern did not
possible feature selection algorithms, thereby generating provide feature selection options. ML classification
665 classification models. When using 100 samples, models were generated using the training data with

www.impactjournals.com/oncotarget 2569 Oncotarget


Table 3: Comparisons of machine learning classification components
Tool Components CancerDiscover GenePattern Chipster Aliferis
Normalization ü - ü -
Background correction ü - - -
Partitioning ü - - -
Feature selection ü - - ü
Modeling ü ü ü ü

This table highlights the capabilities of tools for performing different functions necessary to generate ML models.

accuracies of 98.43% for the Tumor vs. Normal model, for cancer vs. normal classification, and 12 and 500
and 99.06% for the Adenocarcinoma vs. Squamous Cell features, respectively, for adenocarcinoma vs. squamous
Carcinoma model. These higher accuracies could also cell carcinoma classification. Aliferis et al. reported
be due to the normalization and background correction average accuracies across classification algorithms:
performed by the CancerDiscover pipeline. Of all the 94.97% for cancer vs. normal model, and 96.83% for
three compared software tools, GenePattern’s accuracies the squamous carcinoma vs. adenocarcinoma model.
are most similar to the ones produced by CancerDiscover In comparison, CancerDiscover resulted in 99.21%
– 99.21% and 99.06%, respectively. All probes were accuracy for cancer vs. normal model, and 99.06% for
utilized in the model building since feature selection could the adenocarcinoma vs. squamous cell carcinoma model,
not be performed using GenePattern. On the other hand, while using only three features. In the context of these
CancerDiscover was able to achieve similar accuracy by data, CancerDiscover was more accurate, while using less
using as few as three probes (See Supplementary Table information than that of Aliferis et al.
1 in Supplementary File 3). Finally, CancerDiscover These results demonstrate that the CancerDiscover
differs from the proprietary GenePattern by the fact that method is complementary to some of the existing methods,
CancerDiscover is open-source; as such, its methodologies such as GenePattern, Chipster, and Aliferis et al. methods,
are transparent and reproducible, and the community can and that it is also suitable for accurate classification of
further expand the software. other types of cancer types and subtypes. Although the
Chipster is developed based on a client-server classification accuracy of CancerDiscover was marginally
architecture. Data is imported at the client side, while higher than that of the compared methods, the strengths of
all processing is performed on the server side using CancerDiscover lie in its streamlined nature that enables
R programming. It requires that all data need to be users to begin with raw data and proceed to build machine
transferred between client and server for each analysis learning models within a complete pipeline. Another
step which can be very time-consuming if the datasets strength of CancerDiscover is that it is flexible, allowing
are large [22]. Chipster was not able to successfully users to utilize various methodologies within the platform,
perform a classification when we provided the dataset and further extend the software as a whole due to its open-
containing all probes. As a result, feature selection was source nature.
performed artificially; that is, the datasets provided to In conclusion, we have developed a comprehensive,
Chipster contained only those probes selected by our integrative, open-source, and freely available pipeline,
CancerDiscover feature selection method; thus, datasets CancerDiscover, which enables researchers to automate
provided included the top 3, 6, 12, 100, or 500 probes. the processing and analysis of high-throughput data
Raw data in the form of CEL files were normalized with the objective of both identifying cancer biomarkers
(RMA normalization) by Chipster. The accuracy using top and classifying cancer and normal tissue samples
3 probes for the Tumor vs. Normal model was 97.63%, (including cancer sub-types). Herein, we showcased the
whereas, for the Adenocarcinoma vs. Squamous Cell pipeline’s flexibility, utility, and ease-of-use in generating
Carcinoma model was 98.82%, ranking 3rd for the accuracy multiple prediction models simultaneously from raw
assessment (See Supplementary Table 1 in Supplementary high-throughput data. CancerDiscover allows users to
File 3). These accuracy assessments for the CancerDiscover customize each step of the pipeline, preprocessing raw
are better than the results provided by Chipster. data, selecting normalization methods, data partitions,
Data used in this paper were also analyzed feature selection algorithms, and classification algorithms
independently in Aliferis et al. [21], using two feature for additional analysis. The CancerDiscover pipeline was
selection algorithms: Recursive Feature Elimination able to identify an optimal number of top-ranked features
and Univariate Association Filtering. These algorithms and accurately classify the sample data into its classes.
identified 6 and 100 features, respectively, as significant Benchmarking demonstrated the high performance of the

www.impactjournals.com/oncotarget 2570 Oncotarget


pipeline across datasets of varying sizes. Researchers can [25] and robust multichip average (RMA) [20] were used
now utilize CancerDiscover for diverse projects, including for normalization and background correction, respectively.
biomarker identification, and tissue classification
without extensive technical knowledge while retaining Machine learning algorithms and framework
significant flexibility. Another great advantage to the
biomedical community is that unlike GenePattern and Though the pipeline supports the diverse set of
Chipster, CancerDiscover is open-source and freely machine learning classifiers, we used Support Vector
available and written in a modular fashion which opens Machines (SVMs) and Random Forests to construct
an array of opportunities for users to tweak the software the models for the case study. These machine-learning
themselves for their needs, adding more algorithms as methods were chosen because of their extensive and
it becomes available. Next, we will make our efforts successful applications to datasets from genomic
to develop graphical user interface and web server for and proteomic domains [26, 27]. Some of the cancer
CancerDiscover with additional functionalities such as classification tasks were binary (two classes), and the
querying, searching and downloading datasets from the others were multiclass (more than two classes). Though
public repositories and data visualization of the outputs. SVMs are designed for binary classification, they can
also be used for multiclass classification by a one-versus-
MATERIALS AND METHODS rest approach [28]. The one-versus-rest approach for
classification is known to be among the best-performing
The presented integrative pipeline consists of methods for multi-category classification for microarray
existing open-source software tools and utilizes publicly gene expression data [29]. Models were also constructed
available datasets and various performance metrics. using Random Forests (RF), which can solve multi-
category problems natively through a direct application
[30]. Waikato Environment for Knowledge Analysis
Data collection
(WEKA 3-6-11) [18] is a machine learning software
For the case study, lung cancer (tumor vs. environment that serves as a platform for clustering and
normal and adenocarcinoma vs. squamous carcinoma) classification of high-throughput data.
microarray gene expression data were collected from
the Broad Institute Cancer Program Legacy Publication Performance measure
Resources database [23]. 237 tumor tissue samples and
17 histologically normal tissue samples, and 211 lung Accuracy was defined as the overall ability of
adenocarcinoma and squamous cell carcinoma samples models to predict testing data correctly. Reported
were used. The plate used was Human Genome U95A measures included the numbers of true positives (TP),
oligonucleotide probe arrays, containing 12,625 probes. true negatives (TN), false positives (FP), and false
Benchmarking was performed using 500 samples of Acute negatives (FN). A true-positive count is the number of
Myeloid Leukemia (AML) and normal blood sample samples in a dataset which were correctly categorized
expression data downloaded from NCBI (GSE6891, into classes. A false-positive count is the number of
GSE2677, GSE43346, GSE63270) [HG-U133_Plus_2] samples in a dataset which were sorted into the wrong
Human Genome U133 Plus 2.0 Array, containing 54,675 category. A true negative count represents the number of
probes. samples which were not classified into a class to which
they do not belong, and false negatives are samples which
are not classified into the class to which they do belong.
Normalization and background correction
Accuracy, Sensitivity (or Recall), Specificity, Precision,
Normalization and preprocessing are essential F1-Score are derived from the measures mentioned
steps for the analysis of high-throughput data. The Affy above as follows: accuracy is the ratio of correctly
R module 1.54 [24] from Bioconductor package (https:// predicted samples to the total number of samples.
bioconductor.org/packages/release/bioc/html/affy.html) Sensitivity is the proportion of true positives that are
was used to remove the technical variation from noisy data predicted as positives. Specificity is the proportion of
and background noise from signal intensities. This step is true negatives which are predicted as negatives, and
crucial for analyzing large amounts of data which have Precision is the ratio of true positives to the total number
been compiled from different experimental settings where, of true negatives and true positives. Lastly, F1-score is
individual data files are processed to remove sample bias defined as the harmonic mean of Precision and Recall
from the data, which could otherwise introduce a bias in and is calculated by first multiplying precision and recall
the model. Affy R package provides multiple methods for values, then dividing the resulting value by the total of
normalization and background correction, which can be precision and recall, and finally, multiplying the result
utilized within CancerDiscover using programmatic flags. by two. The Accuracy, Sensitivity, Specificity, Precision,
For the case study given above, quantile normalization and F1-Score are given by:

www.impactjournals.com/oncotarget 2571 Oncotarget


TP + TN Computing Center at the University of Nebraska-Lincoln
Accuracy = for providing high-performance computing clusters
TP + TN + FP + FN
for machine learning modeling and benchmarking the
pipeline.
TP
Recall / Sensitivity =
TP + FN CONFLICTS OF INTEREST
TP The authors declare that they have no conflicts of
Precision =
TP + FP interest.

Specificity =
TN FUNDING
TN + FP
This work has been supported by the National
Precision * Recall Institutes of Health grant # [1R35GM119770-01 to TH].
F1 − Score = 2 *
Precision + Recall
REFERENCES
Model selection and accuracy estimation 1. Saeys Y, Inza I, Larranaga P. A review of feature selection
techniques in bioinformatics. Bioinformatics. 2007; 23:
The pipeline offers the flexibility to choose any
2507–17. https://doi.org/10.1093/bioinformatics/btm344.
k-fold cross-validation for model selection and accuracy
estimation. In the case study, we used stratified 10-fold 2. Sanborn JZ, Benz SC, Craft B, Szeto C, Kober KM, Meyer
cross-validation [27, 29]. This technique separates data L, Vaske CJ, Goldman M, Smith KE, Kuhn RM, Karolchik
into ten parts and uses nine parts for the model generation D, Kent WJ, Stuart JM, et al. The UCSC cancer genomics
while predictions are generated and evaluated by using browser: update 2011. Nucleic Acids Res. 2011; 39: D951–
the one part. This step is subsequently repeated ten times, 9. https://doi.org/10.1093/nar/gkq1113.
such that each part (internal test set) is tested against the 3. Mateo L, Guitart-Pla O, Pons C, Duran-Frigola M, Mosca
other nine parts (internal train set). After the 10-fold cross- R, Aloy P. A PanorOmic view of personal cancer genomes.
validation, the average performance of all of the folds is Nucleic Acids Res. 2017; 45: W195–200. https://doi.
used as an unbiased estimate of the performance of model org/10.1093/nar/gkx311.
training. 4. Tang Z, Li C, Kang B, Gao G, Li C, Zhang Z. GEPIA:
a web server for cancer and normal gene expression
Abbreviations profiling and interactive analyses. Nucleic Acids Res.
Oxford University Press; 2017; 45: W98–102. https://doi.
HTS: high throughput sequencing; HPC: high org/10.1093/nar/gkx247.
performance computing; SLURM: simple linux utility 5. Cases I, Pisano DG, Andres E, Carro A, Fernandez JM,
for resource management; WEKA: waikato environment Gomez-Lopez G, Rodriguez JM, Vera JF, Valencia A,
for knowledge analysis; ML: machine learning; SVM: Rojas AM. CARGO: a web portal to integrate customized
support vector machine; RF: random forests; RMA: biological information. Nucleic Acids Res. 2007; 35: W16–
robust multi average; TP: true positives; TN: true 20. https://doi.org/10.1093/nar/gkm280.
negatives; FP: false positives; FN: false negatives; AML:
6. Dietmann S, Lee W, Wong P, Rodchenkov I, Antonov AV.
acute myeloid leukemia; ARFF: attribute relation file
CCancer: a bird’s eye view on gene lists reported in cancer-
format.
related studies. Nucleic Acids Res. 2010; 38: W118–23.
https://doi.org/10.1093/nar/gkq515.
Author contributions
7. Shepherd R, Forbes SA, Beare D, Bamford S, Cole CG,
AM designed the software pipeline. AM and GB Ward S, Bindal N, Gunasekaran P, Jia M, Kok CY, Leung
wrote the software, developed the case study, and wrote K, Menzies A, Butler AP, et al. Data mining using the
the manuscript. TH, JA, and AM conceptualized the Catalogue of Somatic Mutations in Cancer BioMart.
project, reviewed, and revised the manuscript. Database. 2011; 2011: bar018-bar018. https://doi.
org/10.1093/database/bar018.
ACKNOWLEDGMENTS 8. Culhane AC, Schroder MS, Sultana R, Picard SC, Martinelli
EN, Kelly C, Haibe-Kains B, Kapushesky M, St Pierre AA,
We would like to thank Lara Appleby and Flahive W, Picard KC, Gusenleitner D, Papenhausen G, et
Achilles Gasper Rasquinha for the critical review of this al. GeneSigDB: a manually curated database and resource
manuscript. We would also like to acknowledge Holland for analysis of gene expression signatures. Nucleic Acids

www.impactjournals.com/oncotarget 2572 Oncotarget


Res. 2012; 40: D1060–6. https://doi.org/10.1093/nar/ oligonucleotide array data based on variance and bias.
gkr901. Bioinformatics. 2003; 19: 185. Available from http://www.
9. Rhodes DR, Kalyana-Sundaram S, Mahavisno V, stat.berkeley.edu/~bolstad/normalize/
Varambally R, Yu J, Briggs BB, Barrette TR, Anstet MJ, 20. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD,
Kincead-Beal C, Kulkarni P, Varambally S, Ghosh D, Antonellis KJ, Scherf U, Speed TP. Exploration,
Chinnaiyan AM. Oncomine 3.0: genes, pathways, and normalization, and summaries of high density
networks in a collection of 18,000 cancer gene expression oligonucleotide array probe level data. Biostatistics. 2003;
profiles 1. Neoplasia. 2007; 9: 166–80. https://doi. 4: 249–64. https://doi.org/10.1093/biostatistics/4.2.249.
org/10.1593/neo.07112. 21. Aliferis CF, Tsamardinos I, Massion PP, Statnikov A,
10. Zuo Y, Cui Y, Di Poto C, Varghese RS, Yu G, Li R, Ressom Fananapazir N, Hardin D. Machine learning models for
HW. INDEED: Integrated differential expression and classification of lung cancer and selection of genomic
differential network analysis of omic data for biomarker markers using array gene expression data. Am Assoc
discovery. Methods. 2016; 111: 12–20. https://doi. artifical Intell. 2003; 67–71.
org/10.1016/j.ymeth.2016.08.015. 22. Koschmieder A, Zimmermann K, Trissl S, Stoltmann T,
11. Gao J, Aksoy BB, Dogrusoz U, Dresdner G, Gross B, Leser U. Tools for managing and analyzing microarray data.
Sumer SO, Sun Y, Jacobsen A, Sinha R, Larsson E, Br Bioinform. 2012; 13: 46–60. https://doi.org/10.1093/bib/
Cerami E, Sander C, Schultz N. Integrative analysis of bbr010.
complex cancer genomics and clinical profiles using the 23. Cancer Program Legacy Publication Resources. Data
cBioPortal. Sci Signal. 2013; 6: pl1. https://doi.org/10.1126/ Identifier: Classification of Human Lung Carcinomas
scisignal.2004088. by mRNA Expression Profiling Reveals Distinct
12. Kallio MA, Tuimala JT, Hupponen T, Klemelä P, Gentile Adenocarcinoma Sub-classes. Available from http://portals.
M, Scheinin I, Koski M, Käki J, Korpelainen EI. Chipster: broadinstitute.org/cgi-bin/cancer/datasets.cgi
user-friendly analysis software for microarray and other 24. Gautier L, Cope L, Bolstad BM, Irizarry RA. affy--
high-throughput data. BMC Genomics. 2011; 12: 507. analysis of Affymetrix GeneChip data at the probe level.
https://doi.org/10.1186/1471-2164-12-507. Bioinformatics. 2004; 20: 307–15. https://doi.org/10.1093/
13. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov bioinformatics/btg405.
JP. GenePattern 2.0. Nat Genet. 2006; 38: 500–1. https:// 25. Bolstad B. Probe level quantile normalization of high
doi.org/10.1038/ng0506-500. density oligonucleotide array data. Cell. 2001; 1–8.
14. Agrawal U, Soria D, Wagner C. Cancer subtype 26. Statnikov A, Wang L, Aliferis C. A comprehensive
identification pipeline: A classifusion approach. 2016 IEEE comparison of random forests and support vector
Congress on Evolutionary Computation (CEC). IEEE; 2016. machines for microarray-based cancer classification.
p. 2858–65. https://doi.org/10.1109/CEC.2016.7744150. BMC Bioinformatics. 2008; 9: 319. https://doi.
15. Huang HL, Chang FL. ESVM: evolutionary support vector org/10.1186/1471-2105-9-319.
machine for automatic feature selection and classification of 27. Mohammed A, Guda C. Application of a hierarchical
microarray data. Biosystems. 2007; 90: 516–28. https://doi. enzyme classification method reveals the role of gut
org/10.1016/j.biosystems.2006.12.003. microbiome in human metabolism. BMC Genomics. 2015;
16. Medina I, Montaner D, Tárraga J, Dopazo J. Prophet, a 16:S16. https://doi.org/10.1186/1471-2164-16-S7-S16.
web-based tool for class prediction using microarray data. 28. Cortes C, Vapnik V. Support-vector networks. Mach Learn.
Bioinformatics. 2007 ; 23: 390–1. https://doi.org/10.1093/ 1995; 20: 273–97. https://doi.org/10.1007/BF00994018.
bioinformatics/btl602. 29. Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S.
17. Mohammed A, Biegert G, Adamec J, Helikar T. A comprehensive evaluation of multicategory classification
Identification of potential tissue-specific cancer biomarkers methods for microarray gene expression cancer diagnosis.
and development of cancer versus normal genomic Bioinformatics. 2005; 21: 631–43. https://doi.org/10.1093/
classifiers. Oncotarget. 2017; 8:85692-85715. https://doi. bioinformatics/bti033.
org/10.18632/oncotarget.21127. 30. Bishop CM. Pattern recognition and machine learning.
18. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann Jordan M, Kleinberg J, Scholkopf B, editors. Book.
P, Witten IH. The WEKA data mining software. Springer; 2007. 738 p. https://doi.org/10.1117/1.2819119.
SIGKDD Explor Newsl. 2009; 11: 10. https://doi.
org/10.1145/1656274.1656278.
19. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A
comparison of normalization methods for high density

www.impactjournals.com/oncotarget 2573 Oncotarget

You might also like