5.multimodal Non-Small Cell Lung Cancer Classification Using Convolutional Neural Networks
5.multimodal Non-Small Cell Lung Cancer Classification Using Convolutional Neural Networks
ABSTRACT Lung cancer is the leading cause of death worldwide. Early detection of lung cancer is
a hard mission. New Small Cell Lung Cancer (NSCLC) is the most prevalent sub-type of lung cancer.
Differentiating between several NSCLC subtypes is important for making the right decision of treatment
plan for the patient. Despite the focus of recent researchers on single modality approach, multi-omics
modalities have many underlying influences and discoveries in the cancer detection and classification
area. Through this research multi-omics modalities are used. Previous efforts have been focused either on
multimodality using traditional machine learning classifiers or single modality using deep learning. Also,
for the molecular sources (RNA-seq and miRNA-Seq) traditional machine learning approaches are usually
used. For this work, deep learning using Convolutional Neural Networks (CNNs) is used and applied on
the above-mentioned multimodalities. The classification accuracy results obtained for RNA-Seq, miRNA-
Seq, WSIs are 96.79%, 98.59%, 89.73% respectively. The F1 scores obtained for RNA-Seq, miRNA-Seq,
WSIs are 95.238%,99.67%,89.76% respectively. Moreover, the Area Under Curve obtained for RNA-Seq,
miRNA-Seq, WSIs are 100%, 99.41%,97,54% respectively. These results improves the results obtained by
other related works as will be explained. According to these improvements in the results, the lung cancer
classification could be better and the disease would be discovered at early stages which is the goal for the
research field efforts.
INDEX TERMS Lung cancer, convolutional neural networks, multimodality, molecular data, whole slide
images, deep learning, multiomics.
Studies have been focused recently on the genetic molecular Deep learning has been massively used using single modal-
level for early diagnosis and early treatment plans. ity (images in most of cases) for detecting different types of
Within NSCLC, two predominant subtypes exist: Lung cancer. In this paper, multi-omics modalities are used for lung
Adenocarcinoma (LUAD) and Lung Squamous Cell Carci- cancer classification such as: RNA-Seq [2], [17]and miRNA-
noma (LUSC). LUAD typically develops in the peripheral Seq [2], [18]. Moreover, Whole Slide Images (WSIs) [2], [19]
lung tissue, whereas LUSC tends to manifest in cen- are used. So, in this paper multimodalities are used RNA-Seq,
tral locations [2]. Accurate differentiation between these miRNA-Seq, and WSIs.
subtypes is crucial as they necessitate distinct treatment Despite the focus of this paper on lung cancer, the
approaches [3], [6]. idea itself is effective when being applied on multiple
Next Generation Sequencing methods such as whole- large number of tumors (multi-scale, multi-modalities). The
genome DNA sequencing and total RNA sequencing, are architecture of the deep learning model requires minimal pre-
considered revolutionary technologies for studying genetic processing for multi-layer networks [20]. It utilizes the spatial
changes in cancer [7]. These technologies are promising in relationships within the data to decrease the dimensions that
accurate classification of tumor cells because of their ability need to be learned [15]. This approach speeds up the learning
of sequencing thousands of genes and detecting genomic process while yielding more precise results.
and transcriptomic alterations [3]. NGS achieves that through So, NGS using deep learning techniques is the suitable
comparing DNA and RNA in normal and cancer cells. It dis- solution for the problem of complexity and high dimensional-
covers the genetic changes that leads to abnormal activity ity. Convolutional Neural Networks are used in this work for
that leads to the presence of different levels of genes and both slide images and molecular sources (RNA-Seq (mRNA),
consequently proteins within the cells [3], [8]. This is known miRNA-seq) for dealing with high dimensionality problem.
as gene expression. Differentially expressed genes within the Through this research several past efforts will be shown
cell gives great insight of the motive of tumor growth [3], [8]. through the related works to highlight the importance of
Gene expression has been the focus of recent researches reaching an accurate classification for this type of cancer.
in the field of cancer classification [3], [9]. Cell functions Also, the dataset used and the preprocessing done will be
are determined by individual proteins and the synthesis of discussed in the methodology section. Then, a detailed expla-
these proteins depends on which genes are expressed by the nation of the approach used of the convolutional neural
cell. Therefore, the gene expression gives some insight about networks will be illustrated in the implementation and discus-
the cell function [3]. Gene expression is the process of trans- sion section. After that, the results will be shown to compare
lating DNA into proteins and non-coding RNA. Microarray with the previously obtained results by other researches. Also,
shows limitation because of the incomplete snapshot of the future work points that could be made will be shown in the
transcriptome. Also, it cannot detect previously unidentified future work section.
genes or transcripts [3], [9]. Gene expression quantification,
therefore, is an effective alternative. It identifies which genes
are preferentially expressed in different tissues. So, in this II. RELATED WORKS
paper STAR counts data provided by TCGA program is used. In recent years, there has been significant interest in using
Through this paper two types of gene expression data machine learning models with biological data to aid in the
are used mRNA-Seq and miRNA-Seq. mRNA is involved diagnosis and prognosis of cancer patients, particularly in the
in conveying genetic information from DNA to ribosome, context of lung cancer. Gene expression data has been widely
where it is translated into proteins. While miRNA is a type explored for lung cancer classification, with studies achieving
of non-coding RNA which plays a role in gene regulation high accuracy rates. For instance, Smolander et al. achieved
by binding to target mRNAs and inhibiting their transla- a 95.97% accuracy in distinguishing LUAD from control
tion or promoting their degradation. Any abnormalities in samples using deep learning models applied to coding RNA
these processes of mRNA or miRNA leads to different gene data [21]. Similarly, Fan et al. used support vector machines
expressions which accordingly activates tumor growth [3], (SVMs) with a 12-gene signature to achieve a 91% accuracy
[10]. However, cancer classification using gene expression is in the same classification task [22].
very challenging given the complexity and massive amount Multiclass classification of lung cancer subtypes has also
of genetic data that is produced [3], [9], [11], [12]. The been addressed in the literature. Gonzales et al. developed a
magnitude of variant obtained from RNA-Sequencing for model for classifying small-cell lung cancer (SCLC), LUAD,
example is exponential which makes it difficult for tradi- LUSC, and large-cell lung carcinoma (LCLC) using dif-
tional machine learning methods and bioinformatics tools ferentially expressed genes (DEGs) as input, achieving an
to approach genetic variants for disease prediction [3], [13], accuracy of 88.23% with the random forest (RF) algorithm
[14]. Gene expression is known by high dimensionality, with [23]. Additionally, Castillo-Secilla et al. attained a 95.7%
w very large number of features representing genes and very accuracy in subtype classification of non-small cell lung
small number of training data representing patient samples cancer (NSCLC) using random forest [24]. Studies utilizing
[3], [13], [15]. Deep learning has been used recently for miRNA-Seq data for lung cancer classification have also been
dealing with this problem of high dimensionality [16]. conducted. Ye et al. identified a 10-miRNA signature for
distinguishing LUSC from control samples, achieving an A. DATA ACQUISITION AND PRE-PROCESSING
F1 score of 99.4% [18]. In this work two molecular modalities and one imaging
Deep learning approaches, particularly convolutional neu- modality are considered: RNA-Seq, miRNA-Seq and Whole
ral networks (CNNs), have shown promise in combination Slide Images (WSIs). The data were collected from the Can-
with whole-slide imaging (WSI) for NSCLC subtype clas- cer Genome Atlas (TCGA) program [27] which is easily
sification. Coudray et al. utilized CNNs with tiles extracted accessible from the Genomic Data Commons (GDC) portal.
from WSI to classify LUAD, LUSC, and control samples, GDC is the National Cancer Institute’s (NCI) data shar-
achieving an impressive AUC score of 0.978 [19]. Further- ing platform. The GDC contains NCI-generated data from
more, Kanavati et al. used transfer learning with CNNs on some of the largest and most comprehensive cancer genomic
manually labeled images to distinguish lung carcinoma from datasets, including the Cancer Genome Atlas (TCGA) pro-
control samples, obtaining an AUC score of 0.988 [25]. gram. The data provided covers real-world scenarios which
At last, hybrid approaches combining deep learning with is important while dealing with such fatal disease.
traditional statistics have been explored. Graham et al. WSI images were downloaded as a zip file and SVS
utilized tiles extracted from images along with summary files were extracted for the pre-processing phase. For the
statistics to classify LUAD, control, and LUSC samples, molecular data (RNA-Seq and miRNA-Seq), TCGAbiolinks
achieving an accuracy of 81% [26]. Bioconductor R backage was used for downloading the data
III. METHODOLOGY and extracting the cases with the corresponding expressed
Through this section data preparation and pre-processing genes for pre-processing. STAR Counts from TCGA project
methodology will be discussed. Data is collected from are used for analysis.
the Genomic Data Commons (GDC) portal and different For avoiding small test set or data imbalance no anomaly
preprocessing techniques are used for preparing genomic and detection is used, but stratified kfold cross validation is
slide images for the training phase used to obtain unbiased results. Stratified kfold cross
validation ensures gaining the same proportion of data across performed using ‘LabelEncoder’ from scikit-learn to convert
the splits. categorical labels into numerical values. The data is split
The molecular models follows a binary classification either into separate folders for training and validation to make sure
the patient sample is Tumor or Normal while the WSI model that our model is never trained and tested on the same set
follows 3 classes classification (LUAD, LUSC, Control). The of tiles. Normalization for the pixel values is performed
data sample of each patient can be fed to these individual between 1 and 0. This normalization helps the neural network
models for accurate diagnosis. First, the sample is tested converge faster and perform better during training. No data
on RNA-Seq (mRNA) model and miRNA-Seq model to be augmentation is performed.
classified as normal or tumor. For Whole Slide Images, it goes WSI images are known for their large size that they cannot
for further classification as LUAD, LUSC or Normal for be fed directly to any neural networks. So, a magnification
deciding the appropriate treatment plan. factor 20x was used for obtaining 512 × 512 non-overlapping
pixel-tiles. Depending on the original WSI image, tens
to thousands of tiles are generated as in the literature
review [19].
FIGURE 6. WSI confusion matrix. FIGURE 8. WSI training and validation loss curve.
FIGURE 10. mRNA model accuracy. FIGURE 12. mRNA model confusion matrix.
V. RESULTS FIGURE 14. miRNA model accuracy for the training and validation.
The results obtained by each modality can be observed in
Table4. As observed, the best results are achieved according
to the following order: miRNA-Seq, RNA-Seq, WSI. Binary results. The WSI assures the result if it is normal and goes
classification is performed on both of the molecular data. for further classification. Otherwise, if classification is tumor
Classification is based on being tumor or normal. Based on the samples are to be classified either for LUAD or LUSC.
the results of the binary classification, the samples go for The followed method helps in determining the right treatment
one more classification step on WSI images for more precise plan.
VII. CONCLUSION
To summarize the previous work, deep learning using convo-
lutional neural networks has already proven its great impact
on the bioinformatics field and will lead the next revolution
of detecting rare diseases at very early stages. Convolutional
neural networks will confront the growing problem of high
dimensionality especially in the omics data. Moreover, it is
already known for its accurate classification of high resolu-
tion slide images. Throughout this research we demonstrated
the great benefit of using convolutional neural networks in
the classification of non-small cell lung cancer using omics
data such as mRNA-Seq and miRNA-Seq and images like
FIGURE 16. miRNA model confusion matrix. Whole Slide Images achieving higher result than obtained by
previous works.
[3] T. Khorshed, M. N. Moustafa, and A. Rafea, ‘‘Deep learning [22] Z. Fan, W. Xue, L. Li, C. Zhang, J. Lu, Y. Zhai, Z. Suo, and J. Zhao,
for multi-tissue cancer classification of gene expressions ‘‘Identification of an early diagnostic biomarker of lung adenocarcinoma
(GeneXNet),’’ IEEE Access, vol. 8, pp. 90615–90629, 2020, doi: based on co-expression similarity and construction of a diagnostic model,’’
10.1109/ACCESS.2020.2992907. J. Transl. Med., vol. 16, no. 1, Jul. 2018, Art. no. 205, doi: 10.1186/s12967-
[4] Types of Lung Cancer | Cancer Research U.K. Accessed: May 23, 2024. 018-1577-5.
[Online]. Available: https://www.cancerresearchuk.org/about- [23] S. González, D. Castillo, J. M. Galvez, I. Rojas, and L. J. Herrera, ‘‘Feature
cancer/lung-cancer/stages-types-grades/types selection and assessment of lung cancer sub-types by applying predictive
[5] A. M. Bode and Z. Dong, ‘‘Cancer prevention research—Then and models,’’ in Proc. Int. Work-Conf. Artif. Neural Netw., in Lecture Notes in
now,’’ Nature Rev. Cancer, vol. 9, no. 7, pp. 508–516, Jun. 2009, doi: Computer Science: Including Subseries Lecture Notes in Artificial Intelli-
10.1038/nrc2646. gence and Lecture Notes in Bioinformatics, vol. 11507, 2019, pp. 883–894,
[6] N. Hanna, D. Johnson, S. Temin, S. Baker, J. Brahmer, P. M. Ellis, doi: 10.1007/978-3-030-20518-8_73.
G. Giaccone, P. J. Hesketh, I. Jaiyesimi, N. B. Leighl, G. J. Riely, [24] D. Castillo-Secilla, J. M. Gálvez, F. Carrillo-Perez, M. Verona-Almeida,
J. H. Schiller, B. J. Schneider, T. J. Smith, J. Tashbar, W. A. Biermann, D. Redondo-Sánchez, F. M. Ortuno, L. J. Herrera, and I. Rojas, ‘‘KnowSeq
and G. Masters, ‘‘Systemic therapy for stage IV non–small-cell lung R-Bioc package: The automatic smart gene expression tool for retrieving
cancer: American society of clinical oncology clinical practice guideline relevant biological knowledge,’’ Comput. Biol. Med., vol. 133, Jun. 2021,
update,’’ J. Clin. Oncol., vol. 35, no. 30, pp. 3484–3515, Oct. 2017, doi: Art. no. 104387, doi: 10.1016/j.compbiomed.2021.104387.
10.1200/jco.2017.74.6065. [25] F. Kanavati, G. Toyokawa, S. Momosaki, M. Rambeau, Y. Kozuma,
F. Shoji, K. Yamazaki, S. Takeo, O. Iizuka, and M. Tsuneki, ‘‘Weakly-
[7] Z. Wang, M. Gerstein, and M. Snyder, ‘‘RNA-seq: A revolutionary tool for
supervised learning for lung carcinoma classification using deep learning,’’
transcriptomics,’’ Nature Rev. Genet., vol. 10, no. 1, pp. 57–63, Jan. 2009,
Sci. Rep., vol. 10, no. 1, Jun. 2020, Art. no. 9297, doi: 10.1038/s41598-
doi: 10.1038/nrg2484.
020-66333-x.
[8] J. M. Rizzo and M. J. Buck, ‘‘Key principles and clinical appli- [26] S. Graham, M. Shaban, T. Qaiser, N. A. Koohbanani, S. A. Khurram, and
cations of ‘next-generation’ DNA sequencing,’’ Cancer Prevention N. Rajpoot, ‘‘Classification of lung cancer histology images using patch-
Res., vol. 5, no. 7, pp. 887–900, Jul. 2012, doi: 10.1158/1940-6207. level summary statistics,’’ Proc. SPIE, vol. 10581, pp. 327–334, Mar. 2018,
capr-11-0432. doi: 10.1117/12.2293855.
[9] J. M. Knight, I. Ivanov, K. Triff, R. S. Chapkin, and E. R. Dougherty, [27] J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. M. Shaw,
‘‘Detecting multivariate gene interactions in RNA-seq data using opti- B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, and J. M. Stuar,
mal Bayesian classification,’’ IEEE/ACM Trans. Comput. Biol. Bioinf., ‘‘The cancer genome atlas pan-cancer analysis project,’’ Nature Genet.,
vol. 15, no. 2, pp. 484–493, Mar. 2018, doi: 10.1109/TCBB.2015. vol. 45, no. 10, pp. 1113–1120, 2013, doi: 10.1038/ng.2764.
2485223.
[10] S. Sleijfer, J. Bogaerts, and L. L. Siu, ‘‘Designing transformative clin-
ical trials in the cancer genome era,’’ J. Clin. Oncol., vol. 31, no. 15,
pp. 1834–1841, May 2013, doi: 10.1200/jco.2012.45.3639. MARIAN MAGDY AMIN received the B.Sc.
degree in computer science from The American
[11] L. E. MacConaill, ‘‘Existing and emerging technologies for tumor genomic
profiling,’’ J. Clin. Oncol., vol. 31, no. 15, pp. 1815–1824, May 2013, doi: University in Cairo, in 2014. She is a currently
10.1200/jco.2012.46.5948. pursuing the master’s degree (by research) with
[12] The Cancer Genome Atlas Program (TCGA)—NCI. Accessed: Fayoum University, Egypt.
May 23, 2024. [Online]. Available: https://www.cancer.gov/ccg/
research/genome-sequencing/tcga
[13] K. R. Kukurba and S. B. Montgomery, ‘‘RNA sequencing and analysis,’’
Cold Spring Harbor Protocols, vol. 2015, no. 11, Nov. 2015, Art. no. pdb-
top084970, doi: 10.1101/pdb.top084970.
[14] J. E. Dancey, P. L. Bedard, N. Onetto, and T. J. Hudson, ‘‘The genetic
basis for cancer treatment decisions,’’ Cell, vol. 148, no. 3, pp. 409–420, AHMED S. ISMAIL received the master’s degree
Feb. 2012, doi: 10.1016/j.cell.2012.01.014. in computer science and information systems from
[15] C. Peng, X. Wu, W. Yuan, X. Zhang, Y. Zhang, and Y. Li, the Department of Information Systems, Faculty
‘‘MGRFE: Multilayer recursive feature elimination based on an embed- of Computers and Information, Fayoum, Egypt,
ded genetic algorithm for cancer classification,’’ IEEE/ACM Trans. in 2012, and the Ph.D. degree from the Depart-
Comput. Biol. Bioinf., vol. 18, no. 2, pp. 621–632, Mar. 2021, doi: ment of Information Systems, Faculty of Com-
10.1109/TCBB.2019.2921961. puter Science and Information Systems, Fayoum
[16] Y. Lecun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, University, in January 2021. He is an Assistant
pp. 436–444, May 2015, doi: 10.1038/nature14539. Professor. He has authored/co-authored several
[17] D. Castillo, J. M. Galvez, L. J. Herrera, F. Rojas, O. Valenzuela, O. Caba, scientific researches in various technical fields,
J. Prados, and I. Rojas, ‘‘Leukemia multiclass assessment and classifi- such as semantic web, data science, big data, the Internet of Things, and
cation from microarray and RNA-seq technologies integration at gene blockchain.
expression level,’’ PLoS ONE, vol. 14, no. 2, Feb. 2019, Art. no. e0212127,
doi: 10.1371/journal.pone.0212127.
[18] Z. Ye, B. Sun, and Z. Xiao, ‘‘Machine learning identifies 10 feature
MASOUD E. SHAHEEN received the B.Sc.
miRNAs for lung squamous cell carcinoma,’’ Gene, vol. 749, Jul. 2020,
degree in science from the Department of Math-
Art. no. 144669, doi: 10.1016/j.gene.2020.144669.
ematics and Computer Science, Minia University,
[19] N. Coudray, P. S. Ocampo, T. Sakellaropoulos, N. Narula, M. Snuderl,
in 1996, the M.S. degree in computer science from
D. Fenyö, A. L. Moreira, N. Razavian, and A. Tsirigos, ‘‘Classification
and mutation prediction from non–small cell lung cancer histopathology the Faculty of Science, Fayoum University, Egypt,
images using deep learning,’’ Nature Med., vol. 24, no. 10, pp. 1559–1567, in 2005, and the Ph.D. degree in computer science
Oct. 2018, doi: 10.1038/s41591-018-0177-5. from The University of Southern Mississippi, Hat-
[20] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image tiesburg, MS, USA, in 2013. He was the Vice-Dean
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), of postgraduate studies and research with the Fac-
Jun. 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90. ulty of Computers and AI, Fayoum University,
[21] J. Smolander, A. Stupnikov, G. Glazko, M. Dehmer, and F. Emmert- from May 2021 to December 2022, where he is an Associate Professor with
Streib, ‘‘Comparing biological information contained in mRNA and the Computer Science Department. He is the Project Portal Manager with
non-coding RNAs for classification of lung cancer patients,’’ BMC Can- Fayoum University. He is also the Vice-Dean for Community Service and
cer, vol. 19, no. 1, Dec. 2019, Art. no. 1176, doi: 10.1186/s12885-019- Environmental Development Affairs.
6338-1.