A Microarray Data Pre-processing Method for Cancer Classification
A Microarray Data Pre-processing Method for Cancer Classification
INTERNATIONAL JOURNAL
ON INFORMATICS VISUALIZATION
journal homepage : www.joiv.org/index.php/joiv
Abstract—The development of microarray technology has led to significant improvements and research in various fields. With the help
of machine learning techniques and statistical methods, it is now possible to organize, analyze, and interpret large amounts of biological
data to uncover significant patterns of interest. The exploitation of microarray data is of great challenge for many researchers. Raw
gene expression data are usually vulnerable to missing values, noisy data, incomplete data, and inconsistent data. Hence, processing
data before being applied for cancer classification is important. In order to extract the biological significance of microarray gene
expression data, data pre-processing is a necessary step to obtain valuable information for further analysis and address important
hypotheses. This study presents a detailed description of pre-processing data method for cancer classification. The proposed method
consists of three phases: data cleaning, transformation, and filtering. The combination of GenePattern software tool and Rstudio was
utilized to implement the proposed data pre-processing method. The proposed method was applied to six gene expression datasets: lung
cancer dataset, stomach cancer dataset, liver cancer dataset, kidney cancer dataset, thyroid cancer dataset, and breast cancer dataset
to demonstrate the feasibility of the proposed method for cancer classification. A comparison has been made to illustrate the differences
between the dataset before and after data pre-processing.
Manuscript received 15 Jan. 2022; revised 29 Apr. 2022; accepted 12 Oct. 2022. Date of publication 31 Dec. 2022.
International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.
784
Hence, data pre-processing is a mandatory procedure to a single gene in a variety of samples or conditions [13]. In
undergo before the dataset can be applied to other mainstream comparison, an array profile is the expression values of many
research algorithms [6]. genes in one sample or condition.
The structure of the paper is arranged as follows. Section 2 In this study, six datasets were obtained from the NCBI
provides details about the use of the gene expression dataset GEO database: the lung cancer dataset [27], stomach cancer
and its information, followed by the method to pre-process the dataset [28], liver cancer dataset [29], kidney cancer dataset
dataset. Section 3 presents the outcome of pre-processed data, [30], thyroid cancer dataset [31], and breast cancer dataset
and a comparison will be made to showcase the difference [32]. Table II presents the details of the selected cancer
before and after pre-processing of the dataset. Section 4 datasets.
provides a concluding summary before ending this research TABLE II
paper. GENE EXPRESSION DATASETS
Number of Number of
II. MATERIALS AND METHODS Platform
Cancer GEO ID Cancerous Normal
ID
Data pre-processing involves preparing and transforming Samples Samples
the dataset into a clean and useful format. It aims to remove Lung GSE10072 GPL96 58 49
irrelevant and missing data, normalize data, reduce the size of Stomach GSE13911 GPL570 38 31
Liver GSE17856 GPL6480 43 44
data, and extract features for data [7]. This section will explain Kidney GSE15641 GPL96 69 23
the materials and methodology applied in this study. Gene Thyroid GSE33630 GPL570 60 45
expression dataset and the available pre-processing software Breast GSE3494 GPL96 60 176
tool will be introduced for further data modeling in cancer
classification. The proposed data pre-processing method will The range of sample identification (ID) for each cancer
be described thoroughly in this section. dataset is shown in Table III.
A. Microarray Data TABLE III
SAMPLES ID OF GENE EXPRESSION DATASETS
The microarray gene expression dataset is the dataset
obtained from microarray technology. These data are Total
deposited in many different databases, which can be extracted Platform Range of Number
Cancer GEO ID
ID Samples ID of
depending on the issues of researchers. Some of the common
Samples
public microarray databases are National Centre for GSM254625-
Biotechnology Information (NCBI) Gene Expression Lung GSE10072 GPL96 107
GSM254621
Omnibus (GEO) database, The Cancer Genome Atlas GSM350411-
(TCGA) database, the ArrayExpress database, Stanford Stomach GSE13911 GPL570 69
GSM350479
Microarray Database (SMD) and so on. Table I shows the GSM446165-
Liver GSE17856 GPL6480 87
descriptions of the mentioned public databases. GSM446251
GSM391107-
TABLE I Kidney GSE15641 GPL96 92
GSM391198
MICROARRAY DATABASES
GSM831749-
Microarray Thyroid GSE33630 GPL570 105
Descriptions GSM831853
Databases GSM79114-
Gene Expression Gene Expression Omnibus – NCBI is an Breast GSE3494 GPL96 236
GSM79615
Omnibus - NCBI international public repository that stores
and distributes high-throughput gene B. GenePattern
expression and other functional genomics GenePattern is an open-source software package that
datasets [9]. provides access to various computational methods used to
The Cancer The Cancer Genome Atlas is a public
Genome Atlas free-access database that catalogs a
analyze genomic data [14]. It aims to provide four important
(TCGA) collection of different cancers' expression functionalities, which are accessibility, reproducibility,
data [10]. extensibility, and multiple interfaces [15]. From the
ArrayExpress ArrayExpress is an open-source perspective of accessibility, GenePattern provides access to
microarray database storing and providing over two hundred genomic analysis tools for researchers to
access to high-throughput functional develop, capture, and reproduce genomic analysis
genomics data [11]. methodologies. These genomic analysis tools (referred to as
Stanford Stanford Microarray Database stores raw “modules”) in the GenePattern module repository allow for
Microarray and normalized data from microarray the analysis and visualization of microarray, Single
Database (SMD) experiments and is made available to
researchers for applications [12].
Nucleotide Polymorphism (SNP), proteomic, and sequence
data [14].
Microarray data are stored in the format of large matrices GenePattern ensures the reproducibility of analysis
of gene expression levels. The rows represent the genes that methods and results by capturing the source of the data and
have been under different experimental conditions or samples analytic methods [14]. It provides automated history and
represented by the columns [13]. Two types of profiles are provenance tracking (along with methods applied and
exhibited in the microarray matrix structure: the gene profile parameter settings) for users to share and reproduce a
and the array profile. Gene profile is the expression values of computational analysis [15]. In addition, GenePattern
facilitates simple creation and integration that allows users to
785
import their methods and code for sharing. Multiple interfaces steps on GCT input file to produce filtered, pre-processed
are made available to a broad range of users, such as web gene expression data. There are four main module parameter
browsers, applications, and programmatic interfaces for users settings: thresholding/ceiling, variation filtering,
to analyze without any programming through a point-and- normalization, and log2 transform. The threshold filtering
click user interface. Fig. 1 shows the web interface of parameter removes a gene whose expression profile contains
GenePattern. insufficient values greater than a specified threshold. Floor
and ceiling values can be set manually by users. The variation
filtering parameter removes a gene if the variation of its
expression values across the samples does not meet a
minimum threshold. Row normalization or log2 transform
parameters remove systematic variation of gene expression
values between microarray experiments. Fig. 3 presents the
screenshot interface of PreprocessDataset module available
on GenePattern.
Raw
gene expression dataset ZIP of CEL files
786
Remove
unwanted
attributes, Normalisation Average
missing values,
and imputation
Cleaned gene
Gene
Data cleaning Data transformation Data filtering expression
expression dataset
dataset
Fig. 5 Data Pre-processing of Gene Expression Dataset
This raw gene expression data file contains abundant a random error that is generated due to faulty data collection,
information extracted from the cell [18]. In order to generate or data entry errors Mean imputation method was
a GCT file for data pre-processing, a ZIP package of CEL files implemented to fill the missing data elements in the gene
downloaded from the database is created for the usage of expression dataset without reducing the sample size [20]. It
GenePattern modules in the next step. Then, the created ZIP creates a complete gene expression data matrix for further
package of CEL files was uploaded to the data analysis using classification algorithms. However, data
AffySTExpressionFileCreator module for processing. The rearrangement was run through before proceeding to the next
module's normalized and background correct parameters were phase. Fig. 6 shows the details of phase 1 in microarray data
set to 'no' to extract the raw dataset. Other parameters were set pre-processing.
to default behavior to obtain a matrix containing one intensity
value per probe set per sample in the GCT file format [16].
The analysis module will output a GCT file of gene Data Cleaning
expression dataset that a computer can process for further data
pre-processing. Unwanted attributes
D. Data Pre-processing
Microarray experiments produce huge amounts of data, Patient biological info
and systematic pre-processing methods are required to extract
meaningful expression relations [13]. The mass numbers of Dataset information
microarray data collected from a single experiment could be
tens of thousands of data points for thousands of genes [13].
Dataset description
This data represents the key information for responding to
crucial biological questions and hypotheses. In order to
enhance the reliability of data, it is necessary to apply pre- Missing value
processing techniques to extract accurate data.
After completing the pre-analysis of gene expression data, Empty values of attributes
the actual data pre-processing starts. In this study, data pre-
processing involves three phases: Phase 1: data cleaning,
Phase 2: data transformation, and Phase 3: data filtering. Fig. Incomplete values
5 demonstrates the phases in data pre-processing. Data
cleaning is the first step in microarray data pre-processing, Mean imputation
which aims to correct or remove inaccurate, damaged,
improperly formatted, duplicate, or incomplete data from a Attribute arrangement
dataset. These dirty data will affect the mining procedure and
lead to unreliable and poor output [7]. First, unwanted and
empty values of attributes were removed. The unwanted Restructure according to
format
attributes include patient biological information, dataset
information, and dataset descriptions not applicable to cancer
Fig. 6 Phase 1 of Data Pre-processing
classification.
In comparison, empty values of attributes refer to the
missing values that appeared across the rows in the gene In the data transformation phase, the PreprocessDataset
expression dataset. Missing values occurred due to different module in GenPattern was applied to normalize gene
factors, such as the corruption of the image, insufficient expression data. This step aims to tune the data into a proper
resolution, dust or scratches on the slide, and the robotic format suitable for analysis and other downstream processes.
methods used to create the arrays [13]. Then, rows with Fig. 7 depicts the details of phase 2 in microarray data pre-
incomplete attributes or noise data values were imputed with processing.
mean values to resolve inconsistencies in data. Noise data is
787
After completing the three phases in data pre-processing,
Data Transformation the cleaned dataset is now prepared to be used in both the
evaluation method and the classifiers. Data pre-processing is
PreprocessDataset module
essential to build models with this cleaned dataset effectively.
This process eliminates inconsistencies or duplicates in data
and increases the efficiency and reliability of data for mining
Threshold/flooring procedures.
788
displays consistent gene expression levels across samples in help extract intrinsic patterns or knowledge, which may be
the breast cancer dataset. useful for uncovering the causes of critical diseases [24, 25,
26]. However, the huge amount of microarray data is one of
the unsolvable matters for researchers. Real-world microarray
data are incomplete, noisy, and missing for data mining.
Hence, data pre-processing is a mandatory process to
facilitate the use of this powerful technology. Integrating
three data pre-processing steps provides a solution to
minimize the obstacles faced by analysts.
This study proposed a feasible data pre-processing method
covering three phases: (1) Data cleaning, (2) Data
transformation, and (3) Data filtering. Data cleaning is aimed
at removing noisy, missing, and unnecessary data. Data
transformation transforms data into an appropriate format and
reduces data volume for efficient and effective data mining.
The proposed method was applied to six cancer datasets and
recorded a decrease in the number of genes after pre-
processing. The differences in the number of genes between
the original dataset and the cleaned dataset proved the
feasibility of the proposed data pre-processing method to
generate high-quality data.
ACKNOWLEDGMENT
Universiti Tun Hussein Onn Malaysia funds this paper.
The authors appreciate the Malaysia Ministry of Higher
Education (MoHE). This research was funded under REGG
FASA 1/2021 (VOT NO. H888). This work also was
supported/funded by Universiti Teknologi Malaysia under
UTM Fundamental Research Grant (UTMFR):
Q.J130000.3851.21H94.
Fig. 9 Visualization of CEL File for Breast Cancer Dataset
REFERENCES
[1] Owzar, K., Barry, W. T., Jung, S. H., Sohn, I., & George, S. L. (2008).
Statistical challenges in pre-processing in microarray experiments in
cancer. Clinical Cancer Research, 14(19), 5959-5966.
[2] Bharti, S., Krishnan, N., Veyssi, A., Momeni, M., & Raj, S. (2022).
sMAP: An interactive microarray data analysis tool for early-stage
researchers. bioRxiv.
[3] Herrero, J., Díaz-Uriarte, R., & Dopazo, J. (2003). Gene expression
data pre-processing. Bioinformatics, 19(5), 655-656.
[4] García de la Nava, J., van Hijum, S., & Trelles, O. (2003). Pre P: gene
expression data pre-processing. Bioinformatics, 19(17), 2328-2329.
[5] Deepak Jain. (2021, June 29). Data Preprocessing in Data Mining.
Retrieved November 1, 2022, from
https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/
[6] Revathy N, Amalraj D. Accurate Cancer Classification Using
Expressions of Very Few Genes. International Journal of Computer
Applications. 2011;14(4):19-22.
[7] Alasadi, S. A., & Bhaya, W. S. (2017). Review of data pre-processing
techniques in data mining. Journal of Engineering and Applied
Sciences, 12(16), 4102-4107.
[8] Wikipedia contributors. (2018, June 4). Microarray databases. In
Wikipedia, The Free Encyclopedia. Retrieved 01:06, November 1,
2022, from
https://en.wikipedia.org/w/index.php?title=Microarray_databases&ol
did=844388880.
[9] Clough, E., & Barrett, T. (2016). The gene expression omnibus
database. In Statistical genomics (pp. 93-110). Humana Press, New
York, NY.
Fig. 10 Visualization of Excel File After Pre-processing [10] Tomczak, K., Czerwinska, P., & Wiznerowicz, M. (2015). The Cancer
Genome Atlas (TCGA): an immeasurable source of knowledge,
Współczesna Onkologia, vol. 19, no. 1A, pp. A68-A77.
IV. CONCLUSION [11] Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena,
The emergence of microarray technology provides N., Coulson, R., Farne, A., ... & Brazma, A. (2007). ArrayExpress—a
public database of microarray experiments and gene expression
solutions to crucial biological problems at the molecular level. profiles. Nucleic acids research, 35(suppl_1), D747-D750.
It serves various purposes in research and clinical studies to [12] Sarkans, U., Parkinson, H., Lara, G. G., Oezcimen, A., Sharma, A.,
Abeygunawardena, N., ... & Brazma, A. (2005). The ArrayExpress
789
gene expression database: a software engineering and implementation [23] Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK
perspective. Bioinformatics, 21(8), 1495-1501. (2015). “limma powers differential expression analyses for RNA-
[13] Rafii, F., & Rossi, B. D. (2015). Data pre-processing and reducing for sequencing and microarray studies.” Nucleic Acids Research, 43(7),
microarray data exploration and analysis. International Journal of e47. doi: 10.1093/nar/gkv007.
Computer Applications, 132(16), 20-26. [24] Donatin, E., & Drancourt, M. (2012). DNA microarrays for the
[14] Kuehn, H., Liberzon, A., Reich, M., & Mesirov, J. P. (2008). Using diagnosis of infectious diseases. Médecine et maladies infectieuses,
GenePattern for gene expression analysis. Current protocols in 42(10), 453-459.
bioinformatics, 22(1), 7-12. [25] Tzouvelekis, A., Patlakas, G., & Bouros, D. (2004). Application of
[15] Wikipedia contributors. (2021, December 23). GenePattern. In microarray technology in pulmonary diseases. Respiratory research,
Wikipedia, The Free Encyclopedia. Retrieved 03:17, November 1, 5(1), 1-18.
2022, from [26] Yoo, S. M., Choi, J. H., Lee, S. Y., & Yoo, N. C. (2009). Applications
https://en.wikipedia.org/w/index.php?title=GenePattern&oldid=1061 of DNA microarray in disease diagnostics. Journal of microbiology
704802. and biotechnology, 19(7), 635-646.
[16] David Eby, Broad Institute. (n.d.). AffySTExpressionFileCreator (v1) [27] Landi MT, Dracheva T, Rotunno M, Figueroa JD et al. Gene
BETA. Retrieved November 1, 2022, from expression signature of cigarette smoking and its role in lung
https://www.genepattern.org/modules/docs/AffySTExpressionFileCr adenocarcinoma development and survival. PLoS One 2008 Feb
eator/1. 20;3(2):e1651. PMID: 18297132.
[17] Joshua Gould, Broad Institute. (n.d.). PreprocessDataset (v5). [28] D'Errico M, de Rinaldis E, Blasi MF, Viti V et al. Genome-wide
Retrieved November 1, 2022, from expression profile of sporadic gastric cancers with microsatellite
https://genepattern.org/modules/docs/PreprocessDataset/5?print=yes. instability. Eur J Cancer 2009 Feb;45(3):461-9. PMID: 19081245.
[18] Seah, C. S., Kasim, S., Fudzee, M. F., Mohamad, M. S., Saedudin, R. [29] Tsuchiya M, Parker JS, Kono H, Matsuda M et al. Gene expression in
R., Hassan, R., ... & Atan, R. (2018). An effective pre-processing nontumoral liver tissue and recurrence-free survival in hepatitis C
phase for gene expression classification. Indonesian Journal of virus-positive hepatocellular carcinoma. Mol Cancer 2010 Apr 9;9:74.
Electrical Engineering and Computer Science, 11(3), 1223. PMID: 20380719.
[19] Sadhvi Anunaya. (2022, June 20). Data Preprocessing in Data Mining [30] Jones J, Otu H, Spentzos D, Kolia S et al. Gene signatures of
-A Hands On Guide. Retrieved November 2, 2022 from progression and metastasis in renal cell cancer. Clin Cancer Res 2005
https://www.analyticsvidhya.com/blog/2021/08/data-preprocessing- Aug 15;11(16):5730-9. PMID: 16115910.
in-data-mining-a-hands-on-guide/. [31] Tomás G, Tarabichi M, Gacquer D, Hébrant A et al. A general method
[20] Peterson, P. L., Baker, E., & McGaw, B. (2010). International to derive robust organ-specific gene expression-based differentiation
encyclopedia of education. Elsevier Ltd. indices: application to thyroid cancer diagnostic. Oncogene 2012 Oct
[21] Normalization supplement: commentary on the impact of different 11;31(41):4490-8. PMID: 22266856.
normalization methodologies on variance distributions at a global and [32] Miller LD, Smeds J, George J, Vega VB et al. An expression signature
pathway level. Retrieved November 2, 2022 from for p53 status in human breast cancer predicts mutation status,
https://doi.org/10.1371/journal.pgen.1002207.s003. transcriptional effects, and patient survival. Proc Natl Acad Sci U S A
[22] RStudio Team (2020). RStudio: Integrated Development for R. 2005 Sep 20;102(38):13550-5. PMID: 16141321
RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
790