0% found this document useful (0 votes)
4 views

A Microarray Data Pre-processing Method for Cancer Classification

Uploaded by

tayxinhui1002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

A Microarray Data Pre-processing Method for Cancer Classification

Uploaded by

tayxinhui1002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

JOIV : Int. J. Inform.

Visualization, 6(4) - December 2022 784-790


INTERNATIONAL
JOURNAL ON
INFORMATICS
VISUALIZATION

INTERNATIONAL JOURNAL
ON INFORMATICS VISUALIZATION
journal homepage : www.joiv.org/index.php/joiv

A Microarray Data Pre-processing Method for Cancer Classification


Tay Xin Hui a, Shahreen Kasim a,*, Mohd Farhan Md Fudzee a, Zubaile Abdullah a, Rohayanti Hassan b ,
Aldo Erianda c
a
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja 86400, Johor, Malaysia
b
Faculty of Computing, Universiti Teknologi Malaysia, 83100, Johor, Malaysia
c
Department of Information Technology, Politeknik Negeri Padang, Sumatera Barat, Indonesia
Corresponding author: *[email protected]

Abstract—The development of microarray technology has led to significant improvements and research in various fields. With the help
of machine learning techniques and statistical methods, it is now possible to organize, analyze, and interpret large amounts of biological
data to uncover significant patterns of interest. The exploitation of microarray data is of great challenge for many researchers. Raw
gene expression data are usually vulnerable to missing values, noisy data, incomplete data, and inconsistent data. Hence, processing
data before being applied for cancer classification is important. In order to extract the biological significance of microarray gene
expression data, data pre-processing is a necessary step to obtain valuable information for further analysis and address important
hypotheses. This study presents a detailed description of pre-processing data method for cancer classification. The proposed method
consists of three phases: data cleaning, transformation, and filtering. The combination of GenePattern software tool and Rstudio was
utilized to implement the proposed data pre-processing method. The proposed method was applied to six gene expression datasets: lung
cancer dataset, stomach cancer dataset, liver cancer dataset, kidney cancer dataset, thyroid cancer dataset, and breast cancer dataset
to demonstrate the feasibility of the proposed method for cancer classification. A comparison has been made to illustrate the differences
between the dataset before and after data pre-processing.

Keywords—Data pre-processing; microarray data; gene expression data; GenePattern.

Manuscript received 15 Jan. 2022; revised 29 Apr. 2022; accepted 12 Oct. 2022. Date of publication 31 Dec. 2022.
International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.

columns, one for each experimental condition measured [3].


I. INTRODUCTION This massive genomic data requires practical data pre-
DNA microarray technologies allow researchers to processing techniques for their analyses. Effective
measure thousands of genes' expression patterns in various computational-based methodologies highly depend on the
experimental conditions. This high-throughput technology quality of input data. There are numerous sources of
opens up the possibility of organizing, analyzing, and systematic and random changes introduced along the various
interpreting biological data to solve biological problems at the phases in assessing gene expression levels [4]. These
molecular level. In general, statistical analysis can be variations in expression levels might lead to false positives
categorized into three studies: (a) association studies, to under certain changing experimental conditions. Thus,
discover the relationships between interesting genes or applying data pre-processing techniques is important to
biological pathways; (b) prognostic or prediction studies, to enhance the quality of results.
classify patients concerning clinical endpoints based on Data pre-processing is a data mining technique used to
molecular markers; and (c) class discovery studies, to transform raw data into an efficient and useful format. A basic
discover clusters based on molecular data [1]. The ability to data pre-processing method involves three steps: (a) data
derive biological inferences from microarray data allows cleaning, to remove missing and noisy data; (b) data
researchers to identify key disease pathways and find transformation, to transform data into an appropriate form;
potential therapeutic targets [2]. and (c) data reduction, to increase the storage efficiency and
Microarray data or gene expression data is composed of reduce data storage and analysis costs [5]. The unprocessed
huge tables with thousands of rows corresponding to the raw data are susceptible to missing, noise, outliers, and
genes or clones present in the DNA array, and several inconsistency, affecting the quality of data mining results.

784
Hence, data pre-processing is a mandatory procedure to a single gene in a variety of samples or conditions [13]. In
undergo before the dataset can be applied to other mainstream comparison, an array profile is the expression values of many
research algorithms [6]. genes in one sample or condition.
The structure of the paper is arranged as follows. Section 2 In this study, six datasets were obtained from the NCBI
provides details about the use of the gene expression dataset GEO database: the lung cancer dataset [27], stomach cancer
and its information, followed by the method to pre-process the dataset [28], liver cancer dataset [29], kidney cancer dataset
dataset. Section 3 presents the outcome of pre-processed data, [30], thyroid cancer dataset [31], and breast cancer dataset
and a comparison will be made to showcase the difference [32]. Table II presents the details of the selected cancer
before and after pre-processing of the dataset. Section 4 datasets.
provides a concluding summary before ending this research TABLE II
paper. GENE EXPRESSION DATASETS
Number of Number of
II. MATERIALS AND METHODS Platform
Cancer GEO ID Cancerous Normal
ID
Data pre-processing involves preparing and transforming Samples Samples
the dataset into a clean and useful format. It aims to remove Lung GSE10072 GPL96 58 49
irrelevant and missing data, normalize data, reduce the size of Stomach GSE13911 GPL570 38 31
Liver GSE17856 GPL6480 43 44
data, and extract features for data [7]. This section will explain Kidney GSE15641 GPL96 69 23
the materials and methodology applied in this study. Gene Thyroid GSE33630 GPL570 60 45
expression dataset and the available pre-processing software Breast GSE3494 GPL96 60 176
tool will be introduced for further data modeling in cancer
classification. The proposed data pre-processing method will The range of sample identification (ID) for each cancer
be described thoroughly in this section. dataset is shown in Table III.
A. Microarray Data TABLE III
SAMPLES ID OF GENE EXPRESSION DATASETS
The microarray gene expression dataset is the dataset
obtained from microarray technology. These data are Total
deposited in many different databases, which can be extracted Platform Range of Number
Cancer GEO ID
ID Samples ID of
depending on the issues of researchers. Some of the common
Samples
public microarray databases are National Centre for GSM254625-
Biotechnology Information (NCBI) Gene Expression Lung GSE10072 GPL96 107
GSM254621
Omnibus (GEO) database, The Cancer Genome Atlas GSM350411-
(TCGA) database, the ArrayExpress database, Stanford Stomach GSE13911 GPL570 69
GSM350479
Microarray Database (SMD) and so on. Table I shows the GSM446165-
Liver GSE17856 GPL6480 87
descriptions of the mentioned public databases. GSM446251
GSM391107-
TABLE I Kidney GSE15641 GPL96 92
GSM391198
MICROARRAY DATABASES
GSM831749-
Microarray Thyroid GSE33630 GPL570 105
Descriptions GSM831853
Databases GSM79114-
Gene Expression Gene Expression Omnibus – NCBI is an Breast GSE3494 GPL96 236
GSM79615
Omnibus - NCBI international public repository that stores
and distributes high-throughput gene B. GenePattern
expression and other functional genomics GenePattern is an open-source software package that
datasets [9]. provides access to various computational methods used to
The Cancer The Cancer Genome Atlas is a public
Genome Atlas free-access database that catalogs a
analyze genomic data [14]. It aims to provide four important
(TCGA) collection of different cancers' expression functionalities, which are accessibility, reproducibility,
data [10]. extensibility, and multiple interfaces [15]. From the
ArrayExpress ArrayExpress is an open-source perspective of accessibility, GenePattern provides access to
microarray database storing and providing over two hundred genomic analysis tools for researchers to
access to high-throughput functional develop, capture, and reproduce genomic analysis
genomics data [11]. methodologies. These genomic analysis tools (referred to as
Stanford Stanford Microarray Database stores raw “modules”) in the GenePattern module repository allow for
Microarray and normalized data from microarray the analysis and visualization of microarray, Single
Database (SMD) experiments and is made available to
researchers for applications [12].
Nucleotide Polymorphism (SNP), proteomic, and sequence
data [14].
Microarray data are stored in the format of large matrices GenePattern ensures the reproducibility of analysis
of gene expression levels. The rows represent the genes that methods and results by capturing the source of the data and
have been under different experimental conditions or samples analytic methods [14]. It provides automated history and
represented by the columns [13]. Two types of profiles are provenance tracking (along with methods applied and
exhibited in the microarray matrix structure: the gene profile parameter settings) for users to share and reproduce a
and the array profile. Gene profile is the expression values of computational analysis [15]. In addition, GenePattern
facilitates simple creation and integration that allows users to

785
import their methods and code for sharing. Multiple interfaces steps on GCT input file to produce filtered, pre-processed
are made available to a broad range of users, such as web gene expression data. There are four main module parameter
browsers, applications, and programmatic interfaces for users settings: thresholding/ceiling, variation filtering,
to analyze without any programming through a point-and- normalization, and log2 transform. The threshold filtering
click user interface. Fig. 1 shows the web interface of parameter removes a gene whose expression profile contains
GenePattern. insufficient values greater than a specified threshold. Floor
and ceiling values can be set manually by users. The variation
filtering parameter removes a gene if the variation of its
expression values across the samples does not meet a
minimum threshold. Row normalization or log2 transform
parameters remove systematic variation of gene expression
values between microarray experiments. Fig. 3 presents the
screenshot interface of PreprocessDataset module available
on GenePattern.

Fig. 1 The Web Interface of GenePattern

This study utilizes two GenePattern modules, the


AffySTExpressionFileCreator module and the
PreprocessDataset module. AffySTExpressionFileCreator
module is aimed to create a Gene Cluster Text (GCT) file
from a set of CEL files (Affymetrix Probe Results File) from
Affymetrix ST arrays [16]. This module allows the Fig. 3 The Interface of PreprocessDataset Module
transformation of a gene expression data file (.CEL) to a
computer-readable tab-delimited text file (.GCT) to analyze C. Pre-analysis
matrix-compatible gene expression datasets. There are six Pre-analysis of gene expression data is aimed to generate a
parameter settings available in this module, which includes computer-readable tab-delimited text file (.GCT) for data pre-
input file, normalize, background correct, clm file, annotate processing. Fig. 4 illustrates the pre-analysis of gene
probes, and output file base [16]. The input file parameter expression dataset. The pre-analysis conducts the following
accepts one or more Affymetrix ST CEL files for analysis. steps in order.
Normalized parameters allow users to normalize data using 1. Download microarray gene expression dataset from
quantile normalization. Background correction aims to microarray database
remove geographical biases in fluorescent intensity. clm file 2. Create a ZIP package of CEL files for the usage of
is a tab-delimited text file containing one scan, sample, and GenePattern modules
class per line. Annotate probes parameter provides rows 3. Run module in GenePattern
annotation with the gene symbol and description. The output 4. Output GCT files of gene expression datasets for
file base parameter sets the base name of the output file. Fig. further data pre-processing
2 shows the screenshot interface of
AffySTExpressionFileCreator module available on
GenePattern.

Raw
gene expression dataset ZIP of CEL files

Fig. 2 The Interface of AffySTExpressionFileCreator Module

On the other hand, the PreprocessDataset module provides


a variety of pre-processing operations which aim to remove
platform noise and genes that have little variation so the GCT files for Run module in
subsequent analysis can identify interesting variations, such data pre-processing GenePattern
as the differential expression between tumor and normal
tissue [17]. This module performs several pre-processing Fig. 4 Pre-analysis of Gene Expression Dataset

786
Remove
unwanted
attributes, Normalisation Average
missing values,
and imputation

Cleaned gene
Gene
Data cleaning Data transformation Data filtering expression
expression dataset
dataset
Fig. 5 Data Pre-processing of Gene Expression Dataset

This raw gene expression data file contains abundant a random error that is generated due to faulty data collection,
information extracted from the cell [18]. In order to generate or data entry errors Mean imputation method was
a GCT file for data pre-processing, a ZIP package of CEL files implemented to fill the missing data elements in the gene
downloaded from the database is created for the usage of expression dataset without reducing the sample size [20]. It
GenePattern modules in the next step. Then, the created ZIP creates a complete gene expression data matrix for further
package of CEL files was uploaded to the data analysis using classification algorithms. However, data
AffySTExpressionFileCreator module for processing. The rearrangement was run through before proceeding to the next
module's normalized and background correct parameters were phase. Fig. 6 shows the details of phase 1 in microarray data
set to 'no' to extract the raw dataset. Other parameters were set pre-processing.
to default behavior to obtain a matrix containing one intensity
value per probe set per sample in the GCT file format [16].
The analysis module will output a GCT file of gene Data Cleaning
expression dataset that a computer can process for further data
pre-processing. Unwanted attributes
D. Data Pre-processing
Microarray experiments produce huge amounts of data, Patient biological info
and systematic pre-processing methods are required to extract
meaningful expression relations [13]. The mass numbers of Dataset information
microarray data collected from a single experiment could be
tens of thousands of data points for thousands of genes [13].
Dataset description
This data represents the key information for responding to
crucial biological questions and hypotheses. In order to
enhance the reliability of data, it is necessary to apply pre- Missing value
processing techniques to extract accurate data.
After completing the pre-analysis of gene expression data, Empty values of attributes
the actual data pre-processing starts. In this study, data pre-
processing involves three phases: Phase 1: data cleaning,
Phase 2: data transformation, and Phase 3: data filtering. Fig. Incomplete values
5 demonstrates the phases in data pre-processing. Data
cleaning is the first step in microarray data pre-processing, Mean imputation
which aims to correct or remove inaccurate, damaged,
improperly formatted, duplicate, or incomplete data from a Attribute arrangement
dataset. These dirty data will affect the mining procedure and
lead to unreliable and poor output [7]. First, unwanted and
empty values of attributes were removed. The unwanted Restructure according to
format
attributes include patient biological information, dataset
information, and dataset descriptions not applicable to cancer
Fig. 6 Phase 1 of Data Pre-processing
classification.
In comparison, empty values of attributes refer to the
missing values that appeared across the rows in the gene In the data transformation phase, the PreprocessDataset
expression dataset. Missing values occurred due to different module in GenPattern was applied to normalize gene
factors, such as the corruption of the image, insufficient expression data. This step aims to tune the data into a proper
resolution, dust or scratches on the slide, and the robotic format suitable for analysis and other downstream processes.
methods used to create the arrays [13]. Then, rows with Fig. 7 depicts the details of phase 2 in microarray data pre-
incomplete attributes or noise data values were imputed with processing.
mean values to resolve inconsistencies in data. Noise data is

787
After completing the three phases in data pre-processing,
Data Transformation the cleaned dataset is now prepared to be used in both the
evaluation method and the classifiers. Data pre-processing is
PreprocessDataset module
essential to build models with this cleaned dataset effectively.
This process eliminates inconsistencies or duplicates in data
and increases the efficiency and reliability of data for mining
Threshold/flooring procedures.

Row normalization III. RESULT AND DISCUSSION


This study used six datasets to perform the proposed data
Log2 transformation pre-processing method. Table IV shows the number of genes
after data pre-processing.
Fig. 7 Phase 2 of Data Pre-processing TABLE IV
GENE EXPRESSION DATASETS AFTER PRE-PROCESSING
The cleaned gene expression data was inputted into Number of Genes
PreprocessDataset module for pre-processing. All the Number of Removed
Cancer GEO ID Raw Cleaned
parameter settings were set to default except the row Genes
Dataset Dataset
normalization and log2 transform is enabled to normalize the Lung GSE10072 22283 12986 9297
gene's expression values across all samples. This module Stomach GSE13911 54675 12419 42256
undergoes a series of data transformations intended to aid in Liver GSE17856 25075 13802 11273
comparing gene expression data gathered across a series of Kidney GSE15641 22283 11593 10690
hybridizations [21]. These include applying intensity Thyroid GSE33630 54675 12986 41689
Breast GSE3494 22283 12986 9297
thresholds or flooring to eliminate poorly detected probes and
Total Number of Removed Genes 124502
improve signal-to-noise sensitivity. Log transformation
normalizes the distribution of probes across the experiment's
Based on the results in Table IV, the proposed data pre-
intensity range. Row normalization scales the data into a
processing method removed a total of 124502 genes across
specific range between -1.0 to 1.0 or 0.0 to 1.0. The
six datasets. For the lung and breast cancer dataset, the raw
thresholding, scaling, and log transforming data reduce
dataset contains 22283 genes, and the cleaned dataset left
variance between samples and are useful for data mining
12986 genes after pre-processing. A total of 9297 genes were
techniques like cancer classification [7].
removed for further data modeling. The stomach cancer
The last phase in data pre-processing is data filtering. This
dataset originally consisted of 54675 genes, and the number
step aims to reduce the huge dataset volume concerning
was reduced to 12419 genes with a total of 42256 genes
maintaining the original dataset's integrity. Fig. 8 presents the
removed. The liver cancer dataset contains 25075 genes
details of phase 3 in microarray data pre-processing.
before pre-processing, and the number of genes decreased to
13802 with a total of 11273 genes removed. In addition, the
proposed data pre-processing method eliminated a total of
Data Filtering
10690 genes for the kidney cancer dataset. The number of
genes reduced from 22283 genes in a raw dataset to 11593
R limma package genes in the cleaned dataset. For the thyroid cancer dataset,
41689 numbers of genes were extracted from the raw gene
avereps (Average Over expression dataset, resulting in the alteration of figures from
Irregular Replicate Probes) 54675 to 12986 numbers of genes.
In order to depict the differences between the original raw
gene expression dataset and the pre-processed breast cancer
Fig. 8 Phase 3 of Data Pre-processing dataset, GSE3494 was used as an example to compare and
visualize the attribute variation. Fig. 9 illustrates the raw CEL
Data filtering was conducted in Rstudio using R file for the breast cancer dataset, and fig. 10 demonstrates the
programming language [22]. The Limma package is one of cleaned excel file for the breast cancer dataset after pre-
the R packages build up by the R programming language for processing. By visualizing the two formats of datasets, the
data analysis, linear models, and differential expression of differences in the content presentation can be observed
microarray data [23]. Limma package was downloaded and directly to prove the feasibility of the proposed data pre-
imported in Rstudio for data pre-processing. “avereps” processing method.
(Average Over Irregular Replicate Probes) function in Limma Based on Fig. 9, the raw breast cancer dataset contains
package was utilized for data reduction. It works by rows of unwanted information irrelevant to data analysis and
condensing the microarray data object so that values for modeling. This unwanted information includes dataset
within-array replicate probes are replaced with their average information, a number of attributes contained, the dataset
[23]. This method preserves highly relevant attributes and header and footer and so on. On the other hand, Fig. 10 shows
discards redundant features to reduce the size of the gene the cleaned dataset with rows represented by gene
expression dataset. identification and columns represented by samples. The data

788
displays consistent gene expression levels across samples in help extract intrinsic patterns or knowledge, which may be
the breast cancer dataset. useful for uncovering the causes of critical diseases [24, 25,
26]. However, the huge amount of microarray data is one of
the unsolvable matters for researchers. Real-world microarray
data are incomplete, noisy, and missing for data mining.
Hence, data pre-processing is a mandatory process to
facilitate the use of this powerful technology. Integrating
three data pre-processing steps provides a solution to
minimize the obstacles faced by analysts.
This study proposed a feasible data pre-processing method
covering three phases: (1) Data cleaning, (2) Data
transformation, and (3) Data filtering. Data cleaning is aimed
at removing noisy, missing, and unnecessary data. Data
transformation transforms data into an appropriate format and
reduces data volume for efficient and effective data mining.
The proposed method was applied to six cancer datasets and
recorded a decrease in the number of genes after pre-
processing. The differences in the number of genes between
the original dataset and the cleaned dataset proved the
feasibility of the proposed data pre-processing method to
generate high-quality data.

ACKNOWLEDGMENT
Universiti Tun Hussein Onn Malaysia funds this paper.
The authors appreciate the Malaysia Ministry of Higher
Education (MoHE). This research was funded under REGG
FASA 1/2021 (VOT NO. H888). This work also was
supported/funded by Universiti Teknologi Malaysia under
UTM Fundamental Research Grant (UTMFR):
Q.J130000.3851.21H94.
Fig. 9 Visualization of CEL File for Breast Cancer Dataset
REFERENCES
[1] Owzar, K., Barry, W. T., Jung, S. H., Sohn, I., & George, S. L. (2008).
Statistical challenges in pre-processing in microarray experiments in
cancer. Clinical Cancer Research, 14(19), 5959-5966.
[2] Bharti, S., Krishnan, N., Veyssi, A., Momeni, M., & Raj, S. (2022).
sMAP: An interactive microarray data analysis tool for early-stage
researchers. bioRxiv.
[3] Herrero, J., Díaz-Uriarte, R., & Dopazo, J. (2003). Gene expression
data pre-processing. Bioinformatics, 19(5), 655-656.
[4] García de la Nava, J., van Hijum, S., & Trelles, O. (2003). Pre P: gene
expression data pre-processing. Bioinformatics, 19(17), 2328-2329.
[5] Deepak Jain. (2021, June 29). Data Preprocessing in Data Mining.
Retrieved November 1, 2022, from
https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/
[6] Revathy N, Amalraj D. Accurate Cancer Classification Using
Expressions of Very Few Genes. International Journal of Computer
Applications. 2011;14(4):19-22.
[7] Alasadi, S. A., & Bhaya, W. S. (2017). Review of data pre-processing
techniques in data mining. Journal of Engineering and Applied
Sciences, 12(16), 4102-4107.
[8] Wikipedia contributors. (2018, June 4). Microarray databases. In
Wikipedia, The Free Encyclopedia. Retrieved 01:06, November 1,
2022, from
https://en.wikipedia.org/w/index.php?title=Microarray_databases&ol
did=844388880.
[9] Clough, E., & Barrett, T. (2016). The gene expression omnibus
database. In Statistical genomics (pp. 93-110). Humana Press, New
York, NY.
Fig. 10 Visualization of Excel File After Pre-processing [10] Tomczak, K., Czerwinska, P., & Wiznerowicz, M. (2015). The Cancer
Genome Atlas (TCGA): an immeasurable source of knowledge,
Współczesna Onkologia, vol. 19, no. 1A, pp. A68-A77.
IV. CONCLUSION [11] Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena,
The emergence of microarray technology provides N., Coulson, R., Farne, A., ... & Brazma, A. (2007). ArrayExpress—a
public database of microarray experiments and gene expression
solutions to crucial biological problems at the molecular level. profiles. Nucleic acids research, 35(suppl_1), D747-D750.
It serves various purposes in research and clinical studies to [12] Sarkans, U., Parkinson, H., Lara, G. G., Oezcimen, A., Sharma, A.,
Abeygunawardena, N., ... & Brazma, A. (2005). The ArrayExpress

789
gene expression database: a software engineering and implementation [23] Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK
perspective. Bioinformatics, 21(8), 1495-1501. (2015). “limma powers differential expression analyses for RNA-
[13] Rafii, F., & Rossi, B. D. (2015). Data pre-processing and reducing for sequencing and microarray studies.” Nucleic Acids Research, 43(7),
microarray data exploration and analysis. International Journal of e47. doi: 10.1093/nar/gkv007.
Computer Applications, 132(16), 20-26. [24] Donatin, E., & Drancourt, M. (2012). DNA microarrays for the
[14] Kuehn, H., Liberzon, A., Reich, M., & Mesirov, J. P. (2008). Using diagnosis of infectious diseases. Médecine et maladies infectieuses,
GenePattern for gene expression analysis. Current protocols in 42(10), 453-459.
bioinformatics, 22(1), 7-12. [25] Tzouvelekis, A., Patlakas, G., & Bouros, D. (2004). Application of
[15] Wikipedia contributors. (2021, December 23). GenePattern. In microarray technology in pulmonary diseases. Respiratory research,
Wikipedia, The Free Encyclopedia. Retrieved 03:17, November 1, 5(1), 1-18.
2022, from [26] Yoo, S. M., Choi, J. H., Lee, S. Y., & Yoo, N. C. (2009). Applications
https://en.wikipedia.org/w/index.php?title=GenePattern&oldid=1061 of DNA microarray in disease diagnostics. Journal of microbiology
704802. and biotechnology, 19(7), 635-646.
[16] David Eby, Broad Institute. (n.d.). AffySTExpressionFileCreator (v1) [27] Landi MT, Dracheva T, Rotunno M, Figueroa JD et al. Gene
BETA. Retrieved November 1, 2022, from expression signature of cigarette smoking and its role in lung
https://www.genepattern.org/modules/docs/AffySTExpressionFileCr adenocarcinoma development and survival. PLoS One 2008 Feb
eator/1. 20;3(2):e1651. PMID: 18297132.
[17] Joshua Gould, Broad Institute. (n.d.). PreprocessDataset (v5). [28] D'Errico M, de Rinaldis E, Blasi MF, Viti V et al. Genome-wide
Retrieved November 1, 2022, from expression profile of sporadic gastric cancers with microsatellite
https://genepattern.org/modules/docs/PreprocessDataset/5?print=yes. instability. Eur J Cancer 2009 Feb;45(3):461-9. PMID: 19081245.
[18] Seah, C. S., Kasim, S., Fudzee, M. F., Mohamad, M. S., Saedudin, R. [29] Tsuchiya M, Parker JS, Kono H, Matsuda M et al. Gene expression in
R., Hassan, R., ... & Atan, R. (2018). An effective pre-processing nontumoral liver tissue and recurrence-free survival in hepatitis C
phase for gene expression classification. Indonesian Journal of virus-positive hepatocellular carcinoma. Mol Cancer 2010 Apr 9;9:74.
Electrical Engineering and Computer Science, 11(3), 1223. PMID: 20380719.
[19] Sadhvi Anunaya. (2022, June 20). Data Preprocessing in Data Mining [30] Jones J, Otu H, Spentzos D, Kolia S et al. Gene signatures of
-A Hands On Guide. Retrieved November 2, 2022 from progression and metastasis in renal cell cancer. Clin Cancer Res 2005
https://www.analyticsvidhya.com/blog/2021/08/data-preprocessing- Aug 15;11(16):5730-9. PMID: 16115910.
in-data-mining-a-hands-on-guide/. [31] Tomás G, Tarabichi M, Gacquer D, Hébrant A et al. A general method
[20] Peterson, P. L., Baker, E., & McGaw, B. (2010). International to derive robust organ-specific gene expression-based differentiation
encyclopedia of education. Elsevier Ltd. indices: application to thyroid cancer diagnostic. Oncogene 2012 Oct
[21] Normalization supplement: commentary on the impact of different 11;31(41):4490-8. PMID: 22266856.
normalization methodologies on variance distributions at a global and [32] Miller LD, Smeds J, George J, Vega VB et al. An expression signature
pathway level. Retrieved November 2, 2022 from for p53 status in human breast cancer predicts mutation status,
https://doi.org/10.1371/journal.pgen.1002207.s003. transcriptional effects, and patient survival. Proc Natl Acad Sci U S A
[22] RStudio Team (2020). RStudio: Integrated Development for R. 2005 Sep 20;102(38):13550-5. PMID: 16141321
RStudio, PBC, Boston, MA URL http://www.rstudio.com/.

790

You might also like