0% found this document useful (0 votes)

8 views7 pages

A Microarray Data Pre-Processing Method For Cancer Classification

Uploaded by

tayxinhui1002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views7 pages

A Microarray Data Pre-Processing Method For Cancer Classification

Uploaded by

tayxinhui1002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

JOIV : Int. J. Inform.

Visualization, 6(4) - December 2022 784-790

INTERNATIONAL
JOURNAL ON
INFORMATICS
VISUALIZATION

INTERNATIONAL JOURNAL
ON INFORMATICS VISUALIZATION
journal homepage : [Link]/[Link]/joiv

A Microarray Data Pre-processing Method for Cancer Classification

Tay Xin Hui a, Shahreen Kasim a,*, Mohd Farhan Md Fudzee a, Zubaile Abdullah a, Rohayanti Hassan b ,
Aldo Erianda c
a
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja 86400, Johor, Malaysia
b
Faculty of Computing, Universiti Teknologi Malaysia, 83100, Johor, Malaysia
c
Department of Information Technology, Politeknik Negeri Padang, Sumatera Barat, Indonesia
Corresponding author: *shahreen@[Link]

Abstract—The development of microarray technology has led to significant improvements and research in various fields. With the help
of machine learning techniques and statistical methods, it is now possible to organize, analyze, and interpret large amounts of biological
data to uncover significant patterns of interest. The exploitation of microarray data is of great challenge for many researchers. Raw
gene expression data are usually vulnerable to missing values, noisy data, incomplete data, and inconsistent data. Hence, processing
data before being applied for cancer classification is important. In order to extract the biological significance of microarray gene
expression data, data pre-processing is a necessary step to obtain valuable information for further analysis and address important
hypotheses. This study presents a detailed description of pre-processing data method for cancer classification. The proposed method
consists of three phases: data cleaning, transformation, and filtering. The combination of GenePattern software tool and Rstudio was
utilized to implement the proposed data pre-processing method. The proposed method was applied to six gene expression datasets: lung
cancer dataset, stomach cancer dataset, liver cancer dataset, kidney cancer dataset, thyroid cancer dataset, and breast cancer dataset
to demonstrate the feasibility of the proposed method for cancer classification. A comparison has been made to illustrate the differences
between the dataset before and after data pre-processing.

Keywords—Data pre-processing; microarray data; gene expression data; GenePattern.

Manuscript received 15 Jan. 2022; revised 29 Apr. 2022; accepted 12 Oct. 2022. Date of publication 31 Dec. 2022.
International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.

columns, one for each experimental condition measured [3].

I. INTRODUCTION This massive genomic data requires practical data pre-
DNA microarray technologies allow researchers to processing techniques for their analyses. Effective
measure thousands of genes' expression patterns in various computational-based methodologies highly depend on the
experimental conditions. This high-throughput technology quality of input data. There are numerous sources of
opens up the possibility of organizing, analyzing, and systematic and random changes introduced along the various
interpreting biological data to solve biological problems at the phases in assessing gene expression levels [4]. These
molecular level. In general, statistical analysis can be variations in expression levels might lead to false positives
categorized into three studies: (a) association studies, to under certain changing experimental conditions. Thus,
discover the relationships between interesting genes or applying data pre-processing techniques is important to
biological pathways; (b) prognostic or prediction studies, to enhance the quality of results.
classify patients concerning clinical endpoints based on Data pre-processing is a data mining technique used to
molecular markers; and (c) class discovery studies, to transform raw data into an efficient and useful format. A basic
discover clusters based on molecular data [1]. The ability to data pre-processing method involves three steps: (a) data
derive biological inferences from microarray data allows cleaning, to remove missing and noisy data; (b) data
researchers to identify key disease pathways and find transformation, to transform data into an appropriate form;
potential therapeutic targets [2]. and (c) data reduction, to increase the storage efficiency and
Microarray data or gene expression data is composed of reduce data storage and analysis costs [5]. The unprocessed
huge tables with thousands of rows corresponding to the raw data are susceptible to missing, noise, outliers, and
genes or clones present in the DNA array, and several inconsistency, affecting the quality of data mining results.

784
Hence, data pre-processing is a mandatory procedure to a single gene in a variety of samples or conditions [13]. In
undergo before the dataset can be applied to other mainstream comparison, an array profile is the expression values of many
research algorithms [6]. genes in one sample or condition.
The structure of the paper is arranged as follows. Section 2 In this study, six datasets were obtained from the NCBI
provides details about the use of the gene expression dataset GEO database: the lung cancer dataset [27], stomach cancer
and its information, followed by the method to pre-process the dataset [28], liver cancer dataset [29], kidney cancer dataset
dataset. Section 3 presents the outcome of pre-processed data, [30], thyroid cancer dataset [31], and breast cancer dataset
and a comparison will be made to showcase the difference [32]. Table II presents the details of the selected cancer
before and after pre-processing of the dataset. Section 4 datasets.
provides a concluding summary before ending this research TABLE II
paper. GENE EXPRESSION DATASETS
Number of Number of
II. MATERIALS AND METHODS Platform
Cancer GEO ID Cancerous Normal
ID
Data pre-processing involves preparing and transforming Samples Samples
the dataset into a clean and useful format. It aims to remove Lung GSE10072 GPL96 58 49
irrelevant and missing data, normalize data, reduce the size of Stomach GSE13911 GPL570 38 31
Liver GSE17856 GPL6480 43 44
data, and extract features for data [7]. This section will explain Kidney GSE15641 GPL96 69 23
the materials and methodology applied in this study. Gene Thyroid GSE33630 GPL570 60 45
expression dataset and the available pre-processing software Breast GSE3494 GPL96 60 176
tool will be introduced for further data modeling in cancer
classification. The proposed data pre-processing method will The range of sample identification (ID) for each cancer
be described thoroughly in this section. dataset is shown in Table III.
A. Microarray Data TABLE III
SAMPLES ID OF GENE EXPRESSION DATASETS
The microarray gene expression dataset is the dataset
obtained from microarray technology. These data are Total
deposited in many different databases, which can be extracted Platform Range of Number
Cancer GEO ID
ID Samples ID of
depending on the issues of researchers. Some of the common
Samples
public microarray databases are National Centre for GSM254625-
Biotechnology Information (NCBI) Gene Expression Lung GSE10072 GPL96 107
GSM254621
Omnibus (GEO) database, The Cancer Genome Atlas GSM350411-
(TCGA) database, the ArrayExpress database, Stanford Stomach GSE13911 GPL570 69
GSM350479
Microarray Database (SMD) and so on. Table I shows the GSM446165-
Liver GSE17856 GPL6480 87
descriptions of the mentioned public databases. GSM446251
GSM391107-
TABLE I Kidney GSE15641 GPL96 92
GSM391198
MICROARRAY DATABASES
GSM831749-
Microarray Thyroid GSE33630 GPL570 105
Descriptions GSM831853
Databases GSM79114-
Gene Expression Gene Expression Omnibus – NCBI is an Breast GSE3494 GPL96 236
GSM79615
Omnibus - NCBI international public repository that stores
and distributes high-throughput gene B. GenePattern
expression and other functional genomics GenePattern is an open-source software package that
datasets [9]. provides access to various computational methods used to
The Cancer The Cancer Genome Atlas is a public
Genome Atlas free-access database that catalogs a
analyze genomic data [14]. It aims to provide four important
(TCGA) collection of different cancers' expression functionalities, which are accessibility, reproducibility,
data [10]. extensibility, and multiple interfaces [15]. From the
ArrayExpress ArrayExpress is an open-source perspective of accessibility, GenePattern provides access to
microarray database storing and providing over two hundred genomic analysis tools for researchers to
access to high-throughput functional develop, capture, and reproduce genomic analysis
genomics data [11]. methodologies. These genomic analysis tools (referred to as
Stanford Stanford Microarray Database stores raw “modules”) in the GenePattern module repository allow for
Microarray and normalized data from microarray the analysis and visualization of microarray, Single
Database (SMD) experiments and is made available to
researchers for applications [12].
Nucleotide Polymorphism (SNP), proteomic, and sequence
data [14].
Microarray data are stored in the format of large matrices GenePattern ensures the reproducibility of analysis
of gene expression levels. The rows represent the genes that methods and results by capturing the source of the data and
have been under different experimental conditions or samples analytic methods [14]. It provides automated history and
represented by the columns [13]. Two types of profiles are provenance tracking (along with methods applied and
exhibited in the microarray matrix structure: the gene profile parameter settings) for users to share and reproduce a
and the array profile. Gene profile is the expression values of computational analysis [15]. In addition, GenePattern
facilitates simple creation and integration that allows users to

785
import their methods and code for sharing. Multiple interfaces steps on GCT input file to produce filtered, pre-processed
are made available to a broad range of users, such as web gene expression data. There are four main module parameter
browsers, applications, and programmatic interfaces for users settings: thresholding/ceiling, variation filtering,
to analyze without any programming through a point-and- normalization, and log2 transform. The threshold filtering
click user interface. Fig. 1 shows the web interface of parameter removes a gene whose expression profile contains
GenePattern. insufficient values greater than a specified threshold. Floor
and ceiling values can be set manually by users. The variation
filtering parameter removes a gene if the variation of its
expression values across the samples does not meet a
minimum threshold. Row normalization or log2 transform
parameters remove systematic variation of gene expression
values between microarray experiments. Fig. 3 presents the
screenshot interface of PreprocessDataset module available
on GenePattern.

Fig. 1 The Web Interface of GenePattern

This study utilizes two GenePattern modules, the

AffySTExpressionFileCreator module and the
PreprocessDataset module. AffySTExpressionFileCreator
module is aimed to create a Gene Cluster Text (GCT) file
from a set of CEL files (Affymetrix Probe Results File) from
Affymetrix ST arrays [16]. This module allows the Fig. 3 The Interface of PreprocessDataset Module
transformation of a gene expression data file (.CEL) to a
computer-readable tab-delimited text file (.GCT) to analyze C. Pre-analysis
matrix-compatible gene expression datasets. There are six Pre-analysis of gene expression data is aimed to generate a
parameter settings available in this module, which includes computer-readable tab-delimited text file (.GCT) for data pre-
input file, normalize, background correct, clm file, annotate processing. Fig. 4 illustrates the pre-analysis of gene
probes, and output file base [16]. The input file parameter expression dataset. The pre-analysis conducts the following
accepts one or more Affymetrix ST CEL files for analysis. steps in order.
Normalized parameters allow users to normalize data using 1. Download microarray gene expression dataset from
quantile normalization. Background correction aims to microarray database
remove geographical biases in fluorescent intensity. clm file 2. Create a ZIP package of CEL files for the usage of
is a tab-delimited text file containing one scan, sample, and GenePattern modules
class per line. Annotate probes parameter provides rows 3. Run module in GenePattern
annotation with the gene symbol and description. The output 4. Output GCT files of gene expression datasets for
file base parameter sets the base name of the output file. Fig. further data pre-processing
2 shows the screenshot interface of
AffySTExpressionFileCreator module available on
GenePattern.

Raw
gene expression dataset ZIP of CEL files

Fig. 2 The Interface of AffySTExpressionFileCreator Module

On the other hand, the PreprocessDataset module provides

a variety of pre-processing operations which aim to remove
platform noise and genes that have little variation so the GCT files for Run module in
subsequent analysis can identify interesting variations, such data pre-processing GenePattern
as the differential expression between tumor and normal
tissue [17]. This module performs several pre-processing Fig. 4 Pre-analysis of Gene Expression Dataset

786
Remove
unwanted
attributes, Normalisation Average
missing values,
and imputation

Cleaned gene
Gene
Data cleaning Data transformation Data filtering expression
expression dataset
dataset
Fig. 5 Data Pre-processing of Gene Expression Dataset

This raw gene expression data file contains abundant a random error that is generated due to faulty data collection,
information extracted from the cell [18]. In order to generate or data entry errors Mean imputation method was
a GCT file for data pre-processing, a ZIP package of CEL files implemented to fill the missing data elements in the gene
downloaded from the database is created for the usage of expression dataset without reducing the sample size [20]. It
GenePattern modules in the next step. Then, the created ZIP creates a complete gene expression data matrix for further
package of CEL files was uploaded to the data analysis using classification algorithms. However, data
AffySTExpressionFileCreator module for processing. The rearrangement was run through before proceeding to the next
module's normalized and background correct parameters were phase. Fig. 6 shows the details of phase 1 in microarray data
set to 'no' to extract the raw dataset. Other parameters were set pre-processing.
to default behavior to obtain a matrix containing one intensity
value per probe set per sample in the GCT file format [16].
The analysis module will output a GCT file of gene Data Cleaning
expression dataset that a computer can process for further data
pre-processing. Unwanted attributes
D. Data Pre-processing
Microarray experiments produce huge amounts of data, Patient biological info
and systematic pre-processing methods are required to extract
meaningful expression relations [13]. The mass numbers of Dataset information
microarray data collected from a single experiment could be
tens of thousands of data points for thousands of genes [13].
Dataset description
This data represents the key information for responding to
crucial biological questions and hypotheses. In order to
enhance the reliability of data, it is necessary to apply pre- Missing value
processing techniques to extract accurate data.
After completing the pre-analysis of gene expression data, Empty values of attributes
the actual data pre-processing starts. In this study, data pre-
processing involves three phases: Phase 1: data cleaning,
Phase 2: data transformation, and Phase 3: data filtering. Fig. Incomplete values
5 demonstrates the phases in data pre-processing. Data
cleaning is the first step in microarray data pre-processing, Mean imputation
which aims to correct or remove inaccurate, damaged,
improperly formatted, duplicate, or incomplete data from a Attribute arrangement
dataset. These dirty data will affect the mining procedure and
lead to unreliable and poor output [7]. First, unwanted and
empty values of attributes were removed. The unwanted Restructure according to
format
attributes include patient biological information, dataset
information, and dataset descriptions not applicable to cancer
Fig. 6 Phase 1 of Data Pre-processing
classification.
In comparison, empty values of attributes refer to the
missing values that appeared across the rows in the gene In the data transformation phase, the PreprocessDataset
expression dataset. Missing values occurred due to different module in GenPattern was applied to normalize gene
factors, such as the corruption of the image, insufficient expression data. This step aims to tune the data into a proper
resolution, dust or scratches on the slide, and the robotic format suitable for analysis and other downstream processes.
methods used to create the arrays [13]. Then, rows with Fig. 7 depicts the details of phase 2 in microarray data pre-
incomplete attributes or noise data values were imputed with processing.
mean values to resolve inconsistencies in data. Noise data is

787
After completing the three phases in data pre-processing,
Data Transformation the cleaned dataset is now prepared to be used in both the
evaluation method and the classifiers. Data pre-processing is
PreprocessDataset module
essential to build models with this cleaned dataset effectively.
This process eliminates inconsistencies or duplicates in data
and increases the efficiency and reliability of data for mining
Threshold/flooring procedures.

Row normalization III. RESULT AND DISCUSSION

This study used six datasets to perform the proposed data
Log2 transformation pre-processing method. Table IV shows the number of genes
after data pre-processing.
Fig. 7 Phase 2 of Data Pre-processing TABLE IV
GENE EXPRESSION DATASETS AFTER PRE-PROCESSING
The cleaned gene expression data was inputted into Number of Genes
PreprocessDataset module for pre-processing. All the Number of Removed
Cancer GEO ID Raw Cleaned
parameter settings were set to default except the row Genes
Dataset Dataset
normalization and log2 transform is enabled to normalize the Lung GSE10072 22283 12986 9297
gene's expression values across all samples. This module Stomach GSE13911 54675 12419 42256
undergoes a series of data transformations intended to aid in Liver GSE17856 25075 13802 11273
comparing gene expression data gathered across a series of Kidney GSE15641 22283 11593 10690
hybridizations [21]. These include applying intensity Thyroid GSE33630 54675 12986 41689
Breast GSE3494 22283 12986 9297
thresholds or flooring to eliminate poorly detected probes and
Total Number of Removed Genes 124502
improve signal-to-noise sensitivity. Log transformation
normalizes the distribution of probes across the experiment's
Based on the results in Table IV, the proposed data pre-
intensity range. Row normalization scales the data into a
processing method removed a total of 124502 genes across
specific range between -1.0 to 1.0 or 0.0 to 1.0. The
six datasets. For the lung and breast cancer dataset, the raw
thresholding, scaling, and log transforming data reduce
dataset contains 22283 genes, and the cleaned dataset left
variance between samples and are useful for data mining
12986 genes after pre-processing. A total of 9297 genes were
techniques like cancer classification [7].
removed for further data modeling. The stomach cancer
The last phase in data pre-processing is data filtering. This
dataset originally consisted of 54675 genes, and the number
step aims to reduce the huge dataset volume concerning
was reduced to 12419 genes with a total of 42256 genes
maintaining the original dataset's integrity. Fig. 8 presents the
removed. The liver cancer dataset contains 25075 genes
details of phase 3 in microarray data pre-processing.
before pre-processing, and the number of genes decreased to
13802 with a total of 11273 genes removed. In addition, the
proposed data pre-processing method eliminated a total of
Data Filtering
10690 genes for the kidney cancer dataset. The number of
genes reduced from 22283 genes in a raw dataset to 11593
R limma package genes in the cleaned dataset. For the thyroid cancer dataset,
41689 numbers of genes were extracted from the raw gene
avereps (Average Over expression dataset, resulting in the alteration of figures from
Irregular Replicate Probes) 54675 to 12986 numbers of genes.
In order to depict the differences between the original raw
gene expression dataset and the pre-processed breast cancer
Fig. 8 Phase 3 of Data Pre-processing dataset, GSE3494 was used as an example to compare and
visualize the attribute variation. Fig. 9 illustrates the raw CEL
Data filtering was conducted in Rstudio using R file for the breast cancer dataset, and fig. 10 demonstrates the
programming language [22]. The Limma package is one of cleaned excel file for the breast cancer dataset after pre-
the R packages build up by the R programming language for processing. By visualizing the two formats of datasets, the
data analysis, linear models, and differential expression of differences in the content presentation can be observed
microarray data [23]. Limma package was downloaded and directly to prove the feasibility of the proposed data pre-
imported in Rstudio for data pre-processing. “avereps” processing method.
(Average Over Irregular Replicate Probes) function in Limma Based on Fig. 9, the raw breast cancer dataset contains
package was utilized for data reduction. It works by rows of unwanted information irrelevant to data analysis and
condensing the microarray data object so that values for modeling. This unwanted information includes dataset
within-array replicate probes are replaced with their average information, a number of attributes contained, the dataset
[23]. This method preserves highly relevant attributes and header and footer and so on. On the other hand, Fig. 10 shows
discards redundant features to reduce the size of the gene the cleaned dataset with rows represented by gene
expression dataset. identification and columns represented by samples. The data

788
displays consistent gene expression levels across samples in help extract intrinsic patterns or knowledge, which may be
the breast cancer dataset. useful for uncovering the causes of critical diseases [24, 25,
26]. However, the huge amount of microarray data is one of
the unsolvable matters for researchers. Real-world microarray
data are incomplete, noisy, and missing for data mining.
Hence, data pre-processing is a mandatory process to
facilitate the use of this powerful technology. Integrating
three data pre-processing steps provides a solution to
minimize the obstacles faced by analysts.
This study proposed a feasible data pre-processing method
covering three phases: (1) Data cleaning, (2) Data
transformation, and (3) Data filtering. Data cleaning is aimed
at removing noisy, missing, and unnecessary data. Data
transformation transforms data into an appropriate format and
reduces data volume for efficient and effective data mining.
The proposed method was applied to six cancer datasets and
recorded a decrease in the number of genes after pre-
processing. The differences in the number of genes between
the original dataset and the cleaned dataset proved the
feasibility of the proposed data pre-processing method to
generate high-quality data.

ACKNOWLEDGMENT
Universiti Tun Hussein Onn Malaysia funds this paper.
The authors appreciate the Malaysia Ministry of Higher
Education (MoHE). This research was funded under REGG
FASA 1/2021 (VOT NO. H888). This work also was
supported/funded by Universiti Teknologi Malaysia under
UTM Fundamental Research Grant (UTMFR):
Q.J130000.3851.21H94.
Fig. 9 Visualization of CEL File for Breast Cancer Dataset
REFERENCES
[1] Owzar, K., Barry, W. T., Jung, S. H., Sohn, I., & George, S. L. (2008).
Statistical challenges in pre-processing in microarray experiments in
cancer. Clinical Cancer Research, 14(19), 5959-5966.
[2] Bharti, S., Krishnan, N., Veyssi, A., Momeni, M., & Raj, S. (2022).
sMAP: An interactive microarray data analysis tool for early-stage
researchers. bioRxiv.
[3] Herrero, J., Díaz-Uriarte, R., & Dopazo, J. (2003). Gene expression
data pre-processing. Bioinformatics, 19(5), 655-656.
[4] García de la Nava, J., van Hijum, S., & Trelles, O. (2003). Pre P: gene
expression data pre-processing. Bioinformatics, 19(17), 2328-2329.
[5] Deepak Jain. (2021, June 29). Data Preprocessing in Data Mining.
Retrieved November 1, 2022, from
[Link]
[6] Revathy N, Amalraj D. Accurate Cancer Classification Using
Expressions of Very Few Genes. International Journal of Computer
Applications. 2011;14(4):19-22.
[7] Alasadi, S. A., & Bhaya, W. S. (2017). Review of data pre-processing
techniques in data mining. Journal of Engineering and Applied
Sciences, 12(16), 4102-4107.
[8] Wikipedia contributors. (2018, June 4). Microarray databases. In
Wikipedia, The Free Encyclopedia. Retrieved 01:06, November 1,
2022, from
[Link]
did=844388880.
[9] Clough, E., & Barrett, T. (2016). The gene expression omnibus
database. In Statistical genomics (pp. 93-110). Humana Press, New
York, NY.
Fig. 10 Visualization of Excel File After Pre-processing [10] Tomczak, K., Czerwinska, P., & Wiznerowicz, M. (2015). The Cancer
Genome Atlas (TCGA): an immeasurable source of knowledge,
Współczesna Onkologia, vol. 19, no. 1A, pp. A68-A77.
IV. CONCLUSION [11] Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena,
The emergence of microarray technology provides N., Coulson, R., Farne, A., ... & Brazma, A. (2007). ArrayExpress—a
public database of microarray experiments and gene expression
solutions to crucial biological problems at the molecular level. profiles. Nucleic acids research, 35(suppl_1), D747-D750.
It serves various purposes in research and clinical studies to [12] Sarkans, U., Parkinson, H., Lara, G. G., Oezcimen, A., Sharma, A.,
Abeygunawardena, N., ... & Brazma, A. (2005). The ArrayExpress

789
gene expression database: a software engineering and implementation [23] Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK
perspective. Bioinformatics, 21(8), 1495-1501. (2015). “limma powers differential expression analyses for RNA-
[13] Rafii, F., & Rossi, B. D. (2015). Data pre-processing and reducing for sequencing and microarray studies.” Nucleic Acids Research, 43(7),
microarray data exploration and analysis. International Journal of e47. doi: 10.1093/nar/gkv007.
Computer Applications, 132(16), 20-26. [24] Donatin, E., & Drancourt, M. (2012). DNA microarrays for the
[14] Kuehn, H., Liberzon, A., Reich, M., & Mesirov, J. P. (2008). Using diagnosis of infectious diseases. Médecine et maladies infectieuses,
GenePattern for gene expression analysis. Current protocols in 42(10), 453-459.
bioinformatics, 22(1), 7-12. [25] Tzouvelekis, A., Patlakas, G., & Bouros, D. (2004). Application of
[15] Wikipedia contributors. (2021, December 23). GenePattern. In microarray technology in pulmonary diseases. Respiratory research,
Wikipedia, The Free Encyclopedia. Retrieved 03:17, November 1, 5(1), 1-18.
2022, from [26] Yoo, S. M., Choi, J. H., Lee, S. Y., & Yoo, N. C. (2009). Applications
[Link] of DNA microarray in disease diagnostics. Journal of microbiology
704802. and biotechnology, 19(7), 635-646.
[16] David Eby, Broad Institute. (n.d.). AffySTExpressionFileCreator (v1) [27] Landi MT, Dracheva T, Rotunno M, Figueroa JD et al. Gene
BETA. Retrieved November 1, 2022, from expression signature of cigarette smoking and its role in lung
[Link] adenocarcinoma development and survival. PLoS One 2008 Feb
eator/1. 20;3(2):e1651. PMID: 18297132.
[17] Joshua Gould, Broad Institute. (n.d.). PreprocessDataset (v5). [28] D'Errico M, de Rinaldis E, Blasi MF, Viti V et al. Genome-wide
Retrieved November 1, 2022, from expression profile of sporadic gastric cancers with microsatellite
[Link] instability. Eur J Cancer 2009 Feb;45(3):461-9. PMID: 19081245.
[18] Seah, C. S., Kasim, S., Fudzee, M. F., Mohamad, M. S., Saedudin, R. [29] Tsuchiya M, Parker JS, Kono H, Matsuda M et al. Gene expression in
R., Hassan, R., ... & Atan, R. (2018). An effective pre-processing nontumoral liver tissue and recurrence-free survival in hepatitis C
phase for gene expression classification. Indonesian Journal of virus-positive hepatocellular carcinoma. Mol Cancer 2010 Apr 9;9:74.
Electrical Engineering and Computer Science, 11(3), 1223. PMID: 20380719.
[19] Sadhvi Anunaya. (2022, June 20). Data Preprocessing in Data Mining [30] Jones J, Otu H, Spentzos D, Kolia S et al. Gene signatures of
-A Hands On Guide. Retrieved November 2, 2022 from progression and metastasis in renal cell cancer. Clin Cancer Res 2005
[Link] Aug 15;11(16):5730-9. PMID: 16115910.
in-data-mining-a-hands-on-guide/. [31] Tomás G, Tarabichi M, Gacquer D, Hébrant A et al. A general method
[20] Peterson, P. L., Baker, E., & McGaw, B. (2010). International to derive robust organ-specific gene expression-based differentiation
encyclopedia of education. Elsevier Ltd. indices: application to thyroid cancer diagnostic. Oncogene 2012 Oct
[21] Normalization supplement: commentary on the impact of different 11;31(41):4490-8. PMID: 22266856.
normalization methodologies on variance distributions at a global and [32] Miller LD, Smeds J, George J, Vega VB et al. An expression signature
pathway level. Retrieved November 2, 2022 from for p53 status in human breast cancer predicts mutation status,
[Link] transcriptional effects, and patient survival. Proc Natl Acad Sci U S A
[22] RStudio Team (2020). RStudio: Integrated Development for R. 2005 Sep 20;102(38):13550-5. PMID: 16141321
RStudio, PBC, Boston, MA URL [Link]

790

Obust Model For Gene Anlysis and Classification: Fatemeh Aminzadeh, Bita Shadgar, Alireza Osareh
No ratings yet
Obust Model For Gene Anlysis and Classification: Fatemeh Aminzadeh, Bita Shadgar, Alireza Osareh
10 pages
Plagiarism1 - Report
No ratings yet
Plagiarism1 - Report
8 pages
Gene Selection for Cancer SVM Classification
No ratings yet
Gene Selection for Cancer SVM Classification
9 pages
Microarray Review
No ratings yet
Microarray Review
5 pages
Microarray Data Analysis
No ratings yet
Microarray Data Analysis
11 pages
Bioinformatics & Data Mining Insights
No ratings yet
Bioinformatics & Data Mining Insights
3 pages
Efficacy of Non-Negative Matrix Factorization For Feature Selection in Cancer Data
No ratings yet
Efficacy of Non-Negative Matrix Factorization For Feature Selection in Cancer Data
20 pages
Methods of Microarray Data Analysis III Papers From CAMDA 02 - 1st Edition Scribd PDF Download
No ratings yet
Methods of Microarray Data Analysis III Papers From CAMDA 02 - 1st Edition Scribd PDF Download
17 pages
Exploration and Analysis of DNA Microarray and Other High Dimensional Data, 2nd Edition ISBN 1118356330, 9781118356333 Unlimited Download
No ratings yet
Exploration and Analysis of DNA Microarray and Other High Dimensional Data, 2nd Edition ISBN 1118356330, 9781118356333 Unlimited Download
15 pages
Microarray Time Series
No ratings yet
Microarray Time Series
19 pages
Almugren, Alshamlan - 2019 - A Survey On Hybrid Feature Selection Methods in Microarray Gene Expression Data For Cancer Classification
No ratings yet
Almugren, Alshamlan - 2019 - A Survey On Hybrid Feature Selection Methods in Microarray Gene Expression Data For Cancer Classification
16 pages
(IJCST-V4I3P23) :fadoua Rafii, Badr Dine Rossi Hassani, M'hamed Aït Kbir
No ratings yet
(IJCST-V4I3P23) :fadoua Rafii, Badr Dine Rossi Hassani, M'hamed Aït Kbir
8 pages
Microarray & Soft Computing in Cancer
No ratings yet
Microarray & Soft Computing in Cancer
9 pages
Deep Learning For Biomedical Data Analysis Techniques, Approaches, and Applications
No ratings yet
Deep Learning For Biomedical Data Analysis Techniques, Approaches, and Applications
358 pages
An Overview On Gene Expression Analysis: Dr. R. Radha, P. Rajendiran
No ratings yet
An Overview On Gene Expression Analysis: Dr. R. Radha, P. Rajendiran
6 pages
GEMS: A System For Automated Cancer Diagnosis and Biomarker Discovery From Microarray Gene Expression Data
No ratings yet
GEMS: A System For Automated Cancer Diagnosis and Biomarker Discovery From Microarray Gene Expression Data
13 pages
Gene Based Disease Prediction Using Pattern Similarity Based Classification
No ratings yet
Gene Based Disease Prediction Using Pattern Similarity Based Classification
6 pages
Breast Cancer Gene Expression
No ratings yet
Breast Cancer Gene Expression
9 pages
draftDNAPressChapter v8
No ratings yet
draftDNAPressChapter v8
77 pages
High Dimensional Data Analysis in Cancer Research, 1st Edition No-Wait Download
No ratings yet
High Dimensional Data Analysis in Cancer Research, 1st Edition No-Wait Download
15 pages
Genes 13 01839 v2
No ratings yet
Genes 13 01839 v2
22 pages
Cancer Detection via Microarrays & ML
No ratings yet
Cancer Detection via Microarrays & ML
1 page
Two-Stage Classification Methods For Microarray Data: Tzu-Tsung Wong, Ching-Han Hsu
No ratings yet
Two-Stage Classification Methods For Microarray Data: Tzu-Tsung Wong, Ching-Han Hsu
9 pages
Analysis of Microarray Gene Expression Data Ebook Full Text
100% (20)
Analysis of Microarray Gene Expression Data Ebook Full Text
17 pages
Detect Key Genes in Classification of Microarray Data
No ratings yet
Detect Key Genes in Classification of Microarray Data
13 pages
Deep Learning in Mining Biological Data
100% (1)
Deep Learning in Mining Biological Data
33 pages
SVM for Gene Expression Classification
No ratings yet
SVM for Gene Expression Classification
31 pages
11920
No ratings yet
11920
149 pages
TP ComparacaoClassificadores
No ratings yet
TP ComparacaoClassificadores
3 pages
A Technical Study On Biomedical Image Classification Using Mining Algorithms
No ratings yet
A Technical Study On Biomedical Image Classification Using Mining Algorithms
4 pages
SVM and PCA for Microarray Classification
No ratings yet
SVM and PCA for Microarray Classification
8 pages
Machine Learning in Cell Biology - Teaching Computers To Recognize Phenotypes
No ratings yet
Machine Learning in Cell Biology - Teaching Computers To Recognize Phenotypes
11 pages
Microarray Data Analysis-Springer
No ratings yet
Microarray Data Analysis-Springer
228 pages
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
No ratings yet
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
398 pages
Micro Array Analysis
No ratings yet
Micro Array Analysis
29 pages
Multivariate Exploratory
No ratings yet
Multivariate Exploratory
13 pages
Cancer Info
No ratings yet
Cancer Info
11 pages
Utilization of Microarray Analysis To Determine Therapeutic Targets in Human Cancers
No ratings yet
Utilization of Microarray Analysis To Determine Therapeutic Targets in Human Cancers
11 pages
Liver Disease Prediction with Ensemble Techniques
No ratings yet
Liver Disease Prediction with Ensemble Techniques
4 pages
Methods of Microarray Data Analysis IV 1st Edition ISBN 0387230742, 9780387230740 Study Guide Download
No ratings yet
Methods of Microarray Data Analysis IV 1st Edition ISBN 0387230742, 9780387230740 Study Guide Download
14 pages
Microarray Gene Expression Classification: Dwarf Mongoose Optimization With Deep Learning
No ratings yet
Microarray Gene Expression Classification: Dwarf Mongoose Optimization With Deep Learning
9 pages
Apriori-Hybrid Algorithm As A Tool For Colon Cancer Microarray Data Classification
No ratings yet
Apriori-Hybrid Algorithm As A Tool For Colon Cancer Microarray Data Classification
5 pages
Disease Prediction Using Machine Learning
No ratings yet
Disease Prediction Using Machine Learning
4 pages
Cancer Classification of Bioinformatics Data Using ANOVA: A. Bharathi, Dr.A.M.Natarajan
No ratings yet
Cancer Classification of Bioinformatics Data Using ANOVA: A. Bharathi, Dr.A.M.Natarajan
5 pages
PROMO: Tool for Multi-Omic Cancer Analysis
No ratings yet
PROMO: Tool for Multi-Omic Cancer Analysis
10 pages
1 s2.0 S0010482521008453 Main
No ratings yet
1 s2.0 S0010482521008453 Main
29 pages
Gene Clustering Review Paper
No ratings yet
Gene Clustering Review Paper
4 pages
15 49 Ijeeemi+
No ratings yet
15 49 Ijeeemi+
13 pages
Statistics and Data Analysis For Microarrays Using MATLAB 2nd Edition Draghici PDF Download
No ratings yet
Statistics and Data Analysis For Microarrays Using MATLAB 2nd Edition Draghici PDF Download
164 pages
Unit Iii
No ratings yet
Unit Iii
62 pages
Drug Discovery & Design 3
100% (2)
Drug Discovery & Design 3
47 pages
Applied and Environmental Microbiology-2007-Schilling-499.full
No ratings yet
Applied and Environmental Microbiology-2007-Schilling-499.full
9 pages
PhD Thesis Help in Library Science
100% (3)
PhD Thesis Help in Library Science
5 pages
DNA Microarray Technology and Its Applications in Cancer Biology
No ratings yet
DNA Microarray Technology and Its Applications in Cancer Biology
10 pages
Algorithms in Bioinformatics: A Practical Introduction: Introduction To Molecular Biology
No ratings yet
Algorithms in Bioinformatics: A Practical Introduction: Introduction To Molecular Biology
78 pages
DNA Microarray Overview and Applications
No ratings yet
DNA Microarray Overview and Applications
8 pages
Tumor Marker 12345678
No ratings yet
Tumor Marker 12345678
65 pages
Using Bayesian Networks To Analyze Expression Data: Nir Friedman Michal Linial
No ratings yet
Using Bayesian Networks To Analyze Expression Data: Nir Friedman Michal Linial
25 pages
CNVs Dataset Analysis Techniques
No ratings yet
CNVs Dataset Analysis Techniques
39 pages
Barragan F, - 2016 - Human Endometrial Fibroblasts Derived From Mesenchymal Progenitors Inherit Progesterone Resistance and Acquire An Inflammatory Phenotype in The Endometrial Niche in Endometriosis
No ratings yet
Barragan F, - 2016 - Human Endometrial Fibroblasts Derived From Mesenchymal Progenitors Inherit Progesterone Resistance and Acquire An Inflammatory Phenotype in The Endometrial Niche in Endometriosis
20 pages
Molecular Biology by David Clark-759
No ratings yet
Molecular Biology by David Clark-759
1 page
Ye 2023
No ratings yet
Ye 2023
9 pages
Affymetrix Microarrays Overview and Methods
No ratings yet
Affymetrix Microarrays Overview and Methods
43 pages
DNA Assays for Genetic Disease Diagnosis
No ratings yet
DNA Assays for Genetic Disease Diagnosis
40 pages
BRAFV600E Glioma Resistance Study
No ratings yet
BRAFV600E Glioma Resistance Study
13 pages
Alkan Et Al., (2011) Genome Structural Variation Discovery and Genotyping
No ratings yet
Alkan Et Al., (2011) Genome Structural Variation Discovery and Genotyping
14 pages
PCR Primer Design Resources Guide
No ratings yet
PCR Primer Design Resources Guide
62 pages
Nucleic Acids
No ratings yet
Nucleic Acids
6 pages
Voom: Enhancing RNA-seq Analysis
No ratings yet
Voom: Enhancing RNA-seq Analysis
17 pages
DNA Microarrays and Gene Expression From Experiments to Data Analysis and Modeling 1st Edition Pierre Baldi ebook deluxe digital version
No ratings yet
DNA Microarrays and Gene Expression From Experiments to Data Analysis and Modeling 1st Edition Pierre Baldi ebook deluxe digital version
94 pages
Biochip Intro
No ratings yet
Biochip Intro
2 pages
Microbial Functional Genomics Zhou J Et Al Download
No ratings yet
Microbial Functional Genomics Zhou J Et Al Download
83 pages
Genetics Lec Reviewer Pointers
No ratings yet
Genetics Lec Reviewer Pointers
17 pages
1 Improved Statistical Test
100% (1)
1 Improved Statistical Test
20 pages
Life Cel Internship
No ratings yet
Life Cel Internship
13 pages
[Symposium of the Society for General Microbiology Volume 65] Society for General Microbiology. Symposium,...Add, K - Micro-Organisms and Earth Systems- -Advances in Geomicrobiology_ Sixty-fifth Symposium of the Society - Libgen.li
100% (1)
[Symposium of the Society for General Microbiology Volume 65] Society for General Microbiology. Symposium,...Add, K - Micro-Organisms and Earth Systems- -Advances in Geomicrobiology_ Sixty-fifth Symposium of the Society - Libgen.li
389 pages
(MIC-LEC) - 1.03-Diagnostic Microbiology-Llanera-v2
No ratings yet
(MIC-LEC) - 1.03-Diagnostic Microbiology-Llanera-v2
17 pages
Biotechnology in Diagnostics
No ratings yet
Biotechnology in Diagnostics
44 pages
Advances in Genomic Tools For Plant Breeding: Harnessing DNA Molecular Markers, Genomic Selection, and Genome Editing
No ratings yet
Advances in Genomic Tools For Plant Breeding: Harnessing DNA Molecular Markers, Genomic Selection, and Genome Editing
23 pages
Functional Differences and Similarities in Activated Peripheral Blood Mononuclear Cells by Lipopolysaccharide or Phytohemagglutinin Stimulation Between Human and Cynomolgus Monkeys PDF
No ratings yet
Functional Differences and Similarities in Activated Peripheral Blood Mononuclear Cells by Lipopolysaccharide or Phytohemagglutinin Stimulation Between Human and Cynomolgus Monkeys PDF
25 pages

A Microarray Data Pre-Processing Method For Cancer Classification

Uploaded by

A Microarray Data Pre-Processing Method For Cancer Classification

Uploaded by

JOIV : Int. J. Inform.

Visualization, 6(4) - December 2022 784-790

A Microarray Data Pre-processing Method for Cancer Classification

Keywords—Data pre-processing; microarray data; gene expression data; GenePattern.

columns, one for each experimental condition measured [3].

Fig. 1 The Web Interface of GenePattern

This study utilizes two GenePattern modules, the

Fig. 2 The Interface of AffySTExpressionFileCreator Module

On the other hand, the PreprocessDataset module provides

Row normalization III. RESULT AND DISCUSSION

You might also like