Microarray Time Series
Microarray Time Series
Microarray Time Series
Prepared by:
Supervised by:
Survey
2011
List of Figures
Figure 1.1………………………………………………………………………………….........6
Figure 1.2 ………………………………………………………………………………………6
Figure 2.1……………………………………………………………………………………….9
Figure 3.1……………………………………………………………………………………….14
Figure 3.2……………………………………………………………………………………….15
Chapter 1: Introduction
1.1 What is DNA microarray technology?…………………………………………4
1.2 What is DNA microarray technology used for?...................................................5
1.3 How does DNA microarray technology works? ………………………………6
1.4 Microarray types and usage…………………………………………………….8
Introduction
In [1], Microarray technology has supplied a large volume of data, which changes many problems
in biology into the problems of computing. As a result techniques for extracting useful information
from the data are developed. In particular, microarray technology has been applied to prediction
and diagnosis of cancer, so that it expectedly helps us to exactly predict and diagnose cancer. To
precisely classify cancer we have to select genes related to cancer because the genes extracted
from microarray have many noises. In this paper, we attempt to explore seven feature selection
methods and four classifiers and propose ensemble classifiers
In three benchmark datasets to systematically evaluate the performances of the feature selection
methods and machine learning classifiers. Three benchmark datasets are leukemia cancer dataset,
colon cancer dataset and lymphoma cancer data set. The methods to combine the classifiers are
majority voting, weighted voting, and Bayesian approach to improve the performance of
classification. Experimental results show that the ensemble with several basis classifiers produces
the best recognition rate on the benchmark datasets.
In [5], Microarray technologies facilitate the generation of vast amount of bio-signal or genomic
signal data. The major challenge in processing these signals is the extraction of the global
characteristics of the data due to their huge dimension and the complex relationship among
various genes
In [1], although all of the cells in the human body contain identical genetic material, the same
genes are not active in every cell. Studying which genes are active and which are inactive in
different cell types helps scientists to understand both how these cells function normally and how
they are affected when various genes do not perform properly. In the past, scientists have only
been able to conduct these genetic analyses on a few genes at once. With the development of DNA
microarray technology, however, scientists can now examine how active thousands of genes are at
any given time.
In[2],DNA microarray technology has attracted tremendous interest in both the scientific
community and industry. Generally, microarray expression experiments allow the recording of
In [3], recently microarray data have brought much attention to bioinformatics research area.
Microarray data are a recording of expression levels of thousands of genes that are measured in
various experimental settings. Microarray data have typically several thousands of genes (features)
but only tens of experiments (samples), referred to as a small-sample-sized-problem.
In [6], DNA microarray technology allows simultaneous monitoring and measuring of thousands
of gene expression activation levels in a single experiment. This technology is currently used in
medical diagnosis and gene analysis. Many microarray research projects focus on clustering
analysis and classification accuracy. In clustering analysis, the purpose of clustering is to analyze
the gene groups that show a correlated pattern of the gene expression data and provide insight into
gene interactions and function. Research on classification accuracy is aimed at building an
efficient model for predicting the class membership of data, produce a correct label on training
data, and predict the label for any unknown data correctly.
In [8], the analysis of microarray data requires two steps: feature selection and classification. From
a variety of feature selection methods and classifiers, it is difficult to find optimal ensembles
composed of any feature-classifier pairs. This paper proposes a novel method based on the
evolutionary algorithm (EA) to form sophisticated ensembles of features and classifiers that can be
used to obtain high classification performance. In spite of the exponential number of possible
ensembles of individual feature-classifier pairs, an EA can produce the best ensemble in a
reasonable amount of time. The chromosome is encoded with real values to decide the weight for
each feature-classifier pair in an ensemble. Experimental results with two well-known microarray
datasets in terms of time and classification rate indicate that the proposed method produces
ensembles that are superior to individual classifiers, as well as other ensembles optimized by
random and greedy strategies.
In[1],Microarray technology will help researchers to learn more about many different diseases,
including heart disease, mental illness and infectious diseases, to name only a few. One intense
area of microarray research at the National Institutes of Health (NIH) is the study of cancer. In the
past, scientists have classified different types of cancers based on the organs in which the tumors
develop. With the help of microarray technology, however, they will be able to further classify
these types of cancers based on the patterns of gene activity in the tumor cells. Researchers will
then be able to design treatment strategies targeted directly to each specific type of cancer.
Additionally, by examining the differences in gene activity between untreated and treated tumor
cells - for example those that are radiated or oxygen-starved - scientists will understand exactly
how different therapies Affect tumors and be able to develop more effective treatments.
Figure 1.2
In [1], DNA microarrays are created by robotic machines that arrange minuscule amounts of hundreds or
thousands of gene sequences on a single microscope slide. Researchers have a database of over 3.5
million genetic sequences that they can use for this purpose. When a gene is activated, cellular
machinery begins to copy certain segments of that gene. The resulting product is known as messenger
RNA (mRNA), which is the body's template for creating proteins. The mRNA produced by the cell is
complementary, and therefore will bind to the original portion of the DNA strand from which it was
copied.
In [1], to determine which genes are turned on and which are turned off in a given cell, a researcher must
first collect the messenger RNA molecules present in that cell. The researcher then labels each mRNA
molecule by using a reverse transcriptase enzyme (RT) that generates a complementary cDNA to the
mRNA. During that process fluorescent nucleotides are attached to the cDNA. The tumor and the
normal samples are labeled with different fluorescent dyes. Next, the researcher places the labeled
cDNAs onto a DNA microarray slide. The labeled cDNAs that represent mRNAs in the cell will then
hybridize – or bind – to their synthetic complementary DNAs attached on the microarray slide, leaving
its fluorescent tag. A researcher must then use a special scanner to measure the fluorescent intensity for
each spot/areas on the microarray slide.
In [1], if a particular gene is very active, it produces many molecules of messenger RNA, thus, more
labeled cDNAs, which hybridize to the DNA on the microarray slide and generate a very bright
fluorescent area. Genes that are somewhat less active produce fewer mRNAs, thus, less labeled cDNAs,
which results in dimmer fluorescent spots. If there is no fluorescence, none of the messenger molecules
have hybridized to the DNA, indicating that the gene is inactive. Researchers frequently use this
technique to examine the activity of various genes at different times. When co-hybridizing Tumor
samples (Red Dye) and Normal sample (Green dye) together, they will compete for the synthetic
complementary DNAs on the microarray slide. As a result, if the spot is red, this means that that specific
gene is more expressed in tumor than in normal (up-regulated in cancer). If a spot is Green, that means
that gene is more expressed in the Normal tissue (Down regulated in cancer). If a spot is yellow that
means that that specific gene is equally expressed in normal and tumor.
In [1], many types of array exist and the broadest distinction is whether they are spatially arranged on a
surface or on coded beads:
The alternative bead array is a collection of microscopic polystyrene beads, each with a
specific probe and a ratio of two or more dyes, which do not interfere with the fluorescent
dyes used on the target sequence.
In [1], DNA microarrays can be used to detect DNA (as in comparative genomic hybridization), or
detect RNA (most commonly as cDNA after reverse transcription) that may or may not be translated
into proteins. The process of measuring gene expression via cDNA is called expression analysis or
expression profiling.
In[2],ICA is a useful extension of PCA that has been developed in context with blind separation of
independent sources from their linear mixtures.7 Such blind separation techniques have been used, for
example, in various applications of auditory signal separation, medical signal processing, and so on.
Roughly speaking, rather than requiring that the coefficients of a linear expansion of the data vectors be
uncorrelated as in PCA, in ICA these coefficients must be mutually independent (or as independent as
possible). This implies that higher order statistics are needed in determining the ICA expansion.
Figure 2.1
The gene expression data synthesis model. To find a set of independent basis snapshots (eigenassay), the
snapshots in X are considered to be a linear combination of statistically independent basis snapshots
(eigenassay, the rows in S), where W is the unmixing matrix and A =W−1 is an unknown mixing
matrix. The independent eigenassay is estimated as the output U of the learned ICA
In[2],this method allows to find the individual gene expression profiles that help to discriminate between
two classes by calculating for each gene expression profile gj a score based on the mean μ1j
(respectively μ2j) and the standard deviation σ1 j(respectively σ2 j ) of each class of samples. In this
study, we ranked the genes by their scores and retained a set of the top 500, 1000 and 2000 genes of the
two data sets for ICA, respectively.
2.4 Classifiers
In [2], after processing the gene expression data using t-statistics and ICA, the final step is to classify the
data set. There have been many methods for performing the classification tasks so far, such as radial
basis function neural network (RBFNN), 11, 12 radial basis probabilities neural networks,13 logistic
discrimination (LD) and quadratic discriminant analysis (QDA),16 etc. Because the dimension of DNA
microarray gene expression data is higher even after they are processed by ICA, and there are only few
samples of the data achieved in general, we use support vector machines (SVM),8 which have been
proved to be very useful, to classify the gene expression data.
SVM is a relatively new type of machine learning model, originally introduced by Vapnik and co-
workers,24 and successively extended by a number of other researchers. This model, which is of
remarkably robust performance with respect to sparse and noisy data, is becoming a system of choice in
a number of applications from text categorization to protein function prediction.
When used for classification, SVM can separate a given set of binary labeled training data with a
hyperplane that is maximally distant from them (the maximal margin hyperplane). For the cases in
which no linear separation is possible, they can work in combination with the technique of “kernels”,
which automatically realizes a nonlinear mapping to a feature space. Generally, the hyperplane founded
by the SVM in feature space corresponds to a nonlinear decision boundary in the original space.
In [1], data mining for DNA microarray is to select discriminative genes related to classification from
gene expression data and train classifier with which classifies new data. After acquiring the gene
expression data calculated from the DNA microarray, prediction system has 2 stages: feature selection
and pattern classification.
In [1], the feature selection can be thought of as the gene selection, which is to get the list of genes that
might be informative for the prediction by statistical and information theoretical methods. Since it is
highly unlikely that all the genes have the information related to cancer and using all the genes results in
too big dimensionality, it is necessary to explore the efficient way to get the best feature. We have
extracted some informative genes using seven methods and the cancer predictor classifies the category
only with these genes.
In [1], given the gene list, a classifier makes decision as to which category the gene pattern belongs at
prediction stage. We have adopted four most widely used classification methods and an ensemble
classifier.
3.1Gene selection
In [1], among thousands of genes whose expression levels are measured, not all are needed for
classification. Microarray data consist of large number of genes in small samples. We need to select
some genes highly related to particular classes for classification, which are called informative genes.
In [1], using the statistical correlation analysis, we can see the linear relationship and the direction of
relation between two variables. Correlation coefficient r varies from -1 to +1, so that the data distributed
near the line biased to (+) direction will have positive coefficients, and the data near the line biased to (-)
direction will have negative coefficients.
In [1], many algorithms designed for solving classification problems in machine learning have been
applied to recent research of prediction and classification of cancer with gene expression data. The
general process of classification in machine learning is to train classifiers to accurately recognize
patterns from given training samples and to classify test samples with the trained classifier.
Representative classification algorithms such as multi-layer perception, k-nearest neighbour, support
vector machine, and structure-adaptive self-organizing maps are applied to the classification. [3]The
optimization criterion for feature selection can be independent with the classifier, which works on the
selected features, or it can be combined with a classifier in both a selection process and a classification
stage.
In [4], we compare the performance of different discrimination methods for tumour classification
problems based on different feature selection methods. We consider four feature extraction methods and
five classification methods, from which 20 classification models can be derived. Each classification
model is a combination of one feature extraction method and one classification method. The feature
extraction methods are t -statistics, non-parametric Wilcoxon statistics, ad hoc signal-to-noise statistics,
and principal component analysis (PCA), and the classification methods are Fisher linear discriminate
analysis (FLDA), the support vector machine (SVM), the k nearest-neighbour classifier (kNN), diagonal
linear discriminant analysis (DLDA), and diagonal quadratic discriminate analysis (DQDA). The
selection of the parameters used in these feature extraction and classification methods is supervised.
These discrimination methods are then applied to three well-known publicly available microarray
datasets: acute leukaemia data , prostate cancer data , and lung cancer data.
Classification methods
In [1], a feed-forward multilayer perception (MLP) is error back propagation neural network that is
applied to many fields due to its powerful and stable learning algorithm. The neural network learns the
training examples by adjusting the synaptic weight of neurons according to the error occurred on the
output layer. The power of the back propagation algorithm lies in two main aspects: local for updating
the synaptic weights and biases, and efficient for computing all the partial derivatives of the cost
function with respect to these free parameters.
K-Nearest-Neighbor (KNN)
In [1], k-nearest neighbor (KNN) is one of the most common methods among memory based induction.
Given an input vector, KNN extracts k closest vectors in the reference set based on similarity measures,
In [1], support vector machine (SVM) estimates the function classifying the data into two classes. SVM
builds up a hyperplane as the decision surface in such a way as to maximize the margin of separation
between positive and negative examples. SVM achieves this by the structural risk minimization
principle that the error rate of a learning machine on the test data is bounded by the sum of the training-
error rate and a term that depends on the Vapnik-Chervonenkis (VC) dimension. In [4] SVM finds the
best hyperplane separating the two classes in the training set. The best hyperplane is generally found by
maximizing the sum of the distances from the hyperplane to the closest positive and negative correctly
classified observations, while penalizing for the
number of misclassifications .
In [6], support Vector Machine (SVM) classification algorithm to evaluate the selected features, and to
establish the influence on classification accuracy. The results indicate that in terms of the number of
genes that need to be selected and classification accuracy the proposed method is superior to other
methods in the literature.
Ensemble classifier
In [1], Classification can be defined as the process to approximate I/O mapping from the given
observation to the optimal solution. Generally, classification tasks consist of two parts: feature selection
and classification. Feature selection is a transformation process of observations to obtain the best
pathway to get to the optimal solution. Therefore, considering multiple features encourages obtaining
various candidate solutions so that we can estimate a more accurate solution to the optimal than any
other local optima.
In [1], when we have multiple features available, it is important to know which features should be used.
Theoretically, for many features concerned, it may be more effective for the classifier to solve the
problems. But features that are overlapped in feature spaces may cause the redundancy of irrelevant
information and result in the counter effect such as over fitting. Therefore, it is more important to
explore and utilize independent features to train classifiers, rather than increase the number of features
we use. Correlation between feature sets can be induced from the distribution of features.
Wavelet Approach
In [5], A wavelet transform is a lossless linear transformation of a signal or data into coefficients on the
basis of wavelet functions.5,6 The coefficients yielded by wavelet transform contain information about
characteristics of the data at different scales. Fine scales capture local details of coefficients and coarse
scales capture global features of a signal. Performing the discrete wavelet transform (DWT) of a signal x
is done by passing it through low pass filters (scaling functions) and high pass filter
Figure 3.1
In [7], accurate classification of microarray data plays a vital role in cancer prediction and diagnosis.
Previous studies have demonstrated the usefulness of naïve Bayes classifier in solving various
classification problems. In microarray data analysis, however, the conditional independence assumption
embedded in the classifier itself and the characteristics of microarray data, e.g. the extremely high
dimensionality, may severely affect the classification performance of naïve Bayes classifier. This paper
presents a sequential feature extraction approach for naïve Bayes classification of microarray data. The
proposed approach consists of feature selection by stepwise regression and feature transformation by
class-conditional independent component analysis. Experimental results on five microarray datasets
demonstrate the effectiveness of the proposed approach in improving the performance of naïve Bayes
classifier in microarray data analysis.
Figure 3.2
Weighted Voting: Poor classifier can affect the result of the ensemble in
majority voting because it gives the same weight to all classifiers.
Weighted voting reduces the effect of poor classifier by giving a different
weight to a classifier based on the performance of each classifier.
Bayesian Combination: While majority voting method combines classifiers with their
results, Bayesian combination makes the error possibility of each classifier affect the final
result. The method combines classifier with different weight by using the previous
knowledge of each classifier.
[2] YAN CHEN and XIU-XIA LI, YI-XUE LI, YUN-PING ZHU, CHUN-HOU
ZHENG, “TUMOR CLASSIFICATION BASED ON INDEPENDENT
COMPONENT ANALYSIS”, International Journal of Pattern Recognition and
Artificial Intelligence, China, pp. 297–310 , November 02, 2006
[4] JING ZHANG, TIANZI JIANG*, BING LIU, XINGPENG JIANG and HUIZHI
ZHAO, “Systematic benchmarking of microarray data feature extraction and
classification”, International Journal of Computer Mathematics, pp 803–811, May 2008
[6] Cheng-San Yang, Li-Yeh Chuang, Chao-Hsuan Ke, and Cheng-Hong Yang, Member,
IAENG“A Hybrid Feature Selection Method for Microarray Classification”. IAENG
International Journal of Computer Science,pp 285-290, 2008.
[7] Fan, Liwei, Zhou, Peng “A sequential feature extraction approach for naïve
bayes classification of microarray data”. Article, pp 9919-9923., Aug2009.