Arraycluster: An Analytic Tool For Clustering, Data Visualization and Module Finder On Gene Expression Profiles
Arraycluster: An Analytic Tool For Clustering, Data Visualization and Module Finder On Gene Expression Profiles
組員:李祥豪
謝紹陽
江建霖
Outline
Introduction
Mixed Factors Model
Analytic Tools
Summary
Demo
Introduction
This task can be addressed by groupi
ng gene expression patterns of a larg
e number of genes
Typical microarray data have a fairly
small sample size, less than 100, whe
reas the number of genes involved is
more than several thousands
Introduction
One major difficulty in this problem is
that the number of samples to be
clustered is much smaller than the
dimension of data
Most clustering technologies, e.g. k-
means, Gaussian mixture clustering,
hierarchical clustering and so on,
would be limited by over-learning
Introduction
In statistics, overfitting is fitting a stat
istical model that has too many param
eters.
When the degrees of freedom in para
meter selection exceed the data, this l
eads to arbitrariness in the final (fitte
d) model parameters which reduces or
destroys the ability of the model to ge
neralize beyond the fitting data.
Introduction
In machine learning, usually a
learning algorithm is trained using
some set of training examples,
especially in learning was performed
too long or training are rare, the
learner may adjust to very specific
random features of the training data,
that have no causal relation to the
target function.
Introduction
In both statistics and machine learnin
g, in order to avoid overfitting, it is n
ecessary to use additional techniques
(e.g. cross-validation, early stopping,
Bayesian Priors on parameters or mo
del comparison), that can indicate wh
en further training is not resulting in
better generalization.
Mixed Factors Model
The mixed factors model presents a
parsimonious parameterization of Gaussian
mixture model
Our primal intention is parsimoniously to
describe the group structure of data based
on the factor variables. To this end, we
devise the mixed factors that follow a G-
components Gaussian mixture as
G
p ( f j ) g ( f j ; g , g )
g 1
Mixed Factors Model
The mixed factors model, we possibly
avoid the over-fitting of the Gaussian
mixture by choosing an appropriate fa
ctor dimension regardless to the high
dimensionality of data.
Once the model has been fitted to a g
iven dataset, clustering can be addres
sed by the Bayes rule.
Mixed Factors Model
To avoid it, we impose the orthogonality on
the q columns of the factor loading matrix
This imposition leads to a canonical represe
ntation of the mixed factors model as
AT X j f j A T j
From this equation, one achieves the fact t
hat the q canonical variates in ATxj€Rq are d
istributed according to
G
p ( A X j ) g ( A T X j ; g , g I )
T
g 1
Mixed Factors Model
The canonical variates can be conside
red as the q modules of genes which
are relevant to the existing molecular
subtypes.
This process yields a feature selection
that constructs good discriminators fo
r existing groups as linear combinatio
n d genes.
Analytic Tools
File format of data
file
Analytic Tools
model selection based on BIC curve
Analytic Tools
In this plot, the horizontal and vertica
l axes correspond to the factor dimen
sion and the BIC scores, respectively.
The each line represents curve of BIC
scores against to varying factor dime
nsions (q) for a fixed number of clust
ers (G)
Analytic Tools
File format of mixed_factors
Analytic Tools
Box plot of the computed factor scores
Analytic Tools
Each cluster is separated with the bla
nk lines. All samples in one cluster ar
e ordered according to the degree of t
he belongings that are measured by t
he Maharanobis distance between eac
h sample point and the corresponding
group centeroid. The calculated dista
nces are indicated next to the sample
identifiers
Analytic Tools
File format of relevant_set
Analytic Tools
relevant module profiling
After selecting rows (genes) of intere
st, the enlarged expression image will
be displayed on the right window
Analytic Tools
The ArrayCluster provides users an usable
environment to perform the following tasks
:
Parameter estimation of the mixed factors model
: The ArrayCluster computes the maximum likeli
hood estimators by using the EM algorithm
Determination of the number of clusters and the
factor dimension (the number of group-relatedm
odules):These are selected based on the Bayesia
n information criterion (BIC)
Clustering based on the Bayes rule
Analytic Tools
Dimension reduction of data: This task is addres
sed by the same way of the classical factor analy
sis, the mixed factors analysis explicitly reflects t
he existing group structure of original data, whil
e the classical factor analysis ignores it during th
e dimension reduction
Identification of the group-related genes: In the
ArrayCluster, the relevant genes in each module
are selected to be top L (user can specify) of the
highest positive (negative) correlation with each
element of the factor vector
Analytic Tools
Identification of the modules: By
separating positive and negative
correlated genes with the factor vector in
a module, totally we identify 2q modules
Missing data imputation
Data preprocessing: The methods
include normalization and gene filtering
Summary
The ArrayCluster visualizes the comput
ed factor scores using the box plot mat
rix
Enhancing the graphical understanding
of the group structure.
A casual link from the calibrated cluster
s to biological knowledge can be elucid
ated through the inspection of the grou
p-related modules.
Summary
The ArrayCluster displays the express
ion patterns of these modules.
Genes at these modules and their vis
ualization give us a scope to question
where the calibrated clusters come fr
om.
Thanks for your attention
Next->DEMO