Arraycluster: An Analytic Tool For Clustering, Data Visualization and Module Finder On Gene Expression Profiles

The document describes ArrayCluster, an analytic tool for clustering gene expression profiles, performing data visualization, and identifying modules of genes. It uses a mixed factors model to cluster gene expression data in a way that avoids overfitting issues common in other clustering algorithms. The tool estimates model parameters, determines the optimal number of clusters and factor dimensions, performs clustering, identifies relevant gene modules for each cluster, and allows visualization of factor scores and gene expression patterns to help understand cluster structure and biological significance.

Uploaded by

Joker Jr

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

Arraycluster: An Analytic Tool For Clustering, Data Visualization and Module Finder On Gene Expression Profiles

Uploaded by

Joker Jr

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 26

ArrayCluster:

an analytic tool for clustering, data visualization

and module ﬁnder on gene expression proﬁles

組員：李祥豪
謝紹陽
江建霖
Outline
 Introduction
 Mixed Factors Model
 Analytic Tools
 Summary
 Demo
Introduction
 This task can be addressed by groupi
ng gene expression patterns of a larg
e number of genes
 Typical microarray data have a fairly
small sample size, less than 100, whe
reas the number of genes involved is
more than several thousands
Introduction
 One major difficulty in this problem is
that the number of samples to be
clustered is much smaller than the
dimension of data
 Most clustering technologies, e.g. k-
means, Gaussian mixture clustering,
hierarchical clustering and so on,
would be limited by over-learning
Introduction
 In statistics, overfitting is fitting a stat
istical model that has too many param
eters.
 When the degrees of freedom in para
meter selection exceed the data, this l
eads to arbitrariness in the final (fitte
d) model parameters which reduces or
destroys the ability of the model to ge
neralize beyond the fitting data.
Introduction
 In machine learning, usually a
learning algorithm is trained using
some set of training examples,
especially in learning was performed
too long or training are rare, the
learner may adjust to very specific
random features of the training data,
that have no causal relation to the
target function.
Introduction
 In both statistics and machine learnin
g, in order to avoid overfitting, it is n
ecessary to use additional techniques
(e.g. cross-validation, early stopping,
Bayesian Priors on parameters or mo
del comparison), that can indicate wh
en further training is not resulting in
better generalization.
Mixed Factors Model
 The mixed factors model presents a
parsimonious parameterization of Gaussian
mixture model
 Our primal intention is parsimoniously to
describe the group structure of data based
on the factor variables. To this end, we
devise the mixed factors that follow a G-
components Gaussian mixture as
G
p ( f j )    g  ( f j ; g ,  g )
g 1
Mixed Factors Model
 The mixed factors model, we possibly
avoid the over-fitting of the Gaussian
mixture by choosing an appropriate fa
ctor dimension regardless to the high
dimensionality of data.
 Once the model has been fitted to a g
iven dataset, clustering can be addres
sed by the Bayes rule.
Mixed Factors Model
 To avoid it, we impose the orthogonality on
the q columns of the factor loading matrix
 This imposition leads to a canonical represe
ntation of the mixed factors model as
AT X j  f j  A T  j
 From this equation, one achieves the fact t
hat the q canonical variates in ATxj€Rq are d
istributed according to
G
p ( A X j )    g  ( A T X j ; g ,  g   I )
T

g 1
Mixed Factors Model
 The canonical variates can be conside
red as the q modules of genes which
are relevant to the existing molecular
subtypes.
 This process yields a feature selection
that constructs good discriminators fo
r existing groups as linear combinatio
n d genes.
Analytic Tools
 File format of data
file
Analytic Tools
 model selection based on BIC curve
Analytic Tools
 In this plot, the horizontal and vertica
l axes correspond to the factor dimen
sion and the BIC scores, respectively.
The each line represents curve of BIC
scores against to varying factor dime
nsions (q) for a fixed number of clust
ers (G)
Analytic Tools
 File format of mixed_factors
Analytic Tools
 Box plot of the computed factor scores
Analytic Tools
 Each cluster is separated with the bla
nk lines. All samples in one cluster ar
e ordered according to the degree of t
he belongings that are measured by t
he Maharanobis distance between eac
h sample point and the corresponding
group centeroid. The calculated dista
nces are indicated next to the sample
identifiers
Analytic Tools
 File format of relevant_set
Analytic Tools
 relevant module profiling
 After selecting rows (genes) of intere
st, the enlarged expression image will
be displayed on the right window
Analytic Tools
 The ArrayCluster provides users an usable
environment to perform the following tasks
:
 Parameter estimation of the mixed factors model
: The ArrayCluster computes the maximum likeli
hood estimators by using the EM algorithm
 Determination of the number of clusters and the
factor dimension (the number of group-relatedm
odules):These are selected based on the Bayesia
n information criterion (BIC)
 Clustering based on the Bayes rule
Analytic Tools
 Dimension reduction of data: This task is addres
sed by the same way of the classical factor analy
sis, the mixed factors analysis explicitly reflects t
he existing group structure of original data, whil
e the classical factor analysis ignores it during th
e dimension reduction
 Identification of the group-related genes: In the
ArrayCluster, the relevant genes in each module
are selected to be top L (user can specify) of the
highest positive (negative) correlation with each
element of the factor vector
Analytic Tools
 Identification of the modules: By
separating positive and negative
correlated genes with the factor vector in
a module, totally we identify 2q modules
 Missing data imputation
 Data preprocessing: The methods
include normalization and gene filtering
Summary
 The ArrayCluster visualizes the comput
ed factor scores using the box plot mat
rix
 Enhancing the graphical understanding
of the group structure.
 A casual link from the calibrated cluster
s to biological knowledge can be elucid
ated through the inspection of the grou
p-related modules.
Summary
 The ArrayCluster displays the express
ion patterns of these modules.
 Genes at these modules and their vis
ualization give us a scope to question
where the calibrated clusters come fr
om.
Thanks for your attention

Next->DEMO

Business Statistics Cheat Sheet?
No ratings yet
Business Statistics Cheat Sheet?
7 pages
COBIT Case Study IT Risk Management in A Bank
100% (1)
COBIT Case Study IT Risk Management in A Bank
7 pages
A Network Flow Model For Biclustering Via Optimal Re-Ordering of Data Matrices
No ratings yet
A Network Flow Model For Biclustering Via Optimal Re-Ordering of Data Matrices
12 pages
Exploration of Co-Relation Between Depression and Anaemia in Pregnant Women Using Knowledge Discovery and Data Mining Algorithms and Tools
No ratings yet
Exploration of Co-Relation Between Depression and Anaemia in Pregnant Women Using Knowledge Discovery and Data Mining Algorithms and Tools
9 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
kmeansfinal
No ratings yet
kmeansfinal
5 pages
Genetic Algorithm-Based Clustering Technique
No ratings yet
Genetic Algorithm-Based Clustering Technique
11 pages
Hierarchical Clustering PDF
No ratings yet
Hierarchical Clustering PDF
5 pages
Improved Histograms For Selectivity Estimation of Range Predicates - Poosala
No ratings yet
Improved Histograms For Selectivity Estimation of Range Predicates - Poosala
12 pages
Genetic K-Means Algorithm: Conf., 1987, Pp. 50-58
No ratings yet
Genetic K-Means Algorithm: Conf., 1987, Pp. 50-58
7 pages
Unit V - Big Data Programming
No ratings yet
Unit V - Big Data Programming
22 pages
Adv QSAR
No ratings yet
Adv QSAR
72 pages
RobustClustering Algorithm Based On Complete LinkApplied To Selection Ofbio - Basis Foramino Acid Sequence Analysis
No ratings yet
RobustClustering Algorithm Based On Complete LinkApplied To Selection Ofbio - Basis Foramino Acid Sequence Analysis
10 pages
Applied Mathematics and Computation: Graça Trindade, José G. Dias, Jorge Ambrósio
No ratings yet
Applied Mathematics and Computation: Graça Trindade, José G. Dias, Jorge Ambrósio
12 pages
36 Cibulkova Jana Paper
No ratings yet
36 Cibulkova Jana Paper
9 pages
Breast Cancer Classification
100% (2)
Breast Cancer Classification
16 pages
Genetic Algorithm Based Semi-Feature Selection Method: Hualong Bu Shangzhi Zheng, Jing Xia
No ratings yet
Genetic Algorithm Based Semi-Feature Selection Method: Hualong Bu Shangzhi Zheng, Jing Xia
4 pages
Program Test Data Generation For Branch Coverage With Genetic Algorithm: Comparative Evaluation of A Maximization and Minimization Approach
No ratings yet
Program Test Data Generation For Branch Coverage With Genetic Algorithm: Comparative Evaluation of A Maximization and Minimization Approach
12 pages
Bayesian Nonparametric Models: Peter Orbanz, Cambridge University Yee Whye Teh, University College London
No ratings yet
Bayesian Nonparametric Models: Peter Orbanz, Cambridge University Yee Whye Teh, University College London
14 pages
BuildingPredictiveModelsR Caret
No ratings yet
BuildingPredictiveModelsR Caret
26 pages
How To Minimize Misclassification Rate and Expected Loss For Given Model
No ratings yet
How To Minimize Misclassification Rate and Expected Loss For Given Model
7 pages
An Efficient GA-based Clustering Technique: Hwei-Jen Lin, Fu-Wen Yang and Yang-Ta Kao
No ratings yet
An Efficient GA-based Clustering Technique: Hwei-Jen Lin, Fu-Wen Yang and Yang-Ta Kao
10 pages
Clustering X
No ratings yet
Clustering X
2 pages
1 s2.0 S0031320311005188 Main
No ratings yet
1 s2.0 S0031320311005188 Main
15 pages
Microsoft Word - JCD Final
No ratings yet
Microsoft Word - JCD Final
8 pages
Handwritten Digit Recognition by Combined Classifiers
No ratings yet
Handwritten Digit Recognition by Combined Classifiers
7 pages
1 C. Fraley and A. E. Raftery Technical Report No. 329 Department of Statistics University of Washington Box 354322 Seattle, WA 98195-4322 USA February 27, 1998
No ratings yet
1 C. Fraley and A. E. Raftery Technical Report No. 329 Department of Statistics University of Washington Box 354322 Seattle, WA 98195-4322 USA February 27, 1998
18 pages
PeerEval Unsupervised
No ratings yet
PeerEval Unsupervised
6 pages
Model Based Evaluation of Clustering
No ratings yet
Model Based Evaluation of Clustering
18 pages
Unit - 2 ML notes
No ratings yet
Unit - 2 ML notes
14 pages
On Input Selection With Reversible Jump Markov Chain Monte Carlo Sampling
No ratings yet
On Input Selection With Reversible Jump Markov Chain Monte Carlo Sampling
10 pages
545 Project
No ratings yet
545 Project
10 pages
TD2345
No ratings yet
TD2345
3 pages
Clustering Algorithms For Mixed Datasets: A Review: K. Balaji and K. Lavanya
No ratings yet
Clustering Algorithms For Mixed Datasets: A Review: K. Balaji and K. Lavanya
10 pages
Comparing ML Algorithms - Anjali Garg
No ratings yet
Comparing ML Algorithms - Anjali Garg
14 pages
Discovering Stock Price Prediction Rules Using Rough Sets
No ratings yet
Discovering Stock Price Prediction Rules Using Rough Sets
19 pages
Data Mining Exam Answers - April 2024
No ratings yet
Data Mining Exam Answers - April 2024
6 pages
Automatic Clustering Using An Improved Differential Evolution Algorithm
No ratings yet
Automatic Clustering Using An Improved Differential Evolution Algorithm
20 pages
Knee Point Detection
No ratings yet
Knee Point Detection
8 pages
A Two Step Clustering Method For Mixed Categorical and Numerical Data
No ratings yet
A Two Step Clustering Method For Mixed Categorical and Numerical Data
9 pages
MACHINE LEARNING PROJECT
No ratings yet
MACHINE LEARNING PROJECT
29 pages
Fuzzy System Modeling by Fuzzy Partition and GA Hybrid Schemes
No ratings yet
Fuzzy System Modeling by Fuzzy Partition and GA Hybrid Schemes
10 pages
Clustering, K-Means,. Expectation Maximization, Mean Shift, Classifier Ensembles, Bagging, Boosting
No ratings yet
Clustering, K-Means,. Expectation Maximization, Mean Shift, Classifier Ensembles, Bagging, Boosting
21 pages
Ps 3
No ratings yet
Ps 3
3 pages
Efficient Data Clustering With Link Approach
No ratings yet
Efficient Data Clustering With Link Approach
8 pages
Genetic Algorithms and The Search For Optimal Database Index Selection
No ratings yet
Genetic Algorithms and The Search For Optimal Database Index Selection
7 pages
Performance Analysis of A Gaussian Mixture Based Feature Selection Algorithm
No ratings yet
Performance Analysis of A Gaussian Mixture Based Feature Selection Algorithm
6 pages
GA Clustering
No ratings yet
GA Clustering
6 pages
A New Method For Dimensionality Reduction Using K-Means Clustering Algorithm For High Dimensional Data Set
No ratings yet
A New Method For Dimensionality Reduction Using K-Means Clustering Algorithm For High Dimensional Data Set
6 pages
CWI, Amsterdam, The Netherlands Department of Computer Science, Leiden University, The Netherlands
No ratings yet
CWI, Amsterdam, The Netherlands Department of Computer Science, Leiden University, The Netherlands
8 pages
1-s2.0-S0031320316303326-main
No ratings yet
1-s2.0-S0031320316303326-main
10 pages
Genetic Algorithm Optimization and Its Application To Antenna Design
No ratings yet
Genetic Algorithm Optimization and Its Application To Antenna Design
4 pages
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
No ratings yet
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
6 pages
datamining unit4
No ratings yet
datamining unit4
21 pages
CHEMOMETRICS and STATISTICS Multivariate Classification Techniques-21-27
No ratings yet
CHEMOMETRICS and STATISTICS Multivariate Classification Techniques-21-27
7 pages
An Alternative Ranking Problem For Search Engines: 1 Motivation
No ratings yet
An Alternative Ranking Problem For Search Engines: 1 Motivation
22 pages
Matlab Ga
No ratings yet
Matlab Ga
15 pages
Recent Advances in Clustering A Brief Survey
No ratings yet
Recent Advances in Clustering A Brief Survey
9 pages
1-2 The Problem 3-4 Proposed Solution 5-7 The Experiment 8-9 Experimental Results 10-11 Conclusion 12 References 13
No ratings yet
1-2 The Problem 3-4 Proposed Solution 5-7 The Experiment 8-9 Experimental Results 10-11 Conclusion 12 References 13
14 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Research Report: Executive Summary Blockchain For Enterprise Applications
No ratings yet
Research Report: Executive Summary Blockchain For Enterprise Applications
17 pages
AI Revenue
No ratings yet
AI Revenue
11 pages
CHP 2 - Dependency Injection Using Spring Rev H PDF
No ratings yet
CHP 2 - Dependency Injection Using Spring Rev H PDF
29 pages
Analytics Strategy Presentation PDF
No ratings yet
Analytics Strategy Presentation PDF
82 pages
Lab Building Simple Shopping Cart Using Python, Flask, MySQL
No ratings yet
Lab Building Simple Shopping Cart Using Python, Flask, MySQL
14 pages
Learned in Designing COBIT2019 Framework
No ratings yet
Learned in Designing COBIT2019 Framework
12 pages
LPT Brochure
No ratings yet
LPT Brochure
15 pages
Lab 2 - Data Wrangling - 261119 A
No ratings yet
Lab 2 - Data Wrangling - 261119 A
4 pages
CySA (CS0-001) Practice Exam
No ratings yet
CySA (CS0-001) Practice Exam
26 pages
Practice Lab 2 - Implementing Identity Synchronization
No ratings yet
Practice Lab 2 - Implementing Identity Synchronization
4 pages
Environment Setup JDK, Ant and Junit: Case Study 3
No ratings yet
Environment Setup JDK, Ant and Junit: Case Study 3
16 pages
Training Agenda Android 190418
No ratings yet
Training Agenda Android 190418
1 page
(XXXX) Syllabus - Front-End & Back-End Web Dev. Node - Js Express - Js - HS 260919
No ratings yet
(XXXX) Syllabus - Front-End & Back-End Web Dev. Node - Js Express - Js - HS 260919
1 page
CH 10 - AJAX
No ratings yet
CH 10 - AJAX
44 pages
How To Build A Data Science Portfolio by Michae
No ratings yet
How To Build A Data Science Portfolio by Michae
2 pages
stat 440 lab exercises 5
No ratings yet
stat 440 lab exercises 5
7 pages
Stat 222 Lecture 1-1
No ratings yet
Stat 222 Lecture 1-1
71 pages
Summarize Data Sets
No ratings yet
Summarize Data Sets
12 pages
Homgeneious Section FWD
No ratings yet
Homgeneious Section FWD
16 pages
Sele 2
No ratings yet
Sele 2
18 pages
Final FINAL IMRAD PDF
No ratings yet
Final FINAL IMRAD PDF
38 pages
Subject: STATS Test Marks: 50
No ratings yet
Subject: STATS Test Marks: 50
4 pages
Udacity Rubic
No ratings yet
Udacity Rubic
2 pages
What Significant Information Is Omitted?
No ratings yet
What Significant Information Is Omitted?
20 pages
STAT 1 Course Outline
No ratings yet
STAT 1 Course Outline
1 page
Mohammad Suleman Alias Israr Ahmed 021000022234 Be (Elect) 136 136 000
No ratings yet
Mohammad Suleman Alias Israr Ahmed 021000022234 Be (Elect) 136 136 000
4 pages
Organic Multipurpose Cleaner
0% (1)
Organic Multipurpose Cleaner
60 pages
Souza e Junqueira 2005 PDF
No ratings yet
Souza e Junqueira 2005 PDF
11 pages
Qualitative and Quantitative Research Methods in Criminal Justice
No ratings yet
Qualitative and Quantitative Research Methods in Criminal Justice
6 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
Deepar: Time Series Forecasting With Deep Ar
No ratings yet
Deepar: Time Series Forecasting With Deep Ar
20 pages
Concepts in Research 1
No ratings yet
Concepts in Research 1
50 pages
Thesis With Chi-Square Test
100% (3)
Thesis With Chi-Square Test
7 pages
The Practice of Statistics: Third Edition
No ratings yet
The Practice of Statistics: Third Edition
39 pages
Learning Activity Sheets: Quarter 3, Week 5 and 6
No ratings yet
Learning Activity Sheets: Quarter 3, Week 5 and 6
11 pages
Chapter 3 PDF
No ratings yet
Chapter 3 PDF
15 pages
Insert Your Research Title Here: Groupmates
No ratings yet
Insert Your Research Title Here: Groupmates
28 pages
Tutorial Letter 101/0/2024: Statistical Inference I
No ratings yet
Tutorial Letter 101/0/2024: Statistical Inference I
16 pages
Combined Subject Table of Contents
No ratings yet
Combined Subject Table of Contents
49 pages
Exploratory Data Analysis With R PDF
No ratings yet
Exploratory Data Analysis With R PDF
63 pages
NYPD: Guidelines For Use of Force in New York City
No ratings yet
NYPD: Guidelines For Use of Force in New York City
89 pages
Modelling in Crystallography: What Is X-Ray Crystal Structure Analysis?
No ratings yet
Modelling in Crystallography: What Is X-Ray Crystal Structure Analysis?
49 pages
Assignment Usol 2016-2017 PGDST
No ratings yet
Assignment Usol 2016-2017 PGDST
20 pages

Arraycluster: An Analytic Tool For Clustering, Data Visualization and Module Finder On Gene Expression Profiles

Uploaded by

Arraycluster: An Analytic Tool For Clustering, Data Visualization and Module Finder On Gene Expression Profiles

Uploaded by

ArrayCluster:

an analytic tool for clustering, data visualization

You might also like