Microarray Data Analysis-Springer
Microarray Data Analysis-Springer
Microarray
Data Analysis
Methods and Applications
Second Edition
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
Second Edition
Edited by
The development of novel technological platforms in molecular biology has given a large
input to research and in particular has caused a big development of bioinformatics to
support storage, management, and analysis of a large amount of data about different
aspects of the omic world. We here in particular focus on two main techniques for studying
the activity of transcriptome, i.e., the set of molecules that play a role in the complex
mechanism of protein synthesis. Such a study focuses on the role of mRNA, i.e., coding
fragments of messenger RNA, and miRNA, i.e., small fragments of noncoding RNA. This
study has been conducted through two main technological platforms: microarray and
miRNA-microarray. More recently, the advent of next-generation sequencing techniques
is gaining a prominent role. Despite this, classical microarray studies are still alive since
there are a considerable number of published papers related to the generation and analysis
of microarray data.
The flow of information in this field starts from technological platforms that produce
different data. Examples of such platforms are microarray for studying the expression of
messenger RNA (mRNA) and microRNA (miRNA); genomic microarrays for studying
copy number variations (CNV) or single-nucleotide polymorphisms (SNP); novel micro-
arrays for studying noncoding RNAs (e.g., miRNA); and genomic arrays for pharmacoge-
nomics.
Classical studies focused on the individuation of the role of a single class of molecules
into a specific disease. Therefore they contained the analysis of a single class of data. More
recently, the biological assumption that different molecules (e.g., miRNA, mRNA, or
Transcription Factors) are strongly correlated has determined the rise of a novel discipline,
often referred to as computational systems biology or network systems biology. In such
discipline computer science, bioinformatics, and mathematical modeling play a synergistic
role in the interpretation of large data sets belonging to different data sources. Conse-
quently, a big attention has been paid to the development of integrated methods of
analysis, often based on distributed or high-performance architectures (e.g., Cloud) or
on semantic-based approaches, for extracting biologically relevant knowledge from data. In
parallel, a growing number of biological and medical papers have demonstrated the real
application of these methodologies.
This book is intended to cover main aspects of this area, and it covers a large area, from
the description of methodologies for data analysis to the real application. The intended
audience is students or researchers that need to learn main topics of research as well as
practitioners that need to have a look on applications. The structure of the presentation of
all the chapters makes it adapt even for the use in bioinformatics courses.
The book is composed of 15 chapters. It starts by presenting main concepts related to
data analysis. Wu and Gantier present main methodologies for preprocessing of microarray
data in Chapter 1. Cristiano and Veltri present a survey of miRNA Data analysis in
Chapter 2 while Calabrese and Cannataro discuss the rise of Cloud-based approaches in
Chapter 3. Chapter 4 by Lopez Kleine et al. presents the application of data mining
techniques for data analysis and in Chapter 5 Deveci et al. focus on the use of biclustering
to query different datasets. In Chapter 6 Chang and Lin discuss a web-based tool to
analyze the evolution of miRNA clusters. Roy et al. present in Chapter 7 the application
v
vi Preface
of biclustering to mine patterns of co-regulated genes. Chapters 8 and 9 present the use of
ontologies; in particular, Ovaska discusses the use of csbl.go tool while Agapito and Milano
survey main existing tools for semantic similarity analysis of microarray data. Wang et al. in
Chapter 10 introduce the integration of microarray and proteomic data. Chapter 11 by
Koumakis et al. discusses the relevance of Gene Regulatory Network Inference, while
Chapter 12 by Roy and Guzzi focuses on the assessment of Gene Regulatory Network
methods. The remaining chapters present some relevant applications in different medical
fields. Chapter 13 by Gan et al. is related to the analysis of Mouse data for metabolomics
studies. Chapter 14 by Di Martino et al. surveys the functional analysis of microRNA data
in multiple myeloma that is currently a big research area. Chapter 15 by Bhawe and Aghi
presents the application of microarray data analysis in glioblastomas. Finally, Chapter 16
discusses the analysis of microRNA data in cardiogenesis.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Contributors
ix
x Contributors
Abstract
microRNA (miRNA) microarray normalization is a critical step for the identification of truly differentially
expressed miRNAs. This is particularly important when dealing with cancer samples that have a global
miRNA decrease. In this chapter, we provide a simple step-by-step procedure that can be used to normalize
Affymetrix miRNA microarrays, relying on robust normal-exponential background correction with cyclic
loess normalization.
1 Introduction
1
2 Di Wu and Michael P. Gantier
2 Materials
2.2 miRNA The command lines provided below are specifically designed for our
Affymetrix Microarray published dataset from Dicer-deficient cells, to be used as an exam-
(Version 1.0 or Later) ple of the overall normalization procedure. The nine .CEL files
(from GSM1118272_MG1.CEL to GSM1118280_MG9.CEL)
can be downloaded from Gene Expression Omnibus (GEO), acces-
sion number GSE45886. Briefly, miRNA levels were detected by
Affymetrix miRNA v1.0 microarray, at day 2, 3, and 4 after genetic
deletion of Dicer1. Each condition (t2, t3, and t4) was replicated in
biological triplicate (A, B, and C) (14). Our normalization proce-
dure relies on different weights being applied to different types of
probes present on the arrays. As such, the correct definition of the
Affymetrix miRNA Microarray Normalization 3
3 Methods
{
targets <- read.delim("targets-mirna.txt",stringsAs
Factors¼FALSE, sep¼" ")
}
des <- model.matrix(~0+as.factor(time),
data¼targets)
3.1 Robust Normexp For background correction, our procedure relies on normexp back-
Background ground correction using the ‘nec’ function in ‘R’. In addition, we
Correction use the ‘robust’ argument in ‘nec’ that determines background
mean and standard deviation, as we found it increased the sensitiv-
ity of the detection of differentially expressed miRNAs (14).
Nonetheless, robust can be disabled using ‘robust ¼ FALSE’ in
the command below.
Normexp background correction relies on the negative control
probes in the Affymetrix array—annotated as ‘BkGR’ in the man-
ufacturer’s annotation file. The following lines define which probes
are used as control probes, from the Affymetrix annotations.
bkgr.idx.pm<-grep("BkGr",rownames(pm.raw))
status<-rep("regular",nrow(pm.raw))
status[bkgr.idx.pm]<-"negative"
table(status)
This will print the amount of negative and regular probes in the
arrays (negative: 8221 and regular: 38006 when using GSE45886).
nec.pm.raw.r<-nec(pm(affy2),status¼status,negctrl¼
"negative",
regular¼"regular", offset¼16, robust¼TRUE)
summary(nec.pm.raw.r)
This will print the raw intensities for each microarray divided in:
Min./1st Qu./Median/Mean/3rd Qu./Max values.
3.2 Definition The first step is to obtain the probe annotations from the appropri-
of Non-miRNA Small ate annotation file from Affymetrix. The file should be placed in the
RNA Probes Used working directory—i.e., ‘/Documents’ in our case (see Note 4).
in Cyclic Loess ann<-read.csv("miRNA-1_0.annotations.20081203.
Normalization csv",skip¼11)
data.frame(table(ann$Sequence.Type))
ann.miRNA<- which(ann.m$Sequence.Type¼¼"miRNA")
mirna<-as.character(ann.m$Probe.Set.ID[ann.miRNA])
ann.affyctlseq<- which(ann.m$Sequence.Type¼¼"Affymetrix
Control Sequence")
affyctlseq<-as.character(ann.m$Probe.Set.ID[ann.
affyctlseq])
ann.spikein<- which(ann.m$Sequence.Type¼¼"Oligonucleo
tide spike-in controls")
spikein<-as.character(ann.m$Probe.Set.ID[ann.spikein])
ann.rrna<- which(ann.m$Sequence.Type¼¼"5.8 s rRNA")
rrna<-as.character(ann.m$Probe.Set.ID[ann.rrna])
ann.cdbox<- which(ann.m$Sequence.Type¼¼"CDBox")
cdbox<-as.character(ann.m$Probe.Set.ID[ann.cdbox])
ann.hacabox<- which(ann.m$Sequence.Type¼¼"HAcaBox")
hacabox<-as.character(ann.m$Probe.Set.ID[ann.hacabox])
ann.scarna<- which(ann.m$Sequence.Type¼¼"scaRna")
scarna<-as.character(ann.m$Probe.Set.ID[ann.scarna])
ann.snorna<- which(ann.m$Sequence.Type¼¼"snoRNA")
snorna<-as.character(ann.m$Probe.Set.ID[ann.snorna])
idx.pm.mirna<-which(match(probe.name,mirna)!¼"NA")
length(idx.pm.mirna)
3.3 Cyclic Loess The next step is cyclic loess normalization—which attributes
Normalization heavier weight to non-miRNA small RNA probes than miRNA
probes defined in the previous step to normalize the differences
between arrays. By using a much higher weight for non-miRNA
small RNA probes (100 vs. 0.01 for miRNAs), we found that we
greatly increased the accuracy of the normalization (14).
affy2.temp<-affy2
pm(affy2.temp)<-nec.pm.raw.r
w<-rep(1,nrow(pm(affy2.temp)))
w[status.spot¼¼"miRNA"]<- 0.001
w[status.spot¼¼"other.small.RNA"]<-100
norm3<- normalizeCyclicLoess(log2(pm(affy2.temp)),
weights¼w,
iteration¼5) (see Note 5)
pm(affy2.temp)<-2^(norm3)
4 Notes
This will print the contrasts (i.e., 1 for level t2 and 1 for level t4).
fit.w<-lmFit(exprs2,design¼designMatrix, weights¼
w.des)
fit.w<-contrasts.fit(fit.w, contrast.matrix)
fit.w<-eBayes(fit.w)
summary(decideTests(fit.w[mmu.idx,],p.value¼0.1))
This will print the following results for p < 0.1 (where 1
defines the number of probes downregulated at t4 versus t2;
0 defines the number of unchanged probes; +1 defines the
number of upregulated probes). Noteworthy, these differ
slightly from what is obtained with the analyses of the nine
microarrays due to statistical variations with fewer arrays.
t4 - t2
-1 68
0 538
1 3
Finally, the miRNAs that are significantly different at the two
time points can be retrieved with the following commands:
top1<- topTable(fit.w[mmu.idx,],coef¼1,number¼
Inf,p.value¼0.1)
write.table(top1, file¼"topTab1.csv", row.names¼
TRUE, sep¼",")
8. Please note that the values stated might change slightly with
the different releases of the statistical packages used.
Acknowledgments
The authors thank Frances Cribbin for her help with the redaction
of this review. The authors are supported by funding from the
Australian NHMRC (1022144 and 1062683 to MPG and
1036541 to DW) and the Victorian Government’s Operational
Infrastructure Support Program.
10 Di Wu and Michael P. Gantier
References
1. Melo SA, Esteller M (2011) Dysregulation of expression profiles classify human cancers.
microRNAs in cancer: playing with fire. FEBS Nature 435(7043):834–838
Lett 585(13):2087–2099 8. Gaur A, Jewell DA, Liang Y, Ridzon D, Moore
2. Ota A, Tagawa H, Karnan S, Tsuzuki S, Karpas JH, Chen C, Ambros VR, Israel MA (2007)
A, Kira S, Yoshida Y, Seto M (2004) Identifica- Characterization of microRNA expression
tion and characterization of a novel gene, levels and their biological correlates in human
C13orf25, as a target for 13q31-q32 amplifi- cancer cell lines. Cancer Res 67(6):2456–2468
cation in malignant lymphoma. Cancer Res 64 9. Volinia S, Calin GA, Liu C-G, Ambs S, Cim-
(9):3087–3095 mino A, Petrocca F, Visone R, Iorio M, Roldo
3. Calin GA, Dumitru CD, Shimizu M, Bichi R, C, Ferracin M, Prueitt RL, Yanaihara N, Lanza
Zupo S, Noch E, Aldler H, Rattan S, Keating G, Scarpa A, Vecchione A, Negrini M, Harris
M, Rai K, Rassenti L, Kipps T, Negrini M, CC, Croce CM (2006) A microRNA expres-
Bullrich F, Croce CM (2002) Frequent dele- sion signature of human solid tumors defines
tions and down-regulation of micro-RNA cancer gene targets. Proc Natl Acad Sci U S A
genes miR15 and miR16 at 13q14 in chronic 103(7):2257–2261
lymphocytic leukemia. Proc Natl Acad Sci U S 10. Yanaihara N, Caplen N, Bowman E, Seike M,
A 99(24):15524–15529 Kumamoto K, Yi M, Stephens RM, Okamoto
4. Melo SA, Moutinho C, Ropero S, Calin GA, A, Yokota J, Tanaka T, Calin GA, Liu C-G,
Rossi S, Spizzo R, Fernandez AF, Davalos V, Croce CM, Harris CC (2006) Unique
Villanueva A, Montoya G, Yamamoto H, microRNA molecular profiles in lung cancer
Schwartz S, Esteller M (2010) A genetic defect diagnosis and prognosis. Cancer Cell 9(3):
in exportin-5 traps precursor microRNAs in 189–198
the nucleus of cancer cells. Cancer Cell 18 11. Kumar MS, Pester RE, Chen CY, Lane K, Chin
(4):303–315 C, Lu J, Kirsch DG, Golub TR, Jacks T (2009)
5. Melo SA, Ropero S, Moutinho C, Aaltonen Dicer1 functions as a haploinsufficient tumor
LA, Yamamoto H, Calin GA, Rossi S, Fernan- suppressor. Genes Dev 23(23):2700–2704
dez AF, Carneiro F, Oliveira C, Ferreira B, Liu 12. Karube Y, Tanaka H, Osada H, Tomida S,
C-G, Villanueva A, Capella G, Schwartz S, Tatematsu Y, Yanagisawa K, Yatabe Y, Takami-
Shiekhattar R, Esteller M (2009) A TARBP2 zawa J, Miyoshi S, Mitsudomi T, Takahashi T
mutation in human cancer impairs microRNA (2005) Reduced expression of Dicer associated
processing and DICER1 function. Nat Genet with poor prognosis in lung cancer patients.
41(3):365–370 Cancer Sci 96(2):111–115
6. Merritt WM, Lin YG, Han LY, Kamat AA, 13. Grelier G, Voirin N, Ay A-S, Cox DG, Cha-
Spannuth WA, Schmandt R, Urbauer D, Pen- baud S, Treilleux I, Leon-Goddard S, Rimokh
nacchio LA, Cheng J-F, Nick AM, Deavers R, Mikaelian I, Venoux C, Puisieux A, Lasset C,
MT, Mourad-Zeidan A, Wang H, Mueller P, Moyret-Lalle C (2009) Prognostic value of
Lenburg ME, Gray JW, Mok S, Birrer MJ, Dicer expression in human breast cancers and
Lopez-Berestein G, Coleman RL, Bar-Eli M, association with the mesenchymal phenotype.
Sood AK (2008) Dicer, Drosha, and outcomes Br J Cancer 101(4):673–683
in patients with ovarian cancer. N Engl J Med 14. Wu D, Hu Y, Tong S, Williams BR, Smyth GK,
359(25):2641–2650 Gantier MP (2013) The use of miRNA
7. Lu J, Getz G, Miska EA, Alvarez-Saavedra E, microarrays for the analysis of cancer samples
Lamb J, Peck D, Sweet-Cordero A, Ebert BL, with global miRNA decrease. RNA 19(7):
Mak RH, Ferrando AA, Downing JR, Jacks T, 876–888
Horvitz HR, Golub TR (2005) MicroRNA
Methods in Molecular Biology (2016) 1375: 11–23
DOI 10.1007/7651_2015_238
© Springer Science+Business Media New York 2015
Published online: 12 June 2015
Abstract
Genomic data analysis consists of techniques to analyze and extract information from genes. In particular,
genome sequencing technologies allow to characterize genomic profiles and identify biomarkers and
mutations that can be relevant for diagnosis and designing of clinical therapies. Studies often regard
identification of genes related to inherited disorders, but recently mutations and phenotypes are considered
both in diseases studies and drug designing as well as for biomarkers identification for early detection.
Gene mutations are studied by comparing fold changes in a redundancy version of numeric and string
representation of analyzed genes starting from macromolecules. This consists of studying often thousands
of repetitions of gene representation and signatures identified by biological available instruments that
starting from biological samples generate arrays of data representing nucleotides sequences representing
known genes in an often not well-known sequence.
High-performance platforms and optimized algorithms are required to manipulate gigabytes of raw data
that are generated by the so far mentioned biological instruments, such as NGS (standing for Next-
Generation Sequencing) as well as for microarray. Also, data analysis requires the use of several tools and
databases that store gene targets as well as gene ontologies and gene–disease association.
In this chapter we present an overview of available software platforms for genomic data analysis, as well as
available databases with their query engines.
1 Introduction
11
12 Francesca Cristiano and Pierangelo Veltri
them use the methods for sequencing the samples on the basis of
the length or type of sequence (paired end, single ended, etc.)
Nowadays the next-generation sequencing produces a large
amount of data and information difficult to manage and therefore
requires the use of efficient and high-performance tools in order to
conduct an analysis in a very short time (9). The output of the
sequencing is in FastQ format, and each file can reach an average
size of almost 1 GB, producing more than one FastQ file. Many ad
hoc pipelines are developed by software engineers to analyze the
produced data, but the process of installing, configurating, and
managing the software requires computer skill that users (often
doctors and biologists) usually do not have.
5.1 Galaxy The large amount of data that is produced with the next-generation
sequencing requires that data be stored and managed in an efficient
manner.
Galaxy (13, 17, 18) is an open and Web-based workbench that
enables users to perform statistical and bioinformatic analysis on NGS
data. Galaxy platform can be downloaded and installed locally, and
there are many tools that can be integrated as plugins.
Galaxy is a tool that is used mostly by researchers who have not
computer science skills. It provides a simple Web interface and
plugins that can be used in order to make an analysis. In particular,
the available modules to perform the analysis can be used in
sequence. However, it is possible to install a local version of Galaxy
and the various available plugins manually. MiRNA-seq for exam-
ple, can be analyzed following a simple workflow (19). It is neces-
sary to import the sequenced files in Galaxy and view the reads
16 Francesca Cristiano and Pierangelo Veltri
5.2 Strand NGS Strand NGS (14) is a commercial software that can be used to
(Formerly Avadis NGS) perform NGS analysis on DNA, RNA, or small RNA. This suite
allows to create two type of experiments including alignment and
statistical and bioinformatics analysis one. A smallRNA alignment
consists in importing the dataset (FastQ file in the tool) related the
sequencing experiment, define the appropriate reference genome,
i.e., mouse, human, and select from the entries, the library type and
the platform used during the sequencing. Before performing the
alignment, the program requires a preprocessing phase (pre-
alignment) to allow the increasing in the number of sequence that
has to be aligned with the considered genome. Even in Strand
NGS, you can view the report on the quality of the produced
sequences. If the reads present an adapter, a trimming set para-
meters is necessary to trim adapter and poor sequences. There is
Methods and Techniques for miRNA Data Analysis 17
7.1 miRNA–mRNA The interest of studying miRNAs and their role with respect to
Associations chronic diseases has been recently shown (e.g., in refs. (34) and
(35) for chronic diseases) as well as in representing new target for
different therapies and drugs. miRNA functions are related to
(subset of) genes that can be regulated by them. There are many
tools available online that, given a set of miRNAs, are in charge of
searching gene targets as well as proteins involved and that are able
to predict the mRNAs target of miRNAs. There exist different
miRNA–gene target associations databases, for example: miRDIP
(36) is a database for miRNA and mRNA that integrates a large
number of prediction tools results; such results are obtained by
using different prediction tools such as DIANA microT (37),
MicroCosm Target (formerly miRBase) (32), microRNA.org
(38), PicTar (39), and TargetScan (40). mirDIP gets as input a
list of miRNAs and returns miRNA–mRNA interactions on the
basis of the accuracy level. It is possible to select both database
and prediction accuracy (i.e., high, medium, and low accuracy). It is
possible to select automatically the prediction database by tuning
accuracy or prediction parameters. The results can be stored in a file
and contain miRNAs with associated mRNAs and the database used
for the prediction with rispective accuracy measure. Another exam-
ple of miRNA target database is miRDB (41) that contains several
genes from different organisms. miRBD uses machine learning
algorithms to predict the gene targets of miRNAs; a query by
example interface can be used to compose a query starting from
single or multiple miRNAs and linking them to gene targets.
Finally, miRWalk (42) is a tool that allows to select predicted
genes and validated (from literature) genes from rat, human, and
mouse genome.
20 Francesca Cristiano and Pierangelo Veltri
8 Ontologies
10 Conclusions
References
1. Zhang X, Zeng Y (2011) Performing custom GX/?cid¼AG-PT- 130&tabId¼AG-PR-
microRNA microarray experiments. J Vis Exp 1061
56:e3250. doi:10.3791/3250 16. Friedl€ander MR, Chen W et al (2008) Discov-
2. Schena M, Shalon D et al (1995) Quantitative ering microRNAs from deep sequencing data
monitoring of gene expression patterns with a using miRDeep. Nat Biotechnol 26
complementary DNA microarray. Science 270 (4):407–415. doi:10.1038/nbt1394
(5235) 17. Blankenberg D, Von Kuster G, et al (2010)
3. Yin JQ, Zhao RC et al (2008) Profiling micro- Current protocols in molecular biology.
RNA expression with microarrays. Trends Bio- Chapter 19:Unit 19.10.1-21
technol 26(2):70–76. doi:10.1016/j.tibtech. 18. Giardine B, Riemer C et al (2005) Galaxy: a
2007.11.007 platform for interactive large-scale genome
4. Brazma A, Hingamp P et al (2011) Minimum analysis. Genome Res 15(10):1451–1455
information about a microarray experiment 19. http://training.bioinformatics.ucdavis.edu/
(MIAME): toward standards for microarray docs/2012/09/BSC/ThuPM-miRNA.html
data. Nat Genet 29(4):365–371 20. http://hannonlab.cshl.edu/fastx_toolkit/
5. David P, Bartel (2009) MicroRNAs: target rec- commandline.html#fastx_barcode_splitter_
ognition and regulatory functions. Cell 136 usage
(2):215–233. doi:10.1016/j.cell.2009.01.002 21. Friedl€ander MR, Mackowiak SD et al (2012)
6. http://www.454.com miRDeep2 accurately identifies known and
7. http://technology.illumina.com/technology/ hundreds of novel microRNA genes in seven
next-generation-sequencing/solexatechnology. animal clades. Nucleic Acids Res 40(1):37–52.
html doi:10.1093/nar/gkr688
8. http://www.appliedbiosystems.com/absite/ 22. Trapnell C, Pachter L et al (2009) TopHat:
us/en/home/applications-technologies/solid- discovering splice junctions with RNA-Seq.
next-generation-sequencing.html Bioinformatics 25(9):1105–1111. doi:10.
9. Pop M, Salzberg SL (2008) Bioinformatics 1093/bioinformatics/btp120
challenges of new sequencing technology. 23. Kim D, Pertea G et al (2013) TopHat2:
Trends Genet 24(3):142–149. doi:10.1016/j. accurate alignment of transcriptomes in the
tig.2007.12.006 presence of insertions, deletions and gene
10. http://www.bioinformatics.babraham.ac.uk/ fusions. Genome Biol 14:R36. doi:10.1186/
projects/fastqc/ gb-2013-14-4-r36
11. http://journal.embnet.org/index.php/ 24. http://cole-trapnell-lab.github.io/cufflinks/
embnetjournal/article/view/200/479 25. Anders S, Huber W (2010) Differential expres-
12. http://www.bioinformatics.babraham.ac.uk/ sion analysis for sequence count data. Genome
projects/trim_galore/ Biol 11:R106. doi:10.1186/gb-2010-11-10-
13. Goecks J, Nekrutenko A et al (2010) Galaxy: a r106
comprehensive approach for supporting acces- 26. Gene ontology (2014) http://www.
sible, reproducible, and transparent computa- geneontology.org/
tional research in the life sciences. Genome Biol 27. Biclustering of gene expression data. Jesùs S.
11(8):R86 Aguilar-Ruiz
14. Strand Life Sciences Pvt. Ltd. Strand NGS- 28. BLAST. http://blast.ncbi.nlm.nih.gov/Blast.cgi
formerly Avadis NGS, 2012, Version 1.3.0. 29. ENTREZ. http://www.ncbi.nlm.nih.gov/gquery/
San Francisco, CA: Strand Genomics, Inc. 30. PubMed. http://www.ncbi.nlm.nih.gov/pubmed/
15. http://www.genomics.agilent.com/en/Micro 31. EMBL. http://www.embl.org
array-Data-Analysis-Software/GeneSpring-
Methods and Techniques for miRNA Data Analysis 23
32. Kozomara A, Griffiths-Jones S (2013) miR- 43. Kibbe WA, Arze C et al (2014) Disease ontol-
Base: annotating high confidence microRNAs ogy 2015 update: an expanded and updated
using deep sequencing data. Nucleic Acids Res database of human diseases for linking biomed-
42:D68–D73. doi:10.1093/nar/gkt1181 ical knowledge through disease data. Nucleic
33. Kozomara A, Griffiths-Jones S (2011) miR- Acids Res 43:D1071–D1078, pii: gku1011
Base: integrating microRNA annotation and 44. Medical subject headings. http://www.nlm.
deep-sequencing data. Nucleic Acids Res 39 nih.gov/mesh/
(Database issue):D152–D157. doi:10.1093/ 45. ICD. http://www.who.int/classifications/icd
nar/gkq1027 46. Bauer-Mehren A, Bundschus M et al (2011)
34. Ellison GM, Vicinanza C et al (2013) Adult c- Gene-disease network analysis reveals func-
kit(pos) cardiac stem cells are necessary and tional modules in Mendelian, complex and
sufficient for functional cardiac regeneration environmental diseases. PLoS One 6(6):
and repair. Cell 154(4):827–842 e20284
35. Leidinger P, Backes C et al (2013) A blood 47. http://www.disgenet.org/web/DisGeNET/
based 12-mirna signature of Alzheimer disease v2.1/dbinfo
patients. Genome Biol 14:R78. doi:10.1186/ 48. Shannon P, Markiel A et al (2003) Cytoscape: a
gb-2013-14-7-r78 software environment for integrated models of
36. Shirdel EA, Xie W et al (2011) Navigating the biomolecular interaction networks. Genome
micronome. using multiple microRNA predic- Res 13(11):2498–2504
tion database to identify signalling pathway- 49. Reactome Fi Cytoscape Plugin. http://www.
associated microRNAs. PLoS One 6(2): reactome.org
e17429. doi:10.1371/journal.pone.0017429
50. Guanming W, Feng X et al (2010) A human
37. Paraskevopoulou MD et al (2013) Diana- functional protein interaction network and its
microt web server v5.0: service integration application to cancer data analysis. Genome
into mirna functional analysis workflows. Biol 11(53)
Nucleic Acids Res 41(Web Server issue):
W169–W173. doi:10.1093/nar/gkt393 51. Gade S, Porzelius C et al (2011) Graph based
fusion of mirna and mrna expression data
38. Betel D, Wilson M et al (2008) The micro- improves clinical outcome prediction in pros-
RNA.org resource: targets and expression. tate cancer. BMC Bioinformatics 12:488
Nucleic Acids Res 36(Database Issue):
D149–D153 52. Tian Z, Greene AS et al (2008) MicroRNA
target pairs in the rat kidney identified by
39. Pictar. http://pictar.mdc-berlin.de microRNA microarray, proteomic, and bioin-
40. TargetScan microRNA target prediction. formatic analysis. Genome Res 18:404–411
http://www.targetscan.org/ 53. Pietro Hiram Guzzi, Pierangelo Veltri et al
41. Wang X (2008) miRDB: a microRNA target (2012) Unraveling multiple miRNA-mRNA
prediction and functional annotation data- associations through a graph-based approach.
base with a wiki interface. RNA 14 In: ACM BCB
(6):1012–1017 54. Bo W, Mezlini Aziz M et al (2014) Similarity
42. Dweep H, Sticht C et al (2011) miRWalk: network fusion for aggregating data types on a
database—prediction of possible miRNA bind- genomic scale. Nat Methods 11:333–337.
ing sites by “walking” the genes of 3 genomes. doi:10.1038/nmeth.2810
J Biomed Inform 44:839–847
Methods in Molecular Biology (2016) 1375: 25–39
DOI 10.1007/7651_2015_236
© Springer Science+Business Media New York 2015
Published online: 12 April 2015
Abstract
High-throughput platforms such as microarray, mass spectrometry, and next-generation sequencing are
producing an increasing volume of omics data that needs large data storage and computing power. Cloud
computing offers massive scalable computing and storage, data sharing, on-demand anytime and anywhere
access to resources and applications, and thus, it may represent the key technology for facing those issues. In
fact, in the recent years it has been adopted for the deployment of different bioinformatics solutions and
services both in academia and in the industry. Although this, cloud computing presents several issues
regarding the security and privacy of data, that are particularly important when analyzing patients data, such
as in personalized medicine. This chapter reviews main academic and industrial cloud-based bioinformatics
solutions; with a special focus on microarray data analysis solutions and underlines main issues and problems
related to the use of such platforms for the storage and analysis of patients data.
1 Introduction
25
26 Barbara Calabrese and Mario Cannataro
Managing omics data requires both space for data storing and
services for data preprocessing, analysis, and sharing. The resulting
scenario comprises a set of bioinformatics tools, often implemented
as Web services, for the management and analysis of data stored in
geographically distributed biological databases.
Cloud computing is a computing model that has spread very
rapidly in recent years for the supply of IT resources (hardware and
software) of different nature, through services accessible via the
network. The resources that a cloud system provides to users
include: CPU, memory, networks, operating systems, middleware,
and applications. The cloud resources are dynamically scalable,
virtualized, and accessible on the Internet (1). This model provides
new advantages related to massive and scalable computing
resources available on demand, virtualization technology, and pay-
ment for use as needed (2).
Thus, cloud computing may play an important role in many
phases of the bioinformatics analysis pipeline, from data manage-
ment and processing, to data integration and analysis, including
data exploration and visualization.
Despite the many benefits associated with cloud computing,
there are also several management, technology, security, and legal
issues to be addressed. In fact, cloud computing currently presents
some issues and open problems such as privacy and security, geo-
graphical localization of data, legal responsibilities in the case of
data leaks, that are particularly important when managing sensitive
data such as the patients data stored and processed in genomics and
pharmacogenomics studies, and more in general when clinical data
are transferred to the cloud.
The aim of this chapter is to describe and discuss the most
significant applications of cloud computing in the bioinformatics
with special focus on microarray data analysis. The chapter focuses
on specific requirements and issues of such applications on cloud
computing. The chapter is organized as follows: in Section 2 cloud
computing definition is discussed. Service and delivery models are
presented in order to define the cloud-related background. Succes-
sively, in Section 3 the chapter focuses on the application of cloud
computing in bioinformatics and microarray data analysis. Section 4
summarizes the main problems to be faced when moving bioinfor-
matics applications on the cloud and underlines open problems
related to the full adoption of cloud computing in the bioinformat-
ics data analysis pipeline.
2 Materials
2.1 Service Models Cloud services can be classified into three main models:
l Infrastructure as a Service (IaaS): this service model is offered in
a computing infrastructure that includes servers (typically vir-
tualized) with specific computational capability and/or storage.
The user controls all the storage resources, operating systems,
and applications deployed to, while he/she has limited control
over the network settings. An example is Amazon’s Elastic
Compute Cloud (EC2), which allows the user to create virtual
machines and manage them, and Amazon Simple Storage Ser-
vice (S3), which allows storing and accessing data, through a
Web-service interface.
l Platform as a Service (PaaS): it allows the development, instal-
lation and execution on its infrastructure of user-developed
applications. Applications must be created using programming
languages, libraries, services, and tools supported by the pro-
vider that constitute the development platform provided as a
service. An example is Google Apps Engine, which allows
developing applications in Java and Python and provides for
both languages the SDK (Software Development Kit) and uses
a plugin for the Eclipse development environment.
l Software as a Service (SaaS): customers can use the applications
provided by the cloud provider infrastructure. The applications
are accessible through a specific interface. Customers do not
28 Barbara Calabrese and Mario Cannataro
2.2 Delivery Models Cloud services can be made available to users in different ways. In
the following, a brief description of the delivery models is
presented:
l Public Cloud: vendors who provide the users/customers the
hardware and software resources of their data centers offer
public cloud services. Examples of public clouds are Amazon,
Google Apps, and Microsoft Azure.
l Private Cloud: private cloud is configured by a user or by an
organization for its exclusive use. Computers that are in the
domain of the organization supply services. To install a private
cloud, several commercial and free tools are available (e.g.,
OpenStack, Eucalyptus, Open Nebula, Terracotta, and
VMware Cloud).
l Community Cloud: it is an infrastructure on which are installed
cloud services shared by a community or by a set of individuals,
companies and organizations that share a common purpose and
that have the same needs. The cloud can be managed by the
community itself or by a third party (typically a cloud service
provider).
l Hybrid Cloud: the cloud infrastructure is made up of two or
more different clouds using different delivery models, which,
while remaining separate entities, are connected by proprietary
or standard technology that enables the portability of data and
applications.
3 Methods
3.2 Bioinformatics In recent years, there have been several efforts to develop cloud-
Tools Deployed as based tools to execute different bioinformatics tasks (11), e.g.,
SaaS mapping applications, sequences alignment, gene expression analy-
sis (12). Some examples of SaaS bioinformatics tools are reported in
the following.
In ref. (13), the authors propose an efficient Cloud-based
Epistasis cOmputing (eCEO) model for large-scale epistatic inter-
action in genome-wide association study (GWAS). Given a large
number of combinations of SNPs (Single-nucleotide polymor-
phism), eCEO model is able to distribute them to balance the
load across the processing nodes. Moreover, eCEO model can
efficiently process each combination of SNPs to determine the
significance of its association with the phenotype. The authors
have implemented and evaluated eCEO model on their own cluster
of more than 40 nodes. The experiment results demonstrate that
the eCEO model is computationally efficient, flexible, scalable, and
practical. In addition, the authors have also deployed the eCEO
model on the Amazon Elastic Compute Cloud.
Bioinformatics and Microarray Data Analysis on the Cloud 31
3.3 Bioinformatics Currently, the most used platform (PaaS) for bioinformatics appli-
Platforms Deployed cations is Galaxy Cloud, which is a Galaxy cloud-based platform for
as PaaS the analysis of data at a large scale. It allows anyone to run a private
Galaxy installation on the Cloud exactly replicating functionality of
the main site, but without the need to share computing resources
with other users. With Galaxy Cloud, unlike software service solu-
tions, the user can customize their deployment as well as retain
complete control over their instances and associated data; the anal-
ysis can also be moved to other cloud providers or local
resources, avoiding concerns about dependence on a single vendor.
Currently, a public Galaxy Cloud deployment, called CloudMan, is
Bioinformatics and Microarray Data Analysis on the Cloud 33
Project’s Services
name/ref models Task URL
eCEO SaaS Sequencing www.comp.nus.edu.sg
(genome
resequencing)
STORMSEQ SaaS Sequencing http://www.stormseq.org
(genome
resequencing)
Crossbow SaaS Sequencing http://bowtie-bio.
(genome sourceforge.net/
resequencing) crossbow/index.shtml
CloudBurst SaaS Sequencing: http://sourceforge.net/
genome projects/cloudburst-
resquencing, bio/
short-read
aligner
CloudAligner SaaS Sequencing: http://sourceforge.net/
genome projects/cloudaligner/
resquencing,
short-read
aligner
VAT SaaS Sequencing: vat.gersteinlab.org
genome
resquencing,
variant
annotation
(continued)
34 Barbara Calabrese and Mario Cannataro
(continued)
Project’s Services
name/ref models Task URL
FX SaaS Sequencing: fx.gmi.ac.kr
RNA-seq
Myrna SaaS Sequencing: http://bowtie-bio.
RNA-seq sourceforge.net/
myrna/index.shtml
PeakRanger SaaS Sequencing: http://ranger.
ChIP SEQ sourceforge.net
ProteoCloud SaaS Mass https://code.google.
spectrometry: com/p/proteocloud/
MS-based
proteomics
YunBE SaaS Transcriptomics: http://lrcv-crp-sante.s3-
gene set website-us-east-1.
analysis amazonaws.com
BioVLAB- SaaS Analysis of https://sites.google.
MMIA microRNA and com/site/biovlab/
mRNA
expression data
Cloud4SNP SaaS Microarray: SNP Not available
Analysis
CloudMan PaaS A public Galaxy wiki.galaxyproject.org
cloud
deployment for
bioinformatics
Eoulsan PaaS A framework for http://transcriptome.ens.
high- fr/eoulsan/
throughput
sequencing
data analysis
Bionimbus IaaS A cloud-based bionimbus.openscience
infrastructure openscience
for managing,
analyzing and
sharing
genomics
datasets.
CloVR IaaS A virtual machine http://clovr.org
for automated
and portable
microbial
(continued)
Bioinformatics and Microarray Data Analysis on the Cloud 35
(continued)
Project’s Services
name/ref models Task URL
sequence
analysis
CloudBioLinux IaaS Genome analysis cloudbiolinux.org
resources for
cloud
computing
platforms
4 Notes
5 Conclusions
References
1. Mell P, Grance T. The NIST definition of 7. Schadt EE, Linderman MD, Sorenson J et al
cloud computing. Recommendations of the (2011) Cloud and heterogeneous computing
National Institute of Standards and Technol- solutions exist today for the emerging big data
ogy, Special Publication, 800–145 http://csrc. problems in biology. Nat Rev Genet 12(3):224
nist.gov/publications/PubsSPs.html 8. Grossmann RL, White KP (2011) A vision for a
2. Armbrust M, Fox A, Griffith R et al (2010) A biomedical cloud. J Intern Med 271(2):
view of cloud computing. Commun ACM 53 122–130
(4):50–58 9. Dudley JT, Pouliot Y, Chen JR et al (2010)
3. Vaquero LM, Rodero-Merino L, Caceres J et al Translational bioinformatics in the cloud: an
(2009) A break in the clouds: towards a cloud affordable alternative. Genome Med 2:51
definition. Comput Comm Rev 39:50–55 10. Fusaro VA, Patil P, Gafni E et al (2011) Bio-
4. Calabrese B, Cannataro M, Cloud Computing medical cloud computing with Amazon web
in Healthcare and Biomedicine, Scalable Com- services. PLoS Comput Biol 7(8):e1002147.
puting: Practice and Experience 16(1):1–18. doi:10.1371/journal.pcbi.1002147
doi:10.12694/scpe.v16i1.1057 11. Dai L, Gao X, Guo Y et al (2012) Bioinformat-
5. Cannataro M, Guzzi PH, Veltri P (2010) Pro- ics clouds for big data manipulation. Biol
tein-to-protein interactions: technologies, data- Direct 7:43. doi:10.1186/1745-6150-7-43
bases, and algorithms. ACM Comput Surv 43 12. Zhang L, Gu S, Wang B et al (2012) Gene set
(1):1–36 analysis in the cloud. Bioinformatics 28
6. Phillips C (2009) SNP databases. In: Komar (2):294–295
AA (ed) Single nucleotide polymorphisms, vol 13. Wang Z, Wang Y, Tan KL et al (2011) eCEO:
578. Humana, Totowa, NJ, pp 43–71, ch. 3 an efficient Cloud Epistasis cOmputing model
Bioinformatics and Microarray Data Analysis on the Cloud 39
Abstract
Gene expression data (microarrays and RNA-sequencing data) as well as other kinds of genomic data can be
extracted from publicly available genomic data. Here, we explain how to apply multivariate cluster and
classification methods on gene expression data. These methods have become very popular and are
implemented in freely available software in order to predict the participation of gene products in a specific
functional category of interest. Taking into account the availability of data and of these methods, every
biological study should apply them in order to obtain knowledge on the organism studied and functional
category of interest. A special emphasis is made on the nonlinear kernel classification methods.
1 Introduction
2 Materials
Table 1
Typical microarray data table
Fig. 1 Prediction of new genes belonging to the virulence category through classification
3 Methods
3.1 Functional Data This kind of method is straightforward for classifying genes into
(Known Gene two categories (e.g., virulence factors and not virulence factors,
Categories) and immunity related genes and not). These categories need to be
Training Sets constructed prior to applying the here proposed methods. They
can be constructed based on literature or extracted from genomic
databases. They should be represented as a vector indicating for
each of the genes to which category it belongs.
For supervised classification like support vector machine classifi-
cation (SVM) and linear discriminant analysis (LDA), we used a
training set. The genes belonging to the training set are chosen at
random from the two known categories and should represent approx-
imately a third part of all genes. The ratio between both classes in the
overall data set should be maintained in the training set.
Microarray Classification for Functional Prediction 43
3.2 Preprocessing Several preprocessing methods for gene expression data exist. Any
Microarray Data of them can be used in order to normalize gene expression data and
to make experiments comparable. We recommend using the
method proposed by Huber et al. (8).
Below the code with an example data contained in Huber’s vsn
package. This package is an R package and makes a part of the
Bioconductor packages ((9), www.bioconductor.org) developed
especially for gene expression data.
source("http://bioconductor.org/biocLite.R") #installa-
tion of this package from Bioconductor
biocLite("vsn")
library(vsn) # this library contains Huber’s (2003) method
implemented
citation("vsn") # here is how you should cite the library if
you use it
data("lymphoma") #this is an available microarray data set
in R we are going to use to illustrate all methods
class(lymphoma) # this command returns the type of object;
this is a special object for microarray data
dim(exprs(lymphoma)) #returns the dimension of the data
table containing the gene expression data
boxplot(exprs(lymphoma)) #constructs a boxplot of gene
expression values for each of the 16 samples
par(mfrow¼c(1,2)) #prepares graphic window for two plots
hist(exprs(lymphoma)[,1],main¼"green") # plots a histo-
gram of the first sample
hist(exprs(lymphoma)[,2],main¼"red") # plots a histogram
of the second sample
lym2¼justvsn(lymphoma) #applies normalization method and
creates a new table
meanSdPlot(lym2, ranks¼TRUE) #shows the result of normal-
ization plotting mean against variance; #higher mean values
should not implicate higher standard deviation (sd).
#A horizontal red line is expected.
boxplot(exprs(lym2)) #boxplots after normalization
par(mfrow¼c(1,2))
hist(exprs(lym2)[,1]) #histograms after normalization
hist(exprs(lym2)[,2])
3.3 Clustering Clustering methods allow grouping observations, with the aim of
Methods reducing the variability inside each cluster and identifying homoge-
neous subgroups. Therefore, the observations inside each group
will present similar characteristics, essentially numerical. This meth-
odology has been widely used in many applications and is classified
as a multivariate technique in statistics.
A key element of this analysis is the similarity metric to be used in
order to quantify how similar individuals (here genes) are based on
the data at hand. The most common is the Euclidean metric.
44 Liliana López Kleine et al.
3.3.2 K-Means Algorithm The K-means algorithm is the most common partition method
used in clustering analysis. This is a dynamic iterative method
which minimizes the within-class sum of squares for a given num-
ber of clusters (15). The algorithm starts with a vector of initial
centroids and each observation is placed in the cluster to which it is
the closest. The algorithm can be classified as dynamic because the
centroids are updated on each iteration. In this case, iteration is
defined as each complete stage once the groups are formed. The
process is repeated until the cluster centers stabilize.
46 Liliana López Kleine et al.
3.3.3 Kohonen Self The genesis of this method was proposed by T. Kohonen (17, 18).
Organizing Maps The so-called Kohonen self-organizing maps, or simply SOM, are
highly appreciated for its ability to map and visualize high-
dimensional data in two dimensions. This algorithm is classified as
an unsupervised learning technique based on the neural networks
theory (18).
This neural network has a competitive unsupervised learning,
that is, no additional information is available for the classification of
the data, where all neurons compete in order to carry out a specific
task. Therefore, under an input pattern, only one of the output
neurons (or a group of neighbors) is activated. Therefore, activated
neurons compete until a winning neuron is assigned.
Initially, self-organized network use all information as input
data (communalities, regularities, correlations or categories) to
incorporate them into its internal structure connections. So the
neurons must self-organize based on the provided data.
In this method an input information vector is connected to an
intermediate layer, where each neuron or node is compared to the
input through weights computed from predefined functions.
Finally, an exit occurs when the result obtained is compared with
the final nodes and the winner node (activated neuron) is that one
that produced the smaller output. Code is presented below.
library(kohonen)
Koh_lymph<-som(scale(STDlymph_express), grid¼somgrid
(4, 2, "rectangular"))
Koh_member<-Koh_lymph$unit.classif
table(Koh_member)
par(mfrow¼c(4,2))
plot(Koh_lymph, type¼"codes")
plot(Koh_lymph, type¼"changes")
plot(Koh_lymph, type¼"counts")
plot(Koh_lymph, type¼"dist.neighbours")
####### Prediction
Microarray Classification for Functional Prediction 47
percentage_training<- floor(.8*nrow(STDlymph_express))
Lymph_train<- sample(nrow(STDlymph_express), percentage_
training)
Lymph_training<- as.matrix(STDlymph_express[Lymph_train,])
Lymph_test<- as.matrix(STDlymph_express[-Lymph_train,])
som.STDlymphoma<- som(Lymph_training, grid¼somgrid(4, 2,
"hexagonal"))
som.Lymph.prediction<- predict(som.STDlymphoma, newdata¼
Lymph_test,
trainX¼Lymph_training,
trainY¼factor(som.STDlymphoma$unit.classif
[Lymph_train]))
table(som.STDlymphoma$unit.classif)
table(som.Lymph.prediction$unit.classif)
3.4 Supervised These methods use a part of the data in order to train a classifier,
Classification Methods which is a function allowing to place new individuals in one of the
previously known groups. For these methods, a training set is
needed as presented in Section 3.1.
3.4.1 Linear Discriminant The linear discriminant analysis (LDA) is a supervised method in
Analysis which a linear classifier is adjusted to classify known objects into
previously known groups, in this case for two groups. The method-
ology looks for linear combinations of variables which best explain
and separate the data, and explicitly attempts to model the differ-
ence between the given data classes (19). There are some assump-
tions related to the correct application of this method: normal
distribution and equality of variances between groups. Neverthe-
less, the latter coincides with the Fisher’s rule, a nonparametric
approach, when the response variables contain two categories.
Therefore, normality can be omitted, but the constraint of equally
variance should be evaluated.
In order to illustrate the method, we use the Breast cancer NKI
dataset from Bioconductor package (http://www.bioconductor.
org/packages/release/data/experiment/html/breastCancerNKI.
html). An auxiliary file is incorporated with the gene categories (R,
NR), obtained from the work developed by van’t Veer et al. (20).
source("http://bioconductor.org/biocLite.R")
biocLite("breastCancerNKI")
library(breastCancerNKI)
data(nki)
show(nki)
fData(nki)
pData(nki)
Rep<-read.table("nkiR.txt",header¼T)
reporter<-Rep[,4]
48 Liliana López Kleine et al.
mxpr<- as.data.frame(exprs(nki))
dim(mxpr)
cl_mxpr<- data.frame(mxpr,R¼as.data.frame(reporter))
dim(cl_mxpr)
names(cl_mxpr)
cl_mxpr$reporter
library(MASS)
train<- sample(1:24481, round(24481*9/10))
table(cl_mxpr$reporter[train])
ytest<- lda(reporter~., cl_mxpr, prior¼c(1,1)/2, subset
¼train)
ypred<- predict(ytest, cl_mxpr[-train,])
ctable<- table(cl_mxpr[-train,]$reporter, ypred$class)
gclass<- sum(diag(ctable))/sum(ctable)
3.4.2 Linear Support Support Vector Machines (SVM) are kernel methods. The particu-
Vector Machines larity of kernel methods is that algorithms are performed after the
data is transformed and therefore projected into a different high
dimensional space called feature space. So the expression data will
be projected into space H through the mapping Φ : X ! H where
X is the original space H is called feature space and all data points
will have their analogue in that space. For the case of kernel meth-
ods, this mapping does not need to be known and is achieved
through the kernel function K : X X ! ℝ, ðx; yÞ ! K ðx; yÞ
(21). This is called the “kernel trick” and can be stated as follows:
Let X be the space of the function and consider a bivariate function
K defined as X X. Let H be the associated feature space. 0
Then a
transformation Φ : X ! H exists, so that K ðx; yÞ=ΦðxÞ Φð yÞ.
An important result for the application of SVM is that any
X f defined on X can be expressed as follows:
function
f ðÞ= αi K ðx i ; Þ this is called the Reproducibility Kernel Hilbert
i
Space (RKHS).
Several types of kernel functions exist, being the linear kernel,
the simplest. The linear kernel is directly the inner product between
two data vectors x and y on the same p individuals:
0 Pp
K ðx; yÞ = x y = i¼1 x i y i . Other common kernels are the polyno-
mial kernel K ðx; yÞ = ðscale 〈x, y〉 þ cte Þg and the Gaussian kernel
K ðx; yÞ = exp (σjjxyjj2).
A linear classifier will allow separating individuals (here genes)
as shown in Fig. 2. If data is not linearly separable, a nonlinear
classifier needs to be constructed (Fig. 3).
Consider a the sample ðx1 ; y 1 Þ, ðx2 ; y 2 Þ, , xn ; y n ; where
xi 2 ℝ p and y i 2 f1, þ 1g, (the response variable). These sample
is named learning sample in machine learning theory.
The idea behind these kinds of models is that they make it
possible to separate the two classes 1 and +1 by a hyperplane.
Microarray Classification for Functional Prediction 49
Fig. 2 Classifier: the support vectors are the objects that are placed on the dotted
lines representing the margin
Fig. 3 Schematic representation of the nonlinear SVM classifier and its margin
1X n X n
0
Xn
minα αi α j y i y j xi x j αi
2 i¼1 j ¼1 i¼1
X
n
Constrain to αi y i ¼ 0 and 0 αi C, i ¼ 1, 2, . . . , n ð2Þ
i¼1
3.4.3 Nonlinear Support Here we consider again the function to optimize (Eq. 1), but with
Vector Machine the mapped points in H done by the Φ function:
1 0
minW, b jjWjj2 constrain to y i W Φðxi Þ þ b 1 f or all
2
i ¼ 1, 2, . . . , n ð3Þ
A new parameter C to solve the optimization problem is needed.
It controls the individuals that the model cannot classify in the
correct class:
X
n
minW, b jjWjj2 þ C ξi
i¼1
0
Constrain to y i W Φðxi Þ þ b 1 ξi and ξi 0 f or all
i ¼ 1, 2, . . . , n ð4Þ
52 Liliana López Kleine et al.
X
n
Constrain to αi y i ¼ 0 and 0 αi C to i ¼ 1, 2, . . . , n ð5Þ
i¼1
table(ytest,ypred)
# Compute accuracy
sum(ypred¼¼ytest)/length(ytest)
4 Notes
References
1. Dudoit S, Yang YH, Callow MJ, Speed TP computational biology and bioinformatics.
(2002) Statistical methods for identifying dif- Genome Biol 5(10):R80
ferentially expressed genes in replicated cDNA 10. Rencher AC, Christensen WF (2012) Methods
microarray experiments. Stat Sinica 12 of multivariate analysis, 3rd edn. Wiley, Hobo-
(1):111–140 ken, NJ
2. Moguerza JM, Muñoz A (2006) Support vec- 11. Izenman AJ (2008) Modern multivariate sta-
tor machines with applications. Statist Sci 21 tistical techniques: regression, classification,
(3):299–426 and manifold learning. Springer, New York
3. R Core Team (2014) R: a language and envi- 12. Kaufman L, Rousseeuw PJ (1990) Finding
ronment for statistical computing. R Founda- groups in data: an introduction to cluster anal-
tion for Statistical Computing, Vienna, http:// ysis. Wiley, New York
www.R-project.org/ 13. Mojena R (1977) Hierarchical grouping meth-
4. López-Kleine L1, Molano N, Ospina L. Int J ods and stopping rules: an evaluation. Comput
Bioinform Res Appl. 2013;9(3):285–300. doi: J 20(4):359–363. doi:10.1093/comjnl/20.4.
10.1504/IJBRA.2013.053607. Using multi- 359
variate methods to infer knowledge from geno- 14. Glenn W, Milligan GW, Cooper MC (1985) An
mic data examination of procedures for determining the
5. López-Kleine L, Torres-Avilés F, Tejedor FH, number of clusters in a data set. Psychometrika
Gordillo LA (2012) Virulence factor predic- 50(2):159–179
tion in Streptococcus pyogenes using classifi- 15. Hartigan JA, Wong MA (1979) A k-means
cation and clustering based on microarray clustering algorithm. Appl Statist 28:100–108
data. Appl Microbiol Biotechnol
93:2091–2098. doi:10.1007/s00253-012- 16. Leiva-Valdebenito S, Torres-Avilés F (2010) A
3917-3 review of the most common partition algo-
rithms in cluster analysis: a comparative study.
6. López-Kleine L, Romeo J, Torres-Avilés F Rev Colomb Estad 33(2):321–339
(2013) Gene functional prediction using clus-
tering methods for the analysis of tomato 17. Kohonen T (1982) Self-organizing formation
microarray data. In: Mohamad MS et al (eds) of topologically correct feature maps. Biol
7th International conference on PACBB, Cybern 43:59–69
AISC, vol 222, pp 1–6 18. Kohonen T (2001) Self-organizing maps, 3rd
7. Romeo JS, Torres-Avilés F, López-Kleine L edn. Springer, New York
(2013) Detection of influent virulence and 19. Friedman JH (1989) Regularized discriminant
resistance genes in microarray data through analysis. JASA 84:165–175
quasi likelihood modeling. Mol Genet Geno- 20. van’t Veer LJ, Dai H, van de Vijver MJ, He YD,
mics 288(1–2):49–61. doi:10.1007/s00438- Hart AAM, Mao M, Peterse HL, van der Kooy
012-0730-8 K, Marton MJ, Witteveen AT, Schreiber GJ,
8. Huber W, von Heydebreck A, Sueltmann H, Kerkhoven RM, Roberts C, Linsley PS, Ber-
Poustka A, Vingron M (2003) Parameter esti- nards R, Friend SH (2002) Gene expression
mation for the calibration and variance stabili- profiling predicts clinical outcome of breast
zation of microarray data. Stat Appl Genet Mol cancer. Nature 415:530–536
2(1):Article 3 21. Schölkopf B, Smola A (2002) Learning with
9. Gentleman RC, Carey VJ, Bates DM, Bolstad Kernels: support vector machines, regulariza-
B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge tion, optimization, and beyond. The MIT
Y, Hornik K, Gentry J, Hothorn T, Huber W, Press, Cambridge
Iacus S, Irizarry R, Leisch F, Li C, Maechler M, 22. Clarke B, Fokoué E, Zhang H (2009) Princi-
Rossini AJ, Sawitzki G, Smith C, Smyth G, ples and theory for data mining and machine
Tierney L, Yang JYH, Zhang J (2004) Biocon- learning. Springer, New York
ductor: open software development for
Methods in Molecular Biology (2016) 1375: 55–74
DOI 10.1007/7651_2015_246
© Springer Science+Business Media New York 2015
Published online: 02 December 2015
Abstract
Rapid development and increasing popularity of gene expression microarrays have resulted in a number of
studies on the discovery of co-regulated genes. One important way of discovering such co-regulations is the
query-based search since gene co-expressions may indicate a shared role in a biological process. Although
there exist promising query-driven search methods adapting clustering, they fail to capture many genes that
function in the same biological pathway because microarray datasets are fraught with spurious samples or
samples of diverse origin, or the pathways might be regulated under only a subset of samples. On the other
hand, a class of clustering algorithms known as biclustering algorithms which simultaneously cluster both
the items and their features are useful while analyzing gene expression data, or any data in which items are
related in only a subset of their samples. This means that genes need not be related in all samples to be
clustered together. Because many genes only interact under specific circumstances, biclustering may recover
the relationships that traditional clustering algorithms can easily miss. In this chapter, we briefly summarize
the literature using biclustering for querying co-regulated genes. Then we present a novel biclustering
approach and evaluate its performance by a thorough experimental analysis.
1 Introduction
a b c d
Fig. 1 Sample biclusters with various models: (a) constant-row, (b) shift, (c) scale, and (d) shift-scale. In
pattern expressions, aij represents expression level of gene i in sample j, π j a base value, αi scaling, and βi
shifting patterns. The parameters are selected as αi ¼ ½1, 2, 3, 1T , βi ¼ ½2, 3, 4, 1T , π j ¼ ½1, 2, 1, 4.
Shift-scale is the most general model, as it has shift and scale models as special cases and can represent both
positive and negative correlation
Querying Co-regulated Genes on Diverse Gene Expression Datasets Via Biclustering 59
2.1 PCC-Based PCC is a measure that evaluates positive and negative linear rela-
Biclustering tionships between vectors. It is commonly used in clustering gene
expression data [2, 4] due to its power in capturing both shifting
and scaling patterns. For a PCC-based biclustering on gene expres-
sion dataset, the correlation of two genes is calculated on some
specified columns since those genes may or may not be correlated
on every experiment. Therefore, our PCC-based similarity measure
between rows r and s on selected columns Y is calculated with:
X
ðr r Þðs s Þ
i∈Y i i
pccðr, s, Y Þ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X X , ð1Þ
ðr i r Þ2 ðs i s Þ2
i∈Y i∈Y
where the equation runs on select columns, and the absolute value
of the expression gives a result in [0, 1] interval.
PCC-based biclustering was recently proposed in [9, 57]. In [57],
the authors present the bi-correlation clustering algorithm (BCCA),
which tries to find biclusters using Pearson correlation. They also
discuss the complexity of computing pairwise PCCs, and the ineffi-
ciency of the method. Bozdağ et al. [9] discuss potential complexity
issues of an exhaustive search using PCC, and propose that, instead of
computing all pairwise PCC values, a center-like vector (tendency
vector) is sufficient and more efficient at finding correlated rows.
3.1 The CPB Let R and C denote the set of rows and columns of a data matrix A,
Algorithm respectively. Each element arc ∈ A represents the relation between
row r and column c. A bicluster B ¼ (X, Y ) is a subset of rows
X ¼ { x1, . . ., xn} and a subset of columns Y ¼ { y1, . . ., ym}, where
n N, and m M.
Definition 1 (Correlated Pattern Biclusters Algorithm). Given a data
matrix A, reference row rr, PCC threshold ρ, and minimum number of
columns γ, CPB finds a bicluster B ¼ (X, Y) such that rr ∈ X, m γ,
8x i , x j ∈X pccðx i , x j , Y Þ ρ.
3.1.1 Generating Initial Selecting the rows and columns of the initial bicluster is important
Biclusters since the algorithm converges to a more stable one by adding and
removing rows and columns to this bicluster. In [9], initial biclus-
ters were chosen randomly, and the algorithm runs efficiently when
discovering small number of biclusters embedded in synthetic data-
sets. However, we observe that when there are multiple biclusters
this approach does not provide a consistent mechanism to return
multiple biclusters with good coverage of the whole dataset.
In CPB, we generate initial biclusters with a grid-based
approach. We first shuffle the row and column numbers of the
dataset, and then partition the dataset into a coarse-grain grid of
10 2 initial biclusters. The query gene rr is inserted into each
bicluster, if necessary. At the end, all genes and conditions in the
dataset are assigned to at least one initial bicluster. Repeating the
process gives us enough initial biclusters to find co-regulated genes
and corresponding conditions. In addition, different runs obtain
more than 75 % of the top-ranked co-regulated genes with the
grid-based initialization, even though the generation of the initial
biclusters is randomized.
3.1.3 Updating the Rows For a row r to be included in X, we require pcc(r, xi, Y ) > ρ for all
of a Bicluster xi ∈ X. To avoid testing this condition against all xi ∈ X, we
utilize the tendency vector T, and only test whether pcc(r, T, Y) is
greater than another threshold ρ0 instead. ρ0 is selected such that
pcc(r, T, Y) > ρ0 must ensure pcc(r, xi, Y ) > ρ for all xi ∈ X.
However, PCC lacks transitivity property [58] and has a complex
formula that strongly depends on the values and the length of the
vectors. Although it is analytically difficult to compute a lower
bound for ρ0, it was empirically shown that there exists a lower
bound proportional to ρ [9].
In Algorithm 1, we start with a relaxed threshold and slowly
tighten it at Line 18. While tightening ρ0 , we relax the constraint on
minimum number of columns. This allows sweeping the search
space between two extreme combinations of these parameters.
The algorithm uses five tightening steps and initial values of
0 0
ρc ¼ 2=3ρ and γ c ¼ jY j (Line 3).
3.1.4 Updating the Using PCC to measure the coherence between the columns is too
Columns of a Bicluster restrictive. For example, although the rows in Fig. 1d are perfectly
correlated, Pearson correlation between columns is less than 1.
Therefore, we use root mean square error to assess the coherence
of the columns. It is computed as:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1X n
ERRORðy k Þ ¼ ðãx y t k Þ2 , ð3Þ
n i¼1 i k
3.2 Filtering Any dataset contains small biclusters with a high Pearson correla-
Biclusters Found by tion value by random chance. Although we specify a lower bound
Random Chance for PCC ρ0 and minimum number of columns γ, especially when γ is
small, in addition to larger biclusters, CPB recovers such
small biclusters. To eliminate randomly found biclusters in a
non-parametric fashion, we developed following method. Suppose
¼ fB 1 , B 2 , . . . , B z g be the set of biclusters found by different
runs of CPB on a data matrix A. We first generate A0 by shuffling
the elements of A. Then, we find the bicluster Bmax with the highest
number of rows in A0 , and use its dimension n0 as a threshold to
filter biclusters in . Algorithm 2 summarizes the filtering process.
Note that the parameter n0 is unique for each dataset, but this
method empirically finds a lower bound for n0 . The more biclusters
generated from the shuffled dataset, the better the estimate of n0 .
3.3 Combining CPB often produces different resulting biclusters due to the ran-
Correlation dom selection of initial biclusters. Information from these biclus-
Information ters, each including the reference row rr, is merged to score each
row’s relationship with rr.
In [9], bicluster uniqueness (BU) measure was proposed to
calculate the correlation score of the genes. Although BU is able
to capture the information redundancy caused by overlapping
biclusters, we present a similar but more efficient scoring function
to be used instead.
Let ¼ fB 1 , B 2 , . . . , B z g be the set of biclusters found by
different runs of CPB on a data matrix A, and with reference row
rr. Suppose IR(r) and IC(c) denote the maximal subset of that
contain the given row and column, respectively.
64 Mehmet Deveci et al.
4 Experimental Results
4.1 Experiments on We first define recovery and relevance metrics to evaluate the results
Synthetic Datasets of biclustering algorithms. For each experiment, a synthetic dataset
is generated with 1000 rows and 200 samples. Then two 60 60
biclusters with the given model are embedded into the dataset. The
average score of 100 replication of the same experiment is reported.
4.1.1 Evaluation Metrics Similar to recall and precision metrics, recovery and relevance
scores are proposed to evaluate the biclustering results. These
measures can be defined to compare a single found bicluster against
an expected one, as well as a set of found biclusters against a set of
expected ones.
Let e and f be expected and found biclusters, respectively. The
recovery score of a found bicluster against an expected one is
calculated by dividing the intersection area by the area of the
expected bicluster:
je\f j
recðe, f Þ ¼ , ð6Þ
jej
where the recovery score reaches to 1 if and only if e f .
Similarly, relevance score is calculated by dividing the intersec-
tion area by the area of the found bicluster:
je\f j
relðe, f Þ ¼ , ð7Þ
jf j
where the relevance score reaches to 1 if and only if f e. Examples
of how these scores are computed are given in Fig. 3.
Using these rec and rel measures, we define recovery and rele-
vance scores to compare two sets. Let E and F be a set of expected
Fig. 3 Example expected/found biclusters with their recovery and relevance scores
66 Mehmet Deveci et al.
1 X
RELðE, F Þ ¼ max relðe, f Þ ð9Þ
j F j f ∈F e∈E
4.1.2 Effects of the Biclustering methods often focus on detecting specific types of
Bicluster Model biclusters, as mentioned in Background section. In this experiment,
we compare the success rate of CPB with other algorithms on
detecting biclusters generated with various models. Constant-row,
shift, scale, and shift-scale models were chosen for this experiment.
Examples of these models are given in Fig. 1.
The resulting recovery and relevance scores (see Fig. 4a–d)
show that CPB is the only algorithm that can fully recover biclusters
generated with all four models with a high relevance score. BBC
was able to find shifted and constant-row biclusters with a slightly
lower relevance score. BCCA was expected to display similar results
to CPB since they both use Pearson correlation; however, it was
only able to fully recover shift biclusters. OPSM could not identify
shift, scale, or shift-scale biclusters although they are all valid order-
preserving submatrices. Our experiments show that when a base
row is scaled with a value between 1 and 1, the expression rank-
ings of the columns of a bicluster row lie in a narrow range along
the row; therefore, OPSM fails to discover it. Despite this limita-
tion, OPSM was able to identify one of the shifted biclusters, since
it reports a single bicluster for each size of column. The δ-biclusters
algorithm performed poorly on all the datasets, among which it can
partially recover only constant-row biclusters. The other models are
not captured by the metric, which was previously discussed in [53].
For the noise and overlap experiments, CPB and other meth-
ods were compared on shift biclusters since the shift model is
successfully recovered by most of the algorithms (see Fig. 4b).
4.1.3 Effects of the Noise Microarrays results are perturbed by many sources of noise. In
order to measure the sensitivity of CPB to noise, an error value ε
was added to each element of the synthetic datasets. The experi-
ments were run with various noise levels: each error value was
drawn from a normal distribution with zero mean and variance
equal to the chosen noise level.
Figure 4e shows the recovery and relevance scores of the algo-
rithms on datasets with varying noise levels. OPSM is dramatically
affected by noise since it may violate the order-preserving structure.
Querying Co-regulated Genes on Diverse Gene Expression Datasets Via Biclustering 67
a b c d
Fig. 4 Experiments on synthetic datasets: (a–d) with different bicluster models, (e) under noise, and (g) with
overlapping biclusters
68 Mehmet Deveci et al.
Since BCCA checks for pairwise correlation score for each row, this
method is more likely to be affected by the increasing noise.
Although BBC seems to be insensitive to noise in recovery plot,
its relevance score drops slightly with noise addition (see Fig. 4e).
CPB is the second best algorithm that is resistant to noise even
though it has a linear metric. Moreover, the noise resistance of CPB
can be improved by adjusting a better PCC threshold. We fixed
ρ ¼ 0. 9 in order to be consistent with the rest of the experiments.
We also experimented with relative noise, in which the noise is
added to each element with respect to its expression value, i.e.,
element x becomes x + x ε. We observed results similar to the
previous experiment.
4.1.4 Effects A gene may take roles in several functions in a cell, each of which
of the Overlap may be occurring simultaneously in a given sample; therefore, there
might be overlaps between biclusters. In this experiment we test
how CPB and other algorithms perform with increasing overlaps of
biclusters. The datasets are generated with two overlapping biclus-
ters. The overlapping regions of these biclusters are increased by 10
rows and 10 columns at each step. The expression values in the
these regions are not assumed to be additive; instead, shift values
for rows and base vector are chosen in a way to allow both of the
biclusters to have the same expression value at overlapping regions.
Figure 4f shows the results of the overlap test. We observe that
CPB and BCCA are both insensitive to increasing overlap, while
BCCA fails to recover a very small portion of the biclusters. BBC is
affected more than BCCA in terms of recovery; also, its relevance
score drops with increasing overlap. Although OPSM recovers only
one of the biclusters, it increases its recovery score by including
more of the overlapping region with increasing overlap.
4.2 Identifying Genes In this experiment, we employ CPB to identify the most correlated
Co-regulated with genes with BRCA1, BRCA2, and p53, which are highly penetrant
BRCA1, BRCA2, p53 cancer specific tumor suppressors. CPB was run on 40 different
datasets obtained from the GPL96 series (GDS{1064, 1284, 1615,
2113, 2362, 2649, 2954, 3116, 3312, 3716, 1067, 1329, 1815,
2190, 2373, 2736, 3057, 3128, 3471, 534, 1209, 1375, 1956,
2255, 2519, 2767, 3096, 3233, 3514, 596, 1220, 1479, 1975,
2297, 2643, 2771, 3097, 3257, 3517, 987}), all of which have the
same set of probes. The results of each dataset are then combined
with Gene Correlation Score function.
Table 1 gives the top-ranked genes for the probes of BRCA1,
BRCA2, and p53. We observe that more than 50 % of the
genes found by our framework are already investigated in cancer
research, suggesting that CPB is indeed finding genes involved
with cancer.
Querying Co-regulated Genes on Diverse Gene Expression Datasets Via Biclustering 69
Table 1
Associated top-ranked genes for 6 probes of BRCA1, BRCA2, and p53
4.3 Comparison with There are a limited number of studies on query-based discovery of
Other Query-Based co-regulated genes in the literature. Gene recommender [5] ana-
Frameworks lyzed Rb protein complex to find new co-regulated genes in worms
(specifically C. elegans) using a technique similar to biclustering.
SPELL [6], a PCC-based clustering framework was tested on S.
cerevisiae datasets, where several genes were categorized and anno-
tated. A probabilistic biclustering framework, QDB [7], was tested
on some synthetic and yeast microarray datasets. Adler et al. [8]
propose a query engine (MEM) to search for correlated genes
across many datasets. Zhao et al. proposed ProBic [10], a probabi-
listic biclustering algorithm, and tested on E. coli to detect high
quality biclusters in the presence of noise.
Among those query-driven search methods, we could compare
our framework on cancer-related genes with only MEM [8],
because the other studies are either specialized in non-human
organisms, or resource is not accessible. Using MEM framework,
we retrieved the genes correlated with the selected probes of
BRCA1, BRCA2, and p53 genes with Pearson correlation. Top-
ranked genes are then investigated to find whether the gene is
claimed to be cancer-related in a research study in medical
literature.
In Table 2 we compare our results with MEM’s based on how
successful each method is on finding known and unexplored genes.
Since all samples (columns) of a microarray dataset were included in
similarity calculations before biclustering, the top-ranked genes
discovered by MEM are expected to be investigated before. While
the ratios of known cancer-related genes are similar, we argue that
our framework finds more unexplored genes that are likely to be
missed by earlier clustering-based methods.
Although PCC is the default similarity measure, we also run
MEM with absolute Pearson correlation, which is expected to
capture negative correlations as in our pcc function (see Eq. (1)).
However, the results on six probes between two runs of MEM
framework are 87 % overlapping. Absolute PCC could only
Querying Co-regulated Genes on Diverse Gene Expression Datasets Via Biclustering 71
Table 2
Ratio of known and unexplored genes found within top-25 results of MEM (with Pearson and absolute
Pearson correlation) and our framework
5 Conclusion
6 Acknowledgments
References
1. Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) 8. Adler P, Kolde R, Kull M, Tkachenko A, Peter-
Discovering local structure in gene expression son H, Reimand J, Vilo J (2009) Mining for
data: The order-preserving submatrix problem. coexpression across hundreds of datasets using
In: Proceedings of the International Confer- novel rank aggregation and visualization meth-
ence on Computational Biology, pp 49–57 ods. Genome Biol 10:R139
2. Jiang D, Pei J, Zhang A (2003) DHC: a 9. Bozdağ D, Parvin JD, Çataly€ € (2009) A
urek UV
density-based hierarchical clustering method biclustering method to discover co-regulated
for time series gene expression data. In: Pro- genes using diverse gene expression datasets.
ceedings IEEE Symposium on BioInformatics In: Proceedings of 1st International Confer-
and Bioengineering, pp 393–400 ence on Bioinformatics and Computational
3. Madeira SC, Oliveira AL (2004) Biclustering Biology, pp 151–163
algorithms for biological data analysis: a survey. 10. Zhao H, Cloots L, Van den Bulcke T, Wu Y, De
IEEE/ACM Trans Comput Biol Bioinform 1 Smet R, Storms V, Meysman P, Engelen K,
(1):24–45 Marchal K (2011) Query-based biclustering
4. Pujana MA, Han J-DJ, LM Starita, Stevens of gene expression data using probabilistic rela-
KN, Tewari M, Ahn JS, Rennert G, Moreno tional models. BMC Bioinf 12(Suppl 1):S37
V, Kirchhoff T, Gold B, Assmann V, ElShamy 11. Cheng Y, Church GM (2000) Biclustering of
WM, Rual J-F, Levine D, Rozek LS, Gelman expression data. In: Proceedings of Interna-
RS, Gunsalus KC, Greenberg RA, Sobhian B, tional Conference on Intelligent Systems for
Bertin N, Venkatesan K, Ayivi-Guedehoussou Molecular Biology, pp 93–103
N, Sole X, Hernandez P, Lazaro C, Nathanson 12. Segal E, Taskar B, Gasch A, Friedman N, Koller D
KL, Weber BL, Cusick ME, Hill DE, Offit K, (2001) Rich probabilistic models for gene expres-
Livingston DM, Gruber SB, Parvin JD, Vidal sion. Bioinformatics 17(suppl_1):S243–S252
M (2007) Network modeling links breast can- 13. Wang H, Wang W, Yang J, Yu PS (2002) Clus-
cer susceptibility and centrosome dysfunction. tering by pattern similarity in large data sets. In:
Nat Genet 39(11):1338–1349 Proceedings of ACM SIGMOD
5. Owen AB, Stuart J, Mach K, Villeneuve AM, 14. Lazzeroni L, Owen A (2000) Plaid models for
Kim S (2003) A gene recommender algorithm gene expression data. Tech. Rep., Stanford
to identify coexpressed genes in C. elegans. University
Genome Res 13(8):1828–1837
15. Mitra S, Banka H (2006) Multi-objective evo-
6. Hibbs MA, Hess DC, Myers CL, Huttenhower lutionary biclustering of gene expression data.
C, Li K, Troyanskaya OG (2007) Exploring the Pattern Recognit 39(12):2464–2477
functional landscape of gene expression:
directed search of large microarray compendia. 16. Mejı́a-Roa E, Carmona-Saez P, Nogales R,
Bioinformatics 23:2692–2699 Vicente C, Vázquez M, Yang XY, Garcı́a C,
Tirado F, Pascual-Montano A (2008)
7. Dhollander T, Sheng Q, Lemmens K, De bioNMF: a web-based tool for nonnegative
Moor B, Marchal K, Moreau Y (2007) matrix factorization in biology. Nucleic Acids
Query-driven module discovery in microarray Res 36(suppl 2):W523–W528
data. Bioinformatics 23:2573–2580
Querying Co-regulated Genes on Diverse Gene Expression Datasets Via Biclustering 73
17. Gu J, Liu JS (2008) Bayesian biclustering of 29. Liu J, Wang J, Wang W (2004) Gene ontology
gene expression data. BMC Genomics 9(Suppl friendly biclustering of expression profiles. In:
1):S4 Proceedings of IEEE Computational Systems
18. Hochreiter S, Bodenhofer U, Heusel M, Mayr Bioinformatics Conference, pp 436–447.
A, Mitterecker A, Kasim A, Khamiakova T, Van IEEE Computer Society
Sanden S, Lin D, Talloen W et al (2010) Fabia: 30. Madeira S, Oliveira A (2005) A linear time
factor analysis for bicluster acquisition. Bioin- biclustering algorithm for time series gene
formatics 26(12):1520–1527 expression data. In: Casadio R, Myers G (eds)
19. Painsky A, Rosset S (2012) Exclusive row Algorithms in bioinformatics. Lecture Notes in
biclustering for gene expression using a combi- Computer Science, vol 3692, pp 39–52,
natorial auction approach. In: Proceedings of Springer, Berlin/Heidelberg
the 2012 I.E. 12th International Conference 31. Pontes B, Giraldéz R, Aguilar-Ruiz JS (2013)
on Data Mining, pp 1056–1061. IEEE Com- Configurable pattern-based evolutionary
puter Society biclustering of gene expression data. Algo-
20. Joung J-G, Kim S-J, Shin S-Y, Zhang B-T rithms Mol Biol 8:4
(2012) A probabilistic coevolutionary biclus- 32. Yang W-H, Dai D-Q, Yan H (2011) Finding
tering algorithm for discovering coherent pat- correlated biclusters from gene expression
terns in gene expression dataset. BMC Bioinf data. IEEE Trans Knowl Data Eng
13(Suppl 17):S12 23:568–584
21. Flores JL, Inza I, Larrañaga P, Calvo B (2013) 33. Yoon S, Nardini C, Benini L, De Micheli G
A new measure for gene expression biclustering (2005) Discovering coherent biclusters from
based on non-parametric correlation. Comput gene expression data using zero-suppressed
Methods Prog Biomed 112(3):367–397 binary decision diagrams. IEEE/ACM Trans
22. Sun P, Speicher NK, Röttger R, Guo J, Baum- Comput Biol Bioinf 2:339–354
bach J (2014) Bi-force: large-scale bicluster 34. Angiulli F, Cesario E, Pizzuti C (2008) Ran-
editing and its application to gene expression dom walk biclustering for microarray data. Inf
data biclustering. Nucleic Acids Res. doi:10. Sci 178(6):1479–1497
1093/nar/gku201 35. Bryan K (2005) Biclustering of expression data
23. Chakraborty A (2005) Biclustering of gene using simulated annealing. In: Proceedings of
expression data by simulated annealing. In: the 18th IEEE Symposium on Computer-
Proceedings of Eighth International Confer- Based Medical Systems, CBMS’05, (Washing-
ence on High-Performance Computing in ton, DC, USA), pp 383–388. IEEE Computer
Asia-Pacific Region, 2005, pp 627–632 Society
24. Liew AW-C, Law N-F, Yan H (2011) Recent 36. Bryan K, Cunningham P, Bolshakova N (2006)
patents on biclustering algorithms for gene Application of simulated annealing to the
expression data analysis. Recent Pat DNA biclustering of gene expression data. Trans Inf
Gene Seq 5(2):117–125 Tech Biomed 10:519–525
25. Hussain SF (2011) Bi-clustering gene expres- 37. Bleuler S, Prelic A, Zitzler E (2004) An EA
sion data using co-similarity. In: Proceedings of framework for biclustering of gene expression
the 7th International Conference on Advanced data. In: Congress on Evolutionary Computa-
Data Mining and Applications - Volume Part I, tion, 2004 (CEC2004), vol 1, pp 166–173
ADMA’11, pp 190–200. Springer, Berlin/ 38. Divina F, Aguilar-Ruiz J (2006) Biclustering of
Heidelberg expression data with evolutionary computation.
26. An J, Liew AW-C, Nelson CC (2012) Seed- IEEE Trans Knowl Data Eng 18:590–602
based biclustering of gene expression data. 39. Nepomuceno JA, Troncoso A, Aguilar-Ruiz JS
PLoS ONE 7:e42431, 08 (2010) Correlation-based scatter search for
27. Kiraly A, Abonyi J, Laiho A, Gyenesei A (2012) discovering biclusters from gene expression
Biclustering of high-throughput gene expres- data. In: Proceedings of the 8th European
sion data with bicluster miner. In: IEEE 12th Conference on Evolutionary Computation,
International Conference on Data Mining Machine Learning and Data Mining in Bioin-
Workshops (ICDMW), 2012, pp 131–138 formatics, EvoBIO’10, pp 122–133. Springer,
28. Liu J, Wang J, Wang W (2004) Biclustering in Berlin/Heidelberg
gene expression data by tendency. In: Proceed- 40. Nepomuceno JA, Troncoso A, Aguilar-Ruiz JS
ings of IEEE Computational Systems Bioinfor- (2011) A comparative analysis of biclustering
matics Conference, pp 182–193. IEEE algorithms for gene expression data. BioData
Computer Society Mining 4:3
74 Mehmet Deveci et al.
41. Erten C, Sözdinler M (2009) Biclustering Algorithms. In: ACM International Confer-
expression data based on expanding localized ence on Bioinformatics and Computational
substructures. In: Rajasekaran S (ed) Bioinfor- Biology
matics and computational biology. Lecture 54. Chia BKH, Karuturi RKM (2010) Differential
Notes in Computer Science, vol 5462, co-expression framework to quantify goodness
pp 224–235. Springer, Berlin/Heidelberg of biclusters and compare biclustering algo-
42. Tanay A, Sharan R, Shamir R (2002) Discover- rithms. Algorithms Mol Biol 5(1):8
ing statistically significant biclusters in gene 55. Eren K, Deveci M, K€ uç€
uktunç O, Çataly€urek
expression data. Bioinformatics 18(Supple- € (2012) A comparative analysis of bicluster-
UV
ment 1):136–144 ing algorithms for gene expression data. Brief
43. Bergmann S, Ihmels J, Barkai N (2003) Itera- Bioinform
tive signature algorithm for the analysis of 56. Oghabian A, Kilpinen S, Hautaniemi S, Czei-
large-scale gene expression data. Phys Rev E zler E (2014) Biclustering methods: Biological
Stat Nonlinear Soft Matter Phys 67:031902 relevance and application in gene expression
44. Kluger Y, Basri R, Chang JT, Gerstein M analysis. PloS one 9(3):e90801
(2003) Spectral biclustering of microarray 57. Bhattacharya A, De RK (2009) Bi-correlation
data: coclustering genes and conditions. clustering algorithm for determining a set of
Genome Res 13(4):703–716 co-regulated genes. Bioinformatics 25
45. Prelić A, Bleuler S, Zimmermann P, Wille A, (21):2795–2801
B€uhlmann P, Gruissem W, Hennig L, Thiele L, 58. Casella G, Wells MT (1993) Is Pitman close-
Zitzler E (2006) A systematic comparison ness a reasonable criterion: comment. J Am
and evaluation of biclustering methods for Stat Assoc 88(421):70–71
gene expression data. Bioinformatics 59. Mian O, Wang S, Zhu S, Gnanapragasam M,
22:1122–1129 Graham L, Bear H, Ginder G (2011) Methyl-
46. Li G, Ma Q, Tang H, Paterson AH, Xu Y binding domain protein 2-dependent prolifer-
(2009) QUBIC: a qualitative biclustering algo- ation and survival of breast cancer cells. Mol
rithm for analyses of gene expression data. Cancer Res 9(8):1152–62
Nucleic Acids Res 37(15):e101 60. Kioulafa M, Kaklamanis L, Stathopoulos E,
47. Huttenhower C, Mutungu KT, Indik N, Yang Mavroudis D, Georgoulias V, Lianidou ES
W, Schroeder M, Forman JJ, Troyanskaya OG, (2009) Kallikrein 10 (KLK10) methylation as
Coller HA (2009) Detailing regulatory net- a novel prognostic biomarker in early breast
works through large scale data integration. cancer. Ann Oncol 20:1020–1025
Bioinformatics 25:3267–3274 61. Dorszewska J, Florczak J, Rozycka A,
48. Voggenreiter O, Bleuler S, Gruissem W (2012) Jaroszewska-Kolecka J, Trzeciak WH,
Exact biclustering algorithm for the analysis of Kozubski W (2005) Polymorphisms of the
large gene expression data sets. BMC Bioinf 13 CHRNA4 gene encoding the alpha4 subunit
(Suppl 18):A10 of nicotinic acetylcholine receptor as related to
49. Bryan K, Cunningham P (2006) Bottom-up the oxidative DNA damage and the level of
biclustering of expression data. In: IEEE Sym- apoptotic proteins in lymphocytes of the
posium on Computational Intelligence and patients with Alzheimer’s disease. DNA Cell
Bioinformatics and Computational Biology, Biol 24:786–794
2006 (CIBCB ’06), pp 1–8 62. Zhang L, Farrell JJ, Zhou H, Elashoff D, Akin
50. Murali T, Kasif S (2003) Extracting conserved D, Park N-H, Chia D, Wong DT (2010) Sali-
gene expression motifs from gene expression vary transcriptomic biomarkers for detection of
data. Pac Symp Biocomput 8:77–88 resectable pancreatic cancer. Gastroenterology
51. Liu J, Wang W (2003) Op-cluster: clustering 138(3):949–957, e1–7
by tendency in high dimensional space. In: 63. Lindahl M, Poteryaev D, Yu L, Arumae U,
Proceedings of IEEE International Conference Timmusk T, Bongarzone I, Aiello A, Pierotti
on Data Mining, p 187 MA, Airaksinen MS, Saarma M (2001) Human
52. Freitas AV, Ayadi W, Elloumi M, Oliveira J, glial cell line-derived neurotrophic factor
Oliveira J, Hao J-K (2013) Survey on bicluster- receptor alpha 4 is the receptor for persephin
ing of gene expression data, pp 591–608. and is predominantly expressed in normal and
Wiley, New York malignant thyroid medullary cells. J Biol Chem
53. Bozdağ D, Kumar A, Çataly€ € (2010)
urek UV 276:9344–9351
Comparative Analysis of Biclustering
Methods in Molecular Biology (2016) 1375: 75–89
DOI 10.1007/7651_2015_237
© Springer Science+Business Media New York 2015
Published online: 11 April 2015
Abstract
Recent emerging studies suggest that a substantial fraction of microRNA (miRNA) genes is likely to form
clusters in terms of evolutionary conservation and biological implications, posing a significant challenge for
the research community and shifting the bottleneck of scientific discovery from miRNA singletons to
miRNA clusters. In addition, the advance in molecular sequencing technique such as next-generation
sequencing (NGS) has facilitated researchers to comprehensively characterize miRNAs with low abundance
on genome-wide scale in multiple species. Taken together, a large scale, cross-species survey of grouped
miRNAs based on genomic location would be valuable for investigating their biological functions and
regulations in an evolutionary perspective. In the present chapter, we describe the application of effective
and efficient bioinformatics tools on the identification of clustered miRNAs and illustrate how to use the
recently developed Web-based database, MetaMirClust (http://fgfr.ibms.sinic.aedu.tw/MetaMirClust) to
discover evolutionarily conserved pattern of miRNA clusters across metazoans.
1 Introduction
2 Materials
3 Methods
3.2 Identification Recent studies have revealed that the clustering propensity of
of miRNA Clusters miRNA genes is higher than previously evaluated and they usually
(MirClust) occur on polycistronic transcripts (17, 32–36). To investigate clus-
tered miRNA genes derived from the same polycistronic transcript,
researchers usually adopt adjacent miRNA genes located on the
same strand to form miRNA clusters. Two or more consecutive
miRNA genes on the same strand of individual chromosome are
considered to form a cluster according to their adjacent distance. In
miRBase, 10 Kb is used to report clustered miRNAs when users
browse an individual miRNA gene. Take hsa-mir-25
(chr7:99,691,183-99,691,266:-) as example, miRBase will display
hsa-mir-93 (chr7:99,691,391-99,691,470:-) and hsa-mir-106b
(chr7:99,691,616-99,691,697:-) as adjacent miRNA genes within
10 Kb as shown in Fig. 1. As a result, using different adjacent
distance might result in a different data set of miRNA clusters.
Meanwhile, the clustered miRNAs reported in miRBase are lack
of evolutionary conservation across species. Four different maxi-
mum inter-miRNA distances (MIDs); 1 Kb, 3 Kb, 10 Kb, and
50 Kb, were commonly used to identify clustered miRNA genes
(MirClust). To illustrate the procedure of identification of miRNA
clusters (MirClust), we prepared two BED file composed of human
(hg19) precursor/mature miRNA genes (reported in miRBase v.16
or ZooMir) (http://fgfr.ibms.sinica.edu.tw/MetaMirClust/data/
pre.mir.bed; http://fgfr.ibms.sinica.edu.tw/MetaMirClust/data/
mat.mir.bed) as a sample data set for readers to identify miRNA
clusters (MirClust) in human. In addition, the BED file of individ-
ual mature miRNA genes was prepared for the retrieval of miRNA
Fig. 1 The hsa-mir-25-106b cluster reported in miRBase. By the default MID of 10 Kb used in miRBase, this
snapshot figure shows two adjacent miRNA genes, hsa-mir-106b and hsa-mir-93, when querying hsa-mir-25
80 Wen-Ching Chan and Wen-chang Lin
Table 1
Distributions of numbers of identified miRNA clusters in nine representative species
MID
Fig. 2 Two miRNA clusters identified on chromosome 13 in human. Based on miRNA genes identified in
miRBase and ZooMir and the use of MID of 10 Kb, two miRNA clusters can be revealed on chromosome 13.
One is mir-15a/16 (13q14.2) in the length of 229 nt on the plus strand and the other is mir-17-92 (13q31.3) in
the length of 787 nt on the minus strand
Fig. 4 MetaMirClust data of mir-17-92 shown in UCSC Genome Browser. Viewing the cluster information of
human mir-17-92 (13q31.3) cluster using public genome browser like UCSC Genome Browser, additional
pieces of evidence such as transcriptional regions, histone modifications, and conservation level and so on
can facilitate users to gain more insights of miRNA clusters of interest
miRNA Cluster Discovery in Metazoan Genomes 83
3.3 Discovery Most previous works only focused on studying the evolutionary
of Metazoan and functional implications of limited specific miRNA clusters
miRNA Clusters among a few species. No systematic and efficient approach has
(MetaMirClust) been performed before MetaMirClust to analyze the conservation
by FP-Growth pattern of miRNA clusters on global-wide scale. To interrogate the
Algorithm conservation level of the clusters of miRNA genes in large numbers
of metazoan genomes, we adopted a data mining approach to
discover the conserved co-occurrence modules of miRNA genes
upon miRNA clusters identified under the same MID. Filtering
singleton miRNA clusters identified in MirClust as mentioned in
the previous procedure, we conducted the analysis by utilizing the
FP-growth algorithm implemented by Borgelt (http://www.
borgelt.net/fpgrowth.html) to detect the conserved co-occurrence
sets of miRNA genes in terms of miRNA clusters defined within the
same MID. These frequent co-occurrence sets present highly con-
served combinations of miRNA genes through miRNA clusters in
metazoan species, which are defined as metazoan miRNA clusters.
Based on nine representative species same as listed in Table 1, we
prepared an aggregate file (http://fgfr.ibms.sinica.edu.tw/
MetaMirClust/data/nine.mir.clust.csv) consisting of all miRNA
clusters using the previous procedure to identify MirClust. The
following command can be used to discover co-occurred miRNA
genes across selected species.
1. Discover co-occurred miRNA genes across species
fpgrowth -s-7 -q0 nine.mir.clust.csv nine.meta.mir.clust.csv
According to the output result (i.e., nine.meta.mir.clust.csv),
there are 84 evolutionarily conserved miRNA clusters (MetaMir-
Clust) identified in at least seven out of nine representative species.
Among those evolutionarily conserved miRNA clusters, mir-17-92
(13q31.3) is the largest group containing five miRNA classes with
six precursor miRNA genes. Figure 5 shows the conservation pat-
tern of mir-17-92 (13q31.3) in MetaMirClust. The length of the
mir-17-92 (13q31.3) cluster varies from 717 (Loxodonta africana)
to 1,028 (Gasterosteus aculeatus) nucleotides (nt) in 20 metazoan
genomes, which confirmed the estimation of the mir-17 cluster
length as 1 kb reported previously.
In MetaMirClust, to investigate the recruitment process
between evolutionarily conserved miRNA clusters, we also recon-
structed the hierarchical structure using the sets of co-occurred
miRNA genes. The community users can directly select one of
evolutionarily conserved miRNA clusters of interest from the Meta-
MirClust list (http://fgfr.ibms.sinica.edu.tw/MetaMirClust/Met
aMirClustStat.php) or select one of miRNA classes from the
search page in MetaMirClust (http://fgfr.ibms.sinica.edu.tw/
MetaMirClust/MetaMirClustSearch.php) to obtain the hierarchi-
cal information involving the selected miRNA cluster and the
84 Wen-Ching Chan and Wen-chang Lin
Fig. 5 The conservation pattern of mir-17-92 across metazoan. The mir-17-92 (13q31.3) cluster has been
revealed to conserve across 20 species in terms of evolution in our data set and occurs 24 instances among
these species
4 Notes
4.1 Data Preparation In miRNA research, miRBase is the most critical repository, in which
from Diverse Sources computational and experimental miRNA genes have been collected,
and a searchable database. Recently, due to the advance in molecular
sequencing technique like next-generation sequencing (NGS),
miRBase have obtained ever-growing miRNA genes identified
from the screening experiments (37). Currently, the miRBase data-
base provides two major formats of archive files: raw-text and SQL-
like files. The former includes dat and fa files in EMBL and fasta
formats, respectively. They are easily for the community users to
check the RNA sequences of precursor and mature miRNA genes.
On the other hand, the SQL-like files dumped directly from miR-
Base contain more information, which is normalized and store into
individual tables in terms of database management. For advanced
users, the latter files will be more efficient to retrieve related data
Table 2
Hierarchical structure of different recruitment of mir-25-106
miRNA Cluster Discovery in Metazoan Genomes
85
86 Wen-Ching Chan and Wen-chang Lin
Fig. 6 The evolutionarily conserved patterns of mir-25-106. This figure shows the conservation pattern of the
mir-25-106 (7q22.1) across 15 species according to the proportion of genomic distance
miRNA Cluster Discovery in Metazoan Genomes 87
from joining tables by using the SQL language. For our predicted
miRNA genes across metazoans, the dumped data from ZooMir
(http://insr.ibms.sinica.edu.tw/ZooMir/ZooMir.Candidates_3.
tar.bz2) can be easily incorporated into the latest version of
miRBase.
4.2 Understanding In recent years with the large-scale and genome-wide data generated
the Basics of Data by ever-developing molecular biology technique, the huge amount
Mining and Machine of data have become the major challenge for biologists to manipu-
Learning late and analyze them using conventional approaches. Increasing
evidence suggests that data mining and machine learning
approaches can facilitate researchers to efficiently and effectively
conquer the massive number of data like in biological research.
For instance, in MetaMirClust we introduced a data mining
approach to efficiently discover highly conserved sets of miRNA
genes upon miRNA clusters. By treating miRNA genes as items,
FP-growth algorithm can be utilized to mining the frequent item
sets without using candidate generations, of which it can dramati-
cally improve performance in terms of memory space and running
time. The algorithm first compresses the input data into a tree-based
structure, FP-tree, in which all frequent item sets can be retrieved
after easily tracing the entire tree. By iteratively tracing the sub FP-
tree based on conditional frequent item sets, the algorithm can
efficiently reduce the search costs by avoiding the problem intro-
duced in other approaches to look for short fundamental patterns
recursively. Subsequently, the identified frequent item sets using the
FP-growth algorithm are equivalent to the frequently co-occurred
miRNA genes in terms of clusters. Based on those conservation sets
of miRNA genes, we can further reconstruct the hierarchical struc-
ture of conservation patterns across metazoans to facilitate the
community users to gain more insights into the recruitment process
of miRNA genes in clusters in evolution perspective.
4.3 Investigation of To test whether miRNA clusters are co-conserved with their flanking
Conservation Between protein-coding genes, we have conducted a downstream analysis, in
miRNA Clusters and which the linkage of known protein-coding genes in the vicinity of
Flanking Protein- evolutionarily conserved miRNA clusters between human and
Coding Genes mouse were interrogated. We focused only on the nearest adjacent
known genes located in the upstream/downstream regions of con-
served miRNA clusters upon the same strand between those two
species. The genomic information of the protein-coding genes in
human (hg19) and mouse (mm9) were downloaded from the UCSC
Genome Browser (https://genome.ucsc.edu/). In addition, the lift-
Over program (http://genome.ucsc.edu/cgi-bin/hgLiftOver)
downloaded from UCSC Genome Browser was utilized to find the
best mapping of genomic locations between human and mouse if a
miRNA cluster occurs in multiple locations. The homologous
annotations between known protein-coding genes were identified
88 Wen-Ching Chan and Wen-chang Lin
References
26. Hayashita Y et al (2005) A polycistronic 32. Megraw M et al (2007) miRGen: a database for
microRNA cluster, miR-17-92, is overexpressed the study of animal microRNA genomic orga-
in human lung cancers and enhances cell prolif- nization and function. Nucleic Acids Res 35
eration. Cancer Res 65(21):9628–9632 (Database issue):D149–D155
27. Yoshino H et al (2013) Tumor-suppressive 33. Lai EC et al (2003) Computational identifica-
microRNA-143/145 cluster targets tion of Drosophila microRNA genes. Genome
hexokinase-2 in renal cell carcinoma. Cancer Biol 4(7):R42
Sci 104(12):1567–1574 34. Lagos-Quintana M et al (2003) New micro-
28. Esquela-Kerscher A, Slack FJ (2006) Onco- RNAs from mouse and human. RNA 9(2):
mirs: microRNAs with a role in cancer. Nat 175–179
Rev Cancer 6(4):259–269 35. Berezikov E et al (2005) Phylogenetic shadow-
29. Zhang Y, Zhang R, Su B (2009) Diversity and ing and computational identification of human
evolution of MicroRNA gene clusters. Sci microRNA genes. Cell 120(1):21–24
China C Life Sci 52(3):261–266 36. Alexiou P et al (2010) miRGen 2.0: a database
30. Han JW et al (2004) Mining frequent patterns of microRNA genomic information and regu-
without candidate generation: a frequent- lation. Nucleic Acids Res 38(Database issue):
pattern tree approach. Data Min Knowl Discov D137–D141
8(1):53–87 37. Kozomara A, Griffiths-Jones S (2014) miR-
31. Chen L, Liu W (2013) Frequent patterns Base: annotating high confidence microRNAs
mining in multiple biological sequences. Com- using deep sequencing data. Nucleic Acids Res
put Biol Med 43(10):1444–1452 42(Database issue):D68–73
Methods in Molecular Biology (2016) 1375: 91–103
DOI 10.1007/7651_2015_280
© Springer Science+Business Media New York 2015
Published online: 10 September 2015
Abstract
Mining microarray data to unearth interesting expression profile patterns for discovery of in silico biological
knowledge is an emerging area of research in computational biology. A group of functionally related genes
may have similar expression patterns under a set of conditions or at some time points. Biclustering is an
important data mining tool that has been successfully used to analyze gene expression data for biologically
significant cluster discovery. The purpose of this chapter is to introduce interesting patterns that may be
observed in expression data and discuss the role of biclustering techniques in detecting interesting
functional gene groups with similar expression patterns.
1 Introduction
1.1 Patterns in Gene With the help of microarray experiments one can simultaneously
Expression Data monitor the expression levels of genes at a genome scale. Data
generated from microarray experiments, measuring relative expres-
sion levels of genes in a sample and in a controlled population can
be represented in the form of a matrix or vector [6], often called
gene expression matrix. Formally, it can be defined as follows.
Definition 1 (Gene Expression Data). Let G ¼ {G1, G2, . . ., Gm} be a
set of m genes and R ¼ {T1, T2, . . ., Tn} be the set of n conditions or
time points at which the genes’ expression levels are recorded in a
microarray dataset. The gene expression dataset X can be represented
as an m n matrix, Xmn where each entry xi, j in the matrix corre-
sponds to the logarithm of the relative abundance of mRNA
corresponding to a gene.
To gain better understanding of genes and their behavior inside
the cell, various patterns can be derived by analyzing the change in
expression levels of the genes. The notion of patterns in microarray
data is introduced in [7] as below.
Definition 2 (Expression Pattern). Given a gene Gi, its expression
values under a single condition or a series of varying conditions lie
within a certain range. Gi is a vector of real numbers within the range
[a, b], denoted as Gi@[a, b], and is called an item. The values in Gi are
limited inclusively between a and b.
A set containing one single item is called a pattern. A set of several
items, which come from different genes is also called a pattern. So, a
pattern looks like:
fG i1 @½a i1 , b i1 , . . . , G ik @½aik , b ik g
Table 1
Sample gene expression data from Homo sapiens
ORF C1 C2 C3 C4
GALNT5 3.474 3.837 4.644 5.059
APOE 2 1.943 1.786 1.737
IDH3B 1.449 1.299 0.993 0.832
Homo sapiens
2
1
GALNT5
0
Expression level
APOE
−1 IDH3B
−2
−3
−4
−5
1.1.1 Shifting and In shifting patterns [7], the gene profiles show similar trends, but
Scaling Patterns distance-wise, they may not be close to each other (see Fig. 2).
In terms of expression values, the gene patterns are separated
by more or less constant vertical distances among them. Formally,
shifting patterns can be defined as follows
Definition 3 (Shifting Pattern). Given two gene expression profiles
Gi ¼ {Ei1, Ei2, . . ., Eik} and Gj ¼ {Ej1, Ej2, . . ., Ejk} with k expression
values, a profile is called a shifting pattern with respect to another
1
www.ncbi.nlm.nih.gov.
94 Swarup Roy et al.
Expression level
60
40
50
30 40
20 30
20
10
10
0 0
0 2 4 6 0 2 4 6
Conditions Conditions
E ip =E jp ζij or E jp =E ip ζ ij : ð2Þ
1.1.2 Coherent Patterns A group of genes showing similar pattern tendency across different
conditions is called coherent. Such a group shows predominantly
one kind of co-expression in the expression profiles of all member
genes. Co-expressed genes are likely to be involved in the same
cellular processes. In practice, co-expressed genes may belong to
the same or similar functional categories indicating co-regulated
Gene Expression Patterns 95
1.1.3 Co-regulated Often, coherent patterns are divided into two categories, namely,
Patterns positively regulated patterns and negatively regulated or inverted
patterns. Sometimes, a group of genes that are positively or nega-
tively regulated are also called co-regulated genes. In Fig. 1, human
genes, GLANT5 and IDH3B, show similar patterns or positively
regulated patterns. On the other hand, IDH3B and GLANT5 show
inverted or negative patterns with APOE. Biologically all three
genes are very significant. As suggested Gene Ontology, the three
genes are involved in regulation of plasma lipoprotein particle levels
and triglyceride-rich lipoprotein particle remodeling. Pronounced
inverted or negative patterns can be observed in Fig. 3, taken
from NCBI Rat dataset GDS3702. Gene Ontology suggests that
both are responsible for regulation of interferon-beta production. A
group of genes may share a combination of both positive and
negative co-regulation under a few conditions or at some time
points.
Thus, gene expression data analysis involves pattern finding.
Data mining is the study of techniques that extract patterns from
large amounts of data. As a result, data mining provides the pri-
mary tools for gene expression data analysis. Biclustering is an
important data mining tool for analyzing biologically significant
gene groups. Below we present a brief discussion of biclustering
techniques.
GDS3702 (Rat)
500
450 Mrps26
Pfn2
400
350
Expression level
300
250
200
150
100
50
2 4 6 8 10 12
Conditions
1.2 Biclustering of Clustering is a popular data analysis tool in genomic studies, partic-
Co-regulated Genes ularly in the context of gene-expression microarrays [10–12]. Each
microarray provides expression measurements for thousands of
genes and clustering is a useful exploratory technique to analyze
gene expression data since it groups similar genes together and
allows biologists to identify groups of potentially meaningful
genes, which have related functions or are co-regulated, which in
turn helps find the relationships among them in the form of gene
regulatory networks [5]. It has frequently been observed that sub-
sets of genes are co-regulated and co-expressed under a subset of
environmental conditions or time points [13]. Biclustering algo-
rithms tackle the problem of finding a set of sub-matrices where
each sub-matrix or bicluster meets a certain homogeneity criterion.
Given a gene expression dataset DNM, where G ¼ {G1, G2,
. . . GN} is a set of N genes and R ¼ {T1, T2, . . ., TM} is the set of M
conditions or time points, biclusters can be defined as follows.
Definition 5 (Biclusters). Biclusters are a set of sub-matrices of the
matrix D ¼ (N, M) with dimensions I1 J1, . . . Ik Jk such that
Ii N, Ji M 8i{1, . . ., k}, where each sub-matrix (bicluster) meets a
given homogeneity criterion.
Madeira and Oliveira [14] identify four different categories of
biclusters based on homogeneity criterion, namely:
1. Constant biclusters,
2. Biclusters with constant values on either columns or rows,
3. Biclusters with coherent values, and
4. Biclusters with coherent evolutions.
A comprehensive survey of different biclustering techniques for
gene expression data clustering can be found in [15, 16]. In gene
expression analysis, patterns play a more important role than
expression values [17]. As a result, the value based homogeneity
criterion mentioned above may not be suitable for grouping bio-
logically significant genes.
2 Materials
Table 2
Freely available Biclustering software packages
2.1 Data Sources A plethora of real expression data produced by different biotech-
nology labs are freely available online. In this chapter, we use some
datasets from Table 3 for experimentation and demonstration.
2.2 Evaluating From the point of view of biological data analysis, a cluster is
Quality of Biclusters biologically significant if it can produce functionally enriched
groups of genes. A majority of the literature on biclustering evalu-
ates and reports results based on functional enrichment of the
clusters against Gene Ontology (GO). To determine the statistical
significance of the association of a particular GO term with a group
of genes in a cluster, various online tools from the GO Project2 are
available. In Table 4, we report some freely available tools.
These tools use the hypergeometric distribution to calculate
the p-value or q-value, which evaluates whether the clusters have
significant enrichment in one or more function groups. The p-value
is computed as follows:
2
http://www.geneontology.org.
98 Swarup Roy et al.
Table 3
Short description of data sources
No. of No. of
Organism Dataset genes samples Source
YeastDB 2884 17 http://arep.med.harvard.edu/biclustering/yeast.matrix
Yeast Sporulation 474 7 http://cmgm.stanford.edu/pbrown/sporulation
Yeast_KY 237 17 http://faculty.washington.edu/kayee/cluster/
YeastCho 384 17 http://faculty.washington.edu/kayee/cluster
(cell cycle)
Rat Rat_CNS 112 9 http://faculty.washington.edu/kayee/cluster
Human GDS3712 325 12 NCBI
Fibroblast 517 13 http://www.sciencemag.org/feature/data/984559.hsl/
Serum
Mouse GDS958 308 12 NCBI
Rice Thaliana 138 8 http://homes.esat.kuleuven.be/s̃istawww/bioi/thijs/
Work/Clustering.html
Table 4
GO-based cluster evaluation tools
f gf
X
k
i ni
p ¼1 : ð3Þ
i¼0
g
n
termf, given the total number of genes in the whole genome g and
the number of genes in the whole genome that are annotated with
that GO term f. It is important to note that p-value measures
whether a cluster is enriched with genes from a particular category
to a greater extent than what would be expected by chance. If the
majority of genes in a cluster appears in one category, the p-value of
the category is small. That is, the closer the p-value to zero, the
more the probability that the particular GO term is associated with
the group of genes. The Q-value is the minimal False Discovery
Rate (FDR) at which this gene appears significant. Q-values are
estimated using the Benjamini Hochberg procedure [25].
3 Methods
20 20
15 15
10 10
5 5
0 0
CoBi BiMax CC OPSM CoBi BiMax CC OPSM
Bicluster Algorithms Bicluster Algorithms
Yeast DB YeastKY
Avg.Percentage of Enriched Biclusters
10 25
8 20
6 15
4 10
2 5
0 0
CoBi BiMax CC OPSM CoBi BiMax CC OPSM
Bicluster Algorithms Bicluster Algorithms
4 Notes
References
1. Kurella M, Hsiao L, Yoshida T, Randall J, Chow 15. Kriegel HP, Kröger P, Zimek A (2009) Clus-
G, Sarang S, Jensen R, Gullans S (2001) Dna tering high-dimensional data: a survey on sub-
microarray analysis of complex biologic pro- space clustering, pattern-based clustering and
cesses. J Am Soc Nephrol 12:1072–1078 correlation clustering. ACM Trans Knowl Dis-
2. Kraljevic S, Stambrook PJ, Pavelic K (2004) cov Data (TKDD) 3:1
Accelerating drug discovery. EMBO Rep 16. Mahanta P, Ahmed H, Bhattacharyya D, Kalita
5:837–842 JK (2011) Triclustering in gene expression data
3. Yu H, Luscombe N, Qian J, Gerstein M (2003) analysis: a selected survey. In: 2011 2nd
Genomic analysis of gene expression relation- national conference on emerging trends and
ships in transcriptional regulatory networks. applications in computer science (NCETACS),
Trends Genet 19:422–427 IEEE pp 1–6
4. Gasch A, Eisen M et al (2002) Exploring the 17. Roy S, Bhattacharyya DK, Kalita JK (2014)
conditional coregulation of yeast gene expres- Reconstruction of gene co-expression network
sion through fuzzy k-means clustering. from microarray data using local expression
Genome Biol 3:1–22 patterns. BMC Bioinf 15:S10
5. Tavazoie S, Hughes J, Campbell M, Cho R, 18. Shamir R, Maron-Katz A, Tanay A, Linhart C,
Church G et al (1999) Systematic determina- Steinfeld I, Sharan R, Shiloh Y, Elkon R (2005)
tion of genetic network architecture. Nat Expander—an integrative program suite for
Genet 22:281–285 microarray data analysis. BMC Bioinf 6:232
6. Grant R (2004) Computational genomics: the- 19. Barkow S, Bleuler S, Prelić A, Zimmermann P,
ory and application. Horizon Bioscience, Zitzler E (2006) Bicat: a biclustering analysis
Cambridge toolbox. Bioinformatics 22:1282–1283
7. Li J, Wong L (2001) Emerging patterns and 20. Gonçalves JP, Madeira SC, Oliveira AL (2009)
gene expression data. Genome Inform Ser Biggests: integrated environment for bicluster-
12:3–13 ing analysis of time series gene expression data.
8. Alberts B, Johnson A et al (2002) Studying BMC Res Notes 2:124
gene expression and function. In: Molecular 21. Cheng KO, Law NF, Siu WC, Lau T (2007)
biology of the cell, 4th edn Bivisu: software tool for bicluster detection and
9. Spellman P, Sherlock G, Zhang M, Iyer V, visualization. Bioinformatics 23:2342–2344
Anders K, Eisen M, Brown P, Botstein D, 22. Zhou F, Ma Q, Li G, Xu Y (2012) Qserver: a
Futcher B (1998) Comprehensive identifica- biclustering server for prediction and assess-
tion of cell cycle-regulated genes of the yeast ment of co-expressed gene clusters. PloS one
saccharomyces cerevisiae by microarray hybri- 7:e32660
dization. Mol Biol Cell 9:3273–3297 23. Leung E, Bushel PR (2006) Page: phase-
10. Ben-Dor A, Shamir R, Yakhini Z (1999) Clus- shifted analysis of gene expression. Bioinfor-
tering gene expression patterns. J Comput Biol matics 22:367–368
6:281–297 24. Roy S, Bhattacharyya DK, Kalita JK (2013)
11. Chipman H, Hastie TJ, Tibshirani R (2003) Cobi: pattern based co-regulated biclustering
Clustering microarray data. In: Statistical anal- of gene expression data. Pattern Recogn Lett
ysis of gene expression microarray data, vol 34:1669–1678
1. Chapman & Hall/CRC, Boca Raton, 25. Benjamini Y, Hochberg Y (1995) Controlling
pp 159–200 the false discovery rate: a practical and powerful
12. Ahmed HA, Mahanta P, Bhattacharyya D, approach to multiple testing. J R Stat Soc Ser B
Kalita JK (2011) Gerc: tree based clustering (Methodological) 57:289–300
for gene expression data. In: 2011 I.E. 11th 26. Hartigan JA (1972) Direct clustering of a data
international conference on bioinformatics and matrix. J Am Stat Assoc 67:123–129
bioengineering (BIBE), IEEE, pp 299–302 27. Cheng Y, Church G (2000) Biclustering of
13. Mitra S, Banka H (2006) Multi-objective evo- expression data. In: Proceedings of 8th inter-
lutionary biclustering of gene expression data. national conference on intelligent systems
Pattern Recogn 39:2464–2477 for molecular biology, ICISMB’00, vol 8,
14. Madeira SC, Oliveira AL (2004) Biclustering pp 93–103
algorithms for biological data analysis: a survey. 28. Yang J, Wang H, Wang W, Yu P (2003)
IEEE/ACM Trans Comput Biol Bioinform Enhanced biclustering on expression data. In:
1:24–45 Proceedings of the 3rd IEEE symposium on
Gene Expression Patterns 103
Abstract
Cellular phenotypes result from the combined effect of multiple genes, and high-throughput techniques
such as DNA microarrays and deep sequencing allow monitoring this genomic complexity. The large scale
of the resulting data, however, creates challenges for interpreting results, as primary analysis often yields
hundreds of genes. Gene Ontology (GO), a controlled vocabulary for gene products, enables semantic
analysis of such gene sets. GO can be used to define semantic similarity between genes, which enables
semantic clustering to reduce the complexity of a result set. Here, we describe how to compute semantic
similarities and perform GO-based gene clustering using csbl.go, an R package for GO semantic similarity.
We demonstrate the approach with expression profiles from breast cancer.
1 Introduction
1.1 Semantic The formal structure of GO allows defining semantic similarity (SS)
Similarity Using Gene between genes using the idea that similar genes share similar GO
Ontology annotations. More than two dozen such measures have been
defined, and it is not always clear which one is the best for a given
purpose (5). However, usually the choice of a default measure is
sufficient. Here, we first illustrate the use of one simple and effective
SS measure, Resnik similarity (6, 7). Then, we briefly survey the
main features of other measures.
As an example, we compute the similarity of two hypothetical
genes, A and B, using the Resnik measure. Gene A is annotated
with the GO terms plasma membrane (CC) and signal transduction
(BP), and gene B with mitochondrial membrane (CC) and regula-
tion of signaling (BP) (Fig. 1). Resnik similarity is a “pairwise”
similarity measure, which means that it defines similarity between
individual GO terms, but not, as such, for genes. Gene similarity is
obtained in a second step from term similarities. Hence, we first
compute the pairwise similarities between GO terms of A and B.
The ontologies BP, CC, and MF are handled individually, so the
similarity of terms from different ontologies is zero.
To compute the term similarity between plasma membrane and
mitochondrial membrane, Resnik similarity first evaluates the speci-
ficity of all GO terms, as intuitively specific terms are more informa-
tive and contribute more to similarity than generic terms. In Resnik
similarity, specificity is based on empirical usage of the terms in gene
annotations: a less frequently used term is more specific than a
commonly used one. This is formalized as term information content
(IC) as follows. We compute the number of genes annotated with a
Using Semantic Similarities and csbl.go for Analyzing Microarray Data 107
membrane 20 signaling 12
A A B
plasma membrane 5 organelle membrane 11 signal transduction 6 regulation of signaling 7
B
mitochondrial membrane 3
Fig. 1 Illustration of a GO network structure that is used to compute semantic similarity between genes A and
B using the Resnik measure. Small subsets of the ontologies cellular component and biological process
are shown here. Links between GO terms denote “is a” or “part of” relationships so that child terms are more
specific than parent terms. The annotations for genes A and B are shown next to the relevant GO terms. The
figures next to each GO term are usage frequencies in a hypothetical annotation corpus, used for computing
information content. For example, 20 genes are annotated with membrane or one of its child terms. Usage
frequencies are decreasing when following paths from the root node to leaf nodes because annotation with a
child term implies annotation with ancestor terms
Table 1
Pairwise GO term Resnik similarities for genes A (columns) and B (rows)
Plasma Signal
membrane (A) transduction (A)
Mitochondrial membrane (B) 2.32 0
Regulation of signaling (B) 0 2.74
1.2 Semantic Resnik similarity is a good first choice for an SS measure, but for
Similarity Measure specific applications other measures may need to be considered.
Dimensions Understanding the differences between various measures is
facilitated by categorization of measures along feature dimensions
(5, 8, 9). Most differences between SS measures can be understood
along three such dimensions. First, a major differentiator between
SS measures is whether they directly define an SS measure for
complete GO term sets (groupwise), or if they define an SS
measure for individual GO terms (pairwise), which must then be
transformed into gene similarity. Resnik similarity belongs to the
pairwise family.
Second, measures differ in how they evaluate GO term speci-
ficity. Resnik similarity uses the empirical information content
obtained from an annotation corpus. An alternative is to use the
GO graph structure so that a term with high depth (long distance
to ontology root node) is considered to be more specific.
Some measures do not explicitly consider term specificity at all.
IC-based methods are considered the most accurate option (5).
Third, some measures, such as Resnik, consider only one
common ancestor to evaluate the similarity of two terms; for
IC-based methods, this is usually the MICA. Other measures
consider all common ancestors, which may provide additional
value over MICA (5).
2 Materials
2.1 Breast Cancer The clustering demonstration is done using example data from
Example Data Set invasive breast cancer from The Cancer Genome Atlas (TCGA)
(11). The publicly available data consist of Agilent G4502A expres-
sion microarrays, including both tumors (TCGA codes 01 and 06)
and adjacent healthy tissue (TCGA code 11) for control. Prepro-
cessed level 3 data, containing cy5/cy3 log-ratios, were downloaded
Using Semantic Similarities and csbl.go for Analyzing Microarray Data 109
2.2 Custom Data The csbl.go package requires (1), minimally, a gene set, and (2),
Sets optionally, an expression or other quantitative matrix for the gene
set. The genes are represented as a set of database identifiers. Any
genome database can be used, as long as identifiers in the gene set,
expression matrix, and GO annotation source match.
2.3 Computational csbl.go requires a Windows or a Linux machine (32 or 64 bit), with
Environment an installation of the R programming environment (http://www.
r-project.org/). R version 2.12+ or 3.0+ is recommended.
3 Methods
3.2 Obtaining GO Using csbl.go requires that the genes under analysis are annotated
Annotations for with GO terms. This information is obtained from genome or
Differentially proteome databases, or directly from the Gene Ontology database.
Expressed Genes In our breast cancer example, we use Biomart from R to annotate
the 168 DEGs using the Uniprot database.
First, install the biomaRt Bioconductor package using:
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
set.prob.table(organism¼TAXONOMY.HUMAN,
type¼"similarity")
expr <- read.table(EXPRESSION, header¼TRUE, row.
names¼1,
sep¼"\t", stringsAsFactors¼FALSE)
expr <- expr[, 1:20] # For demonstration
result <- go.heatmap(expr, ANNOTATION, metric¼
"Resnik",
go.cut¼CUT, margins¼c(15, 5))
print(result$members[[6]])
print(result$desc[[6]])
In csbl.go, we first specify the species from which the gene set is
obtained with set.prob.table. Then, we read the expression
matrix into memory and take a subset of the first 20 samples to
obtain a more readable clustering visualization for demonstration
purposes. GO-based clustering is performed with go.heatmap,
which takes the in-memory expression matrix and the GO annota-
tion file as mandatory arguments. The expression matrix must have
row (gene) and column (sample) names. Clustering yields (1) a heat
map visualization shown in Fig. 2, and (2) a data structure describ-
ing the clusters obtained. Details on go.heatmap can be obtained
from its help page, shown with ?go.heatmap in R.
3.4 Interpreting GO From the right margin of Fig. 2, we see that the GO-based cluster-
Clustering Results ing resulted in six gene clusters, G1 to G6. Each cluster contains
genes that are semantically similar (i.e., share similar GO annota-
tions). From the heat map, we can observe expression values within
the gene clusters. On the X axis, the three leftmost samples are
112 Kristian Ovaska
low high
G6
G5
G4
G3
G2
GO groups
G1
cut
TCGA.BH.A0BV.11A.31R.A089.07
TCGA.BH.A18Q.11A.34R.A12D.07
TCGA.E2.A15M.11A.22R.A12D.07
TCGA.A8.A07C.01A.11R.A034.07
TCGA.AO.A0JL.01A.11R.A056.07
TCGA.E2.A15A.06A.11R.A12D.07
TCGA.B6.A0RM.01A.11R.A084.07
TCGA.B6.A0RI.01A.11R.A056.07
TCGA.A1.A0SD.01A.11R.A115.07
TCGA.B6.A0IE.01A.11R.A034.07
TCGA.BH.A18R.01A.11R.A12D.07
TCGA.A8.A08C.01A.11R.A00Z.07
TCGA.A8.A09N.01A.11R.A00Z.07
TCGA.E2.A15C.01A.31R.A12D.07
TCGA.AO.A0JA.01A.11R.A056.07
TCGA.A8.A09D.01A.11R.A00Z.07
TCGA.E2.A108.01A.13R.A10J.07
TCGA.B6.A0WS.01A.11R.A115.07
TCGA.A2.A0T1.01A.21R.A084.07
TCGA.BH.A0DG.01A.21R.A12P.07
Samples
Fig. 2 Results of GO-based clustering for TCGA breast cancer samples. On the X axis, a subset of 20 samples
is shown; they are clustered using expression values, which are visualized in the heat map. The Y axis
represents genes, which are clustered using Resnik similarity. The selected cut point in the dendrogram on the
left results in six gene clusters, denoted G1 to G6
3.5 Computing Gene In addition to GO-based clustering, csbl.go contains lower level
Similarity Matrix functionality for computing similarities. The following R script
produces a similarity matrix sim between all the genes, using
Resnik/max similarity.
library(csbl.go)
ANNOTATION <- "annotated.txt"
set.prob.table(organism¼TAXONOMY.HUMAN,
type¼"similarity")
ent <- entities.from.text(ANNOTATION)
sim <- entity.sim.many.allont(ent, "Resnik", "max")
4 Notes
4.1 Creating Custom The csbl.go package is bundled with GO term probability tables for
GO Probability Tables Homo sapiens, Saccharomyces cerevisiae, Caenorhabditis elegans,
Drosophila melanogaster, Mus musculus, Rattus norvegicus, Arabi-
dopsis thaliana, and Xenopus tropicalis. The GO annotation corpora
for these are obtained from the database provided by the Gene
Ontology Consortium (GOC). Building a custom GO probability
table may be necessary if: (a) you work with a species other than the
above, or (b) you use a different database for obtaining GO anno-
tations for your gene set than GOC. Whereas condition (a) is
obvious, (b) is more subtle. GO analysis gives best results when
the background probabilities (ICs) are computed from the same
database that is used for annotation. If different databases are used,
results may be inaccurate because the expected (a priori) GO term
usage is different from actual usage. The degree of inaccuracy
114 Kristian Ovaska
4.2 Loading GO For loading a bundled GO probability table into memory, csbl.go
Tables for Bundled requires an NCBI taxonomy identifier as a parameter to set.
Species prob.table. The following constants represent the identifiers of
bundled species: TAXONOMY.HUMAN, TAXONOMY.YEAST, TAXON-
OMY.C_ELEGANS, TAXONOMY.DROSOPHILA, TAXONOMY.MOUSE,
TAXONOMY.RAT, TAXONOMY.ARABIDOPSIS and TAXONOMY.
XENOPUS.
4.3 Optimizing the Hierarchical clustering uses a “cut” parameter with range from 0 to
Dendrogram Cut 1 to determine how to obtain gene clusters from the clustering
Parameter dendrogram. This parameter is named go.cut in the go.heatmap
for Clustering function and corresponds to the h parameter in cut.dendrogram.
A lower value favors smaller clusters. Often, it is necessary to try
different values of go.cut to obtain a suitable set of clusters.
The cut threshold can be seen in the left margin of the heat map,
aiding manual optimization.
4.6 Running Without GO-based clustering can be executed also when expression or other
Expression Data quantitative data is not available, using only the annotated gene set.
This is done by supplying NULL as the first argument to go.heat-
map, i.e., go.heatmap(NULL, ANNOTATION.FILE).
4.7 Integration with The core functionality of csbl.go has been integrated with the
Anduril Workflow Anduril workflow framework (24) available from http://anduril.
Framework org. The relevant Anduril components are GOClustering
(GO-based clustering) and GOProbabilityTable (creating custom
GO tables). GO annotations can be fetched using, for example,
BiomartAnnotator (various databases) or KorvasieniAnnotator
(Ensembl).
Acknowledgements
References
1. Hanahan D, Weinberg RA (2011) Hallmarks assessment with biological features and issues.
of cancer: the next generation. Cell Brief Bioinform 13:569–585
144:646–674 6. Resnik P (1995) Using information content to
2. Vogelstein B, Papadopoulos N, Velculescu VE evaluate semantic similarity in a taxonomy. Pro-
et al (2013) Cancer genome landscapes. Sci- ceedings of the 14th international joint confer-
ence 339:1546–1558 ence on artificial intelligence, vol 1, pp
3. Ashburner M, Ball C, Blake J et al (2000) Gene 448–453
ontology: tool for the unification of biology. 7. Lord P, Stevens R, Brass A et al (2003) Inves-
Nat Genet 25:25–29 tigating semantic similarity measures across the
4. Rebhan M, Chalifa-Caspi V, Prilusky J et al gene ontology: the relationship between
(1998) GeneCards: a novel functional geno- sequence and annotation. Bioinformatics
mics compendium with automated data mining 19:1275–1283
and query reformulation support. Bioinfor- 8. Mazandu GK, Mulder NJ (2013) Information
matics 14:656–664 content-based gene ontology semantic similar-
5. Guzzi PH, Mina M, Guerra C et al (2012) ity approaches: toward a unified framework
Semantic similarity analysis of protein data: theory. BioMed Res In 2013:292063
116 Kristian Ovaska
9. Harispe S, Sánchez D, Ranwez S et al (2014) A associative relations in the gene ontology. Pac
framework for unifying ontology-based seman- Symp Biocomput 2005:91–102
tic similarity measures: a study in the biomedi- 18. Pesquita C, Faria D, Bastos H et al (2008)
cal domain. J Biomed Inform 48:38–53 Metrics for GO based protein semantic similar-
10. Ovaska K, Laakso M, Hautaniemi S (2008) ity: a systematic evaluation. BMC Bioinformat-
Fast gene ontology based clustering for micro- ics 9:S4
array experiments. BioData Mining 1:11 19. Brun C, Chevenet F, Martin D et al (2004)
11. The Cancer Genome Atlas Network (2012) Functional classification of proteins for the
Comprehensive molecular portraits of human prediction of cellular function from a
breast tumours. Nature 490:61–70 protein-protein interaction network. Genome
12. Gentleman RC, Carey VJ, Bates DM et al Biol 5:6
(2004) Bioconductor: open software develop- 20. Couto FM, Silva MJ, Coutinho PM (2007)
ment for computational biology and bioinfor- Measuring semantic similarity between
matics. Genome Biol 5:R80 gene ontology terms. Data Knowl Eng
13. Lin D (1998) An information-theoretic defini- 61:137–152
tion of similarity. Proceedings of the 15th inter- 21. Yu G, Li F, Qin Y et al (2010) GOSemSim: an
national conference on machine learning, R package for measuring semantic similarity
pp 296–304 among GO terms and gene products. Bioinfor-
14. Jiang J, Conrath D (1997) Semantic similarity matics 26:976–978
based on corpus statistics and lexical taxonomy. 22. Frohlich H, Speer N, Poustka A et al (2007)
Proceedings of international conference on GOSim – an R-package for computation of
research in computational linguistics, pp 19–33 information theoretic GO similarities between
15. Schlicker A, Domingues F, Rahnenf€ uhrer J et al terms and gene products. BMC Bioinformatics
(2006) A new measure for functional similarity 8:166
of gene products based on gene ontology. 23. Harispe S, Ranwez S, Janaqi S et al (2014) The
BMC Bioinformatics 7:302 semantic measures library and toolkit: fast
16. Huang D, Sherman B, Tan Q et al (2007) The computation of semantic similarity and related-
DAVID gene functional classification tool: a ness using biomedical ontologies. Bioinformat-
novel biological module-centric algorithm to ics 30:740–742
functionally analyze large gene lists. Genome 24. Ovaska K, Laakso M, Haapa-Paananen S et al
Biol 8:R183 (2010) Large-scale data integration framework
17. Bodenreider O, Aubry M, Burgun A (2005) provides a comprehensive view on glioblastoma
Non-lexical approaches to identifying multiforme. Genome Med 2:65
Methods in Molecular Biology (2016) 1375: 117–121
DOI 10.1007/7651_2015_249
© Springer Science+Business Media New York 2015
Published online: 14 May 2015
Abstract
The importance of semantic-based methods and algorithms for the analysis and management of biological
data is growing for two main reasons. From a biological side, knowledge contained in ontologies is more
and more accurate and complete, from a computational side, recent algorithms are using in a valuable way
such knowledge. Here we focus on semantic-based management and analysis of protein interaction net-
works referring to all the approaches of analysis of protein–protein interaction data that uses knowledge
encoded into biological ontologies.
Semantic approaches for studying high-throughput data have been largely used in the past to mine
genomic and expression data. Recently, the emergence of network approaches for investigating molecular
machineries has stimulated in a parallel way the introduction of semantic-based techniques for analysis and
management of network data. The application of these computational approaches to the study of micro-
array data can broad the application scenario of them and simultaneously can help the understanding of
disease development and progress.
1 Introduction
117
118 Agapito Giuseppe and Marianna Milano
References
1. Barabasi AL, Oltvai ZN (2004) Network 12. Pesquita C et al (2009) Semantic similarity in
biology: understanding the cell’s functional biomedical ontologies. PLoS Comput Biol.
organization. Nat Rev Genet 5(2):101–113 doi:10.1371/journal.pcbi.1000443
2. Cannataro M, Guzzi PH, Veltri P (2010) Pro- 13. Dai X et al (2014) A comprehensive semantic
tein-to-protein interactions: technologies, similarity measurement for predicting the func-
databases, and algorithms. ACM Comput tion of gene products. J Bionanosci 8
Surv. doi:10.1145/1824795.1824796 (4):287–292
3. Ciriello G et al (2012) AlignNemo: a local 14. Agapito G, Guzzi PH, Cannataro M (2013)
network alignment method to integrate Visualization of protein interaction networks:
homology and topology. PLoS One. doi:10. problems and solutions. BMC Bioinformatics.
1371/journal.pone.0038107 doi:10.1186/1471-2105-14-S1-S1
4. West DB (2000) Introduction to graph theory, 15. Guzzi PH, Cannataro M (2012) Cyto-sevis:
2nd edn. Prentice Hall, New York semantic similarity-based visualisation of pro-
5. Blake JA, Bult CJ (2006) Beyond the data tein interaction networks. EMB-Net J doi:
deluge: data integration and bio-ontologies. J http://dx.doi.org/10.14806/ej.18.A.397
Biomed Informat 39(3):314–320 16. Cannataro M et al (2007) Using ontologies for
6. Harris MA et al (2004) The gene ontology preprocessing and mining spectra data on the
(go) database and informatics resource. grid. Future Generat Comput Syst 23
Nucleic Acids Res 32:258–261 (1):55–60
7. Barrell D et al (2009) The GOA database in 17. Smith B et al (2007) The OBO foundry: coor-
2009-an integrated gene ontology annotation dinated evolution of ontologies to support bio-
resource. Nucleic acids Research. doi:10. medical data integration. Nat Biotechnol 25
1093/nar/gkn803 (11):1251–1255
8. Huang DW, Sherman BT, Lempicki RA (2009) 18. Popescu M, Keller JM, Mitchell JA
Bioinformatics enrichment tools: paths toward (2006) Fuzzy measures on the gene ontology
the comprehensive functional analysis of large for gene product similarity. IEEE/ACM
gene lists. Nucleic Acids Res 37(1):1–13 Trans Comput Biol Bioinformatics 3
9. Alexeyenko A et al (2012) Network enrichment (3):263–274
analysis: extension of gene-set enrichment anal- 19. Yu G, Li F et al (2010) GOSemSim: an R
ysis to gene networks. BMC Bioinformatics. package for measuring semantic similarity
doi:10.1186/1471-2105-13-226 among GO terms and gene products. Bioinfor-
10. Guzzi PH et al (2012) Semantic similarity analysis matics 26(7):976–978
of protein data: assessment with biological fea- 20. Guzzi PH, Mina M (2012) Towards the assess-
tures and issues. Brief Bioinform 13(5):569–585 ment of semantic similarity analysis of protein
11. Smoot ME et al (2011) Cytoscape 2.8: new data: main approaches and issues. ACM SIG-
features for data integration and network visu- Bioinformatics Rec 2(3):17–18
alization. Bioinformatics 27(3):431–432
Methods in Molecular Biology (2016) 1375: 123–136
DOI 10.1007/7651_2015_242
© Springer Science+Business Media New York 2015
Published online: 12 March 2015
Abstract
Integrated analysis of large-scale transcriptomic and proteomic data can provide important insights into the
metabolic mechanisms underlying complex biological systems. In this chapter, we present methods to
address two aspects of issues related to integrated transcriptomic and proteomic analysis. First, due to the
fact that proteomic datasets are often incomplete, and integrated analysis of partial proteomic data may
introduce significant bias. To address these issues, we describe a zero-inflated Poisson (ZIP)-based model to
uncover the complicated relationships between protein abundances and mRNA expression levels, and then
apply them to predict protein abundance for the proteins not experimentally detected. The ZIP model takes
into consideration the undetected proteins by assuming that there is a probability mass at zero representing
expressed proteins that were undetected owing to technical limitations. The model validity is demonstrated
using biological information of operons, regulons, and pathways. Second, weak correlation between
transcriptomic and proteomic datasets is often due to biological factors affecting translational processes.
To quantify the effects of these factors, we describe a multiple regression-based statistical framework
to quantitatively examine the effects of various translational efficiency-related sequence features on
mRNA–protein correlation. Using the datasets from sulfate-reducing bacteria Desulfovibrio vulgaris, the
analysis shows that translation-related sequence features can contribute up to 15.2–26.2 % of the total
variation of the correlation between transcriptomic and proteomic datasets, and also reveals the relative
importance of various features in translation process.
1 Introduction
123
124 Jiangxin Wang et al.
2 Materials
2.2 Genome The cellular functional categories of all genes in the target genome
Information are downloaded from the Comprehensive Microbial Resource of
TIGR (http://cmr.tigr.org) (35) and NCBI (http://www.ncbi.
nlm.nih.gov/). On the basis of the original annotation, the
genes/proteins are classified into 19 cellular functional categories.
On the basis of the original annotation, the genes and proteins
are classified into different cellular functional categories. These
categories are included in the model as possible predictors of
protein abundance. Gene annotation attributes such as sequence
length, protein length, molecular weight, and GC content and
triple codon counts of all genes in the target genome are down-
loaded from the TIGR or NCBI resource. Continuous numerical
values are gathered for the molecular weight of each gene. The GC
content reflects the proportion of nucleotides G or C in the target
genome. The triple codon information includes counts for all 64
triple codon combinations in the genetic code.
Integrated Analysis of Transcriptomic and Proteomic Datasets Reveals. . . 127
The complete genome of target species and its ORF calls and
annotation of D. vulgaris are downloaded from NCBI Genbank
and the TIGR resource. Genes transcribed in the same direction
having intergenic regions <15 bp are defined as one operon. Gene
lists of all metabolic pathways defined for target genomes of interest
were downloaded from the KEGG database (http://www.genome.
jp/kegg/kegg2.html).
3 Methods
3.1 Zero-Inflated The Poisson regression model, one of the so-called generalized
Poisson Regression linear models (38), was used to model the relationship between
Model proteomic abundance and mRNA expression.
3.1.1 Model Construction 1. In the Poisson regression model, for protein abundances (Y ),
and Validation we assume that the mean (λ) of the Poisson distribution depends
on log-scaled mRNA abundance (X), and therefore λ ¼ exp
(α + β X), which ensures that the expected value is nonneg-
ative. This Poisson regression model provides a valid framework
to integrate two types of expression data; however, it provides
no explanation for the fact that ~83 % genes have zero proteo-
mic abundance.
2. We then ascribe the high percentage of proteins with zero
abundance to technical limitations in the proteomic analyses,
such as detection sensitivity. Therefore, a nonstandard mixture
model, the ZIP regression model (39), is proposed to analyze
the data. In this model, we assume that 100 p % of the genes
with proteomic abundance level of 0 may be unexpressed genes
or expressed genes that were undetected owing to the technical
limitations. Thus, the proteomic abundance, y, is distributed as
follows: y ¼ 0, probability mass at zero, with probability p;
where y follows a Poisson regression distribution with probabil-
ity (1 p). Therefore the observed protein abundance (y)
follows a mixture model:
1δ
λy
f ð y Þ ¼ p þ ð1 pÞ expðλÞ δ ð1 pÞexpðλÞ
y!
ð1Þ
log itð pÞ ¼ log½ p=ð1 pÞ ¼ α0 þ β0 x ;
0 p þ expðα þ β x Þ ð1 pÞ,
3.2.2 Identification of The identity of each start codon and stop codon is treated as a
Start Codon, Stop Codon, categorical variable during multiple regression analysis. The start
and Their Contexts codon context is defined as the upstream 30 bases and downstream
9 codons of the start codon. Therefore, each sequence of start
codon context is 60 bases long, including the start codon. To
evaluate the potential of each start codon context to form a stable
mRNA secondary structure, the minimum free energy of this
region is computed with the Vienna package RNAfold (44, 45).
The stop codon and the base immediately downstream of the stop
codon are regarded as the stop codon context. Each combination is
treated as a categorical variable in multiple regression analysis
described below.
3.2.3 Analyses of the The major trends in codon usage and amino acid usage are revealed
Overall Codon Usage and with a correspondence analysis. The relative synonymous codon
Amino Acid Usage usage (RSCU) is used in the correspondence analysis to remove
the effects of amino acid usage. For amino acid usage, the raw
codon counts are added up for each amino acid and used as input
in the correspondence analysis. The CodonW software (http://
codonw.sourceforge.net) is used for the correspondence analysis,
generating four major axes accounting for most of the variations in
codon usage or amino acid usage of D. vulgaris genes or proteins,
respectively (46).
Integrated Analysis of Transcriptomic and Proteomic Datasets Reveals. . . 131
4 Conclusion
5 Notes
References
1. Medini D, Serruto D, Parkhill J, Relman DA, 3. Uchiyama I, Mihara M, Nishide H, Chiba H
Donati C, Moxon R, Falkow S, Rappuoli R (2013) MBGD update 2013: the microbial
(2008) Microbiology in the post-genomic era. genome database for exploring the diversity of
Nat Rev Microbiol 6:419–430 microbial world. Nucleic Acids Res 41(Data-
2. Kyrpides NC (2009) Fifteen years of microbial base issue):D631–D635
genomics: meeting the challenges and fulfilling 4. Schoolnik GK (2001) The accelerating conver-
the dream. Nat Biotechnol 27:627–632 gence of genomics and microbiology. Genome
Biol 2: REPORTS4009
Integrated Analysis of Transcriptomic and Proteomic Datasets Reveals. . . 135
5. Ward N, Fraser CM (2005) How genomics has protein composition, tissue diversity, and gene
affected the concept of microbiology. Curr regulation in mouse mitochondria. Cell
Opin Microbiol 8:564–571 115:629–640
6. Sharan R, Ideker T (2006) Modeling cellular 20. Alter O, Golub GH (2004) Integrative analysis
machinery through biological network com- of genome-scale data by using pseudoinverse
parison. Nat Biotechnol 24:427–433 projection predicts novel correlation between
7. Cardenas E, Tiedje JM (2008) New tools for DNA replication and RNA transcription. Proc
discovering and characterizing microbial diver- Natl Acad Sci U S A 101:16577–16582
sity. Curr Opin Biotechnol 19:544–549 21. Greenbaum D, Jansen R, Gerstein M (2002)
8. Rocha EP (2008) The organization of the bac- Analysis of mRNA expression and protein
terial genome. Annu Rev Genet 42:211–223 abundance data: an approach for the compari-
9. Fiehn O (2001) Combining genomics, meta- son of the enrichment of features in the cellular
bolome analysis, and biochemical modelling to population of proteins and transcripts. Bioin-
understand metabolic networks. Comp Funct formatics 18:585–596
Genomics 2:155–168 22. Ideker T, Thorsson V, Ranish JA, Christmas R,
10. Singh OV, Nagaraj NS (2006) Transcrip- Buhler J, Eng JK, Bumgarner R, Goodlett DR,
tomics, proteomics and interactomics: unique Aebersold R, Hood L (2001) Integrated geno-
approaches to track the insights of bioremedia- mic and proteomic analyses of a systematically
tion. Brief Funct Genomic Proteomic perturbed metabolic network. Science
4:355–362 292:929–934
11. Lin J, Qian J (2007) Systems biology approach 23. Washburn MP, Koller A, Oshiro G, Ulaszek G,
to integrative comparative genomics. Expert Plouffe D, Deciu C, Winzeler E, Yates JR III
Rev Proteomics 4:107–119 (2003) Protein pathway and complex cluster-
ing of correlated mRNA and protein expression
12. Kandpal R, Saviola B, Felton J (2009) The era analyses in Saccharomyces cerevisiae. Proc Natl
of omics unlimited. Biotechniques Acad Sci U S A 100:3107–3112
46:351–355
24. Greenbaum D, Colangelo C, Williams K, Ger-
13. Ishii N, Tomita M (2009) Multi-omics data- stein M (2003) Comparing protein abundance
driven systems biology of E. coli. In: Lee SY and mRNA expression levels on a genomic
(ed) Systems biology and biotechnology of scale. Genome Biol 4:117.1–117.8
Escherichia coli. Springer, Dordrecht, The
Netherlands, pp 41–57 25. Beyer A, Hollunder J, Nasheuer HP, Wilhelm
T (2004) Posttranscriptional expression regu-
14. Tang YJ, Martin HG, Myers S, Rodriguez S, lation in the yeast Saccharomyces cerevisiae on a
Baidoo EE, Keasling JD (2009) Advances in genomic scale. Mol Cell Proteomics
analysis of microbial metabolic fluxes via 13C 3:1083–1092
isotopic labeling. Mass Spectrom Rev
28:362–375 26. Nie L, Wu G, Zhang W (2006) Correlation of
mRNA expression and protein abundance
15. Park SJ, Lee SY, Cho J, Kim TY, Lee JW, Park affected by multiple sequence features related
JH, Han MJ (2005) Global physiological to translational efficiency in Desulfovibrio vul-
understanding and metabolic engineering of garis: a quantitative analysis. Genetics
microorganisms based on omics studies. Appl 174:2229–2243
Microbiol Biotechnol 68:567–579
27. Wilkins MR, Pasquali C, Appel RD, Ou K,
16. Gygi SP, Rochon Y, Franza BR, Aebersold R Golaz O, Sanchez J, Yan JX, Gooley AA,
(1999) Correlation between protein and Hughes G et al (1996) From proteins to pro-
mRNA abundance in yeast. Mol Cell Biol teomes: large scale protein identification by
19:1720–1730 two-dimensional electrophoresis and amino
17. Hegde PS, White IR, Debouck C (2003) Inter- acid analysis. Biotechnology (NY) 14:61–65
play of transcriptomics and proteomics. Curr 28. Scherl A, Francois P, Charbonnier Y, Deshusses
Opin Biotechnol 14:647–651 JM, Koessler T, Huyghe A, Bento M, Stahl-
18. Mootha VK, Lepage P, Miller K, Bunkenborg Zeng J, Fischer A et al (2006) Exploring
J, Reich M, Hjerrild M, Del-monte T, Ville- glycopeptide-resistance in Staphylococcus
neuve A, Sladek R et al (2003) Identification of aureus: a combined proteomics and transcrip-
a gene causing human cytochrome c oxidase tomics approach for the identification of
deficiency by integrative genomics. Proc Natl resistance-related markers. BMC Genomics
Acad Sci U S A 100:605–610 7:296
19. Mootha VK, Bunkenborg J, Olsen JV, Hjerrild 29. Zhang W, Gritsenko M, Moore RJ, Culley DE,
M, Wisniewski JR, Stahl E, Bolouri MS, Ray Nie L, Petritis K, Strittmat-ter EF, Camp DG,
HN, Sihag S et al (2003) Integrated analysis of
136 Jiangxin Wang et al.
Smith RD, Brockman FJ (2006) A proteomic inference, prediction. Springer, New York,
view of Desulfovibrio vulgaris metabolism as NY, USA
determined by liquid chromatography coupled 42. Osada Y, Saito R, Tomita M (1999) Analysis of
with tandem mass spectrometry. Proteomics base-pairing potentials between 16S rRNA and
6:4286–4299 50 UTR for translation initiation in various
30. Tuikkala J, Elo L, Nevalainen OS, Aittokallio T prokaryotes. Bioinformatics 15:578–581
(2006) Improving missing value estimation in 43. Suzek BE, Ermolaeva MD, Schreiber M, Salz-
microarray data with gene ontology. Bioinfor- berg SL (2001) A probabilistic method for
matics 22:566–572 identifying start codons in bacterial genomes.
31. Nie L, Wu G, Brockman FJ, Zhang W (2006) Bioinformatics 17:1123–1130
Integrated analysis of transcriptomic and pro- 44. Hofacker IL (2003) Vienna RNA secondary
teomic data of Desulfovibrio vulgaris: zero- structure server. Nucleic Acids Res
inflated Poisson regression models to predict 31:3429–3431
abundance of undetected proteins. Bioinfor- 45. Hofacker IL, Stadler PF (2006) Memory
matics 22:1641–1647 efficient folding algorithms for circular
32. Collins RF, Roberts M, Phoenix DA (1995) RNA secondary structures. Bioinformatics
Codon bias in Escherichia coli may modulate 22:1172–1176
translation initiation. Biochem Soc Trans 46. Wu G, Nie L, Zhang W (2006) Relation
23:76 between mRNA expression and sequence
33. Akashi H, Gojobori T (2002) Metabolic effi- information in Desulfovibrio vulgaris: combi-
ciency and amino acid composition in the pro- natorial contributions of upstream regulatory
teomes of Escherichia coli and Bacillus subtilis. motifs and coding sequence features to varia-
Proc Natl Acad Sci U S A 99:3695–3700 tions in mRNA abundance. Biochem Biophys
34. Tate WP, Poole ES, Dalphin ME, Major LL, Res Commun 344:114–121
Crawford DJ et al (1996) The translational 47. Devore J, Farnum N (2005) Applied statistics
stop signal: codon with a context, or extended for engineers and scientists. Thompson
factor recognition element? Biochimie Learning, Belmont, CA
78:945–952 48. Ott RY, Longnecker M (2001) An introduc-
35. Heidelberg JF, Seshadri R, Haveman SA, tion to statistical methods and data analysis.
Hemme CL et al (2004) The genome sequence Thompson Learning, Pacific Grove, CA
of the anaerobic, sulfate-reducing bacterium 49. Montgomery DC (2001) Introduction to sta-
Desulfovibrio vulgaris Hildenborough. Nat tistical quality control (Wiley series in statistics
Biotechnol 22:554–559 and probability). Wiley, New York
36. Zhang W, Culley DE, Scholten JC, Hogan M, 50. Nie L, Wu G, Culley DE, Scholten JC, Zhang
Vitiritti L, Brockman FJ (2006) Global tran- W (2007) Integrative analysis of transcriptomic
scriptomic analysis of Desulfovibrio vulgaris on and proteomic data: challenges, solutions and
different electron donors. Antonie Van Leeu- applications. Crit Rev Biotechnol 27:63–75
wenhoek 89:221–237
51. Lange R, Hengge-Aronis R (1994) The cellu-
37. Nie L, Wu G, Zhang W (2006) Correlation lar concentration of the S subunit of RNA
between mRNA and protein abundance in polymerase in Escherichia coli is controlled at
Desulfovibrio vulgaris: a multiple regression to the levels of transcription, translation, and pro-
identify sources of variations. Biochem Biophys tein stability. Genes Dev 8:1600–1612
Res Commun 339:603–610
52. Rocha EP, Danchin A, Viari A (1999) Transla-
38. McCullagh P, Nelder JA (1989) Generalized tion in Bacillus subtilis: roles and trends of
linear models. Chapman and Hall, Boca initiation and termination, insights from a
Raton, FL genome analysis. Nucleic Acids Res
39. Lambert D (1992) Zero-inflated Poisson 27:3567–3576
regression, with an application to defects in 53. Romby P, Springer M (2003) Bacterial transla-
manufacturing. Technometrics 34:1–14 tional control at atomic resolution. Trends
40. Johnson RA (2005) Miller and Freund’s prob- Genet 19:155–161
ability and statistics for engineers. Pearson 54. Lithwick G, Margalit H (2003) Hierarchy of
prentice Hall sequence-dependent features associated with
41. Hastie T, Tibshirani R, Friedman J (2001) The prokaryotic translation. Genome Res
elements of statistical learning-data mining, 13:2665–2673
Methods in Molecular Biology (2016) 1375: 137–153
DOI 10.1007/7651_2015_252
© Springer Science+Business Media New York 2015
Published online: 02 July 2015
Abstract
With the completion of the Human Genome Project and the emergence of high-throughput technologies,
a vast amount of molecular and biological data are being produced. Two of the most important
and significant data sources come from microarray gene-expression experiments and respective databanks
(e,g., Gene Expression Omnibus—GEO (http://www.ncbi.nlm.nih.gov/geo)), and from molecular path-
ways and Gene Regulatory Networks (GRNs) stored and curated in public (e.g., Kyoto Encyclopedia of
Genes and Genomes—KEGG (http://www.genome.jp/kegg/pathway.html), Reactome (http://www.
reactome.org/ReactomeGWT/entrypoint.html)) as well as in commercial repositories (e.g., Ingenuity
IPA (http://www.ingenuity.com/products/ipa)). The association of these two sources aims to give new
insight in disease understanding and reveal new molecular targets in the treatment of specific phenotypes.
Three major research lines and respective efforts that try to utilize and combine data from both of these
sources could be identified, namely: (1) de novo reconstruction of GRNs, (2) identification of Gene-
signatures, and (3) identification of differentially expressed GRN functional paths (i.e., sub-GRN paths that
distinguish between different phenotypes). In this chapter, we give an overview of the existing methods that
support the different types of gene-expression and GRN integration with a focus on methodologies that
aim to identify phenotype-discriminant GRNs or subnetworks, and we also present our methodology.
Keywords: Microarray, Gene expression, Gene regulatory networks, Pathways, Functional pathways,
Bioinformatics, Systems biology
1 Introduction
selection using topology goes one step further and tries to identify the
discriminant pathways or subpathways. Within this approach iden-
tification and selection of the most discriminant paths ignore the
present gene relations/regulations. The last and most informative
category is the subpathway selection using regulatory mechanisms.
This approach takes advantage of the GRN topology as well as the
type of GRN gene relations (e.g., activation or inhibition).
Initial efforts used GRN information as groups (plain list) of
associated genes in order to identify the most discriminant and
phenotype-differentiating genes. Molecular pathways effectively
reduced the resulting sets of genes, extracted from a gene set analysis
approach, and in some cases improved prediction performance. But
GRNs encompass much more knowledge form just a plain list of
genes. Recently, more and more methods take advantage of the
GRNs topology and the underlying gene interaction patterns.
Pathway selection methodologies show similarities with gene
signatures in terms of the level of information used over the years.
Although GRNs hold important information about the structure
and correlation among genes that should not be neglected, most of
the currently available methods in pathway selection do not fully
exploit it. In the literature, one can find three categories of
methodologies that focus on the identification and selection
of discriminant pathways and subpathways, based on the different
levels of knowledge extraction from target GRNs. Initially the focus
was on the identification of differentially expressed pathways (as a
whole) using microarray data. Then the efforts concentrated on the
knowledge of the GRN topology using decomposition mechanisms
to reveal discriminant subpathways based on the graph theory
concepts and network visualization toolkits. Recently more
advanced methodologies are developed, which takes in consider-
ation not only the topology of the GRNs but also the regulation
type (activation/inhibition) of the interaction link that connects
two or more genes.
One can easily identify three main categories of methodologies
according to the level of the utilised GRN information. The cate-
gories are pathway selection using GRNs as list of genes, subpathway
selection using the topology of GRNs, and subpathway selection
methodologies using the underlying GRN gene regulatory interac-
tions. The last category—being in its infancy—exhibits the fewer
methodologies so far, but it takes the most out of GRNs and gene-
expression data compared to the other two, and is a promising
alternative for the identification of the regulatory mechanisms that
underlie and putatively govern various phenotypes.
The subpathway selection using the underlying GRN gene
regulatory interactions approach solves the major problem of the
set enrichment strategies that refers to the conflicting constrains
between GRNs and gene-expression data. A typical example of the
conflicting constrains is reflected in the situation when two
Integrating Microarray Data and GRNs 141
2 Method
1
http://www.kegg.jp/kegg/xml/
Integrating Microarray Data and GRNs 147
Fig. 8 Functional-path decomposition: Left: A target part of an artificial GRN; Right: The ten decomposed
functional sub-paths
Both the GRNs and the gene expression data have to use the same
ids. GRNs use gene ids while gene expression platforms use probes.
A probe is a specific segment of single-strand DNA that is comple-
mentary to a desired gene. For example, if the gene of interest
contains the sequence AATGGCACA, then the probe will contain
the complementary sequence TTACCGTGT. When added to the
appropriate solution, the probe will match and then bind to the
gene of interest.
Due to the large number of databases and associated IDs, the
conversion of gene identifiers is one of the initial and central steps in
many workflows related to genomic data analysis. In the literature
and the web, we can find several freely available ID conversion
tools. Although each tool has distinct features and strengths, as
reviewed by Khatri et al. (12), they all adopt a common core
strategy to systematically map a large number of interesting genes
in a list to the associated biological annotation.
The mapping from a thesaurus to another rises the many to one
issue which in our case many probes from the gene expression
dataset are assigned to the same KEGG gene ID. We check the
multiple probes for the gene and place a logic OR for the assess-
ment of the gene’s value. This is actually the selection of the value
of the probe with the highest intensity out of all the probes that
map to the same gene.
Then we need to identify the subpaths that exhibit high-
matching scores for one of phenotypic class and low-matching
148 L. Koumakis et al.
2.3 Data Analysis As an example, assume the gene-expression binary profiles of six
artificial samples for genes A, B, D and C—with “1” to denote
“ON” and “0” to denote “OFF”—three of them are assigned to
phenotype-1 (S1, S2, and S3) and the other three to phenotype-2
(S4, S5, and S6)—refer to Fig. 9.
Furthermore, assume the artificial GRN shown in the left part
of Fig. 9, and its subpath A ! B ! D —| C (in bold). We follow a
logic-gates process that aims to match the path-module instance of
the subpath with the respective samples’ binary instances. The
process results into the formation of an ordered pattern that indi-
cate the samples for which the target sub-path is consistent with
(“1”s) or not (“0”s), i.e., the respective path-module
A¼“ON” ! B¼“ON” ! D¼“ON” —| C¼“OFF” is active.
Integrating Microarray Data and GRNs 149
Fig. 9 Matching gene-expression sample profiles with GRN functional path-modules: a logic-gates approach
2.4 Experiments Most of breast cancer (BRCA) cases are estrogen responsive, imply-
ing the activation of a series of growth-promoting pathways, for
example, the estrogen receptor (ER) related ErbB signaling GRN.
In an effort to reveal the underlying regulatory mechanisms that
govern BRCA patients’ treatment responses we applied our meth-
odology on a public gene-expression study from the GEO, the
GSE73902 dataset targeting the ER phenotypic status of the
respective patients, i.e., ER+ (ER positive) vs. ER (ER negative).
We targeted 14 pathways all of which are engaged within the
“Pathways in Cancer” integrated pathway of KEGG (hsa05200)
namely: ECM-receptor interaction (hsa04512), Cytocin-cytocin
receptor interaction (hsa04060), Adherens junction (hsa04520),
Wnt signaling (has04310), Focal adhesion (hsa04510), Jak-STAT
signaling (hsa04630), ErbB signaling (hsa04012), MAPK signaling
(hsa04010), mTOR signaling (hsa04150), VEGF signaling
(hsa04370), Apoptosis (hsa04210), p53 signaling (hsa04115),
Cell cycle (hsa04110), and TGF-β signaling (hsa04350).
The visualization of the results for the ErbB signaling
(hsa04012) can be found in Fig. 10 where with the help of the
Cytoscape3 graph library. The graph preserves the KEGG layout
topology. It is enriched with the expressed regulatory mechanisms
(relations) between genes that differentiate between the two
phenotypes and the color coding is as follows:
l Red indicates relations active at class 1 which in our example
is the ERpos.
l Blue indicates relations active at class 2 (ERneg).
l Magenta indicates overlapping relations in the two classes.
l Orange for subpaths that are always active.
The figure highlights only the “interesting” subpaths which in
our case are the most discriminant subpaths for the specific two
phenotypes.
Inspecting the reduced network, it is clear that there is a
pathway starting from NRG (1 and 2) and ends at inhibiting the
CDKN1B for ERpos phenotype; and a pathway starting from
TGFA or AREG or HBEFG that ends-up at inhibiting EIF4EBP1
for ERneg phenotype.
2
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc¼gse7390
3
http://www.cytoscape.org/
Integrating Microarray Data and GRNs 151
Acknowledgment
References
1. Brown PO, Botstein D (1999) Exploring the 11. Ott MA, Gert V (2006) Correcting ligands,
new world of the genome with DNA microar- metabolites, and pathways. BMC Bioinformat-
rays. Nat Genet 21:33–37 ics 7(1):517
2. Huang Y, Zhao Z, Xu H, Shyr Y, Zhang B 12. Khatri P, Draghici S (2005) Ontological analy-
(2012) Advances in systems biology: computa- sis of gene expression data: current tools, lim-
tional algorithms and applications. BMC Syst itations, and open problems. Bioinformatics
Biol 6(3) 21:3587–3595
3. Hung J-H, Yang T-H, Zhenjun H, Weng Z, 13. Kauffman SA (1993) The origins of order: self-
DeLisi C (2012) Gene set enrichment analysis: organization and selection in evolution.
performance evaluation and usage guidelines. Oxford University Press, New York
Brief Bioinform 13(3):281–291 14. Hall M, Frank E, Holmes G, Pfahringer B,
4. Heckera M, Lambecka S, Toepferb S, van Som- Reutemann P, Ian H (2009) The WEKA data
erenc E, Guthke R (2009) Gene regulatory mining software: an update. SIGKDD Explora-
network inference: data integration in dynamic tions 11(1)
models—a review. Biosystems 96(1):86–103 15. Sutherland RL (2011) Endocrine resistance in
5. Ein-Dor L, Kela I, Getz G, Givol D, Domany E breast cancer: new roles for ErbB3 and ErbB4.
(2005) Outcome signature genes in breast can- Breast Cancer Res 13(3):106
cer: is there a unique set? Bioinformatics 21 16. Hutcheson IR et al (2007) Heregulin beta1
(2):171–178 drives gefitinib-resistant growth and invasion
6. Iwamoto T, Pusztai L (2010) Predicting prog- in tamoxifen-resistant MCF-7 breast cancer
nosis of breast cancer with gene signatures: are cells. Breast Cancer Res 9(4):50
we lost in a sea of data? Genome Med 2(11):81 17. Geistlinger L, Csaba G, K€ uffner R, Mulde N,
7. Shannon CEA (1948) Mathematical theory of Zimmer R (2011) From sets to graphs towards
communication. Bell Sys Tech J 27 a realistic enrichment analysis of transcriptomic
(3):379–423 systems. Bioinformatics 27(13):366–373
8. Potamias G, Koumakis L, Moustakis V (2004) 18. Tarca AL, Draghici S, Khatri P, Hassan SS,
Gene selection via discretized gene-expression Mittal P, Kim JS, Kim CJ, Kusanovic JP,
profiles and greedy feature-elimination. Meth Romero R (2009) A novel signaling pathway
Appl Artif Intelligence 3025:256–266 impact analysis. Bioinformatics 25(1):75–82
9. Li L, Weinberg CR, Darden TA, Pedersen LG 19. Judeh T, Johnson C, Kumar A, Zhu D (2013)
(2001) Gene selection for sample classification TEAK: Topology Enrichment Analysis frame-
based on gene expression data: study of sensi- worK for detecting activated biological sub-
tivity to choice of parameters of the GA/KNN pathways. Nucleic Acids Res 41(1):1425–1437
method. Bioinformatics 17(12):1131–1142 20. Nam S, Chang HR, Kim KT et al (2014)
10. Kanehisa M, Araki M, Goto S, Hattori M, Hir- PATHOME: an algorithm for accurately
akawa M, Itoh M, Yamanishi Y (2008) KEGG detecting differentially expressed subpathways.
for linking genomes to life and the environ- Oncogene 33(41):4941–4951
ment. Nucleic Acids Res 36:480–484
Methods in Molecular Biology (2016) 1375: 155–167
DOI 10.1007/7651_2015_284
© Springer Science+Business Media New York 2015
Published online: 28 October 2015
Abstract
Currently in bioinformatics and systems biology there is a growing interest for the analysis of associations
among biological molecules at a network level. A main research in this area is represented by the inference of
biological networks from experimental data. Biological network inference aims to reconstruct network of
interactions (or associations) among biological molecules (e.g., genes or proteins) starting from experi-
mental observations. The current scenario is characterized by a growing number of algorithms for the
inference, while few attention has been posed on the determination of fair assessments and comparisons.
Current assessments are usually based on the comparison of the algorithms using reference networks or
gold standard datasets. Here we survey some selected inference algorithms and we compare current
assessments. We also present a systematic listing of freely available inference and assessment tools for easy
reference. Finally we outline some possible future directions of research, such as the use of a prior
knowledge into the assessment process.
Keywords: Biological network inference, Assessment, Gene regulatory network, Gold standard, Gene
Ontology, Graph theory
1 Introduction
YCR084C
YFL026W
YCL067C
YIL015W
YHR084W
YNL145W
YDR461W YMR043W
YPR113W
YKR097W
YJL157C YAL040C
YCR084C
YFL026W
YCL067C
YIL015W
YHR084W
YNL145W
YDR461W YMR043W
YPR113W
YKR097W
YJL157C YAL040C
and its target’s expression level. Similarly, Jung and Cho [4] also
propose an evolutionary computation-based approach for con-
struction of gene (interaction) networks from gene expression
time-series data. It assumes an artificial gene network and compares
it with the reconstructed network from the gene expression time-
series data generated by the artificial network. Next, it employs real
gene expression time-series data to construct a gene network by
applying the proposed approach. Mutual information [26, 27] or
correlation coefficient [28, 29] based approaches have been pro-
posed for extracting gene-gene interaction networks. It has been
observed that a pair of genes with high mutual information are
nonrandomly associated with each other biologically or with
biological significance. Butte et al. [26] compute comprehensive
pair-wise mutual information for all genes in an expression dataset.
By picking a threshold mutual information and using only associa-
tions at or above the threshold, they construct relevance networks.
A number of additional mutual information-based approaches have
also been proposed. Some of the well-known algorithms in this
category are CLR [30], ARACNE [31], and MRNET [32].
GENIE3 [33] is a Random Forest-based method that model the
inference as a regression problem. Recently, GeCON [34] a pattern
based co-expression network inference method has been proposed
capable of detecting undirected co-regulated network with regula-
tion information (+ or ). A summery of various inference methods
discussed above is reported in Table 1.
4.1 In Silico As introduced before the assessment of GRN algorithms has been
Assessment Methods largely discussed within the Dialogue on Reverse Engineering
Assessment and Methods (DREAM) project [38]. The project is
Biological Network Inference from Microarray Data, Current Solutions, and Assessments 161
Table 1
Synopsis of GRN inference methods
Inferred
network Assessment based
Algorithm Approach type Package Platform on (type of data)
GGM Bayesian network Undirected GeneNeta R Synthetic (statically simulated)
and real (breast cancer)
CBN Bayesian network Directed CatNetb R Real (breast, lung, gastric, and
renal cancer)
GENIE3 Random Forest Directed GENIE3c R, MatLab Synthetic (DREAM4) and
real (E. coli)
CLR Mutual information Undirected MINETd R Real (E. coli)
e
ARACNE Mutual information Undirected MINET R Synthetic (random network)
and real (human B cells)
MRNET Mutual information Undirected MINETf R Synthetic (sRogers and
SynTReN)
GeCON Expression pattern Undirected GeCONg Java Synthetic (DREAM4), real
similarity (yeast, human, rat, mouse,
rice)
JUMP3 Decision tree and Directed Jump3h Matlab DREAM4, IRMA
Boolean network
a
http://strimmerlab.org/software/genenet/
b
http://cran.r-project.org/web/packages/catnet
c
http://www.montefiore.ulg.ac.be/huynh-thu/software.html
d
http://minet.meyerp.com
e
http://minet.meyerp.com
f
http://minet.meyerp.com
g
https://sites.google.com/site/swarupnehu/publications/resources
h
http://homepages.inf.ed.ac.uk/vhuynht/misc/jump3.zip
organized on an annual basis and researchers from all the world may
participate on this context. The structure of the competition is
quite simple. For each year, organizer provides many test dataset
to the community of participant. Then researchers may test their
own algorithms on these datasets and they may determine a candi-
date network for each dataset. Once completed, researchers send
candidate network to the organizers for the analysis of results.
Network analysis is based on the comparison of candidate
network with a priori-determined network (one for each dataset)
that represents reference or gold standard network. The measure of
distance from the gold standard network is calculated by evaluating
the confusion matrix, i.e., a matrix containing the numbers of true
positives/negatives and false positives/negatives. All the submis-
sion are reviewed manually by the organizers and submission are
then scored by deriving receiver operating characteristics (ROC) or
precision-recall.
162 Swarup Roy and Pietro Hiram Guzzi
4.2 Assessment Limited availability of true gold standard network (of all living
Against Biological organisms) for validation of a candidate inference method imposes
Truth an additional challenge to the system biologist to select a suitable
inference method for their experimentation.
Literature-based known interactions are explored to validate an
inference method from biological significance point of view. Exper-
imentally validated regulators collected from publicly available
databases like RegulonDB [40] are used to assess the performance
of an inference method [33]. Pathways provide a way of linking the
functionality of groups of genes to specific biological processes.
Well-established methodologies such as Gene Set Enrichment
Analysis (GSEA) [41] help in differentiating pathways as functional
units from experimental populations. Manually curated pathways
based on expert knowledge and existing literature obtained from
the Kyoto Encyclopedia of Genes and Genomes (KEGG, http://
www.genome.jp/kegg/pathway.html) are another alternative mea-
sure used for validation [21].
Biological Network Inference from Microarray Data, Current Solutions, and Assessments 163
Table 2
Few free GRN inference and visualization tools
6 Conclusion
References
1. Cannataro M, Guzzi PH, Veltri P (2010) Pro- toward parameter identification for whole-cell
tein-to-protein interactions: technologies, models. PLoS Comput Biol 11(5):e1004096
databases, and algorithms. ACM Comput 8. Godsil C, Royle GF (2013) Algebraic graph
Surv (CSUR) 43(1):1 theory, vol 207. Springer Science & Business
2. Cannataro M, Guzzi PH, Sarica A (2013) Data Media, New York
mining and life sciences applications on the 9. Cannataro M, Guzzi PH, Veltri P (2010)
grid. WIREs Data Mining Knowl Discov 3 Impreco: distributed prediction of protein
(3):216–238 complexes. Futur Gener Comput Syst 26
3. Levine M, Davidson EH (2005) Gene regu- (3):434–440
latory networks for development. Proc Natl 10. Fuente ADI (2010) What are gene regulatory
Acad Sci U S A 102(14):4936–4942 networks? Handbook of research on computa-
4. Jung SH, Cho K-H (2004) Identification of tional methodologies in gene regulatory net-
gene interaction networks based on evolution- works. IGI Global, Hershey, PA, pp 1–27
ary computation, AIS. Springer, New York, pp 11. Roy S, Das D, Choudhury D, Gohain GG,
428–439 Sharma R, Bhattacharyya DK (2013) Causality
5. Agapito G, Guzzi PH, Cannataro M (2013) inference techniques for in-silico gene regu-
Visualization of protein interaction networks: latory network, Mining intelligence and knowl-
problems and solutions. BMC Bioinformatics edge exploration. Springer, New York,
14(Suppl 1):S1 pp 432–443
6. Marbach D, Prill RJ, Schaffter T, Mattiussi C, 12. Olsen C, Meyer PE, Bontempi G (2009) Infer-
Floreano D, Stolovitzky G (2010) Revealing ring causal relationships using information the-
strengths and weaknesses of methods for gene oretic measures. In Proceedings of the 5th
network inference. Proc Natl Acad Sci 107 Benelux Bioinformatics Conference (BBC09)
(14):6286–6291 13. Mina M, Guzzi PH (2014) Improving the
7. Karr JR, Williams AH, Zucker JD, Raue A, robustness of local network alignment: design
Steiert B, Timmer J, Kreutz C, Wilkinson S, and extensive assessment of a Markov
Allgood BA, Bot BM et al (2015) Summary of clustering-based approach. IEEE/ACM Trans
the DREAM8 parameter estimation challenge: Comput Biol Bioinformatics 11(3):561–572
166 Swarup Roy and Pietro Hiram Guzzi
14. Mitra S, Das R, Hayashi Y (2011) Genetic net- genome-wide expression patterns. Proc Natl
works and soft computing. IEEE/ACM Trans Acad Sci 95(25):14863–14868
Comput Biol Bioinformatics 8(1):94–107 28. Tong AHY, Lesage G, Bader GD, Ding H, Xu
15. Nagrecha S, Lingras PJ, Chawla NV (2013) H, Xin X, Young J, Berriz GF, Brost RL, Chang
Comparison of gene co-expression networks M et al (2004) Global mapping of the yeast
and Bayesian networks, Intelligent Information genetic interaction network. Science 303
and Database Systems. Springer, New York, pp (5659):808–813
507–516 29. Kuo WP, Mendez E, Chen C, Whipple ME,
16. Karopka T, Scheel T, Bansemer S, Glass Ä Farell G, Agoff N, Park PJ (2003) Functional
(2004) Automatic construction of gene rela- relationships between gene pairs in oral squa-
tion networks using text mining and gene mous cell carcinoma, AMIA annual symposium
expression data. Med Inform Internet Med 29 proceedings. American Medical Informatics
(2):169–183 Association, Bethesda, MD, p 371
17. Özg€ ur A, Vu T, Erkan G, Radev DR (2008) 30. Faith JJ, Hayete B, Thaden JT, Mogno I,
Identifying gene-disease associations using Wierzbowski J, Cottarel G, Kasif S, Collins JJ,
centrality on a literature mined gene- Gardner TS (2007) Large-scale mapping and
interaction network. Bioinformatics 24(13): validation of escherichia coli transcriptional
i277–i285 regulation from a compendium of expression
18. Friedman N, Linial M, Nachman I, Pe’er D profiles. PLoS Biol 5(1):e8
(2000) Using Bayesian networks to analyze 31. Margolin AA, Nemenman I, Basso K, Wiggins
expression data. J Comput Biol 7 C, Stolovitzky G, Favera RD, Califano A
(3–4):601–620 (2006) Aracne: an algorithm for the recon-
19. Davidich MI, Bornholdt S (2008) Boolean struction of gene regulatory networks in a
network model predicts cell cycle sequence of mammalian cellular context. BMC Bioinfor-
fission yeast. PLoS One 3(2), e1672 matics 7(Suppl 1):S7
20. Sch€afer J, Strimmer K (2005) An empirical 32. Meyer PE, Kontos K, Lafitte F, Bontempi G
Bayes approach to inferring large-scale gene (2007) Information-theoretic inference of
association networks. Bioinformatics 21 large transcriptional regulatory networks.
(6):754–764 EURASIP J Bioinforma Syst Biol 2007:79879
21. Balov N (2013) A categorical network 33. Huynh-Thu VA, Irrthum A, Wehenkel L,
approach for discovering differentially Geurts P (2010) Inferring regulatory networks
expressed regulations in cancer. BMC Med from expression data using tree-based meth-
Genet 6(Suppl 3):S1 ods. PLoS One 5(9):e12776
22. Kwon AT, Hoos HH, Ng R (2003) Inference 34. Roy S, Bhattacharyya DK, Kalita JK (2014)
of transcriptional regulation relationships from Reconstruction of gene co-expression network
gene expression data. Bioinformatics 19 from microarray data using local expression pat-
(8):905–912 terns. BMC Bioinformatics 15(Suppl 7):S10
23. Sanguinetti G et al (2015) Combining tree- 35. Moult J, Fidelis K, Kryshtafovych A, Rost B,
based and dynamical systems for the inference Hubbard T, Tramontano A (2007) Critical
of gene regulatory networks. Bioinformatics 31 assessment of methods of protein structure
(10):1614–1622 prediction-round vii. Proteins 69(S8):3–9
24. Segal E, Taskar B, Gasch A, Friedman N, Koller 36. Mendes P, Sha W, Ye K (2003) Artificial gene
D (2001) Rich probabilistic models for gene networks for objective comparison of analysis
expression. Bioinformatics 17(Suppl 1): algorithms. Bioinformatics 19(Suppl 2):
S243–S252 ii122–ii129
25. Mitra S, Das R, Banka H, Mukhopadhyay S 37. Marbach D, Schaffter T, Mattiussi C, Floreano
(2009) Gene interaction–an evolutionary D (2009) Generating realistic in silico gene
biclustering approach. Information Fusion 10 networks for performance assessment of
(3):242–249 reverse engineering methods. J Comput Biol
26. Butte AJ, Kohane IS (2000) Mutual informa- 16(2):229–239
tion relevance networks: functional genomic 38. Stolovitzky G, Monroe D, Califano A (2007)
clustering using pairwise entropy measure- Dialogue on reverse-engineering assessment
ments, vol 5, Pacific symposium on biocom- and methods. Ann N Y Acad Sci 1115(1):1–22
puting. World Scientific, Singapore, pp 39. Siegenthaler C, Gunawan R (2014) Assessment
418–429 of network inference methods: how to cope
27. Eisen MB, Spellman PT, Brown PO, Botstein with an underdetermined problem. PLoS One
D (1998) Cluster analysis and display of 9(3):e90481
Biological Network Inference from Microarray Data, Current Solutions, and Assessments 167
40. Gama-Castro S, Jiménez-Jacinto V, Peralta-Gil 46. Jupiter D, Chen H, VanBuren V (2009) Star-
M, Santos-Zavaleta A, Peñaloza-Spinola MI, net 2: a web-based tool for accelerating discov-
Contreras-Moreira B, Segura-Salazar J, ery of gene regulatory networks using
Muñiz-Rascado L, Martnez-Flores I, Salgado microarray co-expression data. BMC Bioinfor-
H et al (2008) RegulonDB (version 6.0): gene matics 10(1):332
regulation model of escherichia coli k-12 47. Tripathi S, Dehmer M, Emmert-Streib F
beyond transcription, active (experimental) (2014) Netbiov: an r package for visualizing
annotated promoters and textpresso naviga- large network data in biology and medicine.
tion. Nucleic Acids Res 36(suppl 1): Bioinformatics 30(19):2834–2836
D120–D124 48. Bozdag S, Li A, Wuchty S, Fine HA (2010)
41. Subramanian A, Tamayo P, Mootha VK, Fastmedusa: a parallelized tool to infer gene
Mukherjee S, Ebert BL, Gillette MA, Paulovich regulatory networks. Bioinformatics 26
A, Pomeroy SL, Golub TR, Lander ES et al (14):1792–1793
(2005) Gene set enrichment analysis: a 49. Smith VA, Yu J, Smulders TV, Hartemink AJ,
knowledge-based approach for interpreting Jarvis ED (2006) Computational inference of
genome-wide expression profiles. Proc Natl neural information flow networks. PLoS Com-
Acad Sci U S A 102(43):15545–15550 put Biol 2(11):e161, pp. 1436–1449
42. Kharumnuid G, Roy S (2015) Tools for in- 50. Wang M, Verdier J, Benedito VA, Tang Y,
silico reconstruction and visualization of gene Murray JD, Ge Y, Becker JD, Carvalho H,
regulatory networks (GRN). In 2nd IEEE Rogers C, Udvardi M et al (2013) Legumegrn:
international conference on advance comput- a gene regulatory network prediction server for
ing and communication engineering functional and comparative studies. PLoS One
(ICACCE’ 2015) 8(7):e67434
43. Schaffter T, Marbach D, Floreano D (2011) 51. Faisal FE, Meng L, Crawford J, Milenković T
Genenetweaver: in silico benchmark genera- (2015) The post-genomic era of biological net-
tion and performance profiling of network work alignment. EURASIP J Bioinforma Syst
inference methods. Bioinformatics 27 Biol 2015:3
(16):2263–2270
52. Ciriello G, Mina M, Guzzi PH, Cannataro M,
44. Smoot ME, Ono K, Ruscheinski J, Wang P-L, Guerra C (2012) AlignNemo: a local network
Ideker T (2011) Cytoscape 2.8: new features alignment method to integrate homology and
for data integration and network visualization. topology. PLoS One 7(6):e38107. doi:10.
Bioinformatics 27(3):431–432 1371/journal.pone.0038107
45. Baker C, Carpendale MT, Prusinkiewicz P, Sur- 53. Guzzi PH, Milano M, Roy S (2015) Towards
ette MG (2002) Genevis: visualization tools for the assessment of GRN algorithms based on
genetic regulatory network dynamics. In Pro- (disease) ontology. In: Proceedings of the
ceedings of the conference on Visualization’02. ACM conf on bioinformatics, computational
IEEE Computer Society, 2002, pp 243–250 biology and health informatics (BCB’15)
Methods in Molecular Biology (2016) 1375: 169–179
DOI 10.1007/7651_2015_248
© Springer Science+Business Media New York 2015
Published online: 24 May 2015
Abstract
Due to the highly sensitive nature of metabolic states, the quality of metabolomics data depends on the
suitability of the experimental procedure. Metabolism could be affected by factors such as the method of
euthanasia of the animals and the sample collection procedures. The effects of these factors on metabolites are
tissue-specific. Thus, it is important to select proper methods to sacrifice the animal and appropriate
procedures for collecting samples specific to the tissue of interest. Here, we present our protocol to collect
specific mouse skeletal muscles with different fiber types for metabolomics studies. We also provide a protocol
to measure lactate levels in tissue samples as a way to estimate the metabolic state in collected samples.
1 Introduction
169
170 Zhuohui Gan et al.
a
2.00
plantaris
# #
1.20
0.80
0.40
0.00
Cervical Dislocation Decaptation Pentobarbital
b
2.00
plantaris
Lactate in tissue homogenate (mM)
soleus
1.60
1.20
*
0.80
#
0.40
0.00
Flash Freezing Tongs Freezing In-Situ Dissection
Fig. 1 (a) The comparison of euthanasia methods on lactate level in mouse skeletal muscles. Two types of
mouse skeletal muscles, plantaris and soleus, were dissected from the flash-frozen legs which were cut from
mice sacrificed by the assigned methods. The sample size for each data point is 3–4. The values are
mean s.d. *: different from the lactate level in plantaris muscles from mice sacrificed by cervical
dislocation, p < 0.05. #: different from the lactate level in soleus muscles from mice sacrificed by cervical
dislocation, p < 0.05. (b) The comparison of collection methods on lactate level in mouse skeletal muscles.
Two types of mouse skeletal muscles, plantaris and soleus, were dissected from legs processed with the
assigned methods. The mice were sacrificed by cervical dislocation. The sample size for each data point is
3–4. The values are mean s.d. *: different from the lactate level in plantaris muscles from flash-frozen legs,
p < 0.05. #: different from the lactate level in soleus muscles from flash-frozen legs, p < 0.05
2 Materials
5. Iris scissor.
6. Hemostat (Halsted mosquito forceps).
7. Aluminum foil.
8. Indelible marker pen.
9. Disposable latex or nitrile gloves, gown or clean lab coat, and
eye protection.
10. Disposal container.
11. Cryotubes.
2.2 Materials 1. 1 phosphate buffered saline (PBS), PH 7.4 (no Ca2+, no Mg2+).
for Amplex Red-Based 2. L-lactate standard (Sigma, part #71718).
Lactate Assay Mix 10 μL of 100 mM lactate sample with 990 μL deionized
water to get a 1 mM lactate solution. Prepare lactate standard
samples using deionized water with lactate concentrations at 0,
20, 50, 100, 200, and 300 μM, respectively. Store at 20 C.
3 Methods
13. Make a small incision along the skin of the lower leg around the
“ankle,” and then tear off the entire skin covering the leg.
14. Remove the membrane surrounding the leg muscles
using forceps or cut the membrane using a dissection scissors
(see Note 3).
15. Gently separate the rear tendons from the bone, clamp the rear
tendons very distally underneath the heel using a hemostat, and
cut with Iris scissors.
16. Pull and separate the posterior muscles from the bone with the
hemostat. If it is still hard to move the posterior muscles,
wait for a few more minutes to thaw the leg a little more
(see Note 4).
17. Dissect soleus from the posteriors muscles by pulling the soleus
from its proximal tendon until the distal tendon and cut out the
soleus using a dissection scissors. Put the dissected soleus into a
cyrotube and freeze in liquid nitrogen or homogenize in assay
buffer immediately (see Note 5). The location of the soleus
muscle in a frozen-thawed leg is indicated in Fig. 2a.
Fig. 2 (a) The location of soleus muscle in a flash-frozen and then thawed mouse
leg. The soleus muscle is identifiable and dissectible in frozen-thawed mouse leg.
The integrity of soleus muscle keeps well. (b) The location of plantaris muscle in a
flash-frozen and then thawed mouse leg after the soleus muscle is removed. The
plantaris muscle is identifiable and dissectible with a good integrity
176 Zhuohui Gan et al.
18. Separate the two distal tendons and find the plantaris
muscle, cut its tendon. Pull the plantaris muscle by holding
the distal end of its tendon, and cut the proximal end (see Note
6). Put the dissected plantaris into a cytotube and freeze
in liquid nitrogen or homogenize in assay buffer immediately
(see Note 7). The location of plantaris muscle in a frozen-
thawed leg is shown in Fig. 2b.
19. Store the dissected muscles at 80 C if it won’t be used
immediately. Otherwise, proceed to homogenization.
3.1 Supplement: This assay was developed to detect lactate levels in dissected muscle
Amplex Red-Based samples. This is not a part of the muscle collection protocol, but an
Lactate Assay option for metabolic measurement. This protocol is sufficient to
run 20 lactate measurement reactions (see Note 8).
1. Label tubes for Amplex Red solution, HRP solution, LOX
solution and at least two tubes for each sample.
2. Take Amplex Red stock, HRP stock, lactate oxidase stock and
lactate standards out of the freezer to thaw.
3. Mix 100 μL Amplex Red stock and 900 μL 1 PBS to
get 1 mM Amplex Red solution, store on ice (this is AR-PBS)
(see Note 9).
4. Mix 20 μL HRP stock and 80 μL 1 PBS to get 100 U/mL
HRP solution, store on ice (this is HRP-PBS).
5. Mix 10 μL lactate oxidase stock and 90 μL 1 PBS to get
5 U/mL lactate oxidase solution (this is LOX-PBS).
6. Homogenize the muscle sample with the ratio of 1 mg per
30 μL 1 PBS, in ice bath.
7. Centrifuge the tube containing homogenate at 13,000 g for
10 min, 4 C to remove insoluble material.
8. Move the supernatant and transfer into clean tubes.
9. Dilute the supernatant with 1 PBS 1:10 (see Note 10), store
on ice.
10. Add 45 μL AR-PBS to each cell, the total number of cells is the
sum of the cells for sample measurements and the cells for the
lactate standard curve which requires six cells.
11. Add 5 μL HRP-PBS to each cell.
12. Measure the colorimetric absorbance at 571 nm as the back-
ground signal (abs1) at room temperature.
13. Add 45 μL diluted supernatant or lactate standards (0, 20,
50, 100, 200, and 300 μM) to each cell, mix by slightly shaking
(see Note 11).
14. Measure the colorimetric absorbance at 571 nm as the
measurement of H2O2 in supernatant samples (abs2).
A Protocol to Collect Specific Mouse Skeletal Muscles for Metabolomics Studies 177
4 Notes
Acknowledgements
References
1. Mizunoya W, Wakamatsu J, Tatsumi R et al 4. Evans CA, Kerkut GA (1981) Effect of nem-
(2008) Protocol for high-resolution separation butal anesthesia, electric shock, and shock
of rodent myosin heavy chain isoforms in a avoidance conditioning on acetylcholinesterase
mini-gel electrophoresis system. Anal Biochem activity and protein content in various regions
377(1):111–113 of the rat brain. Neurosci Behav Physiol 11
2. Fiehn O (2002) Metabolomics – the link (6):614–620
between genotypes and phenotypes. Plant 5. Marquez-Julio A, French IW (1967) The effect
Mol Biol 48(1–2):155–171 of ether, pentobarbital, and decapitation on
3. Noack S, Wiechert W (2014) Quantitative various metabolites of rat skeletal muscle. Can
metabolomics: a phantom? Trends Biotechnol J Biochem 45(9):1323–1327
32(5):238–244
A Protocol to Collect Specific Mouse Skeletal Muscles for Metabolomics Studies 179
6. Pence HH, Pence S, Kurtul N et al (2003) The mitochondrial isolated from the hearts of
alterations in adenosine nucleotides and lactic anesthetized rats with high-dose pentobarbital
acid levels in striated muscles following death sodium. Jpn J Physiol 47(1):87–92
with cervical dislocation or electric shock. Soud 15. Du F, Zhang Y, Iltis I et al (2009) In vivo
Lek 48(1):8–11 proton MRS to quantify anesthetic effects of
7. Rezin GT, Goncalves CL, Daufenbach JF et al pentobarbital on cerebral metabolism and
(2009) Acute administration of ketamine brain activity in rat. Magn Reson Med 62
reverses the inhibition of mitochondrial respi- (6):1385–1393
ratory chain induced by chronic mild stress. 16. Yamamoto Y, Hasegawa H, Ikeda K et al
Brain Res Bull 79(6):418–421 (1988) Cervical dislocation of mice
8. Chang Y, Chen TL, Sheu JR et al (2005) Sup- induces rapid accumulation of platelet
pressive effects of ketamine on macrophage func- serotonin in the lung. Agents Actions 25
tions. Toxicol Appl Pharmacol 204(1):27–35 (1–2):48–56
9. de Oliveira L, Fraga DB, De Luca RD et al 17. Fischer JC, Ruitenbeek W, Stadhouders AM
(2011) Behavioral changes and mitochondrial et al (1985) Investigation of mitochondrial
dysfunction in a rat model of schizophrenia metabolism in small human skeletal muscle
induced by ketamine. Metab Brain Dis 26 biopsy specimens. Improvement of prepara-
(1):69–77 tion procedure. Clin Chim Acta 145
10. Pravdic D, Hirata N, Barber L et al (2012) (1):89–99
Complex I and ATP synthase mediate mem- 18. Boros-Hatfaludy S, Fekete G, Apor P (1986)
brane depolarization and matrix acidification Metabolic enzyme activity patterns in muscle
by isoflurane in mitochondria. Eur J Pharmacol biopsy samples in different athletes. Eur J Appl
690(1–3):149–157 Physiol Occup Physiol 55(3):334–338
11. Zhang Y, Xu Z, Wang H et al (2012) Anes- 19. Bergstrom J (1975) Percutaneous needle
thetics isoflurane and desflurane differently biopsy of skeletal muscle in physiological and
affect mitochondrial function, learning, and clinical research. Scand J Clin Lab Invest 35
memory. Ann Neurol 71(5):687–698 (7):609–616
12. Kohro S, Hogan QH, Nakae Y et al (2001) Anes- 20. Antal C, Teletin M, Wendling O et al (2007)
thetic effects on mitochondrial ATP-sensitive K Tissue collection for systematic phenotyping in
channel. Anesthesiology 95(6):1435–1440 the mouse. Curr Protoc Mol Biol Chapter 29:
13. Braun S, Gaza N, Werdehausen R et al (2010) Unit 29A 24
Ketamine induces apoptosis via the mitochon- 21. Winder WW, Fuller EO, Conlee RK (1983)
drial pathway in human lymphocytes and neu- Adrenal hormones and liver cAMP in exercis-
ronal cells. Br J Anaesth 105(3):347–354 ing rats – different modes of anesthesia. J Appl
14. Takaki M, Nakahara H, Kawatani Y et al (1997) Physiol Respir Environ Exerc Physiol 55
No suppression of respiratory function of (5):1634–1636
Methods in Molecular Biology (2016) 1375: 181–194
DOI 10.1007/7651_2015_250
© Springer Science+Business Media New York 2015
Published online: 14 May 2015
Abstract
MicroRNAs (miRNAs) are short non coding RNAs that regulate the gene expression and play a relevant
role in physiopathological mechanisms such as development, proliferation, death, and differentiation of
normal and cancer cells. Recently, abnormal expression of miRNAs has been reported in most of solid or
hematopoietic malignancies, including multiple myeloma (MM), where miRNAs have been found deeply
dysregulated and act as oncogenes or tumor suppressors. Presently, the most recognized approach for
definition of miRNA portraits is based on microarray profiling analysis. We here describe a workflow based
on the identification of dysregulated miRNAs in plasma cells from MM patients based on Affymetrix
technology. We describe how it is possible to search miRNA putative targets performing whole gene
expression profile on MM cell lines transfected with miRNA mimics or inhibitors followed by luciferase
reporter assay to analyze the specific targeting of the 30 untranslated region (UTR) sequence of a mRNA by
selected miRNAs. These technological approaches are suitable strategies for the identification of relevant
druggable targets in MM.
1 Introduction
181
182 Maria Teresa Di Martino et al.
Fig. 1 The figure shows the workflow of procedures for miRNA functional analysis. First the detection of
dysregulated miRNAs by differential miRNA profiling. Then the replacement of downregulated miRNAs or
specific inhibition of upregulated miRNAs, and the validation of miRNA-specific targeting by (1) the luciferase
reporter assay, (2) the whole gene expression analysis by microarray, (3) the study of specific targets at mRNA
level by real-time PCR and (4) at protein level by Western Blot analysis
2 Materials
2.2 Synthetic miRNA 1. miRNA mimics (miRVana™ catalog no. 4464070, 5 nmol lyo-
Overexpression philized pellet) are small, chemically modified double-stranded
or Inhibition RNAs that mimic endogenous miRNAs and enable miRNA
functional analysis by upregulation of miRNA activity.
2.2.1 Transient
Transfection 2. miRNA inhibitors (miRVana™ catalog no. 4464066, 5 nmol
lyophilized pellet) are small, chemically modified single-
stranded RNA molecules designed to specifically bind to and
inhibit endogenous miRNA molecules and enable miRNA
functional analysis by downregulation of miRNA activity.
3. Negative control (Life Technologies).
4. Neon® Transfection System 100 μL kit (Invitrogen™), catalog
no. MPK10025. The Neon® Transfection System 100 μL Kit
includes 1 mL resuspension buffer R, 1 mL resuspension buffer
T, 75 mL E electrolytic buffer, 25 reaction delivery tips, five
electroporation tubes.
5. Exponentially growing MM cell lines.
6. Six-well plates.
7. RPMI-1640 medium.
8. Fetal bovine serum.
Equipment
1. Neon® Transfection System (Catalog Number MPK5000).
2. Neon® Pipette (Catalog Number MPP100).
Functional Analysis of microRNA in Multiple Myeloma 185
2.2.2 miRNA Quantitative 1. tRNA isolation: TRIzol® Reagent (Life Technologies), chloro-
Analysis form, isopropanol, 75 % ethanol, and nuclease-free water.
2. TaqMan® MicroRNA Reverse Transcription Kit (Applied
Biosystems).
3. TaqMan® MicroRNA Assays (Applied Biosystems).
4. NanoDrop 1000 Spectrophotometer.
5. Optical 96-well reaction plates with barcode (Applied
Biosystems).
6. Viia7 Dx real-time PCR system (Applied Biosystems).
Additionally support.
Library Files
GeneChip® Human Transcriptome Array 2.0 Analysis
(zip, 303 MB).
GeneChip® Human Transcriptome Array 2.0 AGCC Library File
Installer (zip, 109 KB).
2.3.2 Luciferase 1. Plasmid constructs (30 UTR of the gene of interest is cloned in
Reporter Assay for miRNA- pEZX-MT01 vector, Genecopoeia).
Target Validation (a) 400 mL LB liquid medium: 10 g/L tryptone; 5 g/L yeast
extract, 10 g NaCl.
(b) Kanamycin for bacterial selection.
(c) LB-Kanamycin bacterial plates: LB liquid media plus 8 g/L
agar. Autoclave, cool down to about 50 C in a water bath
and add Kanamycin to have the final concentration of
25 μg/mL; mix well and distribute 15–20 mL of medium
per 10 cm plate.
(d) Maxi prep kit (PureLink® HiPure Plasmid Maxiprep Kit,
Invitrogen, Life Technologies) for isolation of high purity
plasmid from bacteria.
2. MM cell transfection and 30 UTR luciferase reporter assay.
(a) Exponentially growing MM cells.
(b) miRVANA miRNA mimics (Applied Biosystems).
(c) Six-well plates.
(d) RPMI-1640 medium.
(e) Fetal bovine serum (FBS).
(f) Dual-Glo Luciferase Assay kit (Promega).
Equipment
1. Neon electroporation system (Life technologies).
2. Plate reader for luminesce detection.
3 Methods
3.1 Screening by Plasma cells from human peripheral blood mononuclear cells
miRNA Profiling (PBMCs) bone marrow are isolated at >90 %, purity as determined
by flow cytometry, by the use of CD138 magnetic beads sorting
according to the manufacturer’s instructions (www.miltenyibiotec.
com); subsequently total RNA (tRNA) including small
RNA fractions is extracted by a modified Qiagen protocol
(www.qiagen.com). Briefly, 5 104 cells are lysed by 250 μL of
TRIzol® solution, then the aqueous phase is loaded on the Qiagen
column. After two washings by high speed centrifugation at r.t. the
Functional Analysis of microRNA in Multiple Myeloma 187
3.2 Synthetic miRNA The following procedure refers to the transfection of RPMI-8226
Overexpression MM cells.
or Inhibition The day before the transfection, MM cells are seeded at
5.0 105 cells/mL in RPMI-1640 containing 10 % heath inacti-
3.2.1 Transient vated FBS and 1 % penicillin-streptomycin.
Transfection MM cells are transfect by the Neon transfection system (Life
Technologies): briefly, 1.0 106 cells are used for each transfec-
tion point, washed in PBS and resuspended in 100 μL of buffer R.
Add 2 μL of 100 μM miRNA mimics or inhibitors or negative
control (NC). The mix is resuspend and electroporated by the use
of the Neon pipette and 100 μL tips at the following electropora-
tion conditions: 1,050 V, 30 ms, 1 pulse. Transfected cells are
seeded into a six-well plate containing 2 mL of pre-warmed growth
medium without antibiotics. The plate is incubated into a 37 C/
5 % CO2 incubator and cells collected 24 and 48 h after transfection
and analyzed for miRNA and target expression.
3.2.2 miRNA Quantitative 0.5 mL TRIzol® Reagent is added to 1 106 harvested cells; cell
Analysis sample are lysed by pipetting the cells up and down several times.
The homogenized sample is incubated for 5 min at room tempera-
tRNA Isolation
ture then 0.2 mL of chloroform are added. After shaking vigorously
by hand for 15 s, the tube is incubated for 2 min at room tempera-
ture. The sample is then centrifuged at 12,000 g for 15 min at
4 C. The aqueous phase is now transferred into a fresh Eppendorf
tube avoiding to draw any of the interphase or organic layer into the
pipette. After adding of 0.5 mL of 100 % isopropanol to the
aqueous phase, mix well the sample and incubate for 10 min at
room temperature. After centrifugation at 12,000 g for 10 min
at 4 C the supernatant is discarded from the tube, and the RNA
pellet washed with 1 mL of 75 % ethanol. After centrifugation at
7,500 g for 5 min at 4 C, the supernatant is discarded and the
RNA pellet air-dried for 5–10 min. The RNA pellet is then resus-
pended in nuclease-free water by pipetting up and down several
times. The concentration is measured by NanoDrop
spectrophotometer.
cDNA Generation and qRT- To prepare the RT master mix using the TaqMan® MicroRNA
PCR Performance and Reverse Transcription Kit (Applied Biosystems) components, the
Analysis kit components are allowed to thaw on ice. Before use, the RT
primer tubes are vortexed. The RT master mix is prepared in a
polypropylene tube on ice, gently mixed and then centrifuged.
Note: RT master mix for each sample consists of:
– 100 mM dNTPs (with dTTP): 0.15 μL.
– MultiScribe™ Reverse Transcriptase, 50 U/μL: 1.0 μL.
– 10 Reverse Transcription Buffer: 1.50 μL.
– RNase Inhibitor, 20 U/μL: 0.19 μL.
Functional Analysis of microRNA in Multiple Myeloma 189
3.3 Analysis of To identify mRNA targets which can contribute to explain the role
Target Modulation of miRNAs in MM, Affymetrix gene expression profiling is per-
formed. Target mRNAs are then evaluated as possible miRNA
targets by luciferase reporter assay by the use of the 30 UTR
mutated and wild type sequence of the miRNA-target gene cloned
in pEZX-MT01 vector (Genecopoeia).
3.3.1 Gene Profiling Total RNA (tRNA) including small RNA fractions, is extracted from
transfected MM cells using a modified protocol from Qiagen as
above described. tRNA samples consist of a combination of ribo-
somal RNA (rRNA), messenger RNA (mRNA), transfer RNA
(tRNA), and other small RNA species with the rRNA fraction con-
stituting the vast majority (non-rRNA depleted). Gene expression
profiling (GEP) is then carried out according to the Affymetrix
recommended protocol. Briefly, after quantification of tRNA by
NanoDrop Spectrophotometry, 100 ng is processed using the Gen-
eChip WT PLUS Reagent Kit according to manufacturer’s instruc-
tions (www.affymetrix.com), which uses a reverse transcription
priming method that primes the entire length of each RNA tran-
script, including both poly-A and non-poly-A mRNA to provide
complete transcriptome. By the use of 100 ng of tRNA without
rRNA depletion 10 μg of cRNA is obtained to be carried into the
second cycle. The cRNA concentration is measured by UV spectro-
photometry (NanoDrop) following beads purification. Total cRNA
yield is calculated by multiplying the measured concentration by the
190 Maria Teresa Di Martino et al.
Data Processing All resulting CEL files for each array type are processed as a single
and Analysis group via the Affymetrix Power Tools (APT) package using sketch
normalization and the RMA algorithm to summarize probeset
signal. QC metrics from the APT report file shows, the mean
perfect match (PM) and mean background intensity for all of the
samples included in the analysis.
Expression values for gene are extracted from CEL files using
Affymetrix® Transcriptome Analysis Console (TAC) software. Dif-
ferent GeneChip® Array have been developed in the last decade by
the company. For the older GeneChip Whole Transcriptome (WT)
Arrays, log2-transformed expression values are extracted from CEL
files and normalized using Transcript Cluster Annotations, and
robust multi-array average (RMA) procedure in Expression Con-
sole (EC) software (Affymetrix Inc.). The analyses are then per-
formed on log2 transformed data generated from EC. After
hierarchical clustering of the samples, enabled to group either
genes, specimens or both with similar expression patterns,
Functional Analysis of microRNA in Multiple Myeloma 191
3.3.2 Luciferase The following procedure is based on the use of 30 UTR sequence of
Reporter Assay for miRNA- the selected mRNA cloned in pEZX-MT01 vector, specifically
Target Validation designed by and purchased from Genecopoeia. The 30 UTR of
interest—or a deletion mutant lacking the predicted miRNA target
sequence(s)—is specifically cloned in such vector containing both
Firefly and Renilla luciferase reporters.
30 UTR Luciferase 24 h after the transfection, MM cells are collected and lysed with
Reporter Assay 200 μL of passive lysis buffer contained into the Dual Luciferase
Reporter Assay System (Promega). After incubation at r.t. for
10 min the cells are centrifuged at 2,320 rcf for 5 min and the
superrnatants collected. The Luciferase Assay Reagent II and the
Stop & Glo buffer (Dual-Glo Luciferase Assay Kit) are thawed at
r.t. and the content of one bottle of Luciferase Assay Reagent II is
Functional Analysis of microRNA in Multiple Myeloma 193
References
1. Anderson KC (2014) Multiple myeloma. 6. Misso G et al (2013) Emerging pathways as
Hematol Oncol Clin North Am 28:xi–xii. individualized therapeutic target of multiple
doi:10.1016/j.hoc.2014.08.001 myeloma. Expert Opin Biol Ther 13(Suppl
2. Tagliaferri P et al (2012) Promises and chal- 1):S95–S109. doi:10.1517/14712598.2013.
lenges of microRNA-based treatment of multi- 807338
ple myeloma. Current Cancer Drug Targets 7. Misso G et al (2014) Mir-34: a new weapon
12:838–846 against cancer? Mol Ther Nucleic Acids 3:
3. Tassone P, Tagliaferri P (2012) Editorial: new e194. doi:10.1038/mtna.2014.47
approaches in the treatment of multiple mye- 8. Lionetti M et al (2013) Biological and clinical
loma: from target-based agents to the new era relevance of miRNA expression signatures in
of microRNAs (dedicated to the memory of primary plasma cell leukemia. Clin Cancer Res
Prof. Salvatore Venuta). Curr Cancer Drug 19:3130–3142. doi:10.1158/1078-0432.
Targets 12:741–742 CCR-12-2043
4. Rossi M et al (2013) From target therapy to 9. Lionetti M, Agnelli L, Lombardi L, Tassone P,
miRNA therapeutics of human multiple mye- Neri A (2012) MicroRNAs in the pathobiology
loma: theoretical and technological issues in of multiple myeloma. Curr Cancer Drug Tar-
the evolving scenario. Curr Drug Targets gets 12:823–837
14:1144–1149 10. Amodio N et al (2013) miR-29b induces
5. Rossi M et al (2014) MicroRNA and multiple SOCS-1 expression by promoter demethyla-
myeloma: from laboratory findings to transla- tion and negatively regulates migration of mul-
tional therapeutic approaches. Curr Pharm tiple myeloma and endothelial cells. Cell Cycle
Biotechnol 15:459–467 12:3650–3662. doi:10.4161/cc.26585
194 Maria Teresa Di Martino et al.
11. Amodio N, Di Martino MT, Neri A, Tagliaferri 18. Rossi M et al (2013) miR-29b negatively reg-
P, Tassone P (2013) Non-coding RNA: a novel ulates human osteoclastic cell differentiation
opportunity for the personalized treatment of and function: implications for the treatment
multiple myeloma. Expert Opin Biol Ther 13 of multiple myeloma-related bone disease. J
(Suppl 1):S125–S137. doi:10.1517/ Cell Physiol 228:1506–1515. doi:10.1002/
14712598.2013.796356 jcp.24306
12. Amodio N et al (2012) DNA-demethylating 19. Scognamiglio I et al (2014) Transferrin-
and anti-tumor activity of synthetic miR-29b conjugated SNALPs encapsulating 20 -O-
mimics in multiple myeloma. Oncotarget methylated miR-34a for the treatment of mul-
3:1246–1258 tiple myeloma. Biomed Res Int 2014:217365.
13. Di Martino MT et al (2014) In vivo activity of doi:10.1155/2014/217365
miR-34a mimics delivered by stable nucleic 20. Monroig PD, Chen L, Zhang S, Calin GA
acid lipid particles (SNALPs) against multiple (2014) Small molecule compounds targeting
myeloma. PloS One 9:e90005. doi:10.1371/ miRNAs for cancer therapy. Adv Drug Deliv
journal.pone.0090005 Rev. doi:10.1016/j.addr.2014.09.002
14. Di Martino MT et al (2014) In vitro and in vivo 21. Amodio N et al (2012) miR-29b sensitizes mul-
activity of a novel locked nucleic acid (LNA)- tiple myeloma cells to bortezomib-induced apo-
inhibitor-miR-221 against multiple myeloma ptosis through the activation of a feedback loop
cells. PloS One 9:e89659. doi:10.1371/jour with the transcription factor Sp1. Cell Death Dis
nal.pone.0089659 3:e436. doi:10.1038/cddis.2012.175
15. Di Martino MT et al (2012) Synthetic miR-34a 22. Di Martino MT et al (2013) In vitro and in vivo
mimics as a novel therapeutic agent for multi- anti-tumor activity of miR-221/222 inhibitors
ple myeloma: in vitro and in vivo evidence. Clin in multiple myeloma. Oncotarget 4:242–255
Cancer Res 18:6260–6270. doi:10.1158/ 23. Lionetti M et al (2009) Identification of micro-
1078-0432.CCR-12-1708 RNA expression patterns and definition of a
16. Leone E et al (2013) Targeting miR-21 inhi- microRNA/mRNA regulatory network in dis-
bits in vitro and in vivo multiple myeloma cell tinct molecular groups of multiple myeloma.
growth. Clin Cancer Res 19:2096–2106. Blood 114:e20–e26. doi:10.1182/blood-
doi:10.1158/1078-0432.CCR-12-3325 2009-08-237495
17. Leotta M et al (2014) A p53-dependent tumor 24. Livak KJ, Schmittgen TD (2001) Analysis of
suppressor network is induced by selective relative gene expression data using real-time
miR-125a-5p inhibition in multiple myeloma quantitative PCR and the 2(-Delta Delta C
cells. J Cell Physiol 229:2106–2116. doi:10. (T)) method. Methods 25:402–408. doi:10.
1002/jcp.24669 1006/meth.2001.1262
Methods in Molecular Biology (2016) 1375: 195–206
DOI 10.1007/7651_2015_245
© Springer Science+Business Media New York 2015
Published online: 26 June 2015
Abstract
Microarray analysis in glioblastomas is done using either cell lines or patient samples as starting material. A
survey of the current literature points to transcript-based microarrays and immunohistochemistry (IHC)-
based tissue microarrays as being the preferred methods of choice in cancers of neurological origin.
Microarray analysis may be carried out for various purposes including the following:
i. To correlate gene expression signatures of glioblastoma cell lines or tumors with response to chemo-
therapy (DeLay et al., Clin Cancer Res 18(10):2930–2942, 2012)
ii. To correlate gene expression patterns with biological features like proliferation or invasiveness of the
glioblastoma cells (Jiang et al., PLoS One 8(6):e66008, 2013)
iii. To discover new tumor classificatory systems based on gene expression signature, and to correlate
therapeutic response and prognosis with these signatures (Huse et al., Annu Rev Med 64(1):59–70,
2013; Verhaak et al., Cancer Cell 17(1):98–110, 2010)
While investigators can sometimes use archived tumor gene expression data available from repositories
such as the NCBI Gene Expression Omnibus to answer their questions, new arrays must often be run to
adequately answer specific questions. Here, we provide a detailed description of microarray methodologies,
how to select the appropriate methodology for a given question, and analytical strategies that can be used.
Experimental methodology for protein microarrays is outside the scope of this chapter, but basic sample
preparation techniques for transcript-based microarrays are included here.
1 Introduction
195
196 Kaumudi M. Bhawe and Manish K. Aghi
2 Materials
f. CGHcall.
g. CGHnormaliter (correction for intensity dependence).
h. Bead array R package (svn release 1.7.0) (8).
i. Lumi R package (release 1.1.0) for variance stabilizing and
spline normalizing (8).
2. Recount program (to correct for potential sequencing errors
during transcriptome tag sequencing) (7).
3. TagDust (7).
4. Bowtie short read aligner (to remove tags coming from mito-
chondrial RNA or rRNA) (7).
5. limma (comparing between microarrays) (7).
6. Ingenuity Pathways Knowledge Base and Analysis Software
(www.ingenuity.com) (8).
7. BLAT (Kent 2002) (14).
8. AltAnalyze (15, 16) for quintile normalization to look at dif-
ferential gene expression (14).
9. Partek genomic suite (http://www.partek.com/) for analysis of
the microarray data (14).
10. Significance Analysis of Microarrays (SAM) 3.0 (Stanford Uni-
versity) for statistical analyses (17).
11. Imagene 6.0 data extraction software (BioDiscovery Inc.) (17).
12. AROMA (18).
Fig. 2 A representative heatmap of gene expression obtained by microarray analysis. Shown is an unpublished
heatmap showing differentially expressed genes in a glioblastoma cell engineered to express shRNA targeting
autophagy gene ATG7
4 Notes
References
1. DeLay M, Jahangiri A, Carbonell WS, Hu YL, implications. Annu Rev Med 64(1):59–70.
Tsao S, Tom MW, Paquette J, Tokuyasu TA, doi:10.1146/annurev-med-100711-143028
Aghi MK (2012) Microarray analysis verifies 4. Verhaak RG, Hoadley KA, Purdom E, Wang V,
two distinct phenotypes of glioblastomas resis- Qi Y, Wilkerson MD, Miller CR, Ding L,
tant to antiangiogenic therapy. Clin Cancer Res Golub T, Mesirov JP, Alexe G, Lawrence M,
18(10):2930–2942. doi:10.1158/1078- O’Kelly M, Tamayo P, Weir BA, Gabriel S,
0432.ccr-11-2390 Winckler W, Gupta S, Jakkula L, Feiler HS,
2. Jiang T, Tie X, Han S, Meng L, Wang Y, Wu A Hodgson JG, James CD, Sarkaria JN, Brennan
(2013) NFAT1 Is highly expressed in, and reg- C, Kahn A, Spellman PT, Wilson RK, Speed
ulates the invasion of, glioblastoma multiforme TP, Gray JW, Meyerson M, Getz G, Perou CM,
cells. PLoS One 8(6):e66008. doi:10.1371/ Hayes DN, Cancer Genome Atlas Research
journal.pone.0066008 Network (2010) Integrated genomic analysis
3. Huse JT, Holland E, DeAngelis LM (2013) identifies clinically relevant subtypes of glio-
Glioblastoma: molecular analysis and clinical blastoma characterized by abnormalities in
206 Kaumudi M. Bhawe and Manish K. Aghi
PDGFRA, IDH1, EGFR, and NF1. Cancer 12. Tarca ALRRDS (2006) Analysis of microarray
Cell 17(1):98–110. doi:10.1016/j.ccr.2009. experiments of gene expression profiling. Am J
12.020 Obstet Gynaecol 192(2):15
5. Bao ZS, Zhang CB, Wang HJ, Yan W, Liu YW, 13. Hartmann M, Roeraade J, Stoll D, Templin
Li MY, Zhang W (2013) Whole-genome MF, Joos TO (2009) Protein microarrays for
mRNA expression profiling identifies func- diagnostic assays. Anal Bioanal Chem 393
tional and prognostic signatures in patients (5):1407–1416. doi:10.1007/s00216-008-
with mesenchymal glioblastoma multiforme. 2379-z
CNS Neurosci Ther 19(9):714–720. doi:10. 14. Solomon O, Oren S, Safran M, Deshet-Unger
1111/cns.12118 N, Akiva P, Jacob-Hirsch J, Cesarkas K, Kabesa
6. Tivnan A, McDonald KL (2013) Current R, Amariglio N, Unger R, Rechavi G, Eyal E
progress for the use of miRNAs in glioblastoma (2013) Global regulation of alternative splicing
treatment. Mol Neurobiol. doi:10.1007/ by adenosine deaminase acting on RNA
s12035-013-8464-0 (ADAR). RNA 19(5):591–604. doi:10.1261/
7. Engstrom PG, Tommei D, Stricker SH, Ender rna.038042.112
C, Pollard SM, Bertone P (2012) Digital tran- 15. Emig D, Salomonis N, Baumbach J, Lengauer
scriptome profiling of normal and T, Conklin BR, Albrecht M (2010) AltAnalyze
glioblastoma-derived neural stem cells identi- and DomainGraph: Analyzing and visualizing
fies genes associated with patient survival. exon expression data. Nucleic Acids Res 38:
Genome Med 4(10):76. doi:10.1186/gm377 W755–W762
8. Ernst A, Hofmann S, Ahmadi R, Becker N, 16. Salomonis N, Schlieve CR, Pereira L, Wahl-
Korshunov A, Engel F, Hartmann C, Felsberg quist C, Colas A, Zambon AC, Vranizan K,
J, Sabel M, Peterziel H, Durchdewald M, Hess Spindler MJ, Pico AR, Cline MS, et al. (2010)
J, Barbus S, Campos B, Starzinski-Powitz A, Alternative splicing regulates mouse embryonic
Unterberg A, Reifenberger G, Lichter P, stem cell pluripotency and differentiation. Proc
Herold-Mende C, Radlwimmer B (2009) Natl Acad Sci 107:10514–10519
Genomic and expression profiling of glioblas-
17. Lin Y, Zhang G, Zhang J, Gao G, Li M, Chen
toma stem cell-like spheroid cultures identifies
Y, Wang J, Li G, Song S-W, Qiu X, Wang Y,
novel tumor-relevant genes associated with sur-
vival. Clin Cancer Res 15(21):6541–6550. Jiang T (2013) A panel of four cytokines pre-
doi:10.1158/1078-0432.ccr-09-0695 dicts the prognosis of patients with malignant
gliomas. J Neuro-Oncol 114(2):199–208.
9. Sooman L, Ekman S, Andersson C, Kultima doi:10.1007/s11060-013-1171-x
HG, Isaksson A, Johansson F, Bergqvist M,
Blomquist E, Lennartsson J, Gullbo J (2013) 18. Godoy PR, Mello SS, Magalhaes DA, Donaires
Synergistic interactions between camptothecin FS, Nicolucci P, Donadi EA, Passos GA,
and EGFR or RAC1 inhibitors and between Sakamoto-Hojo ET (2013) Ionizing
imatinib and Notch signaling or RAC1 inhibi- radiation-induced gene expression changes in
tors in glioblastoma cell lines. Cancer Che- TP53 proficient and deficient glioblastoma cell
mother Pharmacol 72(2):329–340. doi:10. lines. Mutat Res 756(1–2):46–55. doi:10.
1007/s00280-013-2197-7 1016/j.mrgentox.2013.06.010
10. Zeeberg BR, Kohn KW, Kahn A, Larionov V, 19. Matson RS, Wadia PP, Miklos DB, Song Y,
Weinstein JN, Reinhold W, Pommier Y (2012) Wang D, Yamada M, Martinsky T (2009)
Concordance of gene expression and functional Microarray methods and protocols. CRC
correlation patterns across the NCI-60 cell lines Press, Boca Raton, FL
and the cancer genome atlas glioblastoma sam- 20. Subramanian A, Tamayo P, Mootha VK,
ples. PLoS One 7(7):e40062, doi: 10.1371/ Mukherjee S, Ebert BL, Gillette MA, Paulovich
journal.pone.0040062.g001. 10.1371/jour- A, Pomeroy SL, Golub TR, Lander ES,
nal.pone.0040062.t001. 10.1371/ journal. Mesirov JP (2005) Gene set enrichment analy-
pone.0040062.t002 sis: a knowledge-based approach for interpret-
11. Quann K, Gonzales DM, Mercier I, Wang C, ing genome-wide expression profiles. Proc Natl
Sotgia F, Pestell RG, Lisanti MP, Jasmin J-F Acad Sci U S A 102(43):15545–15550.
(2013) Caveolin-1 is a negative regulator of doi:10.1073/pnas.0506580102
tumor growth in glioblastoma and modulates 21. Doerks T, Copley RR, Schultz J, Ponting CP,
chemosensitivity to temozolomide. Cell Bork P (2002) Systematic identification of
Cycle 12(10):1510–1520. doi:10.4161/cc. novel protein domain families associated with
24497 nuclear functions. Genome Res 12(1):47–56,
10.1101/
Methods in Molecular Biology (2016) 1375: 207–221
DOI 10.1007/7651_2015_247
© Springer Science+Business Media New York 2015
Published online: 14 May 2015
Abstract
microRNAs are a subclass of noncoding RNAs which have been demonstrated to play pivotal roles in
multiple cellular mechanisms. microRNAs are small RNA molecules of 22–24 nt in length capable of
modulating protein translation and/or RNA stability by base-priming with complementary sequences of
the mRNAs, normally at the 30 untranslated region. To date, over 2,000 microRNAs have been already
identified in humans, and orthologous microRNAs have been also identified in distinct animals and plants
ranging a wide vast of species. High-throughput analyses by microarrays have become a gold standard to
analyze the changes on microRNA expression in normal and pathological cellular or tissue conditions. In
this chapter, we provide insights into the usage of this uprising technology in the context of cardiac
development and disease.
1 Background
207
208 Diego Franco et al.
2.1 Isolation of RNA Purification and preparation of total RNA that includes small RNAs
for microRNA (<200 nt) from a biological samples is the first critical step for a
Microarrays successful expression profiling analysis of microRNAs. Therefore,
the method used for RNA simple preparation is critical to the
success of the experiment. An important limitation is that naked
RNA is extremely susceptible to degradation by endogenous ribo-
nucleases (RNases) that are present in all living cells. Thus, the key
to successful isolation of high-quality RNA is to ensure that neither
endogenous nor exogenous RNases are introduced during the
extraction procedure. We normally used a TRIzol-based isolation
protocol to isolate total RNA, without any special requirements for
small RNA enrichment, as detailed below.
2.1.2 RNA Precipitation Measure the volume of the aqueous phase and add isopropyl alco-
hol at 1:1 proportion. Incubate samples at room temperature
(20–30 C) for 10 min and centrifuge 6,300 g for 15 min at
4 C. Remove the supernatant completely. The RNA precipitate,
often invisible before centrifugation, forms a pellet on the side and
at bottom of the tube.
2.1.3 RNA Wash Wash the RNA pellet once adding 0.2 ml of filter-sterilized 75 %
ethanol and centrifuge at 5,500 g for 5 min at 4 C. Remove all
leftover ethanol. It is important avoid completely drying the RNA
pellet as this greatly will decrease its solubility. Redissolve RNA
pellet in 35–50 μl of Milli-Q water RNAse-free by passing solution
a few times through a pipette tip. Measure the samples in Nano-
Drop 2000c and keep it in the freezer (80 C) until further use.
Individual
Component reaction
Total RNA 10–50 μg
10 Incubation Buffer 5 μl
DNase I recombinant, RNase-free 2.5–10 units
Optionally: RNaseOUT Recombinant Ribonuclease 10 units
Inhibitor
Milli-Q water, RNase-free Up to 50 μl
Incubate at 25 to 37 C for 15–20 min.
2. Stop the reaction by adding 2 μl of 0.2 M EDTA (pH 8.0) to a
final concentration of 8 mM and heating to 75 C for 10 min.
The concentration of EDTA has to be taken into account for all
subsequent applications.
3. Add 50 μl of Phenol RNA (pH 4.7) and 50 μl of chloroform
and vortex samples vigorously for 15 s. Centrifuge the samples
at 16,000 g for 10 min at 4 C.
4. Following centrifugation, the mixture separates into a lower
phenol phase and an upper aqueous phase. RNA remains exclu-
sively in the aqueous phase. Transfer upper aqueous phase
carefully into fresh tube.
210 Diego Franco et al.
2.4 Protocol Tips The following precautions should be taken to prevent RNase con-
tamination and degradation of the RNA sample and reagents:
l Always use gloves.
l Use nuclease-free, low nucleic acid binding plasticware and
filter barrier pipette tips.
l Keep tubes capped whenever possible.
l Make sure your equipment and solution are RNAse-free.
Analysis of microRNA Microarrays in Cardiogenesis 211
5 microRNA Validation
5.2 qPCR Protocol Although Exiqon has optimized qPCR experiments by using the
miRCURY LNA™ ExiLENT SYBR® Green master mix, our expe-
rience shows how LNA™ PCR primer sets can be used with others
SYBR® Green master mixes such as GoTaq® qPCR Master Mix
(Promega) with similar results. Whatever the master mix is used
proceed as follow:
1. Place cDNA (from previous step), nuclease-free water, and
PCR Master mix on ice and thaw for 15–20 min. Protect the
PCR Master mix vials from light. Immediately before use, mix
the PCR Master mix by pipetting up and down. The rest of the
reagents are mixed by vortexing and spun down.
2. When multiple real-time PCR reactions are performed with the
same microRNA primer set, it is recommended to prepare a
primer master mix working-solution of the PCR primers and
the PCR Master mix as follow:
Na+
NCX MCU NCX
Na+ Mitochondrion VDAC ANT
PMCA Cyp-D
ER/SR
+P
Neurotransmitter, GPCR Gs ADCY PKA PLN
autacoid
cAMP
TnC
ORAI STIM
Contraction
Depletion MLCK
of Ca2+ stores
PHK Metabolism
VOCs CaV1 SERCA
Membrane CaV2 MAPK
depolarization RYR signalng pathway
CaV3
Apoptosis
Neurotransmitter, CaN Proliferation
ROC CALM
autacoid Fertilization
Ca2+ CAMK Learning and memory
Neurotransmitter, Long term potentiation
hormone, GPCR Gq
autacoid
PLCδ NOS Long term depression
Growth factor PTK ER/SR
PLCβ
IP3R ADCY
PLCγ IP3
TCR PDE1
Antigen PLCε
BCR Other signaling pathways
cAMP
Sperm PLCζ FAK2
DAG
? ATP Phosphatidylinositol
IP3 3K signaling pathway
cADPR
CD38 NAADPR PKC
NAADP
NADH
? SPHK ? Exocytosis
S1P Secretion
04020 4/15/14
References
1. Kelly RG, Buckingham ME (2002) The 13. Espinoza-Lewis RA, Wang DZ (2012)
anterior heart-forming field: voyage to the MicroRNAs in heart development. Curr Top
arterial pole of the heart. Trends Genet Dev Biol 100:279–317
18:210–216 14. Bonet F, Hernandez-Torres F, Esteban FJ,
2. Moorman AF, Christoffels VM, Anderson RH, Aranega A, Franco D (2013) Comparative ana-
van den Hoff MJ (2007) The heart-forming lyses of microRNA microarrays during cardio-
fields: one or multiple? Philos Trans R Soc genesis: functional perspectives. Microarrays
Lond B Biol Sci 362:1257–1265 2:81–96. doi:10.3390/microarrays2020081
3. Kelly RG (2012) The second heart field. Curr 15. Bonet F, Hernandez-Torres F, Franco D
Top Dev Biol 100:33–65 (2014) Towards the therapeutic usage of
4. López-Sánchez C, Garcı́a-Martı́nez V (2011) microRNAs in cardiac disease and regenera-
Molecular determinants of cardiac specifica- tion. Exp Clin Cardiol 20:720–756
tion. Cardiovasc Res 91:185–195 16. Callari M, Dugo M, Musella V, Marchesi E,
5. de Castro Mdel P, Acosta L, Domı́nguez JN, Chiorino G, Grand MM, Pierotti MA, Dai-
Aránega A, Franco D (2003) Molecular diver- done MG, Canevari S, De Cecco L (2012)
sity of the developing and adult myocardium: Comparison of microarray platforms for mea-
implications for tissue targeting. Curr Drug suring differential microRNA expression in
Targets Cardiovasc Haematol Disord paired normal/cancer colon tissues. PLoS
3:227–239 One 7(9):e45105
6. Campione M, Ros MA, Icardo JM, Piedra E, 17. Meyer SU, Pfaffl MW, Ulbrich SE (2010) Nor-
Christoffels VM, Schweickert A, Blum M, malization strategies for microRNA profiling
Franco D, Moorman AF (2001) Pitx2 expres- experiments: a “normal” way to a hidden layer
sion defines a left cardiac lineage of cells: evi- of complexity? Biotechnol Lett 32
dence for atrial and ventricular molecular (12):1777–1788
isomerism in the iv/iv mice. Dev Biol 231 18. Der SD, Zhou A, Williams BR, Silverman RH
(1):252–264 (1998) Identification of genes differentially
7. Franco D, Campione M, Kelly R, Zammit PS, regulated by interferon alpha, beta, or gamma
Buckingham M, Lamers WH, Moorman AF using oligonucleotide arrays. Proc Natl Acad
(2000) Multiple transcriptional domains, with Sci U S A 95:15623–15628
distinct left and right components, in the atrial 19. Eisen M, Brown P (1999) DNA arrays for
chambers of the developing heart. Circ Res 87 analysis of gene expression. Meth Enzymol
(11):984–991 303:179–205
8. Franco D, Lamers WH, Moorman AF (1998) 20. Winzeler EA, Schena M, Davis RW (1999)
Patterns of expression in the developing myo- Fluorescence-based expression monitoring
cardium: towards a morphologically integrated using microarrays. Meth Enzymol 306:3–18
transcriptional model. Cardiovasc Res 21. Livak K, Schmittgen T (2001) Analysis of rela-
38:25–53 tive gene expression data using real-time quan-
9. Chinchilla A, Franco D (2006) Regulatory titative PCR and the 2ΔΔCT method.
mechanisms of cardiac development and repair. Methods 25:402–408
Cardiovasc Hematol Disord Drug Targets 22. Cheng Y, Ji R, Yue J, Yang J, Liu X, Chen H,
6:101–112 Dean DB, Zhang C (2007) MicroRNAs are
10. Franco D, Chinchilla A, Aránega AE (2012) aberrantly expressed in hypertrophic heart: do
Transgenic insights linking pitx2 and atrial they play a role in cardiac hypertrophy? Am J
arrhythmias. Front Physiol 3:206 Pathol 170(6):1831–1840
11. Zhao Y, Ransom JF, Li A, Vedantham V, von 23. Wang Y, Weng T, Gou D, Chen Z, Chintagari
Drehle M, Muth AN, Tsuchihashi T, McManus NR, Liu L (2007) Identification of rat lung-
MT, Schwartz RJ, Srivastava D (2007) Dysre- specific microRNAs by microRNA microarray:
gulation of cardiogenesis, cardiac conduction, valuable discoveries for the facilitation of lung
and cell cycle in mice lacking miRNA-1-2. Cell research. BMC Genomics 8:29
129:303–317 24. Wang J, Xu R, Lin F, Zhang S, Zhang G, Hu S,
12. Fish JE, Santoro MM, Morton SU, Yu S, Yeh Zheng Z (2009) MicroRNA: novel regulators
RF, Wythe JD, Ivey KN, Bruneau BG, Stainier involved in the remodeling and reverse remo-
DY, Srivastava D (2008) miR-126 regulates deling of the heart. Cardiology 113(2):81–88
angiogenic signaling and vascular integrity. 25. Matkovich SJ, Van Booven DJ, Youker KA,
Dev Cell 15:272–284 Torre-Amione G, Diwan A, Eschenbacher
Analysis of microRNA Microarrays in Cardiogenesis 221
WH, Dorn LE, Watson MA, Margulies KB, 31. Letonqueze O, Lee J, Vasudevan S (2012)
Dorn GW 2nd (2009) Reciprocal regulation MicroRNA-mediated posttranscriptional
of myocardial microRNAs and messenger mechanisms of gene expression in proliferating
RNA in human cardiomyopathy and reversal and quiescent cancer cells. RNA Biol 9
of the microRNA signature by biomechanical (6):871–880
support. Circulation 119(9):1263–1271 32. Lee S, Vasudevan S (2013) Post-transcriptional
26. Naga Prasad SV, Duan ZH, Gupta MK, Sur- stimulation of gene expression by microRNAs.
ampudi VS, Volinia S, Calin GA, Liu CG, Kot- Adv Exp Med Biol 768:97–126
wal A, Moravec CS, Starling RC, Perez DM, 33. Papadopoulos GL, Reczko M, Simossis VA,
Sen S, Wu Q, Plow EF, Croce CM, Karnik S Sethupathy P, Hatzigeorgiou AG (2009) The
(2009) Unique microRNA profile in end-stage database of experimentally supported targets: a
heart failure indicates alterations in specific car- functional update of TarBase. Nucleic Acids
diovascular signaling networks. J Biol Chem Res 37:D155–D158
284(40):27487–27499 34. Fiedler J, Jazbutyte V, Kirchmaier BC,
27. Chinchilla A, Lozano E, Daimi H, Esteban FJ, Gupta SK, Lorenzen J, Hartmann D,
Crist C, Aranega AE, Franco D (2011) Micro- Galuppo P, Kneitz S, Pena JT, Sohn-Lee C
RNA profiling during mouse ventricular matu- et al (2011) MicroRNA-24 regulates vascular-
ration: a role for miR-27 modulating Mef2c ity after myocardial infarction. Circulation
expression. Cardiovasc Res 89(1):98–108 124:720–730
28. Condorelli G, Latronico MV (2014) Cavar- 35. Mayorga ME, Penn MS (2012) miR-145 is
retta E. microRNAs in cardiovascular diseases: differentially regulated by TGF-β1 and ischae-
current knowledge and the road ahead. J Am mia and targets disabled-2 expression and wnt/
Coll Cardiol 63(21):2177–2187 β-catenin activity. J Cell Mol Med
29. Vasudevan S (2012) Posttranscriptional upre- 16:1106–1113
gulation by microRNAs. Wiley Interdiscip Rev 36. Li DF, Tian J, Guo X, Huang LM, Xu Y, Wang
RNA 3(3):311–330 CC, Wang JF, Ren AJ, Yuan WJ, Lin L (2013)
30. Steitz JA, Vasudevan S (2009) miRNPs: versa- Induction of microRNA-24 by HIF-1 protects
tile regulators of gene expression in vertebrate against ischemic injury in rat cardiomyocytes.
cells. Biochem Soc Trans 37(Pt 5):931–935 Physiol Res 61:555–565
Methods in Molecular Biology (2016) 1375: 223
DOI 10.1007/7651_2015_256
© Springer Science+Business Media New York 2015
Published online: 27 November 2015
There is an error in given name and family name of the author Liliana López Kleine.
The correct name should read as Liliana López-Kleine (given name: Liliana and family
name: López-Kleine)
223
Methods in Molecular Biology (2016) 1375: 225–226
DOI 10.1007/7651_2015
© Springer Science+Business Media New York 2016
INDEX
A G
Affymetrix ......................... 1–9, 183, 184, 185, 187, 189, Gene expression .........................................................2, 11,
190, 197, 198, 202 12, 16, 17, 29, 30, 31, 41, 43, 52, 53, 55–72, 91,
Amplex red ..........................................170, 173, 176, 178 92, 93, 94, 95, 96, 99, 100, 101, 137, 138, 140,
Assessment................ 128, 138, 147, 155–165, 197, 208 141, 143, 144, 146, 147, 148, 149, 150, 151,
152, 156, 159, 160, 163, 182, 183, 189, 195,
B 200, 203, 204, 212, 213, 218, 219
Biclustering................................. 18, 55–72, 91–101, 159 Gene ontology......................................................... 17, 20,
Bioinformatics .................................................... 13, 15, 18 21, 29, 95, 97, 106, 110, 113, 118, 119, 125,
163, 200, 216, 217
Biological network inference ............................... 155–165
Gene regulatory networks (GRNs)..............96, 137–153,
C 155, 156, 157, 158, 159, 160, 161, 162, 163,
164, 216, 217, 218
Cancer samples .............................................1–9, 111, 112 Gene target ................................................ 11, 13, 19, 216
Cardiac development ........................................... 207, 218 Glioblastoma ........................................................ 195–205
Classification..................................................... 41–54, 146 Gold standard...................................................... 156, 160,
Cloud computing.............. 26, 27, 29, 30, 31, 32, 33, 35 161, 162, 163, 164, 165
Clustering ................................18, 21, 27, 35, 41–53, 56, Graph theory ............................................... 140, 155, 156
57, 59, 64, 70, 79, 96, 97, 98, 99, 108, 111, 112,
113, 114, 115, 159, 187, 190, 195, 199, 203, H
213, 219
Correlation ................ 19, 29, 56, 57, 58, 59, 62, 63, 64, Hierarchical clustering .................. 44, 46, 111, 114, 190,
213, 219
65, 66, 68, 70, 71, 100, 124, 125, 126, 129, 131,
133, 134, 140, 159, 160, 187, 212, 216
L
D Lactate .......................................126, 170, 171, 172, 173,
Data analysis .................................. 11–22, 25–38, 41, 57, 176, 177, 178
64, 95, 96, 97, 141, 147, 148, 195, 199,
M
212–213
Databases ..............................................17, 18, 19, 56, 77, Measure ...................................................... 19, 59, 62, 63,
78, 84, 109, 110, 113, 114, 118, 127, 146, 200, 66, 70, 71, 99, 100, 101, 105, 106, 107, 108,
202, 218 114, 120, 128, 131, 137, 161, 162, 163, 172,
Data mining................................78, 83, 87, 95, 101, 144 176, 177, 193, 209, 210
Dissection .......................... 169, 171, 172, 174, 175, 177 Meta-analyses ...............................................208, 218–219
Metabolomics ...............................................124, 169–178
E MetaMirClust ............................................................75–88
Euthanasia ................................................... 170, 171, 172 Microarray data analysis ..............................12–13, 25–38,
105–115, 141, 212–213
Expression microarray.......................................... 108, 212
Expression patterns ........................................92, 161, 191 Microarray profiling ............................................. 208, 212
Microarrays ..................................................... 1–9, 12–13,
F 15, 20, 21, 25–38, 41–54, 55, 56, 57–59, 64, 66,
70, 71, 91, 92, 93, 96, 105–115, 117–120, 126,
Functional pathways .............................................. 21, 162 137–153, 155–165, 182, 183, 184, 185, 191,
Functional prediction................................................41–54 195–205, 207–219
225
226 M ICROARRAY DATA ANALYSIS: METHODS
Index
AND APPLICATIONS
microRNA ............................. 1–9, 11–22, 32, 34, 75–88, Prediction ...................13, 17, 18, 19, 20, 22, 29, 41–54,
181–193, 207–213, 214, 219 128, 129, 133, 138, 140, 144, 160, 193, 203, 216
cluster................................................................... 75–88 Proteome .............................................110, 125, 126, 198
inhibition ....................................................... 183, 184,
188, 191 S
microarray................................................................ 1–9 Semantic similarity ............................ 106, 107, 108, 114,
replacement .................................................... 183, 191 115, 118, 119–120
Multivariate data analysis ...............................43, 212, 213 Sequence features .............. 125, 126, 131, 132, 133, 134
Skeletal muscle .............................................169–178, 211
N Synteny ......................................................................77, 78
Next-generation sequencing (NGS) ................ 12, 13–15, Systems biology................. 117, 137, 138, 152, 155, 164
16, 17, 18, 20, 21, 25, 28, 31, 32, 78, 84
Normalization .................................................1–9, 13, 17,
T
31, 43, 60, 61, 126, 190, 200, 202, 208, Transcriptome ......................................34, 162, 185, 186,
212–213, 219 187, 189, 190, 191, 199, 200
Transfection.................................. 76, 184, 186, 188, 192
O Translation......................................... 124, 125, 130, 132,
Ontologies .............. 20, 21, 29, 106, 107, 118, 119, 165 133, 134, 146, 182, 216
P U
Pathways ....................................................................1, 20, Undetected proteins ................................... 124, 126, 132
21, 56, 75, 77, 91, 105, 123, 127, 128, 129,
Z
138, 139, 140, 146, 149, 150, 151, 162, 182,
191, 199, 200, 204, 205, 207, 216, 217, 218 Zero-inflated Poisson regression.................................. 127