Computational Methods With Applications in Bioinformatics Analysis
Computational Methods With Applications in Bioinformatics Analysis
WITH APPLICATIONS IN
BIOINFORMATICS ANALYSIS
Published:
Vol. 16: Design Techniques for Integrated CMOS Class-D Audio Amplifiers
by Adrian I. Colli-Menchi, Miguel A. Rojas-Gonzalez and
Edgar Sanchez-Sinencio
COMPUTATIONAL METHODS
WITH APPLICATIONS IN
BIOINFORMATICS ANALYSIS
Editors
Jeffrey J. P. Tsai
Ka-Lok Ng
Asia University, Taiwan
World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TA I P E I • CHENNAI
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.
Printed in Singapore
Preface
v
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-fm page vi
vi Preface
Acknowledgment
We are grateful for all the authors for their efforts and involvement in pro-
ducing this book. This book would not have been possible without the
financial support by the Ministry of Science and Technology of Taiwan
(MOST) under contract number MOST 105-2632-E-468 -002, and the sup-
port from Asia University. Finally, we would like to thank the editorial and
production staffs at World Scientific Publishing, in particular Steven Patt,
Herbert Moses and Rajesh Babu, for making this book possible.
vii
b2530 International Strategic Relations and China’s National Security: World at the Crossroads
ix
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-fm page x
Ka-Lok Ng received his PhD degree in physics from the Vanderbilt Uni-
versity at the US in 1990. He is a professor at the Department of Biomedical
Informatics, Asia University, since August 2008. Beginning from December
2009, he serves on the Editorial board of several scientific journals. Dr. Ng
has published articles in highly ranked journals, in the areas of protein
interactions, robustness of protein interaction networks and microRNA.
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-fm page xi
List of Contributors
Charles C. N. Wang received his M.S. and PhD degree both in bioinfor-
matics from Asia University, Taichung, Taiwan in 2008 and 2015, respec-
tively. He is currently a postdoctoral fellow with the department of
bioinformatics, Asia University, Taichung, Taiwan. He is currently active
xi
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-fm page xii
of IEEE since 2007. He has funded research and published articles in the
areas of Multivariate Analysis, Fuzzy Measure and Integral, Educational
Measurement and Statistics, generalized Meta-analysis, Information Man-
agement, Machine Learning, and data mining.
Pei-Chun Chang received his PhD degree in Chemistry from the National
Taiwan University, Taiwan in 1998. Currently, Pei-Chun Chang is associate
professor with the Department of Bioinformatics and Medical Engineering
at Asia University where he is conducting research activities in the areas
of biomedical informatics and computing chemistry. His research interests
concern the discovery of cancer biomarkers by genomics analysis, the screen-
ing of anticancer compounds from Chinese herb components, the develop-
ment of disease model with systems biology, and medical data mining.
Pei-Lin Chen received her Master’s Degree in Computer Science and In-
formation Engineering in 2014 from the National University of Tainan,
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-fm page xv
List of Contributors xv
Phillip C.-Y. Sheu received his PhD degree in electrical engineering and
computer science from the University of California, Berkeley in 1986. He is
a professor of EECS, CS, and BME at the University of California, Irvine,
and a guest professor with the Department of Biotechnology and Bioin-
formatics, Asia University, Taiwan. Dr. Sheu is a Fellow of IEEE. He is
currently active in research related to semantic computing, robotic com-
puting, biomedical computing and multimedia computing.
received his B.S. degree in Mechanical Engineering from National Sun Yet-
sen University in Kaohsiung City, Taiwan. He completed M.S. and PhD
degrees in the Institute of Biomedical Engineering at National Cheng Kung
University in Tainan City of Taiwan. He had been a visiting scholar at
University of Washington, Seattle, USA, from August 2005 to March 2006.
His current research interests include biosensing technologies, aptasensor,
simulation of protein-nucleic acid interaction and biomechanics.
Contents
Preface v
Acknowledgment vii
About the Authors ix
List of Contributors xi
xix
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-fm page xx
xx Contents
Index 207
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch01
Chapter 1
1.1 Introduction
1
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch01
genetic algorithm [17] was used to determine the number of clusters and
identify significant multiclass membership (SiMM) genes. In the second
stage, a multiobjective genetic algorithm [1] was used to cluster the
genes after SiMM genes were filtered. Finally, the SiMM genes were
assigned to one of the clusters defined in the second stage based on the
nearest neighbor criterion.
1.2.2.2 Hierarchical
1.2.2.3 SOM
1.2.2.4 CRC
1.3 Methods
This section details the methods and introduces the proposed data
processing algorithm, which includes the preprocessing of the time series
gene expression data, spectrum processing, singular value
decomposition, and autoregressive modeling.
Fig. 1.1. Flowchart for the proposed data processing algorithm for clustering of time
series gene-expression data based on spectrum processing and autoregressive modeling.
First, suppose that each gene g has N time points. The time series gene
expression can be expressed as
T
xg xg (0) xg (1) Λxg ( N 1) (1.1)
where x(t ) , x(t - 1) ,Κ, x(t - N +1) is a time series of length N, a1 ,..., a p
is the autoregressive coefficient, and e(t ) is white noise. In addition,
[a1 , a 2 ,..., a p ] is defined as the parameter vector, and (t , ) is the
estimated error:
(t , ) x(t ) xˆ (t | ) (1.4)
xˆ (t | ) T (t ) (1.5)
(t , x(t ) T (t ) (1.7)
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch01
The silhouette index evaluates cluster validity using the distance between
points in various clusters and the distance between all points in all
clusters. This study used the Pearson’s correlation to determine the
distance between points. The silhouette index is defined as the average
silhouette width of all points (genes) in all clusters. In Fig. 1.2, which
shows a schematic diagram of silhouette width, A, B, and C are three
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch01
ba
s (i ) (1.9)
max(a, b)
The value of s(i) can range from –1 to 1; values close to 1 imply that i
has been assigned to an appropriate cluster.
This study used the same experimental data as [2], [9], and [21]. A total
of six sets of time series gene expression data were tested, including one
set of synthetic data (named AD400_10_10) and five sets of real-life
experimental biological data. The five sets of biological data included
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch01
one set of data each from human fibroblast serum, rat central nervous
system (CNS), yeast sporulation, and two sets of data from
Saccharomyces cerevisiae; all data sets are available for download
online. Emulating previous studies in the treatment of data sets with high
numbers of genes (i.e., yeast sporulation, S. cerevisiae [7], and
S. cerevisiae [22]), only genes with time series gene expression values
that varied significantly were used in this study; thus, the regulatory
relationships between two genes were readily apparent [27]. In this
study, genes with time series gene expression values that did not vary
significantly or genes that did not significantly differ in gene expression
were filtered. The variance and root mean square were calculated for
each time series gene expression record. The filter threshold was set to
variance ≥ AVG1 and root mean square ≥ AVG2, where AVG1 is the
average variance of all genes and AVG2 is the average root mean square
of all genes. Only genes matching both sets of criteria were included in
the data sets for further analysis. These six sets of time series gene
expression data were described as follows.
This time series gene expression data set contains synthetic data,
comprising 400 genes measured at 10 time points. The data set contains
10 clusters, and each cluster contains 40 genes; in other words, the data
contains 10 vastly dissimilar time series gene expression styles [28].
This real-life experimental biological data set contains the raw data of
6118 genes measured at 7 time points. Prior to cluster analysis, the
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch01
previously described method was used to filter the genes of which the
gene expression values did not vary significantly. This filtering resulted
in a subset of 844 time series gene expression values that were used in
this experiment.
This real-life biological data set contains 112 genes measured at 9 time
points.
b http://www.genome.ad.jp/kegg/pathway/sce/sce04111.html
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch01
cell cycle phases G1, S, G2, and M; therefore, the number of clusters for
these two data sets was four.
Table 1.1. Median silhouette index values obtained after 11 executions of the clustering
algorithms. Values marked with a (*) are higher than the value obtained by the proposed
data processing method. The bottom values in parentheses indicate the p values calculated
by comparing the specified method and the proposed method (combining both spectrum
processing and autoregressive modeling).
Table 1.2. Number of times that the SIC1 and CLB2 genes were assigned to the same
cluster using the S. cerevisiae experimental data of Spellman et al.. The number of
clusters was set to four. Three algorithms using four data processing methods were
executed 11 times each.
Data processing
Original data Data with Data with AR The
spectrum modeling [9] proposed
processing method
[30]
K-means 0 11 9 11
Algorithm Hierarchical 0 11 11 11
SOM 0 11 0 11
In this section, we analyze the clustering of the SIC1 and CLB2 genes in
the S. cerevisiae data set of Spellman et al. based on the S. cerevisiae cell
cycle genetic regulatory pathway (KEGG) to explore the clustering
results of these two genes when the proposed data processing method
and the other three methods were applied. SIC1 and CLB2 genes are
expressed in the G2 stage of the cell cycle [18]. The four data processing
methods were applied to the three clustering algorithms, with the number
of clusters set to four. Each combination of method and algorithm was
executed 11 times. The number of times that SIC1 and CLB2 were
assigned to the same cluster is recorded in Table 1.2. This information
indicated that SIC1 and CLB2 were assigned to the same cluster for all
11 executions of the clustering algorithms, signifying that the proposed
data processing method could effectively express correlations based on
biological significance.
Acknowledgments
This work was supported in part from the Ministry of Science and
Technology, Taiwan [Grant Numbers: NSC 100-2221-E-024-020 and
MOST 103-2221-E-024-014].
References
3. Bracewell, R. (1999). The Fourier Transform and Its Applications, 3rd Ed.
(McGraw Hill, USA).
4. Bretscher, O. (1997). Linear Algebra with Applications, (Prentice Hall, USA).
5. Chen, J. J. W. (2000). Introduction and application of DNA microarrays: A
formidable weapon for genetic analysis in the 21st century, NTU BioMed Bulletin,
2, pp. 18–25 (in Chinese).
6. Chiu, T. Y., Hsu, T. C., Yen C. C. and Wang, J. S. (2015). Interpolation based
consensus clustering for gene expression time series, BMC Bioinformatics, 16, pp.
117–133.
7. Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka,
L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J. and Davis, R.
W. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle,
Molecular Cell, 2, pp. 65–73.
8. Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown and P. O.,
Herskowitz, I. (1998). The transcriptional program of sporulation in budding yeast,
Science, 282, pp. 699–705.
9. Darvish, A., Hakimzadeh, R. and Najarian, K. (2004). Discovering dynamic
regulatory pathway by applying an auto regressive model to time series DNA
microarray data, Proc. 26th Annual International Conference of the IEEE
Engineering in Medicine and Biology Society, IEMBS, pp. 2941–2944.
10. Darvish, A., Najarian, K., Jeong, D. H. and Ribarsky, W. (2005). System
identification and nonlinear factor analysis for discovery and visualization of
dynamic gene regulatory pathways, Proc. IEEE Symposium on Computational
Intelligence in Bioinformatics and Computational Biology, CIBCB, pp. 1–6.
11. DeRisi, J., Penland, L., Brown, P. O., Bittner, M. L., Meltzer, P. S., Ray, M., Chen,
Y., Su, Y. A. and Trent, J. M. (1996). Use of a cDNA microarray to analyze gene
expression patterns in human cancer, Nature Genetics, 14, pp. 457–460.
12. Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. (1998). Cluster analysis
and display of genome-wide expression patterns, Proc. National Academy of
Sciences of the United States of America, 95, pp. 14863–14868.
13. Hollander M. and Wolfe, D. A. (1999) Nonparametric Statistical Methods, 2nd Ed.
(Wiley, USA).
14. Jain, A. K. (2010). Data clustering: 50 years beyond K-means, Pattern Recognition
Letters, 31 pp. 651–666.
15. Jaskowiak, P. A., Campello, R. J. and Costa, I. G. (2014). On the selection of
appropriate distances for gene expression data clustering, BMC Bioinformatics, 15,
suppl. 2, pp. S2–S18.
16. Lomb, N. R. (1976). Least-squares frequency analysis of unequally spaced data,
Astrophysics and Space Science, 39, pp. 447–462.
17. Maulik U. and Bandyopadhyay, S. (2002). Performance evaluation of some
clustering algorithms and validity indices, IEEE Transaction on Pattern Analysis and
Machine Intelligence, 24, pp. 1650–1654.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch01
18. Noguchi, E. and Gadaleta, M. C. (2014). Cell Cycle Control: Mechanisms and
Protocols, (Humana Press, USA).
19. Qin, Z. S. (2006). Clustering microarray gene expression data using weighted
Chinese restaurant process, Bioinformatics, 22, pp. 1988–1997.
20. Rousseeuw, P. (1987). Silhouettes: a graphical aid to the interpretation and
validation of cluster analysis, Journal of Computational and Applied Mathematics,
20, pp. 53–65.
21. Scargle, J. D. (1982). Studies in astronomical time series analysis. II — Statistical
aspects of spectral analysis of unevenly spaced data, Astrophysical Journal, Part 1,
263, pp. 835–853.
22. Spellman, P. T., Sherlock, G. and Zhang, M. Q. (1998). Comprehensive
identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by
microarray hybridization, Molecular Biology of the Cell, 9, pp. 3273–3297.
23. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E.,
Lander, E. S. and Golub, T. R. (1999). Interpreting patterns of gene expression with
self-organizing maps: methods and application to hematopoietic differentiation,
Proc. of the National Academy of Sciences of the United States of America, 96, pp.
2907–2912.
24. Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J. and Church, G. M. (1999).
Systematic determination of genetic network architecture, Nature Genetics, 22, pp.
281–285.
25. Walpole, R. E., Myers, R. H., Myers, S. L. and Ye, K. E. (2011) Probability &
Statistics for Engineers & Scientists, (Pearson, USA).
26. Wen, X., Fuhrman, S., Michaels, G. S., Carr, D. B., Smith, S., Barker, J. L. and
Somogyi, R. (1998). Large-scale time series gene expression mapping of central
nervous system development, Proc. National Academy of Sciences of the United
States of America, 95, pp. 334–339.
27. Xu, H. L., Liu, Y. H. and Wand, S. T. (2008). Autoregressive-model based dynamic
fuzzy clustering for time-course gene expression data, Biotechnology, 7, pp. 59–65.
28. Yeung, K. Y., Haynor, D. R. and Ruzzo, W. L. (2001). Validating clustering for
gene expression data, Bioinformatics, 17, pp. 309–318.
29. Yeung, K. Y. and Ruzzo, W. L. (2001). An empirical study of principal component
analysis for clustering gene expression data, Bioinformatics, 17, pp. 763–774.
30. Zhao, W., Serpedin, E. and Dougherty, E. R. (2009). Spectral preprocessing for
clustering time-series gene expressions, EURASIP Journal on Bioinformatics and
System Biology: Article ID 713248, 10 pages.
31. Zhou, X., Wang, X., Dougherty, E. R., Russ, D. and Suh, E. (2004). Gene clustering
based on cluster-wide mutual information, Journal of Computational Biology, 11,
pp. 147–161.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch02
Chapter 2
2.1 Introduction
a An earlier version of this study in Chinese was presented at The 2013 National
Computer Symposium (Domestic Poster Track), Taiwan, Dec. 13-14, 2013.
*Corresponding author.
22
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch02
2.2 Preliminaries
clustering methods. First, time series gene expression data are introduced
as the primary targets of analysis in this study. Second, the two major
types of clustering methods, supervised and unsupervised clustering, are
described. Finally, gene functions and GO terms, which were used as the
bases for gene clustering, are discussed.
respective groups. HTHs were used as the control group because they
exhibited less similar time series gene expression data patterns with one
another. These genes could be divided into different groups through
unsupervised clustering. This study incorporated a total of 227 time
series gene expression data that contained all these six types of functions.
The time series of two of the gene functions were illustrated in line
charts to observe their patterns.
2.3 Methods
Deviation, conversion, and noise may cause similar time series gene
expression data to become different. These problems are averted during
microarray gene expression tests and image data value conversion, if
possible. In addition, missing values are the primary focus in data
preprocessing when gene expression data are downloaded from the
Internet. Solutions to missing values adopted by numerous studies
include replacing them directly with zeroes, deleting the first n similar
time series gene expression data at the temporal points outside the
missing values and replacing the missing values with the mean value of
these n time series at the temporal points of the missing values, and
interpolation [19, 22]. Because the unknown errors caused by the
existing missing value assessment methods must be eliminated first, in
this study, before processing the missing values, the gene expression data
downloaded from the yeast gene database were directly applied in the
experiment to prevent potential errors in the subsequent result
comparisons caused by the processing of missing values. In some
studies, genes without significant changes in the expression values were
removed [4]. Although such a method may improve the experimental
result, this type of gene was initially included in the experiment of the
present study.
Time series gene expression data Time series gene expression data
with known GO terms with tested genes
Classiiers:
Training Model 1 Gene classiication results:
Training Model 2 genes classiied to their
Fig. 2.1. Flow chart of GO-based SVM time series gene expression data cluster analysis.
This section describes the data, goals, and methods of the three
experiments conducted in this study. The ideas in each of the three
experimental designs are clarified as follows. The first experiment
involved comparing the effectiveness of supervised clustering with that
of unsupervised clustering. The second and third experiments differed in
the characteristics of their training samples; the results of the three
experiments are explained in the Section 2.5. Following, the CV process
is introduced before presenting the experimental designs.
sample, and the other nine subgroups are designated as training samples,
and so on. Thus, a 10-fold CV is executed. Finally, the accuracies of the
10 results are averaged to obtain the effective value of the experiment.
Generally, the number of subgroups is determined according to the
number of samples obtained for an experiment.
This subsection introduces the gene expression and GO term data used in
this experiment, as well as the reason and method for acquiring the data,
thereby clarifying the data content of this experiment.
This experiment incorporated two types of gene expression data files.
The first was gene association files (GAFs), which contained GO IDs,
GO classes, and gene names, and could be downloaded from the GO
website. The other was a gene expression data file. In this experiment,
the gene expression data of Saccharomyces cerevisiae were employed
[23]. The dataset comprised 6,149 gene expression data. Each time series
gene expression data featured 17 time points. This study used gene
expression data downloaded from the Saccharomyces Genome Database
[28] because they have been incorporated as experimental data in
numerous studies [47, 52]. In addition, the proportion of missing values
in this data was only 0.08%, indicating that the missing values did not
affect the result significantly.
As for the GO term data, typically, more than one gene is annotated
with a specific GO ID. However, the number of corresponding genes for
each GO ID varies. An insufficient number of training samples would
affect the representativeness of the SVM models; therefore, at the
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch02
This subsection briefly explains the core content of this experiment: the
goal, method, and parameter settings of the experiment.
The k-means and hierarchical methods, two conventional
unsupervised clustering methods, are based on the time series expression
distance among genes. Ideally, genes clustered into one-group share
similar or same functions. However, many genes that are close to one
another rarely share the same genetic functions. Therefore, in this
experiment, the samples were classified according to their GO IDs
through the SVMs and the two conventional unsupervised clustering
methods, and the cluster results were compared to verify whether the
classification accuracy of the SVM algorithm was superior to that of the
k-means and hierarchical methods.
To verify the effectiveness of the SVMs, a total of 10 GO ID pairs
were selected. Genes that contained the GO IDs of both the positive and
negative classes were precluded, and the genes corresponding to each
GO ID were labeled as positive or negative classes. Considering the
number of collected data and the balance between positive and negative
classes on the numbers of samples in each GO ID pair, 105 genes were
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch02
randomly selected from each of the two classes, for a five-fold CV. Each
data group was divided into five subgroups; four were designated as
training samples, and the last was used as the test sample for verifying
the cluster results.
MATLAB was employed for three types of clustering algorithms.
k-means clustering was performed using the default MATLAB
parameters, and hierarchical clustering was conducted using the
hierarchical agglomerative algorithm. The distances among the genes
were calculated in Euclidean distances, and the distances among the
clustered genes were calculated in average linkages. Because GO ID
pairs were used to verify the effectiveness of the algorithms used in this
experiment, the preliminary number of clusters for the k-means and
hierarchical methods was set as two.
SVM algorithms require setting up more parameters than in the
k-means and hierarchical methods. Gaussian radial basis function was
selected for kernel_function, the primary parameters of which are
rbf_sigma and penalty parameter. After numerous tests, three types of
rbf_sigma setting values (0.5, 1, and 100) were used, and two types of
penalty parameter setting values (0.8 and 128) were incorporated.
This experiment required two types of data: gene expression data and the
genes selected from the MYGD.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch02
training and test samples from the 226 genes that belonged to the
aforementioned six functional classes and running a CV, thus requiring
relatively few training samples. For example, the division of the 17 genes
in the TCA class into a three-fold CV is set as CV1(5), CV2(6), and
CV3(6), where the numbers in parentheses indicate the number of genes
in each of the three TCA subgroups.
One of the subgroups was designated as the test sample, and the other
two were used as training samples. For example, the five genes in CV1
were used as the test sample, and those of CV2 and CV3 were employed
as training samples; a total of 12 genes were classified in the positive
class. An equivalent number of negative genes among the HIST, HTH,
PROTEAS, RESP, and RIBO classes were selected, with the total
number equal to the number of positive genes. Thus, the training sample
in this experiment consisted of 12 positive genes and 12 negative genes.
The selection of the test sample differed from that of the training
samples in that it involved selecting an equal number of genes for the
negative class to that of the genes for the positive class from each of the
five functional classes. In the aforementioned example, five genes
belonged to the positive class; thus, five genes were selected from each
of the five functional classes to form the negative class. A total of 30 test
samples were used in the example. Notably, the genes in the training
samples must not overlap with those of the test sample. When the
number of genes in a specific class was insufficient, an adequate number
of genes from the other functional classes were selected. Specifically, the
number of genes in RIBO exceeded five times the total number of genes
from all the other functional classes that were not selected for the
training samples. Therefore, the RIBO test sample genes were applied
together with the genes from all the other functional classes that were not
selected as training samples, as the test sample in this study.
The other selection method involved all 226 genes in the six
functional classes, but did not undergo CV. For example, all 17 TCA
genes were used as positive genes, and 17 random genes that did not
overlap with these genes were selected from the gene expression data of
Saccharomyces cerevisiaeas negative genes. Thus, a total of 34 genes
were used for the training samples.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch02
This experiment required two types of data: the GAFs as applied in the
first experiment, and the genes of the six functional classes from the
MYGD in the second experiment.
Numerous GO term data were employed in this experiment. From the
GAFs, the GO ID data with the numbers of their corresponding genes
exceeding the experiment threshold of 30 were selected. Among the
three types of GO classes, the BP was selected for this experiment.
Subsequently, the expression time series corresponding to each gene
were selected from the gene expression data files. A total of 185 GO IDs
were incorporated, and the number of genes for each ID varied; some
GO IDs had only one corresponding gene each, whereas others exhibited
more than 400.
Similar to the first experiment, this experiment employed
Saccharomyces cerevisiae gene expression data. A total of 6,239 gene
expression data were used; each time series gene expression data
exhibited 17 time points. From this data, the 185 GO IDs of the 226
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch02
genes were selected. The genes that did not contain these 185 GO IDs
were then eliminated from the 6,239 gene expression data.
This subsection introduces the goal, method, and parameter settings for
this final experiment of the study. The result is presented in Section 2.5.
The MIPS functional classes are broader classification classes than
GO IDs. Analyzing the gene data of the six classes from the MYGD
revealed that the genes clustered in the same functional classes did not
necessarily share the same GO IDs. Therefore, using the GO IDs as the
classification indices, this experiment verified whether genes with the
same GO IDs could be clustered in the same functional classes, thereby
effectively improving the SVM time series gene expression data
classification accuracy.
This experiment required two types of files. One was the 227-gene
file downloaded from the MYGD. Similar to the second experiment,
YDL184C was removed and the remaining 226 genes were used. The
other type was the GAFs that can be downloaded from the GO website.
A total of 185 GO IDs corresponding to the 226 genes were identified
from the files, and all the genes corresponding to these 185 GO IDs were
selected from the GAFs.
The 185 GO IDs were used to create 185 SVM classifiers. The genes
corresponding to each of the GO IDs were grouped as the positive genes.
Subsequently, an equal number of genes not corresponding to the GO
IDs (thus avoiding overlap with the positive genes) were then randomly
selected from the processed files as the negative genes. Thus, a total of
185 GO ID pairs were established. For example, in the GAFs, the 38
genes corresponding to GO: 0000002 were used as the positive genes in
the training samples. From the processed GAFs, 38 random genes not
corresponding to GO: 0000002 were then selected as the negative genes.
Each GO ID enabled the creation of an SVM classifier.
The 226 genes downloaded from the MYGD were used as the test
sample, and their time series gene expression data were applied in the
SVM classifier. If the classification result of one of the genes was
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch02
This section analyzes and compares the results of the three experiments.
The first subsection introduces the approach for assessing the
classification results. The second subsection presents a comparison
between the clustering results of the GO ID-based SVM algorithm and
the two unsupervised clustering methods detailed in Section 2.4.2 (the
first experiment). The third subsection explains the comparison between
the time series gene expression data clustering results of the MIPS-based
SVM algorithm and the GO ID-based approach as described in Sections
2.4.3 and 2.4.4 (the second and third experiments). The final subsection
presents an analysis of the effect of the DAG levels of GO IDs on the
classification accuracy of the third experiment.
four conditions: true positive (TP), true negative (TN), false positive
(FP), and false negative (FN). The accuracy is defined as follows:
TP TN
Accuracy (2.1)
TP TN FP FN
In the first experiment, all the GO ID pairs underwent the five-fold CV.
The average clustering accuracies of the proposed, k-means, and
hierarchical methods in the five-fold CV of the 10 GO ID pairs are
visually presented in Fig. 2.2, which reveals that the classification
accuracy of the GO ID-based SVM algorithm method significantly
surpassed that of the two conventional unsupervised clustering methods.
Fig. 2.2. Comparison of the average clustering accuracies of the proposed, k-means, and
hierarchical methods in the five-fold CV of the 10 GO ID pairs. (Some of the data are
also presented in an unpublished project report for the Ministry of Science and
Technology, Taiwan, which sponsored this study. Project Number: NSC 101-2221-E-
024-024).
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch02
Table 2.1. Classification accuracy of the six MIPS-based functional classes from the first
sampling method in the three-fold CV (%).
Functional TP TN FP FN Accuracy
Classes (%)
HIST 8 50 5 3 87.9
HTH 11 66 14 5 80.2
PROTEAS 27 144 31 8 81.4
RESP 17 130 20 13 81.7
RIBO 115 74 4 5 95.5
TCA 10 61 24 7 69.6
Note: The number of training samples for each functional class is listed as follows.
HIST: 16 for CV1, 14 for CV2, and 14 for CV3.
HTH: 22 for CV1, 22 for CV2, and 20 for CV3.
PROTEAS: 48 for CV1, 46 for CV2, and 46 for CV3.
RESP: 40 for each of the three CVs.
RIBO: 160 for each of the three CVs.
TCA: 24 for CV1, 22 for CV2, and 22 for CV3.
The results revealed that the clustering result of the GO-based SVM
algorithm was superior to those of the MIPS-based algorithms. When the
number of training samples exceeded 40, the average classification
accuracy was improved to as high as 93%. Following the continuing
expansion of and updates to the GO database, the number of genes
corresponding to the GO IDs will continue to increase, thereby
promoting the practicality of the proposed method.
Table 2.2. Classification accuracy of the six MIPS-based functional classes from the
second sampling method in the three-fold CV (%).
2.6.1 Conclusions
Acknowledgements
This work was supported in part from the Ministry of Science and
Technology, Taiwan [Grant numbers: NSC 101-2221-E-024-024, MOST
102-2221-E-024 -019, MOST 103-2221-E-024-014 and MOST 104-
2221-E-024-018].
References
24. DeRisi, J., Penland, L., Brown, P. O., Bittner, M. L., Meltzer, P. S., Ray, M., Chen,
Y., Su, Y. A. and Trent, J. M. (1996). Use of a cDNA microarray to analyze gene
expression patterns in human cancer, Nature Genetics, 14, pp. 457–460.
25. Dor, K. C., Chambwe, N., Srdanovic, M. and Campagne, F. (2010). BDVal:
Reproducible large-scale predictive model development and validation in high-
throughput datasets, Bioinformatics, 26, pp. 2472–2473.
26. Drummond, C. and Holte, R. C. (2003). C4.5, class imbalance, and cost sensitivity:
Why under-sampling beats over-sampling, Proc. International Conference on
Machine Learning: Workshop Learn from Imbalanced Data Sets II, pp. 1–8.
27. Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis
(Wiley, New York, NY, USA).
28. Engel, S. R., Dietrich, F. S., Fisk, D. G., Binkley, G., Balakrishnan, R., Costanzo,
M. C., Dwight, S. S., Hitz, B. C., Karra, K., Nash, R. S., Weng, S., Wong, E. D.,
Lloyd, P., Skrzypek, M. S., Miyasato, S. R., Simison, M. and Cherry, J. M. (2014).
The reference genome sequence of Saccharomyces cerevisiae: Then and Now, G3
(Bethesda), 4, pp. 389–398.
29. Freund, Y. and Schapire, R. (1996). Experiments with a new boosting algorithm,
Proc.13th International Conference Machine Learning, pp. 148–156.
30. Freund, Y. and Schapire, R. (1999). A short introduction to boosting, Journal of
Japanese Society for Artificial Intelligence, 14, pp. 771–780.
31. Idicula-Thomas, S., Kulkarni, A. J., Kulkarni, B, D., Jayaraman, V. K. and Balaji, P.
V. (2006). A support vector machine-based method for predicting the propensity of
a protein to be soluble or to form inclusion body on overexpression in Escherichia
coli., Bioinformatics, 22, pp. 278–284.
32. Jain, A. K. and Dubes, R. C. (1988). Algorithms for Clustering Data (Prentice-Hall,
Englewood Cliffs, NJ).
33. Jain, A. K. (2010). Data clustering: 50 years beyond K-means, Pattern Recognition
Letters, 31, pp. 651–666.
34. Kohonen, T. (1990). The self-organizing map, Proceedings of the IEEE, 78 pp.
1464–1480.
35. Kotsiantis, S. B. (2007). Supervised machine learning: A review of classification
techniques, Informatica, 31, pp. 249–268.
36. Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets:
One-sided selection, Proc. 14th International Conference on Machine Learning, pp.
179–186.
37. Kumar, R., Kulkarni, A. J., Jayaraman, V. K. and Kulkarni B. D. (2004).
Symbolization assisted SVM classifier for noisy data, Pattern Recognition Letters,
25, pp. 495–504.
38. Lin, Z. Y., Hao, Z. F., Yang, X. W. and Liu, X. L. (2009). Several SVM Ensemble
Methods Integrated with Under-Sampling for Imbalanced Data Learning (Springer,
Berlin, Germany) pp. 536–544.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch02
53. The Gene Ontology Consortium (2010). The Gene Ontology in 2010: Extensions
and refinements, Nucleic Acids Research, 38, D331–D335.
54. Tseng, V. S. and Yu, H. H. (2011). Microarray data classification by multi-
information based gene scoring integrated with gene ontology, International Journal
of Data Mining and Bioinformatics, 5, pp. 402–416.
55. Vapnik, V. N. (1995). The nature of Statistical Learning Theory (Springer-Verlag,
New York, NY, USA).
56. Vapnik, V. N. and Cortes, C. (1995). Support vector networks, Machine Learning,
20, pp. 273–297.
57. Vapnik, V. N. (1998). Statistical Learning Theory (Wiley, New York, NY, USA).
58. Wang, L. (2005). Support Vector Machines: Theory and Applications (Springer,
New York, NY, USA).
59. Warita, K., Mitsuhashi, T., Tabuchi, Y., Ohta, K., Suzuki, S., Hoshi, N., Miki, T.
and Takeuchi, Y. (2012). Microarray and gene ontology analyses reveal
downregulation of DNA repair and apoptotic pathways in diethylstilbestrol-exposed
testicular Leydig cells, The Journal of Toxicological Sciences, 37, pp. 287–295.
60. Weiss, G. M. (2004). Mining with rarity: A unifying framework, Proc. ACM
SIGKDD Explorations Newsletter, pp. 7–19.
61. Wu, D., Bennett, K., Cristianini, N. and Shawe-Taylor, J. (1999). Large margin
decision trees for induction and transduction, Proc.16th International Conference on
Machine Learning, pp. 474–483.
62. Yang, Z. R. (2004). Biological applications of support vector machines, Brief
Bioinformatics, 5 pp. 328–338.
63. Yeung, K. Y. and Ruzzo, W. L. (2001). Principal component analysis for clustering
gene expression data, Bioinformatics, 17, pp. 763–774.
May 23, 2017 15:10 Computational Methods with Applications. . . 9in x 6in b2826-ch03 page 53
Chapter 3
Recently, the use of ensemble data matrix as a transformed space for clas-
sification has been put forward. Specific to the problem of predicting stu-
dent dropout, the matrix generated as part of summarizing members in a
cluster ensemble is investigated with a number of conventional classification
methods. Despite the reported success in comparison to the case of origi-
nal data and other attribute reduction techniques like PCA and KPCA, the
study is limited to only one ensemble matrix that is created by the link-based
ensemble approach or LCE. To provide an comparative review with respect
to the aforementioned problem, this paper includes the experiments and find-
ings obtained from the use of different graph-based ensemble algorithms as
data transformation methods for microarray data classification. The empirical
study can be hugely useful particularly for those working in bioinformatics,
and generally applicable to any classification problem. Besides, the review ini-
tiates another interesting challenge for many researchers in the field of cluster
ensemble to coupling their models with this hybrid, clustering-classification
learning.
‡ Corresponding author.
53
May 23, 2017 15:10 Computational Methods with Applications. . . 9in x 6in b2826-ch03 page 54
3.1 Introduction
This framework simply consists of two stages, one for the acquisition of
cluster ensemble, and the other for creating cluster information matrices
with graph-based cluster ensemble methods.
As for the current investigation, the following two types of ensembles are
examined. According to the original work of LCE,25 a partitioning algo-
rithm like k-means is used to generate base clusterings, each of which is
initialized with a random set of cluster centers or prototypes.
Note
√ that, to obtain a meaningful data partition, k becomes 50 if
N > 50.
• Random-k: Each base clustering πg is generated using the gene
expression dataset X ∈ RN ×d with N smaples and d genes. √ The
number of clusters is randomly selected between {2, . . . , N }.
Note that both ‘Fixed-k’ and ‘Random-k’ generation strategies
have become common alternatives for the generation process.23
1 if xi ∈ cl
ΘBA (xi , cl) = , (1)
0 otherwise
W CTxy
sim(Cx , Cy ) = × DC, (3)
W CTmax
here the WCT algorithm24 is exploited to calculate W CTxy and
W CTmax , which are the WCT measure between two target clus-
ters (Cx and Cy ) and the maximum WCT measure of the entire
ensemble, respectively. Note also that DC ∈ [0, 1] is the decay fac-
tor that reflects the confidence level of the underlying link analysis.
• Weighted Triple-Quality (WTQ) Matrix: similar to the previous
matrix, ΘW T Q (xi , cl) ∈ [0, 1] can be estimated by the following.
1 if cl = C∗g (xi )
ΘW T Q (xi , cl) = , (4)
sim(cl, C∗g (xi )) otherwise
given that
W T Qxy
sim(Cx , Cy ) = × DC, (5)
W T Qmax
here the WTQ algorithm24 is exploited to calculate W T Qxy and
W T Qmax , which are the WTQ measure between two target clus-
ters (Cx and Cy ) and the maximum WTQ measure of the entire
ensemble, respectively.
• Weighted Distance (WD) Matrix: this is developed as the by-
product of a new soft-subspace clustering model.11 An entry
ΘW D (xi , cl) is estimated from the distance between sample xi ∈ X
and center of the cluster cl ∈ Π. For each base clustering πg ∈ Π,
ΘW D (xi , cl), ∀cl ∈ πg can be defined as follows.
Di − d(xi , cl) + 1
ΘW D (xi , cl) = , (6)
kg Di + kg − d(xi , cl )
∀cl ∈πg
where d(xi , cl) is the distance between sample xi and cl, that is cen-
ter (or centroid) of the cluster cl. In addition, Di can be specified
by the next equation.
According to Ref. 11, the distance d(xi , cl) can be defined as fol-
lows.
d
d(xi , cl) = wscl (xis − cls )2 (8)
s=1
provided that wscl ∈ [0, 1] is the weight of the sth gene that is
specific to the cluster cl ∈ πg , xis denotes value of the sth gene of
data sample xi , and cls denotes the sth gene value of the cluster
center cl. For any cl ∈ πg ,
d
wscl = 1. (9)
s=1
analysis.
Table 1. Description of gene expression datasets: tissue type, microarray chip type,
number of samples (N ), number of original genes (d∗ ), number of selected genes (d)
after pre-processing, and number of classes (K).
more accurate outcome than their RK counterparts and the other matrices.
Despite this, ΘW CT (RK) and ΘW T Q (RK) have shown exceptional results
in a fews cases such as the CNS dataset. In addition, the two variations of
ΘW D are effective for the analysis of Leukemia3 and BCT, while ΘBA usu-
ally has higher error rates than the rest. Similar trends can also be observed
with the applications of C4.5 and KNN classification models to these matri-
ces, with the corresponding statistics being illustrated in Tables 3 and 4 ,
respectively. Note that the classification performance with ΘBA matrices
have improved with these two classifiers, while those of ΘW D become of
lower quality.
To further elaborate the empirical findings, Figure 1 shows for the case
of Naive Bayes the comparison of error rates as the averages across investi-
gated datasets. For the four methods to generate Θ (WCT, WTQ, BA and
WD) and two generation schemes (FK and RK), the best results are equally
obatined by WCT(FK) and WTQ(FK), and the two worst still occur with
BA matrices. The FK strategy is generally better than the other with
WCT, WTQ and WD, while it is the other way round for BA. For the
summarization of C4.5 and KNN, Figures 2 and 3 similarly suggest that
WCT(FK) and WTQ(FK) are the most accurate, which are slightly bet-
ter than the BA(FK) alternative. Figure 4 provides the results with KNN
classifiers, where the number of neighbors of K varies from 1 to 3. Unlike
the previous observation with Naive Bayes, WD matrices are less effective
with the models of C4.5 and KNN. Based on these, the original BA matrix
can usually represent the knowledge embedded in an ensemble rather well,
with a possible improvement by a robust matrix refinement approach, e.g.,
link-based methods. However, the distance-orient refining technique like
May 23, 2017 15:10 Computational Methods with Applications. . . 9in x 6in b2826-ch03 page 64
Table 2. Classification errors with Naive Bayes, where FK and RK denotes Fixed-k
and Random-k generation strategies. The two lowest error rates on each investigated
dataset is highlighted in boldface.
Table 3. Classification errors with C4.5, where FK and RK denotes Fixed-k and Ran-
dom-k generation strategies. The two lowest error rates on each investigated dataset
is highlighted in boldface.
Table 4. Classification errors with KNN, where FK and RK denotes Fixed-k and Ran-
dom-k generation strategies. The two lowest error rates on each investigated dataset
is highlighted in boldface.
3.4 Conclusion
Fig. 1. Comparison of classification errors (y-axis) as averages across all datasets with
Naive Bayes, categorized by matrix type and generation strategy.
Fig. 2. Comparison of classification errors (y-axis) as averages across all datasets with
C4.5, categorized by matrix type and generation strategy.
May 23, 2017 15:10 Computational Methods with Applications. . . 9in x 6in b2826-ch03 page 67
Fig. 3. Comparison of classification errors (y-axis) as averages across all datasets with
KNN, categorized by matrix type and generation strategy.
Fig. 4. Comparison of classification errors (y-axis) as averages across all datasets with
KNN and different K values from 1 to 3; categorized by three specific matrices of
WCT(FK), WTQ(FK) and BA(FK), respectively.
May 23, 2017 15:10 Computational Methods with Applications. . . 9in x 6in b2826-ch03 page 68
References
Chapter 4
a
A part of this chapter is revised from Charles C. N. Wang, Phillip C.-Y. Sheu, and Jeffrey J. P. Tsai,
Towards Semantic Biomedical Problem Solving, Int. J. Semantic Computing, Vol 09, pp 415 (2015).
72
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch04
4.1 Introduction
On another front, Semantic Computing has been drawing more and more
attention in academia and industries. It brings together various analytics
techniques to connect the (often vaguely formulated) intentions of humans
with computational content that includes, but is not limited to, structured
and semi-structured data, multimedia data, text, etc (Sheu et al., 2011).
Dimitar (Hristovski, Dinevski, Kastrin, and Rindflesch, 2015b) proposes
a semantic methodology and describes a tool called SemBT for biomedical
question answering. The system is able to provide answers to a wide array
of questions, from clinical medicine through pharmacogenomics to
microarray results interpretation. Zhang (Zhang et al., 2014) presents a
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch04
Sequence analysis
Gene expression analysis
Protein structure prediction
Biological network and Computational systems biology
In the following sections we shall illustrate the use of SNL for describing
some typical biological applications.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch04
BLAST Problems
Perhaps one of the most common tasks in biological research today is that
of identifying genes and proteins related or similar to a particular
sequence. The task is often performed with BLAST. A representative
problem is presented below, where variable parameters are preceded with
the dollar sign ‘$’:
Nucleotide BLAST
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch04
Protein BLAST
- Search $protein-database using $protein
BLASTX
- Search $protein-database using $translated-nucleotide
TBLASTN
- Search $translated-nucleotide-database using $protein
TBLASTX
- Search $translated-nucleotide-database using $translated-
nucleotid
Nucleotide BLAST
- Align $nucleotide and $nucleotide
The problem may be solved using one or more of the following tools:
NCBI BLAST: http://blast.ncbi.nlm.nih.gov/Blast.cgi
EBI BLAST: http://www.ebi.ac.uk/Tools/msa/
drug targets, as well as to study the gene and potential toxicological effects
of compounds in a model (Fryer et al., 2002).
Normalization
- Normalize $microarray-experiments
In our study, we assume the analyst interacts with the MI system in two
steps:
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch04
We shall use the symbol MIV in the below to designate the union of MV,
BV, DV, TV, and GV.
Given a dataset, the analyst can use the following query pattern to form a
“sub-cube”:
GIVEN dataset(s)
Find patients conditions
! !
C = = = for C = = =
! ! - ! ! ! - !
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch04
Table 2. Problems that have not been studied yet or do not make sense
Problems that have not yet been studied (10)
Correlate TV and GV
Correlate MV and TV and GV
Correlate DV and BV and TV
Correlate DV and BV and GV
Correlate DV and TV and GV
Correlate MV and DV and BV and TV
Correlate MV and DV and BV and GV
Correlate MV and DV and TV and GV
(Continued )
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch04
Table 2. (Continued )
Problems that have not yet been studied (10)
Correlate MV and BV and TV and GV
Correlate DV and BV and TV and GV
Problems that do not make sense (6)
Correlate MV and MV
Correlate BV and TV
Correlate BV and GV
Correlate MV and BV and TV
Correlate MV and BV and GV
Correlate BV and TV and GV
We present some case studies in the following.
Case Study – Correlate $variable in DV
Methods
Based on the clinical records of the enrollees of NHI, 18,321 youngsters
aged 18 or less with a diagnosis of ADHD in 2001 were recruited to join
the case group. All the clinical diagnoses made from 2000 to 2002 for
those who were admitted to the medical center in 2008 and whose ICD-
9 code id was 314 (Attention Deficit Hyperactivity Disorder) are
extracted from the NHI database.
Methods
Two million individuals between 1998 and 2011 randomly sampled
from the NHIRD were extracted.
GIVEN NHIRD
Find patients whose disease includes [625.4] and whose medical
claims were between 1998 and 2011
Correlate medication in TV
The study uses association rule mining (ARM) and social network
analysis to explore the combinations of CHM treatments for PMS. It
finds Jia-Wei-Xiao-Yao-San (JWXYS) had the highest prevalence
(37.5% of all prescriptions) and also the core of the prescription network
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch04
for PMS. For combination of two CHM, JWXYS with Cyprus rotundas
L. are prescribed most frequently, 7.7% of all prescriptions, followed
by JWXYS.
Type SNL PC
1 Estimate impact of $variable(s) in $space 25
Space
2 Estimate impact of $variable(s) in $space and 50
Spaces $variable(s) in $space on $variable(s) in $space
3 Estimate impact of $variable(s) in $$space, 50
Spaces $variable(s) in $space and $variable(s) in
$space on $variable(s) in $space
4 Estimate impact of $variable(s) in $space, 25
Spaces $variable(s) in $space, $variable(s) in $space
and $variable(s) in $space on $variable(s) in
$space
5 Estimate impact of $variable(s) in $space, 5
Spaces $variable(s) in $space, $variable(s) in $space,
$variable(s) in $space and $variable(s) in
$space on $variable(s) in $space
Methods
The authors identified a study population from the NHI Research
Database (NHIRD) in Taiwan database between 1999 and 2003 that
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch04
includes 16821 patients with bipolar disorder and 6728 age and sex
matched control participants without bipolar disorder. The incidence of
ICD9 code is between 430 and 438, patient survival rate after stroke are
calculated for both groups using data from the NIHRD between 2004
and 2010.
GIVEN NHIRD
P11 = Find patients whose disease includes [430:438] and whose
medical claims were between 2004 and 2010
GIVEN P11
P12 = Find patients whose disease includes [434.91] and whose
medical claims were between 2004 and 2010
GIVEN NHIRD
P21 = Find patients whose disease does not include [430:438] and
whose medical claims were between 2004 and 2010
GIVEN P21
P22 = Find patients whose disease includes [434.91] and whose
medical claims were between 2004 and 2010
GIVEN NHIRD
Find patients whose disease includes [714] and whose medical claims
cover 10 years
The SNL sentence that describes the problem of estimating the impact
of one disease (rheumatoid arthritis) on another (depressive
disorders) for the patients found in the OLAP query is:
The SNL sentence that describes the problem of estimating the impact
of environment (air pollution) on a disease (dementia) is:
The paper evaluates the risk of dementia among four levels of air
pollutants. It uses Relative Risk (RR) and confidence interval (CI) to
investigate the Incidence and Relative Risk of dementia among patients
to discover the associations between the levels of nitrogen dioxide
(NO2) and carbon monoxide (CO) exposure and dementia.
OLAP Query:
4.5 Conclusions
In this chapter, we survey different bioinformatics tools, data mining and
OLAP tools, and statistical analysis tools for predication analysis and
decision support for biomedical applications. We describe a Structured
Natural Language (SNL) that can express many problems in BMI with a
finite number of sentence patterns. We show how OLAP, data mining and
statistical analysis tools may be linked to solve problems in computational
medicine with a uniform interface, and how a language-based approach
may help in discovering new problems.
One problem that is not addressed by this paper is how to solve a specific
problem automatically based on the existing tools. For each problem
sentence, if applicable we give a case study that shows an instance of the
corresponding problem has been solved. For different instances of a
problem we may need to apply variations of the tools employed in the case
study. In addition, we need to connect and convert the methods used in a
case study into an algorithm that is suitable for automation and
parameterization. What is described in the paper is merely the beginning
of a long-term effort to provide an integrated platform for biomedical
problem solving.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch04
Acknowledgment
This work of CCNW, PCYS and JJPT are supported in part under grant
numbers NSC 102-2632-E-468-001-MY3 and MOST 105-2632-E46-002
from the Ministry of Science and Technology, Taiwan and Asia
University. The views, opinions and/or findings contained in this report
are those of the authors and should not be construed as an official National
Science Council position, policy or decision unless so designated by other
documentation.
References
Hristovski, D., Dinevski, D., Kastrin, A., and Rindflesch, T.C. (2015a)
Biomedical question answering using semantic relations. BMC
Bioinformatics 2009 10:1, 16, 1.
Hristovski, D., Dinevski, D., Kastrin, A., and Rindflesch, T.C. (2015b)
Biomedical question answering using semantic relations. BMC
Bioinformatics 2009 10:1, 16, 6.
Huang, M.-J. et al. (2007) Integrating data mining with case-based
reasoning for chronic diseases prognosis and diagnosis. Expert
Systems with Applications, 32, 856–867.
Ikeda, S. et al. (2011) A MODEL FOR OBJECT RELATIONAL OLAP.
International Journal on Artificial Intelligence Tools, 19, 551–595.
Kim, S.Y. et al. (2006) Comparison of various statistical methods for
identifying differential gene expression in replicated microarray data.
Stat Methods Med Res, 15, 3–20.
Kitano, H. (2002) Computational systems biology. Nature, 420, 206–210.
Lander, E.S. et al. (2001) Initial sequencing and analysis of the human
genome. Nature, 409, 860–921.
Lee, Y.-C. et al. (2011) A Database of Gene-Environment Interactions
Pertaining to Blood Lipid Traits, Cardiovascular Disease and Type 2
Diabetes. Journal of data mining in genomics & proteomics, 2.
Lowe, H. J. and Barnett, G. O. (1994) Understanding and using the
medical subject headings (MeSH) vocabulary to perform literature
searches. JAMA, 14, 1103–1108.
MACGREGoR P. F. and SqUIRE J. A. (2002). Application of microarrays
to the analysis of gene expression in cancer. Clinical Chemistry, 48,
1170–1177.
Maojo,V. and Kulikowski, C.A. (2003) Bioinformatics and Medical Infor-
matics: Collaborations on the Road to Genomic Medicine? Journal of
the American Medical Informatics Association, 10, 515–522.
Mutch, D.M. et al. (2001) Microarray data analysis: a practical approach
for selecting differentially expressed genes. Genome Biol., 2,
preprint0009.1.
Natarajan, J. (2013) Text Mining Perspectives in Microarray Data Mining,
ISRN Computational Biology, 2013, 5.
National Library of Medicine National Library of Medicine.
httpswww.nlm.nih.gov.
Pietro Zoppoli et al. (2010) TimeDelay-ARACNE: Reverse engineering
of gene networks from time-course data by an information theoretic
approach. BMC Bioinformatics 2009 10:1, 11, 154.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch04
Rud, O.P. ed. (2012) Business Intelligence Success Factors John Wiley &
Sons, Inc., Hoboken, NJ, USA.
Schäfer, J. et al. (2001) Reverse engineering genetic networks using the
GeneNet package. J Am Stat Assoc.
Sheu, P.C.-Y. and T. Kitazawa. (2007) From Semantic objects to Semantic
Software Engineering. International Journal of Semantic Computing,
01, 18.
Sheu, P.C.-Y. (2007) Semantic Computing. International Journal of
Semantic Computing, 01, 9.
Svrakic, N.M. et al. (2003) Statistical approach to DNA chip analysis.
Recent Prog Horm Res, 58,75–93.
Tai, Y.-M. and Chiu, H.-W. (2009) Comorbidity study of ADHD:
Applying association rule mining (ARM) to National Health
Insurance Database of Taiwan. International Journal of Medical
Informatics, 78, e75–e83.
Tarczy-Hornoch, P. and Minie, M. (2005) Bioinformatics Challenges and
Opportunities. In, Medical Informatics, Integrated Series in
Information Systems. Springer US, Boston, pp. 63–94.
Tusher, V.G. et al. (2001) Significance analysis of microarrays applied to
the ionizing radiation response. PNAS, 98, 5116–5121.
Vilela, M. et al. (2008) Parameter optimization in S-system models. BMC
Systems Biology 2008 2:1, 2, 35.
Visser, D. and Heijnen, J.J. (2003) Dynamic simulation and metabolic re-
design of a branched pathway using linlog kinetics. Metabolic
Engineering, 5, 164–176.
Voit, E. O. (2013) Biochemical Systems Theory: A Review, ISRN
Biomathematics, 2013, 53.
Wang, S.-L. et al. (2014) Risk of Developing Depressive Disorders
following Rheumatoid Arthritis: A Nationwide Population-Based
Study. PLOS ONE, 9, e107791.
Wu, C.-S. et al. (2014) Concordance between Patient Self-Reports and
Claims Data on Clinical Diagnoses, Medication Use, and Health
System Utilization in Taiwan. PLOS ONE, 9, e112257.
Wu, H.-C. et al. (2013) The Incidence and Relative Risk of Stroke among
Patients with Bipolar Disorder: A Seven-Year Follow-Up Study.
PLOS ONE, 8, e73037.
Zhang, R. et al. (2014) Using semantic predications to uncover drug–drug
interactions in clinical data. Journal of Biomedical Informatics, 49,
134–147.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
Chapter 5
5.1 Introduction
98
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
CHARMM are the mostly used force fields for the implementation of
molecular simulations. Weiner et al. [13] developed the AMBER force
field that was original for the calculations of proteins and NAs. Nowadays,
several types of AMBER forces fields with improved parameters (ff94,
ff96, ff98, ff99, ff99SB, etc.) designed for the simulations of proteins,
peptides and NAs. Some modified AMBER force fields, parm94, parm99
and parmbsc0, show impressive performances in modeling a large number
of DNA structures [12, 14, 15].
Interactions between proteins and nucleic acids play important roles
in many biological activities, which involve in degradation of nucleic
acids, protein synthesis, DNA replication, RNA transcription, and RNA
splicing [16]. In the past, many three-dimensional structures of NAs are
unavailable, which is one of the limitations for simulating protein-NA
interaction. Currently, some web servers provide the functions for
predicting and generating the 3D-structural models of NAs [17–19]. These
advances make investigating protein-NA interactions via computational
approaches more possible. Computational simulations indeed can help
interpreting protein–nucleic acid interactions and complementing
experimental results. Therefore, simulation studies of protein-NA
complexes attracted great attention from many scientists due to the
capability of characterizing the binding domain of protein for NA and
visualizing the interaction forces between the protein and NA.
This chapter focuses on the computer simulation studies of protein-
aptamer simulations. Aptamers are short single-stranded DNA or RNA,
and they can form a special stem-loop secondary structure and have the
specificity for recognizing target molecules. The target molecules can be
viruses, cells, proteins, ions, drugs, toxins, peptides and bacteria. First, we
present the interacting forces between proteins and nucleic acids. Second,
we describe the two most widely used force fields for simulations briefly;
AMBER and CHARMM. Available modeling tools of proteins and
aptamers, and the experimental procedures and computational approaches
for selecting aptamers are introduced in the chapter.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
van der Waals contact distance. If two atoms come too close to each other,
the repulsive force becomes the dominant force. Because the outer electron
cloud of an atom overlaps that of another atom causes the repulsive force.
Van der Waals interaction contribute the strength of bond from 2 to 4
kJ/mol per atom pair.
Stacking between adjacent bases is also a key factor that is responsible for
the stability of the DNA double helix [28]. Stacking interactions take place
between complementary base pairs of double-stranded DNA and depend
on the dipole moments and the aromaticity of the bases. The base stacking
force is short ranged and can be characterized by an attraction potential
and a strong repulsion potential [29]. The strength for the stacks of G-C
base pairs is stronger than that for the stacks of A-T base pairs. For dsDNA,
base staking forces are very central in maintaining the structure. Unlike
the function in the dsDNA, the base staking forces can help ssDNA bind
with proteins because bases are usually bound by stacking with aromatic
protein side chains [30]. Base stacking forces depend on several
noncovalent forces, and hydrophobic and electrostatic interactions are the
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
The term “AMBER force field” generally refers to a family of force fields
for molecular dynamics of biomolecules, which is originally developed by
Peter Kollman’s group at the University of California, San Francisco. The
force field is developed for the simulation of macromolecules, and it is
necessary to set appropriate parameters of the force field (e.g., bonds,
angles, dihedrals, and atom types in the system). Many standard sets of
parameters exist and provide in the simulation programs. For the
simulations of proteins and nucleic acids, the ff14SB AMBER force field
is suitable. The ff14SB force field is an improved version of the ff99SB
AMBER force field, originally developed by Hornak et al. in 2006 [33].
The ff14SB force field runs simulations well in the system combined with
protein, nucleic acid and water models. According to the reference manual
of AMBER, OL15 and OL3 are more specific force fields for the
simulations of DNA and RNA molecules, respectively. For example, Krepl
et al. performed MD simulations of protein/RNA complexes and they
adopted the ff99bsc0χOL3 force field for RNA and ff14SB, ff12SB and
ff99SB force field for proteins [34].
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
The CHARMM Force Field is a frequently used force field in the study of
biomolecules, original developed by Karplus and co-workers in 1983 [35].
Currently, there are several versions of CHARMM Force Field available
for applying in the simulations of different biological systems, including
proteins, peptides, nucleic acids, small molecule ligands, lipids, prosthetic
groups and carbohydrates. CHARMM program is continuously
maintained by a large group of developers led by Martin Karplus, and free
CHARMM program is also available for download on the website
(https://www.charmm.org/charmm/). The CHARMM19 adopts a united
atom force field where only hydrogen atoms belonging to polar groups are
explicitly included [36]. Continuous improvements and development
efforts in the CHARMM additive force field give more accurate
simulations of different biomolecules. Nowadays, there are several
versions of the additive force field. CHARMM22, CHARMM27, and
CHARMM36 are all belonged to atom force fields. CHARMM22 [37] is
suitable for the simulations of proteins, and CHARMM27 [38] is designed
for nucleic acids (DNA and RNA). CHARMM36 [39], CHARMM37 [40,
41] and CGenFF [42] are developed with the optimal simulation
parameters for lipids, carbohydrates and drug-like molecules (small
molecules), respectively. CHARMM [37] and AMBER [14] are the
commonly used two force fields to simulate nucleic acid-protein
complexes [43]. Both force fields can apply in the DNA and RNA models.
Except for the CHARMM program, other MD programs like AMBER,
GROMACS and NAMD also adopt the CHARMM additive force field in
MD simulations. HyperChem, a molecular modeling software, incorporate
the CHARMM force field and name it as Bio+ force field.
Bini et al. [54] use computational approach for the selection of thrombin-
binding aptamers (TBA). Aptamers are usually selected from
experimental procedures called as the systematic evolution of ligands by
exponential enrichment procedure (SELEX). Before the onset of SELEX,
random nucleic acids are generated as the nucleic acid libraries. Basically,
a library generally contains 1014∼1015 different sequences. There are four
main steps in the cycle of SELEX: (1) incubation with target protein; (2)
separation of unbound nucleic acids and conservation of protein-nucleic
acid complexes; (3) elution of nucleic acids (aptamers); (4) amplification
by PCR. The cycle needs to repeat 8-16 times and then the sequence of
selected aptamer that can bind to the target protein with high specificity
needs to be sequenced. In the study of Bini’s group [54], they choose a
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
Figure 1. The docking results of thrombin (shown with ribbons) and the sequence of Best
TBA reported by Bini et al. [54] by using the ZDOCK algorithm. The image shows the
best docking result of the two molecules. The dots with different colors represents different
docking poses. The color of dot approximates to red (darkest in this gray image) that means
it is a better docking pose.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
Surface coverage of
Name of ka kd KA ZRANK
biomolecules 3 –1 –1 –3 –1 6 –1
aptamer (10 M s ) (10 s ) (10 M ) score
(ng/cm2) (AVG±SD)
Seq1 11.17±1.47 10.02 1.39 7.23 –93.855
References
1. Alder, B. J., Wainwright, T. E. (1957). Phase transition for a hard sphere system,
J. Chem. Phys., 27, pp. 1208–1209.
2. Alder, B. J., Wainwright, T. E. (1959). Studies in molecular dynamics. I. general method,
J. Chem. Phys., 31, pp. 459–466.
3. Rahman, A. (1964). Correlations in the motion of atoms in liquid argon, Phys. Rev., 136,
pp. A405–A411.
4. McCammon J. A. (1976). Molecular dynamics study of the bovine pancreatic trypsin
inhibitor, In Models for Protein Dynamics, CECAM, pp. 137 (in France).
5. McCammon, J. A., Gelin, B. R., Karplus, M. (1977). Dynamics of folded proteins,
Nature, 267, pp. 585–590.
6. Arnold, K., Bordoli, L., Kopp, J., Schwede, T. (2006). The SWISS-MODEL workspace:
a web-based environment for protein structure homology modelling, Bioinformatics,
22, pp. 195–201.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
7. Guex, N., Peitsch, M. C., Schwede, T. (2009). Automated comparative protein structure
modeling with SWISS-MODEL and Swiss-PdbViewer: a historical perspective,
Electrophoresis, 30 Suppl 1, pp. S162–S173.
8. Biasini, M., Bienert, S., Waterhouse, A., Arnold, K., Studer, G., Schmidt, T., Kiefer, F.,
Gallo Cassarino, T., Bertoni, M., Bordoli, L., Schwede, T. (2014). SWISS-MODEL:
modelling protein tertiary and quaternary structure using evolutionary information,
Nucleic Acids Res., 42, pp. W252–W258.
9. Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N., Sternberg, M. J. E. (2015). The
Phyre2 web portal for protein modeling, prediction and analysis, Nat. Protoc., 10, pp.
845–858.
10. Levitt, M. (1983) Computer simulation of DNA double-helix dynamics, Cold Spring
Harb. Symp. Quant. Biol., 47 Pt 1, pp. 251–262.
11. Tidor, B., Irikura, K. K., Brooks, B. R., Karplus, M. (1983). Dynamics of DNA
oligomers, J. Biomol. Struct. Dyn., 1, pp. 231–252.
12. Pérez, A., Marchán, I., Svozil, D., Sponer, J., Cheatham, T. E., Laughton, C. A., Orozco,
M. (2007). Refinement of the AMBER force field for nucleic acids: improving the
description of α/γ conformers, Biophys. J., 92, pp. 3817–3829.
13. Weiner, P. K., Kollman, P. A. (1981). AMBER: Assisted model building with energy
refinement. A general program for modeling molecules and their interactions, J.
Comput. Chem., 2, pp. 287–303.
14. Cornell, W. D., Cieplak, P., Bayly, C. I., Gould, I. R., Merz, K. M., Ferguson, D. M.,
Spellmeyer, D. C., Fox, T., Caldwell, J. W., Kollman, P. A. (1996). A second
generation force field for the simulation of proteins, nucleic acids, and organic
molecules J. Am. Chem. Soc. 1995, 117, 5179−5197, J. Am. Chem. Soc., 118, pp.
2309–2309.
15. Cheatham, T. E. 3rd, Cieplak, P., Kollman, P. A. (1999). A modified version of the
Cornell et al. force field with improved sugar pucker phases and helical repeat, J.
Biomol. Struct. Dyn., 16, pp. 845–862.
16. Tuszynska, I., Magnus, M., Jonak, K., Dawson, W., Bujnicki, J. M. (2015). NPDock:
a web server for protein–nucleic acid docking. Nucleic Acids Res., 43(Web Server
issue), pp. W425–W430.
17. Popenda, M., Błażewicz, M., Szachniuk, M., Adamiak, R. W. (2008). RNA FRABASE
version 1.0: an engine with a database to search for the three-dimensional fragments
within RNA structures, Nucleic Acids Res., 36, pp. D386–D391.
18. van Dijk, M., Bonvin, A. M. J. J. (2009). 3D-DART: a DNA structure modelling server,
Nucleic Acids Res., 37(Web Server issue), pp. W235–W239.
19. Popenda, M., Szachniuk, M., Antczak, M., Purzycka, K. J., Lukasiak, P., Bartol, N.,
Blazewicz, J., Adamiak, R. W. (2012). Automated 3D structure composition for large
RNAs. Nucleic Acids Res., 40, e112. doi: 10.1093/nar/gks339.
20. Bosshard, H. R., Marti, D. N., Jelesarov, I. (2004). Protein stabilization by salt bridges:
concepts, experimental approaches and clarification of some misunderstandings. J. Mol.
Recognit., 17, pp. 1–16.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
21. Xu, D., Lin, S. L., Nussinov, R. (1997). Protein binding versus protein folding: the role
of hydrophilic bridges in protein associations, J. Mol. Biol., 265, pp. 68–84.
22. Mandel-Gutfreund, Y., Schueler, O., Margalit, H. (1995). Comprehensive analysis of
hydrogen bonds in regulatory protein DNA-complexes: in search of common principles,
J. Mol. Biol., 253, pp. 370–382.
23. Jones, S., Shanahan, H. P., Berman, H. M., Thornton, J. M. (2003). Using electrostatic
potentials to predict DNA-binding sites on DNA-binding proteins. Nucleic Acids Res.,
31, pp. 7189–7198.
24. Chen, Y. C., Lim, C. (2008). Predicting RNA-binding sites from the protein structure
based on electrostatics, evolution and geometry, Nucleic Acids Res., 36, e29. doi:
10.1093/nar/gkn008.
25. Dougherty, R. C. (1998). Temperature and pressure dependence of hydrogen bond
strength: A perturbation molecular orbital approach, J. Chem. Phys., 109, pp. 7372–
7378.
26. Greenwood, N. N. and Earnshaw, A. (1997). Chemistry of the Elements, 2nd Ed.
(Elsevier Ltd, United Kingdom).
27. Sinden, R. R. (1994). DNA Structure and Function. Sinden, R. R., Chapter 1
“Introduction to the Structure, Properties, and Reactions of DNA,” (Academic Press,
San Diego) pp. 1–57.
28. Yakovchuk, P., Protozanova, E., Frank-Kamenetskii, M. D. (2006). Base-stacking and
base-pairing contributions into thermal stability of the DNA double helix, Nucleic
Acids Res., 34, pp. 564–574.
29. Haijun, Z., Zhong-can, O.-Y. (1999). Bending and base-stacking interactions in double-
stranded semiflexible polymer, Phys. Rev. Lett., 82, pp. 4560–4563.
30. Theobald, D. L., Schultz, S. C. (2003). Nucleotide shuffling and ssDNA recognition in
Oxytricha nova telomere end-binding protein complexes, EMBO J., 22, pp. 4314–4324.
31. Strick, T. R., Allemand, J. F., Bensimon, D., Bensimon, A., Croquette, V. (1996). The
elasticity of a single supercoiled DNA molecule, Science, 271, pp. 1835–1837.
32. Saenger, W. (1984) Principles of Nucleic Acid Structure, 1st Ed. (Springer, USA).
33. Hornak, V., Abel, R., Okur, A., Strockbine, B., Roitberg, A., Simmerling, C. (2006).
Comparison of multiple Amber force fields and development of improved protein
backbone parameters, Proteins, 65, pp. 712–725.
34. Krepl, M., Cléry, A., Blatter, M., Allain, F. H. T., Sponer, J. (2016). Synergy between
NMR measurements and MD simulations of protein/RNA complexes: application to
the RRMs, the most common RNA recognition motifs, Nucleic Acids Res., 44, pp.
6452-6470. doi: 10.1093/nar/gkw438.
35. Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S., Karplus,
M. (1983). CHARMM: a program for macromolecular energy, minimization, and
dynamics calculations, J. Comput. Chem., 4, pp. 187–217.
36. Bottaro, S., Lindorff-Larsen, K., Best, R. B. (2013). Variational optimization of an all-
atom implicit solvent force field to match explicit solvent simulation data, J. Chem.
Theory Comput., 9, pp. 5641–5652.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
37. MacKerell, A. D., Bashford, D., Bellott, M., Dunbrack, R. L., Evanseck, J. D., Field,
M. J., Fischer, S., Gao, J., Guo, H., Ha, S., Joseph-McCarthy, D., Kuchnir, L., Kuczera,
K., Lau, F. T., Mattos, C., Michnick, S., Ngo, T., Nguyen, D. T., Prodhom, B., Reiher,
W. E., Roux, B., Schlenkrich, M., Smith, J. C., Stote, R., Straub, J., Watanabe, M.,
Wiorkiewicz-Kuczera, J., Yin, D., Karplus, M. (1998). All-atom empirical potential for
molecular modeling and dynamics studies of proteins, J. Phys. Chem. B, 102, pp. 3586–
3616.
38. MacKerell, A. D. J., Banavali, N., Foloppe, N. (2000). Development and current status
of the CHARMM force field for nucleic acids, Biopolymers, 56, pp. 257–265.
39. Klauda, J. B., Venable, R. M., Freites, J. A., O’Connor, J. W., Tobias, D. J.,
Mondragon-Ramirez, C., Vorobyov, I., MacKerell, A. D. J., Pastor, R. W. (2010).
Update of the CHARMM all-atom additive force field for lipids: validation on six lipid
types, J. Phys. Chem. B, 114, 7830–7843.
40. Raman, E. P., Guvench, O., MacKerell, A. D. J. (2010). CHARMM additive all-atom
force field for glycosidic linkages in carbohydrates involving furanoses, J. Phys. Chem.
B, 114, pp. 12981–12994.
41. Hatcher, E., Guvench, O., Mackerell, A. D. (2009). CHARMM additive all-atom force
field for aldopentofuranoses, methyl-aldopentofuranosides, and fructofuranose, J. Phys.
Chem. B, 113, pp. 12466–12476.
42. Vanommeslaeghe, K., Hatcher, E., Acharya, C., Kundu, S., Zhong, S., Shim, J., Darian,
E., Guvench, O., Lopes, P., Vorobyov, I., MacKerell, A. D. (2010). CHARMM general
force field (CGenFF): a force field for drug-like molecules compatible with the
CHARMM all-atom additive biological force fields, J. Comput. Chem., 31, pp.
671–690.
43. MacKerell, A. D., Nilsson, L. (2008). Molecular dynamics simulations of nucleic acid-
protein complexes, Curr. Opin. Struct. Biol., 18, pp. 194–199.
44. Schwede, T., Kopp, J., Guex, N., Peitsch, M. C. (2003). SWISS-MODEL: an automated
protein homology-modeling server, Nucleic Acids Res., 31, pp. 3381–3385.
45. Pieper, U., Webb, B. M., Dong, G. Q., Schneidman-Duhovny, D., Fan, H., Kim, S. J.,
Khuri, N., Spill, Y. G., Weinkam, P., Hammel, M., Tainer, J. A., Nilges, M., Sali, A.
(2014). ModBase, a database of annotated comparative protein structure models and
associated resources, Nucleic Acids Res., 42, pp. D336–D346.
46. Bhattacharya, A., Wunderlich, Z., Monleon, D., Tejero, R., Montelione, G. T. (2008).
Assessing model accuracy using the homology modeling automatically software.
Proteins, 70, pp. 105–118.
47. Roy, A., Kucukural, A., Zhang, Y. (2010). I-TASSER: a unified platform for automated
protein structure and function prediction, Nat. Protoc., 5, pp. 725–738.
48. Webb, B., Sali, A. (2016). Comparative protein structure modeling using MODELLER,
Curr. Protoc. Bioinform., 54, pp. 5.6.1-5.6.37. doi: 10.1002/cpbi.3.
49. Lu, X.-. J., Olson, W. K. (2003). 3DNA: a software package for the analysis, rebuilding
and visualization of three-dimensional nucleic acid structures, Nucleic Acids Res, 31,
pp. 5108-5121.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
50. Afzal, M., Shahid, A. A., Shehzadi, A., Nadeem, S., Husnain, T. (2012). RDNAnalyzer:
A tool for DNA secondary structure prediction and sequence analysis, Bioinformation,
8, pp. 687–690.
51. Gruber, A. R., Lorenz, R., Bernhart, S. H., Neuböck, R., Hofacker, I. L. (2008). The
Vienna RNA website. Nucleic Acids Res., 36, pp. W70–W74.
52. Reuter, J. S., Mathews, D. H. (2010). RNAstructure: software for RNA secondary
structure prediction and analysis, BMC Bioinformatics, 11, 129, doi: 10.1186/1471-
2105-11-129.
53. Popenda, M., Szachniuk, M., Blazewicz, M., Wasik, S.; Burke, E. K., Blazewicz, J.,
Adamiak, R. W. (2010). RNA FRABASE 2.0: an advanced web-accessible database
with the capacity to search the three-dimensional fragments within RNA structures,
BMC Bioinformatics, 11, 231. doi: 10.1186/1471-2105-11-231.
54. Bini, A., Mascini, M., Mascini, M., Turner, A. P. F. (2011). Selection of thrombin-
binding aptamers by using computational approach for aptasensor application. Biosens.
Bioelectron., 26, pp. 4411–4416.
55. Halgren, T. A. (1996). Merck molecular force field. I. Basis, form, scope,
parameterization, and performance of MMFF94, J. Comput. Chem., 17, pp. 490–519.
56. Chen, R., Li, L., Weng, Z. (2003). ZDOCK: an initial-stage protein-docking algorithm,
Proteins, 52, pp. 80–87.
57. Kumar, J. V., Chen, W. Y., Tsai, J. J. P., Hu, W. P. (2013). Molecular simulation
methods for selecting thrombin-binding aptamers, Lect. Notes Electr. Eng., 253, pp.
977–983.
58. Hu, W. P., Kumar, J. V., Huang, C. J., Chen, W. Y. (2015). Computational selection of
RNA aptamer against angiopoietin-2 and experimental evaluation, Biomed Res. Int.,
2015, 658712. doi: 10.1155/2015/658712.
59. Sarraf-Yazdi, S., Mi, J., Moeller, B. J., Niu, X., White, R. R., Kontos, C. D., Sullenger,
B. A., Dewhirst, M. W., Clary, B. M. (2008). Inhibition of in vivo tumor angiogenesis
and growth via systemic delivery of an angiopoietin 2-specific RNA aptamer, J. Surg.
Res., 146, pp. 16–23.
60. White, R. R., Shan, S., Rusconi, C. P., Shetty, G., Dewhirst, M. W., Kontos, C. D.,
Sullenger, B. A. (2003). Inhibition of rat corneal angiogenesis by a nuclease-resistant
RNA aptamer specific for angiopoietin-2, Proc. Natl. Acad. Sci. U. S. A., 100, pp.
5028–5033.
61. Setny, P., Bahadur, R. P., Zacharias, M. (2012). Protein-DNA docking with a coarse-
grained force field, BMC Bioinformatics 2012, 13, 228. doi: 10.1186/1471-2105-13-
228.
62. Kumar, J. V., Tsai, J. J. P.; Hu, W. P., Chen, W. Y. (2015). Comparative molecular
simulation method for ang2 / aptamers with in vitro studies, Int. J. Pharma Med. Biol.
Sci., 4, pp. 61–64.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch05
63. Etheve, L., Martin, J., Lavery, R. (2015). Dynamics and recognition within a protein-
DNA complex: a molecular dynamics study of the SKN-1/DNA interaction, Nucleic
Acids Res., 44, pp. 1440–1448.
64. Colasanti, A. V., Lu, X.-J., Olson, W. K. (2013). Analyzing and building nucleic acid
structures with 3DNA, J. Vis. Exp., 2013, e4401. doi: 10.3791/4401.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch06
Chapter 6
Bioinformatics analysis of
microRNA and protein-protein
interaction in plant host-pathogen
interaction system
Nilubon Kurubanjerdjit
School of Information Technology, Mae Fah Luang University,
Chiang Rai 57100, Thailand
Ka-Lok Ng
Department of Bioinformatics and Medical Engineering Asia University,
Taichung 41354, Taiwan
Department of Medical Research China Medical University Hospital
China Medical University, Taichung 40402, Taiwan
6.1 Introduction
118
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch06
promoter
L2 L3 L4
There are three resources that we used in this study i) a dataset of 563
confirmed A.thaliana miRNA-target pairs was obtained from the
Arabidopsis Small RNA project Database (ASRP) [Gustafson A.M.
2005] which comprise the interactions of 118 miRNAs and 205 mRNAs
ii) a set of 243 A. thaliana miRNAs with their sequences obtained from
miRBase [Griffiths-Jones S. 2008] and iii) a gene set (33,539 genes) with
their mRNA FASTA sequences collected from The Arabidopsis
Information Resource (TAIR version 10) [Rhee SY 2003].
Besides, the genomic 3’UTR information of A.thaliana was extracted
from TAIR. Furthermore, Dinucleotide statistical information of 3’UTR
was obtained from Genomatix Software GmbH Company located at
Munich (see http://www.genomatix.de/). Those two pieces of
information are for RNAHybrid calculation. Moreover, the gene
annotation information was carried from the GO website.
parameter setting. The positive training set (406 pairs) are experimentally
confirmed pairs. The negative set, a total of 9938, comprised pairs that
satisfied the three algorithms’ default parameter settings with the positive
set subtracted.
Moreover, the test set was generated by combining the three prediction
scores with their default parameter setting for a set of 243 A. thaliana
miRNA and a genome wide set of UTR.
In case RNAHybrid returns multiple MFE values with the same miRNA-
target gene interaction, the target score is given by Equation 1,
For PITA, default parameter settings were used. The ΔΔG value is used
in case of single binding sites occurrences, while the determined
prediction score by Eq. 1 is used in case of multiple binding sites, where
mfe is replaced by ΔΔG, which denotes the free energy value of binding
between the miRNA and the target gene.
For MiRanda, max score is considered. There are three parameters are
required; the threshold score, MFE, and scaling factor are set to 80, -14
kcal/mol and 2.0 respectively.
6.2.1.7 Results
6.2.2 Study II: Prediction of PPI between A. thaliana and Xcc based
on protein DDI and interolog approaches
Host-pathogen PPI was identified using two pipelines; DDI and the
interolog approaches as shown in the system flowchart as Fig. 3.
Subsequently, the set of predicted PPI from both methods was subjected
to enrichment analysis via DAVID [Huang D.W. 2009]. In addition,
information about PRG, TF and Xcc effector proteins was integrated into
our system.
Domain-Domain Interaction (DDI) Interolog
BlastP:
Xcc domain e-value<=10E-4
(Uniprot) confirmed PPIs
(DIP)
Search for DDI
filter:
e-value<=10E-70, identity >=50%
ALC >=80%
predicted PPIs
enrichment analysis
(DAVID)
Fig. 3 System Flowchart of PPI prediction of A. thaliana and Xcc. Prediction is based on
the (i) domain-based and, (ii) interolog approaches [Kurubanjerdjit N. 2013b]
PFam domain pairs. The known DDI were gathered by the iPFam and
3DID databases. The three sets of input were merged: i) a set of 1,555 A.
thaliana PFam domains, ii) a total of 304 Xcc PFam domains, and iii) a
collection of 7,039 known DDI recorded by iPFam and 3DID.
The Xcc effector protein and the type III secretion system effector
protein were identified in this study. To identify effector protein, a set of
Xcc proteins was submitted to EffectiveT3 (http://www.effectors.org/)
[Jehl M.A. 2011] with the default parameter setting: i) the organism type
was set to gram-negative; ii) the classification module was set to type III
effector prediction of the plant set; and iii) the cutoff default setting was
set to 0.999; and iv) the domain score was set to 4.0.
Furthermore, to identify the type III secretion system effector protein, a
set of Xcc proteins was submitted to ModLab system for identifying the
existence of type III secretion system (T3SS) signals in amino acid
sequences. The default parameter setting is used in the prediction: i) the
prediction method was set to “neural network;” ii) the sequence
truncation: N and C terminals were set to 1 and 30 respectively; and iii)
the neural network threshold was set to 0.4.
Besides, the type III secretion system effector predicted by ModLab
(Molecular Design Laboratory: http://gecco.org.chemie.uni-
frankfurt.de/index.html) [Lower M. 2009] is a prediction system for
identifying the existence of type III secretion system (T3SS) signals in
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch06
amino acid sequences. A set of Xcc protein sequences was input into the
system, where the parameters were set to default values: i) the prediction
method was set to “neural network;” ii) the sequence truncation: N and C
terminals were set to 1 and 30 respectively; and iii) the neural network
threshold was set to 0.4.
6.2.2.7 Results
E.A. 2000] indicates that a pathogen mutates its genes to infect the host.
However, the plant defends the attacks by expanding its gene families.
Xcc effector predicted by EffectiveT3 and the type III secretion system
effector prediction (ModLab) system
As a result of implementing the EffectiveT3 tool, there are two Xcc
proteins (P58892 and Q8PC32) are predicted as bacterial secreted
proteins. Q8PC32 (dsbB) had been reported in the work of Jiang [Jiang
B.L. 2008] that a mutation in the dsbB gene can result in ineffective type
II and type III secretion systems. In addition, another two Xcc proteins
(Q8PBK7 and P22260) are predicted as type III effector proteins. The
work of Hsiao [Hsiao Y.M. 2005] demonstrated that P22260 (Clp) up-
regulates the transcription of the engA gene encoding a virulence factor
in Xcc by a direct binding to the upstream tandem Clp sites.
The ModLab software identified type III effector proteins (Q8P7S1,
P22260, Q8PAK9 and Q8P815). Interestingly, P22260 (CRP-like
protein) was identified as a type III effector protein by both predictors
and also it was recorded by UniProt as a pathogenesis effector protein, as
it undergoes specific processes that generate the ability of an organism to
cause disease. Furthermore Q8P815 involves in the plant-pathogen
interaction pathway recorded by KEGG. This finding also consists with
the report of Buell [Buell C.R. 2002] indicating that the genes involved
in the resistance response can be classified into three classes: (1) R genes
which are involved in the recognition of the pathogen; (2) signal
transduction genes; and (3) defense response genes which are involved in
the suppression of pathogen development.
blastP: blastP:
e-value<=0.005 e-value<=0.0001
identity>=25% identity>=25%
Fig. 4 System flowchart of PDI prediction between Xcc effector and A. thaliana miRNA.
Prediction was based on the (i) interolog and, (ii) alignment of homolog TFBS profile
matrix approaches [Kurubanjerdjit N. 2014].
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch06
6.2.3.9 Results
References
Adamcsek B., Pella. G., Farkas IJ., Derenyi I., Vicsek T., (2006). “CFinder: locating
cliques and overlapping modules in biological networks.” BMC Bioinformatics
22: 1021-1023.
Arifuzzaman M., Maeda M., Itoh A., Nishikata K., Takita C., Saito R., Ara T.,
Nakahigashi K., Huang H. C., Hirai A., Tsuzuki K., Nakamura S., Altaf-Ul-Amin
M., Oshima T., Baba T., Yamamoto N., Kawamura T., Ioka-Nakamichi T.,
Kitagawa M., Tomita M., Kanaya S., Wada C., Mori H., (2006). “Large-scale
identification of protein-protein interaction of escherichia coli k-12.” Genome
Research 16(5): 686-691.
Babitha M.P., Bhat. S. G., Prakash H.S., Shetty H.S., (2002). “Differential induction of
superoxide dismutase in downy mildew resistant and susceptible genotypes of
pearly millet.” Plant Pathol 15(4): 480-486.
Banjerdkit P., Vattanaviboon P., Mongkolsuk S., (2005). “Exposure to cadmium elevates
expression of genes in the OxyR and OhrR regulons and induces cross-resistance
to peroxide killing treatment in Xanthomonas camertis.” Appl Environ Microbiol
71(4): 1843-1849.
Bartel, D. P. (2004). “MicroRNAs: genomics, biogenesis, mechanism, and function.”
Cell 116(2): 281-297.
Buell C.R. (2002). “Interactions between xanthomonas species and Arabidopsis
thaliana.” Arabidopsis Book 1: e0031
Casadevall A., Pirofski, L. A. (2000). “Host-pathogen interactions: basic concepts of
microbial commensalism, colonization, infection, and disease.” Infection and
Immunity 68(12): 6511-6518.
Claverie J.M., C. N. (2006). Bioinformatics for Dummies, Wiley Publishing.
Dai X, Zhuang. Z., Zhao PX (2010). “Computational analysis of miRNA targets in
plants: current status and challenges.” Briefings in Bioinformatics 12(2): 115-121.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch06
de-Jong H., Pietersma. H., Cordes M., Kuipers O.P., Kok J., (2012). “PePPER: a
webserver for prediction of prokaryote promoter elements and regulons.” BMC
Genomics 13(1): 299.
Fones H., Davis. C. A., Rico A., Fang F., Smith J.A., Preston G.M., (2010). “Metal
hyperaccumulation armors plants against disease.” PLoS Pathog 6(9): 1.
Franza T., Mahe. B., Expert D., (2005). “Erwinia chrysanthemi requires a second iron
transport route dependent of the siderophore achrophore achromobactin for
extracellular growth and plant infection.” Mol Microbiol 55: 261-275.
Griffiths-Jones S., Saini. H. K., Dongen S.V., Enright A.J., (2008). “miRBase: tools for
microRNA genomics.” Nucleic Acids Research 36: D154-D158.
Gustafson A.M., Allen. E., Givan S., Smith D., Carrington J.C., Kasschau K.D., (2005).
“ASRP: the Arabidopsis Small RNA Project Database.” Nucleic Acids Research
33: D637-D640.
He F., Zhang. Y., Chen H., Zhang Z., Peng Y.L., (2008). “The prediction of protein-
protein interaction networks in rice blast fungus.” BMC Genomics 9: 519.
Flor HH. (1971). “Current status of the gene-for-gene concept.” Annu Rev Phytopathol
9: 275-296.
Hsiao Y.M., Liao. H. Y., Lee M.C., Yang T.C., Tseng Y.H., (2005). “Clp upregulates
transcription of engA gene encoding a virulence factor in xanthomonas campestris
by direct binding to the upstream tandem Clp sites.” Febs Lett 579: 3525-3533.
Huang D.W., Sherman. B. T., Lempicki R.A., (2009). “Systematic and integrative
analysis of large gene lists using DAVID bioinformatics resources.” Nat Protoc 4:
44–57.
Ito T., Chiba. T., Ozawa R., Yoshida M., Hattori M., Sakaki Y., (2001). “A
comprehensive two-hybrid analysis to explore the yeast protein interactome.” P
Natl Acad Sci USA 98(8): 4569-4574.
Ito T., T. K., Muta S., Ozawa R., Chiba T., Nishizawa M., Yamamoto K., Kuhara S.,
Sakaki Y., (2000). “Toward a protein-protein interaction map of the budding
yeast: a comprehensive system to examine two-hybrid interactions in all possible
combinations between the yeast proteins.” P Natl Acad Sci USA 97(3): 1143-
1147.
Jehl M.A., A. R., Rattei T., (2011). “Effective-a database of predicted secreted bacterial
proteins.” Nucleic Acids Research 39: D591-595.
Jiang B.L., L. J., Chen L.F., Ge Y.Y., Hang X.H., He Y.Q., Tang D.J., Lu G.T., Tang
J.L., (2008). “DsbB is required for the pathogenesis process of xantomonas
campestris pv. campestris.” Mol Plant Microbe In 21(8): 1036-1045
Jonsson P., C. T., Zicha D., Bates P., (2006). “Cluster analysis of networks generated
through homology: automatic identification of important protein communities
involved in cancer metastasis.” BMC Bioinformatics 7: 2.
Kim S.K., N. J. W., Rhree J.K., Lee W.J., Zhang B.T., (2006). “miTarget: microRNA
target gene prediction using a support vector machine.” Bioinformatics 7(1): 441.
Kurubanjerdjit N., Huang C.H., Lee Y.L., Tsai Jeffrey J.P, Ng K.L. (2013a). “Prediction
of microRNA-regulated protein interaction pathway in Arabidopsis using machine
learning algorithms,” Computers in Biology and Medicine, 43(11), 1645-1652.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch06
Kurubanjerdjit N., Tsai Jeffrey J.P, Sheu C.Y., Ng K.L. (2013b). “The prediction of
protein-protein interaction of A. thaliana and X. campestris pv. campestris based
on protein domain and interolog approaches”, Plant OMICS, 6(6), 388-398.
Kurubanjerdjit N., Tsai Jeffrey J.P, Huang C.H., Ng K.L. (2014). Disturbance of A.
thaliana microRNA-regulated pathways by Xcc bacterial effector proteins, Amino
Acids 46(4), pp. 953-961.
Li Z.G., H. F., Zhang Z., Peng Y.L., (2011). “Prediction of protein-protein interactions
between ralstonia solanacearum and arabidopsis thaliana.” Amino Acids 42(6):
2363-2371.
Lin N., W. B., Jansen R., Gerstein M., Zhao H., (2004). “Information asssesment on
predicting protein-protein interactions.” BMC Bioinformatics 5: 154.
Lower M., S. G. (2009). “Prediction of type III secretion signals in genomes of gram-
negative bacteria.” PLoS One 4(6): 1.
Mandoli D.F., O. R. (2000). “The importance of emerging model systems in plant
biology.” J Plant Growth Regul 19(3): 249-252
Matys V, K.-M. O., Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D,
Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel
AE, Wingender E., (2006). “TRANSFAC and its module TRANSCompel:
transcriptional gene regulation in eukaryotes.” Nucleic Acids Research 34: D108-
110.
Meyer D., L. E., Roby D., Arlat M., Kroj T., (2005). “Optimization of pathogenicity
assays to study the arabidopsis thaliana–xanthomonas campestris pv. campestris
pathosystem.” Mol Plant Pathol 6(3): 327-333.
Morgan T.D., B. P., Kramer K.J., Basibuyuk H.H., Quicke D.L.J., (2002). “Metals in
mandibles of stored product insects: do zinc and manganese enhance the ability of
larvae to infest seeds?" J Stored Prod Res 39: 65-75.
Norambuena T., M. F. (2010). “The Protein-DNA Interface database.” BMC
Bioinformatics 11: 262.
Palla G., D. I., Farkas I., Vicsek T., (2005). “Uncovering the overlapping community
structure of complex networks in nature and society.” Nature 435: 814-818.
Pinzon A., R.-R. L. M., Gonzalez A., Bernal A., Restrepo S., (2010). “Targeted
metabolic reconstruction: a novel approach for the characterization of plant-
pathogen interactions.” Brief Bioinform 12(2): 151-162.
Rhee SY, B. W., Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala
E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind
J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P., (2003). “The arabidopsis
information resource (TAIR): a model organism database providing a centralized,
curated gateway to arabidopsis biology, research materials and community.”
Nucleic Acids Research 31(1): 224-228.
Rolke Y., L. S., Quidde T., Williamson B., Schouten A., Weltring K.M., Siewers V.,
Tenberge K.B., Tudzynski B., Tudzynski P., (2004). “Functional analysis of
H2O2-generating systems in Botrytis cinerea: the major Cu-Zn-superoxide
dismutase (BCSOD 1) contributes to virulence on French bean, whereas a glucose
oxidase (BCGOD1) is dispensable.” Mol Plant Pathol 5: 17-27.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch06
Sanseverino W., R. G., De-Simone M., Faino L., Melito M., Stupka E., Frusciante L.,
Ercolano M.R., (2010). “PRGdb: a bioinformatics platform for plant resistance
gene analysis.” Nucleic Acids Research 38: D814–D821.
Stahl E.A., B. J. G. (2000). “Plant-pathogen arms races at the molecular level.” Curr
Opin Plant Biol 3(4): 299-304.
Tang D.J., L. X. J., He Y.Q., Feng J.X., Chen B., Tang J.L., “The zinc uptake regulator
Zur is essential for the full virulence of Xanthomonas campestris pv campestris.”
Mol Plant Microbe Interact 18: 652-658.
Tsao T.H., C. C. H., Huang Chi-Yang F., Lee S.A., (2011). Systems and computational
biology-molecular and cellular experimental systems. In: Prof.Ning-Sun Yang
(ed) The prediction and Analysis of Inter- and Intra-Species Protein-Protein
Interaction. China, InTech.
Tsuji J., S. S. C. (1988). “Xanthomonas campestris pv. campestris induced chlorosis in
Arabidopsis thaliana.” Arabidopsis Information Service 26: 1-8.
Tsuji J., S. S. C. (1992). “First report of the natural infection of arabidopsis thaliana by
xanthomonas campestris pv. campestris.” Plant Dis 76: 539.
Tsuji J., S. S. C., Hammerschmidt R., (1991). “Identification of a gene in arabidopsis
thaliana that controls resistance to xanthomonas campestris pv. campestris.”
Physiol Mol Plant P 38: 57-65.
Tucker S.L., T. T. R., Tasker K., Jacob C., Giles G.,Egan M.,Talbota N.J., (2004). “A
fungal metallothionein is required for pathogenicity of Mangaporthe grisea.”
Plant Cell 16: 1575-1588.
Uetz P., G. L., Cagney G., Mansfield T.A., Judson R.S., Knight J.R., Lockshon D.,
Narayan V., Srinivasan M., Pochart P., Qureshi-Emili A., Li Y., Godwin B.,
Conover D., Kalbfleisch T., Vijayadamodar G., Yang M., Johnston M., Fields S.,
Rothberg J.M., (2000). “A comprehensive analysis of protein-protein interactions
in Saccharomyces cerevisiae.” Nature 403(6770): 623-627
Yu H., L. N. M., Lu H.X., Zhu X., Xia Y., Han J.D., Bertin N., Chung S., Vidal M.,
Gerstein M., (2004). “Annotation transfer between genomes: protein-protein
interologs and protein-dna regulogs.” Genome Research 14(6): 1107-1118.
Zhang Z., Y. J., Li D., Zhang Z., Liu F., Zhou X., Wang T., Ling Y., Su Z., (2010).
“PMRD: Plant microRNA database. Nucleic Acids Research.” Nucleic Acids
Research 38: D806-D813.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch07
Chapter 7
Kung-Hao Liang
Medical Research Department
Taipei Verterans General Hospital
Abstract
In the human genome, 10% of the nucleotide sequences were Alus which
were retrotransposon-produced genomic repeats. Despite occasional
evidence of Alu-induced genetic diseases, it remained a mystery whether
these Alu elements play substantial physiological roles comparable with
its proportion in the human genome. Recently, cytosolic sense- and
antisense-Alu carrying mature RNAs, corresponding to more than 1300
protein coding and various non-coding genes, were shown to form a
network of mutual regulations. Messenger RNA transcripts of genes
pertinent to the immunological Th17 maturation, including CCL5,
CCR6, IL23R, IL2RA, IL1R1, CD28 and REL, consistently carry Alu in
the sense direction. On the other hand, other immunological genes such
as CXCL16, IFNAR2, CD302, CDH1, IL28RA (a.k.a. IFNLR1) and
JAK3, all carry Alus only in the antisense direction. The Alu sequences
facilitated RNA-RNA interactions, resulting in a RNA regulatory
network which enabled a computational modelling of cellular state
transitions. The transition from naïve T cells to mature Th17 cells was
largely controlled by the relative transcriptional rates of genes carrying
sense and antisense Alus, which showed an inherent inverse relationship
of levels at equilibrium.
140
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch07
7.1 Background
Until now, the known molecular pathway of Th17 activation was rather
complex. It comprised gene expression, cytokine stimulation, signal
transduction and protein complex forming, which altogether involved a
large number of genes interconnected in a way of intertwined positive
and negative feedback loops. In such conditions, the dynamic behaviour
become less intuitive. Regrettably, the molecular pathways still seemed
incomplete, prohibiting a computational modelling of the dynamics from
the cellular state A to state B.
The human genome encodes the blueprint of immune system. One tenth
of the human genome is composed of a single class of non-coding
elements, the short interspersed Alu repeats, which are only found in
primate genomes [3, 4]. Recently, an unconventional thinking on Alu’s
regulatory role was proposed. Human messenger RNA transcripts
carrying Alu elements in two opposite directions were demonstrated to
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch07
7.2 Methods
Fig. 1. A schematic diagram of the Alu mediated regulation network. RNA transcripts
carrying antisense Alus (species A) and sense Alus (species B) hybridise and form a
duplex structure (species C). Two biological properties were modelled, including the
expression rates of the two species (Ka and Kb). Biochemical properties in this model
include Kc (the hybridisation rate of A and B), Ga, Gb, Gc (degradation rates), Rca, Rcb
(proportion of C which become guide strands for forming the RISC complex) and RF
(number of silenced transcripts per a guide strand).
Rcb), and the ratio factor (RF) which is the number of mRNA transcripts
silenced by one guide strand. The derivative of A was the net result of
the increment (controlled by Ka), and the decrement (including A
degradation, A and B binding, A silencing) (Eq. 1). The derivative of B
can be calculated in a similar way (Eq. 2). The derivative of C is caused
by the amount of binding of A and B, deducting the amount of C
degradation and the amount of C transformation into A and B inhibition
guide strands (Eq. 3).
dA
Ka Ga A Kc A B RF Rca C (1)
dt
dB
Kb Gb B Kc A B RF Rcb C (2)
dt
dC
Kc A B Gc C Rca C Rcb C (3)
dt
7.3 Results
7.3.1 Sense-Alu carrying RNAs were co-activated during Th17
maturation
Fig. 2. Differences of RNA levels between Th17 cells and naïve CD4+ T cells.
Immunological genes tagged by sense-Alu were up-regulated while those tagged by
antisense Alu were down-regulated during the Th17 state transition.
Table 1. Levels of sense and antisense Alu carrying mRNAs in response to Th17
maturation.
RNA level
Gene Symbol Alu directions Refseq accession number difference
CXCL16 Antisense NM_022059.2 –607
IFNAR2 Antisense NM_207585.1 –581
CD302 Antisense NM_014880.4 –170
(Continued )
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch07
Table 1. (Continued)
RNA level
Gene Symbol Alu directions Refseq accession number difference
CDH1 Antisense NM_004360.3 –151
NM_170743.2;
NM_173064.1;
IL28RA Antisense NM_173065.1 –137
JAK3 Antisense NM_000215.3 –13
REL Sense NM_002908.2 39
CD28 Sense NM_006139.2 104
IL1R1 Sense NM_000877.2 140
IL2RA Sense NM_000417.2 510
IL23R Sense NM_144701.2 710
NM_031409.3;
CCR6 Sense NM_004367.5 1845
CCL5 Sense NM_002985.2 2241
Fig. 3. An inherent inverse relationship of A and B levels at equilibrium. The levels are
determined when the biochemical parameters are constant (Ga=Gb=3.5; Kc=1; Gc=0.5;
Rca=0.25; Rcb=0.25). The transcription rate of A is also kept constant (Ka=3.5).
Different Kb are given. The activation of species B (controlled by the parameter of Kb) is
accompanied by a decrease of species A.
We then moved on to see if Kb/Ka was not changed, what will happen if
exogenous RNA were given. As the Th17 related genes CCL5, CCR6,
IL23R, IL2RA, IL1R1, CD28 and REL all carry sense Alus, exogenous
antisense Alu RNA may concurrently suppress these genes. The time-
course levels of A, B and C upon one treatment of exogenous antisense
Alu (at the time point #5) is shown in Fig. 4. Levels of A, B and C are
stable before the treatment due to a fixed Ka/Kb ratio of 1. A maximum
peak of exogenous antisense Alu occurs at the time point #6. An increase
of the double stranded RNA (C) and a decrease of the level of B is due to
the binding of exogenous nucleotide with sense Alu carrying genes (B).
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch07
Antisense Alu carrying genes (A) is also slightly suppressed due to the
elevated double stranded RNA loaded into the RISC machinery. The
suppression of B is more prominent than A. The treatment effect is
shown to be transient. Levels of A, B and C reverted after the exogenous
Alus are exhausted17 maturation can be seen as a state transition process
where naïve T cells and mature Th17 cells are the two cellular steady
states.
0.7
0.6
0.5
0.4
A
0.3
0.2
0.1
0
0 5 10 15 20 25 30
0.7
0.6
0.5
0.4
B
0.3
0.2
0.1
0
0 5 10 15 20 25 30
1
0.8
0.6
C
0.4
0.2
0
0 5 10 15 20 25 30
4
3.5
antisense AIu
Exogeneous
3
2.5
2
1.5
1
0.5
0
0 5 10 15 20 25 30
time
Treatment window
0.3
0.2
0.1
0
0 5 10 15 20 25 30
0.7
0.6
0.5
0.4
B
0.3
0.2
0.1
0
0 5 10 15 20 25 30
1
0.8
0.6
C
0.4
0.2
0
0 5 10 15 20 25 30
5
antisense AIu
Exogeneous
4
3
2
1
0
0 5 10 15 20 25 30
time
Repetitive treatment
7.4 Discussion
References
Chapter 8
Mitsuo Iwadate
Departmet of Biological Sciences, Chuo University, Tokyo 112-8551, Japan
Hideaki Umeyama
The School of Pharmacy, Kitasato University, Tokyo 108-8641, Japan
Yoshiki Murakami
Department of Hepatology, Osaka City University Graduate School of
Medicine, Osaka, Japan
8.1 Introduction
In bioinformatics analysis, it is very usual that there are more features than
samples. You are supposed to analyze gene expression profile composed of
tens of thousands of genes and less than a hundred samples. In this case, it is
unrealistic to assume that all genes contribute to something that you would
like to investigate. “something” can be either disease, reaction toward some
drugs, or anything else. Then, you are usually willing to identify limited
number of genes that truly contribute to something of your interest. But,
how? It is very natural to do this by selecting genes that can successfully
discriminate samples of interest from those supposed to be control. Or
you can simply select genes highly expressive or suppressive in samples of
interest than in control ones.
However, this strategy might not give you an appropriate set of genes
because class labels may not always be true. For example, target samples
153
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-ch08 page 154
include more females than control ones, or more aged patients are in disease
samples than in healthy controls. In that case, any successful discrimination
or over/under expression may not be because of something of your interest,
but not intended unbalance of something without your interest between
sampled interest and control ones. One may think that it is not a problem
at all, since you can select equal number of males/females between two
classes, or you can match mean age between two classes. However, this
does not also solve the problem completely. Ratio of smokers or frequent
drinkers may not be same between two classes. Practically, it is impossible
to prepare samples where all features but those you are interested in obey
same distribution between two classes.
Unsupervised methodologies may solve this difficulty, since, in contrast
to the supervised methodologies, it can classify (or to find clusters of) sam-
ples without accessing class labels. Although it is reasonable to be afraid of
less ability of unsupervised classifications because of their possible vulnera-
bility to noise, instead of that, unsupervised methods are less likely affected
by mislabeling. As for the above examples associated with unintentionally
unbalanced distribution of not focused features, primary cluster (classifica-
tion) will be not due to the labeling of interest but coincident with the unfo-
cused classification, e.g., aging or sex. Then, we can have the opportunity to
notice that samples are highly unbalanced with some features not focused.
Or if you find no clustering (classification) is coincident with the labeling of
interest, you can have opportunity to terminate study and go back to the
beginning of project to prepare updated samples with balanced features.
However, this strategy is highly opposed to the recent trends that
encourage to find some criteria to classify samples associated with given
labeling. These criteria was usually implemented in some model. These
so-called model-based approach is usually powerful to exclude some fea-
tures not related to the discrimination between targets and control samples
and to identify critical features necessary to classify target samples from
control samples. However, as mentioned in the beginning of this chapter,
in bioinformatics analysis, it is very usual that we cannot have enough
samples to train models so as to classify two classes well. In that case,
we are not supposed to get limited number of features that classify two
classes using model based approaches. Another difficulty of model based
approaches raises when there are more classes than two without the pre-
knowledge about the relationship among multiple classes. For example,
suppose that we ought to treat time sequence data. In this case, each
time point corresponds to each class, among which no information about
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-ch08 page 155
Before discussing our methodology, we briefly review studies that make use
of PCA for gene selection. The most popular strategies is to identify limited
number of features to preserve primary structure of PCA [Wang and Gehan
(2005); Krzanowski (1987)]. In this regard, PCA is not a tool but rather
a purpose. This is opposed to our strategy that often identifies features
based upon miner PCs that contribute less (see the following application
examples). Thus, it principally differs from our methodology in spite of
apparent similarities.
Alternatively, some studies make use of PCA to identify genes. Jonnala-
gadda and Srinivasan [Jonnalagadda and Srinivasan (2008)] tried to iden-
tify differentially expressed genes based upon contribution to PCs, which
is very similar to our methodology. The potential difference between ours
and theirs is that they have evaluated difference of contributions between
two classes. Our methodology, as can be seen below, we never compute
difference between two classes directly, but identify gene as outliers along
specified PCs. We also used difference between two classes in order to iden-
tify PCs used for outlier identification and P -values are evaluated assuming
χ2 distributions, in spite of apparent similarity between ours and theirs,
there are principal differences. First of all, since they performed pairwise
comparisons, extension towards categorical multiclass is unclear, although
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-ch08 page 156
In this section, we explain the basic procedure to apply the principal com-
ponent analysis (PCA) based unsupervised feature extraction (FE) to gene
expression/promoter methylation profiles.
of eigen vector and eigen values are smaller number among N and M ,
although in the followings M < N thus the number of eigen values/vectors
are M .
Although one may wonder if these two PCA, sample embedding and
gene embedding, are equivalent, these two differ from each other because of
the distinct mean extraction i xij = 0 or j xij = 0. Since PCA is the
diagonalization of the product of X, the effect of distinct mean extraction
is non-liner. Therefore generally there are noways to infer the results of
gene embedding from those of sample embedding and vise versa.
multiple cancers using a single criterion. Figure 8.1 shows how to iden-
tify miRNA-mRNA interactions using PCA based unsupervised FE. First,
mRNA/miRNA expression profiles was pre-screened separately using PCA
based unsupervised FE. Then, mRNAs/miRNAs expressed significantly
and distinctly between tumors and normal tissues were further screened.
Finally, among those pre-screened, miRNA-mRNA interactions were iden-
tified using TargetScan [Agarwal et al. (2015)], which is one of the most
trustable sequence based miRNA-mRNA interaction inference databases.
Fig. 8.1 Schematic of miRNA-mRNA interactions using PCA based unsupervised FE.
Tables 8.2 and 8.3 shows the summary of pre-screened miRNAs and
mRNAs used and summary of identified miRNA-mRNA interactions,
respectively (for more details including the list of pairs identified, see the
original study [Taguchi (2016a)]). As can be seen, in spite of that we
used single same adjusted P -values threshold, the numbers of pre-screened
miRNAs and mRNAs do not drastically vary even when the number of
samples used varies. Although samples are not matched and measurements
were performed with diverse platforms (microarrays), we could successfully
identify miRNA-mRNA pairs for all cancers investigated.
In this demonstration, one may be able to understand that not samples
but features (mRNAs and miRNAs) based P -value estimation is very useful
and suitable for biological researches.
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-ch08 page 161
Table 8.2 Summary of identification of miRNAs and mRNAs screened by PCA based
unsupervised FE.
Table 8.3 Identification of mRNAs targeted by miRNAs, using TargetScan. Pairs shows
the number of miRNA-mRNA pairs included in TargetScan within miRNAs and miRNAs
selected in Table 8.2. Numbers of miRNA and mRNAs are those within pairs.
Cancers Pairs miRNA mRNA
HCC 20 (9) 13 (13) 18 (16)
NSCLC 311 (184) 27 (27) 113 (72)
ESCC 4 (2) 3 (3) 4 (4)
PC 32 (18) 8 (8) 19 (6)
CRC/CC 8 (3) 7 (7) 7 (6)
BC 37 (17) 11 (11) 30 (25)
The number of pairs are less than those of mRNAs and/or miRNAs, since multiple pairs
share the same mRNAs and miRNAs. The numbers in parenthesis: for mRNAs and
miRNAs, those associated previous studies that papers relation with cancers are counted.
For pairs, those associated with negative correlation between miRNAs and mRNAs in
starbase [Li et al. (2014)] were counted.
1 2 3 5 10
1 T T,C T T,C T,C
10 - - - T,C -
42 - - - - T,C
Row: rest days, column: caged days. T: treated (stressed), C: control. Each conditions
are associated with four replicates.
0.25
C10−1d
0.30
C10−42d
C5−10d
C5−1d
0.25
T10−1d
0.20
T10−42d
T5−10d
PC1 miRNA
PC1 miRNA
T5−1d
0.20
C2−1d
T1−1d
0.15
T2−1d
0.15
T3−1d
0.10
0.10
0.05
0.05
0.00
Fig. 8.2 Left: Scatter plot of PC1 loading, v 1 , between gene expression (vertical axis)
and gene expression (horizontal). P -values attributed to correlation coefficient is 0.01.
Right: Those averaged with experimental condition. P -values attributed to correlation
coefficient is 0.01. T/CX-Y d stands for treated (T) or control (C) samples for Y days’
rest after X days spent in the cage with/without violent mice.
Table 8.5 Samples used in this study. Numebers are those of biological replicates.
Cluster Dendrogram
0.0
−0.2
−0.4
PC 12
PCM 6
PC 20
PCM 15
PC 11
PCM 16
Height
PC 9
PCM 18
PC 15
PCM 24
−0.6
PC 23
PCM 14
PC 6
PCM 8
PC 21
PCM 13
PC 10
PCM 9
PC 14
PCM 12
PC 13
PCM 17
PC 22
PCM 22
PC 1
PC 8
PCM 19
PC 19
PCM 23
PC 17
PCM 11
PC 18
PCM 10
PC 24
PCM 20
−0.8
PC 2
PCM 5
PC 16
PCM 21
PC 7
PCM 7
PC 5
PC 3
PCM 3
PC 4
PCM 4
PCM 1
PCM 2
−1.0
as.dist(−abs(cor(Z)))
hclust (*, "average")
Fig. 8.3 Hierarchical clustering of 24 PC loadings obtained using mRNAs (PC) and pro-
moter methylation (PCM), respectively. Distances are negative signed absolute Pearson
correlations.
lated for the third and fourth PCs between mRNA expression and pro-
moter methylation. Thus, we decided to identify outliers using the third
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-ch08 page 167
Table 8.6 List of genes identified by PCA based unsupervised FE in non-small cell
lung cancer cell line reprogramming.
and the fourth PC scores. Top most 300 outliers are identified using the
third and fourth PC scores of mRNA expression and promoter methylation,
respectively. Genes selected commonly between mRNA expression and pro-
moter methylation via either the third or fourth PC scores are listed (Table
8.6). Although we cannot detail the biological meanings of PC3 and PC4
here because of the lack of spaces, the outline is as follows. Both PC3
and PC4 represent distinction between pre and post reprogrammed non-
small cell lung cancer cell lines. In addition to this, PC3 and PC4 also
shows that the coincidence between reprogrammed non-small cell lung can-
cer cell lines and pluripotent/iPS cell lines. Furthermore, PC3 and PC4
confirmed that pluripotent/iPS cell lines are distinct from IMR90 that is
not reprogrammed cell line. One may also wonder why there are two PCs
obtained. The distinction between two PCs are coincidence between two
non-small cell lung cancer cell lines. PC3 represent aberrant but coincident
mRNA expression/promoter methylation between non-small cell lung can-
cer and reprogramed one while PC4 represent aberrant but opposite mRNA
expression/promoter methylation between non-small cell lung cancer and
reprogramed one. Anyway, it is obvious that PCA based unsupervised FE
can have superior power to identify biologically meaningful features even
in categtorical multiclass problems in an unsupervised manner. To our
knowledge, no other methods can do this. For more details, see original
paper [Taguchi et al. (2016)].
Among those selected (Table 8.6), due to massive literature search,
we identified that SFRP1 was the most promising candidate as epigenetic
therapy target gene in non-small cell lung cancer because of the following
two reasons. First, SFRP1 was highly expressive in histone deacethyla-
tion inhibitor (HDACi) non-resistant non-small cell lung cancer cell lines
than on HDACi resistant non-small cell lung cancer cell lines [Miyanaga
et al. (2008)]. Second, histone acetylation of SFRP1 was enhanced due to
HDACi [Tang et al. (2010)]. These two strongly suggested that SFRP1
was a promising candidate of epigenetic therapy. Although we have done
more researches on this topics, including GO term enrichment analysis and
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-ch08 page 168
HTB56 A549
Cell lines
With Without With Without
metastasis
mRNA expression 3 3 3 3
Promoter methylation 2 2 2 2
vs PC3 and 5.1 × 10−4 for PC5 vs PC4, respectively. Thus, these overlaps
cannot be accidental and these genes are worthwhile considering.
Although we have performed extensive researches including path-
way/GO term enrichment analysis as well as in silico drug discov-
ery [Umeyama et al. (2014)], we cannot discuss about it because of lack
of spaces. Finally, we identify two promising metastasis causing genes,
TINAGL1 and B3GALNT1. Although there were no experimental studies
that support our findings, after the publication of our study [Umeyama
et al. (2014)], Takahashi et al. [Takahashi et al. (2016)] reported that
TINAGL1 plays potential roles in mouse impaired female fertility that is
supposed to be related to metastasis. Thus, our findings may be feasible.
In conclusion, even if the distinction is very little and the number of
samples is small, PCA based unsupervised FE can identify critical genes in
an unsupervised manner.
Table 8.9 Samples of primordial germ cells between E13 and E16 rat F3 generation
vinclozolin lineage. Treated means vinclozolin treatments. Promoter methylation was
given as ratio between control and treated samples.
Height
PC18_comp
PC24_mRNA
PC28_miRNA
PC8_miRNA
PC31_comp
PC27_mRNA
PC14_comp
PC19_mRNA
PC30_miRNA
PC20_comp
PC25_mRNA
PC22_mRNA
PC31_miRNA
PC32_mRNA
PC32_comp
PC32_miRNA
PC31_mRNA
PC28_comp
PC15_miRNA
PC26_comp
PC23_mRNA
PC16_mRNA
PC27_comp
PC25_miRNA
PC19_comp
PC29_miRNA
PC17_comp
PC10_mRNA
PC23_miRNA
PC25_comp
PC18_mRNA
PC12_miRNA
PC16_comp
PC14_mRNA
PC14_miRNA
PC9_comp
PC6_mRNA
PC30_comp
PC20_mRNA
PC16_miRNA
PC26_miRNA
PC21_comp
PC12_mRNA
Cluster Dendrogram
PC11_comp
PC4_miRNA
PC15_mRNA
PC11_miRNA
PC10_comp
PC20_miRNA
PC10_miRNA
PC5_comp
PC7_mRNA
PC7_comp
PC4_mRNA
PC12_comp
PC7_miRNA
PC6_comp
PC5_mRNA
PC6_miRNA
PC29_mRNA
PC27_miRNA
PC8_mRNA
PC8_comp
PC13_miRNA
PC13_comp
PC17_mRNA
PC22_miRNA
PC24_comp
PC30_mRNA
PC9_mRNA
PC9_miRNA
PC4_comp
PC3_mRNA
PC1_comp
PC13_mRNA
PC1_miRNA
PC2_miRNA
PC3_comp
PC1_mRNA
PC2_mRNA
PC21_mRNA
PC19_miRNA
PC22_comp
PC3_miRNA
PC18_miRNA
PC29_comp
PC28_mRNA
PC23_comp
PC11_mRNA
PC21_miRNA
PC2_comp
PC5_miRNA
PC26_mRNA
PC24_miRNA
PC15_comp
PC17_miRNA
Fig. 8.4 Hierarchical clustering of 16 PC loadings for mRNA, miRNA and compounds.
Distance is negative signed absolute Pearson correlation coefficients.
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-ch08 page 173
0.2
PC3_comp
0.620 0.699 0.384 0.728
2.08e−04 1.50e−05 3.07e−02 4.96e−06
−0.2
−0.12
●
●
●
●
0.746 0.640
−0.16
0.402
● ●
●
PC1_mRNA
2.68e−06 2.31e−02 1.13e−04
●
−0.20
●
●
● ●
●● ● ● ●● ●
● ●
● ● ●
0.2
● ●
PC2_mRNA
0.253 0.722
● ●
0.0
1.61e−01 6.35e−06
● ●
−0.2
−0.06
−0.12
0.288
PC1_miRNA
● ● ●
1.09e−01
● ● ●●●
●● ● ●
−0.18
● ● ● ●●
● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● 0.4
● ● ● ●
PC2_miRNA
● ● ● ●
● ● ● ●
0.0
●●● ● ●● ● ● ● ●●
● ● ●●●
● ● ● ●
Fig. 8.5 Left lower triangle:Scatter plots of PC loadings used for outliers identifica-
tion. ◦ :CCC, : adjusted normal tissue for CCC, +:HCC, ×: adjusted normal tissues
for CCC. Right upper triangle: Pearson correlation coefficients as well as attributed
P -values.
Table 8.11 Discrimination of HCC, CCC and normal tissues using either 14 compounds
or 17 miRNAs identified by PCA based unsupervised FE.
Result
miRNA Compounds
Normal HCC CCC Normal HCC CCC
Normal 13 1 1 14 0 2
Predict HCC 2 4 1 0 5 0
CCC 1 1 8 2 1 8
1
●●
● ● ●●
●● ●●●● ●
●●● ● ●●● ●●●
●● ●● ●
● ● ● ● ● ●
●●●
● ● ●● ● ● ● ● ●
●
●
● ● ● ● ● ●● ●●
● ●●●
●●
● ● ●● ●● ● ●
●●● ●
● ●
0
● ●
●● ●
● ● ●
● ● ● ● ● ●
● ●● ●●
−1
●
PC1 ● ● ●
−2
● ● ●
−3
−4
● ● ●
● ●
● ●
●
● ● ● ● ●●
●●●● ● ● ●
●
1.0
●●
● ● ● ●
● ●
0.174 PC2 ● ●
0.0
●● ● ●
● ●
●● ● ●
●●● ● ●●
●● ●●
−1.0
●● ● ● ● ●
●●● ● ● ●
● ● ●
●
●● ● ●●
●
● ●
1.5
●
● ● ●
● ●●
2.977 −2.893
●
0.5
● ●
PC3 ●
●
● ●
●
●
● ● ●
−0.5
●
●
● ●
●● ●
● ●
●●
−1.5
● ●●
1.0
0.0
Fig. 8.6 Upper right triangle: Scatter plots of the first four PC loadings obtained by
gene embedding. Adjusted time points are connected by solid lines. Lower left triangle:
winding numbers computed from the corresponding scatter plots.
during cell division cycle. Figure 8.7(a) shows so called biplot where PC
loading attributed to samples as well as PC scores attributed to genes are
overdrawn. It is obvious that three clusters are identified. Due to GO
(a) (b)
10
●
●
0.2
●
●
● ●● ●●
●●● ●●●
5
● ● ●
● ●●
● ●●● ●
● ●●● ● ●
● ●
●
PC2/PC3
● ●●
● ●
●
●●●●
●●●●●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●
●●●
●●● ●
●
●●●
●●
● ●●
●●
●
● ●
●
●●●●●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●
●
●
● ●●●
●
●●
● ●●● ●
●●● ●
●● ●
●●
●
● ●
●●
●●●●●●●● ● ●
PC3
● ●●
●● ●
●● ●●
● ●
●●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●●
●●
●●● ● ●
●●
●●
●●
●
●
●●●
●●●
●●●
●
●●
●
●
●●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●●
●●●
●
●●
●●
● ●
●●
●●●
●●●●●●
●
●● ●●
●
●●
●
●
●●
●●
●●
● ●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●●
● ●●
●
●●● ●
●
● ●
●
● ●
●●●●
●● ●
●●●
●●
●
●●●
●
●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●●
●●●
●●
●●●
●
● ● ●
●
●●●●●
●●●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●●
●
●
●●
●
●●
●●
●
●
●●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●●
●●●●●
0
● ●●
●
● ●
●●●
●
●●
●
●
●●●
●●
●●
● ●
● ●
●
● ●●
●
●●● ●
●
●●
●●
●●
●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●●
●●
● ●
●●
●●●●●
●
●
● ●●●●●●
●●
●●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●●
●●
●● ●
●●●● ● ● ●
0.0
●●
●● ● ●
●●●
●●
●●
●
●●
●
●
● ●
●●
●●
●●
● ●● ●●
● ●
●
●
●
●
●●
●
●●●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●●●
● ● ●
●
●●●●● ●
●●
● ●●●
●
●
●●
●●
●
● ●
●●● ●
●● ● ● ●
●●● ●
●
● ●
●●●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
● ●
● ● ●
−5
●● ● ●
●
−0.2
●● ●● ●
● ● ●
−10
−20 −10 0 10 0 10 20 30
PC2 T
Fig. 8.7 (a) Biplot of PC2 and PC3 scores attributed to genes. Solid line represents
PC loadings attributed to samles (time points) and characters correspond to genes.
Triangles, crosses and black open circles correspond to three clusters identified applying
K-means to gene identified by PCA based unsupervised FE within this embedding space
(gray open circles correspond to genes not identified by PCA based unsupervised FE).
Broken lines emphasize that circular variable can separate three clusters. (b) Time
dependence of PC2(◦)/PC3() loadings.
based unsupervised FE select less than 200 genes among more than thou-
sands genes for each profile, this coincidence is highly significant. Enrich-
ment analysis performed also supports the feasibility of 37 selected genes.
This suggested that PCA based unsupervised FE has ability of integrating
as many as seven gene expression profiles in an unsupervised manner. For
more details, see the original paper [Taguchi (2016b)].
Table 8.12 Fibroblast cell lines gene expression profiles used for this study.
Transfection
Nothing Mock Mutation1 Mutation2 Mutation 3
Patients 2 2 2 2 2
Healthy controls 2 2 2 2 2
are highly coincident with each other. For example, more than half of genes
(393 genes) are common. KEGG pathway enrichment analyses were also
almost identical between genes identified in 4 samples without transfection
and those in 16 samples with transfections. These suggested that PCA
based unsupervised FE can integrate multiple gene expression as well. As
mentioned in the above, although it is motor neuron disease, slight differ-
ence observed in fibroblast gene expression can be detected correctly by our
methodology. Thus, PCA based unsupervised FE also has ability to detect
very slight difference in gene expression.
8.8 Conclusions
References
Agarwal, V., Bell, G. W., Nam, J. W., and Bartel, D. P. (2015). Predicting
effective microRNA target sites in mammalian mRNAs, Elife 4.
Artmann, S., Jung, K., Bleckmann, A., and Beissbarth, T. (2012). Detection of
simultaneous group effects in microRNA expression and related target gene
sets, PLoS ONE 7, 6, p. e38365.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A
practical and powerful approach to multiple testing, Journal of the Royal
Statistical Society. Series B (Methodological) 57, 1, pp. 289–300, http://
www.jstor.org/stable/2346101.
Bleckmann, A., Leha, A., Artmann, S., Menck, K., Salinas-Riester, G., Binder,
C., Pukrop, T., Beissbarth, T., and Klemm, F. (2015). Integrated miRNA
and mRNA profiling of tumor-educated macrophages identifies prognostic
subgroups in estrogen receptor-positive breast cancer, Mol Oncol 9, 1, pp.
155–166.
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-ch08 page 179
Cho, J. H., Lee, I., Hammamieh, R., Wang, K., Baxter, D., Scherler, K.,
Etheridge, A., Kulchenko, A., Gautam, A., Muhie, S., Chakraborty,
N., Galas, D. J., Jett, M., and Hood, L. (2014). Molecular evidence of
stress-induced acute heart injury in a mouse model simulating posttrau-
matic stress disorder, Proc. Natl. Acad. Sci. U.S.A. 111, 8, pp. 3188–3193.
Counago, F., Rodriguez, A., Calvo, P., Luna, J., Monroy, J. L., Taboada, B.,
Diaz, V., and Rodriguez de Dios, N. (2016). Targeted therapy combined
with radiotherapy in non-small-cell lung cancer: a review of the Oncologic
Group for the Study of Lung Cancer (Spanish Radiation Oncology Society),
Clin Transl Oncol .
Ding, M., Li, J., Yu, Y., Liu, H., Yan, Z., Wang, J., and Qian, Q. (2015). Inte-
grated analysis of miRNA, gene, and pathway regulatory networks in hep-
atic cancer stem cells, J Transl Med 13, p. 259.
Fogel, B. L., Cho, E., Wahnich, A., Gao, F., Becherel, O. J., Wang, X., Fike,
F., Chen, L., Criscuolo, C., De Michele, G., Filla, A., Collins, A., Hahn,
A. F., Gatti, R. A., Konopka, G., Perlman, S., Lavin, M. F., Geschwind,
D. H., and Coppola, G. (2014). Mutation of senataxin alters disease-specific
transcriptional networks in patients with ataxia with oculomotor apraxia
type 2, Hum. Mol. Genet. 23, 18, pp. 4758–4769.
Forde, P. M., Brahmer, J. R., and Kelly, R. J. (2014). New strategies in lung
cancer: epigenetic therapy for non-small cell lung cancer, Clin. Cancer Res.
20, 9, pp. 2244–2248.
Fu, J., Tang, W., Du, P., Wang, G., Chen, W., Li, J., Zhu, Y., Gao, J., and Cui,
L. (2012). Identifying microRNA-mRNA regulatory network in colorectal
cancer by a combination of expression profile and bioinformatics analysis,
BMC Syst Biol 6, p. 68.
Gargano, L. M., Caramanica, K., Sisco, S., Brackbill, R. M., and Stellman, S. D.
(2015). Exposure to the World Trade Center Disaster and 9/11-related
post-traumatic stress disorder and household disaster preparedness, Disas-
ter Med Public Health Prep 9, 6, pp. 625–633.
GEO (2016). Gene expression omnibus, http://www.ncbi.nlm.nih.gov/geo/.
Hascher, A., Haase, A. K., Hebestreit, K., Rohde, C., Klein, H. U., Rius, M.,
Jungen, D., Witten, A., Stoll, M., Schulze, I., Ogawa, S., Wiewrodt, R.,
Tickenbrock, L., Berdel, W. E., Dugas, M., Thoennissen, N. H., and Muller-
Tidow, C. (2014). DNA methyltransferase inhibition reverses epigenetically
embedded phenotypes in lung cancer preferentially affecting polycomb tar-
get genes, Clin. Cancer Res. 20, 4, pp. 814–826.
Jansson, M. D. and Lund, A. H. (2012). Microrna and cancer, Molec-
ular Oncology 6, 6, pp. 590–610, doi:http://dx.doi.org/10.1016/j.
molonc.2012.09.006, http://www.sciencedirect.com/science/article/
pii/S1574789112000981, cancer epigenetics.
Jonnalagadda, S. and Srinivasan, R. (2008). Principal components analysis
based methodology to identify differentially expressed genes in time-course
microarray data, BMC Bioinformatics 9, p. 267.
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-ch08 page 180
Skinner, M. K., Guerrero-Bosagna, C., Haque, M., Nilsson, E., Bhandari, R.,
and McCarrey, J. R. (2013). Environmentally induced transgenerational
epigenetic reprogramming of primordial germ cells and the subsequent germ
line, PLoS ONE 8, 7, p. e66318.
Taguchi, Y. H. (2015). Identification of aberrant gene expression associated with
aberrant promoter methylation in primordial germ cells between E13 and
E16 rat F3 generation vinclozolin lineage, BMC Bioinformatics 16 Suppl
18, p. S16.
Taguchi, Y. H. (2016a). Identification of more feasible MicroRNA-mRNA inter-
actions within multiple cancers using principal component analysis based
unsupervised feature extraction, Int J Mol Sci 17, 5, p. 696.
Taguchi, Y. H. (2016b). Principal component analysis based unsupervised feature
extraction applied to budding yeast temporally periodic gene expression,
BioData Min 9, p. 22.
Taguchi, Y. H., Iwadate, M., and Umeyama, H. (2015a). Heuristic principal com-
ponent analysis-based unsupervised feature extraction and its application
to gene expression analysis of amyotrophic lateral sclerosis data sets, in
Computational Intelligence in Bioinformatics and Computational Biology
(CIBCB), 2015 IEEE Conference on, pp. 1–10, doi:10.1109/CIBCB.2015.
7300274.
Taguchi, Y. H., Iwadate, M., and Umeyama, H. (2015b). Principal component
analysis-based unsupervised feature extraction applied to in silico drug
discovery for posttraumatic stress disorder-mediated heart disease, BMC
Bioinformatics 16, p. 139.
Taguchi, Y. H., Iwadate, M., and Umeyama, H. (2016). SFRP1 is a possible
candidate for epigenetic therapy in non-small cell lung cancer, BMC Medical
Genomics 9, Suppl 1, p. 28, doi:10.1186/s12920-016-0196-3.
Taguchi, Y. H., Iwadate, M., Umeyama, H., Murakami, Y., and Okamoto, A.
(2015c). Heuristic principal component analysis-aased unsupervised feature
extraction and its application to bioinformatics, in B. Wang, R. Li, and
W. Perrizo (eds.), Big Data Analytics in Bioinformatics and Healthcare
(IGI Global, Pensylvania, USA), pp. 138–162.
Takahashi, A., Rahim, A., Takeuchi, M., Fukui, E., Yoshizawa, M., Mukai,
K., Suematsu, M., Hasuwa, H., Okabe, M., and Matsumoto, H. (2016).
Impaired female fertility in tubulointerstitial antigen-like 1-deficient mice,
J. Reprod. Dev. 62, 1, pp. 43–49.
Tang, Y. A., Wen, W. L., Chang, J. W., Wei, T. T., Tan, Y. H., Salunke, S.,
Chen, C. T., Chen, C. S., and Wang, Y. C. (2010). A novel histone deacety-
lase inhibitor exhibits antitumor activity via apoptosis induction, F-actin
disruption and gene acetylation in lung cancer, PLoS ONE 5, 9, p. e12417.
Tu, B. P., Kudlicki, A., Rowicka, M., and McKnight, S. L. (2005). Logic of the
yeast metabolic cycle: temporal compartmentalization of cellular processes,
Science 310, 5751, pp. 1152–1158.
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-ch08 page 182
Umeyama, H., Iwadate, M., and Taguchi, Y.-h. (2014). TINAGL1 and
B3GALNT1 are potential therapy target genes to suppress metastasis in
non-small cell lung cancer, BMC Genomics 15, Suppl 9, p. S2, doi:10.1186/
1471-2164-15-S9-S2, http://www.biomedcentral.com/1471-2164/15/S9/
S2.
Wang, A. and Gehan, E. A. (2005). Gene selection for microarray data analysis
using principal component analysis, Stat Med 24, 13, pp. 2069–2087.
Wu, B., Li, C., Zhang, P., Yao, Q., Wu, J., Han, J., Liao, L., Xu, Y., Lin, R.,
Xiao, D., Xu, L., Li, E., and Li, X. (2013). Dissection of miRNA-miRNA
interaction in esophageal squamous cell carcinoma, PLoS ONE 8, 9, p.
e73191.
Yang, Y., Li, D., Yang, Y., and Jiang, G. (2015). An integrated analysis of the
effects of microRNA and mRNA on esophageal squamous cell carcinoma,
Mol Med Rep 12, 1, pp. 945–952.
Zhang, W., Edwards, A., Fan, W., Flemington, E. K., and Zhang, K. (2012).
miRNA-mRNA correlation-network modules in human prostate cancer and
the differences between primary and metastatic tumor subtypes, PLoS ONE
7, 6, p. e40130.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch09
Chapter 9
*corresponding author.
183
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch09
9.2 Methods
9.2.1 Epitope datasets and physicochemical properties
1) 0 , X 1 (boundary conditions)
2) A, B 2 X , A B , A B X
g A B
(3)
g A g B g A g B
n
3) 1 s x 1 0, s x
i i g xi
i 1
2) L 0, , A X , A X
n
fi x j fi x j 1 Ai j , i 1,2,..., N (5)
C fi d
j 1
where fi x 0 0 , fi x j indicates that the indices have been permuted
so that
0 f i x1 f i x 2 ... f i x n (6)
A x , x ,..., x
j j j 1 n
(7)
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch09
where n represents the position in the window, and pni represents the
proportion of i-th amino acid at position n.
For each position in the feature window of the aligned peptide sequences,
normalize each physicochemical property for all peptide sequences at the
same position. Assume the size of the peptide set is k. Let the i-th peptide
sequence for physicochemical property m at position l be a variable
X il , m where 1 ≦ l ≦ k, 1 ≦ m ≦ 3. If max X il ,m min X il , m ≠0, then
l l
X il ,m min X il ,m . Otherwise, set Z il ,m 0 .
Z il ,m l
max X
l
i
l ,m
minX
l
i
l ,m
Table 9.2. The top three AAindex for each position in the feature window.
The kernel parameter determines how the samples are transformed into a
high-dimensional sampling space. The cost parameter C>0 of SVM
adjusts the total error penalty. The parameters C and γ must be tuned to
get the best prediction performance [16].
9.3 Results
TPi
OA
N
ACC i
AA (12)
h
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch09
where TPi, TNi, FPi and FNi are the number of true positive, true
negative, false positive and false negative, respectively. N is the total
number of peptide sequences and h is the number of immunogenicity
classes.
Table 9.3 shows the performance of our algorithm in term of
ACC for the four immunogenicity classes, and the prediction accuracies
of OA and AA. In the case of Lambda measure, the ACC accuracies of
the four classes None, Little, Moderate and High are 94.74, 89.47, 74.00
and 96.77%, respectively. The overall accuracy and average accuracy are
92.05 and 88.79%, respectively. The other case of L-measure, the ACC
accuracies of the four classes None, Little, Moderate and High are 94.73,
81.57, 72.00 and 95.167%, respectively. The overall accuracy and
average accuracy are 90.18 and 85.86%, respectively.
As the results, our prediction methods based on Lambda measure
and L-measure (L=0.6) have better performance than POPI [9] for every
immunogenicity classes.
9.4 Discussion
References
1. Kemir, C., Nussbaum, A. K., Schild, H., Detours, V., and Brunak, S. (2002)
Prediction of proteasome cleavage motifs by neural networks, Protein Eng., 15,
pp. 287–296.
2. Bhasin M. and Raghava, G. P. (2005) Pcleavage: an SVM based method for
prediction of constitutive proteasome and immunoproteasome cleavage sites in
antigenic sequences, Nucleic Acids Res., 33, pp. W202–W207.
3. Nielsen, M., Lundegaard C., Worning, P., Hvid, C. S., Lamberth, K., Buus, S.,
Brunak, S., and Lund, O. (2004) Improved prediction of MHC class I and class II
epitopes using a novel Gibbs sampling approach, Bioinformatics, 20, pp. 1388–
1397.
4. Larsen, M. V., Lundegaard, C., Lamberth, K., Buus, S., Brunak, S., Lund, O., and
Nielsen, M. (2005) An integrative approach to CTL epitope prediction: a
combined algorithm integrating MHC class I binding, TAP transport efficiency,
and proteasomal cleavage predictions, Eur. J. Immunol., 35, pp. 2295–2303.
5. Lin, H. H., Zhang, G. L., Tongchusak, S., Reinherz, E. L., and Brusic. V. (2008)
Evaluation of MHC-II peptide binding prediction servers: applications for
vaccine research, BMC Bioinformatics, 9, pp. S22.
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch09
Chapter 10
10.1 Introduction
Flow and mass cytometry are widely used in clinical and basic research
to characterize cell phenotypes and functions. Both measure the
expression of surface and intracellular molecules (termed “markers”) in
individual cells. Flow cytometry is a laser-based cytometric technique in
which cells are stained with fluorescence-conjugated antibodies and
taken past a laser light one cell at a time by a tiny stream of fluid. As the
cell is passing through the laser beam, the cell will scatter the light; the
fluorochromes will emit light when excited by a laser with the
corresponding excitation wavelength. The intensity of scattered and
fluorescent light is detected and analysed. Mass cytometry (a.k.a.
CyTOF) is a next-generation flow cytometer that uses heavy metal
isotopes to tag antibodies instead of fluorophores. By using isotopic
tagging, mass cytometry produces little crosstalk between channels as
compared to flow cytometry.
193
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch10
For flow cytometry data, the very first algorithm called flowClust,2 a
Bioconductor package for automated gating of flow cytometry data was
proposed. flowClust implements a robust model-based clustering
approach based on multivariate t mixture models with the Box-Cox
transformation. By using multivariate t mixture models instead of the
most commonly used finite Gaussian mixture models, flowClust is able
to identify outliers as well as clusters that are far from elliptical shape.
One key challenge of mixture model based clustering is to determine the
optimal number of clusters. The max BIC model fitting criterion
generally overestimates the number of clusters; whilst model fitting
criteria based on the entropy, such as the ICL, tend to provide poor fit to
the underlying distribution. Thus a Bioconductor package called
flowMerge combines these two approaches to achieve good model fitting
and accurately estimate number of clusters. flowMerge first chooses the
best BIC solution, then merges clusters in the best BIC solution, and
choose the best merged solution based on the entropy criterion. On the
other hand, Model-independent or non-parametric clustering method,
such as spectral clustering, has the advantage in not requiring a priori
assumption that cell populations follow the predefined distributional
models. However, spectral clustering is computationally intensive and
time inefficient for large datasets. In order to improve efficiency,
SamSPECTRAL3 modified spectral clustering by a non-uniform
information preserving down-sampling. Another model-independent
approach, FLOw Clustering without K (FLOCK),4 utilizes grid-based
partitioning and density distribution analysis to identify cell populations.
It partitions the n-dimensional space into “hyperregions” by partitioning
each dimension into equally sized bins. Any hyperregion in which the
cell count exceeds a pre-defined threshold is labeled as ‘‘dense’’
hyperregion. Adjacent “dense” hyperregions are then merged. Each cell
is then assigned to the nearest centroids of the merged “dense”
hyperregions. K-means is widely used for clustering. However it requires
a predefined number of K. flowMeans5, based on K-means, first uses
kernel density based mode detection and uses the number of modes as K
to run K-means clustering. The number of modes usually overestimates
the number of clusters. flowMeans iteratively merges the closest pair of
clusters based on a symmetric Mahalanobis semi-metric distance
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch10
10.3 Method
10.4 Results
cells far part (Figure 1). We overlaid hand-gated cell populations on the
t-SNE map (Figure 2). Each cell population occupied distinct regions on
the t-SNE plot, indicating the t-SNE is able to segregate known
populations. However, some populations such as pro-B cells and
eosinophils scatter across more than one region. This could be due to the
limitation of t-SNE or the fact that these cells are heterogonous and
comprise sub-populations.
10.5 Discussion
dimensions, its clusters are usually consistent with t-SNE map. However,
it is not the case for algorithms such as phenograph and flowSOM that
cluster cells based on the original dimensions. It is not unusual that
phenograph or flowSOM clusters don’t align well with t-SNE map. For
some datasets, t-SNE and t-SNE based clustering perform better; while
for some other datasets, t-SNE independent clustering performs better.
The performance of dimension reduction based methods such as
ACCENSE, DensVM and clusterX, will be affected by the performance
of dimension reduction. Both dimension reduction and clustering
algorithms need further improvement. New algorithms need to be
developed to perform optimal dimension reduction and clustering at the
same time.
References
1 van der Maaten, L. & Hinton, G. Visualizing High-Dimensional Data Using t-SNE.
Journal of Machine Learning Research 9, 2579-2605 (2008).
2 Lo, K., Hahne, F., Brinkman, R. R. & Gottardo, R. flowClust: a Bioconductor
package for automated gating of flow cytometry data. BMC bioinformatics 10, 145,
doi:10.1186/1471-2105-10-145 (2009).
3 Zare, H., Shooshtari, P., Gupta, A. & Brinkman, R. R. Data reduction for spectral
clustering to analyze high throughput flow cytometry data. BMC bioinformatics 11,
403, doi:10.1186/1471-2105-11-403 (2010).
4 Qian, Y. et al. Elucidation of seventeen human peripheral blood B-cell subsets and
quantification of the tetanus response using a density-based method for the
automated identification of cell populations in multidimensional flow cytometry
data. Cytometry. Part B, Clinical cytometry 78 Suppl 1, S69-82,
doi:10.1002/cyto.b.20554 (2010).
5 Aghaeepour, N., Nikolic, R., Hoos, H. H. & Brinkman, R. R. Rapid cell population
identification in flow cytometry data. Cytometry. Part A : the journal of the
International Society for Analytical Cytology 79, 6-13, doi:10.1002/cyto.a.21007
(2011).
6 Ge, Y. & Sealfon, S. C. flowPeaks: a fast unsupervised clustering for flow
cytometry data via K-means and density peak finding. Bioinformatics 28, 2052-
2058, doi:10.1093/bioinformatics/bts300 (2012).
7 Naim, I. et al. SWIFT-scalable clustering for automated identification of rare cell
populations in large, high-dimensional flow cytometry datasets, part 1: algorithm
design. Cytometry. Part A : the journal of the International Society for Analytical
Cytology 85, 408-421, doi:10.1002/cyto.a.22446 (2014).
8 Sorensen, T., Baumgart, S., Durek, P., Grutzkau, A. & Haupl, T. ImmunoClust —
An automated analysis pipeline for the identification of immunophenotypic
signatures in high-dimensional cytometric datasets. Cytometry. Part A : the journal
of the International Society for Analytical Cytology 87, 603-615,
doi:10.1002/cyto.a.22626 (2015).
Computational Methods with Applications in Bioinformatics Analysis 9in x 6in b2826-ch10
Index
207
May 23, 2017 15:11 Computational Methods with Applications. . . 9in x 6in b2826-index page 208
208 Index
Index 209
210 Index
Index 211
212 Index
V X
van der Waals forces, 102 χ2 distribution, 157
vinclozolin, 170 X-ray crystallography, 106
Xanthomonas campestris pv
W campestris (Xcc), 120
Xcc effector binding sites, 134
WCT(FK), 68
Xcc effector protein, 126–127
Weighted Connected-Triple (WCT)
Xcc homolog proteins, 132
Matrix, 57–60, 63
Xcc pathogen, 125
Weighted Distance (WD) Matrix, 58,
63
Y
Weighted Triple-Quality (WTQ)
Matrix, 58–60, 63 yeast metabolic cycle, 174
white noise, 9 yeast sporulation, 1
widing numbers, 174
Wilcoxon rank-sum test, 10 Z
Wilcoxon test, 54 ZDOCK algorithm, 108
WTQ(FK), 68 ZRANK scoring function, 109