Click The Link Below To Download
Click The Link Below To Download
https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374
ebooknice.com
https://ebooknice.com/product/matematik-5000-kurs-2c-larobok-23848312
ebooknice.com
https://ebooknice.com/product/sat-ii-success-
math-1c-and-2c-2002-peterson-s-sat-ii-success-1722018
ebooknice.com
(Ebook) Master SAT II Math 1c and 2c 4th ed (Arco Master the SAT
Subject Test: Math Levels 1 & 2) by Arco ISBN 9780768923049,
0768923042
https://ebooknice.com/product/master-sat-ii-math-1c-and-2c-4th-ed-
arco-master-the-sat-subject-test-math-levels-1-2-2326094
ebooknice.com
ebooknice.com
Advances in Bioinformatics
4th International Workshop on Practical
Applications of Computational Biology
and Bioinformatics 2010 (IWPACBB 2010)
ABC
Editors
Miguel P. Rocha Hagit Shatkay
Dep. Informática / CCTC Computational Biology and
Universidade do Minho Machine Learning Lab
Campus de Gualtar School of Computing
4710-057 Braga Queen’s University Kingston
Portugal Ontario K7L 3N6
Canada
E-mail: [email protected]
Florentino Fernández Riverola Juan Manuel Corchado
Escuela Superior de Departamento de Informática
Ingeniería Informática y Automática
Edificio Politécnico, Facultad de Ciencias
Despacho 408 Universidad de Salamanca
Campus Universitario Plaza de la Merced S/N
As Lagoas s/n 37008 Salamanca
32004 Ourense Spain
Spain E-mail: [email protected]
E-mail: [email protected]
DOI 10.1007/978-3-642-13214-8
Advances in Intelligent and Soft Computing ISSN 1867-5662
Library of Congress Control Number: 2010928117
c 2010 Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations
are liable for prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.
General Co-chairs
Miguel Rocha University of Minho (Portugal)
Florentino Riverola University of Vigo (Spain)
Juan M. Corchado University of Salamanca (Spain)
Hagit Shatkay Queens University, Ontario (Canada)
Program Committee
Juan M. Corchado University of Salamanca (Spain)
(Co-chairman)
Alicia Troncoso Universidad of Pablo de Olavide (Spain)
Alípio Jorge LIAAD/INESC, Porto LA (Portugal)
Anália Lourenço University of Minho (Portugal)
Arlindo Oliveira INESC-ID, Lisboa (Portugal)
Arlo Randall University of California Irvine (USA)
B. Cristina Pelayo University of Oviedo (Spain)
Christopher Henry Argonne National Labs (USA)
Daniel Gayo University of Oviedo (Spain)
David Posada Univ. Vigo (Spain)
Emilio S. Corchado University of Burgos (Spain)
Eugénio C. Ferreira IBB/CEB, University of Minho (Portugal)
Fernando Diaz-Gómez University of Valladolid (Spain)
Gonzalo Gómez-López UBio/CNIO, Spanish National Cancer Research
Centre (Spain)
Isabel C. Rocha IBB/CEB, University of Minho (Portugal)
Jesús M. Hernández University of Salamanca (Spain)
Jorge Vieira IBMC, Porto (Portugal)
José Adserias University of Salamanca (Spain)
José L. López University of Salamanca (Spain)
José Luís Oliveira Univ. Aveiro (Portugal)
Juan M. Cueva University of Oviedo (Spain)
Júlio R. Banga IIM/CSIC, Vigo (Spain)
VIII Organization
Organizing Committee
Microarrays
Highlighting Differential Gene Expression between Two
Condition Microarrays through Heterogeneous Genomic
Data: Application to Lesihmania infantum Stages
Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Liliana López Kleine, Vı́ctor Andrés Vera Ruiz
Biomedical Applications
Structure Based Design of Potential Inhibitors of Steroid
Sulfatase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Elisangela V. Costa, M. Emı́lia Sousa, J. Rocha,
Carlos A. Montanari, M. Madalena Pinto
Bioinformatics Applications
e-BiMotif: Combining Sequence Alignment and Biclustering
to Unravel Structured Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Joana P. Gonçalves, Sara C. Madeira
Abstract. Classical methods for the detection of gene expression differences be-
tween two microarray conditions often fail to detect interesting and important
differences, because they are weak in comparison with the overall variability.
Therefore, methodologies that highlight weak differences are needed. Here, we
propose a method that allows the fusion of other genomic data with microarray
data and show, through an example on L. infantum microarrays comparing pro-
mastigote and amastigote stages, that differences between the two microarray
conditions are highlighted. The method is flexible and can be applied to any
organism for which microarray and other genomic data is available.
1 Introduction
Protozoan of the genus Leishmania are parasites that are transmitted by blood-
feeding insect vectors to mammalian hosts, and cause a number of important hu-
man diseases, collectively referred as leishmaniasis. During their life cycle, these
parasites alternate between two major morphologically distinct developmental
stages. In the digestive tract of the sandfly vector, they exist as extracellular elon-
gated, flagellated, and motile promastigotes that are exposed to pH 7 and fluctuat-
ing temperatures averaging 25ºC. Upon entry into a mammalian host, they reside
in mononuclear phagocytes or macrophages (37ªC), wherein they replicate as cir-
cular, aflagellated and non-motile amastigotes. In order to survive, these two ex-
treme environments, Leishmania sp. (L. sp) has developed regulatory mechanisms
that result in important morphological and biochemical adaptations [1, 2, 3].
Universidad Nacional de Colombia (Sede Bogotá), Cra 30, calle 45, Statistics Department
e-mail: [email protected], [email protected]
M.P. Rocha et al. (Eds.): IWPACBB 2010, AISC 74, pp. 1–8, 2010.
springerlink.com © Springer-Verlag Berlin Heidelberg 2010
2 L.L. Kleine and V.A.V. Ruiz
2 Methodology
2.1.1 Microarrays
The data used are microarray data from Rochette et al. [3]. From these data, we
extracted only the expression data comparing promastigotes and amastigotes of L.
infantum (8317 genes for 14 replicates). Microarray data was downloaded from
the NCBI’s GEO Datasets [14] (accession number GSE10407). We worked with
normalized and log2 transformed gene expression intensities from the microarray
data obtained by Rochette and colleagues [3].
microarray data and calculated the sum of gene expression intensities in each con-
dition. This allows to determine which one of the two genes is responsible for the
similarity change and finally to identify potential targets that explain the
differences that occur for L. infantum adaptations during its life cycle. R code is
available under request: [email protected].
The most interesting targets (which show the highest difference or are present
repeatedly on the list), are candidates to perform wet-lab experiments.
Table 2 List of 14 gene pair similarity changes between amastigote (KA1sum) and promas-
tigote (KP1sum) microarray data highlighted through the fusion with data on the presence of
genes on chromosomes (K3). AMA and PRO: sum of gene expression of each gene in the
amasigote microarray (AMA) or promastigote microarray (PRO). Pos: position of similarity
changes in the 2092×2092 kernel (1924 similarity changes above T1). P: gene annotated as
putative. Hpc: hypothetical conserved protein
Gene pair with similarity change in KA1sum vs. KP1sum Pos AMA PRO AMA PRO
Gene1 Gene2 Gene1 Gene1 Gene2 Gene2
The present work opens the possibility of implementing a kernel method that
will allow determining differences in a more precise way once the data are fused.
The detection of differences can be improved in several ways. Here, only a pre-
liminary and very simple comparison of similarities is proposed. The kernel
method could be based on multidimensional scaling via the mapping of both
kernels on a common space that could allow the measure of distances between
similarities on that space.
Although differences between kernels are ordered, having a probability associ-
ated to each difference would be useful. This could be achieved by a bootstrapping
procedure or a matrix permutation test.
References
[1] McConville, M.J., Turco, S.J., Ferguson, M.A.J., Saks, D.L.: Developmental modifi-
cation of lipophosphoglycan during the differentiation of Leishmania major promas-
tigotes to an infectious stage. EMBO J. 11, 3593–3600 (1992)
[2] Zilberstein, D., Shapira, M.: The role of pH and temperature in the development of
Leishmania parasites. Annu. Rev. Microbiol. 48, 449–470 (1994)
[3] Rochette, A., Raymond, F., Ubeda, J.M., Smith, M., Messier, N., Boisvert, S., Ri-
gault, P., Corbeil, J., Ouellette, M., Papadopoulou, B.: Genome-wide gene expression
profiling analysis of Leishmania major and Leishmania infantum developmental
stages reveals substantial differences between the two species. BMC Genomics 9,
255–280 (2008)
[4] Cohen-Freue, G., Holzer, T.R., Forney, J.D., McMaster, W.R.: Global gene expres-
sion in Leishmania. Int. J. Parasitol. 37, 1077–1086 (2007)
[5] Leifso, K., Cohen-Freue, G., Dogra, N., Murray, A., McMaster, W.R.: Genomic and
proteomic expression analysis of Leishmania promastigote and amastigote life stages:
the Leishmania genome is constitutively expressed. Mol. Biochem. Parasitol. 152,
35–46 (2007)
[6] Saxena, A., Lahav, T., Holland, N., Aggarwal, G., Anupama, A., Huang, Y., Volpin,
H., Myler, P.J., Zilberstein, D.: Analysis of the Leishmania donovani transcriptome
reveals an ordered progression of transient and permanent changes in gene expression
during differentiation. Mol. Biochem. Parasitol 52, 53–65 (2007)
[7] Ivens, A.C., Lewis, S.M., Bagherzadeh, A.: A physical map of Leishmania major
friedlin genome. Genome Res. 8, 135–145 (1998)
[8] Holzer, T.R., McMaster, W.R., Forney, J.D.: Expression profiling by whole-genome
interspecies microarray hybridization reveals differential gene expression in procyclic
promastigotes, lesion-derived amastigotes, and axenic amastigotes in Leishmania
mexicana. Mol. Biochem. Parasitol 146, 198–218 (2006)
[9] McNicoll, F., Drummelsmith, J., Müller, M., Madore, E., Boilard, N., Ouellette, M.,
Papadopoulou, B.: A combined proteomic and transcriptomic approach to the study
of stage differentiation in Leishmania infantum. Proteomics 6, 3567–3581 (2006)
[10] Rosenzweig, D., Smith, D., Opperdoes, F., Stern, S., Olafson, R.W., Zilberstein, D.:
RetoolingLeishmania metabolism: from sand fly gut to human macrophage. FASEB
J. (2007), doi:10.1096/fj.07-9254com
[11] Lynn, M.A., McMaster, W.R.: Leishmania: conserved evolution-diverse diseases.
Trends Parasitol 24, 103–105 (2008)
8 L.L. Kleine and V.A.V. Ruiz
[12] Storey, J.D., Tibshirani, R.: Statistical significance for genome-wide experiments.
Proc. Natl. Acad. Sci. 100, 9440–9445 (2003)
[13] Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette,
M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov, J.P.: Gene
set enrichment analysis: A knowledge-based approach for interpreting genome-wide
expression profiles. PNAS 102, 15545–15550 (2005)
[14] Edgar, R., Domrachev, M., Lash, A.E.: Gene Expression Omnibus: NCBI gene ex-
pression and hybridization array data repository. Nucleic Acid Res. 30, 207–210
(2002)
[15] DeLuca, T.F., Wu, I.H., Pu, J., Monaghan, T., Peshkin, L., Singh, S., Wall, D.P.:
Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioin-
formatics 22, 2044–2046 (2006)
[16] R Development Core Team R: A language and environment for statistical computing.
R Foundation for Statistical Computing. Vienna, Austria (2005), ISBN 3-900051-07-
0, http://www.R-project.org
[17] Vert, J., Tsuda, K., Schölkopf, B.: A primer on kernels. In: Schölkopf, B., Tsuda, K.,
Vert, J. (eds.) Kernel methods in computational biology. The MIT Press, Cambridge
(2004)
[18] Yamanishi, Y., Vert, J.P., Nakaya, A., Kaneisha, M.: Extraction of correlated clusters
from multiple genomic data by generalized kernel canonical correlation analysis. Bio-
informatics 19, 323–330 (2003)
[19] López Kleine, L., Monnet, V., Pechoux, C., Trubuil, A.: Role of bacterial peptidase F
inferred by statistical analysis and further experimental validation. HFSP J. 2, 29–41
(2008)
[20] Kondor, R.I., Lafferty, J.: Diffusion kernels on graphs and other discrete structures.
In: Sammut, C., Hoffmann, A.G. (eds.) Machine learning: proceedings of the 19th in-
ternational conference. Morgan Kaufmann, San Francisco (2002)
An Experimental Evaluation of a Novel
Stochastic Method for Iterative Class
Discovery on Real Microarray Datasets
Abstract. Within a gene expression matrix, there are usually several particular
macroscopic phenotypes of samples related to some diseases or drug effects, such
as diseased samples, normal samples or drug treated samples. The goal of sample-
based clustering is to find the phenotype structures of these samples. A novel
method for automatically discovering clusters of samples which are coherent from
a genetic point of view is evaluated on publicly available datasets. Each possible
cluster is characterized by a fuzzy pattern which maintains a fuzzy discretization
of relevant gene expression values. Possible clusters are randomly constructed and
iteratively refined by following a probabilistic search and an optimization schema.
1 Introduction
Following the advent of high-throughput microarray technology it is now possible
to simultaneously monitor the expression levels of thousands of genes during
important biological processes and across collections of related samples. In this
Florentino Fdez-Riverola
ESEI: Escuela Superior de Ingeniería Informática, University of Vigo,
Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
e-mail: [email protected],
{dgpena, mrjato, pavon, riverola}@uvigo.es
Fernando Díaz
EUI: Escuela Universitaria de Informática, University of Valladolid, Plaza Santa Eulalia,
9-11, 40005, Segovia, Spain
e-mail: [email protected]
M.P. Rocha et al. (Eds.): IWPACBB 2010, AISC 74, pp. 9–16, 2010.
springerlink.com © Springer-Verlag Berlin Heidelberg 2010
10 H. Gómez et al.
context, sample-based clustering is one of the most common methods for discov-
ering disease subtypes as well as unknown taxonomies. By revealing hidden struc-
tures in microarray data, cluster analysis can potentially lead to more tailored
therapies for patients as well as better diagnostic procedures.
From a practical point of view, existing sample-based clustering methods can be
(i) directly applied to cluster samples using all the genes as features (i.e., classical
techniques such as K-means, SOM, HC, etc.) or (ii) executed after a set of informa-
tive genes are identified. The problem with the first approach is the signal-to-noise
ratio (smaller than 1:10), which is known to seriously reduce the accuracy of cluster-
ing results due to the existence of noise and outliers of the samples [1]. To overcome
such difficulties, particular methods can be applied to identify informative genes
and reduce gene dimensionality prior to clustering samples in order to detect their
phenotypes. In this context, both supervised and unsupervised informative gene
selection techniques have been developed.
While supervised informative gene selection techniques often yield high clus-
tering accuracy rates, unsupervised informative gene selection methods are more
complex because they assume no a priori phenotype information being assigned to
any sample [2]. In such a situation, two general strategies have been adopted to
address the lack of prior knowledge: (i) unsupervised gene selection, this aims to
reduce the number of genes before clustering samples by using appropriate statis-
tical models and (ii) interrelated clustering, that takes advantage of utilizing the re-
lationship between the genes and samples to perform gene selection and sample
clustering simultaneously in an iterative paradigm. Following the second strategy
for unsupervised informative gene selection (interrelated clustering), Ben-Dor et
al. [3] present an approach based on statistically scoring candidate partitions ac-
cording to the overabundance of genes that separate the different classes. Xing and
Karp [1] use a feature filtering procedure for ranking features according to their
intrinsic discriminability and irredundancy to other relevant features. Their clus-
tering algorithm is based on the concept of a normalized cut for grouping samples
in new reference partition. Von Heydebreck et al. [4] and Tang et al. [5] propose
algorithms for selecting sample partitions and corresponding gene sets by defining
an indicator of partition quality and a search procedure to maximize this parame-
ter. Varma and Simon [6] describe an algorithm for automatically detecting clus-
ters of samples that are discernable only in a subset of genes.
In this contribution we are focused in the evaluation a novel simulated anneal-
ing-based algorithm for iterative class discovery. The rest of the paper is structured
as follows: Section 2 sketches the proposed method and introduces the relevant as-
pects of the technique. Section 3 presents the experimental setup carried out and
the results obtained from a publicly available microarray data set. Section 4 com-
prises a discussion about the obtained results by the proposed technique. Finally,
Section 5 summarizes the main conclusions extracted from this work.
Our clustering technique is not based on the distance between the microarrays
belonging to each given cluster, but rather on the notion of genetic coherence of
its own clusters. The genetic coherence of a given partition is calculated by taking
into consideration the genes which share the same expression value through all the
samples belonging to the cluster (which we term a fuzzy pattern), but discarding
those genes present due to pure chance (herein referred to noisy genes of a fuzzy
pattern). The proposed clustering technique combines both (i) the simplicity and
good performance of a heuristic search method able to find good partitions in the
space of all possible partitions of the set of samples with (ii) the robustness of
fuzzy logic, able to cope with several levels of uncertainty and imprecision by us-
ing partial truth values. A global view of the proposed method is sketched in
Figure 1. This figure shows how from the fuzzy discretization of the microarrays
from raw dataset the method performs a stochastic search, looking for a “good
12 H. Gómez et al.
3 Experimental Results
In this Section we evaluate the proposed algorithm on two public microarray data-
sets, herein referred to as HC-Salamanca dataset [7] and Armstrong dataset [8].
Predicted class
APL Inv Mono Other
APL 76.19% 2.71% 2.18% 18.92%
True Inv 7.79% 26.49% 33.66% 32.06%
class Mono 3.11% 17.81% 51.73% 27.35%
Other 8.62% 5.56% 8.70% 77.12%
An Experimental Evaluation of a Novel Stochastic Method 13
Assuming as “ground truth” the clustering given by authors in [7], the perform-
ance of the clustering process can be tested by comparing the results given in both
tables. Some commonly used indices such as the Rand index and the Jaccard coef-
ficient have been defined to measure the degree of similarity between two parti-
tions. For the clustering given by our experiment, the Rand index was 0.90 and the
Jaccard coefficient was 0.77.
Predicted class
ALL AML MLL
ALL 65.88% 5.16% 28.95%
True
AML 4.42% 86.40% 9.18%
class
MLL 34.74% 12.85% 52.41%
4 Discussion
The aim of the experiments reported in the previous section is to test the validity
of the proposed clustering method. Dealing with unsupervised classification, it is
very difficult to test the ability of a method to perform the clustering since there is
no supervision of the process. In this sense, the classification into different groups
proposed by the authors in [7, 8] is assumed to be the reference partition of sam-
ples in our work. This assumption may be questionable in some cases, since the
reference groups are not well established. For example, in the HC-Salamanca
dataset the AML with inversion group is established by observation of the karyo-
type of cancer cells, but there is no other evidence (biological, genetic) suggesting
that this group corresponds to a distinct disease. Even so, the assumption of these
prior partitions as reference groups is the only way to evaluate the similarity (or
dissimilarity) of the results computed by the proposed method based on existing
knowledge. As it turns out, there is no perfect match among the results of our pro-
posed method and the reference partitions, but they are compatible with the cur-
rent knowledge of each dataset. For example, for the HC-Salamanca dataset the
better characterized groups are the APL and Other-AML groups, the worst is the
AML with inversion group, and there is some confusion of the monocytic AML
with the AML with inversion and Other-AML groups. These results are compati-
ble with the state-of-the-art discussed in [7], where the APL group is the better
characterized disease (it can be considered as a distinct class), the monocytic
AML is a promising disease, the AML with inversion in chromosome 16 is the
weaker class, and the Other-AML group acts as the dumping ground for the rest
of samples which are not similar enough to the other possible classes. For the
An Experimental Evaluation of a Novel Stochastic Method 15
Armstrong dataset, the AML group is clearly separated from the MLL and ALL
groups. It is not surprising since the myeloid leukemia (AML) and lymphoblastic
leukaemias (MLL and ALL) represent distinct diseases. Some confusion is present
among ALL and MLL groups, but this result is compatible with the assumption
(which the authors test in [8]) that the MLL group is a subtype of the ALL disease.
5 Conclusion
The simulated annealing-based algorithm presented in this work is a new
algorithm for iterative class discovery that uses fuzzy logic for informative gene
selection. An intrinsic advantage of the proposed method is that, assuming the
percentage of times in which a given microarray has been grouped with samples
of other potential classes, the degree of membership of that microarray to each po-
tential group can be deduced. This fact allows a fuzzy clustering of the available
microarrays which is more suitable for the current state-of-the-art in gene expres-
sion analysis, since it will be very unlikely to state (without uncertainty) that any
available microarray only belongs to a unique potential cluster. In this case, the
proposed method can help to assess the degree of affinity of each microarray with
potential groups and to guide the analyst in the discovery of new diseases.
In addition, the proposed method is also an unsupervised technique for gene se-
lection when it is used in conjunction with the concept of discriminant fuzzy pat-
tern (DFP) introduced in [9]. Since the selected genes depend on the resulting
clustering (they are the genes in the computed DFP obtained from all groups) and
the clustering is obtained by maximizing the cost function (which is based on the
notion of genetic coherence and assessed by the number of genes in the fuzzy pat-
tern of each cluster), then the selected genes jointly depend on all the genes in the
microarray, and the proposed method can be also considered a multivariate
method for gene selection.
Finally, the proposed technique, in conjunction with our previous developed
GENECBR platform [10], represents a more sophisticated tool which integrates
three main tasks in expression analysis: clustering, gene selection and classifica-
tion. In this context, all the proposed methods are non-parametric (they do not de-
pend on assumptions about the underlying distribution of available data), unbiased
with regard to the basic computational facility used to construct them (the notion
of fuzzy pattern) and with the ability to manage imprecise (and hence, uncertain)
information, which is implicit in available datasets in terms of degree of member-
ship to linguistic labels (expressions levels, potential categories, etc.).
Acknowledgements
This work is supported in part by the project Development of computational tools for the
classification and clustering of gene expression data in order to discover meaningful
biological information in cancer diagnosis (ref. VA100A08) from JCyL (Spain).
16 H. Gómez et al.
References
1. Xing, E.P., Karp, R.M.: CLIFF: clustering of high-dimensional microarray data via it-
erative feature filtering using normalized cuts. Bioinformatics 17, S306–S315 (2001)
2. Jiang, D., Tang, C., Zhang, A.: Cluster analysis for gene expression data: a survey.
IEEE T. Knowl. Data En. 16, 1370–1386 (2004)
3. Ben-Dor, A., Friedman, N., Yakhini, Z.: Class discovery in gene expression data. In:
Proceedings of the Fifth Annual International Conference on Computational Biology.
ACM, Montreal (2001)
4. von Heydebreck, A., Huber, W., Poustka, A., Vingron, M.: Identifying splits with clear
separation: a new class discovery method for gene expression data. Bioinformatics 17,
S107–S114 (2001)
5. Tang, C., Zhang, A., Ramanathan, M.: ESPD: a pattern detection model underlying
gene expression profiles. Bioinformatics 20, 829–838 (2004)
6. Varma, S., Simon, R.: Iterative class discovery and feature selection using Minimal
Spanning Trees. BMC Bioinformatics 5, 126 (2004)
7. Gutiérrez, N.C., López-Pérez, R., Hernández, J.M., Isidro, I., González, B., Delgado,
M., Fermiñán, E., García, J.L., Vázquez, L., González, M., San Miguel, J.F.: Gene ex-
pression profile reveals deregulation of genes with relevant functions in the different
subclasses of acute myeloid leukemia. Leukemia 19, 402–409 (2005)
8. Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R., den Boer, M.L., Minden,
M.D., Sallan, S.E., Lander, E.S., Golub, T.R., Korsmeyer, S.J.: MLL translocations
specify a distinct gene expression profile that distinguishes a unique leukemia. Nat.
Genet. 30, 41–47 (2002)
9. Díaz, F., Fdez-Riverola, F., Corchado, J.M.: geneCBR: a case-based reasoning tool for
cancer diagnosis using microarray data sets. Comput. Intell. 22, 254–268 (2006)
10. Glez-Peña, D., Díaz, F., Hernández, J.M., Corchado, J.M., Fdez-Riverola, F.: ge-
neCBR: a translational tool for multiple-microarray analysis and integrative informa-
tion retrieval for aiding diagnosis in cancer research. BMC Bioinformatics 10, 187
(2009)
Automatic Workflow during the Reuse Phase of
a CBP System Applied to Microarray Analysis
1 Introduction
The continuous growth of techniques for obtaining cancerous samples, specifically
those using microarray technologies, provides a great amount of data. Microarray
has become an essential tool in genomic research, making it possible to investigate
global genes in all aspects of human disease [4]. Expression arrays [5] contain in-
formation about certain genes in a patient’s samples. These data have a high di-
mensionality and require new powerful tools.
This paper presents an innovative solution to model reorganization systems in
biomedical environments. It is based on a multi-agent architecture that can inte-
grate Web services, and incorporates a novel planning mechanism that makes it
possible to determine workflows based on existing plans and previous results. The
M.P. Rocha et al. (Eds.): IWPACBB 2010, AISC 74, pp. 17–24, 2010.
springerlink.com © Springer-Verlag Berlin Heidelberg 2010
18 J.F. De Paz, A.B. Gil, and E. Corchado
• The Controller agent manages the agents available in the different layers
of the multiagent system. It allows the registration of agents in the layers,
as well as their use in the organization.
• Analysis Services: The analysis services are services used by analysis
agents for carrying out different tasks. The analysis services include ser-
vices for pre-processing, filtering, clustering and extraction of knowledge.
connection with the subsequent service, i.e., S21, for which service a1 is executed.
Lastly, column S1x executes action S1f.
Actions/Services
S01 S02 ... S12 S13 ... S1f ... S21 S23 ... S2f ... Si1 Sij ... Sif
... ... v ... ... ... ...
Efficiency
a1 v v1
Plans
Fig. 1 Plans and plan actions carried out through a concatenation of services
2.1.1 Retrieve
During the retrieval stage, the plans with the greatest and least efficiency are se-
lected from among those that have been applied. Microarrays are composed of
probes that represent variables that mark the level of significance of specific
genes. The retrieval of those cases is performed in one of two ways according to
the case study. To retrieve cases, it is important to consider whether there has
been a previous analysis of a case study with similar characteristics. If so, the
corresponding plans for the same case study are selected.
If there are no plans that correspond to the same case study, or if the number of
plans is insufficient, the plans corresponding ot the most similar case study are re-
trieved. The selection of the most similar case study is performed according to the
cosine distance applied to the following set of variables: Number of probes,
Number of cases, Coefficient of the Pearson variation [12] for e0.
The number of efficient and inefficient cases selected is predetermined so that
at the end of this stage the following set of elements is obtained:
P = {Pe { p1e ,.., pne } ∪ Pi { p1i ,.., pni }} (3)
Pe represents the set of efficient plans and Pi represents the set of inefficient plans.
Once the plans have been retrieved, a new efficient plan is generated in the next
phase.
Automatic Workflow during the Reuse Phase of a CBP System 21
2.1.2 Reuse
This phase takes the plans P obtained in the retrieval phase and generates a new,
more efficient plan. The new plan is built according to the efficiency of the actions
as estimated by the overall efficiency vi of the plan. Estimating the efficiency of
each action is done according to the model defined by the decision trees for select-
ing significant nodes [3]. This way, estimating the efficiency of each action is
carried out according to the expression (3). This expression is referred to as the
winning rate and depends on both node S and the selected attribute B.
t
Si
G (S , B) = I (S ) − ∑ I ( Si ) (4)
i =1 S
where S represents a node that, in this case, will always be the root node of the
tree, B is the condition for the existing action, Si represents child node i from
node S, S i the number of cases associated with the child node Si . The function
I (S ) represents gain and is defined as follows
n
I (S ) = −∑ f jS ⋅ log( f jS ) (5)
j =1
S
where f jS represents the frequency relative to class C j in S , f S = n j ,
j
n Sj the
S
N
number of elements from class C j in S and N S the total number of elements. In
this case, Cj={efficient, inefficient}.
The gain ratio G determines the level of importance for each action by distin-
guishing between an efficient and an inefficient plan. High values for the gain ra-
tio indicate that the action should be included in a plan if it involves an action to
be carried out in an efficient plan, otherwise it should be eliminated.
A new table listing gain data is formed according to the values of the gain ratio
and the efficiency associated with each plan. A new flow of execution for each
action is created from the gains table. The gains uses the following formula to
establish a value for the significance of each of the actions carried out in each
plan:
T ( Sij , k ) = G' (S , Sij ) ⋅ vk (6)
where G´ contains the values of G that are normalized between 0 and 1 with the
values being inverted (the maximum value corresponds to 0 and the minimum to
1) and v contains the average value of efficiency for the plans with a connection
ij. Each connection ij presents an influence in the final efficiency of the plan that
is represented as tijk .
Once the graph for the plans has been constructed, the minimal route that goes
from the start node to the end node is calculated. In order to calculate the
22 J.F. De Paz, A.B. Gil, and E. Corchado
shortest/longest route, the Dijkstra algorithm is applied since there are implemen-
tations for the order n*log n. To apply this algorithm, it is necessary to add to each
of the edges the absolute value of the edge with a higher negative absolute value,
in order to remove from the graph those edges with negative values.
Plan Variability (z) Uniform (α) Correlation (α) Cutoff Efficiency Class
p1 1 2 3 0.14641098 1
p2 1 2 3 4 1 0
p3 1 2 0.24248635 1
p4 1 2 0.14935538 1
p5 3 1 2 0.15907924 1
p6 1 0.96457118 0
p7 1 1 0
p8 1 2 1 0
0.95
0.37
0.1
S1 S3
0.52
1
0
0.14
1
0.2
S0
Sf
8
0.06 0.6
S2 S4
1
Figure 2 displays the directed graph that was obtained. The final path that is
followed is shown in bold. The new estimated plan is comprised of the sequence
of actions S02, S21, S13, S3f.
It is clear that the path followed in the plan that was obtained does not coincide
with any path previously applied, although the services that it contains presents an
efficiency similar to that given by plan p1 as shown in table 1. The efficiency ob-
tained in this execution is 0.14458.
The system presented in this study provides a novel mechanism for a global co-
ordination in highly changing environments. The mechanism is capable of auto-
matic reorganization and is especially useful for decision making in systems that
use agreement technologies. The system was applied to a case study in a biomedi-
cal environment and can be easily extended to other environments with similar
characteristics.
24 J.F. De Paz, A.B. Gil, and E. Corchado
Acknowledgments. This development has been partially supported by the projects JCyL
SA071A08, of the Junta of Castilla and León (JCyL): [BU006A08], the project of the Span-
ish Ministry of Education and Innovation [CIT-020000-2008-2] and [CIT-020000-2009-
12], and Grupo Antolin Ingenieria, S.A., within the framework of project MAGNO2008 -
1028.- CENIT also funded by the same Government Ministry.
References
[1] Kolodner, J.: Case-Based Reasoning. Morgan Kaufmann, San Francisco (1993)
[2] Glez-Bedia, M., Corchado, J.: A planning strategy based on variational calculus for
deliberative agents. Computing and Information Systems Journal 10(1), 2–14 (2002)
[3] Kohavi, R., Ross Quinlan, R.: Decision Tree Discovery Handbook of Data Mining
and Knowledge Discovery, pp. 267–276. Oxford University Press, Oxford (2002)
[4] Quackenbush, J.: Computational analysis of microarray data. Nature Review Genet-
ics 2(6), 418–427 (2001)
[5] Affymetrix,
http://www.affymetrix.com/support/technical/datasheets/
hgu133arrays_datasheet.pdf
[6] Corchado, J.M., Bajo, J., De Paz, Y., Tapia, D.I.: Intelligent Environment for Moni-
toring Alzheimer Patients, Agent Technology for Health Care. Decision Support Sys-
tems 44(2), 382–396 (2008)
[7] Ardissono, L., Petrone, G., Segnan, M.: A conversational approach to the interaction
with Web Services. Computational Intelligence, vol. 20, pp. 693–709. Blackwell Pub-
lishing, Malden (2004)
[8] Oliva, E., Natali, A., Ricci, A., Viroli, M.: An Adaptation Logic Framework for
{J}ava-based Component Systems. Journal of Universal Computer Science 14(13),
2158–2181 (2008)
[9] Bratman, M.: Intention, Plans and Practical Reason. Harvard U.P., Cambridge (1987)
[10] Corchado, J.M., De Paz, J.F., Rogríguez, S., Bajo, J.: Model of experts for decision
support in the diagnosis of leukemia patients. Artificial Intelligence in Medi-
cine 46(3), 179–200 (2009)
[11] Horner, M.J., Ries, L.A.G., Krapcho, M., Neyman, N., Aminou, R., Howlader, N.,
Altekruse, S.F., Feuer, E.J., Huang, L., Mariotto, A., Miller, B.A., Lewis, D.R., Eis-
ner, M.P., Stinchcomb, D.G., Edwards, B.K. (eds.): SEER Cancer Statistics Review,
1975-2006, National Cancer Institute (2009),
http://seer.cancer.gov/csr/1975_2006/
[12] Kuo, C.D., Chen, G.Y., Wang, Y.Y., Hung, M.J., Yang, J.L.: Characterization and
quantification of the return map of RR intervals by Pearson coefficient in patients
with acute myocardial infarction. Autonomic Neuroscience 105(2), 145–152 (2003)
A Comparative Study of Microarray Data
Classification Methods Based on Ensemble
Biological Relevant Gene Sets
Abstract. In this work we study the utilization of several ensemble alternatives for
the task of classifying microarray data by using prior knowledge known to be bio-
logically relevant to the target disease. The purpose of the work is to obtain an
accurate ensemble classification model able to outperform baseline classifiers by
introducing diversity in the form of different gene sets. The proposed model takes
advantage of WhichGenes, a powerful gene set building tool that allows the auto-
matic extraction of lists of genes from multiple sparse data sources. Preliminary
results using different datasets and several gene sets show that the proposal is able
to outperform basic classifiers by using existing prior knowledge.
Miguel Reboiro-Jato . Daniel Glez-Peña . Juan Francisco Gálvez . Rosalía Laza Fidalgo .
1
Florentino Fdez-Riverola
ESEI: Escuela Superior de Ingeniería Informática, University of Vigo,
Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
e-mail: {mrjato, dgpena, galvez, rlaza, riverola}@uvigo.es
Fernando Díaz
EUI: Escuela Universitaria de Informática, University of Valladolid, Plaza Santa Eulalia,
9-11, 40005, Segovia, Spain
e-mail: [email protected]
M.P. Rocha et al. (Eds.): IWPACBB 2010, AISC 74, pp. 25–32, 2010.
springerlink.com © Springer-Verlag Berlin Heidelberg 2010
26 M. Reboiro-Jato et al.
promising approach in cancer diagnosis since the early detection and treatment
can substantially improve the survival rates. For this task, several computational
methods (statistical and machine learning) have been proposed in the literature
including linear discriminant analysis (LDA), Naïve-Bayes classifier (NBC),
learning vector quantization (LVQ), radial basis function (RBF) networks, deci-
sion trees, probabilistic neural networks (PNNs) and support vector machines
(SVMs) among others [2]. In the same line, but following the assumption that a
classifier ensemble system is more robust than an excellent single classifier [3],
some researchers have also successfully applied different classifier ensemble sys-
tems to deal with the classification of microarray datasets [4].
In addition to predictive performance, there is also hope that microarray studies
uncover molecular disease mechanisms. However, in many cases the molecular
signatures discovered by the algorithms are unfocused form a biological point of
view [5]. In fact, they often look more like random gene lists than biologically
plausible and understandable signatures. Another shortcoming of standard classi-
fication algorithms is that they treat gene-expression levels as anonymous attrib-
utes. However, a lot is known about the function and the role of many genes in
certain biological processes.
Although numerical analysis of microarray data is considerable consolidated,
the true integration of numerical analysis and biological knowledge is still a long
way off [6]. The inclusion of additional knowledge sources in the classification
process can prevent the discovery of the obvious, complement a data-inferred hy-
pothesis with references to already proposed relations, help analysis to avoid over-
confident predictions and allow us to systematically relate the analysis findings to
present knowledge [7]. In this work we would like to incorporate relevant gene
sets obtained from WhichGenes [8] in order to make predictions easy to interpret
in concert with incorporated knowledge. The study carried out aims to borrow
information from existing biological knowledge to improve both predictive
accuracy and interpretability of the resulting classifiers.
The rest of the paper is structured as follows: Section 2 presents a brief review
about the use of ensemble methods for classifying microarray data. Section 3 de-
scribes the selected datasets and base classifiers for the current study, together
with the choice of gene sets and the different approaches used for ensemble
creation. Finally Section 4 discusses the reported results and concludes the paper.
2 Related Work
Although much research has been performed on applying machine learning tech-
niques for microarray data classification during the past years, it has been shown
that conventional machine learning techniques have intrinsic drawbacks in achiev-
ing accurate and robust classifications. In order to obtain more robust microarray
data classification techniques, several authors have investigated the benefits of this
approach applied to genomic research.
Díaz-Uriarte and Alvarez de Andrés [9] investigated the use of random forest
for multi-class classification of microarray data and proposed a new method of
gene selection in classification problems based on random forest. Using simulated
A Comparative Study of Microarray Data Classification Methods 27
and real microarray datasets the authors showed that random forest can obtain
comparable performance to other methods, including DLDA, KNN, and SVM.
Peng [10] presented a novel ensemble approach based on seeking an optimal
and robust combination of multiple classifiers. The proposed algorithm begins
with the generation of a pool of candidate base classifiers based on the gene sub-
sampling and then, it performs the selection of a sub-set of appropriate base classi-
fiers to construct the classification committee based on classifier clustering.
Experimental results demonstrated that the proposed approach outperforms both
baseline classifiers and those generated by bagging and boosting.
Liu and Huang [11] applied Rotation Forest to microarray data classification
using principal component analysis, non-parametric discriminant analysis and
random projections to perform feature transformation in the original rotation
forest. In all the experiments, the authors reported that the proposed approach
outperformed bagging and boosting alternatives.
More recently, Liu and Xu [12] proposed a genetic programming approach to
analyze multiclass microarray datasets where each individual consists of a set of
small-scale ensembles containing several trees. In order to guarantee high diver-
sity in the individuals a greedy algorithm is applied. Their proposal was tested
using five datasets showing that the proposed method effectively implements the
feature selection and classification tasks.
As a particular case in the use of ensemble systems, ensemble feature selection
represents an efficient method proposed in [13] which can also achieve high clas-
sification accuracy by combining base classifiers built with different feature sub-
sets. In this context, the works of [14] and [15] study the use of different genetic
algorithms alternatives for performing feature selection with the aim of making
classifiers of the ensemble disagree on difficult cases. Reported results on both
cases showed improvements when compared against other alternatives.
Related with previous work, the aim of this study is to validate the superiority
of different classifier ensemble approaches when using prior knowledge in the
form of biological relevant gene sets. The objective is to improve the predictive
performance of baseline classifiers.
3 Comparative Study
In order to carry out the comparative study, we apply several ensemble alternatives
to classify three DNA microarray datasets involving various tumour tissue samples.
With the goal of validate the study, we analyze the performance of different base-
line classifiers and test our hypothesis using two different sources of information.
Table 1 Distribution of microarray data samples belonging to the public datasets analyzed
In this study, base classifiers are trained with all the samples in each data set, so
no work is performed at data level. The feature level is carried out by incorporat-
ing gene set data to the ensemble models. Each pathway or group of genes is used
as a feature selection, so microarray data will be filtered to keep only the expres-
sion level of those genes belonging to some group before training base classifiers.
In order to construct the final ensemble, our approach consists on two sequen-
tial steps: (i) classifier selection, in which each simple classifier is initially trained
with each gene set following a stratified 10-fold cross-validation process for esti-
mating its performance and (ii) classifier training, where the selected pairs of
simple_classifier/gene_set are trained with the whole data set. All the different
strategies proposed in this study for the selection of promising classifiers are based
on the value of the kappa statistic obtained for each simple_classifier/gene_set
pair in the first step. The proposed heuristics are the following:
• All classifiers [AC]: every simple_classifier/gene_set pair is used for
constructing the final ensemble.
• All gene sets [AG]: for each gene set, the simple_classifier/gene_set pair
with best kappa value is selected for constructing the final ensemble.
• Best classifiers without type [BCw/oT_%]: a global threshold is calcu-
lated as a percentage of the best kappa value obtained by the winner sim-
ple_classifier/gene_set pair. Those pairs with a kappa value equal or
higher than the computed threshold are selected.
• Best classifier by type [BCbyT_%]: as in the previous heuristic a given
threshold is calculated, but in this case there is a threshold for each
simple classifier type.
The form in which the final output of the ensemble is calculated is also based on
the kappa statistic. The combination approach used on for the proposed ensembles
is a weighted majority vote where the weight of each vote is the corresponding
classifier’s kappa value.
Table 3 presents the same experimentation but using the OMIM gene sets.
Once again, BCbyT heuristic achieved good performance. Comparing its behav-
iour against single classifiers, performance of ensembles is even better than in the
previous experimentation (using KEGG gene sets). BCw/oT heuristic also per-
forms better with the OMIM gene set, being slightly superior to BCbyT heuristic.
Ensembles using this strategy not only performed better than single classifiers, but
also achieved the best kappa value in two of the three analyzed data sets.
To sum up, we can conclude that BCbyT heuristic performed as the best base
classifier selection strategy, followed closely by BCw/oT heuristic. This fact backs
up the following ideas: (i) depending on the data set there is not a single classifier
able to achieve good performance in concert with the supplied knowledge and
(ii) the presence of each classifier type in the final ensemble may improve the
classification performance.
Regardless of the data set both BCw/oT and BCbyT heuristics behave
uniformly performing better than single baseline classifiers. This circumstance
A Comparative Study of Microarray Data Classification Methods 31
confirms the fact that ensembles generally perform better than single classifiers, in
this case, by taking advantage of using prior structured knowledge.
References
1. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P.,
Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.:
Molecular classification of cancer: class discovery and class prediction by gene ex-
pression monitoring. Science 286, 531–537 (1999)
2. Ressom, H.W., Varghese, R.S., Zhang, Z., Xuan, J., Clarke, R.: Classification algo-
rithms for phenotype prediction in genomics and proteomics. Frontiers in Biosci-
ence 13, 691–708 (2008)
3. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley Inter-
science, Hoboken (2004)
4. Liu, K.H., Li, B., Wu, Q.Q., Zhang, J., Du, J.X., Liu, G.Y.: Microarray data classifica-
tion based on ensemble independent component selection. Computers in Biology and
Medicine 39(11), 953–960 (2009)
5. Lottaz, C., Spang, R.: Molecular decomposition of complex clinical phenotypes using
biologically structured analysis of microarray data. Bioinformatics 21(9), 1971–1978
(2005)
6. Cordero, F., Botta, M., Calogero, R.A.: Microarray data analysis and mining ap-
proaches. Briefings in Functional Genomics and Proteomics 6(4), 265–281 (2007)
7. Bellazzi, R., Zupan, B.: Methodological Review: Towards knowledge-based gene ex-
pression data mining. Journal of Biomedical Informatics 40(6), 787–802 (2007)
8. Glez-Peña, D., Gómez-López, G., Pisano, D.G., Fdez-Riverola, F.: WhichGenes: a
web-based tool for gathering, building, storing and exporting gene sets with applica-
tion in gene set enrichment analysis. Nucleic Acids Research 37(Web Server issue),
W329–W334 (2009)
9. Díaz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microar-
ray data using random forest. BMC Bioinformatics 7, 3 (2006)
10. Peng, Y.: A novel ensemble machine learning for robust microarray data classification.
Computers in Biology and Medicine 36(6), 553–573 (2006)
11. Liu, K.H., Huang, D.S.: Cancer classification using Rotation Forest. Computers in Bi-
ology and Medicine 38(5), 601–610 (2008)
12. Liu, K.H., Xu, C.G.: A genetic programming-based approach to the classification of
multiclass microarray datasets. Bioinformatics 25(3), 331–337 (2009)
13. Opitz, D.: Feature selection for ensembles. In: Proceedings of 16th National Confer-
ence on Artificial Intelligence, Orlando, Florida (1999)
14. Kuncheva, L.I., Jain, L.C.: Designing classifier fusion systems by genetic algorithms.
IEEE Transactions on Evolutionary Computation 4(4), 327–336 (2000)
32 M. Reboiro-Jato et al.
15. Oliveira, L.S., Morita, M., Sabourin, R.: Feature selection for ensembles using the
multi-objective optimization approach. Studies in Computational Intelligence 16, 49–
74 (2006)
16. Gutiérrez, N.C., López-Pérez, R., Hernández, J.M., Isidro, I., González, B., Delgado,
M., Fermiñán, E., García, J.L., Vázquez, L., González, M., San Miguel, J.F.: Gene ex-
pression profile reveals deregulation of genes with relevant functionsin the different
subclasses of acute myeloid leukemia. Leukemia 19(3), 402–409 (2005)
17. Bullinger, L., Döhner, K., Bair, E., Fröhling, S., Schlenk, R.F., Tibshirani, R., Döhner,
H., Pollack, J.R.: Use of gene-expression profiling to identify prognostic subclasses in
adult acute myeloid leukemia. The New England Journal of Medicine 350(16), 1506–
1516 (2004)
18. Valk, P.J., Verhaak, R.G., Beijen, M.A., Erpelinck, C.A., Barjesteh van Waalwijk van
Doorn-Khosrovani, S., Boer, J., Beverloo, H., Moorhouse, M., van der Spek, P.,
Löwenberg, B., Delwel, R.: Prognostically useful gene-expression profiles in Acute
Myeloid Leukemia. The New England Journal of Medicine 350(16), 1617–1628
(2004)
19. Tai, F., Pan, W.: Incorporating prior knowledge of predictors into penalized classifiers
with multiple penalty terms. Bioinformatics 23(14), 1775–1782 (2007)
20. Wei, Z., Li, H.: Nonparametric pathway-based regression models for analysis of ge-
nomic data. Biostatistics 8(2), 265–284 (2007)
Predicting the Start of Protein
α-Helices Using Machine Learning
Algorithms
1 Introduction
Proteins are complex structures synthesised by living organisms. They are
actually a fundamental type of molecules and can perform a large number of
functions in cell biology. Proteins can assume catalytic roles and accelerate or
inhibit chemical reactions in our body. They can assume roles of transporta-
tion of smaller molecules, storage, movement, mechanical support, immunity
and control of cell growth and differentiation [25]. All of these functions rely
on the 3D-structure of the protein. The process of going from a linear se-
quence of amino acids, that together compose a protein, to the protein’s 3D
shape is named protein folding. Anfinsen’s work [29] has proven that primary
structure determines the way protein folds. Protein folding is so important
that whenever it does not occur correctly it may produce diseases such as
Alzheimer’s, Bovine Spongiform Encephalopathy (BSE), usually known as
mad cows disease, Creutzfeldt-Jakob (CJD) disease, a Amyotrophic Lateral
Sclerosis (ALS), Huntingtons syndrome, Parkinson disease, and other diseases
related to cancer.
A major challenge in Molecular Biology is to unveil the process of protein
folding. Several projects have been set up with that purpose. Although protein
function is ultimately determined by their 3D structure there have been identi-
fied a set of other intermediate structures that can help in the formation of the
Rui Camacho · Rita Ferreira · Natacha Rosa · Vânia Guimarães
LIAAD & Faculdade de Engenharia da Universidade do Porto, Portugal
Nuno A. Fonseca · Vı́tor Santos Costa
CRACS-INESC Porto LA, Portugal
Vı́tor Santos Costa
DCC-Faculdade de Ciências da Universidade do Porto, Portugal
Miguel de Sousa · Alexandre Magalhães
REQUIMTE/Faculdade de Ciências da Universidade do Porto, Portugal
M.P. Rocha et al. (Eds.): IWPACBB 2010, AISC 74, pp. 33–41, 2010.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
Discovering Diverse Content Through
Random Scribd Documents
Kukkalaakson laitamalla,
Leppoisassa lehdikössä
Liverteli lintu pieni,
Satakieli kaunokainen.
Hetken helkyteltyänsä
Tuo on soittaja sorea
Lehahtihe lentämähän,
Kohden pilviä kohosi.
1895.
Kuusikossa.
1894.
Hehku, rinta!
18/3 1895.
Riehu, myrskytuuli!
22/3 1895.
Kevään oikkuja.
29/3 1895.
Keväiset jäät.
1/4 1895.
Yliopiston kirjastossa.
13/4 1895.
Vapaus voittaa!
1894.
Se lienteytyy.
Niin paksu pilvi nousee metsän takaa.
Se peloittaa,
Se ennustaa
Satehen maille pian vihmovan.
Vaan kun on noussut ylös ilmahan,
Se lienteytyy,
Se hämärtyy
Ja pisaretta tuskin maahan sataa. —
Niin moni täällä ihminen
Ihmeitä lausuu ilmoillen,
Vaan kun on aika toimintaan,
Hän uupuu unten maailmaan.
1895.
Kuusi ja koivu.
1894.
"Eestäs löydät."
1894.
Kyyhkyselle.
Kultalintu kyyhkyläinen,
Lehdon lempeä eläjä,
Miksi vainen vaikerrellen
Huokaelet huolissasi
Lehtoloitten lehviessä,
Tullessa suven suloisen?
1894-95.
Tie ja tähti.
Ei oo oikein, määränpäähän
Yksiten vain katsahtaa,
Täytyy myöskin tarkastella
Tietä sinne kulkevaa!
1894.
Hyvä siemen.
Maanmies parahimmat
Siemeneksi viljat
Jättää puidessaan,
Kun hän parahimman
Sadon niistä saavan
Tietää kootessaan. —
Sivistyksen siemen
Myöskin punniskellen
Kylvettävä ois,
Ettei ohdakkeita,
Rikkaruohokkeita
Toukomaamme tois.
1894.
Kahdenlaisia kuulioita.
22/3 1895.
Kaksin.
Syksyilman irjuessa
Tuima tuuli kun kulutti
Lehdet koivulta komeat,
Silloin peitteli petäjä
Leveillä lehvillänsä
Kaunokaista koivahaista
Tuiman tuulen suutelulta.
1895.
Lemmen terhenissä.
18/4 1895.
13/4 1895.
Kallein omaisuus.
1894.
Uusi kotimme.
Yhdestä me ikkunasta
Näämme nurminiittyjämme,
Vainioita vihreöitä,
Kukkamaita kaunehia.
Toisesta me ikkunasta
Yli lehdon lehtisimmän
Näämme läikkyvän lahelman,
Näämme saaria satoja,
Rauhaisia rannikoita,
Ikkunasta kolmannesta
Näämme suuria saloja,
Kolkoimpia korpimaita,
Sydänmaita synkeöitä. —
Vaikka kukkakunnahilla
Kukkuisimmekin käkenä,
Totta onnemme ei oisi
Jos ei tietoa pahasta,
Huolta huonosta ajasta.
1894.
10/4 1895.
Hienottarelle.
1894.
Udutar.
Uduttaren usvalinna
Tuoll' on pilvitarhassaan,
Sinne höyhenvienosilla
Purjehdin ma unelmilla,
Toivon auerhaavehilla
Uinun hetken helmassaan.
1894.
Petetty.
1894.
Muistolaulu.
Pienosella purtosella
Kera neidon armahan
Valkamasta vaelsimme
Tasangolle lahdelman.
Siinä elontoivehia
Toisillemme kuiskailtiin,
Muinaisajan armautta
Kaihoellen muisteltiin.
Poissa on se keinuinensa,
Jossa illoin istuttiin,
Lemmekkäitä lauleloita
Käsikäissä laulettiin.
Muinaisaikaa muistellessa
Silmiini sain kyynelet,
Kyynelet myös immeltäni
Kostutteli poskuet.
Pienosehen purtehemme
Hiljalleen me hiivittiin.
Muinaiselle haavistolle
Muistolaulu laulettiin.
1894.
Tulen tienoolla.
Säkenissä säihkyvissä
Paloi siivet pääskysen,
Tuleen varpunenkin lensi
Lentimensä polttaen.
1894.
1894.
Kuin orjanruusu.
1894.
Epätietoinen.