0% found this document useful (0 votes)
11 views81 pages

Click The Link Below To Download

The document provides information about the 'Advances in Bioinformatics' ebook, which is a compilation of papers presented at the 4th International Workshop on Practical Applications of Computational Biology and Bioinformatics in 2010. It emphasizes the growing importance of computational techniques in handling biological data and the need for interdisciplinary collaboration in the field. The ebook includes contributions from various authors and is available for download at ebooknice.com.

Uploaded by

vinetmayeri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views81 pages

Click The Link Below To Download

The document provides information about the 'Advances in Bioinformatics' ebook, which is a compilation of papers presented at the 4th International Workshop on Practical Applications of Computational Biology and Bioinformatics in 2010. It emphasizes the growing importance of computational techniques in handling biological data and the need for interdisciplinary collaboration in the field. The ebook includes contributions from various authors and is available for download at ebooknice.com.

Uploaded by

vinetmayeri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Visit https://ebooknice.

com to download the full version and


explore more ebooks

(Ebook) Advances in Bioinformatics: 4th


International Workshop on Practical Applications of
Computational Biology and Bioinformatics 2010
(IWPACBB 2010) by Liliana López Kleine, Víctor
_____
Andrés Click theRuiz
Vera link below to Miguel
(auth.), download
P. _____
Rocha,
Florentino Fernández Riverola, Hagit Shatkay, Juan
https://ebooknice.com/product/advances-in-
bioinformatics-4th-international-workshop-on-practical-
Manuel Corchado (eds.) ISBN 9783642132131,
applications-of-computational-biology-and-
3642132138
bioinformatics-2010-iwpacbb-2010-1988766

Explore and download more ebooks at ebooknice.com


Here are some recommended products that might interest you.
You can download now and explore!

(Ebook) Biota Grow 2C gather 2C cook by Loucas, Jason; Viles, James


ISBN 9781459699816, 9781743365571, 9781925268492, 1459699815,
1743365578, 1925268497

https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374

ebooknice.com

(Ebook) Matematik 5000+ Kurs 2c Lärobok by Lena Alfredsson, Hans


Heikne, Sanna Bodemyr ISBN 9789127456600, 9127456609

https://ebooknice.com/product/matematik-5000-kurs-2c-larobok-23848312

ebooknice.com

(Ebook) SAT II Success MATH 1C and 2C 2002 (Peterson's SAT II Success)


by Peterson's ISBN 9780768906677, 0768906679

https://ebooknice.com/product/sat-ii-success-
math-1c-and-2c-2002-peterson-s-sat-ii-success-1722018

ebooknice.com

(Ebook) 8th International Conference on Practical Applications of


Computational Biology & Bioinformatics (PACBB 2014) by Julio Saez-
Rodriguez, Miguel P. Rocha, Florentino Fdez-Riverola, Juan F. De Paz
Santana (eds.) ISBN 9783319075808, 9783319075815, 3319075802,
3319075810
https://ebooknice.com/product/8th-international-conference-on-
practical-applications-of-computational-biology-bioinformatics-
pacbb-2014-4697380
ebooknice.com
(Ebook) 11th International Conference on Practical Applications of
Computational Biology & Bioinformatics by Florentino Fdez-
Riverola, Mohd Saberi Mohamad, Miguel Rocha, Juan F. De Paz, Tiago
Pinto (eds.) ISBN 9783319608150, 9783319608167, 3319608150, 3319608169
https://ebooknice.com/product/11th-international-conference-on-
practical-applications-of-computational-biology-amp-
bioinformatics-5884622
ebooknice.com

(Ebook) Master SAT II Math 1c and 2c 4th ed (Arco Master the SAT
Subject Test: Math Levels 1 & 2) by Arco ISBN 9780768923049,
0768923042

https://ebooknice.com/product/master-sat-ii-math-1c-and-2c-4th-ed-
arco-master-the-sat-subject-test-math-levels-1-2-2326094

ebooknice.com

(Ebook) Practical Applications of Computational Biology and


Bioinformatics, 13th International Conference by Florentino Fdez-
Riverola, Miguel Rocha, Mohd Saberi Mohamad, Nazar Zaki, José A.
Castellanos-Garzón ISBN 9783030238735, 3030238733
https://ebooknice.com/product/practical-applications-of-computational-
biology-and-bioinformatics-13th-international-conference-10489090

ebooknice.com

(Ebook) Practical Applications of Computational Biology and


Bioinformatics, 16th International Conference (PACBB 2022) by
Florentino Fdez-Riverola, Miguel Rocha, Mohd Saberi Mohamad, Simona
Caraiman, Ana Belén Gil-González ISBN 9783031170232, 3031170237
https://ebooknice.com/product/practical-applications-of-computational-
biology-and-bioinformatics-16th-international-conference-
pacbb-2022-46787820
ebooknice.com

(Ebook) Practical Applications of Computational Biology &


Bioinformatics, 14th International Conference (PACBB 2020) by
Gabriella Panuccio, Miguel Rocha, Florentino Fdez-Riverola, Mohd
Saberi Mohamad, Roberto Casado-Vara ISBN 9783030545673, 9783030545680,
3030545679, 3030545687
https://ebooknice.com/product/practical-applications-of-computational-
biology-bioinformatics-14th-international-conference-
pacbb-2020-22500428
ebooknice.com
Advances in Bioinformatics Miguel P. Rocha Digital
Instant Download
Author(s): Miguel P. Rocha, Florentino Fernández Riverola, Hagit Shatkay,
Juan Manuel Corchado RodrÃ​guez
ISBN(s): 9783642132131, 3642132138
Edition: Kindle
File Details: PDF, 3.61 MB
Year: 2010
Language: english
Advances in Intelligent and
Soft Computing 74
Editor-in-Chief: J. Kacprzyk
Advances in Intelligent and Soft Computing
Editor-in-Chief
Prof. Janusz Kacprzyk
Systems Research Institute
Polish Academy of Sciences
ul. Newelska 6
01-447 Warsaw
Poland
E-mail: [email protected]
Further volumes of this series can be found on our homepage: springer.com
Vol. 58. J. Mehnen, A. Tiwari, Vol. 66. G.Q. Huang,
M. Köppen, A. Saad (Eds.) K.L. Mak, P.G. Maropoulos (Eds.)
Applications of Soft Computing, 2009 Proceedings of the 6th CIRP-Sponsored
ISBN 978-3-540-89618-0 International Conference on Digital
Enterprise Technology, 2009
Vol. 59. K.A. Cyran, ISBN 978-3-642-10429-9
S. Kozielski, J.F. Peters,
U. Stańczyk, A. Wakulicz-Deja (Eds.) Vol. 67. V. Snášel, P.S. Szczepaniak,
Man-Machine Interactions, 2009 A. Abraham, J. Kacprzyk (Eds.)
ISBN 978-3-642-00562-6 Advances in Intelligent Web Mastering - 2, 2010
ISBN 978-3-642-10686-6
Vol. 60. Z.S. Hippe,
J.L. Kulikowski (Eds.) Vol. 68. V.-N. Huynh, Y. Nakamori,
Human-Computer Systems Interaction, 2009 J. Lawry, M. Inuiguchi (Eds.)
ISBN 978-3-642-03201-1 Integrated Uncertainty Management and
Applications, 2010
Vol. 61. W. Yu, E.N. Sanchez (Eds.) ISBN 978-3-642-11959-0
Advances in Computational
Intelligence, 2009 Vol. 69. E. Pi˛etka and J. Kawa (Eds.)
ISBN 978-3-642-03155-7 Information Technologies in Biomedicine, 2010
ISBN 978-3-642-13104-2
Vol. 62. B. Cao,
T.-F. Li, C.-Y. Zhang (Eds.) Vol. 70. XXX
Fuzzy Information and
Engineering Volume 2, 2009 Vol. 71. XXX
ISBN 978-3-642-03663-7
Vol. 72. J.C. Augusto, J.M. Corchado,
Vol. 63. Á. Herrero, P. Gastaldo, P. Novais, C. Analide (Eds.)
R. Zunino, E. Corchado (Eds.) Ambient Intelligence and Future Trends, 2010
Computational Intelligence in Security for ISBN 978-3-642-13267-4
Information Systems, 2009
Vol. 73. J.M. Corchado, P. Novais,
ISBN 978-3-642-04090-0
C. Analide, J. Sedano (Eds.)
Soft Computing Models in Industrial and
Vol. 64. E. Tkacz, A. Kapczynski (Eds.) Environmental Applications, 5th International
Internet – Technical Development and Workshop (SOCO 2010), 2010
Applications, 2009
ISBN 978-3-642-05018-3 ISBN 978-3-642-13160-8

Vol. 65. E. Kacki,


˛ M. Rudnicki, Vol. 74. M.P. Rocha, F.F. Riverola, H. Shatkay,
J. Stempczyńska (Eds.) J.M. Corchado (Eds.)
Computers in Medical Activity, 2009 Advances in Bioinformatics
ISBN 978-3-642-04461-8 ISBN 978-3-642-13213-1
Miguel P. Rocha,
Florentino Fernández Riverola, Hagit Shatkay,
and Juan Manuel Corchado (Eds.)

Advances in Bioinformatics
4th International Workshop on Practical
Applications of Computational Biology
and Bioinformatics 2010 (IWPACBB 2010)

ABC
Editors
Miguel P. Rocha Hagit Shatkay
Dep. Informática / CCTC Computational Biology and
Universidade do Minho Machine Learning Lab
Campus de Gualtar School of Computing
4710-057 Braga Queen’s University Kingston
Portugal Ontario K7L 3N6
Canada
E-mail: [email protected]
Florentino Fernández Riverola Juan Manuel Corchado
Escuela Superior de Departamento de Informática
Ingeniería Informática y Automática
Edificio Politécnico, Facultad de Ciencias
Despacho 408 Universidad de Salamanca
Campus Universitario Plaza de la Merced S/N
As Lagoas s/n 37008 Salamanca
32004 Ourense Spain
Spain E-mail: [email protected]
E-mail: [email protected]

ISBN 978-3-642-13213-1 e-ISBN 978-3-642-13214-8

DOI 10.1007/978-3-642-13214-8
Advances in Intelligent and Soft Computing ISSN 1867-5662
Library of Congress Control Number: 2010928117

c 2010 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations
are liable for prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.

Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.

Printed on acid-free paper


543210
springer.com
Preface

The fields of Bioinformatics and Computational Biology have been growing


steadily over the last few years boosted by an increasing need for computational
techniques that can efficiently handle the huge amounts of data produced by the
new experimental techniques in Biology. This calls for new algorithms and ap-
proaches from fields such as Data Integration, Statistics, Data Mining, Machine
Learning, Optimization, Computer Science and Artificial Intelligence.
Also, new global approaches, such as Systems Biology, have been emerging
replacing the reductionist view that dominated biological research in the last dec-
ades. Indeed, Biology is more and more a science of information needing tools
from the information technology field. The interaction of researchers from differ-
ent scientific fields is, more than ever, of foremost importance and we hope this
event will contribute to this effort.
IWPACBB'10 technical program included a total of 30 papers (26 long papers
and 4 short papers) spanning many different sub-fields in Bioinformatics and
Computational Biology. Therefore, the technical program of the conference will
certainly be diverse, challenging and will promote the interaction among computer
scientists, mathematicians, biologists and other researchers.
We would like to thank all the contributing authors, as well as the members of
the Program Committee and the Organizing Committee for their hard and highly
valuable work. Their work has helped to contribute to the success of the
IWAPCBB’10 event. IWPACBB’10 wouldn’t exist without your contribution.

Miguel Rocha Juan Manuel Corchado


Florentino Fernández Riverola Hagit Shatkay
IWPACBB’10 Organizing Co-chairs IWPACBB’10 Programme Co-chairs
Organization

General Co-chairs
Miguel Rocha University of Minho (Portugal)
Florentino Riverola University of Vigo (Spain)
Juan M. Corchado University of Salamanca (Spain)
Hagit Shatkay Queens University, Ontario (Canada)

Program Committee
Juan M. Corchado University of Salamanca (Spain)
(Co-chairman)
Alicia Troncoso Universidad of Pablo de Olavide (Spain)
Alípio Jorge LIAAD/INESC, Porto LA (Portugal)
Anália Lourenço University of Minho (Portugal)
Arlindo Oliveira INESC-ID, Lisboa (Portugal)
Arlo Randall University of California Irvine (USA)
B. Cristina Pelayo University of Oviedo (Spain)
Christopher Henry Argonne National Labs (USA)
Daniel Gayo University of Oviedo (Spain)
David Posada Univ. Vigo (Spain)
Emilio S. Corchado University of Burgos (Spain)
Eugénio C. Ferreira IBB/CEB, University of Minho (Portugal)
Fernando Diaz-Gómez University of Valladolid (Spain)
Gonzalo Gómez-López UBio/CNIO, Spanish National Cancer Research
Centre (Spain)
Isabel C. Rocha IBB/CEB, University of Minho (Portugal)
Jesús M. Hernández University of Salamanca (Spain)
Jorge Vieira IBMC, Porto (Portugal)
José Adserias University of Salamanca (Spain)
José L. López University of Salamanca (Spain)
José Luís Oliveira Univ. Aveiro (Portugal)
Juan M. Cueva University of Oviedo (Spain)
Júlio R. Banga IIM/CSIC, Vigo (Spain)
VIII Organization

Kaustubh Raosaheb Patil Max-Planck Institute for Informatics(Germany)


Kiran R. Patil Biocentrum, DTU (Denmark)
Lourdes Borrajo University of Vigo (Spain)
Luis M. Rocha Indiana University (USA)
Manuel J. Maña López University of Huelva (Spain)
Margarida Casal University of Minho (Portugal)
Maria J. Ramos FCUP, University of Porto (Portugal)
Martin Krallinger CNB, Madrid (Spain)
Nicholas Luscombe EBI (UK)
Nuno Fonseca CRACS/INESC, Porto (Portugal)
Oscar Sanjuan University of Oviedo (Spain)
Paulo Azevedo University of Minho (Portugal)
Paulino Gómez-Puertas University Autónoma de Madrid (Spain)
Pierre Balde University of California Irvine (USA)
Rui Camacho LIACC/FEUP, University of Porto (Portugal)
Rui Brito University of Coimbra (Portugal)
Rui C. Mendes CCTC, University of Minho (Portugal)
Sara Madeira IST/INESC, Lisboa (Portugal)
Ségio Deusdado IP Bragança (Portugal)
Vítor Costa University of Porto (Portugal)

Organizing Committee

Miguel Rocha CCTC, Univ. Minho (Portugal)


(Co-chairman)
Florentino Fernández University of Vigo (Spain)
Riverola (Co-chairman)
Juan F. De Paz University of Salamanca (Spain)
Daniel Glez-Peña University of Vigo (Spain)
José P. Pinto University of Minho (Portugal)
Rafael Carreira University of Minho (Portugal)
Simão Soares University of Minho (Portugal)
Paulo Vilaça University of Minho (Portugal)
Hugo Costa University of Minho (Portugal)
Paulo Maia University of Minho (Portugal)
Pedro Evangelista University of Minho (Portugal)
Óscar Dias University of Minho (Portugal)
Contents

Microarrays
Highlighting Differential Gene Expression between Two
Condition Microarrays through Heterogeneous Genomic
Data: Application to Lesihmania infantum Stages
Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Liliana López Kleine, Vı́ctor Andrés Vera Ruiz

An Experimental Evaluation of a Novel Stochastic Method


for Iterative Class Discovery on Real Microarray Datasets . . . 9
Héctor Gómez, Daniel Glez-Peña, Miguel Reboiro-Jato,
Reyes Pavón, Fernando Dı́az, Florentino Fdez-Riverola

Automatic Workflow during the Reuse Phase of a CBP


System Applied to Microarray Analysis . . . . . . . . . . . . . . . . . . . . . . 17
Juan F. De Paz, Ana B. Gil, Emilio Corchado
A Comparative Study of Microarray Data Classification
Methods Based on Ensemble Biological Relevant Gene
Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Miguel Reboiro-Jato, Daniel Glez-Peña, Juan Francisco Gálvez,
Rosalı́a Laza Fidalgo, Fernando Dı́az, Florentino Fdez-Riverola

Data Mining and Data Integration


Predicting the Start of Protein α-Helices Using Machine
Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Rui Camacho, Rita Ferreira, Natacha Rosa, Vânia Guimarães,
Nuno A. Fonseca, Vı́tor Santos Costa, Miguel de Sousa,
Alexandre Magalhães
X Contents

A Data Mining Approach for the Detection of High-Risk


Breast Cancer Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Orlando Anunciação, Bruno C. Gomes, Susana Vinga,
Jorge Gaspar, Arlindo L. Oliveira, José Rueff

GRASP for Instance Selection in Medical Data Sets . . . . . . . . . 53


Alfonso Fernández, Abraham Duarte, Rosa Hernández,
Ángel Sánchez

Expanding Gene-Based PubMed Queries . . . . . . . . . . . . . . . . . . . . 61


Sérgio Matos, Joel P. Arrais, José Luis Oliveira

Improving Cross Mapping in Biomedical Databases . . . . . . . . . . 69


Joel Arrais, João E. Pereira, Pedro Lopes, Sérgio Matos,
José Luis Oliveira

An Efficient Multi-class Support Vector Machine Classifier


for Protein Fold Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Wieslaw Chmielnicki, Katarzyna Sta̧por, Irena Roterman-Konieczna

Feature Selection Using Multi-Objective Evolutionary


Algorithms: Application to Cardiac SPECT Diagnosis . . . . . . . 85
António Gaspar-Cunha

Phylogenetics and Sequence Analysis


Two Results on Distances for Phylogenetic Networks . . . . . . . . 93
Gabriel Cardona, Mercè Llabrés, Francesc Rosselló

Cramér Coefficient in Genome Evolution . . . . . . . . . . . . . . . . . . . . 101


Vera Afreixo, Adelaide Freitas
An Application for Studying Tandem Repeats in
Orthologous Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
José Paulo Lousado, José Luis Oliveira, Gabriela Moura,
Manuel A.S. Santos
Accurate Selection of Models of Protein Evolution . . . . . . . . . . . 117
Mateus Patricio, Federico Abascal, Rafael Zardoya, David Posada

Scalable Phylogenetics through Input Preprocessing . . . . . . . . . 123


Roberto Blanco, Elvira Mayordomo, Esther Montes, Rafael Mayo,
Angelines Alberto

The Median of the Distance between Two Leaves in a


Phylogenetic Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Arnau Mir, Francesc Rosselló
Contents XI

In Silico AFLP: An Application to Assess What Is Needed


to Resolve a Phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Marı́a Jesús Garcı́a-Pereira, Armando Caballero, Humberto Quesada

Employing Compact Intra-Genomic Language Models


to Predict Genomic Sequences and Characterize Their
Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Sérgio Deusdado, Paulo Carvalho

Biomedical Applications
Structure Based Design of Potential Inhibitors of Steroid
Sulfatase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Elisangela V. Costa, M. Emı́lia Sousa, J. Rocha,
Carlos A. Montanari, M. Madalena Pinto

Agent-Based Model of the Endocrine Pancreas and


Interaction with Innate Immune System . . . . . . . . . . . . . . . . . . . . . 157
Ignacio V. Martı́nez Espinosa, Enrique J. Gómez Aguilera,
Marı́a E. Hernando Pérez, Ricardo Villares,
José Mario Mellado Garcı́a
State-of-the-Art Genetic Programming for Predicting
Human Oral Bioavailability of Drugs . . . . . . . . . . . . . . . . . . . . . . . . 165
Sara Silva, Leonardo Vanneschi

Pharmacophore-Based Screening as a Clue for the


Discovery of New P-Glycoprotein Inhibitors . . . . . . . . . . . . . . . . . 175
Andreia Palmeira, Freddy Rodrigues, Emı́lia Sousa, Madalena Pinto,
M. Helena Vasconcelos, Miguel X. Fernandes

Bioinformatics Applications
e-BiMotif: Combining Sequence Alignment and Biclustering
to Unravel Structured Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Joana P. Gonçalves, Sara C. Madeira

Applying a Metabolic Footprinting Approach to


Characterize the Impact of the Recombinant Protein
Production in Escherichia Coli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Sónia Carneiro, Silas G. Villas-Bôas, Isabel Rocha,
Eugénio C. Ferreira
XII Contents

Rbbt: A Framework for Fast Bioinformatics Development


with Ruby . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Miguel Vázquez, Rubén Nogales, Pedro Carmona, Alberto Pascual,
Juan Pavón

Analysis of the Effect of Reversibility Constraints on the


Predictions of Genome-Scale Metabolic Models . . . . . . . . . . . . . . 209
José P. Faria, Miguel Rocha, Rick L. Stevens, Christopher S. Henry

Enhancing Elementary Flux Modes Analysis Using


Filtering Techniques in an Integrated Environment . . . . . . . . . . 217
Paulo Maia, Marcellinus Pont, Jean-François Tomb, Isabel Rocha,
Miguel Rocha

Genome Visualization in Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225


Leandro S. Marcolino, Bráulio R.G.M. Couto, Marcos A. dos Santos

A Hybrid Scheme to Solve the Protein Structure Prediction


Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
José C. Calvo, Julio Ortega, Mancia Anguita

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241


Highlighting Differential Gene Expression
between Two Condition Microarrays through
Heterogeneous Genomic Data: Application to
Lesihmania infantum Stages Comparison

Liliana López Kleine and Víctor Andrés Vera Ruiz


1

Abstract. Classical methods for the detection of gene expression differences be-
tween two microarray conditions often fail to detect interesting and important
differences, because they are weak in comparison with the overall variability.
Therefore, methodologies that highlight weak differences are needed. Here, we
propose a method that allows the fusion of other genomic data with microarray
data and show, through an example on L. infantum microarrays comparing pro-
mastigote and amastigote stages, that differences between the two microarray
conditions are highlighted. The method is flexible and can be applied to any
organism for which microarray and other genomic data is available.

1 Introduction

Protozoan of the genus Leishmania are parasites that are transmitted by blood-
feeding insect vectors to mammalian hosts, and cause a number of important hu-
man diseases, collectively referred as leishmaniasis. During their life cycle, these
parasites alternate between two major morphologically distinct developmental
stages. In the digestive tract of the sandfly vector, they exist as extracellular elon-
gated, flagellated, and motile promastigotes that are exposed to pH 7 and fluctuat-
ing temperatures averaging 25ºC. Upon entry into a mammalian host, they reside
in mononuclear phagocytes or macrophages (37ªC), wherein they replicate as cir-
cular, aflagellated and non-motile amastigotes. In order to survive, these two ex-
treme environments, Leishmania sp. (L. sp) has developed regulatory mechanisms
that result in important morphological and biochemical adaptations [1, 2, 3].

Liliana López Kleine . Victor Andrés Vera Ruiz


1

Universidad Nacional de Colombia (Sede Bogotá), Cra 30, calle 45, Statistics Department
e-mail: [email protected], [email protected]

M.P. Rocha et al. (Eds.): IWPACBB 2010, AISC 74, pp. 1–8, 2010.
springerlink.com © Springer-Verlag Berlin Heidelberg 2010
2 L.L. Kleine and V.A.V. Ruiz

Microarray studies allow measuring the expression level of thousands of genes


at the same time by just one hybridization experiment and the comparison of two
conditions (here the development stages of L. Sp.). Several microarray analyses
have been done to study global gene expression in developmental stages of L. sp.
[3, 4, 5]. Results show that L. sp genome can be considered to be constitutively
expressed, as more than 90% of the genes is expressed in the same amount in both
stages, and that only a limited number (7-9.3%) of genes show stage-specific ex-
pression [3, 6]. Furthermore, no metabolic pathway or cell function characteristic
of any stage has been detected [3, 4, 6, 7, 8, 9]. This is an astonishing result be-
cause morphological and physiological differences between the two stages are
huge, indicating that different specialized stage proteins are needed and therefore
gene expression should change. In order to explain the weak differences in
gene expression between both stages, it has been proposed that regulations and
adaptations take place after translation and not at the gene expression level [10].
The detection of gene expression differences has been of great interest [11], as
an understanding of the adaptation and resistance factors of Leishmania sp. can
provide interesting therapeutic targets. Analytical methods used until now, even
improved ones such as the Gene Set Enrichment Analysis [13], have difficulties in
the determination of differences between gene expression in both L. sp. life cycle
stages because in the context of global gene expression variation, the detection of
weak differences is not possible using classical methods. Microarray data analysis
can still be improved if a way to highlight weak differences in gene expression
between L. sp. stages is found.
The present method consists of using additional information to the microarrays
to achieve this. It allows incorporating different genomic and post-genomic data
(positions of genes on the chromosome, metabolic pathways, phylogenetic pro-
files, etc.) to detect differences between two experimental microarray conditions
The method can be applied to detect gene expression differences between any two
conditions and for all completely sequenced organisms if genomic data are
available. It will be especially useful for the comparison of conditions in which
apparently, using classical methods, gene expression seems small.
To apply the proposed strategy, four steps need to be taken: i) Database con-
struction of the genomic data for the same genes that are present on the microarray,
ii) kernel construction and parameter estimation, iii) determination of differences in
gene expression between two microarray conditions at the kernel level, and iv)
interpretation of differences in regard of the original microarray data.
In the present work, we apply the proposed methodology to determine differ-
ences between L. infantum amastigotes and promastigotes microarray data ob-
tained by Rochette et al [3]. The methodology is proposed for all genes of a
microarray data set. Nevertheless, taking into account the interest in determining
adaptation and defense mechanisms in the pathogen, we concentrated on genes
that could explain the adaptation and resistance of L. sp. to the environmental
changes between a promastigote and an amastigote. Therefore, we searched pri-
marily for changes in expression of 180 known and putative transport proteins and
stress factors.
Highlighting Differential Gene Expression between Two Condition Microarrays 3

2 Methodology

2.1 Data Base Construction of Microarray and Genomic Data

2.1.1 Microarrays
The data used are microarray data from Rochette et al. [3]. From these data, we
extracted only the expression data comparing promastigotes and amastigotes of L.
infantum (8317 genes for 14 replicates). Microarray data was downloaded from
the NCBI’s GEO Datasets [14] (accession number GSE10407). We worked with
normalized and log2 transformed gene expression intensities from the microarray
data obtained by Rochette and colleagues [3].

2.1.2 Phylogenetic Profiles


They were constructed using the tool Roundup proposed by DeLuca et al [15]
(http://rodeo.med.harvard.edu/tools/roundup) which allows the extraction of the
presence or absence of all genes of an organism in other organisms chosen by the
user. The result can be retrieved as a phylogenetic profile matrix of presence and
absence (0,1) of each L. sp. gene in the genomes of the other organisms. We
generated a phylogenetic profile matrix for 30 organisms(1) and 2599 genes of
L. infantum sharing common genes.
(1)
A._thaliana, Bacillus_subtilis, C._elegans, Coxiella_burnetii (Cb),
Cb_CbuG_Q212, Cb_Dugway_7E9-12, Cb_RSA_331, D._melanogaster, Enterobac-
ter_638, E._faecalis_V583, Escherichia_coli_536, H._sapiens, Lactobacil-
lus_plantarum, Lactococcus._lactis, Listeria_innocua, L._monocytogenes, Mycobac-
terium_bovis, M._leprae, Nostoc_sp, P._falciparum, Pseudomonas_putida_KT2440,
S._cerevisiae, Salmonella_enterica_Paratypi_ATCC_9150, Staphylococ-
cus_aureus_COL, Staphylococcus_epidermidis_ATCC_12228, Streptococ-
cus_mutans, S._pyogenes_MGAS10394, S._pyogenes_MGAS10750, T._brucei,
V._cholerae

2.1.3 Presence of Genes on Chromosomes


For the same 8317 genes we obtained the presence of each of them on the 36
chromosomes of L. infantum. This information was obtained directly from the
gene name as registered in NCBI (http://www.ncbi.nlm.nih.gov/) and used to con-
struct a presence and absence (0,1) table for each gene in each chromosome (8317
x 36).

2.1.4 Genes of Interest


We constructed a list of 180 genes coding for known or putative transporters
and stress factors annotated with these functions in GeneDB
(http://www.genedb.org//genedb/).
Once all data types were obtained, determining genes present in all databases
was done automatically using functions written in R [16]. Without taking into ac-
count the 180 genes of interest, 2092 genes were common to microarrays, pres-
ence on the chromosomes and phylogenetic profiles. Using only the 180 genes of
interest, we obtained a list of 161 genes common to all 3 datasets.
4 L.L. Kleine and V.A.V. Ruiz

2.2 Kernel Construction


We built a kernel similarity matrix for each data type. These representations allow
the posterior fusion of heterogeneous data. Data are not represented individually,
but through the pair-wise comparison of objects (here 161 genes). The comparison
is expressed as the similarity between objects through the data. A comparison
function of type k : X × X → R is used and the data are represented by a
n × n comparison matrix: k i , j = k ( xi , x j ) [17]. Kernels are semi definite posi-
tive matrices, and can be used in several kernel algorithms [17]. There are differ-
ent ways to construct a kernel. The simplest kernels are linear and Gaussian. They
have been used for genomic and post-genomic data in previous works, which
aimed to infer biological knowledge by the analysis of heterogeneous data [18].

(
d xi , x j )
k (xi , x j ) = e
2σ 2
We built Gaussian kernels for all data-types: ,where σ is a
parameter and d is an Euclidian distance. The constructed kernels were: KA1 for
the gene expression data on amastigotes, KP1 for the gene expression data on
promastigotes, K2 for the phylogenetic profiles and K3 for the presence on the
chromosomes. The parameters associated to each kernel were sigma1A, sigma1P,
sigma2, sigma3.
Then, we constructed two unique kernels KA1sum and KP1sum for the gene expres-
sion data together with the other types of data by addition:
K ( A , P )1sum = w1 K ( A , P )1 + w2 K 2 + w3 K 3 , where w are weights and considered
also as parameters.
Taking into account the objectives of the present work (detection of gene ex-
pression differences between amastigote and promastigote microarray), parame-
ters sigma and w were found using a search algorithm evaluating all combination
of parameters, that optimized the difference between KA1sum and KP1sum. The crite-
rion was the minimum covariance between both kernels. The values tested for σ
were: 0.01,0.1,0.5,1,10,15,50 and the values tested for w were 0,0.1,0.5,0.9.

2.3 Detection of Differences between Amastigote and


Promastigote Gene Expression

Differences in similarity and gene expression


Kernels KA1sum and KP1sum were compared to detect differences in gene expression
between both L. infantum stages by computing a Dif matrix as follows: Dif =
KA1sum - KP1sum. The resulting matrix values were ordered, and those with higher
value than a certain threshold were retained. The same was done for the kernels
without integration of other data types, the ones constructed only based on the mi-
croarray data: KA1 and KP1. Two thresholds used were: 10% (T1) and 20% (T2) of
the maximum distance value found in Dif.
Subsequently, a list of pairs of genes implicated in each change of similarity is
generated. To interpret this change in similarity we returned to the original
Highlighting Differential Gene Expression between Two Condition Microarrays 5

microarray data and calculated the sum of gene expression intensities in each con-
dition. This allows to determine which one of the two genes is responsible for the
similarity change and finally to identify potential targets that explain the
differences that occur for L. infantum adaptations during its life cycle. R code is
available under request: [email protected].
The most interesting targets (which show the highest difference or are present
repeatedly on the list), are candidates to perform wet-lab experiments.

3 Results and Discussion


The parameters that were determined by the search algorithm to maximize differ-
ences between amastigote and promastigote gene expression are shown in table 1.

Table 1 Parameters obtained optimizing differences between amastigote and promastigote


gene expression

Kernel KA1 KP1 K2 K3


Sigma 1 1 15 50
Weight 0.9 0.9 0 0.1

K2 seemed not to be useful to highlight the differences between gene expres-


sions in both stages. Nevertheless, phylogenetic profiles have shown to be infor-
mative in other studies, i.e., for the determination of protein functions [18, 19]. It
is possible that either phylogenetic profiles are definitely not useful to highlight
the differences between the two conditions analyzed here, or that the organisms
that were chosen to construct the profiles are phylogenetically too distant from L.
infantum and therefore poorly informative. The fact that only a few genes are pre-
sent in most of them (data not shown), corroborates the second explanation and
opens the possibility that phylogenetic profiles could be useful if appropriate
organisms were chosen.
The result of the parameter search leaves only the K3 kernel with a low weight
(0.1) to be added to the microarray data kernels. This indicates that apparently the
fusion with data other than microarrays is not useful. Nevertheless, the results ob-
tained when similarity changes were detected indicates that including K3 is indeed
useful. The differences detected between KA1 and KP1 (based only on microarray
data) implicate a change in only 10 similarities for threshold 1 (T1) and 44
changes for T2 using the 161×161 kernel. The fusion with K3 allows the detection
of more differences: 14 for T1 and 61 for T2. The 14 gene pairs which show simi-
larity changes for above T1 in the fusion kernels are shown in Table 2. As the be-
havior of the genes of interest should be regarded in the context of all the genes,
the position of these similarity changes when all genes (2092) are analyzed is im-
portant. Using these 2092, 1924 similarity changes are detected above T1. The po-
sition (Pos) in the 1924 list of genes of interest is also indicated in Table 2 for
comparison.
6 L.L. Kleine and V.A.V. Ruiz

4 Conclusions and Future Work


It is important to point out that the use of genomic data to highlight differences be-
tween two microarray conditions is possible and easy to implement via the use of
kernels. The flexibility of kernels allows the fusion of very different types of data,
as only the comparison of objects needs to be computed. Depending on the bio-
logical question behind the study, other types of data such as information on Clus-
ters of Orthologous Groups (COGs) or physical interaction between proteins
obtained from two-hybrid data could be included. Graph metabolic pathway in-
formation could be very informative and a kernel for graphs has been already
proposed [20].
Differences between two microarray conditions are highlighted by the fusion
with other types of data. Nevertheless, the usefulness of genomic data depends on
their quality and information content. In our example, the phylogenetic profiles
appeared to be useless in highlighting information. Use of more informative
organisms to construct the phylogenetic profiles needs to be investigated.

Table 2 List of 14 gene pair similarity changes between amastigote (KA1sum) and promas-
tigote (KP1sum) microarray data highlighted through the fusion with data on the presence of
genes on chromosomes (K3). AMA and PRO: sum of gene expression of each gene in the
amasigote microarray (AMA) or promastigote microarray (PRO). Pos: position of similarity
changes in the 2092×2092 kernel (1924 similarity changes above T1). P: gene annotated as
putative. Hpc: hypothetical conserved protein

Gene pair with similarity change in KA1sum vs. KP1sum Pos AMA PRO AMA PRO
Gene1 Gene2 Gene1 Gene1 Gene2 Gene2

LinJ07.0700 vacuolar-type LinJ34.2780 Hpc 16 -1,68 -1,84 -3,42 -3,57


Ca2+ATPase, P

LinJ08.1020 stress-induced sti1 LinJ35.2730 Hpc 35 -3,98 -3,18 1,98 4,22


LinJ09.0960 ef-hand prot. 5, P LinJ23.0290 multidrug resist. P 86 -1,67 -2,36 -2,47 -2,57
LinJ14.0270 LinJ23.0430 ABC trans.-like 287 -3,65 -2,55 -0,29 -0,98
LinJ14.0270 Hpc LinJ24.1180 Hpc 459 -3,65 -2,55 20,31 26,54
LinJ14.0270 Hpc LinJ25.1000 Hpc 569 -3,65 -2,55 -4,82 -4,71
LinJ14.1040 Hpc LinJ33.0340 ATP-bin. P 752 -1,91 -1,01 4,37 6,27
LinJ17.0420 Hpc LinJ35.2730 Hpc 769 -4,28 -4,41 1,98 4,22
LinJ20.1230 calpain-like cyste- LinJ28.2230 Hpc 865 -3,97 -3,87 -0,06 -0,6
ine pept., P (clcp)

LinJ20.1250 Clcp LinJ35.2730 Hpc 923 -4,12 -3,98 1,98 4,22


LinJ21.1900 calcineurin B LinJ35.2730 Hpc 965 -4,01 -3,87 1,98 4,22
subunit, P

LinJ22.0050 Hpc LinJ35.2730 Hpc 1002 -3,43 -3,35 1,98 4,22


LinJ31.1790 Hpc LinJ35.2730 Hpc 1036 -4,39 -4,12 1,98 4,22
LinJ35.2730 Hpc LinJ33.2130 Clcp 1782 1,98 4,22 -3,95 -3,83
Highlighting Differential Gene Expression between Two Condition Microarrays 7

The present work opens the possibility of implementing a kernel method that
will allow determining differences in a more precise way once the data are fused.
The detection of differences can be improved in several ways. Here, only a pre-
liminary and very simple comparison of similarities is proposed. The kernel
method could be based on multidimensional scaling via the mapping of both
kernels on a common space that could allow the measure of distances between
similarities on that space.
Although differences between kernels are ordered, having a probability associ-
ated to each difference would be useful. This could be achieved by a bootstrapping
procedure or a matrix permutation test.

References
[1] McConville, M.J., Turco, S.J., Ferguson, M.A.J., Saks, D.L.: Developmental modifi-
cation of lipophosphoglycan during the differentiation of Leishmania major promas-
tigotes to an infectious stage. EMBO J. 11, 3593–3600 (1992)
[2] Zilberstein, D., Shapira, M.: The role of pH and temperature in the development of
Leishmania parasites. Annu. Rev. Microbiol. 48, 449–470 (1994)
[3] Rochette, A., Raymond, F., Ubeda, J.M., Smith, M., Messier, N., Boisvert, S., Ri-
gault, P., Corbeil, J., Ouellette, M., Papadopoulou, B.: Genome-wide gene expression
profiling analysis of Leishmania major and Leishmania infantum developmental
stages reveals substantial differences between the two species. BMC Genomics 9,
255–280 (2008)
[4] Cohen-Freue, G., Holzer, T.R., Forney, J.D., McMaster, W.R.: Global gene expres-
sion in Leishmania. Int. J. Parasitol. 37, 1077–1086 (2007)
[5] Leifso, K., Cohen-Freue, G., Dogra, N., Murray, A., McMaster, W.R.: Genomic and
proteomic expression analysis of Leishmania promastigote and amastigote life stages:
the Leishmania genome is constitutively expressed. Mol. Biochem. Parasitol. 152,
35–46 (2007)
[6] Saxena, A., Lahav, T., Holland, N., Aggarwal, G., Anupama, A., Huang, Y., Volpin,
H., Myler, P.J., Zilberstein, D.: Analysis of the Leishmania donovani transcriptome
reveals an ordered progression of transient and permanent changes in gene expression
during differentiation. Mol. Biochem. Parasitol 52, 53–65 (2007)
[7] Ivens, A.C., Lewis, S.M., Bagherzadeh, A.: A physical map of Leishmania major
friedlin genome. Genome Res. 8, 135–145 (1998)
[8] Holzer, T.R., McMaster, W.R., Forney, J.D.: Expression profiling by whole-genome
interspecies microarray hybridization reveals differential gene expression in procyclic
promastigotes, lesion-derived amastigotes, and axenic amastigotes in Leishmania
mexicana. Mol. Biochem. Parasitol 146, 198–218 (2006)
[9] McNicoll, F., Drummelsmith, J., Müller, M., Madore, E., Boilard, N., Ouellette, M.,
Papadopoulou, B.: A combined proteomic and transcriptomic approach to the study
of stage differentiation in Leishmania infantum. Proteomics 6, 3567–3581 (2006)
[10] Rosenzweig, D., Smith, D., Opperdoes, F., Stern, S., Olafson, R.W., Zilberstein, D.:
RetoolingLeishmania metabolism: from sand fly gut to human macrophage. FASEB
J. (2007), doi:10.1096/fj.07-9254com
[11] Lynn, M.A., McMaster, W.R.: Leishmania: conserved evolution-diverse diseases.
Trends Parasitol 24, 103–105 (2008)
8 L.L. Kleine and V.A.V. Ruiz

[12] Storey, J.D., Tibshirani, R.: Statistical significance for genome-wide experiments.
Proc. Natl. Acad. Sci. 100, 9440–9445 (2003)
[13] Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette,
M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov, J.P.: Gene
set enrichment analysis: A knowledge-based approach for interpreting genome-wide
expression profiles. PNAS 102, 15545–15550 (2005)
[14] Edgar, R., Domrachev, M., Lash, A.E.: Gene Expression Omnibus: NCBI gene ex-
pression and hybridization array data repository. Nucleic Acid Res. 30, 207–210
(2002)
[15] DeLuca, T.F., Wu, I.H., Pu, J., Monaghan, T., Peshkin, L., Singh, S., Wall, D.P.:
Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioin-
formatics 22, 2044–2046 (2006)
[16] R Development Core Team R: A language and environment for statistical computing.
R Foundation for Statistical Computing. Vienna, Austria (2005), ISBN 3-900051-07-
0, http://www.R-project.org
[17] Vert, J., Tsuda, K., Schölkopf, B.: A primer on kernels. In: Schölkopf, B., Tsuda, K.,
Vert, J. (eds.) Kernel methods in computational biology. The MIT Press, Cambridge
(2004)
[18] Yamanishi, Y., Vert, J.P., Nakaya, A., Kaneisha, M.: Extraction of correlated clusters
from multiple genomic data by generalized kernel canonical correlation analysis. Bio-
informatics 19, 323–330 (2003)
[19] López Kleine, L., Monnet, V., Pechoux, C., Trubuil, A.: Role of bacterial peptidase F
inferred by statistical analysis and further experimental validation. HFSP J. 2, 29–41
(2008)
[20] Kondor, R.I., Lafferty, J.: Diffusion kernels on graphs and other discrete structures.
In: Sammut, C., Hoffmann, A.G. (eds.) Machine learning: proceedings of the 19th in-
ternational conference. Morgan Kaufmann, San Francisco (2002)
An Experimental Evaluation of a Novel
Stochastic Method for Iterative Class
Discovery on Real Microarray Datasets

Héctor Gómez, Daniel Glez-Peña, Miguel Reboiro-Jato, Reyes Pavón,


Fernando Díaz, and Florentino Fdez-Riverola
1

Abstract. Within a gene expression matrix, there are usually several particular
macroscopic phenotypes of samples related to some diseases or drug effects, such
as diseased samples, normal samples or drug treated samples. The goal of sample-
based clustering is to find the phenotype structures of these samples. A novel
method for automatically discovering clusters of samples which are coherent from
a genetic point of view is evaluated on publicly available datasets. Each possible
cluster is characterized by a fuzzy pattern which maintains a fuzzy discretization
of relevant gene expression values. Possible clusters are randomly constructed and
iteratively refined by following a probabilistic search and an optimization schema.

Keywords: microarray data, fuzzy discretization, gene selection, fuzzy pattern,


class discovery, simulated annealing.

1 Introduction
Following the advent of high-throughput microarray technology it is now possible
to simultaneously monitor the expression levels of thousands of genes during
important biological processes and across collections of related samples. In this

Héctor Gómez . Daniel Glez-Peña . Miguel Reboiro-Jato . Reyes Pavón


1

Florentino Fdez-Riverola
ESEI: Escuela Superior de Ingeniería Informática, University of Vigo,
Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
e-mail: [email protected],
{dgpena, mrjato, pavon, riverola}@uvigo.es
Fernando Díaz
EUI: Escuela Universitaria de Informática, University of Valladolid, Plaza Santa Eulalia,
9-11, 40005, Segovia, Spain
e-mail: [email protected]

M.P. Rocha et al. (Eds.): IWPACBB 2010, AISC 74, pp. 9–16, 2010.
springerlink.com © Springer-Verlag Berlin Heidelberg 2010
10 H. Gómez et al.

context, sample-based clustering is one of the most common methods for discov-
ering disease subtypes as well as unknown taxonomies. By revealing hidden struc-
tures in microarray data, cluster analysis can potentially lead to more tailored
therapies for patients as well as better diagnostic procedures.
From a practical point of view, existing sample-based clustering methods can be
(i) directly applied to cluster samples using all the genes as features (i.e., classical
techniques such as K-means, SOM, HC, etc.) or (ii) executed after a set of informa-
tive genes are identified. The problem with the first approach is the signal-to-noise
ratio (smaller than 1:10), which is known to seriously reduce the accuracy of cluster-
ing results due to the existence of noise and outliers of the samples [1]. To overcome
such difficulties, particular methods can be applied to identify informative genes
and reduce gene dimensionality prior to clustering samples in order to detect their
phenotypes. In this context, both supervised and unsupervised informative gene
selection techniques have been developed.
While supervised informative gene selection techniques often yield high clus-
tering accuracy rates, unsupervised informative gene selection methods are more
complex because they assume no a priori phenotype information being assigned to
any sample [2]. In such a situation, two general strategies have been adopted to
address the lack of prior knowledge: (i) unsupervised gene selection, this aims to
reduce the number of genes before clustering samples by using appropriate statis-
tical models and (ii) interrelated clustering, that takes advantage of utilizing the re-
lationship between the genes and samples to perform gene selection and sample
clustering simultaneously in an iterative paradigm. Following the second strategy
for unsupervised informative gene selection (interrelated clustering), Ben-Dor et
al. [3] present an approach based on statistically scoring candidate partitions ac-
cording to the overabundance of genes that separate the different classes. Xing and
Karp [1] use a feature filtering procedure for ranking features according to their
intrinsic discriminability and irredundancy to other relevant features. Their clus-
tering algorithm is based on the concept of a normalized cut for grouping samples
in new reference partition. Von Heydebreck et al. [4] and Tang et al. [5] propose
algorithms for selecting sample partitions and corresponding gene sets by defining
an indicator of partition quality and a search procedure to maximize this parame-
ter. Varma and Simon [6] describe an algorithm for automatically detecting clus-
ters of samples that are discernable only in a subset of genes.
In this contribution we are focused in the evaluation a novel simulated anneal-
ing-based algorithm for iterative class discovery. The rest of the paper is structured
as follows: Section 2 sketches the proposed method and introduces the relevant as-
pects of the technique. Section 3 presents the experimental setup carried out and
the results obtained from a publicly available microarray data set. Section 4 com-
prises a discussion about the obtained results by the proposed technique. Finally,
Section 5 summarizes the main conclusions extracted from this work.

2 Overview of the Iterative Class Discovery Algorithm


In this article we propose a simulated annealing-based algorithm for iterative class
discovery that uses a novel fuzzy logic method for informative gene selection. The
An Experimental Evaluation of a Novel Stochastic Method 11

interrelated clustering process carried out is based on an iterative approach where


possible clusters are randomly constructed and evaluated by following a probabilistic
search and an optimization schema.

Fig. 1 Overview of the iterative class discovery method

Our clustering technique is not based on the distance between the microarrays
belonging to each given cluster, but rather on the notion of genetic coherence of
its own clusters. The genetic coherence of a given partition is calculated by taking
into consideration the genes which share the same expression value through all the
samples belonging to the cluster (which we term a fuzzy pattern), but discarding
those genes present due to pure chance (herein referred to noisy genes of a fuzzy
pattern). The proposed clustering technique combines both (i) the simplicity and
good performance of a heuristic search method able to find good partitions in the
space of all possible partitions of the set of samples with (ii) the robustness of
fuzzy logic, able to cope with several levels of uncertainty and imprecision by us-
ing partial truth values. A global view of the proposed method is sketched in
Figure 1. This figure shows how from the fuzzy discretization of the microarrays
from raw dataset the method performs a stochastic search, looking for a “good
12 H. Gómez et al.

partition” of microarrays in order to maximize the genetic coherence of each one


cluster within the tentative partition.

3 Experimental Results
In this Section we evaluate the proposed algorithm on two public microarray data-
sets, herein referred to as HC-Salamanca dataset [7] and Armstrong dataset [8].

3.1 The HC-Salamanca Dataset


This dataset consists of bone marrow samples from 43 adult patients with de novo
diagnosed acute myeloid leukemia (AML) – 10 acute promyelocytic leukemias
(APL) with t(15;17), 4 AML with inv(16), 7 monocytic leukemias and 22 non-
monocytic leukemias, according to the WHO classification. All samples contained
more than 80% blast cells and they were analyzed using high-density oligonucleo-
tide microarrays (specifically, the Affymetrix GeneChip Human Genome U133A
Array) [7]. In [7], hierarchical clustering analysis segregated APL, AML with
inv(16), monocytic leukemias and the remaining AML into separate groups, so we
consider this partition as the reference classification for validating our proposed
technique in the following experimentation.
As each execution of the simulated annealing algorithm gives a different result
(due the stochastic nature of the search), then for each available microarray has
been computed the percentage of the times that it has been grouped together with
other microarrays belonging to the reference groups (APL, AML with inversion,
Monocytic and Other AML) in ten executions of the whole algorithm. The per-
centage of times (on average) in which microarrays of each reference cluster have
been grouped together with microarrays belonging to different classes is shown in
each row of Table 1. This table can be interpreted as a confusion matrix numeri-
cally supporting the facts commented above, since the APL and Other-AML
groups are the better identified pathologies (in an average percentage of 76.19%
and 77.12% for all their samples and runs of the algorithm), followed by the
monocytic leukemias (with an average percentage of 51.73%). As mentioned
above, the group of AML with inversion is confused in a mean percentage of
33.66% and 32.06% with samples from monocytic and Other-AML groups, re-
spectively. If we consider that the highest percentage for each microarray deter-
mines the cluster to which it belongs, the final clustering obtained by our simu-
lated annealing-based algorithm is shown in Table 2.

Table 1 Confusion matrix for the HC-Salamanca dataset

Predicted class
APL Inv Mono Other
APL 76.19% 2.71% 2.18% 18.92%
True Inv 7.79% 26.49% 33.66% 32.06%
class Mono 3.11% 17.81% 51.73% 27.35%
Other 8.62% 5.56% 8.70% 77.12%
An Experimental Evaluation of a Novel Stochastic Method 13

Table 2 Final clustering for the HC-Salamanca dataset

Cluster Samples in cluster


APL-05204, APL-10222, APL-12366, APL-13058, APL-13223, APL-14217, APL-
APL
14398, APL-16089, APL-16739, APL-17074, Other-00139
Inv-00355, Inv-10891, Mono-06667, Mono-09949, Mono-12361, Mono-13701, Mo-
Mono
no-13774, Mono-13850, Mono-14043
Inv-00185, Inv-07644, Other-00170, Other-06209, Other-07297, Other-09376, Other-
09875, Other-10232, Other-10557, Other-11567, Other-12570, Other-13296, Other-
Other
13451, Other-14399, Other-14698, Other-14735, Other-15443, Other-15833, Other-
16221, Other-16942, Other-16973, Other-17099, Other-17273

Assuming as “ground truth” the clustering given by authors in [7], the perform-
ance of the clustering process can be tested by comparing the results given in both
tables. Some commonly used indices such as the Rand index and the Jaccard coef-
ficient have been defined to measure the degree of similarity between two parti-
tions. For the clustering given by our experiment, the Rand index was 0.90 and the
Jaccard coefficient was 0.77.

3.2 The Armstrong Dataset


In [8] the authors proposed that lymphoblastic leukemias with MLL translocations
(mixed-lineage leukemia) constitute a distinct disease, denoted as MLL, and show
that the differences in gene expression are robust enough to classify leukemias
correctly as MLL, acute lymphoblastic leukemia (ALL) or acute myeloid leuke-
mia (AML). The public dataset of this work, herein referred to as the Armstrong
dataset, has been also used to test our proposal. The complete group of samples
consists of 24 patients with B-Precursor ALL (ALL), 20 patients with MLL rear-
ranged B-precursor ALL (MLL) and 28 patients with acute myeloid leukemia
(AML). All the samples were analyzed using the Affymetrix GeneChip U95a
which contains 12600 known genes.

Table 3 Confusion matrix for the Armstrong dataset

Predicted class
ALL AML MLL
ALL 65.88% 5.16% 28.95%
True
AML 4.42% 86.40% 9.18%
class
MLL 34.74% 12.85% 52.41%

The percentage of times (on average) in which microarrays of each reference


cluster have been grouped together with microarrays of different classes (across
the ten executions of the algorithm) is shown in Table 3. These percentages can be
considered as an estimation of the overlapping area of the membership functions
of any two potential groups in the sector associated to a true class.
14 H. Gómez et al.

Table 4 Final clustering for the Armstrong dataset

Cluster Samples in cluster


ALL-01, ALL-02, ALL-04, ALL-05, ALL-06, ALL-07, ALL-08, ALL-09, ALL-10,
ALL ALL-11, ALL-12, ALL-13, ALL-14, ALL-15, ALL-16, ALL-17, ALL-18, ALL-19,
ALL-20, ALL-58, ALL-59, ALL-60, MLL-25, MLL-32, MLL-34, MLL-62
ALL-03, AML-38, AML-39, AML-40, AML-41, AML-42, AML-43, AML-44,
AML-46, AML-47, AML-48, AML-49, AML-50, AML-51, AML-52, AML-53,
AML
AML-54, AML-55, AML-56, AML-57, AML-65, AML-66, AML-67, AML-68,
AML-69, AML-70, AML-71, AML-72
ALL-61, AML-45, MLL-21, MLL-22, MLL-23, MLL-24, MLL-26, MLL-27, MLL-
MLL 28, MLL-29, MLL-30, MLL-31, MLL-33, MLL-35, MLL-36, MLL-37, MLL-63,
MLL-64

As in the HC-Salamanca dataset, if the highest percentage for each sample


determines the cluster of the microarray, the final clustering obtained by our simu-
lated annealing-based algorithm is shown in Table 4. As in the previous experi-
ment, assuming the clustering given by authors in [8] is the “ground truth”, the
Rand index and the Jaccard coefficient for experiments carried out are 0.89 and
0.72, respectively.

4 Discussion
The aim of the experiments reported in the previous section is to test the validity
of the proposed clustering method. Dealing with unsupervised classification, it is
very difficult to test the ability of a method to perform the clustering since there is
no supervision of the process. In this sense, the classification into different groups
proposed by the authors in [7, 8] is assumed to be the reference partition of sam-
ples in our work. This assumption may be questionable in some cases, since the
reference groups are not well established. For example, in the HC-Salamanca
dataset the AML with inversion group is established by observation of the karyo-
type of cancer cells, but there is no other evidence (biological, genetic) suggesting
that this group corresponds to a distinct disease. Even so, the assumption of these
prior partitions as reference groups is the only way to evaluate the similarity (or
dissimilarity) of the results computed by the proposed method based on existing
knowledge. As it turns out, there is no perfect match among the results of our pro-
posed method and the reference partitions, but they are compatible with the cur-
rent knowledge of each dataset. For example, for the HC-Salamanca dataset the
better characterized groups are the APL and Other-AML groups, the worst is the
AML with inversion group, and there is some confusion of the monocytic AML
with the AML with inversion and Other-AML groups. These results are compati-
ble with the state-of-the-art discussed in [7], where the APL group is the better
characterized disease (it can be considered as a distinct class), the monocytic
AML is a promising disease, the AML with inversion in chromosome 16 is the
weaker class, and the Other-AML group acts as the dumping ground for the rest
of samples which are not similar enough to the other possible classes. For the
An Experimental Evaluation of a Novel Stochastic Method 15

Armstrong dataset, the AML group is clearly separated from the MLL and ALL
groups. It is not surprising since the myeloid leukemia (AML) and lymphoblastic
leukaemias (MLL and ALL) represent distinct diseases. Some confusion is present
among ALL and MLL groups, but this result is compatible with the assumption
(which the authors test in [8]) that the MLL group is a subtype of the ALL disease.

5 Conclusion
The simulated annealing-based algorithm presented in this work is a new
algorithm for iterative class discovery that uses fuzzy logic for informative gene
selection. An intrinsic advantage of the proposed method is that, assuming the
percentage of times in which a given microarray has been grouped with samples
of other potential classes, the degree of membership of that microarray to each po-
tential group can be deduced. This fact allows a fuzzy clustering of the available
microarrays which is more suitable for the current state-of-the-art in gene expres-
sion analysis, since it will be very unlikely to state (without uncertainty) that any
available microarray only belongs to a unique potential cluster. In this case, the
proposed method can help to assess the degree of affinity of each microarray with
potential groups and to guide the analyst in the discovery of new diseases.
In addition, the proposed method is also an unsupervised technique for gene se-
lection when it is used in conjunction with the concept of discriminant fuzzy pat-
tern (DFP) introduced in [9]. Since the selected genes depend on the resulting
clustering (they are the genes in the computed DFP obtained from all groups) and
the clustering is obtained by maximizing the cost function (which is based on the
notion of genetic coherence and assessed by the number of genes in the fuzzy pat-
tern of each cluster), then the selected genes jointly depend on all the genes in the
microarray, and the proposed method can be also considered a multivariate
method for gene selection.
Finally, the proposed technique, in conjunction with our previous developed
GENECBR platform [10], represents a more sophisticated tool which integrates
three main tasks in expression analysis: clustering, gene selection and classifica-
tion. In this context, all the proposed methods are non-parametric (they do not de-
pend on assumptions about the underlying distribution of available data), unbiased
with regard to the basic computational facility used to construct them (the notion
of fuzzy pattern) and with the ability to manage imprecise (and hence, uncertain)
information, which is implicit in available datasets in terms of degree of member-
ship to linguistic labels (expressions levels, potential categories, etc.).

Acknowledgements
This work is supported in part by the project Development of computational tools for the
classification and clustering of gene expression data in order to discover meaningful
biological information in cancer diagnosis (ref. VA100A08) from JCyL (Spain).
16 H. Gómez et al.

References
1. Xing, E.P., Karp, R.M.: CLIFF: clustering of high-dimensional microarray data via it-
erative feature filtering using normalized cuts. Bioinformatics 17, S306–S315 (2001)
2. Jiang, D., Tang, C., Zhang, A.: Cluster analysis for gene expression data: a survey.
IEEE T. Knowl. Data En. 16, 1370–1386 (2004)
3. Ben-Dor, A., Friedman, N., Yakhini, Z.: Class discovery in gene expression data. In:
Proceedings of the Fifth Annual International Conference on Computational Biology.
ACM, Montreal (2001)
4. von Heydebreck, A., Huber, W., Poustka, A., Vingron, M.: Identifying splits with clear
separation: a new class discovery method for gene expression data. Bioinformatics 17,
S107–S114 (2001)
5. Tang, C., Zhang, A., Ramanathan, M.: ESPD: a pattern detection model underlying
gene expression profiles. Bioinformatics 20, 829–838 (2004)
6. Varma, S., Simon, R.: Iterative class discovery and feature selection using Minimal
Spanning Trees. BMC Bioinformatics 5, 126 (2004)
7. Gutiérrez, N.C., López-Pérez, R., Hernández, J.M., Isidro, I., González, B., Delgado,
M., Fermiñán, E., García, J.L., Vázquez, L., González, M., San Miguel, J.F.: Gene ex-
pression profile reveals deregulation of genes with relevant functions in the different
subclasses of acute myeloid leukemia. Leukemia 19, 402–409 (2005)
8. Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R., den Boer, M.L., Minden,
M.D., Sallan, S.E., Lander, E.S., Golub, T.R., Korsmeyer, S.J.: MLL translocations
specify a distinct gene expression profile that distinguishes a unique leukemia. Nat.
Genet. 30, 41–47 (2002)
9. Díaz, F., Fdez-Riverola, F., Corchado, J.M.: geneCBR: a case-based reasoning tool for
cancer diagnosis using microarray data sets. Comput. Intell. 22, 254–268 (2006)
10. Glez-Peña, D., Díaz, F., Hernández, J.M., Corchado, J.M., Fdez-Riverola, F.: ge-
neCBR: a translational tool for multiple-microarray analysis and integrative informa-
tion retrieval for aiding diagnosis in cancer research. BMC Bioinformatics 10, 187
(2009)
Automatic Workflow during the Reuse Phase of
a CBP System Applied to Microarray Analysis

Juan F. De Paz, Ana B. Gil, and Emilio Corchado


1

Abstract. The application of information technology in the field of biomedicine


has become increasingly important over the last several years. The different possi-
bilities for the workflow in the microarray analysis can be huge and it would be
very interesting to create an automatic process for establishing the workflows.
This paper presents an intelligent dynamic architecture based on intelligent or-
ganizations for knowledge data discovery in biomedical databases. The multi-
agent architecture incorporates agents that can perform automated planning and
find optimal plans. The agents incorporate the CBP-BDI model for developing the
automatic planning that makes possible to predict the efficiency of the workflow
beforehand These agents propose a new reorganizational agent model in which
complex processes are modelled as external services.

Keywords: Multiagent Systems, microarray, Case-based planning.

1 Introduction
The continuous growth of techniques for obtaining cancerous samples, specifically
those using microarray technologies, provides a great amount of data. Microarray
has become an essential tool in genomic research, making it possible to investigate
global genes in all aspects of human disease [4]. Expression arrays [5] contain in-
formation about certain genes in a patient’s samples. These data have a high di-
mensionality and require new powerful tools.
This paper presents an innovative solution to model reorganization systems in
biomedical environments. It is based on a multi-agent architecture that can inte-
grate Web services, and incorporates a novel planning mechanism that makes it
possible to determine workflows based on existing plans and previous results. The

Juan F. De Paz . Ana B. Gil . Emilio Corchado


1

Departamento Informática y Automática


Universidad de Salamanca
Plaza de la Merced s/n, 37008, Salamanca, Spain
e-mail: {fcofds, abg, escorchado}@usal.es

M.P. Rocha et al. (Eds.): IWPACBB 2010, AISC 74, pp. 17–24, 2010.
springerlink.com © Springer-Verlag Berlin Heidelberg 2010
18 J.F. De Paz, A.B. Gil, and E. Corchado

core of the system presented in this paper is a CBP-BDI (Case-based planning)


(Belief Desire Intention) agent [3] specifically designed to act as Web services co-
ordinator, making it possible to reduce the computational load for the agents in the
organization and expedite the classification process. CBP-BDI agents [2] make it
possible to formalize systems by using a new planning mechanism that incorpo-
rates graph theory as a reasoning engine to generate plans. The system was
specifically applied to case studies consisting of the classification of cancers from
microarrays. The multi-agent system developed incorporates novel strategies for
data analysis and microarray data classification.
The next section describes the expression analysis problem. Section 2 presents
a case study consisting of a distributed multi-agent system for cancer detection
scenarios. Finally section 3 presents the results and conclusions obtained.

2 Self-Adaptive Multiagent System for Expression Analysis


Nowadays, it is essential to have software solutions that enforce autonomy, ro-
bustness, flexibility and adaptability of the system to develop. The dynamic agent
organizations that auto-adjust themselves to obtain advantages from their envi-
ronment seem to be a technology that is more than suitable for coping with the de-
velopment of this type of system. The integration of multi-agent systems with
SOA (Service Oriented Architecture) and Web Services approaches has been re-
cently explored [7]. Some developments are centered on communication between
these models, while others are centered on the integration of distributed services,
especially Web Services, into the structure of the agents. [8] Oliva et al. [8] have
developed a java-based framework to create SOA and Web Services compliant
applications, which are modelled as agents.
The approach presented in this paper is an organizational model for biomedical
environments based on a multi-agent dynamic architecture that incorporates
agents capable of generating plans for analyzing large amounts of data. The core
of the system is a novel mechanism for the implementation of the stages of
CBP-BDI mechanisms through Web services that provides a dynamic self-
adaptive behaviour to reorganize the environment. The types of agents are distrib-
uted in layers within the system according to their functionalities. The agent layers
constitute the core and define a virtual organization for massive data analysis:
• Organization: The agents will be responsible for conducting the analysis
of information following the CBP-BDI [2] reasoning model. The agents
from the organizational layer should be initially configured for the differ-
ent types of analysis that will be performed, given that these analyses
vary according to the available information and the search results.
• Analysis: The agents in the analysis layer are responsible for selecting the
configuration and the flow of services best suited for the problem at
hand. They communicate with Web services to generate results. The
agents of this layer follow the CBP-BDI [2] reasoning model. The work-
flow and configuration of the services to be used is selected with
graphs, using information that corresponds to the previously executed
plans.
Automatic Workflow during the Reuse Phase of a CBP System 19

• The Controller agent manages the agents available in the different layers
of the multiagent system. It allows the registration of agents in the layers,
as well as their use in the organization.
• Analysis Services: The analysis services are services used by analysis
agents for carrying out different tasks. The analysis services include ser-
vices for pre-processing, filtering, clustering and extraction of knowledge.

2.1 Coordinator CBP-BDI Agent


The agents in the analysis layer have the capacity to learn from the analysis car-
ried out in previous procedures. They adopt the CBP reasoning model, a speciali-
zation of case-based reasoning (CBR) [1]. CBR systems solve new problems by
adapting solutions that have been used to solve similar problems in the past, and
learning from each new experience. A CBR manages cases (past experiences) to
solve new problems. The way cases are managed is known as the CBR cycle, and
consists of four sequential phases: retrieve, reuse, revise and retain. CBP is the
idea of planning as remembering [2]. In CBP, the solution proposed to solve a
given problem is a plan, so this solution is generated by taking into account the
plans applied to solve similar problems in the past [6]. The CBP-BDI agents stem
from the BDI model [9] and establish a correspondence between the elements
from the BDI model and the CBP systems. Fusing the CBP agents together with
the BDI model and generating CBP-BDI agents makes it possible to formalize the
available information, the definition of the goals and actions that are available for
resolving the problem, and the procedure for resolving new problems by adopting
the CBP reasoning cycle. Agent plan is the name we give to a sequence of actions
that, from a current state e0, defines the path of states through which the agent
passes in order to reach the other world state.
pn (e0 ) = en = an (en−1 ) = " = (an D "D a1 )(e0 ) p n ≡ a n D " D a1 (1)
Based on this representation, the CBP-BDI coordinator agents combine the initial
state of a case, the final state of a case with the goals of the agent, and the inten-
tions with the actions that can be carried out in order to create plans that make it
possible to reach the final state. The actions that need to be carried out are ser-
vices, making a plan an ordered sequence of services. It is necessary to facilitate
the inclusion of new services and the discovery of new plans based on existing
plans. Services correspond to the actions that can be carried out and that determine
the changes in the initial problem data. The plan actions correspond to services
and the order in which the actions are applied correspond to the order for execut-
ing services. As such, an organizational plan is defined by the services that
comprise it and by the order of applying each of those services.
The information corresponding to each plan is represented in bidimensional ar-
rays as shown in the chart in figure 1. The chart lists the plans in rows while the
colums represent the links between the services that comprise a plan so that Sij
represents the execution of service j occurring after service call i. The second row
shows the plan comprised of services a2, a1, with an initial connection S02 that exe-
cutes service a2 at the initial stage. The columns for service S2x provide the
20 J.F. De Paz, A.B. Gil, and E. Corchado

connection with the subsequent service, i.e., S21, for which service a1 is executed.
Lastly, column S1x executes action S1f.

Actions/Services
S01 S02 ... S12 S13 ... S1f ... S21 S23 ... S2f ... Si1 Sij ... Sif
... ... v ... ... ... ...

Efficiency
a1 v v1
Plans

a2a1 v ... ... v ... v ... ... ... v2


a1a2 v ... v ... ... ... v ... ... v3

Fig. 1 Plans and plan actions carried out through a concatenation of services

Based on the information corresponding to previous experiences (plans already


executed) a new plan is generated. To do so, the cases with the greatest and least
efficiency with regards to the current problem are retrieved, and the CBP reason-
ing cycle is initiated according to the BDI specifications. This way, each plan is
represented by the following expression:
p = {(a f D ak D ⋅ ⋅ ⋅ D ai D a0 )(e0 ) = ( S kf D ⋅ ⋅ ⋅ D S0i )(e0 )} (2)
where e0 represents the initial state that corresponds to the initial value of each
probe. As each of the selected services are executed, different states ei, are
reached, which contain the new set of probes produced by the application of
services.

2.1.1 Retrieve
During the retrieval stage, the plans with the greatest and least efficiency are se-
lected from among those that have been applied. Microarrays are composed of
probes that represent variables that mark the level of significance of specific
genes. The retrieval of those cases is performed in one of two ways according to
the case study. To retrieve cases, it is important to consider whether there has
been a previous analysis of a case study with similar characteristics. If so, the
corresponding plans for the same case study are selected.
If there are no plans that correspond to the same case study, or if the number of
plans is insufficient, the plans corresponding ot the most similar case study are re-
trieved. The selection of the most similar case study is performed according to the
cosine distance applied to the following set of variables: Number of probes,
Number of cases, Coefficient of the Pearson variation [12] for e0.
The number of efficient and inefficient cases selected is predetermined so that
at the end of this stage the following set of elements is obtained:
P = {Pe { p1e ,.., pne } ∪ Pi { p1i ,.., pni }} (3)
Pe represents the set of efficient plans and Pi represents the set of inefficient plans.
Once the plans have been retrieved, a new efficient plan is generated in the next
phase.
Automatic Workflow during the Reuse Phase of a CBP System 21

2.1.2 Reuse
This phase takes the plans P obtained in the retrieval phase and generates a new,
more efficient plan. The new plan is built according to the efficiency of the actions
as estimated by the overall efficiency vi of the plan. Estimating the efficiency of
each action is done according to the model defined by the decision trees for select-
ing significant nodes [3]. This way, estimating the efficiency of each action is
carried out according to the expression (3). This expression is referred to as the
winning rate and depends on both node S and the selected attribute B.
t
Si
G (S , B) = I (S ) − ∑ I ( Si ) (4)
i =1 S

where S represents a node that, in this case, will always be the root node of the
tree, B is the condition for the existing action, Si represents child node i from

node S, S i the number of cases associated with the child node Si . The function
I (S ) represents gain and is defined as follows
n
I (S ) = −∑ f jS ⋅ log( f jS ) (5)
j =1

S
where f jS represents the frequency relative to class C j in S , f S = n j ,
j
n Sj the
S
N
number of elements from class C j in S and N S the total number of elements. In
this case, Cj={efficient, inefficient}.
The gain ratio G determines the level of importance for each action by distin-
guishing between an efficient and an inefficient plan. High values for the gain ra-
tio indicate that the action should be included in a plan if it involves an action to
be carried out in an efficient plan, otherwise it should be eliminated.
A new table listing gain data is formed according to the values of the gain ratio
and the efficiency associated with each plan. A new flow of execution for each
action is created from the gains table. The gains uses the following formula to
establish a value for the significance of each of the actions carried out in each
plan:
T ( Sij , k ) = G' (S , Sij ) ⋅ vk (6)

where G´ contains the values of G that are normalized between 0 and 1 with the
values being inverted (the maximum value corresponds to 0 and the minimum to
1) and v contains the average value of efficiency for the plans with a connection
ij. Each connection ij presents an influence in the final efficiency of the plan that
is represented as tijk .
Once the graph for the plans has been constructed, the minimal route that goes
from the start node to the end node is calculated. In order to calculate the
22 J.F. De Paz, A.B. Gil, and E. Corchado

shortest/longest route, the Dijkstra algorithm is applied since there are implemen-
tations for the order n*log n. To apply this algorithm, it is necessary to add to each
of the edges the absolute value of the edge with a higher negative absolute value,
in order to remove from the graph those edges with negative values.

2.1.3 Revise and Retain


The revise phase is carried out automatically according to the final efficiency ob-
tained. The different analyses are associated with different measures of efficiency
that measure the final quality of the results obtained, making it unnecessary to per-
form a manual revision. During the retain phase, the plan is stored in the memory
of plans. If a plan with the same flow of execution of services for the same case
study already exists, only the information from the plan with the highest quality
will be stored. This makes it possible to limit the size of the case memory and
select only those plans with certain parameters such as level of significance,
correlation coefficient, percentile, etc.

3 Results and Conclusions

This paper has presented a self-adaptive organization based on a multiagent archi-


tecture and its application to a real problem. In 2006, there were approximately
64,122 men and women alive in the United States who had a history of cancer of
the stomach: 36,725 men and 27,397 women. The age-adjusted incidence rate was
7.9 per 100,000 men and women per year while the fatality rate was 4.0 per
100,000 [11]. The data for gastric cancer were obtained with a HG U133 plus 2.0
chip and corresponded to 234 patients affected by this cancer in 3 different parts
of the organism (primary gastric cancer, skin and others) [10]. The data were
obtained from a public repository at http://www.ebi.ac.uk.
The experiment consisted of evaluating the services distribution system in the
filtering agent for the case study that classified patients affected by different types
of cancer. According to the identification of the problem described in table 1, the
filtering agent selected the plans with the greatest efficiency, considering the
different execution workflows for the services that are in the plans.
The filtering agent in the analysis layer selects the configuration parameters be-
tween a specific set of pre-determined values, when it has been told to explore the
parameters. Otherwise, for a specific plan, it selects the values that have provided
better results based on the measure of the previously established efficiency. The
different configurations used are listed in table 1. A depth analysis of these tech-
niques can be found in our previous work [10]. The last columns of the table list
the final efficiency obtained based a measure and the type of plan (efficient or in-
efficient). A value of 1 in the Class column indicates that the plan is efficient
while a 0 indicates that the plan is inefficient. The remaining value indicates the
order of execution of the services.
Automatic Workflow during the Reuse Phase of a CBP System 23

Table 1 Efficiency of the plans

Plan Variability (z) Uniform (α) Correlation (α) Cutoff Efficiency Class
p1 1 2 3 0.14641098 1
p2 1 2 3 4 1 0
p3 1 2 0.24248635 1
p4 1 2 0.14935538 1
p5 3 1 2 0.15907924 1
p6 1 0.96457118 0
p7 1 1 0
p8 1 2 1 0

The diagnostic agent at the Organization level is in charge of selecting which


filtering agent to use. The filtering agent, in turn, automatically and in accordance
with the efficiency of each of the plans listed in table 1 and the equation (5), se-
lects the flow of services that best adapts to the case study. In accordance with the
equation defined in (5) and the information regarding efficiency provided by table
1, the relevance for the execution of each of the services is calculated.

0.95

0.37
0.1
S1 S3
0.52
1
0
0.14

1
0.2

S0
Sf
8
0.06 0.6

S2 S4
1

Fig. 2 Decision tree for classifying patients

Figure 2 displays the directed graph that was obtained. The final path that is
followed is shown in bold. The new estimated plan is comprised of the sequence
of actions S02, S21, S13, S3f.
It is clear that the path followed in the plan that was obtained does not coincide
with any path previously applied, although the services that it contains presents an
efficiency similar to that given by plan p1 as shown in table 1. The efficiency ob-
tained in this execution is 0.14458.
The system presented in this study provides a novel mechanism for a global co-
ordination in highly changing environments. The mechanism is capable of auto-
matic reorganization and is especially useful for decision making in systems that
use agreement technologies. The system was applied to a case study in a biomedi-
cal environment and can be easily extended to other environments with similar
characteristics.
24 J.F. De Paz, A.B. Gil, and E. Corchado

Acknowledgments. This development has been partially supported by the projects JCyL
SA071A08, of the Junta of Castilla and León (JCyL): [BU006A08], the project of the Span-
ish Ministry of Education and Innovation [CIT-020000-2008-2] and [CIT-020000-2009-
12], and Grupo Antolin Ingenieria, S.A., within the framework of project MAGNO2008 -
1028.- CENIT also funded by the same Government Ministry.

References
[1] Kolodner, J.: Case-Based Reasoning. Morgan Kaufmann, San Francisco (1993)
[2] Glez-Bedia, M., Corchado, J.: A planning strategy based on variational calculus for
deliberative agents. Computing and Information Systems Journal 10(1), 2–14 (2002)
[3] Kohavi, R., Ross Quinlan, R.: Decision Tree Discovery Handbook of Data Mining
and Knowledge Discovery, pp. 267–276. Oxford University Press, Oxford (2002)
[4] Quackenbush, J.: Computational analysis of microarray data. Nature Review Genet-
ics 2(6), 418–427 (2001)
[5] Affymetrix,
http://www.affymetrix.com/support/technical/datasheets/
hgu133arrays_datasheet.pdf
[6] Corchado, J.M., Bajo, J., De Paz, Y., Tapia, D.I.: Intelligent Environment for Moni-
toring Alzheimer Patients, Agent Technology for Health Care. Decision Support Sys-
tems 44(2), 382–396 (2008)
[7] Ardissono, L., Petrone, G., Segnan, M.: A conversational approach to the interaction
with Web Services. Computational Intelligence, vol. 20, pp. 693–709. Blackwell Pub-
lishing, Malden (2004)
[8] Oliva, E., Natali, A., Ricci, A., Viroli, M.: An Adaptation Logic Framework for
{J}ava-based Component Systems. Journal of Universal Computer Science 14(13),
2158–2181 (2008)
[9] Bratman, M.: Intention, Plans and Practical Reason. Harvard U.P., Cambridge (1987)
[10] Corchado, J.M., De Paz, J.F., Rogríguez, S., Bajo, J.: Model of experts for decision
support in the diagnosis of leukemia patients. Artificial Intelligence in Medi-
cine 46(3), 179–200 (2009)
[11] Horner, M.J., Ries, L.A.G., Krapcho, M., Neyman, N., Aminou, R., Howlader, N.,
Altekruse, S.F., Feuer, E.J., Huang, L., Mariotto, A., Miller, B.A., Lewis, D.R., Eis-
ner, M.P., Stinchcomb, D.G., Edwards, B.K. (eds.): SEER Cancer Statistics Review,
1975-2006, National Cancer Institute (2009),
http://seer.cancer.gov/csr/1975_2006/
[12] Kuo, C.D., Chen, G.Y., Wang, Y.Y., Hung, M.J., Yang, J.L.: Characterization and
quantification of the return map of RR intervals by Pearson coefficient in patients
with acute myocardial infarction. Autonomic Neuroscience 105(2), 145–152 (2003)
A Comparative Study of Microarray Data
Classification Methods Based on Ensemble
Biological Relevant Gene Sets

Miguel Reboiro-Jato, Daniel Glez-Peña, Juan Francisco Gálvez,


Rosalía Laza Fidalgo, Fernando Díaz, and Florentino Fdez-Riverola
1

Abstract. In this work we study the utilization of several ensemble alternatives for
the task of classifying microarray data by using prior knowledge known to be bio-
logically relevant to the target disease. The purpose of the work is to obtain an
accurate ensemble classification model able to outperform baseline classifiers by
introducing diversity in the form of different gene sets. The proposed model takes
advantage of WhichGenes, a powerful gene set building tool that allows the auto-
matic extraction of lists of genes from multiple sparse data sources. Preliminary
results using different datasets and several gene sets show that the proposal is able
to outperform basic classifiers by using existing prior knowledge.

Keywords: microarray data classification, ensemble classifiers, gene sets, prior


knowledge.

1 Introduction and Motivation


The advent of microarray technology has become a fundamental tool in genomic
research, making it possible to investigate global gene expression in all aspects of
human disease. In particular, cancer genetics based on the analysis of cancer geno-
types, provides a valuable alternative to cancer diagnosis in both theory and prac-
tice [1]. In this context, the automatic classification of cancer patients has been a

Miguel Reboiro-Jato . Daniel Glez-Peña . Juan Francisco Gálvez . Rosalía Laza Fidalgo .
1

Florentino Fdez-Riverola
ESEI: Escuela Superior de Ingeniería Informática, University of Vigo,
Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
e-mail: {mrjato, dgpena, galvez, rlaza, riverola}@uvigo.es
Fernando Díaz
EUI: Escuela Universitaria de Informática, University of Valladolid, Plaza Santa Eulalia,
9-11, 40005, Segovia, Spain
e-mail: [email protected]

M.P. Rocha et al. (Eds.): IWPACBB 2010, AISC 74, pp. 25–32, 2010.
springerlink.com © Springer-Verlag Berlin Heidelberg 2010
26 M. Reboiro-Jato et al.

promising approach in cancer diagnosis since the early detection and treatment
can substantially improve the survival rates. For this task, several computational
methods (statistical and machine learning) have been proposed in the literature
including linear discriminant analysis (LDA), Naïve-Bayes classifier (NBC),
learning vector quantization (LVQ), radial basis function (RBF) networks, deci-
sion trees, probabilistic neural networks (PNNs) and support vector machines
(SVMs) among others [2]. In the same line, but following the assumption that a
classifier ensemble system is more robust than an excellent single classifier [3],
some researchers have also successfully applied different classifier ensemble sys-
tems to deal with the classification of microarray datasets [4].
In addition to predictive performance, there is also hope that microarray studies
uncover molecular disease mechanisms. However, in many cases the molecular
signatures discovered by the algorithms are unfocused form a biological point of
view [5]. In fact, they often look more like random gene lists than biologically
plausible and understandable signatures. Another shortcoming of standard classi-
fication algorithms is that they treat gene-expression levels as anonymous attrib-
utes. However, a lot is known about the function and the role of many genes in
certain biological processes.
Although numerical analysis of microarray data is considerable consolidated,
the true integration of numerical analysis and biological knowledge is still a long
way off [6]. The inclusion of additional knowledge sources in the classification
process can prevent the discovery of the obvious, complement a data-inferred hy-
pothesis with references to already proposed relations, help analysis to avoid over-
confident predictions and allow us to systematically relate the analysis findings to
present knowledge [7]. In this work we would like to incorporate relevant gene
sets obtained from WhichGenes [8] in order to make predictions easy to interpret
in concert with incorporated knowledge. The study carried out aims to borrow
information from existing biological knowledge to improve both predictive
accuracy and interpretability of the resulting classifiers.
The rest of the paper is structured as follows: Section 2 presents a brief review
about the use of ensemble methods for classifying microarray data. Section 3 de-
scribes the selected datasets and base classifiers for the current study, together
with the choice of gene sets and the different approaches used for ensemble
creation. Finally Section 4 discusses the reported results and concludes the paper.

2 Related Work
Although much research has been performed on applying machine learning tech-
niques for microarray data classification during the past years, it has been shown
that conventional machine learning techniques have intrinsic drawbacks in achiev-
ing accurate and robust classifications. In order to obtain more robust microarray
data classification techniques, several authors have investigated the benefits of this
approach applied to genomic research.
Díaz-Uriarte and Alvarez de Andrés [9] investigated the use of random forest
for multi-class classification of microarray data and proposed a new method of
gene selection in classification problems based on random forest. Using simulated
A Comparative Study of Microarray Data Classification Methods 27

and real microarray datasets the authors showed that random forest can obtain
comparable performance to other methods, including DLDA, KNN, and SVM.
Peng [10] presented a novel ensemble approach based on seeking an optimal
and robust combination of multiple classifiers. The proposed algorithm begins
with the generation of a pool of candidate base classifiers based on the gene sub-
sampling and then, it performs the selection of a sub-set of appropriate base classi-
fiers to construct the classification committee based on classifier clustering.
Experimental results demonstrated that the proposed approach outperforms both
baseline classifiers and those generated by bagging and boosting.
Liu and Huang [11] applied Rotation Forest to microarray data classification
using principal component analysis, non-parametric discriminant analysis and
random projections to perform feature transformation in the original rotation
forest. In all the experiments, the authors reported that the proposed approach
outperformed bagging and boosting alternatives.
More recently, Liu and Xu [12] proposed a genetic programming approach to
analyze multiclass microarray datasets where each individual consists of a set of
small-scale ensembles containing several trees. In order to guarantee high diver-
sity in the individuals a greedy algorithm is applied. Their proposal was tested
using five datasets showing that the proposed method effectively implements the
feature selection and classification tasks.
As a particular case in the use of ensemble systems, ensemble feature selection
represents an efficient method proposed in [13] which can also achieve high clas-
sification accuracy by combining base classifiers built with different feature sub-
sets. In this context, the works of [14] and [15] study the use of different genetic
algorithms alternatives for performing feature selection with the aim of making
classifiers of the ensemble disagree on difficult cases. Reported results on both
cases showed improvements when compared against other alternatives.
Related with previous work, the aim of this study is to validate the superiority
of different classifier ensemble approaches when using prior knowledge in the
form of biological relevant gene sets. The objective is to improve the predictive
performance of baseline classifiers.

3 Comparative Study
In order to carry out the comparative study, we apply several ensemble alternatives
to classify three DNA microarray datasets involving various tumour tissue samples.
With the goal of validate the study, we analyze the performance of different base-
line classifiers and test our hypothesis using two different sources of information.

3.1 Datasets and Base Classifiers


We carry out the experimentation using three public leukemia datasets taken from
the previous studies of Gutiérrez et al [16], Bullinger et al [17] and Valk et al
[18]. We have selected samples from each dataset belonging to 4 different groups
of acute myeloid leukemias including (i) promyelocytic (APL), (ii) inversion 16,
(iii) monocytic and (iv) other AMLs. The distribution of samples is showed in
Table 1.
28 M. Reboiro-Jato et al.

Table 1 Distribution of microarray data samples belonging to the public datasets analyzed

APL Inv(16) Monocytic Other


Gutiérrez et al 10 4 7 22
Bullinger et al 19 14 64 177
Valk et al 7 10 7 51

In order to compare the performance obtained by the different ensemble ap-


proaches, we have selected four well-known classification algorithms: (i) Naïve
Bayes (NB) learner is perhaps the most widely used method. Although its inde-
pendence assumption is over-simplistic, studies have found NB to be very effec-
tive in a wide range of problems; (ii) IB3 represents a variant of the well-known
nearest neighbour algorithms implementing a simple version of a lazy learner
classifier; (iii) Support Vector Machines (SVMs) constitute a famous family of
algorithms used for classification and regression purposes. Their mayor advantage
is that their learning capacity does not degrade even if many characteristics exist,
being especially applicable to microarray data; (iv) Random Forest (RFs) is a basic
ensemble classifier that consists of many decision trees. The method combines
bagging idea and random selection of features in order to construct a collection of
decision trees with controlled variation.

3.2 Biological Knowledge Gene Sets


For the prior selection of gene sets that represent explicit information available the fol-
lowing sources of information have been used: (i) 33 metabolic sub-pathways related to
existing cancers in SABiosciences (http://www.sabiosciences.com) previously analyzed
in studies by [19] and [20] plus 4 groups extracted from the OMIM (Online Mendelian
Inheritance in Man) database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim) that
correspond to various types of leukemia (myeloid, monocytoid, promyelocytic and
general leukemia) and (ii) those pathways from KEGG (Kyoto Encyclopedia of Genes
and Genomes) database grouped in both ‘environmental information processing’ and
‘genetic information processing’ categories.

3.3 Ensemble Alternatives


According to Kuncheva [3], several ensembles can be built by introducing varia-
tions at four different levels: (i) data level, (ii) feature level, (iii) classifier level
and (iv) combination level.
First of all, by using different data subsets at data level or different feature sub-
sets at feature level, the space of the problem can be divided into several areas
where base classifiers can be trained. This divide-and-conquer strategy can sim-
plify the problem, leading to improved performance of the base classifiers.
Secondly, at classifier level, different types of classifiers can be used in order to
take advantage of the strong points of each classifier type. Although many ensem-
ble paradigms employ the same classification model, there is no evidence that one
strategy is better than the other [3]. Finally, combination level groups the different
ways of combining the classifier decisions.
A Comparative Study of Microarray Data Classification Methods 29

In this study, base classifiers are trained with all the samples in each data set, so
no work is performed at data level. The feature level is carried out by incorporat-
ing gene set data to the ensemble models. Each pathway or group of genes is used
as a feature selection, so microarray data will be filtered to keep only the expres-
sion level of those genes belonging to some group before training base classifiers.
In order to construct the final ensemble, our approach consists on two sequen-
tial steps: (i) classifier selection, in which each simple classifier is initially trained
with each gene set following a stratified 10-fold cross-validation process for esti-
mating its performance and (ii) classifier training, where the selected pairs of
simple_classifier/gene_set are trained with the whole data set. All the different
strategies proposed in this study for the selection of promising classifiers are based
on the value of the kappa statistic obtained for each simple_classifier/gene_set
pair in the first step. The proposed heuristics are the following:
• All classifiers [AC]: every simple_classifier/gene_set pair is used for
constructing the final ensemble.
• All gene sets [AG]: for each gene set, the simple_classifier/gene_set pair
with best kappa value is selected for constructing the final ensemble.
• Best classifiers without type [BCw/oT_%]: a global threshold is calcu-
lated as a percentage of the best kappa value obtained by the winner sim-
ple_classifier/gene_set pair. Those pairs with a kappa value equal or
higher than the computed threshold are selected.
• Best classifier by type [BCbyT_%]: as in the previous heuristic a given
threshold is calculated, but in this case there is a threshold for each
simple classifier type.
The form in which the final output of the ensemble is calculated is also based on
the kappa statistic. The combination approach used on for the proposed ensembles
is a weighted majority vote where the weight of each vote is the corresponding
classifier’s kappa value.

4 Experimental Results and Discussion


In order to evaluate the heuristics defined in the previous section, a comparative study
was carried out using two different sources of information (OMIM and KEGG) in
order to classify 392 samples belonging to four classes coming from three real data
sets. In addition, the four simple base classifiers used for the ensemble generation
(IB3, NBS, RF, SVM) where also tested individually, using as feature selection both
those genes included in the OMIM gene sets plus those genes being part of the
KEGG gene sets. Classification tests were performed using a stratified 10-fold cross-
validation. Tables 2 and 3 summarize the results obtained from the experimentation
carried out showing only those classifiers with better performance.
Table 2 presents the accuracy and kappa values achieved by each classifier us-
ing KEGG gene sets as prior knowledge. As it can be observed, BCbyT heuristic
generally exhibits good performance regardless of the data set. Additionally,
BCw/oT heuristic also showed good performance, although in the Gutiérrez data
set two single classifiers (IB3 and NBS) performed better than ensembles using
this strategy.
30 M. Reboiro-Jato et al.

Table 2 Classification result using KEGG gene sets


Gutiérrez Bullinger Valk
Classifier
Accuracy Kappa Accuracy Kappa Accuracy Kappa
AC 76,74% 0,588 76,00% 0,292 76,28% 0,503
AG 79,07% 0,634 76,00% 0,299 75,18% 0,533
BCbyT_90% 81,40% 0,724 82,67% 0,373 77,01% 0,540
BCbyT_95% 83,72% 0,724 82,67% 0,329 78,83% 0,574
BCw/oT_60% 79,07% 0,635 80,00% 0,476 77,37% 0,556
BCw/oT_75% 79,07% 0,635 81,33% 0,367 76,64% 0,555
BCw/oT_85% 76,74% 0,612 80,00% 0,403 75,55% 0,546
IB3 83,72% 0,756 69,33% 0,369 67,52% 0,410
NBS 81,40% 0,679 73,33% 0,269 74,09% 0,530
RF 72,09% 0,533 68,00% 0,123 69,71% 0,337
SVM 51,16% 0,000 68,00% 0,000 64,60% 0,000

Table 3 Classification result using OMIM gene sets


Gutiérrez Bullinger Valk
Classifier
Accuracy Kappa Accuracy Kappa Accuracy Kappa
AC 76,74% 0,588 76,00% 0,343 74,82% 0,439
AG 76,74% 0,588 76,00% 0,343 76,28% 0,506
BCbyT_90% 81,40% 0,680 82,67% 0,569 75,91% 0,528
BCbyT_95% 86,05% 0,774 82,67% 0,569 75,91% 0,513
BCw/oT_60% 81,40% 0,680 80,00% 0,483 75,91% 0,521
BCw/oT_75% 88,37% 0,809 81,33% 0,526 75,91% 0,530
BCw/oT_85% 79,07% 0,672 80,00% 0,482 76,28% 0,555
IB3 76,74% 0,643 73,33% 0,451 67,52% 0,391
NBS 79,07% 0,634 76,00% 0,370 72,99% 0,510
RF 79,07% 0,658 74,67% 0,372 74,09% 0,420
SVM 51,16% 0,000 68,00% 0,000 77,74% 0,539

Table 3 presents the same experimentation but using the OMIM gene sets.
Once again, BCbyT heuristic achieved good performance. Comparing its behav-
iour against single classifiers, performance of ensembles is even better than in the
previous experimentation (using KEGG gene sets). BCw/oT heuristic also per-
forms better with the OMIM gene set, being slightly superior to BCbyT heuristic.
Ensembles using this strategy not only performed better than single classifiers, but
also achieved the best kappa value in two of the three analyzed data sets.
To sum up, we can conclude that BCbyT heuristic performed as the best base
classifier selection strategy, followed closely by BCw/oT heuristic. This fact backs
up the following ideas: (i) depending on the data set there is not a single classifier
able to achieve good performance in concert with the supplied knowledge and
(ii) the presence of each classifier type in the final ensemble may improve the
classification performance.
Regardless of the data set both BCw/oT and BCbyT heuristics behave
uniformly performing better than single baseline classifiers. This circumstance
A Comparative Study of Microarray Data Classification Methods 31

confirms the fact that ensembles generally perform better than single classifiers, in
this case, by taking advantage of using prior structured knowledge.

Acknowledgements. This work is supported in part by the project MEDICAL-BENCH:


Platform for the development and integration of knowledge-based data mining techniques
and their application to the clinical domain (TIN2009-14057-C03-02) from Ministerio de
Ciencia e Innovación (Spain). D. Glez-Peña acknowledges Xunta de Galicia (Spain) for the
program Ángeles Álvariño.

References
1. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P.,
Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.:
Molecular classification of cancer: class discovery and class prediction by gene ex-
pression monitoring. Science 286, 531–537 (1999)
2. Ressom, H.W., Varghese, R.S., Zhang, Z., Xuan, J., Clarke, R.: Classification algo-
rithms for phenotype prediction in genomics and proteomics. Frontiers in Biosci-
ence 13, 691–708 (2008)
3. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley Inter-
science, Hoboken (2004)
4. Liu, K.H., Li, B., Wu, Q.Q., Zhang, J., Du, J.X., Liu, G.Y.: Microarray data classifica-
tion based on ensemble independent component selection. Computers in Biology and
Medicine 39(11), 953–960 (2009)
5. Lottaz, C., Spang, R.: Molecular decomposition of complex clinical phenotypes using
biologically structured analysis of microarray data. Bioinformatics 21(9), 1971–1978
(2005)
6. Cordero, F., Botta, M., Calogero, R.A.: Microarray data analysis and mining ap-
proaches. Briefings in Functional Genomics and Proteomics 6(4), 265–281 (2007)
7. Bellazzi, R., Zupan, B.: Methodological Review: Towards knowledge-based gene ex-
pression data mining. Journal of Biomedical Informatics 40(6), 787–802 (2007)
8. Glez-Peña, D., Gómez-López, G., Pisano, D.G., Fdez-Riverola, F.: WhichGenes: a
web-based tool for gathering, building, storing and exporting gene sets with applica-
tion in gene set enrichment analysis. Nucleic Acids Research 37(Web Server issue),
W329–W334 (2009)
9. Díaz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microar-
ray data using random forest. BMC Bioinformatics 7, 3 (2006)
10. Peng, Y.: A novel ensemble machine learning for robust microarray data classification.
Computers in Biology and Medicine 36(6), 553–573 (2006)
11. Liu, K.H., Huang, D.S.: Cancer classification using Rotation Forest. Computers in Bi-
ology and Medicine 38(5), 601–610 (2008)
12. Liu, K.H., Xu, C.G.: A genetic programming-based approach to the classification of
multiclass microarray datasets. Bioinformatics 25(3), 331–337 (2009)
13. Opitz, D.: Feature selection for ensembles. In: Proceedings of 16th National Confer-
ence on Artificial Intelligence, Orlando, Florida (1999)
14. Kuncheva, L.I., Jain, L.C.: Designing classifier fusion systems by genetic algorithms.
IEEE Transactions on Evolutionary Computation 4(4), 327–336 (2000)
32 M. Reboiro-Jato et al.

15. Oliveira, L.S., Morita, M., Sabourin, R.: Feature selection for ensembles using the
multi-objective optimization approach. Studies in Computational Intelligence 16, 49–
74 (2006)
16. Gutiérrez, N.C., López-Pérez, R., Hernández, J.M., Isidro, I., González, B., Delgado,
M., Fermiñán, E., García, J.L., Vázquez, L., González, M., San Miguel, J.F.: Gene ex-
pression profile reveals deregulation of genes with relevant functionsin the different
subclasses of acute myeloid leukemia. Leukemia 19(3), 402–409 (2005)
17. Bullinger, L., Döhner, K., Bair, E., Fröhling, S., Schlenk, R.F., Tibshirani, R., Döhner,
H., Pollack, J.R.: Use of gene-expression profiling to identify prognostic subclasses in
adult acute myeloid leukemia. The New England Journal of Medicine 350(16), 1506–
1516 (2004)
18. Valk, P.J., Verhaak, R.G., Beijen, M.A., Erpelinck, C.A., Barjesteh van Waalwijk van
Doorn-Khosrovani, S., Boer, J., Beverloo, H., Moorhouse, M., van der Spek, P.,
Löwenberg, B., Delwel, R.: Prognostically useful gene-expression profiles in Acute
Myeloid Leukemia. The New England Journal of Medicine 350(16), 1617–1628
(2004)
19. Tai, F., Pan, W.: Incorporating prior knowledge of predictors into penalized classifiers
with multiple penalty terms. Bioinformatics 23(14), 1775–1782 (2007)
20. Wei, Z., Li, H.: Nonparametric pathway-based regression models for analysis of ge-
nomic data. Biostatistics 8(2), 265–284 (2007)
Predicting the Start of Protein
α-Helices Using Machine Learning
Algorithms

Rui Camacho, Rita Ferreira, Natacha Rosa, Vânia Guimarães,


Nuno A. Fonseca, Vı́tor Santos Costa, Miguel de Sousa,
and Alexandre Magalhães

1 Introduction
Proteins are complex structures synthesised by living organisms. They are
actually a fundamental type of molecules and can perform a large number of
functions in cell biology. Proteins can assume catalytic roles and accelerate or
inhibit chemical reactions in our body. They can assume roles of transporta-
tion of smaller molecules, storage, movement, mechanical support, immunity
and control of cell growth and differentiation [25]. All of these functions rely
on the 3D-structure of the protein. The process of going from a linear se-
quence of amino acids, that together compose a protein, to the protein’s 3D
shape is named protein folding. Anfinsen’s work [29] has proven that primary
structure determines the way protein folds. Protein folding is so important
that whenever it does not occur correctly it may produce diseases such as
Alzheimer’s, Bovine Spongiform Encephalopathy (BSE), usually known as
mad cows disease, Creutzfeldt-Jakob (CJD) disease, a Amyotrophic Lateral
Sclerosis (ALS), Huntingtons syndrome, Parkinson disease, and other diseases
related to cancer.
A major challenge in Molecular Biology is to unveil the process of protein
folding. Several projects have been set up with that purpose. Although protein
function is ultimately determined by their 3D structure there have been identi-
fied a set of other intermediate structures that can help in the formation of the
Rui Camacho · Rita Ferreira · Natacha Rosa · Vânia Guimarães
LIAAD & Faculdade de Engenharia da Universidade do Porto, Portugal
Nuno A. Fonseca · Vı́tor Santos Costa
CRACS-INESC Porto LA, Portugal
Vı́tor Santos Costa
DCC-Faculdade de Ciências da Universidade do Porto, Portugal
Miguel de Sousa · Alexandre Magalhães
REQUIMTE/Faculdade de Ciências da Universidade do Porto, Portugal

M.P. Rocha et al. (Eds.): IWPACBB 2010, AISC 74, pp. 33–41, 2010.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
Discovering Diverse Content Through
Random Scribd Documents
Kukkalaakson laitamalla,
Leppoisassa lehdikössä
Liverteli lintu pieni,
Satakieli kaunokainen.

Koko luonnon laulukunta


Silloin ääntänsä alensi,
Kuulostellen kummastellen
Soittajata sulosuista,
Jonka helkyntä heleä,
Virsi vieno vieryellen
Yli lehtojen levisi.

Hetken helkyteltyänsä
Tuo on soittaja sorea
Lehahtihe lentämähän,
Kohden pilviä kohosi.

Miksi soittaja sorea


Kohden pilviä kohosi,
Kohosiko kuullaksensa
Enkel-laulua iloista
Taaton luota taivahasta?

Siksi soittaja sorea


Kohden pilviä kohosi:
Sieltä silmäelläksensä
Kansan suuria suruja,
Syänten piirtoja syviä.

Sitten tuolta tultuansa


Lintu lemmestä liversi,
Soitti syntyjä syviä, Ilmoitteli ihmehiä; Soitti suurille sopua,
Vapautta vangituille, Totta soitti säikkymättä.

1895.

Kuusikossa.

Mä päivän mailleen vaipuessa


Jo auran jätän kädestäin
Ja kotikorven katvehessa
Ma istun rauha syämessäin.

Vienosti kuuset huokaileepi,


Hiljalleen illan henki soi;
Tuo sävel kaikui sydämeeni
Ja soinnun heikon siellä loi.

Mun sydän silloin soitteleepi


Äänettä kaihosointujaan,
Vaan vihdoin rintan' aukeneepi
Ja ääneen laulaa tunteitaan.

Niin hiljaa laulaa virsiänsä


Mun ääni heikko, soinnuton
Ja kuuset soittaa lehviänsä
Mun mukaan heikon laulelon.

Jos sattuis laulelmista noista


Yli korven kaiku kuulumaan,
Niit' ellös moiti, kuusikoista
Ei taide-soinnut kaijukaan.

1894.

Hehku, rinta!

Miksi, miksi nuori rinta


Raivoelet rauhatonna,
Miksi rauhan riistät multa,
Yöllä karkotat unosen?

Toiset tuolla surutonna,


Huoletonna huiskehtiipi;
Mun sydän sälähteleepi,
Rinta tuikkaapi tulessa.

Riehuele vain mun rinta


Hehkuele hersymättä,
Hehku tulta, innostusta,
Tarmokkuutta, urhoutta!

Hehku, rinta, riutumatta


Riehu pitkin päivyeitä!
Mun on onni ollakseni
Tunnetulten tulviessa.

Jos sä tyynnyt jollonkullon,


Tyynny aivan ainiaaksi.
Mieluisampi mullan alla
Maata on kun maailmalla.

18/3 1895.

Riehu, myrskytuuli!

Taas tuisku tuima tuivertaapi


Ja talvimyrsky myllertää
Ja lumipilvet pelmuaapi
Ja kylmä viima viheltää.

Ma kurja, kolkko ihmislapsi


Taas elämään nyt innostun,
Taas tunnen rinnan toivovaksi
Mä pyörteheisiin taistelun.

Vaikk' äsken päivän paistaessa


Ma hautaan tunsin toivovain,
Nyt lumituiskun huiskaessa
Mä elää, elää tahdon vain.

Siis riehu, myrskytuuli tuima,


Sa riehu, mylvi yhtenään!
Myrskyittä, taisteloitta huima
Mä elää en voi ensinkään.

22/3 1895.
Kevään oikkuja.

Taas koittaa kevät, hyisen hallan


Pois lämmin päivä hiuvottaa,
Ja talven kylmän hirmuvallan
Pois suven tuuli karkottaa.

Mut — kevätpäivät ihanimmat


Saa saastaa ensin ilmoillen;
Saa kaikki kätköt likaisimmat
Se nähtäväksi ihmisen.

Vaan kunhan päivä paistaneeksi


Saa maahan hieman kauvemmin,
Niin likaan tuohon nousseheksi
Saa nurmivaippa vihrehin.

29/3 1895.

Keväiset jäät.

Kuullos, ystävä, kevähän jäille


Onpi mennä sun vaarallinen;
Jäinen vaippa vois murtua rikki,
Omaks jäisit sä aaltosien.

Ellös konsana siihen sa luota,


Jospa talvella kestikin jää.
Se jo kehnoni, kun sitä murtaa
Kevätpaiste ja lauhea sää.

Ulapalle jos on sulla into,


Säre tieltäsi pettävät jäät;
Oman purtesi turvin sa aallot
Sitten purjehdi vaahtoovapäät! —

Ellös aattehen aalloissa myöskään


Kulje vanhoja siltoja vaan;
Koko sielusi tarmolla syöksy
Itse solmuja aukasemaan!

1/4 1895.

Yliopiston kirjastossa.

Mä tänne hairahduin ja oikein en


Mä tiedä, missä taasen lienenkään.
Niin outo tunne mulle rintahan
Nyt hiljallehen alkaa elpymään.

Niin tuntuu ilma lumoovaiselta


Kuin vienon vieno henki taivainen
Tääll' hiljaa liihoitellen asustais. —
Tää taivaan etusuoja lienekin?

Kun nään nuo holvimuurit muhkeat


Ja suurensuuret tiedonlähtehet,
Ma salon lapsi tyyten peljästyn
Ja toivon kauvas täältä pääseväin.

Vaan kun mä aijon sitten lähtemään


Niin jäädä tekee mieli kuitenkin. —
Ken pois nyt pyrkisikään taivaastaan,
Kun oven löytäisin vain salihin!

13/4 1895.

Vapaus voittaa!

Jäykästi uhkaa seistä jää


Ja murtuissansa temmeltää,
Mut paiste kevätpäivyen
Tuo aaltosille vapauden!

Vaikk' pimeyden kahlehet


Tahtookin hyytää sydämet,
Niin työ ja taisto pelvoton
Vapauden viepi voittohon!

1894.

Se lienteytyy.
Niin paksu pilvi nousee metsän takaa.
Se peloittaa,
Se ennustaa
Satehen maille pian vihmovan.
Vaan kun on noussut ylös ilmahan,
Se lienteytyy,
Se hämärtyy
Ja pisaretta tuskin maahan sataa. —
Niin moni täällä ihminen
Ihmeitä lausuu ilmoillen,
Vaan kun on aika toimintaan,
Hän uupuu unten maailmaan.

1895.

Kuusi ja koivu.

Kaks puuta seisoi rintehellä,


Kuus toinen, koivu rinnallaan.
Tuo koivu alkoi terhennellä
Kun kerran istuin juurellaan:

"Sa kehno kuusi, lehtiäni


Katsoppas vihrehiä vaan —
Ja katso noita lehviäsi,
Ne kauniit ei oo ollenkaan!"

Ol' hiljaa kuusi, huokaellen


Vain heilutteli lehviään;
Hän huokui koivun kopeudellen,
Kun ylpeillä voi lehdillään.

Mut tuuli syys ja tuimin tuulin


Vei koivun lehdet mennessään
Ja koivun kerran hiljaa kuulin
Ma huokailevan yksinään:

"On onnekkaampi mua kuusi


Kun lehvänsä on suojanaan;
Mun loistoni vei tuima tuuli,
Ma turvaa mistä saanenkaan!"

Ol' hiljaa kuusi, huokaellen


Vain heilutteli lehviään
Ja lehviänsä levitellen
Suojella koitti veljyttään.

1894.

"Eestäs löydät."

Kerran poika pienenlainen


Päivän metsässä myräsi,
Heilahteli heinämaalla.
Ilta tuosta kun tuleepi,
Poika potrana palaapi,
Luoksi maammonsa meneepi.
Tuosta tuon sanoiksi saapi,
Laskettaapi lausehiksi:

"Päivän metsässä myräsin,


Heilahtelin heinämaalla;
Pitkän pieleksen tekasin,
Heinäsuovan suurenlaisen."

Tuosta äitivanha virkki,


Tuosta virkki, tuon saneli:
"Eestäs löydät lapsueni
Mitä jätät jälkehesi."

Tuota poika tuumaileepi,


Arvellen ajatteleepi:
"Kuta tehdä, kuin eleä,
Kunne suovani sukea,
Pitkä pielonen veteä,
Kun se vaivanen viruupi,
Turma tielleni tuleepi?"

Poika tuosta toisen päivän


Vielä metsässä mekasti,
Heilahteli heinämaalla.
Illan tullen asteleepi,
Luoksi maammonsa meneepi.
Tuosta jo sanoiksi virkki,
Tuosta laski lausehiksi:

"Päivän metsässä mekastin,


Heinäniittyllä humasin.
Heinäsuovani sytytin,
Poltin pieleksen poroksi.
Nyt se vieläkö viruupi
Turma tielleni tuleepi?"

Äiti vainen vanha virkki,


Noinpa virkki, noin pakisi:
"Eestäs löydät lapsueni
Mitä jätät jälkehesi."

Poika parka peljästyypi,


Hätäellen hämmästyypi;
Tyystin tuosta tuumaileepi,
Arvellen ajatteleepi:
"Kuta tehdä, kuin eleä,
Kun se vieläkin viruupi,
Tielle tuhkakin tuleepi?"

Tuosta kohta kolmannenkin


Poika päivyen peräsi
Hiostellen heinämaalla.
Illan tullen asteleepi,
Hyvillänsä hyppeleepi,
Luoksi maammonsa meneepi.
Naurusuulla nalkuttaapi,
Laskettaapi lausehiksi:

"Taaskin metsässä myräsin


Kesäpäivän pitkänlaisen.
Tuhkan konttihin kokosin,
Pikku pussihin peräsin,
Kannoin kontin kalliolle
Lahden laajan laitamalla,
Siitä järvehen siristin,
Vetosehen vieryttelin.
Nyt se vieläkö viruisi,
Tuhka tielleni tulisi,
Kun sen tuuli tuiskutteli,
Aallot aavalle ajeli,
Selvälle selän vedelle,
Ulapalle aukealle."

Äiti vain varoen virkki,


Sanat sattuvat saneli:
"Eestäs löydät lapsueni
Mitä jätät jälkehesi. —
Heinäsuovasi sytytit,
Poltit pieleksen poroksi;
Käyös uutta alkamahan,
Suovoa sukeamahan!"

"Anna olla, ajan mennä,


Päähän päivien samota;
Kaikki tielle se tuleepi
Minkä tuhmasti tekeepi,
Minkä toimin toimittaapi."

1894.

Kyyhkyselle.
Kultalintu kyyhkyläinen,
Lehdon lempeä eläjä,
Miksi vainen vaikerrellen
Huokaelet huolissasi
Lehtoloitten lehviessä,
Tullessa suven suloisen?

Tule tänne kyyhkykulta,


Istu mun olkapäähyelle,
Kyynäsvarrelle kykähdä,
Usko mulle huoliasi,
Kerro kaihosi kovimmat; —
Mulla myöskin murhe musta
Sydäntäni synkistääpi,
Niinpä mulla kuin sinulla
Kaiho mieltä kaiveleepi. —
Ehkä oisi armahampi
Kaksin meidän kaihoella.
Huolissamme huokaella.

Tuosta[1] kyyhkynen kyhähti


Lähimmälle lehväselle,
Siinä huolensa selitti,
Laati mulle murehensa:

"Minä hautelin halulla


Muinen kymmentä munoa
Pienosiksi poikasiksi.
Pyy se pienonen pyrähti
Leveälle lehväselle
Luoksi pienosen pesäni;
Siinä piiskutti pahainen,
Vihelteli viekotellen,
Pyysi muutella munia.
Vailuhia vaihetella.

Enkös onneton eläjä


Silloin muutellut munia,
Vaihetellut vailuhia!
Sitä itken tuon ikäni,
Sitä vaivanen valitan:
Kuni kymmenen munoa
Kaikki kahtehen katosi!"

1894-95.

[1] Kansantarinan mukaan.

Tie ja tähti.

Syksyn yössä synkimmässä,


Kuljin kerran kotihin
Yli aavan nevaniityn
Varovaisin askelin.

Mustaa, synkkää ympärillä


Oli jokapuolellain,
Usvaa unteloista henki
Raskahasti rintaham.
Valo kodin ikkunasta
Vieri vihdoin silmähäin.
Siihen silmin tähystellen
Kiiruhdin mä sinnepäin.

Sinne katson, jalkoihini


Katsoa en muistakaan,
Mutahautaan mustimpahan
Suistun sormin sorkkimaan.

Kun mä pääsin pälkähästä


Taasen maalle marssimaan,
Aijoin aivan ainaiseksi
Tuosta tulla tuntemaan:

Ei oo oikein, määränpäähän
Yksiten vain katsahtaa,
Täytyy myöskin tarkastella
Tietä sinne kulkevaa!

1894.

Hyvä siemen.

Maanmies parahimmat
Siemeneksi viljat
Jättää puidessaan,
Kun hän parahimman
Sadon niistä saavan
Tietää kootessaan. —

Sivistyksen siemen
Myöskin punniskellen
Kylvettävä ois,
Ettei ohdakkeita,
Rikkaruohokkeita
Toukomaamme tois.

1894.

Kahdenlaisia kuulioita.

Kirkossa pappi pauhailee,


Saarnailee helvetistä
Ja naiset itkee, huokailee
Ja uipi kyynelissä.

Vaan miehet istuu jäykästi,


Ei murru mieli heiltä,
Mielessään muistot maalliset
On kärsimysten teiltä.

Kun pappi muuttaa aihettaan,


Taivaasta saarnaileepi,
Miehetkin silloin innostuu
Ja rinta lämpeneepi.

22/3 1895.
Kaksin.

Ahon aukean rajassa,


Koivumetsän katvehessa
Multakummusta kohosi:
Yksi pienonen petäjä,
Toinen kauno koivahainen.

Tuo on pienonen petäjä


Kohotteli kokkoansa,
Pientä päätänsä ylensi;
Niin hän suureksi sukesi,
Kohottihen korkeaksi.

Kesän armahan ajalla,


Suvituulen suudellessa
Koivu oksansa ojensi,
Lehdet vehreät levitti
Kukkana kukoistamahan
Männyn lehvien lomasta.

Syksyilman irjuessa
Tuima tuuli kun kulutti
Lehdet koivulta komeat,
Silloin peitteli petäjä
Leveillä lehvillänsä
Kaunokaista koivahaista
Tuiman tuulen suutelulta.

Tuopa kauno koivahainen,


Impi vihreä vereltä,
Vielä vieno vartalolta,
Kiertelikse, kaartelikse
Männyn oksien lomitse
Varren ympäri vakavan.

Kerran koivikko komea


Maahan kaikki kaadettihin
Kuivavaksi kaskoseksi;
Kaksi puuta kaunokaista,
Yhtenen yhistynyttä,
Heitettihin heilumahan
Tuulen tuuviteltavaksi.

Kaksi puuta kaunokaista


Siinä yhdessä yleni
Sini-ilman siintäville.
Kerran syksyn tuima tuuli
Halki haiverti ahoja,
Tuosta puuhkean petäjän
Päälle paksusti puhalsi;
Kaksi puuta kaunokaista
Tuuli murskaksi muserti,
Maahan paiskasi pahasti.

Kaksi puuta kaunokaista


Kaksin maassa nyt makasi,
Pahoinakin päivinänsä
Oksat yhtehen sovitti.

1895.
Lemmen terhenissä.

Mä elon melskehessä haparoin,


Kuin haparoipi yössä matkamies —
Ja valon tuikkehesta unelmoin
Niin heikosta, kuin luopi hiilloslies.

Niin unelmoin — ja kolkon tyhjyyden


Mä silloin tunsin rinta-alhossain
Ja sinne tänne silmä kaihoten
Mun katsoi, jotain tunsin kaipaavain.

Vaan silloin siinti tähti silmähäin


Katveesta öisten usvain hämärten, —
Se kirkastuen kiilui tännepäin,
Kuin päivyt väistyessä yöhyen.

Mut salomaalla päivyt suvinen


Kun alkaa korven takaa pilkoittaa,
Se armas vyhtyy peittoon usvien
Ja hämärästi maahan ullottaa.

Niin munkin lemmenaamu terheniin


Niin hentoisihin velloo tienohon,
Kuin kuolevaisen taivasunelmiin
Luo epäusko verhon untelon.

Sa uuvut ihanaisin impyein


Mun syämessäni utu-untuviin, —
Se liekö vasta alku päivyein,
Vai tyyntyneekin tyyten unelmiin!
Vaan haihtuupihan usvat huomenen.
Kun päivä ylemmäksi yllättää
Ja lämmin hohde niityn nurmehen
Lempeemmin kuni ennen hellittää.

Ja siihen luottaen ma uskallan


Silmältää kohden onnenpäivyttäin.
Sun, impi, katsees toivon hehkuvan
Pois usvat unteloiset syämestäin!

18/4 1895.

Oon kuni hurja oronen.

On hurja nuori oronen,


Oon itse ihan semmoinen.

Kun orhi tulen valtavan


Saa nähdä luonaan leimuvan
Ja hälle savu sieramiin
Saa karvas tunkeuneeksi, niin
Hän tulta kohden korskuen
Kuin tuuli syöksyy pelmuten
Ja liekin kuumuudestakin
Hän veisaa viisi tietenkin.

Mä myöskin lapsi ihmisen


Kun hehkun lemmenliekkien
Rintaani tunnen tuikkivan,
Niin silloin tyyten tulistun
Ja toivon tulen roihuhun,
Unhottain koko maailman
Ja tuiman tulikuoleman.

On kyllä hurja oronen,


Vaan itsekin oon semmoinen.

13/4 1895.

Kallein omaisuus.

Niin kallis, impi, mulle on


Silmäisi rimpi pohjaton,
Ja äänes sointu vienoinen
Ja poskeis hehku herttainen
Ja huultes juoma verraton
Niin kallis, impi, mulle on —
Vaan kallehin sun omaisuus
On tunteittesi puhtoisuus.

1894.

Uusi kotimme.

Sulle armahin asunnon,


Mökin rauhaisan rakennan
Taaksi niityn nurmikkoisen,
Kukkakummun kukkulalle.

Siellä pienosen pesämme


Sisustelen sievimmästi,
Kaunihimmasti kalustan;
Laitan kolme korkeata
Isohkoa ikkunata,
Joista iltojen iloksi
Tienoita tähystelemme.

Yhdestä me ikkunasta
Näämme nurminiittyjämme,
Vainioita vihreöitä,
Kukkamaita kaunehia.

Toisesta me ikkunasta
Yli lehdon lehtisimmän
Näämme läikkyvän lahelman,
Näämme saaria satoja,
Rauhaisia rannikoita,

Ikkunasta kolmannesta
Näämme suuria saloja,
Kolkoimpia korpimaita,
Sydänmaita synkeöitä. —
Vaikka kukkakunnahilla
Kukkuisimmekin käkenä,
Totta onnemme ei oisi
Jos ei tietoa pahasta,
Huolta huonosta ajasta.
1894.

Jos tahdot, impi, vain.

Me kätösehen käsi painetaan


Ja kaksitellen käydä aletaan.

Me rinnan korpitietä astutaan


Ja maata myöten murrot murretaan.

Niin hiljalleen me kauvas kuletaan,


Kun toinen toisehemme turvataan.

Niin maalle vapauden joudutaan


Ja muita mukanamme huudetaan!

Jos tahdot, impi, vain.

10/4 1895.

Hienottarelle.

Sa, impi, päivän paistetta


Noin miksi karttelet? —
Jaa — siks: kun siten muotosi
Hienoisna varjelet.
Vaan kuullos! Turha toimi on
Vain pintaa kiilloittaa
Jos sydän jäähän jähmettyy
Ja lunta tallentaa.

Kasvoilles kesän päivyen


Suo paistaa esteettä!
Se pinnan kyllä päivettää
Vaan syäntä lämmittää.

1894.

Udutar.

Uduttaren usvalinna
Tuoll' on pilvitarhassaan,
Sinne höyhenvienosilla
Purjehdin ma unelmilla,
Toivon auerhaavehilla
Uinun hetken helmassaan.

Udutar mun kukkasilla


Ummistaapi kokonaan,
Juovuttaa mun hunajilla,
Silmät sulkee suudelmilla,
Sitten usvapurjehilla
Maahan tuopi tointumaan.
Liihytellen linnahansa
Lentää taas hän tanssimaan;
Soittaa sulosointujansa
Kultakanteloisellansa,
Houkuttaa taas soitollansa
Linnaansa mun toivomaan.

1894.

Petetty.

Hän tuli lailla pilvyen


Jot' tuuli tutjuttaapi
Ja loisti lailla päivyen —
Niin pauloinsa mun saapi.

Mä lemmin häntä, lempeään


Hän hetken mulle leikki
Ja sitten läksi, itkemään
Mun raukan tänne heitti.

Mun sydän itkee kaihoissaan


Ja rinta raukka riutuu,
Mun toivo hänen tulostaan
Vain vähemmäksi hiutuu.

Jos sattuiskin hän tulemaan,


Ei löytäis neitoansa,
Vaan äidin heikon, kalvakkaan,
Mi vaalii kuvoansa.

1894.

Muistolaulu.

Pienosella purtosella
Kera neidon armahan
Valkamasta vaelsimme
Tasangolle lahdelman.

Yli tyynen peilipinnan


Rannikolle soudettiin,
Laihopellon laitamalle
Nurmikolle istuttiin.

Siinä elontoivehia
Toisillemme kuiskailtiin,
Muinaisajan armautta
Kaihoellen muisteltiin.

Tässä muistimmehan muinen


Haapalehdon ollehen,
Johon laaja laihopelto
Oivin ojitettu on.

Poissa on se keinuinensa,
Jossa illoin istuttiin,
Lemmekkäitä lauleloita
Käsikäissä laulettiin.

Hiljaa huokui haavikossa


Silloin vieno tuulonen,
Loihti nuoreen rintahamme
Lemmenruusun puhtosen.

Muinaisaikaa muistellessa
Silmiini sain kyynelet,
Kyynelet myös immeltäni
Kostutteli poskuet.

Pienosehen purtehemme
Hiljalleen me hiivittiin.
Muinaiselle haavistolle
Muistolaulu laulettiin.

1894.

Tulen tienoolla.

Kilvan kiiti kautta ilman


Lintupari lentäen,
Hennon pääskyn saaliiksensa
Tahtoi saada varpunen.

Kauvan pääsky pieni lensi


Eessä tuiman vainoojan,
Kunis sattui lentämähän
Tulen liekin lieskahan.

Säkenissä säihkyvissä
Paloi siivet pääskysen,
Tuleen varpunenkin lensi
Lentimensä polttaen.

Vaipui linnut hiljallensa,


Tuonne tulen tienoollen.
Elelivät ystävyssä
Yhteisäänin laulellen. —
—————

Kilvan kaksi kaunokaista


Juoksi kotikentällään;
Impi eellä hipsutellen,
Poika kinteryksillään.

Kauvan juoksi neito nuori


Suuteloita peljäten,
Kunis sattui lentämähän
Luoksi lemmenliekkien.

Sieltä lemmen säkeneitä


Sattui immen sydämmeen,
Ennen osunut jo noita
Oli pojan rintaseen.

Vaipui kaksi kaunokaista


Luoksi lemmenliekkien.
Elelivät ystävyssä
Yhteisäänin laulellen.

1894.

Tuolla minun mielitietty.

Tuolla minun mielitietty,


Minun ainut armastettu
Toisen kanssa kainaluksin
Kiitää tanssitanhuilla.
Tuolla heiluu hymyhuulin,
Hymyhuulin, simasilmin
Kera mieron miehyitten.

Kuni tähti taivahilta


Lammen kalvoon kuvastuupi
Syksy-illan synketessä —
Niin mun tuikki tuntehissa
Kultaseni kuva muinen
Yli maailmoiden mahdin.

Kuni syksyn vaisu viima


Sulan järven jähmentääpi
Hyväilyllä hyyhmäsellä —
Niin mun hyytyy sydän hylky,
Lemmenhehku hervahtuupi
Kohden moista morsianta:
Jok' on mieron mielitietty,
Leveälän lempilintu.

1894.

Kuin orjanruusu.

Kuin orjanruusu kaunoinen,


On viekas impi semmoinen:
Ne mieltä viehkeydellään
Voi ihastuttaa hetkisen,
Vaan pistävillä piikeillään
Myös viedä ilon entisen.

1894.

Epätietoinen.

Niin lämmön lemmensuudelman


Toi mulle illan tuulonen,
Se lemmityltäni mun on,
Jok' on niin ihmeen herttainen.

Vaan häntä lempii monikin,


En tiedä, ketä lempii hän,
Siks olen epätietoinen:
Mull' laittoiko hän suukon tän.

You might also like