Java-ML: A Machine Learning Library
Java-ML: A Machine Learning Library
net/publication/220320645
CITATIONS READS
110 3,583
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Genetic and Genomic approaches to identify loci involved n plant disease resistance View project
All content following this page was uploaded by Yves Van de Peer on 16 May 2014.
Abstract
Java-ML is a collection of machine learning and data mining algorithms, which aims to be a readily
usable and easily extensible API for both software developers and research scientists. The inter-
faces for each type of algorithm are kept simple and algorithms strictly follow their respective
interface. Comparing different classifiers or clustering algorithms is therefore straightforward, and
implementing new algorithms is also easy. The implementations of the algorithms are clearly writ-
ten, properly documented and can thus be used as a reference. The library is written in Java and is
available from http://java-ml.sourceforge.net/ under the GNU GPL license.
Keywords: open source, machine learning, data mining, java library, clustering, feature selection,
classification
1. Introduction
Machine learning techniques are increasingly popular in research fields like bio- and chemo-
informatics, text and web mining, as well as many other areas of research and industry. In this
paper we present Java-ML: a cross-platform, open source machine learning library written in Java.
Several well-known data mining libraries already exist, including for example, Weka (Witten
and Frank, 2005) and Yale/RapidMiner (Mierswa et al., 2006). These programs provide a user-
friendly interface and are geared towards interactive use with the user. In contrast to these programs,
Java-ML is oriented towards developers that want to use machine learning in their own programs.
To this end, Java-ML interfaces are restricted to the essentials, and are very easy to understand. As
a result, Java-ML facilitates a broad exploration of different models, is straightforward to integrate
into your own source code, and can be easily extended.
Regarding the content of the library, Java-ML also has a different focus than the other libraries.
Java-ML contains an extensive set of similarity based techniques, and offers state-of-the-art feature
selection techniques. The large number of similarity functions allow for a broad set of clustering
and instance based learning techniques, while the feature selection techniques are well suited to
deal with high-dimensional domains, such as the ones often encountered in bioinformatics and
biomedical applications.
2009
c Thomas Abeel, Yves Van de Peer and Yvan Saeys.
A BEEL , VAN DE P EER AND S AEYS
Clustering Classification
K-means-like (7) SVM (2)
Self organizing maps Instance based learning (4)
Density based clustering (3) Tree based methods (2)
Markov chain clustering Random Forests
Cobweb Bagging
Cluster evaluation measures (15)
Table 1: Overview of the main algorithms included in Java-ML. The number of algorithms for each
category is shown in parentheses.
932
JAVA -ML: A M ACHINE L EARNING L IBRARY
well-known clustering algorithms. A large number of distance, similarity and correlation measures
are included. Feature selection algorithms include traditional algorithms like symmetrical uncer-
tainty, gain ratio, RELIEF, stepwise addition/removal, as well as a number of more recent methods
(SVMRFE and random forest attribute evaluation). Also the recently introduced concept of ensem-
ble feature selection techniques (Saeys et al., 2008) is incorporated in the library. We have also
implemented a fast and simple random tree algorithm to cope with high dimensional, sparse and
ambiguous data. Finally, we provide bridges for classification and clustering in Weka and libsvm
(Fan et al., 2005).
The first line uses the FileHandler utility to load data from the iris.data file. In this file, the class
label is on the fourth position and the fields are separated by a comma. The second line constructs a
new instance of the KMeans clustering algorithm with default values, in this case k=4. The third line
uses the KMeans instance to cluster the data that we loaded in the first line. The resulting clusters
will be returned as an array of data sets.
The following example illustrates how to perform a cross-validation experiment for a specific
dataset and classifier.
First we load the iris data set, and construct a K-nearest neighbors classifier, which uses 5 neigh-
bors to classify instances. In the next line, we initialize the cross-validation with our classifier. The
last line runs the cross-validation on the loaded data. By default, a 10-fold cross validation will
be performed. The result is returned in a map, which maps each class label to its corresponding
PerformanceMeasure (Map<Object,PerformanceMeasure>). For classification problems, a per-
formance measure is a wrapper around four values: (i) true positives, (ii) true negatives, (iii) false
positives and (iv) false negatives. This class also provides a number of derivative measures such as
accurracy, error rate, precision, recall and others. More advanced samples are available from the
documentation pages on the Java-ML website.
2.3 Documentation
There are a number of resources for documentation about Java-ML. The source code itself is docu-
mented thoroughly, always up-to-date, and accessible from the web site through the API documen-
tation. The web site additionally provides a number of tutorials with illustrated code samples for
933
A BEEL , VAN DE P EER AND S AEYS
the most common tasks in Java-ML, covering the following topics: installing the library, introduc-
ing basic concepts, creating and loading data, creating algorithms and applying them to your data,
and more advanced topics for people who would like to contribute to the library. Finally, all code
samples as well as the PDF versions of the tutorials are also included in the Java-ML distribution
itself.
3. Case Studies
The library described in this manuscript has been used in several studies. Here we highlight two
applications which have been recently published.
Initially, the project focused on clustering algorithms and measures to evaluate the quality of
a clustering. Our goal was to separate DNA sequences that are likely to contain a promoter (the
controlling element of a gene) from other sequences, a well-known task in bioinformatics. The
best results were obtained using a clustering algorithm based on self-organizing maps (Abeel et al.,
2008).
More recently, the focus has shifted toward feature selection. More specifically, we are looking
whether ensemble feature selection (combining different feature selectors) can improve the stability
of feature selection in case of high-dimensional data sets with few samples. The improvements in
stability were shown not to affect the prediction accuracy. This is ongoing research, but the first
results are promising (Saeys et al., 2008).
Acknowledgments
We thank A. De Rijcke for his early contributions to Java-ML, as well as the anonymous reviewers
for their valuable comments. TA is funded by IWT-Vlaanderen. YS would like to thank the Research
Foundation-Flanders (FWO-Vlaanderen) for funding his research.
References
Thomas Abeel, Yvan Saeys, Pierre Rouzé, and Yves Van de Peer. ProSOM: Core promoter predic-
tion based on unsupervised clustering of DNA physical profiles. Bioinformatics, 24(13):i24–i31,
July 2008.
Rong-En Fan, Pai-Hsuen Chen, and Chih-Jen Lin. Working set selection using the second order
information for training SVM. Journal of Machine Learning Research, 6:1889–1918, 2005.
Ingo Mierswa, Michael Wurst, Ralf Klinkenberg, Martin Scholz, and Timm Euler. Yale: Rapid pro-
totyping for complex data mining tasks. In Proceedings of the 12th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD-06), 2006.
Yvan Saeys, Thomas Abeel, and Yves Van de Peer. Robust feature selection using ensemble feature
selection techniques. In Proceedings of the ECML-PKDD conference 2008, 2008.
Ian H. Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques.
Morgan Kaufmann, San Francisco, 2nd edition, 2005.
934