Orange: From Experimental Machine Learning To Interactive Data Mining
Orange: From Experimental Machine Learning To Interactive Data Mining
Orange: From Experimental Machine Learning To Interactive Data Mining
net/publication/225876490
CITATIONS READS
279 1,918
4 authors, including:
Tomaž Curk
University of Ljubljana
251 PUBLICATIONS 4,544 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Gregor Leban on 25 August 2014.
Orange is a library of C++ core objects and routines that includes a large vari-
ety of standard and not-so-standard machine learning and data mining algo-
rithms, plus routines for data input and manipulation. Orange is also a script-
able environment for fast prototyping of new algorithms and testing schemes. It
is a collection of Python-based modules that sit over the core library and im-
plement some functionality for which execution time is not crucial and which
is easier done in Python than in C++. This includes a variety of tasks such as
pretty-print of decision trees, attribute subset, bagging and boosting, and alike.
Orange is also a set of graphical widgets that use methods from core library
and Orange modules and provide a nice user’s interface. Widgets support sig-
nal-based communication and can be assembled together into an application
by a visual programming tool called Orange Canvas.
white paper
1
data mining
fruitful&fun
Orange core objects and Python modules support various data mining tasks
that span from data preprocessing to modeling and evaluation. Among other
are techniques for:
Data input, providing the support for various popular data formats,
Data manipulation and preprocessing, like sampling, filtering, scaling, dis-
cretization, construction of new attributes, and alike,
Methods for development of classification models, including classification
trees, naïve Bayesian classifier, instance-based approaches, logistic regres-
sion and support vector machines,
Regression methods, including linear regression, regression trees, and in-
stance-based approaches,
Various wrappers, like those for calibration of probability predictions of
classification models,
Ensemble approaches, like boosting and bagging,
Various state-of-the-art constructive induction methods, including function
white paper
2
data mining
fruitful&fun
decomposition,
Association rules and data clustering methods,
Evaluation methods, different hold-out schemes and range of scoring meth-
ods for prediction models including classification accuracy, AUC, Brier
score, and alike. Various hypothesis testing approaches are also supported,
Methods to export predictive models to PMML.
The guiding principle in Orange is not to cover just about any method and as-
pect in machine learning and data mining (although through years of develop-
ment quite a few have been build up), but to cover those that are implemented
deeply and thoroughly, building them from reusable components that expert
users can change or replace with the newly prototyped ones. For instance, Or-
ange’s top-down induction of decision trees is a method build of various com-
ponents of which any can be prototyped in Python and used in place of the
original one. Orange widgets are not just graphical objects that provide a
graphical interface for a particular method in Orange – they also include versa-
tile signaling mechanism that is for communication and exchange of objects
like data sets, learners, classification models, objects that store the results of the
evaluation, … All these concepts are important, and together distinguish Or-
ange from other data mining frameworks.
Orange Scripting
You can access Orange objects, write your own components, and design your
test schemes and machine learning applications through scripting. Orange in-
terfaces to Python, a modern easy-to-use scripting language with clear but
powerful syntax and extensive set of additional libraries. Just like any scripting
language, Python can be used to test some ideas interactively, on-the-fly, or to
develop more elaborate scripts and programs.
To give you a taste of how easy it is to use Python and Orange, here is a set of
examples. We start with a simple script that reads the data set and prints the
number of attributes used and instances defined. We will use a classification
data set called “voting” from UCI Machine Learning Repository that records six-
teen key votes of each of the U.S. House of Representatives Congressmen and
labels each instance (congressman) with a party membership:
white paper
3
data mining
fruitful&fun
import orange
data = orange.ExampleTable('voting.tab')
print 'Instances:', len(data)
print 'Attributes:', len(data.domain.attributes)
Notice that the script first loads in the Orange library, reads the data file and
prints out what we were interested in. If we store this script in script.py, and
run it by a shell command “python script.py” – making sure that the data file is
in the same directory – we get:
Instances: 435
Attributes: 16
Let us continue with our script (that is, use the same data), build a naïve
Bayesian classifier and print the classification of the first five instances:
model = orange.BayesLearner(data)
for i in range(5):
print model(data[i])
This is simple! To induce the classification model, we have just called Orange’s
object called BayesLearner and gave it the data set: it returned another object
(naïve Bayesian classifier), that when given an instance returns the label of
most probable class. Here is the output of this part of the script:
republican
republican
republican
democrat
democrat
To find out what the right classifications were, we can print the original labels
of our five instances:
for i in range(5):
print model(data[i]), 'originally', data[i].getclass()
What we find out is that naïve Bayesian classifier has misclassified the third
instance:
white paper
4
data mining
fruitful&fun
All classifiers implemented in Orange are probabilistic, e.g. they estimate the
class probabilities. So is the naïve Bayesian classifier, and we may be interested
in how much we have missed in the third case:
p = model(data[2], orange.GetProbabilities)
print data.domain.classVar.values[0], ':', p[0]
Notice that Python’s indices start with 0 and that classification model returns a
probability vector when a classifier is called with argument orange.-
GetProbabilities. Well, our model was unjustly overconfident here, estimating
a very high probability for a republican:
republican : 0.995421469212
Now, we could go on like this, but we won’t. For some more illustrative exam-
ples check the three somewhat more complex scripts in the sidebars. There are
many more examples available in Orange’s distribution and at Orange’s web
pages and described in accompanying tutorials and documentation.
Here is a simple script that uses 10-fold cross validation to test a naïve Bayes-
ian classifier and k-nearest neighbors algorithm on a voting data set.
white paper
5
data mining
fruitful&fun
orngStat.CA(results)[i],
orngStat.IS(results)[i],
orngStat.BrierScore(results)[i],
orngStat.AROCFromCDT(cdt[i])[7])
Scores reported in this script are classification accuracy, information score, brier
score, and area under ROC. Running the script, we get the following report:
Following is a script that tests how a parameter that defines the minimum num-
ber of examples in the internal nodes of classification tree influences the size
of the tree and accuracy on the test set.
white paper
6
data mining
fruitful&fun
For testing, the script splits the voting data set to train (70%) and test set (30%
of all instances). To report on the sizes of the resulting classification trees,
evaluation method has to store all the classifiers induced. The output of the
script is:
Ex Size CA IS
0 615 0.802 0.561
1 465 0.840 0.638
5 151 0.931 0.800
10 85 0.939 0.826
100 25 0.954 0.796
Not to be driven astray with too abstract descriptions, here is a simple exam-
ple. We’ll take Orange’s algorithm for induction of decision trees which is itself
assembled from components like those for attribute ranking, condition-based
data splitting and a component that implements the evaluation for a stopping
criterion. The induction procedure for classification trees uses some heuristics
to pick the best attribute on which to split the data set, so what if instead we
simply randomly choose the attribute? Here is a script that designs the new
learner by replacing the split component of a standard classification tree
learner with a newly constructed one that randomly selects the attribute. To see
if that makes a difference, we build a standard classification tree and a tree
with a random choice of attributes in nodes and measure their size (number of
tree nodes):
white paper
7
data mining
fruitful&fun
data = orange.ExampleTable('voting.tab')
treeLearner = orange.TreeLearner()
rndLearner = orange.TreeLearner()
rndLearner.split = randomChoice
tree = treeLearner(data)
rndtree = rndLearner(data)
print tree.treesize(), 'vs.', rndtree.treesize()
A function randomChoice does the whole trick: in the first line it randomly se-
lects an attribute from the list, and in the second returns what a split compo-
nent for decision tree would need to return. The rest of the script is trivial, and
if you run it, you will find out that the random tree is substantially bigger (as
was expected).
Here is a script which shows why we really like Python. We intend to count
the number of times each attribute appears in the node of the classification
tree. For this we need the dictionary which stores the frequencies of the attrib-
utes (initialized to 0). We also need a function which recursively traverses the
tree and for each node adds 1 to the corresponding attribute’s count in the dic-
white paper
8
data mining
fruitful&fun
tionary. Once you get used to it, programming with dictionaries and lists in
Python is really fun.
import orange
data = orange.ExampleTable("voting")
classifier = orange.TreeLearner(data)
# tree traversal
def count(node):
if node.branches:
freq[node.branchSelector.classVar] += 1
for branch in node.branches:
if branch: # make sure not a null leaf
count(branch)
This script reports on the frequencies of the first three attributes in the data do-
main:
14 x handicapped-infants
16 x water-project-cost-sharing
4 x adoption-of-the-budget-resolution
The following script builds a list of association rules from imports-85 data set
(attribute-based descriptions of cars imported to US in 1985). We discretize the
continuously-valued attributes and use only first ten attributes in analysis.
white paper
9
data mining
fruitful&fun
The script reports on the number of rules and prints out the first five rules to-
gether with information on their support and confidence:
We can now count how many of the 87 rules include attribute on fuel type in
their condition:
att = "fuel-type"
subset = filter(lambda x: x.left[att]<>"~", rules)
print "%i rules with %s in conditional part" % (len(subset), att)
Programming with other data models and objects in Orange is as easy as work-
ing with classification trees and association rules. The guiding principle in de-
signing Orange was to make most of the data structures used in C++ routines
available to scripts in Python.
white paper
10
data mining
fruitful&fun
Orange Widgets
Widgets communicate by tokens that are passed from the sender to receiver
widget. For example, a file widget outputs the data object, which can be re-
ceived by a widget classification tree learner widget, which builds a classifica-
tion model that can then be sent to a widget that graphically shows the tree.
Or, an evaluation widget may receive a data set from the file widget and ob-
jects that learn the classification models (say, from logistic regression and naïve
Bayesian learner widgets). It can then cross-validate the learners, presenting
the results in the table while at the same time passing the object that stores the
results to a widget for interactive visualization of ROC graphs.
white paper
11
data mining
fruitful&fun
Orange widgets
for classification
tree visualization
(top), classifica-
tion tree learner
(middle) and sieve
diagram (bottom).
white paper
12
data mining
fruitful&fun
Applications that include widgets are therefore a data flow schemes, where
widgets process the information and provide for the user’s interface. One can
script such applications by hand, or use Orange Canvas, our visual program-
ming environment, to interactively design the scheme. Like any visual program-
ming environment, Orange Canvas is simple and fun to use.
Orange widgets and Orange Canvas are all written in pure Python, using Qt
graphical user’s interface library. This allows Orange to run on various plat-
forms, including MS Windows and Linux.
Orange Canvas,
with a schema
that compares
two different
learners
(classification
tree learner and
naïve Bayesian
classifier) on a
selected data
set. Evaluation
results are also
studied through
calibration and
ROC plots.
white paper
13
data mining
fruitful&fun
Also, your Orange, AI Lab, Faculty of Computer and Informations Science, Univer-
applications may sity of Ljubljana, Trzaska 25, SI-1000 Ljubljana, Slovenia.
use additional
Orange modules
made available by
other research-
ers, so other
citations may be
in place as well.
white paper
14
data mining
fruitful&fun
Acknowledgements
white paper
15