0% found this document useful (0 votes)
54 views11 pages

Efficient Distributed Genetic Algorithm For Rule Extraction: Applied Soft Computing

ga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views11 pages

Efficient Distributed Genetic Algorithm For Rule Extraction: Applied Soft Computing

ga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Applied Soft Computing 11 (2011) 733743

Contents lists available at ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

Efcient Distributed Genetic Algorithm for Rule extraction


Miguel Rodrguez , Diego M. Escalante, Antonio Peregrn
Dept. of Information Technologies, University of Huelva, 21819 Huelva, Spain

a r t i c l e

i n f o

Article history:
Received 15 March 2009
Received in revised form
18 December 2009
Accepted 29 December 2009
Available online 13 January 2010
Keywords:
Classication rules
Rule induction
Distributed computing
Coarse-grained implementation
Parallel genetic algorithms

a b s t r a c t
This paper presents an Efcient Distributed Genetic Algorithm for classication Rule extraction in data
mining (EDGAR), which promotes a new method of data distribution in computer networks. This is done
by spatial partitioning of the population into several semi-isolated nodes, each evolving in parallel and
possibly exploring different regions of the search space. The presented algorithm shows some advantages
when compared with other distributed algorithms proposed in the specic literature. In this way, some
results are presented showing signicant learning rate speedup without compromising the accuracy.
2010 Elsevier B.V. All rights reserved.

1. Introduction
Nowadays the size of datasets is growing quickly due to the
widespread use of automatic processes in commercial and public domains and the lower cost of massive storage. Mining large
datasets to obtain classication models with prediction accuracy
can be a very difcult task because the size of the dataset can make
data mining algorithms inefcacy and inefcient.
There are three main approaches to tackling the scaling problem:
Use as much as possible a priori knowledge to search in subspaces
small enough to be explored.
Perform data reduction.
Algorithm scalability.
The third approach, algorithm scalability, promotes the use
of computation capacity in order to handle the full dataset. The
use of computer grids to achieve a greater amount of computational resources has become more popular over the past few years
because they are much more cost-effective than single computers
of comparable speed. The main challenge when using distributed
computing is the need for new algorithms that take the architecture into account. Genetic algorithms are especially well suited for
this task because of their implicit parallelism. As a typical popula-

Corresponding author. Tel.: +34 959217372.


E-mail addresses: [email protected] (M. Rodrguez),
[email protected] (D.M. Escalante), [email protected] (A. Peregrn).
1568-4946/$ see front matter 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.asoc.2009.12.035

tion algorithm, there is a direct way of distributing the algorithm


by the use of several smaller populations that interchange individuals occasionally. This kind of distributed GA achieves a signicant
speedup when used on a network of computers and prevents an
early convergence by keeping diversity in several populations.
There have been several efforts to make use of models based
on distributed genetic algorithms (GA) in data mining emphasising aspects like scalability and efciency. REGAL [10] and NOW
G-Net [1] are well known references of this approach. Both of them
increase the computational resources via the use of data distribution on a network of loosely coupled workstations.
In this paper we present an Efcient Distributed Genetic Algorithm for classication Rule extraction (EDGAR) with dynamic data
partitioning that shows advantages in scalability for exploring high
complexity search spaces with comparable classication accuracy.
The outline of the contribution is as follows: in Section 2 we
review the distributed genetic models in rule induction. Section 3
is devoted to analysing the proposed algorithm and the strategies
followed to keep the algorithm scalable. The experimental study
developed is shown in Section 4 and nally, we reach some conclusions in Section 5. Appendix A is included containing a detailed
table of results obtained in our study.

2. Machine learning and parallel genetic algorithms


This section reviews the main streams found in literature about
parallel genetic algorithms in rule induction. The rst subsection describes the approaches commonly used in the area to
achieve machine learning. The second subsection is focused on the

734

M. Rodrguez et al. / Applied Soft Computing 11 (2011) 733743

main strategies for parallelising genetic algorithms in data mining


related tasks.
2.1. Genetic algorithms in data mining
Genetic algorithms are search algorithms based on natural
genetics that provide robust search capabilities in complex spaces,
and thereby offer a valid approach to problems requiring an effective search process [15].
GA has achieved a reputation for robustness in rule induction,
in common problems associated with real world mining (noise,
outliers, incomplete data, etc.). GA can be dedicated to machine
learning algorithms [3]. For example, the search space can be seen
as the entire possible hypothesis rule base that covers the data
and the goodness can be formulated as a coverage function over
a number of learning examples.
A key point in GA implementation is the selected representation;
the proposals in the specialist literature follow two approaches in
order to encode rules within a population of individuals:
The Chromosome = Set of rules, also called the Pittsburgh
approach, in which each individual represents a rule set [23]. The
chromosome evolves a complete rule set and they compete among
themselves throughout the evolutionary process. GABIL [6] and
GA-MINER [9] are proposals that follow this approach.
The Chromosome = Rule approach, in which each individual
codies a single rule, and the whole rule set is provided by combining several individuals in a population (rule cooperation) or via
different evolutionary runs (rule competition).
In turn, within the Chromosome = Rule approach, there are
three generic proposals:
The Michigan approach, in which each individual encodes a single
rule. These kinds of systems, usually called learning classier systems [14], are rule-based, message-passing systems that employ
reinforcement learning and a GA to learn rules that guide their
performance in a given environment. The GA is used to detect
new rules that replace the bad ones via the competition between
the chromosomes in the evolutionary process.
The IRL (Iterative Rule Learning) approach uses several GA executions to obtain the rule set. Chromosomes compete in every GA
run, choosing the best rule per run. The global solution is formed
by the best rules obtained when the algorithm is run multiple
times. SIA [26] is a proposal that follows this approach.
The GCCL (genetic cooperativecompetitive learning) approach
encodes the rule set as the complete population or a subset of it.
The chromosomes compete and cooperate simultaneously. This
strategy requires the conservation of species in the same population to avoid a nal solution consisting of clones of the best
individual. COGIN [13], REGAL [10] and NOW G-Net [1] are examples using this representation.

Global parallelisation [8]. Only the evaluation of individuals tness values is parallelised by assigning a fraction of the population
to each processor to be evaluated. This is an equivalent algorithm
that will produce the same results as the sequential one.
Coarse-grained [8] and ne-grained parallelisation [20]. In the
former, the entire population is partitioned into subpopulations
(demes). A GA is run on each subpopulation, and exchange
of information between demes (migration) takes place occasionally [8] in analogy with the natural evolution of spatially
distributed populations such as the island model (Fig. 1). Finegrained course has just one individual per processor and rules
to perform crossover in the closest neighbourhood dened by a
topology.
Supervised data distribution [11]. A master process uses a group
of processors (slaves) by sending them partial tasks and a smaller
data partition. Each node has a complete GA or a part of it. The
master process uses the partial results to reassign data and tasks
[10,1] to the processors until some condition is met.
Not supervised data distribution [18]. The full dataset is shared
out in several processors and moved to the next processor in
the topology after a pre-specied number of generations without removing the existing population. The individuals will try to
cover the newly arrived training data.
The proposal presented in this work follows a GCCL approach
and as a parallelisation strategy uses a coarse-grained implementation and a master process to build up the nal classier on the
basis of partial results.
3. Genetic Learning Proposal: EDGAR algorithm
This section describes the characteristics of an Efcient Distributed Genetic Algorithm for classication Rules extraction, from
now on designated EDGAR. The proposed algorithm distributes
population (rules in a GCCL approach) and training data in a coarsegrained model to achieve scalability.
We start by explaining the distributed model in Section 3.1. Sections 3.23.5 describe the components of the genetic algorithm:
representation, genetic operators, genetic search and data reduction. Finally, Section 3.6 is devoted to the strategy used to determine
the best set of rules that will make up the classier from the redundant population of rules generated by the GCCL algorithm.
3.1. Distributed model
This subsection explains the main properties of the distributed
framework. First, in Section 3.1.1 we describe the use of the coarsegrained implementation with data partition.

2.2. Parallel genetic algorithms in data mining


The denition of scalable in computing could be something like
able to support the required quality of service as the system load
increases. Applied to data mining and more precisely to supervised
classication, the system load is provided by the complexity and
the size of the dataset (in attributes or training examples), and the
quality of service is relative to the processing time for producing a
similar classier mainly in terms of accuracy and interpretability.
As complexity of the dataset increases, GAs exhibit high computational cost and degradation of the quality of the solutions. Efforts
towards solving these shortcomings have been made in several
directions, and parallel GAs is one of the most signicant. We stress
four approaches that represent the main parallelisation strategies:

Fig. 1. Island model.

M. Rodrguez et al. / Applied Soft Computing 11 (2011) 733743

735

Extract the nal solution.


The process is as follows: the nodes send their best rules to the
elite pool. Then, the pool selects the rules with better coverage (see
Section 3.7) and tness. The process ends when the classication
ratio in the pool does not improve after a number of iteration of
this process (Central Stall Parameter or CSP ).
The number of rules accumulated can become a bottleneck. In
order to keep it as small as possible, we have the following policies:

Fig. 2. Effect of training size on disjunct error rate.

3.1.1. Parallelisation strategy


The coarse-grained parallelisation [8] model is a simple way to
improve the search capability of a GA. Nevertheless, it has some
disadvantages when dealing with large datasets because each node
will deal with a complete copy of the dataset, making it less efcient.
To achieve a better speedup on this model, we propose to assign
different partitions of the learning data to each node. The GA in each
node will try to cover the local data proposing a concept description, but the main characteristics of the coarse-grained model will
remain. On one hand, the migration of the best individuals between
subpopulations will enforce those individuals (rules) that perform
properly in more than one node and, on the other hand, we propose
a ring topology that will prevent an early convergence (see Fig. 3).
3.1.2. Data Learning Flow technique
The data partition in a coarse-grained implementation is able
to produce similar quality classiers to a single GA [20] if the data
does not have small disjuncts [12]. In these cases, a few data examples representing a concept description (one rule) may be split into
several nodes in the initial partition, preventing the local GA from
inducing the rule.
Fig. 2 [27], shows the relation between error rates and training
set size. This result further suggests that the presence of small disjuncts in a small training dataset decreases the accuracy and should
take training set size into account.
In our proposal the training data assigned to a local GA will be
a smaller percentage in size relative to the original dataset. For
instance, a dataset using 50% partition for training and testing and
10 nodes as conguration for data distribution will handle in each
node just 5% of training data compared with the full dataset.
We propose a novel technique called Data Learning Flow (DLF) to
join the examples of small disjuncts together. It is based on the idea
that a training example not properly covered in one node may be
better covered in another data partition. Training examples covered
by low tness rules are copied to the next node in the neighbourhood. The maximum migration rate of learning examples was set to
1% of the local dataset to keep the size of local dataset small enough.

Only the rules in the last proposed classier are kept in the pool.
Only new discovered rules are included; the nodes keep track of
the sent rules and the pool checks against the received rules, preventing evaluating them again. This check is run in a reasonable
time frame by the use of native Java hash table implementation
based on the chromosome string representation to detect the
duplicated rules.
3.2. Genetic algorithm
EDGAR uses a GA in each node in a ring communications topology with the neighbourhood as in the aforementioned island model
and some training data for examples poorly covered by the local
classier.
Each node will work on a partition of the full dataset generated
by random samples. The initial population is created constructing
rules that cover some of the examples in the local dataset (seeding). Universal Suffrage (US) operator selects a set of individuals (g)
for crossover and mutation in each generation. Each offspring will
replace a randomly selected individual in the current population.
After a number of generations (Local Number of Generations
LNG ), some operations are performed (see Fig. 4):
Using a greedy algorithm (see Section 3.7) to extract the set of
rules that better classies the local data and copy them to the
next node in the ring and the pool.
Randomly replace selected individuals in the current population
with the individuals received from the previous node in the ring.
Copy the learning examples not covered or covered by low tness
rules to the next node in the ring.
Perform data training set reduction if the best individual does not
change (see Section 3.6).

3.1.3. Elite pool


The standard coarse-grained model will converge with the same
population in all nodes, but our algorithm may produce a different
classier in each node due to the data partition. Moreover, a rule
with a high accuracy in a local data partition may have a poor coverage when applied against the entire dataset. For this reason, the
local node cannot decide whether it has a good set of rules. We
propose an elite pool that holds the full dataset, to:
Validate whether a rule is good enough to have been kept.
Stop the algorithm when a stall criterion is reached.

Fig. 3. Distributed model.

736

M. Rodrguez et al. / Applied Soft Computing 11 (2011) 733743

The tness function is:


f (r) =

1+

zeros(r)
length(r)

1FP

where FP means the number of covered examples predicted has


false positives (different consequent as the rule). Zeros(r) is the
number of zeroes in the bit string representation of the rule r and
length is the chromosome length in bits. This function will give
very low values to rules covering just a few numbers of false positives. As soon as the rule improves, the length will take on more
relevance.
3.5. Genetic operators

Fig. 4. Genetic algorithm.

The node will stop searching when there is no data in the local
dataset or because the pool node orders it.

3.3. Representation
EDGAR uses a xed length bit string representation to code a disjunctive rule. The use of xed length chromosomes allows simpler
genetic operators.
Each different possible value of an attribute in a rule is represented as a single bit. A bit set to 1 signies the presence of the value
in the rule and a bit set to 0 means the absence of this value in the
rule. One advantage of this representation is that mutation on a bit
of the genotype only leads to a small change on the phenotype.
A rule is composed of characteristics C = c1 , c2 , . . ., cj , where each
one can take only one value in each instance of the data mined but
the rule may have more than one value for this characteristic.
For example, Fig. 5 shows a rule with three antecedents
c1 (v1 , v2 , v3 ), c2 (v4 , v5 ), c3 (v6 , v7 , v8 ) and the consequent
class(v9 , v10 ). As seen in this gure, when all the bits corresponding
to the same attribute are set to 0, it means that it does not affect
the rule.

3.4. Fitness function


The tness function is based on the following measurements:
Complexity: considered as the number of conditions in a rule. In
a bit string representation, the more zeros that are present in the
formula, the fewer conditions in the rule.
Accuracy: is inversely the number of misclassications. This
means examples covered with a different assigned class.

Fig. 5. Rule representation example.

The species formation is of great importance in a GCCL algorithm. For this purpose, EDGAR uses the US selection operator rst
used in [10]. This mechanism creates coverage niches that do not
compete with each other (co-evolution) through a voting process:
each generation a set of learning examples is selected. The process
searches the set of rules in the population that better covers each
example (tness) and perform a weighted roulette based on the
tness and the number of positive cases. In the event that no rule
covers an example, a new rule is created generalising the learning
example.
Crossover and mutation operators are based on standard bit
string representation and are applied on the selected parents based
on a given probability. The crossover operator used is the twopoint crossover where the offspring are evaluated to cover at least
one example before being inserted in the population. The mutation
operator changes one random bit in the chromosome depending
on the individual tness. The mutation operator behaves in a different manner depending on the tness of the selected individual.
In the early stage of the process, some of the rules cover only a few
examples. In this case, the mutation is driven to generalise the rule,
increasing the possibilities of covering new examples (i.e. generalises the rule). If the offspring has a higher tness than its parents,
it will be inserted in the population; otherwise the parent will be
used instead.
3.6. Data training set reduction under evolution and covering
US depends on a proper populationdataset ratio for a good
coverage. When training examples representing a concept are in
a smaller amount than other concepts, the rules for the some concepts may disappear under the attraction of the rules with more
voters. For instance, in a local dataset of 1000 instances of training data examples and 10 individuals in the population, US will use
10 examples randomly each time to select the 10 best rules that
represent them. The probability of a particular instance of being
selected will be less than 1%. Once selected, there will be at least
one rule representing it, but this one will disappear in the next 100
generations with a probability of 99%.
EDGAR deletes the examples already learned to focus on those
less represented examples. The process is as follows: when the
algorithm detects that the proposed rule set does not change in
a number of consecutive times (Local Stall Parameter LSP ), the best
rule and its covered data are removed from the node. Therefore,
the rest of the examples will receive more computational effort,
making it possible to induct rules on them.
This strategy makes the algorithm less dependent on the ratio
between learning examples and population because it guarantees
that all the examples will be selected either in the standard phase
with the initial dataset, or later, once all the examples covered by
the already learned rules have been removed from the local dataset.

M. Rodrguez et al. / Applied Soft Computing 11 (2011) 733743

737

Table 1
Compared percentage accuracy.

Monk-1
Monk-2
Monk-3
Tic-tac-toe
Credit
Breast
Vote

Fig. 6. Communication architecture.

3.7. Generation of a classier from a redundant rule set


The population in any node is a redundant set of rules that
does not specify how they perform the classication. Our aim is
to generate the shortest and fastest classier that covers the training examples. The proposed classier is an ordered set of rules in
which the rst applicable rule will return the assigned class. Nevertheless, as the complete set of rules covering the data may be high,
the rule position in the classier is relative to its correct classied
examples (True Positive cases or TP). This organisation has some
advantages; on one hand, the rst rules express a rough idea of the
main concept descriptions and on the other hand it gives a better
response time for classication in the event of a high number of
rules.
The order criteria () ensures that length is taken into account
when more than one rule competes for a similar coverage and
classication ratio by using accuracy and complexity in a derived
expression of the tness rule:


:

1+

zeros(r)
length(r)

1FP

TP

Once the rule is selected, all its positive cases are removed and the
remaining candidate rules are newly ordered. The process nalises
when all examples are covered or there are no more rules in the
rule set.
3.8. Architecture scalability
Scalability in parallel implementations is described in the literature [17] as the relation between number of processors and
execution time. The following paragraphs describe the policy for
keeping execution time scalable considering the network speed and
synchronisation in distributed processes.
The components of execution time in distributed systems are
[16]: time of each processor (Tpi ), number of communications (c),
average time for communication (Tc ), idle time waiting for synchronisation (Ti ) and probability of idle status for a node (p).
T=

n


Tpi + c (Tc + Ti p)

If the term c*(Tc + Ti *p) is lower than the computational time


expended by each local genetic algorithm, the algorithm will
present a speedup independent of network speed and synchronisation issues. To minimise these terms, the following directives were
followed:
Avoid synchronous calls to other processes and process synchronisation. The communication of proposed rules and ow of
learning data is performed through buffers (see Fig. 6) avoiding the producerconsumer synchronisation. Idle time waiting for

C4.5

EDGAR

REGAL

100
67.0
100.0
92.9
86.0
94.1
96.4

100.0
96.0
99.6
98.9
85.0
94.0
97.1

100.0
95.0
100.0
98.7
84.0
94.1
96.2

synchronisation (Ti ) and probability of idle status (p) for a node and
the pool becomes then close to zero.
Communication time (c*Tc ) depends on size and frequency of
communication of the best individuals and training data (DLF
technique) to the neighbourhood and to the pool. A system
parameter (Local Number of Generations LNG ) allows adjusting
the communication frequency (number of generations between
communications) to the convergence of the local model and the
network speed. If this parameter is too low, the local model will
over learn the assigned dataset. If the parameter is too high, the
newly arrived individuals will slow down or even prevent the
convergence with the local data [8] and the time expended in
communication handling will increase. When this time is less
than the time used in sending learning data and rule through the
network (Tc ), the algorithm execution time will be independent
of the network speed.
4. Experimental study
In this section, we describe the experimental study carried out
on a variety of datasets, ranging from standard benchmark to test
the accuracy against standard algorithms in rule induction to specic comparison with more complex problems. The experimental
study was carried out in a cluster of 8 workstations with 2 CPU Intel
Xeon 3 GHz each. In order to run more than 16 nodes at a time, each
node was implemented as a thread in a Java VM and communications time was simulated, adding a time equivalent to an Ethernet
100 mbs in a conguration of 8 processors. As an example, usually
the communication unit in Ethernet is 512 bytes per package; if
a rule is 30 bytes in length, the pack will have 25 records, which
means a delay of 0.0512/25 = 0.00015 s per rule.
Section 4.1 compares EDGAR with standard benchmarks. Section 4.2 analyses the effect of data distribution on accuracy and
speedup in a study case. Section 4.3 develops a statistical analysis
over a set of commonly used datasets.
4.1. Comparison with standard benchmarks
This section is devoted to testing in a rst run whether EDGAR is
able to obtain similar accuracy to distributed and non-distributed
learners in a variety of datasets chosen from University of California
at Irvine repository [25]. The selected problems are well known in
the literature so we will simply describe the main characteristics
of each. Monk-1, Monk-2 and Monk-3 are articial classication
problems whose aim is to test specic abilities of learning systems.
Tic-tac-toe consists in classifying the states of the homonymous
strategy game as winning or losing. Credit, Breast and Vote
are prediction problems related to the reliability of applicants for
credit cards, the prognosis of breast cancer, and the prediction of the
vote given by congressmen on the basis of their political previous
choices. All dataset testing was performed with 10-fold cross.
Table 1 reports results on this rst group of problems. The systems used for the comparison are C4.5 and REGAL. C4.5 [19] is a
classical propositional learner, whose results are used as a baseline.
Performance of C4.5 is reported in [2]. REGAL was executed in the

738

M. Rodrguez et al. / Applied Soft Computing 11 (2011) 733743


Table 3
Results of Mushroom REGAL.

Table 2
Execution parameters.

Stopping criteria
Mutation percentage
Crossover percentage
Selection percentage g
Communication ratio
Training dataset reduction
Training dataset communication ratio

REGAL

EDGAR

Nodes

500 gen.
0.01%
60%
10%
10%

CSP = 5, LNG = 20
1%
90%
10%
10% max
LSP = 5
1% max

4
8
16
32
64

same hardware as EDGAR with the parameter shown in Table 2:


16 nodes and a global population of 400 individuals. For a more
detailed execution over a different number of node congurations
on different datasets, see Section 4.3.
Monk-1 and Monk-3 are easily handled by most of the algorithms with accuracies of nearly 100%. Monk-2 and Tic-tac-toe are
handled with better accuracy in the tree distributed algorithms
than in C4.5. Credit is a little bit better in EDGAR than in REGAL, but
in this case does not behave better than C4.5. Finally, EDGAR gets
better results than the other learners in Vote. Note that algorithm
parameters in all runs were not modied and the results show average on 100 runs with a regular behaviour, being able to nd good
solutions if the population in the nodes is large enough to cover the
assigned data.

Comm.
318,092
558,501
681,782
1,085,929
1,267,774

Time

Rules

%Test.

%Tra.

0.79
0.68
0.57
0.54
0.43

16
14
15
15
16

99.93
99.95
99.95
99.94
99.95

99.98
99.99
99.99
99.99
99.99

Table 4
Results of Mushroom EDGAR.
Nodes
4
8
16
32
64

Comm.

Time

Rules

%Test

%Tra.

2129
3907
8232
17,60
20,11

1.20
0.55
0.35
0.23
0.18

14
13
15
16
16

99.92
99.94
99.96
99.95
99.93

99.94
99.96
99.99
99.90
99.95

Table 5
Results of Nursery REGAL.
Nodes
4
8
16
32
64

Comm.
784,870
2,319,559
6,304,130
18,595,490
50,664,658

Time

Rules

%Test

%Tra.

1.78
1.97
2.32
2.81
3.08

290
251
250
268
316

98.6
99.0
98.9
98.5
97.9

98.9
99.3
99.2
98.7
98.0

4.2. Data distribution, speedup and accuracy


This section analyses the effect of data distribution on the accuracy and speedup. For this study, we run REGAL [10] and EDGAR
using parameters shown in Table 2. As mentioned previously,
REGAL is a distributed genetic algorithm that also uses US.
For these experimental studies, two well known problems were
chosen from UCI [25]: Nursery and Mushroom. Mushroom is simple
(just two classes) and large enough for testing accuracy. Nursery is
also a medium size dataset with six characteristics and ve nonbalanced classes, three of them representing more than 97% of the
dataset.
We evaluated both algorithms using Dietterich [4] 5x2 crossvalidation with 50% training and testing and 5 different seeds.
The comparison will measure the accuracy of the classier and
speedup achieved relative to the number of processors. The former is given by the number of rules and classication ratio. The
speedup is measured through the total execution time having the
same parameters in each execution.
The comparison was carried out with 1600 individuals as the
sum of the whole local node population. This population proved
to be large enough for both algorithms to obtain the maximum
accuracy and lowest number of rules for the datasets.
The experiments were executed with congurations from 4
to 64 nodes to study the impact of the distribution on the referenced variables. All parameters in REGAL execution are from
their original paper [10] except for the stopping criteria, calculated
experimentally using the same conditions as with the population.
The differences in parameter conguration in both algorithms are
the reason for the different learning approach. EDGAR needs a
higher mutation ratio, as most of the offspring mutated are not
used because they are worse than their parents.
Tables 36 show averages on 150 executions (5 for each partition, 6 different seeds and 5 node congurations). First column
is the number of nodes. Second is the number of communications
from a node (either one rule or one learning example). Third column shows execution time in minutes. Finally, the fourth, fth and
sixth columns show the classication results for test and training
datasets.

Fig. 7. Compared time Nursery execution.

Analysing Tables 24, we can point out some conclusions in each


measured property. Regarding the execution time (see Fig. 7), we
observe a considerable speedup and a better behaviour than the
compared algorithm when the number of processors increases.
Nevertheless, the conguration with four nodes gives better
time results in REGAL than in EDGAR. This effect is a consequence
of the recalculation in central pool concept every few iterations. On
the other hand, the experimentation shows that there is an equilibrium point that depends on each dataset regarding the number of
possible partition. Having a large number of partitions may increase
the processing time. The local learning examples are not properly
covered, because of small disjuncts split into more than one partition or due a local dataset that does not have enough learning
examples. DLF will reorganise the learning data in order to have a
local dataset large enough to generalise better rules.
Classication accuracy is similar in both algorithms and does not
follow any tendency in terms of the number of processors. EDGAR
Table 6
Results of Nursery EDGAR.
Nodes
4
8
16
32
64

Comm.

Time

Rules

%Test

%Tra.

6818
3356
4836
8309
77,859

2.89
1.59
1.20
1.11
1.21

173
209
206
231
199

99.4
98.5
98.9
98.3
98.5

99.7
98.8
99.2
98.8
98.6

M. Rodrguez et al. / Applied Soft Computing 11 (2011) 733743


Table 7
EDGAR, covered positives and negative cases for Mushroom execution.

739

Table 8
Dataset characteristics.

Rule order

Positive cases

Negative cases

Dataset

Instances

Features

Classes

1
2
3
4
5
6
7
8
9
10
11
12
13

1985
1943
1786
1532
1098
11
9
5
5
2
25
1855
1909

0
0
0
0
0
0
0
0
0
0
22
1833
1513

Car
Cleveland
Credit
Ecoli
Glass
Haberman
House Votes
H Hypothyroid
Iris
Krvskp
Monks
Mushrooms
New-Thyroid
Nursery
Pima
Segment
Soybean
Splice
Tic-tac-toe
Vehicle
Waveform
Wine
Wisconsin
Vote
Thyroid
Zoo

1727
297
1000
336
214
305
432
1920
150
3198
432
8124
215
12,960
768
2308
307
3190
958
846
5000
178
683
435
7200
100

6
13
20
7
9
3
16
29
4
37
6
22
5
6
8
19
35
60
9
18
41
13
9
16
21
16

4
5
2
2
7
2
2
4
3
2
2
2
3
2
2
7
19
3
2
4
3
3
2
2
3
7

is able to nd a good solution if the ratio between population and


data is enough to generalise a rule. As the ratio of individual and
data remains the same in all experiments (1600 individuals as a
sum of all local populations) in the different node conguration,
the accuracy does not decrease with the number of nodes.
Regarding complexity, the greedy algorithm is able to generate
between 60% and 80% less rules in Nursery than REGAL. Mushroom
does not show this behaviour, having a similar number of rules.
Table 7 shows the covered cases generated by EDGAR for Mushroom. The rules in the classier are ordered by number of positive
cases*tness. This order guarantees that the rules with less false
positives will be applicable rst. This classier is able to have an
accuracy of 99% with only the rst ve rules. This represents the
average behaviour on the classiers generated by EDGAR. We stress
the meaning of the rules with less than 11 examples. All of them
represent small disjuncts in the problem. For instance, in a dataset
of 4062 instances, these rules represent less than 0.2% of the learning examples. The probability of having these 11 examples in one
partition on a 64 node distribution is lower than 0.01%. Thanks to
DLF, the algorithm brings together the examples belonging to small
disjuncts in some of the nodes, otherwise it will generate one rule
per example, increasing the number of rules to two or three times
the size of the current classier.
4.3. Statistical analysis
The aim of this section is to test the behaviour of the proposed
algorithm statistically in a representative number of datasets (see
Table 8) chosen from the UCI database [25]. The splice dataset
provided by Towell et al. [24] was chosen since it was originally
used by REGAL [10] as an example case. The number of datasets is
determined by the chosen non-parametric test, Wilcoxon signedranks test [5]. This test has maximum condence when the number
of paired data (results on dataset) is at least 25. The continuous
attributes in the datasets were discretised using 10 equal frequency
values because both algorithms work only with nominal attributes.
The accuracy of the classier produced using a discretised dataset
may be lower than in an originally discrete dataset, but we think the
comparison between two nominal algorithms is still valid because
both of them will be affected in a similar way, and allows having a
number of standard datasets instead of synthetic datasets or not so
common nominal ones. The datasets were chosen taking size criteria into account, with datasets ranging from hundreds to some
thousand instances and from two to some ten features. The comparison was carried out with 1000 individuals as the sum of the
whole local node population, stopping when no improvement was
achieved, and the remaining parameters were the same as in the
previous subsection. This conguration will not achieve the best
results for a specic dataset, but allows evaluation of the algorithm
exibility in a variety of input conditions.

To compare the obtained results non-parametric tests were


used, following the recommendations made in [7,21,22]. They are
safer than parametric tests since they do not assume normal distribution or homogeneity of variance. As such, these non-parametric
tests can be applied to classication accuracies, error ratios or any
other measure for evaluation of classiers, even including model
sizes and computation times. Empirical results suggest that they are
also stronger than the parametric test. Demsar recommends a set
of simple, safe and robust non-parametric tests for statistical comparisons of classiers. Given that the evaluation of only the mean
classication accuracy over all the datasets would hide important
information and that each dataset represents a different classication problem with different degrees of difculty, we have included
a second table that shows the average and standard deviation. As
mentioned, [7,21,22] recommend a set of simple, safe and robust
non-parametric tests for statistical comparisons of classiers, one
of which is the Wilcoxon signed-ranks test.
This is analogous to the paired t-test in non-parametrical statistical procedures; therefore, it is a pairwise test that aims to detect
signicant differences in the behaviour of two algorithms. In our
study, we show the level of signicance (p) at which the rejection
of null Hypothesis (H0 ) can be rejected. A p-value of 0.05 means
that H0 can be rejected with a condence level of 95%.
The Wilcoxon signed-ranks test works as follows. Let di be the
difference between the performance scores of the two classiers in
ith out of Nds datasets. The differences are ranked according to their
absolute values; average ranks are assigned in case of ties. Let R+
be the sum of ranks for the datasets in which the second algorithm
outperformed the rst, and R the sum of ranks for the opposite.
Ranks of di = 10 are split evenly among the sums; if there is an odd
number of them, one is ignored:
R+ =

rank(di ) +

di =0

di >0

R =


di <0

1
rank(di )
2

rank(di ) +

1
rank(di )
2
di =0

740

M. Rodrguez et al. / Applied Soft Computing 11 (2011) 733743

Table 9
Average results.

Table 9 (Continued )
EDGAR

EDGAR
Rules

REGAL
Test

Time

Rules

Rules
Test

Time

Car
Mean
sd

61
6.06

97%
0.006

0.10
0.07

66
8.13

98%
0.006

63.4
71.6

Cleveland
Mean
sd

73
14.30

91%
0.070

0.07
0.07

62
8.33

96%
0.016

100.5
91.6

Credit
Mean
sd

69
10.62

90%
0.015

0.05
0.03

75
6.31

94%
0.028

Ecoli
Mean
sd

61
9.89

90%
0.036

0.04
0.03

47
7.84

Glass
Mean
sd

61
14.24

92%
0.105

0.07
0.03

Haberman
Mean
29
sd
4.84

82%
0.038

House Votes
Mean
26
sd
5.79

REGAL
Test

Time

Rules

Test

Time

11.7
Waveform
Mean
1026
sd
77.79

95%
0.044

15.15
8.81

1070
77.79

93%
0.044

15.1
8.8

Wine
Mean
sd

57
13.51

97%
0.031

0.03
0.01

60
14.52

97%
0.031

1.2
0.3

69.4
63.3

Wisconsin
Mean
sd

25
4.10

98%
0.056

0.02
0.01

25
3.63

100%
0.026

14.7
5.4

96%
0.022

95.5
86.7

Zoo
Mean
sd

69%
0.314

0.02
0.01

65%
0.311

9.1
14.5

45
9.57

96%
0.025

106.2
98.1

0.04
0.03

52
4.83

92%
0.018

125.4
115.5

98%
0.026

0.02
0.01

26
5.79

97%
0.025

0.7
9.5

Hypothyroid
Mean
200
sd
28.08

95%
0.038

1.00
0.94

178
20.42

95%
0.029

35.6
27.1

Iris
Mean
sd

14
5.55

96%
0.038

0.01
0.01

11
2.55

99%
0.017

36.2
9.6

Krvskp
Mean
sd

62
8.78

98%
0.014

0.10
0.06

54
10.40

86%
0.092

22.5
11.1

Monk
Mean
sd

46
7.25

99%
0.043

0.03
0.03

58
6.51

77%
0.055

61.7
41.3

Mushroom
Mean
13
sd
3.23

100%
0.012

0.14
0.08

16
4.19

100%
0.013

4.9
5.0

New-Thyroid
Mean
14
sd
2.97

99%
0.027

0.81
0.42

14
2.97

99%
0.027

51.9
49.7

Nursery
Mean
sd

270
28.41

99%
0.027

1.63
1.50

309
22.41

98%
0.013

91.0
77.2

Pima
Mean
sd

128
9.34

94%
0.017

0.19
0.22

127
9.49

94%
0.017

65.5
68.5

Segment
Mean
sd

132
13.51

99%
0.012

0.47
0.35

137
22.00

99%
0.013

30.5
29.5

9
5.035

6
4.097

Let T be the smaller of the sums, T = min(R+ ;R ). If T is less than or


equal to the value of the distribution of Wilcoxon for Nds degrees of
freedom [28], the null hypothesis of equality of means is rejected.
The Wilcoxon signed-ranks test is more sensitive than the t-test.
It assumes commensurability of differences, but only qualitatively:
greater differences count still more, which is probably to be desired,
but absolute magnitudes are ignored.
From the statistical point of view, the test is safer since
it does not assume normal distribution. Moreover, outliers
(exceptionally good/bad performances on a few datasets) have
less effect on Wilcoxon than on the t-test. Wilcoxons test
assumes continuous differences di , which therefore should not
be rounded to, say, one or two decimals, since this would
decrease the power of the test due to a high number of
ties.
Table 9 shows average results and standard deviation on the
compared execution over the selected datasets. The rst column has the dataset name. The second, the number of rules. The
third, four and fth columns are training, test and processing
time respectively for EDGAR. Columns 610 are rules, training,
test and processing time respectively for REGAL. Table 10 summarises Wilcoxon test results as follows: rst column has the
number of nodes. Second column express the condition measured in the test. Third, four and fth columns are the number
of wins, loss and ties for the condition. The last column is the pvalue.

Table 10
Wilcoxon signed-ranks test.
Nodes

REGAL/EDGAR

R+

Ties

p-Value

16
17
0

6
8
24

3
0
1

0.13
0.11
0.00

Soybean
Mean
sd

72
6.97

98%
0.026

0.41
0.36

81
11.94

97%
0.031

21.8
17.6

Rules < Rules


Acc. < Acc.
Time < Time

Splice
Mean
sd

77
10.91

100%
0.027

9.71
4.53

72
12.41

95%
0.026

29.1
4.5

Rules < Rules


Acc. < Acc.
Time < Time

12
17
1

9
8
24

4
0
0

0.49
0.31
0.00

Tic-tac-toe
Mean
87
sd
14.45

99%
0.028

0.06
0.04

50
13.12

91%
0.055

61.4
51.8

16

Rules < Rules


Acc. < Acc.
Time < Time

12
18
0

10
7
25

3
0
0

0.40
0.09
0.00

Vehicle
Mean
sd

181
14.11

95%
0.019

0.64
1.42

161
18.32

95%
0.028

39.5
46.2

32

Rules < Rules


Acc. < Acc.
Time < Time

10
17
0

12
8
25

3
0
0

0.21
0.32
0.00

24
7.13

97%
0.021

0.02
0.01

8
9.36

96%
0.442

11.8

All

Rules < Rules


Acc. < Acc.
Time < Time

12
20
0

10
4
24

3
1
1

0.73
0.01
0.00

Vote
Mean
sd

M. Rodrguez et al. / Applied Soft Computing 11 (2011) 733743

Observing Tables 911 (see Appendix A), we can make the following analysis:

741

Table 11
Detailed results.
Edgar

The number of nodes in EDGAR does not follow any trend regarding the accuracy. Even in some of the datasets the results are
better with 32 nodes than with 16 or less.
Processing time decreases in EDGAR with the number of nodes,
but it does not achieve a linear speedup.
Table 10, shows that in average (nodes = All), EDGAR wins in 20
out of 25 of the datasets in accuracy with a condence of 99%.
The other conguration shows a variety of results that does not
allow rejection of the null hypothesis. Processing time is better
in EDGAR with any conguration in 100% of the cases.
The number of rules cannot reject the null hypothesis because
the condence is only 27% for the average case and less than 90%
in the rest of the node congurations.

5. Conclusions
This work presents a distributed genetic algorithm for classication rules extraction based on the island model and enhanced
for scalability with data training partitioning. To be able to generate an accurate classier with data partition, two techniques were
proposed: an elitist pool for rule selection and a novel technique of
data distribution (DLF) that uses heuristics based on the local data
to dynamically redistribute the training data in the node neighbourhood.
In this study, EDGAR shows a considerable speedup and, moreover, this improvement does not compromise the accuracy and
complexity of the classier.
The complementarities of the proposed techniques allow having low dependency on parameter setting. Proportion of individuals
per learning example is compensated by the data training set
reduction that will handle the removal of the already learned
rules, redirecting computations efforts to the more difcult cases.
Seeding operator reintroduces rules preventing loss of diversity.
The elite pool also ensures that already discovered rules will be
kept in the nal classier even if they are removed from the
nodes.
Finally, we would like to point out the absence of a master
process to guide the search. This architecture suggests a better
scalability by avoiding idle time due to synchronisation issues or
network bottlenecks typically associated with masterslave synchronous relation.

Acknowledgements
This work was supported by the Spanish Ministry of Education and Science under Grant No. TIN2008-06681-C06-06, and the
Andalusian Government under Grant Nos. P05-TIC-00531 and P07TIC-03179.

Appendix A. Detailed results


This section shows the obtained results for the different node
conguration as extension of Section 4.3. First column has the
dataset name. Second the number of rules. Third and fourth and
fth are training, test and processing time respectively for EDGAR.
Columns 56 have number of rules, test accuracy and processing
time respectively for REGAL. Each dataset was executed following
four node conguration: 4, 8, 16 and 32 nodes. For each conguration, the compared algorithms were executed using Dietterich 5x2
cv [4] and 5 different seeds (Table 11).

Rules
Car
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd

Test

Time

Rules

Test

Time

57
4.34
61
6.85
61
6.57
56
4.72
61
6.06

96%
0.005
95%
0.005
94%
0.003
94%
0.004
97%
0.006

0.09
0.01
0.05
0.00
0.08
0.01
0.22
0.08
0.10
0.07

58
3.52
61
5.99
67
4.17
73
7.03
66
8.13

96%
0.007
95%
0.006
94%
0.005
94%
0.006
98%
0.006

15.8
7.4
28.6
20.4
75.8
45.5
148.6
87.6
63.4
71.6

Cleveland
Mean
4
sd
Mean
8
sd
Mean
16
sd
Mean
32
sd
Mean
All
sd

69
7.90
65
22.06
70
7.24
78
7.48
73
14.30

93%
0.054
84%
0.116
87%
0.029
89%
0.026
91%
0.070

0.09
0.10
0.06
0.09
0.05
0.02
0.05
0.01
0.07
0.07

54
4.04
55
4.66
62
4.24
68
6.41
62
8.33

96%
0.053
92%
0.007
91%
0.010
90%
0.013
96%
0.016

16.6
2.6
42.6
7.5
89.6
17.6
234.9
35.0
100.5
91.6

Credit
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd

62
6.39
63
7.36
68
7.51
79
9.62
69
10.62

91%
0.027
88%
0.010
87%
0.014
88%
0.013
90%
0.015

0.08
0.03
0.04
0.01
0.03
0.01
0.04
0.03
0.05
0.03

73
5.56
71
5.88
73
7.53
73
6.75
75
6.31

94%
0.059
89%
0.022
90%
0.025
89%
0.028
94%
0.028

15.2
2.7
32.5
11.5
62.3
27.8
147.4
56.3
69.4
63.3

52
4.88
57
6.19
55
7.15
70
6.87
61
9.89

92%
0.053
84%
0.035
89%
0.016
84%
0.029
90%
0.036

0.04
0.03
0.04
0.04
0.03
0.01
0.06
0.03
0.04
0.03

41
5.23
41
4.01
43
4.94
53
6.52
47
7.84

95%
0.056
91%
0.016
91%
0.019
94%
0.019
96%
0.022

21.2
5.4
48.2
16.2
89.3
53.9
158.0
97.8
95.5
86.7

55
10.22
53
6.21
51
23.40
65
9.63
61
14.24

96%
0.115
85%
0.019
77%
0.216
85%
0.013
92%
0.105

0.11
0.09
0.04
0.02
0.04
0.02
0.05
0.03
0.07
0.03

35
3.68
40
3.64
45
4.14
56
7.96
45
9.57

97%
0.042
93%
0.024
93%
0.020
93%
0.024
96%
0.025

18.1
3.4
42.0
5.9
100.8
16.2
250.9
53.0
106.2
98.1

Haberman
Mean
4
sd
Mean
8
sd
Mean
16
sd
Mean
32
sd
Mean
All
sd

31
5.73
25
5.33
24
5.14
21
3.61
29
4.84

83%
0.110
74%
0.030
73%
0.023
54%
0.009
82%
0.038

0.04
0.01
0.03
0.01
0.03
0.04
0.05
0.09
0.04
0.03

53
4.50
50
4.91
51
4.22
49
5.14
52
4.83

91%
0.036
88%
0.014
89%
0.016
91%
0.015
92%
0.018

15.6
8.7
37.3
21.6
122.9
52.4
228.6
106.9
125.4
115.5

House Votes
Mean
4
sd
Mean
8
sd

24
4.21
21
3.11

98%
0.115
89%
0.006

0.04
0.01
0.02
0.00

24
4.21
21
3.11

96%
0.112
86%
0.006

12.5
0.6
8.0
0.3

Ecoli
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Glass
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd

Regal

8
16
32
All

4
8
16
32
All

4
8
16
32
All

4
8
16
32
All

742

M. Rodrguez et al. / Applied Soft Computing 11 (2011) 733743

Table 11 (Continued )

Table 11 (Continued )

Edgar
Rules
Mean
sd
Mean
sd
Mean
sd

16

Regal
Test

Time

Rules

Edgar
Test

Time

23
3.92
30
4.83
26
5.79

89%
0.006
89%
0.005
98%
0.026

0.01
0.00
0.01
0.00
0.02
0.01

23
3.92
30
4.83
26
5.79

87%
0.006
87%
0.005
97%
0.025

9.6
0.4
9.3
0.7
0.7
9.5

170
28.72
173
11.98
197
13.67
180
6.40
200
28.08

95%
0.146
85%
0.002
85%
0.003
92%
0.001
95%
0.038

0.52
0.08
0.40
0.07
0.99
0.06
2.57
0.36
1.00
0.94

158
26.63
147
17.50
150
14.11
180
10.71
178
20.42

95%
0.147
85%
0.005
71%
0.001
90%
0.004
95%
0.029

9.6
4.0
13.6
2.8
21.9
6.4
56.5
20.5
35.6
27.1

13
4.06
14
2.77
12
4.87
15
3.47
14
5.55

97%
0.117
88%
0.024
90%
0.030
99%
0.030
96%
0.038

0.02
0.00
0.01
0.00
0.01
0.01
0.01
0.01
0.01
0.01

10
2.32
10
2.34
12
2.74
14
3.23
11
2.55

98%
0.035
95%
0.013
96%
0.008
97%
0.064
99%
0.017

11.8
5.2
30.2
11.4
66.1
29.0
28.8
12.5
36.2
9.6

56
8.14
60
8.42
63
6.59
61
7.99
62
8.78

98%
0.035
93%
0.005
93%
0.006
88%
0.005
98%
0.014

0.15
0.07
0.07
0.05
0.08
0.03
0.07
0.03
0.10
0.06

56
8.99
58
6.68
47
1.41
59
1.41
54
10.40

90%
0.004
90%
0.003
66%
0.004
65%
0.004
86%
0.092

13.3
4.8
33.9
11.3
23.0
1.6
29.0
2.7
22.5
11.1

52
6.40
43
3.90
32
6.42
36
3.12
46
7.25

99%
0.000
88%
0.007
88%
0.005
88%
0.009
99%
0.043

0.05
0.02
0.02
0.01
0.03
0.05
0.02
0.01
0.03
0.03

51
0.00
48
2.73
51
6.59
56
5.93
58
6.51

82%
0.068
70%
0.024
61%
0.018
64%
0.021
77%
0.055

5.6
0.0
39.6
27.1
43.1
19.4
88.2
48.7
61.7
41.3

Mushroom
Mean
4
sd
Mean
8
sd
Mean
16
sd
Mean
32
sd
Mean
All
sd

14
3.39
14
3.36
11
1.78
11
3.05
13
3.23

100%
0.035
98%
0.000
95%
0.000
95%
0.000
100%
0.012

0.24
0.07
0.07
0.01
0.07
0.02
0.08
0.03
0.14
0.08

13
2.80
15
4.54
16
2.97
16
5.25
16
4.19

100%
0.054
95%
0.000
95%
0.000
95%
0.000
100%
0.013

3.2
2.4
4.1
2.2
4.1
1.0
7.4
8.8
4.9
5.0

New-Thyroid
Mean
4
sd
Mean
8
sd
Mean
16
sd
Mean
32
sd

12
2.50
12
2.13
14
2.56
15
2.52

98%
0.128
90%
0.008
90%
0.009
90%
0.007

0.72
0.12
0.66
0.56
0.80
0.62
0.58
0.73

12
2.50
12
2.13
14
2.56
15
2.52

99%
0.117
90%
0.008
90%
0.008
90%
0.007

12.3
6.7
25.0
13.8
49.5
23.4
103.3
55.9

32
All

Hypothyroid
Mean
4
sd
Mean
8
sd
Mean
16
sd
Mean
32
sd
Mean
All
sd
Iris
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Krvskp
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Monk
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd

4
8
16
32
All

4
8
16
32
All

4
8
16
32
All

Rules
Mean
sd
Nursery
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Pima
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Segment
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Soybean
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Splice
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd

All

4
8
16
32
All

4
8
16
32
All

4
8
16
32
All

4
8
16
32
All

4
8
16
32
All

Tic-tac-toe
Mean
4
sd
Mean
8
sd
Mean
16
sd
Mean
32
sd
Mean
All
sd

Regal
Test

Time

Rules

Test

Time

14
2.97

99%
0.027

0.81
0.42

14
2.97

99%
0.027

51.9
49.7

259
26.59
261
14.27
262
24.03
224
34.43
270
28.41

97%
0.116
88%
0.006
92%
0.002
91%
0.001
99%
0.027

0.80
0.54
0.75
1.23
0.98
0.09
3.56
0.34
1.63
1.50

290
25.11
292
23.64
297
11.86
311
15.18
309
22.41

98%
0.053
94%
0.003
94%
0.003
94%
0.004
98%
0.013

13.7
8.0
40.8
20.9
105.3
30.9
187.5
49.3
91.0
77.2

125
11.27
126
9.27
132
8.06
132
11.53
128
9.34

94%
0.051
90%
0.011
90%
0.013
90%
0.013
94%
0.017

0.23
0.31
0.15
0.27
0.14
0.21
0.18
0.26
0.19
0.22

120
10.49
125
8.74
122
8.43
124
8.22
127
9.49

90%
0.052
83%
0.022
82%
0.014
89%
0.028
94%
0.017

12.5
6.2
27.1
12.6
51.7
23.0
159.0
57.8
65.5
68.5

132
13.86
130
13.55
131
11.89
119
11.54
132
13.51

98%
0.034
93%
0.006
95%
0.003
96%
0.003
99%
0.012

0.26
0.20
0.17
0.02
0.37
0.02
0.95
0.16
0.47
0.35

124
12.46
117
11.91
128
13.99
160
12.84
137
22.00

99%
0.054
94%
0.004
94%
0.005
94%
0.005
99%
0.013

8.0
4.2
15.6
6.9
27.9
12.0
65.2
33.6
30.5
29.5

66
7.32
65
7.62
67
4.08
71
6.93
72
6.97

98%
0.116
89%
0.003
89%
0.003
90%
0.003
98%
0.026

0.17
0.02
0.12
0.01
0.33
0.04
0.90
0.14
0.41
0.36

72
8.73
66
3.89
78
7.10
87
9.03
81
11.94

98%
0.118
89%
0.015
89%
0.017
88%
0.022
97%
0.031

5.2
1.3
9.6
1.0
19.4
2.7
45.3
3.8
21.8
17.6

84
10.26
73
7.97
66
6.91
53
8.59
77
10.91

100%
0.095
91%
0.001
91%
0.001
86%
0.000
100%
0.027

11.90
5.81
6.52
2.74
10.22
4.28
5.99
1.19
9.71
4.53

83
9.37
86
8.24
68
6.42
65
8.77
72
12.41

100%
0.088
83%
0.001
86%
0.001
84%
0.000
95%
0.026

24.4
5.5
36.5
2.5
39.7
4.0
25.6
1.2
29.1
4.5

70
10.38
74
6.83
83
4.68
95
7.97
87
14.45

99%
0.119
90%
0.006
90%
0.007
91%
0.002
99%
0.028

0.12
0.03
0.03
0.00
0.04
0.01
0.03
0.01
0.06
0.04

63
11.28
39
13.25
44
11.31
40
7.09
50
13.12

97%
0.113
85%
0.014
82%
0.011
77%
0.021
91%
0.055

5.2
2.8
42.2
17.8
69.4
56.9
106.8
24.1
61.4
51.8

M. Rodrguez et al. / Applied Soft Computing 11 (2011) 733743


Table 11 (Continued )

References

Edgar
Rules
Vehicle
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd

Regal
Test

Time

Rules

Test

Time

190
13.06
171
10.66
162
7.32
147
5.00
181
14.11

92%
0.049
89%
0.010
90%
0.012
63%
0.018
95%
0.019

1.23
2.40
0.46
0.92
0.23
0.19
0.13
0.02
0.64
1.42

143
17.44
139
8.89
150
10.96
168
10.98
161
18.32

95%
0.115
86%
0.013
86%
0.014
86%
0.011
95%
0.028

6.9
1.5
13.3
3.2
23.1
5.2
101.0
36.1
39.5
46.2

18
3.54
17
4.53
19
3.48
29
5.56
24
7.13

97%
0.115
88%
0.006
89%
0.006
93%
0.008
97%
0.021

0.03
0.00
0.01
0.00
0.01
0.00
0.02
0.01
0.02
0.01

13
1.00
13
2.83
13
1.00
14
1.41
8
9.36

66%
0.000
66%
0.007
66%
0.003
67%
0.004
53%
0.442

12.8
14.9
16.6
0.4
12.6
13.9
17.1
0.1
11.8
11.7

1064
96.97
530
54.27
880
45.21
894
35.87
1026
77.79

93%
0.075
47%
0.075
88%
0.005
85%
0.004
95%
0.044

9.74
4.58
19.93
4.77
18.41
6.03
17.52
5.98
15.15
8.81

1032
96.97
1040
97.84
843
121.09
1069
45.21
1070
77.79

92%
0.074
93%
0.073
46%
0.059
86%
0.005
93%
0.044

9.7
4.6
9.8
4.6
19.9
4.7
18.4
6.0
15.1
8.8

51
9.97
52
5.99
65
15.12
68
14.55
57
13.51

97%
0.055
88%
0.013
89%
0.016
89%
0.017
97%
0.031

0.04
0.02
0.02
0.00
0.02
0.01
0.03
0.01
0.03
0.01

60
11.30
53
5.99
63
14.14
64
17.20
60
14.52

94%
0.054
86%
0.013
88%
0.016
88%
0.019
97%
0.031

1.1
0.5
0.7
0.1
1.1
0.3
1.2
0.3
1.2
0.3

Wisconsin
Mean
4
sd
Mean
8
sd
Mean
16
sd
Mean
32
sd
Mean
All
sd

24
4.74
22
3.93
15
3.93
18
4.29
25
4.10

98%
0.115
87%
0.007
48%
0.027
87%
0.047
98%
0.056

0.03
0.00
0.01
0.00
0.05
0.03
0.01
0.00
0.02
0.01

23
4.40
22
2.56
24
4.37
25
2.65
25
3.63

99%
0.117
90%
0.004
91%
0.004
91%
0.004
100%
0.026

4.9
2.8
7.0
4.4
16.0
10.1
25.8
8.2
14.7
5.4

Zoo
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd

10
2.21
9
2.52
8
4.80
12
3.63
9
5.04

99%
0.125
90%
0.017
90%
0.016
66%
0.314
69%
0.314

0.04
0.01
0.02
0.01
0.01
0.00
0.01
0.00
0.02
0.01

9
1.50
8
1.49
9
1.70
13
4.59
6
4.10

98%
0.120
89%
0.015
82%
0.034
90%
0.024
65%
0.311

10.2
3.3
15.6
3.8
28.8
3.7
32.9
4.6
9.1
14.5

Vote
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd

4
8
16
32
All

4
8
16
32
All

Waveform
Mean
4
sd
Mean
8
sd
Mean
16
sd
Mean
32
sd
Mean
All
sd
Wine
Mean
sd
Mean
sd
Mean
sd
Mean
sd
Mean
sd

4
8
16
32
All

4
8
16
32
All

743

[1] C. Anglano, M. Botta, NOW G-Net: learning classication programs on networks


of workstations, IEEE Transactions on Evolutionary Computation (October)
(2002) 463480.
[2] C. Schaffer, When does over tting decrease prediction accuracy in induced
decision trees and rule sets? in: Proceedings of the European Working Session
on Learning (EWSL-91), 1991, pp. 192205.
[3] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, New York, 1989.
[4] T.G. Dietterich, Approximate statistical tests for comparing supervised classication, learning algorithms Neural Computation 10 (7) (1999) 18951924.
[5] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures,
CRC Press, Boca Raton, FL, 2003.
[6] K.A. De Jong, W.M. Spears, D.F. Gordon, Using genetic algorithms for concept
learning, Machine Learning (1993) 161188.
[7] J. Demsar, Statistical comparisons of classiers over multiple data sets, Journal
of Machine Learning Research 7 (2006) 130.
[8] E. Cantu-Paz, Efcient and Accurate Parallel Genetic Algorithms, Kluwer Academic Publishers, 2000.
[9] I.W. Flockhart, N.J. Radcliffe, GA-MINER: parallel data mining with hierarchical
genetic algorithmsnal report, EPCC-AIKMS-GA-MINER Report1.0, University of Edinburgh, UK, 1995.
[10] A. Giordana, F. Neri, Search-intensive concept induction, Evolutionary Computation (1995) 375416.
[11] A.A. Freitas, S.H. Lavington, Mining Very Large Databases with Parallel Processing, Kluwer Academic Publishers, 1998.
[12] G.M. Weiss, Learning with rare cases and small disjuncts, in: Proceedings of
the 12th International Conference on Machine Learning (ML-95), 1995, pp.
558565.
[13] D.P. Greene, S.F. Smith, Competition-based induction of decision models from
examples, Machine Learning (1993) 229257.
[14] J.H. Holland, J.S. Reitman, in: D.A. Waterman, F. Hayes-Roth (Eds.), Cognition
Systems Based on Adaptive Algorithms, in Pattern-Directed Inference Systems,
Academic Press, New York, 1978.
[15] J.H. Holland, Adaptation in Natural and Articial Systems, University of Michigan Press, Ann Arbor, MI, 1975.
[16] S.T. Leutenegger, X. Sun, Limitations of cycle stealing for parallel processing on
a network of homogeneous workstations, Journal of Parallel and Distributed
Computing 43 (1997) 169178.
[17] M.J. Quinn, Parallel Computing: Theory and Practice, McGraw-Hill, 1994.
[18] Y. Nojima, I. Kuwajima, H. Ishibuchi, Data set subdivision for parallel distribution implementation of genetic fuzzy rule selection, in: IEEE International
Conference on Fuzzy Systems (FUZZ-IEEE07), London, 2007, pp. 20062011.
[19] J.R. Quinlan, C4.5, in: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993.
[20] F. Provost, D. Hennessy, Distributed machine learning: scaling up with coarsegrained parallelism, in: Proceedings of the Second International Conference on
Intelligent Systems for Molecular Biology (ISMB-94), Stanford, 1994.
[21] S. Garca, F. Herrera, An extension on Statistical comparisons of classiers over
multiple data sets for all pairwise comparisons, Journal of Machine Learning
Research 9 (2008) 26772694.
[22] S. Garca, A. Fernndez, J. Luengo, F. Herrera, A study of statistical techniques
and performance measures for genetics-based machine learning: accuracy and
interpretability, Soft Computing- A Fusion of Foundations, Methodologies and
Applications 13 (10) (2009) 959977.
[23] S. Smith, A learning system based on genetic algorithms, PhD Thesis, University
of Pittsburgh, Pittsburgh, 1980.
[24] G.G. Towell, J.W. Shavlik, Knowledge-based articial neural networks, Articial
Intelligence 70 (1994) 119165, Kluwer Academic Publishers.
[25] UCI, C.J. Merz, P.M. Murphy, UCI Repository of Machine Learning Databases,
University of California Irvine, Department of Information and Computer Science, 1996, http://kdd.ics.uci.edu.
[26] G. Venturini, SIA: a supervised inductive algorithm with genetic search for
learning attribute based concepts, in: Proceedings of European Conference on
Machine Learning, Vienna, 1993, pp. 280296.
[27] G.M. Weiss, H. Hirsh, A quantitative study of small disjuncts, in: Proceedings
of the Seventeenth National Conference on Articial Intelligence, AAAI Press,
Menlo Park, CA, 2000, pp. 665670.
[28] J.H. Zar, Biostatistical Analysis, Prentice Hall, 1999.

You might also like