6.algorithm Quasi-Optimal (AQ) Learning
6.algorithm Quasi-Optimal (AQ) Learning
mode is particularly useful for mining very large and AQ4SD is optimized to run with a very large
noisy datasets. number of events with a small number of attributes.
The core of the AQ algorithm is the so-called star Experiments, for example, were performed with up
generation, the process of which can be done in two to 1,000,000 training events, each comprised of 20
different ways, depending on the mode of operation real-valued attributes.
(TF or PD). In TF mode, the star generation proceeds Although a formal analysis of the complexity of
by selecting a random positive example (called a seed) the AQ algorithm is beyond the scope of this article,
and then generalizing it in various ways to create experimental runs showed that AQ complexity is
a set of consistent generalizations (that cover the polynomial. In particular, it is a low-order polynomial
positive example and do not cover any of the negative in the number of events and a higher order polynomial
examples). In PD mode, rules are generated similarly, in the number of negative events. The lower complex-
but the program seeks strong patterns (that may ity increases associated with an increase in positive
be partially inconsistent) rather than fully consistent events are due to the fact that in optimization during
rules. This star generation process is repeated until all learning, only uncovered positive events are used to
the positive events are covered. Additionally, when evaluate rules (whereas all negatives, or a sample of
run in PD mode, the generated rules go through all the negatives, are used to evaluate the rules).
an optimization process which aims at generalizing The following sections describe the algorithms
and/or specializing the learned descriptions to simplify and data types used and implemented in AQ4SD.
the patterns. Because AQ4SD is an implementation of a general
methodology, when AQ4SD is specified in the text, it
refers to specific features or implementation details of
AQ4SD Features and Implementation AQ4SD itself, whereas when AQ is specified, it refers
AQ4SD is, as noted above, a total rewrite of the AQ to concepts and theories that apply to the general AQ
algorithm, specifically optimized to solve the problem methodology.
of source detection of atmospheric releases (see Section
Source Detection of Atmospheric Releases). It shares
many parts with the earlier version of AQ2011 but AQ Events
includes new features and optimization algorithms The AQ input data consists of a sequence of events.
for the source detection problem. The development of An event is a vector of values, where each value corre-
AQ20 was led by the first author in close collaboration sponds to a measurement associated with a particular
with many faculty and student members of the attribute. An event can be seen as a row in a database,
Computer Science Department and Machine Learning with each value an observation of a particular
Laboratory at George Mason University. attribute where columns are the different attributes.
AQ4SD is written in C++ making extensive AQ events are a form of labeled data, meaning
use of the standard templated library (STL)13 and that they are or can be classified into one of two or
generic design patterns.14 The entire code comprises more classes. Therefore, each event contains a special
about 250,000 lines. The goal of the AQ4SD attribute class, which identifies which class it belongs
algorithm was to be suitable as a main engine of to. A sequence of events belonging to the same class
evolution in a non-Darwinian evolutionary process is called an eventset.
(see Section Evolutionary Computation Guided by Additionally, two different types of events can
AQ) to find the sources of atmospheric releases, using be used by AQ: training and testing. Training events
sensor concentration measurements and forward are used by AQ to learn rules. Testing events are used
atmospheric transport and dispersion numerical to compute the statistical correctness of the learned
models. rules on events not used during learning.
AQ4SD was thus optimized to be used iteratively
because evolutionary computation is based on
iterative processes. It is tailored primarily toward
AQ Rules
real-valued (continuous) attributes, and it uses a novel
AQ uses a highly descriptive representation language
method that does not discretize real-valued attributes
to represent the learned knowledge. In particular,
into ordinal attributes during preprocessing. It is
it uses rules to describe patterns in the data. A
also optimized to work with noisy data, as sensor
prototypical AQ rule is defined in logical Eq. (1).
concentration measurements often contain errors and
missing values. Finally, as sensors are usually very
limited in number but record very long time series, Consequent ←− Premise Exception (1)
[Attribute. Relation. Value(s)] (2) Each rule has a different statistical value. Assuming
13 positive events associated with cluster 1, the first
rule in Eq. (4) covers not only most positive events in
Depending on the attribute type, different the cluster (11 of the 13 events) but also three negative
relations may exist. For example, for unordered events. This means that AQ was run in PD mode, to
categorical attributes, the relations < or > cannot allow inconsistencies to gain simpler rules. The second
be used as they are undefined. A complete set of rule covers less than 50% of the events and the third
the relations allowed with each attribute type is covers only 1, but both without covering any elements
given in Section Attribute Types). Typically, the in other clusters. Therefore, there is a tradeoff between
consequent consists of a single condition, whereas the completeness, namely the number of events covered
premise consists of a conjunction of several conditions. out of all the clouds in the cluster and consistency,
Equation (3) shows a sample rule relating a particular namely the coverage of events from other clusters.
cluster to a set of input parameters. The annotations
p and n indicate the number of positive and negative Attribute Types
events covered by this rule.
AQ4SD allows for four different types of attributes,
nominal, linear, integer, and continuous. Each
attribute type is associated with specific relations that
[Cluster = 1] ←− [WindDir = N . . . E] can be used in rule conditions.
[WindSpeed > 10 m/s] (3)
◦ Nominal: Unordered categorical attribute for which
[Temp > 22 C] : p = 11, n = 3 a distance metric cannot be defined. Nominal
attributes do not naturally or necessarily fall into
any particular order or rank, like the colors or
This type of rule is usually called attributional to blood types or city names. The domain of nominal
be distinguished from more traditional rules that attributes is thus that of unordered sets. The
use a simpler representation language. The main following relations are allowed in rule conditions:
difference from traditional rules is that referee equal (=) and not equal (=).
(attribute), relation, and reference may include
internal disjunctions of attribute values, ranges of Linear: Ordinal categorical attribute that is rankable,
values, internal conjunctions of attributes, and other but not capable of being arithmetically operated
constructs. Such a rich representation language means upon. Examples of linear attributes are small,
that very complex concepts can be represented using medium, large, or good, better, best. Such
a compact description. However, attributional rules attributes can be sorted and ranked but cannot
have the disadvantage of being more prone to over be multiplied or subtracted from one other. The
following relations are allowed in rule conditions:
fitting with noisy data.
equal (=), not equal (=), lesser (<), greater (>),
Multiple rules are learned for each cluster, and
lesser or equal (≤), and greater or equal (≥).
are called a ruleset. A ruleset for a specific consequent
is also called a cover. A ruleset is a disjunction of Integer: Ordinal integer-valued attribute without a
rules, meaning that even if only one rule is satisfied, prefixed discretization and without decimal
then the consequent is true. Multiple rules can be values. Integer attributes allow only whole
satisfied at one time because the learned rules could numbers, such as 20, or −77. The following
be intersecting each other. Equation (4) shows a relations are allowed in rule conditions: equal
(=), not equal (=), lesser (<), greater (>), lesser be unique and belong to a single class. AQ has four
or equal (≤), and greater or equal (≥). different strategies to resolve ambiguities:
Continuous: Ordinal real-valued attribute without Positives: The ambiguous event is kept in the positive
a prefixed discretization but which contains a class (the class rules are being learned from) and
decimal point and a fractional portion. The eliminated from all the other classes.
following relations are allowed in rule conditions:
Negatives: The ambiguous event is eliminated from
equal (=), not equal (=), lesser (<), greater (>),
the positive class.
lesser or equal (≤), and greater or equal (≥).
Previous versions of AQ dealt with continuous Eliminate: The ambiguous event is eliminated and
variables by discretizing them into a number not used for learning.
of discrete units and then treating them as
linear attributes. AQ4SD does not require such Majority: The ambiguous event is associated to the
discretization as it automatically determines class where it most appears.
ranges of continuous values for each variable
occurrence in a rule during the star generation Attribute Selection
process. In general, AQ learns rules to discriminate between
classes using only the smallest number of attributes.a
Therefore, AQ performs an automatic attribute
selection during the learning phase, selecting the most
AQ Algorithm
relevant attributes, and disregarding those apparently
The AQ learning process can be divided into
irrelevant. Unfortunately, especially for large noisy
four different parts: data preparation, rule learning,
problems, irrelevant attributes can lead to generation
postprocessing, and optional testing. The following
of incorrect rules. To avoid this problem, AQ can
sections address each part individually.
be set to create statistics for each of the attribute
The input data is made of a definition of the
values, namely a measure of how many positives
attributes (variables), AQ control parameters for each and negative examples, respectively, each attribute
of the four parts mentioned above, and the raw events. value covers. AQ can try to keep only those attributes
The output of AQ consists of the learned rules, which that seem to have more discriminatory information
can be displayed in textual or graphical form. Different between classes.
versions of AQ used different ways to define the format This is only a rough approximation, as
of the input and output. Because the different methods individual attributes might have little discriminatory
do not affect learning, they are not discussed in this information when considered singularly but can help
article. AQ4SD uses the input/output format described the generation of excellent rules in combination with
in Ref 12. others. Creating such statistics is a quick linear
operation that requires a one-time analysis of the
Data Preparation entire data or of a statistical sample of the data.
The AQ learning process starts with data being Because such statistics are also used by the learning
read from a file (when used as a stand alone and optimization algorithms, there is not a significant
classifier) or from memory (when embedded in a computational overhead introduced by this attribute
larger system). The data is processed by the data selection method.
preparation mechanism, which checks the data format
for correctness, corrects or removes ambiguities, Rules for Incremental Learning
selects the relevant attributes (a.k.a. feature selection), One of the main advantages of AQ (see Section
and applies rules for incremental learning. Advantages and Disadvantages of the AQ Method-
Some versions of AQ can also automatically ology) is the ability to refine previously learned rules
or through user input generate new attributes to as new input events become available. The input data
change the data representation. This feature, called can specify rules that describe either a previously
constructive induction, was first implemented in learned concept or constraints between attributes.
a specialized version of AQ1715,16 and is not For example, they can specify that a particular
implemented in AQ4SD. combination of attributes cannot appear together in
a rule or that the boundaries of the search space are
Resolving Ambiguities reduced under particular attribute values.
An ambiguity is an event that belongs to two or more As described in detail in Section Rule Learning,
classes. For the purpose of learning, each event must AQ starts generating rules by comparing positive
and negative events and keeps specializing previously the seed but might cover many if not all the events in
learned rules with new conditions when they cover P . Rule r is then added to the list of rules R to be
negative events. In incremental learning mode, the added to the final answer.
set of rules being specialized does not start with an
empty set but with those specified in the input data. Star Generation
No other aspects of the learning are affected except The central concept of the algorithm is a star, defined
in the case of extreme ambiguities, when the supplied as a set of alternative general descriptions (rules) of a
rules do not include any positive examples. In such particular event (a ‘seed’) that satisfy given constraints,
situations, AQ cannot use the input rules as it is not for example, do not cover negative examples, do not
able to evaluate their positive and negative coverage. contradict prior knowledge, etc.
Rule Learning
This is the core of the AQ methodology, where rules
are generated from examples and counterexamples.
AQ generates rules by an iterative process aimed at
identifying generalizations of the positive examples
with respect to the negative examples. Recall that
positive examples are those labeled for the target
class, and negatives are those belonging to all the
other classes.
binary representation is [{0,1,1}], which is exactly the to determine which rules, among those generated,
negation of the negative event. are best suited to be included in the answer. AQ
The extension-against operation for linear has been described as performing a beam search in
attributes is slightly different, and it involves flipping space.8 LEF is the parameter that controls the width of
the bits only up to the value of the negative event, the beam.
and not any values beyond. Assuming a linear LEF works as following:
variable ||y|| = {XS, S, M, L, XL, XXL}, the result
of the extension-against operation between a seed 1. Sort the rules in the star according to LEF, from
with y = S and a negative event with y = L is the best to the worst.
the rule [positive] ←− [y = XS . . . M]. Its binary
2. Select the first rule and compute the number
representation is [{1,1,1,0,0,0}].
of examples it covers. Select the next rule and
For integer and continuous attributes, the
compute the number of new examples it covers.
extension-against operator finds a value between the
seed and negative. The degree of generalization can 3. If the number of new examples covered exceeds a
be controlled, and by default set to choose the middle new example threshold, then the rule is selected,
point between the two. The ε parameter, defined otherwise it is discarded. Continue the process
between [0 and 1], controls the degree of generaliza- until all rules are inspected.
tion during the extension-against operation, with 0
being most restrictive to the seed, and 1 generalizing The result of this procedure is a set of rules
up to the negative. Assuming a continuous variable selected from a star. The list of positive events to
||z|| = {0.100}, the result of the extension-against cover is updated by deleting all those events that are
operation between a seed with z = 10 and a negative covered by these rules.
event with y = 30, and ε = 0.5 is the rule [positive]
←− [z ≤ 20]. The result of the same extension-against
operation with ε = 1 is [positive] ←− [z < 30] (note Postprocessing
that 30 is not included). Postprocessing operations consist in: (1) improvement
The rules from the extension-against operation of the learned rules through generalization and
are then logically multiplied out with all the rules r to specialization operators, (2) optional generation of
form a star (Algorithm 2, line 4), and the best rule (or alternative covers, and (3) formatting of the output
rules) according to a given multicriterion functional for textual and graphical visualization.
LEF (Section Lexicographical Evaluation Functions)
are selected (line 5). The parameter maxstar is central Optimization of Rules
to the star generation process and defines how many When AQ is run in PD mode, rules can be further
rules are kept for each star. optimized during post processing. Rules can be
If AQ is run in TF mode, the result from the generalized by dropping values in the reference of the
intersection of the previously learned rules and the conditions or by enlarging the ranges for continuous
new rule is kept. In PD mode, the function Q [Eq. and integer attributes. Rules can be further generalized
(5)] is used to compute the tradeoff between the by dropping conditions altogether. Finally, entire
completeness and the consistency of the rules rules could be dropped. The opposite operation of
w 1−w specialization is performed only at the condition level,
p P+N p P by adding values in discrete attributes, and shrinking
Q = − (5)
P N p+n P+N domains for integer and continuous attributes.
The optimization operation follows heuristics,
where is p and n are the number of positive and and at each step computes what is called in AQ the Q
negative events, and P and N are the total number of value for the new rule Eq. (5). If the Q value increases,
positive and negative events in the data. The parameter then the modified rule is added to the final answer,
w is defined between 0 and 1 and controls the tradeoff otherwise is disregarded.
between completeness and consistency.9
Alternative Covers
Lexicographical Evaluation Functions Some of the rules learned during the star generation
LEF is an evaluation function composed from process, especially with large maxstar values, might
elementary criteria and their tolerances and are used not be required in the final output. The final step of
to determine which rules best reflect the needs of the learning process consists in selecting from the pool
the problem at hand. In other words, LEF is used of learned rules, only the minimum set required to
Group = 2
(30, 19) Rule 3
Rule 1
(25, 0)
(30, 0)
Rule 2
(27, 0)
<=0.8399
=Feb..Nov (27,17)
(25,12)
DiffSST
>=1.002e+05
=NE..S (16,9)
=N..SE >=7.79 >=9.954e+04
(23,3) =5.45..8.95 <=294.7
(27,8) (28,14) (30,16) >=2
(13,12) (23,19)
(28,19)
Wind0Z500 BlackSST Pressure Humidity Date SLHF Air temperature
=S..W >=5.35 <=Dec
=7.035..19.32 <=1.011e+05 (9,33) (11,39) <=38
(10,8) =Feb..Mar
(10,22) (11,39) (9,31)
=3.75..5.45 (2,4)
(3,8)
Rule 1
(11, 0) Rule 2
Group = 6 (2, 0)
(11, 39)
cover the positive examples. Thus, some of the rules of a much slower or complex program. Other
might not be included in the final answer and can be issues remain unresolved and open to investigation.
used to generate alternative solutions. Depending on The following discussion summarizes those that are
the presence of multiple strong patterns in the data, believed to be the main issues to consider when
alternative covers might be very useful to discriminate choosing the AQ methodology over other methods,
between classes. in particular C4.5 which is the closest widely used
machine learning symbolic classifier.
Association Graphs
Association graphs are used to visualize attributional
rules that characterize multivariate dependencies Rich Representation Language
between target classes and input attributes. A pro- One of the main advantages of AQ consists in the abil-
gram called concept association grapth (CAG) was ity to generate compact descriptions which are easy to
developed by the first author to automatically display read and understand. Unlike neural networks, which
such graphs. Figure 1 is a graphical illustration of are black boxes and use a representation language
the rules discovered from an atmospheric release that cannot be easily visualized, AQ rules can be
problem.17 Representing relationships with nodes and inspected and validated by human experts. Although
links is not new nor unique to AQ and has been used decision tree classifiers, such as C4.5, can convert the
in many applications in statistics and mathematics. learned trees into rules, the resulting descriptions are
Each target class is associated only with unique pat- expressed in a much simpler representation language
terns of input parameters. The thickness of the links in AQ. For example, C4.5 rules only allow for atomic
indicates the weight of a particular parameter-value relationships between attributes and possible values
combination in the definition of the cluster. and do not allow for internal disjunctions or multiple
ranges. Figure 2 shows the respective covers generated
by AQ (left) and C4.5 (right). In this example, internal
ADVANTAGES AND DISADVANTAGES disjunction allows for a simpler and more compact
OF THE AQ METHODOLOGY representation due to intersecting patterns.
The AQ methodology has intrinsic advantages and The cover generated by AQ [Eq. (6)] is composed
disadvantages with respect to other machine learning of two rules with a single condition, each covering 20
classifiers, such as neural networks, decision trees, or positives and no negatives.
decision rules. Some of the original disadvantages have
been solved or improved with additional components [Positives = 1] ←− [X ≥ 5] : p = 20, n = 0
or optimization processes, often at the expense ←− [Y ≥ 5] : p = 20, n = 0 (6)
Y
Y
++ + ++ +
4 + 4 +
− −
− + − +
− − − −
2 2
− − − −
− + − +
− − + + − − + +
0 − 0 − FIGURE 2 | Different covers
0 2 4 6 8 10 0 2 4 6 8 10 generated by AQ (left) and C4.5
X X
(right) using the same dataset.
In contrast, the tree [Eq. (7)] and the corresponding a portion of the positive examples but still has to
rules [Eq. (8)] generated by C4.5 cannot represent consider all the negatives. This is in part due to the
the intersecting concept because of the simpler ability of representing intersecting concepts, meaning
representation language. that rules are not bound to prior partitions.
Root
Quality of Decisions
X<5 X≥5 As previously seen, C4.5 performs consecutive splits
on a decreasing number of positive and negative
Y≥5 Y<5 Positives examples. This means that at each iteration, decisions
Positives Negatives
are made on a smaller amount of information. In
(7)
contrast, AQ considers all the search space at each
iteration, meaning that all decisions are made with
the maximum amount of information available.
[Positives = 1] ←− [X ≥ 5] : p = 20, n = 0
←− [X < 5][Y ≥ 5] : p = 10, n = 0
(8) Control Parameters
AQ has a very large number of control variables. Such
The C4.5 cover is composed of two rules, one controls allow for a very fine tuning of the algorithm,
with a single condition, and one with two conditions. which can lead to very high quality descriptions. On
The first, identical to the rule learned by AQ, covers the other hand, it is often difficult to determine a priori
20 positives and no negatives, whereas the second which set of parameters will generate better rules.
covers only 10 positives and no negatives. Although Although heuristics on how to set the parameters
both covers are complete and consistent, the cover exist, they are often suboptimal, and user fine tuning
of C4.5 is more complex and cannot represent the is required for optimal descriptions.
intersecting concept.
Matching of Rules and Events
Speed AQ allows for different methods to test events on a
AQ is considerably slower than C4.5 because of set of learned rules. In C4.5, and most other classifiers
the underlying differences between the ‘separate and which do not allow for intersecting concepts, testing
conquer’ learning strategy of AQ and the ‘divide and an event usually involves checking if it is included in
conquer’ strategy of C4.5. In C4.5, at each iteration, a rule or not. This is due to the fact the entire search
the algorithm recursively divides the search space. This space is partitioned into one of the target classes. In
means that at each iteration the algorithm analyzes an AQ, each event can be included in more than one rule
always smaller number of events. In contrast, AQ or could be in an area of the search space which has not
compares each positive with all of the negatives. been assigned to any classes. Assigning an unclassified
Effectively, AQ can be optimized to consider only event to one of the target classes involves computing
different degrees of match between the event and the Incremental Learning
covers of each of the classes and selecting the class with Decision rule learners have the intrinsic advantage of
highest degree of match. Figure 3 shows the example being able to refine previous rules as new training data
of an unclassified event that lies in an area of the search become available. This is because of the sequential
space not generalized to any of the classes. The degree nature of the ‘separate and conquer’ strategy of
of match between the event and the cover of each of the algorithms. Refinement of rules involves adding or
class is computed, and it is assigned to the class with dropping conditions in previously learned rules or
the highest score, in this case (d1). Several distance splitting a rule in a number of partially subsumed
functions can be used to match rules and events and rules. The main advantage is that modification in a rule
differ at the top level if they are strict or flexible. of the cover does not affect the coverage of the other
In strict matching, AQ counts how many times rules in the cover (Although the overall completeness
a particular event is covered by the rules of each of and consistency of the entire cover might be affected).
the classes. An event can be covered multiple times by In contrast, although possible, it is more complicated
the rules of a particular class because the rules might to update a tree as it often involves several updates
be intersecting due to internal disjunctions. It can also that propagate from leaves of the tree, all the way
be covered multiple times by rules of different classes to the root. Additionally, the resulting tree might be
if AQ was run in PD mode, and inconsistent covers suboptimal and very unbalanced, prompting for a
were generated. complete re-evaluation of each node.
In flexible matching, AQ computes the ratio of
how many of the attributes are covered over the total
number of attributes. Assuming an event with three Input Background Knowledge
attributes, if a rule for class A matches three of them, Because of the ability of AQ to update previously
and rule for class B matches two of them, the event is learned rules, it is possible to add background knowl-
classified as type A because of a higher flexible degree edge in the form of input rules. This feature is partic-
of match. If the degree of match falls below a certain ularly important when there is an existing knowledge
threshold, AQ classifies the event as unknown. In case of the data, or constraints on the attributes, which
more than one class has the same degree of match, can lead to a simpler rules and a faster execution.
the classification is uncertain, and multiple classes are
output.
EVOLUTIONARY COMPUTATION
Multiple Target Classes GUIDED BY AQ
Decision tree classifiers are advantaged when learning The term evolutionary computation was coined in
from data with several target classes. They can learn 1991 as an effort to combine the different approaches
to simulating evolution to solve computational Discriminating between best and worst per-
problems.18–23 Evolutionary computation algorithms forming individuals could provide additional infor-
are stochastic methods that evolve in parallel a mation on how to guide the evolutionary process.
set of potential solutions through a trial and error The learnable evolution model (LEM) methodology
process. Potential solutions are encoded as vectors was proposed in which a machine learning rule
of values and evaluated according to an objective induction algorithm was used to learn attributional
function (often called fitness function). The evolu- rules that discriminate between best and worst per-
tionary process consists of selecting one or more forming candidate solutions.33–35 New individuals
candidate solutions whose vector values are modified were then generated according to inductive hypothe-
to maximize (or minimize) the objective function. ses discovered by the machine learning program. The
If the newly created solutions better optimize the individuals are thus genetically engineered, in the sense
objective function, they are inserted into the next that the values of the variables are not randomly or
semi-randomly assigned but set according to the rules
generation, otherwise they are disregarded. While
discovered by the machine learning program.
the methodologies and algorithms that are subsumed
The basic algorithm of LEM works like
by this name are numerous, most of them share one
Darwinian-type evolutionary methods, that is, exe-
fundamental characteristic. They use nondeterministic
cutes repetitively three main steps:
operators such as mutation and recombination as the
main engine of the evolutionary process.
These operators are semi-blind, and the evolu- 1. Create a population of individuals (randomly or
tion is not guided by knowledge learned in the past by selecting them from a set of candidates using
generations, but it is a form of search process executed some selection method).
in parallel. In fact, most evolutionary computation 2. Apply operators of mutation and/or recombi-
algorithms are inspired by the principles of Darwinian nation to selected individuals to create new
evolution, defined by ‘. . .one general law, leading to individuals.
the advancement of all organic beings, namely, multi- 3. Use a fitness function to evaluate the new
ply, vary, let the strongest live and the weakest die’.24 individuals.
The Darwinian evolution model is simple and fast 4. Select the individuals which survive into the next
to simulate, and it is domain independent. Because generations.
of these features, evolutionary algorithms have been
applied to a wide range of optimization problems.25
There have been several attempts to extend The main difference with Darwinian-type evo-
the traditional Darwinian operators with statistical lutionary algorithms is in the way it generates new
and machine learning approaches that use history individuals. In contrast to Darwinian operators of
information from the evolution to guide the search mutation and/or recombination, AQ conducts a rea-
process. The main challenges are to avoid local soning process in the creation of new individuals.
maxima and increase the rate of convergence. The Specifically, at each step (or selected steps) of evolu-
tion, a machine learning method generates hypotheses
majority of such methods use some form of memory
characterizing differences between high-performing
and/or learning to direct the evolution toward
and low-performing individuals. These hypotheses are
particular directions thought more promising.26–31
then instantiated in various ways to generate new indi-
Because evolutionary computation algorithms
viduals. The search conducted by LEM for a global
evolve a number of individuals in parallel, it is possible solution can be viewed as a progressive partitioning
to learn from the ‘experience’ of entire populations. of the search space.
There is not a similar type of biological evolution Each time the machine learning program is
because in nature there is not a mechanism to evolve applied, it generates hypotheses indicating the areas
entire species. Estimation of distribution algorithms in the search space that are likely to contain high-
(EDA) are a form of evolutionary algorithms where performing individuals. New individuals are selected
an entire population may be approximated with a from these areas and then classified as belonging to
probability distribution.32 New candidate solutions a high-performance and a low-performance group,
are not chosen at random but using statistical depending on their fitness value. These groups are
information from the sampling distribution. The aim then differentiated by a machine learning program,
is to avoid premature convergence and to provide a yielding a new hypothesis as to the likely location of
more compact representation. the global solution.
Distance (km)
0
20
60
(Co − Cs )2
0
0 20
0
NMSQE =
80
40
2
(9) 20
Co
Cs = P1 P2 (P3 + P4 ) (10)
3
400 400 400
4
1
3
2
2
1
0
200 200 1 200
1 2
1
2 1
3
1
2
0 0 0 0
0
1
2 3
3
−200 2 −200 −200
3
−600 −600 −600
0 200 600 1000 0 200 600 1000 0 200 600 1000
0.8
atmosphere causes much narrower plumes, which
result in higher ground concentrations. 0.6
0.4
Results 0.2
Experiments were performed for each of the 68 prairie A B C D E F
grass releases. The algorithm started by generating Atmosphere type
a population of random candidate solutions. Each
candidate solution is a potential source and is encoded FIGURE 6 | Errors of AQ4SD divided by atmosphere type.
as a vector of eight variables: x, y, z, θ, U, Q, S, and ψ.
For each potential source, the resulting concentration
field is computed by Eq. (10). The fitness score of each between the two groups. New candidate solutions
source is defined as the error between the observed are generated according to the learned patterns. The
ground concentration and the simulated concentration process continues for 500 iterations. The algorithm
at the same locations, computed according to Eq. (9). was run using a population of 100 candidate
The algorithm proceeds by dividing the can- solutions. At each step, the top and lowest 30%
didate solutions with high and low fitness scores, of the solutions were used as members of the high-
and learning patterns (rules) which characterize and low-performing groups. Each experiment was
the attribute values combinations that discriminate repeated 10 times to study the sensitivity of the
1.0
Error
0.8
0.6
0.4
algorithm to the initial guess of solutions. The results Figure 8 shows a summary of the results in
are shown also in terms of atmospheric type. terms of x, y, z, θ (called WA = wind angle), and Q.
A total of 680 AQ4SD source detections For all experiments, the original source was located
were performed, namely 10 for each of the 68 at x,y,z = 0, 0, 0. The WA and Q errors are defined,
experiments. There is a considerably higher number respectively, as the identified angle minus the real
of experiments of type D, as this was the predominant angle, and the identified Q minus the real Q. Once
atmospheric condition at the time of the releases. In again, the atmosphere type is color coded using the
order to compensate for the different distributions same colors as in Figure 7. The ideal solution would
of experiments, the results are normalized using this be all the points located at 0, for all variables. In
information. the figure, although this is primarily true for most
Figure 6 shows a summary of the different variables, the strong dependence between y and θ
errors, defined by Eq. (9), achieved by AQ4SD as a is evident. The units for the x,y, and z directions
function of the atmospheric type. A threshold of 1.0 are meters. Therefore, the errors associated with
was assigned as minimum fitness value to recognize changes in the z values are actually very small, as
a source, because such value indicates AQ4SD z only varies 10 m at most. There are larger errors
identifying the source within 50 m of correct solution. for the alongwind dimension x compared with the
Considerably better results were achieved for atmo- crosswind dimension y. This is to be attributed to the
spheric type D, and worse results for atmospheric type concentration field which has a larger gradient in the
A and type F. This pattern reflects the accuracy of the crosswind direction compared with the alongwind
dispersion model (10) to reproduce the concentration direction. Note the correlation between the error in
field under different stability conditions. The Gaus- θ and the y dimension. Such behavior exemplifies
sian model is expected to perform better in neutral the algorithm’s skill at compensating for errors in y
conditions (D), whereas convective turbulence (A) through changes in θ. The variable that seems to be
and stable stratification (F) involve more complex dis- harder to optimize is Q. Such results are primarily due
persion mechanisms which cannot be accounted for, to the correlation between Q and U [P1 in Eq. (10)].
resulting in a lack of accuracy. Figure 6 is consistent
with the notion that the algorithm performs better
when the fitness of the dispersion model is higher. DISCUSSION
Figure 7 shows a summary for all the 68 prairie
grass experiments. Each atmosphere type is color This article introduces the main concepts of the
coded. With the exception of six experiments (3, 4, AQ methodology and discusses its advantages and
7, 25, 52), each of type A or type F, AQ4SD always disadvantages. It describes a new implementation of
achieves a minimum fitness of 1.0, which was the the AQ methodology, AQ4SD, applied to the problem
target acceptance threshold for this experiment. The of source detection of atmospheric releases. In that
overall average fitness error is 0.6. context, AQ4SD is used as main engine of evolution
−100
X Atmosphere type
−300
A
−500 B
C
60
D
20 E
Y
F
−20
−60
10
8
6
Z
4
2
0
100
−100
150 150
100 100
50 Error Q 50
0 0
233
Algorithm quasi-optimal learning
Advanced Review www.wiley.com/wires/compstats
for an evolutionary computation process aimed at The proposed methodology has a wide domain
finding the source of an atmospheric release, using of applicability, not restricted only to the source
only a observed ground measurements and a numeric detection problem. It can be used for a variety of opti-
atmospheric dispersion model. Experiments were per- mization problems and is particularly advantageous
formed to identify the source of each of the 68 releases for those problems where the fitness function evalua-
of the prairie grass field experiment. tion involves a computationally expensive operation.
The numerical experiments show that in all but
five cases the methodology was able to achieve a
fitness score considered acceptable for the correct NOTES
identification of the source. The performance of a
Some versions of AQ can also be run to generate
the algorithm has been very satisfactory consider- rules with the largest amount of attributes (called
ing that the error intrinsic in the measured data and characteristic mode), but such mode merely consists
the approximation of the dispersion model. AQ4SD in generating discriminant rules and adding condi-
also proved to be quite efficient in terms of number tions that include all events in the class, but having no
of model simulations required for each optimiza- discriminatory information.
tion case. This is one of the main advantages of b Some versions of AQ sort all or a part of the negative
the proposed methodology compared with traditional events according to a distance metric. Although such
evolutionary algorithms, because a fitness evaluation mechanism has been shown to generate simpler rules
for a complete source detection procedure may require in specific cases, because of the additional complexity
computationally expensive numerical simulations. In of defining such distance metrics, which is not always
particular, for larger scale dispersion problems, more possible as in the case of nominal attributes, paired
sophisticated and computationally expensive mete- with the additional computational resources required,
orological and dispersion models need to be run the advantage of such sorting is not clear. AQ4SD
concurrently to evaluate the fitness of each candidate can be run with and without sorting, and experiments
solution.45 have shown no or negligible improvements.
ACKNOWLEDGEMENTS
This material is partly based upon work supported by the National Science Foundation under Grant no: AGS
0849191.
REFERENCES
1. Michalski R. On the quasi-minimal solution of the 5. Steinberg D, Colla P. CART: Tree-structured Non-
general covering problem. Proceesings of Fifth Inter- parametric Data Analysis San Diego, CA: Salford
national Symposium on Information Processing (FCIP Systems; 1995.
69), Yugoslavia, Bled, vol A3; October 3–11 1969, 6. Quinlan J. C4.5: Programs for Machine Learning. San
125–128. Mateo: Morgan Kaufmann; 1993.
2. Michalski R. A theory and methodology of inductive 7. Breiman L, Friedman J, Stone CJ, Olshen RA. Classifi-
learning. Mach Learn 1983, 1:83–134. cation and Regression Trees: Wadsworth International
3. Michalski R. AQVAL/1 computer implementation of a Group; 1984.
variable-valued logic system VL 1 and examples of its 8. Michalski R, Mozetic I, Hong J, Lavrac N. The multi-
application to pattern recognition. First International purpose incremental learning system AQ15 and its test-
Joint Conference on Pattern Recognition, Washington, ing application to three medical domains. Proceedings
D.C., 1973, 3–17. of the 1986 AAAI Conference, Philadelphia, PA vol.
4. Chilausky R, Jacobsen B, Michalski R. An application 104; August 11–15 1986, 1041–1045.
of variable-valued logic to inductive learning of plant 9. Kaufman K, Michalski R. The AQ18 Machine Learn-
disease diagnostic rules. Proceedings of the Sixth Inter- ing and Data Mining System: An Implementation and
national Symposium on Multiple-valued Logic. Logan, User’s Guide. MLI Report. Fairfax, VA: Machine
UT: IEEE Computer Society Press Los Alamitos; 1976, Learning and Inference Laboratory, George Mason
233–240. University; 1999.
10. Mitchell T. Machine Learning. New York: McGraw- 26. Grefenstette J. Incorporating problem specific knowl-
Hill; 1997. edge into genetic algorithms. Genetic Alg Simul Anneal-
ing 1987, 4:42–60.
11. Cervone G, Panait L, Michalski R. The development
of the AQ20 learning system and initial experiments. 27. Grefenstette J. Lamarckian learning in multi-agent
Proceedings of the Fifth International Symposium on environments. Proceedings of the Fourth International
Intelligent Information Systems, June 18-22, 2001, Conference on Genetic Algorithms, Morgan Kaufmann
Zakopane, Poland: Physica Verlag; 2001, 13. Publishers Inc., San Francisco, CA, 1991.
12. Keesee APK. How Sequential-Cover Data Mining Pro- 28. Sebag M, Schoenauer M. Controlling Crossover
grams Learn. College of Science. Fairfax, VA: George through Inductive Learning. Lecture Notes in
Mason University; 2006. Computer Science. London: Springer-Verlag; 1994,
209–209.
13. Austern M. Generic Programming and the STL: Using
and Extending the C++ Standard Template Library. 29. Sebag M, Schoenauer M, Ravise C. Inductive learning
1998. of mutation step-size in evolutionary parameter opti-
mization, Lecture Notes in Computer Science. London:
14. Gamma E, Helm R, Johnson R, Vlissides J. Design Pat-
Springer-Verlag; 1997, 247–261.
terns: Elements of Reusable Object-Oriented Software.
Westford, MA: Addison-Wesley Reading; 1995. 30. Reynolds R. Cultural Algorithms: Theory and Applica-
tions. Mcgraw-Hill’S Advanced Topics In Computer
15. Bloedorn E, Wnek J, Michalski R. Multistrategy con-
Science Series. Maidenhead, England: McGraw-Hill
structive induction: AQ17-MCI. Rep Mach Learn Infer
Ltd.; 1999, 367–378.
Lab 1993, 1051:93–4.
31. Hamda H, Jouve F, Lutton E, Schoenauer M, Sebag M.
16. Wnek J, Michalski R. Hypothesis-driven constructive Compact unstructured representations for evolutionary
induction in AQ17-HCI: a method and experiments. design. Appl Intell 2002, 16:139–155.
Mach Learn 1994, 14:139–168.
32. Lozano J. Towards a New Evolutionary Computation:
17. Cervone G, Franzese P, Ezber Y, Boybeyi Z. Risk Advances in the Estimation of Distribution Algorithms:
assessment of atmospheric emissions using machine Springer; 2006.
learning. Nat Hazards Earth Syst Sci 2008,
8:991–1000. 33. Michalski R. Learnable evolution: combining symbolic
and evolutionary learning. Proceedings of the Fourth
18. Holland J. Adaptation in Natural and Artificial Sys- International Workshop on Multistrategy Learning
tems. Cambridge, MA: The MIT Press; 1975. (MSL’98). 1999, 14–20.
19. Goldberg DE. Genetic Algorithms in Search, Optimiza- 34. Cervone G, Michalski R, Kaufman K, Panait L. Com-
tion, and Machine Learning. Reading, MA: Addison bining machine learning with evolutionary computa-
Wesley; 1989. tion: Recent results on lem. Proceedings of the Fifth
20. Bäck T. Evolutionary Algorithms in Theory and Prac- International Workshop on Multistrategy Learning
tice: Evolutionary Straegies, Evolutionary Program- (MSL-2000). Portugal: Guimaraes; 2000, pp. 41–58.
ming, and Genetic Algorithms. Oxford, NY: Oxford 35. Cervone G, Kaufman K, Michalski R. Experimental
University Press; 1996. validations of the learnable evolution model. Proceed-
21. Michalewicz Z. Genetic Algorithms + Data Structures ings of the 2000 Congress on Evolutionary Computa-
= Evolution Programs. 3rd ed. Berlin: Springer-Verlag; tion, LaJolla, CA, vol. 2; July 16–19 2000.
1996. 36. Pudykiewicz J. Application of adjoint tracer transport
22. Fogel L. Intelligence Through Simulated Evolution: equations for evaluating source parameters. Atmos
Forty Years of Evolutionary Programming. Wiley Series Environ 1998, 32:3039–3050.
on Intelligent Systems. New York: John Wiley & Sons, 37. Hourdin F, Issartel JP. Sub-surface nuclear tests moni-
Inc.; 1999. toring through the ctbt xenon network. Geophys Res
23. De Jong K. Evolutionary computation: a unified Lett 2000, 27:2245–2248.
approach. Proceedings of the 2008 GECCO Confer- 38. Enting I. Inverse Problems in Atmospheric Constituent
ence on Genetic and Evolutionary Computation. New Transport. Cambridge, NY: Cambridge University
York: ACM; 2008, 2245–2258. Press; 2002, 392.
24. Darwin C. On the Origin of Species by Means of Natu- 39. Gelman A, Carlin J, Stern H, Rubin D. Bayesian Data
ral Selection, or the Preservation of Favoured Races in Analysis: Chapman & Hall/CRC; 2003, 668 pp.
the Struggle for Life. London: Oxford University Press;
40. Chow F, Kosović B, Chan T. Source inversion for
1859.
contaminant plume dispersion in urban environments
25. Ashlock D. Evolutionary Computation for Modeling using building-resolving simulations. Proceedings of the
and Optimization. Berlin Heidelberg: Springer-Verlag; 86th American Meteorological Society Annual Meeting,
2006. Atlanta, GA, January 2006, 12–22.
41. Senocak I, Hengartner N, Short M, Daniel W. Stochas- 45. Delle Monache L, Lundquistand J, Kosović
tic event reconstruction of atmospheric contaminant B, Johannesson G, Dyer K, et al. Bayesian inference
dispersion using Bayesian inference. Atmos Environ and markov chain monte carlo sampling to reconstruct
2008, 42:7718–7727. a contaminant source on a continental scale. J Appl
Meteor Climatol 2008, 47:2600–2613.
42. Haupt SE. A demonstration of coupled recep-
46. Pasquill F. The estimation of the dispersion of wind-
tor/dispersion modeling with a genetic algorithm.
borne material. Meteorol Magazine 1961, 90:33–49.
Atmos Environ 2005, 39:7181–7189.
47. Pasquill F, Smith F. Atmospheric Diffusion. Chichester,
43. Haupt SE, Young GS, Allen CT. A genetic algorithm UK: Ellis Horwood; 1983.
method to assimilate sensor data for a toxic contami-
48. Arya PS. Air Pollution Meteorology and Dispersion.
nant release. J Comput 2007, 2:85–93.
Oxford, NY: Oxford University Press; 1999.
44. Allen CT, Young GS, Haupt SE. Improving pollutant 49. Barad M, Haugen D. Project Prairie Grass, A Field
source characterization by better estimating wind direc- Program in Diffusion: United States Air Force, Air
tion with a genetic algorithm. Atmos Environ 2007, Research and Development Command, Air Force Cam-
41:2283–2289. bridge Research Center; Cambridge, MA, 1958.