Suprb: A Supervised Rule-Based Learning System For Continuous Problems
Suprb: A Supervised Rule-Based Learning System For Continuous Problems
Continuous Problems
Michael Heider∗ David Pätzel∗ Jörg Hähner
University of Augsburg University of Augsburg University of Augsburg
Organic Computing Group Organic Computing Group Organic Computing Group
Augsburg, Germany Augsburg, Germany Augsburg, Germany
michael.heider@informatik. david.paetzel@informatik. joerg.haehner@informatik.
uni-augsburg.de uni-augsburg.de uni-augsburg.de
We propose the SupRB learning system, a new accuracy-based of mappings from parametrizations for the machine and variables
Pittsburgh-style learning classifier system (LCS) for supervised beyond their influence to an expected process quality resulting
learning on multi-dimensional continuous decision problems. SupRB from them—abstractly speaking, a collection of if-then rules with
learns an approximation of a quality function from examples (con- outcomes subject to noise. While many ML methods represent
sisting of situations, choices and associated qualities) and is then knowledge in a less or differently structured way, this is not the case
able to make an optimal choice as well as predict the quality of a for learning classifier systems (LCSs) whose models are collections
choice in a given situation. One area of application for SupRB is of human-readable if-then rules constructed using ML techniques
parametrization of industrial machinery. In this field, acceptance of and model structure optimizers [12, 31]. This learning scheme is
the recommendations of machine learning systems is highly reliant thus suited naturally to incorporate an operator’s knowledge as
on operators’ trust. While an essential and much-researched ingre- externally specified rules can be included directly. Also, due to their
dient for that trust is prediction quality, it seems that this alone is inner structure, LCSs can more easily provide explainations for
not enough. At least as important is a human-understandable ex- their predictions. Due to this transparency towards human users,
planation of the reasoning behind a recommendation. While many compared to black box systems, an increased trust by operators that
state-of-the-art methods such as artificial neural networks fall short contained knowledge and thus recommendations are correct can be
of this, LCSs such as SupRB provide human-readable rules that can expected; which is essential for these system’s actual applicability.
be understood very easily. The prevalent LCSs are not directly ap- This paper proposes the SupRB learning system, a new accuracy-
plicable to this problem as they lack support for continuous choices. based Pittsburgh-style LCS for supervised learning on continuous
This paper lays the foundations for SupRB and shows its general multi-dimensional decision problems such as the one of parametriza-
applicability on a simplified model of an additive manufacturing tion of industrial machinery. Pittsburgh-style LCSs [26] have a
problem. model structure optimizer (in classic Pittburgh-style systems, a
genetic algorithm (GA)) operate on a population of rule collections
KEYWORDS of variable length each of which represents a potential solution to
the learning problem at hand.
Learning Classifier Systems, Evolutionary Machine Learning, Man-
This work focuses on solving the problem of parametrization
ufacturing
optimization of industrial machinery, which is defined in Section 2.
An LCS architecture that solves this problem, SupRB, is introduced
1 INTRODUCTION in Section 3 along with its first implementation SupRB-1 in Section 4.
Parametrization of industrial machinery is often determined by SupRB-1 is evaluated on different function approximation problems
human operators. These specialists usually obtained most of their in Section 5. Section 6 gives an account of related research.
expertise through year-long experimental exploration based on
prior knowledge about the system or process at play. Transferring 2 PARAMETRIZATION OPTIMIZATION
that knowledge to other operators with as little loss as possible Parametrization optimization is the process of finding the best pa-
(e. g. to new colleagues whenever experienced operators retire or rameter choice, or parametrization, for a given system S with regard
to end users of the machinery after commissioning is finished) to some quality measure q. One such parametrization can be viewed
is a challenge: Humans’ ability of exactly attributing parameter as a vector a ∈ A ⊆ RD A where A is the parametrization space,
choices to the situations that led to them and then communicating D A is the number of parameters to be optimized and each com-
this knowledge in a comprehensible manner tends to be rather ponent of a corresponds to one adjustable system parameter for S.
restricted—which leads to new operators being forced to repeat Which parametrization is optimal regarding q depends on a number
exploration to learn for themselves. Machine learning (ML) can of environmental factors (e. g. ambient temperature or humidity)
help with this, for example, by supporting new operators or users in addition to characteristics of process, machine, material and the
with recommendations or simply by recording existing experiences part to be produced. For a given system S, we call one instance of
and extracting knowledge to make it available at a later point. those additional factors a situation; situations can again be assumed
to be represented by a vector x ∈ X ⊆ RD X where X is called the
∗ These authors contributed equally to the paper. situation space and D X is the fixed dimensionality of situations
for S. Having defined parametrizations and situations, we can now is not just about approximating q as close as possible; there are
specify the quality measure’s form as usually additional goals such as being explainable (cf. Section 1)
q : {X, A}T → R (1) which require the model structure’s complexity to be as low as
possible.
where every {x, a}T= (x 1 , . . . , x D X , a 1 , . . . , a D A )T ∈ {X, A}T is
a stacked vector consisting of a situation (x 1 , . . . , x D X ) ∈ X and 3.2 Local models: Classifiers
a parametrization (a 1 , . . . , a D A ) ∈ A. For readability, we write
A classifier c consists of three main components:
q(x, a) instead of q({x, a}T ). The target of q is a single scalar which
is possibly derived appropriately from a vector of multiple quality • Some representation of a matching function mc : X →
features. We assume that q(x, a) is at least continuous in a which {T, F}. We say that c matches situation x iff mc (x) = T. Cor-
we think is realistic in most real-world scenarios: respondingly, we say that c does not match x iff mc (x) = F.
• Some local model approximating q on all x ∈ X which the
lim q(x, a) = q(x, a 0 ) (2)
a→a 0 classifiers matches.
With the definition for q, we can now define the optimization prob- • An estimation of the classifier’s goodness-of-fit on the situa-
lem that describes the search for an optimal parametrizations for a tions it matches (solely used in classifier mixing).
given situation x: Be aware that the classifiers’ matching functions’ domain is
maximize q(x, a) (3) X and not {X, A}T . This increases explainability greatly as an
a ideal partitioning of X (total, without overlaps) entails that there
Note that, realistically, neither q nor its derivative can be assumed is exactly one rule regarding the parametrization for each possible
to be known (albeit either of those would simplify the problem situation. Conversely, if we partitioned in {X, A}T optimally, there
greatly). Instead, we assume that the only information about q is a would possibly still be multiple rules for a given situation as the
fixed set of examples. system might have partitioned in the dimensions of A as well. Since
Thus, the learning problem we consider is: Given a fixed set of we assume continuity of q(x, a) regarding a (see (2)), partitioning
N examples for situations and parametrizations {{x, a}T } as well in A would only be necessary if the local models could not capture
as their respective qualities {q(x, a)}, learn to predict for a given q’s behaviour in A, for example because it is highly multi-modal.
unknown situation x ∈ X a parametrization â max (x) ∈ A for which In that case, partitioning in A might be sensible, as would using
â max (x) ≈ a max (x) = argmax q(x, a) (4) more sophisticated local models.
a
where a max (x) is the actual optimal parametrization in situation x. 3.3 Epoch-wise training
A natural way of measuring improvements on this learning Training an LCS can generally be divided into two subproblems:
problem is the following: A model can be said to be an improvement For once, the classifiers’ local models need to be trained so that
over another on a set of situations X eval ⊂ X if the actual quality of the predictions they make on the subspace they are responsible for
the predicted optimal parametrizations on those situations is closer are as accurate as possible. Secondly, the overall model structure
to the actual quality of the actual optimal parametrizations. This of the LCS has to be optimized: the classifier’s localizations have
can be quantified, for example, by using the mean error for an error to be aligned in such a way that every local model can capture the
measure L on the model’s prediction: characteristics of the subspace it is assigned to as well as possible.
1 Õ
Michigan-style LCS such as XCS(F) [6, 37] or ExSTraCS [32]
L(q(x, a max (x)), q(x, â max (x))) (5)
|X eval | try to solve these problems incrementally by, for each seen exam-
x ∈X eval
ple, performing a single update on some of the classifier’s local
3 AN LCS ARCHITECTURE FOR models and then improving these classifier’s localizations. This
CONTINUOUS PROBLEMS approach is especially sensible when learning has to be incremental
(e. g. in reinforcement learning settings). However, due to the learn-
This section presents a high-level view of SupRB, the overall LCS
ing problem being non-incremental (all training data is available
architecture we propose, in order to solve parametrization opti-
from the very start), SupRB can be trained non-incrementally. This
mization problems which were introduced in the previous section.
means that training can be done in two separate phases that are
repeated alternatingly until overall convergence [8], each phase
3.1 Model structure
being responsible for solving one of the subproblems:
Just like other LCSs, SupRB forms a global model from a population
C of local models, called classifiers; in the case of SupRB, the global (1) (Re-)train each local classifier model on the data that it (now)
model is meant to approximate the quality measure q defined in matches.
Section 2. Each classifier is responsible for a subspace of the input (2) Optimize the model structure (i. e. the set of classifier condi-
space X; which subspace it is for a certain classifier is specified tions), for example using a heuristic such as a GA.
by that classifier’s condition which is also sometimes called its At that, each phase is executed until it converges or some termi-
localization. The set of classifier conditions forms the overall model nation criterion, such as a fixed number of updates, is met. It is
structure of SupRB; this structure fulfills a similar role as the graph important to note that during fitting of each phase’s parameters, the
structure of a neural network in that it needs to be chosen carefully parameters of the respective other are considered fixed—otherwise,
in order for the system to perform well. At that, performing well convergence cannot be guaranteed. If a GA is used for the model
2
structure optimization, then this GA works on a population of its function approximation:
classifier populations—these kind of systems are commonly called
Pittsburgh-style LCS. However, since we expect many optimization â max = argmax q̂(x 0 , a) (9)
a ∈Alocal
methods to be applicable to this (see Section 7), an implementation
of SupRB does not necessarily contain a GA. Nevertheless, the gen- For other functions, where an exact analytical solution is unknown
eral SupRB architecture should probably be placed into or close or impractical, there are other options that range from root-finding
to the Pittsburgh-style category. The implementation of SupRB we algorithms [4] to heuristics such as hill climbing with random
present in Section 4 is definitely a Pittsburgh-style LCS since it uses restarts [24], genetic algorithms [11] or chemical reaction opti-
a GA to optimize the model structure. mization [15]. Although these non-analytical methods require a
Dividing the learning process into two distinct phases is advan- comparably larger amount of computation time, they are feasible
tageous. First of all, optimization of the process’s hyperparameters in the setting SupRB targets: Industrial processes that are being
can be done more straightforwardly because hyperparameters are optimized are usually preplanned anyway, which takes a lot longer
divided into two disjoint sets, one for each of the two phases. Be- than any of the heuristics needs to find SupRB’s parametrization
sides that, the learning process is analysed more easily because choice.
the overall optimization problem of fitting the model to the data
decomposes nicely into the two subproblems solved by the phases
[8]—‘nicely’ meaning, that solving the subproblems independently 4 SUPRB-1: A FIRST IMPLEMENTATION OF
of each other solves the overall problem. For example, if learning SUPRB
does not work and, upon inspection, the classifier weight updates While the previous section introduced SupRB’s general architecture,
converge correctly and fast enough, it is immediatly clear that learning process as well as its desired prediction capabilities, we
the model structure optimization is the culprit and corresponding now want to give a detailed account of SupRB-1, a first implemen-
measures can be taken. tation of that system1 .
using expert knowledge. The process itself consists of material sets each contained 2000 examples for training (1000 training and
(usually thermoplastic polymers) being melted and then extruded 1000 validation examples) and 1000 examples we held out for eval-
to gradually construct a part whose quality depends on a number of uation. On each of those data sets SupRB-1 was run once for 500
factors such as the temperature to which the material is heated. For generations using all standard parameters but initial individual
a given material (one dimension of X), the resulting part quality sizes of 50 and k = 10−6 and then evaluated; the results are shown
varies at increasing temperatures: Up until the melting point any in Figures 1 and 2. Note that, having 30 different functions to test
resulting part’s quality can be expected to be zero as no part can SupRB-1 on leads to a better estimate of its general performance at
be produced at all. With a further increase in temperature, quality the cost of having a higher variance of results than when repeatedly
tends to increase as well at a rate depending on material properties testing on a single function.
up until some—unknown—point where the material becomes too We compare SupRB-1’s results with those achieved by a two-
soft to remain in shape which degrades part quality. At even higher layer fully connected artificial neural network (ANN) trained on
temperatures, material might simply fail to successfully construct identical data and functions. We performed simple automated archi-
the part at all at which point quality can effectively be treated as tecture optimization on the ANN in terms of error during validation,
zero again. This relationship of material, temperature and resulting determining optimal architecture for the given problems at 512 and
quality can be simplified to a Gaussian function. 8 hidden cells respectively, while using ReLu activation functions
The FDM-based AM process we consider contains five contin- twice; model complexity was not factored into the architecture
uous (obviously a simplification by itself) situation dimensions: optimization strategy. We show the results of this architecture on
Material, printer, room temperature, humidity and the kind of part the holdout datasets as a baseline.
to produce. These situations interact with six continuous param- It can be seen in Figure 1a that SupRB-1’s quality predictions’
eters: Extrusion temperature, print bed temperature, cooling fan RMSE on holdout data improves rapidly over the first 50 gener-
speed, extruder movement speed, material retraction speed and ations and then seems to converge at around 1.02 in generation
retraction distance (the first four parameters are rather self explana- 100 which is on par with the ANN baseline. At around generation
tory; the latter two come into effect whenever the extruder can not 200, however, the error starts to increase again and later fluctu-
construct the part using continuous movement and has to move ates around a value of 1.1. The same behaviour can be observed
without extruding material). Assuming that every combination of on parametrization choices’ RMSE on holdout data (Figure 1b) al-
situation dimensions and parameters can be modeled by a Gaussian though the baseline is missed on that metric. The same can be
function as motivated above leads to the following overall model seen, however, when looking at the quality prediction’s RMSE on
for the quality function: the training data (Figure 1c); this means that the problem can be
detected and averted during training especially since the number
x of examples that are not matched by any classifier increases in a
©© 1 ªª
.. ®® similar manner (Figure 2b).
. ®®
x ®®
T ! When looking at the number of classifiers in SupRB-1’s elitist
yj yj
®® Õ
q(y) = q 5 ®®®® = exp − − s P j,k −s (Figure 2a), a steady decrease up to a convergence at only 2-4 classi-
a 1 ®® j ∈1, ...,11, yk yk
fiers can be observed. By construction it is highly unlikely that the
. ®®
.. ®® k ∈1, ...,11, AM-Gauss problem can be solved satisfyingly by this few local mod-
®® k ,j
««a 6 ¬¬
els. We tried to alleviate that problem by during mutation adding a
(20) random classifier with a probability of 0.5 (this is also used for the
Here, the P j,k ’s are randomly generated positive semi-definite ma- shown runs)—but to no avail. It can be seen that, between genera-
trix in R2x 2 with eigenvalues in [0, 30] (ensures sensible scaling) and tions 100 and 200, the number of classifiers lies between 13 and 35
s is a randomly generated vector in [−1, 1]2 representing the shift which seems to be the sweet spot with the used hyperparameters.
of the Gaussian function. We did not include noise in our model, The fact that after finding that sweet spot model complexity still
however, an evaluation on more realistic noisy environments is decreases leads us to to believe that there is an issue with SupRB-1’s
already planned. fitness measure. And indeed: It accepts individuals with slightly
We generated 30 such functions from consecutive random seeds worse error in favour of a lower complexity (see (13)), which, when
and used these to create 30 sets of training data for SupRB-1. These applied repeatedly can result in classifier population deterioration
6
1.4 4.50
1.3
4.25
1.3
4.00 1.2
1.2
3.75 1.1
1.1 3.50
1.0
3.25
1.0
0.9
3.00
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Generations Generations Generations
(a) Quality predictions on holdout data (b) Parametrization choices on holdout (c) Quality predictions on training data
with ANN baseline. data with ANN baseline. {X, A}T .
Figure 1: Root mean squared error (with standard deviation (SD)) of different metrics on SupRB-1’s elitist’s performance,
averaged over 30 random AM-Gauss problems.