0% found this document useful (0 votes)
11 views8 pages

Active Learning of Markov Decision Processes Using Baum-Welch Algorithm (Extended)

Uploaded by

antonius gotama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views8 pages

Active Learning of Markov Decision Processes Using Baum-Welch Algorithm (Extended)

Uploaded by

antonius gotama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Active Learning of Markov Decision Processes

using Baum-Welch algorithm (Extended)


Giovanni Bacci Anna Ingólfsdóttir Kim G. Larsen Raphaël Reynouard
Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science
Aalborg, Denmark Reykjavík, Iceland Aalborg, Denmark Reykjavík, Iceland
Email: [email protected] Email: [email protected] Email: [email protected] Email: [email protected]

Abstract—Cyber-physical systems (CPSs) are naturally mod- observed system behaviours. These algorithms, in the large
elled as reactive systems with nondeterministic and probabilistic
arXiv:2110.03014v1 [cs.LG] 6 Oct 2021

sample limit, identify the original (canonical) model. However,


dynamics. Model-based verification techniques have proved ef- for practical applications, the available data is often limited,
fective in the deployment of safety-critical CPSs. Central for a
successful application of such techniques is the construction of as the generation of a large number of observations can
an accurate formal model for the system. Manual construction be a resource-demanding task. Additionally, there might be
can be a resource-demanding and error-prone process, thus mo- requirements on the size of the learned model, e.g., when the
tivating the design of automata learning algorithms to synthesise model has to be stored in an embedded system.
a system model from observed system behaviours. The Baum-Welch algorithm [10] is an expectation max-
This paper revisits and adapts the classic Baum-Welch al-
gorithm for learning Markov decision processes and Markov imisation technique [11] for learning model parameters of
chains. For the case of MDPs, which typically demand more a hidden Markov model. This algorithm has recently been
observations, we present a model-based active learning sampling applied in model-based statistical verification of CPSs [12],
strategy that choses examples which are most informative w.r.t. model checking of interval Markov chains [13], and metric-
the current model hypothesis. We empirically compare our based approximate minimisation of Markov chains [14].
approach with state-of-the-art tools and demonstrate that the
proposed active learning procedure can significantly reduce the This paper proposes a variant of the Baum-Welch algorithm
number of observations required to obtain accurate models. that learns model parameters for Markov chains and Markov
Index Terms—Baum-Welch algorithm, Markov decision pro- decision processes from observed systems behaviours. As the
cesses, active learning original algorithm, it starts from a given model hypothesis
and iteratively updates its transition probabilities until the
I. I NTRODUCTION likelihood of the data stops improving more than a suitably
Model-based verification techniques have proved effective small 𝜖. The algorithm can be combined with other learning
in the deployment of safety-critical cyber-physical systems. techniques like A LERGIA [4] and IOA LERGIA [6]–[8] for the
Due to their interactions with a physical environment, CPSs choice of the initial hypothesis. Notably, by fixing a suitably
are naturally modelled as reactive systems with nondetermin- small initial hypothesis, the algorithm can also be used to
istic and probabilistic dynamics. A popular formalism for such construct succinct, yet accurate, approximations of complex
systems are discrete-time Markov decision processes (MDPs). systems. This characteristic is particularly useful when one
Quantitative verification techniques like probabilistic model needs to control the size of the learned model e.g., to store it
checking can provide strategies that are provably optimal with into an embedded system.
respect to the probability of satisfaction of some requirements Empirical comparisons with state-of-the-art tools show that
expressed as LTL or PCTL formulae. Model checking tools the Baum-Welch algorithm for MDPs can achieve a better ratio
such as P RISM [1], S TORM [2], and U PPAAL -S TRATEGO [3] of accuracy to the size of the model. However, when the size of
offer efficient methods for finite MDPs. These techniques initial hypothesis model is bigger than that of the system under
assume that the model is an accurate formalisation of the learning it is not uncommon for the Baum-Welch algorithm to
true system. Thus, central for model-based verification is the overfit the observation set.
construction of accurate models. Learning MDPs typically requires more observations as the
Manual construction requires one to determine a big number number of model parameters grows with the number of non-
of model parameters which can be a resource-demanding and deterministic actions. To address this issue, we employ active
error-prone process. This motivated the design of automata learning. Rather than collecting data samples at random, we
learning algorithms able to synthesise Markov chains [4], steer the sampling of new observations aiming at uncovering
[5] and deterministic Markov decision processes [6]–[9] from unobserved behaviours, thus improving the accuracy of the
current model hypothesis. In this line, we propose to learn
R. Reynouard and A. Ingólfsdóttir have been supported by the project an initial hypothesis from a relatively small set of system
Learning and Applying Probabilistic Systems (nr. 206574-051) of the Icelandic
Research Fund. K.G. Larsen has been supported by the ERC Advanced Grant observations sampled at random. Then, for each hidden state
LASSO (nr. 669844), and the Innovation Fund Denmark center DiCyPS. we compute the expected number of times each action has
been chosen from that state. This information is used to devise ℓ𝑒𝑟𝑟 ∈ 𝐿, and move back to 𝑠 with probability 1 whenever an
an observation-based scheduler aimed at restoring balance in action 𝑎 ∈ 𝐴 which is not available is chosen from the current
the count of actions performed from each hidden state. This state 𝑠. Formally, 𝑎 ∉ Available(𝑠) implies 𝜏𝑎 (𝑠) (ℓ𝑒𝑟𝑟 , 𝑠) = 1.
helps the collected data set to represent a wider spectrum of A path is an infinite sequence in Paths = (𝐿 × 𝑆 × 𝐴) 𝜔
the nondeterministic behaviours of the systems under learning. representing an execution of M. We denote by Pathsfin = (𝐿 ×
Experiments show that our active learning procedure can 𝑆 × 𝐴) ∗ (𝐿 × 𝑆) the set of finite paths. Analogously, we define
significantly reduce the number of observations required to the set of infinite (resp. finite) observations as Obs = (𝐿 × 𝐴) 𝜔
obtain accurate models, achieving a faster convergence rate (resp. Obsfin = (𝐿 × 𝐴) ∗ 𝐿). The length of a finite path 𝑤 (resp.
than that observed when employing uniform schedulers. observation 𝑜), written |𝑤| (resp. |𝑜|), equals the number of
Other Related Work: An influential active automata learn- occurrences of labels in the sequence.
ing technique is Angluin’s 𝐿 ∗ -algorithm [15] for learning For 𝑖 ∈ N>0 , we define 𝑋𝑖 : Paths → 𝑆, 𝑌𝑖 : Paths → 𝐿,
regular languages, which inspired a number of extensions 𝐴𝑖 : Paths → 𝐴, and 𝑂 𝑖 : Paths → Obsfin respectively
better suited for modelling reactive systems [16]–[18]. In this as 𝑋𝑖 (𝜋) = 𝑠𝑖 , 𝑌𝑖 (𝜋) = ℓ𝑖 , 𝐴𝑖 (𝜋) = 𝑎 𝑖 , and 𝑂 𝑖 (𝜋) =
line of research, Tappler et al. [9] proposed an 𝐿 ∗ -based (ℓ1 , 𝑎 1 ) · · · (ℓ𝑖−1 , 𝑎 𝑖−1 )ℓ𝑖 , where 𝜋 = (ℓ1 , 𝑠1 , 𝑎 1 ) (ℓ2 , 𝑠2 , 𝑎 2 ) · · · .
technique for learning (deterministic) MDPs. The method iter- Following the classical cylinder set construction [19, Ch10],
atively refines the current hypothesis until the teacher cannot we define the measurable space of paths (Paths, Σ) where
provide a counterexample sequence. For each refinement step a Σ = 𝜎({cyl(𝑤) | 𝑤 ∈ Pathsfin }) is the smallest 𝜎-algebra that
predefined amount of new observations is collected. In contrast contains all the cylinder sets cyl(𝑤) = 𝑤( 𝐴 × 𝑆 × 𝐿) 𝜔 .
to our proposal, new sequences are sampled targeting a subset To define a probability measure for MDPs, we use sched-
of states that are marked as rare. ulers (a.k.a., policies or strategies) to resolve the nondetermin-
Other related work include model-based learning techniques istic choices of actions that are taken at each step.
for partially observable MDPs (e.g., [?]). These techniques A scheduler is a function 𝜎 : Pathsfin → D ( 𝐴). Intuitively,
aim at learning how to act in an unknown partially observable a scheduler determines a distribution of actions to take, based
domain taking actions based on an approximate model of the on the history of the current path. This notion of sched-
domain. Typically, they learn only a portion of the real model uler encompasses well-studied classes of schedulers such as
that is sufficient to optimise the strategy, leaving unnecessary memoryless, deterministic, and randomised (cf. [19]). In this
parts of the system unexplored. In contrast, we aim at learning paper we distinguish between two types of schedulers, namely
the whole model and be able to analyse it. model-based and observation-based schedulers. A model-
based scheduler chooses actions having complete knowledge
II. P RELIMINARIES AND N OTATION of the history. In contrast, an observation-based scheduler
We denote by R, Q, and N respectively the sets of real, performs the choice based only on observable features of the
rational, and natural numbers. We denote by Σ𝑛 , Σ∗ and, Σ 𝜔 history.
respectively the set of words of length 𝑛 ∈ N, finite length, Definition 2.2: A scheduler 𝜎 is observation-based if for all
and infinite length, built over the finite alphabet Σ. 𝑤, 𝑤 0 ∈ Pathsfin such that |𝑤| = |𝑤 0 |, 𝑂 (𝑤) = 𝑂 (𝑤 0) implies
We denote by D (Ω) the set of discrete probability distribu- 𝜎(𝑤) = 𝜎(𝑤 0).
tions on Ω For 𝑥 ∈ Ω, the Dirac distribution concentrated at An MDP M and a scheduler 𝜎 induce a prob-
ability space (Paths, Σ, 𝑃𝑟 𝜎 M ) where 𝑃𝑟 M denotes the
𝑥 is the distribution 1 𝑥 ∈ D (Ω) defined, for arbitrary 𝑦 ∈ Ω, 𝜎
as 1 𝑥 (𝑦) = 1 if 𝑥 = 𝑦, 0 otherwise. (unique) probability measure such that for arbitrary 𝑤 =
(ℓ1 , 𝑠1 , 𝑎 1 ) · · · (ℓ𝑛−1 , 𝑠 𝑛−1 , 𝑎 𝑛−1 ) (ℓ𝑛 , 𝑠 𝑛 ) ∈ Pathsfin ,
A. Markov decision processes and schedulers
N Î𝑛−1
𝑃𝑟 𝜎 (cyl(𝑤)) = 𝜄(ℓ1 , 𝑠1 ) · 𝑖=1 𝜎(𝑤 𝑖 ) (𝑎 𝑖 ) · 𝜏𝑎𝑖 (𝑠𝑖 ) (ℓ𝑖+1 , 𝑠𝑖+1 ),
Definition 2.1: A discrete-time Markov decision process is
a tuple, M = h𝑆, 𝐿, 𝐴, 𝜄, {𝜏𝑎 } 𝑎 ∈ 𝐴i, where (i) 𝑆 is a finite where 𝑤 𝑖 = (ℓ1 , 𝑠1 , 𝑎 1 ) · · · (ℓ𝑖−1 , 𝑠𝑖−1 , 𝑎 𝑖−1 ) (ℓ𝑖 , 𝑠𝑖 ) is the 𝑖-th
nonempty set of states, (ii) 𝐿 is a finite nonempty set of labels, prefix of 𝑤.
(iii) 𝐴 is a finite nonempty set of actions, (iv) 𝜄 ∈ D (𝐿 × 𝑆)
is an initial distribution, and (v) 𝜏𝑎 : 𝑆 → D (𝐿 × 𝑆) is a III. L EARNING MPD S USING BAUM -W ELCH ALGORITHM
probabilistic transition function. In this section we present a variant of the Baum-Welch
Intuitively, M initially emits a label and probabilistically algorithm [10] for learning an MDP M from a finite set of
moves to some state according to 𝜄. Then, if M is in state 𝑠 observation sequences O ⊆ Obsfin .
and receives an input action 𝑎 ∈ 𝐴, it emits a label ℓ ∈ 𝐿 and As the Baum-Welch algorithm, also our method is a max-
moves to state 𝑠 0 with probability 𝜏𝑎 (𝑠) (ℓ, 𝑠 0). In this sense, imum likelihood approach: the transitions probabilities of M
M can be thought of as a state-machine that reacts to a stream are estimated to maximise the likelihood
of input actions 𝑎 1 , 𝑎 2 , · · · ∈ 𝐴 𝜔 by emitting traces of labels
𝐿(M, 𝑜) = 𝑃𝑟 M [𝑌1:𝑇 = ℓ1 . . ℓ𝑇 | 𝐴1:𝑇 −1 = 𝑎 1 . . 𝑎𝑇 −1 ]
of the form ℓ1 , ℓ2 , · · · ∈ 𝐿 𝜔 .
Remark 2.1: We do not assume to know a priori which of an observed sequence 𝑜 = (ℓ1 , 𝑎 1 ) · · · (ℓ𝑇 −1 , 𝑎𝑇 −1 )ℓ𝑇 . The
actions are available from a given state 𝑠 of the model. Rather, maximum likelihood problem is solved using the expectation
we assume the model to react with an error label, denoted maximisation approach [11]. In this line, our algorithm starts
sequence is 𝑜𝑟 = ℓ1𝑟 , 𝑎𝑟1 , . . . , ℓ𝑇𝑟𝑟 −1 , 𝑎𝑇𝑟 𝑟 −1 , ℓ𝑇𝑟𝑟 , the procedure
M DP -BW(O, H0 )
U PDATE (H , O, 𝛼, 𝛽) updates 𝜄 and {𝜏𝑎 } 𝑎 ∈ 𝐴 as follows
1 𝑖 =0 Í𝑅
2 repeat 𝑟 =1 1ℓ1 (ℓ) · 𝛾𝑜𝑟 (𝑠, 1)
𝑟
𝜄(ℓ, 𝑠) =
3 (𝛼, 𝛽) = F ORWARD -BACKWARD (H𝑖 , O) 𝑅
4 H𝑖+1 = U PDATE (H𝑖 , O, 𝛼, 𝛽) 𝜉𝑜𝑟 (𝑠, 𝑡) (ℓ, 𝑠 0)
Í𝑅 Í𝑇𝑟 𝑎
𝑟 =1
5 𝑖 = 𝑖+1 𝜏𝑎 (𝑠) (ℓ, 𝑠 0) = Í𝑅 Í𝑇 𝑡=1 .
𝑟 =1 𝑡=1 1 𝑎 (𝑎 𝑡 ) · 𝛾𝑜𝑟 (𝑠, 𝑡)
𝑟 𝑟
6 until 𝐿(H𝑖 , O) − 𝐿(H𝑖−1 , O) ≤ 𝜖
7 return H𝑖 Remark 3.1: Depending on the specific scheduler employed
to sample the observations one may incur in the situation
where 𝑟𝑅=1 𝑇𝑡=1
Fig. 1. Baum-Welch algorithm for MPDs
Í Í𝑟
𝛾𝑜𝑟 (𝑠, 𝑡) = 0, indicating that the state 𝑠 does
not play a role in the observed dynamics. In this case the up-
date procedure leaves the distributions {𝜏𝑎 (𝑠)} 𝑎 ∈ 𝐴 unchanged.
with an initial model hypothesis H0 which is iteratively The above described procedure is easily adapted to Markov
updated in a way that the likelihood is nondecreasing at each chains, which are MDPs with a single action. Hereafter we
step, that is 𝐿 (H𝑛 ) ≤ 𝐿 (H𝑛+1 ), until the likelihood difference use M C -BW to explicitly refer to such adaptation.
between the current and the previous hypothesis goes below
a fixed threshold 𝜖 (cf. Figure 1). A. Experimental Results
Next, we describe the update procedure. To ease the ex- I this section we compare the quality of the models learned
position, we fix the set of states 𝑆, labels 𝐿, and actions 𝐴 using M C -BW and M DP -BW respectively against the cur-
and we implicitly refer to the current hypothesis as the pair rent state-of-the-art passive-learning tools for Markov chains
H = h𝜄, {𝜏𝑎 } 𝑎 ∈ 𝐴i. We define the forward and the backward and Markov decision processes, namely A LERGIA [4] and
functions 𝛼𝑜 , 𝛽𝑜 : 𝑆 × {1 . . 𝑇 } → [0, 1] for an observation IOA LERGIA [8]. Before we proceed, we briefly recall how
sequence 𝑜 as A LERGIA and IOA LERGIA work. Both algorithms start from
a maximal tree-shaped probabilistic automaton representing
𝛼𝑜 (𝑠, 𝑡) = 𝑃𝑟 H [𝑌1:𝑡 = ℓ1 . . ℓ𝑡 , 𝑋𝑡 = 𝑠| 𝐴1:𝑡−1 = 𝑎 1 . . 𝑎 𝑡−1 ] , and the training set O, which is iteratively reduced by recursive
𝛽𝑜 (𝑠, 𝑡) = 𝑃𝑟 H [𝑌𝑡+1:𝑇 = ℓ𝑡+1 . . ℓ𝑇 |𝑋𝑡 = 𝑠, 𝐴𝑡:𝑇 −1 = 𝑎 𝑡 . . 𝑎𝑇 −1 ] . merging operations among compatible states. Compatibility
among states is determined based on the Hoeffding test
These can be calculated using dynamic programming accord- parametric on a given confidence value 𝛼 ∈ (0, 1).
ing to the following recurrences Remarkably, these approaches are very efficient and enjoy
convergence properties. However, IOA LERGIA converges to
 𝜄(ℓ if 𝑡 = 1 the original (canonical) model M only if it is deterministic,
∑︁1 , 𝑠)


i.e., for all 𝑠, 𝑠 0, 𝑠 00 ∈ 𝑆, ℓ ∈ 𝐿, and 𝑎 ∈ 𝐴, if 𝜏𝑎 (𝑠) (ℓ, 𝑠 0) > 0

𝛼𝑜 (𝑠, 𝑡) = 0 0
𝛼(𝑠 , 𝑡 − 1) 𝜏𝑎𝑡−1 (𝑠 ) (ℓ𝑡 , 𝑠) if 1 < 𝑡 ≤ 𝑇 (1)

0
 𝑠 ∈𝑆 and 𝜏𝑎 (𝑠) (ℓ, 𝑠 00) > 0, then 𝑠 0 = 𝑠 00. Hence each observation
sequence is assumed to be emitted by a unique path.
 1∑︁ if 𝑡 = 𝑇


 As a consequence, if the MDP under learning is not deter-
𝛽𝑜 (𝑠, 𝑡) = 𝛽(𝑠 0, 𝑡 + 1) 𝜏𝑎𝑡 (𝑠) (ℓ𝑡+1 , 𝑠 0) if 1 ≤ 𝑡 < 𝑇 (2) ministic IOA LERGIA can only learn a deterministic approxi-

0
 𝑠 ∈𝑆 mation of the model which has often a larger state space.
Due to the nature of the model construction, A LERGIA and
Next, we define 𝛾𝑜 : 𝑆 × {1, . . , 𝑇 } → [0, 1] and the action-
IOA LERGIA do not require (nor explicitly allow) the user
indexed family of functions 𝜉𝑜 : 𝑆 × {1, . . , 𝑇 − 1} × 𝐿 × 𝑆 →
𝑎
to choose the size of the learned model (i.e. the number of
[0, 1] for 𝑎 ∈ 𝐴 as
states) upfront. However, it can be tuned by choosing the input
H confidence value of 𝛼.
𝛾𝑜 (𝑠, 𝑡) = 𝑃𝑟 [𝑋𝑡 = 𝑠|𝑂𝑇 = 𝑜] , (3)
M C -BW vs. A LERGIA: For experimental comparison
𝜉𝑜𝑎 (𝑠, 𝑡) (ℓ, 𝑠 0) = 𝑃𝑟 H [𝑋𝑡 = 𝑠, 𝑌𝑡+1 = ℓ, 𝑋𝑡+1 = 𝑠 0 |𝑂𝑇 = 𝑜] . between M C -BW and A LERGIA, we fixed a training set O and
a test set T respectively consisting of 104 and 105 observation
The above are related to 𝛼𝑜 and 𝛽𝑜 as follows sequences of length 5 generated by the chain in Figure 2. The
size of the test set is 10 times bigger than that of the training
𝛼𝑜 (𝑠, 𝑡) · 𝛽𝑜 (𝑠, 𝑡)
𝛾𝑜 (𝑠, 𝑡) = Í set because we are interested in measuring to what extent the
0 0
𝑠0 ∈𝑆 𝛼𝑜 (𝑠 , 𝑡) · 𝛽𝑜 (𝑠 , 𝑡) learning procedures are able to generalise w.r.t. a relatively
𝛼 (𝑠, 𝑡)𝜏 (𝑠) (ℓ, 𝑠 0 ) 𝛽 (𝑠 0 , 𝑡 + 1) small training set. First we have run M C -BW starting from a
𝜉𝑜𝑎 (𝑠, 𝑡) (ℓ, 𝑠 0) = 1𝑎𝑡 (𝑡)1ℓ𝑡 (ℓ)
𝑜 𝑎 𝑜
Í
𝑢 ∈𝑆 𝑜𝛼 (𝑢, 𝑡) · 𝛽 𝑜 (𝑢, 𝑡) random initial hypothesis with 𝑛 = 7 . . 15 states, then we have
run A LERGIA with an input value of 𝛼 chosen to match the
Given the current hypothesis H = h𝑆, 𝜄, {𝜏𝑎 } 𝑎 ∈ 𝐴i of size of the learned model to 𝑛.
the model and a multiset O of i.i.d. observation se- Table Ia summarises the results of our experiments in terms
quences 𝑜1 , . . . , 𝑜 𝑅 ∈ Obsfin where the 𝑟-th observation of the quality of the learned models. The values reported
A LERGIA M C -BW
|𝑆|
𝛼 ln 𝐿 on O ln 𝐿 on T KL div. ln 𝐿 on O ln 𝐿 on T KL div.
7 2.09e-201 −3.968 −4.163 1.256 −2.597 −2.66 0.086
8 7.28e-160 −3.836 −4.239 1.025 −2.595 -2.651 0.086 ln 𝐿 on O ln 𝐿 on T KL div.
9 2.93e-100 −3.257 −3.432 0.607 −2.597 −2.659 0.086 True model −4.171 −4.262 0
10 7.14e-104 −2.993 −3.133 0.376 −2.587 −2.654 0.095 M DP -BW −4.899 −4.989 0.333
11 5.66e-75 −3.076 −3.231 0.29 −2.693 −2.808 0.001 IOA LERGIA −13.83 − −
12 2.87e-44 −2.701 −2.804 0.002 −2.699 −2.807 0.001
13 0.01 −2.701 −2.803 0.002 −2.54 −2.72 0.155 (b) Comparison of IOA LERGIA and M DP -BW on an
14 0.5 −2.693 -2.8 0.001 −2.586 −2.657 0.095 adaptation of the Grid World model from [9].
15 0.9 −2.694 −2.808 0.001 −2.533 −2.723 0.161
(a) Comparison of Alergia and M C -BW on the R EBER grammar from [20].
TABLE I
C OMPARATIVE ANALYSIS OF THE BAUM -W ELCH ALGORITHM VS A LERGIA

𝑠3
0.4
𝑠5 a number of observations that could not be generated by
0.6 S
X
0.5
the model produced with IOA LERGIA. In contrast, the MDP
start 1 learned with M DP -BW was able to generalise better from the
0.5 S
T 0.5
1 0.5 1.0 training set, achieving a log-likelihood value on T comparably
𝑠1 𝑠2 𝑠7
B
0.5
X P E similar to the one measured on original grid-world model. This
0.5 results show us that for small training sets, M DP -BW seems to
P V
0.3 attain more accurate models than IOA LERGIA, which requires
0.7 T 𝑠4 𝑠6
V big training sets to achieve good results.
However, the price of the accuracy of M DP -BW is payed in
Fig. 2. The REBER grammar from [20] terms of efficiency: in all experiments IOA LERGIA run orders
of magnitude faster than M DP -BW. This is not surprising,
because IOA LERGIA has a run-time complexity that grow
in the table correspond to the loglikelihood of O (resp. T ) linearly in the size of the data set.
divided by |O| (resp. |T |) and the Kullback-Leibler divergence
relative to T . We can see that M C -BW achieves better
quality performace with fewer states compared with A LERGIA.
Interestingly, we observe an increased size of the model does
not necessarily correspond to a quality improvement. This
phenomenon may have two plausible explanations: (i) having
too many states leads the learning procedure to overfit the
training set; (ii) or only a portion of the model gets updated
by the procedure, while the remaining portion of the model is
left almost identical to the starting hypothesis. Fig. 3. The Small Grid World Model.
M DP -BW vs. IOA LERGIA: By using the same method-
ology, we compared M DP -BW against IOA LERGIA [8].
Here the model we are learning is a smaller variant of the grid IV. ACTIVE L EARNING OF M ARKOV D ECISION P ROCESSES
world introduced in [9] (cf. Figure 3). A robot is moving in this
grid, starting from the middle cell. The actions are the four The M DP -BW algorithm is a passive learning method: it
directions —nord, east, south, and west— and the observed assumes no interaction with the system, which has to be
labels represent different terrains. Depending on target terrain learned from a fixed set of observations. In situations where
the robot may slip and change direction, e.g. move south west one can actively query the system to collect training data, one
instead of south. By construction, the model is a deterministic can think of employing querying strategies to produce new
MDP thus, in the big sample limit, IOA LERGIA can learn it. examples that are most informative w.r.t. the systems nonde-
For the comparison, we used a training set O and a test terministic behaviour. In this way, one can learn qualitatively
set T consisting respectively of 103 and 102 sequences of 10 better models compared to the passive learning approach while
length. With 𝛼 = 0.05, IOA LERGIA produced a model with collecting a considerably smaller amount of observations.
10 states. We then run M DP -BW staring from a randomly Let H = h𝑆, 𝐴, 𝜄, {𝜏𝑎 } 𝑎 ∈ 𝐴i and O = {𝑜1 , . . . , 𝑜 𝑅 } be
generated initial hypothesis with 9 states. Table Ib summarises respectively the current hypothesis and the current training
the results of the comparison. On the training set, the model set. The active learning procedure iteratively updates H and
learned by IOA LERGIA scores lower log-likelihood value than O by performing the following steps:
the model learned by M DP -BW. Notably, the test set had 1) devise an observation-based scheduler from O and H ;
-0.92 -4.36

-0.94

-4.38

-0.96

Log-Likelihood
Log-Likelihood

-4.40
-0.98

-1.00
-4.42

-1.02

Passive Learning Passive Learning


-4.44
-1.04
Active Learning Active Learning
50 100 150 200 250 300 400 500 600 700 800 900 1000
Number of Sequences Number of Sequences

(a) Street crossing model: log-likelihood graphs relative (b) Small grid world model: log-likelihood graphs relative a test set of of 200 sequences of
to a test set of of 200 sequences of fixed length 12. length 𝑇 ∼ Geo(0.8).
Fig. 4. Comparison between the passive learning and active learning procedures based on the M DP -BW algorithm.

2) sample new observation sequences using the above men-


ACTIVE S AMPLING (M, H = h𝑆, 𝜄, {𝜏𝑎 } 𝑎 ∈ 𝐴i, O, 𝑇 ∈ N)
tioned scheduler, adding them to O; and
3) update H based on the new data using M DP -BW. 1 Initialise 𝑀 = (𝑚 𝑠𝑎 ) 𝑠 ∈𝑆,𝑎 ∈ 𝐴 as Eq. (4)
2 ℓ1 = I NIT (M) // initialise the system
These steps are repeated until a given sampling budget has
3 for each 𝑠 ∈ 𝑆
been exceeded or no further scrutiny of the system is deemed
4 𝛼(𝑠, 1) = 𝜄(ℓ1 , 𝑠)
necessary. Hereafter, we detail how each step is implemented.
5 for 𝑡 = 1 to 𝑇 − 1
We start by computing the matrix 𝑀 = (𝑚 𝑠𝑎 ) 𝑠 ∈𝑆,𝑎 ∈ 𝐴 where
Sample 𝑎 𝑡 ∈ 𝐴 according to 𝑠 ∈𝑆 Í 0 𝛼(𝑠,𝑡)
Í
𝑚 𝑠𝑎 is the expected number of times the action 𝑎 has been 6 0 𝜎𝑀 (𝑠)
𝑠 ∈𝑆 𝛼(𝑠 ,𝑡)
chosen from 𝑠, that is computed as follows 7 ℓ𝑡+1 = O BSERVE -L ABEL (M, 𝑎 𝑡 )
Í Í |𝑜𝑟 | 8 for each 𝑠 ∈ 𝑆
𝑚 𝑠𝑎 = 𝑟𝑅=1 𝑡=1 1𝑎 (𝑎𝑟𝑡 ) 𝛾𝑜𝑟 (𝑠, 𝑡) , (4) 9 𝑚 𝑠𝑎𝑡 = 𝑚 𝑠𝑎𝑡 +Í𝛼(𝑠, 𝑡)/ 𝑠0 ∈𝑆 𝛼(𝑠 0, 𝑡)
Í

then, we define the memoryless scheduler 𝜎𝑀 : 𝑆 → D ( 𝐴) as 10 𝛼(𝑠, 𝑡 + 1) = 𝑠0 ∈𝑆 𝜏𝑎𝑡 (𝑠 0) (ℓ𝑡+1 , 𝑠) · 𝛼(𝑠 0, 𝑡)


Í 11 // Return the entire observation sequence
𝜎𝑀 (𝑠)(𝑎) = 1 − (𝑚 𝑠𝑎 / 𝑎0 ∈ 𝐴 𝑚 𝑠𝑎0 ) . (5) 12 return (ℓ1 , 𝑎 1 ) · · · (ℓ𝑇 −1 , 𝑎𝑇 −1 )ℓ𝑇
Intuitively, given the system is in state 𝑠 ∈ 𝑆, the above
Fig. 5. Active Sampling Strategy
scheduler chooses an action 𝑎 ∈ 𝐴 with a probability that is
opposite to that observed in O. Since the current state of the
system is hidden, when sampling we use a belief state instead. according to 𝜎𝑀 ∗ , and used to observe the next label ℓ
𝑡+1
This corresponds to employ the observation-based scheduler emitted by M (line 7). The forward distribution 𝛼(·, 𝑡 + 1)
∗ : Obs
𝜎𝑀 fin → D ( 𝐴) defined as follows. For an observation and the matrix 𝑀 are then updated (line 8–10) before moving
𝑜 = (ℓ1 , 𝑎 1 ) · · · (ℓ𝑡−1 , 𝑎 𝑡−1 )ℓ𝑡 ∈ Obsfin and an action 𝑎 ∈ 𝐴, to the next time-step. The update of the forward probabilities
𝜎𝑀∗
(𝑜) (𝑎) = 𝑠 ∈𝑆 𝑃𝑟 H [𝑋𝑡 = 𝑠|𝑂 𝑡 = 𝑜] · 𝜎𝑀 (𝑠) (𝑎)
Í follows Eq. (1), while the update of the column vector 𝑀𝑎𝑡
Í follows Eq. (4).
= 𝑠 ∈𝑆 𝛾𝑜 (𝑠, 𝑡) 𝜎𝑀 (𝑠) (𝑎) . (6)
Intuitively, the above scheduler works as follows. Having A. Experimental Results
observed 𝑜, we believe system is in state 𝑠 ∈ 𝑆 with probability In this section we present an empirical analysis of the active
𝑃𝑟 H [𝑋𝑡 = 𝑠|𝑂 𝑡 = 𝑜]; consequently, 𝜎𝑀 ∗ chooses the action
sampling strategy. We will use two case study models: the
𝑎 ∈ 𝐴 with probability 𝜎𝑀 (𝑠) (𝑎). small grid world model from previous section (see Fig. 3),
The algorithm in Fig. 5 describes how we actively sample and the street crossing model (depicted in Fig. 6). The former
an observation sequence of length 𝑇 ∈ N emitted by a partially model represents an agent trying to avoid a stranger bumping
observable MDP M by using the scheduler 𝜎𝑀 ∗ of Eq. (6).
into her. Here she can choose among two actions: stay on
ACTIVE S AMPLING keeps track and updates at each step the the current side of the sidewalk or move to the other side.
matrix 𝑀 and the current forward distribution 𝛼(·, 𝑡) ∈ D (𝑆). The agent and the stranger make their move independently at
These are respectively used to compute the current belief state the same time; in particular, when the two are not in front
𝛾(·, 𝑡) ∈ D (𝑆) (cf. Eq. (3)) and the memoryless scheduler each other the stranger, proceeds forward. After performing
𝜎𝑀 (cf. Eq. (5)), which are used in line 6. After observing the action, the agent observes if the stranger is on the left or
the an initial label ℓ1 from the system M, the initial forward the right side of the street. If the two end up in the same side
distribution 𝛼(·, 1) is computed (lines 3–4). Then, for each they bump into each other, otherwise they avoid each other.
time-step 𝑡 from 1 to 𝑇 − 1, an action 𝑎 𝑡 ∈ 𝐴 is sampled The stranger changes side with probability 𝑝 ∈ (0, 1).
start 1
right 𝑝 move
𝑠1 move 𝑠2
𝑝 left
right 1− 𝑝 1− 𝑝 left
stay stay
1− 𝑝 right left 1− 𝑝
𝑝 left 𝑠3 right 𝑝
move stay
1 bump 1 avoid
stay stay
bump 1 hit ok 1 avoid

move move
bump 1 1 avoid

Fig. 7. The Grid World Model from [9].


Fig. 6. The Street crossing model
true 𝐿 ∗MDP IOA LERGIA A-M DP -BW
overall # of labels - 3101959 3103607 23781
We compare the active procedure against the passive one # of observation traces - 391530 387746 1200
and show how the learning accuracy of the former compares |𝑆| (# of states) 35 35 21 19
to the latter with the size of the training set. The experiments bismilarity distance 𝛿0.9 0 0.144 0.524 0.364
have been performed as follows. Starting from the same initial Pmax (𝐹 <12 (goal)) 0.962 0.965 0.230 0.978
hypothesis —learned with M DP -BW from a small data set— Pmax (¬G 𝑈 ≤14 (goal)) 0.65 0.646 0.158 0.466
we incrementally grew the data set bigger respectively using Pmax (¬S 𝑈 ≤16 (goal)) 0.691 0.676 0.180 0.806
the active sampling strategy and a sampling strategy based on a TABLE II
memoryless uniformly distributed selection of actions. For the R ESULTS FOR LEARNING THE GRID WORLD MODEL .
street crossing model the initial hypothesis was learned from
a data set of 50 sequences of length 12; then we performed
200 active learning iterations. Fig. 4a shows the graph of the performed in [9] for comparing IOA LERGIA with 𝐿 ∗MDP when
mean log-likelihood paired with standard error bars measured learning the grid world model depicted in Fig. 7.
from a number of re-run of the experiment relative to test set Our model was learned using the active learning approach
of 200 sequences each of length 12. starting from a (deterministic) initial model with 19 states,
For the small grid world model the initial hypothesis was learned from a small dataset of 200 sequences. The length
learned from 250 observation sequences of length 𝑇 distributed 𝑇 of each sampled sequence is distributed according to a
according to a geometric distribution with success probability geometric distribution shifted by 10 with success probability
𝑝 = 0.8, that is 𝑇 ∼ Geo(0.8); then we performed 750 active 𝑝 = 0.9, that is, 𝑇 ∼ 10 + Geo(0.9) 1 . At each active learning
learning iterations by sampling new observations of length 𝑇 ∼ iteration we sampled two new sequences, and we stopped after
Geo(0.8). Analogously to the first case study, the results of this collecting 1200 observation traces. Table II shows the results
experiment are summarised in Fig. 4b. The graph shows that of the experiment. As done in [9] we compared the models
the passive learning approach has a more pronounced tendency with respect to the bisimilarity distance2 with discount factor
to overfit the data set than the active learning approach. 𝜆 = 0.9: the model learned with our active learning approach,
Overall, the graphs in Fig. 4 show that the active learning scores slightly better than IOA LERGIA but worse than 𝐿 ∗MDP .
approach provides better approximations than the passive Nevertheless, the results of the three model-checking queries
approach. Another interpretation is that the proposed active performed on our model are close to the true one: the absolute
learning is able to obtain the same level of accuracy than the error from the true values is bounded by 0.184. Overall, 𝐿 ∗MDP
passive learning approach with a smaller data set. Notably, the scores better than our active learning approach. This is due
graphs show also that the standard error for the active learning to a number of reasons: (i) the learned model is smaller
method is smaller than the one measured for the passive than the canonical true model and (ii) it was learned from
learning approach. This indicates that our active learning a significantly smaller data set; finally, (iii) the active learning
approach is more stable than the passive approach. approach is not sensitive to structural counterexamples as the
Active M DP -BW vs 𝐿 ∗MDP : We conclude the experiment 𝐿 ∗MDP algorithm is. Indeed, when the algorithm encounters a
section by comparing our active learning method against the new observation which has probability zero of being generated
𝐿 ∗MDP algorithm [9] for learning deterministic MDPs. We recall by the current hypothesis, also the next hypothesis won’t be
that 𝐿 ∗MDP actively refines its current hypothesis as long as the able to generate it. This aspect in particular needs particular
teacher can provide new counterexamples. The implementation attention when learning deterministic models or in general
of the teacher in the 𝐿 ∗MDP algorithm is done both by checking
the conformance and the structure of the hypothesis w.r.t the 1 Specifically,
𝑃 (𝑇 = 10 + 𝑘) = (1 − 𝑝) 𝑘−1 𝑝 for 𝑘 ∈ N>0 .
data set. 2 To compute the distance, we used the MDPDist library [21] adapted to
For the comparison we replicated the same experiment labelled MDPs.
when some observation traces can be emitted only by a single [4] R. C. Carrasco and J. Oncina, “Learning stochastic regular grammars
path in the hypothesis model. by means of a state merging method,” in Grammatical Inference and
Applications, Second International Colloquium, ICGI-94, ser. Lecture
Notes in Computer Science, R. C. Carrasco and J. Oncina, Eds., vol.
V. C ONCLUSIONS AND F UTURE W ORK 862. Springer, 1994, pp. 139–152.
In this paper we revisited the classic Baum-Welch algorithm [5] ——, “Learning deterministic regular grammars from stochastic samples
in polynomial time,” RAIRO – Theoretical Informatics and Applications
for learning models parameters of nondeterministic MDPs and (RAIRO: ITA), vol. 33, no. 1, pp. 1–20, 1999.
Markov chains from a set of observations. Compared with [6] H. Mao, Y. Chen, M. Jaeger, T. D. Nielsen, K. G. Larsen, and
state-of-the-art (passive) learning algorithms like A LERGIA B. Nielsen, “Learning probabilistic automata for model checking,” in
Eighth International Conference on Quantitative Evaluation of Systems,
and IOA LERGIA, the M DP -BW procedure has a higher run- QEST 2011. IEEE Computer Society, 2011, pp. 111–120.
time complexity. However, experiments show that M DP -BW is [7] Y. Chen and T. D. Nielsen, “Active learning of markov decision
able to learn models that reflect more accurately the behaviours processes for system verification,” in 11th International Conference on
Machine Learning and Applications, ICMLA, Boca Raton, FL, USA,
of the observed system. This aspect is more pronounced when December 12-15, 2012. Volume 2. IEEE, 2012, pp. 289–294. [Online].
learning MDPs from a relatively small set of observations. Available: https://doi.org/10.1109/ICMLA.2012.158
Learning model parameters for MDPs typically requires [8] H. Mao, Y. Chen, M. Jaeger, T. D. Nielsen, K. G. Larsen, and B. Nielsen,
“Learning Deterministic Probabilistic Automata from a Model Checking
large data sets, especially when the system under learning Perspective,” Machine Learning, vol. 105, no. 2, pp. 255–299, 2016.
exhibits a high degree of nondeterminism. To cope with this [9] M. Tappler, B. K. Aichernig, G. Bacci, M. Eichlseder, and K. G. Larsen,
issue, we proposed a model-based active learning sampling “𝐿 ∗ -Based Learning of Markov Decision Processes,” in Formal Methods
strategy which has three main advantages: (a) it is simple to - The Next 30 Years - Third World Congress, FM 2019, ser. Lecture Notes
in Computer Science, M. H. ter Beek, A. McIver, and J. N. Oliveira,
implement and can be seamlessly integrated into small low Eds., vol. 11800. Springer, 2019, pp. 651–669.
power embedded systems; (b) it does not introduce additional [10] L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected
overhead with respect to the model update procedure; (c) it Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77,
no. 2, pp. 257–286, Feb 1989.
collects a diverse and well-spread variety of observations, [11] N. M. L. A. P. Dempster and D. B. Rubin, “Maximum Likelihood from
that better represent the nondeterministic behaviours of the Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical
system under learning. Experimental results show that the ac- Society, vol. 39, no. 1, pp. 1–38, 1977.
[12] K. Kalajdzic, C. Jégourel, A. Lukina, E. Bartocci, A. Legay, S. A.
tive procedure strategy outperforms the corresponding passive Smolka, and R. Grosu, “Feedback control for statistical model checking
learning variant in terms of accuracy relative to the size of the of cyber-physical systems,” in Leveraging Applications of Formal
data set. This makes our active learning procedure an effective Methods, Verification and Validation: Foundational Techniques - 7th
International Symposium, ISoLA 2016, Imperial, Corfu, Greece, October
solution when one has the possibility to have limited amount 10-14, 2016, Proceedings, Part I, ser. Lecture Notes in Computer
of interactions with the system under learning. Science, T. Margaria and B. Steffen, Eds., vol. 9952, 2016, pp. 46–61.
A weakness of our active learning procedure is the fact that [Online]. Available: https://doi.org/10.1007/978-3-319-47166-2_4
[13] M. Benedikt, R. Lenhardt, and J. Worrell, “LTL model checking of
is it not sensitive to structural counterexamples. As future work interval markov chains,” in Tools and Algorithms for the Construction
we intend address this issue. and Analysis of Systems - 19th International Conference, TACAS 2013,
Another interesting research direction consists in generalis- Held as Part of the European Joint Conferences on Theory and
Practice of Software, ETAPS 2013, Rome, Italy, March 16-24, 2013.
ing the active learning procedure for learning model param- Proceedings, ser. Lecture Notes in Computer Science, N. Piterman and
eters of stochastic two-player games, allowing one to learn S. A. Smolka, Eds., vol. 7795. Springer, 2013, pp. 32–46. [Online].
systems that operate in an unknown (adversarial) environment Available: https://doi.org/10.1007/978-3-642-36742-7_3
by actively interacting with both players. [14] G. Bacci, G. Bacci, K. G. Larsen, and R. Mardare, “On the
metric-based approximate minimization of markov chains,” J. Log.
Algebraic Methods Program., vol. 100, pp. 36–56, 2018. [Online].
R EFERENCES Available: https://doi.org/10.1016/j.jlamp.2018.05.006
[1] M. Z. Kwiatkowska, G. Norman, and D. Parker, “PRISM 4.0: [15] D. Angluin, “Learning regular sets from queries and counterexamples,”
Verification of probabilistic real-time systems,” in Computer Aided Information and Computation, vol. 75, no. 2, pp. 87–106, 1987.
Verification - 23rd International Conference, CAV 2011, Snowbird, [16] B. Steffen, F. Howar, and M. Merten, “Introduction to active automata
UT, USA, July 14-20, 2011. Proceedings, ser. Lecture Notes learning from a practical perspective,” in Formal Methods for Eternal
in Computer Science, G. Gopalakrishnan and S. Qadeer, Eds., Networked Software Systems - 11th International School on Formal
vol. 6806. Springer, 2011, pp. 585–591. [Online]. Available: Methods for the Design of Computer, Communication and Software Sys-
https://doi.org/10.1007/978-3-642-22110-1_47 tems, SFM 2011, ser. Lecture Notes in Computer Science, M. Bernardo
[2] C. Dehnert, S. Junges, J. Katoen, and M. Volk, “A storm is and V. Issarny, Eds., vol. 6659. Springer, 2011, pp. 256–296.
coming: A modern probabilistic model checker,” in Computer Aided [17] M. Isberner, F. Howar, and B. Steffen, “The TTT algorithm: A
Verification - 29th International Conference, CAV 2017, Heidelberg, redundancy-free approach to active automata learning,” in Runtime
Germany, July 24-28, 2017, Proceedings, Part II, ser. Lecture Verification - 5th International Conference, RV 2014, ser. Lecture Notes
Notes in Computer Science, R. Majumdar and V. Kuncak, Eds., in Computer Science, B. Bonakdarpour and S. A. Smolka, Eds., vol.
vol. 10427. Springer, 2017, pp. 592–600. [Online]. Available: 8734. Springer, 2014, pp. 307–322.
https://doi.org/10.1007/978-3-319-63390-9_31 [18] S. Cassel, F. Howar, B. Jonsson, and B. Steffen, “Active learning for
[3] A. David, P. G. Jensen, K. G. Larsen, M. Mikucionis, and J. H. extended finite state machines,” Formal Aspects of Computing, vol. 28,
Taankvist, “Uppaal stratego,” in Tools and Algorithms for the no. 2, pp. 233–263, 2016.
Construction and Analysis of Systems - 21st International Conference, [19] C. Baier and J. Katoen, Principles of Model Checking. MIT Press,
TACAS 2015, Held as Part of the European Joint Conferences on 2008.
Theory and Practice of Software, ETAPS 2015, London, UK, April [20] A. S. Reber, “Implicit learning of artificial grammars,” Journal of Verbal
11-18, 2015. Proceedings, ser. Lecture Notes in Computer Science, Learning and Verbal Behavior, vol. 6, pp. 855–863, Dec 1967.
C. Baier and C. Tinelli, Eds., vol. 9035. Springer, 2015, pp. 206–211. [21] G. Bacci, G. Bacci, K. G. Larsen, and R. Mardare, “The bisimdist
[Online]. Available: https://doi.org/10.1007/978-3-662-46681-0_16 library: Efficient computation of bisimilarity distances for markovian
models,” in QEST, ser. Lecture Notes in Computer Science, vol. 8054.
Springer, 2013, pp. 278–281.

You might also like