Active Learning of Markov Decision Processes Using Baum-Welch Algorithm (Extended)
Active Learning of Markov Decision Processes Using Baum-Welch Algorithm (Extended)
Abstract—Cyber-physical systems (CPSs) are naturally mod- observed system behaviours. These algorithms, in the large
elled as reactive systems with nondeterministic and probabilistic
arXiv:2110.03014v1 [cs.LG] 6 Oct 2021
𝑠3
0.4
𝑠5 a number of observations that could not be generated by
0.6 S
X
0.5
the model produced with IOA LERGIA. In contrast, the MDP
start 1 learned with M DP -BW was able to generalise better from the
0.5 S
T 0.5
1 0.5 1.0 training set, achieving a log-likelihood value on T comparably
𝑠1 𝑠2 𝑠7
B
0.5
X P E similar to the one measured on original grid-world model. This
0.5 results show us that for small training sets, M DP -BW seems to
P V
0.3 attain more accurate models than IOA LERGIA, which requires
0.7 T 𝑠4 𝑠6
V big training sets to achieve good results.
However, the price of the accuracy of M DP -BW is payed in
Fig. 2. The REBER grammar from [20] terms of efficiency: in all experiments IOA LERGIA run orders
of magnitude faster than M DP -BW. This is not surprising,
because IOA LERGIA has a run-time complexity that grow
in the table correspond to the loglikelihood of O (resp. T ) linearly in the size of the data set.
divided by |O| (resp. |T |) and the Kullback-Leibler divergence
relative to T . We can see that M C -BW achieves better
quality performace with fewer states compared with A LERGIA.
Interestingly, we observe an increased size of the model does
not necessarily correspond to a quality improvement. This
phenomenon may have two plausible explanations: (i) having
too many states leads the learning procedure to overfit the
training set; (ii) or only a portion of the model gets updated
by the procedure, while the remaining portion of the model is
left almost identical to the starting hypothesis. Fig. 3. The Small Grid World Model.
M DP -BW vs. IOA LERGIA: By using the same method-
ology, we compared M DP -BW against IOA LERGIA [8].
Here the model we are learning is a smaller variant of the grid IV. ACTIVE L EARNING OF M ARKOV D ECISION P ROCESSES
world introduced in [9] (cf. Figure 3). A robot is moving in this
grid, starting from the middle cell. The actions are the four The M DP -BW algorithm is a passive learning method: it
directions —nord, east, south, and west— and the observed assumes no interaction with the system, which has to be
labels represent different terrains. Depending on target terrain learned from a fixed set of observations. In situations where
the robot may slip and change direction, e.g. move south west one can actively query the system to collect training data, one
instead of south. By construction, the model is a deterministic can think of employing querying strategies to produce new
MDP thus, in the big sample limit, IOA LERGIA can learn it. examples that are most informative w.r.t. the systems nonde-
For the comparison, we used a training set O and a test terministic behaviour. In this way, one can learn qualitatively
set T consisting respectively of 103 and 102 sequences of 10 better models compared to the passive learning approach while
length. With 𝛼 = 0.05, IOA LERGIA produced a model with collecting a considerably smaller amount of observations.
10 states. We then run M DP -BW staring from a randomly Let H = h𝑆, 𝐴, 𝜄, {𝜏𝑎 } 𝑎 ∈ 𝐴i and O = {𝑜1 , . . . , 𝑜 𝑅 } be
generated initial hypothesis with 9 states. Table Ib summarises respectively the current hypothesis and the current training
the results of the comparison. On the training set, the model set. The active learning procedure iteratively updates H and
learned by IOA LERGIA scores lower log-likelihood value than O by performing the following steps:
the model learned by M DP -BW. Notably, the test set had 1) devise an observation-based scheduler from O and H ;
-0.92 -4.36
-0.94
-4.38
-0.96
Log-Likelihood
Log-Likelihood
-4.40
-0.98
-1.00
-4.42
-1.02
(a) Street crossing model: log-likelihood graphs relative (b) Small grid world model: log-likelihood graphs relative a test set of of 200 sequences of
to a test set of of 200 sequences of fixed length 12. length 𝑇 ∼ Geo(0.8).
Fig. 4. Comparison between the passive learning and active learning procedures based on the M DP -BW algorithm.
move move
bump 1 1 avoid