Distributed State Space Minimization
Distributed State Space Minimization
Abstract We present a new algorithm, and its distributed im- then full-generation tools have an important advantage: af-
plementation, for reducing labeled transition systems mod- ter being generated, a state space can be reduced modulo
ulo strong bisimulation. The base of this algorithm is the an equivalence that preserves the properties to be checked.
Kanellakis-Smolka ’naive method’, which has a high theo- The reduction can considerably reduce the size of the state
retical complexity, but is succesful in practice and well suited space that needs to be verified. This is particularly impor-
to parallelization. This basic approach is combined with op- tant in cases where the original state space is too big to be
timizations inspired by the Kanellakis-Smolka algorithm for verified on a single machine, like in two recent case studies
the case of bounded fanout, which has the best known time with the µCRL toolset: a cache coherence protocol [20] and
complexity. The distributed implementation is improved with a model of JavaSpaces [21]. In both cases, verification on a
respect to previous attempts by a better overlap between com- single machine was possible for the reduced state space only.
munication and computation, which results in an efficient us- Moreover, state space reduction could only be performed us-
age of both memory and processing power. We also discuss ing the distributed tool.
the time complexity of this algorithm and show experimental This paper proposes a new distributed solution for the
results with sequential and distributed prototype tools. problem of state space reduction modulo strong bisimulation
equivalence. This is a widely used system equivalence which
preserves all properties expressible as µ-calculus formulas.
We choose clusters of workstations as target architecture
because it is the most common environment able to offer
1 Introduction the memory and processing power required by model check-
ing industrial applications. So, we are interested in message-
There is currently a lot of interest in building distributed model passing algorithms that can handle very large problem in-
checking tools, both symbolic [11,2] and enumerative [24, stances on a comparatively small number of processors and
16,15,1]. The symbolic tools manipulate a compressed repre- that work well for the specific type of labeled transition sys-
sentation of the state space. The enumerative tools explicitly tems representing state spaces. State spaces have bounded
compute all the states and transitions of the state space and fanout and usually a small depth. The states are distributed
can be sub-divided into on-the-fly and full-generation. An on- evenly among the network nodes (workers), and the transi-
the-fly tool will compute the transitions leading from a state tions are managed by the worker that owns their initial states.
on demand, while it is checking a property. A full-generation
tool will first compute the whole state space and only then History The strong bisimulation reduction is found in the lit-
start checking the property. The main advantage of on-the-fly erature under the name multiple relational coarsest partition
tools is that if the property can be proved or dis-proved ex- problem (MRCPP): given a set S and a number of relations
ploring only a small part of the state-space, the unnecessary on S, ρ1 · · · ρr , find a partition of S into subsets S1 · · · Ss
generation of the rest of the state space is avoided. However, such that for any two subsets Si , Sj and any relation ρk ,
if proving a property requires visiting the whole state space either all or none of the elements of Si are in the relation
? Partially supported by PROGRESS, the embedded systems research
ρk with an element of Sj . Moreover, the coarsest partition
with this property is required, that is the one with the least
program of the Dutch organisation for Scientific Research NWO, the Dutch
Ministry of Economic Affairs and the Technology Foundation STW, grant number of subsets. MRCPP is an immediate generalization
CES.5009. of the relational coarsest partition problem (RCPP), which
2 Stefan Blom, Simona Orzan: Distributed State Space Minimization
treats the case when the set of relations is a singleton. Let tures of marked states as follows: if only some of the states in
N , M the sizes of the set S and of the relation, respectively. a block are marked then the signatures of the marked states
The most well known solutions for RCPP are the O(M N ) all get new IDs and the unmarked states keep their old ID; if
one proposed by Kanellakis and Smolka [14] and the later all states in a block are marked then the old ID is reused for
O(M log N ) one by Paige and Tarjan [19]. [14] also contains the signature which occurs most often and new IDs are as-
an O(N log N ) solution for the restricted case of bounded signed to the others. This is similar to the strategy used in the
fanout. Long before these, Hopcroft described an O(N log N ) Hopcroft or the Paige-Tarjan algorithm, which always split
algorithm for the deterministic case [12], when the relation is with respect to the smallest block. The algorithm terminates
a function. if there are no more changes.
A close variation of the optimized algorithm achieves a
Towards a good distributed solution The main challenge in time complexity of O(N log N ). Namely, this is the case if
building good distributed tools is dividing the computation the old ID is always given to the most often occurring sig-
in such a way that communication is triggered rather infre- nature, not only when all states of the old block are unsta-
quently, but in the same time avoiding large idle times. Ide- ble. However, to implement this, all the states belonging to a
ally, workers have equal computation loads and they rarely given block should be easily retrievable and this is not trivial
need access to remote data. In our case, workers should be to arrange in a distributed setting.
able to compute as much as possible from the states and tran-
sitions that they own. Related work Parallel versions of Kanellakis-Smolka and
All the algorithms above are based on partition refinement Paige-Tarjan have been proposed [25,22], with time com-
M
and they vary in how the refinement step is defined. In the plexities O(N 1+ ) using N CREW PRAM processors (for
naive algorithm of [14], the refinement consists in putting in any fixed < 1), and O(N log N ) with O( M N ) CREW PRAM
different blocks any two states that can be distinguished with processors, respectively. These algorithms are designed for
respect to the current partition. To distinguish states, it suf- shared memory machines and they are difficult to translate ef-
fices to compare the set of transitions going out, i.e., the set of ficiently to a distributed memory setting. It would however be
pairs (label, destination block). This ensures an independent interesting to see how they work on a virtual shared memory.
treatment of the states, very suitable for parallelization. In the We expect that the latency of the shared memory simulation
other, theoretically better, algorithms, it is essential that the would seriously affect their performance.
states in the same block could be easily retrieved. To achieve There exist on-the-fly algorithms for bisimilarity check-
this, extra administration is necessary (like sorting), which ing, both sequential [18] and distributed [13]. They are based
can be very expensive in a distributed setting. on solving boolean equation systems and can be used to com-
A distributed implementation of the Kanellakis-Smolka pare two state spaces with respect to an equivalence notion,
naive algorithm has been presented in [5]. However, the ques- while generating them. Our problem is rather to find the equiv-
tion remains how to distribute one of the theoretically better alence classes of a given state space, which is quite differ-
algorithms, or at least use some of their tricks. Therefore in ent and cannot be immediately solved by these algorithms. In
this paper we propose a new algorithm that keeps the sim- fact, we are not aware of any algorithm that attempts to solve
plicity and symmetry of the naive algorithm, while employ- on-the-fly bisimilarity reduction.
ing some optimizations similar to those used in the bounded
fanout Kanellakis-Smolka [14] or Paige-Tarjan [19]. We will Overview The next section introduces some definitions and
refer to the new algorithm as “optimized”. formalizes the problem of reduction modulo strong bisimu-
lation equivalence. In section 3 the optimization of the naive
Naive versus optimized In our implementations, a unique ID algorithm by using a marking procedure is discussed and it is
(an integer) is assigned to each block and partitions are repre- justified that the (sequential) algorithm thus obtained is still
sented as arrays of IDs. The signature of a state x with respect correct. Also the theoretical complexity of the new algorithm
to a partition is a set of pairs of labels and IDs, such that a pair is discussed. Then, in section 4, the distributed implemen-
(a, id) is in the set if and only if there is a transition with the tation of this new optimized algorithm is briefly explained.
label a from the state x to another state belonging to the block Some performance data is presented in section 5 and some
with the ID id. Two states are distinguishable with respect to concluding remarks in section 6.
a partition if they have different signatures with respect to that
partition.
The naive algorithm [5] computes the signatures of all 2 Bisimilarity checking, bisimulation minimization and
states in every iteration and randomly assigns IDs to each sig- the Relational Coarsest Partition Problem
nature. It terminates when the number of signatures becomes
stable. Let Act be a fixed set of labels, representing actions. A la-
The optimized algorithm doesn’t recompute the signa- beled transition system (LT S) is a triple (S, T, s0 ) consisting
tures on each iteration. Instead, it modifies the old signatures. of a set of states S, a set of transitions T ⊆ S × Act × S and
While this recomputation goes on, the states with modified an initial state s0 ∈ S. When T is understood, we will use the
a
signatures are marked. Next, we assign new IDs to the signa- notation p − → q for (p, a, q) ∈ T .
Stefan Blom, Simona Orzan: Distributed State Space Minimization 3
the current iteration to new IDs (the block identifiers of the the steps sequence 7 − 20 does not regard x, that IDn (x) =
next iteration). IDn−1 (x) = i.
IDf is the final partition, the blocks of which represent the - En−1 ≤ i < En . This means that i is “created” in the
states of the minimized LTS. The termination and correctness steps 12 − 14 as identifier for a new block. In step 13 the IDn
of OSBR follow from a few simple properties listed below. of the first state of this block is explicitly defined as being i.
3 . Let us consider an iteration n that satisfies ∀x ∈
Lemma 1. Let U n , sig n , IDn , En denote the set of unstable S IDn (x) = IDn−1 (x). This means that in the iteration n−1,
states, the signatures mapping, the ID mapping, and the num- the condition in the step 17, that compares exactly IDn (x)
ber of equivalence classes at the beginning of the n-th itera- and IDn−1 (x) was never satisfied, thus νU remains ∅, that is
tion of the optimized algorithm (i.e. before the n-th execution U n = ∅. The inverse is also true: if U n = ∅ then νU ended
of step 3) — the count starts at 0. The following properties up empty in the previous iteration. This could only happen if
hold,for any n ≥ 0: the condition on line 17 was never met, that is the value of ID
1. (∀x ∈ S) IDn (x) < En . was not changed for any state. Formally, ∀x ∈ S IDn (x) =
(The blocks of the current partition are numbered IDn−1 (x).
{0 · · · E − 1}.) 4 . We prove this by induction on n ≥ 0. The case
2. (∀0 ≤ i < En−1 ) ∃x ∈ S s.t. IDn (x) = IDn−1 (x) = i. n = 0 follows from the fact that (∀x) sig 0 (x) = 0 and
(∀0 ≤ i < En ) ∃x ∈ S s.t. IDn (x) = i. (∀x) ID0 (x) = 0. To prove the first half of the invariant for
(Every identifier in the set {0 · · · E − 1} is really used. an arbitrary n, we consider three cases:
Old blocks pass their identifiers to subblocks of their own.) - x, y ∈ U n−1 . In this case, both sig n (x) and sig n (y) are
3. (∀x ∈ S IDn (x) = IDn−1 (x)) iff U n = ∅ inserted in the hashtable ST, which ensures the same ID value
(A partition is final iff the unstable set becomes empty) for (and only for) equal signatures.
4. (∀x, y ∈ S) - x, y ∈/ U n−1 . Then the sigs and IDs do not change, i.e.
IDn (x) = IDn (y) iff sig n (x) = sig n (y) and sig n (x) = sig n−1 (x), IDn (x) = IDn−1 (x) and sig n (y) =
sig n (x) 6= sig n (y) =⇒ sig n+1 (x) 6= sig n+1 (y). sig n−1 (y), IDn (y) = IDn−1 (y). From the induction hypoth-
(The block identifiers are given correctly. esis, it follows that sig n (x) = sig n (y) iff IDn (x) = IDn (y).
The newer partitions are refinements of the older.) - x ∈ U n−1 and y ∈/ U n−1 . Then there must be a state
5. En−1 ≤ En . z that has caused the instability of x, i.e. there is a transi-
a
En = En−1 iff (∀x ∈ S IDn (x) = IDn−1 (x)). tion x − → z with IDn−1 (z) 6= IDn−2 (z). Then IDn−1 (z) =
n−2
(The number of blocks is increasing with the refinements.. i ≥ E , therefore the pair (a, i) cannot be in sig n−1 (y).
until it stops because the final partition is reached.) And since sig n (x) is recomputed and sig n (y) not, it follows
that sig n (x) 6= sig n (y). It remains to prove that IDn (x) 6=
Proof. The properties 1, 3 and 4 will be proved indepen- IDn (y) as well. Let us first notice that
dently, by induction on n. Property 2 relies on 4 and prop-
if IDn (x) 6= IDn−1 (x) then IDn (x) ≥ En−1 . (1)
erty 5 relies on 1 and 2.
1 . (∀x ∈ S) ID0 (x) = 0 < 1 = E0 . In every iteration, If sig n−1 (x) = sig n−1 (y) = i, then i ∈/ Reusable (since
the only place where fresh values are introduced for ID is step y ∈/ U n−1 ), thus IDn (x) 6= i, thus (1) IDn (x) ≥ En−1 , while
13 (in step 10 an old value is used). But E is also immediately IDn (y) = IDn−1 (y) < En−1 . If, on the contrary, sig n−1 (x) 6=
increased (step 14), therefore the invariant stays true. sig n−1 (y), then IDn−1 (x) 6= IDn−1 (y) (induction hypothe-
2 . For n = 0 the property is obviously true, since sis). IDn (x) is computed in the fragment 7 − 20 and the out-
ID0 (x) = 0 for all states x. Suppose it is true for En−1 , come can be IDn (x) = IDn−1 (x) 6= IDn−1 (y) = IDn (y) or
IDn−1 and let us look at how En and IDn are computed, in IDn (x) 6= IDn−1 (x). In the latter case, IDn (x) ≥ En−1 (1),
the n − 1th iteration. First, the set Reusable is constructed while IDn (y) = IDn−1 (y) < En−1 .
(step 5), containing the identifiers of the blocks whose states And now we prove the second half of the property. Let
are all marked unstable. Let i be any identifier 0 ≤ i < En . x and y be two states for which sig n (x) 6= sig n (y). Then
We distinguish three cases: (w.l.o.g.) there is some pair (a, IDn−1 (z)) ∈ sig n (x) and
- i ∈ Reusable. Then all states x with IDn−1 (x) = i (ac- / sig n (y). If sig n (y) does not contain any pair (a, j) then
∈
a
cording to the induction hypothesis, there is at least one) must clearly sig n+1 (x) 6= sig n+1 (y). Otherwise, let y −→ t be any
n−1
be in U n−1 . Let y be the first of these states that is handled of the a-transitions from y. Then (a, ID (t)) ∈ sig n (y)
in the step 7 of the algorithm. sig n (y) cannot be already in and IDn−1 (t) 6= IDn−1 (z), which means (induction hypoth-
ST, since this would mean that there exist a state z from an- esis) that sig n−1 (t) 6= sig n−1 (z) and, further, sig n (t) 6=
other block (IDn−1 (z) 6= IDn−1 (y)) with the same signa- sig n (z). Above we have proved that this is equivalent to IDn (z) 6=
ture (sig n−1 (z) = sig n−1 (y)), which contradicts point 4 of IDn (t). Thus, sig n+1 (x) contains the pair (a, IDn (z)) and
this lemma. Therefore, y will only be affected by steps 15 sig n+1 (y) does not.
and 16, that do not modify the value of ID. Thus, IDn (y) = 5 . From the points 1 and 2 of this lemma it follows that
IDn−1 (y) = i. (∀n) En is exactly the number of different values for IDn .
- i ∈/ Reusable ∧ i < En−1 . Then there must be a state x Therefore, if ∀x ∈ S IDn (x) = IDn−1 (x) then obviously
for which IDn−1 (x) = i and x ∈/ U n−1 . It follows, since En = En−1 .
6 Stefan Blom, Simona Orzan: Distributed State Space Minimization
Unstable
0 0 0 0
a a
1
1 a a a a
1 1 1 1
a a a a a a a a
Figure 4. Example 1, showing that the bound O(N 2 ) for the optimized al-
gorithm is tight.
0 1 0 1 0 1 0 1
0 1 0 1
1 νU i := ∅ 1 STi := ∅;
2 for all x ∈ Ui 2 loop
compute sig(x) r eceive / newsig : oid, s, x .
send SS (ID(x)) if (oid, s, Lx) ∈ STi
/newsig : ID(x), sig(x), x . then Lx := Lx + [x]
else STi := STi ∪ {(oid, s, [x])}
Reusablei := {oid P IDSi |
∈
c(oid) = (oid,s,Lx)∈STi |Lx|}
20 decide on νIDSi
3 loop 3 for all (oid, s, Lx) ∈ STi
r eceive a message if oid ∈/ Reusablei
case / newid : x, i . then take id from νIDSi
for all y : (y, a, x) ∈ Ini else id := oid
send SM (y) for all x ∈ Lx
/update : y, a, ID(x), i . send SM (x) / newid : x, id .
ID(x) := i 4 re-balance c , νIDS
case / update : x, a, oid, i .
Outi := Outi − (x, a, oid) + (x, a, i)
νU i := νU i ∪ {x}
Here Lx is the list of all unstable states that have s as The main advantage of overlapping the phases is memory
signature. Lx is necessary because unlike in the sequen- gain: since the consumers and producers of messages are ac-
tial implementation, in the distributed it is not possible to tive in the same time, the messages don’t have to be stored.
generate the new ID at the moment of signature insertion. Thus, less memory is used.
In the distributed algorithm , like in the sequential, a series of 4.3 Correctness argument
iterations is executed. In between iterations, workers synchro-
nize in order to decide whether the final partition has been Lemma 2. The following properties hold for this distributed
reached. The computation inside an iteration is asynchronous algorithm:
and directed only by messages, as sketched in Figure 6. There
1. in every iteration, the signatures of the states in the same
are five phases distinguishable within an iteration:
- managers compute the signatures of the unstable states block are sent to the same server
2. every time a block splits, one of the new blocks gets the
and send them (newsig) to the appropriate servers
old id
- servers receive the signatures (newsig) and insert them in
their local ST 3. in every iteration, finitely many newsig and newid mes-
sages are generated
- servers compute new IDs for the unstable states and send
4. in every iteration, a received newid message generates
them (newid) back to the managers
- managers receive the new IDs for their unstable states (newid) finitely many update messages.
5. (∀n > 0) if IDnd is the state partition at the beginning
and send messages to the parent states of the own states that
changed the ID (update) of iteration n of ODBR , and IDns is the state partition
at the beginning of iteration n of OSBR , then (∀x, y ∈
- managers receive and process the update messages (update)
S) IDnd (x) = IDnd (y) ⇐⇒ IDns (x) = IDns (y)
Due to the division of tasks between managers and servers,
the first and the second phase happen in parallel (step 2 in Proof. 1 . Indeed, if ID(x) = ID(y) = i, both sig(x)
Figure 6). Also the last three (step 3) are overlapped. The and sig(y) are sent (step 2 in the segment manager) to the
overlapping limits the amount of CPU idle time, by allowing signature server responsible for i, SS (i).
computation and communication to proceed in parallel. For 2. Consider a block with the identifier i. If there are
instance, the servers can already proceed with inserting signa- states x ∈ Sj − Uj with ID(x) = i, then it’s clear: all these
tures in the table while managers prepare and send more sig- states are not touched this iteration, i.e. they keep their old
nature messages. In the actual runs of the program, a worker ID. If, on the contrary, all the states x with ID(x) = i are in
(manager + server) will use one processor. some unstable set (∀x ∈ S ID(x) = i ∃jx ∈ Uj ) then all
Stefan Blom, Simona Orzan: Distributed State Space Minimization 9
signatures will be computed and sent to the same server (step problem bcg min naive optimized
time mem time mem time mem
2 in the segment manager). At the signature server side, all (s) (M) (s) (M) (s) (M)
these signatures get inserted in STi and counted – and i is CCP 15.0 18 21.3 20 4.5 18
added to the Reusablei set. Further, in step 3, when the first 1394-LL 18.5 19 6.2 14 3.3 21
triple (i, s, Lx) is encountered, all the states in Lx get i as lift5 113 184 64 123 43 214
new ID. CCP-2p3t - - 4363 968 779 1187
3. The number of newsig and newid messages is lim- Table 2. A comparison of single threaded tools. The times include the I/O
operations.
ited by the total size of the sets Ui , i.e. by |S|.
4. For each /newid : x, i . messages, |Ini | messages
(that is ≤ |S|) with the tag update are sent.
5. By induction on n. t u problem d. naive – 16 CPUs d. opt – 16 CPUs
Theorem 3. (termination and correctness of ODBR ) For time mem time mem
(s) (M) (s) (M)
any LT S (S, T, s0 ), ODBR terminates and the IDfd function lift5 33 460 20 480
computed is the same as the IDf computed by OSBR . CCP-2p3t 550 4430 104 1658
token ring 120 10802 231 4508
Proof. The properties (1), (2) from Lemma 2 ensure that lift6 702 5958 346 3834
the invariants from Lemma 1 are also true in the distributed 1394-LE 555 15388 428 8737
implementation ODBR . (3),(4) ensure that the computation Table 3. A comparison of distributed implementations. The times are without
within an iteration terminates. The global termination is jus- I/O.
tified by the one-to-one mapping between iterations in the
sequential algorithm OSBR and the iterations in the dis-
tributed implementation ODBR (5). From (5) and the cor-
rectness of OSBR (Theorem 1) it follows that the partition
machines. It is clear from this table that the marking proce-
computed is indeed the correct one. t u
dure (used for the optimized) can give significant gains in
time – see the numbers for both cache coherence protocols.
5 Experiments The sequential optimized implementation needs more mem-
ory than the naive, since it keeps both the straight and the
We implemented both the sequential and the distributed ver- inverse transition systems. On the other hand, the naive one
sions of the optimized algorithm and compared their perfor- consumes more memory for the hashtable – all signatures
mance with the naive ones. The experiments were done on an have to be inserted, while only some have to be considered
8 node dual CPU PC cluster and an SGI Origin 2000.1 The by the optimized implementation. Therefore, we expect that
test set consists of state spaces generated by case studies car- the optimized will be less memory expensive than the naive
ried out with the µCRL toolset [4]. Problem sizes before and when it comes to large examples. The distributed implemen-
after reduction can be found in Table 1. 1394-LL,1394-LE tation confirms this idea.
are models of the firewire link layer [17] and of the firewire
leader election protocol with 17 nodes [23]. CCP-2p3t is a 5.2 Distributed tools compared
cache coherence protocol model with two processes and 3
threads [20] and CCP is an older (and smaller) variant of it.
Table 3 shows a comparison of the naive and optimized dis-
lift5, lift6 are models of a distributed lift system with 5 and
tributed implementations on the cluster, for a number of large
6 legs [10]. token ring is the model of a Token Ring leader
LT Ss. The numbers listed for the memory usage represent
election for 4 stations2 .
the maximum total memory touched on all 8 workstations
during a run.
5.1 Sequential tools compared The runs indicate that the optimized implementation out-
performs the naive one most of the time. The optimized is
In [5], the usefulness of the signature approach was proved, designed to perform better when the partition refinement se-
by analysis and performance comparisons of the naive algo- ries needs a large number of iterations to stabilize, yet very
rithm with existing tools. In order to justify the good perfor- few blocks split in every iteration. This is exactly the case for
mance of the marking procedure, we first present a compar- the CCP state space. On the other hand, for state spaces like
ison between the naive and the optimized sequential imple- the Token Ring protocol, where almost all blocks split in ev-
mentations (Table 2). The tests were run on one of the cluster ery iteration, and the whole process ends in just a few rounds,
1 The cluster nodes are dual AMD Athlon MP 1600+ machines with 2G the naive works faster, since it doesn’t waste time on admin-
memory each, running Linux and connected by gigabit ethernet. The Origin istration issues. In all larger examples though, the memory
2000 is a ccNUMA machine with 32 CPUs and 64G of memory running gain is obvious – and for the bisimulation reduction problem,
IRIX, of which we used 16 MIPS R10k processors. On the cluster, we used
LAM/MPI 6.5 On the SGI, we used the native MPI implementation. memory is a more critical resource than time.
2 The original LOTOS model [8] was translated to µCRL by Judi Romijn To test how the optimized distributed algorithm scales,
and extended from 3 to 4 stations. we ran on the cluster series of experiments using 1-8 ma-
10 Stefan Blom, Simona Orzan: Distributed State Space Minimization
3000
200 500
2500
time(cluster)
time(SGI)
time(cluster)
150 375
2000
1500
100 250
1000
50 125
500
0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of CPUs CPU
6000
seem to scale up as nicely. This is partly due to the nondeter-
minism present in the optimized implementation – signatures
total memory used (MB)
5000
can arrive at servers in any order, the order influences the new
4000
IDs assignment to states, the new IDs determine how many
3000 unstable states are there in the next iteration, thus how much
time will that iteration cost etc. It is also due to the possibly
2000
unbalanced distribution of signatures to servers, which intro-
1000 duces unpredictable idle times. Last, there is some latency
due to the MPI implementation. We compared (Figure 8) the
0
1 2 3 4 5 6 7 8 9
number of CPUs
10 11 12 13 14 15 16 reduction of lift5 on the cluster with the reduction on a shared
memory machine that uses its native MPI implementation. It
Figure 7. Runtimes and memory usage for CCP-2p3t and for lift6
appears that the optimized algorithm does scale better on this
other MPI.
chines (2-16 processors). Figure 7 shows the runtimes (in 5.3 The VLTS test suite
seconds) needed to reduce lift6 and CCP-2p3t. Since lift6
is a real industrial case study with serious memory require- After analysing the behaviour of the two algorithms on some
ments, it couldn’t be run single threaded on a cluster node special case studies, we turn to “anonymous” state spaces
or distributed on less than 3 nodes. We see that for both dis- from the VLTS benchmark [6]. Figure 9 shows the times and
tributed implementations and both case studies presented, the total memory usage of the optimized algorithm relative to
memory usage scales well, i.e. the total memory needed on those of the naive algorithm. Unlike the other measurements
the cluster is almost constant, regardless the number of ma- presented, the times considered now are total, that is the I/O
chines used. Hence, more machines available will mean less operations are included. The 25 state spaces in this selection
resources occupied on each machine. are small to medium size (between 0.06 and 12 million states,
On runtimes however, the naive implementation scales in and between 0.3 million and 60 million transitions) and get
a more predictable manner, while the optimized times don’t reduced modulo strong bisimulation in less than 100 itera-
Stefan Blom, Simona Orzan: Distributed State Space Minimization 11
2.75 2.75
2.5 2.5
2 2
1.75 1.75
1.5 1.5
1.25 1.25
1 1
0.75 0.75
0.5 0.5
0.25 0.25
0 0
0 20 40 60 80 100 0 20 40 60 80 100
number of iterations number of iterations
tions. The stars mark the very small state spaces, i.e. those tical state spaces this condition holds and marking visibly im-
that get reduced in less than 5 seconds by both algorithms. proves the performance.
We present the state spaces ordered by the number of it- The gain in time comes from the more elaborated treat-
erations in which the reduction procedure stabilizes. This is ment of partition refining. The gain in memory usage is the
a relevant order only for the time performance, not for the merit of two elements. First, the same improved refinement
memory usage. procedure makes sure that the hashtable accommodates less
As apparent from the figure, the relative time performance signatures, thus consumes less memory. The second and more
of the optimized is indeed influenced by the number of itera- important reason is that computation and communication are
tions and the size of the state space. This is roughly because, not separate phases anymore, but they are interleaved, saving
compared to the naive, it spends (much) more time on the this way the memory needed for storing intermediate results.
initial setup - and this time pays back only if the reduction The concept of signature refinement also works for other
process has some length. Note that for very short reductions, equivalences, like branching bisimulation [9], weak bisimu-
it can be almost 3 times slower than the naive, but for lengthy lation and τ ∗ a equivalence as defined in [7].
ones it is usually much faster (up to 6 times faster). Another promissing approach to generation, reduction and
Regarding the memory usage, we may notice that the op- model checking of large state spaces is using the disk as ex-
timized is indeed almost always an improvement. Exceptions tra storage. This technique would allow to increase the size
are the small state spaces, where the fixed size buffers used by of the models that can be verified without resorting to bet-
the optimized are significantly larger than needed. This could ter equipment. There has not been much research yet in this
be fixed by using dynamic buffers. direction yet.
6 Conclusions References