GraphConvNetwor 2
GraphConvNetwor 2
GraphConvNetwor 2
Repository ISTITUZIONALE
Multi-perspective enriched instance graphs for next activity prediction through graph neural network
Original
Multi-perspective enriched instance graphs for next activity prediction through graph neural network /
Chiorrini, Andrea; Diamantini, Claudia; Genga, Laura; Potena, Domenico. - In: JOURNAL OF INTELLIGENT
INFORMATION SYSTEMS. - ISSN 0925-9902. - 61:1(2023), pp. 5-25. [10.1007/s10844-023-00777-1]
Availability:
This version is available at: 11566/314348 since: 2024-05-24T09:55:04Z
Publisher:
Published
DOI:10.1007/s10844-023-00777-1
Terms of use:
The terms and conditions for the reuse of this version of the manuscript are specified in the publishing policy. The use of
copyrighted works requires the consent of the rights’ holder (author or publisher). Works made available under a Creative Commons
license or a Publisher's custom-made license can be used according to the terms and conditions contained therein. See editor’s
website for further information and terms and conditions.
This item was downloaded from IRIS Università Politecnica delle Marche (https://iris.univpm.it). When citing, please refer to the
published version.
15 August 2024
Springer Nature 2021 LATEX template
Abstract
Today’s organizations store lots of data tracking the execution of
their business processes. These data often contain valuable information
that can be used to predict the evolution of running process execu-
tions. The present paper investigates the combined use of Instance
Graphs and Deep Graph Convolutional Neural Networks to predict
which activity will be performed next given a partial process execu-
tion. In addition to the exploitation of graph structures to encode the
control-flow information, we investigate how to couple it with addi-
tional data perspectives. Experiments show the feasibility of the proposed
approach, whose outcomes are consistently placed in the top ranking then
compared to those obtained by well-known state-of-the-art approaches.
∗
This version of the article has been accepted for publication, after peer review (when
applicable) and is subject to Springer Nature’s AM terms of use, but is not the Ver-
sion of Record and does not reflect post-acceptance improvements, or any corrections.
The Version of Record is available online at: http://dx.doi.org/10.1007/s10844-023-
00777-1
1
Springer Nature 2021 LATEX template
1 Introduction
Predictive process monitoring (PPM) is an emerging field of process mining
whose aim is to predict how a running execution of a process will unfold up
until its completion (Maggi et al, 2014). PPM approaches can be used to pre-
dict different information on the process, such as the remaining completion
time or the probability of violating a set of constraints. In this work, we focus
on the next-activity prediction task. Given the current state of execution of a
process, the goal consists in predicting which activity will be executed next.
Being able to “look ahead” during a process execution can support the man-
agers in determining, for example, the best allocation of resources, or whether
to intervene to prevent undesired process outcomes. (Appice et al, 2019). A
recent trend emerging from the literature consists in the use of deep learning
architectures, which outperformed traditional machine-learning and model-
based approaches in several studies. Most of the previous work investigated
the use of architectures originally developed within the natural language pro-
cessing field (e.g., LSTM), thus exploiting the sequential nature of traces in
the event log. Some approaches also explored the use of architectures com-
monly used for image classification (e.g., CNN), proposing to use different trace
properties to build a multi-dimensional representation of a log resembling the
data structure of images. To the best of our knowledge, however, little atten-
tion has been paid to exploiting structural properties of a process execution
when generating a prediction. Process executions are characterized by (com-
plex) control-flow constructs, like concurrency, choices, and loops. However,
these structures are flattened in the event log, since traces only record the
sequence of executed activities, possibly with additional data properties. Con-
sequently, a single construct-flow can correspond in the event log to several
different sequences of events. For instance, a parallel construct involving two
or more activities can correspond to a number of sequences equal to all the
possible order permutations of the activities. This can make it challenging for a
sequential-based classifier to learn possible relations between high-level control-
flow constructs and the classification target, thus affecting its performance
(Evermann et al (2017a); Metzger and Neubauer (2018)).
Another relevant trend emerging from the literature consists in combin-
ing the control-flow information with additional data stored in the event log,
thus adopting a so-called multi-perspective view on process executions, which
has shown to boost classification performance significantly in previous studies
(Pasquadibisceglie et al (2021); Camargo et al (2019)). Indeed, in real-life pro-
cesses what to do “next” may depend not only on which activities have been
executed before but also on, for instance, which resources have been involved,
or on the kind of customers, and so on.
Springer Nature 2021 LATEX template
2 Related Work
Predictive process monitoring made its appearance as a process mining task in
the first decade of the 2000s (Castellanos et al, 2006; Van Der Aalst et al, 2010),
receiving increasing attention in the latest years (Di Francescomarino et al,
Springer Nature 2021 LATEX template
data, like log traces are. Hence the majority of approaches are model-agnostic.
For what concerns the next-activity prediction, Long Short-Term Memory
(LSTM) is one of the first and most adopted architectures (Evermann et al,
2017b; Tax et al, 2017; Camargo et al, 2019). LSTM trained with a Genera-
tive Adversarial Nets learning scheme has also been proposed (Taymouri et al,
2020), tackling the lack of sufficient training data that often impact perfor-
mances. An alternative approach is that of Pasquadibisceglie et al (2020) where
it is proposed to transform traces into image-like data, thus unleashing the
full potential of Convolutional Neural Networks (CNN). Although more tra-
ditional Deep Learning architectures like Multi Layer Perceptrons have been
largely overlooked, in Venugopal et al (2021) experiments demonstrate that
they can achieve good performance on some datasets. Other approaches like
reinforcement learning or transformers have also been experimented (Chior-
rini et al, 2020; Philipp et al, 2020). In all these proposals, different learning
architectures, different input data encodings and attributes characterize the
approaches. However, a common feature is the inherently sequential struc-
ture of inputs and the consequent inability to fully capture the structure of
process executions. Few previous studies have proposed to encode structural
information from the process model for Recurrent Neural Network models.
For example, Di Francescomarino et al (2017) proposes an approach which
first detects loops in log traces and then uses this information to improve the
results of a LSTM-based next-activity classifier. Their approach also allows
to incorporate domain knowledge related to execution constraints. A different
strategy to take the process structure into account within the next-activity
classification task consists in using graphs, which provide a convenient means
to represent processes (van Dongen and van der Aalst, 2004; van der Aalst
et al, 2003). A proposal to directly process graphs to predict the next activity
has been done by Venugopal et al (2021). This approach has some similarities
with the present proposal. First of all, it adopts a process discovery approach
(inductive mining with Directly-Follows Graphs) for building a model of the
process. Second, it adopts a Graph Convolutional Neural Network (GCNN)
to learn the prediction. With respect to Venugopal et al (2021), the present
paper adopts a different, instance-specific, graph model in the form of Instance
Graph, managing also non-fitting traces. Furthermore, in Venugopal et al
(2021) the network architecture is composed of a single graph convolutional
layer followed by two fully connected layers, while in the present paper a vari-
ation of the Deep Graph Convolutional Neural Network (DGCNN) of Zhang
et al (2018) is exploited. As another difference, if many events in a trace cor-
respond to the same activity, only the features of the most recent event are
retained in Venugopal et al (2021), whereas the Instance Graphs adopted in
this work can present the same activity more than once.
3 Preliminaries
In this section, we introduce some core definitions used throughout the paper.
Springer Nature 2021 LATEX template
Fig. 1: Petri net mined with the Inductive Miner from the Helpdesk event log.
Transition labels are displayed below each transition, while inside each square
there is the corresponding acronym.
Definition 1 (Labeled Petri Net) A labeled Petri net is a tuple (P, T, F, A, ℓ) where
P is a set of places, T is a set of transitions, F ⊆ (P × T ) ∪ (T × P ) is the flow
relation connecting places and transitions, A is a set of labels for transitions, and
ℓ : T ↛ A is a partial function that associates a label with a subset of the transitions
in T . Transitions not associated with any label are called invisible transitions.
Figure 1 shows the Petri net obtained from a real-life process concerning the
ticketing management process of the help desk of an Italian software company1 .
Transitions represent process activities, namely well-defined tasks that have
to be performed within the process, and places are used to represent states.
Invisible transitions do not correspond to process activities and are used for
routing purposes. We indicate the set of invisible transitions as TH ⊆ T .
Specific executions of a process, so-called process instances, are typically
recorded in logs. More precisely, the execution of an activity generates an
event, which is a complex entity characterized by a set of properties.
Definition 2 (Event, Trace, Log) Let AL be the set of all activity names, C be
the set of all case (aka, process instance) identifiers, H be the set of all timestamps,
U a set of variable values, V a set of variable names. An event e = (a, D, c, i, t) ∈
AL × (V ̸→ U ) × C × N × H is a tuple consisting of an executed activity a ∈ AL ,
a function D which assigns a value to some process variables (possibly all of them),
a case identifier c ∈ C and a number i ∈ N. A case corresponds to a single process
execution; the number i identifies the position of the event within the sequence of
events that occurred within a case. The set of events is denoted by E. An event trace
σL ∈ E ∗ is a sequence of events with the same case id. An event log is a multi-set
of event traces L.
Table 1 shows an excerpt of the event log for the Helpdesk process men-
tioned above. Through this paper, we will use the notation act(e), case(e),
pos(e), time(e), and var name(e) to refer to, respectively, the activity, the
case id, the position in the sequence, the timestamp and the attribute named
1
https://data.mendeley.com/datasets/39bp3vv62t/1
Springer Nature 2021 LATEX template
var name of an event e. For instance, let e2 be the second event in Table 1;
act(e2) is “Assign seriousness”, while time(e2 ) is “03/04/2012 16:55”. Here
we also introduce the projection operator πAtt (x), which is used to build the
projection of a tuple x on a subset of its attributes Att. For instance, given an
event ei we can define the projection πAL ,C,N (ei ) = (act(ei ), case(ei ), pos(ei )).
With a slight abuse of notation, we extend this operator to traces as follows:
πAL ,C,N (σi ) = ⟨πAL ,C,N (e1 ), . . . , πAL ,C,N (en )⟩.
Definition 3 (Prefix trace) A prefix of length k of a trace σ = ⟨e1 , e2 , . . . en ⟩ ∈ E ∗ ,
is a trace pk (σ) = ⟨e1 , e2 , . . . ek ⟩ ∈ E ∗ where k ≤ n.
For example, let us indicate with σ1 the trace involving the events
with case id Case 2 in Table 1. The prefix of length 3 of σ1 is p3 =
⟨(Start, {}, Case 2, 1, 03/04/2012 16:55), (Assign seriousness, {}, Case 2, 2,
03/04/2012 16:55), (Take in charge ticket, {}, Case 2, 3, 03/04/2012 16:55)⟩.
Note that, in this example, the function D corresponds to an empty set, since
we don’t have any additional data attributes in the log.
A well-known issue of log traces is that events are logged in a trace accord-
ing to the timestamp of the corresponding activities, thus hiding possible
concurrency among activities. To address this issue, log traces can be converted
in so-called Instance Graphs (Diamantini et al, 2016). These are directed,
acyclic graphs which represent the real execution flow of process activities.
Definition 4 (Instance Graph) Let σ = ⟨e1 , . . . , en ⟩ ∈ L be a trace and let σ ′ =
πAL ,N (σ) be its projection on the activity and position sets. Let the causal relation
CR ⊆ A × A be a relation defining the expected order of execution of each pair of
activities in the process recorded in L. Hereafter a1 →CR a2 denotes that (a1 , a2 ) ∈
CR. An Instance Graph (or IG) γσ of σ is a directed acyclic graph (E, W ) where:
• E = {e ∈ σ ′ } is the set of nodes, corresponding to the events occurring in σ ′ ;
• W = {(eh , ek ) ∈ E × E | h < k ∧ act(eh ) →CR act(ek ) ∧ (∀eq ∈ E(h <
q < k ⇒ act(eh ) ̸→CR act(eq )) ∨ ∀ew ∈ E (h < w < k ⇒ act(ew ) ̸→CR
act(ek )))} is the set of edges;
A causal relation can be determined, for instance, by using a priori
domain knowledge, or it can be extracted from an event log recording process
executions. We will discuss our strategy to determine CR in Section 4.1.
Similarly to what we have done for trace and trace prefixes, starting from
the definition of IGs we can introduce the notion of Graph prefix.
Springer Nature 2021 LATEX template
(a) (b)
Definition 5 (Prefix Instance Graph (prefix-IG)) Let (E, W ) be the instance graph
of some trace σ. Let E ek be the set of events in the prefix trace pk (σ) of size k. We
define the prefix instance graph of size k of σ as the graph pk ((E, W )) = (E ek , W ∩
ek × E
(E ek )). Informally, a graph prefix pk (gj ) is a subgraph of gj involving only k
nodes of gj , i.e., nodes included in the corresponding trace prefix.
Example 1 Consider σ1 and the set CR derived from the Petri net in Figure 1. In
Figure 2, Figure 2a shows the IG corresponding to the trace, while Figure 2b shows
its prefix of length 3. For the sake of simplicity, we only use activity acronyms to label
the graph nodes rather than showing the index and the complete names. We will
adopt the same simplification when drawing IGs throughout the rest of the paper.
4 Methodology
This work introduces a novel approach to tackle the next-activity prediction
challenge. Formally, this problem corresponds to learning a classifier able to
label a prefix trace with the activity to be executed next.
Figure 3 shows the proposed approach. Given an event log and its pro-
cess model expressed as a Petri net, the approach i) represents each trace
with its corresponding Instance Graph (IG), ii) enrich the built IG with addi-
tional perspectives regarding the sequential execution and, when available,
additional event attributes, and iii) process such IGs through graph neural
networks, designed to work with graph data structures, to train a classifier to
perform the next-activity prediction task. The approach used to build the IGs
is robust against the possible presence of outliers or anomalous behaviors. In
other words, even in the presence of anomalous behaviors the approach returns
instance graphs without structural anomalies and that provide a high-quality
model for the corresponding process behaviors. The set of instances graphs
is then used to train the graph neural network. For the classifier, among the
various architectures proposed in the literature, we chose to adopt the Deep
Graph Convolutional Neural Network (DGNN) (Zhang et al, 2018). In the fol-
lowing, we will refer to our methodology as Multi-BIG-DGCNN. The following
subsections delve into each step of the approach.
2
For the sake of simplicity, we directly show the projected trace obtained by another trace from
the Helpdesk log. Furthermore, for the sake of readability, we only use activity acronyms.
Springer Nature 2021 LATEX template
E W Label
p2 (g) {(1, S), (2, SI)} {((1, S), (2, SI))} AS
p3 (g) {(1, S), (2, SI), (3, AS)} {((1, S), (2, SI)), ((2, SI), (3, AS))} TC
Table 2: prefix-IG of lenght 2 and 3 extracted from the IG in Figure 4b
.
of IG-prefixes S built at the previous step, the goal is to build the dataset
′
S = {(p′i (gj ), al )}, where p′i (gj ) = (Ep , Wp , V al M ) is a Multi-Perspective
Enriched prefix-IG of lenght i of a graph gj = (Ej , Wj ), i ∈ [1, ∥Ej ∥ − 1].
We consider two sets of features; direct features, corresponding to data
attributes stored in the event log, and indirect features, derived from the infor-
mation available in the trace. Note that the set of direct features corresponds
to the set D introduced in Definition 2; therefore, D ⊂ M . For the indirect
features, we are especially interested to time-related features, which are used
to encode information about the sequential execution order of the traces from
which the IGs have been extracted.
The use of this kind of information has been previously used in literature
(Tax et al, 2017; Pasquadibisceglie et al, 2020). However, thanks to the use of
instance graphs in place of log traces, in our framework the temporal intervals
are computed for each activity with respect to its causal predecessor rather
than with respect to the preceding activity in the sequence. We argue that such
computation provides a more accurate representation of what actually hap-
pened within the process execution, thus providing more robust information
to be used for the prediction in place of the sequence-based features. These
features are defined as follows. Let CR be the causal relation defined among
the activities of the event log L4 . Let us consider the prefix pi (gj ) ∈ S and let
us indicate with ni the node corresponding to the event at the i − th position
in the trace corresponding to gj . With a slight abuse of notation, in the fol-
lowing we use act(ni ), time(ni ) to indicate the activity and the timestamp of
the event ei . This is justified by the fact that for each node of each prefix-IG
there exists a unique mapping to the position of the event of the corresponding
trace, from which the corresponding information can be accessed.
The first temporal feature we define is ∆tni , which represents the time
between the current event and its predecessor in the graph. For all nodes ni ,
let predni = {nj | (act(nj ), act(ni )) ∈ CR} denote the set of all nodes that
are causal predecessors of ni . We define
(
0 if predni = ∅
∆tni = time(ni )−time(nj )
minnj ∈predni ∆maxe otherwise
4
Note that in our framework we use the extended causal relation obtained from the BIG
repairing procedure.
Springer Nature 2021 LATEX template
E W V al M Label
p′2 (g) {(1, S),(2, SI)} {((1, S), (2, SI))} {(1, {∆tni = 0, tdn = 0, twni = 0.27}), AS
i
(2{∆tni = 0,tdn = 0,twni = 0.27})}
i
p′3 (g) {(1, S), (2, SI), {((1, S), (2, SI)), {(1, {∆tni = 0, tdn = 0,twni = 0.27}), T C
i
(3, AS)} ((2, SI), (3, AS))} (2, {∆tni = 0, tdn = 0, twni =0.27}),
i
(3, {∆tni = 0, tdn = 0, twni = 0.27})}
i
business process are likely to be carried out within office hours. Formally:
where tw0 is the timestamp of the last passed Sunday midnight, and t0
is the start timestamp of the process. ∆tw is the amount of time in a week,
while ∆maxt is the maximum trace duration. Note that ∆tw , ∆maxe and ∆maxt
are normalization factors computed on the entire event log to make features
varying in the range [0, 1], as it improves the performance of the network.
Once the direct features have been selected from the event log and the
indirect ones have been computed, we compute the mapping function V al M
′
for each node of each prefix, thus generating the dataset S .
As an example, Table 3 shows the prefixes discussed above enriched with
the temporal features. Note that since the first three events of σ1 all have the
same timestamps, the temporal features are all the same for these prefixes.
The final processing step consists in transforming the feature set in the
format requested by the classifier. In particular, the Deep Convolutional Neu-
ral Network we select for our architecture takes in input a vector F V =
[F Ve , F VW , Label] where:
• F Ve = {f v1 , . . . , f vn } where f vi corresponds to a feature vector describing
one node of the graph, i.e., f vi ∈ AL × V al M . Note that we exploit the
one-hot encoding to encode both the name of the activity and the possible
categorical features in M .
• F VW = {(i, j) | 1 ≤ i, j ≤ |F Ve |} is a set of tuples corresponding to the set
of the edges of the graph;
• Label corresponds to the classification label associated to the graph.
input is unified. At last, a 1-D convolutional layer and a dense layer take the
obtained representation to perform predictions.
The graph convolutional layer adopted by DGCNN is represented by the
following formula:
Z = f (D̃−1 ÃXW ) (1)
where à = A + I is the adjacency matrix (A) of the P graph with added
self-loops(I), D̃ is its diagonal degree matrix with D̃ii = j Ãij , X ∈ Rn×c is
the graph nodes information matrix (in our case the one-hot encoding of the
′
activity labels associated to the nodes), W ∈ Rc×c is the matrix of trainable
′
weight parameters, f is a nonlinear activation function, and Z ∈ Rn×c is the
output activation matrix. In the formulas, n is the number of nodes of the input
graph (in our case, the graph prefix), c is the number of features associated to
a node, and c′ is the number of features in the next layer tensor representation
of the node.
In a graph, the convolutional operation aggregates node information in local
neighborhoods so to extract local structural information. To extract multi-scale
structural features, multiple graph convolutional layers (eq. 1) are stacked as
follows:
Z k+1 = f (D̃−1 ÃZ k W k ) (2)
where Z 0 = X, Z k ∈ Rn×ck is the output of the k th convolutional layer, ck
is the number of features of layer k, and W k ∈ Rck ×ck+1 maps ck features to
ck+1 features.
The graph convolutional outputs P Z k , k = 1, ..., h are then concatenated in
h
a tensor Z 1:h := [Z 1 , ..., Z h ] ∈ Rn× 1 ck which is then passed to the Sort-
PoolingLayer. It first sorts the input Z 1:h row-wise according to Z h , and then
returns as output the top m nodes representations, where m is a user-defined
parameter. This way, it is possible to train the next layers on the resulting
fixed-in-size graph representation.
In the original proposal the DGCNN includes a 1-D convolutional layer,
followed by several MaxPooling layers, one further 1-D convolutional layer fol-
lowed by a dense layer and a softmax layer. In the present paper we simplify the
architecture leaving only one 1-D convolution layer with dropout (Srivastava
et al, 2014) followed by a dense and a softmax layer. This is because the pro-
cess mining domain tend to present smaller graphs in comparison with those
of typical application domains of graph neural networks (Wu et al, 2021). For
further information we refer the interested reader to Zhang et al (2018).
5 Experiments
This section describes the experiments we carried out on multiple real-world
datasets to assess the performance of our approach w.r.t. state-of-the-art com-
petitors. We first provide a description of the experimental set-up, the selected
datasets and the competitors. Then, we discuss the obtained results.
Springer Nature 2021 LATEX template
5.1.1 Dataset
For our experiments, we selected some of the benchmark datasets commonly
used in literature, whose characteristics are reported in Table 4.
The Helpdesk dataset (Verenich, 2016) contains traces from a ticketing
management process of the help desk of an Italian software company.
The BPI12 dataset (van Dongen, 2012) tracks personal loan applications
within a global financing organization. The event log is a merge of three parallel
sub-processes. We considered both the full BPI12 and the BPI12W sub-
process, related to the work items belonging to the application. We retained
only the completed events in the two logs, as done in previous work.
The BPI20 dataset(van Dongen, 2020) is taken from the reimbursement
process at TU/e. The data is split into travel permits and several request types
from which we selected four datasets. Requests for Payment (RfP) sub-log con-
tains cost declaration referred to expenses that should not be related to trips.
Travel Permit (TP) includes all related events of travel permits declarations
Springer Nature 2021 LATEX template
5
Here we refer to the state-of-the-art notion of fitness proposed by Adriansyah et al (Adriansyah
et al (2011))
Springer Nature 2021 LATEX template
the number of samples with a prefix shorter than 8 is the vast majority. We
also notice that this explanation also holds for the small number of stacked
graph convolutional layers. All the experiments have been performed using
either pytorch geometric (Fey and Lenssen, 2019) with torch version 1.10.0 or
tensorflow 2.5 (Abadi et al, 2015), on an NVIDIA GeForce GTX 1080 GPU,
a Intel(R) Core(TM) i7-8700K [email protected], and a 32 GB RAM.
5.2 Results
Table 5 reports the results achieved by each approach over the tested datasets.
The best values for each dataset are highlighted in bold. To assess the impact of
the enrichment phase on the classification performance, we tested two versions
of our approach, i.e., the one exploiting only the control-flow information (BIG-
DGCNN) and the one exploiting the enriched IGs (Multi-BIG-DGCNN).
The first interesting insight is that considering multiple perspectives is
overall beneficial for classification performance. In fact, Multi-BIG-DGCNN is
consistently better than BIG-DGCNN over all tested datasets. The strongest
differences can be observed in BPI12, which shows improvements in accuracy
and the F1 score respectively of 3.19% and 2.81%, and in the ID dataset,
where the accuracy and the F1 score improved of, respectively, 14.72% and
Springer Nature 2021 LATEX template
15.65%. These results suggest that the set of features used for the enrichment
have strong predictive capabilities for these two datasets. On the other hand,
focusing on the pure workflow perspective, we can state that BIG-DGCNN is
a better approach than GCNN.
Moving to the comparison with the competitors, Multi-BIG-DGCNN
achieves best results in terms of F1 score on five datasets out of seven. In
Helpdesk, BPI12W and RfP Multi-BIG-DGCNN also achieves the best accu-
racy performance. CNN turns out to be the best on TP and achieves the best
accuracy values on ID and Prepaid, whereas LSTM is the best on BPI12. Over-
all, considering the F1 score, it seems that Multi-BIG-DGCNN shows a better
consistency over all datasets. To demonstrate this, we report in Table 6 the
overall comparison expressed in terms of AR, SRR and R for both accuracy
and F1 score figures of merit. We observe that, for what concerns AR, Multi-
BIG-DGCNN is the best approach, followed by CNN and then BIG-DGCNN
and LSTM. It also turns out to be the best approach according to the SRR
metrics, though values show that it is basically comparable with CNN. Consid-
ering that the CNN encodes a richer set of aggregated temporal features than
Multi-BIG-DGCNN, results are encouraging and demonstrate the viability of
instance graphs processed by DGCNN, since this kind of information may also
be added when deemed useful for prediction purposes.
Springer Nature 2021 LATEX template
It is also worth noting that the BPI12W dataset, where both our
approaches obtained the biggest improvement with respect to the second best
approach, is also the dataset with the highest percentage of activities in a short
loop, which is known to be a difficult situation for next-activity prediction. A
reasonable explanation for this result is that the graph convolution mechanism
is naturally robust to such repetitions since it can aggregate the information
of nearby nodes, which is exactly the scenario we have when a specific activity
is repeated several times.
In addition to analyze the overall behavior of the approach, we are also
interested in understanding how it varies among the different prefix sizes.
Figure 5 shows the trend of the F1 score with respect to the different prefix
lengths across all the datasets. We compare the performance of Multi-BIG-
DGCNN (blue line) against those of LSTM (orange line). We chose to compare
these two approaches because the LSTM approach proposed by Tax et al. is
the one with the set of features more similar to ours. The main differences are
that we consider for each event the time w.r.t. the start of the process, rather
than within the day (i.e., w.r.t. midnight) and that we consider causal relations
in computing temporal intervals between an event and its successor(s), rather
than considering subsequent events in the trace (see Section 4.2.2). Therefore,
we can reasonably assume that differences in performance are likely to be due
either to the different architectures employed, i.e., sequential vs graph-based,
or to the explicit use of information on the process structure in the feature set.
In addition to the F1 score performance, in the figures a red, dotted line shows
how the sample size varies with the increase of the prefix length. To provide
some additional insights on the size of the sample set for the different pre-
fix lengths, a vertical, dotted black line is placed to separate results obtained
on prefix lengths with at least ten samples (on the left of the line), to those
obtained on fewer samples (on the right of the line).
In the following, we focus the discussion on the prefixes on the left of the
black line, i.e., prefixes involving at least ten samples. This is justified by the
fact that for prefix lengths involving a very scarce number of samples, even a
difference of a few samples classified correctly or incorrectly can deeply impact
the results. Note that most of the F1 score plots in Figure 5 show a very
unstable result in the neighborhood of the black line for both classifiers, which
seems to confirm that a limit of 10 is reasonable for these datasets.
The figure shows that Multi-BIG-DGCNN usually performs close to or
higher than LSTM on the shorter prefixes; however, the performance get worse
for longer prefixes. Since the shorter prefixes correspond to the higher num-
ber of samples, outperforming the competitor in the shorter prefixes allows
Multi-BIG-DGCNN to obtain a higher accuracy than LSTM in the correspond-
ing dataset. An exception is represented by the dataset BPI12, where LSTM
obtains comparable or better results along all the prefix lengths, which indeed
results in a higher overall average accuracy as shown in Table 5.
Springer Nature 2021 LATEX template
3000 1.02
1.0 1
1200 0.96 4000
0.9 2500
1000 0.90
0.8
2000 0.84 3000
0.7 800
Samples
Samples
Samples
0.78
F1score
F1score
F1score
600 1500
0.6 0.72 2000
0.5 400 1000 0.66
0.4 0.60 1000
Multi-BIG-DGCNN 200 Multi-BIG-DGCNN 500 Multi-BIG-DGCNN
0.3 LSTM LSTM 0.54 LSTM
#samples 0 0 #samples 0 #samples 0
0.48
0 2 4 6 8 10 12 14 0 10 20 30 40 50 0 10 20 30 40 50 60 70
Prefix length Prefix length Prefix length
Samples
Samples
F1score
F1score
F1score
0.80 1000
1000
1000 0.75 750
0.70 500
500 500
Multi-BIG-DGCNN Multi-BIG-DGCNN 0.65 Multi-BIG-DGCNN
LSTM LSTM LSTM 250
0 #samples 0 0 #samples 0 0.60 #samples 0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0 10 20 30 40 50 0 5 10 15 20 25
Prefix length Prefix length Prefix length
700
0.96
600
0.88
500
0.80
400
Samples
F1score
0.72
300
0.64
200
0.56
Multi-BIG-DGCNN
100
0.48 LSTM
#samples
0.40 0
0 2 4 6 8 10 12 14 16 18
Prefix length
(g) Prepaid
6 Conclusions
The paper has presented BIG-DGCNN, a model-aware neural approach to
address the task of next activity prediction. The model allows to represent
process instances in the form of Instance Graphs, thus maintaining informa-
tion about parallel activities that is missed in the traces recorded in event
Springer Nature 2021 LATEX template
logs. Graphs are then natively processed by Deep Convolutional Graph Neural
Networks to synthesize a classification model able to predict the next activ-
ity given a prefix of any length. The adoption of BIG allows to build sound
Instance Graphs even for non-fitting traces and makes the approach suit-
able also for unstructured processes. Furthermore, an extension is proposed
which enriches the Instance Graph with additional data perspectives. The
comparison with state-of-the-art literature highlights that BIG-DGCNN shows
promising performance, especially considering that competitor approaches all
take into account some data perspective, whereas BIG-DGCNN only encodes
control-flow information. Endowing BIG-DGCNN with temporal information
and resource information when available, it demonstrates to favourably com-
pare to other approaches. However, since the tested competitors use different
encodings, the results do not clarify which part of the performance is due to
the neural architecture and which one to the encoding method adopted. In
the future, we intend to design an experimental plan to separate and further
investigate the impact of each element.
Further analysis of the performance trend with respect to the prefix length
enlighten an interesting difference with respect to the LSTM architecture, that
is the decay of performance on longer prefixes. This suggests to investigate
further improvements of the approach, namely to train different networks for
the different prefixes, to implement a GAN learning scheme in order to deal
with the limited number of training samples for longer prefixes, or to adopt
resampling or other imbalanced learning techniques.
7.5 Funding
No funding was received for conducting this study.
Springer Nature 2021 LATEX template
References
Abadi M, Agarwal A, Sutskever I, et al (2015) TensorFlow: Large-scale
machine learning on heterogeneous systems. URL https://www.tensorflow.
org/, software available from tensorflow.org
Ceci M, Lanotte PF, Fumarola F, et al (2014) Completion time and next activ-
ity prediction of processes using sequential pattern mining. In: International
Conference on Discovery Science, Springer, pp 49–61
van Dongen BF, van der Aalst WMP (2004) Multi-phase process mining:
Building instance graphs. In: Atzeni P, Chu W, Lu H, et al (eds) Concep-
tual Modeling – ER 2004. Springer Berlin Heidelberg, Berlin, Heidelberg,
pp 362–376
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Pro-
ceedings of the 3rd International Conference on Learning Representations
(ICLR 2015)
Leemans SJJ, Fahland D, van der Aalst WMP (2014) Discovering block-
structured process models from incomplete event logs. In: Ciardo G, Kindler
E (eds) Application and Theory of Petri Nets and Concurrency. Springer
International Publishing, Cham, pp 91–110
Van Der Aalst W, Pesic M, Song M (2010) Beyond process mining: From
the past to present and future. Lecture Notes in Computer Science (includ-
ing subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics) 6051 LNCS:38–52