2019-Exploratory Visual Sequence Mining Based On Pattern-Growth

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2848247, IEEE
Transactions on Visualization and Computer Graphics
Exploratory Visual Sequence Mining

Based on Pattern-Growth
Katerina Vrotsou, Member, IEEE, and Aida Nordman, Member, IEEE
Abstract—Sequential pattern mining finds applications in numerous diverging fields. Due to the problem’s combinatorial nature, two
main challenges arise. First, existing algorithms output large numbers of patterns many of which are uninteresting from a user’s
perspective. Second, as datasets grow, mining large numbers of patterns gets computationally expensive. There is, thus, a need for
mining approaches that make it possible to focus the pattern search towards directions of interest. This work tackles this problem by
combining interactive visualization with sequential pattern mining in order to create a “transparent box” execution model. We propose a
novel approach to interactive visual sequence mining that allows the user to guide the execution of a pattern-growth algorithm at
suitable points through a powerful visual interface. Our approach (1) introduces the possibility of using local constraints during the
mining process, (2) allows stepwise visualization of patterns being mined, and (3) enables the user to steer the mining algorithm
towards directions of interest. The use of local constraints significantly improves users’ capability to progressively refine the search
space without the need to restart computations. We exemplify our approach using two event sequence datasets; one composed of web
page visits and another composed of individuals’ activity sequences.
Index Terms—Sequential pattern mining, interactive mining, visual data mining, mining with constraints.
1 I NTRODUCTION
S EQUENTIAL pattern mining addresses the problem of

detecting sequences of events as patterns in data [1].
Identification and analysis of sequential patterns are of
user only gets to interact with the resulting patterns and
not with the pattern generation. This paper builds on the
idea of opening this black box and involving the expert
increasing importance in a range of top priority application in the mining process by embedding interactivity deeper
domains such as electronic health record analysis, process in it, catering in this way for the possibility of the user
control, cybersecurity and safety, autonomous systems and to guide the execution of the algorithms at suitable points.
software, and aid in the understanding and debugging of To our knowledge, the possibility of changing and refining
machine learning systems. There are, however, two main constraints while a particular sequence pattern is being built
challenges that need to be addressed before sequential has not yet been considered, and it is an approach that
pattern mining can be fully utilized. The first challenge is addresses both challenges described above. To this end, we
based on the vast number of possible patterns. State-of- aim to investigate the possibility of breaking down existing
the-art algorithms may extract too many patterns, many of algorithms into incremental steps making it possible to
which may be of lesser significance or even irrelevant for check point the mining process, display the current status
the current analysis. This aspect makes it difficult for the and allow a user to intervene by imposing constraints that
user to grasp, and consequently use, the multitude of ob- steer the algorithm in the direction of interesting patterns.
tained patterns. Although tailored visualization techniques We propose a novel exploratory event sequence mining
have been proposed helping the user to explore the large approach based on the pattern-growth methodology [6]. The
number of patterns produced by the mining algorithm, the main contributions of the approach are the following.
effectiveness of the existing techniques needs to be signif-
icantly improved, both at the algorithm and visualization • User-steered pattern mining. The proposed approach
level. The second challenge is the computational complexity enables the entirely interactive mining of patterns by
involved in pattern identification, as mining large number of giving control to the user to steer the mining algorithm
patterns is computationally very expensive. One approach to directions of interest to the specific task. This is
to tackling these problems, is to introduce constraints and achieved by allowing the user to: (1) choose which
promising results have been shown in many applications. sequence patterns to grow during the mining process,
These two challenges are the motivation behind several and (2) dynamically apply local constraints.
interactive systems [2]–[5] which allow the user to define • Support for local constraints. The presented approach
constraints to increase the effectiveness and efficiency of introduces the notion of ‘local constraints’ in the mining
the mining process. However, the actual mining algorithms process by allowing a user to apply a number of differ-
in these systems then operate as a black box, and the ent types of constraints on subsets of the search space.
• Stepwise visualization of patterns. Patterns are step-
• Katerina Vrotsou and Aida Nordman are with the Department of Science wise visualized in two views. A pattern tree view
and Technology, Linköping University, Sweden. visualizing the frequent subsequences being built and
E-mail: [email protected], [email protected] an event sequence view displaying selected patterns in
Manuscript received April 19, 2005; revised August 26, 2015. the context of the event sequences they appear in.
1077-2626 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2019.2848247, IEEE
•Embedding of domain knowledge. The interactive construct complex queries and search the data for matches.
approach proposed and the incorporation of local con- This contrasts with our approach where interactive se-
straints in the algorithm computation allow an expert quence mining is used to find frequent sub-sequences al-
to express domain knowledge which can be taken into lowing the user to progressively refine constraints while
account during the pattern search. patterns are being built. OutFlow [13] provides an aggre-
The above contributions are implemented within an gated Sankey-like view of an event sequence dataset and
interactive visualization system which we call E LOQUENCE enables exploration of the most common pathways in the
(for ExpLOratory seQUENCE mining). data. The representation is built around a user selected state
and focuses on the effective summarization and aggregate
representation of the sequences including this state. Our
2 R ELATED WORK approach instead offers a flexible navigation of the search
Several approaches have been proposed for visually space by using sequence mining to identify interesting
analysing event sequences in order to identify and explore sequence patterns subject to certain constraints. Decision-
interesting patterns. We divide existing approaches into Flow [14] proposes a system that supports an exploratory
three categories. (1) Visual inspection approaches focus on visual environment allowing the user to interact with a
creating appropriate representations of an event sequence graph corresponding to an aggregated representation of the
dataset, oftentimes using filtering, aggregation, and summa- sub-sequences matching a user query. DecisionFlow allows
rization, in order to enable visual identification of sequential the user to express time gaps constraints between milestone
trends within it. (2) Query-based approaches are focused events which are similar to gap constraints supported by our
on exploring a dataset of event sequences containing a system, E LOQUENCE. Additionally, neither DecisionFlow
user-specified pattern of interest in order to understand the nor E LOQUENCE support events with overlapping intervals,
characteristics and variations of this pattern across the data. as is usual in systems dealing with interval events [15]. More
(3) Visual sequence mining approaches focus on identifying recently, Cappers and van Wijk [16] presented a system
a set of sub-sequences as patterns from the data. Our work for exploring multivariate event sequence datasets. Their
falls under the third category therefore we focus the main approach is based on the interactive creation and flexible
part of our related work around this third approach. application of regular expression rules and the use of se-
lections, sorting and aggregation for identifying interesting
2.1 Visual inspection approaches patterns. Also, this approach implies a direct search for a
pattern with specific attributes and focusses on how this
Lifelines [7] is an early example of using visualization, fil-
pattern appears in the data.
tering, highlighting and interaction tools in order to provide
Overall, the above examples require that the user has
overviews of temporal event sequences and enable visual
good knowledge of the data and of the sub-sequences of
identification of similar patterns among a limited number
interest to then formulate the right queries which makes
of sequences. Lifelines was extended with functionality
them less well suited for free exploration and identification
for aligning event sequences around pre-decided discrete
of unexpected patterns which is the focus of our work.
events of interest and creating temporal summaries of the
aligned results in order to visually reveal similar patterns in
the data [8], [9]. Our work instead is concerned with explic- 2.3 Visual sequence mining approaches
itly identifying sequences that match certain constraints as
As the size of data increases and focus shifts to the identi-
patterns. EventFlow [10] allows the identification of patterns
fication of meaningful and interesting sets of patterns, new
by providing simplified overviews of the event sequences,
approaches that can further and flexibly reduce the search
using a number of filtering and transformation based sim-
space need to be used. Therefore, research aiming at inte-
plifications, in order to reveal prominent trends within
grating data mining with visualization [17] has been gaining
them. The work can handle large numbers of sequences
increasing interest, which is also the focus of our work. In
but, in contrast to our work, no pattern mining algorithm is
doing so, the value of going away from “black box” model
used to automatically find frequent sub-sequences occurring
analysis approaches towards more transparent approaches
in the dataset. A number of ideas from this work could,
has been lifted [18] and the notion of progressive visual
however, be very interesting as a preprocessing step in order
analytics was introduced promoting the importance of the
to simplify the data before applying data mining. Event-
interaction of the analyst with the mining algorithm [19].
Thread [11] summarizes event sequences into clusters of
Frequence [3] and Peekquence [4] are two systems based
similar sequences (threads) using a tensor-based approach
on the SPAM (Sequential PAttern Mining) algorithm [20]
and visualizes the evolution of patterns by grouping sim-
integrated with a visual interface for exploring the resulting
ilar threads over time. Also, here the focus is on how to
patterns. Frequence allows a user to specify several con-
appropriately align and group sequences in order to allow
straints, such as the level of detail of the events in the
the identification of trends in the data. This contrasts with
patterns and a time window for events being considered
our approach where frequent sequences are automatically
part of the same sequence. Peekquence attempts to improve
identified through a pattern mining algorithm.
understanding of the patterns by using a set of summarizing
overview representations of the patterns, allowing the user
2.2 Query-based approaches to sort the mined patterns by various attributes, and includ-
PatternFinder [12] enables the identification of temporal pat- ing a time line view of the event sequences for inspecting
terns across multiple event sequences by allowing users to the patterns in context. Apart from the choice of algorithm,
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2018.2848247, IEEE
the fundamental difference between these two systems and on presenting partial results as the algorithm computes and
our proposed approach is that Frequence and Peekquence allowing the analyst to make decisions directly, instead of
focus mainly on interactively setting global constraints and having to wait until the algorithm completes before they
visualizing and filtering the resulting patterns. They work as can inspect the final results. The authors present a system,
“black box” systems in the sense that their visual interface called Progressive Insights, based on an adapted version of
helps the user to explore the final patterns produced by the the SPAM algorithm [20] for mining patterns. Patterns are
underlying algorithm. In contrast, we propose a “transpar- associated with scores, like support. In order to help users
ent box” system that allows the user to visualize the partial to prioritize which patterns should be expanded next, an
patterns that are being built by the underlying algorithm. interesting scatterplot view is used to visualize the score
As a consequence, in our approach the user can then impose differences between computed patterns. In their proposed
local constraints even after the mining algorithm has started system an analyst can make assessments on the partial
and before it stops searching for patterns. results produced and choose to stop the algorithm and
Chronodes [5] mines frequent sequences that users can restart it with adjusted parameters. There is a fundamental
interactively combine in order to explore patterns before and difference between our approach and the work presented
after them. A user chooses a single or a series of frequent in [19]. We propose the use of local constraints as a pow-
sequences as a focal sequence and can then align event se- erful way to allow users to focus the mining algorithm to
quences that occur around them or in between them. Similar the search space of their interest and avoid unnecessary
to our approach, Chronodes uses the PrefixSpan algorithm computations. Progressive Insights achieves similar goals
for mining frequent sequences. The latter, however, only by using a priority-based steering approach of the mining
allows a user to interactively set global constraints before algorithm. The design rationale that guided us is close to the
the execution of the algorithm, while our approach allows design goals presented in [19], as described in section 3. It is
setting local constraints progressively. The Chronodes sys- undoubtedly the case that both approaches are complemen-
tem, similar to the previous examples, applies the mining tary. The examples in section 6 illustrate concrete scenarios
algorithm as a “black box” as opposed to our transparent of the advantages of our proposed approach.
interactive approach. Furthermore, the focus of Chronodes We address the aforementioned shortcomings by propos-
is on the exploration of how the identified frequent patterns ing a “transparent box” approach to sequence mining which
appear in the data, while in our system the focus is on the allows the user to interactively steer the mining algorithm
identification of the frequent patterns themselves. Finally, towards focused results of interest for their analysis. Our
Chronodes is designed for handling a large number of se- initial work in this direction was presented as a poster in the
quences composed however of a very small event alphabet. Visual Analytics Science and Technology conference [24].
Liu et al. [21] propose an approach for interactively
analysing clickstream data. The authors use the VSMP algo-
rithm for identification of maximal sequential patterns and 3 E LOQUENCE OVERVIEW
prune identified patterns based on both their support and The exploratory sequence mining approach proposed in
their similarity to each other. The resulting patterns are then this paper follows the design goals for progressive visual
visualized and explored in an interface allowing sophisti- analytics proposed by Stolper et al. [19]. These state that
cated filtering and additional hierarchical pattern mining. analytics components should be designed to:
Similar to the previous examples, the focus of this work is G1 Provide increasingly meaningful partial results as the
on the analysis of the results of the mining algorithm. algorithm executes.
In general, existing work that integrates sequence pat- G2 Allow users to focus the algorithm to subspaces of
tern mining algorithms with visualization techniques op- interest.
erate on the results of the mining algorithms. An early G3 Allow users to ignore irrelevant subspaces.
attempt to apply sequence mining interactively was pro-
The visualizations used should be designed to:
posed by Vrotsou et al. [2], [22] who introduce an interactive
visual mining interface based on an Apriori algorithm [23]. G4 Minimize distractions by not changing views exces-
Their proposed system allows a user to mine sequential sively.
patterns in a stepwise manner where constraints can be set G5 Provide cues to indicate where new results have been
at each step. The distribution of the resulting patterns is found by analytics.
then explored in the context of the event sequences that they G6 Support an on-demand refresh when analysts are ready
appeared in. This approach allows for constraints to be set at to explore the latest results.
each iteration of the algorithm so that different constraints G7 Provide an interface for users to specify where analytics
apply for patterns of different lengths. However, it is not should focus, as well as the portions of the problem
possible to set different constraints to explore different parts space that should be ignored.
of the search space, i.e. to have different constraints for In the following, we give a brief overview of our pro-
patterns of the same length. posed mining approach and system, E LOQUENCE, including
Following a similar notion of interactive sequence min- a presentation of the different types of constraints sup-
ing, Stolper et al. [19] introduce the concept of progressive ported. In doing so, we will refer to the design goals satisfied
visual analytics and present a number of design goals that by the design choices of our approach and the functionality
systems should follow to support this type of analytics. of our system.
This work is, in our perspective, the closest to the research The system supports the mining of patterns in datasets
described in this paper. Their proposed method is based composed of sequences of ordered events and of temporal
Fig. 1. The E LOQUENCE system. (a) The constraints panel, (b) the pattern tree view, (c) the event sequence view.
event sequences, where each event has a timestamp asso- gorithm into its incremental steps, pauses the mining pro-
ciated with it or a start time and a duration. Note that in cess at suitable points and allows users to intervene and
our approach point events within the same sequence cannot gear pattern discovery in the direction of their interest. This
co-occur and interval events cannot overlap. proposed interactive stepwise computation of the algorithm
Our approach is based on the pattern-growth method- satisfies design goal G1. User intervention occurs in two
ology [6] for mining frequent sequential patterns. More ways. First, a user can impose local constraints at various
concretely, it is an interactive version of the pattern-growth levels of the mined patterns so that different subsets of the
algorithm PrefixSpan [25] which we steer visually through search space adhere to constraints of different types and
an interactive visualization system. severity. Second, a user can choose which patterns they want
The pattern-growth methodology encompasses a family to grow during the mining process or even decide to stop
of algorithms based on the following key features. First, building patterns with a certain prefix, since it may be well-
pattern-growth uses a divide-and-conquer strategy. Based known which sequence of events usually follow (thus, not
on the sequential patterns mined so far, the search space leading to the discovery of novel knowledge). In this way,
is divided into disjoint subsets that are in turn recursively the user can use her expert knowledge to prune the search
searched for patterns. Second, increasingly smaller datasets, space. These two ways of interactively intervening in the
known as projected datasets, are considered in each recur- algorithm computation support design goals G2 and G3
sive step and mined locally for frequent patterns. Third, the respectively.
pattern-growth framework can easily push deeper into the Our interactive visualization system, E LOQUENCE, is de-
algorithms a broader class of constraints [26], denominated signed around these characteristics of the proposed mining
prefix-monotone, compared to other well-known methods approach and provides a powerful interface to the algo-
such as Apriori based ones [23], [27]. rithm. Two main views compose the system. A pattern view
The pattern-growth methodology has two important in which patterns are visualized stepwise as they are grown
advantages that motivates its choice in the context of our by the mining algorithm (Fig. 1(b)), and consequently, allow
work. Patterns are built incrementally, i.e. patterns of length a user to gain a gradual understanding about the dataset
(l+1) are grown from patterns of length l > 0, starting from (G4, G6, G7). A sequence view, displays the event sequences
patterns of length one (frequent single events). This pro- that match a selected pattern and reveals how the pattern
vides natural points where the algorithm can be paused and is distributed in the data, providing in this way additional
permits the stepwise visualization of the patterns as they context to the mining process (G5)(Fig. 1(c)).
are grown. Furthermore, the divide-and-conquer approach The patterns in the E LOQUENCE pattern view are dis-
used to find patterns in the projected datasets makes it played as a tree, where nodes are associated with events
possible to interactively add local constraints on the patterns and each path (or branch) from the root of the tree to a leaf
being built. node corresponds to a sequence pattern α ≡ e1 → · · · → en
Based on these key features, we propose an approach (Fig. 2). The user can click on a specific leaf node to further
that decompounds the pattern-growth based PrefixSpan al- expand the corresponding branch of the tree, and, conse-
(a) Horizontal layout. (b) Radial layout. (c) Vertical layout.
Fig. 2. Tree representation examples based on the Reingold-Tilford layout [28] used in E LOQUENCE.
quently, further grow the pattern associated with the branch (ik ≥ 1, with 1 ≤ k ≤ n), as depicted below in equation
(G2, G4, G6, G7). Growing a pattern α of length l > 0 (1) (G2). This is also a fundamental difference between our
corresponds to finding α−supersequence patterns of length proposed approach and other existing systems (eg. [3], [5])
l + n, where n ≥ 1. A sequence α ≡ e1 → · · · → en is for visually analysing event sequence data.
a subsequence of another sequence β ≡ e01 → · · · → e0m
(0 ≤ n ≤ m), denoted by α v β , if e1 = e0i1 , · · · , en = e0in , α ≡ e11 → · · · → e1i1 → · · · → en1 → · · · → enin (1)
| {z } | {z }
for some integers 1 ≤ i1 < · · · < in ≤ m. Sequence β is C1 Cn
then a supersequence of α.
It is possible to adopt a breadth-first exploratory search Independently of how deep a pattern α ≡ e1 → · · · →
strategy for sequence patterns, starting with the most often en (with n ≥ 1) has been expanded, the user can always
occurring events, i.e. patterns of length 1, or use a depth- select again pattern α, modify the constraints applied at
first search strategy which fully expands a tree branch before the pattern’s node labelled with event en , and find the
building another tree branch. In practice, the user may build α−supersequence patterns that satisfy the new set of con-
the pattern tree using a mixed strategy, where breadth-first straints (G2, G6, G7). The sub-tree with root at the pattern’s
is applied to a sub-tree until a certain level and depth-first node labelled with event en is re-computed and displayed.
search is applied to another sub-tree. Our approach supports also time-specific constraints for
temporal event sequences. For example, it is possible to set
The user can also prune a sub-tree from the existing
that the maximum time elapsed between events in a pattern
pattern-tree which implies that the tree visualization is up-
dated so that the pruned sub-tree will not be shown. In this
α is t1 ≥ 0. Since patterns can be built in a stepwise manner,
the user may modify such constraints so that, after event e3 ,
way, the user can avoid having a visually cluttered display
the maximum time elapsed between the pattern’s events is
by selecting which parts of the search space of patterns are
increased to t2 (equation (2)).
visualized at each moment (G3, G7). For example, the user
may be interested in only visualizing the patterns starting α ≡ e1 −→ e2 −→ e3 −→ e4 −→ e5 −→ e6 (2)
with events e1 or e2 . | {z } | {z }
max duration between events=t1 max duration between events=t2
Upon inspection of the sequences in the dataset with
a pattern α, the user may decide to modify existing con- Domain knowledge can be represented as simple ontolo-
straints, or add new ones, before expanding the tree with the gies, hierarchies of event types, to express for instance that
α−supersequence patterns (G1, G2). For instance, assume “visits to webpages (world) news, on-air and local
the user exploring a dataset of web page visits selects (news) are news related events” (Fig. 3). Thus, within
the pattern α ≡ frontpage → news on the tree of E LOQUENCE a user can express “is-a” relationships between
figure 2(c). She can then adjust the support and further event types. It is also possible to choose at which level of
filter the sequences carrying pattern α so that mining of detail sequence patterns should be mined and the user may
the α−supersequence patterns of length l ≥ 3 focus on increase or decrease the level of detail while a pattern is
those sequences for visitors who exited the website from a being built (G2). For instance, she may start the mining
business site. To this end, each node is associated with process at a finer level of detail (e.g. use events like webpage
a set of constraints which may be different from the set visits to (world) news and local (news)) and, before
of constraints applied to the parent node. By default, the starting to explore α-supersequence patterns of a selected
constraints associated with a node propagate to its children. pattern α, decrease the level of detail (e.g. use the more
An important point to bear in mind is that different sets general event news related visits instead of (world)
of constraints C1 , · · · , Cn (n ≥ 1) can be linked to consec- news and local (news)) as a way to get sensible patterns
utive (and non-overlapping) subsequences of a pattern α without decreasing the support (Fig. 3(c)).
(a) Original events. (b) Events expressed using an ontology. (c) Patterns mined at various ontology levels.
Fig. 3. Ontology example.
Finally, within E LOQUENCE our approach supports two Pattern-growth algorithms partition the search space of
sequence mining modes: forward mining and backward sequential patterns, as follows. Consider that α is a sequen-
mining. Given a selected pattern α, forward mining is used tial pattern of length l > 0 and that {α → e1 , · · · , α → ek },
to grow α to its α-supersequence patterns, while backward with k > 0, is the set of patterns of length l + 1 with prefix
mining finds all patterns β having α as a suffix (G2, G7). For α. Then, the search for patterns with prefix α is decomposed
instance, while exploring an early warning system dataset in k sub-problems, each corresponding to the search of
of an electricity plant, backward mining can be used to patterns with prefix α → ei , for 1 ≤ i ≤ k . The search
find pattern sequences of events that lead to a turbine process starts with the frequent events in D, i.e. the patterns
failure event. If instead one wants to discover most often of length one.
occurring sequences of events after a turbine failure The patterns with prefix α are mined by considering only
then forward mining should be used. The choice of back- the relevant part of the dataset D, named α-projected dataset
ward versus forward mining and the choice of constraints and designated as D|α . If α is a sequential pattern then the
help to steer the focus of the search, embedded in the mining α-projected dataset is the collection of suffixes of sequences in
algorithm, in the direction of those phenomena the user is D which have the prefix α.
interested in. The visual interface of E LOQUENCE facilitates The pattern-growth algorithm PrefixSpan [25] is shown in
this exploratory mining process. algorithm 1 below. The call to PrefixSpan(, D) can then
mine all sequential patterns in a sequence dataset D.
4 PATTERN - GROWTH BASED MINING APPROACH Algorithm 1 PrefixSpan
This section is dedicated to describe how patterns are mined 1: function P REFIX S PAN(α, D|α )
in E LOQUENCE, in deeper detail. First, we review the pattern 2: P ← {α → S e | e is a frequent event in D|α }
growth approach our system is based on. We then present 3: return P ∪ β∈P PrefixSpan(β, D|β )
the types of constraints that are currently supported in 4: end function
E LOQUENCE and finally discuss how we have adapted the
PrefixSpan algorithm to suit the needs of our approach. As Algorithm 1 above shows, each mining sub-problem
Pattern growth [25], [26] is the sequence mining method- PrefixSpan(β, D|β ) is solved by mining locally a smaller
ology underlying our system, as reflected by the min- part of the original dataset, i.e. only the projected dataset
ing algorithm described in 4.2. Pattern-growth adopts a D|β needs to be considered. The length of the sequences
projection-based divide-and-conquer strategy to frequent in a projected dataset D|β decreases when compared to
sequential pattern mining. the length of the sequences in D|α , though the number of
Let D be a dataset of event sequences and ED be the set sequences in D|β and D|α may remain the same. Most often,
of events occurring in D. The length of an event sequence the number of sequences in D|β also quickly decreases.
α ≡ e1 → · · · → el (with {e1 , · · · , el } ⊆ ED ) is the Pseudo-projected datasets [25] is a technique to optimize
number of events l > 0 occurring in the sequence. The main memory usage requirements of PrefixSpan. Each suffix
empty sequence, designated as , has length zero. β ∈ D|α can be represented by a pair of integers (i1 , i2 ),
The support of a sequence of events α, denoted as where i1 > 0 is the id of the sequence S ∈ D such that β
supD (α), is defined as the number or percentage of se- is the S -suffix with respect to α and i2 > 0 is the starting
quences β ∈ D such that α v β , for a given sequence dataset position of the projected suffix β in the sequence S .
D. Moreover, α is a pattern in D, if supD (α) ≥ min sup, The pattern growth methodology underlying the PrefixS-
where min sup is the minimum support threshold. Given pan algorithm has several key advantages which facilitate
a sequence α ≡ → e1 → · · · → en (n > 0), a sequence the integration of an exploratory visual oriented interaction
β ≡ → e1 → · · · → ek is a prefix of α, if 0 ≤ k < n. Then, with the user during the mining process. First, no candidate
the sequence ek+1 → · · · → en is the α-suffix with respect to sequences need to be generated reducing in this way the
β . Sequences with prefix β are called β -supersequences. search space. Second, patterns are grown incrementally by
length, using a divide-and-conquer strategy, providing in sat(D, LO , α → e) evaluates to true, if and only if
this way points in the execution of the algorithm where e ∈ LO . Ontology level constraints apply to all types
it can be paused. At these points, a visual interface can of events since they are associated to the event’s type
then display a meaningful layout of the patterns that the description and not it’s temporality.
algorithm has uncovered so far (Fig. 1(b)). The user can also • Event duration. For temporal sequences, where the
select one of these patterns and explore properties of the occurrence of each event is associated with a start
population of sequences supporting the selected pattern. time and a duration, event duration constraints allow
For instance in Fig. 1, the pattern frontpage → news the user to express min (max) duration of events in
is selected and the bottom window allows the user to a pattern. For instance, the user may wish to ex-
inspect all sequences supporting the pattern. Third, growing clude patterns composed of short duration events (i.e.
a pattern α of length l > 0 to a pattern of length l +1 is done events’ duration is below a given threshold). Then,
by mining locally a separate (projected) dataset D|α [25]. sat(D, min_dur = k, α → e) evaluates to true, if
This is a key feature for creating a flexible user interface and only if event e’s duration is at least k units of time.
in the proposed system, since it opens for the possibility • Event filter. An event filter constraint specifies a subset
of setting separate constraints on the patterns being created of events E 0 ⊆ ED that can be present in the patterns.
and allows an incremental process of constraints refinement. Then, sat(D, E 0 , α → e) evaluates to true, if and only
if e ∈ E 0 . Event filter constraints are also associated
4.1 Constraints with the event type description and hence applicable to
all types of events.
Constraints are particularly relevant to the human-centred
exploratory mining approach described in this paper. They Gap constraints
facilitate user exploration and control, and consequently,
These constraints can be applied to the gaps between events
increase focus of the mined patterns. E LOQUENCE supports
in a pattern. Gap constraints allow the user to set the min
four categories of constraints: support constraints, event
(max) number of events that can occur between two adjacent
constraints, gap constraints, and data filters. The applicabil-
events of the pattern. If temporal information is associated
ity of these depends on the type of events that the sequence
with the occurrence of events, then gap constraints can
datasets are composed of.
be used to express min (max) time elapsed between two
Assume that for each sequence dataset D to be mined,
adjacent events of the pattern. For instance for temporal
there is an associated ontology of events. An ontology ex-
events, sat(D, max gap = k, α → e) evaluates to true,
presses knowledge about events in the form of an hierarchy,
if at most k units of time elapsed between the end of the last
where similar types of events can be grouped by the user
event of sequence α and the start of event e. If α ≡ then
into more general categories of events. Thus, an ontology is
the function evaluates to true.
a pair O = (EO , H), where EO is the set of ontology events
and H is a set of statements of the form “e1 is-a e2 ”, with
Data filters
{e1 , e2 } ⊆ EO . Note that each event occurring in the dataset
D belongs to the set of ontology events (i.e. ED ⊆ EO ), A data filter constraint F is represented as a regular expres-
though the user can add to EO more general categories of sion and can be applied by the user to select a subset of the
events which do not occur in D. Thus, the pair (ED , ∅) is sequence dataset to be mined. Thus, F (D) contains all event
the default ontology associated with the dataset D. sequences in D which are accepted by the deterministic
The function sat(D, c, α → e) captures whether the last finite automata corresponding to F . Also these constraints
event (e) added to a sequence α satisfies a given constraint apply to all types of event sequences supported.
c, where c is either a support, event or gap constraint. This
function is used to describe algorithm 2 underlying the 4.2 Algorithm for stepwise expansion of patterns
proposed system. Assume k is an integer larger than zero. Recall that E LOQUENCE builds a tree of patterns (Fig. 1(b)),
where each path from the root of the tree to a node corre-
Support constraints sponds to a sequential pattern. In contrast to the depth-first
A constraint min_sup = k specifies the minimum support search approach adopted by the PrefixSpan algorithm, our
that a sequence of events has to satisfy to be considered a system allows the user to select other search approaches
pattern. Moreover, sat(D, min_sup = k, α → e) evaluates (e.g. breath-first search). More concretely, it is possible to
to true, if and only if e is a frequent event in a given select a specific node of the tree to further grow the sequen-
sequence dataset D (i.e. supD (e) ≥ k ). Support constraints tial pattern α associated with the branch of the tree ending
apply to all types of event sequences. at the selected node. One point worth noting here is that a
set of constraints is associated with each node and the user
Event constraints can modify the set of constraints associated with it, before
Event constraints apply separately to each event in the finding longer patterns with prefix α. By default, the set
pattern. Different types of event constraints are supported. constraints associated with a parent node propagates to its
• Ontology level. Assume that O = (EO , H) is the on- children. Each node N stores the following information.
tology associated to a sequence dataset D. The user • A set of user-specified constraints. These can include
can then select at which generalization level the events constraints of any of the four types described above.
should appear in the sequential patterns built from D. The set of support, event and gap constraints stored in
An ontology level constraint is a subset LO ⊆ EO and N is designated as N.c.
(a) Pattern tree view. (b) Non-temporal sequence view. (c) Non-temporal sequence view aligned.
(d) Temporal sequence view. (e) Temporal sequence view aligned.
Fig. 4. Examples of E LOQUENCE views: (a) Pattern tree view with node information pop-up. (b) Non-temporal event sequence view with a pattern
highlighted. (c) Non-temporal event sequence view with pattern highlighted and aligned by the first pattern event. (d) Temporal event sequence view
with pattern highlighted. (e) Temporal event sequence view with pattern highlighted and aligned by the second pattern event.
•An event. The event stored in a node N is designated Algorithm 2 Expand one level of sub-tree with root N
as N.e. 1: procedure E XPAND(N ) . N is a user selected node
• A pseudo-projected dataset D|α , where α ≡ e1 → 2: N.children ← ∅
· · · → N.e is the sequence of events corresponding to 3: User can modify constraints in N.c
the path from the tree’s root to node N . The pseudo- 4: User can define a new data filter F
projected dataset stored at node N is designated as 5: If (new data filter) then
N.db (i.e. N.db = D|α ). 6: N.db ← F (N.db)
• A (possible empty) list of child nodes is denoted as 7: P ← {e | e is a frequent event in N.db}
N.children. 8: φ ← {e ∈ P | ∀c ∈ N.c : sat(N.db, c, N.e → e)}
Note that each node stores a pseudo-projected dataset 9: for all ei ∈ φ do
(instead of a projected dataset). Recall that each suffix β ∈ 10: Create new node Ni
D|α is then represented by pair of integers, though β can be 11: Ni .e ← ei
a long sequence of events. 12: Ni .c ← N.c
Algorithm 2 is used, when the user selects a node N to 13: Ni .db ← N.db|ei
further expand the pattern α associated with it. 14: Ni .children ← ∅
15: end for
16: Add Ni to N.children
5 R EPRESENTATION AND USER INTERFACE 17: end procedure
E LOQUENCE is composed of two main linked visual repre-
sentations; the pattern tree view ((Fig. 1(b)) for interactively
steering the mining process and the event sequence view The central view of E LOQUENCE, namely the pattern tree
(Fig. 1(c)) for inspecting the event sequences and viewing view, is a tree representation that is used for interacting
the distribution of the mined patterns. In addition, a number with the mining algorithm and displaying the patterns as
of panels are available for setting constraints, filtering the they are mined (Fig. 2). By clicking on the nodes of the tree
data and providing surrounding information (Fig. 1(a)). representation, the user decides which patterns to grow and
(a) Regular expression editor. (b) Ontology editor. (c) Event filter editor.
Fig. 5. E LOQUENCE constraint panel editors.
thus dynamically controls the execution of the algorithm, as get a less cluttered view, a user can choose to collapse the
described previously. Apart from growing patterns one level subtree of a node or select a node as the root of the tree
at a time, the user can also choose to grow a larger number and therefore only show the subtree of the node. This can
of levels on each click of a node, for example to mine k > 1 be seen in Fig. 7(b), where the node travel by car has
levels at each click. Initial constraints and data filters are been set as the root.
set by the user. At any time and any pattern growth level The user can explore the distribution of the mined
the user has the possibility to set new local constraints and patterns across the data by selecting patterns within the
filters in the constraints panel. tree and inspecting them in the accompanying linked event
The pattern tree can be drawn using different layouts; sequence view (Fig. 1(c)). In this view, the event sequences
a Reingold-Tilford layout [28] expanding in a vertical or supporting the selected pattern are displayed as horizontal
horizontal direction, or a radial layout (Fig. 2). Sequence bars consisting of the different events and ordered along the
patterns start from the route and are grown rightwards, y-axis. If the events have temporal information associated
downwards or outwards, respectively. The nodes represent to them, then the length of the bars can be drawn according
the different events and the edges link the events composing to their start time and proportional to the duration of the
the sequence patterns. events (Fig. 4(d)), otherwise events are drawn using fixed
Nodes are assigned a colour indicating their event type. size squares (as in Fig. 4(b)). In both the pattern tree and
By default, random colours are assigned to the nodes. A the event sequence representation of E LOQUENCE, colour
user can through a colour legend adjust these colours and reflects the event type category.
save the information for future use when analysing the Events composing the currently mined pattern, or a
same dataset. Colours can also be assigned by the user with user selected pattern, are highlighted in the accompanying
respect to the ontology classification associated with the event sequence view by being displayed opaque, while non-
events, if available. All event types belonging to the same pattern events are displayed transparent (Fig. 4(b-e)). The
category can then be set to have the same colour, or shades user can adjust the level of transparency of non pattern
of the same colour, according to preference. events interactively. By clicking on a pattern event in the
The thickness of the edges, by default, is proportional to event sequence view, the sequences are aligned by that
the support value of the patterns and a random colour is event, i.e. clicking on the first event of the pattern will align
assigned to them (Fig. 2(a)). The user can choose to instead all matched patterns by the first event (Fig. 4(c)). This way, a
map the changes in the constraints used in the mining user can compare the characteristics of the pattern across the
process onto the colour of the pattern tree edges. In this sequences and inspect the events surrounding the pattern.
case, a different colour is assigned to an edge as a visual cue The constraints panel includes all constraints available to
to indicate that diverging constraints apply (G5). This can be the user for controlling the search space of the algorithm,
seen in Fig. 2(c), where the constraints applied to the sub- as described in section 4.1. For most constraints, the user
tree with root at event health are different from the rest of is expected to set a numeric threshold in the interface
the tree, and also in Fig. 3(c) where a change in the ontology corresponding to time, percent, or number depending on
level has been applied to sub-trees with roots at frontpage the type of constraint. Event filters are set by allowing the
and news and in addition different support has been ap- user to choose specific event types to exclude from the
plied to local. In order to gain further understanding of pattern mining process (Fig. 5(c)). The user can also select to
the changes made, the user can inspect which constraints discard of equal subsequent events in the mined patterns,
have been altered in the constraints panel. By clicking on i.e. the website visit pattern frontpage → frontpage
a node the user can restore the set of configurations that → news would become frontpage → news. Data filters
have been applied to that node’s sub-tree. The constraints are controlled in a drag and drop editor for interactively
that differ from the previous level will be updated and creating regular expressions in order to filter the event
highlighted in the panel. sequence data (Fig. 5(a)). A user can, for example, specify
During the mining and exploration process, in order to that she is only interested in mining web visit sequences
of Fig. 1(b) cannot be further expanded, since no further
patterns can be found with the 5% support threshold set
initially. This fact is indicated in the tree by drawing an
extra circle around the node. Thus to continue exploring
patterns for “tech visitors”, one needs to modify the local
(a) Pattern tree support constraint stored in the tech node (by lowering
it). This illustrates how the support can be modified on the
fly without requiring to re-start the mining algorithm or
even re-compute any part of the pattern tree. To minimize
distractions, it is possible to only visualize the sub-tree with
root node tech (Fig. 6(a)). If later on the user wishes to
view again all pattern tree then this can be achieved by
clicking on the tech node, without any extra time costs in
re-computing the patterns.
(b) Sequences aligned by first occurrence of tech Domain knowledge tells that site visitors often tend to
return to the frontpage. Thus, we decided to use an event
Fig. 6. Patterns of the subset of users that visit webpage tech. filter constraint in the tech node to exclude the event
frontpage from the patterns to be computed. The moti-
vation for adding this local constraint was to avoid wasting
that start with a visit to a news page. Finally, the constraints time with computation of patterns that did not really convey
panel includes an ontology editor for setting the ontology any new knowledge, while at same time fewer patterns
level constraints. Within this editor a user can express “is-a” are shown in the tree (which improves its understanding).
relationships of event types. The editor is composed of a tree This illustrates how constraints can be used to incorporate
representation of the event type hierarchy in which a user domain knowledge in the system to steer computations to
can introduce new event types at any level of the hierarchy the subspaces of interest (as required by design goals G2,
and assign existing event types to them by dragging the G3, and G7).
nodes of the tree. Selecting and deselecting nodes in the
Next, we wanted to investigate which pages most of the
ontology adjusts the level of detail of the event sequences,
tech visitors visited before tech. Do they enter msnbc.com
i.e. sets ontology level constraints (Fig. 5(b)).
by tech directly? Inspecting the sequences in the event
sequence view (Fig. 6(b)) with the first occurrence of tech
6 U SE CASES highlighted indicates that this is in fact the case; many users
We will exemplify E LOQUENCE using two datasets. The first enter the site by tech.
one is a web data dataset composed of sequences of web We also got curious about the patterns that character-
site visits and the second is a time use dataset composed of ize sport visitors who entered the site by the front page
individuals daily performed activities. and then visit a news page. To this end, the node news
with parent frontpage (Fig. 1(b)) is selected and several
local constraints are then added to it. First, the support is
6.1 Web data exploration lowered, since the support chosen to build the initial tree
The web data comes from Internet Information Server (IIS) (5%) does not allow us to further find patterns with the
logs for msnbc.com, and news-related portions of msn.com, prefix frontpage → news. Then, a data filter constraint is
for the entire day of September, 28, 1999. Each sequence in added to the same node to restrict the search of patterns to
the dataset corresponds to page views of a user during that sequences that contain sport events. Finally, an event filter is
twenty-four hour period (https://archive.ics.uci.edu/ml/ added to prevent building patterns containing sport events,
datasets/MSNBC.com+Anonymous+Web+Data). The data since all sequences being mined now contain sport events.
was retrieved from the UCI Machine Learning Reposi- One of the patterns then retrieved says that about 11% of
tory [29]. The dataset includes 989925 sequences of web the sport visitors return to the front page and then leave the
site visits. There are 18 types of web sites, i.e. event types, site. In many of the systems reported in the literature [3],
included in the dataset and the average number of visits per [5], [19], [21], the user would need to re-start the mining
user is (average sequence length) 5.7. algorithm with new input parameters to tackle a similar
We start the exploration by setting the support constraint problem. E LOQUENCE addresses this limitation with the use
in the root node to 5% (min_sup = 5) and growing the of local constraints.
pattern tree five levels. The results can be seen in Fig. 1. We
notice that frontpage is one of the most visited webpages
and users tend to return to it often which is not surprising. 6.2 Time use data exploration
The same behaviour is also observed with news. Also The data comes from the 2010/2011 Swedish time use
news is the most frequent type of webpage visited after survey performed by the Swedish statistics bureau, Statistics
frontpage (support 7.23%) (excluding frontpage itself). Sweden (http://www.scb.se). The dataset is composed of
tech is among the four most visited webpages and few 6477 sequences of individuals daily performed activities,
users (less than 5%) visit another page after tech. during one weekday and one weekend day. There are 74
We get interested to further explore the behaviour of types of activity, i.e. event types, included in the dataset
“tech visitors”. However, the node tech in the pattern tree and the average number of activities per individual (average
(a) Patterns ending with travel by car (b) Patterns starting with travel by car
(c) Patterns ending with physical exercise → travel by car (d) Patterns starting with travel by car → physical exercise
Fig. 7. Car dependent activity patterns.
sequence length) is 32.8. The exploration scenario was de- the set constraints. In order to explore other car dependent
veloped together with Human Geography researchers from patterns in the population under study we choose the ac-
the Department of Technology and Social Change at our tivity travel by car as a starting activity and modify
university who are performing research in everyday life. the constraints from that node onwards. By decreasing the
A main focus of theirs is to study patterns of activity, also minimum support to 1% and setting the maximum allowed
referred to as daily life projects or practices, and how these duration between pattern events to 5 minutes, the activities
are distributed across a population in terms of, for example, that people use their cars for can be identified. This illus-
sex, age, day of the week etc. trates the ability the system has in using local constraints,
We start the exploration by searching for patterns ap- applied to a particular part of the search space, to guide the
pearing in a large amount of the population so we set the search (as stated by design goals G2, G3, and G7). This way,
initial support (min_sup) to 15%. We further set the maxi- however, we only explore patterns starting with the activity
mum duration between pattern events (max gap duration) to travel by car. In order to get a better overview we start
60 minutes, since activities occurring far from each other are also a parallel exploration to inspect patterns that end with
not regarded as belonging to the same pattern. Finally, we travel by car, i.e. mine backwards with travel by
specify that equal subsequent events within patterns should car as the final activity and using the same constraints.
be ignored, so that patterns composed of the same activity Figure 7 shows patterns that match the updated constraints
are ignored (e.g. watch TV → watch TV → watch TV). both in the forward and backward mining case.
The pattern tree is expanded 3 levels from the root with Most activities that directly follow or precede travel
these general constraints and the identified patterns are by car are expected, for example work, shopping (grocery
observed (Fig. 4(a)). Sleep is the most prominent activity and other) and help/raise children (i.e. dropping off or
in the dataset (99.92% support) followed by meal (99%), picking children up from other activities). Studying how
hygiene (97.61%), and then watch TV (83.33%). These these patterns appear across the data in the accompanying
activities and the immediate patterns that form by growing event sequence view reveals distribution patterns. For ex-
them further are quite obvious patterns one would expect ample, the number of men that take the car to/from work is
from daily life activities (e.g. sleep → hygiene → meal). larger than the number of women, while more women drop
An aspect of high interest in daily life studies are travel off/pick up children and do shopping using the car. Some
related practices which reveal energy use patterns. One of car dependent patterns however are diverging, e.g. using
the more prominent patterns in the pattern tree is travel the car just before/after cleaning and physical exercise.
by car → work (17.52%). This is also the only pattern We choose, therefore, to explore even closer what hap-
starting with the activity travel by car that matches pens before and after physical exercise. We reduce the
minimum support constraint even more, since this activity
doesn’t appear as often in the data, and grow the patterns
further (see Fig. 7(c,d)). In doing so, the one thing that
sticks out is the fact that some individuals seem to pause
their physical exercise to have coffee or eat a meal and then
continue with their exercise (see for example Fig. 7(c)). The
support for these patterns is very low (9 individuals in total),
this however triggers the interest of our colleagues to see
how the overall patterns of individuals that exercise look.
Using data filters (regular expressions) we choose to
include in the pattern mining only individuals that at
some point during their day have engaged in physical
exercise ( * → physical exercise → *). This results
in a dataset of 1052 individuals. We apply the same set
of general constraints as in the initial exploration, i.e. 15%
support and maximum 60 minutes between pattern events.
The resulting pattern tree can be seen in Fig. 8. The most
noticeable difference between this sub-population and the
original population (Fig. 4(a)) is that the most prominent
activity of the new group is physical exercise, as
opposed to sleep in the overall population, which has a
support that is even larger than sleep. Furthermore, this
group seems to use the car for going to exercise, while using
it for work is not as common a pattern. Other than this the
patterns of the two groups look very similar.
Using our pattern exploration approach with this dataset
makes it possible to start an exploration with strict con-
straints and relax them as the exploration proceeds and new
hypotheses arise. Apart from common frequent patterns, it Fig. 8. Patterns of sub-population that engage in physical exercise.
has allowed unusual patterns of activity to be identified and
explored in a flexible and immediate way. If a traditional
algorithm was to used for the same task, then the constraints
for the entire search would have had to be very loose from Another example, that illustrates the relevance of local
the beginning which would have resulted in a much larger constraints, is the fact that support constraints can be used
pattern search space and more time consuming computa- to modify on the fly the support of patterns to be discovered.
tion. Furthermore, the number of identified patterns that In our experiments with E LOQUENCE, we noticed that a
the user would have to go through would have also been common strategy is to set the initial support to a high value
much larger making pattern exploration more cumbersome. and expand the pattern tree a few levels. The choice of a
high support leads to more common patterns and a smaller
pattern tree. Upon inspection of the uncovered patterns in
the pattern tree view, an interest in some of the patterns
7 D ISCUSSION usually arises and there is the wish to further expand the
The motivation behind this work has been our vision of tree with the corresponding supersequence patterns. Since
an approach to sequential pattern mining that deeply em- the support tends to decrease as patterns get longer, it is
beds interaction within the mining algorithm to facilitate common that some of the selected patterns cannot be further
exploratory mining. To this end, we have in this paper expanded due to the high support value set initially, pre-
presented a novel interactive sequence mining approach venting the user to continue the exploration in the direction
based on the pattern-growth methodology, supporting local the user is interested in. In some systems (e.g. [3], [5], [19],
constraints, and implemented within E LOQUENCE. [21]), the user needs to re-start the mining algorithm with
A key strength of our proposed approach is the in- a lower value for the support. E LOQUENCE overcomes this
troduction of local constraints in the mining process. An limitation by simply allowing the user to modify the support
advantage of the use of local constraints is that it allows constraint associated with a node (to lower the support) and
to incorporate expert knowledge on the fly, while searching continue finding supersequence patterns, without the need
for patterns. For instance, the user may know quite well that to redo computations.
events in a set E usually occur after a sequence of events α. Our proposed system embeds a “lazy execution” ap-
Thus, if α is revealed to be a pattern then the user can set proach, in the sense that it is possible to select in the pattern
event filter constrains to eliminate the events in E , when view a pattern of length l > 0 and then request the system to
searching for α-supersequence patterns. In this way, the build the α-supersequences of length l + n, where n > 0. In
system does an amount of computation more proportional other words, the user can select a leaf-node and expand the
to the amount of new knowledge discovered and avoids sub-tree with root at that node by e.g. two levels at a time
displaying irrelevant patterns. (n = 2). The advantages of this approach are twofold. First,
the user can request the system to display a few patterns at analyst time. This strategy is also in accordance with
a time (as required by design goals G4 and G6). Second, it the design goals G2, G4, and G7.
is a way to control the cluttering of the display resulting • The user may decide which events can appear in the
from expanding the tree many levels at once. Based on patterns. If some types of events are less relevant then
observation, we conclude that it is preferable to select a event filters can be used to exclude these events from
node in the tree, expand it by a few levels, and then inspect the patterns. Most likely, a smaller number of patterns
the new patterns uncovered. In this way, the user can get a is generated. As described above, this can be used as a
preliminary insight about the patterns, often using the event first step in the data exploration. In a second step, the
sequence view, and then decide which patterns to expand user can relax the event filters in some nodes of selected
next. Note that users may even prune a sub-tree from the patterns, then request for the system to compute and
pattern-tree, if the patterns in that tree are deemed as not display the corresponding pattern sub-trees, possibly
interesting enough. including more event types.
A common issue for all sequence mining based ap- A limitation of the presented work is that the system has
proaches is that of scalability. This issue arises from the not been evaluated with high dimensional data. Sequences
fact that datasets may consist of large numbers of se- and events can be associated with surrounding information
quences which may also include many different types of and it should be possible to incorporate this information
events. Moreover, mining these datasets may result in a in the mining process. Another limitation is that aggregate
large number of patterns that a user needs to assess and constraints over patterns, such as average duration of the
extract knowledge from. To overcome these problems, strict events in a pattern, have not yet been considered. These
constraints are often applied to the algorithms in order types of constraints are very relevant when mining high
to reduce the pattern search space. In doing so there is dimensional datasets. Therefore, further work needs to be
a trade-off made between the rigidity of the constraints done to address these limitations, within our vision of
and the quality of the patterns. Rigid constraints limit the interactive visual sequence mining.
search considerably but results in only frequent patterns
being identified. Frequency of a pattern, however, is not
8 C ONCLUSIONS AND FUTURE WORK
necessarily equivalent with interestingness and relevance
to the analysts task. Our approach deals with the issue of The main contribution of the proposed work is an interac-
scalability by providing functionality that allows an analyst tive sequence mining approach that allows a user to progres-
to flexibly navigate and prune the pattern search space so sively refine constraints while pattern sequences are being
that the algorithm is not applied on the entire dataset at built, enhancing in this way user exploration and control
once. This is done in 3 ways: (1) the computation of the over the search for interesting patterns. This contrasts with
algorithm occurs in a stepwise manner and is steered by existing interactive sequential pattern mining systems [3]–
the analyst (along the lines of design goals G1, G2, G3), [5] that mostly offer the possibility of setting constraints
(2) the analyst can apply local constraints on the fly without at the start of the mining process, using then different
having to restart the computation (G2), and (3) a selection of visualization techniques to explore the resulting patterns.
carefully chosen types of constraints is available that enables Consequently, the latter tend to treat the mining process
flexible manipulation of the data (G2, G3). Some examples as a black box while our approach and prototype system,
of how this functionality can be applied are described below. E LOQUENCE, attempts to open the box, reveal the process
and allow a user to intervene and steer it.
• The user has the option of building an ontology of Additional key strengths of E LOQUENCE are the fol-
events, creating more general categories of events. On- lowing. First, it combines two visual views, pattern tree
tology level constraints can then be added to the root and event sequence view, providing in this way additional
of the tree expressing that (a smaller number of) more context to the mining process by revealing how a selected
general categories of events should be used to build pattern appears in the data. Second, different types of
patterns. In this way, the number of patterns can be constraints are supported such as ontology level or gap
reduced. The user can expand the tree in the pattern constraints, and data filters. The practical usefulness of
view, starting at the root node, a few levels. This can these features is demonstrated by two example use cases
be used as a first step in the data exploration aiding presented in section 6.
the user to get a first insight of which patterns may be Several interesting problems merit further research. First,
more interesting. The user can then continue the pattern we would like to investigate how our proposed interactive
exploration process by relaxing the ontology level con- “transparent box” approach can be incorporated in other
straints in some deeper nodes (e.g. leaves) of selected sequence mining algorithms. It would also be interesting to
patterns. Note that the size of the projected database closely examine how the pattern-growth approach can be
associated with a node decreases with the depth of the extended to mine soft sequential patterns [30], and which
node, making it often feasible to relax the ontology type of constraints and visualization techniques could be
level constraints for deeper nodes. The user can then used to guide the search for such patterns. In the current
request for several levels of the sub-tree with root at a status of E LOQUENCE, pattern support is computed based
selected deeper node, like a leaf node, to be computed on the first match of the pattern in a sequence. A future
and displayed, including more specific types of events. step would be to extend this to also take into account the
In this way, it is possible to progressively get insights of number of times a pattern appears within a sequence, as
the patterns without wasting computational time nor in [5]. Furthermore, more research is required to find ways
to visualize patterns that minimize visual clutter. This prob- [15] M. Monroe, R. Lan, J. Morales del Olmo, B. Shneiderman,
lem is particularly relevant when big datasets are analysed. C. Plaisant, and J. Millstein, “The Challenges of Specifying Inter-
vals and Absences in Temporal Queries : A Graphical Language
Possibilities may include to represent sequential patterns in Approach,” in CHI 2013, 2013, pp. 2349–2358.
more expressive languages or to allow the user to define non [16] B. C. M. Cappers and J. J. van Wijk, “Exploring multivariate
trivial pattern ranking criteria. Finally, a formal usability event sequences using rules, aggregations, and selections,” IEEE
evaluation of the system described here has not yet been Transactions on Visualization and Computer Graphics, vol. 24, no. 1,
pp. 532–541, Jan 2018.
performed, though we plan for it as a next step in our work. [17] D. Sacha, H. Senaratne, B. C. Kwon, G. Ellis, and D. A. Keim, “The
Role of Uncertainty, Awareness, and Trust in Visual Analytics,”
IEEE Transactions on Visualization and Computer Graphics, vol. 22,
ACKNOWLEDGMENTS no. 1, pp. 240–249, 2016.
This research has been partly supported by CENIIT, Center [18] T. Muhlbacher, H. Piringer, S. Gratzl, M. Sedlmair, and M. Streit,
“Opening the black box: Strategies for increased user involvement
for Industrial Information Technology at Linköping Univer- in existing algorithm implementations,” IEEE Transactions on Visu-
sity and by the RESKILL project, funded by public research alization and Computer Graphics, vol. 20, no. 12, pp. 1643–52, 2014.
and innovation funds from the Swedish Transport Adminis- [19] C. D. Stolper, A. Perer, and D. Gotz, “Progressive visual analytics:
User-driven visual exploration of in-progress analytics,” IEEE
tration, with the Swedish Maritime Administration and the
Transactions on Visualization and Computer Graphics, vol. 20, no. 12,
Swedish Air Navigation Service Provider LFV. We would pp. 1653–1662, 2014.
like to thank Joakim Deborg and Sergey Ignatenko for their [20] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu, “Sequential Pattern
contribution to the implementation. mining using a bitmap representation,” in Proceedings of the eighth
ACM SIGKDD int’l conf on Knowledge discovery and data mining -
KDD ’02. Edmonton, Alberta, Canada: ACM Press, 2002, p. 429.
R EFERENCES [21] Z. Liu, Y. Wang, M. Dontcheva, M. Hoffman, S. Walker, and
A. Wilson, “Patterns and Sequences : Interactive Exploration of
[1] C. H. Mooney and J. F. Roddick, “Sequential Pattern Mining - Clickstreams to Understand Common Visitor Paths,” IEEE Trans-
Approaches and Algorithms,” ACM Computing Surveys, vol. 45, actions on Visualization and Computer Graphics, vol. 23, no. 1, pp.
no. 2, 2013. 321–330, 2017.
[2] K. Vrotsou, K. Ellegård, and M. Cooper, “Exploring Time Diaries [22] K. Vrotsou, K. Ellegård, and M. Cooper, “Everyday Life Discov-
Using Semi-Automated Activity Pattern Extraction,” electronic In- eries: Mining and Visualizing Activity Patterns in Social Science
ternational Journal of Time Use Research, vol. 6, no. 1, pp. 1–25, 2009. Diary Data,” in 11th Int’l Conf Information Visualization, Zürich,
[3] A. Perer and F. Wang, “Frequence : Interactive Mining and Visu- Switzerland, 2007, pp. 130–138.
alization of Temporal Frequent Event Sequences,” in Int’l Conf on [23] R. Agrawal and R. Srikant, “Mining sequential patterns,” in 11th
Intelligent User Interfaces. Haifa, Israel: ACM, 2014, pp. 153–162. Int’l Conf on Data Engineering, Mar 1995, pp. 3–14.
[4] B. C. Kwon and A. Perer, “Peekquence : Visual Analytics for [24] K. Vrotsou and A. Nordman, “Interactive Visual Sequence Mining
Event Sequence Data,” in KDD 2016 Workshop on Interactive Data based on Pattern-Growth,” in IEEE Conference on Visual Analytics
Exploration and Analytics (IDEA’16), San Francisco, CA, USA, 2016, in Science and Technology (VAST), 2014, pp. 285–286.
pp. 72–75. [25] J. Pei, J. Han, B. Mortazavi-asl, J. Wang, H. Pinto, Q. Chen,
[5] P. J. Polack, S.-T. Chen, M. Kahng, K. de Barbaro, M. Sharmin, U. Dayal, and M.-C. Hsu, “Mining Sequential Patterns by Pattern-
R. Basole, and D. H. Chau, “Chronodes: Interactive Multi-focus Growth : The PrefixSpan Approach,” IEEE Trans on Knowledge and
Exploration of Event Sequences,” CoRR, vol. abs/1609.0, 2016. Data Engineering, vol. 16, no. 11, pp. 1424–1440, 2004.
[6] J. Han, J. Pei, and X. Yan, “Sequential pattern mining by pattern- [26] J. Pei, J. Han, and W. Wang, “Constraint-based sequential pattern
growth: principle and extensions,” Studies in Fuzziness and Soft mining: the pattern-growth methods,” Journal of Intelligent Infor-
Computing, vol. 180, pp. 183–220, 2005. mation Systems, vol. 28, no. 2, pp. 133–160, 2007.
[7] C. Plaisant, B. Milash, A. Rose, S. Widoff, and B. Shneiderman, [27] R. Srikant and R. Agrawal, “Mining sequential patterns: General-
“LifeLines: visualizing personal histories,” in CHI ’96: Proc. of the izations and performance improvements,” in Advances in Database
SIGCHI conference on Human factors in computing systems. New Technology — EDBT ’96, 1996, pp. 1–17.
York, NY, USA: ACM, 1996, pp. 221–227. [28] E. Reingold and J. Tilford, “Tidier Drawings of Trees,” IEEE
[8] T. D. Wang, C. Plaisant, A. J. Quinn, R. Stanchak, and S. Murphy, Transactions on Software Engineering, vol. 7, no. 2, pp. 223–228, 1981.
“Aligning Temporal Data by Sentinel Events : Discovering Pat- [29] M. Lichman, “UCI machine learning repository,” 2013. [Online].
terns in Electronic Health Records,” CHI 2008 Proceedings · Health Available: http://archive.ics.uci.edu/ml
and Wellness, pp. 457–466, 2008. [30] D. Gotz, “Soft Patterns : Moving Beyond Explicit Sequential Pat-
[9] T. D. Wang, C. Plaisant, B. Shneiderman, N. Spring, D. Roseman, terns During Visual Analysis of Longitudinal Event Datasets,” in
G. Marchand, V. Mukherjee, and M. Smith, “Temporal Sum- IEEE VIS 2016 Workshop on Temporal & Sequential Event Analysis,
maries: Supporting Temporal Categorical Searching, Aggregation 2016, pp. 1–2.
and Comparison,” IEEE Transactions on Visualization and Computer
Graphics, vol. 15, no. 6, pp. 1049–1056, 2009.
[10] M. Monroe, R. Lan, H. Lee, C. Plaisant, and B. Shneiderman,
“Temporal event sequence simplification,” IEEE Transactions on
Visualization and Computer Graphics, vol. 19, no. 12, pp. 2227–36, Katerina Vrotsou is an Assistant Professor in Information Visualization
2013. at Linköping University, Sweden where she works with interactive visual
[11] S. Guo, K. Xu, R. Zhao, D. Gotz, H. Zha, and N. Cao, “Eventthread: analysis of multidimensional event-based data. She received her PhD
Visual summarization and stage analysis of event sequence data,” in Visualization and Interaction in 2010 from Linköping University. Her
IEEE Transactions on Visualization and Computer Graphics, vol. 24, research interests include information visualization, visual analytics,
no. 1, pp. 56–65, Jan 2018. data mining, and interactive knowledge discovery.
[12] J. A. Fails, A. Karlson, L. Shahamat, and B. Shneiderman, “A
Visual Interface for Multivariate Temporal Data: Finding Patterns
of Events across Multiple Histories,” in IEEE Symposium on Visual
Analytics Science and Technology, 2006, pp. 167–174.
[13] K. Wongsuphasawat and D. Gotz, “Exploring Flow, Factors, and Aida Nordman is a Lecturer in Computer Science at Linköping Univer-
Outcomes of Temporal Event Sequences with the Outflow Visual- sity, Sweden where she works with knowledge discovery in multidimen-
ization,” IEEE Transactions on Visualization and Computer Graphics, sional event-based data. She received her PhD in Computer Science
vol. 18, no. 12, pp. 2659–2668, 2012. in 2010 from Linköping University. Her research interests include data
[14] D. Gotz and H. Stavropoulos, “Decisionflow: Visual analytics for mining, knowledge representation and reasoning, description logics,
high-dimensional temporal event sequence data,” IEEE Transac- temporal logic, and artificial intelligence.
tions on Visualization and Computer Graphics, vol. 20, no. 12, pp.
1783–1792, Dec 2014.

2019-Exploratory Visual Sequence Mining Based On Pattern-Growth

Uploaded by

Copyright:

Available Formats

2019-Exploratory Visual Sequence Mining Based On Pattern-Growth

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2019-Exploratory Visual Sequence Mining Based On Pattern-Growth

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Exploratory Visual Sequence Mining

S EQUENTIAL pattern mining addresses the problem of

Fig. 3. Ontology example.

(d) Temporal sequence view. (e) Temporal sequence view aligned.

Fig. 5. E LOQUENCE constraint panel editors.

Fig. 7. Car dependent activity patterns.

You might also like