0% found this document useful (0 votes)
1 views17 pages

2022 Pattern Mining Current Challenges

The document discusses the challenges and opportunities in the field of pattern mining, highlighting six key research areas: mining patterns in complex graph data, targeted pattern mining, repetitive sequential pattern mining, incremental and interactive pattern mining, heuristic pattern mining, and mining interesting patterns. It emphasizes the need for advancements in algorithms and methodologies to address these challenges, particularly in the context of evolving data types and user-specific requirements. The paper serves as a comprehensive overview of the current state of pattern mining research and identifies promising directions for future exploration.

Uploaded by

uyenmyy2309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views17 pages

2022 Pattern Mining Current Challenges

The document discusses the challenges and opportunities in the field of pattern mining, highlighting six key research areas: mining patterns in complex graph data, targeted pattern mining, repetitive sequential pattern mining, incremental and interactive pattern mining, heuristic pattern mining, and mining interesting patterns. It emphasizes the need for advancements in algorithms and methodologies to address these challenges, particularly in the context of evolving data types and user-specific requirements. The paper serves as a comprehensive overview of the current state of pattern mining research and identifies promising directions for future exploration.

Uploaded by

uyenmyy2309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/362028656

Pattern Mining: Current Challenges and Opportunities

Chapter in Lecture Notes in Computer Science · July 2022


DOI: 10.1007/978-3-031-11217-1_3

CITATIONS READS

60 396

7 authors, including:

Philippe Fournier Viger Wensheng Gan


Shenzhen University Jinan University
464 PUBLICATIONS 13,911 CITATIONS 265 PUBLICATIONS 4,884 CITATIONS

SEE PROFILE SEE PROFILE

Youxi Wu Mourad Nouioua


Hebei University of Technology University Mohamed El Bachir El Ibrahimi of Bordj Bou Arreridj
66 PUBLICATIONS 1,101 CITATIONS 17 PUBLICATIONS 200 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Wensheng Gan on 25 July 2022.

The user has requested enhancement of the downloaded file.


Pattern Mining: Current Challenges
and Opportunities

Philippe Fournier-Viger1(B) , Wensheng Gan2 , Youxi Wu3 , Mourad Nouioua4 ,


Wei Song5 , Tin Truong6 , and Hai Duong6
1
Shenzhen University, Shenzhen, China
[email protected]
2
Jinan University, Guangzhou, China
3
Hebei University of Technology, Tianjin, China
4
University of Bordj Bou Arreridj, Bordj Bou Arreridj, Algeria
5
North China University of Technology, Beijing, China
[email protected]
6
Dalat University, Dalat, Vietnam
{tintc,haidv}@dlu.edu.vn

Abstract. Pattern mining is a key subfield of data mining that aims


at developing algorithms to discover interesting patterns in databases.
The discovered patterns can be used to help understanding the data and
also to perform other tasks such as classification and prediction. After
more than two decades of research in this field, great advances have
been achieved in terms of theory, algorithms, and applications. How-
ever, there still remains many important challenges to be solved and also
many unexplored areas. Based on this observations, this paper provides
an overview of six key challenges that are promising topics for research
and describe some interesting opportunities. Those challenges were iden-
tified by researchers from the field, and are: (1) mining patterns in com-
plex graph data, (2) targeted pattern mining, (3) repetitive sequential
pattern mining, (4) incremental, stream, and interactive pattern mining,
(5) heuristic pattern mining, and (6) mining interesting patterns.

Keywords: Data mining · Pattern mining · Challenges ·


Opportunities

1 Introduction
Nowadays, large amounts of data of various types are stored in databases of various
organizations. Hence, it has become important for many organizations to develop
automatic or semi-automatic tools to analyze data. Pattern mining is a subfield
of data mining that aims at identifying interesting and useful patterns in data.
The aim is to find patterns that are easily interpretable by users, and thus can
help in understanding the data. Patterns can be used to support decision-making
but also to perform other tasks such as classification, clustering and prediction.

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022


U. K. Rage et al. (Eds.): DASFAA 2022 Workshops, LNCS 13248, pp. 34–49, 2022.
https://doi.org/10.1007/978-3-031-11217-1_3
Pattern Mining: Current Challenges and Opportunities 35

Pattern mining research started more than two decades ago. While initial studies
have focused on discovering frequent patterns on data such as shopping data, the
field has rapidly changed to consider other data types and pattern types. Also,
major improvements have been made to algorithms and data structures to improve
efficiency, scalability, and provide more features.
This paper provides an overview of key challenges and opportunities in pat-
tern mining, that deserve more attention. To write this paper, seven researchers
from the field of pattern mining were invited to write about a key challenge of
their choice. Six challenges have been identified:

1. C1: Mining patterns in complex graph data (by P. Fournier-Viger )


2. C2: Targeted pattern mining (by W. Gan)
3. C3: Repetitive sequential pattern mining (by Y. Wu)
4. C4: Interactive pattern mining (by M. Nouioua)
5. C5: Heuristic pattern mining (by W. Song)
6. C6: Mining interesting patterns (by T. Truong and H. Duong)

The rest of this paper is organized as follows. The Sects. 2 to 6 describe the
six challenges. Then, Sect. 7 draws a conclusion.

2 C1: Mining Patterns in Complex Graph Data


The first studies on frequent pattern mining have focused on analyzing trans-
action databases, which are tables of records described using binary attributes.
Although this data representation has many applications, it remains very sim-
ple and thus it is unsuitable for many applications where complex data must
be analyzed. Hence, a current trend in pattern mining is to develop algorithms
to analyze complex data such as temporal data, spatial data, time series and
graphs. Graph data has attracted the attention of many researchers in recent
years [13,18] as it can encode various types of information such as social links
in friendship-based social networks, chemical molecules, roads between cities,
flights between airports, and co-authorship of papers in academia.
Initial studies on graph pattern mining have focused on frequent subgraph
mining, which aims at finding connected subgraphs that are common to many
graphs of a graph database, or that appear frequently in a single graph [18]. In
the original problem, the input graphs have a rather simple form. They are static,
have vertices and edges which can each have at most one label, the graphs must
be connected, self-loops are forbidden (an edge from a vertex to itself), and there
can be at most a single edge between two vertices. This simple representation
restricts the applications of frequent subgraph mining. To broaden the applica-
tions of pattern mining in graphs, the following challenges must be solved.
Handling More Complex Types of Graphs. A key challenge is to consider
richer types of graphs such as those shown in Fig. 1 [13]. Those are directed
graphs (where edges may have directions), weighted graphs (where numbers are
36 P. Fournier-Viger et al.

assigned to edges to indicate the strength of relationships), attributed graphs


(where each node can be described using many categorical or nominal attributes),
multi-labeled graphs (where edges and vertices may have multiple labels), and
multi-relational graphs (graphs where multiple edges of different types may con-
nect nodes) [13]. For some domains such as social network analysis, handling
rich graph data is crucial. For example, a user profile (vertex) on a social net-
work may be described using many attributes and persons have various types of
relationships (edges).

a) a connected b) a disconnected c) a directed graph d) a weighted e) a labeled


graph graph graph graph

A
2 1.5 A
z
B
f) an attributed graph
g) a multi-labeled graph h) a multi-relational graph
Age = 22
Money = 2200
A,B,C
x,y D,E
friend

z
Age = 51 E classmate classmate
Money = 3500 E

Fig. 1. Different types of graphs

Handling Dynamic Graphs. Another key challenge is to mine patterns in


graphs that change over time. There are three main types of changes: topo-
logical changes (edges may be added or removed), label evolution (node labels
may change over time), and a mix of both two cases [13]. Many algorithms for
analyzing dynamic graphs adopt a snapshot model where graphs are observed
at different timestamps. Other models of time should also be studied such as
events having a duration (each event has a start and end time and may overlap
with other events). A related research direction is to design algorithms to update
patterns incrementally when new data arrives, or to process evolving graphs in
real-time as a stream. It is also possible to search for different types of temporal
patterns in a dynamic graph such as sequences of changes, periodic patterns (a
pattern that is repeating itself over time) and trending patterns (a pattern that
has an increasing trend over time) and attribute rules [12].
Discovering Specialized Types of Graphs. Another important challenge in
graph pattern mining is to design algorithms that are specialized for mining spe-
cific patterns rather than more general patterns. For example, algorithms have
been designed to mine sub-trees [29] or paths instead of more complex graphs.
The benefit of solving some more specialized graph pattern mining problem is
that more efficient algorithm can be developed due to specialized optimizations.
Discovering Novel Pattern Types. Many recent studies have focused on
finding new pattern types or to use new criteria to select patterns. For instance,
Pattern Mining: Current Challenges and Opportunities 37

a trend is to design algorithms to find statistically significant patterns based on


statistical tests, or to use correlation measures to filter out spurious patterns.
Handling Multi-modal Data. Another interesting research direction in graph
pattern mining is to combine graph data with other data types to perform a joint
analysis of this data. Multi-modal data refers to data of different modes such as
graph data, combined with video data and audio data.
Developing Solutions to Applied Graph Pattern Mining Problems.
Another important research topic in graph mining is to design specialized solu-
tions to address the needs of some given applications. For instance, it was shown
to be advantageous to develop custom algorithms to analyze alarm data from a
computer network rather than using generic graph mining algorithms [12]. By
designing a tailored solution, better performance may be obtained and more
interesting patterns (e.g. by using custom measures to select patterns).

3 C2: Targeted Pattern Mining


The current pattern mining literature provides various methods to find all inter-
esting patterns using several parameters. In other words, most of the pattern
mining algorithms aims at discovering the complete set of patterns (i.e., item-
sets, rules, sequences, graphs, etc.) that satisfy predefined thresholds. However,
in general, a huge number of discovered patterns may not be interesting, which
are usually based on variations special interest. To filter out redundant informa-
tion and obtain concise results, targeted pattern mining (TPM, or called targeted
pattern search) provides a different solution to the classic pattern mining prob-
lem. To be specific, instead of discovering a large number of patterns that may
not be the target ones, users in TPM could input a single or several targets at
a time and then discover/query the desired patterns containing the input target
[26,44]. Therefore, the interactive TPM method can return the concise queries
with the user-defined targets.
In summary, the goal of TPM is to discover a particular subset or group
that contain one or several special patterns. TPM is often more interesting and
reliable for finite samples resulting in potential finite subset. Different from tra-
ditional pattern mining algorithms, TPM is computationally much more difficult
to compute the subset from the potential search space. How to estimate special
subset but not the all patterns satisfy the given parameters is quite challenging.
Several definitions about targeted pattern have been provided before, includ-
ing targeted frequent itemset mining [20], targeted sequence mining [5], targeted
high-utility itemset mining [26], and targeted high-utility sequence mining [44].
Up to now, several frequency-based or utility-driven TPM models have been
developed, as briefly reviewed below.
Targeted FIM and ARM Algorithms. Kubat et al. [20] studied special-
ized queries of frequent itemsets in a transaction database. All the rules (i.e.,
targeted queries) can be extracted from the designed Itemset-Tree. Here the
user-specified itemset is antecedent in rules. An improved Itemset-Tree [14] was
38 P. Fournier-Viger et al.

designed for quickly querying frequent itemsets during the operation process.
Both algorithms adopt the minimum confidence and support measurement, and
the improved Itemset-Tree can be updated incrementally with new transac-
tions. For multitude-targeted mining, the guided FP-growth [30] was designed
to determine the frequency of each given itemset based on target Itemset-Tree.
After that, a constraint-based ARM query model [1] was also introduced for
exploratory analysis of diverse clinical databases.
Targeted SPM Algorithms. The sequential ordering of items is commonly
seen in real-life applications. To handle the sequence data that is more complex
than transaction data, Chueh et al. [8] reversed the original sequences to dis-
cover targeted sequential patterns with time intervals. Based on the definition of
targeted SPM, Chand et al. [5] proposed a novel SPM algorithm to discover pat-
terns with checking whether they satisfied the recency and monetary constraint
and also were target-oriented. However, the target pattern in this approach is
defined in the end of each sequence. A goal-oriented algorithm [7] can extract the
transaction activities before losing the customer. By utilizing TPM, this algo-
rithm can handle the problem of determining whether a customer is leaving and
toward a specific goal.
Utility-Driven TPM Algorithms. Previous TPM algorithms mainly adopt
the measurement of frequency and confidence, but them do not involve the con-
cept of utility [15], which is helpful for discovering more informative patterns and
knowledge. Recently, Miao et al. [26] are the first to introduce a targeted high-
utility itemset querying model (abbreviated as TargetUM). TargetUM intro-
duced several key definitions and formulated the problem of mining the desired
set of high-utility itemsets containing given target items. A utility-based trie
tree was designed to index and query target itemsets on-the-fly. Consider the
sequence data, Zhang et al. [44] introduced targeted high-utility sequence query-
ing problem and proposed the TUSQ algorithm. Targeted utility-chain and two
novel upper bounds on utility measurement (namely suffix remain utility and
terminated descendants utility) are proposed in the TUSQ model.
Several open problems of targeted pattern mining/search and interesting
directions (including but not limited to) in the future are highlighted in detail
below. It is important to note that these open problems are also widespread in
other pattern mining tasks.
– What type of data to be mined. As we know, there are many types
of data in real world, such as transaction data, sequence, streaming data,
spatiotemporal data, complex event, time-series, text and web, multi-media,
graphs, social network, and uncertain data. How to design effective TPM
algorithms to deal with these data is very urgent and more challenging.
– What kind of pattern or knowledge to be mined. For example, there
are two categories, descriptive vs. predictive data mining, which is based
on different kind of knowledge. As reviewed before, itemset, sequence, rule,
graph, and event are the different kinds of patterns that are extracted from
various types of data. However, few TPM algorithms can discover these kinds
of patterns.
Pattern Mining: Current Challenges and Opportunities 39

– More effective data structure. According to the current studies, the index-
ing and searching in TPM are more challenging than that of traditional pat-
tern mining. In particular, when dealing with big data, we need more effective
data structures to store rich information from data.
– More powerful strategies. Due to the difficulty, the search space of TPM
has an explosion. Thus, how to reduce the search space using powerful pruning
strategies (w.r.t. upper bounds) plays a key role in improving the performance
of the TPM algorithm.
– Different applications. In general, there are many applications of data
mining methods, including discrimination, association analysis, classification,
clustering, trend/deviation, outlier detection, etc. It is clear that different
application requires a special solution of TPM.
– Visualization. It is interesting that the data and mining results will be dis-
played automatically in search process. In the future, there are many oppor-
tunities to increase the interpretability of the results, the ease of use of the
model, and the interactivity of the mining process.

To summary, targeted pattern mining/search is difficult and quite different


from previous mining methods. In the future, there are many opportunities and
interesting work in this research field.

4 C3: Repetitive Sequential Pattern Mining

Sequential pattern mining (SPM) has been used in keyphrase extraction [42] and
feature selection [41]. The goal of SPM is to discover interesting subsequences
(also called patterns). The most common problem is to mine frequent patterns
whose supports are no less than a user-defined parameter called minsup. The
definitions are as follows.

Definition 1. (sequence and sequence database) Suppose we have a set of


items σ. A sequence S is an ordered list of itemsets S = {s1 , s2 , · · · , sn }, where
si is an itemset, which is a subset of σ. A sequence database is composed of k
sequences, i.e. SDB = {S1 , S2 , · · · , Sk }.

Example 1. For a sale dataset, suppose there are five products: a, b, c, d, and e,
i.e. σ = {a, b, c, d, e}. Suppose customer 1 first purchased items a, b, and c, then
bought a, b, and e, then purchased c, then bought (a, b, d), and e, then purchased
a and c, and finally bought (a, c) and e. The shopping sequence of customer 1 is
S1 = {s1 , s2 , s3 , s4 , s5 , s6 } = {(a, b, c), (a, b, e), (c), (a, b, d, e), (a, c), (a, c, e)}.
Similarly, we assume that for customer 2, S2 = {s1 , s2 } = {(a, b, d), (c)}. Thus,
the sequence database is SDB = {S1 , S2 }.

This kind of sequence format is quite general since the sequence is an ordered
list of itemsets, which means that each itemset contains one or more items. Thus,
such sequence is called a sequence with itemsets. But for many applications, the
data is represented as an ordered list of items called a sequence with items,
40 P. Fournier-Viger et al.

which means that each itemset contains only one item, e.g. DNA sequences,
protein sequences, virus sequences, and time series. For example, “attaaagg” is
a segment of the SARS-CoV-2 virus.

Definition 2. (pattern and occurrence) A pattern P = p1 , p2 , · · · , pm is


also a sequence. A pattern P is a subsequence of a sequence S = {s1 , s2 , · · · ,
sn } if and only if p1 ⊆ si1 , p2 ⊆ si2 , · · · , pm ⊆ sim , and 1 ≤ i1 < i2 < · · · <
im ≤ n. I = <i1 , i2 , · · · , im > is an occurrence of pattern P in sequence S.

Example 2. Pattern P = {(a, b), (c)} occurs in sequences S1 and S2 . For exam-
ple, <1,3> is an occurrence of pattern P in sequence S1 , since p1 = (a, b) ⊆ s1
= (a, b, c) and p2 = (c) ⊆ s3 = (c).

Definition 3. (support and frequent pattern) The support is the num-


ber of occurrences of a pattern P in a sequence database SDB, represented as
sup(P, SDB). If the support is no less than the predefined threshold minsup,
then the pattern is called a frequent pattern.

Classical SPM cares if a pattern occurs in a sequence or not, but it ignores


a pattern’s repetitions in a sequence. For example, P occurs in both S1 and S2 .
Thus, the support of P in SDB is 2. However, pattern P occurs many times in
sequence S1 . If we neglect the repetition, many important interesting patterns
will be lost. However, researchers mainly focused on mining the repetitive pat-
terns in a sequence database with items, rather than in a sequence database
with itemsets. Various methods have been investigated to mine various kinds of
patterns such as patterns without gap [6], patterns with self-adaptive gap [41],
and patterns with gap constraint [21,27,28,40]. An illustrative example is given.

Example 3. Suppose we have a sequence S = s1 s2 s3 s4 s5 s6 s7 s8 = aabababa.

(1) Pattern without gap: Pattern without gap is also called consecutive sub-
sequences [6], i.e. for occurrence I = <i1 , i2 , · · · , im >, it requires that i2 =
i1 + 1, i3 = i2 + 1, · · · , im = im−1 + 1. For example, there are two occur-
rences of pattern P = p1 p2 p3 = aba in sequence S: <2,3,4> and <6,7,8>.
The advantage of this method is that it is easy to calculate the support.
However, the restriction is too strict, which will lead to the loss of a lot of
important information.
(2) Pattern with self-adaptive gap [41]: It means that there is no constraint
on the occurrence. For example, <1,7,8> is an occurrence of P = aba in S.
The advantage of this method is that users do not need any prior knowledge
and it is easy to find the characteristics of the sequence database. However,
there are too many occurrences, which will lead to difficulties in analyzing
the results.
(3) Pattern with gap constraint: In this case, users should predefine a gap
= [M, N ], and for each occurrence, it needs to satisfy that M ≤ ik − ik−1 − 1
≤ N (1 < k ≤ m), where M and N are the minimum and maximum wild-
cards. This method can prune some meaningless occurrences. For example,
Pattern Mining: Current Challenges and Opportunities 41

if gap = [0,2], <1,3,4> is an occurrence of pattern P = aba in sequence


S, since p1 = s1 = a, p2 = s3 = b, p3 = s4 = a, and both 1,3 and 3,4
satisfy the gap constraint [0,2], while <1,5,6> is not an occurrence, since
5 − 1 − 1 = 3 > 2. This approach not only is more challenging, but also has
many types. As far as we know there are four types: no condition [27], the
one-off condition [21], the nonoverlapping condition [40], and the disjoint
condition [28].
• No condition means that each item can be reused [27]. Therefore, all ten
occurrences of P = aba with gap [0,2] in sequence S are acceptable under no
condition: <1,3,4>, <1,3,6>, <2,3,4>, <2,3,6>, <2,5,6>, <2,5,8>, <4,5,6>,
<4,5,8>, <4,7,8>, and <6,7,8>.
• The one-off condition means that each item can be used at most once [21].
Therefore, <1,3,4> and <2,5,6> are two occurrences of P in S which satisfy
the one-off condition, while <1,3,4> and <4,5,6> do not.
• The nonoverlapping condition means that each item cannot be reused by
the same pj , but can be reused by different pj [40]. <1,3,4> and <4,5,6>
satisfy the nonoverlapping condition, since in <1,3,4>, p3 matches s4 and in
<4,5,6> p1 matches s4 . However, <1,3,6> and <2,3,4> do not satisfy the
nonoverlapping condition, since in both occurrences, p2 matches s3 . Hence,
there are three occurrences of P in S under the nonoverlapping condition:
<1,3,4>, <4,5,6>, and <6,7,8>.
• The disjoint condition means that the maximum position of an occurrence
should be less than the minimum position of the next occurrence [28]. For
example, there are two occurrences of P in S: <1,3,4> and <6,7,8>.
Although the four conditions are very similar, their characteristics are dif-
ferent. The advantages of no condition are that the support can be calculated
in polynomial time and it is a complete mining approach. However, this mining
approach does not satisfy the Apriori property and has to apply the Apriori-like
strategy to generate candidate patterns. For the one-off condition, although it
satisfies the Apriori property, it cannot exactly calculate its support, since it is
an NP-Hard problem. Therefore, the mining approach is an approximate mining
approach. Although both the nonoverlapping condition and the disjoint condi-
tion satisfy the Apriori property, and the support can be calculated in polynomial
time [39], the disjoint condition is easier to calculate than the nonoverlapping
condition and may lose some feasible occurrences.
If we apply the four conditions in a sequence database with itemsets, it is a
more challenging task, since the support calculation and candidate generation
are significantly different from those for a sequence database with items. For the
support calculation, in a sequence with items, we require that pj = sij , while in
a sequence with itemsets, we require that pj ⊆ sij . For candidate generation, in
a sequence database with items, we only apply S-Concatenation to generate can-
didates, while in a sequence database with itemsets, we adopt S-Concatenation
and I-Concatenation to generate candidates.
Hence, the following tasks should be further investigated in sequence
databases with itemsets. 1). What are the computational complexities of
42 P. Fournier-Viger et al.

calculating the supports under different conditions? 2). Given a database with
itemsets, how to design effective mining algorithms for these conditions? 3).
If the dataset is dynamic or a stream database, how to design effective min-
ing algorithms? 4). A variety of SPM methods were proposed to meet differ-
ent requirements, such as closed SPM, maximal SPM, top-k SPM, compressing
SPM, co-occurrence SPM, rare SPM, negative SPM, tri-partition SPM, and
high utility SPM. However, most of them neglect the repetitions and consider
sequence databases with itemsets. If the repetitions cannot be neglected, how to
design effective mining algorithms? 5). For a specific problem, there are many
approaches to solve it. However, what is the best approach? For example, for
a sequence classification problem, there are many methods to extract the fea-
tures, such as frequent patterns and contrast patterns under the four conditions.
However, which one is the best approach?

5 C4: Incremental, Stream and Interactive Pattern


Mining

A key limitation of traditional pattern mining algorithms such as Apriori and FP-
Growth is that they are batch algorithms. This means that if the input database
is updated, the user needs to run again the algorithm to get new results even
if the database is slightly changed. Consequently, classical algorithms are ineffi-
cient for various real applications where databases are dynamics. To address this
challenge, various approaches have been adopted which can be roughly classi-
fied into three categories: (1) Incremental pattern mining algorithms, (2) Stream
pattern mining algorithms and (3) Interactive pattern mining algorithms.
Incremental pattern mining algorithms are designed to update the set of
discovered patterns once the database is updated by inserting or deleting some
transactions. To avoid repetitively scanning the database, a strategy is to use
a buffer that contains the set of almost frequent itemsets in memory [19,23].
Stream pattern mining algorithms are designed to deal with databases
that change in real-time and where new data may arrive at a very high speed.
These algorithms aim to process transactions quickly to return an approximate
set of patterns rather that the complete set. Two representative algorithms for
incremental pattern mining are estDec and estDec+ [32]. estDec employs a lex-
icographic tree structure called a prefix tree to identify and maintain significant
itemsets from an online data stream. Significant itemsets are itemsets that may
be frequent itemsets in the near future. It has been observed that the size of
the prefix tree, which is located in the main memory, becomes very large as
the number of significant itemsets increases. Thus, if the size of the prefix tree
becomes larger than the available memory space, estDec fails to identify new
significant itemsets. As a result, the accuracy of estDec results is degraded [32].
estDec+ and other algorithms have been designed to solve this problem.
Interactive pattern mining tries to handle dynamic databases differently
by injecting users preferences, users feedback or user targeted queries, into the
mining process [3,4,14,20,22]. In contrast with incremental and stream pattern
Pattern Mining: Current Challenges and Opportunities 43

mining where algorithms aim to maintain and update a large set of patterns
that may be uninteresting to users, interactive pattern mining algorithms focus
only on some specific sets of patterns that are needed by the user. Besides,
several approaches have been designed which can generally be classified in three
categories: (1) Targeted querying based approaches, (2) Users feed-backs based
approaches and (3) Visualization based approaches.
Targeted Querying Based Approaches. These approaches let the user search for
patterns containing specific items by sending some targeted queries to the system
to search for interesting patterns. Then, the system interacts and tries to give
quick answers to the user queries [14,20,22]. See Sect. 3 for more details.
Users Feedback Based Approaches. Users feedback based approaches are more
interactive comparing with targeted querying based approaches. The key idea
is to progressively address feedback sent by users during the mining process.
Bhuiyan et al. [4] proposed an interactive pattern mining system that is based
on the sampling of frequent patterns from hidden datasets. Hidden datasets
exist in various real applications where the data owner and the data analyst is
no necessarily the same entity. Thus, the data analyst may not have the full
access to the data and the data owner has to maintain the confidentiality of
the data by providing to analysts only some samples from data that would be
beneficial to him but without giving him the possibility to reconstruct the entire
dataset from the given samples [4]. The proposed interactive systems aims to
continuously update effective sampling distributions by binary feedback from
the users. The proposed system works as follows: Using a Markov Chain Monte
Carlo (MCMC) sampling method, the system return a small set of frequent
patterns (samples) to each analysts (user). Then, each analyst sends a feedback
about its associated samples. The feedback used in this method is a simple
feedback where the response of a user on a pattern is to indicate if this pattern
is interesting or uninteresting. The system defines a scoring function based on
users’ feedback and updates each sampling distribution taking into consideration
its corresponding user’s interests. Following these steps, the proposed system can
progressively address the user preferences so that the data remains confidential.
Experiments on itemset and graph mining datasets demonstrate the usefulness
of the proposed system. Based on the same approach, an improved version of
this system was proposed [3]. Besides, authors have adopted a better scoring
function for graph data by using graph topology and new improved feedback
mechanisms, namely, periodic feedback and conditional periodic feedback.
Another common problem in pattern mining that motivates researchers to
design interactive pattern discovery tools is the problem of pattern explosion [11].
More precisely, traditional pattern mining algorithms discover a large number
of patterns, of which many are redundant or similar. As a result, the analyst or
the data expert should invest substantial efforts to look for the desired patterns
which is not an easy task. To overcome this limitation, an interactive pattern
discovery framework was proposed [11] for two mining tasks, frequent itemset
mining and subgroup discovery. The proposed framework consists of three steps:
(1) Mining patterns, (2) Interacting with the user and (3) learning user-specific
44 P. Fournier-Viger et al.

pattern interestingness. Besides, The user is only asked to rank small sets of pat-
terns, while a ranking function is inferred from users feedback using preference
learning techniques. In the experimental results, it has been demonstrated that
the system was able to learn accurate pattern rankings for both mining tasks.
Visualization Based Approaches. Another important aspect to design a
good interactive pattern mining system is the visualisation aspect. More pre-
cisely, data visualisation techniques play an important role in making the dis-
covered knowledge understandable and interpretable by humans [17]. In fact, the
output of the implemented algorithms is presented to the user only in a textual
form, which may impose many limitations such as the difficulty to identify similar
patterns and the difficulty to understand the relation between patterns. There
are various visualization techniques for different forms of patterns. For instance,
researchers [2], have used a lattice based representation based on the Hasse dia-
gram to visualise the output of frequent itemset mining. All possible itemsets can
be represented in the diagram and the frequent itemsets are highlighted in bold.
Other visualisation techniques have been used to efficiently present itemsets to
the user such as pixel based visualization and tree based visualisation [17]. As for
itemsets mining, various visualization tools were proposed for the other pattern
mining problems such as mining association rules, mining sequential patterns
and mining episodes. The reader can refer to [17] where a detailed survey that
present the visualisation techniques designed for each mining task.

6 C5: Heuristic Pattern Mining


Since the emergence of the subfield, the excessive number of results caused by
combinatorial explosion has been the most fundamental problem encountered in
pattern mining. Although many efficient pattern mining algorithms have been
proposed, the excessive results still lead to a high computational cost, and the
high computational cost required to mine all exact patterns is not proportional to
the actionability of the mining results. Occasionally, the result of a large number
of discovered patterns can even lead to decision makers not knowing how to use
them. Furthermore, for application fields such as recommender systems, it is not
necessary to use all exact patterns.
To solve this problem, heuristic pattern mining (HPM) algorithms have been
developed to identify an approximate subset of all patterns within a reasonable
time. Inspired by biological [9] and physical [35] phenomena, heuristic methods
are effective for solving combinatorial problems such as pattern mining. Com-
pared with exact pattern mining algorithms, HPM algorithms are efficient and
do not need any domain knowledge in advance. Several heuristic algorithms are
used in the subfield of pattern mining, such as the genetic algorithm (GA) [9],
particle swarm optimization (PSO) [24], artificial bee colony (ABC) [33], cross-
entropy (CE) [35], and bat algorithm (BA) [34]. In addition to achieving high
efficiency, these HPM algorithms can also discover a sufficient number of exact
patterns. From the perspective of the key components of the entire mining pro-
cess, the following challenges are summarized.
Pattern Mining: Current Challenges and Opportunities 45

Identifying the Appropriate Objective. For heuristic methods, a fitness


function is the objective function used to measure the performance of each indi-
vidual to determine which will survive and reproduce into the next generation. In
existing HPM algorithms, typical measures, for example, support [9] and utility
[34], are used as fitness functions directly. Standard fitness functions are easy to
implement; however, for more specific patterns (closed pattern or top-k pattern),
flexible fitness functions must be applied, which are more difficult to implement.
Furthermore, in complex scenarios, for example, choosing sets of products that
are both economical and fast to deliver, using multi-objective fitness functions
[45] is also a challenging issue.
Speed up the Mining Process. Determining how to narrow the search space is
a general key issue in algorithm design. HPM algorithms also have this problem.
The downward closure property is the most widely used principle in HPM algo-
rithms. Specifically, support is used for frequent itemsets [9], and transaction-
weighted utilization is used for high utility itemsets [34]. Other strategies include
using the lengths of the discovered patterns to generate promising patterns [33]
and developing a tree structure to avoid invalid combinations [24]. Considering
the characteristics of heuristic methods and the resulting patterns, it would be
interesting to develop a new data structure or pruning strategies to improve
mining efficiency.
Diversifying the Discovered Results. Heuristic methods are likely to fall
into local optima. This is also true for HPM algorithms. When HPM algorithms
fall into local optimal values, it is difficult to produce new patterns in subsequent
iterations. Inspired by the mutation operator of the GA, randomly generating
some individuals in the next generation is the most important approach to avoid
falling into local optima [9]. Although determining how to increase the diversity
of results is difficult, a first step is to measure the diversity of results, such as
the bit edit distance [35].
Designing a General Framework. Pattern mining is different from problems
in which there are relatively few best values, all patterns have support/utilities
or other measures no lower than the minimum threshold are the mining targets.
In addition to the GA, PSO, ABC, CE, and BA, other heuristic methods such
as the ant colony system and dolphin echolocation optimization have been used
for pattern mining recently. However, attempting to use each heuristic method
individually to mine patterns is not only expensive but also infeasible. There-
fore, integrating all the objectives, processes, and results into a general HPM
framework is a promising approach [10,34].
At the present time, HPM mainly focuses on itemsets. Thus, discovering a
complex pattern, for example, sequential pattern or graph pattern [31], using
heuristic methods is challenging.

7 C6: Mining Interesting Patterns


The problem of Interesting Pattern Mining (IPM) plays an important role in
Data Mining. The goal is to discover patterns of interest to users in databases
46 P. Fournier-Viger et al.

(DBs) where interest is measured using functions. One of the first interestingness
functions used in IPM is the support function to mine frequent patterns (FPM)
in binary DBs. A pattern is said to be frequent in a binary DB if the number
of its appearances in transactions of the database (or its support) is no less
than a user-predefined minimum support threshold. For the special measure, in
order to efficiently solve the combinatorial explosion in FPM, a nice property
of the support, the Downward Closure or Anti-Monotonicity - AM, has been
applied. This property states that if a pattern is not frequent (infrequent), all its
super-patterns are also infrequent, or the whole branch rooted at the infrequent
pattern (on the prefix search tree) can be pruned immediately.
However, the support measure is not suitable for all applications. Thus, other
interestingness functions have been designed to find important patterns that
may be rare but useful or interesting for many real-life applications. Some of the
most popular kinds are utility functions of patterns in quantitative DBs (QDBs).
Utility functions can be used for example to find the most profitable purchase
patterns in customer transactions. Note that the support can be seen as a special
utility function. A simple QDB is called a quantitative transaction DB (QTDB),
where each (input) transaction is a quantitative itemset (a set of quantitative
items). A more general QDB is quantitative sequence DB (QSDB) of which each
input quantitative sequence consists of a sequence of quantitative itemsets.
Moreover, a key challenge in the problem of high utility pattern mining
(HUPM) is that such utility functions usually do not satisfy the AM property. To
overcome this challenge, we need to devise upper bounds or weak upper bounds
(on the utilities) that satisfy the AM property or weaker (such as Anti-Monotone
like - AML) ones. In this context, given a utility function u of patterns, a function
ub is said to be an upper bound (UB) on u if ub(x) ≥ u(x) for any pattern x. And
a function wub is said to be a weak upper bound (WUB) on u if wub(x) ≥ u(y)
for any extension pattern y of x. Usually, given a (W)UB, the tighter (W)UB
is, the stronger its pruning ability is. The effort and time for devising good and
tight (W)UBs is often very long.
For example, in the first problem of high utility itemset mining (HUIM) on a
QTDB D, the utility u of an itemset A is defined as the summation of its utilities
u(A,T) in all transactions T of D containing A, where the utility u(A,T) of A
in T is the summation of utilities of items of A appearing in T. Similarly, for the
second problem of high average utility itemset mining (HAUIM) in a QTDB D,
the average utility au of an itemset A is defined as the utility u(A) divided by
its length length(A). From the first time 2004 [43] (2009) where HUIM (HAUIM,
respectively) was proposed, it took more than 8 (10) years to obtain good tighter
UBs based on the remaining utility [25] (WUBs based on vertical representation
of QTDB [37], respectively). It is worthy to note that for the average utility au,
besides UBs (on it), there are many WUBs, which are much tighter than the
UBs. The number of WUBs found so far is about five times more than that of
UBs, and devising such good WUBs requires much effort and time.
For the more general problems of high utility sequence mining (HUSM) on
a QSDB D, because each sequence α may appear multiple times in an input
quantitative sequence (IQS) Ψ of D, there are many ways to define the utility
Pattern Mining: Current Challenges and Opportunities 47

of α in Ψ . There are two popular kinds of such utilities, denoted as umax (α, Ψ )
and umin (α, Ψ ), that are respectively defined as the maximum and minimum
values among utilities of occurrences of α in Ψ . Then, umax (α) and umin (α) are
respectively the summation of umax (α, Ψ ) and umin (α, Ψ ) of α in all IQSs Ψ
containing α. Similarly, there are two other kinds of utilities named aumax (α)
and aumin (α) that are respectively defined as umax (α) and umin (α) divided
by length of α. For the first (third) utility umax (aumax ), to find good UBs
[16] (WUBs [36], respectively), it took about 10 years (8 years, respectively).
Furthermore, devising such UBs (for example on umax ) without mathematically
proving it strictly may lead to inexactness in corresponding algorithms [16].
For the new second (fourth) utility umin (aumin ), the time for devising good
UBs (WUBs) on it has decreased significantly only in one paper (e.g. [38] for
aumin ). Thus, from the theoretical results presented in the paper, a natural and
useful question that has been raised is how to propose a generic framework for
the IPM problem according to any new interestingness function, and a general
and simple method to quickly design (W)UBs on functions using weeks instead
of years? In more details, given a QSDB D and a new interestingness function itr
that may not satisfy AM and a user-specified minimum interestingness threshold
mi, the corresponding IPM problem is to mine the set {α|itr(α) ≥ mi} of all
highly interesting patterns. The first question is how to quickly devise (W)UBs
on itr so that they are as tight as possible and have anti-monotone-like proper-
ties? The goal of these requirements is to allow significantly reducing the search
space. The second question is how to transform checking the anti-monotone-like
properties of itr in the whole D into simpler one in each input quantitative
sequence? Moreover, these theoretical results must be proven strictly in mathe-
matical language. Then, the main challenge that aims at significantly reducing
time for devising good (W)UBs on itr will be solved.

8 Conclusion
The field of pattern mining has been rapidly changing. This paper has provided
an overview of six key challenges, each identified by a researcher from the field.

References
1. Abeysinghe, R., Cui, L.: Query-constraint-based mining of association rules for
exploratory analysis of clinical datasets in the national sleep research resource.
BMC Med. Inform. Decis. Making 18(2), 58 (2018)
2. Alsallakh, B., Micallef, L., Aigner, W., Hauser, H., Miksch, S., Rodgers, P.: The
state-of-the-art of set visualization. In: Computer Graphics Forum, vol. 35, pp.
234–260. Wiley Online Library (2016)
3. Bhuiyan, M., Hasan, M.A.: Interactive knowledge discovery from hidden data
through sampling of frequent patterns. Statist. Anal. Data Mining ASA Data Sci.
J. 9(4), 205–229 (2016)
4. Bhuiyan, M., Mukhopadhyay, S., Hasan, M.A.: Interactive pattern mining on hid-
den data: a sampling-based solution. In: Proceedings of the 21st ACM International
Conference on Information and Knowledge Management, pp. 95–104 (2012)
48 P. Fournier-Viger et al.

5. Chand, C., Thakkar, A., Ganatra, A.: Target oriented sequential pattern mining
using recency and monetary constraints. Int. J. Comput. App. 45(10), 12–18 (2012)
6. Chen, M.S., Park, J.S., Yu, P.S.: Efficient data mining for path traversal patterns.
IEEE Trans. Knowl. Data Eng. 10(2), 209–221 (1998)
7. Chiang, D.A., Wang, Y.F., Lee, S.L., Lin, C.J.: Goal-oriented sequential pattern
for network banking churn analysis. Expert Syst. App. 25(3), 293–302 (2003)
8. Chueh, H.E., et al.: Mining target-oriented sequential patterns with time-intervals.
Int. J. Comput. Sci. Inf. Technol. 2(4), 113–123 (2010)
9. Djenouri, Y., Comuzzi, M.: Combining apriori heuristic and bio-inspired algorithms
for solving the frequent itemsets mining problem. Inf. Sci 420, 1–15 (2017)
10. Djenouri, Y., Djenouri, D., Belhadi, A., Fournier-Viger, P., Lin, J.C.-W.: A new
framework for metaheuristic-based frequent itemset mining. Appl. Intell. 48(12),
4775–4791 (2018). https://doi.org/10.1007/s10489-018-1245-8
11. Dzyuba, V., Leeuwen, M.v., Nijssen, S., De Raedt, L.: Interactive learning of pat-
tern rankings. Int. J. Artif. Intell. Tools 23(06), 1460026 (2014)
12. Fournier-Viger, P., Cheng, C., Cheng, Z., Lin, J.C., Selmaoui-Folcher, N.: Mining
significant trend sequences in dynamic attributed graphs. Knowl. Based Syst. 182,
104797 (2019)
13. Fournier-Viger, P., et al.: A survey of pattern mining in dynamic graphs. Wiley
Interdiscip. Rev. Data Min. Knowl. Discov. 10(6), e1372 (2020)
14. Fournier-Viger, P., Mwamikazi, E., Gueniche, T., Faghihi, U.: MEIT: memory effi-
cient itemset tree for targeted association rule mining. In: Motoda, H., Wu, Z.,
Cao, L., Zaiane, O., Yao, M., Wang, W. (eds.) ADMA 2013. LNCS (LNAI), vol.
8347, pp. 95–106. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-
53917-6 9
15. Gan, W., et al.: A survey of utility-oriented pattern mining. IEEE Trans. Knowl.
Data Eng. 33(4), 1306–1327 (2021)
16. Gan, W., et al.: ProUM: projection-based utility mining on sequence data. Inf. Sci.
513, 222–240 (2020)
17. Jentner, W., Keim, D.A.: Visualization and visual analytic techniques for patterns.
In: High-Utility Pattern Mining, pp. 303–337 (2019)
18. Jiang, C., Coenen, F., Zito, M.: A survey of frequent subgraph mining algorithms.
Knowl. Eng. Rev. 28, 75–105 (2013)
19. Koh, J.-L., Shieh, S.-F.: An efficient approach for maintaining association rules
based on adjusting FP-tree structures. In: Lee, Y.J., Li, J., Whang, K.-Y., Lee, D.
(eds.) DASFAA 2004. LNCS, vol. 2973, pp. 417–424. Springer, Heidelberg (2004).
https://doi.org/10.1007/978-3-540-24571-1 38
20. Kubat, M., Hafez, A., Raghavan, V.V., Lekkala, J.R., Chen, W.K.: Itemset trees
for targeted association querying. IEEE Trans. Knowl. Data Eng. 15(6), 1522–1534
(2003)
21. Lam, H.T., Morchen, F., Fradkin, D., Calders, T.: Mining compressing sequential
patterns. Statist. Anal. Data Mining ASA Data Sci. J. 7(1), 34–52 (2014)
22. Li, X., Li, J., Fournier-Viger, P., Nawaz, M.S., Yao, J., Lin, J.C.W.: Mining pro-
ductive itemsets in dynamic databases. IEEE Access 8, 140122–140144 (2020)
23. Lin, C.W., Hong, T.P., Lu, W.H.: The pre-FUFP algorithm for incremental mining.
Expert Syst. App. 36(5), 9498–9505 (2009)
24. Lin, J.C.W., Yang, L., Fournier-Viger, P., Hong, T.P., Voznak, M.: A binary PSO
approach to mine high-utility itemsets. Soft Comput. 21(17), 5103–5121 (2017)
25. Liu, M., Qu, J.: Mining high utility itemsets without candidate generation. In: Pro-
ceedings of the 21st ACM International Conference on Information and Knowledge
Management, pp. 55–64 (2012)
Pattern Mining: Current Challenges and Opportunities 49

26. Miao, J., Wan, S., Gan, W., Sun, J., Chen, J.: TargetUM: targeted high-utility
itemset querying. arXiv preprint arXiv:2111.00309 (2021)
27. Min, F., Zhang, Z.H., Zhai, W.J., Shen, R.P.: Frequent pattern discovery with
tri-partition alphabets. Inf. Sci. 507, 715–732 (2020)
28. Ouarem, O., Nouioua, F., Fournier-Viger, P.: Mining episode rules from event
sequences under non-overlapping frequency. In: Fujita, H., Selamat, A., Lin, J.C.-
W., Ali, M. (eds.) IEA/AIE 2021. LNCS (LNAI), vol. 12798, pp. 73–85. Springer,
Cham (2021). https://doi.org/10.1007/978-3-030-79457-6 7
29. Qu, W., Yan, D., Guo, G., Wang, X., Zou, L., Zhou, Y.: Parallel mining of frequent
subtree patterns. In: Qin, L., et al. (eds.) SFDI/LSGDA -2020. CCIS, vol. 1281,
pp. 18–32. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61133-0 2
30. Shabtay, L., Yaari, R., Dattner, I.: A guided FP-growth algorithm for multitude-
targeted mining of big data. arXiv preprint arXiv:1803.06632 (2018)
31. Shelokar, P., Quirin, A., Cordón, O.: Three-objective subgraph mining using mul-
tiobjective evolutionary programming. Comput. Syst. Sci 80(1), 16–26 (2014)
32. Shin, S.J., Lee, D.S., Lee, W.S.: CP-tree: an adaptive synopsis structure for com-
pressing frequent itemsets over online data streams. Inf. Sci. 278, 559–576 (2014)
33. Song, W., Huang, C.: Discovering high utility itemsets based on the artificial bee
colony algorithm. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M.,
Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10939, pp. 3–14. Springer,
Cham (2018). https://doi.org/10.1007/978-3-319-93040-4 1
34. Song, W., Huang, C.: Mining high utility itemsets using bio-inspired algorithms: a
diverse optimal value framework. IEEE Access 6, 19568–19582 (2018)
35. Song, W., Zheng, C., Huang, C., Liu, L.: Heuristically mining the top-k high-utility
itemsets with cross-entropy optimization. Appl. Intell. 1–16 (2021). https://doi.
org/10.1007/s10489-021-02576-z
36. Truong, T., Duong, H., Le, B., Fournier-Viger, P.: EHAUSM: an efficient algorithm
for high average utility sequence mining. Inf. Sci. 515, 302–323 (2020)
37. Truong, T., Duong, H., Le, B., Fournier-Viger, P., Yun, U.: Efficient high average-
utility itemset mining using novel vertical weak upper-bounds. Knowl. Based Syst.
183, 104847 (2019)
38. Truong, T., Duong, H., Le, B., Fournier-Viger, P., Yun, U.: Frequent high mini-
mum average utility sequence mining with constraints in dynamic databases using
efficient pruning strategies. Appl. Intell. 52, 1–23 (2021)
39. Wu, Y., Shen, C., Jiang, H., Wu, X.: Strict pattern matching under non-overlapping
condition. Sci. China Inf. Sci. 50(1), 012101 (2017)
40. Wu, Y., Tong, Y., Zhu, X., Wu, X.: NOSEP: nonoverlapping sequence pattern
mining with gap constraints. IEEE Trans. Cybern. 48(10), 2809–2822 (2018)
41. Wu, Y., Wang, Y., Li, Y., Zhu, X., Wu, X.: Self-adaptive nonoverlapping contrast
sequential pattern mining. IEEE Trans. Cybern. (2021)
42. Xie, F., Wu, X., Zhu, X.: Efficient sequential pattern mining with wildcards for
keyphrase extraction. Knowl. Based Syst. 115, 27–39 (2017)
43. Yao, H., Hamilton, H.J., Butz, C.J.: A foundational approach to mining itemset
utilities from databases. In: Proceedings of the 2004 SIAM International Confer-
ence on Data Mining, pp. 482–486. SIAM (2004)
44. Zhang, C., Du, Z., Dai, Q., Gan, W., Weng, J., Yu, P.S.: TUSQ: targeted high-
utility sequence querying. arXiv preprint arXiv:2103.16615 (2021)
45. Zhang, L., Fu, G., Cheng, F., Qiu, J., Su, Y.: A multi-objective evolutionary app-
roach for mining frequent and high utility itemsets. Appl. Soft Comput. 62, 974–
986 (2018)

View publication stats

You might also like