A Survey of Utility-Oriented Pattern Mining
A Survey of Utility-Oriented Pattern Mining
4, APRIL 2021
Abstract—The main purpose of data mining and analytics is to find novel, potentially useful patterns that can be utilized in real-world
applications to derive beneficial knowledge. For identifying and evaluating the usefulness of different kinds of patterns, many
techniques and constraints have been proposed, such as support, confidence, sequence order, and utility parameters (e.g., weight,
price, profit, quantity, satisfaction, etc.). In recent years, there has been an increasing demand for utility-oriented pattern mining (UPM,
or called utility mining). UPM is a vital task, with numerous high-impact applications, including cross-marketing, e-commerce, finance,
medical, and biomedical applications. This survey aims to provide a general, comprehensive, and structured overview of the state-of-
the-art methods of UPM. First, we introduce an in-depth understanding of UPM, including concepts, examples, and comparisons with
related concepts. A taxonomy of the most common and state-of-the-art approaches for mining different kinds of high-utility patterns is
presented in detail, including Apriori-based, tree-based, projection-based, vertical-/horizontal-data-format-based, and other hybrid
approaches. A comprehensive review of advanced topics of existing high-utility pattern mining techniques is offered, with a discussion
of their pros and cons. Finally, we present several well-known open-source software packages for UPM. We conclude our survey with a
discussion on open and practical challenges in this field.
Index Terms—Data science, economics, utility theory, utility mining, high-utility pattern, application
1 INTRODUCTION
mining [1], [2] focuses on extraction of informa- domains. Most of them aim at extracting the desired
D ATA
tion from a large set of data and transforms it into
an easily interpretable structure for further use. It is an
patterns using frequency or co-occurrence [7], [8], [9],
[10], as well as other properties and interestingness
interdisciplinary field focused on scientific methods, measures [18], [19], [20], [21]. Despite the wide use of
processes, and systems to extract knowledge or insights pattern mining techniques, most of these algorithms do
from data in various forms, either structured or not allow for the discovery of utility-oriented patterns,
unstructured. Mining interesting patterns from different i.e., those that contribute the most to a predefined utility
types of data is quite important in many real-life appli- threshold, an objective function, or a performance met-
cations [1], [3], [4], [5], [6]. In recent decades, the task of ric. In general, some implicit factors, such as the utility,
interesting pattern mining [e.g., frequent pattern mining interestingness, or risk of objects/patterns, are com-
(FPM) [7], [8], association rule mining (ARM) [9], [10], fre- monly seen in real-world situations. The knowledge
quent episode mining (FEM) [11], [12], [13], [14], and that is actually important to the user may not be found
sequential pattern mining (SPM) [5], [15], [16], [17]] has by traditional data mining algorithms. Therefore, a
been extensively studied. These are important and fun- novel utility mining framework, called utility-oriented
damental data mining techniques [1] that satisfy the pattern mining (UPM) or high-utility pattern mining
requirements of real-world applications in numerous (HUPM1) [22], [23], [24], which considers the relative
importance of items (utility-oriented [25]), has become an
emerging research topic in recent years. In UPM, the
W. Gan is with the Harbin Institute of Technology (Shenzhen), utility (i.e., importance, interest, satisfaction, or risk) of
Shenzhen 518055, China, and also with the University of Illinois at each item can be predefined based on a user’s back-
Chicago, Chicago, IL 60607 USA. E-mail: [email protected]. ground knowledge or preferences.
J. C. W. Lin is with the Western Norway University of Applied Sciences,
Bergen 5063, Norway. E-mail: [email protected]. According to Wikipedia,2 in economics, utility is a mea-
P. Fournier-Viger is with the Harbin Institute of Technology (Shenzhen), sure of preferences over some set of goods (including serv-
Shenzhen 518055, China. E-mail: [email protected]. ices, i.e., something that satisfies human wants). In a
H. C. Chao is with the National Dong Hwa University, Hualien 974,
Taiwan. E-mail: [email protected].
perspective, it represents satisfaction experienced by the
V.S. Tseng is with the Department of Computer Science, National Chiao Tung consumer of a good. Hence, utility is a subjective measure.
University, Hsinchu City 30010, Taiwan. E-mail: [email protected]. This definition indicates that a subjective value is associated
P.S. Yu is with the University of Illinois at Chicago, Chicago, IL 60607 with a specific value in a domain to express user preference.
USA. E-mail: [email protected].
In practice, the value of utility is assigned by the user
Manuscript received 21 May 2018; revised 5 Aug. 2019; accepted 15 Sept.
2019. Date of publication 20 Sept. 2019; date of current version 5 Mar. 2021.
(Corresponding author: Jerry Chun-Wei Lin.) 1. The terms of UPM and HUPM can be interchangeably used but
Recommended for acceptance by L. Chen. we will use UPM in the rest of this manuscript.
Digital Object Identifier no. 10.1109/TKDE.2019.2942594 2. https://en.wikipedia.org/wiki/Utility
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1307
A sequential rule R is said to be a high-utility sequential rule significant one in FPM. However, in practice, these frequent
(HUSR) [49] iff uðRÞ minutil and R is a valid rule, in patterns do not show the business value and impact. In con-
which uðRÞ is the total utility of R in QSD. Otherwise, it is trast, the goal of UPM is to identify the useful patterns that
said to be a low-utility sequential rule. The problem of mining appear together and also bring high profits to the merchants
high-utility sequential rules from a sequence database is the dis- [52]. In UPM, managers can investigate the historical data-
covery of all high-utility sequential rules. bases and extract the set of patterns having high combined
utilities. Such problems cannot be tackled by the support/
Definition 6 (High-Utility Episode, HUE [43], [44]). An
frequency-based FPM framework.
episode a is a non-empty totally ordered set of simultaneous
UPM versus WFPM. In the related areas, the relative
events (SE) of the form < ðSE1 Þ; ðSE2 Þ; . . . ; ðSEk Þ > , where
importance of each object/item is not considered in the con-
SEi appears before SEj for all 1 i < j k. For example,
cept of FPM. To address this problem, weighted frequent-
< ðABÞ; ðCÞ > is an episode containing a simultaneous event
pattern mining (WFPM) was proposed [53], [54], [55], [56],
(AB) and a series event (C). The total utility of an episode a in
[57], [58], [59]. In the framework of WFPM, the weights of
a single simple or complex event containing a set of sub-events
items, such as unit profits of items in transaction databases,
is uðaÞ [43], [44], and its calculation is more complicated than
are considered. Therefore, even if some patterns are infre-
that of the utility of a sequence [42]. An episode is said to be a
quent, they might still be discovered if they have high
high-utility episode (abbreviated as HUE) in complex event
weighted support [53], [54], [55]. However, the quantities of
sequences if its total utility in these sequences is no less than
objects/items are not considered in WFPM. Thus, the
the minimum utility threshold such that uðaÞ minutil. Oth-
requirements of users who are interested in discovering the
erwise, this episode is a low-utility episode.
desired patterns with high risks or profits cannot be satisfied.
Definition 7 (Utility-Oriented Pattern Mining, UPM). The reason is that the profits are composed of unit profits
A general definition of UPM is given below: UPM is a new (i.e., weights) and purchased quantities. In view of this, util-
mining framework that utilizes the utility theory and various ity-oriented pattern mining has emerged as an important
mining techniques (e.g., data structure, pruning strategy, topic. It refers to discovering the patterns with high profits.
upper bound) to discover the interesting patterns (e.g., HUI, As mentioned previously, the meaning of a pattern’s utility
HUAR, HUSP, HUSR, HUE), and these derived patterns can is the interestingness, importance, or profitability of the pat-
lead to utility maximization and high benefit in business or tern to users. The utility theory is applied to data mining by
other tasks. considering both the unit utility (i.e., profit, risk, and weight)
and purchased quantities. This has led to the concept of
Based on the above concepts of utility pattern, the UPM UPM [52], which selects interesting patterns based on mini-
framework can be further classified into the following cate- mum utility rather than minimum support.
gories, including 1) high-utility itemset mining (HUIM), 2) high- UPM versus SPM. Different from FIM, sequential pat-
utility association rule mining (HUARM), 3) high-utility seq- tern mining (SPM) [5], [15], [16], [17], which discovers fre-
uential pattern mining (HUSPM), 4) high-utility sequential rule quent subsequences as patterns in a sequence database that
mining (HUSRM), and 5) high-utility episode mining (HUEM). contains the embedded timestamp information of an event,
is more complex and challenging. In 1995, Agrawal and
2.2 Comparisons with Related Concepts Srikant first extended the FPM model to handle sequences
With the boom in data mining and analysis, all kinds of data [15]. Consider the sequence <fa; eg; fbg; fc; dg; fgg; feg> ,
have emerged, and a number of concepts (e.g., FPM, SPM, which represents five events made by a customer at a retail
FEM, UPM, etc.) to model various types of data have been store. Each single letter represents an item (i.e., fag, fcg,
proposed. These concepts have similar meanings, as well as fgg, etc.), and items between curly braces represent an
subtle differences. Here we compare the UPM framework itemset (i.e., fa; eg and fc; dg). Simply speaking, a sequence
with its most related concepts. is a list of temporally ordered itemsets (also called events).
UPM versus FPM. Frequent pattern mining (FPM) [7], Owing to the absence of time constraints in FPM not present
[8], [9], [10] is a common and fundamental topic in data in SPM, SPM has a potentially huge set of candidate sequen-
mining. FPM is a key phase of association-rule mining ces [16]. In a related area, through 25 years’ study and
(ARM), but it has been generalized to many kinds of pat- development, many techniques and approaches have been
terns, such as frequent sequential patterns [16], frequent proposed for mining sequential patterns in a wide range
episodes [11], and frequent subgraphs [51]. The goal of FPM of real-world applications [5]. In general, SPM mainly
is to discover all the desired patterns having support no focuses on the co-occurrence of derived patterns; it does not
lower than a given minimum support threshold. If a pattern consider the unit profit and purchase quantities of each
has higher support than this threshold, it is called a frequent product/item.
pattern; otherwise, it is called an infrequent pattern. Unlike So far, we have reviewed a wide range of pattern-mining
UPM, studies of FPM seldom consider the database having frameworks that aim to discover various types of patterns,
quantities of items, and none of them considers the utility such as itemsets [9], [53], sequences [15], [16], and graphs
feature. Under the “economic view” of consumer rational [51]. These frameworks, however, only select high-fre-
choices, utility theory can be used to maximize the esti- quency/support patterns. Patterns below the minimum
mated profit. UPM considers both statistical significance threshold are considered useless and discarded. Frequency is
and profit significance, whereas FPM aims at discovering the main interestingness measurement, and all objects/
the interesting patterns that frequently co-occur in data- items and transactions are treated equally in such a frame-
bases. In other words, any frequent pattern is treated as a work. Clearly, this assumption contradicts the truth in
1310 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
TABLE 4
Apriori-Based Algorithms for High-Utility Pattern Mining
frequent itemset must also be frequent, and any superset of mining, and web mining originated in [52]. Researchers in
an infrequent itemset cannot be frequent. For example, the field of UPM consider the MEU model as the first theoret-
assuming fa; b; cg is frequent, all of its sub-itemsets, such as ical model and strict definition of high-utility itemset mining.
fa; cg and fb; cg, are also frequent. If fd; eg is infrequent, its MEU uses a heuristic to determine candidates and usually
supersets, such as fa; d; eg and fd; e; fg, are not frequent. overestimates. However, it cannot maintain the downward
Some Apriori-based approaches for HUIM have been fur- closure property of Apriori [9], and the derived results are
ther developed. The core step of these Apriori-based UPM incomplete.
algorithms is the generation of candidate k-itemsets Ck from UMining and UMining H [29]. Yao et al. then proposed
high-utility (k-1)-itemsets (denoted as HUIk1 ), and it con- UMining and heuristic UMining H [29] for finding HUIs
sists of two operations: join and prune. In join step, the con- based on several mathematical properties of the utility mea-
ditional join of two HUIk1 patterns is used to generate sure. The utility constraint is characterized by a property
candidate set Ck . The prune step then reduces the size of Ck giving the upper bound of the utility value of an itemset. In
by using the utility upper bound (which is similar to the UMining, the property of utility upper bound is used as a
Apriori property [9]). pruning strategy. UMining H utilizes another pruning strat-
OOApriori & Top-k Closed Utility Mining [22], [23]. In egy based on a heuristic method [29]. However, some HUIs
2002, Shen and Yang proposed an objective-oriented associ- may be erroneously pruned by this heuristic method. Fur-
ation (OOA) mining approach [22]. They integrated the util- thermore, neither of them have the downward closure prop-
ity constraint into OOApriori (a variant of Apriori [9]) to erty of Apriori [9], and they overestimate too many patterns.
prune candidates for deriving the OOA rules. The interest- Therefore, they suffer from excessive candidate generation
ingness of OOA rules are measured in terms of probabilities and poor scalability.
and utilities in supporting the user’s objective. The utility Two-Phase [77]. Note that the downward closure prop-
constraint for OOA rules is neither monotone nor anti- erty (w.r.t. the Apriori property [9]) of the support measure
monotone. In 2003, Chan et al. first defined the concept of does not hold for the utility. To address the challenge that
utility mining and proposed an objective-directed mining the utility measure is neither monotone nor anti-monotone,
algorithm to mine the top-k closed-utility patterns [23]. This Liu et al. proposed the well-known Two-Phase algorithm
was the first time the term “utility mining” was presented [77]. Two-Phase introduces a novel concept named the trans-
and used to identify both frequent and high-utility itemsets action-weighted downward closure (TWDC) property (for
based on business objectives. In this utility-based mining any itemset X, if X is not a HTWUI, any superset of X is not
framework, a pruning strategy based on a weak but anti- an HUI) and used it to discover HUIs in two phases. Phase 1:
monotonic condition was developed to reduce search space. it finds each itemset X such that TWUðXÞ minutil using
Mining with Expected Utility (MEU) [52]. In 2005, Yao the TWU upper bound to prune the search space. Initially, it
et al. proposed a utility mining model, called mining with scans a database once to get all 1-itemset HTWUI1 ; then gen-
expected utility (MEU) [52], which considers both the pur- erates (k+1)-level candidate itemsets (with length k+1) from
chase quantities (called internal utility) and unit profits length-k candidates HTWUIk (where k > 0). For each itera-
(called external utility) of items to mine HUIs. Note that the tion, it needs to examine the TWU values of candidates by
term “mining high-utility itemsets” first appeared in [23], scanning the database once. Finally, it is terminated when no
but their concept and definitions were quite different from candidate can be generated. Phase 2: it scans the database
the definitions of high-utility itemset mining today. It is again to calculate the exact utility of each candidate in the set
widely believed that utility-based itemset mining, sequence of HTWUIk and then outputs the desired HUIs.
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1313
TABLE 5
Tree-Based Pattern-Growth Algorithms for High-Utility Pattern Mining
independently. The performance of CTU-PRO is better than PTA [85]. Different from PB [85] and GPA [85], pruning
Two-Phase [77] and CTU-Mine [83]. CTU-PROL introduces and filtering strategies are proposed to tighten the upper
two new concepts, compressed transaction utility-prol and bounds of utility values in the projection-based upper-bound
CUP-tree, which are used for parallel projection of the trans- tightening approach (abbreviated as PTA). The framework of
action database. Note that the anti-monotone property of PTA includes the following: 1) finds HTWUIs and high-utility
TWU is used to prune the search space of sub-divisions in 1-itemsets; 2) performs the pruning strategy and the indexing
CTU-PROL. However, unlike Two-Phase, it avoids a rescan strategy; 3) projects transactions required by the prefix item-
of the database to calculate the actual utilities of HTWUIs. sets to be processed; and 4) finds k-HTWUIs and high-utility
The results show that CTU-PROL outperforms Two-Phase k-itemsets. An effective index mechanism is applied to reduce
[77] and CTU-Mine [83]. the time cost of searching relevant transactions that need to be
GPA and PB [85]. Since the tree-based pattern-growth projected in sub-databases. Thus, PTA only needs one data-
approaches recursively perform tree traversal and generate a base scan. Through experiments, the results show that PTA
series of sub-tree structures, Lan et al. proposed two outperforms the other existing algorithms (i.e., Two-Phase
alternative efficient projection-based utility mining approa- [77], GPA [85], PB [85], CTU-PRO [33], IHUPPL [31], IHUPTWU
ches, named Gradual Pruning Approach (GPA) [85] and [31], and IHUPTF [31]) in terms of pruning unpromising item-
PB (Projection-Based mining approach) [85]. Compared with sets, memory usage, and runtime, respectively.
the level-wise techniques, the property of a projection-based Discussions. In summary, the above UPM approaches,
technique is more suitable for improving the utility upper which utilize the database projection mechanism, have
bound. The general idea is to use the overestimated HTWUIs the following advantages: 1) mine the complete set of high-
[77] to recursively project item/sequence databases into some utility patterns but reduce the effort of candidate generation;
smaller projected databases and grow item/subsequence 2) prefix-projection reduces the size of the projected sub-
fragments in each projected sub-database. In addition, PB database and leads to efficient processing; and 3) bi-level
applies a novel pruning strategy and an indexing mechanism projection and pseudo-projection may improve mining effi-
to speed up the runtime and reduce the memory requirement ciency, as summarized in Table 6.
of the mining process. The indexing mechanism imitates tra-
ditional projection algorithms (i.e., PrefixSpan [16]) by projec- 3.5 New Data-Format-Based Approach
ting sub-databases. Using projection, GPA and PB can To achieve more efficiency than the tree-based UPM
significantly reduce database size when deriving larger item- approaches, some algorithms that mine high-utility itemsets
sets and outperform Two-Phase [77]. using a vertical or horizontal data structure with a single
1316 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
TABLE 6
Projection-Based Pattern-Growth Approaches for UPM
TABLE 7
Utility-List-Based Algorithms for High-Utility Pattern Mining
phase were proposed recently, such as HUI-Miner [86], FHM corresponding to the transactions in which X appears. Each
[87], d2HUP [88], HUP-Miner [89], and EFIM [90]. Both tuple is defined as < tid; iu; nu > for every transaction Tq
d2HUP and EFIM use a horizontal database, while others use containing X, in which the tid element is the transaction iden-
the vertical data structure. All these algorithms cannot only tifier of Tq , the iu element is the utility value of X in Tq , and ru
avoid the disadvantages of Apriori-based approaches but also element is the remaining utility value of X in Tq . More details
avoid the disadvantages of the tree-based HUIM approaches. about the remaining utility, utility-list structure, and its con-
Details are shown in Table 7 and described below. struction can be referred to [86]. The construction process of
HUI-Miner [86] and FHM [87]. High-Utility Itemset Miner utility-list is quite efficient and consumes little memory. By
(HUI-Miner) [86] is the first one-phase algorithm to discover keeping necessary information from the transaction database
HUIs. It proposes a vertical data structure named utility-list in memory, HUI-Miner can directly mine HUIs by spanning
[86] and the concept of remaining utility [86], which have been the search space w.r.t. a set-enumeration tree [93]. As an
widely extended in many other newly UPM algorithms. As a enhanced version of HUI-Miner [86], Fast High-Utility Miner
compact data structure, utility-list can store utility informa- (FHM) [87] utilizes a novel pruning strategy named Estimated
tion for the potential patterns that may have high utility value. Utility Co-occurrence Pruning (EUCP) to reduce the costly
The utility-list of an itemset X in a database D is a set of tuples join operations of utility-lists. EUCP is based on the Estimated
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1317
Utility Co-Occurrence Structure (EUCS) [87]. Using utility-list HUIs and prune the search space. The experimental results
[86], HUI-Miner and FHM need only two database scans to show that IHUI-Mine outperforms some popular algorithms,
construct a series of utility-lists of HTWUI1 . Then, utility-lists including Two-Phase [77], FUM [34], and HUC-Prune [79],
of (k+1)-itemsets can be obtained by performing the join oper- but it has not been compared with the state-of-the-art
ations of utility-lists of k-itemsets. They can directly discover algorithms.
HUIs by keeping utility-list in memory, and utilizes the upper IMHUP [91]. In the framework of list-based high-utility
bound of the remaining utility. HUI-Miner and FHM outper- pattern mining, there are a number of comparison and join
form than the all previous algorithms on most datasets, in operations of entries within lists causing enormous execu-
terms of running time (almost two orders of magnitude faster) tion time costs. Based on the indexed utility-list (IU-list)
and memory cost. FHM is more faster than HUI-Miner [86], [91], two techniques were developed in Indexed list-based
especially for dense databases, but not efficient for databases Mining of High-Utility Patterns (IMHUP) to reduce utility
that are sparse. However, the drawback is that both of them upper-bounds that satisfy the anti-monotonic property.
need to perform costly join operations among a series of util- IMHUP-RUI and IMHUP-CHI [91] generate high-utility
ity-lists, which can be time costly. Note that some quantitative patterns without any construction of additional local-lists
results are already reported on the same benchmark datasets when the current lists only contain information of the same
in [86], [87]. revised transactions. They further utilize the upper-bound
d2HUP [88]. d2HUP is also able to directly discover utilities in IU-lists to decrease the search space.
HUIs without candidate generation. It utilizes another novel EFIM [90]. The projection-based EFficient high-utility
data structure, named Chain of Accurate Utility Lists Itemset Mining (EFIM) algorithm introduces several new
(CAUL) [88] to store the necessary information. In contrast ideas, including two new upper bounds named revised sub-
to HUI-Miner, it enumerates an itemset as a prefix extension tree utility and local utility, and a array-based utility comput-
of its prefix itemset. In fact, the search space of d2HUP is a ing technique. To reduce the cost of database scans, EFIM
variant of set-enumeration tree [93]. It can efficiently calcu- further proposes the database projection and transaction
late the utility of each enumerated itemset and the upper merging techniques named High-utility Database Projection
bound on utilities of the prefix-extended itemsets. In fact, (HDP) and High-utility Transaction Merging (HTM). As
d2HUP also utilizes the similar concept of remaining utility larger itemsets are explored, both projection and merging
to tighten the utility upper bound, which is much tighter reduce the size of the database. The main ideas of HDP and
than TWU. This upper bound is tightened by iteratively fil- HTM are described in [90]. The time and space complexity
tering out irrelevant items when constructing CAUL. More of EFIM is roughly linear with the number of distinct items
specifically, it requires less memory than different kinds of in the search space. The competitive results show that EFIM
tree structures used in the above-mentioned algorithms. is in general 2 to 3 orders of magnitude faster than the state-
d2HUP was shown to be more efficient than Two-Phase of-the-art algorithms (UP-Growth+ [48], HUI-Miner [86],
[77], UP-Growth [80], and HUI-Miner [86], but the perfor- FHM [87], d2HUP [88], and HUP-Miner [89]) on dense data-
mance was not compared with some recent algorithms, sets and performs quite well on sparse datasets.
such as FHM [87] and HUP-Miner [89]. mHUIMiner. mHUIMiner [92] is a hybrid algorithm that
HUP-Miner. HUP-Miner [89] is an improvement algo- combines some ideas from HUI-Miner [86] and IHUP-tree
rithm based on HUI-Miner [86]. Two new pruning strate- [31]. It adopts the utility-list and remaining utility. It utilizes a
gies, PU-Prune (based on dataset partition) and LA-Prune tree structure to guide the itemset expansion process, and
(based on the concept of lookahead pruning), are introduced thus the itemsets that are nonexistent in the database can be
in HUP-Miner to limit the search space for mining HUIs avoided. Unlike current techniques, it does not have a com-
[89]. It needs to set the number of dataset partitions K, plex pruning strategy that requires expensive computational
which determines how many partitions processed inter- overhead. It was shown to well perform on sparse datasets,
nally. However, the optimal value of K is hard to find and provide the best runtime on sparse datasets, while having
empirically for a given dataset. Based on the concept of a comparable performance than other state-of-the-art algo-
remaining utility [86], LA-Prune provides a tighter utility rithms (e.g., HUI-Miner [86], FHM [87], and EFIM [90]) on
upper bound of any k-itemset. Thus, a huge number of dense datasets.
unpromising k-itemset (k 2) that have low utility can be Discussions. All the algorithms discussed in this subsec-
pruned. It has been shown that HUP-Miner is significantly tion utilize the new data structure to store necessary infor-
faster than HUI-Miner. In fact, the PU-Prune strategy based mation about each itemset. By spanning the search space
on dataset partition does not always have an effect on run- w.r.t. a set-enumeration tree [93], they can easily calculate
time and memory consumption. In addition, a shortcoming the total utility of an itemset by performing join operations
is that the number of partitions is required to be set explic- of the built utility-lists. Moreover, an upper bound on the
itly by users, since it is an additional parameter. overall utilities of itemsets called the remaining utility is cal-
Index High-Utility Itemsets Mine (IHUI-Mine) [94]. As culated using utility-lists. It can be used to determine if each
mentioned before, these candidate generation-and-test pattern and its extensions are not high-utility itemsets (to
approaches suffer from the drawbacks of having an immense reduce the search space). The upper bound with remaining
candidate pool and requiring several database scans. Mean- utility is equivalent to the upper bound proposed in d2HUP
while, methods based on pattern growth tend to consume [88]. Although a pattern-growth approach in d2HUP can
large amounts of memory to store conditional trees. IHUI- avoid considering itemsets not appearing in the database,
Mine uses the subsume index [95], a data structure for effi- the used hyper-structure still consumes a considerable
cient frequent itemset mining, to enumerate the desired amount of memory [88]. Some competitive results of these
1318 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
UPM methods have been compared and summarized in HUIs with record insertion, such as IHUP [31], FUP-HUI-INS
recent studies [90], [92]. [102], PRE-HUI-INS [103], HUI-list-INS [104], and EIHI [105].
As shown at Table 7, the HUIM algorithms are based on Among these, the early algorithms, e.g., FUP-HUI-INS and
a combination of vertical or horizontal data formats and PRE-HUI-INS, utilize the utility-oriented dynamic maintain
typical approaches. These hybrid algorithms combine dif- strategies that are extended by the original FUP [100] and pre-
ferent techniques to mine high-utility patterns in such a large [101] concepts. Since FUP-HUI-INS and PRE-HUI-INS
way that the strengths of each technique are utilized to max- algorithms are processed by a Two-Phase model, an addi-
imize their efficiency. The properties of these one-phase tional database rescan is still necessary to find the actual
algorithms are as follows: HUIs. Furthermore, computations are required to find the
HTWUIs based on the pattern-growth approach. Both HUI-
1) Complete result: The completeness is guaranteed as the list-INS and EIHI utilize the utility-list [86] and utility prop-
traversal of the search space w.r.t. set-enumeration erty to significantly reduce runtime and memory usage. More
tree. complete reviews can be referred to [47].
2) Stable result: The result is stable as all exact utility Case 2: HUIM with Record Deletion. In practical situations,
information is stored in a vertical or horizontal data record deletion is also an important issue in databases.
structure. Depth-first searching is also used to Cheung et al. designed the FUP2 concept [106] to discover
quickly calculate the utilities. frequently updated itemsets for record deletion. Hong et al.
3) Efficiency: The algorithm is efficient relative to algo- developed the pre-large concept [101] for handling record
rithms that traverse the complete search space. deletion to avoid a multiple database scan each time. Two
Moreover, the sort order of items in set-enumeration support thresholds are separately set in pre-large [101], and
tree affects the mining efficiency, but not the final thus the original database is not required to be scanned until
mining results of patterns. the number of accumulative deleted transactions achieves
4) Parameter sensitivity: These algorithms, except for the designed safety bound. Since the FUP2 concept [106] can-
HUP-Miner, only have minutil as the parameter, and not be directly applied to the HUIM, Lin et al. separately
are sensitive to it. designed the FUP-HUI-DEL [107] and PRE-HUI-DEL [108]
algorithms for handling record deletion to maintain and
4 ADVANCED TOPIC OF UPM update the new HUIs based on the Two-Phase model.
Recently, an efficient dynamic algorithm named HUI-list-
4.1 Mining High Average Utility Itemsets
DEL [109] was developed to discover HUIs by maintaining
A main challenge in HUIM is that the exponential search
the built utility-list [86] structure for record deletion in
space for HUIM is extremely large when the number of dis-
dynamic databases. The new HUIs can be directly produced
tinct items or the size of the database is too large. The other
without candidate generation or numerous database scans.
challenge is that existing HUIM methods overlook the fact
Case 3: HUIM with Record Modification. As one of the
that longer itemsets result in higher utility values. A large
three common operations (record insertion, deletion, and
itemset may have an unreasonable estimated profit as
modification) in databases, record modification is also com-
opposed to its actual value. Therefore, the concept named
monly seen in real-life situations. For example, some typos or
high average-utility itemset mining (HAUIM) is proposed
errors may occur when the collected data from periodic trans-
[72]. HAUIM discovers utility patterns by considering both
actions is input into a computer using a keyboard. Thus, some
their utilities and lengths, thus providing a different utility
information may become invalid or new information may
measure than traditional HUIM. HAUIM divides the utility
arise. Lin et al. first proposed the FUP-HUP-tree-MOD algo-
of an itemset by its length (the number of items that the
rithm [110] to address this issue. It is based on the FUP con-
itemset contains). Up to now, some interesting works have
cept [100] and shows better performance compared to Two-
been extensively studied, such as Apriori-based algorithms
Phase and some tree-based algorithms in batch mode. In
[72], projection-based PAI [96], utility-list based HAUI-
addition, a faster PRE-HUI-MOD algorithm [111] extends the
Miner [97], [98], and other hybrid algorithms with different
pre-large concept [101] to set the effective upper bound for
upper-bound models [98], [99].
discovering HTWUIs and HUIs from the dynamic databases.
amount of HUIs is difficult to comprehend and be analyzed two tree structures, called utility-based WAS tree (UWAS-
by users. Thus, it is often impractical to generate and return tree) [63] and incremental UWAS-tree (IUWAS-tree) [63],
the entire set of HUIs. were developed to mine web access sequences. However, a
Maximal High-Utility Pattern. To return representative sequence element with multiple items, such as [(a, 3)(c, 4)],
HUIs to users, some concise representations of HUIs were cannot be supported in these two models. The considered sce-
proposed. Chan et al. introduced the concept of a utility fre- narios are rather simple, which limits their applicability for
quent closed pattern [23], the definition of which is different handling complex sequences. To this end, some algorithms
from high-utility itemset [48], [52]. Shie et al. then proposed were proposed to address the HUSPM problem.
a new representation called maximal high-utility itemset in UL and US [41]. Since both UWAS-tree and IUWAS-tree
which a HUI is not a subset of any other HUI [69]. Although algorithms cannot deal with sequences containing multiple
maximal HUI reduces the number of extracted HUIs, it is items in each sequence element (transaction), Ahmed et al.
not lossless because the utilities of the subsets of a maximal designed two algorithms (level-wise Utility-Level (UL) [41]
HUI cannot be known without rescanning the database. and pattern-growth Utility-Span (US) [41]) to mine HUSPs.
Moreover, recovering all HUIs from the set of maximal UL and US extend traditional sequential pattern mining
HUIs is very inefficient since many subsets of a maximal (SPM). The utility of a sequential pattern is calculated in
HUI may have low utility. two ways. The utilities of sequences having only distinct
Closed High-Utility Pattern. To provide not only com- occurrences are added together, while the highest occur-
pact but also complete information about high-utility item- rences are selected from sequences with multiple occur-
sets to users, Tseng et al. first addressed the problem of rences and used to calculate the utilities. However, the
redundancy in high-utility itemset mining [116]. A lossless problem definition in UL and US [41] is rather specific. No
and compact representation named closed high-utility item- generic framework for transferring from SPM to high-utility
set [116] was introduced. To mine this representation, they sequence analysis has been proposed.
proposed three algorithms named AprioriHC (Apriori-based USpan [42]. Yin et al. then formalized the problem of
approach for mining High-utility Closed itemsets), Aprior- HUSPM, and proposed a generic framework and the USpan
iHC-D (AprioriHC algorithm with Discarding unpromising algorithm to mine high-utility sequences. [42]. A lexico-
and isolated items), and Closed High-Utility Itemset Discov- graphic quantitative sequence tree (LQS-tree) is constructed
ery (CHUID) [116]. Fournier-Viger et al. then proposed a fast as the search space. Two concatenation mechanisms, I-Con-
and memory efficient algorithm named EFIM-Closed [117] to catenation and S-Concatenation, are used to generate newly
discover closed HUIs by extending the EFIM model [90]. It concatenated utility-based sequences. Based on the LQS-tree
proposes three strategies to mine CHUIs efficiently: closure structure, USpan [42] adopts the sequence-weighted utiliza-
jumping, forward closure checking, and backward closure tion (SWU) measure and the Sequence Weighted Downward
checking. EFIM-Closed relies on two new upper bounds, Closure (SWDC) property to prune unpromising sequences
named local utility and sub-tree utility, to prune the search and to improve the mining performance. However, a short-
space, and it can calculate these upper bounds efficiently. coming of USpan is that the data representation w.r.t. the
Inspired by utility-list [86], some more efficient one-phase utility matrix is quite complex and memory-costly.
algorithms have been proposed to address this interesting PHUS [121]. Lan et al. then proposed the projection-
issue, such as CHUI-Miner [118] and CHUM [119]. based high-utility sequential pattern mining (PHUS) algo-
rithm for mining HUSPs with the maximum utility measure
4.4 Mining High-Utility Quantitative Itemsets/Rules and a sequence-utility upper-bound (SUUB) model [121]. The
Although extensive studies have been proposed for high- algorithm extends PrefixSpan [16] and uses a projection-
utility itemset mining, a critical limitation of these studies is based pruning strategy to obtain tight upper bounds on
that they ignore the quantity attribute of items in discovered sequence utilities. Thus, it can avoid considering too many
HUIs. However, such information can be very useful and candidates, and improves the performance of mining
valuable in many applications. In view of this, the concept HUSPs using the SUUB model.
of High-Utility Quantitative Itemset mining (abbreviated as HuspExt [122] and HUS-Span [125]. Alkan et al. [122]
HUQI) [38], [120] has emerged. In the framework of HUQI designed the high-utility sequential pattern extraction (Hus-
mining, an item may have different quantities in the data- pExt) algorithm with an upper-bound called Cumulate Rest
base and each item carrying a different quantity is regarded of Match (CRoM). It uses a pruning before candidate genera-
as a quantitative item. HUQI [38] and more efficient vertical tion (PBCG) strategy to prune unpromising sequences for
utility-list-based VHUQI [120] were thus developed. An mining HUSPs. However, HuspExt cannot discover the com-
example of such a rule is (bread, 3, 4) ) (milk, 2, 3), which plete HUSPs due to the incorrect upper bound. In view of the
means that most customers who purchased three or four previous upper bounds on sequence utilities not being tight
breads also purchased two or three milks. We can use this enough, HUS-Span [125] utilizes two tight utility upper
information to package products with quantities that have bounds, called prefix extension utility (PEU) and reduced
high utility and estimate the number of items that need to sequence utility (RSU), as well as two companion pruning
be reserved according to the number of other items. strategies, to identify high-utility sequential patterns.
ProUM [123]. Gan et al. [123] proposed an efficient
4.5 High-Utility Sequential Pattern Mining projection-based utility mining approach named ProUM
By integrating the utility factor and sequence data, the to discover high-utility sequences by utilizing the upper
problem of high-utility sequential pattern mining (HUSPM) bound named sequence extension utility (SEU) and the util-
was introduced. For handling the utility of web log sequences, ity-array structure [123]. Different from the upper bound
1320 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
TABLE 8
Algorithms for High-Utility Sequential Pattern Mining
used in USpan, SEU can guarantee the correctness and com- FEM reveals a significant amount of useful information hid-
pleteness of discovered results on sequence data. Besides, den in the event sequence with a wide range of applications
ProUM has better performance up to two orders of magni- [11], [12], [13], [14]. However, the discovered frequent epi-
tude in terms of execution time on most sequence datasets sode is still too simple and primitive. In some cases, FEM
than USpan and HUS-Span. may lose some rich information, such as utility, important,
HUSP-ULL [124]. The state-of-the-art HUSP-ULL [124] risk, etc. Wu et al. [43] presented the first attempt to solve the
algorithm utilizes a new data structure namely utility- problem of high-utility episode mining (HUEM) in a complex
linked (UL)-list and two pruning strategies (called look event sequence. However, the proposed UP-Span algorithm
ahead strategy and irrelevant item pruning strategy) to fast suffers from low efficiency in both runtime and memory con-
discover HUSPs. According to the extensive experiments sumption. Furthermore, the proposed upper-bound named
[124], it shows that HUSP-ULL is the fastest when compar- Episode Weighted Utility (EWU) is a loose and basic utility
ing to the current HUSPM algorithms. Some competitive bound for episodes. Guo et al. then proposed the TSpan algo-
results of these recent state-of-the-art HUSPM methods rithm with several improvements for UP-Span in a much
have been compared and summarized in the studies of more efficient manner [126], which can save considerable
ProUM [123] and HUSP-ULL [124]. search space and runtime. Then, Lin et al. separately intro-
Main characteristics of these HUSPM algorithm are sum- duced some models to process complex event sequences and
marized in Table 8. In addition, Shie et al. explored a new stock investment using high-utility episode mining and a
problem of mining high-utility mobile sequential patterns genetic algorithm [44], [127]. In addition, the top-k issue of
(HUMSPs) by integrating mobile data mining with utility HUEM has been studied recently [128].
mining [35], [36]. This is the first work that combines mobil-
ity patterns with utility factor to find high-utility mobile
sequential patterns. 4.7 UPM in Big Data
In the big data era, it requires more efficient frameworks of
4.6 High-Utility Episode Mining UPM to handle the big data issue. Several models are pre-
When the sequential data becomes an event sequence, the sented to address UPM in big data [129], [130], [131]. Details
task of frequent episode mining (FEM) [11] is introduced. are described below.
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1321
UPM in Big Itemset Data. PHUI-Growth (Parallel mining constraint-based UPM algorithms have been extensively
High-Utility Itemsets by pattern-Growth) [129] is first pro- developed for various problems, targeting a wide range of
posed for parallel mining HUIs on Hadoop platform. It applications. For example, mining high-utility patterns with
adopts the MapReduce [132] architecture to partition the products’ on-shelf time period [85], [138], mining the up-to-
whole mining tasks. As a distributed parallel algorithm, date HUIs that reflect recent trends [139], [140], mining dis-
PHUI-Miner with a sampling strategy is introduced by criminative high-utility patterns [75], [141], mining top-k
Chen et al. [130]. It extracts the approximate HUIs from big high-utility patterns without setting the minimum utility
data. Recently, the study of parallel mining of top-k HUIs in threshold [68], [142], UPM with multiple minimum utility
Spark in-memory computing architecture is further pro- thresholds [143], utility-based association rule mining [39],
posed. It inherits several advantages of Spark [133]. [40], UPM with consideration of various discount strategies
UPM in Big Sequence Data. The BigHUSP model is the first [61], UPM by considering negative utility values [144],
work to discover distributed and parallel high-utility sequen- [145], UPM from uncertain data [73], [146], and extracting
tial patterns [131]. BigHUSP uses multiple steps of MapRe- non-redundant correlated HUIs [62], [147]. Obviously,
duce [132] to process big data in parallel. In contrast to the UPM with various interesting constraints is an active
traditional HUSPM approaches, it can deal with large-scale research topic.
sequential data. MAHUSP [134] is a memory-adaptive
approximation algorithm to efficiently discover high-utility 4.10 Privacy Preserving for UPM
sequential patterns over data streams. It employs a memory- Since more useful information is in the expected utility-
adaptive mechanism using a bounded portion of memory, based patterns than in that of the frequent itemsets or
and guarantees that all HUSPs are discovered under certain sequences, privacy preserving for high-utility pattern min-
circumstances. Experimental study shows that MAHUSP can ing (PPUM) is more realistic and critical than privacy-
not only discover HUSPs over data streams efficiently, but preserving data mining (PPDM) [148], [149], [150], [151].
also adapt to memory allocation without sacrificing much of Some preliminary studies have been done on this issue. Yeh
the quality of discovered HUSPs. et al. first designed two models, named Hiding High-Utility
Itemset First (HHUIF) and Maximum Sensitive Itemsets
4.8 UPM in Stream Data Conflict First (MSICF), to hide sensitive HUIs in PPUM
A data stream is an infinite sequence of data elements con- [152]. The main task of PPUM is to hide the sensitive high-
tinuously arriving at a rapid rate [66], [67]. Mining useful utility itemsets (SHUIs). Lin et al. first developed a genetic-
patterns from data streams has become one of interesting algorithm-based method to hide the user-specified SHUIs
problems of data mining [67], [135], [136]. However, few by inserting the dummy transactions into the original data-
works on mining data streams consider the utility factor bases [153]. Yun et al. then developed a tree-based algo-
embedded in data streams. Tseng et al. first proposed the rithm called the Fast Perturbation algorithm Using a Tree
THUI-Mine (Temporal High-Utility Itemsets) model to structure and Tables (FPUTT) for hiding SHUIs [154]. Then,
mine temporal HUIs from data streams [137]. THUI-Mine other faster and more efficient algorithms were developed
can effectively identify the temporal HUIs by generating for PPUM, such as [155], [156]. A recent overview of PPUM
fewer temporal 2-itemsets of HTWUIs. Thus, the execution has been reported by Gan et al. [157].
time can be reduced significantly in mining all HUIs from
data streams. In this way, the discovery process under all
time windows of data streams can be achieved with limited 5 OPEN-SOURCE SOFTWARE AND DATASETS
memory space and less candidates. Then, researchers for 5.1 Open-Source Software
HUIM proposed several stream mining models, such as Although the problem of UPM has been studied for more
Mining High-Utility Itemsets based on BITvector (MHUI- than 15 years, and the advanced topic of utility pattern min-
BIT) [30], Mining High-Utility Itemsets based on TIDlist ing also has been extended to many research fields, few
(MHUI-TID) [30], and Generation of maximal high-Utility implementations or source code of these algorithms have
Itemsets from Data strEams (GUIDE) [32], [69]. GUIDE is a been released. This raises some barriers to other researchers
framework that mines the compact maximal HUIs from in that they need to re-implement algorithms to use them or
data streams with different models (i.e., the landmark, slid- compare their performance with that of novel proposed
ing, and time fading window models) [32], [69]. In [70], the algorithms. To make matters worse, this may introduce
high-utility stream tree (HUS-tree) and HUPMS algorithm unfairness in running experimental comparisons, since the
(high-utility pattern mining over stream data) are proposed performance of pattern mining algorithms may commonly
for incremental and interactive UPM over data streams depend on the compiler and machine architecture used. We
with a sliding window. now list some open-source software specialized for UPM.
UP-Miner. Tseng et al. proposed a first-of-its-kind util-
4.9 UPM with Various Interesting Constraints ity mining toolbox named Utility Pattern Miner (UP-Miner)
Up to now, most of the algorithms for UPM have been [158]. UP-Miner provides various models for utility-
developed to improve the efficiency of the mining process, oriented pattern mining. The main merits of UP-Miner
while effectiveness of the algorithms for UPM is also very have three aspects. First, to the best of our knowledge, it is
important, because it is related to its usefulness for various the first-of-its-kind cross-platform utility mining system.
data, constraints, and applications. Researchers in the field Second, it provides complete Java implementations of 13
of utility-oriented pattern mining have proposed many algorithms for discovering different types of utility-oriented
algorithms and models to extend effectiveness. Many patterns, such as high-utility itemset (HUI), high-utility
1322 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
sequential rule (HUSR), high-utility sequential pattern UPM. For example, the itemset-based synthetic T10I4D100K,
(HUSP), and high-utility episode (HUE), as well as the con- T40I10D100K; and the sequence synthetic C8S6T4I3DjXjK
cise representations of utility patterns. In addition, it offers are described in [124].
four functionalities for processing utility-based databases.
Third, the toolbox and relevant materials, including source
codes, demo paper, benchmark datasets, and data genera-
6 OPEN CHALLENGES AND OPPORTUNITIES
tors, have been made public on Website3 for the benefit of Here, we discuss important open problems that have the
the research community. potential to become future research areas in utility-oriented
SPMF. As a well-known open-source data mining pattern mining. Owing to the rapid growth of the volume of
library, SPMF [159] offers implementations of many algo- data stored in databases, we have entered the era of Big
rithms and has been cited in more than 700 research papers Data. While analyzing utility-oriented patterns, we have
since 2010. SPMF is written in Java, and provides imple- identified numerous technical challenges and opportunities
mentations of 170 data mining algorithms, specializing in for UPM. We next highlight some important research
pattern mining. SPMF has the largest collection of imple- opportunities, which are common to many, and sometimes
mentations of various algorithms for pattern mining algo- all, UPM algorithms.
rithms (i.e., FPM, ARM, SPM, etc.) and provides a user- Application-Driven Algorithms. Up to now, most of the
friendly graphical interface.4 In particular, it also provides algorithms for UPM have been developed to improve the
the relevant materials, including source codes, documenta- efficiency of mining process. The effectiveness of the algo-
tion, user instruction, benchmark datasets, data generators, rithms for UPM is also very important, because it is related
and academic papers. SPMF offers up to 30 algorithms for to the usefulness on various data, constraints, and applica-
utility-oriented pattern mining, such as Two-Phase, UP- tions. In general, as described in Section 2.3, the application-
Growth, UP-Growth+, HUI-Miner, EFIM, USpan, HUSP- driven algorithms with many particular features of utility
ULL, and many other state-of-the-art algorithms. More spe- patterns reflect real-life problems of different applications
cifically, SPMF is distributed under the GPL v3 license and in various fields. How to propose a specialized UPM model
is suitable for both academic and industrial purposes. for different applications (e.g., business, web intelligent,
risk perdition, smart city, financial analysis, Internet of
Things, Biomedicine, smart transportation) and experimen-
5.2 Datasets for UPM
tally show its effectiveness is necessary and challenging.
Several datasets are commonly used in the studies of UPM. Moreover, the incorporation of domain knowledge [161]
All of them have been released at websites, such as SPMF has a higher influence on performance for some data mining
[159], UP-Miner [158]. methods. Utility mining guided by domain knowledge thus
Real Datasets. foodmart: it is provided by Microsoft contain- provides many opportunities.
ing 21,556 customer transactions and 1,559 distinct items from Developing More Efficient Algorithms. Traditionally,
an anonymous chain store. It contains the quantity and a unit most pattern mining algorithms, especially UPM algo-
profit of each item. yoochoose-buys commercial dataset was rithms, are computationally expensive in terms of execu-
constructed in the RecSys Challenge 2015.5 It contains a collec- tion time and memory cost. This may be a serious problem
tion of 1,150,753 sessions from a retailer, where each session is for dense databases or databases containing numerous
encapsulating the click events. The total number of item IDs items/sequences or long transactions, depending on the
and category IDs is 54,287 and 347 correspondingly, with an minimum utility threshold chosen by the user. Although
interval of 6 months. UK-online6: it contains 541,909 transac- current UPM algorithms (e.g., HUI-Miner [86], EFIM [117],
tions, which occurs between 01/12/2010 and 09/12/2011 for and mHUIMiner [92]) are much efficient than previous
a UK-based and registered non-store online retail. The origi- Apriori-based and tree-based algorithms, there is still
nal data contains the real timestamp and many noise values. room for improvement. 1) It is important to reduce the
It has the attributes as InvoiceNo, StockCode, Quantity, Invoi- search space, and this requires to design novel pruning
ceDate, UnitPrice, CustomerID, etc. strategies that rely on upper-bounds on the utility measure
Semi-Authentic Datasets. They are the real datasets7 (e.g., that are tighter than current measures. 2) Moreover, we
chess, retail, kosarak, mushroom, accidents, BMSPOS2) with can design novel data structures to more quickly calculate
synthetic utility values. The internal utility values are gener- the utility and upper-bounds, and integrate constraints in
ated using a uniform distribution in [1, 10]. The external the mining process to reduce the search space. 3) Fast
utility values are generated using a Gaussian (normal) dis- approximate algorithms [130] that guarantee a maximum
tribution. Detailed description and characteristics of these error can also be developed.
real datasets can be referred to SPMF [159], UP-Miner [158], Unified Framework for UPM. Many variations of utility
or existing UPM literature. mining have been proposed to deal with various types of
Synthetic Datasets. There are some synthetic itemset-based data and to solve different problems. The current paradigm
or sequence-based datasets generated by IBM Quest Data- used to solve utility-oriented pattern mining problem is to
set Generator [160], which have been commonly used in first define the definition of utility-based patterns with
interest and their properties, and then develop an algorithm
3. http://bigdatalab.cs.nctu.edu.tw/software.php that can exploit the properties of the utility (e.g., upper
4. http://www.philippe-fournier-viger.com/spmf/index.php bound) to efficiently mine them. Hence, this laborious pro-
5. https://recsys.acm.org/recsys15/challenge/
6. http://archive.ics.uci.edu/ml/datasets/Online+Retail/ cess can be avoided if the following problem is solved: “Is
7. http://fimi.ua.ac.be/data/ there a paradigm such that existing and new definitions of
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1323
utility-based pattern (HUI [77], HUSP [42], [123], HUE [43]) commerce. Specifically, research should focus on algorithms
can be solved by a unifying algorithm?” Owing to these that are sub-linear to the input or, at the very least, linear.
challenges, the utility-oriented pattern mining problem, in Other computational challenges, such as the demands of the
its most general form, is not easy to solve. In fact, most of results being returned in real- or near-real-time, are the open
the existing utility mining techniques (e.g., HUIM [77], issues in the data mining community. For example, real-time
HUSPM [42], [123], HUEM [43], etc.) solve a specific formu- mining with optimization requires a new formalism and
lation of a specific problem. Therefore, how to formalize solving techniques. As mentioned before, increasing quantity
utility mining tasks in a generic framework is crucial and and complexity of data demands scalable solutions. Using
challenging. Focus on general principles and modeling of the existing computational infrastructures for real-time util-
UPM rather than specific implementations is more impor- ity-oriented mining massive datasets may be a feasible way.
tant and challenging.
Deal with Complex Data. The amount of complex data has 7 CONCLUSIONS
been explored during the past two decades, while most
of the data mining and analysis approaches are not utility The term utility is commonly used to mean “the quality of
oriented. Many current techniques of UPM are not suited being useful,” and utilities are widely used in data-mining
to dealing with various types of complex data, such as and decision-making processes to extract different useful
“structured data”8 (i.e., pattern mining), “unstructured kinds of knowledge. Utilities are subjective and can be
data”9 (including documents, health records, audio, video, acquired from domain experts/users. Utility mining in data
images, etc.), and “semi-structured data”10 (i.e., XML, JSON), is a vital task, with numerous high-impact applications,
and most of these are the heterogeneous data. More specifi- including cross-marketing, e-commerce, finance, medical,
cally, the dynamic data [47], the uncertain data [8], [73], the and biomedical applications. Up to now, many techniques
high-dimensional datasets of moderate size, or the very large and approaches have been extensively proposed for the task
datasets of moderate complexity in real-life applications are of UPM. In this survey, we have provided a comprehensive
commonly seen in different domains and applications. Bridg- review of utility-oriented pattern mining, both in terms of
ing this gap requires the solution of fundamentally new current status and future directions. This survey describes
research problems, which can be grouped into the following various problems associated with mining utility-based pat-
challenges: 1) how to define the utility function integrating terns and methods for addressing these problems, including
with various rich features on complex data; 2) how to achieve 1) high-utility itemset mining (HUIM), 2) high-utility association
utility maximization for the goal and mining task; and 3) how rule mining (HUARM), 3) high-utility sequential pattern mining
to develop new frameworks and algorithms to deal with new (HUSPM), 4) high-utility sequential rule mining (HUSRM), and
types of data. A need therefore arises for a better framework 5) high-utility episode mining (HUEM). Overall, we have not
that extends the existing data mining methodologies, techni- only reviewed the most common, as well as the state-of-the-
ques, and tools, guided by utility and knowledge. art, approaches for UPM but have also provided a compre-
Large-Scale Data. Efficiently mining large-scale databases hensive review of advanced UPM topics. Finally, we have
may result in a high computational cost and memory con- identified several important issues and research opportuni-
sumption. Under the batch model, traditional UPM algo- ties for UPM.
rithms must be repeatedly applied to obtain updated results
when new data are inserted [47]. However, in the Big Data ACKNOWLEDGMENTS
era, incrementally or dynamically processing data [47] and This research was partially supported by the Shenzhen
taking into account the results of prior analysis is crucial. Technical Project under No. KQJSCX 20170726103424709 and
There are some challenging research opportunities of UPM No. JCYJ 20170307151733005, and by the China Scholarship
for handling large-scale data (as described in Section 4.7): Council Program. The authors would like to thank the editors
how to design the parallelized UPM algorithms and how to and anonymous reviewers for their detailed comments and
develop the UPM algorithms based on the existing technolo- constructive suggestions which have improved the quality of
gies of Big Data (i.e., MapReduce [132] and Spark [133]). this paper.
Some other promising areas of research are the design of dis-
tributed, parallel, multi-core, or graphical-processing-unit REFERENCES
(GPU)-based algorithms [45], [162] for UPM. There are some
[1] M. S. Chen, J. Han, and P. S. Yu, “Data mining: An overview
open challenges and opportunities to improve the scalability from a database perspective,” IEEE Trans. Knowl. Data Eng.,
of utility mining tasks from resource-constraint devices to vol. 8, no. 6, pp. 866–883, Dec. 1996.
collaborative and hybrid execution models. [2] J. Han, J. Pei, and M. Kamber, Data Mining: Concepts and Techni-
ques. Amsterdam, The Netherlands: Elsevier, 2011.
Scalable Real-Time Pattern Mining. There exists many [3] Y. S. Koh and S. D. Ravana, “Unsupervised rare pattern mining:
interactive approaches for interactive data mining, but few A survey,” ACM Trans. Knowl. Discovery Data, vol. 10, no. 4, 2016,
have been extended to address the challenge of utility. And it Art. no. 45.
is not trivial to adapt them. One of the most important future [4] P. Fournier-Viger, J. C. W. Lin, B. Vo, T. T. Chi, J. Zhang, and
H. B. Le, “A survey of itemset mining,” Wiley Interdisciplinary Rev.:
challenges is to develop scalable high-utility pattern online Data Mining Knowl. Discovery, vol. 7, no. 4, 2017, Art. no. e1207.
mining approaches for streaming data from electronic [5] P. Fournier-Viger, J. C. W. Lin, R. U. Kiran, Y. S. Koh, and
R. Thomas, “A survey of sequential pattern mining,” Data Sci. Pat-
tern Recognit., vol. 1, no. 1, pp. 54–77, 2017.
8. https://en.wikipedia.org/wiki/Structure_mining [6] C. W. Tsai, C. F. Lai, M. C. Chiang, L. T. Yang, et al.,“Data min-
9. https://en.wikipedia.org/wiki/Unstructured_data ing for internet of things: A survey,” IEEE Commun. Surveys
10. https://en.wikipedia.org/wiki/Semi-structured_data Tuts., vol. 16, no. 1, pp. 77–97, Jan.–Mar. 2014.
1324 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
[7] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns [34] Y. C. Li, J. S. Yeh, and C. C. Chang, “Isolated items discarding
without candidate generation: A frequent-pattern tree approach,” strategy for discovering high utility itemsets,” Data Knowl. Eng.,
Data Mining Knowl. Discovery, vol. 8, no. 1, pp. 53–87, 2004. vol. 64, no. 1, pp. 198–217, 2008.
[8] C. C. Aggarwal, Y. Li, J. Wang, and J. Wang, “Frequent pattern [35] B. E. Shie, H. F. Hsiao, V. S. Tseng, and P. S. Yu, “Mining high util-
mining with uncertain data,” in Proc. 15th ACM SIGKDD Int. ity mobile sequential patterns in mobile commerce environments,”
Conf. Knowl. Discovery Data Mining, 2009, pp. 29–38. in Proc. Int. Conf. Database Syst. Adv. Appl., 2011, pp. 224–238.
[9] R. Agrawal, T. Imieli nski, and A. Swami, “Mining association [36] B. E. Shie, H. F. Hsiao, and V. S. Tseng, “Efficient algorithms for
rules between sets of items in large databases,” ACM SIGMOD discovering high utility user behavior patterns in mobile
Rec., vol. 22, no. 2, pp. 207–216, 1993. commerce environments,” Knowl. Inf. Syst., vol. 37, no. 2,
[10] R. Agrawal, R. Srikant, et al., “Fast algorithms for mining associ- pp. 363–387, 2013.
ation rules,” in Proc. 20th Int. Conf. Very Large Data Bases, 1994, [37] M. Zihayat, H. Davoudi, and A. An, “Mining significant high
pp. 487–499. utility gene regulation sequential patterns,” BMC Syst. Biol.,
[11] H. Mannila, H. Toivonen, and A. I. Verkamo, “Discovery of vol. 11, no. 6, 2017, Art. no. 109.
frequent episodes in event sequences,” Data Mining Knowl. Dis- [38] S. J. Yen and Y. S. Lee, “Mining high utility quantitative associa-
covery, vol. 1, no. 3, pp. 259–289, 1997. tion rules,” in Proc. Int. Conf. Data Warehousing Knowl. Discovery,
[12] K. Y. Huang and C. H. Chang, “Efficient mining of frequent epi- 2007, pp. 283–292.
sodes from complex sequences,” Inf. Syst., vol. 33, no. 1, [39] D. Lee, S. H. Park, and S. Moon, “Utility-based association rule
pp. 96–114, 2008. mining: A marketing solution for cross-selling,” Expert Syst.
[13] A. Achar, S. Laxman, and P. Sastry, “A unified view of the Apriori- Appl., vol. 40, no. 7, pp. 2715–2725, 2013.
based algorithms for frequent episode discovery,” Knowl. Inf. [40] J. Sahoo, A. K. Das, and A. Goswami, “An efficient approach for
Syst., vol. 31, no. 2, pp. 223–250, 2012. mining association rules from high utility itemsets,” Expert Syst.
[14] A. Achar, A. Ibrahim, and P. Sastry, “Pattern-growth based Appl., vol. 42, no. 13, pp. 5754–5778, 2015.
frequent serial episode discovery,” Data Knowl. Eng., vol. 87, [41] C. F. Ahmed, S. K. Tanbeer, and B. S. Jeong, “A novel approach
pp. 91–108, 2013. for mining high-utility sequential patterns in sequence data-
[15] R. Agrawal and R. Srikant, “Mining sequential patterns,” in Proc. bases,” ETRI J., vol. 32, no. 5, pp. 676–686, 2010.
11th Int. Conf. Data Eng., 1995, pp. 3–14. [42] J. Yin, Z. Zheng, and L. Cao, “USpan: An efficient algorithm for
[16] J. Han, J. Pei, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and mining high utility sequential patterns,” in Proc. 18th ACM
M. Hsu, “PrefixSpan: Mining sequential patterns efficiently by SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2012, pp. 660–668.
prefix-projected pattern growth,” in Proc. 17th Int. Conf. Data [43] C. W. Wu, Y. F. Lin, P. S. Yu, and V. S. Tseng, “Mining high utility
Eng., 2001, pp. 215–224. episodes in complex event sequences,” in Proc. 19th ACM SIGKDD
[17] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and P. S. Yu, Int. Conf. Knowl. Discovery Data Mining, 2013, pp. 536–544.
“A survey of parallel sequential pattern mining,” ACM Trans. [44] Y. F. Lin, C. W. Wu, C. F. Huang, and V. S. Tseng, “Discovering
Knowl. Discovery Data, vol. 13, no. 3, 2019, Art. no. 25. utility-based episode rules in complex event sequences,” Expert
[18] L. Geng and H. J. Hamilton, “Interestingness measures for data min- Syst. Appl., vol. 42, no. 12, pp. 5303–5314, 2015.
ing: A survey,” ACM Comput. Surveys, vol. 38, no. 3, 2006, Art. no. 9. [45] W. Gan, J. C. W. Lin, H. C. Chao, and J. Zhan, “Data mining in
[19] J. Pei, J. Han, and L. V. Lakshmanan, “Mining frequent itemsets distributed environment: A survey,” Wiley Interdisciplinary Rev.:
with convertible constraints,” in Proc. 17th Int. Conf. Data Eng., Data Mining Knowl. Discovery, vol. 7, no. 6, 2017, Art. no. e1216.
2001, pp. 433–442. [46] B. Nath, D. Bhattacharyya, and A. Ghosh, “Incremental associa-
[20] P. N. Tan, V. Kumar, and J. Srivastava, “Selecting the right objec- tion rule mining: A survey,” Wiley Interdisciplinary Rev.: Data
tive measure for association analysis,” Inf. Syst., vol. 29, no. 4, Mining Knowl. Discovery, vol. 3, no. 3, pp. 157–169, 2013.
pp. 293–313, 2004. [47] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, T. P. Hong,
[21] K. McGarry, “A survey of interestingness measures for knowl- and H. Fujita, “A survey of incremental high-utility itemset min-
edge discovery,” Knowl. Eng. Rev., vol. 20, no. 1, pp. 39–61, 2005. ing,” Wiley Interdisciplinary Rev.: Data Mining Knowl. Discovery,
[22] Y. D. Shen, Z. Zhang, and Q. Yang, “Objective-oriented utility- vol. 8, no. 2, 2018, Art. no. e1242.
based association mining,” in Proc. IEEE Int. Conf. Data Mining, [48] V. S. Tseng, B. E. Shie, C. W. Wu, and P. S. Yu, “Efficient
2002, pp. 426–433. algorithms for mining high utility itemsets from transactional
[23] R. Chan, Q. Yang, and Y. D. Shen, “Mining high utility itemsets,” databases,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 8,
in Proc. 3rd IEEE Int. Conf. Data Mining, 2003, pp. 19–26. pp. 1772–1786, Aug. 2013.
[24] H. Yao, H. J. Hamilton, and L. Geng, “A unified framework for [49] S. Zida, P. Fournier-Viger, C. W. Wu, J. C. W. Lin, and V. S. Tseng,
utility-based measures for mining itemsets,” in Proc. ACM “Efficient mining of high-utility sequential rules,” in Proc. Int. Work-
SIGKDD 2nd Workshop Utility-Based Data Mining, 2006, pp. 28–37. shop Mach. Learn. Data Mining Pattern Recognit., 2015, pp. 157–171.
[25] A. Marshall, Principles of Economics, 8th ed. London, U.K.: [50] P. Fournier-Viger, U. Faghihi, R. Nkambou, and E. M. Nguifo,
Macmillan, 1926. “CMRules: Mining sequential rules common to several sequen-
[26] R. J. Hilderman and H. J. Hamilton, “Measuring the interesting- ces,” Knowl.-Based Syst., vol. 25, no. 1, pp. 63–76, 2012.
ness of discovered knowledge: A principled approach,” Intell. [51] C. Jiang, F. Coenen, and M. Zito, “A survey of frequent subgraph
Data Anal., vol. 7, no. 4, pp. 347–382, 2003. mining algorithms,” Knowl. Eng. Rev., vol. 28, no. 1, pp. 75–105, 2013.
[27] A. Silberschatz and A. Tuzhilin, “On subjective measures of [52] H. Yao, H. J. Hamilton, and C. J. Butz, “A foundational approach
interestingness in knowledge discovery,” in Proc. ACM SIGKDD to mining itemset utilities from databases,” in Proc. SIAM Int.
Int. Conf. Knowl. Discovery Data Mining, 1995, pp. 275–281. Conf. Data Mining, 2004, pp. 482–486.
[28] T. De Bie, “Maximum entropy models and subjective interesting- [53] C. H. Cai, A. W. C. Fu, C. Cheng, and W. Kwong, “Mining associ-
ness: An application to tiles in binary databases,” Data Mining ation rules with weighted items,” in Proc. Int. Database Eng. Appl.
Knowl. Discovery, vol. 23, no. 3, pp. 407–446, 2011. Symp., 1998, pp. 68–77.
[29] H. Yao and H. J. Hamilton, “Mining itemset utilities from transac- [54] W. Wang, J. Yang, and P. S. Yu, “Efficient mining of weighted
tion databases,” Data Knowl. Eng., vol. 59, no. 3, pp. 603–626, 2006. association rules (WAR),” in Proc. 6th ACM SIGKDD Int. Conf.
[30] H. F. Li, H. Y. Huang, Y. C. Chen, Y. J. Liu, and S. Y. Lee, “Fast Knowl. Discovery Data Mining, 2000, pp. 270–274.
and memory efficient mining of high utility itemsets in [55] F. Tao, F. Murtagh, and M. Farid, “Weighted association rule
data streams,” in Proc. 8th IEEE Int. Conf. Data Mining, 2008, mining using weighted support and significance framework,” in
pp. 881–886. Proc. 9th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
[31] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and Y. K. Lee, “Efficient 2003, pp. 661–666.
tree structures for high utility pattern mining in incremental [56] K. Sun and F. Bai, “Mining weighted association rules without
databases,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 12, preassigned weights,” IEEE Trans. Knowl. Data Eng., vol. 20,
pp. 1708–1721, Dec. 2009. no. 4, pp. 489–495, Apr. 2008.
[32] B. E. Shie, V. S. Tseng, and P. S. Yu, “Online mining of temporal [57] J. C. W. Lin, W. Gan, P. Fournier-Viger, and T. P. Hong,
maximal utility itemsets from data streams,” in Proc. ACM Symp. “RWFIM: Recent weighted-frequent itemsets mining,” Eng. Appl.
Appl. Comput., 2010, pp. 1622–1626. Artif. Intell., vol. 45, pp. 18–32, 2015.
[33] A. Erwin, R. P. Gopalan, and N. Achuthan, “Efficient mining of [58] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and
high utility itemsets from large datasets,” in Proc. Pacific-Asia V. S. Tseng, “Weighted frequent itemset mining over uncertain
Conf. Knowl. Discovery Data Mining, 2008, pp. 554–561. databases,” Appl. Intell., vol. 44, no. 1, pp. 232–250, 2016.
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1325
[59] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, J. M. T. Wu, [83] A. Erwin, R. P. Gopalan, and N. Achuthan, “CTU-Mine: An effi-
and J. Zhan, “Extracting recent weighted-based patterns from cient high utility itemset mining algorithm using the pattern
uncertain temporal databases,” Eng. Appl. Artif. Intell., vol. 61, growth approach,” in Proc. 7th IEEE Int. Conf. Comput. Inf. Tech-
pp. 161–172, 2017. nol., 2007, pp. 71–76.
[60] Y. C. Li, J. S. Yeh, and C. C. Chang, “Direct candidates genera- [84] C. W. Lin, T. P. Hong, and W. H. Lu, “An effective tree structure
tion: A novel algorithm for discovering complete share-frequent for mining high utility itemsets,” Expert Syst. Appl., vol. 38, no. 6,
itemsets,” in Proc. Int. Conf. Fuzzy Syst. Knowl. Discovery, 2005, pp. 7419–7424, 2011.
pp. 551–560. [85] G. C. Lan, “A study on efficient algorithms for on-shelf utility
[61] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and mining,” PhD Thesis, National Cheng Kung Univ., pp. 1–154,
V. S. Tseng, “Fast algorithms for mining high-utility itemsets 2012.
with various discount strategies,” Adv. Eng. Inform., vol. 30, [86] M. Liu and J. Qu, “Mining high utility itemsets without candi-
no. 2, pp. 109–126, 2016. date generation,” in Proc. 21st ACM Int. Conf. Inf. Knowl. Manage.,
[62] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and H. Fujita, 2012, pp. 55–64.
“Extracting non-redundant correlated purchase behaviors by [87] P. Fournier-Viger, C. W. Wu, S. Zida, and V. S. Tseng, “FHM:
utility measure,” Knowl.-Based Syst., vol. 143, pp. 30–41, 2018. Faster high-utility itemset mining using estimated utility co-
[63] C. F. Ahmed, S. K. Tanbeer, and B. S. Jeong, “A framework for occurrence pruning,” in Proc. Int. Symp. Methodologies Intell. Syst.,
mining high utility web access sequences,” IETE Tech. Rev., 2014, pp. 83–92.
vol. 28, no. 1, pp. 3–16, 2011. [88] J. Liu, K. Wang, and B. C. Fung, “Direct discovery of high utility
[64] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and Y. K. Lee, “Efficient itemsets without candidate generation,” in Proc. IEEE 12th Int.
mining of utility-based web path traversal patterns,” in Proc. Conf. Data Mining, 2012, pp. 984–989.
11th Int. Conf. Adv. Commun. Technol., 2009, pp. 2215–2218. [89] S. Krishnamoorthy, “Pruning strategies for mining high utility
[65] L. Atzori, A. Iera, and G. Morabito, “The internet of things: A itemsets,” Expert Syst. Appl., vol. 42, no. 5, pp. 2371–2381, 2015.
survey,” Comput. Netw., vol. 54, no. 15, pp. 2787–2805, 2010. [90] S. Zida, P. Fournier-Viger, J. C. W. Lin, C. W. Wu, and
[66] L. Golab and M. T. Ozsu,€ “Issues in data stream management,” V. S. Tseng, “EFIM: A highly efficient algorithm for high-utility
ACM SIGMOD Rec., vol. 32, no. 2, pp. 5–14, 2003. itemset mining,” in Proc. Mexican Int. Conf. Artif. Intell., 2015,
[67] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintain- pp. 530–546.
ing closed frequent itemsets over a stream sliding window,” in [91] H. Ryang and U. Yun, “Indexed list-based high utility pattern
Proc. 4th IEEE Int. Conf. Data Mining, 2004, pp. 59–66. mining with utility upper-bound reduction and pattern combina-
[68] M. Zihayat and A. An, “Mining top-k high utility patterns over tion techniques,” Knowl. Inf. Syst., vol. 51, no. 2, pp. 627–659, 2017.
data streams,” Inf. Sci., vol. 285, pp. 138–161, 2014. [92] A. Y. Peng, Y. S. Koh, and P. Riddle, “mHUIMiner: A fast high
[69] B. E. Shie, P. S. Yu, and V. S. Tseng, “Efficient algorithms for min- utility itemset mining algorithm for sparse datasets,” in Proc.
ing maximal high utility itemsets from data streams with differ- Pacific-Asia Conf. Knowl. Discovery Data Mining, 2017, pp. 196–207.
ent models,” Expert Syst. Appl., vol. 39, no. 17, pp. 12 947–12 960, [93] R. Rymon, “Search through systematic set enumeration,” in Proc.
2012. 3rd Int. Conf. Principles Knowl. Represenation Reasoning, 1992,
[70] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and H. J. Choi, pp. 539–550.
“Interactive mining of high utility patterns over data streams,” [94] W. Song, Z. Zhang, and J. Li, “A high utility itemset mining algo-
Expert Syst. Appl., vol. 39, no. 15, pp. 11 979–11 991, 2012. rithm based on subsume index,” Knowl. Inf. Syst., vol. 49, no. 1,
[71] Y. C. Liu, C. P. Cheng, and V. S. Tseng, “Mining differential top-k pp. 315–340, 2016.
co-expression patterns from time course comparative gene [95] W. Song, B. Yang, and Z. Xu, “Index-BitTableFI: An improved
expression datasets,” BMC Bioinf., vol. 14, no. 1, 2013, Art. no. 230. algorithm for mining frequent itemsets,” Knowl.-Based Syst., vol. 21,
[72] T. P. Hong, C. H. Lee, and S. L. Wang, “Effective utility mining no. 6, pp. 507–513, 2008.
with the measure of average utility,” Expert Syst. Appl., vol. 38, [96] G. C. Lan, T. P. Hong, and V. S. Tseng, “Efficiently mining
no. 7, pp. 8259–8265, 2011. high average-utility itemsets with an improved upper-bound
[73] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and strategy,” Int. J. Inf. Technol. Decision Making, vol. 11, no. 05,
V. S. Tseng, “Efficient algorithms for mining high-utility itemsets pp. 1009–1030, 2012.
in uncertain databases,” Knowl.-Based Syst., vol. 96, pp. 171–187, [97] J. C. W. Lin, T. Li, P. Fournier-Viger, T. P. Hong, J. Zhan, and
2016. M. Voznak, “An efficient algorithm to mine high average-utility
[74] C. K. Chui, B. Kao, and E. Hung, “Mining frequent itemsets from itemsets,” Adv. Eng. Inform., vol. 30, no. 2, pp. 233–243, 2016.
uncertain data,” in Proc. Pacific-Asia Conf. Knowl. Discovery Data [98] J. C. W. Lin, S. Ren, P. Fournier-Viger, and T. P. Hong,
Mining, 2007, pp. 47–58. “EHAUPM: Efficient high average-utility pattern mining with
[75] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and tighter upper bounds,” IEEE Access, vol. 5, pp. 12 927–12 940,
H. C. Chao, “FDHUP: Fast algorithm for mining discriminative 2017.
high utility patterns,” Knowl. Inf. Syst., vol. 51, no. 3, pp. 873–909, [99] U. Yun, D. Kim, E. Yoon, and H. Fujita, “Damped window based
2017. high average utility pattern mining over data streams,” Knowl.-
[76] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and P. S. Yu, Based Syst., vol. 44, pp. 188–205, 2017.
“HUOPM: High-utility occupancy pattern mining,” IEEE Trans. [100] D. W. Cheung, J. Han, V. T. Ng, and C. Wong, “Maintenance of
Cybern., early access, Feb. 20, 2019, doi: 10.1109/TCYB.2019.2896267. discovered association rules in large databases: An incremental
[77] Y. Liu, W. K. Liao, and A. Choudhary, “A two-phase algorithm updating technique,” in Proc. 12th Int. Conf. Data Eng., 1996,
for fast discovery of high utility itemsets,” in Proc. Pacific-Asia pp. 106–114.
Conf. Knowl. Discovery Data Mining, 2005, pp. 689–695. [101] T. P. Hong, C. Y. Wang, and Y. H. Tao, “A new incremental data
[78] J. Hu and A. Mojsilovic, “High-utility pattern mining: A method mining algorithm using pre-large itemsets,” Intell. Data Anal.,
for discovery of high-utility item sets,” Pattern Recognit., vol. 40, vol. 5, no. 2, pp. 111–129, 2001.
no. 11, pp. 3317–3324, 2007. [102] C. W. Lin, G. C. Lan, and T. P. Hong, “An incremental mining
[79] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and Y. K. Lee, “An effi- algorithm for high utility itemsets,” Expert Syst. Appl., vol. 39,
cient candidate pruning technique for high utility pattern min- no. 8, pp. 7173–7180, 2012.
ing,” in Proc. Pacific-Asia Conf. Knowl. Discovery Data Mining, [103] C. W. Lin, T. P. Hong, G. C. Lan, J. W. Wong, and W. Y. Lin,
2009, pp. 749–756. “Incrementally mining high utility patterns based on pre-large
[80] V. S. Tseng, C. W. Wu, B. E. Shie, and P. S. Yu, “UP-Growth: An concept,” Appl. Intell., vol. 40, no. 2, pp. 343–357, 2014.
efficient algorithm for high utility itemset mining,” in Proc. 16th [104] J. C. W. Lin, W. Gan, T. P. Hong, and B. Zhang, “An incremental
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2010, high-utility mining algorithm with transaction insertion,” Sci.
pp. 253–262. World J., vol. 2015, 2015, Art. no. 161564.
[81] W. Song, Y. Liu, and J. Li, “Mining high utility itemsets by dyna- [105] P. Fournier-Viger, J. C. W. Lin, T. Gueniche, and P. Barhate,
mically pruning the tree structure,” Appl. Intell., vol. 40, no. 1, “Efficient incremental high utility itemset mining,” in Proc. ASE
pp. 29–43, 2014. BigData Social Inform., 2015, Art. no. 53.
[82] H. Ryang, U. Yun, and K. H. Ryu, “Fast algorithm for high utility [106] D. W. Cheung, S. D. Lee, and B. Kao, “A general incremental
pattern mining with the sum of item quantities,” Intell. Data technique for maintaining discovered association rules,” in Proc.
Anal., vol. 20, no. 2, pp. 395–415, 2016. 5th Int. Conf. Database Syst. Adv. Appl., 1997, pp. 185–194.
1326 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021
[107] C. W. Lin, G. C. Lan, and T. P. Hong, “Mining high utility item- [132] J. Dean and S. Ghemawat, “MapReduce: A flexible data process-
sets for transaction deletion in a dynamic database,” Intell. Data ing tool,” Commun. ACM, vol. 53, no. 1, pp. 72–77, 2010.
Anal., vol. 19, no. 1, pp. 43–55, 2015. [133] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,
[108] C. W. Lin, T. P. Hong, G. C. Lan, J. W. Wong, and W. Y. Lin, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets:
“Efficient updating of discovered high-utility itemsets for trans- A fault-tolerant abstraction for in-memory cluster computing,” in
action deletion in dynamic databases,” Adv. Eng. Inform., Proc. 9th USENIX Conf. Netw. Syst. Des. Implementation, 2012, pp. 2–2.
vol. 29, no. 1, pp. 16–27, 2015. [134] M. Zihayat, Y. Chen, and A. An, “Memory-adaptive high utility
[109] J. C. W. Lin, W. Gan, and T. P. Hong, “A fast maintenance algo- sequential pattern mining over data streams,” Mach. Learn.,
rithm of the discovered high-utility itemsets with transaction vol. 106, no. 6, pp. 799–836, 2017.
deletion,” Intell. Data Anal., vol. 20, no. 4, pp. 891–913, 2016. [135] G. S. Manku and R. Motwani, “Approximate frequency counts
[110] C. W. Lin, B. Zhang, W. Gan, B. W. Chen, S. Rho, and T. P. Hong, over data streams,” in Proc. 28th Int. Conf. Very Large Databases,
“Updating high-utility pattern trees with transaction mod- 2002, pp. 346–357.
ification,” Multimedia Tools Appl., vol. 75, no. 9, pp. 4887–4912, 2016. [136] H. F. Li, M. K. Shan, and S. Y. Lee, “DSM-FI: An efficient algo-
[111] J. C. W. Lin, W. Gan, and T. P. Hong, “A fast updated algorithm rithm for mining frequent itemsets in data streams,” Knowl. Inf.
to maintain the discovered high-utility itemsets for transaction Syst., vol. 17, no. 1, pp. 79–97, 2008.
modification,” Adv. Eng. Inform., vol. 29, no. 3, pp. 562–574, 2015. [137] C. J. Chu, V. S. Tseng, and T. Liang, “An efficient algorithm for
[112] J. F. Boulicaut, A. Bykowski, and C. Rigotti, “Free-sets: A conden- mining temporal high utility itemsets from data streams,” J. Syst.
sed representation of boolean data for the approximation of fre- Softw., vol. 81, no. 7, pp. 1105–1117, 2008.
quency queries,” Data Mining Knowl. Discovery, vol. 7, no. 1, pp. 5–22, [138] G. C. Lan, T. P. Hong, and V. S. Tseng, “Discovery of high utility
2003. itemsets from on-shelf time periods of products,” Expert Syst.
[113] T. Calders and B. Goethals, “Mining all non-derivable frequent Appl., vol. 38, no. 5, pp. 5851–5857, 2011.
itemsets,” in Proc. Eur. Conf. Principles Data Mining Knowl. Discov- [139] J. C. W. Lin, W. Gan, T. P. Hong, and V. S. Tseng, “Efficient algo-
ery, 2002, pp. 74–86. rithms for mining up-to-date high-utility patterns,” Adv. Eng.
[114] K. Gouda and M. J. Zaki, “Efficiently mining maximal frequent Inform., vol. 29, no. 3, pp. 648–661, 2015.
itemsets,” in Proc. IEEE Int. Conf. Data Mining, 2001, pp. 163–170. [140] W. Gan, J. C. W. Lin, P. Fournier-Viger, and H. C. Chao, “Mining
[115] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Efficient min- recent high-utility patterns from temporal databases with time-
ing of association rules using closed itemset lattices,” Inf. Syst., sensitive constraint,” in Proc. Int. Conf. Big Data Analytics Knowl.
vol. 24, no. 1, pp. 25–46, 1999. Discovery, 2016, pp. 3–18.
[116] V. S. Tseng, C. W. Wu, P. Fournier-Viger, and P. S. Yu, “Efficient [141] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and H. J. Choi, “A frame-
algorithms for mining the concise and lossless representation of work for mining interesting high utility patterns with a strong
high utility itemsets,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 3, frequency affinity,” Inf. Sci., vol. 181, no. 21, pp. 4878–4894, 2011.
pp. 726–739, Mar. 2015. [142] C. W. Wu, B. E. Shie, V. S. Tseng, and P. S. Yu, “Mining top-k
[117] P. Fournier-Viger, S. Zida, J. C. W. Lin, C. W. Wu, and high utility itemsets,” in Proc. 18th ACM SIGKDD Int. Conf.
V. S. Tseng, “EFIM-Closed: Fast and memory efficient discovery Knowl. Discovery Data Mining, 2012, pp. 78–86.
of closed high-utility itemsets,” in Proc. Int. Conf. Mach. Learn. [143] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and J. Zhan,
Data Mining Pattern Recognit., 2016, pp. 199–213. “Efficient mining of high-utility itemsets using multiple minimum
[118] C. W. Wu, P. Fournier-Viger, J. Y. Gu, and V. S. Tseng, “Mining utility thresholds,” Knowl.-Based Syst., vol. 113, pp. 100–115, 2016.
closed+ high utility itemsets without candidate generation,” in [144] J. C. W. Lin, P. Fournier-Viger, and W. Gan, “FHN: An efficient
Proc. Conf. Technol. Appl. Artif. Intell., 2015, pp. 187–194. algorithm for mining high-utility itemsets with negative unit
[119] J. Sahoo, A. K. Das, and A. Goswami, “An efficient fast algorithm profits,” Knowl.-Based Syst., vol. 111, pp. 283–298, 2016.
for discovering closed+ high utility itemsets,” Appl. Intell., [145] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and
vol. 45, no. 1, pp. 44–74, 2016. V. S. Tseng, “Mining high-utility itemsets with both positive and
[120] C. H. Li, C. W. Wu, and V. S. Tseng, “Efficient vertical mining of negative unit profits from uncertain databases,” in Proc. Pacific-
high utility quantitative itemsets,” in Proc. IEEE Int. Conf. Granu- Asia Conf. Knowl. Discovery Data Mining, 2017, pp. 434–446.
lar Comput., 2014, pp. 155–160. [146] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and
[121] G. C. Lan, T. P. Hong, V. S. Tseng, and S. L. Wang, “Applying V. S. Tseng, “Efficiently mining uncertain high-utility itemsets,”
the maximum utility measure in high utility sequential pattern Soft Comput., vol. 21, no. 11, pp. 2801–2820, 2017.
mining,” Expert Syst. Appl., vol. 41, no. 11, pp. 5071–5081, 2014. [147] W. Gan, J. C. W. Lin, H. C. Chao, T. P. Hong, and S. Y. Philip,
[122] O. K. Alkan and P. Karagoz, “CRoM and HuspExt: Improving “CoUPM: Correlated utility-based pattern mining,” in Proc. IEEE
efficiency of high utility sequential pattern extraction,” IEEE Int. Conf. Big Data, 2018, pp. 2607–2616.
Trans. Knowl. Data Eng., vol. 27, no. 10, pp. 2645–2657, Oct. 2015. [148] R. Agrawal and R. Srikant, “Privacy-preserving data mining,”
[123] W. Gan, J. C. W. Lin, J. Zhang, H. C. Chao, H. Fujita, and P. S. Yu, ACM SIGMOD Rec., vol. 29, no. 2, pp. 439–450, 2000.
“ProUM: Projection-based utility mining on sequence data,” [149] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in
arXiv:1904.07764, 2019. Proc. Annu. Int. Cryptology Conf., 2000, pp. 36–54.
[124] W. Gan, J. C. W. Lin, J. Zhang, P. Fournier-Viger, H. C. Chao, and [150] C. C. Aggarwal and P. S. Yu, “A general survey of privacy-
P. S. Yu, “Fast utility mining on sequence data,” arXiv:1904.12248, preserving data mining models and algorithms,” in Privacy-
2019. Preserving Data Mining. Berlin, Germany: Springer, 2008, pp. 11–52.
[125] J. Z. Wang, J. L. Huang, and Y. C. Chen, “On efficiently mining [151] T. Zhu, G. Li, W. Zhou, and P. S. Yu, “Differentially private data
high utility sequential patterns,” Knowl. Inf. Syst., vol. 49, no. 2, publishing and analysis: A survey,” IEEE Trans. Knowl. Data
pp. 597–627, 2016. Eng., vol. 29, no. 8, pp. 1619–1638, Aug. 2017.
[126] G. Guo, L. Zhang, Q. Liu, E. Chen, F. Zhu, and C. Guan, “High [152] J. S. Yeh and P. C. Hsu, “HHUIF and MSICF: Novel algorithms
utility episode mining made practical and fast,” in Proc. Int. Conf. for privacy preserving utility mining,” Expert Syst. Appl., vol. 37,
Adv. Data Mining Appl., 2014, pp. 71–84. no. 7, pp. 4779–4786, 2010.
[127] Y. F. Lin, C. F. Huang, and V. S. Tseng, “A novel methodology [153] C. W. Lin, T. P. Hong, J. W. Wong, G. C. Lan, and W. Y. Lin, “A
for stock investment using high utility episode mining and GA-based approach to hide sensitive high utility itemsets,” Sci.
genetic algorithm,” Appl. Soft Comput., vol. 59, pp. 303–315, 2017. World J., vol. 2014, 2014, Art. no. 804629.
[128] S. Rathore, S. Dawar, V. Goyal, and D. Patel, “Top-k high utility [154] U. Yun and J. Kim, “A fast perturbation algorithm using tree
episode mining from a complex event sequence,” in Proc. 21st structure for privacy preserving utility mining,” Expert Syst.
Int. Conf. Manage. Data Comput. Soc. India, 2016, pp. 56–63. Appl., vol. 42, no. 3, pp. 1149–1165, 2015.
[129] Y. C. Lin, C. W. Wu, and V. S. Tseng, “Mining high utility item- [155] J. C. W. Lin, T. Y. Wu, P. Fournier-Viger, G. Lin, J. Zhan, and
sets in big data,” in Proc. Pacific-Asia Conf. Knowl. Discovery Data M. Voznak, “Fast algorithms for hiding sensitive high-utility
Mining, 2015, pp. 649–661. itemsets in privacy-preserving utility mining,” Eng. Appl. Artif.
[130] Y. Chen and A. An, “Approximate parallel high utility itemset Intell., vol. 55, pp. 269–284, 2016.
mining,” Big Data Res., vol. 6, pp. 26–42, 2016. [156] J. C. W. Lin, T. P. Hong, P. Fournier-Viger, Q. Liu, J. W. Wong,
[131] M. Zihayat, Z. Z. Hut, A. An, and Y. Hut, “Distributed and paral- and J. Zhan, “Efficient hiding of confidential high-utility itemsets
lel high utility sequential pattern mining,” in Proc. IEEE Int. Conf. with minimal side effects,” J. Exp. Theoretical Artif. Intell., vol. 29,
Big Data, 2016, pp. 853–862. no. 6, pp. 1225–1245, 2017.
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1327
[157] W. Gan, J. C. W. Lin, H. C. Chao, S. L. Wang, and P. S. Yu, Han-Chieh Chao (SM’04) received the MS and
“Privacy preserving utility mining: A survey,” in Proc. IEEE Int. PhD degrees in electrical engineering from
Conf. Big Data, 2018, pp. 2617–2626. Purdue University, in 1989 and 1993, respec-
[158] V. S. Tseng, C. W. Wu, J. H. Lin, and P. Fournier-Viger, “UP- tively. He has been the president of the National
Miner: A utility pattern mining toolbox,” in Proc. IEEE Int. Conf. Dong Hwa University since February 2016. His
Data Mining Workshop, 2015, pp. 1656–1659. research interests include high-speed networks,
[159] P. Fournier-Viger, J. C. W. Lin, A. Gomariz, T. Gueniche, wireless networks, IPv6-based networks, and
A. Soltani, Z. Deng, and H. T. Lam, “The SPMF open-source data artificial intelligence. He has published nearly 500
mining library version 2,” in Proc. Joint Eur. Conf. Mach. Learn. peer-reviewed research papers. He is the editor-
Knowl. Discovery Databases, 2016, pp. 36–40. in-chief (EiC) of the IET Networks and the Journal
[160] R. Agrawal and R. Srikant, “Quest synthetic data generator,” of Internet Technology. He has served as a guest
1994. [Online]. Available: http://www.Almaden.ibm.com/cs/ editor of the ACM Mobile Networks and Applications, the IEEE Journal
quest/syndata.html on Selected Areas in Communications, the IEEE Communications Mag-
[161] L. Cao, “Domain-driven data mining: Challenges and pros- azine, the IEEE Systems Journal, Computer Communications, the IEEE
pects,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 6, pp. 755–769, Proceedings Communications, Wireless Personal Communications, and
Jun. 2010. Wireless Communications & Mobile Computing. He is a senior member
[162] S. Hong, T. Oguntebi, and K. Olukotun, “Efficient parallel graph of the IEEE and a fellow of the IET.
exploration on multi-core CPU and GPU,” in Proc. Int. Conf. Par-
allel Archit. Compilation Techn., 2011, pp. 78–88.
Vincent S. Tseng (SM’16) received the PhD
Wensheng Gan received the BS degree in com- degree in computer science from National Chiao
puter science from South China Normal Univer- Tung University, Taiwan, in 1997. He is currently a
sity, Guangdong, China, in 2013. He is working distinguished professor with the Department of
toward the PhD degree in computer science and Computer Science, National Chiao Tung Univer-
technology at the Harbin Institute of Technology sity, Taiwan. His research interests covering data
(Shenzhen), Guangdong, China. He was a joint mining, big data, biomedical informatics, mobile,
PhD student with the University of Illinois at and Web technologies. He has published more
Chicago (UIC), from 2017 to 2019. His research than 400 research papers in peer-reviewed jour-
interests include data mining, utility computing, nals and conferences and holds 15 patents. He
and big data analytics. He has published more has been on the editorial board of a number of
than 50 research papers in peer-reviewed jour- journals, including the IEEE Transactions on Knowledge and Data Engi-
nals (i.e., the IEEE Transactions on Knowledge and Data Engineerning, neering, the ACM Transactions on Knowledge Discovery from Data, and
the ACM Transactions on Knowledge Discovery from Data, the IEEE the IEEE Journal of Biomedical and Health Informatics. He is a senior
Transactions on Cybernetics, ACM Transactions on Data Science, member of the IEEE.
Knowledge-Based Systems) and conferences, which have received
more than 600 citations.
Philip S. Yu (F’93) received the BS degree in
electrical engineering from National Taiwan Uni-
Jerry Chun-Wei Lin (SM’19) received the PhD versity, the MS and PhD degrees in electrical
degree in computer science and information engi- engineering from Stanford University, and the
neering from the National Cheng Kung Univer- MBA degree from New York University. He is a
sity, Tainan, Taiwan, in 2010. He is an associate distinguished professor of computer science with
professor with the Western Norway University of the University of Illinois at Chicago (UIC) and
Applied Sciences, Bergen, Norway. His research holds the Wexler Chair in Information Technol-
interests include data mining, big data analytics, ogy, UIC. Before joining UIC, he was with IBM,
machine learning, soft computing, and privacy- where he was manager of the Software Tools
preserving and security. He has published more and Techniques Department, Thomas J. Watson
than 300 research papers in peer-reviewed inter- Research Center. His research interests include databases, data mining,
national conferences (i.e., IEEE ICDE, IEEE artificial intelligence, and privacy. He has published more than 1,300
ICDM, PKDD, and PAKDD) and journals (i.e., the IEEE Transactions on papers in peer-reviewed journals (i.e., the IEEE Transactions on Knowl-
Knowledge and Data Engineering, the IEEE Transactions on Cybernet- edge and Data, the IEEE Transactions on Parallel and Distributed, the
ics, the ACM Transactions on Knowledge Discovery from Data, and the ACM Transactions on Knowledge Discovery from Data, the VLDB Jour-
ACM Transactions on Data Science ). He is the co-leader of the popular nal) and conferences (i.e., SIGMOD, KDD, ICDE, WWW, AAAI, SIGIR,
SPMF open-source data mining library, the project leader of PPSF ICML, etc). He holds or has applied for more than 300 U.S. patents. He
open-source privacy and security library, the editor-in-chief (EiC) of the was the editor-in-chief of the ACM Transactions on Knowledge Discov-
Data Science and Pattern Recognition (DSPR) journal, and associate ery from Data. He received the ACM SIGKDD 2016 Innovation Award,
editor of the Journal of Internet Technology and IEEE Access. He is the and the IEEE Computer Society 2013 Technical Achievement Award.
senior member of the IEEE and ACM. He is a fellow of the ACM and IEEE.
Philippe Fournier-Viger received the PhD degree " For more information on this or any other computing topic,
in computer science from the University of please visit our Digital Library at www.computer.org/csdl.
Quebec, Montreal, in 2010. He is full professor
and Youth 1,000 scholar with the Harbin Institute
of Technology (Shenzhen), Shenzhen, China.
His research interests include pattern mining,
sequence analysis and prediction, and social net-
work mining. He has published more than 250
research papers in refereed international confer-
ences and journals. He is the founder of the
popular SPMF open-source data mining library,
which has been cited in more than 800 research papers. He is editor-
in-chief (EiC) of the Data Science and Pattern Recognition (DSPR) journal.