0% found this document useful (0 votes)
51 views22 pages

A Survey of Utility-Oriented Pattern Mining

Uploaded by

MG Shivanand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views22 pages

A Survey of Utility-Oriented Pattern Mining

Uploaded by

MG Shivanand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

1306 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO.

4, APRIL 2021

A Survey of Utility-Oriented Pattern Mining


Wensheng Gan , Jerry Chun-Wei Lin , Senior Member, IEEE, Philippe Fournier-Viger,
Han-Chieh Chao, Senior Member, IEEE, Vincent S. Tseng , Senior Member, IEEE,
and Philip S. Yu, Fellow, IEEE

Abstract—The main purpose of data mining and analytics is to find novel, potentially useful patterns that can be utilized in real-world
applications to derive beneficial knowledge. For identifying and evaluating the usefulness of different kinds of patterns, many
techniques and constraints have been proposed, such as support, confidence, sequence order, and utility parameters (e.g., weight,
price, profit, quantity, satisfaction, etc.). In recent years, there has been an increasing demand for utility-oriented pattern mining (UPM,
or called utility mining). UPM is a vital task, with numerous high-impact applications, including cross-marketing, e-commerce, finance,
medical, and biomedical applications. This survey aims to provide a general, comprehensive, and structured overview of the state-of-
the-art methods of UPM. First, we introduce an in-depth understanding of UPM, including concepts, examples, and comparisons with
related concepts. A taxonomy of the most common and state-of-the-art approaches for mining different kinds of high-utility patterns is
presented in detail, including Apriori-based, tree-based, projection-based, vertical-/horizontal-data-format-based, and other hybrid
approaches. A comprehensive review of advanced topics of existing high-utility pattern mining techniques is offered, with a discussion
of their pros and cons. Finally, we present several well-known open-source software packages for UPM. We conclude our survey with a
discussion on open and practical challenges in this field.

Index Terms—Data science, economics, utility theory, utility mining, high-utility pattern, application

1 INTRODUCTION
mining [1], [2] focuses on extraction of informa- domains. Most of them aim at extracting the desired
D ATA
tion from a large set of data and transforms it into
an easily interpretable structure for further use. It is an
patterns using frequency or co-occurrence [7], [8], [9],
[10], as well as other properties and interestingness
interdisciplinary field focused on scientific methods, measures [18], [19], [20], [21]. Despite the wide use of
processes, and systems to extract knowledge or insights pattern mining techniques, most of these algorithms do
from data in various forms, either structured or not allow for the discovery of utility-oriented patterns,
unstructured. Mining interesting patterns from different i.e., those that contribute the most to a predefined utility
types of data is quite important in many real-life appli- threshold, an objective function, or a performance met-
cations [1], [3], [4], [5], [6]. In recent decades, the task of ric. In general, some implicit factors, such as the utility,
interesting pattern mining [e.g., frequent pattern mining interestingness, or risk of objects/patterns, are com-
(FPM) [7], [8], association rule mining (ARM) [9], [10], fre- monly seen in real-world situations. The knowledge
quent episode mining (FEM) [11], [12], [13], [14], and that is actually important to the user may not be found
sequential pattern mining (SPM) [5], [15], [16], [17]] has by traditional data mining algorithms. Therefore, a
been extensively studied. These are important and fun- novel utility mining framework, called utility-oriented
damental data mining techniques [1] that satisfy the pattern mining (UPM) or high-utility pattern mining
requirements of real-world applications in numerous (HUPM1) [22], [23], [24], which considers the relative
importance of items (utility-oriented [25]), has become an
emerging research topic in recent years. In UPM, the
 W. Gan is with the Harbin Institute of Technology (Shenzhen), utility (i.e., importance, interest, satisfaction, or risk) of
Shenzhen 518055, China, and also with the University of Illinois at each item can be predefined based on a user’s back-
Chicago, Chicago, IL 60607 USA. E-mail: [email protected]. ground knowledge or preferences.
 J. C. W. Lin is with the Western Norway University of Applied Sciences,
Bergen 5063, Norway. E-mail: [email protected]. According to Wikipedia,2 in economics, utility is a mea-
 P. Fournier-Viger is with the Harbin Institute of Technology (Shenzhen), sure of preferences over some set of goods (including serv-
Shenzhen 518055, China. E-mail: [email protected]. ices, i.e., something that satisfies human wants). In a
 H. C. Chao is with the National Dong Hwa University, Hualien 974,
Taiwan. E-mail: [email protected].
perspective, it represents satisfaction experienced by the
 V.S. Tseng is with the Department of Computer Science, National Chiao Tung consumer of a good. Hence, utility is a subjective measure.
University, Hsinchu City 30010, Taiwan. E-mail: [email protected]. This definition indicates that a subjective value is associated
 P.S. Yu is with the University of Illinois at Chicago, Chicago, IL 60607 with a specific value in a domain to express user preference.
USA. E-mail: [email protected].
In practice, the value of utility is assigned by the user
Manuscript received 21 May 2018; revised 5 Aug. 2019; accepted 15 Sept.
2019. Date of publication 20 Sept. 2019; date of current version 5 Mar. 2021.
(Corresponding author: Jerry Chun-Wei Lin.) 1. The terms of UPM and HUPM can be interchangeably used but
Recommended for acceptance by L. Chen. we will use UPM in the rest of this manuscript.
Digital Object Identifier no. 10.1109/TKDE.2019.2942594 2. https://en.wikipedia.org/wiki/Utility

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1307

15 years of study and development, many techniques and


approaches have been extensively proposed for UPM in
various applications. As shown in Fig. 1, there has been a
rapid surge of interest of UPM in recent years in terms of
the number of academic papers published in several sub-
fields, including high-utility itemsets [29], high-utility rules
[39], [40], high-utility sequential patterns [41], [42], and
high-utility episodes [43], [44].
In spite of the fact that there are a considerable number of
existing published studies and surveys about data mining,
especially for pattern mining, none of them discuss UPM. Yet,
Fig. 1. Number of published papers that use “High Utility Pattern Mining” in after more than 15 years of theoretical development, a signifi-
sub-areas of data science and analytics. These publication statistics are cant number of new technologies and applications have
obtained from Google Scholar. Note that the search phrase is defined as appeared in the UPM field. Unfortunately, there is no compre-
the sub-field named with the exact phrase “utility pattern,” and at least one hensive survey of utility-oriented pattern mining methods
of utility or itemset/rule/episode/sequential pattern appearing, e.g., “utility
itemset,” “utility pattern,” and “utility sequential pattern”. and no study that systematically compares the state-of-the-art
algorithms. We believe that now is a good time to summarize
according to his interpretation of domain-specific knowl- the new technologies and address the gap between theory
edge measured by a specific value, such as cost, profit, or and application. Here, we attempt to find a clearer way to
aesthetic value. According to the studies of Li et al. [18], present the concepts and practical aspects of UPM for the
interestingness measures can be classified as objective meas- data mining research community. In this paper, we provide a
ures, subjective measures, and semantic measures [18], [20], systematic and comprehensive survey of the significant
[21]. Objective measures [21], [26], such as support or confi- advances in UPM. The methods discussed in this article are
dence for pattern mining, are based only on data itself, not only important for high-utility pattern (i.e., itemset [24],
whereas subjective measures [27], [28], such as unexpected- [29], rule [39], [40], sequence, episode, etc.) mining but can
ness or novelty, take into account the user’s domain knowl- also serve as inspiration for other data mining tasks [1], [2],
edge. For the semantic measures [24], such as utility, they including episode mining [11], [12], [13], [14], distributed data
consider the data itself, as well as the user’s expectation. mining [45], and incremental/dynamic data mining [46], [47].
Hence, utility is a quantitative representation of user prefer- The major contributions are listed as follows:
ence, and the usefulness of an itemset is quantified in terms
of its utility value. Utility can be defined as “A measure of 1) This paper first presents the background, motivation,
how ‘useful’ (i.e., profitable) an itemset is” [24], [29]. For- and a comprehensive survey of UPM (Section 1).
mally, a pattern is said to be useful to a user if it satisfies a This survey investigates more than 150 UPM papers
specific utility constraint. In practice, the utility value of a published in the last 15 years and summarizes them
pattern can be measured in terms of cost, profit, aesthetic in a systematic fashion.
value, or other measures of user preference. 2) This survey first introduces an in-depth understand-
To address these issues, utility-oriented pattern mining ing of UPM, including concepts, examples, compari-
(hereinafter called UPM) has become a useful task and an sons with related studies (e.g., FPM, SPM),
important topic in data mining. In UPM, each object/item applications, and evaluation measures (Section 2).
has an unit utility (e.g., unit profit) and can appear more This survey presents a bird’s eyes view, and then
than once in each transaction or event (e.g., purchase quan- deeply and comprehensively summarizes the devel-
tity). The utility of a pattern represents its importance or sat- opments of UPM, comparing the state-of-the-art
isfaction, which can be measured in terms of risk, profit, works to earlier works (Section 3).
cost, quantity, or other information depending on user pref- 3) A taxonomy of the most common and the state-of-
erence. In general, the utility of a pattern is based on local the-art approaches for UPM is presented, including
transaction utility (also called internal utility) and external Apriori-based, tree-based, projection-based, verti-
utility [24], [29]. The internal utility of an object/item is cal/horizontal-data-format-based, and other hybrid
defined according to the information stored in a transac- approaches (Section 3). We further analyze the pros
tion/event, such as the quantity of the object/item occurred and cons of each presented approach.
or sold. The external utility can be a measure for describing 4) A comprehensive review of advanced topics of util-
user preferences. Therefore, the utility of a pattern depends ity mining techniques (e.g., dynamic UPM, concise
on the utility function specified by the user, which can be the representation of utility patterns, HUSPM, HUEM,
Sum, Average, or Multiplication of quantity and profit of this UPM in big data, and privacy issue) is presented
pattern in databases. More specifically, the utility-based (Section 4), with a discussion of their pros and cons.
method for pattern mining can find various types of pat- Not only the representative algorithms but also the
terns that could not be identified using previous theories advances and latest progress are reviewed.
and techniques. According to previous studies, UPM has a 5) We further review some well-known open-source
wide range of applications, including website click-stream software and datasets (Section 5) of UPM and hope
analysis [30], [31], [32], cross-marketing in retail stores [33], that these resources may reduce barriers for future
[34], mobile commerce environment [35], [36], gene regu- research. Finally, we identify several important issues
lation [37], and biomedical applications [38]. Through and research opportunities for UPM (Section 6).
1308 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021

TABLE 1 pattern if it occurs frequently in this database. Support of an


Summary of Symbols and Their Explanations itemset denoted as supðXÞ is the number of transactions con-
taining X. An association rule R: X ! Y , in which X, Y are
Symbol Definition
disjoint, and Y is non-empty, means that if a transaction
I A set of m distinct items, I = {i1 , i2 , . . ., im }. includes X, then it also has Y . An association rule consists of
D A quantitative database, D = {T1 , T2 , . . ., Tn }. frequent itemsets, and its confidence is no less than the mini-
QSD A quantitative sequential database = {s1 , s2 , . . ., sn }.
k-itemset An itemset with k number of items in itself. mum confidence sometimes called strong rules. It was first pro-
X A k-itemset having k distinct items fi1 ; i2 ; . . . ; ik g. posed by Agrawal et al. [9] in the context of frequent itemset
supðXÞ The support of a pattern X in D or QSD. and association rule mining. For example, fCheese; Milkg !
qðij ; Tq Þ The purchase quantity of an item ij in transaction Tq . Bread [sup = 5%, conf = 80%]; this association rule means
prðij Þ The predefined unit profit of an item ij . that 80 percent of customers who buy Cheese and Milk also
uðij ; Tq Þ The utility of an item ij in transaction Tq . buy Bread, and 5 percent of customers buy all these products
uðX; Tq Þ The utility of an itemset X in transaction Tq .
together.
tuðTq Þ The sum of the utilities of items in transaction Tq .
minsup A predefined minimum support threshold. Definition 2 (High-Utility Itemset, HUI [29], [48]). The
minconf A predefined minimum confidence threshold. utility of an item ij appearing in a transaction Tq is denoted as
minutil A predefined minimum high-utility threshold.
TWU The transaction-weighted utility of a pattern. uðij ; Tq Þ and defined as uðij ; Tq Þ = qðij ; Tq Þ  prðij Þ. The utility
TWDC Transaction-weighted downward closure of an itemset X in Tq is defined as uðX; Tq Þ =
P
ij 2X^XTq uðij ; Tq Þ. The total utility of X in a database D is
property.
HTWUI A high transaction-weighted utilization itemset. P
HUI A high-utility itemset. uðXÞ = XTq ^Tq 2D uðX; Tq Þ. An itemset is said to be a high-
FPM Frequent pattern mining. utility itemset (HUI) if its total utility in a database is no less
SPM Sequential pattern mining. than the user-specified minimum utility threshold (such that
UPM Utility-oriented pattern mining. uðXÞ  minutil); otherwise, it is called a low-utility itemset.
HUAR High-utility association rule.
HUSP High-utility sequential pattern. Definition 3 (High-Utility Association Rule, HUAR
HUSR High-utility sequential rule. [39], [40]). Since the usefulness of association rule [9] can be
HUE High-utility episode.
HUIM High-utility itemset mining.
defined as a utility function based on the business objective, the
HUARM High-utility association rule mining. utility and confidence can be used to extend the concepts of
HUSPM High-utility sequential pattern mining. high-utility itemset and association rule. An association rule
HUEM High-utility episode mining. R: X ! Y is considered to have high utility if it meets the min-
util constraint. Thus, a high-utility association rule (HUAR)
consists of high-utility itemsets, and its confidence is no less
The remainder of this article is organized as follows. In than the minimum confidence. Generally speaking, discovery
Section 2, we introduce the necessary background informa- of HUIs started as the first phase in the discovery of HUARs,
tion, the basic concepts and examples, and the applications but it has been generalized by formulating a new pattern-
in this field. In Section 3, we give a high-level overview of mining framework.
emerging UPM problems and survey several popular meth-
Definition 4 (High-Utility Sequential Pattern, HUSP
ods, as well as recent developments. In Section 4, we discuss
[41], [42]). The utility of an item (ij ) in a q-itemset v is
advanced topics and techniques of UPM. In addition, sev-
denoted as uðij ; vÞ, and defined as uðij ; vÞ = qðij ; vÞ  prðij Þ;
eral well-known open-source software and datasets are
summarized in Section 5. We describe some open chal- where qðij ; vÞ is the quantity of (ij ) in v, and prðij Þ is the profit
of (ij ). The utility of a q-itemset v is denoted as uðvÞ and
lenges and opportunities in Section 6. Several future direc- P
tions are described in Section 7. defined as uðvÞ = ij 2v uðij ; vÞ. The utility of a q-sequence s =
< v ; v , . . . ; vd > is denoted as uðsÞ and defined as uðsÞ =
P 1 2
v2s uðvÞ. A sequence s in a quantitative sequential database
2 BASIC CONCEPT: UTILITY-ORIENTED QSD is said to be a high-utility sequential pattern (HUSP)
PATTERN MINING if its utility is no less than the minimum threshold of s as
2.1 Preliminary and Types of HUSP fsjuðsÞ  minutilg. Considering the ordered
Utility Patterns sequences, high-utility sequential pattern mining (HUSPM)
We first present the basic notations, as summarized in [41], [42] can discover more informative sequential patterns.
Table 1. Then, we introduce related preliminaries of This process is more complicated than the traditional UPM or
UPM and then define the problem of UPM. Based on SPM since the order and the utilities of itemsets should be con-
pattern diversity, utility-oriented pattern mining can be sidered together.
classified using the following basic criteria and extended
Definition 5 (High-Utility Sequential Rule, HUSR [49]).
patterns.
A sequential rule R: X ! Y [50] is a relationship between two
Definition 1 (Frequent Pattern and Association Rule unordered itemsets X, Y  I such that X \ Y = ; and X, Y
[9]). An association rule is an implication of the form, X ! Y , 6¼ ;. The interpretation of a rule R: X ! Y is that if items of
where X  I, Y  I; and X \ Y ¼ f. X (or Y ) is a set of X occur in a sequence, then items of Y will occur afterward in
items, called an itemset. Given a database, a pattern (e.g., a set the same sequence. Let minsup, minconf 2 [0, 1] and minutil
of items, sequences, structures, etc.) is said to be a frequent be thresholds set by the user and QSD be a sequence database.
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1309

A sequential rule R is said to be a high-utility sequential rule significant one in FPM. However, in practice, these frequent
(HUSR) [49] iff uðRÞ  minutil and R is a valid rule, in patterns do not show the business value and impact. In con-
which uðRÞ is the total utility of R in QSD. Otherwise, it is trast, the goal of UPM is to identify the useful patterns that
said to be a low-utility sequential rule. The problem of mining appear together and also bring high profits to the merchants
high-utility sequential rules from a sequence database is the dis- [52]. In UPM, managers can investigate the historical data-
covery of all high-utility sequential rules. bases and extract the set of patterns having high combined
utilities. Such problems cannot be tackled by the support/
Definition 6 (High-Utility Episode, HUE [43], [44]). An
frequency-based FPM framework.
episode a is a non-empty totally ordered set of simultaneous
 UPM versus WFPM. In the related areas, the relative
events (SE) of the form < ðSE1 Þ; ðSE2 Þ; . . . ; ðSEk Þ > , where
importance of each object/item is not considered in the con-
SEi appears before SEj for all 1  i < j  k. For example,
cept of FPM. To address this problem, weighted frequent-
< ðABÞ; ðCÞ > is an episode containing a simultaneous event
pattern mining (WFPM) was proposed [53], [54], [55], [56],
(AB) and a series event (C). The total utility of an episode a in
[57], [58], [59]. In the framework of WFPM, the weights of
a single simple or complex event containing a set of sub-events
items, such as unit profits of items in transaction databases,
is uðaÞ [43], [44], and its calculation is more complicated than
are considered. Therefore, even if some patterns are infre-
that of the utility of a sequence [42]. An episode is said to be a
quent, they might still be discovered if they have high
high-utility episode (abbreviated as HUE) in complex event
weighted support [53], [54], [55]. However, the quantities of
sequences if its total utility in these sequences is no less than
objects/items are not considered in WFPM. Thus, the
the minimum utility threshold such that uðaÞ  minutil. Oth-
requirements of users who are interested in discovering the
erwise, this episode is a low-utility episode.
desired patterns with high risks or profits cannot be satisfied.
Definition 7 (Utility-Oriented Pattern Mining, UPM). The reason is that the profits are composed of unit profits
A general definition of UPM is given below: UPM is a new (i.e., weights) and purchased quantities. In view of this, util-
mining framework that utilizes the utility theory and various ity-oriented pattern mining has emerged as an important
mining techniques (e.g., data structure, pruning strategy, topic. It refers to discovering the patterns with high profits.
upper bound) to discover the interesting patterns (e.g., HUI, As mentioned previously, the meaning of a pattern’s utility
HUAR, HUSP, HUSR, HUE), and these derived patterns can is the interestingness, importance, or profitability of the pat-
lead to utility maximization and high benefit in business or tern to users. The utility theory is applied to data mining by
other tasks. considering both the unit utility (i.e., profit, risk, and weight)
and purchased quantities. This has led to the concept of
Based on the above concepts of utility pattern, the UPM UPM [52], which selects interesting patterns based on mini-
framework can be further classified into the following cate- mum utility rather than minimum support.
gories, including 1) high-utility itemset mining (HUIM), 2) high-  UPM versus SPM. Different from FIM, sequential pat-
utility association rule mining (HUARM), 3) high-utility seq- tern mining (SPM) [5], [15], [16], [17], which discovers fre-
uential pattern mining (HUSPM), 4) high-utility sequential rule quent subsequences as patterns in a sequence database that
mining (HUSRM), and 5) high-utility episode mining (HUEM). contains the embedded timestamp information of an event,
is more complex and challenging. In 1995, Agrawal and
2.2 Comparisons with Related Concepts Srikant first extended the FPM model to handle sequences
With the boom in data mining and analysis, all kinds of data [15]. Consider the sequence <fa; eg; fbg; fc; dg; fgg; feg> ,
have emerged, and a number of concepts (e.g., FPM, SPM, which represents five events made by a customer at a retail
FEM, UPM, etc.) to model various types of data have been store. Each single letter represents an item (i.e., fag, fcg,
proposed. These concepts have similar meanings, as well as fgg, etc.), and items between curly braces represent an
subtle differences. Here we compare the UPM framework itemset (i.e., fa; eg and fc; dg). Simply speaking, a sequence
with its most related concepts. is a list of temporally ordered itemsets (also called events).
 UPM versus FPM. Frequent pattern mining (FPM) [7], Owing to the absence of time constraints in FPM not present
[8], [9], [10] is a common and fundamental topic in data in SPM, SPM has a potentially huge set of candidate sequen-
mining. FPM is a key phase of association-rule mining ces [16]. In a related area, through 25 years’ study and
(ARM), but it has been generalized to many kinds of pat- development, many techniques and approaches have been
terns, such as frequent sequential patterns [16], frequent proposed for mining sequential patterns in a wide range
episodes [11], and frequent subgraphs [51]. The goal of FPM of real-world applications [5]. In general, SPM mainly
is to discover all the desired patterns having support no focuses on the co-occurrence of derived patterns; it does not
lower than a given minimum support threshold. If a pattern consider the unit profit and purchase quantities of each
has higher support than this threshold, it is called a frequent product/item.
pattern; otherwise, it is called an infrequent pattern. Unlike So far, we have reviewed a wide range of pattern-mining
UPM, studies of FPM seldom consider the database having frameworks that aim to discover various types of patterns,
quantities of items, and none of them considers the utility such as itemsets [9], [53], sequences [15], [16], and graphs
feature. Under the “economic view” of consumer rational [51]. These frameworks, however, only select high-fre-
choices, utility theory can be used to maximize the esti- quency/support patterns. Patterns below the minimum
mated profit. UPM considers both statistical significance threshold are considered useless and discarded. Frequency is
and profit significance, whereas FPM aims at discovering the main interestingness measurement, and all objects/
the interesting patterns that frequently co-occur in data- items and transactions are treated equally in such a frame-
bases. In other words, any frequent pattern is treated as a work. Clearly, this assumption contradicts the truth in
1310 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021

TABLE 2 era, such as smart-phones, wireless networks, and GPS


Various Applications of UPM devices, information about users’ mobile behavior (e.g.,
locations and payment records) can be acquired and inte-
Domain Applications
grated in data analytics. In this scenario, utility-oriented
Business Market basket analysis, recommendation, mining technologies can be used to discover valuable user
intelligence cross marketing, sales intelligence, and risk
behaviors. Shie et al. first proposed a new framework to
prediction.
mine high-utility mobile sequences [35], [36] in mobile envi-
Web mining Users’ click-stream, users’ access behaviors,
ronments. It can extract associations between customers’
and traversal pattern mining.
purchase behaviors and location trajectories. The discov-
Mobile Mobile e-commerce, travel route
computing recommendation, spatial crowdsourcing, ered high-utility patterns can be utilized for many applica-
and spatial data analytics. tions essentially to mobile e-commerce, such as location-
Stream Web-click stream analysis, IoT data based advertisement or recommendations, navigational
processing analytics, and stream mining in smart services, spatial crowdsourcing, and utility-based recom-
transportation. mendation systems.
Biomedicine Gene expression and gene-disease  Stream Processing. The majority of data is born as contin-
association. uous streams [66], [67]: sensor events, user activity on a
website, financial trades, and others; all these data are cre-
ated as a series of events over time. In general, some stream
many real-world applications because the importance of data contain rich and important features that are similar to
different items/itemsets/sequences might be significantly the general static data. In contrast to the support-based pat-
different. In these circumstances, the frequency-/support- tern mining technologies, the utility-oriented pattern min-
based framework is inadequate for pattern mining and ing technologies can be applied to extract useful pattens
selection. Based on the above concerns, researchers pro- and knowledge from stream data, i.e., website click-streams
posed the concept of UPM. [68]. Some preliminary studies have been carried out on this
issue, such as [30], [32], [69], [70].
2.3 Why Utility-Oriented Pattern Mining and  Biomedicine. In gene expression data, each row repre-
Analysis sents a set of genes and their expression levels (i.e., internal
With the rapid advancement of research on UPM, numer- utility) under an experimental condition. In addition, each
ous applications in different domains have been proposed gene has a degree of importance for biological processes
in recent years. We next describe several important applica- (i.e., external utility). In bioinformatics, UPM technology
tions, as summarized in Table 2. can discover useful relationships between genes. For exam-
 Market Basket Analysis. In market basket analysis, each ple, Liu et al. [71] applied a UPM method to successfully
transaction recorded with a customer contains several prod- discover interesting gene regulation patterns from a time-
ucts/items, annotated with their purchase time, purchase course of comparative gene expression data. By analyzing
quantities and the selling price. An important technique is the discovered results, medical researchers can find new
based on the theory that if customers buy a certain set of items, drugs for the treatment of diseases. Recently, Zihayat et al.
customers are more (or less) likely to buy another set of items. proposed a utility model by considering both the gene-
For the problem of mining-characterized association rules disease association scores and their degrees of expression
from market basket data, the goal is to not only discover the levels in a biological investigation [37].
buying patterns of customers but also the highly profitable pat-  Other Applications. Since the “utility” of a pattern meas-
terns and customers. In some existing frameworks [33], [34], ures the importance of the pattern to the user (i.e., risk,
[60], [61], [62], the utility (i.e., importance, interest, or risk) of weight, cost, and profit), UPM has broad real-life applica-
each product can be predefined based on users’ background tions; several examples are described below. In risk predic-
knowledge or preferences. As a result, UPM is able to offer tion, the risk that events may occur is indicated by
richly detailed information about users’ purchasing behaviors. occurrence probabilities and risk. For example, the event
 Web Mining. There is much rich information in web <ðA, 1, 80); (D, 5, 15); (E, 3, 125); 90%> indicates that
data. For example, users’ click-stream and purchase behav- this event consists of three sub-events fA; D; Eg with occur-
iors are recorded in web logs. In such data, a user’s click rence frequencies {1, 5, 3}, while their risk {80, 15, 125},
operation (there are one or many clicks in one session) and respectively, has a 90 percent probability of occurring. In e-
browsing time on a web page can be expressed as the inter- commerce business, this manifests as identifying customers
nal utility of the web page. Obviously, each web page has who visit web pages a number of times by taking pages vis-
different importance depending on users’ different prefer- ited as a utility parameter. In financial analysis, e.g., online
ences (i.e., external utility). Thus, UPM technology can be banking fraud detection, the transfer of a large amount of
used to discover utility-oriented patterns from web logs, money to an unauthorized overseas account may appear
such as high-utility access patterns [63] and high-utility tra- once or many times in several million transactions, yet it
versal patterns [64]. The derived results are quite useful for has a substantial business impact.
electronic commerce for such things as improving website
services, providing some navigation suggestions for web 2.4 Evaluation Measures of Utility
browsing, and improving the design of web pages. Here, we briefly describe several key measures that have
 Mobile Computing. With the explosive growth of the been used in the literature to determine utility-oriented rela-
Internet of Things (IoT) [65] technologies in the Big Data tionships in UPM. In Section 2.1, the theoretical foundations
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1311

TABLE 3 of a pattern in the processed database is the cumulative utili-


Evaluation Measures of Utility in UPM ties of this pattern in each transaction where it appears in.
The average utility [72] is divide by the length of pattern, and
Measure Description
used to avoid the effect of overall utility increasing with the
Utility The commonly used utility measure in length of pattern. Expected/potential utility determines both
many UPM models and algorithms and its uncertainty and utility of a pattern in uncertain data [73].
definition has been given from Definitions 2
to 7, and details can be referred to [42], [52]. Thus, for UPM, this measure is suitable for dealing with
uncertain data. Affinitive utility [75] is proposed to address
Average utility P average utility is auðX; Tq Þ =
The the special task of correlated UPM, but not used for the gen-
ij 2X^XTq qðij ; Tq Þ  prðij Þ=jXj [72], where
eral task of UPM. The utility occupancy [76] is more suitable
k is the number of items in X. It considers
the length of pattern as a major factor. than the utility concept and average utility for discovering the
high-utility patterns which have high utility contribution.
Expected/ Measures both probability and utility of a
potential utility pattern in uncertain databases [73]; the
expected support [74] is measured as
PjDj Q
3 BASIC APPROACHES FOR HIGH-UTILITY
expSup(X) = i¼1 ð Xi 2X pðXi ; Tq ÞÞ. PATTERN MINING
Affinitive utility The affinitive utility [75] of a pattern X in Tq 3.1 Overview of Proposed Categorization
is defined as auðX; P Tq Þ = euðXÞ  afðX; Tq Þ, The development of the utility-oriented algorithms has
where euðXÞ = ij 2X prðij Þ and afðX; Tq Þ
always been an important issue in data mining area. During
denote the affinitive frequency of X in Tq
the past decades, a significant number of utility-oriented algo-
that is afðX; Tq Þ = minfqði1 ; Tq Þ; qði2 ; Tq Þ;
. . . ; qðij ; Tq Þg; ij 2 X [75]. rithms has been proposed to mine utility-based patterns from
Utility occupancy It depends on the contribution of a unit various types of data (i.e., transaction data [9], sequential data
item. The utility occupancy [76] of a pattern [15], episode data [11], stream data [66], [67], etc). Considering
X in Tq and D are defined as uoðX; Tq Þ = that it is infeasible to go through all existing algorithms within
uðX; Tq Þ/tuðTq Þ and uoðXÞ =
P a limited space, in this review we select some representative
XTq ^Tq 2D uoðX; Tq Þ=j X j, respectively. HUIM algorithms. According to the different mining princi-
ples and data structures, Fig. 2 presents a rough overview
of techniques to address the UPM problem. Specifically, to
of several UPM frameworks were analyzed. Based on the facilitate our discussion, we classify these efforts into the fol-
utility theory [25], many evaluation measures of utility have lowing categories: 1) Apriori-based approaches; 2) tree-based
been proposed. Some commonly used evaluation measures approaches; 3) projection-based approaches; and 4) vertical-/
of the utility of a pattern in the UPM field are summarized horizontal-data-based approaches.
in Table 3.
The most commonly adopted evaluation measure for 3.2 Apriori-Based Approaches
UPM is the general utility concept [42], [52]. It is based on In 1994, Agrawal and Srikant proposed the well-known
external utility (e.g., profit, unit price, risk) and internal utility downward closure property, also known as the Apriori
(e.g., quantity). As described in Definition 2, the overall utility property [9], which states that all non-empty subsets of a

Fig. 2. Taxonomy of UPM algorithms.


1312 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021

TABLE 4
Apriori-Based Algorithms for High-Utility Pattern Mining

Name Description Pros. Cons. Year


MEU [52] The first theoretical model and MEU uses a heuristic to It cannot maintain the 2004
strict definitions of high-utility determine candidates and downward closure property of
itemset mining. usually overestimates. Apriori [9], and the derived
results are incomplete.
UMining [29] & The general HUIM model with UMining uses the utility upper It generates a large amount of 2006
UMining H [29] several mathematical properties bound and UMining H utilizes a candidate patterns) and suffers
of the utility measure. heuristic pruning strategy. from excessive candidate
generations and poor
scalability.
Two-Phase [77] The TWDC property was It can greatly prune a large It has the problem of candidates 2005
proposed to discover HUIs in amount of candidate patterns. level-wisely generation-and-
two phases. test [9], and requires multiple
database scans.
IIDS [34], FUM By applying IIDS to ShFSM and For any existing level-wise It has the same performance 2008
[34], DCG+ [34] DCG, two methods - FUM and utility mining method, it can issues as Apriori [9].
DCG+ - were implemented. reduce the number of candidates
and improve performance.

frequent itemset must also be frequent, and any superset of mining, and web mining originated in [52]. Researchers in
an infrequent itemset cannot be frequent. For example, the field of UPM consider the MEU model as the first theoret-
assuming fa; b; cg is frequent, all of its sub-itemsets, such as ical model and strict definition of high-utility itemset mining.
fa; cg and fb; cg, are also frequent. If fd; eg is infrequent, its MEU uses a heuristic to determine candidates and usually
supersets, such as fa; d; eg and fd; e; fg, are not frequent. overestimates. However, it cannot maintain the downward
Some Apriori-based approaches for HUIM have been fur- closure property of Apriori [9], and the derived results are
ther developed. The core step of these Apriori-based UPM incomplete.
algorithms is the generation of candidate k-itemsets Ck from  UMining and UMining H [29]. Yao et al. then proposed
high-utility (k-1)-itemsets (denoted as HUIk1 ), and it con- UMining and heuristic UMining H [29] for finding HUIs
sists of two operations: join and prune. In join step, the con- based on several mathematical properties of the utility mea-
ditional join of two HUIk1 patterns is used to generate sure. The utility constraint is characterized by a property
candidate set Ck . The prune step then reduces the size of Ck giving the upper bound of the utility value of an itemset. In
by using the utility upper bound (which is similar to the UMining, the property of utility upper bound is used as a
Apriori property [9]). pruning strategy. UMining H utilizes another pruning strat-
 OOApriori & Top-k Closed Utility Mining [22], [23]. In egy based on a heuristic method [29]. However, some HUIs
2002, Shen and Yang proposed an objective-oriented associ- may be erroneously pruned by this heuristic method. Fur-
ation (OOA) mining approach [22]. They integrated the util- thermore, neither of them have the downward closure prop-
ity constraint into OOApriori (a variant of Apriori [9]) to erty of Apriori [9], and they overestimate too many patterns.
prune candidates for deriving the OOA rules. The interest- Therefore, they suffer from excessive candidate generation
ingness of OOA rules are measured in terms of probabilities and poor scalability.
and utilities in supporting the user’s objective. The utility  Two-Phase [77]. Note that the downward closure prop-
constraint for OOA rules is neither monotone nor anti- erty (w.r.t. the Apriori property [9]) of the support measure
monotone. In 2003, Chan et al. first defined the concept of does not hold for the utility. To address the challenge that
utility mining and proposed an objective-directed mining the utility measure is neither monotone nor anti-monotone,
algorithm to mine the top-k closed-utility patterns [23]. This Liu et al. proposed the well-known Two-Phase algorithm
was the first time the term “utility mining” was presented [77]. Two-Phase introduces a novel concept named the trans-
and used to identify both frequent and high-utility itemsets action-weighted downward closure (TWDC) property (for
based on business objectives. In this utility-based mining any itemset X, if X is not a HTWUI, any superset of X is not
framework, a pruning strategy based on a weak but anti- an HUI) and used it to discover HUIs in two phases. Phase 1:
monotonic condition was developed to reduce search space. it finds each itemset X such that TWUðXÞ  minutil using
 Mining with Expected Utility (MEU) [52]. In 2005, Yao the TWU upper bound to prune the search space. Initially, it
et al. proposed a utility mining model, called mining with scans a database once to get all 1-itemset HTWUI1 ; then gen-
expected utility (MEU) [52], which considers both the pur- erates (k+1)-level candidate itemsets (with length k+1) from
chase quantities (called internal utility) and unit profits length-k candidates HTWUIk (where k > 0). For each itera-
(called external utility) of items to mine HUIs. Note that the tion, it needs to examine the TWU values of candidates by
term “mining high-utility itemsets” first appeared in [23], scanning the database once. Finally, it is terminated when no
but their concept and definitions were quite different from candidate can be generated. Phase 2: it scans the database
the definitions of high-utility itemset mining today. It is again to calculate the exact utility of each candidate in the set
widely believed that utility-based itemset mining, sequence of HTWUIk and then outputs the desired HUIs.
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1313

 IIDS, FUM, and DCG+ [34]. The itemset share mining


problem [60] can be converted to the utility mining problem
by replacing the frequency value of each item in every trans-
action by its total profit (i.e., multiplying the frequency
value by its unit profit). The isolated items discarding strat-
egy (IIDS) [34] reduces the number of candidates in every
database scan. By discarding isolated items to reduce the
number of candidates and to shrink the database scan in
each pass, IIDS can improve the level-wise, multi-pass can- Fig. 3. Example IHUPTWU -tree structure (used by IHUP [48]).
didate-generation process. Applying IIDS to ShFSM [60]
and DCG [60], Fast Utility Mining (FUM) [34] and Direct candidate generation-and-test problem, Ahmed et al. pro-
Candidates Generation (DCG)+ [34] were further devel- posed a novel tree-based algorithm named High-Utility
oped. The results showed that FUM and DCG+ [60] are bet- Candidate Prune (HUC-Prune). The proposed HUC-tree is
ter than MEU, UMining, UMining H and Two-Phase. a prefix tree storing the candidate items in descending order
However, both still suffer from the problem of generating of TWU values. Each node in the HUC-tree consists of the
and testing candidates in a level-wise way and require mul- item’s name and its TWU value. Similar to IHUP [31], HUC-
tiple database scans. Prune replaces the level-wise candidate generation process
Discussions. In summary, all of the early UPM appr- by a pattern-growth mining approach. It needs at most three
oaches improved on the Apriori work [9]. As shown in database scans for mining the HUIs, and has better perfor-
Table 4, an important drawback is that all of them need to mance than the Apriori-based algorithms.
generate a huge amount of candidates since they rely on a  IHUP (Incremental High-Utility Pattern Mining) [31]. The
loose upper bound on the utilities of candidates. As a result, tree-based algorithm named IHUP with three tree structures
these approaches may suffer from long execution times (IHUPPL -tree, IHUPTF -tree and IHUPTWU -tree) was pro-
(computationally expensive) and consume a huge amount posed for incremental and interactive high-utility pattern
of memory. Moreover, all these algorithms suffer from the mining [31]. Each node in the IHUP-tree represents an item-
same limitations as Apriori-based ARM algorithms, which set and consists of the itemset’s name, a TWU value, and a
are to generate candidates not appearing in the database support count. The IHUP algorithm has three steps: 1) con-
and to perform multiple database scans to mine the desired struction of IHUP-tree, 2) generation of HTWUIs [77], and
information. The computational complexity of these Apri- 3) identification of high-utility itemsets. Step 1 is similar to
ori-based UPM techniques depends on the level-wise man- the construction of FP-tree [7] and a complete illustrated
ner that generates a huge number of candidates. These example had been given in [31], [47]. In each transaction,
techniques may have quadratic complexity if the processed the set of HTWUI1 are sorting in TWU-descending order
data containing long transactions or a low minutil threshold and then continuously inserted into the prefix-based IHUP-
value is used. tree. Fig. 3 shows the constructed IHUP-tree for the example
database given in [48]. In step 2, all HTWUIk are generated
3.3 Tree-Based Pattern-Growth Approaches from the constructed IHUP-tree using the FP-Growth [7]
Many early HUIM approaches perform a level-wise explo- approach. In step 3, all HUIs and their utilities can be identi-
ration of the search space to find HUIs. To avoid the fied from the set of HTWUIs by scanning the database once.
drawback of an Apriori-based level-wise search, several Thus, IHUP can avoid generating candidates in a level-wise
tree-based HUIM algorithms were then proposed to effi- way. Although IHUP significantly outperforms IIDS [34]
ciently mine HUIs based on the TWDC property and the and Two-Phase [77], it still produces too many HTWUIs in
pattern-growth-mining approach. Generally speaking, the step 1. Note that both IHUP and Two-Phase use the TWU
tree-based UPM algorithms are inspired by the traditional concept to overestimate the utilities of itemsets. Thus, they
tree-based FPM algorithms, i.e., FP-Growth [7]. Some math- produce the same huge number of HTWUIs in step 1. Such
ematical formalism of the pattern-growth method with FP- a large number of HTWUIs substantially degrades the min-
tree can be referred to [7]. ing performance in terms of execution time and memory
 HYP Tree [78] and HUC-Prune [79]. In 2007, Hu et al. consumption. Moreover, the performance of step 2 is
proposed an approximation method that identifies the con- affected by the number of HTWUIs. The reason is that the
tribution of the predefined utility, objective function, and more HTWUIs is generated in step 1, the longer execution
performance metric, and can take advantage of item attrib- time required for mining HUIs in step 2.
utes [78]. It identifies high-utility combinations and then  UP-Growth [80] and UP-Growth+ [48]. Tseng et al.
finds HUIs through a high-yield partition (HYP) tree [78]. designed a more compressed utility-pattern tree (UP-tree)
In contrast to the traditional FPM and ARM techniques, its and proposed the well-known utility pattern-growth algo-
goal is to find segments of data and combinations of items/ rithm (UP-Growth) [80] to efficiently mine HUIs. UP-
rules that satisfy certain conditions and maximize a prede- Growth is inspired by the frequency-based FP-Growth
fined objective function. Different from the former UPM method. It integrates four novel strategies, named DLU
approaches, it conducts “rule-discovery” with respect to (Discarding Local Unpromising items), DLN (Decreasing
individual attributes and the overall criterion for the discov- Local Node utilities), DGU (Discarding Global Unpromising
ered results. It aims at mining groups of patterns that when items during the construction of a global UP-tree), and DGN
combined, contribute the most to an objective function [78]. (Decreasing Global Node utilities during the construction of
Since the Apriori-like HUIM algorithms suffer from the a global UP-tree). After two scans of the original database,
1314 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021

the number of candidate patterns effectively with the


reduced overestimation utilities, through which mining per-
formance is improved. This approach is faster than IHUP
[31] on most datasets, but the faster UP-Growth [80] and
UP-Growth+ [48], were not compared.
Discussions. Characteristics and differences of these tree
structures are presented in Table 5. In addition, there are vari-
ous other utility-based mining methods based on tree struc-
Fig. 4. Example UP-tree structure (used by UP-Growth algorithm [48]). tures [83], [84]. These tree-based algorithms comprise three
steps: 1) construction of trees, 2) generation of candidate HUIs
the UP-tree can be constructed. In the first scan, the utility of from the trees using the designed pattern-growth approach,
each transaction and TWU of each single item are calcu- and 3) identification of HUIs from the set of candidates.
lated. Discarding global unpromising items, those unprom- Although these trees are compact, they may not be minimal
ising items that are not HTWUIs are removed from each and still occupy a large memory space. The mining perfor-
transaction, and utilities are eliminated again. Then, the mance is closely related to the number of conditional trees
remaining promising items in each transaction are sorted in constructed during the entire mining process and the con-
the descending order of TWU. In the second scan, transac- struction/traversal cost of each conditional tree. When using
tions are inserted into UP-tree by using DGU and DGN these algorithms on a large database with a low-utility thresh-
strategies. After building the complete UP-tree, as shown in old, the storage and traversal costs of numerous conditional
Fig. 4, the potential HUIs (PHUIs) are generated from the trees are high. Thus, one of the performance bottlenecks of
global UP-tree with DLU and DLN strategies. In summary, these algorithms is the generation of a huge number of condi-
the framework of UP-Growth consists of three steps: 1) scan tional trees, which has high time and space costs.
the database twice to construct a global UP-tree with the In summary, to address the disadvantages of the Apriori-
DLU and DLN strategies; 2) recursively generate PHUIs like UPM algorithm, and to improve efficiency, the advan-
from global UP-tree and local UP-trees by UP-Growth with tages of pattern-growth tree-based techniques are as follows:
the DGU and DGN strategies; and 3) identify final HUIs 1) only need two or three passes over dataset; 2) “compresses”
from the set of PHUIs. As an improvement of UP-Growth, datasets into the tree structure; 3) no candidate generation;
UP-Growth+ [48] was then developed by utilizing the mini- and 4) much faster than Apriori-like approaches. However,
mal utilities of each node in each path in the UP-tree. Com- they still have some disadvantages: 1) the constructed tree
pared with UP-Growth, the enhanced UP-Growth+ can may not fit in memory; 2) the designed tree is expensive to
decrease the overestimated utilities of PHUIs and greatly build; 3) it is time-consuming to recursively process all condi-
reduce the number of candidates. tional prefix trees to generate candidates; and 4) the con-
 CHUI-Mine [81]. Later, Song et al. proposed the Con- structed tree is sensitive to the minutil parameter.
current High-Utility Itemset Mine (CHUI-Mine) algorithm
for mining HUIs by dynamically pruning the CHUI-tree 3.4 Projection-Based Pattern-Growth Approaches
structure. The CHUI-tree is introduced to capture the In the past, some of projection-based techniques have been
important utility information of the candidate itemsets. By commonly used in data mining, i.e., FP-Growth [7] for FPM
recording changes in support counts of candidates during and PrefixSpan [16] for SPM. The general idea of projection-
the tree construction process, it uses dynamic CHUI-tree pattern mining is to use target items to recursively project
pruning. CHUI-Mine makes use of a concurrent strategy, the processed database into some smaller projected sub-
enabling the simultaneous construction of a CHUI-tree and databases, and then grow the itemset or subsequence frag-
the process for discovering HUIs. Thus, it can reduce the ments in each projected sub-database. To overcome the dis-
problem of huge memory usage for tree construction and advantages of the tree-based HUIM approaches, some
traversal in tree-based HUIM algorithms. Concurrent pro- projection-based techniques have been developed for UPM.
cesses generally interact through the following two mecha- Some basic mathematical formalism of projection and pro-
nisms: shared variables and message passing [81]. jected sub-databases are skipped here, and details can be
Extensive experimental results show that CHUI-Mine is referred to [16], [85].
both more efficient and more scalable than Two-Phase [77],  CTU-PRO and CTU-PROL [33]. In 2007, Erwin et al.
FUM [34], and HUC-Prune [79]. However, the faster tree- first proposed a projection-based CTU-PRO algorithm for
based algorithms, including IHUP [31], UP-Growth [80], HUIM. It mines HUIs by bottom-up traversal of a com-
and UP-Growth+ [48] were not all compared. pressed utility pattern tree (CUP-tree) [33], which is a vari-
 Sum of Item Quantities-Tree (SIQ-Tree) [82]. Tree con- ant of CTU-tree [83]. The mining of a subdivision from
struction with a single database scan is significant since a CUP-tree consists of three steps: 1) Construction of ProItem-
database scan is a time-consuming task. In utility mining, Table, 2) construction of ProCUP-tree, and 3) mining by Pro-
an additional database scan is necessary to identify actual Cup-tree traversal. CTU-PRO creates a GlobalCUP-tree
high-utility patterns from candidates. A novel tree struc- from the transaction database after identifying the individ-
ture, namely Sum of Item Quantities tree (SIQ-tree) [82], ual HTWUIs [77] with the concept of TWU. For each
was developed to capture database information through a HTWUI, a smaller projection tree called the LocalCUP-tree
single pass. Moreover, a restructuring method is proposed is extracted from the GlobalCUP-tree for returning all
with strategies for reducing overestimated utilities. It can high-utility itemsets with that item as prefix. CTU-PRO con-
construct the SIQ-tree with only a single scan and decrease structs parallel sub-divisions on disk that can be mined
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1315

TABLE 5
Tree-Based Pattern-Growth Algorithms for High-Utility Pattern Mining

Name Description Pros. Cons. Year


HYP tree [78] An approximation method It can find segments of data through
The built HYP tree is huge, and 2007
identifies the utility contribution. combinations of few items/rules.
the memory is costly.
CTU-Mine [83] A pattern-growth approach The pattern growth, which avoids
The CTU-tree is complex and 2007
based on a compact data candidate generation-and-test, is
stores too much information,
representation named CTU-tree suitable for dense data. which may consume much
for utility mining. memory.
HUC-Prune [79] High-utility candidate prune It replaces the level-wise candidate The upper bound is high, and 2009
without level-wise candidate generation process by a pattern- the constructed tree is huge.
generation-and-test. growth mining approach.
IHUP [31] A tree-based approach for The IHUP-tree is more compact It still uses the TWU and 2009
incremental and interactive high-than previous trees, and IHUP is produces too many HTWUIs in
utility pattern mining. significantly faster than IIDS and phase 1.
Two-Phase.
UP-Growth [80] The utility pattern growth UP-tree is more compact than It is time-consuming to 2010
algorithm with more IHUP-tree, and the strategies are recursively process all
compressed utility pattern tree powerful to reduce the number of conditional prefix trees for
(UP-tree). candidates. candidate generation.
UP-Growth+ [48] An improved version of UP- The enhanced UP-Growth+ can It is time-consuming to 2013
Growth with two pruning decrease the overestimated utilities recursively process all
strategies. of PHUIs and greatly reduce the conditional prefix trees for
number of candidates. candidate generation.
CHUI-Mine [81] Concurrent High-Utility Itemset Two mechanisms, shared variables Recursively processing all 2014
Mine (CHUI-Mine) by and message passing, can make the conditional prefix trees is time-
dynamically pruning the CHUI-tree more compact. consuming.
CHUI-tree.
SIQ-tree [82] Sum of Item Quantities. Constructs the SIQ-tree with only a Recursively processing all 2016
single scan and decreases the conditional prefix trees is time-
number of candidate patterns. consuming.

independently. The performance of CTU-PRO is better than  PTA [85]. Different from PB [85] and GPA [85], pruning
Two-Phase [77] and CTU-Mine [83]. CTU-PROL introduces and filtering strategies are proposed to tighten the upper
two new concepts, compressed transaction utility-prol and bounds of utility values in the projection-based upper-bound
CUP-tree, which are used for parallel projection of the trans- tightening approach (abbreviated as PTA). The framework of
action database. Note that the anti-monotone property of PTA includes the following: 1) finds HTWUIs and high-utility
TWU is used to prune the search space of sub-divisions in 1-itemsets; 2) performs the pruning strategy and the indexing
CTU-PROL. However, unlike Two-Phase, it avoids a rescan strategy; 3) projects transactions required by the prefix item-
of the database to calculate the actual utilities of HTWUIs. sets to be processed; and 4) finds k-HTWUIs and high-utility
The results show that CTU-PROL outperforms Two-Phase k-itemsets. An effective index mechanism is applied to reduce
[77] and CTU-Mine [83]. the time cost of searching relevant transactions that need to be
 GPA and PB [85]. Since the tree-based pattern-growth projected in sub-databases. Thus, PTA only needs one data-
approaches recursively perform tree traversal and generate a base scan. Through experiments, the results show that PTA
series of sub-tree structures, Lan et al. proposed two outperforms the other existing algorithms (i.e., Two-Phase
alternative efficient projection-based utility mining approa- [77], GPA [85], PB [85], CTU-PRO [33], IHUPPL [31], IHUPTWU
ches, named Gradual Pruning Approach (GPA) [85] and [31], and IHUPTF [31]) in terms of pruning unpromising item-
PB (Projection-Based mining approach) [85]. Compared with sets, memory usage, and runtime, respectively.
the level-wise techniques, the property of a projection-based Discussions. In summary, the above UPM approaches,
technique is more suitable for improving the utility upper which utilize the database projection mechanism, have
bound. The general idea is to use the overestimated HTWUIs the following advantages: 1) mine the complete set of high-
[77] to recursively project item/sequence databases into some utility patterns but reduce the effort of candidate generation;
smaller projected databases and grow item/subsequence 2) prefix-projection reduces the size of the projected sub-
fragments in each projected sub-database. In addition, PB database and leads to efficient processing; and 3) bi-level
applies a novel pruning strategy and an indexing mechanism projection and pseudo-projection may improve mining effi-
to speed up the runtime and reduce the memory requirement ciency, as summarized in Table 6.
of the mining process. The indexing mechanism imitates tra-
ditional projection algorithms (i.e., PrefixSpan [16]) by projec- 3.5 New Data-Format-Based Approach
ting sub-databases. Using projection, GPA and PB can To achieve more efficiency than the tree-based UPM
significantly reduce database size when deriving larger item- approaches, some algorithms that mine high-utility itemsets
sets and outperform Two-Phase [77]. using a vertical or horizontal data structure with a single
1316 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021

TABLE 6
Projection-Based Pattern-Growth Approaches for UPM

Name Description Pros. Cons. Year


CTU-PRO [33] Two projection-based algorithms They construct parallel sub- TWU is adopted as the upper 2008
& CTU-PROL with Compressed Utility Pattern- divisions on disk that can be mined bound, which generates many
[33] tree (CUP-tree). independently and have good redundant candidates.
performance.
GPA and PB Two projection-based mining Using projection, they can speed up TWU is adopted as the upper 2012
[85] approaches, GPA (Gradual the runtime and reduce database bound, which generates many
Pruning Approach) and PB size when deriving larger itemsets. redundant candidates.
(Projection-Based mining
approach).
PTA [85] A projection-based upper-bound Two effective strategies, named The projection of sub-databases 2012
tightening approach. pruning and filtering, are proposed is sometimes time-consuming.
to tighten the upper bounds of
utility values.

TABLE 7
Utility-List-Based Algorithms for High-Utility Pattern Mining

Name Description Pros. Cons. Year


HUI-Miner [86] The first one-phase model to It first introduced the concept of the The join operations between 2012
mine high-utility itemsets. remaining utility and the vertical utility lists of (k+1)-itemsets and
data structure w.r.t. utility-list. k-itemsets is time-consuming.
d2HUP [88] Another algorithm that can It efficiently obtains the utility of The tree structure and CAUL 2012
directly discover HUIs without each enumerated itemset and the consume more memory.
maintaining candidates. upper bound on utilities using the
CAUL.
FHM [87] An improved version of HUI- It not only has the advantages of
It consumes slightly more 2014
Miner with a pruning strategy HUI-Miner but also reduces the join
memory than HUI-Miner and
named EUCP. operations between utility lists.
has poor performance on dense
datasets.
HUP-Miner [89] An improved version of HUI- The two new pruning strategies can It needs to explicitly set the 2015
Miner with two new pruning reduce the join operations between number of dataset partitions,
strategies (PU-Prune and LA- utility lists. while these partitions cannot
Prune). always improve the efficiency.
EFIM [90] Uses projection and transaction- It consumes less memory, and its Sometimes the recursive 2015
merging techniques for reducing complexity is roughly linear with projection is time-consuming
the cost of database scans. the number of items in the search and uses a lot of memory.
space.
IMHUP [91] A novel utility-list-based It uses the indexed utility list to The upper bound on utilities is 2017
algorithm for HUPs mining reduce the join operations between not tight enough.
without any candidate utility lists.
generation.
mHUIMiner [92] An efficient one-phase algorithm It utilizes utility list and remaining It still suffers from some 2017
that combines some ideas from utility and performs well on sparse problems similar to those of
HUI-Miner and IHUP. datasets. HUI-Miner and IHUP.

phase were proposed recently, such as HUI-Miner [86], FHM corresponding to the transactions in which X appears. Each
[87], d2HUP [88], HUP-Miner [89], and EFIM [90]. Both tuple is defined as < tid; iu; nu > for every transaction Tq
d2HUP and EFIM use a horizontal database, while others use containing X, in which the tid element is the transaction iden-
the vertical data structure. All these algorithms cannot only tifier of Tq , the iu element is the utility value of X in Tq , and ru
avoid the disadvantages of Apriori-based approaches but also element is the remaining utility value of X in Tq . More details
avoid the disadvantages of the tree-based HUIM approaches. about the remaining utility, utility-list structure, and its con-
Details are shown in Table 7 and described below. struction can be referred to [86]. The construction process of
 HUI-Miner [86] and FHM [87]. High-Utility Itemset Miner utility-list is quite efficient and consumes little memory. By
(HUI-Miner) [86] is the first one-phase algorithm to discover keeping necessary information from the transaction database
HUIs. It proposes a vertical data structure named utility-list in memory, HUI-Miner can directly mine HUIs by spanning
[86] and the concept of remaining utility [86], which have been the search space w.r.t. a set-enumeration tree [93]. As an
widely extended in many other newly UPM algorithms. As a enhanced version of HUI-Miner [86], Fast High-Utility Miner
compact data structure, utility-list can store utility informa- (FHM) [87] utilizes a novel pruning strategy named Estimated
tion for the potential patterns that may have high utility value. Utility Co-occurrence Pruning (EUCP) to reduce the costly
The utility-list of an itemset X in a database D is a set of tuples join operations of utility-lists. EUCP is based on the Estimated
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1317

Utility Co-Occurrence Structure (EUCS) [87]. Using utility-list HUIs and prune the search space. The experimental results
[86], HUI-Miner and FHM need only two database scans to show that IHUI-Mine outperforms some popular algorithms,
construct a series of utility-lists of HTWUI1 . Then, utility-lists including Two-Phase [77], FUM [34], and HUC-Prune [79],
of (k+1)-itemsets can be obtained by performing the join oper- but it has not been compared with the state-of-the-art
ations of utility-lists of k-itemsets. They can directly discover algorithms.
HUIs by keeping utility-list in memory, and utilizes the upper  IMHUP [91]. In the framework of list-based high-utility
bound of the remaining utility. HUI-Miner and FHM outper- pattern mining, there are a number of comparison and join
form than the all previous algorithms on most datasets, in operations of entries within lists causing enormous execu-
terms of running time (almost two orders of magnitude faster) tion time costs. Based on the indexed utility-list (IU-list)
and memory cost. FHM is more faster than HUI-Miner [86], [91], two techniques were developed in Indexed list-based
especially for dense databases, but not efficient for databases Mining of High-Utility Patterns (IMHUP) to reduce utility
that are sparse. However, the drawback is that both of them upper-bounds that satisfy the anti-monotonic property.
need to perform costly join operations among a series of util- IMHUP-RUI and IMHUP-CHI [91] generate high-utility
ity-lists, which can be time costly. Note that some quantitative patterns without any construction of additional local-lists
results are already reported on the same benchmark datasets when the current lists only contain information of the same
in [86], [87]. revised transactions. They further utilize the upper-bound
 d2HUP [88]. d2HUP is also able to directly discover utilities in IU-lists to decrease the search space.
HUIs without candidate generation. It utilizes another novel  EFIM [90]. The projection-based EFficient high-utility
data structure, named Chain of Accurate Utility Lists Itemset Mining (EFIM) algorithm introduces several new
(CAUL) [88] to store the necessary information. In contrast ideas, including two new upper bounds named revised sub-
to HUI-Miner, it enumerates an itemset as a prefix extension tree utility and local utility, and a array-based utility comput-
of its prefix itemset. In fact, the search space of d2HUP is a ing technique. To reduce the cost of database scans, EFIM
variant of set-enumeration tree [93]. It can efficiently calcu- further proposes the database projection and transaction
late the utility of each enumerated itemset and the upper merging techniques named High-utility Database Projection
bound on utilities of the prefix-extended itemsets. In fact, (HDP) and High-utility Transaction Merging (HTM). As
d2HUP also utilizes the similar concept of remaining utility larger itemsets are explored, both projection and merging
to tighten the utility upper bound, which is much tighter reduce the size of the database. The main ideas of HDP and
than TWU. This upper bound is tightened by iteratively fil- HTM are described in [90]. The time and space complexity
tering out irrelevant items when constructing CAUL. More of EFIM is roughly linear with the number of distinct items
specifically, it requires less memory than different kinds of in the search space. The competitive results show that EFIM
tree structures used in the above-mentioned algorithms. is in general 2 to 3 orders of magnitude faster than the state-
d2HUP was shown to be more efficient than Two-Phase of-the-art algorithms (UP-Growth+ [48], HUI-Miner [86],
[77], UP-Growth [80], and HUI-Miner [86], but the perfor- FHM [87], d2HUP [88], and HUP-Miner [89]) on dense data-
mance was not compared with some recent algorithms, sets and performs quite well on sparse datasets.
such as FHM [87] and HUP-Miner [89].  mHUIMiner. mHUIMiner [92] is a hybrid algorithm that
 HUP-Miner. HUP-Miner [89] is an improvement algo- combines some ideas from HUI-Miner [86] and IHUP-tree
rithm based on HUI-Miner [86]. Two new pruning strate- [31]. It adopts the utility-list and remaining utility. It utilizes a
gies, PU-Prune (based on dataset partition) and LA-Prune tree structure to guide the itemset expansion process, and
(based on the concept of lookahead pruning), are introduced thus the itemsets that are nonexistent in the database can be
in HUP-Miner to limit the search space for mining HUIs avoided. Unlike current techniques, it does not have a com-
[89]. It needs to set the number of dataset partitions K, plex pruning strategy that requires expensive computational
which determines how many partitions processed inter- overhead. It was shown to well perform on sparse datasets,
nally. However, the optimal value of K is hard to find and provide the best runtime on sparse datasets, while having
empirically for a given dataset. Based on the concept of a comparable performance than other state-of-the-art algo-
remaining utility [86], LA-Prune provides a tighter utility rithms (e.g., HUI-Miner [86], FHM [87], and EFIM [90]) on
upper bound of any k-itemset. Thus, a huge number of dense datasets.
unpromising k-itemset (k  2) that have low utility can be Discussions. All the algorithms discussed in this subsec-
pruned. It has been shown that HUP-Miner is significantly tion utilize the new data structure to store necessary infor-
faster than HUI-Miner. In fact, the PU-Prune strategy based mation about each itemset. By spanning the search space
on dataset partition does not always have an effect on run- w.r.t. a set-enumeration tree [93], they can easily calculate
time and memory consumption. In addition, a shortcoming the total utility of an itemset by performing join operations
is that the number of partitions is required to be set explic- of the built utility-lists. Moreover, an upper bound on the
itly by users, since it is an additional parameter. overall utilities of itemsets called the remaining utility is cal-
 Index High-Utility Itemsets Mine (IHUI-Mine) [94]. As culated using utility-lists. It can be used to determine if each
mentioned before, these candidate generation-and-test pattern and its extensions are not high-utility itemsets (to
approaches suffer from the drawbacks of having an immense reduce the search space). The upper bound with remaining
candidate pool and requiring several database scans. Mean- utility is equivalent to the upper bound proposed in d2HUP
while, methods based on pattern growth tend to consume [88]. Although a pattern-growth approach in d2HUP can
large amounts of memory to store conditional trees. IHUI- avoid considering itemsets not appearing in the database,
Mine uses the subsume index [95], a data structure for effi- the used hyper-structure still consumes a considerable
cient frequent itemset mining, to enumerate the desired amount of memory [88]. Some competitive results of these
1318 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021

UPM methods have been compared and summarized in HUIs with record insertion, such as IHUP [31], FUP-HUI-INS
recent studies [90], [92]. [102], PRE-HUI-INS [103], HUI-list-INS [104], and EIHI [105].
As shown at Table 7, the HUIM algorithms are based on Among these, the early algorithms, e.g., FUP-HUI-INS and
a combination of vertical or horizontal data formats and PRE-HUI-INS, utilize the utility-oriented dynamic maintain
typical approaches. These hybrid algorithms combine dif- strategies that are extended by the original FUP [100] and pre-
ferent techniques to mine high-utility patterns in such a large [101] concepts. Since FUP-HUI-INS and PRE-HUI-INS
way that the strengths of each technique are utilized to max- algorithms are processed by a Two-Phase model, an addi-
imize their efficiency. The properties of these one-phase tional database rescan is still necessary to find the actual
algorithms are as follows: HUIs. Furthermore, computations are required to find the
HTWUIs based on the pattern-growth approach. Both HUI-
1) Complete result: The completeness is guaranteed as the list-INS and EIHI utilize the utility-list [86] and utility prop-
traversal of the search space w.r.t. set-enumeration erty to significantly reduce runtime and memory usage. More
tree. complete reviews can be referred to [47].
2) Stable result: The result is stable as all exact utility  Case 2: HUIM with Record Deletion. In practical situations,
information is stored in a vertical or horizontal data record deletion is also an important issue in databases.
structure. Depth-first searching is also used to Cheung et al. designed the FUP2 concept [106] to discover
quickly calculate the utilities. frequently updated itemsets for record deletion. Hong et al.
3) Efficiency: The algorithm is efficient relative to algo- developed the pre-large concept [101] for handling record
rithms that traverse the complete search space. deletion to avoid a multiple database scan each time. Two
Moreover, the sort order of items in set-enumeration support thresholds are separately set in pre-large [101], and
tree affects the mining efficiency, but not the final thus the original database is not required to be scanned until
mining results of patterns. the number of accumulative deleted transactions achieves
4) Parameter sensitivity: These algorithms, except for the designed safety bound. Since the FUP2 concept [106] can-
HUP-Miner, only have minutil as the parameter, and not be directly applied to the HUIM, Lin et al. separately
are sensitive to it. designed the FUP-HUI-DEL [107] and PRE-HUI-DEL [108]
algorithms for handling record deletion to maintain and
4 ADVANCED TOPIC OF UPM update the new HUIs based on the Two-Phase model.
Recently, an efficient dynamic algorithm named HUI-list-
4.1 Mining High Average Utility Itemsets
DEL [109] was developed to discover HUIs by maintaining
A main challenge in HUIM is that the exponential search
the built utility-list [86] structure for record deletion in
space for HUIM is extremely large when the number of dis-
dynamic databases. The new HUIs can be directly produced
tinct items or the size of the database is too large. The other
without candidate generation or numerous database scans.
challenge is that existing HUIM methods overlook the fact
 Case 3: HUIM with Record Modification. As one of the
that longer itemsets result in higher utility values. A large
three common operations (record insertion, deletion, and
itemset may have an unreasonable estimated profit as
modification) in databases, record modification is also com-
opposed to its actual value. Therefore, the concept named
monly seen in real-life situations. For example, some typos or
high average-utility itemset mining (HAUIM) is proposed
errors may occur when the collected data from periodic trans-
[72]. HAUIM discovers utility patterns by considering both
actions is input into a computer using a keyboard. Thus, some
their utilities and lengths, thus providing a different utility
information may become invalid or new information may
measure than traditional HUIM. HAUIM divides the utility
arise. Lin et al. first proposed the FUP-HUP-tree-MOD algo-
of an itemset by its length (the number of items that the
rithm [110] to address this issue. It is based on the FUP con-
itemset contains). Up to now, some interesting works have
cept [100] and shows better performance compared to Two-
been extensively studied, such as Apriori-based algorithms
Phase and some tree-based algorithms in batch mode. In
[72], projection-based PAI [96], utility-list based HAUI-
addition, a faster PRE-HUI-MOD algorithm [111] extends the
Miner [97], [98], and other hybrid algorithms with different
pre-large concept [101] to set the effective upper bound for
upper-bound models [98], [99].
discovering HTWUIs and HUIs from the dynamic databases.

4.2 HUIM in Dynamic Environments 4.3 Concise Representations of Utility Patterns


In a wide range of applications, the processed data may be In the field of FPM, many techniques have been devised to
commonly dynamic but not static. The dynamic data are derive compact representations of frequent patterns that
more complicated and difficult to handle than the static eliminate redundancy but have rich information, such as
data. Most algorithms process a static database to mine free sets [112], non-derivable sets [113], maximal itemsets
HUIs. In real-world applications, records/transactions are [114], and closed itemsets [115]. These representations sig-
dynamically changed (i.e., inserted, deleted, and modified) nificantly reduce the number of extracted frequent patterns,
in the original database. Some preliminary studies have but some lead to loss of information (e.g., maximal itemsets
been done on this issue for UPM. [114]). Although the above UPM methods perform well in
 Case 1: HUIM with Record Insertion. Data mining is an iter- some cases, their performance may degrade when the mini-
ative process, and incremental data mining [100], [101] pro- mum utility threshold is low. A large number of HUIs and
vides the ability to continuous analyze and mine the data by candidates lead to long execution times and huge memory
using previous data structure and mining results. Up to now, consumption. When computing resources are limited, this
some incremental models have been developed for mining is a serious problem for the mining task. However, a large
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1319

amount of HUIs is difficult to comprehend and be analyzed two tree structures, called utility-based WAS tree (UWAS-
by users. Thus, it is often impractical to generate and return tree) [63] and incremental UWAS-tree (IUWAS-tree) [63],
the entire set of HUIs. were developed to mine web access sequences. However, a
 Maximal High-Utility Pattern. To return representative sequence element with multiple items, such as [(a, 3)(c, 4)],
HUIs to users, some concise representations of HUIs were cannot be supported in these two models. The considered sce-
proposed. Chan et al. introduced the concept of a utility fre- narios are rather simple, which limits their applicability for
quent closed pattern [23], the definition of which is different handling complex sequences. To this end, some algorithms
from high-utility itemset [48], [52]. Shie et al. then proposed were proposed to address the HUSPM problem.
a new representation called maximal high-utility itemset in  UL and US [41]. Since both UWAS-tree and IUWAS-tree
which a HUI is not a subset of any other HUI [69]. Although algorithms cannot deal with sequences containing multiple
maximal HUI reduces the number of extracted HUIs, it is items in each sequence element (transaction), Ahmed et al.
not lossless because the utilities of the subsets of a maximal designed two algorithms (level-wise Utility-Level (UL) [41]
HUI cannot be known without rescanning the database. and pattern-growth Utility-Span (US) [41]) to mine HUSPs.
Moreover, recovering all HUIs from the set of maximal UL and US extend traditional sequential pattern mining
HUIs is very inefficient since many subsets of a maximal (SPM). The utility of a sequential pattern is calculated in
HUI may have low utility. two ways. The utilities of sequences having only distinct
 Closed High-Utility Pattern. To provide not only com- occurrences are added together, while the highest occur-
pact but also complete information about high-utility item- rences are selected from sequences with multiple occur-
sets to users, Tseng et al. first addressed the problem of rences and used to calculate the utilities. However, the
redundancy in high-utility itemset mining [116]. A lossless problem definition in UL and US [41] is rather specific. No
and compact representation named closed high-utility item- generic framework for transferring from SPM to high-utility
set [116] was introduced. To mine this representation, they sequence analysis has been proposed.
proposed three algorithms named AprioriHC (Apriori-based  USpan [42]. Yin et al. then formalized the problem of
approach for mining High-utility Closed itemsets), Aprior- HUSPM, and proposed a generic framework and the USpan
iHC-D (AprioriHC algorithm with Discarding unpromising algorithm to mine high-utility sequences. [42]. A lexico-
and isolated items), and Closed High-Utility Itemset Discov- graphic quantitative sequence tree (LQS-tree) is constructed
ery (CHUID) [116]. Fournier-Viger et al. then proposed a fast as the search space. Two concatenation mechanisms, I-Con-
and memory efficient algorithm named EFIM-Closed [117] to catenation and S-Concatenation, are used to generate newly
discover closed HUIs by extending the EFIM model [90]. It concatenated utility-based sequences. Based on the LQS-tree
proposes three strategies to mine CHUIs efficiently: closure structure, USpan [42] adopts the sequence-weighted utiliza-
jumping, forward closure checking, and backward closure tion (SWU) measure and the Sequence Weighted Downward
checking. EFIM-Closed relies on two new upper bounds, Closure (SWDC) property to prune unpromising sequences
named local utility and sub-tree utility, to prune the search and to improve the mining performance. However, a short-
space, and it can calculate these upper bounds efficiently. coming of USpan is that the data representation w.r.t. the
Inspired by utility-list [86], some more efficient one-phase utility matrix is quite complex and memory-costly.
algorithms have been proposed to address this interesting  PHUS [121]. Lan et al. then proposed the projection-
issue, such as CHUI-Miner [118] and CHUM [119]. based high-utility sequential pattern mining (PHUS) algo-
rithm for mining HUSPs with the maximum utility measure
4.4 Mining High-Utility Quantitative Itemsets/Rules and a sequence-utility upper-bound (SUUB) model [121]. The
Although extensive studies have been proposed for high- algorithm extends PrefixSpan [16] and uses a projection-
utility itemset mining, a critical limitation of these studies is based pruning strategy to obtain tight upper bounds on
that they ignore the quantity attribute of items in discovered sequence utilities. Thus, it can avoid considering too many
HUIs. However, such information can be very useful and candidates, and improves the performance of mining
valuable in many applications. In view of this, the concept HUSPs using the SUUB model.
of High-Utility Quantitative Itemset mining (abbreviated as  HuspExt [122] and HUS-Span [125]. Alkan et al. [122]
HUQI) [38], [120] has emerged. In the framework of HUQI designed the high-utility sequential pattern extraction (Hus-
mining, an item may have different quantities in the data- pExt) algorithm with an upper-bound called Cumulate Rest
base and each item carrying a different quantity is regarded of Match (CRoM). It uses a pruning before candidate genera-
as a quantitative item. HUQI [38] and more efficient vertical tion (PBCG) strategy to prune unpromising sequences for
utility-list-based VHUQI [120] were thus developed. An mining HUSPs. However, HuspExt cannot discover the com-
example of such a rule is (bread, 3, 4) ) (milk, 2, 3), which plete HUSPs due to the incorrect upper bound. In view of the
means that most customers who purchased three or four previous upper bounds on sequence utilities not being tight
breads also purchased two or three milks. We can use this enough, HUS-Span [125] utilizes two tight utility upper
information to package products with quantities that have bounds, called prefix extension utility (PEU) and reduced
high utility and estimate the number of items that need to sequence utility (RSU), as well as two companion pruning
be reserved according to the number of other items. strategies, to identify high-utility sequential patterns.
 ProUM [123]. Gan et al. [123] proposed an efficient
4.5 High-Utility Sequential Pattern Mining projection-based utility mining approach named ProUM
By integrating the utility factor and sequence data, the to discover high-utility sequences by utilizing the upper
problem of high-utility sequential pattern mining (HUSPM) bound named sequence extension utility (SEU) and the util-
was introduced. For handling the utility of web log sequences, ity-array structure [123]. Different from the upper bound
1320 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021

TABLE 8
Algorithms for High-Utility Sequential Pattern Mining

Name Description Pros. Cons. Year


UWAS-tree & The specific algorithms designed The first paper that integrates The considered scenarios are 2010
IUWAS-tree [63] for mining utility web log utility measure into sequential rather simple, which limits their
sequences. pattern mining. applicability for complex
sequences.
UL & US [41] A level-wise Utility-Level (UL) They first extend SPM to The problem definition is rather 2010
algorithm and a pattern-growth HUSPM, and introduce the specific. No generic framework
Utility-Span (US) algorithm. calculation of utility in a is proposed.
sequence.
USpan [42] First formalizes the problem of The width and depth pruning Data representation w.r.t. utility 2012
HUSPM, and proposes a generic methods substantially reduce the matrix in USpan is quite complex
framework. search space in the LQS-tree. and memory-costly.
PHUS [121] A projection-based algorithm for The projection mechanism and The upper bound on sequence 2014
mining HUSPs with the upper-bound SUUB model can utility is not tight enough, and
sequence-utility upper-bound avoid considering too many consumes longer runtime.
(SUUB). candidates.
HuspExt [122] Based on the upper-bound PBCG strategy prunes some The CRoM upper-bound on 2015
CRoM, it utilizes a Pruning unpromising sequences for sequence utility produces the
Before Candidate Generation mining HUSPs. incomplete set of HUSPs.
(PBCG) strategy.
HUS-Span [122] Two upper-bounds (PEU and PEU is more tighter than It may consume longer runtime 2016
RSU) and utility-chain are previous upper-bounds, and than USpan on some datasets.
proposed in this method. utility-chain is easily
implemented.
ProUM [123] Using a new upper-bound SEU is a correct upper-bound The SEU upper bound is still an 2019
named SEU and the array-based and the projection-based over-estimated bound; the
structure named utility-array. pruning strategies are powerful search space can be further
to prune unpromising reduced.
sequences.
HUSP-ULL [124] Based on the PEU upper-bound, It outperforms the state-of-the- The proposed UL-list structure is 2019
it utilizes the utility-linked (UL)- art algorithms for mining a complicated structure for
list structure and two pruning HUSPs, in terms of execution implementation.
strategies to prune candidates. time and memory consumption.

used in USpan, SEU can guarantee the correctness and com- FEM reveals a significant amount of useful information hid-
pleteness of discovered results on sequence data. Besides, den in the event sequence with a wide range of applications
ProUM has better performance up to two orders of magni- [11], [12], [13], [14]. However, the discovered frequent epi-
tude in terms of execution time on most sequence datasets sode is still too simple and primitive. In some cases, FEM
than USpan and HUS-Span. may lose some rich information, such as utility, important,
 HUSP-ULL [124]. The state-of-the-art HUSP-ULL [124] risk, etc. Wu et al. [43] presented the first attempt to solve the
algorithm utilizes a new data structure namely utility- problem of high-utility episode mining (HUEM) in a complex
linked (UL)-list and two pruning strategies (called look event sequence. However, the proposed UP-Span algorithm
ahead strategy and irrelevant item pruning strategy) to fast suffers from low efficiency in both runtime and memory con-
discover HUSPs. According to the extensive experiments sumption. Furthermore, the proposed upper-bound named
[124], it shows that HUSP-ULL is the fastest when compar- Episode Weighted Utility (EWU) is a loose and basic utility
ing to the current HUSPM algorithms. Some competitive bound for episodes. Guo et al. then proposed the TSpan algo-
results of these recent state-of-the-art HUSPM methods rithm with several improvements for UP-Span in a much
have been compared and summarized in the studies of more efficient manner [126], which can save considerable
ProUM [123] and HUSP-ULL [124]. search space and runtime. Then, Lin et al. separately intro-
Main characteristics of these HUSPM algorithm are sum- duced some models to process complex event sequences and
marized in Table 8. In addition, Shie et al. explored a new stock investment using high-utility episode mining and a
problem of mining high-utility mobile sequential patterns genetic algorithm [44], [127]. In addition, the top-k issue of
(HUMSPs) by integrating mobile data mining with utility HUEM has been studied recently [128].
mining [35], [36]. This is the first work that combines mobil-
ity patterns with utility factor to find high-utility mobile
sequential patterns. 4.7 UPM in Big Data
In the big data era, it requires more efficient frameworks of
4.6 High-Utility Episode Mining UPM to handle the big data issue. Several models are pre-
When the sequential data becomes an event sequence, the sented to address UPM in big data [129], [130], [131]. Details
task of frequent episode mining (FEM) [11] is introduced. are described below.
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1321

 UPM in Big Itemset Data. PHUI-Growth (Parallel mining constraint-based UPM algorithms have been extensively
High-Utility Itemsets by pattern-Growth) [129] is first pro- developed for various problems, targeting a wide range of
posed for parallel mining HUIs on Hadoop platform. It applications. For example, mining high-utility patterns with
adopts the MapReduce [132] architecture to partition the products’ on-shelf time period [85], [138], mining the up-to-
whole mining tasks. As a distributed parallel algorithm, date HUIs that reflect recent trends [139], [140], mining dis-
PHUI-Miner with a sampling strategy is introduced by criminative high-utility patterns [75], [141], mining top-k
Chen et al. [130]. It extracts the approximate HUIs from big high-utility patterns without setting the minimum utility
data. Recently, the study of parallel mining of top-k HUIs in threshold [68], [142], UPM with multiple minimum utility
Spark in-memory computing architecture is further pro- thresholds [143], utility-based association rule mining [39],
posed. It inherits several advantages of Spark [133]. [40], UPM with consideration of various discount strategies
 UPM in Big Sequence Data. The BigHUSP model is the first [61], UPM by considering negative utility values [144],
work to discover distributed and parallel high-utility sequen- [145], UPM from uncertain data [73], [146], and extracting
tial patterns [131]. BigHUSP uses multiple steps of MapRe- non-redundant correlated HUIs [62], [147]. Obviously,
duce [132] to process big data in parallel. In contrast to the UPM with various interesting constraints is an active
traditional HUSPM approaches, it can deal with large-scale research topic.
sequential data. MAHUSP [134] is a memory-adaptive
approximation algorithm to efficiently discover high-utility 4.10 Privacy Preserving for UPM
sequential patterns over data streams. It employs a memory- Since more useful information is in the expected utility-
adaptive mechanism using a bounded portion of memory, based patterns than in that of the frequent itemsets or
and guarantees that all HUSPs are discovered under certain sequences, privacy preserving for high-utility pattern min-
circumstances. Experimental study shows that MAHUSP can ing (PPUM) is more realistic and critical than privacy-
not only discover HUSPs over data streams efficiently, but preserving data mining (PPDM) [148], [149], [150], [151].
also adapt to memory allocation without sacrificing much of Some preliminary studies have been done on this issue. Yeh
the quality of discovered HUSPs. et al. first designed two models, named Hiding High-Utility
Itemset First (HHUIF) and Maximum Sensitive Itemsets
4.8 UPM in Stream Data Conflict First (MSICF), to hide sensitive HUIs in PPUM
A data stream is an infinite sequence of data elements con- [152]. The main task of PPUM is to hide the sensitive high-
tinuously arriving at a rapid rate [66], [67]. Mining useful utility itemsets (SHUIs). Lin et al. first developed a genetic-
patterns from data streams has become one of interesting algorithm-based method to hide the user-specified SHUIs
problems of data mining [67], [135], [136]. However, few by inserting the dummy transactions into the original data-
works on mining data streams consider the utility factor bases [153]. Yun et al. then developed a tree-based algo-
embedded in data streams. Tseng et al. first proposed the rithm called the Fast Perturbation algorithm Using a Tree
THUI-Mine (Temporal High-Utility Itemsets) model to structure and Tables (FPUTT) for hiding SHUIs [154]. Then,
mine temporal HUIs from data streams [137]. THUI-Mine other faster and more efficient algorithms were developed
can effectively identify the temporal HUIs by generating for PPUM, such as [155], [156]. A recent overview of PPUM
fewer temporal 2-itemsets of HTWUIs. Thus, the execution has been reported by Gan et al. [157].
time can be reduced significantly in mining all HUIs from
data streams. In this way, the discovery process under all
time windows of data streams can be achieved with limited 5 OPEN-SOURCE SOFTWARE AND DATASETS
memory space and less candidates. Then, researchers for 5.1 Open-Source Software
HUIM proposed several stream mining models, such as Although the problem of UPM has been studied for more
Mining High-Utility Itemsets based on BITvector (MHUI- than 15 years, and the advanced topic of utility pattern min-
BIT) [30], Mining High-Utility Itemsets based on TIDlist ing also has been extended to many research fields, few
(MHUI-TID) [30], and Generation of maximal high-Utility implementations or source code of these algorithms have
Itemsets from Data strEams (GUIDE) [32], [69]. GUIDE is a been released. This raises some barriers to other researchers
framework that mines the compact maximal HUIs from in that they need to re-implement algorithms to use them or
data streams with different models (i.e., the landmark, slid- compare their performance with that of novel proposed
ing, and time fading window models) [32], [69]. In [70], the algorithms. To make matters worse, this may introduce
high-utility stream tree (HUS-tree) and HUPMS algorithm unfairness in running experimental comparisons, since the
(high-utility pattern mining over stream data) are proposed performance of pattern mining algorithms may commonly
for incremental and interactive UPM over data streams depend on the compiler and machine architecture used. We
with a sliding window. now list some open-source software specialized for UPM.
 UP-Miner. Tseng et al. proposed a first-of-its-kind util-
4.9 UPM with Various Interesting Constraints ity mining toolbox named Utility Pattern Miner (UP-Miner)
Up to now, most of the algorithms for UPM have been [158]. UP-Miner provides various models for utility-
developed to improve the efficiency of the mining process, oriented pattern mining. The main merits of UP-Miner
while effectiveness of the algorithms for UPM is also very have three aspects. First, to the best of our knowledge, it is
important, because it is related to its usefulness for various the first-of-its-kind cross-platform utility mining system.
data, constraints, and applications. Researchers in the field Second, it provides complete Java implementations of 13
of utility-oriented pattern mining have proposed many algorithms for discovering different types of utility-oriented
algorithms and models to extend effectiveness. Many patterns, such as high-utility itemset (HUI), high-utility
1322 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021

sequential rule (HUSR), high-utility sequential pattern UPM. For example, the itemset-based synthetic T10I4D100K,
(HUSP), and high-utility episode (HUE), as well as the con- T40I10D100K; and the sequence synthetic C8S6T4I3DjXjK
cise representations of utility patterns. In addition, it offers are described in [124].
four functionalities for processing utility-based databases.
Third, the toolbox and relevant materials, including source
codes, demo paper, benchmark datasets, and data genera-
6 OPEN CHALLENGES AND OPPORTUNITIES
tors, have been made public on Website3 for the benefit of Here, we discuss important open problems that have the
the research community. potential to become future research areas in utility-oriented
 SPMF. As a well-known open-source data mining pattern mining. Owing to the rapid growth of the volume of
library, SPMF [159] offers implementations of many algo- data stored in databases, we have entered the era of Big
rithms and has been cited in more than 700 research papers Data. While analyzing utility-oriented patterns, we have
since 2010. SPMF is written in Java, and provides imple- identified numerous technical challenges and opportunities
mentations of 170 data mining algorithms, specializing in for UPM. We next highlight some important research
pattern mining. SPMF has the largest collection of imple- opportunities, which are common to many, and sometimes
mentations of various algorithms for pattern mining algo- all, UPM algorithms.
rithms (i.e., FPM, ARM, SPM, etc.) and provides a user-  Application-Driven Algorithms. Up to now, most of the
friendly graphical interface.4 In particular, it also provides algorithms for UPM have been developed to improve the
the relevant materials, including source codes, documenta- efficiency of mining process. The effectiveness of the algo-
tion, user instruction, benchmark datasets, data generators, rithms for UPM is also very important, because it is related
and academic papers. SPMF offers up to 30 algorithms for to the usefulness on various data, constraints, and applica-
utility-oriented pattern mining, such as Two-Phase, UP- tions. In general, as described in Section 2.3, the application-
Growth, UP-Growth+, HUI-Miner, EFIM, USpan, HUSP- driven algorithms with many particular features of utility
ULL, and many other state-of-the-art algorithms. More spe- patterns reflect real-life problems of different applications
cifically, SPMF is distributed under the GPL v3 license and in various fields. How to propose a specialized UPM model
is suitable for both academic and industrial purposes. for different applications (e.g., business, web intelligent,
risk perdition, smart city, financial analysis, Internet of
Things, Biomedicine, smart transportation) and experimen-
5.2 Datasets for UPM
tally show its effectiveness is necessary and challenging.
Several datasets are commonly used in the studies of UPM. Moreover, the incorporation of domain knowledge [161]
All of them have been released at websites, such as SPMF has a higher influence on performance for some data mining
[159], UP-Miner [158]. methods. Utility mining guided by domain knowledge thus
Real Datasets. foodmart: it is provided by Microsoft contain- provides many opportunities.
ing 21,556 customer transactions and 1,559 distinct items from  Developing More Efficient Algorithms. Traditionally,
an anonymous chain store. It contains the quantity and a unit most pattern mining algorithms, especially UPM algo-
profit of each item. yoochoose-buys commercial dataset was rithms, are computationally expensive in terms of execu-
constructed in the RecSys Challenge 2015.5 It contains a collec- tion time and memory cost. This may be a serious problem
tion of 1,150,753 sessions from a retailer, where each session is for dense databases or databases containing numerous
encapsulating the click events. The total number of item IDs items/sequences or long transactions, depending on the
and category IDs is 54,287 and 347 correspondingly, with an minimum utility threshold chosen by the user. Although
interval of 6 months. UK-online6: it contains 541,909 transac- current UPM algorithms (e.g., HUI-Miner [86], EFIM [117],
tions, which occurs between 01/12/2010 and 09/12/2011 for and mHUIMiner [92]) are much efficient than previous
a UK-based and registered non-store online retail. The origi- Apriori-based and tree-based algorithms, there is still
nal data contains the real timestamp and many noise values. room for improvement. 1) It is important to reduce the
It has the attributes as InvoiceNo, StockCode, Quantity, Invoi- search space, and this requires to design novel pruning
ceDate, UnitPrice, CustomerID, etc. strategies that rely on upper-bounds on the utility measure
Semi-Authentic Datasets. They are the real datasets7 (e.g., that are tighter than current measures. 2) Moreover, we
chess, retail, kosarak, mushroom, accidents, BMSPOS2) with can design novel data structures to more quickly calculate
synthetic utility values. The internal utility values are gener- the utility and upper-bounds, and integrate constraints in
ated using a uniform distribution in [1, 10]. The external the mining process to reduce the search space. 3) Fast
utility values are generated using a Gaussian (normal) dis- approximate algorithms [130] that guarantee a maximum
tribution. Detailed description and characteristics of these error can also be developed.
real datasets can be referred to SPMF [159], UP-Miner [158],  Unified Framework for UPM. Many variations of utility
or existing UPM literature. mining have been proposed to deal with various types of
Synthetic Datasets. There are some synthetic itemset-based data and to solve different problems. The current paradigm
or sequence-based datasets generated by IBM Quest Data- used to solve utility-oriented pattern mining problem is to
set Generator [160], which have been commonly used in first define the definition of utility-based patterns with
interest and their properties, and then develop an algorithm
3. http://bigdatalab.cs.nctu.edu.tw/software.php that can exploit the properties of the utility (e.g., upper
4. http://www.philippe-fournier-viger.com/spmf/index.php bound) to efficiently mine them. Hence, this laborious pro-
5. https://recsys.acm.org/recsys15/challenge/
6. http://archive.ics.uci.edu/ml/datasets/Online+Retail/ cess can be avoided if the following problem is solved: “Is
7. http://fimi.ua.ac.be/data/ there a paradigm such that existing and new definitions of
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1323

utility-based pattern (HUI [77], HUSP [42], [123], HUE [43]) commerce. Specifically, research should focus on algorithms
can be solved by a unifying algorithm?” Owing to these that are sub-linear to the input or, at the very least, linear.
challenges, the utility-oriented pattern mining problem, in Other computational challenges, such as the demands of the
its most general form, is not easy to solve. In fact, most of results being returned in real- or near-real-time, are the open
the existing utility mining techniques (e.g., HUIM [77], issues in the data mining community. For example, real-time
HUSPM [42], [123], HUEM [43], etc.) solve a specific formu- mining with optimization requires a new formalism and
lation of a specific problem. Therefore, how to formalize solving techniques. As mentioned before, increasing quantity
utility mining tasks in a generic framework is crucial and and complexity of data demands scalable solutions. Using
challenging. Focus on general principles and modeling of the existing computational infrastructures for real-time util-
UPM rather than specific implementations is more impor- ity-oriented mining massive datasets may be a feasible way.
tant and challenging.
 Deal with Complex Data. The amount of complex data has 7 CONCLUSIONS
been explored during the past two decades, while most
of the data mining and analysis approaches are not utility The term utility is commonly used to mean “the quality of
oriented. Many current techniques of UPM are not suited being useful,” and utilities are widely used in data-mining
to dealing with various types of complex data, such as and decision-making processes to extract different useful
“structured data”8 (i.e., pattern mining), “unstructured kinds of knowledge. Utilities are subjective and can be
data”9 (including documents, health records, audio, video, acquired from domain experts/users. Utility mining in data
images, etc.), and “semi-structured data”10 (i.e., XML, JSON), is a vital task, with numerous high-impact applications,
and most of these are the heterogeneous data. More specifi- including cross-marketing, e-commerce, finance, medical,
cally, the dynamic data [47], the uncertain data [8], [73], the and biomedical applications. Up to now, many techniques
high-dimensional datasets of moderate size, or the very large and approaches have been extensively proposed for the task
datasets of moderate complexity in real-life applications are of UPM. In this survey, we have provided a comprehensive
commonly seen in different domains and applications. Bridg- review of utility-oriented pattern mining, both in terms of
ing this gap requires the solution of fundamentally new current status and future directions. This survey describes
research problems, which can be grouped into the following various problems associated with mining utility-based pat-
challenges: 1) how to define the utility function integrating terns and methods for addressing these problems, including
with various rich features on complex data; 2) how to achieve 1) high-utility itemset mining (HUIM), 2) high-utility association
utility maximization for the goal and mining task; and 3) how rule mining (HUARM), 3) high-utility sequential pattern mining
to develop new frameworks and algorithms to deal with new (HUSPM), 4) high-utility sequential rule mining (HUSRM), and
types of data. A need therefore arises for a better framework 5) high-utility episode mining (HUEM). Overall, we have not
that extends the existing data mining methodologies, techni- only reviewed the most common, as well as the state-of-the-
ques, and tools, guided by utility and knowledge. art, approaches for UPM but have also provided a compre-
 Large-Scale Data. Efficiently mining large-scale databases hensive review of advanced UPM topics. Finally, we have
may result in a high computational cost and memory con- identified several important issues and research opportuni-
sumption. Under the batch model, traditional UPM algo- ties for UPM.
rithms must be repeatedly applied to obtain updated results
when new data are inserted [47]. However, in the Big Data ACKNOWLEDGMENTS
era, incrementally or dynamically processing data [47] and This research was partially supported by the Shenzhen
taking into account the results of prior analysis is crucial. Technical Project under No. KQJSCX 20170726103424709 and
There are some challenging research opportunities of UPM No. JCYJ 20170307151733005, and by the China Scholarship
for handling large-scale data (as described in Section 4.7): Council Program. The authors would like to thank the editors
how to design the parallelized UPM algorithms and how to and anonymous reviewers for their detailed comments and
develop the UPM algorithms based on the existing technolo- constructive suggestions which have improved the quality of
gies of Big Data (i.e., MapReduce [132] and Spark [133]). this paper.
Some other promising areas of research are the design of dis-
tributed, parallel, multi-core, or graphical-processing-unit REFERENCES
(GPU)-based algorithms [45], [162] for UPM. There are some
[1] M. S. Chen, J. Han, and P. S. Yu, “Data mining: An overview
open challenges and opportunities to improve the scalability from a database perspective,” IEEE Trans. Knowl. Data Eng.,
of utility mining tasks from resource-constraint devices to vol. 8, no. 6, pp. 866–883, Dec. 1996.
collaborative and hybrid execution models. [2] J. Han, J. Pei, and M. Kamber, Data Mining: Concepts and Techni-
ques. Amsterdam, The Netherlands: Elsevier, 2011.
 Scalable Real-Time Pattern Mining. There exists many [3] Y. S. Koh and S. D. Ravana, “Unsupervised rare pattern mining:
interactive approaches for interactive data mining, but few A survey,” ACM Trans. Knowl. Discovery Data, vol. 10, no. 4, 2016,
have been extended to address the challenge of utility. And it Art. no. 45.
is not trivial to adapt them. One of the most important future [4] P. Fournier-Viger, J. C. W. Lin, B. Vo, T. T. Chi, J. Zhang, and
H. B. Le, “A survey of itemset mining,” Wiley Interdisciplinary Rev.:
challenges is to develop scalable high-utility pattern online Data Mining Knowl. Discovery, vol. 7, no. 4, 2017, Art. no. e1207.
mining approaches for streaming data from electronic [5] P. Fournier-Viger, J. C. W. Lin, R. U. Kiran, Y. S. Koh, and
R. Thomas, “A survey of sequential pattern mining,” Data Sci. Pat-
tern Recognit., vol. 1, no. 1, pp. 54–77, 2017.
8. https://en.wikipedia.org/wiki/Structure_mining [6] C. W. Tsai, C. F. Lai, M. C. Chiang, L. T. Yang, et al.,“Data min-
9. https://en.wikipedia.org/wiki/Unstructured_data ing for internet of things: A survey,” IEEE Commun. Surveys
10. https://en.wikipedia.org/wiki/Semi-structured_data Tuts., vol. 16, no. 1, pp. 77–97, Jan.–Mar. 2014.
1324 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021

[7] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns [34] Y. C. Li, J. S. Yeh, and C. C. Chang, “Isolated items discarding
without candidate generation: A frequent-pattern tree approach,” strategy for discovering high utility itemsets,” Data Knowl. Eng.,
Data Mining Knowl. Discovery, vol. 8, no. 1, pp. 53–87, 2004. vol. 64, no. 1, pp. 198–217, 2008.
[8] C. C. Aggarwal, Y. Li, J. Wang, and J. Wang, “Frequent pattern [35] B. E. Shie, H. F. Hsiao, V. S. Tseng, and P. S. Yu, “Mining high util-
mining with uncertain data,” in Proc. 15th ACM SIGKDD Int. ity mobile sequential patterns in mobile commerce environments,”
Conf. Knowl. Discovery Data Mining, 2009, pp. 29–38. in Proc. Int. Conf. Database Syst. Adv. Appl., 2011, pp. 224–238.
[9] R. Agrawal, T. Imieli nski, and A. Swami, “Mining association [36] B. E. Shie, H. F. Hsiao, and V. S. Tseng, “Efficient algorithms for
rules between sets of items in large databases,” ACM SIGMOD discovering high utility user behavior patterns in mobile
Rec., vol. 22, no. 2, pp. 207–216, 1993. commerce environments,” Knowl. Inf. Syst., vol. 37, no. 2,
[10] R. Agrawal, R. Srikant, et al., “Fast algorithms for mining associ- pp. 363–387, 2013.
ation rules,” in Proc. 20th Int. Conf. Very Large Data Bases, 1994, [37] M. Zihayat, H. Davoudi, and A. An, “Mining significant high
pp. 487–499. utility gene regulation sequential patterns,” BMC Syst. Biol.,
[11] H. Mannila, H. Toivonen, and A. I. Verkamo, “Discovery of vol. 11, no. 6, 2017, Art. no. 109.
frequent episodes in event sequences,” Data Mining Knowl. Dis- [38] S. J. Yen and Y. S. Lee, “Mining high utility quantitative associa-
covery, vol. 1, no. 3, pp. 259–289, 1997. tion rules,” in Proc. Int. Conf. Data Warehousing Knowl. Discovery,
[12] K. Y. Huang and C. H. Chang, “Efficient mining of frequent epi- 2007, pp. 283–292.
sodes from complex sequences,” Inf. Syst., vol. 33, no. 1, [39] D. Lee, S. H. Park, and S. Moon, “Utility-based association rule
pp. 96–114, 2008. mining: A marketing solution for cross-selling,” Expert Syst.
[13] A. Achar, S. Laxman, and P. Sastry, “A unified view of the Apriori- Appl., vol. 40, no. 7, pp. 2715–2725, 2013.
based algorithms for frequent episode discovery,” Knowl. Inf. [40] J. Sahoo, A. K. Das, and A. Goswami, “An efficient approach for
Syst., vol. 31, no. 2, pp. 223–250, 2012. mining association rules from high utility itemsets,” Expert Syst.
[14] A. Achar, A. Ibrahim, and P. Sastry, “Pattern-growth based Appl., vol. 42, no. 13, pp. 5754–5778, 2015.
frequent serial episode discovery,” Data Knowl. Eng., vol. 87, [41] C. F. Ahmed, S. K. Tanbeer, and B. S. Jeong, “A novel approach
pp. 91–108, 2013. for mining high-utility sequential patterns in sequence data-
[15] R. Agrawal and R. Srikant, “Mining sequential patterns,” in Proc. bases,” ETRI J., vol. 32, no. 5, pp. 676–686, 2010.
11th Int. Conf. Data Eng., 1995, pp. 3–14. [42] J. Yin, Z. Zheng, and L. Cao, “USpan: An efficient algorithm for
[16] J. Han, J. Pei, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and mining high utility sequential patterns,” in Proc. 18th ACM
M. Hsu, “PrefixSpan: Mining sequential patterns efficiently by SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2012, pp. 660–668.
prefix-projected pattern growth,” in Proc. 17th Int. Conf. Data [43] C. W. Wu, Y. F. Lin, P. S. Yu, and V. S. Tseng, “Mining high utility
Eng., 2001, pp. 215–224. episodes in complex event sequences,” in Proc. 19th ACM SIGKDD
[17] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and P. S. Yu, Int. Conf. Knowl. Discovery Data Mining, 2013, pp. 536–544.
“A survey of parallel sequential pattern mining,” ACM Trans. [44] Y. F. Lin, C. W. Wu, C. F. Huang, and V. S. Tseng, “Discovering
Knowl. Discovery Data, vol. 13, no. 3, 2019, Art. no. 25. utility-based episode rules in complex event sequences,” Expert
[18] L. Geng and H. J. Hamilton, “Interestingness measures for data min- Syst. Appl., vol. 42, no. 12, pp. 5303–5314, 2015.
ing: A survey,” ACM Comput. Surveys, vol. 38, no. 3, 2006, Art. no. 9. [45] W. Gan, J. C. W. Lin, H. C. Chao, and J. Zhan, “Data mining in
[19] J. Pei, J. Han, and L. V. Lakshmanan, “Mining frequent itemsets distributed environment: A survey,” Wiley Interdisciplinary Rev.:
with convertible constraints,” in Proc. 17th Int. Conf. Data Eng., Data Mining Knowl. Discovery, vol. 7, no. 6, 2017, Art. no. e1216.
2001, pp. 433–442. [46] B. Nath, D. Bhattacharyya, and A. Ghosh, “Incremental associa-
[20] P. N. Tan, V. Kumar, and J. Srivastava, “Selecting the right objec- tion rule mining: A survey,” Wiley Interdisciplinary Rev.: Data
tive measure for association analysis,” Inf. Syst., vol. 29, no. 4, Mining Knowl. Discovery, vol. 3, no. 3, pp. 157–169, 2013.
pp. 293–313, 2004. [47] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, T. P. Hong,
[21] K. McGarry, “A survey of interestingness measures for knowl- and H. Fujita, “A survey of incremental high-utility itemset min-
edge discovery,” Knowl. Eng. Rev., vol. 20, no. 1, pp. 39–61, 2005. ing,” Wiley Interdisciplinary Rev.: Data Mining Knowl. Discovery,
[22] Y. D. Shen, Z. Zhang, and Q. Yang, “Objective-oriented utility- vol. 8, no. 2, 2018, Art. no. e1242.
based association mining,” in Proc. IEEE Int. Conf. Data Mining, [48] V. S. Tseng, B. E. Shie, C. W. Wu, and P. S. Yu, “Efficient
2002, pp. 426–433. algorithms for mining high utility itemsets from transactional
[23] R. Chan, Q. Yang, and Y. D. Shen, “Mining high utility itemsets,” databases,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 8,
in Proc. 3rd IEEE Int. Conf. Data Mining, 2003, pp. 19–26. pp. 1772–1786, Aug. 2013.
[24] H. Yao, H. J. Hamilton, and L. Geng, “A unified framework for [49] S. Zida, P. Fournier-Viger, C. W. Wu, J. C. W. Lin, and V. S. Tseng,
utility-based measures for mining itemsets,” in Proc. ACM “Efficient mining of high-utility sequential rules,” in Proc. Int. Work-
SIGKDD 2nd Workshop Utility-Based Data Mining, 2006, pp. 28–37. shop Mach. Learn. Data Mining Pattern Recognit., 2015, pp. 157–171.
[25] A. Marshall, Principles of Economics, 8th ed. London, U.K.: [50] P. Fournier-Viger, U. Faghihi, R. Nkambou, and E. M. Nguifo,
Macmillan, 1926. “CMRules: Mining sequential rules common to several sequen-
[26] R. J. Hilderman and H. J. Hamilton, “Measuring the interesting- ces,” Knowl.-Based Syst., vol. 25, no. 1, pp. 63–76, 2012.
ness of discovered knowledge: A principled approach,” Intell. [51] C. Jiang, F. Coenen, and M. Zito, “A survey of frequent subgraph
Data Anal., vol. 7, no. 4, pp. 347–382, 2003. mining algorithms,” Knowl. Eng. Rev., vol. 28, no. 1, pp. 75–105, 2013.
[27] A. Silberschatz and A. Tuzhilin, “On subjective measures of [52] H. Yao, H. J. Hamilton, and C. J. Butz, “A foundational approach
interestingness in knowledge discovery,” in Proc. ACM SIGKDD to mining itemset utilities from databases,” in Proc. SIAM Int.
Int. Conf. Knowl. Discovery Data Mining, 1995, pp. 275–281. Conf. Data Mining, 2004, pp. 482–486.
[28] T. De Bie, “Maximum entropy models and subjective interesting- [53] C. H. Cai, A. W. C. Fu, C. Cheng, and W. Kwong, “Mining associ-
ness: An application to tiles in binary databases,” Data Mining ation rules with weighted items,” in Proc. Int. Database Eng. Appl.
Knowl. Discovery, vol. 23, no. 3, pp. 407–446, 2011. Symp., 1998, pp. 68–77.
[29] H. Yao and H. J. Hamilton, “Mining itemset utilities from transac- [54] W. Wang, J. Yang, and P. S. Yu, “Efficient mining of weighted
tion databases,” Data Knowl. Eng., vol. 59, no. 3, pp. 603–626, 2006. association rules (WAR),” in Proc. 6th ACM SIGKDD Int. Conf.
[30] H. F. Li, H. Y. Huang, Y. C. Chen, Y. J. Liu, and S. Y. Lee, “Fast Knowl. Discovery Data Mining, 2000, pp. 270–274.
and memory efficient mining of high utility itemsets in [55] F. Tao, F. Murtagh, and M. Farid, “Weighted association rule
data streams,” in Proc. 8th IEEE Int. Conf. Data Mining, 2008, mining using weighted support and significance framework,” in
pp. 881–886. Proc. 9th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
[31] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and Y. K. Lee, “Efficient 2003, pp. 661–666.
tree structures for high utility pattern mining in incremental [56] K. Sun and F. Bai, “Mining weighted association rules without
databases,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 12, preassigned weights,” IEEE Trans. Knowl. Data Eng., vol. 20,
pp. 1708–1721, Dec. 2009. no. 4, pp. 489–495, Apr. 2008.
[32] B. E. Shie, V. S. Tseng, and P. S. Yu, “Online mining of temporal [57] J. C. W. Lin, W. Gan, P. Fournier-Viger, and T. P. Hong,
maximal utility itemsets from data streams,” in Proc. ACM Symp. “RWFIM: Recent weighted-frequent itemsets mining,” Eng. Appl.
Appl. Comput., 2010, pp. 1622–1626. Artif. Intell., vol. 45, pp. 18–32, 2015.
[33] A. Erwin, R. P. Gopalan, and N. Achuthan, “Efficient mining of [58] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and
high utility itemsets from large datasets,” in Proc. Pacific-Asia V. S. Tseng, “Weighted frequent itemset mining over uncertain
Conf. Knowl. Discovery Data Mining, 2008, pp. 554–561. databases,” Appl. Intell., vol. 44, no. 1, pp. 232–250, 2016.
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1325

[59] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, J. M. T. Wu, [83] A. Erwin, R. P. Gopalan, and N. Achuthan, “CTU-Mine: An effi-
and J. Zhan, “Extracting recent weighted-based patterns from cient high utility itemset mining algorithm using the pattern
uncertain temporal databases,” Eng. Appl. Artif. Intell., vol. 61, growth approach,” in Proc. 7th IEEE Int. Conf. Comput. Inf. Tech-
pp. 161–172, 2017. nol., 2007, pp. 71–76.
[60] Y. C. Li, J. S. Yeh, and C. C. Chang, “Direct candidates genera- [84] C. W. Lin, T. P. Hong, and W. H. Lu, “An effective tree structure
tion: A novel algorithm for discovering complete share-frequent for mining high utility itemsets,” Expert Syst. Appl., vol. 38, no. 6,
itemsets,” in Proc. Int. Conf. Fuzzy Syst. Knowl. Discovery, 2005, pp. 7419–7424, 2011.
pp. 551–560. [85] G. C. Lan, “A study on efficient algorithms for on-shelf utility
[61] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and mining,” PhD Thesis, National Cheng Kung Univ., pp. 1–154,
V. S. Tseng, “Fast algorithms for mining high-utility itemsets 2012.
with various discount strategies,” Adv. Eng. Inform., vol. 30, [86] M. Liu and J. Qu, “Mining high utility itemsets without candi-
no. 2, pp. 109–126, 2016. date generation,” in Proc. 21st ACM Int. Conf. Inf. Knowl. Manage.,
[62] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and H. Fujita, 2012, pp. 55–64.
“Extracting non-redundant correlated purchase behaviors by [87] P. Fournier-Viger, C. W. Wu, S. Zida, and V. S. Tseng, “FHM:
utility measure,” Knowl.-Based Syst., vol. 143, pp. 30–41, 2018. Faster high-utility itemset mining using estimated utility co-
[63] C. F. Ahmed, S. K. Tanbeer, and B. S. Jeong, “A framework for occurrence pruning,” in Proc. Int. Symp. Methodologies Intell. Syst.,
mining high utility web access sequences,” IETE Tech. Rev., 2014, pp. 83–92.
vol. 28, no. 1, pp. 3–16, 2011. [88] J. Liu, K. Wang, and B. C. Fung, “Direct discovery of high utility
[64] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and Y. K. Lee, “Efficient itemsets without candidate generation,” in Proc. IEEE 12th Int.
mining of utility-based web path traversal patterns,” in Proc. Conf. Data Mining, 2012, pp. 984–989.
11th Int. Conf. Adv. Commun. Technol., 2009, pp. 2215–2218. [89] S. Krishnamoorthy, “Pruning strategies for mining high utility
[65] L. Atzori, A. Iera, and G. Morabito, “The internet of things: A itemsets,” Expert Syst. Appl., vol. 42, no. 5, pp. 2371–2381, 2015.
survey,” Comput. Netw., vol. 54, no. 15, pp. 2787–2805, 2010. [90] S. Zida, P. Fournier-Viger, J. C. W. Lin, C. W. Wu, and
[66] L. Golab and M. T. Ozsu,€ “Issues in data stream management,” V. S. Tseng, “EFIM: A highly efficient algorithm for high-utility
ACM SIGMOD Rec., vol. 32, no. 2, pp. 5–14, 2003. itemset mining,” in Proc. Mexican Int. Conf. Artif. Intell., 2015,
[67] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintain- pp. 530–546.
ing closed frequent itemsets over a stream sliding window,” in [91] H. Ryang and U. Yun, “Indexed list-based high utility pattern
Proc. 4th IEEE Int. Conf. Data Mining, 2004, pp. 59–66. mining with utility upper-bound reduction and pattern combina-
[68] M. Zihayat and A. An, “Mining top-k high utility patterns over tion techniques,” Knowl. Inf. Syst., vol. 51, no. 2, pp. 627–659, 2017.
data streams,” Inf. Sci., vol. 285, pp. 138–161, 2014. [92] A. Y. Peng, Y. S. Koh, and P. Riddle, “mHUIMiner: A fast high
[69] B. E. Shie, P. S. Yu, and V. S. Tseng, “Efficient algorithms for min- utility itemset mining algorithm for sparse datasets,” in Proc.
ing maximal high utility itemsets from data streams with differ- Pacific-Asia Conf. Knowl. Discovery Data Mining, 2017, pp. 196–207.
ent models,” Expert Syst. Appl., vol. 39, no. 17, pp. 12 947–12 960, [93] R. Rymon, “Search through systematic set enumeration,” in Proc.
2012. 3rd Int. Conf. Principles Knowl. Represenation Reasoning, 1992,
[70] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and H. J. Choi, pp. 539–550.
“Interactive mining of high utility patterns over data streams,” [94] W. Song, Z. Zhang, and J. Li, “A high utility itemset mining algo-
Expert Syst. Appl., vol. 39, no. 15, pp. 11 979–11 991, 2012. rithm based on subsume index,” Knowl. Inf. Syst., vol. 49, no. 1,
[71] Y. C. Liu, C. P. Cheng, and V. S. Tseng, “Mining differential top-k pp. 315–340, 2016.
co-expression patterns from time course comparative gene [95] W. Song, B. Yang, and Z. Xu, “Index-BitTableFI: An improved
expression datasets,” BMC Bioinf., vol. 14, no. 1, 2013, Art. no. 230. algorithm for mining frequent itemsets,” Knowl.-Based Syst., vol. 21,
[72] T. P. Hong, C. H. Lee, and S. L. Wang, “Effective utility mining no. 6, pp. 507–513, 2008.
with the measure of average utility,” Expert Syst. Appl., vol. 38, [96] G. C. Lan, T. P. Hong, and V. S. Tseng, “Efficiently mining
no. 7, pp. 8259–8265, 2011. high average-utility itemsets with an improved upper-bound
[73] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and strategy,” Int. J. Inf. Technol. Decision Making, vol. 11, no. 05,
V. S. Tseng, “Efficient algorithms for mining high-utility itemsets pp. 1009–1030, 2012.
in uncertain databases,” Knowl.-Based Syst., vol. 96, pp. 171–187, [97] J. C. W. Lin, T. Li, P. Fournier-Viger, T. P. Hong, J. Zhan, and
2016. M. Voznak, “An efficient algorithm to mine high average-utility
[74] C. K. Chui, B. Kao, and E. Hung, “Mining frequent itemsets from itemsets,” Adv. Eng. Inform., vol. 30, no. 2, pp. 233–243, 2016.
uncertain data,” in Proc. Pacific-Asia Conf. Knowl. Discovery Data [98] J. C. W. Lin, S. Ren, P. Fournier-Viger, and T. P. Hong,
Mining, 2007, pp. 47–58. “EHAUPM: Efficient high average-utility pattern mining with
[75] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and tighter upper bounds,” IEEE Access, vol. 5, pp. 12 927–12 940,
H. C. Chao, “FDHUP: Fast algorithm for mining discriminative 2017.
high utility patterns,” Knowl. Inf. Syst., vol. 51, no. 3, pp. 873–909, [99] U. Yun, D. Kim, E. Yoon, and H. Fujita, “Damped window based
2017. high average utility pattern mining over data streams,” Knowl.-
[76] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and P. S. Yu, Based Syst., vol. 44, pp. 188–205, 2017.
“HUOPM: High-utility occupancy pattern mining,” IEEE Trans. [100] D. W. Cheung, J. Han, V. T. Ng, and C. Wong, “Maintenance of
Cybern., early access, Feb. 20, 2019, doi: 10.1109/TCYB.2019.2896267. discovered association rules in large databases: An incremental
[77] Y. Liu, W. K. Liao, and A. Choudhary, “A two-phase algorithm updating technique,” in Proc. 12th Int. Conf. Data Eng., 1996,
for fast discovery of high utility itemsets,” in Proc. Pacific-Asia pp. 106–114.
Conf. Knowl. Discovery Data Mining, 2005, pp. 689–695. [101] T. P. Hong, C. Y. Wang, and Y. H. Tao, “A new incremental data
[78] J. Hu and A. Mojsilovic, “High-utility pattern mining: A method mining algorithm using pre-large itemsets,” Intell. Data Anal.,
for discovery of high-utility item sets,” Pattern Recognit., vol. 40, vol. 5, no. 2, pp. 111–129, 2001.
no. 11, pp. 3317–3324, 2007. [102] C. W. Lin, G. C. Lan, and T. P. Hong, “An incremental mining
[79] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and Y. K. Lee, “An effi- algorithm for high utility itemsets,” Expert Syst. Appl., vol. 39,
cient candidate pruning technique for high utility pattern min- no. 8, pp. 7173–7180, 2012.
ing,” in Proc. Pacific-Asia Conf. Knowl. Discovery Data Mining, [103] C. W. Lin, T. P. Hong, G. C. Lan, J. W. Wong, and W. Y. Lin,
2009, pp. 749–756. “Incrementally mining high utility patterns based on pre-large
[80] V. S. Tseng, C. W. Wu, B. E. Shie, and P. S. Yu, “UP-Growth: An concept,” Appl. Intell., vol. 40, no. 2, pp. 343–357, 2014.
efficient algorithm for high utility itemset mining,” in Proc. 16th [104] J. C. W. Lin, W. Gan, T. P. Hong, and B. Zhang, “An incremental
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2010, high-utility mining algorithm with transaction insertion,” Sci.
pp. 253–262. World J., vol. 2015, 2015, Art. no. 161564.
[81] W. Song, Y. Liu, and J. Li, “Mining high utility itemsets by dyna- [105] P. Fournier-Viger, J. C. W. Lin, T. Gueniche, and P. Barhate,
mically pruning the tree structure,” Appl. Intell., vol. 40, no. 1, “Efficient incremental high utility itemset mining,” in Proc. ASE
pp. 29–43, 2014. BigData Social Inform., 2015, Art. no. 53.
[82] H. Ryang, U. Yun, and K. H. Ryu, “Fast algorithm for high utility [106] D. W. Cheung, S. D. Lee, and B. Kao, “A general incremental
pattern mining with the sum of item quantities,” Intell. Data technique for maintaining discovered association rules,” in Proc.
Anal., vol. 20, no. 2, pp. 395–415, 2016. 5th Int. Conf. Database Syst. Adv. Appl., 1997, pp. 185–194.
1326 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 33, NO. 4, APRIL 2021

[107] C. W. Lin, G. C. Lan, and T. P. Hong, “Mining high utility item- [132] J. Dean and S. Ghemawat, “MapReduce: A flexible data process-
sets for transaction deletion in a dynamic database,” Intell. Data ing tool,” Commun. ACM, vol. 53, no. 1, pp. 72–77, 2010.
Anal., vol. 19, no. 1, pp. 43–55, 2015. [133] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,
[108] C. W. Lin, T. P. Hong, G. C. Lan, J. W. Wong, and W. Y. Lin, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets:
“Efficient updating of discovered high-utility itemsets for trans- A fault-tolerant abstraction for in-memory cluster computing,” in
action deletion in dynamic databases,” Adv. Eng. Inform., Proc. 9th USENIX Conf. Netw. Syst. Des. Implementation, 2012, pp. 2–2.
vol. 29, no. 1, pp. 16–27, 2015. [134] M. Zihayat, Y. Chen, and A. An, “Memory-adaptive high utility
[109] J. C. W. Lin, W. Gan, and T. P. Hong, “A fast maintenance algo- sequential pattern mining over data streams,” Mach. Learn.,
rithm of the discovered high-utility itemsets with transaction vol. 106, no. 6, pp. 799–836, 2017.
deletion,” Intell. Data Anal., vol. 20, no. 4, pp. 891–913, 2016. [135] G. S. Manku and R. Motwani, “Approximate frequency counts
[110] C. W. Lin, B. Zhang, W. Gan, B. W. Chen, S. Rho, and T. P. Hong, over data streams,” in Proc. 28th Int. Conf. Very Large Databases,
“Updating high-utility pattern trees with transaction mod- 2002, pp. 346–357.
ification,” Multimedia Tools Appl., vol. 75, no. 9, pp. 4887–4912, 2016. [136] H. F. Li, M. K. Shan, and S. Y. Lee, “DSM-FI: An efficient algo-
[111] J. C. W. Lin, W. Gan, and T. P. Hong, “A fast updated algorithm rithm for mining frequent itemsets in data streams,” Knowl. Inf.
to maintain the discovered high-utility itemsets for transaction Syst., vol. 17, no. 1, pp. 79–97, 2008.
modification,” Adv. Eng. Inform., vol. 29, no. 3, pp. 562–574, 2015. [137] C. J. Chu, V. S. Tseng, and T. Liang, “An efficient algorithm for
[112] J. F. Boulicaut, A. Bykowski, and C. Rigotti, “Free-sets: A conden- mining temporal high utility itemsets from data streams,” J. Syst.
sed representation of boolean data for the approximation of fre- Softw., vol. 81, no. 7, pp. 1105–1117, 2008.
quency queries,” Data Mining Knowl. Discovery, vol. 7, no. 1, pp. 5–22, [138] G. C. Lan, T. P. Hong, and V. S. Tseng, “Discovery of high utility
2003. itemsets from on-shelf time periods of products,” Expert Syst.
[113] T. Calders and B. Goethals, “Mining all non-derivable frequent Appl., vol. 38, no. 5, pp. 5851–5857, 2011.
itemsets,” in Proc. Eur. Conf. Principles Data Mining Knowl. Discov- [139] J. C. W. Lin, W. Gan, T. P. Hong, and V. S. Tseng, “Efficient algo-
ery, 2002, pp. 74–86. rithms for mining up-to-date high-utility patterns,” Adv. Eng.
[114] K. Gouda and M. J. Zaki, “Efficiently mining maximal frequent Inform., vol. 29, no. 3, pp. 648–661, 2015.
itemsets,” in Proc. IEEE Int. Conf. Data Mining, 2001, pp. 163–170. [140] W. Gan, J. C. W. Lin, P. Fournier-Viger, and H. C. Chao, “Mining
[115] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Efficient min- recent high-utility patterns from temporal databases with time-
ing of association rules using closed itemset lattices,” Inf. Syst., sensitive constraint,” in Proc. Int. Conf. Big Data Analytics Knowl.
vol. 24, no. 1, pp. 25–46, 1999. Discovery, 2016, pp. 3–18.
[116] V. S. Tseng, C. W. Wu, P. Fournier-Viger, and P. S. Yu, “Efficient [141] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and H. J. Choi, “A frame-
algorithms for mining the concise and lossless representation of work for mining interesting high utility patterns with a strong
high utility itemsets,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 3, frequency affinity,” Inf. Sci., vol. 181, no. 21, pp. 4878–4894, 2011.
pp. 726–739, Mar. 2015. [142] C. W. Wu, B. E. Shie, V. S. Tseng, and P. S. Yu, “Mining top-k
[117] P. Fournier-Viger, S. Zida, J. C. W. Lin, C. W. Wu, and high utility itemsets,” in Proc. 18th ACM SIGKDD Int. Conf.
V. S. Tseng, “EFIM-Closed: Fast and memory efficient discovery Knowl. Discovery Data Mining, 2012, pp. 78–86.
of closed high-utility itemsets,” in Proc. Int. Conf. Mach. Learn. [143] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and J. Zhan,
Data Mining Pattern Recognit., 2016, pp. 199–213. “Efficient mining of high-utility itemsets using multiple minimum
[118] C. W. Wu, P. Fournier-Viger, J. Y. Gu, and V. S. Tseng, “Mining utility thresholds,” Knowl.-Based Syst., vol. 113, pp. 100–115, 2016.
closed+ high utility itemsets without candidate generation,” in [144] J. C. W. Lin, P. Fournier-Viger, and W. Gan, “FHN: An efficient
Proc. Conf. Technol. Appl. Artif. Intell., 2015, pp. 187–194. algorithm for mining high-utility itemsets with negative unit
[119] J. Sahoo, A. K. Das, and A. Goswami, “An efficient fast algorithm profits,” Knowl.-Based Syst., vol. 111, pp. 283–298, 2016.
for discovering closed+ high utility itemsets,” Appl. Intell., [145] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and
vol. 45, no. 1, pp. 44–74, 2016. V. S. Tseng, “Mining high-utility itemsets with both positive and
[120] C. H. Li, C. W. Wu, and V. S. Tseng, “Efficient vertical mining of negative unit profits from uncertain databases,” in Proc. Pacific-
high utility quantitative itemsets,” in Proc. IEEE Int. Conf. Granu- Asia Conf. Knowl. Discovery Data Mining, 2017, pp. 434–446.
lar Comput., 2014, pp. 155–160. [146] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and
[121] G. C. Lan, T. P. Hong, V. S. Tseng, and S. L. Wang, “Applying V. S. Tseng, “Efficiently mining uncertain high-utility itemsets,”
the maximum utility measure in high utility sequential pattern Soft Comput., vol. 21, no. 11, pp. 2801–2820, 2017.
mining,” Expert Syst. Appl., vol. 41, no. 11, pp. 5071–5081, 2014. [147] W. Gan, J. C. W. Lin, H. C. Chao, T. P. Hong, and S. Y. Philip,
[122] O. K. Alkan and P. Karagoz, “CRoM and HuspExt: Improving “CoUPM: Correlated utility-based pattern mining,” in Proc. IEEE
efficiency of high utility sequential pattern extraction,” IEEE Int. Conf. Big Data, 2018, pp. 2607–2616.
Trans. Knowl. Data Eng., vol. 27, no. 10, pp. 2645–2657, Oct. 2015. [148] R. Agrawal and R. Srikant, “Privacy-preserving data mining,”
[123] W. Gan, J. C. W. Lin, J. Zhang, H. C. Chao, H. Fujita, and P. S. Yu, ACM SIGMOD Rec., vol. 29, no. 2, pp. 439–450, 2000.
“ProUM: Projection-based utility mining on sequence data,” [149] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in
arXiv:1904.07764, 2019. Proc. Annu. Int. Cryptology Conf., 2000, pp. 36–54.
[124] W. Gan, J. C. W. Lin, J. Zhang, P. Fournier-Viger, H. C. Chao, and [150] C. C. Aggarwal and P. S. Yu, “A general survey of privacy-
P. S. Yu, “Fast utility mining on sequence data,” arXiv:1904.12248, preserving data mining models and algorithms,” in Privacy-
2019. Preserving Data Mining. Berlin, Germany: Springer, 2008, pp. 11–52.
[125] J. Z. Wang, J. L. Huang, and Y. C. Chen, “On efficiently mining [151] T. Zhu, G. Li, W. Zhou, and P. S. Yu, “Differentially private data
high utility sequential patterns,” Knowl. Inf. Syst., vol. 49, no. 2, publishing and analysis: A survey,” IEEE Trans. Knowl. Data
pp. 597–627, 2016. Eng., vol. 29, no. 8, pp. 1619–1638, Aug. 2017.
[126] G. Guo, L. Zhang, Q. Liu, E. Chen, F. Zhu, and C. Guan, “High [152] J. S. Yeh and P. C. Hsu, “HHUIF and MSICF: Novel algorithms
utility episode mining made practical and fast,” in Proc. Int. Conf. for privacy preserving utility mining,” Expert Syst. Appl., vol. 37,
Adv. Data Mining Appl., 2014, pp. 71–84. no. 7, pp. 4779–4786, 2010.
[127] Y. F. Lin, C. F. Huang, and V. S. Tseng, “A novel methodology [153] C. W. Lin, T. P. Hong, J. W. Wong, G. C. Lan, and W. Y. Lin, “A
for stock investment using high utility episode mining and GA-based approach to hide sensitive high utility itemsets,” Sci.
genetic algorithm,” Appl. Soft Comput., vol. 59, pp. 303–315, 2017. World J., vol. 2014, 2014, Art. no. 804629.
[128] S. Rathore, S. Dawar, V. Goyal, and D. Patel, “Top-k high utility [154] U. Yun and J. Kim, “A fast perturbation algorithm using tree
episode mining from a complex event sequence,” in Proc. 21st structure for privacy preserving utility mining,” Expert Syst.
Int. Conf. Manage. Data Comput. Soc. India, 2016, pp. 56–63. Appl., vol. 42, no. 3, pp. 1149–1165, 2015.
[129] Y. C. Lin, C. W. Wu, and V. S. Tseng, “Mining high utility item- [155] J. C. W. Lin, T. Y. Wu, P. Fournier-Viger, G. Lin, J. Zhan, and
sets in big data,” in Proc. Pacific-Asia Conf. Knowl. Discovery Data M. Voznak, “Fast algorithms for hiding sensitive high-utility
Mining, 2015, pp. 649–661. itemsets in privacy-preserving utility mining,” Eng. Appl. Artif.
[130] Y. Chen and A. An, “Approximate parallel high utility itemset Intell., vol. 55, pp. 269–284, 2016.
mining,” Big Data Res., vol. 6, pp. 26–42, 2016. [156] J. C. W. Lin, T. P. Hong, P. Fournier-Viger, Q. Liu, J. W. Wong,
[131] M. Zihayat, Z. Z. Hut, A. An, and Y. Hut, “Distributed and paral- and J. Zhan, “Efficient hiding of confidential high-utility itemsets
lel high utility sequential pattern mining,” in Proc. IEEE Int. Conf. with minimal side effects,” J. Exp. Theoretical Artif. Intell., vol. 29,
Big Data, 2016, pp. 853–862. no. 6, pp. 1225–1245, 2017.
GAN ET AL.: A SURVEY OF UTILITY-ORIENTED PATTERN MINING 1327

[157] W. Gan, J. C. W. Lin, H. C. Chao, S. L. Wang, and P. S. Yu, Han-Chieh Chao (SM’04) received the MS and
“Privacy preserving utility mining: A survey,” in Proc. IEEE Int. PhD degrees in electrical engineering from
Conf. Big Data, 2018, pp. 2617–2626. Purdue University, in 1989 and 1993, respec-
[158] V. S. Tseng, C. W. Wu, J. H. Lin, and P. Fournier-Viger, “UP- tively. He has been the president of the National
Miner: A utility pattern mining toolbox,” in Proc. IEEE Int. Conf. Dong Hwa University since February 2016. His
Data Mining Workshop, 2015, pp. 1656–1659. research interests include high-speed networks,
[159] P. Fournier-Viger, J. C. W. Lin, A. Gomariz, T. Gueniche, wireless networks, IPv6-based networks, and
A. Soltani, Z. Deng, and H. T. Lam, “The SPMF open-source data artificial intelligence. He has published nearly 500
mining library version 2,” in Proc. Joint Eur. Conf. Mach. Learn. peer-reviewed research papers. He is the editor-
Knowl. Discovery Databases, 2016, pp. 36–40. in-chief (EiC) of the IET Networks and the Journal
[160] R. Agrawal and R. Srikant, “Quest synthetic data generator,” of Internet Technology. He has served as a guest
1994. [Online]. Available: http://www.Almaden.ibm.com/cs/ editor of the ACM Mobile Networks and Applications, the IEEE Journal
quest/syndata.html on Selected Areas in Communications, the IEEE Communications Mag-
[161] L. Cao, “Domain-driven data mining: Challenges and pros- azine, the IEEE Systems Journal, Computer Communications, the IEEE
pects,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 6, pp. 755–769, Proceedings Communications, Wireless Personal Communications, and
Jun. 2010. Wireless Communications & Mobile Computing. He is a senior member
[162] S. Hong, T. Oguntebi, and K. Olukotun, “Efficient parallel graph of the IEEE and a fellow of the IET.
exploration on multi-core CPU and GPU,” in Proc. Int. Conf. Par-
allel Archit. Compilation Techn., 2011, pp. 78–88.
Vincent S. Tseng (SM’16) received the PhD
Wensheng Gan received the BS degree in com- degree in computer science from National Chiao
puter science from South China Normal Univer- Tung University, Taiwan, in 1997. He is currently a
sity, Guangdong, China, in 2013. He is working distinguished professor with the Department of
toward the PhD degree in computer science and Computer Science, National Chiao Tung Univer-
technology at the Harbin Institute of Technology sity, Taiwan. His research interests covering data
(Shenzhen), Guangdong, China. He was a joint mining, big data, biomedical informatics, mobile,
PhD student with the University of Illinois at and Web technologies. He has published more
Chicago (UIC), from 2017 to 2019. His research than 400 research papers in peer-reviewed jour-
interests include data mining, utility computing, nals and conferences and holds 15 patents. He
and big data analytics. He has published more has been on the editorial board of a number of
than 50 research papers in peer-reviewed jour- journals, including the IEEE Transactions on Knowledge and Data Engi-
nals (i.e., the IEEE Transactions on Knowledge and Data Engineerning, neering, the ACM Transactions on Knowledge Discovery from Data, and
the ACM Transactions on Knowledge Discovery from Data, the IEEE the IEEE Journal of Biomedical and Health Informatics. He is a senior
Transactions on Cybernetics, ACM Transactions on Data Science, member of the IEEE.
Knowledge-Based Systems) and conferences, which have received
more than 600 citations.
Philip S. Yu (F’93) received the BS degree in
electrical engineering from National Taiwan Uni-
Jerry Chun-Wei Lin (SM’19) received the PhD versity, the MS and PhD degrees in electrical
degree in computer science and information engi- engineering from Stanford University, and the
neering from the National Cheng Kung Univer- MBA degree from New York University. He is a
sity, Tainan, Taiwan, in 2010. He is an associate distinguished professor of computer science with
professor with the Western Norway University of the University of Illinois at Chicago (UIC) and
Applied Sciences, Bergen, Norway. His research holds the Wexler Chair in Information Technol-
interests include data mining, big data analytics, ogy, UIC. Before joining UIC, he was with IBM,
machine learning, soft computing, and privacy- where he was manager of the Software Tools
preserving and security. He has published more and Techniques Department, Thomas J. Watson
than 300 research papers in peer-reviewed inter- Research Center. His research interests include databases, data mining,
national conferences (i.e., IEEE ICDE, IEEE artificial intelligence, and privacy. He has published more than 1,300
ICDM, PKDD, and PAKDD) and journals (i.e., the IEEE Transactions on papers in peer-reviewed journals (i.e., the IEEE Transactions on Knowl-
Knowledge and Data Engineering, the IEEE Transactions on Cybernet- edge and Data, the IEEE Transactions on Parallel and Distributed, the
ics, the ACM Transactions on Knowledge Discovery from Data, and the ACM Transactions on Knowledge Discovery from Data, the VLDB Jour-
ACM Transactions on Data Science ). He is the co-leader of the popular nal) and conferences (i.e., SIGMOD, KDD, ICDE, WWW, AAAI, SIGIR,
SPMF open-source data mining library, the project leader of PPSF ICML, etc). He holds or has applied for more than 300 U.S. patents. He
open-source privacy and security library, the editor-in-chief (EiC) of the was the editor-in-chief of the ACM Transactions on Knowledge Discov-
Data Science and Pattern Recognition (DSPR) journal, and associate ery from Data. He received the ACM SIGKDD 2016 Innovation Award,
editor of the Journal of Internet Technology and IEEE Access. He is the and the IEEE Computer Society 2013 Technical Achievement Award.
senior member of the IEEE and ACM. He is a fellow of the ACM and IEEE.

Philippe Fournier-Viger received the PhD degree " For more information on this or any other computing topic,
in computer science from the University of please visit our Digital Library at www.computer.org/csdl.
Quebec, Montreal, in 2010. He is full professor
and Youth 1,000 scholar with the Harbin Institute
of Technology (Shenzhen), Shenzhen, China.
His research interests include pattern mining,
sequence analysis and prediction, and social net-
work mining. He has published more than 250
research papers in refereed international confer-
ences and journals. He is the founder of the
popular SPMF open-source data mining library,
which has been cited in more than 800 research papers. He is editor-
in-chief (EiC) of the Data Science and Pattern Recognition (DSPR) journal.

You might also like