Scalable Sequential Pattern Mining Based On PrefixSpan For High Dimensional Data
Scalable Sequential Pattern Mining Based On PrefixSpan For High Dimensional Data
Abstract—The rapid use and development of information repetitive pattern at the time, and the projection is dug to obtain
technology rapidly nowadays has resulted in the phenomenon of a pattern.
data explosion. The phenomenon of data explosion makes analysis
to find insights inside the data become more difficult. A problem However, parsing process in some functions is needed in
that often occurs in the application of pattern recognition in the order to find sequential patterns in big data with a relatively
real-world domain is not only caused by the large size of data but short time given in a large and high-dimensional data. Recursive
also the high-dimensional data. Data analysis demands that a large process in PrefixSpan implemented in big data will affect
and complex data can be processed quickly and optimally to memory usage and potentially interfere with the reliability of the
support decision making. This study offered a scalable sequential processes that cause the program stops before the process is
patterns extraction to gain more insight from the data using completed [5]. Therefore, big data requires different processing
PrefixSpan implemented on the Spark platform as a distributed and programming technique compared to relational database.
system. The goal is to overcome the problem of increasing the
amount of data (scalability) in complex and high dimensional data MapReduce is one of big data processing in parallel and
effectively and in a relatively quick performance. distributed servers. MapReduce is a concept where data is
continually broken down into parts of data and distributed
Keywords—PrefixSpan, Spark, sequential patterns mining, through the machines that are connected in a cluster for
scalability, high dimensional data processing [6, 7]. In previous research, the concept of
MapReduce is used to support scalability in PrefixSpan with two
I. INTRODUCTION alternative processing to build projected database. However,
The development of information technology, the internet and Hadoop MapReduce approach is still not optimal and have
devices nowadays has resulted in enormous number of data limitations when implemented in recursive algorithm, Hadoop
generated everyday. This data explosion phenomenon is a MapReduce requires every input and output process is stored on
situation where number of data become tooso numerous and disk, and this process requires high cost [13].
large which makes it difficult for people to take information and Spark is the development of Hadoop MapReduce which is
knowledge from the data [1]. This phenomenon triggers new more effective in processing large amounts of data easily and
concept of information management called Big Data [2, 3]. The quickly. By introducing new processing methods and data
problem that comes from big data is how to process and analyze structured RDD (Resilient Distributed Datasets) as the main unit
the data in order to obtain information and knowledge to which is less optimal on recursive algorithm including the
improve the value of a very large data set. To overcome this establishment of projected database in PrefixSpan algorithm [8].
problem, a process called Knowledge Discovery from Database In this paper, implementation of PrefixSpan algorithm on
was developed [1]. distributed systems is proposed.
One of KDD implementation on data enterprise that has been
widely applied by various e-commerce organizations is data II. PRELIMINARY DEFINITION
analysis on customer behavior. Market Basket Analysis (MBA) A. High Dimensional Data
is one of knowledge discovery method that has been commonly
Many real world applications deal with transactional data,
used to get pattern of customer behavior or customer shopping
characterized by a huge number of transactions (tuples) with a
habits, which then help decision makers in choosing the right
small number of dimensions (attributes). However, there are
marketing strategy and simplify sales process [4].
some other applications that involve rather high dimensional
Sequential pattern mining is one of techniques that can be data [15]. Examples of such applications include bioinformatics,
used in MBA which can determine goods relatedness pattern and survey-based statistical analysis, text processing, and so on. Formatted: Font color: Red
illustrate sequence of events of the goods are purchased. One of High dimensional data pose great challenges to most existing
sequential patterns approaches is PrefixSpan which is applying data mining algorithms. Although there are numerous
divide and conquer [5]. With this approach, recursively database algorithms dealing with transactional data sets, there are few
will be projected into a bunch of smaller database based on algorithms oriented to very high dimensional data sets. Taking
sequential pattern mining for MBA as an example, most of the Sequential pattern mining process begins by determining a
existing algorithms are column (i.e., item) enumeration-based minimum number of support. Minimum support or min_sup is a
algorithms, which take the combinations of columns (items) as threshold number specified by the user, to determine the interest
search space. Due to the exponential number of column of a pattern. Definition of sequential pattern mining itself is if
combinations, this method is not suitable for very high given a min_sup and there is a set of sequences, where each
dimensional data. sequence is a list of events and each event is a set of items,
sequential pattern mining is a way to find all frequent
Let T be a discretized data table (or data set), composed of a subsequence that the frequency of occurrence is less than
set of rows, S = {r1, r2, …, rn,}, where ri(i = 1, …, n) is called min_sup of a sequence database.
a row ID, or rid in short. Each row corresponds to a sample
consisting of k discrete values or intervals. For simplicity, we C. PrefixSpan (Prefix-Projected Sequential Pattern Mining)
call each of this kind of values or intervals an item. We call a set PrefixSpan method use divide and conquer approach that
of rids a rowset, and a rowset having k rids a k-rowset. Likewise, reduce the size of database sequence as it shows sequence that
we call a set of items an itemset. A k-rowset is called large if k refer to a sequential pattern called projected database. Other
is no less than a user-specified threshold which is called sequential pattern are obtained through the process of finding
minimum size threshold. items that locally frequent in a projected database. There are two
Let TT be the transposed table of T, in which each row important terms in the establishment of a database that is
corresponds to an item ij and consists of a set of rids which projected : prefix and suffix. The following is PrefixSpan
contain i j in T. For clarity, we call each row of TT a tuple. algorithm [5] :
Input : Sequence database S, with a minimum threshold of
TABLE I. AN EXAMPLE TABLE T
support min_support.
ri A B C D Output : A complete set of sequential pattern
1 a1 b1 c1 d1 Methods : Call PrefixSpan (<>, 0, S)
Subroutine : PrefixSpan (α, L, S | α)
2 a1 b1 c2 d2
Parameters :
3 a1 b1 c1 d2 1. α is a sequential pattern
4 a2 b1 c2 d2 2. L is the length of α
5 a2 b2 c2 d3
3. S | α is α-projected database if α ≠ <>, but as a
sequence database S.
Table I shows an example table T with 4 attributes Methods :
(columns): A, B, C and D. The corresponding transposed table 1. Perform readings S | α once, get all the frequent
TT is shown in Table II. For simplicity, we use number i (i = 1, items b, b such that is:
2, …, n) instead of ri to represent each rid.
b can be attached to the last element of α
TABLE II. TRANSPOSED TABLE TT OF T to form a sequential pattern.
Itemset rowset <b> placed behind α to form a sequential
a1 1, 2, 3 pattern.
a1 4, 5 2. For each frequent item b, add item b into α to form
b1 1, 2, 3, 4 a sequential pattern 'α and produce ‘α.
c1 1, 3 3. For each 'α, construct the projected database of α'
c2 2, 4, 5 form S | α’ and call back procedures PrefixSpan ('α,
L + 1, S | α').
d2 1, 3, 4
Third experiment was done by adding executor memory’s Future work may focus on the optimization, and implement
capacity to find the impact of executor memory’s in execution this approach for another case with high dimensional data, such
time with multiple data. The minimum support is 25%. The as user access behavior on website, or DNA sequence analysis.
result (Figure 9) shows that increasing executor memory
capacity does not always increase time performance because REFERENCES
number of partitions built by Spark by default are not always
effective. Small number of partitions will not utilize all of the [1] Han, J. and Kamber, M. (2006). Data Mining Concepts and Techniques,
cores available in the cluster, but too many partitions will cause 2nd Ed. The Morgan Kaufmann Series in Data Management Systems.
excessive overhead in managing many small tasks. Number of March 2006..
executor memory only affect how much RDD that can fit in [2] Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile
Networks and Applications, 19(2), 171-209.
memory before being transferred into disk.
[3] Boyd, D., & Crawford, K. (2012). Critical questions for big data:
Provocations for a cultural, technological, and scholarly phenomenon.
Information, communication & society, 15(5), 662-679.
[4] Rahianty., H, Nurullah., Letik, Juningsi D F J., W, Tri Wahyu.,
Wicaksana, I Wayan Simri. Personifikasi Web E-Commerce
menggunakan Algoritma Data Mining. (2009). Proceeding PESAT vol.3.
[5] Pei, Jian., Han, Jiawei., MortazaviAsl, Behzad., Wang, Jianyong., Pinto,
Helen., Chen, Qiming., et al. (2004). Mining Sequential Patterns by
Pattern Growth: The PrefixSpan Approach. IEEE Transaction on
Knowledge and Data Engineering, vol. 16, nos. 11, pp. 1041-4347.
[6] Chen, Y., Alspaugh, S., & Katz, R. (2012). Interactive analytical Abstraction for In-Memory Cluster Computing. Electrical Engineering
processing in big data systems: A cross-industry study of mapreduce and Computer Sciences, University of California, Berkeley.
workloads. Proceedings of the VLDB Endowment, 5(12), 1802-1813. [12] Dieng, Jie., Qu, Zhiguo., Zhu, Yongxu., Muntean, G.M., Wang, Xiaojun.
[7] Madden, S. (2012). From databases to big data. IEEE Internet (2014). Towards Efficient and Scalable Data Mining Using Spark. The
Computing,16(3), 4-6. Rince Institute, Dublin City University, Ireland
[8] Dieng, Jie., Qu, Zhiguo., Zhu, Yongxu., Muntean, G.M., Wang, Xiaojun. [13] Sabrina, Puspita Nurul., Saptawati, G.A. Putri. (2015). Multiple
(2014). Towards Efficient and Scalable Data Mining Using Spark. The MapReduce and Derivate Projected Database : New Approach for
Rince Institute, Dublin City University, Ireland. Supporting PrefixSpan Scalability. 2015 International Conference on
[9] Zaharia, Matei., Chowdhury, Mosharaf., Franklin, Michael J., Shenker, Data and Software Engineering (ICoDSE).
Scott., Stoica, Ion. (2010). Spark : Cluster Computing with Working Sets. [14] Lucchese, Claudio., Orlando, Salvotore., Perego, Raggaele., Silvestri,
University of California, Barkeley. Fabrizio. (2004). WebDocs: a real-life huge transactional dataset. FIMI
[10] Karau, Holden., Konwinski, Andy., Wendell, Patrick., Zaharia, Matei. '04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset
(2015). Learning Spark : Lighting-Fast Big Data Analysis ch01. O’Reilly Mining Implementations, Brighton, UK, November 1, 2004
Media. [15] Liu, Hongyan., Han, Jiawei., Xin, Dong., Shao, Zheng. (2006). Top-
[11] Matei, Zaharia., Chowdhury, Mosharaf., Das, Tathagata., Dave, Ankur., Down Mining of Frequent Patterns from Very High Dimensional Data.
Ma, Justin., McCauley, Murphy., Franklin, Michael., Shenker, Scott., Philadelphia: Philadelphia Society for Industrial and Applied
Stoica, Ion. (2011). Resilient Distributed Dataset : A Fault-Tolerant Mathematics.