0% found this document useful (0 votes)
111 views6 pages

Scalable Sequential Pattern Mining Based On PrefixSpan For High Dimensional Data

This document proposes implementing the PrefixSpan algorithm on Spark to efficiently mine sequential patterns from large, high-dimensional data. PrefixSpan is a divide-and-conquer algorithm that recursively projects databases based on frequent prefixes to find patterns. However, implementing PrefixSpan's recursive processes on big data is challenging due to memory usage and reliability issues. The document argues that Spark is better suited than Hadoop MapReduce for PrefixSpan due to Spark's in-memory computation model and Resilient Distributed Datasets.

Uploaded by

Jenifer Goodwin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views6 pages

Scalable Sequential Pattern Mining Based On PrefixSpan For High Dimensional Data

This document proposes implementing the PrefixSpan algorithm on Spark to efficiently mine sequential patterns from large, high-dimensional data. PrefixSpan is a divide-and-conquer algorithm that recursively projects databases based on frequent prefixes to find patterns. However, implementing PrefixSpan's recursive processes on big data is challenging due to memory usage and reliability issues. The document argues that Spark is better suited than Hadoop MapReduce for PrefixSpan due to Spark's in-memory computation model and Resilient Distributed Datasets.

Uploaded by

Jenifer Goodwin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Scalable Sequential Pattern Mining Based on

Commented [AA1]: It should use Hyphen as two adjectives


PrefixSpan for High- Dimensional Data before noun.

Muhammad Nur Akbar G.A. Putri Saptawati


School of Electronic Engineering and Informatics School of Electronic Engineering and Informatics
Institute of Technology Bandung Institute of Technology Bandung
Bandung, Indonesia Bandung, Indonesia
muhnurakbar_mail[at]gmail.com putri[at]informatika.org

Abstract—The rapid use and development of information repetitive pattern at the time, and the projection is dug to obtain
technology rapidly nowadays has resulted in the phenomenon of a pattern.
data explosion. The phenomenon of data explosion makes analysis
to find insights inside the data become more difficult. A problem However, parsing process in some functions is needed in
that often occurs in the application of pattern recognition in the order to find sequential patterns in big data with a relatively
real-world domain is not only caused by the large size of data but short time given in a large and high-dimensional data. Recursive
also the high-dimensional data. Data analysis demands that a large process in PrefixSpan implemented in big data will affect
and complex data can be processed quickly and optimally to memory usage and potentially interfere with the reliability of the
support decision making. This study offered a scalable sequential processes that cause the program stops before the process is
patterns extraction to gain more insight from the data using completed [5]. Therefore, big data requires different processing
PrefixSpan implemented on the Spark platform as a distributed and programming technique compared to relational database.
system. The goal is to overcome the problem of increasing the
amount of data (scalability) in complex and high dimensional data MapReduce is one of big data processing in parallel and
effectively and in a relatively quick performance. distributed servers. MapReduce is a concept where data is
continually broken down into parts of data and distributed
Keywords—PrefixSpan, Spark, sequential patterns mining, through the machines that are connected in a cluster for
scalability, high dimensional data processing [6, 7]. In previous research, the concept of
MapReduce is used to support scalability in PrefixSpan with two
I. INTRODUCTION alternative processing to build projected database. However,
The development of information technology, the internet and Hadoop MapReduce approach is still not optimal and have
devices nowadays has resulted in enormous number of data limitations when implemented in recursive algorithm, Hadoop
generated everyday. This data explosion phenomenon is a MapReduce requires every input and output process is stored on
situation where number of data become tooso numerous and disk, and this process requires high cost [13].
large which makes it difficult for people to take information and Spark is the development of Hadoop MapReduce which is
knowledge from the data [1]. This phenomenon triggers new more effective in processing large amounts of data easily and
concept of information management called Big Data [2, 3]. The quickly. By introducing new processing methods and data
problem that comes from big data is how to process and analyze structured RDD (Resilient Distributed Datasets) as the main unit
the data in order to obtain information and knowledge to which is less optimal on recursive algorithm including the
improve the value of a very large data set. To overcome this establishment of projected database in PrefixSpan algorithm [8].
problem, a process called Knowledge Discovery from Database In this paper, implementation of PrefixSpan algorithm on
was developed [1]. distributed systems is proposed.
One of KDD implementation on data enterprise that has been
widely applied by various e-commerce organizations is data II. PRELIMINARY DEFINITION
analysis on customer behavior. Market Basket Analysis (MBA) A. High Dimensional Data
is one of knowledge discovery method that has been commonly
Many real world applications deal with transactional data,
used to get pattern of customer behavior or customer shopping
characterized by a huge number of transactions (tuples) with a
habits, which then help decision makers in choosing the right
small number of dimensions (attributes). However, there are
marketing strategy and simplify sales process [4].
some other applications that involve rather high dimensional
Sequential pattern mining is one of techniques that can be data [15]. Examples of such applications include bioinformatics,
used in MBA which can determine goods relatedness pattern and survey-based statistical analysis, text processing, and so on. Formatted: Font color: Red
illustrate sequence of events of the goods are purchased. One of High dimensional data pose great challenges to most existing
sequential patterns approaches is PrefixSpan which is applying data mining algorithms. Although there are numerous
divide and conquer [5]. With this approach, recursively database algorithms dealing with transactional data sets, there are few
will be projected into a bunch of smaller database based on algorithms oriented to very high dimensional data sets. Taking
sequential pattern mining for MBA as an example, most of the Sequential pattern mining process begins by determining a
existing algorithms are column (i.e., item) enumeration-based minimum number of support. Minimum support or min_sup is a
algorithms, which take the combinations of columns (items) as threshold number specified by the user, to determine the interest
search space. Due to the exponential number of column of a pattern. Definition of sequential pattern mining itself is if
combinations, this method is not suitable for very high given a min_sup and there is a set of sequences, where each
dimensional data. sequence is a list of events and each event is a set of items,
sequential pattern mining is a way to find all frequent
Let T be a discretized data table (or data set), composed of a subsequence that the frequency of occurrence is less than
set of rows, S = {r1, r2, …, rn,}, where ri(i = 1, …, n) is called min_sup of a sequence database.
a row ID, or rid in short. Each row corresponds to a sample
consisting of k discrete values or intervals. For simplicity, we C. PrefixSpan (Prefix-Projected Sequential Pattern Mining)
call each of this kind of values or intervals an item. We call a set PrefixSpan method use divide and conquer approach that
of rids a rowset, and a rowset having k rids a k-rowset. Likewise, reduce the size of database sequence as it shows sequence that
we call a set of items an itemset. A k-rowset is called large if k refer to a sequential pattern called projected database. Other
is no less than a user-specified threshold which is called sequential pattern are obtained through the process of finding
minimum size threshold. items that locally frequent in a projected database. There are two
Let TT be the transposed table of T, in which each row important terms in the establishment of a database that is
corresponds to an item ij and consists of a set of rids which projected : prefix and suffix. The following is PrefixSpan
contain i j in T. For clarity, we call each row of TT a tuple. algorithm [5] :
Input : Sequence database S, with a minimum threshold of
TABLE I. AN EXAMPLE TABLE T
support min_support.
ri A B C D Output : A complete set of sequential pattern
1 a1 b1 c1 d1 Methods : Call PrefixSpan (<>, 0, S)
Subroutine : PrefixSpan (α, L, S | α)
2 a1 b1 c2 d2
Parameters :
3 a1 b1 c1 d2 1. α is a sequential pattern
4 a2 b1 c2 d2 2. L is the length of α
5 a2 b2 c2 d3
3. S | α is α-projected database if α ≠ <>, but as a
sequence database S.
Table I shows an example table T with 4 attributes Methods :
(columns): A, B, C and D. The corresponding transposed table 1. Perform readings S | α once, get all the frequent
TT is shown in Table II. For simplicity, we use number i (i = 1, items b, b such that is:
2, …, n) instead of ri to represent each rid.
 b can be attached to the last element of α
TABLE II. TRANSPOSED TABLE TT OF T to form a sequential pattern.
Itemset rowset  <b> placed behind α to form a sequential
a1 1, 2, 3 pattern.
a1 4, 5 2. For each frequent item b, add item b into α to form
b1 1, 2, 3, 4 a sequential pattern 'α and produce ‘α.
c1 1, 3 3. For each 'α, construct the projected database of α'
c2 2, 4, 5 form S | α’ and call back procedures PrefixSpan ('α,
L + 1, S | α').
d2 1, 3, 4

From the algorithm above, concluded the efficiency of the


Originally, we want to find all of the frequent closed itemsets PrefixSpan methods are:
which satisfy the minimum support threshold minsup from table
T. After transposing T to the transposed table TT, the constraint 1. No need to process the sequences candidate in
minimum support threshold for itemsets becomes the minimum PrefixSpan.
size threshold for rowsets. Therefore, the mining task becomes 2. Projected database size will continue to shrink.
finding all of the large closed rowsets which satisfy minimum
size threshold minsup from table TT. 3. The primary cost of PrefixSpan is projected
establishment of a database
B. Sequential Pattern Mining
D. Framework Apache Spark
Sequential pattern mining is the process of data mining that
generate knowledge about the series of events that have a Apache Spark is an open source big data processing
frequency of occurrence that exceeds the threshold value [5]. framework built around speed, ease of use, and sophisticated
analytics. Spark basically developed to overcome the
shortcomings of Hadoop MapReduce which is less optimal E. RDD (Resilient Distributed Datasets)
implemented in a recursive algorithm including projected Distributed Resilient Dataset or commonly abbreviated
database establishment in PrefixSpan algorithm. The following RDD is the core concept of the framework Spark. RDD can be
are features of Spark :
described as a table in a database that is capable of
1. Support Map and Reduce functions with many accommodating a variety of data types and are distributed on
advantages. different partitions in memory on a large cluster while
maintaining fault tolerance at the data flow model as well as
2. Optimizing operator arbitrary graph.
Hadoop MapReduce [11]. RDD itself supports two types of
3. Provide API concise and consistent using Scala, Java, operations as shown in Figure 2 :
Python, and R programming language.
4. Offer interactive shell in Scala and Python.
5. Can be integrated with both the ecosystem and the data
source Hadoop (HDFS, Amazon S3, Cassandra, Hive,
HBase, etc).
6. Can be run on different cluster manager such as
Hadoop YARN, Apache Mesos, or standalone mode.
Spark contains several components [11] that are well
integrated. At the core, Spark is a computation machine Fig. 2. Spark Flow
responsible for the scheduling, distribution and monitoring
1. Transformation
application that consists of many computing tasks on many
machine workers. Transformation is an operation that does not return a
value but form a new RDD. This operation does not
perform any evaluation because it only requires RDD
as input RDD then returns RDD in a different form.
2. Actions
Actions are operations that evaluates and returns a new
value. When an action is run on RDD, all requests to
the data processing will be computed and will generate
a value as its output.
Fig. 1. Spark Ecosytem [10] III. ANALYSIS OF PREFIXSPAN IMPLEMENTATION ON
SPARK FRAMEWORK
1. Spark Core contains the basic functionality of Spark,
including components for task scheduling, memory There are various techniques that can be used for pattern
management, fault recovery, interacting with storage discovery on market basket analysis which form different
systems, and more. Spark Core is also home to the API patterns depending on the chosen techniques such as frequent
that defines resilient distributed datasets (RDDs), pattern mining, association rule mining and sequential pattern
which are Spark’s main programming abstraction. mining. Sequential pattern mining is a data mining technique in
finding a pattern of sequential event. Compared to other two
2. Spark Streaming can be used for processing the real- techniques, sequential pattern mining is considered to provide
time streaming data. This is based on micro batch style more insight from the data as it is not only showing the pattern
of computing and processing. It uses the DStream of relatedness of goods but also illustrates the sequence of events
which is basically a series of RDDs, to process the real- the goods were purchased.
time data.
PrefixSpan is one of algorithm to extract sequential patterns Commented [AA2]: It should be an algorithm not “one of
3. Spark SQL provides the capability to expose the Spark applying the divide and conquer approach. With this approach, algorithmS”
datasets over JDBC API and allow running the SQL recursively database will be projected into a bunch of smaller
like queries on Spark data using traditional BI and database based on a repetitive pattern at the time, then the
visualization tools. projection is dug to obtain a pattern. PrefixSpan will be
4. MLlib is Spark’s scalable machine learning library projecting the prefix only so that the size of the projected
consisting of common learning algorithms and utilities, database will continue to shrink and redundancy checks on every
including classification, regression, clustering, possible options of a potential candidate will be reduced. Several
collaborative filtering, dimensionality reduction, as studies also show that in terms of the duration of execution,
well as underlying optimization primitives. scalability, reliability, and memory utilization PrefixSpan is
better than other algorithms [5].
5. GraphX is the new (alpha) Spark API for graphs and
graph-parallel computation.
A. PrefixSpan on Spark The problem on second approach can be solved by utilizing
PrefixSpan on Spark is also partitioned into two parts to RDD in Spark. The projected database outcome on each
complete all the tasks same as PrefixSpan in Hadoop iteration instead of stored in files form to the hard disk, it will
MapReduce [12]. The concept of frequent calculation is used to be stored in new RDD form in memory so it can efficiently use
calculate items on each transaction and compare the results with memory and thus can eliminate the cost of read and write
the minimum support. Flat Map and Map used to count the process just like on Hadoop MapReduce.
items, and each item will be stored in a tuple for further
processing. Project database can be implemented with a Map
and Filter functions, Map function will return each transaction
after purging its prefix or in other words, this section will
produce a suffix while the Filter function will remove a
transaction that does not contain the appropriate item.
Fig. 5. Iterative computations to project database in Spark

RDD also has a storage mechanism that utilizes the memory


and hard disk, when the memory is full then the last RDD that
not accessed are transferred to the disk for avoid out of memory
when applied to big data with high dimensional.
Fig. 3. PrefixSpan on Spark

B. Projected Database Construction in Spark PrefixSpan


In previous research on Hadoop Mapreduce to improve
scalability of PrefixSpan [13], implement construction of
projected database uses two approaches as following :
1. The first approach is the process of establishing the
projected database by parsing to the original database
to get the prefix and then take the suffix from the prefix.
The projected database produced only stored in the Fig. 6. Projected database construction in RDD Spark
memory without needed to be stored persistently
because the construction process only refers to the IV. EXPERIMENTAL RESULT
original database. But if the size of original database
are larger than the parsing process, it will take a long The experiments run on cluster Spark that consists of a
time, although the projected database obtained will be master machine and also have a role as worker node with single
small because the parsing process must be done in the or multiple worker. Machine specification :
original database. 1. Intel Core i5-4300U CPU @1.9GHz x 4 core, RAM
2. The second approach is proposed to reduce the size of 12GB, HDD 1TB (single worker for experiment 1 and
data that must be parsed when looking for prefix and 3)
suffix. This idea is based on the condition that it is 2. Intel Xeon@CPU ES4610 @2,4GHz x 18 core, RAM
definitely the sequence of the projected database is a 16GB, HDD 300GB (multiple worker for experiment
prefix only if it’s a member of the prefix sequence of 2)
data. However, this method requires that the projected
database of the prefix to be stored in a file, those file The experiment used “WebDocs: a huge real-life
will be referred to construct next projected database. transactional dataset” [14], its was built from the web
This projected database is stored in a named file to collection. Dataset has a size about 1.48GB, contains exactly
identify the prefix of the projected database. 1,692,082 transaction with 5,267,656 distinct items, and the
maximal length of transaction is 71,472.
The problem found in second approach which should save
any projected database results to a file on the hard disk. If the Experiment for scalability testing is performed in
size of data is very large and has many prefix then it will require standalone mode with single worker. The experiment is needed
higher cost in write and read projected database to hard disk on to find the relationship between number of data, number of
each iteration. dimension and execution time with minimum support of 25%.
Figure 7 shows that PrefixSpan with Spark is scalable for 1
million rows data with approximately three million distinct
items. When PrefixSpan run 1 million rows data, it took longer
time because RDD has not fit in memory so Spark needs to
move some RDD to disk, but it is still better than PrefixSpan on
MapReduce when running on the same data size [13]. The
Fig. 4. Iterative computations to project database in Hadoop MapReduce execution time increased if the distinct items increases
significantly where number of rows is between 100.000 and
400.000.

Fig. 9. Executor Memory Testing


Fig. 7. Scalability Testing
V. CONCLUSION
The second experiment was done by adding workers to find
the impact of number of workers in execution time with In this paper, implementation of PrefixSpan algorithm on
minimum support 25%. Test result (Figure 8) shows that by distributed systems is proposed. With this approach, PrefixSpan
increasing the number of worker can increase the time is extended to handle massive data with high dimensional. By
performance especially when processed in big data. The using Spark, memory usage becomes more optimal which leads
execution time of big data increase 4.35 times faster, but in to a better performance than common distributed Hadoop, but
small data it only increases about 15% by average. it takes memory resource relative to data size.

Increasing the number of workers has high influence on


time performance, especially for high dimensional data.
However, the experiment result showed that the performance
does not increase linearly with the number of worker, especially
when the worker number are three or more. This is caused by
an overload of distributed systems, which means the more
worker are included in the system, the more time and resources
are spent on communication between these workers.

The experiment also showed that the performance does not


scale linearly with the number of executor memory, it only
affects how much RDD can fit in a memory, so Spark requires
a lot of configuration to run optimally. It needs configuration of
Fig. 8. Multi Worker Testing partition number relative to data size, number of core etc.

Third experiment was done by adding executor memory’s Future work may focus on the optimization, and implement
capacity to find the impact of executor memory’s in execution this approach for another case with high dimensional data, such
time with multiple data. The minimum support is 25%. The as user access behavior on website, or DNA sequence analysis.
result (Figure 9) shows that increasing executor memory
capacity does not always increase time performance because REFERENCES
number of partitions built by Spark by default are not always
effective. Small number of partitions will not utilize all of the [1] Han, J. and Kamber, M. (2006). Data Mining Concepts and Techniques,
cores available in the cluster, but too many partitions will cause 2nd Ed. The Morgan Kaufmann Series in Data Management Systems.
excessive overhead in managing many small tasks. Number of March 2006..
executor memory only affect how much RDD that can fit in [2] Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile
Networks and Applications, 19(2), 171-209.
memory before being transferred into disk.
[3] Boyd, D., & Crawford, K. (2012). Critical questions for big data:
Provocations for a cultural, technological, and scholarly phenomenon.
Information, communication & society, 15(5), 662-679.
[4] Rahianty., H, Nurullah., Letik, Juningsi D F J., W, Tri Wahyu.,
Wicaksana, I Wayan Simri. Personifikasi Web E-Commerce
menggunakan Algoritma Data Mining. (2009). Proceeding PESAT vol.3.
[5] Pei, Jian., Han, Jiawei., MortazaviAsl, Behzad., Wang, Jianyong., Pinto,
Helen., Chen, Qiming., et al. (2004). Mining Sequential Patterns by
Pattern Growth: The PrefixSpan Approach. IEEE Transaction on
Knowledge and Data Engineering, vol. 16, nos. 11, pp. 1041-4347.
[6] Chen, Y., Alspaugh, S., & Katz, R. (2012). Interactive analytical Abstraction for In-Memory Cluster Computing. Electrical Engineering
processing in big data systems: A cross-industry study of mapreduce and Computer Sciences, University of California, Berkeley.
workloads. Proceedings of the VLDB Endowment, 5(12), 1802-1813. [12] Dieng, Jie., Qu, Zhiguo., Zhu, Yongxu., Muntean, G.M., Wang, Xiaojun.
[7] Madden, S. (2012). From databases to big data. IEEE Internet (2014). Towards Efficient and Scalable Data Mining Using Spark. The
Computing,16(3), 4-6. Rince Institute, Dublin City University, Ireland
[8] Dieng, Jie., Qu, Zhiguo., Zhu, Yongxu., Muntean, G.M., Wang, Xiaojun. [13] Sabrina, Puspita Nurul., Saptawati, G.A. Putri. (2015). Multiple
(2014). Towards Efficient and Scalable Data Mining Using Spark. The MapReduce and Derivate Projected Database : New Approach for
Rince Institute, Dublin City University, Ireland. Supporting PrefixSpan Scalability. 2015 International Conference on
[9] Zaharia, Matei., Chowdhury, Mosharaf., Franklin, Michael J., Shenker, Data and Software Engineering (ICoDSE).
Scott., Stoica, Ion. (2010). Spark : Cluster Computing with Working Sets. [14] Lucchese, Claudio., Orlando, Salvotore., Perego, Raggaele., Silvestri,
University of California, Barkeley. Fabrizio. (2004). WebDocs: a real-life huge transactional dataset. FIMI
[10] Karau, Holden., Konwinski, Andy., Wendell, Patrick., Zaharia, Matei. '04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset
(2015). Learning Spark : Lighting-Fast Big Data Analysis ch01. O’Reilly Mining Implementations, Brighton, UK, November 1, 2004
Media. [15] Liu, Hongyan., Han, Jiawei., Xin, Dong., Shao, Zheng. (2006). Top-
[11] Matei, Zaharia., Chowdhury, Mosharaf., Das, Tathagata., Dave, Ankur., Down Mining of Frequent Patterns from Very High Dimensional Data.
Ma, Justin., McCauley, Murphy., Franklin, Michael., Shenker, Scott., Philadelphia: Philadelphia Society for Industrial and Applied
Stoica, Ion. (2011). Resilient Distributed Dataset : A Fault-Tolerant Mathematics.

You might also like