Scalable Sequential Pattern Mining Based On PrefixSpan For High Dimensional Data

This document proposes implementing the PrefixSpan algorithm on Spark to efficiently mine sequential patterns from large, high-dimensional data. PrefixSpan is a divide-and-conquer algorithm that recursively projects databases based on frequent prefixes to find patterns. However, implementing PrefixSpan's recursive processes on big data is challenging due to memory usage and reliability issues. The document argues that Spark is better suited than Hadoop MapReduce for PrefixSpan due to Spark's in-memory computation model and Resilient Distributed Datasets.

Uploaded by

Jenifer Goodwin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views6 pages

Scalable Sequential Pattern Mining Based On PrefixSpan For High Dimensional Data

Uploaded by

Jenifer Goodwin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Scalable Sequential Pattern Mining Based on

Commented [AA1]: It should use Hyphen as two adjectives

PrefixSpan for High- Dimensional Data before noun.

Muhammad Nur Akbar G.A. Putri Saptawati

School of Electronic Engineering and Informatics School of Electronic Engineering and Informatics
Institute of Technology Bandung Institute of Technology Bandung
Bandung, Indonesia Bandung, Indonesia
muhnurakbar_mail[at]gmail.com putri[at]informatika.org

Abstract—The rapid use and development of information repetitive pattern at the time, and the projection is dug to obtain
technology rapidly nowadays has resulted in the phenomenon of a pattern.
data explosion. The phenomenon of data explosion makes analysis
to find insights inside the data become more difficult. A problem However, parsing process in some functions is needed in
that often occurs in the application of pattern recognition in the order to find sequential patterns in big data with a relatively
real-world domain is not only caused by the large size of data but short time given in a large and high-dimensional data. Recursive
also the high-dimensional data. Data analysis demands that a large process in PrefixSpan implemented in big data will affect
and complex data can be processed quickly and optimally to memory usage and potentially interfere with the reliability of the
support decision making. This study offered a scalable sequential processes that cause the program stops before the process is
patterns extraction to gain more insight from the data using completed [5]. Therefore, big data requires different processing
PrefixSpan implemented on the Spark platform as a distributed and programming technique compared to relational database.
system. The goal is to overcome the problem of increasing the
amount of data (scalability) in complex and high dimensional data MapReduce is one of big data processing in parallel and
effectively and in a relatively quick performance. distributed servers. MapReduce is a concept where data is
continually broken down into parts of data and distributed
Keywords—PrefixSpan, Spark, sequential patterns mining, through the machines that are connected in a cluster for
scalability, high dimensional data processing [6, 7]. In previous research, the concept of
MapReduce is used to support scalability in PrefixSpan with two
I. INTRODUCTION alternative processing to build projected database. However,
The development of information technology, the internet and Hadoop MapReduce approach is still not optimal and have
devices nowadays has resulted in enormous number of data limitations when implemented in recursive algorithm, Hadoop
generated everyday. This data explosion phenomenon is a MapReduce requires every input and output process is stored on
situation where number of data become tooso numerous and disk, and this process requires high cost [13].
large which makes it difficult for people to take information and Spark is the development of Hadoop MapReduce which is
knowledge from the data [1]. This phenomenon triggers new more effective in processing large amounts of data easily and
concept of information management called Big Data [2, 3]. The quickly. By introducing new processing methods and data
problem that comes from big data is how to process and analyze structured RDD (Resilient Distributed Datasets) as the main unit
the data in order to obtain information and knowledge to which is less optimal on recursive algorithm including the
improve the value of a very large data set. To overcome this establishment of projected database in PrefixSpan algorithm [8].
problem, a process called Knowledge Discovery from Database In this paper, implementation of PrefixSpan algorithm on
was developed [1]. distributed systems is proposed.
One of KDD implementation on data enterprise that has been
widely applied by various e-commerce organizations is data II. PRELIMINARY DEFINITION
analysis on customer behavior. Market Basket Analysis (MBA) A. High Dimensional Data
is one of knowledge discovery method that has been commonly
Many real world applications deal with transactional data,
used to get pattern of customer behavior or customer shopping
characterized by a huge number of transactions (tuples) with a
habits, which then help decision makers in choosing the right
small number of dimensions (attributes). However, there are
marketing strategy and simplify sales process [4].
some other applications that involve rather high dimensional
Sequential pattern mining is one of techniques that can be data [15]. Examples of such applications include bioinformatics,
used in MBA which can determine goods relatedness pattern and survey-based statistical analysis, text processing, and so on. Formatted: Font color: Red
illustrate sequence of events of the goods are purchased. One of High dimensional data pose great challenges to most existing
sequential patterns approaches is PrefixSpan which is applying data mining algorithms. Although there are numerous
divide and conquer [5]. With this approach, recursively database algorithms dealing with transactional data sets, there are few
will be projected into a bunch of smaller database based on algorithms oriented to very high dimensional data sets. Taking
sequential pattern mining for MBA as an example, most of the Sequential pattern mining process begins by determining a
existing algorithms are column (i.e., item) enumeration-based minimum number of support. Minimum support or min_sup is a
algorithms, which take the combinations of columns (items) as threshold number specified by the user, to determine the interest
search space. Due to the exponential number of column of a pattern. Definition of sequential pattern mining itself is if
combinations, this method is not suitable for very high given a min_sup and there is a set of sequences, where each
dimensional data. sequence is a list of events and each event is a set of items,
sequential pattern mining is a way to find all frequent
Let T be a discretized data table (or data set), composed of a subsequence that the frequency of occurrence is less than
set of rows, S = {r1, r2, …, rn,}, where ri(i = 1, …, n) is called min_sup of a sequence database.
a row ID, or rid in short. Each row corresponds to a sample
consisting of k discrete values or intervals. For simplicity, we C. PrefixSpan (Prefix-Projected Sequential Pattern Mining)
call each of this kind of values or intervals an item. We call a set PrefixSpan method use divide and conquer approach that
of rids a rowset, and a rowset having k rids a k-rowset. Likewise, reduce the size of database sequence as it shows sequence that
we call a set of items an itemset. A k-rowset is called large if k refer to a sequential pattern called projected database. Other
is no less than a user-specified threshold which is called sequential pattern are obtained through the process of finding
minimum size threshold. items that locally frequent in a projected database. There are two
Let TT be the transposed table of T, in which each row important terms in the establishment of a database that is
corresponds to an item ij and consists of a set of rids which projected : prefix and suffix. The following is PrefixSpan
contain i j in T. For clarity, we call each row of TT a tuple. algorithm [5] :
Input : Sequence database S, with a minimum threshold of
TABLE I. AN EXAMPLE TABLE T
support min_support.
ri A B C D Output : A complete set of sequential pattern
1 a1 b1 c1 d1 Methods : Call PrefixSpan (<>, 0, S)
Subroutine : PrefixSpan (α, L, S | α)
2 a1 b1 c2 d2
Parameters :
3 a1 b1 c1 d2 1. α is a sequential pattern
4 a2 b1 c2 d2 2. L is the length of α
5 a2 b2 c2 d3
3. S | α is α-projected database if α ≠ <>, but as a
sequence database S.
Table I shows an example table T with 4 attributes Methods :
(columns): A, B, C and D. The corresponding transposed table 1. Perform readings S | α once, get all the frequent
TT is shown in Table II. For simplicity, we use number i (i = 1, items b, b such that is:
2, …, n) instead of ri to represent each rid.
 b can be attached to the last element of α
TABLE II. TRANSPOSED TABLE TT OF T to form a sequential pattern.
Itemset rowset  <b> placed behind α to form a sequential
a1 1, 2, 3 pattern.
a1 4, 5 2. For each frequent item b, add item b into α to form
b1 1, 2, 3, 4 a sequential pattern 'α and produce ‘α.
c1 1, 3 3. For each 'α, construct the projected database of α'
c2 2, 4, 5 form S | α’ and call back procedures PrefixSpan ('α,
L + 1, S | α').
d2 1, 3, 4

From the algorithm above, concluded the efficiency of the

Originally, we want to find all of the frequent closed itemsets PrefixSpan methods are:
which satisfy the minimum support threshold minsup from table
T. After transposing T to the transposed table TT, the constraint 1. No need to process the sequences candidate in
minimum support threshold for itemsets becomes the minimum PrefixSpan.
size threshold for rowsets. Therefore, the mining task becomes 2. Projected database size will continue to shrink.
finding all of the large closed rowsets which satisfy minimum
size threshold minsup from table TT. 3. The primary cost of PrefixSpan is projected
establishment of a database
B. Sequential Pattern Mining
D. Framework Apache Spark
Sequential pattern mining is the process of data mining that
generate knowledge about the series of events that have a Apache Spark is an open source big data processing
frequency of occurrence that exceeds the threshold value [5]. framework built around speed, ease of use, and sophisticated
analytics. Spark basically developed to overcome the
shortcomings of Hadoop MapReduce which is less optimal E. RDD (Resilient Distributed Datasets)
implemented in a recursive algorithm including projected Distributed Resilient Dataset or commonly abbreviated
database establishment in PrefixSpan algorithm. The following RDD is the core concept of the framework Spark. RDD can be
are features of Spark :
described as a table in a database that is capable of
1. Support Map and Reduce functions with many accommodating a variety of data types and are distributed on
advantages. different partitions in memory on a large cluster while
maintaining fault tolerance at the data flow model as well as
2. Optimizing operator arbitrary graph.
Hadoop MapReduce [11]. RDD itself supports two types of
3. Provide API concise and consistent using Scala, Java, operations as shown in Figure 2 :
Python, and R programming language.
4. Offer interactive shell in Scala and Python.
5. Can be integrated with both the ecosystem and the data
source Hadoop (HDFS, Amazon S3, Cassandra, Hive,
HBase, etc).
6. Can be run on different cluster manager such as
Hadoop YARN, Apache Mesos, or standalone mode.
Spark contains several components [11] that are well
integrated. At the core, Spark is a computation machine Fig. 2. Spark Flow
responsible for the scheduling, distribution and monitoring
1. Transformation
application that consists of many computing tasks on many
machine workers. Transformation is an operation that does not return a
value but form a new RDD. This operation does not
perform any evaluation because it only requires RDD
as input RDD then returns RDD in a different form.
2. Actions
Actions are operations that evaluates and returns a new
value. When an action is run on RDD, all requests to
the data processing will be computed and will generate
a value as its output.
Fig. 1. Spark Ecosytem [10] III. ANALYSIS OF PREFIXSPAN IMPLEMENTATION ON
SPARK FRAMEWORK
1. Spark Core contains the basic functionality of Spark,
including components for task scheduling, memory There are various techniques that can be used for pattern
management, fault recovery, interacting with storage discovery on market basket analysis which form different
systems, and more. Spark Core is also home to the API patterns depending on the chosen techniques such as frequent
that defines resilient distributed datasets (RDDs), pattern mining, association rule mining and sequential pattern
which are Spark’s main programming abstraction. mining. Sequential pattern mining is a data mining technique in
finding a pattern of sequential event. Compared to other two
2. Spark Streaming can be used for processing the real- techniques, sequential pattern mining is considered to provide
time streaming data. This is based on micro batch style more insight from the data as it is not only showing the pattern
of computing and processing. It uses the DStream of relatedness of goods but also illustrates the sequence of events
which is basically a series of RDDs, to process the real- the goods were purchased.
time data.
PrefixSpan is one of algorithm to extract sequential patterns Commented [AA2]: It should be an algorithm not “one of
3. Spark SQL provides the capability to expose the Spark applying the divide and conquer approach. With this approach, algorithmS”
datasets over JDBC API and allow running the SQL recursively database will be projected into a bunch of smaller
like queries on Spark data using traditional BI and database based on a repetitive pattern at the time, then the
visualization tools. projection is dug to obtain a pattern. PrefixSpan will be
4. MLlib is Spark’s scalable machine learning library projecting the prefix only so that the size of the projected
consisting of common learning algorithms and utilities, database will continue to shrink and redundancy checks on every
including classification, regression, clustering, possible options of a potential candidate will be reduced. Several
collaborative filtering, dimensionality reduction, as studies also show that in terms of the duration of execution,
well as underlying optimization primitives. scalability, reliability, and memory utilization PrefixSpan is
better than other algorithms [5].
5. GraphX is the new (alpha) Spark API for graphs and
graph-parallel computation.
A. PrefixSpan on Spark The problem on second approach can be solved by utilizing
PrefixSpan on Spark is also partitioned into two parts to RDD in Spark. The projected database outcome on each
complete all the tasks same as PrefixSpan in Hadoop iteration instead of stored in files form to the hard disk, it will
MapReduce [12]. The concept of frequent calculation is used to be stored in new RDD form in memory so it can efficiently use
calculate items on each transaction and compare the results with memory and thus can eliminate the cost of read and write
the minimum support. Flat Map and Map used to count the process just like on Hadoop MapReduce.
items, and each item will be stored in a tuple for further
processing. Project database can be implemented with a Map
and Filter functions, Map function will return each transaction
after purging its prefix or in other words, this section will
produce a suffix while the Filter function will remove a
transaction that does not contain the appropriate item.
Fig. 5. Iterative computations to project database in Spark

RDD also has a storage mechanism that utilizes the memory

and hard disk, when the memory is full then the last RDD that
not accessed are transferred to the disk for avoid out of memory
when applied to big data with high dimensional.
Fig. 3. PrefixSpan on Spark

B. Projected Database Construction in Spark PrefixSpan

In previous research on Hadoop Mapreduce to improve
scalability of PrefixSpan [13], implement construction of
projected database uses two approaches as following :
1. The first approach is the process of establishing the
projected database by parsing to the original database
to get the prefix and then take the suffix from the prefix.
The projected database produced only stored in the Fig. 6. Projected database construction in RDD Spark
memory without needed to be stored persistently
because the construction process only refers to the IV. EXPERIMENTAL RESULT
original database. But if the size of original database
are larger than the parsing process, it will take a long The experiments run on cluster Spark that consists of a
time, although the projected database obtained will be master machine and also have a role as worker node with single
small because the parsing process must be done in the or multiple worker. Machine specification :
original database. 1. Intel Core i5-4300U CPU @1.9GHz x 4 core, RAM
2. The second approach is proposed to reduce the size of 12GB, HDD 1TB (single worker for experiment 1 and
data that must be parsed when looking for prefix and 3)
suffix. This idea is based on the condition that it is 2. Intel Xeon@CPU ES4610 @2,4GHz x 18 core, RAM
definitely the sequence of the projected database is a 16GB, HDD 300GB (multiple worker for experiment
prefix only if it’s a member of the prefix sequence of 2)
data. However, this method requires that the projected
database of the prefix to be stored in a file, those file The experiment used “WebDocs: a huge real-life
will be referred to construct next projected database. transactional dataset” [14], its was built from the web
This projected database is stored in a named file to collection. Dataset has a size about 1.48GB, contains exactly
identify the prefix of the projected database. 1,692,082 transaction with 5,267,656 distinct items, and the
maximal length of transaction is 71,472.
The problem found in second approach which should save
any projected database results to a file on the hard disk. If the Experiment for scalability testing is performed in
size of data is very large and has many prefix then it will require standalone mode with single worker. The experiment is needed
higher cost in write and read projected database to hard disk on to find the relationship between number of data, number of
each iteration. dimension and execution time with minimum support of 25%.
Figure 7 shows that PrefixSpan with Spark is scalable for 1
million rows data with approximately three million distinct
items. When PrefixSpan run 1 million rows data, it took longer
time because RDD has not fit in memory so Spark needs to
move some RDD to disk, but it is still better than PrefixSpan on
MapReduce when running on the same data size [13]. The
Fig. 4. Iterative computations to project database in Hadoop MapReduce execution time increased if the distinct items increases
significantly where number of rows is between 100.000 and
400.000.

Fig. 9. Executor Memory Testing

Fig. 7. Scalability Testing
V. CONCLUSION
The second experiment was done by adding workers to find
the impact of number of workers in execution time with In this paper, implementation of PrefixSpan algorithm on
minimum support 25%. Test result (Figure 8) shows that by distributed systems is proposed. With this approach, PrefixSpan
increasing the number of worker can increase the time is extended to handle massive data with high dimensional. By
performance especially when processed in big data. The using Spark, memory usage becomes more optimal which leads
execution time of big data increase 4.35 times faster, but in to a better performance than common distributed Hadoop, but
small data it only increases about 15% by average. it takes memory resource relative to data size.

Increasing the number of workers has high influence on

time performance, especially for high dimensional data.
However, the experiment result showed that the performance
does not increase linearly with the number of worker, especially
when the worker number are three or more. This is caused by
an overload of distributed systems, which means the more
worker are included in the system, the more time and resources
are spent on communication between these workers.

The experiment also showed that the performance does not

scale linearly with the number of executor memory, it only
affects how much RDD can fit in a memory, so Spark requires
a lot of configuration to run optimally. It needs configuration of
Fig. 8. Multi Worker Testing partition number relative to data size, number of core etc.

Third experiment was done by adding executor memory’s Future work may focus on the optimization, and implement
capacity to find the impact of executor memory’s in execution this approach for another case with high dimensional data, such
time with multiple data. The minimum support is 25%. The as user access behavior on website, or DNA sequence analysis.
result (Figure 9) shows that increasing executor memory
capacity does not always increase time performance because REFERENCES
number of partitions built by Spark by default are not always
effective. Small number of partitions will not utilize all of the [1] Han, J. and Kamber, M. (2006). Data Mining Concepts and Techniques,
cores available in the cluster, but too many partitions will cause 2nd Ed. The Morgan Kaufmann Series in Data Management Systems.
excessive overhead in managing many small tasks. Number of March 2006..
executor memory only affect how much RDD that can fit in [2] Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile
Networks and Applications, 19(2), 171-209.
memory before being transferred into disk.
[3] Boyd, D., & Crawford, K. (2012). Critical questions for big data:
Provocations for a cultural, technological, and scholarly phenomenon.
Information, communication & society, 15(5), 662-679.
[4] Rahianty., H, Nurullah., Letik, Juningsi D F J., W, Tri Wahyu.,
Wicaksana, I Wayan Simri. Personifikasi Web E-Commerce
menggunakan Algoritma Data Mining. (2009). Proceeding PESAT vol.3.
[5] Pei, Jian., Han, Jiawei., MortazaviAsl, Behzad., Wang, Jianyong., Pinto,
Helen., Chen, Qiming., et al. (2004). Mining Sequential Patterns by
Pattern Growth: The PrefixSpan Approach. IEEE Transaction on
Knowledge and Data Engineering, vol. 16, nos. 11, pp. 1041-4347.
[6] Chen, Y., Alspaugh, S., & Katz, R. (2012). Interactive analytical Abstraction for In-Memory Cluster Computing. Electrical Engineering
processing in big data systems: A cross-industry study of mapreduce and Computer Sciences, University of California, Berkeley.
workloads. Proceedings of the VLDB Endowment, 5(12), 1802-1813. [12] Dieng, Jie., Qu, Zhiguo., Zhu, Yongxu., Muntean, G.M., Wang, Xiaojun.
[7] Madden, S. (2012). From databases to big data. IEEE Internet (2014). Towards Efficient and Scalable Data Mining Using Spark. The
Computing,16(3), 4-6. Rince Institute, Dublin City University, Ireland
[8] Dieng, Jie., Qu, Zhiguo., Zhu, Yongxu., Muntean, G.M., Wang, Xiaojun. [13] Sabrina, Puspita Nurul., Saptawati, G.A. Putri. (2015). Multiple
(2014). Towards Efficient and Scalable Data Mining Using Spark. The MapReduce and Derivate Projected Database : New Approach for
Rince Institute, Dublin City University, Ireland. Supporting PrefixSpan Scalability. 2015 International Conference on
[9] Zaharia, Matei., Chowdhury, Mosharaf., Franklin, Michael J., Shenker, Data and Software Engineering (ICoDSE).
Scott., Stoica, Ion. (2010). Spark : Cluster Computing with Working Sets. [14] Lucchese, Claudio., Orlando, Salvotore., Perego, Raggaele., Silvestri,
University of California, Barkeley. Fabrizio. (2004). WebDocs: a real-life huge transactional dataset. FIMI
[10] Karau, Holden., Konwinski, Andy., Wendell, Patrick., Zaharia, Matei. '04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset
(2015). Learning Spark : Lighting-Fast Big Data Analysis ch01. O’Reilly Mining Implementations, Brighton, UK, November 1, 2004
Media. [15] Liu, Hongyan., Han, Jiawei., Xin, Dong., Shao, Zheng. (2006). Top-
[11] Matei, Zaharia., Chowdhury, Mosharaf., Das, Tathagata., Dave, Ankur., Down Mining of Frequent Patterns from Very High Dimensional Data.
Ma, Justin., McCauley, Murphy., Franklin, Michael., Shenker, Scott., Philadelphia: Philadelphia Society for Industrial and Applied
Stoica, Ion. (2011). Resilient Distributed Dataset : A Fault-Tolerant Mathematics.

Coimbatore Sahodaya 23-24 QP Set B
No ratings yet
Coimbatore Sahodaya 23-24 QP Set B
10 pages
Sap Abap Questions: Rohitash Kumar 2637 Times Viewed
No ratings yet
Sap Abap Questions: Rohitash Kumar 2637 Times Viewed
20 pages
Oracle DBA Interview Questions
100% (1)
Oracle DBA Interview Questions
20 pages
Data Dictionary Requirements
No ratings yet
Data Dictionary Requirements
5 pages
44con greedyBTS
No ratings yet
44con greedyBTS
45 pages
Logcat Prev CSC Log
No ratings yet
Logcat Prev CSC Log
188 pages
DDM Soft
No ratings yet
DDM Soft
184 pages
MDM 4
No ratings yet
MDM 4
159 pages
Module 1: Platform Development Basics: Salesforce Developer
No ratings yet
Module 1: Platform Development Basics: Salesforce Developer
3 pages
Content Server WebReports Standard 10.0 Installation and Administration Guide
No ratings yet
Content Server WebReports Standard 10.0 Installation and Administration Guide
42 pages
Transport Tablespace From One Database To Another
No ratings yet
Transport Tablespace From One Database To Another
10 pages
All Linux Commands With Example PDF
No ratings yet
All Linux Commands With Example PDF
13 pages
PrefixSpan The Presentation (1) Removed
No ratings yet
PrefixSpan The Presentation (1) Removed
51 pages
Sequential Pattern Mining
No ratings yet
Sequential Pattern Mining
24 pages
GSM Basics 1 PDF
No ratings yet
GSM Basics 1 PDF
21 pages
PrefixSpan The Presentation
No ratings yet
PrefixSpan The Presentation
93 pages
Synacktiv Mobile Communications Attacks
No ratings yet
Synacktiv Mobile Communications Attacks
26 pages
A Survey of Sequential Pattern Mining
No ratings yet
A Survey of Sequential Pattern Mining
24 pages
2010 - An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
No ratings yet
2010 - An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
14 pages
PrefixSpan The Presentation
No ratings yet
PrefixSpan The Presentation
76 pages
Dbms Unit - 1
No ratings yet
Dbms Unit - 1
18 pages
5615ijdkp06 PDF
No ratings yet
5615ijdkp06 PDF
8 pages
Upadhyay 2018 Ijca 916573
No ratings yet
Upadhyay 2018 Ijca 916573
9 pages
Huang 2006
No ratings yet
Huang 2006
12 pages
DM Lect 5 - Sequence & Stream Mining
No ratings yet
DM Lect 5 - Sequence & Stream Mining
32 pages
Module 6
No ratings yet
Module 6
13 pages
Introduction To SQL Stored Procedures
No ratings yet
Introduction To SQL Stored Procedures
3 pages
PrefixSpan Final
No ratings yet
PrefixSpan Final
22 pages
Mining Frequent Patterns From Very High Dimensional Data: A Top-Down Row Enumeration Approach
No ratings yet
Mining Frequent Patterns From Very High Dimensional Data: A Top-Down Row Enumeration Approach
12 pages
DEFCON 21 Bogdan Alecu Attacking SIM Toolkit With SMS WP
No ratings yet
DEFCON 21 Bogdan Alecu Attacking SIM Toolkit With SMS WP
26 pages
Central Finance Exploits The Capabilities of HANA I
No ratings yet
Central Finance Exploits The Capabilities of HANA I
3 pages
21 Maxweight
No ratings yet
21 Maxweight
15 pages
Lecture 13
No ratings yet
Lecture 13
43 pages
49 Sweight
No ratings yet
49 Sweight
13 pages
LAN - PAKDD2014 - Sequential - Pattern - Mining - CM-SPADE - CM-SPAM
No ratings yet
LAN - PAKDD2014 - Sequential - Pattern - Mining - CM-SPADE - CM-SPAM
13 pages
Unit5-Dwdm
No ratings yet
Unit5-Dwdm
58 pages
Csb4318 DWDM Unit II PPT Word
No ratings yet
Csb4318 DWDM Unit II PPT Word
133 pages
Dbms Ques & Ans-1
No ratings yet
Dbms Ques & Ans-1
9 pages
PDE-sample Ques
No ratings yet
PDE-sample Ques
4 pages
An Efficient Algorithm For Enumerating Closed Patterns in Transaction Databases
No ratings yet
An Efficient Algorithm For Enumerating Closed Patterns in Transaction Databases
15 pages
ADMA2013 MaxSP Maximal Sequential Patterns
No ratings yet
ADMA2013 MaxSP Maximal Sequential Patterns
12 pages
Comparing The Performance of Frequent Pattern Mini
No ratings yet
Comparing The Performance of Frequent Pattern Mini
5 pages
Hardening Teradata
No ratings yet
Hardening Teradata
10 pages
GG Doc
No ratings yet
GG Doc
9 pages
Mining Web Access Patterns With Super-Pattern Constraint
No ratings yet
Mining Web Access Patterns With Super-Pattern Constraint
13 pages
A Comprehensive Survey of Pattern Mining: Challenges and Opportunities
No ratings yet
A Comprehensive Survey of Pattern Mining: Challenges and Opportunities
8 pages
ATW115 Assignment Brief 232 FINAL
No ratings yet
ATW115 Assignment Brief 232 FINAL
2 pages
Utility Mining
No ratings yet
Utility Mining
5 pages
Manual Typo3 Indexed Search
100% (1)
Manual Typo3 Indexed Search
17 pages
Fptreehuffman
No ratings yet
Fptreehuffman
4 pages
Mining High Utility Patterns in One Phase Without Generating Candidates
No ratings yet
Mining High Utility Patterns in One Phase Without Generating Candidates
17 pages
Sequential Pattern Mining by Pattern-Growth: Principles and Extensions
No ratings yet
Sequential Pattern Mining by Pattern-Growth: Principles and Extensions
38 pages
Concepts and Techniques: Mining Sequence Patterns in Transactional Databases
No ratings yet
Concepts and Techniques: Mining Sequence Patterns in Transactional Databases
26 pages
Sequential Pattern Mining: A Comparison Between GSP, SPADE and Prefix SPAN
No ratings yet
Sequential Pattern Mining: A Comparison Between GSP, SPADE and Prefix SPAN
21 pages
Sequential Pattern Mining
No ratings yet
Sequential Pattern Mining
3 pages
Prediction of Customer Behavior Using Cma
No ratings yet
Prediction of Customer Behavior Using Cma
9 pages
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
No ratings yet
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
8 pages
Efficient Mining of Top-K Sequential Rules: Abstract
No ratings yet
Efficient Mining of Top-K Sequential Rules: Abstract
14 pages
An Effective Heuristic Approach For Hiding Sensitive Patterns in Databases
No ratings yet
An Effective Heuristic Approach For Hiding Sensitive Patterns in Databases
6 pages
A Rough Set Model For Sequential Pattern Mining With Constraints
No ratings yet
A Rough Set Model For Sequential Pattern Mining With Constraints
7 pages
Red Hat Directory Server 10: Performance Tuning Guide
No ratings yet
Red Hat Directory Server 10: Performance Tuning Guide
56 pages
Book Haven Database Design
No ratings yet
Book Haven Database Design
5 pages
Mining Frequent Patterns From Very High Dimensional Data: A Top-Down Row Enumeration Approach
No ratings yet
Mining Frequent Patterns From Very High Dimensional Data: A Top-Down Row Enumeration Approach
12 pages
Improved Sequential Pattern Mining Using An Extended Bitmap Representation
No ratings yet
Improved Sequential Pattern Mining Using An Extended Bitmap Representation
11 pages
Examples of LDAP Injections: Obtaining User Information
No ratings yet
Examples of LDAP Injections: Obtaining User Information
2 pages
Slide 04 - Computer Storage
No ratings yet
Slide 04 - Computer Storage
13 pages
Project Database Report
No ratings yet
Project Database Report
13 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
An Updown Directed Acyclic Graph Approach For Sequential Pattern Mining
No ratings yet
An Updown Directed Acyclic Graph Approach For Sequential Pattern Mining
67 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
14 pages
Mid Term Assignment Spring 2020
No ratings yet
Mid Term Assignment Spring 2020
5 pages
Pattern Sequence Mining: Presented By: Devika Mittal
No ratings yet
Pattern Sequence Mining: Presented By: Devika Mittal
15 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
34 pages
An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
No ratings yet
An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
14 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
26 pages
Performance Analysis of Sequential Pattern Mining Algorithms On Large Dense Datasets
No ratings yet
Performance Analysis of Sequential Pattern Mining Algorithms On Large Dense Datasets
7 pages
Review Paper: Procuring Frequent and Sequential Items To Improve Product Sales in E-Commerce Sites
No ratings yet
Review Paper: Procuring Frequent and Sequential Items To Improve Product Sales in E-Commerce Sites
5 pages
Chapter-4 Association Pattern Mining: 4.1 A Critical Look On Currently Used Algorithms
No ratings yet
Chapter-4 Association Pattern Mining: 4.1 A Critical Look On Currently Used Algorithms
40 pages
Seminar On UNIX OS Buffer Cache
No ratings yet
Seminar On UNIX OS Buffer Cache
11 pages
Comparative Study of Different Improvements of Apriori Algorithm
No ratings yet
Comparative Study of Different Improvements of Apriori Algorithm
4 pages
Defintion #1: "Fitness For Use": - Degree To Which Data Can Be Used For Its Intended Purpose
No ratings yet
Defintion #1: "Fitness For Use": - Degree To Which Data Can Be Used For Its Intended Purpose
6 pages
Compusoft, 3 (10), 1140-1142 PDF
No ratings yet
Compusoft, 3 (10), 1140-1142 PDF
3 pages
NGDM07 Philip Yu
No ratings yet
NGDM07 Philip Yu
22 pages
Eng-Improve Frequent Pattern Mining in Data Stream-Himanshu Shah
No ratings yet
Eng-Improve Frequent Pattern Mining in Data Stream-Himanshu Shah
10 pages
Pjsua Cli
No ratings yet
Pjsua Cli
5 pages
Compusoft, 3 (9), 1079-1082 PDF
No ratings yet
Compusoft, 3 (9), 1079-1082 PDF
4 pages
p139 Data Mining Mafia
No ratings yet
p139 Data Mining Mafia
13 pages
Mining Temporal Patterns For Interval-Based and Point-Based Events
No ratings yet
Mining Temporal Patterns For Interval-Based and Point-Based Events
6 pages
Itemset Mining Over Large Transactional Tables On The Relational Databases
No ratings yet
Itemset Mining Over Large Transactional Tables On The Relational Databases
6 pages
p132 Closet
No ratings yet
p132 Closet
11 pages
Good One
No ratings yet
Good One
12 pages
Data Mining - Mining Sequential Patterns
No ratings yet
Data Mining - Mining Sequential Patterns
10 pages
ETP-Mine: An Efficient Method For Mining Transitional Patterns
No ratings yet
ETP-Mine: An Efficient Method For Mining Transitional Patterns
9 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

Scalable Sequential Pattern Mining Based On PrefixSpan For High Dimensional Data

Uploaded by

Scalable Sequential Pattern Mining Based On PrefixSpan For High Dimensional Data

Uploaded by

Scalable Sequential Pattern Mining Based on

Commented [AA1]: It should use Hyphen as two adjectives

Muhammad Nur Akbar G.A. Putri Saptawati

From the algorithm above, concluded the efficiency of the

RDD also has a storage mechanism that utilizes the memory

B. Projected Database Construction in Spark PrefixSpan

Fig. 9. Executor Memory Testing

Increasing the number of workers has high influence on

The experiment also showed that the performance does not

You might also like