Analysis of Large Web Sequences Using Aprioriall - Set Algorithm

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)

Web Site: www.ijettcs.org Email: [email protected], [email protected]


Volume 3, Issue 2, March April 2014 ISSN 2278-6856


Volume 3, Issue 2 March April 2014 Page 292


Abstract: With the proliferation of Internet, discovery and
analysis of useful information from the World Wide Web
becomes a practical necessity. Web usage mining has become
a fertile field of research for improving designs of web sites,
analyzing system performance as well as network
Communications, understanding user reaction, motivation
and Building adaptive Web sites. This paper implements an
algorithm based on Traditional Set theory: AprioriAll_Set on
the Web sequential Datasets KDD CUP 2000, Kosarak and
MSNBC. Experimental results have shown that the
AprioriAll_Set algorithm results in the best performance
compared to AprioriAll and GSP. AprioriAll_Set algorithm
avoids multiple scan of sequence database to compute support
by storing the sequence database with its items position. As in
the algorithm traditional Set operations are applied, the
database keeps on shrinking with increase in the length of
sequences.

Keywords: Generalized Sequential Pattern Mining, Web
Usage Mining,Sequential Pattern Mining, ApioriAll, Set
Theory.

1. INTRODUCTION
When you submit your paper print it in two-column
format, including figures and tables. In addition,
designate one author as the corresponding author. This
is the author to whom proofs of the paper will be sent.
Proofs are sent to the corresponding author only. The
actual data mining task is the automatic or semi-
automatic analysis of large quantities of data to extract
previously unknown interesting patterns such as groups of
data records (cluster analysis), unusual records (anomaly
detection) and dependencies (association rule mining).
This usually involves using database techniques such
as spatial indices. The focus of Mining Sequential
Patterns from Large Data Sets is on sequential pattern
mining.
In many applications, such as bioinformatics, web access
traces, system utilization logs, etc., the data is naturally in
the form of sequences. This information has been of
great interest for analyzing the sequential data to find its
inherent characteristics. Sequential Pattern mining is a
topic of data mining concerned with finding statistically
relevant patterns between data examples where the values
are delivered in a sequence. It is usually presumed that
the values are discrete, and thus time series mining is
closely related, but usually considered a different activity.
Sequential pattern mining is a special case of structured
data mining. Sequence data is omnipresent. Customer
shopping sequences, medical treatment data, and data
related to natural disasters, science and engineering
processes data, stocks and markets data, telephone calling
patterns, weblog click streams, program execution
sequences, DNA sequences and gene expression and
structures data are some examples of sequence data.
Traditional association rule mining finds intra-
transaction patterns, sequential pattern mining finds
inter-transaction patterns, to detect the presence of a set
of items in a time-ordered sequence of transactions. In
basic association rule mining, the items occurring in one
transaction have no order, but in sequential pattern
mining, an order exists between the items (events) and an
item may re-occur in the same sequence.
An important application of sequential mining techniques
is web usage mining, for mining web log accesses, where
the sequences of web page accesses made by different web
users over a period of time, through a server, are
recorded.
Web mining is the application of data mining techniques
to discover patterns from the Web [1]. In Web Mining,
data can be collected at the server-side, client-side, proxy
servers, or obtained from an organizations database;
which contains business data or consolidated Web data.
The information gathered through Web mining is
evaluated by using traditional data mining parameters
such as clustering and classification, association, and
examination of sequential patterns [2]. According to
analysis targets, web mining can be divided into three
different types, which are Web usage mining, Web
content mining and Web structure mining.
Web Usage mining has a lot of application in real life
such as Improving designs of web sites, analyzing system
performance as well as network Communications,
understanding user reaction, motivation and Building
adaptive Web sites; it is now a very important and useful
subject. Web usage mining is concerned with finding user
navigational patterns on the World Wide Web by
extracting knowledge from web logs, where ordered
sequences of events in the sequence database are
composed of single items and not sets of items, with the
assumption that a web user can physically access only one
web page at any given point in time. The pattern mining
and researches in data mining, machine learning as well
as statistics are mainly focused on analysis of the web
pattern discovery. As for pattern mining, it could be:
Statistical analysis, used to obtain useful statistical
information such as the most frequently accessed pages;
Association rule mining [2], used to find references to a
set of pages that are accessed together with a support
Analysis of Large Web Sequences using
AprioriAll_Set Algorithm

Dr. Sunita Mahajan
1
, Prajakta Pawar
2
and Alpa Reshamwala
3


1
Principal ,Institute of Computer Science, MET, Mumbai University, Bandra, Mumbai, India
2
M.Tech Student

, Computer Engineering Department, MPSTME, SVKMs NMIMS University, Mumbai, India
3
Assistant Professor, Computer Engineering Department, MPSTME, SVKMs NMIMS University, Mumbai, India
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 3, Issue 2, March April 2014 ISSN 2278-6856


Volume 3, Issue 2 March April 2014 Page 293


value exceeding some specified threshold; Sequential
pattern mining [3], used to discover frequent sequential
patterns which are lists of Web pages ordered by viewing
time for predicting visit patterns; Clustering, used to
group together users with similar characteristics;
Classification, used to group together users into
predefined classes based on their characteristics.
Currently, most web usage-mining solutions consider web
access by a user as one page at a time, giving rise to
special sequence database with only one item in each
sequences ordered event list. Thus, given a set of events
E ={a, b, c, d, e, f }, which may represent product web
pages accessed by users in an e-commerce application, a
web access sequence database for four users may have
four records: [T1, <abdac>]; [T2, <eaebcac>]; [T3,
<babfaec>]; [T4, <abfac>]. A web log pattern mining
on this web sequence database can find a frequent
sequence, abac, indicating that over 90% of users who
visit product as web page also immediately visit product
bs web page and then revisit product as page, before
visiting product cs page. Store managers may then place
promotional prices on product as web page, which is
visited a number of times in sequence, to increase the sale
of other products. The web log could be on the server-
side, client-side, or on a proxy server, each with its own
benefits and drawbacks in finding the users relevant
patterns and navigational sessions.
In this paper, we focus on sequential pattern mining for
finding interesting patterns based on Web click stream
sequences.

2. REVIEW OF LITERATURE
Mining frequent web access patterns from very large
databases (e.g. using click-stream analysis) has been
studied intensively and there are a variety of approaches.
Most of the previous studies have adopted a sequential
patterns mining technique which aims to find sub-
sequences that appear frequently in a sequence database
on a web log access sequence. In web server logs, a visit
by a client is recorded over a period of time and the
discovery of sequential patterns allows web-based
organizations to predict user visit patterns, which helps in
targeting advertising aimed at groups of users based on
these patterns.
Sequential pattern mining was proposed in [3], using the
main idea of association rule mining presented in Apriori
algorithm of [2]. Later, three algorithms (Apriori,
AprioriAll, and AprioriSome) to handle sequential
mining problem were proposed in [3]. Following this, the
GSP (Generalized Sequential Patterns) [4] algorithm,
which is 20 times faster than the Apriori algorithm in [3]
was proposed. The PSP (Prefix Tree for Sequential
Patterns) [5] approach is much similar to the GSP
algorithm [4]. The main idea of Graph Traversal mining
which is proposed by [6][7], is using a simple unweighted
graph to reflect the relationship between the pages of Web
sites. The Web Utilization Miner (WUM) [8] tool aims to
discover sequential patterns that are considered as
interesting from a statistical point of view. The WAP-
mine, described in [1], is a method that allows the
extraction of frequent patterns from the user sessions. The
authors of [9] were interested in discovering contiguous
sequence patterns in a Web log file; The FS-Miner
algorithm [10] is based on the FS-Tree that is a
compressed tree used to represent sequences. The
ApproxMAP [11] combines clustering and sequential
patterns for extraction of multiple alignment sequential
pattern mining. Pre-Order Linked WAP-Tree Mining
(PLWAP) algorithm has been presented by [12] for
efficiently mining of sequential patterns from the Web
log. Automatic Log mining via Genetic algorithm to mine
sequential accesses from Web log files has been proposed
by [13]. An intelligent recommender system known as
SWARS (Sequential Web Access based Recommender
System) that uses sequential access pattern mining
proposed in [14].
Traditional sequential patterns mining approaches such
as Apriori-based algorithms [3, 4] encounter the problem
that multiple scans of the database are required in order
to determine which candidates are actually frequent. Most
of the solutions provided so far for reducing the
computational cost resulting from the apriori property use
a bitmap vertical representation of the access sequence
database [15][16][17][18] and employ bitwise operations
to calculate support at each iteration. The transformed
vertical databases, in their turn, introduce overheads that
lower the performance of the proposed algorithm, but not
necessarily worse than that of pattern-growth algorithms.
Chiu et al. [19] propose the DISCall algorithm along with
the Direct Sequence Comparison DISC technique, to
avoid support for counting by pruning nonfrequent
sequences according to other sequences of the same
length. There is still no variation of the DISC-all for web
log mining. Breadth-first search, generate-and-test, and
multiple scans of the database, which are discussed
below, are all key features of apriori-based methods that
pose challenging problems, hinder the performance of the
algorithms. Pei et al. introduced a compressed data
structure called Web Access Pattern tree (or WAP-tree),
which facilitates the development of algorithms for
mining access patterns from pieces of web logs [1]. Since
then, many modifications were proposed in order to
further improve efficiency, by eliminating the need to
perform any re-construction of intermediate WAP-trees
during mining; for example the Position Coded Pre-order
Linked Web Access Pattern mining algorithm [20][21],
Conditional Sequence mining algorithm [22] and the
modified Web Access Pattern (mWAP) algorithm [23].
Sequential pattern mining algorithms can be classified
into apriori-based, pattern-growth, early-pruning, and
hybrids of these three techniques. Breadth-first search,
generate-and-test, and multiple scans of the database, are
all key features of apriori-based methods that pose
challenging problems, hinder the performance of the
algorithms. Also, the apriori-based algorithms are too
slow and have a large search space, while pattern-growth
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 3, Issue 2, March April 2014 ISSN 2278-6856


Volume 3, Issue 2 March April 2014 Page 294


algorithms have been tested extensively on mining the
web log and found to be fast, early-pruning algorithms
have had success with protein sequences stored in dense
databases. Shang Gao et al. approach in [24], relaxes the
constraint described in AprioriAll/Some and improves the
performance by user oriented and self adaptive approach
than the probabilistic knowledge representation. In this
paper, a traditional set based apriori-based algorithm
proposed by A. Reshamwala and S. Mahajan in [25], is
implemented as thealgorithm has acceptable performance
measures such as low CPU execution time and low
memory utilization when mined with low minimum
support values. This algorithm also handles Candidate
sequence pruning by utilizing a data structure that allows
them to prune candidate sequences early in the mining
process. AprioriAll_Set algorithm, avoids multiple scan
of sequence database to compute support by storing the
sequence database with its items position in Hash Map
data structure in Java. As in the algorithm traditional Set
operations are applied, the database keeps on shrinking
with increase in the length of sequences.

3. EXPERIMENTAL RESULTS
A simulation study is done to compare the performances
of the algorithms: AprioriAll[3], GSP[4] and the
AprioriAll_Set[25], to discover sequential patterns from
large Web sequences.
These algorithms are executed on Web sequential
Datasets KDD CUP 2000, Kosarak and MSNBC . These
dataset are downloaded from SPMF (Sequential Pattern
Mining Framework) which is implemented by Phillipe
Fournier-Viguera [26] and available from
http://www.philippe-fournier-viger.com/spmf/Also,
SPMF tool is used to analyze and compare dataset
statistical parameters.
Figure 1 shows the comparison of the different web
datasets. KDD CUP 2000 dataset contains 59,601
sequences of click stream data from an e-commerce. It
contains 497 distinct items. The average length of
sequences is 2.42 items with a standard deviation of 3.22.

Figure 1. Comparison of Web Dataset Statistical
Parameters

In this dataset, there are some long sequences. For
example, 318 sequences contains more than 20 items.
Kosarak is a very large dataset containing 990 000
sequences of click-stream data from an Hungarian news
portal. The dataset is converted in SPMF format.
However, this dataset is very large. MSNBC is a dataset
of click-stream data. The original dataset contains
989,818 sequences obtained from the UCI repository.
Here the shortest sequences have been removed to keep
only 31,790 sequences. The number of distinct item in
this dataset is 17 (an item is a webpage category). The
average number of item sets per sequence is13.33. The
average number of distinct item per sequence is 5.33 as
shown in Table I, Kosarak and MSNBC are dense
dataset, that is, usually there are less unique items.

Table I: Web Dataset Statistical Parameters.
Sr.
no
Statistical
Parameters
KDD CUP
2000
Kosarak MSNBC
1 Number of
sequences
59601 69999 31790
2 Number of distinct
items
497 21144 17
3 Average number
of itemsets per
sequence
2.51066257
3
7.97687109
8
13.3304812
8
4 Average number
of distinct itemper
sequence
2.51066257
3
7.97681395
4
5.33381566
5

The experiments were performed on a system having Java
SE 1.6.0_26 with NetBeans 7.0 on Windows 7
Professional, Intel Core i5-2400 processor 3.10 GHz with
4 GB RAM. Performance of KDD CUP 2000 dataset can
be seen in Figure 2, 3 and 4, where the minimum support
ranges from 1 % to 10 %. AprioriAll_Set, takes the
minimum time for execution in seconds compared to
AprioriAll and GSP due to its dataset shrinking property.
AprioriAll_Set avoids scanning the dataset multiple times
support count leading to faster execution. AprioriAll_Set
is followed by GSP for kosark and KDD CUP 2000
dataset as shown in figure 3 and 2 respectively. But for
the sparse dataset, the feature of web usage mining,
AprioriAll_Set is followed by AprioriAll.

Figure 2. Performance of Kosarak Dataset


Figure 3: Performance of MSNBC Dataset
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 3, Issue 2, March April 2014 ISSN 2278-6856


Volume 3, Issue 2 March April 2014 Page 295



Figure 3: Patterns discovery of KDD CUP 2000

From Figure 4, the performance shows that, GSP
algorithm takes the maximum time for MSNBC sparse
dataset. From Figure 3, GSP is followed by AprioriAll, as
the number of sequences increases in the sparse dataset
kosarak. Frequent sequential patterns are discovered
maximum by the GSP algorithm. From the Figure 5, 6
and 7; it can be found that AprioriAll_Set and AprioriAll
generate same number of frequent sequential patterns due
to the property of candidate sequence pruning

Figure 4: Patterns discovery of Kosarak


Figure 5: Patterns discovery of MSNBC

4. CONCLUSION AND FUTURE WORK
The complexity of tasks such as Web site design, Web
server design, and of simply navigating through a Web
site has been increasing continuously. An important input
to these design tasks is the analysis of how a Web site is
being used. Usage analysis includes straightforward
statistics, such as page access frequency, as well as more
sophisticated forms of analysis, such as finding the
common traversal paths through a
Website. Web Usage Mining is the application of pattern
mining techniques to usage logs of large Web data
repositories in order to produce results that can be used in
the design tasks.
Hence, these experimental results conclude that the
AprioriAll_Set algorithm results in the best performance
when execution time is considered. AprioriAll_Set, takes
the minimum time for execution due to its dataset
shrinking property. It also, avoids scanning the dataset
multiple times for support count leading to faster
execution. AprioriAll_Set is followed by GSP for kosark
and KDD CUP 2000 dataset. But for the sparse dataset
like MSNBC, AprioriAll_Set is followed by AprioriAll
and GSP algorithm takes the maximum time. GSP is
followed by AprioriAll, as the number of sequences
increases in the sparse dataset like kosarak.
AprioriAll_Set and AprioriAll generate same number of
frequent sequential patterns due to the property of
candidate sequence pruning.
In future work, as in these experiments we have found
sequence patterns, by ignoring the time interval and
including only the temporal order of the patterns. The
approach can be extended to more set-based mathematical
models for further data analysis in order to discover
hidden sequential patterns.

References
[1] J . Pei, J . Han, B. Mortazavi-Asl, and H. Zhu. 2000. Mining
access patterns efficiently from web logs. In Proceedings of
the Paci_c-Asia Conference on Knowledge Discovery and
Data Mining (PAKDD00). Kyoto,J apan, pp. 396-399, 400-
402, 2000
[2] R. Agrawal and R. Srikant. Fast algorithms for mining
association rules in large databases. In Proceedings of the
20th International Conference on Very Large Databases.
Santiago, Chile, pp. 487499,1994.
[3] R. Agrawal and R. Srikant, Mining sequential patterns,
In 11th Intl Conf. of Data Engineering (ICDE95), pp. 3-
14, Taipei, Taiwan, Mar. 1995.
[4] R. Srikant and R. Agrawal, Mining Sequential
Patterns:Generalizations and performance improvements,
Proceedings of theFifth International Conference on
Extending Database Technology,(Avignon, France, 1996),
Springer-Verlag, vol. 1057, 3-17.
[5] Masseglia, F., Cathala, F., and Poncelet, P., PSP: Prefix
tree for sequential patterns. In Proc. of the 2nd
EuropeanSymposium on Principles of Data Mining and
Knowledge Discovery PKDD98). 176184, France, LNAI,
1998.
[6] Nanopoulos, A. and Manolopoulos, Y., Mining patterns
fromgraph traversals. Data and Knowledge Engineering,
2001
[7] Nanopoulos, A. and Manolopoulos, Y. 2000. Finding
generalized path patterns for Web log data mining. Data
and Knowledge Engineering, 37(3):243---266.
[8] Spiliopoulou, M, The Laboriuos, Way fromdata mining to
Web mining, Journal of Computer Systems & Engg,Special
Issue on Semantics of the Web, 14 :( 113-126), 1999.
[9] Y. Xiao et al. Efficient Mining of Traversal Patterns. Data
and Knowledge Engineering, 39(2):191- 214, 2001.
[10] M. El-Sayed, C. Ruiz, and E. A. Rundensteiner. FS-Miner:
Efficient and Incremental Mining of Frequent Sequence
Patterns in Web Logs. In Proc. of the Sixth Annual ACM
International Workshop on Web Information and Data
Management (WIDM'04), 128-135. ACM Press, 2004.
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 3, Issue 2, March April 2014 ISSN 2278-6856


Volume 3, Issue 2 March April 2014 Page 296


[11] H.C. Kum. Approximate Mining of Consensus Sequential
Patterns. PhD thesis, University of North Carolina, 2004
[12] C.I.Ezeife, YI Lu, Mining Web Log Sequential Patterns
with Position Coded Pre-order Linked WAP-Tree, Springer
Science, Data Mining & Knowledge Discovery, 10,5-38,
2005.
[13] Emine Tug, Merve Sakiroglu, Ahmet Arslan, Automatic
Discovery of the Sequential Accesses fromWeb log data
files via a genetic algorithm, Elsevier, Knowledge Based
Systems, 19, 180-186 , 2006.
[14] Baoyao Zhou, Siu Cheung Hui, Kuiyu Chang, An
Intelligent Recommender System using Sequential Web
Access Patterns, In Proc. of the IEEE international conf.
on Cybernetics and Intelligent Systems, 393-398,
Singapore, 2004.
[15] ZAKI, M. J ., Efficient enumeration of frequent sequences.
In Proceedings of the 7th International Conference on
Information and Knowledge Management. 6875. 1998.
[16] AYRES, J ., FLANNICK, J .,GEHRKE, J ., AND YIU, T.,
Sequential pattern mining using a bitmap representation. In
Proceedings of the 8th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining.
429435..2002.
[17] YANG, Z. AND KITSUREGAWA, M., LAPIN-SPAM: An
improved algorithm for mining sequential pattern. In
Proceedings of the 21st International Conference on Data
Engineering Workshops. 1222.,2005
[18] SONG, S., HU, H., AND J IN, S., HVSM: A new
sequential pattern mining algorithm using bitmap
representation. In Advanced Data Mining and
Applications. Lecture Notes in Computer Science, vol.
3584, Springer, Berlin, 455463. 2005.
[19] CHIU, D.-Y., WU, Y.-H., AND CHEN, A. L. P., An
efficient algorithmfor mining frequent sequences by a new
strategy without support counting. In Proceedings of the
20th International Conference on Data Engineering. 375
386. 2004.
[20] I. Ezeife and Y. Lu, Mining web log sequential
patterns with position coded pre-order linked WAP-
tree, International Journal of Data Mining and
Knowledge Discovery, 2005, 10, 5-38.
[21] W. Wang and P. T. Cao-Thai, Novel position-coded
methods formining web access patterns, IEEE
International Conference on Intelligence and Security
Informatics, 2008, 194-196.
[22] X. Tan, M. Yao and J . Zhang, Mining maximal frequent
access sequences based on improved WAP-tree,
Proceedings of the Sixth International Conference on
Intelligent Systems Design and Applications, IEEE
Computer Society Press, 2006, vol. 1, 616-620.
[23] J . D. Parmar and S. Garg, Modified web access pattern
(mWAP) approach for sequential pattern mining,
INFOCOMP Journal of Computer Science, J une, 2007,
6(2): 46-54.
[24] Shang Gao, Reda Alhaji, J on Rokne, J iwen Guan, Set
Based Approach in Mining Sequential Patterns, 24th
International Symposium on Computer and Information
Sciences, 2009. ISCIS 2009, pp 218 223.
[25] Alpa Reshamwala, Dr. Sunita Mahajan, Traditional Set
based Approach in Mining Sequential Patterns,
Proceedings of National Conference on New Horizons in
IT- NCNHIT 2013, ISBN 978-93-82338, pp. 173- 177.
[26] SPMF: Sequential Pattern Mining Framework.
AUTHOR

Ms. Alpa Reshamwala is currently working
as an Asistant Professor in the Department
of Computer Engineering at MPSTME,
NMIMS University. She received her B.E
degree in Computer Engineering from Fr. CRCE, Bandra,
Mumbai University in 2000 and M.E degree in Computer
Engineering from TSEC, Mumbai University in 2008.
Her area of Interest includes Artificial Intelligence, Data
Mining, Soft Computing Fuzzy Logic, Neural Network
and Genetic Algorithm. She has more than 25 papers in
National/International Conferences/ Journal to her credit.

Dr Sunita M. Mahajan is currently
working as the Principal, Mumbai
Educational Trusts Institute of Computer
Science. She has done her Doctorate from
S.N.D.T. Womens University in 1997. She has worked
as senior scientist at Bhabha Atomic Research Centre for
31 years and entered educational field after her
retirement. She has done extensive work in parallel
processing. She has more than 45 papers in National and
International conferences and journals to her credit. She
has guided many PhD students in distributed computing,
data mining, natural language processing etc. Her current
field of interest is parallel processing, distributed
computing, cloud computing, data mining. She has also
written a text book on Distributed Computing(New
Delhi, Oxford University Press, 2010)

Prajakta Pawar is currently pursuing
M.Tech in the Department of Computer
Engineering at MPSTME, NMIMS
University. She has received her B.E degree
from SSJCOE, Dombivli, Mumbai university in 2011.
Her area of Interest includes Artificial Intelligence, Data
Mining. She has published 3 papers. She has attended
one International Conference and received award for the
Excellent Paper

You might also like