0% found this document useful (0 votes)
9 views

Sequential Pattern Mining in Data Streams Using the Weighted Sliding Window

The document presents the SWSS algorithm for mining frequent sequential patterns in data streams using a weighted sliding window model. This approach allows users to specify their interests, improving the relevance of mining results while addressing challenges such as limited memory and the need for timely results. Experimental results demonstrate the algorithm's efficiency and feasibility in handling large-scale data streams.

Uploaded by

vineetsuradkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Sequential Pattern Mining in Data Streams Using the Weighted Sliding Window

The document presents the SWSS algorithm for mining frequent sequential patterns in data streams using a weighted sliding window model. This approach allows users to specify their interests, improving the relevance of mining results while addressing challenges such as limited memory and the need for timely results. Experimental results demonstrate the algorithm's efficiency and feasibility in handling large-scale data streams.

Uploaded by

vineetsuradkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2009 15th International Conference on Parallel and Distributed Systems

Sequential Pattern Mining in Data Streams Using the Weighted Sliding Window
Model

Chuan XU, Yong CHEN, Rongfang BIE*


College of Information Science and Technology, Beijing Normal University
Beijing 100875, P. R. China
[email protected], [email protected]
Corresponding Author: [email protected]

Abstract—Mining data streams for knowledge discovery is environment[2,3,6,9]. Chi et al. [6] a novel algorithm has
important to many applications, including Web click stream been introduced to mine closed frequent itemsets over data
mining, network intrusion detection, and on-line transaction stream sliding windows. The closed enumeration tree(CET)
analysis. In this paper, by analyzing data characteristics, we as an efficient in-memory data structure is used to store all
propose an efficient algorithm SWSS (Sequential pattern closed frequent itemsets in the present sliding window. Ho,
mining with the weighted sliding window model in SPAM) to Li, Kuo, and Lee[14] proposed IncSPAM to maintain
mine frequent sequential patterns based on the weighted sequential patterns over a stream sliding window. Customer
sliding windows model. This algorithm provides more space
Bit-Vector Array with Sliding Window(CBASW) is
for users to specify which sequences they are more interested
introduced to store the items information and this kind of
in. Extensive experiments show that the proposed algorithm is
feasible and efficient for mining all sequential patterns as users
representation can definitely reduce the memory requirement
specified. and execution time. In [15], the author proposed a new
flexible framework for stream mining, called the weighted
Keywords-Data Mining; Stream Mining; Sliding Window; sliding window model. The main difference between it and
Sequential Pattern Mining the traditional sliding window model is that it allows users to
specify the number of windows and the weight for each
window. That is to say, the mining results are closer to users’
I. INTRODUCTION
requirements. The WSW algorithm and the improved WSW
The problem of data streams mining arose with the algorithm are used to mine the frequent itemsets in
introduction of new application areas, including Web click continuous sliding windows.
stream mining, network intrusion detection, and on-line In our work, we propose SWSS algorithm to mine
transaction analysis[1]. Recently, data mining communities frequent sequential patterns in the weighted sliding window
have focused on a new data model, where data arrives in the model. As far as our knowledge, this is the first piece of
form of continuous streams. Many applications can generate work for mining frequent sequential patterns over weighted
great amount of data streams in real time, such as online sliding windows. In this paper, we make full use of the fast
transaction flows in retail chains, web click-streams in Web counting of SPAM and its lexicographical tree for storing the
applications, performance measurement in network frequent sequential patterns. Several effective summary data
monitoring, and ATM transaction records in banks, etc. structures are developed to store the essential information
There are several challenges for data stream mining. First, during mining, which include All-List, B-List, W-List, and
each item in a stream could be examined only once. Second, W-Tree. All-List is used to store all the items which is
it’s impractical to provide sufficient memory to store the data different from each other and appears in the sliding windows
generated continuously. Third, the mining result should be specified by users. B-List is developed to dynamically show
got as fast as possible and timely available according to the bitmap representation of a sequence of a tree node in the
users’ requests. At last, the errors of the results should be lexicographical tree. All the sliding windows are stored in
bounded as small as possible. the W-List which will be refreshed with time goes by. At last,
In data stream mining [7, 10, 11, 12, 13, 17, 18, 19, 20, W-Tree is used for storing the frequent sequences and it’s the
21, 22, 23, 24, 25, 26], the time models mainly include the basement for the depth first traversal(DFS).
landmark model[4], the titled-time window model[5] and the The rest of the paper is organized as follows. Section Ċ
sliding window model[6]. In the paper, we focus on the goes deeply to present the problem statement. The algorithm
sliding window. The sliding window model mainly deals SWSS is presented in Section ċ. Section č gives an
with the data generated from the current moment to a overview example to specify the SWSS process. Section Ď
specified time point and the data continuously changes when reports the result of our experiments. Finally, we give the
new elements come and old ones are deleted. The size of the conclusions of the paper in Section ď.
window could be specified by users to be a given number of
transactions or a fixed time period. There are many research
results about mining frequent itemsets in a data stream

1521-9097/09 $26.00 © 2009 IEEE 886


DOI 10.1109/ICPADS.2009.64
II. PROBLEM DEFINITION data layout in the dataset to speed the support counting is a
Assume that the current time point is T1, n is the number main implementation characteristic. Among the transactions
of sliding windows. Let W = {w1, w2, …, wn}, where each wi every item is represented by the vertical bitmap. Every
N bitmap is partitioned into n sections and n means the number
ęW, 1” i ” N and ¦ wi 1 . Let I = {i1, i2, …, in} be a set of the customers. In addition, each section includes m bits
i 1 and m represents the number of transactions generated by a
of items in all sliding windows. A sequence is an ordered list customer. If item i appears in a transaction j, then, the bit
of itemsets denoted by {s1, s2, … , sm}, where each si  I , 1 corresponding to transaction j of the bitmap for item i is set
” i ” m. The size m is the number of the itemsets in the to one, otherwise the bit is set to zero. And the above kind of
sequence. The support counts of item x, xęI, in {w1, w2, …, bitmap can naturally be extended to itemsets and sequence.
wn} can represents {sup1, sup2, …, supn}. We can define the A sequence support counting is a calculating process that is
minimum weighted support count of an itemset x to be the the bitwise AND of these two bitmaps.
summation of the weight and its support count in each
Lemma 1: If sequence s is an infrequent sequence, then any
sliding windows.
super sequence of it is an infrequent sequence[27].
A frequent sequential pattern in the weighted sliding
Lemma 2: If sequence s is an frequent sequence, then any
windows includes a sequences whose weighted support
subsequence of it is an frequent sequence[27].
count (abbreviated, sup_weight(s)) is not smaller than the
From Apriori algorithm property we can easily get the
minimum weighted support count of the windows
two extension conclusions above which are also applicable
(abbreviated, minSup_weight).
in the frequent sequential patterns mining over the weighted
Since the weight of the sliding windows can be specified
sliding windows. Both lemmas are applicable in every
by users according to their interests, one frequent sequence
in one window is not really frequent in other windows. window for the same reason. Besides, the candidate
Similarly, one infrequent sequence in one window is sequences are generated during the mining process in the
probably frequent in other windows. Thus we have to manner of the depth first traversal through a lexicographic
tree. In MaxMiner[27] and MAFIA[28] two similar
consider the weights of the windows and get the final results
algorithms have been used for mining frequent itemsets
by calculating the support of one sequence based on the
problem. In the lexicographic tree, SPAM generates
weights of different windows.
sequences by traversing the tree, each node can be generated
We give an example to demonstrate how the weighted
by two categories of extending methods: the sequence-
sliding window model works. Assume that the current time
extension step (abbreviated, S-step) and the itemset–
is T1, and there are 4 sliding windows. Suppose there are
extension step (abbreviated, I-step). Suppose bitmaps B(s)
1000, 2000, 2000, 3000 customers in these 4 sliding
and B(i) is respectively a sequence and a item i. The S-step
windows. In short, we use w1 representing sliding window 1,
appends the itemset {i} to the sequence s. And the I-step
w2 for sliding window 2, w3 for sliding window 3, and w4
appends item I to the last itemset of s to generate a new
for sliding window 4. The weight of each window is Ȧ1 =
sequence sn. Besides, in order to reduce the memory required
0.4, Ȧ2 = 0.3, Ȧ3 = 0.2, Ȧ4 = 0.1, the sum of the weights for
during mining, SPAM designs two Apriori–based pruning
each window should be equal to 1. Assume the support
strategies which are called S-step pruning and I-step pruning
counts of sequence s in w1, w2, w3, and w4 are 200, 300,
respectively. Each sequence can be extended through S-step
300, 400, and the minimum support is 0.2. Thus, the
or I-step. If the extended sequence is infrequent, the item a
minimum support counts for w1, w2, w3, and w4 are 200,
will be deleted during the process. Obviously each node of
400, 400, 600, respectively. Besides, sup_weight(s) in w1,
lexicographic tree represents one frequent sequence pattern.
w2, w3, and w4 as in
200×0.4+300×0.3+300×0.2+400×0.1 = 270 (1)
Similarly, minSup_weight as in Algorithm SWSS
200×0.4+400×0.3+400×0.2+600×0.1 = 370 (2) Input:
Obviously, the sup_weight(s) is smaller than the (1) the number of the sliding windows: n
minSup_weight. Thus, we can conclude that the sequence is (2) the size of a window: t
not a frequent sequence in the weighted sliding window (3) the minimum support: minSup
model. (4) the weight of each window Wj (1 İ j İ n)
Output:
III. SWSS ALGORITHM All sequential patterns
In this section, we present how SPAM works briefly and Method:
how we use SWSS to mine frequent sequential patterns over (1) Initialization:
the weighted sliding window model. Then, a proper example a: for each window from the data stream do
is presented to show the SWSS algorithm below. b: get B-List (initialize the bitmaps of all items of each
window respectively.)
A. The overview of SPAM(Sequential Pattern Mining)
c: get minSup_weight(get minSup_weight of the sliding
SPAM is a first depth-first search strategy for mining windows)
sequential patterns according to [9]. Using a vertical bitmap (2) Mining:

887
a: For each window from the data stream do
b: Get All-List (Find out all items which are not different
from each other in every window and store them in All-
List. The items are used as the initial values for S-step
and I-step)
c: Initialize W-Tree (the root of the tree is NULL)
d: Build W-Tree by depth first traversal (S-step then I-
step)
If a new sequence support over minSup_weight
Add the sequence into W-Tree as a node
Add the sequence into All-List
Transform the sequence a bitmap into B-List
e: If a new frequent sequence exists
Go to step d
Else
Build W–Tree process is completed
IV. EXAMPLE FOR SWSS
Assume the number of the sliding windows is 4, the
minimum support is 0.2. At time point T, figure 4.1 shows
the initial information.

Figure4.2 Bitmaps of A-List for the 4 sliding windows

All-List includes items (9, 11, 12, 13, 14, 15, 19, 20, 21).
Item 9 appears in both W1, and W3. Item 12 appears in all the
4 windows. Item 21 appears only in W4. The bitmap of item
Figure4.1 The sequence data at time T 9 in W1 (abbreviated bW1 (9)) has only one bit in the first
section and its value is 0, since item 9 was not appear in the
From figure4.1, the minSup_weight is 0.42 {0.2 × (3 × sequence of customer 1(CID = 1). Similarly, bW1(12) also
0.1 + 2 × 0.2 + 2 × 0.3 + 2 × 0.4)). Through the initialization has only one bit in the third section for the same reason.
of our algorithm in figure4.1, All-list and W-List can be sup_weight(9) = 1 × 0.1 + 1 × 0.2 = 0.3, which is smaller
generated. The bitmaps of the items are present in figure4.2 than minSup_weight. Thus, item 9 removed. sup_weight(12)
(for simplicity, some bitmaps are not listed). = 2 × (0.1 + 0.2 + 0.3 + 0.4) = 2.0, which is greater than
minSup_weight, so W-Tree creates a node for storing the
one item sequence({12}). Through calculation, items (9, 14,
15, 19, 20, 21) should be removed, so the first level of W-
Tree has only 3 child nodes just as Figure4.4. According to
the property of lexicographical tree, sequence ({11}, {12})
can be generated through S-step of sequence ({11}). Since
({11}) doesn’t appear in W4, so when in the process of S-
step of ({11}), there is no need to consider W4, what we need
to do is calculating the support count in the first three
windows. See Figure 4.3. Similarly, other frequent sequences
can be got by the same method. Figure 4.4 shows the
building of W-Tree.

888
Fig 5.1 Different windows over small scale dataset
Figure4.3 The final results of B-List
In the second group of experiments Fig5.3, we test
running time of different window number and different
minimum support over large scale dataset. The total number
of transactions is 10K, and the number of windows is 1, 2, 3,
4. The execution time in the experiments is the total
execution time at 5 different time points. Fig5.4 shows the
execution time under the different minimum support
condition.
Figure4.4 The final results of W-Tree

V. PERFORMANCE EVALUATION
In this section, we report the experiments of the proposed
SWSS algorithm. These experiments are executed on a
personal computer with a 3.00 GHz Intel Pentium4 processor,
1.00GB memory, and running Microsoft Windows XP.
Moreover, the synthetic data used in these experiments is
generated by the IBM synthetic data generator [8].
Parameters of synthetic are listed in Table5.1.

Symbol Meaning
D Number of customers in 000s Fig 5.2 Different min-support over small scale dataset
C Average number of transactions per customer
T Average number of items per transaction
N Number of different items in 000s
S Average length of maximal sequences

Firstly, we verify SWSS’s correctness by setting the 2


parameters: the window number equals 1 and the minimum
support set equally. The frequent sequences generated by
SWSS are the same with those by SPAM.
Meanwhile, we verify SWSS’s feasibility. In the first
group of experiments Fig5.1, we test running time of
different window number and different minimum support
over small scale dataset. The total number of transactions is
1K, and the number of windows is 2, 4, 6, 8, 10. The
execution time in the experiments is the total execution time
Fig5.3 Different windows over large scale dataset
at 5 different time points. Fig5.2 shows the execution time
under the different minimum support condition.

889
[12]R. Srikant & R. Agrawal, Mining Sequential Patterns: Generalizations
and Performance Improvements, In Proc. 5th Int. Conf. Extending
Database Technology, 1996, pp.3-17
[13]M. J. Zaki, SPADE: An Efficient Algorithm for Mining Frequent
Sequences, In Machine Learning Journal, 42(1/2), 2001, pp.31-60.
[14]C. C. Ho, H. F. Li, F. F. Kuo, & S. Y. Lee, Incremental mining of
sequential patterns over a stream sliding window, In Proceedings of IEEE
international workshop on mining evolving and streaming data, 2006.
[15]P. S. M Tsai, Mining frequent itemsets in data streams using the
weighted sliding window model, Expert Systems with Applications, 2009,
pp.11617-11625.
[16]X. Yan, J. Han, & R. Afshar, CloSpan: Mining Closed Sequential
Patterns in Large Datasets, In Proc. of SIAM Int’l Conf. on Data Mining
(SDM), 2003, pp.166-177.
[17]M. Yen & S. Lee, Incremental Update on Sequential Patterns in Large
Databases, In Proceedings of the 10th IEEE International Conference on
Fig 5.4 Different min-support over large scale dataset Tools with Artificial Intelligence (ICTAI), 1998, pp.24-31.
[18]Y. Zhu & D. Shasha, StatStream: Statistical Monitoring of Thousands
of Data Stream in Real Time, In Proceedings of the 28th International
VI. CONCLUSION Conference on Very Large Data Bases (VLDB), 2002, pp.358-369.
[19]G. Chen, X. Wu, & X. Zhu, Mining Sequential Patterns across Data
In this paper, we propose an efficient single pass Streams, Technical Report, 2004.
algorithm SWSS to mine all the frequent sequential patterns [20]H. Cheng, X. Yan, & J. Han, IncSpan: Incremental Mining of
over the weighted sliding windows. SWSS allows users to Sequential Patterns in Large Database, In Proceedings of the 10th ACM
specify what they are more interested in and it also uses the International Conference on Knowledge Discovery and Data Mining
similar depth first traversal method and the bitmap (SIGKDD), 2004, pp.527-532.
[21]Chen, G., Wu, X., Zhu, X, Sequential pattern mining in multiple data
representation as SPAM. Extensive experiments show that streams, In Prooceedings of the ICDM, 2005, pp.585–588.
SWSS can generate all frequent sequential patterns as users [22]C. Ezeife & M. Monwar, SSM: a frequent sequential data stream
defined. In the future, we will consider how to further reduce patterns miner, CIDM, 2007, pp.120-126.
the memory consumption of SWSS efficiently. [23]C. C. Ho, H. F. Li, F. F. Kuo, & S. Y. Lee, Incremental mining of
sequential patterns over a stream sliding window, ICDM-Workshops, 2006,
VII. ACKNOWLEDGMENT pp.677-681.
[24]J. H. Chang & W. S. Lee, Efficient mining method for retrieving
This work is supported by the National Science sequential patterns over online data streams, Joural of Information Science,
Foundation of China under the Grant No. 90820010. 31(5), 2005, pp.420-432.
[25]C. Ra¨Õssi, P. Poncelet, & M. Teisseire, Need for speed: mining
REFERENCES sequential patterns in data streams, BDA, 2005, pp. 865-874.
[26]A. Marascu & F. Masseglia, Mining Sequential Patterns from
[1]Y. Zhu, & D. Shasha, StartStream: Statistical monitoring of thousands Temporal Streaming Data, In Proceedings of the 1st ECML/PKDD
of data streams in real time, In Proceedings of the VLDB conference, 2002,
Workshop on Mining Complex Data (IEEE MCD), 2005
pp.358-369.
[27]Dong. Guozhu, Pei. Jian, Sequence Data Mining, Series: Advances in
[2]P. Domingos, & G. Hulten, Mining high-speed data streams, In
Database Systems, Vol. 33, 2007, p.119
Proceedings of ACM SIGKDD, 2000, pp.71-80.
[3]C. Jin, W. Qian, C. Sha, J. X. Yu, & A. Zhou, Dynamically Maintaining
Frequent Items over a Data Stream, In Proceedings of the information and
knowledge management (CIKM), 2003, pp.287-294.
[4]G. Manku, & R. Motwani, Approximate frequency counts over data
streams, In Proceedings of the VLDB conference, 2002, pp.346-357.
[5]C. Giannella, J. Han, J. Pei, X. Yan, & P. S. Yu, Mining frequent
patterns in data streams at multiple time granularities, AAAI/MIT, Next
generation data mining, 2002, pp.191-210.
[6]Y.Chi, H.Wang, P. S. Yu, & R. R. Muntz, Catch the moment:
Maintaining closed frequent itemsets over a data stream sliding window,
Knowledge and Information Systems10(3), 2006, pp.265-294.
[7]R. Agarwal, & R. Srikant, Fast Algorithm for Mining Association Rule
in Large Databases, In Proc. Int. Conf. Very Large DataBases, 1994,
pp.487-499.
[8]R. Agarwal, & R. Srikant, Mining Sequential Pattern, In Proc. Int. Conf.
Data Engineering, 1995, pp3-10.
[9]J. Ayres, J. Flannick, J. Gehrke, & T. Yiu, Sequential PAttern Mining
Using A Bitmap Representation, In Proc. Int. Conf. Knowledge Discovery
and Data Mining, 2002, pp.429-435.
[10]J. Pei, J. Han, Q. Chen, U. Dayal, & H. Pinto, FreeSpan: Frequent
Pattern-Projected Sequential Pattern Mining, In Proc. Int. Conf. Knowledge
Discovery and Data Mining, 2000, pp.355-359.
[11]J. Pei, J. Han, B. Mortazavi-Asi & H. Pinto, PrefixSpan Mining
Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, In Proc.
Int. Conf. on Data Engineering, 2001, pp.215-224.

890

You might also like