Sequential Pattern Mining in Data Streams Using the Weighted Sliding Window
Sequential Pattern Mining in Data Streams Using the Weighted Sliding Window
Sequential Pattern Mining in Data Streams Using the Weighted Sliding Window
Model
Abstract—Mining data streams for knowledge discovery is environment[2,3,6,9]. Chi et al. [6] a novel algorithm has
important to many applications, including Web click stream been introduced to mine closed frequent itemsets over data
mining, network intrusion detection, and on-line transaction stream sliding windows. The closed enumeration tree(CET)
analysis. In this paper, by analyzing data characteristics, we as an efficient in-memory data structure is used to store all
propose an efficient algorithm SWSS (Sequential pattern closed frequent itemsets in the present sliding window. Ho,
mining with the weighted sliding window model in SPAM) to Li, Kuo, and Lee[14] proposed IncSPAM to maintain
mine frequent sequential patterns based on the weighted sequential patterns over a stream sliding window. Customer
sliding windows model. This algorithm provides more space
Bit-Vector Array with Sliding Window(CBASW) is
for users to specify which sequences they are more interested
introduced to store the items information and this kind of
in. Extensive experiments show that the proposed algorithm is
feasible and efficient for mining all sequential patterns as users
representation can definitely reduce the memory requirement
specified. and execution time. In [15], the author proposed a new
flexible framework for stream mining, called the weighted
Keywords-Data Mining; Stream Mining; Sliding Window; sliding window model. The main difference between it and
Sequential Pattern Mining the traditional sliding window model is that it allows users to
specify the number of windows and the weight for each
window. That is to say, the mining results are closer to users’
I. INTRODUCTION
requirements. The WSW algorithm and the improved WSW
The problem of data streams mining arose with the algorithm are used to mine the frequent itemsets in
introduction of new application areas, including Web click continuous sliding windows.
stream mining, network intrusion detection, and on-line In our work, we propose SWSS algorithm to mine
transaction analysis[1]. Recently, data mining communities frequent sequential patterns in the weighted sliding window
have focused on a new data model, where data arrives in the model. As far as our knowledge, this is the first piece of
form of continuous streams. Many applications can generate work for mining frequent sequential patterns over weighted
great amount of data streams in real time, such as online sliding windows. In this paper, we make full use of the fast
transaction flows in retail chains, web click-streams in Web counting of SPAM and its lexicographical tree for storing the
applications, performance measurement in network frequent sequential patterns. Several effective summary data
monitoring, and ATM transaction records in banks, etc. structures are developed to store the essential information
There are several challenges for data stream mining. First, during mining, which include All-List, B-List, W-List, and
each item in a stream could be examined only once. Second, W-Tree. All-List is used to store all the items which is
it’s impractical to provide sufficient memory to store the data different from each other and appears in the sliding windows
generated continuously. Third, the mining result should be specified by users. B-List is developed to dynamically show
got as fast as possible and timely available according to the bitmap representation of a sequence of a tree node in the
users’ requests. At last, the errors of the results should be lexicographical tree. All the sliding windows are stored in
bounded as small as possible. the W-List which will be refreshed with time goes by. At last,
In data stream mining [7, 10, 11, 12, 13, 17, 18, 19, 20, W-Tree is used for storing the frequent sequences and it’s the
21, 22, 23, 24, 25, 26], the time models mainly include the basement for the depth first traversal(DFS).
landmark model[4], the titled-time window model[5] and the The rest of the paper is organized as follows. Section Ċ
sliding window model[6]. In the paper, we focus on the goes deeply to present the problem statement. The algorithm
sliding window. The sliding window model mainly deals SWSS is presented in Section ċ. Section č gives an
with the data generated from the current moment to a overview example to specify the SWSS process. Section Ď
specified time point and the data continuously changes when reports the result of our experiments. Finally, we give the
new elements come and old ones are deleted. The size of the conclusions of the paper in Section ď.
window could be specified by users to be a given number of
transactions or a fixed time period. There are many research
results about mining frequent itemsets in a data stream
887
a: For each window from the data stream do
b: Get All-List (Find out all items which are not different
from each other in every window and store them in All-
List. The items are used as the initial values for S-step
and I-step)
c: Initialize W-Tree (the root of the tree is NULL)
d: Build W-Tree by depth first traversal (S-step then I-
step)
If a new sequence support over minSup_weight
Add the sequence into W-Tree as a node
Add the sequence into All-List
Transform the sequence a bitmap into B-List
e: If a new frequent sequence exists
Go to step d
Else
Build W–Tree process is completed
IV. EXAMPLE FOR SWSS
Assume the number of the sliding windows is 4, the
minimum support is 0.2. At time point T, figure 4.1 shows
the initial information.
All-List includes items (9, 11, 12, 13, 14, 15, 19, 20, 21).
Item 9 appears in both W1, and W3. Item 12 appears in all the
4 windows. Item 21 appears only in W4. The bitmap of item
Figure4.1 The sequence data at time T 9 in W1 (abbreviated bW1 (9)) has only one bit in the first
section and its value is 0, since item 9 was not appear in the
From figure4.1, the minSup_weight is 0.42 {0.2 × (3 × sequence of customer 1(CID = 1). Similarly, bW1(12) also
0.1 + 2 × 0.2 + 2 × 0.3 + 2 × 0.4)). Through the initialization has only one bit in the third section for the same reason.
of our algorithm in figure4.1, All-list and W-List can be sup_weight(9) = 1 × 0.1 + 1 × 0.2 = 0.3, which is smaller
generated. The bitmaps of the items are present in figure4.2 than minSup_weight. Thus, item 9 removed. sup_weight(12)
(for simplicity, some bitmaps are not listed). = 2 × (0.1 + 0.2 + 0.3 + 0.4) = 2.0, which is greater than
minSup_weight, so W-Tree creates a node for storing the
one item sequence({12}). Through calculation, items (9, 14,
15, 19, 20, 21) should be removed, so the first level of W-
Tree has only 3 child nodes just as Figure4.4. According to
the property of lexicographical tree, sequence ({11}, {12})
can be generated through S-step of sequence ({11}). Since
({11}) doesn’t appear in W4, so when in the process of S-
step of ({11}), there is no need to consider W4, what we need
to do is calculating the support count in the first three
windows. See Figure 4.3. Similarly, other frequent sequences
can be got by the same method. Figure 4.4 shows the
building of W-Tree.
888
Fig 5.1 Different windows over small scale dataset
Figure4.3 The final results of B-List
In the second group of experiments Fig5.3, we test
running time of different window number and different
minimum support over large scale dataset. The total number
of transactions is 10K, and the number of windows is 1, 2, 3,
4. The execution time in the experiments is the total
execution time at 5 different time points. Fig5.4 shows the
execution time under the different minimum support
condition.
Figure4.4 The final results of W-Tree
V. PERFORMANCE EVALUATION
In this section, we report the experiments of the proposed
SWSS algorithm. These experiments are executed on a
personal computer with a 3.00 GHz Intel Pentium4 processor,
1.00GB memory, and running Microsoft Windows XP.
Moreover, the synthetic data used in these experiments is
generated by the IBM synthetic data generator [8].
Parameters of synthetic are listed in Table5.1.
Symbol Meaning
D Number of customers in 000s Fig 5.2 Different min-support over small scale dataset
C Average number of transactions per customer
T Average number of items per transaction
N Number of different items in 000s
S Average length of maximal sequences
889
[12]R. Srikant & R. Agrawal, Mining Sequential Patterns: Generalizations
and Performance Improvements, In Proc. 5th Int. Conf. Extending
Database Technology, 1996, pp.3-17
[13]M. J. Zaki, SPADE: An Efficient Algorithm for Mining Frequent
Sequences, In Machine Learning Journal, 42(1/2), 2001, pp.31-60.
[14]C. C. Ho, H. F. Li, F. F. Kuo, & S. Y. Lee, Incremental mining of
sequential patterns over a stream sliding window, In Proceedings of IEEE
international workshop on mining evolving and streaming data, 2006.
[15]P. S. M Tsai, Mining frequent itemsets in data streams using the
weighted sliding window model, Expert Systems with Applications, 2009,
pp.11617-11625.
[16]X. Yan, J. Han, & R. Afshar, CloSpan: Mining Closed Sequential
Patterns in Large Datasets, In Proc. of SIAM Int’l Conf. on Data Mining
(SDM), 2003, pp.166-177.
[17]M. Yen & S. Lee, Incremental Update on Sequential Patterns in Large
Databases, In Proceedings of the 10th IEEE International Conference on
Fig 5.4 Different min-support over large scale dataset Tools with Artificial Intelligence (ICTAI), 1998, pp.24-31.
[18]Y. Zhu & D. Shasha, StatStream: Statistical Monitoring of Thousands
of Data Stream in Real Time, In Proceedings of the 28th International
VI. CONCLUSION Conference on Very Large Data Bases (VLDB), 2002, pp.358-369.
[19]G. Chen, X. Wu, & X. Zhu, Mining Sequential Patterns across Data
In this paper, we propose an efficient single pass Streams, Technical Report, 2004.
algorithm SWSS to mine all the frequent sequential patterns [20]H. Cheng, X. Yan, & J. Han, IncSpan: Incremental Mining of
over the weighted sliding windows. SWSS allows users to Sequential Patterns in Large Database, In Proceedings of the 10th ACM
specify what they are more interested in and it also uses the International Conference on Knowledge Discovery and Data Mining
similar depth first traversal method and the bitmap (SIGKDD), 2004, pp.527-532.
[21]Chen, G., Wu, X., Zhu, X, Sequential pattern mining in multiple data
representation as SPAM. Extensive experiments show that streams, In Prooceedings of the ICDM, 2005, pp.585–588.
SWSS can generate all frequent sequential patterns as users [22]C. Ezeife & M. Monwar, SSM: a frequent sequential data stream
defined. In the future, we will consider how to further reduce patterns miner, CIDM, 2007, pp.120-126.
the memory consumption of SWSS efficiently. [23]C. C. Ho, H. F. Li, F. F. Kuo, & S. Y. Lee, Incremental mining of
sequential patterns over a stream sliding window, ICDM-Workshops, 2006,
VII. ACKNOWLEDGMENT pp.677-681.
[24]J. H. Chang & W. S. Lee, Efficient mining method for retrieving
This work is supported by the National Science sequential patterns over online data streams, Joural of Information Science,
Foundation of China under the Grant No. 90820010. 31(5), 2005, pp.420-432.
[25]C. Ra¨Õssi, P. Poncelet, & M. Teisseire, Need for speed: mining
REFERENCES sequential patterns in data streams, BDA, 2005, pp. 865-874.
[26]A. Marascu & F. Masseglia, Mining Sequential Patterns from
[1]Y. Zhu, & D. Shasha, StartStream: Statistical monitoring of thousands Temporal Streaming Data, In Proceedings of the 1st ECML/PKDD
of data streams in real time, In Proceedings of the VLDB conference, 2002,
Workshop on Mining Complex Data (IEEE MCD), 2005
pp.358-369.
[27]Dong. Guozhu, Pei. Jian, Sequence Data Mining, Series: Advances in
[2]P. Domingos, & G. Hulten, Mining high-speed data streams, In
Database Systems, Vol. 33, 2007, p.119
Proceedings of ACM SIGKDD, 2000, pp.71-80.
[3]C. Jin, W. Qian, C. Sha, J. X. Yu, & A. Zhou, Dynamically Maintaining
Frequent Items over a Data Stream, In Proceedings of the information and
knowledge management (CIKM), 2003, pp.287-294.
[4]G. Manku, & R. Motwani, Approximate frequency counts over data
streams, In Proceedings of the VLDB conference, 2002, pp.346-357.
[5]C. Giannella, J. Han, J. Pei, X. Yan, & P. S. Yu, Mining frequent
patterns in data streams at multiple time granularities, AAAI/MIT, Next
generation data mining, 2002, pp.191-210.
[6]Y.Chi, H.Wang, P. S. Yu, & R. R. Muntz, Catch the moment:
Maintaining closed frequent itemsets over a data stream sliding window,
Knowledge and Information Systems10(3), 2006, pp.265-294.
[7]R. Agarwal, & R. Srikant, Fast Algorithm for Mining Association Rule
in Large Databases, In Proc. Int. Conf. Very Large DataBases, 1994,
pp.487-499.
[8]R. Agarwal, & R. Srikant, Mining Sequential Pattern, In Proc. Int. Conf.
Data Engineering, 1995, pp3-10.
[9]J. Ayres, J. Flannick, J. Gehrke, & T. Yiu, Sequential PAttern Mining
Using A Bitmap Representation, In Proc. Int. Conf. Knowledge Discovery
and Data Mining, 2002, pp.429-435.
[10]J. Pei, J. Han, Q. Chen, U. Dayal, & H. Pinto, FreeSpan: Frequent
Pattern-Projected Sequential Pattern Mining, In Proc. Int. Conf. Knowledge
Discovery and Data Mining, 2000, pp.355-359.
[11]J. Pei, J. Han, B. Mortazavi-Asi & H. Pinto, PrefixSpan Mining
Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, In Proc.
Int. Conf. on Data Engineering, 2001, pp.215-224.
890