0% found this document useful (0 votes)
9 views9 pages

Mining Recent Maximal Frequent Itemsets Over Data Streams With Sliding Window

This document summarizes a research paper that proposes a new method called Recent Maximal Frequent Itemsets Mining (RMFIsM) to mine recent maximal frequent itemsets over data streams with a sliding window. The RMFIsM method uses two matrix structures to store streaming transaction data and frequent 1-itemsets. It mines frequent itemsets by extending frequent 2-itemsets and obtains maximal frequent itemsets by removing sub-itemsets of longer frequent itemsets. Experimental results show that RMFIsM can efficiently mine recent maximal frequent itemsets over data streams.

Uploaded by

戴积文
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

Mining Recent Maximal Frequent Itemsets Over Data Streams With Sliding Window

This document summarizes a research paper that proposes a new method called Recent Maximal Frequent Itemsets Mining (RMFIsM) to mine recent maximal frequent itemsets over data streams with a sliding window. The RMFIsM method uses two matrix structures to store streaming transaction data and frequent 1-itemsets. It mines frequent itemsets by extending frequent 2-itemsets and obtains maximal frequent itemsets by removing sub-itemsets of longer frequent itemsets. Experimental results show that RMFIsM can efficiently mine recent maximal frequent itemsets over data streams.

Uploaded by

戴积文
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

The International Arab Journal of Information Technology, Vol. 16, No.

6, November 2019 961

Mining Recent Maximal Frequent Itemsets Over


Data Streams with Sliding Window
Saihua Cai1, Shangbo Hao1, Ruizhi Sun1, and Gang Wu2
1
College of Information and Electrical Engineering, China Agricultural University, China
2
Secretary of Computer Science Department, Tarim University, China

Abstract: The huge number of data streams makes it impossible to mine recent frequent itemsets. Due to the maximal frequent
itemsets can perfectly imply all the frequent itemsets and the number is much smaller, therefore, the time cost and the memory
usage for mining maximal frequent itemsets are much more efficient. This paper proposes an improved method called Recent
Maximal Frequent Itemsets Mining (RMFIsM) to mine recent maximal frequent itemsets over data streams with sliding
window. The RMFIsM method uses two matrixes to store the information of data streams, the first matrix stores the
information of each transaction and the second one stores the frequent 1-itemsets. The frequent p-itemsets are mined with
“extension” process of frequent 2-itemsets, and the maximal frequent itemsets are obtained by deleting the sub-itemsets of
long frequent itemsets. Finally, the performance of the RMFIsM method is conducted by a series of experiments, the results
show that the proposed RMFIsM method can mine recent maximal frequent itemsets efficiently.

Keywords: Data streams, recent maximal frequent itemsets, sliding window, matrix structure.

Received November 16, 2016; accepted April 25, 2018

1. Introduction transaction data streams are generated from shops,


network data streams are generated from website, etc.
The scale of collected data information shows an Due to people need to grasp the relevance of the
explosive growth in various domains with the rapid information generated from online applications in real
development of Internet of Things (IoT) technology, time, the mining process should be treated
information technology and network technology [1]. In immediately.
order to analyze the collected data information better, The differences of data streams mining and static
the association rules [3] among data should to be data mining are list as follows [8]. First, each element
mined effectively, and the generation of frequent of data streams is allowed to be checked for most once.
itemsets is the most critical technique and procedure in Second, memory usage of the analysis for data streams
mining association rules. The frequent itemsets [7, 12] should be restricted finitely although new data
mean the itemsets whose support is not smaller than elements generated continuously. Third, the newly
predefined minimum support (denoted as min_sup). In generation of the data should be processed as fast as
general, the number of frequent itemsets generated by possible. Fourth, the up-to-data analysis result of data
most frequent itemsets mining methods is very large streams should be instantly available when user
because all frequent itemsets of a given dataset need to requested. To satisfy these requirements, the method of
be mined, it can easily consume the memory usage. data streams mining should indulge the rapidity and the
Due to the number of maximal frequent itemsets is memory usage should be as small as possible.
relatively smaller and they can imply all frequent However, the huge number of data streams makes it
itemsets perfectly, in this case, the efficiency of time impossible to store all the data into main memory.
cost for mining maximal frequent itemsets is very good Moreover, previous methods that studied for mining
and the memory usage is also much smaller. Therefore, static datasets are not feasible for mining data streams,
the problems of mining frequent itemsets can be in this case, new structures and mining methods are
transformed into the operation of mining maximal eager for using to support one-time and continuous
frequent itemsets. mining process. In this paper, we use the matrix
The massive use of sensors makes the collected data structure to store the data information of data structures,
exist in the form of data streams. One common and then propose an improved method called Recent
definition of data streams is that it is made up of Maximal Frequent Itemsets Mining (RMFIsM) to mine
massive unbounded sequences with a large amount of recent maximal frequent itemsets over data streams.
data elements and existed in the form of continuous The rest of this paper is organized as follows. The
streams [11]. Data streams are generated in many related work is introduced in section 2. The definitions
aspects of applications, such as: sensor data streams and problems statement are given in section 3, the
are generated from sensor networks [4], online structure and main idea of our proposed maximal
962 The International Arab Journal of Information Technology, Vol. 16, No. 6, November 2019

frequent itemsets mining method are described in named DSM-Miner to mine maximal frequent itemsets
section 4. The experimental analysis is presented in over data streams, it used appropriate method to reduce
section 5, and the conclusion is given in section 6. the effects of old transactions, and then the Sliding
Window Maximum frequent pattern Tree (called
2. Related Work SWM-Tree) was proposed to maintain the latest
pattern’s information. In the process of mining
At present, several methods have proposed to mine the maximal frequent patterns, DSM-Miner used
frequent itemsets over data streams. The models of appropriate pruning operations, calculation pattern of
frequent itemsets mining can be divided into: bit items group and “depth-first” search strategies, the
1. Landmark window model. experimental results showed that DSM-Miner was
2. Damped window model. better in time performance and memory usage. A new
3. Sliding window model according to the processing algorithm that based on the prefix-tree data structure
models of data streams. was proposed by Deypir et al. [5] to find and update
frequent itemsets of the windows, a batch of
For landmark window model, the researchers always transactions were used as the unit of insertion and
focus on the data in the entire data streams, and get the deletion within the window to improve the
global frequent itemsets through the analysis of performance, moreover, an effective traversal strategy
historical data. Li et al. [8] referred to Apriori for the prefix-tree and the suitable representation for
algorithm to present a method called Data Stream each batch of transactions were used in the algorithm,
Mining for Maximal Frequent Itemsets (DSM-MFI), it the required information in each node of the prefix-tree
used prefix tree structure to store the data information was stored and the old batch of transactions were
of data streams, and then the maximal frequent deleted directly.
itemsets mining process was realized with the However, some disadvantages also have existed in
constructed prefix tree structure. INSTANT method the proposed methods. The drawbacks of INSTANT
was presented by Mao et al. [10], it defined some sub- algorithm [10] were the amount of arrays designed for
operators of itemsets and maintained itemsets with maintaining all maximal frequent itemsets was very
different level of support in memory, the advantage of large and the cost of memory usage was also very
INSTANT method was that the maximal frequent expensive, moreover, no efficient superset or subset
itemsets could be displayed directly to user through a were used to check the newly identified maximal
serious of sub-operations when the new transaction frequent itemsets of each array, therefore, the
arriving. comparison times were increased very fast and the
For damped window model, each transaction has a memory usage was enlarged rapidly when the average
corresponding value and the value decreases gradually length of the transactions became longer.
with the increase of time, therefore, preserve and
reduce the related information of historical data need to
be considered in the control of the value. Chang and 3. Definitions And Problems Statement
Lee [2] developed a method called estDec in 2003, this In this section, we first provide some formal
method examined each transaction in turn without the definitions of the important terms used in this paper
generation of any candidate, the occurred count of the and then give the problems statement.
itemsets that appeared in each transaction was
maintained with a prefix-tree structure, and the effect 3.1. Definitions
of old transaction on current mining result was
diminished by defining the parameter called Let I= {i1, i2, i3,…, im} be a finite set of m distinct
debilitating factor. Lin et al. [9] presented the Mining items. The data streams DS= [T1, T2, T3, …, Tn), where
Recently Frequent Itemsets with Variable Support over each transaction Tj∈DS is a subset of I with a unique
Data Streams (MRVSDS) algorithm to store frequent identifier TID. If the relation of itemset α and β
itemsets in current window into PFI-tree structure, the is    , α is called the sub-itemset of β and β is called
itemsets were deleted from PFI-tree when the degree of the super-itemset of α. If the length of itemset is k, it is
the transaction was less than min_sup. In addition, the called k-itemset. Table 1 shows an example of data
authors also designed the Decaying Synopsis Vector stream as the running example to clearly explain the
(DSYV) structure to store the processed transaction, definitions. In this example, assume that min_sup is
and the frequent itemsets were found by re-mining the 0.33 and the size of sliding window is 6.
transactions from DSYV when the current itemsets’
support was less than historical min_sup.
For sliding window model, the focus is always on
the recent transactions, therefore, the mining results are
the local frequent itemsets over a certain period of
time. Yang et al. [13] designed an efficient algorithm
Mining Recent Maximal Frequent Itemsets Over Data Streams with Sliding Window 963

Table 1. An example of data streams. response is very important to users. In addition, the
TID Transaction TID Transaction huge nature of data streams makes it impossible to
T1 {i1,i2,i3} T6 {i1,i2,i3,i5,i6}
store all the data information into main memory or
even in secondary storage due to they can easily
T2 {i1,i2,i4} T7 {i1,i2,i5}
consume all resources of system and bring difficulties
T3 {i2,i3,i5} T8 {i1,i2,i3,i4} to the underlying mining tasks.
T4 {i1,i2,i3,i5} T9 {i2,i3,i5} Specifically, the DSM-MFI method [8] took the
T5 {i1,i3,i5} … …… structure of summary frequent itemset forest to store
every sub-projection of affairs, and two main problems
 Support: The frequency of itemset xi in DS is of DSM-MFI method could be included as:
defined as support, that is, support ({xi})= count(xi,
DS) / |SW|, where count(xi, DS) is the number of 1. Large memory storage was wasted to store the sub-
contained itemset xi in DS and |SW| is the size of projections for a part of sub-projections were not
sliding window. frequent.
2. Much time was wasted for deleting the sub-
For example, itemset {i1} is existing in T1, T2, T4, T5 projections from summary frequent itemset forest to
and T6 in current sliding window, therefore, support achieve the lower memory occupancy.
({i1}) = 5/6. Itemset {i1, i2} is existing in T1, T2, T4 and
T6 in current sliding window, therefore, support ({i1, The size of prefix tree that generated by estDec method
i2})= 4/6. [2] was very large with the increasing number of
frequent itemsets, and more seriously, the estDec
 Frequent Itemsets (FIs): The frequent itemsets mean method would stop working once the prefix tree
that the itemsets’ support is not less than the occupies full of the memory. The drawback of TMFI
predefined minimal support threshold min_sup. method [6] was the infrequent 1-itemsets were also
For example, itemset {i1, i3} is existing in T1, T4, T5 stored in matrix structure, therefore, some meaningless
and T6, support ({i1, i3})=4/6>0.33, therefore, {i1,i3} is “extension” operation of infrequent itemsets also were
a frequent itemset. conducted to gain longer itemsets.
In general, the time cost, the memory storage and
 Infrequent Itemsets (IFIs): The infrequent itemsets the accuracy rate of mining process are the most
mean that the itemsets’ support is less than the important problems we should to deal with.
predefined minimal support threshold min_sup.
For example, itemset {i2, i4} is existing in T2, 4. Mining Recent Maximal Frequent
support ({i2, i4})=1/6<0.33, therefore, {i2,i4} is an
infrequent itemset.
Itemsets
 Maximal Frequent Itemsets (MFIs): The itemsets are In this section, we refer TMFI method [6] to propose
the maximal frequent itemsets should satisfy the an improved method called RMFIsM to mine the
following two conditions: recent MFIs over data streams. The RMFIsM method
uses two matrixes (record as: matrix A and frequent
1. They are frequent itemsets.
matrix B) to store the information of each item, and the
2. No super-itemset of them is frequent.
infrequent itemsets need to be deleted from matrix
For example, itemset {i4} is not a MFI due to support immediately to reduce the time cost and memory usage
({i4})=1/6<0.33. Itemset {i1} is not a MFI though based on downward closure property.
support ({i1}) = 5/6>0.33, the reason is that its super-
itemset {i1, i2} is frequent. Itemset {i1, i2, i3, i5} is a 4.1. The Structure of RMFIsM Method
MFI due to support ({i1, i2, i3, i5})=2/6>0.33 and no
super-itemset of it is frequent. Matrix A is constructed to store the information of
each item of data streams, and frequent matrix B is
 Dictionary order: If the appeared sequence of built to record the information of frequent 1-itemsets.
itemset A is earlier than itemset B in dictionary, the The rows of matrix A stand for the information of
dictionary order of itemset A and itemset B can be transactions Ti and the columns of matrix A stand for
recorded as: A » B. Similarly, the next itemsets can the information of each item of {i1, i2, i3,…, im}, the
be recored as: A » ABD » ACD » BD in dictionary size of matrix A is (n+1)*m, where row (n+1) records
order. the support of each item. Specifically, the transactions
are scanned in order when the current sliding window
3.2. Problems Statement is not full, and Ad,k is marked as 1 if item ik is appeared
For mining useful information over data streams, the in transaction Td, otherwise, Ad,k is marked as 0.
final mining results should be send to users In order to effectively mine the recent information
immediately, it means that any useful data should be of data streams, old transactions need to be replaced by
processed in an efficient way, in this case, the real-time new ones directly. The position of new transaction Td
964 The International Arab Journal of Information Technology, Vol. 16, No. 6, November 2019

is calculated by Equation 1, where n is the size of  Theorem 1. If Xk is a frequent k-itemset, then, any
sliding window, and the information of transaction Td nonempty sub-itemset Xk-1 of Xk is also frequent.
is recorded in row n if the result of pos is 0.  Proof. Since X k 1  X k , the transactions contains
pos  d %n (1) itemset Xk must contains the itemset Xk-1, that is:
TID( X )  TID ( X ) , it follows that: support(Xk-1)≥
k k -1
Frequent matrix B is built to store the data information
of frequent 1-itemsets in dictionary order, the original support(Xk)≥ min_sup. Hence, any nonempty sub-
element of matrix B is 0 and the real size of matrix B is itemset Xk-1 of Xk is also frequent if Xk is a frequent
(k-1)*(k-1), where k is the number of frequent 1- itemset.
itemsets. The construction process of matrix B is  Theorem 2. If Xk is an infrequent k-itemset, then, any
shown as follows: for frequent 1-itemsets ip and iq with super-itemset Xk+1 of Xk is also infrequent.
the order of ip » iq, doing “logic and” operation process  Proof. Since X k  X k 1 , the transactions contains
for every element of columns p and q in matrix A, Bp,q itemset Xk+1 must contains the itemset Xk, that is:
is marked as 1 if the result of “logic and” for itemset k 1
TID( X )  TID( X )
k
, it follows that:
{ip,iq} is not less than min_sup, otherwise, Bp,q is
support(X )≤ support(X )≤ min_sup. Hence, any
k+1 k
marked as 0.
super-itemset Xk+1 of Xk is also infrequent if Xk is an
Matrix A and frequent matrix B are the basis for
infrequent itemset.
mining maximal frequent itemsets, the pseudo-code of
constructing matrix A and frequent matrix B is shown It can be easily known from downward closure
in Algorithm 1. property that the “extension” process of infrequent
Algorithm 1: Construct matrix A and frequent matrix B itemsets is meaningless, thus, the downward closure
property should to be considered in every step of
Input: Data streams, n(the maximal |SW|), m(maximal number maximal frequent itemsets mining. More specifically,
of different items), min_sup
Output: matrix A, frequent matrix B
the infrequent itemsets that existing in matrix A should
for (|SW|=1 to n) not add into frequent matrix B as the basic element of
{ “extension” process for RMFIsM method, that is, if 1-
for (k=1 to m) itemset ip is an infrequent itemset, its super-itemsets
{ are impossible being the frequent itemsets, therefore, ip
if (ik in Td) should not appear in matrix B to reduce the time cost
Ad,k=1
and memory usage in both constructing matrix B and
else
Ad,k=0 calculating the support value of these meaningless
} extended itemsets.
}
return matrix A 4.3. The Main Idea of RMFIsM Method
for (k=1 to m)
{ The main idea of RMFIsM method can be included
if (support(ik) ≥min_sup) into next three parts:
add ik to matrix B
else 1. Extend the short frequent itemsets into long itemsets.
delete ik 2. Calculate the support value for the extended long
} itemsets and save the frequent long itemsets into
for (k=1 to |B|) maximal frequent itemsets library MFIs_L.
{ 3. check and move the frequent sub-itemsets of the
for (s=k+1 to |B|)
{
extended frequent long itemsets out from MFIs_L.
if (support({ik,is}) ≥min_sup) Note that, each itemset need to be checked before
Bk,s=1 “extension” process to discard the infrequent
else itemsets to further improve the mining efficiency.
Bk,s=0
} Once matrix A is constructed and each element is
} written into matrix A, the support value of each item is
return matrix B calculated and written in row (n+1), and the frequent
1-itemsets are stored into MFIs_L. After constructing
4.2. Downward Closure Property matrix B and the corresponding items are written into
matrix B, the frequent 2-itemsets where the item is
The downward closure property is an important part of
marked in 1 are stored into MFIs_L, and then all 1-
RMFIsM method, it is the foundation of pruning
itemsets are checked and each sub-itemset of frequent
strategy for reducing the meaningless “extension”
2-itemsets are moved out from MFIs_L.
process to save the time cost and memory usage.
If itemset {ik1, ik2, …, ikp} is a frequent p-itemset, the
“extension” process of frequent p-itemset into (p+1)-
Mining Recent Maximal Frequent Itemsets Over Data Streams with Sliding Window 965

itemset can be summarized as follows. Frequent p- i1 i2 i3 i4 i5 i6


itemset {ik1,ik2,…,ikp} can be extended into (p+1)- T1  1 1 1 0 0 0 
T2  1 1 0 1 0 0 
itemset if and only if every B(ky,k(p+1))= 1, where y∈  
[1, p]. Next, doing “logic and” operation for the T3  0 1 1 0 1 0 
 
T4  1 1 1 0 1 0 
corresponding (p+1) column to calculate the support
 1 0 
value, itemset {ik1,ik2,…,ikp,ik(p+1)} is retained and stored T5

0 1 0 1

into MFIs_L if its support value is not less than T6  1 1 1 0 1 1 
 0.83
support  0.83 0.83 0.17 0.67 
0.17 
min_sup, otherwise, it is discarded directly. If the
current extended (p+1)-itemset is frequent, all p- a) Original matrix structure.

itemsets are checked and each sub-itemset is moved i1 i2 i3 i4 i5 i6


out from MFIs_L. Repeat the above “extension” T7  1 1 0 0 1 0 
operation until no itemset can be further extended. The T8  1 1 1 1 0 0 
 
specific pseudo-code is shown in Algorithm 2. T3  0 1 1 0 1 0 
 
Algorithm 2: RMFIsM T4  1 1 1 0 1 0 
T5  1 0 1 0 1 0 
Input: Frequent matrix B  
Output: MFIs T6  1 1 1 0 1 1 
call Algorithm 1 support 
 0.83 0.83 0.83 0.17 0.83 0.17 

delete each frequent sub-itemsets of frequent 2-itemsets b) New matrix structure.
for (k=1 to |B|)
{ Figure 1. The change process of matrix A.
foreach (B(ky,k(p+1))=1) // y∈[1,p]
Next, we take the transactions of {T7, T8, T3, T4, T5,
{
extend p-itemset of {ik1,…,ikp} into (p+1)-itemset of T6} as the example to explain the MFIs mining process
{ik1,…,ikp,ik(p+1)} more clearly, the whole process is divided into next
calculate support({ik1,…,ikp,ik(p+1)}) steps.
if support({ik1,…,ikp,ik(p+1)}) ≥min_sup
add {ik1,…,ikp,ik(p+1)} into MFIs_L  Retaining the 1-itemsets whose support value are
move every sub-itemsets of {ik1,…,ikp,ik(p+1)} out from not less than min_sup and saving them into MFIs_L,
MFIs_L these frequent 1-itemsets are the basic elements of
else matrix B. Here, the frequent 1-itemsets in MFIs_L
delete {ik1,…,ikp,ik(p+1)} are {i1}, {i2}, {i3}, {i5}.
}
}
 Taking the frequent 1-itemsets that saved in MFIs_L
return MFIs to construct matrix B, the row of matrix B is the
front (n-1) elements and the column of matrix B is
4.4. An Example of RMFIsM Method the last (n-1) elements, where n is the size of
frequent 1-itemsets. Thus, the row of matrix B is {i1,
In order to describe our proposed RMFIsM method i2, i3} and the column of matrix B is {i2, i3, i5}. Next,
better, we take the example that shown in Table 1 to the support of each 2-itemset ({i1, i2}, {i1, i3}, {i1,
illustrate the specific mining process of maximal i5}, {i2, i3}, {i2, i5}, {i3, i5}) is calculated and their
frequent itemsets, the min_sup is set into 0.33 and the support values are marked into matrix B. Then, the
size of sliding window is set into 6. frequent 2-itemsets are saved into MFIs_L. The
Matrix A is constructed and each data information specific information of matrix B is shown in Figure
of transactions is marked into when they pass the 2.
sliding window, the original information of each item
(T1-T6) is shown in Figure 1-a. When the sliding i2 i 3 i 5
window is full, we built matrix B and implement
maximal frequent mining process. When the new i1  1 1 1
transactions flowing into the sliding window, the oldest i 2  0 1 1
transaction is covered by the latest one directly based  
i 3  0 0 1
on Equation (1) to get better time efficiency. Figure 1
shows the change process of matrix A that T1 is Figure 2. The structure of matrix B.
covered by T7 and T2 is covered by T8, the new matrix
A is shown in Figure 1-b.  After constructing matrix B, the infrequent itemsets
need to be deleted first and each sub-itemsets of
frequent 2-itemsets need to be moved out from
MFIs_L. For the example, frequent 1-itemsets {i1}
and {i2} are the sub-itemsets of frequent 2-itemset
{i1, i2}, then, moving {i1} and {i2} out from
MFIs_L. Continue this operation until no frequent
966 The International Arab Journal of Information Technology, Vol. 16, No. 6, November 2019

1-itemsets is the sub-itemset of frequent 2-itemsets. 5.1. Time Cost for RMFIsM Method
Here, the itemsets in MFIs_L are {i1, i2}, {i1, i3}, {i1,
i5}, {i2, i3}, {i2, i5} and {i3, i5}. The time cost for mining recent MFIs on sparse dataset
T10.I4.D1000K with different value of min_sup is
 Then, the frequent 2-itemsets need to be extended
shown in Figure 3-a. The time cost on T10.I4.D1000K
into 3-itemsets. The frequent 2-itemset {i1,i2} is first
with different size of sliding window is shown in
selected as the conditional potential itemset, due to
Figure 3-b. The time cost on T10.I4.D1000K with
B(i1,i3)=1 and B(i2,i3)=1, {i1,i2} can be extended into
different number of transactions is shown in Figure 3-c.
{i1,i2,i3} and it is saved into MFIs_L due to
The time cost on dense dataset T30.I20.D1000K is
support({i1,i2,i3})=0.5>0.33. Repeat the same
shown in Figure 4-a, Figure 4-b, and Figure 4-c
process to gain the frequent 3-itemsets {i1, i2, i5}, {i1,
separately.
i3, i5}, {i2, i3, i5}, and they are saved into MFIs_L.
It can be seen from Figure 3-a and Figure 4-a that
After gaining all frequent 3-itemsets, each frequent
the time cost of RMFIsM, DSM-MFI, estDec and
2-itemset is checked and all sub-itemsets need to be
TMFI methods shows a decreasing trend with the
moved out from MFIs_L.
increasing value of min_sup. The time cost of our
 Next, the frequent 3-itemsets need to be extended
proposed RMFIsM method is the lowest of the
into 4-itemsets. The frequent {i1,i2,i3} is first compared four methods, the reason is that in the
selected as the conditional potential itemset, due to
process of mining MFIs, RMFIsM method just
B(i1,i5)=1, B(i2,i5)=1 and B(i3,i5)=1, {i1,i2,i3} can be implements the “logic and” operation of each data
extended into {i1,i2,i3,i5} and it is saved into MFIs_L
information that stored in matrixes, which reduces the
for support({i1,i2,i3,i5})= 0.333>0.33. After gaining
operations of iteration, sorting and pruning, moreover,
the frequent 4-itemsets, each frequent 3-itemset is
the infrequent itemsets are discarded directly in
checked and each sub-itemset need to be moved out
RMFIsM method to avoid meaningless “extension”
from MFIs_L. operation. Compared with DSM-MFI method, the
After above steps, the MFIs_L is {i1, i2, i3, i5}.
saved time of RMFIsM algorithm is great in the first
and becomes smaller gradually with the increasing
5. Experimental Analysis value of min_sup, the reason is that the total frequent
To verify the efficiency of our proposed RMFIsM itemsets are decreasing significantly accompanied with
method, the estDec method [2], the TMFI method [6] large value of min_sup. Compared with dataset
and the DSM-MFI method [8] are compared in our T10.I4.D1000K, the time cost of T30.I20.D1000K on
experiment. All experiments are conducting on a MFIs mining process is much more, the reason is that
machine running Windows 7 with an Intel dual core i3- the itemsets in dense dataset T30.I20.D1000K are more
2020 2.93 GHz processor, the development likely frequent for their larger support value.
environment is Microsoft Visual Studio 2010. The It can be seen from Figures 3-b and 4-b that with the
performance of RMFIsM method is analyzed on increasing size of sliding window, the time cost of the
synthetic sparse datasets of T10.I4.D1000K and compared four methods shows an increasing trends, the
synthetic dense dataset of T30.I20.D1000K that reason is that the number of frequent itemsets is rising
generated by IBM data generator, where |T| means the rapidly as |SW| is becoming larger gradually. The time
average size of the transactions, |I| means the potential cost of our proposed RMFIsM method is the lowest of
size of frequent itemsets and |D| means the total the four methods, and the time cost on
number of transactions, K means one thousand. T30.I20.D1000K is much larger than that on
Experiments are conducted to investigate the T10.I4.D1000K.
efficiency of the RMFIsM method both in time cost We can obviously see from Figures 3-c and 4-c that
and memory usage with different value of min_sup, the time cost of compared four methods is increasing
different size of sliding window and different number with the increased number of transactions, the reason is
of transactions, the experiments are also conducted to that the frequent itemsets increase gradually when the
test the accuracy rate of RMFIsM method. Each group number of transactions is rising. The time cost of
of experiments is repeated for 50 times, and the RMFIsM method is less than DSM-MFI, estDec and
average time and memory usage are calculated. TMFI methods, and the time cost on T30.I20.D1000K
is much more than that on T10.I4.D1000K.
Mining Recent Maximal Frequent Itemsets Over Data Streams with Sliding Window 967

350 700 250


DSM-MFI DSM-MFI DSM-MFI
300 estDec 600 estDec estDec
TMFI TMFI 200 TMFI
Time cost (second)

Time cost (second)

Time cost (second)


250 RMFIsM 500 RMFIsM RMFIsM
150
200 400

150 300
100

100 200
50
50 100

0 0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 200 400 600 800 1000 1200 1400 1600 300K 400K 500K 600K 700K 800K 900K1000K
Value of min_sup (|SW|=1000, Transactions=1000K) Size of sliding window (min_sup=0.1, Transactions=1000K) Number of transactions (min_sup=0.1, |SW|=1000)
a) Different min_sup. b) Different sizes of sliding window. c) Different numbers of transactions.

Figure 3. Time cost on sparse dataset T10.I4.D1000K.

1000 1500 700


DSM-MFI DSM-MFI DSM-MFI
estDec estDec 600 estDec
800 TMFI TMFI TMFI
Time cost (second)

Time cost (second)

Time cost (second)


RMFIsM RMFIsM 500 RMFIsM
1000
600
400

400 300
500
200
200
100

0 0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 200 400 600 800 1000 1200 1400 1600 300K 400K 500K 600K 700K 800K 900K1000K
Value of min_sup (|SW|=1000, Transactions=1000K) Size of sliding window (min_sup=0.1, Transactions=1000K) Number of transactions (min_sup=0.1, |SW|=1000)
a) Different min_sup. b) Different sizes of sliding window. c) Different numbers of transactions.

Figure 4. Time cost on dense dataset T30.I20.D1000K.

3000 3000 1500


DSM-MFI DSM-MFI DSM-MFI
estDec estDec estDec
2500 2500
TMFI TMFI TMFI
Memory usage (KB)

Memory usage (KB)

Memory usage (KB)

RMFIsM RMFIsM RMFIsM


2000 2000 1000

1500 1500

1000 1000 500

500 500

0 0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 200 400 600 800 1000 1200 1400 1600 300K 400K 500K 600K 700K 800K 900K1000K
Value of min_sup (|SW|=1000, Transactions=1000K) Size of sliding window (min_sup=0.1, Transactions=1000K) Number of transactions (min_sup=0.1, |SW|=1000)
a) Different min_sup. b) Different sizes of sliding window. c) Different numbers of transactions.

Figure 5. Memory usage on sparse dataset T10.I4.D1000K.

10000 9000 5000


DSM-MFI DSM-MFI DSM-MFI
8000 estDec estDec
estDec
8000 TMFI 7000 TMFI 4000 TMFI
Memory usage (KB)

Memory usage (KB)

Memory usage (KB)

RMFIsM RMFIsM RMFIsM


6000
6000 3000
5000
4000
4000 2000
3000

2000 2000 1000


1000
0 0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 200 400 600 800 1000 1200 1400 1600 300K 400K 500K 600K 700K 800K 900K1000K
Value of min_sup (|SW|=1000, Transactions=1000K) Size of sliding window (min_sup=0.1, Transactions=1000K) Number of transactions (min_sup=0.1, |SW|=1000)
a) Different min_sup. b) Different sizes of sliding window. c) Different numbers of transactions.

Figure 6. Memory usage on dense dataset T30.I20.D1000K.

5.2. Memory Usage for RMFIsM Method transactions, the parameters used in this experiment is
same with that in subsection 5.1, and the experimental
The memory usage is an important factor to measure results are shown in Figures 5 and 6.
the efficiency of our proposed RMFIsM method. The We can see from Figures 5-a and 6-a that with the
experiment to test the peak memory usage is also increasing value of min_sup, the peak memory usage
conducted with different value of min_sup, different of the compared four algorithms shows a decreasing
size of sliding window and different number of trend. It is owing to that the number of frequent 1-
968 The International Arab Journal of Information Technology, Vol. 16, No. 6, November 2019

itemsets is decreasing gradually with the larger value of becomes much larger with the extending size of
min_sup, therefore, the number of intermediate sliding window. The peak memory usage on sparse
generated itemsets in MFIs mining process is also dataset T10.I4.D1000K is much smaller than on dense
reduced much. The peak memory usage of our dataset T30.I20.D1000K.
proposed RMFIsM method is lowest of the four We can see from Figures 5-c and 6-c that the peak
methods, the reason is that the infrequent itemsets have memory usage of RMFIsM, DSM-MFI, estDec and
been discarded in the beginning of RMFIsM method, so TMFI methods is increasing smoothly with the
the meaningless “extension” operation hasn’t been increasing number of transactions and the occupied
conducted to occupy the additional memory storage. peak memory usage is linearly related to the number
Compared with sparse dataset T10.I4.D1000K, the of transactions. In the compared four methods, the
memory usage of MFIs mining process on dense peak memory usage of RMFIsM method is lower than
dataset T30.I20.D1000K is much more. estDec, TMFI and DSM-MFI methods in a certain
Figures 5-b and 6-b show that the peak memory extent. The peak memory usage on dense dataset
usage of the compared four methods grows up T30.I20.D1000K is also much larger than that on
gradually with the increasing size of sliding window, sparse dataset T10.I4.D1000K.
the reason is that the number of frequent itemsets
Table 2. Accuracy rate of RMFIsM method.
Dataset min_sup T10 T30 Dataset |SW| T10 T30 Dataset Transactions T10 T30
0.05 87.2% 89.6% 200 91.3% 92.2% 300K 92.1% 93.4%
0.1 92.3% 93.4% 400 91.6% 92.6% 400K 92.4% 93.2%
0.15 95.2% 96.1% 600 91.9% 92.8% 500K 92.3% 93.5%
0.2 96.4% 96.9% 800 92% 93.1% 600K 92.2% 93.5%
0.25 96.8% 97.3% 1000 92.3% 93.5% 700K 92.4% 93.3%
0.3 97.1% 97.5% 1200 92.5% 93.6% 800K 92.1% 93.4%
0.35 97.2% 97.6% 1400 92.7% 93.7% 900K 92.2% 93.2%
0.4 97.3% 97.8% 1600 92.8% 93.9% 1000K 92.3% 93.5%

5.3. Accuracy Rate of RMFIsM Method 6. Conclusions


The accuracy rate of our proposed RMFIsM method is It is often difficult to quickly mine the recent frequent
also tested with different value of min_sup, different itemsets over huge scale of data streams. In this paper,
size of sliding window and different number of we propose an improved approach called RMFIsM to
transactions, the set of experimental parameters is same mine the maximal frequent itemsets instead of to mine
with subsection 5.1 and the experimental result is all frequent itemsets. We first construct two matrixes
shown in Table 2. to store the data information of each transaction and
We can see from Table 2 that with the arising value the information of frequent 1-itemsets. The frequent
of min_sup, the accuracy rate of the mining results is (p+1)-itemsets are mined by the “extension” process
improving slowly both on datasets T10.I4.D1000K and of frequent p-itemsets, the current maximal frequent
T30.I20.D1000K, the reason is that the number of itemsets are stored into MFIs_L and each sub-itemsets
frequent itemsets shows a decreasing trend with the of frequent long itemsets are moved out from MFIs_L.
increasing value of min_sup, which results the Through the compared experimental with DSM-MFI,
influence of infrequent itemsets disappearing gradually. estDec and TMFI methods, it can be easily found that
Furthermore, with the increasing size of sliding our proposed RMFIsM method is more effective both
window, the accuracy rate of RMFIsM method is in in time cost and memory usage, and the accuracy rate
rising trend, and the accuracy rate is relatively stable in of MFIs mining is also very high.
general. Moreover, the accuracy rate of RMFIsM
method is smooth between 92.1% to 92.4% on sparse References
dataset T10.I4.D1000K and between 93.2% to 93.5%
on dense dataset T30.I20.D1000K with the increasing [1] Calders T., Dexters N., Gillis J., and Goethals B.,
number of transactions, it is obvious that the number of “Mining Frequent Itemsets in A Stream,”
transactions is a small factor that impact the accuracy Information Systems, vol. 39, pp. 233-255, 2014.
rate of MFIs mining. The accuracy rate result indicates [2] Chang J. and Lee W., “Finding Recent Frequent
that our proposed RMFIsM method is suitable for Itemsets Adaptively Over Online Data Streams,”
mining the maximal frequent itemsets over online data in Proceedings of 9th International Conference
streams under the larger value of min_sup. on Knowledge Discovery and Data Mining,
Washington, pp. 487-492, 2003.
[3] Deng Z., “Diffnodesets: An Efficient Structure
for Fast Mining Frequent Itemsets,” Applied Soft
Mining Recent Maximal Frequent Itemsets Over Data Streams with Sliding Window 969

Computing, vol. 41, pp. 214-223, 2016. Saihua Cai is a Ph.D. student in
[4] Deypir M. and Sadreddini M., “A Dynamic College of Information and
Layout of Sliding Window for Frequent Itemset Electrical Engineering, China
Mining Over Data Streams,” Journal of Systems Agricultural University, China. He
and Software, vol. 85, no. 3, pp. 746-759, 2012. received the MS degree from
[5] Deypir M., Sadreddini M., and Tarahomi M., “An Jiangsu University, China, in 2016.
Efficient Sliding Window Based Algorithm for His major research interests include
Adaptive Frequent Itemset Mining over Data uncertain data management, data mining, outlier
Streams,” Journal of Information Science and detecting and software testing.
Engineering, vol. 29, no. 5, pp. 1001-1020, 2013.
[6] Guidan F. and Shaohong Y., “A Frequent Shangbo Hao is a Master Student in
Itemsets Mining Algorithm Based on Matrix in College of Information and
Sliding Window Over Data Streams,” in Electrical Engineering, China
Proceedings of 3rd International Conference on Agricultural University, China. His
Intelligent System Design and Engineering research interests include pattern
Applications, Hong Kong, pp. 66-69, 2013. mining and outlier detecting.
[7] Han M., Ding J., and Li J., “TDMCS: An
Efficient Method for Mining Closed Frequent
Patterns over Data Streams Based on Time Decay Ruizhi Sun is a Full Professor in
Model,” The International Arab Journal of College of Information and
Information Technology, vol. 14, no. 6, pp. 851- Electrical Engineering, China
860, 2017. Agricultural University, China. He
[8] Li H., Lee S., and Shan M., “Online Mining received his Ph.D. degree in
(Recently) Maximal Frequent Item sets Over Data Computer Science and Technology
Streams,” in Proceedings of 15th International from Tsinghua University, Beijing,
Workshop on Research Issues in Data China, in 2003. His major research interests include
Engineering: Stream Data Mining and agricultural data acquisition and processing
Applications, Tokyo, pp. 11-18, 2005. technology, computer network and applications,
[9] Lin M., Hsueh S., and Wang C., “Interactive workflow management and cloud computing.
Mining of Frequent Patterns in A Data Stream of
Time-Fading Models,” in Proceedings of 8th Gang Wu is an associate professor
International Conference on Intelligent Systems in Secretary of Computer Science
Design and Applications, Kaohsiung, pp. 513- Department, Tarim University,
518, 2008. China. His research interests mainly
[10] Mao G., Wu X., Zhu X., Chen G., and Liu C., involve agriculture information
“Mining Maximal Frequent Itemsets From Data processing technology, data mining,
Streams,” Journal of Information Science, vol. 33, agricultural remote sensing
no. 3, pp. 251-262, 2007. application.
[11] Nori F., Deypir M., and Sadreddini M., “A
Sliding Window Based Algorithm For Frequent
Closed Itemset Mining Over Data Streams,”
Journal of Systems and Software, vol. 86, no. 3,
pp. 615-623, 2013.
[12] Shin S., Lee D., and Lee W., “CP-Tree: An
Adaptive Synopsis Structure for Compressing
Frequent Itemsets Over Online Data Streams,”
Information Sciences, vol. 278, pp. 559-576,
2014.
[13] Yang J., Wei Y., and Zhou F., “An Efficient
Algorithm for Mining Maximal Frequent Patterns
over Data Streams,” in Proceedings of 7th
International Conference on Intelligent Human-
Machine Systems and Cybernetics, Hangzhou, pp.
444-447, 2015.

You might also like