A Novel Weighted FP-Stream Algorithm For IoT Data Streams
A Novel Weighted FP-Stream Algorithm For IoT Data Streams
Abstract—The Internet of Things (IoT) is a technology that is numerous concepts are introduced to literature, such as data
being widely used in daily life. This technology makes it easier for stream mining, big data, stream data analysis [1].
devices to connect with each other. As a result of the high
connectivity between devices, enormous volumes of data are being IoT networks create a virtual representation of the real world
collected. Such data is called big streaming data which can be used by using numerous sensors. Detection of recurring events can
to curate useful information by data mining techniques. One of the be used for predicting events or errors. By exploring meaningful
most used processing methods is called Frequent Itemset (Pattern) relations between events, more complex events can be
Mining (FIM) which detects recurring and common patterns over formulated. For this purpose, association rules are
data streams. In this paper, a new algorithm based on frequently used. Association rule mining structures have been created to
used FP-Stream algorithm is presented. The proposed algorithm ensure sensitive systems to work at high performance. It ensures
enhances conventional FP-Stream algorithm to make it more that the data is related to each other. It is absolutely necessary
adaptive to concept drifts when retaining its applicability to data to work on a database. The minimum support and minimum
streams. Conventional FP-Stream algorithms store all detected confidence values determine how strong the relationship is.
patterns. By adding weights during the pruning process based on Association rules can produce a single output to be used or an
pattern freshness, the proposed algorithm prioritizes newer output that can be an input to other mining operations. One of
patterns thereby learns new patterns and forgets older one swiftly. the oldest and one of the most frequently used algorithms for
Performance evaluations are performed using data acquired from association rule mining is the Apriori algorithm.
an IoT testbed established in KAVEM Lab of Gazi University.
Evaluation results indicate that the proposed algorithm performs Apriori algorithm requires multiple passes on data to be able
better than conventional FP-Stream significantly. to detect relations in-between. But in data streams where data
volume is great and data streams are continuous, it is not
Keywords—streaming data mining; frequent patterns; feasible to process a data point multiple times. Therefore, for
logarithmic tilted-time window; internet of things; tail pruning; streaming data, algorithms that require minimal number of
weighting passes on data are needed. For this purpose, FP-Growth
I. INTRODUCTION algorithm and data structure used for FP-Growth, namely FP-
Tree is proposed [2]. By performing reduction operations with
Internet of Things (IoT) is a technology that helps all objects FP-Growth algorithm, higher performance and more frequent
communicate with each other. IoT aims to improve the quality patterns are obtained [3].
of life. IoT obtains data from related objects and contributes for
users with meaningful information [1]. IoT is a term used for a A search of the literature revealed that for datasets which
network composed of numerous objects that are highly include identical transactions with high numbers, pruning of old
connected and penetrates into daily life in a pervasive manner. patterns takes a considerable high time in addition to lack of
Services and applications developed upon IoT increases the ability to detect new ones. To overcome this problem, a weight
quality of life for humans. It is estimated that by 2020 there will parameter is introduced to the conventional FP-Stream
be 50 billion connected devices in the Internet of Things (IoT) algorithm. In addition to the weight parameter, the proposed
networks [1]. By connecting all these devices together, it Weighted FP-Stream algorithm decreases storage used for
becomes easier to process and obtain information. Using these patterns, therefore effectively reduces memory and time
capabilities, numerous smart applications are introduced to our complexities and prioritizes recent transactions. In short, the
lives, like smart applications, smart homes, smart cities etc. proposed algorithm is faster and possesses higher ability to keep
Since IoT systems are always running, large volumes of data up with concept drifts. This research is made specifically for
are generated continuously. The processing of this data is data streaming mining applications over IoT. Data used to test
crucial to improve life quality further. For this purpose, the Weighted FP-Stream algorithm are collected from KAVEM
Lab at Gazi University. Data are preprocessed and used to
compare the proposed algorithm to the conventional FP-Stream
4554
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on July 29,2023 at 01:29:11 UTC from IEEE Xplore. Restrictions apply.
way, it makes it easier to work with outlier values. The average TABLE III. LOGARITHMIC TILTED-TIME WINDOW
weight parameter is 0.5 which is the average of minimum and f(1,1) []
maximum weight values. Weight parameter updates f(2,2) [] f(1,1) []
dynamically based on scanning as a new batch arrives. f(2,2)
f(3,3) []
[f(1,1)]
TABLE I. PRELIMINARIES OF FP-STREAM AND WEIGHTED FP-STREAM
Term Explanation f(4,4) [] f(3,3) [] f(2,1) []
I Itemset, combination of single items ia f(4,4)
f(5,5) [] f(2,1) []
σ Minimum support [f(3,3)]
ε Maximum support error f(4,3)
f(6,6) [] f(5,5) []
[f(2,1)]
T Time period
f(6,6) f(4,3)
F Frequency f(7,7) []
[f(5,5)] [f(2,1)]
w Weight
f(8,8) [] f(7,7) [] f(6,5) [] f(4,1) []
wi Window
f(8,8)
wi w Average window weight f(9,9) [] f(6,5) [] f(4,1) []
[f(7,7)]
w Average weight f(8,7)
f(10,10) [] f(9,9) [] f(4,1) []
[f(6,5)]
f(10,10) f(8,7)
In the FP-Stream structure, frequent and sub-frequent f(11,11) [] f(4,1) []
[f(9,9)] [f(6,5)]
patterns are captured using FP-Tree structure and the f(8,5)
f(12,12) [] f(11,11) [] f(10,9) []
logarithmic tilted-time windows store these patterns. In Table [f(4,1)]
II, the frequency conditions of the patterns are given. The f(12,12) f(8,5)
f(13,13) [] f(10,9) []
purpose of storing sub-frequent patterns is the possibility that [f(11,11)] [f(4,1)]
f(12,11) f(8,5)
these patterns may become frequent in the future. f(14,14) [] f(13,13) []
[f(10,9)] [f(4,1)]
TABLE II. PATTERN CATEGORIZATION ON FP-STREAM f(14,14) f(12,11) f(8,5)
f(15,15) []
Pattern Categories [f(13,13)] [f(10,9)] [f(4,1)]
f(16,16) [] f(15,15) [] f(14,13) [] f(12,9) [] f(8,1) []
Frequent support>
f(16,16)
f(17,17) [] f(14,13) [] f(12,9) [] f(8,1) []
Sub-Frequent support < and support ≥ [f(15,15)]
Infrequent support<
The first added transaction is held alone in the first unit. At
the next level, if the buffer is empty for the next unit, the old
FP-Stream algorithm includes FP-Tree and logarithmic tilted- first transaction is transferred directly to that unit and the
time window structures. With the logarithmic tilted-time intermediate buffer in the unit is transferred to the buffer. If
window, frequent patterns are stored with certain compressions there is no free space, the batch to be transferred and the next
to save memory space. The logarithmic reduction of the number batch are compressed and transferred to the next unit. This
of units held in the structure is stored by keeping the windows continues until all the batches have settled. The formula for
in a logarithmic manner. For example, 366 x 24 x 4 = 35,136 finding the number of units in Equation 1, including the
units are needed in a natural tilted-time window for an annual frequency value of n, is used [13].
data retention. Instead of this, the logarithmic tilted-time
window structure can perform the same operation as ⌈log N+1⌉ (1)
log (365x24x4) + 1 ≈ 17 units. For each division operation,
fixed size batches are used. Tail pruning is done with the T
Incoming batch is transferred to the FP-Tree structure and
information and ε parameter, and mining operations are done
patterns are determined in accordance with the FP-Growth
on the FP-Tree with the FP-Growth algorithm [13].
algorithm. With the f_list formation, a structure is created that
1) FP-Stream algorithm: The FP-Stream algorithm aims keeps information on the usage frequency of the data and the
to find frequent patterns in data streams. It includes FP-Tree and data sequences accordingly. If all data in the incoming batch are
FP-Growth algorithms. FP-Stream trees contain the tilted-time added and if the incoming itemset is in the FP-Tree structure,
window and support value information of that value in each the corresponding batch is added to the logarithmic tilted-time
node. According to the minimum support and maximum table for the related itemset. Tail pruning is performed. If the
support error values, it is decided whether the items in the table is empty, the mining process is completed as a result of
related itemset are frequent or not. The processed data are the FP-Growth algorithm. Thus, FP-Stream structure is created.
collected in batches according to their frequency and with the Depth-first search is performed in the created structure and if
mining is not performed in the incoming batch, zero is added to
structure called tilted-time window, the data in batches are kept
the related itemset. Tail pruning continues. When the processes
in memory [13]. are completed, frequent patterns are observed in the created
The use of a tilted-time window allows the units held in structure [13].
memory to be reduced. This window uses buffering. Ease of
operation and increased performance are provided by keeping 2) FP-Tree algorithm: It is an algorithm that provides
the batches together. Batches are held logarithmically. They are frequent pattern finding with the FP-Tree structure established
given in Table III. to reduce the number of scans of the Apriori algorithm. The
Apriori algorithm cannot achieve accurate results with a small
4555
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on July 29,2023 at 01:29:11 UTC from IEEE Xplore. Restrictions apply.
amount of data. Cartesian products are used in the Apriori b. If I is not in the structure
algorithm, so it increases the cost of calculating and storing the i. The frequency of that batch is not less than
patterns obtained. The FP-Tree algorithm responds to the need maximum support error and the batch size, then
to create a structure that can be updated dynamically to find insert I into the structure. Otherwise stop mining
frequent patterns. It is quite easy to sort the frequency of the and use tail pruning.
patterns [14, 15]. ii. Scan the structure by depth-first search. If
For using this structure, first of all, a blank FP-Tree structure any of the nodes has no children, it will be a leaf
is created. The minimum support value is determined. Then, all node.
data in the database are scanned once and support value is found
for each item. Items that are not smaller than the minimum Fig 1. Pseudocode for Weighted FP-Stream algorithm
support value are frequent and listed in f_list in descending
order. Every transaction in the f_list is added to the tree IV. DATA PREPROCESSING
structure according to the item frequencies. For each added The collected sensor data have included the timestamp and
item, the value of that node increases by one [15]. the corresponding sensor values. In order to find frequent
patterns by FP-Stream algorithm, these data should be
3) FP-Growth algorithm: The mining process of the FP- categorized. The sensors have different measurement
Tree structure is performed with the FP-Growth algorithm. frequencies. Therefore, data have grouped in 10-minute time
Operations begin with the item with the least frequency in the units, averaging the same type of sensor readings. Grouping has
f_list structure. If the related item is on a single path, the been done by taking into account the standard deviation of each
frequent pattern created by all items up to the root node is taken sensor data. Indoor air quality sensor, temperature, humidity,
and the lowest support value becomes the conditional pattern- light density, sharp and PIR sensors’ readings are kept in
base value of that pattern. If there is more than one branch in streaming batches. Standard deviations were found according to
that item, the conditional pattern-base is equal to the number of the values of each sensor feature, and the data of each feature
branches formed and the same operations are performed for were grouped and categorized. Prepared data were given as an
each branch. For each item, the values of all items in the input to the encoded FP-Stream algorithm and frequent patterns
conditional pattern-base are examined. Patterns are larger than were found.
the minimum support value form the conditional FP-tree As mentioned before in this paper, real sensor data are
structure. FP-Growth algorithm uses divide-and-conquer collected from KAVEM lab of Gazi University is used. The
method. So, data mining provides higher performance operation collection of unprocessed data were in Figure 2. Basically, there
than the Apriori algorithm [15]. were time, sensor and sensor values. Temperature, indoor air
B. Weighted FP-Stream Algorithm quality, light density, humidity, PIR and sharp sensors have
held together in the related data. During the data processing,
The purpose of this algorithm is to increase the importance records containing 2 sensor values are separated. Noise and
of the current batch of frequent patterns. In the FP-Stream outlier data were determined and necessary arrangements were
algorithm, it is observed that after the frequent patterns obtained made.
with large amounts of data, the algorithm should run millions
of times in order that fresh patterns have been more dominant
than the old patterns. By using the weight parameter of the tail
pruning process, which will prevent this situation and ensure
the freshness of the data, it is provided to accept more current
data as a frequent pattern. The algorithm is presented in Figure
1.
1. An FP-Stream structure
Fig 2. Unprocessed sensor data
INPUT 2. σ, ε and batch size thresholds
3. Incoming batch to store transactions
There were some outliers on light density sensor data. They
1. Updated FP-Stream structure had to be greater than 0 but there were 6 values less than zero.
OUTPUT
2. Frequent patterns It is caused by sensor failure. These values are dropped.
1. Initialize an empty FP-tree.
Grouping the relevant data at 10-minute intervals and
2. Sort each item with their frequency on f_list
creating a feature for each sensor data were performed. If a large
and insert all transactions on FP-tree structure (only
amount of data were collected from a sensor in 10 minutes, their
f_list items are inserted) by FP-Growth algorithm. averages were taken. Then, considering the standard deviations
3. Use tail pruning on FP-tree structure of the relevant sensor data, the data were categorized and
METHOD a. If I is in structure; itemsets to be used in the study were obtained. For
i. Add the frequency of I to the tilted categorization by grouping, value ranges used for each sensor
time window. data is given in Table IV.
ii. Conduct tail pruning by using weight factor
TABLE IV. CATEGORY RANGES OF SENSOR DATA
iii. If the table is empty, stop mining of Sensor Value Ranges
supersets, else continue to mine supersets of I.
4556
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on July 29,2023 at 01:29:11 UTC from IEEE Xplore. Restrictions apply.
Indoor Air Quality 0-6
Light Density 0-25
Celcius 0-6
Humidity 0-8
Sharp 0-1
Pir 0-1
In Figure 4, the values between 1000 and 2000 indexes data “∃ , ∀ , , fI(ti) wi AND
were shown with a graph. As it could be seen in Figure 4, the ∀ ′, ′ ,∑ fI(ti)< ∑ wi.” (2)
fact that all transactions have no all sensor types caused some
disconnections in the graphic drawings. Categorizing the data Equation 2 is actually a combination of two different
at larger intervals instead of 10-minute intervals would prevent equations and their operator. However, pruning takes place with
these disconnections at a good level. The values that all the data the fulfillment of both conditions. And true if all frequency
have taken according to the relevant transaction have shown in values of an itemset are less than the product of the respective
the Jupyter Notebook study where data were preprocessed.
window size and value in the left part of the operator. And to
the right of his operator is true if the sum of all the frequency
values of an itemset is less than the product of the total window
size and value. In the event that both conditions are true, tail
pruning is performed.
The tail pruning formula in the weighted FP-Stream
algorithm developed will be in Equation 3.
“∃ , ∀ , ,fI(ti)wi wi w AND
∀ ′, ′ ,∑ fI(ti) wi ∑ wi w.” ( 3)
4557
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on July 29,2023 at 01:29:11 UTC from IEEE Xplore. Restrictions apply.
older data. Thus, the availability of current frequent patterns has algorithm. The weight parameter is used for the tail pruning
been made more dominant in the algorithm. process to keep tree structure fresh. Weight values between 0
and 1 are used to indicate recentness of transactions. To show
VI. PERFORMANCE EVALUATION effectiveness of the Weighted FP-Stream, the algorithm is
The proposed algorithm is coded using Java Programming compared to conventional FP-Stream using the same
language. Data preprocessing parts are coded using Python parameters. Using FP-Stream algorithm 710 frequent patterns,
programming language and then all the processed data are and using our Weighted FP-Stream algorithm 58 frequent
divided into batches. The performance evaluations are done by patterns are obtained. In the Weighted FP-Stream algorithm,
these batches. The properties of the environment include a 2.60 less and more current frequent patterns are obtained as
GHz CPU, 16 GB RAM and Windows 10 OS. expected. 53 of the patterns are common in both algorithms’
As a result of using the same data and parameters, the results. The proposed algorithm can keep more recent frequent
frequent pattern amount obtained with the FP-Stream algorithm patterns with higher performance and less memory usage which
and the Weighted FP-Stream algorithm are given in Table VI. is important when the data size is big and resources are limited.
Minimum support threshold is fixed by 0.1, maximum support ACKNOWLEDGMENT
error threshold is fixed by 0.01. Batch size is fixed by 1000 and
39 batches are used. As expected, all common patterns derived This work was supported by the Scientific and
from the Weighted FP-Stream algorithm are also included in the Technological Research Council of Turkey under Grant
original FP-Stream algorithm. 118E212.
TABLE VI. FREQUENT PATTERN NUMBERS OBTAINED BY FP-
REFERENCES
STREAM AND WEIGHTED FP-STREAM ALGORITHM
Weighted FP-Stream
Algorithm FP-Stream Algorithm
Algorithm [1] Kök, İ., Şimşek, M. U., & Özdemir, S. (2017, December). A deep learning
Number of Frequent model for air quality prediction in smart cities. In 2017 IEEE International
710 58 Conference on Big Data (Big Data) (pp. 1983-1990). IEEE.
Patterns
[2] Ezeife, C. I., & Su, Y. (2002, May). Mining incremental association rules
with generalized FP-tree. In Conference of the Canadian Society for
By using the FP-Stream algorithm, there are 710 frequent Computational Studies of Intelligence (pp. 147-160). Springer, Berlin,
Heidelberg.
patterns and by using the Weighted FP-Stream algorithm, there [3] Zhang, W., Liao, H., & Zhao, N. (2008, December). Research on the FP
are 58 frequent patterns found. 53 patterns are common in both growth algorithm about association rule mining. In 2008 International
Seminar on Business and Information Management (Vol. 1, pp. 315-318).
algorithms. IEEE.
[4] Tao, F., Murtagh, F., & Farid, M. (2003, August). Weighted association
rule mining using weighted support and significance framework. In
Proceedings of the ninth ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 661-666).
[5] Yun, U., Lee, G., & Ryu, K. H. (2014). Mining maximal frequent patterns
by considering weight conditions over data streams. Knowledge-Based
Systems, 55, 49-65.
[6] Ahmed, C. F., Tanbeer, S. K., Jeong, B. S., Lee, Y. K., & Choi, H. J.
(2012). Single-pass incremental and interactive mining for weighted
frequent patterns. Expert Systems with Applications, 39(9), 7976-7994.
[7] Ahmed, C. F., Tanbeer, S. K., & Jeong, B. S. (2009, June). Efficient
mining of weighted frequent patterns over data streams. In 2009 11th
IEEE International Conference on High Performance Computing and
Communications (pp. 400-406). IEEE.
[8] J. Wang and Y. Zeng, "DSWFP: Efficient mining of weighted frequent
pattern over data streams," 2011 Eighth International Conference on
Fuzzy Systems and Knowledge Discovery (FSKD), Shanghai, 2011, pp.
942-946, doi: 10.1109/FSKD.2011.6019763.
[9] Yun, U., & Ryu, K. H. (2011). Approximate weighted frequent pattern
Fig. 5. Freshness of the patterns mining with/without noisy environments. Knowledge-Based Systems,
24(1), 73-82.
When Weighted FP-Stream and FP-Stream algorithms are [10] Gouider, M. S., & Zarrouk, M. (2012). Frequent Patterns mining in time-
examined as Figure 5, we can see that only the fresh patterns sensitive Data Stream. International Journal of Computer Science Issues
(IJCSI), 9(4), 117.
obtained in the Weighted FP-Stream algorithm and the old [11] Kim, Y. H., Kim, W. Y., & Kim, U. M. (2010). Mining frequent itemsets
patterns are deleted from the memory. Because, while the with normalized weight in continuous data streams. Journal of
number of patterns found by FP-Stream is increasing per unit information processing systems, 6(1), 79-90.
time, there is no such relation by Weighted FP-Stream because [12] C. I. Ezeife and M. Monwar, "SSM : A Frequent Sequential Data Stream
Patterns Miner," 2007 IEEE Symposium on Computational Intelligence
of the weight parameter. This is an important property when and Data Mining, Honolulu, HI, 2007, pp. 120-126, doi:
10.1109/CIDM.2007.368862.
then the data size is big and the user is interested in only the
[13] Giannella, C., Han, J., Pei, J., Yan, X., & Yu, P. S. (2003). Mining
fresh patterns. frequent patterns in data streams at multiple time granularities. Next
generation data mining, 212, 191-212.
VII. CONCLUSION [14] Internet: T-61.6020: Popular Algorithms in Data Mining and Machine
Learning, http://www.cis.hut.fi/Opinnot/T-61.6020/2008/fptree.pdf
In this article, the proposed weighted FP-Stream algorithm (23.05.2020)
aims to eliminate outdated data and give priority to fresh data. [15] Qiu, Y., Lan, Y. J., & Xie, Q. S. (2004, August). An improved algorithm
of mining from FP-tree. In Proceedings of 2004 International Conference
For this purpose, a weight parameter is introduced to the on Machine Learning and Cybernetics (IEEE Cat. No. 04EX826) (Vol. 3,
pp. 1665-1670). IEEE.
4558
Authorized licensed use limited to: UNIVERSITY TEKNOLOGI MALAYSIA. Downloaded on July 29,2023 at 01:29:11 UTC from IEEE Xplore. Restrictions apply.