1improvement Algorithms of Perceptually Important P PDF
1improvement Algorithms of Perceptually Important P PDF
Abstract—In the field of time series data mining, the concept of typically characterized by a few salient points such as a head
the Perceptually Important Point (PIP) identification process is and shoulders (H&S). Time series pattern consists of a head
proposed for financial time series pattern matching and it is point, two shoulder points and a pair of neck points. These
then found suitable for time series dimensionality reduction points are perceptually important in the human visual
and representation. Its strength is on preserving the overall identification process. These points are therefore more
shape of the time series by identifying the salient points in it. important than other data points in the time series. The
With the rise of Big Data, time series data contributes a major identification of PIP is first introduced by Chung et al. [6]
proportion, especially on the data which generates by sensors and is suited for pattern matching of technical (analysis)
in the Internet of Things (IoT) environment. According to the
patterns in financial applications. In the previous years,
nature of PIP identification and the successful applications in
different applications of PIP are proposed, such as [7] adopts
the past years, it is worth to further explore the opportunity to
apply PIP in time series “Big Data”. However, the PIPs to represent the movement shape of time series. It
performance of PIP identification is always considered as the demonstrates that this representation is better than Symbolic
limitation when dealing with “Big” time series data. In this Aggregate approXimation (SAX) [8] which SAX will
paper, two improvement algorithms namely Caching and smooth the movement of the time series. [9] adopts PIP for
Splitting algorithms are proposed. Significant improvement in data compression and [10] adopts PIPs to model the PM2.5
term of speed is obtained by these improvement algorithms. concentration data.
Although PIP identification is found suitable for many
Keywords- Perceptually Important Point identification; PIP; time series data mining tasks, one of the major
time series data mining; performance analysis considerations when adopting PIP identification process is its
complexity. [7] shows that PIP takes a longer run time than
I. INTRODUCTION SAX [9] because PIP identification process requires more
scans of time series than SAX. Todorov et al. specify that
Recent years, the rapid development of cloud computing, one of the bottlenecks in their study is the PIP identification
mobile technologies and Internet of Things (IoT) have led to process when analyzing a large amount of data [11]. Son and
the explosive growth of data. It is no doubt that Big Data Era Anh also comment that their proposed variation of PIP [12],
is already here [1]. Among the diversified data types i.e. IPIP, suffers from the computational complexity which is
(Variety), temporal data, in particular time series data, is higher than that of Piecewise Aggregate Approximation
certainly being one of the major contributions of this “Big” (PAA). Phetking and Selamat suggest that reducing the
data [2]. A time series is considered as a collection of computational expensiveness of PIP identification process
observations made chronologically. The nature of time series can allow the user to assess the predictive performance of
data includes: large in data size (Volume), high their proposed approach [13].
dimensionality (Variability) and update continuously The worst case complexity of the PIP identification
(Velocity). Big data analytic can be considered as an process is O(n2) [11][14][15] as it needs to identify the point
extension of data mining and machine learning research. that contributes most to the overall shape of a time series in
Review on data mining for IoT can be found in [3], machine each iteration. Therefore, the original algorithm needs to
learning for big data in [4] and time series data mining in [5]. evaluate every point in a time series for every iteration. In
A time series is constructed by a sequence of data points this paper, two algorithms are proposed to reduce the
and the amplitude of a data point has different extent of calculation of the PIP identification process in order to
influence on the shape of the time series. That is, each data increase the speed. The conventional PIP identification
point has its own importance to the time series. It is also algorithm will be reviewed in next section. The proposed
considered as the significant of the data point. A data point improvement algorithms will be introduced in section III.
may contribute on the overall shape of the time series while The performance of them will be evaluated in section IV.
another may only have little influence on the time series or Section V offers our conclusion and proposed future work.
may even be discarded. For example, frequently appearing
technical time series patterns in the financial domain are
12
performed in order to reduce the number of points analyzed the maximum distance and maximum point between two
in each step in a way that, it does not harm the quality of the adjacent points in L. We called it as Splitting algorithm.
analysis but makes the delay insignificant compared to the Function PIP_Identification (P)
transmission delay. Input: sequence P[1..m]
Function PIP_Identification (P) Output: PIP L[1..m]
Input: sequence P[1..m] Begin
Output: PIP L[1..m] Set L[1] = P[1], L[2] = P[m]
Begin DataStruct TrackList
Set L[1] = P[1], L[2] = P[m] Begin
Float CachePreDonePoint, CacheNextDonePoint Int Begin
Float CacheResult[1..m] Int End
Set all CacheResult[1..m] = -1 Float Max
Repeat until L[1..m] all filled Int MaxPoint
Begin Float MaxValue
MaxDist = -1 TrackList * PrePointer
For each i in P and not in L TrackList * NextPointer
Begin End
If CacheResult[i] = -1 then TrackList * ListHead // the head of the sort
Begin ed list
Calculate the distance Dist of I to the TrackList * MaxNode
adjacent points in L Listhead->Begin = 1
CasheResult[i] = Dist Listhead->End = m
End Calculate the MaxValue and MaxPoint in
If Dist > MaxDist ListHead with ListHead->Begin and
Begin ListHead->End as the adjacent points in PIP
MaxDist = Dist
j = i Repeat until L[1..m] all filled
CachPreDonePoint = smaller point of the Begin
adjacent points in L MaxNode = ListHead
CachNextDonePoint = larger point of the ListHead = Listhead->NextPointer
adjacent points in L Append MaxNode->MaxPoint TO L
End Split MaxNode into 2 nodes HeadNode and
End TailNode
Append P[j] TO L Begin
Set all points CachResult[i] = -1 where i HeadNode->Begin = MaxNode->Begin
lay between CachPreDonePoint and HeadNode->End = MaxNode->MaxPoint
CachNextDonePoint TailNode->Begin = MaxNode->MaxPoint
End TailNode->End = MaxNode->End
Return L End
End Calculate the MaxValue and MaxPoint in
HeadNode with HeadNoed->Begin and
Figure 3. Pseudo code of the Caching algorithm of the PIP iden-tification HeadNode->End as the adjacent points in
process. PIP
Calculate the MaxValue and MaxPoint in
Here, we also propose to cache the previous calculation TailNode with TailNode->Begin and
TailNode->End as the adjacent points in
in the PIP identification process (Fig. 3). We call it as PIP
Caching algorithm. In each iteration, as the PIP identified in Insert HeadNode and TailNode into sorted
the previous iteration only changes the shape of a segment list beginning with ListHead
instead of the whole time series, the distances calculated in End
the previous iteration for many data points, which are not Return L
End
located in this segment, are remain unchanged. By caching
these results, it is expected the speed of the whole process Figure 4. Pseudo code of the Splitting algorithm of the PIP identification
can be speed up. The algorithm works on the following steps: process.
z Set an array (CacheResult) to store PIP value laid
between CachPreDonePoint and First, a sorted segment list is maintained. Each segment
in the sorted list has BEGIN POINT, END POINT, MAX
CachNextDonePoint.
POINT and MAX VALUE (Max Perpendicular Distance).
z Take CachPreDonePoint and CachNextDonePoint BEGIN POINT and END POINT have already been in the
which in between has CacheResult with MaxDist, PIP list but the points lay between BEGIN and END Points
all points laid between CachPreDonePoint and are not in the PIP list. MAX POINT is the point laid between
CachNextDonePoint are set -1 to CacheResult. BEGIN and END with MAX VALUE. The segments are
z Only the points with CacheResult = -1 need to re- sorted with MAX VALUE in descending order.
calculate The algorithm works on the following steps when
B. Splitting Algorithm assigning a point to the PIP list:
z Remove the HEAD segment from the sorted list.
We further purpose another algorithm to speed up the z Append the MAX POINT and MAX VALUE in the
performance which does not cache the distance but caches HEAD segment to PIP list.
13
z Split the HEAD segment into two segments.
z Set one new segment’s BEGIN POINT to be
BEGIN POINT of HEAD segment and END
POINT to be the MAX POINT.
z Set the other segment’s BEGIN POINT to be the
MAX POINT of HEAD segment and END POINT
to be the END POINT of HEAD NODE.
z Calculate the MAX VALUE and MAX POINT in
the two nodes corresponding to their BEGIN and
END POINTS.
z Insert the two split segments into the sorted list at
the right position provided that the sorted segment
list is in descending order.
z Repeat the above steps until all the points are filled
in PIP list.
Figure 6. Processing time of the three PIP identification algorithms in
This algorithm does not repeat any calculation of the engineering domain (temperature detected by valve).
distance of each point except changing its adjacent point in
PIP. It does not need to consider all MAX POINT in each
step but need to insert the two segments into the sorted list. It
is expected the PIP identification process can be speeded up
by reducing most of the repetitive calculation. Fig.4 shows
the detail procedure.
IV. EXPERIMENTAL RESULTS
In this section, we evaluate the two proposed
improvement algorithms on different dimensions, which
include their ability on different datasets and the effect when
increasing the length of time series. The experiments were
implemented with Visual C++ programming language. They
were performed on a standalone PC with configuration:
Microsoft Windows 10 Professional edition, Intel ® Core™
i5-4440CPU @3.10GHZ, 8 GB RAM. Figure 7. Processing time of the three PIP identification algorithms in
medical domain (EEG time series).
A. Experiments on Different Application Domains
To evaluate the performance of the two improvement Fig. 5-7 shows the time performance on the
algorithms in different application domains, 720 corresponding three application domains. The results are
experiments were conducted. The dataset for the simulation very close to each other. It demonstrated that the PIP
tests include three application domains which are financial identification process work well for the time series from
(stock data), engineering (temperature detected by a Valve) different application domains.
and medical (Electroencephalography, EEG) domains. Each The Naïve algorithm performs the worst and it is served
domain has 10 time series and each time series has 4,000 as the baseline for the comparisons of the two improvement
data points. The average processing time among the 10 time algorithms. The Caching algorithm kept running around
series were presented. 25% of Naïve and the Splitting algorithm kept 3~4% of
Naïve algorithm.
B. Experiment on Big Time Data
To figure out whether the improved performance can be
sustained when the time series data is increasing, an EEG
time series with 200,000 data points was adopted in this test.
Fig.8 shows the processing time of the three algorithms and
Fig.9 shows the performance of the two improvement
algorithms (Caching and Splitting) in terms of the
improvement ratio when comparing to the Naïve algorithm.
The Caching algorithm can keep 25% of the Naïve algorithm
even the length of time series increased. On the other hand,
the improvement of the Splitting algorithm drops gradually
Figure 5. Processing time of the three PIP identification algorithms in when the data points increased. It is because the Splitting
financial domain (Stock time seres).
algorithm needs to maintain the sorted list and every step
14
needs to insert the two new segments to the sorted list. If the [2] Yu, S., Gu, L. and Dai, W. Fast Event Detection on Big Time Series.
sorted list is very long, the time to process may be increased. IEEE/CIC International Conference on Communications in China,
Symposium on Social Networks and Big Data, 2014.
However, the Splitting algorithm is still the fastest algorithm
[3] Chen, F., Deng, P., Wan, J., Zhang, D., Vasilakos, A.V. and Rong, X.
among the three. Data Mining for the Internet of Things: Literature Review and
Challenges. International Journal of Distributed Sensor Networks,
11(8), 2015.
[4] Al-Jarrah, O.Y., Yoo, P.D., Muhaidat, S., Karagiannidis, G.K. and
Taha, K. Efficient Machine Learning for Big Data: A Review, Big
Data Research, 2(3), 2015, 87-93.
[5] Fu, T.C. A Review on Time Series Data Mining. Engineering
Applications of Artificial Intelligence, 24(1), 2011.
[6] Chung, F.L., Fu, T.C., Luk, R. and Ng, V. Flexible Time Series
Pattern Matching Based on Perceptually Important Points.
International Joint Conference on Artificial Intelligence Workshop on
Learning from Temporal and Spatial Data, 2001, 1-7.
[7] Park, S.H., Chun, S.J., Lee, J.H. and Song, J.W. Representation and
Clustering of Time Series by Means of Segmentation based on PIPs
Figure 8. Processing time of the three PIP identification algorithms in a Detection. In Proceedings of the 2nd International Conference on
long time series (EEG time series). Computer and Automation Engineering, 2010, 17-21.
[8] Lin, J., Keogh, E., Lonardi, S. and Chiu, B. A Symbolic
Representation of Time Series, with Implications for Streaming
Algorithms. In the 8th ACM SIGMOD International Conference on
Management of Data Workshop on Research Issues in Data Mining
and Knowledge Discovery, 2003, 2-11.
[9] Feller, S., Todorov, Y., Pauli, D. and Beck, F. Optimized Strategies
for Archiving Multi-dimensional Process Data: Building a Fault-
diagnosis Database. In Proceedings of the 8th International
Conference on Informatics in Control, Automation and Robotics, 1,
2011, 388-393.
[10] Song, C., Pei, T. and Yao, L. Analysis of the Characteristics and
Evolution Modes of PM2.5 Pollution Episodes in Beijing, China
During 2013. Internation Journal of Environmental Research and
Public Health, 12, 2015, 1099-1111.
Figure 9. Performance of the two improved algorithms (Caching and
Splitting) in terms of the improvement ratio when comparing to the Naïve [11] Todorov, Y., Feller, S. and Chevalier, R. Making the Investigation of
PIP identification algorithm (EEG time series). Huge Data Archives Possible in an Industrial Context - An Intuitive
Way of Finding Non-typical Patterns in a Time Series Haystack. In
Proceedings of the 12th International Conference on Informatics in
Control, Automation and Robotics, 2015, 569-581.
V. CONCLUSION
[12] Son, N.T. and Anh, D.T. An Improvement of PIP for Time Series Di-
Two improvement algorithms: Caching and Splitting mensionality Reduction and Its Index Structure. In Proceedings of the
algorithms are proposed in this paper to improve the speed of 2nd International Conference on Knowledge and Systems
the PIP identification process. Sound results are obtained on Engineering, 2010, 47-54.
the preliminary experiments. To cater the continuously [13] Phetking, C., Sap, M.N.M. and Selamat, A. A Multiresolution
Important Point Retrieval Method for Financial Time Series
updating of time series data, a Specialized Binary (SB) Tree Representation. In Proceedings of the International Conference on
structure was proposed in [18] to minimize the re-calculation Computer and Communication Engineering, 2008, 510-515.
issue when dealing with time series streaming data. As [14] Jugel, U., Jerzak, Z. and Markl, V. M4: A Visualization-Oriented
building the binary tree structure during the PIP Time Series Data Aggregation. In Proceedings of the Very Large
identification process can be considered as a parallel Data Bases Endowment, 7(10), 2014, 797-808.
computing problem, it is proposed to re-design and [15] Jugel, U., Jerzak, Z., Hackenbroich, G. and Markl, V. VDDA:
implement it for the distributed environment to further Automatic Visualization-driven Data Aggregation in Relational
improve the performance when dealing with the Big time Databases. The In-ternational Journal on Very Large Data Bases,
25(1), 2016, 53-77.
series data. In addition, parallel processing can also be
[16] Fu, T.C., Chung, F.L., Luk, R. and Ng, C.M. Representing Financial
adopted to maintain the sorted list of the Splitting algorithm Time Series Based on Data Point Importance. Engineering
to solve the problem which mentioned in last section. This Applications of Artificial Intelligence, 21(2), 2008, 277-300.
will be our future research direction on the performance [17] Papageorgiou, A., Cheng, B. and Kovacs, E. Reconstructability-aware
issue of the PIP identification process. Filtering and Forwarding of Time Series Data in Internet-of-Things
Ar-chitectures. In Proceedings of the IEEE International Congress on
REFERENCES Big Data, 2015, 576-583.
[1] Jin, X., Wah, B.W., Cheng, X. and Wang, Y. Significance and [18] Fu, T.C., Chung, F.L., Luk, R. and Ng, C.M. A Specialized Binary
Challenges of Big Data Research. Big Data Research, 2(2), 2015, 59- Tree for Financial Time Series Representation. The 10th ACM
64. SIGKDD International Conference on Knowledge Discovery and
Data Mining Workshop on Temporal Data Mining, 2004, 96-103.
15