0% found this document useful (0 votes)
35 views38 pages

GMA Gap Imputing Algorithm For Time Series Missing Values

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views38 pages

GMA Gap Imputing Algorithm For Time Series Missing Values

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Journal of Electrical Systems and Information Technology

GMA: Gap Imputing Algorithm for Time Series Missing Values


--Manuscript Draft--

Manuscript Number: JESI-D-23-00029R1

Full Title: GMA: Gap Imputing Algorithm for Time Series Missing Values

Article Type: Research

Funding Information: Tanta University Not applicable


(Abdel Hamed Khattab, ph.D)
Faculty of Engineering, Tanta University Not applicable
(EG)
(Abdel Hamed Khattab, ph.D)

Abstract: Data collected from the environment in computer engineering may include missing
values due to various factors, such as lost readings from sensors caused by
communication errors or power outages. Missing data can result in inaccurate analysis
or even false alarms. It is therefore essential to identify missing values and correct
them as accurately as possible to ensure the integrity of the analysis and the
effectiveness of any decision-making based on the data. This paper presents a new
approach, the Gap Imputing Algorithm (GMA), for imputing missing values in time
series data. The Gap Imputing Algorithm (GMA) identifies sequences of missing values
and determines the periodic time of the time series. Then, it searches for the most
similar subsequence from historical data. Unlike previous work, GMA supports any
type of time series and is resilient to consecutively missing values with different gaps
distances. The experimental findings, which were based on both real-world and
benchmark datasets, demonstrate that the GMA framework proposed in this study
outperforms other methods in terms of accuracy. Specifically, our proposed method
achieves an accuracy score that is 5 to 20% higher than that of other methods.
Furthermore, the GMA framework is well-suited to handling missing gaps with larger
distances, and it produces more accurate imputations, particularly for datasets with
strong periodic patterns.

Corresponding Author: Abdel Hamed Khattab, ph.D


Tanta University Faculty of Engineering
EGYPT

Corresponding Author Secondary


Information:

Corresponding Author's Institution: Tanta University Faculty of Engineering

Corresponding Author's Secondary


Institution:

First Author: Abdel Hamed Khattab, ph.D

First Author Secondary Information:

Order of Authors: Abdel Hamed Khattab, ph.D

Nada Mohamed Elshennawy

Mahmoud Mohamed Fahmy

Order of Authors Secondary Information:

Response to Reviewers: All points covered in the attached response file. the points of reviewer 1 written in red
color. the points of reviewer 2 written in green color.

Additional Information:

Question Response

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Title Page Click here to access/download;Title Page;Title Page v2.docx

GMA: Gap Imputing Algorithm for Time


Series Missing Values
Abd Alhamid Khattab *, Nada Mohamed Elshennawy, and Mahmoud Fahmy
Computer and Automatic Control Dep., Faculty of Engineering, Tanta University, Tanta, Egypt
E-mail(s): [email protected]; [email protected]; [email protected]

*Corresponding author(s). E-mail(s): [email protected].


Declarations
Availability of data and material: The developed package and datasets analyzed
during the current study are available in the github repository https://github.com/Eng-
Khattab/missval.
Competing interests: The authors declare that they have no competing interests
Funding: This research did not receive any type of grant from funding agencies, either
public or private sectors, commercial, or not profit sectors.
Authors' contributions: All authors read and approved the final manuscript.
▪ Abdelhamid Khattab conducted the experiments and written the Python code.
▪ Nada M. Elshennawy conceived the study and written the algorithm.
▪ Mahmoud M. Fahmy participated in the study's design and coordination and provided
assistance in drafting the manuscript.
Acknowledgements: The Department of Computer and Control Engineering at Tanta
University deserves thanks for providing us with their expertise and valuable advice,
which we greatly appreciate.
Cover Letter
Our manuscript should be considered for publication in the journal because it presents a new
and innovative approach, the Gap Imputing Algorithm (GMA), for imputing missing values
in time series data. The GMA method has several unique advantages over existing
approaches, including its ability to handle any type of time series data and its resilience to
consecutively missing values with different gap distances.
Our experimental findings, which are based on both real-world and benchmark datasets,
demonstrate that the GMA framework significantly outperforms other methods in terms of
accuracy. Our proposed method achieves an accuracy score higher than that of other
methods, indicating its superior performance in imputing missing values in time series data.
Additionally, the GMA framework is particularly well-suited to handling missing gaps with
larger distances, making it a valuable tool for a wide range of applications. Furthermore, our
manuscript adheres to high ethical standards and guidelines set forth by the journal. We have
thoroughly reviewed and revised the manuscript to ensure that it meets the journal's
standards for quality and relevance. Overall, we believe that our manuscript presents an
important contribution to the field of time series analysis and should be considered for
publication in the journal.
All authors have agreed to the submission to the journal and that the manuscript is not
currently under submission in any other journal.
9 Manuscript
Blinded Click here to access/download;Blinded
10 Manuscript;GMA Gap Imputing Algorithm
11here to view linked References
Click
12
13
14
15
16
GMA: Gap Imputing Algorithm for
17
18
Time Series Missing Values
19 Abstract
Data collected from the environment in computer engineering may include missing values
20
due to various factors, such as lost readings from sensors caused by communication errors
21 or power outages. Missing data can result in inaccurate analysis or even false alarms. It is
22 therefore essential to identify missing values and correct them as accurately as possible to
23 ensure the integrity of the analysis and the effectiveness of any decision-making based on
24 the data. This paper presents a new approach, the Gap Imputing Algorithm (GMA), for
25 imputing missing values in time series data. The Gap Imputing Algorithm (GMA) identifies
26 sequences of missing values and determines the periodic time of the time series. Then, it
27 searches for the most similar subsequence from historical data. Unlike previous work,
28 GMA supports any type of time series and is resilient to consecutively missing values with
29 different gaps distances. The experimental findings, which were based on both real-world
30 and benchmark datasets, demonstrate that the GMA framework proposed in this study
outperforms other methods in terms of accuracy. Specifically, our proposed method
31
achieves an accuracy score that is 5 to 20% higher than that of other methods.
32 Furthermore, the GMA framework is well-suited to handling missing gaps with larger
33 distances, and it produces more accurate imputations, particularly for datasets with strong
34 periodic patterns.
35 Keywords Time series, Incomplete Subsequence, Missing data imputation
36
I. INTRODUCTION
37
38 Sensing technologies are now widely employed in computer engineering for
39 persistent and collaborative monitoring of the physical environment. These sensors
40 generate large geo-tagged time series data, which can be used to improve human
41 understanding of various ambient conditions including water level, stream flow
42 observation, and meteorological conditions (YU ZHENG, 2014). Missing values can
43 occur for a variety of reasons, including sensor failures, transmission errors, power
44 outages, and other technical issues. When dealing with time series data, missing
45 readings can be especially problematic because they can disrupt real-time monitoring
46 and compromise the accuracy of further data analysis, such as prediction and inference
47 (Xiuwen Yi, 2015). Living missing values untreated can cause incorrect or ill-defined
48 results (José Cambronero, 2017). In order to address missing values in time series data,
49 there are a number of techniques that can be used. One approach is to use imputation
50 methods, which involve estimating missing values based on other data points in the
51 time series. Another approach is to use interpolation methods, which involve estimating
52 missing values by interpolating between nearby data points. Both of these techniques
53 can be effective, but it's important to choose the right approach based on the specific
54 characteristics of the time series data and the goals of the analysis (Liao, Bak-Jensen,
55 Pillai, Yang, & Wang, 2021). It is important to understand the pattern of missing values,
56 as this can have an impact on the analysis of the data. Two common types of missing
57 data patterns are (Little, 1992):
58 1) Missing at Random (MAR): In this type of missing data, the probability of a value
59 being missing is dependent on other variables in the dataset, but not on the missing
60
61
1
62
63
64
65
9
10
11
12
13
14 value itself. That means, the missing value is related to the observed values in the
15 dataset.
16 2) Missing Completely at Random (MCAR): In this type of missing data, the
17 probability of a value being missing is unrelated to any other variables in the dataset,
18 including the observed values. That means, the missing values is completely random
19 and not related to the values in the dataset.
20
21 There is also a third type of missing data pattern, called Missing Not at Random
22 (MNAR), where the probability of a value being missing is dependent on the missing
23 value itself, but this is less commonly encountered in practice. Understanding the
24 pattern of missing values is important because it can impact the analysis of the data,
25 and different techniques are used to handle different types of missing data patterns.
26 There are many methods that can be used to handle missing data. Missing values can
27 be handled in a dataset through either single imputation or multiple imputation methods.
28 Single imputation methods involve replacing missing data points with a single value,
29 and the most common techniques include mean, average, or median imputation. On the
30 other hand, multiple imputation methods create multiple values for the missing data
31 points (Enders, 2010). Each of these methods has its advantages and disadvantages, and
32 the choice of technique depends on the specific characteristics of the data and the
33 research question being investigated.
34 Effective approaches for predicting missing values from accessible data are needed.
35 Algorithms for recovering missing data blocks can utilize various techniques such as
36 matrix completion principles or pattern matching. Matrix completion-based algorithms
37 treat a set of series as a matrix and apply methods that aim to complete the missing
38 entries. On the other hand, pattern-matching algorithms utilize the observed values
39 from the sensors to replace the missing data blocks. By using these methods, algorithms
40 can reconstruct missing data blocks from the available information, allowing for more
41 complete and accurate data analysis (Mourad Khayati, 2020).
42 A commonly used approach for replacing missing data gaps involves utilizing the
43 values from the most similar subsequence. This technique falls under the category of
44 pattern-matching algorithms, the Dynamic Time Warping (DTW) algorithm is a highly
45 effective pattern-matching technique that is extensively employed across numerous
46 problem domains. Nonetheless, a drawback of employing DTW is its tendency to be
47 computationally expensive and time-consuming, which can impact the algorithm's
48 overall efficiency and performance. Some researchers have proposed a solution to the
49 computational expense of Dynamic Time Warping (DTW) by suggesting the use of
50 shape-feature extraction algorithms that extract sequence features in sliding windows.
51 This approach involves calculating DDTW only if the correlation between the shape-
52 features of the window and the subsequences before the missing gap is sufficiently high
53
(Irfan Pratama, 2016). Results from this method have demonstrated better outcomes
54
when dealing with time series that have strong seasonality and high correlation. It's
55
worth noting that while DTW can identify the most similar patterns with similar
56
dynamics, it can warp the shape by expanding or compressing, which can lead to the
57
position of the missing gaps not aligning with the original pattern's position, to
58
overcome this limitation we decided to employ Inverse Fourier Transform to predict
59
the length of the seasonal period beforehand. This allows us to understand the dataset's
60
61
62 2
63
64
65
9
10
11
12
13
14 characteristics and identify the length of the missing gap, particularly if it represents
15 one or more seasonal periods, so we can apply a suitable algorithm to deal with it.
16 The objective of this study is to create novel methods for accurately and efficiently
17 imputing missing or anomalous data in computer systems. Our focus is on developing
18 techniques to impute missing values in seasonal time series. We observe that seasonal
19 patterns tend to exhibit similarities over time. Therefore, we propose to leverage this
20 phenomenon by replicating the pattern from the most comparable subsequence in the
21 historical data. Through this approach, we have developed an effective method for
22 imputing missing data, which involves utilizing simple operations for pattern searching
23 and matching. The paper makes the following technical contributions:
24  We present and formalize GMA: Gap Imputing Algorithm to impute missing
25 values in time series, which covers stationary, non-linear relationships, and
26 seasonal time series.
27  In computer engineering, we use the inverse Fourier transform to determine the
28 periodic length of each time series. This helps us gain a better understanding of the
29 dataset's characteristics and identify suitable algorithms for handling the missing
30 gaps. We then apply appropriate algorithms to address these gaps.
31
32  To identify similar historical situations, we rely on techniques such as the
33 Euclidean distance, Spearman's Rank-Order Correlation, and Kendall's tau, which
34 enable us to measure the similarity between patterns accurately. By leveraging
35 these techniques, we can improve the accuracy and efficiency of our data analysis
36 and make more informed decisions based on the insights gained.
37
38
 We empirically show on real-world and synthetic datasets that GMA outperforms
39
state-of-the-art solutions, and it is capable of effectively imputing values in time
40
series with extended blocks of consecutively missing values.
41
The remainder of the paper is organized as follows: Section 2 describes the associate
42
43 works. The model overview in Section 3. Section 4 shows the detail of proposed method.
44 Section 5 shows the details datasets, comparative methods, experiment setting and
45 evaluation, Results displayed in section 6. Finally, the paper is concluded in Section 7.
46 II. RELATED WORK
47
Missing data imputing as shown in table 1 involves two primary implementation techniques:
48
univariate and multivariate. The univariate approach makes use of a single variable to
49
estimate the missing values, whereas multivariate techniques analyze the relationship
50
between multiple variables to estimate missing data (Thakolpat Khampuengson, 2022).
51
Multivariate methods can be further classified into three categories: matrix completion
52
principles, pattern matching (Mourad Khayati, 2020), and machine learning imputing. To
53
provide a more comprehensive overview of the related work in missing data imputation, we
54
have classified the existing algorithms into four categories, which are as follows:
55
56
57
58
59
60
61
62 3
63
64
65
9
10
11
12
13
TABLE 1: RELATED WORK SUMMARY
14
15 Abbreviation datasets description
16 4 time series used: CO2 This approach involves transforming data into a multivariate time series,
concentrations, Phu Lien air using machine learning for forward and backward forecasting to estimate
17 (PHAN, 2020) temperature, NNGC1 F1 V1 missing values, and imputing gaps with the average of both forecast sets. It
univariate methods

003 (NNGC), and Ba Tri adheres to academic standards of syntax and grammar.
18 temperature.
19 (Paternoster, 1998) Scientific research data on The "most frequent value" imputation method replaces missing data with the
factors causing crime for malesmode of the variable. It is typically used for categorical variables or
20 and females numerical variables that have a limited range of values.
21 Water level data from telemetryThis study compared three methods for imputing missing data: mean
stations across Thailand imputation, regression imputation, and multiple imputation. The results
22 (Kulanuwat L, 2021)
showed that multiple imputation was the most effective method and
23 produced less biased estimates compared to the other two methods.
Air quality and meteorological The proposed ST-MVL method fills missing readings in geo-sensory time
24 data in Beijing, China series data by considering temporal and spatial correlations. It uses empirical
(Yi, 2016)
25 statistic models and data-driven algorithms to handle different types of
missing data cases.
Pattern Matching

26 2 datasets: SBR meteorological Clustering algorithm was employed to group similar time series, and the
27 (Wellenzohn, 2017) time series in South Tyrol and resulting groups were used to impute missing values. This approach is
Flights dataset. specifically designed for continuous streams of time series data.
28 Data collected from in-situ The proposed method involves using a Seq2Seq model to impute missing
29 monitoring station in values in time series data. This model utilizes a dual-head architecture that
(Zhang, 2021) Mulgrave-Russell catchment, includes an encoder and two decoders, each corresponding to one direction
30 Australia. of the time series data. Seq2Seq models are a type of recurrent neural
31 network (RNN) that can be applied for sequence prediction and generation.
Face images under varying The proposed technique for robust low-rank matrix recovery is capable of
32 illuminations: 168x192 handling data corruption and utilizes orthonormal subspace learning to
33 resolution, 55 frames estimate a low-rank matrix from incomplete or corrupted data. This method
(Shu, 2014)
has shown promising results in experiments and outperformed existing
34 methods, and can be applied in various applications such as image
Matrix Completion Principles

processing, signal processing, and recommendation systems.


35 Netflix data: 17,770 movies The proposed method utilizes spectral regularization to promote low-rank
36 rated by 480,189 customers solutions and impose structural constraints on the estimated matrix. This
approach has proven to be effective in dealing with ill-posed problems and
37 (Mazumder, 2010)
improving the performance of matrix completion. The method also allows
38 for incorporating additional side information, such as similarity between
multivariate Methods

items or users, to further enhance the estimation.


39 Hydrological time series with SVD and CD are two widely used methods for imputing missing values in
40 tuples of timestamp and time series data. SVD decomposes the dataset into a subset of singular
(Khayati, 2015) observation value values, while CD calculates the distance between the missing value and its
41 neighboring points based on correlation. CD has been found to be more
accurate and computationally for time series datasets with low correlation.
42 National water quality A proposed approach for imputing missing values in data involves
43 reference index data monitored combining low-rank matrix completion and sparse representation. The
by Haimen Bay station approach first uses low-rank matrix completion to impute missing values
44 (Jianlong Xu, 2021)
based on a low-rank structure assumption. Then, sparse representation is
45 employed to refine the imputed data by assuming it can be represented as a
linear combination of a few basis elements.
46 Rainfall data from 4 stations in The method combines PCA and Bayesian modeling to estimate missing
(K, 2019)
47 Malaysia values in a dataset.
Machine Learning Imputing

Measured water levels in 7 RF algorithm used for imputing missing values in a dataset with continuous
48 (Dwivedi D, 2022)
monitoring wells in the USA variables.
49 Three field-based time series A hybrid approach was used to impute missing data, where regression
were used, including traffic imputation predicted missing input variables, and data augmentation created
50 (Bokde N, 2018)
speed data, water flow rate synthetic data points for missing output variables. The approach was applied
51 data, and the Nottem dataset. to a dataset with missing values in both input and output variables.
Seven datasets were used from A genetic algorithm is proposed to impute missing values in datasets with
52 the UCI and KEEL repositories multiple missing observations and different data types. The algorithm
(Figueroa-García, 2022)
53 minimizes a multi-objective fitness function based on Minkowski distance of
statistical measures between available and completed data.
54 Two untargeted metabolomics A two-step approach was used for imputing missing values, involving a
55 (Jonathan P. datasets from the COPDGene random forest classifier to classify the missing mechanism and mechanism-
Dekermanjian, 2022) cohort were used. specific algorithms for imputation. The approach improved imputations by
56 reducing bias and producing values closer to the original data.
57 (Trubitsyna, 2022)
Letter and SPAM datasets A method that estimates missing values in datasets using a generative
https://archive.ics.uci.edu/ adversarial network (GAN) model.
58
59
60
61
62 4
63
64
65
9
10
11
12
13
14 A. Univariate Methods
15 Hong PHAN (PHAN, 2020) proposed a method called "MLBUI" for filling consecutive
16 missing values in univariate time series using machine learning methods. the data before
17 and after the gap transformed into multivariate time series, followed by forward and
18 backward forecasting using ML methods to estimate the missing values. The imputation of
19
the gap is then done by taking the average values of both forecast sets. Paternoster
20
(Paternoster, 1998) explained that the "most frequent value" imputation technique consists
21
of substituting missing data with the value that appears most frequently for the given
22
variable. This imputation strategy is typically applied to categorical variables or numerical
23
variables with a finite set of possible values. Kulanuwat et Al. (Kulanuwat L, 2021) Studied
24
missing data imputation in electronic health records (EHR) using three methods: mean
25
imputation, regression imputation, and multiple imputation. Mean imputation involves
26
replacing missing values with the mean of available data, while regression imputation uses
27
a regression model to predict missing values. Multiple imputation generates multiple
28
plausible imputed datasets using a statistical model and combines them for a single estimate.
29
The study found that multiple imputation was the most effective method for imputing
30
missing data in EHR, producing estimates closer to true values and with less bias compared
31 to the other methods.
32
33 B. Pattern Matching
34
(Yi, 2016) Proposed a method called spatio-temporal multi-view-based learning (ST-MVL)
35
to fill missing readings in geo-sensory time series data. The method takes into account the
36
temporal correlation between readings at different timestamps in the same series and the
37
spatial correlation between different time series. The method combines empirical statistic
38
models (Inverse Distance Weighting and Simple Exponential Smoothing) with data-driven
39
algorithms (User-based and Item-based Collaborative Filtering) to handle different types of
40
missing data cases. The method is evaluated using Beijing air quality and meteorological
41
data. K. Wellenzohn (Wellenzohn, 2017) proposed a method for continuously imputing
42
missing values in streams of time series data. The proposed method, called CIViC
43
(Continuous Imputation of Values in time series with Clustering), uses a clustering
44
algorithm to group similar time series and then uses the grouped time series to impute
45
missing values. The authors evaluated their method on several real-world datasets and
46
compared it to other imputation methods. Zhang and Thorburn (Zhang, 2021) proposed a
47
dual-head sequence-to-sequence (Seq2Seq) model for imputing missing values in time
48 series data. Seq2Seq models are a type of recurrent neural network (RNN) that can be used
49 for sequence prediction and generation tasks. In their study, Zhang and Thorburn used a
50 dual-head architecture, which includes an encoder and two decoders, to predict the missing
51 values in a time series dataset. The two decoders correspond to the two directions of the
52 time series data (forward and backward).
53
54 C. Matrix Completion Principles
55
Shu, Porikli, and Ahuja (Shu, 2014) proposed a method for robust low-rank matrix
56
recovery that can handle data corruption. Low-rank matrix recovery is a fundamental
57
problem in computer vision and machine learning, and it involves estimating a low-rank
58
matrix from corrupted or incomplete data. The proposed method is based on orthonormal
59
subspace learning, which is a technique for finding the principal subspace of a given set of
60
61
62 5
63
64
65
9
10
11
12
13
14 data. Mazumder, Hastie, and Tibshirani (Mazumder, 2010) proposed a method for matrix
15 completion, which involves recovering missing entries in a matrix. The authors noted that
16 matrix completion has important applications in various fields, including recommender
17 systems and collaborative filtering. The proposed method is based on spectral regularization
18 and involves solving a convex optimization problem. Khayati, Böhlen, and Cudré-Mauroux
19 (Khayati, 2015) compared two methods, Singular Value Decomposition (SVD) and
20 Correlation Distance (CD), for recovering missing values in time series datasets. The
21 authors noted that missing data is a common problem in time series datasets and can have a
22 significant impact on subsequent analysis. SVD and CD are two methods that can be used
23 for imputing missing values in time series data, and they differ in how they select a subset
24 of the data to use for imputation. The authors evaluated the two methods on several datasets
25 and found that CD performed better than SVD in terms of accuracy and computational
26 efficiency, especially when the time series data had low correlation. Yang et al (Jianlong
27 Xu, 2021) proposed a method for imputing missing data in high-dimensional datasets by
28 combining low-rank matrix completion and sparse representation. The authors argued that
29 the high dimensionality of the data and the sparsely of the missing values require an
30 approach that can effectively capture the underlying structure of the data. The proposed
31 method first utilized low-rank matrix completion to impute the missing values in the data
32 matrix, leveraging the assumption that the data has a low-rank structure. The method then
33 employed sparse representation to refine the imputed data, utilizing the assumption that the
34 data can be represented as a linear combination of a few basis elements. Ai and Kuok (K,
35 2019) suggested employing a statistical method known as Bayesian Principal Component
36 Analysis (BPCA) to perform imputation of missing values in rainfall data. BPCA is a
37 technique that merges Principal Component Analysis (PCA) with Bayesian modeling to
38 estimate missing data points in a dataset.
39 D. Machine Learning Imputing
40
41 Dwivedi (Dwivedi D, 2022) used Random Forest (RF) to impute continuous missing
42 values in a dataset. The RF algorithm was used to impute missing values in a dataset
43 containing continuous variables. The performance of RF imputation was compared with
44 other imputation methods, such as k-nearest neighbors (KNN) and mean imputation.
45 Bokde. Al (Bokde N, 2018) proposed a method for imputing missing values in a dataset
46 using a hybrid approach that combines regression imputation and data augmentation. The
47 study dealing with a dataset that had missing values in both the input and output variables.
48 They used regression imputation to impute the missing values in the input variables by
49 predicting them based on the available data. For the missing values in the output variables,
50 they used data augmentation, which involves creating synthetic data points to fill in the
51 missing values. A genetic algorithm approach to estimate missing data in multivariate
52 databases is proposed (Figueroa-García, 2022) . Genetic algorithms are effective at handling
53 multiple missing observations and different types of data, unlike traditional methods that
54 only deal with univariate continuous data. The proposed algorithm minimizes a new multi-
55 objective fitness function based on Minkowski distance of means, variances, covariances,
56 and skewness between available and completed data. The approach is compared to EM
57 algorithm and auxiliary regressions using a continuous/discrete dataset, and benchmarked
58 against seven datasets. Jonathan P (Jonathan P. Dekermanjian, 2022) designed imputation
59 algorithm to handle missing values in metabolomics datasets, which are often caused by
60 various mechanisms such as instrument detection limits, data collection and processing
61
62 6
63
64
65
9
10
11
12
13
14 conditions, and random factors. The algorithm takes a mechanism-aware approach and
15 consists of two steps. In the first step, a random forest classifier is used to classify the
16 missing mechanism for each missing value in the dataset. In the second step, missing values
17 are imputed using mechanism-specific imputation algorithms, namely MAR/MCAR or
18 MNAR. Simulations were conducted using complete data and different missing patterns to
19 test the performance of the proposed algorithm. Results showed that the two-step approach
20 reduced bias and provided imputations that were closer to the original data compared to
21 using a single imputation algorithm for all missing values. Overall, this mechanism-aware
22 imputation algorithm offers a promising solution for handling missing values in
23 metabolomics datasets and improving downstream analyses. Trubitsyna R. and Irina S
24 (Trubitsyna, 2022) developed a method for estimating missing values in datasets through
25 the use of a generative adversarial network (GAN) based model named DEGAIN. The
26 performance of DEGAIN is evaluated on two publicly available datasets, namely Letter
27 Recognition and SPAM, and compared against existing methods.
28 In this paper, we propose a novel approach for computing the missing values in
29 incomplete subsequences, called Gap Imputing Algorithm (GMA). We divide the time
30 series into two subsequences: one that contains the complete data and another that contains
31 the missing gaps. To fill in the missing data, we use a pattern-matching approach by
32 analyzing the similarity between the complete and incomplete subsequences. Specifically,
33 we imitate the pattern of the complete subsequence to recreate the missing data.
34 Table 2: List of abbreviations
35 Symbol definition
36 GMA The Gap Imputing Algorithm.
37 MAR Type of missing data Missing at Random
38 MCAR Missing Completely at Random
39 MNAR Missing Not at Random
40 S The total time series
41 Sf The subsequence with no missing values
Sm The subsequence with missing values
42 The number of readings in the periodic sequence
43 P
for the longest component within the time series.
44 W Missing gaps
R The Right pattern for the missing gap
45 L The Left pattern for the missing gap
46 N The number of readings in the missing gap
47 wl the most similarity pattern to L in Sf
48 wr the most similarity pattern to R in Sf
49 b the start index for wl
e the end index for wr
50
sr Correlation value between R and wr
51 sl Correlation value between L and wl
52 RMSE mean squared error
53 MAE and mean absolute error
54 FSM Full Subsequence Matching algorithm
55 I. OVERVIEW
56
57 We have developed a method to recover missing values in an incomplete time series
58 S, where readings are taken at equal intervals. Our approach involves identifying gaps in S
that have null values, and dividing S into two parts: Sf, which has no missing values, and Sm,
59
which contains the missing gaps W. Then we utilize the Fourier transform (Oppenheim,
60
61
62 7
63
64
65
9
10
11
12
13
14 2010) on the Sf in order to obtain the number of readings in the periodic sequence P for each
15 time series. P represents the number of readings required for the time series to complete one
16 cycle. Each missing gap w is then analyzed to determine its surrounding right pattern R,
17 which comprises P readings, and its lift pattern L, which also comprises P readings. W
18 contains N missing values. To identify the two patterns in the Sf set that are most similar to
19 R and L, we utilize the kendalltau correlation measure (Abdi, 2007). Subsequently, we
20 employ the algorithm outlined in the following section to complete the missing values in W.
Table 2 explains the Symbols and annotations.
21
22 II. PROPOSED METHODS
23
We have developed a novel method for imputing missing values in a time series using
24
Fourier transform and a new filling algorithm. Our approach involves using Fourier
25
transform to determine the wavelength of each time series, followed by identifying the
26
sequence period for each series. This enables us to use an optimal imputation method to fill
27
in the gaps in the time series. Our proposed GMA method consists of four main steps:
28
29 A. Identifying the Missing Gaps
30
31 We identifying a missing gaps in a time series denoted by W = {w1, w2, w3...}. For each gap
32 w, we determine the preceding and succeeding data points, as well as the number of missing
33 points in the gap between them. By performing this analysis, we generate an array G that
34 records the number of missing data points for each gap, as well as the preceding and
35 succeeding data points.
36
B. Time Series Analysis
37
38 After identifying the gaps in the time series S that have null values, and dividing S into two
39 parts. It is typically advisable to focus on the subset of the time series that contains complete
40 data, which we denote as Sf. This is because incomplete or missing data can introduce biases
41 and inaccuracies into the analysis. A fundamental characteristic that needs to be identified
42 is whether the data is stationary or (seasonal) periodically repeated, and the number of
43 readings P in the periodic cycle. The discrete Fourier Transform (DFT) is a mathematical
44 technique that analyzes the time series in the frequency domain. By performing the DFT on
45 the time series, it decomposes it into its constituent frequencies and obtain information about
46 the spectral content of the time series (Oppenheim, 2010). We use the DFT output as input
47 for the inverse DFT Python function (community, 2023), this function calculates the peak
48 frequency of the signal using the argmax function, and then determines the period of the
49 signal by taking the reciprocal of the frequency. Figure 1 show the URC and its inverse
50 DFT.
51
52
53
54
55
56
57
58
59
60
61
62 8
63
64
65
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Figure 1 the URC dataset (Keogh, 2021) and its inverse DFT
32
33 C. Extract the similar subsequences to surrounding right and left pattern R,L
34
35 Once we have generated the array G in the first step, that records the missing gaps, we can
36 use this array to extract subsequences from the left and right sides of each gap w.
Specifically, we extract a subsequence of length P points from the left side of the gap
37
(denoted by L) and another subsequence of the same length from the right side of the gap
38
(denoted by R). Then we use these subsequences to search for the most similar subsequences
39
to L and R, denoted by wl and wr, respectively. This similarity search can be performed using
40
various techniques, such as dynamic time warping (Rakthanmanon, 2012), Pearson
41
correlation (mapreduce., 2016), or Euclidean distance (Park, 2009). Experiments have
42
shown that the most suitable for our algorithm is kendalltau technique (Gibbons, 2011) as
43
it compares the direction of the points, up or down, in value.
44
45 In general, if the data is normally distributed and the relationship between the variables is
46 expected to be linear, Pearson correlation may be the most appropriate technique to use. If
47 the data is not normally distributed or the relationship between the variables is not expected
48 to be linear, Euclidean distance or Kendall tau may be more appropriate. However, the
49 specific technique used should be selected based on the characteristics of the data and the
50 research question being investigated.
51
The Kendall rank correlation coefficient is a statistical measure that is used to determine
52
the degree of similarity between two sets of variables. It is a non-parametric measure that is
53
used to quantify the strength of the relationship between two sets based on the ranks of their
54
values. The coefficient ranges from -1 to 1, with -1 indicating a perfect negative correlation,
55
0 indicating no correlation, and 1 indicating a perfect positive correlation. Figure 2 shows
56
the Kendall results. The similarity pattern to the left, wl, is represented by a green rectangle.
57
Its corresponding similarity measure, sl, is 0.735. The similarity pattern to the right, wr, is
58
represented by a red rectangle. Its corresponding similarity measure, s r, is 0.540.
59
60
61
62 9
63
64
65
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30 Figure 2 the steam flow dataset (STUMPY, 2023) and the Kendall results: wl is the most similar for L
31 pattern (green rectangle) and wr is the most similar for R pattern (red rectangle)
32
GMA Algorithm
33
34
GMA Algorithm
35 Input: Sf the complete sequence
36 b the start index for (wr the similarity to R)
37 sr correlation value between R and wr
38 e the end index for (wl the similarity to L)
39 N number of missing points
40 sl correlation value between L and wl
41 FUNCTION GMA(w, b, sr, e, sl, N)
42 // Calculate the tuas retio
43 T1 = sr / (sr + sl)
44 T2 = sl / (sr + sl)
// Method 1: Filling the gap with mutation
45
imp1 = []
46 FOR i = 0 TO (FLOOR(T1 * N) - 1) DO
47 imp1[i] = Sf [e+i]
48 END FOR
49 FOR i = FLOOR(T1 * N) TO N DO
50 imp1[i] = Sf [b-(T1 * N)+ i]
51 END FOR
52 // Method 2: Filling the gap with combination
53 imp2 = []
54 FOR i = 0 TO N DO
imp2[i] = (T1 / N * Sf [e+i]) + (T2 / N * Sf [b-N+i])
55
END FOR
56 // Return the results of both methods
57 RETURN (imp1, imp2)
58 END FUNCTION
59
60
61
10
62
63
64
65
9
10
11
12
13
14 D. Imputation the gaps
15
We developed two different techniques to impute missing values: muting imputation and
16
ratio imputation, as shown in Algorithm 1. The algorithm takes the missing gap w, the
17
18 similarity pattern to the right wr and its corresponding similarity tua measure sr, the
19 similarity pattern to the left wl and its corresponding similarity measure sl, and the number
20 of missing points N. The algorithm uses two methods to fill in the missing gap. Method 1 of
21 the algorithm employs the two similar patterns, wr and wl, to fill the missing gap.
22 Specifically, the gap is filled by wr from 1 to sr *N/( sr+ sl ) and filled by wl from sr *N/(
23 sr+ sl ) to N. The time complexity of Method 1 is approximately O (N). Method 2 combines
24 the two similar patterns wr and wl based on their correlation values sr and sl, to fill the gap.
25 The time complexity of Method 2 is approximately O (N). The computational complexity
26 of the GMA function is linear with respect to the length of the missing gap N.
27
III. EXPERIMENTAL SET‑ UP
28
29 A. Dataset
30
31 1) The UCR_BIDMC1_2500 benchmark is a time series dataset that is part of the UCR Time
32 Series Anomaly Archive (Keogh, 2021). It contains 2,500 instances, each consisting of
33 128 observations. The dataset was collected from intensive care unit (ICU) patients, where
34 each instance represents the continuous physiological signals of a patient over a 6-hour
35 period. The anomalies in this dataset correspond to changes in the patients' physiological
36 conditions that require medical attention, such as cardiac arrest or shock. This benchmark
37 dataset is specifically designed to address common flaws present in other anomaly
38 detection benchmarks, including trivial and unrealistic anomaly intensity, misleading
39 ground truth, and running to failure bias. Using the inverse Fourier transformer technique,
40 we were able to extract the periodic cycle of the UCR_BIDMC1_2500 dataset, which was
41 determined to be 50 readings in length. Figure 1 shows the UCR_BIDMC1_2500 dataset
42 and its inverse DFT.
43
44 2) The Steamgen dataset is a commonly used benchmark dataset in the field of process control
45 and system identification the dataset consists of 6,000 samples, each containing 19 features
46 that describe the operating conditions and performance of the steam generator (STUMPY,
47 2023). These features include variables such as steam flow rate, water level, and
48 temperature, as well as indicators of system faults and disturbances. As shown in figure2,
49 we have chosen to focus on the steam flow feature. Using the inverse Fourier transformer
50 technique, we were able to extract the periodic cycle of the steam flow dataset, which was
51 determined to be 497 readings in length.
52
53 B. Missing Data Generation
54
We simulated missing data in order to enable us to evaluate the efficacy of different imputation
55
56 techniques. To generate datasets with missing data, we systematically removed consecutive
57 values from the dataset, assuming that the deletions occurred randomly.
58 In the case of the URC dataset, we created gaps of sizes 6, 21, 26, 40, and 101 points, as the
59 periodic cycle of this dataset is 50 points. For the steam flow feature, we created gaps of sizes
60
61
11
62
63
64
65
9
10
11
12
13 7, 21, 51 and 244 points, as the periodic cycle for steam flow is 507 points. These various
14 missing data scenarios were simulated in order to test the performance of different imputation
15 approaches. Experiments have shown that if the size of the missing gap is greater than the
16 periodic cycle value, then the error in imputation is significantly higher using any imputation
17 method.
18 C. Comparative Imputation Methods
19
20 We have selected several widely-used imputation techniques for evaluating the effectiveness
21 of our proposed methods. These techniques include linear interpolation (pandas, 2023) and
22 polynomial interpolation (Qingkai Kong, 2020), and Full Subsequence Matching (FSM)
23 (Thakolpat Khampuengson, 2022). Linear interpolation is a common method for filling in
24 figure3. The stream flow dataset missing values in a dataset by estimating the value based on
25 the linear relationship between adjacent data points. Polynomial interpolation is a more
26 complex variant of this technique, where a polynomial function is used to approximate the
27
missing values based on the surrounding data points. FSM is a pattern-matching approach to
28
imputation, where the missing value is estimated by identifying similar subsequences of data
29
within the dataset and using them to make a prediction. This technique is useful for datasets
30
31 with repeated patterns or cyclical trends. By comparing the performance of our methods
32 against these established techniques, we aim to demonstrate the efficacy of our approach and
33 provide valuable insights into the most effective methods for imputing missing data.
34 D. Experimental Setting
35 Our experiments were conducted on a server equipped with Core i7 Intel processors running
36
at 2.60 GHz, 8 GB RAM, and a 250 GB SATA hard drive. We implemented our proposed
37
framework using the open source Python package missval, which offers a range of missing
38
value imputation methods, as well as visualization and performance evaluation tools. The
39
40 package is publicly available on Github at https://github.com/Eng-Khattab/missval. In
41 addition, we have made the two datasets used in this study available for public access. To
42 perform interpolation, we utilized the "interpolate" class from the pandas DataFrame (pandas,
43 2023) Python library, which offers a convenient method for filling in missing values using
44 interpolation techniques. Specifically, we employed a linear approach for linear interpolation
45 and a polynomial method with a second order polynomial for polynomial interpolation. To
46 perform the Subsequence Matching (FSM) methods, we used the matrix profile python library
47 called STUMPY (STUMPY, 2023).
48
49 E. Evaluation Metrics
50 The performance of an imputation method is commonly evaluated by measuring its accuracy
51 using three widely-used metrics: root mean square error (RMSE), mean absolute error (MAE),
52 Kendall's tau measure between the actual pattern and the imputed pattern. These metrics are
53 defined as follows:
54 1) The root mean square error (RMSE) (Bennett, et al., 2013) is a measure of the
55 differences between the actual and imputed values, calculated as the square root of the
56 average of the squared differences:
57 1 𝑛
𝑠𝑎𝑚𝑝𝑙𝑒𝑠 −1
RMSE =√𝑛 ∑𝑖=0 1(𝑦^ − 𝑦𝑖) ^2
58 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

59 Where y^ is the predicted value and y is actual value.


60
61
12
62
63
64
65
9
10
11
12
13 2) The mean absolute error (MAE) (Bennett, et al., 2013) is another measure of the
14 differences between the actual and imputed values, calculated as the average of the
15 absolute differences:
16 1 𝑛
𝑠𝑎𝑚𝑝𝑙𝑒𝑠 −1
MAE =𝑛 ∑𝑖=0 1(𝑦^ − 𝑦𝑖)
17 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

18
3) Kendall's tau (Abdi, 2007) is a measure of the correlation between the actual and imputed
19
patterns, which takes into account the order or rank of the values rather than their actual
20
21 magnitudes. It ranges between -1 (perfect negative correlation) to 1 (perfect positive
22 correlation), with 0 indicating no correlation:
23 Kendall's tau = (number of concordant pairs - number of discordant pairs) / (number of pairs)
24 Where a pair is concordant if the relative order of the values in the actual pattern is the same
25 as in the imputed pattern, and discordant if the order is different. The number of pairs is equal
26 to 𝑛(𝑛 − 1)/2 for a dataset with n samples.
27
28 IV. RESULTS AND DISCUSSION
29
We employed four distinct algorithms for filling the missing data points in two datasets. For
30
the steam flow dataset, there are gaps of sizes 7, 21, 51, and 244 points, as the periodic cycle
31
for steam flow is 507 points. Our results, presented in Tables 3 and 4 and Figures 3 a and b,
32
demonstrate the superiority of our algorithm for larger gaps that exceed a quarter of the
33 identified time series period length, as outlined in Section 3. Conversely, the linear algorithm
34 performs exceptionally well for smaller gaps, particularly at lower gap levels. These findings
35 emphasize the importance of accurately determining the time period for dataset before
36 imputing the missing gaps.
37
38 The results was presented in two tables and two figures. The first table and figure show the
39 percentage of root mean squared error (RMSE) and mean absolute error (MAE), where
40 lower values indicate better performance. The Kendell correlation analysis results are
41 presented in both the second table and accompanying figure. A higher correlation value
42 indicates a stronger relationship between the compensated data and the original data, and
43 then the compensated data closely aligns with the original data in terms of their
44 characteristics. Our findings indicate that achieving better results with the Full Subsequence
45 Matching (FSM) algorithm may require a significant number of repetitions. This is because
46 the algorithm relies on random selection of the length of the right and left patterns, which
47 can lead to variability in the outcomes.
48 Our algorithm employs two distinct methods to fill in missing data gaps. The first method
49 involves alternating between two patterns based on their tuas, and is utilized when the gap
50 size exceeds the periodic cycle p. Conversely, the second method involves combining the
51 two patterns based on their tuas to fill the missing gap less than the periodic cycle.
52
53
54
55
56
57
58
59
60
61
13
62
63
64
65
9
10
11
12 Table 3.the performance indexes of 4 methods on steam flow dataset
13
14 Gap Points G7 G26 G51 G244
15 Error Metrics RMSE MAE RMSE MAE RMSE MAE RMSE MAE
16
17 Linear 0.58 0.51 0.68 0.55 2.6 2.3 7.2 5.9
Algorit

Poly 0.65 0.57 2.6 2.3 2.4 2.19 18.9 16


hm

18
FSM 7.3 6.4 0.55 0.55 3.8 3.6 7.8 8.9
19 GMA 3.7 3.6 0.53 0.34 1.7 0.9 5 5
20
21 Table 4 .the Kendell tua results between imputing gaps and actual data for steam flow dataset
22
Gap Points G7 G26 G51 G244
23
24 Linear 0.86 0.58 -0.4 -0.5
25 Poly 0.33 -0.25 -0.19 -0.02
FSM -0.7 0.51 0.26 -0.34
26 GMA -0.4 0.6 0.48 0.05
27
28
29 20 Linear Poly FSM GMA
RMSE and MAE Scales

30
31 15
32
10
33
34 5
35
36 0
37 RMSE MAE RMSE MAE RMSE MAE RMSE MAE
38
G7 G26 G55 G244
39
40 Number of missing points in each gap
41
42 fig3 .a. the performance indexes of 4 methods on steam flow dataset
43 Linear Poly FSM GMA
44 1
45
46
47 0.5
48
Kendall Scale

49
50 0
51 G7 G26 G51 G244
52
53 -0.5
54
55 Number of missing points in each gap
56 -1
57
58 Fig 3.b The Kendall correlation between the filled and actual data
59 Fig3. The performance indexes of 4 methods on steam flow dataset
60
61
14
62
63
64
65
9
10
11
12
13 Table 5.the performance indexes of 4 methods on URC dataset
14 GP G6 G26 G33 G40 G101
15
16 EM RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE
17 Linear 229 209 2222 2034 2476 2007 5877 5070 10591 9527
18 Poly 57 50 2857 2372 2811 2139 6433 4738 6175 4921
FSM 474 416 1271 1144 1316 1074 2408 934 5318 4374
19 GMA 512 555 1015 830 6530 6213 1398 1319 3859 3144
20
21 Table 6 .the Kendell tua results between imputing gaps and actual data for the URC data set
22
GP G6 G26 G33 G40 G101
23
24 Linear 0.99 0.99 0.76 0.49 0.035
Poly 0.99 0.92 0.77 0.59 0.56
25 FSM 0.39 0.92 0.77 -0.73 0.43
26 GMA 0.99 0.99 0.95 0.13 0.8
27
28
29 12000 Linear Poly FSM GMA
RMSE and MAE Scales

30 10000
31
8000
32
33 6000
34 4000
35 2000
36
0
37
RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE
38
39 G6 G21 G26 G40 G101
40 Number of missig points in each gap
41 fig4 .a. the performance indexes of 4 methods on URC dataset
42
43 1.5
44 Linear Poly FSM GMA
45
46 1
47
48
Kendall Scale

0.5
49
50
51 0
52 G6 G21 G26 G40 G101
53
-0.5
54
55
Number of missig points in each gap
56 -1
57
58
Fig 4.b The Kendall correlation between the filled and actual data
59
Fig4. The performance indexes of 4 methods on URC dataset
60
61
15
62
63
64
65
9
10
11
12
13 The results which includes in Tables 5 and 6 and figure 4a and b, shows the effectiveness of
14 our model when applied to URC data set, with the exception of gap 26. It is worth noting
15 that gap 26 is an anomaly in the original data. The values presented in the tables 5 and 6 may
16 appear to be large due to the wide domain of the dataset, which ranges from -10,000 to
17 30,000.as shown in fig.1 a. As a result, the errors may also be expressed in large values.
18
19
20 V. CONCLUSION
21
22 This paper presents the Gap Imputing Algorithm (GMA), a novel method for imputing
23 missing values in time series data. GMA is specifically designed to address the challenging
24 problem of consecutively missing values with varying gap distances in time series analysis.
25 Initially, GMA identifies sequences of missing values and determines the periodicity of the
26 time series. It then searches for the most similar subsequences in the historical data to fill in
27 the missing gap. GMA employs two methods to impute the missing data gaps, depending on
28 the gap size. If the gap size exceeds the periodic cycle p, GMA utilizes the first method,
29 which involves alternating between the two most similar patterns to the missing gap
30 terminals based on their correlation scale. On the other hand, if the missing gap size is less
31 than the periodic cycle, the second method is used. This involves combining the two similar
32 patterns based on their correlation scale with the most similar patterns to fill in the missing
33 data. Experimental results demonstrate that GMA outperforms existing methods in terms of
34 accuracy, particularly for datasets with long periodic patterns and larger missing gaps. Using
35 the periodic cycle to determine the pattern length leads to a more precise and accurate result.
36 In contrast, other algorithms require multiple runs because they rely on random selection of
37 the length of the right and left patterns, which can result in variability in the outcomes.
38 Overall, this research contributes to the development of more effective and efficient missing
39 value imputation techniques in time series data analysis. The practical implications of these
40 findings are significant, as accurate imputation of missing data is crucial for a wide range of
41 applications.
42
43 REFERENCES
44 Abdi, H. (2007). The Kendall Rank Correlation Coefficient. Encyclopedia of Measurement and
45 Statistics.
46 Bennett, N., Croke, B., Guariso, G., Guillaume, J. H., Jakeman, A., Marsili-Libelli, S., . . .
Norton, J. (2013). Characterising performance of environmental models. Environmental
47
Modelling & Software.
48
Bokde N, B. M. (2018). A novel imputation methodology for time series based on pattern
49
sequence forecasting. Pattern Recogn Lett , 116, 88–96.
50
community, T. S. (2023, 4 10). scipy.fft.rfft. (The SciPy community) Retrieved 2023, from
51
https://docs.scipy.org/doc/scipy/reference/generated/scipy.fft.rfft.html
52
Dwivedi D, M. U. (2022). Imputation of contiguous gaps and extremes of subhourly
53
groundwater time series using random forests. J Mach Learn Model Comput , 3.
54
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
55
Figueroa-García, J. C.–P. (2022). A genetic algorithm for multivariate missing data imputation.
56
Information Sciences.
57
Gibbons, J. D. (2011). Nonparametric statistical inference. CRC Press, 14.
58
Irfan Pratama, A. E. (2016). A review of missing values handling methods on time-series data.
59
International Conference on Information Technology Systems and Innovation (ICITSI).
60
61
16
62
63
64
65
9
10
11
12
Jianlong Xu, K. W. (2021). FM-GRU: A Time Series Prediction Method for Water Quality
13
Based on seq2seq Framework. MDPI, 13.
14
Jonathan P. Dekermanjian, E. S. (2022). Mechanism-aware imputation: a two-step approach in
15
handling missing values in metabolomics. BMC Bioinformatics.
16
José Cambronero, J. K. (2017). Query Optimization for Dynamic Imputation. the VLDB
17
Endowment, 10.
18
K, L. W. (2019). A study on bayesian principal component analysis for addressing missing
19
rainfall water. Water Resour Manage, 33, 2615–2628.
20
Keogh, R. W. (2021). Current Time Series Anomaly Detection Benchmarks are Flawed and
21
are Creating the Illusion of Progress. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
22 ENGINEERING .
23 Khayati, M. B.-M. ( 2015). Using lowly correlated time series to recover missing values in
24 time series: A comparison between SVD and CD. In Advances in Spatial and Temporal
25 Databases . 14th International Symposium, SSTD .
26 Kulanuwat L, C. C.-A. (2021). Anomaly detection using a sliding window technique and data
27 imputation. Water , 13.
28 Liao, W., Bak-Jensen, B., Pillai, J. R., Yang, D., & Wang, Y. (2021). Data-driven Missing Data
29 Imputation for Wind Farms Using Context Encoder. Journal of Modern Power Systems and
30 Clean Energy.
31 Little, R. J. (1992). Regression with missing X's: a review. Journal of the American Statistical
32 Association.
33 mapreduce., A. e. (2016). Gu, J., and Zhang. Journal of Parallel and Distributed Computing,
34 95, 54-62.
35 Mazumder, R. H. (2010). Spectral regularization algorithms for learning large incomplete
36 matrices. Journal of Machine Learning Research, 11.
37 Mourad Khayati, A. L. (2020). Mind the Gap: An Experimental Evaluation of Imputation of
38 Missing Values Techniques in Time Series. VLDB Endowment, 13.
39 Oppenheim, A. V. (2010). Discrete-time signal processing (3rd ed.). Upper Saddle River: NJ:
40 Pearson Prentice Hall.
41 pandas. (2023, 4 10). pandas.DataFrame.interpolate. (pandas) Retrieved from pandas:
42 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html
43 Park, H. a. (2009). A simple and fast algorithm for K-medoids clustering. Expert systems with
44 applications, 36, 3336-3341.
45 Paternoster, R. B. (1998). Using the correct statistical test for the equality of regression
46 coefficients. Criminology, 859-866, 36.
47 PHAN, T.-T.-H. (2020). Machine Learning for Univariate Time Series impution. Preprint
48 MAPR .
49 Qingkai Kong, T. S. (2020). Python Programming and Numerical Methods - A Guide for
50 Engineers and Scientists. Elsevier.
51 Rakthanmanon, T. K. (2012). Searching and mining trillions of time series subsequences under
52 dynamic time warping. the 18th ACM .
53 Schulz, K. F. (2002). Allocation concealment in randomised trials: defending against
54 deciphering. . The Lancet, 359, 614-618.
55 Shu, X. P.-r. (2014). Robust orthonormal subspace learning: Efficient recovery of corrupted
56 low-rank matrices. . IEEE Conference on Computer Vision and Pattern Recognition, CVPR .
57 Columbus, OH, USA.
58 STUMPY. (2023, 4 10). Steamgen Example. (STUMPY) Retrieved 2023, from STUMPY:
59 https://stumpy.readthedocs.io/en/latest/Tutorial_The_Matrix_Profile.html
60
61
17
62
63
64
65
9
10
11
12
Thakolpat Khampuengson, a. W. (2022). Novel Methods for Imputing Missing Values in
13
Water Level. Water Resources Management.
14
Trubitsyna, R. S. (2022). DEGAIN: Generative-Adversarial-Network-Based Missing Data
15
Imputation. Information, 13.
16
Wellenzohn, K. B. (2017). Continuous imputation of missing values in streams of pattern-
17
determining time series. the 20th International Conference on Extending Database
18
Technology, EDBT.
19
Xiuwen Yi, Y. Z. (2015). ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data.
20
Conference on Artificial Intelligence.
21
Yi, X. Z. (2016). ST-MVL: Filling Missing Values in Geo-sensory Time Series Data. the 25th
22 International Joint Conference on Artificial Intelligence.
23 YU ZHENG, L. C. (2014). Urban Computing: Concepts, Methodologies, and Applications.
24 ACM Transactions on Intelligent Systems and Technology, 38.
25 Zhang, Y. a. (2021). Dual-head sequence-to-sequence model for imputing missing data in
26 multivariate time series. IEEE Journal of Biomedical and Health Informatics, 25 , 1692-1702.
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
18
62
63
64
65
Author’s Response to Reviewers‘ Comments Click here to access/download;Author’s
Response to Reviewers‘ Comments;GMA

GMA: Gap Imputing Algorithm for


Time Series Missing Values
Abstract
Data collected from the environment in computer engineering may include missing values
due to various factors, such as lost readings from sensors caused by communication errors
or power outages. Missing data can result in inaccurate analysis or even false alarms. It is
therefore essential to identify missing values and correct them as accurately as possible to
ensure the integrity of the analysis and the effectiveness of any decision-making based on
the data. This paper presents a new approach, the Gap Imputing Algorithm (GMA), for
imputing missing values in time series data. The Gap Imputing Algorithm (GMA) identifies
sequences of missing values and determines the periodic time of the time series. Then, it
searches for the most similar subsequence from historical data. Unlike previous work,
GMA supports any type of time series and is resilient to consecutively missing values with
different gaps distances. The experimental findings, which were based on both real-world
and benchmark datasets, demonstrate that the GMA framework proposed in this study
outperforms other methods in terms of accuracy. Specifically, our proposed method
achieves an accuracy score that is 5 to 20% higher than that of other methods.
Furthermore, the GMA framework is well-suited to handling missing gaps with larger
distances, and it produces more accurate imputations, particularly for datasets with strong
periodic patterns.
Keywords Time series, Incomplete Subsequence, Missing data imputation
I. INTRODUCTION
Sensing technologies are now widely employed in computer engineering for
persistent and collaborative monitoring of the physical environment. These sensors
generate large geo-tagged time series data, which can be used to improve human
understanding of various ambient conditions including water level, stream flow
observation, and meteorological conditions (YU ZHENG, 2014). Missing values can
occur for a variety of reasons, including sensor failures, transmission errors, power
outages, and other technical issues. When dealing with time series data, missing
readings can be especially problematic because they can disrupt real-time monitoring
and compromise the accuracy of further data analysis, such as prediction and inference
(Xiuwen Yi, 2015). Living missing values untreated can cause incorrect or ill-defined
results (José Cambronero, 2017). In order to address missing values in time series data,
there are a number of techniques that can be used. One approach is to use imputation
methods, which involve estimating missing values based on other data points in the
time series. Another approach is to use interpolation methods, which involve estimating
missing values by interpolating between nearby data points. Both of these techniques
can be effective, but it's important to choose the right approach based on the specific
characteristics of the time series data and the goals of the analysis (Liao, Bak-Jensen,
Pillai, Yang, & Wang, 2021). It is important to understand the pattern of missing values,
as this can have an impact on the analysis of the data. Two common types of missing
data patterns are (Little, 1992):
1) Missing at Random (MAR): In this type of missing data, the probability of a value
being missing is dependent on other variables in the dataset, but not on the missing

1
value itself. That means, the missing value is related to the observed values in the
dataset.
2) Missing Completely at Random (MCAR): In this type of missing data, the
probability of a value being missing is unrelated to any other variables in the dataset,
including the observed values. That means, the missing values is completely random
and not related to the values in the dataset.

There is also a third type of missing data pattern, called Missing Not at Random
(MNAR), where the probability of a value being missing is dependent on the missing
value itself, but this is less commonly encountered in practice. Understanding the
pattern of missing values is important because it can impact the analysis of the data,
and different techniques are used to handle different types of missing data patterns.
There are many methods that can be used to handle missing data. Missing values can
be handled in a dataset through either single imputation or multiple imputation methods.
Single imputation methods involve replacing missing data points with a single value,
and the most common techniques include mean, average, or median imputation. On the
other hand, multiple imputation methods create multiple values for the missing data
points (Enders, 2010). Each of these methods has its advantages and disadvantages, and
the choice of technique depends on the specific characteristics of the data and the
research question being investigated.
Effective approaches for predicting missing values from accessible data are needed.
Algorithms for recovering missing data blocks can utilize various techniques such as
matrix completion principles or pattern matching. Matrix completion-based algorithms
treat a set of series as a matrix and apply methods that aim to complete the missing
entries. On the other hand, pattern-matching algorithms utilize the observed values
from the sensors to replace the missing data blocks. By using these methods, algorithms
can reconstruct missing data blocks from the available information, allowing for more
complete and accurate data analysis (Mourad Khayati, 2020).
A commonly used approach for replacing missing data gaps involves utilizing the
values from the most similar subsequence. This technique falls under the category of
pattern-matching algorithms, the Dynamic Time Warping (DTW) algorithm is a highly
effective pattern-matching technique that is extensively employed across numerous
problem domains. Nonetheless, a drawback of employing DTW is its tendency to be
computationally expensive and time-consuming, which can impact the algorithm's
overall efficiency and performance. Some researchers have proposed a solution to the
computational expense of Dynamic Time Warping (DTW) by suggesting the use of
shape-feature extraction algorithms that extract sequence features in sliding windows.
This approach involves calculating DDTW only if the correlation between the shape-
features of the window and the subsequences before the missing gap is sufficiently high
(Irfan Pratama, 2016). Results from this method have demonstrated better outcomes
when dealing with time series that have strong seasonality and high correlation. It's
worth noting that while DTW can identify the most similar patterns with similar
dynamics, it can warp the shape by expanding or compressing, which can lead to the
position of the missing gaps not aligning with the original pattern's position, to
overcome this limitation we decided to employ Inverse Fourier Transform to predict
the length of the seasonal period beforehand. This allows us to understand the dataset's

2
characteristics and identify the length of the missing gap, particularly if it represents
one or more seasonal periods, so we can apply a suitable algorithm to deal with it.
The objective of this study is to create novel methods for accurately and efficiently
imputing missing or anomalous data in computer systems. Our focus is on developing
techniques to impute missing values in seasonal time series. We observe that seasonal
patterns tend to exhibit similarities over time. Therefore, we propose to leverage this
phenomenon by replicating the pattern from the most comparable subsequence in the
historical data. Through this approach, we have developed an effective method for
imputing missing data, which involves utilizing simple operations for pattern searching
and matching. The paper makes the following technical contributions:
 We present and formalize GMA: Gap Imputing Algorithm to impute missing
values in time series, which covers stationary, non-linear relationships, and
seasonal time series.
 In computer engineering, we use the inverse Fourier transform to determine the
periodic length of each time series. This helps us gain a better understanding of the
dataset's characteristics and identify suitable algorithms for handling the missing
gaps. We then apply appropriate algorithms to address these gaps.

 To identify similar historical situations, we rely on techniques such as the


Euclidean distance, Spearman's Rank-Order Correlation, and Kendall's tau, which
enable us to measure the similarity between patterns accurately. By leveraging
these techniques, we can improve the accuracy and efficiency of our data analysis
and make more informed decisions based on the insights gained.

 We empirically show on real-world and synthetic datasets that GMA outperforms


state-of-the-art solutions, and it is capable of effectively imputing values in time
series with extended blocks of consecutively missing values.
The remainder of the paper is organized as follows: Section 2 describes the associate
works. The model overview in Section 3. Section 4 shows the detail of proposed method.
Section 5 shows the details datasets, comparative methods, experiment setting and
evaluation, Results displayed in section 6. Finally, the paper is concluded in Section 7.
II. RELATED WORK
Missing data imputing as shown in table 1 involves two primary implementation techniques:
univariate and multivariate. The univariate approach makes use of a single variable to
estimate the missing values, whereas multivariate techniques analyze the relationship
between multiple variables to estimate missing data (Thakolpat Khampuengson, 2022).
Multivariate methods can be further classified into three categories: matrix completion
principles, pattern matching (Mourad Khayati, 2020), and machine learning imputing. To
provide a more comprehensive overview of the related work in missing data imputation, we
have classified the existing algorithms into four categories, which are as follows:

3
TABLE 1: RELATED WORK SUMMARY

Abbreviation datasets description


4 time series used: CO2 This approach involves transforming data into a multivariate time series,
concentrations, Phu Lien air using machine learning for forward and backward forecasting to estimate
(PHAN, 2020) temperature, NNGC1 F1 V1 missing values, and imputing gaps with the average of both forecast sets. It
univariate methods

003 (NNGC), and Ba Tri adheres to academic standards of syntax and grammar.
temperature.
(Paternoster, 1998) Scientific research data on The "most frequent value" imputation method replaces missing data with the
factors causing crime for malesmode of the variable. It is typically used for categorical variables or
and females numerical variables that have a limited range of values.
Water level data from telemetryThis study compared three methods for imputing missing data: mean
stations across Thailand imputation, regression imputation, and multiple imputation. The results
(Kulanuwat L, 2021)
showed that multiple imputation was the most effective method and
produced less biased estimates compared to the other two methods.
Air quality and meteorological The proposed ST-MVL method fills missing readings in geo-sensory time
data in Beijing, China series data by considering temporal and spatial correlations. It uses empirical
(Yi, 2016)
statistic models and data-driven algorithms to handle different types of
missing data cases.
Pattern Matching

2 datasets: SBR meteorological Clustering algorithm was employed to group similar time series, and the
(Wellenzohn, 2017) time series in South Tyrol and resulting groups were used to impute missing values. This approach is
Flights dataset. specifically designed for continuous streams of time series data.
Data collected from in-situ The proposed method involves using a Seq2Seq model to impute missing
monitoring station in values in time series data. This model utilizes a dual-head architecture that
(Zhang, 2021) Mulgrave-Russell catchment, includes an encoder and two decoders, each corresponding to one direction
Australia. of the time series data. Seq2Seq models are a type of recurrent neural
network (RNN) that can be applied for sequence prediction and generation.
Face images under varying The proposed technique for robust low-rank matrix recovery is capable of
illuminations: 168x192 handling data corruption and utilizes orthonormal subspace learning to
resolution, 55 frames estimate a low-rank matrix from incomplete or corrupted data. This method
(Shu, 2014)
has shown promising results in experiments and outperformed existing
methods, and can be applied in various applications such as image
Matrix Completion Principles

processing, signal processing, and recommendation systems.


Netflix data: 17,770 movies The proposed method utilizes spectral regularization to promote low-rank
rated by 480,189 customers solutions and impose structural constraints on the estimated matrix. This
approach has proven to be effective in dealing with ill-posed problems and
(Mazumder, 2010)
improving the performance of matrix completion. The method also allows
for incorporating additional side information, such as similarity between
multivariate Methods

items or users, to further enhance the estimation.


Hydrological time series with SVD and CD are two widely used methods for imputing missing values in
tuples of timestamp and time series data. SVD decomposes the dataset into a subset of singular
(Khayati, 2015) observation value values, while CD calculates the distance between the missing value and its
neighboring points based on correlation. CD has been found to be more
accurate and computationally for time series datasets with low correlation.
National water quality A proposed approach for imputing missing values in data involves
reference index data monitored combining low-rank matrix completion and sparse representation. The
by Haimen Bay station approach first uses low-rank matrix completion to impute missing values
(Jianlong Xu, 2021)
based on a low-rank structure assumption. Then, sparse representation is
employed to refine the imputed data by assuming it can be represented as a
linear combination of a few basis elements.
Rainfall data from 4 stations in The method combines PCA and Bayesian modeling to estimate missing
(K, 2019)
Malaysia values in a dataset.
Machine Learning Imputing

Measured water levels in 7 RF algorithm used for imputing missing values in a dataset with continuous
(Dwivedi D, 2022)
monitoring wells in the USA variables.
Three field-based time series A hybrid approach was used to impute missing data, where regression
were used, including traffic imputation predicted missing input variables, and data augmentation created
(Bokde N, 2018)
speed data, water flow rate synthetic data points for missing output variables. The approach was applied
data, and the Nottem dataset. to a dataset with missing values in both input and output variables.
Seven datasets were used from A genetic algorithm is proposed to impute missing values in datasets with
the UCI and KEEL repositories multiple missing observations and different data types. The algorithm
(Figueroa-García, 2022)
minimizes a multi-objective fitness function based on Minkowski distance of
statistical measures between available and completed data.
Two untargeted metabolomics A two-step approach was used for imputing missing values, involving a
(Jonathan P. datasets from the COPDGene random forest classifier to classify the missing mechanism and mechanism-
Dekermanjian, 2022) cohort were used. specific algorithms for imputation. The approach improved imputations by
reducing bias and producing values closer to the original data.
Letter and SPAM datasets A method that estimates missing values in datasets using a generative
(Trubitsyna, 2022)
https://archive.ics.uci.edu/ adversarial network (GAN) model.

4
A. Univariate Methods
Hong PHAN (PHAN, 2020) proposed a method called "MLBUI" for filling consecutive
missing values in univariate time series using machine learning methods. the data before
and after the gap transformed into multivariate time series, followed by forward and
backward forecasting using ML methods to estimate the missing values. The imputation of
the gap is then done by taking the average values of both forecast sets. Paternoster
(Paternoster, 1998) explained that the "most frequent value" imputation technique consists
of substituting missing data with the value that appears most frequently for the given
variable. This imputation strategy is typically applied to categorical variables or numerical
variables with a finite set of possible values. Kulanuwat et Al. (Kulanuwat L, 2021) Studied
missing data imputation in electronic health records (EHR) using three methods: mean
imputation, regression imputation, and multiple imputation. Mean imputation involves
replacing missing values with the mean of available data, while regression imputation uses
a regression model to predict missing values. Multiple imputation generates multiple
plausible imputed datasets using a statistical model and combines them for a single estimate.
The study found that multiple imputation was the most effective method for imputing
missing data in EHR, producing estimates closer to true values and with less bias compared
to the other methods.
B. Pattern Matching
(Yi, 2016) Proposed a method called spatio-temporal multi-view-based learning (ST-MVL)
to fill missing readings in geo-sensory time series data. The method takes into account the
temporal correlation between readings at different timestamps in the same series and the
spatial correlation between different time series. The method combines empirical statistic
models (Inverse Distance Weighting and Simple Exponential Smoothing) with data-driven
algorithms (User-based and Item-based Collaborative Filtering) to handle different types of
missing data cases. The method is evaluated using Beijing air quality and meteorological
data. K. Wellenzohn (Wellenzohn, 2017) proposed a method for continuously imputing
missing values in streams of time series data. The proposed method, called CIViC
(Continuous Imputation of Values in time series with Clustering), uses a clustering
algorithm to group similar time series and then uses the grouped time series to impute
missing values. The authors evaluated their method on several real-world datasets and
compared it to other imputation methods. Zhang and Thorburn (Zhang, 2021) proposed a
dual-head sequence-to-sequence (Seq2Seq) model for imputing missing values in time
series data. Seq2Seq models are a type of recurrent neural network (RNN) that can be used
for sequence prediction and generation tasks. In their study, Zhang and Thorburn used a
dual-head architecture, which includes an encoder and two decoders, to predict the missing
values in a time series dataset. The two decoders correspond to the two directions of the
time series data (forward and backward).
C. Matrix Completion Principles
Shu, Porikli, and Ahuja (Shu, 2014) proposed a method for robust low-rank matrix
recovery that can handle data corruption. Low-rank matrix recovery is a fundamental
problem in computer vision and machine learning, and it involves estimating a low-rank
matrix from corrupted or incomplete data. The proposed method is based on orthonormal
subspace learning, which is a technique for finding the principal subspace of a given set of

5
data. Mazumder, Hastie, and Tibshirani (Mazumder, 2010) proposed a method for matrix
completion, which involves recovering missing entries in a matrix. The authors noted that
matrix completion has important applications in various fields, including recommender
systems and collaborative filtering. The proposed method is based on spectral regularization
and involves solving a convex optimization problem. Khayati, Böhlen, and Cudré-Mauroux
(Khayati, 2015) compared two methods, Singular Value Decomposition (SVD) and
Correlation Distance (CD), for recovering missing values in time series datasets. The
authors noted that missing data is a common problem in time series datasets and can have a
significant impact on subsequent analysis. SVD and CD are two methods that can be used
for imputing missing values in time series data, and they differ in how they select a subset
of the data to use for imputation. The authors evaluated the two methods on several datasets
and found that CD performed better than SVD in terms of accuracy and computational
efficiency, especially when the time series data had low correlation. Yang et al (Jianlong
Xu, 2021) proposed a method for imputing missing data in high-dimensional datasets by
combining low-rank matrix completion and sparse representation. The authors argued that
the high dimensionality of the data and the sparsely of the missing values require an
approach that can effectively capture the underlying structure of the data. The proposed
method first utilized low-rank matrix completion to impute the missing values in the data
matrix, leveraging the assumption that the data has a low-rank structure. The method then
employed sparse representation to refine the imputed data, utilizing the assumption that the
data can be represented as a linear combination of a few basis elements. Ai and Kuok (K,
2019) suggested employing a statistical method known as Bayesian Principal Component
Analysis (BPCA) to perform imputation of missing values in rainfall data. BPCA is a
technique that merges Principal Component Analysis (PCA) with Bayesian modeling to
estimate missing data points in a dataset.
D. Machine Learning Imputing
Dwivedi (Dwivedi D, 2022) used Random Forest (RF) to impute continuous missing
values in a dataset. The RF algorithm was used to impute missing values in a dataset
containing continuous variables. The performance of RF imputation was compared with
other imputation methods, such as k-nearest neighbors (KNN) and mean imputation.
Bokde. Al (Bokde N, 2018) proposed a method for imputing missing values in a dataset
using a hybrid approach that combines regression imputation and data augmentation. The
study dealing with a dataset that had missing values in both the input and output variables.
They used regression imputation to impute the missing values in the input variables by
predicting them based on the available data. For the missing values in the output variables,
they used data augmentation, which involves creating synthetic data points to fill in the
missing values. A genetic algorithm approach to estimate missing data in multivariate
databases is proposed (Figueroa-García, 2022) . Genetic algorithms are effective at handling
multiple missing observations and different types of data, unlike traditional methods that
only deal with univariate continuous data. The proposed algorithm minimizes a new multi-
objective fitness function based on Minkowski distance of means, variances, covariances,
and skewness between available and completed data. The approach is compared to EM
algorithm and auxiliary regressions using a continuous/discrete dataset, and benchmarked
against seven datasets. Jonathan P (Jonathan P. Dekermanjian, 2022) designed imputation
algorithm to handle missing values in metabolomics datasets, which are often caused by
various mechanisms such as instrument detection limits, data collection and processing

6
conditions, and random factors. The algorithm takes a mechanism-aware approach and
consists of two steps. In the first step, a random forest classifier is used to classify the
missing mechanism for each missing value in the dataset. In the second step, missing values
are imputed using mechanism-specific imputation algorithms, namely MAR/MCAR or
MNAR. Simulations were conducted using complete data and different missing patterns to
test the performance of the proposed algorithm. Results showed that the two-step approach
reduced bias and provided imputations that were closer to the original data compared to
using a single imputation algorithm for all missing values. Overall, this mechanism-aware
imputation algorithm offers a promising solution for handling missing values in
metabolomics datasets and improving downstream analyses. Trubitsyna R. and Irina S
(Trubitsyna, 2022) developed a method for estimating missing values in datasets through
the use of a generative adversarial network (GAN) based model named DEGAIN. The
performance of DEGAIN is evaluated on two publicly available datasets, namely Letter
Recognition and SPAM, and compared against existing methods.
In this paper, we propose a novel approach for computing the missing values in
incomplete subsequences, called Gap Imputing Algorithm (GMA). We divide the time
series into two subsequences: one that contains the complete data and another that contains
the missing gaps. To fill in the missing data, we use a pattern-matching approach by
analyzing the similarity between the complete and incomplete subsequences. Specifically,
we imitate the pattern of the complete subsequence to recreate the missing data.
Table 2: List of abbreviations
Symbol definition
GMA The Gap Imputing Algorithm.
MAR Type of missing data Missing at Random
MCAR Missing Completely at Random
MNAR Missing Not at Random
S The total time series
Sf The subsequence with no missing values
Sm The subsequence with missing values
The number of readings in the periodic sequence
P
for the longest component within the time series.
W Missing gaps
R The Right pattern for the missing gap
L The Left pattern for the missing gap
N The number of readings in the missing gap
wl the most similarity pattern to L in Sf
wr the most similarity pattern to R in Sf
b the start index for wl
e the end index for wr
sr Correlation value between R and wr
sl Correlation value between L and wl
RMSE mean squared error
MAE and mean absolute error
FSM Full Subsequence Matching algorithm

I. OVERVIEW

We have developed a method to recover missing values in an incomplete time series


S, where readings are taken at equal intervals. Our approach involves identifying gaps in S
that have null values, and dividing S into two parts: Sf, which has no missing values, and Sm,
which contains the missing gaps W. Then we utilize the Fourier transform (Oppenheim,

7
2010) on the Sf in order to obtain the number of readings in the periodic sequence P for each
time series. P represents the number of readings required for the time series to complete one
cycle. Each missing gap w is then analyzed to determine its surrounding right pattern R,
which comprises P readings, and its lift pattern L, which also comprises P readings. W
contains N missing values. To identify the two patterns in the Sf set that are most similar to
R and L, we utilize the kendalltau correlation measure (Abdi, 2007). Subsequently, we
employ the algorithm outlined in the following section to complete the missing values in W.
Table 2 explains the Symbols and annotations.
II. PROPOSED METHODS
We have developed a novel method for imputing missing values in a time series using
Fourier transform and a new filling algorithm. Our approach involves using Fourier
transform to determine the wavelength of each time series, followed by identifying the
sequence period for each series. This enables us to use an optimal imputation method to fill
in the gaps in the time series. Our proposed GMA method consists of four main steps:
A. Identifying the Missing Gaps
We identifying a missing gaps in a time series denoted by W = {w1, w2, w3...}. For each gap
w, we determine the preceding and succeeding data points, as well as the number of missing
points in the gap between them. By performing this analysis, we generate an array G that
records the number of missing data points for each gap, as well as the preceding and
succeeding data points.
B. Time Series Analysis
After identifying the gaps in the time series S that have null values, and dividing S into two
parts. It is typically advisable to focus on the subset of the time series that contains complete
data, which we denote as Sf. This is because incomplete or missing data can introduce biases
and inaccuracies into the analysis. A fundamental characteristic that needs to be identified
is whether the data is stationary or (seasonal) periodically repeated, and the number of
readings P in the periodic cycle. The discrete Fourier Transform (DFT) is a mathematical
technique that analyzes the time series in the frequency domain. By performing the DFT on
the time series, it decomposes it into its constituent frequencies and obtain information about
the spectral content of the time series (Oppenheim, 2010). We use the DFT output as input
for the inverse DFT Python function (community, 2023), this function calculates the peak
frequency of the signal using the argmax function, and then determines the period of the
signal by taking the reciprocal of the frequency. Figure 1 show the URC and its inverse
DFT.

8
Figure 1 the URC dataset (Keogh, 2021) and its inverse DFT

C. Extract the similar subsequences to surrounding right and left pattern R,L
Once we have generated the array G in the first step, that records the missing gaps, we can
use this array to extract subsequences from the left and right sides of each gap w.
Specifically, we extract a subsequence of length P points from the left side of the gap
(denoted by L) and another subsequence of the same length from the right side of the gap
(denoted by R). Then we use these subsequences to search for the most similar subsequences
to L and R, denoted by wl and wr, respectively. This similarity search can be performed using
various techniques, such as dynamic time warping (Rakthanmanon, 2012), Pearson
correlation (mapreduce., 2016), or Euclidean distance (Park, 2009). Experiments have
shown that the most suitable for our algorithm is kendalltau technique (Gibbons, 2011) as
it compares the direction of the points, up or down, in value.
In general, if the data is normally distributed and the relationship between the variables is
expected to be linear, Pearson correlation may be the most appropriate technique to use. If
the data is not normally distributed or the relationship between the variables is not expected
to be linear, Euclidean distance or Kendall tau may be more appropriate. However, the
specific technique used should be selected based on the characteristics of the data and the
research question being investigated.
The Kendall rank correlation coefficient is a statistical measure that is used to determine
the degree of similarity between two sets of variables. It is a non-parametric measure that is
used to quantify the strength of the relationship between two sets based on the ranks of their
values. The coefficient ranges from -1 to 1, with -1 indicating a perfect negative correlation,
0 indicating no correlation, and 1 indicating a perfect positive correlation. Figure 2 shows
the Kendall results. The similarity pattern to the left, wl, is represented by a green rectangle.
Its corresponding similarity measure, sl, is 0.735. The similarity pattern to the right, wr, is
represented by a red rectangle. Its corresponding similarity measure, s r, is 0.540.

9
Figure 2 the steam flow dataset (STUMPY, 2023) and the Kendall results: wl is the most similar for L
pattern (green rectangle) and wr is the most similar for R pattern (red rectangle)

GMA Algorithm

GMA Algorithm
Input: Sf the complete sequence
b the start index for (wr the similarity to R)
sr correlation value between R and wr
e the end index for (wl the similarity to L)
N number of missing points
sl correlation value between L and wl
FUNCTION GMA(w, b, sr, e, sl, N)
// Calculate the tuas retio
T1 = sr / (sr + sl)
T2 = sl / (sr + sl)
// Method 1: Filling the gap with mutation
imp1 = []
FOR i = 0 TO (FLOOR(T1 * N) - 1) DO
imp1[i] = Sf [e+i]
END FOR
FOR i = FLOOR(T1 * N) TO N DO
imp1[i] = Sf [b-(T1 * N)+ i]
END FOR
// Method 2: Filling the gap with combination
imp2 = []
FOR i = 0 TO N DO
imp2[i] = (T1 / N * Sf [e+i]) + (T2 / N * Sf [b-N+i])
END FOR
// Return the results of both methods
RETURN (imp1, imp2)
END FUNCTION

10
D. Imputation the gaps
We developed two different techniques to impute missing values: muting imputation and
ratio imputation, as shown in Algorithm 1. The algorithm takes the missing gap w, the
similarity pattern to the right wr and its corresponding similarity tua measure sr, the similarity
pattern to the left wl and its corresponding similarity measure s l, and the number of missing
points N. The algorithm uses two methods to fill in the missing gap. Method 1 of the
algorithm employs the two similar patterns, wr and wl, to fill the missing gap. Specifically,
the gap is filled by wr from 1 to sr *N/( sr+ sl ) and filled by wl from sr *N/( sr+ sl ) to N. The
time complexity of Method 1 is approximately O (N). Method 2 combines the two similar
patterns wr and wl based on their correlation values sr and sl, to fill the gap. The time
complexity of Method 2 is approximately O (N). The computational complexity of the GMA
function is linear with respect to the length of the missing gap N.
III. EXPERIMENTAL SET‑ UP
A. Dataset
1) The UCR_BIDMC1_2500 benchmark is a time series dataset that is part of the UCR Time
Series Anomaly Archive (Keogh, 2021). It contains 2,500 instances, each consisting of
128 observations. The dataset was collected from intensive care unit (ICU) patients, where
each instance represents the continuous physiological signals of a patient over a 6-hour
period. The anomalies in this dataset correspond to changes in the patients' physiological
conditions that require medical attention, such as cardiac arrest or shock. This benchmark
dataset is specifically designed to address common flaws present in other anomaly
detection benchmarks, including trivial and unrealistic anomaly intensity, misleading
ground truth, and running to failure bias. Using the inverse Fourier transformer technique,
we were able to extract the periodic cycle of the UCR_BIDMC1_2500 dataset, which was
determined to be 50 readings in length. Figure 1 shows the UCR_BIDMC1_2500 dataset
and its inverse DFT.
2) The Steamgen dataset is a commonly used benchmark dataset in the field of process control
and system identification the dataset consists of 6,000 samples, each containing 19 features
that describe the operating conditions and performance of the steam generator (STUMPY,
2023). These features include variables such as steam flow rate, water level, and
temperature, as well as indicators of system faults and disturbances. As shown in figure2,
we have chosen to focus on the steam flow feature. Using the inverse Fourier transformer
technique, we were able to extract the periodic cycle of the steam flow dataset, which was
determined to be 497 readings in length.
B. Missing Data Generation
We simulated missing data in order to enable us to evaluate the efficacy of different imputation
techniques. To generate datasets with missing data, we systematically removed consecutive
values from the dataset, assuming that the deletions occurred randomly.
In the case of the URC dataset, we created gaps of sizes 6, 21, 26, 40, and 101 points, as the
periodic cycle of this dataset is 50 points. For the steam flow feature, we created gaps of sizes

11
7, 21, 51 and 244 points, as the periodic cycle for steam flow is 507 points. These various
missing data scenarios were simulated in order to test the performance of different imputation
approaches. Experiments have shown that if the size of the missing gap is greater than the
periodic cycle value, then the error in imputation is significantly higher using any imputation
method.
C. Comparative Imputation Methods
We have selected several widely-used imputation techniques for evaluating the effectiveness
of our proposed methods. These techniques include linear interpolation (pandas, 2023) and
polynomial interpolation (Qingkai Kong, 2020), and Full Subsequence Matching (FSM)
(Thakolpat Khampuengson, 2022). Linear interpolation is a common method for filling in
figure3. The stream flow dataset missing values in a dataset by estimating the value based on
the linear relationship between adjacent data points. Polynomial interpolation is a more
complex variant of this technique, where a polynomial function is used to approximate the
missing values based on the surrounding data points. FSM is a pattern-matching approach to
imputation, where the missing value is estimated by identifying similar subsequences of data
within the dataset and using them to make a prediction. This technique is useful for datasets
with repeated patterns or cyclical trends. By comparing the performance of our proposed
methods against these established techniques, we aim to demonstrate the efficacy of our
approach and provide valuable insights into the most effective methods for imputing missing
data.
D. Experimental Setting
Our experiments were conducted on a server equipped with Core i7 Intel processors running
at 2.60 GHz, 8 GB RAM, and a 250 GB SATA hard drive. We implemented our proposed
framework using the open source Python package missval, which offers a range of missing
value imputation methods, as well as visualization and performance evaluation tools. The
package is publicly available on Github at https://github.com/Eng-Khattab/missval. In
addition, we have made the two datasets used in this study available for public access. To
perform interpolation, we utilized the "interpolate" class from the pandas DataFrame (pandas,
2023) Python library, which offers a convenient method for filling in missing values using
interpolation techniques. Specifically, we employed a linear approach for linear interpolation
and a polynomial method with a second order polynomial for polynomial interpolation. To
perform the Subsequence Matching (FSM) methods, we used the matrix profile python library
called STUMPY (STUMPY, 2023).
E. Evaluation Metrics
The performance of an imputation method is commonly evaluated by measuring its accuracy
using three widely-used metrics: root mean square error (RMSE), mean absolute error (MAE),
Kendall's tau measure between the actual pattern and the imputed pattern. These metrics are
defined as follows:
1) The root mean square error (RMSE) (Bennett, et al., 2013) is a measure of the
differences between the actual and imputed values, calculated as the square root of the
average of the squared differences:
1 𝑛
𝑠𝑎𝑚𝑝𝑙𝑒𝑠 −1
RMSE =√𝑛 ∑𝑖=0 1(𝑦^ − 𝑦𝑖) ^2
𝑠𝑎𝑚𝑝𝑙𝑒𝑠

12
Where y^ is the predicted value and y is actual value.

2) The mean absolute error (MAE) (Bennett, et al., 2013) is another measure of the
differences between the actual and imputed values, calculated as the average of the
absolute differences:
1 𝑛
𝑠𝑎𝑚𝑝𝑙𝑒𝑠 −1
MAE =𝑛 ∑𝑖=0 1(𝑦^ − 𝑦𝑖)
𝑠𝑎𝑚𝑝𝑙𝑒𝑠

3) Kendall's tau (Abdi, 2007) is a measure of the correlation between the actual and imputed
patterns, which takes into account the order or rank of the values rather than their actual
magnitudes. It ranges between -1 (perfect negative correlation) to 1 (perfect positive
correlation), with 0 indicating no correlation:
Kendall's tau = (number of concordant pairs - number of discordant pairs) / (number of pairs)
Where a pair is concordant if the relative order of the values in the actual pattern is the same
as in the imputed pattern, and discordant if the order is different. The number of pairs is equal
to 𝑛(𝑛 − 1)/2 for a dataset with n samples.

IV. RESULTS AND DISCUSSION


We employed four distinct algorithms for filling the missing data points in two datasets. For
the steam flow dataset, there are gaps of sizes 7, 21, 51, and 244 points, as the periodic cycle
for steam flow is 507 points. Our results, presented in Tables 3 and 4 and Figures 3 a and b,
demonstrate the superiority of our algorithm for larger gaps that exceed a quarter of the
identified time series period length, as outlined in Section 3. Conversely, the linear algorithm
performs exceptionally well for smaller gaps, particularly at lower gap levels. These findings
emphasize the importance of accurately determining the time period for dataset before
imputing the missing gaps.
The results was presented in two tables and two figures. The first table and figure show the
percentage of root mean squared error (RMSE) and mean absolute error (MAE), where
lower values indicate better performance. The Kendell correlation analysis results are
presented in both the second table and accompanying figure. A higher correlation value
indicates a stronger relationship between the compensated data and the original data, and
then the compensated data closely aligns with the original data in terms of their
characteristics. Our findings indicate that achieving better results with the Full Subsequence
Matching (FSM) algorithm may require a significant number of repetitions. This is because
the algorithm relies on random selection of the length of the right and left patterns, which
can lead to variability in the outcomes.
Our algorithm employs two distinct methods to fill in missing data gaps. The first method
involves alternating between two patterns based on their tuas, and is utilized when the gap
size exceeds the periodic cycle p. Conversely, the second method involves combining the
two patterns based on their tuas to fill the missing gap less than the periodic cycle.

13
Table 3.the performance indexes of 4 methods on steam flow dataset

Gap Points G7 G26 G51 G244

Error Metrics RMSE MAE RMSE MAE RMSE MAE RMSE MAE
Linear 0.58 0.51 0.68 0.55 2.6 2.3 7.2 5.9
Algorit

Poly 0.65 0.57 2.6 2.3 2.4 2.19 18.9 16


hm

FSM 7.3 6.4 0.55 0.55 3.8 3.6 7.8 8.9


GMA 3.7 3.6 0.53 0.34 1.7 0.9 5 5

Table 4 .the Kendell tua results between imputing gaps and actual data for steam flow dataset

Gap Points G7 G26 G51 G244


Linear 0.86 0.58 -0.4 -0.5
Poly 0.33 -0.25 -0.19 -0.02
FSM -0.7 0.51 0.26 -0.34
GMA -0.4 0.6 0.48 0.05

20 Linear Poly FSM GMA


RMSE and MAE Scales

15

10

0
RMSE MAE RMSE MAE RMSE MAE RMSE MAE
G7 G26 G55 G244
Number of missing points in each gap

fig3 .a. the performance indexes of 4 methods on steam flow dataset

Linear Poly FSM GMA


1

0.5
Kendall Scale

0
G7 G26 G51 G244

-0.5

Number of missing points in each gap


-1

Fig 3.b The Kendall correlation between the filled and actual data
Fig3. The performance indexes of 4 methods on steam flow dataset

14
Table 5.the performance indexes of 4 methods on URC dataset
GP G6 G26 G33 G40 G101
EM RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE
Linear 229 209 2222 2034 2476 2007 5877 5070 10591 9527
Poly 57 50 2857 2372 2811 2139 6433 4738 6175 4921
FSM 474 416 1271 1144 1316 1074 2408 934 5318 4374
GMA 512 555 1015 830 6530 6213 1398 1319 3859 3144

Table 6 .the Kendell tua results between imputing gaps and actual data for the URC data set

GP G6 G26 G33 G40 G101


Linear 0.99 0.99 0.76 0.49 0.035
Poly 0.99 0.92 0.77 0.59 0.56
FSM 0.39 0.92 0.77 -0.73 0.43
GMA 0.99 0.99 0.95 0.13 0.8

12000 Linear Poly FSM GMA


RMSE and MAE Scales

10000
8000
6000
4000
2000
0
RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE
G6 G21 G26 G40 G101
Number of missig points in each gap
fig4 .a. the performance indexes of 4 methods on URC dataset

1.5
Linear Poly FSM GMA

1
Kendall Scale

0.5

0
G6 G21 G26 G40 G101

-0.5

Number of missig points in each gap


-1

Fig 4.b The Kendall correlation between the filled and actual data
Fig4. The performance indexes of 4 methods on URC dataset

15
The results which includes in Tables 5 and 6 and figure 4a and b, shows the effectiveness of
our model when applied to URC data set, with the exception of gap 26. It is worth noting
that gap 26 is an anomaly in the original data. The values presented in the tables 5 and 6 may
appear to be large due to the wide domain of the dataset, which ranges from -10,000 to
30,000.as shown in fig.1 a. As a result, the errors may also be expressed in large values.

V. CONCLUSION
This paper presents the Gap Imputing Algorithm (GMA), a novel method for imputing
missing values in time series data. GMA is specifically designed to address the challenging
problem of consecutively missing values with varying gap distances in time series analysis.
Initially, GMA identifies sequences of missing values and determines the periodicity of the
time series. It then searches for the most similar subsequences in the historical data to fill in
the missing gap. GMA employs two methods to impute the missing data gaps, depending on
the gap size. If the gap size exceeds the periodic cycle p, GMA utilizes the first method,
which involves alternating between the two most similar patterns to the missing gap
terminals based on their correlation scale. On the other hand, if the missing gap size is less
than the periodic cycle, the second method is used. This involves combining the two similar
patterns based on their correlation scale with the most similar patterns to fill in the missing
data. Experimental results demonstrate that GMA outperforms existing methods in terms of
accuracy, particularly for datasets with long periodic patterns and larger missing gaps. Using
the periodic cycle to determine the pattern length leads to a more precise and accurate result.
In contrast, other algorithms require multiple runs because they rely on random selection of
the length of the right and left patterns, which can result in variability in the outcomes.
Overall, this research contributes to the development of more effective and efficient missing
value imputation techniques in time series data analysis. The practical implications of these
findings are significant, as accurate imputation of missing data is crucial for a wide range of
applications.

REFERENCES
Abdi, H. (2007). The Kendall Rank Correlation Coefficient. Encyclopedia of Measurement and
Statistics.
Bennett, N., Croke, B., Guariso, G., Guillaume, J. H., Jakeman, A., Marsili-Libelli, S., . . .
Norton, J. (2013). Characterising performance of environmental models. Environmental
Modelling & Software.
Bokde N, B. M. (2018). A novel imputation methodology for time series based on pattern
sequence forecasting. Pattern Recogn Lett , 116, 88–96.
community, T. S. (2023, 4 10). scipy.fft.rfft. (The SciPy community) Retrieved 2023, from
https://docs.scipy.org/doc/scipy/reference/generated/scipy.fft.rfft.html
Dwivedi D, M. U. (2022). Imputation of contiguous gaps and extremes of subhourly
groundwater time series using random forests. J Mach Learn Model Comput , 3.
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
Figueroa-García, J. C.–P. (2022). A genetic algorithm for multivariate missing data imputation.
Information Sciences.
Gibbons, J. D. (2011). Nonparametric statistical inference. CRC Press, 14.
Irfan Pratama, A. E. (2016). A review of missing values handling methods on time-series data.
International Conference on Information Technology Systems and Innovation (ICITSI).

16
Jianlong Xu, K. W. (2021). FM-GRU: A Time Series Prediction Method for Water Quality
Based on seq2seq Framework. MDPI, 13.
Jonathan P. Dekermanjian, E. S. (2022). Mechanism-aware imputation: a two-step approach in
handling missing values in metabolomics. BMC Bioinformatics.
José Cambronero, J. K. (2017). Query Optimization for Dynamic Imputation. the VLDB
Endowment, 10.
K, L. W. (2019). A study on bayesian principal component analysis for addressing missing
rainfall water. Water Resour Manage, 33, 2615–2628.
Keogh, R. W. (2021). Current Time Series Anomaly Detection Benchmarks are Flawed and
are Creating the Illusion of Progress. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING .
Khayati, M. B.-M. ( 2015). Using lowly correlated time series to recover missing values in
time series: A comparison between SVD and CD. In Advances in Spatial and Temporal
Databases . 14th International Symposium, SSTD .
Kulanuwat L, C. C.-A. (2021). Anomaly detection using a sliding window technique and data
imputation. Water , 13.
Liao, W., Bak-Jensen, B., Pillai, J. R., Yang, D., & Wang, Y. (2021). Data-driven Missing Data
Imputation for Wind Farms Using Context Encoder. Journal of Modern Power Systems and
Clean Energy.
Little, R. J. (1992). Regression with missing X's: a review. Journal of the American Statistical
Association.
mapreduce., A. e. (2016). Gu, J., and Zhang. Journal of Parallel and Distributed Computing,
95, 54-62.
Mazumder, R. H. (2010). Spectral regularization algorithms for learning large incomplete
matrices. Journal of Machine Learning Research, 11.
Mourad Khayati, A. L. (2020). Mind the Gap: An Experimental Evaluation of Imputation of
Missing Values Techniques in Time Series. VLDB Endowment, 13.
Oppenheim, A. V. (2010). Discrete-time signal processing (3rd ed.). Upper Saddle River: NJ:
Pearson Prentice Hall.
pandas. (2023, 4 10). pandas.DataFrame.interpolate. (pandas) Retrieved from pandas:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html
Park, H. a. (2009). A simple and fast algorithm for K-medoids clustering. Expert systems with
applications, 36, 3336-3341.
Paternoster, R. B. (1998). Using the correct statistical test for the equality of regression
coefficients. Criminology, 859-866, 36.
PHAN, T.-T.-H. (2020). Machine Learning for Univariate Time Series impution. Preprint
MAPR .
Qingkai Kong, T. S. (2020). Python Programming and Numerical Methods - A Guide for
Engineers and Scientists. Elsevier.
Rakthanmanon, T. K. (2012). Searching and mining trillions of time series subsequences under
dynamic time warping. the 18th ACM .
Schulz, K. F. (2002). Allocation concealment in randomised trials: defending against
deciphering. . The Lancet, 359, 614-618.
Shu, X. P.-r. (2014). Robust orthonormal subspace learning: Efficient recovery of corrupted
low-rank matrices. . IEEE Conference on Computer Vision and Pattern Recognition, CVPR .
Columbus, OH, USA.
STUMPY. (2023, 4 10). Steamgen Example. (STUMPY) Retrieved 2023, from STUMPY:
https://stumpy.readthedocs.io/en/latest/Tutorial_The_Matrix_Profile.html

17
Thakolpat Khampuengson, a. W. (2022). Novel Methods for Imputing Missing Values in
Water Level. Water Resources Management.
Trubitsyna, R. S. (2022). DEGAIN: Generative-Adversarial-Network-Based Missing Data
Imputation. Information, 13.
Wellenzohn, K. B. (2017). Continuous imputation of missing values in streams of pattern-
determining time series. the 20th International Conference on Extending Database
Technology, EDBT.
Xiuwen Yi, Y. Z. (2015). ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data.
Conference on Artificial Intelligence.
Yi, X. Z. (2016). ST-MVL: Filling Missing Values in Geo-sensory Time Series Data. the 25th
International Joint Conference on Artificial Intelligence.
YU ZHENG, L. C. (2014). Urban Computing: Concepts, Methodologies, and Applications.
ACM Transactions on Intelligent Systems and Technology, 38.
Zhang, Y. a. (2021). Dual-head sequence-to-sequence model for imputing missing data in
multivariate time series. IEEE Journal of Biomedical and Health Informatics, 25 , 1692-1702.

18

You might also like