GMA Gap Imputing Algorithm For Time Series Missing Values
GMA Gap Imputing Algorithm For Time Series Missing Values
Full Title: GMA: Gap Imputing Algorithm for Time Series Missing Values
Abstract: Data collected from the environment in computer engineering may include missing
values due to various factors, such as lost readings from sensors caused by
communication errors or power outages. Missing data can result in inaccurate analysis
or even false alarms. It is therefore essential to identify missing values and correct
them as accurately as possible to ensure the integrity of the analysis and the
effectiveness of any decision-making based on the data. This paper presents a new
approach, the Gap Imputing Algorithm (GMA), for imputing missing values in time
series data. The Gap Imputing Algorithm (GMA) identifies sequences of missing values
and determines the periodic time of the time series. Then, it searches for the most
similar subsequence from historical data. Unlike previous work, GMA supports any
type of time series and is resilient to consecutively missing values with different gaps
distances. The experimental findings, which were based on both real-world and
benchmark datasets, demonstrate that the GMA framework proposed in this study
outperforms other methods in terms of accuracy. Specifically, our proposed method
achieves an accuracy score that is 5 to 20% higher than that of other methods.
Furthermore, the GMA framework is well-suited to handling missing gaps with larger
distances, and it produces more accurate imputations, particularly for datasets with
strong periodic patterns.
Response to Reviewers: All points covered in the attached response file. the points of reviewer 1 written in red
color. the points of reviewer 2 written in green color.
Additional Information:
Question Response
Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Title Page Click here to access/download;Title Page;Title Page v2.docx
003 (NNGC), and Ba Tri adheres to academic standards of syntax and grammar.
18 temperature.
19 (Paternoster, 1998) Scientific research data on The "most frequent value" imputation method replaces missing data with the
factors causing crime for malesmode of the variable. It is typically used for categorical variables or
20 and females numerical variables that have a limited range of values.
21 Water level data from telemetryThis study compared three methods for imputing missing data: mean
stations across Thailand imputation, regression imputation, and multiple imputation. The results
22 (Kulanuwat L, 2021)
showed that multiple imputation was the most effective method and
23 produced less biased estimates compared to the other two methods.
Air quality and meteorological The proposed ST-MVL method fills missing readings in geo-sensory time
24 data in Beijing, China series data by considering temporal and spatial correlations. It uses empirical
(Yi, 2016)
25 statistic models and data-driven algorithms to handle different types of
missing data cases.
Pattern Matching
26 2 datasets: SBR meteorological Clustering algorithm was employed to group similar time series, and the
27 (Wellenzohn, 2017) time series in South Tyrol and resulting groups were used to impute missing values. This approach is
Flights dataset. specifically designed for continuous streams of time series data.
28 Data collected from in-situ The proposed method involves using a Seq2Seq model to impute missing
29 monitoring station in values in time series data. This model utilizes a dual-head architecture that
(Zhang, 2021) Mulgrave-Russell catchment, includes an encoder and two decoders, each corresponding to one direction
30 Australia. of the time series data. Seq2Seq models are a type of recurrent neural
31 network (RNN) that can be applied for sequence prediction and generation.
Face images under varying The proposed technique for robust low-rank matrix recovery is capable of
32 illuminations: 168x192 handling data corruption and utilizes orthonormal subspace learning to
33 resolution, 55 frames estimate a low-rank matrix from incomplete or corrupted data. This method
(Shu, 2014)
has shown promising results in experiments and outperformed existing
34 methods, and can be applied in various applications such as image
Matrix Completion Principles
Measured water levels in 7 RF algorithm used for imputing missing values in a dataset with continuous
48 (Dwivedi D, 2022)
monitoring wells in the USA variables.
49 Three field-based time series A hybrid approach was used to impute missing data, where regression
were used, including traffic imputation predicted missing input variables, and data augmentation created
50 (Bokde N, 2018)
speed data, water flow rate synthetic data points for missing output variables. The approach was applied
51 data, and the Nottem dataset. to a dataset with missing values in both input and output variables.
Seven datasets were used from A genetic algorithm is proposed to impute missing values in datasets with
52 the UCI and KEEL repositories multiple missing observations and different data types. The algorithm
(Figueroa-García, 2022)
53 minimizes a multi-objective fitness function based on Minkowski distance of
statistical measures between available and completed data.
54 Two untargeted metabolomics A two-step approach was used for imputing missing values, involving a
55 (Jonathan P. datasets from the COPDGene random forest classifier to classify the missing mechanism and mechanism-
Dekermanjian, 2022) cohort were used. specific algorithms for imputation. The approach improved imputations by
56 reducing bias and producing values closer to the original data.
57 (Trubitsyna, 2022)
Letter and SPAM datasets A method that estimates missing values in datasets using a generative
https://archive.ics.uci.edu/ adversarial network (GAN) model.
58
59
60
61
62 4
63
64
65
9
10
11
12
13
14 A. Univariate Methods
15 Hong PHAN (PHAN, 2020) proposed a method called "MLBUI" for filling consecutive
16 missing values in univariate time series using machine learning methods. the data before
17 and after the gap transformed into multivariate time series, followed by forward and
18 backward forecasting using ML methods to estimate the missing values. The imputation of
19
the gap is then done by taking the average values of both forecast sets. Paternoster
20
(Paternoster, 1998) explained that the "most frequent value" imputation technique consists
21
of substituting missing data with the value that appears most frequently for the given
22
variable. This imputation strategy is typically applied to categorical variables or numerical
23
variables with a finite set of possible values. Kulanuwat et Al. (Kulanuwat L, 2021) Studied
24
missing data imputation in electronic health records (EHR) using three methods: mean
25
imputation, regression imputation, and multiple imputation. Mean imputation involves
26
replacing missing values with the mean of available data, while regression imputation uses
27
a regression model to predict missing values. Multiple imputation generates multiple
28
plausible imputed datasets using a statistical model and combines them for a single estimate.
29
The study found that multiple imputation was the most effective method for imputing
30
missing data in EHR, producing estimates closer to true values and with less bias compared
31 to the other methods.
32
33 B. Pattern Matching
34
(Yi, 2016) Proposed a method called spatio-temporal multi-view-based learning (ST-MVL)
35
to fill missing readings in geo-sensory time series data. The method takes into account the
36
temporal correlation between readings at different timestamps in the same series and the
37
spatial correlation between different time series. The method combines empirical statistic
38
models (Inverse Distance Weighting and Simple Exponential Smoothing) with data-driven
39
algorithms (User-based and Item-based Collaborative Filtering) to handle different types of
40
missing data cases. The method is evaluated using Beijing air quality and meteorological
41
data. K. Wellenzohn (Wellenzohn, 2017) proposed a method for continuously imputing
42
missing values in streams of time series data. The proposed method, called CIViC
43
(Continuous Imputation of Values in time series with Clustering), uses a clustering
44
algorithm to group similar time series and then uses the grouped time series to impute
45
missing values. The authors evaluated their method on several real-world datasets and
46
compared it to other imputation methods. Zhang and Thorburn (Zhang, 2021) proposed a
47
dual-head sequence-to-sequence (Seq2Seq) model for imputing missing values in time
48 series data. Seq2Seq models are a type of recurrent neural network (RNN) that can be used
49 for sequence prediction and generation tasks. In their study, Zhang and Thorburn used a
50 dual-head architecture, which includes an encoder and two decoders, to predict the missing
51 values in a time series dataset. The two decoders correspond to the two directions of the
52 time series data (forward and backward).
53
54 C. Matrix Completion Principles
55
Shu, Porikli, and Ahuja (Shu, 2014) proposed a method for robust low-rank matrix
56
recovery that can handle data corruption. Low-rank matrix recovery is a fundamental
57
problem in computer vision and machine learning, and it involves estimating a low-rank
58
matrix from corrupted or incomplete data. The proposed method is based on orthonormal
59
subspace learning, which is a technique for finding the principal subspace of a given set of
60
61
62 5
63
64
65
9
10
11
12
13
14 data. Mazumder, Hastie, and Tibshirani (Mazumder, 2010) proposed a method for matrix
15 completion, which involves recovering missing entries in a matrix. The authors noted that
16 matrix completion has important applications in various fields, including recommender
17 systems and collaborative filtering. The proposed method is based on spectral regularization
18 and involves solving a convex optimization problem. Khayati, Böhlen, and Cudré-Mauroux
19 (Khayati, 2015) compared two methods, Singular Value Decomposition (SVD) and
20 Correlation Distance (CD), for recovering missing values in time series datasets. The
21 authors noted that missing data is a common problem in time series datasets and can have a
22 significant impact on subsequent analysis. SVD and CD are two methods that can be used
23 for imputing missing values in time series data, and they differ in how they select a subset
24 of the data to use for imputation. The authors evaluated the two methods on several datasets
25 and found that CD performed better than SVD in terms of accuracy and computational
26 efficiency, especially when the time series data had low correlation. Yang et al (Jianlong
27 Xu, 2021) proposed a method for imputing missing data in high-dimensional datasets by
28 combining low-rank matrix completion and sparse representation. The authors argued that
29 the high dimensionality of the data and the sparsely of the missing values require an
30 approach that can effectively capture the underlying structure of the data. The proposed
31 method first utilized low-rank matrix completion to impute the missing values in the data
32 matrix, leveraging the assumption that the data has a low-rank structure. The method then
33 employed sparse representation to refine the imputed data, utilizing the assumption that the
34 data can be represented as a linear combination of a few basis elements. Ai and Kuok (K,
35 2019) suggested employing a statistical method known as Bayesian Principal Component
36 Analysis (BPCA) to perform imputation of missing values in rainfall data. BPCA is a
37 technique that merges Principal Component Analysis (PCA) with Bayesian modeling to
38 estimate missing data points in a dataset.
39 D. Machine Learning Imputing
40
41 Dwivedi (Dwivedi D, 2022) used Random Forest (RF) to impute continuous missing
42 values in a dataset. The RF algorithm was used to impute missing values in a dataset
43 containing continuous variables. The performance of RF imputation was compared with
44 other imputation methods, such as k-nearest neighbors (KNN) and mean imputation.
45 Bokde. Al (Bokde N, 2018) proposed a method for imputing missing values in a dataset
46 using a hybrid approach that combines regression imputation and data augmentation. The
47 study dealing with a dataset that had missing values in both the input and output variables.
48 They used regression imputation to impute the missing values in the input variables by
49 predicting them based on the available data. For the missing values in the output variables,
50 they used data augmentation, which involves creating synthetic data points to fill in the
51 missing values. A genetic algorithm approach to estimate missing data in multivariate
52 databases is proposed (Figueroa-García, 2022) . Genetic algorithms are effective at handling
53 multiple missing observations and different types of data, unlike traditional methods that
54 only deal with univariate continuous data. The proposed algorithm minimizes a new multi-
55 objective fitness function based on Minkowski distance of means, variances, covariances,
56 and skewness between available and completed data. The approach is compared to EM
57 algorithm and auxiliary regressions using a continuous/discrete dataset, and benchmarked
58 against seven datasets. Jonathan P (Jonathan P. Dekermanjian, 2022) designed imputation
59 algorithm to handle missing values in metabolomics datasets, which are often caused by
60 various mechanisms such as instrument detection limits, data collection and processing
61
62 6
63
64
65
9
10
11
12
13
14 conditions, and random factors. The algorithm takes a mechanism-aware approach and
15 consists of two steps. In the first step, a random forest classifier is used to classify the
16 missing mechanism for each missing value in the dataset. In the second step, missing values
17 are imputed using mechanism-specific imputation algorithms, namely MAR/MCAR or
18 MNAR. Simulations were conducted using complete data and different missing patterns to
19 test the performance of the proposed algorithm. Results showed that the two-step approach
20 reduced bias and provided imputations that were closer to the original data compared to
21 using a single imputation algorithm for all missing values. Overall, this mechanism-aware
22 imputation algorithm offers a promising solution for handling missing values in
23 metabolomics datasets and improving downstream analyses. Trubitsyna R. and Irina S
24 (Trubitsyna, 2022) developed a method for estimating missing values in datasets through
25 the use of a generative adversarial network (GAN) based model named DEGAIN. The
26 performance of DEGAIN is evaluated on two publicly available datasets, namely Letter
27 Recognition and SPAM, and compared against existing methods.
28 In this paper, we propose a novel approach for computing the missing values in
29 incomplete subsequences, called Gap Imputing Algorithm (GMA). We divide the time
30 series into two subsequences: one that contains the complete data and another that contains
31 the missing gaps. To fill in the missing data, we use a pattern-matching approach by
32 analyzing the similarity between the complete and incomplete subsequences. Specifically,
33 we imitate the pattern of the complete subsequence to recreate the missing data.
34 Table 2: List of abbreviations
35 Symbol definition
36 GMA The Gap Imputing Algorithm.
37 MAR Type of missing data Missing at Random
38 MCAR Missing Completely at Random
39 MNAR Missing Not at Random
40 S The total time series
41 Sf The subsequence with no missing values
Sm The subsequence with missing values
42 The number of readings in the periodic sequence
43 P
for the longest component within the time series.
44 W Missing gaps
R The Right pattern for the missing gap
45 L The Left pattern for the missing gap
46 N The number of readings in the missing gap
47 wl the most similarity pattern to L in Sf
48 wr the most similarity pattern to R in Sf
49 b the start index for wl
e the end index for wr
50
sr Correlation value between R and wr
51 sl Correlation value between L and wl
52 RMSE mean squared error
53 MAE and mean absolute error
54 FSM Full Subsequence Matching algorithm
55 I. OVERVIEW
56
57 We have developed a method to recover missing values in an incomplete time series
58 S, where readings are taken at equal intervals. Our approach involves identifying gaps in S
that have null values, and dividing S into two parts: Sf, which has no missing values, and Sm,
59
which contains the missing gaps W. Then we utilize the Fourier transform (Oppenheim,
60
61
62 7
63
64
65
9
10
11
12
13
14 2010) on the Sf in order to obtain the number of readings in the periodic sequence P for each
15 time series. P represents the number of readings required for the time series to complete one
16 cycle. Each missing gap w is then analyzed to determine its surrounding right pattern R,
17 which comprises P readings, and its lift pattern L, which also comprises P readings. W
18 contains N missing values. To identify the two patterns in the Sf set that are most similar to
19 R and L, we utilize the kendalltau correlation measure (Abdi, 2007). Subsequently, we
20 employ the algorithm outlined in the following section to complete the missing values in W.
Table 2 explains the Symbols and annotations.
21
22 II. PROPOSED METHODS
23
We have developed a novel method for imputing missing values in a time series using
24
Fourier transform and a new filling algorithm. Our approach involves using Fourier
25
transform to determine the wavelength of each time series, followed by identifying the
26
sequence period for each series. This enables us to use an optimal imputation method to fill
27
in the gaps in the time series. Our proposed GMA method consists of four main steps:
28
29 A. Identifying the Missing Gaps
30
31 We identifying a missing gaps in a time series denoted by W = {w1, w2, w3...}. For each gap
32 w, we determine the preceding and succeeding data points, as well as the number of missing
33 points in the gap between them. By performing this analysis, we generate an array G that
34 records the number of missing data points for each gap, as well as the preceding and
35 succeeding data points.
36
B. Time Series Analysis
37
38 After identifying the gaps in the time series S that have null values, and dividing S into two
39 parts. It is typically advisable to focus on the subset of the time series that contains complete
40 data, which we denote as Sf. This is because incomplete or missing data can introduce biases
41 and inaccuracies into the analysis. A fundamental characteristic that needs to be identified
42 is whether the data is stationary or (seasonal) periodically repeated, and the number of
43 readings P in the periodic cycle. The discrete Fourier Transform (DFT) is a mathematical
44 technique that analyzes the time series in the frequency domain. By performing the DFT on
45 the time series, it decomposes it into its constituent frequencies and obtain information about
46 the spectral content of the time series (Oppenheim, 2010). We use the DFT output as input
47 for the inverse DFT Python function (community, 2023), this function calculates the peak
48 frequency of the signal using the argmax function, and then determines the period of the
49 signal by taking the reciprocal of the frequency. Figure 1 show the URC and its inverse
50 DFT.
51
52
53
54
55
56
57
58
59
60
61
62 8
63
64
65
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Figure 1 the URC dataset (Keogh, 2021) and its inverse DFT
32
33 C. Extract the similar subsequences to surrounding right and left pattern R,L
34
35 Once we have generated the array G in the first step, that records the missing gaps, we can
36 use this array to extract subsequences from the left and right sides of each gap w.
Specifically, we extract a subsequence of length P points from the left side of the gap
37
(denoted by L) and another subsequence of the same length from the right side of the gap
38
(denoted by R). Then we use these subsequences to search for the most similar subsequences
39
to L and R, denoted by wl and wr, respectively. This similarity search can be performed using
40
various techniques, such as dynamic time warping (Rakthanmanon, 2012), Pearson
41
correlation (mapreduce., 2016), or Euclidean distance (Park, 2009). Experiments have
42
shown that the most suitable for our algorithm is kendalltau technique (Gibbons, 2011) as
43
it compares the direction of the points, up or down, in value.
44
45 In general, if the data is normally distributed and the relationship between the variables is
46 expected to be linear, Pearson correlation may be the most appropriate technique to use. If
47 the data is not normally distributed or the relationship between the variables is not expected
48 to be linear, Euclidean distance or Kendall tau may be more appropriate. However, the
49 specific technique used should be selected based on the characteristics of the data and the
50 research question being investigated.
51
The Kendall rank correlation coefficient is a statistical measure that is used to determine
52
the degree of similarity between two sets of variables. It is a non-parametric measure that is
53
used to quantify the strength of the relationship between two sets based on the ranks of their
54
values. The coefficient ranges from -1 to 1, with -1 indicating a perfect negative correlation,
55
0 indicating no correlation, and 1 indicating a perfect positive correlation. Figure 2 shows
56
the Kendall results. The similarity pattern to the left, wl, is represented by a green rectangle.
57
Its corresponding similarity measure, sl, is 0.735. The similarity pattern to the right, wr, is
58
represented by a red rectangle. Its corresponding similarity measure, s r, is 0.540.
59
60
61
62 9
63
64
65
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30 Figure 2 the steam flow dataset (STUMPY, 2023) and the Kendall results: wl is the most similar for L
31 pattern (green rectangle) and wr is the most similar for R pattern (red rectangle)
32
GMA Algorithm
33
34
GMA Algorithm
35 Input: Sf the complete sequence
36 b the start index for (wr the similarity to R)
37 sr correlation value between R and wr
38 e the end index for (wl the similarity to L)
39 N number of missing points
40 sl correlation value between L and wl
41 FUNCTION GMA(w, b, sr, e, sl, N)
42 // Calculate the tuas retio
43 T1 = sr / (sr + sl)
44 T2 = sl / (sr + sl)
// Method 1: Filling the gap with mutation
45
imp1 = []
46 FOR i = 0 TO (FLOOR(T1 * N) - 1) DO
47 imp1[i] = Sf [e+i]
48 END FOR
49 FOR i = FLOOR(T1 * N) TO N DO
50 imp1[i] = Sf [b-(T1 * N)+ i]
51 END FOR
52 // Method 2: Filling the gap with combination
53 imp2 = []
54 FOR i = 0 TO N DO
imp2[i] = (T1 / N * Sf [e+i]) + (T2 / N * Sf [b-N+i])
55
END FOR
56 // Return the results of both methods
57 RETURN (imp1, imp2)
58 END FUNCTION
59
60
61
10
62
63
64
65
9
10
11
12
13
14 D. Imputation the gaps
15
We developed two different techniques to impute missing values: muting imputation and
16
ratio imputation, as shown in Algorithm 1. The algorithm takes the missing gap w, the
17
18 similarity pattern to the right wr and its corresponding similarity tua measure sr, the
19 similarity pattern to the left wl and its corresponding similarity measure sl, and the number
20 of missing points N. The algorithm uses two methods to fill in the missing gap. Method 1 of
21 the algorithm employs the two similar patterns, wr and wl, to fill the missing gap.
22 Specifically, the gap is filled by wr from 1 to sr *N/( sr+ sl ) and filled by wl from sr *N/(
23 sr+ sl ) to N. The time complexity of Method 1 is approximately O (N). Method 2 combines
24 the two similar patterns wr and wl based on their correlation values sr and sl, to fill the gap.
25 The time complexity of Method 2 is approximately O (N). The computational complexity
26 of the GMA function is linear with respect to the length of the missing gap N.
27
III. EXPERIMENTAL SET‑ UP
28
29 A. Dataset
30
31 1) The UCR_BIDMC1_2500 benchmark is a time series dataset that is part of the UCR Time
32 Series Anomaly Archive (Keogh, 2021). It contains 2,500 instances, each consisting of
33 128 observations. The dataset was collected from intensive care unit (ICU) patients, where
34 each instance represents the continuous physiological signals of a patient over a 6-hour
35 period. The anomalies in this dataset correspond to changes in the patients' physiological
36 conditions that require medical attention, such as cardiac arrest or shock. This benchmark
37 dataset is specifically designed to address common flaws present in other anomaly
38 detection benchmarks, including trivial and unrealistic anomaly intensity, misleading
39 ground truth, and running to failure bias. Using the inverse Fourier transformer technique,
40 we were able to extract the periodic cycle of the UCR_BIDMC1_2500 dataset, which was
41 determined to be 50 readings in length. Figure 1 shows the UCR_BIDMC1_2500 dataset
42 and its inverse DFT.
43
44 2) The Steamgen dataset is a commonly used benchmark dataset in the field of process control
45 and system identification the dataset consists of 6,000 samples, each containing 19 features
46 that describe the operating conditions and performance of the steam generator (STUMPY,
47 2023). These features include variables such as steam flow rate, water level, and
48 temperature, as well as indicators of system faults and disturbances. As shown in figure2,
49 we have chosen to focus on the steam flow feature. Using the inverse Fourier transformer
50 technique, we were able to extract the periodic cycle of the steam flow dataset, which was
51 determined to be 497 readings in length.
52
53 B. Missing Data Generation
54
We simulated missing data in order to enable us to evaluate the efficacy of different imputation
55
56 techniques. To generate datasets with missing data, we systematically removed consecutive
57 values from the dataset, assuming that the deletions occurred randomly.
58 In the case of the URC dataset, we created gaps of sizes 6, 21, 26, 40, and 101 points, as the
59 periodic cycle of this dataset is 50 points. For the steam flow feature, we created gaps of sizes
60
61
11
62
63
64
65
9
10
11
12
13 7, 21, 51 and 244 points, as the periodic cycle for steam flow is 507 points. These various
14 missing data scenarios were simulated in order to test the performance of different imputation
15 approaches. Experiments have shown that if the size of the missing gap is greater than the
16 periodic cycle value, then the error in imputation is significantly higher using any imputation
17 method.
18 C. Comparative Imputation Methods
19
20 We have selected several widely-used imputation techniques for evaluating the effectiveness
21 of our proposed methods. These techniques include linear interpolation (pandas, 2023) and
22 polynomial interpolation (Qingkai Kong, 2020), and Full Subsequence Matching (FSM)
23 (Thakolpat Khampuengson, 2022). Linear interpolation is a common method for filling in
24 figure3. The stream flow dataset missing values in a dataset by estimating the value based on
25 the linear relationship between adjacent data points. Polynomial interpolation is a more
26 complex variant of this technique, where a polynomial function is used to approximate the
27
missing values based on the surrounding data points. FSM is a pattern-matching approach to
28
imputation, where the missing value is estimated by identifying similar subsequences of data
29
within the dataset and using them to make a prediction. This technique is useful for datasets
30
31 with repeated patterns or cyclical trends. By comparing the performance of our methods
32 against these established techniques, we aim to demonstrate the efficacy of our approach and
33 provide valuable insights into the most effective methods for imputing missing data.
34 D. Experimental Setting
35 Our experiments were conducted on a server equipped with Core i7 Intel processors running
36
at 2.60 GHz, 8 GB RAM, and a 250 GB SATA hard drive. We implemented our proposed
37
framework using the open source Python package missval, which offers a range of missing
38
value imputation methods, as well as visualization and performance evaluation tools. The
39
40 package is publicly available on Github at https://github.com/Eng-Khattab/missval. In
41 addition, we have made the two datasets used in this study available for public access. To
42 perform interpolation, we utilized the "interpolate" class from the pandas DataFrame (pandas,
43 2023) Python library, which offers a convenient method for filling in missing values using
44 interpolation techniques. Specifically, we employed a linear approach for linear interpolation
45 and a polynomial method with a second order polynomial for polynomial interpolation. To
46 perform the Subsequence Matching (FSM) methods, we used the matrix profile python library
47 called STUMPY (STUMPY, 2023).
48
49 E. Evaluation Metrics
50 The performance of an imputation method is commonly evaluated by measuring its accuracy
51 using three widely-used metrics: root mean square error (RMSE), mean absolute error (MAE),
52 Kendall's tau measure between the actual pattern and the imputed pattern. These metrics are
53 defined as follows:
54 1) The root mean square error (RMSE) (Bennett, et al., 2013) is a measure of the
55 differences between the actual and imputed values, calculated as the square root of the
56 average of the squared differences:
57 1 𝑛
𝑠𝑎𝑚𝑝𝑙𝑒𝑠 −1
RMSE =√𝑛 ∑𝑖=0 1(𝑦^ − 𝑦𝑖) ^2
58 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
18
3) Kendall's tau (Abdi, 2007) is a measure of the correlation between the actual and imputed
19
patterns, which takes into account the order or rank of the values rather than their actual
20
21 magnitudes. It ranges between -1 (perfect negative correlation) to 1 (perfect positive
22 correlation), with 0 indicating no correlation:
23 Kendall's tau = (number of concordant pairs - number of discordant pairs) / (number of pairs)
24 Where a pair is concordant if the relative order of the values in the actual pattern is the same
25 as in the imputed pattern, and discordant if the order is different. The number of pairs is equal
26 to 𝑛(𝑛 − 1)/2 for a dataset with n samples.
27
28 IV. RESULTS AND DISCUSSION
29
We employed four distinct algorithms for filling the missing data points in two datasets. For
30
the steam flow dataset, there are gaps of sizes 7, 21, 51, and 244 points, as the periodic cycle
31
for steam flow is 507 points. Our results, presented in Tables 3 and 4 and Figures 3 a and b,
32
demonstrate the superiority of our algorithm for larger gaps that exceed a quarter of the
33 identified time series period length, as outlined in Section 3. Conversely, the linear algorithm
34 performs exceptionally well for smaller gaps, particularly at lower gap levels. These findings
35 emphasize the importance of accurately determining the time period for dataset before
36 imputing the missing gaps.
37
38 The results was presented in two tables and two figures. The first table and figure show the
39 percentage of root mean squared error (RMSE) and mean absolute error (MAE), where
40 lower values indicate better performance. The Kendell correlation analysis results are
41 presented in both the second table and accompanying figure. A higher correlation value
42 indicates a stronger relationship between the compensated data and the original data, and
43 then the compensated data closely aligns with the original data in terms of their
44 characteristics. Our findings indicate that achieving better results with the Full Subsequence
45 Matching (FSM) algorithm may require a significant number of repetitions. This is because
46 the algorithm relies on random selection of the length of the right and left patterns, which
47 can lead to variability in the outcomes.
48 Our algorithm employs two distinct methods to fill in missing data gaps. The first method
49 involves alternating between two patterns based on their tuas, and is utilized when the gap
50 size exceeds the periodic cycle p. Conversely, the second method involves combining the
51 two patterns based on their tuas to fill the missing gap less than the periodic cycle.
52
53
54
55
56
57
58
59
60
61
13
62
63
64
65
9
10
11
12 Table 3.the performance indexes of 4 methods on steam flow dataset
13
14 Gap Points G7 G26 G51 G244
15 Error Metrics RMSE MAE RMSE MAE RMSE MAE RMSE MAE
16
17 Linear 0.58 0.51 0.68 0.55 2.6 2.3 7.2 5.9
Algorit
18
FSM 7.3 6.4 0.55 0.55 3.8 3.6 7.8 8.9
19 GMA 3.7 3.6 0.53 0.34 1.7 0.9 5 5
20
21 Table 4 .the Kendell tua results between imputing gaps and actual data for steam flow dataset
22
Gap Points G7 G26 G51 G244
23
24 Linear 0.86 0.58 -0.4 -0.5
25 Poly 0.33 -0.25 -0.19 -0.02
FSM -0.7 0.51 0.26 -0.34
26 GMA -0.4 0.6 0.48 0.05
27
28
29 20 Linear Poly FSM GMA
RMSE and MAE Scales
30
31 15
32
10
33
34 5
35
36 0
37 RMSE MAE RMSE MAE RMSE MAE RMSE MAE
38
G7 G26 G55 G244
39
40 Number of missing points in each gap
41
42 fig3 .a. the performance indexes of 4 methods on steam flow dataset
43 Linear Poly FSM GMA
44 1
45
46
47 0.5
48
Kendall Scale
49
50 0
51 G7 G26 G51 G244
52
53 -0.5
54
55 Number of missing points in each gap
56 -1
57
58 Fig 3.b The Kendall correlation between the filled and actual data
59 Fig3. The performance indexes of 4 methods on steam flow dataset
60
61
14
62
63
64
65
9
10
11
12
13 Table 5.the performance indexes of 4 methods on URC dataset
14 GP G6 G26 G33 G40 G101
15
16 EM RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE
17 Linear 229 209 2222 2034 2476 2007 5877 5070 10591 9527
18 Poly 57 50 2857 2372 2811 2139 6433 4738 6175 4921
FSM 474 416 1271 1144 1316 1074 2408 934 5318 4374
19 GMA 512 555 1015 830 6530 6213 1398 1319 3859 3144
20
21 Table 6 .the Kendell tua results between imputing gaps and actual data for the URC data set
22
GP G6 G26 G33 G40 G101
23
24 Linear 0.99 0.99 0.76 0.49 0.035
Poly 0.99 0.92 0.77 0.59 0.56
25 FSM 0.39 0.92 0.77 -0.73 0.43
26 GMA 0.99 0.99 0.95 0.13 0.8
27
28
29 12000 Linear Poly FSM GMA
RMSE and MAE Scales
30 10000
31
8000
32
33 6000
34 4000
35 2000
36
0
37
RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE
38
39 G6 G21 G26 G40 G101
40 Number of missig points in each gap
41 fig4 .a. the performance indexes of 4 methods on URC dataset
42
43 1.5
44 Linear Poly FSM GMA
45
46 1
47
48
Kendall Scale
0.5
49
50
51 0
52 G6 G21 G26 G40 G101
53
-0.5
54
55
Number of missig points in each gap
56 -1
57
58
Fig 4.b The Kendall correlation between the filled and actual data
59
Fig4. The performance indexes of 4 methods on URC dataset
60
61
15
62
63
64
65
9
10
11
12
13 The results which includes in Tables 5 and 6 and figure 4a and b, shows the effectiveness of
14 our model when applied to URC data set, with the exception of gap 26. It is worth noting
15 that gap 26 is an anomaly in the original data. The values presented in the tables 5 and 6 may
16 appear to be large due to the wide domain of the dataset, which ranges from -10,000 to
17 30,000.as shown in fig.1 a. As a result, the errors may also be expressed in large values.
18
19
20 V. CONCLUSION
21
22 This paper presents the Gap Imputing Algorithm (GMA), a novel method for imputing
23 missing values in time series data. GMA is specifically designed to address the challenging
24 problem of consecutively missing values with varying gap distances in time series analysis.
25 Initially, GMA identifies sequences of missing values and determines the periodicity of the
26 time series. It then searches for the most similar subsequences in the historical data to fill in
27 the missing gap. GMA employs two methods to impute the missing data gaps, depending on
28 the gap size. If the gap size exceeds the periodic cycle p, GMA utilizes the first method,
29 which involves alternating between the two most similar patterns to the missing gap
30 terminals based on their correlation scale. On the other hand, if the missing gap size is less
31 than the periodic cycle, the second method is used. This involves combining the two similar
32 patterns based on their correlation scale with the most similar patterns to fill in the missing
33 data. Experimental results demonstrate that GMA outperforms existing methods in terms of
34 accuracy, particularly for datasets with long periodic patterns and larger missing gaps. Using
35 the periodic cycle to determine the pattern length leads to a more precise and accurate result.
36 In contrast, other algorithms require multiple runs because they rely on random selection of
37 the length of the right and left patterns, which can result in variability in the outcomes.
38 Overall, this research contributes to the development of more effective and efficient missing
39 value imputation techniques in time series data analysis. The practical implications of these
40 findings are significant, as accurate imputation of missing data is crucial for a wide range of
41 applications.
42
43 REFERENCES
44 Abdi, H. (2007). The Kendall Rank Correlation Coefficient. Encyclopedia of Measurement and
45 Statistics.
46 Bennett, N., Croke, B., Guariso, G., Guillaume, J. H., Jakeman, A., Marsili-Libelli, S., . . .
Norton, J. (2013). Characterising performance of environmental models. Environmental
47
Modelling & Software.
48
Bokde N, B. M. (2018). A novel imputation methodology for time series based on pattern
49
sequence forecasting. Pattern Recogn Lett , 116, 88–96.
50
community, T. S. (2023, 4 10). scipy.fft.rfft. (The SciPy community) Retrieved 2023, from
51
https://docs.scipy.org/doc/scipy/reference/generated/scipy.fft.rfft.html
52
Dwivedi D, M. U. (2022). Imputation of contiguous gaps and extremes of subhourly
53
groundwater time series using random forests. J Mach Learn Model Comput , 3.
54
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
55
Figueroa-García, J. C.–P. (2022). A genetic algorithm for multivariate missing data imputation.
56
Information Sciences.
57
Gibbons, J. D. (2011). Nonparametric statistical inference. CRC Press, 14.
58
Irfan Pratama, A. E. (2016). A review of missing values handling methods on time-series data.
59
International Conference on Information Technology Systems and Innovation (ICITSI).
60
61
16
62
63
64
65
9
10
11
12
Jianlong Xu, K. W. (2021). FM-GRU: A Time Series Prediction Method for Water Quality
13
Based on seq2seq Framework. MDPI, 13.
14
Jonathan P. Dekermanjian, E. S. (2022). Mechanism-aware imputation: a two-step approach in
15
handling missing values in metabolomics. BMC Bioinformatics.
16
José Cambronero, J. K. (2017). Query Optimization for Dynamic Imputation. the VLDB
17
Endowment, 10.
18
K, L. W. (2019). A study on bayesian principal component analysis for addressing missing
19
rainfall water. Water Resour Manage, 33, 2615–2628.
20
Keogh, R. W. (2021). Current Time Series Anomaly Detection Benchmarks are Flawed and
21
are Creating the Illusion of Progress. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
22 ENGINEERING .
23 Khayati, M. B.-M. ( 2015). Using lowly correlated time series to recover missing values in
24 time series: A comparison between SVD and CD. In Advances in Spatial and Temporal
25 Databases . 14th International Symposium, SSTD .
26 Kulanuwat L, C. C.-A. (2021). Anomaly detection using a sliding window technique and data
27 imputation. Water , 13.
28 Liao, W., Bak-Jensen, B., Pillai, J. R., Yang, D., & Wang, Y. (2021). Data-driven Missing Data
29 Imputation for Wind Farms Using Context Encoder. Journal of Modern Power Systems and
30 Clean Energy.
31 Little, R. J. (1992). Regression with missing X's: a review. Journal of the American Statistical
32 Association.
33 mapreduce., A. e. (2016). Gu, J., and Zhang. Journal of Parallel and Distributed Computing,
34 95, 54-62.
35 Mazumder, R. H. (2010). Spectral regularization algorithms for learning large incomplete
36 matrices. Journal of Machine Learning Research, 11.
37 Mourad Khayati, A. L. (2020). Mind the Gap: An Experimental Evaluation of Imputation of
38 Missing Values Techniques in Time Series. VLDB Endowment, 13.
39 Oppenheim, A. V. (2010). Discrete-time signal processing (3rd ed.). Upper Saddle River: NJ:
40 Pearson Prentice Hall.
41 pandas. (2023, 4 10). pandas.DataFrame.interpolate. (pandas) Retrieved from pandas:
42 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html
43 Park, H. a. (2009). A simple and fast algorithm for K-medoids clustering. Expert systems with
44 applications, 36, 3336-3341.
45 Paternoster, R. B. (1998). Using the correct statistical test for the equality of regression
46 coefficients. Criminology, 859-866, 36.
47 PHAN, T.-T.-H. (2020). Machine Learning for Univariate Time Series impution. Preprint
48 MAPR .
49 Qingkai Kong, T. S. (2020). Python Programming and Numerical Methods - A Guide for
50 Engineers and Scientists. Elsevier.
51 Rakthanmanon, T. K. (2012). Searching and mining trillions of time series subsequences under
52 dynamic time warping. the 18th ACM .
53 Schulz, K. F. (2002). Allocation concealment in randomised trials: defending against
54 deciphering. . The Lancet, 359, 614-618.
55 Shu, X. P.-r. (2014). Robust orthonormal subspace learning: Efficient recovery of corrupted
56 low-rank matrices. . IEEE Conference on Computer Vision and Pattern Recognition, CVPR .
57 Columbus, OH, USA.
58 STUMPY. (2023, 4 10). Steamgen Example. (STUMPY) Retrieved 2023, from STUMPY:
59 https://stumpy.readthedocs.io/en/latest/Tutorial_The_Matrix_Profile.html
60
61
17
62
63
64
65
9
10
11
12
Thakolpat Khampuengson, a. W. (2022). Novel Methods for Imputing Missing Values in
13
Water Level. Water Resources Management.
14
Trubitsyna, R. S. (2022). DEGAIN: Generative-Adversarial-Network-Based Missing Data
15
Imputation. Information, 13.
16
Wellenzohn, K. B. (2017). Continuous imputation of missing values in streams of pattern-
17
determining time series. the 20th International Conference on Extending Database
18
Technology, EDBT.
19
Xiuwen Yi, Y. Z. (2015). ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data.
20
Conference on Artificial Intelligence.
21
Yi, X. Z. (2016). ST-MVL: Filling Missing Values in Geo-sensory Time Series Data. the 25th
22 International Joint Conference on Artificial Intelligence.
23 YU ZHENG, L. C. (2014). Urban Computing: Concepts, Methodologies, and Applications.
24 ACM Transactions on Intelligent Systems and Technology, 38.
25 Zhang, Y. a. (2021). Dual-head sequence-to-sequence model for imputing missing data in
26 multivariate time series. IEEE Journal of Biomedical and Health Informatics, 25 , 1692-1702.
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
18
62
63
64
65
Author’s Response to Reviewers‘ Comments Click here to access/download;Author’s
Response to Reviewers‘ Comments;GMA
1
value itself. That means, the missing value is related to the observed values in the
dataset.
2) Missing Completely at Random (MCAR): In this type of missing data, the
probability of a value being missing is unrelated to any other variables in the dataset,
including the observed values. That means, the missing values is completely random
and not related to the values in the dataset.
There is also a third type of missing data pattern, called Missing Not at Random
(MNAR), where the probability of a value being missing is dependent on the missing
value itself, but this is less commonly encountered in practice. Understanding the
pattern of missing values is important because it can impact the analysis of the data,
and different techniques are used to handle different types of missing data patterns.
There are many methods that can be used to handle missing data. Missing values can
be handled in a dataset through either single imputation or multiple imputation methods.
Single imputation methods involve replacing missing data points with a single value,
and the most common techniques include mean, average, or median imputation. On the
other hand, multiple imputation methods create multiple values for the missing data
points (Enders, 2010). Each of these methods has its advantages and disadvantages, and
the choice of technique depends on the specific characteristics of the data and the
research question being investigated.
Effective approaches for predicting missing values from accessible data are needed.
Algorithms for recovering missing data blocks can utilize various techniques such as
matrix completion principles or pattern matching. Matrix completion-based algorithms
treat a set of series as a matrix and apply methods that aim to complete the missing
entries. On the other hand, pattern-matching algorithms utilize the observed values
from the sensors to replace the missing data blocks. By using these methods, algorithms
can reconstruct missing data blocks from the available information, allowing for more
complete and accurate data analysis (Mourad Khayati, 2020).
A commonly used approach for replacing missing data gaps involves utilizing the
values from the most similar subsequence. This technique falls under the category of
pattern-matching algorithms, the Dynamic Time Warping (DTW) algorithm is a highly
effective pattern-matching technique that is extensively employed across numerous
problem domains. Nonetheless, a drawback of employing DTW is its tendency to be
computationally expensive and time-consuming, which can impact the algorithm's
overall efficiency and performance. Some researchers have proposed a solution to the
computational expense of Dynamic Time Warping (DTW) by suggesting the use of
shape-feature extraction algorithms that extract sequence features in sliding windows.
This approach involves calculating DDTW only if the correlation between the shape-
features of the window and the subsequences before the missing gap is sufficiently high
(Irfan Pratama, 2016). Results from this method have demonstrated better outcomes
when dealing with time series that have strong seasonality and high correlation. It's
worth noting that while DTW can identify the most similar patterns with similar
dynamics, it can warp the shape by expanding or compressing, which can lead to the
position of the missing gaps not aligning with the original pattern's position, to
overcome this limitation we decided to employ Inverse Fourier Transform to predict
the length of the seasonal period beforehand. This allows us to understand the dataset's
2
characteristics and identify the length of the missing gap, particularly if it represents
one or more seasonal periods, so we can apply a suitable algorithm to deal with it.
The objective of this study is to create novel methods for accurately and efficiently
imputing missing or anomalous data in computer systems. Our focus is on developing
techniques to impute missing values in seasonal time series. We observe that seasonal
patterns tend to exhibit similarities over time. Therefore, we propose to leverage this
phenomenon by replicating the pattern from the most comparable subsequence in the
historical data. Through this approach, we have developed an effective method for
imputing missing data, which involves utilizing simple operations for pattern searching
and matching. The paper makes the following technical contributions:
We present and formalize GMA: Gap Imputing Algorithm to impute missing
values in time series, which covers stationary, non-linear relationships, and
seasonal time series.
In computer engineering, we use the inverse Fourier transform to determine the
periodic length of each time series. This helps us gain a better understanding of the
dataset's characteristics and identify suitable algorithms for handling the missing
gaps. We then apply appropriate algorithms to address these gaps.
3
TABLE 1: RELATED WORK SUMMARY
003 (NNGC), and Ba Tri adheres to academic standards of syntax and grammar.
temperature.
(Paternoster, 1998) Scientific research data on The "most frequent value" imputation method replaces missing data with the
factors causing crime for malesmode of the variable. It is typically used for categorical variables or
and females numerical variables that have a limited range of values.
Water level data from telemetryThis study compared three methods for imputing missing data: mean
stations across Thailand imputation, regression imputation, and multiple imputation. The results
(Kulanuwat L, 2021)
showed that multiple imputation was the most effective method and
produced less biased estimates compared to the other two methods.
Air quality and meteorological The proposed ST-MVL method fills missing readings in geo-sensory time
data in Beijing, China series data by considering temporal and spatial correlations. It uses empirical
(Yi, 2016)
statistic models and data-driven algorithms to handle different types of
missing data cases.
Pattern Matching
2 datasets: SBR meteorological Clustering algorithm was employed to group similar time series, and the
(Wellenzohn, 2017) time series in South Tyrol and resulting groups were used to impute missing values. This approach is
Flights dataset. specifically designed for continuous streams of time series data.
Data collected from in-situ The proposed method involves using a Seq2Seq model to impute missing
monitoring station in values in time series data. This model utilizes a dual-head architecture that
(Zhang, 2021) Mulgrave-Russell catchment, includes an encoder and two decoders, each corresponding to one direction
Australia. of the time series data. Seq2Seq models are a type of recurrent neural
network (RNN) that can be applied for sequence prediction and generation.
Face images under varying The proposed technique for robust low-rank matrix recovery is capable of
illuminations: 168x192 handling data corruption and utilizes orthonormal subspace learning to
resolution, 55 frames estimate a low-rank matrix from incomplete or corrupted data. This method
(Shu, 2014)
has shown promising results in experiments and outperformed existing
methods, and can be applied in various applications such as image
Matrix Completion Principles
Measured water levels in 7 RF algorithm used for imputing missing values in a dataset with continuous
(Dwivedi D, 2022)
monitoring wells in the USA variables.
Three field-based time series A hybrid approach was used to impute missing data, where regression
were used, including traffic imputation predicted missing input variables, and data augmentation created
(Bokde N, 2018)
speed data, water flow rate synthetic data points for missing output variables. The approach was applied
data, and the Nottem dataset. to a dataset with missing values in both input and output variables.
Seven datasets were used from A genetic algorithm is proposed to impute missing values in datasets with
the UCI and KEEL repositories multiple missing observations and different data types. The algorithm
(Figueroa-García, 2022)
minimizes a multi-objective fitness function based on Minkowski distance of
statistical measures between available and completed data.
Two untargeted metabolomics A two-step approach was used for imputing missing values, involving a
(Jonathan P. datasets from the COPDGene random forest classifier to classify the missing mechanism and mechanism-
Dekermanjian, 2022) cohort were used. specific algorithms for imputation. The approach improved imputations by
reducing bias and producing values closer to the original data.
Letter and SPAM datasets A method that estimates missing values in datasets using a generative
(Trubitsyna, 2022)
https://archive.ics.uci.edu/ adversarial network (GAN) model.
4
A. Univariate Methods
Hong PHAN (PHAN, 2020) proposed a method called "MLBUI" for filling consecutive
missing values in univariate time series using machine learning methods. the data before
and after the gap transformed into multivariate time series, followed by forward and
backward forecasting using ML methods to estimate the missing values. The imputation of
the gap is then done by taking the average values of both forecast sets. Paternoster
(Paternoster, 1998) explained that the "most frequent value" imputation technique consists
of substituting missing data with the value that appears most frequently for the given
variable. This imputation strategy is typically applied to categorical variables or numerical
variables with a finite set of possible values. Kulanuwat et Al. (Kulanuwat L, 2021) Studied
missing data imputation in electronic health records (EHR) using three methods: mean
imputation, regression imputation, and multiple imputation. Mean imputation involves
replacing missing values with the mean of available data, while regression imputation uses
a regression model to predict missing values. Multiple imputation generates multiple
plausible imputed datasets using a statistical model and combines them for a single estimate.
The study found that multiple imputation was the most effective method for imputing
missing data in EHR, producing estimates closer to true values and with less bias compared
to the other methods.
B. Pattern Matching
(Yi, 2016) Proposed a method called spatio-temporal multi-view-based learning (ST-MVL)
to fill missing readings in geo-sensory time series data. The method takes into account the
temporal correlation between readings at different timestamps in the same series and the
spatial correlation between different time series. The method combines empirical statistic
models (Inverse Distance Weighting and Simple Exponential Smoothing) with data-driven
algorithms (User-based and Item-based Collaborative Filtering) to handle different types of
missing data cases. The method is evaluated using Beijing air quality and meteorological
data. K. Wellenzohn (Wellenzohn, 2017) proposed a method for continuously imputing
missing values in streams of time series data. The proposed method, called CIViC
(Continuous Imputation of Values in time series with Clustering), uses a clustering
algorithm to group similar time series and then uses the grouped time series to impute
missing values. The authors evaluated their method on several real-world datasets and
compared it to other imputation methods. Zhang and Thorburn (Zhang, 2021) proposed a
dual-head sequence-to-sequence (Seq2Seq) model for imputing missing values in time
series data. Seq2Seq models are a type of recurrent neural network (RNN) that can be used
for sequence prediction and generation tasks. In their study, Zhang and Thorburn used a
dual-head architecture, which includes an encoder and two decoders, to predict the missing
values in a time series dataset. The two decoders correspond to the two directions of the
time series data (forward and backward).
C. Matrix Completion Principles
Shu, Porikli, and Ahuja (Shu, 2014) proposed a method for robust low-rank matrix
recovery that can handle data corruption. Low-rank matrix recovery is a fundamental
problem in computer vision and machine learning, and it involves estimating a low-rank
matrix from corrupted or incomplete data. The proposed method is based on orthonormal
subspace learning, which is a technique for finding the principal subspace of a given set of
5
data. Mazumder, Hastie, and Tibshirani (Mazumder, 2010) proposed a method for matrix
completion, which involves recovering missing entries in a matrix. The authors noted that
matrix completion has important applications in various fields, including recommender
systems and collaborative filtering. The proposed method is based on spectral regularization
and involves solving a convex optimization problem. Khayati, Böhlen, and Cudré-Mauroux
(Khayati, 2015) compared two methods, Singular Value Decomposition (SVD) and
Correlation Distance (CD), for recovering missing values in time series datasets. The
authors noted that missing data is a common problem in time series datasets and can have a
significant impact on subsequent analysis. SVD and CD are two methods that can be used
for imputing missing values in time series data, and they differ in how they select a subset
of the data to use for imputation. The authors evaluated the two methods on several datasets
and found that CD performed better than SVD in terms of accuracy and computational
efficiency, especially when the time series data had low correlation. Yang et al (Jianlong
Xu, 2021) proposed a method for imputing missing data in high-dimensional datasets by
combining low-rank matrix completion and sparse representation. The authors argued that
the high dimensionality of the data and the sparsely of the missing values require an
approach that can effectively capture the underlying structure of the data. The proposed
method first utilized low-rank matrix completion to impute the missing values in the data
matrix, leveraging the assumption that the data has a low-rank structure. The method then
employed sparse representation to refine the imputed data, utilizing the assumption that the
data can be represented as a linear combination of a few basis elements. Ai and Kuok (K,
2019) suggested employing a statistical method known as Bayesian Principal Component
Analysis (BPCA) to perform imputation of missing values in rainfall data. BPCA is a
technique that merges Principal Component Analysis (PCA) with Bayesian modeling to
estimate missing data points in a dataset.
D. Machine Learning Imputing
Dwivedi (Dwivedi D, 2022) used Random Forest (RF) to impute continuous missing
values in a dataset. The RF algorithm was used to impute missing values in a dataset
containing continuous variables. The performance of RF imputation was compared with
other imputation methods, such as k-nearest neighbors (KNN) and mean imputation.
Bokde. Al (Bokde N, 2018) proposed a method for imputing missing values in a dataset
using a hybrid approach that combines regression imputation and data augmentation. The
study dealing with a dataset that had missing values in both the input and output variables.
They used regression imputation to impute the missing values in the input variables by
predicting them based on the available data. For the missing values in the output variables,
they used data augmentation, which involves creating synthetic data points to fill in the
missing values. A genetic algorithm approach to estimate missing data in multivariate
databases is proposed (Figueroa-García, 2022) . Genetic algorithms are effective at handling
multiple missing observations and different types of data, unlike traditional methods that
only deal with univariate continuous data. The proposed algorithm minimizes a new multi-
objective fitness function based on Minkowski distance of means, variances, covariances,
and skewness between available and completed data. The approach is compared to EM
algorithm and auxiliary regressions using a continuous/discrete dataset, and benchmarked
against seven datasets. Jonathan P (Jonathan P. Dekermanjian, 2022) designed imputation
algorithm to handle missing values in metabolomics datasets, which are often caused by
various mechanisms such as instrument detection limits, data collection and processing
6
conditions, and random factors. The algorithm takes a mechanism-aware approach and
consists of two steps. In the first step, a random forest classifier is used to classify the
missing mechanism for each missing value in the dataset. In the second step, missing values
are imputed using mechanism-specific imputation algorithms, namely MAR/MCAR or
MNAR. Simulations were conducted using complete data and different missing patterns to
test the performance of the proposed algorithm. Results showed that the two-step approach
reduced bias and provided imputations that were closer to the original data compared to
using a single imputation algorithm for all missing values. Overall, this mechanism-aware
imputation algorithm offers a promising solution for handling missing values in
metabolomics datasets and improving downstream analyses. Trubitsyna R. and Irina S
(Trubitsyna, 2022) developed a method for estimating missing values in datasets through
the use of a generative adversarial network (GAN) based model named DEGAIN. The
performance of DEGAIN is evaluated on two publicly available datasets, namely Letter
Recognition and SPAM, and compared against existing methods.
In this paper, we propose a novel approach for computing the missing values in
incomplete subsequences, called Gap Imputing Algorithm (GMA). We divide the time
series into two subsequences: one that contains the complete data and another that contains
the missing gaps. To fill in the missing data, we use a pattern-matching approach by
analyzing the similarity between the complete and incomplete subsequences. Specifically,
we imitate the pattern of the complete subsequence to recreate the missing data.
Table 2: List of abbreviations
Symbol definition
GMA The Gap Imputing Algorithm.
MAR Type of missing data Missing at Random
MCAR Missing Completely at Random
MNAR Missing Not at Random
S The total time series
Sf The subsequence with no missing values
Sm The subsequence with missing values
The number of readings in the periodic sequence
P
for the longest component within the time series.
W Missing gaps
R The Right pattern for the missing gap
L The Left pattern for the missing gap
N The number of readings in the missing gap
wl the most similarity pattern to L in Sf
wr the most similarity pattern to R in Sf
b the start index for wl
e the end index for wr
sr Correlation value between R and wr
sl Correlation value between L and wl
RMSE mean squared error
MAE and mean absolute error
FSM Full Subsequence Matching algorithm
I. OVERVIEW
7
2010) on the Sf in order to obtain the number of readings in the periodic sequence P for each
time series. P represents the number of readings required for the time series to complete one
cycle. Each missing gap w is then analyzed to determine its surrounding right pattern R,
which comprises P readings, and its lift pattern L, which also comprises P readings. W
contains N missing values. To identify the two patterns in the Sf set that are most similar to
R and L, we utilize the kendalltau correlation measure (Abdi, 2007). Subsequently, we
employ the algorithm outlined in the following section to complete the missing values in W.
Table 2 explains the Symbols and annotations.
II. PROPOSED METHODS
We have developed a novel method for imputing missing values in a time series using
Fourier transform and a new filling algorithm. Our approach involves using Fourier
transform to determine the wavelength of each time series, followed by identifying the
sequence period for each series. This enables us to use an optimal imputation method to fill
in the gaps in the time series. Our proposed GMA method consists of four main steps:
A. Identifying the Missing Gaps
We identifying a missing gaps in a time series denoted by W = {w1, w2, w3...}. For each gap
w, we determine the preceding and succeeding data points, as well as the number of missing
points in the gap between them. By performing this analysis, we generate an array G that
records the number of missing data points for each gap, as well as the preceding and
succeeding data points.
B. Time Series Analysis
After identifying the gaps in the time series S that have null values, and dividing S into two
parts. It is typically advisable to focus on the subset of the time series that contains complete
data, which we denote as Sf. This is because incomplete or missing data can introduce biases
and inaccuracies into the analysis. A fundamental characteristic that needs to be identified
is whether the data is stationary or (seasonal) periodically repeated, and the number of
readings P in the periodic cycle. The discrete Fourier Transform (DFT) is a mathematical
technique that analyzes the time series in the frequency domain. By performing the DFT on
the time series, it decomposes it into its constituent frequencies and obtain information about
the spectral content of the time series (Oppenheim, 2010). We use the DFT output as input
for the inverse DFT Python function (community, 2023), this function calculates the peak
frequency of the signal using the argmax function, and then determines the period of the
signal by taking the reciprocal of the frequency. Figure 1 show the URC and its inverse
DFT.
8
Figure 1 the URC dataset (Keogh, 2021) and its inverse DFT
C. Extract the similar subsequences to surrounding right and left pattern R,L
Once we have generated the array G in the first step, that records the missing gaps, we can
use this array to extract subsequences from the left and right sides of each gap w.
Specifically, we extract a subsequence of length P points from the left side of the gap
(denoted by L) and another subsequence of the same length from the right side of the gap
(denoted by R). Then we use these subsequences to search for the most similar subsequences
to L and R, denoted by wl and wr, respectively. This similarity search can be performed using
various techniques, such as dynamic time warping (Rakthanmanon, 2012), Pearson
correlation (mapreduce., 2016), or Euclidean distance (Park, 2009). Experiments have
shown that the most suitable for our algorithm is kendalltau technique (Gibbons, 2011) as
it compares the direction of the points, up or down, in value.
In general, if the data is normally distributed and the relationship between the variables is
expected to be linear, Pearson correlation may be the most appropriate technique to use. If
the data is not normally distributed or the relationship between the variables is not expected
to be linear, Euclidean distance or Kendall tau may be more appropriate. However, the
specific technique used should be selected based on the characteristics of the data and the
research question being investigated.
The Kendall rank correlation coefficient is a statistical measure that is used to determine
the degree of similarity between two sets of variables. It is a non-parametric measure that is
used to quantify the strength of the relationship between two sets based on the ranks of their
values. The coefficient ranges from -1 to 1, with -1 indicating a perfect negative correlation,
0 indicating no correlation, and 1 indicating a perfect positive correlation. Figure 2 shows
the Kendall results. The similarity pattern to the left, wl, is represented by a green rectangle.
Its corresponding similarity measure, sl, is 0.735. The similarity pattern to the right, wr, is
represented by a red rectangle. Its corresponding similarity measure, s r, is 0.540.
9
Figure 2 the steam flow dataset (STUMPY, 2023) and the Kendall results: wl is the most similar for L
pattern (green rectangle) and wr is the most similar for R pattern (red rectangle)
GMA Algorithm
GMA Algorithm
Input: Sf the complete sequence
b the start index for (wr the similarity to R)
sr correlation value between R and wr
e the end index for (wl the similarity to L)
N number of missing points
sl correlation value between L and wl
FUNCTION GMA(w, b, sr, e, sl, N)
// Calculate the tuas retio
T1 = sr / (sr + sl)
T2 = sl / (sr + sl)
// Method 1: Filling the gap with mutation
imp1 = []
FOR i = 0 TO (FLOOR(T1 * N) - 1) DO
imp1[i] = Sf [e+i]
END FOR
FOR i = FLOOR(T1 * N) TO N DO
imp1[i] = Sf [b-(T1 * N)+ i]
END FOR
// Method 2: Filling the gap with combination
imp2 = []
FOR i = 0 TO N DO
imp2[i] = (T1 / N * Sf [e+i]) + (T2 / N * Sf [b-N+i])
END FOR
// Return the results of both methods
RETURN (imp1, imp2)
END FUNCTION
10
D. Imputation the gaps
We developed two different techniques to impute missing values: muting imputation and
ratio imputation, as shown in Algorithm 1. The algorithm takes the missing gap w, the
similarity pattern to the right wr and its corresponding similarity tua measure sr, the similarity
pattern to the left wl and its corresponding similarity measure s l, and the number of missing
points N. The algorithm uses two methods to fill in the missing gap. Method 1 of the
algorithm employs the two similar patterns, wr and wl, to fill the missing gap. Specifically,
the gap is filled by wr from 1 to sr *N/( sr+ sl ) and filled by wl from sr *N/( sr+ sl ) to N. The
time complexity of Method 1 is approximately O (N). Method 2 combines the two similar
patterns wr and wl based on their correlation values sr and sl, to fill the gap. The time
complexity of Method 2 is approximately O (N). The computational complexity of the GMA
function is linear with respect to the length of the missing gap N.
III. EXPERIMENTAL SET‑ UP
A. Dataset
1) The UCR_BIDMC1_2500 benchmark is a time series dataset that is part of the UCR Time
Series Anomaly Archive (Keogh, 2021). It contains 2,500 instances, each consisting of
128 observations. The dataset was collected from intensive care unit (ICU) patients, where
each instance represents the continuous physiological signals of a patient over a 6-hour
period. The anomalies in this dataset correspond to changes in the patients' physiological
conditions that require medical attention, such as cardiac arrest or shock. This benchmark
dataset is specifically designed to address common flaws present in other anomaly
detection benchmarks, including trivial and unrealistic anomaly intensity, misleading
ground truth, and running to failure bias. Using the inverse Fourier transformer technique,
we were able to extract the periodic cycle of the UCR_BIDMC1_2500 dataset, which was
determined to be 50 readings in length. Figure 1 shows the UCR_BIDMC1_2500 dataset
and its inverse DFT.
2) The Steamgen dataset is a commonly used benchmark dataset in the field of process control
and system identification the dataset consists of 6,000 samples, each containing 19 features
that describe the operating conditions and performance of the steam generator (STUMPY,
2023). These features include variables such as steam flow rate, water level, and
temperature, as well as indicators of system faults and disturbances. As shown in figure2,
we have chosen to focus on the steam flow feature. Using the inverse Fourier transformer
technique, we were able to extract the periodic cycle of the steam flow dataset, which was
determined to be 497 readings in length.
B. Missing Data Generation
We simulated missing data in order to enable us to evaluate the efficacy of different imputation
techniques. To generate datasets with missing data, we systematically removed consecutive
values from the dataset, assuming that the deletions occurred randomly.
In the case of the URC dataset, we created gaps of sizes 6, 21, 26, 40, and 101 points, as the
periodic cycle of this dataset is 50 points. For the steam flow feature, we created gaps of sizes
11
7, 21, 51 and 244 points, as the periodic cycle for steam flow is 507 points. These various
missing data scenarios were simulated in order to test the performance of different imputation
approaches. Experiments have shown that if the size of the missing gap is greater than the
periodic cycle value, then the error in imputation is significantly higher using any imputation
method.
C. Comparative Imputation Methods
We have selected several widely-used imputation techniques for evaluating the effectiveness
of our proposed methods. These techniques include linear interpolation (pandas, 2023) and
polynomial interpolation (Qingkai Kong, 2020), and Full Subsequence Matching (FSM)
(Thakolpat Khampuengson, 2022). Linear interpolation is a common method for filling in
figure3. The stream flow dataset missing values in a dataset by estimating the value based on
the linear relationship between adjacent data points. Polynomial interpolation is a more
complex variant of this technique, where a polynomial function is used to approximate the
missing values based on the surrounding data points. FSM is a pattern-matching approach to
imputation, where the missing value is estimated by identifying similar subsequences of data
within the dataset and using them to make a prediction. This technique is useful for datasets
with repeated patterns or cyclical trends. By comparing the performance of our proposed
methods against these established techniques, we aim to demonstrate the efficacy of our
approach and provide valuable insights into the most effective methods for imputing missing
data.
D. Experimental Setting
Our experiments were conducted on a server equipped with Core i7 Intel processors running
at 2.60 GHz, 8 GB RAM, and a 250 GB SATA hard drive. We implemented our proposed
framework using the open source Python package missval, which offers a range of missing
value imputation methods, as well as visualization and performance evaluation tools. The
package is publicly available on Github at https://github.com/Eng-Khattab/missval. In
addition, we have made the two datasets used in this study available for public access. To
perform interpolation, we utilized the "interpolate" class from the pandas DataFrame (pandas,
2023) Python library, which offers a convenient method for filling in missing values using
interpolation techniques. Specifically, we employed a linear approach for linear interpolation
and a polynomial method with a second order polynomial for polynomial interpolation. To
perform the Subsequence Matching (FSM) methods, we used the matrix profile python library
called STUMPY (STUMPY, 2023).
E. Evaluation Metrics
The performance of an imputation method is commonly evaluated by measuring its accuracy
using three widely-used metrics: root mean square error (RMSE), mean absolute error (MAE),
Kendall's tau measure between the actual pattern and the imputed pattern. These metrics are
defined as follows:
1) The root mean square error (RMSE) (Bennett, et al., 2013) is a measure of the
differences between the actual and imputed values, calculated as the square root of the
average of the squared differences:
1 𝑛
𝑠𝑎𝑚𝑝𝑙𝑒𝑠 −1
RMSE =√𝑛 ∑𝑖=0 1(𝑦^ − 𝑦𝑖) ^2
𝑠𝑎𝑚𝑝𝑙𝑒𝑠
12
Where y^ is the predicted value and y is actual value.
2) The mean absolute error (MAE) (Bennett, et al., 2013) is another measure of the
differences between the actual and imputed values, calculated as the average of the
absolute differences:
1 𝑛
𝑠𝑎𝑚𝑝𝑙𝑒𝑠 −1
MAE =𝑛 ∑𝑖=0 1(𝑦^ − 𝑦𝑖)
𝑠𝑎𝑚𝑝𝑙𝑒𝑠
3) Kendall's tau (Abdi, 2007) is a measure of the correlation between the actual and imputed
patterns, which takes into account the order or rank of the values rather than their actual
magnitudes. It ranges between -1 (perfect negative correlation) to 1 (perfect positive
correlation), with 0 indicating no correlation:
Kendall's tau = (number of concordant pairs - number of discordant pairs) / (number of pairs)
Where a pair is concordant if the relative order of the values in the actual pattern is the same
as in the imputed pattern, and discordant if the order is different. The number of pairs is equal
to 𝑛(𝑛 − 1)/2 for a dataset with n samples.
13
Table 3.the performance indexes of 4 methods on steam flow dataset
Error Metrics RMSE MAE RMSE MAE RMSE MAE RMSE MAE
Linear 0.58 0.51 0.68 0.55 2.6 2.3 7.2 5.9
Algorit
Table 4 .the Kendell tua results between imputing gaps and actual data for steam flow dataset
15
10
0
RMSE MAE RMSE MAE RMSE MAE RMSE MAE
G7 G26 G55 G244
Number of missing points in each gap
0.5
Kendall Scale
0
G7 G26 G51 G244
-0.5
Fig 3.b The Kendall correlation between the filled and actual data
Fig3. The performance indexes of 4 methods on steam flow dataset
14
Table 5.the performance indexes of 4 methods on URC dataset
GP G6 G26 G33 G40 G101
EM RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE
Linear 229 209 2222 2034 2476 2007 5877 5070 10591 9527
Poly 57 50 2857 2372 2811 2139 6433 4738 6175 4921
FSM 474 416 1271 1144 1316 1074 2408 934 5318 4374
GMA 512 555 1015 830 6530 6213 1398 1319 3859 3144
Table 6 .the Kendell tua results between imputing gaps and actual data for the URC data set
10000
8000
6000
4000
2000
0
RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE
G6 G21 G26 G40 G101
Number of missig points in each gap
fig4 .a. the performance indexes of 4 methods on URC dataset
1.5
Linear Poly FSM GMA
1
Kendall Scale
0.5
0
G6 G21 G26 G40 G101
-0.5
Fig 4.b The Kendall correlation between the filled and actual data
Fig4. The performance indexes of 4 methods on URC dataset
15
The results which includes in Tables 5 and 6 and figure 4a and b, shows the effectiveness of
our model when applied to URC data set, with the exception of gap 26. It is worth noting
that gap 26 is an anomaly in the original data. The values presented in the tables 5 and 6 may
appear to be large due to the wide domain of the dataset, which ranges from -10,000 to
30,000.as shown in fig.1 a. As a result, the errors may also be expressed in large values.
V. CONCLUSION
This paper presents the Gap Imputing Algorithm (GMA), a novel method for imputing
missing values in time series data. GMA is specifically designed to address the challenging
problem of consecutively missing values with varying gap distances in time series analysis.
Initially, GMA identifies sequences of missing values and determines the periodicity of the
time series. It then searches for the most similar subsequences in the historical data to fill in
the missing gap. GMA employs two methods to impute the missing data gaps, depending on
the gap size. If the gap size exceeds the periodic cycle p, GMA utilizes the first method,
which involves alternating between the two most similar patterns to the missing gap
terminals based on their correlation scale. On the other hand, if the missing gap size is less
than the periodic cycle, the second method is used. This involves combining the two similar
patterns based on their correlation scale with the most similar patterns to fill in the missing
data. Experimental results demonstrate that GMA outperforms existing methods in terms of
accuracy, particularly for datasets with long periodic patterns and larger missing gaps. Using
the periodic cycle to determine the pattern length leads to a more precise and accurate result.
In contrast, other algorithms require multiple runs because they rely on random selection of
the length of the right and left patterns, which can result in variability in the outcomes.
Overall, this research contributes to the development of more effective and efficient missing
value imputation techniques in time series data analysis. The practical implications of these
findings are significant, as accurate imputation of missing data is crucial for a wide range of
applications.
REFERENCES
Abdi, H. (2007). The Kendall Rank Correlation Coefficient. Encyclopedia of Measurement and
Statistics.
Bennett, N., Croke, B., Guariso, G., Guillaume, J. H., Jakeman, A., Marsili-Libelli, S., . . .
Norton, J. (2013). Characterising performance of environmental models. Environmental
Modelling & Software.
Bokde N, B. M. (2018). A novel imputation methodology for time series based on pattern
sequence forecasting. Pattern Recogn Lett , 116, 88–96.
community, T. S. (2023, 4 10). scipy.fft.rfft. (The SciPy community) Retrieved 2023, from
https://docs.scipy.org/doc/scipy/reference/generated/scipy.fft.rfft.html
Dwivedi D, M. U. (2022). Imputation of contiguous gaps and extremes of subhourly
groundwater time series using random forests. J Mach Learn Model Comput , 3.
Enders, C. K. (2010). Applied missing data analysis. New York: Guilford Press.
Figueroa-García, J. C.–P. (2022). A genetic algorithm for multivariate missing data imputation.
Information Sciences.
Gibbons, J. D. (2011). Nonparametric statistical inference. CRC Press, 14.
Irfan Pratama, A. E. (2016). A review of missing values handling methods on time-series data.
International Conference on Information Technology Systems and Innovation (ICITSI).
16
Jianlong Xu, K. W. (2021). FM-GRU: A Time Series Prediction Method for Water Quality
Based on seq2seq Framework. MDPI, 13.
Jonathan P. Dekermanjian, E. S. (2022). Mechanism-aware imputation: a two-step approach in
handling missing values in metabolomics. BMC Bioinformatics.
José Cambronero, J. K. (2017). Query Optimization for Dynamic Imputation. the VLDB
Endowment, 10.
K, L. W. (2019). A study on bayesian principal component analysis for addressing missing
rainfall water. Water Resour Manage, 33, 2615–2628.
Keogh, R. W. (2021). Current Time Series Anomaly Detection Benchmarks are Flawed and
are Creating the Illusion of Progress. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING .
Khayati, M. B.-M. ( 2015). Using lowly correlated time series to recover missing values in
time series: A comparison between SVD and CD. In Advances in Spatial and Temporal
Databases . 14th International Symposium, SSTD .
Kulanuwat L, C. C.-A. (2021). Anomaly detection using a sliding window technique and data
imputation. Water , 13.
Liao, W., Bak-Jensen, B., Pillai, J. R., Yang, D., & Wang, Y. (2021). Data-driven Missing Data
Imputation for Wind Farms Using Context Encoder. Journal of Modern Power Systems and
Clean Energy.
Little, R. J. (1992). Regression with missing X's: a review. Journal of the American Statistical
Association.
mapreduce., A. e. (2016). Gu, J., and Zhang. Journal of Parallel and Distributed Computing,
95, 54-62.
Mazumder, R. H. (2010). Spectral regularization algorithms for learning large incomplete
matrices. Journal of Machine Learning Research, 11.
Mourad Khayati, A. L. (2020). Mind the Gap: An Experimental Evaluation of Imputation of
Missing Values Techniques in Time Series. VLDB Endowment, 13.
Oppenheim, A. V. (2010). Discrete-time signal processing (3rd ed.). Upper Saddle River: NJ:
Pearson Prentice Hall.
pandas. (2023, 4 10). pandas.DataFrame.interpolate. (pandas) Retrieved from pandas:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html
Park, H. a. (2009). A simple and fast algorithm for K-medoids clustering. Expert systems with
applications, 36, 3336-3341.
Paternoster, R. B. (1998). Using the correct statistical test for the equality of regression
coefficients. Criminology, 859-866, 36.
PHAN, T.-T.-H. (2020). Machine Learning for Univariate Time Series impution. Preprint
MAPR .
Qingkai Kong, T. S. (2020). Python Programming and Numerical Methods - A Guide for
Engineers and Scientists. Elsevier.
Rakthanmanon, T. K. (2012). Searching and mining trillions of time series subsequences under
dynamic time warping. the 18th ACM .
Schulz, K. F. (2002). Allocation concealment in randomised trials: defending against
deciphering. . The Lancet, 359, 614-618.
Shu, X. P.-r. (2014). Robust orthonormal subspace learning: Efficient recovery of corrupted
low-rank matrices. . IEEE Conference on Computer Vision and Pattern Recognition, CVPR .
Columbus, OH, USA.
STUMPY. (2023, 4 10). Steamgen Example. (STUMPY) Retrieved 2023, from STUMPY:
https://stumpy.readthedocs.io/en/latest/Tutorial_The_Matrix_Profile.html
17
Thakolpat Khampuengson, a. W. (2022). Novel Methods for Imputing Missing Values in
Water Level. Water Resources Management.
Trubitsyna, R. S. (2022). DEGAIN: Generative-Adversarial-Network-Based Missing Data
Imputation. Information, 13.
Wellenzohn, K. B. (2017). Continuous imputation of missing values in streams of pattern-
determining time series. the 20th International Conference on Extending Database
Technology, EDBT.
Xiuwen Yi, Y. Z. (2015). ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data.
Conference on Artificial Intelligence.
Yi, X. Z. (2016). ST-MVL: Filling Missing Values in Geo-sensory Time Series Data. the 25th
International Joint Conference on Artificial Intelligence.
YU ZHENG, L. C. (2014). Urban Computing: Concepts, Methodologies, and Applications.
ACM Transactions on Intelligent Systems and Technology, 38.
Zhang, Y. a. (2021). Dual-head sequence-to-sequence model for imputing missing data in
multivariate time series. IEEE Journal of Biomedical and Health Informatics, 25 , 1692-1702.
18