Enhancing Missing Values Imputation Through Transformer-Based Predictive Modeling
Enhancing Missing Values Imputation Through Transformer-Based Predictive Modeling
How to cite this article: Ayub H, Jamil H. Enhancing Missing Imputation through
Values Imputation through Transformer-Based Predictive
Modeling. IgMin Res. Jan 23, 2024; 2(1): 025-031. IgMin ID:
igmin140; DOI: 10.61927/igmin140; Available at:
Transformer-Based
Predictive Modeling
www.igminresearch.com/articles/pdf/igmin140.pdf
Abstract
This paper tackles the vital issue of missing value imputation in data preprocessing, where traditional techniques like zero, mean, and KNN imputation fall short
in capturing intricate data relationships. This often results in suboptimal outcomes, and discarding records with missing values leads to significant information loss.
Our innovative approach leverages advanced transformer models renowned for handling sequential data. The proposed predictive framework trains a transformer
model to predict missing values, yielding a marked improvement in imputation accuracy. Comparative analysis against traditional methods—zero, mean, and KNN
imputation—consistently favors our transformer model. Importantly, LSTM validation further underscores the superior performance of our approach. In hourly data,
our model achieves a remarkable R2 score of 0.96, surpassing KNN imputation by 0.195. For daily data, the R2 score of 0.806 outperforms KNN imputation by 0.015
and exhibits a notable superiority of 0.25 over mean imputation. Additionally, in monthly data, the proposed model’s R2 score of 0.796 excels, showcasing a significant
improvement of 0.1 over mean imputation. These compelling results highlight the proposed model’s ability to capture underlying patterns, offering valuable insights
for enhancing missing values imputation in data analyses.
www.igminresearch.com 025
ISSN 2995-8067 DOI: 10.61927/igmin140
revealing that models like LightGBM and XGBoost, coupled with researchers and data scientists. Recent research [31] thoroughly
careful feature engineering, excel in imputation performance, compares seven data imputation methods for numeric datasets,
emphasizing the importance of balanced model complexity. In revealing kNN imputation’s consistent outperformance. This
scRNA-seq, vital for studying single-cell transcription, addressing contribution adds valuable insights to the ongoing discourse
high-dimensionality and dropout values is crucial. This study [11] on selecting optimal methods for handling missing data in data
evaluates advanced imputation methods, providing insights for mining tasks. Furthermore, we introduce an additional validation
selecting appropriate approaches in diverse data contexts and layer by subjecting the imputed data to scrutiny through Long
aiding downstream functional analysis. Both the Self-Organizing Short-Term Memory (LSTM) networks [32]. This not only assesses
Map (SOM) [12,13] and the MLP [14] represent additional ML the accuracy of imputation but also gauges the temporal coherence
techniques applied for the imputation of missing values. of the imputed values.
Furthermore, studies employing the regression approach By undertaking this exploration, we aim to contribute valuable
[15] implemented a novel method involving weighted quantile insights into the realm of missing values imputation, offering a
regression to estimate missing values within health data. In nuanced understanding of the capabilities of transformer-based
another article [16], the author introduced a comprehensive case models. The observed improvements in imputation accuracy,
regression approach for handling missing values, employing particularly validated through LSTM analysis, underscore the
functional principal components. Iterative regression is used for potential of our proposed approach to address the persistent
effective imputation in multivariate data [17]. Another method challenges associated with missing data. Through this work, we
Hot-deck imputation, matches missing values with complete aspire to provide a robust foundation for future advancements in
values on key variables [18]. Research is conducted on expectation data preprocessing and analysis methodologies. These are the key
minimization in handling missing data using a dataset analyzing contributions of the article:
the effects of feeding behaviors among drug-treated and untreated
• Introduced a novel missing values imputation approach
animals [19]. Recognizing the insufficiency of merely deleting or
using transformer models, deviating from traditional
discarding missing data [20], researchers often turn to employing
methods.
multiple imputations. Multiple imputation involves leveraging the
observed data distribution to estimate numerous values, reflecting • Leveraging self-attention mechanisms, the transformer-
the uncertainty surrounding the true value. This approach has based model provides a data-driven and adaptive solution
predominantly been utilized to address the constraints associated for capturing intricate data relationships.
with single imputation [21].
• Through a comprehensive comparative analysis, the
Moreover, another study [22] evaluates imputation methods for transformer model consistently outperforms traditional
incomplete water network data, focusing on small to medium-sized imputation techniques like zero, mean, and KNN.
utilities. Among the tested methods, IMPSEQ outperforms others
in imputing missing values in cast iron water mains data from the • The inclusion of LSTM validation adds a layer of scrutiny,
City of Calgary, offering insights for cost-effective water mains evaluating not only imputation accuracy but also the
renewal planning. The proposed one-hot encoding method by temporal coherence of imputed values.
[23] excels in addressing missing data for credit risk classification,
• The proposed model showcases robust performance across
demonstrating superior accuracy and computational efficiency,
diverse datasets, demonstrating its efficacy in preserving
especially in high missing-rate scenarios, when integrated with
data relationships and capturing variability.
the CART model. Another work [24] proposes a novel imputation
method for symbolic regression using Genetic Programming
Methodology
(GP) and weighted K-Nearest Neighbors (KNN). It outperforms
state-of-the-art methods in accuracy, symbolic regression, and Handling missing values in datasets is a crucial challenge,
imputation time on real-world datasets. Conventional techniques particularly when predicting these values based on available
for multiple imputations exhibit suboptimal performance when data. Figure 1 outlines a comprehensive process for predicting
confronted with high-dimensional data, prompting researchers to missing values using a transformer model. In the initial step, we
enhance these algorithms [25,26]. Likewise, indications exist that showcase an example dataset with missing values, highlighting
exercising caution is advisable when applying continuous-based the intricacies of the task. Moving to step two, we prepare the
approaches to impute categorical data, as it may introduce bias data for missing values imputation by segregating complete data
into the results [27]. for model training and reserving a test set for predicting missing
data. Before arranging the data, each data sequence is assigned
Motivated by the need for a comprehensive evaluation, we a unique identifier, ensuring traceability. Complete data features
conduct extensive experiments to compare the performance of our (f0, f3, f6, and f9) are repositioned on the right side in the third
transformer-based imputation against established methods. This step. Subsequently, in step four, all complete rows are relocated to
comparison extends beyond conventional imputation techniques, the top of the dataset.
encompassing zero [28], mean [29], and KNN imputations [30].
In the context of missing value imputation, it is noteworthy Step five reveals the division of the dataset into X-data and
that addressing missing values is a common concern among Y-data, forming the basis for training the model. In step six, we
Figure 1: A detailed process of preparing data for the Transformer for missing values prediction.
select the complete X-Data and the target feature f1 from Y-Data, the subsequent step nine, the adeptly trained model takes on the
which contains missing data. Utilizing the train-test split on task of predicting missing values within the X-Data. The imputed
X-Data and Y-Data (f1), we generate X-Train, Y-Train, X-Test, and f1 feature is seamlessly integrated back into the X-Data, initiating
Y-Test. In step seven, the train data is prepared for our proposed a cascading effect as subsequent missing values are accurately
prediction model, providing a complete set for training the predicted. This iterative refinement persists until the entirety of
Transformer model. missing values is meticulously filled.
Advancing further, at step eight, the transformer undergoes In the culminating step, the dataset is meticulously organized,
comprehensive training using the entirety of the available data. In preserving its inherent structure by adhering to initially assigned
IDs. This methodical approach not only ensures the seamless Table 1: A detailed comparative analysis of the Imputation techniques.
integration of imputed values but also maintains the overall Proposed
FE P Zero Mean Mode KNN
integrity and coherence of the dataset. In essence, our methodology Data Measure Imputation imputation Imputation Imputation
Model
Imputation
provides a structured and systematic solution, navigating the
R2 score 0.233 0.647 0.437 0.765 0.96
intricacies of missing value imputation using a transformer model. MAE 0.058 0.113 0.075 0.037 0.036
MSE 0.006 0.02 0.008 0.003 0.003
Proposed model validation Hourly
Data RMSE 0.077 0.141 0.089 0.055 0.055
MAPE 1.2 0.92 1.01 0.83 0.423
After the missing data imputation process finished using
R2 score 0.391 0.556 0.471 0.791 0.806
our suggested transformer-based prediction model, a thorough
MAE 0.066 0.051 0.059 0.048 0.028
validation was carried out. During the validation stage, we aimed
MSE 0.009 0.008 0.0073 0.004 0.003
to assess how well our suggested model performed in comparison Daily
RMSE 0.077 0.095 0.055 0.045 0.045
Data
to other widely used imputation techniques, such as zero, mean, MAPE 0.93 0.85 0.89 0.47 0.32
mode, and KNN imputation. We used these various imputation R2 score 0.251 0.696 0.698 0.419 0.796
methods to produce five sets of imputed data. We validated each MAE 0.023 0.051 0.029 0.038 0.025
imputation model using Long Short-Term Memory (LSTM) Monthly
MSE 0.001 0.003 0.002 0.004 0.001
networks to evaluate its effectiveness thoroughly. The LSTM Data RMSE 0.032 0.055 0.045 0.063 0.032
network was fed the imputed data from all five models, including MAPE 1.13 0.89 0.891 1.01 0.523
Table 1 examines the imputation performance across Overall, these results consistently highlight the proposed
model’s effectiveness in preserving data relationships and underscore the robustness of the proposed model in minimizing
capturing variability across diverse datasets, positioning it as a imputation errors. These comprehensive findings suggest that,
robust choice for imputing missing values when accurate modeling beyond R2 scores, the proposed imputation model consistently
of underlying data patterns is crucial. A visual analysis of the r2 excels across various error metrics, affirming its efficacy in
score for the selected imputation method is illustrated in Figure 3. accurately filling missing data and offering a comprehensive
solution for handling diverse datasets with absent values.
Beyond R2 scores, an in-depth analysis of other error metrics
further solidifies the superiority of the proposed imputation Critical discussion
model, as shown in Figure 4. In hourly consumption data, the
model’s Mean Absolute Error (MAE) of 0.036 is notably lower In this study, we have demonstrated the superior efficacy of
than that of other methods, reflecting its ability to predict missing transformer models over traditional methods like zero mean and
values with minimal deviation accurately. This trend continues in KNN imputation, particularly in handling accuracy and context
daily and monthly consumption data, where the proposed model in missing data. However, the performance of these models
consistently achieves the lowest MAE values, indicating superior varies with different data types and sizes, highlighting potential
imputation accuracy. Similarly, examining Mean Squared Error limitations in scalability and applicability to diverse datasets.
(MSE) and Root Mean Squared Error (RMSE) across all datasets, Comparative analysis suggests that while transformers excel in
the proposed model consistently outperforms alternative methods. interpreting sequential data, they may not be the most suitable
The observed reductions in MAE, MSE, and RMSE collectively choice for simpler or smaller datasets. The practical applications of
our model are promising, yet they are accompanied by challenges 6. Schafer JL. Analysis of incomplete multivariate data. CRC press. 1997.
in computational demands and ethical considerations, especially in 7. Menard S. Applied logistic regression analysis. Sage. 2002. 106.
sensitive sectors like healthcare and finance. The generalizability of
8. Little RJ, Rubin DB. Statistical analysis with missing data. John Wiley & Sons.
our model across various types of missing data and its application 2019; 793.
across different fields remains an area ripe for further research and
9. Hadeed SJ, O’Rourke MK, Burgess JL, Harris RB, Canales RA. Imputation
validation.
methods for addressing missing data in short-term monitoring of air
pollutants. Sci Total Environ. 2020 Aug 15;730:139140. doi: 10.1016/j.
Future studies should focus on integrating advanced machine scitotenv.2020.139140. Epub 2020 May 3. PMID: 32402974; PMCID:
learning techniques to enhance the robustness and applicability PMC7745257.
of our model. Additionally, while the use of LSTM networks for
10. Luo Y. Evaluating the state of the art in missing data imputation for clinical
validation is beneficial, alternative methods might provide a data. Brief Bioinform. 2022 Jan 17;23(1):bbab489. doi: 10.1093/bib/bbab489.
more comprehensive evaluation. It is crucial to acknowledge PMID: 34882223; PMCID: PMC8769894.
that the quality of imputation has a significant impact on the 11. Wang M, Gan J, Han C, Guo Y, Chen K, Shi YZ, Zhang BG. Imputation
predictive accuracy of models, particularly in fields where data methods for scRNA sequencing data. Applied Sciences. 2022; 12(20):10684.
integrity is crucial. Our findings highlight the importance of
12. Samad T, Harp SA. Self–organization with partial data. Network: Computation
continuous development in imputation methods, keeping pace in Neural Systems. 1992; 3(2):205-212.
with evolving data complexities and advancements in AI. This
13. Fessant F, Midenet S. Self-organising map for data imputation and correction
research contributes to the broader understanding of missing data in surveys. Neural Computing & Applications. 2002; 10:300-310.
imputation, setting a foundational stage for future innovations in
14. Westin LK. Missing data and the preprocessing perceptron. Univ. 2004.
predictive modeling.
15. Sherwood B, Wang L, Zhou XH. Weighted quantile regression for
Conclusion analyzing health care cost data with missing covariates. Stat Med. 2013 Dec
10;32(28):4967-79. doi: 10.1002/sim.5883. Epub 2013 Jul 9. PMID: 23836597.
This paper introduces a novel transformer-based prediction
16. Crambes C, Henchiri Y. Regression imputation in the functional linear
model to handle the critical problem of dataset missing value model with missing values in the response. Journal of Statistical Planning and
imputation. By methodically explaining the process, we Inference. 2019; 201:103-119.
demonstrated a comprehensive strategy that outperformed 17. Siswantining T, Soemartojo SM, Sarwinda D. Application of sequential
conventional imputation strategies, such as zero imputation, regression multivariate imputation method on multivariate normal missing
mean imputation, and KNN imputation. The suggested model data. In 2019 3rd International Conference on Informatics and Computational
demonstrated exceptional prediction powers by capturing complex Sciences (ICICoS). IEEE. 2019; 1-6.
patterns in sequential data. Our model significantly outperformed 18. Andridge RR, Little RJ. A Review of Hot Deck Imputation for Survey
alternative imputation techniques after extensive validation using Non-response. Int Stat Rev. 2010 Apr;78(1):40-64. doi: 10.1111/j.1751-
5823.2010.00103.x. PMID: 21743766; PMCID: PMC3130338.
LSTM networks, highlighting its effectiveness and resilience.
The present study significantly contributes to advancing missing 19. Rubin LH, Witkiewitz K, Andre JS, Reilly S. Methods for Handling Missing
values imputation approaches by providing a detailed comparative Data in the Behavioral Neurosciences: Don’t Throw the Baby Rat out with the
Bath Water. J Undergrad Neurosci Educ. 2007 Spring;5(2):A71-7. Epub 2007
analysis of transformer-based and conventional methods. In light Jun 15. PMID: 23493038; PMCID: PMC3592650.
of the difficulties associated with missing data, the suggested
20. Rubin DB. Inference and missing data. Biometrika. 1976; 63(3):581-592.
approach closes a large gap in the literature and offers a viable path
toward more trustworthy data analysis. 21. Uusitalo L, Lehikoinen A, Helle I, Myrberg K. An overview of methods
to evaluate uncertainty of deterministic models in decision support.
References Environmental Modelling & Software. 2015; 63:24-31.
22. Kabir G, Tesfamariam S, Hemsing J, Sadiq R. Handling incomplete and missing
1. Du J, Hu M, Zhang W. Missing data problem in the monitoring system: A
data in water network database using imputation methods. Sustainable and
review. IEEE Sensors Journal. 2020; 20(23):13984-13998.
Resilient Infrastructure. 2020; 5(6):365-377.
2. Alruhaymi AZ, Kim CJ. Study on the Missing Data Mechanisms and
Imputation Methods. Open Journal of Statistics. 2021; 11(4):477-492. 23. Yu L, Zhou R, Chen R, Lai KK. Missing data preprocessing in credit
classification: One-hot encoding or imputation?. Emerging Markets Finance
3. Liu J, Pasumarthi S, Duffy B, Gong E, Datta K, Zaharchuk G. One Model to and Trade. 2022; 58(2):472-482.
Synthesize Them All: Multi-Contrast Multi-Scale Transformer for Missing
Data Imputation. IEEE Trans Med Imaging. 2023 Sep;42(9):2577-2591. doi: 24. Al-Helali B, Chen Q, Xue B, Zhang M. A new imputation method based
10.1109/TMI.2023.3261707. Epub 2023 Aug 31. PMID: 37030684; PMCID: on genetic programming and weighted KNN for symbolic regression with
PMC10543020. incomplete data. Soft Computing. 2021; 25:5993-6012.
4. Edelman BL, Goel S, Kakade S, Zhang C. Inductive biases and variable creation 25. Zhao Y, Long Q. Multiple imputation in the presence of high-
in self-attention mechanisms. In International Conference on Machine dimensional data. Stat Methods Med Res. 2016 Oct;25(5):2021-2035. doi:
Learning. PMLR. 2022; 5793-5831. 10.1177/0962280213511027. Epub 2013 Nov 25. PMID: 24275026.
5. Choi SR, Lee M. Transformer Architecture and Attention Mechanisms in 26. Huque MH, Carlin JB, Simpson JA, Lee KJ. A comparison of multiple
Genome Data Analysis: A Comprehensive Review. Biology (Basel). 2023 Jul imputation methods for missing data in longitudinal studies. BMC Med Res
22;12(7):1033. doi: 10.3390/biology12071033. PMID: 37508462; PMCID: Methodol. 2018 Dec 12;18(1):168. doi: 10.1186/s12874-018-0615-6. PMID:
PMC10376273. 30541455; PMCID: PMC6292063.
27. Horton NJ, Lipsitz SR, Parzen M. A potential for bias when rounding in 30. Mohammed MB, Zulkafli HS, Adam MB, Ali N, Baba IA. Comparison of five
multiple imputation. The American Statistician. 2003; 57(4):229-232. imputation methods in handling missing data in a continuous frequency table.
In AIP Conference Proceedings. AIP Publishing. 2021; 2355:1
28. Yi J, Lee J, Kim KJ, Hwang SJ, Yang E. Why not to use zero imputation?
correcting sparsity bias in training neural networks. arXiv preprint 31. Jadhav A, Pramod D, Ramanathan K. Comparison of performance of data
arXiv:1906.00150. 2019. imputation methods for numeric dataset. Applied Artificial Intelligence. 2019;
33(10):913-933.
29. Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A
survey on missing data in machine learning. J Big Data. 2021;8(1):140. doi: 32. Staudemeyer RC, Morris ER. Understanding LSTM--a tutorial into long short-
10.1186/s40537-021-00516-9. Epub 2021 Oct 27. PMID: 34722113; PMCID: term memory recurrent neural networks. arXiv preprint arXiv:1909.09586.
PMC8549433. 2019.
How to cite this article: Ayub H, Jamil H. Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling. IgMin Res. Jan 23, 2024; 2(1): 025-031. IgMin ID:
igmin140; DOI: 10.61927/igmin140; Available at: www.igminresearch.com/articles/pdf/igmin140.pdf
https://www.igminresearch.com/pages/publish-now/author-guidelines https://www.igminresearch.com/pages/publish-now/apc
WHY WITH US
IgMin Research | A BioMed & Engineering Open Access Journal employs a rigorous peer-review process, ensuring the publication of high-quality research spanning STEM disciplines. The journal offers a
global platform for researchers to share groundbreaking indings, promoting scienti ic advancement.
JOURNAL INFORMATION
Journal Full Title: IgMin Research-A BioMed & Engineering Open Access Journal Regularity: Monthly License: Open Access by IgMin Research is
Journal NLM Abbreviation: IgMin Res Review Type: Double Blind licensed under a Creative Commons Attribution 4.0
Journal Website Link: https://www.igminresearch.com Publication Time: 14 Days International License. Based on a work at IgMin
Topics Summation: 150 GoogleScholar: https://www.igminresearch.com/gs Publications Inc.
Subject Areas: Biology, Engineering, Medicine and General Science Plagiarism software: iThenticate Online Manuscript Submission:
Organized by: IgMin Publications Inc. Language: English https://www.igminresearch.com/submission or can be
Collecting capability: Worldwide mailed to [email protected]