0% found this document useful (0 votes)
17 views5 pages

Integrated ECOD-KNN Algorithm For Missing Values Imputation in Datasets: Outlier Removal

Missing data cause the incompleteness of data sets and can lead to poor performance of models which also can result in poor decisions, despite using the best handling methods. When there is a presence of outliers in the data, using KNN algorithm for missing values imputation produce less accurate results. Outliers are anomalies from the observations and removing outliers is one of the most important pre-processing step in all data analysis models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views5 pages

Integrated ECOD-KNN Algorithm For Missing Values Imputation in Datasets: Outlier Removal

Missing data cause the incompleteness of data sets and can lead to poor performance of models which also can result in poor decisions, despite using the best handling methods. When there is a presence of outliers in the data, using KNN algorithm for missing values imputation produce less accurate results. Outliers are anomalies from the observations and removing outliers is one of the most important pre-processing step in all data analysis models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL1459

Integrated ECOD-KNN Algorithm for Missing


Values Imputation in Datasets: Outlier Removal
Tsitsi Jester Mugejo1 Weston Govere2
Department of Cloud Computing Department of Cloud Computing
School of Information Science and Technology, Harare School of Information Science and Technology,
Institute of Technology Harare Institute of Technology
Harare, Zimbabwe Harare, Zimbabwe

Abstract:- Missing data cause the incompleteness of data (IForest), how the outliers where substituted using the median
sets and can lead to poor performance of models which also of the non-outlier data and the imputation of missing values
can result in poor decisions, despite using the best handling using KNN algorithm in a single model.
methods. When there is a presence of outliers in the data,
using KNN algorithm for missing values imputation Outliers are anomalies from the observations and
produce less accurate results. Outliers are anomalies from removing outliers is one of the important pre-processing step in
the observations and removing outliers is one of the most all data analysis models (1). It is important to first identify the
important pre-processing step in all data analysis models. outliers, in this paper, using the Outlier Detection, to be able to
KNN algorithms are able to adapt to missing value remove or substitute them. Data is bound to have some noisy
imputation even though they are sensitive to outliers, data or rather outliers which definitely affect the KNN missing
which might end up affecting the quality of the imputation value imputation process and the performance of the trained
results. KNN is mainly used among other machine learning models (2). It is of most importance to filter all the noisy data
algorithms because it is simple to implement and have a from any training dataset. This step should be the first before
relatively high accuracy. In the literature, various studies imputing missing values to have more accurate results.
have explored the application of KNN in different Imputation result will not be good enough if performed before
domains, however failing to address the issue of how outlier handling.
sensitive it is to outliers. In the proposed model, outliers
are identified using a combination of the Empirical- Missing values are important when involving big data (3),
Cumulative-distribution-based Outlier Detection (ECOD), which is very large amounts of data or large datasets which
Local Outlier Factor (LOF) and isolation forest (IForest). requires analysis and storage. Missing values generally pose a
The outliers are substituted using the median of the non- weakness to models (4) as they affect the quality of results
outlier data and the imputation of missing values is done especially with prediction systems. In the pre-processing stage
using the k-nearest neighbors algorithm. For the of datasets with numeric values, it is noted that one of the main
evaluation of the model, different metrics were used such challenges is the processing of missing values. So it is
as the Root Mean Square Error (RMSE), (MSE), R2 important to deal with missing values in our datasets during
squared (R2) and Mean Absolute Error (MAE). It clearly pre-processing (5).
indicated that dealing with outliers first before imputing
missing values produces better imputation results than Challenges may also arise from choosing the wrong
just using the traditional KNN technique which is sensitive handling method of missing values (6) and this also affect the
to outliers. effectiveness of any model. Previous studies have provided
information on imputation using the KNN algorithm and other
Keywords:- Imputation; Outlier; Missing Values; Incomplete; various extensions of the algorithm, however failing to
Algorithm. consider outlier detection and normalization before the missing
value imputation process. The performance of the KNN
I. INTRODUCTION imputation method can be greatly improved with solving
outliers and normalization of the data (7). It has been proved
Missing data cause the incompleteness of data sets and that using normalization and imputation mean together is more
can lead to poor performance of models which also can result accurate than the original mean and median methods (8).
in poor decisions, despite using the best handling methods.
Analyses of datasets containing missing values can perpetuate This study is taking note of the outliers, by firstly
deriving actions from a biased model. In this paper, we want to detecting them using ECOD and substituting them using mean
reveal the impact of how solving missing data using KNN as it is more effective and then proceeding to impute the
algorithm may produce less accurate results, especially when missing values in the datasets, for an improved accuracy of the
there is a presence of outliers in the data. Additionally, there is imputation result. It has been noted that this combination has
a demonstration of how outliers can be identified using the never been used in previous studies of imputation using KNN
Empirical-Cumulative-distribution-based Outlier Detection or other imputation methods, although it has been proved by
(ECOD), Local Outlier Factor (LOF) and isolation forest

IJISRT24JUL1459 www.ijisrt.com 2307


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL1459

this paper to improve the accuracy of the missing values In the context of movie recommender systems, a
imputation process. comparative study was conducted (14) on pre-processing
algorithms for Singular Value Decomposition (SVD) to help
II. LITERATURE REVIEW data managers choose the most suitable algorithm for their
business needs (15). This study underscores the importance of
Many studies have been done to solve the issue of missing selecting the right imputation method to enhance the accuracy
values in datasets. Incompleteness of data is handled depending and reliability of data analysis. Furthermore, in the medical
on the type and requirements mainly. The two main methods of field, (16) the study focused on missing value estimation
imputation are statistics and machine learning. These methods methods for arrhythmia classification, emphasizing the
then generate values or approximations from the observable significance of handling missing values in datasets to ensure
variables in order to replace the missing values (9). KNN is accurate classification results (18).
mainly used among other machine learning algorithms because
it is simple to implement and have a relatively high accuracy. Furthermore, a novel KNN variant (KNNV) algorithm
In the literature, various studies have explored the application was introduced (17) for accurate classification of COVID-
of KNN in different domains. 19 based on incomplete heterogeneous data, showcasing
improved results through experimental work (18). The KNNV
This highlights the adaptability of KNN algorithms in algorithm addresses incompleteness by imputation and
addressing missing values and improving classification heterogeneity by converting categorical data into numerical
accuracy in diverse applications. In summary, the literature values. Moreover, a hybrid missing data imputation method
review showcases the significance of using the KNN algorithm called KI was proposed (19), which combines k-nearest
for imputing missing values in datasets across various domains, neighbors and iterative imputation algorithms to address
including proteomics, recommendation systems, medical missing values effectively (20). This approach leverages
diagnostics, and other domains. similarity learning techniques to impute missing data
accurately.
In the field of proteomics, (10) it is highlighted that there
is a complexity of identifying the subcellular locations of Fig 1 shows the experimental design of the systems of
proteins, especially when proteins can exist in multiple most of the current studies (21). The studies show that they
locations simultaneously (11). To address missing values in introduce missing values as the first step, if a dataset with no
proteomic data, (12) the Cluster-based KNN (CKNN) missing values is being used. An imputation algorithm is then
imputation method was introduced, which incorporates local picked for the imputation process. The imputed result is then
data clustering for improved quality and efficiency (13). evaluated using various different metrics.

Fig 1 Block Diagram of the Proposed System Experimental Design.

Overall, the literature highlights the significance of the studies discussed emphasize the importance of selecting
KNN algorithm in imputing missing values in datasets. appropriate imputation methods, which all involve KNN, to
Researchers have developed novel approaches, such as the enhance data quality, analysis accuracy, and classification
MKDF-WKNN classifier and KNNV algorithm, to enhance the performance. However, all the KNN algorithms used by
accuracy of classification models when dealing with different researchers fail to address the issue that KNN is
incomplete data (22). Additionally, hybrid methods like KI sensitive to outliers and can affect the result of the missing
have been proposed to improve missing data imputation by values imputation process.
incorporating k-nearest neighbors and iterative algorithms. The

IJISRT24JUL1459 www.ijisrt.com 2308


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL1459

III. METHODOLOGY non-outlier data and the imputation of missing values is done
using the k-nearest neighbors algorithm. It identifies the 'k'
The first step was data exploration. Pandas was used for nearest data points to the missing value based on a distance
the data frames which makes it easy to work with structured metric. For numerical data, the mean of these neighbors is used
data. To support the large datasets and calculate the average to replace the missing value. For categorical data, the most
arithmetic set of values in the datasets, Numpy was also used frequent category (mode) among the neighbors is used. This
and for plotting graphs for visual comprehension Matplot was approach leverages the similarity between data points to
also used. The outliers are identified using a combination of the provide a more accurate imputation compared to simple mean
Empirical-Cumulative-distribution-based Outlier Detection or mode imputation. Fig 2 illustrates the experimental design
(ECOD), Local Outlier Factor (LOF) and isolation forest of the proposed system.
(IForest). The outliers are substituted using the median of the

Fig 2 Block Diagram of the Proposed System Experimental Design.

 Dataset Preparation  Evaluation Criteria


This experiment was implemented using five datasets For the evaluation of the model, different metrics were
from the Kaggle Website tabulated in Table 1. The datasets used such as the Root Mean Square Error (RMSE), (MSE), R2
were loaded for feature extraction and standardization of the squared (R2) and Mean Absolute Error (MAE).
features. Preprocessing to check if outliers and missing values
were present was done. This would then lead to next step of  RMSE metric is used in machine learning to compute the
Outlier Analysis where the emphasis of the experiment is. This difference between the observed value and the imputed
would involve detecting outliers and substituting them with the value.
results from the analysis.  MSE is used to measure the average of the squared
differences between the predicted values and actual target
values. The lower the MSE, the closer the models results
are to the true values.

Table 1 Details of the Datasets used


Details of the datasets used
Dataset No.
Dataset Name Rows Attributes
1 Dissolved O2 River Water 3500 37
2 Crop Recommendation 1470 20
3 Online Course Engagement 4650 12
4 Health Care Diabetes 1460 6
5 Amazon cell phone and accessories 10448570 12

IJISRT24JUL1459 www.ijisrt.com 2309


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL1459

 R2 or the Coefficient of Determination is another metric IV. RESULTS AND DISCUSSION


used to evaluate the model’s goodness. It is known as the
Goodness of fit. R2 squared score move towards one, which The proposed model consists of outlier removal and
means that regression line moves towards perfection. It was imputation whilst the other KNN imputation techniques does
used in this study because of its ability to measure not take outlier removal in consideration. The results in table
variability. 2-6 below show both the proposed model and basic KNN
 MAE, which is the mean absolute error matches the error evaluation results for comparison.
value units to the predicted target value units. MAE has
changes that are intuitive and it penalize large errors more. Of all the metrics used, the proposed model seem to have
The square of the error value increases or inflates the mean better results compared to KNN, as shown from the above
error value. tables of various datasets. RMSE evaluation had the worst
results for both models and in all the datasets, but better for the
proposed model based on the simulation results.

Table 2 Dissolved O2 River Water Results


Dissolved O2 River Water Results
Metric ECOD-KNN(Proposed System) KNN
RMSE 1.775 1.958
MSE 3.246 3.834
R2 0.542 0.548
MAE 1.320 1.425

Table 3 Online Course Engagement Data Dataset


Online Course Engagement Data Dataset
Metric ECOD-KNN(Proposed System) KNN
RMSE 1.523 2.518
MSE 0.223 1.331
R2 0.742 0.948
MAE 1.202 1.282

Table 4 Crop Recommendation Dataset


Crop Recommendation Dataset
Metric ECOD-KNN(Proposed System) KNN
RMSE 0.923 1.210
MSE 0.389 1.765
R2 0.427 0.812
MAE 1.897 3.423

Table 5 Health Care Diabetes Dataset


Health Care Diabetes Dataset
Metric ECOD-KNN(Proposed System) KNN
RMSE 1.302 1.838
MSE 0.482 0.935
R2 0.923 1.275
MAE 1.193 1.585

Table 6 Amazon of cell phone and accessories Product Ratings Dataset


Amazon of cell phone and accessories Product Ratings Dataset
Metric ECOD-KNN(Proposed System) KNN
MSE 0.0153 0.0529
R2 0.9645 0.9742
MAE 0.046 0.245

V. CONCLUSION widely used to impute missing values among other techniques.


However one of the disadvantages is that, it sensitive to
Important information goes missing when a dataset has outliers, which was the focus of this study. The study focused
missing values. Missing values have to be imputed to avoid on detecting outliers using a combination of Local Outlier
such scenarios. Imputing missing values insures that the dataset Factor (LOF), isolation forest (IForest) and ECOD. After
is complete and this will help the various models to produce averaging the detectors, for the outliers result, they are replaced
accurate results where decision making is based on. KNN is in dataset with median of the non-outlier data. The k-nearest

IJISRT24JUL1459 www.ijisrt.com 2310


Volume 9, Issue 7, July – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24JUL1459

neighbor’s algorithm is then used to impute the missing values. [13]. S. Patra, and B. Ganguly; "Improvising Singular Value
After testing the model with more than 5 different datasets, the Decomposition By KNN for Use in Movie
evaluation criteria using RMSE, MSE, R2 and MAE clearly Recommender Systems", JOURNAL OF
indicated that dealing with outliers first before imputing OPERATIONS AND STRATEGIC PLANNING,
missing values produces better imputation results than just 2019.
using the traditional KNN technique which is sensitive to [14]. N. Rabiei, A.R. Soltanian, M. Farhadian, and F.
outliers. Despite the good performance of the proposed ECOD- Bahreini; "The Performance Evaluation of The
KNN model, there are other missing value imputation Random Forest Algorithm for A Gene Selection in
techniques that can perform better. Also, KNN operates by Identifying Genes Associated with Resectable
memorizing the entire dataset which can be a disadvantage. Pancreatic Cancer in Microarray Dataset: A
Retrospective Study", CELL JOURNAL, 2023.
REFERENCES [15]. F. Yang, J. Du, J. Lang, W. Lu, L. Liu, C. Jin, and Q.
Kang; "Missing Value Estimation Methods Research
[1]. H. Nugroho, N.P Utama, and K. Surendro, for Arrhythmia Classification Using The Modified
“Normalization and outlier removal in class Kernel Difference-Weighted KNN Algorithms",
center‑based firefly algorithm for missing value BIOMED RESEARCH INTERNATIONAL, 2020.
imputation,” Open Access, J Big Data, (2021)8:129. (IF: 3)
[2]. D. Chehal, P. Gupta, P. Gulati, and T. Gupta, [16]. Z. Zhang, "Introduction To Machine Learning: K-
“Comparative Study of Missing Value Imputation nearest Neighbors", ANNALS OF
Techniques on E Commerce Product Ratings,” TRANSLATIONAL MEDICINE, 2016. (IF: 7)
Informatica 47 (2023) 373–382. [17]. A. Hamed, A. Sobhy, and H. Nassar; "Accurate
[3]. A.F. Sallaby, Azlan, “Analysis of Missing Value Classification of COVID-19 Based on Incomplete
Imputation Application with K-Nearest Neighbor (K- Heterogeneous Data Using A KNN Variant
NN) Algorithm in Dataset,” (International Journal of Algorithm", ARABIAN JOURNAL FOR SCIENCE
Informatics and Computer Science) Vol 5 No 2, July AND ENGINEERING, 2021. (IF: 3)
2021, Page 141-144. [18]. N. Rabiei, A.R. Soltanian, M. Farhadian, and F.
[4]. P. Mishra, K.D. Mani, P. Johri, and D. Arya, “ FCMI: Bahreini; "The Performance Evaluation of The
Feature Correlation based Missing Data Imputation” Random Forest Algorithm for A Gene Selection in
[5]. I.S. Jacobs and C.P. Bean, “Fine particles, thin films Identifying Genes Associated with Resectable
and exchange anisotropy,” in Magnetism, vol. III, G.T. Pancreatic Cancer in Microarray Dataset: A
Rado and H. Suhl, Eds. New York: Academic, 1963, Retrospective Study"CELL JOURNAL, 2023,
pp. 271-350. [19]. ] M. Zaki, Shao-jie Chen, Jicheng Zhang, Fan
[6]. F. E. Harrell, Jr., “Regression Modeling Strategies,” Feng, Liu Qi, M.A. Mahdy, and Linlin
Nashville, TN, USA July 2015, ISSN 2197-568X Jin, "Optimized Weighted Ensemble Approach for
[7]. C. K. Enders, “Applied Missing Data Analysis,” Enhancing Gold Mineralization
Second Edition, 2022 pp1-43, Prediction", APPLIED SCIENCES, 2023.
[8]. M. Tannous, M. Miraglia, F. Inglese, L. Giorgini, F. [20]. S. Sheikhi; M.T. Kheirabadi, and A. Bazzazi; "A
Ricciardi, R. Pelliccia, M. Milazzo, and C. Novel Scheme for Improving Accuracy of KNN
Stefanini, "Haptic-based Touch Detection for Classification Algorithm Based on The New Weighting
Collaborative Robots in Welding Technique and Stepwise Feature Selection", 2020.
Applications", ROBOTICS COMPUT. INTEGR. [21]. M. Zhang, and W. Xu; "Study on An Improved Lie
MANUF., 2020. (IF: 3) Group Machine Learning-based Classification
[9]. L.Y. Wang, D. Wang; Y.H. Chen, "Prediction Of Algorithm", 2020 IEEE 3RD INTERNATIONAL
Protein Subcellular Multisite Localization Using A CONFERENCE OF SAFE PRODUCTION ..., 2020.
New Feature Extraction Method", GENETICS AND [22]. E.Y. Boateng, J. Otoo, and D.A. Abaye; "Basic
MOLECULAR RESEARCH : GMR, 2016 Tenets of Classification Algorithms K-Nearest-
[10]. F. Pirotti, R. Ravanelli, F. Fissore, and A. Masiero, Neighbor, Support Vector Machine, Random Forest
"Implementation and Assessment of Two Density- and Neural Network: A Review", 2020. (IF: 4)
based Outlier Detection Methods Over Large Spatial
Point Clouds", OPEN GEOSPATIAL DATA,
SOFTWARE AND STANDARDS, 2018. (IF: 3).
[11]. P. Keerin, W. Kurutach, and T. Boongoen, "Cluster-
based KNN Missing Value Imputation for DNA
Microarray Data", 2012 IEEE INTERNATIONAL
CONFERENCE ON SYSTEMS, MAN, AND ...,
2012. (IF: 3)
[12]. K.M. Fouad, M.M. Ismail, A.T. Azar, and M.M Arafa,
"Advanced Methods for Missing Values Imputation
Based on Similarity Learning," PEERJ. COMPUTER
SCIENCE, 2021. (IF: 3).

IJISRT24JUL1459 www.ijisrt.com 2311

You might also like