A Method For Missing Values Imputation of Machine Learning Datasets
A Method For Missing Values Imputation of Machine Learning Datasets
Corresponding Author:
Youssef Hanyf
National School of Commerce and Management of Dakhla, Ibn Zohr University
Dakhla, Morocco
Email: [email protected]; [email protected]
1. INTRODUCTION
In the last decade, machine-learning classification methods have become increasingly required and
used in various outstanding technologies such as health care [1], social media, and recommendation systems.
In consequence, many machine-learning-related problems have attracted the attention of a large community of
researchers. Handling missing data is one of the most severe problems of machine-learning classification
because it significantly affects classification accuracy [2], [3]. Although the increasing development of data
collection and acquisition technologies, various reasons can lead to losing values in datasets like the breakdown
of devices, power cuts, and unanswered form questions [4]. Therefore, datasets often require a preprocessing
phase to impute the missing values before training and testing classification models.
The intuitive way to deal with missing data is the deletion of the features or instances containing
missing values [5]–[7]. However, this method has risks of losing important information in datasets, and it can
significantly impact classification accuracy. Many other methods have been used and proposed in the literature
to impute missing data for increasing classification accuracy [8], [9]. These methods can be classified into two
principal categories; statistical-based methods, such as mean/mod and least squares (LS), and machine-
learning-based methods like k-nearest neighbors (KNN), neural networks (NN), and decision tree (DT) [10].
Hoque et al. [5] have compared the imputation accuracy of many machine-learning-based methods. They found
that adaboost classifier and linear support vector machine (SVM) are better than logistic regression (LR), and
random forest (RF). But this study has been carried out only on one dataset. Thus, these results need to be
validated in other datasets.
The machine-learning-based imputation methods are better than statistical methods regarding
classification accuracy. But they are computationally expensive due to the training cost of models for each
feature that contains missing values [11], [12]. Nevertheless, the statistical imputation methods remain widely
used in practice, especially for massive and high-dimensional datasets when the imputation becomes very
expensive. Two recent trends appeared for improving the classical imputation methods; hybrid methods and
multi-imputation methods. Hybrid methods aim to optimize the trade-off between the imputation cost and the
classification accuracy by combining the statistical and machine learning approaches [11]–[13]. Multiple
imputation methods aim to increase the imputation’s accuracy by imputing missing values with many estimated
values and keeping the one that achieves the best accuracy [14].
The class center missing values imputation (CCMVI) is among the hybrid methods that have shown
an inexpensive imputation cost and good accuracy at the same time. It computes the center of each class, and
then it imputes missing values of each incomplete instance based on its class center. The main drawback of the
CCMVI method is that it is applicable just for instances of known classes, whereas that is not the case in real-
world applications of machine learning models. Consequently, the employment of the CCMVI method after
the model deployment in real-world applications is not possible. As well, instances of test datasets should be
assumed unknown to simulate the usage of the model in real-world applications to ensure the credibility of the
performance evaluation results. Thus, the CCMVI method can successfully impute missing values of training
datasets, but it is not appropriate for test datasets because classes of instances should be assumed unknown to
employ the same imputation method that we can use after model deployment.
In this work, we propose an imputation method (CCMVI+) that extends the CCMVI method to handle
missing values in test datasets. On one side, we combine the CCMVI method to impute training datasets with
statistical or machine-learning-based imputation methods, such as KNN and mean, to impute test datasets. On
the other side, we propose two new techniques for datasets missing values imputation based on classes’ centers
determined in the training dataset imputation. One technique, called the nearest class center method (N_CC),
imputes the missing values of instances based on their nearest neighbor among classes’ centers. While, the
other technique, called in this paper the class centers mean (CC-Mean), is based on the mean of classes’ centers.
Thus, there are three possible versions of the proposed method by combining the CCMVI with the proposed
techniques, namely CCMVI+a_literature_method, CCMVI+N_CC, and CCMVI+CC-Mean. We evaluated the
accuracy and the computation time of the proposed method on six datasets of different sizes, dimensionality,
and number of classes. Thus, we compared the accuracies of state-of-art classification models trained and tested
on datasets imputed by the proposed method. We also evaluated the computation time consumed by the
proposed method in test datasets.
The rest of this paper is organized as follows. Section 2 reviews related literature and describes the
CCMVI method, and section 3 presents the proposed method. The research method is presented in section 4.
While, the results are presented and analyzed in section 5, and section 6 concludes the paper.
2. RELATED WORK
2.1. Methods of missing values imputation
Many works in the literature [11]–[13], [15] categorize imputation methods into two categories:
statistical methods and machine-learning-based methods. The Mean method is considered among the simplest
statistical method; it replaces each missing value with the average value of the corresponding feature. Other
more advanced statistical methods have been proposed in the literature, like the expectation-maximization
method (EM) [16], the LR method [17], and the least square method (LS) [18]. Machine-learning-based
methods create a model for each feature that contains missing values. The feature of the missing value is
regarded as the target of classification/regression models trained based on the remaining data. They predict
each missing value by using the corresponding classification/regression model. Among many machine-learning-
based imputation methods, SVM, DT [19], KNN [20], and RF are the most popular in the literature [21].
Osman et al. [15] categorize the imputation methods into traditional and modern methods. The
traditional methods category includes techniques of missing data deletion, and some methods of single
imputation such as mean, hot-dock, cold-dock, and regression imputation. The modern methods category
contains the multiple-imputation-based methods which analyze multiple imputation choices and adopt the best
one, EM method, and machine-learning methods such as KNN [20], [22], NN [23] and DT methods [19].
A method for missing values imputation of machine learning datasets (Youssed Hanyf)
890 ISSN: 2252-8938
3. PROPOSED METHOD
The proposed method, CCMVI+, differentiates between training datasets and test datasets. It imputes
incomplete instances of training datasets following the same approach as the CCMVI method. However, for
imputing test datasets, we propose three different techniques that do not require the identification of the
instance class. These techniques can also handle incomplete instances in real-world machine-learning
applications. Thus, the CCMVI+ method consists of two parts. The first part deals with training datasets where
the classes of instances are known, and the second part handles instances with unknown or assumed unknown
classes, mainly test datasets and instances of real-world applications that use machine-learning models. All
imputation methods of literature are applicable in the second part. Furthermore, we proposed two algorithms
specifically designed for this purpose, which are described in the sub-sections bellow. Figure 1 presents a high-
level flow diagram illustrating the CCMVI+ process for imputing data for training, testing, and using machine-
learning classification models.
and the complete instances of the class. Formally, the threshold a class i is calculated by using the following
formula:
1
𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑖) = ∑𝑛𝑘=1 𝐸𝑢𝑐𝑙𝑖𝑑𝑖𝑎𝑛_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐼𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 (𝑘), 𝑎𝑣𝑔(𝑖)) (1)
𝑛
Such as n is the number of complete instances in the class i and Icomplete(k) is the kth complete instance of class
i. Next, the algorithm distinguishes between two cases to impute the missing values in incomplete instances of
each class. The first case is when the instance contains just one missing value. The algorithm, in this case,
replaces the missing jth attribute of an incomplete instance Iincomplete of class i by the jth attribute of the already
calculated average (center) Avg(i) of class i; Iincomplete(j)= Avg(i,j). Then, if the Euclidian distance between the
instance Iincomplete and the average of the class Avg(i) is superior to the class threshold, the algorithm
decreases/increases the missing value by the corresponding attribute of the class standard deviation as follows:
𝐼𝑖𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 (𝑗) = 𝐼𝑖𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 (𝑗) + 𝑠𝑡𝑑(𝑖, 𝑗) 𝑶𝒓 𝐼𝑖𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 (𝑗) = 𝐼𝑖𝑛𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 (𝑗) − 𝑠𝑡𝑑(𝑖, 𝑗) (2)
The second case is when the instance contains more than one missing value. In this case, the algorithm
replaces each missing attribute of the incomplete instance Iincomplete. If the Euclidean distance between the
instance and the class center exceeds the class threshold, the algorithm increases or decreases missing values
by using the corresponding attributes of the class standard deviation and recalculates the distance to the class
center. Finally, the algorithm adopts the instance that minimizes the distance to the center.
A method for missing values imputation of machine learning datasets (Youssed Hanyf)
892 ISSN: 2252-8938
instance and centers requires impermanent handling of missing values. Two possible ways to handle the
missing values before distance computation; deleting attributes of missing values or replacing the missing
values with a constant value. The proposed technique substitutes missing values in incomplete instances and
corresponding values in centers with zeros for computing distances (lines 4, 10, and 11 of algorithm 1), which
is equivalent to deleting missing values attributes.
Algorithm 1 provides the pseudocode of the proposed technique to impute an incomplete instance
Iincomplete based on a set, C, of classes’ centers. The algorithm defines the nearest neighbor of the incomplete
instance among the classes’ centers. To this end, it replaces the missing values with zeros and gets their indexes
(lines 2-7). Then, it calculates the Euclidean distances between the instances and centers to identify the nearest
neighbor (lines 9-14). Finally, the algorithm replaces the instance missing values with the values of the
corresponding attributes of the identified nearest center (lines 16, 17, and 18). Figure 2 illustrates an example
of using the nearest center technique for imputing an incomplete instance based on a six-dimensional and four
classes dataset.
The algorithm of this method (see the pseudocode in algorithm 2) calculates the average of classes’
centers. Then it replaces the missing values of the incomplete instance with the corresponding attributes of the
Mean of centers. Figure 3 illustrates an example of using the CC_Mean technique for imputing missing values.
4. RESEARCH METHOD
4.1. Datasets
We used in the experiments six numerical datasets with different numbers of instances, features, and
classes to evaluate the performance of the proposed imputation techniques. The classes' numbers are between
2 and 11, the features' number ranges between 11 and 64, and the instances' number ranges between 208 and
19020 in Table 1. The distribution of the Euclidian distance from the centroid of each dataset is presented in
Figure. All datasets are downloaded from the University of California Irvine Machine learning repository,
except the Texture Dataset which is downloaded from the OpenML repository [24].
A method for missing values imputation of machine learning datasets (Youssed Hanyf)
894 ISSN: 2252-8938
Although the superiority of the CCMVI+KNN accuracy, the average difference between the
CCMVI+KNN accuracy with that of the proposed CCMVI+N_CC is insignificant (0.03). The average accuracy
of the proposed CCMVI+N_CC outperforms the CCMVI+KNN in the Optdigit dataset (see Table 3) and
outperforms all other methods in high missing rates of Optdigit and Segment datasets. We can conclude that the
CCMVI+N_CC method gives their best accuracies in datasets of high missing rates and high numbers of
classes (Optidigits, Texture, and Segment).
Figure 5 presents the accuracies averages of each method per missing rates on all datasets. For small
missing rates (10%), CCMVI+Mean and proposed CCMVI+CC_Mean techniques give almost the same
accuracy. Whereas the CCMVI+KNN method performs all others, the difference with the CCMVI+N_CC
significantly decreases when we increase the missing rates (40%). Figures 6 to 11 respectively represent the
classification accuracies on Sonar, Magic, Wall-robot navigation, Segment, Optdigit, and Texture datasets. The
results confirm that the CCMVI+KNN method is more accurate than the other methods in the majority of cases.
Average accuracies
Average accuracies
Figure 5. Average accuracies per missing rate Figure 6. Classification accuracies on Sonar dataset
Average accuracies
Average accuracies
Average accuracies
Figure 9. Classification accuracies on segment dataset Figure 10. Classification accuracies on optdigits
dataset
A method for missing values imputation of machine learning datasets (Youssed Hanyf)
896 ISSN: 2252-8938
Average accuracies
14 Imputation time
13
12
11
10
9
Seconds
8
7
6
5
4
3
2
1
0
KNN
Mean
N_CC
CC-Mean
Imputation methods constant
magic Optdigits
texture wall-robot-navigation
segment Sonar
The constant, the Mean, and the CC_Mean methods are significantly faster than the KNN method.
The complexity of computing the average of instances in the Mean method is O(n*m), where n is the instances
number and m is the features number. On the other hand, the CC_Mean complexity of computing the average
of centers is O(k*m), where k is the number of classes. Although the computation of the average by CC_Mean
is significantly less expensive than that computed by the classical Mean, one can observe that there is no
significant impact on the total imputation time because the computation of the average is done just once to
impute all instances.
In these experiments, the proposed N_CC imputation method is 20 times faster than the KNN
imputation method. The naïve KNN algorithm requires O(n) distance computation between the incomplete
instance and the training datasets instances. The use of data structures such as kd-tree or Ball-tree can reduce
the complexity of the KNN imputation to approximately O(log n) distance computations [20], [22], [30]–[35].
While, the naïve N_CC requires only O(k) distance computations, such as k is the number of classes.
6. CONCLUSION
In this work, we proposed an extension of the CCMVI imputation method, called CCMVI+, to handle
missing values of test datasets. The CCMVI+ method uses the classical CCMVI to impute training datasets
and provides three possible techniques for test datasets imputing. The first proposed technique combines the
CCMVI with literature imputation methods for test datasets. The second technique identifies the nearest class
center for test datasets imputing, whereas the third technique computes the Mean of classes’ centers for test
datasets imputing. In the experiment, we compared classification accuracies of machine learning methods on
datasets imputed by methods that use the proposed techniques, namely CCMVI+Constant, CCMVI+Mean,
CCMVI+KNN, CCMVI+CC_Mean, and CCMVI+N_CC imputation methods. The results show that the
combination between CCMVI and KNN outperforms the other methods and that the proposed CCMVI+N_CC
is the second-best choice in terms of classification accuracy. The results show also that the difference between
the proposed second technique and the combination between KNN and N_CC becomes less significant in high
missing rates. Moreover, the difference between the proposed technique based on the Mean of classes’ centers
(CC_Mean) and the CCMVI+Mean is highly insignificant. We also compared the computation time of the
proposed methods. The results show that KNN is the most computationally expensive imputation method
compared with N_CC and CC_Mean. The computation time of CC_Mean and classical Mean is approximately
the same. Thus, the accuracy of the CCMVI+N_CC method is near to that of CCMVI+KNN, and it is
significantly less expensive because they treat only classes’ centers instead of all datasets. The proposed
CCMVI+N_CC and CCMVI+CC_Mean methods significantly save the prediction time and memory space
without a high impact on the accuracy, especially for high dimensional, large, and high missing rates datasets.
ACKNOWLEDGEMENTS
We acknowledge the invaluable contribution to this research by the AI Data SEED team belonging to
the Research Laboratory in Management and Decision Support. Our gratitude extends to Ibn Zohr University
and ENCG of Dakhla for providing essential resources and a conducive research environment.
REFERENCES
[1] D. P. Javale and S. S. Desai, “Machine learning ensemble approach for healthcare data analytics,” Indonesian Journal of Electrical
Engineering and Computer Science, vol. 28, no. 2, pp. 926–933, Nov. 2022, doi: 10.11591/ijeecs.v28.i2.pp926-933.
[2] N. H. A. Rahman and M. H. Lee, “Artificial neural network forecasting performance with missing value imputations,” IAES
International Journal of Artificial Intelligence, vol. 9, no. 1, pp. 33–39, Mar. 2020, doi: 10.11591/ijai.v9.i1.pp33-39.
[3] H. A. Saleh, R. A. Sattar, E. M. H. Saeed, and D. S. Abdul-Zahra, “Hybrid features selection method using random forest and
meerkat clan algorithm,” Telkomnika (Telecommunication Computing Electronics and Control), vol. 20, no. 5, pp. 1046–1054, Oct.
2022, doi: 10.12928/TELKOMNIKA.v20i5.23515.
[4] A. Mirzaei, S. R. Carter, A. E. Patanwala, and C. R. Schneider, “Missing data in surveys: key concepts, approaches, and
applications,” Research in Social and Administrative Pharmacy, vol. 18, no. 2, pp. 2308–2316, Feb. 2022,
doi: 10.1016/j.sapharm.2021.03.009.
[5] J. M. Z. Hoque, J. Hossen, S. Sayeed, K. Chy Mohammed Tawsif, J. Ganesan, and J. Emerson Raja, “Automatic missing value
imputation for cleaning phase of diabetic s readmission prediction model,” International Journal of Electrical and Computer
Engineering, vol. 12, no. 2, pp. 2001–2013, Apr. 2022, doi: 10.11591/ijece.v12i2.pp2001-2013.
[6] A. Desiani, S. Yahdin, A. Kartikasari, and Irmeilyana, “Handling the imbalanced data with missing value elimination smote in the
classification of the relevance education background with graduates employment,” IAES International Journal of Artificial
Intelligence, vol. 10, no. 2, pp. 346–354, Jun. 2021, doi: 10.11591/ijai.v10.i2.pp346-354.
[7] F. Ahmad and S. A. M. Rizvi, “Identification of user’s credibility on twitter social networks,” Indonesian Journal of Electrical
Engineering and Computer Science, vol. 24, no. 1, pp. 554–563, Oct. 2021, doi: 10.11591/ijeecs.v24.i1.pp554-563.
[8] T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago, and O. Tabona, “A survey on missing data in machine learning,”
Journal of Big Data, vol. 8, no. 1, p. 140, Oct. 2021, doi: 10.1186/s40537-021-00516-9.
[9] P. C. Chiu, A. Selamat, O. Krejcar, K. K. Kuok, S. D. A. Bujang, and H. Fujita, “Missing value imputation designs and methods of
nature-inspired metaheuristic techniques: a systematic review,” IEEE Access, vol. 10, pp. 61544–61566, 2022,
doi: 10.1109/ACCESS.2022.3172319.
[10] P. J. García-Laencina, J. L. Sancho-Gómez, and A. R. Figueiras-Vidal, “Pattern classification with missing data: a review,” Neural
Computing and Applications, vol. 19, no. 2, pp. 263–282, Mar. 2010, doi: 10.1007/s00521-009-0295-6.
[11] C. F. Tsai, M. L. Li, and W. C. Lin, “A class center based approach for missing value imputation,” Knowledge-Based Systems,
vol. 151, pp. 124–135, Jul. 2018, doi: 10.1016/j.knosys.2018.03.026.
[12] I. B. Aydilek and A. Arslan, “A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector
regression and a genetic algorithm,” Information Sciences, vol. 233, pp. 25–35, Jun. 2013, doi: 10.1016/j.ins.2013.01.021.
[13] H. Nugroho, N. P. Utama, and K. Surendro, “Class center-based firefly algorithm for handling missing data,” Journal of Big Data,
vol. 8, no. 1, p. 37, Dec. 2021, doi: 10.1186/s40537-021-00424-y.
[14] G. S. Hassan, N. J. Ali, A. K. Abdulsahib, F. J. Mohammed, and H. M. Gheni, “A missing data imputation method based on salp
swarm algorithm for diabetes disease,” Bulletin of Electrical Engineering and Informatics, vol. 12, no. 3, pp. 1700–1710, Jun. 2023,
doi: 10.11591/eei.v12i3.4528.
[15] M. S. Osman, A. M. Abu-Mahfouz, and P. R. Page, “A survey on data imputation techniques: water distribution system as a use
case,” IEEE Access, vol. 6, pp. 63279–63291, 2018, doi: 10.1109/ACCESS.2018.2877269.
[16] L. Malan, C. M. Smuts, J. Baumgartner, and C. Ricci, “Missing data imputation via the expectation-maximization algorithm can
improve principal component analysis aimed at deriving biomarker profiles and dietary patterns,” Nutrition Research, vol. 75,
pp. 67–76, Mar. 2020, doi: 10.1016/j.nutres.2020.01.001.
[17] N. Karmitsa, S. Taheri, A. Bagirov, and P. Makinen, “Missing value imputation via clusterwise linear regression,” IEEE
Transactions on Knowledge and Data Engineering, vol. 34, no. 4, pp. 1889–1901, 2020, doi: 10.1109/TKDE.2020.3001694.
A method for missing values imputation of machine learning datasets (Youssed Hanyf)
898 ISSN: 2252-8938
[18] Y. Zhang and Y. Liu, “Data imputation using least squares support vector machines in urban arterial streets,” IEEE Signal
Processing Letters, vol. 16, no. 5, pp. 414–417, May 2009, doi: 10.1109/LSP.2009.2016451.
[19] R. C. Barros, M. P. Basgalupp, A. C. P. L. F. De Carvalho, and A. A. Freitas, “A survey of evolutionary algorithms for decision-
tree induction,” IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, vol. 42, no. 3, pp. 291–
312, May 2012, doi: 10.1109/TSMCC.2011.2157494.
[20] S. Zhang, D. Cheng, Z. Deng, M. Zong, and X. Deng, “A novel KNN algorithm with data-driven k parameter computation,” Pattern
Recognition Letters, vol. 109, pp. 44–54, Jul. 2018, doi: 10.1016/j.patrec.2017.09.036.
[21] W. C. Lin and C. F. Tsai, “Missing value imputation: a review and analysis of the literature (2006–2017),” Artificial Intelligence
Review, vol. 53, no. 2, pp. 1487–1509, Feb. 2020, doi: 10.1007/s10462-019-09709-4.
[22] Y. Hanyf and H. Silkan, “A fast and scalable similarity search in high-dimensional image datasets,” International Journal of
Computer Applications in Technology, vol. 59, no. 1, p. 95, 2019, doi: 10.1504/IJCAT.2019.10018181.
[23] S. J. Choudhury and N. R. Pal, “Imputation of missing data with neural networks for classification,” Knowledge-Based Systems,
vol. 182, p. 104838, Oct. 2019, doi: 10.1016/j.knosys.2019.07.009.
[24] R. G. Mantovani, “Texture,” Laboratory of Image Processing and Pattern Recognition (INPG-LTIRF), 2016.
[25] R. P. Gorman and T. J. Sejnowski, “Analysis of hidden units in a layered network trained to classify sonar targets,” Neural Networks,
vol. 1, no. 1, pp. 75–89, Jan. 1988, doi: 10.1016/0893-6080(88)90023-8.
[26] R. K. Bock et al., “Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray
telescope,” Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and
Associated Equipment, vol. 516, no. 2–3, pp. 511–528, Jan. 2004, doi: 10.1016/j.nima.2003.08.157.
[27] A. L. Freire, G. A. Barreto, M. Veloso, and A. T. Varela, “Short-term memory mechanisms in neural network learning of robot
navigation tasks: a case study,” in 2009 6th Latin American Robotics Symposium, LARS 2009, Oct. 2009, pp. 1–6,
doi: 10.1109/LARS.2009.5418323.
[28] C. L. Blake and C. J. Merz, “Image segmentation data set,” UCI Machine Learning Repository, 1990.
[29] L. Xu, A. Krzyżak, and C. Y. Suen, “Methods of combining multiple classifiers and their applications to handwriting recognition,”
IEEE Transactions on Systems, Man and Cybernetics, vol. 22, no. 3, pp. 418–435, 1992, doi: 10.1109/21.155943.
[30] Y. Hanyf and H. Silkan, “A queries-based structure for similarity searching in static and dynamic metric spaces,” Journal of King
Saud University - Computer and Information Sciences, vol. 32, no. 2, pp. 188–196, Feb. 2020, doi: 10.1016/j.jksuci.2018.05.004.
[31] Y. Hanyf, H. Silkan, and H. Labani, “An improvable structure for similarity searching in metric spaces: application on image
databases,” in Proceedings - Computer Graphics, Imaging and Visualization: New Techniques and Trends, CGiV 2016, Mar. 2016,
pp. 67–72, doi: 10.1109/CGiV.2016.22.
[32] Y. Hanyf, H. Silkan, and H. Labani, “Criteria and technique to choose a good ρ parameter for the D-index,” in 2015 Intelligent
Systems and Computer Vision, ISCV 2015, Mar. 2015, pp. 1–6, doi: 10.1109/ISACV.2015.7106169.
[33] Y. Hanyf and H. Silkan, “Fast similarity search in high dimensional image data sets,” in ACM International Conference Proceeding
Series, Mar. 2017, vol. Part F1294, pp. 1–5, doi: 10.1145/3090354.3090426.
[34] Z. Kouahla et al., “A survey on big IoT data indexing: potential solutions, recent advancements, and open issues,” Future Internet,
vol. 14, no. 1, p. 19, Dec. 2022, doi: 10.3390/fi14010019.
[35] M. Zhang, L. Yang, Y. Dong, J. Wang, and Q. Zhang, “Picture semantic similarity search based on bipartite network of picture-tag
type,” PLoS ONE, vol. 16, no. November, p. e0259028, Nov. 2021, doi: 10.1371/journal.pone.0259028.
BIOGRAPHIES OF AUTHORS
Youssef Hanyf holds a PhD degree in computer science from Chouaib Doukkali
University, Morocco in 2017. He also received his B.Sc. (Mathematics and informatics) and
M.Sc. (Software quality) from the same University, in 2009 and 2011, respectively.
Currently, He is a professor of computer science at National School of Commerce and
Management of Dakhla, Ibn Zhor University, Dakhla, Morocco. His research includes high-
dimensional data processing, Data structure, Image processing, information retrieval,
similarity search, machine learning, and recommendation systems. He has published over 13
papers in international journals and conferences. He can be contacted at email:
[email protected] or [email protected].
Hassan Silkan receives the PhD in computer sciences from Sidi Mohamed Ben
Abdellah University, FSDM, Morocco. Currently, he is a professor in Chouaib Doukkali
University, Department of Computer Science, Faculty of sciences El Jadida, Morocco. He
published more than 32 papers in international journals and conferences in the fields of Shape
Representation and Description, Similarity search, Content based images retrieval, Database
Indexing, Multimedia Databases, and others. He can be contacted at email:
[email protected] or [email protected].