IMPClustering_Algorithms_based_Noise_Identification_from_Air_Pollution_Monitoring_Data
IMPClustering_Algorithms_based_Noise_Identification_from_Air_Pollution_Monitoring_Data
Abstract—The development of data science has brought about Noise data could be detected by four techniques,
many discussions of noise detection, and so far, there is no summarized in [6] [7]: statistics-based, distance-based, density-
universal best method. In this paper, we propose a clustering- based, and clustering-based. Statistics-based detect noise data
algorithm-based solution to identify and remove noise from air using statistical tests assuming the dataset follows some
pollution data collected with mobile portable sensors. The test distribution. The distance-based outlier technique distinguishes
dataset is the air pollution data collected by the portable sensors the noise by testing the distance to the neighbors [8], usually
throughout three seasons at the campus in Macao. We have represented by the k nearest examples. Density-based outlier
applied and compared six clustering algorithms to identify the detection detects outliers about the density of the surrounding
most appropriate clustering algorithm to achieve this goal: Simple
domain. The clustering-based algorithms aim to identify the
K-means, Hierarchical Clustering, Cascading K-means, X-means,
Expectation Maximization, and Self-Organizing Map. The
instances of misclassified and considered noise.
performance is evaluated by their accuracy and the best number Compared to other noise-detected approaches, clustering
of clusters calculated by the Silhouette Coefficient. Additionally, a algorithms have been proven with their simplicity and reliability
classification algorithm J48 tree can extract the key attributes and characteristics, rapid computation, memory efficiency, and ease
identify the noise cluster for future unlabeled data that may of understanding in theory. The clustering-based algorithm does
contain noise. The experiment results indicate that the not need to make assumptions about the data distribution like the
Expectation Maximization and Cascading Simple K-Means statistics-based algorithm. Moreover, the clustering algorithm
perform the best. Moreover, temperature and carbon dioxide are
has been considered extremely sensitive to noise, especially in
vital attributes in identifying the noise cluster.
bioinformatics [9]. In other similar fields, such as ECG signals
Keywords—data clustering, portable sensor, air pollution data, in medicine, clustering algorithms also lead the way [10].
noise identification, noise removal In this paper, different from prior air quality analyses [11],
we do not focus on classifying the different air quality levels but
I. INTRODUCTION
on detecting the noise group from the mixed, unlabeled data
Recently, portable sensors have been widely used to collect transferred from different users with mobile portable sensors.
environmental data, especially for monitoring air pollution [1] The proposed solution can be divided into two steps: 1) using
[2]. Unlike the traditional meteorological station, the portable clustering algorithms to identify the best number of clusters and
sensor has several advantages: (1) it can collect air data in cluster the mixed data into different groups; 2) the classification
relatively small and particular areas, such as on-campus or in a algorithm J48 Tree is used to examine the key attributes to detect
specific room. (2) It can be more accurate to record air data to the noise cluster. Therefore, the noise cluster can be removed
the second, reflecting the continuous changes of each attribute from future data.
in the area [3]. However, the air pollution data collected by the
portable sensor would be interspersed with data from indoors, To find out which clustering algorithm is more capable of
outdoors, or transitioning from two states. The data with the distinguishing the indoor, outdoor, and noise clusters. Six
unclear boundary of indoor or outdoor are considered noise data, clustering algorithms, Simple K-means, Hierarchical, Cascading
which may interfere with further investigation, making K-means, X-means, Expectation Maximization, and Self-
unprecise results. Therefore, it is crucial to distinguish different Organizing Map, are applied and compared on a test dataset
data groups and identify the noise data, such as inappropriate collected by the portable sensors throughout three seasons at the
data and outliers, before data processing. campus in Macao. To better examine the ability of each
clustering algorithm, two scenarios are designed for the
Noise is inevitable in data collection. Generally, one possible experiments: pure indoor and pure outdoor air data (without
noise source is hardware bias, such as sensor error [4] [5]. noise); pure indoor, pure outdoor, and air data in between (with
Additionally, human misconduct in data collection may result in noise). Accuracy and Silhouette Coefficient are two essential
incorrect data collection. In this paper, noise data could come criteria to evaluate the performance of clustering algorithms for
from the moving users who take the portable sensor devices and each experiment. We expect the clustering algorithm to achieve
move around from indoor to outdoor positions and vice versa. high accuracy while also considering whether the best number
There can be other noise, such as the sensor’s heating time. of clusters matches the number of clusters physically divided.
This work is funded by Macao Polytechnic University under
grant number RP/ESCA-01/2021.
978-1-6654-5305-9/22/$31.00 ©2022
Authorized licensed use limited to: North IEEEHill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore.
Eastern Restrictions apply.
These can apply to identify the future noise data from the centroid should be split into different clusters. The Bayesian
unlabeled and mixed data collected by the user side with the Information Criterion (BIC) is used in making this decision.
exact locations. If the number of clusters determined by the best After that, X-means tracks and re-calculates the centroid and
clustering algorithm is equal to or more than three, the assigns the center of massing points to it. X-means and
assumption is made that noise exists. Then the critical attributes Cascading K-means do not require the number of clusters (k) in
obtained by the J48 tree are used to distinguish the noise cluster advance.
and remove it.
5) Expectation Maximization (EM)
II. RELATED WORK The EM algorithm consists of two main steps: the
expectation and maximization steps. The expectation step uses
A. Clustering Algorithms the experimental parameters and conditions for the current
Clustering algorithms are unsupervised algorithms that aim estimate of the unknown underlying variables, while the
to maximumly keep similar objects in one group and separate maximization step uses these parameters to provide new
less similar objects into different groups. Clustering algorithms estimates. These two steps are not independent but interact with
can be sorted into several technical categories according to each other until they reach convergence. The strength of the EM
different emphasizes and features, such as partitioning, clustering algorithm is that the implementation process is
hierarchical, density-based, distance-based [12], and neural slightly more straightforward, while the drawback is that, in
network based. some cases, it has a high chance of slowing down the linear
convergence [19].
In this work, the clustering algorithms used can be classified
into two categories: clustering algorithms that need to be 6) Self-Organizing Map (SOM)
predefined the number of clusters, including Simple K-means, A Self-Organizing Map is a kind of artificial neural network-
Hierarchical, Expectation Maximization, and Self-Organizing based clustering algorithm. It generates low-dimensional
Map; the clustering algorithms which can automatically find the projection images from a high-dimensional data distribution,
number of clusters, including Cascading Simple K-means, and where the similarity relationships are maintained between data
X-means. items [20]. The so-called map represents discretized input
training with samples under unsupervised learning. SOM differs
1) Simple K-means from other neural network-based clustering algorithms since it
As a partitioning-based clustering algorithm, K-Means can applies competitive learning instead of error-correction learning.
be regarded as a ramification of the distance-based clustering
algorithm. K-means requires a specific number of clusters stored B. Platform for Data Clustering
in value k to initialize the whole computation process at the very In this paper, we have compared the performance of six
beginning. The K-means randomly selects k objects from the clustering algorithms introduced in section II.A on WEKA [21].
dataset to represent the cluster centers. Given these centers, the WEKA is a data mining platform that can process data
distances of all remained data objects from the centers are preparation and multiple machine learning methods.
calculated, and the data objects are assigned to the cluster with
the smallest distance [13][14]. K-Means benefits from its C. Silhouette Coefficient for Determining the Best Number of
simplicity, fast computation, and flexibility. However, K-Means Clusters
has several flaws, such as the number of clusters that should be On WEKA, Cascading K-means and X-means could
entered manually, the initialization affecting the result automatically determine the number of clusters, while K-Means,
significantly, and more sensitivity to the scale of the dataset. Hierarchical, EM, and SOM require a predefined number of
2) Hierarchical clusters. However, in some circumstances, the number of
The hierarchical clustering [15] nests each cluster to build a clusters based on the clustering results does not match the
tree, then for each level of the cluster, all its sub-clusters are expected outcomes. Thus, a method to evaluate the performance
united and assigned to it. The node in the tree reflects a cluster of k is needed. Silhouette Coefficient [22] is a metric to measure
in a hierarchy called a dendrogram. Hierarchical clustering is the performance of clustering results with range from -1 to 1.
easy to implement and read results due to its well-structured The formula of the Silhouette Coefficient shows as (1) and (2).
dendrogram. However, Hierarchical is less flexible compared to The closer the coefficient is to 1, the more accurate the clustering
K-Means and may need a longer time during the process. performance, and the number of clusters ( ) corresponding to
the most significant coefficient is the best.
3) Cascading K-means
Cascading K-means Clustering [16] is based on K-Means = (1)
and abides by the rule of the partitioning method for clustering. ,
This algorithm was proposed by Caliński and Harabasz [17],
adding the knowledge of statistics and hypothesis testing to the = , (2)
original K-means.
Assuming a dataset is operated by a specific clustering
4) X-means algorithm. It is clustered into clusters. For every vector in the
X-means algorithm was developed by Dan Pelleg and cluster, use as an instance, is the average distance from
Andrew Moore [18], and it enables the estimation of the number to other points which belong to the same cluster as ; is the
of clusters in a short time. X-means operates after every run of
K-means and determines if any subset covered in the current
Authorized licensed use limited to: North Eastern Hill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore. Restrictions apply.
average distance from to other points which do not belong to = (3)
the same cluster.
D. J48 Tree to Obtain the Key Attributes C. Performance Evaluation
J48 tree is an embedded decision tree. A supervised Two criteria evaluate the performance of clustering
classification function contains the value of the dependent algorithms: accuracy and best number of clusters (best ) for
attribute, which is given by the values of the input attributes algorithms in each experiment determined by the Silhouette
[23]. Data could be classified based on the entropy calculated Coefficient. Based on the confusion matrix, the accuracy
during tree forming. The most significant advantage of the J48 formula is shown in (4).
tree is easy for a human to interpret due to its tree structure.
However, the running time complexity of the algorithm .=
!"# $%& ' (
(4)
corresponds to the depth of the tree, which cannot be greater than !"# $%& ' ( ) !"# * + ' (
the number of attributes. The processing time will be extended
if the dataset is huge. In this paper, the J48 tree is used to obtain IV. RESULTS AND DISCUSSION
the key attributes when detecting the noise class based on In this session, the clustering performance of six algorithms
conditions from the J48 tree’s nodes and branches. will be compared for two scenarios. Additionally, the J48 tree is
applied to obtain the critical attribute for identifying the noise
III. EXPERIMENT DESIGN
data.
The portable sensors collect air pollution data to compare the
clustering performance, and two scenarios are designed. A. First Scenario: Pure Indoor and Pure Outdoor without
noise
A. Data Collection This scenario contains experiment 1 to 10.
Portable sensor devices which integrate many kinds of
sensors are used to detect the containments and indexes in the 1) Clustering Evaluation
air, such as PM1.0, PM2.5, PM10, Formaldehyde (HCHO), The accuracy of clustering algorithms that need to pre-define
Carbon Dioxide, Temperature, and Humidity. Data are collected the number of clusters ( =2), including SimpleKMeans,
from the fall until the following spring within the campus, such HierarchicalClusterer, EM, and Self-Organizing Map, are
as at different academic buildings, the student dormitory, the shown in Table I. All achieve 100% correctness except SOM,
playground, and the parking lots. There are two scenarios, pure with one indoor data sample incorrectly clustered into the
indoor and pure outdoor (without noise); pure indoor, pure outdoor cluster. That incorrectly clustered indoor sample, with
outdoor, and mixed (with noise, shown in Fig.1, noise data is slightly higher temperature and humidity, is slightly different
collected when the door is open). In each scenario, ten sets of from other indoor samples because it is collected when the air
data are collected for comparison. Air pollution data is recorded conditioner is about to work indoors. SOM is a susceptible
every 10 seconds after the sensor device is activated. clustering algorithm and treats samples less close to other
Additionally, the data collected in the first five minutes may lead similar samples as a different cluster when projecting data from
to bias since the heating after the sensor has been activated is an high dimension to low dimension.
objective factor in the data collection. Therefore, we removed
the air data collected in the first five minutes and used the data TABLE I. ACCURACY OF CLUSTERING ALGORITHMS NEED TO PRE-
DEFINE K FOR FIRST SCENARIO
collected after the sensor was stabilized for experiments. For
each experiment, about 400 data records are recorded. Clustering Algorithms Avg. Accuracy
SimpleKMeans 100.00%
Hierarchical 100.00%
Outdoor EM 100.00%
SOM 99.97%
For clustering algorithms that do not need to pre-define the
Noise number of clusters, the accuracy for Cascading K-means
Indoor (100.00%) is higher than in X-means (74.31%), as Table II
shows. In addition, the average obtained from ten experiments
for Cascading Simple K-means is equal to 2, which perfectly
matches the actual situation. While the for X-means across ten
Fig. 1. Two Scenarios (without or with noise) experiments mostly equals 3 or 4, and the average is 3.3, which
exceeds the actual situation. This is because X-means is very
B. Data Normalization sensitive to the distance between the data. Given experiment 1
Data normalization is to scale down and shift the distribution as an example, as Fig. 2 shows below, X-means segments indoor
of raw data through different attributes into the same scale with data into two clusters (cluster 0 and cluster 1). Fig. 2 (b)
a fixed range of 0 to 1. Convergence can be accelerated, and the illustrates CO2, temperature, and humidity are very similar since
model's performance can be improved [24]. In this paper, the they overlap each other. However, Fig. 2 (a) shows PM
min-max scalar normalization is used for data normalization, pollutants (PM10, PM2.5, and PM1.0) are less similar since the
shown in (3): data can be projected to two significant groups. Therefore, X-
means clusters indoor data into two different clusters.
Authorized licensed use limited to: North Eastern Hill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore. Restrictions apply.
TABLE II. ACCURACY OF CLUSTERING ALGORITHMS DO NOT NEED TO average value of PM1.0 can distinguish the indoor and outdoor
PRE-DEFINE K FOR FIRST SCENARIO
classes for each cluster, and the indoor cluster has a lower
Clustering Algorithms Accuracy Avg. , average value of PM1.0. As for temperature, for autumn, the
Cascading Simple K-means 100.00% 2 indoor temperature typically lowers the outdoor temperature
X-means 74.31% 3.3 because the air conditioner is always on for the indoor
environment. When the weather gets cooler, like in winter and
spring, the indoor and outdoor temperatures do not show
significant differences. Therefore, the temperature is the second
most important attribute after PM1.0. This finding can be
applied to future data collected from different users. If the data
are sent from the same location (e.g., GPS information), take the
fourth experiment as an example. After clustering, if the number
of clusters is two, an initial assumption is made that there is no
noise in the data. Then the mean value of PM1.0 is calculated
for both clusters, and the cluster with a lower mean value will be
considered indoor and vice versa for outdoor.
(a) (b) B. Second Scenario: Pure Indoor and Pure Outdoor, and
Noise
This scenario contains experiment 11 to 20.
Fig. 2. Visualize Cluster 0, 1 in Indoor Class Under X-means
1) Clustering Evaluation
To evaluate the best , the Silhouette Coefficient is The accuracy of clustering algorithms that need to pre-define
calculated for each experiment with four algorithms the number of clusters ( =3), including SimpleKMeans,
(SimpleKMeans, Hierarchical, EM, CSKMeans), and the best HierarchicalClusterer, EM, and SOM, are shown in Table IV.
is obtained based on the most significant Silhouette Coefficient EM has the highest accuracy of 95.36%, while SimpleKMeans
value. The Silhouette Coefficients for four algorithms are the has the lowest accuracy of 87.03%.
same, and the best all equals 2. All algorithms can distinguish
indoor and outdoor clusters very well because the difference TABLE IV. ACCURACY OF ALGORITHMS NEED TO PRE-DEFINE K FOR
between two clusters is significant, and best matches the SECOND SCENARIO
actual classes. Clustering Algorithms Avg. Accuracy
According to the accuracy and Silhouette Coefficient results, SimpleKMeans 87.03%
we can conclude that in the first scenario, without any noise, Hierarchical 90.76%
SimpleKMeans, Hierarchical, EM, and Cascading Simple K- EM 95.36%
means are the best clustering algorithms. SOM 90.29%
2) J48 Tree EM exceeds SimpleKmeans due to the covariance matrices
J48 tree is applied for each experiment to obtain the critical rooted in its algorithm’s structure. The covariance matrices
attributes for segmenting the dataset. As Table III shows below, allow EM to detect subpopulations with different characteristics
after counting the 10 experiments, the most critical attribute for in the data set, therefore segmenting data into different shapes
distinguishing the indoor and outdoor data is PM1.0, followed of clusters. Given experiment 15 as an example, the clustering
by temperature. result of SimpleKMeans is shown in Fig. 4 (a); the clustering
result of EM is shown in Fig. 4 (b), and the actual class of data
TABLE III. COUNT FOR KEY ATTRIBUTES FOR FIRST SCENARIO is shown in Fig. 4 (c). The results are visualized by Principal
Component Analysis (PCA).
Attribute PM1.0 Temperature CO2 PM2.5
Count 5 3 1 1 As Fig. 4 (c) shows, the boundary between outdoor and noise
data is unclear. Some overlap happens between these two types
of data, while indoor data is easy to distinguish. SimpleKMeans
fails to distinguish the outdoor and noise data, clusters them into
the same cluster2 (noise), and incorrectly separates indoor data
into two clusters. EM acts better than SimpleKMeans. Though
some actual noise is clustered into the outdoor cluster, EM
correctly distinguishes most of the noise and outdoor data and
all the indoor data.
As Table V shows that for clustering algorithms that do not
need to pre-define the number of clusters, the average accuracy
for Cascading Simple K-means is 81.63%, lower than X-means
Fig. 3. J48 Tree for Exp.4
(88.82%) across 10 experiments. However, like in scenario 1,
Because typically, outdoor air contains more PM pollutants X-means generates more clusters ( = 3.5 ) than
than indoor air, as shown in Fig. 3 based on experiment 4. The
Authorized licensed use limited to: North Eastern Hill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore. Restrictions apply.
CascadingSimpleKMeans ( = 3.2 ), which is not entirely 2) J48 Tree
consistent with the actual situation ( = 3). As Table VII shows that the most critical attribute for
distinguishing the indoor, outdoor, and noise data is
In conclusion, Cascading Simple K-means achieves better
temperature, followed by CO2, humidity, and PM2.5. The
performance than X-means.
indoor air conditioner usually works on campus, resulting in a
specific temperature difference between indoors and outdoors.
Usually, in summer and autumn, the indoor temperature is the
lowest, slightly higher when the door is opened for ventilation
(defined as noise), and the outdoor temperature is the highest.
However, in the winter, the situation is the opposite. In such
cases, carbon dioxide and humidity can be used to achieve the
same goal. For example, the highest level of carbon dioxide is
found in indoor clusters.
TABLE V. ACCURACY OF CLUSTERING ALGORITHMS DO NOT NEED TO Attribute Count (out of 10)
PRE-DEFINE K FOR SECOND SCENARIO Temperature 9
CO2 4
Clustering Algorithms Accuracy Avg. ,
Humidity 3
Cascading Simple K-means 81.63% 3.2
PM2.5 3
X-means 88.82% 3.5
PM1.0 1
To evaluate the best , the Silhouette Coefficient for each
experiment is calculated for four algorithms (SimpleKMeans,
Hierarchical, EM, CSKMeans), and the best is obtained based
on the most significant Silhouette Coefficient value, shown in Temperature
Table VI. There are three situations. (1) If the best = 2, which
<= 0.255319 > 0.255319
means noise data is very similar to one of the indoor or outdoor
data, it should not be clustered into an individual group. (2) If
Indoor(207.0)
the best = 3, it matches the experiment scenario. (3) If the best Temperature
= 4 or more, there are some variants in indoor, outdoor, and
<= 0.446809 > 0.446809
noise classes. That may explain the unusual circumstance that
happens to experiment 13. To sum up, considering two criteria,
for those algorithms that need to predefine the , EM performs Noise(189.0) Outdoor(207.0)
Authorized licensed use limited to: North Eastern Hill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore. Restrictions apply.
Coefficient can be used to evaluate the clustering performance. [10] J. Rodrigues, D. Belo, and H. Gamboa, “Noise detection on ECG
The results show that all algorithms perform well except X- based on agglomerative clustering of morphological features,”
Computers in Biology and Medicine, vol. 87, pp. 322–334, Aug.
means for the data without any noise, and the best matches the 2017, doi: 10.1016/J.COMPBIOMED.2017.06.009.
actual situation; EM and Cascading Simple K-means perform [11] P. Govender and V. Sivakumar, “Application of k-means and
the best for the data with noise. Additionally, affected by the hierarchical clustering techniques for analysis of air pollution: A
noise, the best does not always match the actual situation. review (1980-2019),” 2019, doi: 10.1016/j.apr.2019.09.009.
Finally, the J48 tree is used to examine which attribute in the [12] S.Saraswathi and M. Immaculate Sheela, “A Comparative Study of
dataset is critical to identify the noise cluster from the data Various Clustering Algorithms in Data Mining,” International
coming from other users at the same location and similar date Journal of Computer Science and Mobile Computing, vol. 3, no. 11,
pp. 422–428, 2014, Accessed: May 05, 2022. [Online]. Available:
and time. Therefore, the noise data can be removed. The www.ijcsmc.com
experiments indicate that temperature and CO2 are the critical [13] P. Narayan Baser and J. R. Saini, “A Comparative Analysis of
attributes to distinguishing different clusters: indoor, outdoor, Various Clustering Techniques used for Very Large Datasets,”
and noise. International Journal of Computer Science and Communication
Networks, vol. 3, no. 4, pp. 271–275, 2013, [Online]. Available:
For further study, more data can be collected throughout the https://www.researchgate.net/publication/281965200
year to test the accuracy of our solution, and the performance of [14] J. Yadav and M. Sharma, “A Review of K-mean Algorithm,”
clustering and classification algorithms can be further International Journal of Engineering Trends and Technology, vol. 4,
investigated. no. 7, 2013, Accessed: Jun. 26, 2022. [Online]. Available:
http://www.ijettjournal.org
REFERENCES [15] Nisha and P. J. Kaur, “Cluster quality based performance evaluation
of hierarchical clustering method,” in 2015 1st International
[1] X. Yang, E. Xie, and L. Cuthbert, “Cluster segregation for indoor and Conference on Next Generation Computing Technologies (NGCT),
outdoor environmental monitoring system,” Proceedings - 18th IEEE 2015, pp. 649–653. doi: 10.1109/NGCT.2015.7375201.
International Conference on High Performance Computing and
Communications, 14th IEEE International Conference on Smart City [16] A. G. Karegowda, M. A. Jayaram, and A. S. Manjunath, “Cascading
and 2nd IEEE International Conference on Data Science and k-means clustering and k-nearest neighbor classifier for
Systems, HPCC/SmartCity/DSS 2016, pp. 1264–1269, Jan. 2017, categorization of diabetic patients,” International Journal of
doi: 10.1109/HPCC-SMARTCITY-DSS.2016.0179. Engineering and Advanced Technology, vol. 1, no. 3, pp. 147–151,
2012.
[2] X. Yang, L. Zhu, S. Lam, L. Cuthbert, and Y. Wang, “Comparison
of clustering methods for identification of outdoor measurements in [17] T. Caliński and J. Harabasz, “A dendrite method for cluster analysis,”
pollution monitoring,” IOP Conference Series: Earth and Communications in Statistics, vol. 3, no. 1, pp. 1–27, 1974, doi:
Environmental Science, vol. 257, no. 1, p. 012014, Apr. 2019, doi: 10.1080/03610927408827101.
10.1088/1755-1315/257/1/012014. [18] D. Pelleg and A. Moore, “X-means: Extending K-means with
[3] R. T. Tse and Y. Xiao, “A portable Wireless Sensor Network system Efficient Estimation of the Number of Clusters,” in 17th International
for real-time environmental monitoring,” WoWMoM 2016 - 17th Conf. on Machine Learning, 2002, pp. 727–734. [Online]. Available:
International Symposium on a World of Wireless, Mobile and https://www.researchgate.net/publication/2532744
Multimedia Networks, Jul. 2016, doi: [19] T. K. Moon, “The expectation-maximization algorithm,” IEEE
10.1109/WOWMOM.2016.7523588. Signal Processing Magazine, vol. 13, no. 6, pp. 47–60, 1996, doi:
[4] X. Wu and X. Zhu, “Mining with noise knowledge: Error-aware data 10.1109/79.543975.
mining,” IEEE Transactions on Systems, Man, and Cybernetics Part [20] D. Miljković, “Brief review of self-organizing maps,” in 2017 40th
A:Systems and Humans, vol. 38, no. 4, pp. 917–932, Jul. 2008, doi: International Convention on Information and Communication
10.1109/TSMCA.2008.923034. Technology, Electronics and Microelectronics (MIPRO), 2017, pp.
[5] A. Klein, “Incorporating quality aspects in sensor data streams,” 1061–1066. doi: 10.23919/MIPRO.2017.7973581.
International Conference on Information and Knowledge [21] “Weka 3 - Data Mining with Open Source Machine Learning
Management, Proceedings, pp. 77–84, 2007, doi: Software in Java.” https://www.cs.waikato.ac.nz/ml/weka/ (accessed
10.1145/1316874.1316888. May 10, 2022).
[6] S. A. N. Alexandropoulos, S. B. Kotsiantis, and M. N. Vrahatis, [22] R. Lletí, M. C. Ortiz, L. A. Sarabia, and M. S. Sánchez, “Selecting
“Data preprocessing in predictive data mining,” The Knowledge variables for k-means cluster analysis by using a genetic algorithm
Engineering Review, vol. 34, 2019, doi: that optimises the silhouettes,” Analytica Chimica Acta, vol. 515, no.
10.1017/S026988891800036X. 1, pp. 87–100, Jul. 2004, doi: 10.1016/J.ACA.2003.12.020.
[7] Z. Nematzadeh, R. Ibrahim, and A. Selamat, “A hybrid model for [23] N. Bhargava, G. Sharma, R. Bhargava, and M. Mathuria, “Decision
class noise detection using k-means and classification filtering tree analysis on j48 algorithm for data mining,” Proceedings of
algorithms,” SN Applied Sciences, vol. 2, no. 7, pp. 1–10, Jul. 2020, international journal of advanced research in computer science and
doi: 10.1007/S42452-020-3129-X/TABLES/8. software engineering, vol. 3, no. 6, 2013.
[8] S. D. Bay and M. Schwabacher, “Mining Distance-Based Outliers in [24] Y. A. LeCun, L. Bottou, G. B. Orr, and K. R. Müller, “Efficient
Near Linear Time with Randomization and a Simple Pruning Rule,” BackProp,” Lecture Notes in Computer Science (including subseries
2003. Lecture Notes in Artificial Intelligence and Lecture Notes in
[9] R. Sloutsky, N. Jimenez, S. J. Swamidass, and K. M. Naegle, Bioinformatics), vol. 7700 LECTURE NO, pp. 9–48, 2012, doi:
“Accounting for noise when clustering biological data,” Briefings in 10.1007/978-3-642-35289-8_3.
Bioinformatics, vol. 14, no. 4, pp. 423–436, Jul. 2013, doi:
10.1093/BIB/BBS057.
Authorized licensed use limited to: North Eastern Hill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore. Restrictions apply.