IMPClustering_Algorithms_based_Noise_Identification_from_Air_Pollution_Monitoring_Data

This paper presents a clustering algorithm-based approach to identify and remove noise from air pollution data collected by portable sensors. Six clustering algorithms were evaluated for their effectiveness in distinguishing noise from valid data, with Expectation Maximization and Cascading K-means performing the best. The study emphasizes the importance of accurately identifying noise to improve the reliability of air pollution monitoring results.

Uploaded by

birinchi.orcheetech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

IMPClustering_Algorithms_based_Noise_Identification_from_Air_Pollution_Monitoring_Data

Uploaded by

birinchi.orcheetech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Clustering Algorithms based Noise Identification

from Air Pollution Monitoring Data

Xinyi Fang Chak Fong Chong Xu Yang Yapeng Wang
Faculty of Applied Sciences Faculty of Applied Sciences Faculty of Applied Sciences Faculty of Applied Sciences
Macao Polytechnic University Macao Polytechnic University Macao Polytechnic University Macao Polytechnic University
2022 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE) | 978-1-6654-5305-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/CSDE56538.2022.10089276

Macao, China Macao, China Macao, China Macao, China

[email protected] [email protected] [email protected] [email protected]

Abstract—The development of data science has brought about Noise data could be detected by four techniques,
many discussions of noise detection, and so far, there is no summarized in [6] [7]: statistics-based, distance-based, density-
universal best method. In this paper, we propose a clustering- based, and clustering-based. Statistics-based detect noise data
algorithm-based solution to identify and remove noise from air using statistical tests assuming the dataset follows some
pollution data collected with mobile portable sensors. The test distribution. The distance-based outlier technique distinguishes
dataset is the air pollution data collected by the portable sensors the noise by testing the distance to the neighbors [8], usually
throughout three seasons at the campus in Macao. We have represented by the k nearest examples. Density-based outlier
applied and compared six clustering algorithms to identify the detection detects outliers about the density of the surrounding
most appropriate clustering algorithm to achieve this goal: Simple
domain. The clustering-based algorithms aim to identify the
K-means, Hierarchical Clustering, Cascading K-means, X-means,
Expectation Maximization, and Self-Organizing Map. The
instances of misclassified and considered noise.
performance is evaluated by their accuracy and the best number Compared to other noise-detected approaches, clustering
of clusters calculated by the Silhouette Coefficient. Additionally, a algorithms have been proven with their simplicity and reliability
classification algorithm J48 tree can extract the key attributes and characteristics, rapid computation, memory efficiency, and ease
identify the noise cluster for future unlabeled data that may of understanding in theory. The clustering-based algorithm does
contain noise. The experiment results indicate that the not need to make assumptions about the data distribution like the
Expectation Maximization and Cascading Simple K-Means statistics-based algorithm. Moreover, the clustering algorithm
perform the best. Moreover, temperature and carbon dioxide are
has been considered extremely sensitive to noise, especially in
vital attributes in identifying the noise cluster.
bioinformatics [9]. In other similar fields, such as ECG signals
Keywords—data clustering, portable sensor, air pollution data, in medicine, clustering algorithms also lead the way [10].
noise identification, noise removal In this paper, different from prior air quality analyses [11],
we do not focus on classifying the different air quality levels but
I. INTRODUCTION
on detecting the noise group from the mixed, unlabeled data
Recently, portable sensors have been widely used to collect transferred from different users with mobile portable sensors.
environmental data, especially for monitoring air pollution [1] The proposed solution can be divided into two steps: 1) using
[2]. Unlike the traditional meteorological station, the portable clustering algorithms to identify the best number of clusters and
sensor has several advantages: (1) it can collect air data in cluster the mixed data into different groups; 2) the classification
relatively small and particular areas, such as on-campus or in a algorithm J48 Tree is used to examine the key attributes to detect
specific room. (2) It can be more accurate to record air data to the noise cluster. Therefore, the noise cluster can be removed
the second, reflecting the continuous changes of each attribute from future data.
in the area [3]. However, the air pollution data collected by the
portable sensor would be interspersed with data from indoors, To find out which clustering algorithm is more capable of
outdoors, or transitioning from two states. The data with the distinguishing the indoor, outdoor, and noise clusters. Six
unclear boundary of indoor or outdoor are considered noise data, clustering algorithms, Simple K-means, Hierarchical, Cascading
which may interfere with further investigation, making K-means, X-means, Expectation Maximization, and Self-
unprecise results. Therefore, it is crucial to distinguish different Organizing Map, are applied and compared on a test dataset
data groups and identify the noise data, such as inappropriate collected by the portable sensors throughout three seasons at the
data and outliers, before data processing. campus in Macao. To better examine the ability of each
clustering algorithm, two scenarios are designed for the
Noise is inevitable in data collection. Generally, one possible experiments: pure indoor and pure outdoor air data (without
noise source is hardware bias, such as sensor error [4] [5]. noise); pure indoor, pure outdoor, and air data in between (with
Additionally, human misconduct in data collection may result in noise). Accuracy and Silhouette Coefficient are two essential
incorrect data collection. In this paper, noise data could come criteria to evaluate the performance of clustering algorithms for
from the moving users who take the portable sensor devices and each experiment. We expect the clustering algorithm to achieve
move around from indoor to outdoor positions and vice versa. high accuracy while also considering whether the best number
There can be other noise, such as the sensor’s heating time. of clusters matches the number of clusters physically divided.
This work is funded by Macao Polytechnic University under
grant number RP/ESCA-01/2021.

978-1-6654-5305-9/22/$31.00 ©2022
Authorized licensed use limited to: North IEEEHill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore.
Eastern Restrictions apply.
These can apply to identify the future noise data from the centroid should be split into different clusters. The Bayesian
unlabeled and mixed data collected by the user side with the Information Criterion (BIC) is used in making this decision.
exact locations. If the number of clusters determined by the best After that, X-means tracks and re-calculates the centroid and
clustering algorithm is equal to or more than three, the assigns the center of massing points to it. X-means and
assumption is made that noise exists. Then the critical attributes Cascading K-means do not require the number of clusters (k) in
obtained by the J48 tree are used to distinguish the noise cluster advance.
and remove it.
5) Expectation Maximization (EM)
II. RELATED WORK The EM algorithm consists of two main steps: the
expectation and maximization steps. The expectation step uses
A. Clustering Algorithms the experimental parameters and conditions for the current
Clustering algorithms are unsupervised algorithms that aim estimate of the unknown underlying variables, while the
to maximumly keep similar objects in one group and separate maximization step uses these parameters to provide new
less similar objects into different groups. Clustering algorithms estimates. These two steps are not independent but interact with
can be sorted into several technical categories according to each other until they reach convergence. The strength of the EM
different emphasizes and features, such as partitioning, clustering algorithm is that the implementation process is
hierarchical, density-based, distance-based [12], and neural slightly more straightforward, while the drawback is that, in
network based. some cases, it has a high chance of slowing down the linear
convergence [19].
In this work, the clustering algorithms used can be classified
into two categories: clustering algorithms that need to be 6) Self-Organizing Map (SOM)
predefined the number of clusters, including Simple K-means, A Self-Organizing Map is a kind of artificial neural network-
Hierarchical, Expectation Maximization, and Self-Organizing based clustering algorithm. It generates low-dimensional
Map; the clustering algorithms which can automatically find the projection images from a high-dimensional data distribution,
number of clusters, including Cascading Simple K-means, and where the similarity relationships are maintained between data
X-means. items [20]. The so-called map represents discretized input
training with samples under unsupervised learning. SOM differs
1) Simple K-means from other neural network-based clustering algorithms since it
As a partitioning-based clustering algorithm, K-Means can applies competitive learning instead of error-correction learning.
be regarded as a ramification of the distance-based clustering
algorithm. K-means requires a specific number of clusters stored B. Platform for Data Clustering
in value k to initialize the whole computation process at the very In this paper, we have compared the performance of six
beginning. The K-means randomly selects k objects from the clustering algorithms introduced in section II.A on WEKA [21].
dataset to represent the cluster centers. Given these centers, the WEKA is a data mining platform that can process data
distances of all remained data objects from the centers are preparation and multiple machine learning methods.
calculated, and the data objects are assigned to the cluster with
the smallest distance [13][14]. K-Means benefits from its C. Silhouette Coefficient for Determining the Best Number of
simplicity, fast computation, and flexibility. However, K-Means Clusters
has several flaws, such as the number of clusters that should be On WEKA, Cascading K-means and X-means could
entered manually, the initialization affecting the result automatically determine the number of clusters, while K-Means,
significantly, and more sensitivity to the scale of the dataset. Hierarchical, EM, and SOM require a predefined number of
2) Hierarchical clusters. However, in some circumstances, the number of
The hierarchical clustering [15] nests each cluster to build a clusters based on the clustering results does not match the
tree, then for each level of the cluster, all its sub-clusters are expected outcomes. Thus, a method to evaluate the performance
united and assigned to it. The node in the tree reflects a cluster of k is needed. Silhouette Coefficient [22] is a metric to measure
in a hierarchy called a dendrogram. Hierarchical clustering is the performance of clustering results with range from -1 to 1.
easy to implement and read results due to its well-structured The formula of the Silhouette Coefficient shows as (1) and (2).
dendrogram. However, Hierarchical is less flexible compared to The closer the coefficient is to 1, the more accurate the clustering
K-Means and may need a longer time during the process. performance, and the number of clusters ( ) corresponding to
the most significant coefficient is the best.
3) Cascading K-means
Cascading K-means Clustering [16] is based on K-Means = (1)
and abides by the rule of the partitioning method for clustering. ,
This algorithm was proposed by Caliński and Harabasz [17],
adding the knowledge of statistics and hypothesis testing to the = , (2)
original K-means.
Assuming a dataset is operated by a specific clustering
4) X-means algorithm. It is clustered into clusters. For every vector in the
X-means algorithm was developed by Dan Pelleg and cluster, use as an instance, is the average distance from
Andrew Moore [18], and it enables the estimation of the number to other points which belong to the same cluster as ; is the
of clusters in a short time. X-means operates after every run of
K-means and determines if any subset covered in the current

Authorized licensed use limited to: North Eastern Hill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore. Restrictions apply.
average distance from to other points which do not belong to = (3)
the same cluster.
D. J48 Tree to Obtain the Key Attributes C. Performance Evaluation
J48 tree is an embedded decision tree. A supervised Two criteria evaluate the performance of clustering
classification function contains the value of the dependent algorithms: accuracy and best number of clusters (best ) for
attribute, which is given by the values of the input attributes algorithms in each experiment determined by the Silhouette
[23]. Data could be classified based on the entropy calculated Coefficient. Based on the confusion matrix, the accuracy
during tree forming. The most significant advantage of the J48 formula is shown in (4).
tree is easy for a human to interpret due to its tree structure.
However, the running time complexity of the algorithm .=
!"# $%& ' (
(4)
corresponds to the depth of the tree, which cannot be greater than !"# $%& ' ( ) !"# * + ' (
the number of attributes. The processing time will be extended
if the dataset is huge. In this paper, the J48 tree is used to obtain IV. RESULTS AND DISCUSSION
the key attributes when detecting the noise class based on In this session, the clustering performance of six algorithms
conditions from the J48 tree’s nodes and branches. will be compared for two scenarios. Additionally, the J48 tree is
applied to obtain the critical attribute for identifying the noise
III. EXPERIMENT DESIGN
data.
The portable sensors collect air pollution data to compare the
clustering performance, and two scenarios are designed. A. First Scenario: Pure Indoor and Pure Outdoor without
noise
A. Data Collection This scenario contains experiment 1 to 10.
Portable sensor devices which integrate many kinds of
sensors are used to detect the containments and indexes in the 1) Clustering Evaluation
air, such as PM1.0, PM2.5, PM10, Formaldehyde (HCHO), The accuracy of clustering algorithms that need to pre-define
Carbon Dioxide, Temperature, and Humidity. Data are collected the number of clusters ( =2), including SimpleKMeans,
from the fall until the following spring within the campus, such HierarchicalClusterer, EM, and Self-Organizing Map, are
as at different academic buildings, the student dormitory, the shown in Table I. All achieve 100% correctness except SOM,
playground, and the parking lots. There are two scenarios, pure with one indoor data sample incorrectly clustered into the
indoor and pure outdoor (without noise); pure indoor, pure outdoor cluster. That incorrectly clustered indoor sample, with
outdoor, and mixed (with noise, shown in Fig.1, noise data is slightly higher temperature and humidity, is slightly different
collected when the door is open). In each scenario, ten sets of from other indoor samples because it is collected when the air
data are collected for comparison. Air pollution data is recorded conditioner is about to work indoors. SOM is a susceptible
every 10 seconds after the sensor device is activated. clustering algorithm and treats samples less close to other
Additionally, the data collected in the first five minutes may lead similar samples as a different cluster when projecting data from
to bias since the heating after the sensor has been activated is an high dimension to low dimension.
objective factor in the data collection. Therefore, we removed
the air data collected in the first five minutes and used the data TABLE I. ACCURACY OF CLUSTERING ALGORITHMS NEED TO PRE-
DEFINE K FOR FIRST SCENARIO
collected after the sensor was stabilized for experiments. For
each experiment, about 400 data records are recorded. Clustering Algorithms Avg. Accuracy
SimpleKMeans 100.00%
Hierarchical 100.00%
Outdoor EM 100.00%
SOM 99.97%
For clustering algorithms that do not need to pre-define the
Noise number of clusters, the accuracy for Cascading K-means
Indoor (100.00%) is higher than in X-means (74.31%), as Table II
shows. In addition, the average obtained from ten experiments
for Cascading Simple K-means is equal to 2, which perfectly
matches the actual situation. While the for X-means across ten
Fig. 1. Two Scenarios (without or with noise) experiments mostly equals 3 or 4, and the average is 3.3, which
exceeds the actual situation. This is because X-means is very
B. Data Normalization sensitive to the distance between the data. Given experiment 1
Data normalization is to scale down and shift the distribution as an example, as Fig. 2 shows below, X-means segments indoor
of raw data through different attributes into the same scale with data into two clusters (cluster 0 and cluster 1). Fig. 2 (b)
a fixed range of 0 to 1. Convergence can be accelerated, and the illustrates CO2, temperature, and humidity are very similar since
model's performance can be improved [24]. In this paper, the they overlap each other. However, Fig. 2 (a) shows PM
min-max scalar normalization is used for data normalization, pollutants (PM10, PM2.5, and PM1.0) are less similar since the
shown in (3): data can be projected to two significant groups. Therefore, X-
means clusters indoor data into two different clusters.

Authorized licensed use limited to: North Eastern Hill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore. Restrictions apply.
TABLE II. ACCURACY OF CLUSTERING ALGORITHMS DO NOT NEED TO average value of PM1.0 can distinguish the indoor and outdoor
PRE-DEFINE K FOR FIRST SCENARIO
classes for each cluster, and the indoor cluster has a lower
Clustering Algorithms Accuracy Avg. , average value of PM1.0. As for temperature, for autumn, the
Cascading Simple K-means 100.00% 2 indoor temperature typically lowers the outdoor temperature
X-means 74.31% 3.3 because the air conditioner is always on for the indoor
environment. When the weather gets cooler, like in winter and
spring, the indoor and outdoor temperatures do not show
significant differences. Therefore, the temperature is the second
most important attribute after PM1.0. This finding can be
applied to future data collected from different users. If the data
are sent from the same location (e.g., GPS information), take the
fourth experiment as an example. After clustering, if the number
of clusters is two, an initial assumption is made that there is no
noise in the data. Then the mean value of PM1.0 is calculated
for both clusters, and the cluster with a lower mean value will be
considered indoor and vice versa for outdoor.
(a) (b) B. Second Scenario: Pure Indoor and Pure Outdoor, and
Noise
This scenario contains experiment 11 to 20.
Fig. 2. Visualize Cluster 0, 1 in Indoor Class Under X-means
1) Clustering Evaluation
To evaluate the best , the Silhouette Coefficient is The accuracy of clustering algorithms that need to pre-define
calculated for each experiment with four algorithms the number of clusters ( =3), including SimpleKMeans,
(SimpleKMeans, Hierarchical, EM, CSKMeans), and the best HierarchicalClusterer, EM, and SOM, are shown in Table IV.
is obtained based on the most significant Silhouette Coefficient EM has the highest accuracy of 95.36%, while SimpleKMeans
value. The Silhouette Coefficients for four algorithms are the has the lowest accuracy of 87.03%.
same, and the best all equals 2. All algorithms can distinguish
indoor and outdoor clusters very well because the difference TABLE IV. ACCURACY OF ALGORITHMS NEED TO PRE-DEFINE K FOR
between two clusters is significant, and best matches the SECOND SCENARIO
actual classes. Clustering Algorithms Avg. Accuracy
According to the accuracy and Silhouette Coefficient results, SimpleKMeans 87.03%
we can conclude that in the first scenario, without any noise, Hierarchical 90.76%
SimpleKMeans, Hierarchical, EM, and Cascading Simple K- EM 95.36%
means are the best clustering algorithms. SOM 90.29%
2) J48 Tree EM exceeds SimpleKmeans due to the covariance matrices
J48 tree is applied for each experiment to obtain the critical rooted in its algorithm’s structure. The covariance matrices
attributes for segmenting the dataset. As Table III shows below, allow EM to detect subpopulations with different characteristics
after counting the 10 experiments, the most critical attribute for in the data set, therefore segmenting data into different shapes
distinguishing the indoor and outdoor data is PM1.0, followed of clusters. Given experiment 15 as an example, the clustering
by temperature. result of SimpleKMeans is shown in Fig. 4 (a); the clustering
result of EM is shown in Fig. 4 (b), and the actual class of data
TABLE III. COUNT FOR KEY ATTRIBUTES FOR FIRST SCENARIO is shown in Fig. 4 (c). The results are visualized by Principal
Component Analysis (PCA).
Attribute PM1.0 Temperature CO2 PM2.5
Count 5 3 1 1 As Fig. 4 (c) shows, the boundary between outdoor and noise
data is unclear. Some overlap happens between these two types
of data, while indoor data is easy to distinguish. SimpleKMeans
fails to distinguish the outdoor and noise data, clusters them into
the same cluster2 (noise), and incorrectly separates indoor data
into two clusters. EM acts better than SimpleKMeans. Though
some actual noise is clustered into the outdoor cluster, EM
correctly distinguishes most of the noise and outdoor data and
all the indoor data.
As Table V shows that for clustering algorithms that do not
need to pre-define the number of clusters, the average accuracy
for Cascading Simple K-means is 81.63%, lower than X-means
Fig. 3. J48 Tree for Exp.4
(88.82%) across 10 experiments. However, like in scenario 1,
Because typically, outdoor air contains more PM pollutants X-means generates more clusters ( = 3.5 ) than
than indoor air, as shown in Fig. 3 based on experiment 4. The

Authorized licensed use limited to: North Eastern Hill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore. Restrictions apply.
CascadingSimpleKMeans ( = 3.2 ), which is not entirely 2) J48 Tree
consistent with the actual situation ( = 3). As Table VII shows that the most critical attribute for
distinguishing the indoor, outdoor, and noise data is
In conclusion, Cascading Simple K-means achieves better
temperature, followed by CO2, humidity, and PM2.5. The
performance than X-means.
indoor air conditioner usually works on campus, resulting in a
specific temperature difference between indoors and outdoors.
Usually, in summer and autumn, the indoor temperature is the
lowest, slightly higher when the door is opened for ventilation
(defined as noise), and the outdoor temperature is the highest.
However, in the winter, the situation is the opposite. In such
cases, carbon dioxide and humidity can be used to achieve the
same goal. For example, the highest level of carbon dioxide is
found in indoor clusters.

(a) (b) Take the 14th experiment as an example, shown in Fig. 5.

The cluster with the lowest mean temperature value can be
considered the indoor cluster. The cluster with the highest mean
temperature value can be considered the outdoor cluster. The
noise cluster is in between. Suppose the server receives some
unlabeled data transferred from different users with the same
GPS location and the similar date and time information in the
future. After clustering, if the number of clusters is three or
more, an initial assumption is made that there is noise in the data.
Then the mean value of temperature can be calculated for each
(c) cluster. The cluster(s) with the mean value between the highest
and lowest will be considered noise cluster(s), then removed
from the future collected data.
Fig. 4. PCA of (a) SimpleKMeans Cluster, (b) EM Cluster, and (c) Origin
Data on Exp.15 TABLE VII. COUNT FOR KEY ATTRIBUTE FOR SECOND SCENARIO

TABLE V. ACCURACY OF CLUSTERING ALGORITHMS DO NOT NEED TO Attribute Count (out of 10)
PRE-DEFINE K FOR SECOND SCENARIO Temperature 9
CO2 4
Clustering Algorithms Accuracy Avg. ,
Humidity 3
Cascading Simple K-means 81.63% 3.2
PM2.5 3
X-means 88.82% 3.5
PM1.0 1
To evaluate the best , the Silhouette Coefficient for each
experiment is calculated for four algorithms (SimpleKMeans,
Hierarchical, EM, CSKMeans), and the best is obtained based
on the most significant Silhouette Coefficient value, shown in Temperature
Table VI. There are three situations. (1) If the best = 2, which
<= 0.255319 > 0.255319
means noise data is very similar to one of the indoor or outdoor
data, it should not be clustered into an individual group. (2) If
Indoor(207.0)
the best = 3, it matches the experiment scenario. (3) If the best Temperature
= 4 or more, there are some variants in indoor, outdoor, and
<= 0.446809 > 0.446809
noise classes. That may explain the unusual circumstance that
happens to experiment 13. To sum up, considering two criteria,
for those algorithms that need to predefine the , EM performs Noise(189.0) Outdoor(207.0)

best; for algorithms that do not need to predefine the ,

Cascading Simple K-means performs best. Fig. 5. J48 for Exp.14

TABLE VI. BEST , FOR EXP. CLUSTERED BY FOUR CLUSTERING V. CONCLUSION

ALGORITHMS
Distinguished from most studies about air pollution data, the
,=1 ,=2 ,34 focus of this paper is to propose a solution for detecting the noise
Algorithm Exp. No. Exp. No. Exp. No. cluster from the mixed and unlabeled data transferred from
SKM 14,15,17,18,20 11,12,13,16,19 N/A different users with portable sensors. The collected test dataset
HC 15,17,18,20 11,12,14,16,19 13 is used to compare the capability of six different clustering
algorithms and explore which one is the best at identifying noise
EM 15,17,18,20 11,12,14,16,19 13
from air data, as in a real-world scenario. The correctness of the
CSKM 15,17,18,20 11,12,14,16,19 13
clustering algorithm is measured by the accuracy and the
Silhouette Coefficient. The best obtained from the Silhouette

Authorized licensed use limited to: North Eastern Hill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore. Restrictions apply.
Coefficient can be used to evaluate the clustering performance. [10] J. Rodrigues, D. Belo, and H. Gamboa, “Noise detection on ECG
The results show that all algorithms perform well except X- based on agglomerative clustering of morphological features,”
Computers in Biology and Medicine, vol. 87, pp. 322–334, Aug.
means for the data without any noise, and the best matches the 2017, doi: 10.1016/J.COMPBIOMED.2017.06.009.
actual situation; EM and Cascading Simple K-means perform [11] P. Govender and V. Sivakumar, “Application of k-means and
the best for the data with noise. Additionally, affected by the hierarchical clustering techniques for analysis of air pollution: A
noise, the best does not always match the actual situation. review (1980-2019),” 2019, doi: 10.1016/j.apr.2019.09.009.
Finally, the J48 tree is used to examine which attribute in the [12] S.Saraswathi and M. Immaculate Sheela, “A Comparative Study of
dataset is critical to identify the noise cluster from the data Various Clustering Algorithms in Data Mining,” International
coming from other users at the same location and similar date Journal of Computer Science and Mobile Computing, vol. 3, no. 11,
pp. 422–428, 2014, Accessed: May 05, 2022. [Online]. Available:
and time. Therefore, the noise data can be removed. The www.ijcsmc.com
experiments indicate that temperature and CO2 are the critical [13] P. Narayan Baser and J. R. Saini, “A Comparative Analysis of
attributes to distinguishing different clusters: indoor, outdoor, Various Clustering Techniques used for Very Large Datasets,”
and noise. International Journal of Computer Science and Communication
Networks, vol. 3, no. 4, pp. 271–275, 2013, [Online]. Available:
For further study, more data can be collected throughout the https://www.researchgate.net/publication/281965200
year to test the accuracy of our solution, and the performance of [14] J. Yadav and M. Sharma, “A Review of K-mean Algorithm,”
clustering and classification algorithms can be further International Journal of Engineering Trends and Technology, vol. 4,
investigated. no. 7, 2013, Accessed: Jun. 26, 2022. [Online]. Available:
http://www.ijettjournal.org
REFERENCES [15] Nisha and P. J. Kaur, “Cluster quality based performance evaluation
of hierarchical clustering method,” in 2015 1st International
[1] X. Yang, E. Xie, and L. Cuthbert, “Cluster segregation for indoor and Conference on Next Generation Computing Technologies (NGCT),
outdoor environmental monitoring system,” Proceedings - 18th IEEE 2015, pp. 649–653. doi: 10.1109/NGCT.2015.7375201.
International Conference on High Performance Computing and
Communications, 14th IEEE International Conference on Smart City [16] A. G. Karegowda, M. A. Jayaram, and A. S. Manjunath, “Cascading
and 2nd IEEE International Conference on Data Science and k-means clustering and k-nearest neighbor classifier for
Systems, HPCC/SmartCity/DSS 2016, pp. 1264–1269, Jan. 2017, categorization of diabetic patients,” International Journal of
doi: 10.1109/HPCC-SMARTCITY-DSS.2016.0179. Engineering and Advanced Technology, vol. 1, no. 3, pp. 147–151,
2012.
[2] X. Yang, L. Zhu, S. Lam, L. Cuthbert, and Y. Wang, “Comparison
of clustering methods for identification of outdoor measurements in [17] T. Caliński and J. Harabasz, “A dendrite method for cluster analysis,”
pollution monitoring,” IOP Conference Series: Earth and Communications in Statistics, vol. 3, no. 1, pp. 1–27, 1974, doi:
Environmental Science, vol. 257, no. 1, p. 012014, Apr. 2019, doi: 10.1080/03610927408827101.
10.1088/1755-1315/257/1/012014. [18] D. Pelleg and A. Moore, “X-means: Extending K-means with
[3] R. T. Tse and Y. Xiao, “A portable Wireless Sensor Network system Efficient Estimation of the Number of Clusters,” in 17th International
for real-time environmental monitoring,” WoWMoM 2016 - 17th Conf. on Machine Learning, 2002, pp. 727–734. [Online]. Available:
International Symposium on a World of Wireless, Mobile and https://www.researchgate.net/publication/2532744
Multimedia Networks, Jul. 2016, doi: [19] T. K. Moon, “The expectation-maximization algorithm,” IEEE
10.1109/WOWMOM.2016.7523588. Signal Processing Magazine, vol. 13, no. 6, pp. 47–60, 1996, doi:
[4] X. Wu and X. Zhu, “Mining with noise knowledge: Error-aware data 10.1109/79.543975.
mining,” IEEE Transactions on Systems, Man, and Cybernetics Part [20] D. Miljković, “Brief review of self-organizing maps,” in 2017 40th
A:Systems and Humans, vol. 38, no. 4, pp. 917–932, Jul. 2008, doi: International Convention on Information and Communication
10.1109/TSMCA.2008.923034. Technology, Electronics and Microelectronics (MIPRO), 2017, pp.
[5] A. Klein, “Incorporating quality aspects in sensor data streams,” 1061–1066. doi: 10.23919/MIPRO.2017.7973581.
International Conference on Information and Knowledge [21] “Weka 3 - Data Mining with Open Source Machine Learning
Management, Proceedings, pp. 77–84, 2007, doi: Software in Java.” https://www.cs.waikato.ac.nz/ml/weka/ (accessed
10.1145/1316874.1316888. May 10, 2022).
[6] S. A. N. Alexandropoulos, S. B. Kotsiantis, and M. N. Vrahatis, [22] R. Lletí, M. C. Ortiz, L. A. Sarabia, and M. S. Sánchez, “Selecting
“Data preprocessing in predictive data mining,” The Knowledge variables for k-means cluster analysis by using a genetic algorithm
Engineering Review, vol. 34, 2019, doi: that optimises the silhouettes,” Analytica Chimica Acta, vol. 515, no.
10.1017/S026988891800036X. 1, pp. 87–100, Jul. 2004, doi: 10.1016/J.ACA.2003.12.020.
[7] Z. Nematzadeh, R. Ibrahim, and A. Selamat, “A hybrid model for [23] N. Bhargava, G. Sharma, R. Bhargava, and M. Mathuria, “Decision
class noise detection using k-means and classification filtering tree analysis on j48 algorithm for data mining,” Proceedings of
algorithms,” SN Applied Sciences, vol. 2, no. 7, pp. 1–10, Jul. 2020, international journal of advanced research in computer science and
doi: 10.1007/S42452-020-3129-X/TABLES/8. software engineering, vol. 3, no. 6, 2013.
[8] S. D. Bay and M. Schwabacher, “Mining Distance-Based Outliers in [24] Y. A. LeCun, L. Bottou, G. B. Orr, and K. R. Müller, “Efficient
Near Linear Time with Randomization and a Simple Pruning Rule,” BackProp,” Lecture Notes in Computer Science (including subseries
2003. Lecture Notes in Artificial Intelligence and Lecture Notes in
[9] R. Sloutsky, N. Jimenez, S. J. Swamidass, and K. M. Naegle, Bioinformatics), vol. 7700 LECTURE NO, pp. 9–48, 2012, doi:
“Accounting for noise when clustering biological data,” Briefings in 10.1007/978-3-642-35289-8_3.
Bioinformatics, vol. 14, no. 4, pp. 423–436, Jul. 2013, doi:
10.1093/BIB/BBS057.

Authorized licensed use limited to: North Eastern Hill University - Shillong. Downloaded on November 23,2024 at 08:45:36 UTC from IEEE Xplore. Restrictions apply.

Restaurant Opening &amp Closing Checklist
83% (197)
Restaurant Opening &amp Closing Checklist
2 pages
Proposed 2-Storey Residence
100% (3)
Proposed 2-Storey Residence
43 pages
X R750 Service Manual Ed9!10!2010 UK
No ratings yet
X R750 Service Manual Ed9!10!2010 UK
212 pages
Instruction Manual FOR Frydenbö Steering Gear
100% (1)
Instruction Manual FOR Frydenbö Steering Gear
53 pages
Debate Writing
90% (10)
Debate Writing
3 pages
DC Adventures Character Generator Sheet
No ratings yet
DC Adventures Character Generator Sheet
124 pages
Chemsheets AS 1078 Crude Oil
No ratings yet
Chemsheets AS 1078 Crude Oil
15 pages
Mihăiţă et al. - 2019 - Evaluating air quality by combining stationary, smart mobile pollution monitoring and data-driven modelling
No ratings yet
Mihăiţă et al. - 2019 - Evaluating air quality by combining stationary, smart mobile pollution monitoring and data-driven modelling
21 pages
Robust Fuzzy Clustering Algorithms
No ratings yet
Robust Fuzzy Clustering Algorithms
6 pages
ERAD04 - P - 67 - Gaussian Model Adaptive Processing (GMAP) For Improved Ground Clutter Cancellation and Moment Calculation
No ratings yet
ERAD04 - P - 67 - Gaussian Model Adaptive Processing (GMAP) For Improved Ground Clutter Cancellation and Moment Calculation
7 pages
International Refereed Journal of Engineering and Science (IRJES)
No ratings yet
International Refereed Journal of Engineering and Science (IRJES)
7 pages
Air Quality Prediction
No ratings yet
Air Quality Prediction
17 pages
Dani 2015
No ratings yet
Dani 2015
7 pages
1 - DHS - IEEE - Deep - Air
No ratings yet
1 - DHS - IEEE - Deep - Air
1 page
A Novel Approach To Noise Clustering For Outlier D
No ratings yet
A Novel Approach To Noise Clustering For Outlier D
6 pages
Yadav 2018
No ratings yet
Yadav 2018
8 pages
Garcia 2018
No ratings yet
Garcia 2018
6 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
Application of The Layered Algorithm in Searc - 2022 - Journal of Parallel and D
No ratings yet
Application of The Layered Algorithm in Searc - 2022 - Journal of Parallel and D
9 pages
Mansingh 2021
No ratings yet
Mansingh 2021
3 pages
Lecture_5
No ratings yet
Lecture_5
38 pages
Convex-Hull & DBSCAN Clustering To Predict Future Weather
No ratings yet
Convex-Hull & DBSCAN Clustering To Predict Future Weather
8 pages
Noise_Prediction_Using_Machine_Learning_with_Measu
No ratings yet
Noise_Prediction_Using_Machine_Learning_with_Measu
21 pages
Full download (Ebook) Sensor and data fusion : a tool for information assessment and decision making by Klein, Lawrence A. ISBN 9780819491343, 0819491349 pdf docx
100% (2)
Full download (Ebook) Sensor and data fusion : a tool for information assessment and decision making by Klein, Lawrence A. ISBN 9780819491343, 0819491349 pdf docx
82 pages
Gaussian Model Adaptive Time Domain Filter (Gmat) For Weather Radars
No ratings yet
Gaussian Model Adaptive Time Domain Filter (Gmat) For Weather Radars
2 pages
Dealing with Noise
No ratings yet
Dealing with Noise
9 pages
Anomaly Detection in Ambient Air Quality
No ratings yet
Anomaly Detection in Ambient Air Quality
9 pages
A Distribution-Based Clustering Algorithm For Mining in Large Spatial Databases
No ratings yet
A Distribution-Based Clustering Algorithm For Mining in Large Spatial Databases
8 pages
Utilization of Expert Knowledge in Automatic Classifiers of Noise Sources (1996
No ratings yet
Utilization of Expert Knowledge in Automatic Classifiers of Noise Sources (1996
4 pages
(J) 2014 - MATLAB-Based Graphical User Interface (GUI) For Data Mining As
No ratings yet
(J) 2014 - MATLAB-Based Graphical User Interface (GUI) For Data Mining As
8 pages
Spatially Enhanced Outdoor Air Quality Monitoring Through Advanced Clustering Algorithms
No ratings yet
Spatially Enhanced Outdoor Air Quality Monitoring Through Advanced Clustering Algorithms
9 pages
bitty batch paper
No ratings yet
bitty batch paper
4 pages
Immediate download (Ebook) Sensor and Data Fusion: A Tool for Information Assessment and Decision Making, Second Edition by Lawrence A. Klein ISBN 9780819491336, 0819491330 ebooks 2024
100% (1)
Immediate download (Ebook) Sensor and Data Fusion: A Tool for Information Assessment and Decision Making, Second Edition by Lawrence A. Klein ISBN 9780819491336, 0819491330 ebooks 2024
67 pages
[10] Predicting Highly Dynamic Traffic Noise Using Rotating Mobile Monitoring and Machine Learning Method
No ratings yet
[10] Predicting Highly Dynamic Traffic Noise Using Rotating Mobile Monitoring and Machine Learning Method
11 pages
Environmental Noise Classification Through Machine Learning
No ratings yet
Environmental Noise Classification Through Machine Learning
9 pages
Electronics 10 02329 v2
No ratings yet
Electronics 10 02329 v2
23 pages
A Mar III
No ratings yet
A Mar III
12 pages
Air Pollution Forecasting Using Data Mining Technique
No ratings yet
Air Pollution Forecasting Using Data Mining Technique
5 pages
Spatial Data Mining On Remote Sensing Pe
No ratings yet
Spatial Data Mining On Remote Sensing Pe
9 pages
A Novel Method For Sea-Land Clutter Separation Using Regularized Randomized and Kernel Ridge Neural Networks
No ratings yet
A Novel Method For Sea-Land Clutter Separation Using Regularized Randomized and Kernel Ridge Neural Networks
21 pages
A Comparative Study On Air Quality Analysis by SVM K - Means and Naive Bayes Algorithms
No ratings yet
A Comparative Study On Air Quality Analysis by SVM K - Means and Naive Bayes Algorithms
17 pages
[16] Internet_of_Things_for_Noise_Mapping_in_Smart_Cities_State_of_the_Art_and_Future_Directions
No ratings yet
[16] Internet_of_Things_for_Noise_Mapping_in_Smart_Cities_State_of_the_Art_and_Future_Directions
7 pages
Journal of Electrical and Computer Engineering - 2023 - Albaji - Investigation on Machine Learning Approaches for
No ratings yet
Journal of Electrical and Computer Engineering - 2023 - Albaji - Investigation on Machine Learning Approaches for
26 pages
Identification of Plastic Wastes by Using Fuzzy Radial Basis Function Neural Networks Classifier With Conditional Fuzzy C-Means Clustering
No ratings yet
Identification of Plastic Wastes by Using Fuzzy Radial Basis Function Neural Networks Classifier With Conditional Fuzzy C-Means Clustering
8 pages
1 s2.0 S1352231023004132 Main
No ratings yet
1 s2.0 S1352231023004132 Main
18 pages
474-Document Upload-2110-1-10-20180910
No ratings yet
474-Document Upload-2110-1-10-20180910
8 pages
Team 19 Report-2
No ratings yet
Team 19 Report-2
15 pages
A Predictive Data Feature Exploration-Based Air Quality Prediction Approach
No ratings yet
A Predictive Data Feature Exploration-Based Air Quality Prediction Approach
12 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
machines-10-00140-v2
No ratings yet
machines-10-00140-v2
24 pages
DSS09 (B) - Clustering
No ratings yet
DSS09 (B) - Clustering
35 pages
admin,+12027
No ratings yet
admin,+12027
24 pages
Smart Air Quality Monitoring IoT-Based Infrastruct
No ratings yet
Smart Air Quality Monitoring IoT-Based Infrastruct
45 pages
A Robust Deterministic Annealing Algorithm For Data Clustering
No ratings yet
A Robust Deterministic Annealing Algorithm For Data Clustering
17 pages
A Bayesian Kriged Kalman Model for Short-term Forecasting of Air Pollution Levels
No ratings yet
A Bayesian Kriged Kalman Model for Short-term Forecasting of Air Pollution Levels
22 pages
FDM Thesis
No ratings yet
FDM Thesis
146 pages
Lecture Three (1)
No ratings yet
Lecture Three (1)
48 pages
Research On Airport Noise Prediction Method Based On Noise Model INM
No ratings yet
Research On Airport Noise Prediction Method Based On Noise Model INM
6 pages
Urban_Noise_Classification_Using_Machine_Learning_Techniques__Comparative_Analysis_and_Future
No ratings yet
Urban_Noise_Classification_Using_Machine_Learning_Techniques__Comparative_Analysis_and_Future
17 pages
ST-DBSCAN: An Algorithm For Clustering Spatial-Temporal Data
No ratings yet
ST-DBSCAN: An Algorithm For Clustering Spatial-Temporal Data
14 pages
Convolutional Neural Networks For Aircraft
No ratings yet
Convolutional Neural Networks For Aircraft
10 pages
FPGA Implementationof Impulse Noise Removal by Using New MDBUT Filter
No ratings yet
FPGA Implementationof Impulse Noise Removal by Using New MDBUT Filter
6 pages
Speech enhancement based on soft�masking exploiting both output SNR and selectivity of spatial filtering
No ratings yet
Speech enhancement based on soft�masking exploiting both output SNR and selectivity of spatial filtering
3 pages
Bogdan Smolka Mit
No ratings yet
Bogdan Smolka Mit
8 pages
Automatic Target Recognition: Advances in Computer Vision Techniques for Target Recognition
From Everand
Automatic Target Recognition: Advances in Computer Vision Techniques for Target Recognition
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Automatic Target Recognition: Fundamentals and Applications
From Everand
Automatic Target Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
sensors-23-00170-v3
No ratings yet
sensors-23-00170-v3
16 pages
Label_Noise_Detection_in_IoT_Security_based_on_Decision_Tree_and_Active_Learning
No ratings yet
Label_Noise_Detection_in_IoT_Security_based_on_Decision_Tree_and_Active_Learning
8 pages
3Efficient_adaptive_noise_cancellation_techniques_i
No ratings yet
3Efficient_adaptive_noise_cancellation_techniques_i
6 pages
YOLO Algorithm_ Real-Time Object Detection from A to Z
No ratings yet
YOLO Algorithm_ Real-Time Object Detection from A to Z
26 pages
5ANFIS and Deep Learning based missing sensor data prediction in IoT
No ratings yet
5ANFIS and Deep Learning based missing sensor data prediction in IoT
15 pages
SYNOPSIS
No ratings yet
SYNOPSIS
17 pages
768538051e58eb1457c3d8f703aa640e62bc
No ratings yet
768538051e58eb1457c3d8f703aa640e62bc
14 pages
How to Create Datasets_ strategies and examples
No ratings yet
How to Create Datasets_ strategies and examples
18 pages
thesis-Choosing the Efficient Algorithm for Vertex Cover problem
No ratings yet
thesis-Choosing the Efficient Algorithm for Vertex Cover problem
56 pages
report
No ratings yet
report
5 pages
4796-22345-1-PB
No ratings yet
4796-22345-1-PB
7 pages
5046351cbf3275dcfa
No ratings yet
5046351cbf3275dcfa
8 pages
presentation
No ratings yet
presentation
2 pages
ADTI
No ratings yet
ADTI
12 pages
Magazine
No ratings yet
Magazine
32 pages
Kakorrhaphiophobia: Persistent, All-Consuming Fear of Failure
No ratings yet
Kakorrhaphiophobia: Persistent, All-Consuming Fear of Failure
3 pages
Lesson No. 3 Definitions of Public Administration: Basilan State College Graduate Studies
No ratings yet
Lesson No. 3 Definitions of Public Administration: Basilan State College Graduate Studies
3 pages
Ic Communication Receiving Various Branches
No ratings yet
Ic Communication Receiving Various Branches
43 pages
Sonder Notes
No ratings yet
Sonder Notes
5 pages
Patrick, Dale R. - Fardo, Stephen W. - Rotating Electrical Machines and Power Systems-Fairmont Press, Inc. (1997)
No ratings yet
Patrick, Dale R. - Fardo, Stephen W. - Rotating Electrical Machines and Power Systems-Fairmont Press, Inc. (1997)
591 pages
Ex180 Electric Motor Cooling
No ratings yet
Ex180 Electric Motor Cooling
2 pages
100 Hot Words For The Sat
No ratings yet
100 Hot Words For The Sat
2 pages
Ri Quotation For Test
No ratings yet
Ri Quotation For Test
1 page
(Role Theory and International Relations) Stephen G. Walker, Akan Malici, Mark Schafer - Rethinking Foreign Policy Analysis_ States, Leaders, and the Microfoundations of Behavioral International Relat
100% (1)
(Role Theory and International Relations) Stephen G. Walker, Akan Malici, Mark Schafer - Rethinking Foreign Policy Analysis_ States, Leaders, and the Microfoundations of Behavioral International Relat
337 pages
The Development Validation and Application of The Course Experience Questionnaire PDF
No ratings yet
The Development Validation and Application of The Course Experience Questionnaire PDF
22 pages
1.3 Energy management & Audit
No ratings yet
1.3 Energy management & Audit
25 pages
Hubungan Pola Asuh Orang Tua Dan Kemandirian Anak
No ratings yet
Hubungan Pola Asuh Orang Tua Dan Kemandirian Anak
9 pages
Trend Al Awning Window
No ratings yet
Trend Al Awning Window
9 pages
Mil I 6866
No ratings yet
Mil I 6866
6 pages
Electrical Engineering and Electronics
No ratings yet
Electrical Engineering and Electronics
20 pages
Parts Catalogue: KARIZMA ZMR (May, 2014)
No ratings yet
Parts Catalogue: KARIZMA ZMR (May, 2014)
93 pages
Teks Bahasa Inggris Why Students Should Manage Their Stress
No ratings yet
Teks Bahasa Inggris Why Students Should Manage Their Stress
2 pages
AM Tutorial: Wednesday, May 6, 2020
No ratings yet
AM Tutorial: Wednesday, May 6, 2020
3 pages
Su Jeet Kumar Gupta
No ratings yet
Su Jeet Kumar Gupta
3 pages
Povrty On Students Academic Performance
No ratings yet
Povrty On Students Academic Performance
14 pages
Karcher K 397m
No ratings yet
Karcher K 397m
12 pages

IMPClustering_Algorithms_based_Noise_Identification_from_Air_Pollution_Monitoring_Data

Uploaded by

IMPClustering_Algorithms_based_Noise_Identification_from_Air_Pollution_Monitoring_Data

Uploaded by

Clustering Algorithms based Noise Identification

from Air Pollution Monitoring Data

Macao, China Macao, China Macao, China Macao, China

(a) (b) Take the 14th experiment as an example, shown in Fig. 5.

best; for algorithms that do not need to predefine the ,

TABLE VI. BEST , FOR EXP. CLUSTERED BY FOUR CLUSTERING V. CONCLUSION

You might also like