(C) 2022 Matlab-Based Graphical User Interface For IoT Sensor Measurements Subject To Outlier
(C) 2022 Matlab-Based Graphical User Interface For IoT Sensor Measurements Subject To Outlier
M P Flower Queen
Department of Electrical Engineering
NICHE
India
[email protected]
Abstract—Graphical user interface is addressed for outlier Outlier Detection (OD) is an essential step in data mining
detection in sensor measurement. One of the essential steps applications, in which outliers are primarily categorized as
toward sensor measurement analysis and decision making is univariate and multivariate, based on the dimension of the
anomaly detection. Specifically, measurement noise may feature space. It can be sub-classified as parametric that is
encounter abnormal readings, which affect the model algorithm statistical, and nonparametric that are not related to any model.
and hence misleading decisions; hence, data must be explored, More details about the types of outliers are discussed in
cleaned, and corrected before being considered for statistical section II. Methods for detection anomalies have been studied
analysis. In this paper, we propose an interactive, easy to for a very long time in various applications, such as intrusion
navigate, and friendly user interface to extract and explore the
detection, wireless sensors Networks, sensor malfunction,
sensor's abnormal measurement records. Extraction,
visualization, and detection are the three main tasks of such an
clinical diagnosis, earth science, satellite picture, climate
interface, particularly providing the user with a variety of pre- forecast, and numerous others. Considering Internet Of
defined detection techniques as the first step toward data Things (IoT)-based embedded systems, where sensors are
analysis. The interface is operated via Matlab, and the transmitting measurements remotely, there may be varied
effectiveness of such a tool is demonstrated by the mean of real- sources of outliers; noise may affect the data, or the data may
time measurements extraction where different outlier detection be affected by some factors such as malfunction of the sensor
is considered. or measuring device. The considered data has to be
preprocessed before it can be analyzed. Otherwise, it might
Keywords—outlier detection, graphical user interface, sensor result in a bad estimation, very high computation cost, useful
measurement, anomalies, data mining resources' wastage, and providing a poor data insight for the
decision-makers [6]. Moreover, bad estimation of data along
I. INTRODUCTION with an incorrect model can lead to a situation where crucial
Knowledge Discovery in Databases is the most common information may be hidden and lost. IoT platform can be
way of identifying valid, valuable, and reasonable data from linked with MATLAB to analyze and visualize data as well as
enormous datasets. Data Mining (DM) is the study of a develop the hardware-in-loop control. The source of the data
dataset to extract helpful and useful insights to predict used in an IoT platform is usually stored in cloud storage. An
outcomes from the vast measures of information that is IoT platform communicates with the sensors and transmits the
available using a wide range of techniques. It finds its data via the cloud for analysis and storage. This allows for
application in a wide variety of areas and different fields of simple exploration and interpretation of the data. If the data is
science, business, and design. New methodologies are not being preprocessed correctly, the patterns in data can be
required to extract useful data from the information, which are greatly misinterpreted, which results in misleading decision
expected to fit in all the steps of the data mining system making and incorrect analytics. Outlier detection is a very
successfully. New techniques are constantly being developed important step in data processing, and this is an integral part
that deal with complex and heterogeneous data. DM is the while using IoT platforms along with MATLAB. The paper
numerical center of the (Knowledge discovery in databases) has been organized as follows. In Section II, the outliers
KDD interaction. It involves various algorithms that are used fundamentals are discussed, including the definition of
in data exploration, mathematical model development, and outliers, types, and causes of outliers. Section III gives an
implicit and explicit pattern discovery. All these steps are insight into some of the prominent outlier detection
essential in extracting useful information [1-2]. Data mining techniques. A Matlab-based GUI is presented in IV. The
has gained widespread popularity over the last few years. The conclusion and discussion are drawn in Section V.
range of applications of data mining has also considerably
expanded over the last few years [2-5]. DM is associated with
extracting information from the previously available data and
anticipating the future trends of the process by investigating
and analyzing data. It exhibits the best computing results and
is considered one of the most sought-after end products of
information technology. To have the option to find and to
remove data from information is an errand that many analysts
and experts are trying to achieve.
Fig. 1. The architecture of a KDD
Authorized licensed use limited to: Universita degli Studi di Genova. Downloaded on March 19,2022 at 09:26:26 UTC from IEEE Xplore. Restrictions apply.
II. OUTLIERS characteristics are the clear specification of the objective of
the problem, explanation of the characteristics and
A. Outliers - Definition representation of the data needs to be explained, and finally, a
In measurements, an outlier is an information point that is determination of a mining algorithm for the data processing.
in contrast to the other data points. An anomaly or outlier As the data from IoT keeps expanding, the first task that is
might be caused because of changes in the estimation, or it performed in data mining is to find a relation between the data
might be displayed due to the test error, or at times it might and to specify association rules among them. Apriori
occur through the process of data collection. An anomaly can algorithm has been proposed to improve the frequent items in
cause difficult issues in factual examinations. In order to the mining algorithm. However, in time series analysis, it is
ensure a smooth and coherent analysis of data, detection of difficult to find frequent items of data, so an improved mining
anomalies is a primary step. Aberrant data may be indicated algorithm has been proposed using an advanced apriori
through anomalies which would eventually lead to algorithm [12]. IoT involves various embedded systems like
misspecifications in the model. In the data process where household devices, portable devices, and unmanned vehicles
bigger samples of information are available, little information that are embedded with sensors. One of the main purposes of
could be further away from the normal mean than what is developing IoT devices is to improve the efficiency and
considered sensible. Outliers may be created when there are accuracy of the devices and to reduce the usage of energy and
some unusual behaviors associated with the data generation cost. Big data has always been produced from intricate
process. systems, and the complexity of the data determines the
diversity of applications. Hence IoT data mining is one of the
Outliers, as explained by Hawkins [8], are an observation
prime factors in IoT era. Singh et al. [13] discussed four
that may deviate from the rest of the remaining observations,
different data mining models for IoT includes the multilayer
which can cause suspicion that the particular data may have a
data mining model, distributed data mining model, the grid-
different source of origin. These deviant data that may
based data mining model, and multi Technology integration
fluctuate from other data might be caused by various factors
data mining model. Classification, clustering, and association-
during an experimental procedure or during alterations that
based mining techniques are discussed in [14] with
occur during the recording of the measurement value or due to
comparative analysis results.
exclusiveness.
This can be because of accidental errors or mistakes during B. Classification of outliers
measurement or faulty instruments. An anomaly can Understanding the type of outlier is very crucial before
consequently show defective information, incorrect starting the data analysis; those types can be; Univariate
techniques, or regions where a specific hypothesis probably outliers, where this type of outlier is encountered as a uni-
won't be substantial. The aim of outlier detection is to detect feature space distribution, and Multivariate Outliers, where
abnormal patterns(the data that deviate from the rest of the two or more variables have an unusual record (outlying).
measurements, which call outliers or anomalies). When there Another classification of outlier types that are considered
is a very high dimension of data present due to the increase in generally in statistics are global outliers, conditional outliers
the size of the features or attributes, the amount of data also and collective outliers.
grows, so there is an improved need for outlier detection. C. Causes of outliers
Various approaches involved in anomaly detection are
discussed in [7]. Although outliers may be quite common in large sets of
data, there underlying causes that might cause an outlier to
Various applications of anomaly detection methods in need to be comprehended in order to determine the underlying
Real-world are discussed [9]. Comparing the advantages and cause that might help to either eliminate or to rectify the
disadvantages of various outlier detection techniques has been outlier. Outlier sources may be encountered during any one of
studied in [10]. the stages of data processing, starting right from the
Outliers can have numerous varied causes. An actual production of data, collection of data, processing the data, or
device for taking the measurements might have experienced a analysis of data. Though many factors may contribute to the
transient breakdown. There may have been a mistake outliers in general, manual errors in the entry of data or errors
committed during the information transmission or data in reporting the observation might have led to faulty data or
recording process. Outliers may emerge due to multiple outliers. Errors might be caused during the experimental phase
reasons, including errors in measuring instruments, errors in of the procedure that may include the planning of the
human handling, faulty systems, fraudulent behavior, or execution procedure or during the data extraction phase. A
maybe due to some natural deviations that may occur in the prominent change in the trend of data due to external factors
population. During the data entry process, there might be an like the environment may contribute to an outlier. Outliers in
error that can be caused due to a typographic mistake, which data may also be caused due to faulty or malfunctioning
can be rectified manually. If the outlier is natural, then it can equipment that is used to collect and measure the data that is
be verified to remove any discrepancies and can be added to being reported. Outliers may be present due to the intentional
the classification. The system must make use of a addition of erroneous data that may have been used for testing
classification algorithm that is resistant to the outlier. the dataset for detection purposes. It may also be occurring
due to the manipulation or mutations in the data set that may
With the growth of IoT, there is massive data that has been produce errors or outliers during the processing of data. Also,
available, which may contain a wide variety of useful the outlier is present if the data have been extracted from
information. Data mining plays a keen role in smart systems improper sources. Commonly although outliers may pose a
to provide accurate services. In order to develop an efficient reason to be omitted, there might be an alarming situation in
data mining module for IoT-based data, three main case of some novelties & it might not be an error, like in case
characteristics have to be considered [11]. Those of malicious activity that the data that is deviant from the rest
Authorized licensed use limited to: Universita degli Studi di Genova. Downloaded on March 19,2022 at 09:26:26 UTC from IEEE Xplore. Restrictions apply.
of the observation may pose a reason for concern in terms of H. Min Covariance Determinant
a hacking activity or an intrusion into a network. A highly robust technique that is used to detect outlier is
III. OUTLIER DETECTION TECHNIQUES based on multivariate and scattering. An ellipse is constructed
under which all the normal data are fitted, and any data point
In order to be able to identify the presence of an outlier in outside the ellipse is considered to be an outlier.
a dataset, outlier detection techniques need to be used based
on the initial understanding of outliers analysis. I. Local Outlier Factor
Whenever a feature is at a distance from other features that
A. Standard Deviation
are considered to be an outlier, this technique is only used
If the data does not lie within three times the standard when there are simple dimensions to the data set. It works on
deviation of the data, then those values considered as outliers. the principle of nearest neighbor, and the scores are given to
B. Box Plot each data points based on the size of the local neighborhood.
Box plots are nothing but the graphical representation of J. Distance-based outlier detection
data obtained through their quantile. The boundaries of the In a data set observation, any point P in the data said which
data are defined by the upper and the lower whiskers. Any is far from the other data points will be considered as an outer
value of data that tends to be outside the range of the whiskers layer, and it is an underlying principle in finding distance-
is considered to be an outlier. The upper quartile tends to lie based outliers.
within the upper whiskers range, and the lower quartile will
lie within the lower whiskers range. K. Local metrics based detection
Based on the count of neighborhood points of a data point
C. Box Plot Anatomy
P, the distance-based out layers are useful only if the threshold
In boxplot anatomy, the interquartile range IQR, which is accurate. This forms the underlying principle in calculating
measure the dispersion that is statistical and also the MGDEF.
variability of data by splitting the data into various type of
quartiles. Each quartile will represent three points or 4 L. Kernel Estimator
intervals. In estimating the probability density function, of statistical
estimator might be used in random sampling for each point in
D. Z-Score for Parametric
the data set and will be associated with the point.
Define For a Gaussian distribution, the standard deviation
of a point is calculated from its mean when the data points do
not fit the Gaussian distribution; different transform
IV. MATLAB BASED GUI
techniques, including scaling, might be applied to calculate
the z score of a point. Matlab Graphical User Interface (GUI) is a very useful
𝑥−𝑢 tool for designing and creating an interactive interface for data
𝑧= processing. The difficulties associated with the usage of the
𝜎 algorithm is completely removed, and information will be
Usually, thresholds of 2.5, 3, 3.5 are assumed to be within displayed at the click of a button. There are various graphical
the normal range. The data or observations which do not fit object tools associated with the user interface, text buttons for
the threshold are eliminated editing popup menus, checkbox, push-button, slider, and
many more. In order to obtain a versatile design, the
E. Dbscan for Non Parametric arrangement of the graphical objects needs to be done
For nonparametric data points, the point near the neighbor meticulously. This will ensure a user-friendly interface. The
is found using a density clustering algorithm. Here the core mathematical model representing the system can be easily
points or border points are marked as outliers. There are three built by changing the parameters associated with the GUI.
main constituents of the DB scan clustering model, the core This paper proposes an interactive interface to process the
points, border points, and noise points. In this method, the data sensor measurements. The proposed system is shown in Fig.
will be classified and clustered into groups and using single or 2. The proposed interface can be considered as the first phase
multi-dimensional data; a density-based anomaly detection of data analysis for sensor data prior to decision making and
technique is used. statistical analysis reporting. The data flow to the interface has
two different paths; either via indirect data extraction from a
F. Isolation forest for Non Parametric
server database; or using the offline analysis. Indirect method
In this method all the points that are at a greater distance requires the initialization using a third party interconnected
from the observation or termed outliers. The points are built system; to interface the server data with Matlab. It is supported
using a training tree, which grows based on the splitting by a bridging access and requires special setup according to
algorithm. By splitting the node and the node, children are the data server configuration. In offline method sensor
compared with the path length, and the data with the shortest measurement can be directly uploaded into Matlab. In this
path length is termed as an outlier. A decision tree family case; a matrix of measurement is built and can be uploaded
structure is used to portray the isolation forest detection using a GUI functionality. As shown in Fig. 2, the two adopted
method where the outliers might be isolated and regions are approaches are displayed in the interface in panel 1: sensor
constructed based on scoring. It is a very effective technique network, and panel 2: load data, respectively. The startup
in detecting anomalies or outliers. interface shown in Fig. 2, provides the user with the ability to
G. Robust Random Cut Forest extract or upload a dataset to start the data analysis according
to the listed features.
It is an unsupervised algorithm that is used to detect
outliers.
Authorized licensed use limited to: Universita degli Studi di Genova. Downloaded on March 19,2022 at 09:26:26 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. The startup user interface (Initial run) – No data is processed
Authorized licensed use limited to: Universita degli Studi di Genova. Downloaded on March 19,2022 at 09:26:26 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. shows a running example of the proposed interface TABLE II. RUNNING INTERFACE FUNCTIONS
where data is initialized to be extracted from a pre-defined Tool Functionality Remarks
station dataset and visualized using the available tools in Table
II. The first visualizations the user performs are the original Plot – Sensor Plotting the If the extracted data is not
Data extracted pre-initialized, the dataset
measurement plot and smoothed data, as explained in Table measurement vector can be uploaded as well
II. In order to provide the user with a general initial [Panel: Initialize (measurement vs using "Upload"
and Visualize]
understanding of the desired data, a plot of smoothed data is time) – Top plot
generated (bottom plot in the GUI). The data is smoothed Extracted in The initialized and The Valid-Smoother raw in
using the "robust LOWESS" method. where outlier is Station – Table plotted data is the table is considered as
included in within the noise and eliminated. The smoothed [Synced table]
shown in this table initial outlier preprocessing
data is listed in the table provided within the interface as where the first row to understand the presented
present the data visually, where the
"VALID1". The user must be careful when reading such data measurement over data is smoothed and
since outlier detection is not optimized yet, and detection time, and the last outlier is removed
methods should be chosen in a way to ensure that there is no row (Valid-
misleading affecting data analysis. The second way of smoother) present
the smoothed data –
understanding the outlier, besides the available outlier Bottom plot
detection methods, is the plot of outliers thresholds shown in
Fig. 6. In the figure, any measure is defined as an outlier if it Outlier Find and plot the The plot is shown in a
Thresholds outlier location, separate figure, and the
is three scaled median absolute deviations away from the thresholds, and data is saved. The value is
median by default. Other outlier methods can be chosen as [Panel: Initialize center value. an outlier is called back to
and Visualize]
well and hence updating outlier location and thresholds plot. the GUI and shown in a
vector attached to
[outlier_value[ panel
Visualize To choose the The option to define and set
outlier detection up a new script is also
[Panel: Initialize method from a drop- provided to the user, so if
and Visualize] down list with a an outlier detection method
shortlist of available is required and not listed,
methods and then the user can still use a
create a new figure script according to some
of outlier plot and pre-setup instructions.
return true for each Available methods in the
detected value GUI include quartiles,
Grubbs, generalized
extreme studentized deviate
test, boxplot …etc
Save Data / All analyzed data Using Export functionality,
Export are saved to Matlab the saved variable can be
workspace and exported to a spreadsheet
[Panel: Initialize hence available to with all records and
and Visualize] be exported statistical analysis.
Authorized licensed use limited to: Universita degli Studi di Genova. Downloaded on March 19,2022 at 09:26:26 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION AND DISCUSSION REFERENCES
We have presented an interactive interface for data
extraction and outlier detection of sensor measurement [1] O. Maimon and L. Rokach, "Data Mining and Knowledge
Discovery Handbook," O. Maimon and L. Rokach Editors,3rd
subjected to outliers. The proposed interface provides the user edition, Springer. Springer New York Heidelberg London 2010, doi:
with an easy, graphical, and friendly interface with a variety 10.1007/978-0-387-09823-4.
of functionality to explore and process the sensor [2] U. Fayyad et al., "From Data Mining to Knowledge Discovery in
measurement. The interface is designed to fit the real-time Databases," AI Mag., vol. 17, no. 3, pp. 37–37, Mar. 1996, doi:
sensor data extraction with a pre-defined setup for the model 10.1609/AIMAG.V17I3.1230.
[3] IH.W.Ian et al., "Data mining : practical machine learning tools and
and parameters. Generalizing the use of such a model is techniques," p. 629, 2011.
considered by providing the user with the ability to [4] O. Marban et al., "A Data Mining & Knowledge Discovery Process
import/export data directly from the off-line dataset as well as Model," Data Min. Knowl. Discov. Real Life Appl., Jan. 2009, doi:
adding a new script to the main file. The data preprocessing 10.5772/6438.
[5] S. Sayad., “Real Time Data Mining.” Self-Help Publishers books,
starts with visualizing the original dataset together with Canada, 2011 https://www.bookdepository.com/Real-Time-Data-
smoother data where the user can explore and export the main Mining-Saed-Sayad/9780986606045 (accessed Dec. 14, 2021).
statistical measures. Then a brief of outlier thresholds is [6] M. Awawdeh et al., "Application of Outlier Detection using Re-
provided by the mean of graphical presentation and numerical Weighted Least Squares and R-squared for IoT Extracted Data,"
2019 Adv. Sci. Eng. Technol. Int. Conf. ASET 2019, May 2019, doi:
results in order to start navigating the desired outlier detection 10.1109/ICASET.2019.8714261.
methods from the available list. All generated variables, [7] S. Agrawal and J. Agrawal, "Survey on anomaly detection using data
figures, tables are auto-saved and can be exported for further mining techniques," Procedia Comput. Sci., vol. 60, no. 1, pp. 708–
statistical analysis. Working with sensor measurement and 713, 2015, doi: 10.1016/j.procs.2015.08.220.
data preprocessing is a crucial task before proceeding with [8] D. M. Hawkins, "Identification of Outliers," Identif. Outliers, 1980,
doi: 10.1007/978-94-015-3994-4.
data analysis and decision making. Still, at this phase many [9] L. Akoglu et al., "Graph based anomaly detection and description: A
points and challenges need to be considered. For this survey," Data Min. Knowl. Discov., vol. 29, no. 3, pp. 626–688, Apr.
particular interface, the user must understand at least the 2015, doi: 10.1007/S10618-014-0365-Y.
outliers groups/types, data distribution, and data size before [10] N. R. Prasad et al., "Anomaly detection," Comput. Mater. Contin.,
vol. 14, no. 1, pp. 1–22, 2009, doi: 10.1145/1541880.1541882.
proceeding to choose the outlier detection method. One of [11] C. W. Tsai et al., "Data mining for internet of things: A survey,"
those challenges is the correct choice of outlier detection, IEEE Commun. Surv. Tutorials, vol. 16, no. 1, pp. 77–97, 2014, doi:
since choosing an incorrect model may generate misleading 10.1109/SURV.2013.103013.00206.
peaks or removal of important information carried by outlier [12] Z. Wang et al., "Data mining in IoT era:A method based on improved
value. The other challenge is the size of dataset, where frequent items mining algorithm," Proc. - 2019 5th Int. Conf. Big
Data Inf. Anal. BigDIA 2019, pp. 120–125, 2019, doi:
according to the computational time and pre-modeling 10.1109/BigDIA.2019.8802727.
limitation, the big dataset may cost time as well as incorrect [13] A. Singh and S. Sharma, "Analysis on data mining models for
peaks among the value, this point is considered for further Internet of Things," Proc. Int. Conf. IoT Soc. Mobile, Anal. Cloud,
investigation in the future together with developing the I-SMAC 2017, pp. 94–100, 2017, doi: 10.1109/I-
SMAC.2017.8058313.
interface to include the data mining methods a prior to outlier [14] I. Batra et al., "Performance Analysis of Data Mining Techniques in
detection phase and to include the machine learning methods IoT," Proc. - 4th Int. Conf. Comput. Sci. ICCS 2018, pp. 194–199,
and more robust techniques. 2019, doi: 10.1109/ICCS.2018.00039.
ACKNOWLEDGMENT
This research is supported by Higher Colleges of
Technology under applied research interdisciplinary grant
number 11328.
Authorized licensed use limited to: Universita degli Studi di Genova. Downloaded on March 19,2022 at 09:26:26 UTC from IEEE Xplore. Restrictions apply.