Lecture 5
Lecture 5
SENSOR DATA
ACQUISITION,
CLEANING & QUERY
PROCESSING
Dr. Firoz Anwar
MIDTERM EXAM FORMAT
Question Task
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
IN-NETWORK DATA
ACQUISITION
Consider a smart agriculture system with soil moisture sensors
deployed across a large field:
• Each sensor collects data about soil moisture levels.
• Instead of sending all raw data to a central server, the sensors
process the data locally.
• They might calculate the average moisture level for their region and
only transmit this summarized data to the central server.
• This reduces the amount of data transmitted, saves energy, and
allows for real-time irrigation decisions.
MULTI-DIMENSIONAL
GAUSSIAN DISTRIBUTIONS
Also known as multivariate normal distributions
An extension of the normal (Gaussian) distribution to multiple
variables
In simple terms, it describes the probability of different values
occurring for multiple interrelated continuous variables.
Statistical models used to represent and analyse data that has
multiple correlated variables.
In sensor data processing, this approach is often used for data
modelling, anomaly detection, and pattern recognition.
Key Characteristics
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MULTI-DIMENSIONAL
GAUSSIAN DISTRIBUTIONS
Single-variable Gaussian (Normal) Distribution
Multiple Variables (Multivariate Gaussian)
o Probability of different combinations of these values occurring
together.
Key Components:
1. Mean vector (μ): A list of the average values of each variable.
2. Covariance matrix (Σ): A table that shows how much each pair of
variables is related (positively, negatively, or not at all).
MULTI-DIMENSIONAL
GAUSSIAN DISTRIBUTIONS
Consider a weather monitoring system with sensors measuring
temperature, humidity, and pressure:
• The data from these sensors can be modelled using a multi-dimensional
Gaussian distribution.
• The distribution captures the typical relationships between temperature,
humidity, and pressure.
• If a sensor reports a combination of values that is highly unlikely (e.g., very
high temperature with very high humidity), it can be flagged as an
anomaly.
• This helps in detecting faulty sensors or unusual weather conditions.
PUSH-BASED DATA
ACQUISITION
The sensors autonomously
decide when to communicate
sensor values to the base
station.
If the sensor values deviate
from their expected
behaviour, then the sensors
communicate only the
deviated values to the base
station.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PUSH-BASED DATA
ACQUISITION
Techniques:
PRESTO
Ken
A Generic Push-Based Approach
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PUSH-BASED DATA
ACQUISITION
Drivers:
Pull strategy will be unable to observe any unusual or
interesting patterns between any two pull requests.
Increasing the pull frequency for better detection of such
patterns, increases the overall energy consumption of the
system.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PRESTO
Known as the Predictive Storage (PRESTO) system.
Two main components:
PRESTO proxies
Have higher computational capability and storage re- sources
PRESTO sensors.
Task is to gather data from the PRESTO sensors and to answer
queries posed by the user.
PRESTO sensors
Assumed to be battery-powered and remotely located.
Task is to sense the data and transmit it to PRESTO proxies, while
archiving some of it locally on flash memory.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PRESTO
For answering such a query,
PRESTO proxies always maintain a time-series prediction
model. Specifically, PRESTO maintains a seasonal ARIMA
(SARIMA) model following form for each sensor:
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PRESTO
The proxies estimate the parameters of the model given in Eq.
(2.3), and then transmit these parameters to individual
PRESTO sensors.
The PRESTO sensors use these models to predict the sensor
value vˆij, and only transmit the raw sensor value vij to the
proxies when the absolute difference between the predicted
sensor value and the raw sensor value is greater than a user-
defined threshold δ.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
KEN
For reducing the communication cost, the Ken framework
employs a similar strategy as PRESTO.
Key difference between Ken and PRESTO:
PRESTO uses a SARIMA model; this model only takes into
account temporal correlations. On the other hand, Ken uses
a dynamic probabilistic model that takes into account spatial
and temporal correlations in the data.
Since a large quantity of sensor data is correlated spatially,
and not only temporally, Ken derives advantage from such
spatiotemporal correlation.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
KEN
Two types of entities:
Sink (Similar to the PRESTO proxy)
The sink is the base station to which the sensor values vij
are communicated by the source
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
Sensor data is Uncertain and Erroneous.
Operational Difficulty:
Sensors often operate with discharged batteries, network
failures, and imprecision. Other factors, such as low-cost
sensors, freezing or heating of the casing or measurement
device, accumulation of dirt, mechanical failure or vandalism
(from humans or animals) heavily affect the quality of the
sensor data.
Problem:
May cause a significant problem with respect to data
utilization, since applications using erroneous data may yield
unsound results.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
A system for
cleaning sensor data
generally consists of
four major
components:
User interface,
Stream processing
engine,
Anomaly detector,
and
Data storage.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
User Interface plays two roles :
First, it takes all necessary inputs from users to perform data
cleaning, e.g., name of sensor data and parameter settings
for models.
Second, the results of data cleaning, such as ‘dirty’ sensor
values captured by the anomaly detector, are presented
using graphs and tables, so that users can confirm whether
each candidate of such dirty values is an actual error.
The confirmed results are then stored to (or removed from)
the underlying data storage or materialized views.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
Anomaly Detector:
Online or Offline
Stream Processing Engine
Data Storage
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODELS FOR SENSOR DATA
CLEANING
Regression Models
Probabilistic Models
Outlier Detection Models
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
REGRESSION MODELS
Polynomial Regression:
Polynomial regression finds the best-fitting curve that minimizes the total
difference between the curve and each raw sensor value vij at time ti. Given a
degree d, polynomial regression is formally defined as:
Polynomial regression with high degrees approximate given time series with more
sophisticated curves, resulting in theoretically more accurate description of the
raw sensor values. Practically, however, low-degree polynomials, such as
constant (d = 0) and linear (d = 1), also perform satisfactorily.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
REGRESSION MODELS
Chebyshev Regression:
Chebyshev regression is a popular model class for fitting sensor values, since
they can quickly compute near- optimal approximations for given time series.
Suppose that time values t vary within a range [min(t ),max(t )].
i i i
We, then, obtain normalized time values t′ within a range [−1,1], by using the
i
following transformation function f(ti) and its inverse transformation function
f−1(t′i)
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
REGRESSION MODELS
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PROBABILISTIC MODELS
In sensor data cleaning, inferring sensor values is perhaps the most
important task, since systems can then detect and clean dirty sensor
values by comparing raw sensor values with the corresponding inferred
sensor values.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
OUTLIER DETECTION
MODELS
Outlier detection methods (Zhang et al. offer an overview):
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED QUERY
PROCESSING
In-Network Query Processing
Model-Based Views
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED QUERY
PROCESSING
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
IN-NETWORK QUERY
PROCESSING
First builds an overlay network (like, the SRT).
Then, the overlay network is used for increasing the efficiency of
aggregating sensor values and processing queries.
For instance, while processing a threshold query, parent nodes send the query to
the child nodes only when the query threshold condition overlaps with the range
of sensor values contained in the child nodes, which is stored in the parent
node’s local memory.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED VIEWS
MauveDB
Approach:
Use standard database views as an abstraction layer for processing
queries.
Views are maintained in a form of a regression model; thus they are
called model-based views.
Advantage:
The model-based view can be incrementally updated as fresh sensor
values are obtained from the sensors.
Incremental updates are computationally efficient.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED VIEWS
Uses a model called “RegModel”
A regression model in which the temperature is the dependent variable and the
sensor position (xj , yj ) is an independent variable.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
SENSOR DATA
COMPRESSION
The goal of the sensor data compression system is to approximate a sensor
data stream by a set of functions.
A standard approach to sensor data compression is to segment the data
stream into data segments, and then approximate each data segment, so
that a specific error norm is satisfied.
Functions are employed for approximating data segments, only the
approximated data segments are stored in the database, instead of the raw
sensor values of the data stream.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
SENSOR DATA
COMPRESSION
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
METHODS FOR DATA
SEGMENTATION
Piecewise linear approximation has
been the most widely used.
Piecewise linear approximation
models the data stream with a
separate linear function per data
segment.
Piecewise constant approximation
(PCA) approximates a data
segment with a constant value,
which can be the first value of the
segment (referred to as the cache
filter), the mean value or the
median value.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
Sensor data is Uncertain and Erroneous.
Operational Difficulty:
Sensors often operate with discharged batteries, network failures, and
imprecision. Other factors, such as low-cost sensors, freezing or heating of the
casing or measurement device, accumulation of dirt, mechanical failure or
vandalism (from humans or animals) heavily affect the quality of the sensor data.
Problem:
May cause a significant problem with respect to data utilization, since
applications using erroneous data may yield unsound results.
Less reliable sensors may produce inaccurate prediction results.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
ACTIVITY
1. Lab 5 anomaly detection Notebook + NYC Taxi
2. https://towardsdatascience.com/anomaly-detection-for-dummies-15f148e
559c1#:~:text=Histogram%2Dbased%20Outlier%20Detection%20(HBOS
,and%20combined%20at%20the%20end