0% found this document useful (0 votes)
6 views38 pages

Lecture 5

The document outlines the midterm exam format and contents for a course on model-based sensor data acquisition, cleaning, and query processing. It covers in-network data acquisition, multi-dimensional Gaussian distributions, push-based data acquisition methods, and various models for sensor data cleaning and compression. Additionally, it discusses the operational challenges of sensor data and the importance of effective data processing techniques to ensure data quality and reliability.

Uploaded by

anshikanahata33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views38 pages

Lecture 5

The document outlines the midterm exam format and contents for a course on model-based sensor data acquisition, cleaning, and query processing. It covers in-network data acquisition, multi-dimensional Gaussian distributions, push-based data acquisition methods, and various models for sensor data cleaning and compression. Additionally, it discusses the operational challenges of sensor data and the importance of effective data processing techniques to ensure data quality and reliability.

Uploaded by

anshikanahata33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

MODEL-BASED

SENSOR DATA
ACQUISITION,
CLEANING & QUERY
PROCESSING
Dr. Firoz Anwar
MIDTERM EXAM FORMAT
Question Task

Section A Short Questions Elaborate your


answers. Only
writing out from
Lecture materials
will not result in
good marks.
Section B Python Python code
similar to lab
activities.
Total 20
MIDTERM EXAM CONTENTS
 Open Book Exam (Subject materials such as
Lecture slides, labs, articles are allowed)
 No Internet Access
 Week 1 – Week 5 Lecture + Lab Activity
DATA ACQUISITION
(cont. from Week
4)
IN-NETWORK DATA
ACQUISITION
 Collecting, processing, and aggregating data within the network of
sensors itself, rather than sending all raw data to a central server for
processing.
 Leverages the computational capabilities of the sensors or
intermediate nodes to reduce data transmission, save energy, and
improve efficiency.
 Key Characteristics

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
IN-NETWORK DATA
ACQUISITION
Consider a smart agriculture system with soil moisture sensors
deployed across a large field:
• Each sensor collects data about soil moisture levels.
• Instead of sending all raw data to a central server, the sensors
process the data locally.
• They might calculate the average moisture level for their region and
only transmit this summarized data to the central server.
• This reduces the amount of data transmitted, saves energy, and
allows for real-time irrigation decisions.
MULTI-DIMENSIONAL
GAUSSIAN DISTRIBUTIONS
 Also known as multivariate normal distributions
 An extension of the normal (Gaussian) distribution to multiple
variables
 In simple terms, it describes the probability of different values
occurring for multiple interrelated continuous variables.
 Statistical models used to represent and analyse data that has
multiple correlated variables.
 In sensor data processing, this approach is often used for data
modelling, anomaly detection, and pattern recognition.
 Key Characteristics

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MULTI-DIMENSIONAL
GAUSSIAN DISTRIBUTIONS
 Single-variable Gaussian (Normal) Distribution
 Multiple Variables (Multivariate Gaussian)
o Probability of different combinations of these values occurring
together.
 Key Components:
1. Mean vector (μ): A list of the average values of each variable.
2. Covariance matrix (Σ): A table that shows how much each pair of
variables is related (positively, negatively, or not at all).
MULTI-DIMENSIONAL
GAUSSIAN DISTRIBUTIONS
Consider a weather monitoring system with sensors measuring
temperature, humidity, and pressure:
• The data from these sensors can be modelled using a multi-dimensional
Gaussian distribution.
• The distribution captures the typical relationships between temperature,
humidity, and pressure.
• If a sensor reports a combination of values that is highly unlikely (e.g., very
high temperature with very high humidity), it can be flagged as an
anomaly.
• This helps in detecting faulty sensors or unusual weather conditions.
PUSH-BASED DATA
ACQUISITION
 The sensors autonomously
decide when to communicate
sensor values to the base
station.
 If the sensor values deviate
from their expected
behaviour, then the sensors
communicate only the
deviated values to the base
station.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PUSH-BASED DATA
ACQUISITION
 Techniques:

 PRESTO
 Ken
 A Generic Push-Based Approach

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PUSH-BASED DATA
ACQUISITION
 Drivers:
 Pull strategy will be unable to observe any unusual or
interesting patterns between any two pull requests.
 Increasing the pull frequency for better detection of such
patterns, increases the overall energy consumption of the
system.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PRESTO
 Known as the Predictive Storage (PRESTO) system.
 Two main components:
 PRESTO proxies
 Have higher computational capability and storage re- sources
PRESTO sensors.
 Task is to gather data from the PRESTO sensors and to answer
queries posed by the user.

 PRESTO sensors
 Assumed to be battery-powered and remotely located.
 Task is to sense the data and transmit it to PRESTO proxies, while
archiving some of it locally on flash memory.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PRESTO
 For answering such a query,
 PRESTO proxies always maintain a time-series prediction
model. Specifically, PRESTO maintains a seasonal ARIMA
(SARIMA) model following form for each sensor:

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PRESTO
 The proxies estimate the parameters of the model given in Eq.
(2.3), and then transmit these parameters to individual
PRESTO sensors.
 The PRESTO sensors use these models to predict the sensor
value vˆij, and only transmit the raw sensor value vij to the
proxies when the absolute difference between the predicted
sensor value and the raw sensor value is greater than a user-
defined threshold δ.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
KEN
 For reducing the communication cost, the Ken framework
employs a similar strategy as PRESTO.
 Key difference between Ken and PRESTO:
 PRESTO uses a SARIMA model; this model only takes into
account temporal correlations. On the other hand, Ken uses
a dynamic probabilistic model that takes into account spatial
and temporal correlations in the data.
 Since a large quantity of sensor data is correlated spatially,
and not only temporally, Ken derives advantage from such
spatiotemporal correlation.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
KEN
 Two types of entities:
 Sink (Similar to the PRESTO proxy)
 The sink is the base station to which the sensor values vij
are communicated by the source

 Source (Similar to the PRESTO sensor).


 Unlike PRESTO, which uses sensor only represents a single
sensor, but a source could include more than one sensor or
a sensor network.
 The source only communicates with the sink when the raw
sensor values deviate beyond a certain bound, as compared to
the predictions from the dynamic probabilistic model.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
GENERIC PUSH-BASED
APPROACH
 A generalised version of other push-based approaches.
 Proposed by Silberstein et al., like other push-based
approaches, the base station and the sensor network agree on
an expected behavior, and, as usual, the sensor network
reports values only when there is a substantial deviation from
the agreed behavior.
 But, unlike other approaches, the definition of expected
behavior proposed in is more generic, and is not limited to a
threshold δ.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
 Sensor data is Uncertain and Erroneous.
 Operational Difficulty:
 Sensors often operate with discharged batteries, network
failures, and imprecision. Other factors, such as low-cost
sensors, freezing or heating of the casing or measurement
device, accumulation of dirt, mechanical failure or vandalism
(from humans or animals) heavily affect the quality of the
sensor data.
 Problem:
 May cause a significant problem with respect to data
utilization, since applications using erroneous data may yield
unsound results.
 Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
 A system for
cleaning sensor data
generally consists of
four major
components:
 User interface,
 Stream processing
engine,
 Anomaly detector,
and
 Data storage.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
 User Interface plays two roles :
 First, it takes all necessary inputs from users to perform data
cleaning, e.g., name of sensor data and parameter settings
for models.
 Second, the results of data cleaning, such as ‘dirty’ sensor
values captured by the anomaly detector, are presented
using graphs and tables, so that users can confirm whether
each candidate of such dirty values is an actual error.
 The confirmed results are then stored to (or removed from)
the underlying data storage or materialized views.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
 Anomaly Detector:
 Online or Offline
 Stream Processing Engine
 Data Storage

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODELS FOR SENSOR DATA
CLEANING
 Regression Models
 Probabilistic Models
 Outlier Detection Models

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
REGRESSION MODELS
 Polynomial Regression:
 Polynomial regression finds the best-fitting curve that minimizes the total
difference between the curve and each raw sensor value vij at time ti. Given a
degree d, polynomial regression is formally defined as:

 Polynomial regression with high degrees approximate given time series with more
sophisticated curves, resulting in theoretically more accurate description of the
raw sensor values. Practically, however, low-degree polynomials, such as
constant (d = 0) and linear (d = 1), also perform satisfactorily.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
REGRESSION MODELS
 Chebyshev Regression:
 Chebyshev regression is a popular model class for fitting sensor values, since
they can quickly compute near- optimal approximations for given time series.
 Suppose that time values t vary within a range [min(t ),max(t )].
i i i
 We, then, obtain normalized time values t′ within a range [−1,1], by using the
i
following transformation function f(ti) and its inverse transformation function
f−1(t′i)

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
REGRESSION MODELS

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PROBABILISTIC MODELS
 In sensor data cleaning, inferring sensor values is perhaps the most
important task, since systems can then detect and clean dirty sensor
values by comparing raw sensor values with the corresponding inferred
sensor values.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
OUTLIER DETECTION
MODELS
 Outlier detection methods (Zhang et al. offer an overview):

 Deligiannakis et al. consider correlation, extended Jaccard coefficients,


and regression-based approximation for model-based data cleaning.

 Shen et al. propose to use a histogram-based method to capture outliers.

 Subramaniam et al. introduce distance- and density-based metrics that


can identify outliers.

 ORDEN system detects polygonal outliers using the triangulated


wireframe surface model.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED QUERY
PROCESSING
 In-Network Query Processing
 Model-Based Views

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED QUERY
PROCESSING

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
IN-NETWORK QUERY
PROCESSING
 First builds an overlay network (like, the SRT).
 Then, the overlay network is used for increasing the efficiency of
aggregating sensor values and processing queries.
 For instance, while processing a threshold query, parent nodes send the query to
the child nodes only when the query threshold condition overlaps with the range
of sensor values contained in the child nodes, which is stored in the parent
node’s local memory.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED VIEWS
 MauveDB
 Approach:
 Use standard database views as an abstraction layer for processing
queries.
 Views are maintained in a form of a regression model; thus they are
called model-based views.

 Advantage:
 The model-based view can be incrementally updated as fresh sensor
values are obtained from the sensors.
 Incremental updates are computationally efficient.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED VIEWS
 Uses a model called “RegModel”
 A regression model in which the temperature is the dependent variable and the
sensor position (xj , yj ) is an independent variable.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
SENSOR DATA
COMPRESSION
 The goal of the sensor data compression system is to approximate a sensor
data stream by a set of functions.
 A standard approach to sensor data compression is to segment the data
stream into data segments, and then approximate each data segment, so
that a specific error norm is satisfied.
 Functions are employed for approximating data segments, only the
approximated data segments are stored in the database, instead of the raw
sensor values of the data stream.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
SENSOR DATA
COMPRESSION

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
METHODS FOR DATA
SEGMENTATION
 Piecewise linear approximation has
been the most widely used.
 Piecewise linear approximation
models the data stream with a
separate linear function per data
segment.
 Piecewise constant approximation
(PCA) approximates a data
segment with a constant value,
which can be the first value of the
segment (referred to as the cache
filter), the mean value or the
median value.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
 Sensor data is Uncertain and Erroneous.
 Operational Difficulty:
 Sensors often operate with discharged batteries, network failures, and
imprecision. Other factors, such as low-cost sensors, freezing or heating of the
casing or measurement device, accumulation of dirt, mechanical failure or
vandalism (from humans or animals) heavily affect the quality of the sensor data.
 Problem:
 May cause a significant problem with respect to data utilization, since
applications using erroneous data may yield unsound results.
 Less reliable sensors may produce inaccurate prediction results.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
ACTIVITY
1. Lab 5 anomaly detection Notebook + NYC Taxi
2. https://towardsdatascience.com/anomaly-detection-for-dummies-15f148e
559c1#:~:text=Histogram%2Dbased%20Outlier%20Detection%20(HBOS
,and%20combined%20at%20the%20end

You might also like