0% found this document useful (0 votes)

6 views38 pages

Lecture 5

The document outlines the midterm exam format and contents for a course on model-based sensor data acquisition, cleaning, and query processing. It covers in-network data acquisition, multi-dimensional Gaussian distributions, push-based data acquisition methods, and various models for sensor data cleaning and compression. Additionally, it discusses the operational challenges of sensor data and the importance of effective data processing techniques to ensure data quality and reliability.

Uploaded by

anshikanahata33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views38 pages

Lecture 5

Uploaded by

anshikanahata33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

MODEL-BASED

SENSOR DATA
ACQUISITION,
CLEANING & QUERY
PROCESSING
Dr. Firoz Anwar
MIDTERM EXAM FORMAT
Question Task

Section A Short Questions Elaborate your

answers. Only
writing out from
Lecture materials
will not result in
good marks.
Section B Python Python code
similar to lab
activities.
Total 20
MIDTERM EXAM CONTENTS
 Open Book Exam (Subject materials such as
Lecture slides, labs, articles are allowed)
 No Internet Access
 Week 1 – Week 5 Lecture + Lab Activity
DATA ACQUISITION
(cont. from Week
4)
IN-NETWORK DATA
ACQUISITION
 Collecting, processing, and aggregating data within the network of
sensors itself, rather than sending all raw data to a central server for
processing.
 Leverages the computational capabilities of the sensors or
intermediate nodes to reduce data transmission, save energy, and
improve efficiency.
 Key Characteristics

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
IN-NETWORK DATA
ACQUISITION
Consider a smart agriculture system with soil moisture sensors
deployed across a large field:
• Each sensor collects data about soil moisture levels.
• Instead of sending all raw data to a central server, the sensors
process the data locally.
• They might calculate the average moisture level for their region and
only transmit this summarized data to the central server.
• This reduces the amount of data transmitted, saves energy, and
allows for real-time irrigation decisions.
MULTI-DIMENSIONAL
GAUSSIAN DISTRIBUTIONS
 Also known as multivariate normal distributions
 An extension of the normal (Gaussian) distribution to multiple
variables
 In simple terms, it describes the probability of different values
occurring for multiple interrelated continuous variables.
 Statistical models used to represent and analyse data that has
multiple correlated variables.
 In sensor data processing, this approach is often used for data
modelling, anomaly detection, and pattern recognition.
 Key Characteristics

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MULTI-DIMENSIONAL
GAUSSIAN DISTRIBUTIONS
 Single-variable Gaussian (Normal) Distribution
 Multiple Variables (Multivariate Gaussian)
o Probability of different combinations of these values occurring
together.
 Key Components:
1. Mean vector (μ): A list of the average values of each variable.
2. Covariance matrix (Σ): A table that shows how much each pair of
variables is related (positively, negatively, or not at all).
MULTI-DIMENSIONAL
GAUSSIAN DISTRIBUTIONS
Consider a weather monitoring system with sensors measuring
temperature, humidity, and pressure:
• The data from these sensors can be modelled using a multi-dimensional
Gaussian distribution.
• The distribution captures the typical relationships between temperature,
humidity, and pressure.
• If a sensor reports a combination of values that is highly unlikely (e.g., very
high temperature with very high humidity), it can be flagged as an
anomaly.
• This helps in detecting faulty sensors or unusual weather conditions.
PUSH-BASED DATA
ACQUISITION
 The sensors autonomously
decide when to communicate
sensor values to the base
station.
 If the sensor values deviate
from their expected
behaviour, then the sensors
communicate only the
deviated values to the base
station.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PUSH-BASED DATA
ACQUISITION
 Techniques:

 PRESTO
 Ken
 A Generic Push-Based Approach

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PUSH-BASED DATA
ACQUISITION
 Drivers:
 Pull strategy will be unable to observe any unusual or
interesting patterns between any two pull requests.
 Increasing the pull frequency for better detection of such
patterns, increases the overall energy consumption of the
system.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PRESTO
 Known as the Predictive Storage (PRESTO) system.
 Two main components:
 PRESTO proxies
 Have higher computational capability and storage re- sources
PRESTO sensors.
 Task is to gather data from the PRESTO sensors and to answer
queries posed by the user.

 PRESTO sensors
 Assumed to be battery-powered and remotely located.
 Task is to sense the data and transmit it to PRESTO proxies, while
archiving some of it locally on flash memory.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PRESTO
 For answering such a query,
 PRESTO proxies always maintain a time-series prediction
model. Specifically, PRESTO maintains a seasonal ARIMA
(SARIMA) model following form for each sensor:

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PRESTO
 The proxies estimate the parameters of the model given in Eq.
(2.3), and then transmit these parameters to individual
PRESTO sensors.
 The PRESTO sensors use these models to predict the sensor
value vˆij, and only transmit the raw sensor value vij to the
proxies when the absolute difference between the predicted
sensor value and the raw sensor value is greater than a user-
defined threshold δ.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
KEN
 For reducing the communication cost, the Ken framework
employs a similar strategy as PRESTO.
 Key difference between Ken and PRESTO:
 PRESTO uses a SARIMA model; this model only takes into
account temporal correlations. On the other hand, Ken uses
a dynamic probabilistic model that takes into account spatial
and temporal correlations in the data.
 Since a large quantity of sensor data is correlated spatially,
and not only temporally, Ken derives advantage from such
spatiotemporal correlation.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
KEN
 Two types of entities:
 Sink (Similar to the PRESTO proxy)
 The sink is the base station to which the sensor values vij
are communicated by the source

 Source (Similar to the PRESTO sensor).

 Unlike PRESTO, which uses sensor only represents a single
sensor, but a source could include more than one sensor or
a sensor network.
 The source only communicates with the sink when the raw
sensor values deviate beyond a certain bound, as compared to
the predictions from the dynamic probabilistic model.
Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
GENERIC PUSH-BASED
APPROACH
 A generalised version of other push-based approaches.
 Proposed by Silberstein et al., like other push-based
approaches, the base station and the sensor network agree on
an expected behavior, and, as usual, the sensor network
reports values only when there is a substantial deviation from
the agreed behavior.
 But, unlike other approaches, the definition of expected
behavior proposed in is more generic, and is not limited to a
threshold δ.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
 Sensor data is Uncertain and Erroneous.
 Operational Difficulty:
 Sensors often operate with discharged batteries, network
failures, and imprecision. Other factors, such as low-cost
sensors, freezing or heating of the casing or measurement
device, accumulation of dirt, mechanical failure or vandalism
(from humans or animals) heavily affect the quality of the
sensor data.
 Problem:
 May cause a significant problem with respect to data
utilization, since applications using erroneous data may yield
unsound results.
 Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
 A system for
cleaning sensor data
generally consists of
four major
components:
 User interface,
 Stream processing
engine,
 Anomaly detector,
and
 Data storage.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
 User Interface plays two roles :
 First, it takes all necessary inputs from users to perform data
cleaning, e.g., name of sensor data and parameter settings
for models.
 Second, the results of data cleaning, such as ‘dirty’ sensor
values captured by the anomaly detector, are presented
using graphs and tables, so that users can confirm whether
each candidate of such dirty values is an actual error.
 The confirmed results are then stored to (or removed from)
the underlying data storage or materialized views.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
 Anomaly Detector:
 Online or Offline
 Stream Processing Engine
 Data Storage

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODELS FOR SENSOR DATA
CLEANING
 Regression Models
 Probabilistic Models
 Outlier Detection Models

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
REGRESSION MODELS
 Polynomial Regression:
 Polynomial regression finds the best-fitting curve that minimizes the total
difference between the curve and each raw sensor value vij at time ti. Given a
degree d, polynomial regression is formally defined as:

 Polynomial regression with high degrees approximate given time series with more
sophisticated curves, resulting in theoretically more accurate description of the
raw sensor values. Practically, however, low-degree polynomials, such as
constant (d = 0) and linear (d = 1), also perform satisfactorily.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
REGRESSION MODELS
 Chebyshev Regression:
 Chebyshev regression is a popular model class for fitting sensor values, since
they can quickly compute near- optimal approximations for given time series.
 Suppose that time values t vary within a range [min(t ),max(t )].
i i i
 We, then, obtain normalized time values t′ within a range [−1,1], by using the
i
following transformation function f(ti) and its inverse transformation function
f−1(t′i)

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
REGRESSION MODELS

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
PROBABILISTIC MODELS
 In sensor data cleaning, inferring sensor values is perhaps the most
important task, since systems can then detect and clean dirty sensor
values by comparing raw sensor values with the corresponding inferred
sensor values.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
OUTLIER DETECTION
MODELS
 Outlier detection methods (Zhang et al. offer an overview):

 Deligiannakis et al. consider correlation, extended Jaccard coefficients,

and regression-based approximation for model-based data cleaning.

 Shen et al. propose to use a histogram-based method to capture outliers.

 Subramaniam et al. introduce distance- and density-based metrics that

can identify outliers.

 ORDEN system detects polygonal outliers using the triangulated

wireframe surface model.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED QUERY
PROCESSING
 In-Network Query Processing
 Model-Based Views

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED QUERY
PROCESSING

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
IN-NETWORK QUERY
PROCESSING
 First builds an overlay network (like, the SRT).
 Then, the overlay network is used for increasing the efficiency of
aggregating sensor values and processing queries.
 For instance, while processing a threshold query, parent nodes send the query to
the child nodes only when the query threshold condition overlaps with the range
of sensor values contained in the child nodes, which is stored in the parent
node’s local memory.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED VIEWS
 MauveDB
 Approach:
 Use standard database views as an abstraction layer for processing
queries.
 Views are maintained in a form of a regression model; thus they are
called model-based views.

 Advantage:
 The model-based view can be incrementally updated as fresh sensor
values are obtained from the sensors.
 Incremental updates are computationally efficient.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED VIEWS
 Uses a model called “RegModel”
 A regression model in which the temperature is the dependent variable and the
sensor position (xj , yj ) is an independent variable.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
SENSOR DATA
COMPRESSION
 The goal of the sensor data compression system is to approximate a sensor
data stream by a set of functions.
 A standard approach to sensor data compression is to segment the data
stream into data segments, and then approximate each data segment, so
that a specific error norm is satisfied.
 Functions are employed for approximating data segments, only the
approximated data segments are stored in the database, instead of the raw
sensor values of the data stream.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
SENSOR DATA
COMPRESSION

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
METHODS FOR DATA
SEGMENTATION
 Piecewise linear approximation has
been the most widely used.
 Piecewise linear approximation
models the data stream with a
separate linear function per data
segment.
 Piecewise constant approximation
(PCA) approximates a data
segment with a constant value,
which can be the first value of the
segment (referred to as the cache
filter), the mean value or the
median value.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
MODEL-BASED SENSOR
DATA CLEANING
 Sensor data is Uncertain and Erroneous.
 Operational Difficulty:
 Sensors often operate with discharged batteries, network failures, and
imprecision. Other factors, such as low-cost sensors, freezing or heating of the
casing or measurement device, accumulation of dirt, mechanical failure or
vandalism (from humans or animals) heavily affect the quality of the sensor data.
 Problem:
 May cause a significant problem with respect to data utilization, since
applications using erroneous data may yield unsound results.
 Less reliable sensors may produce inaccurate prediction results.

Source: Charu C Aggarwal (2013). Managing and Mining Sensor Data. Springer
ACTIVITY
1. Lab 5 anomaly detection Notebook + NYC Taxi
2. https://towardsdatascience.com/anomaly-detection-for-dummies-15f148e
559c1#:~:text=Histogram%2Dbased%20Outlier%20Detection%20(HBOS
,and%20combined%20at%20the%20end

200-301 Exam - Free Actual Q&as, Page 1 - ExamTopics
100% (4)
200-301 Exam - Free Actual Q&as, Page 1 - ExamTopics
579 pages
ACT - Hydrostatic Fluid Extension
No ratings yet
ACT - Hydrostatic Fluid Extension
34 pages
Lecture 3
No ratings yet
Lecture 3
29 pages
International Refereed Journal of Engineering and Science (IRJES)
No ratings yet
International Refereed Journal of Engineering and Science (IRJES)
7 pages
Survey SensorNetworkAnalytics 2012
No ratings yet
Survey SensorNetworkAnalytics 2012
29 pages
Datamining Presentation
No ratings yet
Datamining Presentation
20 pages
TT02 Data, Methods, and Scenarios
No ratings yet
TT02 Data, Methods, and Scenarios
44 pages
Research Article
No ratings yet
Research Article
16 pages
Clustering
No ratings yet
Clustering
75 pages
Paper 73-Privacy Preserving Data Mining Approach For IoT
No ratings yet
Paper 73-Privacy Preserving Data Mining Approach For IoT
9 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Applied Data Mining
100% (1)
Applied Data Mining
284 pages
Connections (Mathematics - Phy - Chem) J4
No ratings yet
Connections (Mathematics - Phy - Chem) J4
10 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
No ratings yet
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
14 pages
Chapter 3 2
No ratings yet
Chapter 3 2
27 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Data Mining and Analysis: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Analysis: Fundamental Concepts and Algorithms
9 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
Phase 2.1
No ratings yet
Phase 2.1
9 pages
Ds Iot Mid Ans
No ratings yet
Ds Iot Mid Ans
27 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Phase 2.3
No ratings yet
Phase 2.3
8 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
DataMining S
No ratings yet
DataMining S
103 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
Data Mining Group 6
No ratings yet
Data Mining Group 6
21 pages
2016 Book PrinciplesOfDataMining PDF
100% (3)
2016 Book PrinciplesOfDataMining PDF
530 pages
03 Querying Sensor Networks
No ratings yet
03 Querying Sensor Networks
12 pages
AbidAdhikari26840 DWDM
No ratings yet
AbidAdhikari26840 DWDM
43 pages
Data Mining PDF
No ratings yet
Data Mining PDF
24 pages
Mining Weather Data Using Rattle
No ratings yet
Mining Weather Data Using Rattle
6 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Sistemas de Sensores en Red Lectura7
No ratings yet
Sistemas de Sensores en Red Lectura7
21 pages
(C) 2022 Matlab-Based Graphical User Interface For IoT Sensor Measurements Subject To Outlier
No ratings yet
(C) 2022 Matlab-Based Graphical User Interface For IoT Sensor Measurements Subject To Outlier
6 pages
Dmwviva
No ratings yet
Dmwviva
4 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
ML Module 5
No ratings yet
ML Module 5
15 pages
Unit 2
No ratings yet
Unit 2
37 pages
Reputation-Aware Data Fusion and Malicious Participant Detection in Mobile Crowdsensing
No ratings yet
Reputation-Aware Data Fusion and Malicious Participant Detection in Mobile Crowdsensing
9 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
Analog Dialogue, Volume 48, Number 2
From Everand
Analog Dialogue, Volume 48, Number 2
Analog Dialogue
No ratings yet
Chapter 1 Introduction To Data Mining
No ratings yet
Chapter 1 Introduction To Data Mining
46 pages
Classification
No ratings yet
Classification
34 pages
DM 5th Unit
No ratings yet
DM 5th Unit
54 pages
A Novel Method For Designing Transferable Soft Sen
No ratings yet
A Novel Method For Designing Transferable Soft Sen
10 pages
Data Binning
No ratings yet
Data Binning
9 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
MCMC Data Association and Sparse Factorization Updating For Real Time Multitarget Tracking With Merged and Multiple Measurements
No ratings yet
MCMC Data Association and Sparse Factorization Updating For Real Time Multitarget Tracking With Merged and Multiple Measurements
13 pages
Unit5-Dwdm
No ratings yet
Unit5-Dwdm
58 pages
Per Com 12
No ratings yet
Per Com 12
10 pages
DATA MINING For Search Engines
No ratings yet
DATA MINING For Search Engines
33 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Handbook of Research On Machine Learning Applications and Trends
100% (1)
Handbook of Research On Machine Learning Applications and Trends
34 pages
ML Notes 1
No ratings yet
ML Notes 1
3 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Unit 4
No ratings yet
Unit 4
5 pages
Electronics 09 01295 v2
No ratings yet
Electronics 09 01295 v2
12 pages
Data Mining Unit I Notes
No ratings yet
Data Mining Unit I Notes
29 pages
Alex Watts CV
No ratings yet
Alex Watts CV
2 pages
Using Multivariate Statistics 7th Edition Barbara G. Tabachnickdownload
100% (2)
Using Multivariate Statistics 7th Edition Barbara G. Tabachnickdownload
51 pages
B-Jac Us
No ratings yet
B-Jac Us
8 pages
APAAR Consent Form - Eng
No ratings yet
APAAR Consent Form - Eng
1 page
Diagnostic Systematic Reviews Road Map V3
No ratings yet
Diagnostic Systematic Reviews Road Map V3
2 pages
Oop Lab Report 5-6
No ratings yet
Oop Lab Report 5-6
5 pages
Computerised Accounting Systems Ani) Financical
No ratings yet
Computerised Accounting Systems Ani) Financical
57 pages
EtherWAN EX78602-01B User Manual
No ratings yet
EtherWAN EX78602-01B User Manual
249 pages
Welcome To NEST: What Saving With NEST Means For You
No ratings yet
Welcome To NEST: What Saving With NEST Means For You
9 pages
Day 3 - Customizing ChatGPT
No ratings yet
Day 3 - Customizing ChatGPT
44 pages
Btzs
No ratings yet
Btzs
5 pages
DcTrack Installation
No ratings yet
DcTrack Installation
4 pages
Chapter 2
No ratings yet
Chapter 2
5 pages
One To One and Onto1
No ratings yet
One To One and Onto1
9 pages
The Crystal World
No ratings yet
The Crystal World
41 pages
Wire Color Code Charts
No ratings yet
Wire Color Code Charts
4 pages
Citra Log - Txt.old
No ratings yet
Citra Log - Txt.old
6 pages
Dev Guide
No ratings yet
Dev Guide
8 pages
Quiz Let 464 Study Guide 2
No ratings yet
Quiz Let 464 Study Guide 2
17 pages
Agile Methology
No ratings yet
Agile Methology
29 pages
Resume Shubhendu
100% (1)
Resume Shubhendu
2 pages
Deregistration of Tax Groups
No ratings yet
Deregistration of Tax Groups
28 pages
D-Tect 50 Ip Quad Pir Datasheet
No ratings yet
D-Tect 50 Ip Quad Pir Datasheet
2 pages
5th Sem
No ratings yet
5th Sem
1 page
VR&AR
No ratings yet
VR&AR
8 pages
Accountinginthe Cloud
No ratings yet
Accountinginthe Cloud
15 pages
Tl-Wa850re Qig V6
No ratings yet
Tl-Wa850re Qig V6
2 pages
Arpan Koley - Oe-Ec604c - Ca-1
No ratings yet
Arpan Koley - Oe-Ec604c - Ca-1
9 pages

Lecture 5

Uploaded by

Lecture 5

Uploaded by

MODEL-BASED

Section A Short Questions Elaborate your

 Source (Similar to the PRESTO sensor).

 Deligiannakis et al. consider correlation, extended Jaccard coefficients,

 Shen et al. propose to use a histogram-based method to capture outliers.

 Subramaniam et al. introduce distance- and density-based metrics that

 ORDEN system detects polygonal outliers using the triangulated

You might also like