0% found this document useful (0 votes)
38 views12 pages

Approximate Data Aggregation

This document proposes approximate data aggregation algorithms for sensor-equipped Internet of Things (IoT) networks to address maximum value and distinct set queries in an energy-efficient manner. It introduces (1) uniform sampling and Bernoulli sampling based algorithms to calculate accurate maximums and distinct sets within a given probability, (2) provides mathematical proofs that the algorithms are accurate, and (3) simulation results showing the algorithms consume less energy than simple distributed algorithms.

Uploaded by

kk notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views12 pages

Approximate Data Aggregation

This document proposes approximate data aggregation algorithms for sensor-equipped Internet of Things (IoT) networks to address maximum value and distinct set queries in an energy-efficient manner. It introduces (1) uniform sampling and Bernoulli sampling based algorithms to calculate accurate maximums and distinct sets within a given probability, (2) provides mathematical proofs that the algorithms are accurate, and (3) simulation results showing the algorithms consume less energy than simple distributed algorithms.

Uploaded by

kk notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

TSINGHUA SCIENCE AND TECHNOLOGY

ISSNll1007-0214 05/14 pp44–55


DOI: 1 0 . 2 6 5 9 9 / T S T . 2 0 1 9 . 9 0 1 0 0 2 3
Volume 25, Number 1, February 2020
IEEE Transaction on Internet of Things,Year:2020

Approximate Data Aggregation in Sensor Equipped IoT Networks

Ji Li, Madhuri Siddula, Xiuzhen Cheng, Wei Cheng, Zhi Tian, and Yingshu Li

Abstract: As Internet-of-Things (IoT) networks provide efficient ways to transfer data, they are used widely in data
sensing applications. These applications can further include wireless sensor networks. One of the critical problems
in sensor-equipped IoT networks is to design energy efficient data aggregation algorithms that address the issues
of maximum value and distinct set query. In this paper, we propose an algorithm based on uniform sampling
and Bernoulli sampling to address these issues. We have provided logical proofs to show that the proposed
algorithms return accurate results with a given probability. Simulation results show that these algorithms have high
performance compared with a simple distributed algorithm in terms of energy consumption.

Key words: data aggregation; sampling; Internet-of-Things (IoT) networks

1 Introduction be well-spread.
Similar to the smart city, we are also focusing on
As the urban population snowballs, the smart city
smart home applications[5–10] . These applications are
has become inevitable to solve many day-to-day
based on the fact that in today’s world all the home
problems. These problems include power supply,
electronics are connected to the internet. A network
disaster prediction, and traffic maintenance[1–4] . Some
with such connected devices is called Internet-of-
of the smart city applications that are already being
Things (IoT). Recent devices like Alexa and Google
used are parking services, intelligent light systems, and
home are built on such a network. These devices
water conservation. For the better utilization of natural
interact with all the other devices connected to the same
resources, we should incorporate these applications
network. Since not all devices are the same, we need a
even in rural areas. The fundamental working principle
way to collect sensory data from different sensors. The
of a smart city application is that there are various
primary objective of any IoT network is to reduce cost
sensors deployed all over the city that are used
and provide faster access to data. One of the distinct
for collecting data. This data helps us understand
challenges in these applications is the deployment of a
information at a city level and hence the data should
considerable number of sensing devices.
 Ji Li is with Kennesaw State University, Marietta, GA 30060, It is clear that sensors are the building blocks of any
USA. E-mail: [email protected].
IoT network. However, a network with sensors has
 Madhuri Siddula and Yingshu Li are with Georgia State
University, Atlanta, GA 30303, USA. E-mail: msiddula1@
some drawbacks such as the issue of dynamic traffic,
student.gsu.edu; [email protected]. adding new service, adaptive to channel condition,
 Xiuzhen Cheng is with the George Washington University, and ever-changing user requirements. Having self-
Washington, DC 20052, USA. E-mail: [email protected]. configurable sensors helps address some of these issues.
 Wei Cheng is with the Department of Computer Science, Additionally, many algorithms have been proposed to
Virginia Commonwealth University, Richmond, VA 23284, solve the issues of routing, topology control, and time
USA. E-mail: [email protected].
synchronization[11–24] . Using sensors in IoT networks
 Zhi Tian is with the Department of Electrical & Computer
Engineering, George Mason University, Fairfax, VA 22030,
reduces the communication cost but increases the
USA. E-mail: [email protected]. processing cost. We deploy sensors in our network
 To whom correspondence should be addressed. because they collect data over a long period and could
Manuscript received: 2019-05-22; accepted: 2019-05-27 be placed over a long period of time and could be
@ The author(s) 2020. The articles published in this open access journal are distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
IEEE
Ji Li Transaction on Internet
et al.: Approximate of Things,Year:2020
Data Aggregation in Sensor Equipped IoT Networks 45

placed over a vast network. Hence, the data collected (4) Extensive simulation results are presented which
from these sensors is huge and requires high processing show the proposed algorithms perform significantly
power to aggregate and analyze the data. Hence, if the better than a simple distributed algorithm in the aspect
data aggregation problem is addressed at the sensor of energy consumption.
level, we do not have to deal with extensive data. The rest of the paper is organized as follows.
However, adding data aggregation functionality to the Section 2 defines the problem. Section 3 provides the
sensor might consume a lot of sensor’s energy. This mathematical proof for the ı-approximate aggregation
further raises the energy consumption issue as the algorithms. Section 4 explains the proposed ı-
aggregation costs much energy and the sensors are approximate aggregation algorithms. Section 5 shows
not equipped with huge amounts of power supply. the simulation results and the related works are
According to Ref. [25], cost of transmitting one bit of discussed in Section 6. Section 7 concludes the paper.
data is equivalent to the energy cost of executing 1000
instructions. Therefore, reducing data transmission is 2 Problem Definition
one of the major ways to decrease the energy cost in Suppose we have an IoT network with n sensor nodes
IoT. Hence, it is critical to design energy efficient data and sti is the sensory data of node i at time t . S t D
aggregation methods for sensor equipped IoT networks. fs t1 ; s t 2 ; : : : ; stn g is the set of all the sensory data at
In this paper, we study two kinds of aggregation d
time t and Dis.S t / D fs t1 d
; s td2 ; : : : ; s tjDis.S t /j
g contains
queries: maximum query and distinct set query. The
the distinct values in S t . For example, if S t D fs t1 ;
maximum query is to calculate the maximum of all the
s t 2 ; s t 3 ; s t 4 ; s t 5 g and s t1 D 1; s t 2 D 1; s t 3 D 2; s t 4 D
sensory data. The distinct set query is to calculate the
3; s t 5 D 3, then Dis.S t / D f1; 2; 3g.
unique values in the sensory data. Both the queries are
In this paper, we address maximum and distinct set
critical for an IoT. These two queries can be widely used
queries by performing max and distinct set operations,
in practice. For example, in the field of environmental
respectively. The definition of these operations are as
monitoring, the maximum value query can be used to
follows:
acquire the most serious level of pollution. While the
(1) The exact maximum value denoted by Max.S t /
user may get all the pollution levels in the monitored
satisfies Max.S t / D maxfs t i 2 S t j1 6 i 6 ng.
area using the distinct-set query. Therefore, the energy
(2) The exact distinct-set of S t denoted by Dis.S t /
efficient data aggregation model should accommodate
satisfies 8s 2 S t ; 9s d 2 Dis.S t /; s D s d , and 8sxd ;
both queries in its development.
syd 2 Dis.S t /; x ¤ y ) sxd ¤ syd .
In practice, exact query results are not always
necessary. Approximate query results may also be Obviously, the following steps can be used to solve
acceptable for some applications[26, 27] . Therefore, in the max and distinct set aggregation problems.
this paper, we propose two algorithms to process (1) Arrange all the nodes in the network in the form
approximate maximum queries and distinct-set queries. of an aggregation tree where the sink node broadcasts
These algorithms are based on uniform sampling and the aggregation operation.
Bernoulli sampling, respectively. Proposed algorithms (2) All the nodes submit their sensory data to the sink
will return the exact query results with probability not node along the aggregation tree.
less than 1 ı, where ı is a real number and its value can (3) The intermediate nodes in the aggregation tree
be arbitrarily small. In summary, the main contributions calculate the partial results during the data transmission.
of our paper can be summarized are as follows: Although this method results in accurate aggregation
(1) Mathematical estimators for the two aggregation results, it will also lead to huge communication
operations are provided. and computation cost. Hence, we propose a ı-
(2) The mathematical methods to determine approximation to the results that can be achieved by
the required sample size and sample probability the above said aggregation operations. Let I t and Ibt are
for calculating approximate maximum value and the accurate and approximate aggregation results at the
approximate distinct-set are designed. time “t ”, respectively. The definition of the ı-estimator
(3) Distributed algorithms for approximate maximum is as follows:
value and approximate distinct-set are provided. The Definition 1 (ı-estimator) For any ı .0 6 ı 6 1/,
energy costs of these algorithms are analyzed. I t is called the ı-estimator of I t if Pr.Ibt ¤ I t / 6 ı.
b
46 IEEE Transaction on Internet of Things,Year:2020 Tsinghua Science and Technology, February 2020, 25(1): 44–55

According to Definition 1, the problem of computing ln ı


m> nmin  ;
ı-approximate maximum value and ı-approximate

ln 1
distinct-set is defined as follows. n
Input: (1) A sensor equipped IoT network; (2) The where nmin is the number of appearances for the least
sensory data set S t ; and (3) Aggregation operator Agg 2 appearing data.
fMax, Disg and ı .0 6 ı 6 1/. Proof Based on the condition, we have
nmin 
Output: ı-approximate aggregation result of Agg. m ln 1 6 ln ı;
n
 nmin m
3 Preliminaries 1 6 ı:
n
In this paper, we use two sampling techniques to According to Lemma 1, we have
 nMax.S t / m
sample the raw data in the network, which are uniform Pr.Max.S t / … U.m// D 1 ;
sampling and Bernoulli sampling, respectively. The n
where nMax.S t / is the number of appearance for the
preliminaries of computing ı-approximate maximum
maximum value in S t . Since nMax.S t / > nmin , we have
value and ı-approximate distinct-set are presented in  nmin m
the following subsections. Pr.Max.S t / … U.m// 6 1 6 ı:
n
4
3.1 Uniform sampling-based approximate Then this theorem is proved. 
Let Dis.S t /u denote the uniform sampling-based
aggregation
4
estimator of exact result Dis.S t /. Then Dis.S t /u is
Let u1 ; u2 ; :::; um to denote m simple random
4
defined as
samplings with replacement from sensory data set S t ,
Dis.S t /u D Dis.U.m//:
U.m/ D fu1 ; u2 ; :::; um g is a uniform sample of S t with
sample size m, we have the following conclusions. Based on Lemma 1, we also have the following

4
(1) ui and uj are independent with each other for all theorem.
1 6 i ¤ j 6 m. Theorem 2 Dis.S t /u is a ı-estimator of Dis.S t / if
1 ln.1 .1 ı/nmin =n /
(2) Pr.ui D s tj / D for any 1 6 i 6 m, 1 6 j 6n. m>
n
 nmin  :
Based on the above two conclusions, we have the ln 1
n
following lemma. Proof First, we have
Lemma 1 For any given value x 2 Dis.S t /, we
 nmin m
1 6 1 .1 ı/nmin =n ;
have n
 nx m
Pr.x … U.m// D 1 ;
  nmin m n=nmin
n 1 1 > 1 ı;
n
where nx is the number of appearance of value x in S t .
Yt /j 
jDis.S
Proof Pr.x … U.m// D Pr.u1 ¤ x ^ u2 ¤ x ^ : : :^
 nmin m 
1 1 1 6 ı:
um ¤ x/. Since all the samples u1 ; u2 ; : : : ; um are n
i D1
independent with each other, we have Let ns d to denote the number of appearance for s tdi ,
m ti
Y then we have
Pr.x … U.m// D Pr.ui ¤ x/ D .Pr.u1 ¤ x//m : jDis.S
Yt /j   ns d m 
i D1 ti
1 1 1 6 ı;
Moreover, we have n
i D1
nx
Pr.u1 ¤ x/ D 1 Pr.u1 D x/ D 1 : since nmin 6 ns d . Moreover, according to Lemma 1,
n ti
we have
Then this lemma is proved.  jDis.S
Yt /j
To obtain ı-approximate maximum value, the 1 .1 Pr.stid … U.m/// 6 ı;
mathematical estimator needs to be defined firstly. Let
4
Max.S t /u denote the uniform sampling-based estimator
i D1

4
of Max.S t /. Then Max.S t /u is defined as 1
Yt /j
jDis.S
Pr.stid 2 U.m// 6 ı;
4
Max.S t /u D Max.U.m//:
i D1

4
4
Based on Lemma 1, we have the following theorem. 1 Pr.Dis.S t /u D Dis.S t // 6 ı;

if
Theorem 1 Max.S t /u is a ı-estimator of Max.S t / 4
Pr.D is.S t /u ¤ Dis.S t // 6 ı:
IEEE Transaction on Internet of Things,Year:2020

Ji Li et al.: Approximate Data Aggregation in Sensor Equipped IoT Networks 47

Then this theorem is proved.  Yt /j


jDis.S
1 .1 .1 q/nmin / 6 ı:
3.2 Bernoulli sampling-based approximate i D1
aggregation
Let ns d denote the number of appearance for s tdi ,
ti
Let B.q/ D fb1 ; b2 ; : : : ; bjB.q/j g denote a Bernoulli since nmin 6 ns d , we have
ti
sample of data set S t with sample probability q. Then jDis.S
Yt /j n
we have the following lemma. 1 .1 .1 q/
d
sti
/ 6 ı:
Lemma 2 For any given value x 2 Dis.S t /, we i D1
have Moreover, according to Lemma 2, we have
Pr.x … B.q// D .1 q/nx ; Yt /j
jDis.S
1 .1 Pr.s tdi … B.q/// 6 ı;
where nx is the number of appearance of value x in S t .
i D1
Proof Without loss of generality, we assume s t1 D jDis.S
s t2 D    D s tnx D x, then we have Pr.x … B.q// D Yt /j
1 Pr.stid 2 B.q// 6 ı;
Pr.s t1 … B.q/ ^ s t 2 … B.q/ ^    ^ s t nx … B.q//. i D1
Therefore, we have
1 4
Pr.Dis.S t /b D Dis.S t // 6 ı;
Pr.x … B.q// D .Pr.s t1 … B.q///nx :
According to the definition of Bernoulli sampling, we
4
Pr.D is.S t /b ¤ Dis.S t // 6 ı:
have Then this theorem is proved. 
Pr.s t1 … B.q// D 1 Pr.s t1 2 B.q// D 1 q:
4 ı -Approximate Aggregation Algorithm
Then this lemma is proved. 
4
Let Max.S t /b denote the Bernoulli sampling-based Theorems in Section 3 describe the calculation methods
4
estimator of exact value Max.S t /. Max.S t /b is defined required for sampling size and probability. However, we
as need to address the following problems:
4
Max.S t /b D Max.B.q//: (1) Broadcasting the sampling information by the
sink node to the whole network.
Based on Lemma 2, we have the following theorem.
4
Theorem 3 Max.S t /b is a ı-estimator of Max.S t /
(2) Sampling the sensory data.
(3) Transmission and aggregation of the partial
if
aggregation results.
q>1 .ı/1=nmin :
4.1 Uniform sampling-based aggregation
Proof Based on the condition, we have algorithm
.1 q/nmin 6 ı:
One of the naive methods to calculate sample size m
According to Lemma 2, we have
can be described as follows:
Pr.Max.S t / … B.q// D .1 q/nMax.S t / ; (1) The sink nodes generate and broadcast m random
where nMax.S t / is the number of appearance for the numbers f1; 2; 3; : : : ; ng into the network.
maximum value in S t . Since nMax.S t / > nmin , we have (2) A sensor node identifies itself by the random
Pr.Max.S t / … B.q// 6 .1 q/nmin 6 ı: number sent by the sink node, thereby receiving the
 sensory data.
4
Then this theorem is proved.
Let Dis.S t /b denote the Bernoulli sampling-based This procedure needs huge energy cost due to
estimator of exact result Dis.S t /. Then Dis.S t /b is 4 the broadcasting information transmitted through out
network sensors. Hence, we need to develop a
4
defined as
Dis.S t /b D Dis.B.q//: mechanism to reduce the energy cost for broadcasting.
Therefore, to reduce the energy cost, we cluster the
4
Based on Lemma 1, we have the following theorem.
network into “k” clusters fC1 ; C2 ; : : : ; Ck g that are
Theorem 4 Dis.S t /b is a ı-estimator of Di s.S t / if
disjoint. By using the method proposed in Ref. [28], we
q > 1 .1 .1 ı/nmin =n /1=nmin :
organize the cluster heads in the network as a minimum
Proof According to the condition, we have hop-count spanning tree that has sink node as the root.
.1 q/nmin 6 1 .1 ı/nmin =n ; We then perform uniform sampling algorithm proposed
.1 .1 q/nmin /n=nmin > 1 ı; in Ref. [29]. We describe the algorithm as follows:
48 IEEE Transaction on Internet of Things,Year:2020 Tsinghua Science and Technology, February 2020, 25(1): 44–55
IEEE Transaction on Internet of Things,Year:2020

(1) The sink node generates random numbers Yi with Algorithm 1 Uniform sampling-based aggregation algorithm
jCl j Input: ı, aggregation operator Agg 2 fMax, DistinctSetg
the probability Pr.Yi D l/ D .1 6 i 6 m/. Output: ı-approximate aggregation results
n
(2) Let ml be the sample size of Cl . Then ml is 1: if Agg D Max then
ln ı
calculated by ml D jfYi jYi D lgj. 2: mDd nmin e
ln.1 /
n
(3) The sink node sends the sample size fml j 1 6 3: else
.1 ı/nmin =n /
l 6 kg to each cluster head. Each cluster head samples 4: m D d ln.1 nmin e
ln.1 n /
the sensory data in its own cluster using the above naive 5: end if
6: generate Yi following Pr.Yi D l/ D jCnl j ,
sampling algorithm.
7: ml D jfYi j Yi D lgj .1 6 i 6 m; 1 6 l 6 k/, the sink
If the sensory data received by the l-th cluster head is sends ml to each cluster head by multi-hop communication
U.ml /, it then calculates the partial aggregation result 8: for each cluster head of the clusters Cl .1 6 l 6 k/ do
R.U.ml // based on the aggregation operation Agg by 9: generate random numbers k1 , k2 , : : : ; kml then broadcast
using the following( method: inside the cluster
10: end for
Max.U.ml //; if Agg D MaxI
R.U.ml // D 11: for each cluster member of Cl .1 6 l 6 k/ do
Dis.U.ml //; elsewhere: 12: send sensory value to cluster head if its id belongs to
Then R.U.ml // is transmitted along the spanning fk1 ; k2 ; : : : ; kml g
tree. To further reduce the transmission cost, the 13: end for
14: for each cluster head of the clusters Cl .1 6 l 6 k/ do
intermediate nodes also aggregate the received partial
15: receive sample data U.ml / and calculate partial result
result while transmitting the sensory data. The whole R.U.ml //
process is explained in Algorithm 1. 16: end for
According to the content in Section 3.1, we have 17: for each node j in the spanning tree do
ln ı if j is the leaf node then
8l m 18:
< ln.1 nmin / ;
ˆ
ˆ if Agg D MaxI 19: Send Rj to its parent node
mD l n
else
ln.1 .1 ı/nmin =n / m 20:
; if Agg D Dis:
ˆ
ˆ 21: Receive partial results Rj1 ; Rj 2 ; : : : ; Rjc from its
ln.1 nnmin /
:
children
Therefore, we have 22: if Agg D Max then
8  
1 23: Rj D max.Rj1 ; Rj 2 ; : : : ; Rjc /
ˆ
ˆ O ln ; if Agg D MaxI
ı else
< 24:
mD    Rj D ciD1 Rj i
S
1 25:
; if Agg D Dis:
ˆ
: O ln
ˆ
26: end if
1 .1 ı/nmin =n
27: if j is the sink node then
In practice, jRj j can be regarded as a constant. 28: return Rj
According to Ref. [29], the communication cost and the 29: else
1 30: Send Rj to its parent node
energy cost of the above algorithm is O.ln / if Agg D
  ı  31: end if
1 32: end if
Max, while the cost is O ln
1 .1 ı/nmi n =n 33: end for
if Agg D Dis. In practice, the value of nmi n can be
acquired by the background knowledge of the specific
applications. For example, in the field of environmental the following steps are used in the Bernoulli sampling-
monitoring, the user can get the value of nmi n according based aggregation algorithm to perform sampling and
to the historical data. the network need not be divided into clusters.
(1) Sink node broadcasts the sampling probability q
4.2 Bernoulli sampling-based aggregation
in the network.
algorithm (2) Each node generates a random number rand in the
Unlike the uniform sampling-based aggregation range of [0,1], submit its sensory data to the parent node
algorithm, the sampling information of Bernoulli if rand < q.
sampling-based aggregation algorithm utilizes only the When the intermediate nodes in the spanning tree
sampling probability q. Additionally, Bernoulli-based receive the submitted sensory data, they will calculate
method provides a mechanism for each node in the the partial aggregation results using a similar method
network to do the sampling independently. Therefore, introduced in Section 4.1. These nodes then transmit the
IEEE
Ji LiTransaction on InternetData
et al.: Approximate of Things,Year:2020
Aggregation in Sensor Equipped IoT Networks 49

partial results along the spanning tree. Similarly, during 5 Simulation Results
the process of transmitting partial aggregation results to
In order to evaluate the proposed algorithms, we
the sink node along the spanning tree, the intermediate
simulated an IoT network with 1000 nodes. All nodes
nodes in the spanning tree aggregate the received partial
are randomly distributed in a 300 m  300 m rectangular
results. The process mentioned above is explained in
region and the sink node is in the center of the region.
detail in Algorithm 2.
The following steps are used to define the clusters.
According to the analysis in Section 3.2, for the
(1) Divide the region into 10  10 grids.
sample probability q, we have
( (2) Group the nodes in the same gird into one cluster.
1 .ı/1=nmin ; if Agg D MaxI (3) The cluster head is randomly chosen.
qD nmin =n 1=nmin
1 .1 .1 ı/ / ; if Agg D Dis: For each node, the energy cost to send and receive
Similarly, the communication cost and the energy one byte is defined as 0.0144 mJ and 0.0057 mJ,
cost of the Bernoulli sampling-based ı-approximate respectively[30] . The communication
p range of each
aggregation algorithm is O.n n.ı/1=nmin / if Agg D sensor node is set to be 30 2 m in our simulation[31] .
Max, while the cost is O.n n.1 .1 ı/nmin =n /1=nmin / By these simulation settings, we ensure that each sensor
if Agg D Dis. node at a one-hop distance from its corresponding
cluster head.
Algorithm 2 Bernoulli sampling-based aggregation 5.1 Uniform sampling-based aggregation
algorithm algorithm
Input: ı, aggregation operator Agg 2 fMax, Disg
Output: ı-approximate aggregation results The first group of simulations is to study the
1: if Agg D Max then relationship between ı and the sample size. The results
2: q D 1 .ı/1=nmin are shown in Fig. 1. The results for both the maximum
3: else value aggregation and the distinct-set aggregation are
4: q D 1 .1 .1 ı/nmin =n /1=nmin
listed. Additionally, two groups of results with different
5: end if n
6: Sink node broadcasts q in the network are also listed for comparison. These results
nmin
7: for each leaf node j in the spanning tree do indicate that the sample size increases with a decline
8: if rand < q then
of ı. Moreover, the sample sizes are much smaller than
9: Send its own sensory data to its parent node;
end if
that of the network. For example, if we have ı D 0:01
10: n
11: end for and D 15, the sample size is about 67, which
nmin
12: for each non-leaf node j in the spanning tree do
indicates that we only need to sample 6.7% sensory data
13: Receive partial results Rj1 ; Rj 2 ; : : : ; Rjc from its
to guarantee that the estimated maximum value being
children
14: if Agg D Max then equal to the actual maximum value with the probability
15: Rj D max.Rj1 ; Rj 2 ; : : : ; Rjc / greater than 99%. Hence, the proposed algorithm based
16: else on uniform sampling preserves a tremendous amount of
Rj D ciD1 Rj i
S
17: energy as the amount of sensory data sampled is little.
18: end if Additionally, in the same condition, the sample size for
19: if rand < q then
the distinct-set aggregation is greater than the sample
20: if Agg D Max then
21: Rj D max.Rj ; j:data/ size for maximum value aggregation. Hence, we have
22: else to ensure that the distinct-set aggregation has all distinct
23: Rj D Rj [ fj:datag values that are sampled.
24: end if The second group of simulations is to study the
25: end if relationship between ı and the energy cost. The results
26: if j is the sink node then
are listed in Fig. 2. These results indicate that the energy
27: return Rj
28: else cost increases with the decrease of ı. It can be observed
29: Send Rj to its parent node that the energy cost for the distinct-set aggregation is
30: end if higher than that of the maximum value aggregation as
31: end for the distinct-set aggregation requires sample size.
50 IEEE Transaction on Internet of Things,Year:2020 Tsinghua Science and Technology, February 2020, 25(1): 44–55

140 12.0

n/nmin=30 n/nmin=30

120 n/nmin=15 11.5 n/n =15


min

Energy cost (mJ/Byte)


100 11.0
Sample size

80 10.5

60 10.0

40 9.5

20 9.0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
δ δ
(a) Maximum value aggregation (a) Maximum value aggregation

240 14.0
n/nmin=30
n/nmin=30
220 n/nmin=15 13.5
n/n =15
min
200
13.0
Energy cost (mJ/Byte)

180
12.5
Sample size

160
12.0
140
11.5
120
11.0
100

10.5
80

10.0
60 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
δ δ
(b) Distinct-set aggregation (b) Distinct-set aggregation

Fig. 1 Relationship between ı and the sample size. Fig. 2 Relationship between ı and the energy cost for the
uniform sampling-based aggregation algorithm.
The third group of simulation is to compare the
energy cost between the uniform sampling-based These results indicate that the uniform sampling-based
aggregation algorithm and the simple distributed aggregation algorithm performs much better in energy
algorithm. The steps of the simple distributed algorithm cost although it may return wrong aggregation results.
are as follows. Finally, with the increase in the network size, the energy
(1) Collect all the raw sensory data. cost of the simple distributed algorithm proliferates,
(2) Aggregate the partial results during the while the energy cost for the uniform sampling-based
transmission. aggregation algorithm almost remains the same. That
We can see that the simple distributed algorithm is because the uniform sampling algorithm’s required
nmi n
can always return accurate aggregation results. For the sample size depends on the value of ı and
n
uniform sampling-based aggregation algorithm, we set rather than the network size n itself. This phenomenon
n also indicates that the uniform sampling algorithm is
ı D 0:1, D 15, and the network size changes
nmi n appropriate for large scale networks, which is verified
from 500 to 1500. The results are listed in Fig. 3. We
by the results shown in Fig. 4.
can see that for all these two algorithms, the energy
cost increases with the increase of the network size. 5.2 Bernoulli sampling-based aggregation
Moreover, the energy cost of the uniform sampling- algorithm
based aggregation algorithm is much lower than that The first group of simulations is about the relationship
of the naive distributed algorithm since only a small between ı and the sample probability. The results
number of nodes need to transmit their sensory data. are presented in Fig. 5. The results show that the
JiIEEE
Li et Transaction on Internet
al.: Approximate Data of Things,Year:2020
Aggregation in Sensor Equipped IoT Networks 51
30 10

Simple distributed algorithm Uniform sampling algorithm


Uniform sampling algorithm Bernoulli sampling algorithm
9
25
Energy cost (mJ/Byte)

Energy cost (mJ/Byte)


8
20

15
6

10
5

5 4
500 1000 1500 100 200 300 400 500 600 700 800 900 1000
Network size Network size
(a) Maximum value aggregation (a) Maximum value aggregation

30 11.0
Simple distributed algorithm Uniform sampling algorithm
10.8
Uniform sampling algorithm Bernoulli sampling algorithm
25 10.6
Energy cost (mJ/Byte)

Energy cost (mJ/Byte)

10.4

20 10.2

10.0

15 9.8

9.6

10 9.4

9.2

5 9.0
500 1000 1500 1000 1200 1400 1600 1800 2000
Network size Network size
(b) Distinct-set aggregation (b) Distinct-set aggregation

Fig. 3 Energy cost comparison between the uniform Fig. 4 Energy cost comparison between the uniform
sampling-based aggregation algorithm and the simple sampling-based aggregation algorithm and Bernoulli
distributed algorithm. sampling-based aggregation algorithm.

sample probability increases with the decline of ı. aggregation algorithm and the simple distributed
Moreover, the sample probabilities are much smaller algorithm. For the Bernoulli sampling-based
than 1. For example, when ı D 0:01, the sample aggregation algorithm, we set ı D 0:1 and nmin D 67.
probability is about 0.066 for deriving ı-approximate The network size varies from 500 to 1500. The results
maximum value. Therefore, our Bernoulli sampling- are listed in Fig. 7. Similarly, we can see for the same
based algorithm also saves a great deal of energy. network size, the energy cost of the Bernoulli sampling-
Similarly, the required sample size for the distinct-set based aggregation algorithm is much lower than that of
aggregation is greater than that of the maximum value the simple distributed algorithm which indicates that
aggregation in the same condition. Bernoulli sampling-based aggregation algorithm has
The second group of simulations is about the high performance on energy consumption. Moreover,
relationship between ı and the energy cost. The results we can also see that the Bernoulli sampling-based
are shown in Fig. 6. Similarly, we can see that the aggregation algorithm has even better performance on
energy cost increases with the decline of ı and the large scale networks.
energy cost for the distinct-set aggregation is greater The fourth group of simulation is to compare the
than that of the maximum value aggregation. energy cost between the Bernoulli sampling-based
The third group of simulation is to compare the aggregation algorithm and the uniform sampling-based
n
energy cost between the Bernoulli sampling-based aggregation algorithm. We set ı D 0:1 and D 15.
nmin
52 IEEE Transaction on Internet of Things,Year:2020 Tsinghua Science and Technology, February 2020, 25(1): 44–55
0.13 11.8
nmin=34 nmin=34
0.12 11.6
nmin=67 nmin=67
0.11 11.4

Energy cost (mJ/Byte)


0.10 11.2
Sample probability

0.09 11.0

0.08 10.8

0.07 10.6

0.06 10.4

0.05 10.2

0.04 10.0

0.03 9.8
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
δ δ
(a) Maximum value aggregation (a) Maximum value aggregation

0.22 14.0

nmin=34 nmin=34
0.20 nmin=67 13.5 nmin=67

0.18
13.0
Energy cost (mJ/Byte)
Sample probability

0.16
12.5

0.14
12.0
0.12

11.5
0.10

11.0
0.08

0.06 10.5
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
δ δ
(b) Distinct-set aggregation (b) Distinct-set aggregation

Fig. 5 Relationship between ı and the sample probability. Fig. 6 Relationship between ı and the energy cost for the
Bernoulli sampling-based aggregation algorithm.
In order to ensure the network connectivity when the While on the other hand, the uniform sampling
network size is small, we set node’s communication algorithm is appropriate for large scale clustered
to 60 m for this group of simulation. The results networks.
are shown in Fig. 4. We can see that for both the
uniform sampling-based aggregation algorithm and the 6 Related Work
Bernoulli sampling-based aggregation algorithm, the The sampling technique has been widely used, such
energy cost increases with the increase of network size. as quantile calculation, data collection, and top-k
Moreover, the Bernoulli sampling-based aggregation query. For example, Ref. [32] is about an approximate
algorithm has lower energy cost when the network size algorithm to calculate the quantiles in wireless sensor
is small, while the uniform sampling-based aggregation networks. By using the sampling technique, Ref. [33]
algorithm has lower energy cost when the network develops ASAP, which is an adaptive sampling-based
size is large. From the above results, we can see the method to do energy-efficient data collection in sensor
Bernoulli sampling-based aggregation algorithm has networks. Reference [34] uses samples of past sensory
the following advantages. data to define the problem of optimizing approximate
(1) The Bernoulli sampling-based aggregation top-k queries. However, all these techniques cannot be
algorithm can be used in unclustered networks. used in our problem because these operations differ a
(2) The Bernoulli sampling-based aggregation lot with the maximum query and distinct-set query.
algorithm has lower energy cost in small scale The distinct-count query has been widely studied
networks. in many works, such as Refs. [35, 36]. Reference
IEEE Transaction on Internet of Things,Year:2020
Ji Li et al.: Approximate Data Aggregation in Sensor Equipped IoT Networks 53

30
respectively. We have proposed mathematical
Simple distributed algorithm
Bernoulli sampling algorithm
estimators for the two algorithms. Moreover, we
25 have derived the values for the required sample size
and the required sample probability for any given ı.
Energy cost (mJ/Byte)

20 Finally, an algorithm based on uniform sampling and


an algorithm based on Bernoulli sampling are provided.
15
Simulation results are shown for various ı values and
the network sizes. These simulation results indicate
that the proposed algorithms have high performance in
10
terms of the energy cost.
5 Acknowledgment
500 1000 1500
Network size This work was partly supported by the National Science
(a) Maximum value aggregation Foundation (NSF) (Nos. 1741277, 1741287, 1741279,
30 1851197, and 1741338).
Simple distributed algorithm
Bernoulli sampling algorithm
References
25
[1] Z. P. Cai, X. Zheng, and J. G. Yu, A differential-
Energy cost (mJ/Byte)

private framework for urban traffic flows estimation


20 via taxi companies, IEEE Trans. Ind. Inf., doi:
10.1109/TII.2019.2911697.
[2] Z. P. Cai and X. Zheng, A private and efficient mechanism
15
for data uploading in smart cyber-physical systems, IEEE
Trans. Netw. Sci. Eng., doi: 10.1109/TNSE.2018.2830307.
10 [3] X. Zheng, Z. P. Cai, and Y. S. Li, Data linkage in smart
internet of things systems: A consideration from a privacy
perspective, IEEE Commun. Mag., vol. 56, no. 9, pp. 55–
5
500 1000 1500 61, 2018.
Network size [4] Y. Liang, Z. P. Cai, J. G. Yu, Q. L. Han, and Y. S. Li,
(b) Distinct-set aggregation Deep learning based inference of private information using
embedded sensors in smart devices, IEEE Netw. Mag., vol.
Fig. 7 Energy cost comparison between the Bernoulli
32, no. 4, pp. 8–14, 2018.
sampling-based aggregation algorithm and simple [5] Y. Huo, C. Q. Hu, X. W. Qi, and T. Jing, LoDPD: A
distributed algorithm. location difference-based proximity detection protocol for
fog computing, IEEE Internet Things J., vol. 4, no. 5, pp.
[35] introduces an algorithm to calculate approximate
1117–1124, 2017.
distinct-count based on approximate frequency query [6] Y. Huo, C. T. Yong, and Y. F. Lu, Re-ADP: Real-time
results. Reference [37] is about range count queries in data aggregation with adaptive !-event differential privacy
big IoT data. Reference [36] is about an algorithm to for fog computing, Wirel. Commun. Mobile Comput., vol.
compute the approximate distinct-count. However, this 2018, pp. 1–13, 2018.
[7] Y. K. Wen, Y. Huo, L. R. Ma, T. Jing, and Q. H. Gao,
algorithm is centralized and not appropriate for IoT
A scheme for trustworthy friendly jammer selection in
networks. Moreover, all these works are for the distinct- cooperative cognitive radio networks, IEEE Trans. Veh.
count query, which is about the size of the distinct set Technol., vol. 68, no. 4, pp. 3500–3512, 2019.
rather than the content of the distinct set. Therefore, the [8] Y. Huo, M. Xu, X. Fan, and T. Jing, A novel secure relay
above works still cannot be used in our problem. selection strategy for energy-harvesting-enabled internet of
things, EURASIP J. Wirel. Comm., vol. 2018, p. 264, 2018.
7 Conclusion [9] Y. Q. Jia, Y. Chen, X. S. Dong, P. Saxena, J. Mao, and Z. K.
Liang, Man-in-the-browser-cache: Persisting https attacks
In this paper, the approximate algorithms for via browser cache poisoning, Comput. Secur., vol. 55, pp.
the maximum value aggregation and distinct-set 62–80, 2015.
[10] J. Mao, S. S. Zhu, J. D. Bian, Q. X. Lin, and J. W. Liu,
aggregation operations in sensor equipped IoT
Anomalous power-usage behavior detection from smart
networks are proposed. These algorithms are based home wireless communications, J . Commun. Inf. Netw.,
on the uniform sampling and Bernoulli sampling, vol. 4, no. 1, pp. 13–23, 2019.
54 IEEE Transaction on Internet of Things,Year:2020 Tsinghua Science and Technology, February 2020, 25(1): 44–55

[11] C. Schurgers and M. B. Srivastava, Energy efficient 2001.


routing in wireless sensor networks, in 2001 MILCOM [25] J. B. Li and J. Z. Li, Data sampling control, compression
Proc. Communications for Network-Centric Operations: and query in sensor networks, Int.J . Sens. Netw., vol. 2,
Creating the Information Force, McLean, VA, USA, 2001, nos. 1&2, pp. 53–61, 2007.
pp. 357–361. [26] J. Considine, F. Li, G. Kollios, and J. Byers, Approximate
[12] S. Y. Cheng, Z. P. Cai, J. Z. Li, and H. Gao, Extracting aggregation techniques for sensor databases, in Proc. 20th
kernel dataset from big sensory data in wireless sensor Int. Conf. Data Engineering, Boston, MA, USA, 2004, pp.
networks, IEEE Trans. Knowl. Data Eng., vol. 29, no. 4, 449–460.
pp. 813–827, 2017. [27] G. Hartl and B. C. Li, infer: A Bayesian inference
[13] S. Y. Cheng, Z. P. Cai, J. Z. Li, and X. L. Fang, Drawing approach towards energy efficient data collection in dense
dominant dataset from big sensory data in wireless sensor networks, in Proc. 25th IEEE Int. Conf. Distributed
sensor networks, in Proc. 2015 IEEE Conf. Computer Computing Systems, Columbus, OH, USA, 2005, pp. 371–
Communications, Kowloon, China, 2015, pp. 531–539. 380.
[14] S. Y. Cheng, Z. P. Cai, and J. Z. Li, Curve query processing [28] R. Lachowski, M. E. Pellenz, M. C. Penna, E. Jamhour,
in wireless sensor networks, IEEE Trans. Veh. Technol., and R. D. Souza, An efficient distributed algorithm for
vol. 64, no. 11, pp. 5198–5209, 2015. constructing spanning trees in wireless sensor networks,
[15] S. Y. Cheng, J. Z. Li, and Z. P. Cai, O(")-approximation Sensors, vol. 15, no. 1, pp. 1518–1536, 2015.
to physical world by sensor networks, in Proc. 32nd Ann. [29] S. Y. Cheng and J. Z. Li, Sampling based (epsilon, delta)-
IEEE Int. Conf. Computer Communications, Turin, Italy, approximate aggregation algorithm in sensor networks,
2013, pp. 3084–3092. in Proc. 29th IEEE Int. Conf. Distributed Computing
[16] Z. B. He, Z. P. Cai, S. Y. Cheng, and X. M. Wang, Systems, Montreal, Canada, 2009, pp. 273–280.
Approximate aggregation for tracking quantiles and range [30] Crossbow, MPR-Mote Processor Radio Board User’s
countings in wireless sensor networks, Theor. Comput. Manual. San Jose, CA, USA: Crossbow Technology Inc,
Sci., vol. 607, pp. 381–390, 2015. 2003.
[17] X. Zheng and Z. P. Cai, Real-time big data delivery in [31] G. Anastasi, A. Falchi, A. Passarella, M. Conti, and
wireless networks: A case study on video delivery, IEEE E. Gregori, Performance measurements of motes sensor
Trans. Ind. Inf., vol. 13, no. 4, pp. 2048–2057, 2017. networks, in Proc. 7 th ACM Int. Symp. Modeling, Analysis
[18] X. Zheng, Z. P. Cai, J. Z. Li, and H. Gao, A study on
and Simulation of Wireless and Mobile Systems, Venice,
application-aware scheduling in wireless networks, IEEE
Italy, 2004, pp. 174–181.
Trans. Mobile Comput., vol. 16, no. 7, pp. 1787–1801,
[32] Z. F. Huang, L. Wang, K. Yi, and Y. H. Liu, Sampling
2017.
based algorithms for quantile computation in sensor
[19] J. G. Yu, Q. B. Zhang, D. X. Yu, C. C. Chen, and G. H.
networks, in Proc. 2011 ACM SIGMOD Int. Conf.
Wang, Domatic partition in homogeneous wireless sensor
Management of Data, Athens, Greece, 2011, pp. 745–756.
networks, J . Netw. Comput. Appl., vol. 37, pp. 186–193,
[33] B. Gedik, L. Liu, and P. S. Yu, ASAP: An adaptive
2014.
sampling approach to data collection in sensor networks,
[20] J. G. Yu, X. L. Ning, Y. C. Sun, S. L. Wang, and Y. W.
IEEE Trans. Parallel Distrib. Syst., vol. 18, no. 12, pp.
Wang, Constructing a self-stabilizing CDS with bounded
1766–1783, 2007.
diameter in wireless networks under SINR, in Proc. IEEE
[34] A. S. Silberstein, R. Braynard, C. Ellis, K. Munagala, and
INFOCOM 2017-IEEE Conf. Computer Communications,
J. Yang, A sampling-based approach to optimizing top-k
Atlanta, GA, USA, 2017, pp. 1–9.
[21] J. G. Yu, B. G. Huang, X. Z. Cheng, and M. Atiquzzaman, queries in sensor networks, in Proc. 22nd Int. Conf. Data
Shortest link scheduling algorithms in wireless networks Engineering, Atlanta, GA, USA, 2006, p. 68.
[35] J. Li, S. Y. Cheng, Z. P. Cai, J. G. Yu, C. K. Wang, and Y.
under the SINR model, IEEE Trans. Veh. Technol., vol. 66,
S. Li, Approximate holistic aggregation in wireless sensor
no. 3, pp. 2643–2657, 2017.
[22] S. L. Wang, X. Wang, X. Z. Cheng, J. H. Huang, R. F. Bie, networks, ACM Trans. Sens. Netw., vol. 13, no. 2, p. 11,
and F. Zhao, Fundamental analysis on data dissemination 2017.
in mobile opportunistic networks with levy mobility, IEEE [36] K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R.
Trans. Veh. Technol., vol. 66, no. 5, pp. 4173–4187, 2017. Gemulla, On synopses for distinct-value estimation under
[23] Y. Wang, Topology control for wireless sensor networks, in multiset operations, in Proc. 2007 ACM SIGMOD Int.
Wireless Sensor Networks and Applications, Y. S. Li, M. T. Conf. Management of Data, Beijing, China, 2007, pp. 199–
Thai, and W. L. Wu, eds. Springer, 2008, pp. 113–147. 210.
[24] J. Elson and D. Estrin, Time synchronization for [37] Z. Cai and Z. He, Trading private range counting over
wireless sensor networks, in Proc. 15th Int. Parallel and big iot data. in Proc. 39th IEEE Int. Conf. Distributed
Distributed Processing Symp., San Francisco, CA, USA, Computing Systems, Dallas, TX, USA, 2019.
IEEE
Ji LiTransaction on InternetData
et al.: Approximate of Things,Year:2020
Aggregation in Sensor Equipped IoT Networks 55

Ji Li received the BS degree from Wei Cheng received the BS and MS


Heilongjiang University, China in 2012, degrees from the National University of
and the PhD degree from Georgia State Defense Technology, Changsha, China,
University in 2018. He is currently an in 2002 and 2004, respectively, and the
assistant professor in the College of PhD degree from the George Washington
Computing and Software Engineering at University, Washington, DC, USA, in
Kennesaw State University. His research 2010. He is currently an assistant professor
focuses on mobile crowdsensing and big with Virginia Commonwealth University,
data management in IoT networks. Richmond, VA, USA. He was a post-doctoral scholar with
University of California, Davis, CA, USA. His current research
interests include wireless networks, cyber-physical networking
Madhuri Siddula received the BS degree systems, and algorithm design and analysis. In particular, he
from Osmania University, India in 2010 is interested in localization, security, fog computing, and smart
and the MS degree from Indraprastha cities. He is a member of the ACM.
Institute of Information Technology, India
in 2012. She is currently a PhD student Zhi Tian is a professor in the Electrical
at Georgia State University, USA. Her and Computer Engineering Department of
research interests include social networks, George Mason University, Fairfax, VA,
privacy and security, and big data mining. USA, since 2015. Prior to that, she was
on the faculty of Michigan Technological
University from 2000 to 2014. Her research
Xiuzhen Cheng received the MS and PhD
interests lie in statistical signal processing,
degrees from University of Minnesota
wireless communications, and wireless
Twin Cities, Minneapolis, MN, USA,
sensor networks. She is an IEEE fellow. She is an elected
in 2000 and 2002, respectively. She
member of the IEEE Signal Processing for Communications
is a professor with the Department of
and Networking Technical Committee and a member of the Big
Computer Science, the George Washington
Data Special Interest Group IEEE Signal Processing Society. She
University, Washington, DC, USA. She
served as an associate editor for IEEE Transactions on Wireless
was a program director for the National
Communications and IEEE Transactions on Signal Processing.
Science Foundation from April to October in 2006 (full
She is a distinguished lecturer of the IEEE Vehicular Technology
time) and from April 2008 to May 2010 (part time). She has
Society from 2013 to 2017 and the IEEE Communications
published more than 170 peer-reviewed papers. Her current
Society from 2015 to 2016.
research interests include privacy-aware computing, wireless
and mobile security, dynamic spectrum access, mobile handset
networking systems (mobile health and safety), cognitive radio Yingshu Li received the BS degree from
networks, and algorithm design and analysis. She has served the Department of Computer Science
on the Editorial Board of several technical journals (e.g., and Engineering, Beijing Institute of
IEEE Transactions on Parallel and Distributed Systems and Technology, Beijing, China in 2001,
IEEE Wireless Communications) and the Technical Program and the MS and PhD degrees from
Committee of various professional conferences/workshops the Department of Computer Science
(e.g., IEEE Conference on Computer Communications, IEEE and Engineering, University of Minnesota
International Conference on Distributed Computing Systems, Twin Cities, Minneapolis, MN, USA in
IEEE International Conference on Communications, and 2003 and 2005, respectively. She is currently an associate
IEEE/ACM International Symposium on Quality of Service). professor with the Department of Computer Science, Georgia
She also has chaired several international conferences (e.g., State University, Atlanta, GA, USA. Her research interests
IEEE Conference on Communications and Network Security, include wireless networking, sensor networks, sensory data
and International Conference on Wireless Algorithms, Systems, management, social networks, and optimization. She received the
and Applications). National Science Foundation CAREER Award in 2006.

You might also like