Approximate Data Aggregation
Approximate Data Aggregation
Ji Li, Madhuri Siddula, Xiuzhen Cheng, Wei Cheng, Zhi Tian, and Yingshu Li
Abstract: As Internet-of-Things (IoT) networks provide efficient ways to transfer data, they are used widely in data
sensing applications. These applications can further include wireless sensor networks. One of the critical problems
in sensor-equipped IoT networks is to design energy efficient data aggregation algorithms that address the issues
of maximum value and distinct set query. In this paper, we propose an algorithm based on uniform sampling
and Bernoulli sampling to address these issues. We have provided logical proofs to show that the proposed
algorithms return accurate results with a given probability. Simulation results show that these algorithms have high
performance compared with a simple distributed algorithm in terms of energy consumption.
1 Introduction be well-spread.
Similar to the smart city, we are also focusing on
As the urban population snowballs, the smart city
smart home applications[5–10] . These applications are
has become inevitable to solve many day-to-day
based on the fact that in today’s world all the home
problems. These problems include power supply,
electronics are connected to the internet. A network
disaster prediction, and traffic maintenance[1–4] . Some
with such connected devices is called Internet-of-
of the smart city applications that are already being
Things (IoT). Recent devices like Alexa and Google
used are parking services, intelligent light systems, and
home are built on such a network. These devices
water conservation. For the better utilization of natural
interact with all the other devices connected to the same
resources, we should incorporate these applications
network. Since not all devices are the same, we need a
even in rural areas. The fundamental working principle
way to collect sensory data from different sensors. The
of a smart city application is that there are various
primary objective of any IoT network is to reduce cost
sensors deployed all over the city that are used
and provide faster access to data. One of the distinct
for collecting data. This data helps us understand
challenges in these applications is the deployment of a
information at a city level and hence the data should
considerable number of sensing devices.
Ji Li is with Kennesaw State University, Marietta, GA 30060, It is clear that sensors are the building blocks of any
USA. E-mail: [email protected].
IoT network. However, a network with sensors has
Madhuri Siddula and Yingshu Li are with Georgia State
University, Atlanta, GA 30303, USA. E-mail: msiddula1@
some drawbacks such as the issue of dynamic traffic,
student.gsu.edu; [email protected]. adding new service, adaptive to channel condition,
Xiuzhen Cheng is with the George Washington University, and ever-changing user requirements. Having self-
Washington, DC 20052, USA. E-mail: [email protected]. configurable sensors helps address some of these issues.
Wei Cheng is with the Department of Computer Science, Additionally, many algorithms have been proposed to
Virginia Commonwealth University, Richmond, VA 23284, solve the issues of routing, topology control, and time
USA. E-mail: [email protected].
synchronization[11–24] . Using sensors in IoT networks
Zhi Tian is with the Department of Electrical & Computer
Engineering, George Mason University, Fairfax, VA 22030,
reduces the communication cost but increases the
USA. E-mail: [email protected]. processing cost. We deploy sensors in our network
To whom correspondence should be addressed. because they collect data over a long period and could
Manuscript received: 2019-05-22; accepted: 2019-05-27 be placed over a long period of time and could be
@ The author(s) 2020. The articles published in this open access journal are distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
IEEE
Ji Li Transaction on Internet
et al.: Approximate of Things,Year:2020
Data Aggregation in Sensor Equipped IoT Networks 45
placed over a vast network. Hence, the data collected (4) Extensive simulation results are presented which
from these sensors is huge and requires high processing show the proposed algorithms perform significantly
power to aggregate and analyze the data. Hence, if the better than a simple distributed algorithm in the aspect
data aggregation problem is addressed at the sensor of energy consumption.
level, we do not have to deal with extensive data. The rest of the paper is organized as follows.
However, adding data aggregation functionality to the Section 2 defines the problem. Section 3 provides the
sensor might consume a lot of sensor’s energy. This mathematical proof for the ı-approximate aggregation
further raises the energy consumption issue as the algorithms. Section 4 explains the proposed ı-
aggregation costs much energy and the sensors are approximate aggregation algorithms. Section 5 shows
not equipped with huge amounts of power supply. the simulation results and the related works are
According to Ref. [25], cost of transmitting one bit of discussed in Section 6. Section 7 concludes the paper.
data is equivalent to the energy cost of executing 1000
instructions. Therefore, reducing data transmission is 2 Problem Definition
one of the major ways to decrease the energy cost in Suppose we have an IoT network with n sensor nodes
IoT. Hence, it is critical to design energy efficient data and sti is the sensory data of node i at time t . S t D
aggregation methods for sensor equipped IoT networks. fs t1 ; s t 2 ; : : : ; stn g is the set of all the sensory data at
In this paper, we study two kinds of aggregation d
time t and Dis.S t / D fs t1 d
; s td2 ; : : : ; s tjDis.S t /j
g contains
queries: maximum query and distinct set query. The
the distinct values in S t . For example, if S t D fs t1 ;
maximum query is to calculate the maximum of all the
s t 2 ; s t 3 ; s t 4 ; s t 5 g and s t1 D 1; s t 2 D 1; s t 3 D 2; s t 4 D
sensory data. The distinct set query is to calculate the
3; s t 5 D 3, then Dis.S t / D f1; 2; 3g.
unique values in the sensory data. Both the queries are
In this paper, we address maximum and distinct set
critical for an IoT. These two queries can be widely used
queries by performing max and distinct set operations,
in practice. For example, in the field of environmental
respectively. The definition of these operations are as
monitoring, the maximum value query can be used to
follows:
acquire the most serious level of pollution. While the
(1) The exact maximum value denoted by Max.S t /
user may get all the pollution levels in the monitored
satisfies Max.S t / D maxfs t i 2 S t j1 6 i 6 ng.
area using the distinct-set query. Therefore, the energy
(2) The exact distinct-set of S t denoted by Dis.S t /
efficient data aggregation model should accommodate
satisfies 8s 2 S t ; 9s d 2 Dis.S t /; s D s d , and 8sxd ;
both queries in its development.
syd 2 Dis.S t /; x ¤ y ) sxd ¤ syd .
In practice, exact query results are not always
necessary. Approximate query results may also be Obviously, the following steps can be used to solve
acceptable for some applications[26, 27] . Therefore, in the max and distinct set aggregation problems.
this paper, we propose two algorithms to process (1) Arrange all the nodes in the network in the form
approximate maximum queries and distinct-set queries. of an aggregation tree where the sink node broadcasts
These algorithms are based on uniform sampling and the aggregation operation.
Bernoulli sampling, respectively. Proposed algorithms (2) All the nodes submit their sensory data to the sink
will return the exact query results with probability not node along the aggregation tree.
less than 1 ı, where ı is a real number and its value can (3) The intermediate nodes in the aggregation tree
be arbitrarily small. In summary, the main contributions calculate the partial results during the data transmission.
of our paper can be summarized are as follows: Although this method results in accurate aggregation
(1) Mathematical estimators for the two aggregation results, it will also lead to huge communication
operations are provided. and computation cost. Hence, we propose a ı-
(2) The mathematical methods to determine approximation to the results that can be achieved by
the required sample size and sample probability the above said aggregation operations. Let I t and Ibt are
for calculating approximate maximum value and the accurate and approximate aggregation results at the
approximate distinct-set are designed. time “t ”, respectively. The definition of the ı-estimator
(3) Distributed algorithms for approximate maximum is as follows:
value and approximate distinct-set are provided. The Definition 1 (ı-estimator) For any ı .0 6 ı 6 1/,
energy costs of these algorithms are analyzed. I t is called the ı-estimator of I t if Pr.Ibt ¤ I t / 6 ı.
b
46 IEEE Transaction on Internet of Things,Year:2020 Tsinghua Science and Technology, February 2020, 25(1): 44–55
4
(1) ui and uj are independent with each other for all theorem.
1 6 i ¤ j 6 m. Theorem 2 Dis.S t /u is a ı-estimator of Dis.S t / if
1 ln.1 .1 ı/nmin =n /
(2) Pr.ui D s tj / D for any 1 6 i 6 m, 1 6 j 6n. m>
n
nmin :
Based on the above two conclusions, we have the ln 1
n
following lemma. Proof First, we have
Lemma 1 For any given value x 2 Dis.S t /, we
nmin m
1 6 1 .1 ı/nmin =n ;
have n
nx m
Pr.x … U.m// D 1 ;
nmin m n=nmin
n 1 1 > 1 ı;
n
where nx is the number of appearance of value x in S t .
Yt /j
jDis.S
Proof Pr.x … U.m// D Pr.u1 ¤ x ^ u2 ¤ x ^ : : :^
nmin m
1 1 1 6 ı:
um ¤ x/. Since all the samples u1 ; u2 ; : : : ; um are n
i D1
independent with each other, we have Let ns d to denote the number of appearance for s tdi ,
m ti
Y then we have
Pr.x … U.m// D Pr.ui ¤ x/ D .Pr.u1 ¤ x//m : jDis.S
Yt /j ns d m
i D1 ti
1 1 1 6 ı;
Moreover, we have n
i D1
nx
Pr.u1 ¤ x/ D 1 Pr.u1 D x/ D 1 : since nmin 6 ns d . Moreover, according to Lemma 1,
n ti
we have
Then this lemma is proved. jDis.S
Yt /j
To obtain ı-approximate maximum value, the 1 .1 Pr.stid … U.m/// 6 ı;
mathematical estimator needs to be defined firstly. Let
4
Max.S t /u denote the uniform sampling-based estimator
i D1
4
of Max.S t /. Then Max.S t /u is defined as 1
Yt /j
jDis.S
Pr.stid 2 U.m// 6 ı;
4
Max.S t /u D Max.U.m//:
i D1
4
4
Based on Lemma 1, we have the following theorem. 1 Pr.Dis.S t /u D Dis.S t // 6 ı;
if
Theorem 1 Max.S t /u is a ı-estimator of Max.S t / 4
Pr.D is.S t /u ¤ Dis.S t // 6 ı:
IEEE Transaction on Internet of Things,Year:2020
(1) The sink node generates random numbers Yi with Algorithm 1 Uniform sampling-based aggregation algorithm
jCl j Input: ı, aggregation operator Agg 2 fMax, DistinctSetg
the probability Pr.Yi D l/ D .1 6 i 6 m/. Output: ı-approximate aggregation results
n
(2) Let ml be the sample size of Cl . Then ml is 1: if Agg D Max then
ln ı
calculated by ml D jfYi jYi D lgj. 2: mDd nmin e
ln.1 /
n
(3) The sink node sends the sample size fml j 1 6 3: else
.1 ı/nmin =n /
l 6 kg to each cluster head. Each cluster head samples 4: m D d ln.1 nmin e
ln.1 n /
the sensory data in its own cluster using the above naive 5: end if
6: generate Yi following Pr.Yi D l/ D jCnl j ,
sampling algorithm.
7: ml D jfYi j Yi D lgj .1 6 i 6 m; 1 6 l 6 k/, the sink
If the sensory data received by the l-th cluster head is sends ml to each cluster head by multi-hop communication
U.ml /, it then calculates the partial aggregation result 8: for each cluster head of the clusters Cl .1 6 l 6 k/ do
R.U.ml // based on the aggregation operation Agg by 9: generate random numbers k1 , k2 , : : : ; kml then broadcast
using the following( method: inside the cluster
10: end for
Max.U.ml //; if Agg D MaxI
R.U.ml // D 11: for each cluster member of Cl .1 6 l 6 k/ do
Dis.U.ml //; elsewhere: 12: send sensory value to cluster head if its id belongs to
Then R.U.ml // is transmitted along the spanning fk1 ; k2 ; : : : ; kml g
tree. To further reduce the transmission cost, the 13: end for
14: for each cluster head of the clusters Cl .1 6 l 6 k/ do
intermediate nodes also aggregate the received partial
15: receive sample data U.ml / and calculate partial result
result while transmitting the sensory data. The whole R.U.ml //
process is explained in Algorithm 1. 16: end for
According to the content in Section 3.1, we have 17: for each node j in the spanning tree do
ln ı if j is the leaf node then
8l m 18:
< ln.1 nmin / ;
ˆ
ˆ if Agg D MaxI 19: Send Rj to its parent node
mD l n
else
ln.1 .1 ı/nmin =n / m 20:
; if Agg D Dis:
ˆ
ˆ 21: Receive partial results Rj1 ; Rj 2 ; : : : ; Rjc from its
ln.1 nnmin /
:
children
Therefore, we have 22: if Agg D Max then
8
1 23: Rj D max.Rj1 ; Rj 2 ; : : : ; Rjc /
ˆ
ˆ O ln ; if Agg D MaxI
ı else
< 24:
mD Rj D ciD1 Rj i
S
1 25:
; if Agg D Dis:
ˆ
: O ln
ˆ
26: end if
1 .1 ı/nmin =n
27: if j is the sink node then
In practice, jRj j can be regarded as a constant. 28: return Rj
According to Ref. [29], the communication cost and the 29: else
1 30: Send Rj to its parent node
energy cost of the above algorithm is O.ln / if Agg D
ı 31: end if
1 32: end if
Max, while the cost is O ln
1 .1 ı/nmi n =n 33: end for
if Agg D Dis. In practice, the value of nmi n can be
acquired by the background knowledge of the specific
applications. For example, in the field of environmental the following steps are used in the Bernoulli sampling-
monitoring, the user can get the value of nmi n according based aggregation algorithm to perform sampling and
to the historical data. the network need not be divided into clusters.
(1) Sink node broadcasts the sampling probability q
4.2 Bernoulli sampling-based aggregation
in the network.
algorithm (2) Each node generates a random number rand in the
Unlike the uniform sampling-based aggregation range of [0,1], submit its sensory data to the parent node
algorithm, the sampling information of Bernoulli if rand < q.
sampling-based aggregation algorithm utilizes only the When the intermediate nodes in the spanning tree
sampling probability q. Additionally, Bernoulli-based receive the submitted sensory data, they will calculate
method provides a mechanism for each node in the the partial aggregation results using a similar method
network to do the sampling independently. Therefore, introduced in Section 4.1. These nodes then transmit the
IEEE
Ji LiTransaction on InternetData
et al.: Approximate of Things,Year:2020
Aggregation in Sensor Equipped IoT Networks 49
partial results along the spanning tree. Similarly, during 5 Simulation Results
the process of transmitting partial aggregation results to
In order to evaluate the proposed algorithms, we
the sink node along the spanning tree, the intermediate
simulated an IoT network with 1000 nodes. All nodes
nodes in the spanning tree aggregate the received partial
are randomly distributed in a 300 m 300 m rectangular
results. The process mentioned above is explained in
region and the sink node is in the center of the region.
detail in Algorithm 2.
The following steps are used to define the clusters.
According to the analysis in Section 3.2, for the
(1) Divide the region into 10 10 grids.
sample probability q, we have
( (2) Group the nodes in the same gird into one cluster.
1 .ı/1=nmin ; if Agg D MaxI (3) The cluster head is randomly chosen.
qD nmin =n 1=nmin
1 .1 .1 ı/ / ; if Agg D Dis: For each node, the energy cost to send and receive
Similarly, the communication cost and the energy one byte is defined as 0.0144 mJ and 0.0057 mJ,
cost of the Bernoulli sampling-based ı-approximate respectively[30] . The communication
p range of each
aggregation algorithm is O.n n.ı/1=nmin / if Agg D sensor node is set to be 30 2 m in our simulation[31] .
Max, while the cost is O.n n.1 .1 ı/nmin =n /1=nmin / By these simulation settings, we ensure that each sensor
if Agg D Dis. node at a one-hop distance from its corresponding
cluster head.
Algorithm 2 Bernoulli sampling-based aggregation 5.1 Uniform sampling-based aggregation
algorithm algorithm
Input: ı, aggregation operator Agg 2 fMax, Disg
Output: ı-approximate aggregation results The first group of simulations is to study the
1: if Agg D Max then relationship between ı and the sample size. The results
2: q D 1 .ı/1=nmin are shown in Fig. 1. The results for both the maximum
3: else value aggregation and the distinct-set aggregation are
4: q D 1 .1 .1 ı/nmin =n /1=nmin
listed. Additionally, two groups of results with different
5: end if n
6: Sink node broadcasts q in the network are also listed for comparison. These results
nmin
7: for each leaf node j in the spanning tree do indicate that the sample size increases with a decline
8: if rand < q then
of ı. Moreover, the sample sizes are much smaller than
9: Send its own sensory data to its parent node;
end if
that of the network. For example, if we have ı D 0:01
10: n
11: end for and D 15, the sample size is about 67, which
nmin
12: for each non-leaf node j in the spanning tree do
indicates that we only need to sample 6.7% sensory data
13: Receive partial results Rj1 ; Rj 2 ; : : : ; Rjc from its
to guarantee that the estimated maximum value being
children
14: if Agg D Max then equal to the actual maximum value with the probability
15: Rj D max.Rj1 ; Rj 2 ; : : : ; Rjc / greater than 99%. Hence, the proposed algorithm based
16: else on uniform sampling preserves a tremendous amount of
Rj D ciD1 Rj i
S
17: energy as the amount of sensory data sampled is little.
18: end if Additionally, in the same condition, the sample size for
19: if rand < q then
the distinct-set aggregation is greater than the sample
20: if Agg D Max then
21: Rj D max.Rj ; j:data/ size for maximum value aggregation. Hence, we have
22: else to ensure that the distinct-set aggregation has all distinct
23: Rj D Rj [ fj:datag values that are sampled.
24: end if The second group of simulations is to study the
25: end if relationship between ı and the energy cost. The results
26: if j is the sink node then
are listed in Fig. 2. These results indicate that the energy
27: return Rj
28: else cost increases with the decrease of ı. It can be observed
29: Send Rj to its parent node that the energy cost for the distinct-set aggregation is
30: end if higher than that of the maximum value aggregation as
31: end for the distinct-set aggregation requires sample size.
50 IEEE Transaction on Internet of Things,Year:2020 Tsinghua Science and Technology, February 2020, 25(1): 44–55
140 12.0
n/nmin=30 n/nmin=30
80 10.5
60 10.0
40 9.5
20 9.0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
δ δ
(a) Maximum value aggregation (a) Maximum value aggregation
240 14.0
n/nmin=30
n/nmin=30
220 n/nmin=15 13.5
n/n =15
min
200
13.0
Energy cost (mJ/Byte)
180
12.5
Sample size
160
12.0
140
11.5
120
11.0
100
10.5
80
10.0
60 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
δ δ
(b) Distinct-set aggregation (b) Distinct-set aggregation
Fig. 1 Relationship between ı and the sample size. Fig. 2 Relationship between ı and the energy cost for the
uniform sampling-based aggregation algorithm.
The third group of simulation is to compare the
energy cost between the uniform sampling-based These results indicate that the uniform sampling-based
aggregation algorithm and the simple distributed aggregation algorithm performs much better in energy
algorithm. The steps of the simple distributed algorithm cost although it may return wrong aggregation results.
are as follows. Finally, with the increase in the network size, the energy
(1) Collect all the raw sensory data. cost of the simple distributed algorithm proliferates,
(2) Aggregate the partial results during the while the energy cost for the uniform sampling-based
transmission. aggregation algorithm almost remains the same. That
We can see that the simple distributed algorithm is because the uniform sampling algorithm’s required
nmi n
can always return accurate aggregation results. For the sample size depends on the value of ı and
n
uniform sampling-based aggregation algorithm, we set rather than the network size n itself. This phenomenon
n also indicates that the uniform sampling algorithm is
ı D 0:1, D 15, and the network size changes
nmi n appropriate for large scale networks, which is verified
from 500 to 1500. The results are listed in Fig. 3. We
by the results shown in Fig. 4.
can see that for all these two algorithms, the energy
cost increases with the increase of the network size. 5.2 Bernoulli sampling-based aggregation
Moreover, the energy cost of the uniform sampling- algorithm
based aggregation algorithm is much lower than that The first group of simulations is about the relationship
of the naive distributed algorithm since only a small between ı and the sample probability. The results
number of nodes need to transmit their sensory data. are presented in Fig. 5. The results show that the
JiIEEE
Li et Transaction on Internet
al.: Approximate Data of Things,Year:2020
Aggregation in Sensor Equipped IoT Networks 51
30 10
15
6
10
5
5 4
500 1000 1500 100 200 300 400 500 600 700 800 900 1000
Network size Network size
(a) Maximum value aggregation (a) Maximum value aggregation
30 11.0
Simple distributed algorithm Uniform sampling algorithm
10.8
Uniform sampling algorithm Bernoulli sampling algorithm
25 10.6
Energy cost (mJ/Byte)
10.4
20 10.2
10.0
15 9.8
9.6
10 9.4
9.2
5 9.0
500 1000 1500 1000 1200 1400 1600 1800 2000
Network size Network size
(b) Distinct-set aggregation (b) Distinct-set aggregation
Fig. 3 Energy cost comparison between the uniform Fig. 4 Energy cost comparison between the uniform
sampling-based aggregation algorithm and the simple sampling-based aggregation algorithm and Bernoulli
distributed algorithm. sampling-based aggregation algorithm.
sample probability increases with the decline of ı. aggregation algorithm and the simple distributed
Moreover, the sample probabilities are much smaller algorithm. For the Bernoulli sampling-based
than 1. For example, when ı D 0:01, the sample aggregation algorithm, we set ı D 0:1 and nmin D 67.
probability is about 0.066 for deriving ı-approximate The network size varies from 500 to 1500. The results
maximum value. Therefore, our Bernoulli sampling- are listed in Fig. 7. Similarly, we can see for the same
based algorithm also saves a great deal of energy. network size, the energy cost of the Bernoulli sampling-
Similarly, the required sample size for the distinct-set based aggregation algorithm is much lower than that of
aggregation is greater than that of the maximum value the simple distributed algorithm which indicates that
aggregation in the same condition. Bernoulli sampling-based aggregation algorithm has
The second group of simulations is about the high performance on energy consumption. Moreover,
relationship between ı and the energy cost. The results we can also see that the Bernoulli sampling-based
are shown in Fig. 6. Similarly, we can see that the aggregation algorithm has even better performance on
energy cost increases with the decline of ı and the large scale networks.
energy cost for the distinct-set aggregation is greater The fourth group of simulation is to compare the
than that of the maximum value aggregation. energy cost between the Bernoulli sampling-based
The third group of simulation is to compare the aggregation algorithm and the uniform sampling-based
n
energy cost between the Bernoulli sampling-based aggregation algorithm. We set ı D 0:1 and D 15.
nmin
52 IEEE Transaction on Internet of Things,Year:2020 Tsinghua Science and Technology, February 2020, 25(1): 44–55
0.13 11.8
nmin=34 nmin=34
0.12 11.6
nmin=67 nmin=67
0.11 11.4
0.09 11.0
0.08 10.8
0.07 10.6
0.06 10.4
0.05 10.2
0.04 10.0
0.03 9.8
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
δ δ
(a) Maximum value aggregation (a) Maximum value aggregation
0.22 14.0
nmin=34 nmin=34
0.20 nmin=67 13.5 nmin=67
0.18
13.0
Energy cost (mJ/Byte)
Sample probability
0.16
12.5
0.14
12.0
0.12
11.5
0.10
11.0
0.08
0.06 10.5
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
δ δ
(b) Distinct-set aggregation (b) Distinct-set aggregation
Fig. 5 Relationship between ı and the sample probability. Fig. 6 Relationship between ı and the energy cost for the
Bernoulli sampling-based aggregation algorithm.
In order to ensure the network connectivity when the While on the other hand, the uniform sampling
network size is small, we set node’s communication algorithm is appropriate for large scale clustered
to 60 m for this group of simulation. The results networks.
are shown in Fig. 4. We can see that for both the
uniform sampling-based aggregation algorithm and the 6 Related Work
Bernoulli sampling-based aggregation algorithm, the The sampling technique has been widely used, such
energy cost increases with the increase of network size. as quantile calculation, data collection, and top-k
Moreover, the Bernoulli sampling-based aggregation query. For example, Ref. [32] is about an approximate
algorithm has lower energy cost when the network size algorithm to calculate the quantiles in wireless sensor
is small, while the uniform sampling-based aggregation networks. By using the sampling technique, Ref. [33]
algorithm has lower energy cost when the network develops ASAP, which is an adaptive sampling-based
size is large. From the above results, we can see the method to do energy-efficient data collection in sensor
Bernoulli sampling-based aggregation algorithm has networks. Reference [34] uses samples of past sensory
the following advantages. data to define the problem of optimizing approximate
(1) The Bernoulli sampling-based aggregation top-k queries. However, all these techniques cannot be
algorithm can be used in unclustered networks. used in our problem because these operations differ a
(2) The Bernoulli sampling-based aggregation lot with the maximum query and distinct-set query.
algorithm has lower energy cost in small scale The distinct-count query has been widely studied
networks. in many works, such as Refs. [35, 36]. Reference
IEEE Transaction on Internet of Things,Year:2020
Ji Li et al.: Approximate Data Aggregation in Sensor Equipped IoT Networks 53
30
respectively. We have proposed mathematical
Simple distributed algorithm
Bernoulli sampling algorithm
estimators for the two algorithms. Moreover, we
25 have derived the values for the required sample size
and the required sample probability for any given ı.
Energy cost (mJ/Byte)