0% found this document useful (0 votes)
10 views14 pages

DDRM A Continual Frequency Estimation Mechanism With Local Differential Privacy

The paper presents the Dynamic Difference Report Mechanism (DDRM), a novel approach for continual frequency estimation that enhances local differential privacy (LDP) by addressing privacy leakage issues associated with existing methods. DDRM utilizes difference trees to track data changes over time while optimizing privacy budget allocation to improve estimation accuracy. The authors demonstrate through theoretical analysis and experiments that DDRM achieves high accuracy in real-time frequency estimation while maintaining strong privacy guarantees.

Uploaded by

17043420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views14 pages

DDRM A Continual Frequency Estimation Mechanism With Local Differential Privacy

The paper presents the Dynamic Difference Report Mechanism (DDRM), a novel approach for continual frequency estimation that enhances local differential privacy (LDP) by addressing privacy leakage issues associated with existing methods. DDRM utilizes difference trees to track data changes over time while optimizing privacy budget allocation to improve estimation accuracy. The authors demonstrate through theoretical analysis and experiments that DDRM achieves high accuracy in real-time frequency estimation while maintaining strong privacy guarantees.

Uploaded by

17043420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

6784 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO.

7, JULY 2023

DDRM: A Continual Frequency Estimation


Mechanism With Local Differential Privacy
Qiao Xue , Qingqing Ye , Member, IEEE, Haibo Hu , Senior Member, IEEE,
Youwen Zhu , and Jian Wang

Abstract—Many applications rely on continual data collection to provide real-time information services, e.g., real-time road traffic
forecasts. However, the collection of original data brings risks to user privacy. Recently, local differential privacy (LDP) has emerged as
a private data collection framework for mass population. However, for continual data collection, existing LDP schemes, e.g., those
employing the memoization technique, are known to have privacy leakage on data change points over time. In this paper, we propose a
new scheme with stronger privacy guarantee for continual frequency estimation under LDP, namely, Dynamic Difference Report
Mechanism (DDRM). In DDRM, we introduce difference trees to capture the data changes over time, which well addresses possible
privacy leakage on data change points. As for the utility enhancement, DDRM exploits the common case of no data change in time
series and thereby suppresses the consumption of privacy budget in such cases. Meanwhile, an optimal privacy budget allocation
scheme is proposed to encourage users to report more data for better estimation accuracy. By both theoretical analysis and
experimental evaluations, we show DDRM achieves highly accurate frequency estimation in real time.

Index Terms—Continual frequency estimation, local differential privacy, time series data

1 INTRODUCTION perturbed data to an untrusted data collector. For one-round


data collection, there exist several LDP schemes [4], [5], [6]
ITH the fast development of the Internet and mobile
W devices, it is commonplace to continually collect data
from individuals for online services, such as real-time road
that can provide strong privacy guarantee for individuals
while retaining reasonably good utility. However, when
they are applied in continual data collection, the data utility
traffic forecasting [1]. However, the collected data may
degrades exponentially as the privacy budget must be allo-
include sensitive and private information of individuals,
cated among all timestamps due to the property of sequential
such as locations, activities (e.g., up/downlink rate), and
composition [7] in DP/LDP model, causing overwhelming
vital signs (e.g., heartbeat). Collecting them not only imposes
noise that overshadows the original data.
privacy risks to users but also causes reputation damage or
To address this problem, Erlingsson et al. [8] propose the
legal actions against the data collector. To resolve this
memoization technique as well as the RAPPOR framework to
dilemma, local differential privacy (LDP) [2] has been pro-
estimate frequencies of discrete values. Specifically, each
posed. It is a variant of the differential privacy (DP) model [3]
user pre-computes and stores a sanitized version of all pos-
in the local setting where individuals contribute their
sible input values by an -LDP algorithm. Then in each
round of data collection, each user always submits a pre-
computed response based on her current value, without
 Qiao Xue is with the Department of Electronic and Information Engineer- invoking the LDP algorithm again and thus spending any
ing, The Hong kong Polytechnic University, Hung Hom, Hong Kong, and
also with the College of Computer Science and Technology, Nanjing Uni- privacy budget. However, as pointed out by [9], memoiza-
versity of Aeronautics and Astronautics, Nanjing 210016, China. tion may cause two types of privacy risks. The following is
E-mail: [email protected]. a counterexample to illustrate them. Suppose that a user has
 Qingqing Ye and Haibo Hu are with the Department of Electronic and
Information Engineering, The Hong kong Polytechnic University, Hung
a time series of a; b; b; b; a; a; b    , and after perturbation by
Hom, Hong Kong. E-mail: {qqing.ye, haibo.hu}@polyu.edu.hk. memoization, the noisy time series becomes 00; 01; 01;
 Youwen Zhu and Jian Wang are with the College of Computer Science and 01; 00; 00; 01    . First, if an adversary has some background
Technology, Nanjing University of Aeronautics and Astronautics, Nanj- knowledge that can correlate a true value with its noisy ver-
ing 210016, China. E-mail: {zhuyw, wangjian}@nuaa.edu.cn.
sion, e.g., matching 00 with a, whenever the user sends 00,
Manuscript received 12 Oct. 2021; revised 25 Mar. 2022; accepted 22 May
2022. Date of publication 26 May 2022; date of current version 5 June 2023.
the adversary can infer her true value a with 100% confi-
This work was supported in part by the National Key R&D Program of China dence. Second, even without any background knowledge,
under Grant 2021YFB3100400, in part by the Guangxi Key Laboratory of based on the changes on noisy values, the adversary can still
Trusted Software under Grant kx202034, in part by the National Natural Sci- locate those timestamps when the true values change, e.g.,
ence Foundation of China under Grants, 62172216, 62072390, 62102334, and
61941121, in part by the Natural Science Foundation of Jiangsu Province at t ¼ 2 because 00 ! 01.
under Grant BK20211180, and in part by the Research Grants Council, Hong Therefore, Ding et al. [9] propose dBitFlipPM to improve
Kong SAR, China under Grants 15222118, 15218919, 15203120, 15226221, this memoization technique, where each discrete value is
15225921, and C2004-21GF. pre-sanitized and “hashed” to a random d-bit vector. As
(Corresponding author: Qingqing Ye.)
Recommended for acceptance by X. Xiao. such, some different original values are mapped to the
Digital Object Identifier no. 10.1109/TKDE.2022.3177721 same memorized response, which resolves the first issue,
1041-4347 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
XUE ET AL.: DDRM: A CONTINUAL FREQUENCY ESTIMATION MECHANISM WITH LOCAL DIFFERENTIAL PRIVACY 6785

but not the second one. Joseph et al. [10] recently design a and contributes the noisy version of original data to an
solution based on data changes to track statistics (e.g., fre- untrusted data collector.
quency) over time, which addresses both issues by applying
Definition 1. (-local differential privacy). A randomized
a fresh perturbation in each reporting. The scheme includes
algorithm A : D ! V that takes one value in the domain of D
two procedures (i.e., voting and statistics estimation) that
is -local differential privacy iff for any two values di ; dj 2 D,
need to access true user data, so the privacy budget has to
and any output v 2 V
be split, and only a half can be used for estimation, which
harms the data utility. Subsequently, Erlingsson et al. [11]
PrfAðdi Þ ¼ vg  e  PrfAðdj Þ ¼ vg: (1)
propose to sanitize and report data changes during contin-
ual frequency estimation. With the assumption that the time
series only changes at most C times, each user samples one
from her C data changes to report. However, their assump- The privacy budget  is a public and non-negative
tion may not be practical enough in real-world applications, parameter, which bounds the probability of A outputting
because, to fully satisfy the assumption, C must be set to the the same result on any two different input values. Intui-
largest possible number of changes, which significantly tively, a smaller (resp. larger)  indicates a stronger (resp.
harms the data utility due to the client-side sampling pro- weaker) privacy guarantee and more (resp. less) perturba-
cess. In this paper, we propose a time series data collection tion noise.
scheme, namely Dynamic Difference Report Mechanism As with centralized differential privacy, LDP has the
(DDRM), for continual frequency estimation with strong same property of sequential composition [7] as below.
privacy guarantee (addressing both issues of memoization) Theorem 1. (Sequential Composition). If S randomized algo-
while retaining high accuracy. Similar to [10], [11], we rithms A1 ; . . .; AS are s -local differential privacy respectively,
mainly focus on the difference between two values in time s 2 f1; . . .; Sg, the sequence of outputs, i.e., A1 ðdi Þ; . . .; AS ðdi Þ
series data and employ binary trees to dynamically record P
for di 2 D provides s -local differential privacy.
the differences over time. The employed multiple trees can
capture data changes over one or several timestamps, from According to Theorem 1, to guarantee -LDP for a
which users select one difference value to perturb and sequence of randomized algorithms, we can divide the pri-
report on. DDRM can exploit the common case of no data vacy budget  into multiple portions, each of which can be
change in time series and suppress the consumption of pri- consumed by an algorithm to sanitize the private data of
vacy budget in such cases. Moreover, to improve the estima- users.
tion accuracy, we design an optimal allocation of privacy
budget to maximize the utility of user reports. Through 2.2 Continual Data Collection Under LDP
extensive theoretical analysis and experimental evaluations As a promising framework for private data releasing, local
on both synthetic and real datasets, effectiveness of DDRM differential privacy (LDP) has been employed to collect user
is verified. To summarize, our main contributions are as data over time for longitudinal privacy guarantee. The exist-
follows. ing LDP solutions for continual data collection can be classi-
fied into two categories, namely memoization and data
 We formulate the problem of continual data collection changes based approaches. We briefly introduce four state-
under local differential privacy, and develop a new of-the-art works in the following.
scheme DDRM for real-time frequency estimation. Memoization Based Approaches. Erlingsson et al. [8] pro-
 We provide complete algorithms for client-side data pose a memoization approach to protect privacy of the users
modeling and perturbation protocol, and collector- whose multiple responses are collected over time. Specifi-
side aggregation and calibration procedures. cally, each user utilizes randomized response to generate a
 We present an optimal solution to allocate privacy noisy version of her true value. Then this noisy response
budget for continual frequency estimation, which will be memorized and reported next time when the same
achieves significant utility enhancement. true value occurs. Memoization protects a true value from
The rest of this paper is organized as follows. Section 2 being exposed, but the longitudinal privacy guarantee
introduces the preliminaries and problem definition. Sec- works only if the underlying true value does not change or
tion 3 overviews the workflow of DDRM. Section 4 presents change in an uncorrelated fashion [8].
implementation details of DDRM. Section 5 gives detailed The work in [9] points out that, in memoization, if an
utility and privacy analysis of DDRM. Section 6 shows adversary (e.g., data collector) can correlate a true value
experimental evaluations. Section 7 reviews the related liter- with its noisy response at some timestamp, the true value
ature. Section 8 concludes this paper. will be exposed from that timestamp onwards. Therefore,
dBitFlipPM [9] is proposed to improve memoization by
mapping several true values to the same noisy response. In
2 PRELIMINARIES AND PROBLEM DEFINITION dBitFlipPM, each user first encodes each original value into
2.1 Local Differential Privacy a one-hot vector, from which d bits are randomly selected.
Local differential privacy (LDP) [2] is proposed for the local The selected d bits are perturbed by randomized response
setting where data contributors (users) may upload their and then memorized. The “hash” operation guarantees that
sensitive information to an untrusted data collector. In this multiple original values can be perturbed to the same noisy
setting, each user locally sanitizes her private data by a ran- d-bit vector. Hence, even if an adversary correlates an origi-
domized algorithm satisfying -local differential privacy, nal value with its noisy version at some timestamp, she still
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
6786 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 7, JULY 2023

TABLE 1
Notations

Symbols Description
n the number of users
ui the ith user in the population, i 2 ½n
di user ui ’s time series data, di ¼ fd1i ; d2i ; . . .g
dti , v~ti user ui ’s true/perturbed value at timestamp t
f t , f^t the true/estimated frequency of ‘1’ at timestamp t
cti the difference of any two consecutive values,
cti ¼ dti  dt1
i Fig. 1. A difference tree with time series data d i ¼ ½1; 1; 1; 0; 0; 0; 1; 1.
mt the number of difference trees at timestamp t
Rti user ui ’s list to store key nodes in difference trees at
With the LDP guarantee on the private data of users, our
timestamp t
hti the node index in Rti selected by ui at timestamp t goal is to estimate a highly accurate frequency. Specifically,
k privacy budget allocation parameter we aim to develop a mechanism that can minimize the dis-
½n a set of integers, ½n ¼ f1; 2; . . .; ng tance (denoted by Dis) between the estimated and the true
frequencies at any timestamp t
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
cannot infer the true value with high confidence when DisðtÞ ¼ jf^t  fm t j2 ; (2)
m2M m
observing the same noisy vector in future.
Data Changes Based Approaches. Joseph et al. [10] propose
where M is the domain of values, e.g., M ¼ f0; 1g for the
a scheme based on data changes to maintain the up-to-date
binary case, and each f^m t t
(resp. fm ) denotes the estimated
statistics over time. The main idea of their scheme is to
(resp. true) frequency of m 2 M at timestamp t.
update the statistics when the data change significantly. In
each round, each user compares her current data with that
in the previous round. If the difference exceeds a given 3 DDRM OVERVIEW
threshold, she will vote “yes,” indicating her data are In this section, we first elaborate on the design rationale of
changed. When most users vote “yes,” the collector will col- DDRM for continual data collection, and then overview its
lect the current user data and update the global statistics. workflow. Implementation details and privacy analysis are
Erlingsson et al. [11] propose to continually estimate fre- shown in Sections 4 and 5, respectively.
quency based on data changes, with an assumption that the
underlying data can change at most C times. In their 3.1 Design Rationale
scheme, the time horizon T is mapped to a binary tree with To provide differential privacy guarantee of time series
T leaf nodes. Specifically, each leaf node corresponds to a data, a naive idea is to divide the given privacy budget over
timestamp; while each non-leaf node corresponds to the time and consume a small portion on each report. However,
timestamp in the rightmost leaf node of its subtree. During when the collecting time horizon becomes long or even
the data collection, each user first randomly chooses a unlimited, this idea is no longer feasible, as the privacy bud-
change index c 2 ½C and one tree level. Then the user get for each report will be too small to contribute to any util-
reports data (i.e., the perturbed cth change or a random ity. To address this problem, in DDRM, we assume time
value) at the corresponding timestamps of the chosen tree series exhibit continuity, i.e., they do not fluctuate signifi-
level. At each timestamp t, P the collector derives a tree level cantly over a short period of time. Thus, instead of storing
set H  ½log2 T þ1 such that h2H 2h1 ¼ t and aggregates the and perturbing each value itself, we record the difference of
latest values reported by the users from H levels. The aggre- any two consecutive values, based on which binary differ-
gated data include the value at the first timestamp and the ence trees are constructed to store the time series data. Fig. 1
changes at f2; . . . ; tg timestamps, so the frequency at t can shows an example. For each value dti 2 f0; 1g at timestamp
be estimated by summing them up after calibration and t, a leaf node stores the difference between the current and
compensation (i.e., scaled up by Clog 2 T ). previous value, denoted by cti ¼ dti  dt1 i , while a non-leaf
node sums up the values from its two child nodes. In that
sense, nodes at different levels of the binary tree reflect different
2.3 Problem Description
views of the value changes. For example, the leaf node d8i
This paper focuses on the problem of continual frequency denotes the difference over one timestamp, i.e., d8i  d7i ,
estimation over discrete data with local differential privacy. while the root node denotes a larger view of 8 timestamps,
We assume n data contributors (users) and an untrusted which is d8i  d0i .1
data collector. Each user ui ði 2 ½nÞ has a private time series Based on this tree, each user can choose to submit one of
d i ¼ ½d1i ; . . .; dti ; . . ., where each value is a binary, i.e., dti 2 the nodes associated with the current value (i.e., orange
f0; 1g. At each timestamp t, the data collector wants toPesti- nodes in Fig. 1), showing its difference from one of the pre-
dt
mate the frequency of ‘1’ in the population, i.e., f t ¼ ni i , vious timestamps. By capturing value changes from differ-
without violating local differential privacy. Table 1 summa- ent views, DDRM can retain more dynamics of the time
rizes the main notations. Note that the extension of binary
dti to a multi-valued case is straightforward, which will be 1. The value stored in the root node is ðd8i  d7i Þ þ ðd7i  d6i Þ þ    þ
elaborated in Section 4.6. ðd1i  d0i Þ ¼ d8i  d0i . Here we set d0i ¼ 0 as the initial state.
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
XUE ET AL.: DDRM: A CONTINUAL FREQUENCY ESTIMATION MECHANISM WITH LOCAL DIFFERENTIAL PRIVACY 6787

Fig. 2. Workflow of DDRM at timestamp t for ui . Fig. 3. Procedure of updating difference trees.

series and thus achieve better accuracy. We will elaborate calculates cti as the difference between dti and dt1 t
i , i.e., ci ¼
on the node selection strategy in Section 4.2. t t1 0
di  di , where di ¼ 0 is the initial value. Then she appends
As for perturbation, since the value of any tree node this new node to the tree and updates it locally. Fig. 3
denotes a change, it can only be f1; 0; 1g for binary-valued shows an example where the time series data di ¼
time series. Our idea is based on the assumption that values ½1; 1; 1; 0; 0; 0; 1; 1 from t ¼ 1 to 8. So the values of differen-
in time series do not change often, so most values in the tree ces are ci ¼ ½1; 0; 0; 1; 0; 0; 1; 0. At each timestamp t, a new
nodes are 0. If 0 is perturbed to other values with the same proba- leaf node which represents the difference cti (i.e., orange leaf
bility, it will not consume any privacy budget. With that said, nodes in Fig. 3) is appended at the rightmost of the differ-
our perturbation protocol reports a value in the tree node as ence tree. Then some non-leaf nodes, which stores the sum
follows: v 2 f1; 1g is perturbed to v~ ¼ 1 (resp. 1) with of its two child nodes (i.e., orange non-leaf nodes in Fig. 3),
 1 v e 1
probability 12 þ 2v  ee 1
þ1 (resp. 2  2  e þ1), while v ¼ 0 is perturbed are also added to make all trees complete. Based on this
to 1 or 1 with the same probability of 0.5, consuming no structure, nodes in different levels of difference trees can
privacy budget. Therefore, the privacy budget can be allo- provide different views on data changes over time.
cated to the values of 1 and 1 with less frequency, enhanc- To reduce the local storage space in each user, only key
ing the overall estimation accuracy. nodes instead of entire difference trees need to be stored. As
will be shown in the sequel, they are the rightmost nodes in
3.2 Workflow of DDRM each level (i.e., the values in red in Fig. 3). Usually, the key
Fig. 2 shows the workflow of DDRM. At timestamp t, each values at timestamp t can be calculated with the difference
user ui updates the difference tree to record the value cti and the key nodes at the previous timestamp t  1. For
changes of her current time series data di (step 1 ).2 Then ui instance, in Fig. 3, at t ¼ 8, the value of the root (i.e., 1) is the
randomly selects a node hti with value vti (step 2 ), sanitizes sum of the difference c8i ¼ 0 and all key values at t ¼ 7, i.e.,
it by the perturbation protocol with privacy budget  (step 1, 0, 0. As such, we use a list Rti to store the key nodes for
~ti to the data collector as
3 ), and sends the perturbed value v user ui and update it over time. Fig. 4 illustrates the storage
t
hi ’s value (step ). The detailed implementation of DDRM
4 layout of key nodes in Fig. 3. The index of Rti is from 1 and
for the above four steps will be presented in Sections 4.1 to it also indicates the level of the node in difference trees. For
4.4, respectively. example, when t ¼ 8, the first value (i.e., R8i ½1) is the leaf
node with level 1, while the last value (i.e., R8i ½4) is the root
with level 4.
4 DDRM: IMPLEMENTATION
The rationale of key nodes is as follows. t can be uniquely
In this section, we present the implementation details of expressed as the sum of some terms of 2x , i.e.,
DDRM. We first discuss how to build difference trees, fol-
lowed by the node selection strategy and the perturbation t ¼ 2a1 þ 2a2 þ    þ 2amt ; a1 > a2 >    > amt 0;
protocol for node values. Then we elaborate on the aggrega-
tion procedure of frequency estimation at the collector side. (3)
Subsequently, we show how to extend DDRM to multi-val-
where mt denotes the number of difference trees at time-
ued data collection, and summarize the technique merits of
stamp t. For each x 2 fa1 ; ::; amt g, 2x nodes can be grouped
DDRM in the end.
to form a difference tree. For example, t ¼ 6 where 6 ¼
22 þ 21 . So in Fig. 3 there are two difference trees when t ¼
4.1 Difference Trees 6, where the first 4 nodes form the first tree, and the rest 2
Difference trees are complete binary trees that record users’ form the second one. Furthermore, according to Eq. (3) and
value changes over time. At each timestamp t, user ui first Fig. 3, a1 þ 1 is the height of the first difference tree and the
number of key nodes stored in Rti , while amt þ 1 is the
2. Rti is persisted in the client storage as a list. height of the last difference tree and the number of key
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
6788 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 7, JULY 2023

noisy results. Thus, the error still accumulates over time.


Nevertheless, we observe that the accumulated error can be
mitigated to the greatest extent if users choose to report the
highest-level node. This is because the highest-level node
always provides the largest view on data changes than other
non-leaf nodes. For instance, in Eq. (4), if the highest-level
node, i.e., R8i ½4, is selected, the frequency will be estimated
only by the differences (i.e., d8i  d0i ) and there will be no
accumulated error. Thus, the node selection strategy should
Fig. 4. The list Rti stores key information of difference trees over time. select either the last leaf node (i.e., difference cti ) or the non-leaf
node associated with the current value with the highest level
nodes to be updated in the list. Specifically, for each j 2 (denoted by rt ).
f1; 2; . . . ; a1 þ 1g, the jth value in Rti , denoted by Rti ½j, will We observe that, for each timestamp t, the highest-level
be updated as follows: node associated with the current value is the root of the
 Pj1 t1
rightmost tree, e.g., the orange node in level 2 at timestamp
l¼1 Ri ½l þ cti ; if j  amt þ 1 6. Particularly, when t is odd, the last leaf node is also the
Rti ½j ¼ t1 :
Ri ½j; otherwise: highest-level node associated with the current value. Algo-
rithm 2 shows the details of the node selection strategy. The
highest level rt is achieved in Line 1-2. Then hi is randomly
Algorithm 1 shows the pseudo-code of building and
selected from f1; rt g and its corresponding value in Rti is
updating difference trees. t is uniquely expressed in Line 1,
returned (Lines 3-5).
and then according to the value of amt , Rti is updated with
Rt1
i and the difference cti (Lines 3-6).
Algorithm 2. Node Selection Strategy: SelectðÞ
Algorithm 1. Build and Update Difference Trees: TreeðÞ Input: The current timestamp t, the list Rti of user ui .
Output: Selected value vti with node index hti .
Input: The list Rt1 of ui , the difference cti , timestamp t
i 1: Derive amt such that t ¼ 2a1 þ 2a2 þ    þ 2amt , where
Output: Rti .
a1 > a2 >    > amt 0
1: Express t as: t ¼ 2a1 þ 2a2 þ    þ 2amt
2: rt ¼ amt þ 1 // rt ¼ 1 when t is odd
2: for each j 2 f1; . . .; a1 g do
3: Randomly select a value hti from f1; rt g
3: if j  amt P þ 1 then
4: vti ¼ Rti ½hti  // Rti ½1 is equal to cti
4: Rti ½j ¼ j1 t1
l¼1 Ri ½l þ ci
t
5: return (vti , hti )
5: else
6: Rti ½j ¼ Rt1
i ½j
7: return Rti
Algorithm 3. Perturbation Protocol: PerturbðÞ
Input: A private value v 2 f1; 0; 1g, privacy budget 
4.2 Node Selection Strategy Output: The sanitized value v~.
Intuitively, at each timestamp t, each user ui can directly 1: if v ¼ 0 then
report the difference cti in the last leaf node. Based on the 
1; w.p. 0:5
reported values, Pthet collector iteratively estimates frequency v~ ¼
c 1; w.p. 0:5
as f ¼ f þ ni i where f 0 ¼ 0. However, this solution
t t1

can lead to severe accumulated error over time. To alleviate


this issue, we propose a node selection strategy to allow a 2: else
part of users to report non-leaf nodes whose values are asso- ( 
ciated with the current value (i.e., orange non-leaf nodes in 1; w.p. 1
2 þ 2v  ee 1
þ1
v~ ¼ 
Fig. 3). For example, when t ¼ 8, there are three non-leaf 1; w.p. 1
2  2v  ee 1
þ1
nodes that can be choosen by each user, i.e., R8i ½2, R8i ½3 and
R8i ½4. By selecting one of non-leaf nodes, the frequency at
t ¼ 8 can be estimated as 3: return~
v

8 P
d8 d0i
>
> ui 2U4 i
if R8i ½4 is selected
>
> P
; 4.3 Perturbation Protocol
< jU4 j
8 4
d d
f 8 ¼ f 4 þ ui 2U3 i i ; if R8i ½3 is selected (4) Our perturbation protocol is inspired by [12], which is origi-
>
> P jU3 j
>
> 8 6 nally designed for mean estimation of numerical values in
: 6 d d
f þ ui 2U2 i i ;
jU2 j if R8 ½2 is selected
i
½1; 1. In DDRM, the selected value v is the node from differ-
ence trees, which can take three values, i.e., f1; 0; 1g. For a
where Ul denotes the group of users who report the node in value v 2 f1; 1g, according to [12], it is perturbed to 1 (resp.
 1 v e 1
the lth level and l 2 f2; 3; 4g. 1) with probability 12 þ 2v  ee 1
þ1 (resp. 2  2  e þ1). When v ¼ 0,
As shown in Eq. (4), the calculation of the frequency f 8 is our protocol perturbs it to either 1 or 1 with the same proba-
based on the frequency f 4 or f 6 . LDP brings some estima- bility 0.5. In that sense, it does not consume any privacy bud-
tion error to each estimated frequency, i.e., f 4 and f 6 are get . Thus more privacy budget can be allocated to the value
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
XUE ET AL.: DDRM: A CONTINUAL FREQUENCY ESTIMATION MECHANISM WITH LOCAL DIFFERENTIAL PRIVACY 6789

1 or 1. At last, each user reports her perturb value v~ti with its To summarize, according to the perturbation protocol and
node index hti to the data collector. In short, our proposed per- Theorem 2, the collector can estimate the frequency f^t at each
turbation protocol suppresses the consumption of privacy timestamp t by the following equation (where f^0 ¼ 0)
budget on the selected “0” value, and therefore significantly 8 P
improves the accuracy of the estimation results. >
< ^t1 v^t
i2fijhti ¼1g i
f þ ; if t is odd,
Algorithm 3 shows the pseudo-code of our perturbation f^t ¼ jfui jhti ¼1gj : (8)
protocol. Specifically, when v ¼ 0, the protocol generates 1 >
: 1  f^t ð1Þ þ 2  f^t ð2Þ;
w w
w1þw2 w1þw2 otherwise:
or 1 with the same probability 0.5, which does not con-
sume any privacy budget (Line 1). Otherwise, a sanitized where the calculation of w1 and w2 are based on the variance
value v~ is obtained based on the private value v and the of f^t ð1Þ and f^t ð2Þ, which can be further derived by Theorem 3.
given privacy budget  (Line 2).
Theorem 3. For each user ui , the calibrated value v^ti at each time-
 2
stamp t is unbiased and the variance of v^ti is bounded by ðee þ1 1Þ .
4.4 Aggregation ^t ^ t
In particular, for an even t, the variances of f ð1Þ and f ð2Þ are
After receiving the noisy values from users, the collector þ1Þ=ðe1ÞÞ2 0  1ÞÞ2
first calibrates each noisy value by at most Var½f^t1  þ ððejfu t and Var½f^t  þ ððejfuþ1Þ=ðet ,
i jhi ¼1gj i jhi ¼rt gj
0 rt 1
respectively, where t ¼ t  2 .
e þ 1 t
v^ti ¼  v~ : Proof. From Algorithm 3, when vti ¼ 0, the expectation of v^ti is
e  1 i
According to the node selection strategy, at even time- e þ 1 1 1
vti jvti ¼ 0 ¼
E½^  ð  Þ ¼ 0 ¼ vti :
stamps half users report the value in the leaf node while e  1 2 2
others report that in root node. As such, the data collector
When vti 6¼ 0, the expectation of v^ti is:
can get two estimated frequencies, f^t ð1Þ and f^t ð2Þ, of ‘1’
when t is even as follows. vti e  1 e þ 1
P vti jvti 6¼ 0 ¼ 2 
E½^   ¼ vti :
^ti 2 e þ 1 e  1
i2fijhti ¼1g v
^t
f ð1Þ ¼ f þ^t1
jfui jhti ¼ 1gj From the above, we learn that v^ti is unbiased, i.e., E½^
vti  ¼
P t t
vi . The variance of v^i is
0
^ti
i2fijhti ¼rt g v
f^t ð2Þ ¼ f^t þ ðt0 ¼ t  2rt 1 Þ: (5) e þ1 2
jfui jhti ¼ rt gj Var½^ vti Þ2 jvti ¼ 0ðE½^
vti jvti ¼ 0 ¼ E½ð^ vti jvti ¼ 0Þ2 ¼ ð Þ
P e 1
^ ^
v^8
i2fijh8i ¼1g i vti jvti 6¼ 0 ¼ E½ð^
Var½^ vti Þ2 jvti 6¼ 0ðE½^
vti jvti 6¼ 0Þ2
8
ForPexample, at t ¼ 8, f ð1Þ ¼ f þ 7
8 and f^8 ð2Þ
v^8
jfui jhi ¼1gj e þ1
¼
i2fijh8i ¼4g i
. ¼ ð  Þ2 1:
jfui jh8i ¼4gj e 1
The collector then takes a weighted average on both estima- vti   ðeeþ1 2
AsP
such, Var½^ 1Þ . Let h 2 f1; rt g, and the variance
tion and obtains the final frequency estimation f^t . Formally, t v^
i2fijhti ¼hg i
of jfui jhti ¼hgj
in each estimation is
f^t ¼ w  f^t ð1Þ þ ð1  wÞ  f^t ð2Þ; (6) P P
^ti
i2fijhti ¼hg v vti 
i2fijhti ¼hg Var½^
where w indicates the weight of the first estimation result in Var½ ¼
jfui jhti ¼ hgj jfui jhti ¼ hgj2
Eq. (5). To properly set w that can optimize the estimation
accuracy, the following Theorem 2 shows that it is equiva- ððe þ1Þ=ðe 1ÞÞ2
lent to minimizing the variance of f^t .  (9)
jfui jhti ¼ hgj
Theorem 2. The variance of the estimated frequency f^t by Eq. (6) According to Eq. (8), the variance of each estimation
is minimized by setting w ¼ w1wþw1
, where w1 ¼ Var½f1^t ð1Þ and Var½f^t  can be derived as follows:
w1 ¼ Var½f1^t ð2Þ .
2
8 P
> v^t
< Var½f^t1 þ i2fijhti¼1g i ; if t is odd,
Proof. According to the proposed scheme, the collector can Var½f^t  ¼ jfui jhti¼1gj (10)
gain two different frequency results f^t ð1Þ, f^t ð2Þ at even >
: Var½f ð1ÞVar½f ð2Þ ;
^t ^t

Var½f^t ð1ÞþVar½f^t ð2Þ


otherwise:
timestamps, and merge them by f^t ¼ w  f^t ð1Þ þ ð1  wÞ 
f^t ð2Þ. The variance of f^t is Since Var½f^0  ¼ 0, the value of Var½f^t  can be iteratively
calculated with Eq. (9) and Eq. (10). Particularly, the var-
Var½f^t  ¼ w2  Var½f^t ð1Þ þ ð1  wÞ2  Var½f^t ð2Þ; (7) iances Var½f^t ð1Þ and Var½f^t ð2Þ at each even timestamp t
are calculated as
It is obvious that Eq. (7) is a convex function on w. Thus
ððe þ1Þ=ðe 1ÞÞ2
the variance Var½f^t  can be minimized if the derivative of Var½f^t ð1Þ  Var½f^t1  þ
jfui jhti ¼ 1gj
Eq. (7) is equal to 0, that is
0 ððe þ1Þ=ðe 1ÞÞ2
Var½f^t ð2Þ  Var½f^t  þ (11)
2w  Var½f^t ð1Þ  2ð1  wÞ  Var½f^t ð2Þ ¼ 0: jfui jhti ¼ rt gj
By solving the above equation, we derived w ¼ w wþw
1 0
where Var½f^t1  and Var½f^t  are derived in previous
1 2
1 1
with w1 ¼ Var½f^t ð1Þ ; w2 ¼ Var½f^t ð2Þ . u
t rounds. u
t

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
6790 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 7, JULY 2023

4.5 Overall Algorithm of DDRM each value v to an M-bit binary vector V , where only the vth
Recall that the perturbation on 0 does not consume any pri- bit in V is 1 and all other bits are 0. Intuitively, for each bit of
vacy budget and according to the property of sequential the encoded vector, each user can locally maintain the differ-
composition, to satisfy -LDP, the privacy budget should be ence trees, and then apply DDRM to estimate its frequency.
allocated to all non-zero values (i.e., 1 or 1). Or equiva- However, this naive extension causes two problems. First,
lently, we need to set an appropriate threshold k, so that the local storage cost of each user and the communication
each user can upload noisy values of 1 or 1 with =k for k cost between users and the data collector would be propor-
times at most during data collection. In case some user ui tional to M. Second, the utility gain by our perturbation pro-
exhausts the privacy budget at a timestamp t, she will tocol, which is mainly contributed by 0 value of difference,
report a totally random 1 or 1 at the following timestamps, can no longer be achieved with ease. To address both issues,
to avoid consuming privacy budget. The selection of an we propose to divide users into groups where users in one
appropriate threshold k will be detailed in Section 5.1. group only report on a subset of bits. The effectiveness of the
Algorithm 4 summarizes the overall algorithm of DDRM. sampling approach on multi-valued cases will be verified
Before data collection, the privacy budget are divided to 0 ¼ through experimental results in Section 6.3.
=k (Line 1). In the beginning, each user ui initializes d0i ¼ 0,
R0i ½1 ¼ 0 and the counter ai ¼ 0 for tracking the number of 4.7 Summary
perturbation on non-zero values (Lines 5-6). Then ui calculates In this subsection, we summarize the technical merits of
the difference cti and updates the difference trees (i.e., list Rti ) DDRM by highlighting the challenges and our contributions.
by Algorithm 1 (Lines 7-8). According to the node selection DDRM is a very practical and effective LDP scheme for time
strategy, ui obtains the value of vti with the node index hti by series data. First, DDRM executes a fresh perturbation at each
Algorithm 2 (Line 9). Based on the values of ai and vti , ui either timestamp, which breaks the deterministic mapping between
updates her counter ai (Line 11) or sets vti ¼ 0 to avoid consum- true values and their noisy versions, thus addressing the two
ing privacy budget (Line 13). Lastly, ui obtains the noisy value privacy issues of memoization as pointed out in [9]. Second,
v~ti by the perturbation protocol in Algorithm 3 with 0 (Line 14), DDRM employs multiple binary trees (i.e., difference trees) to
and reports v~ti with the node index hti to the data collector (Line capture data changes over one or several timestamps. This
15). The collector calibrates the noisy values from users with addresses the issue of error accumulation over time. With dif-
0
the calibration factor ee0 þ1
1
(Line 17), and then derives the esti- ference trees of data changes over several timestamps, our
mated frequency by Eq. (8) (Line 19). For an even t, two weights node selection strategy allows users to report changes at any
w1 and w2 are calculated before the estimation (Line 18). timestamp with the smallest accumulated noise. Third, to
maximize time series data utility with a limited privacy bud-
Algorithm 4. Overall Algorithm of DDRM get, DDRM adopts a perturbation protocol that does not con-
Input: Time series data of all users fdd1 ; . . .; d n g sume any privacy budget when the value is unchanged, and
Privacy budget  and the allocation parameter k thus develops an optimal privacy budget allocation strategy
The length of time series T (see Sec. 5.1) to further encourage users to report more data
Output: Estimated frequencies f^1 ; . . .; f^T . for estimation accuracy enhancement.
1: 0 ¼ =k
2: for t ¼ 1 to T do 5 UTILITY AND PRIVACY ANALYSIS
3: // Users side
4: for each user ui , i 2 ½n do In this section, we provide theoretical analysis of DDRM, in
5: if t ¼ 1 then terms of utility and privacy guarantees.
6: Initialize d0i ¼ 0, R0i ½1 ¼ 0 and ai ¼ 0
7: Calculate the difference by cti ¼ dti  dt1 i
5.1 Privacy Budget Allocation: How to Set
8: Update difference trees: Rti ¼ TreeðRt1 t
i ; ci ; tÞ
Threshold k
t t t
9: Select a node: ðvi ; hi Þ ¼ SelectðRi ; tÞ Recall in Algorithm 4, k is a parameter for dividing privacy
10: if ai < k && vti 6¼ 0 then budget. In this subsection, we will discuss how to derive an
11: ai ¼ ai þ 1 optimal k to enhance the estimation accuracy.
12: else At each timestamp, there are two kinds of error involved
13: vti ¼ 0 in an estimated frequency. One is due to the data perturba-
14: Perturbation: v~ti ¼ Perturbðvti ; 0 ) tion, which leads to noise error denoted by errtn , and the
15: Report v~ti and hti to the collector other is caused by the submitted data from the users who
16: // Collector side exhaust the given privacy budget, which leads to manipula-
0
17: Calibrate each noisy value by v^ti ¼ v~ti  ee0 þ1 1 tion error denoted by errtm . Fig. 5 shows a relation between
18: Calculate weights w1 , w2 by Theorem 2
the noise error and the manipulation error by varying kðk 
19: Estimate the frequency f^t by Eq. (8)
T Þ. Intuitively, for a small k (e.g., k ¼ 1), the large manipula-
20: return f^1 ; . . .; f^T
tion error errtm dominates the overall utility, as most users
exhaust their privacy budgets for some of the earlier values
and then can only submit totally random reports (i.e., 1 or
4.6 Extension to Multi-Valued Cases 1) for the estimation. However, for a large k (e.g., k ¼ T ),
The proposed DDRM on binary data can be extended to although the manipulation error is alleviated, the large
multi-valued cases. Suppose there are M (M > 2) different noise error dominates the overall utility again, as each
values in the universe, i.e., f1; 2; . . . ; Mg. We can encode reported value comes with overwhelming noise because of
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
XUE ET AL.: DDRM: A CONTINUAL FREQUENCY ESTIMATION MECHANISM WITH LOCAL DIFFERENTIAL PRIVACY 6791

at each timestamp. (Note that the vti is the true value from
Algorithm 2.) Then, the number Nt0 of users who have
exhausted privacy budget at timestamp t is

Nt0 ¼ 0; tk
t1  0
X 
t 1 m k m 0 (13)
Nt0 n ð Þ ð1  Þt k ; t > k:
t0 ¼k
k1 T T

From Eq. (13), we know that Nt0 increases with t 2 ½T , i.e.,


N10  N20      NT0 , thus given T timestamps, the deviation
Fig. 5. Noise error errtn and manipulation error errtm , varying k. errtm achieves the maximal value at the T th timestamp, that is,
err1m  err2m      errTm . The following Theorem 6 further
the small privacy budget =k. This motivates us to find an provides a solution to derive the value of m in Eq. (13).
appropriate k by minimizing the maximum of these two blog T c
Theorem 6. For each t0 2 f2; 4; . . . ; 2 22 g, given the frequency
kinds of error, such that the overall estimation accuracy is
(f 1 ) at the first timestamp and the data change rate pc , i.e., pc ¼
guaranteed. Then the objective function can be definded as
Prfcti 6¼ 0g, the expectation (m) of the number of vti 6¼ 0 across
k ¼ arg min maxferrtn ; errtm g: (12) T timestamps is
k2½T 
T T pc
m ¼ f 1 þ d  1e  pc þ b c 
2 2 2
In what follows, we show how to evaluate these two X 1
kinds of error. Specifically, the noise error errtn is measured þ t2ftjt2blog2 tc ¼0g 2
ð1  Psa1  Psa2 Þ
by the absolute error between the estimated frequency f^t X X 1
and the true one f t , which is bounded by the standard devi- þ t0 t2ftjt2blog2 tc ¼t0 g 2
ð1  Psb Þ
ation of f^t . The following Theorem 4 shows an analytical
result. While the manipulation error errtm , which is mea- where
sured by the deviation resulting from randomly selecting 1 t  
or 1 for report, will be shown in Theorem 5. Finally, Theo- X
21
t1 t12t
Psa1 ¼ð1  f 1 Þ p2t
c ð1  pc Þ
rem 7 provides a solution to the objective function Eq. (12). 2t
t¼0
Theorem 4. At each timestamp t, the noisy error errtn measured t=2 
X 
t1
by the absolute error of the estimated frequency f^t is
a2
Ps ¼f  1
p2t1
c ð1  pc Þt2t
t¼1
2t1
X t0 
2G t 0 =2 
errtn < pffiffiffi ; Psb ¼ p2t t0 2t
n c ð1  pc Þ
t¼0
2t
=k
where G ¼ ee=k þ1
1
.
Proof. Please refer to Appendix A, which can be found on
the Computer Society Digital Library at http://doi. Proof. See Appendix C, available in the online supplemen-
ieeecomputersociety.org/10.1109/TKDE.2022.3177721 t u tal material. u
t

Theorem 5. Let Ut0 denotes the group of users who have In Theorem 6, the parameters pc is regarded as prior knowl-
exhausted privacy budget at timestamp t. The manipulation edge learned from historical data, while f 1 can be set to 0.5 by
error errtm caused by the reported data from U0t is default or an empirical value if the collector can have some
N 0 pc
background knowledge. With errtn ¼ 2G pffiffi and errt ¼ T
n m n from
Nt0 Theorems 4 and 5, the following Theorem 7 solves Eq. (12) to
errtm   pc ;
n derive an optimal threshold k, i.e., an integer near the crossing
where Nt0 ¼ jUt0 j is the number of users in Ut0 , and pc is the point (i.e., optimum in Fig 5) of the two error curves.
N pc 0
data change rate, i.e., pc ¼ Prfcti 6¼ 0g, Theorem 7. Let GðkÞ ¼ 2G pffiffi and F ðkÞ ¼ T
n n . An optimal
Proof. Please refer to Appendix B, available in the online threshold is an integer k 2 ½T  satisfying one of the following
supplemental material. u
t constraints

To gain the relationship between errtm and k, let’s focus GðkÞ  F ðkÞ; Gðk þ 1Þ > F ðk þ 1Þ
on Nt0 , the number of users who have exhausted privacy or GðkÞ F ðkÞ; Gðk  1Þ < F ðk  1Þ
budget at timestamp t. One method of calculating Nt0 is to
enumerate all possible combinations of the timestamps
when users consume the privacy budget. However, the
computation complexity is Oðtt=2 Þ, which is too heavy when 5.2 Privacy Analysis of DDRM
t is large.
P To address the problem, given T timestamps and Given privacy budget , DDRM allows each user ui to report
m ¼ E½ Tt¼1 Iðvti 6¼ 0Þ, we use an average Tm to approximate the noisy values of 1 or 1 at most k times. After exhausting
the probability that users may consume the privacy budget all privacy budget, ui will not contribute her true
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
6792 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 7, JULY 2023

information any more and always uploads the perturbed DDRM breaks the deterministic mapping between true values
value of 0 for privacy guarantee. The following Theorem 8 and their noisy versions, by applying a fresh perturbation on
establishes the differential privacy guarantee of DDRM. changing values at each timestamp. As such, the noisy value of
a true value is no longer fixed and vice versa different true val-
Theorem 8. Given privacy budget , DDRM satisfies -LDP for
ues can be mapped to the same noisy value. As in the example
continual frequency estimation.
of a; b; b; b; a; a; b    in Section 1, the perturbed time series by
Proof. Let d~ ¼ fð~ v1 ; h1 Þ; . . . ; ð~
vT ; hT Þg be a set of perturbed DDRM could be 1; 1; 1; 1; 1; 1; 1    , from which an adver-
reports by DDRM across T timestamps from one user. In sary cannot infer the true values or true data change time-
the scheme, we use vti to denote the value that ui selects in stamps, even if she has the background knowledge of a true
step 2 and it is also the input value of the perturbation value with its perturbed value at one timestamp.
algorithm in step 3 . Recall that our protocol only allows
users to do perturbation on non-zero values for k times. 6 EXPERIMENTS
Thus vti ¼ 0 will be set regardless of the true value of vti
In this section, we show the experimental results of DDRM
if there has been existed k non-zero values among
against state-of-the-art methods to verify its effectiveness.
fv1i ; . . . ; vt1
i g. In other words, there are at most k non-
zero values in the set of fv1i ; . . . ; vTi g. For any two users ui ,
6.1 Experimental Setup
uj , without loss of generality, suppose that the k values
Datasets. We use both real and synthetic datasets for binary
v1i ; . . . ; vki at the first k timestamps are not 0 for ui ; while
data and multi-valued cases.
the k values vkþ1 j ; . . . ; vj
2k
at the following k timestamps
are not 0 for uj , that is  Stocks.3 This is a real dataset about historical daily
price (Open, High, Low, Close) of 7136 US stocks. In
ui :fv1i ; . . . ; vki ; . . . ; vTi g ¼ f 1; . . . ; 1; 0; 0; . . . ; 0g the experiments, we focus on ‘Close’ price of all stocks
|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}
k from 2014 to 2017, and use one bit to indicate the stock
uj :fv1j ; . . . ; vkþ1 2k T
j ; . . . ; vj ; . . . ; vi g ¼
price fluctuation. If the price rises (resp. drops) more
than 2% compared to the last trading day, the bit is set
f0; . . . ; 0 ; 1; . . . ; 1; 0; 0; . . . ; 0g
|fflfflfflffl{zfflfflfflffl} |fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl} to 1 (resp. 0); otherwise, it stays unchanged. The data-
k k set is split into 169,879 individual records, each con-
where 1 means 1 or 1. taining 32 values (i.e., T ¼ 32).
Let pht denote the probability that each user selects ht  SynB . This is a synthetic dataset containing 100,000
node at timestamp t, i.e., Pht ¼ Prfhti ¼ ht g; ht 2 f1; rt g. users, each with a binary time series with T ¼ 32. For
According to the perturbation protocol (DDRM) which is each user, the value at the first timestamp follows a
denoted by A here, we have Bernoulli distribution Berð0:2Þ, and for each subse-
quent value, a change may occur with the probability
~ ¼Prf~
PrfAðddi Þ ¼ ddg v1 jv1i gPh1    Prf~
vk jvki gPhk of 0.3, i.e., the change rate pc ¼ Prfcti 6¼ 0g ¼ 0:3.
 Trajectory.4 This is a real dataset describing the trajec-
vkþ1 j0gPhk þ1    Prf~
Prf~ vT j0gPhT tories of 442 taxis in Porto from 2013 to 2014. In
~ ¼Prf~
PrfAðddj Þ ¼ ddg v1 j0g  Ph1    Prf~
vk j0gPhk experiments, we focus on the trajectories within a
vkþ1 jvkþ1 v2 k jv2j k gPh2 k specified area where the longitude ranges from
Prf~ j gPhkþ1    Prf~
8:65 to 8:55 and latitude ranges from 41.1 to 41.2.
v2kþ1 j0gPhk þ1    Prf~
Prf~ vT j0gPhT We then divide the area into 12 (3  4) cells and each
location is mapped to a corresponding cell. The data-
Since Pht is the same for each user at any timestamp and
set is split into 1,044,693 individual trajectories, each
0 is perturbed to 1 or 1 with the same probability (i.e.,
containing 32 values.
Prf1j0g ¼ Prf1j0g), the ratio is
 SynM . This is a synthetic dataset containing
~ 1,000,000 users, each with a categorical-valued time
PrfAðddi Þ ¼ ddg v1 jv1i g    Prf~
Prf~ vk jvki g
¼ series (8 different categories) with T ¼ 32. For each
~
PrfAðddj Þ ¼ ddg Prf~kþ1 kþ1
v jvj g    Prf~ v2 k jv2j k g user, the value at the first timestamp follows an
exponential distribution Expð1=3Þ. For each subse-
When v~a ¼ vai , a 2 f1; . . . ; kg and v~b 6¼ vbj , b 2 fk þ quent value, a change occurs with the probability
1; . . . ; 2kg, the above ratio can reach the maximum, that is pc ¼ 0:4.
=k =k In the experiments, we set time horizon T to 16 or 32 [11].
~
PrfAðddi Þ ¼ ddg ð12 þ 12  ee=k 1
þ1
Þ    ð12 þ 12 ee=k 1
þ1
Þ Note that for each of the above datasets with T ¼ 32, we

~
PrfAðddj Þ ¼ ddg =k =k
ð12  12  e=k 1Þ    ð12  12  e=k 1Þ extract the first half of each record to generate other four
e þ1 e þ1
datasets with T ¼ 16. The statistics of datasets are summa-
e=k e=k
e=k þ1
   e=k þ1 rized in Table 2.
¼ 1 1
¼ e Experiment Design. We compare the performance of
e=k þ1
   e=k þ1
DDRM with several existing methods for continual frequency
Thus, DDRM satisfies -LDP. u
t
3. https://www.kaggle.com/borismarjanovic/price-volume-data-
Besides the above LDP guarantee, DDRM also well for-all-us-stocks-etfs
addresses the privacy risks in memoization. Specifically, 4. https://www.kaggle.com/crailtap/taxi-trajectory
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
XUE ET AL.: DDRM: A CONTINUAL FREQUENCY ESTIMATION MECHANISM WITH LOCAL DIFFERENTIAL PRIVACY 6793

TABLE 2 6.2 Experiments on Binary Data


Statistics of Datasets We compare DDRM with the following five competitive
methods RAPPOR [8], dBitFlipPM [9], M.J.-18 [10], U.E.- 
Dataset Data type Time horizon (T ) # Records (n)
19 [11] and ToPL [13] on the real dataset Stocks and the syn-
Stocks Binary 16 169,879 thetic one Syn B by varying privacy budget from 0.2 to 6.
32 169,879 The results in terms of l2 and l1 loss are plotted in Figs. 6
SynB Binary 16 100,000 and 7, respectively. We observe that d BitFilpPM (d ¼ 1; 2)
32 100,000 has the lowest loss whereas DDRM comes the second. The
Trajectory Multi-value 16 1,044,693 implementation of RAPPOR and ToPL is in accord with the
32 1,044,693 naive idea in Section 3.1, so they have the similar perfor-
SynM Multi-value 16 1,000,000 mance, which is mainly affected by the division of the pri-
32 1,000,000 vacy budget, especially when  is small. For M.J.-18, only
half of the given privacy budget can be used for frequency
estimation procedure, which leads to more perturbation
estimation with LDP, which include RAPPOR [8], noise and thus a deterioration of the estimation accuracy. In
 
the scheme of U.E.-19, they assume that time series data
dBitFlipPM [9], M.J.-18 [10], U.E.-19 [11] and ToPL [13]. When
implementing RAPPOR, we divide the budget into T parts and change at most C times across T timestamps, and each user
spend one portion on each report to provide the longitudinal randomly selects one value c from f1; . . . ; Cg and perturbs
privacy guarantee [8]. In dBitFlipPM, we set d ¼ 1 or 2 to esti- the cth change with all privacy budget to report. We first set
mate frequency on binary datasets, and set d ¼ 2, 3 or 4 on C to the maximum number of data changes among all users
multi-valued datasets. For the non-real-time scheme in [10], to fully satisfy the assumption, which, however, introduces
denoted by M.J.-18, we set each epoch length5 to 104 in experi- large sampling error. For comparison purpose, we then set
ments. As for the scheme in [11], we implement their basic C to a small reasonable value, i.e., the median of the number
 
of changes (indicated by U.E.-Med), but such setting viola-
method, denoted by U.E.-19. The shuffle model, on the other
hand, is a different privacy model and beyond the scope of this tes the assumption and causes a biased estimation, which
work. ToPL [13] is a scheme with event-level privacy. For a fair also negatively affects the estimation accuracy.
comparison, when implementing ToPL, we divide the budget Even though dBitFlipPM performs the best, it has limita-
by the time horizon to achieve user-level privacy, i.e., the pri- tions because it is based on memoization technique, which
vacy guarantee DDRM promises. may expose the data change points. For binary values, there
We design four sets of experiments. The first set focuses always exist users whose real values (i.e., 0 or 1) are mapped
on binary data in the datasets Stocks and SynB . The second to different noisy responses, so that an adversary can
set focuses on the multi-valued case in the datasets Trajec- observe each change in their time series data from the noisy
tory and SynM . The third set evaluates the impact of param- outputs and recover every data change point. To better
eter k on DDRM, which verifies the effectiveness of k illustrate this issue, in the experiments, we calculate the per-
selection in Theorem 7. Finally, the fourth set studies the centage of users whose data change points are all revealed
impact of data change rate on DDRM. by using dBitFilpPM (d ¼ 1; 2), i.e., they generate different
All algorithms are implemented in MATLAB, and the noisy values with different inputs, respectively. Table 3
experiments are conducted on a desktop computer with shows the results. We observe that at least 50% users expose
Intel Core i7-10700 2.9Ghz CPU and 72 GB RAM. their data change points in dBitFilpPM, and the percentage
Performance Metrics. At each timestamp t 2 ½T , we first increases significantly as privacy budget rises. Our DDRM
calculate the distance DisðtÞ between the real and estimated mechanism, on the other hand, is free from such attacks.
frequencies by Eq. (2). Then we use the following three met-
rics to evaluate the performance of DDRM and its competi- 6.3 Experiments on Multi-Valued Data
tive methods, namely, l1 loss, l2 loss and infinity norm l1 In this subsection, we show the performance of DDRM on
multi-valued data in Section 4.6 by the experiments on Tra-
jectory and Syn M datatsets. We compare DDRM with three
l1 ¼ Disð1Þ þ    þ DisðT Þ competitive methods, i.e., RAPPOR [8], d BitFlipPM [9] and
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi M.J.-18 [10].6 For dBitFlipPM, we consider d ¼ 2; 3; 4 in the
l2 ¼ Disð1Þ2 þ    þ DisðT Þ2 datasets. For DDRM, each value is first encoded into an
l1 ¼ maxðDisð1Þ; . . .; DisðT ÞÞ M-bit binary vector. M is 12 or 8 for Trajectory and Syn M
respectively. Then all users are randomly divided into M
groups. Users in the m th (m 2 f1; . . .; Mg) group only
where l1 focuses on the worst case, and l1 , l2 show the over- report the mth bit of the encoded vector with DDRM during
all performance of estimation accuracy across T time- the continual data collection. Since each user only focuses
stamps. Due to the space limitation, we mainly present the on one bit, the data change rate is expected to be slow, so
results of l1 and l2 loss, and put the results of l1 loss to we empirically set it to 0.02 and the value of k is achieved
Appendix D, available in the online supplemental material. by Theorem 7.

5. [10] suggests to set each epoch length l to 1=a2 , where a is the 


6. U.E.-19 [11] is designed for binary values and ToPL [13] is
expected absolute error between the true and estimated frequencies. designed for the sum of numerical values, so they are not compared
We set l by a ¼ 0:01, i.e., l ¼ 104 . here.
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
6794 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 7, JULY 2023

Fig. 6. l2 loss in different schemes under the varying  on Stocks and SynB .

Fig. 7. l1 loss in different schemes under varying  on Stocks and SynB .

Figs. 8 and 9 show the experimental results, where


TABLE 3
DDRM always performs the best for small privacy budget
Percentage of Users Who Expose Their All Data Change Points
(i.e.,  < 3) and achieves almost equivalent accuracy as by dBitFilpPM on Stocks and SynB
dBitFlipPM when the privacy budget becomes large. The
low effectiveness of RAPPOR and dBitFlipPM for small  is Privacy budget  d¼1 d¼2
mainly due to the division of privacy budget on multi-val- Stocks SynB Stocks SynB
ued data. As for M.J.-18, as we explained in the binary case,
0.2 50% 50% 74% 74%
only half of the given privacy budget can be used for fre- 0.5 50% 50% 74% 75%
quency estimation. Moreover, for the multi-valued case, M. 1 50% 50% 75% 75%
J.-18 has to be implemented together with Succinct Histo- 2 52% 53% 77% 77%
gram [5], a mechanism inferior to RAPPOR in terms of data 3 55% 56% 80% 81%
utility [14]. Consequently, these two points make M.J.-18 4 60% 62% 84% 84%
perform the worst. 5 65% 65% 87% 87%
6 69% 70% 90% 91%
About dBitFlipPM, we also observe that its accuracy
improves with increasing d, which is consistent with [9].
However, as d increases, the disclose of data change points
gets more severe. This is because, for a small d, e.g., d ¼ 2, an optimal k using , n and T . For the real dataset Stocks,
there always exist some different inputs which map to the since we have no background knowledge on the frequency
same noisy response, which makes it hard to track some f 1 , we just set it to 0.5. As for the data change rate pc , we
data changes. When d gets larger, e.g., d ¼ 4, users more take its average on the dataset Stocks, i.e., pc ¼ 0:15. Experi-
likely generate different noisy values with different inputs, ments are conducted by varying parameter k and privacy
causing the severe disclosure of data change points. Simi- budget . For a given , we obtain the empirical k by enumer-
larly, we also show the percentage of users who expose their ating different k and finding the one with the minimum l1
data change points by dBitFlipPM with d ¼ 2; 3; 4 in Table 4, loss. Due to the randomness of the algorithm, the optimal k
where this percentage increases significantly with the from experiments (i.e., empirical k) may have some fluctua-
increasing d. tions. So we extract 10 values of the optimal k from 10-times
experiments and plot them in Fig. 10. The optimal values
from experiments mostly fall close to the theoretically opti-
6.4 The Optimal Value of k mal k,7 which verifies the correctness of our optimal k set-
In the following, we conduct experiments to verify the opti- ting. On the other hand, we also observe that, given a
mal k derived in Section 5.1. Due to space limitation, we specific privacy budget in the same dataset, a larger T tends
mainly present the results on two datasets Stocks and SynB
by varying privacy budget from 0.2 to 8. For the dataset 7. Detailed numerical results can be found in Appendix D, available
SynB , we set f 1 ¼ 0:2, pc ¼ 0:3 and theoretically calculate in the online supplemental material.
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
XUE ET AL.: DDRM: A CONTINUAL FREQUENCY ESTIMATION MECHANISM WITH LOCAL DIFFERENTIAL PRIVACY 6795

Fig. 8. l2 loss in different schemes under the varying  on Trajectory and SynM .

Fig. 9. l1 loss in different schemes under varying  on Trajectory and SynM .

TABLE 4
Percentage of Users Who Expose Their All Data Change Points
by dBitFilpPM on Trajectory and SynM

 d¼2 d¼3 d¼4


Trajectory SynM Trajectory SynM Trajectory SynM
0.2 0% 0% 0% 0.24% 0.31% 1.20%
0.5 0% 0% 0% 0.24% 0.31% 1.20%
1 0% 0% 0% 0.23% 0.30% 1.20%
2 0% 0% 0% 0.23% 0.30% 1.19%
3 0% 0% 0% 0.23% 0:30% 1.16%
4 0% 0% 0% 0.21% 0.28% 1.14%
5 0% 0% 0% 0.20% 0.27% 1.10%
6 0% 0% 0% 0.19% 0.26% 1.06%

to select a larger k, which indicates that the manipulation


error errtm analyzed in Theorem 5 has a more significant
impact on the estimation accuracy when the time horizon is
larger. In such cases, a larger k is needed to mitigate it. To
sum up, the theoretically optimal k can be a good reference
to set k in practice.
Fig. 10. Empirical k versus theoretical k.

6.5 Impact of Data Change Rate parts, or more data changes have to be discarded. By com-
Finally, we explore the performance of DDRM on different paring pc ¼ 0:2 and pc ¼ 0:2, we observe that a short-time
datasets by varying data change rates pc . To do so, we gen- significant value fluctuation has very little impact on the
erate three datasets with different change rates, i.e., pc ¼ effectiveness of DDRM.
0:8, pc ¼ 0:5 and pc ¼ 0:2, respectively, each containing
100,000 time series with T ¼ 32. To study the impact of the
short-time significant fluctuation, we also generate a 4th 7 RELATED WORK
dataset pc ¼ 0:2, with pc ¼ 0:8 in the first T =4 timestamps Differential privacy [3] is a rigorous privacy model which
and pc ¼ 0 in the rest, retaining the effective pc as 0.2. Fig. 11 can provide semantic and information-theoretic security on
plots the estimation loss under various privacy budgets, private data. Because of its strong privacy guarantee and
where we observe that frequent changes (i.e., a large pc ) can high efficiency, it has attracted much attention from various
have a negative impact on the estimation accuracy of research areas including data management [15], data min-
DDRM. This is because frequently changing time series ing [16], [17] and machine learning [18], [19].
increase the non-zero values (i.e., changes) for users to Due to its decentralized nature, local differential privacy
report, so either the privacy budget has be split into more (LDP) [2], [20] is proposed to provide the privacy guarantee
Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
6796 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 7, JULY 2023

REFERENCES
[1] R. Barnes, S. Buthpitiya, J. Cook, A. Fabrikant, A. Tomkins, and
F. Xu, “BusTr: Predicting bus travel times from real-time traffic,”
in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining,
2020, pp. 3243–3251.
[2] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova,
and A. Smith, “What can we learn privately?,” SIAM J. Comput.,
vol. 40, no. 3, pp. 793–826, 2011.
[3] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating
noise to sensitivity in private data analysis,” in Proc. Theory Cryp-
Fig. 11. Impact of the change rate pc . togr. Conf., 2006, pp. 265–284.
[4] P. Kairouz, S. Oh, and P. Viswanath, “Extremal mechanisms for
for individuals in the local setting. Currently, LDP becomes local differential privacy,” in Proc. Adv. Neural Inf. Process. Syst.,
2014, pp. 2879–2887.
increasingly popular in not only fundamental operations, [5] R. Bassily and A. Smith, “Local, private, efficient protocols for suc-
such as frequency estimation [6], [14], [21], [22], mean value cinct histograms,” in Proc. 47th Annu. ACM Symp. Theory Comput.,
calculation [12], [23], [24] and high-dimensional distribution 2015, pp. 127–135.
[6] T. Wang, J. Blocki, N. Li, and S. Jha, “Locally differentially private
estimation [25], [26], [27], [28], [29], but also applications in protocols for frequency estimation,” in Proc. 26th USENIX Secur.
different domains, such as itemset mining [30], graph data Symp., 2017, pp. 729–745.
analysis [31], [32], [33], key-value data collection [34], [35], [7] F. D. McSherry, “Privacy integrated queries: An extensible plat-
[36] and private learning [37], [38]. form for privacy-preserving data analysis,” in Proc. ACM SIG-
MOD Int. Conf. Manage. Data, 2009, pp. 19–30.
As for continual data collection, Dwork et al. [39] first study [8]  Erlingsson, V. Pihur, and A. Korolova, “RAPPOR: Randomized
U.
the problem under differential privacy, and propose event- aggregatable privacy-preserving ordinal response,” in Proc. 2014 ACM
level and user-level private algorithms in the case of continual SIGSAC Conf. Comput. Commun. Secur. ACM, 2014, pp. 1054–1067.
observation. Fan et al. [40] propose a differentially private [9] B. Ding, J. Kulkarni, and S. Yekhanin, “Collecting telemetry data
privately,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017,
method to release real-time aggregated data. Kellaris et al. [41] pp. 3574–3583.
focus on the privacy-preserving statistics publishing over infi- [10] M. Joseph, A. Roth, J. Ullman, and B. Waggoner, “Local differen-
nite streams with differential privacy. Cao et al. [42] consider tial privacy for evolving data,” in Proc. 32nd Int. Conf. Neural Inf.
Process. Syst., 2018, pp. 2381–2390.
the privacy loss under a differentially private mechanism in [11]  Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, K. Talwar,
U.
the context of temporally correlated data release. In the local and A. Thakurta, “Amplification by shuffling: From local to central
setting, Erlingsson et al. [8] propose a method (RAPPOR) of differential privacy via anonymity,” in Proc. 30th Annu. ACM-
memoization for continual data collection with local differen- SIAM Symp. Discrete Algorithms, 2019, pp. 2468–2479.
[12] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Minimax optimal
tial privacy, and randomize the memorized responses to procedures for locally private estimation,” J. Amer. Statist. Assoc.,
avoid tracking clients. Then Erlingsson et al. present a new vol. 113, no. 521, pp. 182–201, 2018.
scheme in [11] to repeatedly collect time series data that are [13] T. Wang et al., “Continuous release of data streams under both
centralized and local differential privacy,” in Proc. ACM SIGSAC
correlated or change in non-independent patterns, and fur- Conf. Comput. Commun. Secur., 2021, pp. 1237–1253.
ther study it in a shuffle model. Ding et al. [9] design an alter- [14] Z. Qin, Y. Yang, T. Yu, I. Khalil, X. Xiao, and K. Ren, “Heavy hitter
native approach to RAPPOR to provide privacy guarantees estimation over set-valued data with local differential privacy,”
for the changing data. Joseph et al. [10] design an approach to in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2016,
pp. 192–203.
track a changing statistic by assuming that user data are sam- [15] Y. Jafer, S. Matwin, and M. Sokolova, “Using feature selection to
pled from several evolving distributions. Wang et al. [13] improve the utility of differentially private data publishing,” Pro-
release a stream of real values with unbounded length under cedia Comput. Sci., vol. 37, pp. 511–516, 2014.
the centralized and local setting. Besides, for time series data, [16] S. Su, S. Xu, X. Cheng, Z. Li, and F. Yang, “Differentially private
frequent itemset mining via transaction splitting,” IEEE Trans.
temporal perturbation to realize differential privacy is also Knowl. Data Eng., vol. 27, no. 7, pp. 1875–1891, Jul. 2015.
considered in the most recent work [43]. [17] P. Liu, M. Wang, J. Cui, and H. Li, “Top-k competitive loca-
tion selection over moving objects,” Data Sci. Eng., vol. 6, no. 4,
pp. 392–401, 2021.
[18] M. Abadi et al., “Deep learning with differential privacy,” in Proc.
8 CONCLUSION ACM SIGSAC Conf. Comput. Commun. Secur., 2016, pp. 308–318.
This work proposes a locally differential private scheme [19] S. Tian, S. Mo, L. Wang, and Z. Peng, “Deep reinforcement learning-
based approach to tackle topic-aware influence maximization,” Data
DDRM for continual frequency estimation on time series Sci. Eng., vol. 5, no. 1, pp. 1–11, 2020.
data. DDRM consists of complete algorithms for client-side [20] Q. Ye and H. Hu, “Local differential privacy: Tools, challenges,
data modeling and perturbation protocol, and collector-side and opportunities,” in Proc. Int. Conf. Web Inf. Syst. Eng., 2020,
pp. 13–23.
aggregation and calibration procedures. Furthermore, we [21] P. Kairouz, K. Bonawitz, and D. Ramage, “Discrete distribution
present an optimal solution for privacy budget allocation by estimation under local privacy,” in Proc. Int. Conf. Mach. Learn.,
setting a threshold k. Through theoretical analysis, we ver- 2016, pp. 2436–2444.
ify the privacy and accuracy guarantees of DDRM. Finally, [22] R. Du, Q. Ye, Y. Fu, and H. Hu, “Collecting high-dimensional and
correlation-constrained data with local differential privacy,” in
extensive experiments on both synthetic and real datasets Proc. Int. Conf. Sens., Commun., Netw., 2021, pp. 1–9.
also show its effectiveness. [23] N. Wang et al., “Collecting and analyzing multidimensional data
As for the future work, we plan to extend this work to with local differential privacy,” in Proc. IEEE 35th Int. Conf. Data
multivariate time series data, where each timestamp comes Eng., 2019, pp. 638–649.
[24] J. Duan, Q. Ye, and H. Hu, “Utility analysis and enhancement of
with more than one time-dependent values, such as daily LDP mechanisms in high-dimensional space,” in Proc. Int. Conf.
behavioral data. Data Eng., 2022, arXiv:2201.07469.

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.
XUE ET AL.: DDRM: A CONTINUAL FREQUENCY ESTIMATION MECHANISM WITH LOCAL DIFFERENTIAL PRIVACY 6797

[25] G. Fanti, V. Pihur, and U.  Erlingsson, “Building a RAPPOR with Qingqing Ye (Member, IEEE) received the PhD
the unknown: Privacy-preserving learning of associations and degree in computer science from Renmin Univer-
data dictionaries,” Proc. Privacy Enhancing Technol., vol. 2016, sity of China, in 2020. She is a research Assistant
no. 3, pp. 41–61, 2016. Professor with the Department of Electronic and
[26] Z. Zhang, T. Wang, N. Li, S. He, and J. Chen, “CALM: Consistent Information Engineering, The Hong Kong Polytech-
adaptive local marginal for marginal release under local differen- nic University. She has received several prestigious
tial privacy,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., awards, including China National Scholarship, Out-
2018, pp. 212–229. standing Doctoral Dissertation Award, and IEEE
[27] G. Cormode, T. Kulkarni, and D. Srivastava, “Marginal release Security & Privacy Student Travel Award. Her
under local differential privacy,” in Proc. Int. Conf. Manage. Data, research interests include data privacy and security,
2018, pp. 131–146. and adversarial machine learning.
[28] Z. Li, T. Wang, M. Lopuha€a-Zwakenberg, N. Li, and B. Skoric, 
“Estimating numerical distributions under local differential privacy,”
in Proc. Int. Conf. Manage. Data, 2020, pp. 621–635. Haibo Hu (Senior Member, IEEE) is an associate
[29] Q. Xue, Y. Zhu, and J. Wang, “Joint distribution estimation and naı̈ve professor with the Department of Electronic and
bayes classification under local differential privacy,” IEEE Trans. Information Engineering, Hong Kong Polytechnic
Emerg. Topics Comput., vol. 9, no. 4, pp. 2053–2063, Sep.–Dec. 2019. University. His research interests include cyberse-
[30] T. Wang, N. Li, and S. Jha, “Locally differentially private frequent curity, data privacy, Internet of Things, and adver-
itemset mining,” in Proc. Symp. Secur. Privacy, 2018, pp. 127–143. sarial machine learning. He has published more
[31] Q. Ye, H. Hu, M. H. Au, X. Meng, and X. Xiao, “Towards locally than 100 research papers in refereed journals,
differentially private generic graph metric estimation,” in Proc. international conferences, and book chapters. As
Int. Conf. Data Eng., 2020, pp. 1922–1925. principal investigator, he has received more than 20
[32] H. Sun et al., “Analyzing subgraph statistics from extended local million HK dollars of external research grants from
views with decentralized differential privacy,” in Proc. ACM SIG- Hong Kong and mainland China. He is the recipient
SAC Conf. Comput. Commun. Secur., 2019, pp. 703–717. of a number of titles and awards, including IEEE MDM 2019 Best Paper
[33] Q. Ye, H. Hu, M. H. Au, X. Meng, and X. Xiao, “LF-GDPR: A Award, WAIM distinguished young lecturer, ICDE 2020 outstanding
framework for estimating graph metrics with local differential reviewer, VLDB 2018 distinguished reviewer, ACM-HK Best PhD Paper,
privacy,” IEEE Trans. Knowl. Data Eng., early access, Dec. 24, 2020, Microsoft Imagine Cup, and GS1 Internet of Things Award.
doi: 10.1109/TKDE.2020.3047124.
[34] Q. Ye, H. Hu, X. Meng, and H. Zheng, “PrivKV: Key-value data
collection with local differential privacy,” in Proc. IEEE Symp. Youwen Zhu received the BE and PhD degrees
Secur. Privacy, 2019, pp. 317–331. in computer science from the University of Sci-
[35] X. Gu, M. Li, Y. Cheng, L. Xiong, and Y. Cao, “PCKV: Locally differ- ence and Technology of China, Hefei, China, in
entially private correlated key-value data collection with optimized 2007 and 2012, respectively. He is currently a
utility,” in Proc. 29th USENIX Secur. Symp., 2020, pp. 967–984. professor with the College of Computer Science
[36] Q. Ye et al., “PrivKVM*: Revisiting key-value statistics estimation and Technology, Nanjing University of Aeronau-
with local differential privacy,” IEEE Trans. Dependable Secure Com- tics and Astronautics, China. From 2012 to 2014,
put., early access, Aug. 27, 2021, doi: 10.1109/TDSC.2021.3107512. he is a JSPS postdoctoral in Kyushu University,
[37] A. Smith, A. Thakurta, and J. Upadhyay, “Is interaction necessary Japan. He has published more than 40 papers in
for distributed private learning?,” in Proc. IEEE Symp. Secur. Pri- refereed international conferences and journals,
vacy, 2017, pp. 58–77. and has served as program committee member
[38] H. Zheng, Q. Ye, H. Hu, C. Fang, and J. Shi, “Protecting decision in several international conferences. His research interests include iden-
boundary of machine learning model with differentially private tity authentication, information security and data privacy.
perturbation,” IEEE Trans. Dependable Secure Comput., vol. 19,
no. 3, pp. 2007–2022, May/Jun. 2022.
[39] C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum, “Differential Jian Wang received the PhD degree in Nanjing
privacy under continual observation,” in Proc. 42nd ACM Symp. University, Nanjing, China, in 1998. He is currently
Theory Comput., 2010, pp. 715–724. a professor with the College of Computer Science
[40] L. Fan, L. Xiong, and V. Sunderam, “FAST: Differentially private and Technology, Nanjing University of Aeronautics
real-time aggregate monitor with filtering and adaptive sampling,” and Astronautics, China. His research interests
in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2013, pp. 1065–1068. include cryptographic protocol and malicious
[41] G. Kellaris, S. Papadopoulos, X. Xiao, and D. Papadias, “Differentially tracking.
private event sequences over infinite streams,” Proc. VLDB Endow-
ment, vol. 7, no. 12, pp. 1155–1166, 2014.
[42] Y. Cao, M. Yoshikawa, Y. Xiao, and L. Xiong, “Quantifying differ-
ential privacy under temporal correlations,” in Proc. IEEE 33rd Int.
Conf. Data Eng., 2017, pp. 821–832.
[43] Q. Ye, H. Hu, N. Li, X. Meng, H. Zheng, and H. Yan, “Beyond " For more information on this or any other computing topic,
value perturbation: Local differential privacy in the temporal please visit our Digital Library at www.computer.org/csdl.
setting,” in Proc. IEEE Conf. Comput. Commun., 2021, pp. 1–10.

Qiao Xue received the BE and PhD degrees from


the Nanjing University of Aeronautics and Astro-
nautics, China, in 2015 and 2020, respectively.
She is currently a postdoctoral fellow with the
Department of Electronic and Information Engi-
neering, The Hong Kong Polytechnic University.
Her research interests include information secu-
rity and data privacy.

Authorized licensed use limited to: University of Malaya. Downloaded on March 07,2024 at 11:25:20 UTC from IEEE Xplore. Restrictions apply.

You might also like