Differentially Private Linear Regression with Linked Data
Abstract.
There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on developing differentially private versions of individual statistical and machine learning tasks, with nontrivial upstream pre-processing typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more datasets of the same group of entities without a unique identifier. This probabilistic procedure brings additional uncertainty to the subsequent task. In this paper, we present two differentially private algorithms for linear regression with linked data. In particular, we propose a noisy gradient method and a sufficient statistics perturbation approach for the estimation of regression coefficients. We investigate the privacy-accuracy tradeoff by providing finite-sample error bounds for the estimators, which allows us to understand the relative contributions of linkage error, estimation error, and the cost of privacy. The variances of the estimators are also discussed. We demonstrate the performance of the proposed algorithms through simulations and an application to synthetic data.
Shurong Lin\upstairs\affilone,*, Elliot Paquette\upstairs\affiltwo, Eric D. Kolaczyk\upstairs\affiltwo |
\upstairs\affilone Department of Mathematics and Statistics, Boston University |
\upstairs\affiltwo Department of Mathematics and Statistics, McGill University |
* [email protected]. The research was supported in part by the U.S. Census Bureau Cooperative Agreement CB20ADR0160001 and Canadian NSERC RGPIN-2023-03566. The authors would like to thank Adam Smith (Boston University) for all the helpful discussions and comments.
Keywords: differential privacy, record linkage, data integration, privacy-preserving record linkage, gradient descent
Media Summary
Differential privacy is a mathematical framework for ensuring the privacy of individuals in datasets. It mitigates the privacy risk of disclosing sensitive information about individuals within the dataset during data analysis. Under such a framework, we are interested in finding the relationship between two variables (via statistical regression) after they are linked from two data sources with uncertainties. A pre-processing procedure of linking datasets is called record linkage, and the uncertainties should be taken into account in the downstream analysis. In the article, we propose two algorithms that satisfy differential privacy for regression estimation problems with linked data. The theoretical results regarding privacy guarantees and statistical accuracy are provided. We demonstrate the performance of the proposed algorithms through simulations and an application.
1. Introduction
Data for the same group of entities are often scattered across different resources, lacking unique identifiers for perfect linkage. To conduct statistical modeling or inference on the integrated information, it is necessary to probabilistically link multiple datasets by comparing the common quasi-identifiers (e.g., names, gender, address) as a pre-processing step. Such a procedure is called record linkage (RL), also known as entity resolution, or data matching (Christen2012data), which is an essential component of data integration in big data analytics (Dong2015). Thanks to its wide application in many disciplines such as public health and official statistics, record linkage has been studied for decades. Earlier pioneering works include Newcombe1959; fellegi1969; Jaro1989AdvancesIR. In addition, record linkage is frequently used in current practice. The U.S. Census Bureau has a long tradition using record linkage methodology for multiple endeavors. A current prominent example is the Decennial Census (census2022rl). In this context, record linkage involves using administrative records and other data sources to improve data quality, with efforts underway to construct a comprehensive “reference database” including individuals from multiple administrative records. A recent review paper, Olivier2022RL, provided a comprehensive summary of record linkage. Broadly speaking, there are two perspectives regarding record linkage (Chambers2019SmallAE): (1) the primary viewpoint concerns how to link the records; (2) the secondary perspective is focused on how to propagate the uncertainty to the downstream statistical learning tasks after the linkage has been determined. Our focus in this paper will adopt the second of these two perspectives.
Closely related to record linkage is data privacy. In the area of privacy-preserving record linkage (PPRL), two or more private datasets owned by different organizations are linked without revealing the data to one another (Hall2010pprl; Christen2020). The outcome of PPRL is the information regarding which pairs or sets are matched. PPRL, in turn, is associated with secure multiparty computation (SMPC) in that SMPC techniques are commonly used to solve PPRL problems (Kuzu2013; He2017pprl; Rao2019). PPRL to date only engages in the private linkage process from a primary perspective, without concerning how the linkage uncertainties would impact the downstream analysis. On the contrary, the secondary perspective is to modify the statistical tools to account for the linkage uncertainty. Our goal is to incorporate privacy into the secondary perspective of record linkage, which is different from yet complementary to PPRL or SMPC.
Privacy concerns have, if anything, become significantly more exacerbated with the emergence of individual-level big data. Releasing information about a sensitive dataset is subject to a variety of privacy attacks (Dwork2017Exposed). Therefore, there has been a growing demand for establishing robust privacy-preserving methodologies for modern statistics and machine learning. A mathematical framework proposed by Dwork2006, differential privacy (DP), is now considered the gold standard for rigorous privacy protection and has made its way to broad application in industry (apple2017dp; ms2020dp; google2021dp) and the public sector (census2021dp). The literature on differential privacy has been flourishing in recent years and the interface of differential privacy and statistics has started to draw increasing attention from the statistics community.
Recent work on differential privacy focuses primarily on individual statistical and machine learning tasks, with nontrivial upstream pre-processing, such as record linkage, typically not incorporated. In this paper, we consider the linear regression problem, i.e.,
(1) |
but where and are observed in two separate datasets. As a result, rather than having and in hand, we are instead provided with a pair and . Here is a permutation of resulting from record linkage performed by an external entity, who also supplies a minimum amount of information about the linkage accuracy. In the regression procedure, we take into account the linkage uncertainty as well as offer differential privacy guarantees. As shown in Figure 1 which depicts the pipeline of the problem we consider, we assume that an external analyst conducts record linkage a priori. From there, we aim to devise a private estimator for the regression coefficients of ultimate interest with the help of differential privacy.
Specifically, we propose two algorithms for linear regression after record linkage to meet differential privacy: (1) post-RL noisy gradient descent (NGD), and (2) post-RL sufficient statistics perturbation (SSP). Our work builds on the seminal work by lahiri2005 where an estimator is proposed for linear regression with linked data in a non-privacy-aware setting. We construct a private estimator, , by deploying differential privacy tools to achieve privacy protections. To the best of our knowledge, our work is the first one in the literature to consider a statistical model after record linkage in a privacy-aware setting.
The two proposed algorithms also extend the noisy gradient method (Bassily2014PrivateER) and the “Analyze Gauss” algorithm (DworkTT014), which are applied to linear regression, to additionally handle the presence of linkage errors. Prior works (Sheffet17; wang2018; Bernstein2019; Cai2019TheCO; Alabi2022) on differentially private linear regression do not consider possible record linkage pre-processing. If the data are linked beforehand, directly applying their algorithms to the imperfectly linked data is not ideal. It is well known that overlooking the linkage errors leads to substantial bias even with a high linkage accuracy (Neter1965TheEO; Scheuren1993). Figure 2 showcases a toy example of record linkage, where mismatches, if treated as true, change the sign of the slope estimate. Our illustrative application later in the paper confirms this, where around 90% of the records are correctly linked, and the estimators ignoring linkage errors end up with large biases.
The true dataset is , yielding a slop estimate , while the linked set is given by , yielding .
Accompanying the estimators resulting from our algorithms, we provide mean-squared error bounds under typical regularity assumptions and record linkage schemes. When no linkage errors are present (i.e., a special case in our scenario), our result in Theorem 4.4 improves upon the noisy gradient method proposed in Cai2019TheCO by using zero-concentrated differential privacy (zCDP, Bun2016ConcentratedDP) to enable tighter bounds on privacy cost (see Lemma 10). Additionally, we have presented (approximate) theoretical variances for resulting from both proposed algorithms. There appear to be very few other works that have addressed the issue of uncertainty. Two that we are aware of are Alabi, who provided confidence bounds for the univariate case, and Sheffet17, who provided confidence intervals dependent on differential privacy noise. Our work focuses on the multivariate case and appears to be the first to directly work on exact variances rather than relying on bounds.
The remainder of this paper is organized as follows. Section 2 provides preliminaries on linear regression with linked data and differential privacy. We propose our two algorithms in Section 3 and present the relevant theoretical results in Section 4. In Section 5, we conduct a series of simulation studies and an application to synthetic data. Section 6 concludes and discusses future work. Complete proofs of all theorems can be found in the supplementary materials.
2. Preliminaries
In this section, we review the background results of linear regression after record linkage upon which we build our work, and fundamental concepts from differential privacy. Related work on linear regression with linked data and record linkage with privacy awareness are discussed.
2.1. Linear Regression with Record Linkage
Let and be two datasets that refer to the same group of entities, with unknown one-to-one correspondence. The quasi-identifiers and are used to perform the linkage procedure. Let be the linked data where is a permutation of . Consider the following model for :
(2) |
then for all and for all . Thus, is the probability of the th record being linked correctly. Let , which we call the matching probability matrix (MPM), a doubly stochastic matrix. The matrix can be estimated, for example, through bootstrapping (Chipperfield2015; Chipperfield2020). In some cases, estimating can require inference on only a single parameter (e.g., in the exchangeable linkage error (ELE) model described in Section 2.1.1).
For the fixed-design homoskedastic linear model (1), when inference is done after record linkage based on , lahiri2005 proposed an unbiased estimator
(3) |
where . Let be the -th row vector of , then . Note that , where the expectation is taken over both linkage uncertainties and . Transforming into offers bias correction for regression estimation after record linkage.
In addition, the variance of is given by
(4) |
where . lahiri2005 provide the following characterization of the first two moments of .
Lemma 2.1 (Theorem A.1, lahiri2005).
Note that involves the true coefficients and where is a function of as elaborated in Lemma 2.1. Compared to the covariance of , has an additional component due to the uncertainty of record linkage.
2.1.1. Structural Schemes of MPM
The matching probability matrix (MPM) is generally assumed to have a simple structure. Two schemes used commonly in the literature are as follows.
Blocking Scheme
It is assumed that the MPM is a block diagonal matrix, which means the true matches only happen within blocks. Blocking significantly reduces the number of pairs for comparison and allows scalable record linkage. This scheme is used in almost all real-world applications, and different methods for blocking have been developed (Christen2012data; Steorts2014a; Christophides2020an).
Exchangeable Linkage Errors (ELE) Model
The ELE model (Chambers2009RegressionAO) assumes homogeneous linkage accuracy and errors:
(5) | ||||
The ELE model has been adopted in recent works, such as Chambers2019SmallAE; Chambers2022, for various estimation problems. Even though (5) may oversimplify the reality, it is a representative model for a secondary analyst who has minimum information about the linkage quality. When blocking is used, the homogeneous linkage accuracy assumption is imposed within individual blocks. In other words, it still allows heterogeneous linkage accuracy between blocks.
2.2. Differential Privacy
Let be some data space, and be two neighboring datasets of size which only differ in one record. Such a relation is denoted by .
Definition 1 (-DP, DworkR14).
For , a randomized algorithm : is -differentially private if, for all and any ,
(6) |
The expression (6) controls the distance between the output distributions on two neighboring datasets through the privacy budget and . Intuitively, differential privacy ensures that is not distinguishable from based on the outputs. Thus, should be small enough for the privacy level to be meaningful. Typically, and .
Differential privacy enjoys the following properties that facilitate the construction of differentially private algorithms.
Proposition 2.1 (Basic composition, DworkR14).
If is -DP and is -DP, then is -DP.
Proposition 2.2 (Post-processing, DworkR14).
If is -DP, for any deterministic mapping that takes as an input, then is -DP.
Generally, a differentially private algorithm is constructed by adding random noise from a certain structured distribution, such as the Laplace or Gaussian distributions. A notion central to the amount of noise we add is the sensitivity of the estimation function we desire to release privately.
Definition 2 (-sensitivity).
Let be an algorithm. The -sensitivity of is defined as
(7) |
The sensitivity of a function characterizes how much the output would change if one record in the dataset changes. To achieve -DP, the amount of noise we need depends on both the budget and the sensitivity. The Gaussian Mechanism is a canonical example that will be employed herein, which does just that.
Lemma 2.2 (Gaussian mechanism, DworkR14).
Let and . For an algorithm on the dataset , the Gaussian Mechanism defined as
(8) |
where , is -DP.
Combining the basic composition rule and the Gaussian mechanism, for a sequence of functions , let
where is the -sensitivity of . Then, satisfies -DP. However, as increases, this construction tends to add more noise than necessary due to the loose composition. Instead, we could utilize zero-concentrated differential privacy (zCDP, Bun2016ConcentratedDP), another variant of DP, to achieve tighter composition for -DP. The following Lemma essentially captures the results from Bun2016ConcentratedDP, formulated for our purposes.
Lemma 2.3 (Better composition for -DP via zCDP).
Let . For a sequence of functions , let
(9) |
with . Then, the randomized algorithm satisfies -DP. If , it suffices to have
(10) |
2.3. Related Work
Linear regression with linked data is a fundamental statistical task that has been explored in various articles. Scheuren1993 initially considered the linkage model (2) for linear regression and proposed an estimator that is not generally unbiased. Later, lahiri2005 introduced an exactly unbiased OLS-like estimator given in (3) with an expression for the variance, which outperformed the approach by Scheuren1993. Besides, Chambers2009RegressionAO; Zhang2021linkage offered a few other estimators. According to their simulation studies, some of the estimators provided performance that was at most similar, but not noticeably better, compared to the one proposed by lahiri2005. Yet, Zhang2021linkage relaxed the condition by not assuming that the probability of correct linkage, in the model (2), can be obtained or estimated. For more extensive reviews of this literature, Wang2022reg gave an account of the recent development of various methods on regression analysis with linked datasets. Chambers2022 reviewed current research on robust regression of linked data.
On the other hand, there is ongoing research on privacy-preserving record linkage (PPRL) in the field of computer science. PPRL aims to privately link multiple sensitive datasets held by different organizations when they are unwilling or not permitted to share their data with external parties due to privacy and confidentiality concerns. To achieve privacy protection, techniques such as SMPC and DP are combined with machine learning and deep learning methods for conducting PPRL (Christen2020; Divanis2021; Ranbaduge2022PPRL). PPRL primarily concerns data leakage during the linkage process and produces a linked dataset that can be used for further analysis, yet most applications treat the linked data as if there were no linkage errors. Neither the uncertainty propagation nor private release of the downstream analysis is considered within the scope of PPRL.
Note that there are several articles on privacy-preserving analysis on vertically partitioned databases. In these databases, the attributes are distributed among multiple parties, but common unique identifiers exist to facilitate data linkage across the different parties. Unlike probabilistic record linkage, vertically partitioned databases do not involve linkage errors. Du2004; Sanil2004; Hall2011SecureML; Gasc2017 discussed the implementations of privacy-preserving linear regression protocols that prevent data disclosure across organizations, whereas Dwork2004PPVRD considered data mining from the perspective of the private release of statistical querying in a spirit similar to our work.
3. Differentially Private Algorithms
The unbiased and simply structured estimator provided in (3) with a known closed-form variance makes it a suitable prototype to construct our private estimators. We introduce two differentially private algorithms in the following, based on (1) noisy gradient descent, and (2) sufficient statistics perturbation. As the names suggest, we mitigate privacy risk by perturbing either the gradient or sufficient statistics during the computation of the linear model. Hereafter, if not specified otherwise, denotes the 2-norm.
3.1. Post-RL Noisy Gradient Descent
Gradient descent methods are ubiquitous in scientific computing for numerous optimization problems. Within the framework of differential privacy, Bassily2014PrivateER provided a noisy variant of the classic gradient descent algorithm. It was later adapted by Cai2019TheCO to solve the classic linear regression problem with faster convergence. Leveraging the work by Bassily2014PrivateER; Cai2019TheCO, we tailor the noisy gradient method for the post-RL linear regression model for based on (1) and (2).
Let be the loss function, where recall . The minimizer of is the non-private RL estimator proposed by lahiri2005. Let denote the projection of onto the ball . The post-RL noisy gradient descent (NGD) algorithm is defined as follows.
(11) |
Algorithm 1 is a modified version of the projected gradient descent that incorporates (1) post-RL transformation of the design matrix, (2) addition of noise at each gradient step, and (3) use of projection on the response variable. The regular parameters, including , and for the projected gradient method, are specified in Theorem 4.4 for the discussion of the accuracy of . The injection of noise follows Lemma 10. The scale of the Gaussian noise at step depends on the privacy budget (), and the noise scale factor associated with the sensitivity in the update function (11). The purpose of the projection on is to bound the sensitivity of the gradient. With a proper choice of that scales up with (specified in Section 4), the projection does not affect the accuracy of the final estimator with high probability.
The major challenge lies in calculating the sensitivity. In the non-RL least square regression, two neighboring datasets and differ in a single row, making it straightforward to derive the sensitivity of the gradient of . Here, in the context of post-RL analysis, we consider two neighboring datasets containing both linking variables and regression variables, denoted as and , which differ in the record of one individual. The change in one row of the quasi-identifiers and may affect more than one row of the matching probability matrix . As a result, the entries of the transformed design matrix subject to change are not limited to one row as in the non-RL case. Consequently, determining the sensitivity of the gradient of becomes non-trivial. This challenge distinguishes our work from Cai2019TheCO. However, we will demonstrate in Section 4 that, under a condition on the structure of , the sensitivity can be tracked.
3.2. Post-RL Sufficient Statistics Perturbation
Noise can be injected into the process besides the gradient computation. Since the estimator interacts with the data through its (joint) sufficient statistics, an efficient way is to perturb the sufficient statistics to protect the data. Such a technique, sufficient statistics perturbation (SSP), has been used in previous works such as Slavkovic2009; Foulds2016; wang2018. For the non-private OLS estimator , to perturb the joint sufficient statistics , it suffices to add noise to where is the augmented matrix. DworkTT014 offered an algorithm, “Analyze Gauss”, to privately release . It was later utilized by Sheffet17 for private linear regression, primarily perturbing the sufficient statistics.
In our work, we adapt the “Analyze Gauss” algorithm to linear regression after record linkage, as shown in Algorithm 2. The noise scale factor is the sensitivity of which is specified in Section 4. The gram matrix exhibits properties that facilitate the computation of its sensitivity. Algorithm 2 illustrates how incorporating the joint sufficient statistics in a comprehensive form facilitates the deployment of differential privacy.
Remark 3.1.
In step 4, by post-processing, checking for singularity of consumes no extra privacy budget. In fact, the probability of being singular decreases exponentially as the sample size increases.
An alternative approach to implementing the SSP method is to add random noise separately to each sufficient statistic. In this approach, the total privacy budget should be divided between and for the estimation of linear regression, as proposed by wang2018. However, treating the joint statistics as a whole is more economical in terms of budgeting in general. Lin2023 showed through comparison that splitting the total budget among the components results in introducing larger noise on average. Although adding noise individually to the components of interest allows for the private release of each quantity, it is not part of the goal of the estimation.
4. Theoretical Results
In this section, we provide the theoretical results of the two algorithms introduced in Section 3. The results are threefold: (1) differential privacy guarantees, (2) finite-sample error bounds, and (3) variances of the private estimators. We present each of these along with a discussion of the corresponding conditions as they relate to the main variables in our record linkage model. All proofs for these results can be found in the supplementary materials.
4.1. Privacy Guarantees
The algorithms are designed to achieve certain privacy guarantees, given the corresponding sensitivity, for the post-RL case:
Theorem 4.1 (Privacy Guarantees).
Assume the following boundedness conditions hold:
(A1) There is a constant such that .
(A2) Let and be the matching probability matrices (MPMs) resulting from the neighboring datasets and and let denote such a relation. We assume that for some constant , where is the entry-wise 1-norm.
Essentially, we assume that the data domain is bounded, which is critical for deriving a finite sensitivity of the target function on the data. (A1) is a standard assumption for a bounded design . For the linking variables that are generally categorical, there are no analogous definitions of “norm” for numerical vectors. Instead, (A2) is imposed on the MPM since it summarizes all the information of the linking variables in the linkage model we consider. Specifically, we assume that two MPMs produced by two neighboring datasets do not differ much in terms of the entry-wise 1 norm. This assumption characterizes a bounded linkage model.
The rationale of (A2) is supported by typical schemes imposed on the structures of MPM in practice, as reviewed in Section 2.1.1. For example, with the blocking scheme, the size of each block is manageably small (O(1)). When one record is altered, the fluctuation of the MPM is limited to at most two blocks. Additionally, with the ELE model (5), as long as the changes to a single record only affect a finite number of records, the linkage accuracy changes at most . Therefore, we have . In general, a robust record linkage approach should not produce two considerably different MPMs from two neighboring datasets. Therefore, it is realistic to assume a bounded linkage model.
The proofs of Theorem 4.1 revolve around calculating the sensitivity of the target function in each algorithm. Besides the upper bounds and discussed above, the sensitivity also depends on the truncation level on the response. Truncation is commonly used in DP algorithm designs when there are no priori bounds on the relevant quantities (e.g., Abadi2016). In Section 4.2, we provide a specific choice of and present an accuracy statement with high probability.
4.2. Finite-Sample Error Bounds
We study the accuracy of the proposed estimators by deriving the finite-sample error bounds. In the following, we introduce two more assumptions in addition to (A1) and (A2):
(A3) The true parameter satisfies for some constant .
(A4) The minimum and maximum eigenvalues of satisfy
(14) |
for some constant .
Assumption (A4) implies the smoothness and strong convexity of the loss function , which allows for a fast convergence rate for the gradient descent method in Algorithm 1. On the other hand, for Algorithm 2, note that the term is a component of sufficient statistics. Assumption (A4) offers a bound on the norm of , which helps derive the error bound of . Let Assumption (A4’) be (A4) with replaced by and the constant replaced by . The larger of and can be chosen as the constant to satisfy both (A4) and (A4’). Therefore, for convenience, we consider (A4) and (A4’) to be the same assumption. We first obtain the accuracy of the non-private estimators, for comparison purposes.
Lemma 4.2.
Let be the OLS estimator. Then, under (A4), it follows that .
Lemma 4.3.
Let be the non-private record linkage estimator, and be the covariance matrix of . Then,
(15) |
where .
As a special case, when the linkage is perfect (i.e., is an identity matrix), the expected error of in (15) takes the reduced form which is exactly the lower bound obtained by . Then, by Lemma 4.2, we know that is of order at least under (A4). From a secondary perspective regarding record linkage, it is beyond our scope to study how behaves in general.
For the two proposed algorithms, we present upper bounds of the excess squared error of the private estimators, namely, .
Theorem 4.4 (Post-RL NGD).
Given the linked data and the matching probability matrix for the regression problem in (1), set the parameters of Algorithm 1 as follows:
-
•
step size , number of iterations , feasibility , initialization ;
-
•
truncation level ;
-
•
noise scale factor ;
Under Assumptions (A1)-(A4), given , with probability at least where are constants (see the proof), it follows that
(16) |
Theorem 4.5 (Post-RL SSP).
In both algorithms, the response is projected with a level where is the homoskedastic variance of the random error in linear model (1). Let , then is a high-probability event. The error bound is analyzed under , thus we obtain a statement with high probability.
In the NGD method, the bound consists of two parts on the RHS in (16). The first error term results from the convergence rate of gradient descent after iterations. The second error term is due to the addition of Gaussian noise for privacy and thus involves . It is worth noting that the choice in theory is, to some extent, conservative to ensure the first error term is , which is the same order as . However, more iterations give rise to larger random noise being added to gradient updates due to a smaller privacy budget per iteration. In practice, a smaller number of iterations may be favored for the tradeoff (see the experiment in Section 5.2), especially when is not sufficiently large.
For the SSP algorithm, the convergence rate in (17) depends on similar variables as in the NGD algorithm. The major difference is that it is controlled by instead of due to the sensitivity of the gram matrix defined in Section 3.2. However, the SSP method has a faster convergence rate when is sufficiently large. As a result, the SSP estimator is more susceptible to a large variance of the random error in the response variable whereas the NGD method is more robust. As we shall see in Section 5, the performance of the two algorithms is different under various scenarios.
Putting together Lemma 4.3 and Theorems 4.4 and 4.5, we obtain a high probability error bound for each algorithm as follows.
Corollary 4.1.
Under the regularity conditions (A1)-(A4),
(i) (Post-RL NGD)
(18) |
with probability at least .
(ii) (Post-RL SSP)
(19) |
with probability at least .
4.3. Variances
As discussed in the Introduction, although a few works (Alabi; Sheffet17) have addressed uncertainty of DP estimators through confidence bounds and intervals, the exact variance of DP estimators is rarely determined in most cases. Recent work, such as Lin2023, has explored the variance of the private estimators for population proportions that have fairly simple structures. The main barrier to the inspection of variance is that if the noise is injected into the intermediate steps of the estimation process other than the output, then it is difficult to track the variability that noise introduces to the output estimator due to the intricate nature of the algorithm.
The NGD and SSP algorithms are two examples where noise is added in the middle of the estimation process. The operations like function composition and taking the inverse complicate the inspection of the variance of the output estimator . To address this issue, we investigate the variance of for the two algorithms by studying the variances of two proxy estimators. The theoretical variances of the proxy estimators can be used to approximate those of .
Theorem 4.6 (Variance for Post-RL NGD).
In Algorithm 1 , if we consider the estimator without projections
(20) |
then the variance of the th iterate is given by
(21) |
where is the identity matrix of size , , , and is the variance of .
Remark 4.1.
The estimator in Algorithm 1 is a projected variant of (20). The use of projection with level on in (11) impedes the exact analysis of variance for . Instead, we provide the variance in (21) for the non-projected estimator as a conservative variance for . The level of projection, the scale of noise, and the number of iterations together determine how conservative it is. From Remark 4.1, we know that as increases, the first term in the RHS of (21) is getting close to . The second term, , then summarizes the cumulative variability resulting from adding random noise at each iteration. Note that this term does not converge by simply increasing , due to the fact that a smaller budget leads to larger noise at each iteration.
Theorem 4.7 (Variance for Post-RL SSP).
For Algorithm 2, let , then as . The variance of is given by
(22) |
where and the entries of , and are given by
-
•
for ; for .
-
•
for ; for .
-
•
for ; for where .
Remark 4.2.
As shown in the proof of 4.7 (see the supplemental), the proxy estimator is a first-order approximation for using Taylor series for the term which appears in the decomposition of .
The variance of also consists of two parts: the variance of the non-private estimator and the additional variation due to the noise injected for privacy purposes. Given Assumption (A4), we have that appears in and . As increases, the dominant component of the second term would be .
5. Numerical Results
To evaluate the finite-sample performance of the proposed algorithms, we conduct a series of simulation studies and an application to a synthetic dataset that contains real data.
5.1. Simulation Studies
In this section, we conduct simulation studies to assess the performance of the two proposed algorithms for simple linear regression with linked data. The non-private OLS estimator and RL estimator by lahiri2005 are included as benchmarks. The private, non-RL counterpart methods are also performed in the absence of linkage errors for comparison.
For each simulation, a fixed design matrix and an matching probability matrix are produced and a total of 1000 repetitions are run over the randomness of both the intrinsic error of the regression model and the noise injected for privacy. Figure 3 displays the relative error and both empirical and theoretical variances for the two settings.
Two sets of simulations are conducted to explore the performance with varying sample size and , the homoskedastic variance of the random error in linear model (1). The parameters are set as follows:
-
•
ELE linkage model: blockwise linkage accuracy characterizing , block size .
-
–
Settings 1 and 2: , in Assumption (A2).
-
–
Setting 3: the linkage accuracy which varies from 0.6 to 1, while scales from 1 to 0.
-
–
-
•
regression model: , true regression coefficient .
-
–
Setting 1: varies from 3,000 to 10,000, is fixed at 1.
-
–
Setting 2: is fixed at , varies from 0.5 to 1.8.
-
–
Setting 3: is fixed at , is fixed at 1.
-
–
-
•
privacy budget: , .
In setting 1, where is fixed at 1, Figure 3(a) shows the errors of all methods decrease with a growing sample size. Due to the linkage errors, the post-RL methods, including and our two algorithms (denoted as “RL-OLS”, “RL-NGD”, and “RL-SSP” in the figures) run on the linked data , naturally always yield larger errors than their counterparts run on when no linkage has to be done beforehand. In this case, with , post-RL SSP outperforms post-RL NGD in terms of both accuracy and variance. However, as increases, post-RL NGD algorithm starts perform better, as depicted in Figure 3(c) with varying . The error grows linearly for post-RL NGD and quadratically for post-RL SSP, which aligns with the theoretical results on the error bounds presented in Section 4.2. Similar trends are observed for comparison of the non-RL NGD and SSP algorithms. In Figure 3(e), where linkage error tends to zero, the post-RL versions of the three estimators approach the corresponding non-RL versions. NGD and SSP methods have strictly larger error than OLS due to the cost of privacy.
Figures 3(b), 3(d) and 3(f) illustrate the empirical variances (EMP) against the theoretical variances (THR) of the proxy estimators given in Section 4.3. The theoretical variance of post-RL NGD closely aligns with the empirical variance at the chosen level of projection . Recall that the theoretical variance would be exact when no projection is applied. Thus, with a lower level of projection on the gradient update, we anticipate it to be conservative. On the other hand, the theoretical variance of post-RL SSP approximates well with moderately large and small . However, in scenarios with small and/or large , our theoretical variance may underestimate the reality due to the approximation’s reliance on a first-order Taylor expansion. Therefore, one can include higher-order terms for better approximation. In setting 3, where and are fixed, as the linkage error vanishes, the variance reduces as a result of the smaller DP noise needed.
5.2. Application to Synthetic Data
Due to privacy concerns, pairs of datasets containing personal information, which serve as quasi-identifiers, are typically not made public. We instead synthesize from a pair of generated quasi-identifiers datasets and real data for regression, as in Chambers2019SmallAE. For quasi-identifiers, we take advantage of the datasets generated by the Freely Extensible Biomedical Record Linkage (Febrl), which are available in the module RecordLinkage by rlpython in Python. The pair of datasets for linkage we use correspond to 5000 individuals. The domain indicator (state) is used for blocking. The record linkage is performed based on the Jaro-Winkler score (Jaro1989AdvancesIR) or exact comparison on 6 quasi-identifiers (names, date of birth, address, etc.). The maximum score is 6 for pairs that have exact alignment. A threshold of 4 is chosen to link the records. For those left unlinked, we assign random links to ensure one-to-one linkage. A unique identifier is available in the dataset for verification purposes. The resulting linkage accuracy for the 9 blocks are , making the overall accuracy 92.5%. We adopt the ELE model for and estimate it using .
On the other hand, an anonymous dataset for regression comes from the Survey on Household Income and Wealth (SHIW) from the italydataset. The net disposable income and consumption are the explanatory variable and the response , respectively. Since the SHIW dataset is larger, consisting of 8151 data points, we drop the outliers and randomly draw 5000 records and synthesize them with the Febrl dataset. Figure 4 depicts the setup of the synthesization process. Using the unique identifier from the Ferlb dataset, the regression variables are appended to the quasi-identifiers , resulting in two separate datasets: and . Then, record linkage is performed by comparing and to output the linked data and the matrix .
To apply the proposed DP algorithms to the synthesized dataset, we set the (hyper)parameters as follows. The privacy budget is given by . The variance of the random error, , is estimated by the MSE. The upper bounds in Assumptions (A1)-(A3) are set as: , , . In the NDG method, the projection level is set to 1.2.
To illustrate the importance of propagating linkage uncertainty when conducting downstream regression, we also apply the non-RL version of NGD and SSP algorithms. We obtain the non-RL regression results by running post-RL NGD and post-RL SSP methods with set to 0 and without converting into . This is equivalent to applying the non-RL methods discussed in Cai2019TheCO; Sheffet17 to the linked set as if it were perfectly linked.
The red dashed line indicates the OLS estimate. The proposed post-RL algorithms are compared with the non-RL “NGD” and “SSP” methods applied to (i.e., without accounting present linkage errors). The third and fourth columns represent the two NGD methods running for iterations.
Figure 5 displays the boxplots of the estimates of each algorithm. For each algorithm, a total of 1000 repetitions are done in order to reflect the randomness of the injected noise for privacy purposes. The variables and are standardized before conducting simple linear regression. The OLS estimator on (dashed line) is plotted for comparison. As can be seen, the DP estimators by running (non-RL) NGD and SSP on directly are excessively biased as a consequence of ignoring linkage errors, even when the overall linkage accuracy is as high as 92.5%. Conversely, the results of post-RL NGD and post-RL SSP yields estimates centered around the OLS estimator but with higher variances, attributed to the cost of bias correction. Post-RL NGD is more flexible due to hyperparameter tuning. Additionally, we run the NGD methods for fewer iterations with , which is one-third of the value recommended by theory. We have found that this approach yields smaller variance while still producing accurate results in finite samples. Therefore, the theoretical number of iterations may be conservative in some circumstances. Moderately reducing may lead to better results.
6. Discussion
In this paper, we propose two differentially private algorithms for linear regression on a linked dataset that contains linkage errors, by leveraging the existing work on (1) linear regression after record linkage, and (2) differentially private linear regression. Figure 6 displays the connections among the related areas at a high level, including PPRL and SMPC mentioned in Sections 1 and 2.3. Our work is the first one to simultaneously consider the linkage uncertainty propagation and the privatization of the output. It also complements the area of PPRL where the main concern is the data leakage among different parties. However, we do not discuss how to link the records in the first place and thus the security issues of the linkage process are beyond our scope. Instead, we treat record linkage from a secondary perspective: we begin with linked data prepared by an external entity and we have limited information about the linkage quality.
Specifically, we propose two post-RL algorithms based on the noisy gradient descent and sufficient statistics perturbation methods from the DP literature. We provide privacy guarantees and finite-sample error bounds for these algorithms and discuss the variances of the private estimators. Our simulation studies and the application demonstrate the following: (1) the proposed estimators converge as the sample size increases; (2) post-RL linear regression incurs a higher cost than the non-RL counterpart in terms of the privacy-accuracy tradeoff; (3) The NGD method is flexible with hyperparameter tuning and can be applied to more general optimization problems; (4) SSP is specific to the least-squares problem, offering greater budget efficiency and more accurate results provided that the random error of the regression model is not too large.
There are different directions to extend our work. Note that there may be different scenarios of linking between the two datasets of the same set of entities. Assuming one-to-one linkage, as in our paper, is a canonical scenario. Although we do not explore it, we expect that our methods can be extended to other scenarios (e.g., one-to-many linkage) where still makes sense. Extra assumptions may be required when determining the relevant sensitivities for privacy purposes.
One can also consider record linkage from a primary perspective. In addition to the traditional Fellegi–Sunter model, Bayesian approaches and machine learning-based methods have gained popularity. The record linkage may take forms other than the matching probability matrix adopted here. Furthermore, when privacy concerns arise during the linkage process involving different parties, PPRL and SMPC protocols become essential. Tackling all the challenges depicted in Figure 6 simultaneously with a single efficient tool is of great practical use and significance. This interdisciplinary challenge requires expertise in both statistics and computer science.
Another important direction is exploring related statistical problems in the post-RL context, with or without privacy constraints. For example, confidence intervals and hypothesis testing are fundamental statistical inference tools. Other potential problems that interest statisticians include high-dimensional linear regression and ridge regression.
Appendix A Proofs
A.1. Lemmas
The lemmas here support the proofs in Section A.2.
Lemma A.1.
If the minimum and maximum eigenvalues of satisfy for some constant , then the loss function is -smooth and -strongly convex.
Proof.
Since , then for any ,
By definition, is -strongly convex.
For smoothness, we have
The second equality is due to the fact that for symmetric matrix . By definition, is -smooth.
One can have a neater proof using alternative definitions (See Eq. (4) and (10) in lecture_smoothness2 for a twice differentiable function:
(1) is –strongly convex if ;
(1) is -smooth if .
∎
Lemma A.2 (Bubeck2015, Proof of Theorem 3.10.).
Let be -strongly convex and -smooth on , and be the minimizer of on . Then projected gradient descent with step size satisfies for ,
Lemma A.3.
for any scalar .
Proof.
Since
it follows that
∎
Lemma A.4 (Cai2019TheCO, Lemma A.2).
For , , ,
Lemma A.5 (Sheffet17, Proof of Proposition D.2).
For any invertible matrix and any matrix such that is invertible,
Lemma A.6.
Let X be a symmetric random matrix with i.i.d upper triangle entries. Each entry has mean 0 and variance . Let be a -dimensional random vector, which has mean and covariance matrix . Let denote the covariance matrix of . Then, the diagonal entries of are given by
(23) |
the off-diagonal entries are
(24) |
Proof.
Let where and . Then,
(25) |
Therefore,
(26) |
For the first term,
For ,
where
and is a matrix with the entry being and 0 other wise. Putting and back in (26), we know that has the diagonal entries
and the off-diagonal entry
∎
Remark A.1.
In Lemma A.6, is given by with the diagonal entries replaced by its trace.
Lemma A.7.
Let be a random matrix with . Let be a -dimensional random vector that is independent of . Then, .
Proof.
∎
Lemma A.8.
Let be a random matrix with . Let be a -dimensional random vector. Let be another -dimensional random vector that is independent of both and and . Then, .
Proof.
Since is independent of both and and the entries of appear linearly in every entry of . By the zero-expectation of , .
∎
A.2. Proofs
This section provides the proofs for theorems and lemmas presented in the paper.
To derive this tighter composition for -DP, we utilize the notion of zero-concentrated differential privacy (zCDP, Bun2016ConcentratedDP), defined as follows.
Definition 3 (-zCDP).
A randomized mechanism is -zero-concentrated-differentially private (-zCDP) if, for all differing on a single entry and all ,
where is the -Rényi divergence van2014renyi between the distribution of and the distribution of .
Proof of Lemma 10.
Bun2016ConcentratedDP have shown that, like the classic -DP notion, -zCDP enjoys properties including basic composition and post-processing. The corresponding Gaussian mechanism (Proposition 1.6, Bun2016ConcentratedDP) states that an algorithm is -zCDP after adding Gaussian noise to it. In addition, they have shown that -zCDP implies -DP (Proposition 1.3, Bun2016ConcentratedDP).
Therefore, to achieve -DP, it suffices for the algorithm to be -zCDP with . Using the Gaussian mechanism and basic composition rule of -zCDP, it suffices to add noise
to for all . Note that if , then . Therefore, it suffices to add noise
to for all .
∎
Proof of Theorem 4.1(i).
By the composition proposition of differential privacy, to establish that Algorithm 1 (Post-RL Noisy Gradient Descent) satisfies -differential privacy it suffices to show that the computation of is -differentially private. According to the Gaussian mechanism Theorem 2.1, showing the latter boils down to proving that the sensitivity is controlled at each gradient step. Let . We will show that the -sensitivity of , denoted by , is bounded by .
Without loss of generality, we assume the neighboring data sets and differ in the -th record. Recall is a permutation of satisfying . Let be a copy of but with the entry changed to . Recall in the record linkage linear model elaborated in lahiri2005, , which are convex combinations of rows of . We can write and . From assumption (A1) that with probability 1, we know almost surely. Then, we have
(27) | ||||
Since is a doubly stochastic matrix, for any . By the arbitrariness of index , it follows that
(28) |
The sensitivity of is
(29) | ||||
We use and to denote the two terms on the right of (29), respectively. To bound the first term , since
then by (28), can be controlled:
(30) |
For the second term , since
(31) | ||||
Then,
(32) | ||||
It follows that
∎
Proof of Theorem 4.1(ii).
We now show that Algorithm 2 (Post-RL sufficient statistics perturbation) is -differentially private. Let be the augmented matrix considering linkage errors. Then contains sufficient statistics for the . Thus, it suffices to show that the sensitivity of is controlled by .
Let , where and come from any neighboring data set, as in the proof of Theorem 4.1 (i). We have
By the properties of the norm of block matrices,
and
We have
and
Note that we can swap the rows of and , such that and only differ in one record and it does not change the estimation using and after swapping. Then,
Since
and
we have
Putting the upper bounds together, we derive
∎
Proof of Lemma 4.2.
To establish the behavior of the expected squared-error loss of , first note that since
we have
Therefore, where are the eigenvalues of . ∎
Proof of Lemma 4.3.
Note that is an unbiased estimator for , and hence
(33) | ||||
∎
Proof of Theorem 4.4.
To establish an upper bound of the excess error of the private estimator, i.e., , for Algorithm 1, we work under the event . By the concentration bound of the Gaussian distribution, with the choice of , where and are constants.
Recall that the loss function . The assumption about the eigenvalues of implies that is -smooth and -strongly convex. See the proof in Lemma A.1. Under , the iterate . Let be the unperturbed iterate, then . Since (A3) says that , we can assume without loss of generality where . Then, is the same as by setting . By Lemma A.2, with , it then follows that
Let be some constant, then by Lemma A.3, we have the following for the noisy iterate :
The above recursive formula yields
Setting , the first term , given that and . When is sufficiently large, can be set to .
To control the second term, we apply Lemma A.4 with and which is the variance of for . Provided , let for some sufficiently large constant so where
That is, the noise term is then controlled by a corresponding big-O statement with probability at least , where , hence
∎
Proof of Theorem 4.5.
To bound , we need to bound the norms of , , , and . First, by the assumption (A4):
for some constant , we have
Then,
(36) |
Recall from Algorithm 2 that is a Gaussian symmetric matrix with upper triangle given by iid The Gaussian concentration bounds give that w.p. , we have
(37) |
The result (37) is from random matrix theory (Vershynin, Corollary 4.4.8). Since the vector norm is bounded by the norm of the matrix that contains the vector as a column, thus we also have
(38) |
For , consider its Taylor series:
(39) |
W. p. ,
is going to zero as . Therefore, the Taylor Series (39) converges. Then,
(40) | |||||
w.p. . Plugging all the bounds into (35), we have derived
∎
Proof of Corollary 4.1.
The proof is completed by using the inequality
∎
Proof of Theorem 4.6.
Consider the non-projected estimator
(41) |
Let and . From the recursive form , we derive for :
(42) |
Then,
(43) |
∎
Proof of Theorem 4.7.
By rewriting as in (35) and ignoring the remainder of the first-order Taylor expansion in (39), we have
(44) |
Then,
(45) | ||||
Since , we have
For the third term, let . By Lemma A.6,
and
For the fourth term, let and let .
and
On the other hand, the covariances in (45) are all zeros by Lemmas A.7 and A.8 due to the independencies and zero expectations of and . Then putting them together, we have
(46) |
Note that and have a common factor . By rescaling and , we have the expression for as stated in the theorem.
∎