A Privacy Preserving Distributed Filtering Framework For NLP 30r6g0qti3
A Privacy Preserving Distributed Filtering Framework For NLP 30r6g0qti3
Abstract
Background: Medical data sharing is a big challenge in biomedicine, which often hinders collaborative research.
Due to privacy concerns, clinical notes cannot be directly shared. A lot of efforts have been dedicated to de-identifying
clinical notes but it is still very challenging to accurately locate and scrub all sensitive elements from notes in an automatic
manner. An alternative approach is to remove sentences that might contain sensitive terms related to personal information.
Methods: A previous study introduced a frequency-based filtering approach that removes sentences containing low
frequency bigrams to improve the privacy protection without significantly decreasing the utility. Our work extends this
method to consider clinical notes from distributed sources with security and privacy considerations. We developed a
novel secure protocol based on private set intersection and secure thresholding to identify uncommon and low-frequency
terms, which can be used to guide sentence filtering.
Results: As the computational cost of our proposed framework mostly depends on the cardinality of the intersection of
the sets and the number of data owners, we evaluated the framework in terms of these two factors. Experimental results
demonstrate that our proposed method is scalable in various experimental settings. In addition, we evaluated our framework
in terms of data utility. This evaluation shows that the proposed method is able to retain enough information for data analysis.
Conclusion: This work demonstrates the feasibility of using homomorphic encryption to develop a secure and efficient
multi-party protocol.
Keywords: Biomedical data security and privacy, Clinical notes de-identification, Homomorphic encryption
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Sadat et al. BMC Medical Informatics and Decision Making (2019) 19:183 Page 2 of 10
frequency “stop words” intact. However, the system Existing works and their limitations
blocks too much and has a high false positive rate, A critical step for our distributed bigram filtering model
making the outputs hard to read [2]. Finley et al pro- is to find what the bigrams in common are among all
posed a similar method which was applied to de-identify collaborative sites in a privacy-preserving manner.
distributed semantic models [6]. Scrub system [7] used a Although there are several studies on 2-party private set
template-based approach to match components of high intersection [16, 17], only a few works have been done
privacy risk, which are then removed, generalized, or to solve multi-party private set intersection (MPSI)
replaced with made-up ones. This method can get rid of problem. Earlier approaches for MPSI have some limita-
explicit personally-identifiable information but it does not tions. In [18], the dataset size of each party must be
handle combinations of fields and the results might still be equal. Another approach suffers from approximation
matched or linked to the identities of individuals [8]. errors [19]. A recent work has shown the feasibility of
Other researchers also treated text de-identification as handling n > 2 parties [20]. In this work [20], each data
a classic Named Entity Recognition (NER) problem and owner constructs a Bloom filter from their data (using
tried to solve it with machine learning models [9]. only the words or bigrams, not the count associated with
Szarvas et al used decision tree to take into consi- them). Data owners send the encrypted (exponential
deration of various features (length, frequency, etc.) to ElGamal encryption scheme) Bloom filter to a service
detect PHIs [10]. Several research groups [2, 11] devel- provider. All encrypted Bloom filters are securely added
oped methods based on Support Vector Machine (SVM) by the service provider without decrypting, which results
to classify sensitive attributes based on Part-of-speech in an encrypted Integrated Bloom Filter (IBF). Then, the
(POS) inputs. Another popular framework utilizes condi- service provider constructs a randomized n-subtraction
tional random fields (CRF), an extension of logistic of IBF (encrypted), where n is the number of parties.
regression and considers correlations in the sentence to The service provider broadcasts this encrypted random-
predict PHIs [12, 13]. Latest methods in this direction ized n-subtraction of IBF to all the data owners. Finally,
[14] using deep learning approaches reported improved all data owners jointly decrypt it and compute the set
performance in detecting PHIs but the model requires intersection: if an element x is in the set intersection, the
careful tuning of parameters for each dataset, which corresponding array locations in the encrypted random-
makes it hard to be portable for collaborative research. ized n-subtraction of IBF, where x is mapped by k hash
A recent method was proposed by Li et al [15] to filter functions is an encryption of 0; otherwise, is an encryp-
out rare sentences (frequency < 3) and sentences con- tion of random integer. Their approach [20] demon-
taining bigrams under a certain frequency threshold strated good performance for set sizes range from 64 to
(frequency < 256). This method demonstrated good 16,384. However, this approach may not scale well with
performance in obtaining sentences with almost no PHIs millions of records, which is common in real world
(evaluated by a manual review on sampled outputs) applications. With a much larger set, to reduce the prob-
while preserving a similar Type Unique Identity (TUI) ability of false positives, the size of the Bloom filter
distribution of the original data, providing an alternative should be large enough compared to the number of
and generalizable way to obtain useful data with items to be inserted into it. In their approach, runtime is
mitigated privacy risks. However, the method is only dominated by the encryption and decryption of Bloom
designed to anonymize data from a single source. In filter. Constructing, encrypting, and transferring such
reality, collaborative research often involves more than large Bloom filters (that can deal with millions of records
one party and poses new challenges to conduct filtering with a minimal probability of false positives) will introduce
in a global manner. In this paper, we propose a distrib- huge computation and communication overhead.
uted and privacy-preserving method as an extension of Our problem specification is different from these works
the single source model [15]. Our criterion for bigram on private set intersection mentioned here, which do not
filtering is stricter than previous work [15] by taking dis- involve any secure thresholding operations. We are de-
tributional differences of local sites into consideration. scribing these works just to give an overview of state-of-
We will only keep sentences containing bigrams observed the-art solutions of the related problems. To the best of
at all collaboration sites and with sufficient global fre- our knowledge, there is no secure protocol for sensitive
quency. Our proposed method can be easily generalized information filtering that combines private set intersection
to cover other NLP artifacts including unigram, trigram, and secure thresholding.
and n-gram. To develop such a global bigram-based filter- The major contributions of this article are summarized
ing method, appropriate protection needs to be enforced as follows:
on private set intersection, secure count aggregation,
and thresholding to ensure data confidentiality during 1. We propose a novel framework based on private
the process. set intersection and secure thresholding to identify
Sadat et al. BMC Medical Informatics and Decision Making (2019) 19:183 Page 3 of 10
Fig. 1 Block diagram of the system architecture. Only encrypted summary statistics are delegated to the central server to conduct the bigram
filtering, which returns to individual data owners with encrypted bigrams (that are both common and frequent enough in a global manner).
This block diagram was drawn by the authors
restriction to one single algebraic operation is very homomorphic addition operation Add computes the
inconvenient for general purpose applications. Lately, encrypted sum of m1, m2.
researchers are adopting lattice cryptosystems, which Mult(c1, c2): Let c1, c2 be the ciphertexts for
leverage ring homomorphism (addition and multiplica- messages m1, m2 respectively. Given, c1, c2 as input,
tion) [26, 27]. The cryptosystem in [28] is a Somewhat a homomorphic multiplication operation Mult
Homomorphic Encryption (SWHE) scheme that can computes the encrypted product of m1, m2.
compute a bounded number of homomorphic func- ReLin(cmult, evk): The objective of relinearization
tions. Other recent RLWE-based SWHE cryptosystems operation ReLin is to reduce the size of a given ciphertext
include BGV [29], FV [30], and YASHE [31]. While cmult back to (at least) 2. Relinearization is performed
these systems are intrinsically similar, there are diffe- when the size of the ciphertext increases substantially by
rences and trade-offs. Interested readers can refer to multiplication operations. Relinearization operation
[32] for more details. requires the evaluation key evk.
In this work, we used the FV cryptosystem (other
RLWE-based system will work in a similar manner), There is a recent application of homomorphic encryp-
which consists of the following functionalities: tion, which can securely perform genome search on a
semi-honest cloud server [33].
KeyGen (params): Given the system parameters
params as input, Keygen generates a public-private Ciphertext packing
key pair and an evaluation key (pk, sk, evk). The considerable computational overhead of homomorphic
Enc (pk, m): An encryption algorithm encrypts a encryption results from the large ciphertexts. As homo-
plaintext message m using the public key pk. morphic operations have to operate on these large cipher-
Dec (sk, c): Let, c be the encryption of a plaintext m. texts, they can be quite slow. The primary solution to deal
A decryption algorithm outputs m, given private key with this issue is to work with packed ciphertexts, which
sk and ciphertext c as input. refer to the ciphertexts that encrypt a vector of plaintext
Add(c1, c2): Let c1, c2 be the ciphertexts for messages values [34, 35]. Homomorphic operations can be performed
m1, m2 respectively. Given, c1, c2 as input, a on these vectors component-wise in a Single Instruction,
Sadat et al. BMC Medical Informatics and Decision Making (2019) 19:183 Page 5 of 10
Multiple Data (SIMD) manner. Depending on the memory known as hash value or hash) can be considered as
allowance, this mechanism can significantly boost the per- an unique representation of that message. In this
formance due to parallelization. work, we have used SHA-256, which is a member of
Consider the plaintext elements in a polynomial Secure Hash Algorithm (SHA) family. The length of
quotient ring m ∈ Rt = Zt/(Xn + 1) and ciphertext ele- message digest for SHA-256 is 256 bits [36]. Security
ments in Rq = Zq/(Xn + 1). Here, q and t are positive of hashing is discussed in detail in Security Analysis,
integers (q > t, q > 1, see [30]), Zq represents the set Security of Hashing.
of integers ð− q2 ; q2 , and Xn + 1 is an irreducible poly-
nomial of degree n. Using ciphertext packing, we can Detailed system protocol
encrypt n plaintext values in a single ciphertext for a At the system initialization phase, data owners receive
single instruction execution. public and private keys from the CSP. Also, the central
Since a packed ciphertext is essentially the same as server receives only the public key. Then, each data
a standard ciphertext, the basic homomorphic opera- owner sends the hashes of bigrams to the central server.
tions still work, for instance, homomorphic addition After receiving the hashes from each data owner, the
by adding ciphertexts. Ciphertext packing thus facili- central server computes the intersection of the hashes.
tates SIMD-type homomorphic computation, which is Then, the central server sends the elements of this inter-
capable of computing the same function over many section to data owners. Figure 2 shows the flow diagram
inputs at once. The usage of ciphertext packing in of our protocol.
our proposed framework is elaborated in Detailed Upon receiving the intersection of the hashes from
System Protocol. the central server, data owners encrypt the local
We apply ciphertext packing to minimize both com- frequency of the intersected bigrams by using the
putational and communication overhead. The data ciphertext packing technique. To do so, they follow
owners group their counts of bigrams into vectors of the order received from the central server. Figure 3
length n, encrypt them, and send Cardinality of Inter- illustrates this technique for a data owner and
section of Sets/ n ciphertexts to the central server (see indicates the difference with naive homomorphic en-
Detailed System Protocol). Then the packing mechan- cryption approach. After encrypting the counts, data
ism allows the central server to perform computation owners send the packed ciphertexts to the central ser-
on n items simultaneously, which results in n-fold im- ver, where the encrypted global frequency will be
provement in computation and communication both. computed.
In our case, n equals to 4096, which leads to a sig- After receiving the ciphertexts, the central server per-
nificant time cost reduction over the naive homo- forms homomorphic addition operation on these packed
morphic encryption method. ciphertexts. So, at the end of this addition process, the
resulting output looks like the table below. Here, E rep-
resents the encryption function.
Hash functions In Table 2, E(C11) denotes the encrypted count of
Hash functions are one of the fundamental crypto- bigram B1 contributed by data owner 1. E(C12) denotes
graphic primitives. Hash functions can compute a di- the encrypted count of B1 contributed by data owner 2,
gest of a given message, which is a fixed-length bit E(C13) denotes the encrypted count of B1 contributed
string. For a given message, the message digest (also by data owner 3, and so on.
Fig. 2 Flow diagram for the proposed system protocol. The order of the execution runs in a top down manner in key distribution and computation phases
Sadat et al. BMC Medical Informatics and Decision Making (2019) 19:183 Page 6 of 10
Fig. 3 Usage of ciphertext packing in our proposed method. Here, n is the degree of the polynomial, which indicates the number of slots for parallel computing
Now, we need to meet the thresholding requirement decrypting it, if a data owner gets a random negative
for the sum of homomorphically encrypted counts. For number (or zero), she will understand that the sum of
each of the records, we check the following inequality. counts of the corresponding record is less than (or
equal to) the threshold. Similarly, if a data owner gets a
E ðC11Þ þ E ðC12Þ þ E ðC13Þ þ ⋯ > threshold random positive number, she will understand that the
sum of counts of the corresponding record is greater
Solving this problem involves both addition and com-
than the threshold. Multiplying every coefficient of the
parison. It is known that in arithmetic circuits, addition
resulting ciphertext by same random number may ex-
is cheap but comparison is not trivial. To avoid the com-
pose some additional information about other data
parison operation in the arithmetic circuit, we formulate
owners’ counts. So, we multiply the resulting ciphertext
the problem in the following way,
with a random polynomial, all of whose coefficients are
E ðC11Þ þ E ðC12Þ þ E ðC13Þ þ ⋯−threshold randomly generated.
Although polynomial addition and subtraction are co-
After performing the above mentioned homomorphic efficient-wise by nature, polynomial multiplication in Rt
operation, the central server sends to the data owners (and Rq) is a convolution product of the coefficients. An
r*(E(C11) + E(C12) + E(C13) + ⋯ − threshold), where r is effective technique to transform convolution product into
a random number drew by the central server. After coefficient-wise product in polynomial ring is the Num-
ber-Theoretic Transform (NTT), a specialization of Fou-
rier transform for finite rings. One important property of
Table 2 Secure count aggregation at central server NTT is that it works in the same ring as lattice cryptosys-
Bigram Encrypted Global Frequency tems do. Therefore, NTT can be used to improve the effi-
B1 E(C11) + E(C12) + E(C13) + ⋯ ciency of the polynomial operations [37]. To ensure that
B2 E(C21) + E(C22) + E(C23) + ⋯ the products in the ciphertext space be translated into co-
B3 E(C31) + E(C32) + E(C33) + ⋯
efficient-wise products in plaintext space, we perform an
inverse-NTT operation to plaintext before encryption and
⋮ ⋮
a NTT operation after decryption.
Sadat et al. BMC Medical Informatics and Decision Making (2019) 19:183 Page 7 of 10
Table 3 Experimental results for different cardinality of intersection of sets. In the five different settings, cardinality is increased by
1% of the entire dataset. The number of data owners is a constant [3]. The numbers are in seconds
Cardinality of Intersection Intersecting Hashes (s) Encryption (s) Homomorphic Operation (s) Decryption (s) Network Comm. (s) Total Time (s)
1,515,520 (~ 10%) 4.63 8.11 55.43 6.73 0.48 75.38
1,667,072 (~ 11%) 4.69 8.92 61.19 7.06 0.52 82.38
1,818,624 (~ 12%) 4.98 9.70 66.63 7.88 0.54 89.73
1,970,176 (~ 13%) 5.07 10.97 72.21 8.49 0.59 97.33
2,121,728 (~ 14%) 5.20 11.32 77.65 9.34 0.60 104.11
Sadat et al. BMC Medical Informatics and Decision Making (2019) 19:183 Page 8 of 10
Table 4 Experimental results for different number of data owners. The cardinality of intersection of sets is fixed, which is 1,515,520.
The numbers are in term of seconds
Number of Data Owners Intersecting Hashes (s) Encryption (s) Homomorphic Operation (s) Decryption (s) Network Comm. (s) Total Time (s)
2 1.69 8.17 54.72 6.29 0.32 71.19
3 2.72 8.19 55.49 6.33 0.39 73.12
4 3.53 8.28 55.51 6.60 0.46 74.38
5 4.63 8.22 56.36 6.67 0.53 76.41
6 5.36 8.24 58.01 7.11 0.60 79.32
is small but it gets larger at an increasing threshold. distribution respectively. σ denotes the standard deviation
However, this is not a critical issue because we can of the error distribution, and ϵ is the attacker advantage.
maintain the original distribution by oversampling the For our experiments, we choose n = 212, q = 2120,
filtered corpus using sentences that contain one or more σ = 3, ϵ = 2−32. According to root-Hermit factor meas-
TUIs. This is a standard combinatorial optimization ure, our proposed method guarantees 142 bit security.
problem but we do not explore it in this paper.
Security of hashing
One of the primary security requirements of hash function
Security analysis is one-wayness: given a hash output h, it must be computa-
In this section, we analyze the security of our proposed tionally infeasible to find an input m such that h = H(m). In
framework. other words, given a message digest, an adversarial cannot
find out the matching message m from H−1(h) = m. There
exist some cryptanalytic attacks against one-way hashing
Security of encryption that try to break the security properties of the hash function.
To evaluate the security of a lattice cryptosystem, a Brute-force attack (also known as exhaustive search) is a
widely used measure is root-Hermite factor . Lindner type of cryptanalytic attack. Let (m, h) denote the pair of in-
and Peikert showed a mathematical relationship between put message and output hash value, and let M = {m1, m2, ..
root-Hermite factor and security level λ (in bits) [41]. …, mk} be the message space of all possible messages mi.
Such an attack checks for every element of M if H(mi) = =
h. If an equality holds, a possible input message is found.
This type of attack is impractical for a large message space.
A similar one is called dictionary attack, which tries all the
is given by, where c ≈ input messages in a pre-arranged listing, generally derived
qffiffiffiffiffiffiffiffiffiffiffiffiffi
ln ð1=ϵÞ pffiffiffiffiffiffi from a list of words such as in a dictionary (hence the term
Π and s ¼ σ 2Π . dictionary attack), which has a smaller space to search.
n, q, and s represent the degree of the polynomial ring, There is a variant of dictionary attack, known as Rainbow
ciphertext modulus, and scale parameter of the error table attack [42], which uses a precomputed table (rainbow
Table 5 Comparison of TUI Proportion Distribution
TUI Original Clinical Note Threshold = 1 Threshold = 2 Threshold = 4 Threshold = 8 Threshold = 16
T007 0.2627 0.2012 0.1601 0.1421 0.0922 0.0428
T023 5.8168 4.4492 3.5281 2.9490 2.5213 2.1758
T033 7.7646 5.3959 4.8470 3.6402 3.1259 2.5570
T047 7.6978 5.4338 4.8742 3.7598 3.3876 2.8825
T060 2.5509 1.8672 1.6446 1.4018 1.1242 0.9680
T074 1.5871 1.2046 1.0991 0.9302 0.8257 0.6724
T093 0.9824 0.7123 0.6594 0.5846 0.5197 0.4925
T109 4.1908 2.8163 2.7084 2.8069 2.6024 1.6447
T121 1.2840 0.8898 0.8983 0.7719 0.5971 0.6253
T170 0.7523 0.5182 0.4450 0.3165 0.2764 0.1284
T184 3.5566 2.4968 2.2498 1.8443 1.4265 0.6895
T201 1.8249 1.1075 0.9960 0.9173 0.8441 0.8437
Sadat et al. BMC Medical Informatics and Decision Making (2019) 19:183 Page 9 of 10
Table [42] that contains elements up to a certain length con- an efficient multi-party protocol for distributed data de-
sisting of a limited set of characters) for reversing hash func- identification. Experimental results show that our pro-
tions. This attack requires less computation time but more posed method can simultaneously guarantee data privacy
storage compared to brute-force attack. Addressing above and preserve data utility for analysis.
mentioned attacks, we used salt to randomize the hashing. To the best of our knowledge, this is one of the pio-
In cryptography, salt refers to random data that are used as neering privacy-preserving initiatives to de-identify clin-
an additional input to a hash function. Salt was generated by ical notes in a distributed environment. We have open
the CSP and provided to data owners before each hashing sourced our code in GitHub with a GNU general public
process, making these attacks computationally infeasible. license, along with a software manual for compiling and
Another desirable property of a hash function is colli- running it.
sion resistance. A hash function is said to be collision re-
sistant if it is computationally infeasible to find two Availability and requirements
different inputs m1 ≠ m2 with H(m1) = = H(m2). It seems Project name: A Privacy-preserving Distributed Filtering
if the hash function has an output length of b bits, we Framework for NLP Artifacts.
have to check about 2b messages. However, it turns out Project home page: https://github.com/Nazmus-Sadat/th_
that an attacker needs only about 2b/2 messages. This is mpsi
a quite surprising result, which is due to the birthday at- Operating system: Linux.
tack. This attack is based on the birthday paradox, which Programming language: C++.
is a powerful tool that is often used in cryptanalysis. License: GNU general public license.
Collision search for a hash function H() is exactly the
Abbreviations
same problem as finding birthday collisions among party CSP: Crypto Service Provider; EHR: Electronic Health Record; GCE: Google
attendees: how many people are required at a birthday Compute Engine; HIPAA: Health Insurance Portability and Accountability Act;
party such that there is a significant chance that at least IBF: Integrated Bloom Filter; MIMIC-III: Medical Information Mart for Intensive
Care; MPSI: Multi-party private set intersection; NER: Named Entity
two attendees have the same date of birth?. The question Recognition; NTT: Number-Theoretic Transform; PHI: Protected Health
is how many messages (m1, m2, ……, mk) does an attacker Information; SHA: Secure Hash Algorithm; SIMD: Single Instruction, Multiple
need to hash until he has a chance of finding H(mi) = = Data; SWHE: Somewhat Homomorphic Encryption; TCP: Transmission Control
Protocol; TUI: Type Unique Identity; UMLS: Unified Medical Language System
H(mj) for some mi and mj that he chooses. The most sig-
nificant consequence of the birthday attack is that the Acknowledgements
number of messages needed to hash to find a collision is Not applicable.
approximately equal to the square root of the number of
pffiffiffiffiffiffi Authors’ contributions
possible output values, i.e., about 2b ¼ 2b=2 . Hence, for All authors approved the final manuscript. MNS, MMA, and XJ designed the
a security level of u bit, the hash function needs to have method. MNS implemented the protocol and devised experiments. MNS and
XJ wrote the majority of the manuscript. NM, SP, HL, and XJ provided
an output length of 2u bit. In order to prevent collision detailed edits and critical suggestions.
attacks based on the birthday paradox, the output length
of a hash function must be at least 128 [36]. As mentioned Funding
This work was funded in part by NIBIB U01 EB023685, NSERC Discovery
previously, we are using SHA-256 in this work, which has Grants (RGPIN-2015-04147), NIH U01TR002062, and University Research
output length 256. Grants Program (URGP) from the University of Manitoba.
In 2004, collision-finding attacks against MD5 and Xiaoqian Jiang was supported in part by the CPRIT RR180012, UT Stars
award, the National Institute of Health (NIH) under award number
SHA-0 were demonstrated by Xiaoyun Wang [43]. One U01TR002062, R01GM114612, R01GM118574, R01GM124111.
year later, it was claimed that the attack could be ex-
tended to SHA-1 and a collision search would take 263 Availability of data and materials
The clinical notes used in the experiment are available from MIMIC-III (Med-
steps, which is considerably less than the 280, achieved ical Information Mart for Intensive Care), an openly available dataset [38].
by the birthday attack (the output width in this case is
160 bit). In this work, we are using SHA-2 (precisely, Ethics approval and consent to participate
Not applicable.
SHA-256) against which no attacks are known to date.
Consent for publication
Not applicable.
Conclusion
In this article, we proposed a novel protocol to achieve Competing interests
The authors declare that they have no competing interests.
the joint mission of private set intersection and secure
thresholding for a distributed data de-identification task. Author details
1
We extended a previous filtering-based method to cover Department of Computer Science, University of Manitoba, Winnipeg, MB
R3T 2N2, Canada. 2Department of Biomedical Informatics, University of
data from distributed sources and demonstrated the California San Diego, La Jolla, CA, USA. 3Department of Pharmaceutical Care
feasibility of using homomorphic encryption to develop & Health Systems, University of Minnesota, Minneapolis, MN, USA.
Sadat et al. BMC Medical Informatics and Decision Making (2019) 19:183 Page 10 of 10
4
Department of Health Sciences Research, Mayo Clinic College of Medicine, 24. Paillier P. Public-key cryptosystems based on composite degree residuosity
Rochester, MN, USA. 5School of Biomedical Informatics, University of Texas classes. Advances in cryptology—EUROCRYPT’99. Springer; 1999. pp. 223–238.
Health Science Center at Houston, Houston, TX, USA. 25. ElGamal T. A public key cryptosystem and a signature scheme based on
discrete logarithms. IEEE Trans Inf Theory IEEE. 1985;31:469–72.
Received: 2 December 2018 Accepted: 4 July 2019 26. Melchor CA, Barrier J, Fousse L. XPIR: Private information retrieval for
everyone. on Privacy Enhancing; 2016; Available: https://hal.archives-
ouvertes.fr/hal-01396142/. hal.archives-ouvertes.fr
27. Dowlin N, Gilad-Bachrach R, Laine K, Lauter K, Naehrig M, Wernsing J.
References Cryptonets: Applying neural networks to encrypted data with high
1. Demner-Fushman D, Chapman WW, McDonald CJ. What can natural throughput and accuracy: International Conference on Machine Learning
language processing do for clinical decision support? J Biomed Inform. ICML; 2016. p. 201–10.
2009;42:760–72. 28. Naehrig M, Lauter K, Vaikuntanathan V. Can homomorphic encryption be
2. Neamatullah I, Douglass MM, Lehman L-WH, Reisner A, Villarroel M, Long practical? Proceedings of the 3rd ACM workshop on Cloud computing
WJ, et al. Automated de-identification of free-text medical records. BMC security workshop: ACM; 2011. p. 113–24.
Med Inform Decis Mak. 2008;8:32. 29. Brakerski Z, Gentry C, Vaikuntanathan V. (Leveled) fully homomorphic
3. Douglass M, Clifford GD, Reisner A, Moody GB, Mark RG. Computer-assisted encryption without bootstrapping. Proceedings of the 3rd Innovations in
de-identification of free text in the MIMIC II database. Comput Cardiol. 2004; Theoretical Computer Science Conference on - ITCS ‘12. New York: ACM
2004:341–4. Press; 2012. pp. 309–325.
4. Beckwith BA, Mahaadevan R, Balis UJ, Kuo F. Development and evaluation 30. Fan J, Vercauteren F. Somewhat Practical Fully Homomorphic Encryption.
of an open source software tool for deidentification of pathology reports. IACR Cryptology ePrint Archive. 2012;2012:144.
BMC Med Inform Decis Mak. 2006;6:12. 31. Bos JW, Lauter KE, Loftus J, Naehrig M. Improved Security for a Ring-Based Fully
5. Berman JJ. Concept-match medical data scrubbing. How pathology text can Homomorphic Encryption Scheme: IMA Int Conf. Springer; 2013. p. 45–64.
be used in research. Arch Pathol Lab Med. 2003;127:680–6. 32. Acar A, Aksu H, Selcuk Uluagac A, Conti M. A Survey on Homomorphic
6. Finley GP, Pakhomov SVS, Melton GB. Automated De-Identification of Encryption Schemes: Theory and Implementation. arXiv. 2017; Available:
Distributional Semantic Models: AMIA Annual Symposium; 2016. http://arxiv.org/abs/1704.03578. Accessed 21 Jan 2018
7. Sweeney L. Replacing personally-identifying information in medical records, 33. Zhou TP, Li NB, Yang XY, Lv LQ, Ding YT, Wang XA. Secure Testing for
the scrub system. Proc AMIA Annu Fall Symp. 1996:333–7. Genetic Diseases on Encrypted Genomes with Homomorphic Encryption
8. Sweeney L. Guaranteeing anonymity when sharing medical data, the Scheme Secur Commun Netw. 2018. pp. 1–12. doi:https://doi.org/10.1155/2
Datafly system. Proc AMIA Annu Fall Symp. 1997:51–5. 018/4635715
9. Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de- 34. Smart NP, Vercauteren F. Fully homomorphic SIMD operations. Des Codes
identification of textual documents in the electronic health record: a review Cryptogr Springer US. 2014;71:57–81.
of recent research. BMC Med Res Methodol. 2010;10:70. 35. Brakerski Z, Gentry C, Halevi S. Packed Ciphertexts in LWE-Based
10. Szarvas G, Farkas R, Busa-Fekete R. State-of-the-art anonymization of medical Homomorphic Encryption. Public-Key Cryptography – PKC 2013. Berlin:
records using an iterative machine learning framework. J Am Med Inform Springer; 2013. p. 1–13.
Assoc. 2007;14:574–580. 36. Paar C, Pelzl J. Understanding Cryptography: A Textbook for Students and
11. Guo Y, Gaizauskas R. Identifying personal health information using support Practitioners: Springer Science & Business Media; 2009.
vector machines. i2b2 workshop on łdots. 2006; Available: ftp://ftp.dcs.shef. 37. Chen DD, Mentens N, Vercauteren F, Roy SS, Cheung RCC, Pao D, et al.
ac.uk/home/robertg/papers/amia06-deident.pdf High-speed polynomial multiplication architecture for ring-LWE and SHE
12. Gardner J, Xiong L. HIDE: An Integrated System for Health Information DE- cryptosystems. IEEE Trans Circuits Syst I Regul Pap. 2015;62:157–66.
identification: EDBT. IEEE; 2008. p. 254–9. 38. Johnson AEW, Pollard TJ, Shen L, Lehman L-WH, Feng M, Ghassemi M, et al.
13. Wellner B, Huyck M, Mardis S, Aberdeen J, Morgan A, Peshkin L, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035.
Rapidly retargetable approaches to de-identification in medical records. J 39. Aguilar-Melchor C, Barrier J, Guelton S, Guinet A, Killijian M-O, Lepoint T.
Am Med Inform Assoc. 2007;14:564–73. NFLlib: NTT-Based Fast Lattice Library. Topics in Cryptology - CT-RSA 2016.
14. Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes Cham: Springer; 2016. p. 341–56.
with recurrent neural networks. J Am Med Inform Assoc. 2017;24:596–606. 40. Volk M, Ripplinger B, Vintar S, Buitelaar P, Raileanu D, Sacaleanu B. Semantic
15. Li D, Rastegar-Mojarad M, Elayavilli RK, Wang Y, Mehrabi S, Yu Y, et al. A annotation for concept-based cross-language medical information retrieval.
frequency-filtering strategy of obtaining PHI-free sentences from clinical Int J Med Inform. 2002;67:97–112.
data repository. Proceedings of the 6th ACM Conference on Bioinformatics, 41. Lindner R, Peikert C. Better key sizes (and attacks) for LWE-baAvailable:sed
Computational Biology and Health Informatics. ACM; 2015. pp. 315–324. encryption. CT-RSA: Springer; 2011. http://link.springer.com/content/pdf/10.1
16. Wang XA, Xhafa F, Luo X, Zhang S, Ding Y. A privacy-preserving fuzzy interest 007/978-3-642-19074-2.pdf#page=330
matching protocol for friends finding in social networks. Soft Computing. 2018. 42. Oechslin P. Making a Faster Cryptanalytic Time-Memory Trade-Off. Advances
pp. 2517–2526. doi: https://doi.org/10.1007/s00500-017-2506-x in Cryptology - CRYPTO 2003. Berlin: Springer; 2003. p. 617–30.
17. Chen H, Laine K, Rindal P. Fast Private Set Intersection from Homomorphic 43. Wang X, Feng D, Lai X, Yu H. Collisions for hash functions MD4, MD5,
Encryption. Proceedings of the 2017 ACM SIGSAC Conference on Computer HAVAL-128 and RIPEMD. IACR Cryptology ePrint Archive. 2004;2004:199.
and Communications Security - CCS ‘17; 2017. https://doi.org/10.1145/3133
956.3134061.
18. Kissner L, Song - Crypto D. Privacy-preserving set operations, vol. 2005: Publisher’s Note
Springer; 2005. Available: http://link.springer.com/content/pdf/10.1 Springer Nature remains neutral with regard to jurisdictional claims in
007/11535218.pdf#page=251 published maps and institutional affiliations.
19. Egert R, Fischlin M, Gens D, Jacob S, Senker M, Tillmanns J. Privately
Computing Set-Union and Set-Intersection Cardinality via Bloom Filters.
Information Security and Privacy. Springer, Cham; 2015. pp. 413–430.
20. Miyaji A, Nakasho K, Nishida S. Privacy-Preserving Integration of Medical
Data. J Med Syst. Springer US. 2017;41:37.
21. Nikolaenko V, Weinsberg U, Ioannidis S, Joye M, Boneh D, Taft N. Privacy-
preserving ridge regression on hundreds of millions of records. Security and
Privacy (SP), 2013 IEEE Symposium on. IEEE; 2013. p. 334–48.
22. Sadat MN, Aziz MMA, Mohammed N, Chen F, Jiang X, Wang S. SAFETY: secure
gwAs in federated environment through a hYbrid solution. IEEE/ACM Trans
Comput Biol Bioinform. 2018. https://doi.org/10.1109/TCBB.2018.2829760.
23. Rivest RL, Adleman L, Dertouzos ML. On data banks and privacy
homomorphisms. Foundations of secure computation. 1978;4:169–80.