Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?

Eric Lehman, Sarthak Jain, Karl Pichotta, Yoav Goldberg, Byron Wallace


Abstract
Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks. The cost of training such models (and the necessity of data access to do so) coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT. While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar). Would it be safe to release the weights of such a model if they did? In this work, we design a battery of approaches intended to recover Personal Health Information (PHI) from a trained BERT. Specifically, we attempt to recover patient names and conditions with which they are associated. We find that simple probing methods are not able to meaningfully extract sensitive information from BERT trained over the MIMIC-III corpus of EHR. However, more sophisticated “attacks” may succeed in doing so: To facilitate such research, we make our experimental setup and baseline probing models available at https://github.com/elehman16/exposing_patient_data_release.
Anthology ID:
2021.naacl-main.73
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
946–959
Language:
URL:
https://aclanthology.org/2021.naacl-main.73
DOI:
10.18653/v1/2021.naacl-main.73
Bibkey:
Cite (ACL):
Eric Lehman, Sarthak Jain, Karl Pichotta, Yoav Goldberg, and Byron Wallace. 2021. Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 946–959, Online. Association for Computational Linguistics.
Cite (Informal):
Does BERT Pretrained on Clinical Notes Reveal Sensitive Data? (Lehman et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.73.pdf
Video:
 https://aclanthology.org/2021.naacl-main.73.mp4
Code
 elehman16/exposing_patient_data_release +  additional community code
Data
MIMIC-III