Private prediction for large-scale synthetic text generation

Amin, Kareem; Bie, Alex; Kong, Weiwei; Kurakin, Alexey; Ponomareva, Natalia; Syed, Umar; Terzis, Andreas; Vassilvitskii, Sergei

Computer Science > Machine Learning

arXiv:2407.12108 (cs)

[Submitted on 16 Jul 2024 (v1), last revised 9 Oct 2024 (this version, v2)]

Title:Private prediction for large-scale synthetic text generation

Authors:Kareem Amin, Alex Bie, Weiwei Kong, Alexey Kurakin, Natalia Ponomareva, Umar Syed, Andreas Terzis, Sergei Vassilvitskii

View PDF HTML (experimental)

Abstract:We present an approach for generating differentially private synthetic text using large language models (LLMs), via private prediction. In the private prediction framework, we only require the output synthetic data to satisfy differential privacy guarantees. This is in contrast to approaches that train a generative model on potentially sensitive user-supplied source data and seek to ensure the model itself is safe to release.
We prompt a pretrained LLM with source data, but ensure that next-token predictions are made with differential privacy guarantees. Previous work in this paradigm reported generating a small number of examples (<10) at reasonable privacy levels, an amount of data that is useful only for downstream in-context learning or prompting. In contrast, we make changes that allow us to generate thousands of high-quality synthetic data points, greatly expanding the set of potential applications. Our improvements come from an improved privacy analysis and a better private selection mechanism, which makes use of the equivalence between the softmax layer for sampling tokens in LLMs and the exponential mechanism. Furthermore, we introduce a novel use of public predictions via the sparse vector technique, in which we do not pay privacy costs for tokens that are predictable without sensitive data; we find this to be particularly effective for structured data.

Comments:	20 pages; updated figure + some new experiments from EMNLP 2024 findings camera-ready
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Cite as:	arXiv:2407.12108 [cs.LG]
	(or arXiv:2407.12108v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2407.12108

Submission history

From: Alex Bie [view email]
[v1] Tue, 16 Jul 2024 18:28:40 UTC (932 KB)
[v2] Wed, 9 Oct 2024 17:45:07 UTC (857 KB)

Computer Science > Machine Learning

Title:Private prediction for large-scale synthetic text generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Private prediction for large-scale synthetic text generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators