Controlling The Extraction of Memorized Data From Large Language Models Via Prompt-Tuning
Controlling The Extraction of Memorized Data From Large Language Models Via Prompt-Tuning
3.1 Attack
In the attack setting, we assume that the adversary
has a set of [ prefix || suffix ] sequences Strain ,
sampled from the training set of the target model.
Their goal is to extract the suffixes corresponding
to a disjoint set of prefixes, denoted by Stest 2 .
To do so, the adversary first initializes a prompt:
a continuous set of l × e parameters where e is the
embedding size of the model, and l is the length of
Figure 1: A schematic of our setup. The upper section
the prompt, a hyperparameter decided by the adver-
shows our training and testing setup while the lower
section shows our evaluation metrics. sary. The prompt is trained over Strain to facilitate
the correct generation of suffixes. To do this, we
first prepend the prompt to the embedding of the
The use of instructive prompts for language mod- prefix and pass the joint embedding through the
els has been extensively researched, including use model for generation. We then minimize the loss
during pretraining (Raffel et al., 2020), as a sec- objective (see below) with respect to the prompt
ond stage of training (Sanh et al., 2022; Wei et al., while keeping the parameters of the model frozen.
2021), and during inference to guide model output We explore two loss objectives. The first is
(Brown et al., 2020). Within the third category, in causal language modeling (hereafter referred to as
order to improve upon manual prompt engineering CLM), where we minimize the cross-entropy loss
researchers have implemented methods to learn dis- over the entire sequence (Radford et al., 2019). In
crete natural language prompts (Shin et al., 2020), the second, the prompt is optimized by minimizing
to mine them (Jiang et al., 2020), or, neglecting the cross entropy loss of only the suffixes, given
natural language, to learn continuous prompts (Li the prefixes. Here, the training is aligned with our
and Liang, 2021; Lester et al., 2021). inference task such that during training the model
Our work leverages continuous prompts as a way is penalized only on the suffix tokens; hence we
of passing an external signal to a model to trigger refer to it as aligned CLM. During inference, the
a desired model behavior (i.e., less or more memo- learned prompt is prepended to each embedding
rized data in open language generation, which map of the prefixes in Stest , and the joint embedding is
to an extraction attack and defense, respectively). passed to the model for generation (see Figure 1).
3.2 Defense
3 Method
In the defense setting, the defender (API owner)
Prompt-tuning requires the prepending of a prompt trains the prompt, and prepends it to the incoming
to the prefix embedding and access to the training prefixes before passing them to the model. Our
loss (see Figure 1). Given these constraints, we algorithm is inspired by machine-unlearning liter-
explore a white-box attack where the adversary has ature (Halimi et al., 2022), and defenses against
access to the target model parameters, and a black- membership inference and backdoor attacks (Chen
box defense where the adversary interacts with the et al., 2022; Ozdayi et al., 2021). We introduce a
target model via an API. We therefore do not test 2
For simplicity, we assume all prefixes are k-length. This
our defense against our own attack. can easily be ensured by padding or truncating different length
Let [prefix || suffix] be a sequence in the training prefixes if needed in a real-world setting.
hyperparameter named learning threshold denoted fix consist of 50 tokens; Figures 2-A1 and 2-A2).
by θ. During prompt training (see Section 3.1), We note that prompts tuned with both CLM and
when loss is less than θ we do gradient ascent to aligned CLM provide improvements over the base-
penalize the prompt. If the loss is greater than θ, line in all cases, with aligned CLM providing the
we perform gradient descent with respect to the best performance. Given this, we train prompts
prompt as usual. Training is stopped once the av- using the aligned CLM objective for all other ex-
erage epoch loss is equal or above θ. This allows periments, including our defense.
us to increase training loss in a controlled manner With aligned CLM, we achieve the highest ex-
and stabilize it around θ. Through this process, we traction rates of 25.8% and 54.3% for the 125M
can achieve various privacy-utility trade-offs effi- and 1.3B models, respectively (an improvement of
ciently without re-training any part of the model. 8.9 and 9.3 percentage points, respectively), with
To explore θ, we set the initial value to be slightly a 100 token prompt (blue line). We observe that ex-
above the model training loss and increase in steps traction rates increase with prompt length and tend
of 0.25 until desired performance is achieved. to saturate after prompt length 100. Over-fitting
was ruled out as a potential cause of saturation as
4 Experiments there is no increase in test loss observed during
For our experiments, we use the 125M and 1.3B training. This suggests that there is a max limit on
parameter variants of the GPT-Neo models (Black the parameter count in the prompt that might add
et al., 2021). These are public, decoder-only trans- value for extraction purposes given our objective.
former models (Vaswani et al., 2017) trained using We note that more sophisticated training strategies
CLM on the Pile dataset (Gao et al., 2020). We (designing better loss functions, better prompt ini-
extract Strain and Stest from the Language Model tialization etc.) might yield better extraction rates.
Extraction Benchmark dataset (Google-Research). Suffix Size Next, we fix the prefix size to 50 and
This dataset contains 15k sequences sampled from vary the suffix size. As shown in Figures 2-B1
the training split of the Pile where each sequence and 2-B2, extraction rates decrease roughly expo-
is partitioned into a prefix and suffix. In the default nentially with suffix size. We note that as suffix size
evaluation setting, both prefix and suffix consist of increases, longer prompts (≥ 20) provide greater
50 tokens. We ensure a random train/test split of improvements over the baseline. For example, with
14k/1k samples. a prompt length of 100 (blue line) using the 1.3B
Our evaluation metric of choice is Exact extrac- model, at suffix size 5 we observe an extraction
tion rate which is the fraction of correctly gener- rate increase of 5.3 percentage points. Whereas at
ated suffixes (i.e., all tokens of the generated suffix suffix size 50, the increase is 9.3 percentage points.
match with ground-truth suffix) over the test set.
We additionally discuss fractional extraction rate Prefix Size Next, we fix the suffix size to 50 and
and present results in Appendix A. As a baseline, vary the prefix size. As shown in Figures 2-C1
we use the attack analyzed in Carlini et al. (2022), and 2-C2, extraction rates increase roughly loga-
which consists of feeding the prefixes to the model, rithmically (as in Carlini et al. 2022). Contrary to
and generating suffixes with greedy decoding. This suffix size, we observe that the gaps between base-
is the only extraction attack for this setting apart line and attacks decrease with increasing prefix
from our work, to the best of our knowledge. Our size. This suggests that our attack stands to benefit
training setup is discussed in Appendix B. All ex- a less informed adversary (small prefix sizes) when
periments are repeated over 5 runs with a new ran- compared to the baseline.
dom train/test split in each run.
Beam Decoding Finally, we utilize the default
4.1 Attack setting with prefix and suffix sizes at 50 tokens
and vary the beam size (beam size=1 corresponds
We explore the performance of our attack across
to greedy decoding). The results are shown in
several dimensions: prompt length, suffix size, pre-
Figures 2-D1 and 2-D2. We observe that extrac-
fix size, and beam size. We use greedy-decoding
tion rates increase across the board when increas-
in all cases, except the beam size experiments.
ing beam size from 1 to 5. However, improve-
Prompt Length First, we explore prompt length ments tend to plateau or oscillate when beam size
in the context of the default setting (prefix and suf- is greater than 5. The 1.3B model benefits more
Figure 2: The change in exact extraction rates against prompt length (2-A1, 2-A2), suffix size (2-B1, 2-B2), prefix
size (2-C1, 2-C2) and beam size (2-D1, 2-D2). Top panels show the GPT-Neo-125M results while the bottom
panels show GPT-Neo-1.3B results. The transparent polygons about each line represent 95% confidence intervals
across the points.
Model θ
Exact Extract Pile Test Therefore, we utilize the baseline attack (Section 4)
Rate PPL to quantify privacy. We note that longer prompts
0∗ 0.169 ± 0.007 15.71 ± 0.431 did not add value in a defense setting, so we resort
GPT-Neo 1.25 0.031 ± 0.005 16.601 ± 0.197
125M 1.5 0.006 ± 0.001 17.499 ± 0.156
to using a prompt of length 1. We utilize perplexity
1.75 0.001 ± 0.0 19.691 ± 0.598 (PPL) on generated suffixes, to quantify the utility
GPT2 of the model in addition to using exact extraction
- 0.004 ± 0.002 30.323 ± 1.019
124M rate as in Section 3.1. To measure PPL, we use a
∗
0 0.450 ± 0.015 9.213 ± 0.232 random subset of 1k sequences sampled from the
GPT-Neo 0.5 0.108 ± 0.02 9.758 ± 0.245 test split of the Pile, ensuring that PPL is measured
1.3B 0.75 0.022 ± 0.004 10.267 ± 0.094
1 0.01 ± 0.002 10.775 ± 0.248 on data unseen by the model. We also compare our
GPT2
metrics with those of similar sized models that were
- 0.019 ± 0.002 17.155 ± 0.545 not trained on the Pile dataset (GPT2 models). Our
1.5B
premise here is that better performance in terms of
Table 1: Exact extraction rates and corresponding per- privacy and utility, when compared to an out-of-
plexities for our defense setting, with different values domain model of similar size, would mean that our
of θ. Values are reported as mean ± std. Extraction defense mechanism is of value to an API owner.
rates that are smaller than the corresponding GPT2 vari-
ent of similar size, achieved while perplexity values are
also smaller, are good. (∗ no defense).
In Table 1, we display our results obtained using
the default evaluation setting (prefix and suffix com-
from increasing beam size achieving the highest ex- prise of 50 tokens). Our defense achieves lower
traction rate of 61.4%, at a beam size of 20 (with extraction rates with competitive PPL values. For
a prompt length of 150). The highest extraction the 125M model, we achieve an exact extraction
rate achieved for the 125M model was 28.3% at a rate reduction of 99.4% relative to baseline with a
beam size of 15 (with a prompt length of 100). PPL increase of 25.3% at θ = 1.75. For the 1.3B
model, the extraction rate is reduced by 97.7% rel-
4.2 Defense ative to baseline with a PPL increase of 16.9% at
Finally, we evaluate the privacy-utility trade-off of θ = 1. The ability to achieve lower extraction rates
our black-box defense. As mentioned in Section 3, with lower PPL values as measured against the
our defense is designed for a black-box adversary, GPT2 models of the corresponding size, provides
and cannot be tested against our white-box attack. evidence that our defense is effective.
5 Conclusion Acknowledgements
We present the first known effort to leverage The authors would like to thank Wael Hamza for
prompt-tuning to control the extractability of mem- helpful discussions on this topic and Stephen Rawls
orized data from LLMs in an open language genera- for help with securing the GPU instances that were
tion task. We develop a novel data extraction attack required for experimentation.
and defense, and illustrate their performance under
various settings. Our attack consistently outper-
forms the baseline in terms of exact extraction rate.
Our defense provides competitive privacy-utility
trade-offs and would prove beneficial to API own-
ers with model trained on sensitive content. These
results are achieved efficiently, without any change
to the original model weights. We details avenues
of future work in Appendix C
6 Limitations
7 Ethical Considerations
Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
Kos, and Dawn Xiaodong Song. 2018. The secret The power of scale for parameter-efficient prompt
sharer: Evaluating and testing unintended memoriza- tuning. In Proceedings of the 2021 Conference on
tion in neural networks. In USENIX Security Sympo- Empirical Methods in Natural Language Processing,
sium. pages 3045–3059, Online and Punta Cana, Domini-
can Republic. Association for Computational Lin-
Nicholas Carlini, Florian Tramèr, Eric Wallace, guistics.
Matthew Jagielski, Ariel Herbert-Voss, Katherine
Lee, Adam Roberts, Tom B. Brown, Dawn Xi- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning:
aodong Song, Úlfar Erlingsson, Alina Oprea, and Optimizing continuous prompts for generation. In
Colin Raffel. 2020. Extracting training data from Proceedings of the 59th Annual Meeting of the
large language models. In USENIX Security Sympo- Association for Computational Linguistics and the
sium. 11th International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers), pages
Dingfan Chen, Ning Yu, and Mario Fritz. 2022. Re- 4582–4597, Online. Association for Computational
laxloss: Defending membership inference attacks Linguistics.
without losing utility. ArXiv, abs/2207.05801.
Jimit Majmudar, Christophe Dupuy, Charith S. Peris,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Sami Smaili, Rahul Gupta, and Richard S. Zemel.
Kristina Toutanova. 2019. BERT: Pre-training of 2022. Differentially private decoding in large lan-
deep bidirectional transformers for language under- guage models. ArXiv, abs/2205.13621.
standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association Mustafa Safa Ozdayi, Murat Kantarcioglu, and Yulia R.
for Computational Linguistics: Human Language Gel. 2021. Defending against backdoors in feder-
Technologies, Volume 1 (Long and Short Papers), ated learning with robust learning rate. Proceedings
pages 4171–4186, Minneapolis, Minnesota. Associ- of the AAAI Conference on Artificial Intelligence,
ation for Computational Linguistics. 35(10):9268–9276.
Christophe Dupuy, Radhika Arava, Rahul Gupta, and Adam Paszke, Sam Gross, Francisco Massa, Adam
Anna Rumshisky. 2021. An efficient dp-sgd mech- Lerer, James Bradbury, Gregory Chanan, Trevor
anism for large scale nlu models. ICASSP 2022 Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Kopf, Edward on Empirical Methods in Natural Language Process-
Yang, Zachary DeVito, Martin Raison, Alykhan Te- ing.
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
Junjie Bai, and Soumith Chintala. 2019. Pytorch: Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin
An imperative style, high-performance deep learn- Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
ing library. In Advances in Neural Information Pro- drew M. Dai, and Quoc V. Le. 2021. Finetuned lan-
cessing Systems 32, pages 8024–8035. Curran Asso- guage models are zero-shot learners.
ciates, Inc.
Chiyuan Zhang, Daphne Ippolito, Katherine Lee,
Alec Radford, Jeff Wu, Rewon Child, David Luan, Matthew Jagielski, Florian Tramèr, and Nicholas
Dario Amodei, and Ilya Sutskever. 2019. Language Carlini. 2021. Counterfactual memorization in neu-
models are unsupervised multitask learners. ral language models. ArXiv, abs/2112.12938.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine A Fractional Extraction Rate Results
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2020. Exploring the limits Fractional extraction rate is the fraction of gener-
of transfer learning with a unified text-to-text trans- ated tokens that are both correct and in the right
former. J. Mach. Learn. Res., 21(1). position, over the dataset (see lower section of Fig-
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, ure 2). Our reason to measure this metric is to pro-
and Yuxiong He. 2020. Deepspeed: System opti- vide a more detailed assessment of risks associated
mizations enable training deep learning models with with extraction. Exact extraction rate is particu-
over 100 billion parameters. Proceedings of the 26th
ACM SIGKDD International Conference on Knowl- larly important in cases where the attacker requires
edge Discovery and Data Mining. an exact match in order for the extraction to be
of use; a good example is the case of extracting a
Victor Sanh, Albert Webson, Colin Raffel, Stephen
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine credit card number. In such cases, even getting a
Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, few tokens incorrect will render the attack useless.
M Saiful Bari, Canwen Xu, Urmish Thakker, However, when the attacker cares more about the
Shanya Sharma Sharma, Eliza Szczechla, Tae- meaning of the extracted sequences, fractional ex-
woon Kim, Gunjan Chhablani, Nihal Nayak, De-
traction rate can be a better metric to assess the risk.
bajyoti Datta, Jonathan Chang, Mike Tian-Jian
Jiang, Han Wang, Matteo Manica, Sheng Shen, This is because a human might be able to infer the
Zheng Xin Yong, Harshit Pandey, Rachel Bawden, correct meaning of the sequence even when few
Thomas Wang, Trishala Neeraj, Jos Rozen, Ab- tokens are wrong.
heesht Sharma, Andrea Santilli, Thibault Fevry, Ja- The results related to this metric are shown in
son Alan Fries, Ryan Teehan, Teven Le Scao, Stella
Biderman, Leo Gao, Thomas Wolf, and Alexan- Figure 3. Comparing these results with the exact
der M Rush. 2022. Multitask prompted training en- extraction rate results (Figure 2), we observe the
ables zero-shot task generalization. In International same trends across all of our experiment. We note
Conference on Learning Representations. that the same shared trends are observed in the case
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, of our defense. In this case the fractional extraction
Eric Wallace, and Sameer Singh. 2020. AutoPrompt: rate results are tabulated in Table 2.
Eliciting knowledge from language models with au-
tomatically generated prompts. In Empirical Meth- B Training Setup
ods in Natural Language Processing (EMNLP).
Our soft-prompts are initialized to random word
Saleh Soltan, Shankar Ananthakrishnan, Jack FitzGer-
ald, Rahul Gupta, Wael Hamza, Haidar Khan, embeddings as described in Lester et al. (2021).
Charith Peris, Stephen Rawls, Andy Rosenbaum, We use a batch size of 128 and an Adam opti-
Anna Rumshisky, Chandana Satya Prakash, Mukund mizer (Kingma and Ba, 2014) with a learning rate
Sridhar, Fabian Triefenbach, Apurv Verma, Gokhan of 5e − 4. For the attack setting, the prompts are
Tur, and Prem Natarajan. 2022. Alexatm 20b:
Few-shot learning using a large-scale multilingual trained for 15 epochs. In the defense case, the
seq2seq model. arXiv. prompts are trained until training loss stabilizes
around the specified θ value (as described in Sec-
Ashish Vaswani, Noam M. Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, tion 3.2), which happens within 2-3 epochs in our
Lukasz Kaiser, and Illia Polosukhin. 2017. Atten- experiments.
tion is all you need. ArXiv, abs/1706.03762. We use a Pytorch (Paszke et al., 2019) imple-
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,
mentation where we leverage the HuggingFace Ac-
and Sameer Singh. 2019. Universal adversarial trig- celerate (HF) and DeepSpeed (Rasley et al., 2020)
gers for attacking and analyzing nlp. In Conference libraries to handle distributed training over 8 GPUs
Figure 3: The change in fractional extraction rates against prompt length (3-A1, 3-A2), suffix size (3-B1, 3-B2),
prefix size (3-C1, 3-C2) and beam size (3-D1, 3-D2). Top panels show the GPT-Neo-125M results while the
bottom panels show GPT-Neo-1.3B results. The transparent polygons about each line represent 95% confidence
intervals across the points.
C Future work
Fract Extract Pile Test
Model θ
Rate PPL We have several avenues that we would like to ex-
0∗ 0.35 ± 0.006 15.71 ± 0.431 plore in the context of future work. We envision
GPT-Neo 1.25 0.192 ± 0.011 16.601 ± 0.197
125M 1.5 0.123 ± 0.005 17.499 ± 0.156 that more sophisticated training strategies might
1.75 0.087 ± 0.003 19.691 ± 0.598 yield better extraction rates in our attack setting
GPT2 (designing better loss objectives, better initializa-
- 0.099 ± 0.003 30.323 ± 1.019
124M tion of soft-prompts etc.) and we would like to
∗
0 0.634 ± 0.013 9.213 ± 0.232 explore this further.
GPT-Neo 0.5 0.316 ± 0.022 9.758 ± 0.245
1.3B 0.75 0.171 ± 0.004 10.267 ± 0.094
We would like to explore different prompt learn-
1 0.128 ± 0.006 10.775 ± 0.248 ing algorithms such as other parameter-efficient
GPT2 training methods (Li and Liang, 2021; Hu et al.,
- 0.166 ± 0.003 17.155 ± 0.545
1.5B 2021), and hard-prompt learning methods (Wallace
et al., 2019), in order to conduct a more robust
Table 2: Fractional extraction rates and corresponding analysis of extraction rates.
perplexities for our defense setting, with different val-
We would like to test the transferability of
ues of θ. Values are reported as mean ± std. Extraction
rates that are smaller than the corresponding GPT2 vari- trained prompts across different models and
ent of similar size, achieved while perplexity values are datasets.
also smaller, are good.(∗ no defense). Finally, we would like to combine our defense
with other existing defenses such as those applied
at training time (e.g. versions of differentially pri-
vate stochastic gradient descent; Abadi et al. 2016;
Dupuy et al. 2021) or those applied at decoding
stage (e.g., differentially private decoding; Majmu-
dar et al. 2022). The goal would be to achieve better
privacy-utility trade-offs under a combination of
such defenses.