0% found this document useful (0 votes)
30 views13 pages

What Do Code Models Memorize? An Empirical Study On Large Language Models of Code

Uploaded by

hsgarross
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views13 pages

What Do Code Models Memorize? An Empirical Study On Large Language Models of Code

Uploaded by

hsgarross
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

What Do Code Models Memorize?

An Empirical Study on Large Language Models of Code


Zhou Yang Zhipeng Zhao Chenyu Wang
Singapore Management University Singapore Management University Singapore Management University
Singapore Singapore Singapore
[email protected] [email protected] [email protected]

Jieke Shi Dongsun Kim DongGyun Han


Singapore Management University Kyungpook National University Royal Holloway, University of London
arXiv:2308.09932v1 [cs.SE] 19 Aug 2023

Singapore South Korea United Kingdom


[email protected] [email protected] [email protected]

David Lo
Singapore Management University
Singapore
[email protected]

ABSTRACT metrics that infer whether an output contains memorization ac-


The availability of large-scale datasets, advanced architectures, and curately. We also make some suggestions regarding dealing with
powerful computational resources have led to effective code models memorization in code models.
that automate diverse software engineering activities. The datasets
usually consist of billions of lines of code from both open-source KEYWORDS
and private repositories. A code model memorizes and produces Open-Source Software, Memorization, Code Generation
source code verbatim, which potentially contains vulnerabilities,
ACM Reference Format:
sensitive information, or code with strict licenses, leading to poten-
Zhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsun Kim, Dong-
tial security and privacy issues.
Gyun Han, and David Lo. 2023. What Do Code Models Memorize? An
This paper investigates an important problem: to what extent do Empirical Study on Large Language Models of Code. In Proceedings of
code models memorize their training data? We conduct an empirical ACM Conference (Conference’17). ACM, New York, NY, USA, 13 pages. https:
study to explore memorization in large pre-trained code models. //doi.org/10.1145/nnnnnnn.nnnnnnn
Our study highlights that simply extracting 20,000 outputs (each
having 512 tokens) from a code model can produce over 40,125 code
snippets that are memorized from the training data. To provide a
1 INTRODUCTION
better understanding, we build a taxonomy of memorized contents As more large open-source datasets become publicly available [30,
with 3 categories and 14 subcategories. The results show that the 38], code models [10, 21, 29, 49, 65], trained on billion lines of code,
prompts sent to the code models affect the distribution of mem- are now an important part of software engineering. The models
orized contents. We identify several key factors of memorization. automate a series of critical tasks such as defect prediction [59],
Specifically, given the same architecture, larger models suffer more code review [42], code generation [7] and software questions anal-
from memorization problem. A code model produces more memo- ysis [33, 34, 67] These models have gone beyond academic explo-
rization when it is allowed to generate longer outputs. We also find ration and have been widely deployed and used by a large number
a strong positive correlation between the number of an output’s of users. For example, GitHub CoPilot [2], powered by the Ope-
occurrences in the training data and that in the generated outputs, nAI Codex model [21], obtains over 400,000 subscribers in the first
which indicates that a potential way to reduce memorization is to month of its release [62] and has already been used by over 1.2 mil-
remove duplicates in the training data. We then identify effective lion developers. In addition, the recently released models [3, 29, 41]
demonstrate outstanding performance in a variety of software en-
gineering tasks [57, 66] and has powered many tools, e.g., IDE
plugins.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed The impressive performance of these models can be attributed to
for profit or commercial advantage and that copies bear this notice and the full citation the combination of advanced model architectures (e.g., state-of-the-
on the first page. Copyrights for components of this work owned by others than ACM art Transformer models with Despite the remarkable advancements,
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a they are confronted with a set of privacy and legal challenges. For
fee. Request permissions from [email protected]. example, CoPilot was found to produce real people’s names and
Conference’17, July 2017, Washington, DC, USA physical addresses in its outputs [25]. It is also reported that Sam-
© 2023 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 sung employees accidentally leaked company secrets via Chat-
https://doi.org/10.1145/nnnnnnn.nnnnnnn GPT [6], which raises concerns as ChatGPT retains records of
Conference’17, July 2017, Washington, DC, USA Zhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsun Kim, DongGyun Han, and David Lo

these conversations and could potentially use this data for training sliding windows. We compute these metrics to rank the outputs and
its system [5]. analyze the top-100 ranking outputs. We find that the first 3 metrics
The above concerns emphasize the importance of exploring the are very effective in identifying memorized contents: over 97% of
capacity of large code models to memorize their training data. A code the top-100 outputs contain memorization. However, metrics (1)
model memorizes a string from its training data if there exists a and (3) tend to rank outputs with license information higher, while
prompt to make the model generate this string. This exploration metric (2) tends to rank outputs with code logic higher. Additionally,
is critical, especially given the fact that the training data for these to demonstrate that memorization also exists in other code models,
models may come from a variety of sources. For example, large we analyze outputs from two popular code models that have already
public datasets, such as the repositories hosted on GitHub, are been deployed in practice: Incoder [29] and StarCoder [41]. We
made publicly available for training code models. These datasets manually analyze the top 100 outputs from the two models ranked
may contain licensed code. If code models memorize and generate using metric (2). We search code snippets in the top 100 outputs
licensed code, and model users are not aware of this and utilize the on GitHub to see whether GitHub repositories contain the same
code, it potentially results in a breach of the license agreements. code. If yes, it is an evidence that the model potentially memorize
The training data includes ‘software secrets’ [13] such as pass- the code from its training data collected from GitHub. We find that
words and API keys, which can be outputted to users and leveraged 81% and 75% of the top-100 outputs from Incoder and StarCoder
by attackers directly. For instance, as shown in our evaluation, a contains code snippets that can be found on GitHub, respectively.
code model can memorize the username and password to access a Besides, 90.12% and 65.33% of memorization are related to code
database in a public IP address. Furthermore, the training datasets logic.
may contain vulnerable or even malicious code. Such code could To summarize, this paper makes the following contributions:
potentially be memorized and displayed to users, causing a serious • We systematically investigate the phenomena of memoriza-
risk to the security and integrity of software systems that adopt tion in large code models, highlighting the potential risks of
outputs of the code models. Therefore, it is crucial to understand memorization in code models.
the memorization in code models. • We empirically show the feasibility to extract a large number of
To investigate the memorization of code models, we first explore memorized contents from a code model using simple methods.
two open-source models: CodeParrot and CodeParrot-small [1]. We categorize the memorized contents into 14 categories and
CodeParrot is a GPT-2 model with 1.5 billion parameters trained to analyze their distribution.
generate Python code. CodeParrot-small is a lightweight version • We analyze the factors affecting memorization in code models.
of CodeParrot with 110 million parameters. We choose to analyze Also, we evaluate metrics to infer memorization.
the two models because (1) their training data is available and is of a • We show that deployed code models memorize training data
size feasible to conduct memorization analysis; (2) the models share as well. We also provide some suggestions on how to deal with
the same architecture and dataset but come in two sizes, allowing potential risks brought by such memorization.
us to investigate the impact of model size on memorization;
Our investigation first extracts a large number of outputs using Paper structure. Section 2 provides the background and moti-
different methods. We extract 20,000 outputs by feeding a special vation. In Section 3, we detail our methodology for extracting,
token called the ‘start token’ as a prompt into CodeParrot; this identifying and inferring memorization. Section 4 describes the
method provides no information to guide the generation. After iden- experiment settings. We analyze the memorization by answering
tifying over 40,125 unique memorized code snippets using clone three research questions in Section 5. In Section 6, we provide some
detection, we conduct an open card sorting study on a statistically suggestions and discussion Next, Section 7 highlights relevant stud-
representative number (381) of memorized contents to build a tax- ies. We conclude our paper and present future work in Section 8.
onomy with 3 categories and 14 subcategories. We find that 277
out of 381 memorized code snippets contain documentation (e.g., 2 BACKGROUND AND MOTIVATION
license information and docstring), and 239 out of 381 memorized
code snippets contain code logic. This section describes the code generation process and some moti-
Additionally, we investigate factors that affect the memorization. vating examples to study such memorization.
The results show that CodeParrot memorizes more contents than
CodeParrot-small, indicating that larger models have stronger 2.1 Code Generation Process
memorization ability. Allowing a model to generate longer outputs Modern code models typically utilize the Transformer architec-
can reveal more memorization. When the maximal length of outputs ture [63]. Many well-known models (such as InCoder [29] and
increases, the number of memorized contents increases as well. codeparrot [1]) are built on the GPT-2 model [53], which is referred
We also find a strong positive correlation between the number to as the generative pre-trained transformer model. The primary ob-
of a memorized content’s occurrences in the training data and jective of such models is to generate code based on a given context
that in the outputs, which indicates that a potential way to reduce (i.e., the prompt sent into the model’s input layer). This is accom-
memorization is to remove duplicates in the training data. plished by training the model on a large corpus of datasets and
Inspired by a study on memorization in language models [19], letting the model learn a probability distribution that predicts the
we investigate four metrics to infer whether an output contains likelihood of the next token, given the context.
memorization: (1) perplexity [22], (2) ratio of perplexity of two mod- Let us assume that the context consists of a sequence of tokens,
els, (3) ratio of perplexity to zlib [4], and (4) average perplexity of denoted by 𝑝 = ⟨𝑥 1, 𝑥 2, · · · , 𝑥𝑛 ⟩. The model applies a softmax layer
What Do Code Models Memorize?
An Empirical Study on Large Language Models of Code Conference’17, July 2017, Washington, DC, USA

to compute the probability distribution of the next token, denoted the generated code without proper compliance with the Apache
by 𝑃 (𝑥𝑛+1 |𝑝). Formally, this distribution is defined: License 2.0 terms, it may violate open-source licenses.
𝑒 𝑧𝑛+1 1 def read ( self , iprot ):
𝑃 (𝑥𝑛+1 |𝑝) = Í (1) 2 iprot . readStructBegin ()
𝑘 𝑒 𝑧𝑖
𝑖=1 3 while True :
4 ( fname , ftype , fid ) = iprot . readFieldBegin ()
In the equation, 𝑧𝑖 is the logit of the 𝑖-th token computed by the 5 if ftype == TType . STOP :
6 break
model. Then, the model uses a decoding strategy to decide the next 7 if fid == 1:
token. A commonly adopted strategy is the top-𝑘 sampling [15]. 8 if ftype == TType . STRING :
9 .....
The model selects the 𝑘 most probable tokens from the probability
distribution, and then chooses the next token from this new set of Listing 1. An example of code snippets generated by
tokens. The parameter 𝑘 plays an important role in balancing the CodeParrot, which is protected by an Apache license.1
trade-off between diversity and coherence in the generated output.
Intuitively, a small 𝑘 encourages the model to generate more fixed Scenario 2: Security vulnerabilities. The training data may con-
and coherent outputs, while a large 𝑘 introduces randomness and tain vulnerable and even malicious code. After being trained on such
diversity. This paper assumes users can control the code generation data, code models might generate malicious code and exploitable
by modifying the 𝑘 parameter, which is feasible even with only bugs, or adopt poor security practices. If a model memorizes and
access to the API (e.g., OpenAI API allows 𝑘 to be changed). outputs insecure code to users, it puts users and systems at risk
by opening them up to security threats. Listing 2 demonstrates
2.2 Memorization Definition potential security risks related to code model memorization. The
Memorization in the context of generation tasks refers to a model’s memorized code is susceptible to SQL injection attacks because it di-
ability to store and recall specific information or patterns it has rectly uses the ‘user_id’ input in the SQL query without sanitizing
encountered during its training. We define a string 𝑠 as a piece of or parameterizing it. An attacker could potentially manipulate the
memorization as follows. Let 𝑓 : 𝑃 → 𝑆 be a code model, where 𝑃 ‘user_id’ input to execute malicious SQL queries if this generated
is the space of prompts and 𝑆 is the space of strings. Let 𝐷 ⊆ 𝑆 be code is adopted by users.
the training dataset. A string 𝑠 ∈ 𝑆 is considered memorized if:
1 def fetch_user_data ( user_id ):
2 connection = sqlite3 . connect (< masked value >)
∃𝑝 ∈ 𝑃; 𝑠 ∈ 𝑓 (𝑝) ∧ 𝑠 ∈ 𝐷 (2) 3 cursor = connection . cursor ()
4 query = f ' SELECT * FROM users WHERE id = { user_id } '
In other word, a string 𝑠 is memorized if there exists a prompt 𝑝 such 5 + < masked value >
6 cursor . execute ( query )
that the model generates 𝑓 (𝑝) when given 𝑝 as input, and 𝑠 appears
both in the training dataset 𝐷 and the generated output 𝑓 (𝑝). In Listing 2. An example of insecure code generated by
practice, the length of 𝑠 should be substantially long. Otherwise, CodeParrot. The identifiers are substituted and strings are
there might be many less useful memorized code such as variable masked due to ethical consideration.
names or boilerplate statements. We explain the threshold to decide
memorization from code models in Section 3.2. Scenario 3: Leakage of sensitive information. Recent studies
show that there exist a vast amount of software secrets in public
2.3 Memorization in Code Models repositories [13, 27], including API keys, passwords, etc. A code
The issue of memorization is especially concerning and should model that tends to memorize data can cause the leakage of sensitive
warrant careful attention from the research community. We explain information if it has been exposed to such information during
three scenarios where memorization may negatively impact the training. Listing 3 provides an example that a code model memorizes
code models. Each scenario comes with an example of memorization a hardcoded public IP address, username, and password from its
we encountered during our experiments. We obfuscate the examples training data. On the one side, the code model exposes sensitive
due to ethical considerations (e.g., protecting sensitive information information, leading to potential security breaches and privacy
and the identity of contributors of vulnerable code). violations. On the other side, even the end users do not use the
generated information, users may use the program sketch and
Scenario 1: Intellectual property issues. Some code models are
replace the password with their own, which is still an insecure
trained on public repositories on GitHub, without paying much
development practice.
attention to the licensing terms of the code. Consequently, mem-
orization in code models may cause intellectual property issues, 2.4 Ethical Consideration
such as violation of open-source licenses. When the model learns
to reproduce specific code snippets from the training data instead We emphasize our ethical responsibility by explicitly stating that
of generalizing the underlying concepts, this leads to the genera- our goal is to comprehend the memorization phenomenon rather
tion of code that closely resembles copyrighted or licensed code, than actively exploiting sensitive information within memorized
potentially infringing on the intellectual property rights of the orig- contents. Consequently, we avoid utilizing ‘targeted attacks’ [9] (i.e.,
inal authors. Listing 1 shows an output from CodeParrot, which aiming to extract specific types of memorization from the training
memorizes the original read function verbatim from a public repos- data) to identify or extract sensitive information deliberately. To
itory that is licensed under Apache License 2.0. If a developer uses 1 https://github.com/Workiva/frugal/
Conference’17, July 2017, Washington, DC, USA Zhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsun Kim, DongGyun Han, and David Lo

1 netowrk_config = {
2 ' device_type ': < masked value >, When 𝜏 is small, the probability distribution is compressed, which
3 ' ip ': < masked value >, means that the model will be more likely to select the originally
4 ' username ': < masked value >,
5 ' password ': < masked value > most probable tokens. Otherwise, the model will give less prefer-
6 } ence to the originally most probable tokens, producing more diverse
7 network_con = ConnectHandler (** netowrk_config )
8 print network_con . find_prompt () outputs. However, maintaining a high temperature increases the
probability of generating incoherent code and, as a result, produces
Listing 3. An example of source code with sensitive infor-
less memorization. Carlini et al. [19] use the temperature-decaying
mation generated by CodeParrot, including sensitive infor-
strategy. Specifically, during the generation process, the temper-
mation like public IP address, username, etc. Identifiers are
ature gradually decreases from a high value to a low value. It en-
substituted and strings are masked for ethical consideration.
courages the model to generate diverse outputs for the first several
tokens and then gradually focus on generating coherent outputs.
further minimize the ethical concerns, we choose two open-source Prompt-Conditioned Generation (PCG). This strategy requires
code models that are trained on publicly available datasets rather designing a prompt for the code model. Generation from the start
than models that are commercial. Throughout the research process, token tends to produce many contents that usually appear at the
we ensure that we handle data and results in an ethical manner. beginning of code files, e.g., license information. Besides, the users
For example, in Listing 2 and 3, we obfuscate identifiers and mask usually send prompts to the code model for completion rather
the sensitive information so that the contributors of the vulnerable than generating code from the start token. Therefore, we randomly
code cannot be disclosed. choose a list of files from unseen testing data and parse the files
to extract the function definition statements. We feed the func-
3 METHODOLOGY tion definition statements as the prompt to the code model for
This section explains the methodology of our study, including how completion.
to sample outputs from code models and how to detect memoriza-
Two-Step Generation (TSG). In NPG, a code model generates
tion in code models.
outputs from the start token, which produces a large number of
license information and less amount of outputs relevant to the
3.1 Generating Outputs from Code Models
prompt. Thus, after we identify memorization using NPG, we send
We consider four different strategies to generate outputs from code the most frequently appearing memorization as the prompt to the
models: non-prompt generation, temperature-decaying generation, code model for completion to explore whether this can drive the
prompt-conditional generation, and two-step generation. model to generate more memorization.
Non-Prompt Generation (NPG). We leverage the autoregressive
nature of language models to generate outputs without an initial
prompt. The generation process is described as:
3.2 Memorization Detection
(1) Language models usually have a special token to represent the After generating outputs from the code models, we detect whether
beginning of a sentence (e.g., <s> in GPT-2). We initialize a the outputs contain memorization. Recalling the definition of mem-
token sequence with only this start-of-sentence token, denoted orization in Section 2.2, a string from model outputs is memorized if
by 𝑝 = ⟨𝑥 1 ⟩, where 𝑥 1 = <s>. this string is also in the training data. In a previous study analyzing
(2) For each step 𝑡 = 2, 3, · · · , we compute the probability distri- memorization in natural language models [19], the memorization
bution of the next token given the current token sequence as is detected by performing fuzzy match (e.g., 3-gram fuzzy match)
input, 𝑃 (𝑥𝑡 |𝑝) and decide the next token 𝑥𝑡 . We then append between the generated text and the training data. However, this
the new token to the token sequence, updating the context for strategy is not suitable to code models for the following reasons.
the next iteration: 𝑝 ← ⟨𝑝, 𝑥𝑡 ⟩. Unlike natural language, source code has a more structured syntax
(3) The above process repeats until a termination criterion is met. with specific rules and conventions, which usually have unique con-
In this paper, the termination criterion is specified by the maxi- structs, keywords, and idioms that often occur together in specific
mal length of output a model generates. Once the termination sequences. A fuzzy match approach might flag these commonly
criterion is met, the generated token sequence is then decoded occurring sequences as memorization despite the fact that they are
into human-readable code. simply inherent to the language.
We use the concept of code clone to determine whether an output
Temperature-Decaying Generation (TDG). The outputs gen- contains memorization. A Type-1 clone, also known as an exact clone
erated by NPG can often be repetitive and lack of diversity. To or an identical clone, refers to a specific type of code duplication in
increase the diversity, we can use a larger value for the temper- which two or more code fragments are exactly the same [55]. This
ature 𝜏. The temperature value controls the level of diversity in means that the code fragments have the same sequence of tokens,
outputs. Recalling the Equation 1, the temperature 𝜏 is used to scale including comments, whitespace, and formatting. Comparing to
the logits before applying the softmax function. Specially, the new other types of clone (e.g., Type-2 clone), Type-1 clone means that a
distribution is computed as follows: model produces output verbatim from the training data, which is
𝑧𝑛+1 a stronger evidence of memorization. However, very short Type-
𝑒 𝜏
1 clones may not be considered memorization and they may not
𝑃 (𝑥𝑛+1 |𝑝) = Í 𝑧𝑖 (3)
𝑘
𝑖=1 𝑒
𝜏 reveal useful information about the model. For example, the code
What Do Code Models Memorize?
An Empirical Study on Large Language Models of Code Conference’17, July 2017, Washington, DC, USA

snippet ‘a = 1’ may appear in many places in the training data. that the output content is more likely to be memorized but less
As a result, we only consider Type-1 clones that are longer than a likely to be repetitive [19].
threshold number of lines 𝐿.
Average PPL. Memorized contents may be surrounded by non-
Following a study [23] on analyzing the clones of code generated
memorized outputs. Computing the perplexity of an output mixed
by deep learning-based code recommenders, we employ Simian [32]
of memorized and non-memorized content may not reflect the
for identifying Type-1 clones between the generated code and the
model’s confidence. As stated in Section 3.2, clones spanning over
training data. Although there are other clone detection tools, some
6 lines are considered as memorization in this paper. So we apply
of these tools are not publicly accessible, and others are unable to
a sliding window of 6 lines to each output (moving by one line
process incomplete code snippets, which are often produced by
each time). We compute the perplexity of each window and then
language models. By default, Simian identifies clones that span at
compute the Average PPL of all windows. We rank model outputs
least six lines, which is chosen as the threshold 𝐿 in our study.
by the Average PPL in ascending order.

4 EXPERIMENT SETUP
4.1 Subject Code Models and Dataset
3.3 Memorization Prediction 4.1.1 Models for Memorization Analysis. This paper first analyzes
While memorization detection directly compares the output of a code the contents that are exactly memorized from the training data
model with source code in the training data, memorization prediction (i.e., Type-1 clones) and the factors that affect such memoriza-
infers whether an output contains memorized content without tion. To achieve this goal, we investigate two models: CodeParrot
accessing the training data. This task assumes that the training and CodeParrot-small. The former is based on the GPT-2 model
data is not accessible; this may simulate adversarial settings. For with 1.5 billion parameters, and the latter is a smaller version
example, a malicious user may generate a large amount of outputs of CodeParrot, which also uses the GPT-2 architecture but has
and then infer which parts are more likely to be memorization. If fewer (110 million) parameters. Both models are trained on a col-
such a malicious user can infer memorization correctly, information lection of Python files from scratch. The two models are available
regarding the training data (e.g., private code or software secrets) on the HuggingFace model hub. Table 1 shows the performances
can be leaked, putting a potential security risk. According to the of two investigated models and other code models of similar sizes
study on analyzing memorization in natural language models [19], on the OpenAI’s HumanEval benchmark [21]. Although there are
this paper adopts the following metrics to infer memorization. other available code models, this study selects CodeParrot and
Perplexity. Perplexity [22] measures how well a language model CodeParrot-small as the main research subjects for the following
predicts a sample. A lower perplexity value indicates that the model reasons.
is more confident in its predictions. Intuitively, a model is more First, the training data of some models can be either inaccessible
confident on the example that it has seen during training. There- or lack sufficient details, preventing a thorough analysis of their
fore, a lower perplexity value may indicate that a model retrieves memorization capabilities. For example, InCoder [29] states the
memorization. source of its training data but does not provide the exact dataset
used for training. Second, while some models have available data,
PPL-PPL ratio. Language models tend to have low perplexity trivial its extensive dataset, which includes multiple programming lan-
memorization, e.g., repeated contents such as license information. guages, complicates clone detection-based memorization analysis.
From the perspective of risk assessment, exposing code logic leads For example, SantaCoder [10] uses a training set of 3TB and GPT-
to higher risks than exposing trivial memorization. Imagining that Neo [16] is trained on a diverse text dataset of 800GB. Additionally,
there are two code models: a small model and a large model, both SantaCoder only comes in one size, making it unsuitable for an-
trained on the same dataset. As shown in Section 5.2, a large model alyzing the impact of model size. Third, commercial models like
better memorizes more non-trivial contents than the small model. ChatGPT exhibit superior results but cannot be analyzed due to the
It means that on trivial memorization, both two models have small unavailability of their training data. Attempting to extract training
perplexity, but on non-trivial memorization, the large model has data from these models can be potentially considered violating their
smaller perplexity than the small model. Therefore, we use the terms of use as well. Lastly, CodeParrot is a popular model. Query-
ratio of perplexity between the large model and the small model ing “CodeParrot” as the keyword on HuggingFace returns 419
to infer memorization. We name this metric as PPL-PPL ratio (PPL models. In contrast, searching HuggingFace using “codebert” and
𝑙𝑜𝑔 (𝑃 )
for perplexity). The metric is computed as 𝑙𝑜𝑔 (𝑃𝑙 ) , where 𝑃𝑠 is the “codet5”, which are both widely evaluated models in the literature,
𝑠
perplexity computed using the small model and 𝑃𝑙 is the perplexity returns only 126 and 124 models, respectively.2
of the large model. A small ratio indicates that an output is more The selected models are trained on the CodeParrot dataset,
likely to contain non-trivial memorization. which was created with the GitHub dataset available via Google’s
BigQuery. This dataset contains approximately 22 million Python
PPL-zlib ratio. zlib [4] is a data compression library. The zlib files and is 180 GB in size. Some preprocessing steps are conducted
entropy of a text is computed as the number of bits when the text is to clean the dataset. After removing whitespace in each file, the
compressed with zlib library. A repetitive input has a small zlib developers of CodeParrot find and remove exact duplicates by
𝑙𝑜𝑔 (𝑃 )
entropy. We compute the PPL-zlib ratio: 𝑧𝑙𝑖𝑏𝑣 (the ratio of the
model perplexity and the zlib entropy). A small ratio indicates 2 The result is obtained on 25 July 2023.
Conference’17, July 2017, Washington, DC, USA Zhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsun Kim, DongGyun Han, and David Lo

Table 1. Model performance on the OpenAI’s HumanEval 4.2.1 RQ1. What do code models memorize?
benchmark [21]. Pass@n means the chance that a model Motivation. Code models are trained on a large amount of data
provides a correct answer within 𝑛 attempts. collected from various sources, including both public open-source
repositories and private code bases; both of which are prone to
Model Size Pass@n contain sensitive information. Understanding what code models
memorize helps avoid the risks associated with code models leaking
n=1 n=10 n=100
their training data. For example, if code models can memorize
CodeParrot 1.5B 3.80% 6.57% 12.78% and output information such as function implementation and even
CodeParrot-small 110M 3.58% 8.03% 14.96% private data, then the risk of code models leaking their training
PolyCoder 160M 2.13% 3.35% 4.88% data is high.
PolyCoder 400M 2.96% 5.29% 4.88%
Experiment Design. We make a code model generate a large
GPT-Neo 125M 0.75% 1.88% 2.97%
amount of source code according to different generation strate-
GPT-Neo 1.3B 4.79% 7.47% 16.30%
gies described in Section 3.1. For this experiment, we choose the
CodeParrot model as the experiment subject. For each strategy, we
generate 20,000 outputs; each output has a maximal length of 512
computing the hash of each file. The files whose first 5 lines contain tokens. Then we analyze the Type-1 clones spanning more than
the word ‘auto-generated’ are removed to remove the potentially 6 lines that appear in both the training data and the outputs, i.e.,
generated files. Then, we follow the preprocessing tasks detailed in memorized contents.
the original dataset repository. After data cleaning, the processed In order to gain a deeper understanding of the memorization,
dataset is split into the training set and the validation set, which we design an annotation study to classify the memorization into
contain approximately 5 million files that are 50GB. different categories. We identify 40,125 unique clones (ranging
from 6 to 53 lines) from 20,000 outputs produced by Non-Prompt
4.1.2 Analyzing Memorization in Deployed Models. We also con- Generation (NPG). We use a widely adopted sample size calculator4
duct a case study on more code models to analyze whether they with a confidence level of 95% and a confidence interval of 5 to
can potentially memorize their training data as well. We select two obtain a statistically representative sample size of 381. We conduct
additional models: Incoder [29] and StarCoder [41]. We consider an open-card sorting study [14], a well-established technique for
these models for the following reasons. (1) The two models are generating meaningful groupings of data. Two authors of this paper
of larger sizes (6B and 15.5B, respectively), popular, and demon- discuss with each other to categorize the cards and a senior author
strate strong performance [29, 41]; and (2) These models have been resolves the disagreement and merge low-granularity categories
deployed for practical usage. StarCoder has been deployed as an into higher-level ones. We then classify the memorized contents
extension that can be used in IDEs like IntelliJ IDE3 . Incoder is into three main categories: Documentation, Code Logic, and Others,
deployed and used within Meta [46]. Evaluating these models can each having 2 to 8 sub-categories. Each memorized content can
help us understand the risks of using code models in practice. contain multiple categories.

4.2 Research Questions 4.2.2 RQ2. What factors affect memorization in code models? Mo-
In this study, we answer three research questions to investigate the tivation. RQ1 shows that code models memorize different types
risk of code models memorizing training data: of contents and that the content of the prompts affects the model
• RQ1. What do code models memorize? outputs, motivating us to further investigate the other factors that
• RQ2. What factors affect memorization in code models? impact the memorization of code models. Knowing the relevant fac-
• RQ3. How to infer whether an output contains memorized infor- tors helps us better understand code models and highlight potential
mation? directions to mitigate memorization.

The first question aims to understand the extent and nature of in- Experiment Design. We consider the following factors that may
formation that code models tend to memorize, trying to categorize affect the memorization of code models.
the memorized contents into different types. The second question • Model size: Previous research [49] shows that if two models
analyze the factors that influence the memorization, which helps share the same architecture and the same dataset, the larger
us understand the risks associated with code models leaking their model has stronger capacity. In this experiment, we compare
training data. The third question focuses on infering whether an the memorization power of CodeParrot and its smaller version
output contains memorized information from its training data. By CodeParrot-small.
addressing these research questions, we seek to get a deeper un- • Top 𝑘 sampling: Code models generate the next token by choos-
derstanding of the risks associated with code models leaking their ing from the top-𝑘 most likely tokens. A small (large) 𝑘 value
training data, along with practical techniques to identify and miti- can generate more fixed (diverse) outputs. We try 4 settings: 5,
gate such risks. The following paragraphs describe the experiment 10, 20, and 40 to investigate the effect of 𝑘.
design for each research question.

3 https://plugins.jetbrains.com/plugin/22090-starcoder 4 https://www.surveysystem.com/sscalc.htm
What Do Code Models Memorize?
An Empirical Study on Large Language Models of Code Conference’17, July 2017, Washington, DC, USA

• Output length: The length of model outputs (i.e., number of strategy and categorize them. Table 2 shows the total number of
tokens generated) may also impact the memorization of code occurrences of each category obtained using different generation
models. We try 4 settings: 256, 512, 768, and 1024. strategies. We observe that the license and copyright, as well as the
• Number of generated outputs: If we can query the model for import statements, are the most frequent categories. More specifi-
many times and obtain many outputs, the model may expose cally, the license and copyright category appears 216, 150, 288, and
more memorized information. 222 times in the outputs generated by NPG, TSG, TDG, and PCG,
• Occurrences in the training data: The number of occurrences respectively. The import statements sub-category appears 129, 211,
of a code snippet in the training data can affect the memorization. 80, and 112 times, respectively. The reasons for their higher fre-
Intuitively, if the model sees a training example frequently, it quency than other categories are two-fold. On the one side, such
may overfit this example and produce the frequently occurring information usually appears at the beginning of a file, which is easy
patterns at higher probability. to been seen and memorized. On the other side, there exist many
duplicates of such information (e.g., developers declare the same
4.2.3 RQ3. How to infer whether an output contains memorized
license in many files).
information?
Each generation strategy shows different distributions of outputs.
Motivation: For the aforementioned research questions, we assume
For example, we find that TSG tends to generate less license and
to have complete access to the training data in order to analyze
copyright information (decrease from 216 to 150) and more code
memorization. This assumption is practical when conducting analy-
logic (increase 239 from 330) than NPG. The reason is that TSG
sis as model developers. However, in certain adversarial situations,
selects the most frequent memorization found by NPG (which is the
such as when malicious users aim to extract training data from
license information) as the prompt into code models. The model will
code models, the attacker typically does not have knowledge of the
complete the content after the prompt and thus skips the license
training data. Therefore, we investigate how to infer whether an
information to generate more code logic related contents, such as
output from code models contains memorization without querying
import statements and method definitions.
the training data.
Temperature has an impact on the diversity of sampled out-
Experiment Design: We use the methods described in Section 3.3
puts. For example, TDG drives the models to produce higher por-
to rank the outputs from code models by using four different metrics.
tion of license information. The potential reason is that using a
Following the setting in the paper that propose these metrics [19],
higher temperature diversifies model outputs. For those contents
we look into the 100 top-ranking outputs from the ranked lists and
that are harder to memorize (e.g., code logic), adding more diver-
compute the ratio of the outputs containing memorized contents.
sity makes the model fail to generate them. Consequently, most
of the outputs generated by TDG are license information while
4.3 Implementation Details
it generates the least code logic among all generation strategies.
We fetch the CodeParrot and CodeParrot-small models from Although the prompts used in PCG already include the license and
the HuggingFace model hub and run them an NVIDIA GeForce import statements, PCG produces a similar number of license and
A5000 GPU with 24 GB of memory. To implement the temperature- import statements as NPG. This is because PCG generates code
decaying generation, we use an initial high temperature of 20.0. logic to complete the function first, after which the model generates
Each time the model generates a new token, we decrease the temper- license and import statements. This observation contrasts with the
ature by 1.0 until it reaches 1.0 (i.e., 20 tokens are generated), after conclusion drawn in NLP models that higher temperature values
which we keep the temperature at 1.0. We download the datasets helps the model produce more non-trivial information [19], which
of the two models released by the authors of CodeParrot.5 As the is the code logic in our case.
clone detection tool we use consumes a large amount of time when Interestingly, we find that PCG produces more memorization
analyzing many files, we merge all the files in the training data and related to code logic than other generation strategies. For example,
split the merged file into 53 chunks and run clone detection in par- the occurrences of exception handling increase from 9 in NPG to
allel. On a machine having AMD EPYC 7643 CPU with 48 cores and 32 in PCG, and that of method calls increase from 25 to 60. PCG to
512GB memory, analyzing memorization for each 20,000 examples some extent simulates the behavior of users when interacting with
in parallel takes one hour. We encourage researchers with more code models: writing some code like function definition and letting
computational resources to replicate our study at a larger scale. the model complete. Our results suggest that code models may
produce much memorization when interacting with users, posing
5 EXPERIMENT RESULTS data leakage risk.
5.1 RQ1. What do code models memorize? Sensitive Information Detection. Our experiment also reveals
Memorization Detection and Categorization. Model outputs sensitive information memorized by the code models. One of the
by different generation strategies include a significant number examples is shown in Listing 4 in which we can identify a private
of code fragments memorized from the training data. Note that, key that might expose financial accounts. As discussed in Section 2,
on average, approximately 43% and 57% of 20,000 outputs from there might be other types of sensitive information, but we focus
CodeParrot-small and CodeParrot contain memorized informa- on counting IP addresses, email addresses, and hash keys as they
tion, respectively. Following the procedure described in Section 4.2.1, are explicitly identifiable compared with other types of information
we randomly sample 381 memorized outputs for each generation such as licensed code or vulnerabilities. We use detect-secrets,6

5 https://huggingface.co/datasets/codeparrot/codeparrot-clean 6 https://github.com/Yelp/detect-secrets
Conference’17, July 2017, Washington, DC, USA Zhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsun Kim, DongGyun Han, and David Lo

Table 2. Occurrences of memorization extracted with Non- 5.2 RQ2. What factors affect memorization in
Prompt Generation (NPG), Temperature Decaying Genera- code models?
tion (TDG), Prompt-Condition Generation (PCG), and Two-
Step Generation (TSG). The number of annotated outputs for Model size and top 𝑘 sampling. The model size has a substantial
each sampling method is 381. Note that one output may map impact on the frequency of memorized outputs and 𝑘 parameters af-
to multiple categories of memorization. fect the results of memorization as well. Following the procedure de-
scribed in Section 5.2, we use the Non-Prompt Generation (NPG) to
generate 20,000 outputs from CodeParrot and CodeParrot-small,
Category NPS TSS TDS PCS respectively. Figure 1a illustrates the numbers of unique memo-
Documentation 277 247 333 315 rization in the outputs of the two models with different 𝑘 values.
License and Copyright 216 150 288 222 The orange and blue curves represent the results of CodeParrot
Docstring 30 55 10 50 and CodeParrot-small, respectively. We observe that curve of the
Usage Intruction 31 42 35 43 larger model CodeParrot is consistently above the other curve,
Code Logic 239 330 100 304 suggesting that it memorizes more contents. Besides, the larger
Import Statements 129 211 80 112 CodeParrot also memorizes longer contents. Given the same set-
Method Definition 39 41 2 41 ting (k=20, output length is 512), CodeParrot produces memoriza-
Method Calls 25 30 4 60 tion with a maximal of 60 lines, while CodeParrot-small only
Class Definition 17 19 4 18 produces memorization with a maximal of 45 lines.
Conditional Statements 14 10 4 33 We analyze how the 𝑘 value affects the memorization of code
Exception Handling 9 11 4 32 models. We try 4 settings of 𝑘 values: 5, 10, 20, and 40. The length
Testing 4 7 2 5 of each output is set as 256 tokens. As shown in Figure 1a, when 𝑘
Print Statement and Log 2 1 0 3 is low (e.g., 5), both models produce less memorization. As a small
Others 60 36 45 57 𝑘 means that a model chooses from a small and relatively fixed
Configuration 38 23 25 52 set of tokens, which may lead to the generation of memorization
Unable to Classify 22 13 20 5 with more duplicates (i.e., the number of unique memorization
is smaller). When 𝑘 increases from 5 to 10, the outputs become
more diverse and consequently produce more unique memorized
outputs, which can be observed from the figure that the number of
1 class SignMessagesTest ( BitcoinTestFramework ):
unique memorization increases, reaching a peak at 𝑘=10. However,
2 def set_test_params ( self ): when 𝑘 continues to increase, the outputs become more diverse
3 self . setup_clean_chain = True
4 self . num_nodes = 1
and differ more from the training data, which leads to a decrease in
5 def run_test ( self ): the number of unique memorization. As a result, models memorize
6 message = < masked value >
7 self . log . info (< masked value >) less alongside 𝑘 increases from 10 to 40.
8 priv_key = < masked value > Output length and the number of outputs. Longer outputs can
Listing 4. An example of source code with private keys. The expose more memorized contents according to our experiments.
identifiers are substituted and strings are masked because of Given a model, we keep other factors unchanged and vary the max-
ethical consideration. imal number of tokens of the outputs: 256, 512, 768, and 1024. The
results are shown in Table 3. When the length of outputs increases,
the number of unique memorization also increases. This trend is
consistent for both models on different numbers of outputs. The
impact of output length is larger on the larger model. Specifically,
given the same setting (5,000 examples and increase the output
a popular open-source tool to scan three types of sensitive infor- length from 256 to 1024), the number of unique memorization in-
mation in the 20,000 outputs generated by NPG. After removing creases by 7,365 (from 6,666 to 14,031 for CodeParrot-small, while
local IPs and emails containing “example,” we find 25 IP addresses, increases by 12,785 (from 9,785 to 22,570) for CodeParrot.
914 emails, and 25 keys that are exactly same with the information Table 3 also shows that when we extract more outputs from
from the training data as well (i.e., they are memorized).7 Our find- the models, the number of unique memorization increases as well.
ings warn that such sensitive information requires to be properly We let the CodeParrot model to generate 10 million outputs (512
handled (e.g., removed) before training code models. tokens each) and plot the trend of the number of unique memo-
rization with the number of outputs in Figure 1b. The blue curve
Answers to RQ1: Code models successfully memorize different shows the total number of unique memorization, while the orange
types of content, including both documentation and code logic. curve shows the number of newly identified unique memorization
The license and copyright, as well as import statements, are the by generation additional 200,000 outputs each time. We see a ‘di-
most frequently appearing memorization. With proper strategies, minishing return’ pattern: although more memorization can be
one can drive the model to produce more code logic-related mem- identified when more outputs are generated, the number of newly
orization. Code models memorize sensitive information such as 7 The detection results of other generation strategies are included in ‘./PPI’ folder in
private keys and emails. the replication package.
What Do Code Models Memorize?
An Empirical Study on Large Language Models of Code Conference’17, July 2017, Washington, DC, USA

00
Number of memorization

000 00000 00000


Total number of memorization

Number of memorization
00 8000 0000 2000 4000 6000

y = 0.29x + 41.58

400
Frequency in the outputs
Number of newly find memorization

8
2

00
300
2

6
small
regular
2

00
200
400
2

000

00
100
200
1601

0
5 10 15 20 25 30 35 40 0 2 4 6 8 0 20000 40000 60000 80000
k values Number of model outputs (million) Frequency in the training data
(a) How the 𝑘 value affects the memorization (b) How the total number of outputs affect (c) The correlation between the frequency in
of code models. identified memorization. the training data and outputs.

Figure 1. Three factors that affect memorization in code models.

Table 3. Number of unique memorized-code for a given set of analyze the 100 top-ranking outputs. Table 4 shows the ratio of
parameters: (1) number of tokens and (2) number of outputs. the top 100 outputs that contain memorization. We find that gener-
ally three out of four metrics can accurately rank the memorized
Model # outputs 256 512 768 1024 outputs at the top. Note that, on average, approximately 43% and
5,000 6,666 9,080 11,041 14,031
57% of 20,000 outputs from CodeParrot-small and CodeParrot
10,000 10,627 14,655 17,664 22,243 contain memorized information. When the output length is set as
CodeParrot-small 256 tokens, all the top 100 outputs ranked using perplexity, PPL-
15,000 14,015 19,444 23,863 29,133
20,000 16,966 23,574 29,204 35,363 PPL ratio, and PPL-zlib ratio, contain memorization. Ranking the
outputs using PPL-zlib ratio achieves the best performance, with
5,000 9,785 14,645 18,325 22,570
10,000 16,062 24,345 32,519 37,448
all the top 100 outputs containing memorization. The method of
CodeParrot PPL-PPL ratio is slightly less effective. When the output length is
15,000 21,560 32,666 42,853 50,127
20,000 26,420 40,125 51,059 61,787 512, its detection accuracy is 79%. But in other settings, its accuracy
is close to 100%. However, the Average PPL metric is less effective
in detecting memorization, especially on long outputs.
identified memorization decreases. The 10 million outputs contain We further analyze what types of memorization are inferred by
over 8 million unique memorization that is longer than 6 lines. these methods. We conduct another annotation study to count the
Occurrences in the training data. Code fragments more fre- occurrences of different categories of memorization in the 100 top
quently appearing in the training data are more likely to be gen- ranked outputs for CodeParrot-small model (output length is 256).
erated by the code models. We count the number of occurrences We find that all the memorized outputs ranked by the perplexity
of each memorized content in the training data and the model out- and PPL-zlib Ratio are license information. However, using PPL-
puts and visualize the distribution in Figure 1c. Spearman’s rank PPL ratio can rank more diverse memorization in the top; 12% of
correlation test obtains a correlation coefficient of 0.804 (𝑝 < 0.01), memorized contents contain code logic.
indicating a strong positive correlation between the two variables.
We conduct the Pearson correlation test to measure the linear re-
lationship. We obtain a correlation coefficient of 0.752 (𝑝 < 0.01), Answers to RQ3: Three out of four metrics can infer memoriza-
which indicates a strong positive linear correlation between the two tion accurately. The PPL-PPL ratio ranks more diverse memoriza-
variables. The finding implies that when the model is exposed to a tion at the top positions, while memorization highly ranked by
higher frequency of certain contents in its training data, it tends to the other metrics is mainly license information.
memorize it and produces that content more frequently in its out-
puts, highlighting to importance of conducting data de-duplication
before training code models. 6 DISCUSSION
6.1 Memorization in Deployed Code Models
Answers to RQ2: We identify five factors affecting memoriza-
tion: (1) the model size given the same architecture; (2) the hy- We analyze the memorization of two models that have already been
perparameter 𝑘 in top-𝑘 sampling; (3) the length of generated deployed in Meta and IntelliJ IDE: Incoder and StarCoder. Both
outputs; (4) the number of generated outputs; (5) the frequency Incoder and StarCoder are trained on large-scale datasets (159GB
of the code fragment in the training data. and 3TB) that cover multiple programming languages. However, the
training data of Incoder is not directly available and the magnitude
of the training data for StarCoder presents significant challenges
5.3 RQ3. How to infer whether an output when attempting to analyze Type-1 clones. Note that the authors of
contains memorized information? the models [29, 41] mention that the training data is curated from
Prediction performance. We use the four metrics described in GitHub. Thus, we use the search function provided by GitHub as a
Section 3.3 to rank the outputs generated by the code models and proxy and manually confirm the memorization of these models. If
Conference’17, July 2017, Washington, DC, USA Zhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsun Kim, DongGyun Han, and David Lo

1 # Bitcoin Cash ( BCH ) qpz3 < masked value >5 nuk Table 4. The performance of different methods in inferring
2 # Ether ( ETH ) - 0 x84 < masked value > c9FB
3 # Litecoin ( LTC ) - Lfk5 < masked value > qPvu whether an output contains memorized information. The
4 # Bitcoin ( BTC ) - 34 L8 < masked value > BtTd
5
numbers represent the ratio of top 100 outputs ranked by
6 # contact :- nnheo@example . com each method; i.e., 1.0 indicates that the method successfully
computes rankings top 100 as memorized source code.
Listing 5. An example of output from StarCoder that contains
a substituted email address: ‘[email protected]’. The ad-
dresses of cryptocurrency are still memorized. We only show Models
the first and last 4 digits to protect privacy. Methods
CodeParrot CodeParrot-S
t=256 t=512 t=256 t=512
Perplexity 1.00 0.92 1.00 1.00
PPL-PPL Ratio 1.00 0.79 0.97 0.98
PPL-zlib Ratio 1.00 1.00 1.00 1.00
GitHub returns the exact code snippets (spanning over 6 lines), we Average PPL 0.74 0.24 0.86 0.36
consider an output contains memorization.
For each model, we extract 20,000 outputs using NPG. According
to the finding from RQ3 that the PPL-PPL ratio ranks more diverse
memorization at the top positions, we rank the outputs using the
PPL-PPL ratio and analyze the top 100 outputs. Then, we manually rights associated with a repository or provide an option to specify
search GitHub for each line of code in the top 100 outputs. We the rights in the account settings.
find that 81% and 75% of the selected outputs from Incoder and
StarCoder are confirmed to be memorized. Besides, 90.12% and 2. Data Collection and Processing: Data collectors should pay
65.33% of memorization from the two models is related to code attention to data license information and avoid using data with
logic. The results reveal that there is risk that the two models can strict or unclear licenses for training code models. Appropriate
memorize and expose their training data. This fact can be worrying preprocessing of the dataset is necessary, which may include but
especially when models that trained on private repositories (e.g., not limited to detecting and removing personal identifiable infor-
internal code in a company) are deployed for public usage. mation and other software secrets reasonably. Our study shows
Both StarCoder and InCoder adopt preprocessing methods to that duplicates in the training data are more likely to be memorized.
remove personal identifiable information and other software secrets Allamanis [11] suggests that duplicate code may cause adverse
in the training data. They use regular expressions to detect and effects on code models. Removing duplicates in the training data
substitute emails. The benefit of doing so is reflected in the model can help reduce memorization. Additionally, collectors can employ
outputs. For example, Listing 5 shows an output from StarCoder. defensive methods to prevent privacy attacks (e.g., data poisoning).
We search GitHub for this output; GitHub returns the exact code 3. Additional Information for Outputs: Code model developers
except a different email address. We believe this output is memo- should also offer users sufficient information when they use the
rized from the training data, and it substitutes the email address model. For instance, model developers should detect whether an
in the training data. However, we find that the addresses of cryp- output is likely to be memorized from the training data. If so, users
tocurrency (i.e., the masked values in Listing 5) are still memorized. should be informed of the output’s origin and its copyright infor-
This highlights the need of appropriately dealing with more types mation. This not only prevents users from inadvertently violating
of sensitive information in the training data before training code open-source regulations but also empowers them to make informed
models and releasing them to the public. decisions about the quality of the output. For example, users may
avoid using an output if it is from a poorly-maintained repository.
Furthermore, our study shows that larger models memorize more
6.2 Suggestions on Memorization in Code contents. Therefore, the service providers of large code models (e.g.,
Models ChatGPT) should take measures to limit the queries to the model
Based on our findings, we provide the following suggestions to deal to prevent potential privacy breaches.
with memorization in code models.
4. Opt-out Mechanisms: Dataset providers should allow users
1. Data Sources: Users from various platforms such as GitHub and to determine if their data has been included in a dataset. We also
Stack Overflow should have the right to explicitly indicate whether suggest building a tracking system to build a connection between
their data can be utilized for AI model training. Otherwise, the code a model and its training data. Such a tracking system helps iden-
models may memorize the data and expose the data to the other tify what models include a specific user’s data for training. As
users. Platforms should support such declarations, allowing users mentioned in Suggestion 1, users should have the right to prevent
to specify their preferences at different levels of granularity. For their code from being used in training and be able to provide their
example, a user may allow the main branch of a repository to be consent. For data that has already been used, users should have
used for training, but not the development branch as it may include the right to declare or withdraw their consent. Their data should
experimental and unfinished code. To support the declarations, not only be removed from the dataset but also from the models.
GitHub can allow users to include a separate file that outlines the Moreover, it is necessary to develop a certification mechanism to
What Do Code Models Memorize?
An Empirical Study on Large Language Models of Code Conference’17, July 2017, Washington, DC, USA

evaluate whether a model has removed the data from its knowledge Researchers have also studied the limitations of code models
properly. and threats. One of the limits is that a malicious user can generate
adversarial examples [36, 58, 68, 70] by adding semantic preserving
5. Multidisciplinary Collaborations: We believe that a multidis-
transformations [12, 51] to fool code models. Another threat to the
ciplinary approach is essential to effectively address the outcomes
training data is data poisoning. An attacker can simply inject a
of data memorization in code models. AI researchers can contribute
small portion of malicious code into the training data to poison the
developing models that minimize memorization while maintain-
datasets [40, 54, 64, 69]; consequently, the obtained code models
ing the model performance. Software engineering researchers can
have backdoors that can be triggered by specific inputs. Data poison
focus on open-source data management and understand user re-
attacks can also harm the performance of code models [48, 60].
quirements. Legal professionals provide guidance on copyright
Software engineering researchers have noticed the potential is-
and other regulatory requirements to ensure compliance and the
sues brought by memorization in code models. Al-Kaswan et al. [8]
protection of users’ rights.
discuss the security, privacy, and license implications of memo-
rization in code models. Our paper conducts the first systematic
6.3 Threats to Validity study of memorization in code models, by categorizing memorized
Threats to Internal Validity. Internal validity refers to the extent content, analyzing factors that affect memorization, etc. Rabin et
to which a study is free from errors or biases that could invalidate al. [52] also evaluate memorization in code models, however, their
the results. Our study leverages Type-1 clone detection to identify concept of memorization diverges from ours. In their paper, a code
memorization. However, the accuracy of clone detection tool may model learns from noisy dataset by memorizing the noise and fails
affect our results. To mitigate this threat, we choose a state-of- to generalize to test data, which is more similar to the notion of
the-art clone detection tool, Simian [32], which has been used to overfitting. Our paper investigates to what extent code models
analyze code clones between the model output and training data. memorize and output the training data verbatim.
Another threat is that using Type-1 clone detection may lead to
under estimation of memorization. For example, a code model may 7.2 Privacy Concerns in AI Models
not fully memorize a code snippet and produce a slightly modified Henderson et al. [35] discuss some ethical challenges in data-driven
version of the code snippet. In this case, this output is not treated dialogue systems, one of which is the potential privacy violation.
as memorization. Carlini et al. [18] evaluate unintended memorization in Google’s
Threats to External Validity. External validity refers to the extent Smart Compose that can complete emails. Feldman [26] uses the
to which the results of a study can be generalized to other settings. long-tail theory to explain the memorization behavior of deep learn-
This paper evaluates two code models based on the GPT-2 archi- ing models. Zhu et al. [73] analyzes memorization behavior for a
tecture, which is widely used in open-source code models [10, 29]. LSTM-based neural language model. One important privacy con-
However, the results obtained in this paper may not generalize to cern related to memorization is the feasibility of data extraction
other models. To alleviate this threat, we analyze the memoriza- attack, which aims to extract the training data from the model. An
tion in two popular code models that are deployed in practice and important privacy threat related to memorization is the data extrac-
demonstrate that they also suffer from memorization issues. We tion attack, which aims to extract the training data from the model.
also plan to conduct experiments on more models and datasets in Carlini et al. [19] extract around 600 examples of memorization
the future. from the GPT-2 model like URLs, phone numbers, etc. Al-Kaswan
et al. [9] propose a targeted attack on extracting data from the
GPT-Neo model. To the best of our knowledge, our paper presents
7 RELATED WORK the first empirical study of memorization in large pre-trained code
In this section, we present an overview of studies relevant to this models. Another important privacy concern is the membership
paper, including (1) pre-trained models of code and (2) privacy inference attack (MIA), which aims to infer whether a specific data
concerns in AI models. sample is used to train a model. Shokri et al. [56] propose a black-
box MIA method on machine learning-based classification models.
7.1 Pre-trained Code Models and Analysis Hisamoto et al [37] operate MIA on machine translation systems.
Chen et al. [20] produces a taxonomy of membership inference
Large language models like BERT [24, 43] and GPT [17, 53] have
attacks against various generative models. Mireshghallah et al. [45]
excelled in NLP tasks, inspiring pre-trained models for code. Code-
use MIA to quantify the privacy risk of masked language models
BERT [28] and a list of similar models (including GraphCode-
like BERT. Researchers also propose defensive method to mitigate
BERT [31], CuBERT [39], etc) are developed to produce code em-
the risks by MIA. For example, Tang et al. [61] propose to use
beddings to support downstream tasks like defect prediction. Many
model ensemble to mitigate the MIA risk. Nasr et al. [47] leverage
code models leverage the GPT architecture [17, 53] to conduct
adversarial regularization to enhance membership privacy.
generation tasks. CodeGPT, which trains the GPT-2 architecture
on CodeSearchNet [38], is proposed as a baseline model in the
CodeXGLUE benchmark [44]. Code models with larger sizes and 8 CONCLUSION AND FUTURE WORK
better performance are also proposed, like InCoder [29], Code- This paper conducts a comprehensive study to examine the memo-
Gen [49]. A list of studies [50, 71, 72] have empirically shown the rization in large pre-trained code models. We develop a taxonomy
outstanding performance of these models on various tasks. for memorized contents, consisting of 3 primary categories and
Conference’17, July 2017, Washington, DC, USA Zhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsun Kim, DongGyun Han, and David Lo

14 subcategories. The study uncovers several key factors affecting H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33.
memorization: the model size, the length of outputs, etc. We also Curran Associates, Inc., 1877–1901.
[18] Nicholas Carlini, Chang Liu, ’Ulfar Erlingsson, Jernej Kos, and Dawn Song. 2019.
discover a strong positive correlation between the frequency of an The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural
output’s appearance in the training data and that in model outputs. Networks. In Proceedings of the 28th USENIX Conference on Security Symposium
(SEC’19). USENIX Association, USENIX Association, Santa Clara, CA, USA, 267–
This suggests that eliminating duplicates in the training data could 284.
potentially reduce memorization. Furthermore, we identify effec- [19] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-
tive metrics that accurately determine whether an output contains Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson,
Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Lan-
memorized contents and offer recommendations on how to address guage Models. In USENIX Security Symposium.
the issue of memorization in code models. Additionally, we evaluate [20] Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2020. GAN-Leaks: A Taxon-
memorization in two popular models that are already deployed. omy of Membership Inference Attacks against Generative Models. In Proceedings
of the 2020 ACM SIGSAC Conference on Computer and Communications Security
In future work, we plan to study larger code models and more (Virtual Event, USA) (CCS ’20). Association for Computing Machinery, New York,
programming languages. We also plan to explore different strategies NY, USA, 343–362. https://doi.org/10.1145/3372297.3417238
[21] Mark Chen, Jerry Tworek, and Heewoo Jun et al. 2021. Evaluating Large Language
to mitigate memorization in code models. Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv:2107.03374 https:
//arxiv.org/abs/2107.03374
[22] Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing
The replication package is provided for replication at https: techniques for language modeling. Computer Speech and Language 13, 4 (1999),
//doi.org/10.6084/m9.figshare.22774697, which should 359–394. https://doi.org/10.1006/csla.1999.0128
not be used for malicious purposes like conducting data [23] Matteo Ciniselli, Luca Pascarella, and Gabriele Bavota. 2022. To What Extent
Do Deep Learning-Based Code Recommenders Generate Predictions by Cloning
extraction attacks. Code from the Training Set?. In Proceedings of the 19th International Conference
on Mining Software Repositories (Pittsburgh, Pennsylvania) (MSR ’22). Association
for Computing Machinery, New York, NY, USA, 167–178. https://doi.org/10.
REFERENCES 1145/3524842.3528440
[1] [n. d.]. codeparrot (CodeParrot). https://huggingface.co/codeparrot [24] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
[2] [n. d.]. GitHub copilot · your AI pair programmer. https://github.com/features/ Pre-training of Deep Bidirectional Transformers for Language Understanding. In
copilot Proceedings of the 2019 Conference of the North American Chapter of the Association
[3] [n. d.]. Introducing chatgpt. https://openai.com/blog/chatgpt for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
[4] [n. d.]. A Massively Spiffy Yet Delicately Unobtrusive Compression Library. Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota,
https://zlib.net/. Accessed on March 27, 2023. 4171–4186. https://doi.org/10.18653/v1/N19-1423
[5] [n. d.]. What is chatgpt? https://help.openai.com/en/articles/6783457-what-is- [25] Delton Ding. 2021. GitHub Copilot provided me with a picture of someone’s
chatgpt ID card? PIC.TWITTER.COM/RRLE1UXF6U. https://twitter.com/DeltonDing/
[6] 2023. Samsung employees accidentally leaked company secrets via chatgpt: status/1423651446340259840
Here’s what happened. https://gizmodo.com/chatgpt-ai-samsung-employees- [26] Vitaly Feldman. 2020. Does Learning Require Memorization? A Short Tale about
leak-data-1850307376 a Long Tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on
[7] Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Uni- Theory of Computing (Chicago, IL, USA) (STOC 2020). Association for Computing
fied Pre-training for Program Understanding and Generation. In Proceedings of Machinery, Association for Computing Machinery, New York, NY, USA, 954–959.
the 2021 Conference of the North American Chapter of the Association for Computa- https://doi.org/10.1145/3357713.3384290
tional Linguistics: Human Language Technologies. Association for Computational [27] Runhan Feng, Ziyang Yan, Shiyan Peng, and Yuanyuan Zhang. 2022. Auto-
Linguistics, Online, 2655–2668. mated Detection of Password Leakage from Public GitHub Repositories. In 2022
[8] Ali Al-Kaswan and Maliheh Izadi. 2023. The (ab)use of Open Source Code to IEEE/ACM 44th International Conference on Software Engineering (ICSE). 175–186.
Train Large Language Models. arXiv:2302.13681 [cs.SE] https://doi.org/10.1145/3510003.3510150
[9] Ali Al-Kaswan, Maliheh Izadi, and Arie van Deursen. 2023. Targeted At- [28] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong,
tack on GPT-Neo for the SATML Language Model Data Extraction Challenge. Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT:
arXiv:2302.07735 [cs.CL] A Pre-Trained Model for Programming and Natural Languages. In Findings of the
[10] Loubna Ben Allal, Raymond Li, and Denis Kocetkov et al. 2023. SantaCoder: Association for Computational Linguistics: EMNLP 2020. Association for Computa-
don’t reach for the stars! arXiv:2301.03988 [cs.SE] tional Linguistics, 1536–1547.
[11] Miltiadis Allamanis. 2019. The Adverse Effects of Code Duplication in Machine [29] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi,
Learning Models of Code. In Proceedings of the 2019 ACM SIGPLAN International Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. 2023. InCoder: A
Symposium on New Ideas, New Paradigms, and Reflections on Programming and Generative Model for Code Infilling and Synthesis. In The Eleventh International
Software (Athens, Greece) (Onward! 2019). Association for Computing Machinery, Conference on Learning Representations. https://openreview.net/forum?id=hQwb-
New York, NY, USA, 143–153. https://doi.org/10.1145/3359591.3359735 lbM6EL
[12] Leonhard Applis, Annibale Panichella, and Arie van Deursen. 2021. Assessing [30] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles
Robustness of ML-Based Program Analysis Tools using Metamorphic Program Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and
Transformations. In 2021 36th IEEE/ACM International Conference on Automated Connor Leahy. 2020. The Pile: An 800GB Dataset of Diverse Text for Language
Software Engineering (ASE). 1377–1381. https://doi.org/10.1109/ASE51524.2021. Modeling. arXiv preprint arXiv:2101.00027 (2020).
9678706 [31] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou,
[13] Setu Kumar Basak, Lorenzo Neil, Bradley Reaves, and Laurie Williams. 2023. Nan Duan, Alexey Svyatkovskiy, Shengyu Fu andz Michele Tufano, Shao Kun
SecretBench: A Dataset of Software Secrets. In Proceedings of the 20th International Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang,
Conference on Mining Software Repositories (MSR ’23). 5 pages. and Ming Zhou. 2021. GraphCodeBERT: Pre-training Code Representations with
[14] Andrew Begel and Thomas Zimmermann. 2014. Analyze This! 145 Questions for Data Flow. In 9th International Conference on Learning Representations, ICLR 2021,
Data Scientists in Software Engineering. In Proceedings of the 36th International Virtual Event, Austria, May 3-7, 2021.
Conference on Software Engineering (Hyderabad, India) (ICSE 2014). Association [32] Simon Harris. n.d.. Simian. https://www.harukizaemon.com/simian.
for Computing Machinery, New York, NY, USA, 12–23. https://doi.org/10.1145/ [33] Junda He, Zhou Xin, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian
2568225.2568233 Thung, Ivana Irsan, and David Lo. 2023. Representation Learning for Stack
[15] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled Overflow Posts: How Far are We? arXiv:2303.06853 [cs.SE]
sampling for sequence prediction with recurrent neural networks. Advances in [34] Junda He, Bowen Xu, Zhou Yang, DongGyun Han, Chengran Yang, and David Lo.
neural information processing systems 28 (2015). 2022. PTM4Tag: Sharpening Tag Recommendation of Stack Overflow Posts with
[16] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT- Pre-Trained Models. In Proceedings of the 30th IEEE/ACM International Conference
Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https: on Program Comprehension (Virtual Event) (ICPC ’22). Association for Computing
//doi.org/10.5281/zenodo.5297715 If you use this software, please cite it using Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3524610.3527897
these metadata.. [35] Peter Henderson, Koustuv Sinha, Nicolas Angelard-Gontier, Nan Rosemary Ke,
[17] Tom Brown, Benjamin Mann, and Nick et al. Ryder. 2020. Language Models Genevieve Fried, Ryan Lowe, and Joelle Pineau. 2018. Ethical Challenges in
are Few-Shot Learners. In Advances in Neural Information Processing Systems,
What Do Code Models Memorize?
An Empirical Study on Large Language Models of Code Conference’17, July 2017, Washington, DC, USA

Data-Driven Dialogue Systems. In Proceedings of the 2018 AAAI/ACM Conference [53] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
on AI, Ethics, and Society (New Orleans, LA, USA) (AIES ’18). Association for Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).
Computing Machinery, Association for Computing Machinery, New York, NY, [54] G. Ramakrishnan and A. Albarghouthi. 2022. Backdoors in Neural Models of
USA, 123–129. https://doi.org/10.1145/3278721.3278777 Source Code. In 2022 26th International Conference on Pattern Recognition (ICPR).
[36] Jordan Henkel, Goutham Ramakrishnan, Zi Wang, Aws Albarghouthi, Somesh Jha, IEEE Computer Society, Los Alamitos, CA, USA, 2892–2899. https://doi.org/10.
and Thomas Reps. 2022. Semantic Robustness of Models of Source Code. In 2022 1109/ICPR56361.2022.9956690
IEEE International Conference on Software Analysis, Evolution and Reengineering [55] Chanchal Kumar Roy and James R Cordy. 2007. A survey on software clone
(SANER). 526–537. https://doi.org/10.1109/SANER53432.2022.00070 detection research. Queen’s School of Computing TR 541, 115 (2007), 64–68.
[37] Sorami Hisamoto, Matt Post, and Kevin Duh. 2020. Membership Inference Attacks [56] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Mem-
on Sequence-to-Sequence Models: Is My Data In Your Machine Translation bership Inference Attacks Against Machine Learning Models. In 2017 IEEE Sym-
System? Transactions of the Association for Computational Linguistics 8 (2020), posium on Security and Privacy (SP). 3–18. https://doi.org/10.1109/SP.2017.41
49–63. https://doi.org/10.1162/tacl_a_00299 [57] Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. 2023.
[38] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc An Analysis of the Automatic Bug Fixing Performance of ChatGPT.
Brockschmidt. 2019. CodeSearchNet challenge: Evaluating the state of semantic arXiv:2301.08653 [cs.SE]
code search. arXiv preprint arXiv:1909.09436 (2019). [58] Shashank Srikant, Sijia Liu, Tamara Mitrovska, Shiyu Chang, Quanfu Fan,
[39] Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. Gaoyuan Zhang, and Una-May O’Reilly. 2021. Generating Adversarial Com-
Learning and evaluating contextual embedding of source code. In International puter Programs using Optimized Obfuscations. ICLR 16 (2021), 209–226.
Conference on Machine Learning. PMLR, 5110–5121. [59] Benjamin Steenhoek, Md Mahbubur Rahman, Richard Jiles, and Wei Le. 2023.
[40] Jia Li, Zhuo Li, Huangzhao Zhang, Ge Li, Zhi Jin, Xing Hu, and Xin Xia. 2022. An Empirical Study of Deep Learning Models for Vulnerability Detection.
Poison Attack and Defense on Deep Source Code Processing Models. https: arXiv:2212.08109 [cs.SE]
//doi.org/10.48550/ARXIV.2210.17029 [60] Zhensu Sun, Xiaoning Du, Fu Song, Mingze Ni, and Li Li. 2022. CoProtector:
[41] Raymond Li, Loubna Ben Allal, and Yangtian Zi et al. 2023. StarCoder: may the Protect Open-Source Code against Unauthorized Training Usage with Data Poi-
source be with you! arXiv:2305.06161 [cs.CL] soning. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon,
[42] Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep France) (WWW ’22). Association for Computing Machinery, New York, NY, USA,
Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundare- 652–660. https://doi.org/10.1145/3485447.3512225
san. 2022. Automating Code Review Activities by Large-Scale Pre-Training. In [61] Xinyu Tang, Saeed Mahloujifar, Liwei Song, Virat Shejwalkar, Milad Nasr, Amir
Proceedings of the 30th ACM Joint European Software Engineering Conference and Houmansadr, and Prateek Mittal. 2022. Mitigating membership inference attacks
Symposium on the Foundations of Software Engineering (Singapore, Singapore) by { Self-Distillation } through a novel ensemble architecture. In 31st USENIX
(ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, Security Symposium (USENIX Security 22). 1433–1450.
1035–1047. https://doi.org/10.1145/3540250.3549081 [62] Roberto Torres. 2022. GitHub copilot adds 400K subscribers in first month. https://
[43] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer www.ciodive.com/news/github-copilot-microsoft-software-developer/628587/
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A [63] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
arXiv:1907.11692 http://arxiv.org/abs/1907.11692 you need. Advances in neural information processing systems 30 (2017).
[44] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio [64] Yao Wan, Shijie Zhang, Hongyu Zhang, Yulei Sui, Guandong Xu, Dezhong Yao,
Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Hai Jin, and Lichao Sun. 2022. You See What I Want You to See: Poisoning
Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Vulnerabilities in Neural Code Search. In Proceedings of the 30th ACM Joint
Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. European Software Engineering Conference and Symposium on the Foundations
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for
and Generation. CoRR abs/2102.04664 (2021). arXiv:2102.04664 Computing Machinery, New York, NY, USA, 1233–1245. https://doi.org/10.1145/
[45] Fatemehsadat Mireshghallah, Kartik Goyal, Archit Uniyal, Taylor Berg- 3540250.3549153
Kirkpatrick, and Reza Shokri. 2022. Quantifying Privacy Risks of Masked Lan- [65] Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5:
guage Models Using Membership Inference Attacks. In Proceedings of the 2022 Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Under-
Conference on Empirical Methods in Natural Language Processing. Association standing and Generation. In Proceedings of the 2021 Conference on Empirical
for Computational Linguistics, Abu Dhabi, United Arab Emirates, 8332–8347. Methods in Natural Language Processing, EMNLP 2021.
https://aclanthology.org/2022.emnlp-main.570 [66] Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going:
[46] Vijayaraghavan Murali, Chandra Maddila, Imad Ahmad, Michael Bolin, Daniel Fixing 162 out of 337 bugs for 0.42 each using ChatGPT. arXiv:2304.00385 [cs.SE]
Cheng, Negar Ghorbani, Renuka Fernandez, and Nachiappan Nagappan. 2023. [67] Chengran Yang, Bowen Xu, Ferdian Thung, Yucen Shi, Ting Zhang, Zhou Yang,
CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Author- Xin Zhou, Jieke Shi, Junda He, Donggyun Han, and David Lo. 2023. Answer
ing. arXiv preprint arXiv:2305.12050 (2023). Summarization for Technical Queries: Benchmark and New Approach. Association
[47] Milad Nasr, Reza Shokri, and Amir Houmansadr. 2018. Machine Learning with for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3551349.
Membership Privacy Using Adversarial Regularization. In Proceedings of the 2018 3560421
ACM SIGSAC Conference on Computer and Communications Security (Toronto, [68] Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural Attack for Pre-
Canada) (CCS ’18). Association for Computing Machinery, New York, NY, USA, Trained Models of Code. In Proceedings of the 44th International Conference on
634–646. https://doi.org/10.1145/3243734.3243855 Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Com-
[48] Phuong T. Nguyen, Claudio Di Sipio, Juri Di Rocco, Massimiliano Di Penta, and puting Machinery, New York, NY, USA, 1482–1493. https://doi.org/10.1145/
Davide Di Ruscio. 2021. Adversarial Attacks to API Recommender Systems: 3510003.3510146
Time to Wake Up and Smell the Coffee?. In 2021 36th IEEE/ACM International [69] Zhou Yang, Bowen Xu, Jie M. Zhang, Hong Jin Kang, Jieke Shi, Junda He, and
Conference on Automated Software Engineering (ASE). 253–265. https://doi.org/ David Lo. 2023. Stealthy Backdoor Attack for Code Models. https://doi.org/10.
10.1109/ASE51524.2021.9678946 48550/ARXIV.2301.02496
[49] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, [70] Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial examples for models of
Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language code. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020).
Model for Code with Multi-Turn Program Synthesis. In The Eleventh Interna- [71] Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, and Ling-
tional Conference on Learning Representations. https://openreview.net/forum?id= ming Zhang. 2022. An Extensive Study on Pre-Trained Models for Program
iaYcJKpY2B_ Understanding and Generation. In Proceedings of the 31st ACM SIGSOFT Inter-
[50] Changan Niu, Chuanyi Li, Vincent Ng, Dongxiao Chen, Jidong Ge, and Bin national Symposium on Software Testing and Analysis (Virtual, South Korea)
Luo. 2023. An Empirical Comparison of Pre-Trained Models of Source Code. (ISSTA 2022). Association for Computing Machinery, New York, NY, USA, 39–51.
arXiv:2302.04026 [cs.SE] https://doi.org/10.1145/3533767.3534390
[51] Md Rafiqul Islam Rabin, Nghi DQ Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and [72] Xin Zhou, DongGyun Han, and David Lo. 2021. Assessing Generalizability of
Mohammad Amin Alipour. 2021. On the generalizability of Neural Program Mod- CodeBERT. In 2021 IEEE International Conference on Software Maintenance and
els with respect to semantic-preserving program transformations. Information Evolution (ICSME). 425–436. https://doi.org/10.1109/ICSME52107.2021.00044
and Software Technology 135 (2021), 106552. [73] Derui Zhu, Jinfu Chen, Weiyi Shang, Xuebing Zhou, Jens Grossklags, and
[52] Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour, and Vincent J. Ahmed E. Hassan. 2021. DeepMemory: Model-based Memorization Analysis of
Hellendoorn. 2023. Memorization and generalization in neural code intelligence Deep Neural Language Models. In 2021 36th IEEE/ACM International Conference
models. Information and Software Technology 153 (2023), 107066. https://doi. on Automated Software Engineering (ASE). 1003–1015. https://doi.org/10.1109/
org/10.1016/j.infsof.2022.107066 ASE51524.2021.9678871

You might also like