Using_an_LLM_to_Help_with_Code_Understanding (1)

The document presents a study on an IDE plugin called GILT that utilizes large language models (LLMs) to assist developers in understanding code by providing contextual information without requiring explicit prompts. A user study with 32 participants demonstrated that GILT improved task completion rates compared to traditional web searches, although time efficiency and understanding levels did not show significant gains. The findings suggest that LLM-based tools can enhance code comprehension, particularly for novice developers, while also highlighting the need for further improvements in usability and effectiveness.

Uploaded by

crce.9598.ce

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views13 pages

Using_an_LLM_to_Help_with_Code_Understanding (1)

Uploaded by

crce.9598.ce

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE)

Using an LLM to Help With Code Understanding

Daye Nam Andrew Macvean Vincent Hellendoorn
Carnegie Mellon University Google, Inc. Carnegie Mellon University
U.S.A. U.S.A. U.S.A.
[email protected] [email protected] [email protected]

Bogdan Vasilescu Brad Myers

Carnegie Mellon University Carnegie Mellon University
U.S.A. U.S.A.
[email protected] [email protected]
ABSTRACT find [30, 31, 34, 44, 48, 57]. Understanding code, however, is a chal-
Understanding code is challenging, especially when working in lenging task; developers need to assimilate a large amount of infor-
new and complex development environments. Code comments and mation about the semantics of the code, the intricacies of the APIs
documentation can help, but are typically scarce or hard to navigate. used, and the relevant domain-specific concepts. Such information
Large language models (LLMs) are revolutionizing the process of is often scattered across multiple sources, making it challenging for
writing code. Can they do the same for helping understand it? In developers, especially novices or those working with unfamiliar
this study, we provide a first investigation of an LLM-based con- APIs, to locate what they need. Furthermore, much of the relevant
versational UI built directly in the IDE that is geared towards code information is inadequately documented or spread across different
understanding. Our IDE plugin queries OpenAI’s GPT-3.5-turbo formats and mediums, where it often becomes outdated.
model with four high-level requests without the user having to With the growing popularity of large language model (LLM)
write explicit prompts: to explain a highlighted section of code, based code generation tools [26, 54, 67], the need for information
provide details of API calls used in the code, explain key domain- support for code understanding is arguably growing even higher.
specific terms, and provide usage examples for an API. The plugin These tools can generate code automatically, even for developers
also allows for open-ended prompts, which are automatically con- with limited coding skills or domain knowledge. This convenience
textualized to the LLM with the program being edited. We evaluate comes at a cost, however – developers may receive code they don’t
this system in a user study with 32 participants, which confirms understand [24, 79]. Indeed, early research on LLM code generation
that using our plugin can aid task completion more than web search. tools has found that developers have a harder time debugging code
We additionally provide a thorough analysis of the ways develop- generated by the LLM and easily get frustrated [40, 71].
ers use, and perceive the usefulness of, our system, among others Fortunately, LLMs also provide an opportunity in this space,
finding that the usage and benefits differ between students and pro- namely by offering on-demand generation-based information support
fessionals. We conclude that in-IDE prompt-less interaction with for developers faced with unfamiliar code. Compared to general web
LLMs is a promising future direction for tool builders. search queries [74], LLM prompts can allow developers to provide
more context, which can enable them to receive information that
ACM Reference Format: more precisely aligns with their specific needs, potentially reducing
Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, the time spent on sifting through the information obtained from the
and Brad Myers. 2024. Using an LLM to Help With Code Understanding. In web to suit their particular requirements. Developers have indeed
2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE taken to web-hosted conversational LLM tools, such as ChatGPT,
’24), April 14–20, 2024, Lisbon, Portugal. ACM, New York, NY, USA, 13 pages.
for programming support en masse, but this setup requires them to
https://doi.org/10.1145/3597503.3639187
both context switch and copy the relevant context from their IDEs
into the chat system for support.
1 INTRODUCTION To explore the potential for generation-based information sup-
port directly in the developer’s programming environment, we
Building and maintaining software systems requires a deep un- developed a prototype in-IDE LLM information support tool, GILT
derstanding of a codebase. Consequently, developers spend a sig- (Generation-based Information-support with LLM Technology).
nificant amount of time searching and foraging for the informa- GILT is capable of generating on-demand information while consid-
tion they need and organizing and digesting the information they ering the user’s local code context, which we incorporate into the
prompts provided to the LLM behind the scenes. This way, we also
introduce a novel interaction method with the LLM, prompt-less
interaction. This option aims to alleviate the cognitive load associ-
ated with writing prompts, particularly for developers who possess
This work licensed under Creative Commons Attribution International 4.0 License. limited domain or programming knowledge.
ICSE ’24, April 14–20, 2024, Lisbon, Portugal As there is still little knowledge about how to best use an LLM
© 2024 Copyright held by the owner/author(s). for information support (as opposed to just code generation), we
ACM ISBN 979-8-4007-0217-4/24/04.
https://doi.org/10.1145/3597503.3639187 evaluate the effectiveness of our prototype tool in an exploratory

1184
ICSE ’24, April 14–20, 2024, Lisbon, Portugal Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers

user study with 32 participants tasked with comprehending and education [1, 22, 65]. Several studies have also compared LLM-
extending unfamiliar code that involves new domain concepts and generated code and explanations with those authored by humans
Python APIs for data visualization and 3D rendering – a challenging without LLM assistance [22, 37, 56], demonstrating that LLMs can
task. Our study quantitatively compares task completion rates and offer reasonably good help for developers or students with caution.
measures of code understanding between two conditions – using Fewer studies have specifically explored the usefulness of LLM-
the LLM-backed assistant in-IDE versus directly searching the web based programming tools [2, 29, 40, 49, 63, 64, 71, 75, 79, 79] with
in a browser – and qualitatively investigates how participants used actual users or their usage data, and many of these studies have
the tools and their overall satisfaction with this new interaction focused on code generation tools like CoPilot [26]. For instance,
mode. Concretely, we answer three research questions: Ziegler et al.[79] analyzed telemetry data and survey responses
to understand developers’ perceived productivity with GitHub
• RQ1: To what extent does GILT affect developers’ under- Copilot, revealing that over one-fifth of suggestions were accepted
standing, task completion time, and task completion rates by actual developers. Several human studies were also conducted.
when faced with unfamiliar code? Vaithilingam et al. [71] compared the user experience of GitHub
• RQ2: How do developers interact with GILT, and to what Copilot to traditional autocomplete in a user study and found that
extent does that differ between the participants? participants more frequently failed to complete tasks with Copilot,
• RQ3: How do developers perceive the usefulness of GILT? although there was no significant effect on task completion time.
Our results confirm that there are statistically significant gains in Barke [2] investicated further with a grounded theory analysis
task completion rate when using GILT, compared to a web search, to understand how programmers interact with code-generating
showing the utility of generation-based information support. How- models, using Github Copilot as an example. They identified two
ever, we did not find the utility gains in terms of time and un- primary modes of interaction, acceleration or exploration, where
derstanding level, leaving room for further improvement. We also Copilot is used to speed up code authoring in small logical units or
discovered that the degree of the benefit varies between students as a planning assistant to suggest structure or API calls.
and professionals, and investigated potential reasons behind this. Although these studies have increased our understanding of the
usefulness and usability of AI programming assistants in general,
and some of the insights apply to information support, they do
2 RELATED WORK not show the opportunities and challenges of LLM-based tools as
2.1 Studies on Developers Information Seeking information support tools, with a few following exceptions [45, 63].
In every phase of modern software engineering, developers need to MacNeil et al. [45] examined the advantages of integrating code
work with unfamiliar code and domains, and how well they learn explanations generated by LLMs into an interactive e-book focused
such code influences their productivity significantly. Therefore, on web software development, with a user study with sophomores.
researchers have studied to understand how developers learn and They found students tend to find LLM-generated explanations to
comprehend unfamiliar code and domains [30, 38, 51, 62], especially be useful, which is promising, but the study was focused on pro-
how they search for and acquire information, as developers need a viding one-directional support in an introductory e-book which is
variety of kinds of knowledge [43, 66, 68, 77]. Particularly, lots of different from user-oriented need-based information support. The
research was done on the information seeking strategies of devel- Programmer’s assistant [63] is the closest to our work. The authors
opers, mostly in general software maintenance [18, 30, 35] or web integrated a conversational programming assistant into an IDE to
search settings [7, 59]. Researchers have also studied challenges explore other types of assistance beyond code completion. They
developers face [14, 23, 32, 58, 72, 74], including difficulties in ef- collected quantitative and qualitative feedback from a human study
fective search-query writing and information foraging, and built with 42 participants from diverse backgrounds and found that the
tools to support developers overcome such challenges [3, 42, 52]. In perceived utility of the conversational programming assistance was
this work, we explore a way of supporting developers’ information high. In our work, we focus on the utility of LLM-based tools to
needs with generation-based information support using LLMs. satisfy information needs for code understanding, and take a step
Other efforts were made to understand developers’ information forward to test the actual utility of an LLM-integrated program-
seeking within software documentation, which is the main source ming tool by assessing performance measures such as completion
of information for developers when they learn to use new APIs rates, time, and participants’ code understanding levels.
or libraries. Researchers cataloged problems developers face when 3 THE GILT PROTOTYPE TOOL
using documentation [9, 60, 61] and identified types of knowledge
We iteratively designed GILT to explore different modes of inter-
developers report to be critical [46, 60, 61, 70], which were taken
action with an LLM for information support. GILT is a plugin for
into account when we designed GILT for our study.
the VS Code IDE (Figure 1) that considers user context (the code
selected by the user) when querying an LLM for several information
2.2 Studies on LLM-based Developer Tools support applications.
The potential and applicability of LLM-based AI programming
tools have been actively studied by many researchers. Numerous 3.1 Interacting with GILT
empirical studies [17, 22, 37, 65] evaluated the quality of code or There are two ways to interact with the plugin. First, users can
explanations generated by LLMs, to test the feasibility of applying select parts of their code (Figure 1- 2 ) and trigger the tool by click-
LLM into development tools [41, 69, 79] and to Computer Science ing on “AI Explanation” on the bottom bar (Figure 1- 1 ), or using

1185
Using an LLM to Help With Code Understanding ICSE ’24, April 14–20, 2024, Lisbon, Portugal

Figure 1: Overview of our prototype. (1) A trigger button; (2) code used as context when prompting LLM; (3) code summary
(no-prompt trigger); (4) buttons for further details; (5) an input box for user prompts; (6) options to embed information to code
(Embed) and a hide/view button; (7) options to clear the panel (Clear all) and an abort LLM button; (8) a refresh button.

“alt/option + a” as a shortcut, to receive a summary description of In-IDE extension. Besides anticipating a better user experience,
the highlighted code (Overview). They can then explore further by we designed the prototype as an in-IDE extension to more easily
clicking on buttons (Figure 1- 4 ) for API (API), domain-specific con- provide the code context to the LLM – participants could select
cepts (Concept), and usage examples (Usage), which provide more code to use as part of the context for a query.
detailed explanations with preset prompts. The API button offers de- Pre-generated prompts. We designed buttons that query the LLM
tailed explanations about the API calls used in the code, the Concept with pre-generated prompts (prompt-less interaction) to ask about
button provides domain-specific concepts that might be needed to an API, conceptual explanations, or usage examples, as shown in
understand the highlighted code fully, and the Usage button offers Figure 1- 4 . We chose these based on API learning theory [14, 33,
a code example involving API calls used in the highlighted code. 46, 68], expecting this may particularly assist novice programmers
Users can also ask a specific question directly to the LLM via the or those unfamiliar with the APIs/domains or the LLM, as writing
input box (Figure 1- 5 ). If no code is selected, the entire source code efficient search queries or prompts can be difficult for novices [13,
is used as context (Prompt); alternatively, the relevant code high- 15, 32]. At the same time, we also expected that this could reduce
lighted by the user is used (Prompt-context). The model will then the cognitive burden of users in general in formulating prompts.
answer the question with that code as context. GILT also allows For Overview and the buttons API, Concept, Usage, we came up
users to probe the LLM by supporting conversational interaction with prompt templates after a few iterations. To more efficiently
(Prompt-followup). When previous LLM-generated responses exist, provide the context to LLM, we used the library names and the
if a user does not highlight any lines from the code, the LLM gener- list of API methods included in the selected code, such as “Please
ates a response with the previous conversion as context. Users can provide a [library name] code example, mainly showing the usage
also reset the context by triggering the tool with code highlighted, of the following API calls: [list of API methods]” for Usage.
or with the Clear all button.
Unrestricted textual queries. Users can also directly prompt the
LLM (Figure 1- 5 ), in which case GILT will automatically add any
3.2 Our Design Process and Decisions selected code as context for the query. Internally, the tool adds the
Focus on understanding. We intentionally did not integrate a selected code as part of the user prompt using pre-defined templates,
code generation feature in the prototype as we wanted to focus on and requests the LLM to respond based on the code context.
how developers understand code.

1186
ICSE ’24, April 14–20, 2024, Lisbon, Portugal Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers

Need-based explanation generation. The tool is pull-based, i.e., Current Goal

it generates an explanation only when a user requests it. Similar to

many previous developer information support tools, we wanted to
reduce information overload and distraction. We expect that if and
when enough context can be extracted from the IDE, hybrid (pull +
push) tools will be possible, but this would require more research.
Iterative design updates. We ran design pilot studies and updated
our prototype accordingly. For example, we made the code sum-
mary as the default action for the tool trigger with code selection,
after seeing pilot participants struggling to find parts of code to Figure 2: Subtask
A 3D-rendering
4 example sub-task (open3d-3). With
work on due to their unfamiliarity with libraries and domains. We Segment the objects correctly so that you can see the green and pink objects on the left.
these start and goal outputs, we asked the participants to
updated the prompt-based interaction with LLM to support a con- “Make the bunny Current
sit upright on the chair.” Goal See Figure 1 for
versational interface, based on the pilot participants’ feedback that the corresponding starter code and the tool output.
they wanted to probe the model based on their previous queries to
clarify their intent or ask for further details. Finally, we opted to
use GPT-3.5-turbo instead of GPT-4 as planned, after discovering
that the response time was too slow in the pilot studies. sub-tasks, so we advised participants to follow the order we gave,
but they were free to skip. Completing each sub-task required a
4 HUMAN STUDY DESIGN small change, ranging from a single parameter value update to
an addition of a line. The difficulty levels of the sub-tasks varied,
Participants. We advertised our IRB-approved study widely within but we intentionally designed the first sub-task to be easy so that
the university community (through Slack channels, posted flyers, participants can onboard easily. Sub-tasks also came with start and
and personal contacts) and to the public (through Twitter, email, goal outputs and descriptions (see Figure 2). We did not include
and other channels). We asked each participant about their pro- tasks that required strong programming knowledge, because our
gramming experience and screened out those who reported having goal was to assess how well participants could understand the code.
a “not at all” experience. We did not ask about their professional For the same reason, we provided participants with starter code
programming experience, as the target users of our information sup- that was runnable and bug-free.
port tools are not limited to professional developers. To minimize Our tasks cover both a common and less common domain that
the possibility of participants knowing solutions, we specifically a Python developer might encounter in the wild. We chose two
sought out participants who had not used the libraries included in domains: data visualization and 3D rendering. These two tasks
our study. We accepted participants into the study on a rolling basis also allowed participants to easily check their progress, as they
to capture a range of programming experience and existing domain produce visual outputs that are comparable with the given goal.
knowledge. We compensated each participant with a $25 Amazon For the data visualization task, we used the Bokeh [5] library and
Gift card. We recruited 33 participants and conducted 33 studies in asked participants to edit code that visualizes income and expenses
total. However, we had to exclude data from one participant from data in polar plots. Understanding this code required knowledge
the analysis because they did not follow the instructions for using of concepts related to visualizing data formats, marks, and data
the extension. In the end, we had 9 women and 23 men participants. mapping. In the 3D rendering task, we used the Open3d [53] library.
Among them, 16 participants identified themselves primarily as This task required knowledge of geometry and computer graphics.
students, 1 as software engineer, 2 as data scientists, and 13 as Participants were asked to edit code that involved point cloud
researchers. In the analysis, we divided the participants into two rendering, transformation, and plane segmentation.
groups (students vs. professionals) based on this. 24 participants When selecting libraries, we intentionally did not choose the
had experience with ChatGPT, 15 with Copilot, 5 with Bard, while most common ones in their respective domains, to reduce the risk
7 participants reported no prior use of any LLM-based developer of participants knowing them well. Choosing less common libraries
tools. In terms of familiarity with such tools, 14 participants stated also helped reduce the risk of an outsized advantage of our LLM-
that they have either used AI developer tools for their work or powered information generation tool. Responses for popular li-
always use them for work. braries can be significantly better than those for less commonly
Tasks. The tasks were designed to simulate a scenario in which used ones, as the quality of LLM-generated answers depends on
developers with specific requirements search the web or use ex- whether the LLM has seen enough relevant data during training.
isting LLMs to generate code and find similar code that does not The Bokeh starter code consisted of 101 LOC with 11 (6 unique)
precisely match their needs. For each task, we provided partici- Bokeh API calls, and the starter code for the Open3D task consisted
pants with a high-level goal of the code, start and goal outputs, of 43 LOC with 18 (18 unique) Open3D API calls. The tasks were
and a starter code file loaded into the IDE. In this way, participants designed based on tutorial examples in each API’s documentation.
had to understand the starter code we provided and make changes In the starter codes, we did not include any comments in the code
to it so that the modified code met the goal requirements. Each to isolate the effects of the information provided by our prototype
task consisted of 4 sub-tasks, to help reduce participant’s overhead or collected from search engines. All necessary API calls were in
in planning, as well as to measure the task completion ratio in the starter code so participants did not need to find or add new
the analysis. There were some subtle dependencies between the ones.

1187
Using an LLM to Help With Code Understanding ICSE ’24, April 14–20, 2024, Lisbon, Portugal

In the task descriptions, we tried to avoid using domain-specific task, we did not use the think-aloud protocol because we wanted
or library-specific keywords that could potentially provide partici- to collect timing data. Instead, we collected qualitative data in the
pants with direct answers from either search engines or GILT. For post-survey with open-ended questions. We also collected exten-
instance, we opted to use “make the bunny...”, instead of “trans- sive event and interaction logs during the task. After each task, we
form the bunny...” which may have steered participants towards asked participants to complete a post-task survey to measure their
the function transform without much thought. understanding of the provided code and the API calls therein. At the
The full task descriptions, starter code, solution for the demo, very end, we asked them to complete a post-study survey where we
and the actual tasks are available in our replication package. asked them to evaluate the perceived usefulness and perceived ease
Experimental Design. We chose a within-subjects design, with of use of each code understanding approach and each feature in
participants using both GILT (treatment) and a search engine (con- GILT. We based our questionnaire on the Technology Acceptance
trol) for code understanding, but they did so on different tasks. This Model (TAM) [36], NASA Task Load Index (TLX) [21], and pre-
allowed us to ask participants to rate both conditions and provide and post-study questionnaires that were previously used in similar
comparative feedback about both treatments. studies [63, 68]. See our replication package for the instruments.
The control-condition participants were not allowed to use our We conducted 33 studies in total, with 33 participants. The ini-
prototype, but they were free to use any search engine to find the in- tial 18 studies were conducted on a one-on-one basis, while some
formation they needed. The treatment-condition participants were studies in the latter half (involving 15 participants) were carried
encouraged to primarily use our prototype for information support. out in group sessions, with two to five participants simultaneously
However, if they could not find a specific piece of information using and one author serving as the moderator. We took great care to
our prototype, we allowed them to use search engines to find it. This ensure that participants did not interrupt each other or share their
was to prevent participants from being entirely blocked by the LLM. progress. As mentioned before, we excluded one participant’s data
We expected that this was a realistic use case of any LLM-based and used 32 participants’ for the analysis. We discovered this issue
tool, but it rarely happened during the study. Only 2 participants after the study, as this participant was part of the largest group
ended up using search engines during the treatment condition, but session (with five participants).
they could not complete the tasks even with the search engines.
We counterbalanced the tasks and the order they were presented 5 RQ1: EFFECTS OF GILT
to participants to prevent carryover effects, resulting in four groups In this section, we report on the effectiveness of using GILT in
(2 conditions x 2 orders). We used random block assignments when understanding unfamiliar code.
assigning participants to each group. Participants were assigned to
each group to balance the self-reported programming and domain
experience (data visualization and 3D rendering). For every new 5.1 Data Collection
participant, we randomly assigned them to a group that no previous Code understanding. To evaluate the effectiveness of each con-
participant with the same experience level had been assigned. If all dition, we used three measurements: (1) Task completion time: to
groups had previous participants with the same experience level, complete each sub-task; (2) Task progress: we rated the correctness
we randomly assigned the participant to any of them. of the participants’ solution to each sub-task and measured how
Study Protocol. We conducted the study via a video conferenc- many sub-tasks they correctly implemented; and (3) Understanding
ing tool and in person, with each session taking about 90 minutes; level: we cross-checked participants’ general understanding of the
in-person participants also used the video conferencing tool, for starter code by giving them sets of quiz questions about the APIs
consistency. At the beginning of the study, we asked participants to included in the starter code. Each set contained three questions,
complete a pre-study survey, collecting their demographic informa- requiring an in-depth understanding of the functionalities of each
tion, background knowledge, and experience with LLMs. We also API call and the application domains. To measure the effect of us-
estimated their general information processing and learning styles ing GILT and search engines, we excluded the sub-tasks data if
using a cognitive style survey [20] categorizing participants into participants guessed the solution without using the tool (i.e., zero
two groups per dimension: comprehensive / selective information interaction with the tool) or search engines before completing it
processing and process-oriented learning / tinkering. The partici- (i.e., no search queries).
pants were then asked to join our web-based VS Code IDE hosted Prior knowledge. To control for prior knowledge, we used self-
on GitHub CodeSpaces [25], which provided a realistic IDE inside reported measures of participants’ programming and domain expe-
a web browser with edit, run, test, and debug capabilities, without rience. We expected more programming experience, especially in
requiring participants to install any software locally [12]. We then the specific domain, to lead to a faster understanding of code.
showed them a demo task description and explained what they
Experience in AI developer tools. Crafting effective prompts
would be working on during the real tasks. Before their first task
for LLM-based tools requires trial and error, even for NLP ex-
in the treatment condition, we provided a tutorial for our plugin
perts [13, 16, 76]. Therefore, we asked participants about their
using the demo task, introducing each feature and giving three
experience with LLM-based developer tools. We expected partic-
example prompts for the LLM. For the control condition, we did not
ipants’ familiarity with other AI tools to affect their usage of the
provide any demo, as we expected every participant to be able to
LLM-based information support tool, especially the use of free-
use search engines fluently. For each task, we gave participants 20
form queries, and lead to more effective use of the extension than
minutes to complete as many sub-tasks as they could. During the
participants without such experience.

1188
ICSE ’24, April 14–20, 2024, Lisbon, Portugal Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers

Table 1: Summaries of regressions estimating the effect of between conditions. This suggests that users in the GILT condition
using the prototype. Each column summarizes the model do not complete their tasks at a sufficiently different speed or have
for a different outcome variable. We report the coefficient a sufficiently different level of understanding than those in the
estimates with the standard errors in parentheses. control group, given the statistical power of our experiment.
In summary, the results suggest that GILT may help users make
𝑃𝑟𝑜𝑔𝑟𝑒𝑠𝑠 𝑇𝑖𝑚𝑒 (𝑠) 𝑈 𝑛𝑑𝑒𝑟𝑠𝑡. Progress more progress in their tasks without changing, for better or worse,
(1) (2) (3) 𝑃𝑟𝑜𝑠 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 their speed and code understanding abilities.
Constant 0.41 312.65 −1.81∗∗ −0.38 1.82∗∗ 5.4 Additional Analysis
(0.49) (185.33) (0.89) (0.68) (0.83)
After observing the significant effect of GILT on task progress, we
Domain 0.13∗ 23.14 0.41∗∗∗ 0.16 0.04 dove deeper to examine whether all participants benefited equally
experience (0.07) (25.40) (0.12) (0.09) (0.11) from the tool. To do this, we divided the participants into two
Program. −0.10 −23.67 0.20 0.01 −0.37∗ distinct groups based on their self-reported occupations (profes-
experience (0.12) (43.53) (0.22) (0.17) (0.21) sionals and students) and estimated the effects of GILT usage in
AI tool −0.01 7.70 −0.09 0.07 −0.10 each group.1 We opted for these groups as we did not have any
familiarity (0.07) (27.04) (0.14) (0.11) (0.10) prior theoretical framework to guide our grouping choices, and it
provided a simple yet effective approach to group participants with
Uses GILT 0.47∗∗∗ −9.10 0.29 0.57∗∗ 0.29
multiple dimensions, including programming experience, skills, and
(0.16) (57.26) (0.28) (0.22) (0.25)
attitude toward programming.
𝑅2 0.173 0.022 0.202 0.341 0.137 Although both groups were more successful when using the tool,
Adj. 𝑅 2 0.117 −0.046 0.148 0.243 0.010 there were notable differences in their performance gains. To better
Note: *p <0.1; **p <0.05; ***p <0.01. understand these variations, we estimated coefficients for each
group (Table 1-Pros and -Students) and observed that the impact
of GILT was significant only in the Pros group model. Specifically,
5.2 Methodology professionals completed 0.57 more sub-tasks with GILT support
compared to when they used search engines, whereas students
To answer RQ1, we compared the effectiveness of using a GILT with
did not experience significant gains. These findings suggest that
traditional search engines for completing programming tasks by
the degree of benefit provided by GILT may vary depending on
estimating regression models for three outcome variables. For task
participants’ backgrounds or skills.
progress and code understanding, we used quasi-Poisson models be-
cause we are modeling count variables, and for the task completion Summary RQ1
time, we used a linear regression model.
To account for potential confounding factors, we included task There are statistically significant gains in task completion
experience, programming experience, and LLM knowledge as con- rate when using GILT, compared to a web search, but the
trol variables in our models. Finally, we used a dummy variable degree of the benefit varies between students and profes-
(uses_GILT) to indicate the condition (using GILT vs. using search sionals.
engines). We considered mixed-effects regression but used fixed
effects only, since each participant and task appear only once in
the two conditions (with and without GILT). For example, for the 6 RQ2: GILT USAGE
task completion time response, we estimate the model: In this section, we focus on how participants interacted with GILT,
completion_time ∼ domain_experience + programming_experience their perception of the importance of different features, and how
+ AI_tool_familiarity + uses_GILT different factors correlate with the feature usage.
The estimated coefficient for the uses_GILT variable indicates the
effect of using GILT while holding fixed the effects of programming 6.1 Usage of Features
experience, domain experience, and LLM knowledge. To analyze in more detail how participants actually used the tool, we
instrumented GILT and recorded participants’ event and interaction
5.3 Results logs. The logs allowed us to count the number of times participants
Table 1 columns (1)-(3) display the regression results for three re- triggered each feature, and in what order. To supplement the usage
sponse variables. The task progress model (Table 1-(1)) shows a data, participants were asked to rate the importance of each feature
significant difference between the two conditions, with participants in a post-task survey. We used these ratings to triangulate our
in the GILT condition completing statistically significantly more findings from the usage data.
sub-tasks (0.47 more, 𝑝 < 0.01) than those who used search engines, Figure 3 summarizes the sequences of GILT features used by par-
controlling for experience levels and AI tool familiarity. This indi- ticipants in the treatment condition. On average, to complete their
cates that GILT may assist users in making more progress in their tasks in this condition, participants interacted with the LLM via
tasks compared to search engines. GILT 15.34 times. The number of interactions per participant ranged
On the other hand, models (2) and (3) fail to show any significant 1We considered but decided against, modeling interaction effects as they would have
difference in completion time and code understanding quiz scores required more statistical power.

1189
Using an LLM to Help With Code Understanding ICSE ’24, April 14–20, 2024, Lisbon, Portugal

1
S Extremely Very Moderately Slightly Not at all
7
t
u 11
30 8 7 6
d
16
14 10
e
n 18 15
t
20 9 8 12
s
27
8
B 28
o
12 8 5 10 8
k
3
10 17
5
e
3 6 8 5 4
h 8
P 13 0
r Prompt Prompt Overview API Concept Usage
o
17
s 19
context Btn Btn Btn
20
30
Figure 4: Participants’ report on the importance of GILT
32
features.
6
9
S
t
14 To further investigate participants’ interaction with the tool,
u 22
d we created transition graphs (Figure 5) using sequences of feature
23
e
n 25 use events for each sub-task, using both the sub-tasks successfully
O t
p s
26 completed by participants and those that resulted in failure (due
29
e to incorrect answers or timeouts). Out of the potential total of 128
n 31
3 sub-tasks (32 participants × 4 sub-tasks), 98 sub-tasks were started
2
d before the time ran out. In understanding the transition graph, we
4
P 10 focused on the last feature in each participant’s sequence, with an
r
12
o assumption that when a participant completes a task, it is likely that
s 15
21 the information from the last interaction satisfied their information
24 needs. Among the sub-tasks they successfully completed, a sub-
Prompt P-Context P-Followup
stantial majority (75%) originated from prompt-based interactions.
Overview B-API B-Concept B-Usage At the same time, 83% of the failed tasks were also preceded by
prompt-based interactions, so prompt-based interactions were not
particularly likely to result in successful information seeking.
Figure 3: The sequences of feature usage in GILT. Each row
corresponds to an individual participant, and the color cells 6.2 Professionals vs. Students
are arranged chronologically, from left
1 to right. To better understand the experiences of professionals and students
from a minimum of 5 to a maximum of 23. The Overview feature was (see Section 5.4), we compared the transition graphs for both groups
the most frequently used method to interact with the LLM, with (Figure 5 (b) and (c)). Notable distinctions emerged in terms of the
an average of 4.76 activations per participant. Many participants features more likely influencing the success and failure of sub-tasks.
also used Overview as their first feature, possibly because it requires Specifically, for professionals, a majority (86%) of successful sub-
minimal effort, with just a single click, in contrast to other features tasks originated from prompt, whereas for students, this percentage
that necessitated the formulation of queries by participants, and (62%) was statistically significantly lower (𝜒 2 (1, 66), 𝑝 < .05). The
perhaps also because some of the buttons (e.g., Concept) required success rate of prompt-based interaction was also higher among
first using the Overview feature. Participants also frequently used professionals (71%: 32 out of 45) compared to students (58%: 18 out
Prompt-context (4.12 times) and Prompt-followup (2.88 times). Gen- of 31). Conversely, the success rate of the overview and buttons for
eral prompting without code context was used less frequently (1.27 professionals (56%: 5 out of 9) was lower than that of students (85%:
times). While participants generally used buttons less frequently, 11 out of 13). These results may indicate that students, possibly with
some used them more frequently than queries (e.g., P29), indicating less experience in information seeking for programming, encounter
personal preferences in prompt-based and prompt-less interactions. challenges in formulating effective prompts compared to the pro-
Specifically, the API button was used 1.24 times, the Concept button fessionals, and rely more on prompt-less interaction. However, we
0.45 times, and the Usage button 0.24 times on average. can also infer that prompt-less interaction is still not sufficient to
The reported importance of the features by participants (see compete with the benefits of prompt-based interaction with the
Figure 4) generally corresponds to the observed usage data. Most current design, as they only accounted for less than 40% of the
of the participants (97%) responded that the ability to directly ask completed tasks.
questions to the LLM was extremely/very important, whereas their To further investigate the differences in the two groups’ prompt
reported usefulness of the buttons varied. The reported impor- engineering, we analyzed the text of the prompts they wrote, by
tance of the overview feature (53% extremely/very important) was comparing the frequencies of bi-, tri-, and quadrigrams in the
relatively low compared to the actual use, suggesting that partici- prompts. Table 2 presents the list of n-grams that showed divergent
pants may not have used the summary description provided by the usage between the two groups. One notable observation is that
overview but instead used it as context for further prompting or to the n-grams used by the professional group include more effective
activate buttons. keywords, or they revise the prompts to incorporate such keywords.

1190
ICSE ’24, April 14–20, 2024, Lisbon, Portugal Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers

(a) All (b) Professionals (c) Students

Figure 5: Transition Graphs for User Interaction. Each node displays the number of times users interacted with respective
features, and each edge indicates the counted number of transitions between the connected features. For space and readability
reasons, Prompt, Prompt-Context, and Prompt-Followup are merged into prompt, and API, Concept, and Usage are merged into buttons.
Counts lower than 5 are omitted except for the edges connected to the ‘Success’ and ‘Fail’ nodes.

Table 2: Frequencies of n-grams used differently in prompts 6.3 Other Factors Associated with Feature Use
by professionals and students. For clarity, we only include
During the pilot studies, we observed that participants approached
n-grams used uniquely by one of the two groups, with a
the tasks differently depending on their familiarity with other LLM-
frequency difference of more than 2. If multiple n-grams
based tools, and styles of information processing and learning as
share the same longer n-gram, we report only the superset.
observed in many previous studies on software documentation and
debugging [4, 46, 47]. Thus, we tested whether the GILT feature
Sub-task n-gram Pro. Stu. use correlates with factors other than their experience.
bokeh-2 (‘align’, ‘text’) 3 0 Hypotheses. Out of two information processing styles [8, 11, 19],
(‘flip’, ‘label’) 0 3 people who exhibit a “selective information processor” tendency fo-
bokeh-3 (‘annular’, ‘wedge’) 6 0 cus on the first promising option, pursuing it deeply before seeking
(‘grid’, ‘annular’, ‘wedge’) 3 0 additional information. On the other hand, people who are “compre-
(‘first’, ‘pie’) 3 0 hensive information processors” tend to gather information broadly
(‘pie’, ‘chart’) 3 0 to form a complete understanding of the problem before attempting
(‘add’, ‘legend’) 0 4 to solve it. Based on these processing styles, we hypothesized that
(‘tell’, ‘line’, ‘need’, ‘change’) 0 3 selective processors would utilize GILT’s Prompt-followup, as they
o3d-3 (‘sit’, ‘upright’, ‘chair’) 4 0 would prefer to use a depth-first strategy.
(‘make’, ‘bunny’, ‘sit’, ‘upright’, ‘chair’) 3 0 In terms of learning styles [8, 55], “process-oriented learners”
prefer tutorials and how-to videos, while “tinkerers” like to ex-
periment with features to develop their own understanding of the
software’s inner workings. Consequently, we hypothesized that
tinkerers would use GILT less often, as they would prefer to tinker
For instance, in the bokeh-3 sub-task, none of the participants in with the source code rather than collect information from the tool.
the student group used the critical keyword “annular wedge,” which We also expected that participants who were already familiar
is essential for generating the information needed to solve the task, with LLM-based tools would use prompt-based interaction in gen-
although it was used multiple times in the provided starting code. eral (Prompt), especially the chat feature, more frequently, as they
Instead, students tended to use more general keywords or keywords would already be accustomed to using chat interfaces to interact
that had a different concept in the library (e.g., “legend”) and faced with LLMs. Conversely, we posited that participants with less ex-
difficulties in effectively revising the prompts. In addition, more perience with such tools would use the buttons more, as prompt
participants in the professional group demonstrated proficiency engineering might be less familiar to them and place greater cogni-
in refining their prompts by providing further specifications. For tive demands on them.
example, one participant revised the prompt from “How to change Methodology. To test for associations between GILT features used
the position of the bunny to 180 degrees” to “How to transform and the factors above we again used multiple regression analysis.
the bunny_mesh to 180 degrees.” We infer that the difference in We estimated three models, each focused on one particular feature.
the benefit received from GILT by the two groups can be at least For each model, the dependent variable was the feature usage count,
partially attributed to their proficiency in prompt engineering.

1191
Using an LLM to Help With Code Understanding ICSE ’24, April 14–20, 2024, Lisbon, Portugal

Table 3: Summaries of regressions testing for associations 7 RQ3: USER PERCEPTIONS

between the user factors and the feature usage counts. Each
In this section, we investigate how participants perceived their
column summarizes a regression modeling a different out-
experience of using GILT. Specifically, we examine their perceived
come variables. We report the coefficient estimates with their
usefulness, usability, and cognitive load in comparison to search-
standard errors in parentheses.
based information seeking. Additionally, we explore the pros and
cons participants reported, and suggestions for improving the tool.
Prompt Followup All
(1) (2) (3) 7.1 Comparison with Web Search
Constant 1.39*** −0.82 2.43*** We employed two wildly-used standard measures, TLX and TAM,
(0.31) (0.69) (0.27) in our post-task survey and compared them using two-tailed paired
AI tool 0.19** 0.38** 0.11 t-tests. TAM (Technology Acceptance Model) [36] is a widely used
familiarity (0.07) (0.15) (0.06) survey that assesses users’ acceptance and adoption of new tech-
Infomation −0.04 0.44 −0.04 nologies, and TLX (Task Load Index) [21] is a subjective measure
Comprh. (1.15) (0.30) (0.13) of mental workload that considers several dimensions, including
Learning 0.19 0.60** −0.12 mental, physical, and temporal demand, effort, frustration, and per-
Process (1.14) (0.29) (1.13) formance. The summaries of TAM and TLX comparisons can be
found in our replication package.
𝑅2 0.262 0.283 0.165 The average scores for the [perceived usefulness, perceived ease
Adj. 𝑅 2 0.184 0.206 0.075 of use] in TAM scales were [27.3, 29.75] for the control condition,
Note: *p <0.1; **p <0.05; ***p <0.01. and [33.49, 34.2] for the treatment condition. The paired t-tests on
the TAM scores indicated that there were significant differences in
perceived usefulness and perceived usability scores between the
while participants’ information processing style, learning style, and
two conditions (𝑝 < 0.001). Specifically, participants rated GILT
familiarity with AI developer tools were modeled as independent
higher on both dimensions than they did search engines.
variables to explain the variation in usage counts.
For TLX items [mental demand, physical demand, temporal de-
Results. Table 3 presents the results of the regression analysis con- mand, performance, effort, frustration], the average scores were
ducted for three response variables. The first model (Prompt (1)), [3.8, -2.1, 4.0, 1.6, 3.4, -0.1] for the control condition and [3.3, -2.5,
which uses the total count of prompt-based interactions (prompt 2.6, 3.3, 3.3, 1.0] for the treatment condition. Paired t-tests on the
+ prompt-context + prompt-followup), reveals that developers who TLX scores revealed statistically significant differences between the
are more familiar with other AI developer tools are more likely tool and search engines in temporal demand (𝑝 < 0.05) and perfor-
to prompt the LLMs using natural language queries. This result mance (𝑝 < 0.05) but not in other items. These results indicate that
confirms our hypothesis that the AI tool familiarity level influ- the participants felt less rushed when using GILT than when using
ences developers’ use of queries. The familiarity level also has a search engines, and they felt more successful in accomplishing the
statistically significant impact on prompt-followup, as shown in task with the tool than with search engines, but there were no
the Followup model (2). However, we did not find any significant significant differences in other dimensions.
impact of participants’ information processing style on their use
of GILT. This means that selective processors and comprehensive 7.2 User Feedback
processors probed the LLMs similarly, as far as we can tell. The
In the post-task survey, we asked open-ended questions regarding
model, however, shows a statistically significant correlation be-
(i) their general experience with using GILT, (ii) the tool’s fit with
tween participants’ learning styles and prompt-followup feature
the participants’ workflow, (iii) comparison with other tools, and
usage. Specifically, process-oriented learners tend to probe LLMs
(iv) opportunities for improvement. Two authors conducted a the-
more frequently than tinkerers. This result might indicate that
matic analysis [10] to analyze the answers. Initially, two authors
process-oriented learners are more likely to learn thoroughly be-
separately performed open coding on the same set of 8 responses
fore proceeding to the next step, while tinkerers tend to tinker with
(25% of the entire data), and convened to discuss and merge the
the code after getting the minimum amount of direction from GILT.
codes into a shared codebook. The first author coded the rest of
Finally, the All model (3), which uses the total count of all GILT
the responses and discussed with the rest of the authors whenever
interactions, indicates that there is no statistically significant differ-
new codes needed to be added. The codebook is available in our
ence between the information styles, learning styles, and familiarity
replication package, and we discuss some of them here.
levels in terms of overall feature usage counts.
The participants in this study reported several positive aspects
of the tool, with the most notable being context incorporation.
Summary RQ2 Participants valued the ability to prompt the LLM with their code
Overall, participants used Overview and Prompt-context as context, which allowed them to tailor the LLM’s suggestions to
most frequently. However, the way participants interact their specific programming context, e.g., “the extension generated
with GILT varied based on their learning styles and famil- code that could easily be used in the context of the task I was
iarity with other AI tools. performing, without much modification.” (P5) Participants also
found it extremely useful to prompt the LLM with just code context,

1192
ICSE ’24, April 14–20, 2024, Lisbon, Portugal Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers

as it allowed them to bypass the need to write proficient queries, a Similarly, the selection of libraries might have biased the study
well-known challenge in search-based information seeking [32, 74]. results. However, in selecting libraries for our study, we avoided
P15 mentioned “It’s nice not to need to know anything about the using popular libraries that could unintentionally give an advantage
context before being effective in your search strategy.” to LLMs. We believe that the libraries we chose are of medium size
Many participants reported that using the tool helped them speed and quality, and therefore represent a fair test of the LLM tools.
up their information seeking, by reducing the need to forage for However, it is possible that different libraries or larger codebases
information, e.g., “Stack Overflow or a Google search would require could produce different results.
more time and effort in order to find the exact issue and hence Despite our efforts to create a controlled experience, several fac-
would be time-consuming.” (P27) tors differentiate our in-IDE extension from search engines, aside
Some participants, however, reported having a hard time finding from the inclusion of LLMs. For example, although previous re-
a good prompt that could give them the desired response. Combined search investigating the incorporation of search into IDE did not
with the need for good prompts and the limitations of LLM, this find a statistically significant difference between the control and
led some participants to report that the responses provided by the treatment groups [6, 39], the in-IDE design itself may have been
tool were occasionally inaccurate, reducing their productivity. P28 more helpful than access to LLMs, as it potentially reduced context-
summarized this issue well: “[prototype] was not able to give me the switching. Thus, further studies are needed to gain a better under-
code that I was looking for, so it took up all my time (which I got very standing of the extent to which each benefit of our prototype can
annoyed about). I think I just didn’t word the question well.” be attributed to these differences.
Participants had mixed opinions on the different features of the Additionally, the laboratory setting may not fully capture the
tool, especially the buttons. Some preferred to use “different buttons complexity of real-world programming tasks, which could impact
for different types of information so I didn’t have to read a lot of text the generalizability of our findings. Also, the time pressure partici-
to find what I was looking for” (P7), while others thought that was pants could have felt, and the novelty effect in a lab setting could
overkill and mentioned “a simpler view would be nice.” (P8) have changed how users interact with LLMs. Our sample size, 32,
Compared to ChatGPT, 17 participants (out of 19 who answered) was relatively small and skewed towards those in academia. This
mentioned advantages of GILT, with the Prompt- context feature may also limit the generalizability of our findings to more profes-
being one of the main ones. Participants expressed positive feelings sional programmers. Thus, future research with larger, more diverse
about CoPilot but acknowledged that the tool had a different role samples is necessary to confirm and expand upon our results.
than CoPilot and that they would be complementary to each other, Our analysis also has the standard threats to statistical conclu-
e.g.: “Copilot is a tool that I can complete mechanical works quickly, sion validity affecting regression models. Overall, we took several
but [GILT ] offers insight into more challenging tasks.” (P29) steps to increase the robustness of our estimated regression results.
Many participants reported that the tool would be even more First, we removed outliers from the top 1% most extreme values.
useful when combined with search engines, API documentation, or Second, we checked for multicollinearity using the Variation Influ-
CoPilot,2 as they provide different types of information than the tool. ence Factor (VIF) and confirmed that all variables we used had VIF
Having the ability to choose sources based on their needs would lower than 2.5 following Johnston et al. [28].
enhance their productivity by giving them control over the trade- Another potential threat to the validity of our findings is the
offs, such as speed, correctness, and adaptability of the information. rapid pace of technological development in the field of LLM tools.
Despite our efforts to use the most up-to-date LLM available at the
Summary RQ3 time of the submission, it is possible that new breakthroughs could
Participants appreciated GILT’s ability to easily incorpo- render our findings obsolete before long.
rate their code as context, but some participants reported
that writing a good prompt is still a challenge. 9 DISCUSSION AND IMPLICATIONS
Comprehension outsourcing. Our analysis revealed an intrigu-
8 THREATS TO VALIDITY ing finding regarding participants’ behavior during the study, where
One potential concern with our study design is the task and library some of them deferred their need for code comprehension to the
selection. We only used tasks that show visible outputs, which might LLM, which was well described by one participant as comprehen-
have led participants to detect potential errors more easily, com- sion outsourcing. These participants prompted the model at a higher
pared to other tasks, such as optimization or parallel programming. level directly and did not read and fully comprehend the code before
However, we believe that the tasks we chose are representative of making changes. As one participant commented, “I was surprised
common programming errors that would need to be identified in by how little I had to know about (or even read) the starter code before
real-world programming situations. Indeed, when we asked the I can jump in and make changes.” This behavior might be attributed
participants in the post-task survey, both data visualization and 3D to developers’ inclination to focus on task completion rather than
rendering tasks were reported to very or extremely closely resemble comprehending the software, as reported in the literature [44]. Or,
real-world tasks by 82% and 73% of the participants. participants may have also weighed the costs and risks of compre-
hending code themselves, and chosen to defer their comprehension
2 Notably,
efforts to the language model. While this behavior was observed in
GitHub independently announced these enhancements to Copilot already,
after we conducted our study: https://www.theverge.com/2023/7/20/23801498/github- the controlled setting of a lab study and may not fully reflect how
copilot-x-chat-code-chatbot-public-beta developers approach code comprehension in their daily work, it

1193
Using an LLM to Help With Code Understanding ICSE ’24, April 14–20, 2024, Lisbon, Portugal

does raise concerns about the potential impact of such a trend (or that our findings are a timely contribution and a good first step for
over-reliance on LLMs [71]) on code quality. This highlights the researchers and tool builders in designing and developing developer
importance of preventing developers who tend to defer their com- assistants that effectively use LLMs.
prehension efforts to the LLM from being steered in directions that
neither they nor the LLM are adequately equipped to handle. Stud- 10 CONCLUSION
ies showing developers’ heavy reliance on Stack Overflow, despite We presented the results of a user study that aimed to investigate
its known limitations in accuracy and currency [73, 78], further the effectiveness of generation-based information support using
emphasize the need for caution before widely adopting LLM-based LLMs to aid developers in code understanding. With our in-IDE
tools in code development. Research on developers’ motivations prototype tool, GILT, we demonstrated that this approach signif-
and reasons for code comprehension when LLMs are available will icantly enhances developers’ ability to complete tasks compared
be valuable in informing future tool designs. to traditional search-based information seeking. At the same time,
Need for more research in UI. In our analysis, we observed a we also identified that the degree of benefits developers can get
notable trend where the professionals benefited more from the tool from the tool differs between students and professionals, and the
compared to students. Our examination of the prompts indicated way developers interact with the tool varies based on their learning
that this discrepancy may arise because students face challenges styles and familiarity with other AI tools.
in constructing effective queries or revising them to obtain use- Data Availability. Our supplementary material includes the repli-
ful information, aligning with findings in the literature on code cation package, including the study protocol, tasks, study data,
generation using LLMs [13]. Although we provided an option to scripts to replicate the analyses, as well as the prototype tool, GILT,
use prompt-less interaction with LLMs to reduce the difficulty in are available online at DOI https://doi.org/10.5281/zenodo.10461385 [50].
prompt engineering, a lot of participants chose to use prompt-based
interaction, possibly due to their familiarity with other AI tools,
the potentially higher quality of information this mode produces,
11 ACKNOWLEDGEMENT
or other reasons that our study did not cover. However, we find our We would like to thank our participants for their time and input,
results still promising, as we observed that students used prompt- and our reviewers for their valuable feedback.
less interaction more than the professionals and succeeded more
when using the buttons than using the prompts. We believe that REFERENCES
further research is needed, exploring various interaction options to [1] Matin Amoozadeh, David Daniels, Daye Nam, Stella Chen, Michael Hilton,
Sruti Srinivasa Ragavan, and Mohammad Amin Alipour. 2023. Trust in Genera-
support a diverse developer population. tive AI among students: An Exploratory Study. arXiv:2310.04631 Retrieved from
Utilize more context. One of the main advantages of GILT re- https://arxiv.org/abs/2310.04631.
[2] Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2023. Grounded
ported by the participants is its ability to prompt the LLM with the Copilot: How Programmers Interact with Code-Generating Models. Proc. ACM
code being edited as context. We believe that additional types of Program. Lang. 7, OOPSLA1 (2023), 85–111. https://doi.org/10.1145/3586030
context can be leveraged to improve the tool’s utility, including [3] Celeste Barnaby, Koushik Sen, Tianyi Zhang, Elena Glassman, and Satish Chandra.
2020. Exempla Gratis (E.G.): Code Examples for Free. In Proceedings of the 28th
project context (e.g., project scale and domain), system context (e.g., ACM Joint Meeting on European Software Engineering Conference and Symposium
programming languages and target deployment environments), on the Foundations of Software Engineering (ESEC/FSE 2020), Virtual Event, USA.
ACM, New York, NY, 1353–1364. https://doi.org/10.1145/3368089.3417052
and personal context (e.g., programming expertise in libraries, and [4] Laura Beckwith, Cory Kissinger, Margaret M. Burnett, Susan Wiedenbeck, Joseph
domains). By combining these contexts with proper back-end engi- Lawrance, Alan F. Blackwell, and Curtis R. Cook. 2006. Tinkering and gender
neering, we believe that GILT, or other LLM-powered developer in end-user programmers’ debugging. In Proceedings of the 2006 Conference on
Human Factors in Computing Systems, (CHI 2006), Montréal, Québec, Canada,
tools, will be able to provide relevant information to developers April 22-27, 2006. ACM, New York, NY, 231–240. https://doi.org/10.1145/1124772.
with even less prompt engineering efforts of the users. 1124808
[5] Bokeh. 2023. Bokeh. https://bokeh.org/ Retrieved: 2024-01-11.
Need further studies in real-world settings. One possible ex- [6] Joel Brandt, Mira Dontcheva, Marcos Weskamp, and Scott R. Klemmer. 2010.
planation for some of the models with null results from RQ1 and Example-centric programming: integrating web search into the development
environment. In Proceedings of the 28th International Conference on Human Factors
RQ2 is the artificial setting of the lab study, where participants in Computing Systems, (CHI 2010), Atlanta, Georgia, USA, April 10-15, 2010. ACM,
were encouraged to focus on small, specific task requirements in- New York, NY, 513–522. https://doi.org/10.1145/1753326.1753402
stead of exploring the broader information dimension. For example, [7] Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, and Scott R. Klemmer.
2009. Two studies of opportunistic programming: interleaving web foraging,
participants prioritized completing more tasks rather than fully learning, and writing code. In Proceedings of the 27th International Conference
understanding the code, as reported by participant P18 in their on Human Factors in Computing Systems, (CHI 2009), Boston, MA, USA, April 4-9,
survey response: “ [GILT ] ..., which could definitely help one to 2009. ACM, New York, NY, 1589–1598. https://doi.org/10.1145/1518701.1518944
[8] Margaret M. Burnett, Simone Stumpf, Jamie Macbeth, Stephann Makri, Laura
tackle the task better if there weren’t under the timed-settings.” Thus, Beckwith, Irwin Kwan, Anicia Peters, and Will Jernigan. 2016. GenderMag: A
although our first study shed some light on the potential challenges Method for Evaluating Software’s Gender Inclusiveness. Interact. Comput. 28, 6
(2016), 760–787. https://doi.org/10.1093/IWC/IWV046
and promises, to fully understand the implications of deploying [9] Jie-Cherng Chen and Sun-Jen Huang. 2009. An empirical analysis of the impact
this tool into general developer pipelines, it is necessary to observe of software development problem factors on software maintainability. J. Syst.
how programmers use it in real-world settings with larger-scale Softw. 82, 6 (2009), 981–992. https://doi.org/10.1016/J.JSS.2008.12.036
[10] Victoria Clarke and Virginia Braun. 2013. Teaching thematic analysis: Overcom-
software systems, less specific goals, and over a longer time frame. ing challenges and developing strategies for effective learning. The psychologist
Given that GitHub recently launched CopilotX [27], a tool that 26, 2 (2013), 120–123.
offers a comparable set of features to our prototype to enhance [11] William K Darley and Robert E Smith. 1995. Gender differences in information
processing strategies: An empirical test of the selectivity model in advertising
developer experience, such research is urgently needed. We believe response. Journal of advertising 24, 1 (1995), 41–56.

1194
ICSE ’24, April 14–20, 2024, Lisbon, Portugal Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers

[12] Matthew C. Davis, Emad Aghayi, Thomas D. LaToza, Xiaoyin Wang, Brad A. [32] Amy J. Ko and Yann Riche. 2011. The role of conceptual knowledge in API
Myers, and Joshua Sunshine. 2023. What’s (Not) Working in Programmer User usability. In 2011 IEEE Symposium on Visual Languages and Human-Centric Com-
Studies? ACM Trans. Softw. Eng. Methodol. 32, 5 (2023), 120:1–120:32. https: puting, (VL/HCC 2011), Pittsburgh, PA, USA, September 18-22, 2011. IEEE, 173–176.
//doi.org/10.1145/3587157 https://doi.org/10.1109/VLHCC.2011.6070395
[13] Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with Copilot: [33] Thomas D. LaToza, David Garlan, James D. Herbsleb, and Brad A. Myers. 2007.
Exploring Prompt Engineering for Solving CS1 Problems Using Natural Lan- Program comprehension as fact finding. In Proceedings of the 6th joint meeting of
guage. In Proceedings of the 54th ACM Technical Symposium on Computer Science the European Software Engineering Conference and the ACM SIGSOFT International
Education, Volume 1, (SIGCSE 2023), Toronto, ON, Canada, March 15-18, 2023. ACM, Symposium on Foundations of Software Engineering (ESEC/FSE 2007), Dubrovnik,
New York, NY, USA, 1136–1142. https://doi.org/10.1145/3545945.3569823 Croatia, September 3-7, 2007. ACM, 361–370. https://doi.org/10.1145/1287624.
[14] Ekwa Duala-Ekoko and Martin P Robillard. 2010. The information gathering 1287675
strategies of API learners. Technical Report. TR-2010.6, School of Computer [34] Thomas D LaToza, Gina Venolia, and Robert DeLine. 2006. Maintaining mental
Science, McGill University. models: a study of developer work habits. In Proceedings of the 28th international
[15] Ekwa Duala-Ekoko and Martin P. Robillard. 2012. Asking and answering ques- conference on Software engineering. 492–501.
tions about unfamiliar APIs: An exploratory study. In 34th International Con- [35] Joseph Lawrance, Christopher Bogart, Margaret M. Burnett, Rachel K. E. Bellamy,
ference on Software Engineering, (ICSE 2012), June 2-9, 2012, Zurich, Switzerland. Kyle Rector, and Scott D. Fleming. 2013. How Programmers Debug, Revisited:
IEEE Computer Society, 266–276. https://doi.org/10.1109/ICSE.2012.6227187 An Information Foraging Theory Perspective. IEEE Trans. Software Eng. 39, 2
[16] Jean-Baptiste Döderlein, Mathieu Acher, Djamel Eddine Khelladi, and Benoit (2013), 197–215. https://doi.org/10.1109/TSE.2010.111
Combemale. 2023. Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or [36] Younghwa Lee, Kenneth A Kozar, and Kai RT Larsen. 2003. The technology
Black Magic? arXiv:2210.14699 Retrieved from https://arxiv.org/abs/2210.14699. acceptance model: Past, present, and future. Communications of the Association
[17] James Finnie-Ansley, Paul Denny, Brett A. Becker, Andrew Luxton-Reilly, and for information systems 12, 1 (2003), 50.
James Prather. 2022. The Robots Are Coming: Exploring the Implications of [37] Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne
OpenAI Codex on Introductory Programming. In Australasian Computing Educa- Kim, Andrew Tran, and Arto Hellas. 2023. Comparing Code Explanations Created
tion Conference (ACE 2022) Virtual Event, Australia, February 14 - 18, 2022. ACM, by Students and Large Language Models. arXiv (2023). arXiv:2304.03938
10–19. https://doi.org/10.1145/3511861.3511863 [38] Hongwei Li, Zhenchang Xing, Xin Peng, and Wenyun Zhao. 2013. What help
[18] Luanne Freund. 2015. Contextualizing the information-seeking behavior of do developers seek, when and how?. In 20th Working Conference on Reverse
software engineers. J. Assoc. Inf. Sci. Technol. 66, 8 (2015), 1594–1605. https: Engineering, (WCRE 2013), Koblenz, Germany, October 14-17, 2013. IEEE Computer
//doi.org/10.1002/ASI.23278 Society, 142–151. https://doi.org/10.1109/WCRE.2013.6671289
[19] Valentina Grigoreanu, Margaret M. Burnett, and George G. Robertson. 2010. [39] Hongwei Li, Xuejiao Zhao, Zhenchang Xing, Lingfeng Bao, Xin Peng, Dongjing
A strategy-centric approach to the design of end-user debugging tools. In Pro- Gao, and Wenyun Zhao. 2015. amAssist: In-IDE ambient search of online program-
ceedings of the 28th International Conference on Human Factors in Computing ming resources. In 22nd IEEE International Conference on Software Analysis, Evo-
Systems, (CHI 2010), Atlanta, Georgia, USA, April 10-15, 2010. ACM, 713–722. lution, and Reengineering, (SANER 2015), Montreal, QC, Canada, March 2-6, 2015.
https://doi.org/10.1145/1753326.1753431 IEEE Computer Society, 390–398. https://doi.org/10.1109/SANER.2015.7081849
[20] Md Montaser Hamid, Amreeta Chatterjee, Mariam Guizani, Andrew Anderson, [40] Jenny T. Liang, Chenyang Yang, and Brad A. Myers. 2023. A Large-Scale Sur-
Fatima Moussaoui, Sarah Yang, I Escobar, Anita Sarma, and Margaret Burnett. vey on the Usability of AI Programming Assistants: Successes and Challenges.
2023. How to measure diversity actionably in technology. Equity, Diversity, and arXiv:2303.17125 Retrieved from https://arxiv.org/abs/2303.17125.
Inclusion in Software Engineering: Best Practices and Insights (2023). [41] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is
[21] Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large
Load Index): Results of empirical and theoretical research. In Advances in psy- Language Models for Code Generation. arXiv:2305.01210 Retrieved from
chology. Vol. 52. Elsevier, 139–183. https://arxiv.org/abs/2305.01210.
[22] Arto Hellas, Juho Leinonen, Sami Sarsa, Charles Koutcheme, Lilja Kujanpää, [42] Michael Xieyang Liu, Aniket Kittur, and Brad A. Myers. 2022. Crystalline:
and Juha Sorva. 2023. Exploring the Responses of Large Language Models to Lowering the Cost for Developers to Collect and Organize Information for De-
Beginner Programmers’ Help Requests. In Proceedings of the 2023 ACM Conference cision Making. In Conference on Human Factors in Computing Systems (CHI
on International Computing Education Research V.1 (ICER 2023). ACM. https: 2022), New Orleans, LA, USA, 29 April 2022 - 5 May 2022. ACM, 68:1–68:16.
//doi.org/10.1145/3568813.3600139 https://doi.org/10.1145/3491102.3501968
[23] Amber Horvath, Sachin Grover, Sihan Dong, Emily Zhou, Finn Voichick, [43] Walid Maalej and Martin P. Robillard. 2013. Patterns of Knowledge in API
Mary Beth Kery, Shwetha Shinju, Daye Nam, Mariann Nagy, and Brad A. Myers. Reference Documentation. IEEE Trans. Software Eng. 39, 9 (2013), 1264–1282.
2019. The Long Tail: Understanding the Discoverability of API Functionality. https://doi.org/10.1109/TSE.2013.12
In 2019 IEEE Symposium on Visual Languages and Human-Centric Computing, [44] Walid Maalej, Rebecca Tiarks, Tobias Roehm, and Rainer Koschke. 2014. On the
(VL/HCC 2019), Memphis, Tennessee, USA, October 14-18, 2019. IEEE Computer Comprehension of Program Comprehension. ACM Trans. Softw. Eng. Methodol.
Society, 157–161. https://doi.org/10.1109/VLHCC.2019.8818681 23, 4 (2014), 31:1–31:37. https://doi.org/10.1145/2622669
[24] Saki Imai. 2022. Is GitHub Copilot a Substitute for Human Pair-programming? [45] Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny,
An Empirical Study. In 44th IEEE/ACM International Conference on Software Seth Bernstein, and Juho Leinonen. 2022. Experiences from Using Code Expla-
Engineering: Companion Proceedings, (ICSE Companion 2022), Pittsburgh, PA, USA, nations Generated by Large Language Models in a Web Software Development
May 22-24, 2022. ACM/IEEE, 319–321. https://doi.org/10.1145/3510454.3522684 E-Book. arXiv (2022). arXiv:2211.02265
[25] GitHub Inc. 2024. GitHub Codespaces. https://github.com/features/codespaces [46] Michael Meng, Stephanie Steinhardt, and Andreas Schubert. 2017. Application
Retrieved: 2024-01-11. Programming Interface Documentation: What Do Software Developers Want?:.
[26] GitHub Inc. 2024. GitHub Copilot. https://github.com/features/copilot Retrieved: Journal of Technical Writing and Communication 48, 3 (07 2017), 295 – 330. https:
2024-01-11. //doi.org/10.1177/0047281617721853
[27] GitHub Inc. 2024. GitHub Copilot X: The AI-powered developer experience. [47] Michael Meng, Stephanie Steinhardt, and Andreas Schubert. 2019. How Develop-
https://github.com/features/preview/copilot-x Retrieved: 2024-01-11. ers Use API Documentation: An Observation Study. Commun. Des. Q. Rev 7, 2
[28] Ron Johnston, Kelvyn Jones, and David Manley. 2018. Confounding and collinear- (aug 2019), 40–49. https://doi.org/10.1145/3358931.3358937
ity in regression analysis: a cautionary tale and an alternative procedure, il- [48] Andre N. Meyer, Laura E. Barton, Gail C. Murphy, Thomas Zimmermann, and
lustrated by studies of British voting behaviour. Quality & quantity 52 (2018), Thomas Fritz. 2017. The Work Life of Developers: Activities, Switches and
1957–1976. Perceived Productivity. IEEE Transactions on Software Engineering 43, 12 (2017),
[29] Majeed Kazemitabaar, Justin Chow, Carl Ka To Ma, Barbara J. Ericson, David 1178–1193. https://doi.org/10.1109/tse.2017.2656886
Weintrop, and Tovi Grossman. 2023. Studying the effect of AI Code Generators [49] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2023. Reading
on Supporting Novice Learners in Introductory Programming. In Proceedings Between the Lines: Modeling User Behavior and Costs in AI-Assisted Program-
of the 2023 CHI Conference on Human Factors in Computing Systems, (CHI 2023), ming. arXiv:2210.14306 Retrieved from https://arxiv.org/abs/2210.14306.
Hamburg, Germany, April 23-28, 2023. ACM, 455:1–455:23. https://doi.org/10. [50] Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad
1145/3544548.3580919 Myers. 2024. Artifacts for Using an LLM to Help With Code Understanding.
[30] Amy J. Ko, Robert DeLine, and Gina Venolia. 2007. Information Needs in Collo- https://doi.org/10.5281/zenodo.10461385
cated Software Development Teams. In 29th International Conference on Software [51] Daye Nam, Andrew Macvean, Brad Myers, and Bogdan Vasilescu. 2023. Exploring
Engineering (ICSE 2007), Minneapolis, MN, USA, May 20-26, 2007. IEEE Computer Documentation Usage via Page-view Log Analysis. arXiv:2310.10817 Retrieved
Society, 344–353. https://doi.org/10.1109/ICSE.2007.45 from https://arxiv.org/abs/2310.10817.
[31] Amy J. Ko, Brad A Myers, Michael J Coblenz, and Htet Htet Aung. 2006. An [52] Daye Nam, Brad A. Myers, Bogdan Vasilescu, and Vincent J. Hellendoorn. 2023.
Exploratory Study of How Developers Seek, Relate, and Collect Relevant In- Improving API Knowledge Discovery with ML: A Case Study of Comparable
formation during Software Maintenance Tasks. IEEE Transactions on Software API Methods. In 45th IEEE/ACM International Conference on Software Engineering,
Engineering 32, 12 (11 2006), 971 – 987. https://doi.org/10.1109/tse.2006.116 (ICSE 2023), Melbourne, Australia, May 14-20, 2023. IEEE, 1890–1906. https:

1195
Using an LLM to Help With Code Understanding ICSE ’24, April 14–20, 2024, Lisbon, Portugal

//doi.org/10.1109/ICSE48619.2023.00161 (2019), 637–673. https://doi.org/10.1007/S10664-018-9634-5

[53] Open3D. 2024. Open3D – A Modern Library for 3D Data Processing. http: [74] Xin Xia, Lingfeng Bao, David Lo, Pavneet Singh Kochhar, Ahmed E. Hassan, and
//www.open3d.org/ Retrieved: 2024-01-11. Zhenchang Xing. 2017. What do developers search for on the web? Empir. Softw.
[54] OpenAI. 2024. ChatGPT|OpenAI. https://chat.openai.com/ Retrieved: 2024-01-11. Eng. 22, 6 (2017), 3149–3185. https://doi.org/10.1007/S10664-017-9514-4
[55] David N Perkins, Chris Hancock, Renee Hobbs, Fay Martin, and Rebecca Simmons. [75] Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A
1986. Conditions of learning in novice programmers. Journal of Educational systematic evaluation of large language models of code. In 6th ACM SIGPLAN
Computing Research 2, 1 (1986), 37–55. International Symposium on Machine Programming (MAPS@PLDI 2022), San Diego,
[56] Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. 2022. Do Users CA, USA, 13 June 2022. ACM, 1–10. https://doi.org/10.1145/3520312.3534862
Write More Insecure Code with AI Assistants? CoRR abs/2211.03622 (2022). [76] J. D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang.
https://doi.org/10.48550/arXiv.2211.03622 arXiv:2211.03622 2023. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design
[57] David Piorkowski, Austin Z. Henley, Tahmid Nabi, Scott D. Fleming, Christopher LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in
Scaffidi, and Margaret M. Burnett. 2016. Foraging and navigations, fundamentally: Computing Systems (CHI 2023), Hamburg, Germany, April 23-28, 2023. ACM, 437:1–
developers’ predictions of value and cost. In Proceedings of the 24th ACM SIGSOFT 437:21. https://doi.org/10.1145/3544548.3581388
International Symposium on Foundations of Software Engineering, (FSE 2016), [77] Tianyi Zhang, Björn Hartmann, Miryung Kim, and Elena L. Glassman. 2020.
Seattle, WA, USA, November 13-18, 2016. ACM, 97–108. https://doi.org/10.1145/ Enabling Data-Driven API Design with Community Usage Data: A Need-Finding
2950290.2950302 Study. In CHI Conference on Human Factors in Computing Systems (CHI 2020),
[58] Md. Masudur Rahman, Jed Barson, Sydney Paul, Joshua Kayani, Federico Andres Honolulu, HI, USA, April 25-30, 2020. ACM, 1–13. https://doi.org/10.1145/3313831.
Lois, Sebastian Fernandez Quezada, Christopher Parnin, Kathryn T. Stolee, and 3376382
Baishakhi Ray. 2018. Evaluating how developers use general-purpose web-search [78] Tianyi Zhang, Ganesha Upadhyaya, Anastasia Reinhardt, Hridesh Rajan, and
for code retrieval. In Proceedings of the 15th International Conference on Mining Miryung Kim. 2018. Are code examples on an online Q&A forum reliable?: a
Software Repositories, (MSR 2018), Gothenburg, Sweden, May 28-29, 2018. ACM, study of API misuse on stack overflow. In Proceedings of the 40th International
465–475. https://doi.org/10.1145/3196398.3196425 Conference on Software Engineering (ICSE 2018), Gothenburg, Sweden, May 27 -
[59] Nikitha Rao, Chetan Bansal, Thomas Zimmermann, Ahmed Hassan Awadal- June 03, 2018. ACM, 886–896. https://doi.org/10.1145/3180155.3180260
lah, and Nachiappan Nagappan. 2020. Analyzing Web Search Behavior for [79] Albert Ziegler, Eirini Kalliamvakou, X. Alice Li, Andrew Rice, Devon Rifkin,
Software Engineering Tasks. In 2020 IEEE International Conference on Big Data Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productiv-
(IEEE BigData 2020), Atlanta, GA, USA, December 10-13, 2020. IEEE, 768–777. ity assessment of neural code completion. In 6th ACM SIGPLAN International
https://doi.org/10.1109/BIGDATA50022.2020.9378083 Symposium on Machine Programming (MAPS@PLDI 2022), San Diego, CA, USA,
[60] Martin P. Robillard. 2009. What Makes APIs Hard to Learn? Answers from 13 June 2022. ACM, 21–29. https://doi.org/10.1145/3520312.3534864
Developers. IEEE Softw. 26, 6 (2009), 27–34. https://doi.org/10.1109/MS.2009.193
[61] Martin P. Robillard and Robert DeLine. 2011. A field study of API learning
obstacles. Empir. Softw. Eng. 16, 6 (2011), 703–732. https://doi.org/10.1007/S10664-
010-9150-8
[62] Tobias Roehm. 2015. Two user perspectives in program comprehension: end users
and developer users. In Proceedings of the 2015 IEEE 23rd International Conference
on Program Comprehension, (ICPC 2015), Florence/Firenze, Italy, May 16-24, 2015.
IEEE Computer Society, 129–139. https://doi.org/10.1109/ICPC.2015.22
[63] Steven I. Ross, Fernando Martinez, Stephanie Houde, Michael J. Muller, and
Justin D. Weisz. 2023. The Programmer’s Assistant: Conversational Interaction
with a Large Language Model for Software Development. In Proceedings of the
28th International Conference on Intelligent User Interfaces, (IUI 2023), Sydney, NSW,
Australia, March 27-31, 2023. ACM, 491–514. https://doi.org/10.1145/3581641.
3584037
[64] Advait Sarkar, Carina Negreanu, Ben Zorn, Sruti Srinivasa Ragavan, Christian
Pölitz, and Andrew D. Gordon. 2022. What is it like to program with artificial
intelligence?. In Proceedings of the 33rd Annual Workshop of the Psychology of
Programming Interest Group, (PPIG 2022), The Open University, Milton Keynes,
UK & Online, September 5-9, 2022. Psychology of Programming Interest Group,
127–153. https://ppig.org/papers/2022-ppig-33rd-sarkar/
[65] Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic Gen-
eration of Programming Exercises and Code Explanations Using Large Language
Models. In ACM Conference on International Computing Education Research (ICER
2022), Lugano and Virtual Event, Switzerland, August 7 - 11, 2022, Volume 1. ACM,
27–43. https://doi.org/10.1145/3501385.3543957
[66] Jonathan Sillito, Gail C. Murphy, and Kris De Volder. 2008. Asking and Answering
Questions during a Programming Change Task. IEEE Trans. Software Eng. 34, 4
(2008), 434–451. https://doi.org/10.1109/TSE.2008.26
[67] Tabnine. 2024. Tabnine: AI assistant for software developers. https://www.
tabnine.com/ Retrieved: 2024-01-11.
[68] Kyle Thayer, Sarah E Chasins, and Amy J Ko. 2021. A Theory of Robust API
Knowledge. ACM Transactions on Computing Education 21, 1 (2021), 1–32. https:
//doi.org/10.1145/3444945
[69] Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing-Chi Cheung, Jacques
Klein, and Tegawendé F. Bissyandé. 2023. Is ChatGPT the Ultimate Programming
Assistant – How far is it? arXiv:2304.11938 Retrieved from https://arxiv.org/abs/
2304.11938.
[70] Gias Uddin and Martin P. Robillard. 2015. How API Documentation Fails. IEEE
Softw. 32, 4 (2015), 68–75. https://doi.org/10.1109/MS.2014.80
[71] Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs.
Experience: Evaluating the Usability of Code Generation Tools Powered by Large
Language Models. In CHI Conference on Human Factors in Computing Systems
(CHI 2022), New Orleans, LA, USA, 29 April 2022 - 5 May 2022, Extended Abstracts.
ACM, 332:1–332:7. https://doi.org/10.1145/3491101.3519665
[72] Shaowei Wang, Tse-Hsun Chen, and Ahmed E. Hassan. 2020. How Do Users
Revise Answers on Technical Q&A Websites? A Case Study on Stack Overflow.
IEEE Trans. Software Eng. 46, 9 (2020), 1024–1038. https://doi.org/10.1109/TSE.
2018.2874470
[73] Yuhao Wu, Shaowei Wang, Cor-Paul Bezemer, and Katsuro Inoue. 2019. How
do developers utilize source code from stack overflow? Empir. Softw. Eng. 24, 2

1196