This Sounds Unclear Evaluating ChatGPT Capability in Translating End-User Prompts Into Ready-to-Deploy Python Code
This Sounds Unclear Evaluating ChatGPT Capability in Translating End-User Prompts Into Ready-to-Deploy Python Code
ABSTRACT 1 INTRODUCTION
In this paper, we present a study aimed at evaluating how ChatGPT- AI-assisted code generator capabilities by Large Language Models
4 understands end-users’ natural language instructions to express (LLMs) are paving the way for new possibilities in the future of soft-
automation rules for smart home applications and how it translates ware development. While platforms such as Stack Overflow have
them into Python code ready to be deployed. Our study used 34 previously offered entry-level dedicated support to those with at
natural language instructions written by end users who were asked least some programming knowledge, the capability of models such
to automate scenarios presented as visual animations. The results as ChatGPT-4, CoPilot, and other specialized LLMs to generate code
show that ChatGPT-4 can produce coherent and effective code even from natural language prompts is becoming increasingly pervasive
if the instructions present ambiguities or unclear elements, under- among both beginner and expert programmers [11]. However, the
standing natural language instructions and autonomously resolving existing literature still assumes that prompts are produced by ex-
94% of them. However, the generated code still contains numerous perts in software development who can clearly and unambiguously
ambiguities that could potentially affect safety and security aspects. articulate the requirements. Nevertheless, End-User Development
Nevertheless, when appropriately prompted, ChatGPT-4 can subse- (EUD), especially in the field of home automation, is one of the
quently identify those ambiguities. This prompts a discussion about domains where this advancement has the potential to establish new
prospective interaction paradigms that may significantly improve standards. While the availability of user-friendly interfaces for com-
the immediate usability of the generated code. mercial microcontrollers and sensors for the domotic Internet of
Things (IoT) increases yearly, specific programming skills are still
CCS CONCEPTS required to orchestrate and customize their operations. Research
in EUD has proposed several approaches to facilitate naive users
• Human-centered computing → Human computer interac-
in defining those operations themselves, even without the need to
tion (HCI); HCI design and evaluation methods; User studies;
acquire technical skills. Using trigger-action rules has proven an
effective approach [7]. Enabling the creation of ready-to-deploy
KEYWORDS trigger-action rules from user-generated unconstrained natural lan-
End-user development (EUD), Large language models (LLMs), ChatGPT- guage (NL) may significantly democratize the creation of home
4, Task-automation systems automation systems. However, there is still a need to explore the
capability of LLMs to interpret instructions from naive users and
ACM Reference Format: generate ready-to-use code. The ability to interpret incorrect or
Margherita Andrao, Diego Morra, Teresa Paccosi, Maristella Matera, Barbara ambiguous prompts that may not adhere to the structured format
Treccani, and Massimo Zancanaro. 2024. "This Sounds Unclear": Evaluating typical of trigger-action rules is crucial. This is especially important
ChatGPT Capability in Translating End-User Prompts into Ready-to-Deploy considering potential applications that aim to bridge non-expert
Python Code. In International Conference on Advanced Visual Interfaces 2024 user needs with rule-based systems.
(AVI 2024), June 03–07, 2024, Arenzano, Genoa, Italy. ACM, New York, NY,
This paper contributes to the ongoing discussion by presenting
USA, 4 pages. https://doi.org/10.1145/3656650.3656693
a study that explores how ChatGPT-4 understands ambiguous re-
quests provided by end users and how it identifies and corrects
ambiguities in the generated code.
platforms, such as IFTTT1 , assist users in the definition of trigger- contained unclear structure, terminology, or details that cast doubt
action rules for task-automation systems [5]. Research has explored on the outcome due to subjective interpretation and as (ii) complete
the effectiveness of composition paradigms, including innovative if the instruction included all the elements (state, event, action)
visual paradigms and conversational-based approaches, broaden- presented in the scenario, or incomplete if one or more elements
ing the scope of user interaction with smart-home technologies were left implicit.
[2, 6, 10]. Recent works have turned attention to the influence of Only the 34 instructions evaluated as ambiguous and complete
LLMs in this domain, emphasizing that interacting with ChatGPT by both researchers were considered for further analyses. For each
can be challenging for end users who are not expert programmers rule, we provided ChatGPT-4 with the instructions, the list of sen-
due to NL ambiguities that hinder code generation [10, 13]. sors and smart devices involved in the scenario (the context) and
The quality and reliability of code generated by LLMs are increas- "questioned" it asking to: (i) Generate Python code ready to deploy
ing [9, 14, 15]. Some works highlighted ChatGPT’s outperforming (the prompt was: "Given the following rule and the following context,
ability in generating and solving code problems in comparison to make a Python code ready to be deployed. Just output the code without
other models [1, 12]. Less has been done to assess how the interpre- any further comments." ); (ii) Identify errors and ambiguities in the
tation of user prompts affects the output of reliable and functional rule written by the user (the prompt was: "Given the following rule
code ready for deployment [13]. Further investigation is required and the context, identify if there are possible ambiguities or errors in
to explore the reliability, cleanliness, and security of the code gener- the way the instructions were written." ).
ated from inadequate or incomplete prompts. Especially in the EUD Results. The 34 Python code snippets generated by ChatGPT-
of IoT systems, where non-expert users are tasked with program- 4 were analyzed by two independent researchers to evaluate the
ming systems, security is crucial and makes these considerations correctness, resulting in an accuracy rate of 94%. Only two of the
particularly relevant. [4]. generated codes were deemed incorrect. The first one lacked a
condition explicitly stated in the prompt, while the second one
3 THE STUDY overlooked an adverb that establishes the need for a time interval
Our study aimed to explore ChatGPT-42 capability to accurately for the rule to be correctly executed. When prompted to identify
generate correct code from trigger-action rules for home automa- ambiguities in the 34 natural language instructions, ChatGPT-4 de-
tion described by users using NL. Additionally, we aimed to examine tected 237 ambiguities in total, from a minimum of 3 to a maximum
how ChatGPT-4 assists users in recognizing ambiguities and de- of 10 for each instruction (M = 6.97, SD = 1.45). Two researchers
tecting potential errors to create clearer, more accurate, and safer independently coded the descriptions of ambiguities and then ex-
trigger-action rules. amined if and how ChatGPT-4 had autonomously solved these
Participants: Sixteen (16) participants, eight females and eight ambiguities in the respective ready-to-deploy Python code snippet.
males, aged between 24 and 60 (M = 32.81; SD = 11.44) were in- Four main themes emerged from the ambiguities’ descriptions.
volved in the study. Two had no prior experience with programming Theme 1: Ambiguities related to the outcome of the rule
languages or home automation tools; six had minimal experience and its variables (83 instances). Eleven (11) ambiguities involved
with smart home environments but no programming experience. uncertainties regarding the end of the action ("The rule specifies
The remaining eight (M = 4; F = 4) were expert programmers. Five ’after 7 PM’ but does not indicate until when this rule applies." ). Nine
of them had experience with smart home environments. All par- (9) involved ambiguities related to temporal aspects as the rule did
ticipants were native Italian speakers, and the instructions were not clearly state the sequence/simultaneity of occurrences to trigger
produced in Italian. the action and two (2) involved suggestions for additional sensors
Methods and procedure. Participants were exposed to 12 sce- or devices that, if integrated, can ensure the needed outcome. Other
narios of smart home automation presented as silent video (to avoid 39 involved possible ambiguities in error handling ("The rule does
language bias), and they were asked to write the rules to imple- not account for what should happen if the door fails to open or close." ),
ment that automation. Each video lasted around 15 seconds and and specifically, in handling possible or imaginary conflicts, false
represented a combination of one state (e.g., "it’s daytime"), one negatives or positive triggers, or possible safety hazards. Finally, 22
event (e.g., "the temperature rises"), and one action (e.g., "open the described possible undesirable outcomes (e.g., automatically turning
windows"). The initial segment of each video portrayed the smart on a fireplace based on temperature and motion) associated with
home’s response when the state is false (e.g., it’s night, the temper- potentially harmful outcomes for safety, security risks, damages,
ature rises, and no action occurs), followed by the representation and high/unusual energy consumption. Looking at the Python code
when the state is true. Each session was individual, and each partic- snippets, 13 of these ambiguities were fully resolved by ChatGPT-4,
ipant had to write the instructions in natural language to explain four (4) were partially solved, and 66 were not addressed in the
to an "intelligent system" how to implement that automation. generated code.
Data analysis. In total, 199 instructions were collected (some Theme 2: Ambiguities related to the language (55 instances).
participants wrote multiple independent/alternative rules to define Twenty-one (21) ambiguities involved word usage in formulating
the same scenario). Two researchers independently coded the 199 instructions ("The rule mentions activating an evacuation plan but
instructions as: (i) non-ambiguous if the instruction provided clear does not provide details on what the plan entails." ), with some words
references to conditions and actions or ambiguous if the instruction being overly specific, others being too generic or unconventional.
Twenty-four (24) were related to the use of generic expressions,
1 https://ifttt.com/ employing words instead of defining specific values ("The rule does
2 the latest available version at the study time in November 2023 not specify what constitutes ’day.’ Is it based on specific hours?") In
"This Sounds Unclear": Evaluating ChatGPT Capability in Translating End-User Prompts AVI 2024, June 03–07, 2024, Arenzano, Genoa, Italy