0% found this document useful (0 votes)
4 views4 pages

This Sounds Unclear Evaluating ChatGPT Capability in Translating End-User Prompts Into Ready-to-Deploy Python Code

This study evaluates ChatGPT-4's ability to translate natural language instructions from end users into deployable Python code for smart home automation. The results indicate that ChatGPT-4 can generate coherent code and identify ambiguities in user prompts, achieving a 94% accuracy rate, although many ambiguities remain unresolved, which could impact safety and security. The findings suggest a need for improved interaction paradigms to enhance the usability of generated code for non-expert users.

Uploaded by

Zhigen Wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views4 pages

This Sounds Unclear Evaluating ChatGPT Capability in Translating End-User Prompts Into Ready-to-Deploy Python Code

This study evaluates ChatGPT-4's ability to translate natural language instructions from end users into deployable Python code for smart home automation. The results indicate that ChatGPT-4 can generate coherent code and identify ambiguities in user prompts, achieving a 94% accuracy rate, although many ambiguities remain unresolved, which could impact safety and security. The findings suggest a need for improved interaction paradigms to enhance the usability of generated code for non-expert users.

Uploaded by

Zhigen Wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

"This Sounds Unclear": Evaluating ChatGPT Capability in

Translating End-User Prompts into Ready-to-Deploy Python Code


Margherita Andrao Diego Morra Teresa Paccosi
Università di Trento Politecnico di Milano Università di Trento
Fondazione Bruno Kessler Milan, Italy Fondazione Bruno Kessler
Trento, Italy Trento, Italy

Maristella Matera Barbara Treccani Massimo Zancanaro


Politecnico di Milano Università di Trento Università di Trento
Milan, Italy Trento, Italy Fondazione Bruno Kessler
Trento, Italy

ABSTRACT 1 INTRODUCTION
In this paper, we present a study aimed at evaluating how ChatGPT- AI-assisted code generator capabilities by Large Language Models
4 understands end-users’ natural language instructions to express (LLMs) are paving the way for new possibilities in the future of soft-
automation rules for smart home applications and how it translates ware development. While platforms such as Stack Overflow have
them into Python code ready to be deployed. Our study used 34 previously offered entry-level dedicated support to those with at
natural language instructions written by end users who were asked least some programming knowledge, the capability of models such
to automate scenarios presented as visual animations. The results as ChatGPT-4, CoPilot, and other specialized LLMs to generate code
show that ChatGPT-4 can produce coherent and effective code even from natural language prompts is becoming increasingly pervasive
if the instructions present ambiguities or unclear elements, under- among both beginner and expert programmers [11]. However, the
standing natural language instructions and autonomously resolving existing literature still assumes that prompts are produced by ex-
94% of them. However, the generated code still contains numerous perts in software development who can clearly and unambiguously
ambiguities that could potentially affect safety and security aspects. articulate the requirements. Nevertheless, End-User Development
Nevertheless, when appropriately prompted, ChatGPT-4 can subse- (EUD), especially in the field of home automation, is one of the
quently identify those ambiguities. This prompts a discussion about domains where this advancement has the potential to establish new
prospective interaction paradigms that may significantly improve standards. While the availability of user-friendly interfaces for com-
the immediate usability of the generated code. mercial microcontrollers and sensors for the domotic Internet of
Things (IoT) increases yearly, specific programming skills are still
CCS CONCEPTS required to orchestrate and customize their operations. Research
in EUD has proposed several approaches to facilitate naive users
• Human-centered computing → Human computer interac-
in defining those operations themselves, even without the need to
tion (HCI); HCI design and evaluation methods; User studies;
acquire technical skills. Using trigger-action rules has proven an
effective approach [7]. Enabling the creation of ready-to-deploy
KEYWORDS trigger-action rules from user-generated unconstrained natural lan-
End-user development (EUD), Large language models (LLMs), ChatGPT- guage (NL) may significantly democratize the creation of home
4, Task-automation systems automation systems. However, there is still a need to explore the
capability of LLMs to interpret instructions from naive users and
ACM Reference Format: generate ready-to-use code. The ability to interpret incorrect or
Margherita Andrao, Diego Morra, Teresa Paccosi, Maristella Matera, Barbara ambiguous prompts that may not adhere to the structured format
Treccani, and Massimo Zancanaro. 2024. "This Sounds Unclear": Evaluating typical of trigger-action rules is crucial. This is especially important
ChatGPT Capability in Translating End-User Prompts into Ready-to-Deploy considering potential applications that aim to bridge non-expert
Python Code. In International Conference on Advanced Visual Interfaces 2024 user needs with rule-based systems.
(AVI 2024), June 03–07, 2024, Arenzano, Genoa, Italy. ACM, New York, NY,
This paper contributes to the ongoing discussion by presenting
USA, 4 pages. https://doi.org/10.1145/3656650.3656693
a study that explores how ChatGPT-4 understands ambiguous re-
quests provided by end users and how it identifies and corrects
ambiguities in the generated code.

This work is licensed under a Creative Commons Attribution-NoDerivs International


4.0 License.
2 RELATED WORKS
AVI 2024, June 03–07, 2024, Arenzano, Genoa, Italy EUD investigates how naive users and non-professional develop-
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-1764-2/24/06 ers can be enabled to create, modify, or extend software systems
https://doi.org/10.1145/3656650.3656693 using a range of methods, techniques, and tools [3, 8]. Cloud-based
AVI 2024, June 03–07, 2024, Arenzano, Genoa, Italy Andrao, et al.

platforms, such as IFTTT1 , assist users in the definition of trigger- contained unclear structure, terminology, or details that cast doubt
action rules for task-automation systems [5]. Research has explored on the outcome due to subjective interpretation and as (ii) complete
the effectiveness of composition paradigms, including innovative if the instruction included all the elements (state, event, action)
visual paradigms and conversational-based approaches, broaden- presented in the scenario, or incomplete if one or more elements
ing the scope of user interaction with smart-home technologies were left implicit.
[2, 6, 10]. Recent works have turned attention to the influence of Only the 34 instructions evaluated as ambiguous and complete
LLMs in this domain, emphasizing that interacting with ChatGPT by both researchers were considered for further analyses. For each
can be challenging for end users who are not expert programmers rule, we provided ChatGPT-4 with the instructions, the list of sen-
due to NL ambiguities that hinder code generation [10, 13]. sors and smart devices involved in the scenario (the context) and
The quality and reliability of code generated by LLMs are increas- "questioned" it asking to: (i) Generate Python code ready to deploy
ing [9, 14, 15]. Some works highlighted ChatGPT’s outperforming (the prompt was: "Given the following rule and the following context,
ability in generating and solving code problems in comparison to make a Python code ready to be deployed. Just output the code without
other models [1, 12]. Less has been done to assess how the interpre- any further comments." ); (ii) Identify errors and ambiguities in the
tation of user prompts affects the output of reliable and functional rule written by the user (the prompt was: "Given the following rule
code ready for deployment [13]. Further investigation is required and the context, identify if there are possible ambiguities or errors in
to explore the reliability, cleanliness, and security of the code gener- the way the instructions were written." ).
ated from inadequate or incomplete prompts. Especially in the EUD Results. The 34 Python code snippets generated by ChatGPT-
of IoT systems, where non-expert users are tasked with program- 4 were analyzed by two independent researchers to evaluate the
ming systems, security is crucial and makes these considerations correctness, resulting in an accuracy rate of 94%. Only two of the
particularly relevant. [4]. generated codes were deemed incorrect. The first one lacked a
condition explicitly stated in the prompt, while the second one
3 THE STUDY overlooked an adverb that establishes the need for a time interval
Our study aimed to explore ChatGPT-42 capability to accurately for the rule to be correctly executed. When prompted to identify
generate correct code from trigger-action rules for home automa- ambiguities in the 34 natural language instructions, ChatGPT-4 de-
tion described by users using NL. Additionally, we aimed to examine tected 237 ambiguities in total, from a minimum of 3 to a maximum
how ChatGPT-4 assists users in recognizing ambiguities and de- of 10 for each instruction (M = 6.97, SD = 1.45). Two researchers
tecting potential errors to create clearer, more accurate, and safer independently coded the descriptions of ambiguities and then ex-
trigger-action rules. amined if and how ChatGPT-4 had autonomously solved these
Participants: Sixteen (16) participants, eight females and eight ambiguities in the respective ready-to-deploy Python code snippet.
males, aged between 24 and 60 (M = 32.81; SD = 11.44) were in- Four main themes emerged from the ambiguities’ descriptions.
volved in the study. Two had no prior experience with programming Theme 1: Ambiguities related to the outcome of the rule
languages or home automation tools; six had minimal experience and its variables (83 instances). Eleven (11) ambiguities involved
with smart home environments but no programming experience. uncertainties regarding the end of the action ("The rule specifies
The remaining eight (M = 4; F = 4) were expert programmers. Five ’after 7 PM’ but does not indicate until when this rule applies." ). Nine
of them had experience with smart home environments. All par- (9) involved ambiguities related to temporal aspects as the rule did
ticipants were native Italian speakers, and the instructions were not clearly state the sequence/simultaneity of occurrences to trigger
produced in Italian. the action and two (2) involved suggestions for additional sensors
Methods and procedure. Participants were exposed to 12 sce- or devices that, if integrated, can ensure the needed outcome. Other
narios of smart home automation presented as silent video (to avoid 39 involved possible ambiguities in error handling ("The rule does
language bias), and they were asked to write the rules to imple- not account for what should happen if the door fails to open or close." ),
ment that automation. Each video lasted around 15 seconds and and specifically, in handling possible or imaginary conflicts, false
represented a combination of one state (e.g., "it’s daytime"), one negatives or positive triggers, or possible safety hazards. Finally, 22
event (e.g., "the temperature rises"), and one action (e.g., "open the described possible undesirable outcomes (e.g., automatically turning
windows"). The initial segment of each video portrayed the smart on a fireplace based on temperature and motion) associated with
home’s response when the state is false (e.g., it’s night, the temper- potentially harmful outcomes for safety, security risks, damages,
ature rises, and no action occurs), followed by the representation and high/unusual energy consumption. Looking at the Python code
when the state is true. Each session was individual, and each partic- snippets, 13 of these ambiguities were fully resolved by ChatGPT-4,
ipant had to write the instructions in natural language to explain four (4) were partially solved, and 66 were not addressed in the
to an "intelligent system" how to implement that automation. generated code.
Data analysis. In total, 199 instructions were collected (some Theme 2: Ambiguities related to the language (55 instances).
participants wrote multiple independent/alternative rules to define Twenty-one (21) ambiguities involved word usage in formulating
the same scenario). Two researchers independently coded the 199 instructions ("The rule mentions activating an evacuation plan but
instructions as: (i) non-ambiguous if the instruction provided clear does not provide details on what the plan entails." ), with some words
references to conditions and actions or ambiguous if the instruction being overly specific, others being too generic or unconventional.
Twenty-four (24) were related to the use of generic expressions,
1 https://ifttt.com/ employing words instead of defining specific values ("The rule does
2 the latest available version at the study time in November 2023 not specify what constitutes ’day.’ Is it based on specific hours?") In
"This Sounds Unclear": Evaluating ChatGPT Capability in Translating End-User Prompts AVI 2024, June 03–07, 2024, Arenzano, Genoa, Italy

in resolving ambiguities related to the second theme language, suc-


cessfully addressing 31 out of 55 cases. This is frequently attributed
to its skill in interpreting incorrect verbal aspects or unconventional
temporal and grammatical structures. However, ChatGPT-4 shows
less expertise in dealing with ambiguities associated with the first
theme outcome, as it replicated the same ambiguities found in the
prompt in approximately 66 out of 80 cases. Notably, ChatGPT-4
tends to overlook two specific categories: error handling and am-
biguities that might lead to unintended effects. Additionally, there
were instances in each theme where ChatGPT-4 only partially re-
solved ambiguities. For example, in a case involving feedback to
house occupants, the code included code debugging-level warn-
ing messages but not direct messages to the occupants. Another
case involved the figurative interpretation of values, like defining
"darkness" in a sentence. ChatGPT-4 identified the ambiguity but
inadequately addressed it in the generated code. The code incorpo-
rated a logic to read data from the light sensor as a trigger, but it
did not give the user a structure to set a threshold for the sensor
data, offering only boolean true/false options for action triggering.

Figure 1: Distribution of solved ambiguities by theme (above)


and identification code (below). 4 DISCUSSION AND CONCLUSION
The results provide evidence of the potential of LLMs, such as
ChatGPT-4, in understanding NL instructions and translating them
into functional code. Despite the ambiguities in the formulation,
only one (1) instance, the ambiguity was related to an ambiguous
as well as the simplicity of the prompts and the context provided,
typo in a rule. In three (3) cases, the ambiguity was related to
ChatGPT-4 was able to consistently generate syntactically accurate
the use of conjunctions, particularly the expression "and/or". In
and complete Python code. Furthermore, ChatGPT-4 could also
three (3), the ambiguity was related to a rule composed of multiple
detect many ambiguities in the users’ expressions. Nevertheless,
sentences while, in other three (3), it was related to the language
it fails, if not directly prompted, to properly recognize 67% of the
used in general ("If the smart home system’s programming interface
ambiguities. Although these fails do not directly impact the gen-
is in English or another language, the rule should be translated and
erated code’s correctness, they can bring to undesired effects or
formatted according to the system’s requirements."). In this group,
security issues. In our scenario, these aspects are even worse since
31 ambiguities were fully resolved in the code, three (3) were only
our users would not be able to control the Python code and spot the
partially resolved, and 21 had not been addressed.
issues. From our results, we can argue that an effective interaction
Theme 3: Ambiguities related to the necessity of content
for novice users should not rely on a direct code generation but
extension (55 instances). This included: 30 cases of suggestions on
it should involve ChatGPT-4 in initially identifying ambiguities,
specifying alternative states/conditions to avoid ambiguities; nine
followed by iterative resolution processes that employ negotiation
(9) cases of determining when/how often to monitor the state; six
to ensure the accurate generation of code. We acknowledge some
(6) cases for providing details of the action ("The rule does not specify
limitations in our study, starting from the limited number of partic-
how much the curtains should close." ), and 10 cases concerning the
ipants that can affect the generalizability of the emerged themes.
need of confirmation/alert feedback ("The rule does not include any
Additionally, ChatGPT-4 ability to address ambiguities could be
provisions for alerting the occupants of the home that the window
mitigated by future model releases, emphasizing the importance
will be closed. This could be important for awareness and safety.").
of continuing research in this area. Nevertheless, we believe that
Of these ambiguities, 14 were fully resolved in the generated code,
this first study on automatically generated code from end users’
four (4) partially, while 37 were not addressed.
natural language requests may help in future research such as
Theme 4: Ambiguities related to the system’s functioning
on the design of effective interaction paradigms that facilitate the
(44 instances). This related to: 27 suggestions/comments on sen-
negotiation process to mitigate the ambiguities. This will require
sor localization and integration within a broader system; 17 ac-
an extended research-through-design approach that addresses the
curacy/sensitivities of sensors that can lead to false positives or
multiple aspects emerging from the study outlined in this paper.
negatives triggers ("It mentions a movement sensor, but movement
sensors can sometimes give false positives or false negatives."). Of
these instances, 19 ambiguities were fully resolved, one (1) was
partially resolved, and 24 were not addressed. ACKNOWLEDGMENTS
Overall, ChatGPT-4 was able to fully resolve 28% of the ambigui- This research received partial support from the PNRR project FAIR-
ties above and to partially resolve 5% of them, while 67% were not Future AI Research (PE00000013), under the NRRP MUR program
addressed (see Fig 1). The analysis reveals ChatGPT-4’s proficiency funded by the NextGenerationEU.
AVI 2024, June 03–07, 2024, Arenzano, Genoa, Italy Andrao, et al.

REFERENCES Transactions on Computer-Human Interaction (TOCHI) 24, 2 (2017), 1–33.


[1] Imtiaz Ahmed, Ayon Roy, Mashrafi Kajol, Uzma Hasan, Partha Protim Datta, and [8] Henry Lieberman, Fabio Paternò, Markus Klann, and Volker Wulf. 2006. End-User
Md Rokonuzzaman Reza. 2023. ChatGPT vs. Bard: a comparative study. Authorea Development: An Emerging Paradigm. Springer Netherlands, Dordrecht, 1–8.
Preprints (2023). https://doi.org/10.1007/1-4020-5386-X_1
[2] Margherita Andrao, Fabrizio Balducci, Bernardo Breve, Federica Cena, Giuseppe [9] Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li
Desolda, Vincenzo Deufemia, Cristina Gena, Maristella Matera, Andrea Mattioli, Li, Xuan-Bach D. Le, and David Lo. 2023. Refining ChatGPT-Generated Code:
Fabio Paternò, et al. 2023. Understanding Concepts, Methods and Tools for End- Characterizing and Mitigating Code Quality Issues. arXiv:2307.12596 [cs.SE]
User Control of Automations in Ecosystems of Smart Objects and Services. In [10] Alberto Monge Roffarello and Luigi De Russis. 2023. Defining Trigger-Action
International Symposium on End User Development. Springer, 104–124. Rules via Voice: A Novel Approach for End-User Development in the IoT. In
[3] Carmelo Ardito, Maria F. Costabile, Giuseppe Desolda, Marco Manca, Maristella International Symposium on End User Development. Springer, 65–83.
Matera, Fabio Paternò, and Carmen Santoro. 2019. Improving Tools that Allow [11] Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2023. LLM is
End Users to Configure Smart Environments. In End-User Development, Alessio Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation.
Malizia, Stefano Valtolina, Anders Morch, Alan Serrano, and Andrew Stratton arXiv preprint arXiv:2308.02828 (2023).
(Eds.). Springer International Publishing, Cham, 244–248. [12] Stephen R. Piccolo, Paul Denny, Andrew Luxton-Reilly, Samuel H. Payne, and
[4] Bernardo Breve, Giuseppe Desolda, Francesco Greco, and Vincenzo Deufemia. Perry G. Ridge. 2023. Evaluating a large language model’s ability to solve program-
2023. Democratizing Cybersecurity in Smart Environments: Investigating the ming exercises from an introductory bioinformatics course. PLOS Computational
Mental Models of Novices and Experts. In International Symposium on End User Biology 19, 9 (09 2023), 1–16. https://doi.org/10.1371/journal.pcbi.1011511
Development. Springer, 145–161. [13] Gian Luca Scoccia. 2023. Exploring Early Adopters’ Perceptions of ChatGPT
[5] Miguel Coronado and Carlos A. Iglesias. 2016. Task Automation Services: as a Code Generation Tool. In 2023 38th IEEE/ACM International Conference on
Automation for the Masses. IEEE Internet Computing 20, 1 (2016), 52–58. Automated Software Engineering Workshops (ASEW). 88–93. https://doi.org/10.
https://doi.org/10.1109/MIC.2015.73 1109/ASEW60602.2023.00016
[6] Giuseppe Desolda, Carmelo Ardito, and Maristella Matera. 2017. Empowering [14] Burak Yetiştiren, Işık Özsoy, Miray Ayerdem, and Eray Tüzün. 2023. Evaluating
end users to customize their smart environments: model, composition paradigms, the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on
and domain-specific tools. ACM Transactions on Computer-Human Interaction GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv:2304.10778 [cs.SE]
(TOCHI) 24, 2 (2017), 1–52. [15] Li Zhong and Zilong Wang. 2023. Can ChatGPT replace StackOverflow? A
[7] Giuseppe Ghiani, Marco Manca, Fabio Paternò, and Carmen Santoro. 2017. Per- Study on Robustness and Reliability of Large Language Model Code Generation.
sonalization of context-dependent applications through trigger-action rules. ACM arXiv:2308.10335 [cs.CL]

You might also like