An experimental study of integrating fine-tuned large language models and prompts for enhancing mental health support chatbot system

Hong Qing Yu; Stephen McGuinness

doi:10.21037/jmai-23-136

Original Article

An experimental study of integrating fine-tuned large language models and prompts for enhancing mental health support chatbot system

Hong Qing Yu , Stephen McGuinness

School of Computing, University of Derby, Derby, UK

Contributions: (I) Conception and design: Both authors; (II) Administrative support: S McGuinness; (III) Provision of study materials or patients: HQ Yu; (IV) Collection and assembly of data: Both authors; (V) Data analysis and interpretation: Both authors; (VI) Manuscript writing: Both authors; (VII) Final approval of manuscript: Both authors.

Correspondence to: Hong Qing Yu, PhD. School of Computing, University of Derby, Markeaton St, Derby DE223AW, UK. Email: [email protected].

Background: Conversational mental healthcare support plays a crucial role in aiding individuals with mental health concerns. Large language models (LLMs) like GPT and BERT show potential in enhancing chatbot-based therapy responses. Despite their potential, there are recognised limitations in directly deploying these LLMs for therapeutic interactions as they are trained in general context and knowledge data. The overarching aim of this study is to integrate the capabilities of both GPT and BERT with the use of specialised mental health dataset methodologies. Its goal is to enhance mental health conversations, limiting the risk and increasing quality.

Methods: To achieve these aims, we will review existing chatbot methodologies from rule-based systems to advanced approaches based on cognitive behavioural therapy (CBT) principles. The study introduces a unique method which integrates a fine-tuned DialoGPT model along with the real-time capabilities of the ChatGPT 3.5 application programming interface (API). This blended combination aims to leverage the contextual awareness of LLMs and the precision of mental health-focused training. The evaluation involves a case study whereby our hybrid model is compared to traditional and standalone LLM-based chatbots. The performance is assessed using metrics such as perplexity and BLEU (Bilingual Evaluation Understudy) scores, along with subjective evaluations from end-users and mental health carers.

Results: Our combined model outperforms others in conversational quality and relevance in mental healthcare. The positive feedback from patients and mental healthcare professionals is evidence of this. However, vital limitations highlight the need for further development in next-generation mental health support systems. Addressing these challenges is crucial for such technologies’ practical application and effectiveness.

Conclusions: With the rise of digital mental health tools, integrating models such as LLMs transforms conversational support. The study presents a promising approach combining state-of-the-art LLMs with domain-specific fine-tuned model principles. Results suggest our combined model offers affordable and better everyday support, validated by positive feedback from patients and professionals. Our research emphasises the potential of LLMs and points towards shaping responsible and effective policies for chatbot deployment in mental healthcare. These findings will contribute to future mental healthcare chatbot development and policy guidelines, emphasising the need for balanced and effective integration of advanced models and traditional therapeutic principles.

Keywords: Mental health chatbot; large language model (LLM); ChatGPT prompt engineering; artificial intelligence (AI)

Received: 17 October 2023; Accepted: 07 March 2024; Published online: 04 June 2024.

doi: 10.21037/jmai-23-136

Highlight box

Key findings

• Our study presents a pioneering approach in mental health support by integrating outputs from fine-tuned large language models (LLMs) and specialized prompts on ChatGPT 3.5, significantly enhancing chatbot conversation quality in the experimental study of supporting mental healthcare services. The combined model outperforms traditional and standalone LLM-based systems in conversational quality and relevance. The positive feedback from both patients and mental health professionals’ surveys. However, there are also many unsolved but important open issues regarding to the ethic, risk management and harm prevention.

What is known and what is new?

• LLMs have great potential to support mental healthcare because they can understand complex human language, including emotional nuances. Currently, many research projects have applied LLMs as backend systems for mental health support. However, there are significant limitations, as LLMs are not specifically trained to handle domain-specific conversations, particularly in healthcare settings.

• Our research innovates by using fine-tuning LLM’s output as part of the prompts and quality guidelines to feed into ChatGPT 3.5 model to produce better quality and relevance conversations. This providing a novel methodology that improves the chatbot’s understanding and response quality in mental health conversations.

What is the implication, and what should change now?

• The implications of our findings are twofold: they not only highlight the potential of LLMs in mental healthcare but also underscore the necessity for further development in chatbot technologies for next-generation mental health support systems. This research advocates for a balanced integration of advanced models with traditional therapeutic principles, urging the adoption of responsible and effective policies for chatbot deployment in mental healthcare.

Introduction

In today’s world, where digital technology has become a pervasive aspect of daily life, mental health has become a concern. Examples of this include depression, anxiety, addictions and nutrition-related disorders (1). Throughout 2019, it was estimated that one in eight individuals globally, approximating 970 million, were grappling with a form of mental health disorder. With the onset of the COVID-19 pandemic in 2020 boosted these figures by 26%, highlighting the need for effective mental health care interventions (2). Concurrently, there has been a rise in artificial intelligence (AI)-based mental healthcare solutions, such as chatbots (3).

Advancements in AI technology, specifically developing large language models (LLMs) like the ChatGPT framework, present the potential to revolutionise mental health support services. However, a challenge arises as these models are trained on general-purpose knowledge and need domain-specific expertise.

This paper explores developing and evaluating chatbots for mental health support harnessing LLM techniques. The development of the chatbot aims to provide accessible, empathetic support and offer advice on coping strategies along with the resources to alleviate feelings of loneliness and anxiety. The overarching challenge is to ensure that the chatbots contents adhere to clinical standards and are free from harmful data.

The research proposes the hypothesis: By enhancing an LLM with data from real therapeutic conversations, we aim to transform it into a more effective and reliable mental health support chatbot. The enhancement involves equipping the chatbot with a ‘contextual response filter’. This feature aids it in understanding and responding to its users’ emotional needs in a way informed by real-world therapy interactions. This integration of expert knowledge into the general-purpose LLM is the critical component of the research.

The research methodology includes refining the LLM using a 5,000 unstructured text conversations data set from accessible sources. The process involves applying natural language processing (NLP) techniques and optimising the model’s size and parameters, culminating in creating a specialised dialogue knowledge model that seamlessly integrates with the ChatGPT 3.5 application programming interface (API).

The paper is structured as follows: section “Conversational-based therapy and related work” delves into the contemporary—advancements in chatbot development techniques and the landscape of various LLMs. Section “Research methodology” details the research methodology, detailing the steps taken in developing and evaluating the mental health support chatbot. Section “Chatbot systematic and human evaluations” presents the findings from the evaluation, while section “Discussions for future health policy of using LLM-based chatbot systems” explores the potential application of this research. Lastly, section “Conclusions and future work” concludes with our reflections and perspectives on the study.

Conversational-based therapy and related work

Conversational therapy

Cognitive behavioural therapy (CBT) (4) is a form of psychotherapy that focuses on the relationship between thoughts, feelings, and behaviours. It is a goal-oriented and practical approach that aims to help individuals understand how their thoughts and beliefs influence their emotions and actions and how to develop more adaptive and healthier ways of thinking and behaving.

CBT is based on the premise that our thoughts, emotions, and behaviours are interconnected, and by changing our thoughts, we can bring about positive changes in our feelings and behaviours. The therapy typically involves identifying and challenging negative or unhelpful thoughts and beliefs and replacing them with more realistic and positive ones. It also emphasises the importance of taking action and engaging in behaviours that promote well-being and improve one’s quality of life.

In addition to CBT, several other therapeutic approaches share similarities or have been influenced by CBT principles. Here are a few examples:

Rational emotive behaviour therapy (REBT) (5): developed by Albert Ellis, REBT is similar to CBT and focuses on identifying and challenging irrational beliefs and replacing them with rational and constructive ones.
Dialectical behaviour therapy (DBT) (5): originally developed to treat borderline personality disorder, DBT combines elements of CBT with mindfulness techniques. It emphasises acceptance and validation while also encouraging change and skills development.
Acceptance and commitment therapy (ACT) (6): ACT focuses on helping individuals accept their thoughts and emotions while committing to actions that align with their values. It utilises mindfulness and acceptance techniques to promote psychological flexibility.
Mindfulness-based cognitive therapy (MBCT) (7): integrating CBT principles with mindfulness practices, MBCT helps individuals become more aware of their thoughts and emotions and develop a non-judgmental and accepting attitude towards them.
Schema therapy (8): this approach extends beyond the scope of traditional CBT by addressing long-standing patterns or “schema” that develop early in life. It helps individuals identify and modify deeply ingrained negative beliefs and behavioural patterns.

These are just a few examples, and many other therapeutic approaches draw on CBT principles or share similar goals. The choice of therapy depends on the individual’s specific needs and preferences and the therapist’s expertise.

Traditional chatbot approaches

The traditional chatbot approaches are rule- and CBT-based. Rule-based chatbots are conversational agents that follow predefined criteria for interacting with user queries. These chatbots understand and respond to a limited and particular range of user commands. Rule-based chatbots are often employed in situations requiring basic, repetitive tasks and provide prompt and accurate responses to straightforward queries. Eliza is the first published rule-based chatbot in the 1960s by Joseph Weizenbaum. Eliza is one of the earliest examples of a rule-based chatbot (9). It used simple pattern-matching techniques to simulate a conversation with a psychotherapist, specifically in the style of Rogerian psychotherapy. Although rudimentary by today’s standards, Eliza demonstrated the potential of chatbots in mental health support. However, these chatbots may deliver unexpected or incorrect answers when confronted with more complex questions. Despite this limitation, rule-based chatbots can be advantageous in scenarios where rapid and precise responses are necessary, such as booking flights, reserving movie tickets, or modifying appointment dates (10).

While rule-based chatbots are less common for mental health support due to their inherent limitations in handling complex emotions and conversations, there are still many examples combined with CBT which provide a certain level of scale and sufficient support which also need to be mentioned:

MoodGym: MoodGym is a web-based program that uses a rule-based approach to deliver CBT for users experiencing depression and anxiety (11). It offers interactive modules and exercises based on predefined rules to help users learn about their thoughts, emotions, and behaviours. MoodGym is a well-known web-based intervention developed by researchers at the Australian National University.
Woebot: Woebot uses a combination of NLP, decision trees and machine learning algorithms to generate responses. Functionality: Woebot provides CBT techniques, mood tracking, and psycho-educational content through daily check-ins and interactive conversations. Studies have shown that Woebot can help reduce symptoms of depression and anxiety (12).
Wysa: Wysa combines NLP techniques with AI algorithms to create an empathetic and responsive chatbot. Wysa offers support for anxiety, stress and sleep problems through interactive conversations, CBT-based techniques, mindfulness exercises and mental health resources. A study found that Wysa significantly improved well-being and reduced anxiety (13).
Tess: Tess uses AI algorithms and NLP techniques to simulate empathetic conversations with users. Tess offers emotional support, coping strategies, and psychological interventions through conversational interactions. Research indicates that Tess can help reduce symptoms of depression and anxiety (14).

While these examples demonstrate the use of rule-based chatbots and programs for mental health support, there are limitations to these traditional approaches, such as the struggle to understand complex emotions within a context-understandable scenario. It is important to note that more advanced AI-driven approaches are becoming increasingly popular. This is due to their ability to respond to an individual’s emotional needs coherently and handle complex and unstructured conversations.

LLMs in chatbots

LLMs have emerged as a transformative technology in NLP. They utilise deep learning techniques to process and generate humanlike language on a large scale, leading to unprecedented advances in various NLP tasks. LLMs have played a vital role in developing chatbots, virtual assistants, machine translation and other NLP applications.

Two of the most prominent LLMs are GPT-3 and BERT, which have demonstrated remarkable performance in various NLP tasks. GPT-3 is the largest and most powerful LLM to date, containing over 175 billion parameters (15). It has shown impressive coherence and fluency in generating humanlike text. On the other hand, BERT is a transformer-based LLM that has succeeded in tasks such as question-answering and text classification (16). It is trained using a masked language modelling task, which allows it to contextualise surrounding words and generate more accurate responses. LLMs have enabled significant advances in NLP, and their development continues to open up new avenues for research and innovation.

LLMs have significantly impacted the development of chatbots and conversational agents, improving performance in NLP tasks. These models enable chatbots to understand the context of a conversation better and generate more accurate and humanlike responses, making them an attractive choice for chatbot development.

The latest LLM-based chatbot, ChatGPT 3.5, is a state of the art open access chatbot that can communicate with humans and provide general information in different domains. While ChatGPT-3 can provide some level of information to assist mental health patients, it is not explicitly designed to provide support for mental healthcare. GPT’s primary function is to generate human-like text based on user prompts and questions and is not intended for any specific purpose, including mental healthcare.

However, there are some significant gaps in applying LLM-based technology directly to serve mental healthcare purposes:

The generated conversational responses are unpredictable as there is a black hole in knowing what the LLM has previously learned in this domain.
The style of the representation of the conversations may not follow healthcare practice standards.
Providing the most proper conversational therapy to different levels of patients is minimal.

Therefore, to create domain-specific applications and enhance LLMs, research was conducted on existing generative models, specifically GPT-3 and DialoGPT, focusing on automated dialogue generation. The process involved applying transfer learning methods to train the models on therapy and counselling data from sources like Reddit and AlexanderStreet (17). The study then assessed the linguistic quality of these models, discovering that the dialogue generated by DialoGPT, enhanced with transfer learning on video data, achieved scores on par with a human response baseline.

Figure 1 shows an example of conversations directly applying ChatGPT 3.5 without prompting with generated context. The results are very similar to CBT-based traditional chatbot responses. We will compare our final outcomes in the evaluation section later.

Figure 1 Examples of ChatGPT conversations.

Based on the research, the novelty is to use a relatively more minor data set (5,000 conversations) containing conversational therapeutic practice data to train a smaller DialoGPT model as a knowledge-base model (18). The knowledge-based model will then create context knowledge injections to the run-time invocation of ChatGPT API for tuning the text prediction behaviour to follow the domain-specific knowledge. The significant advantage of this approach is that it is easier and cheaper to implement than fine-tuning the whole LLM following this transitional transform learning process.

Methods

Research methodology

Our research approach includes five major stages for development: trainable therapy transcription data, data processing, model fine-tuning, the optimisation of the processed data, integrated with ChatGPT 3.5 API and prototype evaluation.

Creating trainable therapy transcription data

Various resources and websites were evaluated to find suitable data sets for model training. Firstly, we assessed the Mental Health frequently asked questions (FAQ) for Chatbot, a publicly available data set on the Kaggle platform (19). However, the data set only allowed for responses to a set number of general questions regarding mental health. It did not provide real support or guidance for personal treatment, making the solution irrelevant.

The second option was a data set used in an existing project that could detect positive cases of depression based on the user’s words. However, the training data was structured and labelled for classification, not for regression or conversation data. Therefore, we did not consider this option.

As we could not find a suitable data set, we created one tailored to our research context. We used real-world therapy transcript documents from websites and converted the HTML conversation texts between patients and therapists into feature format for processing (see data example in Figure 2). We generated around 7,000 lines of therapy conversations, resulting in a final document size of approximately 350 kB (18).

Figure 2 An example of the transcripts between chatbot and patient.

Data processing

The data set from the previous steps contains communication data typical of spoken language and includes numerous name references. During the cleaning process, all name references to individuals were replaced with general pronouns such as ‘him/her’ or ‘you’. This step allowed for better generalisation of the data and increased data uniformity. Additionally, we removed spoken language idioms where possible. We also eliminated long replies that contained irrelevant information, such as family stories or personal details, to protect privacy and reduce computation complexities in subsequent analysis. These steps were taken to make it possible for the model to find patterns and relationships within the data.

Model fine-tuning and optimisation

To begin with, we used the pre-trained model called DialoGPT, developed by Microsoft, with over 100 GB of colloquial data from various sources. This model is known for its human-like engagement with users, unlike the formal or machine-like tone of standard GPT models (Microsoft Research, 2019). In previous research, DialoGPT has been shown to produce better dialogue models than traditional GPT models, as we discussed earlier.

To personalise this technology, we used a process called finetuning, where we added personalised psychological data to the model. The logical tree and resolution of the model were already in place, but we extracted the communication lexicon from our data set to provide the last layer of word creation. In other words, the model’s logic was established, and we used our data set to determine the vocabulary used in communication.

To find the most optimised model, we used a perplexity matrix to measure varying hyperparameters (see Table 1) during the fine-tuning process. The table shows the different versions and hyperparameters used for fine-tuning DialoGPT.

Table 1

Fine-tuning DialoGPT with varying hyperparameters

Version	Rows	Epochs	Batch size
1	1,600	12	12
2	5,000	8	8
3	5,000	8	10
4	5,000	10	8
5	6,000	8	8
6	6,000	8	10
7	6,000	10	8
8	7,000	8	8
9	7,000	8	10
10	7,000	10	8
11	7,000	9	8
12	7,000	9	9
13	7,000	9	10
14	7,000	10	10
15	7,000	10	11
16	7,000	11	11

Batch size is the number of samples processed per model update. Epoch is a full dataset pass.

We measured the perplexity score for each model, with scores ranging from as high as 1.2338 to as low as 1.1794 (see Figure 3). The uncertainty score for all models was less than 1.3, indicating a high level of accuracy in conversation with humans. Four models (versions 4, 7, 10, and 16) performed significantly better than the others, with scores of 1.1808, 1.807, 1.1796, and 1.1794 (the best). The bottom figure of Figure 3 (values scaled for better visibility) shows that these four models have many training epochs and demonstrate minimum complexity.

Figure 3 Summarised results of hyperparameters: including perplexity and epoch size.

Therefore, version 16 was the best-performing model in the test.

Perplexity is a measurement of the effectiveness of a probability model to predict a sample. In NLP, perplexity measures how well a language model predicts a text sequence. A lower perplexity score indicates that the model can better predict the next word in a sentence and, therefore, has a higher accuracy. In this case, the perplexity matrix was used to measure the performance of different hyperparameters during the fine-tuning process.

Knowledge injection with ChatGPT 3.5 prompting engineering

The fine-tuned DialoGPT transformer has demonstrated an ability to generate primary responses during conversations. However, its capability in context comprehension, dynamic content generation, and advanced treatment support remains inferior to the more advanced ChatGPT 3.5 model. To address these limitations, we devised a novel approach while maintaining robust control of content and ensuring conversation integrity. We integrate the output from the fine-tuned DialoGPT transformer (output1) as a controlled context injection alongside the user’s input, creating a composite prompt for ChatGPT 3.5. Consequently, every prompt processed at run time amalgamates the conversation history, the user’s direct input, and the output1, allowing the ChatGPT 3.5 API to generate the final output (designated as output2 in Figure 4). This methodology seeks to surpass the individual performance metrics of both the ChatGPT 3.5 (as shown in Figure 1) and the standalone finetuned DialoGPT (detailed in Figure 5).

Figure 4 The fine-tuning knowledge injection with ChatGPT prompting engineering process.

Figure 5 Examples without (A) and with (B) GPT 3.5 prompts combinations.

Results

Chatbot systematic and human evaluations

We simulated three scenarios based on the data set using three distinct approaches evaluated through the perplexity and BLEU (Bilingual Evaluation Understudy) scores. These approaches included solely ChatGPTbased conversations (Approach 1), fine-tuned DialoGPT transformer conversations (Approach 2), and fine-tuned DialoGPT transformer conversations combined with the GPT3 prompts API (Approach 3). Human evaluations relied on two groups: mental healthcare professionals and researchers.

Perplexity evaluations

Based on the Table 2 comparing the perplexity scores for the three approaches, it appears that the fine-tuned DialoGPT transformer + GPT3 prompts API conversations approach has the lowest perplexity score, followed by the fine-tuned DialoGPT transformer conversations approach and the ChatGPT-based conversations approach.

Table 2

Comparison of the perplexity scores on the three approaches

Approach	Average	Highest	Lowest
1	1.48	1.56	1.47
2	1.21	1.24	1.16
3	0.37	0.96	0.33

Approach 1: solely ChatGPT-based conversations. Approach 2: fine-tuned DialoGPT transformer conversations. Approach 3: fine-tuned DialoGPT transformer conversations combined with the GPT3 prompts API. API, application programming interface.

The maximum, minimum, and average perplexity scores for each approach also suggest that the fine-tuned DialoGPT transformer + GPT3 prompts API conversations approach consistently outperforms the other two approaches in terms of perplexity scores.

However, it is important to note that perplexity scores alone do not necessarily indicate the overall quality of a language model. Other metrics such as BLEU, human evaluations, and task-specific evaluations should also be considered when evaluating the performance of a language model.

Introduction to BLEU scores

Before we delve into the BLEU scores of our chatbot model. We need to provide a brief overview on the topic. BLEU scores, in essence, are a method of analysis involving the evaluation of machine-generated text to that of referenced human text. The closer the machine-generated text is to that of the human-referenced text—the better it is. This is the core theme of BLEU—these scores are calculated based on matching n-grams—a sequence of ‘n’ words which match a pre-existing referenced text (20). With this understanding, let us now examine the BLEU scores of our model.

BLEU score evaluation

The Table 3 compares the BLEU scores on the three approaches, it appears that the fine-tuned DialoGPT transformer + GPT3 prompts API conversations approach has the highest BLEU score, followed by the fine-tuned DialoGPT transformer conversations approach and the ChatGPT-based conversations approach.

Table 3

Comparison of BLEU scores on three approaches

Approach	Average	Highest	Lowest
1	0.13	0.23	0.07
2	0.32	0.38	0.12
3	0.65	0.83	0.55

Approach 1: solely ChatGPT-based conversations. Approach 2: fine-tuned DialoGPT transformer conversations. Approach 3: fine-tuned DialoGPT transformer conversations combined with the GPT3 prompts API. BLEU, Bilingual Evaluation Understudy; API, application programming interface.

We can conclude that the fine-tuned DialoGPT transformer + GPT3 prompts API conversations approach appears to be the most effective approach based on both perplexity and BLEU scores.

Human evaluations

Standalone application for human evaluation

To support efficient human evaluation, we developed an animation-based standalone application by embedding the chatbot function into a unity game application. The game simulate a relaxed conversation environment with a patient (you can select different types of figures) and a psychologist. The evaluators can start the game to have chats with the psychologist (see Figure 6). The application can be downloaded in the GitHub reposition specified in the appendix section. We conducted two types of human evaluation survey based on using the application. The first one is evaluated by the people who need mental health support and the second one is evaluated by the professional people who works on the mental healthcare domain.

Figure 6 Standalone application interface.

User evaluation

We asked volunteers who believe they are suffering mental health issue in our institution who work different roles such as students, lecturers, support team and administrations. We collected 10 responses to the questionnaires while tested the application.

To gather crucial user insights and evaluate the chatbots performance from a user’s perspective, we conducted surveys among a select group of mental health users and carers. The survey comprised ten questions, focusing on users’ mental health needs, the perceived usefulness of the chatbot, its conversation quality, and potential areas of improvement. Below, we detail the survey’s findings:

Frequency of mental health support needs: when asked about the frequency of their mental health challenges and corresponding support needs, the respondents’ needs varied. Very few required daily intervention, most needed weekly, monthly and seasonally support, no one seek for yearly support.
Perceived utility of support conversations: every participant (10 of 10) agreed that speaking with someone capable of providing mental support and therapy could be beneficial.
Willingness for continued chatbot usage: when questioned about their willingness to engage with the chatbot again after the initial interaction, an overwhelming majority (90%) expressed a positive intent to reuse the service.
Rating of conversation quality (human-likeness): participants rated the chatbot’s conversational quality in terms of its human-like language on a scale from 1 to 5. The chatbot received an average score of 4.3, indicating a high degree of satisfaction with the chatbot’s language quality.
Rating of conversation quality (supportiveness): when assessing the chatbot’s supportive nature in the conversation, participants gave an average score of 4.2 of 5, reflecting their positive experience in terms of perceived support.
Length of conversations: as for the length of the conversations, most users had less than 16 lines of conversation with the chatbot. A few (three participants) had conversations slightly longer than 16 but fewer than 20 lines. Only one participant had a conversation with over 20 lines.
Overall rating of the chatbot application: when asked to provide an overall rating of the chatbot application, participants gave an average score of 4.6 out of 5. Notably, no participant rated the application lower than 4, indicating a high level of user satisfaction.
Positive feedback: participants were invited to share any positive feedback about their experience. We got 8 responses for this question and the most important points are the LLM-based chatbot can always provide useful suggestions and they feel very safe to talk to someone who are always available and talkable about their issue and sadness.
Areas for improvement: we also encouraged users to suggest areas where the chatbot could be improved. The survey participants found the chatbot to be generally helpful, but suggested improvements such as exposing the training data to more diverse circumstances, enhancing the emotional support aspect, avoiding risky responses to sensitive inquiries, reducing repetition of examples, and focusing on more teaching sessions to make the interactions feel less robotic and more like conversing with a human friend.
User interface (UI) suggestions: participants were also asked to provide suggestions for improving the chatbot’s UI functionalities to enhance its usefulness and usability. The feedback is very useful for us to implement further improved version. The suggestions include voice and image combined responses, able to track chat history, virtual reality (VR) or mixed reality innovation and realistic human tongues enhancement.

These findings, while indicating user satisfaction with the chatbot, also provide valuable insights for future improvements to ensure the chatbot’s continued efficacy and user-friendly experience. The survey’s qualitative data (questions 8, 9, and 10) will be used to derive further insights and refine the chatbot’s design and functioning.

Researchers and professional carers’ evaluation analysis

We extended our evaluation to involve individuals working in mental health support roles of researchers and individuals working in adult care. This group included five researchers and five adult carers who frequently interact with individuals experiencing mental health issues, allowing us to assess the chatbot from the perspective of professionals in the field. The carers’ survey mirrored the structure of the users’ survey, aiming to gather insights on the chatbot’s perceived value, effectiveness, interaction quality, and improvement areas. Here are the detailed findings:

Frequency of interaction with mental health patients: most participants reported that they interact with individuals with mental health challenges daily or monthly, while a few participants worked with such individuals every week, and two of the participants are only doing mental healthcare research work barely to interactive with patients.
Perceived value of chatbots: the majority (over 70%) of the participants agreed that chatbots can bring significant value to supporting individuals with mental health issues. Three participants were uncertain and did not completely disagree vote.
Confidence in chatbot’s helpful output: when asked about their confidence in the chatbot’s ability to provide helpful responses, the responses were overwhelmingly positive; 30% of the participants were extremely confident in the chatbot, 40% were confident and only one participant voted for “somewhat not confident”.
Rating of conversation quality (human-likeness): the participants rated the chatbot’s conversation quality in terms of its human-like language. They gave it an average score of 4 out of 5. The rating is lower than users who need mental health support because researchers and professionals are more likely to be cautious about the response in terms of usefulness and safety.
Rating of conversation quality (supportiveness): the chatbot’s supportive nature was also highly rated, receiving an average score of 4.1 out of 5 from the participants.
Overall rating of the chatbot application: when asked to provide an overall rating of the chatbot application, the participants gave an average score of 4 out of 5.
Identifying risky responses: participants were asked if they encountered any response content that could potentially pose a risk or negative impact on the user. If so, they were encouraged to provide examples for further analysis. Nine out of 10 responded no, and one provided a case that if the user’s input is about harming themselves, the chatbot did not provide enough suggestions that the user needed to get help from nearby human services or telephone numbers. This is valuable feedback that we need to consider to have some location tracking function that could provide information about local help services and support telephone lines if a user consents to this in the privacy settings. However, the chatbot aims to enable support by talking to the user to avoid harmful activities. There are two comments about finding some repeated responses and not meaningful responses.
Positive feedback: we invited participants to share their positive feedback about their experience with the chatbot. The survey participants expressed positive views about the developed mental support chatbot application, highlighting its potential impact on the mental health care industry, its ability to engage in natural conversations and provide sensitive and supportive responses. The chatbot’s empathetic approach, well-balanced advice, and 24/7 availability were valuable assets. Users appreciated the chatbot as a tool for bridging gaps in mental health services, maintaining contact with patients, and providing constant, sensitive, and accessible support. The interface was also noted as being friendly. Overall, a supportive chatbot was seen as promising, particularly for individuals without immediate access to professional help or prefer initial discussions with a non-human entity.
Areas for improvement: participants were also encouraged to suggest areas where the chatbot could be improved. The survey respondents provided suggestions for improving the chatbot application. They noted that the conversations still felt like interactions between humans and machines and recommended making the responses more relaxed and concise. Other areas for improvement included updating the chatbot’s database consistently, increasing phrase variety to avoid repetition, understanding different question phrasings, refining natural language processing skills, integrating structured therapy techniques, providing better-detailed answers, conducting research on the efficacy of such bots, clarifying data privacy and informed consent, addressing the bot’s limitations in understanding complex issues, and enhancing its understanding of emotional nuances and subtleties in language to offer more personalised advice. Overall, while acknowledging the chatbot’s current state as good, there was a consensus that there is room for growth and enhancement to better support users, especially those with complex mental health conditions.
UI suggestions: lastly, participants provided their suggestions on potential UI improvements that could enhance the usability and usefulness of the chatbot. They recommended incorporating a more human figure with speech functionalities, allowing users to select the preferred human figure. Adding a chat history option was suggested, along with the inclusion of a ‘Help’ or ‘FAQ’ section to provide assistance to users. Offering categories of questions that users can click on and implementing a search function for quick information retrieval were also proposed. An option to schedule ‘check-in’ messages and a ‘quick help’ button to connect with a human professional were suggested as additional safety net features. Providing a feature for users to clarify their statements and including a switch between ‘light’ and ‘dark’ modes to accommodate different environments and reduce eye strain were also mentioned as valuable additions to the UI. These improvements aim to enhance the usability and functionality of the chatbot application.

Discussion

Discussions for future health policy of using LLM-based chatbot systems

As we dive into the future of health policies, addressing the evolving role of LLMs within the domain is essential. This is due to their inherent potential but also the unique challenges. Below are key points intended to guide developing and implementing LLM-based chatbots within healthcare settings.

Risk assessment and safety measures: healthcare policies need to scrutinize organizations proposing tools using LLMs in domain-specific areas, such as mental health or healthcare in general. This is due to their tendency to hallucinate and provide inaccurate information to the user. In 2018, a survey published in the National Health Literacy stated that only 11% of the general population strongly believed in their ability to appraise the reliability of healthcare information. This low confidence level indicates a significant risk of individuals accepting potentially misleading or inaccurate responses as factual.

In addition, organizations or governments need to employ strategies to educate the public on the limitations of these technologies and their appropriate usage. Providing the proper resources and training aids individuals in evaluating the information received and making informed decisions.

Moreover, integrating a browser tool in platforms like ChatGPT allows the LLM to access and utilize up-to-date information. This enhances its potential utility in healthcare contexts. In this role, the LLM can emulate the function of a healthcare informant by providing current advice for various health-related information. Currently, however, there appears to be little way of updating its databanks concerning healthcare and providing up-to-date advice automatically, as the user has to prompt it. By cross-referencing its output with current advice, it allows for better reliability.

By combining suggested materials, the individual and the developers may have enough fail-safes to prevent inappropriate outputs and empower the user. However, there also needs to be a focus on improving pre-existing risk-based algorithms. A study conducted in 2023 discovered that conversational agents specializing in mental health counseling failed to reference crisis resources—and failed to halt conversational dialogue if the user’s input reverted to a lower risk level—indicating a lack of sensitivity and adaptability to the dynamic nature of mental health states (21).

Balancing support and harm prevention: the dynamic nature of emotions needs to be at the heart of the development of LLMs. Conversational agents must be able to engage with the individual and adapt to their needs, not through stated guidelines but through the evolution of the chatbot’s character through continual tone analysis. This adaptative approach would involve the chatbot not only understanding and responding to the user’s emotions but also evolving its behaviour and developing a unique personality over time.

As such, a chatbot perpetually interacts with its users and then needs to be able to learn from the nuances of each conversation—gradually shaping their responses and interaction style to better align with the user’s preferences and emotional state. This would give the individual a more personalised and intimate experience reflecting a sentient being that remembers past interactions and adapts accordingly. This can be achieved, for example, by tracking a user’s emotional state using a transition network which predicts emotions based on past utterances and generates the most appropriate responses or by applying the principle of Valance and Arousal. A method where each word is embedded with an affective meaning (21).

By addressing these contemporary issues—individuals may further engage with the chatbots—and build trust to facilitate a lasting relationship focused on the patient’s well-being.

Furthermore, health policies must propose a feedback mechanism where patients can regularly express their satisfaction or concerns—for the AI to adjust and adapt—making each interaction more meaningful and effective. There will, however, be circumstances where a user’s input may become increasingly concerning—and during these situations, LangChain technology can be applied to create AI-supported high-risk methods that utilise prompt templates.

Data updates and response enhancements: the suggestions for improving the chatbot responses, including making them more relaxed and concise, increasing phrase variety, and refining natural language processing skills, highlight the need for regular database updates and response enhancements. Future policies could emphasize the importance of consistent updates to keep the chatbots knowledge current and ensure meaningful and varied interactions.
Privacy and informed consent: the survey respondents raised valid concerns about data privacy and informed consent. Future health policies could address these concerns by clearly explaining where user data is sent, providing an explanation of the chatbot’s information sources, and ensuring transparent mechanisms for obtaining informed consent. These policies would help establish trust and accountability in the use of chatbot technology for mental healthcare. Implementing robust privacy and security measures to safeguard user data and ensure compliance with relevant data protection regulations (22). Policies should address data storage, consent management, and secure communication protocols to protect user confidentiality.
The suggestion to integrate structured therapy techniques and conduct research on the efficacy of chatbots in mental health support indicates the potential for collaboration between the chatbot application and professionals in the field. Future policies could encourage partnerships between developers and professionals in psychology and counseling to enhance the chatbots effectiveness and provide a well-rounded approach to mental healthcare (23).
UI enhancements: the participants’ suggestions for improving the UI, such as incorporating a more human figure, providing a chat history option, support different languages and implementing helpful features like categorized questions, search functions, and quick help buttons, present opportunities to enhance usability and user experience. Future health policies could promote user-centered design principles and standards to ensure user-friendly interfaces that facilitate seamless interactions and access to information.
Ethical guidelines: establishing clear ethical guidelines for the development and deployment of LLM-based chatbots to ensure user safety, privacy, and confidentiality. Policies should address issues such as data protection, informed consent, and the responsible use of AI technologies.
Bias mitigation: implementing measures to identify and mitigate biases within LLM-based chatbot systems. Policies should promote regular audits and monitoring to ensure fairness and prevent any discriminatory or harmful outcomes (24).
Training data diversity: encouraging the use of diverse and representative training data to enhance the inclusion and accuracy of LLM-based chatbots. Policies should promote the inclusion of diverse perspectives and address potential biases in the data collection process.
Continual evaluation: mandating regular evaluation and testing of LLM-based chatbot systems to assess their performance, reliability, and effectiveness. Policies should require transparency in reporting evaluation results and addressing any identified issues promptly.
Accountability and transparency: establishing mechanisms for accountability and transparency in the development and use of LLM-based chatbot systems. Policies should require clear disclosure of the system’s capabilities and limitations, as well as the organizations responsible for their development and operation.

Future health policy should consider these discussion points to ensure the safe and effective utilization of mental healthcare support chatbots. By addressing issues related to these points, policies can help shape the development and implementation of chatbot technology in a manner that aligns with ethical, reliable, and user-centric mental healthcare practices.

Conclusions

Conclusions and future work

Navigating the challenges and complexities of mental healthcare requires tools and techniques that are both sophisticated and sensitive. A considerable part of these tools includes chatbots capable of providing conversational support. While the proficiency of chatbots has increased over the years, ensuring their responses are reliable, professional, and ethically sound in the delicate domain of mental health remains a substantial challenge (25).

Our research undertook the challenge of developing a chatbot designed to augment the capabilities of mental healthcare providers effectively. Traditional chatbots, predominantly rule-based and rooted in specific therapeutic methods such as CBT, offer commendable benefits. However, these systems are often constrained by their rigidity, failing to adapt to the complex and evolving nature of mental health dialogues. Acknowledging the merits of the rule-based approach, we are considering its integration into the future iterations of our chatbot to enhance its flexibility and adaptability, essential for addressing the nuanced demands of mental health conversations (26).

Current LLM-based chatbots, despite their superior text generation capabilities, present another set of challenges. The unpredictability of their responses and potential deviation from standard mental healthcare practices raise legitimate concerns (27).

In response to these challenges, we devised a new methodology. We integrated a specifically fine-tuned DialoGPT model, which acts as a guidance system rooted in professional therapeutic practices, with the run time application of the ChatGPT API. The API lends the conversation a more dynamic and human-like quality, enriching the interaction with spontaneity and nuanced understanding, which are the hallmarks of human conversations (15,28).

Our evaluation results, as quantified by perplexity and BLEU scores, demonstrate a promising improvement in performance when compared to using either the DialoGPT model or the ChatGPT API on their own. However, the real strength of our approach is more profoundly reflected in the feedback from the real-world users—mental health patients and professionals who participated in our evaluation. Their affirmation signals the potential of our chatbot to align with the ethical sensitivity and supportive character demanded by mental healthcare practices.

We want to emphasize, though, that our chatbot, while a step forward, is not a replacement for human therapists. Instead, we envision it as an auxiliary resource that can provide support in scenarios where human resources are stretched thin, or as an additional tool to complement traditional therapeutic processes.

As we continue to refine this tool, we aim to deepen its comprehension of complex mental health issues and, more importantly, cultivate its ability to cater to the individual needs of users. We hope that, with continued research and development, we can make a meaningful contribution to the mental healthcare field.

In conclusion, our research elucidates the potential of utilising LLMs, specifically a fine-tuned DialoGPT in conjunction with ChatGPT 3.5, to deliver an efficient, professional, and reliable chatbot to support mental health care. We took a unique approach by training DialoGPT with a relatively more minor dataset of authentic therapeutic conversations and subsequently employed this model as a knowledge base for ChatGPT 3.5. The integration of the models resulted in a chatbot system that combines the domain-specific understanding of the fine-tuned DialoGPT and the broader, more flexible response generation capabilities of ChatGPT 3.5.

There are many future works will be continued:

Personalisation and graphical user interface (GUI) design: this could encompass the development of suitable types of interface for different age group people based on more evaluations and feedback. Improving personalized experiences and visualization of user progress. By integrating these components, the chatbot could offer a more tailored, immersive, and motivational mental health support system, thereby elevating user engagement and satisfaction levels.
Incorporating additional datasets: our current model utilizes a specific dataset of therapeutic conversations. To enhance the model’s versatility and robustness, future work can incorporate additional datasets from varied sources. This could include conversations covering different types of therapy, diverse mental health conditions, and multiple demographic groups.
Integration with clinical systems: future research could look into integrating the chatbot with existing clinical systems. This would allow the chatbot to provide more personalized and context-aware support. It could also facilitate better coordination with healthcare professionals, alerting them when the chatbot identifies potential serious concerns.
Multi-modal inputs: currently, the chatbot interacts through text-based conversations. Future work could explore multi-modal inputs like voice, facial expressions, or physiological signals. This could help the chatbot understand the user’s emotional state more accurately and respond more appropriately.

Acknowledgments

Thanks to Student Union of University of Derby and Professor Myra Conway, the Theme Lead in Biomedical Sciences for organising students and mental health professionals for helping on the evaluations.

GitHub project repository: you can download the backend testing and evaluation codes, a standalone application and evaluation data examples used in the research from GitHub websites on GitHub semanticmachinelearning/MentalHealthLLMAnimationChatBot.

Funding: None.

Footnote

Peer Review File: Available at https://jmai.amegroups.com/article/view/10.21037/jmai-23-136/prf

Conflicts of Interest: Both authors have completed the ICMJE uniform disclosure form (available at https://jmai.amegroups.com/article/view/10.21037/jmai-23-136/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. IRB approval and informed consent are not applicable as we did not use any human data and the conversations between chatbot and users are not saved to be visible again after closed the conversation.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Dinarte-Diaz L. An Overlooked Priority Mental Health. 2023;
World Health Organization. Mental disorders. 2022.
Wilson L, Marasoiu M. The Development and Use of Chatbots in Public Health: Scoping Review. JMIR Hum Factors 2022;9:e35882. [Crossref] [PubMed]
Hofmann SG, Asnaani A, Vonk IJ, et al. The Efficacy of Cognitive Behavioral Therapy: A Review of Meta-analyses. Cognit Ther Res 2012;36:427-40. [Crossref] [PubMed]
Linehan MM, Armstrong HE, Suarez A, et al. Cognitive-behavioral treatment of chronically parasuicidal borderline patients. Arch Gen Psychiatry 1991;48:1060-4. [Crossref] [PubMed]
Hayes SC, Luoma JB, Bond FW, et al. Acceptance and commitment therapy: model, processes and outcomes. Behav Res Ther 2006;44:1-25. [Crossref] [PubMed]
Segal ZV, Williams JMG, Teasdale JD. Mindfulness-based cognitive therapy for depression: A new approach to preventing relapse. Guilford Press; 2002.
Young JE. Cognitive therapy for personality disorders: A schema-focused approach. 1990.
Weizenbaum J. ELIZA—a computer program for the study of natural language communication between man and machine. Communications of the ACM 1966;9:36-45. [Crossref]
Shiji Group. A rule-based or ai chatbot? Here’s the difference. 2022. Available online: https://reviewproblog.shijigroup.com/should-i-choose-a-rule-based-or-an-ai-chatbot/
Twomey C, O'Reilly G. Effectiveness of a freely available computerised cognitive behavioural therapy programme (MoodGYM) for depression: Meta-analysis. Aust N Z J Psychiatry 2017;51:260-9. [Crossref] [PubMed]
Fitzpatrick KK, Darcy A, Vierhile M. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial. JMIR Ment Health 2017;4:e19. [Crossref] [PubMed]
Inkster B, Sarda S, Subramanian V. An Empathy-Driven, Conversational Artificial Intelligence Agent (Wysa) for Digital Mental Well-Being: Real-World Data Evaluation Mixed-Methods Study. JMIR Mhealth Uhealth 2018;6:e12106. [Crossref] [PubMed]
Fulmer R, Joerin A, Gentile B, et al. Using Psychological Artificial Intelligence (Tess) to Relieve Symptoms of Depression and Anxiety: Randomized Controlled Trial. JMIR Ment Health 2018;5:e64. [Crossref] [PubMed]
BrownTBMannBRyderNLanguage Models are Few-Shot Learners.2020. arXiv:2005.14165.
DevlinJChangMWLeeLBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.2018. arXiv:1810.04805.
Das A, Selek S, Warner AR, et al. Conversational Bots for Psychotherapy: A Study of Generative Transformer Models Using Domain-specific Dialogues. In: Proceedings of the 21st Workshop on Biomedical Language Processing. Dublin: Association for Computational Linguistics; 2022: 285-97.
Transcripts of all patients. 2023. Accessed on December 15, 2023. Available online: http://www.thetherapist.com/Transcripts.html
Prasath N, Prabhavalkar N. Mental Health FAQ for Chatbot. 2021. Available online: https://www.kaggle.com/datasets/narendrageek/mental-health-faq-for-chatbot
Hugging Face. Bleu score evaluation space. 2023. Accessed on December 15, 2023. Available online: https://huggingface.co/spaces/evaluate-metric/bleu
Spallek S, Birrell L, Kershaw S, et al. Can we use ChatGPT for Mental Health and Substance Use Education? Examining Its Quality and Potential Harms. JMIR Med Educ 2023;9:e51243. [Crossref] [PubMed]
May R, Denecke K. Security, privacy, and healthcare-related conversational agents: a scoping review. Inform Health Soc Care 2022;47:194-210. [Crossref] [PubMed]
Grové C. Co-developing a Mental Health and Wellbeing Chatbot With and for Young People. Front Psychiatry 2021;11:606041. [Crossref] [PubMed]
Parray AA, Inam ZM, Ramonfaur D, et al. ChatGPT and global public health: Applications, challenges, ethical considerations and mitigation strategies. Global Transitions 2023;5:50-4. [Crossref]
Nordberg OE, Wake JD, Nordby ES, et al. Designing Chatbots for Guiding Online Peer Support Conversations for Adults with ADHD. In: Følstad A, Araujo T, Papadopoulos S, et al. editors. Chatbot Research and Design. CONVERSATIONS 2019. Cham: Springer; 2020.
Gaffney H, Mansell W, Tai S. Conversational Agents in the Treatment of Mental Health Problems: Mixed-Method Systematic Review. JMIR Ment Health 2019;6:e14166. [Crossref] [PubMed]
Brown TB, Mann B, Ryder N, et al. Better language models and their implications. OpenAI 2019;1.
Zhang Y, Sun S, Galley M, et al. DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. Transactions of the Association for Computational Linguistics 2020;8:423-39.

doi: 10.21037/jmai-23-136
Cite this article as: Yu HQ, McGuinness S. An experimental study of integrating fine-tuned large language models and prompts for enhancing mental health support chatbot system. J Med Artif Intell 2024;7:16.

An experimental study of integrating fine-tuned large language models and prompts for enhancing mental health support chatbot system

Highlight box

Introduction

Conversational-based therapy and related work

Conversational therapy

Traditional chatbot approaches

LLMs in chatbots

Methods

Research methodology

Creating trainable therapy transcription data

Data processing

Model fine-tuning and optimisation

Table 1

Knowledge injection with ChatGPT 3.5 prompting engineering

Results

Chatbot systematic and human evaluations

Perplexity evaluations

Table 2

Introduction to BLEU scores

BLEU score evaluation

Table 3

Human evaluations

Standalone application for human evaluation

User evaluation

Researchers and professional carers’ evaluation analysis

Discussion

Discussions for future health policy of using LLM-based chatbot systems

Conclusions

Conclusions and future work

Acknowledgments

Footnote

References

Article Options

Download Citation

Share