Using Chatgpt With Prompt Engineering
Using Chatgpt With Prompt Engineering
with prompt
engineering
for financial
industry tasks
U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Contents
Introduction................................................................................................................... 4
Tokenisation.................................................................................................................. 7
Costs............................................................................................................................... 8
Data................................................................................................................................12
Comparison of models.............................................................................................14
Discussion of results................................................................................................. 17
Acknowledgements................................................................................................ 20
References...................................................................................................................21
2 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Section 1:
Overview of GPT
Innovations and advancements in cutting-edge technologies in the field
of Large Language Models (LLMs) are growing exponentially. With vast
quantities of available data and increases in computing power, they have
wide-ranging application potential in the financial industry.
LSEG is well placed to provide commentary in this space given our experience with LLMs in
combination with the wide variety of financial data used to improve these models and our exploration
of opportunities with Microsoft.
In this paper, we aim to demonstrate the role of prompt engineering in GPT models in improving
performance for sentiment and theme classification using financial text data. The results were
obtained using GPT-4, which was released in March 2023, and at the time of writing is new for all
users. Our aim with the exploratory studies is to provide clear and concise communication for an
improved understanding of GPT models in the financial industry.
Using GPT-4 for sentiment and theme classification, we found that GPT-4 outperforms the GPT-3
and GPT-3.5 models. Also, GPT-4 slightly outperformed other LLMs that are used as benchmarks,
such as BART and FinBERT. Using prompt engineering for sentiment classification, GPT-4’s
performance was seen to further improve, indicating that prompt engineering is a valuable area for
performance optimisation.
3 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Introduction
The release of GPT-4 series marks an exciting time LSEG also offers a wide range of financial data – unstructured and
textual data being particularly relevant in this context. This data is a prime
for those industries looking to apply Large Language
candidate to improve the performance of GPT models given the increased
Models (LLMs) (1; 2). The improved capability of LLMs role of prompt engineering.
has made them increasingly relevant for financial Prompt engineering is the careful construction of the input into a GPT
industry tasks. model to improve performance of a specific task (5). The input into GPT
can be quite detailed, including not just questions or chat history, but also
LSEG leverages LLM capabilities in a variety of products: summarisation, significant amounts of relevant financial data; it is this aspect that is of
entity recognition, topic detection and sentiment analysis are used particular interest to the financial industry given the existence of a variety
in products including SentiMine (1,2), MarketPsych (3) and Transcripts of data-driven models and the significant industry knowledge thereof. With
Summarisation, and are being implemented into Starmine (4). the release of GPT-4 series in March 2023 the input prompt has increased
SentiMine provides aspect-based sentiment analysis of earnings/ significantly, offering improvements in an area already ripe for exploration.
conference call transcripts for a wide range of financially significant themes.
This paper explores using prompt engineering with GPT-4 for sentiment
Such understanding is crucial to the productivity of bankers and portfolio
and theme classification. Section 1 provides a concise overview of the
managers. Call transcripts often touch on a range of topics within the same
GPT family of models, with an emphasis on the practicalities relevant for
sentence or paragraph. For instance, the manager of a company may have
the average user of GPT in the financial industry, such as costs. Section 2
to report a drop in customer retention and want to move on quickly to
provides results comparing GPT models with popular existing models such
positive news about hitting an ESG milestone – SentiMine will clearly set
as BART (6).
those apart and help analysts to find the relevant updates needed to make
a better-informed decision.
4 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Evolution of the
GPT series
5 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Understanding the current state of GPT is important when considering the – The input from a user into GPT models, known as a prompt, has increased
practicalities of product development, such as performance, competitive in size with each subsequent release. This is of value when considering
advantage, limitations and costs. Generative Pre-trained Transformer (GPT) the potential to improve performance through the information included in
models are considered “general”, in the sense that they can perform a wide the prompt; within the financial industry, this may prove of significant value
range of tasks well but can still under-perform against existing techniques when considering including financial data.
given a specific task; this is important to finance use cases given the highly – The GPT-3, GPT-3.5 and GPT-4 series make several models available for
competitive nature of the industry for specific tasks. The increasing pace of use in each release (7) – more than are shown in Figure 1 (see references
releases of GPT models poses the question as to if/when such generalised for details).
models could outperform existing LLM techniques. – These are fine-tuned by OpenAI for specific tasks, such as for chat
Figure 1 outlines several key aspects of the GPT family of models such as or coding. When comparing results, the specific model version
release date, number of parameters, size of training corpus and size of is important.
input tokens. – We include code-davinci-002 as an example both of a fine-tuned task
– GPT-4 series was released in March 2023, less than six months after for code but also as an example of an API that is now discontinued in
the previous release. It is plausible that model updates might become favour of a different API – in this case, gpt-3.5-turbo.
even more frequent; a major consideration when considering product
development cycles.
– ChatGPT is a product powered initially by GPT-3 series and now
analogous to gpt-3.5-turbo. Among the general public the term
“ChatGPT” is sometimes used incorrectly as a catchall phrase for all
GPT models. We wish to emphasise that while acceptable for more
general conversations, this isn’t suitable when comparing performance in
financial use cases, where distinctions between GPT models is important.
6 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Tokenisation
Fig. 2: An example of tokenisation of the headline text of a Reuters news article containing
98 tokens (9). Text taken from Reuters News website (10). Included in the final sentence are
some names, highlighting how these are treated by GPT tokenisers.
7 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Costs
8 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Prompt engineering, fine-tuning and pre-training
9 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Fine-tuning
In fine-tuning, model weights are adjusted to fit domain specific information.
This is typically in the format of a smaller corpus of examples that reflect the
intended task (7). These could be hand-crafted examples as performance
improvements are possible with small sample sizes – e.g., a few dozen
examples. At the time of writing, fine-tuning is not available for gpt-3.5 turbo
or gpt-4; it remains to be seen whether fine-tuning will be an accessible
feature of future GPT models.
10 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Pre-training
Pre-training is the technique of training a large neural network on a corpus
of text data, a technique common to many LLMs. The output of pre-training
is a new base model. It allows for the new base model to learn a general
representation of the language supplied. Note that when discussing pre-
training, we consider only architectures similar to GPT models, not LLMs
more generally.
11 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Section 2: Results for theme and sentiment classification
We report the performance of gpt-4 for theme and sentiment classification, which is a typical use case in the
financial industry. Although this work is exploratory only, there is a clear indication that gpt-4 rivals existing models
in this task. We also explore using additional information in the prompts, demonstrating an ability to significantly
improve misclassification using only a few examples.
Data
Theme and sentiment classification are common problems in the financial The data set chosen for our exploration of GPT models is a set of
industry. LSEG has significant expertise in this field. LSEG’s aspect-based challenging examples available for theme classification; this choice of
sentiment analysis product – SentiMine – offers sentiment scores for difficult examples was intentional as we wished to see how GPT would
financial documents; it is available as part of Workspace and leverages perform in an area considered suitable for improvement. We wish to
LLMs extensively (15). SentiMine requires optimising the accuracy of emphasise that the examples we have used do not reflect the SentiMine
sentiment and theme classification and to this end LSEG has performed production choices. Importantly, while the results we present are inspired
extensive model tuning and selection experiments, analysing over 100 by our knowledge and understanding of LLMs in building SentiMine, they
financially relevant themes in the process. During the development of are not suitable as a valid comparison to the SentiMine product. Instead,
SentiMine, we observed several challenges when assigning sentiment our results are designed to help build understanding and education on
to financial statements from long-form documents – i.e., transcripts and GPT performance and its generalisability to such tasks.
equity research – for instance, correctly picking up mixed sentiment within
The data used consisted of 2,747 sentences annotated into nine themes
a statement, risks expected vs risks realised, sentiment as expressed
and 882 sentences annotated for sentiment. The following tables show the
numerically vs verbally.
nine themes and three values used for sentiment, along with an example
instance of the annotated data. The company name has been redacted
from this sentence for this paper.
12 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Task Categories Prompt input
The following figure (Figure 4) shows the text used for zero-shot results
Theme classification for theme and sentiment classification. This was considered the minimum
“Cloud computing”, “Cost-to-income ratio”,
amount of information required to achieve consistent results using
“Customer experience”, “Epidemics”
GPT models.
“Marketing and advertising costs”,
“Mobile network operator (MNO)”
“Mobile virtual network operator (MVNO)”,
“Non-interest income”, “Shares buyback”
13 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Comparison of models
In the table below we summarise the different models used in our experiments. For the GPT family
of models, GPT-3, GPT-3.5 and GPT-4 series are used. BART (6) and FinBERT (16) are used to provide
benchmarks; BART is used for both theme and sentiment classification, FinBERT is used for
sentiment only.
For the GPT results, the temperature parameter was set to 0 to strive for the most deterministic
results possible (1).
Model Description
GPT-4 series: gpt-4 Highest current capabilities and optimised for chat, with processing for even more tokens (choice of 8,000 or
32,000 tokens).
GPT-3.5 series: gpt-3.5-turbo Optimised for chat, with superior capability to GPT-3 series models. This model is capable of processing 4,000 tokens.
GPT-3 series: davinci Base model (175B parameters), most capable in the series of base models of generating text.
GPT-3 series: text-ada-001 Fine-tuned ada model (0.75B parameters) capable of small tasks. Fastest model in the series and lowest cost, suited to
functional testing. Performance was expected to be poor in comparison to other models.
BART Bidirectional and Auto-Regressive Transformers (BART) is a sequence-to-sequence model, trained to reconstruct noisy
text enabling high-quality text production.
FinBERT Specifically trained for financial text analysis, using a corpus of regulatory filings and financial news. Includes financial
domain vocabulary.
14 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S 14
Results for zero-shot
The following tables present the results for theme and sentiment respect to the number of samples in each category – giving importance to
classification for zero-shot prompts. Accuracy is used as the primary imbalanced data.
result to aid in general communications, given it is relatively easier
The ada model is included in the theme results to highlight its poor
to comprehend. Also included is the number of correct and incorrect
performance; this is as expected. The ada model routinely didn’t predict
predictions, again to aid comprehension of results.
themes from the set requested.
The f1-score is provided as it is the standard metric for classification tasks.
FinBERT is included only in sentiment, given that it would require fine-
The f1-score is calculated as the weighted average of all classes. This
tuning with the theme data to provide a valid comparison.
represents the aggregation of performance in all prediction categories, with
Model accuracy f1-score Correct Incorrect Model accuracy f1-score Correct Incorrect
predictions predictions predictions predictions
BART 0.786 0.784 2158 589 FinBERT 0.450 0.381 397 485
text-ada-001 0.059 0.082 161 2586 BART 0.660 0.626 582 300
davinci 0.847 0.861 2326 421 davinci 0.413 0.430 364 518
gpt-3.5-turbo 0.881 0.887 2419 328 gpt-3.5-turbo 0.536 0.575 473 409
gpt-4 0.941 0.945 2584 163 gpt-4 0.626 0.641 552 330
15 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Results for few-shot
We wish to demonstrate the value of prompt engineering by comparing above zero-shot experiments for all sentences except the three examples
few-shot prompts with zero-shot prompts. The few-shot experiments were used in the few-shot prompt. We report these results for sentiment only,
designed based on the zero-shot results shown above. given that theme classification is already relatively high using zero-shot
prompts. Note that the zero-shot results here are very similar to those
We constructed a few-shot prompt containing three examples that had
above, bar removing the three examples used in the few-shot prompts.
been correctly predicted by GPT models using zero-shot prompts. The
few-shot prompt is very similar to Figure 3 in Section 1. We then re-ran the
Using gpt-4, there were 91 correct classifications using few-shot that zero- Using gpt-3.5-turbo, here were 177 correct classifications using few-
shot had classified incorrectly. There were 44 incorrect classifications using shot that zero-shot had classified incorrectly. There were 69 incorrect
few-shot that zero-shot had classified correctly. classifications using few-shot that zero-shot had classified correctly.
16 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Discussion of results
The results appear promising for the use of GPT improved accuracy. Overall, the best results were obtained with gpt-4 using
few-shot, which is consistent with all other results in indicating that gpt-4 is
models and gpt-4 for the specific task of theme and
the most performant model of the GPT family.
sentiment classification.
Even though our few-shot experiment is a relatively small study, we
In all examples, gpt-4 showed improvement over davinci (GPT-3) and gpt- consider these results to be very promising due to the many variations
3.5-turbo. This indicates that there are likely performance advantages in offered by prompt engineering and the amount of data we can potentially
using the latest models in the GPT family. This performance increase also use in the prompt. We see our exploratory results as confirmation that
comes at an increase in usage costs. prompt engineering is worthy of further exploration.
All GPT models were good at predicting theme using a zero-shot prompt, A few interesting examples...
outperforming the benchmark model (BART). This is potentially related to Here are a few examples to help enhance understanding of GPT models:
the data chosen; as we selected the most difficult sentences from a larger
set, it is plausible these sentences were biased against the benchmark – Occasionally, gpt-4 predicted a theme that was similar, but not identical,
models for reasons that are not yet studied. Regardless, given the results, to a theme in the set requested. In one such example, for the theme
GPT models appear to offer value for this specific task and are worth labelled “Epidemics” gpt-4 returned the response “COVID”. This may be
further investigation. due to our prompt not being sufficient. Regardless, given the semantic
similarity of the results, it is interesting to consider what information could
BART outperformed GPT models in sentiment using a zero-shot prompt be contained in misclassified results.
in accuracy. However, gpt-4 was the top-performing model in both tasks – The zero-shot prompt was the simplest prompt with which we could
according to its f1-score. Overall, the sentiment values for all models obtain reliable results; more naïve prompts were liable to return
were lower than expected, which again may be due to the selection of additional characters that were erroneous, such as Roman numerals.
challenging data. The performance of gpt-4 in a zero-shot setting is still Using naïve prompts places an additional overhead on post-processing
very encouraging given how similar its results are to BART’s. results. It also raises questions as to whether results could be misleading
Of particular interest is the improvement achieved using few-shot over if prompts are not well-formed
zero-shot for sentiment, with both gpt-4 and gpt-3.5-turbo showing
17 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
– Contained in the data was one sentence used only for unit testing, which
consisted of the word “nan”; the response from gpt-4 detailed feedback
as to why this input isn’t suitable for theme classification.
– Input: “nan”
– Output: “There is no theme present in the given text as it only contains
“nan”, which stands for “not a number” and does not provide any
information related to the mentioned themes”
These few examples are indicative that post-processing of output from GPT
models is likely a more important consideration compared to traditional
machine learning systems. There appears to be information of value for
both misclassified results and input data input errors.
18 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Conclusions and future work
The financial industry is only beginning to discover the value of GPT models.
Our exploratory analysis using GPT-4 for financial industry tasks shows
promise, with the latest GPT models demonstrating clear performance
improvements over existing models.
Although we have reported results for the specific tasks of sentiment and theme classification, there
are clearly opportunities for using GPT far beyond these tasks. These tasks were natural starting points
for experimentation given our experience in these areas using LLMs. It is likely GPT offers new ways
for users to interact with models and data given their generative capabilities, which we are keen
to explore.
In the foreseeable future, prompt engineering is likely to form a large part of any GPT project, given
how accessible and cost-effective it is compared to fine-tuning and pre-training. We seek to harness
the increase in available input tokens with the release of GPT-4 series, specifically by using financial
data in the prompt.
Finally, the pace of updates to GPT and the performance improvements of GPT-4 are captivating for
product implementation; we look forward to discovering new opportunities in this fast-growing field.
19 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Acknowledgements
Director, Data Science, LSEG Analytics Will Cruse and Dinesh Kalamegam in LSEG Analytics for infrastructure
support for testing GPT
Aran Batth Evgeny Kovalyov and Anna Stief in LSEG Analytics for ongoing
Senior Data Scientist, LSEG Analytics contributions
Stanislav Chistyakov Jingwei Zhang for contributing to the evolution of GPT section
Junior Data Scientist, LSEG Analytics Rachel Sorek, Rani Shlivinski, Lior Gelernter Oryan and others in LSEG
Applied NLP
Mihail Dungarov, CFA
Text Analytics, Product Lead, LSEG Analytics
20 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
References
1. GPT-4 Technical Report. OpenAI. 2023. 9. OpenAI. [Online] https://platform.openai.com/tokenizer. Accessed: April 2023.
2. ummary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large
S 10. Reuters. [Online] https://www.reuters.com/markets/global-markets-view-
Language Models. Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, europe-2023-04-17/. Accessed: April 2023.
Jiaming Tian, Hao He, Antong Li, Mengshen He, Zhengliang Liu, Zihao Wu, Dajiang
11. OpenAI. [Online] https://openai.com/pricing. Accessed: April 2023.
Zhu, Xiang Li, Ning Qiang, Dingang Shen, Tianming Liu, Bao Ge. 2023.
12. A Survey of Large Language Models. Wayne Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi
3. LSEG. [Online] https://www.lseg.com/content/dam/lseg/en_us/documents/white-papers/
Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican
discovering-sentiment-in-finances-unstructured-data.pdf. Accessed: April 2023.
Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren,
4. LSEG. [Online] https://www.lseg.com/en/labs/sentimine. Accessed: April 2023. Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jia. 2023.
5.
Language Models are Few-Shot Learners. Tom B. Brown, Benjamin Mann, Nick Ryder, 13. RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning. Mingkai
Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng
Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Song, E. Xing, Zhiting Hu. 2022.
Tom Henighan, Rewon Child, Aditya Ramesh, Da. 2020.
14. Prompting GPT-3 To Be Reliable. Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang
6. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Wang, Jianfeng Wang, Jordan Boyd-Graber, Lijuan Wang. 2022.
Generation, Translation and Comprehension. Mike Lewis, Yinhan Liu, Naman Goyal,
15. LSEG. [Online] https://www.refinitiv.com/en/products/refinitiv-workspace.
Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke
Accessed: April 2023.
Zettlemoyer. 2019.
16. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. Araci, Dogu.
7. OpenAI. [Online] https://platform.openai.com/docs/models/overview. Accessed:
2019.
April 2023.
8. etween words and characters: A Brief History of Open-Vocabulary Modeling and
B
Tokenization in NLP. Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel,
Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot,
Samson Tan. 2021.
21 U S I N G G P T- 4 W I T H P R O M P T E N G I N E E R I N G F O R F I N A N C I A L I N D U S T R Y TA S K S
Discover more at lseg.com
LSG2837757/5-23