Large Language Models in Finance A Survey

Uploaded by

asa5tanha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views9 pages

Large Language Models in Finance A Survey

Uploaded by

asa5tanha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Large Language Models in Finance: A Survey

Yinheng Li∗ Shaofei Wang∗

[email protected] [email protected]
Columbia University Columbia University
New York, NY, USA New York, NY, USA

Han Ding∗ Hang Chen∗

[email protected] [email protected]
Columbia University New York University
New York, NY, USA New York, NY, USA
ABSTRACT language models (LLMs) like ChatGPT[49]. These models have
Recent advances in large language models (LLMs) have opened new demonstrated impressive capabilities in understanding, generating,
possibilities for artificial intelligence applications in finance. In this and reasoning about natural language. The finance industry could
paper, we provide a practical survey focused on two key aspects of benefit from applying LLMs, as effective language understanding
utilizing LLMs for financial tasks: existing solutions and guidance and generation can inform trading, risk modeling, customer service,
for adoption. and more.
First, we review current approaches employing LLMs in finance, In this survey, we aim to provide a practical overview focused
including leveraging pretrained models via zero-shot or few-shot on two key aspects of utilizing LLMs for financial applications:
learning, fine-tuning on domain-specific data, and training custom • Existing solutions and models that employ LLMs for various
LLMs from scratch. We summarize key models and evaluate their finance tasks. We summarize key techniques like finetuning
performance improvements on financial natural language process- pretrained LLMs and training domain-specific LLMs from
ing tasks. scratch.
Second, we propose a decision framework to guide financial • Guidance on the decision process for applying LLMs in fi-
professionals in selecting the appropriate LLM solution based on nance. We discuss factors to consider regarding whether
their use case constraints around data, compute, and performance LLMs are suitable for a task, cost/benefit tradeoffs, risks, and
needs. The framework provides a pathway from lightweight exper- limitations.
imentation to heavy investment in customized LLMs. By reviewing current literature and developments, we hope to
Lastly, we discuss limitations and challenges around leveraging give an accessible synthesis of the state-of-the-art along with consid-
LLMs in financial applications. Overall, this survey aims to syn- erations for adopting LLMs in finance. This survey targets financial
thesize the state-of-the-art and provide a roadmap for responsibly professionals and researchers exploring the intersection of AI and
applying LLMs to advance financial AI. finance. It may also inform developers applying LLM solutions for
the finance industry. The remainder of the paper is organized as
KEYWORDS follows. Section 2 covers background on language modeling and
Large Language Models, Generative AI, Natural Language Process- recent advances leading to LLMs. Section 3 surveys current AI
ing, Finance applications in finance and the potential for LLMs to advance in
ACM Reference Format: these areas. Sections 4 and 5 provide LLM solutions and decision
Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. 2023. Large Language guidance for financial applications. Finally, Sections 6 and 7 discuss
Models in Finance: A Survey. In 4th ACM International Conference on AI in risks, limitations, and conclusions.
Finance (ICAIF ’23), November 27–29, 2023, Brooklyn, NY, USA. ACM, New
York, NY, USA, 9 pages. https://doi.org/10.1145/3604237.3626869 2 BASICS OF LANGUAGE MODELS
A language model is a statistical model that is trained on exten-
1 INTRODUCTION sive text corpora to predict the probability distribution of word
Recent advances in artificial intelligence, especially in natural lan- sequences [13]. Let’s consider a sequence of words denoted as
guage processing, have led to the development of powerful large 𝑊 = 𝑤 1, 𝑤 2, ..., 𝑤𝑛 , where 𝑤𝑖 represents the 𝑖-th word in the se-
∗ All
quence. The goal of a language model is to calculate the probability
authors contributed equally to this research. Order is random
𝑃 (𝑊 ), which can be expressed as:
𝑃 (𝑊 ) = 𝑃 (𝑤 1, 𝑤 2, ..., 𝑤𝑛 )
This work is licensed under a Creative Commons Attribution International = 𝑃 (𝑤 1 )𝑃 (𝑤 2 |𝑤 1 )𝑃 (𝑤 3 |𝑤 1, 𝑤 2 )
4.0 License. ...𝑃 (𝑤𝑛 |𝑤 1, 𝑤 2, ..., 𝑤𝑛−1 )
ICAIF ’23, November 27–29, 2023, Brooklyn, NY, USA The conditional probability 𝑃 (𝑤𝑖 |𝑤 1, 𝑤 2, ..., 𝑤𝑖 −1 ) captures the like-
© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0240-2/23/11. lihood of word 𝑤𝑖 given the preceding words. Over the past few
https://doi.org/10.1145/3604237.3626869 decades, language model architectures have undergone significant

374
ICAIF ’23, November 27–29, 2023, Brooklyn, NY, USA Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen

evolution. Initially, n-gram models represented word sequences as financial fraud detection by leveraging user history data and real-
Markov processes [3], assuming that the probability of the next time transaction data [52]. Similar approaches have been employed
word depends solely on the preceding (𝑛 − 1) words. For example, in credit scoring [45, 60] and bankruptcy or default prediction [5].
in a bigram model, the probability of a word is only conditioned on Financial text mining represents a popular area where deep
the previous word. learning models and natural language processing techniques are
Later, Recurrent Neural Network (RNN)-based models like LSTM extensively utilized. According to [50], there are over 40 research
[41] and GRU [23] emerged as neural network solutions, which are publications on this topic. Financial text mining aims to extract
capable of capturing long-term dependencies in sequential data. valuable information from large-scale unstructured data in real-
However, in 2017, the introduction of the transformer architecture time, enabling more informed decision-making in trading and risk
[11] revolutionized language modeling, surpassing the performance modeling. For example, [37] employs financial market sentiment
of RNNs in tasks such as machine translation. Transformers em- extracted from news articles to forecast the direction of the stock
ploy self-attention mechanisms to model parallel relationships be- market index.
tween words, facilitating efficient training on large-scale datasets. Applying AI in financial advisory and customer-related
Prominent transformer-based models include GPT (Generative Pre- services is an emerging and rapidly growing field. AI-powered
trained Transformer) [33, 36], which is encoder-only framework, chatbots, as discussed in [48], already provide more than 37% of
BERT (Bidirectional Encoder Representations from Transformers) supporting functions in various e-commerce and e-service scenar-
[7], which is decoder-only framework, and T5 (Text-to-Text Trans- ios. In the financial industry, chatbots are being adopted as cost-
fer Transformer) [15], which leverages both encoder and decoder effective alternatives to human customer service, as highlighted in
structures. These models have achieved state-of-the-art results on the report "Chatbots in consumer finance" [2]. Additionally, banks
various natural language processing (NLP) tasks through transfer like JPMorgan are leveraging AI services to provide investment
learning. advice, as mentioned in a report by CNBC [55].
It is important to note that the evolution of language models The current implementation of deep learning models offers sig-
has mainly been driven by advancements in computational power, nificant advantages by efficiently extracting valuable insights from
the availability of large-scale datasets, and the development of vast amounts of data within short time frames. This capability is
novel neural network architectures. These models have signifi- particularly valuable in the finance industry, where timely and accu-
cantly enhanced language understanding and generation capabil- rate information plays a crucial role in decision-making processes.
ities, enabling their application across a wide range of industries With the emergence of LLMs, even more tasks that were previ-
and domains. ously considered intractable become possible, further expanding
the potential applications of AI in the finance industry.

3 OVERVIEW OF AI APPLICATIONS IN
FINANCE 3.2 Advancements of LLMs in Finance
3.1 Current AI Applications in Finance LLMs offer numerous advantages over traditional models, particu-
Artificial Intelligence (AI) has witnessed extensive adoption across larly in the field of finance. Firstly, LLMs leverage their extensive
various domains of finance in recent years [40]. In this survey, we pre-training data to effectively process common-sense knowledge,
focus on key financial applications, including trading and portfolio enabling them to understand natural language instructions. This
management [67], financial risk modeling [46], financial text mining is valuable in scenarios where supervised training is challenging
[25, 42], and financial advisory and customer services [54]. While due to limited labeled financial data or restricted access to certain
this list is not exhaustive, these areas have shown significant interest documents. LLMs can perform tasks through zero-shot learning
and high potential with the advancement of AI. [44], as demonstrated by their satisfactory performance in senti-
Trading and portfolio management have been early adopters ment classification tasks across complex levels [35]. For similar
of machine learning and deep learning models within the finance text mining tasks on financial documents, LLMs can automatically
industry. The primary objective of trading is to forecast prices and achieve acceptable performance.
generate profits based on these predictions. Initially, statistical ma- Compared to other supervised models, LLMs offer superior adap-
chine learning methods such as Support Vector Machines (SVM) tation and flexibility. Instead of training separate models for spe-
[43], Xgboost [68], and tree-based algorithms were utilized for cific tasks, LLMs can handle multiple tasks by simply modifying
profit and loss estimation. However, the emergence of deep neural the prompt under different task instructions [34]. This adaptability
networks introduced techniques like Recurrent Neural Networks does not require additional training, enabling LLMs to simultane-
(RNN), particularly Long Short-Term Memory (LSTM) networks ously perform sentiment analysis, summarization, and keyword
[53], Convolutional Neural Networks (CNN), and transformers [28], extraction on financial documents.
which have proven effective in price forecasting. Additionally, rein- LLMs excel at breaking down ambiguous or complex tasks into
forcement learning [59] has been applied to automatic trading and actionable plans. Applications like Auto-GPT [1], Semantic Kernel
portfolio optimization. [47], and LangChain [4] have been developed to showcase this
Financial risk modeling encompasses various applications of capability. In this paper, we refer to this as Tool Augmented Gen-
machine learning and deep learning models. For instance, McKin- eration. For instance [51], Auto-GPT can optimize a portfolio with
sey & Company has developed a deep learning-based solution for global equity ETFs and bond ETFs based on user-defined goals. It

375
Large Language Models in Finance: A Survey ICAIF ’23, November 27–29, 2023, Brooklyn, NY, USA

formulates detailed plans, including acquiring financial data, utiliz- resulting in improved performance in finance-related tasks and
ing Python packages for Sharpe ratio optimization, and presenting generating more accurate and tailored outputs.
the results to the user. Previously, achieving such end-to-end so-
4.2.1 Common Techniques for LLM Fine-tuning. Modern techniques
lutions with a single model was unfeasible. This property makes
for fine-tuning LLMs typically fall into two main categories: stan-
LLMs an ideal fit for financial customer service or financial advi-
dard fine-tuning and instructional fine-tuning.
sory, where they can understand natural language instructions and
In standard fine-tuning, the model is trained on the raw datasets
assist customers by leveraging available tools and information.
without modification. The key context, question, and desired answer
While the application of LLMs in finance is really promising,
are directly fed into the LLM, with the answer masked during
it is crucial to acknowledge their limitations and associated risks,
training so that the model learns to generate it. Despite its simplicity,
which will be further discussed in Section 6.
this approach is widely effective.
Instruct fine-tuning [24] involves creating task-specific datasets
4 LLM SOLUTIONS FOR FINANCE
that provide examples and guidance to steer the model’s learning
4.1 Utilizing Few-shot/Zero-shot Learning in process. By formulating explicit instructions and demonstrations
Finance Applications in the training data, the model can be optimized to excel at certain
Accessing LLM solutions in finance can be done through two op- tasks or produce more contextually relevant and desired outputs.
tions: utilizing an API from LLM service providers or employing The instructions act as a form of supervision to shape the model’s
open-source LLMs. Companies like OpenAI1 , Google2 , and Mi- behavior.
crosoft3 offer LLM services through APIs. These services not only Both methods have their merits: standard fine-tuning is straight-
provide the base language model capabilities but also offer addi- forward to implement, while instructional fine-tuning allows for
tional features tailored for specific use cases. For example, OpenAI’s more precise guidance of the model. The ideal approach depends
APIs include functionalities for chat, SQL generation, code com- on the amount of training data available and the complexity of the
pletion, and code interpretation. While there is no dedicated LLM desired behaviors. However, both leverage the knowledge already
service exclusively designed for finance applications, leveraging embedded in LLMs and fine-tune them for enhanced performance
these general-purpose LLM services can be a viable option, espe- on downstream tasks.
cially for common tasks. An example in this work [38] demonstrates In addition to the above methods, techniques such as Low-Rank
the use of OpenAI’s GPT4 service for financial statement analysis. Adaptation (LoRA)[18] and quantization[10] can enable fine-tuning
In addition to LLM services provided by tech companies, open- with significantly lower computational requirements.
source LLMs can also be applied to financial applications. Models LoRA allows for fine-tuning the low-rank decomposed factors
such as LLaMA [58], BLOOM [14], Flan-T5 [19], and more are of the original weight matrices instead of the full matrices. This
available for download from the Hugging Face model repository4 . approach drastically reduces the number of trainable parameters,
Unlike using APIs, hosting and running these open-source models enabling training on less powerful hardware and shortening the
would require self-hosting. Similar to using LLM APIs, zero-shot or total training time.
few-shot learning approaches can be employed with open-source Another impactful approach is to use reduced numerical preci-
models. Utilizing open-source models offers greater flexibility as sions such as bfloat16 [16] or float16 instead of float32. By halving
the model’s weights are accessible, and the model’s output can the bit-width, each parameter only occupies 2 bytes instead of 4
be customized for downstream tasks. Additionally, it provides bet- bytes, reducing memory usage by 50%. This also accelerates com-
ter privacy protection as the model and data remain under user’s putation by up to 2x since smaller data types speed up training.
control. However, working with open-source models also has its Moreover, the reduced memory footprint enables larger batch sizes,
drawbacks. Reported evaluation metrics suggest a performance gap further boosting throughput.
between open-source models and proprietary models. For certain 4.2.2 Fine-tuned finance LLM evaluation. The performance of fine-
downstream tasks, zero-shot or few-shot learning may not yield tuned finance LLMs can be evaluated in two categories: finance
optimal performance. In such cases, fine-tuning the model with classification tasks and finance generative tasks. In finance classi-
labeled data, expertise, and computational resources is necessary fication, we consider tasks such as Sentiment Analysis and News
to achieve satisfactory results. This may explain why, at the time Headline Classification. In finance generative tasks, our focus is
of writing this paper, no direct examples of open-source models on Question Answering, News Summarization, and Named Entity
applied to financial applications have been found. In Section 5, Recognition. Table 1 provides detailed information about all the
we provide a more detailed discussion of which option is more fine-tuned finance LLMs. Among the various fine-tuned LLMs, we
favorable under different circumstances. will focus on discussing three of them: (1) PIXIU (also known as
FinMA)[29], fine-tuned LLaMA on 136K task-specific instruction
4.2 Fine-tuning a Model samples. (2) FinGPT[63], it presents a end-to-end framework for
Fine-tuning LLMs in the finance domain can enhance domain- training and applying FinLLMs in the finance industry. FinGPT
specific language understanding and contextual comprehension, utilizes the lightweight Low-rank Adaptation (LoRA) technique
to fine-tune open-source LLMs (such as LLaMA and ChatGLM)
1 https://openai.com/product
2 https://bard.google.com/ using approximately 50k samples. However, FinGPT’s evaluation is
3 https://azure.microsoft.com/en-us/products/cognitive-services/openai-service only limited to finance classification tasks. (3) Instruct-FinGPT[65],
4 https://huggingface.co/models on the other hand, fine-tunes LLaMA on 10k instruction samples

376
ICAIF ’23, November 27–29, 2023, Brooklyn, NY, USA Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen

Table 1: Quick Overview of Finetuned Finance LLM

Model Name Finetune data size (samples) Training budget Model architecture Release time
FinMA-7B Raw: 70k, Instruction: 136k 8 A100 40GB GPUs LLaMA-7B Jun 2023
FinMA-30B Raw: 70k, Instruction: 136k 128 A100 40GB GPUs LLaMA-30B Jun 2023
Fin-GPT(V1/V2/V3) 50K < $300 per training ChatGLM, LLaMA July 2023
Instruct-FinGPT 10K Instruction 8 A100 40GB GPUs, ∼1 hr LLaMA-7B Jun 2023
Fin-LLaMA[61] 16.9K Instruction NA LLaMA-33B Jun 2023
Cornucopia(Chinese)[64] 12M instruction NA LLaMA-7B Jun 2023

Table 2: Quick Overview of from scratch trained Finance LLMs

Pretrained Corpus size(tokens) Training budget Model architecture Release time

LLM (A100·hours)
BloomBergGPT 363B Finance tokens + 345B 1,300,000 50B-BLOOM May 2023
public tokens
XuanYuan2.0 366B for pre-training + 13B for Not released 176B-BLOOM May 2023
finetuning
Fin-T5 80B Finance tokens Days/weeks 770M-T5 Feb 2023

derived from two Financial Sentiment Analysis Datasets and also Both BloombergGPT and Fin-T5 have demonstrated superior per-
solely evaluates performance on finance classification tasks. formance compared to their original models like BLOOM176B and
Based on the reported model performance, we summarize our T5, respectively. These tasks encompass activities such as market
findings as below: sentiment classification, multi-categorical and multi-label classifi-
• Compared to the original base LLM (LLaMA) and other open- cation, and more. BloombergGPT achieves an impressive average
source LLMs (BLOOM, OPT[32], ChatGLM[8, 12]), all fine- score of 62.51, surpassing the open-source BLOOM176B model,
tuned finance LLMs exhibit significantly better performance which only attains a score of 54.35. Similarly, Fin-T5 demonstrates
across all finance-domain tasks reported in the papers, espe- its excellence with an average score of 81.78, outperforming the
cially classification tasks. T5 model’s score of 79.56. Notably, BloombergGPT was evaluated
• The fine-tuned finance LLMs outperform BloombergGPT[30] using an internal benchmark specifically designed by Bloomberg.
in most finance tasks reported in the papers. The results of this evaluation showcased remarkable improvements,
• When compared to powerful general LLMs like ChatGPT as BloombergGPT achieved an average score of 62.47, surpassing
and GPT-4, the fine-tuned finance LLMs demonstrate supe- the performance of BLOOM176B, which only attained a score of
rior performance in most finance classification tasks, which 33.39. This outcome highlights that even when the internal private
indicates their enhanced domain-specific language under- training corpus constitutes less than 1% of the total training corpus,
standing and contextual comprehension abilities. However, it can still lead to substantial enhancements in evaluating tasks
in finance generative tasks, the fine-tuned LLMs show similar within the same domain and distribution.
or worse performance, suggesting the need for more high- On finance-related generative tasks such as Question Answer-
quality domain-specific datasets to improve their generative ing, Named Entity Recognition, summarization, both models exhib-
capabilities. ited significantly better results compared to their respective gen-
eral models by a considerable margin. Specifically, BloombergGPT
4.3 Pretrain from Scratch achieved an impressive score of 64.83, surpassing BLOOM-176B’s
score of 45.43. Similarly, Fin-T5 outperformed T5 with a score of
The objective of training LLMs from scratch is to develop models
68.69, while T5 scored 66.06. These findings further highlight the
that have even better adaptation to the finance domain. Table 2
models’ superior performance in generating finance-related content
presents the current finance LLMs that have been trained from
when compared to their general-purpose counterparts.
scratch: BloombergGPT, Xuan Yuan 2.0 [66], and Fin-T5[17].
Although these models are not as powerful as closed-source
As shown in Table 2, there is a trend of combining public datasets
models like GPT-3 or PaLM[9], they demonstrate similar or su-
with finance-specific datasets during the pretraining phase. Notably,
perior performance compared to similar-sized public models. In
BloombergGPT serves as an example where the corpus comprises an
evaluations on various general generative tasks, such as BIG-bench
equal mix of general and finance-related text. It is worth mentioning
Hard, knowledge assessments, reading comprehension, and linguis-
that BloombergGPT primarily relies on a subset of 5 billion tokens
tic tasks, BloombergGPT exhibited comparable or superior perfor-
that pertain exclusively to Bloomberg, representing only 0.7% of
mance compared to similar-sized public models, albeit slightly infe-
the total training corpus. This targeted corpus contributes to the
rior to larger models like GPT-3 or PaLM. Overall, BloombergGPT
performance improvements achieved in finance benchmarks.

377
Large Language Models in Finance: A Survey ICAIF ’23, November 27–29, 2023, Brooklyn, NY, USA

showcased commendable performance across a wide range of gen- 5.2.1 Level 1: Zero-shot Applications. The first decision block de-
eral generative tasks, positioning it favorably among models of termines whether to use an existing LLM service or an open-source
comparable size. This indicates that the model’s enhanced capa- model. If the input question or context involves confidential data,
bilities in finance-related tasks do not come at the expense of its it is necessary to proceed with the 1A action block, which involves
general abilities. self hosting an open-source LLM. As of July 2023, several options
are available, including LLAMA[58], OpenLLAMA[39], Alpaca[57],
5 DECISION PROCESS IN APPLYING LLM TO and Vicuna[6]. LLAMA offers models with sizes ranging from 7B
to 65B, but they are limited to research purposes. OpenLLAMA
FINANCIAL APPLICATIONS provides options for 3B, 7B, and 13B models, with support for com-
5.1 Determining the Need for a LLM mercial usage. Alpaca and Vicuna are fine-tuned based on LLAMA,
Before exploring LLM solutions, it is essential to ascertain whether offering 7B and 13B options. Deploying your own LLM requires a
employing such a model is truly necessary for the given task. The robust local machine with a suitable GPU, such as NVIDIA-V100
advantages of LLMs over smaller models can be summarized as for a 7B model or NVIDIA-A100, A6000 for a 13B model.
follows, as outlined in the work by Yang et al. [22]: If data privacy is not a concern, selecting third-party LLMs such
Leveraging Pretraining Knowledge: LLMs can utilize the as GPT3.5/GPT4 from OpenAI or BARD from Google is recom-
knowledge acquired from pretraining data to provide solutions. If mended. This option allows for lightweight experiments and early
a task lacks sufficient training data or annotated data but requires performance evaluation without significant deployment costs. The
common-sense knowledge, an LLM may be a suitable choice. only cost incurred would be the fees associated with each API call,
Reasoning and Emergent Abilities: LLMs excel at tasks that typically based on input length and the token count of the model’s
involve reasoning or emergent abilities [21]. This property makes output.
LLMs well-suited for tasks where task instructions or expected an-
swers are not clearly defined, or when dealing with out-of-distribution 5.2.2 Level 2: Few-shot Applications. If the model’s performance
data. In the context of financial advisory, client requests in customer at Level 1 is not acceptable for the application, few-shot learning
service often exhibit high variance and complex conversations. can be explored if there are several example questions and their
LLMs can serve as virtual agents to provide assistance in such corresponding answers available. Few-shot learning has shown
cases. advantages in various previous works [33, 36]. The core idea is to
Orchestrating Model Collaboration: LLMs can act as orches- provide a set of example questions along with their corresponding
trators between different models and tools. For tasks that require answers as context in addition to the specific question being asked.
collaboration among various models, LLMs can serve as orches- The cost associated with few-shot learning is similar to that of the
trators to integrate and utilize these tools together [1, 4, 47]. This previous levels, except for the requirement of providing examples
capability is particularly valuable when aiming for a robust automa- each time. Generally, achieving good performance may require
tion of a model solution pipeline. using 1 to 10 examples. These examples can be the same across
While LLMs offer immense power, their use comes with a signifi- different questions or selected based on the specific question at hand.
cant cost, whether utilizing a third-party API [49] or fine-tuning an The challenge lies in determining the optimal number of examples
open-source LLM. Therefore, it is prudent to consider conventional and selecting relevant ones. This process involves experimentation
models before fully committing to LLMs. In cases where the task and testing until the desired performance boundary is reached.
has a clear definition (e.g., regression, classification, ranking), there
is an ample amount of annotated training data, or the task relies
minimally on common-sense knowledge or emerging capabilities 5.2.3 Level 3: Tool-Augmented Generation and Finetuning. If the
like reasoning, relying on LLMs may not be necessary or justified task at hand is extremely complicated and in-context learning does
initially. not yield reasonable performance, the next option is to leverage
external tools or plugins with the LLM, assuming a collection of
relevant tools/plugins is available. For example, a simple calculator
5.2 A general decision guidance for applying could assist with arithmetic-related tasks, while a search engine
LLMs on finance tasks could be indispensable for knowledge-intensive tasks such as query-
Once the decision has been made to utilize LLMs for a finance ing the CEO of a specific company or identifying the company with
task, a decision guidance framework can be followed to ensure the highest market capitalization.
efficient and effective implementation. The framework, illustrated Integrating tools with LLMs can be achieved by providing the
in Figure 1, categorizes the usage of LLMs into four levels based tool’s descriptions. The cost associated with this approach is gener-
on computational resources and data requirements. By progressing ally higher than that of few-shot learning due to the development
through the levels, costs associated with training and data collection of the tool(s) and the longer input sequence required as context.
increase. It is recommended to start at Level 1 and move to higher However, there may be instances where the concatenated tool de-
levels (2, 3, and 4) only if the model’s performance is not satisfactory. scription is too long, surpassing the input length limit of LLMs.
The following section provides detailed explanations of the decision In such cases, an additional step such as a simple tool retrieval or
and action blocks at each level. Table ?? presents an approximate filter might be needed to narrow down the tools for selection. The
cost range for different options, based on pricing from various deployment cost typically includes the cost of using the LLMs as
third-party services like AWS and OpenAI. well as the cost of using the tool(s).

378
ICAIF ’23, November 27–29, 2023, Brooklyn, NY, USA Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen

Figure 1: Decision process flow chart

If the above options fail to produce satisfactory performance, 5.3 Evaluation

finetuning the LLMs can be attempted. This stage requires a rea- The evaluation of LLMs in finance can be conducted through var-
sonable amount of annotated data, computational resources (GPU, ious approaches. One direct evaluation method is to assess the
CPU, etc.), and expertise in tuning language models, as listed in model’s performance on downstream tasks. Evaluation metrics can
Table 3. be categorized into two main groups: accuracy and performance,
based on the taxonomy provided by [62]. The accuracy category
5.2.4 Level 4: Train Your Own LLMs from Scratch. If the results are can further be divided into metrics for regression (such as MAPE,
still unsatisfactory, the only option left is to train domain-specific RMSE, 𝑅 2 ) and metrics for classification (Recall, Precision, F1 score).
LLMs from scratch, similar to what BloombergGPT did. However, The performance category includes metrics or measurements that
this option comes with significant computational costs and data directly assess the model’s performance on the specific task, such
requirements. It typically requires millions of dollars in computa- as measuring total profit or Sharpe Ratio in a trading-related task.
tional resources and training on a dataset with trillions of tokens. These evaluations can be conducted using historical data, backtest
The intricacies of the training process are beyond the scope of this simulations, or online experiments. While performance metrics
survey, but it is worth noting that it can take several months or are often more important in finance, it is crucial to ensure that
even years of effort for a professional team to accomplish. accuracy metrics align with performance to ensure meaningful
By following this decision guidance framework, financial pro- decision-making and guard against overfitting.
fessionals and researchers can navigate through the various levels In addition to task-specific evaluations, general metrics used for
and options, making informed choices that align with their specific LLMs can also be applied. Particularly, when evaluating the overall
needs and resource constraints.

379
Large Language Models in Finance: A Survey ICAIF ’23, November 27–29, 2023, Brooklyn, NY, USA

Options Development Com- Development Data Deployment Computa-

putational Cost($) Cost(samples) tional Cost ($/1k tokens
generated)
OpenSource-ZeroShot - - 0.006 - 0.037
3rd party-ZeroShot - - 0.002 - 0.12
OpenSource-FewShot - - 0.006 - 0.037
3rd party-FewShot - - 0.002 - 0.12
OpenSource Tool Augmented Gen- Cost of developing tools - 0.006 - 0.037
eration
3rd party Tool Augmented Genera- Cost of developing tools - 0.002 - 0.12
tion
OpenSource-Finetune 4-360,000 10,000 - 12,000,000 0.0016 - 0.12
3rd party-Finetune 30-30,000 10,000 - 12,000,000 0.002 - 0.12
Train from Scratch 5,000,000 700,000,000 0.0016 - 0.12
Table 3: Costs of Different LLM Options: This table gives an approximate range of requirements of data and dollar cost.
The data and dollar cost requirements for development are estimated based on previous works listed in table 2. The third-
party deployment costs are listed in https://openai.com/pricing. The open source deployment costs are calculated based on
https://openai.com/pricing and https://aws.amazon.com/ec2/pricing/on-demand/ as of the date of this paper is written. We
assume using NVIDIA A100 GPU. The cost of $ / tokens = $ / second * second / 1k tokens, and it typically takes 3 to 33 seconds
to generate 1k tokens, depending on model size.

quality of an existing LLM or a fine-tuned one, comprehensive eval- research, development of robust evaluation frameworks, and the im-
uation systems like the one presented in [27] can be utilized. This plementation of appropriate safeguards are vital steps in harnessing
evaluation system covers tasks for various scenarios and incorpo- the full potential of LLMs while mitigating potential risks.
rates metrics from different aspects, including accuracy, fairness,
robustness, bias, and more. It can serve as a guide for selecting a 6 CONCLUSION
language model or evaluating one’s own model in the context of In conclusion, this paper has conducted a timely and practical
finance applications. survey on the emerging application of LLMs for financial AI. We
structured the survey around two critical pillars: solutions and
adoption guidance.
Under solutions, we reviewed diverse approaches to harnessing
5.4 Limitations
LLMs for finance, including leveraging pretrained models, fine-
While significant progress has been made in applying LLMs to rev- tuning on domain data, and training custom LLMs. Experimental
olutionize financial applications, it is important to acknowledge the results demonstrate significant performance gains over general
limitations of these language models. Two major challenges are the purpose LLMs across natural language tasks like sentiment analysis,
production of disinformation and the manifestation of biases, such question answering, and summarization.
as racial, gender, and religious biases, in LLMs [56]. In the finan- To provide adoption guidance, we proposed a structured frame-
cial industry, accuracy of information is crucial for making sound work for selecting the optimal LLM strategy based on constraints
financial decisions, and fairness is a fundamental requirement for around data availability, compute resources, and performance needs.
all financial services. To ensure information accuracy and mitigate The framework aims to balance value and investment by guiding
hallucination, additional measures like retrieve-augmented genera- practitioners from low-cost experimentation to rigorous customiza-
tion [26] can be implemented. To address biases, content censoring tion.
and output restriction techniques (such as only generating answers In summary, this survey synthesized the latest progress in ap-
from a pre-defined list) can be employed to control the generated plying LLMs to transform financial AI and provided a practical
content and reduce bias. roadmap for adoption. We hope it serves as a useful reference for
LMMs poises potential challenges in terms of regulation and researchers and professionals exploring the intersection of LLMs
governance. Although LLM offers more interpretability compared and finance. As datasets and computation improve, finance-specific
to conventional deep learning models by providing reasoning steps LLMs represent an exciting path to democratize cutting-edge NLP
or thinking processes for the generated answers when prompted across the industry.
correctly [20] [31], LLM remains a black box and explainability of
the content it generates is highly limited. REFERENCES
Addressing these limitations and ensuring the ethical and respon- [1] 2023. Auto-GPT: An Autonomous GPT-4 Experiment. https://github.com/
sible use of LLMs in finance applications is essential. Continuous Significant-Gravitas/Auto-GPT.

380
ICAIF ’23, November 27–29, 2023, Brooklyn, NY, USA Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen

[2] 2023. Chatbots in consumer finance. https://www.consumerfinance.gov/data- [36] Yaqing Wang et al. 2020. Generalizing from a Few Examples: A Survey on
research/research-reports/chatbots-in-consumer-finance/chatbots-in- Few-Shot Learning. arXiv:1904.05046 [cs.LG]
consumer-finance/ [37] Bledar Fazlija and Pedro Harder. 2022. Using Financial News Sentiment for Stock
[3] Talal Almutiri and Farrukh Nadeem. 2022. Markov models applications in natural Price Direction Prediction. Mathematics 10, 13 (2022). https://doi.org/10.3390/
language processing: a survey. Int. J. Inf. Technol. Comput. Sci 2 (2022), 1–16. math10132156
[4] Harrison Chase. 2022. LangChain. https://github.com/hwchase17/langchain [38] Peter Foy. 2023. GPT-4 for Financial Statements: Building an AI Analyst. MLQ
[5] Mu-Yen Chen. 2011. Bankruptcy prediction in firms with statistical and intelligent AI. https://www.mlq.ai/gpt-4-financial-statements-ai-analyst/
techniques and a comparison of evolutionary computation approaches. Computers [39] Xinyang Geng and Hao Liu. 2023. OpenLLaMA: An Open Reproduction of LLaMA.
& Mathematics with Applications 62, 12 (2011), 4514–4524. https://doi.org/10. https://github.com/openlm-research/open_llama
1016/j.camwa.2011.10.030 [40] John Goodell, Satish Kumar, Weng Marc Lim, and Debidutta Pattnaik. 2021.
[6] Wei-Lin et al. Chiang. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 Artificial intelligence and machine learning in finance: Identifying foundations,
with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/ themes, and research clusters from bibliometric analysis. Journal of Behavioral
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: and Experimental Finance 32 (08 2021). https://doi.org/10.1016/j.jbef.2021.100577
Pre-training of Deep Bidirectional Transformers for Language Understanding. [41] Alex Graves. 2014. Generating Sequences With Recurrent Neural Networks.
arXiv:1810.04805 [cs.CL] arXiv:1308.0850 [cs.NE]
[8] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and [42] Aaryan Gupta, Vinya Dengre, Hamza Kheruwala, and Manan Shah. 2020. Com-
Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive prehensive review of text-mining applications in finance. Journal of Financial
Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Innovation 6 (11 2020). https://doi.org/10.1186/s40854-020-00205-1
Computational Linguistics (Volume 1: Long Papers). 320–335. [43] Kyoung jae Kim. 2003. Financial time series forecasting using support vector
[9] Aakanksha Chowdhery et al. 2022. PaLM: Scaling Language Modeling with machines. Neurocomputing 55, 1 (2003), 307–319. https://doi.org/10.1016/S0925-
Pathways. arXiv:2204.02311 [cs.CL] 2312(03)00372-2 Support Vector Machines.
[10] Amir Gholami et al. 2021. A Survey of Quantization Methods for Efficient Neural [44] Yinheng Li. 2023. A Practical Survey on Zero-shot Prompt Design for In-context
Network Inference. arXiv:2103.13630 [cs.CV] Learning. International Conference Recent Advances in Natural Language Pro-
[11] Ashish Vaswani et al. 2017. Attention Is All You Need. arXiv:1706.03762 [cs.CL] cessing.
[12] Aohan Zeng et al. 2023. GLM-130B: An Open Bilingual Pre-trained Model. In [45] Cuicui Luo, Desheng Wu, and Dexiang Wu. 2017. A deep learning approach for
The Eleventh International Conference on Learning Representations (ICLR). https: credit scoring using credit default swaps. Engineering Applications of Artificial
//openreview.net/forum?id=-Aw0rrrPUF Intelligence 65 (2017), 465–470. https://doi.org/10.1016/j.engappai.2016.12.002
[13] Bengio et al. 2000. A neural probabilistic language model. Advances in neural [46] Akib Mashrur, Wei Luo, Nayyar A. Zaidi, and Antonio Robles-Kelly. 2020. Ma-
information processing systems 13 (2000). chine Learning for Financial Risk Management: A Survey. IEEE Access 8 (2020),
[14] BigScience Workshop et al. 2023. BLOOM: A 176B-Parameter Open-Access 203203–203223. https://doi.org/10.1109/ACCESS.2020.3036322
Multilingual Language Model. arXiv:2211.05100 [cs.CL] [47] Microsoft. 2023. Semantic Kernel. https://github.com/microsoft/semantic-kernel.
[15] Colin Raffel et al. 2020. Exploring the Limits of Transfer Learning with a Unified [48] Chiara Valentina Misischia, Flora Poecze, and Christine Strauss. 2022. Chatbots
Text-to-Text Transformer. arXiv:1910.10683 [cs.LG] in customer service: Their relevance and impact on service quality. Procedia
[16] Dhiraj Kalamkaret et al. 2019. A Study of BFLOAT16 for Deep Learning Training. Computer Science 201 (2022), 421–428. https://doi.org/10.1016/j.procs.2022.03.
arXiv:1905.12322 [cs.LG] 055 The 13th International Conference on Ambient Systems, Networks and
[17] Dakuan Lu et al. 2023. BBT-Fin: Comprehensive Construction of Chi- Technologies (ANT) / The 5th International Conference on Emerging Data and
nese Financial Domain Pre-trained Language Model, Corpus and Benchmark. Industry 4.0 (EDI40).
arXiv:2302.09432 [cs.CL] [49] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[18] Edward J. Hu et al. 2021. LoRA: Low-Rank Adaptation of Large Language Models. [50] Ahmet Murat Ozbayoglu, Mehmet Ugur Gudelek, and Omer Berat Sezer. 2020.
arXiv:2106.09685 [cs.CL] Deep Learning for Financial Applications : A Survey. arXiv:2002.05786 [q-fin.ST]
[19] Hyung Won Chung et al. 2022. Scaling Instruction-Finetuned Language Models. [51] Igor Radovanovic. 2023. Auto-GPT for finance - an exploratory guide - algotrad-
arXiv:2210.11416 [cs.LG] ing101 blog. https://algotrading101.com/learn/auto-gpt-finance-guide/
[20] Jason Wei et al. 2022. Chain of Thought Prompting Elicits Reasoning in Large [52] Abhimanyu Roy, Jingyi Sun, Robert Mahoney, Loreto Alonzi, Stephen Adams,
Language Models. CoRR abs/2201.11903 (2022). arXiv:2201.11903 https://arxiv. and Peter Beling. 2018. Deep learning detecting fraud in credit card transactions.
org/abs/2201.11903 In 2018 Systems and Information Engineering Design Symposium (SIEDS). 129–134.
[21] Jason Wei et al. 2022. Emergent Abilities of Large Language Models. https://doi.org/10.1109/SIEDS.2018.8374722
arXiv:2206.07682 [cs.CL] [53] Omer Berat Sezer, Murat Ozbayoglu, and Erdogan Dogdu. 2017. A Deep Neural-
[22] Jingfeng Yang et al. 2023. Harnessing the Power of LLMs in Practice: A Survey Network Based Stock Trading System Based on Evolutionary Optimized Technical
on ChatGPT and Beyond. arXiv:2304.13712 [cs.CL] Analysis Parameters. Procedia Computer Science 114 (2017), 473–480. https:
[23] Kyunghyun Cho et al. 2014. Learning Phrase Representations using RNN Encoder- //doi.org/10.1016/j.procs.2017.09.031 Complex Adaptive Systems Conference
Decoder for Statistical Machine Translation. arXiv:1406.1078 [cs.CL] with Theme: Engineering Cyber Physical Systems, CAS October 30 – November
[24] Long Ouyang et al. 2022. Training language models to follow instructions with 1, 2017, Chicago, Illinois, USA.
human feedback. arXiv:2203.02155 [cs.CL] [54] Ashish et al. Shah. 2020. FinAID, A Financial Advisor Application using AI. ,
[25] Pagliaro et al. 2022. Investor Behavior Modeling by Analyzing Financial Ad- 2282–2286 pages. https://doi.org/10.35940/ijrte.a2951.059120
visor Notes: A Machine Learning Perspective. In Proceedings of the Second [55] Hugh Son. 2023. JPMorgan is developing a CHATGPT-like A.I. service that gives
ACM International Conference on AI in Finance (Virtual Event) (ICAIF ’21). As- investment advice. https://www.cnbc.com/2023/05/25/jpmorgan-develops-ai-
sociation for Computing Machinery, New York, NY, USA, Article 23, 8 pages. investment-advisor.html
https://doi.org/10.1145/3490354.3494388 [56] Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. 2021. Understanding
[26] Patrick Lewis et al. 2021. Retrieval-Augmented Generation for Knowledge- the Capabilities, Limitations, and Societal Impact of Large Language Models.
Intensive NLP Tasks. arXiv:2005.11401 [cs.CL] arXiv:2102.02503 [cs.CL]
[27] Percy Liang et al. 2022. Holistic Evaluation of Language Models. [57] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos
arXiv:2211.09110 [cs.CL] Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An
[28] Qingsong Wen et al. 2023. Transformers in Time Series: A Survey. Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_
arXiv:2202.07125 [cs.LG] alpaca.
[29] Qianqian Xie et al. 2023. PIXIU: A Large Language Model, Instruction Data and [58] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Evaluation Benchmark for Finance. arXiv:2306.05443 [cs.CL] Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro,
[30] Shijie Wu et al. 2023. BloombergGPT: A Large Language Model for Finance. Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guil-
arXiv:2303.17564 [cs.LG] laume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models.
[31] Shunyu Yao et al. 2023. Tree of Thoughts: Deliberate Problem Solving with Large arXiv:2302.13971 [cs.CL]
Language Models. arXiv:2305.10601 [cs.CL] [59] Junhao Wang, Yinheng Li, and Yijie Cao. 2019. Dynamic Portfolio Management
[32] Susan Zhang et al. 2022. OPT: Open Pre-trained Transformer Language Models. with Reinforcement Learning. arXiv:1911.11880 [q-fin.PM]
arXiv:2205.01068 [cs.CL] [60] David West. 2000. Neural network credit scoring models. Computers & Operations
[33] Tom B. Brown et al. 2020. Language Models are Few-Shot Learners. Research 27, 11 (2000), 1131–1152. https://doi.org/10.1016/S0305-0548(99)00149-5
arXiv:2005.14165 [cs.CL] [61] Pedram Babaei William Todt, Ramtin Babaei. 2023. Fin-LLAMA: Efficient Fine-
[34] Tom B. Brown et al. 2020. Language Models are Few-Shot Learners. CoRR tuning of Quantized LLMs for Finance. https://github.com/Bavest/fin-llama.
abs/2005.14165 (2020). arXiv:2005.14165 https://arxiv.org/abs/2005.14165 [62] Frank Xing, Erik Cambria, and Roy Welsch. 2018. Natural language based financial
[35] Wenxuan Zhang et al. 2023. Sentiment Analysis in the Era of Large Language forecasting: a survey. Artificial Intelligence Review 50 (06 2018). https://doi.org/
Models: A Reality Check. arXiv:2305.15005 [cs.CL] 10.1007/s10462-017-9588-9

381
Large Language Models in Finance: A Survey ICAIF ’23, November 27–29, 2023, Brooklyn, NY, USA

[63] Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. 2023. FinGPT: Open- [66] Xuanyu Zhang, Qing Yang, and Dongliang Xu. 2023. XuanYuan 2.0: A
Source Financial Large Language Models. arXiv:2306.06031 [q-fin.ST] Large Chinese Financial Chat Model with Hundreds of Billions Parameters.
[64] YangMu Yu. 2023. Cornucopia-LLaMA-Fin-Chinese. https://github.com/ arXiv:2305.12002 [cs.CL]
jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese. [67] Zihao Zhang, Stefan Zohren, and Stephen Roberts. 2020. Deep Learning for
[65] Boyu Zhang, Hongyang Yang, and Xiao-Yang Liu. 2023. Instruct-FinGPT: Fi- Portfolio Optimization. The Journal of Financial Data Science 2, 4 (aug 2020),
nancial Sentiment Analysis by Instruction Tuning of General-Purpose Large 8–20. https://doi.org/10.3905/jfds.2020.1.042
Language Models. arXiv:2306.12659 [cs.CL] [68] Ekaterina Zolotareva. 2021. Aiding Long-Term Investment Decisions with XG-
Boost Machine Learning Model. arXiv:2104.09341 [q-fin.CP]

382