Large Language Models in Finance
Large Language Models in Finance
the probability of a word is only conditioned on the previous Financial risk modeling encompasses various applica-
word. tions of machine learning and deep learning models. For in-
Later, Recurrent Neural Network (RNN)-based models like stance, McKinsey & Company has developed a deep learning-
LSTM [20] and GRU [10] emerged as neural network solu- based solution for financial fraud detection by leveraging
tions, which are capable of capturing long-term dependen- user history data and real-time transaction data [39]. Similar
cies in sequential data. However, in 2017, the introduction approaches have been employed in credit scoring [29, 52]
of the transformer architecture [46] revolutionized language and bankruptcy or default prediction [8].
modeling, surpassing the performance of RNNs in tasks such Financial text mining represents a popular area where
as machine translation. Transformers employ self-attention deep learning models and natural language processing tech-
mechanisms to model parallel relationships between words, niques are extensively utilized. According to [35], there are
facilitating efficient training on large-scale datasets. Promi- over 40 research publications on this topic. Financial text
nent transformer-based models include GPT (Generative Pre- mining aims to extract valuable information from large-
trained Transformer) [5, 48], which is decoder-only frame- scale unstructured data in real-time, enabling more informed
work, BERT (Bidirectional Encoder Representations from decision-making in trading and risk modeling. For exam-
Transformers) [13], which is encoder-only framework, and ple, [15] employs financial market sentiment extracted from
T5 (Text-to-Text Transfer Transformer) [38], which lever- news articles to forecast the direction of the stock market
ages both encoder and decoder structures. These models index.
have achieved state-of-the-art results on various natural lan- Applying AI in financial advisory and customer-
guage processing (NLP) tasks through transfer learning. related services is an emerging and rapidly growing field.
It is important to note that the evolution of language mod- AI-powered chatbots, as discussed in [32], already provide
els has mainly been driven by advancements in computa- more than 37% of supporting functions in various e-commerce
tional power, the availability of large-scale datasets, and the and e-service scenarios. In the financial industry, chatbots
development of novel neural network architectures. These are being adopted as cost-effective alternatives to human
models have significantly enhanced language understanding customer service, as highlighted in the report "Chatbots in
and generation capabilities, enabling their application across consumer finance" [2]. Additionally, banks like JPMorgan
a wide range of industries and domains. are leveraging AI services to provide investment advice, as
mentioned in a report by CNBC [42].
The current implementation of deep learning models of-
fers significant advantages by efficiently extracting valuable
insights from vast amounts of data within short time frames.
3 Overview of AI Applications in Finance
This capability is particularly valuable in the finance indus-
3.1 Current AI Applications in Finance try, where timely and accurate information plays a crucial
Artificial Intelligence (AI) has witnessed extensive adoption role in decision-making processes. With the emergence of
across various domains of finance in recent years [19]. In LLMs, even more tasks that were previously considered in-
this survey, we focus on key financial applications, includ- tractable become possible, further expanding the potential
ing trading and portfolio management [67], financial risk applications of AI in the finance industry.
modeling [30], financial text mining [21, 36], and financial
advisory and customer services [41]. While this list is not
exhaustive, these areas have shown significant interest and 3.2 Advancements of LLMs in Finance
high potential with the advancement of AI. LLMs offer numerous advantages over traditional models,
Trading and portfolio management have been early particularly in the field of finance. Firstly, LLMs leverage
adopters of machine learning and deep learning models their extensive pre-training data to effectively process common-
within the finance industry. The primary objective of trading sense knowledge, enabling them to understand natural lan-
is to forecast prices and generate profits based on these pre- guage instructions. This is valuable in scenarios where super-
dictions. Initially, statistical machine learning methods such vised training is challenging due to limited labeled financial
as Support Vector Machines (SVM) [23], Xgboost [68], and data or restricted access to certain documents. LLMs can per-
tree-based algorithms were utilized for profit and loss esti- form tasks through zero-shot learning [26], as demonstrated
mation. However, the emergence of deep neural networks in- by their satisfactory performance in sentiment classification
troduced techniques like Recurrent Neural Networks (RNN), tasks across complex levels [65]. For similar text mining tasks
particularly Long Short-Term Memory (LSTM) networks on financial documents, LLMs can automatically achieve ac-
[40], Convolutional Neural Networks (CNN), and transform- ceptable performance.
ers [51], which have proven effective in price forecasting. Compared to other supervised models, LLMs offer supe-
Additionally, reinforcement learning [47] has been applied rior adaptation and flexibility. Instead of training separate
to automatic trading and portfolio optimization. models for specific tasks, LLMs can handle multiple tasks
ICAIF-23, New York City, NY,
Large Language Models in Finance: A Survey
by simply modifying the prompt under different task in- Similar to using LLM APIs, zero-shot or few-shot learning
structions [6]. This adaptability does not require additional approaches can be employed with open-source models. Uti-
training, enabling LLMs to simultaneously perform senti- lizing open-source models offers greater flexibility as the
ment analysis, summarization, and keyword extraction on model’s weights are accessible, and the model’s output can
financial documents. be customized for downstream tasks. Additionally, it pro-
LLMs excel at breaking down ambiguous or complex tasks vides better privacy protection as the model and data remain
into actionable plans. Applications like Auto-GPT [1], Se- under user’s control. However, working with open-source
mantic Kernel [31], and LangChain [7] have been developed models also has its drawbacks. Reported evaluation metrics
to showcase this capability. In this paper, we refer to this suggest a performance gap between open-source models and
as Tool Augmented Generation. For instance [37], Auto- proprietary models. For certain downstream tasks, zero-shot
GPT can optimize a portfolio with global equity ETFs and or few-shot learning may not yield optimal performance. In
bond ETFs based on user-defined goals. It formulates detailed such cases, fine-tuning the model with labeled data, exper-
plans, including acquiring financial data, utilizing Python tise, and computational resources is necessary to achieve
packages for Sharpe ratio optimization, and presenting the satisfactory results. This may explain why, at the time of
results to the user. Previously, achieving such end-to-end writing this paper, no direct examples of open-source models
solutions with a single model was unfeasible. This prop- applied to financial applications have been found. In Section
erty makes LLMs an ideal fit for financial customer service 5, we provide a more detailed discussion of which option is
or financial advisory, where they can understand natural more favorable under different circumstances.
language instructions and assist customers by leveraging
available tools and information. 4.2 Fine-tuning a Model
While the application of LLMs in finance is really promis- Fine-tuning LLMs in the finance domain can enhance domain-
ing, it is crucial to acknowledge their limitations and as- specific language understanding and contextual comprehen-
sociated risks, which will be further discussed in Section sion, resulting in improved performance in finance-related
6. tasks and generating more accurate and tailored outputs.
4 LLM Solutions for Finance 4.2.1 Common Techniques for LLM Fine-tuning. Mod-
4.1 Utilizing Few-shot/Zero-shot Learning in ern techniques for fine-tuning LLMs typically fall into two
Finance Applications main categories: standard fine-tuning and instructional fine-
tuning.
Accessing LLM solutions in finance can be done through
In standard fine-tuning, the model is trained on the raw
two options: utilizing an API from LLM service providers
datasets without modification. The key context, question,
or employing open-source LLMs. Companies like OpenAI1 ,
and desired answer are directly fed into the LLM, with the
Google2 , and Microsoft3 offer LLM services through APIs.
answer masked during training so that the model learns to
These services not only provide the base language model
generate it. Despite its simplicity, this approach is widely
capabilities but also offer additional features tailored for
effective.
specific use cases. For example, OpenAI’s APIs include func-
Instruct fine-tuning [34] involves creating task-specific
tionalities for chat, SQL generation, code completion, and
datasets that provide examples and guidance to steer the
code interpretation. While there is no dedicated LLM ser-
model’s learning process. By formulating explicit instruc-
vice exclusively designed for finance applications, leveraging
tions and demonstrations in the training data, the model
these general-purpose LLM services can be a viable option,
can be optimized to excel at certain tasks or produce more
especially for common tasks. An example in this work [16]
contextually relevant and desired outputs. The instructions
demonstrates the use of OpenAI’s GPT4 service for financial
act as a form of supervision to shape the model’s behavior.
statement analysis.
Both methods have their merits: standard fine-tuning
In addition to LLM services provided by tech companies,
is straightforward to implement, while instructional fine-
open-source LLMs can also be applied to financial applica-
tuning allows for more precise guidance of the model. The
tions. Models such as LLaMA [45], BLOOM [54], Flan-T5
ideal approach depends on the amount of training data avail-
[12], and more are available for download from the Hugging
able and the complexity of the desired behaviors. However,
Face model repository4 . Unlike using APIs, hosting and run-
both leverage the knowledge already embedded in LLMs and
ning these open-source models would require self-hosting.
fine-tune them for enhanced performance on downstream
tasks.
1 https://openai.com/product
2 https://bard.google.com/
In addition to the above methods, techniques such as Low-
3 https://azure.microsoft.com/en-us/products/cognitive-services/openai- Rank Adaptation (LoRA)[22] and quantization[18] can en-
service able fine-tuning with significantly lower computational re-
4 https://huggingface.co/models quirements.
ICAIF-23, New York City, NY,
Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen
Model Name Finetune data size (samples) Training budget Model architecture Release time
FinMA-7B Raw: 70k, Instruction: 136k 8 A100 40GB GPUs LLaMA-7B Jun 2023
FinMA-30B Raw: 70k, Instruction: 136k 128 A100 40GB GPUs LLaMA-30B Jun 2023
Fin-GPT(V1/V2/V3) 50K < $300 per training ChatGLM, LLaMA July 2023
Instruct-FinGPT 10K Instruction 8 A100 40GB GPUs, ∼1 hr LLaMA-7B Jun 2023
Fin-LLaMA[53] 16.9K Instruction NA LLaMA-33B Jun 2023
Cornucopia(Chinese)[61] 12M instruction NA LLaMA-7B Jun 2023
LoRA allows for fine-tuning the low-rank decomposed fine-tunes LLaMA on 10k instruction samples derived from
factors of the original weight matrices instead of the full two Financial Sentiment Analysis Datasets and also solely
matrices. This approach drastically reduces the number of evaluates performance on finance classification tasks.
trainable parameters, enabling training on less powerful Based on the reported model performance, we summarize
hardware and shortening the total training time. our findings as below:
Another impactful approach is to use reduced numerical
• Compared to the original base LLM (LLaMA) and other
precisions such as bfloat16 [24] or float16 instead of float32.
open-source LLMs (BLOOM, OPT[64], ChatGLM[14,
By halving the bit-width, each parameter only occupies 2
62]), all fine-tuned finance LLMs exhibit significantly
bytes instead of 4 bytes, reducing memory usage by 50%.
better performance across all finance-domain tasks
This also accelerates computation by up to 2x since smaller
reported in the papers, especially classification tasks.
data types speed up training. Moreover, the reduced mem-
• The fine-tuned finance LLMs outperform BloombergGPT[55]
ory footprint enables larger batch sizes, further boosting
in most finance tasks reported in the papers.
throughput.
• When compared to powerful general LLMs like Chat-
4.2.2 Fine-tuned finance LLM evaluation. The perfor- GPT and GPT-4, the fine-tuned finance LLMs demon-
mance of fine-tuned finance LLMs can be evaluated in two strate superior performance in most finance classifi-
categories: finance classification tasks and finance genera- cation tasks, which indicates their enhanced domain-
tive tasks. In finance classification, we consider tasks such as specific language understanding and contextual com-
Sentiment Analysis and News Headline Classification. In fi- prehension abilities. However, in finance generative
nance generative tasks, our focus is on Question Answering, tasks, the fine-tuned LLMs show similar or worse per-
News Summarization, and Named Entity Recognition. Table formance, suggesting the need for more high-quality
1 provides detailed information about all the fine-tuned fi- domain-specific datasets to improve their generative
nance LLMs. Among the various fine-tuned LLMs, we will capabilities.
focus on discussing three of them: (1) PIXIU (also known
as FinMA)[56], fine-tuned LLaMA on 136K task-specific in- 4.3 Pretrain from Scratch
struction samples. (2) FinGPT[58], it presents a end-to-end The objective of training LLMs from scratch is to develop
framework for training and applying FinLLMs in the finance models that have even better adaptation to the finance do-
industry. FinGPT utilizes the lightweight Low-rank Adapta- main. Table 2 presents the current finance LLMs that have
tion (LoRA) technique to fine-tune open-source LLMs (such been trained from scratch: BloombergGPT, Xuan Yuan 2.0
as LLaMA and ChatGLM) using approximately 50k samples. [66], and Fin-T5[28].
However, FinGPT’s evaluation is only limited to finance clas- As shown in Table 2, there is a trend of combining public
sification tasks. (3) Instruct-FinGPT[63], on the other hand, datasets with finance-specific datasets during the pretraining
ICAIF-23, New York City, NY,
Large Language Models in Finance: A Survey
phase. Notably, BloombergGPT serves as an example where 5 Decision Process in Applying LLM to
the corpus comprises an equal mix of general and finance- Financial Applications
related text. It is worth mentioning that BloombergGPT pri- 5.1 Determining the Need for a LLM
marily relies on a subset of 5 billion tokens that pertain
exclusively to Bloomberg, representing only 0.7% of the to- Before exploring LLM solutions, it is essential to ascertain
tal training corpus. This targeted corpus contributes to the whether employing such a model is truly necessary for the
performance improvements achieved in finance benchmarks. given task. The advantages of LLMs over smaller models can
Both BloombergGPT and Fin-T5 have demonstrated su- be summarized as follows, as outlined in the work by Yang
perior performance compared to their original models like et al. [59]:
BLOOM176B and T5, respectively. These tasks encompass ac- Leveraging Pretraining Knowledge: LLMs can utilize
tivities such as market sentiment classification, multi-categorical the knowledge acquired from pretraining data to provide
and multi-label classification, and more. BloombergGPT achieves solutions. If a task lacks sufficient training data or annotated
an impressive average score of 62.51, surpassing the open- data but requires common-sense knowledge, an LLM may
source BLOOM176B model, which only attains a score of be a suitable choice.
54.35. Similarly, Fin-T5 demonstrates its excellence with an Reasoning and Emergent Abilities: LLMs excel at tasks
average score of 81.78, outperforming the T5 model’s score that involve reasoning or emergent abilities [49]. This prop-
of 79.56. Notably, BloombergGPT was evaluated using an erty makes LLMs well-suited for tasks where task instruc-
internal benchmark specifically designed by Bloomberg. The tions or expected answers are not clearly defined, or when
results of this evaluation showcased remarkable improve- dealing with out-of-distribution data. In the context of fi-
ments, as BloombergGPT achieved an average score of 62.47, nancial advisory, client requests in customer service often
surpassing the performance of BLOOM176B, which only at- exhibit high variance and complex conversations. LLMs can
tained a score of 33.39. This outcome highlights that even serve as virtual agents to provide assistance in such cases.
when the internal private training corpus constitutes less Orchestrating Model Collaboration: LLMs can act as
than 1% of the total training corpus, it can still lead to sub- orchestrators between different models and tools. For tasks
stantial enhancements in evaluating tasks within the same that require collaboration among various models, LLMs can
domain and distribution. serve as orchestrators to integrate and utilize these tools
On finance-related generative tasks such as Question An- together [1, 7, 31]. This capability is particularly valuable
swering, Named Entity Recognition, summarization, both when aiming for a robust automation of a model solution
models exhibited significantly better results compared to pipeline.
their respective general models by a considerable margin. While LLMs offer immense power, their use comes with
Specifically, BloombergGPT achieved an impressive score a significant cost, whether utilizing a third-party API [33]
of 64.83, surpassing BLOOM-176B’s score of 45.43. Simi- or fine-tuning an open-source LLM. Therefore, it is prudent
larly, Fin-T5 outperformed T5 with a score of 68.69, while to consider conventional models before fully committing to
T5 scored 66.06. These findings further highlight the models’ LLMs. In cases where the task has a clear definition (e.g.,
superior performance in generating finance-related content regression, classification, ranking), there is an ample amount
when compared to their general-purpose counterparts. of annotated training data, or the task relies minimally on
Although these models are not as powerful as closed- common-sense knowledge or emerging capabilities like rea-
source models like GPT-3 or PaLM[11], they demonstrate soning, relying on LLMs may not be necessary or justified
similar or superior performance compared to similar-sized initially.
public models. In evaluations on various general genera-
tive tasks, such as BIG-bench Hard, knowledge assessments,
reading comprehension, and linguistic tasks, BloombergGPT
exhibited comparable or superior performance compared to 5.2 A general decision guidance for applying LLMs
similar-sized public models, albeit slightly inferior to larger on finance tasks
models like GPT-3 or PaLM. Overall, BloombergGPT show- Once the decision has been made to utilize LLMs for a finance
cased commendable performance across a wide range of task, a decision guidance framework can be followed to en-
general generative tasks, positioning it favorably among sure efficient and effective implementation. The framework,
models of comparable size. This indicates that the model’s illustrated in Figure 1, categorizes the usage of LLMs into four
enhanced capabilities in finance-related tasks do not come levels based on computational resources and data require-
at the expense of its general abilities. ments. By progressing through the levels, costs associated
with training and data collection increase. It is recommended
to start at Level 1 and move to higher levels (2, 3, and 4) only
if the model’s performance is not satisfactory. The following
section provides detailed explanations of the decision and
ICAIF-23, New York City, NY,
Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen
action blocks at each level. Table ?? presents an approxi- If data privacy is not a concern, selecting third-party LLMs
mate cost range for different options, based on pricing from such as GPT3.5/GPT4 from OpenAI or BARD from Google
various third-party services like AWS and OpenAI. is recommended. This option allows for lightweight experi-
ments and early performance evaluation without significant
deployment costs. The only cost incurred would be the fees
5.2.1 Level 1: Zero-shot Applications. The first decision
associated with each API call, typically based on input length
block determines whether to use an existing LLM service
and the token count of the model’s output.
or an open-source model. If the input question or context
involves confidential data, it is necessary to proceed with the 5.2.2 Level 2: Few-shot Applications. If the model’s per-
1A action block, which involves self hosting an open-source formance at Level 1 is not acceptable for the application,
LLM. As of July 2023, several options are available, including few-shot learning can be explored if there are several ex-
LLAMA[45], OpenLLAMA[17], Alpaca[44], and Vicuna[9]. ample questions and their corresponding answers available.
LLAMA offers models with sizes ranging from 7B to 65B, Few-shot learning has shown advantages in various previous
but they are limited to research purposes. OpenLLAMA pro- works [5, 48]. The core idea is to provide a set of example
vides options for 3B, 7B, and 13B models, with support for questions along with their corresponding answers as con-
commercial usage. Alpaca and Vicuna are fine-tuned based text in addition to the specific question being asked. The
on LLAMA, offering 7B and 13B options. Deploying your cost associated with few-shot learning is similar to that of
own LLM requires a robust local machine with a suitable the previous levels, except for the requirement of providing
GPU, such as NVIDIA-V100 for a 7B model or NVIDIA-A100, examples each time. Generally, achieving good performance
A6000 for a 13B model. may require using 1 to 10 examples. These examples can be
ICAIF-23, New York City, NY,
Large Language Models in Finance: A Survey
the same across different questions or selected based on the requires a reasonable amount of annotated data, computa-
specific question at hand. The challenge lies in determining tional resources (GPU, CPU, etc.), and expertise in tuning
the optimal number of examples and selecting relevant ones. language models, as listed in Table ??.
This process involves experimentation and testing until the
desired performance boundary is reached. 5.2.4 Level 4: Train Your Own LLMs from Scratch. If
the results are still unsatisfactory, the only option left is to
5.2.3 Level 3: Tool-Augmented Generation and Fine- train domain-specific LLMs from scratch, similar to what
tuning. If the task at hand is extremely complicated and BloombergGPT did. However, this option comes with signif-
in-context learning does not yield reasonable performance, icant computational costs and data requirements. It typically
the next option is to leverage external tools or plugins with requires millions of dollars in computational resources and
the LLM, assuming a collection of relevant tools/plugins is training on a dataset with trillions of tokens. The intricacies
available. For example, a simple calculator could assist with of the training process are beyond the scope of this survey,
arithmetic-related tasks, while a search engine could be in- but it is worth noting that it can take several months or even
dispensable for knowledge-intensive tasks such as querying years of effort for a professional team to accomplish.
the CEO of a specific company or identifying the company By following this decision guidance framework, financial
with the highest market capitalization. professionals and researchers can navigate through the vari-
Integrating tools with LLMs can be achieved by provid- ous levels and options, making informed choices that align
ing the tool’s descriptions. The cost associated with this with their specific needs and resource constraints.
approach is generally higher than that of few-shot learning
due to the development of the tool(s) and the longer input
sequence required as context. However, there may be in- 5.3 Evaluation
stances where the concatenated tool description is too long, The evaluation of LLMs in finance can be conducted through
surpassing the input length limit of LLMs. In such cases, an various approaches. One direct evaluation method is to assess
additional step such as a simple tool retrieval or filter might the model’s performance on downstream tasks. Evaluation
be needed to narrow down the tools for selection. The de- metrics can be categorized into two main groups: accuracy
ployment cost typically includes the cost of using the LLMs and performance, based on the taxonomy provided by [57].
as well as the cost of using the tool(s). The accuracy category can further be divided into metrics
If the above options fail to produce satisfactory perfor- for regression (such as MAPE, RMSE, 𝑅 2 ) and metrics for
mance, finetuning the LLMs can be attempted. This stage classification (Recall, Precision, F1 score). The performance
ICAIF-23, New York City, NY,
Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen
category includes metrics or measurements that directly as- AI. We structured the survey around two critical pillars:
sess the model’s performance on the specific task, such as solutions and adoption guidance.
measuring total profit or Sharpe Ratio in a trading-related Under solutions, we reviewed diverse approaches to har-
task. These evaluations can be conducted using historical nessing LLMs for finance, including leveraging pretrained
data, backtest simulations, or online experiments. While per- models, fine-tuning on domain data, and training custom
formance metrics are often more important in finance, it LLMs. Experimental results demonstrate significant perfor-
is crucial to ensure that accuracy metrics align with per- mance gains over general purpose LLMs across natural lan-
formance to ensure meaningful decision-making and guard guage tasks like sentiment analysis, question answering, and
against overfitting. summarization.
In addition to task-specific evaluations, general metrics To provide adoption guidance, we proposed a structured
used for LLMs can also be applied. Particularly, when evaluat- framework for selecting the optimal LLM strategy based
ing the overall quality of an existing LLM or a fine-tuned one, on constraints around data availability, compute resources,
comprehensive evaluation systems like the one presented in and performance needs. The framework aims to balance
[27] can be utilized. This evaluation system covers tasks for value and investment by guiding practitioners from low-cost
various scenarios and incorporates metrics from different experimentation to rigorous customization.
aspects, including accuracy, fairness, robustness, bias, and In summary, this survey synthesized the latest progress
more. It can serve as a guide for selecting a language model in applying LLMs to transform financial AI and provided a
or evaluating one’s own model in the context of finance practical roadmap for adoption. We hope it serves as a useful
applications. reference for researchers and professionals exploring the in-
tersection of LLMs and finance. As datasets and computation
improve, finance-specific LLMs represent an exciting path
5.4 Limitations to democratize cutting-edge NLP across the industry.
While significant progress has been made in applying LLMs
to revolutionize financial applications, it is important to ac-
knowledge the limitations of these language models. Two References
major challenges are the production of disinformation and [1] 2023. Auto-GPT: An Autonomous GPT-4 Experiment. https://github.
com/Significant-Gravitas/Auto-GPT.
the manifestation of biases, such as racial, gender, and reli-
[2] 2023. Chatbots in consumer finance. https://www.consumerfinance.
gious biases, in LLMs [43]. In the financial industry, accuracy gov/data-research/research-reports/chatbots-in-consumer-
of information is crucial for making sound financial decisions, finance/chatbots-in-consumer-finance/
and fairness is a fundamental requirement for all financial [3] Talal Almutiri and Farrukh Nadeem. 2022. Markov models applications
services. To ensure information accuracy and mitigate hallu- in natural language processing: a survey. Int. J. Inf. Technol. Comput.
Sci 2 (2022), 1–16.
cination, additional measures like retrieve-augmented gen- [4] Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A neural
eration [25] can be implemented. To address biases, content probabilistic language model. Advances in neural information process-
censoring and output restriction techniques (such as only ing systems 13 (2000).
generating answers from a pre-defined list) can be employed [5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
to control the generated content and reduce bias. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
LMMs poises potential challenges in terms of regulation Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
and governance. Although LLM offers more interpretability Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
compared to conventional deep learning models by provid- Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
ing reasoning steps or thinking processes for the generated Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
answers when prompted correctly [50] [60], LLM remains Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot
Learners. arXiv:2005.14165 [cs.CL]
a black box and explainability of the content it generates is [6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
highly limited. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam,
Addressing these limitations and ensuring the ethical and Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
responsible use of LLMs in finance applications is essen- Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
tial. Continuous research, development of robust evaluation Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
frameworks, and the implementation of appropriate safe- Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
guards are vital steps in harnessing the full potential of LLMs Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot
while mitigating potential risks. Learners. CoRR abs/2005.14165 (2020). arXiv:2005.14165 https:
//arxiv.org/abs/2005.14165
[7] Harrison Chase. 2022. LangChain. https://github.com/hwchase17/
6 Conclusion langchain
[8] Mu-Yen Chen. 2011. Bankruptcy prediction in firms with statistical and
In conclusion, this paper has conducted a timely and practical intelligent techniques and a comparison of evolutionary computation
survey on the emerging application of LLMs for financial approaches. Computers & Mathematics with Applications 62, 12 (2011),
ICAIF-23, New York City, NY,
Large Language Models in Finance: A Survey
Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Lim- [62] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming
isiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam,
Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kas- Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu,
ner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130B: An Open
Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Bilingual Pre-trained Model. In The Eleventh International Conference
Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, on Learning Representations (ICLR). https://openreview.net/forum?id=-
Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Aw0rrrPUF
Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis [63] Boyu Zhang, Hongyang Yang, and Xiao-Yang Liu. 2023. Instruct-
David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ez- FinGPT: Financial Sentiment Analysis by Instruction Tuning of
inwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, General-Purpose Large Language Models. arXiv:2306.12659 [cs.CL]
Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, [64] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen,
Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria
Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke
Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language
Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Models. arXiv:2205.01068 [cs.CL]
Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Syl- [65] Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong
vain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Bing. 2023. Sentiment Analysis in the Era of Large Language Models:
Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Calla- A Reality Check. arXiv:2305.15005 [cs.CL]
han, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Ben- [66] Xuanyu Zhang, Qing Yang, and Dongliang Xu. 2023. XuanYuan 2.0:
jamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin A Large Chinese Financial Chat Model with Hundreds of Billions
Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Parameters. arXiv:2305.12002 [cs.CL]
Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel [67] Zihao Zhang, Stefan Zohren, and Stephen Roberts. 2020. Deep Learn-
Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane ing for Portfolio Optimization. The Journal of Financial Data Science 2,
Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David 4 (aug 2020), 8–20. https://doi.org/10.3905/jfds.2020.1.042
Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, [68] Ekaterina Zolotareva. 2021. Aiding Long-Term Investment Decisions
Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc with XGBoost Machine Learning Model. arXiv:2104.09341 [q-fin.CP]
Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias
Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina
Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha See-
lam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner,
Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg,
Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyaw-
ijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Ki-
blawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter,
Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Woj-
ciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan
Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras,
Younes Belkada, and Thomas Wolf. 2023. BLOOM: A 176B-Parameter
Open-Access Multilingual Language Model. arXiv:2211.05100 [cs.CL]
[55] Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze,
Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and
Gideon Mann. 2023. BloombergGPT: A Large Language Model for
Finance. arXiv:2303.17564 [cs.LG]
[56] Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng,
Alejandro Lopez-Lira, and Jimin Huang. 2023. PIXIU: A Large Lan-
guage Model, Instruction Data and Evaluation Benchmark for Finance.
arXiv:2306.05443 [cs.CL]
[57] Frank Xing, Erik Cambria, and Roy Welsch. 2018. Natural language
based financial forecasting: a survey. Artificial Intelligence Review 50
(06 2018). https://doi.org/10.1007/s10462-017-9588-9
[58] Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang.
2023. FinGPT: Open-Source Financial Large Language Models.
arXiv:2306.06031 [q-fin.ST]
[59] Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang
Feng, Haoming Jiang, Bing Yin, and Xia Hu. 2023. Harnessing
the Power of LLMs in Practice: A Survey on ChatGPT and Beyond.
arXiv:2304.13712 [cs.CL]
[60] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L.
Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of
Thoughts: Deliberate Problem Solving with Large Language Models.
arXiv:2305.10601 [cs.CL]
[61] YangMu Yu. 2023. Cornucopia-LLaMA-Fin-Chinese. https://github.
com/jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese.