0% found this document useful (0 votes)
6 views8 pages

Paper 6

Uploaded by

rashmimaruthi2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

Paper 6

Uploaded by

rashmimaruthi2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Conversational Factor Information Retrieval Model (ConFIRM)

Stephen Choi, William Gazeley, Siu Ho Wong, Tingting Li


IRAI Labs, LORA Research
{stepchoi, william}@irai.co {adrian, tingting}@asklora.ai

Abstract resources required for data annotation and model


training, hinders the ability of LLMs to remain
This paper introduces the Conversational current. Recent advancements aim to address
Factor Information Retrieval Method these shortcomings. Retrieval-Augmented
(ConFIRM), a novel approach to fine- Generation (RAG) (Lewis et al., 2020) integrates
tuning large language models (LLMs) for information retrieval (IR) methods to mitigate the
domain-specific retrieval tasks. ConFIRM limitations of static knowledge. Chain of Thought
leverages the Five-Factor Model of prompting (Wei et al., 2022) has spurred the
personality to generate synthetic datasets
development of task-based workflows and agentic
that accurately reflect target population
characteristics, addressing data scarcity in pipelines. These specialized agents are typically
specialized domains. We demonstrate trained efficiently on a small subset of the model's
ConFIRM's effectiveness through a case parameters that significantly impact the task at
study in the finance sector, fine-tuning a hand, utilizing parameter-efficient fine-tuning
Llama-2-7b model using personality- (PEFT) methodologies (Lialin et al., 2023), such
aligned data from the PolyU-Asklora as Low-Rank Adaptation (LoRA) (Hu et al.,
Fintech Adoption Index. The resulting 2021) and Reinforcement Learning from Human
model achieved 91% accuracy in Feedback (RLHF) (Ouyang et al., 2022).
classifying financial queries, with an However, these training methods require
average inference time of 0.61 seconds on
substantial-sized datasets. For instance, the RLHF
an NVIDIA A100 GPU. ConFIRM shows
promise for creating more accurate and methods outlined by Ouyang et al. (2022)
personalized AI-driven information involved generating a 31,000-sample preference
retrieval systems across various domains, dataset for the reward model, underscoring the
potentially mitigating issues of need for large-scale data in fine-tuning processes.
hallucinations and outdated information These methods also perform optimally when the
in LLMs deployed in mission-critical training dataset accurately represents the target
environments. population's characteristics. The Conversational
Factor Information Retrieval Method (ConFIRM)
1 Introduction is designed to assess these characteristics and
The rapid advancement of large language models efficiently synthesize questions in scale for
(LLMs) has been remarkable, with significant effective fine-tuning. We outline this method in
breakthroughs in scaling laws (Kaplan et al., Section 2.
2020) and subsequent performance gains (Zhao et In Section 3, we apply ConFIRM to a real-
al., 2023). However, deploying LLMs in mission- world example using survey data from Hong
critical domains such as healthcare, law, and Kong Polytechnic University’s PolyU-Asklora
finance poses unique challenges that go beyond Fintech Adoption Index1 to address a financial
general AI optimization and alignment. One major scenario. While we do not include the full
concern is the tendency of LLMs to produce information retrieval instructions due to scope
convincing yet inaccurate responses, known as limitations, we focus on classification to highlight
"hallucinations," particularly in specialized areas
where data is scarce or nuanced. 1
https://www.polyu.edu.hk/en/media/media
Additionally, the exponential growth of -releases/2022/1102_polyus-school-of-
online data, coupled with the significant accounting-and-finance-launches--its-
first-fintech-adoption-index-survey
Figure 1: A schematic illustration of question-answer generation process

the practical business case. This involves the participant. Subsequently, we generate question-
"retrieval" components of RAG, rather than answer pairs by leveraging Large Language
additional engineering required to handle units, Models (LLMs). To craft natural-sounding
currency, and other numerical issues. To aid questions relevant to our task, we employ the
reproduction, we post our model code at: iterative instruction generation approach of
https://github.com/WilliamGazeley/Con SELF-INSTRUCT (Wang et al., 2023), aiming
FIRM to synthesize authentic user interactions. Chat
GPT-3.5 (Brown et al., 2020) aids in generating
2 Theoretical Framework conversational questions.
To create a synthetic dataset that accurately Through an iterative process of manual
reflects personality traits within a population, filtering and refinement, we select high-quality
we utilize the Five-Factor Model (FFM) of question-answer pairs. To further emulate a
personality (McCrae and Costa, 1999). The conversational tone, we apply the Text2Text
FFM quantifies personality across five generation method (Ramirez et al., 2023). A few-
dimensions: Openness, Conscientiousness, shot prompt, engineered based on OCEAN FFM
max-factor scores, integrates population-aligned
Extraversion, Agreeableness, and Neuroticism
(OCEAN). This model is supported by factor loadings into the questions. Pseudo-
extensive research, including behavioral references, derived from meaning
genetics studies confirming its structural representations, are used with an LLM (GPT-
integrity (Yamagata et al., 2006), longitudinal 3.5) for few-shot in-context learning to generate
studies demonstrating its stability over time questions in an extraverted style. This approach
(Roberts and DelVecchio, 2000), and cross- efficiently generates a large, diverse training set
cultural research affirming its universal required for robust fine-tuning task-oriented
applicability (John et al., 2008). Additionally, agents. The final set can be filtered using
significant associations have been established ROUGE-L scores (Lin and Och, 2004) both
between OCEAN traits and key financial vertically (across questions) and horizontally
outcomes (Exley et al., 2021), as well as their (within pairs) to enhance diversity and reduce
utility in financial planning (Campbell et al., redundancy. The question-answer generation
2023). process is outlined in Figure 1.
We utilize the 50-question set from
International Personality Item Pool (IPIP)2.
3 Case Study
Instead of employing the varimax factor loading In this section, we present a practical
scoring method (Goldberg, 1992), we track the implementation of the ConFIRM framework
highest scoring domain (factor) for each through a case study in the finance sector. The
objective is to fine-tune a Retrieval-Augmented
2
https://ipip.ori.org/new_ipip-50-item- Generation (RAG) model to effectively retrieve
scale.htm
information from an internal database based on warehouse where all news scraped from Reuters
(one of the largest news agencies) is stored.
OCEAN Factor % max factor
External data: This category represents the
Openness 10%
external knowledge base where all data that is out-
Conscientiousness 36%
of-scope of the four internal categories. In terms of
Extraversion 14%
label accuracy, this category represents the “true
Agreeableness 18%
negatives”. A typical example of this data source
Neuroticism 22%
is social media or internet sites.
Table 1: Sample OCEAN Max Factors
Hyperparameter LoRA
questions from a target population. Our goal is to learning rate 3E -04
generate a large dataset of questions-answer pairs batch size 8
to fine-tune an agent to accurately label the data epochs 50
category necessary to answer each question. For max length 128
this case study, we focus primarily on stocks, but r 4
this approach can easily be extended to include
dropout 1.00E-03
additional or different data categories.
alpha 64
3.1 Retrieval Database
Table 2: LoRA hyperparameters
The internal retrieval database structure is
modeled after Refinitiv Datastream, a leading and
widely used provider of financial data. Refintiv
offers a comprehensive range of global financial 3.2 Aligned QA generation
data from various sources, including stock
exchanges, central banks, and other financial For the case study, we leveraged the HK PolyU-
institutions. The platform also provides Asklora Fintech Adoption Index survey to
information on the data category, data types, and measure the OCEAN factors of potential users in
descriptions of each data field. Using this Hong Kong. Although the survey included 1000
foundation, we structure the following data participants, only a subset of 50 was presented
categories for a hypothetical stock investment firm. with the IPIP questions, resulting in the
(For detailed information on the actual fields, percentages shown in Table 1.
please refer to Appendix A.) Using the methodology described in Section 2,
Stock data: This category is a table that we generated question-answer pairs, which were
contains data about individual stocks, including then expanded to match the personality factor
price data, identifiers, descriptions, and data ratios of the sample (Table 1). This process
available in company filings. For the scope of resulted in 3300 samples reflecting the target
population. We allocated 3000 samples for
this paper, we limit the number of data fields to
training and 300 for validation (10% of the
the top 40 ranked by popularity within the
training set). To avoid data leakage, we
category marked as “Equities” by Refinitiv.
generated an additional 1000 samples using the
Market data: This category holds data about same process for testing purposes. Refer to
the overall stock market, including benchmark Appendix B for examples of the generated
indices such as S&P 500, and sector and style samples.
benchmarks. We limit the number of data fields
to the top 15 ranked by popularity within the 3.3 Fine-Tuning Agent
“Equity Indices” category. With the training and test sets prepared, we focus
Economic data: This category holds data on parameter-efficient fine-tuning of a
and statistics about the US economy as a whole, manageable pre-trained LLM, Llama-2-7b
such as GDP, unemployment rate and consumer (Touvron et al., 2023). This arrangement allows
price index as well as leading indicators. We full control of the model location and network.
limit the number of data fields to the top 11 The goal is to convert a general-purpose base
ranked by popularity within the “Economics” model (Llama-2-7b) to a specialized classifier.
category. We retain the weights of the pre-trained model
News data: This category is a data and apply Low-Rank Adaptation (LoRA) (Hu et
LoRA fine-tuned Agent Accuracy
95% 91%
88%
90%
83%
85%

80%

75%
69%
70%

65%

60%
54%
55%

50%
0 500 1000 1500 2000 2500 3000
Training Size

Figure 2 Data Efficiency of Agent

al., 2021) for fine-tuning. We utilize the personality and advanced language model
validation set to optimize the hyperparameters techniques, ConFIRM addresses the critical
(Table 2). challenge of creating representative synthetic
datasets for specialized domains.
3.4 Results Our case study in the finance sector
Accuracy is evaluated on two factors: accuracy demonstrated the efficiency of ConFIRM to
of internal/external data category labeling and generate training samples for fine-tuning a
precision of data category and field labels. Llama-2-7b model for accurate information
Instances where false negatives occurred in retrieval and classification. The results show that
external labels are marked as failed with sufficient training data generated using our
classification, underscoring potential regulatory method, we achieved over 91% accuracy in
concerns. For the purposes of the case study, a classifying financial queries, highlighting the
classification is determined as successful if the potential of this approach for real-world
fine-tuned agent returned at least a superset of applications. While our focus was on the finance
the correct data category or categories within sector, the ConFIRM framework has potential
the answer set. applications across various domains where
The need for a large sample size is illustrated personalized information retrieval is crucial, such
in Figure 2. The LoRA fine-tuned agent only as healthcare, legal services, and customer
surpassed 80% accuracy after 500 training support.
samples and surpassed 91% accuracy with 3000 Future work could examine the scalability of
training samples. The average runtime for this approach to even larger language models,
classification (inference) on each test question expand on the FFM integration, and explore how
for the LoRA-tuned model on a single NVIDIA ConFIRM can be adopted to other human
A100 GPU is 0.61 seconds. preference optimization techniques such as
Direct Preference Optimization (DPO) (Rafailov
4 Conclusion et al., 2023).
In this paper, we introduced Conversational
Factor Information Retrieval Method References
(ConFIRM), a novel approach to synthetically Tom Brown, Benjamin Mann, Nick Ryder, Melanie
generating training samples to fine-tune large Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
language models for domain-specific retrieval Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
tasks. By leveraging the Five-Factor Model of
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Robert McCrae and Paul Costa. 1999. A Five-Factor
Clemens Winter, Chris Hesse, Mark Chen, Eric theory of personality. Handbook of personality:
Sigler, Mateusz Litwin, Scott Gray, Benjamin Theory and research (pp. 139–153). Guilford Press,
Chess, Jack Clark, Christopher Berner, Sam New York, NY.
McCandlish, Alec Radford, Ilya Sutskever and
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida,
Dario Amodei. 2020. Language Models are Few-
Carroll L. Wainwright, Pamela Mishkin, Chong
Shot Learners. Advances in Neural Information
Zhang, Sandhini Agarwal, Katarina Slama, Alex
Processing Systems 33 (NeurIPS 2020).
Ray, John Schulman, Jacob Hilton, Fraser Kelton,
W. Keith Campbell, Jim Exley, and Patrick Doyle. Luke Miller, Maddie Simens, Amanda Askell, Peter
2023. The Big Five Personality Traits (OCEAN) Welinder, Paul Christiano, Jan Leike and Ryan
and Financial Planning: A Narrative Review and Lowe. 2022. Training Language Models to Follow
Recommendations for Advisors. Financial Instructions with Human Feedback. Thirty-sixth
Services Review, 31(4), 228–245. Conference on Neural Information Processing
Systems (NeurIPS 2022).
Jim Exley, Patrick Doyle, Michael Snell, W. Keith
Campbell. 2021. OCEAN: How does personality Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano
predict financial success. Journal of Financial Ermon, Christopher D. Manning, and Chelsea Finn.
Planning 34 (10), 68-86. 2024. Direct Preference Optimization: Your
Language Model is Secretly a Reward Model. arXiv
Lewis R Goldberg. 1992. The development of
preprint arXiv: 2305.18290.
markers for the Big-Five factor structure.
Psychological Assessment, 4, 26-42. Angela Ramirez and Mamon Alsalihy and Kartik
Aggarwal and Cecilia Li and Liren Wu and Marilyn
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan
Walker. 2023 Controlling Personality Style in
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang
Dialogue with Zero-Shot Prompt-Based Learning.
and Weizhu Chen. 2021. LoRA: Low-Rank
IWSDS (International Workshop on Spoken
Adaptation of Large Language Models. arXiv
Dialogue Systems Technology) 2023
preprint arXiv:2106.09685.
Brent W. Roberts and Wendy F. DelVecchio. 2000.
Oliver P. John, Laura P. Naumann, and Christopher J.
The rank-order consistency of personality traits from
Soto. 2008. Paradigm shift to the integrative Big
childhood to old age: A quantitative review of
Five trait taxonomy: History, measurement, and
longitudinal studies. Psychological Bulletin, 126(1),
conceptual issues, Handbook of personality:
3–25.
Theory and research (3rd ed., pp. 114–158). The
Guilford Press. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra,
Alisa Liu, Noah A. Smith, Daniel Khashabi and
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom
Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning
B. Brown, Benjamin Chess, Rewon Child, Scott
Language Models with Self-Generated Instructions.
Gray, Alec Radford, Jeffrey Wu and Dario
arXiv preprint arXiv:2212.10560.
Amodei. 2020. Scaling Laws for Neural
Language Models. arXiv preprint Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
arXiv:2001.08361. Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain
of thought prompting elicits reasoning in large
Patrick Lewis, Ethan Perez, Aleksandra Piktus,
language models. 2022. In Thirty-sixth Conference
Fabio Petroni, Vladimir Karpukhin, Naman
on Neural Information Processing Systems
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau
(NeurIPS 2022).
Yih, Tim Rocktäschel, Sebastian Riedel and
Douwe Kiela. 2021. Retrieval-Augmented Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Generation for Knowledge-Intensive NLP Tasks. Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Advances in Neural Information Processing Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen
Systems 33 (NeurIPS 2020). Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu,
Vladislav Lialin, Vijeta Deshpande and Anna
Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen. 2023. A
Rumshisky. 2023. Scaling Down to Scale Up:
Survey of Large Language Models. arXiv preprint
A Guide to Parameter-Efficient Fine-Tuning.
arXiv:2303.18223.
arXiv preprint arXiv:2303.15647.
Shinji Yamagata, Atsunobu Suzuki, Juko Ando,
Chin-Yew Lin and Franz Josef Och. 2004. Automatic
Yutaka Ono, Nobuhiko Kijima, Kimio Yoshimura,
Evaluation of Machine Translation Quality Using
Fritz Ostendorf, Alois Angleitner, Rainer Riemann,
Longest Common Subsequence and Skip-Bigram
Frank M Spinath, W John Livesley, and Kerry L
Statistics. In Proceedings of the 42nd Annual
Jang Is the genetic structure of human personality
Meeting of the Association for Computational
universal? A cross-cultural twin study from North
Linguistics (ACL-04), pages 605–612, Barcelona,
America, Europe, and Asia. 2006. J Pers Soc
Spain.
Psychol. 2006 Jun;90(6):987-98.
A Appendix – Data Categories and Labels
B Appendix – QA Generation Examples
Base question generation from data categories and labels
Questions transformed to incorporate OCEAN personality

You might also like