Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation

Uploaded by

impananr15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views6 pages

Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation

Uploaded by

impananr15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Automated Literature Review Using NLP

Techniques and LLM-Based Retrieval-Augmented

Generation
Nurshat Fateh Ali Md. Mahdi Mohtasim
Department of Computer Science and Engineering Department of Computer Science and Engineering
Military Institute of Science and Technology Military Institute of Science and Technology
Dhaka, Bangladesh Dhaka, Bangladesh
[email protected] [email protected]

Shakil Mosharrof T. Gopi Krishna

Department of Computer Science and Engineering Department of Computer Science and Engineering
Military Institute of Science and Technology Military Institute of Science and Technology
Dhaka, Bangladesh Dhaka, Bangladesh
[email protected] [email protected]

Abstract—This research presents and compares multiple ap- relevant information can be a time-consuming, tedious, and
proaches to automate the generation of literature reviews using error-prone task. Due to these difficulties, there has been
several Natural Language Processing (NLP) techniques and an increasing interest in automating the process of literature
retrieval-augmented generation (RAG) with a Large Language
Model (LLM). The ever-increasing number of research articles reviews [1]. Automated systems can use natural language
provides a huge challenge for manual literature review. It has processing techniques and machine learning algorithms to
resulted in an increased demand for automation. Developing a analyze extensive amounts of text, extract relevant details, and
system capable of automatically generating the literature reviews create structured summaries [2].
from only the PDF files as input is the primary objective of this The primary objective of this research is to develop a system
research work. The effectiveness of several Natural Language
Processing (NLP) strategies, such as the frequency-based method that can automatically generate the literature review segment
(spaCy), the transformer model (Simple T5), and retrieval- of a research paper by using only the PDF files of the related
augmented generation (RAG) with Large Language Model (GPT- papers as input. Several Natural Language Processing tech-
3.5-turbo), is evaluated to meet the primary objective. The niques such as the Frequency-based approach, Transformer-
SciTLDR dataset is chosen for this research experiment and based approach, and Large Language Model-based approach
three distinct techniques are utilized to implement three different
systems for auto-generating the literature reviews. The ROUGE are implemented and compared to find the best procedure. The
scores are used for the evaluation of all three systems. Based SciTLDR dataset [3] is selected for this research work. The
on the evaluation, the Large Language Model GPT-3.5-turbo first procedure uses the frequency-based approach. The library
achieved the highest ROUGE-1 score, 0.364. The transformer named spaCy [4] is utilized here. The second procedure uses
model comes in second place and spaCy is at the last position. the transformer-based model. The Simple T5 model is utilized
Finally, a graphical user interface is created for the best system
based on the large language model. here. The last procedure is based on using the Large Language
Index Terms—T5, SpaCy, Large Language Model, GPT, Model. The GPT-3.5-TURBO-0125 model is utilized here.
ROUGE, Literature Review, Natural Language Processing, The evaluation and comparison are performed using ROUGE
Retrieval-augmented generation. scores [5]. Then the best approach is identified and a Graphical
User Interface-based tool is created.
I. INTRODUCTION Automating aspects of the literature review process allows
Literature reviews have gained considerable importance academicians to save time and concentrate on the most perti-
for scholars. It provides researchers with a comprehensive nent articles for their research. It can also reduce the chance
overview of previous findings in a specific field and assists of errors or prejudice in the review process. The highlights of
scholars in identifying gaps in past understandings. It helps to this article are:
conduct future research and informs researchers of areas where • All three considered NLP approaches such as spaCy, T5,
they can provide significant input. However, conducting liter- and GPT-3.5-TURBO-0125 model can produce satisfac-
ature reviews can be incredibly cumbersome because there’s tory results in automating the literature review generation.
so much to read. Due to the vast volume of research articles • The LLM-based model outperforms T5 and spaCy in
being released, reviewing all related studies and extracting generating literature reviews.
II. LITERATURE REVIEW and enhance the effectiveness of automated approaches.
A brief overview on the topic of automatic literature review
A framework was proposed by Silva et al. [6] for auto- tools was given by Tsai et. al. [11] They discussed the
matically producing systematic literature reviews. They have existing research in the field, the challenges faced in conduct-
focused on four technical steps: Searching, Screening, Map- ing literature reviews manually, and the potential benefits of
ping, and Synthesizing. In response to a specific inquiry, automating the process. The main focus of their contributions
extensive searches are conducted to find as much relevant is the evaluation of Mistral LLM’s effectiveness in the field
research as feasible, involving looking through reference lists, of Academic Research.
scouring internet databases, and reviewing published materials. The gaps in the intersection of systematic literature reviews
Screening reduces the search scope by limiting the collection (SLRs) and LLMs are discussed by Susnjak et. al. [12]. They
to only the papers pertinent to a particular review, aiming also emphasized the need to address challenges in the synthesis
to highlight important findings and facts that could influence phase of research and highlighted the potential of fine-tuning
policy. Mapping is used to comprehend research activity in LLMs with datasets to enhance knowledge synthesis accuracy.
a particular area, involve stakeholders, and define priorities The study aims to bridge this gap by proposing a Systematic
concerning the review emphasis. Synthesizing integrates data Literature Review automation framework.
from numerous sources and provides an overview of the Most of the related works that have been discussed are
outcomes. The formulation of research questions, reporting mainly focused on discussing the potential and challenges of
phase, and peer review are some steps that are also discussed using NLP techniques and LLMs to automate the literature
for the composition of systematic literature reviews. review process. None of them proposes a complete system
Peer-reviewed publications are growing exponentially with pipeline where users can directly generate the literature re-
the rapid development of science. Therefore, Yuan et al. view only using the PDF and DOI. In contrast, this article
[7] have explored the use of machine learning techniques, proposes and implements three unique end-to-end pipelines
natural language generation, multi-document summarization, and procedures for a literature review automation system. This
and multi-objective optimization for automating scientific re- research endeavor has also resulted in the implementation
viewing. They have discussed the generation of comprehensive of a UI tool where users can directly upload PDFs and get
reviews and noted the limitations of constructive feedback a literature review segment generated automatically without
compared to human-written reviews. The models used in this any additional effort. Moreover, this paper also includes a
research are not yet fully capable of automating Literature comparative analysis of different approaches such as the
Reviews and they require human reviewers. frequency-based approach, transformer-based approach, and
A comprehensive analysis of existing tools for systematic rag-based approach using ROUGE scores which contributes
literature reviews was done by Karakan et al. [8]. They have towards finding the effectiveness of these approaches for this
explored the potential for automation in various phases of the task.
review process, highlighting the need for a holistic tool de-
sign to address researchers’ challenges effectively. They have III. SYSTEM DESIGN
discussed two methodologies to accomplish their research: The research is carried out in four stages: 1. Defining
Rapid Review and Semi-Structured Interviews. Rapid Review research objectives. 2. Proposing multiple procedures for au-
emphasizes decision-making procedures for resolving issues, tomated literature review generation. 3. Evaluating multiple
difficulties, and challenges that software engineers encounter procedures to find the best approach. 4. The final system
in their daily work. Semi-structured interviews are used development.
to explore researchers’ experiences, challenges, strategies,
strengths, weaknesses of Systematic Literature Review tools, A. Dataset Selection
and requirements for effective support in software engineering. The SciTLDR dataset from the Hugging Face is selected
Jaspers et al. [9] focused on the use of machine learning for this research work [13]. It contains the summarization
techniques for automation of literature reviews and systematic of scientific documents. It is a dataset with 5,400 TLDRs
reviews. They have outlined the pros and cons of different derived from over 3,200 papers. It contains both author-written
machine-learning techniques. The process of automating the and expert-derived TLDRs of scientific documents. Curated
literature review was elaborately discussed. The paper lacks research articles’ abstract, introduction, and conclusion (AIC)
practical validation across diverse domains and detailed in- or full text of the paper are given as ”source” and the
sights. summaries of the corresponding articles are given as ”target”.
A concise overview of automated literature reviews was Only these two attributes are utilized in all three proposed
presented by Tauchert et. al. [10] They have emphasized the procedures. There is no training for the spaCy approach, but
potential for automation in various stages of the systematic the dataset is utilized for testing purposes. The T5 model is
review process. The paper discusses the importance of in- trained using the SciTLDR dataset for the transformer-based
tegrating computational techniques to streamline tasks such approach and later evaluated on the test dataset. For the LLM-
as searching, screening, extraction, and synthesis. It also ac- based approach, this dataset is used as the knowledge base for
knowledges the need for further research to address challenges the model.
B. The Procedure Utilizing the Frequency-Based Approach model for the final pipeline. The SciTLDR dataset is collected
using spaCy to train the model. Then the dataset is prepared to use as the
The first procedure utilizes the frequency-based approach by training data for the selected model. A task-specific prefix is
using spaCy. The first task is to build the model pipeline. The added to summarize individual papers. Then the model is fine-
model pipeline takes text as input and converts the text into tuned as per the requirements. Then the model is trained with
NLP tokens using the spaCy library. Then preprocessing step is the training data and the result is predicted. The result is the
done by removing stop words and punctuation. Afterward, the summarization of individual papers. Then the evaluation is
word frequency is calculated for each word which later helps performed using ROUGE scores and the model is saved for
to calculate individual sentence weights. This sentence weight further utilization later in the system pipeline. The training
represents the importance of that sentence. Then the top 10 overview of the Transformer Model is given in Figure 3.
percent of sentences are selected as the final output. The model
is later evaluated using ROUGE scores to get an overview of
the performance. The overview of the spaCy Model is given
in Figure 1.

Figure 3: Training of Transformer Model

Figure 1: Building spaCy Model The next step is to implement a system pipeline by using
the transformer-based model to generate a literature review
The next step is to implement a system pipeline by using segment automatically. The system takes the DOI and PDF of
the spaCy model to generate a literature review segment multiple papers as input. It uses the Requests library to collect
automatically. The system takes the DOI and PDF files of the paper titles and first author names from DOIs. Then it uses
multiple papers as input. It uses the Requests library to collect PYPDF2 and Regular Expression (RE) libraries to collect each
the paper titles and first author names from the DOI. Then it PDF’s abstract, introduction, and conclusion. Then it merges 3
uses PYPDF2 and Regular Expression (RE) libraries to collect of these sections to get the final model input. Later it uses the
only the conclusion of each PDF. Then it uses the previously previously trained and saved T5 model to get a summary of
implemented spaCy model to get a summary of each paper. each paper. In the next step, it performs post-processing and
Later it performs post-processing and merges all summaries merges all summaries to produce a coherent literature review
to produce a coherent literature review segment. The system segment. The system pipeline overview of the Transformer
pipeline overview of the spaCy Model is given in Figure 2. Model is given in Figure 4.

Figure 2: Pipeline using spaCy

C. The Procedure Utilizing the Transformer-Based T5 Model Figure 4: Pipeline using Transformer Model
The second approach utilizes the transformer-based Simple
T5 model. The first task is to train the model and prepare the
D. The Procedure Utilizing the Large Language Model: GPT- submits the thread to the assistant with the extracted text as
3.5-TURBO-0125 a query. Then the response from the assistant is retrieved and
The third procedure utilizes the RAG-based approach by the outputs of each paper are merged for the final literature
using the Large Language Model: GPT-3.5-TURBO-0125. The review segment. The system pipeline overview of the LLM is
first task is to create a custom OpenAI Assistant. Firstly, the given in Figure 6.
SciTLDR dataset is collected, and then the GPT-3.5-TURBO- E. The Final System Tool
0125 model is selected for the OpenAI assistant. The retrieval
The final system is implemented using the Large Language
is turned on and the dataset is added for the knowledge of the
Model: GPT-3.5-TURBO-0125 as the backend. An aesthetic
LLM. Now some prompt engineering is performed to produce
and simple user interface is created where the user can easily
the required output. Then the LLM results are evaluated using
upload multiple research articles or PDF files. The user has
ROUGE SCORE. The overview of the creation of the OpenAI
to press the ”Browse files” button and then select the files
assistant is given in figure 5.
to upload. Then the system loads the research papers and
within a few seconds, it produces the literature review segment
automatically. It individually processes each paper and pro-
duces output. The loading screen and processing file numbers
indicate the progress level and the number of processed papers.
At the end of the literature review, the UI shows ”Done” text
to indicate the completion of the task. The user interface of
the system is given in Figure 7

Figure 5: Creation of Custom OpenAI Assistant

The used prompt: “The user will give you a pdf file as input,
similar to the “input” field of the given “data.json” file in your
knowledge base. You have to produce a summarized “output”
for the given pdf based on the file given to your knowledge.
The output will be of max 80 words. Note: You must write
in a way that can be considered a literature review of a new
research paper. The user in the future might add more PDFs
so try to make the literature review coherent and as per IEEE
standards. Please mention the first author’s name and paper
title. Don’t write like this “Literature Review of. . . ”.” Figure 7: The Preview of the System UI

IV. SYSTEM EVALUATION

The ROUGE scores are used for the evaluation in this
research. The evaluation is done based on the test data of
the selected dataset. ROUGE (Recall-Oriented Understudy for
Gisting Evaluation) is a set of metrics used for evaluating the
quality of machine-generated summaries by comparing them
to reference summaries. The used ROUGE metrics are:
• ROUGE-N (precision, recall, and F1 score for n-gram
overlaps),
Figure 6: Pipeline using LLM • ROUGE-L (measuring longest common subsequence)
• ROUGE-Lsum (ROUGE-Longest for summary level
evaluation)
The next step is to implement a system pipeline by using
the LLM to generate a literature review segment automatically. A. Evaluation of Frequency-Based spaCy
The system takes PDFs of multiple papers as input. It uses the The spaCy-based model was evaluated on the test data
PYPDF2 library to extract the entire text of each PDF. Then it utilizing the ROUGE scores. The results are stated in Table I.
creates a new thread with the extracted text as a message and
Table I: ROUGE Scores for spaCy score. The evaluation proves that large language models are
superior to older NLP approaches.
ROUGE-1 0.257
E. User Evaluation on UI
ROUGE-2 0.055
The user evaluation was performed on the LLM-based UI
ROUGE-L 0.144 system. The user chose 4 research articles as input. These
ROUGE-L SUM 0.146 were:
• AI-IoT based Healthcare Prognosis Interactive System
• Applying Internet of Things and Machine-Learning for

B. Evaluation of Transformer Personalized Healthcare: Issues and Challenges

• Improving patients health-care using IoT
T5 The transformer-based model was evaluated on the test • IoT Based AI and its Implementations in Industries
data utilizing the ROUGE scores. The results are stated in
The received output from the system: ”The paper by Joshua
Table II.
Ernest Pedi Reddy et al. presents an AI-IoT based Healthcare
Table II: ROUGE Scores for T5 Prognosis Interactive System (HPIS) to provide real-time
medical diagnosis and support in the absence of healthcare
ROUGE-1 0.268 workers. The system includes an AI-based chatbot and Appli-
ROUGE-2 0.115 cation Interface to gather information, answer medical queries,
assist users, and ensure medication adherence. With a focus
ROUGE-L 0.204 on improving healthcare accessibility and quality, the system
ROUGE-L SUM 0.204 integrated smart pill dispensers for monitoring and enhancing
medication adherence, offering great potential in revolution-
izing healthcare services. Ahamed explores the application
of IoT and Machine Learning in Personalized Healthcare
C. Evaluation of Large Language Model: GPT-3.5-TURBO-
(PH) to enhance disease management and intervention. IoT
0125
sensor devices and wearables collect patient data, which is
The LLM-based model was evaluated on the test data analyzed using AI and ML techniques for disease prediction
utilizing the ROUGE scores. The results are stated in Table and patient self-management. Challenges include biased data
III. collection, outdated training datasets, and privacy concerns.
Table III: ROUGE Scores for LLM The integration of IoT and ML in PH faces issues such as data
transmission reliability and biased training datasets, impacting
ROUGE-1 0.364 diagnostic accuracy. Addressing these challenges is crucial for
improving personalized healthcare systems. Khurana imple-
ROUGE-2 0.123
ments a Smart Healthcare System using IoT sensors to enhance
ROUGE-L 0.181 patient care in hospitals. The system utilizes Ultrasonic and IR
Proximity Sensors connected to an Arduino Uno for automated
ROUGE-L SUM 0.182
IV fluid level monitoring and patient alarm system. The liter-
ature review highlights the importance of automated Hospital
Management Systems for efficient healthcare administration.
D. Comparison of Multiple Approaches The proposed system reduces manpower, costs, and human
The comparison between all three approaches is provided errors while improving patient care. Future research aims to
in Table IV. enhance system scalability and functionality for widespread
Table IV: Comparison of Rouge Scores hospital use. Sherif El-Gendy explores the integration of IoT
and AI in industries in the paper ”IoT Based AI and its
T5 spaCy GPT-3.5-TURBO-0125 Implementations in Industries.” The paper delves into Industry
4.0, IIoT, IAIoT, and IoRT, showcasing the impact on au-
ROUGE-1 0.268 0.257 0.364 tomation and robotics. It discusses IoT challenges, benefits
ROUGE-2 0.115 0.055 0.123 of AI in data analysis, and presents case studies like oil field
production optimization and smart robotics by companies like
ROUGE-L 0.204 0.144 0.181 ABB and Boeing. The future of IoT/AI integration promises
ROUGE-L SUM 0.204 0.146 0.182 transformative advancements in various sectors.”
V. RESULT AND DISCUSSION
From the ROUGE scores, it is clear that the LLM-based The study introduced three procedures for automated lit-
model outperformed both T5 and spaCy. The Transformer- erature review generation. The research work also illustrates
based model is in the second spot based on the ROUGE-1 the performance comparison between various NLP approaches
such as the frequency-based method (spaCy), transformer [6] da Silva Ju´nior EM, Dutra ML. A roadmap toward the automatic
model (Simple T5), and retrieval-augmented generation (RAG) composition of systematic literature reviews. Iberoamerican Journal of
Science Measurement and Communication. 2021 Jul 27.
with LLM (GPT-3.5-turbo). All three procedures are im- [7] Yuan W, Liu P, Neubig G. Can we automate scientific reviewing?.
plemented and the ROUGE-1, ROUGE-2, ROUGE-L, and Journal of Artificial Intelligence Research. 2022 Sep 29;75:171-212.
ROUGE-Lsum scores are calculated based on the Test dataset. [8] Karakan B, Wagner S, Bogner J. Tool support for systematic literature
reviews: Analyzing existing solutions and the potential for automation
For all three approaches, the ROUGE-1 and ROUGE-2 scores (Doctoral dissertation, University of Stuttgart).
are found above the acceptable mark. [9] Jaspers S, De Troyer E, Aerts M. Machine learning techniques for the
From the evaluation, it is seen that the GPT-3.5-turbo model automation of literature reviews and systematic reviews in EFSA. EFSA
Supporting Publications. 2018 Jun;15(6):1427E.
produced results with higher ROUGE-1 and ROUGE-2 scores [10] Tauchert C, Bender M, Mesbah N, Buxmann P. Towards an integrative
than the SpaCy and T5. The overall ROUGE-1 score for the approach for automated literature reviews using machine learning.
LLM is 0.364 while the score for T5 is 0.268 and spaCy is [11] Tsai HC, Huang YF, Kuo CW. Comparative analysis of automatic liter-
ature review using mistral large language model and human reviewers.
0.257. It shows that the LLM-generated summaries have better [12] Susnjak T, Hwang P, Reyes NH, Barczak AL, McIntosh TR, Ranathunga
unigram and bigram overlapping with human summaries. The S. Automating research synthesis with domain-specific large language
transformer T5 is also an advanced model which comes in model fine-tuning. arXiv preprint arXiv:2404.08680. 2024 Apr 8.
[13] AllenAI. SCITL-DR Dataset. [Dataset]. Hugging Face. [Online]. Avail-
second place. The last position is occupied by the frequency- able: https://huggingface.co/datasets/allenai/scitldr. [Accessed: Sep. 8,
based spaCy model. 2024].
From the scores, it is clear that the most advanced models
are LLMs which outperformed all other NLP techniques. But
other approaches such as transformer models and frequency-
based approaches are also capable of producing satisfactory
ROUGE scores and a coherent literature review segment.

VI. CONCLUSION AND FUTURE SCOPES

The research focused on implementing and comparing vari-
ous NLP techniques for automated literature review. All three
implemented systems are successful in generating the coherent
Literature Review segment of a research paper. The results
of various Natural Language Processing techniques such as
the Frequency-based approach, Transformer model, and Large
Language Model are also successfully obtained and compared.
Based on the comparisons, the LLM-based approach is proven
to be the best-performing one based on ROUGE-N scores.
Thus, based on the LLM, a final system tool is also success-
fully developed where the user can upload multiple PDF files
to automatically generate a coherent literature review segment.
Future work of this research work can be focused on
enhancing the effectiveness and applicability of the developed
system tool. More functionality can be added to the Graphical
User Interface such as model options, output size, etc. More
models such as Bert, Gemini, and LLaMA can be utilized to
find better results.

REFERENCES
[1] Felizardo KR, Carver JC. Automating systematic literature review.
Contemporary empirical methods in software engineering. 2020:327-55.
[2] Adhikari S. Nlp based machine learning approaches for text summariza-
tion. In2020 Fourth International Conference on Computing Methodolo-
gies and Communication (ICCMC) 2020 Mar 11 (pp. 535-538). IEEE.
[3] Cachola I, Lo K, Cohan A, Weld DS. TLDR: Extreme summarization
of scientific documents. arXiv preprint arXiv:2004.15011. 2020 Apr 30.
[4] Jugran S, Kumar A, Tyagi BS, Anand V. Extractive automatic text
summarization using SpaCy in Python & NLP. In2021 International
conference on advance computing and innovative technologies in en-
gineering (ICACITE) 2021 Mar 4 (pp. 582-585). IEEE.
[5] Ali NF, Tanvin JU, Islam MR, Ahmed J, Akhtaruzzaman M. ROUGE
Score Analysis and Performance Evaluation Between Google T5 and
SpaCy for YouTube News Video Summarization. In2023 26th Interna-
tional Conference on Computer and Information Technology (ICCIT)
2023 Dec 13 (pp. 1-6). IEEE.