AI-Based Literature Reviews: A Topic Modeling Approach: Manoj Kumar Verma and Mayank Yuvaraj
AI-Based Literature Reviews: A Topic Modeling Approach: Manoj Kumar Verma and Mayank Yuvaraj
Abstract
The purpose of this paper is to highlight the importance of topic modelling in conducting literature reviews using the open-
source LDAShiny package in the R environment, with green libraries literature as a case study. To conduct the analysis, a
title and abstract dataset were prepared using the Scopus database and imported into the LDAShiny package for further
analysis. It was found that the green libraries' literature ranged from 1989-2023, with a sharp increase in research topics
since 2003. The study also identified key themes and documents associated with green libraries research, revealing that
energy efficiency, waste reduction and recycling, and the use of sustainable materials have been extensively discussed in
the literature. However, further research is needed on the implementation of these practices in libraries, as well as the
impact of the COVID-19 pandemic on green libraries. The findings will be beneficial to researchers interested in using topic
modelling for literature reviews.
Keywords: Green Libraries, Latent Topics, LDA Shiny, Literature Review, Topic Modelling
com/), Inciteful (https://inciteful.xyz/), LitMaps (https:// With topic modelling, researchers can better
www.litmaps.com/ ), Open Knowledge Maps (https:// understand large volumes of data spanning over extended
openknowledgemaps.org/), JSTOR Text analyzer (https:// periods. It has been applied extensively to understand
www.jstor.org/analyze/). Also, to deal with the volume, the latent topics in newspapers (Ahmed and Khan,
machine learning algorithms are employed to screen 2022), journals (Ozyurt and Ayaz, 2022), and research
papers for relevant ones. AS Review (https://asreview.nl/) articles (Xie, Ning and Sun, 2022; Mostafa, 2022). The
is one such free and open-source tool which uses active present paper seeks to demonstrate the importance of
learning (a type of machine learning) to train a model that topic modelling using open source LDAShiny package
uses limited examples to predict relevance from texts. The in conducting literature reviews using green libraries
AS Review performs automated title-abstract screening literature as a case study to discover the hidden topics in
and ranks the paper based on the knowledge in the paper. the domain.
Other similar machine learning systems available for use
in systematic reviews are Rayyan (https://www.rayyan.
ai/), Colandr (https://www.colandrcommunity.com/),
2. Topic Modelling
Covidence (https://www.covidence.org/), EPPI reviewer Topic modelling is a text-mining technique often used in
(https://eppi.ioe.ac.uk/CMS/Default.aspx?alias=eppi. machine learning and natural language processing. It is
ioe.ac.uk/cms/er4&), FASTREAD (https://github.com/ an effective method of analyzing and summarizing large
fastread/src), and SWIFT Review (https://www.sciome. amounts of textual data without human intervention to
com/swift-review/). reveal hidden semantic patterns (latent topics). In the
Although these two approaches are useful to identify case of a collection of documents, it facilitates identifying
relevant papers in any domain that can be reviewed to hidden topics. Words making sense together are
identify research gaps and to understand key research presented with each topic. The topic-associated words can
themes, researchers are faced with a high time cost help organize and provide insight into large amounts of
when they need to read a large number of papers from unstructured text. The basic idea behind topic modelling
a set of relevant ones manually. It has been found that is that each document is regarded as a mixture of topics,
a seasoned reviewer can screen about two abstracts per and each word within the document has a certain
minute on average, but more complex abstracts can take probability of belonging to a specific topic.
much longer (Wallace et al., 2010). Although, artificial In the results obtained from topic modelling,
intelligence-based research assistant- Elicit (https://elicit. two matrices are represented: the word-topic matrix
org/) can be used to perform article summarization tasks (probability of a certain word belonging to a topic) and
yet it cannot identify the latent topics in the literature the topic-document matrix (probability of a particular
which are crucial for understanding gaps in research and topic appearing in a specific document); however, the
trends. A simple solution to this can be found in topic end-users usually select the top words (words that
modelling which can be automated, making it an ideal have the highest probability in a topic) and the most
tool to conduct an exploratory literature review (Antons probable texts. In terms of topic modelling algorithms,
et al., 2023; Asmussen and Moller, 2019). According Latent Dirichlet Allocation (LDA) is the simplest, most
to Kavvadias, Drosatos and Kaldoudi (2020) to help well-studied, and most widely accepted method. In
researchers navigate a wide range of publications and get addition, LDA can effectively uncover latent topics and
quick overviews of evolving research fields, identification co-occurrences between words (Mustak et al., 2021).
of research topics in published literature has emerged LDA was used in this study for analyzing the collected
as a powerful tool which can be easily accomplished papers on green libraries.
through topic modelling. The automated topic modelling While there are other packages in the R environment
approach complements traditional approaches to content through which topic modelling can be performed,
analysis (Schmiedel, Muller and Brocke, 2019). Topic LDAShiny is the only free statistical software package
modelling is the process of condensing the text into providing a GUI that allows analysts and researchers
topics composed of connected words based on statistical to perform LDA-focused scientific literature reviews
correlation. interactively. LDAShiny is primarily intended for
Figure 2. Document term matrix dimensions (DTM dim) pre and post-processing.
from the Scopus database was uploaded to the software. 4.3 Number of Topics
A statistical summary of uploaded data is shown in
Previous studies have stressed that the effectiveness of
Figure 1.
LDA models depends on the number of topics chosen
It can be seen from Figure 1 that the publications
when categorizing topics. According to Gan and Qi
on green libraries ranged from 1989 to 2023. The mean
(2021) when the number of topics selected is small, the
and median years of publication are 2016 and 2018
meaning under each topic will be insufficient; when the
respectively. The length of all five exported metadata is
number of topics selected is in excess, the data will lead to
89. The next step in LDAShiny involves data cleaning
over-clustering, resulting in redundant topics. To address
where n-gram inclusion is done, stop-words are added,
this issue, there are various methods in LDAShiny such
and stemming is done. An N-gram consists of N words.
as coherence, four metrics, perplexity and harmonic
As a result, a 2-gram (or bigram) is a sequence of two
mean, to find out an optimal number of topics. Based
words, such as “green libraries”, or “library automation”,
on configuration settings recommended by Hoz-M,
while a 3-gram (or trigram) is a sequence of three words,
Fernandez-Gomez and Mendes (2021), we calculated
such as “automation through KOHA”, or “sustainable
them which is represented in Figure 3. Among the metrics
green libraries”. According to Hoz-M, Fernandez-Gomez
Griffiths 2004, CaoJuan 2009, Arun 2010, Perplexity and
and Mendes (2021), it is more common to analyze words
Harmonic mean, the number of suitable topics stands
individually or use N-grams, and that was chosen in this
between 45 and 50, while Deveaud 2014 shows 35 topics
study. The stop-words approach in text mining helps
and Coherence 14. As a result of our analysis, we found
reduce computing complexity and improve performance
that the best value for k, or the number of topics, is between
by removing words like “and”, “or”, and “was”, etc.
10 and 12 for our dataset. We have selected 10 topics
Although there are many possibilities for StopWord lists,
for the present study as 10 was the optimal coherence
we are limited to the words provided by the R StopWord
score.
(https://cran.r-project.org/web/packages/stopwords/
index.html) as it has been used in previous studies. In
stemming, root words are morphologically modified. 5. Results
Stemming is a text pre-processing technique that involves
reducing a word to its root form, which can help reduce 5.1 Top Terms in Green Libraries
the dimensionality of the data and improve efficiency. Literature Ranked by Term Frequency
For example, by using this feature the words “library”
Figure 4 shows the top terms in green libraries which are
and “libraries” will be stemmed to librar. Figure 2 shows
ranked by term frequency. The top term that appears in
a snapshot of document term dimensions pre and post-
our dataset is “libraries”, with term frequency (TF = 553),
processing.
Figure 3. . Number of topics. (A) Coherence method. (B) Comparison of four methods. (C) Perplexity.
(D) Harmonic Mean.
document frequency (DF = 82) and inverse document 5.2 Topic Trend
frequency (IDF = 0.08) followed by the terms “green”
The yearly growth of topics in green libraries literature has
having TF = 287 DF = 73 IDF = 0.19, “sustainable” with
been represented in Figure 5 through heatmap. There
TF = 140 DF = 50 IDF = 0.57 and “environment” with TF
was a growth of topics from 2003 onward. The peak of
= 104 DF = 44 IDF = 0.7.
growth can be seen during the period 2012-2023 for each
topic.
5.3 Key Themes in Green Libraries topics related to technology or COVID-19 in the present
Literature study from our dataset.
Library
t_8 0.079 6.08 Model, smart, structure, earthquake, time
structure
an overview of the documents which can be used as a the implementation and effectiveness of these practices in
reference for researchers interested in green libraries to libraries, and on the impact of the COVID-19 pandemic
explore the domain. on green libraries.
6. Conclusion 7. References
Using topic modelling, this research study presents an Adam, G.P., Wallace, B.C. and Trikalinos, T.A. (2022).
efficient method for conducting literature reviews and Semi-automated tools for systematic searches. in: meta-
gaining an overview of latent topics found in the title and research. methods in molecular biology, edited by
abstract datasets. Literature reviews conducted manually Evangelou, E., Veroniki, A.A. New York, NY: Humana;
suffer from researcher bias, lack replicability and validity, pp. 17-40. https://doi.org/10.1007/978-1-0716-1566-9_2
are extremely time-consuming, and are unreliable. An PMid:34550582
LDA-based method addresses these concerns. As a result Ahmed, F. and Khan, A. (2022). Topic modeling as a tool to ana-
of an LDA-based tool such as LDAShiny, researchers can lyze child abuse from the corpus of english newspapers in
not only understand the key research topics within a Pakistan. Social Science Computer Review. OnlineFirst.
https://doi.org/10.1177/08944393221132637
document but also identify the key documents associated
Antons, D., Breidbach, C. F., Joshi, A. M. and Salge, T. O.
with each topic, which is an effective alternative to
(2023). Computational literature reviews: Method, algo-
manually screening titles and abstracts to identify relevant
rithms, and roadmap. Organizational Research Methods,
papers. In the present case study of green libraries research,
25, 107-138. https://doi.org/10.1177/1094428121991230
we found that research works were scattered between
Asmussen, C.B. and Moller, C. (2019). Smart literature
1989-2023. In summary, the analysis of the literature on
review: A practical topic modeling approach to explor-
green libraries using topic modelling reveals that energy atory literature review. Journal of Big Data, 6, 93. https://
efficiency, waste reduction and recycling, and the use doi.org/10.1186/s40537-019-0255-7
of sustainable materials are important themes in the Donthu, N., Kumar, S., Mukherjee, D., Pandey, N. and Lim,
literature. However, there is a need for more research on W.M. (2021). How to conduct a bibliometric analy-