0% found this document useful (0 votes)
16 views8 pages

AI-Based Literature Reviews: A Topic Modeling Approach: Manoj Kumar Verma and Mayank Yuvaraj

Uploaded by

somesh rai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

AI-Based Literature Reviews: A Topic Modeling Approach: Manoj Kumar Verma and Mayank Yuvaraj

Uploaded by

somesh rai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

ISSN (Print) : 0972-2467

ISSN (Online) : 0976-2477


Journal of Information and Knowledge, Vol 60(2), April 2023, p.97-104 DOI: 10.17821/srels/2023/v60i2/170967

AI-Based Literature Reviews: A Topic Modeling


Approach
Manoj Kumar Verma1 and Mayank Yuvaraj2*
1
Department of Library and Information Science, Mizoram University, Aizwal - 796004, Mizoram, India;
[email protected]
2
Central Library, Central University of South Bihar, Gaya – 824236, Bihar, India;
[email protected]

Abstract
The purpose of this paper is to highlight the importance of topic modelling in conducting literature reviews using the open-
source LDAShiny package in the R environment, with green libraries literature as a case study. To conduct the analysis, a
title and abstract dataset were prepared using the Scopus database and imported into the LDAShiny package for further
analysis. It was found that the green libraries' literature ranged from 1989-2023, with a sharp increase in research topics
since 2003. The study also identified key themes and documents associated with green libraries research, revealing that
energy efficiency, waste reduction and recycling, and the use of sustainable materials have been extensively discussed in
the literature. However, further research is needed on the implementation of these practices in libraries, as well as the
impact of the COVID-19 pandemic on green libraries. The findings will be beneficial to researchers interested in using topic
modelling for literature reviews.

Keywords: Green Libraries, Latent Topics, LDA Shiny, Literature Review, Topic Modelling

1. Introduction (Schoot et al., 2021). Secondly, the literature reviews are


usually performed manually, with an enormous number
A literature review builds upon and refers to existing of papers that may overwhelm human processing capacity
knowledge and is integral to academic research (Kunisch (Wagner, Lukyanenko and Pare, 2022), because of which
et al., 2023; Snyder, 2019). An exhaustive literature only a few papers are analyzed. Thirdly, the traditional
review typically involves two steps: 1. Identifying a review method is also compromised by researcher bias
subset of citations, and 2. Manually screening the set in selecting articles for review. In addition, some scholars
of citations (Adam, Wallace and Trikalinos, 2022). have criticized the traditional method for its inability to
However, the manual exploratory literature review is be replicated and proven (Saha, 2021).
becoming increasingly difficult due to the rapid growth To overcome these, the conduct of literature reviews
of literature (Asmussen and Moller, 2019). Whenever has been transformed by various tools and approaches.
a field’s literature grows faster than the time available The first approach involves mapping the literature by
for manual reviews, an adequate manual review of the examining relationships through a few papers to discover
literature cannot be conducted. (Marshall and Wallace, scholarly articles. Similar papers linked by “citations”,
2019). A variety of factors have made manual literature “authors”, “funders”, “keywords”, and other metadata
reviews challenging in recent years. Firstly, the process of can be identified this way. Several cloud-based SaaS
searching and gathering relevant papers on any research platforms enable exploration of these connections,
domain is time-consuming. Furthermore, due to the including Citation Geecko (https://www.citationgecko.
laborious process of screening entire research literature com/), Connected papers (https://www.connectedpapers.
on a topic, scholars develop rather narrow search terms

*Author for correspondence


AI-Based Literature Reviews: A Topic Modeling Approach

com/), Inciteful (https://inciteful.xyz/), LitMaps (https:// With topic modelling, researchers can better
www.litmaps.com/ ), Open Knowledge Maps (https:// understand large volumes of data spanning over extended
openknowledgemaps.org/), JSTOR Text analyzer (https:// periods. It has been applied extensively to understand
www.jstor.org/analyze/). Also, to deal with the volume, the latent topics in newspapers (Ahmed and Khan,
machine learning algorithms are employed to screen 2022), journals (Ozyurt and Ayaz, 2022), and research
papers for relevant ones. AS Review (https://asreview.nl/) articles (Xie, Ning and Sun, 2022; Mostafa, 2022). The
is one such free and open-source tool which uses active present paper seeks to demonstrate the importance of
learning (a type of machine learning) to train a model that topic modelling using open source LDAShiny package
uses limited examples to predict relevance from texts. The in conducting literature reviews using green libraries
AS Review performs automated title-abstract screening literature as a case study to discover the hidden topics in
and ranks the paper based on the knowledge in the paper. the domain.
Other similar machine learning systems available for use
in systematic reviews are Rayyan (https://www.rayyan.
ai/), Colandr (https://www.colandrcommunity.com/),
2. Topic Modelling
Covidence (https://www.covidence.org/), EPPI reviewer Topic modelling is a text-mining technique often used in
(https://eppi.ioe.ac.uk/CMS/Default.aspx?alias=eppi. machine learning and natural language processing. It is
ioe.ac.uk/cms/er4&), FASTREAD (https://github.com/ an effective method of analyzing and summarizing large
fastread/src), and SWIFT Review (https://www.sciome. amounts of textual data without human intervention to
com/swift-review/). reveal hidden semantic patterns (latent topics). In the
Although these two approaches are useful to identify case of a collection of documents, it facilitates identifying
relevant papers in any domain that can be reviewed to hidden topics. Words making sense together are
identify research gaps and to understand key research presented with each topic. The topic-associated words can
themes, researchers are faced with a high time cost help organize and provide insight into large amounts of
when they need to read a large number of papers from unstructured text. The basic idea behind topic modelling
a set of relevant ones manually. It has been found that is that each document is regarded as a mixture of topics,
a seasoned reviewer can screen about two abstracts per and each word within the document has a certain
minute on average, but more complex abstracts can take probability of belonging to a specific topic.
much longer (Wallace et al., 2010). Although, artificial In the results obtained from topic modelling,
intelligence-based research assistant- Elicit (https://elicit. two matrices are represented: the word-topic matrix
org/) can be used to perform article summarization tasks (probability of a certain word belonging to a topic) and
yet it cannot identify the latent topics in the literature the topic-document matrix (probability of a particular
which are crucial for understanding gaps in research and topic appearing in a specific document); however, the
trends. A simple solution to this can be found in topic end-users usually select the top words (words that
modelling which can be automated, making it an ideal have the highest probability in a topic) and the most
tool to conduct an exploratory literature review (Antons probable texts. In terms of topic modelling algorithms,
et al., 2023; Asmussen and Moller, 2019). According Latent Dirichlet Allocation (LDA) is the simplest, most
to Kavvadias, Drosatos and Kaldoudi (2020) to help well-studied, and most widely accepted method. In
researchers navigate a wide range of publications and get addition, LDA can effectively uncover latent topics and
quick overviews of evolving research fields, identification co-occurrences between words (Mustak et al., 2021).
of research topics in published literature has emerged LDA was used in this study for analyzing the collected
as a powerful tool which can be easily accomplished papers on green libraries.
through topic modelling. The automated topic modelling While there are other packages in the R environment
approach complements traditional approaches to content through which topic modelling can be performed,
analysis (Schmiedel, Muller and Brocke, 2019). Topic LDAShiny is the only free statistical software package
modelling is the process of condensing the text into providing a GUI that allows analysts and researchers
topics composed of connected words based on statistical to perform LDA-focused scientific literature reviews
correlation. interactively. LDAShiny is primarily intended for

98 Journal of Information and Knowledge | Vol 60(2) | April 2023


Manoj Kumar Verma and Mayank Yuvaraj

researchers who have little prior knowledge of the 4. Methodology


research field and would like to explore a large number of
documents (such as scientific articles) to identify trends In the following section, we describe the steps we took to
(Hoz-M, Fernandez-Gomez and Mendes, 2021). collect and analyze data.
Below is a step-by-step guide to installing and using
LDAShiny: 4.1 Data Collection
• The latest version of the R language (https://cran.r- In this study, data are collected for review using Scopus,
project.org/bin/windows/base/) and the RStudio which has a wider range of academic sources than its
platform (https://posit.co/) should be downloaded counterpart, the Web of Science (Paul et al., 2021) and its
and installed as a first step. coverage of documents is approved for indexing through
• Next step is the installation of LDAShiny. Open strict criteria like “ethics and malpractice statement”,
the RStudio interface and type the following “minimum of two-year publication history”, “ownership”,
command: “peer review” (Donthu et al., 2021). The search strategy
• Install.packages (“LDAShiny”) is based on a single keyword, “green librar*,” searched in
• To invoke and open LDAShiny programs, enter the title, abstract, and keywords of the articles, following
the following command in the control interface the recommendations of Lim, Yap and Makkar (2021),
window: which recommends using a single keyword for review
• library(LDAShiny) domains that are sufficiently broad and generic. We used
• LDAShiny::runLDAShiny() asterisks (*) with the keyword to capture various endings
of the terms such as a green library, and green libraries.
3. Objectives of the Study The initial search using the keyword returned a result of
89 documents which were further used for analysis. We
The major objectives of this research study are: did not use any additional filters like document type,
• To use Latent Dirichlet Allocation (LDA) language, or year in the search query. We recorded the
algorithm, a widely used algorithm for topic document title, year, source title, DOI and abstract
modelling, to identify latent topics and show its of the 89 documents in an Excel file for conducting
usefulness in conducting a literature review. topic modelling from the database on January 30,
• To identify major topics discussed in green 2023.
libraries literature.
• To understand the growth of research topics. 4.2 Data Processing
• To identify key themes in green libraries literature.
LDAShiny package was used for conducting LDA-
• To find out key documents associated with
based topic modelling. The Excel file (.csv) exported
research topics in the green libraries domain.

Figure 1. Statistical summary of uploaded data.

Journal of Information and Knowledge | Vol 60(2) | April 2023 99


AI-Based Literature Reviews: A Topic Modeling Approach

Figure 2. Document term matrix dimensions (DTM dim) pre and post-processing.

from the Scopus database was uploaded to the software. 4.3 Number of Topics
A statistical summary of uploaded data is shown in
Previous studies have stressed that the effectiveness of
Figure 1.
LDA models depends on the number of topics chosen
It can be seen from Figure 1 that the publications
when categorizing topics. According to Gan and Qi
on green libraries ranged from 1989 to 2023. The mean
(2021) when the number of topics selected is small, the
and median years of publication are 2016 and 2018
meaning under each topic will be insufficient; when the
respectively. The length of all five exported metadata is
number of topics selected is in excess, the data will lead to
89. The next step in LDAShiny involves data cleaning
over-clustering, resulting in redundant topics. To address
where n-gram inclusion is done, stop-words are added,
this issue, there are various methods in LDAShiny such
and stemming is done. An N-gram consists of N words.
as coherence, four metrics, perplexity and harmonic
As a result, a 2-gram (or bigram) is a sequence of two
mean, to find out an optimal number of topics. Based
words, such as “green libraries”, or “library automation”,
on configuration settings recommended by Hoz-M,
while a 3-gram (or trigram) is a sequence of three words,
Fernandez-Gomez and Mendes (2021), we calculated
such as “automation through KOHA”, or “sustainable
them which is represented in Figure 3. Among the metrics
green libraries”. According to Hoz-M, Fernandez-Gomez
Griffiths 2004, CaoJuan 2009, Arun 2010, Perplexity and
and Mendes (2021), it is more common to analyze words
Harmonic mean, the number of suitable topics stands
individually or use N-grams, and that was chosen in this
between 45 and 50, while Deveaud 2014 shows 35 topics
study. The stop-words approach in text mining helps
and Coherence 14. As a result of our analysis, we found
reduce computing complexity and improve performance
that the best value for k, or the number of topics, is between
by removing words like “and”, “or”, and “was”, etc.
10 and 12 for our dataset. We have selected 10 topics
Although there are many possibilities for StopWord lists,
for the present study as 10 was the optimal coherence
we are limited to the words provided by the R StopWord
score.
(https://cran.r-project.org/web/packages/stopwords/
index.html) as it has been used in previous studies. In
stemming, root words are morphologically modified. 5. Results
Stemming is a text pre-processing technique that involves
reducing a word to its root form, which can help reduce 5.1 Top Terms in Green Libraries
the dimensionality of the data and improve efficiency. Literature Ranked by Term Frequency
For example, by using this feature the words “library”
Figure 4 shows the top terms in green libraries which are
and “libraries” will be stemmed to librar. Figure 2 shows
ranked by term frequency. The top term that appears in
a snapshot of document term dimensions pre and post-
our dataset is “libraries”, with term frequency (TF = 553),
processing.

100 Journal of Information and Knowledge | Vol 60(2) | April 2023


Manoj Kumar Verma and Mayank Yuvaraj

Figure 3. . Number of topics. (A) Coherence method. (B) Comparison of four methods. (C) Perplexity.
(D) Harmonic Mean.

Figure 4. Top terms in the green libraries dataset.

document frequency (DF = 82) and inverse document 5.2 Topic Trend
frequency (IDF = 0.08) followed by the terms “green”
The yearly growth of topics in green libraries literature has
having TF = 287 DF = 73 IDF = 0.19, “sustainable” with
been represented in Figure 5 through heatmap. There
TF = 140 DF = 50 IDF = 0.57 and “environment” with TF
was a growth of topics from 2003 onward. The peak of
= 104 DF = 44 IDF = 0.7.
growth can be seen during the period 2012-2023 for each
topic.

Journal of Information and Knowledge | Vol 60(2) | April 2023 101


AI-Based Literature Reviews: A Topic Modeling Approach

Figure 5. Topic trend in green libraries literature.

5.3 Key Themes in Green Libraries topics related to technology or COVID-19 in the present
Literature study from our dataset.

Table 1 shows the research topics that were obtained from


5.4 Key Documents Associated with
the LDA model. The results are ranked by the prevalence
scores. There is a high prevalence of literature on the Research Topics
topics such as green libraries, academic libraries, library Through LDAShiny we also identified key documents
programs, and library projects. We could not find any which were associated with each topic. Table 2 presents

Table 1. Key themes in green libraries literature


Topic Label_1 Coherence Prevalence Top_Terms
Library, green, development, public,
t_1 Green libraries 0.052 8.314
environment, research, analysis

Academic Library, academic, sustainable, strategies,


t_2 0.047 8.079
libraries services

Library Program, library, social, develop, green,


t_3 0.141 7.793
program cooper, paper, sustainable, education

Project, library, activity, green, œgreen,


t_4 Library project 0.118 7.535
paper, group
Build, library, design, green, construct,
t_5 Library building 0.097 7.47 plan, exist, architecture, sustain,
environment
Research, library, studies, publish,
t_6 Library research 0.104 7.077
management, countries, green

Libraries, change, climate, sustainable,


t_7 sustainability 0.044 6.944
environment,

Library
t_8 0.079 6.08 Model, smart, structure, earthquake, time
structure

102 Journal of Information and Knowledge | Vol 60(2) | April 2023


Manoj Kumar Verma and Mayank Yuvaraj

Table 2. Key documents in green libraries research

Document topic theta


Green Library and green librarianship- Towards a
t_1 0.45473
conceptualization
A perspective on computational research support programs in
the library: More than 20 years of data from Stanford University t_2 0.36759
library
Planning Approach with “Better Than Before” Concept: A Case
t_3 0.3413
Study of Library Building at SVNIT, Surat, Gujarat, India
More than Just a green building: Developing green strategies at the
t_4 0.59044
Chinese University of Hong Kong Library
Library sow the seed of a sustainable society: A comparative
t_5 0.28296
analysis of IFLA Green Library Award projects 2016
Operation performance evaluation of green public buildings with
t_6 0.5795
AHP-fuzzy synthetic assessment method based on cloud model
The Emergence of Green Library in Kenya: Insights from
t_7 0.29348
Academic Library
Environmentally Sustainable Approaches in Academic Library: A
t_8 0.47281
Micro-study in Uttar Pradesh

an overview of the documents which can be used as a the implementation and effectiveness of these practices in
reference for researchers interested in green libraries to libraries, and on the impact of the COVID-19 pandemic
explore the domain. on green libraries.

6. Conclusion 7. References
Using topic modelling, this research study presents an Adam, G.P., Wallace, B.C. and Trikalinos, T.A. (2022).
efficient method for conducting literature reviews and Semi-automated tools for systematic searches. in: meta-
gaining an overview of latent topics found in the title and research. methods in molecular biology, edited by
abstract datasets. Literature reviews conducted manually Evangelou, E., Veroniki, A.A. New York, NY: Humana;
suffer from researcher bias, lack replicability and validity, pp. 17-40. https://doi.org/10.1007/978-1-0716-1566-9_2
are extremely time-consuming, and are unreliable. An PMid:34550582
LDA-based method addresses these concerns. As a result Ahmed, F. and Khan, A. (2022). Topic modeling as a tool to ana-
of an LDA-based tool such as LDAShiny, researchers can lyze child abuse from the corpus of english newspapers in
not only understand the key research topics within a Pakistan. Social Science Computer Review. OnlineFirst.
https://doi.org/10.1177/08944393221132637
document but also identify the key documents associated
Antons, D., Breidbach, C. F., Joshi, A. M. and Salge, T. O.
with each topic, which is an effective alternative to
(2023). Computational literature reviews: Method, algo-
manually screening titles and abstracts to identify relevant
rithms, and roadmap. Organizational Research Methods,
papers. In the present case study of green libraries research,
25, 107-138. https://doi.org/10.1177/1094428121991230
we found that research works were scattered between
Asmussen, C.B. and Moller, C. (2019). Smart literature
1989-2023. In summary, the analysis of the literature on
review: A practical topic modeling approach to explor-
green libraries using topic modelling reveals that energy atory literature review. Journal of Big Data, 6, 93. https://
efficiency, waste reduction and recycling, and the use doi.org/10.1186/s40537-019-0255-7
of sustainable materials are important themes in the Donthu, N., Kumar, S., Mukherjee, D., Pandey, N. and Lim,
literature. However, there is a need for more research on W.M. (2021). How to conduct a bibliometric analy-

Journal of Information and Knowledge | Vol 60(2) | April 2023 103


AI-Based Literature Reviews: A Topic Modeling Approach

sis: An overview and guidelines. Journal of Business doi.org/10.1007/s10639-022-11071-y PMid:35502161


Research, 133, 285-296. https://doi.org/10.1016/j. PMCid:PMC9046010
jbusres.2021.04.070 Paul, J., Lim, W.M. , O’Cass, A., Hao, A.W. and Bresciani,
Gan, J. and Qi, Y. (2021). Selection of the optimal number S. (2021). Scientific Procedures and Rationales
of topics for LDA topic model- taking patent policy as for Systematic Literature Reviews (SPAR-4-SLR).
an example. Entropy, 23, 1-45. https://doi.org/10.3390/ International Journal of Consumer Studies, 45, O1-O16,
e23101301 https://doi.org/10.1111/ijcs.12695
Hoz-M, J. De La, Fernandez-Gomez, M. J. and Medes, S. Saha, B. (2021). Application of topic modeling for literature
(2021). LDAShiny: An R package for exploratory review review in management research. In: Interdisciplinary
of scientific literature based on Bayesian probabilis- research in technology and management, edited by S.
tic model and machine learning tools. Mathematics, 9. Chakrabarti, R. Nath, P. K. Banerji, S. Datta, S. Poddar
https://doi.org/10.3390/math9141671 and M. Gangopadhyaya. London: CRC Press; pp. 249-
Kavvadias, S., Drosatos, G. and Kaldoudi, E. (2020). 256.
Supporting topic modeling and trend analysis in bio- Schmiedel, T., Muller, O. and Brocke, J.V. (2019). Topic mod-
medical literature. Journal of Biomedical Informatics, eling as a strategy of inquiry in organizational research:
110, 103574. https://doi.org/10.1016/j.jbi.2020.103574 A tutorial with an application example on organizational
PMid:32971274 culture. Organizational Research Methods, 22, 941-968.
Kunisch, S., Denyer, D., Bartunek, J. M., Menz, M. and https://doi.org/10.1177/1094428118773858
Cardinal, L. B. (2023). Review research as scientific Schoot, R. V., Bruin, J. Schram, R., Zahedi, P., Boer, J.,
inquiry. Organizational Research Methods, 26, 3-45. Weijdema, F., Kramer, B., Huijts, M., Hoggerwerf, M.,
https://doi.org/10.1177/10944281221127292 Ferdinands, G., Harkema, A., Willemsen, W., Ma, Y.,
Lim, W.M., Yap, S.F. and Makkar, M. (2021). Home sharing Fang, Q., Hindriks, S., Tummers, L. and Oberski, D.
in marketing and tourism at a tipping point: What do we L. (2021). An open source machine learning frame-
know, how do we know, and where should we be head- work for efficient and transparent systematic reviews.
ing? Journal of Business Research, 122, 534-566, https:// Nature Machine Intelligence, 3, 125-133. https://doi.
doi.org/10.1016/j.jbusres.2020.08.051 PMid:33012896 org/10.1038/s42256-020-00287-7
PMCid:PMC7523531 Snyder, H. (2019). Literature review as a research method-
Marshall, I. J. and Wallace, B. C. (2019). Toward systematic ology: an overview and guidelines, Journal of Business
review automation: a practical guide to using machine Research, 104, 333-339. https://doi.org/10.1016/j.
learning tools in research synthesis. Systematic Reviews, jbusres.2019.07.039
8, 163. https://doi.org/10.1186/s13643-019-1074-9 Wagner, G., Lukyanenko, R. and Pare, G. (2022). Artificial
PMid:31296265 PMCid:PMC6621996 intelligence and the conduct of literature reviews.
Mostafa, M. (2022). A one-hundred-year structural topic Journal of Information Technology, 37, 209-226.
modeling analysis of knowledge structure of interna- https://doi.org/10.1177/02683962211048201
tional management research. Quality and Quantity. Wallace, B. C., Small, K., Brodley, C. E. and Trikalinos,
OnlineFirst. https://doi.org/10.1007/s11135-022- T. A. (2010). Active learning for biomedical cita-
01548-w PMid:36249708 PMCid:PMC9549032 tion screening. In 16th ACM SIGKDD International
Mustak, M., Salminen, J., Ple, L. and Wirtz, J. (2021). Artificial Conference on Knowledge discovery and data mining,
intelligence in marketing: Topic modeling, scientomet- edited by B. Rao, B. Krishnapuram, A. Tomkins and
ric analysis and research agenda. Journal of Business Q. Yang, Washington DC, USA; pp. 173-182. https://
Research, 124, 389-404. https://doi.org/10.1016/j. doi.org/10.1145/1835804.1835829 PMid:20565949
jbusres.2020.10.044 PMCid:PMC2903585
Ozyurt, O. and Ayaz, A. (2022). Twenty-five years of edu- Xie, Y., Ning, C. and Sun, L. (2022). The twenty-first cen-
cation and information technologies: Insights from a tury of structural engineering research: A topic
topic modeling based bibliometric analysis. Education modeling approach. Structures, 35, 577-590. https://doi.
and Information Technologies, 27, 11025-11054. https:// org/10.1016/j.istruc.2021.11.018

104 Journal of Information and Knowledge | Vol 60(2) | April 2023

You might also like