Exploring ChatGPTs Ability To Classify The Structure of Literature Reviews in Engineering Research Articles
Exploring ChatGPTs Ability To Classify The Structure of Literature Reviews in Engineering Research Articles
Abstract—ChatGPT is a newly emerging artificial intelligence quickly edit the structure of research abstracts and ensure an
(AI) tool that can generate and assess written text. In this study, improved and logical flow of ideas [2]. Despite the effectiveness
we aim to examine the extent to which it can correctly identify of these tools in supporting students’ learning, one limitation
the structure of literature review sections in engineering research
articles. For this purpose, we conducted a manual content analysis is that they do not allow enough human–machine interaction,
by classifying paragraphs of literature review sections into their which makes them unable to clearly understand the students’
corresponding categories that are based on Kwan’s model, which needs and inquiries [1].
is a labeling scheme for structuring literature reviews. We then Recently, ChatGPT has emerged as a new AI chatbot re-
asked ChatGPT to perform the same categorization and compared
leased by OpenAI [3]. This tool can operate at the level of
both outcomes. Numerical results do not imply a satisfactory per-
formance of ChatGPT; therefore, writers cannot fully depend on it productive skills (speaking and writing) since it can generate
to edit their literature reviews. However, the AI chatbot displays an written text. It also allows human–machine interaction due to its
understanding of the given prompt and is able to respond beyond conversational mode and uses reinforcement learning to learn
the classification task by giving supportive and useful explanations from experience, human feedback, and follow-up corrections.
for the users. Such findings can be especially helpful for beginners
However, it presents some limitations since it may give incorrect
who usually struggle to write comprehensive literature review sec-
tions since they highlight how users can benefit from this AI chatbot or biased information and knows little about the events that
to revise their drafts at the level of content and organization. With occurred after 2021 [3]. This emerging tool can execute a large
further investigations and advancement, AI chatbots can also be variety of human tasks, such as explaining complex concepts and
used for teaching proper literature review writing and editing. subjects, summarizing information, translating text, generating
Index Terms—ChatGPT, classification, engineering research code, and even writing complete essays and assignments [4].
articles, literature review, strategies, structural moves. It can also help humans in daily time-consuming tasks, such
as drafting emails, which helps users focus on more important
responsibilities such as core work components. ChatGPT has
I. INTRODUCTION also showed potential in teaching and supporting student writ-
RTIFICIAL intelligence (AI) applications in the English ing development through real-time feedback and guidance on
A language teaching and learning have become a common
form of support for students, especially those who are not native
vocabulary, grammar, and syntax [5]. Given the broadness of
these capabilities, students at various levels of their education
English speakers. Some of the existing AI tools operate at the are excessively relying on ChatGPT for several usages, which
level of receptive skills (reading and listening), such as Gram- makes examining its effectiveness a necessity. A rapid review
marly, Duolingo, and Google Translate that help students with paper of 50 articles found that ChatGPT behaves inconsistently
grammar and vocabulary in their writing assignments and save across different topics [6]. For example, it exhibits an out-
them time for higher level tasks [1]. Another writing support tool standing performance in economics, a satisfactory performance
that involves AI is the Mover software that can associate written in high school English language exams, and an unsatisfactory
sentences in research articles with their corresponding structural performance in mathematics and psychology. Its performance
moves, such as indicating a gap, announcing findings, etc. [2]. also depends on the country in which it was being tested. For
Previous research on Mover has shown that it enables users to instance, ChatGPT passed the United States Medical Licensing
Examination (USMLE); however, it did not surpass the students’
average score in other medical licensing examinations of some
Manuscript received 17 August 2023; revised 14 January 2024 and 2 May
2024; accepted 16 May 2024. Date of publication 4 June 2024; date of current
Asian countries [6].
version 1 July 2024. (Corresponding author: Maha Issa.) In this work, we aim to measure the capability of ChatGPT
Maha Issa was with the American University of Beirut, Beirut 1107 2020, to accurately classify paragraphs in the literature review sec-
Lebanon. She is now with Santa Clara University, Santa Clara, CA 95053 USA
(e-mail: [email protected]).
tions of engineering research articles into their corresponding
Marwa Faraj is with the American University of Beirut, Beirut 1107 2020, structural moves by comparing it to the manual classification.
Lebanon (e-mail: [email protected]). The motivation behind choosing the literature review section is
Niveen AbiGhannam is with the University of Texas at Austin, Austin, TX
78712 USA (e-mail: [email protected]).
that writing this section can be a challenging task, especially
Digital Object Identifier 10.1109/TLT.2024.3409514 for students and beginners [7]. Therefore, a tool that identifies
1939-1382 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: The University of Toronto. Downloaded on November 25,2024 at 02:44:30 UTC from IEEE Xplore. Restrictions apply.
1860 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 17, 2024
its structure can help student users learn how to recognize the that they initially wrote and to make edits at the words level.
missing parts in their drafts and avoid writing a list of articles’ The author suggested that students who face difficulties in
summaries without any synthesis of information or connection improving their writing style can also benefit from it. However,
to the remainder of their manuscripts. Consequently, this work the author pointed out that some sentences became worse after
aims to investigate the following research questions. involving the AI chatbot [13]. Moreover, Lo et al. [6] suggest that
1) Research question 1: To what extent can ChatGPT cor- ChatGPT can even generate initial drafts for students’ writing
rectly identify the structure of the literature review sections assignments. However, they emphasized that students should
in engineering research articles? revise these drafts to correct any false information, show their
2) Research question 2: To what extent can students and own critical thinking, and add important references since many
academic writers rely on ChatGPT to learn how to revise, ChatGPT outputs are not supported by evidence or sometimes
edit, and structure the literature review sections in their contain fake citations [6]. Furthermore, ChatGPT can aid in
drafts? selecting relevant topics when conducting literature reviews by
executing literature searches, sorting research articles, and citing
their sources [14]. However, Haman and Školník [15] argue that
II. LITERATURE REVIEW ChatGPT provides fake papers, which renders it as an unrecom-
Several researchers aimed to assess the effectiveness of Chat- mended tool for literature review writing. Therefore, although
GPT in assisting students in their learning process across dif- ChatGPT can facilitate students’ English learning and writing,
ferent subjects, such as public health, medicine, and mathemat- they must use it responsibly while being aware of its limitations
ics [8], [9], [10]. To learn medical terms, Graefen and Fazal [8] and its occasionally inaccurate and biased information and,
allowed 100 students to use ChatGPT, while another 100 stu- most importantly, without omitting the importance of developing
dents used traditional learning techniques. After conducting and showing their own creativity and critical thinking [16].
an exam for both groups, they showed that learning medical Overall, ChatGPT shows promising innovation in education,
terminologies through ChatGPT increased the students’ passing which begun to appear under examination by various scholars,
rate. Moreover, 89% of students who used ChatGPT believed institutions, and educators. This integration of AI in education
that it provides creative ideas and 92% of those students would will require a major shift in education philosophies [17]. With
recommend its use to other colleagues [8]. In [9], ChatGPT’s continuous human surveillance, ChatGPT might create space
performance was tested on 376 questions from the USMLE. for theoretically grounded learning experiences and principles,
Results showed that ChatGPT can sometimes provide answers leveraging second language writing skills and vocabulary [17].
that meet the passing threshold of the USMLE. It can also So far, grammar and sentence structure have been the focus
generate novel ideas and insights, which could help students of most research that discussed or evaluated the potential of
in medical education [9]. To test the mathematical reasoning ChatGPT in assisting in English language learning and writing.
of ChatGPT, Frieder et al. [10] created GHOSTS, a dataset of To the best of our knowledge, no studies have evaluated the
questions covering a variety of topics, such as integrals, algebra, effectiveness of ChatGPT in supporting writers at a higher
and probability. They found that ChatGPT fails to achieve a level of content and organization. In this work, we aim to
passing score in most topics, with inconsistent behavior de- measure how correctly can ChatGPT identify the structure of
pending on the question type and difficulty level. Although the the literature review sections in research articles. Our findings
improved version of ChatGPT, GPT-4, presents an acceptable would give insights regarding the extent to which students and
performance, it was unable to pass graduate-level questions [10]. academic writers can rely on ChatGPT to learn how to revise
Thus, although ChatGPT can support students in learning about and edit the content and organization of the literature review
several topics, they should be aware that it can give incorrect sections in their drafts in order to make them more coherent and
information in some areas, and they should also be able to informative.
critically evaluate and question its suggested answers.
The ability of ChatGPT in assisting students in English lan-
III. METHODS
guage learning and writing has also been a topic of discussion in
some research papers [6], [11], [12], [13], [14], [15]. For exam- In order to assess the ability of ChatGPT to identify the
ple, ChatGPT can provide writing feedback on students’ essays structure of literature review sections of research articles, we
at the level of grammar, sentence structure, and organization sampled 200 engineering research papers from Scopus [18].
of ideas to improve the overall writing style [11]. Wu et al. [12] After assessing the collected papers, we included the ones with a
measured the extent to which ChatGPT can correct grammatical separate literature review section, which resulted in 39 articles.
errors by testing it on 100 sentences and comparing its perfor- Our methodology consisted of conducting a manual content
mance with Grammarly. It was shown that although the consid- analysis of the literature review sections of those sampled ar-
ered evaluation metrics imply that ChatGPT’s performance is ticles, then asking ChatGPT-3.5 to conduct the same content
worse than Grammarly, it can still be considered as a promising analysis, and finally assessing the findings. In this section, we
tool given its ability to go beyond small grammatical corrections explain how we gathered the research articles involved in this
and suggest alternative ways to structure the sentences [12]. This study, introduce the labeling scheme that was adopted, describe
was also highlighted by Cooper [13] who acknowledged using how the manual and ChatGPT classifications were conducted,
ChatGPT in their paper to rephrase some complex sentences and provide an overview of the metrics used in our assessment.
Authorized licensed use limited to: The University of Toronto. Downloaded on November 25,2024 at 02:44:30 UTC from IEEE Xplore. Restrictions apply.
ISSA et al.: EXPLORING CHATGPT’S ABILITY TO CLASSIFY THE STRUCTURE OF LITERATURE REVIEWS 1861
TABLE I
MOVES AND STRATEGIES OF KWAN’S MODEL FOR STRUCTURING LITERATURE REVIEWS [19]
Authorized licensed use limited to: The University of Toronto. Downloaded on November 25,2024 at 02:44:30 UTC from IEEE Xplore. Restrictions apply.
1862 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 17, 2024
December 2023 where 19 articles (145 paragraphs) were catego- evaluation metrics. We also highlight the difference between the
rized. Finally, the authors finalized the manual classification by performance of the May, July, and December 2023 ChatGPT
discussing the contradictory categorizations. Collaborative cod- versions.
ing with discussions guarantees a rigorous coding step, which
ensures that we are meticulously assessing each paragraph in our A. Jaccard Similarity Index
corpus so that we ensure validity once we compare our findings
The first evaluation metric that we computed is based on
to that of ChatGPT.
the Jaccard similarity index. Let A be the set of manually
assigned categories for a given paragraph, and let B be the set
D. ChatGPT Classification of assigned categories by ChatGPT for the same paragraph. The
The automated classification of the paragraphs in the literature Jaccard similarity index between sets A and B, J(A, B), can be
review sections was performed using ChatGPT. Similar to the computed using the following formula:
manual coding, we also conducted the ChatGPT analysis in |A ∩ B|
batches in May (75 paragraphs of the two first manual stages), J(A, B) = (1)
|A ∪ B|
July (73 paragraphs of the third manual stage), and December
2023 (145 paragraphs of the final manual stage). Then, we where |X| refers to the number of elements in the set X. In other
redid the analysis of all the 293 paragraphs using ChatGPT in words, the Jaccard similarity index for a given paragraph is the
December 2023 to ensure validity and consistency and avoid ratio of the number of intersections between the ChatGPT and
temporal bias, given the ongoing development of AI models, manual classification sets over the number of categories in the
and these categorizations were then used for comparison with union of both sets. For example, if the categories identified by
the manual classifications. We gave ChatGPT a separate prompt the manual classification of a given paragraph are 1A, 1C, and
for each paragraph containing: 1) the labeling scheme containing 2B (A ={1A, 1C, 2B}), and if ChatGPT identified the categories
all the 14 categories and their definitions; 2) the content of the 1A, 1C, 2E, and 3A in the same paragraph (B ={1A, 1C, 2E,
paragraph; and 3) a question asking it to classify the given para- 3A}), then J(A, B) = 2/5 = 0.4. This index is between 0 and
graph into one or more of the 14 given categories. The labeling 1 when there is a partial overlap between the two classification
schemes given to the graduate assistants and ChatGPT were sets (similarly to the previous example), equal to 1 if both sets
similar except that the one given to ChatGPT did not include are exactly equal, and equal to 0 when there is no overlap. We
examples since it was unable to accurately respond to long calculated the Jaccard similarity index for every paragraph using
prompts. We also removed the names of the three moves from Excel and then averaged the results over all the 293 paragraphs.
the ChatGPT prompt since, when they were included, it was just The average Jaccard similarity index was 0.278, which implies
classifying the paragraph into the corresponding move without a poor performance of ChatGPT. However, this low index may
specifically mentioning the strategy. The ChatGPT prompt is result from a high number of elements in the union of the manual
extensively displayed in the Appendix. It should be noted that and ChatGPT classification sets, which could be due to a high
we used a separate chat for the paragraphs belonging to the same number of elements in either the manual set or the ChatGPT set.
article. After gathering the ChatGPT responses, we copied them In other words, ChatGPT may identify all the categories detected
to an Excel sheet where we also labeled a category with 1 if in the manual classification but can still achieve a low Jaccard
ChatGPT identified it in the paragraph, and 0 otherwise. similarity index if it identifies many additional false categories.
Hence, to better understand the reason behind this low index,
we also computed using Excel, for every paragraph, the ratio of
E. Evaluation Metrics
the number of intersections between the two classification sets
After gathering both manual and ChatGPT classifications, over the number of categories in the manual set, namely
we computed several evaluation metrics to assess ChatGPT’s
|A ∩ B|
performance and analyze the findings. We first used the Jaccard (2)
similarity index to measure the average agreement between the |A|
two classifications. Then, to better examine ChatGPT’s perfor- and averaged this ratio over all the paragraphs, which gave 0.632.
mance in distinct categories, we computed for each category This number is higher than the average Jaccard similarity index,
the confusion matrix, accuracy, precision, recall, and F1 score. which suggests that ChatGPT was, to some extent, detecting
These metrics would give better insights into the categories that some correct categories in addition to choosing many other cat-
are being successfully identified by ChatGPT and the ones that it egories that were not identified in the manual classification. For
is failing to detect. Finally, an additional analysis was conducted example, ChatGPT was in many cases identifying strategies that
to compare the performance of the May, July, and December are not necessarily present in the given paragraph, as shown in
2023 versions of ChatGPT. the extract of the ChatGPT response for the following paragraph.
Paragraph (Extracted from [20]): We first review the relevant lit-
IV. RESULTS erature on the design of pre-disaster relief networks in the presence
of uncertainty. Then, we briefly review some of the studies related to
To evaluate the effectiveness of ChatGPT in identifying the the methodological aspects of our work. ChatGPT classification:
categories involved in literature review paragraphs, we present Strategies 1C, 2E, and conclusion. Interpretation: In this specific
in this section several numerical results pertaining to different paragraph, the author is engaging in the strategy of surveying the
Authorized licensed use limited to: The University of Toronto. Downloaded on November 25,2024 at 02:44:30 UTC from IEEE Xplore. Restrictions apply.
ISSA et al.: EXPLORING CHATGPT’S ABILITY TO CLASSIFY THE STRUCTURE OF LITERATURE REVIEWS 1863
TABLE II
CONFUSION MATRIX IN OUR MULTILABEL CLASSIFICATION PROBLEM
Fig. 1. Total number of identified categories by human labelers and ChatGPT. B. Confusion Matrices and Other Classification Metrics
The previously computed metrics illustrate the overall per-
formance of ChatGPT, without evaluating it separately for each
research-related phenomena (1C) by reviewing relevant literature
category. To better understand the behavior of ChatGPT across
on the design of pre-disaster relief networks in the presence of
uncertainty. Additionally, the phrase “briefly review some of the all the categories, more comprehensive evaluation metrics need
studies related to the methodological aspects of our work” suggests to be computed. The confusion matrix is a widely known table
a form of abstracting or synthesizing knowledge claims (2E). The used to evaluate classification algorithms. It can have many
paragraph may serve as a transition between the literature review and rows and columns, depending on the problem’s nature. Table II
the methodology section, indicating the direction of the current work.
shows the matrix that we used in the multilabel classification
Manual classification: Introduction (the paragraph only reveals the
themes to be covered in the section). problem of our study to assess ChatGPT’s performance at the
level of each category. The rows consist of the labels assigned
In addition, ChatGPT was detecting the presence of Strategy during the manual classification, while the columns represent
3D, which is defined as announcing the adoption of terms or the labels given by ChatGPT. More specifically, let us consider
definition of concepts in the authors’ own context, whenever that the matrix in Table II is computed for Strategy 1A. In this
a mere definition of acronyms appears. These facts can also be case, the top left number of the matrix, true negative (TN), is
verified by the total numbers of categories identified by ChatGPT the number of paragraphs in which both manual and ChatGPT
and the manual classification. As shown in Fig. 1, ChatGPT classifications did not identify the presence of Strategy 1A.
detected 1024 categories in all the paragraphs, in comparison Similarly, true positive (TP) is the number of paragraphs where
to 513 categories identified by the manual classification. There- both classifications agree that Strategy 1A is present. False
fore, on average, ChatGPT was assigning 3.495 categories for positive (FP) is the number of paragraphs where the manual
each paragraph, while human labelers were only identifying classification did not detect the presence of Strategy 1 A, but
1.751 categories per paragraph. These additional false categories ChatGPT did. Finally, false negative (FN) is the number of
identified by ChatGPT could be one reason behind its poor paragraphs that were labeled as 1 in the manual classification
performance. and as 0 by ChatGPT. The sum of these four numbers should
In addition to the high number of false categories identified by equal 293, the total number of paragraphs.
ChatGPT, another key observation may explain the low Jaccard Other well-known evaluation metrics for classification prob-
similarity index. In cases where the paragraphs did not seem to fit lems are accuracy, precision, and recall. These metrics can be
under any of the literature review categories, ChatGPT was not obtained from the confusion matrix of each category by using
choosing any category, as illustrated in the following example. the following equations:
Paragraph (Extracted from [21]): Industry 4.0 encompasses devices TN + TP
Accuracy = (3)
and technologies that bring with them many opportunities for new TN + TP + FN + FP
products and services [32]. The use of these technologies can usher
TP
in technical and organizational advantages and improvements. The Precision = (4)
same technologies can contribute in many ways to the performance TP + FP
of manufacturing processes, through decentralization [19], vertical
TP
[28] and horizontal integration [2,6,22], and remote process mon- Recall = . (5)
itoring [28]. ChatGPT classification: In summary, the paragraph TP + FN
mainly provides information about Industry 4.0 and its technologies
There is also the F1 score, which is a metric that combines the
but doesn’t explicitly fit into the outlined categories for literature re-
view strategies. Manual classification: Strategy 1 A (the paragraph precision and recall as follows:
presents general information). Precision × Recall
F1 = 2 × . (6)
Although we were aware of this fact, we typically listed Precision + Recall
such paragraphs covering general facts under the category of Table III summarizes the confusion matrices and the four
surveying the nonresearch-related phenomena (Strategy 1A). metrics for all the 14 categories of our problem. These matrices
Authorized licensed use limited to: The University of Toronto. Downloaded on November 25,2024 at 02:44:30 UTC from IEEE Xplore. Restrictions apply.
1864 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 17, 2024
TABLE III
CLASSIFICATION METRICS FOR THE 14 CATEGORIES
Authorized licensed use limited to: The University of Toronto. Downloaded on November 25,2024 at 02:44:30 UTC from IEEE Xplore. Restrictions apply.
ISSA et al.: EXPLORING CHATGPT’S ABILITY TO CLASSIFY THE STRUCTURE OF LITERATURE REVIEWS 1865
Moreover, we notice a slight increase of 5.3% in the Jaccard the July 2023 version of ChatGPT, both of which were clearly
similarity index between July and December, in addition to an better than the ones achieved by the May 2023 version. This is
improvement of 20.1% for the other metric. These observations in accordance with the findings of several research works that
are promising since they reflect the continuous improvement of investigated the difference between the performance of different
ChatGPT versions over time. ChatGPT versions [10], [11]. This continuous improvement is
encouraging since it highlights the future potential of ChatGPT
V. DISCUSSION in supporting humans, and especially students, in their educa-
tional and academic needs.
A. Research Question 1: To What Extent Can ChatGPT By going beyond the calculated metrics and further examining
Correctly Identify the Structure of the Literature Review the manual classification process and ChatGPT outputs, it can
Sections in Engineering Research Articles? be noticed that ChatGPT is sometimes able to imitate human
The poor performance of ChatGPT is clearly suggested by reasoning. For example, ChatGPT was sometimes identifying
the average Jaccard similarity index (0.278), which reflects categories that were also identified by some human labelers but
its overall performance. Although the ratio of the number of that were discarded when finally settling the conflicting cases.
intersections between the two classification sets over the number In addition, the categories that were most confusing ChatGPT
of categories in the manual set presented a better score (0.632), were also confusing many human labelers. For instance, many
its value cannot be considered good, in addition to being an paragraphs that are in fact categorized as Strategy 1A (surveying
intuitive metric. Given the multiclass multilabel nature of our the nonresearch-related phenomena or knowledge claims) were
classification problem, these two metrics can only provide a mistakenly classified as Strategy 1C (surveying the research-
general insight on ChatGPT’s performance and do not explicitly related phenomena) by some human labelers and by ChatGPT.
exhibit its behavior across the 14 different classes. These facts reflect ChatGPT’s capability to replicate human-
The surprisingly high accuracy of ChatGPT achieved in most like interaction behaviors and responses. However, we do not
categories may contradict the first two metrics. However, ac- report the agreement between the ChatGPT classification and
curacy also cannot reflect the correct performance of ChatGPT the combined classifications of all the labelers because this may
since most categories present imbalanced numbers of positive lead to an inaccurate and imprecise assessment.
and negative instances. A weak classifier that labels every in-
stance as negative will achieve an accuracy of 99% on a dataset
that contains 99 negative instances and one positive instance.
However, this high accuracy does not capture information on B. Research Question 2: To What Extent Can Students and
the classifier’s ability to detect positive instances. By examining Academic Writers Rely on ChatGPT to Learn How to Revise,
the F1 scores of the 14 categories, we can conclude that the Edit, and Structure the Literature Review Sections in Their
best performance of ChatGPT is observed in Strategies 1A Drafts?
(surveying the nonresearch-related phenomena or knowledge As educators and researchers in technical communication,
claims), 1C (surveying the research-related phenomena), and 3A our aim was to explore the feasibility of utilizing ChatGPT
(research aims, focuses, questions, or hypotheses). This seems for assessing student writing. Focusing on literature reviews
logical since these categories were also easily detected by human as an illustrative example, we sought to evaluate ChatGPT’s
labelers given their straightforwardness. However, the F1 scores performance. In addition, our objective was to examine whether
in these categories are barely considered moderate. On the other ChatGPT can comprehend rhetorical moves comparable to in-
hand, by looking at the lowest F1 scores of ChatGPT, we can no- structors. If effective, this could potentially pave the way for
tice that they belong to strategies that were also hardly detected employing AI in academic writing assessments and writers can
in the manual classification and appear less frequently, such as use ChatGPT to learn how to properly structure their literature
Strategies 1B (claiming centrality) and 2C (asserting confirma- review sections. However, after a thorough examination of the
tive claims about knowledge or research practices surveyed). ChatGPT responses, it can be suggested that students and aca-
The way ChatGPT works depends on patterns from the data demic writers cannot fully depend on ChatGPT to evaluate their
used for its training. From the context, structure, and information literature reviews but can somewhat benefit from it when drafting
present in the text, the AI chatbot detects expressions, keywords, their literature review sections. First, writers can look at the
or distinctive patterns to make decisions. However, since some categories best identified by ChatGPT to discover if important
strategies are more complex than others, including complex parts are missing in their drafts such as Strategies 1A, 1C, and
sentence structure expressed in a way that diverges from a 3A. For example, writers who do not find Strategy 3A in their
typical pattern, it becomes harder for ChatGPT to detect them. drafts may be encouraged to briefly add a few sentences that
Therefore, based on the machine learning concept, the correct signal the aim of their work to properly demonstrate how it is
detection of straightforward strategies seems like a reasonable connected to previous research.
outcome [22]. Second, the ChatGPT extracts provided in the previous sec-
Despite the unsatisfactory results of ChatGPT, the comparison tion show that it was responding beyond the classification
between the outcomes of the three ChatGPT batches can give prompt by also providing the reason behind its classification
some promising insights. The scores achieved by the December and explaining the paragraph content. This seems in accordance
2023 version of ChatGPT were better than the ones achieved by with the findings in [12], where ChatGPT’s ability to respond
Authorized licensed use limited to: The University of Toronto. Downloaded on November 25,2024 at 02:44:30 UTC from IEEE Xplore. Restrictions apply.
1866 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 17, 2024
beyond the given prompt was also highlighted. Such explana- fabricate irrelevant and inaccurate information. Therefore, users
tions can help writers recognize if the idea that they wanted to should consider referring to professional and credible sources to
communicate in their paragraph will be clearly understood by confirm the supplementary information provided by ChatGPT,
their readers. In addition, ChatGPT often declares if a category especially in specialized academic or professional contexts [26].
is implicitly present in the paragraph. Therefore, in such cases, ChatGPT presents some great advantages in academia, as well
if writers indeed wanted to convey a message related to that as disadvantages. The future holds promising contributions of
category, they would express it in a more explicit way to make AI in science and academia, which will upskill resources under
their point clearer. Moreover, as previously highlighted, Chat- academic monitoring [27], [28].
GPT was acknowledging the fact that some paragraphs did not fit It is crucial to keep in mind that AI is still lacking in many
under any category. Such responses would encourage writers to aspects and, therefore, requires further evaluation and improve-
discard paragraphs that do not carry any useful information or to ment. First, continuous analysis of ChatGPT’s performance in
move them to the introduction if the knowledge being conveyed academia provides updated insights on the strengths and weak-
is still important to be communicated to their readers. Writers nesses of its capabilities. Also, extensive literature reviews must
can also rephrase such paragraphs by using straightforward be adopted to identify gaps and challenges. The feedback strat-
and direct sentence structure, which would make them more egy also provides valuable insights and constructive criticism
detectable by ChatGPT as well as the readers. Finally, it is from a user’s point of view, including researchers, educators,
worth noting that ChatGPT was sometimes outputting the list and students. Developed prototypes should be after extensive
of identified categories with their explanations, in addition to testing integrating more privacy, accuracy, and ethical consider-
providing at the end a brief sentence that summarizes them (yet, ations. Our study provides a small perception of how ChatGPT
in few cases, some categories that were already identified by performs in academic context, which is part of the initial stages
ChatGPT were not included in its summary). This feature can of achieving improved decision making by AI in academia.
particularly help writers who lack time and are unable to read Future perspectives show that, with continuous enhancement
all the explanations provided by ChatGPT. Overall, ChatGPT and involvement of both humans and AI, we can result in the
can serve as a tool that helps writers in editing their literature most favorable outcomes in terms of learning, education, and
review sections, such as adding valuable information to avoid academia [29]. Users can refer to AI to learn literature review
writing a mere list of articles’ summaries, removing any unin- structures and topic-tailored layouts. ChatGPT could provide
formative content, restructuring their ideas, and rephrasing any support tailored to the needs of individual learners, known as
information that seem unclear. adaptive learning. It could also uphold the constructivist theory
However, whenever writers are using ChatGPT to edit their of learning through personalized feedback and guidance to stu-
drafts, they should be aware of its limitations. ChatGPT was dents, enhancing their writing journey [30], [31]. Human writers,
sometimes omitting the presence of some categories that can be by engaging and composing with AI, can resist fixed endings and
obviously detected. As an example, it failed to detect the pres- “speak back” to AI. This generates the opportunity to enhance
ence of Strategy 1C when the paragraph was clearly reviewing and deepen their texts as well as assert their own voices within
the focus and methods of previous research works. ChatGPT AI systems [32].
also detected a high number of false strategies per paragraph
compared to the manual categorization. Therefore, ChatGPT is
not a reliable tool for literature review evaluation in engineering VI. CONCLUSION
articles; however, humans can still benefit from this AI chatbot In this work, we tested the effectiveness of ChatGPT, the
while acknowledging its weaknesses and critically evaluating its new AI chatbot, in identifying the structure of literature review
output. sections in engineering research articles. In particular, we asked
Moreover, users must ensure proper academic integrity and the chatbot to classify the literature review paragraphs into one
ethical writing practices when seeking ChatGPT; therefore, or more of 14 categories, which are based on Kwan’s model for
users should be aware of the lack of proper citations and at- structuring literature reviews, and then compared its outputs to
tributions provided by ChatGPT. In other words, the AI chatbot the human classification. Although numerical results indicate an
may present other people’s work, words, or ideas without proper unsatisfactory performance of ChatGPT, it can respond beyond
acknowledgement, rendering this act as plagiarism. To guarantee the classification task by giving helpful explanations. Such ex-
a safe and ethical use of ChatGPT, users can follow guidelines planations can encourage writers to revise their drafts by adding
of proofreading and editing, verifying information, continuous or eliminating ideas, reorganizing content, and rephrasing in-
adaptation and learning, and others [23]. In addition, ChatGPT formation to result in more coherent and complete literature
has a high potential for bias. Since the training data may contain review sections. Results also show the potential of the AI’s
biases, ChatGPT can potentially inherit such biases [24]. The chatbot in supporting human academic writing in the upcoming
information provided by ChatGPT-3.5 is not later than January years. However, writers should be conscious of this chatbot’s
2022; therefore, the AI chatbot is not aware of the new advances limitations and the inaccurate information that it can occasion-
in certain academic fields. This highlights its limitations inter- ally generate. Despite our valid results, this study has a few
preting new findings and terminologies [25]. limitations, one of them being our small corpus size that prevents
Although ChatGPT can assist users in understanding complex our findings from being generalized. Thus, future research in this
terminologies or concepts, it is associated with a tendency to area may assess ChatGPT’s accuracy using a larger corpus and
Authorized licensed use limited to: The University of Toronto. Downloaded on November 25,2024 at 02:44:30 UTC from IEEE Xplore. Restrictions apply.
ISSA et al.: EXPLORING CHATGPT’S ABILITY TO CLASSIFY THE STRUCTURE OF LITERATURE REVIEWS 1867
updated versions of GPT and search for other evaluation metrics of the applicability or relevancy of the surveyed items to
that may provide deeper insights on this chatbot’s behavior. the current research.
9) Strategy 2E: Abstracting or synthesizing knowledge
claims to establish a theoretical position or a theoretical
APPENDIX framework. Explanation: Introduction of a new perspec-
CHATGPT PROMPT tive or a theoretical framework that is abstracted from
Those are the categories that are usually encountered in a the works cited. Operational definition: This category
literature review section of a research article. Note that the typically should appear after Strategies 1A or 1C.
operational definitions contain typical information and there 10) Strategy 3A: Research aims, focuses, questions, or
may be some exceptions. hypotheses. Explanation: Announcing the aim of the
1) Introduction: It reveals the aim, structure, and themes research. Operational definition: This category typically
to be covered in the literature review and sometimes should appear at the end of the literature review body
makes a case for reviewing those themes. Operational (This work, In this study, etc.) to show the current work’s
definition: This category, if present, should be at the goals.
beginning of the literature review section and no more 11) Strategy 3B: Theoretical positions/theoretical frame-
than a few sentences. works. Explanation: Announcing the theoretical position
2) Strategy 1A: Surveying the nonresearch-related phenom- or the theoretical framework. Operational definition:
ena or knowledge claims. Explanation: Generalizations: This category typically should appear at the end of the
definitions/explanations of terminology, constructs and literature review body (This work, In this study, etc.) to
theories, beliefs and characterizations of nonresearch show the theoretical framework of the current work.
practices, or phenomena associated with the themes. Op- 12) Strategy 3C: Research design/processes. Explanation:
erational definition: This category, if present, typically Announcing the research design or the research process.
appears at the beginning of the body of the literature Operational definition: This category typically should
review to give some background and general knowledge appear at the end of the literature review body (This work,
(it may include a citation or not). In this study, etc.) to briefly describe the method used in
3) Strategy 1B: Claiming centrality. Explanation: Thesis- the current work.
internal claims (showing the need to review the themes) 13) Strategy 3D: Interpretations of terminology used in the
or Thesis-external claims (asserting the centrality of thesis. Explanation: Announcing the adoption of terms
themes). or definitions of terms. Operational definition: This cate-
4) Strategy 1C: Surveying the research-related phenomena. gory can appear at the end or in the middle of the literature
Explanation: Review of different aspects of previous review body to define some concepts or terms in the
studies, such as research focuses, findings, research pro- authors’ own context.
cesses, materials, or participants in the research (e.g., 14) Conclusion: It summarizes the point of the literature re-
subjects, informants). Operational definition: This cate- view, reiterates its purposes, presents insights that emerge
gory typically comes after 1A if 1A is present. It should from it, announces where some of the points will be
include the focus of previous research (examined, inves- revisited, or sometimes introduces the current research
tigated, etc.) and/or their methods (used, etc.) and/or their aims and theoretical positions (Move 3).
results (80% accuracy, etc.). It should include a citation. This is a paragraph of the literature review section of a
5) Strategy 2A: Counterclaiming. Explanation: Critique of research article.
the weaknesses or problems in existing research or non- Paragraph from collected sample inserted here.
research practices. Operational definition: This category Can you please classify the paragraph into one or more of
typically should appear after Strategy 1C (or sometimes those categories.
after Strategy 2E).
6) Strategy 2B: Gap-indicating. Explanation: Scarcity of
various sorts, a lack of understanding of a particular ACKNOWLEDGMENT
phenomenon, or the need for research or nonresearch The authors would like to thank the graduate assistants Zahraa
action. Operational definition: This category typically Swaidan and Lynn Bazzi for their help in manually classifying
should appear after Strategies 1A or 1C (or sometimes the first batch of paragraphs in the literature review sections.
after Strategy 2E).
7) Strategy 2C: Asserting confirmative claims about knowl-
edge or research practices surveyed. Explanation: Affir- REFERENCES
mation of the correctness of the citation, or claims of
[1] W. Winaitham, “The scientific review of AI functions of enhancement
its significance, value, strength, or contribution. Opera- English learning and teaching,” in Proc. 13th Int. Conf. Inf. Commun.
tional definition: This category typically should appear Technol. Convergence, 2022, pp. 148–152.
after Strategies 1A or 1C. [2] L. Anthony and G. Lashkia, “Mover: A machine learning tool to assist in
the reading and writing of technical papers,” IEEE Prof. Commun., vol. 46,
8) Strategy 2D: Asserting the relevancy of the surveyed no. 3, pp. 185–193, Sep. 2003.
claims to one’s own research. Explanation: Affirmation [3] ChatGPT, Jan. 14, 2024. [Online]. Available: https://chat.openai.com/
Authorized licensed use limited to: The University of Toronto. Downloaded on November 25,2024 at 02:44:30 UTC from IEEE Xplore. Restrictions apply.
1868 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 17, 2024
[4] A. Haleem, M. Javaid, and R. P. Singh, “An era of ChatGPT as a significant [27] I. Dergaa, K. Chamari, P. Zmijewski, and H. Ben Saad, “From human
futuristic support tool: A study on features, abilities, and challenges,” writing to artificial intelligence generated text: Examining the prospects
BenchCouncil Trans. Benchmarks, Standards Eval., vol. 2, no. 4, 2022, and potential threats of ChatGPT in academic writing,” Biol. Sport, vol. 40,
Art. no. 100089. no. 2, pp. 615–622, 2023, doi: 10.5114/biolsport.2023.125623.
[5] F. R. Baskara, “Integrating ChatGPT into EFL writing instruction: Benefits [28] A. H. Kumar, “Analysis of ChatGPT tool to assess the potential of its
and challenges,” Int. J. Educ. Learn., vol. 5, no. 1, pp. 44–55, 2023. utility for academic writing in biomedical domain,” Biol., Eng., Med. Sci.
[6] C. K. Lo, “What is the impact of ChatGPT on education? A rapid review Rep., vol. 9, no. 1, pp. 24–30, Feb. 2023.
of the literature,” Educ. Sci., vol. 13, no. 4, 2023, Art. no. 410. [29] S. Grassini, “Shaping the future of education: Exploring the potential and
[7] M. K. Sastry and C. Mohammed, “The summary-comparison matrix: A consequences of AI and ChatGPT in educational settings,” Educ. Sci.,
tool for writing the literature review,” in Proc. IEEE Int. Professional vol. 13, no. 7, 2023, Art. no. 692.
Commun. Conf., 2013, pp. 1–5. [30] T. Rasul et al., “The role of ChatGPT in higher education: Benefits,
[8] B. Graefen and N. Fazal, “Gpteacher: Examining the efficacy of ChatGPT challenges, and future research directions,” J. Appl. Learn. Teach., vol. 6,
as a tool for public health education,” Eur. J. Educ. Stud., vol. 10, no. 8, no. 1, pp. 41–56, May 2023.
pp. 254–263, 2023. [31] D. Baidoo-Anu and L. Owusu Ansah, “Education in the era of generative
[9] T. H. Kung et al., “Performance of ChatGPT on USMLE: Potential for artificial intelligence (AI): Understanding the potential benefits of Chat-
AI-assisted medical education using large language models,” PLoS Digit. GPT in promoting teaching and learning,” J. AI, vol. 7, no. 1, pp. 52–62,
Health, vol. 2, no. 2, pp. 1–12, 2023. 2023.
[10] S. Frieder et al., “Mathematical capabilities of ChatGPT,” in Proc. 37th Int. [32] R. Li, ““Weaving tales of resilience”: Cyborg composing with AI,” English
Conf. Neural Inf. Process. Syst., Red Hook, NY, USA: Curran Associates Teaching: Pract. Critique, vol. 23, no. 1, pp. 57–66, 2024.
Inc., 2024, p. 46.
[11] A. Bahrini et al., “ChatGPT: Applications, opportunities, and threats,” in
Proc. Syst. Inf. Eng. Des. Symp., 2023, pp. 274–279.
[12] H. Wu, W. Wang, Y. Wan, W. Jiao, and M. Lyu, “ChatGPT or Grammarly?
Evaluating ChatGPT on grammatical error correction benchmark,” 2023,
arXiv:2303.13648.
Maha Issa received the master’s degree in electrical and computer engineering
[13] G. Cooper, “Examining science education in ChatGPT: An exploratory
with a focus on artificial intelligence and machine learning, and the master’s
study of generative artificial intelligence,” J. Sci. Educ. Technol., vol. 32,
degree in biomedical engineering from the American University of Beirut,
pp. 1–9, 2023.
[14] J. Huang and M. Tan, “The role of ChatGPT in scientific communication: Beirut, Lebanon, in 2022 and 2023, respectively. She is currently working
toward the Ph.D. degree in computer science and engineering with Santa Clara
Writing better scientific review articles,” Amer. J. Cancer Res., vol. 13,
University, Santa Clara, CA, USA.
no. 4, pp. 1148–1154, 2023.
She was a Research Assistant in technical and academic communication at
[15] M. Haman and M. Školník, “Using ChatGPT to conduct a literature
review,” Accountability Res., pp. 1–3, 2023. the American University of Beirut.
[16] C. Hart, Doing a Literature Review: Releasing the Research Imagination
(Sage Study Skills Series), 2nd ed. Thousand Oaks, CA, USA: Sage, 2018.
[17] I. Kostka and R. Toncelli, “Exploring applications of ChatGPT to English
language teaching: Opportunities, challenges, and recommendations,”
TESL-EJ, vol. 27, no. 3, p. 19, Nov. 2023.
[18] Scopus, Jan. 14, 2024. [Online]. Available: https://www.scopus.com/
Marwa Faraj received the bachelor’s degree in biology from Lebanese Univer-
[19] B. S. Kwan, “The schematic structure of literature reviews in doctoral
sity, Beirut, Lebanon, in 2020, and the master’s degree in biomedical engineering
theses of applied linguistics,” English Specific Purposes, vol. 25, no. 1, from the American University of Beirut, Beirut, in 2023.
pp. 30–55, 2006.
She is currently a biomedical engineering graduate and a Research Assistant
[20] M. A. L. Xing Hong and N. Noyan, “Stochastic network design for disaster
with the American University of Beirut, Beirut. During her master’s, she studied
preparedness,” IIE Trans., vol. 47, no. 4, pp. 329–357, 2015.
the role of nanoparticles in facilitating the delivery of drugs in cardiac hypoxia
[21] D. C. Fettermann, C. G. Sá Cavalcante, T. D. de Almeida, and G. L. Tor- injury. She has also completed training on academic and technical communica-
torella, “How does Industry 4.0 contribute to operations management?,”
tion.
J. Ind. Prod. Eng., vol. 35, no. 4, pp. 255–268, 2018.
[22] N. Robert, “How artificial intelligence is changing nursing,” Nurs. Man-
age., vol. 50, no. 9, pp. 30–39, 2019.
[23] A. Jarrah, Y. Wardat, and P. Fidalgo, “Using ChatGPT in academic writing
is (not) a form of plagiarism: What does the literature say?,” Online J.
Commun. Media Technol., vol. 13, no. 4, Oct. 2023, Art. no. e202346.
[24] P. P. Ray, “ChatGPT: A comprehensive review on background, applica- Niveen AbiGhannam received the B.S. degree in biology and the M.S. degree in
tions, key challenges, bias, ethics, limitations and future scope,” Internet environmental policy from the American University of Beirut, Beirut, Lebanon,
Things Cyber-Phys. Syst., vol. 3, pp. 121–154, 2023. in 2007 and 2009, respectively, and the Ph.D. degree in science communication
[25] S. Rice, S. R. Crouse, S. R. Winter, and C. Rice, “The advantages and from the University of Texas at Austin, Austin, TX, USA, in 2015.
limitations of using ChatGPT to enhance technological research,” Technol. She is currently an Engineering Communication Researcher and Educator
Soc., vol. 76, 2024, Art. no. 102426. with the University of Texas at Austin. She studies the strategic communication
[26] A. Sangzin, “A use case of ChatGPT in a flipped medical terminology of scientific and technical knowledge with different academic and nonacademic
course,” Korean J. Med. Educ., vol. 35, no. 3, pp. 303–307, 2023. audiences.
Authorized licensed use limited to: The University of Toronto. Downloaded on November 25,2024 at 02:44:30 UTC from IEEE Xplore. Restrictions apply.