0% found this document useful (0 votes)
29 views11 pages

Ce Research

This conference paper introduces a novel method for estimating citation impact in circular economy research by utilizing Large Language Models (LLMs) to analyze lexical similarities between papers. The approach employs cosine similarity to weight estimated citations based on the citations of similar papers, enhancing bibliometric analysis through bottom-up clustering of thematic groups. The study aims to provide a more nuanced understanding of citation impact by integrating contextual relationships and semantic meaning into citation estimations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views11 pages

Ce Research

This conference paper introduces a novel method for estimating citation impact in circular economy research by utilizing Large Language Models (LLMs) to analyze lexical similarities between papers. The approach employs cosine similarity to weight estimated citations based on the citations of similar papers, enhancing bibliometric analysis through bottom-up clustering of thematic groups. The study aims to provide a more nuanced understanding of citation impact by integrating contextual relationships and semantic meaning into citation estimations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/388311036

Exploring Citation Impact in Circular Economy Research: An Analysis of


Expected Citations Based on LLM-Generated Lexical Similarities Between
Papers

Conference Paper · January 2025


DOI: 10.5281/zenodo.14174164.svg

CITATIONS READS

0 2

3 authors, including:

Niloufar Farrokhzad
KU Leuven
3 PUBLICATIONS 3 CITATIONS

SEE PROFILE

All content following this page was uploaded by Niloufar Farrokhzad on 23 January 2025.

The user has requested enhancement of the downloaded file.


Exploring Citation Impact in Circular Economy Research: An
Analysis of Expected Citations Based on LLM-Generated
Lexical Similarities Between Papers
Niloufar Farrokhzad*, Mehmet Ali Abdulhayoglu ** and Bart Thijs***
*
[email protected]
https://orcid.org/0009-0001-0995-8915
KU Leuven, FEB, ECOOM, Belgium

** [email protected]
https://orcid.org/0000-0002-1288-2181
KU Leuven, FEB, ECOOM, Belgium

***[email protected]
https://orcid.org/0000-0003-0446-8332
KU Leuven, FEB, ECOOM, Belgium

Abstract
We present a novel approach for estimating citation impact in circular economy research using
Large Language Models to create lexical similarity relationships between papers. By applying
cosine similarity, we weight the estimated citations at the paper level based on the citations of
their most similar papers. This approach builds on the concept of related records and employs
a bottom-up clustering methodology for citation-based assessments, enhancing the granularity
and accuracy of bibliometric analysis. Our dataset consists of publications from 2001 to 2022
sourced from the Web of Science Core Collection and processed by ECOOM. Using this
comprehensive dataset, we identify thematic clusters and ultimately refine citation estimations
by accounting for publication year vicinity, cluster, and document type, ensuring the highest
level of accuracy. This combined normalization strategy yielded improved results, providing a
more nuanced understanding of citation impact in Circular Economy research.

1. Introduction
Assessing the impact of academic research necessitates a comprehensive evaluation of its
influence, which is often measured by citation rates. However, conventional methods, such as
journal-based or subject-based standards, may fail to capture the intricate contextual and
thematic nuances that contribute to a paper's relevance and its subsequent citation potential.
These traditional metrics often lack the necessary granularity, potentially overlooking subtle
distinctions within a field and diminishing the accuracy of citation-based assessments. The
problem has been acknowledged in the literature and several attempts have been made earlier
e.g., implementation of fine-grained top-down clustering approaches have been proposed,
where each paper is assigned to a single, fixed cluster. Yet, these methods struggle with papers
that possibly span multiple topics, limiting flexibility for interdisciplinary research.
Our study endeavors to introduce a pioneering approach to estimating expected citation rates.
Drawing inspiration from the environment-based citation normalization approach proposed by
(Glänzel & Thijs, 2017), we seek to implement this methodology using lexical-similarity
relations generated by Large Language Models.
When establishing citation indicator standards, we must recognize that patterns and practices
vary within the same subject. Imposing a unified disciplinary standard based on cognitive
classification overlooks these nuances. Therefore, two approaches are possible: the bottom-up
approach solution as proposed by Schubert and Braun (1986) by using related records as
reference standards, and the top-down solution suggested by Waltman and Van Eck (2012) by
clustering the document space "down" to the necessary micro-environments. Top-down high-
resolution clustering enables fully automated processes to create unlabeled microenvironments,
starting from the whole dataset and dividing it into smaller clusters. However, there are serious
drawbacks like microclusters from static environments and large amounts of singletons and
minute clusters. In contrast, in bottom-up approaches, documents are clustered based on organic
relationships. This approach builds the clusters from the ground up, starting from individual
papers and looking for direct relationships. Related-record-based environments serve as a
promising alternative to cognitive subject categories in providing reference standards for
citation rates of individual documents.
Following the proposed approach of (Schubert & Braun, 1993) for citation-based assessments
using bibliographic coupling, we applied bottom-up clustering to organize the CE literature into
meaningful environments based on their inherent semantic characteristics, avoiding reliance on
preassigned subject categories or top-down expert-delineated topics. This clustering approach
supports our primary goal of analyzing expected citation rates, as it offers a more contextual
and broader perspective on the field's thematic diversity. By crafting these environments, we
can assess expected citations based on lexical similarity within specific thematic clusters.
Our research commences with a curated set of publications about the Circular Economy (CE),
compiled within the framework of a domain study commissioned by the Flemish government
to map related research activities. These publications represent a diverse, evolving, and highly
multidisciplinary corpus.
Within this study, we present and analyze a novel method whereby LLMs generate embeddings
for individual papers within the CE corpus, enabling the calculation of cosine similarities
between them. Leveraging these embeddings, we construct a network that identifies the top 100
most lexically similar papers for each document. This similarity-based network forms the
foundation for calculating expected citation rates.
By deriving expected citation rates from the cosine-weighted citation counts of these lexically
similar papers, our approach offers a nuanced perspective on estimating the potential impact of
research. By integrating context and meaning into the estimation process, we aim to provide a
more comprehensive and accurate assessment of research impact.

2. Data and Methods


2.1. Data
The data for this study is sourced from the Web of Science Core Collection by Clarivate
Analytics, covering bibliographic records from 2001 to 2022. The dataset was processed and
cleaned at ECOOM, ensuring a high level of data quality and accuracy. To maintain a complete
three-year citation window, the analysis is restricted to publications up to 2020.
The dataset includes citable documents such as articles, letters, notes, proceedings papers, and
reviews. This core dataset was constructed using a detailed search strategy encompassing
mandatory and optional keywords. This strategy was designed to cover a broad range of CE
topics while reducing the risk of missing key research. It was validated through rigorous testing
and domain expert feedback, ensuring both sensitivity and specificity.
To broaden the CE dataset, a citation-based expansion method was applied, including papers
that either reference or are referenced by the initial core set. This approach captures a broader
spectrum of the CE discourse, allowing the analysis to delve into the evolving understanding
of the CE through a more interconnected research perspective.
2.2. Expected Citation Scores
In line with the methodology outlined by (Glänzel, Thijs, Schubert, & Debackere, 2009) , the
normalization of citation impact and assessing expected citations in scientific research relies on
several fundamental methodologies and metrics. These include:
- Observed Citation Rate: This metric reflects the factual citation impact of a unit, such
as a country, region, or institution. The Mean Observed Citation Rate (MOCR) is
quantified as the ratio of citation count to publication count.
- Mean Expected Citation Rate (MECR): MECR represents the average citation rate of
all papers published in the same journal within the same year. While it can be calculated
using different citation windows, consistency in the chosen window is crucial for
meaningful comparisons. MECR is computed by aggregating impact measures (e.g.,
journal impact) and dividing by the number of papers.
- Field Expected Citation Rate (FECR): Similar to MECR, FECR focuses on subfields,
providing the average citation rate of all papers published within the same subject in the
same year.
- Normalized Mean Citation Rate (NMCR): NMCR measures the ratio of observed
citation impact to field-based expected citation impact, thereby assessing citation rates
against the standards established by specific subfields.
- Relative Citation Rate (RCR): RCR compares observed citation impact to journal-based
expected citation impact, offering insight into the relative influence of a paper within its
publication venue.
These metrics facilitate the normalization of citation rates by benchmarking against journal or
subfield standards, but they may need redefinition. Journals function as small neighborhoods,
while broader subfields encompass diverse areas with unique citation practices. This diversity
necessitates reexamining traditional indicators to ensure they accurately reflect research impact
across different citation practices.

2.2. LLM-Based Lexical Similarity


To capture semantic relationships among words and sentences in our corpus, we use LLMs to
create embeddings, representing documents in a lower-dimensional space. Unlike traditional
methods that rely on shared keywords, LLMs capture deeper contextual similarities, enabling
us to identify similar documents even when they do not share exact words.
We leverage LLMs, specifically all-MiniLM-L6-v22, to generate embeddings from the
abstracts and titles of documents. This model's input token size of 512 tokens (approximately
350 words) is designed to efficiently understand and create embeddings that reflect the semantic
meaning of short texts, such as abstracts and titles.
The embeddings are stored in ChromaDB, a vector database, to enable efficient similarity
calculations. We use cosine similarity to measure the closeness between documents and
determine the top 100 most similar publications for each document in our CE dataset. This
threshold of 100 is derived from a distribution analysis, ensuring a balance in the breadth of our
selection to avoid diluting the similarity measure's predictive power.
These similarity-based connections form the basis of a large-scale document network with over
one million nodes. This network structure allows us to explore expected citation rates by
examining cosine-weighted citation counts of these similar papers, providing a comprehensive
framework for analyzing the citation impact of CE research. This approach offers a robust
methodology for understanding how semantic relationships translate into citation patterns.

∑ 𝑐𝑜𝑠𝑖𝑛𝑒 ∗ 𝑐𝑖𝑡
𝐸𝐶𝑖𝑡𝑅𝐿𝐿𝑀𝑠 =
∑ 𝑐𝑜𝑠𝑖𝑛𝑒
2.3. Bottom-Up Clustering

We employed a modularity-based community detection technique to identify communities of


documents sharing similar thematic content, in a bottom-up approach. This method assigns each
document to a single community, resulting in 29 distinct clusters across 1.1 million publications
spanning twenty years. This clustering methodology reveals cohesive groups of documents
representing distinct topics within the CE-related literature.
After creating the clusters, we labeled them employing content analysis, which involved
examining common keywords and central papers within each group. The labeling process
provided insights into the underlying themes and allowed a deeper understanding of the topics
prevalent in CE-related research. The labels (Appendix a.) reflect the diversity of the CE field,
with some clusters exhibiting direct connections to CE topics and others exploring less intuitive
themes.
The clusters also reveal varied patterns of citation and network-based characteristics, indicating
robust and distinct thematic clustering. The majority of research activity within most clusters
has occurred in the last five years, with 47% of documented efforts concentrated in this
timeframe. Some clusters, like Cluster 18 on microplastics, have seen a significant focus on
recent research (82% between 2018 and 2022), while others display a more balanced
distribution across the study period.

3. Results and Discussion


In essence, for each paper in our corpus, we selected the 100 most lexically similar papers from
the same dataset, with similarity measured using cosine similarity on their embeddings created
by an LLM. We also identified 29 distinct clusters, each representing a thematic area within the
CE literature, with each paper belonging to one cluster. The objective is to calculate the
expected citations for each paper based on a weighted average of its most similar set of papers,
with weights being the cosine similarity calculated between each pair of documents. We then
analyze this ratio through the lens of thematic clusters over time.
Initially, we calculated LLM-based Expected citations at the paper level and then aggregated
for a set e.g. the countries with a high share in CE literature, and compared them with
Normalized Mean Citation Rate(NMCR) across those, to determine if this new metric correlates
with the traditional method of calculating NMCR (Figure 1). The comparison between NMCR
and LLM-based expected citation rates was made by a regression analysis, which yielded a
correlation coefficient of 0.59 with a p-value of 2.378e-22, suggesting a statistically significant
relationship between these two variables. This result indicates a moderate positive correlation,
implying that variations in NMCR are likely linked to changes in LLM-based expected citation
rates.

Figure 1. Correlation Between NMCR and LLM-Based Expected Citations


Next, we need to be able to validate the scores by having a more fine-grained investigation,
hence clustering, and to investigate the stability over time as networks are expanding. In Figure
2, we can see the ratio of actual citations to cosine-weighted expected citations across each
cluster. In most clusters, during the early years, there's a noticeable overestimation (estimation
values less than 1), which improves over time but tends to underestimate (estimation values
greater than 1) in recent years. This may result from the trend of rapid growth and an increasing
pattern of citations among most clusters, except for a few. The CE has rapidly evolved in recent
years, with papers from 2021-2003 generally receiving fewer citations compared to those in
later periods. This trend suggests that comparing each paper with others from the same period
can lead to a more realistic estimation.

Figure 2. Cosine-Weighted Expected Citations Ratio

Considering the entire CE period outside the lens of publication year windows, the weighted
average tends to overestimate citation rates for papers in the CE by nearly 30%. This could be
due to several reasons:
1- High-impact Paper Outliers: If highly cited papers within the top 100 similar papers
significantly impact the estimated citation rate, it could skew the expected citation rate
upwards. These papers might be outliers with exceptionally high citation counts, not
due to lexical similarity but other factors like being authored by leading researchers in
the field. Too many outliers with disproportionately high citations in each set can dilute
the results.
2- Bias Towards High-Citation Papers: The LLM might unintentionally favor content
from papers that are already well-cited. If high-citation papers tend to use certain
keywords or phrases that become defining characteristics of what the model perceives
as important within the dataset, the model could be more likely to select these papers as
similar to others. In fields like CE, specific keywords might gain popularity at certain
times. If our model is picking up on these popular terms or themes disproportionately,
it might identify similarity based on these hot topics rather than broader, more genuine
lexical similarity.
3- Common Themes and Vocabulary: High-citation papers might discuss cutting-edge
topics or popular methodologies that set trends within the field. As these topics become
more prevalent, the vocabulary and context associated with them may also become more
common in subsequent publications. Thus, a highly cited paper could set trends that
influence the word choice criteria for lexical similarity in later papers.
4- Comprehensive Abstracts in High-Citation Papers: High-citation papers often have
well-crafted abstracts that cover the study's objectives, methodologies, and findings
comprehensively, making them rich in keywords and contextual content. When
embeddings are generated from these abstracts, they may be contextually richer, leading
to higher similarity scores with other similarly comprehensive abstracts.
Investigating the top 10% of highly cited documents among the most similar papers shows that
some semi-high-cited papers have reoccurred in the list of the most lexically similar set for 1%
to nearly 10% of papers. Although this is not a prevalent pattern and there's no significant
correlation between high-citedness and recognition as the most lexically similar paper (reason1
is not effective), this overrepresentation of these highly cited papers can be more problematic
in high-granularity analysis.
When examining these reoccurring papers as the most similar, many of them, particularly the
high-cited ones, are review types, rich in keywords such as "comprehensive review",
"from...to...", "paper reviews...", "literature", "towards...", "summarizes", "Trends in", "A rift
into", "advances in". Additionally, These publications, besides reviews, tend to have other types
of mentions indicating recency (like "recent", and "latest") or are rich in diverse keywords,
especially in multidisciplinary studies, with rather longer abstracts (reason 4), making them
more likely to be recognized as lexically similar to other papers.
To increase the homogeneity of the keywords affecting LLM decisions, we normalize our
calculations both by document type (as each type comprises a specific set of keywords and
some types like reviews generally receive more citation) and by thematic clusters (each
representing a subject area with its related keywords). This normalization aims to improve the
expected citation ratio estimation.
Normalizing by document type means that, within each cluster and period, each paper's citations
are estimated based on the weighted citations of the most similar papers only from the same
document type as the document itself. For example, a review paper would be compared only
with the citations of other review papers that share the most lexical similarity. Similarly,
normalization by cluster means within each time window and cluster, papers are only compared
with their most lexically similar papers within the same thematic cluster as themselves.
In comparison, results from normalization by cluster (Figure ) show better outcomes during the
early years, suggesting an improvement in overestimation. On the other hand, normalization
by document type (Figure ) has performed better in recent year windows, minimizing
underestimation and keeping the expected citation ratio closer to 1.
Figure 3. Cosine-Weighted Expected Citation Ratio Normalized to Cluster

Figure 4. Cosine-Weighted Expected Citation Ratio Normalized to Document Type

Considering the impact of publication year on the estimation of expected citations, we finally
normalized by publication year. This means within each cluster and publication year window,
each document is only compared with the most lexically similar publications with less than a
three-year absolute difference in their publication years.
The Figure shows the ultimate version of the model, estimating the Citation Ratio of papers
across clusters and publication year windows, normalized by publication year, thematic cluster,
and document type. This model provides the best results compared to previous versions, with
expected citation rates closest to 1, indicating more accurate estimations. One of the drawbacks
of the environment-based approaches is that the environment-normalized citation score for the
complete set will not be guaranteed to be equal to 1 as with the journal and field expected
citation scores.

Figure 5. Cosine-Weighted Expected Citation Ratio Normalized to Document Type, Cluster,


and Publication-Year Vicinity

The results of the correlation analysis (Figure 6) confirm that while controlling for publication
year vicinity and document type significantly improves citation estimation, incorporating
thematic clusters further enhances the accuracy, highlighting the importance of combining all
factors for the best results.

Figure 6. Pearson's Correlation Between Actual Ciations and The Expected Citations from
Different Estimation Models

4. Conclusion and Future Work


Our study on expected citation impact in circular economy research revealed the benefits of
using Large Language Models to create lexical relationships between papers, enabling the
creation of environments or clusters that share common themes. This clustering approach
allows for a more granular analysis, contrasting with traditional journal-based or subject-based
standards. The expected citation rates were derived from cosine-weighted citation counts, and
normalization to publication year, thematic clusters, and document type improves the results of
estimation significantly.
However, utilizing LLMs to establish lexical similarities is a relatively new method and requires
further exploration. The outcomes suggest that these models, trained on our corpora, have
learned patterns that can be useful in bibliometric analysis, but their exact functioning demands
a deeper understanding. Techniques like model explainability and sensitivity analysis could
help assess whether inherent biases affect how the LLM selects words or phrases.

While the model's outcomes were promising, several areas require future investigation. First,
examining other measures of similarity besides cosine could mitigate the impact of high-
dimensionality, where minor variations might lead to significant ranking shifts, potentially
causing over- or underestimation. Additionally, the selection of the top 100 similar papers could
be refined by introducing a mechanism for random sampling among high-similarity candidates,
reducing outlier effects. This method could provide a more varied and potentially less biased
set of similar papers.

5. Bibliographic References
Glänzel, W., & Thijs, B. (2017). Bridging another gap between research assessment and
information retrieval – The delineation of document environments [conference presentation].
STI. France, https://sti2017.ifris.org/wp-content/uploads/2017/11/is3-glanzel-thys.pdf

Glänzel, W., Thijs, B., Schubert, A., & Debackere, K. (2009). Subfield-specific normalized
relative indicators and a new generation of relational charts: Methodological
foundations illustrated on the assessment of institutional research performance.
Scientometrics, 165–188.

Schubert, A., & Braun, T. (1993). Reference-standards for citation-based assessments.


Scientometrics, 26(1), 21–35.
View publication stats

6. Appendix

Appendix a. Labeling 29 Clusters Within the Circular Economy Literature for an ECOOM
Internal Report Purpose, which is previously published

You might also like