Ce Research
Ce Research
net/publication/388311036
CITATIONS READS
0 2
3 authors, including:
Niloufar Farrokhzad
KU Leuven
3 PUBLICATIONS 3 CITATIONS
SEE PROFILE
All content following this page was uploaded by Niloufar Farrokhzad on 23 January 2025.
** [email protected]
https://orcid.org/0000-0002-1288-2181
KU Leuven, FEB, ECOOM, Belgium
***[email protected]
https://orcid.org/0000-0003-0446-8332
KU Leuven, FEB, ECOOM, Belgium
Abstract
We present a novel approach for estimating citation impact in circular economy research using
Large Language Models to create lexical similarity relationships between papers. By applying
cosine similarity, we weight the estimated citations at the paper level based on the citations of
their most similar papers. This approach builds on the concept of related records and employs
a bottom-up clustering methodology for citation-based assessments, enhancing the granularity
and accuracy of bibliometric analysis. Our dataset consists of publications from 2001 to 2022
sourced from the Web of Science Core Collection and processed by ECOOM. Using this
comprehensive dataset, we identify thematic clusters and ultimately refine citation estimations
by accounting for publication year vicinity, cluster, and document type, ensuring the highest
level of accuracy. This combined normalization strategy yielded improved results, providing a
more nuanced understanding of citation impact in Circular Economy research.
1. Introduction
Assessing the impact of academic research necessitates a comprehensive evaluation of its
influence, which is often measured by citation rates. However, conventional methods, such as
journal-based or subject-based standards, may fail to capture the intricate contextual and
thematic nuances that contribute to a paper's relevance and its subsequent citation potential.
These traditional metrics often lack the necessary granularity, potentially overlooking subtle
distinctions within a field and diminishing the accuracy of citation-based assessments. The
problem has been acknowledged in the literature and several attempts have been made earlier
e.g., implementation of fine-grained top-down clustering approaches have been proposed,
where each paper is assigned to a single, fixed cluster. Yet, these methods struggle with papers
that possibly span multiple topics, limiting flexibility for interdisciplinary research.
Our study endeavors to introduce a pioneering approach to estimating expected citation rates.
Drawing inspiration from the environment-based citation normalization approach proposed by
(Glänzel & Thijs, 2017), we seek to implement this methodology using lexical-similarity
relations generated by Large Language Models.
When establishing citation indicator standards, we must recognize that patterns and practices
vary within the same subject. Imposing a unified disciplinary standard based on cognitive
classification overlooks these nuances. Therefore, two approaches are possible: the bottom-up
approach solution as proposed by Schubert and Braun (1986) by using related records as
reference standards, and the top-down solution suggested by Waltman and Van Eck (2012) by
clustering the document space "down" to the necessary micro-environments. Top-down high-
resolution clustering enables fully automated processes to create unlabeled microenvironments,
starting from the whole dataset and dividing it into smaller clusters. However, there are serious
drawbacks like microclusters from static environments and large amounts of singletons and
minute clusters. In contrast, in bottom-up approaches, documents are clustered based on organic
relationships. This approach builds the clusters from the ground up, starting from individual
papers and looking for direct relationships. Related-record-based environments serve as a
promising alternative to cognitive subject categories in providing reference standards for
citation rates of individual documents.
Following the proposed approach of (Schubert & Braun, 1993) for citation-based assessments
using bibliographic coupling, we applied bottom-up clustering to organize the CE literature into
meaningful environments based on their inherent semantic characteristics, avoiding reliance on
preassigned subject categories or top-down expert-delineated topics. This clustering approach
supports our primary goal of analyzing expected citation rates, as it offers a more contextual
and broader perspective on the field's thematic diversity. By crafting these environments, we
can assess expected citations based on lexical similarity within specific thematic clusters.
Our research commences with a curated set of publications about the Circular Economy (CE),
compiled within the framework of a domain study commissioned by the Flemish government
to map related research activities. These publications represent a diverse, evolving, and highly
multidisciplinary corpus.
Within this study, we present and analyze a novel method whereby LLMs generate embeddings
for individual papers within the CE corpus, enabling the calculation of cosine similarities
between them. Leveraging these embeddings, we construct a network that identifies the top 100
most lexically similar papers for each document. This similarity-based network forms the
foundation for calculating expected citation rates.
By deriving expected citation rates from the cosine-weighted citation counts of these lexically
similar papers, our approach offers a nuanced perspective on estimating the potential impact of
research. By integrating context and meaning into the estimation process, we aim to provide a
more comprehensive and accurate assessment of research impact.
∑ 𝑐𝑜𝑠𝑖𝑛𝑒 ∗ 𝑐𝑖𝑡
𝐸𝐶𝑖𝑡𝑅𝐿𝐿𝑀𝑠 =
∑ 𝑐𝑜𝑠𝑖𝑛𝑒
2.3. Bottom-Up Clustering
Considering the entire CE period outside the lens of publication year windows, the weighted
average tends to overestimate citation rates for papers in the CE by nearly 30%. This could be
due to several reasons:
1- High-impact Paper Outliers: If highly cited papers within the top 100 similar papers
significantly impact the estimated citation rate, it could skew the expected citation rate
upwards. These papers might be outliers with exceptionally high citation counts, not
due to lexical similarity but other factors like being authored by leading researchers in
the field. Too many outliers with disproportionately high citations in each set can dilute
the results.
2- Bias Towards High-Citation Papers: The LLM might unintentionally favor content
from papers that are already well-cited. If high-citation papers tend to use certain
keywords or phrases that become defining characteristics of what the model perceives
as important within the dataset, the model could be more likely to select these papers as
similar to others. In fields like CE, specific keywords might gain popularity at certain
times. If our model is picking up on these popular terms or themes disproportionately,
it might identify similarity based on these hot topics rather than broader, more genuine
lexical similarity.
3- Common Themes and Vocabulary: High-citation papers might discuss cutting-edge
topics or popular methodologies that set trends within the field. As these topics become
more prevalent, the vocabulary and context associated with them may also become more
common in subsequent publications. Thus, a highly cited paper could set trends that
influence the word choice criteria for lexical similarity in later papers.
4- Comprehensive Abstracts in High-Citation Papers: High-citation papers often have
well-crafted abstracts that cover the study's objectives, methodologies, and findings
comprehensively, making them rich in keywords and contextual content. When
embeddings are generated from these abstracts, they may be contextually richer, leading
to higher similarity scores with other similarly comprehensive abstracts.
Investigating the top 10% of highly cited documents among the most similar papers shows that
some semi-high-cited papers have reoccurred in the list of the most lexically similar set for 1%
to nearly 10% of papers. Although this is not a prevalent pattern and there's no significant
correlation between high-citedness and recognition as the most lexically similar paper (reason1
is not effective), this overrepresentation of these highly cited papers can be more problematic
in high-granularity analysis.
When examining these reoccurring papers as the most similar, many of them, particularly the
high-cited ones, are review types, rich in keywords such as "comprehensive review",
"from...to...", "paper reviews...", "literature", "towards...", "summarizes", "Trends in", "A rift
into", "advances in". Additionally, These publications, besides reviews, tend to have other types
of mentions indicating recency (like "recent", and "latest") or are rich in diverse keywords,
especially in multidisciplinary studies, with rather longer abstracts (reason 4), making them
more likely to be recognized as lexically similar to other papers.
To increase the homogeneity of the keywords affecting LLM decisions, we normalize our
calculations both by document type (as each type comprises a specific set of keywords and
some types like reviews generally receive more citation) and by thematic clusters (each
representing a subject area with its related keywords). This normalization aims to improve the
expected citation ratio estimation.
Normalizing by document type means that, within each cluster and period, each paper's citations
are estimated based on the weighted citations of the most similar papers only from the same
document type as the document itself. For example, a review paper would be compared only
with the citations of other review papers that share the most lexical similarity. Similarly,
normalization by cluster means within each time window and cluster, papers are only compared
with their most lexically similar papers within the same thematic cluster as themselves.
In comparison, results from normalization by cluster (Figure ) show better outcomes during the
early years, suggesting an improvement in overestimation. On the other hand, normalization
by document type (Figure ) has performed better in recent year windows, minimizing
underestimation and keeping the expected citation ratio closer to 1.
Figure 3. Cosine-Weighted Expected Citation Ratio Normalized to Cluster
Considering the impact of publication year on the estimation of expected citations, we finally
normalized by publication year. This means within each cluster and publication year window,
each document is only compared with the most lexically similar publications with less than a
three-year absolute difference in their publication years.
The Figure shows the ultimate version of the model, estimating the Citation Ratio of papers
across clusters and publication year windows, normalized by publication year, thematic cluster,
and document type. This model provides the best results compared to previous versions, with
expected citation rates closest to 1, indicating more accurate estimations. One of the drawbacks
of the environment-based approaches is that the environment-normalized citation score for the
complete set will not be guaranteed to be equal to 1 as with the journal and field expected
citation scores.
The results of the correlation analysis (Figure 6) confirm that while controlling for publication
year vicinity and document type significantly improves citation estimation, incorporating
thematic clusters further enhances the accuracy, highlighting the importance of combining all
factors for the best results.
Figure 6. Pearson's Correlation Between Actual Ciations and The Expected Citations from
Different Estimation Models
While the model's outcomes were promising, several areas require future investigation. First,
examining other measures of similarity besides cosine could mitigate the impact of high-
dimensionality, where minor variations might lead to significant ranking shifts, potentially
causing over- or underestimation. Additionally, the selection of the top 100 similar papers could
be refined by introducing a mechanism for random sampling among high-similarity candidates,
reducing outlier effects. This method could provide a more varied and potentially less biased
set of similar papers.
5. Bibliographic References
Glänzel, W., & Thijs, B. (2017). Bridging another gap between research assessment and
information retrieval – The delineation of document environments [conference presentation].
STI. France, https://sti2017.ifris.org/wp-content/uploads/2017/11/is3-glanzel-thys.pdf
Glänzel, W., Thijs, B., Schubert, A., & Debackere, K. (2009). Subfield-specific normalized
relative indicators and a new generation of relational charts: Methodological
foundations illustrated on the assessment of institutional research performance.
Scientometrics, 165–188.
6. Appendix
Appendix a. Labeling 29 Clusters Within the Circular Economy Literature for an ECOOM
Internal Report Purpose, which is previously published