Information Retrieval From Scientific Abstract and Citation Database Query by Documents Approach Based On Monte Carlo Sampling
Information Retrieval From Scientific Abstract and Citation Database Query by Documents Approach Based On Monte Carlo Sampling
Information Retrieval From Scientific Abstract and Citation Database Query by Documents Approach Based On Monte Carlo Sampling
Keywords: The rapidly increasing amount of information and entries in abstract and citation databases steadily complicates
Systematic literature review the information retrieval task. In this study, a novel query-by-document approach using Monte-Carlo sampling
Decision-making support of relevant keywords is presented. From a set of input documents (seed) keywords are extracted using TF-
Recommender system
IDF and subsequently sampled to repeatedly construct queries to the database. The occurrence of returned
Monte-Carlo sampling
documents is counted and serves as a proxy relevance metric. Two case studies based on the Scopus® database
Knowledge management
are used to demonstrate the method and its key advantages. No expert knowledge and human intervention
is needed to construct the final search strings which reduces the human bias. The methods practicality is
supported by the high re-retrieval of seed documents of 7/8 and 26/31 in high ranks in the two presented
case studies.
1. Introduction can detect the most relevant keywords and connect them to adequate
queries.
Advances in communication technologies enable researchers from This work presents a novel Query-by-Document (QbD) method
every part of the world to share information with peers. As a result, that can be applied to access-restricted scientific abstract and citation
a 9% growth of yearly published articles in academic journals has databases. The proposed procedure makes use of a feature vector
been recorded (Landhuis, 2016). On one hand, this ever-increasing representation of seed documents via a bag of words approach (TF-IDF).
volume of accessible information and knowledge can be reused for Based on this weighted feature vector, a Monte-Carlo sampling strategy
solving problems and supporting decision-making. On the other hand, is applied to repeatedly construct query strings from the previously
the higher volume of information also implies an increased effort to identified keywords and automatically execute the query using the
find and utilize it. Hence, well performing information retrieval (IR)
Application Programming Interface (API) of the database. This new
systems are key to aid the query formulation and facilitate the search
methodology not only avoids the need of an expert decision when
of relevant information within the big data.
constructing query strings but it also avoids the possible bias that
Scientific abstract and citation databases, such as Scopus® or Web
the expert could introduce. Moreover, and to the best of authors’
of Science, are large indexes of abstracts and metadata that can be
knowledge for the first time, a query-by-document method is directly
sampled by user defined query strings. However, researchers report dif-
applied to an access-restricted scientific abstract and citation database.
ficulties in finding appropriate combinations of keywords to construct a
corpus that properly responds to their research question (Mergel et al.,
2015). 2. Related work
Query-by-document (Yang et al., 2009) is an information retrieval
approach that relies on example documents that satisfy the user’s
Query Expansion
information need. While a human may have difficulties to extract
Query Expansion (QE) is the task of reformulating user queries, that
and connect the most important keywords to find and retrieve further
are often too simplistic or unspecific, by adding additional meaningful
similar documents related to the topic or question, information systems
∗ Corresponding author.
E-mail addresses: [email protected] (F. Lechtenberg), [email protected] (J. Farreres), [email protected]
(A.-L. Galvan-Cara), [email protected] (A. Somoza-Tornos), [email protected] (A. Espuña), [email protected] (M. Graells).
https://doi.org/10.1016/j.eswa.2022.116967
Received 10 May 2021; Received in revised form 13 January 2022; Accepted 20 March 2022
Available online 29 March 2022
0957-4174/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967
terms with similar significance. The target of QE and QbD is similar, documents used as a seed. Their approach makes use of statistics from
that is, retrieving information that responds to a user’s need. However, the sampled corpus in order to solve the problem using their proposed
in QbD the user’s initial query is replaced by a set of documents. Once ‘‘Best Position Algorithm’’. They find that their method outperforms
the initial string has been extracted from the documents, QE methods state-of-the-art QbD methods on two test corpora (TREC-8 Voorhees &
can be incorporated into QbD. Harman, 1999 and TREC-9 Robertson & Hull, 2000).
For a comprehensive overview of the state-of-the-art the reader These methods are either not directly applicable to abstract and cita-
is referred to the review paper by Azad and Deepak (2019). Their tion databases (e.g. because of missing corpora statistics), must be par-
review summarizes a general working methodology of QE: (1) data tially adapted to comply with API requirements of the databases, or re-
preprocessing & term extraction, (2) term weights & ranking, (3) term quire a continuous learning and classification approach. Recently, Marcos-
selection and (4) query reformulation. For each of these steps various Pablos and García-Peñalvo (2018) published a method that solves a
methods have been proposed and evaluated. The studied works are very similar problem statement and application as the one in the
discriminated by (1) application, (2) data source, and (3) core ap- present study. A more detailed comparison of their methodology and
proaches. In the case of our proposed methodology, TF-IDF is used for the one presented in this work is given in subsequent sections.
term weights & ranking while Monte-Carlo sampling is used for term
3. Materials and methods
selection and query reformulation.
Yusuf et al. (2021) review more recent contributions focusing on
3.1. Monte-Carlo sampling
query expansion in text retrieval of search engines. They conclude that
semantic-ontology and pseudo-relevant feedback methods are the most The Monte-Carlo (MC) method is a statistical approach based on
studied and promising QE approaches. One recent contribution that repeated random sampling that is used to approximate solutions to
shares an idea with the presented work is the one by Han et al. (2021). complex or expensive to evaluate mathematical problems. It was first
They propose a method based on Pseudo relevance feedback via text formulated by Metropolis and Ulam (1949) and has been applied in
classification. The approach builds on well-known elements from the several research fields such as bio-/chemical and environmental sys-
literature (BM25, LR, SVM, ensemble avg/RRF, RM3) and combines tems engineering (Sin & Espuña, 2020) and statistical physics (Landau
them in simple ways, arguing that, in QE, simplicity can be a virtue. & Binder, 2014).
Query by Example It has also found application in the field of information retrieval.
Query by Example targets the retrieval of elements that are similar Burgin (1999) demonstrated its use in the evaluation of information
to an example element. In order to achieve this, the main characteristics retrieval system performance (recall, precision, F-value). Through re-
of the example element must be extracted and processed in a way peated random sampling of corpora of know size and number of
that other elements of the same kind can be queried for, and ranked relevant documents the statistical significance of a retrieval result can
according to some criterion. This concept has been used in many be determined, and the probability of an observation stemming from a
different applications: Query by Voice (Lee et al., 2015), Query by random process can be estimated. More recently, Schnabel et al. (2016)
Music (Foote, 1997), Query by Image and Videos (Araujo & Girod, also used Monte-Carlo based estimators to determine the performance
2018) and, most recently, Query by Webpage (Geng et al., 2022). A well of ranking functions in information retrieval. Their work deals with
known commercial example is the ‘‘search by image’’ function offered corpora of known size but unknown number of relevant documents, so
by Google that enables users to upload images and find similar images expert judgement to classify the relevance of the retrieved documents is
from the web. required. Their presented approach allows to choose appropriate query-
Query by Document results pairs in an unbiased manner for manual relevance judgement. It
Query by Document can be considered a variant of query by exam- was shown that through this selection the number of required relevance
ple. It was first introduced by Yang et al. (2009). Their methodology judgements could be halved compared to other heuristic methods.
uses a ‘‘part-of-speech tagger’’ to extract some candidate phrases from Alexandrov et al. (2003) showed that Monte-Carlo algorithms can be
the seed documents that should act as query strings. It has been demon- useful in the efficient calculation of eigenvalues of sparse matrices, such
strated on the BlogScope search engine to retrieve similar documents as the term-by-document matrices that often appear in information re-
to a set of 34 article from the New York Times. Weng et al. (2011) trieval tasks. A dimensionality reduction of the matrix can be achieved,
presented an approach that exploits Latent Semantic Indexing (LSI) which can significantly speed up the ranking function calculations.
as a strategy to project documents into a lower dimensional vector In this study, Monte-Carlo sampling is used to formalize the implicit
space. The focus of this work lies on efficient indexing for subsequent knowledge captured in a seed corpus, in order to support the query-
construction step in information retrieval. Queries are performed on
retrieval enhancement. The authors comment that LSI can be substi-
the whole scientific abstract and citation database of huge but unknown
tuted by other dimensionality reduction techniques. Williams et al.
size and unknown number of relevant documents.
(2014) present SimSeerX, a platform for query-by-document task that
performs on the CiteSeerX database. The methodology also relies on
3.2. Citation databases
dimensionality reduction of the seed documents and the documents in
the database. Using various ranking functions the system returns ranked This work focuses on information retrieval from scientific abstract
list of candidate documents that respond to the query documents. Chen and citation databases. Among the largest databases are Google Scholar,
et al. (2018) presented a strategy based on continuous active learning, Scopus® , ScienceDirect, Web of Science, PubMed, and arXiv. For the
a concept that is frequently implemented in other citation screening implementation and validation of the methodology we used Scopus®
and content recommender systems (Howard et al., 2016; Wallace et al., (Burnham, 2006) due to its large amount of entries (72.2 Mio. in 2019
2010). Yang et al. (2018) use the ‘‘More Like This’’ function from according to Gusenbauer, 2019) in multi-disciplinary fields and the
Elasticsearch, a distributed search engine built on Lucene, in order convenient API that allows automatic sampling of the database pro-
to convert a query document into up to 25 relevant terms. Using vided by Elsevier. It has restricted access, meaning that a subscription
these keywords, a disjunctive search is performed on the RCV1-v2 text is necessary to use its features. A main drawback is that Scopus® ,
categorization test collection. like the majority of scientific databases, does not provide free full-text
Most recently, Le et al. (2021) presented a QbD method on top information. This implies that the screening step during information
of a search interface. Their principled technique formulates the query retrieval can only be performed on the abstract, title and keywords
selection task as an optimization problem (Docs2Query) that mini- information. The database used in the methodology can be exchanged
mizes the position (maximizes the rank) of relevant documents. In the but specific requirements and limitations of alternative APIs must be
Docs2Query-Self problem, these relevant documents are the example considered and adapted in the implementation.
2
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967
Once the keywords are identified, the next step is to query the
database. Construction of appropriate search strings is a hardship in
research and investigation, and a ranked candidate list of keywords can
aid in this process.
The proposed query-by-documents approach implements a Monte-
Carlo (MC) sampling principle. The idea is to construct search strings
by picking keywords from the ranked list of keywords with a proba-
Fig. 1. Schematic representation of proposed query-by-documents approach as part bility distribution corresponding to their tfidf weight, applying ‘‘AND’’
of a corpus extension task. Modules for keyword extraction, database sampling, and
connectors among the keywords and repeatedly query the database (the
evaluation techniques could be exchanged and applied to different databases. Sampling
strategies: SEQ — Sequential, EXP — Expert, MC — Monte-Carlo.
missing ‘‘OR’’ connector results from the addition of each new query).
The list of keywords ranked by their tfidf weight is constructed
following the description and equation given in the Supplementary
Material. The probability 𝜙(𝑡𝑖 ) of each keyword 𝑡𝑖 being selected within
4. Query-by-document methodology
the top 𝑁𝐾𝑊 keywords (where 𝑖 = 1 has the highest weight, 𝑖 = 2 the
second highest . . . ) is then determined as:
The proposed query-by-documents approach is part of a question
answering task, extending a seed corpus of already detected texts that 𝑡𝑓 𝑖𝑑𝑓 (𝑡𝑖 )
𝜙(𝑡𝑖 ) = ∑𝑁 (1)
respond to the information requirements, through inclusion of other 𝐾𝑊
𝑡𝑓 𝑖𝑑𝑓 (𝑡𝑗 )
𝑗=1
relevant documents. Its steps are depicted in Fig. 1. The productive
documents (those that provide answers to the query) may be included It should be noted that this query construction step could in princi-
in the seed corpus and the cycle can be initialized again. This procedure ple be accompanied by the utilization of semantic knowledge (e.g. us-
would be repeated until no new information is found or the goal of the ing domain ontologies) as demonstrated for instance by Amato et al.
information search has been achieved. (2015). By doing so, vocabulary mismatch, also known as the vocabu-
lary problem, can be reduced. In each performed query the occurrence
4.1. Seed corpus of a document is registered and counted over the total amount of
performed MC iterations.
The methodology requires, as any query-by-documents approach, This methodology comes with a few adjustable parameters.
a set of seed documents. This set is used as the knowledge infor-
mation repository that identifies the range of the search. Thus, it 1. The amount of MC iterations 𝑁𝑀𝐶 determines how well the
should be composed by all available documents clearly relevant to the relevance distribution of the keywords is captured in the sam-
search topic. Obviously, adding non-relevant documents will increase pling procedure. In the presented case studies it was found that
non-relevant results, and not incorporating documents associated to a value between 200 and 1000 iterations is sufficient for the
relevant research will limit the scope of the search. From that point ranked candidate list not to change significantly anymore. See
onwards, human intervention in the retrieval process is reduced, which the Supplementary Material (Fig. S1) for a description of how
is important because human resources are expensive and limited. Once this range was determined.
a Seed Corpus (SC) has been identified, the automation process will 2. The upper limit for the number of documents registered in each
speed up the volume and the quality of information gathered, because iteration 𝑁𝑖𝑡 is a parameter that controls the trade-off between
the resulting documents after one cycle will enlarge the seed corpus, exploration and exploitation of the database search space: a
thus feeding the next iteration. Seed corpora may be obtained and high value for this parameter registers many documents in each
provided for instance by experts in the field such as professors as iteration, resulting in the need of more MC iterations to reach
starting point for a project or research line of a coworker or student. a stationary ranking. For lower values stationarity is achieved
The number of documents in the corpus defines the seed corpus length faster but relevant documents might be overlooked through the
(L). Another use case is the retrieval of similar documents to detect stricter cut-off. Currently, the Scopus® API imposes an upper
eventual plagiarism, in which L may take the value of 1. There is no limit of 2000 documents for this parameter.
3
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967
3. The number of keywords included in the sampling procedure extra insight, inversely depending on the quality of the abstract, but a
𝑁𝐾𝑊 is a critical parameter with a similar trade-off characteris- trade-off arises when the associated increase of computational effort is
tic as 𝑁𝑖𝑡 . However, additionally to the computational trade-off, considered.
the amount of included keywords regulates how ‘‘far’’ from For validation purpose, after determining the ranked candidate
the core domain (i.e. how many ‘‘less-relevant’’ keywords) the list, the full-text information is required. It is unreasonable from and
sampling procedure should reach. economic and computational resource point of view to download and
process a huge amount of full-text information. Thus, we opt for down-
Tuning of these parameters, that are common to other information loading (semi-) manually a number 𝑁𝐷𝑊 𝑁 of documents. As a result,
retrieval methods, could, in theory, be automized through a parameter the performance of the evaluation procedure will vary depending on
sweep procedure that refines some performance metric such as seed the institution that performs the retrieval task since the subscribed
recall or average seed position. In practice however, limits imposed on journals and databases are different for most institutions. However,
the amount of queries to the database should be taken into account future changes in publishing policies can be easily incorporated into
and could potentially prohibit an extensive sensitivity analysis. For that the methodology.
reason, in this study, we limit our analysis of 𝑁𝑀𝐶 and 𝑁𝐾𝑊 to three Once the documents are downloaded, their relevance to the domain
alternative values while keeping 𝑁𝑖𝑡 at the upper limit imposed by the can be evaluated using the BM25 ranking function. The linguistic
Scopus® API. relevance is determined with respect to the weighted feature vector
After performing the sampling procedure, the amount of times an Q that is expected to represent the domain of interest. The user can
individual document d appeared in the query process 𝑁𝑑 divided by then manually screen the resulting candidate documents in order of
the number of MC iterations 𝑁𝑀𝐶 yields the document frequency 𝐷𝐹𝑑 . linguistic relevance until a threshold value 𝐵𝑀25𝑚𝑖𝑛 or until he is
This is an inherent relevance metric that can be directly used to rank satisfied with the retrieved information. The document frequency DF or
the candidate documents and propose a reading order. the cosine similarity 𝜃 of the documents with the feature vector could
𝑁𝑑 be used as a metric for linguistic relevance alternatively (Marcos-Pablos
𝐷𝐹𝑑 = (2)
𝑁𝑀𝐶 & García-Peñalvo, 2018).
Alternatively, a naive search can be performed on the database by Apart from the BM25 relevance metric, the performance should be
simply connecting the top keywords until the number of results from evaluated using the recall of seed documents. Le et al. (2021) verified
the database yields the amount of documents the user is willing to read. the hypothesis that IR method performing well in re-retrieving the seed
We refer to this method as the sequential sampling method (SEQ). documents (Docs2Queries-Self problem) also perform well in finding
Finally, instead of blindly connecting the keywords an expert can similar documents (Docs2Queries-Sim problem).
use the identified terms to construct more complex strings using differ-
ent combinations and connectors such as ‘‘OR’’ and ‘‘AND NOT’’. We 4.5. Comparison with other information retrieval methods
refer to this as the expert sampling method (EXP). Compared to the
MC method the user must have some sort degree of expert knowledge Information retrieval procedures have been especially explored and
to apply it. applied in the field of systematic literature review and specialized cor-
The SEQ and EXP methods do not have an inherent relevance metric pora construction. Marcos-Pablos and García-Peñalvo (2018) propose
and the resulting candidate documents must be ordered by other means an iterative methodology to construct search strings which they ap-
such as the application of BM25 ranking function or naive metadata like plied in their literature review about technological ecosystems (Marcos-
the amount of citations. Pablos & García-Peñalvo, 2019). A comparison of their approach with
It must be noted that in this step the database is only sampled by the one presented here is summarized in Table 1.
the information available in the abstract, title and keywords. Thus, The objectives at the end of each iteration are different: Our ap-
we suggest using the abstracts of the seed corpus to obtain the set proach aims at finding an extended corpus departing from a set of rel-
of relevant keywords based on the assumption that the language used evant documents (SC). The methodology by Marcos-Pablos and García-
in abstracts may be different from the full texts and, consequently, Peñalvo (2018) on the other hand, results in suggested keywords for
providing a fair basis for the query task. search string construction. However, both methods follow the same
main steps of keyword construction via TF-IDF, sampling and evalu-
4.4. Evaluation ation.
The main difference lies in the sampling procedure: Marcos-Pablos
Once the ranked list of references is obtained it is possible to evalu- and García-Peñalvo (2018) procedure requires the use of expert knowl-
ate the documents in terms of linguistic relevance. For that purpose, the edge to make the final decision on search string while the MC procedure
freely available abstracts could be used, but we suggest to include as avoids this need. Furthermore, a minor difference lies in the departing
many full texts as possible in the evaluation step. The reasoning behind point of the methodologies that is shifted due to the different targeted
this is that abstracts only represent a very small fraction of the full- endpoints (keywords vs. retrieved information/documents).
text in condensed form. Information retrieval tasks such as retrieval of There are various other information retrieval methods that have
parameters or experimental data will be more successful when looking been briefly addressed in Section 2. In this work we omit the direct
into the full texts (Kottmann et al., 2010), including the Supplemen- comparison to these methods since their scope is not aligned with the
tary Material that often provides more quantitative information than scope of this work. On one hand, those works are applied to specialized
abstracts. static corpora and datasets (e.g. TREC-8 Voorhees & Harman, 1999 and
This work purposely skips any discussion on publication policies and TREC-9 Robertson & Hull, 2000 as used in Le et al., 2021) instead of the
the property of the information. The general methodology developed growing and inter-disciplinary corpora that are the academic abstract
here can be employed in public and private databases, using the total and citation databases. Our proposed methodology is designed to be
or partial information available (e.g. abstracts) according to the access applicable to corpora without the need to analyze the corpus previous
rights. For more insight into the debate about Open Access (OA) the to the retrieval task. Furthermore, this work goes beyond what other
reader is referred to the review by Piwowar et al. (2018). works are doing by evaluating the performance of the method by using
As previously mentioned we decided to use Scopus® for demonstra- the full-text information of the retrieved documents. The corpora that
tion and validation purpose, which limits the sampling task to abstract, other works deal with are pre-classified which is not the case in the
title and keyword information. Sampling and evaluating full texts open question answering approach that is envisioned in this work and
instead of abstracts is debatable. The search of full texts may provide illustrated in Fig. 1.
4
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967
Table 1
Comparison of information retrieval methodologies.
Marcos-Pablos and García-Peñalvo (2018) This study
Input: Search string S; stop words vector SW; minimum Input: Seed Corpus SC; stop words vector SW; (optional)
cosine similarity distance 𝜃𝑚𝑖𝑛 minimum BM25 value 𝐵𝑀25𝑚𝑖𝑛
Output: Recommended new terms 𝑇 for building a new Output: Ranked list of relevant documents RL
search string S1
1. Use S as input search string on academic databases and 1. Project SC on vector space and compute tfidf values.
construct an abstract corpus D.
2. Project D on vector space and compute tfidf values. 2. Perform MC sampling method on academic databases
(corresponds to step 1 in this study) using the filtered top keywords 𝑁𝐾𝑊 .
3. Classify documents in D as relevant (R) and non-relevant 3. Obtain a candidate list CL sorted by the document
(NR) from cosine similarity. frequency 𝐷𝐹𝑑 .
4. Compute term weights 𝑤𝑡,𝐷 in R and NR. 4. (Optional) Download the top 𝑁𝐷𝑊 𝑁 full-text documents of
CL. Apply BM25 ranking function to determine linguistic
relevance order.
5. Suggest new terms 𝑇 based on 𝑤𝑡,𝐷 sorted values. 5. Use those documents with high document frequency 𝐷𝐹𝑑
or higher relevance than 𝐵𝑀25𝑚𝑖𝑛 for information extraction.
6. Construct a new search string S1 and repeat from step 1 6. Extend SC with newly identified truly relevant documents
and repeat from step 1.
The proposed methodology has been tested and illustrated on two 1. Seed corpus length L (𝑁𝐾𝑊 = 10) 1 8 20
case studies that are detailed in the following sections. Choose best performing seed corpus length 𝐿𝑏𝑒𝑠𝑡
2. Number of keywords 𝑁𝐾𝑊 (𝐿 = 𝐿𝑏𝑒𝑠𝑡 ) 7 20 30
5.1. Case study I: Technological ecosystems in care and assistance Other parameters: 𝑁𝑖𝑡 = 𝑁𝑀𝐶 = 1000.
The goal of this case-study is to emulate the findings from the lit- Table 3
erature review by Marcos-Pablos and García-Peñalvo (2019) departing Sampling procedures tested and evaluated in case-study II.
from a subset of the documents that have been identified as truly rel- Method: SEQ EXP A EXP B EXP C MC
evant and use them as a seed corpus in the methodology. The original (FL) (AST) (APL)
systematic literature review deals with technological ecosystems in care 𝑁𝐾𝑊 4 9 9 16 10, 15, 20, 25, 29, 30
and assistance. This topic comes with the difficulty of being based in
two different fields. Therefore, the reasonable combination of suggested
keywords requires a significant degree of expert knowledge. On the This study is motivated by the need to populate a process ontology
other hand, the proposed methodology is expected to account for, and with information for the selection of sustainable waste-to-resource
combine, both fields implicitly in the tfidf values during the sampling alternatives (Pacheco-López et al., 2020).
procedure. The starting point for initialization is a seed corpus consisting of
Using an initial search string on Scopus and Web of Science, Marcos- eight papers. They originate from the review performed by Somoza-
Pablos and García-Peñalvo (2019) narrowed down the candidate list of Tornos et al. (2021) and are given in Table S2 (Supplementary Ma-
potentially relevant documents to 8394. Then, they applied a cosine terial). After extracting the weighted feature vector for sampling, we
similarity threshold to only consider the top 809 documents. These apply the SEQ, EXP and MC sampling methods as summarized in
documents were then screened using a quality assessment checklist Table 3 and compare (1) the position of the seed documents in the
to further reduce the selection to 194 documents. Finally, 37 docu- resulting ranked lists and (2) the linguistic relevance distributions of
ments were included for the quantitative synthesis of the literature the identified candidate lists. As for the EXP method three of the
review. This list of relevant documents is given in Table S1 in the members of the research group (FL, AST, APL) proposed search strings
Supplementary Material. Note that five of these documents are not using the keywords from the extraction step (FL, AST) or alternative
available in Scopus® and therefore cannot be retrieved with the applied ones (APL) based on their experience in the field.
methodology.
In this case-study we depart from randomly selected subsets of L 6. Results and discussion
documents (Table S1) taken from these 37 relevant documents and
follow the steps of the proposed methodology. The chosen quality cri- 6.1. Case study I: Technological ecosystems in care and assistance
terion for assessing the performance in this case-study is the number of
relevant documents and seed documents re-retrieved by the methodol- Table S3 shows the position of all the seed papers as well as the
ogy and their position in the ranked list. After selecting an appropriate remaining relevant papers in the ranked candidate lists. In a first
seed corpus length 𝐿𝑏𝑒𝑠𝑡 using 10 keywords during sampling we vary step, the top ten keywords were used for MC sampling. Search was
the number of included keywords 𝑁𝐾𝑊 . The tested configurations restricted to the years between 2002 and 2019 to better emulate the
are summarized in Table 2. The number of registered documents per results of Marcos-Pablos and García-Peñalvo (2019). Fig. 2 illustrates
iteration 𝑁𝑖𝑡 and the total number of MC iterations 𝑁𝑀𝐶 are both the results.
chosen to be 1000. It can be seen that using a single paper as seed corpus does not
lead to satisfactory results. The seed paper itself ranks in position one
5.2. Case study II: Pyrolysis of plastic waste with 689 appearances in 1000 iterations. Out of the remaining possible
papers only 6 appear in the candidate list while only one of them ranks
The goal of this second case-study is to apply and compare the three high (A2 in position 12). Fig. 2(a) shows the placement of the relevant
presented sampling method alternatives in the domain of chemical en- papers in the candidate lists using different seed corpus length 𝐿.
gineering. The targeted information is the retrieval of documents con- Better results are obtained when using eight seed papers. In total 26
taining parameters that describe pyrolysis processes of plastic waste. out of the 31 available relevant papers in Scopus® (83.9%) are found,
5
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967
Fig. 2. Sensitivity analysis results for seed corpus length 𝐿 and number of keywords 𝑁𝐾𝑊 in case-study I.
6
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967
Table 4
Case study II: Extracted keywords from the eight-paper seed corpus.
Keyword tfidf Keyword tfidf Keyword tfidf
waste 1.12 yield 0.74 recycling 0.42
pyrolysis 1.11 plastic 0.74 gasoline 0.42
product 1.03 increase 0.66 ldpe 0.41
oil 0.90 beda 0.66 char 0.40
gas 0.87 polyethylene 0.59 polymer 0.39
process 0.84 feedstock 0.55 distribution 0.36
wt 0.83 time 0.49 reactor 0.34
catalyst 0.78 residence 0.49 material 0.33
temperature 0.77 flash 0.45 recovery 0.33
monomer 0.76 hydrocarbon 0.44 work 0.32
Fig. 4. Position of seed papers and relevant documents when ordering the candidate
list by BM25 and DF values. a Excluded: out of scope
7
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967
Table 5
Summary of sampling procedures.
Procedure Search string Hits Download Recall
SEQ waste AND pyrolysis AND product AND oil 2015 1154 2/8
EXP A pyrolysis AND plastic AND waste AND (gas OR oil 1156 727 6/8
OR product) AND (temperature OR catalyst OR
yield)
EXP B pyrolysis AND plastic AND waste AND product 853 548 5/8
AND (temperature OR catalyst OR yield)
EXP C pyrolysis AND (plastic OR polyolefin OR polymer) 1127 681 4/8
AND waste AND (gas OR oil OR product OR
biofuel OR chemical OR ethylene OR methane OR
benzene) AND (recycling OR upcycling OR
treatment)
MC Combinations of 29 KWs 116,435 1196 8/8a
a
7/8 within top 2000.
• High amount of more relevant papers in the retrieved documents CRediT authorship contribution statement
• High recall value of seed documents
Fabian Lechtenberg: Conceptualization, Methodology, Software,
Finally, the method proved to offer a variety of benefits in terms of
Data curation, Writing – original draft, Visualization. Javier Farreres:
applicability and flexibility that can be summarized as follows:
Conceptualization, Methodology, Software, Writing – review & edit-
• In principle, no need for expert knowledge ing. Aldwin-Lois Galvan-Cara: Methodology, Software, Data curation,
• Flexible in terms of exploration and exploitation Writing – original draft. Ana Somoza-Tornos: Conceptualization, Writ-
• Reasonable pre-download ordering based on abstracts through DF ing – review & editing. Antonio Espuña: Conceptualization, Writing
– review & editing, Supervision, Project administration, Funding ac-
7. Conclusions quisition. Moisès Graells: Validation, Writing – review & editing,
Supervision, Project administration, Funding acquisition.
Literature search is a specific and essential task in scientific re-
search. The access to digital databases has boosted the search capa- Declaration of competing interest
bilities, but the scientific community worldwide still requires a lot
of time and expert dedication to retrieve relevant information. This The authors declare that they have no known competing finan-
work presents a novel methodology that improves the information cial interests or personal relationships that could have appeared to
retrieval task from scientific abstract and citation databases via a influence the work reported in this paper.
query-by-documents approach.
The main contribution of this work consists of the inclusion of Acknowledgments
a Monte-Carlo sampling procedure during the query string construc-
tion step which leads to two desirable outcomes: (1) human expert Financial support received from the Spanish ‘‘Ministerio de Cien-
intervention (an expensive and scarce resource) is decreased and (2) cia e Innovación’’ and the European Regional Development Fund,
potential human bias is avoided. The proposed method has been devel- both funding the research Projects AIMS (DPI2017-87435-R) and CEPI
oped, implemented and tested on the Scopus® database using two case (PID2020-116051RB-I00) is fully acknowledged. Fabian Lechtenberg
studies. gratefully acknowledges the Universitat Politècnica de Catalunya for
The two case studies demonstrated the methodology’s applicability the financial support of his predoctoral grant FPU-UPC, with the
to various fields of research. Remarkably, one of the studies itself is
collaboration of Banco de Santander. The authors would like to thank
based in two distinct fields (technological ecosystems and healthcare).
Adrián Pacheco-López (APL) for contributing one of the expert query
The retrieval results are satisfactory, i.e. high recall value of truly
strings.
relevant papers declared by the reference work, considering that the
authors are no experts in these fields and only a small amount of
Appendix A. Supplementary data
initial information (seed corpus) has been taken from the reference
(Marcos-Pablos and García-Peñalvo (2019)). These results imply that
Supplementary material related to this article can be found online
corpora for multidisciplinary collaboration can be easily identified by
at https://doi.org/10.1016/j.eswa.2022.116967.
our approach. A case-study on information retrieval of waste plastic
pyrolysis processes suggests that the proposed methodology performs
better in terms of number and linguistic relevance (BM25) of retrieved References
documents than a naive sequential sampling method as well as the
Alexandrov, V. N., Dimov, I. T., Karaivanova, A., & Tan, C. J. K. (2003). Parallel Monte
query string construction by three experts. Carlo algorithms for information retrieval. Mathematics and Computers in Simulation,
In general, the methodology is expected to accelerate the infor- 62, 289–295. http://dx.doi.org/10.1016/S0378-4754(02)00252-5.
mation retrieval process through reducing the need of screening less Amato, F., De Santo, A., Gargiulo, F., Moscato, V., Persia, F., Picariello, A., &
relevant papers. Through the automatization of abstract screening using Sperli, G. (2015). A novel approach to query expansion based on semantic
various combinations of keywords the search can go beyond what similarity measures. In DATA 2015 - 4th international conference on data management
technologies and applications, proceedings (pp. 344–353). http://dx.doi.org/10.5220/
manual search could achieve, thus, finding relevant papers that could
0005579703440353.
have been overlooked otherwise. Systematic literature reviews will Araujo, A., & Girod, B. (2018). Large-scale video retrieval using image queries. IEEE
benefit most from the methodology but really any research that starts Transactions on Circuits and Systems for Video Technology, 28, 1406–1420. http:
with a literature review will find it useful. //dx.doi.org/10.1109/TCSVT.2017.2667710.
Technical limitations like the speed of sampling and request limits Azad, H. K., & Deepak, A. (2019). Query expansion techniques for information retrieval:
A survey. Information Processing & Management, 56, 1698–1735. http://dx.doi.org/
using the available APIs should be addressed to further improve the per-
10.1016/j.ipm.2019.05.009.
formance. Furthermore, active learning strategies (Chen et al., 2018) Burgin, R. (1999). The Monte Carlo method and the evaluation of retrieval system per-
could be integrated in the methodology to adapt the candidate ranking formance. Journal of the American Society for Information Science, 50, 181–191. http:
based on expert feedback during the manual classification step. //dx.doi.org/10.1002/(SICI)1097-4571(1999)50:2<181::AID-ASI8>3.0.CO;2-9.
8
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967
Burnham, J. F. (2006). Scopus database: A review. Biomedical Digital Libraries, 3, Metropolis, N., & Ulam, S. (1949). The Monte Carlo method. Journal of the Ameri-
http://dx.doi.org/10.1186/1742-5581-3-1. can Statistical Association, 44, 335–341. http://dx.doi.org/10.1080/01621459.1949.
Chen, S., Nourashrafeddin, S., Moh’D, A., & Milios, E. (2018). Active high-recall 10483310.
information retrieval from domain-specific text corpora based on query documents. Pacheco-López, A., Somoza-Tornos, A., Muñoz, E., Capón-García, E., Graells, M., &
In Proceedings of the ACM symposium on document engineering 2018 (pp. 1–10). Espuña, A. (2020). Synthesis and assessment of waste-to-resource routes for circular
http://dx.doi.org/10.1145/3209280.3209532. economy. In 30 European symposium on computer aided process engineering (pp.
Foote, J. T. (1997). Content-based retrieval of music and audio. In Multimedia storage 1933–1938). http://dx.doi.org/10.1016/B978-0-12-823377-1.50323-2.
and archiving systems II (pp. 138–147). http://dx.doi.org/10.1117/12.290336. Piwowar, H., Priem, J., Larivière, V., Alperin, J. P., Matthias, L., Norlander, B.,
Geng, Q., Chuai, Z., & Jin, J. (2022). Webpage retrieval based on query by example for Farley, A., West, J., & Haustein, S. (2018). The state of OA: A large-scale analysis
think tank construction. Information Processing & Management, 59, Article 102767. of the prevalence and impact of open access articles. PeerJ, 2018, 1–23. http:
http://dx.doi.org/10.1016/j.ipm.2021.102767. //dx.doi.org/10.7717/peerj.4375.
Gusenbauer, M. (2019). Google Scholar to overshadow them all? Comparing the sizes Robertson, S. E., & Hull, D. A. (2000). The TREC-9 filtering track final report. In
of 12 academic search engines and bibliographic databases. Scientometrics, 118, Proceedings of the ninth text retrieval conference.
177–214. http://dx.doi.org/10.1007/s11192-018-2958-5. Schnabel, T., Frazier, P. I., Swaminathan, A., & Joachims, T. (2016). Unbiased
Han, X., Liu, Y., & Lin, J. (2021). The simplest thing that can possibly work: (Pseudo- comparative evaluation of ranking functions. In ICTIR 2016 - proceedings of the 2016
)relevance feedback via text classification. In Proceedings of the 2021 ACM SIGIR ACM international conference on the theory of information retrieval (pp. 109–118).
international conference on theory of information retrieval (pp. 123–129). http://dx. http://dx.doi.org/10.1145/2970398.2970410.
doi.org/10.1145/3471158.3472261. Sin, G., & Espuña, A. (2020). Editorial: Applications of Monte Carlo method in chemical,
Howard, B. E., Phillips, J., Miller, K., Tandon, A., Mav, D., Shah, M. R., Holmgren, S., biochemical and environmental engineering. Frontiers in Energy Research, 8, 1–2.
Pelch, K. E., Walker, V., Rooney, A. A., Macleod, M., Shah, R. R., & Thayer, K. http://dx.doi.org/10.3389/fenrg.2020.00068.
(2016). SWIFT-Review: A text-mining workbench for systematic review. Systematic Somoza-Tornos, A., Pozo, C., Graells, M., Espuña, A., & Puigjaner, L. (2021). Process
Reviews, 5, 1–16. http://dx.doi.org/10.1186/s13643-016-0263-z. screening framework for the synthesis of process networks from a circular economy
Kottmann, R., Radom, M., Formanowicz, P., Glöckner, F., Rybarczyk, A., Szachniuk, M., perspective. Resources, Conservation and Recycling, 164, Article 105147. http://dx.
& Błażewicz, J. (2010). Cerberus: A new information retrieval tool for marine doi.org/10.1016/j.resconrec.2020.105147.
metagenomics. Foundations of Computing and Decision Sciences, 35, 107–126. Voorhees, E. M., & Harman, D. K. (1999). Overview of the eighth text retrieval
Landau, D. P., & Binder, K. (2014). A guide to Monte Carlo simulations in statis- conference (TREC-8). In Proceedings of the eighth text retrieval conference.
tical physics (4th ed.). Cambridge University Press, http://dx.doi.org/10.1017/ Wallace, B. C., Small, K., Brodley, C. E., & Trikalinos, T. A. (2010). Active learning
cbo9781139696463. for biomedical citation screening categories and subject descriptors. In Proceedings
Landhuis, E. (2016). Scientific literature: Information overload. Nature, 535, 457–458. of the 16th ACM SIGKDD international conference on knowledge discovery and data
http://dx.doi.org/10.1038/nj7612-457a. mining (pp. 173–181). http://dx.doi.org/10.1145/1835804.1835829.
Le, N. X. T., Shahbazi, M., Almaslukh, A., & Hristidis, V. (2021). Query by documents Weng, L., Li, Z., Cai, R., Zhang, Y., Zhou, Y., Yang, L. T., & Zhang, L. (2011).
on top of a search interface. Information Systems, 101, Article 101793. http://dx. Query by document via a decomposition-based two-level retrieval approach. In
doi.org/10.1016/j.is.2021.101793. SIGIR’11 - proceedings of the 34th international ACM SIGIR conference on research
Lee, L. S., Glass, J., Lee, H. Y., & Chan, C. A. (2015). Spoken content retrieval - beyond and development in information retrieval (pp. 505–514). http://dx.doi.org/10.1145/
cascading speech recognition with text retrieval. IEEE Transactions on Audio, Speech 2009916.2009985.
and Language Processing, 23, 1389–1420. http://dx.doi.org/10.1109/TASLP.2015. Williams, K., Wu, J., & Giles, C. L. (2014). SimSeerX: A similar document search engine.
2438543. In DocEng 2014 - Proceedings of the 2014 ACM symposium on document engineering
Marcos-Pablos, S., & García-Peñalvo, F. J. (2018). Information retrieval methodology (pp. 143–146). http://dx.doi.org/10.1145/2644866.2644895.
for aiding scientific database search. Soft Computing, 24, 5551–5560. http://dx.doi. Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., & Papadias, D. (2009). Query
org/10.1007/s00500-018-3568-0. by document. In Proceedings of the 2nd ACM international conference on web search
Marcos-Pablos, S., & García-Peñalvo, F. J. (2019). Technological ecosystems in care and data mining (pp. 34–43). http://dx.doi.org/10.1145/1498759.1498806.
and assistance: A systematic literature review. Sensors, 19, 708. http://dx.doi.org/ Yang, E., Lewis, D. D., Frieder, O., Grossman, D., & Yurchak, R. (2018). Retrieval and
10.3390/s19030708. richness when querying by document. In CEUR workshop proceedings (pp. 68–75).
Mergel, G. D., Silveira, M. S., & Da Silva, T. S. (2015). A method to support search Yusuf, N., Yunus, M. A. M., Wahid, N., Mustapha, A., & Salleh, M. N. M. (2021).
string building in systematic literature reviews through visual text mining. In A survey of query expansion methods to improve relevant search engine results.
Proceedings of the ACM symposium on applied computing (pp. 1594–1601). http: International Journal on Advanced Science, Engineering and Information Technology,
//dx.doi.org/10.1145/2695664.2695902. 11, 1352–1359. http://dx.doi.org/10.18517/ijaseit.11.4.8868.