0% found this document useful (0 votes)

80 views9 pages

Information Retrieval From Scientific Abstract and Citation Database Query by Documents Approach Based On Monte Carlo Sampling

Uploaded by

VorVlo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views9 pages

Information Retrieval From Scientific Abstract and Citation Database Query by Documents Approach Based On Monte Carlo Sampling

Uploaded by

VorVlo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Expert Systems With Applications 199 (2022) 116967

Contents lists available at ScienceDirect

Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa

Information retrieval from scientific abstract and citation databases:

A query-by-documents approach based on Monte-Carlo sampling
Fabian Lechtenberg a , Javier Farreres b , Aldwin-Lois Galvan-Cara a , Ana Somoza-Tornos a,c ,
Antonio Espuña a , Moisès Graells a ,∗
a
Department of Chemical Engineering, Universitat Politècnica de Catalunya, Campus Diagonal-Besòs, Eduard Maristany, 16, 08019 Barcelona, Spain
b Computer Science Department, Universitat Politècnica de Catalunya, Campus Diagonal-Besòs, Eduard Maristany, 16, 08019 Barcelona, Spain
c Renewable and Sustainable Energy Institute (RASEI), University of Colorado Boulder, Boulder, Colorado 80303, United States

ARTICLE INFO ABSTRACT

Keywords: The rapidly increasing amount of information and entries in abstract and citation databases steadily complicates
Systematic literature review the information retrieval task. In this study, a novel query-by-document approach using Monte-Carlo sampling
Decision-making support of relevant keywords is presented. From a set of input documents (seed) keywords are extracted using TF-
Recommender system
IDF and subsequently sampled to repeatedly construct queries to the database. The occurrence of returned
Monte-Carlo sampling
documents is counted and serves as a proxy relevance metric. Two case studies based on the Scopus® database
Knowledge management
are used to demonstrate the method and its key advantages. No expert knowledge and human intervention
is needed to construct the final search strings which reduces the human bias. The methods practicality is
supported by the high re-retrieval of seed documents of 7/8 and 26/31 in high ranks in the two presented
case studies.

1. Introduction can detect the most relevant keywords and connect them to adequate
queries.
Advances in communication technologies enable researchers from This work presents a novel Query-by-Document (QbD) method
every part of the world to share information with peers. As a result, that can be applied to access-restricted scientific abstract and citation
a 9% growth of yearly published articles in academic journals has databases. The proposed procedure makes use of a feature vector
been recorded (Landhuis, 2016). On one hand, this ever-increasing representation of seed documents via a bag of words approach (TF-IDF).
volume of accessible information and knowledge can be reused for Based on this weighted feature vector, a Monte-Carlo sampling strategy
solving problems and supporting decision-making. On the other hand, is applied to repeatedly construct query strings from the previously
the higher volume of information also implies an increased effort to identified keywords and automatically execute the query using the
find and utilize it. Hence, well performing information retrieval (IR)
Application Programming Interface (API) of the database. This new
systems are key to aid the query formulation and facilitate the search
methodology not only avoids the need of an expert decision when
of relevant information within the big data.
constructing query strings but it also avoids the possible bias that
Scientific abstract and citation databases, such as Scopus® or Web
the expert could introduce. Moreover, and to the best of authors’
of Science, are large indexes of abstracts and metadata that can be
knowledge for the first time, a query-by-document method is directly
sampled by user defined query strings. However, researchers report dif-
applied to an access-restricted scientific abstract and citation database.
ficulties in finding appropriate combinations of keywords to construct a
corpus that properly responds to their research question (Mergel et al.,
2015). 2. Related work
Query-by-document (Yang et al., 2009) is an information retrieval
approach that relies on example documents that satisfy the user’s
Query Expansion
information need. While a human may have difficulties to extract
Query Expansion (QE) is the task of reformulating user queries, that
and connect the most important keywords to find and retrieve further
are often too simplistic or unspecific, by adding additional meaningful
similar documents related to the topic or question, information systems

∗ Corresponding author.
E-mail addresses: [email protected] (F. Lechtenberg), [email protected] (J. Farreres), [email protected]
(A.-L. Galvan-Cara), [email protected] (A. Somoza-Tornos), [email protected] (A. Espuña), [email protected] (M. Graells).

https://doi.org/10.1016/j.eswa.2022.116967
Received 10 May 2021; Received in revised form 13 January 2022; Accepted 20 March 2022
Available online 29 March 2022
0957-4174/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967

terms with similar significance. The target of QE and QbD is similar, documents used as a seed. Their approach makes use of statistics from
that is, retrieving information that responds to a user’s need. However, the sampled corpus in order to solve the problem using their proposed
in QbD the user’s initial query is replaced by a set of documents. Once ‘‘Best Position Algorithm’’. They find that their method outperforms
the initial string has been extracted from the documents, QE methods state-of-the-art QbD methods on two test corpora (TREC-8 Voorhees &
can be incorporated into QbD. Harman, 1999 and TREC-9 Robertson & Hull, 2000).
For a comprehensive overview of the state-of-the-art the reader These methods are either not directly applicable to abstract and cita-
is referred to the review paper by Azad and Deepak (2019). Their tion databases (e.g. because of missing corpora statistics), must be par-
review summarizes a general working methodology of QE: (1) data tially adapted to comply with API requirements of the databases, or re-
preprocessing & term extraction, (2) term weights & ranking, (3) term quire a continuous learning and classification approach. Recently, Marcos-
selection and (4) query reformulation. For each of these steps various Pablos and García-Peñalvo (2018) published a method that solves a
methods have been proposed and evaluated. The studied works are very similar problem statement and application as the one in the
discriminated by (1) application, (2) data source, and (3) core ap- present study. A more detailed comparison of their methodology and
proaches. In the case of our proposed methodology, TF-IDF is used for the one presented in this work is given in subsequent sections.
term weights & ranking while Monte-Carlo sampling is used for term
3. Materials and methods
selection and query reformulation.
Yusuf et al. (2021) review more recent contributions focusing on
3.1. Monte-Carlo sampling
query expansion in text retrieval of search engines. They conclude that
semantic-ontology and pseudo-relevant feedback methods are the most The Monte-Carlo (MC) method is a statistical approach based on
studied and promising QE approaches. One recent contribution that repeated random sampling that is used to approximate solutions to
shares an idea with the presented work is the one by Han et al. (2021). complex or expensive to evaluate mathematical problems. It was first
They propose a method based on Pseudo relevance feedback via text formulated by Metropolis and Ulam (1949) and has been applied in
classification. The approach builds on well-known elements from the several research fields such as bio-/chemical and environmental sys-
literature (BM25, LR, SVM, ensemble avg/RRF, RM3) and combines tems engineering (Sin & Espuña, 2020) and statistical physics (Landau
them in simple ways, arguing that, in QE, simplicity can be a virtue. & Binder, 2014).
Query by Example It has also found application in the field of information retrieval.
Query by Example targets the retrieval of elements that are similar Burgin (1999) demonstrated its use in the evaluation of information
to an example element. In order to achieve this, the main characteristics retrieval system performance (recall, precision, F-value). Through re-
of the example element must be extracted and processed in a way peated random sampling of corpora of know size and number of
that other elements of the same kind can be queried for, and ranked relevant documents the statistical significance of a retrieval result can
according to some criterion. This concept has been used in many be determined, and the probability of an observation stemming from a
different applications: Query by Voice (Lee et al., 2015), Query by random process can be estimated. More recently, Schnabel et al. (2016)
Music (Foote, 1997), Query by Image and Videos (Araujo & Girod, also used Monte-Carlo based estimators to determine the performance
2018) and, most recently, Query by Webpage (Geng et al., 2022). A well of ranking functions in information retrieval. Their work deals with
known commercial example is the ‘‘search by image’’ function offered corpora of known size but unknown number of relevant documents, so
by Google that enables users to upload images and find similar images expert judgement to classify the relevance of the retrieved documents is
from the web. required. Their presented approach allows to choose appropriate query-
Query by Document results pairs in an unbiased manner for manual relevance judgement. It
Query by Document can be considered a variant of query by exam- was shown that through this selection the number of required relevance
ple. It was first introduced by Yang et al. (2009). Their methodology judgements could be halved compared to other heuristic methods.
uses a ‘‘part-of-speech tagger’’ to extract some candidate phrases from Alexandrov et al. (2003) showed that Monte-Carlo algorithms can be
the seed documents that should act as query strings. It has been demon- useful in the efficient calculation of eigenvalues of sparse matrices, such
strated on the BlogScope search engine to retrieve similar documents as the term-by-document matrices that often appear in information re-
to a set of 34 article from the New York Times. Weng et al. (2011) trieval tasks. A dimensionality reduction of the matrix can be achieved,
presented an approach that exploits Latent Semantic Indexing (LSI) which can significantly speed up the ranking function calculations.
as a strategy to project documents into a lower dimensional vector In this study, Monte-Carlo sampling is used to formalize the implicit
space. The focus of this work lies on efficient indexing for subsequent knowledge captured in a seed corpus, in order to support the query-
construction step in information retrieval. Queries are performed on
retrieval enhancement. The authors comment that LSI can be substi-
the whole scientific abstract and citation database of huge but unknown
tuted by other dimensionality reduction techniques. Williams et al.
size and unknown number of relevant documents.
(2014) present SimSeerX, a platform for query-by-document task that
performs on the CiteSeerX database. The methodology also relies on
3.2. Citation databases
dimensionality reduction of the seed documents and the documents in
the database. Using various ranking functions the system returns ranked This work focuses on information retrieval from scientific abstract
list of candidate documents that respond to the query documents. Chen and citation databases. Among the largest databases are Google Scholar,
et al. (2018) presented a strategy based on continuous active learning, Scopus® , ScienceDirect, Web of Science, PubMed, and arXiv. For the
a concept that is frequently implemented in other citation screening implementation and validation of the methodology we used Scopus®
and content recommender systems (Howard et al., 2016; Wallace et al., (Burnham, 2006) due to its large amount of entries (72.2 Mio. in 2019
2010). Yang et al. (2018) use the ‘‘More Like This’’ function from according to Gusenbauer, 2019) in multi-disciplinary fields and the
Elasticsearch, a distributed search engine built on Lucene, in order convenient API that allows automatic sampling of the database pro-
to convert a query document into up to 25 relevant terms. Using vided by Elsevier. It has restricted access, meaning that a subscription
these keywords, a disjunctive search is performed on the RCV1-v2 text is necessary to use its features. A main drawback is that Scopus® ,
categorization test collection. like the majority of scientific databases, does not provide free full-text
Most recently, Le et al. (2021) presented a QbD method on top information. This implies that the screening step during information
of a search interface. Their principled technique formulates the query retrieval can only be performed on the abstract, title and keywords
selection task as an optimization problem (Docs2Query) that mini- information. The database used in the methodology can be exchanged
mizes the position (maximizes the rank) of relevant documents. In the but specific requirements and limitations of alternative APIs must be
Docs2Query-Self problem, these relevant documents are the example considered and adapted in the implementation.

2
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967

generally applicable minimum number of seed documents that leads to

good retrieval results.

4.2. Keyword extraction

From the seed corpus, a list of relevant domain keywords must be

extracted to characterize the domain knowledge. The ranked list of
keywords is obtained by computing the tfidf value of each lemmatized
term in the corpus, excluding stopwords. Since the method is solely
based on the occurrence of terms in a set of texts, the ranked list may
contain keywords with seemingly high relevance that are not relevant
in the context of investigation, or different keywords may be synonyms
or related by hyponymy. Because of this, a filtering method could be
applied to the list, which may be aided through the application of
semantic knowledge or rules. Manual filtering of the keywords is, by
no means, necessary. However, in case that the retrieval results do
not match the user’s expectations a (wanted) bias may be introduced
by adjusting the keywords. The resulting weighted feature vector Q
represents the seed corpus and the underlying domain.

4.3. Query sampling

Once the keywords are identified, the next step is to query the
database. Construction of appropriate search strings is a hardship in
research and investigation, and a ranked candidate list of keywords can
aid in this process.
The proposed query-by-documents approach implements a Monte-
Carlo (MC) sampling principle. The idea is to construct search strings
by picking keywords from the ranked list of keywords with a proba-
Fig. 1. Schematic representation of proposed query-by-documents approach as part bility distribution corresponding to their tfidf weight, applying ‘‘AND’’
of a corpus extension task. Modules for keyword extraction, database sampling, and
connectors among the keywords and repeatedly query the database (the
evaluation techniques could be exchanged and applied to different databases. Sampling
strategies: SEQ — Sequential, EXP — Expert, MC — Monte-Carlo.
missing ‘‘OR’’ connector results from the addition of each new query).
The list of keywords ranked by their tfidf weight is constructed
following the description and equation given in the Supplementary
Material. The probability 𝜙(𝑡𝑖 ) of each keyword 𝑡𝑖 being selected within
4. Query-by-document methodology
the top 𝑁𝐾𝑊 keywords (where 𝑖 = 1 has the highest weight, 𝑖 = 2 the
second highest . . . ) is then determined as:
The proposed query-by-documents approach is part of a question
answering task, extending a seed corpus of already detected texts that 𝑡𝑓 𝑖𝑑𝑓 (𝑡𝑖 )
𝜙(𝑡𝑖 ) = ∑𝑁 (1)
respond to the information requirements, through inclusion of other 𝐾𝑊
𝑡𝑓 𝑖𝑑𝑓 (𝑡𝑗 )
𝑗=1
relevant documents. Its steps are depicted in Fig. 1. The productive
documents (those that provide answers to the query) may be included It should be noted that this query construction step could in princi-
in the seed corpus and the cycle can be initialized again. This procedure ple be accompanied by the utilization of semantic knowledge (e.g. us-
would be repeated until no new information is found or the goal of the ing domain ontologies) as demonstrated for instance by Amato et al.
information search has been achieved. (2015). By doing so, vocabulary mismatch, also known as the vocabu-
lary problem, can be reduced. In each performed query the occurrence
4.1. Seed corpus of a document is registered and counted over the total amount of
performed MC iterations.
The methodology requires, as any query-by-documents approach, This methodology comes with a few adjustable parameters.
a set of seed documents. This set is used as the knowledge infor-
mation repository that identifies the range of the search. Thus, it 1. The amount of MC iterations 𝑁𝑀𝐶 determines how well the
should be composed by all available documents clearly relevant to the relevance distribution of the keywords is captured in the sam-
search topic. Obviously, adding non-relevant documents will increase pling procedure. In the presented case studies it was found that
non-relevant results, and not incorporating documents associated to a value between 200 and 1000 iterations is sufficient for the
relevant research will limit the scope of the search. From that point ranked candidate list not to change significantly anymore. See
onwards, human intervention in the retrieval process is reduced, which the Supplementary Material (Fig. S1) for a description of how
is important because human resources are expensive and limited. Once this range was determined.
a Seed Corpus (SC) has been identified, the automation process will 2. The upper limit for the number of documents registered in each
speed up the volume and the quality of information gathered, because iteration 𝑁𝑖𝑡 is a parameter that controls the trade-off between
the resulting documents after one cycle will enlarge the seed corpus, exploration and exploitation of the database search space: a
thus feeding the next iteration. Seed corpora may be obtained and high value for this parameter registers many documents in each
provided for instance by experts in the field such as professors as iteration, resulting in the need of more MC iterations to reach
starting point for a project or research line of a coworker or student. a stationary ranking. For lower values stationarity is achieved
The number of documents in the corpus defines the seed corpus length faster but relevant documents might be overlooked through the
(L). Another use case is the retrieval of similar documents to detect stricter cut-off. Currently, the Scopus® API imposes an upper
eventual plagiarism, in which L may take the value of 1. There is no limit of 2000 documents for this parameter.

3
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967

3. The number of keywords included in the sampling procedure extra insight, inversely depending on the quality of the abstract, but a
𝑁𝐾𝑊 is a critical parameter with a similar trade-off characteris- trade-off arises when the associated increase of computational effort is
tic as 𝑁𝑖𝑡 . However, additionally to the computational trade-off, considered.
the amount of included keywords regulates how ‘‘far’’ from For validation purpose, after determining the ranked candidate
the core domain (i.e. how many ‘‘less-relevant’’ keywords) the list, the full-text information is required. It is unreasonable from and
sampling procedure should reach. economic and computational resource point of view to download and
process a huge amount of full-text information. Thus, we opt for down-
Tuning of these parameters, that are common to other information loading (semi-) manually a number 𝑁𝐷𝑊 𝑁 of documents. As a result,
retrieval methods, could, in theory, be automized through a parameter the performance of the evaluation procedure will vary depending on
sweep procedure that refines some performance metric such as seed the institution that performs the retrieval task since the subscribed
recall or average seed position. In practice however, limits imposed on journals and databases are different for most institutions. However,
the amount of queries to the database should be taken into account future changes in publishing policies can be easily incorporated into
and could potentially prohibit an extensive sensitivity analysis. For that the methodology.
reason, in this study, we limit our analysis of 𝑁𝑀𝐶 and 𝑁𝐾𝑊 to three Once the documents are downloaded, their relevance to the domain
alternative values while keeping 𝑁𝑖𝑡 at the upper limit imposed by the can be evaluated using the BM25 ranking function. The linguistic
Scopus® API. relevance is determined with respect to the weighted feature vector
After performing the sampling procedure, the amount of times an Q that is expected to represent the domain of interest. The user can
individual document d appeared in the query process 𝑁𝑑 divided by then manually screen the resulting candidate documents in order of
the number of MC iterations 𝑁𝑀𝐶 yields the document frequency 𝐷𝐹𝑑 . linguistic relevance until a threshold value 𝐵𝑀25𝑚𝑖𝑛 or until he is
This is an inherent relevance metric that can be directly used to rank satisfied with the retrieved information. The document frequency DF or
the candidate documents and propose a reading order. the cosine similarity 𝜃 of the documents with the feature vector could
𝑁𝑑 be used as a metric for linguistic relevance alternatively (Marcos-Pablos
𝐷𝐹𝑑 = (2)
𝑁𝑀𝐶 & García-Peñalvo, 2018).
Alternatively, a naive search can be performed on the database by Apart from the BM25 relevance metric, the performance should be
simply connecting the top keywords until the number of results from evaluated using the recall of seed documents. Le et al. (2021) verified
the database yields the amount of documents the user is willing to read. the hypothesis that IR method performing well in re-retrieving the seed
We refer to this method as the sequential sampling method (SEQ). documents (Docs2Queries-Self problem) also perform well in finding
Finally, instead of blindly connecting the keywords an expert can similar documents (Docs2Queries-Sim problem).
use the identified terms to construct more complex strings using differ-
ent combinations and connectors such as ‘‘OR’’ and ‘‘AND NOT’’. We 4.5. Comparison with other information retrieval methods
refer to this as the expert sampling method (EXP). Compared to the
MC method the user must have some sort degree of expert knowledge Information retrieval procedures have been especially explored and
to apply it. applied in the field of systematic literature review and specialized cor-
The SEQ and EXP methods do not have an inherent relevance metric pora construction. Marcos-Pablos and García-Peñalvo (2018) propose
and the resulting candidate documents must be ordered by other means an iterative methodology to construct search strings which they ap-
such as the application of BM25 ranking function or naive metadata like plied in their literature review about technological ecosystems (Marcos-
the amount of citations. Pablos & García-Peñalvo, 2019). A comparison of their approach with
It must be noted that in this step the database is only sampled by the one presented here is summarized in Table 1.
the information available in the abstract, title and keywords. Thus, The objectives at the end of each iteration are different: Our ap-
we suggest using the abstracts of the seed corpus to obtain the set proach aims at finding an extended corpus departing from a set of rel-
of relevant keywords based on the assumption that the language used evant documents (SC). The methodology by Marcos-Pablos and García-
in abstracts may be different from the full texts and, consequently, Peñalvo (2018) on the other hand, results in suggested keywords for
providing a fair basis for the query task. search string construction. However, both methods follow the same
main steps of keyword construction via TF-IDF, sampling and evalu-
4.4. Evaluation ation.
The main difference lies in the sampling procedure: Marcos-Pablos
Once the ranked list of references is obtained it is possible to evalu- and García-Peñalvo (2018) procedure requires the use of expert knowl-
ate the documents in terms of linguistic relevance. For that purpose, the edge to make the final decision on search string while the MC procedure
freely available abstracts could be used, but we suggest to include as avoids this need. Furthermore, a minor difference lies in the departing
many full texts as possible in the evaluation step. The reasoning behind point of the methodologies that is shifted due to the different targeted
this is that abstracts only represent a very small fraction of the full- endpoints (keywords vs. retrieved information/documents).
text in condensed form. Information retrieval tasks such as retrieval of There are various other information retrieval methods that have
parameters or experimental data will be more successful when looking been briefly addressed in Section 2. In this work we omit the direct
into the full texts (Kottmann et al., 2010), including the Supplemen- comparison to these methods since their scope is not aligned with the
tary Material that often provides more quantitative information than scope of this work. On one hand, those works are applied to specialized
abstracts. static corpora and datasets (e.g. TREC-8 Voorhees & Harman, 1999 and
This work purposely skips any discussion on publication policies and TREC-9 Robertson & Hull, 2000 as used in Le et al., 2021) instead of the
the property of the information. The general methodology developed growing and inter-disciplinary corpora that are the academic abstract
here can be employed in public and private databases, using the total and citation databases. Our proposed methodology is designed to be
or partial information available (e.g. abstracts) according to the access applicable to corpora without the need to analyze the corpus previous
rights. For more insight into the debate about Open Access (OA) the to the retrieval task. Furthermore, this work goes beyond what other
reader is referred to the review by Piwowar et al. (2018). works are doing by evaluating the performance of the method by using
As previously mentioned we decided to use Scopus® for demonstra- the full-text information of the retrieved documents. The corpora that
tion and validation purpose, which limits the sampling task to abstract, other works deal with are pre-classified which is not the case in the
title and keyword information. Sampling and evaluating full texts open question answering approach that is envisioned in this work and
instead of abstracts is debatable. The search of full texts may provide illustrated in Fig. 1.

4
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967

Table 1
Comparison of information retrieval methodologies.
Marcos-Pablos and García-Peñalvo (2018) This study
Input: Search string S; stop words vector SW; minimum Input: Seed Corpus SC; stop words vector SW; (optional)
cosine similarity distance 𝜃𝑚𝑖𝑛 minimum BM25 value 𝐵𝑀25𝑚𝑖𝑛
Output: Recommended new terms 𝑇 for building a new Output: Ranked list of relevant documents RL
search string S1
1. Use S as input search string on academic databases and 1. Project SC on vector space and compute tfidf values.
construct an abstract corpus D.
2. Project D on vector space and compute tfidf values. 2. Perform MC sampling method on academic databases
(corresponds to step 1 in this study) using the filtered top keywords 𝑁𝐾𝑊 .
3. Classify documents in D as relevant (R) and non-relevant 3. Obtain a candidate list CL sorted by the document
(NR) from cosine similarity. frequency 𝐷𝐹𝑑 .
4. Compute term weights 𝑤𝑡,𝐷 in R and NR. 4. (Optional) Download the top 𝑁𝐷𝑊 𝑁 full-text documents of
CL. Apply BM25 ranking function to determine linguistic
relevance order.
5. Suggest new terms 𝑇 based on 𝑤𝑡,𝐷 sorted values. 5. Use those documents with high document frequency 𝐷𝐹𝑑
or higher relevance than 𝐵𝑀25𝑚𝑖𝑛 for information extraction.
6. Construct a new search string S1 and repeat from step 1 6. Extend SC with newly identified truly relevant documents
and repeat from step 1.

5. Case studies Table 2

Sampling procedures tested and evaluated in case-study I.

The proposed methodology has been tested and illustrated on two 1. Seed corpus length L (𝑁𝐾𝑊 = 10) 1 8 20

case studies that are detailed in the following sections. Choose best performing seed corpus length 𝐿𝑏𝑒𝑠𝑡
2. Number of keywords 𝑁𝐾𝑊 (𝐿 = 𝐿𝑏𝑒𝑠𝑡 ) 7 20 30
5.1. Case study I: Technological ecosystems in care and assistance Other parameters: 𝑁𝑖𝑡 = 𝑁𝑀𝐶 = 1000.

The goal of this case-study is to emulate the findings from the lit- Table 3
erature review by Marcos-Pablos and García-Peñalvo (2019) departing Sampling procedures tested and evaluated in case-study II.
from a subset of the documents that have been identified as truly rel- Method: SEQ EXP A EXP B EXP C MC
evant and use them as a seed corpus in the methodology. The original (FL) (AST) (APL)
systematic literature review deals with technological ecosystems in care 𝑁𝐾𝑊 4 9 9 16 10, 15, 20, 25, 29, 30

and assistance. This topic comes with the difficulty of being based in
two different fields. Therefore, the reasonable combination of suggested
keywords requires a significant degree of expert knowledge. On the This study is motivated by the need to populate a process ontology
other hand, the proposed methodology is expected to account for, and with information for the selection of sustainable waste-to-resource
combine, both fields implicitly in the tfidf values during the sampling alternatives (Pacheco-López et al., 2020).
procedure. The starting point for initialization is a seed corpus consisting of
Using an initial search string on Scopus and Web of Science, Marcos- eight papers. They originate from the review performed by Somoza-
Pablos and García-Peñalvo (2019) narrowed down the candidate list of Tornos et al. (2021) and are given in Table S2 (Supplementary Ma-
potentially relevant documents to 8394. Then, they applied a cosine terial). After extracting the weighted feature vector for sampling, we
similarity threshold to only consider the top 809 documents. These apply the SEQ, EXP and MC sampling methods as summarized in
documents were then screened using a quality assessment checklist Table 3 and compare (1) the position of the seed documents in the
to further reduce the selection to 194 documents. Finally, 37 docu- resulting ranked lists and (2) the linguistic relevance distributions of
ments were included for the quantitative synthesis of the literature the identified candidate lists. As for the EXP method three of the
review. This list of relevant documents is given in Table S1 in the members of the research group (FL, AST, APL) proposed search strings
Supplementary Material. Note that five of these documents are not using the keywords from the extraction step (FL, AST) or alternative
available in Scopus® and therefore cannot be retrieved with the applied ones (APL) based on their experience in the field.
methodology.
In this case-study we depart from randomly selected subsets of L 6. Results and discussion
documents (Table S1) taken from these 37 relevant documents and
follow the steps of the proposed methodology. The chosen quality cri- 6.1. Case study I: Technological ecosystems in care and assistance
terion for assessing the performance in this case-study is the number of
relevant documents and seed documents re-retrieved by the methodol- Table S3 shows the position of all the seed papers as well as the
ogy and their position in the ranked list. After selecting an appropriate remaining relevant papers in the ranked candidate lists. In a first
seed corpus length 𝐿𝑏𝑒𝑠𝑡 using 10 keywords during sampling we vary step, the top ten keywords were used for MC sampling. Search was
the number of included keywords 𝑁𝐾𝑊 . The tested configurations restricted to the years between 2002 and 2019 to better emulate the
are summarized in Table 2. The number of registered documents per results of Marcos-Pablos and García-Peñalvo (2019). Fig. 2 illustrates
iteration 𝑁𝑖𝑡 and the total number of MC iterations 𝑁𝑀𝐶 are both the results.
chosen to be 1000. It can be seen that using a single paper as seed corpus does not
lead to satisfactory results. The seed paper itself ranks in position one
5.2. Case study II: Pyrolysis of plastic waste with 689 appearances in 1000 iterations. Out of the remaining possible
papers only 6 appear in the candidate list while only one of them ranks
The goal of this second case-study is to apply and compare the three high (A2 in position 12). Fig. 2(a) shows the placement of the relevant
presented sampling method alternatives in the domain of chemical en- papers in the candidate lists using different seed corpus length 𝐿.
gineering. The targeted information is the retrieval of documents con- Better results are obtained when using eight seed papers. In total 26
taining parameters that describe pyrolysis processes of plastic waste. out of the 31 available relevant papers in Scopus® (83.9%) are found,

5
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967

Fig. 2. Sensitivity analysis results for seed corpus length 𝐿 and number of keywords 𝑁𝐾𝑊 in case-study I.

including seven (87.5%) of the seed papers. Moreover, these papers

rank high in the list with paper A11 being the lowest one on position
7524 where the total length of the candidate list is 20,494. We consider
these recall values satisfactory, having in mind that in the keyword
extraction and sampling steps no expert knowledge was applied.
Increasing the number of seed papers to 20 does not improve
retrieval performance. In total, 18 out of 20 seed papers appear in the
candidate list (90%) and the same total number of relevant papers are
found (26 out of 31 → 83.9%). Even though the seed recall value is
slightly higher, the total information gain through the retrieval process
is not as effective. Doubling the amount of initial information does not
produce any significant effect in the number of relevant documents
retrieved.
After identifying an appropriate seed corpus length (L), the in-
fluence of the number of keywords included (𝑁𝐾𝑊 ) is investigated
Fig. 3. Linguistic relevance of retrieved documents vs. document frequency resulting
(Fig. 2(b)). This parameter has an inherent trade-off characteristic from MC sampling procedure. Shown are the position of the 37 truly relevant papers
(exploration vs. exploitation in the search space): while a low number from Marcos-Pablos and García-Peñalvo (2019) and the newly identified candidates.
of included keywords exploits well a limited search space, it might
miss out on relevant papers associated with excluded keywords. On the
other hand, a high number of included keywords will explore well the 2979 documents form the set D and the BM25 ranking function was
whole search space but will inevitably include more irrelevant papers. applied to determine their linguistic relevance.
It is evident from Table S3 and Fig. 2(b) that increasing 𝑁𝐾𝑊 leads Fig. 3 depicts the calculated normalized linguistic relevance of each
to longer candidate lists. On the other hand, no clear trend can be document plotted against its document frequency from the MC sam-
identified with respect to the position of the seed documents in those pling procedure. Normalization was applied by dividing the calculated
lists. It can be seen that using seven keywords also leads to a total recall BM25 values by the maximum calculated value and the document
value of 83.9% with the lowest paper being A11 in position 4299. Using frequency was normalized by dividing the number of appearances by
20 or 30 keywords results in higher recall values of 29 out of 31 papers the number of MC iterations (1000). In this Figure, the positions of
(93.5%) but the seed and relevant papers are generally found in lower the found and not found seed and relevant documents, as well as
positions (lowest paper in rank 49,068 and 26,150 respectively). the remaining candidate documents and the seed average values, are
For the final evaluation step, the seven-keyword list from the eight- highlighted.
paper seed corpus is chosen because of the fact that more relevant It can be seen that there is a weak correlation between BM25 and
papers appear in higher positions. It is expected that this implies that DF values. We obtain a triangular shaped cloud of points, i.e., the
the remaining highly ranked documents also have a higher linguistic documents with high document frequency also have a high linguistic
relevance. relevance. On the other hand, it is not ensured that documents with
The top 5000 documents from the candidate list were checked high linguistic relevance appear frequently in the sampling procedure.
for downloading (it includes all the found seed papers, as well as Therefore, it can be concluded that the ranking by document frequency
the remaining relevant documents). Using the Endnote Click (Google yields good results in the higher positions but misses relevant papers
Chrome browser extension) a total of 2979 full-text documents could in the lower ranks. When downloading the full text documents and
be finally downloaded, including the seed and relevant papers. These applying the BM25 ranking function a more reliable ranking can be

6
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967

Table 4
Case study II: Extracted keywords from the eight-paper seed corpus.
Keyword tfidf Keyword tfidf Keyword tfidf
waste 1.12 yield 0.74 recycling 0.42
pyrolysis 1.11 plastic 0.74 gasoline 0.42
product 1.03 increase 0.66 ldpe 0.41
oil 0.90 beda 0.66 char 0.40
gas 0.87 polyethylene 0.59 polymer 0.39
process 0.84 feedstock 0.55 distribution 0.36
wt 0.83 time 0.49 reactor 0.34
catalyst 0.78 residence 0.49 material 0.33
temperature 0.77 flash 0.45 recovery 0.33
monomer 0.76 hydrocarbon 0.44 work 0.32
Fig. 4. Position of seed papers and relevant documents when ordering the candidate
list by BM25 and DF values. a Excluded: out of scope

obtained. The triangular shape is a characteristic result from the MC

sampling that has been found also frequently in other preliminary tests
of the methodology.
Fig. 4 is an alternative representation of the data in Fig. 3. It shows
the position of each document in the candidate list when ordering
it by BM25 values and document frequency respectively. Apart from
the fact that the relevant papers really rank comparably high in the
linguistic relevance, a large number of other seemingly relevant papers
are found with similar BM25 values as the seed papers. Using the lowest
BM25 value of the seed papers as cut-off value (𝐵𝑀25𝑚𝑖𝑛 = 0.709) the
number of relevant papers to be analyzed for information extraction
can be reduced by 83.5% (2979 → 475 papers). In this case, as can Fig. 5. Ordering candidate lists from different sampling procedures by BM25 values.
be seen in Fig. 4, only three relevant papers (out of the ones identified
by Marcos-Pablos & García-Peñalvo, 2019) would be wrongly discarded
while significantly reducing the amount of papers to be checked or read During sequential sampling (SEQ) the top four keywords are com-
during the information extraction step. bined in a search string to yield a list of 2000 candidate documents, the
Moreover, the BM25 metric allows sorting of the documents by upper limit for downloading references imposed by Scopus. Compared
relevance. The document frequency, on the other hand, shows plateaus with the MC sampling, the chances are high that the results from
that correspond to papers with identical amounts of appearances during this procedure are too general and do not correspond properly to the
the sampling procedure. This makes the determination of a cut-off value domain of interest.
based on DF non-trivial. Table 5 shows the search strings that were constructed by three
The findings from the first case-study can be summarized as follows: member of the authors research group for the expert sampling method
(EXP). While EXP A and B used only the suggested keywords from
• The optimal seed corpus length L cannot be determined a priori. the extraction step EXP C used additional keywords that can lead to
Here, eight seed papers lead to satisfactory results but for other
results out of the domain that is represented by the extracted keywords.
applications this number might be too small or even less papers
On a closer inspection of the search strings, it can be seen that EXP
can be used.
B is a subset of EXP A, meaning that all candidate documents form
• The number of keywords 𝑁𝐾𝑊 included in the sampling can be
search string B are included in A. Nevertheless, the documents were
used to manage the exploration vs. exploitation trade-off.
also retrieved and included in the evaluation step. 𝑁𝐷𝑊 𝑁 was chosen
• The proposed methodology has a strong capability of retrieving
as 2000 in order to make the comparison with the Sequential procedure
highly relevant documents and information requiring no expert
fair. The expert knowledge strings yield less candidates than SEQ and
knowledge in the keyword extraction, sampling and evaluation
MC procedures.
procedure.
Again, the complete set of downloaded documents (4306) was eval-
The availability of 37 pre-classified relevant papers served as a good uated in terms of linguistic relevance using the BM25 ranking function
basis for quantifying the performance of the proposed methodology. and the 29 keywords used for MC sampling. Fig. 5 shows the relevance
distribution of the three sets and the position that the seed papers rank
6.2. Case study II: Pyrolysis of plastic waste within the MC sampling candidate list.
It is evident that in the upper ranks of each list the linguistic
The second case-study departs from eight seed documents (Table relevance is very similar. This is because many of those documents
S2), without any available test set of other truly relevant papers. have been retrieved in every sampling procedure. The MC candidates
Therefore, the focus of discussion lies purely on the BM25 linguistic are consistently more relevant than the other candidates at a given
relevance and re-retrieval of seed documents. position. This means that the MC sampling procedure outperforms the
Table 4 shows the keywords and associated tfidf values extracted other procedures in terms of retrieving a high volume of relevant
from the seed documents. Using the top 10, 15, 20, 25, 29 and 30 documents.
chosen keywords we performed the corresponding MC sampling and Moreover, the MC sampling procedure has the highest recall value
observed where the seed papers rank in the candidate lists. It can of seed documents. In fact, every seed document appears in the candi-
be seen from Table S4 that using 20 keywords leads to the most date list with B3 being the only one not included in the downloaded
promising results, according to amount and position of seed papers in papers since its position (3531) is below 𝑁𝐷𝑊 𝑁 .
the retrieved list. Again, there is a trade-off between exploration vs. Based on these findings we conclude that the MC sampling pro-
exploitation capabilities, so the value to be chosen will depend on the cedure performs better compared to the competing procedures in all
importance given to these two properties. aspects of our investigation:

7
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967

Table 5
Summary of sampling procedures.
Procedure Search string Hits Download Recall
SEQ waste AND pyrolysis AND product AND oil 2015 1154 2/8
EXP A pyrolysis AND plastic AND waste AND (gas OR oil 1156 727 6/8
OR product) AND (temperature OR catalyst OR
yield)
EXP B pyrolysis AND plastic AND waste AND product 853 548 5/8
AND (temperature OR catalyst OR yield)
EXP C pyrolysis AND (plastic OR polyolefin OR polymer) 1127 681 4/8
AND waste AND (gas OR oil OR product OR
biofuel OR chemical OR ethylene OR methane OR
benzene) AND (recycling OR upcycling OR
treatment)
MC Combinations of 29 KWs 116,435 1196 8/8a
a
7/8 within top 2000.

• High amount of more relevant papers in the retrieved documents CRediT authorship contribution statement
• High recall value of seed documents
Fabian Lechtenberg: Conceptualization, Methodology, Software,
Finally, the method proved to offer a variety of benefits in terms of
Data curation, Writing – original draft, Visualization. Javier Farreres:
applicability and flexibility that can be summarized as follows:
Conceptualization, Methodology, Software, Writing – review & edit-
• In principle, no need for expert knowledge ing. Aldwin-Lois Galvan-Cara: Methodology, Software, Data curation,
• Flexible in terms of exploration and exploitation Writing – original draft. Ana Somoza-Tornos: Conceptualization, Writ-
• Reasonable pre-download ordering based on abstracts through DF ing – review & editing. Antonio Espuña: Conceptualization, Writing
– review & editing, Supervision, Project administration, Funding ac-
7. Conclusions quisition. Moisès Graells: Validation, Writing – review & editing,
Supervision, Project administration, Funding acquisition.
Literature search is a specific and essential task in scientific re-
search. The access to digital databases has boosted the search capa- Declaration of competing interest
bilities, but the scientific community worldwide still requires a lot
of time and expert dedication to retrieve relevant information. This The authors declare that they have no known competing finan-
work presents a novel methodology that improves the information cial interests or personal relationships that could have appeared to
retrieval task from scientific abstract and citation databases via a influence the work reported in this paper.
query-by-documents approach.
The main contribution of this work consists of the inclusion of Acknowledgments
a Monte-Carlo sampling procedure during the query string construc-
tion step which leads to two desirable outcomes: (1) human expert Financial support received from the Spanish ‘‘Ministerio de Cien-
intervention (an expensive and scarce resource) is decreased and (2) cia e Innovación’’ and the European Regional Development Fund,
potential human bias is avoided. The proposed method has been devel- both funding the research Projects AIMS (DPI2017-87435-R) and CEPI
oped, implemented and tested on the Scopus® database using two case (PID2020-116051RB-I00) is fully acknowledged. Fabian Lechtenberg
studies. gratefully acknowledges the Universitat Politècnica de Catalunya for
The two case studies demonstrated the methodology’s applicability the financial support of his predoctoral grant FPU-UPC, with the
to various fields of research. Remarkably, one of the studies itself is
collaboration of Banco de Santander. The authors would like to thank
based in two distinct fields (technological ecosystems and healthcare).
Adrián Pacheco-López (APL) for contributing one of the expert query
The retrieval results are satisfactory, i.e. high recall value of truly
strings.
relevant papers declared by the reference work, considering that the
authors are no experts in these fields and only a small amount of
Appendix A. Supplementary data
initial information (seed corpus) has been taken from the reference
(Marcos-Pablos and García-Peñalvo (2019)). These results imply that
Supplementary material related to this article can be found online
corpora for multidisciplinary collaboration can be easily identified by
at https://doi.org/10.1016/j.eswa.2022.116967.
our approach. A case-study on information retrieval of waste plastic
pyrolysis processes suggests that the proposed methodology performs
better in terms of number and linguistic relevance (BM25) of retrieved References
documents than a naive sequential sampling method as well as the
Alexandrov, V. N., Dimov, I. T., Karaivanova, A., & Tan, C. J. K. (2003). Parallel Monte
query string construction by three experts. Carlo algorithms for information retrieval. Mathematics and Computers in Simulation,
In general, the methodology is expected to accelerate the infor- 62, 289–295. http://dx.doi.org/10.1016/S0378-4754(02)00252-5.
mation retrieval process through reducing the need of screening less Amato, F., De Santo, A., Gargiulo, F., Moscato, V., Persia, F., Picariello, A., &
relevant papers. Through the automatization of abstract screening using Sperli, G. (2015). A novel approach to query expansion based on semantic
various combinations of keywords the search can go beyond what similarity measures. In DATA 2015 - 4th international conference on data management
technologies and applications, proceedings (pp. 344–353). http://dx.doi.org/10.5220/
manual search could achieve, thus, finding relevant papers that could
0005579703440353.
have been overlooked otherwise. Systematic literature reviews will Araujo, A., & Girod, B. (2018). Large-scale video retrieval using image queries. IEEE
benefit most from the methodology but really any research that starts Transactions on Circuits and Systems for Video Technology, 28, 1406–1420. http:
with a literature review will find it useful. //dx.doi.org/10.1109/TCSVT.2017.2667710.
Technical limitations like the speed of sampling and request limits Azad, H. K., & Deepak, A. (2019). Query expansion techniques for information retrieval:
A survey. Information Processing & Management, 56, 1698–1735. http://dx.doi.org/
using the available APIs should be addressed to further improve the per-
10.1016/j.ipm.2019.05.009.
formance. Furthermore, active learning strategies (Chen et al., 2018) Burgin, R. (1999). The Monte Carlo method and the evaluation of retrieval system per-
could be integrated in the methodology to adapt the candidate ranking formance. Journal of the American Society for Information Science, 50, 181–191. http:
based on expert feedback during the manual classification step. //dx.doi.org/10.1002/(SICI)1097-4571(1999)50:2<181::AID-ASI8>3.0.CO;2-9.

8
F. Lechtenberg et al. Expert Systems With Applications 199 (2022) 116967

Burnham, J. F. (2006). Scopus database: A review. Biomedical Digital Libraries, 3, Metropolis, N., & Ulam, S. (1949). The Monte Carlo method. Journal of the Ameri-
http://dx.doi.org/10.1186/1742-5581-3-1. can Statistical Association, 44, 335–341. http://dx.doi.org/10.1080/01621459.1949.
Chen, S., Nourashrafeddin, S., Moh’D, A., & Milios, E. (2018). Active high-recall 10483310.
information retrieval from domain-specific text corpora based on query documents. Pacheco-López, A., Somoza-Tornos, A., Muñoz, E., Capón-García, E., Graells, M., &
In Proceedings of the ACM symposium on document engineering 2018 (pp. 1–10). Espuña, A. (2020). Synthesis and assessment of waste-to-resource routes for circular
http://dx.doi.org/10.1145/3209280.3209532. economy. In 30 European symposium on computer aided process engineering (pp.
Foote, J. T. (1997). Content-based retrieval of music and audio. In Multimedia storage 1933–1938). http://dx.doi.org/10.1016/B978-0-12-823377-1.50323-2.
and archiving systems II (pp. 138–147). http://dx.doi.org/10.1117/12.290336. Piwowar, H., Priem, J., Larivière, V., Alperin, J. P., Matthias, L., Norlander, B.,
Geng, Q., Chuai, Z., & Jin, J. (2022). Webpage retrieval based on query by example for Farley, A., West, J., & Haustein, S. (2018). The state of OA: A large-scale analysis
think tank construction. Information Processing & Management, 59, Article 102767. of the prevalence and impact of open access articles. PeerJ, 2018, 1–23. http:
http://dx.doi.org/10.1016/j.ipm.2021.102767. //dx.doi.org/10.7717/peerj.4375.
Gusenbauer, M. (2019). Google Scholar to overshadow them all? Comparing the sizes Robertson, S. E., & Hull, D. A. (2000). The TREC-9 filtering track final report. In
of 12 academic search engines and bibliographic databases. Scientometrics, 118, Proceedings of the ninth text retrieval conference.
177–214. http://dx.doi.org/10.1007/s11192-018-2958-5. Schnabel, T., Frazier, P. I., Swaminathan, A., & Joachims, T. (2016). Unbiased
Han, X., Liu, Y., & Lin, J. (2021). The simplest thing that can possibly work: (Pseudo- comparative evaluation of ranking functions. In ICTIR 2016 - proceedings of the 2016
)relevance feedback via text classification. In Proceedings of the 2021 ACM SIGIR ACM international conference on the theory of information retrieval (pp. 109–118).
international conference on theory of information retrieval (pp. 123–129). http://dx. http://dx.doi.org/10.1145/2970398.2970410.
doi.org/10.1145/3471158.3472261. Sin, G., & Espuña, A. (2020). Editorial: Applications of Monte Carlo method in chemical,
Howard, B. E., Phillips, J., Miller, K., Tandon, A., Mav, D., Shah, M. R., Holmgren, S., biochemical and environmental engineering. Frontiers in Energy Research, 8, 1–2.
Pelch, K. E., Walker, V., Rooney, A. A., Macleod, M., Shah, R. R., & Thayer, K. http://dx.doi.org/10.3389/fenrg.2020.00068.
(2016). SWIFT-Review: A text-mining workbench for systematic review. Systematic Somoza-Tornos, A., Pozo, C., Graells, M., Espuña, A., & Puigjaner, L. (2021). Process
Reviews, 5, 1–16. http://dx.doi.org/10.1186/s13643-016-0263-z. screening framework for the synthesis of process networks from a circular economy
Kottmann, R., Radom, M., Formanowicz, P., Glöckner, F., Rybarczyk, A., Szachniuk, M., perspective. Resources, Conservation and Recycling, 164, Article 105147. http://dx.
& Błażewicz, J. (2010). Cerberus: A new information retrieval tool for marine doi.org/10.1016/j.resconrec.2020.105147.
metagenomics. Foundations of Computing and Decision Sciences, 35, 107–126. Voorhees, E. M., & Harman, D. K. (1999). Overview of the eighth text retrieval
Landau, D. P., & Binder, K. (2014). A guide to Monte Carlo simulations in statis- conference (TREC-8). In Proceedings of the eighth text retrieval conference.
tical physics (4th ed.). Cambridge University Press, http://dx.doi.org/10.1017/ Wallace, B. C., Small, K., Brodley, C. E., & Trikalinos, T. A. (2010). Active learning
cbo9781139696463. for biomedical citation screening categories and subject descriptors. In Proceedings
Landhuis, E. (2016). Scientific literature: Information overload. Nature, 535, 457–458. of the 16th ACM SIGKDD international conference on knowledge discovery and data
http://dx.doi.org/10.1038/nj7612-457a. mining (pp. 173–181). http://dx.doi.org/10.1145/1835804.1835829.
Le, N. X. T., Shahbazi, M., Almaslukh, A., & Hristidis, V. (2021). Query by documents Weng, L., Li, Z., Cai, R., Zhang, Y., Zhou, Y., Yang, L. T., & Zhang, L. (2011).
on top of a search interface. Information Systems, 101, Article 101793. http://dx. Query by document via a decomposition-based two-level retrieval approach. In
doi.org/10.1016/j.is.2021.101793. SIGIR’11 - proceedings of the 34th international ACM SIGIR conference on research
Lee, L. S., Glass, J., Lee, H. Y., & Chan, C. A. (2015). Spoken content retrieval - beyond and development in information retrieval (pp. 505–514). http://dx.doi.org/10.1145/
cascading speech recognition with text retrieval. IEEE Transactions on Audio, Speech 2009916.2009985.
and Language Processing, 23, 1389–1420. http://dx.doi.org/10.1109/TASLP.2015. Williams, K., Wu, J., & Giles, C. L. (2014). SimSeerX: A similar document search engine.
2438543. In DocEng 2014 - Proceedings of the 2014 ACM symposium on document engineering
Marcos-Pablos, S., & García-Peñalvo, F. J. (2018). Information retrieval methodology (pp. 143–146). http://dx.doi.org/10.1145/2644866.2644895.
for aiding scientific database search. Soft Computing, 24, 5551–5560. http://dx.doi. Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., & Papadias, D. (2009). Query
org/10.1007/s00500-018-3568-0. by document. In Proceedings of the 2nd ACM international conference on web search
Marcos-Pablos, S., & García-Peñalvo, F. J. (2019). Technological ecosystems in care and data mining (pp. 34–43). http://dx.doi.org/10.1145/1498759.1498806.
and assistance: A systematic literature review. Sensors, 19, 708. http://dx.doi.org/ Yang, E., Lewis, D. D., Frieder, O., Grossman, D., & Yurchak, R. (2018). Retrieval and
10.3390/s19030708. richness when querying by document. In CEUR workshop proceedings (pp. 68–75).
Mergel, G. D., Silveira, M. S., & Da Silva, T. S. (2015). A method to support search Yusuf, N., Yunus, M. A. M., Wahid, N., Mustapha, A., & Salleh, M. N. M. (2021).
string building in systematic literature reviews through visual text mining. In A survey of query expansion methods to improve relevant search engine results.
Proceedings of the ACM symposium on applied computing (pp. 1594–1601). http: International Journal on Advanced Science, Engineering and Information Technology,
//dx.doi.org/10.1145/2695664.2695902. 11, 1352–1359. http://dx.doi.org/10.18517/ijaseit.11.4.8868.

Java Ebook by Durga Sir
67% (3)
Java Ebook by Durga Sir
477 pages
Macmillan Next Move Level 2 Pupil S Book Sample
No ratings yet
Macmillan Next Move Level 2 Pupil S Book Sample
10 pages
Taplin, O. (2007) Pots - Plays PDF
100% (2)
Taplin, O. (2007) Pots - Plays PDF
322 pages
Classroom Observation and Research
100% (1)
Classroom Observation and Research
12 pages
Systems Engineering Management Plan
No ratings yet
Systems Engineering Management Plan
27 pages
Bla Power Pvt. LTD: Woodward 505 Governor Valve / Actuator Calibration &test
No ratings yet
Bla Power Pvt. LTD: Woodward 505 Governor Valve / Actuator Calibration &test
23 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
Intelligent Database Systems
No ratings yet
Intelligent Database Systems
32 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
IMRAD
No ratings yet
IMRAD
18 pages
Alstom - P642 P643 P645 Cortec and Ordering Information
No ratings yet
Alstom - P642 P643 P645 Cortec and Ordering Information
3 pages
Intertextuality Quiz
No ratings yet
Intertextuality Quiz
1 page
Ms Excel: Essential Training For The 70-779 EXAM
No ratings yet
Ms Excel: Essential Training For The 70-779 EXAM
11 pages
Appendix 1: Lesson Plan (Template) : Lesson Plan Subject: English Trainee: Bashayer Abdul-Aziz Topic or Theme: Phonics
No ratings yet
Appendix 1: Lesson Plan (Template) : Lesson Plan Subject: English Trainee: Bashayer Abdul-Aziz Topic or Theme: Phonics
5 pages
A Brief Survey On Data Mining For Biological and Environmental Problems.
No ratings yet
A Brief Survey On Data Mining For Biological and Environmental Problems.
46 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
4 pages
Assembly Language For x86 Processors: Chapter 1: Basic Concepts
No ratings yet
Assembly Language For x86 Processors: Chapter 1: Basic Concepts
41 pages
Biomedical Knowledge Engineering Using A Computational Grid: Marcello Castellano and Raffaele Stifini
No ratings yet
Biomedical Knowledge Engineering Using A Computational Grid: Marcello Castellano and Raffaele Stifini
25 pages
ARTS
No ratings yet
ARTS
50 pages
Mock Assessment Informatica - Practitioner & Specialist Level
No ratings yet
Mock Assessment Informatica - Practitioner & Specialist Level
5 pages
Information Retrieval System
No ratings yet
Information Retrieval System
4 pages
Paper 21-Mining Scientific Data From Pub-Med Database
No ratings yet
Paper 21-Mining Scientific Data From Pub-Med Database
3 pages
Decision Trees For Handling Uncertain Data To Identify Bank Frauds
No ratings yet
Decision Trees For Handling Uncertain Data To Identify Bank Frauds
4 pages
IRS Unit-3
No ratings yet
IRS Unit-3
30 pages
Rijul Research Paper
No ratings yet
Rijul Research Paper
9 pages
Use of Information Retrieval Systems in Scientific Research
No ratings yet
Use of Information Retrieval Systems in Scientific Research
4 pages
Crawling Hidden Objects With KNN Queries
No ratings yet
Crawling Hidden Objects With KNN Queries
6 pages
Using Text Mining To Locate and Classify Research Papers: Mathematical Methods and Systems in Science and Engineering
No ratings yet
Using Text Mining To Locate and Classify Research Papers: Mathematical Methods and Systems in Science and Engineering
7 pages
Modeling Documents As Mixtures of Persons For Expert Finding
No ratings yet
Modeling Documents As Mixtures of Persons For Expert Finding
12 pages
International Journal On Natural Language Computing (IJNLC)
No ratings yet
International Journal On Natural Language Computing (IJNLC)
15 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Artificial Intelligence Assignment
No ratings yet
Artificial Intelligence Assignment
3 pages
Article Writing
No ratings yet
Article Writing
8 pages
The Me 3 Post
No ratings yet
The Me 3 Post
19 pages
Information Retrieval (IR) Is The Science of
No ratings yet
Information Retrieval (IR) Is The Science of
10 pages
Semantic Information Retrieval Based On Domain Ontology
No ratings yet
Semantic Information Retrieval Based On Domain Ontology
3 pages
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
No ratings yet
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
4 pages
QUESEM: Towards Building A Meta Search Service Utilizing Query Semantics
No ratings yet
QUESEM: Towards Building A Meta Search Service Utilizing Query Semantics
10 pages
Bees Swarm Optimization Based Approach For Web Information Retrieval
No ratings yet
Bees Swarm Optimization Based Approach For Web Information Retrieval
8 pages
A Considerable Speck ICSE Class 10 English Answers, Notes
No ratings yet
A Considerable Speck ICSE Class 10 English Answers, Notes
3 pages
1210 Mbio Master Thesis
No ratings yet
1210 Mbio Master Thesis
98 pages
Lesson 3 - Question Forms
No ratings yet
Lesson 3 - Question Forms
32 pages
Enunciado de La Pregunta: Pregunta Finalizado Puntúa 1,0 Sobre 1,0
No ratings yet
Enunciado de La Pregunta: Pregunta Finalizado Puntúa 1,0 Sobre 1,0
11 pages
Pybibx - A Python Library For Bibliometric and Scientometric Analysis Powered With Artificial Intelligence Tools
No ratings yet
Pybibx - A Python Library For Bibliometric and Scientometric Analysis Powered With Artificial Intelligence Tools
30 pages
Lesson Plan: Class-V Subject: English Language and Spelling and Dictation
No ratings yet
Lesson Plan: Class-V Subject: English Language and Spelling and Dictation
6 pages
Formulating Research Question
No ratings yet
Formulating Research Question
8 pages
Elt 124 Castaneda Ubbanan
No ratings yet
Elt 124 Castaneda Ubbanan
12 pages
Kickstart Arrays Lesson
100% (1)
Kickstart Arrays Lesson
3 pages
Irt Ans
No ratings yet
Irt Ans
9 pages
Expert Finding by Means of Plausible Inferences
No ratings yet
Expert Finding by Means of Plausible Inferences
7 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Patriachal History Notes Students-1
No ratings yet
Patriachal History Notes Students-1
71 pages
Speaking-Sample 2
No ratings yet
Speaking-Sample 2
3 pages
NLP Review 3 Formatted 2
No ratings yet
NLP Review 3 Formatted 2
27 pages
An - Ontological - Framework - For - Information - Extraction - From - Diverse - Scientific - Sources-Gohar-Zaman SB
No ratings yet
An - Ontological - Framework - For - Information - Extraction - From - Diverse - Scientific - Sources-Gohar-Zaman SB
14 pages
Poem Extract
No ratings yet
Poem Extract
4 pages
Schenql: In-Depth Analysis of A Query Language For Bibliographic Metadata
No ratings yet
Schenql: In-Depth Analysis of A Query Language For Bibliographic Metadata
20 pages
User Recommendation System On Text Based Images
No ratings yet
User Recommendation System On Text Based Images
5 pages
Application of A Configurable Keywords-Based Query Language To The Healthcare Domain
No ratings yet
Application of A Configurable Keywords-Based Query Language To The Healthcare Domain
6 pages
Common Core Diagnostic Test
No ratings yet
Common Core Diagnostic Test
3 pages
Extending Co-Citation Using Sections of Research Articles
No ratings yet
Extending Co-Citation Using Sections of Research Articles
12 pages
Information Retrieval 1
No ratings yet
Information Retrieval 1
10 pages
Quiz - Flottation - Corrigé
No ratings yet
Quiz - Flottation - Corrigé
2 pages
Pe Ii6
No ratings yet
Pe Ii6
166 pages
ISR Lab Manual
No ratings yet
ISR Lab Manual
110 pages
10 2023
No ratings yet
10 2023
53 pages
(Importante) Frma 06 694307
No ratings yet
(Importante) Frma 06 694307
6 pages
Text Extraction Research Paper
No ratings yet
Text Extraction Research Paper
6 pages
IRS Unit 4
No ratings yet
IRS Unit 4
63 pages
Unit-4 1
No ratings yet
Unit-4 1
7 pages
ART Py-Pde A Python Package For Solving Partial Differential Equtions
No ratings yet
ART Py-Pde A Python Package For Solving Partial Differential Equtions
4 pages
Jurnal TI 3
No ratings yet
Jurnal TI 3
10 pages
Unit - 3 Irs
No ratings yet
Unit - 3 Irs
57 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
PSO4
No ratings yet
PSO4
8 pages
1210 Mbio Master Thesis
No ratings yet
1210 Mbio Master Thesis
98 pages
Chapter 06 File Management
No ratings yet
Chapter 06 File Management
35 pages
Statystical Analysis
No ratings yet
Statystical Analysis
9 pages
Unit Iii
No ratings yet
Unit Iii
100 pages
Artificial Intelligence in Information Retrieval
No ratings yet
Artificial Intelligence in Information Retrieval
5 pages
1 s2.0 S0950705125003272 Main
No ratings yet
1 s2.0 S0950705125003272 Main
12 pages
Annex C 3 COT Rating Sheet For Proficient Teacher For SY 2024 2025
No ratings yet
Annex C 3 COT Rating Sheet For Proficient Teacher For SY 2024 2025
1 page
AKMiner Domain-Specific Knowledge Graph Mining
No ratings yet
AKMiner Domain-Specific Knowledge Graph Mining
15 pages
ABoolean Modelin Information Retrievalfor Search Engines PDF
No ratings yet
ABoolean Modelin Information Retrievalfor Search Engines PDF
6 pages
English FAL NSC P2 May June 2021
No ratings yet
English FAL NSC P2 May June 2021
27 pages
Information Retrieval Methodology For Aiding Scienti C Database Search - 2018
No ratings yet
Information Retrieval Methodology For Aiding Scienti C Database Search - 2018
10 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
Holmes Cunningham DM93
No ratings yet
Holmes Cunningham DM93
10 pages

Information Retrieval From Scientific Abstract and Citation Database Query by Documents Approach Based On Monte Carlo Sampling

Uploaded by

Information Retrieval From Scientific Abstract and Citation Database Query by Documents Approach Based On Monte Carlo Sampling

Uploaded by

Expert Systems With Applications 199 (2022) 116967

Contents lists available at ScienceDirect

Expert Systems With Applications

Information retrieval from scientific abstract and citation databases:

ARTICLE INFO ABSTRACT

generally applicable minimum number of seed documents that leads to

4.2. Keyword extraction

From the seed corpus, a list of relevant domain keywords must be

4.3. Query sampling

5. Case studies Table 2

including seven (87.5%) of the seed papers. Moreover, these papers

obtained. The triangular shape is a characteristic result from the MC

You might also like