0% found this document useful (0 votes)
92 views20 pages

Information Processing and Management: Junmei Wang, Min Pan, Tingting He, Xiang Huang, Xueyan Wang, Xinhui Tu T

Uploaded by

Krishna Basutkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views20 pages

Information Processing and Management: Junmei Wang, Min Pan, Tingting He, Xiang Huang, Xueyan Wang, Xinhui Tu T

Uploaded by

Krishna Basutkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Information Processing and Management 57 (2020) 102342

Contents lists available at ScienceDirect

Information Processing and Management


journal homepage: www.elsevier.com/locate/infoproman

A Pseudo-relevance feedback framework combining relevance


T
matching and semantic matching for information retrieval
Junmei Wanga,d,e,1, Min Panb,d,e,1, Tingting Hec,d,e,⁎, Xiang Huangc,d,e,
Xueyan Wangc,d,e, Xinhui Tuc,d,e
a
School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China
b
School of Computer and Information Engineering, Hubei Normal University, Huangshi 435002, China
c
School of Computer, Central China Normal University Wuhan 430079, China
d
Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Wuhan 430079, China
e
National Language Resources Monitor & Research Center for Network Media, Wuhan 430079, China

ARTICLE INFO ABSTRACT

Keywords: Pseudo-relevance feedback (PRF) is a well-known method for addressing the mismatch between
Information retrieval query intention and query representation. Most current PRF methods consider relevance
Pseudo-relevance feedback matching only from the perspective of terms used to sort feedback documents, thus possibly
Text similarity leading to a semantic gap between query representation and document representation. In this
Semantic matching
work, a PRF framework that combines relevance matching and semantic matching is proposed to
improve the quality of feedback documents. Specifically, in the first round of retrieval, we
propose a reranking mechanism in which the information of the exact terms and the semantic
similarity between the query and document representations are calculated by bidirectional en-
coder representations from transformers (BERT); this mechanism reduces the text semantic gap
by using the semantic information and improves the quality of feedback documents. Then, our
proposed PRF framework is constructed to process the results of the first round of retrieval by
using probability-based PRF methods and language-model-based PRF methods. Finally, we
conduct extensive experiments on four Text Retrieval Conference (TREC) datasets. The results
show that the proposed models outperform the robust baseline models in terms of the mean
average precision (MAP) and precision P at position 10 (P@10), and the results also highlight
that using the combined relevance matching and semantic matching method is more effective
than using relevance matching or semantic matching alone in terms of improving the quality of
feedback documents.

1. Introduction

The rapid rise of the search engine industry has greatly stimulated research interest in information retrieval (IR). In the past
several decades, various classical retrieval models have been proposed, including probability models, statistical language models, and
vector space models. These models have been successfully applied in retrieval systems to address many issues in IR (Nasir et al., 2019;
Yin et al., 2011).
Is ad hoc retrieval relevance matching or semantic matching? Semantic matching involves identifying semantic information and


Corresponding author.
E-mail address: [email protected] (T. He).
1
Junmei Wang and Min Pan contributed equally to this work and should be regarded as co-first authors.

https://doi.org/10.1016/j.ipm.2020.102342
Received 23 January 2020; Received in revised form 14 June 2020; Accepted 14 June 2020
Available online 27 June 2020
0306-4573/ © 2020 Elsevier Ltd. All rights reserved.
J. Wang, et al. Information Processing and Management 57 (2020) 102342

inferring the semantic relationships between two paragraphs of text. However, relevance matching involves identifying whether a
document is relevant to a given query. Guo et al. (2016) believed that the matching problems in many natural language processing
(NLP) tasks are fundamentally different from ad hoc retrieval tasks. Most NLP tasks involve semantic matching, and ad hoc retrieval
tasks mainly involve relevance matching. This paper argues that combining relevance matching and semantic matching is the key to
bridging the semantic text gap between query and document representations. In recent years, an important problem in IR is analyzing
semantic information with context and matching the semantics of the query with web page data. For example, suppose that a user
query is "which hospital provides a high level of skin disease treatment" and a relevant website title is "The effect of Peking union
medical college hospital on skin disease". For the query, the semantic core is "see a doctor for skin disease", and the core of the web
information is "treat skin disease". In general, the semantic backgrounds of "see" and "treat" are different; thus, if the relativity
between "see" and "treat" is directly calculated, then the matching degree of the user's query and the page will be relatively poor.
However, if combined with the context of "hospital" and "skin disease", then the semantic relativity between "seeing" and "treating" is
very strong, mainly because the input query is based on the knowledge accumulation of users, and a computer has no relevant
knowledge base. Therefore, it is difficult for a computer to accurately understand the actual query intention of the user, thus leading
to the semantic deviation between the documents and the query. Recently, the bidirectional encoder representations from trans-
formers (BERT) model (Devlin et al., 2019), proposed by Google, has performed well in 11 NLP tasks, thereby becoming one of the
most popular deep learning models. The model uses a transformer framework, which is more effective than a recurrent neural
network (RNN) (Chen et al., 2018; T. Shen et al., 2018), and can capture information over long distances. Therefore, BERT can
capture contextual information better than previous pretraining models. To research the roles of semantic matching and relevance
matching in IR, this paper takes BERT as the representative model in our proposed framework.
Another problem is that, during the actual retrieval, some terms will not be expressed explicitly by users and are rather hidden in
the query terms (Nasir et al., 2019). The search engine will miss this information, thereby resulting in a deviation between the user's
query intention and the actual query representation. Pseudo-relevance feedback (PRF) (Basile et al., 2011; Raman et al., 2010;
Wang et al., 2008), as an important branch of IR, can effectively improve retrieval performance through query expansion. In PRF, it is
assumed that the top-ranked documents in the first retrieval result are relevant to a given query (Zhou et al., 2013). Then, the top-
ranked documents are used as feedback documents, from which possible relevant terms are selected and added to the original query
to refine the expression of the original query. PRF is currently one of the most effective ways to bridge the gap between user query
intent and actual query representation (Pan et al., 2019). In this work, a PRF framework that combines relevance matching and
semantic matching is proposed to address the above problems.

1.1. Research objective

The main aim of this work is to provide a PRF framework that combines relevance matching and semantic matching, and the
objectives are as follows:

• To determine the importance of relevance matching and semantic matching for IR.
• To reduce the semantic gap between query intention and query representation and between query representation and document
representation in IR.
• To improve retrieval performance by increasing the precision of the top 10 documents retrieved and the mean average precision
(MAP) of the top 1000 results.

The major contributions of this paper are as follows:

• The reranking mechanism avoids calculating all documents and reduces the computational time of BERT because we first use
BM25 (best matching 25) to select the top N documents and then use BERT to calculate the semantic matching scores between
queries and sentences in the N documents. Thus, only the N documents are reranked.
• Experiments are performed to verify that the proposed reranking method, which combines relevance matching and semantic
matching, is more effective (in terms of improving the quality of feedback documents) than using either relevance matching or
semantic matching alone.
• A PRF framework combining relevance matching and semantic matching is proposed, and five enhanced models (denoted by
BKRoc, BPRoc2, BKRoc, BRM3 and BKRM3) are generated by merging the framework with probability-based PRF models
(Rocchio+BM25 (Rocchio, 1971), PRoc2 (Miao et al., 2012), and KRoc (Pan et al., 2020)) and language-model-based PRF models
(RM3 (Lv & Zhai, 2009) and KRM3 (Pan et al., 2020)). A series of experiments are used to verify that the proposed framework is
universal. The results of experiments with the five models and different values of N indicate that the framework is robust.
• A series of experiments involving standard Text Retrieval Conference (TREC) datasets was performed to evaluate the proposed
models from different perspectives. The results show that the proposed models can achieve better performance than the baseline
models in terms of mean average precision (MAP) and precision P at position 10 (P@10). Our PRF framework may reduce the
semantic text gap between query intention and query representation and between query representation and document re-
presentation during IR.
• Our proposed PRF framework can improve the quality of feedback documents.

The remainder of this paper is organized as follows: In section 2, related studies are reviewed. In section 3, we introduce the

2
J. Wang, et al. Information Processing and Management 57 (2020) 102342

proposed PRF framework combining relevance matching and semantic matching and the five improved models. In Section 4, we
describe the setup of the experiment and the four TREC datasets. In section 5, we compare and analyze the experimental results of the
proposed models and different baselines to test the performance of the proposed framework and the five models. Finally, in section 6,
we summarize the paper, provide a brief conclusion, and discuss future research directions.

2. Related work

PRF is a common and effective technique that can improve retrieval performance (Lv & Zhai, 2009). This method extracts the
expansion terms from the feedback documents and uses these terms to refine the representation of the original query. Subsequently, a
second round of retrieval is performed (J. X. Huang et al., 2013). In 1971, Rocchio (Rocchio, 1971) proposed the first well-known
relevance feedback technique, which improved query representation by adding new terms arranged in descending order according to
their term frequency weights. Many relevant feedback models based on the technique developed by Rocchio have been proposed, and
they have achieved good performance (Ksentini et al., 2016). For example, He et al. (He et al., 2011) proposed four novel methods for
improving the classical BM25 model by utilizing term proximity evidence. In 2012, Miao et al. proposed a novel model called PRoc
(proximity-based Rocchio model), which used the Rocchio model to capture the proximity relationships between candidate terms and
the corresponding queries in feedback documents. Three proximity measures, namely, the window-based method, the kernel-based
method and the Hyperspace Analog to Language (HAL) method, were then proposed for evaluating the relationship between ex-
pansion terms and query terms. The three variants of PRoc are called PRoc1, PRoc2, and PRoc3. In addition, many other relevant
methods have also achieved remarkable results in improving retrieval performance (Colace et al., 2015; Daoud & Huang, 2013;
Metzler & Croft, 2005; Ye & Huang, 2014).
The rapid development of the language model (LM) has provided favorable conditions for further research on PRF models (Ponte
& Croft, 1998). However, a core problem in LM estimation is smoothing. Zhai et al. (Zhai & Lafferty, 2001a) compare three popular
smoothing methods (i.e., the Jelinek-Mercer method, the Dirichlet prior method, and absolute discounting) and note that the Di-
richlet prior method generally performs well. Subsequently, many LM-based retrieval methods have been successively proposed (e.g.,
(Lavrenko & Croft, 2001); (Lv & Zhai, 2009); (Hazimeh & Zhai, 2015); (Wu, 2015)), and for such methods, feedback documents are
often used to reevaluate the query LM. In 2001, Zhai et al. (Zhai & Lafferty, 2001b) proposed an LM-based feedback model in which
the authors evaluated two different methods for updating queries based on feedback documents. RM1 and RM2 are relevance-based
LMs (Lavrenko & Croft, 2001) that calculate the probabilities of terms in the relevant class. RM3 and RM4 are extensions of RM1 and
RM2, respectively, and to create RM3 and RM4, RM1 and RM2 are interpolated with the original query model and generate RM3 and
RM4, respectively, which are interpolated versions of the relevance model. Furthermore, a series of query likelihood models are used
as retrieval models with Dirichlet prior smoothing. In 2001, Zhai and Lafferty proposed a simple mixture model (SMM) and a
divergence minimization model (DMM), which are two different approaches for updating a query LM. SMM and DMM differ in terms
of their method for estimating the query model based on feedback documents. The SMM method assumes that feedback documents
are generated by a mixture model in which one component is the query topic model and the other is the collection LM. Given the
observed feedback documents, the maximum likelihood criterion is used to estimate a query topic model. The DMM method uses a
completely different estimation criterion; this method chooses the query model that has the smallest average Kullback-Leibler (KL)
divergence from the smoothed empirical term distribution of the feedback documents. In 2006, Tao and Zhai proposed a query-
regularized mixture model (RMM) for pseudo feedback (Tao & Zhai, 2006). The authors integrated the original query with feedback
documents in a single probabilistic mixture model and regularized the estimation of the LM parameters in the model so that the
information in the feedback documents can be gradually added to the original query. A major advantage of this model is that it has no
parameter to tune. In 2014, Lv and Zhai revealed that DMM inappropriately handles the entropy of the feedback model, thereby
resulting in a highly skewed feedback model. To address this problem, the authors proposed a maximum-entropy divergence
minimization model (MEDMM) by introducing an entropy term to regularize the DMM (Lv & Zhai, 2014). In 2009, Lv and Zhai (Lv &
Zhai, 2009) compared the following five methods for estimating query LMs by using PRF in ad hoc IR: RM3, RM4, SMM (Zhai &
Lafferty, 2001b), RMM (Tao & Zhai, 2006) and DMM (Zhai & Lafferty, 2001b). The authors found that RM3 is more robust and
comparable to any LM method in many tasks; thus, RM3 remains a strong baseline for comparison. In 2020, Pan et al. (Pan et al.,
2020) integrated cooccurrence information into the Rocchio model and RM3 model and proposed the KRoc model and KRM3 model,
respectively, thereby improving retrieval performance.
Some other PRF methods provide solutions for research on IR. In 2016, Zamani et al. (Zamani et al., 2016) proposed a PRF
method based on matrix factorization (RFMF) that tries to expand the query by using not only the terms that discriminate the
feedback documents from a collection but also terms that are relevant to the original query terms. This study was the first to
formulate PRF as a matrix decomposition problem and compute a latent factor representation of documents/queries and terms by
using nonnegative matrix factorization. In contrast, in 2019,Valcarce et al. proposed a linear method (LiMe) for the PRF task
(Valcarce et al., 2019). The LiMe framework computes similarities yielded within the query and the pseudo relevant set. Then, the
similarity information of these relationships between documents or terms is used to expand the original query. However, most current
PRF methods consider relevance matching only from the perspective of terms used to sort feedback documents and do not consider
the semantic information between queries and documents to be an important index for calculating relevance. These methods may
lead to insufficient feedback documents. In current research, the semantics of terms are often considered to be closely related to the
context of the terms (Pang et al., 2017). The semantics of the query express the true intentions of the users. On this basis, we suggest
that, when the semantic similarity of the query and the document is considered, the results of the first round of retrieval may be
improved. Additionally, the quality of the feedback documents may be improved.

3
J. Wang, et al. Information Processing and Management 57 (2020) 102342

In recent years, deep learning methods have been applied to different scenarios in speech recognition, computer vision, NLP, etc.
(Marchesin et al., 2019) because these methods can automatically learn effective data representations (features). When applied to ad
hoc retrieval, the task is usually formalized as a semantic matching problem between two texts (Guo et al., 2019). Some classical
neural IR models related to this task include the deep structured semantic model (DSSM) (Huang et al., 2013) (P.-S.), convolutional
DSSM (CDSSM) (Y. Shen et al., 2014) and deep relevance matching model (DRMM) (Guo et al., 2016). The DSSM uses a fully
connected feedforward network to construct the presentations of the query and document and then generates a ranking score by
calculating the cosine similarity of the two vectors. The CDSSM, an extension of the DSSM, uses a convolutional neural network
(CNN) to better preserve the local word order information when capturing the contextual information of the query of the document.
Then, max-pooling strategies are adopted to filter the salient semantic concepts to form a sentence-level representation. However, the
DSSM and CDSSM consider only semantic matching between queries and documents (Guo et al., 2016). Other important matching
information, such as a precise matching signal, the importance of query terms and different matching requirements, is ignored. In
2016, Guo et al. noted that ad hoc retrieval tasks involve mainly relevance matching, and the authors proposed a DRMM for ad hoc
retrieval. Specifically, the DRMM model employs a joint deep architecture at the query term level for relevance matching. DeepRank
is inspired by the steps of human relevance judgment (Pang et al., 2017). Specifically, DeepRank splits the document into term-
centric contexts with respect to each query term. Then, the interaction function is defined for term-level computation, term-level
aggregation, and global aggregation. However, few studies combine relevance matching and semantic matching for IR.
In 2018, the BERT model achieved the best results among those of other previous models in 11 NLP tasks (Devlin et al., 2019). The
architecture of the BERT model is a multilayer bidirectional transformer encoder. The encoder is composed of a stack of six identical
layers. Each layer has two sublayers. The first sublayer is a multihead self-attention mechanism, and the second sublayer is a simple,
position-wise fully connected feedforward network. Instead of pretraining the LM from left to right, BERT enables the representation
to fuse the bidirectional context. BERT is divided into two main stages (Devlin et al., 2019). The first stage involves training a
common "language understanding" model on a large text corpus (e.g., Wikipedia). In the second stage, the model is fine-tuned for
specific tasks (e.g., machine translation, text embedding, named-entity recognition, automatic abstracts, questions and answers). For
the task in IR, multigenre natural language inference (MNLI) (Williams et al., 2018) was used as the fine-tuning corpus. MNLI is a
large-scale, crowdsourced corpus for implicit classification tasks. Given a pair of sentences, the second sentence can be predicted
whether it is implicit, contradictory or neutral relative to the first sentence. Given the successful implementation of the BERT model
in retrieval tasks (Yang, Zhang, & Lin, (2019)), this paper proposes a PRF framework that combines relevance matching and semantic
matching methods. In the framework, BERT is used in the first round of PRF to reorder the first N documents after BM25 scoring and
to improve the quality of the feedback documents. Then, we apply the framework in combination with probability-based PRF models
(i.e., Rocchio+BM25 (Rocchio, 1971), PRoc2 (Miao et al., 2012), KRoc (Pan et al., 2020)) and LM-based PRF models (i.e., RM3 (Lv &
Zhai, 2009) and KRM3 (Pan et al., 2020)) to measure the quality of the feedback documents. The main difference between the
proposed PRF framework and other methods is that the proposed method combines relevance matching and semantic matching to
improve the quality of the feedback documents by evaluating two aspects of relevance (semantic relevance and term-level relevance)
from the documents and the initial query instead of using exact matching techniques.

3. Our method

We introduce the proposed PRF framework, which combines relevance matching and semantic matching, in section 3.1. Then, we
apply the framework in combination with the probability-based PRF model in section 3.2 and the LM-based PRF model in section 3.3.

3.1. Our PRF framework combining relevance matching and semantic matching

Our approach is motivated by the success of Yang et al., (2019) in applying the BERT model to retrieval tasks. This paper proposes
a simple extended application based on the above guiding approach in which the BERT model is applied for PRF. Specifically, we
initially use the relevance matching method to obtain the exact relevance weight of the query and document at the lexical level. The
score calculated by the traditional BM25 method is used as the score of the initial ranking of documents, and the top N documents are
chosen for reranking. Then, we use the semantic matching method to acquire the semantic relevance between the query and the
document. Due to the limitation of the input length of the BERT model, the model cannot be effectively applied in retrieval tasks.
Therefore, this paper considers the segmentation processing of the top N documents obtained from the previous step, splices each
sentence with a query, and enters the information into the BERT model to obtain classification results (semantic relevance or se-
mantic irrelevance). The implicit relation score is chosen as the semantic similarity score of the two sentences. Based on local
relevance, if part of a document is related to a query, then the document is considered to be related to the query. Based on this
assumption, a semantic linear combination of the top M sentences is selected as the semantic score of the document, and then this
score is combined with the document score obtained by the relevance matching method (by using the linear combination method) to
reorder the first N documents. The calculation method is shown in Equation (1).
M
Sd = × Sde + (1 )× (wi × Sid) (1)
i=1

N dft + 0.5 (k1 + 1) × tf (t , d ) (k3 + 1) × tf (t , Q)


Sde = log × ×
t Q
dft + 0.5 tf (t , d ) + K k3 + tf (t , Q ) (2)

4
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Query
Query Retrieve
Understanding

Relevance
Ranking Matching Index
Matching

Semantic Reranking BERT MNLI


Matching

Document
TREC Data Indexing
Understanding

Fig. 1. The relevance matching and semantic matching process based on BERT reranking for IR

The relevant matching score of document d calculated by the traditional BM25 method is denoted by Sde , and the calculated
method is shown in Equation (2), where N′represents the total number of documents in the index; dft is the number of documents in
which the term t appears; and K is equal to k1 × (1 b + b × avgdl ) , where k1 and b are the adjustment factors that balance the effect
dl

of document length, k3 is a parameter that adjusts the term frequency in the query, tf(t, d) represents the number of occurrences of
term t in document d, and tf(t, Q) represents the number of occurrences of term t in document Q. The sentences from document d are
ranked by the semantic similarity score of the query and the sentences from document d. Then, the score of the ranked i-th sentence is
denoted by Sid ; additionally, wi is the weight of the i-th sentence from document d. The sum of the results of the scores of the top M
sentences and the weights are considered to be the score of document d. α is the moderating factor, and the score of the first round of
retrieval of document dis denoted by Sd.
The relevance matching and semantic matching process based on BERT reranking for IR is shown in Fig. 1.

Start

Documents

Query Tokenize and Stem

Top N
BM25 Retrieval
Documents

BERT Reranking Sentences

Expansion Reranked N
Terms Documents

New Query Second-Round Retrieval

Ranked Documents

End

Fig. 2. A PRF framework combining relevance matching and semantic matching for IR

5
J. Wang, et al. Information Processing and Management 57 (2020) 102342

3.2. Integration with probability-based PRF models

We extract the expansion terms from the results of first-round retrieval and generate a new query for the second round of
retrieval. The flow diagram of the PRF framework combining relevance matching and semantic matching for IR is shown in Fig. 2. To
prove the effectiveness of our framework in improving the quality of feedback documents, we use different PRF methods to extract
the expansion terms from the feedback documents we obtain and perform the second round of retrieval.
There are two kinds of PRF models: probability-based PRF models and statistical-language-based PRF models. Probability-based
PRF models include mainly Rocchio+BM25, PRoc2, and KRoc, which is the most recently proposed model (the model was proposed
in 2020) and yields excellent results for nine TREC datasets, including GOV2.
The methods used to calculate the query expansion terms vary based on how the framework is combined with different PRF
models. Rocchio+BM25 (Rocchio, 1971) uses term frequency–inverse document frequency (TF IDF ) to calculate the term weights
in the feedback documents based on the term frequency and inverse document frequency. TF IDF is calculated according to the
method used in Equation (3). The weight of the term t is wt, as shown in Equation (3). tf(t, di) represents the number of occurrences of
term t in document di, and N represents the number of feedback documents.
N
N dft + 0.5
wt = log × tf (t , di )
dft + 0.5 i=1 (3)

PRoc2 (Miao et al., 2012) considers not only the importance of different query terms but also the average proximity of terms to
the query. In the kernel-based method, the weight of term t in the neighborhood of query term qi is wt, and the calculation method is
as shown in Equation (4).
Q N dfqi + 0.5
wt = K (t , qi ) × log
dfqi + 0.5
i=1 (4)

In the PRoc2 model, |Q| is the number of query terms. A Gaussian kernel is used to measure the proximity K(t, qi) between the
candidate extension term t and query term qi, as shown in Equation (5). pt and pq denote the locations of candidate expansion term t
and query term q in the document, respectively, and σ is a tuning parameter used to control the scale of the Gaussian distribution.

(pt pq )2
K (t , q) = exp 2
2 (5)

In 2019, the KRoc model, proposed by Pan, Huang, et al., (2019), has been shown to consider both the term frequency and the
proximity information of the cooccurrence of the candidate expansion term with queries.

(k1 + 1) × tf (ti qj , D ) N n (ti qj ) + 0.5


w (ti qj , D) = × log
K + tf (ti qj , D) n (ti qj ) + 0.5 (6)
tf (ti) tf (qj )
Here, tf(ti⊗qj, D) is K (ti, qj ) , and n(ti⊗qj) is the number of occurrences of terms ti and qj in the current document. The
1 1
top |Tf| terms with high weights calculated by different methods are selected as the query expansion terms and constitute the new
query Q together with the original query, as shown in Equation (7). In this case, Q0 is the original query, and Q1 is the vector
combination of the query extension terms. Moreover, λ is the contribution weight constant between the original query and the
feedback terms.
Q = (1 ) × Q0 + × Q1 (7)

The three models, namely, Rocchio+BM25, PRoc2 and KRoc, were combined with our framework to form BRoc, BPRoc2, and
BKRoc, respectively.

3.3. Integration with LM-based PRF models

In addition to combining the framework with the probability-based PRF model, the framework must also be combined with the
LM-based PRF model to comprehensively measure the performance of the framework. The LM essentially involves an evaluation for
each document; then, the documents are ranked according to the likelihood that the evaluated LM will generate queries. The LM-
based PRF model uses the first retrieval of the feedback document to reevaluate the LM of the query. Common LM-based PRF models
include RM3 and KRM3. For most standard TREC datasets, RM3 and KRM3 have achieved impressive retrieval performance based on
the retrieval accuracy and recall rate. KRM3 was first developed in 2019. This section will incorporate a robust baseline model (RM3)
and a state-of-the-art model (KRM3) into the framework to form new retrieval LMs, i.e., BRM3 and BKRM3, respectively.
The RM3 model (Lv & Zhai, 2009) selects the expansion terms from the feedback document of the first round of retrieval. The
weight calculation method of the expansion terms is shown in Equation (8):

dl µ
wt = log × pml (t|di ) + × pml (t|c )
dl + µ dl + µ (8)

6
J. Wang, et al. Information Processing and Management 57 (2020) 102342

where pml is the probability function, c represents all the document collections, and dl represents the document length. Furthermore, a
series of query likelihood models (including RM3 and KRM3) are used as retrieval models with Dirichlet prior smoothing, and μ is the
smoothing factor. The RM3 model reevaluates the query LM by using the feedback document retrieved in the first step and then
performs a second round of retrieval. The new query LM is shown in Equation (9):

Q = (1 )× Q0 + × F (9)

where λ is as defined in Equation (7) and is used to adjust the contribution weight between the original query and the expansion
terms.
The KRM3 model (Pan et al., 2020) adds the term frequency of cooccurring words and the proximity information of the query
based on the RM3 model. The KRM3 model generates new queries, as given in Equation (10):

Q = (1 )× Q0 + × ((1 )× F + × F*) (10)

where F* is a feedback LM based on the cooccurrence distribution of query terms and expansion terms.

4. Experimental data and parameter settings

We first introduce the experimental data in Section 4.1; then, we describe the parameter settings in section 4.2.

4.1. Experimental data

In this section, to test the effectiveness of the five models proposed in this paper, we conduct a series of experiments on four
standard TREC datasets: AP90, AP88-89, DISK4&5, and WT10G. Because these text datasets are different in terms of size and type, the
proposed models can be effectively evaluated. The AP90 dataset contains articles published by the Associated Press in 1990. AP88-89
contains articles published by the Associated Press in 1988 and 1989. The DISK4&5 collection contains newswire articles from
various sources, such as the Associated Press, the Wall Street Journal, and the Financial Times, and is generally considered to contain
high-quality text data with little noise. The DISK4&5 collection is a set of documents on TREC Disks 4 and 5 minus the Congressional
Record documents. The WT10G collection is a medium-sized web document crawler for the TREC9 and TREC10 web tracks and
contains 10 gigabytes of uncompressed data. The topic numbers associated with each collection are shown in Table 1. Each topic in
these collections contains three fields (i.e., title, description and narrative). In each query, we use only the title, which contains very
few keyword searches related to the topic.
During retrieval, a query without judgment will be deleted. In the experiment, Lucene is used as the main retrieval system. For all
datasets used, each term is analyzed using Porter's English stem analyzer (Porter, 1980). A total of 418 stop words are deleted from
the standard InQuery stop word list. MAP and normalized discounted cumulative gain (NDCG) based on the top 1000 documents are
used as the evaluation indexes for our experiment because they facilitate comparing our results to those of other methods, and these
metrics are the primary metrics used in the corresponding TREC evaluation. In addition, we use P@10 for evaluation because, when
browsing results, users tend to pay more attention to documents that rank high than to documents that rank low. All statistical
assessments are based on the Wilcoxon matched-pairs signed-rank test. To further analyze whether our method is beneficial, we
measured the robustness index (RI) (Sakai et al., 2005) of our method against the corresponding baseline model. The RI metric, which
takes values in the interval [−1, 1], is computed as the number of topics improved by using our models minus the number of topics
hurt by our models divided by the number of topics (Valcarce et al., 2019).

4.2. Parameter settings

All the PRF retrieval models used in this paper have several hyperparameters that need to be adjusted. For a fair comparison, we
adjusted these hyperparameters to the optimal values. The following parameters are used in the baseline models and the proposed
model, which is generally used in the IR domain to construct robust baselines. First, in BM25, the value of b ranged from 0 to 1.0, with
an increment of 0.05, and k1 and k3 are set to 1.2 and 8, respectively, as recommended by Robertson et al. (Robertson et al., 1996).
Second, in the Dirichlet LM, the value of μ ranged from 500 to 2000, with an increment of 50. Finally, based on experience, the
number of top-ranking documents in all PRF models is set to 10, the number of expansion terms is set to |Tf| ∈ {10, 20, 30, 50}, and
the fusion parameter is set to , {0.0, 0.1, 0.2, …, 1.0} . The parameter σ from kernel function is set to σ ∈ {10, 25, 50, 80, 100, 200,
500, 1000, 1500}.

Table 1
The TREC tasks and topic numbers associated with each collection
Collection # of Docs Size Queries # of Queries

AP90 78,321 0.24 GB 51-100 50


AP88-89 164,597 0.49 GB 51-100 50
DISK4&5 528,155 1.85 GB 301-450 150
WT10G 1,692,096 10 GB 451-550 100

7
J. Wang, et al. Information Processing and Management 57 (2020) 102342

In our framework, we have two hyperparameters. One is used to reorder the previous documents and select the initial ranking,
with values ranging from 500 to 5000 and an increment of 500, yielding 10 different values. The other is the fusion parameter of the
document score calculated by the BM25 method and the document score calculated by BERT when ranking documents in the first
round of retrieval.
To verify that the baselines and the method proposed in this paper are scientifically reasonable, we use 2-fold cross-validation in
the experiment. First, for each dataset, we randomly split the queries into equally sized subsets (half the queries were used for
training, and the other half were used for testing). Then, the parametric model learns from the training set and applies this knowledge
to the test set for evaluation. We choose the TREC datasets and topics to ensure fair comparisons with the baselines, and all metrics
and parameters are adjusted using the same experimental settings.

5. Experimental results and analysis

In this section, we report our experimental results and conduct an extensive analysis. We compare our models with the prob-
ability-based PRF models in section 5.1 and with the LM-based PRF models in section 5.2. Furthermore, we compare the MAP values
of the four TREC collections in three steps (relevance matching, relevance matching and semantic matching, and after PRF) in
section 5.3 and compare the neural IR models in section 5.4. Finally, we investigate the parameter sensitivity in sections 5.5–5.7.
Specifically, we investigate the sensitivity of parameter N for our five models in section 5.5; the sensitivity of parameter α in the five
models in section 5.6; and the sensitivity of parameter σ for BKRoc, BKRM3, and BPRoc2 in section 5.7. For the five models, we give
the empirical values to improve the performance of the models. In section 5.8, we analyze the interaction of parameters.

5.1. Comparison of probability-based PRF models

Since our framework is applied to five different PRF models, there are five different models (i.e., BRoc, BPRoc2, BKRoc, BRM3,
and BKRM3) to analyze. To test the effectiveness of the combination of our framework and the probability-based PRF models, the
three improved models (BRoc, PRoc2 and BKRoc) were first compared with three robust PRF models (Rocchio+BM25, PRoc2 and
KRoc). Table 2 shows the comparison results for the four datasets. The comparison results of MAP values are shown in Table 2 for
different numbers of expansion terms.
The results in Table 2 show that the performance of the six probability-based PRF models generally improves as the number of
expansion terms increases. When comparing the number of expansion terms, the results show that the optimal MAP value is obtained
when the number of expansion terms is 30 or 50 for most datasets. However, the smaller the |Tf| value of the six models is for the
WT10G dataset, the better the MAP results. The comparison of the three proposed models (BRoc, BPRoc2, and BKRoc) and the
corresponding robust baseline models (Rocchio+BM25, PRoc2 and KRoc) indicates that the MAP values are improved with the
proposed models. This improvement was statistically significant for all datasets. Specifically, compared with the MAP value of the
corresponding baseline model, BRoc increased the MAP by the highest percentage, reaching 20.84% and 17.03% for the AP90 and
DISK4&5 datasets, respectively. BPRoc2 yielded the largest improvement in MAP for AP90 at 9.21%. BKRoc obtained the largest
improvement for the DISK4&5 dataset at 13.53%.
In addition, the experimental results of the three proposed improved models (BRoc, PRoc2 and BKRoc) and three PRF models
(Rocchio+BM25, PRoc2 and KRoc) are compared based on P@10, and the subsequent results are shown in Table 3. The average P@
10 of the three models proposed is improved compared to that of the corresponding baseline models for the four datasets. The largest
improvements are observed for the DISK4&5 dataset at 15.50%, 9.58%, and 17.98%.
To further evaluate the effectiveness of our method, we use the average NDCG of the top 1000 documents with different feedback
terms as an evaluation indicator to compare the experimental results of our models with those of the corresponding baseline models.
The results are shown in Table 4. The experimental results show that our three improved models are significant improvements over
the corresponding baseline models in terms of NDCG on the four TREC datasets. Therefore, the method proposed in this paper is
superior to traditional probability-based PRF models.
To analyze the robustness of our models, we measured their RIs for comparison against that of the corresponding baseline model.
The comparison results are shown in Table 4. All our models have positive RI values on all four datasets. Thus, our PRF approach
combining relevance matching and semantic matching tends to make queries better rather than worse. In addition, comparing the RI
values of different models shows that BKRoc achieves the highest RI value on WT10G, and BRoc achieves the highest RI value on the
other three datasets.

5.2. Comparison of LM-based PRF models

To further assess the performance of the framework, we apply our framework to the LM-based PRF model. The results of the
comparison of the proposed BRM3 and BKRM3 models and RM3 and KRM3 for the 4 datasets are shown in Tables 5–7.
As presented in Table 5, the MAP values of BRM3 and BKRM3 are improved compared with those of RM3 and KRM3 for the four
datasets. Compared with the corresponding models (RM3 and KRM3), BRM3 and BKRM3 both performed best, with average im-
provement rates of 13.59% and 11.01%, respectively, for DISK4&5. In addition, the MAP values of the four LM-based PRF methods
generally improved as the number of expansion terms increased.
Furthermore, the experimental results of the two proposed LM-based PRF models (BRM3 and BKRM3) are compared with those of
two robust LM-based PRF models (RM3 and KRM3) based on P@10, as shown in Table 6. The two proposed models yield a higher

8
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Table 2
Comparison of MAP values obtained by robust baselines (Rocchio+BM25, PRoc2, and KRoc) on four TREC collections. The MAP value changes with
the number of expansion terms (|Tf| ∈ {10, 20, 30, 50}). "Avg" in the last row of each collection denotes the average MAP performance for that
dataset. The values in parentheses represent the improvements over Rocchio+BM25, PRoc2, or KRoc. "*", "+" and "#" denote statistically significant
improvement over Rocchio+BM25, PRoc2, or KRoc, respectively. (Wilcoxon signed-rank test at p < 0.05).
Model |Tf| AP90 AP88-89 DISK4&5 WT10G

Rocchio+BM25 10 0.2858 0.2908 0.2307 0.2064


20 0.2884 0.2938 0.2328 0.2076
30 0.2899 0.2950 0.2337 0.2077
50 0.2926 0.2965 0.2349 0.2088
Avg 0.2892 0.2940 0.2330 0.2076
BRoc 10 0.3441* 0.3256* 0.2799* 0.2266*
(+ 20.40%) (+11.97%) (+21.33%) (+ 9.79%)
20 0.3494* 0.3308* 0.2661* 0.2260*
(+21.15%) (+12.59%) (+14.30%) (+ 8.86%)
30 0.3509* 0.3345* 0.2800* 0.2257*
(+ 21.04%) (+ 13.39%) (+19.81%) (+8.67%)
50 0.3533* 0.3398* 0.2648* 0.2213*
(+20.75%) (+14.60%) (+12.73%) (+ 5.99%)
Avg 0.3494* 0.3327* 0.2727* 0.2249*
(+ 20.84%) (+ 13.15%) (+17.03%) (+8.32%)
PRoc2 10 0.3031 0.3128 0.2506 0.2212
20 0.3132 0.3176 0.2572 0.2258
30 0.3204 0.3200 0.2588 0.2248
50 0.3243 0.3212 0.2618 0.2226
Avg 0.3153 0.3179 0.2571 0.2236
BPRoc2 10 0.3412+ 0.3319+ 0.2703+ 0.2384+
(+12.57%) (+ 6.11%) (+7.86%) (+7.78%)
20 0.3384+ 0.3368+ 0.2760+ 0.2359+
(+8.05%) (+ 6.05%) (+7.31%) (+4.47%)
30 0.3464+ 0.3403+ 0.2787+ 0.2357+
(+ 8.11%) (+ 6.34%) (+7.69%) (+4.85%)
50 0.3512+ 0.3410+ 0.2795+ 0.2375+
(+ 8.29%) (+6.16%) (+6.76%) (+6.69%)
Avg 0.3443+ 0.3375+ 0.2761+ 0.2369+
(+9.21%) (+ 6.17%) (+ 7.40%) (+5.94%)
KRoc 10 0.3287 0.3131 0.2563 0.2174
20 0.3367 0.3172 0.2621 0.2166
30 0.3357 0.3214 0.2657 0.2162
50 0.3383 0.3222 0.2654 0.2125
Avg 0.3349 0.3185 0.2624 0.2157
BKRoc 10 0.3538# 0.3434# 0.2938# 0.2446#
(+ 7.64%) (+ 9.68%) (+14.63%) (+12.51%)
20 0.3600# 0.3507# 0.2986# 0.2420#
(+6.92%) (+ 10.56%) (+13.93%) (+11.73%)
30 0.3585# 0.3500# 0.2997# 0.2386#
(+ 6.79%) (+ 8.90%) (+12.80%) (+10.36%)
50 0.3606# 0.3455# 0.2994# 0.2373#
(+ 5.59%) (+ 7.23%) (+12.81%) (+ 11.67%)
Avg 0.3582# 0.3474# 0.2979# 0.2406#
(+6.98%) (+9.08%) (+13.53%) (+11.57%)

average P@10 than the corresponding baseline models based on the four datasets. BRM3 and BKRM3 yield the largest improvement
at 14.50% and 14.56%, respectively, for DISK4&5.
To further evaluate the effectiveness of our method, we use the NDCG value as an evaluation indicator to compare the experi-
mental results of our models with those of the corresponding baseline models. The results are shown in Table 7. The experimental
results show that our improved BRM3 and BKRM3 models are significant improvements over the corresponding baseline models in
terms of the NDCG values obtained on the four TREC datasets. Therefore, it is reasonable to conclude that BRM3 and BKRM3 are
superior to the corresponding LM-based PRF models.
To analyze the robustness of our models, we measured their RIs against the corresponding baseline model. The comparison results
are shown in Table 7. All our models have positive RI values on all four datasets. Thus, our PRF approach combining relevance
matching and semantic matching tends to improve rather than worsen queries. In addition, comparing the RI values of different
models shows that BRM3 achieves the highest RI value on all datasets.

5.3. Comparison before and after semantic matching

In this section, to intuitively show the effect of the proposed method, we compare the MAP and P@10 values of relevance

9
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Table 3
Comparison of P@10 values obtained by robust baselines (Rocchio+BM25, PRoc2 and KRoc) on the four TREC collections. The P@10 value changes
with the number of expansion terms (|Tf| ∈ {10, 20, 30, 50}). "Avg" in the last row of each collection denotes the average P@10 performance for that
dataset. The values in parentheses represent the improvements over Rocchio+BM25, PRoc2, or KRoc. "*", "+" and "#" denote statistically significant
improvement over Rocchio+BM25, PRoc2, or KRoc, respectively. (Wilcoxon signed-rank test at p < 0.05).
Model |Tf| AP90 AP88-89 DISK4&5 WT10G

Rocchio+BM25 10 0.4468 0.4571 0.4247 0.3092


20 0.4426 0.4571 0.4240 0.3071
30 0.4426 0.4551 0.4233 0.3082
50 0.4447 0.4571 0.4220 0.3082
Avg 0.4442 0.4566 0.4235 0.3082
BRoc 10 0.4915* 0.4980* 0.4980* 0.3347*
(+10.00%) (+8.95%) (+17.26%) (+8.25%)
20 0.4851* 0.5020* 0.4853* 0.3357*
(+9.60%) (+9.82%) (+14.46%) (+9.31%)
30 0.5000* 0.5082* 0.4883* 0.3429*
(+12.97%) (+11.67%) (+14.65%) (+11.26%)
50 0.4979* 0.5163* 0.4880* 0.3418*
(+11.96%) (+12.95%) (+15.64%) (+10.90%)
Avg 0.4936* 0.5061* 0.4892* 0.3388*
(+11.13%) (+ 10.85%) (+15.50%) (+9.93%)
PRoc2 10 0.4553 0.4633 0.4353 0.3286
20 0.4553 0.4755 0.4393 0.3245
30 0.4638 0.4735 0.4380 0.3337
50 0.4617 0.4653 0.4407 0.3235
Avg 0.4590 0.4694 0.4383 0.3276
BPRoc2 10 0.4851+ 0.5102+ 0.4767+ 0.3388+
(+6.55%) (+10.12%) (+ 9.51%) (+3.10%)
20 0.4766+ 0.5020+ 0.4793+ 0.3541+
(+4.68%) (+5.57%) (+9.11%) (+ 9.12%)
30 0.4851+ 0.5041+ 0.4793+ 0.3531+
(+4.59%) (+6.46%) (+9.43%) (+5.81%)
50 0.4936+ 0.5020+ 0.4860+ 0.3459+
(+6.91%) (+7.89%) (+10.28%) (+6.92%)
Avg 0.4851+ 0.5046+ 0.4803+ 0.3480+
(+5.68%) (+7.49%) (+9.58%) (+6.23%)
KRoc 10 0.4660 0.4653 0.4313 0.3133
20 0.4681 0.4755 0.4387 0.3133
30 0.4681 0.4673 0.4380 0.3133
50 0.4638 0.4714 0.4393 0.3112
Avg 0.4665 0.4699 0.4368 0.3128
BKRoc 10 0.5021# 0.5041# 0.5080# 0.3459#
(+7.75%) (+8.34%) (+ 17.78%) (+10.41%)
20 0.5085# 0.5122# 0.5207# 0.3337#
(+8.63%) (+ 7.72%) (+ 18.69%) (+6.51%)
30 0.5170# 0.5184# 0.5160# 0.3418#
(+10.45%) (+10.94%) (+17.81%) (+9.10%)
50 0.5085# 0.5184# 0.5167# 0.3398#
(+9.64%) (+9.97%) (+ 17.62%) (+9.19%)
Avg 0.5090# 0.5133# 0.5154# 0.3403#
(+9.12%) (+9.24%) (+ 17.98%) (+8.80%)

Table 4
Comparison of NDCG and RI values obtained by robust baselines (Rocchio+BM25, PRoc2, and KRoc) on the four TREC collections. NDCG denotes
the average NDCG value for different numbers of expansion terms (|Tf| ∈ {10, 20, 30, 50}) and different datasets. The values in bold represent the
best results on the corresponding dataset. "*", "+" and "#" denote statistically significant improvement over Rocchio+BM25, PRoc2, or KRoc,
respectively. (Wilcoxon signed-rank test at p < 0.05).
Collection Metric Roc BRoc PRoc2 BPRoc2 KRoc BKRoc

AP90 NDCG 0.6682 0.6925* 0.6689 0.6898+ 0.6810 0.7051#


RI - 0.2800 - 0.1600 - 0.2400
AP88-89 NDCG 0.6745 0.7026* 0.6716 0.7001+ 0.6812 0.7113#
RI - 0.4800 - 0.4400 - 0.2000
DISK4&5 NDCG 0.6712 0.6964* 0.6580 0.6871+ 0.6701 0.6991#
RI - 0.3467 - 0.3200 - 0.1733
WT10G NDCG 0.5778 0.5828* 0.5838 0.5957+ 0.5787 0.6001#
RI - 0.1633 - 0.2449 - 0.4082

10
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Table 5
Comparison of MAP values obtained by robust baselines (RM3 and KRM3) on the four TREC collections. The MAP value changes with the number of
expansion terms (|Tf| ∈ {10, 20, 30, 50}). "Avg" in the last row of each collection denotes the average MAP performance for that dataset. The values
in parentheses represent improvements over RM3 or KRM3. "*" and "#" denote statistically significant improvement over RM3 or KRM3, respec-
tively. (Wilcoxon signed-rank test at p < 0.05).
Model |Tf| AP90 AP88-89 DISK4&5 WT10G

RM3 10 0.2984 0.3074 0.2529 0.2044


20 0.3075 0.3107 0.2556 0.2133
30 0.3069 0.3166 0.2555 0.2171
50 0.3036 0.3194 0.2602 0.2139
Avg 0.3041 0.3135 0.2561 0.2122
BRM3 10 0.3231* 0.3298* 0.2844* 0.2202*
(+8.28%) (+7.29%) (+12.46%) (+7.73%)
20 0.3398* 0.3429* 0.2880* 0.2261*
(+10.50%) (+10.36%) (+12.68%) (+ 6.00%)
30 0.3394* 0.3492* 0.2935* 0.2224*
(+10.59%) (+10.30%) (+14.87%) (+2.44%)
50 0.3418* 0.3515* 0.2975* 0.2273*
(+12.58%) (+10.05%) (+14.34%) (+ 6.26%)
Avg 0.3360* 0.3434* 0.2909* 0.2240*
(+10.50%) (+ 9.51%) (+13.59%) (+5.57%)
KRM3 10 0.3103 0.3186 0.2649 0.2349
20 0.3249 0.3273 0.2696 0.2392
30 0.3219 0.3272 0.2698 0.2347
50 0.3250 0.3273 0.2690 0.2383
Avg 0.3205 0.3251 0.2683 0.2368
BKRM3 10 0.3456# 0.3473# 0.2918# 0.2538#
(+11.38%) (+9.01%) (+ 10.15%) (+ 8.05%)
20 0.3464# 0.3589# 0.2991# 0.2542#
(+6.62%) (+ 9.65%) (+ 10.94%) (+ 6.27%)
30 0.3483# 0.3570# 0.3009# 0.2528#
(+8.17%) (+9.11%) (+ 11.53%) (+ 7.71%)
50 0.3551# 0.3538# 0.2997# 0.2545#
(+9.43%) (+8.10%) (+ 11.41%) (+ 6.80%)
Avg 0.3489# 0.3543# 0.2979# 0.2538#
(+8.86%) (+8.97%) (+ 11.01%) (+ 7.20%)

matching after semantic matching and after PRF. The experimental results are shown in Tables 8 and 9.
The results in Tables 8 and 9 show that the MAP yielded by the proposed method of combined relevance matching and semantic
matching is a significant improvement over the MAP yielded by the BM25 model for AP90, AP88-89, DISK4&5, and WT10G. The MAP
values of the five methods after PRF are compared with the MAP values after the first round of retrieval, and the five models produce
statistically significant improvements on all datasets. The P@10 values after the second round of retrieval are a significant im-
provement over the P@10 values after the first round for most datasets. The results indicate that the proposed BERT-based reranking
method of combined relevance matching and semantic matching is effective and that the KRoc model combined with the proposed
method yields better MAP and P@10 results than the other four models.

5.4. Comparison with neural IR models

To further verify the effectiveness of the method proposed in this paper, we compared the proposed method with neural IR
models; this comparison is shown in Table 10. The results of the comparison models (DSSM, CDSSM, and DRMM) in Table 8 are
derived from the results reported in the corresponding papers, which used MAP and P@20 as the metrics. To make a more intuitive
comparison, we use P@20 instead of P@10 as the evaluation index in Table 8. DSSM and CDSSM use the semantic matching method,
and DRMM uses the relevance matching method. The five proposed models perform better than the DSSM, CDSSM and DRMM
models based on MAP and P@20. The result shows that the PRF framework combining relevance matching and semantic matching is
effective. In addition, BKRM3 performs the best.

5.5. Sensitivity analysis of N

Parameter N is the number of documents used in the BM25 method in the first round of retrieval. Then, these documents are
reranked by BERT. The larger the value of N, the better the model reflects the real retrieval environment. To analyze the influence of
N on the performance of the proposed model, we investigate N values from 500 to 5000 and observe the MAP of the five models for
different values of |Tf|. As seen in Figs. 3, 4, 5, 6 and 7, the five models display relatively stable performance as the value of N varies.
Moreover, BRoc, BRM3, and BPRoc2 are more stable than BKRoc and BKRM3 when Nvaries.
Specifically, for the BRoc model at an N value of 1000, the optimal MAP value is obtained for AP90. The optimal MAP value for

11
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Table 6
Comparison of P@10 values obtained by robust baselines (RM3 and KRM3) on the four TREC collections. The P@10 value changes with the number
of expansion terms (|Tf| ∈ {10, 20, 30, 50}). "Avg" in the last row of each collection denotes the average P@10 performance for that dataset. The
values in parentheses represent improvements over RM3 or KRM3. "*" and "#" denote statistically significant improvement over RM3 or KRM3,
respectively. (Wilcoxon signed-rank test at p < 0.05).
Model |Tf| AP90 AP88-89 DISK4&5 WT10G

RM3 10 0.4404 0.4510 0.4280 0.3061


20 0.4426 0.4531 0.4240 0.3122
30 0.4426 0.4592 0.4293 0.3163
50 0.4426 0.4653 0.4293 0.2990
Avg 0.4421 0.4572 0.4277 0.3084
BRM3 10 0.4787* 0.4939* 0.4867* 0.3133*
(+8.70%) (+9.51%) (+13.71%) (+2.35%)
20 0.4894* 0.4980* 0.4853* 0.3153*
(+10.57%) (+9.91%) (+14.46%) (+0.99%)
30 0.4979* 0.5041* 0.4880* 0.3194*
(+12.49%) (+9.78%) (+13.67%) (+0.98%)
50 0.5000* 0.5122* 0.4987* 0.3143*
(+12.97%) (+10.08%) (+16.17%) (+5.12%)
Avg 0.4915* 0.5021* 0.4897* 0.3156*
(+11.19%) (+ 9.82%) (+14.50%) (+2.33%)
KRM3 10 0.4426 0.4776 0.4373 0.3245
20 0.4489 0.4939 0.4360 0.3296
30 0.4596 0.5061 0.4267 0.3224
50 0.4468 0.4776 0.4360 0.3316
Avg 0.4495 0.4888 0.4340 0.3270
BKRM3 10 0.4830# 0.4939# 0.4853# 0.3398#
(+9.13%) (+3.41%) (+ 10.98%) (+ 4.71%)
20 0.5043# 0.5102# 0.4927# 0.3408#
(+12.34%) (+ 3.30%) (+13.00%) (+3.40%)
30 0.5085# 0.5122# 0.5020# 0.3418#
(+10.64%) (+1.21%) (+17.65%) (+6.02%)
50 0.5085# 0.5061# 0.5087# 0.3418#
(+13.81%) (+5.97%) (+16.67%) (+3.08%)
Avg 0.5011# 0.5056# 0.4972# 0.3411#
(+11.48%) (+3.44%) (+14.56%) (+4.29%)

Table 7
Comparison of NDCG and RI values obtained by robust baselines (RM3 and KRM3) on the four TREC collections. NDCG denotes the average NDCG
value for different numbers of expansion terms (|Tf| ∈ {10, 20, 30, 50}) and different datasets. The bold values represent the best results on the
corresponding dataset. "*" and "#" denote statistically significant improvement over RM3 or KRM3, respectively. (Wilcoxon signed-rank test at p <
0.05).
Collection Metric RM3 BRM3 KRM3 BKRM3

AP90 NDCG 0.6617 0.6850* 0.6602 0.6960#


RI - 0.2400 - 0.2000
AP88-89 NDCG 0.6737 0.7101* 0.6805 0.7091#
RI - 0.2800 - 0.2400
DISK4&5 NDCG 0.6758 0.7011* 0.6770 0.7091#
RI - 0.3467 - 0.1733
WT10G NDCG 0.5681 0.5880* 0.5948 0.6114#
RI - 0.1633 - 0.1224

AP88-89 is obtained when N is 2000. Additionally, for DISK4&5, the optimal value is obtained when N is greater than 3500, and for
WT10G, the optimal value is obtained when N is greater than 4500. The larger the dataset, the larger the value of Nthat must be
selected to yield optimal performance with the BRoc model.

5.6. Sensitivity analysis of parameter α

The parameter α is the adjustment factor between the relevance matching method and the semantic matching method in
Equation (1). The smaller the value of α, the higher the contribution of the BERT method in our framework. Thus, semantic matching
plays the more significant role. To analyze the influence of the fusion ratio of BM25 and BERT on the first round of retrieval results,
we designed an experiment and fixed the parameter N to 5000. w1,w2, w3 and w4 are set to 1.0, 0.8, 0.9 and 0.9, respectively.
Additionally, α is set from 0 to 1.0. The MAP and P@10 results experimentally generated based on four different TREC datasets are
shown in Figs. 8 and 9, respectively.

12
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Table 8
Comparison of MAP values obtained on the four TREC collections in three steps (relevance matching, combined relevance matching and semantic
matching, and the combined method after PRF). The results in column BM25 are the MAP values based on relevance matching. The BM25+BERT
column denotes the MAP values based on relevance matching and semantic matching, and the values in parentheses in this column indicate the
percent improvements after the first round of retrieval compared to the BM25 values. The results in the last five columns represent the average MAP
after PRF, and the values in parentheses represent the percent improvements over the results of the first round of retrieval. "*" denotes a statistically
significant improvement over BM25, and "#" denotes a statistically significant improvement over BM25+BERT (Wilcoxon signed-rank test at p <
0.05).
Collection BM25 BM25+BERT BRoc BPRoc2 BKRoc BRM3 BKRM3

# # # #
AP90 0.2688 0.3328* 0.3494 0.3443 0.3582 0.3360 0.3490#
(+23.81%) (+4.99%) (+3.46%) (+7.63%) (+0.96%) (+4.84%)
AP88-89 0.2831 0.3305* 0.3327# 0.3375# 0.3474# 0.3434# 0.3540#
(+16.74%) (+0.67%) (+2.12%) (+5.11%) (+3.90%) (+7.20%)
DISK4&5 0.2293 0.2706* 0.2727# 0.2761# 0.2978# 0.2909# 0.2980#
(+18.01%) (+0.78%) (+2.03%) (+10.05%) (+7.50%) (+10.09%)
WT10G 0.2050 0.2209* 0.2249# 0.2369# 0.2406# 0.2240# 0.2540#
(+7.76%) (+1.81%) (+7.24%) (+8.92%) (+1.40%) (+14.89%)

Table 9
Comparison of P@10 values obtained on the four TREC collections in three steps (relevance matching, combined relevance matching and semantic
matching, and the combined method after PRF). The results in column BM25 are the P@10 values based on relevance matching. The BM25+BERT
column shows the P@10 values based on relevance matching and semantic matching, and the values in parentheses in this column indicate the
percent improvements after the first round of retrieval over the values for BM25. The results in the last five columns represent the average P@10
values after PRF, where the values in parentheses represent the percent improvements over the results of the first round of retrieval. "*" denotes a
statistically significant improvement over BM25, and "#" denotes a statistically significant improvement over BM25+BERT (Wilcoxon signed-rank
test at p < 0.05).
Collection BM25 BM25+BERT BRoc BPRoc2 BKRoc BRM3 BKRM3

AP90 0.4468 0.4843* 0.4936# 0.4851 0.5090# 0.4915# 0.5011#


(+8.39%) (+1.92%) (+0.17%) (+5.10%) (+1.49%) (+3.47%)
AP88-89 0.4531 0.5000* 0.5061# 0.5046# 0.5133# 0.5021 0.5056#
(+10.35%) (+1.22%) (+0.92%) (+2.66%) (+0.42%) (+1.12%)
DISK4&5 0.426 0.4800* 0.4892# 0.4803 0.5154# 0.4897# 0.4972#
(+12.68%) (+1.92%) (+0.06%) (+7.38%) (+2.02%) (+3.58%)
WT10G 0.3071 0.3357* 0.3388# 0.3480# 0.3403# 0.3156 0.3411#
(+9.31%) (+0.92%) (+3.66%) (+1.37%) (-5.99%) (+1.61%)

Table 10
Comparison of MAP and P@20 values obtained by the baseline model, proposed models, and neural IR models on the Robust04 dataset. SM stands
for semantic matching method, and RM stands for relevance matching method. The values in parentheses represent the percent improvements over
the results of BM25. "*" denotes a statistically significant improvement over BM25 (Wilcoxon signed-rank test at p < 0.05). The references are as
follows: DSSM (Huang et al., 2013) (P.-S.) [1], CDSSM (Y. Shen et al., 2014) [2] and DRMM (Guo et al., 2016) [3].
Models RM or SM MAP P@20
BM25 RM 0.2553 0.3554

DSSM [1] SM 0.0952 (-62.71%) 0.1713 (-51.80%)


CDSSM [2] SM 0.0674 (-73.60%) 0.1256 (-64.66%)
DRMM [3] RM 0.2793 (+9.40%) 0.3821 (+7.51%)
BRoc RM+SM 0.3138* (+22.91%) 0.4235* (+19.16%)
BPRoc2 RM+SM 0.3123* (+22.33%) 0.4156* (+16.94%)
BKRoc RM+SM 0.3241* (+26.95%) 0.4353* (+22.48%)
BRM3 RM+SM 0.3219* (+26.09%) 0.4274* (+20.26%)
BKRM3 RM+SM 0.3314* (+29.81%) 0.4363* (+22.76%)

As shown in Figs. 8 and 9, the MAP and P@10 results for the four datasets gradually increase as α increases from 0. When α is set
to 0.9, MAP reaches a maximum and then begins to decline. An α value of 0 indicates that documents are ranked only by the BERT
method. An α value of 1.0 indicates that documents are ranked only by BM25. When only BERT is used, the result is not as good as
when only BM25 is used. However, when BERT and BM25 are combined at a ratio of 1:9, we can achieve the best results. The
experiments show that relevance matching or semantic matching alone is not ideal, but the model can achieve improved performance
when these matching processes are combined in a certain proportion. The recommended setting of the fusion parameter α for
relevance matching and semantic matching is approximately 0.9. The results also indicate that relevance matching plays a more
important role than semantic matching.

13
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Fig. 3. The sensitivity of parameter N when the BRoc model takes different |Tf| values

Fig. 4. The sensitivity of parameter N when the BRM3 model takes different |Tf| values

5.7. Sensitivity analysis of parameter σ

Gaussian kernel functions are used in the BKRoc, BKRM3 and BPRoc2 methods. Therefore, sensitivity to the kernel parameter σ is
an important factor affecting the robustness of the three models. In the BKRoc and BKRM3 models, a kernel function is used to

14
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Fig. 5. The sensitivity of parameter N when the BPRoc2 model takes different |Tf| values

Fig. 6. The sensitivity of parameter N when the BKRM3 model takes different |Tf| values

estimate the weights that influence the cooccurrence relationship between query items and expansion terms. Thus, we focus on the
kernel parameter σ. Figs. 10–12 illustrate the three proposed models’ performance based on different datasets for different σ values. σ
is set from 10 to 1500, and the number of feedback terms |Tf| is set to10, 20, 30 and 50. The values of the two parameters
considerably affect the performance of the PRF model. The number of feedback terms in each set varies, as does the corresponding
curve.

15
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Fig. 7. The sensitivity of parameter N when the BKRoc model takes different |Tf| values

Fig. 8. The sensitivity of the fusion parameter α to MAP for the results of BERT-based reranking based on four datasets

As shown in Figs. 10 and 11, the BKRoc and BKRM3 models display optimal performance in the range σ ∈ (10, 30). In general,
BKRoc performance stabilizes when the σ value is relatively large. This result is consistent with that for KRoc. Moreover, the per-
formance of the model is also affected by the number of expansion terms. The result when |Tf| is 10 is not as good as the result
obtained when |Tf| is 20, 30, or 50. However, BKRoc and BKRM3 perform better based on WT10G when |Tf| is 10. In Fig. 12, the
BPRoc2 model tends to yield the best results in the range σ ∈ (100, 200).
Similarly, BKRoc stabilizes as σ increases. Moreover, the performance of the model is affected by the number of expansion terms.
The larger the value of |Tf|, the better the performance of the BPRoc2 model. However, for WT10G, BKRoc performs best when |Tf| is
10.

16
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Fig. 9. The sensitivity of the fusion parameter α to P@10 for the results of BERT-based reranking based on four datasets

Fig. 10. The sensitivity of the BKRoc model to the kernel parameter σ. The model is evaluated on four TREC datasets and with different |Tf| values

5.8. Analysis of parameter interactions

In sections 5.5–5.7, we analyze the change in the MAP values of our models with different levels of parameters N, α and σ. In this
section, we investigate whether the combination of three parameters at different levels will interactively affect the MAP values of the
models. The values of parameter α are {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}.The values of parameter N are selected from
{500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000}. The values of parameter σ are {10, 25, 50, 80, 100, 200, 500, 1000,
1500}. The values of parameters N, α and σ are combined to obtain 990 combinations. We conduct experiments with these parameter
combinations and obtain the average MAP values of the BKRM3, BKRoc and BPRoc2 models on the AP90 dataset. The experiment is

17
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Fig. 11. The sensitivity of the BKRM3 model to the kernel parameter σ. The model is evaluated on four TREC datasets and with different |Tf| values

Fig. 12. The sensitivity of the BPRoc2 model to the kernel parameter σ. The model is evaluated on four TREC datasets and with different |Tf| values

repeated three times. Through three-way analysis of variance (ANOVA), the p values of parameters N, α and σ and the interaction
between the two parameters or among the three parameters are shown in Table 11. The results show that the p values of the three
models are much less than 0.05 for parameters α and σ. However, the p values of the three models are greater than 0.05 for parameter
N. Therefore, we can highlight that different levels of α and σ significantly affect the MAP results of our models. Different levels of N
do not significantly affect the MAP results of our models. In addition, as Table 11 shows, the interaction of parameters σ and α

18
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Table 11
The two-way ANOVA results of parameters N, α and σ.
Factors BKRM3 BKRoc BPRoc2
p value

σ e<<10−8 e<<10−8 e<<10−8


α e<<10−8 e<<10−8 e<<10−8
N 0.1965 0.1059 0.6673
σα e<<10−8 e<<10−8 e<<10−8
σN 0.4096 0.5637 0.9284
αN 0.9078 0.6006 0.4553
σαN 0.8756 0.6452 0.7344

significantly affects the three models, while the interaction of parameters σ and N and the interaction of parameters α and N do not
significantly affect the three models. Additionally, the interaction of all three parameters α, σ and N do not significantly affect the
average MAP values on the KRM3, BKRoc and BPRoc2 models.

6. Conclusions and future work

In this paper, a PRF framework that combines relevance matching and semantic matching is proposed to enhance retrieval
performance. The experiments conducted in this study suggest that, compared to using only one matching method, using relevance
matching and semantic matching in combination can achieve optimal retrieval performance. Moreover, the model proposed in this
paper effectively increases retrieval performance by comparing the MAP and P@10 results obtained using the five enhanced models
and the corresponding robust baseline models on four TREC datasets. Additionally, through analyses of the parameters N, α and σ, we
give the empirical values that yield optimal model performance.
Our work has both theoretical and practical implications. It first provides further empirical support regarding the role of relevance
matching and semantic matching in IR. We believe we have taken the first steps in incorporating both relevance matching and
semantic matching in PRF. Furthermore, this work shows that using both relevance matching and semantic matching is more ef-
fective than using either relevance matching or semantic matching alone to improve the quality of feedback documents. The proposed
PRF methods can be especially effective at increasing the precision of returning the top 10 documents and the MAP of the top 1000
results.
This paper focuses on combining relevance matching and semantic matching for retrieval instead of studying different semantic
matching methods. Therefore, BERT, one of the most recently proposed pretraining models, applies the combination of relevance
matching and semantic matching in the proposed PRF framework. In the future, we will study and discuss cases in which different
semantic matching models are combined with relevance matching methods.

CRediT authorship contribution statement

Junmei Wang: Conceptualization, Methodology, Writing - original draft. Min Pan: Conceptualization, Validation, Writing -
review & editing. Tingting He: Methodology, Funding acquisition, Supervision. Xiang Huang: Validation. Xueyan Wang: Data
curation. Xinhui Tu: Supervision, Writing - review & editing.

Acknowledgments

This research is supported by the National Natural Science Foundation of China (61532008), the National Natural Science
Foundation of China (61572223), the National Key Research and Development Program of China (2017YFC0909502), and Wuhan
Science and Technology Program (2019010701011392). This work was partly supported by the innovation team of the basic in-
telligent education service innovation model and technology research, in Hubei Normal University.

References

Basile, P., Caputo, A., & Semeraro, G. (2011). Negation for document re-ranking in ad-hoc retrieval. Lecture Notes in Computer Science (Including Subseries Lecture Notes
in Artificial Intelligence and Lecture Notes in Bioinformatics), 6931 LNCS, 285–296. https://doi.org/10.1007/978-3-642-23318-0_26.
Chen, Q., Hu, Q., Huang, J. X., & He, L. (2018). CA-RNN: Using context-aligned recurrent neural networks for modeling sentence similarity. 32nd AAAI Conference on
Artificial Intelligence, AAAI 2018 (pp. 265–273). . www.aaai.org.
Colace, F., De Santo, M., Greco, L., & Napoletano, P. (2015). Improving relevance feedback-based query expansion by the use of a weighted word pairs approach.
Journal of the Association for Information Science and Technology, 66(11), 2223–2234. https://doi.org/10.1002/asi.23331.
Daoud, M., & Huang, J. X. (2013). Modeling geographic, temporal, and proximity contexts for improving geotemporal search. Journal of the American Society for
Information Science and Technology, 64(1), 190–212. https://doi.org/10.1002/asi.22648.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-
HLT 2019, 4171–4186. http://arxiv.org/abs/1810.04805.
Guo, J., Fan, Y., Ai, Q., & Croft, W. B. (2016). A deep relevance matching model for Ad-hoc retrieval. International Conference on Information and Knowledge
Management, Proceedings, 55–64. https://doi.org/10.1145/2983323.2983769.
Guo, J., Fan, Y., Pang, L., Yang, L., Ai, Q., Zamani, H., Wu, C., Croft, W. B., & Cheng, X. (2019). A Deep Look into Neural Ranking Models for Information Retrieval.

19
J. Wang, et al. Information Processing and Management 57 (2020) 102342

Information Processing & Management. http://arxiv.org/abs/1903.06902.


Hazimeh, H., & Zhai, C. X. (2015). Axiomatic analysis of smoothing methods in language models for Pseudo-Relevance Feedback. ICTIR 2015 - Proceedings of the 2015
ACM SIGIR International Conference on the Theory of Information Retrieval (pp. 141–150). . https://doi.org/10.1145/2808194.2809471.
He, B., Huang, J. X., & Zhou, X. (2011). Modeling term proximity for probabilistic information retrieval models. Information Sciences, 181(14), 3017–3031. https://doi.
org/10.1016/j.ins.2011.03.007.
Huang, J. X., Miao, J., & He, B. (2013). High performance query expansion using adaptive co-training. Information Processing & Management, 49(2), 441–453. https://
doi.org/10.1016/j.ipm.2012.08.002.
Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., & Heck, L. (2013). Learning deep structured semantic models for web search using clickthrough data. Proceedings of
the 22nd ACM International Conference on Conference on Information & Knowledge Management - CIKM ’13 (pp. 2333–2338). . https://doi.org/10.1145/2505515.
2505665.
Ksentini, N., Tmar, M., & Gargouri, F. (2016). The impact of term statistical relationships on rocchio's model parameters for Pseudo Relevance Feedback. International
Journal of Computer Information Systems and Industrial Management Applications, 8, 135–144. www.mirlabs.net/ijcisim/index.html.
Lavrenko, V., & Croft, W. B. (2001). Relevance-Based Language Models. SIGIR’01, 120–127.
Lv, Y., & Zhai, C. (2009). A comparative study of methods for estimating query language models with pseudo feedback. Proceeding of the 18th ACM Conference on
Information and Knowledge Management - CIKM ’09 (pp. 1895–1898). . https://doi.org/10.1145/1645953.1646259.
Lv, Y., & Zhai, C. X. (2014). Revisiting the divergence minimization feedback model. CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information
and Knowledge Management (pp. 1863–1866). . https://doi.org/10.1145/2661829.2661900.
Marchesin, S., Purpura, A., & Silvello, G. (2019). Focal elements of neural information retrieval models. An outlook through a reproducibility study. Information
Processing & ManagementArticle 102109. https://doi.org/10.1016/j.ipm.2019.102109.
Metzler, D., & Croft, W. B. (2005). A Markov random field model for term dependencies. SIGIR 2005 - Proceedings of the 28th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval (pp. 472–479). . https://doi.org/10.1145/1076034.1076115.
Miao, J., Huang, J. X., & Ye, Z. (2012). Proximity-based rocchio's model for pseudo relevance. Proceedings of the 35th International ACM SIGIR Conference on Research
and Development in Information Retrieval - SIGIR ’12. 535https://doi.org/10.1145/2348283.2348356.
Nasir, J. A., Varlamis, I., & Ishfaq, S. (2019). A knowledge-based semantic framework for query expansion. Information Processing & Management, 56(5), 1605–1617.
https://doi.org/10.1016/j.ipm.2019.04.007.
Pan, M., Huang, J. X., He, T., Mao, Z., Ying, Z., & Tu, X. (2020). A simple kernel co-occurrence-based enhancement for pseudo-relevance feedback. Journal of the
Association for Information Science and Technology, 71(3), 264–281. https://doi.org/10.1002/asi.24241.
Pan, M., Zhang, Y., Zhu, Q., Sun, B., He, T., & Jiang, X. (2019). An adaptive term proximity based rocchio's model for clinical decision support retrieval. BMC Medical
Informatics and Decision Making, 19(9), 251. S https://doi.org/10.1186/s12911-019-0986-6.
Pang, L., Lan, Y., Guo, J., Xu, J., Xu, J., & Cheng, X. (2017). DeepRank: A new deep architecture for relevance ranking in information retrieval. International Conference
on Information and Knowledge Management, Proceedings, Part F1318 (pp. 257–266). . https://doi.org/10.1145/3132847.3132914.
Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. Proceedings of the 21st Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval - SIGIR ’98 (pp. 275–281). . https://doi.org/10.1145/290941.291008.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137. https://doi.org/10.1108/eb046814.
Raman, K., Udupa, R., Bhattacharya, P., & Bhole, A. (2010). On improving pseudo-relevance feedback using pseudo-irrelevant documents. Lecture Notes in Computer
Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5993 LNCS (pp. 573–576). . https://doi.org/10.1007/978-3-
642-12275-0-50.
Robertson, S. E., Walker, S., Beaulieu, M. M., Gatford, M., & Payne, A. (1996). Okapi at TREC-4. In Proceedings of the Fourth Text REtrieval Conference,TREC 1995https://
www.researchgate.net/publication/246553132.
Rocchio, J. J. (1971). Relevance Feedback in Information Retrieval. The SMART Retrieval System, 313–323.
Sakai, T., Manabe, T., & Koyama, M. (2005). Flexible Pseudo-Relevance Feedback via Selective Sampling. ACM Transactions on Asian Language Information Processing,
4(2), 111–135. https://doi.org/10.1145/1105696.1105699.
Shen, T., Jiang, J., Zhou, T., Pan, S., Long, G., & Zhang, C. (2018). Disan: Directional self-attention network for RnN/CNN-free language understanding. 32nd AAAI
Conference on Artificial Intelligence, AAAI 2018 (pp. 5446–5455). . http://arxiv.org/abs/1709.04696.
Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). Learning semantic representations using convolutional neural networks for web search. Proceedings of the 23rd
International Conference on World Wide Web - WWW ’14 Companion (pp. 373–374). . https://doi.org/10.1145/2567948.2577348.
Tao, T., & Zhai, C. X. (2006). Regularized estimation of mixture models for robust pseudo-relevance feedback. Proceedings of the Twenty-Ninth Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, 2006 (pp. 162–169). . https://doi.org/10.1145/1148170.1148201.
Valcarce, D., Parapar, J., & Barreiro, Á (2019). Document-based and Term-based Linear Methods for Pseudo-Relevance Feedback. APPLIED COMPUTING REVIEW,
18(4), 5–17.
Wang, X., Fang, H., & Zhai, C. X. (2008). A study of methods for negative relevance feedback. ACM SIGIR 2008 - 31st Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, Proceedings (pp. 219–226). . https://doi.org/10.1145/1390334.1390374.
Williams, A., Nangia, N., & Bowman, S. (2018). A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. The North American Chapter of the
Association for Computational Linguistics 2018,NAACL 2018 (pp. 1112–1122). . https://doi.org/10.18653/v1/n18-1101.
Wu, M.-S. (2015). Modeling query-document dependencies with topic language models for information retrieval. Information Sciences, 312, 1–12. https://doi.org/10.
1016/j.ins.2015.03.056.
Yang, W., Zhang, H., & Lin, J. (2019). Simple Applications of BERT for Ad Hoc Document Retrieval. http://arxiv.org/abs/1903.10972.
Ye, Z., & Huang, J. X. (2014). A simple term frequency transformation model for effective Pseudo Relevance Feedback. SIGIR 2014 - Proceedings of the 37th International
ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 323–332). . https://doi.org/10.1145/2600428.2609636.
Yin, X., Huang, J. X., & Li, Z. (2011). Mining and modeling linkage information from citation context for improving biomedical literature retrieval. Information
Processing & Management, 47(1), 53–67. https://doi.org/10.1016/j.ipm.2010.03.010.
Zamani, H., Dadashkarimi, J., Shakery, A., & Croft, W. B. (2016). Pseudo-relevance feedback based on matrix factorization. International Conference on Information and
Knowledge Management, Proceedings, 24-28-Octo, 1483–1492. https://doi.org/10.1145/2983323.2983844.
Zhai, C., & Lafferty, J. (2001a). A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. SIGIR’01, 334–342.
Zhai, C., & Lafferty, J. (2001b). Model-based feedback in the language modeling approach to information retrieval. International Conference on Information and
Knowledge Management, Proceedings (pp. 403–410). . https://doi.org/10.1145/502653.502654.
Zhou, D., Truran, M., Liu, J., & Zhang, S. (2013). Collaborative pseudo-relevance feedback. Expert Systems with Applications, 40(17), 6805–6812. https://doi.org/10.
1016/j.eswa.2013.06.030.

20

You might also like