Information Processing and Management: Junmei Wang, Min Pan, Tingting He, Xiang Huang, Xueyan Wang, Xinhui Tu T
Information Processing and Management: Junmei Wang, Min Pan, Tingting He, Xiang Huang, Xueyan Wang, Xinhui Tu T
Keywords: Pseudo-relevance feedback (PRF) is a well-known method for addressing the mismatch between
Information retrieval query intention and query representation. Most current PRF methods consider relevance
Pseudo-relevance feedback matching only from the perspective of terms used to sort feedback documents, thus possibly
Text similarity leading to a semantic gap between query representation and document representation. In this
Semantic matching
work, a PRF framework that combines relevance matching and semantic matching is proposed to
improve the quality of feedback documents. Specifically, in the first round of retrieval, we
propose a reranking mechanism in which the information of the exact terms and the semantic
similarity between the query and document representations are calculated by bidirectional en-
coder representations from transformers (BERT); this mechanism reduces the text semantic gap
by using the semantic information and improves the quality of feedback documents. Then, our
proposed PRF framework is constructed to process the results of the first round of retrieval by
using probability-based PRF methods and language-model-based PRF methods. Finally, we
conduct extensive experiments on four Text Retrieval Conference (TREC) datasets. The results
show that the proposed models outperform the robust baseline models in terms of the mean
average precision (MAP) and precision P at position 10 (P@10), and the results also highlight
that using the combined relevance matching and semantic matching method is more effective
than using relevance matching or semantic matching alone in terms of improving the quality of
feedback documents.
1. Introduction
The rapid rise of the search engine industry has greatly stimulated research interest in information retrieval (IR). In the past
several decades, various classical retrieval models have been proposed, including probability models, statistical language models, and
vector space models. These models have been successfully applied in retrieval systems to address many issues in IR (Nasir et al., 2019;
Yin et al., 2011).
Is ad hoc retrieval relevance matching or semantic matching? Semantic matching involves identifying semantic information and
⁎
Corresponding author.
E-mail address: [email protected] (T. He).
1
Junmei Wang and Min Pan contributed equally to this work and should be regarded as co-first authors.
https://doi.org/10.1016/j.ipm.2020.102342
Received 23 January 2020; Received in revised form 14 June 2020; Accepted 14 June 2020
Available online 27 June 2020
0306-4573/ © 2020 Elsevier Ltd. All rights reserved.
J. Wang, et al. Information Processing and Management 57 (2020) 102342
inferring the semantic relationships between two paragraphs of text. However, relevance matching involves identifying whether a
document is relevant to a given query. Guo et al. (2016) believed that the matching problems in many natural language processing
(NLP) tasks are fundamentally different from ad hoc retrieval tasks. Most NLP tasks involve semantic matching, and ad hoc retrieval
tasks mainly involve relevance matching. This paper argues that combining relevance matching and semantic matching is the key to
bridging the semantic text gap between query and document representations. In recent years, an important problem in IR is analyzing
semantic information with context and matching the semantics of the query with web page data. For example, suppose that a user
query is "which hospital provides a high level of skin disease treatment" and a relevant website title is "The effect of Peking union
medical college hospital on skin disease". For the query, the semantic core is "see a doctor for skin disease", and the core of the web
information is "treat skin disease". In general, the semantic backgrounds of "see" and "treat" are different; thus, if the relativity
between "see" and "treat" is directly calculated, then the matching degree of the user's query and the page will be relatively poor.
However, if combined with the context of "hospital" and "skin disease", then the semantic relativity between "seeing" and "treating" is
very strong, mainly because the input query is based on the knowledge accumulation of users, and a computer has no relevant
knowledge base. Therefore, it is difficult for a computer to accurately understand the actual query intention of the user, thus leading
to the semantic deviation between the documents and the query. Recently, the bidirectional encoder representations from trans-
formers (BERT) model (Devlin et al., 2019), proposed by Google, has performed well in 11 NLP tasks, thereby becoming one of the
most popular deep learning models. The model uses a transformer framework, which is more effective than a recurrent neural
network (RNN) (Chen et al., 2018; T. Shen et al., 2018), and can capture information over long distances. Therefore, BERT can
capture contextual information better than previous pretraining models. To research the roles of semantic matching and relevance
matching in IR, this paper takes BERT as the representative model in our proposed framework.
Another problem is that, during the actual retrieval, some terms will not be expressed explicitly by users and are rather hidden in
the query terms (Nasir et al., 2019). The search engine will miss this information, thereby resulting in a deviation between the user's
query intention and the actual query representation. Pseudo-relevance feedback (PRF) (Basile et al., 2011; Raman et al., 2010;
Wang et al., 2008), as an important branch of IR, can effectively improve retrieval performance through query expansion. In PRF, it is
assumed that the top-ranked documents in the first retrieval result are relevant to a given query (Zhou et al., 2013). Then, the top-
ranked documents are used as feedback documents, from which possible relevant terms are selected and added to the original query
to refine the expression of the original query. PRF is currently one of the most effective ways to bridge the gap between user query
intent and actual query representation (Pan et al., 2019). In this work, a PRF framework that combines relevance matching and
semantic matching is proposed to address the above problems.
The main aim of this work is to provide a PRF framework that combines relevance matching and semantic matching, and the
objectives are as follows:
• To determine the importance of relevance matching and semantic matching for IR.
• To reduce the semantic gap between query intention and query representation and between query representation and document
representation in IR.
• To improve retrieval performance by increasing the precision of the top 10 documents retrieved and the mean average precision
(MAP) of the top 1000 results.
• The reranking mechanism avoids calculating all documents and reduces the computational time of BERT because we first use
BM25 (best matching 25) to select the top N documents and then use BERT to calculate the semantic matching scores between
queries and sentences in the N documents. Thus, only the N documents are reranked.
• Experiments are performed to verify that the proposed reranking method, which combines relevance matching and semantic
matching, is more effective (in terms of improving the quality of feedback documents) than using either relevance matching or
semantic matching alone.
• A PRF framework combining relevance matching and semantic matching is proposed, and five enhanced models (denoted by
BKRoc, BPRoc2, BKRoc, BRM3 and BKRM3) are generated by merging the framework with probability-based PRF models
(Rocchio+BM25 (Rocchio, 1971), PRoc2 (Miao et al., 2012), and KRoc (Pan et al., 2020)) and language-model-based PRF models
(RM3 (Lv & Zhai, 2009) and KRM3 (Pan et al., 2020)). A series of experiments are used to verify that the proposed framework is
universal. The results of experiments with the five models and different values of N indicate that the framework is robust.
• A series of experiments involving standard Text Retrieval Conference (TREC) datasets was performed to evaluate the proposed
models from different perspectives. The results show that the proposed models can achieve better performance than the baseline
models in terms of mean average precision (MAP) and precision P at position 10 (P@10). Our PRF framework may reduce the
semantic text gap between query intention and query representation and between query representation and document re-
presentation during IR.
• Our proposed PRF framework can improve the quality of feedback documents.
The remainder of this paper is organized as follows: In section 2, related studies are reviewed. In section 3, we introduce the
2
J. Wang, et al. Information Processing and Management 57 (2020) 102342
proposed PRF framework combining relevance matching and semantic matching and the five improved models. In Section 4, we
describe the setup of the experiment and the four TREC datasets. In section 5, we compare and analyze the experimental results of the
proposed models and different baselines to test the performance of the proposed framework and the five models. Finally, in section 6,
we summarize the paper, provide a brief conclusion, and discuss future research directions.
2. Related work
PRF is a common and effective technique that can improve retrieval performance (Lv & Zhai, 2009). This method extracts the
expansion terms from the feedback documents and uses these terms to refine the representation of the original query. Subsequently, a
second round of retrieval is performed (J. X. Huang et al., 2013). In 1971, Rocchio (Rocchio, 1971) proposed the first well-known
relevance feedback technique, which improved query representation by adding new terms arranged in descending order according to
their term frequency weights. Many relevant feedback models based on the technique developed by Rocchio have been proposed, and
they have achieved good performance (Ksentini et al., 2016). For example, He et al. (He et al., 2011) proposed four novel methods for
improving the classical BM25 model by utilizing term proximity evidence. In 2012, Miao et al. proposed a novel model called PRoc
(proximity-based Rocchio model), which used the Rocchio model to capture the proximity relationships between candidate terms and
the corresponding queries in feedback documents. Three proximity measures, namely, the window-based method, the kernel-based
method and the Hyperspace Analog to Language (HAL) method, were then proposed for evaluating the relationship between ex-
pansion terms and query terms. The three variants of PRoc are called PRoc1, PRoc2, and PRoc3. In addition, many other relevant
methods have also achieved remarkable results in improving retrieval performance (Colace et al., 2015; Daoud & Huang, 2013;
Metzler & Croft, 2005; Ye & Huang, 2014).
The rapid development of the language model (LM) has provided favorable conditions for further research on PRF models (Ponte
& Croft, 1998). However, a core problem in LM estimation is smoothing. Zhai et al. (Zhai & Lafferty, 2001a) compare three popular
smoothing methods (i.e., the Jelinek-Mercer method, the Dirichlet prior method, and absolute discounting) and note that the Di-
richlet prior method generally performs well. Subsequently, many LM-based retrieval methods have been successively proposed (e.g.,
(Lavrenko & Croft, 2001); (Lv & Zhai, 2009); (Hazimeh & Zhai, 2015); (Wu, 2015)), and for such methods, feedback documents are
often used to reevaluate the query LM. In 2001, Zhai et al. (Zhai & Lafferty, 2001b) proposed an LM-based feedback model in which
the authors evaluated two different methods for updating queries based on feedback documents. RM1 and RM2 are relevance-based
LMs (Lavrenko & Croft, 2001) that calculate the probabilities of terms in the relevant class. RM3 and RM4 are extensions of RM1 and
RM2, respectively, and to create RM3 and RM4, RM1 and RM2 are interpolated with the original query model and generate RM3 and
RM4, respectively, which are interpolated versions of the relevance model. Furthermore, a series of query likelihood models are used
as retrieval models with Dirichlet prior smoothing. In 2001, Zhai and Lafferty proposed a simple mixture model (SMM) and a
divergence minimization model (DMM), which are two different approaches for updating a query LM. SMM and DMM differ in terms
of their method for estimating the query model based on feedback documents. The SMM method assumes that feedback documents
are generated by a mixture model in which one component is the query topic model and the other is the collection LM. Given the
observed feedback documents, the maximum likelihood criterion is used to estimate a query topic model. The DMM method uses a
completely different estimation criterion; this method chooses the query model that has the smallest average Kullback-Leibler (KL)
divergence from the smoothed empirical term distribution of the feedback documents. In 2006, Tao and Zhai proposed a query-
regularized mixture model (RMM) for pseudo feedback (Tao & Zhai, 2006). The authors integrated the original query with feedback
documents in a single probabilistic mixture model and regularized the estimation of the LM parameters in the model so that the
information in the feedback documents can be gradually added to the original query. A major advantage of this model is that it has no
parameter to tune. In 2014, Lv and Zhai revealed that DMM inappropriately handles the entropy of the feedback model, thereby
resulting in a highly skewed feedback model. To address this problem, the authors proposed a maximum-entropy divergence
minimization model (MEDMM) by introducing an entropy term to regularize the DMM (Lv & Zhai, 2014). In 2009, Lv and Zhai (Lv &
Zhai, 2009) compared the following five methods for estimating query LMs by using PRF in ad hoc IR: RM3, RM4, SMM (Zhai &
Lafferty, 2001b), RMM (Tao & Zhai, 2006) and DMM (Zhai & Lafferty, 2001b). The authors found that RM3 is more robust and
comparable to any LM method in many tasks; thus, RM3 remains a strong baseline for comparison. In 2020, Pan et al. (Pan et al.,
2020) integrated cooccurrence information into the Rocchio model and RM3 model and proposed the KRoc model and KRM3 model,
respectively, thereby improving retrieval performance.
Some other PRF methods provide solutions for research on IR. In 2016, Zamani et al. (Zamani et al., 2016) proposed a PRF
method based on matrix factorization (RFMF) that tries to expand the query by using not only the terms that discriminate the
feedback documents from a collection but also terms that are relevant to the original query terms. This study was the first to
formulate PRF as a matrix decomposition problem and compute a latent factor representation of documents/queries and terms by
using nonnegative matrix factorization. In contrast, in 2019,Valcarce et al. proposed a linear method (LiMe) for the PRF task
(Valcarce et al., 2019). The LiMe framework computes similarities yielded within the query and the pseudo relevant set. Then, the
similarity information of these relationships between documents or terms is used to expand the original query. However, most current
PRF methods consider relevance matching only from the perspective of terms used to sort feedback documents and do not consider
the semantic information between queries and documents to be an important index for calculating relevance. These methods may
lead to insufficient feedback documents. In current research, the semantics of terms are often considered to be closely related to the
context of the terms (Pang et al., 2017). The semantics of the query express the true intentions of the users. On this basis, we suggest
that, when the semantic similarity of the query and the document is considered, the results of the first round of retrieval may be
improved. Additionally, the quality of the feedback documents may be improved.
3
J. Wang, et al. Information Processing and Management 57 (2020) 102342
In recent years, deep learning methods have been applied to different scenarios in speech recognition, computer vision, NLP, etc.
(Marchesin et al., 2019) because these methods can automatically learn effective data representations (features). When applied to ad
hoc retrieval, the task is usually formalized as a semantic matching problem between two texts (Guo et al., 2019). Some classical
neural IR models related to this task include the deep structured semantic model (DSSM) (Huang et al., 2013) (P.-S.), convolutional
DSSM (CDSSM) (Y. Shen et al., 2014) and deep relevance matching model (DRMM) (Guo et al., 2016). The DSSM uses a fully
connected feedforward network to construct the presentations of the query and document and then generates a ranking score by
calculating the cosine similarity of the two vectors. The CDSSM, an extension of the DSSM, uses a convolutional neural network
(CNN) to better preserve the local word order information when capturing the contextual information of the query of the document.
Then, max-pooling strategies are adopted to filter the salient semantic concepts to form a sentence-level representation. However, the
DSSM and CDSSM consider only semantic matching between queries and documents (Guo et al., 2016). Other important matching
information, such as a precise matching signal, the importance of query terms and different matching requirements, is ignored. In
2016, Guo et al. noted that ad hoc retrieval tasks involve mainly relevance matching, and the authors proposed a DRMM for ad hoc
retrieval. Specifically, the DRMM model employs a joint deep architecture at the query term level for relevance matching. DeepRank
is inspired by the steps of human relevance judgment (Pang et al., 2017). Specifically, DeepRank splits the document into term-
centric contexts with respect to each query term. Then, the interaction function is defined for term-level computation, term-level
aggregation, and global aggregation. However, few studies combine relevance matching and semantic matching for IR.
In 2018, the BERT model achieved the best results among those of other previous models in 11 NLP tasks (Devlin et al., 2019). The
architecture of the BERT model is a multilayer bidirectional transformer encoder. The encoder is composed of a stack of six identical
layers. Each layer has two sublayers. The first sublayer is a multihead self-attention mechanism, and the second sublayer is a simple,
position-wise fully connected feedforward network. Instead of pretraining the LM from left to right, BERT enables the representation
to fuse the bidirectional context. BERT is divided into two main stages (Devlin et al., 2019). The first stage involves training a
common "language understanding" model on a large text corpus (e.g., Wikipedia). In the second stage, the model is fine-tuned for
specific tasks (e.g., machine translation, text embedding, named-entity recognition, automatic abstracts, questions and answers). For
the task in IR, multigenre natural language inference (MNLI) (Williams et al., 2018) was used as the fine-tuning corpus. MNLI is a
large-scale, crowdsourced corpus for implicit classification tasks. Given a pair of sentences, the second sentence can be predicted
whether it is implicit, contradictory or neutral relative to the first sentence. Given the successful implementation of the BERT model
in retrieval tasks (Yang, Zhang, & Lin, (2019)), this paper proposes a PRF framework that combines relevance matching and semantic
matching methods. In the framework, BERT is used in the first round of PRF to reorder the first N documents after BM25 scoring and
to improve the quality of the feedback documents. Then, we apply the framework in combination with probability-based PRF models
(i.e., Rocchio+BM25 (Rocchio, 1971), PRoc2 (Miao et al., 2012), KRoc (Pan et al., 2020)) and LM-based PRF models (i.e., RM3 (Lv &
Zhai, 2009) and KRM3 (Pan et al., 2020)) to measure the quality of the feedback documents. The main difference between the
proposed PRF framework and other methods is that the proposed method combines relevance matching and semantic matching to
improve the quality of the feedback documents by evaluating two aspects of relevance (semantic relevance and term-level relevance)
from the documents and the initial query instead of using exact matching techniques.
3. Our method
We introduce the proposed PRF framework, which combines relevance matching and semantic matching, in section 3.1. Then, we
apply the framework in combination with the probability-based PRF model in section 3.2 and the LM-based PRF model in section 3.3.
3.1. Our PRF framework combining relevance matching and semantic matching
Our approach is motivated by the success of Yang et al., (2019) in applying the BERT model to retrieval tasks. This paper proposes
a simple extended application based on the above guiding approach in which the BERT model is applied for PRF. Specifically, we
initially use the relevance matching method to obtain the exact relevance weight of the query and document at the lexical level. The
score calculated by the traditional BM25 method is used as the score of the initial ranking of documents, and the top N documents are
chosen for reranking. Then, we use the semantic matching method to acquire the semantic relevance between the query and the
document. Due to the limitation of the input length of the BERT model, the model cannot be effectively applied in retrieval tasks.
Therefore, this paper considers the segmentation processing of the top N documents obtained from the previous step, splices each
sentence with a query, and enters the information into the BERT model to obtain classification results (semantic relevance or se-
mantic irrelevance). The implicit relation score is chosen as the semantic similarity score of the two sentences. Based on local
relevance, if part of a document is related to a query, then the document is considered to be related to the query. Based on this
assumption, a semantic linear combination of the top M sentences is selected as the semantic score of the document, and then this
score is combined with the document score obtained by the relevance matching method (by using the linear combination method) to
reorder the first N documents. The calculation method is shown in Equation (1).
M
Sd = × Sde + (1 )× (wi × Sid) (1)
i=1
4
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Query
Query Retrieve
Understanding
Relevance
Ranking Matching Index
Matching
Document
TREC Data Indexing
Understanding
Fig. 1. The relevance matching and semantic matching process based on BERT reranking for IR
The relevant matching score of document d calculated by the traditional BM25 method is denoted by Sde , and the calculated
method is shown in Equation (2), where N′represents the total number of documents in the index; dft is the number of documents in
which the term t appears; and K is equal to k1 × (1 b + b × avgdl ) , where k1 and b are the adjustment factors that balance the effect
dl
of document length, k3 is a parameter that adjusts the term frequency in the query, tf(t, d) represents the number of occurrences of
term t in document d, and tf(t, Q) represents the number of occurrences of term t in document Q. The sentences from document d are
ranked by the semantic similarity score of the query and the sentences from document d. Then, the score of the ranked i-th sentence is
denoted by Sid ; additionally, wi is the weight of the i-th sentence from document d. The sum of the results of the scores of the top M
sentences and the weights are considered to be the score of document d. α is the moderating factor, and the score of the first round of
retrieval of document dis denoted by Sd.
The relevance matching and semantic matching process based on BERT reranking for IR is shown in Fig. 1.
Start
Documents
Top N
BM25 Retrieval
Documents
Expansion Reranked N
Terms Documents
Ranked Documents
End
Fig. 2. A PRF framework combining relevance matching and semantic matching for IR
5
J. Wang, et al. Information Processing and Management 57 (2020) 102342
We extract the expansion terms from the results of first-round retrieval and generate a new query for the second round of
retrieval. The flow diagram of the PRF framework combining relevance matching and semantic matching for IR is shown in Fig. 2. To
prove the effectiveness of our framework in improving the quality of feedback documents, we use different PRF methods to extract
the expansion terms from the feedback documents we obtain and perform the second round of retrieval.
There are two kinds of PRF models: probability-based PRF models and statistical-language-based PRF models. Probability-based
PRF models include mainly Rocchio+BM25, PRoc2, and KRoc, which is the most recently proposed model (the model was proposed
in 2020) and yields excellent results for nine TREC datasets, including GOV2.
The methods used to calculate the query expansion terms vary based on how the framework is combined with different PRF
models. Rocchio+BM25 (Rocchio, 1971) uses term frequency–inverse document frequency (TF IDF ) to calculate the term weights
in the feedback documents based on the term frequency and inverse document frequency. TF IDF is calculated according to the
method used in Equation (3). The weight of the term t is wt, as shown in Equation (3). tf(t, di) represents the number of occurrences of
term t in document di, and N represents the number of feedback documents.
N
N dft + 0.5
wt = log × tf (t , di )
dft + 0.5 i=1 (3)
PRoc2 (Miao et al., 2012) considers not only the importance of different query terms but also the average proximity of terms to
the query. In the kernel-based method, the weight of term t in the neighborhood of query term qi is wt, and the calculation method is
as shown in Equation (4).
Q N dfqi + 0.5
wt = K (t , qi ) × log
dfqi + 0.5
i=1 (4)
In the PRoc2 model, |Q| is the number of query terms. A Gaussian kernel is used to measure the proximity K(t, qi) between the
candidate extension term t and query term qi, as shown in Equation (5). pt and pq denote the locations of candidate expansion term t
and query term q in the document, respectively, and σ is a tuning parameter used to control the scale of the Gaussian distribution.
(pt pq )2
K (t , q) = exp 2
2 (5)
In 2019, the KRoc model, proposed by Pan, Huang, et al., (2019), has been shown to consider both the term frequency and the
proximity information of the cooccurrence of the candidate expansion term with queries.
The three models, namely, Rocchio+BM25, PRoc2 and KRoc, were combined with our framework to form BRoc, BPRoc2, and
BKRoc, respectively.
In addition to combining the framework with the probability-based PRF model, the framework must also be combined with the
LM-based PRF model to comprehensively measure the performance of the framework. The LM essentially involves an evaluation for
each document; then, the documents are ranked according to the likelihood that the evaluated LM will generate queries. The LM-
based PRF model uses the first retrieval of the feedback document to reevaluate the LM of the query. Common LM-based PRF models
include RM3 and KRM3. For most standard TREC datasets, RM3 and KRM3 have achieved impressive retrieval performance based on
the retrieval accuracy and recall rate. KRM3 was first developed in 2019. This section will incorporate a robust baseline model (RM3)
and a state-of-the-art model (KRM3) into the framework to form new retrieval LMs, i.e., BRM3 and BKRM3, respectively.
The RM3 model (Lv & Zhai, 2009) selects the expansion terms from the feedback document of the first round of retrieval. The
weight calculation method of the expansion terms is shown in Equation (8):
dl µ
wt = log × pml (t|di ) + × pml (t|c )
dl + µ dl + µ (8)
6
J. Wang, et al. Information Processing and Management 57 (2020) 102342
where pml is the probability function, c represents all the document collections, and dl represents the document length. Furthermore, a
series of query likelihood models (including RM3 and KRM3) are used as retrieval models with Dirichlet prior smoothing, and μ is the
smoothing factor. The RM3 model reevaluates the query LM by using the feedback document retrieved in the first step and then
performs a second round of retrieval. The new query LM is shown in Equation (9):
Q = (1 )× Q0 + × F (9)
where λ is as defined in Equation (7) and is used to adjust the contribution weight between the original query and the expansion
terms.
The KRM3 model (Pan et al., 2020) adds the term frequency of cooccurring words and the proximity information of the query
based on the RM3 model. The KRM3 model generates new queries, as given in Equation (10):
where F* is a feedback LM based on the cooccurrence distribution of query terms and expansion terms.
We first introduce the experimental data in Section 4.1; then, we describe the parameter settings in section 4.2.
In this section, to test the effectiveness of the five models proposed in this paper, we conduct a series of experiments on four
standard TREC datasets: AP90, AP88-89, DISK4&5, and WT10G. Because these text datasets are different in terms of size and type, the
proposed models can be effectively evaluated. The AP90 dataset contains articles published by the Associated Press in 1990. AP88-89
contains articles published by the Associated Press in 1988 and 1989. The DISK4&5 collection contains newswire articles from
various sources, such as the Associated Press, the Wall Street Journal, and the Financial Times, and is generally considered to contain
high-quality text data with little noise. The DISK4&5 collection is a set of documents on TREC Disks 4 and 5 minus the Congressional
Record documents. The WT10G collection is a medium-sized web document crawler for the TREC9 and TREC10 web tracks and
contains 10 gigabytes of uncompressed data. The topic numbers associated with each collection are shown in Table 1. Each topic in
these collections contains three fields (i.e., title, description and narrative). In each query, we use only the title, which contains very
few keyword searches related to the topic.
During retrieval, a query without judgment will be deleted. In the experiment, Lucene is used as the main retrieval system. For all
datasets used, each term is analyzed using Porter's English stem analyzer (Porter, 1980). A total of 418 stop words are deleted from
the standard InQuery stop word list. MAP and normalized discounted cumulative gain (NDCG) based on the top 1000 documents are
used as the evaluation indexes for our experiment because they facilitate comparing our results to those of other methods, and these
metrics are the primary metrics used in the corresponding TREC evaluation. In addition, we use P@10 for evaluation because, when
browsing results, users tend to pay more attention to documents that rank high than to documents that rank low. All statistical
assessments are based on the Wilcoxon matched-pairs signed-rank test. To further analyze whether our method is beneficial, we
measured the robustness index (RI) (Sakai et al., 2005) of our method against the corresponding baseline model. The RI metric, which
takes values in the interval [−1, 1], is computed as the number of topics improved by using our models minus the number of topics
hurt by our models divided by the number of topics (Valcarce et al., 2019).
All the PRF retrieval models used in this paper have several hyperparameters that need to be adjusted. For a fair comparison, we
adjusted these hyperparameters to the optimal values. The following parameters are used in the baseline models and the proposed
model, which is generally used in the IR domain to construct robust baselines. First, in BM25, the value of b ranged from 0 to 1.0, with
an increment of 0.05, and k1 and k3 are set to 1.2 and 8, respectively, as recommended by Robertson et al. (Robertson et al., 1996).
Second, in the Dirichlet LM, the value of μ ranged from 500 to 2000, with an increment of 50. Finally, based on experience, the
number of top-ranking documents in all PRF models is set to 10, the number of expansion terms is set to |Tf| ∈ {10, 20, 30, 50}, and
the fusion parameter is set to , {0.0, 0.1, 0.2, …, 1.0} . The parameter σ from kernel function is set to σ ∈ {10, 25, 50, 80, 100, 200,
500, 1000, 1500}.
Table 1
The TREC tasks and topic numbers associated with each collection
Collection # of Docs Size Queries # of Queries
7
J. Wang, et al. Information Processing and Management 57 (2020) 102342
In our framework, we have two hyperparameters. One is used to reorder the previous documents and select the initial ranking,
with values ranging from 500 to 5000 and an increment of 500, yielding 10 different values. The other is the fusion parameter of the
document score calculated by the BM25 method and the document score calculated by BERT when ranking documents in the first
round of retrieval.
To verify that the baselines and the method proposed in this paper are scientifically reasonable, we use 2-fold cross-validation in
the experiment. First, for each dataset, we randomly split the queries into equally sized subsets (half the queries were used for
training, and the other half were used for testing). Then, the parametric model learns from the training set and applies this knowledge
to the test set for evaluation. We choose the TREC datasets and topics to ensure fair comparisons with the baselines, and all metrics
and parameters are adjusted using the same experimental settings.
In this section, we report our experimental results and conduct an extensive analysis. We compare our models with the prob-
ability-based PRF models in section 5.1 and with the LM-based PRF models in section 5.2. Furthermore, we compare the MAP values
of the four TREC collections in three steps (relevance matching, relevance matching and semantic matching, and after PRF) in
section 5.3 and compare the neural IR models in section 5.4. Finally, we investigate the parameter sensitivity in sections 5.5–5.7.
Specifically, we investigate the sensitivity of parameter N for our five models in section 5.5; the sensitivity of parameter α in the five
models in section 5.6; and the sensitivity of parameter σ for BKRoc, BKRM3, and BPRoc2 in section 5.7. For the five models, we give
the empirical values to improve the performance of the models. In section 5.8, we analyze the interaction of parameters.
Since our framework is applied to five different PRF models, there are five different models (i.e., BRoc, BPRoc2, BKRoc, BRM3,
and BKRM3) to analyze. To test the effectiveness of the combination of our framework and the probability-based PRF models, the
three improved models (BRoc, PRoc2 and BKRoc) were first compared with three robust PRF models (Rocchio+BM25, PRoc2 and
KRoc). Table 2 shows the comparison results for the four datasets. The comparison results of MAP values are shown in Table 2 for
different numbers of expansion terms.
The results in Table 2 show that the performance of the six probability-based PRF models generally improves as the number of
expansion terms increases. When comparing the number of expansion terms, the results show that the optimal MAP value is obtained
when the number of expansion terms is 30 or 50 for most datasets. However, the smaller the |Tf| value of the six models is for the
WT10G dataset, the better the MAP results. The comparison of the three proposed models (BRoc, BPRoc2, and BKRoc) and the
corresponding robust baseline models (Rocchio+BM25, PRoc2 and KRoc) indicates that the MAP values are improved with the
proposed models. This improvement was statistically significant for all datasets. Specifically, compared with the MAP value of the
corresponding baseline model, BRoc increased the MAP by the highest percentage, reaching 20.84% and 17.03% for the AP90 and
DISK4&5 datasets, respectively. BPRoc2 yielded the largest improvement in MAP for AP90 at 9.21%. BKRoc obtained the largest
improvement for the DISK4&5 dataset at 13.53%.
In addition, the experimental results of the three proposed improved models (BRoc, PRoc2 and BKRoc) and three PRF models
(Rocchio+BM25, PRoc2 and KRoc) are compared based on P@10, and the subsequent results are shown in Table 3. The average P@
10 of the three models proposed is improved compared to that of the corresponding baseline models for the four datasets. The largest
improvements are observed for the DISK4&5 dataset at 15.50%, 9.58%, and 17.98%.
To further evaluate the effectiveness of our method, we use the average NDCG of the top 1000 documents with different feedback
terms as an evaluation indicator to compare the experimental results of our models with those of the corresponding baseline models.
The results are shown in Table 4. The experimental results show that our three improved models are significant improvements over
the corresponding baseline models in terms of NDCG on the four TREC datasets. Therefore, the method proposed in this paper is
superior to traditional probability-based PRF models.
To analyze the robustness of our models, we measured their RIs for comparison against that of the corresponding baseline model.
The comparison results are shown in Table 4. All our models have positive RI values on all four datasets. Thus, our PRF approach
combining relevance matching and semantic matching tends to make queries better rather than worse. In addition, comparing the RI
values of different models shows that BKRoc achieves the highest RI value on WT10G, and BRoc achieves the highest RI value on the
other three datasets.
To further assess the performance of the framework, we apply our framework to the LM-based PRF model. The results of the
comparison of the proposed BRM3 and BKRM3 models and RM3 and KRM3 for the 4 datasets are shown in Tables 5–7.
As presented in Table 5, the MAP values of BRM3 and BKRM3 are improved compared with those of RM3 and KRM3 for the four
datasets. Compared with the corresponding models (RM3 and KRM3), BRM3 and BKRM3 both performed best, with average im-
provement rates of 13.59% and 11.01%, respectively, for DISK4&5. In addition, the MAP values of the four LM-based PRF methods
generally improved as the number of expansion terms increased.
Furthermore, the experimental results of the two proposed LM-based PRF models (BRM3 and BKRM3) are compared with those of
two robust LM-based PRF models (RM3 and KRM3) based on P@10, as shown in Table 6. The two proposed models yield a higher
8
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Table 2
Comparison of MAP values obtained by robust baselines (Rocchio+BM25, PRoc2, and KRoc) on four TREC collections. The MAP value changes with
the number of expansion terms (|Tf| ∈ {10, 20, 30, 50}). "Avg" in the last row of each collection denotes the average MAP performance for that
dataset. The values in parentheses represent the improvements over Rocchio+BM25, PRoc2, or KRoc. "*", "+" and "#" denote statistically significant
improvement over Rocchio+BM25, PRoc2, or KRoc, respectively. (Wilcoxon signed-rank test at p < 0.05).
Model |Tf| AP90 AP88-89 DISK4&5 WT10G
average P@10 than the corresponding baseline models based on the four datasets. BRM3 and BKRM3 yield the largest improvement
at 14.50% and 14.56%, respectively, for DISK4&5.
To further evaluate the effectiveness of our method, we use the NDCG value as an evaluation indicator to compare the experi-
mental results of our models with those of the corresponding baseline models. The results are shown in Table 7. The experimental
results show that our improved BRM3 and BKRM3 models are significant improvements over the corresponding baseline models in
terms of the NDCG values obtained on the four TREC datasets. Therefore, it is reasonable to conclude that BRM3 and BKRM3 are
superior to the corresponding LM-based PRF models.
To analyze the robustness of our models, we measured their RIs against the corresponding baseline model. The comparison results
are shown in Table 7. All our models have positive RI values on all four datasets. Thus, our PRF approach combining relevance
matching and semantic matching tends to improve rather than worsen queries. In addition, comparing the RI values of different
models shows that BRM3 achieves the highest RI value on all datasets.
In this section, to intuitively show the effect of the proposed method, we compare the MAP and P@10 values of relevance
9
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Table 3
Comparison of P@10 values obtained by robust baselines (Rocchio+BM25, PRoc2 and KRoc) on the four TREC collections. The P@10 value changes
with the number of expansion terms (|Tf| ∈ {10, 20, 30, 50}). "Avg" in the last row of each collection denotes the average P@10 performance for that
dataset. The values in parentheses represent the improvements over Rocchio+BM25, PRoc2, or KRoc. "*", "+" and "#" denote statistically significant
improvement over Rocchio+BM25, PRoc2, or KRoc, respectively. (Wilcoxon signed-rank test at p < 0.05).
Model |Tf| AP90 AP88-89 DISK4&5 WT10G
Table 4
Comparison of NDCG and RI values obtained by robust baselines (Rocchio+BM25, PRoc2, and KRoc) on the four TREC collections. NDCG denotes
the average NDCG value for different numbers of expansion terms (|Tf| ∈ {10, 20, 30, 50}) and different datasets. The values in bold represent the
best results on the corresponding dataset. "*", "+" and "#" denote statistically significant improvement over Rocchio+BM25, PRoc2, or KRoc,
respectively. (Wilcoxon signed-rank test at p < 0.05).
Collection Metric Roc BRoc PRoc2 BPRoc2 KRoc BKRoc
10
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Table 5
Comparison of MAP values obtained by robust baselines (RM3 and KRM3) on the four TREC collections. The MAP value changes with the number of
expansion terms (|Tf| ∈ {10, 20, 30, 50}). "Avg" in the last row of each collection denotes the average MAP performance for that dataset. The values
in parentheses represent improvements over RM3 or KRM3. "*" and "#" denote statistically significant improvement over RM3 or KRM3, respec-
tively. (Wilcoxon signed-rank test at p < 0.05).
Model |Tf| AP90 AP88-89 DISK4&5 WT10G
matching after semantic matching and after PRF. The experimental results are shown in Tables 8 and 9.
The results in Tables 8 and 9 show that the MAP yielded by the proposed method of combined relevance matching and semantic
matching is a significant improvement over the MAP yielded by the BM25 model for AP90, AP88-89, DISK4&5, and WT10G. The MAP
values of the five methods after PRF are compared with the MAP values after the first round of retrieval, and the five models produce
statistically significant improvements on all datasets. The P@10 values after the second round of retrieval are a significant im-
provement over the P@10 values after the first round for most datasets. The results indicate that the proposed BERT-based reranking
method of combined relevance matching and semantic matching is effective and that the KRoc model combined with the proposed
method yields better MAP and P@10 results than the other four models.
To further verify the effectiveness of the method proposed in this paper, we compared the proposed method with neural IR
models; this comparison is shown in Table 10. The results of the comparison models (DSSM, CDSSM, and DRMM) in Table 8 are
derived from the results reported in the corresponding papers, which used MAP and P@20 as the metrics. To make a more intuitive
comparison, we use P@20 instead of P@10 as the evaluation index in Table 8. DSSM and CDSSM use the semantic matching method,
and DRMM uses the relevance matching method. The five proposed models perform better than the DSSM, CDSSM and DRMM
models based on MAP and P@20. The result shows that the PRF framework combining relevance matching and semantic matching is
effective. In addition, BKRM3 performs the best.
Parameter N is the number of documents used in the BM25 method in the first round of retrieval. Then, these documents are
reranked by BERT. The larger the value of N, the better the model reflects the real retrieval environment. To analyze the influence of
N on the performance of the proposed model, we investigate N values from 500 to 5000 and observe the MAP of the five models for
different values of |Tf|. As seen in Figs. 3, 4, 5, 6 and 7, the five models display relatively stable performance as the value of N varies.
Moreover, BRoc, BRM3, and BPRoc2 are more stable than BKRoc and BKRM3 when Nvaries.
Specifically, for the BRoc model at an N value of 1000, the optimal MAP value is obtained for AP90. The optimal MAP value for
11
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Table 6
Comparison of P@10 values obtained by robust baselines (RM3 and KRM3) on the four TREC collections. The P@10 value changes with the number
of expansion terms (|Tf| ∈ {10, 20, 30, 50}). "Avg" in the last row of each collection denotes the average P@10 performance for that dataset. The
values in parentheses represent improvements over RM3 or KRM3. "*" and "#" denote statistically significant improvement over RM3 or KRM3,
respectively. (Wilcoxon signed-rank test at p < 0.05).
Model |Tf| AP90 AP88-89 DISK4&5 WT10G
Table 7
Comparison of NDCG and RI values obtained by robust baselines (RM3 and KRM3) on the four TREC collections. NDCG denotes the average NDCG
value for different numbers of expansion terms (|Tf| ∈ {10, 20, 30, 50}) and different datasets. The bold values represent the best results on the
corresponding dataset. "*" and "#" denote statistically significant improvement over RM3 or KRM3, respectively. (Wilcoxon signed-rank test at p <
0.05).
Collection Metric RM3 BRM3 KRM3 BKRM3
AP88-89 is obtained when N is 2000. Additionally, for DISK4&5, the optimal value is obtained when N is greater than 3500, and for
WT10G, the optimal value is obtained when N is greater than 4500. The larger the dataset, the larger the value of Nthat must be
selected to yield optimal performance with the BRoc model.
The parameter α is the adjustment factor between the relevance matching method and the semantic matching method in
Equation (1). The smaller the value of α, the higher the contribution of the BERT method in our framework. Thus, semantic matching
plays the more significant role. To analyze the influence of the fusion ratio of BM25 and BERT on the first round of retrieval results,
we designed an experiment and fixed the parameter N to 5000. w1,w2, w3 and w4 are set to 1.0, 0.8, 0.9 and 0.9, respectively.
Additionally, α is set from 0 to 1.0. The MAP and P@10 results experimentally generated based on four different TREC datasets are
shown in Figs. 8 and 9, respectively.
12
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Table 8
Comparison of MAP values obtained on the four TREC collections in three steps (relevance matching, combined relevance matching and semantic
matching, and the combined method after PRF). The results in column BM25 are the MAP values based on relevance matching. The BM25+BERT
column denotes the MAP values based on relevance matching and semantic matching, and the values in parentheses in this column indicate the
percent improvements after the first round of retrieval compared to the BM25 values. The results in the last five columns represent the average MAP
after PRF, and the values in parentheses represent the percent improvements over the results of the first round of retrieval. "*" denotes a statistically
significant improvement over BM25, and "#" denotes a statistically significant improvement over BM25+BERT (Wilcoxon signed-rank test at p <
0.05).
Collection BM25 BM25+BERT BRoc BPRoc2 BKRoc BRM3 BKRM3
# # # #
AP90 0.2688 0.3328* 0.3494 0.3443 0.3582 0.3360 0.3490#
(+23.81%) (+4.99%) (+3.46%) (+7.63%) (+0.96%) (+4.84%)
AP88-89 0.2831 0.3305* 0.3327# 0.3375# 0.3474# 0.3434# 0.3540#
(+16.74%) (+0.67%) (+2.12%) (+5.11%) (+3.90%) (+7.20%)
DISK4&5 0.2293 0.2706* 0.2727# 0.2761# 0.2978# 0.2909# 0.2980#
(+18.01%) (+0.78%) (+2.03%) (+10.05%) (+7.50%) (+10.09%)
WT10G 0.2050 0.2209* 0.2249# 0.2369# 0.2406# 0.2240# 0.2540#
(+7.76%) (+1.81%) (+7.24%) (+8.92%) (+1.40%) (+14.89%)
Table 9
Comparison of P@10 values obtained on the four TREC collections in three steps (relevance matching, combined relevance matching and semantic
matching, and the combined method after PRF). The results in column BM25 are the P@10 values based on relevance matching. The BM25+BERT
column shows the P@10 values based on relevance matching and semantic matching, and the values in parentheses in this column indicate the
percent improvements after the first round of retrieval over the values for BM25. The results in the last five columns represent the average P@10
values after PRF, where the values in parentheses represent the percent improvements over the results of the first round of retrieval. "*" denotes a
statistically significant improvement over BM25, and "#" denotes a statistically significant improvement over BM25+BERT (Wilcoxon signed-rank
test at p < 0.05).
Collection BM25 BM25+BERT BRoc BPRoc2 BKRoc BRM3 BKRM3
Table 10
Comparison of MAP and P@20 values obtained by the baseline model, proposed models, and neural IR models on the Robust04 dataset. SM stands
for semantic matching method, and RM stands for relevance matching method. The values in parentheses represent the percent improvements over
the results of BM25. "*" denotes a statistically significant improvement over BM25 (Wilcoxon signed-rank test at p < 0.05). The references are as
follows: DSSM (Huang et al., 2013) (P.-S.) [1], CDSSM (Y. Shen et al., 2014) [2] and DRMM (Guo et al., 2016) [3].
Models RM or SM MAP P@20
BM25 RM 0.2553 0.3554
As shown in Figs. 8 and 9, the MAP and P@10 results for the four datasets gradually increase as α increases from 0. When α is set
to 0.9, MAP reaches a maximum and then begins to decline. An α value of 0 indicates that documents are ranked only by the BERT
method. An α value of 1.0 indicates that documents are ranked only by BM25. When only BERT is used, the result is not as good as
when only BM25 is used. However, when BERT and BM25 are combined at a ratio of 1:9, we can achieve the best results. The
experiments show that relevance matching or semantic matching alone is not ideal, but the model can achieve improved performance
when these matching processes are combined in a certain proportion. The recommended setting of the fusion parameter α for
relevance matching and semantic matching is approximately 0.9. The results also indicate that relevance matching plays a more
important role than semantic matching.
13
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Fig. 3. The sensitivity of parameter N when the BRoc model takes different |Tf| values
Fig. 4. The sensitivity of parameter N when the BRM3 model takes different |Tf| values
Gaussian kernel functions are used in the BKRoc, BKRM3 and BPRoc2 methods. Therefore, sensitivity to the kernel parameter σ is
an important factor affecting the robustness of the three models. In the BKRoc and BKRM3 models, a kernel function is used to
14
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Fig. 5. The sensitivity of parameter N when the BPRoc2 model takes different |Tf| values
Fig. 6. The sensitivity of parameter N when the BKRM3 model takes different |Tf| values
estimate the weights that influence the cooccurrence relationship between query items and expansion terms. Thus, we focus on the
kernel parameter σ. Figs. 10–12 illustrate the three proposed models’ performance based on different datasets for different σ values. σ
is set from 10 to 1500, and the number of feedback terms |Tf| is set to10, 20, 30 and 50. The values of the two parameters
considerably affect the performance of the PRF model. The number of feedback terms in each set varies, as does the corresponding
curve.
15
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Fig. 7. The sensitivity of parameter N when the BKRoc model takes different |Tf| values
Fig. 8. The sensitivity of the fusion parameter α to MAP for the results of BERT-based reranking based on four datasets
As shown in Figs. 10 and 11, the BKRoc and BKRM3 models display optimal performance in the range σ ∈ (10, 30). In general,
BKRoc performance stabilizes when the σ value is relatively large. This result is consistent with that for KRoc. Moreover, the per-
formance of the model is also affected by the number of expansion terms. The result when |Tf| is 10 is not as good as the result
obtained when |Tf| is 20, 30, or 50. However, BKRoc and BKRM3 perform better based on WT10G when |Tf| is 10. In Fig. 12, the
BPRoc2 model tends to yield the best results in the range σ ∈ (100, 200).
Similarly, BKRoc stabilizes as σ increases. Moreover, the performance of the model is affected by the number of expansion terms.
The larger the value of |Tf|, the better the performance of the BPRoc2 model. However, for WT10G, BKRoc performs best when |Tf| is
10.
16
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Fig. 9. The sensitivity of the fusion parameter α to P@10 for the results of BERT-based reranking based on four datasets
Fig. 10. The sensitivity of the BKRoc model to the kernel parameter σ. The model is evaluated on four TREC datasets and with different |Tf| values
In sections 5.5–5.7, we analyze the change in the MAP values of our models with different levels of parameters N, α and σ. In this
section, we investigate whether the combination of three parameters at different levels will interactively affect the MAP values of the
models. The values of parameter α are {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}.The values of parameter N are selected from
{500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000}. The values of parameter σ are {10, 25, 50, 80, 100, 200, 500, 1000,
1500}. The values of parameters N, α and σ are combined to obtain 990 combinations. We conduct experiments with these parameter
combinations and obtain the average MAP values of the BKRM3, BKRoc and BPRoc2 models on the AP90 dataset. The experiment is
17
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Fig. 11. The sensitivity of the BKRM3 model to the kernel parameter σ. The model is evaluated on four TREC datasets and with different |Tf| values
Fig. 12. The sensitivity of the BPRoc2 model to the kernel parameter σ. The model is evaluated on four TREC datasets and with different |Tf| values
repeated three times. Through three-way analysis of variance (ANOVA), the p values of parameters N, α and σ and the interaction
between the two parameters or among the three parameters are shown in Table 11. The results show that the p values of the three
models are much less than 0.05 for parameters α and σ. However, the p values of the three models are greater than 0.05 for parameter
N. Therefore, we can highlight that different levels of α and σ significantly affect the MAP results of our models. Different levels of N
do not significantly affect the MAP results of our models. In addition, as Table 11 shows, the interaction of parameters σ and α
18
J. Wang, et al. Information Processing and Management 57 (2020) 102342
Table 11
The two-way ANOVA results of parameters N, α and σ.
Factors BKRM3 BKRoc BPRoc2
p value
significantly affects the three models, while the interaction of parameters σ and N and the interaction of parameters α and N do not
significantly affect the three models. Additionally, the interaction of all three parameters α, σ and N do not significantly affect the
average MAP values on the KRM3, BKRoc and BPRoc2 models.
In this paper, a PRF framework that combines relevance matching and semantic matching is proposed to enhance retrieval
performance. The experiments conducted in this study suggest that, compared to using only one matching method, using relevance
matching and semantic matching in combination can achieve optimal retrieval performance. Moreover, the model proposed in this
paper effectively increases retrieval performance by comparing the MAP and P@10 results obtained using the five enhanced models
and the corresponding robust baseline models on four TREC datasets. Additionally, through analyses of the parameters N, α and σ, we
give the empirical values that yield optimal model performance.
Our work has both theoretical and practical implications. It first provides further empirical support regarding the role of relevance
matching and semantic matching in IR. We believe we have taken the first steps in incorporating both relevance matching and
semantic matching in PRF. Furthermore, this work shows that using both relevance matching and semantic matching is more ef-
fective than using either relevance matching or semantic matching alone to improve the quality of feedback documents. The proposed
PRF methods can be especially effective at increasing the precision of returning the top 10 documents and the MAP of the top 1000
results.
This paper focuses on combining relevance matching and semantic matching for retrieval instead of studying different semantic
matching methods. Therefore, BERT, one of the most recently proposed pretraining models, applies the combination of relevance
matching and semantic matching in the proposed PRF framework. In the future, we will study and discuss cases in which different
semantic matching models are combined with relevance matching methods.
Junmei Wang: Conceptualization, Methodology, Writing - original draft. Min Pan: Conceptualization, Validation, Writing -
review & editing. Tingting He: Methodology, Funding acquisition, Supervision. Xiang Huang: Validation. Xueyan Wang: Data
curation. Xinhui Tu: Supervision, Writing - review & editing.
Acknowledgments
This research is supported by the National Natural Science Foundation of China (61532008), the National Natural Science
Foundation of China (61572223), the National Key Research and Development Program of China (2017YFC0909502), and Wuhan
Science and Technology Program (2019010701011392). This work was partly supported by the innovation team of the basic in-
telligent education service innovation model and technology research, in Hubei Normal University.
References
Basile, P., Caputo, A., & Semeraro, G. (2011). Negation for document re-ranking in ad-hoc retrieval. Lecture Notes in Computer Science (Including Subseries Lecture Notes
in Artificial Intelligence and Lecture Notes in Bioinformatics), 6931 LNCS, 285–296. https://doi.org/10.1007/978-3-642-23318-0_26.
Chen, Q., Hu, Q., Huang, J. X., & He, L. (2018). CA-RNN: Using context-aligned recurrent neural networks for modeling sentence similarity. 32nd AAAI Conference on
Artificial Intelligence, AAAI 2018 (pp. 265–273). . www.aaai.org.
Colace, F., De Santo, M., Greco, L., & Napoletano, P. (2015). Improving relevance feedback-based query expansion by the use of a weighted word pairs approach.
Journal of the Association for Information Science and Technology, 66(11), 2223–2234. https://doi.org/10.1002/asi.23331.
Daoud, M., & Huang, J. X. (2013). Modeling geographic, temporal, and proximity contexts for improving geotemporal search. Journal of the American Society for
Information Science and Technology, 64(1), 190–212. https://doi.org/10.1002/asi.22648.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-
HLT 2019, 4171–4186. http://arxiv.org/abs/1810.04805.
Guo, J., Fan, Y., Ai, Q., & Croft, W. B. (2016). A deep relevance matching model for Ad-hoc retrieval. International Conference on Information and Knowledge
Management, Proceedings, 55–64. https://doi.org/10.1145/2983323.2983769.
Guo, J., Fan, Y., Pang, L., Yang, L., Ai, Q., Zamani, H., Wu, C., Croft, W. B., & Cheng, X. (2019). A Deep Look into Neural Ranking Models for Information Retrieval.
19
J. Wang, et al. Information Processing and Management 57 (2020) 102342
20