Learning To Rank Short Text
Learning To Rank Short Text
Learning To Rank Short Text
Neural Networks
ABSTRACT 1. INTRODUCTION
Learning a similarity function between pairs of objects is at the core Encoding query-document pairs into discriminative feature vec-
of learning to rank approaches. In information retrieval tasks we tors that are input to a learning-to-rank algorithm is a critical step in
typically deal with query-document pairs, in question answering – building an accurate reranker. The core assumption is that relevant
question-answer pairs. However, before learning can take place, documents have high semantic similarity to the queries and, hence,
such pairs needs to be mapped from the original space of symbolic the main effort lies in mapping a query and a document into a joint
words into some feature space encoding various aspects of their feature space where their similarity can be efficiently established.
relatedness, e.g. lexical, syntactic and semantic. Feature engineer- The most widely used approach is to encode input text pairs us-
ing is often a laborious task and may require external knowledge ing many complex lexical, syntactic and semantic features and then
sources that are not always available or difficult to obtain. Recently, compute various similarity measures between the obtained repre-
deep learning approaches have gained a lot of attention from the sentations. For example, in answer passage reranking [31] employ
research community and industry for their ability to automatically complex linguistic features, modelling syntactic and semantic in-
learn optimal feature representation for a given task, while claim- formation as bags of syntactic and semantic role dependencies and
ing state-of-the-art performance in many tasks in computer vision, build similarity and translation models over these representations.
speech recognition and natural language processing. In this paper, However, the choice of representations and features is a com-
we present a convolutional neural network architecture for rerank- pletely empirical process, driven by the intuition, experience and
ing pairs of short texts, where we learn the optimal representation domain expertise. Moreover, although using syntactic and seman-
of text pairs and a similarity function to relate them in a supervised tic information has been shown to improve performance, it can be
way from the available training data. Our network takes only words computationally expensive and require a large number of external
in the input, thus requiring minimal preprocessing. In particular, tools — syntactic parsers, lexicons, knowledge bases, etc. Further-
we consider the task of reranking short text pairs where elements more, adapting to new domains requires additional effort to tune
of the pair are sentences. We test our deep learning system on two feature extraction pipelines and adding new resources that may not
popular retrieval tasks from TREC: Question Answering and Mi- even exist.
croblog Retrieval. Our model demonstrates strong performance on Recently, it has been shown that the problem of semantic text
the first task beating previous state-of-the-art systems by about 3% matching can be efficiently tackled using distributional word match-
absolute points in both MAP and MRR and shows comparable re- ing, where a large number of lexical semantic resources are used for
sults on tweet reranking, while enjoying the benefits of no manual matching questions with a candidate answer [33].
feature engineering and no additional syntactic parsers. Deep learning approaches generalize the distributional word match-
ing problem to matching sentences and take it one step further by
Categories and Subject Descriptors learning the optimal sentence representations for a given task. Deep
neural networks are able to effectively capture the compositional
H.3 [Information Storage and Retrieval]: H.3.3 Information Search
process of mapping the meaning of individual words in a sentence
and Retrieval; I.5.1 [Pattern Recognition]: Models—Neural nets
to a continuous representation of the sentence. In particular, it has
Keywords been recently shown that convolutional neural networks are able
to efficiently learn to embed input sentences into low-dimensional
Convolutional neural networks; learning to rank; Question Answer- vector space preserving important syntactic and semantic aspects of
ing; Microblog search the input sentence, which leads to state-of-the-art results in many
∗ NLP tasks [18, 19, 38]. Perhaps one of the greatest advantages of
The work was carried out at University of Trento.
† deep neural networks is that they are trained in an end-to-end fash-
Professor at University of Trento, DISI.
Permission to make digital or hard copies of all or part of this work for ion, thus removing the need for manual feature engineering and
personal or classroom use is granted without fee provided that copies are not greatly reducing the need for adapting to new tasks and domains.
made or distributed for profit or commercial advantage and that copies bear In this paper, we describe a novel deep learning architecture for
this notice and the full citation on the first page. Copyrights for components reranking short texts, where questions and documents are limited
of this work owned by others than ACM must be honored. Abstracting with to a single sentence. The main building blocks of our architec-
credit is permitted. To copy otherwise, or republish, to post on servers or to
ture are two distributional sentence models based on convolutional
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from [email protected]. neural networks. These underlying sentence models work in paral-
SIGIR’15, August 09 - 13, 2015, Santiago, Chile. lel, mapping queries and documents to their distributional vectors,
c 2015 ACM. ISBN 978-1-4503-3621-5/15/08 ...$15.00. which are then used to learn the semantic similarity between them.
DOI: http://dx.doi.org/10.1145/2766462.2767738.
The distinctive properties of our model are: (i) we use a state- where function ψ(·) maps query-document pairs to a feature vec-
of-the-art distributional sentence model for learning to map input tor representation where each component reflects a certain type of
sentences to vectors, which are then used to measure the similar- similarity, e.g., lexical, syntactic, and semantic. The weight vector
ity between them; (ii) our model encodes query-document pairs in w is a parameter of the model and is learned during the training.
a rich representation using not only their similarity score but also
their intermediate representations; (iii) the architecture of our net- 2.2 Learning to Rank approaches
work makes it straightforward to include any additional similarity There are three most common approaches in IR to learn the rank-
features to the model; and finally (iv) our model does not require ing function h, namely, pointwise, pairwise and listwise.
manual feature engineering or external resources. We only require Pointwise approach is perhaps the most simple way to build a
to initialize word embeddings from some large unsupervised cor- reranker where the training instances are triples (qi , dij , yij ) and
pora 1 it is enough to train a binary classifier: h(w, ψ(qi , dij )) → yij ,
Our sentence model is based on a convolutional neural network where ψ maps query-document pair to a feature vector and w is a
architecture that has recently showed state-of-the-art results on many vector of model weights.
NLP sentence classification tasks [18, 19]. However, our model The decision function h(·) typically takes a linear form simply
uses it only to generate intermediate representation of input sen- computing a dot product between the model weights w and a fea-
tences for computing their similarity. To compute the similarity ture representation of a pair generated by ψ(·). At test time, the
score we use an approach used in the deep learning model of [38], learned model is used to classify unseen pairs (qi , dij ), where the
which recently established new state-of-the-art results on answer raw scores are used to establish the global rank R of the documents
sentence selection task. However, their model operates only on in the retrieved set. This approch is widely used in practice because
unigram or bigrams, while our architecture learns to extract and of its simplicity and effectiveness.
compose n-grams of higher degrees, thus allowing for capturing A more advanced approaches to reranking, is pairwise, where
longer range dependencies. Additionally, our architecture uses not the model is explicitly trained to score correct pairs higher than
only the intermediate representations of questions and answers to incorrect pairs with a certain margin:
compute their similarity but also includes them in the final rep-
h(w, ψ(qi , dij )) ≥ h(w, ψ(qi , dik )) + ,
resentation, which constitutes a much richer representation of the
question-answer pairs. Finally, our model is trained end-to-end, where document dij is relevant and dik is not. Conceptually simi-
while in [38] the output of the deep learning model is used to learn lar to the pointwise method described above, the pairwise approach
a logistic regression classifier. exploits more information about the ground truth labelling of the
We test our model on two popular retrieval tasks from TREC: an- input candidates. However, it requires to consider a larger number
swer sentence selection and Microblog retrieval. Our model shows of training instances (potentially quadratic in the size of the candi-
a considerable improvement on the first task beating recent state- date document set) than the pointwise method, which may lead to
of-the-art system. On the second task, our model demonstrates that slower training times. Still both pointwise and pairwise approaches
previous state-of-the-art retrieval systems can benefit from using ignore the fact that ranking is a prediction task on a list of objects.
our deep learning model. The third method, referred to as a listwise approach [6], treats a
In the following, we give a problem formulation and provide a query with its list of candidates as a single instance in learning, thus
brief overview of learning to rank approaches. Next, we describe able to capture considerably more information about the ground
our deep learning model and describe our experiments. truth ordering of input candidates.
While pairwise and listwise approaches claim to yield better per-
formance, they are more complicated to implement and less ef-
2. LEARNING TO RANK fective train. Most often, producing a better representation ψ()
This section briefly describes the problem of reranking text pairs that encodes various aspects of similarity between the input query-
which encompasses a large set of tasks in IR, e.g., answer sentence document pairs plays a far more important role in training an accu-
selection in question answering, microblog retrieval, etc. We argue rate reranker than choosing between different ranking approaches.
that deriving an efficient representation of query-document pairs Hence, in this paper we adopt a simple pointwise method to rerank-
required to train a learning to rank model plays an important role ing and focus on modelling a rich representation of query-document
in training an accurate reranker. pairs using deep learning approaches which is described next.
sim(xq , xd ) = xTq Mxd , (2) namely the word embeddings matrix W, filter weights and biases
of the convolutional layers, similarity matrix M, weights and bi-
where M ∈ Rd×d is a similarity matrix. The Eq. 2 can be viewed ases of the hidden and softmax layers.
Figure 2: Our deep learning architecture for reranking short text pairs.
The parameters of the network are optimized with stochastic gra- to the size of the xjoin vector obtained after concatenating query and
dient descent (SGD) using backpropogation algorithm to compute document vectors from the distributional models, similarity score
the gradients. To speedup the convergence rate of SGD various and additional features (if used).
modifications to the update rule have been proposed: momentum, To train the network we use stochastic gradient descent with
Adagrad [12], Adadelta [39], etc. Adagrad scales the learning rate shuffled mini-batches. We eliminate the need to tune the learn-
of SGD on each dimension based on the l2 norm of the history of ing rate by using the Adadelta update rule [39]. The batch size
the error gradient. Adadelta uses both the error gradient history like is set to 50 examples. The network is trained for 25 epochs with
Adagrad and the weight update history. It has the advantage of not early stopping, i.e., we stop the training if no update to the best
having to set a learning rate at all. accuracy on the dev set has been made for the last 5 epochs. The
accuracy computed on the dev set is the MAP score. At test time
3.4 Regularization we use the parameters of the network that were obtained with the
While neural networks have a large capacity to learn complex best MAP score on the development (dev) set, i.e., we compute the
decision functions they tend to easily overfit especially on small MAP score after each 10 mini-batch updates and save the network
and medium sized datasets. To mitigate the overfitting issue we parameters if a new best dev MAP score was obtained. In practice,
augment the cost function with l2 -norm regularization terms for the training converges after a few epochs. We set a value for L2 reg-
the parameters of the network. ularization term to 1e−5 for the parameters of convolutional layers
We also experiment with another popular and effective technique and 1e − 4 for all the others. The dropout rate is set to p = 0.5.
to improve regularization of the NNs — dropout [30]. Dropout
prevents feature co-adaptation by setting to zero (dropping out) a 4.2 Word embeddings
portion of hidden units during the forward phase when computing While our model allows for learning the word embeddings di-
the activations at the softmax output layer. As suggested in [14] rectly for a given task, we keep the word matrix parameter W
dropout acts as an approximate model averaging. static. This is due to a common experience that a minimal size
of the dataset required for tuning the word embeddings for a given
4. EXPERIMENTS AND EVALUATION task should be at least in the order of hundred thousands, while in
We evaluate our deep learning model on two popular retrieval our case the number of query-document pairs is one order of mag-
benchmarks from TREC: answer sentence selection and TREC mi- nitude smaller. Hence, similar to [11, 19, 38] we keep the word
croblog retrieval. embeddings fixed and initialize the word matrix W from an un-
supervised neural language model. We choose the dimensionality
4.1 Training and hyperparameters of our word embeddings to be 50 to be on the line with the deep
The parameters of our deep learning model were (chosen on a learning model of [38].
dev set of the answer sentence selection dataset) as follows: the
width m of the convolution filters is set to 5 and the number of 4.3 Size of the model
convolutional feature maps is 100. We use ReLU activation func- Given that the dimensionality of the word embeddings is 50, the
tion and a simple max-pooling. The size of the hidden layer is equal number of parameters in the convolution layer of each sentence
model is 100 × 5 × 50. Hence, the total number of parameters
in each of the two convolutional networks that map sentences to Table 1: Summary of TREC QA datasets for answer reranking.
vectors is 25k. The similarity matrix is M ∈ R100×100 , which Data # Questions # QA pairs % Correct
adds another 10k parameters to the model. The fully connected TRAIN-ALL 1,229 53,417 12.0%
hidden layer is and a softmax add about 40k parameters. Hence the TRAIN 94 4,718 7.4%
total number of parameters in the network is about 100k. DEV 82 1,148 19.3%
TEST 100 1,517 18.7%
5. ANSWER SENTENCE SELECTION
Our first experiment is on answer sentence selection dataset, where
answer candidates are limited to a single sentence. Given a question tributional word vectors is their inability to deal with numbers and
with its list of candidate answers the task is to rank the candidate proper nouns. This is especially important for factoid question an-
answers based on their relatedness to the question. swering, where most of the questions are of type what, when, who
that are looking for answers containing numbers or proper nouns.
5.1 Experimental setup To mitigate the above two issues, we follow the approach in [38]
Data and setup. We test our model on the manually curated TREC and include additional features establishing relatedness between
QA dataset3 from Wang et al. [36], which appears to be one of the question-answer pairs. In particular, we compute word overlap
most widely used benchmarks for answer reranking. The dataset measures between each question-answer pair and include it as an
contains a set of factoid questions, where candidate answers are additional feature vector x feat in our model. This feature vector
limited to a single sentence. The set of questions are collected contains only four features: word overlap and IDF-weighted word
from TREC QA tracks 8-13. The manual judgement of candidate overlap computed between all words and only non-stop words. Com-
answer sentences is provided for the entire TREC 13 set and for puting these features is straightforward and does not require addi-
the first 100 questions from TREC 8-12. The motivation behind tional pre-processing or external resources.
this annotation effort is that TREC provides only the answer pat- Evaluation. The two metrics used to evaluate the quality of our
terns to identify if a given passage contains a correct answer key model are Mean Average Precision (MAP) and Mean Reciprocal
or not. This results in many unrelated candidate answers marked Rank (MRR), which are common in Information Retrieval and Ques-
as correct simply because regular expressions cannot always match tion Answering.
1
P|Q| 1
the correct answer keys. MRR is computed as follows: M RR = |Q| q=1 rank(q) , where
To enable direct comparison with the previous work, we use the rank(q) is the position of the first correct answer in the candidate
same train, dev and test sets. Table 1 summarizes the datasets used list. MRR is only looking at the rank of the first correct answer,
in our experiments. An additional training set TRAIN-ALL pro- hence it is more suitable in cases where for each question there is
vided by Wang et. al [36] contains 1,229 questions from the entire only a single correct answer. Differently, MAP examines the ranks
TREC 8-12 collection and comes with automatic judgements. This of all the correct answers. It is computed as the mean
set represents a more noisy setting, nevertheless, it provides many 1
PQ over the av-
erage precision scores for each query q ∈ Q: Q q=1 AveP (q).
more QA pairs for learning. We use the official trec_eval scorer to compute the above met-
Word embeddings. We initialize the word embeddings by run- rics.
ning word2vec tool [20] on the English Wikipedia dump and the
AQUAINT corpus4 containing roughly 375 million words. To train 5.2 Results and discussion
the embeddings we use the skipgram model with window size 5
and filtering words with frequency less than 5. The resulting model We report the results of our deep learning model on the TRAIN
contains 50-dimensional vectors for about 3.5 million words. Em- and TRAIN-ALL sets also when additional word overlap features
beddings for words not present in the word2vec model are ran- are used. Additionally, we report the results from a recent deep
domly initialized with each component sampled from the uniform learning system in [38] that has established the new state-of-the-art
distribution U [−0.25, 0.25]. results in the same setting.
We minimally preprocess the data only performing tokenization Table 2 summarises the results for the setting when the network
and lowercasing all words. To reduce the size of the resulting vo- is trained using only input question-answer pairs without using any
cabulary V , we also replace all digits with 0. The size of the word additional features. As we can see our deep learning architecture
vocabulary V for experiments using TRAIN set is 17,023 with ap- demonstrates a much stronger performance compared to the system
proximately 95% of words initialized using wor2vec embeddings in [38]. The deep learning model from [38], similarly to ours, relies
and the remaining 5% words are initialized at random as described on a convolutional neural network to learn intermediate represen-
in Sec. 4.2. For the TRAIN-ALL setting the |V | = 56, 953 with tations. However, their convolutional neural network operates only
85% words found in the word2vec model. on unigram or bigrams, while in our architecture we use a larger
Additional features. Given that a certain percentage of the words width of the convolution filter, thus allowing for capturing longer
in our word embedding matrix are initialized at random (about 15% range dependencies. Additionally, along with the question-answer
for the TRAIN-ALL) and a relatively small number of QA pairs similarity score, our architecture includes intermediate representa-
prevents the network to directly learn them from the training data, tions of the question and the answer, which together constitute a
similarity matching performed by the network will be suboptimal much richer representation. This results in a large improvement
between many question-answer pairs. of about 8% absolute points in MAP for TRAIN and almost 10%
Additionally, even for the words found in the word matrix, as when trained with more data from TRAIN-ALL. This emphasizes
noted in [38], one of the weaknesses of approaches relying on dis- the importance of learning high quality sentence models.
Table 3 provides the results when additional word overlap fea-
3 tures are added to the model. Simple word overlap features help to
http://cs.stanford.edu/people/mengqiu/data/
qg-emnlp07-data.tgz improve the question-answer matching. Our model shows an im-
4
https://catalog.ldc.upenn.edu/LDC2002T31 provement of about a significant improvement over previous state-
Table 2: Results on TRAIN and TRAIN-ALL from Trec QA. Table 5: Summary of TREC Microblog datasets.
Model MAP MRR Data # Topic # Tweet pairs % Correct # Runs
TRAIN TMB2011 49 60,129 5.1% 184
Yu et al. [38] (unigram) .5387 .6284 TMB2012 59 73,073 8.6% 120
Yu et al. [38] (bigram) .5476 .6437
Our model .6258 .6591
TRAIN-ALL mediate representations of the query and the answer. This allows
Yu et al. [38] (unigram) .5470 .6329 for performing a more accurate matching between question-answer
Yu et al. [38] (bigram) .5693 .6613 pairs. Additionally, our architecture includes intermediate question
Our model .6709 .7280 and answer representations in the model, which result in a richer
representation of question-answer pairs. Finally, we train our sys-
tem in an end-to-end fashion, while [38] use the output of their deep
Table 3: Results on TREC QA when augmenting the deep learning system as a feature in a logistic regression classifier.
learning model with word overlap features.
Model MAP MRR 6. TREC MICROBLOG RETRIEVAL
TRAIN To assess the effectiveness and generality of our deep learning
Yu et al. [38] (unigram) .6889 .7727 model for text matching, we apply it on tweet reranking task. We
Yu et al. [38] (bigram) .7058 .7800 focus on the 2011 and 2012 editions of the ad-hoc retrieval task
Our model .7329 .7962 at TREC microblog tracks [23, 29]. We follow the setup in [27],
TRAIN-ALL where they represent query-tweet pairs with a shallow syntactic
Yu et al. [38] (unigram) .6934 .7677 models to learn a tree kernel reranker. In contrast, our model does
Yu et al. [38] (bigram) .7113 .7846 not rely on any syntactic parsers and requires virtually no prepro-
Our model .7459 .8078 cessing other than tokenizaiton and lower-casing. Our main re-
search question is: Can our neural network that requires no manual
feature engineering and expensive pre-processing steps improve on
top of the state-of-the-art learning-to-rank and retrieval algorithms?
Table 4: Survey of the results on the QA answer selection task. To answer this question, we test our model in the following set-
Model MAP MRR tings: we treat the participant systems in the TREC microblog tasks
Wang et al. (2007) [36] .6029 .6852 as a black-box, and implement our model on top of them using only
Heilman and Smith (2010) [15] .6091 .6917 their raw scores (ranks) as a single feature in our model. This al-
Wang and Manning (2010) [35] .5951 .6951 lows us to see whether our model is able to learn information com-
Yao et al. (2013) [37] .6307 .7477 plementary to the approaches used by such retrieval algorithms.
Severyn & Moschitti (2013) [26] .6781 .7358 Our setup replicates the experiments in [27] to allow for comparing
Yih et al. (2013) [33] .7092 .7700 to their model.
Yu et al. (2014) [38] .7113 .7846
6.1 Experimental setup
Our model (TRAIN) .7329 .7962 Data and setup. Our dataset is the tweet corpus used in both TREC
Our model (TRAIN-ALL) .7459 .8078 Microblog tracks in 2011 (TMB2011) and 2012 (TMB2012). It
consists of 16M tweets spread over two weeks, and a set of 49
(TMB2011) and 59 (TMB2012) timestamped topics. We mini-
of-the-art in both MAP and MRR when training on TRAIN and mally preprocess the tweets—we normalize elongations (e.g., sooo
TRAIN-ALL. Note that the results are significantly better than when → so), normalize URLs and author ids. Additionally, we use the
no overlap features are used. This is possibly due to the fact that system runs submitted at TMB2011 and TMB2012, which contain
the distrubutional representations fail to establish the relatedness in 184 and 120 models, respectively. This is summarized in Table 5.
some cases and simple word overlap matching can help to drive the Word embeddings. We used the word2vec tool to learn the word
model in the right direction. embeddings from the provided 16M tweet corpus, with the follow-
Table 4 reports the results of the previously published systems ing setting: (i) we removed non-english tweets, which reduces the
on this task. Our model trained on a small TRAIN dataset beats all corpus to 8.4M tweets and (ii) we used the skipgram model with
of the previous state-of-the-art systems. The improvement is fur- window size 5 and filtering words with frequency less than 5. The
ther emphasized when the system is trained using more question- trained model contains 330k words. We use word embeddings of
answer pairs from TRAIN-ALL showing an improvement of about size 50 — same as for the previous task. To build the word embed-
3% absolute points in both MAP and MRR. The results are very ding matrix W , we extract the vocabulary from all tweets present
promising considering that our system requires no manual feature in TMB2011 and TMB2012. The resulting vocabulary contains
engineering (other than simple word overlap features), no expen- 150k words out of which only 60% are found in the word embed-
sive preprocessing using various NLP parsers, and no external se- dings model. This is due to a very large number of misspellings and
mantic resources other than using pre-initialized word embeddings words occurring only once (hence they are filted by the word2vec
that can be easily trained provided a large amount of unsupervised tool). This has a negative impact on the performance of our deep
text. learning model since around 40% of the word vectors are randomly
In the spirit, our system is most similar to a recent deep learn- initialized. At the same time it is not possible to tune the word
ing architecture from Yu et al. (2014) [38]. However, we employ embeddings on the training set, as it will overfit due to the small
a more expressive convolutional neural network for learning inter- number of the query-tweet pairs available for training.
Training. We train our system on the runs submitted at TMB2011,
and test it on the TMB2012 runs. We focus on one direction only Table 7: Comparison of the averaged relative improvements
to avoid training bias, since TMB2011 topics were already used for for the top, middle (mid), and bottom (btm) 30 systems from
learning systems in TMB2012. TMB2012.
Submission run as a feature. We use the output of participant
STRUCT [27] Our model
systems as follows: we use rank positions of each tweet rather
band MAP P@30 MAP P@30
than raw scores, since scores for each system are scaled differently,
while ranks are uniform across systems. We apply the following top 3.3% 5.3% 2.0% 6.2%
transformation of the rank r: 1/ log (r + 1). In the training phase, mid 12.2% 12.9% 9.8% 13.7%
we take the top 30 systems from the TMB2011 track (in terms of btm 22.1% 25.1% 18.7% 24.3%
P@30). For each query-tweet pair we compute the average trans-
formed rank over the top systems. This score is then used as a
We find that the improvement over underperforming systems is
single feature x feat by our model. In the testing phase, we generate
much larger than for stronger systems. In particular, for the bottom
this feature as follows: for each participant system that we want
30 systems, our approach achieves an average relative improvement
to improve, we use the transformed rank of the query-tweet taken
of 20% in both MAP and P@30. The performance of our model is
from their submission run.
on the par with the STRUCT model [27].
Evaluation. We report on the official evaluation metric for the
We expect that learning word embeddings on a larger corpora
TREC 2012 Microblog track, i.e., precision at 30 (P@30), and also
such that the percentage of the words present in the word embed-
on mean average precision (MAP). Following [4, 23], we regard
ding matrix W should help to improve the accuracy of our system.
minimally and highly relevant documents as relevant and use the
Moreover, similar to the situation observed with answer selection
TMB2012 evaluation script. For significance testing, we use a pair-
experiments, we expect that using more training data would im-
wise t-test, where M and N denote significance at α = 0.05 and
prove the generalization of our model. As one possible solution
α = 0.01, respectively. Triangles point up for improvement over
to getting more training data, it could be interesting to experiment
the baseline, and down otherwise. We also report the improvement
with training our model on much larger pseudo test collections sim-
in the absolute rank (R) in the official TMB2012 ranking.
ilar to the ones proposed in [4]. We leave it for the future work.
6.2 Results and discussion
Table 6 reports the results for re-ranking runs of the best 30 sys- 7. RELATED WORK
tems from TMB2012 (based on their P@30 score) when we train Our learning to rank method is based on a deep learning model
our system using the top 30 runs from TMB2011. for advanced text representations using distributional word embed-
First, we note that our model improves P@30 for the majority dings. Distributional representations have a long tradition in IR,
of the systems with a relative improvement ranging from several e.g., Latent Semantic Analysis [10], which more recently has also
points up to 10% with about 6% on average. This is remarkable, been characterized by studies on distributional models based on
given that the pool of participants in TMB2012 was large, and the word similarities. Their main properties is to alleviate the prob-
top systems are therefore likely to be very strong baselines. lem of data sparseness. In particular, such representations can be
Secondly, we note that the relative improvement of our model derived with several methods, e.g., by counting the frequencies of
is on the par with the STRUCT model from [27], which relies on co-occurring words around a given token in large corpora. Such
using syntactic parsers to train a tree kernel reranker. In contrast, distributed representations can be obtained by applying neural lan-
our model requires no manual feature engineering and virtually no guage models that learn word embeddings, e.g., [3] and more re-
preprocessing and external resources. Similar to the observation cently using recursive autoencoders [34], and convolutional neural
made in [27], our model has a precision-enhancing effect. In cases networks [8].
where MAP drops a bit it can be seen that our model sometimes Our application of learning to rank models concerns passage
lowers relevant documents in the runs. It is possible that our model reranking. For example, [17, 24] designed classifiers of question
favours query-tweet pairs that exhibit semantic matching of higher and answer passage pairs. Several approaches were devoted to
quality, and that it down-ranks tweets that are of lower quality but reranking passages containing definition/description, e.g., [21, 28,
are nonetheless relevant. Another important aspect is the fact that 31]. [1] used a cascading approach, where the ranking produced by
a large portion of the word embeddings (about 40%) used by the one ranker is used as input to the next stage.
network are initialized at random, which has a negative impact on Language models for reranking were applied in [7], where an-
the accuracy of our model. swer ranks were computed based on the probabilities of bigram
Looking at the improvement in absolute position in the official models generating candidate answers. Language models were also
ranking (R), we see that, on average, our deep learning model boosts applied to definitional QA in [9, 25, 32].
the absolute position in the official ranking for top 30 systems by Our work more directly targets the task of answer sentence se-
about 7.8 positions. lection, i.e., the task of selecting a sentence that contains the infor-
All in all, the results suggest that our deep learning model with mation required to answer a given question from a set of candidates
no changes in its architecture is able to capture additional infor- (for example, provided by a search engine). In particular, the state
mation and can be useful when coupled with many state-of-the-art of the art in answer sentence selection is given by Wang et al., 2007
microblog search algorithms. [36], who use quasi-synchronous grammar to model relations be-
While improving the top systems from 2012 represents a chal- tween a question and a candidate answer with the syntactic transfor-
lenging task, it is also interesting to assess the potential improve- mations. Heilman & Smith, 2010 [15] develop an improved Tree
ment for lower ranking systems. We follow [27] and report our Edit Distance (TED) model for learning tree transformations in a
results on the 30 systems from the middle and the bottom of the q/a pair. They search for a good sequence of tree edit operations
official ranking. Table 7 summarizes the average improvements for using complex and computationally expensive Tree Kernel-based
three groups of systems: top-30, middle-30, and bottom-30. heuristic. Wang & Manning, 2010 [35] develop a probabilistic
Table 6: System performance on the top 30 runs from TMB2012, using the top 10, 20 or 30 runs from TMB2011 for training.
TMB2012 STRUCT [27] Our model
# runs MAP P@30 MAP P@30 R% MAP P@30 R%
1 hitURLrun3 0.3469 0.4695 0.3328 (-4.1%)O 0.4774 (1.7%) 0 0.3326 (-4.1%)O 0.4836 (3.0%) 0
2 kobeMHC2 0.3070 0.4689 0.3037 (-1.1%) 0.4768 (1.7%) 1 0.3052 (-0.6%) 0.4899 (4.5%)M 1
3 kobeMHC 0.2986 0.4616 0.2965 (-0.7%) 0.4718 (2.2%) 2 0.2999 (0.4%) 0.4830 (4.6%)M 2
4 uwatgclrman 0.2836 0.4571 0.2995 (5.6%)N 0.4712 (3.1%)M 3 0.2738 (-3.5%)O 0.4516 (-1.2%) -1
5 kobeL2R 0.2767 0.4429 0.2744 (-0.8%) 0.4463 (0.8%) 0 0.2677 (-3.3%)O 0.4409 (-0.5%) -2
6 hitQryFBrun4 0.3186 0.4424 0.3118 (-2.1%) 0.4554 (2.9%) 2 0.3220 (1.1%)N 0.4849 (9.6%)N 5
7 hitLRrun1 0.3355 0.4379 0.3226 (-3.9%)O 0.4525 (3.3%) 2 0.3188 (-5.0%)O 0.4610 (5.3%)N 3
8 FASILKOM01 0.2682 0.4367 0.2820 (5.2%)N 0.4531 (3.8%)N 3 0.2622 (-2.2%)O 0.4346 (-0.5%) -1
9 hitDELMrun2 0.3197 0.4345 0.3105 (-2.9%) 0.4424 (1.8%) 4 0.3246 (1.5%) 0.4723 (8.7%)N 8
10 tsqe 0.2843 0.4339 0.2836 (-0.3%) 0.4441 (2.4%) 5 0.2917 (2.6%) 0.4660 (7.4%)N 7
11 ICTWDSERUN1 0.2715 0.4299 0.2862 (5.4%)N 0.4582 (6.6%)N 7 0.2765 (1.8%)N 0.4484 (4.3%)M 6
12 ICTWDSERUN2 0.2671 0.4266 0.2785 (4.3%)M 0.4475 (4.9%)N 7 0.2786 (4.3%)M 0.4478 (5.0%)M 7
13 cmuPrfPhrE 0.3179 0.4254 0.3172 (-0.2%) 0.4469 (5.1%)N 8 0.3321 (4.5%)M 0.4585 (7.8%)N 9
14 cmuPrfPhrENo 0.3198 0.4249 0.3179 (-0.6%) 0.4486 (5.6%)N 9 0.3359 (5.0%)M 0.4591 (8.1%)N 10
15 cmuPrfPhr 0.3167 0.4198 0.3130 (-1.2%) 0.4379 (4.3%)M 8 0.3282 (3.6%)M 0.4572 (8.9%)N 11
16 FASILKOM02 0.2454 0.4141 0.2718 (10.8%)N 0.4508 (8.9%)N 11 0.2489 (1.4%) 0.4201 (1.5%)M 1
17 IBMLTR 0.2630 0.4136 0.2734 (4.0%)M 0.4441 (7.4%)N 10 0.2703 (2.8%)M 0.4346 (5.1%)N 8
18 otM12ihe 0.2995 0.4124 0.2969 (-0.9%) 0.4322 (4.8%)N 7 0.2900 (-3.2%)O 0.4239 (2.8%)M 3
19 FASILKOM03 0.2716 0.4124 0.2859 (5.3%)N 0.4452 (8.0%)N 14 0.2740 (0.9%) 0.4270 (3.5%)N 7
20 FASILKOM04 0.2461 0.4113 0.2575 (4.6%)N 0.4294 (4.4%)N 9 0.2414 (-1.9%)O 0.4220 (2.6%)M 5
21 IBMLTRFuture 0.2731 0.4090 0.2808 (2.8%) 0.4311 (5.4%)N 10 0.2785 (2.0%)M 0.4415 (8.0%)N 14
22 uiucGSLIS01 0.2445 0.4073 0.2575 (5.3%)N 0.4260 (4.6%)N 9 0.2478 (1.4%) 0.4233 (3.9%)M 7
23 PKUICST4 0.2786 0.4062 0.2909 (4.4%)M 0.4514 (11.1%)N 18 0.2832 (1.7%)N 0.4491 (10.6%)N 18
24 uogTrLsE 0.2909 0.4028 0.2977 (2.3%) 0.4282 (6.3%)N 9 0.3131 (7.6%)N 0.4484 (11.3%)N 19
25 otM12ih 0.2777 0.3989 0.2810 (1.2%) 0.4175 (4.7%)N 10 0.2752 (-0.9%) 0.4119 (3.3%)M 5
26 ICTWDSERUN4 0.1877 0.3887 0.1985 (5.8%)N 0.4164 (7.1%)N 10 0.2040 (8.7%)N 0.4220 (8.6%)N 11
27 uwatrrfall 0.2620 0.3881 0.2812 (7.3%)N 0.4136 (6.6%)N 9 0.2942 (12.3%)M 0.4314 (11.2%)N 16
28 cmuPhrE 0.2731 0.3842 0.2797 (2.4%) 0.4136 (7.7%)N 12 0.2972 (8.8%)M 0.4352 (13.3%)N 19
29 AIrun1 0.2237 0.3842 0.2339 (4.6%)N 0.4102 (6.8%)N 5 0.2285 (2.2%)M 0.4157 (8.2%)N 13
30 PKUICST3 0.2118 0.3825 0.2318 (9.4%)N 0.4119 (7.7%)N 14 0.2363 (11.6%)N 0.4415 (15.4%)N 23
Average 3.3% 5.3% 7.3 2.0% 6.2% 7.8
model to learn tree-edit operations on dependency parse trees. They tions of tweets can improve upon state of the art systems in TMB-
cast the problem into the framework of structured output learning 2011 and TMB-2012. We directly compare with their system, show-
with latent variables. The model of Yao et al., 2013 [37] applies ing that our deep learning model without any changes to its ar-
linear chain CRFs with features derived from TED to automatically chitecture (we only pre-train word embeddings) is on the par with
learn associations between questions and candidate answers. Sev- their reranker. This is remarkable, since different from [27], which
eryn and Moschitti [26] applied SVM with tree kernels to shallow requires additional pre-proccesing using syntactic parsers to con-
syntactic representation, which provide automatic feature engineer- struct syntactic trees, our model requires no expensive pre-processing
ing. Yih et al. [33] use distributional models based on lexical se- and does not rely on any external resources.
mantics to match semantic relations of aligned words in QA pairs.
More recently, Bordes et al. [5] used siamese networks for learn-
ing to project question and answer pairs into a joint space whereas 8. CONCLUSIONS
Iyyer et al. [16] modelled semantic composition with a recursive In this paper, we propose a novel deep learning architecture for
neural network for a question answering task. The work closest reranking short texts. It has the benefits of requiring no manual
to ours is [38], where they apply deep learning to learn to match feature engineering or external resources, which may be expensive
question and answer sentences. However, their sentence model to or not available. The model with the same architecture can be suc-
map questions and answers to vectors operates only on unigrams cessfully applied to other domains and tasks.
or bigrams. Our sentence model is based on a convolutional neural Our experimental findings show that our deep learning model:
network with the state-of-the-art architecture, we use a relatively (i) greatly improves on the previous state-of-the-art systems and a
large width of the convolution filter (5), thus allowing the network recent deep learning approach in [38] on answer sentence selec-
to capture longer range dependencies. Moreover, the architecture tion task showing a 3% absolute improvement in MAP and MRR;
of deep learning model along with the question-answer similarity (ii) our system is able to improve even the best system runs from
score also encodes question and answer vector representations in TREC Microblog 2012 challenge; (iii) is comparable to the syntac-
the model. Hence, our model constructs and learns a richer rep- tic reranker in [27], while our system requires no external parsers
resentation of the question-answer pairs, which results in superior or resources.
results on the answer sentence selection dataset. Finally, our deep Acknowledgments. This work has been supported by the EC project
learning reranker is trained end-to-end, while in [38] they use the CogNet, 671625 (H2020-ICT-2014-2, Research and Innovation ac-
output of their neural network in a separate logistic scoring model. tion). The first author was supported by the Google Europe Doc-
Regarding learning to rank systems applied to TREC microblog toral Fellowship Award 2013.
datasets, recently [27] have shown that richer linguistic representa-
REFERENCES [20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
[1] A. Agarwal, H. Raghavan, K. Subbian, P. Melville, J. Dean. Distributed representations of words and phrases
D. Gondek, and R. Lawrence. Learning to rank for robust and their compositionality. In Advances in Neural
question answering. In CIKM, 2012. Information Processing Systems 26, pages 3111–3119, 2013.
[2] J. W. Antoine Bordes and N. Usunier. Open question [21] A. Moschitti, S. Quarteroni, R. Basili, and S. Manandhar.
answering with weakly supervised embedding models. In Exploiting syntactic and shallow semantic kernels for
ECML, Nancy, France, September 2014. question/answer classification. In ACL, 2007.
[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural [22] V. Nair and G. E. Hinton. Rectified linear units improve
probabilistic language model. Journal of Machine Learning restricted boltzmann machines. In Proceedings of the 27th
Research, 3:1137–1155, 2003. International Conference on Machine Learning (ICML-10),
[4] R. Berendsen, M. Tsagkias, W. Weerkamp, and M. de Rijke. pages 807–814, 2010.
Pseudo test collections for training and tuning microblog [23] I. Ounis, C. Macdonald, J. Lin, and I. Soboroff. Overview of
rankers. In SIGIR, 2013. the TREC-2011 microblog track. In TREC, 2011.
[5] A. Bordes, S. Chopra, and J. Weston. Question answering [24] F. Radlinski and T. Joachims. Query chains: Learning to
with subgraph embeddings. In Proceedings of the 2014 rank from implicit feedback. CoRR, 2006.
Conference on Empirical Methods in Natural Language [25] Y. Sasaki. Question answering as question-biased term
Processing (EMNLP), pages 615–620, Doha, Qatar, October extraction: A new approach toward multilingual qa. In ACL,
2014. Association for Computational Linguistics. 2005.
[6] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to [26] A. Severyn and A. Moschitti. Automatic feature engineering
rank: From pairwise approach to listwise approach. In for answer selection and extraction. In Proceedings of the
Proceedings of the 24th International Conference on 2013 Conference on Empirical Methods in Natural Language
Machine Learning, ICML ’07, pages 129–136, New York, Processing, pages 458–467, Seattle, Washington, USA,
NY, USA, 2007. ACM. October 2013. Association for Computational Linguistics.
[7] Y. Chen, M. Zhou, and S. Wang. Reranking answers from [27] A. Severyn, A. Moschitti, M. Tsagkias, R. Berendsen, and
definitional QA using language models. In ACL, 2006. M. de Rijke. A syntax-aware re-ranker for microblog
[8] R. Collobert and J. Weston. A unified architecture for natural retrieval. In SIGIR, 2014.
language processing: deep neural networks with multitask [28] D. Shen and M. Lapata. Using semantic roles to improve
learning. In ICML, pages 160–167, 2008. question answering. In EMNLP-CoNLL, 2007.
[9] H. Cui, M. Kan, and T. Chua. Generic soft pattern models [29] I. Soboroff, I. Ounis, J. Lin, and I. Soboroff. Overview of the
for definitional QA. In SIGIR, Salvador, Brazil, 2005. ACM. TREC-2012 microblog track. In TREC, 2012.
[10] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and [30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Harshman. Indexing by latent semantic analysis. Journal R. Salakhutdinov. Dropout: A simple way to prevent neural
of the American Society of Information Science, 1990. networks from overfitting. Journal of Machine Learning
[11] M. Denil, A. Demiraj, N. Kalchbrenner, P. Blunsom, and Research, 15:1929–1958, 2014.
N. de Freitas. Modelling, visualising and summarising [31] M. Surdeanu, M. Ciaramita, and H. Zaragoza. Learning to
documents with a single convolutional neural network. rank answers to non-factoid questions from web collections.
Technical report, University of Oxford, 2014. Comput. Linguist., 37(2):351–383, June 2011.
[12] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient [32] J. Suzuki, Y. Sasaki, and E. Maeda. Svm answer selection
methods for online learning and stochastic optimization. J. for open-domain question answering. In COLING, 2002.
Mach. Learn. Res., 12:2121–2159, 2011. [33] W. tau Yih, M.-W. Chang, C. Meek, and A. Pastusiak.
[13] A. Echihabi and D. Marcu. A noisy-channel approach to Question answering using enhanced lexical semantic models.
question answering. In ACL, 2003. In ACL, August 2013.
[14] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. [34] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.
Courville, and Y. Bengio. Maxout networks. In ICML, pages Manzagol. Stacked denoising autoencoders: Learning useful
1319–1327, 2013. representations in a deep network with a local denoising
[15] M. Heilman and N. A. Smith. Tree edit models for criterion. J. Mach. Learn. Res., 11:3371–3408, Dec. 2010.
recognizing textual entailments, paraphrases, and answers to [35] M. Wang and C. D. Manning. Probabilistic tree-edit models
questions. In NAACL, 2010. with structured latent variables for textual entailment and
[16] M. Iyyer, J. Boyd-Graber, L. Claudino, R. Socher, and question answer- ing. In ACL, 2010.
H. Daumé III. A neural network for factoid question [36] M. Wang, N. A. Smith, and T. Mitaura. What is the jeopardy
answering over paragraphs. In Proceedings of the 2014 model? a quasi-synchronous grammar for qa. In EMNLP,
Conference on Empirical Methods in Natural Language 2007.
Processing (EMNLP), pages 633–644, Doha, Qatar, October [37] P. C. Xuchen Yao, Benjamin Van Durme and
2014. Association for Computational Linguistics. C. Callison-Burch. Answer extraction as sequence tagging
[17] J. Jeon, W. B. Croft, and J. H. Lee. Finding similar questions with tree edit distance. In NAACL, 2013.
in large question and answer archives. In CIKM, 2005. [38] L. Yu, K. M. Hermann, P. Blunsom, and S. Pulman. Deep
[18] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A learning for answer sentence selection. CoRR, 2014.
convolutional neural network for modelling sentences. [39] M. D. Zeiler. Adadelta: An adaptive learning rate method.
Proceedings of the 52nd Annual Meeting of the Association CoRR, 2012.
for Computational Linguistics, June 2014. [40] M. D. Zeiler and R. Fergus. Stochastic pooling for
[19] Y. Kim. Convolutional neural networks for sentence regularization of deep convolutional neural networks. CoRR,
classification. In EMNLP, pages 1746–1751, Doha, Qatar, abs/1301.3557, 2013.
October 2014.