Detection of Imperative and Declarative Question-Answer Pairs in Email Conversations
Detection of Imperative and Declarative Question-Answer Pairs in Email Conversations
Detection of Imperative and Declarative Question-Answer Pairs in Email Conversations
1519
characteristics. For instance, over 90% of forum threads con- There are two broad possibilities to create a more general
tain Q-A knowledge (more than for email), the number of question detector: a method based on learning, as S&M, or
participants is often higher, multiple questions and answers not, as Naı̈ve. We chose to examine whether the performance
are often highly interleaved, quoting is used in different ways, a simpler, non-learning, method would suffice.
and message reply relationships are usually unavailable.
Regex Like the Naı̈ve question detector, algorithm Regex
The approach of Cong et al. (henceforth CWLSS) for ques-
is also based entirely on regular expressions. A sentence is
tion detection is based on characterizing questions and non-
detected as a question if it fulfills any of the following:
questions by extracting labelled sequence patterns (LSPs)—
as opposed to POS analysis only—and using the discovered • It ends with a question mark, and is not a URL.
patterns as features to learn a classification model for question • It contains a phrase that begins with words that fit an in-
detection. For online forums, they find the LSP-based clas- terrogative question pattern. This is a generalization of
sifier outperforms the S&M POS-based classifier. CWLSS 5W-1H question words. For example, the second phrase
develop a graph-based propagation method for question- of “When you are free, can you give me a call” is a
answer linkage, leveraging the linkage structure within fo- strong indicator that the sentence is a question.1 This
rum threads, and combine it with syntactic and classification- condition is designed to catch sentences that should end
based methods for answer identification. with a question mark but were not typed with one.
Besides email threads and online forums, there also has • It fits the pattern of common questions that are not in the
been work on extracting question-answer pairs in multi-party interrogative form. For instance, “Let me know when
dialogue [Kathol and Tur, 2008]. While extraction of ques- you will be free” is one such question.
tions in meeting transcripts could be similar to that in email
Our hand-crafted database contains approximately 50 pat-
text, the structure of email threads is very different from that
terns; it was assembled rapidly. The alternative approach is
of meeting speech. For example, while question detection
to learn patterns; CWLSS take this approach, and present a
is relatively more simple in email than in speech, the asyn-
generalized learned classification model to acquire patterns.
chronous nature of email and its rich context (e.g., quoting)
makes question-answer pairing relatively more difficult. 3.2 Answer Detection and Q-A Pairing
Harder still than automated answer detection is automated
Traditional document retrieval methods can be applied to
question answering. While question-answering systems serve
email threads, by treating each message or each candidate an-
purposes very different from ours, ideas from the literature on
swer as a separate document. However, as has been observed
question answering can be of benefit.
in the direct application of information retrieval literature for
extractive email summarization, these methods, such as co-
3 Algorithms sine similarity, query likelihood language models, and KL-
The starting point for our work is an email thread. Given a divergence language models, do not on their own exploit the
set of messages, thread reassembly is the task of extracting content, context, and structural features of email threads.
the individual threads [Yeh and Harnly, 2006]; we assume We again consider three algorithms: one initially imple-
this procedure has been performed. mented in our email assistant, one identified as state-of-the-
art in the literature (S&M), and a new heuristic-based algo-
3.1 Question Detection rithm of our construction that builds upon the literature.
Question detection serves the greater purpose for us of Naı̈ve Q-A We again regard the Naı̈ve question-answer
question-answer linkage. We consider three algorithms for pairing algorithm as a baseline method. It is based on detect-
question detection that work at the sentence level, i.e., the ing answer types, and subsequent simple matching of ques-
algorithms look at each sentence in an email thread and clas- tion and answer sentences by type. Given an email thread,
sify each as either a question or not. To serve question-answer the algorithm iterates through each original (i.e., unquoted)
linkage, we seek an algorithm that exhibits a sufficiently high sentence2 and determines whether it is a question using the
F1 score (the geometric mean of precision and recall) on real Naı̈ve question detector described above. For each detected
data, coupled with a low running cost. question, it guesses the expected answer type based on reg-
Naı̈ve A baseline question detection algorithm had been
ular expression patterns. For example, Naı̈ve categorizes a
implemented in our email assistant. Algorithm Naı̈ve em- question that begins with “Are you” as a yes/no question. For
ploys regular expressions to classify sentences that end with each question, the algorithm looks at each original sentence
a question mark as questions, except for sentences that fit the in every subsequent email in the thread for a sentence that fits
pattern of a URL. A common extension is to detect 5W-1H the expected answer type. For example, the sentence “I think
question words (Who, What, Where, When, Why, or How). so” fits a yes/no question. The first sentence that fits the ex-
pected answer type of a detected question is naı̈vely paired as
S&M The state-of-the-art in question detection in email the answer to that question (i.e., a sentence is paired as the an-
threads is the work of Shrestha and McKeown [2004]. Algo- swer to at most one question). The underlying heuristic is that
rithm S&M uses Ripper [Cohen, 1996] to learn a model that 1
There is an improvement to S&M, discussed below, that breaks a sentence into
classifies each sentence as a question or a non-question, based comma-delimited phrases, in an effort to catch such cases.
on parts-of-speech features. The scope of the S&M algorithm 2
While quoted material is used in email to respond to segments of earlier messages,
is detection of questions in interrogative form only. the usage of this practice varies; it is much more common in online forums.
1520
earlier questions are answered earlier than later questions; the 1. Number of non stop words in Q and A (S&M feature a)
asynchronous nature of conversations in email threads means 2. Cosine similarity between Q and A (part of S&M feat. b)
the assumption does not hold in general. 3. Cosine similarity between Q and A, after named entity
The classification of question types is: yes/no, essay (why tagging, stemming, and removal of stop words
and how questions), what, when, where, who, number, and 4. Number of intermediate messages between mQ and mA
choice (e.g., “Is it a house or a flat?”). However, when the (S&M feature c)
Naı̈ve algorithm looks for an answer to a question, the only
types of answers captured are yes/no, essay, and what. 5. Ratio of the number of messages in the thread prior to
mQ to the total number of messages, and similarly for
S&M Q-A For answer detection and question-answer pair- mA (S&M feature d)
ing, S&M again use Ripper to learn a classification model. 6. Number of candidate answers that come before A (sim-
They work at the paragraph level. For training purposes, a ilar to part of S&M feature f)
paragraph is considered a question segment if it contains a 7. Ratio of the number of candidate answers that come be-
sentence marked by the human annotator as a question, and fore A to the total number of candidate answers (S&M
a paragraph is considered an answer segment to a question feature g)
paragraph if it contains at least one sentence marked by the 8. Whether mA is the first reply to mQ
human annotator as an answer to the question. The candi-
9. Whether Q is a question addressed to the sender of mA :
date answer paragraphs of a question paragraph are all the
for example, “Bill, can you clarify?” is addressed to a
original paragraphs in the thread subsequent to the message
user whose name or address includes “bill”
containing the question paragraph. Candidate answer para-
graphs that do not contain sentences marked as an answer by 10. Semantic similarity between the sentences Q and A
the annotator are used as negative examples. 11. Whether A matches the expected answer type of Q
As noted earlier, the features that the S&M algorithm uses We thus include most of the features found to be use-
include standard textual analysis features from information ful in S&M. We omit S&M feature e, since it is superseded
retrieval, features derived from the thread structure, and fea- by our feature 8, and S&M feature h, since it relies on
tures based on comparison with other candidate paragraphs. In-Reply-To header information which is surprisingly of-
ten not available [Yeh and Harnly, 2006] (for instance, it is
Heuristic Q-A We present a portfolio of heuristic algo-
not available in the Enron corpus), and moreover since the
rithms that operate at the sentence level. The algorithms use
intent of this feature is superseded by taking the best-scoring
a common set of features that extends those considered by
match in Heuristic.
S&M. The first algorithm variant uses hand-picked parame-
Computation of the features is mostly straightforward. We
ter values, the second (like S&M) learns a classification model
find that cosine similarity suffices and do not use also eu-
(also using Ripper), while the third algorithm learns a linear
clidean distance (unlike S&M feature b). We compute seman-
regression model. We examine for questions each content
tic similarity between question and candidate answers based
sentence, i.e., each original sentence that is not classified as a
on WordNet relations [Pirrò and Seco, 2008].
greeting or signature line. (Culling greetings and signatures
The most difficult feature to capture is the last: expected
improves performance by as much as 25%.) Any question de-
answer type. The algorithm Naı̈ve detects answer type by
tector could be used; we use the Regex algorithm described
the crude means of regular expressions. This suffices for
earlier. For each question, we obtain a set of candidate an-
yes/no questions. We built a named entity recognizer that
swers according to the following heuristic. A candidate an-
identifies and classifies proper names, places, temporal ex-
swer is a content sentence in a subsequent message in the
pressions, currency, and numbers, among other entities. The
thread that is: (1) not from the sender of the question email,
recognizer tags phrases in message bodies. Thus, for a when
and (2) an individual’s first reply to the question email, (3)
question, a match occurs if any phrases in the candidate an-
not one of the detected question sentences.
swer are tagged as times or dates. We do not attempt to detect
Our heuristic algorithms score each of the candidate an-
by type essay, what, and choice answers.
swers based on a weighted set of features. In variants that
employ learning, we use the same set of features (described
below) to train answer detectors, and to train Q-A pairing 4 Empirical Analysis
classifiers that assign each candidate answer a probability that We undertook experiments to assess the behaviour of the al-
it is the answer to a given question. We attribute the highest- gorithms described in the previous section. The algorithms
scoring or most probable candidate (appropriately) as the an- were implemented in Java 1.6, and the experiments were per-
swer to a given question, assuming that the score or proba- formed on an Intel Core 2 Duo 2.20GHz machine with 2GB
bility is above a minimum threshold (default 0.5). Finally, memory, running Windows Vista.
we limit the number of questions to which an answer can be
assigned; for the experiments reported, we use a limit of two. 4.1 Methodology and Datasets
The set of features we use for answer detection and What makes a segment of text a question or not, and what
question-answer pairing is a combination of textual features, constitutes an answer to a question, are both subjective judg-
structural features, entity tags, and expected answer types. ments. We follow prior works in information retrieval to train
Let Q be a question in message mQ and A be a candidate (where relevant) and test algorithms in our experiments on
answer found in message mA . The features are as follows. human-annotated datasets.
1521
Algorithm Precision Recall F1 -score Time (ms) Algorithm Precision Recall F1 -score
Naı̈ve 0.956 0.918 0.936 0.0254 5W-1H 0.690 0.151 0.248
S&M 0.623 0.865 0.724 48.30 Naı̈ve 0.978 0.780 0.868
Regex 0.954 0.964 0.959 0.243 S&M 0.873 0.873 0.871
CWLSS 0.971 0.978 0.975
Table 1: Question detection, interrogative questions only
Table 3: Question detection on online forums, including non-
Algorithm Precision Recall F1 -score Time (ms) interrogative questions [Cong et al., 2008]
Naı̈ve 0.958 0.786 0.863 0.0286
S&M 0.662 0.823 0.734 45.90 Results The results are shown in Tables 1 and 2. We can see
Regex 0.959 0.860 0.907 0.227 that S&M performs relatively less well than the other two al-
gorithms on both datasets. The fact that its precision is so low
Table 2: Question detection, including non-interrogative questions is at first surprising, because it is reported that the S&M algo-
rithm achieves high precision [Shrestha and McKeown, 2004;
In contrast to email summarization, where standardized Cong et al., 2008]. The difference may be understood in that
datasets now exist [Ulrich et al., 2008], there are unfortu- S&M tested only on questions in interrogative form and state-
nately no annotated email corpora available for question- ments in declarative form, whereas we tested on questions in
answer pairing. The ACM student corpus used by S&M any form and non-questions in any form. Examples of non-
and the annotations upon it are not available for reasons questions that are not in declarative form, that S&M incor-
of privacy. This state of affairs differs to online forums, rectly detected as questions, are “Brian, just get here as soon
also, in many of which the community rate the posts to a as you can” and “Let’s gear up for the game”.5
thread, thus providing ready and large datasets [Cong et al., The results for the two regular expression algorithms are
2008]. However, while not annotated with questions—or at more expected. Both perform very well on interrogative ques-
least not with sufficient fidelity on question types—nor with tions only, emphasizing that question detection is not so chal-
question-answer pairings, extensive email corpora are avail- lenging a task. Both have lower recall scores on the over-
able. We used the widely-studied Enron corpus [Klimt and all set, since non-interrogative questions are harder to detect.
Yang, 2004] and the Cspace corpus [Minkov et al., 2005]. Since S&M does not consider declarative or imperative ques-
tions, its performance is essentially the same on both datasets.
4.2 Question Detection As expected, Regex achieves higher recall than Naı̈ve be-
Data and Metrics To evaluate the question detection algo- cause of its greater sophistication. Although the runtime in-
rithms Naı̈ve, S&M, and Regex (Section 3.1), we created a creases by one order of magnitude, the absolute runtime re-
test dataset of sentences as follows. We randomly selected mains modest. The tables report the mean runtimes of the
10,000 sentences from emails from the Enron corpus and the algorithms in milliseconds per sentence. The median time
Cspace corpus. A human annotator looked at each sentence per sentence for both Regex and Naı̈ve is essentially zero;
and marked all questions until 350 questions were marked. for S&M it is 18.5ms. POS tagging is the main reason for
For each question the annotator also marked its sentence form the considerable extra time taken by S&M. The variance of
Regex is greater than Naı̈ve: 8.61 and 0.185ms respectively
as interrogative, declarative, imperative, or other. Out of the
350 questions, about 80% were marked as interrogative, 11% (full dataset, including non-interrogative questions). S&M ex-
as imperative, 5% as declarative, and 4% as other. In addi- hibits considerable variance of 1750ms.
tion to these 350 question sentences, we randomly selected Table 3 reports results from Cong et al. [2008]. Although
350 sentences out of those that had been passed over as non- these results are on a different dataset and for online forums
questions to obtain a collection of 700 sentences in total.3 rather than email threads, we give them for comparison. It
can be seen that Naı̈ve based on 5W-1H words performs very
We measured the precision, recall, and F1 -score for each
poorly, while Naı̈ve based on question marks (as our Naı̈ve)
of the three algorithms, on a restricted set with only the in-
has similar performance as to our experiments. The notable
terrogative questions and the 350 non-questions, and on the
difference is S&M, which exhibits the higher precision also
complete set of 700 sentences. The Naı̈ve and Regex algo-
reported by S&M. However, Naı̈ve continues to perform as
rithms do not require training, while the S&M algorithm had
well as S&M in these reported experiments.
been previously trained on the transcribed telephone speech
corpus [Shrestha and McKeown, 2004].4 Thus all 700 sen- The LSP-based classifier learned by CWLSS—algorithm
CWLSS—detects interrogative and non-interrogative ques-
tences were used for testing purposes.
3 tions. It is found to outperform Naı̈ve and S&M, and, albeit
We chose a 50-50 distribution in order to perform a similar evaluation as S&M; a
more natural distribution would have many more non-questions. Increasing the number on different datasets, has higher scores than Regex. However,
of non-questions can be expected to leave recall unchanged, since it depends more on and again comparing unfairly across datasets, Regex still has
non-interrogative questions than non-questions. By contrast, precision can be expected scores better than S&M, particularly precision. The use of
to fall across the algorithms, supposing each question-detection algorithm mis-classifies
a fixed fraction of non-questions as false positives. CWLSS for email threads is to be explored.
4
We maintain the same training as the original so to directly compare results. If
5
trained on email data, the precision of S&M can be expected to improve. However, its The precision reported in CWLSS may be understood because they trained their
training requires POS-tagged data; we were not aware of POS-tagged email corpora. S&M question detector on the same kind of online forum data as they were testing,
Further, available trained POS taggers are mostly trained on traditional text, whereas instead of phone speech data as S&M and ourselves. As noted above, training S&M on
email data has been described as more similar to speech. email corpora was infeasible without POS tagging.
1522
Algorithm Abbreviation LCS SM
Algorithm Precision Recall F1 -score Precision Recall F1 -score
S&M original SMQA SMQA 0.0506 0.2568 0.0842 0.0934 0.4213 0.1519
S&M with Regex SMQA-regex SMQA-regex 0.2101 0.2245 0.2166 0.3832 0.3693 0.3756
SMQA-highest 0.1979 0.1665 0.1805 0.4429 0.3068 0.3582
S&M with Regex and highest prob. SMQA-highest Naı̈ve 0.2615 0.0473 0.0779 0.4544 0.0909 0.1482
Naı̈ve type matching Naı̈ve Random 0.1117 0.1013 0.1059 0.4187 0.3120 0.3521
Heuristic 0.2444 0.2012 0.2195 0.5376 0.3655 0.4259
Heuristic hand-tuned, random selection Random Heuristic-C 0.2058 0.1717 0.1854 0.5238 0.3621 0.4202
Heuristic hand-tuned Heuristic Heuristic-LR 0.2396 0.1814 0.2039 0.5335 0.3459 0.4113
Heuristic by classifier Heuristic-C
Heuristic by linear regression Heuristic-LR Table 6: Question-answer pairing, Annotator 2
1523
Algorithm Mean Variance Median balances precision and recall, while maintaining a modest
SMQA 1450 1480000 1340 computation time. Future work is to (1) examine further the
SMQA-regex 69.7 4570 62.0 interplay of sentence-level and paragraph-level features, (2)
SMQA-highest 57.6 4570 32.0 more fully exploit named entity extraction, and (3) consider
Naı̈ve 0.0769 0.0710 0 explicitly thread structure by means of the induced graph of
Random 2.46 67.9 0 message relationships, together with quoted material (com-
Heuristic 296 84100 172
Heuristic-C 372 130000 249
pare [Carenini et al., 2007]).
Heuristic-LR 256 62300 171
Acknowledgments We thank with appreciation the volunteers
Table 7: Question-answer pairing computation times (ms) who annotated the thread corpora. This material is based upon work
supported by DARPA under Contract No. FA8750-07-D-0185. Any
opinions, findings, and conclusions or recommendations expressed
in terms of number of milliseconds per thread. Mean thread in this material are those of the author(s).
length is 5.2 email messages; mean sentences per message is
13.9. We can see that the median amount of time taken by
all the algorithms, except for SMQA, is extremely low, but
References
that our Heuristic variants take an order of magnitude more [Belotti et al., 2005] V. Belotti, N. Ducheneaut, M. Howard,
time than our SMQA variants, but an order of magnitude less I. Smith, and R. Grinter. Quality vs. quantity: Email-centric task
management and its relations with overload. Human-Computer
than the original S&M. This relatively higher runtime, and
Interaction, 20(2/3):89–138, 2005.
variance, is at the gain of the higher precision and recall.
Heuristic-C is the slowest of the three variants. The bulk of [Carenini et al., 2007] G. Carenini, R. T. Ng, and X. Zhou. Summa-
rizing email conversations with clue words. In Proc. of WWW’07,
the time for S&M is taken by question detection; for Heuristic,
pages 91–100, 2007.
it is finding semantic similarity (feature 10). Feature compu-
tation is reflected in the runtime of Naı̈ve versus Heuristic. [Cohen, 1996] W. Cohen. Learning trees and rules with setvalued
features. In Proc. of AAAI-96, pages 709–716, 1996.
Learning and Features Learning curves (not shown for [Cong et al., 2008] G. Cong, L. Wang, C.-Y. Lin, Y.-I. Song, and
reasons of space) show that, under both SM and LCS metrics, Y. Sun. Finding question-answer pairs from online forums. In
both Heuristic-C and Heuristic-LR converge with a training set Proc. of SIGIR’08, pages 467–474, 2008.
of approximately 30 threads. The comparable-or-better per- [Dabbish and Kraut, 2006] L. Dabbish and R. Kraut. Email over-
formance of hand-tuned Heuristic, after only modest effort in load at work: An analysis of factors associated with email strain.
tuning the feature weights, suggests that it is identification In Proc. of CSCW’06, pages 431–440, 2006.
of appropriate features that is more significant in this prob- [Dredze et al., 2008] M. Dredze, V. R. Carvalho, and T. Lau, edi-
lem than the values of the weights. The similar performance tors. Enhanced Messaging: Papers from the 2008 AAAI Work-
of the two learning variants, despite their different models, shop, Menlo Park, CA, 2008. AAAI Press.
lends support to this. However, automated acquisition of fea- [Freed et al., 2008] M. Freed, J. Carbonell, G. Gordon, J. Hayes,
ture weights, since training is straightforward, does offer the B. Myers, D. Siewiorek, S. Smith, A. Steinfeld, and A. Toma-
flexibility of adaption to characteristics of particular datasets. sic. RADAR: A personal assistant that learns to reduce email
A factor analysis of the relative benefit of features indi- overload. In Proc. of AAAI-08, pages 1287–1293, 2008.
cates, on our benchmark corpora, the features that give great- [Kathol and Tur, 2008] A. Kathol and G. Tur. Extracting ques-
est benefit are numbers 2–7. Features 1 (stop words) and 9– tion/answer pairs in multi-party meetings. In Proc. of ICASSP’08,
11 are also beneficial, but feature 8 (whether mA is the first pages 5053–5056, 2008.
reply to mQ ) has surprisingly little contribution. Although [Klimt and Yang, 2004] B. Klimt and Y. Yang. The Enron cor-
visual inspection reveals that answers are frequently seen in pus: A new dataset for email classification research. In Proc.
the first reply to mQ , we hypothesize that this occurrence is of ECML’04, pages 217–226, 2004.
accounted for by feature 4. [Minkov et al., 2005] E. Minkov, R. Wang, and W. Cohen. Extract-
ing personal names from emails: Applying named entity recog-
5 Conclusion and Future Work nition to informal text. In Proc. of HLT-EMNLP’05, 2005.
[Pirrò and Seco, 2008] G. Pirrò and N. Seco. Design, implementa-
Question-answer identification and pairing provides a gener-
tion and evaluation of a new similarity metric combining feature
ative summary of an email thread. For question detection, and intrinsic information content. In Proc. of ODBASE’08, 2008.
on threads drawn from the Enron and Cspace corpora, we
[Shrestha and McKeown, 2004] L. Shrestha and K. McKeown. De-
find that the learning algorithm of Shrestha and McKeown
[2004] suffers from poor precision when exposed to non- tection of question-answer pairs in email conversations. In Proc.
of COLING’04, pages 542–550, 2004.
interrogative questions. A generalized expression matching
algorithm performs adequately with very low runtime. Future [Ulrich et al., 2008] J. Ulrich, G. Murray, and G. Carenini. A pub-
licly available annotated corpus for supervised email summariza-
work is to investigate the promise of the question detection
tion. In Proc. of AAAI’08 Workshop on Enhanced Messaging,
method of Cong et al. [2008] to email conversations. pages 77–81, 2008.
For answer detection and question-answer pairing, we pre-
[Yeh and Harnly, 2006] J.-Y. Yeh and A. Harnly. Email thread re-
sented a generalization of Shrestha and McKeown’s feature-
assembly using similarity matching. In Proc. of CEAS’06, 2006.
based algorithm, and found that our heuristic-based method
1524