0% found this document useful (0 votes)
21 views8 pages

Incorporating Expert Knowledge Into Keyphrase Extraction

Uploaded by

fernando
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views8 pages

Incorporating Expert Knowledge Into Keyphrase Extraction

Uploaded by

fernando
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

Incorporating Expert Knowledge into Keyphrase Extraction

Sujatha Das Gollapalli, Xiao-Li Li Peng Yang


Institute for Infocomm Research, A*STAR, Singapore Tencent AI Lab, Shenzhen, China
{gollapallis,xlli}@i2r.a-star.edu.sg [email protected]

Abstract keyphrases. Linguistic filters effectively reduce the num-


ber of possible n-grams that need to be considered by sub-
Keyphrases that efficiently summarize a document’s con- sequent scoring and classification modules. Typically, the
tent are used in various document processing and retrieval value of n is set to {1,2,3} based on the observation that
tasks. Current state-of-the-art techniques for keyphrase ex-
author-specified keyphrases tend to be uni/bi/tri-grams in
traction operate at a phrase-level and involve scoring can-
didate phrases based on features of their component words. practice (Caragea et al. 2014).
In this paper, we learn keyphrase taggers for research papers In unsupervised models, candidate phrases are scored
using token-based features incorporating linguistic, surface- based on individual tokens comprising them.1 Various
form, and document-structure information through sequence “goodness” or “interestingness” measures that reflect
labeling. We experimentally illustrate that using within- document-level, corpus-level, and external statistics are used
document features alone, our tagger trained with Conditional in this scoring (Hasan and Ng 2010). For example, the Tex-
Random Fields performs on-par with existing state-of-the-art tRank algorithm builds a graph based on neighboring words
systems that rely on information from Wikipedia and citation in a document and computes the score of each word as the
networks. In addition, we are also able to harness recent work
PageRank centrality measure of its corresponding node in
on feature labeling to seamlessly incorporate expert knowl-
edge and predictions from existing systems to enhance the the word graph (Mihalcea and Tarau 2004).
extraction performance further. We highlight the modeling In contrast, supervised models use known (“correct”)
advantages of our keyphrase taggers and show significant per- keyphrases to frame keyphrase identification as a binary
formance improvements on two recently-compiled datasets of classification task. Candidate phrases from the training set
keyphrases from Computer Science research papers. of documents are assigned positive and negative labels and
features such as part-of-speech (POS) tags, phrase length,
occurrence frequency, and position information in the docu-
Introduction ment are used for learning keyphrase classifiers (Hasan and
Keyphrases (or keywords) that provide a concise represen- Ng 2014). Ranking approaches which use ordering infor-
tation of the topical content of a document are used in vari- mation among candidate phrases to train extraction models
ous data mining and web-related tasks (Hammouda, Matute, were also investigated previously (Jiang, Hu, and Li 2009).
and Kamel 2005; Bao et al. 2007; Xu et al. 2008; Li et al. In this paper, we avoid the candidate phrase extraction
2010). Keyphrase extraction, the challenging task of auto- step by formulating keyphrase extraction as a sequence tag-
matically extracting a small set of representative keyphrases ging/labeling task. Given a stream of tokens corresponding
continues to garner research interest in AI and natural lan- to the content of a document,2 a keyphrase tagger assigns to
guage processing (NLP) communities (Wan and Xiao 2008; each token position a tag/label from the set {KP, O} where
Hasan and Ng 2010). the label KP corresponds to a keyphrase token and O refers
Various supervised and unsupervised techniques are avail- to a non-keyphrase token. An example is shown in Table 1.
able for keyphrase extraction (Hasan and Ng 2014). Most Unlike phrase-based approaches where candidate phrases
state-of-the-art systems first extract a set of candidate comprise (multiple) training/test instances for a document,
phrases for a given document during keyphrase extrac- the entire content of the document comprises a single in-
tion (Frank et al. 1999; Medelyan, Frank, and Witten stance for a sequence tagging model.
2009; Gollapalli and Caragea 2014). These systems employ The example in Table 1 refers to the title of a research
phrase-filtering on the set of all n-grams of an input doc- paper published in the World Wide Web conference in the
ument to remove phrases that are unlikely to be human- year 2010 and is part of the recently-compiled datasets for
generated keyphrases. For instance, n-grams ending in stop- keyphrase extraction described further in the Experiments
words or prepositions are unlikely to be author-specified section. We highlight some shortcomings of existing sys-
1
Copyright  c 2017, Association for the Advancement of Artificial We use “token” and “word” interchangeably.
2
Intelligence (www.aaai.org). All rights reserved. We assume textual content and whitespace tokenization.

3180
Tokens: Visualizing differences in web search algorithms using the expected weighted Hoeffding distance
POS tags: VBG NNS IN NN NN VBZ VBG DT VBN JJ VBG NN
Phrase tags: NP NP PP NP NP VP VP NP NP NP NP NP
Labels: O O O O O O O O KP KP KP KP

Table 1: (Example) The title of a research paper is shown with its tokens, POS and phrase tags, and keyphrase labels.

tems in handling this example. example, specifying expert knowledge that “a noun word
Several keyphrase extraction algorithms including the re- that occurs in the document’s title is more likely to be
cent ExpandRank, CiteTextRank, and CeKE systems (Wan keyphrase” involves simply expressing a label-distribution
and Xiao 2008; Gollapalli and Caragea 2014; Caragea et preference with the corresponding feature as (“isInTitleAnd-
al. 2014) employ part-of-speech criteria during phrase filter- Noun” {KP:0.9 O:0.1}). We use label distributions to incor-
ing. Specifically, these systems only consider phrases com- porate expert hints into model training through the posterior
prising of nouns and adjectives with POS tags from the set regularization framework (Mann and McCallum 2008). Our
{NN, NNS, NNP, NNPS, JJ} for scoring.3 In addition, based contributions in this paper are listed below:
on the value for n during n-gram generation, these systems 1. We study keyphrase extraction as a sequence tagging task
may not generate candidate phrases with more than three to- and design features for learning a keyphrase tagger using
kens. In Table 1, the author-specified keyphrase highlighted Conditional Random Fields or CRFs (Lafferty, McCal-
in bold has four words as well as POS tags referring to lum, and Pereira 2001). In contrast with several existing
verbs4 and is automatically excluded from consideration by works, our set of features is minimalistic with all features
several existing systems (Hasan and Ng 2014). representing linguistic, orthographic, and structure infor-
Indeed, we noticed that about 8% of author-specified mation extracted from within the document.
keyphrases in our experimental datasets have tags other than
nouns and adjectives and about 1% of them have more than 2. We investigate feature-labeling and posterior regulariza-
three tokens. The keyphrase extraction algorithm based on tion as a means to seamlessly integrate expert-knowledge
sequence tagging described in this paper does not involve and domain-specific hints during keyphrase extraction. To
an explicit phrase extraction step and is able to consider all the best of our knowledge, we are the first to study weak
possible candidate phrases of any arbitrary length by default. supervision as an alternative to intricate feature design to
Although concerns related to missing phrases in classifiers achieve this objective.
may be addressed by including all possible n-grams, in prac- 3. We illustrate the performance of our keyphrase taggers
tice, such an inclusion results in several noisy phrases that on two recently-compiled datasets of research papers in
affect learning algorithms. Computer Science. Our models are able to perform on-par
Incorporating Expert Knowledge: Several recent state- with several state-of-the-art systems that make use of ex-
of-the-art keyphrase extraction systems incorporate external ternal evidence from citations and Wikipedia despite only
sources of evidence and domain-specific knowledge along using within-document features. Additionally, when ex-
with document and corpus-level information while scoring ternal evidence is incorporated through feature labeling,
candidate keyphrases. For example, Maui uses semantic in- we significantly out-perform existing baselines on both
formation based on Wikipedia (Medelyan, Frank, and Wit- the datasets. We show performance benefits with both
ten 2009) whereas the CeKE system (Caragea et al. 2014) expert-specified and automatically-generated labeled fea-
includes features based on the document-citation network tures on our experimental datasets.
obtained from CiteSeerx (Li et al. 2006). We summarize the features used to train our keyphrase tag-
In most existing systems specialized knowledge is incor- gers and introduce the feature labeling framework in the
porated into the extraction process through complex features next section (Proposed Methods). Datasets, baselines and
such as “the position of the first occurrence of a phrase di- the experimental setup used to evaluate our models are de-
vided by the total number of tokens” (Frank et al. 1999; scribed in the Experiments section whereas closely-related
Hulth 2003), “the distribution of terms among different doc- recent work is briefly summarized in the Related Work sec-
ument sections” (Nguyen and Kan 2007), “the distance of tion. Finally, we conclude the paper with a summary and
the first occurrence of a phrase from the beginning of a notes on future directions.
paper is below some value β” (Caragea et al. 2014), and
“number of links to the Wikipedia article referring to the Proposed Methods
phrase” (Medelyan, Frank, and Witten 2009).
In lieu of intricate features such as the above, we har- Sequence tagging involves the prediction of a sequence of
ness the recent work on weak supervision to specify ex- labels y = <y1 . . . yN > given an input sequence of tokens:
pert hints and external knowledge during keyphrase ex- t = <t1 . . . tN > (Sarawagi 2005). Each position i : 1 . . . N
traction through simple label-probability distributions. For in the input sequence of tokens can be modeled by vectors
of features <x1 . . . xN >. Although various generative and
3
The Penn Treebank list of tags is available at: discriminative models exist for learning sequence taggers,
https://catalog.ldc.upenn.edu/docs/LDC99T42/tagguid1.pdf. Conditional Random Fields were shown to obtain state-of-
4
We obtain the POS and phrase tags using the Stanford parser. the-art performance on several IE and NLP related tagging

3181
tasks that involve several complex, interdependent features and POS=NN” is stronger evidence to the tagger than each
(Sutton and McCallum 2012). of these features in isolation.
Illustrative Example: A partial list of features extracted
Features for Keyphrase Tagging for the token “expected” from the anecdotal example in Ta-
We train a keyphrase tagger using CRFs with the following ble 1 are shown in Table 2 for illustration. The unigram fea-
three types of features. tures comprise of the stemmed token, the POS and phrase
tags, and boolean features indicating the lack of capitaliza-
1. Word, orthographic, and stopword features: We use tion, presence in the title as well as a feature indicating that
whitespace tokenization, convert all tokens to lowercase this token is not a stopword. The “big1” and “big-1” pre-
after removing punctuation and use the stemmed form fixes indicate bigrams involving the current token with its
corresponding to the token (obtained using the Porter next and previous token positions respectively. For exam-
stemmer (1997)) for word features. We add a special fea- ple, the feature “big1 notStopword notStopword” captures
ture “allPunct” to capture tokens only comprised of punc- the information that both the current token (“expected”) as
tuation as well as boolean features “isCapitalized” and well as the next token (“weighted”) are not stopwords. The
“isStopword” to indicate if the word is capitalized or a “cmpd-L1-VBN isInTitle” feature captures the information
stopword.5 In addition, the end of a sentence is explic- that the token is both a verb as well as in the title, while
itly indicated using an “EOL” feature to capture sentence “skip-1-L1-DT L1-JJ” captures the adjacent POS tag fea-
boundary information. tures of the tokens “the” and “weighted” respectively.
2. Parse-tree features: We obtain the lexicalized parse of We train our CRF tagger using unigram features corre-
the document content using the case and punctuation cues sponding to all feature types and bigram, skipgram, and
provided by the author of the document. The Stanford compound features corresponding to orthographic, stop-
Parser (Finkel, Grenager, and Manning 2005)6 was used word, parse-tree, and title features. In experiments, we
to obtain the level-1 and level-2 parse tags comprising the commonly refer to “bigram, skipgram, and compound fea-
part-of-speech (POS) and phrase features at each word tures” as neighborhood features. Note that in contrast with
position. Hulth (2003) showed that incorporating linguis- some of the intricate features mentioned in the Introduc-
tic knowledge such as NP-chunking and POS tags dramat- tion section, our features are fairly simple in design and
ically improves extraction performance over using statis- are also commonly employed in other IE and NLP tagging
tical features alone and almost all existing works incorpo- tasks (Sarawagi 2005; Indurkhya and Damerau 2010).
rate POS tags in their models (Hasan and Ng 2014). Baselines: We compare our tagger with recent state-of-
the-art systems: Kea, Maui, and CeKE. The Kea system
3. Title features: We indicate if a non-stopword is part of the originally proposed in (Frank et al. 1999) has been sig-
document’s title using a boolean feature (“isInTitle”). The nificantly enhanced since and forms a competitive base-
title of a document can be considered a summary sentence line using document, thesaurus, and corpus-based features
describing the document and authors often add discrimi- such as TFIDF, length of the phrase, and first occurrence.
native words in their titles. The isInTitle feature depends Maui augments features in Kea with several novel fea-
on document structure information that is often part of re- tures such as spread of the phrase and keyphraseness. In
search paper datasets (Kim et al. 2010). addition, phrases are mapped to specific Wikipedia article
Given the token stream corresponding to a document, let pages and features such as node-degree of the page in the
F, G represent feature-types described above (word, POS Wikipedia graph and occurrence of the phrase in the link
etc.) and i represent a token position. The feature templates of the page are used in Maui (Medelyan, Frank, and Wit-
used for training our keyphrase tagger are listed below: ten 2009). The CeKE system, designed for research papers,
Unigram features Fi augments features from Kea with several additional features
Bigram features Fi−1 Fi and Fi Fi+1 based on term occurrence in citation contexts and citation-
Skipgram features Fi−1 Fi+1 based TFIDF (Caragea et al. 2014).
Compound features Fi G i All the above baseline systems are phrase-based super-
vised techniques and have publicly-available implementa-
Unigram features refer to the features generated at po-
tions. Unlike Kea, Maui and CeKE use external evidence
sition i using the token at that position (e.g., POS tag for
from Wikipedia and citation network respectively. To the
the word at position i). The neighborhood information for
best of our knowledge, these systems comprise the most re-
a given position is incorporated using the bigram and skip-
cent algorithms involving supervised techniques. Addition-
gram features that reference tokens at the previous and next
ally, for the research paper datasets used in this paper, CeKE
positions relative to i. Intuitively, if a current token is part of
was shown to outperform both Kea and the TextRank-family
a multiterm phrase, this may be indicated via suitable bigram
of unsupervised techniques (Caragea et al. 2014).
and skipgram features (e.g., they may share the same phrase
tags). Compound features are conjunctions combining fea- Feature Labeling and Posterior Regularization
tures at a given position. For example, the feature “isInTitle
Mann, Druck and McCallum proposed the feature labeling
5
We used the stopword list from Maui (Medelyan, Frank, and framework as a means to incorporate expert-provided hints
Witten 2009). into CRF-based taggers (2008). For example, to capture the
6
http://nlp.stanford.edu/software/lex-parser.shtml expert intuition that noun words occurring in the paper titles

3182
Type of feature Partial list of features
Unigrams expect, L1-VBN, L2-NP, isInTitle, notCapitalized, notStopword
Bigrams big1 notStopword notStopword, big-1-L1-DT L1-VBN, big-1-L2-NP L2-NP
Skipgrams skip-1-L1-DT L1-JJ, skip-1-L2-NP L2-NP , skip-1-isStopword notStopword
Compounds cmpd-L1-VBN L2-NP, cmpd-L1-VBN isInTitle, cpmd-L1-VBN notStopword

Table 2: Sample features are shown for the token “expected” in the example from Table 1.

are more likely to be keyphrases, the feature “cmpd-L1-NN- Set 1 (Predictions from Phrase-based Classifiers)
isInTitle” (using our previous notation) can be assigned a cmpd-CeKEKP MauiKP KP:0.9 O:0.1
CeKEKP KP:0.8 O:0.2
label distribution: {KP: 0.9 O:0.1} indicating a preference
MauiKP KP:0.8 O:0.2
for tokens with the above feature to be marked with the la- Set 2 (Presence in Document Title and Citation Contexts)
bel ‘KP’ with very high probability (90% of the time). Un- cmpd-L2-NP isInCitingContexts KP:0.8 O:0.2
like standard CRF models that are trained on fully-annotated cmpd-L2-NP isInCitedContexts KP:0.8 O:0.2
training instances (supervised models), the Posterior Reg- cmpd-L2-NP isInTitle KP:0.8 O:0.2
ularization (PR) framework incorporates information from Set 3 (Predictions from Unsupervised Models)
individual labeled features such as the above into the CRF cmpd-L2-NP OneUKP KP:0.7 O:0.3
parameter estimation process thus allowing for “weak su- cmpd-L2-NP TwoUKP KP:0.8 O:0.2
pervision” into the learning process. cmpd-L2-NP AllUKP KP:0.9 O:0.1
In the PR framework, the specified feature-label distribu-
tions are converted to a set of linear constraints on model Table 3: Sample ‘expert’ features and label distributions
posterior expectations for the features. The objective func-
tion of the CRF is suitably modified to include an ad-
ditional factor capturing the KL-divergence between pos- phrases occurring in document titles and citation contexts.
teriors based on labeled features and the original model- Previous studies related to keyphrase extraction in research
estimated posteriors for the same features (Mann and Mc- papers have found these features to be highly indicative of
Callum 2010; Ganchev et al. 2010). Mann and McCallum keyphrases (Kim et al. 2010; Gollapalli and Caragea 2014).
showed that given limited annotation time, expert-specified Finally, for the third set of labeled features, we incor-
labeled features can be used to improve discriminative mod- porate information from existing unsupervised keyphrase
els over other semi-supervised approaches that use fully- (UKP) extraction algorithms: TFIDF, TextRank, SingleR-
labeled instances. In addition, given sufficient annotated ank, and ExpandRank (Mihalcea and Tarau 2004; Wan and
data, various techniques were studied to automatically es- Xiao 2008). We indicate preferences for words in noun-
timate feature-label distributions for specific tagging prob- phrases that were marked among the top-10 predictions from
lems (Haghighi and Klein 2006; Druck, Mann, and McCal- at least one, two and all the unsupervised methods indicated
lum 2008; Gollapalli et al. 2014). by OneUKP, TwoUKP, and AllUKP respectively.
Labeled Features through Feature Selection: We also
The standard model training process in CRF also esti-
study automatic techniques to extract labeled features by
mates label distributions for features based on the (feature,
applying standard feature selection measures on instances
label) co-occurrence counts in training data during likeli-
in the training data. Similar to prior works (Haghighi and
hood computation (Sutton and McCallum 2012). Labeled
Klein 2006; Druck, Mann, and McCallum 2008; Gollapalli
features are useful when such an estimation is not accu-
et al. 2014), we extract the features that co-occur with the
rately possible due to lack of sufficient number of training
‘KP’ label with larger than average frequency. The autoPMI
instances capturing (feature, label) co-occurrence. Conse-
list refers to features having larger Pointwise Mutual Infor-
quently, this information when specified explicitly using an
mation with the label ‘KP’ than with the label ‘O’ ranked
‘expert’ label-distribution (e.g., {KP:0.9 O:0.1}) comprises
based on PMI values whereas autoFreq refers to features
additional information for the learning algorithm.
ranked based on their occurrence frequency. The top-10 fea-
Expert Features: We capture observations based on pre- tures from these two rankings are assigned a heuristic label
vious research in keyphrase extraction using three sets of distribution {KP:0.9 O:0.1} to form labeled feature sets.
“expert” labeled features and enforce them using the poste-
rior regularization framework in experiments.
Experiments
The first set of features in Table 3 captures predictions
from baseline systems CeKE and Maui (described in the pre- Datasets
vious section). That is, when a given phrase is identified by We evaluate our models using the research paper datasets
known supervised techniques (CeKE and Maui), we indicate collected by recent works on keyphrase extraction (Golla-
the high likelihood of this word indeed being a keyphrase palli and Caragea 2014). To the best of our knowledge, these
through the corresponding labeled features. Thus, we fold in datasets comprise the largest, publicly-available benchmark
predictions from phrase-based classifiers within the tagger datasets of research paper abstracts containing both author-
in a two-step setting with this set of labeled features. specified keyphrases and citation network information. Ab-
The second set of features captures preferences for noun stracts from these datasets are from papers published in two

3183
Venue #Abs/#KPs (Org) #Abs/#KPs (Locatable) Number of keyphrases with different lengths
KDD 365/1467 315/717 {#unigrams 221, #bigrams 404, #trigrams 80, #>trigrams 12}
WWW 425/2065 388/905 {#unigrams 368, #bigrams 451, #trigrams 79, #>trigrams 7}

Table 4: Summary of Datasets. The total numbers of abstracts and keyphrases in the original dataset are shown with the numbers
of abstracts for which at least one author-specified keyphrase could be located along with the total number of keyphrases located.

premier conferences: the World Wide Web (WWW) Con- from (Caragea et al. 2014) since they are only computed
ference and the ACM SIGKDD Conference on Knowledge on phrases that satisfy the POS filters mentioned in the In-
Discovery and Data Mining (KDD). The incoming and out- troduction section. Based on the results from this table, we
going citation contexts for each paper were obtained from conclude that sequence tagging models are more effective
CiteSeerx , the digital library portal for Computer Science for keyphrase extraction than phrase-based classifiers. How-
related literature (Li et al. 2006). ever, phrase-based models obtain higher recall compared to
Similar to previous works, we evaluate the predictions the CRF taggers that achieve higher precision and an overall
of each extraction algorithm against the author-specified better F1 score. In a later experiment, we incorporate predic-
keyphrases that can be located in the corresponding paper tions from phrase-based models using Set-1 labeled features
abstracts in the dataset (“gold standard”). We employ 10- and PR to further improve extraction performance.
fold cross-validation and present (micro) averaged results Effect of neighborhood and title features The ten-fold
for all our experiments using the precision, recall, and F1 CV tagging performance is shown on the WWW dataset us-
measures. For comparing two methods, we choose the F1 ing unigram features (UF) alone, both unigram and neigh-
measure that represents a balance between precision and re- borhood features (UF+NF), and without the title features
call (Manning, Raghavan, and Schütze 2008). (noTitle) in Figure 1 (b). When predicting tags at a given
The datasets are summarized in Table 4 along with the token position, neighborhood information incorporated via
number of keyphrases originally specified with each paper bigrams, skipgrams, and conjunctions are effectively har-
and the number of keyphrases locatable in the paper ab- nessed via edge-transition parameters (Sutton and McCal-
stracts.7 We also indicate the number of keyphrases with lum 2012) in a CRF resulting in better performance with
one, two, three, and more than three tokens found in these UF+NF features. Similarly, as seen in the noticeable dip in
abstracts. As observed previously (Caragea et al. 2014), the performance measures when title-based features are ex-
very few (only about 1%) author-specified keyphrases have cluded, we conclude that content words present in the titles
greater than three tokens. We found that about 8-9% of the of research papers are very likely to be part of keyphrases.
gold keyphrases do not satisfy the noun or adjective POS Performance with expert-labeled features For these ex-
filters used in previous works. periments, we first train the standard CRF tagger on the
We used the CRF and posterior regularization implemen- train split of data (as before). Next, posterior regularization
tations provided as part of the Mallet toolkit (McCallum is applied on the test instances in transductive mode (Bishop
2002). Default parameter settings were used while training 2006). The extraction performance using the different sets of
the standard CRF models. For posterior regularization, we expert labeled features listed in Table 3 as well as the labeled
set the constraint weights to 50 and the number of iterations features extracted automatically (autoFreq and autoPMI set-
for the EM-style optimization algorithm to 100.8 Publicly- tings) are shown in Figure 1 (a). As the numbers indicate, the
available implementations for Kea,9 Maui,10 and CeKE11 PR framework is extremely effective in incorporating the ex-
were used in baseline experiments.12 ternal knowledge specified as labeled features into the model
estimation process for both the datasets. All sets of labeled
Results and Discussion features including the automatically extracted ones result in
Tagging compared with baselines The results of ten-fold performance benefits over using CRF alone. In particular,
cross-validation experiments with CRF taggers trained us- the best performance (rows marked in bold) is obtained with
ing unigram and neighborhood features are compared with Set-1 that incorporates predictions from CeKE and Maui.
the baselines in Figure 1 (a). Our CRF taggers significantly In Figure 2 (c), we illustrate PR with Set-1 for the WWW
outperform Kea which only use document and corpus-level dataset. By employing predictions from Maui and CeKE
features and also perform on-par with CeKE and Maui as labeled features (CRF+Set-1 w PR), we are able to do
that incorporate features from CiteSeerx and Wikipedia significantly better than both these systems (Maui is the
respectively. We did not directly use the CeKE numbers best performing baseline method on the WWW dataset) as
well as the original CRF tagger. Specifically, we are able
7
Some gold keyphrases probably part of fulltext are missing to improve the tagger’s recall highlighted in the previous
from the abstracts. discussion. The CRF+Set-1 bars indicate tagging perfor-
8
http://mallet.cs.umass.edu/semi-sup-fst.php mance when CeKE and Maui predictions are incorporated
9
http://www.nzdl.org/Kea/index.html as regular features in the CRF. Note that these additional
10
https://github.com/zelandiya/maui features yield no additional performance benefits. As de-
11
http://www.cse.unt.edu/∼ccaragea/keyphrases.html scribed previously, a potential reason for this behaviour is
12
Processed datasets and code are available upon request. the lack of sufficient evidence in the training data for accu-

3184
(a)
KDD WWW (b)
0.5
Setting Precision Recall F1 Precision Recall F1 UF
UF+NF
Kea 0.1551 0.3278 0.2105 0.1549 0.3182 0.2084 noTitle
0.4
CeKE 0.2174 0.3905 0.2793 0.2251 0.2519 0.2377
Maui 0.1695 0.3724 0.2329 0.1837 0.3934 0.2504
0.3

Measures
CRF 0.4068 0.2162 0.2823 0.3689 0.1945 0.2547
CRF with labeled features and PR
0.2
Set 1 0.2777 0.4407 0.3407 0.2668 0.3558 0.3049
Set 2 0.3130 0.3013 0.3070 0.2995 0.2575 0.2768
0.1
Set 3 0.3091 0.3026 0.3058 0.2969 0.2674 0.2814
Top-10 PMI 0.2469 0.4114 0.3085∗ 0.2511 0.3282 0.2845∗
0
Top-10 Frequency 0.3282 0.2649 0.2932 0.3323 0.2298 0.2717 Precision Recall F1

Figure 1: (a) Ten-fold CV performance of the baseline methods, CRF, and posterior regularization; (b) Performance of the CRF
tagger with different feature sets on the WWW dataset

(a) (b) (c)


0.5
Top-10 features (PMI) Top-10 features (Frequency) Maui
CRF
big-1-isInTitle isInTitle big-1-L2-NP L2-NP CRF+Set-1
CRF+Set-1 w PR
0.4
cmpd-isInTitle isCapitalized big1-isInCitedContext isInCitedContext
cmpd-isInTitle oneUKP big1-isInCitingContext isInCitingContext
0.3

Measures
cmpd-isInTitle AllUKP big1-L2-NP L2-NP
cmpd-isInTitle TwoUKP big-1-isInCitedContext isInCitedContext
0.2
big1-isInTitle isInTitle big-1-isInCitingContext isInCitingContext
cmpd-L2-NP isInTitle cmpd-isInCitingContext isInCitedContext
0.1
cmpd-L1-NN isInTitle cmpd-L2-NP isInCitingContext
cmpd-isInTitle isInCitingContext cmpd-L2-NP isInCitedContext
0
cmpd-isInTitle isInCitedContext cmpd-isInTitle isInCitingContext Precision Recall F1

Figure 2: (a) and (b) Top-10 features based on PMI and frequency; (c) Ten-fold CV tagging performance with PR (WWW
dataset)

rate model parameter estimation thus requiring explicit pref- “end-user quality of experience” (L1-JJ L1-NN L1-IN L1-
erence specification via labeled features. NN), and “quadratically constrained quadratic program-
Performance with automatically-extracted labeled fea- ming” (L1-RB L1-VBN L1-JJ L1-NN).
tures We incorporate ‘isInCitedContexts’, ‘isInCitingCon-
texts’, ‘OneUKP’, ‘TwoUKP’ and ‘AllUKP’ as regular fea- Related Work
tures into the training data and apply feature selection us-
ing frequency and PMI information. The top-10 features Keyphrase extraction is widely studied in various do-
extracted by these methods are assigned the heuristic dis- mains (Frank et al. 1999; Kim et al. 2010; Bong and Hwang
tribution ‘{KP:0.9 O:0.1}’ to form labeled features for the 2011), for different document-types (Liu et al. 2009; Marujo
autoFreq and autoPMI runs respectively. The top-10 fea- et al. 2013) and for tag recommendation (Bao et al. 2007;
tures extracted from the WWW dataset using this process Xu et al. 2008). Supervised techniques for keyphrase ex-
are shown in Figures 2 (a) and (b). traction are often phrase-based models trained using docu-
ment and corpus-level features such as POS tags, position
From the results in Figure 1 (a), it can be seen that PMI- of the word and TFIDF information (Frank et al. 1999; Wit-
based features are better performing of the two sets and also ten et al. 1999; Turney 2000; Hulth 2003). Recent systems
among all sets other than Set-1 (rows marked with ‘∗ ’). The also incorporate external features based on citation networks
automatically-extracted PMI features in Figure 2 (a) also as well as Wikipedia into keyphrase extraction (Caragea et
make intuitive sense despite no ‘expert’ guidance. al. 2014; Medelyan, Frank, and Witten 2009). In contrast,
We note that the observations and trends shown with the several unsupervised extraction techniques score keyphrases
WWW datasets also hold for the KDD dataset the plots of based on “goodness” measures of words comprising them
which are not included due to space limitations. using graphs constructed from documents (Mihalcea and
Anecdotes Our best-performing models correctly iden- Tarau 2004; Boudin 2013; Gollapalli and Caragea 2014;
tified 37% of the gold keyphrases with more than three Wang, Liu, and McDonald 2015).
tokens from the experimental datasets. Examples of these Bhaskar et al. (2012) employ CRFs trained on features
keyphrases that also include POS tags not considered in such as word presence in document sections such as abstract
phrase filtering based approaches are “learning to rank re- and title as well as linguistic features such as POS, chunk-
lational objects” (L1-NNP L1-TO L1-VB L1-JJ L1-NNS), ing, and named-entity tags for keyphrase extraction in sci-

3185
entific articles. Similar features were employed by Zhang Finkel, J. R.; Grenager, T.; and Manning, C. 2005. Incor-
et al. for documents in Chinese (2008). We investigated porating non-local information into information extraction
CRFs for their modeling advantages as well as their abil- systems by gibbs sampling. In ACL.
ity to incorporate expert knowledge via weak supervision. Frank, E.; Paynter, G. W.; Witten, I. H.; Gutwin, C.; and
Weak supervision was previously studied for several clas- Nevill-Manning, C. G. 1999. Domain-specific keyphrase
sification, and information extraction problems using tech- extraction. In IJCAI.
niques such as feature labeling (Haghighi and Klein 2006;
Druck, Mann, and McCallum 2008) and based on known Ganchev, K.; Graça, J. a.; Gillenwater, J.; and Taskar, B.
knowledge-bases (Hoffmann et al. 2011). 2010. Posterior regularization for structured latent variable
models. JMLR.
Conclusions Gollapalli, S. D., and Caragea, C. 2014. Extracting
keyphrases from research papers using citation networks. In
We studied keyphrase extraction as a tagging task with AAAI.
Conditional Random Fields using simple token, parse, and
orthographic features. We showed experimentally that CRFs Gollapalli, S. D.; Qi, Y.; Mitra, P.; and Giles, C. L. 2014. Ex-
show both modeling and performance advantages over the tracting researcher metadata with labeled features. In SDM.
current state-of-the-art, phrase-based models on research Haghighi, A., and Klein, D. 2006. Prototype-driven learning
paper datasets. In addition, we are able to incorporate for sequence models. In HLT-NAACL.
domain knowledge into the extraction process via the Hammouda, K. M.; Matute, D. N.; and Kamel, M. S. 2005.
feature labeling framework for CRFs to further enhance Corephrase: Keyphrase extraction for document clustering.
extraction performance. In future, we would like to explore In Machine Learning and Data Mining in Pattern Recogni-
weak supervision for other types of documents (such as tion.
news articles and product reviews) as well as in parallel Hasan, K. S., and Ng, V. 2010. Conundrums in unsupervised
corpora (Arcan et al. 2014). keyphrase extraction: Making sense of the state-of-the-art.
In COLING.
Acknowledgments We are grateful to the reviewers for
their help in improving the presentation of this paper and to Hasan, K. S., and Ng, V. 2014. Automatic keyphrase extrac-
fellow researchers who provided their keyphrase extraction tion: A survey of the state of the art. In ACL.
systems and datasets for comparative evaluation. Hoffmann, R.; Zhang, C.; Ling, X.; Zettlemoyer, L.; and
Weld, D. S. 2011. Knowledge-based weak supervision for
References information extraction of overlapping relations. In HLT.
Arcan, M.; Turchi, M.; Tonelli, S.; and Buitelaar, P. 2014. Hulth, A. 2003. Improved automatic keyword extraction
Enhancing statistical machine translation with bilingual ter- given more linguistic knowledge. EMNLP.
minology in a cat environment. In Proceedings of the Indurkhya, N., and Damerau, F. J. 2010. Handbook of Nat-
Eleventh Biennial Conference of the Association for Ma- ural Language Processing. Chapman & Hall/CRC, 2nd edi-
chine Translation in the Americas. tion.
Bao, S.; Xue, G.; Wu, X.; Yu, Y.; Fei, B.; and Su, Z. 2007. Jiang, X.; Hu, Y.; and Li, H. 2009. A ranking approach to
Optimizing web search using social annotations. In WWW. keyphrase extraction. In SIGIR.
Bhaskar, P.; Nongmeikapam, K.; and Bandyopadhyay, S. Kim, S. N.; Medelyan, O.; Kan, M.-Y.; and Baldwin, T.
2012. Keyphrase extraction in scientific articles: A super- 2010. Semeval-2010 task 5: Automatic keyphrase extrac-
vised approach. In COLING. tion from scientific articles. In SemEval.
Bishop, C. M. 2006. Pattern Recognition and Machine Lafferty, J. D.; McCallum, A.; and Pereira, F. C. N. 2001.
Learning (Information Science and Statistics). Springer- Conditional random fields: Probabilistic models for seg-
Verlag New York, Inc. menting and labeling sequence data. In ICML.
Bong, S.-Y., and Hwang, K.-B. 2011. Keyphrase extraction Li, H.; Councill, I. G.; Bolelli, L.; Zhou, D.; Song, Y.; Lee,
in biomedical publications using mesh and intraphrase word W.-C.; Sivasubramaniam, A.; and Giles, C. L. 2006. Cite-
co-occurrence information. In Proceedings of the ACM Fifth seerx: A scalable autonomous scientific digital library. In
International Workshop on Data and Text Mining in Biomed- Proceedings of the 1st International Conference on Scalable
ical Informatics. Information Systems.
Boudin, F. 2013. A comparison of centrality measures for Li, Z.; Zhou, D.; Juan, Y.-F.; and Han, J. 2010. Keyword
graph-based keyphrase extraction. In IJCNLP. extraction for social snippets. In WWW.
Caragea, C.; Bulgarov, F. A.; Godea, A.; and Gollapalli, Liu, F.; Pennell, D.; Liu, F.; and Liu, Y. 2009. Unsupervised
S. D. 2014. Citation-enhanced keyphrase extraction from approaches for automatic keyword extraction using meeting
research papers: A supervised approach. In EMNLP. transcripts. In NAACL.
Druck, G.; Mann, G.; and McCallum, A. 2008. Learning Mann, S. G., and McCallum, A. 2008. Generalized expecta-
from labeled features using generalized expectation criteria. tion criteria for semi-supervised learning of conditional ran-
In SIGIR. dom fields. In Proceedings of ACL-08: HLT.

3186
Mann, G. S., and McCallum, A. 2010. Generalized ex-
pectation criteria for semi-supervised learning with weakly
labeled data. J. Mach. Learn. Res.
Manning, C. D.; Raghavan, P.; and Schütze, H. 2008. In-
troduction to Information Retrieval. Cambridge University
Press.
Marujo, L.; Ribeiro, R.; de Matos, D. M.; Neto, J. P.; Gersh-
man, A.; and Carbonell, J. G. 2013. Key phrase extraction
of lightly filtered broadcast news. CoRR.
McCallum, A. K. 2002. Mallet: A machine learning for
language toolkit. http://mallet.cs.umass.edu.
Medelyan, O.; Frank, E.; and Witten, I. H. 2009. Human-
competitive tagging using automatic keyphrase extraction.
In EMNLP.
Mihalcea, R., and Tarau, P. 2004. Textrank: Bringing order
into text. In EMNLP.
Nguyen, T. D., and Kan, M.-Y. 2007. Keyphrase extraction
in scientific publications. In Proceedings of the 10th Interna-
tional Conference on Asian Digital Libraries: Looking Back
10 Years and Forging New Frontiers.
Porter, M. F. 1997. Readings in information retrieval. San
Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
chapter An Algorithm for Suffix Stripping.
Sarawagi, S. 2005. Advanced Methods for Knowledge Dis-
covery from Complex Data. chapter Sequence Data Mining.
Sutton, C., and McCallum, A. 2012. An introduction to
conditional random fields. Found. Trends Mach. Learn.
Turney, P. D. 2000. Learning algorithms for keyphrase ex-
traction. Information Retrieval 2(4).
Wan, X., and Xiao, J. 2008. Single document keyphrase
extraction using neighborhood knowledge. In AAAI.
Wang, R.; Liu, W.; and McDonald, C. 2015. Corpus-
independent generic keyphrase extraction using word em-
bedding vectors. In Deep Learning for Web Search and Data
Mining.
Witten, I. H.; Paynter, G. W.; Frank, E.; Gutwin, C.; and
Nevill-Manning, C. G. 1999. Kea: Practical automatic
keyphrase extraction. In Proceedings of the Fourth ACM
Conference on Digital Libraries.
Xu, S.; Bao, S.; Fei, B.; Su, Z.; and Yu, Y. 2008. Exploring
folksonomy for personalized search. In SIGIR.
Zhang, C.; Wang, H.; Liu, Y.; Wu, D.; Liao, Y.; and Wang,
B. 2008. Automatic keyword extraction from documents
using conditional random fields. Journal of Computational
Information Systems 4(3).

3187

You might also like