0% found this document useful (0 votes)
10 views8 pages

HLT 2004

This paper explores the use of Conditional Random Fields (CRFs) for extracting meta-data from research papers, achieving significant improvements over previous methods using hidden Markov models (HMMs) and support vector machines (SVMs). The authors investigate practical issues related to regularization and feature selection, demonstrating a 36% reduction in average F1 error and a 78% decrease in word error rate compared to the best SVM results. The findings suggest that CRFs effectively leverage the advantages of both HMMs and SVMs for this information extraction task.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

HLT 2004

This paper explores the use of Conditional Random Fields (CRFs) for extracting meta-data from research papers, achieving significant improvements over previous methods using hidden Markov models (HMMs) and support vector machines (SVMs). The authors investigate practical issues related to regularization and feature selection, demonstrating a 36% reduction in average F1 error and a 78% decrease in word error rate compared to the best SVM results. The findings suggest that CRFs effectively leverage the advantages of both HMMs and SVMs for this information extraction task.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Accurate Information Extraction from Research Papers

using Conditional Random Fields

Fuchun Peng Andrew McCallum


Department of Computer Science Department of Computer Science
University of Massachusetts University of Massachusetts
Amherst, MA 01003 Amherst, MA 01003
[email protected] [email protected]

Abstract Previous work in information extraction from research


papers has been based on two major machine learn-
With the increasing use of research paper ing techniques. The first is hidden Markov models
search engines, such as CiteSeer, for both lit- (HMM) (Seymore et al., 1999; Takasu, 2003). An
erature search and hiring decisions, the accu- HMM learns a generative model over input sequence
racy of such systems is of paramount impor- and labeled sequence pairs. While enjoying wide his-
tance. This paper employs Conditional Ran- torical success, standard HMM models have difficulty
dom Fields (CRFs) for the task of extracting modeling multiple non-independent features of the ob-
various common fields from the headers and servation sequence. The second technique is based
citation of research papers. The basic the- on discriminatively-trained SVM classifiers (Han et al.,
ory of CRFs is becoming well-understood, but 2003). These SVM classifiers can handle many non-
best-practices for applying them to real-world independent features. However, for this sequence label-
data requires additional exploration. This paper ing problem, Han et al. (2003) work in a two stages pro-
makes an empirical exploration of several fac- cess: first classifying each line independently to assign it
tors, including variations label, then adjusting these labels based on an additional
 on Gaussian, expo-
nential and hyperbolic- priors for improved classifier that examines larger windows of labels. Solving
regularization, and several classes of features the information extraction problem in two steps looses
and Markov order. On a standard benchmark the tight interaction between state transitions and obser-
data set, we achieve new state-of-the-art perfor- vations.
mance, reducing error in average F1 by 36%, In this paper, we present results on this research paper
and word error rate by 78% in comparison with meta-data extraction task using a Conditional Random
the previous best SVM results. Accuracy com- Field (Lafferty et al., 2001), and explore several practi-
pares even more favorably against HMMs. cal issues in applying CRFs to information extraction in
general. The CRF approach draws together the advan-
tages of both finite state HMM and discriminative SVM
1 Introduction techniques by allowing use of arbitrary, dependent fea-
tures and joint inference over entire sequences.
Research paper search engines, such as CiteSeer
(Lawrence et al., 1999) and Cora (McCallum et al., CRFs have been previously applied to other tasks such
2000), give researchers tremendous power and conve- as name entity extraction (McCallum and Li, 2003), table
nience in their research. They are also becoming increas- extraction (Pinto et al., 2003) and shallow parsing (Sha
ingly used for recruiting and hiring decisions. Thus the and Pereira, 2003). The basic theory of CRFs is now
information quality of such systems is of significant im- well-understood, but the best-practices for applying them
portance. This quality critically depends on an informa- to new, real-world data is still in an early-exploration
tion extraction component that extracts meta-data, such phase. Here we explore two key practical issues: (1) reg-
as title, author, institution, etc, from paper headers and ularization, with an empirical study of Gaussian (Chen
references, because these meta-data are further used in and Rosenfeld,
 2000), exponential (Goodman, 2003), and
many component applications such as field-based search, hyperbolic- (Pinto et al., 2003) priors; (2) exploration
author analysis, and citation analysis. of various families of features, including text, lexicons,
and layout, as well as proposing a method for the bene-
12

ficial use of zero-count features without incurring large 10

memory penalties.

counts of lamda (in log scale)


8

We describe a large collection of experimental results


on two traditional benchmark data sets. Dramatic im- 6

provements are obtained in comparison with previous 4

SVM and HMM based results.


2

2 Conditional Random Fields 0


−6 −5 −4 −3 −2 −1 0 1 2 3
lambda

Conditional random fields (CRFs) are undirected graph-


ical models trained to maximize a conditional probabil- Figure 1: Empirical distribution of 
ity (Lafferty et al., 2001). A common special-case graph

  JIK5IL"!MONP $ QSRT is written


structure is a linear chain, which corresponds to a finite
state machine, and is suitable for sequence labeling. A
linear-chain CRF with parameters    defines
a conditional probability for a state (or label1 ) sequence
E  - @  E  I  I"
     given an input sequence     to be IU, V 1 2 1 4. 3 .
,  1 2 1 4. 3 . - -.0/  - 1    5 6678"W
$
 ! #"! %&(' )+* - 0
. / -  1   @ %ZY\[ 9]
     5 6678":9 I  UXV
(2)
(1)
% & Maximizing (2) corresponds to satisfying the follow-
2 1 .43 . is the normalization constant that makes
where ing equality, wherein the the empirical count of each fea-
the probability of all state sequences sum
   5 6678" is a feature function which 1 to one, ture matches its expected count according to the model
FEB ! #" .
21 is often
21 21 .43 . .43 .
binary-valued, but can be real-valued, and  is.43 a learned .
weight associated with feature . The feature functions - -.   6 6JIK678"B - -4^`_ FEB -.  +a  6ba 6JIK678"
can measure any aspect of a state transition, 
<
 ;  , +a JI4"
I I
and the observation .4sequence,
3  , centered at. the current
time step, 7 . For . example, one feature function might
have value 1 when 
 is the state T ITLE,  1 is the state
CRFs share many of the advantageous properties of
standard maximum entropy models, including their con-
AUTHOR, and  is a word appearing in a lexicon of peo- vex likelihood function, which guarantees that the learn-
ple’s first names. Large positive values for  indicate a ing procedure converges to the global maximum. Tra-
preference for such an event, while large negative values ditional maximum entropy learning algorithms, such as
make the event unlikely. GIS and IIS (Pietra et al., 1995), can be used to train
Given such a model as defined in Equ. (1), the most CRFs, however, it has been found that a quasi-Newton
probable labeling sequence for an input  , gradient-climber, BFGS, converges much faster (Malouf,
= > ?6@BAC> ) FEB G #"H 2002; Sha and Pereira, 2003). We use BFGS for opti-
D mization. In our experiments, we shall focus instead on
two other aspects of CRF deployment, namely regulariza-
can be efficiently calculated by dynamic programming tion and selection of different model structure and feature
using the Viterbi algorithm. Calculating the marginal types.
probability of states or transitions at each position in
the sequence by a dynamic-programming-based infer- 2.1 Regularization in CRFs
ence procedure very similar to forward-backward for hid- To avoid over-fitting, log-likelihood is often penalized by
den Markov models. some prior distribution over the parameters. Figure 1
The parameters may be estimated by maximum shows an empirical distribution of parameters,  , learned
likelihood—maximizing the conditional probability of from an unpenalized likelihood, including only features
a set of label sequences, each given their correspond- with non-zero count in the training set. Three prior dis-
ing input sequences. The log-likelihood of training set tributions that have shape similar to this empirical dis-
1
We consider here only finite state models in which there is tribution are the Gaussian prior, exponential prior, and
a one-to-one correspondence between states and labels; this is hyperbolic- prior, each shown in Figure 2. In this pa-
not, however, strictly necessary. per we provide an empirical study of these three priors.
1
. 2 1 4. 3 .
set, h ji I i
   5 6678" . The discounted
0.4
Gaussian varianec=2
lk e
value used here is m kn+o p where is a constant over
Exponential a=0.5
Hyperbolic
0.35

0.3
all features. In this way, we increase the smoothing
0.25
on the low frequency features more so than the high
0.2
frequency features.
0.15

0.1
3. Bin-Based: We divide features into classes based
0.05
on frequency. We bin features by frequency in the
training set, and let the 1 features in the same bin share
0
the same  k variance. The discounted value is set to be
q m k`r:s!tQn+o p where h is the count of features, u is
−10 −8 −6 −4 −2 0 2 4 6 8 10

Figure 2: Shapes of prior distributions


the bin size, and v0wOx is the ceiling function. Alterna-
tively, the variance in each bin may be set indepen-
2.1.1 Gaussian prior dently by cross-validation.
With a Gaussian prior, log-likelihood (2) is penalized 2.1.2 Exponential prior
as follows: 1
Whereas the Gaussian prior penalizes according to the
E  - - 1 c 1
@  E   I  I "W de c (3)
square of the weights (an
c penalizer), the intention here
1 I UXV
 is to create a smoothly differentiable analogue to penal-
y penalizer).
izing the absolute-value of the weights (an
e
where c is a variance.
G penalizers often result in more “sparse solutions,” in
2 1 4. 31
Maximizing (3) corresponds to satisfying which many features have weight nearly at zero, and thus
.
- -.   5 f I 678"W 1 provide a kind of soft feature selection that improves gen-
e c 21  eralization.
I 4. 3 . Goodman (2003) proposes an exponential prior,
- -4^ _ FEZ - .  
+a IL" +a 5+a fJIK678" specifically a Laplacian prior, as an alternative to Gaus-
I sian prior. Under this prior, 1 1
This adjusted constraint (as well as the adjustments im-
BE  - @ FEB  I  I:"PW
- 1{z
 (4)
posed by the other two priors) is intuitively understand- 1 I UXV
able: rather than matching exact empirical feature fre- z
quencies, the model is tuned to match discounted feature where is a parameter in exponential distribution.
frequencies. Chen and Rosenfeld (2000) discuss this in Maximizing 2(4)
1 would
.43 satisfy
. 1
the context of other discounting 1Og procedures common in - -.   5 fJIK678"PW z 
language modeling. We call the term subtracted from the
e c ) a discounted value. I 2 1 .43 .
empirical counts (in this case  - -4^`_  E  -.  
The variance can be feature dependent. However for +a  I " + a 5+a f I 678"
simplicity, constant variance is often used for all features. I
In this paper, however, we experiment with several alter- 1
nate versions of Gaussian prior in which the variance is z smoothing
This corresponds to the absolute z method in
feature dependent. language modeling. We set the  ; i.e. all features
z value can be determined
share the same constant whose
|O}c p , where   and  c
Although Gaussian (and other) priors are gradually
overcome by increasing amounts of training data, per- using absolute discounting 
| }K~ once
are the number of features occurring | and twice (Ney
haps not at the right rate. The three methods below all
provide ways to alter this rate by changing the variance et al., 1995).
of the Gaussian prior dependent on feature counts. 2.1.3 Hyperbolic- prior

1. Threshold Cut: In language modeling, e.g, Good-
Another
 penalizer is the hyperbolic-  prior, de-
Turing smoothing, only low frequency words are
smoothed. Here we apply the same idea and only scribed in (Pinto et al., 2003). The hyperbolic distribution
smooth those features whose frequencies are lower has log-linear tails. Consequently the class of hyperbolic
than a threshold (7 in our experiments, following distribution is an important alternative to the class of nor-
standard practice in language modeling). mal distributions and has been used for analyzing data
from various scientific areas such as finance, though less
2. Divide Count: Here we let the discounted value frequently used in natural language processing.
for a feature depend on its frequency in the training Under a hyperbolic prior,
However, unsupported, zero-count features can be ex-
†K‘ ’†”“Œ• k›š k “ tremely useful for pushing Viterbi inference away from
€#ƒ‚ „b†ˆ‡‰QŠ‹ŒŽ „Q–‡‰\Š—`˜`™  ˜lœ+™ (5) certain paths by assigning such features negative weight.
The use of a prior allows the incorporation of unsup-
which corresponds to satisfying ported features, however, doing so often greatly increases
– ž ž † “• k • k the number parameters and thus the memory require-
„ † „ ž Ÿ  §˜ ¦ ™ k ¦ š ˜ œ ¦ ™ k ¦ ‚ ments.
œ¢¡6£ £:¤ £:¥ ˜ ¦ ™ ¦ ‘ ˜ ”† œ “ ¦ ™ ¦ – ž ž † “ Below we experiment with models containing and not
„ † „©¨ _ ‹¢  ª „ ž Ÿ  ª ª containing unsupported features—both with and without
¤ ¢œ ¡ £ £:¤ £:¥ regularization by priors, and we argue that non-supported
features are useful.
The hyperbolic prior was also tested with CRFs in Mc- We present here incremental support, a method of in-
Callum and Li (2003). troducing some useful unsupported features without ex-
ploding the number of parameters with all unsupported
2.2 Exploration of Feature Space features. The model is trained for several iterations with
Wise choice of features is always vital the performance supported features only. Then inference determines the
of any machine learning solution. Feature induction (Mc- label sequences assigned high probability by the model.
Callum, 2003) has been shown to provide significant im- Incorrect transitions assigned high probability by the
provements in CRFs performance. In some experiments model are used to selectively add to the model those un-
described below we use feature induction. The focus in supported features that occur on those transitions, which
this section is on three other aspects of the feature space. may help improve performance by being assigned nega-
tive weight in future training. If desired, several iterations
2.2.1 State transition features
2 1 .43 . of this procedure may be performed.
In CRFs, state transitions are  also 5 represented as fea-
tures. The feature function 6678" in Equ. (1) 2.2.3 Local features, layout features and lexicon
is a general function over states and observations. Differ- features
ent state transition features can be defined to form dif- One of the advantages of CRFs and maximum entropy
ferent Markov-order structures. We define four differ- models in general is that they easily afford the use of arbi-
ent state transitions features corresponding to different trary features of the input. One can encode local spelling
Markov order for different classes of features. Higher features, layout features such as positions of line breaks,
order features model dependencies better, but also create as well as external lexicon features, all in one framework.
more data sparse problem and require more memory in We study all these features in our research paper extrac-
training. tion problem, evaluate their individual contributions, and
give some guidelines for selecting good features.
1. First-order: Here the2 inputs
. are examined in the con-
text of the current state only. The feature functions

are represented as  5#" . There are no separate
3 Empirical Study
parameters or preferences for state transitions at all. 3.1 Hidden Markov Models

2 .
2. First-order+transitions: 2 Here
.43 we. add parameters Here we also briefly describe a HMM model we used
corresponding to state transitions. The feature func- in our experiments. We relax the independence
. . .43 assump-

tions used are  5#"H 
  5 " . tion made in standard HMM and allow Markov depen-
dencies among observations, e.g.,
«”¬ ­ ¬  " . We
3. Second-order: Here inputs2 .4are
3 examined
. in the con- can vary Markov orders in state transition and observa-
text of the current and previous states. Feature func-
tion are represented as 
  5 5#" . tion transitions. In our experiments, a model with second
order state transitions and first order observation transi-
tions performs the best. The state transition probabilities
4. Third-order: Here inputs 2 are
.43 examined
.43 . in the con- and emission probabilities are estimated using maximum
text of the current, two previous states. Feature func-
tion are represented as 
 5  5 6#" . likelihood estimation with absolute smoothing, which
c was found to be effective in previous experiments, includ-
2.2.2 Supported features and unsupported features ing Seymore et al. (1999).
Before the use of prior distributions over parameters
was common in maximum entropy classifiers, standard 3.2 Datasets
practice was to eliminate all features with zero count We experiment with two datasets of research paper con-
in the training data (the so-called unsupported features). tent. One consists of the headers of research papers. The
other consists of pre-segmented citations from the refer- 2. Averaged F-measure: Averaged F-measure is com-
ence sections of research papers. These data sets have puted by averaging the F1-measures over all fields.
been used as standard benchmarks in several previous Average F-measure favors labels with small num-
studies (Seymore et al., 1999; McCallum et al., 2000; ber of words, which complements word accuracy.
Han et al., 2003). Thus, we consider both word accuracy and average
F-measure in evaluation.
3.2.1 Paper header dataset
The header of a research paper is defined to be all of
the words from the beginning of the paper up to either 3. Whole instance accuracy: An instance here is de-
the first section of the paper, usually the introduction, fined to be a single header or reference. Whole
or to the end of the first page, whichever occurs first. instance accuracy is the percentage of instances in
It contains 15 fields to be extracted: title, author, affil- which every word is correctly labeled.
iation, address, note, email, date, abstract, introduction,
phone, keywords, web, degree, publication number, and
3.4 Experimental Results
page (Seymore et al., 1999). The header dataset contains
935 headers. Following previous research (Seymore et We first report the overall results by comparing CRFs
al., 1999; McCallum et al., 2000; Han et al., 2003), for with HMMs, and with the previously best benchmark re-
each trial we randomly select 500 for training and the re- sults obtained by SVMs (Han et al., 2003). We then break
maining 435 for testing. We refer this dataset as H. down the results to analyze various factors individually.
3.2.2 Paper reference dataset Table 1 shows the results on dataset H with the best re-
sults in bold; (intro and page fields are not shown, fol-
The reference dataset was created by the Cora
lowing past practice (Seymore et al., 1999; Han et al.,
project (McCallum et al., 2000). It contains 500 refer-
2003)). The results we obtained with CRFs use second-
ences, we use 350 for training and the rest 150 for test-
order state transition features, layout features, as well as
ing. References contain 13 fields: author, title, editor,
supported and unsupported features. Feature induction
booktitle, date, journal, volume, tech, institution, pages,
is used in experiments on dataset R; (it didn’t improve
location, publisher, note. We refer this dataset as R.
accuracy on H). The results we obtained with the HMM
3.3 Performance Measures model use a second order model for transitions, and a first
To give a comprehensive evaluation, we measure per- order for observations. The results on SVM is obtained
formance using several different metrics. In addition to from (Han et al., 2003) by computing F1 measures from
the previously-used word accuracy measure (which over- the precision and recall numbers they report.
emphasizes accuracy of the abstract field), we use per-
field F1 measure (both for individual fields and averaged HMM CRF SVM
over all fields—called a “macro average” in the informa- Overall acc. 93.1% 98.3% 92.9%
tion retrieval literature), and whole instance accuracy for Instance acc. 4.13% 73.3% -
measuring overall performance in a way that is sensitive acc. F1 acc. F1 acc. F1
Title 98.2 82.2 99.7 97.1 98.9 96.5
to even a single error in any part of header or citation.
Author 98.7 81.0 99.8 97.5 99.3 97.2
3.3.1 Measuring field-specific performance Affiliation 98.3 85.1 99.7 97.0 98.1 93.8
Address 99.1 84.8 99.7 95.8 99.1 94.7
1. Word Accuracy: We define ® as the number of true Note 97.8 81.4 98.8 91.2 95.5 81.6
positive words, ¯ as the number of false negative Email 99.9 92.5 99.9 95.3 99.6 91.7
words, ° as the number of false positive words, ± Date 99.8 80.6 99.9 95.0 99.7 90.2
as the number of true negative words, and ®³²´¯² Abstract 97.1 98.0 99.6 99.7 97.5 93.8
°³²ˆ± is the total number of words. Word accuracy Phone 99.8 53.8 99.9 97.9 99.9 92.4
is calculated to be µ ~F¶ Keyword 98.7 40.6 99.7 88.8 99.2 88.5
µ ~#·F~ ¸#~#¶ Web 99.9 68.6 99.9 94.1 99.9 92.4
2. F1-measure: Precision, recall and F1 measure are Degree 99.5 68.8 99.8 84.9 99.5 70.1
definedn—as
¹ º5follows.
» m IX¼:I½ n—¾ŒPrecision
» m:¿HÀÁÀ = µ Recall = µ Pubnum 99.8 64.2 99.9 86.6 99.9 89.2
F1 = c Ž
¹ 6
º » Œ
¾ »
m IX¼:I½ | m:¿HÀÀ µ ~ ¸ µ ~#· Average F1 75.6 93.9 89.7
|~
3.3.2 Measuring overall performance Table 1: Extraction results for paper headers on H
1. Overall word accuracy: Overall word accuracy
is the percentage of words whose predicted labels
equal their true labels. Word accuracy favors fields Table 2 shows the results on dataset R. SVM results
with large number of words, such as the abstract. are not available for these datasets.
HMM CRF support feat. all features
Overall acc. 85.1% 95.37% Method F1 F1
instance acc. 10% 77.33% Gaussian infinity 90.5 93.3
acc. F1 acc. F1 Gaussian variance = 0.1 81.7 91.8
Author 96.8 92.7 99.9 99.4 Gaussian variance = 0.5 87.2 93.0
Booktitle 94.4 0.85 97.7 93.7 Gaussian variance = 5 90.1 93.7
Date 99.7 96.9 99.8 98.9 Gaussian variance = 10 89.9 93.5
Editor 98.8 70.8 99.5 87.7 Gaussian cut 7 90.1 93.4
Institution 98.5 72.3 99.7 94.0 Gaussian divide count 90.9 92.8
Journal 96.6 67.7 99.1 91.3
Location 99.1 81.8 99.3 87.2 Gaussian bin 5 90.9 93.6
Note 99.2 50.9 99.7 80.8 Gaussian bin 10 90.2 92.9
Pages 98.1 72.9 99.9 98.6 Gaussian bin 15 91.2 93.9
Publisher 99.4 79.2 99.4 76.1 Gaussian bin 20 90.4 93.2
Tech 98.8 74.9 99.4 86.7 Hyperbolic 89.4 92.8
Title 92.2 87.2 98.9 98.3 Exponential 80.5 85.6
Volume 98.6 75.8 99.9 97.8
Average F1 77.6% 91.5% Table 3: Regularization comparisons: Gaussian infinity is
non-regularized, Gaussian variance = X sets variance to
Table 2: Extraction results for paper references on R be X. Gaussian cut 7 refers to the Threshold Cut method,
Gaussian divide count refers to the Divide Count method,
Gaussian bin N refers to the Bin-Based method with bin
3.5 Analysis size equals N, as described in 2.1.1
3.5.1 Overall performance comparison
From Table (1, 2), one can see that CRF performs
significantly better than HMMs, which again supports
log-likelihood). We hypothesizedz that the problem could
the previous findings (Lafferty et al., 2001; Pinto et al., z
be that the choice of constant is inappropriate. So we
tried varying instead of computing it using absolute
2003). CRFs also perform significantly better than SVM-
discounting, but found the alternatives to perform worse.
based approach, yielding new state of the art performance
These results suggest that Gaussian prior is a safer prior
on this task. CRFs increase the performance on nearly all
to use in practice.
the fields. The overall word accuracy is improved from
92.9% to 98.3%, which corresponds to a 78% error rate 3.5.3 Effects of exploring feature space
reduction. However, as we can see word accuracy can be State transition features and unsupported features.
misleading since HMM model even has a higher word ac- We summarize the comparison of different state tran-
curacy than SVM, although it performs much worse than sition models using or not using unsupported features in
SVM in most individual fields except abstract. Interest- Table 4. The first column describes the four different state
ingly, HMM performs much better on abstract field (98% transition models, the second column contains the overall
versus 93.8% F-measure) which pushes the overall accu- word accuracy of these models using only support fea-
racy up. A better comparison can be made by compar- tures, and the third column contains the result of using
ing the field-based F-measures. Here, in comparison to all features, including unsupported features. Comparing
the SVM, CRFs improve the F1 measure from 89.7% to the rows, one can see that the second-order model per-
93.9%, an error reduction of 36%. forms the best, but not dramatically better than the first-
3.5.2 Effects of regularization order+transitions and the third order model. However, the
The results of different regularization methods are 2 .43 performs
first-order model . significantly worse. The dif-
ference does not come from sharing the weights, but from
summarized in Table (3). Setting Gaussian variance of
ignoring the 
  6 " . The first order transition feature
features depending on feature count performs better, from
is vital here. We would expect the third order model to
90.5% to 91.2%, an error reduction of 7%, when only
perform better if enough training data were available.
using supported features, and an error reduction of 9%
Comparing the second and the third columns, we can
when using supported and unsupported features. Re-
see that using all features including unsupported features,
sults are averaged over 5 random runs, with an aver-
consistently performs better than ignoring them. Our
age variance of 0.2%. In our experiments we found the
preliminary experiments with incremental support have
Gaussian prior to consistently perform better than the
shown performance in between that of supported-only
others. Surprisingly, exponential prior hurts the perfor-
and all features, and are still ongoing.
mance significantly. It over penalizes the likelihood (sig-
nificantly increasing cost—defined as negative penalized Effects of layout features
support all Word Acc. F1 Inst. Acc.
first-order 89.0 90.4 local feature 96.5% 88.8% 40.1%
first-order+trans 95.6 - + lexicon 96.9% 89.9% 53.1%
second-order 96.0 96.5 + layout feature 98.2% 93.4% 72.4%
third-order 95.3 96.1 + layout + lexicon 98.0% 93.0% 71.7%

Table 4: Effects of using unsupported features and state Table 6: Results of using different features on H
transitions on H

lexicon’s class.
To analyze the contribution of different kinds of fea-
tures, we divide the features into three categories: local 3.5.4 Error analysis
features, layout features, and external lexicon resources. Table 7 is the classification confusion matrix of header
The features we used are summarized in Table 5. extraction (field page is not shown to save space). Most
errors happen at the boundaries between two fields. Es-
Feature name Description pecially the transition from author to affiliation, from ab-
Local features stract to keyword. The note field is the one most con-
INITCAP Starts with a capitalized letter
ALLCAPS All characters are capitalized
fused with others, and upon inspection is actually labeled
CONTAINSDIGITS Contains at least one digit inconsistently in the training data. Other errors could
ALLDIGITS All characters are digits be fixed with additional feature engineering—for exam-
PHONEORZIP Phone number or zip code ple, including additional specialized regular expressions
CONTAINSDOTS Contains at least one dot should make email accuracy nearly perfect. Increasing
CONTAINSDASH Contains at least one - the amount of training data would also be expected to
ACRO Acronym
help significantly, as indicated by consistent nearly per-
LONELYINITIAL Initials such as A.
SINGLECHAR One character only fect accuracy on the training set.
CAPLETTER One capitalized character
PUNC Punctuation 4 Conclusions and Future Work
URL Regular expression for URL
EMAIL Regular expression for e-address This paper investigates the issues of regularization, fea-
WORD Word itself ture spaces, and efficient use of unsupported features in
Layout features CRFs, with an application to information extraction from
LINE START At the beginning of a line research papers.
LINE IN In middle of a line For regularization we find that the Gaussian prior with
LINE END At the end of a line variance depending on feature frequencies performs bet-
External lexicon features ter than several other alternatives in the literature. Feature
BIBTEX AUTHOR Match word in author lexicon engineering is a key component of any machine learn-
BIBTEX DATE Words like Jan. Feb.
NOTES Words like appeared, submitted
ing solution—especially in conditionally-trained mod-
AFFILIATION Words like institution, Labs, etc els with such freedom to choose arbitrary features—and
plays an even more important role than regularization.
Table 5: List of features used We obtain new state-of-the-art performance in extract-
ing standard fields from research papers, with a signifi-
cant error reduction by several metrics. We also suggest
The results of using different features are shown in Ta-
better evaluation metrics to facilitate future research in
ble 6. The layout feature dramatically increases the per-
this task—especially field-F1, rather than word accuracy.
formance, raising the F1 measure from 88.8% to 93.9%,
We have provided an empirical exploration of a few
whole sentence accuracy from 40.1% to 72.4%. Adding
previously-published priors for conditionally-trained log-
lexicon features alone improves the performance. How-
linear models. Fundamental advances in regularization
ever, when combing lexicon features and layout fea-
for CRFs remains a significant open research area.
tures, the performance is worse than using layout features
alone.
5 Acknowledgments
The lexicons were gathered from a large collection of
BibTeX files, and upon examination had difficult to re- This work was supported in part by the Cen-
move noise, for example words in the author lexicon that ter for Intelligent Information Retrieval, in part by
were also affiliations. In previous work, we have gained SPAWARSYSCEN-SD grant number N66001-02-1-
significant benefits by dividing each lexicon into sections 8903, in part by the National Science Foundation Co-
based on point-wise information gain with respect to the operative Agreement number ATM-9732665 through a
title auth. pubnum date abs. aff. addr. email deg. note ph. intro k.w. web
title 3446 0 6 0 22 0 0 0 9 25 0 0 12 0
author 0 2653 0 0 7 13 5 0 14 41 0 0 12 0
pubnum 0 14 278 2 0 2 7 0 0 39 0 0 0 0
date 0 0 3 336 0 1 3 0 0 18 0 0 0 0
abstract 0 0 0 0 53262 0 0 1 0 0 0 0 0 0
affil. 19 13 0 0 10 3852 27 0 28 34 0 0 0 1
address 0 11 3 0 0 35 2170 1 0 21 0 0 0 0
email 0 0 1 0 12 2 3 461 0 2 2 0 15 0
degree 2 2 0 2 0 2 0 5 465 95 0 0 2 0
note 52 2 9 6 219 52 59 0 5 4520 4 3 21 3
phone 0 0 0 0 0 0 0 1 0 2 215 0 0 0
intro 0 0 0 0 0 0 0 0 0 32 0 625 0 0
keyword 57 0 0 0 18 3 15 0 0 91 0 0 975 0
web 0 0 0 0 2 0 0 0 0 31 0 0 0 294

Table 7: Confusion matrix on H

subcontract from the University Corporation for Atmo- A. McCallum, K. Nigam, J. Rennie, K. Seymore. 2000.
spheric Research (UCAR) and in part by The Cen- Automating the Construction of Internet Portals with
tral Intelligence Agency, the National Security Agency Machine Learning. Information Retrieval Journal,
and National Science Foundation under NSF grant #IIS- volume 3, pages 127-163. Kluwer. 2000.
0326249. Any opinions, findings and conclusions or rec- A. McCallum and W. Li. 2003. Early Results for Named
ommendations expressed in this material are the author(s) Entity Recognition with Conditional Random Fields,
and do not necessarily reflect those of the sponsor. Feature Induction and Web-Enhanced Lexicons. In
Proceedings of Seventh Conference on Natural Lan-
guage Learning (CoNLL).
References
H. Ney, U. Essen, and R. Kneser 1995. On the Estima-
S. Chen and R. Rosenfeld. 2000. A Survey of Smoothing tion of Small Probabilities by Leaving-One-Out. IEEE
Techniques for ME Models. IEEE Trans. Speech and Transactions on Pattern Analysis and Machine Intelli-
Audio Processing, 8(1), pp. 37–50. January 2000. gence, 17(12):1202-1212, 1995.
J. Goodman. 2003. Exponential Priors for Maximum S. Pietra, V. Pietra, J. Lafferty 1995. Inducing Fea-
Entropy Models. MSR Technical report, 2003. tures Of Random Fields. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, Vol. 19, No.
H. Han, C. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. 4.
Fox. 2003. Automatic Document Meta-data Extrac-
tion using Support Vector Machines. In Proceedings D. Pinto, A. McCallum, X. Wei and W. Croft. 2003. Ta-
of Joint Conference on Digital Libraries 2003. ble Extraction Using Conditional Random Fields. In
Proceedins of the 26th Annual International ACM SI-
J. Lafferty, A. McCallum and F. Pereira. 2001. Condi- GIR Conference on Research and Development in In-
tional Random Fields: Probabilistic Models for Seg- formation Retrieval (SIGIR’03)
menting and Labeling Sequence Data. In Proceed-
ings of International Conference on Machine Learning K. Seymore, A. McCallum, R. Rosenfeld. 1999. Learn-
2001. ing Hidden Markov Model Structure for Information
Extraction. In Proceedings of AAAI’99 Workshop on
S. Lawrence, C. L. Giles, and K. Bollacker. 1999. Digital Machine Learning for Information Extraction.
Libraries and Autonomous Citation Indexing. IEEE
Computer, 32(6): 67-71. F. Sha and F. Pereira. 2003. Shallow Parsing with Con-
ditional Random Fields. In Proceedings of Human
R. Malouf. 2002. A Comparison of Algorithms for Max- Language Technology Conference and North Ameri-
imum Entropy Parameter Estimation. In Proceedings can Chapter of the Association for Computational Lin-
of the Sixth Conference on Natural Language Learning guistics (HLT-NAACL’03)
(CoNLL)
A. Takasu. 2003. Bibliographic Attribute Extrac-
A. McCallum. 2003. Efficiently Inducing Features tion from Erroneous References Based on a Statistical
of Conditional Random Fields. In Proceedings of Model. In Proceedings of Joint Conference on Digital
Conference on Uncertainty in Articifical Intelligence Libraries 2003.
(UAI).

You might also like