HLT 2004
HLT 2004
memory penalties.
0.3
all features. In this way, we increase the smoothing
0.25
on the low frequency features more so than the high
0.2
frequency features.
0.15
0.1
3. Bin-Based: We divide features into classes based
0.05
on frequency. We bin features by frequency in the
training set, and let the 1 features in the same bin share
0
the same k variance. The discounted value is set to be
q m k`r:s!tQn+o p where h is the count of features, u is
−10 −8 −6 −4 −2 0 2 4 6 8 10
2 .
2. First-order+transitions: 2 Here
.43 we. add parameters Here we also briefly describe a HMM model we used
corresponding to state transitions. The feature func- in our experiments. We relax the independence
. . .43 assump-
tions used are 5#"H
5 " . tion made in standard HMM and allow Markov depen-
dencies among observations, e.g.,
«¬ ¬ " . We
3. Second-order: Here inputs2 .4are
3 examined
. in the con- can vary Markov orders in state transition and observa-
text of the current and previous states. Feature func-
tion are represented as
5 5#" . tion transitions. In our experiments, a model with second
order state transitions and first order observation transi-
tions performs the best. The state transition probabilities
4. Third-order: Here inputs 2 are
.43 examined
.43 . in the con- and emission probabilities are estimated using maximum
text of the current, two previous states. Feature func-
tion are represented as
5 5 6#" . likelihood estimation with absolute smoothing, which
c was found to be effective in previous experiments, includ-
2.2.2 Supported features and unsupported features ing Seymore et al. (1999).
Before the use of prior distributions over parameters
was common in maximum entropy classifiers, standard 3.2 Datasets
practice was to eliminate all features with zero count We experiment with two datasets of research paper con-
in the training data (the so-called unsupported features). tent. One consists of the headers of research papers. The
other consists of pre-segmented citations from the refer- 2. Averaged F-measure: Averaged F-measure is com-
ence sections of research papers. These data sets have puted by averaging the F1-measures over all fields.
been used as standard benchmarks in several previous Average F-measure favors labels with small num-
studies (Seymore et al., 1999; McCallum et al., 2000; ber of words, which complements word accuracy.
Han et al., 2003). Thus, we consider both word accuracy and average
F-measure in evaluation.
3.2.1 Paper header dataset
The header of a research paper is defined to be all of
the words from the beginning of the paper up to either 3. Whole instance accuracy: An instance here is de-
the first section of the paper, usually the introduction, fined to be a single header or reference. Whole
or to the end of the first page, whichever occurs first. instance accuracy is the percentage of instances in
It contains 15 fields to be extracted: title, author, affil- which every word is correctly labeled.
iation, address, note, email, date, abstract, introduction,
phone, keywords, web, degree, publication number, and
3.4 Experimental Results
page (Seymore et al., 1999). The header dataset contains
935 headers. Following previous research (Seymore et We first report the overall results by comparing CRFs
al., 1999; McCallum et al., 2000; Han et al., 2003), for with HMMs, and with the previously best benchmark re-
each trial we randomly select 500 for training and the re- sults obtained by SVMs (Han et al., 2003). We then break
maining 435 for testing. We refer this dataset as H. down the results to analyze various factors individually.
3.2.2 Paper reference dataset Table 1 shows the results on dataset H with the best re-
sults in bold; (intro and page fields are not shown, fol-
The reference dataset was created by the Cora
lowing past practice (Seymore et al., 1999; Han et al.,
project (McCallum et al., 2000). It contains 500 refer-
2003)). The results we obtained with CRFs use second-
ences, we use 350 for training and the rest 150 for test-
order state transition features, layout features, as well as
ing. References contain 13 fields: author, title, editor,
supported and unsupported features. Feature induction
booktitle, date, journal, volume, tech, institution, pages,
is used in experiments on dataset R; (it didn’t improve
location, publisher, note. We refer this dataset as R.
accuracy on H). The results we obtained with the HMM
3.3 Performance Measures model use a second order model for transitions, and a first
To give a comprehensive evaluation, we measure per- order for observations. The results on SVM is obtained
formance using several different metrics. In addition to from (Han et al., 2003) by computing F1 measures from
the previously-used word accuracy measure (which over- the precision and recall numbers they report.
emphasizes accuracy of the abstract field), we use per-
field F1 measure (both for individual fields and averaged HMM CRF SVM
over all fields—called a “macro average” in the informa- Overall acc. 93.1% 98.3% 92.9%
tion retrieval literature), and whole instance accuracy for Instance acc. 4.13% 73.3% -
measuring overall performance in a way that is sensitive acc. F1 acc. F1 acc. F1
Title 98.2 82.2 99.7 97.1 98.9 96.5
to even a single error in any part of header or citation.
Author 98.7 81.0 99.8 97.5 99.3 97.2
3.3.1 Measuring field-specific performance Affiliation 98.3 85.1 99.7 97.0 98.1 93.8
Address 99.1 84.8 99.7 95.8 99.1 94.7
1. Word Accuracy: We define ® as the number of true Note 97.8 81.4 98.8 91.2 95.5 81.6
positive words, ¯ as the number of false negative Email 99.9 92.5 99.9 95.3 99.6 91.7
words, ° as the number of false positive words, ± Date 99.8 80.6 99.9 95.0 99.7 90.2
as the number of true negative words, and ®³²´¯² Abstract 97.1 98.0 99.6 99.7 97.5 93.8
°³²± is the total number of words. Word accuracy Phone 99.8 53.8 99.9 97.9 99.9 92.4
is calculated to be µ ~F¶ Keyword 98.7 40.6 99.7 88.8 99.2 88.5
µ ~#·F~ ¸#~#¶ Web 99.9 68.6 99.9 94.1 99.9 92.4
2. F1-measure: Precision, recall and F1 measure are Degree 99.5 68.8 99.8 84.9 99.5 70.1
definednas
¹ º5follows.
» m IX¼:I½ n¾Precision
» m:¿HÀÁÀ = µ Recall = µ Pubnum 99.8 64.2 99.9 86.6 99.9 89.2
F1 = c
¹ 6
º »
¾ »
m IX¼:I½ | m:¿HÀÀ µ ~ ¸ µ ~#· Average F1 75.6 93.9 89.7
|~
3.3.2 Measuring overall performance Table 1: Extraction results for paper headers on H
1. Overall word accuracy: Overall word accuracy
is the percentage of words whose predicted labels
equal their true labels. Word accuracy favors fields Table 2 shows the results on dataset R. SVM results
with large number of words, such as the abstract. are not available for these datasets.
HMM CRF support feat. all features
Overall acc. 85.1% 95.37% Method F1 F1
instance acc. 10% 77.33% Gaussian infinity 90.5 93.3
acc. F1 acc. F1 Gaussian variance = 0.1 81.7 91.8
Author 96.8 92.7 99.9 99.4 Gaussian variance = 0.5 87.2 93.0
Booktitle 94.4 0.85 97.7 93.7 Gaussian variance = 5 90.1 93.7
Date 99.7 96.9 99.8 98.9 Gaussian variance = 10 89.9 93.5
Editor 98.8 70.8 99.5 87.7 Gaussian cut 7 90.1 93.4
Institution 98.5 72.3 99.7 94.0 Gaussian divide count 90.9 92.8
Journal 96.6 67.7 99.1 91.3
Location 99.1 81.8 99.3 87.2 Gaussian bin 5 90.9 93.6
Note 99.2 50.9 99.7 80.8 Gaussian bin 10 90.2 92.9
Pages 98.1 72.9 99.9 98.6 Gaussian bin 15 91.2 93.9
Publisher 99.4 79.2 99.4 76.1 Gaussian bin 20 90.4 93.2
Tech 98.8 74.9 99.4 86.7 Hyperbolic 89.4 92.8
Title 92.2 87.2 98.9 98.3 Exponential 80.5 85.6
Volume 98.6 75.8 99.9 97.8
Average F1 77.6% 91.5% Table 3: Regularization comparisons: Gaussian infinity is
non-regularized, Gaussian variance = X sets variance to
Table 2: Extraction results for paper references on R be X. Gaussian cut 7 refers to the Threshold Cut method,
Gaussian divide count refers to the Divide Count method,
Gaussian bin N refers to the Bin-Based method with bin
3.5 Analysis size equals N, as described in 2.1.1
3.5.1 Overall performance comparison
From Table (1, 2), one can see that CRF performs
significantly better than HMMs, which again supports
log-likelihood). We hypothesizedz that the problem could
the previous findings (Lafferty et al., 2001; Pinto et al., z
be that the choice of constant is inappropriate. So we
tried varying instead of computing it using absolute
2003). CRFs also perform significantly better than SVM-
discounting, but found the alternatives to perform worse.
based approach, yielding new state of the art performance
These results suggest that Gaussian prior is a safer prior
on this task. CRFs increase the performance on nearly all
to use in practice.
the fields. The overall word accuracy is improved from
92.9% to 98.3%, which corresponds to a 78% error rate 3.5.3 Effects of exploring feature space
reduction. However, as we can see word accuracy can be State transition features and unsupported features.
misleading since HMM model even has a higher word ac- We summarize the comparison of different state tran-
curacy than SVM, although it performs much worse than sition models using or not using unsupported features in
SVM in most individual fields except abstract. Interest- Table 4. The first column describes the four different state
ingly, HMM performs much better on abstract field (98% transition models, the second column contains the overall
versus 93.8% F-measure) which pushes the overall accu- word accuracy of these models using only support fea-
racy up. A better comparison can be made by compar- tures, and the third column contains the result of using
ing the field-based F-measures. Here, in comparison to all features, including unsupported features. Comparing
the SVM, CRFs improve the F1 measure from 89.7% to the rows, one can see that the second-order model per-
93.9%, an error reduction of 36%. forms the best, but not dramatically better than the first-
3.5.2 Effects of regularization order+transitions and the third order model. However, the
The results of different regularization methods are 2 .43 performs
first-order model . significantly worse. The dif-
ference does not come from sharing the weights, but from
summarized in Table (3). Setting Gaussian variance of
ignoring the
6 " . The first order transition feature
features depending on feature count performs better, from
is vital here. We would expect the third order model to
90.5% to 91.2%, an error reduction of 7%, when only
perform better if enough training data were available.
using supported features, and an error reduction of 9%
Comparing the second and the third columns, we can
when using supported and unsupported features. Re-
see that using all features including unsupported features,
sults are averaged over 5 random runs, with an aver-
consistently performs better than ignoring them. Our
age variance of 0.2%. In our experiments we found the
preliminary experiments with incremental support have
Gaussian prior to consistently perform better than the
shown performance in between that of supported-only
others. Surprisingly, exponential prior hurts the perfor-
and all features, and are still ongoing.
mance significantly. It over penalizes the likelihood (sig-
nificantly increasing cost—defined as negative penalized Effects of layout features
support all Word Acc. F1 Inst. Acc.
first-order 89.0 90.4 local feature 96.5% 88.8% 40.1%
first-order+trans 95.6 - + lexicon 96.9% 89.9% 53.1%
second-order 96.0 96.5 + layout feature 98.2% 93.4% 72.4%
third-order 95.3 96.1 + layout + lexicon 98.0% 93.0% 71.7%
Table 4: Effects of using unsupported features and state Table 6: Results of using different features on H
transitions on H
lexicon’s class.
To analyze the contribution of different kinds of fea-
tures, we divide the features into three categories: local 3.5.4 Error analysis
features, layout features, and external lexicon resources. Table 7 is the classification confusion matrix of header
The features we used are summarized in Table 5. extraction (field page is not shown to save space). Most
errors happen at the boundaries between two fields. Es-
Feature name Description pecially the transition from author to affiliation, from ab-
Local features stract to keyword. The note field is the one most con-
INITCAP Starts with a capitalized letter
ALLCAPS All characters are capitalized
fused with others, and upon inspection is actually labeled
CONTAINSDIGITS Contains at least one digit inconsistently in the training data. Other errors could
ALLDIGITS All characters are digits be fixed with additional feature engineering—for exam-
PHONEORZIP Phone number or zip code ple, including additional specialized regular expressions
CONTAINSDOTS Contains at least one dot should make email accuracy nearly perfect. Increasing
CONTAINSDASH Contains at least one - the amount of training data would also be expected to
ACRO Acronym
help significantly, as indicated by consistent nearly per-
LONELYINITIAL Initials such as A.
SINGLECHAR One character only fect accuracy on the training set.
CAPLETTER One capitalized character
PUNC Punctuation 4 Conclusions and Future Work
URL Regular expression for URL
EMAIL Regular expression for e-address This paper investigates the issues of regularization, fea-
WORD Word itself ture spaces, and efficient use of unsupported features in
Layout features CRFs, with an application to information extraction from
LINE START At the beginning of a line research papers.
LINE IN In middle of a line For regularization we find that the Gaussian prior with
LINE END At the end of a line variance depending on feature frequencies performs bet-
External lexicon features ter than several other alternatives in the literature. Feature
BIBTEX AUTHOR Match word in author lexicon engineering is a key component of any machine learn-
BIBTEX DATE Words like Jan. Feb.
NOTES Words like appeared, submitted
ing solution—especially in conditionally-trained mod-
AFFILIATION Words like institution, Labs, etc els with such freedom to choose arbitrary features—and
plays an even more important role than regularization.
Table 5: List of features used We obtain new state-of-the-art performance in extract-
ing standard fields from research papers, with a signifi-
cant error reduction by several metrics. We also suggest
The results of using different features are shown in Ta-
better evaluation metrics to facilitate future research in
ble 6. The layout feature dramatically increases the per-
this task—especially field-F1, rather than word accuracy.
formance, raising the F1 measure from 88.8% to 93.9%,
We have provided an empirical exploration of a few
whole sentence accuracy from 40.1% to 72.4%. Adding
previously-published priors for conditionally-trained log-
lexicon features alone improves the performance. How-
linear models. Fundamental advances in regularization
ever, when combing lexicon features and layout fea-
for CRFs remains a significant open research area.
tures, the performance is worse than using layout features
alone.
5 Acknowledgments
The lexicons were gathered from a large collection of
BibTeX files, and upon examination had difficult to re- This work was supported in part by the Cen-
move noise, for example words in the author lexicon that ter for Intelligent Information Retrieval, in part by
were also affiliations. In previous work, we have gained SPAWARSYSCEN-SD grant number N66001-02-1-
significant benefits by dividing each lexicon into sections 8903, in part by the National Science Foundation Co-
based on point-wise information gain with respect to the operative Agreement number ATM-9732665 through a
title auth. pubnum date abs. aff. addr. email deg. note ph. intro k.w. web
title 3446 0 6 0 22 0 0 0 9 25 0 0 12 0
author 0 2653 0 0 7 13 5 0 14 41 0 0 12 0
pubnum 0 14 278 2 0 2 7 0 0 39 0 0 0 0
date 0 0 3 336 0 1 3 0 0 18 0 0 0 0
abstract 0 0 0 0 53262 0 0 1 0 0 0 0 0 0
affil. 19 13 0 0 10 3852 27 0 28 34 0 0 0 1
address 0 11 3 0 0 35 2170 1 0 21 0 0 0 0
email 0 0 1 0 12 2 3 461 0 2 2 0 15 0
degree 2 2 0 2 0 2 0 5 465 95 0 0 2 0
note 52 2 9 6 219 52 59 0 5 4520 4 3 21 3
phone 0 0 0 0 0 0 0 1 0 2 215 0 0 0
intro 0 0 0 0 0 0 0 0 0 32 0 625 0 0
keyword 57 0 0 0 18 3 15 0 0 91 0 0 975 0
web 0 0 0 0 2 0 0 0 0 31 0 0 0 294
subcontract from the University Corporation for Atmo- A. McCallum, K. Nigam, J. Rennie, K. Seymore. 2000.
spheric Research (UCAR) and in part by The Cen- Automating the Construction of Internet Portals with
tral Intelligence Agency, the National Security Agency Machine Learning. Information Retrieval Journal,
and National Science Foundation under NSF grant #IIS- volume 3, pages 127-163. Kluwer. 2000.
0326249. Any opinions, findings and conclusions or rec- A. McCallum and W. Li. 2003. Early Results for Named
ommendations expressed in this material are the author(s) Entity Recognition with Conditional Random Fields,
and do not necessarily reflect those of the sponsor. Feature Induction and Web-Enhanced Lexicons. In
Proceedings of Seventh Conference on Natural Lan-
guage Learning (CoNLL).
References
H. Ney, U. Essen, and R. Kneser 1995. On the Estima-
S. Chen and R. Rosenfeld. 2000. A Survey of Smoothing tion of Small Probabilities by Leaving-One-Out. IEEE
Techniques for ME Models. IEEE Trans. Speech and Transactions on Pattern Analysis and Machine Intelli-
Audio Processing, 8(1), pp. 37–50. January 2000. gence, 17(12):1202-1212, 1995.
J. Goodman. 2003. Exponential Priors for Maximum S. Pietra, V. Pietra, J. Lafferty 1995. Inducing Fea-
Entropy Models. MSR Technical report, 2003. tures Of Random Fields. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, Vol. 19, No.
H. Han, C. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. 4.
Fox. 2003. Automatic Document Meta-data Extrac-
tion using Support Vector Machines. In Proceedings D. Pinto, A. McCallum, X. Wei and W. Croft. 2003. Ta-
of Joint Conference on Digital Libraries 2003. ble Extraction Using Conditional Random Fields. In
Proceedins of the 26th Annual International ACM SI-
J. Lafferty, A. McCallum and F. Pereira. 2001. Condi- GIR Conference on Research and Development in In-
tional Random Fields: Probabilistic Models for Seg- formation Retrieval (SIGIR’03)
menting and Labeling Sequence Data. In Proceed-
ings of International Conference on Machine Learning K. Seymore, A. McCallum, R. Rosenfeld. 1999. Learn-
2001. ing Hidden Markov Model Structure for Information
Extraction. In Proceedings of AAAI’99 Workshop on
S. Lawrence, C. L. Giles, and K. Bollacker. 1999. Digital Machine Learning for Information Extraction.
Libraries and Autonomous Citation Indexing. IEEE
Computer, 32(6): 67-71. F. Sha and F. Pereira. 2003. Shallow Parsing with Con-
ditional Random Fields. In Proceedings of Human
R. Malouf. 2002. A Comparison of Algorithms for Max- Language Technology Conference and North Ameri-
imum Entropy Parameter Estimation. In Proceedings can Chapter of the Association for Computational Lin-
of the Sixth Conference on Natural Language Learning guistics (HLT-NAACL’03)
(CoNLL)
A. Takasu. 2003. Bibliographic Attribute Extrac-
A. McCallum. 2003. Efficiently Inducing Features tion from Erroneous References Based on a Statistical
of Conditional Random Fields. In Proceedings of Model. In Proceedings of Joint Conference on Digital
Conference on Uncertainty in Articifical Intelligence Libraries 2003.
(UAI).