Improving Twitter Named Entity Recognition Using Word Representations
Improving Twitter Named Entity Recognition Using Word Representations
Representations
Zhiqiang Toh, Bin Chen and Jian Su
Institute for Infocomm Research
1 Fusionopolis Way
Singapore 138632
{ztoh,bchen,sujian}@i2r.a-star.edu.sg
141
Proceedings of the ACL 2015 Workshop on Noisy User-generated Text, pages 141–145,
c
Beijing, China, July 31, 2015.
2015 Association for Computational Linguistics
2014)1 , (2) English Gigaword Fifth Edition2 , and Brown clusters are generated using the imple-
(3) raw tweets collected between the period of mentation by Percy Liang4 . We experiment with
March 2015 and April 2015. different cluster sizes ({100, 200, 500, 1000}), re-
For English Gigaword, all articles of story type sulting in different cluster files for each of the cor-
are collected and tokenized. Further preprocess- pora. For each cluster file, different minimum oc-
ing is performed by following the cleaning step currences ({5, 10, 20}) and binary prefix lengths
described in Turian et al. (2010). This results in ({4, 6, · · · , 14, 16}) are tested. For each word in
a corpus consisting of 76 million sentences. the tweet, its corresponding binary prefix string
The collected raw tweets are tokenized3 representation is used as the feature value.
and non-English tweets are removed using K-means clusters are generated using two dif-
langid.py (Lui and Baldwin, 2012), resulting ferent methods. The first method uses the
in a total of 14 million tweets. word2vec tool (Mikolov et al., 2013)5 . By vary-
ing the minimum occurrences ({5, 10, 20}), word
3 Features vector size ({50, 100, 200, 500, 1000}), cluster
This section briefly describes the features used in size ({50, 100, 200, 500, 1000}) and sub-sampling
our system. Besides the features commonly used threshold ({0.00001, 0.001}), different cluster
in traditional NER systems, we focus on the use files are generated and tested. Similar to the
of word cluster features that have shown to be ef- Brown cluster feature, the name of the cluster that
fective in previous work (Ratinov and Roth, 2009; each word belongs to is used as the feature value.
Turian et al., 2010; Cherry and Guo, 2015). The second method uses the GloVe tool to
generate global vectors for word representation6 .
3.1 Word Feature As the GloVe tool does not output any form of
The current word and its lowercase format are clusters, K-mean clusters are generated from the
used as features. To provide additional context in- global vectors using the K-means implementa-
formation, the previous word and next word (in tion from Apache Spark MLlib7 . Similarly, by
original format) are also used. varying the minimum count ({5, 10, 20, 50, 100}),
window size ({5, 10, 15, 20}), vector size
3.2 Orthographic Features ({50, 100, 200, 500, 1000}), and cluster size
Orthographic features based on regular expres- ({50, 100, 200, 500, 1000}), different cluster files
sions are often used in NER systems. We only use are generated and tested.
the following two orthographic features: Initial- We also generate K-mean cluster files us-
Cap ([A-Z][a-z].*) and AllCaps ([A-Z]+). ing the pre-trained GloVe word vectors (trained
In addition, the first character and last two charac- from Wikipedia 2014 and Gigaword Fifth Edi-
ters of each word are used as features. tion, Common Crawl and Twitter data) in the same
manner.
3.3 Gazetteer Feature We create a cluster feature for each cluster file
The current word is matched with entries in the that is found to improve the 5-fold cross valida-
Freebase entity lists and the feature value is the tion performance. As there are over 800 cluster
type of entity list matched. files, we only test a random subset of cluster files
each time and select the best cluster file from the
3.4 Word Cluster Features subset to create a new cluster feature. The proce-
Unsupervised word representations (e.g. Brown dure is repeated for a new subset of cluster files,
clustering) have shown to improve the perfor- until no (or negligible) improvement is obtained.
mance of NER. Besides brown clusters, we also Our final settings use one Brown cluster feature
use clusters generated using the K-means algo- and six K-means cluster features (for both 10types
rithm. These two kinds of clusters are generated and notypes settings).
from the processed Gigaword and tweet corpora
(Section 2.2). 4
https://github.com/percyliang/brown-cluster/
1 5
http://nlp.stanford.edu/projects/glove/ https://code.google.com/p/word2vec/
2 6
https://catalog.ldc.upenn.edu/LDC2011T07 Due to memory constraints, only the tweet corpus is used
3
The tweet tokenization script can be found at to generate global vectors.
7
https://github.com/myleott/ark-twokenize-py https://spark.apache.org/mllib/
142
10types
Feature Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Overall
Word Feature 25.76 23.93 27.59 28.01 9.69 23.94
+ Orthographic Features 36.48 35.64 41.20 43.27 25.34 37.03
+ Gazetteer Feature 44.36 43.94 48.22 44.84 30.35 42.94
+ Word Cluster Features 55.85 57.49 60.07 58.35 44.99 55.95
+ Postprocessing 56.09 57.82 60.07 58.88 45.78 56.31
Table 1: 5-fold cross-validation F1 performances for the 10types evaluation. Each row uses all features
added in the previous rows.
notypes
Feature Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Overall
Word Feature 30.91 30.08 33.41 36.99 20.69 31.09
+ Orthographic Features 52.06 54.29 52.22 53.11 44.49 51.62
+ Gazetteer Feature 52.26 56.70 58.78 56.74 47.45 54.77
+ Word Cluster Features 65.14 65.57 66.77 68.13 55.31 64.66
+ Postprocessing 65.44 65.85 67.30 68.70 56.00 65.13
Table 2: 5-fold cross-validation F1 performances for the notypes evaluation. Each row uses all features
added in the previous rows.
143
10types notypes
System Rank Precision Recall F1 Rank Precision Recall F1
NLANGP 2 63.62 43.12 51.40 2 67.74 54.31 60.29
1st 1 57.66 55.22 56.41 1 72.20 69.14 70.63
2nd 2 63.62 43.12 51.40 2 67.74 54.31 60.29
3rd 3 53.24 38.58 44.74 3 63.81 56.28 59.81
Baseline – 35.56 29.05 31.97 – 53.86 46.44 49.88
Table 3: Comparison of our system (NLANGP) with the top three participating systems and official
baselines for the 10types and notypes evaluations.
10types notypes
System Precision Recall F1 Precision Recall F1
NLANGP 63.62 43.12 51.40 67.74 54.31 60.29
- Word Cluster Features 57.99 25.26 35.19 62.56 38.43 47.61
Table 4: System performances on the test data when word cluster features are not used.
144
References
Colin Cherry and Hongyu Guo. 2015. The Unrea-
sonable Effectiveness of Word Representations for
Twitter Named Entity Recognition. In Proceed-
ings of the 2015 Conference of the North Ameri-
can Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
735–745, Denver, Colorado, May–June. Association
for Computational Linguistics.
145