0% found this document useful (0 votes)
85 views5 pages

Improving Twitter Named Entity Recognition Using Word Representations

This paper describes a system for named entity recognition on Twitter that uses word representations and clustering features. The system uses conditional random fields with additional features from unlabeled newswire data, tweets, and lists from Freebase. Word representations are generated with Brown clustering and k-means clustering on the unlabeled data. These cluster features significantly improve performance on the shared task, with the system ranking second for both subtasks.

Uploaded by

Samir Maity
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views5 pages

Improving Twitter Named Entity Recognition Using Word Representations

This paper describes a system for named entity recognition on Twitter that uses word representations and clustering features. The system uses conditional random fields with additional features from unlabeled newswire data, tweets, and lists from Freebase. Word representations are generated with Brown clustering and k-means clustering on the unlabeled data. These cluster features significantly improve performance on the shared task, with the system ranking second for both subtasks.

Uploaded by

Samir Maity
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Improving Twitter Named Entity Recognition using Word

Representations
Zhiqiang Toh, Bin Chen and Jian Su
Institute for Infocomm Research
1 Fusionopolis Way
Singapore 138632
{ztoh,bchen,sujian}@i2r.a-star.edu.sg

Abstract domain has attracted increasing attention of re-


searchers. The ACL 2015 Workshop on Noisy
This paper describes our system used in User-generated Text (W-NUT) Shared Task for
the ACL 2015 Workshop on Noisy User- NER in Twitter is organized in response to these
generated Text Shared Task for Named new changes (Tim Baldwin, 2015).
Entity Recognition (NER) in Twitter. Our We participated in the above Shared Task,
system uses Conditional Random Fields to which consists of two separate evaluations: one
train two separate classifiers for the two where the task is to predict 10 fine-grained types
evaluations: predicting 10 fine-grained (10types) and the other in which only named entity
types, and segmenting named entities. We segments are predicted (notypes).
focus our efforts on generating word rep- For both evaluations, we model the problem as
resentations from large amount of unla- a sequential labeling task, using Conditional Ran-
beled newswire data and tweets. Our dom Fields (CRF) as the training algorithm. An
experiment results show that cluster fea- additional postprocessing step is applied to further
tures derived from word representations refine the system output.
significantly improve Twitter NER perfor- The remainder of this paper is structured as fol-
mances. Our system is ranked 2nd for both lows. In Section 2, we report on the external re-
evaluations. sources used by our system and how they are ob-
tained and processed. In Section 3, the features
1 Introduction used are described in details. In Section 4, the ex-
Named Entity Recognition (NER) is the task of periment and official results are presented. Finally,
identifying and categorizing the various mentions Section 5 summarizes our work.
of people, organizations and other named entities
2 External Resources
within the text. NER has been an essential analysis
component in many Natural Language Processing External resources have shown to improve the per-
(NLP) systems, especially information extraction formances of Twitter NER (Ritter et al., 2011).
and question answering. Our system uses a variety of external resources,
Traditionally, the NER system is trained and either publicly available, or collected and prepro-
applied on long and formal text such as the cessed by us.
newswire. From the beginning of the new millen-
nium, user-generated content from the social me- 2.1 Freebase Entity Lists
dia websites such as Twitter and Weibo presents a We use the Freebase entity lists provided by the
huge compilation of informative but noisy and in- task organizers. For some of the lists that are not
formal text. This rapidly growing text collection provided (e.g. a list of sports facilities), we man-
becomes more and more important for NLP tasks ually collect them by calling the appropriate Free-
such as sentiment analysis and emerging topic de- base API.
tection.
However, standard NER system trained on for- 2.2 Unlabeled Corpora
mal text does not work well on this new and chal- We gather unlabeled corpora from three differ-
lenging style of text. Therefore, adapting the ent sources: (1) Pre-trained word vectors gen-
NER system to the new and challenging Twitter erated using the GloVe tool (Pennington et al.,

141
Proceedings of the ACL 2015 Workshop on Noisy User-generated Text, pages 141–145,
c
Beijing, China, July 31, 2015. 2015 Association for Computational Linguistics
2014)1 , (2) English Gigaword Fifth Edition2 , and Brown clusters are generated using the imple-
(3) raw tweets collected between the period of mentation by Percy Liang4 . We experiment with
March 2015 and April 2015. different cluster sizes ({100, 200, 500, 1000}), re-
For English Gigaword, all articles of story type sulting in different cluster files for each of the cor-
are collected and tokenized. Further preprocess- pora. For each cluster file, different minimum oc-
ing is performed by following the cleaning step currences ({5, 10, 20}) and binary prefix lengths
described in Turian et al. (2010). This results in ({4, 6, · · · , 14, 16}) are tested. For each word in
a corpus consisting of 76 million sentences. the tweet, its corresponding binary prefix string
The collected raw tweets are tokenized3 representation is used as the feature value.
and non-English tweets are removed using K-means clusters are generated using two dif-
langid.py (Lui and Baldwin, 2012), resulting ferent methods. The first method uses the
in a total of 14 million tweets. word2vec tool (Mikolov et al., 2013)5 . By vary-
ing the minimum occurrences ({5, 10, 20}), word
3 Features vector size ({50, 100, 200, 500, 1000}), cluster
This section briefly describes the features used in size ({50, 100, 200, 500, 1000}) and sub-sampling
our system. Besides the features commonly used threshold ({0.00001, 0.001}), different cluster
in traditional NER systems, we focus on the use files are generated and tested. Similar to the
of word cluster features that have shown to be ef- Brown cluster feature, the name of the cluster that
fective in previous work (Ratinov and Roth, 2009; each word belongs to is used as the feature value.
Turian et al., 2010; Cherry and Guo, 2015). The second method uses the GloVe tool to
generate global vectors for word representation6 .
3.1 Word Feature As the GloVe tool does not output any form of
The current word and its lowercase format are clusters, K-mean clusters are generated from the
used as features. To provide additional context in- global vectors using the K-means implementa-
formation, the previous word and next word (in tion from Apache Spark MLlib7 . Similarly, by
original format) are also used. varying the minimum count ({5, 10, 20, 50, 100}),
window size ({5, 10, 15, 20}), vector size
3.2 Orthographic Features ({50, 100, 200, 500, 1000}), and cluster size
Orthographic features based on regular expres- ({50, 100, 200, 500, 1000}), different cluster files
sions are often used in NER systems. We only use are generated and tested.
the following two orthographic features: Initial- We also generate K-mean cluster files us-
Cap ([A-Z][a-z].*) and AllCaps ([A-Z]+). ing the pre-trained GloVe word vectors (trained
In addition, the first character and last two charac- from Wikipedia 2014 and Gigaword Fifth Edi-
ters of each word are used as features. tion, Common Crawl and Twitter data) in the same
manner.
3.3 Gazetteer Feature We create a cluster feature for each cluster file
The current word is matched with entries in the that is found to improve the 5-fold cross valida-
Freebase entity lists and the feature value is the tion performance. As there are over 800 cluster
type of entity list matched. files, we only test a random subset of cluster files
each time and select the best cluster file from the
3.4 Word Cluster Features subset to create a new cluster feature. The proce-
Unsupervised word representations (e.g. Brown dure is repeated for a new subset of cluster files,
clustering) have shown to improve the perfor- until no (or negligible) improvement is obtained.
mance of NER. Besides brown clusters, we also Our final settings use one Brown cluster feature
use clusters generated using the K-means algo- and six K-means cluster features (for both 10types
rithm. These two kinds of clusters are generated and notypes settings).
from the processed Gigaword and tweet corpora
(Section 2.2). 4
https://github.com/percyliang/brown-cluster/
1 5
http://nlp.stanford.edu/projects/glove/ https://code.google.com/p/word2vec/
2 6
https://catalog.ldc.upenn.edu/LDC2011T07 Due to memory constraints, only the tweet corpus is used
3
The tweet tokenization script can be found at to generate global vectors.
7
https://github.com/myleott/ark-twokenize-py https://spark.apache.org/mllib/

142
10types
Feature Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Overall
Word Feature 25.76 23.93 27.59 28.01 9.69 23.94
+ Orthographic Features 36.48 35.64 41.20 43.27 25.34 37.03
+ Gazetteer Feature 44.36 43.94 48.22 44.84 30.35 42.94
+ Word Cluster Features 55.85 57.49 60.07 58.35 44.99 55.95
+ Postprocessing 56.09 57.82 60.07 58.88 45.78 56.31

Table 1: 5-fold cross-validation F1 performances for the 10types evaluation. Each row uses all features
added in the previous rows.

notypes
Feature Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Overall
Word Feature 30.91 30.08 33.41 36.99 20.69 31.09
+ Orthographic Features 52.06 54.29 52.22 53.11 44.49 51.62
+ Gazetteer Feature 52.26 56.70 58.78 56.74 47.45 54.77
+ Word Cluster Features 65.14 65.57 66.77 68.13 55.31 64.66
+ Postprocessing 65.44 65.85 67.30 68.70 56.00 65.13

Table 2: 5-fold cross-validation F1 performances for the notypes evaluation. Each row uses all features
added in the previous rows.

Similar observations can also be seen for the other


three folds (tested on a subset of train) when
4 Experiments and Results compared with Fold 5. This suggests that there
are notable differences between the data provided
Our system is trained using the CRF++ tool8 . We
during the training period (train and dev) and
trained separate classifiers for the two different
evaluation period (dev_2015), probably because
evaluations (10types and notypes).
the two sets of data are collected in different time
To select the optimum settings, we make use
periods.
of all available training data (train, dev,
dev_2015) and conduct 5-fold cross validation 4.2 Postprocessing
experiments. For easier comparisons with other
systems, the 5 folds are split such that dev is the We also experiment with a postprocessing step
test set for Fold 1, while dev_2015 is the test set based on heuristic rules to further refine the sys-
for Fold 5. tem output (last row of Table 1 and Table 2). The
heuristic rules are based on string matching of
4.1 Preliminary Results on Training Data words with name list entries. To prevent false pos-
Table 1 and Table 2 shows the 5-fold cross val- itives, we require entries in some of the name lists
idation performances after adding each feature to contain at least two words and should not con-
group for the 10types and notypes evaluations re- tain common words/stop words. For certain name
spectively. The use of word clusters significantly lists where single-word entries are common but
improves the performances for both evaluations. ambiguous (e.g. name of sports clubs), we check
There is an overall improvement of 13% and 9% for the presence of cue words in the tweet be-
for the 10types and notypes evaluation respec- fore matching. For example, for single-word sport
tively when word cluster features are added. This team names that are common in tweets, we check
demonstrates the usefulness of word vectors in im- for the presence of cue words such as “vs”. Exam-
proving the accuracy of a Twitter NER system. ples of name lists used include names of profes-
sional athletes, music composers and sport facili-
Comparing the performances of Fold 1 (tested
ties.
on dev) and Fold 5 (tested on dev_2015),
we observe a significant performance difference.
8
http://taku910.github.io/crfpp/

143
10types notypes
System Rank Precision Recall F1 Rank Precision Recall F1
NLANGP 2 63.62 43.12 51.40 2 67.74 54.31 60.29
1st 1 57.66 55.22 56.41 1 72.20 69.14 70.63
2nd 2 63.62 43.12 51.40 2 67.74 54.31 60.29
3rd 3 53.24 38.58 44.74 3 63.81 56.28 59.81
Baseline – 35.56 29.05 31.97 – 53.86 46.44 49.88

Table 3: Comparison of our system (NLANGP) with the top three participating systems and official
baselines for the 10types and notypes evaluations.

10types notypes
System Precision Recall F1 Precision Recall F1
NLANGP 63.62 43.12 51.40 67.74 54.31 60.29
- Word Cluster Features 57.99 25.26 35.19 62.56 38.43 47.61

Table 4: System performances on the test data when word cluster features are not used.

4.3 Evaluation Results NER where state-of-the-art systems can achieve


Table 3 presents the official results of our 10types performances over 90 F1 for the 3 MUC types
and notypes submissions. We also include the re- (PERSON, LOCATION and ORGANIZATION),
sults of the top three participating systems and of- Twitter NER poses new challenges in accurately
ficial baselines for comparison. extracting entity information in such genre that
As shown from the table, our system does not exist in the past.
(NLANGP) is ranked 2nd for both evalua- We are interested to know the performance con-
tions. Based on our preliminary Fold 5 perfor- tribution of the word clusters on the test data. Ta-
mances, our system performances on the test data ble 4 shows the performances on the test data
(test_2015, collected in the same period as when word cluster features are not used. Simi-
dev_2015) are within expectation. In general, larly to the observations observed in the training
the fine-grained evaluation is a more challeng- data, word clusters are important features for our
ing task, as seen from the huge performance system: a performance drop greater than 16% and
difference between the F1 score of 10types and 12% is observed for the 10types and notypes eval-
notypes. uation respectively.

Type Precision Recall F1 5 Conclusion


COMPANY 80.00 41.03 54.24 In this paper, we describe our system used in the
FACILITY 52.17 31.58 39.34 W-NUT Shared Task for NER in Twitter. We fo-
GEO-LOC 63.81 57.76 60.63 cus our efforts on improving Twitter NER using
MOVIE 100.00 33.33 50.00 word representations, namely, Brown clusters and
MUSICARTIST 50.00 9.76 16.33 K-means clusters, that are generated from large
OTHER 50.00 30.30 37.74 amount of unlabeled newswire data and tweets.
PERSON 70.70 64.91 67.68 Our experiments and evaluation results show that
PRODUCT 20.00 8.11 11.54 cluster features derived from word representations
SPORTSTEAM 79.41 38.57 51.92 are effective in improving Twitter NER perfor-
TVSHOW 0.00 0.00 0.00 mances. In future, we hope to investigate on the
Overall 63.62 43.12 51.40 use of distant supervision learning technique to
build better system that can perform more robustly
Table 5: Performance of each fine-grained type of across tweets from different time periods. We also
our system. like to perform an error analysis to help us under-
stand which other problems persist so as to address
Table 5 shows the performance of each fine- them in future.
grained type of our system. Unlike traditional

144
References
Colin Cherry and Hongyu Guo. 2015. The Unrea-
sonable Effectiveness of Word Representations for
Twitter Named Entity Recognition. In Proceed-
ings of the 2015 Conference of the North Ameri-
can Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
735–745, Denver, Colorado, May–June. Association
for Computational Linguistics.

Marco Lui and Timothy Baldwin. 2012. langid.py: An


Off-the-shelf Language Identification Tool. In Pro-
ceedings of the ACL 2012 System Demonstrations,
pages 25–30, Jeju Island, Korea, July. Association
for Computational Linguistics.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
2013. Linguistic Regularities in Continuous Space
Word Representations. In Proceedings of the 2013
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 746–751, Atlanta,
Georgia, June. Association for Computational Lin-
guistics.
Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. Glove: Global Vectors for Word
Representation. In Proceedings of the 2014 Con-
ference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543, Doha,
Qatar, October. Association for Computational Lin-
guistics.

Lev Ratinov and Dan Roth. 2009. Design Chal-


lenges and Misconceptions in Named Entity Recog-
nition. In Proceedings of the Thirteenth Confer-
ence on Computational Natural Language Learning
(CoNLL-2009), pages 147–155, Boulder, Colorado,
June. Association for Computational Linguistics.

Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.


2011. Named Entity Recognition in Tweets: An Ex-
perimental Study. In Proceedings of the 2011 Con-
ference on Empirical Methods in Natural Language
Processing, pages 1524–1534, Edinburgh, Scotland,
UK., July. Association for Computational Linguis-
tics.
Marie Marie Catherine de Marneffe Young-Bum Kim
Alan Ritter Wei Xu Tim Baldwin, Bo Han. 2015.
Findings of the 2015 Workshop on Noisy User-
generated Text. In Proceedings of the Workshop on
Noisy User-generated Text (WNUT 2015). Associa-
tion for Computational Linguistics.
Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio.
2010. Word Representations: A Simple and General
Method for Semi-Supervised Learning. In Proceed-
ings of the 48th Annual Meeting of the Association
for Computational Linguistics, pages 384–394, Up-
psala, Sweden, July. Association for Computational
Linguistics.

145

You might also like