Text Preparation
Text Preparation
Abstract
Tokenization is commonly understood as the first step of any kind of natural
language text preparation. The major goal of this early (pre-linguistic) task is to
convert a stream of characters into a stream of processing units called tokens.
Beyond the text mining community this job is taken for granted. Commonly
it is seen as an already solved problem comprising the identification of word
borders and punctuation marks separated by spaces and line breaks. But in
our sense it should manage language related word dependencies, incorporate
domain specific knowledge, and handle morphosyntactically relevant linguistic
specificities. Therefore, we propose rule-based extended tokenization including
all sorts of linguistic knowledge (e.g., grammar rules, dictionaries). The core
features of our implementation are identification and disambiguation of all kinds of
linguistic markers, detection and expansion of abbreviations, treatment of special
formats, and typing of tokens including single- and multi-tokens. To improve the
quality of text mining we suggest linguistically-based tokenization as a necessary
step preceeding further text processing tasks.
In this paper, we focus on the task of improving the quality of standard tagging.
Keywords: text preparation, natural language processing, tokenization, tagging
improvement, tokenization prototype.
1 Introduction
Nearly all researchers concerned with text mining presuppose tokenizing as first
step during text preparation [1–5]. Good surveys about tokenization techniques are
provided by Frakes and Baeza-Yates [6] and Baeza-Yates and Ribeiro-Neto [7],
and Manning and Schütze in [8, pp.124–136]. But – as we know – only very
few reflect tokenization as a task of multi-language text processing with far-
reaching impact [9]. This involves language-related knowledge about linguistically
WIT Transactions on Information and Communication Technologies, Vol 37, © 2006 WIT Press
www.witpress.com, ISSN 1743-3517 (on-line)
doi:10.2495/DATA060021
14 Data Mining VII: Data, Text and Web Mining and their Business Applications
WIT Transactions on Information and Communication Technologies, Vol 37, © 2006 WIT Press
www.witpress.com, ISSN 1743-3517 (on-line)
Data Mining VII: Data, Text and Web Mining and their Business Applications 15
WIT Transactions on Information and Communication Technologies, Vol 37, © 2006 WIT Press
www.witpress.com, ISSN 1743-3517 (on-line)
16 Data Mining VII: Data, Text and Web Mining and their Business Applications
Category Characters
alpha abcdefghijklmnopqrstuvwxyzüäöß
alpha captial ABCDEFGHIJKLMNOPQRSTUVWXYZÜÄÖ
numeric 0123456789
sentence end .?!
punctuation ,:;"’()[]<>
hyphen -
delimiters \U0003 \U0009 \U000A \U000B \U000C \U000D \U0020
During the next step punctuation marks are identified and separated (step 3 in
Alg. 1). Only tokens typed as mixtures (Tm1 ) are investigated. If a token string
does not match an entry in one of the repositories (e.g., abbreviations, acronyms,
regular expressions rules for single-token or multi-token typing), the last character
is split and builds a new token together with its corresponding token type (see 1 in
Fig. 2). To assure the correctness of this splitting operation basic context-specific
rules are applied. A token ended by a period and followed by a lower case token is
not split, because the period does not mark the end of a sentence (see 2 in Fig. 2).
A set of user-defined token types is used to reinterpret and group (basic) token
types and strings (step 4–6 in Alg. 1). The user is enabled to define custom types
WIT Transactions on Information and Communication Technologies, Vol 37, © 2006 WIT Press
www.witpress.com, ISSN 1743-3517 (on-line)
Data Mining VII: Data, Text and Web Mining and their Business Applications 17
to support domain-specific needs. Such token types are simply expressed through
strings, which are assigned to recognized tokens. The definition of token types
can be related to different sources of knowledge about the motivation for token
interpretation. This includes domain knowledge (i.e., structure of an organization,
knowledge about data warehouses), gazetteer knowledge (i.e., country names,
river names), expert knowledge (i.e., medicine, astronomy), and pure linguistic
knowledge (i.e., morphological and syntactical rules, subject of a sentence).
Examples of user-defined types are stopwords (U1 ), abbreviations (U2 ), dates
and times (U3 ), phone numbers (U4 ), email addresses (U5 ), a sequence of
capitalized single-tokens (U6 , in many cases extended keywords) etc. These types
are identified by applying two strategies: First, tokens are compared with an
repository of reliable (string; token type) entries created by a human
expert or any kind of (semi-) automatic machinery. If no match is found, an ordered
list of rules is applied to process the sequence of tokens. The rules include regular
expression matching of token strings (see 3 in Fig. 2), matching of token types (see
4 in Fig. 2), and combinations (see 5 and 6 in Fig. 2).
The examples in Fig. 2 and Fig. 3 also outline the rule syntax. Each rule
consists of a condition part (input sequence of typed tokens) and a consequence
part (output sequence of typed tokens). The numbered indices of tokens indicate
relative token positions. Our rule-based approach is based on simple and pure
linguistic functional interpretation of basic-token types and token strings in a given
context. Example types of rules may cover morphological, syntactical, and general
patterns like
• suffix identification of well-known endings (e.g., “-ly”, “-ness”).
• identification and reconcatenation of hyphenated words at line breaks
• sentence border disambiguation
• multi-token identification
• special character treatment (e.g., apostrophes, slashes, ampersand etc.)
4 JavaTok
This section describes the architecture of JavaTok 1.0, a free-configurable
tokenizer developed in JAVA. To cope with language dependent occurrence
of special characters (country specific characters like Slavic diacritics, French
accents, umlauts and sharp s in German, etc.), JavaTok enables a Unicode
(http://www.unicode.org (30.03.2006))-conform initialization and input/output
WIT Transactions on Information and Communication Technologies, Vol 37, © 2006 WIT Press
www.witpress.com, ISSN 1743-3517 (on-line)
18 Data Mining VII: Data, Text and Web Mining and their Business Applications
… call +43 (0)462 2700 for … … call +43 (0)462 2700 for …
3 Ta1 Tm1 Tm1 Tn1 Ta1 Ta1 U4 Ta1
RuleID_003:
IN: tin,1.type = Tm1 AND tin,2.type = Tm1 AND tin,3.type = Tn1 AND
(tin,1.str tin,2.str tin,3.str).match(+[0-9]+\s\(0\)[0-9]+\s[0-9]+)
OUT: tout,1.str = (tin,1.str tin,2.str tin,3.str) AND tout,1.type = U1
WIT Transactions on Information and Communication Technologies, Vol 37, © 2006 WIT Press
www.witpress.com, ISSN 1743-3517 (on-line)
Data Mining VII: Data, Text and Web Mining and their Business Applications 19
Output S M R
The Red Cross is aka. RK .
The Red Cross is also known as RK . x
The (Red Cross)/INST is aka. RK . x
The (Red Cross)/INST is (also known as)/ABBR RK . x x
The/Ta2 Red/Ta2 Cross/Ta2 is/Ta1 aka./ABBR RK/Ta3 ./Tp1 x
The/Ta2 Red/Ta2 Cross/Ta2 is/Ta1 also/Ta1 known/Ta1 as/Ta1 RK/Ta3 ./Tp1 x x
The/Ta2 (Red/Ta2 Cross/Ta2)/INST is/Ta1 aka./ABBR RK/Ta3 ./Tp1 x x
The/Ta2 (Red/Ta2 Cross/Ta2)/INST is/Ta1 (also/Ta1 known/Ta1 as/Ta1)/ABBR RK/Ta3 ./Tp1 x x x
In the example given in Fig. 4 the basic token types used are ’/Ta1 ’, ’/Ta2 ’,
’/Ta3 ’, ’/Tm1 ’, and ’/Tp1 ’ (see Sec. 3). The user-defined token types are
’/ABBR’ (abbreviation) and ’/INST’ (institution). The mode describes whether
single-token typing is enabled (S), whether multi-token typing is enabled (M ) and
whether known abbreviations are to be replaced (R). The only known abbreviation
in the example is ’aka.’, standing for ’also known as’. Also, ’Red Cross’
is a known institution. ’RK’ (Red Cross) is an unknown abbreviation. Hence words
do not contain uppercase letters in between, it is marked as irregular by a rule.
WIT Transactions on Information and Communication Technologies, Vol 37, © 2006 WIT Press
www.witpress.com, ISSN 1743-3517 (on-line)
20 Data Mining VII: Data, Text and Web Mining and their Business Applications
5 Conclusion
Extended tokenization can be seen as one of the core steps of any kind of text
preparation. It is crucial for all following text processing tasks. To cope with
NLP difficulties we introduced the notion of extended tokenization, including
token definitions and user-defined token types. With our multi-token concept we
are able to classify, split and recombine tokens and token chains to semantic
units for further processing. Our rule-based token typing approach carries out
reinterpretation and substitution of token strings and token types on two different
levels.
Our implementation, JavaTok 1.0, allows proper treatment of both, general and
language-related tokenization difficulties. To circumvent early misinterpretation
of tokens the tokenizer can leave segmentation decisions open, avoiding
hypothetically motivated decisions in ambiguous contexts. JavaTok is optimized
for reducing data and time complexity with respect to further processing tasks
(e.g., named entity recognition, tagging etc.). A short draft about optimization of
tagging output through our tokenization method is outlined at the end of the paper,
showing promising results. However, more empirical work is certainly needed,
together with an examination of methods for automatic rule elicitation.
WIT Transactions on Information and Communication Technologies, Vol 37, © 2006 WIT Press
www.witpress.com, ISSN 1743-3517 (on-line)
Data Mining VII: Data, Text and Web Mining and their Business Applications 21
References
[1] Webster, J.J. & Kit, C., Tokenization as the initial phase in NLP. University
of Trier, volume 4, pp. 1106–1110, 1992.
[2] Fox, C., Lexical analysis and stoplists. pp. 102–130, 1992.
[3] Grefenstette, G. & Tapanainen, P., What is a word, what is a
sentence? problems of tokenization. The 3rd Conference on Computational
Lexicography and Text Research (COMPLEX’94), pp. 79–87, 1994.
[4] Guo, J., Critical tokenization and its properties. Computational Linguistics,
23(4), pp. 569–596, 1997.
[5] Barcala, F.M., Vilares, J., Alonso, M.A., Graa, J. & Vilares, M., Tokenization
and proper noun recognition for information retrieval. 3rd International
Workshop on Natural Language and Information Systems (NLIS ’02),
pp. 246–250, 2002.
[6] Frakes, W.B. & Baeza-Yates, R., Information Retrieval: Data Structures and
Algorithms. Prentice Hall, Englewood Cliffs, NJ, USA, 1992.
[7] Baeza-Yates, R. & Ribeiro-Neto, B., Modern Information Retrieval. Addison
Wesley, ACM Press, New York: Essex, England, 1999.
[8] Manning, C.D. & Schütze, H., Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, Massachusetts: London, England, 5th
edition, 2002.
[9] Giguet, E., The stakes of multilinguality: Multilingual text tokenization
in natural language diagnosis. Proceedings of the 4th Pacific Rim
International Conference on Artificial Intelligence Workshop Future issues
for Multilingual Text Processing, Cairns, Australia, 1996.
[10] Jackson, P. & Moulinier, I., Natural Language Processing for Online Appli-
cations: Text Retrieval, Extraction and Categorisation. John Benjamins,
msterdam, Netherlands: Wolverhampton, United Kingdom, 2002.
[11] Say, B. & Akman, V., An information-based approach to punctuation.
Proceedings ICML ’96: Second International Conference on Mathematical
Linguistics, Tarragona, Spain, pp. 93–94, 1996.
[12] Palmer, D.D., Tokenisation and sentence segmentation. Handbook of Natural
Language Processing, eds. R. Dale, H. Moisl & H. Somers, Marcel Dekker,
Inc., pp. 11–35, 2000.
[13] Guo, J., One tokenization per source. Proceedings of the Thirty-Sixth Annual
Meeting of the Association for Computational Linguistics and Seventeenth
International Conference on Computational Linguistics, pp. 457–463, 1998.
[14] Mikheev, A., Periods, capitalized words, etc. Comput Linguist, 28(3),
pp. 289–318, 2002.
WIT Transactions on Information and Communication Technologies, Vol 37, © 2006 WIT Press
www.witpress.com, ISSN 1743-3517 (on-line)