An Unsupervised Model For Text Message Normalization
An Unsupervised Model For Text Message Normalization
71
Proceedings of the NAACL HLT Workshop on Computational Approaches to Linguistic Creativity, pages 71–78,
Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics
Formation type Freq. Example senting sounds phonetically. Subsequence abbrevi-
Stylistic variation 152 betta (better) ations, also very frequent, are composed of a sub-
Subseq. abbrev. 111 dng (doing) sequence of the graphemes in a standard form, of-
Prefix clipping 24 hol (holiday) ten omitting vowels. These two formation types ac-
Syll. letter/digit 19 neway (anyway) count for approximately 66% of our development
G-clipping 14 talkin (talking) data; the remaining formation types are much less
Phonetic abbrev. 12 cuz (because) frequent. Prefix clippings and suffix clippings con-
H-clipping 10 ello (hello) sist of a prefix or suffix, respectively, of a standard
Spelling error 5 darliog (darling) form, and in some cases a diminutive ending; we
Suffix clipping 4 morrow (tomorrow) also consider clippings which omit just a g or h from
Punctuation 3 b/day (birthday) a standard form as they are rather frequent.5 A sin-
Unclear 34 mobs (mobile) gle letter or digit can be used to represent a syllable;
Error 12 gal (*girl) we refer to these as syllabic (syll.) letter/digit. Pho-
Total 400 netic abbreviations are variants of clippings and sub-
sequence abbreviations where some sounds in the
Table 1: Frequency of texting forms in the development standard form are represented phonetically. Several
set by formation type. texting forms appear to be spelling errors; we took
the layout of letters on cell phone keypads into ac-
need for applications such as translation and ques- count when making this judgement. The items that
tion answering. did not fit within the above texting form categories
We observe that many creative texting forms are were marked as unclear. Finally, for some expres-
the result of a small number of specific word for- sions the given standard form did not appear to be
mation processes. Rather than using a generic er- appropriate. For example, girl is not the standard
ror model to capture all of them, we propose a mix- form for the texting form gal; rather, gal is an En-
ture model in which each word formation process is glish word that is a colloquial form of girl. Such
modeled explicitly according to linguistic observa- cases were marked as errors.
tions specific to that formation. No texting forms in our development data corre-
spond to multiple standard form words, e.g., wanna
2 Analysis of Texting Forms for want to.6 Since such forms are not present in our
To better understand the creative processes present development data, we assume that a texting form al-
in texting language, we categorize the word forma- ways corresponds to a single standard form word.
tion process of each texting form in our development It is important to note that some text forms have
data, which consists of 400 texting forms paired with properties of multiple categories, e.g., bak (back)
their standard forms.4 Several iterations of catego- could be considered a stylistic variation or a subse-
rization were done in order to determine sensible quence abbreviation. In such cases, we simply at-
categories, and ensure categories were used consis- tempt to assign the most appropriate category.
tently. Since this data is only to be used to guide The design of our model for text message normal-
the construction of our system, and not for formal ization, presented below, uses properties of the ob-
evaluation, only one judge (a native English speak- served formation processes.
ing author of this paper) categorized the expressions.
The findings are presented in Table 1. 3 An Unsupervised Noisy Channel Model
Stylistic variations, by far the most frequent cat- for Text Message Normalization
egory, exhibit non-standard spelling, such as repre- Let S be a sentence consisting of standard forms
4
Most texting forms have a unique standard form; however, s1 s2 ...sn ; in this study the standard forms are reg-
some have multiple standard forms, e.g., will and well can both
5
be shortened to wl. In such cases we choose the category of the Thurlow (2003) also observes an abundance of g-clippings.
6
most frequent standard form; in the case of frequency ties we A small number of similar forms, however, appear with a
choose arbitrarily among the categories of the standard forms. single standard form word, and are therefore marked as errors.
72
ular English words. Let T be a sequence of texting graphemes w i th ou t
forms t1 t2 ...tn , which are the texting language real- phonemes w I T au t
ization of the standard forms, and may differ from
Table 2: Grapheme–phoneme alignment for without.
the standard forms. Given a sequence of texting
forms T , the challenge is then to determine the cor-
responding standard forms S. subsequence abbreviations. We do not model syl-
Following Choudhury et al. (2007)—and vari- labic letters and digits, or punctuation, explicitly; in-
ous approaches to spelling error correction, such stead, we simply substitute digits with a graphemic
as, e.g., Mays et al. (1991)—we model text mes- representation (e.g., 4 is replaced by for), and re-
sage normalization using a noisy channel. We move punctuation, before applying the model. The
want to find argmaxS P (S|T ). We apply Bayes other less frequent formations—phonetic abbrevia-
rule and ignore the constant term P (T ), giving tions, spelling errors, and suffix clippings—are not
argmaxS P (T |S)P (S). Making the independence modeled; we hypothesize that the similarity of these
assumption that each ti depends only on si , and not formation processes to those we do model will allow
on the context in which it occurs, as in Choudhury the system to perform reasonably well on them.
et al., we express Q P (T |S) as a product of probabili-
ties: argmaxS ( i P (ti |si )) P (S). 3.1.1 Stylistic Variations
We note in Section 2 that many texting forms are We propose a probabilistic version of edit-
created through a small number of specific word for- distance—referred to here as edit-probability—
mation processes. Rather than model each of these inspired by Brill and Moore (2000) to model
processes at once using a generic model for P (ti |si ), P (ti |si , stylistic variation). To compute edit-
as in Choudhury et al., we instead create several such probability, we consider the probability of each edit
models, each corresponding to one of the observed operation—substitution, insertion, and deletion—
common word formation P processes. We therefore instead of its cost, as in edit-distance. We then sim-
rewrite P (ti |si ) as wf P (ti |si , wf )P (wf ) where ply multiply the probabilities of edits as opposed to
wf is a word formation process, e.g., subsequence summing their costs.
abbreviation. Since, like Choudhury et al., we focus In this version of edit-probability, we allow two-
on the word model, we simplify our model as below. character edits. Ideally, we would compute the edit-
X probability of two strings as the sum of the edit-
argmaxsi P (ti |si , wf )P (wf )P (si ) probability of each partitioning of those strings into
wf
one or two character segments. However, following
We next explain the components of the model, Brill and Moore, we approximate this by the prob-
P (ti |si , wf ), P (wf ), and P (si ), referred to as the ability of the partition with maximum probability.
word model, word formation prior, and language This allows us to compute edit-probability using a
model, respectively. simple adaptation of edit-distance, in which we con-
sider edit operations spanning two characters at each
3.1 Word Models cell in the chart maintained by the algorithm.
We now consider which of the word formation pro- We then estimate two probabilities: P (gt |gs , pos )
cesses discussed in Section 2 should be captured is the probability of texting form grapheme gt given
with a word model P (ti |si , wf ). We model stylis- standard form grapheme gs at position pos, where
tic variations and subsequence abbreviations simply pos is the beginning, middle, or end of the word;
due to their frequency. We also choose to model P (ht |ps , hs , pos) is the probability of texting form
prefix clippings since this word formation process is graphemes ht given the standard form phonemes ps
common outside of text messaging (Kreidler, 1979; and graphemes hs at position pos. ht , ps , and hs can
Algeo, 1991) and fairly frequent in our data. Al- be a single grapheme or phoneme, or a bigram.
though g-clippings and h-clippings are moderately We compute edit-probability between the
frequent, we do not model them, as these very spe- graphemes of si and ti . When filling each cell
cific word formations are also (non-prototypical) in the chart, we consider edit operations between
73
segments of si and ti of length 0–2, referred to as a more, when they do end in a vowel, it is often
and b, respectively. If a aligns with phonemes in si , of a regular form, such as telly for television and
we also consider those phonemes, p. In our lexicon, breaky for breakfast. We therefore only consider
the graphemes and phonemes of each word are P (ti |si , prefix clipping) if ti is a prefix clipping ac-
aligned according to the method of Jiampojamarn cording to the following heuristics: ti is mono-
et al. (2007). For example, the alignment for syllabic after stripping any word-final vowels, and
without is given in Table 2. The probability of subsequently removing duplicated word-final con-
each edit operation is then determined by three sonants (e.g, telly becomes tel, which is a candidate
properties—the length of a, whether a aligns with prefix clipping). If ti is not a prefix clipping accord-
any phonemes in si , and if so, p—as shown below: ing to these criteria, P (ti |si ) simply sums over all
|a|= 0 or 1, not aligned w/ si phonemes: P (b|a, pos ) models except prefix clipping.
|a|= 2, not aligned w/ si phonemes: 0
3.2 Word Formation Prior
|a|= 1 or 2, aligned w/ si phonemes: P (b|p, a, pos )
Keeping with our goal of an unsupervised method,
3.1.2 Subsequence Abbreviations we estimate P (wf ) with a uniform distribution. We
We model subsequence abbreviations according also consider estimating P (wf ) using maximum
to the equation below: likelihood estimates (MLEs) from our observations
( in Section 2. This gives a model that is not fully
c if ti is a subseq of si unsupervised, since it relies on labelled training
P (ti |si , subseq abrv) =
0 otherwise data. However, we consider this a lightly-supervised
method, since it only requires an estimate of the fre-
where c is a constant.
quency of the relevant word formation types, and not
Note that this is similar to the error model for
labelled texting form–standard form pairs.
spelling correction presented by Mays et al. (1991),
in which all words (in our terms, all si ) within 3.3 Language Model
a specified edit-distance of the out-of-vocabulary
Choudhury et al. (2007) find that using a bigram lan-
word (ti in our model) are given equal probability.
guage model estimated over a balanced corpus of
The key difference is that in our formulation, we
English had a negative effect on their results com-
only consider standard forms for which the texting
pared with a unigram language model, which they
form is potentially a subsequence abbreviation.
attribute to the unique characteristics of text messag-
In combination with the language model,
ing that were not reflected in the corpus. We there-
P (ti |si , subseq abbrev) assigns a non-zero prob-
fore use a unigram language model for P (si ), which
ability to each standard form si for which ti is
also enables comparison with their results. Never-
a subsequence, according to the likelihood of si
theless, alternative language models, such as higher
(under the language model). The models interact
order ngram models, could easily be used in place of
in this way since we expect a standard form to be
our unigram language model.
recognizable relative to the other words for which ti
could be a subsequence abbreviation 4 Materials and Methods
3.1.3 Prefix Clippings
4.1 Datasets
We model prefix clippings similarly to subse-
quence abbreviations. We use the data provided by Choudhury et al. (2007)
which consists of texting forms—extracted from a
c if ti is possible collection of 900 text messages—and their manu-
P (ti |si , prefix clipping) = pre. clip. of si ally determined standard forms. Our development
data—used for model development and discussed in
0 otherwise
Section 2—consists of the 400 texting form types
Kreidler (1979) observes that clippings tend to be that are not in Choudhury et al.’s held-out test set,
mono-syllabic and end in a consonant. Further- and that are not the same as one of their standard
74
forms. The test data consists of 1213 texting forms We create the substitution rules by examining ex-
and their corresponding standard forms. A subset of amples in the development data, considering fast
303 of these texting forms differ from their standard speech variants and dialectal differences (e.g., voic-
form.7 This subset is the focus of this study, but we ing), and drawing on our intuition. The derived
also report results on the full dataset. forms are produced by applying the substitution
rules to the words in our lexicon. To avoid con-
4.2 Lexicon sidering forms that are themselves words, we elimi-
We construct a lexicon of potential standard forms nate any form found in a list of approximately 480K
such that it contains most words that we expect to words taken from SOWPODS9 and the Moby Word
encounter in text messages, yet is not so large as Lists.10 Finally, we obtain the frequency of the de-
to make it difficult to identify the correct standard rived forms from the Web 1T 5-gram Corpus.
form. Our subjective analysis of the standard forms To estimate P (ht |ps , hs , pos), we first esti-
in the development data is that they are frequent, mate two simpler distributions: P (ht |hs , pos ) and
non-specialized, words. To reflect this observation, P (ht |ps , pos). P (ht |hs , pos ) is estimated in the
we create a lexicon consisting of all single-word en- same manner as P (gt |gs , pos ), except that two char-
tries containing only alphabetic characters found in acter substitutions are allowed. P (ht |ps , pos) is es-
both the CELEX Lexical Database (Baayen et al., timated from the frequency of ps , and its align-
1995) and the CMU Pronouncing Dictionary.8 We ment with ht , in a version of CELEX in which
remove all words of length one (except a and I) to the graphemic and phonemic representation of each
avoid choosing, e.g., the letter r as the standard form word is many–many aligned using the method of
for the texting form r. We further limit the lexicon Jiampojamarn et al. (2007).11 P (ht |ps , hs , pos )
to words in the 20K most frequent alphabetic uni- is then an evenly-weighted linear combination of
grams, ignoring case, in the Web 1T 5-gram Corpus P (ht |hs , pos) and P (ht |ps , pos ). Finally, we
(Brants and Franz, 2006). The resulting lexicon con- smooth each of P (gt |gs , pos) and P (ht |ps , hs , pos )
tains approximately 14K words, and excludes only using add-alpha smoothing.
three of the standard forms—cannot, email, and on- We set the constant c in our word models for
line—for the 400 development texting forms. subsequence abbreviations and prefix clippings such
P
that si P (ti |si , wf )P (si ) = 1. We similarly nor-
4.3 Model Parameter Estimation malize P (ti |si , stylistic variation)P (si ).
MLEs for P (gt |gs , pos)—needed to estimate We use the frequency of unigrams (ignoring case)
P (ti |si , stylistic variation)—could be estimated in the Web 1T 5-gram Corpus to estimate our lan-
from texting form–standard form pairs. However, guage model. We expect the language of text mes-
since our system is unsupervised, no such data is saging to be more similar to that found on the web
available. We therefore assume that many texting than that in a balanced corpus of English.
forms, and other similar creative shortenings, occur
on the web. We develop a number of character 4.4 Evaluation Metrics
substitution rules, e.g., s ⇒ z, and use them to create To evaluate our system, we consider three accuracy
hypothetical texting forms from standard words. metrics: in-top-1, in-top-10, and in-top-20.12 In-
We then compute MLEs for P (gt |gs , pos) using the top-n considers the system correct if a correct stan-
frequencies of these derived forms on the web. dard form is in the n most probable standard forms.
7
Choudhury et al. report that this dataset contains 1228 tex- The in-top-1 accuracy shows how well the system
ting forms. We found it to contain 1213 texting forms cor- determines the correct standard form; the in-top-10
responding to 1228 standard forms (recall that a texting form
9
may have multiple standard forms). There were similar incon- http://en.wikipedia.org/wiki/SOWPODS
10
sistencies with the subset of texting forms that differ from their http://icon.shef.ac.uk/Moby/
11
standard forms. Nevertheless, we do not expect these small dif- We are very grateful to Sittichai Jiampojamarn for provid-
ferences to have an appreciable effect on the results. ing this alignment.
8 12
http://www.speech.cs.cmu.edu/cgi-bin/ These are the same metrics used by Choudhury et al.
cmudict (2007), although we refer to them by different names.
75
Model % accuracy Formation type Freq. % in-top-1 acc.
Top-1 Top-10 Top-20 n = 303 Specific All
Uniform 59.4 83.8 87.8 Stylistic variation 121 62.8 67.8
MLE 55.4 84.2 86.5 Subseq. abbrev. 65 56.9 46.2
Choudhury et al. 59.9 84.3 88.7 Prefix clipping 25 44.0 20.0
G-clipping 56 - 91.1
Table 3: % in-top-1, in-top-10, and in-top-20 accuracy Syll. letter/digit 16 - 50.0
on test data using both estimates for P (wf ). The results Unclear 12 - 0.0
reported by Choudhury et al. (2007) are also shown. Spelling error 5 - 80.0
Suffix clipping 1 - 0.0
and in-top-20 accuracies may be indicative of the Phonetic abbrev. 1 - 0.0
usefulness of the output of our system in other tasks Error 1 - 0.0
which could exploit a ranked list of standard forms,
such as machine translation. Table 4: Frequency (Freq.), and % in-top-1 accuracy us-
ing the formation-specific model where applicable (Spe-
5 Results and Discussion cific) and all models (All) with a uniform estimate for
P (wf ), presented by formation type.
In Table 3 we report the results of our system using
both the uniform estimate and the MLE of P (wf ).
Note that there is no meaningful random baseline We first examine the top panel of Table 3 where
to compare against here; randomly ordering the we compare the performance on each word forma-
14K words in our lexicon gives very low accuracy. tion type for both experimental conditions (Specific
The results using the uniform estimate of P (wf )— and All). We first note that the performance using
a fully unsupervised system—are very similar to the formation-specific model on subsequence abbre-
the supervised results of Choudhury et al. (2007). viations and prefix clippings is better than that of
Surprisingly, when we estimate P (wf ) using MLEs the overall model. This is unsurprising since we ex-
from the development data—resulting in a lightly- pect that when we know a texting form’s formation
supervised system—the results are slightly worse process, and invoke a corresponding specific model,
than when using the uniform estimate of this proba- our system should outperform a model designed to
bility. Moreover, we observe the same trend on de- handle a range of formation types. However, this is
velopment data where we expect to have an accurate not the case for stylistic variations; here the over-
estimate of P (wf ) (results not shown). We hypothe- all model performs better than the specific model.
size that the ambiguity of the categories of text forms We observed in Section 2 that some texting forms
(see Section 2) results in poor MLEs for P (wf ), do not fit neatly into our categorization scheme; in-
thus making a uniform distribution, and hence fully- deed, many stylistic variations are also analyzable
unsupervised approach, more appropriate. as subsequence abbreviations. Therefore, the subse-
Results by Formation Type We now consider in- quence abbreviation model may benefit normaliza-
top-1 accuracy for each word formation type, in Ta- tion of stylistic variations. This model, used in iso-
ble 4. We show results for the same word forma- lation on stylistic variations, gives an in-top-1 accu-
tion processes as in Table 1, except for h-clippings racy of 33.1%, indicating that this may be the case.
and punctuation, as no words of these categories are Comparing the performance of the individual
present in the test data. We present results using the word models on only word types that they were de-
same experimental setup as before with a uniform signed for (column Specific in Table 4), we see that
estimate of P (wf ) (All), and using just the model the prefix clipping model is by far the lowest, in-
corresponding to the word formation process (Spe- dicating that in the future we should consider ways
cific), where applicable.13 of improving this word model. One possibility is
13
In this case our model then becomes, for each word forma- to incorporate phonemic knowledge. For example,
tion process wf , argmaxsi P (ti |si , wf )P (si ). both friday and friend have the same probability un-
76
der P (ti |si , prefix clipping) for the texting form fri, Model % in-top-1 accuracy
which has the standard form friday in our data. (The Stylistic variation 51.8
language model, however, does distinguish between Subseq. Abbrev. 44.2
these forms.) However, if we consider the phonemic Prefix clipping 10.6
representations of these words, friday might emerge
as more likely. Syllable structure information may Table 5: % in-top-1 accuracy on the 303 test expressions
also be useful, as we hypothesize that clippings will using each model individually.
tend to be formed by truncating a word at a syllable
boundary. We may similarly be able to improve our own gives results comparable to those of the over-
estimate of P (ti |si , subseq. abrrev.). For example, all model (59.4%, see Table 3). This indicates that
both text and taxation have the same probability un- the overall model successfully combines informa-
der this distribution, but intuitively text, the correct tion from the specific word formation models.
standard form in our data, seems more likely. We Each model used on its own gives an accuracy
could incorporate knowledge about the likelihood of greater than the proportion of expressions of the
omitting specific characters, as in Choudhury et al. word formation type for which the model was de-
(2007), to improve this estimate. signed (compare accuracies in Table 5 to the num-
We now examine the lower panel of Table 4, in ber of expressions of the corresponding word forma-
which we consider the performance of the overall tion type in the test data in Table 4). As we note in
model on the word formation types that are not ex- Section 2, the distinctions between the word forma-
plicitly modeled. The very high accuracy on g- tion types are not sharp; these results show that the
clippings indicates that since these forms are also a shared properties of word formation types enable a
type of subsequence abbreviation, we do not need to model for a specific formation type to infer the stan-
construct a separate model for them. We in fact also dard form of texting forms of other formation types.
conducted experiments in which g-clippings and h-
clippings were modeled explicitly, but found these All Unseen Data Until now we have discussed re-
extra models to have little effect on the results. sults on our test data of 303 texting forms which dif-
fer from their standard forms. We now consider the
Recall from Section 3.1 our hypothesis that suf-
performance of our system on all 1213 unseen tex-
fix clippings, spelling errors, and phonetic abbrevia-
ting forms, 910 of which are identical to their stan-
tions have common properties with formation types
dard form. Since our model was not designed with
that we do model, and therefore the system will per-
such expressions in mind, we slightly adapt it for
form reasonably well on them. Here we find pre-
this new task; if ti is in our lexicon, we return that
liminary evidence to support this hypothesis as the
form as si , otherwise we apply our model as usual,
accuracy on these three word formation types (com-
using the uniform estimate of P (wf ). This gives
bined) is 57.1%. However, we must interpret this
an in-top-1 accuracy of 88.2%, which is very sim-
result cautiously as it only considers seven expres-
ilar to the results of Choudhury et al. (2007) on this
sions. On the syllabic letter and digit texting forms
data of 89.1%. Note, however, that Choudhury et al.
the accuracy is 50.0%, indicating that our heuris-
only report results on this dataset using a uniform
tic to replace digits in texting forms with an ortho-
language model;14 since we use a unigram language
graphic representation is reasonable.
model, it is difficult to draw firm conclusions about
The performance on types of expressions that
the performance of our system relative to theirs.
we did not consider when designing the system—
unclear and error—is very poor. However, this has 6 Related Work
little impact on the overall performance as these ex-
pressions are rather infrequent. Aw et al. (2006) model text message normaliza-
tion as translation from the texting language into the
Results by Model We now consider in-top-1 ac- 14
Choudhury et al. do use a unigram language model for their
curacy using each model on the 303 test expres- experiments on the 303 texting forms which differ from their
sions; results are shown in Table 5. No model on its standard forms (see Section 3.3).
77
standard language. Kobus et al. (2008) incorporate References
ideas from both machine translation and automatic John Algeo, editor. 1991. Fifty Years Among the New
speech recognition for text message normalization. Words. Cambridge University Press, Cambridge.
However, both of these approaches are supervised, AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A
and have only limited means for normalizing texting phrase-based statistical model for SMS text normaliza-
forms that do not occur in the training data. tion. In Proc. of the COLING/ACL 2006 Main Confer-
ence Poster Sessions, pages 33–40. Sydney.
Our work, like that of Choudhury et al. (2007), R.H. Baayen, R. Piepenbrock, and L. Gulikers. 1995. The
can be viewed as a noisy-channel model for spelling CELEX Lexical Database (release 2). Linguistic Data
error correction (e.g., Mays et al., 1991; Brill and Consortium, University of Pennsylvania.
Moore, 2000), in which texting forms are seen as Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram
a kind of spelling error. Furthermore, like our ap- Corpus version 1.1.
proach to text message normalization, approaches to Eric Brill and Robert C. Moore. 2000. An improved error
model for noisy channel spelling correction. In Pro-
spelling correction have incorporated phonemic in-
ceedings of ACL 2000, pages 286–293. Hong Kong.
formation (Toutanova and Moore, 2002). Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh
The word model of the supervised approach of Mukherjee, Sudeshna Sarkar, and Anupam Basu.
Choudhury et al. consists of hidden Markov models, 2007. Investigation and modeling of the structure of
which capture properties of texting language similar texting language. International Journal of Document
to those of our stylistic variation model. We pro- Analysis and Recognition, 10(3/4):157–174.
Cédrick Fairon and Sébastien Paumier. 2006. A trans-
pose multiple word models—corresponding to fre- lated corpus of 30,000 French SMS. In Proceedings of
quent texting language formation processes—and an LREC 2006. Genoa, Italy.
unsupervised method for parameter estimation. Rebecca E. Grinter and Margery A. Eldridge. 2001. y do
tngrs luv 2 txt msg. In Proceedings of the 7th Euro-
7 Conclusions pean Conf. on Computer-Supported Cooperative Work
(ECSCW ’01), pages 219–238. Bonn, Germany.
We analyze a sample of texting forms to determine Sittichai Jiampojamarn, Gregorz Kondrak, and Tarek
frequent word formation processes in creative tex- Sherif. 2007. Applying many-to-many alignments and
ting language. Drawing on these observations, we hidden markov models to letter-to-phoneme conver-
sion. In Proc. of NAACL-HLT 2007, pages 372–379.
construct an unsupervised noisy-channel model for
Rochester, NY.
text message normalization. On an unseen test set Catherine Kobus, François Yvon, and Géraldine
of 303 texting forms that differ from their standard Damnati. 2008. Normalizing SMS: are two metaphors
form, our model achieves 59% accuracy, which is on better than one? In Proc. of the 22nd Int. Conf. on
par with that obtained by the supervised approach of Computational Linguistics, pp. 441–448. Manchester.
Choudhury et al. (2007) on the same data. Charles W. Kreidler. 1979. Creating new words by short-
ening. English Linguistics, 13:24–36.
More research is required to determine the impact Rich Ling and Naomi S. Baron. 2007. Text messaging
of our normalization method on the performance of and IM: Linguistic comparison of American college
a system that further processes the resulting text. In data. Journal of Language and Social Psychology,
the future, we intend to improve our word models by 26:291–98.
incorporating additional linguistic knowledge, such Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991.
as information about syllable structure. Since con- Context based spelling correction. Information Pro-
cessing and Management, 27(5):517–522.
text likely plays a role in human interpretation of Richard Sproat, Alan W. Black, Stanley Chen, Shankar
texting forms, we also intend to examine the perfor- Kumar, Mari Ostendorf, and Christopher Richards.
mance of higher order ngram language models. 2001. Normalization of non-standard words. Com-
puter Speech and Language, 15:287–333.
Acknowledgements Crispin Thurlow. 2003. Generation txt? The sociolin-
guistics of young people’s text-messaging. Discourse
This work is financially supported by the Natu- Analysis Online, 1(1).
ral Sciences and Engineering Research Council of Kristina Toutanova and Robert C. Moore. 2002. Pronun-
Canada, the University of Toronto, and the Dictio- ciation modeling for improved spelling correction. In
Proc. of ACL 2002, pages 144–151. Philadelphia.
nary Society of North America.
78