An Unsupervised Model For Text Message Normalization

Uploaded by

rickshark

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

An Unsupervised Model For Text Message Normalization

Uploaded by

rickshark

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

An Unsupervised Model for Text Message Normalization

Paul Cook Suzanne Stevenson

Department of Computer Science Department of Computer Science
University of Toronto University of Toronto
Toronto, Canada Toronto, Canada
[email protected] [email protected]

Abstract particularly useful for the visually impaired; au-

tomatic translation has also been considered (e.g.,
Cell phone text messaging users express them- Aw et al., 2006). For texting language, given the
selves briefly and colloquially using a variety
abundance of creative forms, and the wide-ranging
of creative forms. We analyze a sample of cre-
ative, non-standard text message word forms possibilities for creating new forms, normalization
to determine frequent word formation pro- is a particularly important problem, and has indeed
cesses in texting language. Drawing on these received some attention in computational linguistics
observations, we construct an unsupervised (e.g., Aw et al., 2006; Choudhury et al., 2007;
noisy-channel model for text message normal- Kobus et al., 2008).
ization. On a test set of 303 text message In this paper we propose an unsupervised noisy
forms that differ from their standard form, our channel method for texting language normalization,
model achieves 59% accuracy, which is on par
that gives performance on par with that of a super-
with the best supervised results reported on
this dataset. vised system. We pursue unsupervised approaches
to this problem, as large collections of text mes-
sages, and their corresponding standard forms, are
1 Text Messaging not readily available.2 Furthermore, other forms of
Cell phone text messages—or SMS—contain many computer-mediated communication, such as Inter-
shortened and non-standard forms due to a variety net messaging, exhibit creative phenomena similar
of factors, particularly the desire for rapid text entry to text messaging, although at a lower frequency
(Grinter and Eldridge, 2001; Thurlow, 2003).1 Fur- (Ling and Baron, 2007). Moreover, technological
thermore, text messages are written in an informal changes, such as new input devices, are likely to
register; non-standard forms are used to reflect this, have an impact on the language of such media (Thur-
and even for personal style (Thurlow, 2003). These low, 2003).3 An unsupervised approach, drawing
factors result in tremendous linguistic creativity, and on linguistic properties of creative word formations,
hence many novel lexical items, in the language of has the potential to be adapted for normalization of
text messaging, or texting language. text in other similar genres—such as Internet dis-
Normalization of non-standard forms— cussion forums—without the cost of developing a
converting non-standard forms to their standard large training corpus. Moreover, normalization may
forms—is a challenge that must be tackled before be particularly important for such genres, given the
other types of natural language processing can 2
One notable exception is Fairon and Paumier (2006), al-
take place (Sproat et al., 2001). In the case of though this resource is in French. The resource used in our
text messages, text-to-speech synthesis may be study, Choudhury et al. (2007), is quite small in comparison.
3
The rise of other technology, such as word prediction, could
1
The number of characters in a text message may also be reduce the use of abbreviations, although it’s not clear such
limited to 160 characters, although this is not always the case. technology is widely used (Grinter and Eldridge, 2001).

71
Proceedings of the NAACL HLT Workshop on Computational Approaches to Linguistic Creativity, pages 71–78,
Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics
Formation type Freq. Example senting sounds phonetically. Subsequence abbrevi-
Stylistic variation 152 betta (better) ations, also very frequent, are composed of a sub-
Subseq. abbrev. 111 dng (doing) sequence of the graphemes in a standard form, of-
Prefix clipping 24 hol (holiday) ten omitting vowels. These two formation types ac-
Syll. letter/digit 19 neway (anyway) count for approximately 66% of our development
G-clipping 14 talkin (talking) data; the remaining formation types are much less
Phonetic abbrev. 12 cuz (because) frequent. Prefix clippings and suffix clippings con-
H-clipping 10 ello (hello) sist of a prefix or suffix, respectively, of a standard
Spelling error 5 darliog (darling) form, and in some cases a diminutive ending; we
Suffix clipping 4 morrow (tomorrow) also consider clippings which omit just a g or h from
Punctuation 3 b/day (birthday) a standard form as they are rather frequent.5 A sin-
Unclear 34 mobs (mobile) gle letter or digit can be used to represent a syllable;
Error 12 gal (*girl) we refer to these as syllabic (syll.) letter/digit. Pho-
Total 400 netic abbreviations are variants of clippings and sub-
sequence abbreviations where some sounds in the
Table 1: Frequency of texting forms in the development standard form are represented phonetically. Several
set by formation type. texting forms appear to be spelling errors; we took
the layout of letters on cell phone keypads into ac-
need for applications such as translation and ques- count when making this judgement. The items that
tion answering. did not fit within the above texting form categories
We observe that many creative texting forms are were marked as unclear. Finally, for some expres-
the result of a small number of specific word for- sions the given standard form did not appear to be
mation processes. Rather than using a generic er- appropriate. For example, girl is not the standard
ror model to capture all of them, we propose a mix- form for the texting form gal; rather, gal is an En-
ture model in which each word formation process is glish word that is a colloquial form of girl. Such
modeled explicitly according to linguistic observa- cases were marked as errors.
tions specific to that formation. No texting forms in our development data corre-
spond to multiple standard form words, e.g., wanna
2 Analysis of Texting Forms for want to.6 Since such forms are not present in our
To better understand the creative processes present development data, we assume that a texting form al-
in texting language, we categorize the word forma- ways corresponds to a single standard form word.
tion process of each texting form in our development It is important to note that some text forms have
data, which consists of 400 texting forms paired with properties of multiple categories, e.g., bak (back)
their standard forms.4 Several iterations of catego- could be considered a stylistic variation or a subse-
rization were done in order to determine sensible quence abbreviation. In such cases, we simply at-
categories, and ensure categories were used consis- tempt to assign the most appropriate category.
tently. Since this data is only to be used to guide The design of our model for text message normal-
the construction of our system, and not for formal ization, presented below, uses properties of the ob-
evaluation, only one judge (a native English speak- served formation processes.
ing author of this paper) categorized the expressions.
The findings are presented in Table 1. 3 An Unsupervised Noisy Channel Model
Stylistic variations, by far the most frequent cat- for Text Message Normalization
egory, exhibit non-standard spelling, such as repre- Let S be a sentence consisting of standard forms
4
Most texting forms have a unique standard form; however, s1 s2 ...sn ; in this study the standard forms are reg-
some have multiple standard forms, e.g., will and well can both
5
be shortened to wl. In such cases we choose the category of the Thurlow (2003) also observes an abundance of g-clippings.
6
most frequent standard form; in the case of frequency ties we A small number of similar forms, however, appear with a
choose arbitrarily among the categories of the standard forms. single standard form word, and are therefore marked as errors.

72
ular English words. Let T be a sequence of texting graphemes w i th ou t
forms t1 t2 ...tn , which are the texting language real- phonemes w I T au t
ization of the standard forms, and may differ from
Table 2: Grapheme–phoneme alignment for without.
the standard forms. Given a sequence of texting
forms T , the challenge is then to determine the cor-
responding standard forms S. subsequence abbreviations. We do not model syl-
Following Choudhury et al. (2007)—and vari- labic letters and digits, or punctuation, explicitly; in-
ous approaches to spelling error correction, such stead, we simply substitute digits with a graphemic
as, e.g., Mays et al. (1991)—we model text mes- representation (e.g., 4 is replaced by for), and re-
sage normalization using a noisy channel. We move punctuation, before applying the model. The
want to find argmaxS P (S|T ). We apply Bayes other less frequent formations—phonetic abbrevia-
rule and ignore the constant term P (T ), giving tions, spelling errors, and suffix clippings—are not
argmaxS P (T |S)P (S). Making the independence modeled; we hypothesize that the similarity of these
assumption that each ti depends only on si , and not formation processes to those we do model will allow
on the context in which it occurs, as in Choudhury the system to perform reasonably well on them.
et al., we express Q P (T |S) as a product of probabili-
ties: argmaxS ( i P (ti |si )) P (S). 3.1.1 Stylistic Variations
We note in Section 2 that many texting forms are We propose a probabilistic version of edit-
created through a small number of specific word for- distance—referred to here as edit-probability—
mation processes. Rather than model each of these inspired by Brill and Moore (2000) to model
processes at once using a generic model for P (ti |si ), P (ti |si , stylistic variation). To compute edit-
as in Choudhury et al., we instead create several such probability, we consider the probability of each edit
models, each corresponding to one of the observed operation—substitution, insertion, and deletion—
common word formation P processes. We therefore instead of its cost, as in edit-distance. We then sim-
rewrite P (ti |si ) as wf P (ti |si , wf )P (wf ) where ply multiply the probabilities of edits as opposed to
wf is a word formation process, e.g., subsequence summing their costs.
abbreviation. Since, like Choudhury et al., we focus In this version of edit-probability, we allow two-
on the word model, we simplify our model as below. character edits. Ideally, we would compute the edit-
X probability of two strings as the sum of the edit-
argmaxsi P (ti |si , wf )P (wf )P (si ) probability of each partitioning of those strings into
wf
one or two character segments. However, following
We next explain the components of the model, Brill and Moore, we approximate this by the prob-
P (ti |si , wf ), P (wf ), and P (si ), referred to as the ability of the partition with maximum probability.
word model, word formation prior, and language This allows us to compute edit-probability using a
model, respectively. simple adaptation of edit-distance, in which we con-
sider edit operations spanning two characters at each
3.1 Word Models cell in the chart maintained by the algorithm.
We now consider which of the word formation pro- We then estimate two probabilities: P (gt |gs , pos )
cesses discussed in Section 2 should be captured is the probability of texting form grapheme gt given
with a word model P (ti |si , wf ). We model stylis- standard form grapheme gs at position pos, where
tic variations and subsequence abbreviations simply pos is the beginning, middle, or end of the word;
due to their frequency. We also choose to model P (ht |ps , hs , pos) is the probability of texting form
prefix clippings since this word formation process is graphemes ht given the standard form phonemes ps
common outside of text messaging (Kreidler, 1979; and graphemes hs at position pos. ht , ps , and hs can
Algeo, 1991) and fairly frequent in our data. Al- be a single grapheme or phoneme, or a bigram.
though g-clippings and h-clippings are moderately We compute edit-probability between the
frequent, we do not model them, as these very spe- graphemes of si and ti . When filling each cell
cific word formations are also (non-prototypical) in the chart, we consider edit operations between

73
segments of si and ti of length 0–2, referred to as a more, when they do end in a vowel, it is often
and b, respectively. If a aligns with phonemes in si , of a regular form, such as telly for television and
we also consider those phonemes, p. In our lexicon, breaky for breakfast. We therefore only consider
the graphemes and phonemes of each word are P (ti |si , prefix clipping) if ti is a prefix clipping ac-
aligned according to the method of Jiampojamarn cording to the following heuristics: ti is mono-
et al. (2007). For example, the alignment for syllabic after stripping any word-final vowels, and
without is given in Table 2. The probability of subsequently removing duplicated word-final con-
each edit operation is then determined by three sonants (e.g, telly becomes tel, which is a candidate
properties—the length of a, whether a aligns with prefix clipping). If ti is not a prefix clipping accord-
any phonemes in si , and if so, p—as shown below: ing to these criteria, P (ti |si ) simply sums over all
|a|= 0 or 1, not aligned w/ si phonemes: P (b|a, pos ) models except prefix clipping.
|a|= 2, not aligned w/ si phonemes: 0
3.2 Word Formation Prior
|a|= 1 or 2, aligned w/ si phonemes: P (b|p, a, pos )
Keeping with our goal of an unsupervised method,
3.1.2 Subsequence Abbreviations we estimate P (wf ) with a uniform distribution. We
We model subsequence abbreviations according also consider estimating P (wf ) using maximum
to the equation below: likelihood estimates (MLEs) from our observations
( in Section 2. This gives a model that is not fully
c if ti is a subseq of si unsupervised, since it relies on labelled training
P (ti |si , subseq abrv) =
0 otherwise data. However, we consider this a lightly-supervised
method, since it only requires an estimate of the fre-
where c is a constant.
quency of the relevant word formation types, and not
Note that this is similar to the error model for
labelled texting form–standard form pairs.
spelling correction presented by Mays et al. (1991),
in which all words (in our terms, all si ) within 3.3 Language Model
a specified edit-distance of the out-of-vocabulary
Choudhury et al. (2007) find that using a bigram lan-
word (ti in our model) are given equal probability.
guage model estimated over a balanced corpus of
The key difference is that in our formulation, we
English had a negative effect on their results com-
only consider standard forms for which the texting
pared with a unigram language model, which they
form is potentially a subsequence abbreviation.
attribute to the unique characteristics of text messag-
In combination with the language model,
ing that were not reflected in the corpus. We there-
P (ti |si , subseq abbrev) assigns a non-zero prob-
fore use a unigram language model for P (si ), which
ability to each standard form si for which ti is
also enables comparison with their results. Never-
a subsequence, according to the likelihood of si
theless, alternative language models, such as higher
(under the language model). The models interact
order ngram models, could easily be used in place of
in this way since we expect a standard form to be
our unigram language model.
recognizable relative to the other words for which ti
could be a subsequence abbreviation 4 Materials and Methods
3.1.3 Prefix Clippings
4.1 Datasets
We model prefix clippings similarly to subse-
quence abbreviations. We use the data provided by Choudhury et al. (2007)
 which consists of texting forms—extracted from a

c if ti is possible collection of 900 text messages—and their manu-
P (ti |si , prefix clipping) = pre. clip. of si ally determined standard forms. Our development

 data—used for model development and discussed in
0 otherwise
Section 2—consists of the 400 texting form types
Kreidler (1979) observes that clippings tend to be that are not in Choudhury et al.’s held-out test set,
mono-syllabic and end in a consonant. Further- and that are not the same as one of their standard

74
forms. The test data consists of 1213 texting forms We create the substitution rules by examining ex-
and their corresponding standard forms. A subset of amples in the development data, considering fast
303 of these texting forms differ from their standard speech variants and dialectal differences (e.g., voic-
form.7 This subset is the focus of this study, but we ing), and drawing on our intuition. The derived
also report results on the full dataset. forms are produced by applying the substitution
rules to the words in our lexicon. To avoid con-
4.2 Lexicon sidering forms that are themselves words, we elimi-
We construct a lexicon of potential standard forms nate any form found in a list of approximately 480K
such that it contains most words that we expect to words taken from SOWPODS9 and the Moby Word
encounter in text messages, yet is not so large as Lists.10 Finally, we obtain the frequency of the de-
to make it difficult to identify the correct standard rived forms from the Web 1T 5-gram Corpus.
form. Our subjective analysis of the standard forms To estimate P (ht |ps , hs , pos), we first esti-
in the development data is that they are frequent, mate two simpler distributions: P (ht |hs , pos ) and
non-specialized, words. To reflect this observation, P (ht |ps , pos). P (ht |hs , pos ) is estimated in the
we create a lexicon consisting of all single-word en- same manner as P (gt |gs , pos ), except that two char-
tries containing only alphabetic characters found in acter substitutions are allowed. P (ht |ps , pos) is es-
both the CELEX Lexical Database (Baayen et al., timated from the frequency of ps , and its align-
1995) and the CMU Pronouncing Dictionary.8 We ment with ht , in a version of CELEX in which
remove all words of length one (except a and I) to the graphemic and phonemic representation of each
avoid choosing, e.g., the letter r as the standard form word is many–many aligned using the method of
for the texting form r. We further limit the lexicon Jiampojamarn et al. (2007).11 P (ht |ps , hs , pos )
to words in the 20K most frequent alphabetic uni- is then an evenly-weighted linear combination of
grams, ignoring case, in the Web 1T 5-gram Corpus P (ht |hs , pos) and P (ht |ps , pos ). Finally, we
(Brants and Franz, 2006). The resulting lexicon con- smooth each of P (gt |gs , pos) and P (ht |ps , hs , pos )
tains approximately 14K words, and excludes only using add-alpha smoothing.
three of the standard forms—cannot, email, and on- We set the constant c in our word models for
line—for the 400 development texting forms. subsequence abbreviations and prefix clippings such
P
that si P (ti |si , wf )P (si ) = 1. We similarly nor-
4.3 Model Parameter Estimation malize P (ti |si , stylistic variation)P (si ).
MLEs for P (gt |gs , pos)—needed to estimate We use the frequency of unigrams (ignoring case)
P (ti |si , stylistic variation)—could be estimated in the Web 1T 5-gram Corpus to estimate our lan-
from texting form–standard form pairs. However, guage model. We expect the language of text mes-
since our system is unsupervised, no such data is saging to be more similar to that found on the web
available. We therefore assume that many texting than that in a balanced corpus of English.
forms, and other similar creative shortenings, occur
on the web. We develop a number of character 4.4 Evaluation Metrics
substitution rules, e.g., s ⇒ z, and use them to create To evaluate our system, we consider three accuracy
hypothetical texting forms from standard words. metrics: in-top-1, in-top-10, and in-top-20.12 In-
We then compute MLEs for P (gt |gs , pos) using the top-n considers the system correct if a correct stan-
frequencies of these derived forms on the web. dard form is in the n most probable standard forms.
7
Choudhury et al. report that this dataset contains 1228 tex- The in-top-1 accuracy shows how well the system
ting forms. We found it to contain 1213 texting forms cor- determines the correct standard form; the in-top-10
responding to 1228 standard forms (recall that a texting form
9
may have multiple standard forms). There were similar incon- http://en.wikipedia.org/wiki/SOWPODS
10
sistencies with the subset of texting forms that differ from their http://icon.shef.ac.uk/Moby/
11
standard forms. Nevertheless, we do not expect these small dif- We are very grateful to Sittichai Jiampojamarn for provid-
ferences to have an appreciable effect on the results. ing this alignment.
8 12
http://www.speech.cs.cmu.edu/cgi-bin/ These are the same metrics used by Choudhury et al.
cmudict (2007), although we refer to them by different names.

75
Model % accuracy Formation type Freq. % in-top-1 acc.
Top-1 Top-10 Top-20 n = 303 Specific All
Uniform 59.4 83.8 87.8 Stylistic variation 121 62.8 67.8
MLE 55.4 84.2 86.5 Subseq. abbrev. 65 56.9 46.2
Choudhury et al. 59.9 84.3 88.7 Prefix clipping 25 44.0 20.0
G-clipping 56 - 91.1
Table 3: % in-top-1, in-top-10, and in-top-20 accuracy Syll. letter/digit 16 - 50.0
on test data using both estimates for P (wf ). The results Unclear 12 - 0.0
reported by Choudhury et al. (2007) are also shown. Spelling error 5 - 80.0
Suffix clipping 1 - 0.0
and in-top-20 accuracies may be indicative of the Phonetic abbrev. 1 - 0.0
usefulness of the output of our system in other tasks Error 1 - 0.0
which could exploit a ranked list of standard forms,
such as machine translation. Table 4: Frequency (Freq.), and % in-top-1 accuracy us-
ing the formation-specific model where applicable (Spe-
5 Results and Discussion cific) and all models (All) with a uniform estimate for
P (wf ), presented by formation type.
In Table 3 we report the results of our system using
both the uniform estimate and the MLE of P (wf ).
Note that there is no meaningful random baseline We first examine the top panel of Table 3 where
to compare against here; randomly ordering the we compare the performance on each word forma-
14K words in our lexicon gives very low accuracy. tion type for both experimental conditions (Specific
The results using the uniform estimate of P (wf )— and All). We first note that the performance using
a fully unsupervised system—are very similar to the formation-specific model on subsequence abbre-
the supervised results of Choudhury et al. (2007). viations and prefix clippings is better than that of
Surprisingly, when we estimate P (wf ) using MLEs the overall model. This is unsurprising since we ex-
from the development data—resulting in a lightly- pect that when we know a texting form’s formation
supervised system—the results are slightly worse process, and invoke a corresponding specific model,
than when using the uniform estimate of this proba- our system should outperform a model designed to
bility. Moreover, we observe the same trend on de- handle a range of formation types. However, this is
velopment data where we expect to have an accurate not the case for stylistic variations; here the over-
estimate of P (wf ) (results not shown). We hypothe- all model performs better than the specific model.
size that the ambiguity of the categories of text forms We observed in Section 2 that some texting forms
(see Section 2) results in poor MLEs for P (wf ), do not fit neatly into our categorization scheme; in-
thus making a uniform distribution, and hence fully- deed, many stylistic variations are also analyzable
unsupervised approach, more appropriate. as subsequence abbreviations. Therefore, the subse-
Results by Formation Type We now consider in- quence abbreviation model may benefit normaliza-
top-1 accuracy for each word formation type, in Ta- tion of stylistic variations. This model, used in iso-
ble 4. We show results for the same word forma- lation on stylistic variations, gives an in-top-1 accu-
tion processes as in Table 1, except for h-clippings racy of 33.1%, indicating that this may be the case.
and punctuation, as no words of these categories are Comparing the performance of the individual
present in the test data. We present results using the word models on only word types that they were de-
same experimental setup as before with a uniform signed for (column Specific in Table 4), we see that
estimate of P (wf ) (All), and using just the model the prefix clipping model is by far the lowest, in-
corresponding to the word formation process (Spe- dicating that in the future we should consider ways
cific), where applicable.13 of improving this word model. One possibility is
13
In this case our model then becomes, for each word forma- to incorporate phonemic knowledge. For example,
tion process wf , argmaxsi P (ti |si , wf )P (si ). both friday and friend have the same probability un-

76
der P (ti |si , prefix clipping) for the texting form fri, Model % in-top-1 accuracy
which has the standard form friday in our data. (The Stylistic variation 51.8
language model, however, does distinguish between Subseq. Abbrev. 44.2
these forms.) However, if we consider the phonemic Prefix clipping 10.6
representations of these words, friday might emerge
as more likely. Syllable structure information may Table 5: % in-top-1 accuracy on the 303 test expressions
also be useful, as we hypothesize that clippings will using each model individually.
tend to be formed by truncating a word at a syllable
boundary. We may similarly be able to improve our own gives results comparable to those of the over-
estimate of P (ti |si , subseq. abrrev.). For example, all model (59.4%, see Table 3). This indicates that
both text and taxation have the same probability un- the overall model successfully combines informa-
der this distribution, but intuitively text, the correct tion from the specific word formation models.
standard form in our data, seems more likely. We Each model used on its own gives an accuracy
could incorporate knowledge about the likelihood of greater than the proportion of expressions of the
omitting specific characters, as in Choudhury et al. word formation type for which the model was de-
(2007), to improve this estimate. signed (compare accuracies in Table 5 to the num-
We now examine the lower panel of Table 4, in ber of expressions of the corresponding word forma-
which we consider the performance of the overall tion type in the test data in Table 4). As we note in
model on the word formation types that are not ex- Section 2, the distinctions between the word forma-
plicitly modeled. The very high accuracy on g- tion types are not sharp; these results show that the
clippings indicates that since these forms are also a shared properties of word formation types enable a
type of subsequence abbreviation, we do not need to model for a specific formation type to infer the stan-
construct a separate model for them. We in fact also dard form of texting forms of other formation types.
conducted experiments in which g-clippings and h-
clippings were modeled explicitly, but found these All Unseen Data Until now we have discussed re-
extra models to have little effect on the results. sults on our test data of 303 texting forms which dif-
fer from their standard forms. We now consider the
Recall from Section 3.1 our hypothesis that suf-
performance of our system on all 1213 unseen tex-
fix clippings, spelling errors, and phonetic abbrevia-
ting forms, 910 of which are identical to their stan-
tions have common properties with formation types
dard form. Since our model was not designed with
that we do model, and therefore the system will per-
such expressions in mind, we slightly adapt it for
form reasonably well on them. Here we find pre-
this new task; if ti is in our lexicon, we return that
liminary evidence to support this hypothesis as the
form as si , otherwise we apply our model as usual,
accuracy on these three word formation types (com-
using the uniform estimate of P (wf ). This gives
bined) is 57.1%. However, we must interpret this
an in-top-1 accuracy of 88.2%, which is very sim-
result cautiously as it only considers seven expres-
ilar to the results of Choudhury et al. (2007) on this
sions. On the syllabic letter and digit texting forms
data of 89.1%. Note, however, that Choudhury et al.
the accuracy is 50.0%, indicating that our heuris-
only report results on this dataset using a uniform
tic to replace digits in texting forms with an ortho-
language model;14 since we use a unigram language
graphic representation is reasonable.
model, it is difficult to draw firm conclusions about
The performance on types of expressions that
the performance of our system relative to theirs.
we did not consider when designing the system—
unclear and error—is very poor. However, this has 6 Related Work
little impact on the overall performance as these ex-
pressions are rather infrequent. Aw et al. (2006) model text message normaliza-
tion as translation from the texting language into the
Results by Model We now consider in-top-1 ac- 14
Choudhury et al. do use a unigram language model for their
curacy using each model on the 303 test expres- experiments on the 303 texting forms which differ from their
sions; results are shown in Table 5. No model on its standard forms (see Section 3.3).

77
standard language. Kobus et al. (2008) incorporate References
ideas from both machine translation and automatic John Algeo, editor. 1991. Fifty Years Among the New
speech recognition for text message normalization. Words. Cambridge University Press, Cambridge.
However, both of these approaches are supervised, AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A
and have only limited means for normalizing texting phrase-based statistical model for SMS text normaliza-
forms that do not occur in the training data. tion. In Proc. of the COLING/ACL 2006 Main Confer-
ence Poster Sessions, pages 33–40. Sydney.
Our work, like that of Choudhury et al. (2007), R.H. Baayen, R. Piepenbrock, and L. Gulikers. 1995. The
can be viewed as a noisy-channel model for spelling CELEX Lexical Database (release 2). Linguistic Data
error correction (e.g., Mays et al., 1991; Brill and Consortium, University of Pennsylvania.
Moore, 2000), in which texting forms are seen as Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram
a kind of spelling error. Furthermore, like our ap- Corpus version 1.1.
proach to text message normalization, approaches to Eric Brill and Robert C. Moore. 2000. An improved error
model for noisy channel spelling correction. In Pro-
spelling correction have incorporated phonemic in-
ceedings of ACL 2000, pages 286–293. Hong Kong.
formation (Toutanova and Moore, 2002). Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh
The word model of the supervised approach of Mukherjee, Sudeshna Sarkar, and Anupam Basu.
Choudhury et al. consists of hidden Markov models, 2007. Investigation and modeling of the structure of
which capture properties of texting language similar texting language. International Journal of Document
to those of our stylistic variation model. We pro- Analysis and Recognition, 10(3/4):157–174.
Cédrick Fairon and Sébastien Paumier. 2006. A trans-
pose multiple word models—corresponding to fre- lated corpus of 30,000 French SMS. In Proceedings of
quent texting language formation processes—and an LREC 2006. Genoa, Italy.
unsupervised method for parameter estimation. Rebecca E. Grinter and Margery A. Eldridge. 2001. y do
tngrs luv 2 txt msg. In Proceedings of the 7th Euro-
7 Conclusions pean Conf. on Computer-Supported Cooperative Work
(ECSCW ’01), pages 219–238. Bonn, Germany.
We analyze a sample of texting forms to determine Sittichai Jiampojamarn, Gregorz Kondrak, and Tarek
frequent word formation processes in creative tex- Sherif. 2007. Applying many-to-many alignments and
ting language. Drawing on these observations, we hidden markov models to letter-to-phoneme conver-
sion. In Proc. of NAACL-HLT 2007, pages 372–379.
construct an unsupervised noisy-channel model for
Rochester, NY.
text message normalization. On an unseen test set Catherine Kobus, François Yvon, and Géraldine
of 303 texting forms that differ from their standard Damnati. 2008. Normalizing SMS: are two metaphors
form, our model achieves 59% accuracy, which is on better than one? In Proc. of the 22nd Int. Conf. on
par with that obtained by the supervised approach of Computational Linguistics, pp. 441–448. Manchester.
Choudhury et al. (2007) on the same data. Charles W. Kreidler. 1979. Creating new words by short-
ening. English Linguistics, 13:24–36.
More research is required to determine the impact Rich Ling and Naomi S. Baron. 2007. Text messaging
of our normalization method on the performance of and IM: Linguistic comparison of American college
a system that further processes the resulting text. In data. Journal of Language and Social Psychology,
the future, we intend to improve our word models by 26:291–98.
incorporating additional linguistic knowledge, such Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991.
as information about syllable structure. Since con- Context based spelling correction. Information Pro-
cessing and Management, 27(5):517–522.
text likely plays a role in human interpretation of Richard Sproat, Alan W. Black, Stanley Chen, Shankar
texting forms, we also intend to examine the perfor- Kumar, Mari Ostendorf, and Christopher Richards.
mance of higher order ngram language models. 2001. Normalization of non-standard words. Com-
puter Speech and Language, 15:287–333.
Acknowledgements Crispin Thurlow. 2003. Generation txt? The sociolin-
guistics of young people’s text-messaging. Discourse
This work is financially supported by the Natu- Analysis Online, 1(1).
ral Sciences and Engineering Research Council of Kristina Toutanova and Robert C. Moore. 2002. Pronun-
Canada, the University of Toronto, and the Dictio- ciation modeling for improved spelling correction. In
Proc. of ACL 2002, pages 144–151. Philadelphia.
nary Society of North America.

Algorithmic Trading & Quantitative Strategies Gappy Lecture 4
No ratings yet
Algorithmic Trading & Quantitative Strategies Gappy Lecture 4
23 pages
Lab Report PAM
No ratings yet
Lab Report PAM
6 pages
Commissioner Report
75% (4)
Commissioner Report
11 pages
Sms Test
No ratings yet
Sms Test
11 pages
Grammar and Electronic Communication
No ratings yet
Grammar and Electronic Communication
11 pages
Normalization of Informal Text
No ratings yet
Normalization of Informal Text
22 pages
Christensen Et Al. - Punctuation Annotation Using Statistical Prosody Models
No ratings yet
Christensen Et Al. - Punctuation Annotation Using Statistical Prosody Models
6 pages
Normalizing The Hindi Text
No ratings yet
Normalizing The Hindi Text
8 pages
ExploringDis uenciesforSpeechToTextMachineTranslation (2021)
No ratings yet
ExploringDis uenciesforSpeechToTextMachineTranslation (2021)
7 pages
Tagging Spoken Swedish
No ratings yet
Tagging Spoken Swedish
32 pages
NLP_Lecture_6_Week_3
No ratings yet
NLP_Lecture_6_Week_3
9 pages
A Method For Vietnamese Text Normalization To Improve The Quality of Speech Synthesis
No ratings yet
A Method For Vietnamese Text Normalization To Improve The Quality of Speech Synthesis
9 pages
Syntactic Annotation of Spontaneous Speech - Application To Call-Center
No ratings yet
Syntactic Annotation of Spontaneous Speech - Application To Call-Center
11 pages
2001 Data Driven Approach
No ratings yet
2001 Data Driven Approach
6 pages
S2MT Paper
No ratings yet
S2MT Paper
12 pages
NLP- AI2214601 unit 1to unit 5 notes
No ratings yet
NLP- AI2214601 unit 1to unit 5 notes
98 pages
Trigram 16
No ratings yet
Trigram 16
36 pages
rs.18015.arg
No ratings yet
rs.18015.arg
37 pages
A Method For Vietnamese Text Normalization To Improve The Quality of Speech Synthesis
No ratings yet
A Method For Vietnamese Text Normalization To Improve The Quality of Speech Synthesis
8 pages
Research Bulsu Hagonoy
No ratings yet
Research Bulsu Hagonoy
33 pages
issues and concepts
No ratings yet
issues and concepts
15 pages
Recognizing Syntactic Errors in The Writing of Second Language Learners
No ratings yet
Recognizing Syntactic Errors in The Writing of Second Language Learners
7 pages
Seifart, Orthography Development PDF
No ratings yet
Seifart, Orthography Development PDF
26 pages
A Phrase-Based Statistical Model For SMS Text Normalization
No ratings yet
A Phrase-Based Statistical Model For SMS Text Normalization
8 pages
A Novel Unsupervised Corpus-Based Stemming
No ratings yet
A Novel Unsupervised Corpus-Based Stemming
16 pages
Towards Creating Precision Grammars From Interlinear Glossed Text: Inferring Large-Scale Typological Properties
No ratings yet
Towards Creating Precision Grammars From Interlinear Glossed Text: Inferring Large-Scale Typological Properties
10 pages
Bangla Text To Speech Using Festival: Firoj Alam S.M. Murtoza Habib Mumit Khan
No ratings yet
Bangla Text To Speech Using Festival: Firoj Alam S.M. Murtoza Habib Mumit Khan
8 pages
Creating An Orthography Description: M. Hosken
No ratings yet
Creating An Orthography Description: M. Hosken
13 pages
Comparison of Tokenizer Method
No ratings yet
Comparison of Tokenizer Method
17 pages
Classifying Idiomatic and Literal Expressions Using Topic Models and Intensity of Emotions
No ratings yet
Classifying Idiomatic and Literal Expressions Using Topic Models and Intensity of Emotions
9 pages
A Phrase-Based Statistical Model For SMS Text Normalization
No ratings yet
A Phrase-Based Statistical Model For SMS Text Normalization
8 pages
Unsupervised Learning of The Morphology of A Natural Language
No ratings yet
Unsupervised Learning of The Morphology of A Natural Language
46 pages
Dependency Structure Trees in Syntax Based Machine Translation
No ratings yet
Dependency Structure Trees in Syntax Based Machine Translation
18 pages
Research On Regional Languages
No ratings yet
Research On Regional Languages
6 pages
Experiment No: 7 BE-COMP-B-26 Aim: Tools: Theory: Morphological Parsing, in Natural Language Processing, Is The Process
No ratings yet
Experiment No: 7 BE-COMP-B-26 Aim: Tools: Theory: Morphological Parsing, in Natural Language Processing, Is The Process
2 pages
Universal Word Segmentation: Implementation and Interpretation
No ratings yet
Universal Word Segmentation: Implementation and Interpretation
17 pages
Dicción 1
No ratings yet
Dicción 1
52 pages
Token Izer
No ratings yet
Token Izer
17 pages
2023.icnlsp-1.30
No ratings yet
2023.icnlsp-1.30
11 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
Morphological Processing of Semitic Languages
No ratings yet
Morphological Processing of Semitic Languages
14 pages
NLP Notes For Students
No ratings yet
NLP Notes For Students
18 pages
Pronunciation Adaptation 2002
No ratings yet
Pronunciation Adaptation 2002
26 pages
L18-1235
No ratings yet
L18-1235
5 pages
A Fuzzy Decision Tree-Based Duration Model For Standard Yoru'ba Text-To-Speech Synthesis
No ratings yet
A Fuzzy Decision Tree-Based Duration Model For Standard Yoru'ba Text-To-Speech Synthesis
25 pages
Purepos 2.0: A Hybrid Tool For Morphological Disambiguation
No ratings yet
Purepos 2.0: A Hybrid Tool For Morphological Disambiguation
7 pages
Sample
No ratings yet
Sample
8 pages
Correcting ESL Errors Using Phrasal SMT Techniques: Chris Brockett, William B. Dolan, and Michael Gamon
No ratings yet
Correcting ESL Errors Using Phrasal SMT Techniques: Chris Brockett, William B. Dolan, and Michael Gamon
8 pages
An Analysis of Language in University Students
No ratings yet
An Analysis of Language in University Students
16 pages
Variation Between Different Discourse Types: Literate vs. Oral
No ratings yet
Variation Between Different Discourse Types: Literate vs. Oral
16 pages
Modeling Word Meaning: Distributional Semantics and The Corpus Quality-Quantity Trade-Off
No ratings yet
Modeling Word Meaning: Distributional Semantics and The Corpus Quality-Quantity Trade-Off
16 pages
Continuous_multilinguality_with_language_vectors
No ratings yet
Continuous_multilinguality_with_language_vectors
13 pages
An Analysis of Language in University Students' Text Messages
No ratings yet
An Analysis of Language in University Students' Text Messages
16 pages
Light Stemming For Arabic Information Retrieval
No ratings yet
Light Stemming For Arabic Information Retrieval
34 pages
8.5 Multilingual Speech Processing
No ratings yet
8.5 Multilingual Speech Processing
24 pages
urmi2016-2
No ratings yet
urmi2016-2
5 pages
1998issues in Building General Letter To Sound Rules
No ratings yet
1998issues in Building General Letter To Sound Rules
4 pages
Bhagat, R., & Hovy, E. H. (2007) - Phonetic Models For Generating Spelling Variants. IJCAI
No ratings yet
Bhagat, R., & Hovy, E. H. (2007) - Phonetic Models For Generating Spelling Variants. IJCAI
6 pages
Bollywood Thesis
100% (1)
Bollywood Thesis
7 pages
Snowball: A Language For Stemming Algorithms
No ratings yet
Snowball: A Language For Stemming Algorithms
14 pages
DONG 2010 Acta Automatica Sinica
No ratings yet
DONG 2010 Acta Automatica Sinica
6 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
KEA Practical Automatic Keyphrase Extraction
No ratings yet
KEA Practical Automatic Keyphrase Extraction
2 pages
Document Centered Approach To Text Normalization
No ratings yet
Document Centered Approach To Text Normalization
8 pages
A Stop List For General Text
No ratings yet
A Stop List For General Text
17 pages
Classifying Arabic Web Pages Toolkit
No ratings yet
Classifying Arabic Web Pages Toolkit
4 pages
A Suggestion-Based RDF Instance Matching System: January 2017
No ratings yet
A Suggestion-Based RDF Instance Matching System: January 2017
6 pages
A Comparative Study For Arabic Text Classification Algorithms Based On Stop Words Elimination
No ratings yet
A Comparative Study For Arabic Text Classification Algorithms Based On Stop Words Elimination
5 pages
Dynamic Discovery of Type Classes and Relations in Semantic Web Data
No ratings yet
Dynamic Discovery of Type Classes and Relations in Semantic Web Data
26 pages
2019 Book CyberSecurity PDF
No ratings yet
2019 Book CyberSecurity PDF
184 pages
How Good Is Your Model?: Andreas Müller
No ratings yet
How Good Is Your Model?: Andreas Müller
54 pages
UpskillingIT 2023 FINAL
No ratings yet
UpskillingIT 2023 FINAL
66 pages
Measures of Central Tendency: Mean Median Mode
No ratings yet
Measures of Central Tendency: Mean Median Mode
20 pages
FortiAnalyzer VM Install Guide
No ratings yet
FortiAnalyzer VM Install Guide
48 pages
Gca 1
No ratings yet
Gca 1
9 pages
2402.06731v1
No ratings yet
2402.06731v1
8 pages
ACCT 325, Module 1
No ratings yet
ACCT 325, Module 1
39 pages
Dp-3 A - Pans-Aim 1st Edition
No ratings yet
Dp-3 A - Pans-Aim 1st Edition
118 pages
Optical Sources and Detectors
No ratings yet
Optical Sources and Detectors
30 pages
Securities Regulation Code
100% (1)
Securities Regulation Code
24 pages
Inter-company STO With SD Delivery, Billing & LIV.docx
No ratings yet
Inter-company STO With SD Delivery, Billing & LIV.docx
77 pages
Indian Wrist Watch Industry - Marketing Mix of The Leading Players
No ratings yet
Indian Wrist Watch Industry - Marketing Mix of The Leading Players
8 pages
Anmols Assignment
No ratings yet
Anmols Assignment
9 pages
Hw4 Solutions
No ratings yet
Hw4 Solutions
6 pages
EA ELECTRICAL POWER SOLUTIONS LTD Packing Slip 20000
No ratings yet
EA ELECTRICAL POWER SOLUTIONS LTD Packing Slip 20000
2 pages
Unit 4 Sinusoidal Oscillators
No ratings yet
Unit 4 Sinusoidal Oscillators
53 pages
Note: All Other Terms & Conditions Shall Remain Same As Previous
No ratings yet
Note: All Other Terms & Conditions Shall Remain Same As Previous
2 pages
Shadow Compact Fume Extractor Brochure
No ratings yet
Shadow Compact Fume Extractor Brochure
12 pages
Parker Fly Pickup Recommendations
No ratings yet
Parker Fly Pickup Recommendations
1 page
Quiz C
No ratings yet
Quiz C
1 page
Chapter 11
No ratings yet
Chapter 11
13 pages
Information Sheet No. 5.1-1 Relative Humidity Inside The Swine Facility
100% (1)
Information Sheet No. 5.1-1 Relative Humidity Inside The Swine Facility
6 pages
Paper 1 PDF
No ratings yet
Paper 1 PDF
4 pages
Bursa Malaysia Annual Report 2015
No ratings yet
Bursa Malaysia Annual Report 2015
222 pages
political movements presentation
No ratings yet
political movements presentation
4 pages
NETTA Presentation
No ratings yet
NETTA Presentation
20 pages
Single 2011 MI 2094 CE Multitester Ang PDF
No ratings yet
Single 2011 MI 2094 CE Multitester Ang PDF
2 pages
PT19 1300 Series
100% (1)
PT19 1300 Series
112 pages

An Unsupervised Model For Text Message Normalization

Uploaded by

An Unsupervised Model For Text Message Normalization

Uploaded by

An Unsupervised Model for Text Message Normalization

Paul Cook Suzanne Stevenson

Abstract particularly useful for the visually impaired; au-

You might also like