A Phrase-Based Statistical Model For SMS Text Normalization
A Phrase-Based Statistical Model For SMS Text Normalization
33
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 33–40,
Sydney, July 2006. c 2006 Association for Computational Linguistics
a consensus translation technique to bootstrap Gale, 1991) that mostly model the edit operations
parallel data using off-the-shelf translation sys- using distance measures (Damerau 1964; Leven-
tems for training a hierarchical statistical transla- shtein 1966), specific word set confusions (Gold-
tion model for general domain instant messaging ing and Roth, 1999) and pronunciation modeling
used in Internet chat rooms. Their method deals (Brill and Moore, 2000; Toutanova and Moore,
with the special phenomena of the instant mes- 2002). These models are mostly character-based
saging language (rather than the SMS language) or string-based without considering the context.
in each individual MT system. Clark (2003) In addition, the author might not be aware of the
proposed to unify the process of tokenization, errors in the word introduced during the edit op-
segmentation and spelling correction for nor- erations, as most errors are due to mistype of
malization of general noisy text (rather than SMS characters near to each other on the keyboard or
or instant messaging texts) based on a noisy homophones, such as “poor” or “pour”.
channel model at the character level. However, In SMS, errors are not isolated within word
results of the normalization are not reported. Aw and are usually not surrounded by clean context.
et al. (2005) gave a brief description on their in- Words are altered deliberately to reflect sender’s
put pre-processing work for an English-to- distinct creation and idiosyncrasies. A character
Chinese SMS translation system using a word- can be deleted on purpose, such as “wat” (what)
group model. In addition, in most of the com- and “hv” (have). It also consists of short-forms
mercial SMS translation applications 2 , SMS such as “b4” (before), “bf” (boyfriend). In addi-
lingo (i.e., SMS short form) dictionary is pro- tion, normalizing SMS text might require the
vided to replace SMS short-forms with normal context to be spanned over more than one lexical
English words. Most of the systems do not han- unit such as “lemme” (let me), “ur” (you are) etc.
dle OOV (out-of-vocabulary) items and ambigu- Therefore, the models used in spelling correction
ous inputs. Following compares SMS text are inadequate for providing a complete solution
normalization with other similar or related appli- for SMS normalization.
cations.
2.3 SMS Normalization versus Text Para-
2.1 SMS Normalization versus General phrasing Problem
Text Normalization
Others may regard SMS normalization as a para-
General text normalization deals with Non- phrasing problem. Broadly speaking, paraphrases
Standard Words (NSWs) and has been well- capture core aspects of variability in language,
studied in text-to-speech (Sproat et al., 2001) by representing equivalencies between different
while SMS normalization deals with Non-Words expressions that correspond to the same meaning.
(NSs) or lingoes and has seldom been studied In most of the recent works (Barzilay and
before. NSWs, such as digit sequences, acronyms, McKeown, 2001; Shimohata, 2002), they are
mixed case words (WinNT, SunOS), abbrevia- acquired (semi-) automatically from large com-
tions and so on, are grammatically correct in lin- parable or parallel corpora using lexical and
guistics. However lingoes, such as “b4” (before) morpho-syntactic information.
and “bf” (boyfriend), which are usually self- Text paraphrasing works on clean texts in
created and only accepted by young SMS users, which contextual and lexical-syntactic features
are not yet formalized in linguistics. Therefore, can be extracted and used to find “approximate
the special phenomena in SMS texts impose a conceptual equivalence”. In SMS normalization,
big challenge to SMS normalization. we are dealing with non-words and “ungram-
matically” sentences with the purpose to normal-
2.2 SMS Normalization versus Spelling
ize or standardize these words and form better
Correction Problem
sentences. The SMS normalization problem is
Intuitively, many would regard SMS normaliza- thus different from text paraphrasing. On the
tion as a spelling correction problem where the other hand, it bears some similarities with MT as
lingoes are erroneous words or non-words to be we are trying to “convert” text from one lan-
replaced by English words. Researches on spell- guage to another. However, it is a simpler prob-
ing correction centralize on typographic and lem as most of the time; we can find the same
cognitive/orthographic errors (Kukich, 1992) and word in both the source and target text, making
use approaches (M.D. Kernighan, Church and alignment easier.
2
http://www.etranslator.ro and http://www.transl8bit.com
34
3 Characteristics of English SMS words and non-standard SMS lingoes; (2) re-
moval of slang and (3) insertion of auxiliary or
Our corpus consists of 55,000 messages collected copula verb and subject pronoun.
from two sources, a SMS chat room and corre-
spondences between university students. The Phenomena Messages
content is mostly related to football matches, 1. Dropping ‘?’ at btw, wat is ur view
making friends and casual conversations on the end of (By the way, what is your
“how, what and where about”. We summarize question view?)
the text behaviors into two categories as below.
Eh speak english mi malay
2. Not using any
3.1 Orthographic Variation not tt good
punctuation at
(Eh, speak English! My Ma-
The most significant orthographic variant in all
lay is not that good.)
SMS texts is in the use of non-standard, self-
3. Using spell- goooooood Sunday morning
created short-forms. Usually, sender takes advan-
ing/punctuation !!!!!!
tage of phonetic spellings, initial letters or num-
for emphasis (Good Sunday morning!)
ber homophones to mimic spoken conversation
or shorten words or phrases (hw vs. homework or 4. Using phonetic dat iz enuf
how, b4 vs. before, cu vs. see you, 2u vs. to you, spelling (That is enough)
oic vs. oh I see, etc.) in the attempt to minimize 5. Dropping i hv cm to c my luv.
key strokes. In addition, senders create a new vowel (I have come to see my love.)
form of written representation to express their yar lor where u go juz now
6. Introducing
oral utterances. Emotions, such as “:(“ symboliz- (yes, where did you go just
local flavor
ing sad, “:)” symbolizing smiling, “:()” symbol- now?)
izing shocked, are representations of body I hv 2 go. Dinner w parents.
language. Verbal effects such as “hehe” for 7. Dropping verb (I have to go. Have dinner
laughter and emphatic discourse particles such as with parents.)
“lor”, “lah”, “meh” for colloquial English are Table 1. Examples of SMS Messages
prevalent in the text collection.
The loss of “alpha-case” information posts an- Transformation Percentage (%)
other challenge in lexical disambiguation and Insertion 8.09
introduces difficulty in identifying sentence Deletion 5.48
boundaries, proper nouns, and acronyms. With Substitution 86.43
the flexible use of punctuation or not using punc-
tuation at all, translation of SMS messages with- Table 2. Distribution of Insertion, Deletion and
out prior processing is even more difficult. Substitution Transformation.
3.2 Grammar Variation Substitution Deletion Insertion
SMS messages are short, concise and convey u -> you m are
much information within the limited space quota 2 → to lah am
(160 letters for English), thus they tend to be im- n → and t is
plicit and influenced by pragmatic and situation r → are ah you
reasons. These inadequacies of language expres- ur →your leh to
sion such as deletion of articles and subject pro- dun → don’t 1 do
noun, as well as problems in number agreements
man → manches- huh a
or tenses make SMS normalization more chal-
ter
lenging. Table 1 illustrates some orthographic
and grammar variations of SMS texts. no → number one in
intro → introduce lor yourself
3.3 Corpus Statistics wat → what ahh will
We investigate the corpus to assess the feasibility
Table 3. Top 10 Most Common Substitu-
of replacing the lingoes with normal English
tion, Deletion and Insertion
words and performing limited adjustment to the
text structure. Similarly to Aw et al. (2005), we Table 2 shows the statistics of these transfor-
focus on the three major cases of transformation mations based on 700 messages randomly se-
as shown in the corpus: (1) replacement of OOV lected, where 621 (88.71%) messages required
35
normalization with a total of 2300 transforma- If we include the word “null” in the English
tions. Substitution accounts for almost 86% of all vocabulary, the above model can fully address
transformations. Deletion and substitution make the deletion and substitution transformations, but
up the rest. Table 3 shows the top 10 most com- inadequate to address the insertion transforma-
mon transformations. tion. For example, the lingoes “duno”, “ysnite”
have to be normalized using an insertion trans-
4 SMS Normalization formation to become “don’t know” and “yester-
We view the SMS language as a variant of Eng- day night”. Moreover, we also want the
lish language with some derivations in vocabu- normalization to have better lexical affinity and
lary and grammar. Therefore, we can treat SMS linguistic equivalent, thus we extend the model
normalization as a MT problem where the SMS to allow many words to many words alignment,
language is to be translated to normal English. allowing a sequence of SMS words to be normal-
We thus propose to adapt the statistical machine ized to a sequence of contiguous English words.
translation model (Brown et al., 1993; Zens and We call this updated model a phrase-based nor-
Ney, 2004) for SMS text normalization. In this malization model.
section, we discuss the three components of our 4.2 Phrase-based Model
method: modeling, training and decoding for
SMS text normalization. Given an English sentence e and SMS sentence
s , if we assume that e can be decomposed into
4.1 Basic Word-based Model K phrases with a segmentation T , such that
The SMS normalization model is based on the each phrase ek in e can be corresponded with
source channel model (Shannon, 1948). Assum- one phrase sk in s , we have e1N = e1 … ek … eK
ing that an English sentence e, of length N is
“corrupted” by a noisy channel to produce a and s1M = s1 … sk … sK . The channel model can be
SMS message s, of length M, the English sen- rewritten in equation (3).
tence e, could be recovered through a posteriori P( s1M | e1N ) = ∑ P( s1M , T | e1N )
distribution for a channel target text given the T
source text P ( s | e) , and a prior distribution for
= ∑ P(T | e1N )i P( s1M | T , e1N )
the channel source text P (e) . T
(3)
N
eˆ = arg max {P( e | s )}
N M = ∑ P(T | e1N )i P( s1K | e1K )
1 1 1
e1N T
A k =1
{
≈ ∑ ∏ P( k | ak )i P( sk | eak )
}
M
A m =1
{
≈ ∑ ∏ P( m | am )i P( sm | eam )
} We are now able to model the three transfor-
mations through the normalization pair ( sk , eak ) ,
36
with the mapping probability P ( sk | eak ) . The fol- Finally, the SMS normalization model consists of
lowings show the scenarios in which the three two sub-models: a word-based language model
transformations occur. (LM), characterized by P( en | en −1 ) and a phrase-
based lexical mapping model (channel model),
Insertion sk < eak characterized by P( sk | ek ) .
37
thographic similarities captured by edit distance mizes P ( s | e)i P (e) using the normalization
and a SMS lingo dictionary3 which contains the model. In this paper, the maximization problem
commonly used short-forms are first used to es- in equation (7) is solved using a monotone search,
tablish phrase mapping boundary candidates. implemented as a Viterbi search through dy-
Heuristics are then exploited to match tokens namic programming.
within the pairs of boundary candidates by trying
to combine consecutive tokens within the bound- 5 Experiments
ary candidates if the numbers of tokens do not
agree. The aim of our experiment is to verify the effec-
Finally, a filtering process is carried out to tiveness of the proposed statistical model for
manually remove the low-frequency noisy SMS normalization and the impact of SMS nor-
alignment pairs. Table 4 shows some of the ex- malization on MT.
tracted normalization pairs. As can be seen from A set of 5000 parallel SMS messages, which
the table, our algorithm discovers ambiguous consists of raw (un-normalized) SMS messages
mappings automatically that are otherwise miss- and reference messages manually prepared by
ing from most of the lingo dictionary. two project members with inter-normalization
agreement checked, was prepared for training
(s, e) log P ( s | e ) and testing. For evaluation, we use IBM’s BLEU
score (Papineni et al., 2002) to measure the per-
(2, 2) 0 formance of the SMS normalization. BLEU score
(2, to) -0.579466 measures the similarity between two sentences
(2, too) -0.897016 using n-gram statistics with a penalty for too
short sentences, which is already widely-used in
(2, null) -2.97058
MT evaluation.
(4, 4) 0
BLEU score (3-
(4, for) -0.431364 Setup
gram)
(4, null) -3.27161 Raw SMS without
0.5784
(w, who are) -0.477121 Normalization
(w, with) -0.764065 Dictionary Look-up
0.6958
plus Frequency
(w, who) -1.83885 Bi-gram Language
(dat, that) -0.726999 0.7086
Model Only
(dat, date) -0.845098 Table 5. Performance of different set-
(tmr, tomorrow) -0.341514 ups of the baseline experiments on the
5000 parallel SMS messages
Table 4. Examples of normalization pairs
5.1 Baseline Experiments: Simple SMS
Given the phrase-aligned SMS corpus, the Lingo Dictionary Look-up and Using
lexical mapping model, characterized by Language Model Only
P( sk | ek ) , is easily to be trained using equation
The baseline experiment is to moderate the texts
(6). Our n-gram LM P( en | en −1 ) is trained on using a lingo dictionary comprises 142 normali-
English Gigaword provided by LDC using zation pairs, which is also used in bootstrapping
SRILM language modeling toolkit (Stolcke, the phrase alignment learning process.
2002). Backoff smoothing (Jelinek, 1991) is used Table 5 compares the performance of the dif-
to adjust and assign a non-zero probability to the ferent setups of the baseline experiments. We
unseen words to address data sparseness. first measure the complexity of the SMS nor-
malization task by directly computing the simi-
4.4 Monotone Search
larity between the raw SMS text and the
Given an input s , the search, characterized in normalized English text. The 1st row of Table 5
equation (7), is to find a sentence e that maxi- reports the similarity as 0.5784 in BLEU score,
which implies that there are quite a number of
English word 3-gram that are common in the raw
3
The entries are collected from various websites such as and normalized messages. The 2nd experiment is
http://www.handphones.info/sms-dictionary/sms-lingo.php,
and http://www.funsms.net/sms_dictionary.htm, etc.
carried out using only simple dictionary look-up.
38
Lexical ambiguity is addressed by selecting the Experimental result analysis reveals that the
highest-frequency normalization candidate, i.e., strength of our model is in its ability to disam-
only unigram LM is used. The performance of biguate mapping as in “2” to “two” or “to” and
the 2nd experiment is 0.6958 in BLEU score. It “w” to “with” or “who”. Error analysis shows
suggests that the lingo dictionary plus the uni- that the challenge of the model lies in the proper
gram LM is very useful for SMS normalization. insertion of subject pronoun and auxiliary or
Finally we carry out the 3rd experiment using copula verb, which serves to give further seman-
dictionary look-up plus bi-gram LM. Only a tic information about the main verb, however this
slight improvement of 0.0128 (0.7086-0.6958) is requires significant context understanding. For
obtained. This is largely because the English example, a message such as “u smart” gives little
words in the lingo dictionary are mostly high- clues on whether it should be normalized to “Are
frequency and commonly-used. Thus bi-gram you smart?” or “You are smart.” unless the full
does not show much more discriminative ability conversation is studied.
than unigram without the help of the phrase-
based lexical mapping model. Takako w r u?
Takako who are you?
5.2 Using Phrase-based Model Im in ns, lik soccer, clubbin hangin w frenz!
We then conducted the experiment using the pro- Wat bout u mee?
posed method (Bi-gram LM plus a phrase-based I'm in ns, like soccer, clubbing hanging with
lexical mapping model) through a five-fold cross friends! What about you?
validation on the 5000 parallel SMS messages. fancy getting excited w others' boredom
Table 6 shows the results. An average score of Fancy getting excited with others' boredom
0.8070 is obtained. Compared with the baseline If u ask me b4 he ask me then i'll go out w u all
performance in Table 5, the improvement is very lor. N u still can act so real.
significant. It suggests that the phrase-based If you ask me before he asked me then I'll go
lexical mapping model is very useful and our out with you all. And you still can act so real.
method is effective for SMS text normalization. Doing nothing, then u not having dinner w us?
Figure 2 is the learning curve. It shows that our Doing nothing, then you do not having dinner
algorithm converges when training data is with us?
increased to 3000 SMS parallel messages. This Aiyar sorry lor forgot 2 tell u... Mtg at 2 pm.
suggests that our collected corpus is representa- Sorry forgot to tell you... Meeting at two pm.
tive and enough for training our model. Table 7 tat's y I said it's bad dat all e gals know u...
illustrates some examples of the normalization Wat u doing now?
results. That's why I said it's bad that all the girls know
you... What you doing now?
5-fold cross validation BLEU score (3-gram)
Table 7. Examples of Normalization Results
Setup 1 0.8023
Setup 2 0.8236 5.3 Effect on English-Chinese MT
Setup 3 0.8071 An experiment was also conducted to study the
Setup 4 0.8113 effect of normalization on MT using 402 mes-
Setup 5 0.7908 sages randomly selected from the text corpus.
Ave. 0.8070 We compare three types of SMS message: raw
SMS messages, normalized messages using sim-
Table 6. Normalization results for 5- ple dictionary look-up and normalized messages
fold cross validation test using our method. The messages are passed to
two different English-to-Chinese translation sys-
0.82 tems provided by Systran4 and Institute for Info-
0.8
0.78 comm Research5(I2R) separately to produce three
0.76 BLEU sets of translation output. The translation quality
0.74
0.72
is measured using 3-gram cumulative BLEU
0.7 score against two reference messages. 3-gram is
1000 2000 3000 4000 5000
4
http://www.systranet.com/systran/net
Figure 2. Learning Curve 5
http://nlp.i2r.a-star.edu.sg/techtransfer.html
39
used as most of the messages are short with aver- E. Brill and R. C. Moore. 2000. An Improved Error
age length of seven words. Table 8 shows the Model for Noisy Channel Spelling Correction.
details of the BLEU scores. We obtain an aver- ACL-2000
age of 0.3770 BLEU score for normalized mes- P. F. Brown, S. D. Pietra, V. D. Pietra and R. Mercer.
sages against 0.1926 for raw messages. The 1993. The Mathematics of Statistical Machine
significant performance improvement suggests Translation: Parameter Estimation. Computational
that preprocessing of normalizing SMS text us- Linguistics: 19(2)
ing our method before MT is an effective way to A. Clark. 2003. Pre-processing very noisy text. In
adapt a general MT system to SMS domain. Proceedings of Workshop on Shallow Processing
of Large Corpora, Lancaster, 2003
I2R Systran Ave.
Raw Message 0.2633 0.1219 0.1926 F. J. Damerau. 1964. A technique for computer detec-
tion and correction of spelling errors. Communica-
Dict Lookup 0.3485 0.1690 0.2588
tions ACM 7, 171-176
Normalization 0.4423 0.3116 0.3770
A.P. Dempster, N.M. Laird and D.B. Rubin. 1977.
Table 8. SMS Translation BLEU score with or Maximum likelihood from incomplete data via the
without SMS normalization EM algorithm, Journal of the Royal Statistical So-
ciety, Series B, Vol. 39, 1-38
6 Conclusion A. Golding and D. Roth. 1999. A Winnow-Based Ap-
proach to Spelling Correction. Machine Learning
In this paper, we study the differences among
34: 107-130
SMS normalization, general text normalization,
spelling check and text paraphrasing, and inves- F. Jelinek. 1991. Self-organized language modeling
tigate the different phenomena of SMS messages. for speech recognition. In A. Waibel and K.F. Lee,
We propose a phrase-based statistical method to editors, Readings in Speech Recognition, pages
450-506. Morgan Kaufmann, 1991
normalize SMS messages. The method produces
messages that collate well with manually normal- M. D. Kernighan, K Church and W. Gale. 1990. A
ized messages, achieving 0.8070 BLEU score spelling correction program based on a noisy
against 0.6958 baseline score. It also signifi- channel model. COLING-1990
cantly improves SMS translation accuracy from K. Kukich. 1992. Techniques for automatically cor-
0.1926 to 0.3770 in BLEU score without adjust- recting words in text. ACM Computing Surveys,
ing the MT model. 24(4):377-439
This experiment results provide us with a good K. A. Papineni, S. Roukos, T. Ward and W. J. Zhu.
indication on the feasibility of using this method 2002. BLEU : a Method for Automatic Evaluation
in performing the normalization task. We plan to of Machine Translation. ACL-2002
extend the model to incorporate mechanism to
P. Koehn, F.J. Och and D. Marcu. 2003. Statistical
handle missing punctuation (which potentially
Phrase-Based Translation. HLT-NAACL-2003
affect MT output and are not being taken care at
the moment), and making use of pronunciation C. Shannon. 1948. A mathematical theory of commu-
information to handle OOV caused by the use of nication. Bell System Technical Journal 27(3):
phonetic spelling. A bigger data set will also be 379-423
used to test the robustness of the system leading M. Shimohata and E. Sumita 2002. Automatic Para-
to a more accurate alignment and normalization. phrasing Based on Parallel Corpus for Normaliza-
tion. LREC-2002
References R. Sproat, A. Black, S. Chen, S. Kumar, M. Ostendorf
A.T. Aw, M. Zhang, Z.Z. Fan, P.K. Yeo and J. Su. and C. Richards. 2001. Normalization of Non-
2005. Input Normalization for an English-to- Standard Words. Computer Speech and Language,
Chinese SMS Translation System. MT Summit- 15(3):287-333
2005 A. Stolcke. 2002. SRILM – An extensible language
S. Bangalore, V. Murdock and G. Riccardi. 2002. modeling toolkit. ICSLP-2002
Bootstrapping Bilingual Data using Consensus K. Toutanova and R. C. Moore. 2002. Pronunciation
Translation for a Multilingual Instant Messaging Modeling for Improved Spelling Correction. ACL-
System. COLING-2002 2002
R. Barzilay and K. R. McKeown. 2001. Extracting R. Zens and H. Ney. 2004. Improvements in Phrase-
paraphrases from a parallel corpus. ACL-2001 Based Statistical MT. HLT-NAALL-2004
40