0% found this document useful (0 votes)
24 views8 pages

A Phrase-Based Statistical Model For SMS Text Normalization

1) The document presents a phrase-based statistical model for normalizing SMS text before machine translation. 2) An evaluation of the model on a dataset of 5,000 sentence pairs showed it achieved a BLEU score of 0.80702 for SMS text normalization, compared to a baseline score of 0.6985. 3) A separate experiment translating English SMS texts to Chinese found that using SMS text normalization as a preprocessing step before MT boosted translation performance from a BLEU score of 0.1926 to 0.3770.

Uploaded by

Deep Sanghani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views8 pages

A Phrase-Based Statistical Model For SMS Text Normalization

1) The document presents a phrase-based statistical model for normalizing SMS text before machine translation. 2) An evaluation of the model on a dataset of 5,000 sentence pairs showed it achieved a BLEU score of 0.80702 for SMS text normalization, compared to a baseline score of 0.6985. 3) A separate experiment translating English SMS texts to Chinese found that using SMS text normalization as a preprocessing step before MT boosted translation performance from a BLEU score of 0.1926 to 0.3770.

Uploaded by

Deep Sanghani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Phrase-based Statistical Model for SMS Text Normalization

AiTi Aw, Min Zhang, Juan Xiao, Jian Su


Institute of Infocomm Research
21 Heng Mui Keng Terrace
Singapore 119613
{aaiti,mzhang,stuxj,sujian}@i2r.a-star.edu.sg
grammatical texts before MT. In this way, the
Abstract traditional MT is treated as a “black-box” with
little or minimal adaptation. One advantage of
Short Messaging Service (SMS) texts be- this pre-translation normalization is that the di-
have quite differently from normal written versity in different user groups and domains can
texts and have some very special phenom- be modeled separately without accessing and
ena. To translate SMS texts, traditional adapting the language model of the MT system
approaches model such irregularities di- for each SMS application. Another advantage is
rectly in Machine Translation (MT). How- that the normalization module can be easily util-
ever, such approaches suffer from ized by other applications, such as SMS to
customization problem as tremendous ef- voicemail and SMS-based information query.
fort is required to adapt the language In this paper, we present a phrase-based statis-
model of the existing translation system to tical model for SMS text normalization. The
handle SMS text style. We offer an alter- normalization is visualized as a translation prob-
native approach to resolve such irregulari- lem where messages in the SMS language are to
ties by normalizing SMS texts before MT. be translated to normal English using a similar
In this paper, we view the task of SMS phrase-based statistical MT method (Koehn et al.,
normalization as a translation problem 2003). We use IBM’s BLEU score (Papineni et
from the SMS language to the English al., 2002) to measure the performance of SMS
language 1 and we propose to adapt a text normalization. BLEU score computes the
phrase-based statistical MT model for the similarity between two sentences using n-gram
task. Evaluation by 5-fold cross validation statistics, which is widely-used in MT evalua-
on a parallel SMS normalized corpus of tion. A set of parallel SMS messages, consisting
5000 sentences shows that our method can of 5000 raw (un-normalized) SMS messages and
achieve 0.80702 in BLEU score against their manually normalized references, is con-
the baseline BLEU score 0.6958. Another structed for training and testing. Evaluation by 5-
experiment of translating SMS texts from fold cross validation on this corpus shows that
English to Chinese on a separate SMS text our method can achieve accuracy of 0.80702 in
corpus shows that, using SMS normaliza- BLEU score compared to the baseline system of
tion as MT preprocessing can largely 0.6985. We also study the impact of our SMS
boost SMS translation performance from text normalization on the task of SMS transla-
0.1926 to 0.3770 in BLEU score. tion. The experiment of translating SMS texts
from English to Chinese on a corpus comprising
402 SMS texts shows that, SMS normalization as
1 Motivation a preprocessing step of MT can boost the transla-
SMS translation is a mobile Machine Translation tion performance from 0.1926 to 0.3770 in
(MT) application that translates a message from BLEU score.
one language to another. Though there exists The rest of the paper is organized as follows.
many commercial MT systems, direct use of Section 2 reviews the related work. Section 3
such systems fails to work well due to the special summarizes the characteristics of English SMS
phenomena in SMS texts, e.g. the unique relaxed texts. Section 4 discusses our method and Sec-
and creative writing style and the frequent use of tion 5 reports our experiments. Section 6 con-
unconventional and not yet standardized short- cludes the paper.
forms. Direct modeling of these special phenom- 2 Related Work
ena in MT requires tremendous effort. Alterna-
tively, we can normalize SMS texts into There is little work reported on SMS normaliza-
1
tion and translation. Bangalore et al. (2002) used
This paper only discusses English SMS text normalization.

33
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 33–40,
Sydney, July 2006. c 2006 Association for Computational Linguistics
a consensus translation technique to bootstrap Gale, 1991) that mostly model the edit operations
parallel data using off-the-shelf translation sys- using distance measures (Damerau 1964; Leven-
tems for training a hierarchical statistical transla- shtein 1966), specific word set confusions (Gold-
tion model for general domain instant messaging ing and Roth, 1999) and pronunciation modeling
used in Internet chat rooms. Their method deals (Brill and Moore, 2000; Toutanova and Moore,
with the special phenomena of the instant mes- 2002). These models are mostly character-based
saging language (rather than the SMS language) or string-based without considering the context.
in each individual MT system. Clark (2003) In addition, the author might not be aware of the
proposed to unify the process of tokenization, errors in the word introduced during the edit op-
segmentation and spelling correction for nor- erations, as most errors are due to mistype of
malization of general noisy text (rather than SMS characters near to each other on the keyboard or
or instant messaging texts) based on a noisy homophones, such as “poor” or “pour”.
channel model at the character level. However, In SMS, errors are not isolated within word
results of the normalization are not reported. Aw and are usually not surrounded by clean context.
et al. (2005) gave a brief description on their in- Words are altered deliberately to reflect sender’s
put pre-processing work for an English-to- distinct creation and idiosyncrasies. A character
Chinese SMS translation system using a word- can be deleted on purpose, such as “wat” (what)
group model. In addition, in most of the com- and “hv” (have). It also consists of short-forms
mercial SMS translation applications 2 , SMS such as “b4” (before), “bf” (boyfriend). In addi-
lingo (i.e., SMS short form) dictionary is pro- tion, normalizing SMS text might require the
vided to replace SMS short-forms with normal context to be spanned over more than one lexical
English words. Most of the systems do not han- unit such as “lemme” (let me), “ur” (you are) etc.
dle OOV (out-of-vocabulary) items and ambigu- Therefore, the models used in spelling correction
ous inputs. Following compares SMS text are inadequate for providing a complete solution
normalization with other similar or related appli- for SMS normalization.
cations.
2.3 SMS Normalization versus Text Para-
2.1 SMS Normalization versus General phrasing Problem
Text Normalization
Others may regard SMS normalization as a para-
General text normalization deals with Non- phrasing problem. Broadly speaking, paraphrases
Standard Words (NSWs) and has been well- capture core aspects of variability in language,
studied in text-to-speech (Sproat et al., 2001) by representing equivalencies between different
while SMS normalization deals with Non-Words expressions that correspond to the same meaning.
(NSs) or lingoes and has seldom been studied In most of the recent works (Barzilay and
before. NSWs, such as digit sequences, acronyms, McKeown, 2001; Shimohata, 2002), they are
mixed case words (WinNT, SunOS), abbrevia- acquired (semi-) automatically from large com-
tions and so on, are grammatically correct in lin- parable or parallel corpora using lexical and
guistics. However lingoes, such as “b4” (before) morpho-syntactic information.
and “bf” (boyfriend), which are usually self- Text paraphrasing works on clean texts in
created and only accepted by young SMS users, which contextual and lexical-syntactic features
are not yet formalized in linguistics. Therefore, can be extracted and used to find “approximate
the special phenomena in SMS texts impose a conceptual equivalence”. In SMS normalization,
big challenge to SMS normalization. we are dealing with non-words and “ungram-
matically” sentences with the purpose to normal-
2.2 SMS Normalization versus Spelling
ize or standardize these words and form better
Correction Problem
sentences. The SMS normalization problem is
Intuitively, many would regard SMS normaliza- thus different from text paraphrasing. On the
tion as a spelling correction problem where the other hand, it bears some similarities with MT as
lingoes are erroneous words or non-words to be we are trying to “convert” text from one lan-
replaced by English words. Researches on spell- guage to another. However, it is a simpler prob-
ing correction centralize on typographic and lem as most of the time; we can find the same
cognitive/orthographic errors (Kukich, 1992) and word in both the source and target text, making
use approaches (M.D. Kernighan, Church and alignment easier.

2
http://www.etranslator.ro and http://www.transl8bit.com

34
3 Characteristics of English SMS words and non-standard SMS lingoes; (2) re-
moval of slang and (3) insertion of auxiliary or
Our corpus consists of 55,000 messages collected copula verb and subject pronoun.
from two sources, a SMS chat room and corre-
spondences between university students. The Phenomena Messages
content is mostly related to football matches, 1. Dropping ‘?’ at btw, wat is ur view
making friends and casual conversations on the end of (By the way, what is your
“how, what and where about”. We summarize question view?)
the text behaviors into two categories as below.
Eh speak english mi malay
2. Not using any
3.1 Orthographic Variation not tt good
punctuation at
(Eh, speak English! My Ma-
The most significant orthographic variant in all
lay is not that good.)
SMS texts is in the use of non-standard, self-
3. Using spell- goooooood Sunday morning
created short-forms. Usually, sender takes advan-
ing/punctuation !!!!!!
tage of phonetic spellings, initial letters or num-
for emphasis (Good Sunday morning!)
ber homophones to mimic spoken conversation
or shorten words or phrases (hw vs. homework or 4. Using phonetic dat iz enuf
how, b4 vs. before, cu vs. see you, 2u vs. to you, spelling (That is enough)
oic vs. oh I see, etc.) in the attempt to minimize 5. Dropping i hv cm to c my luv.
key strokes. In addition, senders create a new vowel (I have come to see my love.)
form of written representation to express their yar lor where u go juz now
6. Introducing
oral utterances. Emotions, such as “:(“ symboliz- (yes, where did you go just
local flavor
ing sad, “:)” symbolizing smiling, “:()” symbol- now?)
izing shocked, are representations of body I hv 2 go. Dinner w parents.
language. Verbal effects such as “hehe” for 7. Dropping verb (I have to go. Have dinner
laughter and emphatic discourse particles such as with parents.)
“lor”, “lah”, “meh” for colloquial English are Table 1. Examples of SMS Messages
prevalent in the text collection.
The loss of “alpha-case” information posts an- Transformation Percentage (%)
other challenge in lexical disambiguation and Insertion 8.09
introduces difficulty in identifying sentence Deletion 5.48
boundaries, proper nouns, and acronyms. With Substitution 86.43
the flexible use of punctuation or not using punc-
tuation at all, translation of SMS messages with- Table 2. Distribution of Insertion, Deletion and
out prior processing is even more difficult. Substitution Transformation.
3.2 Grammar Variation Substitution Deletion Insertion
SMS messages are short, concise and convey u -> you m are
much information within the limited space quota 2 → to lah am
(160 letters for English), thus they tend to be im- n → and t is
plicit and influenced by pragmatic and situation r → are ah you
reasons. These inadequacies of language expres- ur →your leh to
sion such as deletion of articles and subject pro- dun → don’t 1 do
noun, as well as problems in number agreements
man → manches- huh a
or tenses make SMS normalization more chal-
ter
lenging. Table 1 illustrates some orthographic
and grammar variations of SMS texts. no → number one in
intro → introduce lor yourself
3.3 Corpus Statistics wat → what ahh will
We investigate the corpus to assess the feasibility
Table 3. Top 10 Most Common Substitu-
of replacing the lingoes with normal English
tion, Deletion and Insertion
words and performing limited adjustment to the
text structure. Similarly to Aw et al. (2005), we Table 2 shows the statistics of these transfor-
focus on the three major cases of transformation mations based on 700 messages randomly se-
as shown in the corpus: (1) replacement of OOV lected, where 621 (88.71%) messages required

35
normalization with a total of 2300 transforma- If we include the word “null” in the English
tions. Substitution accounts for almost 86% of all vocabulary, the above model can fully address
transformations. Deletion and substitution make the deletion and substitution transformations, but
up the rest. Table 3 shows the top 10 most com- inadequate to address the insertion transforma-
mon transformations. tion. For example, the lingoes “duno”, “ysnite”
have to be normalized using an insertion trans-
4 SMS Normalization formation to become “don’t know” and “yester-
We view the SMS language as a variant of Eng- day night”. Moreover, we also want the
lish language with some derivations in vocabu- normalization to have better lexical affinity and
lary and grammar. Therefore, we can treat SMS linguistic equivalent, thus we extend the model
normalization as a MT problem where the SMS to allow many words to many words alignment,
language is to be translated to normal English. allowing a sequence of SMS words to be normal-
We thus propose to adapt the statistical machine ized to a sequence of contiguous English words.
translation model (Brown et al., 1993; Zens and We call this updated model a phrase-based nor-
Ney, 2004) for SMS text normalization. In this malization model.
section, we discuss the three components of our 4.2 Phrase-based Model
method: modeling, training and decoding for
SMS text normalization. Given an English sentence e and SMS sentence
s , if we assume that e can be decomposed into
4.1 Basic Word-based Model K phrases with a segmentation T , such that
The SMS normalization model is based on the each phrase ek in e can be corresponded with
source channel model (Shannon, 1948). Assum- one phrase sk in s , we have e1N = e1 … ek … eK
ing that an English sentence e, of length N is
“corrupted” by a noisy channel to produce a and s1M = s1 … sk … sK . The channel model can be
SMS message s, of length M, the English sen- rewritten in equation (3).
tence e, could be recovered through a posteriori P( s1M | e1N ) = ∑ P( s1M , T | e1N )
distribution for a channel target text given the T
source text P ( s | e) , and a prior distribution for
= ∑ P(T | e1N )i P( s1M | T , e1N )
the channel source text P (e) . T
(3)
N
eˆ = arg max {P( e | s )}
N M = ∑ P(T | e1N )i P( s1K | e1K )
1 1 1
e1N T

≈ max {P(T | e1N )i P( s1K | e1K )}


(1)
= arg max {P( s1M | e1N )i P( e1N )} T
e1N
This is the basic function of the channel model
Assuming that one SMS word is mapped ex- for the phrase-based SMS normalization model,
actly to one English word in the channel model where we used the maximum approximation for
P ( s | e) under an alignment A , we need to con- the sum over all segmentations. Then we further
sider only two types of probabilities: the align- decompose the probability P( s1K | e1K ) using a
ment probabilities denoted by P(m | am ) and the phrase alignment A as done in the previous
lexicon mapping probabilities denoted by word-based model.
P ( sm | eam ) (Brown et al. 1993). The channel
P( s1K | e1K ) = ∑ P( s1K , A | e1K )
model can be written as in the following equation A

= ∑{P( A | e1K )i P( s1K | A, e1K )}


where m is the position of a word in s and am its
alignment in e . A

P( s1M | e1N ) = ∑ P( s1M , A | e1N )  K  (4)


A
{ }
= ∑  ∏ P( k | ak )i P( sk | s1k −1 , e aa1k ) 
A  k =1 
= ∑ P( A | e1N )i P( s1M | A, e1N ) (2)
 K 
A

A  k =1
{
≈ ∑  ∏ P( k | ak )i P( sk | eak ) 

}
 M 
A  m =1
{
≈ ∑  ∏ P( m | am )i P( sm | eam ) 

} We are now able to model the three transfor-
mations through the normalization pair ( sk , eak ) ,

36
with the mapping probability P ( sk | eak ) . The fol- Finally, the SMS normalization model consists of
lowings show the scenarios in which the three two sub-models: a word-based language model
transformations occur. (LM), characterized by P( en | en −1 ) and a phrase-
based lexical mapping model (channel model),
Insertion sk < eak characterized by P( sk | ek ) .

Deletion eak = null 4.3 Training Issues


For the phrase-based model training, the sen-
Substitution sk = eak tence-aligned SMS corpus needs to be aligned
first at the phrase level. The maximum likelihood
The statistics in our training corpus shows that approach, through EM algorithm and Viterbi
by selecting appropriate phrase segmentation, the search (Dempster et al., 1977) is employed to
position re-ordering at the phrase level occurs infer such an alignment. Here, we make a rea-
rarely. It is not surprising since most of the Eng- sonable assumption on the alignment unit that a
lish words or phrases in normal English text are single SMS word can be mapped to a sequence
replaced with lingoes in SMS messages without of contiguous English words, but not vice verse.
position change to make SMS text short and con- The EM algorithm for phrase alignment is illus-
cise and to retain the meaning. Thus we need to trated in Figure 1 and is formulated by equation
consider only monotone alignment at phrase (8).
level, i.e., k = ak , as in equation (4). In addition,
The Expectation-Maximization Algorithm
the word-level reordering within phrase is
learned during training. Now we can further de- (1) Bootstrap initial alignment using ortho-
rive equation (4) as follows: graphic similarities
(2) Expectation: Update the joint probabili-
 K 
A  k =1
{
P( s1K | e1K ) ≈ ∑  ∏ P( k | ak )i P( sk | eak ) 

} ties P ( sk , ek )
(5) (3) Maximization: Apply the joint probabili-
K
ties P ( sk , ek ) to get new alignment using
≈ ∏ P( sk | ek )
k =1 Viterbi search algorithm
The mapping probability P( sk | ek ) is esti- (4) Repeat (2) to (3) until alignment con-
verges
mated via relative frequencies as follows:
(5) Derive normalization pairs from final
N ( sk , ek ) alignment
P( sk | ek ) = (6)
∑ N ( sk' , ek ) Figure 1. Phrase Alignment Using EM Algorithm
sk'
K
Here, N ( sk , ek ) denotes the frequency of the γˆ< s k , ek > = arg max ∏ P ( sk , ek | s1M , e1N ) (8)
γˆ< sk ,ek > k =1
normalization pair ( sk , ek ) .
Using a bigram language model and assuming The alignment process given in equation (8) is
Bayes decision rule, we finally obtain the follow- different from that of normalization given in
ing search criterion for equation (1). equation (7) in that, here we have an aligned in-
put sentence pair, s1M and e1N . The alignment
eˆ1N = arg max {P ( e1N )i P( s1M | e1N )}
e1N process is just to find the alignment segmentation
 N γˆ< sk ,ek > =< sk , ek > k =1,K between the two sen-
≈ arg max  ∏ P ( en | en −1 )
e1N  n =1 tences that maximizes the joint probability.
(7)
Therefore, in step (2) of the EM algorithm given
 K

i max  P(T | e1N )i ∏ P( sk | ek )  at Figure 1, only the joint probabilities
T
 k =1  P ( sk , ek ) are involved and updated.
 N K
 Since EM may fall into local optimization, in
≈ arg max  ∏ P ( en | en −1 )i∏ P( sk | ek ) 
N
e1 ,T  n =1 k =1  order to speed up convergence and find a nearly
global optimization, a string matching technique
For the above equation, we assume the seg- is exploited at the initialization step to identify
mentation probability P(T | e1N ) to be constant. the most probable normalization pairs. The or-

37
thographic similarities captured by edit distance mizes P ( s | e)i P (e) using the normalization
and a SMS lingo dictionary3 which contains the model. In this paper, the maximization problem
commonly used short-forms are first used to es- in equation (7) is solved using a monotone search,
tablish phrase mapping boundary candidates. implemented as a Viterbi search through dy-
Heuristics are then exploited to match tokens namic programming.
within the pairs of boundary candidates by trying
to combine consecutive tokens within the bound- 5 Experiments
ary candidates if the numbers of tokens do not
agree. The aim of our experiment is to verify the effec-
Finally, a filtering process is carried out to tiveness of the proposed statistical model for
manually remove the low-frequency noisy SMS normalization and the impact of SMS nor-
alignment pairs. Table 4 shows some of the ex- malization on MT.
tracted normalization pairs. As can be seen from A set of 5000 parallel SMS messages, which
the table, our algorithm discovers ambiguous consists of raw (un-normalized) SMS messages
mappings automatically that are otherwise miss- and reference messages manually prepared by
ing from most of the lingo dictionary. two project members with inter-normalization
agreement checked, was prepared for training
(s, e) log P ( s | e ) and testing. For evaluation, we use IBM’s BLEU
score (Papineni et al., 2002) to measure the per-
(2, 2) 0 formance of the SMS normalization. BLEU score
(2, to) -0.579466 measures the similarity between two sentences
(2, too) -0.897016 using n-gram statistics with a penalty for too
short sentences, which is already widely-used in
(2, null) -2.97058
MT evaluation.
(4, 4) 0
BLEU score (3-
(4, for) -0.431364 Setup
gram)
(4, null) -3.27161 Raw SMS without
0.5784
(w, who are) -0.477121 Normalization
(w, with) -0.764065 Dictionary Look-up
0.6958
plus Frequency
(w, who) -1.83885 Bi-gram Language
(dat, that) -0.726999 0.7086
Model Only
(dat, date) -0.845098 Table 5. Performance of different set-
(tmr, tomorrow) -0.341514 ups of the baseline experiments on the
5000 parallel SMS messages
Table 4. Examples of normalization pairs
5.1 Baseline Experiments: Simple SMS
Given the phrase-aligned SMS corpus, the Lingo Dictionary Look-up and Using
lexical mapping model, characterized by Language Model Only
P( sk | ek ) , is easily to be trained using equation
The baseline experiment is to moderate the texts
(6). Our n-gram LM P( en | en −1 ) is trained on using a lingo dictionary comprises 142 normali-
English Gigaword provided by LDC using zation pairs, which is also used in bootstrapping
SRILM language modeling toolkit (Stolcke, the phrase alignment learning process.
2002). Backoff smoothing (Jelinek, 1991) is used Table 5 compares the performance of the dif-
to adjust and assign a non-zero probability to the ferent setups of the baseline experiments. We
unseen words to address data sparseness. first measure the complexity of the SMS nor-
malization task by directly computing the simi-
4.4 Monotone Search
larity between the raw SMS text and the
Given an input s , the search, characterized in normalized English text. The 1st row of Table 5
equation (7), is to find a sentence e that maxi- reports the similarity as 0.5784 in BLEU score,
which implies that there are quite a number of
English word 3-gram that are common in the raw
3
The entries are collected from various websites such as and normalized messages. The 2nd experiment is
http://www.handphones.info/sms-dictionary/sms-lingo.php,
and http://www.funsms.net/sms_dictionary.htm, etc.
carried out using only simple dictionary look-up.

38
Lexical ambiguity is addressed by selecting the Experimental result analysis reveals that the
highest-frequency normalization candidate, i.e., strength of our model is in its ability to disam-
only unigram LM is used. The performance of biguate mapping as in “2” to “two” or “to” and
the 2nd experiment is 0.6958 in BLEU score. It “w” to “with” or “who”. Error analysis shows
suggests that the lingo dictionary plus the uni- that the challenge of the model lies in the proper
gram LM is very useful for SMS normalization. insertion of subject pronoun and auxiliary or
Finally we carry out the 3rd experiment using copula verb, which serves to give further seman-
dictionary look-up plus bi-gram LM. Only a tic information about the main verb, however this
slight improvement of 0.0128 (0.7086-0.6958) is requires significant context understanding. For
obtained. This is largely because the English example, a message such as “u smart” gives little
words in the lingo dictionary are mostly high- clues on whether it should be normalized to “Are
frequency and commonly-used. Thus bi-gram you smart?” or “You are smart.” unless the full
does not show much more discriminative ability conversation is studied.
than unigram without the help of the phrase-
based lexical mapping model. Takako w r u?
Takako who are you?
5.2 Using Phrase-based Model Im in ns, lik soccer, clubbin hangin w frenz!
We then conducted the experiment using the pro- Wat bout u mee?
posed method (Bi-gram LM plus a phrase-based I'm in ns, like soccer, clubbing hanging with
lexical mapping model) through a five-fold cross friends! What about you?
validation on the 5000 parallel SMS messages. fancy getting excited w others' boredom
Table 6 shows the results. An average score of Fancy getting excited with others' boredom
0.8070 is obtained. Compared with the baseline If u ask me b4 he ask me then i'll go out w u all
performance in Table 5, the improvement is very lor. N u still can act so real.
significant. It suggests that the phrase-based If you ask me before he asked me then I'll go
lexical mapping model is very useful and our out with you all. And you still can act so real.
method is effective for SMS text normalization. Doing nothing, then u not having dinner w us?
Figure 2 is the learning curve. It shows that our Doing nothing, then you do not having dinner
algorithm converges when training data is with us?
increased to 3000 SMS parallel messages. This Aiyar sorry lor forgot 2 tell u... Mtg at 2 pm.
suggests that our collected corpus is representa- Sorry forgot to tell you... Meeting at two pm.
tive and enough for training our model. Table 7 tat's y I said it's bad dat all e gals know u...
illustrates some examples of the normalization Wat u doing now?
results. That's why I said it's bad that all the girls know
you... What you doing now?
5-fold cross validation BLEU score (3-gram)
Table 7. Examples of Normalization Results
Setup 1 0.8023
Setup 2 0.8236 5.3 Effect on English-Chinese MT
Setup 3 0.8071 An experiment was also conducted to study the
Setup 4 0.8113 effect of normalization on MT using 402 mes-
Setup 5 0.7908 sages randomly selected from the text corpus.
Ave. 0.8070 We compare three types of SMS message: raw
SMS messages, normalized messages using sim-
Table 6. Normalization results for 5- ple dictionary look-up and normalized messages
fold cross validation test using our method. The messages are passed to
two different English-to-Chinese translation sys-
0.82 tems provided by Systran4 and Institute for Info-
0.8
0.78 comm Research5(I2R) separately to produce three
0.76 BLEU sets of translation output. The translation quality
0.74
0.72
is measured using 3-gram cumulative BLEU
0.7 score against two reference messages. 3-gram is
1000 2000 3000 4000 5000
4
http://www.systranet.com/systran/net
Figure 2. Learning Curve 5
http://nlp.i2r.a-star.edu.sg/techtransfer.html

39
used as most of the messages are short with aver- E. Brill and R. C. Moore. 2000. An Improved Error
age length of seven words. Table 8 shows the Model for Noisy Channel Spelling Correction.
details of the BLEU scores. We obtain an aver- ACL-2000
age of 0.3770 BLEU score for normalized mes- P. F. Brown, S. D. Pietra, V. D. Pietra and R. Mercer.
sages against 0.1926 for raw messages. The 1993. The Mathematics of Statistical Machine
significant performance improvement suggests Translation: Parameter Estimation. Computational
that preprocessing of normalizing SMS text us- Linguistics: 19(2)
ing our method before MT is an effective way to A. Clark. 2003. Pre-processing very noisy text. In
adapt a general MT system to SMS domain. Proceedings of Workshop on Shallow Processing
of Large Corpora, Lancaster, 2003
I2R Systran Ave.
Raw Message 0.2633 0.1219 0.1926 F. J. Damerau. 1964. A technique for computer detec-
tion and correction of spelling errors. Communica-
Dict Lookup 0.3485 0.1690 0.2588
tions ACM 7, 171-176
Normalization 0.4423 0.3116 0.3770
A.P. Dempster, N.M. Laird and D.B. Rubin. 1977.
Table 8. SMS Translation BLEU score with or Maximum likelihood from incomplete data via the
without SMS normalization EM algorithm, Journal of the Royal Statistical So-
ciety, Series B, Vol. 39, 1-38
6 Conclusion A. Golding and D. Roth. 1999. A Winnow-Based Ap-
proach to Spelling Correction. Machine Learning
In this paper, we study the differences among
34: 107-130
SMS normalization, general text normalization,
spelling check and text paraphrasing, and inves- F. Jelinek. 1991. Self-organized language modeling
tigate the different phenomena of SMS messages. for speech recognition. In A. Waibel and K.F. Lee,
We propose a phrase-based statistical method to editors, Readings in Speech Recognition, pages
450-506. Morgan Kaufmann, 1991
normalize SMS messages. The method produces
messages that collate well with manually normal- M. D. Kernighan, K Church and W. Gale. 1990. A
ized messages, achieving 0.8070 BLEU score spelling correction program based on a noisy
against 0.6958 baseline score. It also signifi- channel model. COLING-1990
cantly improves SMS translation accuracy from K. Kukich. 1992. Techniques for automatically cor-
0.1926 to 0.3770 in BLEU score without adjust- recting words in text. ACM Computing Surveys,
ing the MT model. 24(4):377-439
This experiment results provide us with a good K. A. Papineni, S. Roukos, T. Ward and W. J. Zhu.
indication on the feasibility of using this method 2002. BLEU : a Method for Automatic Evaluation
in performing the normalization task. We plan to of Machine Translation. ACL-2002
extend the model to incorporate mechanism to
P. Koehn, F.J. Och and D. Marcu. 2003. Statistical
handle missing punctuation (which potentially
Phrase-Based Translation. HLT-NAACL-2003
affect MT output and are not being taken care at
the moment), and making use of pronunciation C. Shannon. 1948. A mathematical theory of commu-
information to handle OOV caused by the use of nication. Bell System Technical Journal 27(3):
phonetic spelling. A bigger data set will also be 379-423
used to test the robustness of the system leading M. Shimohata and E. Sumita 2002. Automatic Para-
to a more accurate alignment and normalization. phrasing Based on Parallel Corpus for Normaliza-
tion. LREC-2002
References R. Sproat, A. Black, S. Chen, S. Kumar, M. Ostendorf
A.T. Aw, M. Zhang, Z.Z. Fan, P.K. Yeo and J. Su. and C. Richards. 2001. Normalization of Non-
2005. Input Normalization for an English-to- Standard Words. Computer Speech and Language,
Chinese SMS Translation System. MT Summit- 15(3):287-333
2005 A. Stolcke. 2002. SRILM – An extensible language
S. Bangalore, V. Murdock and G. Riccardi. 2002. modeling toolkit. ICSLP-2002
Bootstrapping Bilingual Data using Consensus K. Toutanova and R. C. Moore. 2002. Pronunciation
Translation for a Multilingual Instant Messaging Modeling for Improved Spelling Correction. ACL-
System. COLING-2002 2002
R. Barzilay and K. R. McKeown. 2001. Extracting R. Zens and H. Ney. 2004. Improvements in Phrase-
paraphrases from a parallel corpus. ACL-2001 Based Statistical MT. HLT-NAALL-2004

40

You might also like