DeepNorm Deep Learning Approach
DeepNorm Deep Learning Approach
Normalization
Shaurya Rohatgi Maryam Zare
Pennsylvania State University Pennsylvania State University
State College, Pennsylvania State College, Pennsylvania
[email protected] [email protected]
ABSTRACT and go playing to name but a few. While various approaches
arXiv:1712.06994v1 [cs.CL] 17 Dec 2017
This paper presents an simple yet sophisticated approach have been taken and some NN architectures have surely
to the challenge by Sproat and Jaitly (2016) - given a large been carefully designed for the specific task, there is also
corpus of written text aligned to its normalized spoken form, a widespread feeling that with deep enough architectures,
train an RNN to learn the correct normalization function. and enough data, one can simply feed the data to one’s NN
Text normalization for a token seems very straightforward and have it learn the necessary function. In this paper we
without it’s context. But given the context of the used token present an example of an application that is unlikely to be
and then normalizing becomes tricky for some classes. We amenable to such a "turn- the-crank" approach. The example
present a novel approach in which the prediction of our is text normalization, specifically in the sense of a system
classification algorithm is used by our sequence to sequence that converts from a written representation of a text into a
model to predict the normalized text of the input token. Our representation of how that text is to be read aloud. The tar-
approach takes very less time to learn and perform well get applications are TTS and ASR - in the latter case mostly
unlike what has been reported by Google (5 days on their for generating language modeling data from raw written
GPU cluster). We have achieved an accuracy of 97.62 which text. This problem, while often considered mundane, is in
is impressive given the resources we use. Our approach is fact very important, and a major source of degradation of
using the best of both worlds, gradient boosting - state of perceived quality in TTS systems in particular can be traced
the art in most classification tasks and sequence to sequence to problems with text normalization.
learning - state of the art in machine translation. We present We start by describing the prior work in this area, which
our experiments and report results with various parameter includes use of RNNs in text normalization. We describe the
settings. dataset provided by Google and Kaggle and then we discuss
our approach and experiments 1 we performed with different
KEYWORDS Neural Network architectures.
encoder-decoder framework, deep learning, text normaliza-
2 RELATED WORK
tion
Text normalization has a long history in speech technology,
1 INTRODUCTION dating back to the earliest work on full TTS synthesis (Allen
et al., 1987). Sproat (1996) provided a unifying model for most
Within the last few years a major shift has taken place in
text normalization problems in terms of weighted finite-state
speech and language technology: the field has been taken
transducers (WFSTs). The first work to treat the problem of
over by deep learning approaches. For example, at a recent
text normalization as essentially a language modeling prob-
NAACL conference well more than half the papers related in
lem was (Sproat et al., 2001 ) . More recent machine learning
some way to word embeddings or deep or recurrent neural
work specifically addressed to TTS text normalization in-
networks. This change is surely justified by the impressive
clude (Sproat, 2010; Roark and Sproat, 2014; Sproat and Hall,
performance gains to be had by deep learning, something
2014).
that has been demonstrated in a range of areas from image
In the last few years there has been a lot of work that
processing, handwriting recognition, acoustic modeling in
focuses on social media (Xia et al., 2006; Choudhury et al.,
automatic speech recognition (ASR), parametric speech syn-
2007; Kobus et al., 2008; Beaufort et al., 2010; Kaufmann, 2010;
thesis for text-to-speech (TTS), machine translation, parsing,
Liu et al., 2011; Pennell and Liu, 2011; Aw and Lee, 2012; Liu et
al., 2012a; Liu et al., 2012b; Hassan and Menezes, 2013; Yang
IST 597-003 Fall’17, December 2017, State College, PA, USA and Eisenstein, 2013). This work tends to focus on different
© 2017 Copyright held by the owner/author(s). problems from those of TTS: on the one hand one, in social
ACM ISBN 123-4567-24-567/08/06. . . $15.00
https://doi.org/10.475/123_4 1 https://github.com/shauryr/google_text_normalization
IST 597-003 Fall’17, December 2017, State College, PA, USA Shaurya Rohatgi and Maryam Zare
media one often has to deal with odd spellings of words such Sproat and Jaitly report that a manual analysis of about
as "cu 18r", "coooooooooooooooolllll", or "dat suxx", which 1,000 examples from the test data suggests an overall error
are less of an issue in most applications of TTS; on the other, rate of approximately 0.1% for English. Note that although
expansion of digit sequences into words is critical for TTS the test data were of course taken from a different portion of
text normalization, but of no interest to the normalization of the Wikipedia text than the training and development data,
social media texts. nonetheless a huge percentage of the individual tokens of
Some previous work, also on social media normalization, the test data 99.5% in the case of English - were found in the
that has made use of neural techniques includes (ChrupaÅĆa, training set. This in itself is perhaps not so surprising but it
2014; Min and Mott, 2015). The latter work, for example, does raise the concern that the RNN models may in fact be
achieved second place in the constrained track of the ACL memorizing their results, without doing much generalization.
2015 W-NUT Normalization of Noisy Text (Baldwin et al.,
2015), achieving an F1 score of 81.75%.
Data No. of tokens
3 DATASET Train 9,918,442
The original work by Sproat and Jaitly uses 1.1 billion words Test 1,088,565
for English text and 290 words for Russian text. In this work Table 1: Kaggle Dataset
we used a subset of the dataset submitted by the authors
for the Kaggle competition 2 (table 1). The dataset is de-
rived from Wikipedia regions which could be decoded as
UTF8. The text is then divided into sentences and through 3.1 Data Exploratory Analysis
the Google TTS system’s Kestrel text normalization system
In total, only about 7% of tokens in the training data, or
to produce the normalized version of that text. A snippet is
about 660k objects in total, were changed during the process
shown in the figure 1 . As described in (Ebden and Sproat,
of text normalization in the train data. This explains the
2014), Kestrel’s verbalizations are produced by first tokeniz-
high baseline accuracies we can achieve even without any
ing the input and classifying the tokens, and then verbalizing
adjustment of the test data input.
each token according to its semiotic class. The majority of
The authors of the challenge refer the classes of tokens
the rules are hand-built using the Thrax finite-state gram-
as semiotic classes. The classes can be seen in the Figure 1.
mar development system (Roark et al., 2012). Most ordinary
In total there are 16 classes. The "PLAIN" class is by far the
words are of course left alone (represented here as <self>),
most frequent, followed by "PUNCT" and "DATE". "TIME",
and punctuation symbols are mostly transduced to <sil> (for
"FRACTION", and "ADDRESS" having the lowest number of
"silence").
occurrences (around/below 100 tokens each).
2 https://www.kaggle.com/c/text-normalization-challenge-english- Over exploring the dataset we find that "PLAIN" and
language "PUNCT" semiotic classes do not need transformation or
DeepNorm - A Deep learning approach to Text Normalization IST 597-003 Fall’17, December 2017, State College, PA, USA
they need not be normalized. We exploit this fact to our also defines a line whether our model is better or worse than
advantage when we train our sequence to sequence text memorizing the data.
normalizer by only feeding the tokens which need normal-
ization. This reduces the burden over our model and filters 5 METHODOLOGY
out what may be noise for our model. This is not to say that
notable fraction of "PLAIN" class text elements did change. Our approach involves modeling the problem as classifica-
But the fraction was too less to be considered for training tion and translation problem. The model has two major parts,
our model. For example, "mr" to "mister" or "No." to "number". a classifier which determines the tokens that need to be nor-
We also analyzed the length of the tokens to be normalized malized and a sequence to sequence model that normalizes
in the dataset. We find that short strings are dominant in the non standard tokens (Figure 2). We first explain training
our data but longer ones with up to a few 100 characters and testing process, then we explain classifier and sequence
can occur. This was common with "ELECTRONIC" class as to sequence models in more detail.
it contains URL which can be long. Figure 2 shows the whole process of training and testing.
We trained classifier and sequence to sequence model in-
dividually and in parallel. Training set has 16 classes, 2 of
which don’t need any normalization, so we separated tokens
4 BASELINE
from those two classes from others and only fed tokens from
As mentioned above, most of the tokens in the test data are remaining 14 classes to the sequence to sequence model. On
similar to those in the test data. We exploited this fact to the other hand classifier is trained on the whole data set since
hold the data in the train set in memory and predicted the it need to distinguish between standard and non standard
class of the token using the train set. tokens.
We have written a set of 16 functions for every semiotic Once training is done, we have a two stages pipeline ready.
class to normalize it. Using the predicted class we used the Raw data is fed to the classifier. Results of classifier are two
regular expression functions to normalize the test data. We sets of tokens. Those that don’t need to be normalized are
understand this is not the correct way to do this, but it pro- left alone. Those that need to be normalized are passed to the
vides a very good and competitive baseline for our algorithm. sequence to sequence model. Sequence to sequence model
We score 98.52% on the test data using this approach. This converts the non standard tokens to standard forms. Finally
IST 597-003 Fall’17, December 2017, State College, PA, USA Shaurya Rohatgi and Maryam Zare
CARDINAL 1341833 one million three hundred fourteen thousand eight hundred thirty three
TELEPHONE 0-89879-762-4 o sil eight nine eight seven seven sil nine six two sil four
MONEY 14 trillion won fourteen won
VERBATIM ω wmsb
LETTERS mdns cftt
ELECTRONIC www.sports-reference.com w w r w dot t i s h i s h e n e n e dot c o m
Table 5: Results Analysis of Seq2Seq Model - The prediction gets worse as we go down the table.
results. Also for ELECTRONIC class the window size of the models shines and where we can improve. We also list out
encoder input was the constraint. We can see from table 5 possible ways of improving the results further. We compare
that it starts well but as the sequence gets longer it predicts our results with the state of the art results and show that
irrelevant characters. We believe increasing the encoder se- given limited computation power we can achieve promising
quence length can improve this aspect of our model. results This project helped us understand sequence to se-
Table 5 shows the results. For three classes DATE, CARDI- quence models and the related classification tasks very well.
NAL, and DIGIT the model works very well, and the accuracy We also learned how much parameter tuning can effect the
is very close to Google’s model. For example in case of token results and small changes makes big difference. We can also
’2016’, it is shown that the model can distinguish different try Bidirectional RNNs as we saw if the sequence was longer
concepts very well and outputs the correct tokens. We think the model was not accurate.
this is because we are feeding the label with the tokens to Finally, we conclude that higher accuracy can be achieved
the sequence to sequence model, so it learns the differences via having a very good classifier. Classifier has an important
between these classes pretty good. role in this model and there is still lots of room for improve-
The next three classes are showing acceptable results. Model ment. Using LSTM instead of XGBoost could have make the
shows some difficulties in telephone numbers, big cardinal classifier stronger. But we rested our focus mostly on the
numbers, and class MONEY. Errors are not very bad. In most sequence to sequence model as we wanted to understand and
cases usually one word is missed or the order is reversed. implement it. Due to the lack of time and limited resources
We got low accuracy on the last three classes shown in Table we couldn’t try this and we list this as a future work.
5. We see that Verbatim and Electronic classes have the low-
est accuracy. For Verbatim we think the reason is the size
of vocabulary. Since this class consists of special characters
REFERENCES
Sproat, R. (1996). Multilingual text analysis for text-to-speech synthesis.
that have low frequency in the data set, a larger vocabulary
Natural Language Engineering, 2(4), 369-380. Chicago
could have improve the accuracy a lot. For Electronic class Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., & Richards, C.
we think a larger encoder size can be very helpful. This class (1999). Normalization of non-standard words: WS’99 final report. In
has tokens of up to length 40, which don’t fit to the encoder Hopkins University.
we used. Sproat, R. (2010, December). Lightly supervised learning of text nor-
malization: Russian number names. In Spoken Language Technology
Workshop (SLT), 2010 IEEE (pp. 436-441). IEEE.
7 CONCLUSION Xia, Y., Wong, K. F., & Li, W. (2006, July). A phonetic-based approach
In this project we proposed a model for the task of normaliza- to Chinese chat text normalization. In Proceedings of the 21st Interna-
tion.We present a context aware classification model and how tional Conference on Computational Linguistics and the 44th annual
meeting of the Association for Computational Linguistics (pp. 993-1000).
we used it to clear out "noisy" samples. We then discuss our Association for Computational Linguistics.
unique model, which at it’s core is a sequence to sequence Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., & Basu, A.
model which takes in the label and the input sequence and (2007). Investigation and modeling of the structure of texting language.
predicts the normalized sequence based on the label. We International journal on document analysis and recognition, 10(3), 157-
share our insights and analysis with examples of where our 174.
DeepNorm - A Deep learning approach to Text Normalization IST 597-003 Fall’17, December 2017, State College, PA, USA