0% found this document useful (0 votes)
8 views

DeepNorm Deep Learning Approach

Uploaded by

jesuisenissay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

DeepNorm Deep Learning Approach

Uploaded by

jesuisenissay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

DeepNorm - A Deep learning approach to Text

Normalization
Shaurya Rohatgi Maryam Zare
Pennsylvania State University Pennsylvania State University
State College, Pennsylvania State College, Pennsylvania
[email protected] [email protected]
ABSTRACT and go playing to name but a few. While various approaches
arXiv:1712.06994v1 [cs.CL] 17 Dec 2017

This paper presents an simple yet sophisticated approach have been taken and some NN architectures have surely
to the challenge by Sproat and Jaitly (2016) - given a large been carefully designed for the specific task, there is also
corpus of written text aligned to its normalized spoken form, a widespread feeling that with deep enough architectures,
train an RNN to learn the correct normalization function. and enough data, one can simply feed the data to one’s NN
Text normalization for a token seems very straightforward and have it learn the necessary function. In this paper we
without it’s context. But given the context of the used token present an example of an application that is unlikely to be
and then normalizing becomes tricky for some classes. We amenable to such a "turn- the-crank" approach. The example
present a novel approach in which the prediction of our is text normalization, specifically in the sense of a system
classification algorithm is used by our sequence to sequence that converts from a written representation of a text into a
model to predict the normalized text of the input token. Our representation of how that text is to be read aloud. The tar-
approach takes very less time to learn and perform well get applications are TTS and ASR - in the latter case mostly
unlike what has been reported by Google (5 days on their for generating language modeling data from raw written
GPU cluster). We have achieved an accuracy of 97.62 which text. This problem, while often considered mundane, is in
is impressive given the resources we use. Our approach is fact very important, and a major source of degradation of
using the best of both worlds, gradient boosting - state of perceived quality in TTS systems in particular can be traced
the art in most classification tasks and sequence to sequence to problems with text normalization.
learning - state of the art in machine translation. We present We start by describing the prior work in this area, which
our experiments and report results with various parameter includes use of RNNs in text normalization. We describe the
settings. dataset provided by Google and Kaggle and then we discuss
our approach and experiments 1 we performed with different
KEYWORDS Neural Network architectures.
encoder-decoder framework, deep learning, text normaliza-
2 RELATED WORK
tion
Text normalization has a long history in speech technology,
1 INTRODUCTION dating back to the earliest work on full TTS synthesis (Allen
et al., 1987). Sproat (1996) provided a unifying model for most
Within the last few years a major shift has taken place in
text normalization problems in terms of weighted finite-state
speech and language technology: the field has been taken
transducers (WFSTs). The first work to treat the problem of
over by deep learning approaches. For example, at a recent
text normalization as essentially a language modeling prob-
NAACL conference well more than half the papers related in
lem was (Sproat et al., 2001 ) . More recent machine learning
some way to word embeddings or deep or recurrent neural
work specifically addressed to TTS text normalization in-
networks. This change is surely justified by the impressive
clude (Sproat, 2010; Roark and Sproat, 2014; Sproat and Hall,
performance gains to be had by deep learning, something
2014).
that has been demonstrated in a range of areas from image
In the last few years there has been a lot of work that
processing, handwriting recognition, acoustic modeling in
focuses on social media (Xia et al., 2006; Choudhury et al.,
automatic speech recognition (ASR), parametric speech syn-
2007; Kobus et al., 2008; Beaufort et al., 2010; Kaufmann, 2010;
thesis for text-to-speech (TTS), machine translation, parsing,
Liu et al., 2011; Pennell and Liu, 2011; Aw and Lee, 2012; Liu et
al., 2012a; Liu et al., 2012b; Hassan and Menezes, 2013; Yang
IST 597-003 Fall’17, December 2017, State College, PA, USA and Eisenstein, 2013). This work tends to focus on different
© 2017 Copyright held by the owner/author(s). problems from those of TTS: on the one hand one, in social
ACM ISBN 123-4567-24-567/08/06. . . $15.00
https://doi.org/10.475/123_4 1 https://github.com/shauryr/google_text_normalization
IST 597-003 Fall’17, December 2017, State College, PA, USA Shaurya Rohatgi and Maryam Zare

(a) Semiotic Class Distribution (b) Tokens to be Transformed vs Non-Transformed

Figure 1: Train Data Semiotic Class Analysis - Source Kaggle

media one often has to deal with odd spellings of words such Sproat and Jaitly report that a manual analysis of about
as "cu 18r", "coooooooooooooooolllll", or "dat suxx", which 1,000 examples from the test data suggests an overall error
are less of an issue in most applications of TTS; on the other, rate of approximately 0.1% for English. Note that although
expansion of digit sequences into words is critical for TTS the test data were of course taken from a different portion of
text normalization, but of no interest to the normalization of the Wikipedia text than the training and development data,
social media texts. nonetheless a huge percentage of the individual tokens of
Some previous work, also on social media normalization, the test data 99.5% in the case of English - were found in the
that has made use of neural techniques includes (ChrupaÅĆa, training set. This in itself is perhaps not so surprising but it
2014; Min and Mott, 2015). The latter work, for example, does raise the concern that the RNN models may in fact be
achieved second place in the constrained track of the ACL memorizing their results, without doing much generalization.
2015 W-NUT Normalization of Noisy Text (Baldwin et al.,
2015), achieving an F1 score of 81.75%.
Data No. of tokens
3 DATASET Train 9,918,442
The original work by Sproat and Jaitly uses 1.1 billion words Test 1,088,565
for English text and 290 words for Russian text. In this work Table 1: Kaggle Dataset
we used a subset of the dataset submitted by the authors
for the Kaggle competition 2 (table 1). The dataset is de-
rived from Wikipedia regions which could be decoded as
UTF8. The text is then divided into sentences and through 3.1 Data Exploratory Analysis
the Google TTS system’s Kestrel text normalization system
In total, only about 7% of tokens in the training data, or
to produce the normalized version of that text. A snippet is
about 660k objects in total, were changed during the process
shown in the figure 1 . As described in (Ebden and Sproat,
of text normalization in the train data. This explains the
2014), Kestrel’s verbalizations are produced by first tokeniz-
high baseline accuracies we can achieve even without any
ing the input and classifying the tokens, and then verbalizing
adjustment of the test data input.
each token according to its semiotic class. The majority of
The authors of the challenge refer the classes of tokens
the rules are hand-built using the Thrax finite-state gram-
as semiotic classes. The classes can be seen in the Figure 1.
mar development system (Roark et al., 2012). Most ordinary
In total there are 16 classes. The "PLAIN" class is by far the
words are of course left alone (represented here as <self>),
most frequent, followed by "PUNCT" and "DATE". "TIME",
and punctuation symbols are mostly transduced to <sil> (for
"FRACTION", and "ADDRESS" having the lowest number of
"silence").
occurrences (around/below 100 tokens each).
2 https://www.kaggle.com/c/text-normalization-challenge-english- Over exploring the dataset we find that "PLAIN" and
language "PUNCT" semiotic classes do not need transformation or
DeepNorm - A Deep learning approach to Text Normalization IST 597-003 Fall’17, December 2017, State College, PA, USA

Figure 2: Our Model for Kaggle’s Text Normalization Challenge

they need not be normalized. We exploit this fact to our also defines a line whether our model is better or worse than
advantage when we train our sequence to sequence text memorizing the data.
normalizer by only feeding the tokens which need normal-
ization. This reduces the burden over our model and filters 5 METHODOLOGY
out what may be noise for our model. This is not to say that
notable fraction of "PLAIN" class text elements did change. Our approach involves modeling the problem as classifica-
But the fraction was too less to be considered for training tion and translation problem. The model has two major parts,
our model. For example, "mr" to "mister" or "No." to "number". a classifier which determines the tokens that need to be nor-
We also analyzed the length of the tokens to be normalized malized and a sequence to sequence model that normalizes
in the dataset. We find that short strings are dominant in the non standard tokens (Figure 2). We first explain training
our data but longer ones with up to a few 100 characters and testing process, then we explain classifier and sequence
can occur. This was common with "ELECTRONIC" class as to sequence models in more detail.
it contains URL which can be long. Figure 2 shows the whole process of training and testing.
We trained classifier and sequence to sequence model in-
dividually and in parallel. Training set has 16 classes, 2 of
which don’t need any normalization, so we separated tokens
4 BASELINE
from those two classes from others and only fed tokens from
As mentioned above, most of the tokens in the test data are remaining 14 classes to the sequence to sequence model. On
similar to those in the test data. We exploited this fact to the other hand classifier is trained on the whole data set since
hold the data in the train set in memory and predicted the it need to distinguish between standard and non standard
class of the token using the train set. tokens.
We have written a set of 16 functions for every semiotic Once training is done, we have a two stages pipeline ready.
class to normalize it. Using the predicted class we used the Raw data is fed to the classifier. Results of classifier are two
regular expression functions to normalize the test data. We sets of tokens. Those that don’t need to be normalized are
understand this is not the correct way to do this, but it pro- left alone. Those that need to be normalized are passed to the
vides a very good and competitive baseline for our algorithm. sequence to sequence model. Sequence to sequence model
We score 98.52% on the test data using this approach. This converts the non standard tokens to standard forms. Finally
IST 597-003 Fall’17, December 2017, State College, PA, USA Shaurya Rohatgi and Maryam Zare

Figure 3: Context Aware Classification Model - XGBoost Semiotic Class Classifier

Window Size Dev Set Accuracy


10 99.8087
20 99.7999
40 99.7841
Table 2: Context aware classification model - Varying
Window size

the output is merged with tokens from the classifier that


were marked as standard ones as the final result.
Now we explain both classifier and normalizer in more detail. Figure 4: Sequence to Sequence Model

5.1 Context Aware Classification Model 5.2 Sequence to Sequence Model


(CAC) In this section we explain the sequence to sequence model
Detecting the semiotic class of the token is the key part in detail. We used a 2-layer LSTM reader that reads input
of this task. Once we have determined the class of a token tokens, a layer of 256 attentional units, an embedding layer,
correctly, we can normalize the it accordingly. The usage and a 2-layer decoder that produces word sequences. We
of a token in a sentence determines its semiotic class. To used Gradient Descent with decay as an optimizer.
determine the class of the token in focus, the surrounding The encoder gets the input (x 1 , x 2 , ..., x t 1 ) and decoder gets
tokens play an important role. Specially in differentiating the inputs encoded sequence (h 1 , h 2 , ..., ht 1 ) as well as the
between classes like DATE and CARDINAL, for example, previous hidden state st −1 and token yt −1 and outputs (y1 , y2 , ..., yt 2 ).
CARDINAL 2016 is normalized as two thousand and sixteen, The following steps are executed by decoder to predict the
while DATE 2016 is twenty sixteen, the surrounding context next token:
is very important. r t = σ (Wr yt −1 + Ur st −1 + Cr c t )
Our context aware classification model is explained in zt = σ (Wz yt −1 + Uz st −1 + Cz c t )
the Figure 3 We choose a window size k and we represent дt = tanh(Wp yt −1 + Up (r t ◦ st −1 ) + Cp c t ) (1)
every character in the token with it’s ASCII value. We pad the
st = (1 − zt ) ◦ st −1 + zt ◦ дt
empty window with zeros. We use the preceding k characters
of the tokens and the later k characters of the tokens around yt = σ (Wo yt −1 + Uo st −1 + Co c t )
the token in focus. This helps the classifier understand in The model first computes a fixed dimensional represen-
which context the token in focus has been used. We use tation context vector c t , which is the weighted sum of the
vanilla gradient boosting algorithm without any parameter encoded sequence. Reset gate, r, controls how much infor-
tuning. Other experiment details are in the next section. mation from the previous hidden state st −1 is used to create
DeepNorm - A Deep learning approach to Text Normalization IST 597-003 Fall’17, December 2017, State College, PA, USA

Google’s RNN CAC+Seq2seq 2 Layers 3 Layers


All 0.995 0.9762 64 Nodes 97.46 97.41
PLAIN 0.999 - 128 Nodes 97.55 97.53
PUNCT 1.00 - 256 Nodes 97.62 97.6
DATE 1.00 0.998 Table 4: Accuracy on Test data - Experiments with
LETTERS 0.964 0.818 varying number of nodes and layers
CARDINAL 0.998 0.996
VERBATIM 0.990 0.252
MEASURE 0.979 0.955
ORDINAL 1.00 0.982 is reasonable as most of the tokens are short (less than 10)
DECIMAL 0.995 0.993 in length. Also, the starting characters and the surrounding
ELECTRONIC 1.00 0.133 context of the long tokens are enough to determine their
DIGIT 1.00 0.995 semiotic class. Once we have trained this classifier we predict
MONEY 0.955 0.824 the classes for the test data and label each token with it’s
FRACTION 1.00 0.847 semiotic class. These labeled tokens are then normalized by
TIME 1.00 0.872 the sequence to sequence model, which we discuss in the
ADDRESS 1.00 0.931 following section.
Table 3: Classwise Accuracy comparison with Google’s
RNN - Our model comes close in some classes to the 6.2 Sequence to Sequence Model
existing state of the art deep learning model We build our model using tensorflow’s 4 python module.
Here are the other details about the model -
• Number of encoder/decoder layers: 2-3
• One layer of embedding layer
a proposal hidden state. The update gate, z, controls how
• Number of hidden units: 256-128-64
we much of the proposal we use in the new hidden state st .
• Encoder size: 20
Finally we calculate the t-th token using a simple one layer
• Decode size: 25
neural network using the context, hidden state, and previous
• latent space representation size : 256
token.
• Vocabulary size: 100,000
We fed tokens in a window of size of 20 with the first one
• Optimizer: Gradient Descent with learning decay
being the label (Figure 4). For example, if we want to get
• Number of Epochs: 10
the normalized form of 2017 we will feed it in the following
form <label> <2> <0> <1> <7> <PAD> ... <PAD>. In cases Every other parameter used was default parameter pro-
where the input size is less than 20 we fill the empty spots vided by tensorflow framework.
with reserved token, <PAD>. Batch size is set to 64 and the Table 4 shows the accuracy on test data increases signifi-
vocabulary size is 100,000. We tried smaller vocabulary sizes cantly as we increase the number of nodes on the encoder
but since our data set is very sparse we didn’t get a good side. We also see that increasing the number of layers has
accuracy, after making it bigger the accuracy improved sig- very little effect. We wanted to experiment with more nodes
nificantly. but given the time and resources we could only experiment
with these parameter settings. The test data had approxi-
mately 60,000 tokens (needed to be normalized), and using
6 EXPERIMENTS AND RESULTS such a model to predict the normalized version of the test
tokens took about 6 hours. We present the class-wise com-
6.1 Classification parison of the results in the table 3. One thing to note here is
For classification we use random forests, with the default that we evaluated our data on a 600,000 samples but Google
parameters and early stopping. We used XGBoost 3 module does it only for 20,000 samples. We can see that our model
for python. Table 2 shows the results for different window performs nearly as well as Google’s RNN. But our model
sizes. We used a 10% of the train data as our validation set. also suffers in classes such as VERBATIM and ELECTRONIC.
Training this a classifier on 9 million tokens takes a lot of As discussed below, VERBATIM has special characters from
time to train, of the order of 22 hours. different languages and we chose only the top 100,000 in
We see an interesting behavior, as the window size is de- our vocabulary (GPU memory constraints). We think that
creased the classifier’s accuracy also increases. This behavior if the vocabulary size is increased we can achieve far better
3 http://xgboost.readthedocs.io/en/latest/get_started/index.html 4 https://www.tensorflow.org/
IST 597-003 Fall’17, December 2017, State College, PA, USA Shaurya Rohatgi and Maryam Zare

Semiotic Class before after (predicted)


DATE 2016 twenty sixteen
CARDINAL 2016 two thousand and sixteen
DIGIT 2016 two o one six

CARDINAL 1341833 one million three hundred fourteen thousand eight hundred thirty three
TELEPHONE 0-89879-762-4 o sil eight nine eight seven seven sil nine six two sil four
MONEY 14 trillion won fourteen won

VERBATIM ω wmsb
LETTERS mdns cftt
ELECTRONIC www.sports-reference.com w w r w dot t i s h i s h e n e n e dot c o m

Table 5: Results Analysis of Seq2Seq Model - The prediction gets worse as we go down the table.

results. Also for ELECTRONIC class the window size of the models shines and where we can improve. We also list out
encoder input was the constraint. We can see from table 5 possible ways of improving the results further. We compare
that it starts well but as the sequence gets longer it predicts our results with the state of the art results and show that
irrelevant characters. We believe increasing the encoder se- given limited computation power we can achieve promising
quence length can improve this aspect of our model. results This project helped us understand sequence to se-
Table 5 shows the results. For three classes DATE, CARDI- quence models and the related classification tasks very well.
NAL, and DIGIT the model works very well, and the accuracy We also learned how much parameter tuning can effect the
is very close to Google’s model. For example in case of token results and small changes makes big difference. We can also
’2016’, it is shown that the model can distinguish different try Bidirectional RNNs as we saw if the sequence was longer
concepts very well and outputs the correct tokens. We think the model was not accurate.
this is because we are feeding the label with the tokens to Finally, we conclude that higher accuracy can be achieved
the sequence to sequence model, so it learns the differences via having a very good classifier. Classifier has an important
between these classes pretty good. role in this model and there is still lots of room for improve-
The next three classes are showing acceptable results. Model ment. Using LSTM instead of XGBoost could have make the
shows some difficulties in telephone numbers, big cardinal classifier stronger. But we rested our focus mostly on the
numbers, and class MONEY. Errors are not very bad. In most sequence to sequence model as we wanted to understand and
cases usually one word is missed or the order is reversed. implement it. Due to the lack of time and limited resources
We got low accuracy on the last three classes shown in Table we couldn’t try this and we list this as a future work.
5. We see that Verbatim and Electronic classes have the low-
est accuracy. For Verbatim we think the reason is the size
of vocabulary. Since this class consists of special characters
REFERENCES
Sproat, R. (1996). Multilingual text analysis for text-to-speech synthesis.
that have low frequency in the data set, a larger vocabulary
Natural Language Engineering, 2(4), 369-380. Chicago
could have improve the accuracy a lot. For Electronic class Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., & Richards, C.
we think a larger encoder size can be very helpful. This class (1999). Normalization of non-standard words: WS’99 final report. In
has tokens of up to length 40, which don’t fit to the encoder Hopkins University.
we used. Sproat, R. (2010, December). Lightly supervised learning of text nor-
malization: Russian number names. In Spoken Language Technology
Workshop (SLT), 2010 IEEE (pp. 436-441). IEEE.
7 CONCLUSION Xia, Y., Wong, K. F., & Li, W. (2006, July). A phonetic-based approach
In this project we proposed a model for the task of normaliza- to Chinese chat text normalization. In Proceedings of the 21st Interna-
tion.We present a context aware classification model and how tional Conference on Computational Linguistics and the 44th annual
meeting of the Association for Computational Linguistics (pp. 993-1000).
we used it to clear out "noisy" samples. We then discuss our Association for Computational Linguistics.
unique model, which at it’s core is a sequence to sequence Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., & Basu, A.
model which takes in the label and the input sequence and (2007). Investigation and modeling of the structure of texting language.
predicts the normalized sequence based on the label. We International journal on document analysis and recognition, 10(3), 157-
share our insights and analysis with examples of where our 174.
DeepNorm - A Deep learning approach to Text Normalization IST 597-003 Fall’17, December 2017, State College, PA, USA

Marais, K. (2008). The wise translator: reflecting on judgement in trans-


lator education. Southern African Linguistics and Applied Language
Studies, 26(4), 471-477.
Kaufmann, M., & Kalita, J. (2010, January). Syntactic normalization of
twitter messages. In International conference on natural language pro-
cessing, Kharagpur, India.
Clark, E., & Araki, K. (2011). Text normalization in social media: progress,
problems and applications for a pre-processing system of casual English.
Procedia-Social and Behavioral Sciences, 27, 2-11.
Pennell, D., & Liu, Y. (2011, May). Toward text message normalization:
Modeling abbreviation generation. In Acoustics, Speech and Signal
Processing (ICASSP), 2011 IEEE International Conference on (pp. 5364-
5367). IEEE.
Liu, F., Weng, F., & Jiang, X. (2012, July). A broad-coverage normal-
ization system for social media language. In Proceedings of the 50th
Annual Meeting of the Association for Computational Linguistics: Long
Papers-Volume 1 (pp. 1035-1044). Association for Computational Lin-
guistics.
Hassan, H., & Menezes, A. (2013, August). Social Text Normalization
using Contextual Graph Random Walks. In ACL (1) (pp. 1577-1586).
Yang, Y., & Eisenstein, J. (2013, October). A Log-Linear Model for Unsu-
pervised Text Normalization. In EMNLP (pp. 61-72).
ChrupaÅĆa, G. (2014). Normalizing tweets with edit scripts and recur-
rent neural embeddings. In Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics (Vol. 2, pp. 680-686).
Baltimore, Maryland: Association for Computational Linguistics.
Min, W., Leeman-Munk, S. P., Mott, B. W., James, C. L. I., & Cox, J. A.
(2015). U.S. Patent Application No. 14/967,619.
Baldwin, T., de Marneffe, M. C., Han, B., Kim, Y. B., Ritter, A., & Xu,
W. (2015). Shared tasks of the 2015 workshop on noisy user-generated
text: Twitter lexical normalization and named entity recognition. In
Proceedings of the Workshop on Noisy User-generated Text (pp. 126-
135). Chicago
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learn-
ing with neural networks. In Advances in neural information processing
systems (pp. 3104-3112).
Cho, K., Van MerriÃńnboer, B., Gulcehre, C., Bahdanau, D., Bougares,
F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations
using RNN encoder-decoder for statistical machine translation. arXiv
preprint arXiv:1406.1078.
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting
system. In Proceedings of the 22nd acm sigkdd international conference
on knowledge discovery and data mining (pp. 785-794). ACM.

You might also like