Bangla Text Normalization For Text-To-speech Synthesizer Using Machine Learning
Bangla Text Normalization For Text-To-speech Synthesizer Using Machine Learning
A R T I C L E I N F O A B S T R A C T
Keywords: Text normalization (TN) for text-to-speech (TTS) synthesizer is the transformation of non-standard words like
Text-to-speech times, ordinal numbers, equations, ranges, dates, etc. into standard words that have similarities with their
Text normalization pronunciations. An essential part of all TTS synthesizers is text normalization. Without text normalization,
Machine learning algorithms
generated voice from the TTS synthesizer will be unintelligible. For the unsatisfactory performance of previous
Large Bangla text corpus
ROC curve
research, a text normalization method for the Bangla language is proposed in this paper. At first, we have
Confusion matrices produced a tokenized dataset with a semiotic class using regular expressions from a Bangla corpus. Then, each
token has been trained using the XGBClassifier algorithm. After that, it identifies the semiotic class for each token
in a new Bangla text corpus using the trained XGBClassifier model. Finally, it produces a normalized text for each
token by calling the class function according to the predicted class. This text normalization method will help the
Bangla TTS synthesizer in producing more intelligible voices. The token classification accuracy of this method is
99.997%.
* Corresponding authors.
E-mail addresses: [email protected] (Md.R. Islam), [email protected] (A. Ahmad), [email protected] (M.S. Rahman).
https://doi.org/10.1016/j.jksuci.2023.101807
Received 11 July 2023; Received in revised form 16 September 2023; Accepted 16 October 2023
Available online 20 October 2023
1319-1578/© 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
method for non-standard words was the initial attempt to resolve the
text normalization issue, which is fundamentally a language modeling
issue (Sproat et al., 2001). A taxonomy of NSWs was created based on
four distinctly different forms of documents: news text, a newsgroup for
recipes, a newsgroup for hardware products, and classified advertise
ments for real estate. The approach examined how different general
techniques, such as n-gram language models, decision trees, and
weighted finite-state transducers (WSFT), may be applied to a variety of
NSW types. More recently different machine learning algorithm has
been applied to TTS text normalization (Sproat, 2010; Ebden and Sproat,
Fig. 2. Text normalization (TN) and inverse text normalization (ITN). 2015). Some papers have experimented with different types of RNN
architectures like LSTM, Attenuation based RNN sequence-to-sequence
models, transformer-based frameworks, Proteno etc (Sproat and Jaitly,
1. With a 99.997 % accuracy rate, this research work has surpassed the 2016; Pramanik and Hussain, 2019; Lai et al., 2021; Tyagi et al., 2021;
earlier accuracy for text normalization carried out by other re Ro et al., 2022). But, the research on the topic of low-resourced lan
searchers in other languages. guages like Bangla language is very rare. A rule-based text normalization
2. To find the optimum accuracy classifiers, the 19th machine learning method has been proposed for the Bangla language in (Alam et al.,
methods including word embedding and oversampling, are applied 2008), representing the first Bangla text normalization work for the
to the Bangla text corpus in this research. Bangla language which has used a regular expressions in jflex format.
3. The highest accuracy among the 19th classifiers has been demon The authors have proposed multiple semiotic classes for the Bangla
strated by XGBClassifier and LightGBM with dart, with ratings of language, utilizing verbalization to expand a token. The performance of
99.997 % and 99.993 %, respectively. In Fig. 3, the results of two the method is 99 % for three classes such as floating point, currency, and
classifiers have been displayed. time. They also showed that the accuracy of currency is 100 %, the
floating point is 100 %, and time is 62 %. But they have not explained
The remaining nine sections of the paper are divided into section 2, the method for all types semiotic classes like date, time, etc. So, the
which is a collection of related works. Section 3 provides an overview of methodology isn’t enough to build a Bangla text corpus for the TTS
text-to-speech synthesizers with Bangla TTS and text normalization. We synthesizer. In another paper, the authors have explained multiple se
have covered different machine learning techniques and word embed miotic classes of text normalization and their ambiguities, in which
ding in relevant sections 4 and 5. In section 6, the token classification accuracy is below 90 % (Rashid et al., 2010). They have used rule-based
technique for text normalization is illustrated. Section 6 is divided into approach and database approach for text normalization. Words that do
two subsets, where section 6.1 is discussed about the token classification not obey rules have been managed utilizing a database after trying a
method. In section 6.2, the results and analysis for the token classifi rule-based method as a first step in normalization. Google has described
cation approach are discussed. The results of the 19th classifiers using only a process of text normalization grammars for 6 low-resourced
precision, accuracy, F1-score, recall, heatmap, and ROC curve were languages like Bangla, Nepali, Khmer, etc. (Sodimana et al., 2018). So,
further explained in this section. The full procedure of the proposed text a complete text normalization method of the Bangla language has been
normalization method has been discussed in section 7. The complete proposed for the TTS synthesizer in this paper.
work of this research is concluded in section 8.
3. Text-to-Speech
2. Related works
A text-to-speech (TTS) synthesizer is a computer-based tool that can
The text-to-speech (TTS) synthesizer significantly depends on text read text aloud. Text in TTS is either a computer input stream or scanned
normalization. While a huge amount of work has already been done for input that has been sent to an OCR engine (Sasirekha and Chandra,
well-resourced languages. In speech technology, text normalization has 2012). A very rapid improvement in this area has been made over the
a long old history. Text to speech: the MITalk system is an early work last two decades and now there are numerous excellent TTS synthesizer
where text normalization has been discussed (Allen et al., 1987). In available for commercial use. Recently, personal assistant applications
1996, Sproat proposed a weighted finite-state transducer-based are growing in popularity since they enable users to communicate with
approach (WFSTs) that can be used to solve the majority of text devices like mobile phones and tablets with natural language. These
normalization issues which provide a text analysis component for the applications use NLP approaches to deliver appropriate answers to
multilingual TTS synthesizer from Bell Labs (Sproat, 1996). The trans users’ questions. Numerous applications (such as SIRI, GoogleNow,
ducers permit declarative descriptions of lexicons, morphological rules, Cortana, and Robin) exist in this field. TTS synthesizer has multiple
numeral-expansion rules, and phonological rules that have been built applications in real life like personalization, gaming, digital assistant,
using a lexical toolkit. French, German, Spanish, Russian, Mandarin, etc. Fig. 4 explains the different types of applications of the TTS syn
Italian, Romanian, and Japanese are among the eight languages to thesizer. It is possible to create a TTS synthesizer using both hardware
which the proposed approach has been implemented. The normalization and software. Fig. 5 explains the TTS synthesizer’s whole architecture.
2
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
Table 1
Examples of translating written texts into spoken forms for the Bangla language.
There are two components to a TTS synthesizer: a front-end (module for producing speech from acoustic characteristics.
for text-to-phonetics) and a back-end (module for phoneme-to-speech)
(Handley, 2005). There are two main purposes for the front end. First,
3.2. Text normalization
it transforms unprocessed text that contains symbols like dates, ranges,
times, and abbreviations into a spoken equivalent. Text normalization is
Text normalization for text-to-speech (TTS) synthesizer refers to the
another name for this procedure. The front end then assigns each word a
process of converting input text into a phonetic representation that can
phonetic transcript. The text is divided into prosodic units like clauses
be accurately rendered into speech by a TTS synthesizer. It is an essential
and sentences. Text-to-phoneme or grapheme-to-phoneme conversion is
part of numerous applications for speech and natural language pro
the process of assigning phonetic transcripts to words (Kim et al., 2021;
cessing. But TTS synthesizer has gained popularity globally in recent
Shamsi et al., 2020; Dutoit, 1997). The backend, which is frequently
years. It has gained significant progress and now produces human-like
referred to as the synthesizer, transforms the symbolic verbal repre
speech quality. As an input, a normalized text is required to develop a
sentation into sound.
TTS synthesizer. By transforming an unambiguous non-standard word
into a standard spoken form, the input text corpus is processed into
3.1. Bangla text-to-speech synthesizer normalized text as part of the text normalization process. The text
normalization method for Bangla language is quite uncommon, which is
It’s rare to find TTS development work in Bangla. Concatenative a serious issue with the Bangla text-to-speech (TTS) synthesizer. To train
methods were initially utilized to create Bangla TTS. A unit-selection a TTS synthesizer, researchers require a large unambiguous normalized
TTS synthesizer called Katha (Alam et al., 2007), which was created in text with spoken forms.
2007 written by Firoz Alam and others from BRAC University, is based
on the Festival (The festival speech synthesis system, 2019) toolkit. In 4. Classification algorithms
2009, the di-phone concatenation-based TTS synthesizer Subachan was
developed by Abu Naser et al. at the Shahjalal University of Science and The 19th classifiers, including the K-Nearest Neighbor classifier,
Technology (Naser et al., 2010). A Hidden Markov Model-based statis Gaussian Naïve Bayes classifier, Multinomial Naïve Bayes classifier,
tical parametric speech synthesis system, or SPSS system (Mukherjee Decision Tree classifier, Random Forest classifier, Bernoulli Naïve Bayes
and Mandal, 1406), was developed in 2012. The use of DNNs for classifier, Logistic Regression, Gradient Boosting classifier, LGBMClas
acoustic modeling in SPSS systems has considerably enhanced TTS sifiers with dart and gbdt, XGBClassifier, Linear Support Vector Ma
research. The DNN-based Bangla TTS synthesizer’s architecture is chine, CatBoost, AdaBoost, Nearest Centroid classifier, Voting classifier,
depicted in Fig. 6 (Raju et al., 2019). The major processing elements of Bagging Classifier and LightGBM that are presented in this section, are
this system include the text normalizer, front-end processor, two deep used to measure performance. Then, the classifier with the highest ac
neural networks (DNNs) for duration, acoustic modeling, and vocoder curacy rating among all of these models is evaluated.
3
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
where the predicted class for the document X is C, the bias term is b, the
weight vector for class m is Wm, and the transpose operator is T.
4
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
4.2. Decision tree where C is the predicted class for the document D, n is the number of
decision trees in the random forest, Tk is the kth decision tree in the forest
For multi-class text classification, a decision tree provides a and Cm is the mth class in the classification.
straightforward and understandable classification paradigm. The deci
sion tree divides the feature space into a hierarchical structure of nodes.
4.4. K-Nearest Neighbor
The decision tree’s leaf nodes indicate the anticipated classes, and the
path from the root to a leaf node represents the decision rules that lead
By locating the k training instances in the training set that are
to the predicted class (Kowsari et al., 2019). Typically, the decision rules
physically closest to the input document, the K-Nearest Neighbor (k-NN)
in a decision tree are based on threshold examines of the values of the
classifier assigns the input document to the class with the highest fre
features, such as “if feature i > t, then go left, else go right”. The
quency among its k neighbors (Kowsari et al., 2019). Using the equation
following equation can be utilized to implement a decision tree for
below, the k-NN Classifier can categorize text into many groups.
multi-class text classification: ( )
∑n
C = f (x) (2) C = argmax l(ym = Ck ) (4)
m=1
where X is a vector of feature values for the document, C is the predicted
class for it, and f is a function that converts X into the predicted class C. where, C is the predicted class for the document D, the number of
neighbors to take into consider is n, ym is the label of the mth neighbor,
4.3. Random Forest Ck is the kth class in the classification task and the indicator function l
accepts a value of 1 if the condition ym = Ck is true, and 0 otherwise.
Multiple decision trees are used in Random Forest, an ensemble The k-NN Classifier can compare documents using a variety of dis
learning technique, to increase the classification’s robustness and ac tance metrics, including Euclidean distance, cosine similarity, and Jac
curacy (Kowsari et al., 2019). In order to construct the final classifica card similarity. The formulas for Euclidian, Manhattan, and Minkowski
tion, it builds a number of decision trees based on various subsets of the distance metrics are shown in Eqs. (5), (6), and (7), respectively. KNN
training data and characteristics. The Random Forest builds each deci architecture is illustrated in Fig. 9.
sion tree by recursively dividing the training data according to the √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
∑n
feature values. The decision tree must meet a stopping condition before Euclidean distance metrics = (pi − qi )2 (5)
it can cease splitting, such as a minimum number of training instances in i=1
each leaf node or a maximum depth of the tree. The primary goal of RF is
to create random decision trees, as seen in Fig. 8. ∑
n
Manhattan distance metrics = |pi − qi | (6)
The following equation can be utilized to implement Random Forest i=1
for multi-class text classification:
( ) ( )1r
∑n ∑
n
5
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
Assigning new documents to the class whose centroid is closest to the P(X|C) is the likelihood probability of the data point X given class C,
documents is the foundation of the Rocchio Classification method P(X|C) is the posterior probability of class C, p(C) is the prior probability
(Kowsari et al., 2019). Each class is represented by its centroid in the of class C, and P(X) is the probability of the data point X across all classes
feature space. The Rocchio algorithm’s equation for multi-class text according to the Bayes theorem (Murphy, 2006).
categorization looks like this: p(X, C) p(X|C) p(C)
p(C|X) = = ∑N (13)
(8) p(X) ′ ′
C = argmax(sim(X, Ci )) y′=1 p(X|C )p(C )
Since this model explains how to produce feature vectors for each
where the similarity measure sim(X, Ci) compares the document X and
potential class of C, it is known as a generative model.
the centroid Ci, C is the predicted class for the document X, Ci is the
centroid of class i in the feature space, and the document X belongs to CMAP = argmaxc∈C
p(C|X)
(14)
the predicted class C. One frequently employed similarity measure is the
cosine similarity, which determines the cosine of the angle between the 4.7.1. Gaussian Naïve Bayes classifier
two vectors. The features are assumed to be continuous and have a Gaussian
sim(X, Ci ) = (X.T*Ci )/(‖X‖*‖Ci ‖) (9) distribution in this Naive Bayes classifier version (Xu, 2018). The
Gaussian Naive Bayes classifier can be used to classify text into many
where T is the transpose operator, and ||X|| and ||Ci || are the vectors X categories using the following equation:
and Ci ’s L2 norms. For Nearest Centroid algorithm similarity matric are
C = argmax(P(C1 |D), P(C2 |D), ⋯, P(Ck |D)) (15)
following:
sim(X, Ci ) = ‖X − Ci ‖ (10) where the number of classes in the classification task is k, and P(Ci|D) is
the posterior probability that the document D belongs to class Ci. C is the
predicted class for the document D. The probability P(Ci|D) can be
4.6. Voting classifier
calculated using Bayes’ theorem as follows:
A multi-class text classification ensemble method called the Voting P(D|Ci )*P(Ci )
Classifier combines the predictions of different base classifiers into a P(Ci |D) = (16)
P(D)
single, concluding prediction. The Voting Classifier uses a majority vote
or a weighted vote to aggregate the predictions of the basis classifiers. where the prior probability for class Ci is P(Ci), P(D|Ci) is the probability
Hard voting and soft voting are the two types. Soft voting is used in this of observing the input document D given class Ci and P(D) represents the
6
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
probability of observing the input document D. P(wi|C) is the probability of learning wi given class C, where the prior
probability of class C is P(C), D is the document, wi is the ith word in the
4.7.2. Multinomial Naïve Bayes classifier document, and word i’s presence or absence in the document D is
A variation of the Naïve Bayes classifier made especially for discrete indicated by the binary variable xi, which has a value of either 1 or 0.
feature text classification applications is the multinomial Naïve Bayes
classifier. It is predicated on the idea that a data point’s characteristics 4.8. Logistic regression
are derived from a multinomial distribution. In text classification, the
classes are the possible categories that can be applied to the documents, As a logistic function of the input features, logistic regression models
and the characteristics are typically the frequency of occurrence of each the likelihood of each class, and then utilizes maximum likelihood
word in the documents (Abbas et al., 2019). If there are (n) documents estimation to determine the model’s parameters (Indra et al., 2016). The
that fit into each of the k categories where k ∈ {c1, c2, …, ck}, the pre following is an equation for multi-class text categorization using logistic
dicted class as output is (c C). The following formula is used to determine regression:
the probability that a document belongs to class C using the Bayes ( ∑n )
exp i=1 wi .fi
theorem: P(C|D) = ∑ (∑ n ) (23)
∏ c exp k=1 wk .fk
P(C) P(wi |C)ni
P(C|D) = (20)
P(D) where, the number of features in the input document D is n, P(C|D)
represents the probability of a class given the input document D, fi is the
Where the probability of learning wi in given class C is P(wi|C) where
feature vector for document D, wi is the weight vector for class i in the
the prior probability of class C is P(C), the document is D, wi is the ith ∑ ∑n
logistic regression model and c exp( k=1 wk .fk ) is the sum of the
word in the document and the number of times that wi is mentioned in
the document is ni. exponential weights for all classes.
7
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
( )
∑
n gradient boosting models are gaining popularity (Natekin and Knoll,
C = argmaxc αk .hk (x) (24) 2013; Biswas et al., 2022 Nov). The gradient reasonable size is repre
sented by Equation-25.
k=1
where C is the projected class label for input document D, argmaxc is the ∑
N
[ ]
class with the highest predicted score, αk is the weight given to the mth ρt = argminρ ψ yi, ̂f t − 1(xi ) + ρh(xi , θi ) (25)
weak model and hk (x) is the prediction of the mth weak model for input
i=1
8
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
where the difference between the target value (Yk) and the forecast the level-wise tree approach are usually larger than those produced by
the leaf-wise tree method, the training procedure may therefore greatly
value ( Y
̂ k ) is measured by the differentiable convex loss function (l). Ω
accelerate when the dataset is large.
penalizes model complexity to stop over-fitting.
5. Word embedding
4.13. CatBoost
Typically, unstructured data sets include texts and documents. When
The fundamental predictors of CatBoost, a gradient boosting tech
using a classifier that incorporates mathematical modeling, it is neces
nique, are binary decision trees (Prokhorenkova et al., 2018). Assume
sary to convert these unstructured text sequences into a structured
that we have observed a set of data with samples D = (Xk, yk)k =1,2, …, m,
feature space. This process is called feature extraction. Term Frequency
where a vector of n characteristics is Xk = (x1k , x2k , …, xnk ) and response
(TF), Term Frequency-Inverse Document Frequency (TF-IDF), Word
feature yk ∈ R, which can be encoded as a numerical feature (0 or 1) or as Embedding, and GloVe are the most widely used feature extraction
a binary (yes or no) feature. The distribution of the samples (Xk, yk) is techniques (Kowsari et al., 2019). Each phrase or word from the dic
independent and identical according to an unidentified distribution, p tionary is transformed into an N-dimensional vector of true values
(.,.). The goal of the learning problem is to create a function H: Rn → R known as a word embedding using the feature learning method. A way
that minimizes the estimated loss shown in (28). to express a word as a vector that carries meaning is through word
L(H) = EL(y, H(x)) (28) embeddings. It enables the almost equal representation of words with
related meanings. For instance, the terms “sad” and “sorrow” are
where L(.,.) is a loss function, and (X, y) is a sample of the testing data frequently used interchangeably. These words will therefore have
obtained from the training data. similar vector representations. A variety of word embedding strategies
have been proposed to make unigrams understandable as input for
4.14. LightGBM machine learning algorithms. This study focuses on Word2Vec which is
the most often used techniques for text classification. Fig. 12 shows how
When working with big amounts of data, conventional GBDT takes a word embedding can represent any word as a two-dimensional vector.
long time. LightGBM is an effective and scalable GBDT implementation
that speeds up training while maintaining respectable accuracy (Essa 5.1. Word2Vec
et al., 2023). The process of creating a decision tree is what causes
standard GBDT to be computationally expensive. Scan all of the data A technique called Word2Vec was put forth in 2013 by T. Mikolov
instances to choose which feature to use as a split point to maximize (Mikolov et al., 2013; Mikolov et al., 2013) and colleagues at Google to
information gain. To get around this restriction, LightGBM is intro effectively learn a stand-alone word embedding from a text corpus.
duced. In order to find the appropriate split points for decision trees, Every word is projected to a point in a vector space using the Word2Vec
LightGBM separates continuous feature values into bins and builds technique. For each word, the Word2Vec creates an N-dimensional
feature histograms using these bins. In regards to memory utilization vector using a shallow neural network. It proposes two model
and training time, this strategy performs better than the GBDT method. architectures:
In contrast to many other tree-based learning algorithms like XGBoost,
the LightGBM model divides the tree leaf-wise, which utilized tree • Continuous Skip-gram Model
growth on a level. The level-based tree development approach expands • Continuous Bag-of-Words (CBOW) Model
the tree structure level by level. An unbalanced tree may form using the
node with the greatest loss reduction when using the leaf-based tree The CBOW and Skip-gram models are effective resources for identi
developing method. For the same tree depth, the tree nodes produced by fying connections and parallels between words. Fig. 13 illustrates a
9
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
Fig. 15. Word clouds of the Bangla text corpus for the text normalization method.
Table 2 Table 3
The format of training data list. Training data number for each semiotic class.
Class Name Number Class Name Number
technique. Fourthly, data has been split into 3 parts: test data, training
data, and validation data. Fifthly, validation data is utilized for fine-
tuning and training data has been used to train various machine
learning algorithms. Finally, the performance of these algorithms has
been estimated using test data. In Fig. 14, the full working procedure for
the token classification method has been displayed and the algorithm for
straightforward CBOW model that looks for words based on words that this method is described in Algorithm 1.
have come before it, while Skip-gram looks for words that might be close
Algorithm 1. (Working procedure of token classification for Bangla text
to each word. A particular target of words is represented by a large
normalization)
number of words in the CBOW model (Islam et al., 2020). For instance,
“breakfast” as the target word would include “bread” and “egg” as Input: Text normalized labeled datasets from 80k Bangla sentences
context words. Another model architecture that has many similarities to Output: Predicted class for each token (18 Classes)
CBOW is the continuous skip-gram model. This model seeks to maximize 1. Begin
the classification of a word based on another word in the same phrase 2. data ← load labeled dataset
3. X_data ← data[before]
and predicts the current word based on context.
4. y_data ← data[class]
5. X_ctw ← Word2Vec(X_data)
6. Token classification methodology 6. X_bal, Y_bal ← balanced data of X_ctw, y_data using random
oversampler
6.1. Method 7. X_train, y_train, X_test, y_test, X_valid, y_valid ← split_data of X_bal and
y_bal
8. model ← train model using X_train and y_train
This section has provided a description of the token classification 9. predict ← testing model using X_test and y_test
approach. Firstly, all data has been preprocessed and labeled using 10. computes performance using evaluation metrics
regular expressions. Secondly, labeled text data has been converted to 11. End
vectors using word2vec CBOW model. Thirdly, embedded text data has
been balanced using the random oversampling data balancing
10
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
Table 4
Written text examples for each semiotic class with normalized text.
1) Data Acquisition: The initial step in the token classification Bangla text corpus have presented in Fig. 15. The more frequently a
approach is data acquisition. A large corpus with more than 80,000 term appears in the data, the larger it is.
sentences has been collected from online sources, which is more 2) Text preprocessing: The next step in the token classification
suitable for training a TTS synthesizer. The word clouds for the approach is text preprocessing. In this step, the text preprocessor
Fig. 16. Class distribution of tokens before being balanced using oversampling.
11
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
Fig. 17. Class distribution of tokens after being balanced using oversampling.
12
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
Table 7
Results for each class using top six machine learning algorithms by percentage.
Class Name XGBClassifier, LGBMClassifiers with dart and Catboost Voting Classifier LightGBM K Nearest Neighbor Classifier
(%) (%) (%) (%)
Prec. Rec. F1-s. Prec. Rec. F1-s. Prec. Rec. F1-s. Prec. Rec. F1-s.
ABBREVIATION 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 0.99 0.97 0.98
BNG_TXT 1.00 1.00 1.00 0.98 1.00 0.99 1.00 0.99 0.99 0.97 0.92 0.95
CARDINAL 1.00 1.00 1.00 1.00 0.98 0.99 1.00 0.94 0.97 1.00 0.99 1.00
DATE 1.00 1.00 1.00 0.99 0.98 0.99 1.00 1.00 1.00 1.00 1.00 1.00
ENG_NUM 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00
ENG_TXT 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00
EQUATION 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00
FLOAT 1.00 1.00 1.00 0.97 1.00 0.98 0.99 1.00 0.99 0.99 1.00 1.00
FRACTIONAL 1.00 1.00 1.00 1.00 0.97 0.99 1.00 1.00 1.00 0.97 0.99 0.98
GAME 1.00 1.00 1.00 0.99 0.99 0.99 1.00 1.00 1.00 0.99 0.99 0.99
MONEY 1.00 1.00 1.00 0.96 0.99 0.98 0.99 1.00 0.99 0.96 0.96 0.96
ORDINAL 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.97 0.98 1.00 0.99
PERCENTAGE 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.99 0.99 0.99 1.00 1.00
PUNCTUATION 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00
RANGE 1.00 1.00 1.00 0.99 1.00 0.99 0.99 0.99 0.99 0.97 0.97 0.97
TELEPHONE 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.99 0.98 0.99 0.98
TIME 1.00 1.00 1.00 1.00 0.99 1.00 0.92 1.00 0.96 1.00 1.00 1.00
YEAR 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Accuracy 1.00 0.99 0.99 0.99
Macro avg. 1.00 1.00 1.00 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
Weighted avg. 1.00 1.00 1.00 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
13
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
Fig. 20. The average margin of different models according to the baseline.
5) Model Training: Training data, test data, and validation data were Multinomial Naïve Bayes classifier, Decision Tree classifier, Random
separated from the balanced text data. 49 % of total data has been Forest classifier, Bernoulli Naïve Bayes classifier, Logistic Regres
assigned for training, 30 % of total data is used for testing data and sion, Gradient Boosting classifier, LGBMClassifiers with dart and
21 % of total data is assigned for fine tuning. We have trained 19th gbdt, XGBClassifier, Linear Support Vector Machine, CatBoost,
types of classifiers using training data. In Fig. 18, decreasing of AdaBoost, Nearest Centroid classifier, Voting classifier, Bagging
logloss and error during XGBClassifier training have been displayed. Classifier and LightGBM for the token classification method. Each
Hypermeter values for XGBClassifiers training have explained below. classifier’s performance on the positive and negative classes is
assessed independently using four outcomes from the confusion
matrix. For each classifier in the confusion matrix (shown in
• objective: multi:softmax • max_depth: 10 Table 5), true positive (TP), true negative (TN), false positive (FP),
• silent: 1 • nthread: − 1 and false negative (FN) predictions are given, respectively, by two
• num_class: 18 • eval_metric: merror, mlogloss kinds that make accurate predictions and two types that make
• eta: 0.3 • silent: 1
• iteration: 100 • verbose_eval: 10
inaccurate predictions. But Table 6 and Fig. 19 contain the recall,
precision, accuracy, and f1-score values for the top six classifiers. We
also provide the top six classifiers’ recall, precision, accuracy, and f1-
score values for each semiotic class in Table 7.
6) Model Evaluation: We have determined the recall, precision, ac (TP + TN)
Accuracy = × 100 (29)
curacy, and f1-score of the 19th types of classifiers including the K- (TP + FP + FN + TN)
Nearest Neighbor classifier, Gaussian Naïve Bayes classifier,
14
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
Fig. 21. Heatmaps for (a) XGBClassifier (b) Lightboost Classifiers with dart algorithms.
∑L
TPl FP
Precision = ∑L l=1 × 100 (30) FPR = (34)
l=1 TPl + FPl
(FP + FN)
∑L The full area under the ROC curve is measured by AUC, or area under
TPl the ROC curve. AUC can be written as follows for tasks requiring binary
Recall = ∑L l=1 × 100 (31)
l=1 TPl + FN l classification:
∑L ∫∞
l=1 2TPl TPR(T)FPR(T′)dT
F1 score = ∑L × 100 (32) AUC =
2TPl + FPl + FN l ∫∞∫∞ −∞
(35)
l=1
= I(T′ > T)f1 (T′)f0 (T)dTdT′
Receiver operating characteristic curves, often known as ROC − ∞ − ∞
curves, are helpful visual tools for classifier evaluation. However, class = P(X1 > X0 )
imbalances (variations in prior class probabilities) may prevent ROC
curves from accurately depicting the performance of the classifier. On An example of a multi-class AUC average is as follows:
the ROC curve, the true positive rate (TPR) and false positive rate (FPR) 2 ∑C
15
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
Fig. 22. ROC curve for (a) XGBClassifier (b) Lightboost Classifiers with dart algorithms.
16
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
Table 8
Input written text and output spoken form pairs with class.
Here, the token <self> indicates that the token is to be left alone.
Fig. 24. Working process example of the proposed Bangla text normalization method.
17
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
multi-class text classification tasks, heat maps can offer a visual depic b) Token classification with the trained model: In this step, at first,
tion of the performance measures. The number of instances that were the method will load the trained model. Then it will assign a se
assigned to each class is shown in the heat map’s individual cells. In miotic class for each token of embedded text data using a trained
Fig. 21, heatmaps for XGBClassifier and LightBoost Classifiers with dart model.
algorithms has been displayed. For the token classification approach
model = tf.keras.models.load_model (′xgb_ model.model′)
employing XGBClassifier and Lightboost Classifiers with dart algo
semiotic_class_list = token_classifier (Embedded_text_data, model)
rithms, the ROC curve and average AUC are shown in Fig. 22.
Input: Bangla Text Corpus into written form For example, consider a simple Bangla sentence such as following:
Output: Bangla Text Corpus into spoken form “জািরফ ১১-০৯-১৯৯৬ তািরেখ িবেকল ০৬:৩০ িমঃ এ ঢাকােত জ��হণ কেরন । ”. The
1. Begin output form will be constructed similarly to Table 8 if the sentence ac
2. data ← load dataset
3. Labeled_ text_data ← Text_Preprocessor(data)
cepts the proposed text normalization method. With token categoriza
4. Embedded_ text_data ← Word2Vec(Labeled_ text_data) tion, verbalization, and text post-processing, Fig. 24 shows the full-text
5. model ← load trained model normalization approach for the Bangla language. Table 9’s comparative
6. semiotic_ class_list ← token_ classifier (Embedded_ text_data,model) analysis demonstrates that the proposed text normalization method
7. normalized_token ← Verbalizer (Labeled_ text_data, semiotic_class_
obtains the highest accuracy compared to previous methods for 18 types
list)
8. Final_Output ← Text_ Postprocessor (normalized_ token) of semiotic classes.
9. End
7.3. Environmental setup
Some resources were required to carry out this research. The re
sources listed in Table 10 were used to build the suggested model.
1) Text preprocessing: In the function, some Bangla words will be
grouped into the same token according to semiotic classes. For
8. Conclusion
example, the input text “০৬:৩০ িমঃ”, “৪–১ েগাল” and “৯৯৯ ন�র” will be
grouped using regular expressions. Then tokenization separates a
In this research, a text normalization method has been proposed to
text stream into tokens, which can be words, phrases, symbols, or
produce a normalized text corpus for a text-to-speech (TTS) synthesizer.
other meaningful items.
It translates a written text into normalized text which has more simi
Labeled_text_data = Text_Preprocessor(Bangla_Text_Corpus) larity to the spoken form. The method utilized the XGBClassifier algo
rithm to select the semiotic class for a token. Then, the verbalizer
function converts the token into spoken form according to the predicted
class of the token. The normalized text corpus can be used to generate
2) Feature Extraction: For this method, the word2vec CBOW model
more accurate results for the TTS synthesizer. In the future work, we will
has been chosen for feature extraction from a text corpus. All tokens
produce a text normalization method for the Bangla language using deep
will convert into the vector in this step.
learning algorithms to achieve more accuracy. We will also build a
Embedded_text_data = Word2Vec(Labeled_text_data) machine translation-based model in which a text normalization system
will not require verbalization to generate normalized text. The main
challenge of this research is minimizing the normalized text errors, as
the accuracy of a TTS synthesizer highly dependent on the normalized
3) Token Classification using Trained Model: In this step, the token
text.
classifier will predict a semiotic class for each token. XGBClassifier
algorithm has been chosen for the classification. The step has two
parts. Table 10
Environment setup for the text normalization method.
a) Train a model using a normalized corpus: First of all, a large
normalized training data should be produced to train a model. Resource List Details Information
The data has been created using regular expressions for the token CPU Intel® Core™ i5-3427U CPU @ 1.80 GHz
classification. The method has produced training data with se RAM 8 GB
miotic class for 2,220,900 tokens. The full procedure of this step GPU Intel® UHD Graphics
Experimental Tool Jupyter Notebook
has been explained in section-6.
18
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807
Author contribution Kim, M., Cheon, S.J., Choi, B.J., Kim, J.J., Kim, N.S., 2021. Expressive text-to-speech
using style tag. arXiv preprint arXiv:2104.00436.
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.,
The study’s concept and design were developed by MRI, AA, and 2019. Text classification algorithms: a survey. Information 10 (4), 150.
MSR, who also had complete access to all the study’s data. The creation Lai, T.M., Zhang, Y., Bakhturina, E., Ginsburg, B., Ji, H., 2021. A Unified Transformer-
of normalized data and evaluation of the various models’ degrees of based Framework for Duplex Text Normalization. arXiv preprint arXiv:2108.09889.
Mikolov, T., Chen, K., Corrado, G., Dean, J. Efficient estimation of word representations
accuracy was conducted by MRI. The article’s composition was a in vector space. arXiv 2013, arXiv:1301.3781.
collaborative effort among all the authors. The report’s critical revision Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013. Distributed
included input from AA and MSR. MRI developed all the datasets, results representations of words and phrases and their compositionality. Adv. Neural Inf.
Process. Syst. 26, 3111–3119.
and text normalization methods. The final version was reviewed and Mukherjee, S., Mandal, S.K.D., 2014. A bengali hmm based speech synthesis system,
approved by all authors, who also helped with data collection and arXiv preprint arXiv:1406.3915.
analysis. Murphy, K.P., 2006. Naive Bayes classifiers Generative classifiers. Bernoulli 4701
(October), 1–8. https://doi.org/10.1007/978-3-540-74958-5_35.
Naser, A., Aich, D., Amin, M.R., 2010. Implementation of subachan: Bengali text-to-
speech synthesis software. In: International Conference on Electrical & Computer
Declaration of competing interest Engineering (ICECE 2010). IEEE, 2010, pp. 574–577.
Natekin, A., Knoll, A., 2013. Gradient boosting machines, a tutorial. Front. Neurorob. 7
The authors declare that they have no known competing financial (DEC) https://doi.org/10.3389/fnbot.2013.00021.
Onan, A., Korukoğlu, S., Bulut, H., 2016. Ensemble of keyword extraction methods and
interests or personal relationships that could have appeared to influence classifiers in text classification. Expert Syst. Appl. 57, 232–247.
the work reported in this paper. Pramanik, S., Hussain, A., 2019. Text normalization using memory augmented neural
networks. Speech Comm. 109, 15–23.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A., 2018. CatBoost:
References Unbiased Boosting with Categorical Features. In: Advances in Neural Information
Processing Systems, pp. 6638-6648.
Abbas, M., Memon, K.A., Jamali, A.A., Memon, S., Ahmed, A., 2019. Multinomial Naive Raju, R.S., Bhattacharjee, P., Ahmad, A., Rahman, M.S., 2019. A bangla text-to-speech
Bayes classification model for sentiment analysis. IJCSNS Int. J. Comput. Sci. Netw. system using deep neural networks. In 2019 International Conference on Bangla
Secur 19 (3), 62. Speech and Language Processing (ICBSLP). IEEE, pp. 1-5.
F. Alam, P. K. Nath, and M. Khan, “Text-to-speech for bangla language using festival,” Rashid, M.M., Hussain, M.A., Rahman, M.S., 2010. Text normalization and diphone
2007. preparation for bangla speech synthesis. J. Multimed. 5 (6), 551.
Alam, F., Habib, S.M., Khan, M., 2008. Text normalization system for Bangla. BRAC Ro, J.H., Stahlberg, F., Wu, K., Kumar, S., 2022. Transformer-based Models of Text
University. Normalization for Speech Applications. arXiv preprint arXiv:2202.00153.
Allen, J., Hunnicutt, M.S., Klatt, D.H., Armstrong, R.C., Pisoni, D.B., 1987. From Text to Sasirekha, D., Chandra, E., 2012. Text to speech: a simple tutorial. Int. J. Soft Comput.
Speech: the MITalk System. Cambridge University Press. Eng. (IJSCE) 2 (1), 275–278.
Bauer, E., Kohavi, R., 1999. An empirical comparison of voting classification algorithms: Shamsi, M., Chevelu, J., Barbot, N. and Lolive, D., 2020. Corpus design for expressive
bagging, boosting, and variants. Machine Learn. 36, 105–139. speech: impact of the utterance length. In: 10th International Conference on Speech
Biswas, N., Uddin, K.M., Rikta, S.T., Dey, S.K., 2022 Nov. A comparative analysis of Prosody 2020. ISCA, pp. 955-959.
machine learning classifiers for stroke prediction: a predictive analytics approach. Singh, G., Kumar, B., Gaur, L., Tyagi, A., 2019, April. Comparison between multinomial
Healthcare Analytics 1 (2), 100116. and Bernoulli naïve Bayes for text classification. In: 2019 International Conference
Bloehdorn, S., Hotho, A., 2006. Boosting for text classification with semantic features. In on Automation, Computational and Technology Management (ICACTM). IEEE, pp.
Advances in Web Mining and Web Usage Analysis: 6th International Workshop on 593-596.
Knowledge Discovery on the Web, WebKDD 2004, Seattle, WA, USA, August 22-25, Sodimana, K., De Silva, P., Sproat, R., Theeraphol, A., Li, C.F., Gutkin, A., Sarin, S.,
2004, Revised Selected Papers 6. Springer Berlin Heidelberg, pp. 149-166. Pipatsrisawat, K., 2018. Text Normalization for Bangla, Khmer, Nepali, Javanese,
Dutoit, T., 1997. High-quality text-to-speech synthesis: an overview. J. Electrical Sinhala, and Sundanese TTS Systems.
Electron. Eng. Aust. 17 (1), 25–36. Sproat, R., 1996. Multilingual text analysis for text-to-speech synthesis. Nat. Lang. Eng. 2
Ebden, P., Sproat, R., 2015. The Kestrel TTS text normalization system. Nat. Lang. Eng. (4), 369–380.
21 (3), 333–353. Sproat, R., Jaitly, N., 2016. RNN approaches to text normalization: a challenge. arXiv
Essa, E., Omar, K., Alqahtani, A., 2023. Fake news detection based on a hybrid BERT and preprint arXiv:1611.00068.
LightGBM models. Complex Intelligent Systems 1–12. Sproat, R., Black, A.W., Chen, S., Kumar, S., Ostendorf, M., Richards, C., 2001.
Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Normalization of non-standard words. Comput. Speech Lang. 15 (3), 287–333.
Ann. Statistics 1189–1232. Sproat, R., 2010. Lightly supervised learning of text normalization: Russian number
Handley, Z.L., 2005. Evaluating text-to-speech (TTS) synthesis for use in computer- names. In: 2010 IEEE Spoken Language Technology Workshop. IEEE, pp. 436-441.
assisted language learning (CALL). The University of Manchester (United Kingdom). “The festival speech synthesis system,” accessed: 2019-07-28. [Online]. Available: http
Indra, S.T., Wikarsa, L., Turang, R., 2016, October. Using logistic regression method to ://www.cstr.ed.ac.uk/projects/festival/.
classify tweets into the selected topics. In: 2016 international conference on Tyagi, S., Bonafonte, A., Lorenzo-Trueba, J., Latorre, J., 2021. Proteno: Text
advanced computer science and information systems (icacsis). IEEE, pp. 385-390. normalization with limited data for fast deployment in text to speech systems. arXiv
Islam, M.R., Rahman, J., Talha, M.R., Chowdhury, F., 2020, June. Query Expansion for preprint arXiv:2104.07777.
Bangla Search Engine Pipilika. In: 2020 IEEE Region 10 Symposium (TENSYMP). Xu, S., 2018. Bayesian Naïve Bayes classifiers to text classification. J. Inf. Sci. 44 (1),
IEEE, pp. 1367-1370. 48–59.
19