0% found this document useful (0 votes)
42 views19 pages

Bangla Text Normalization For Text-To-speech Synthesizer Using Machine Learning

Uploaded by

Shahidur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views19 pages

Bangla Text Normalization For Text-To-speech Synthesizer Using Machine Learning

Uploaded by

Shahidur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

H O S T E D BY Contents lists available at ScienceDirect

Journal of King Saud University - Computer and


Information Sciences
journal homepage: www.sciencedirect.com

Bangla text normalization for text-to-speech synthesizer using machine


learning algorithms
Md. Rezaul Islam *, Arif Ahmad, Mohammad Shahidur Rahman *
Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet 3114, Bangladesh

A R T I C L E I N F O A B S T R A C T

Keywords: Text normalization (TN) for text-to-speech (TTS) synthesizer is the transformation of non-standard words like
Text-to-speech times, ordinal numbers, equations, ranges, dates, etc. into standard words that have similarities with their
Text normalization pronunciations. An essential part of all TTS synthesizers is text normalization. Without text normalization,
Machine learning algorithms
generated voice from the TTS synthesizer will be unintelligible. For the unsatisfactory performance of previous
Large Bangla text corpus
ROC curve
research, a text normalization method for the Bangla language is proposed in this paper. At first, we have
Confusion matrices produced a tokenized dataset with a semiotic class using regular expressions from a Bangla corpus. Then, each
token has been trained using the XGBClassifier algorithm. After that, it identifies the semiotic class for each token
in a new Bangla text corpus using the trained XGBClassifier model. Finally, it produces a normalized text for each
token by calling the class function according to the predicted class. This text normalization method will help the
Bangla TTS synthesizer in producing more intelligible voices. The token classification accuracy of this method is
99.997%.

1. Introduction word by word manually. As a result, the proposed text normalization


method will reduce the researcher’s work to create a large normalized
The purpose of text-to-speech (TTS) synthesizers is to produce a Bangla corpus for the TTS synthesizer. It aims to generate a massive
voice output from an input text using several processing steps. Fig. 1 normalized text easily within the shortest time. However, two important
introduces the functional model of a comprehensive TTS synthesizer. questions arise for the text normalization process:
Regarding human reading, a text is turned into a phonetic transcription
using a Natural Language Processing (NLP) module that includes the 1. How should produce the labeled training data?
desired intonation and rhythm (commonly referred to as prosody). Text 2. Which machine learning algorithm will be applied to the classifica­
normalization is the first step of this module. Then, a Digital Signal tion of tokenized text?
Processing (DSP) module, creates a voice from the symbolic data it re­
ceives (Dutoit, 1997). To solve these issues, we will first use regular expressions to create
The first stage of a TTS synthesizer is called text normalization (TN). labeled training data and train various models using this data. Second, in
It is the process of converting a non-standard written text (NSWs) into order to create a tokenized text corpus from a new text corpus, the text
standard words that have similarities with the spoken form. In Fig. 2, the normalization method will call the text preprocessor. Then, the method
basic function of text normalization method has been displayed. For will use a highly accurate trained model, such as XGBClassifier, to
well-resourced languages like English, a huge amount of work has been determine the proper class for each token of the corpus. For example, the
done in the text normalization area. But text normalization works for the token “০৬:৩০ িমঃ” will be classified as a timer class, and “১১–০৯-১৯৯৬”
low-resourced language like Bangla is poor. In the past, various works on will be classified as a date class. To transform each token into the
text normalization for the Bangla language have been completed, normalized text, it will then call the relevant class method. Examples of
although the accuracy is unsatisfactory in a few semiotic classes. For translating written texts into spoken forms for the Bangla language are
that, researchers suffer more to produce a TTS synthesizer using a given in Table 1.
Bangla text corpus. They need to create a normalized Bangla text corpus Our participation in this research project is as follows:

* Corresponding authors.
E-mail addresses: [email protected] (Md.R. Islam), [email protected] (A. Ahmad), [email protected] (M.S. Rahman).

https://doi.org/10.1016/j.jksuci.2023.101807
Received 11 July 2023; Received in revised form 16 September 2023; Accepted 16 October 2023
Available online 20 October 2023
1319-1578/© 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 1. Simple but comprehensive functional model of a TTS synthesizer.

method for non-standard words was the initial attempt to resolve the
text normalization issue, which is fundamentally a language modeling
issue (Sproat et al., 2001). A taxonomy of NSWs was created based on
four distinctly different forms of documents: news text, a newsgroup for
recipes, a newsgroup for hardware products, and classified advertise­
ments for real estate. The approach examined how different general
techniques, such as n-gram language models, decision trees, and
weighted finite-state transducers (WSFT), may be applied to a variety of
NSW types. More recently different machine learning algorithm has
been applied to TTS text normalization (Sproat, 2010; Ebden and Sproat,
Fig. 2. Text normalization (TN) and inverse text normalization (ITN). 2015). Some papers have experimented with different types of RNN
architectures like LSTM, Attenuation based RNN sequence-to-sequence
models, transformer-based frameworks, Proteno etc (Sproat and Jaitly,
1. With a 99.997 % accuracy rate, this research work has surpassed the 2016; Pramanik and Hussain, 2019; Lai et al., 2021; Tyagi et al., 2021;
earlier accuracy for text normalization carried out by other re­ Ro et al., 2022). But, the research on the topic of low-resourced lan­
searchers in other languages. guages like Bangla language is very rare. A rule-based text normalization
2. To find the optimum accuracy classifiers, the 19th machine learning method has been proposed for the Bangla language in (Alam et al.,
methods including word embedding and oversampling, are applied 2008), representing the first Bangla text normalization work for the
to the Bangla text corpus in this research. Bangla language which has used a regular expressions in jflex format.
3. The highest accuracy among the 19th classifiers has been demon­ The authors have proposed multiple semiotic classes for the Bangla
strated by XGBClassifier and LightGBM with dart, with ratings of language, utilizing verbalization to expand a token. The performance of
99.997 % and 99.993 %, respectively. In Fig. 3, the results of two the method is 99 % for three classes such as floating point, currency, and
classifiers have been displayed. time. They also showed that the accuracy of currency is 100 %, the
floating point is 100 %, and time is 62 %. But they have not explained
The remaining nine sections of the paper are divided into section 2, the method for all types semiotic classes like date, time, etc. So, the
which is a collection of related works. Section 3 provides an overview of methodology isn’t enough to build a Bangla text corpus for the TTS
text-to-speech synthesizers with Bangla TTS and text normalization. We synthesizer. In another paper, the authors have explained multiple se­
have covered different machine learning techniques and word embed­ miotic classes of text normalization and their ambiguities, in which
ding in relevant sections 4 and 5. In section 6, the token classification accuracy is below 90 % (Rashid et al., 2010). They have used rule-based
technique for text normalization is illustrated. Section 6 is divided into approach and database approach for text normalization. Words that do
two subsets, where section 6.1 is discussed about the token classification not obey rules have been managed utilizing a database after trying a
method. In section 6.2, the results and analysis for the token classifi­ rule-based method as a first step in normalization. Google has described
cation approach are discussed. The results of the 19th classifiers using only a process of text normalization grammars for 6 low-resourced
precision, accuracy, F1-score, recall, heatmap, and ROC curve were languages like Bangla, Nepali, Khmer, etc. (Sodimana et al., 2018). So,
further explained in this section. The full procedure of the proposed text a complete text normalization method of the Bangla language has been
normalization method has been discussed in section 7. The complete proposed for the TTS synthesizer in this paper.
work of this research is concluded in section 8.
3. Text-to-Speech
2. Related works
A text-to-speech (TTS) synthesizer is a computer-based tool that can
The text-to-speech (TTS) synthesizer significantly depends on text read text aloud. Text in TTS is either a computer input stream or scanned
normalization. While a huge amount of work has already been done for input that has been sent to an OCR engine (Sasirekha and Chandra,
well-resourced languages. In speech technology, text normalization has 2012). A very rapid improvement in this area has been made over the
a long old history. Text to speech: the MITalk system is an early work last two decades and now there are numerous excellent TTS synthesizer
where text normalization has been discussed (Allen et al., 1987). In available for commercial use. Recently, personal assistant applications
1996, Sproat proposed a weighted finite-state transducer-based are growing in popularity since they enable users to communicate with
approach (WFSTs) that can be used to solve the majority of text devices like mobile phones and tablets with natural language. These
normalization issues which provide a text analysis component for the applications use NLP approaches to deliver appropriate answers to
multilingual TTS synthesizer from Bell Labs (Sproat, 1996). The trans­ users’ questions. Numerous applications (such as SIRI, GoogleNow,
ducers permit declarative descriptions of lexicons, morphological rules, Cortana, and Robin) exist in this field. TTS synthesizer has multiple
numeral-expansion rules, and phonological rules that have been built applications in real life like personalization, gaming, digital assistant,
using a lexical toolkit. French, German, Spanish, Russian, Mandarin, etc. Fig. 4 explains the different types of applications of the TTS syn­
Italian, Romanian, and Japanese are among the eight languages to thesizer. It is possible to create a TTS synthesizer using both hardware
which the proposed approach has been implemented. The normalization and software. Fig. 5 explains the TTS synthesizer’s whole architecture.

2
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Table 1
Examples of translating written texts into spoken forms for the Bangla language.

There are two components to a TTS synthesizer: a front-end (module for producing speech from acoustic characteristics.
for text-to-phonetics) and a back-end (module for phoneme-to-speech)
(Handley, 2005). There are two main purposes for the front end. First,
3.2. Text normalization
it transforms unprocessed text that contains symbols like dates, ranges,
times, and abbreviations into a spoken equivalent. Text normalization is
Text normalization for text-to-speech (TTS) synthesizer refers to the
another name for this procedure. The front end then assigns each word a
process of converting input text into a phonetic representation that can
phonetic transcript. The text is divided into prosodic units like clauses
be accurately rendered into speech by a TTS synthesizer. It is an essential
and sentences. Text-to-phoneme or grapheme-to-phoneme conversion is
part of numerous applications for speech and natural language pro­
the process of assigning phonetic transcripts to words (Kim et al., 2021;
cessing. But TTS synthesizer has gained popularity globally in recent
Shamsi et al., 2020; Dutoit, 1997). The backend, which is frequently
years. It has gained significant progress and now produces human-like
referred to as the synthesizer, transforms the symbolic verbal repre­
speech quality. As an input, a normalized text is required to develop a
sentation into sound.
TTS synthesizer. By transforming an unambiguous non-standard word
into a standard spoken form, the input text corpus is processed into
3.1. Bangla text-to-speech synthesizer normalized text as part of the text normalization process. The text
normalization method for Bangla language is quite uncommon, which is
It’s rare to find TTS development work in Bangla. Concatenative a serious issue with the Bangla text-to-speech (TTS) synthesizer. To train
methods were initially utilized to create Bangla TTS. A unit-selection a TTS synthesizer, researchers require a large unambiguous normalized
TTS synthesizer called Katha (Alam et al., 2007), which was created in text with spoken forms.
2007 written by Firoz Alam and others from BRAC University, is based
on the Festival (The festival speech synthesis system, 2019) toolkit. In 4. Classification algorithms
2009, the di-phone concatenation-based TTS synthesizer Subachan was
developed by Abu Naser et al. at the Shahjalal University of Science and The 19th classifiers, including the K-Nearest Neighbor classifier,
Technology (Naser et al., 2010). A Hidden Markov Model-based statis­ Gaussian Naïve Bayes classifier, Multinomial Naïve Bayes classifier,
tical parametric speech synthesis system, or SPSS system (Mukherjee Decision Tree classifier, Random Forest classifier, Bernoulli Naïve Bayes
and Mandal, 1406), was developed in 2012. The use of DNNs for classifier, Logistic Regression, Gradient Boosting classifier, LGBMClas­
acoustic modeling in SPSS systems has considerably enhanced TTS sifiers with dart and gbdt, XGBClassifier, Linear Support Vector Ma­
research. The DNN-based Bangla TTS synthesizer’s architecture is chine, CatBoost, AdaBoost, Nearest Centroid classifier, Voting classifier,
depicted in Fig. 6 (Raju et al., 2019). The major processing elements of Bagging Classifier and LightGBM that are presented in this section, are
this system include the text normalizer, front-end processor, two deep used to measure performance. Then, the classifier with the highest ac­
neural networks (DNNs) for duration, acoustic modeling, and vocoder curacy rating among all of these models is evaluated.

3
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 3. Results of two best classifiers model.

4.1. Support vector Machine

SVM was developed initially for binary classification applications


(Kowsari et al., 2019). But a lot of scholars use these prevalent strategies
when working on multi-class problems. For 2 dimensional datasets, the
linear and non-linear classifiers are shown in Fig. 7.
Because SVMs are often employed for classification based on binary
data, we must develop a multiple-SVM (MSVM) for multi-class circum­
stances. One-vs-One is a method for building N(N-1) classifiers in multi-
class SVM. The formula below can be used to apply the linear SVM to
multi-class text classification:
C = argmax(Wm T*X + b) (1)

where the predicted class for the document X is C, the bias term is b, the
weight vector for class m is Wm, and the transpose operator is T.

Fig. 4. Applications of TTS synthesizer.

Fig. 5. Full architecture of a TTS synthesizer.

4
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 6. Full architecture of Bangla SPSS.

4.2. Decision tree where C is the predicted class for the document D, n is the number of
decision trees in the random forest, Tk is the kth decision tree in the forest
For multi-class text classification, a decision tree provides a and Cm is the mth class in the classification.
straightforward and understandable classification paradigm. The deci­
sion tree divides the feature space into a hierarchical structure of nodes.
4.4. K-Nearest Neighbor
The decision tree’s leaf nodes indicate the anticipated classes, and the
path from the root to a leaf node represents the decision rules that lead
By locating the k training instances in the training set that are
to the predicted class (Kowsari et al., 2019). Typically, the decision rules
physically closest to the input document, the K-Nearest Neighbor (k-NN)
in a decision tree are based on threshold examines of the values of the
classifier assigns the input document to the class with the highest fre­
features, such as “if feature i > t, then go left, else go right”. The
quency among its k neighbors (Kowsari et al., 2019). Using the equation
following equation can be utilized to implement a decision tree for
below, the k-NN Classifier can categorize text into many groups.
multi-class text classification: ( )
∑n
C = f (x) (2) C = argmax l(ym = Ck ) (4)
m=1
where X is a vector of feature values for the document, C is the predicted
class for it, and f is a function that converts X into the predicted class C. where, C is the predicted class for the document D, the number of
neighbors to take into consider is n, ym is the label of the mth neighbor,
4.3. Random Forest Ck is the kth class in the classification task and the indicator function l
accepts a value of 1 if the condition ym = Ck is true, and 0 otherwise.
Multiple decision trees are used in Random Forest, an ensemble The k-NN Classifier can compare documents using a variety of dis­
learning technique, to increase the classification’s robustness and ac­ tance metrics, including Euclidean distance, cosine similarity, and Jac­
curacy (Kowsari et al., 2019). In order to construct the final classifica­ card similarity. The formulas for Euclidian, Manhattan, and Minkowski
tion, it builds a number of decision trees based on various subsets of the distance metrics are shown in Eqs. (5), (6), and (7), respectively. KNN
training data and characteristics. The Random Forest builds each deci­ architecture is illustrated in Fig. 9.
sion tree by recursively dividing the training data according to the √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
∑n
feature values. The decision tree must meet a stopping condition before Euclidean distance metrics = (pi − qi )2 (5)
it can cease splitting, such as a minimum number of training instances in i=1
each leaf node or a maximum depth of the tree. The primary goal of RF is
to create random decision trees, as seen in Fig. 8. ∑
n
Manhattan distance metrics = |pi − qi | (6)
The following equation can be utilized to implement Random Forest i=1
for multi-class text classification:
( ) ( )1r
∑n ∑
n

C = argmax Tk (Cm ) (3) Minkowski distance metrics = r


(|pi − qi |) (7)
i=1
k=1

Fig. 7. Non-linear and linear support vector machine (SVM).

5
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 8. Random Forest architecture.

proposed work for k-NN, Random Forest, multi-layer classifier (Onan


et al., 2016). The Voting Classifier’s equation for multi-class text cate­
gorization looks like this:
( )
∑n
C = argmax Wk .Pk (11)
k=1

where C is the document’s predicted class, n represents the number of


classes, Pk is the predicted probability for class k by the base classifier,
and wk is the weight given to the base classifier. All base classifiers have
equal weight in the event of a majority, and the equation becomes:
( )
∑n
C = argmax Pk (12)
Fig. 9. K-nearest neighbor (KNN) classifiers architecture. k=1

4.5. Rocchio classification 4.7. Naïve Bayes

Assigning new documents to the class whose centroid is closest to the P(X|C) is the likelihood probability of the data point X given class C,
documents is the foundation of the Rocchio Classification method P(X|C) is the posterior probability of class C, p(C) is the prior probability
(Kowsari et al., 2019). Each class is represented by its centroid in the of class C, and P(X) is the probability of the data point X across all classes
feature space. The Rocchio algorithm’s equation for multi-class text according to the Bayes theorem (Murphy, 2006).
categorization looks like this: p(X, C) p(X|C) p(C)
p(C|X) = = ∑N (13)
(8) p(X) ′ ′
C = argmax(sim(X, Ci )) y′=1 p(X|C )p(C )

Since this model explains how to produce feature vectors for each
where the similarity measure sim(X, Ci) compares the document X and
potential class of C, it is known as a generative model.
the centroid Ci, C is the predicted class for the document X, Ci is the
centroid of class i in the feature space, and the document X belongs to CMAP = argmaxc∈C
p(C|X)
(14)
the predicted class C. One frequently employed similarity measure is the
cosine similarity, which determines the cosine of the angle between the 4.7.1. Gaussian Naïve Bayes classifier
two vectors. The features are assumed to be continuous and have a Gaussian
sim(X, Ci ) = (X.T*Ci )/(‖X‖*‖Ci ‖) (9) distribution in this Naive Bayes classifier version (Xu, 2018). The
Gaussian Naive Bayes classifier can be used to classify text into many
where T is the transpose operator, and ||X|| and ||Ci || are the vectors X categories using the following equation:
and Ci ’s L2 norms. For Nearest Centroid algorithm similarity matric are
C = argmax(P(C1 |D), P(C2 |D), ⋯, P(Ck |D)) (15)
following:
sim(X, Ci ) = ‖X − Ci ‖ (10) where the number of classes in the classification task is k, and P(Ci|D) is
the posterior probability that the document D belongs to class Ci. C is the
predicted class for the document D. The probability P(Ci|D) can be
4.6. Voting classifier
calculated using Bayes’ theorem as follows:
A multi-class text classification ensemble method called the Voting P(D|Ci )*P(Ci )
Classifier combines the predictions of different base classifiers into a P(Ci |D) = (16)
P(D)
single, concluding prediction. The Voting Classifier uses a majority vote
or a weighted vote to aggregate the predictions of the basis classifiers. where the prior probability for class Ci is P(Ci), P(D|Ci) is the probability
Hard voting and soft voting are the two types. Soft voting is used in this of observing the input document D given class Ci and P(D) represents the

6
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 10. Boosting architecture.

Fig. 11. Bagging architecture.

probability of observing the input document D. P(wi|C) is the probability of learning wi given class C, where the prior
probability of class C is P(C), D is the document, wi is the ith word in the
4.7.2. Multinomial Naïve Bayes classifier document, and word i’s presence or absence in the document D is
A variation of the Naïve Bayes classifier made especially for discrete indicated by the binary variable xi, which has a value of either 1 or 0.
feature text classification applications is the multinomial Naïve Bayes
classifier. It is predicated on the idea that a data point’s characteristics 4.8. Logistic regression
are derived from a multinomial distribution. In text classification, the
classes are the possible categories that can be applied to the documents, As a logistic function of the input features, logistic regression models
and the characteristics are typically the frequency of occurrence of each the likelihood of each class, and then utilizes maximum likelihood
word in the documents (Abbas et al., 2019). If there are (n) documents estimation to determine the model’s parameters (Indra et al., 2016). The
that fit into each of the k categories where k ∈ {c1, c2, …, ck}, the pre­ following is an equation for multi-class text categorization using logistic
dicted class as output is (c C). The following formula is used to determine regression:
the probability that a document belongs to class C using the Bayes ( ∑n )
exp i=1 wi .fi
theorem: P(C|D) = ∑ (∑ n ) (23)
∏ c exp k=1 wk .fk
P(C) P(wi |C)ni
P(C|D) = (20)
P(D) where, the number of features in the input document D is n, P(C|D)
represents the probability of a class given the input document D, fi is the
Where the probability of learning wi in given class C is P(wi|C) where
feature vector for document D, wi is the weight vector for class i in the
the prior probability of class C is P(C), the document is D, wi is the ith ∑ ∑n
logistic regression model and c exp( k=1 wk .fk ) is the sum of the
word in the document and the number of times that wi is mentioned in
the document is ni. exponential weights for all classes.

4.7.3. Bernoulli Naïve Bayes classifier 4.9. Boosting


The Bernoulli Naive Bayes model uses a bag of words to represent
each document, with each word’s existence or absence denoted by a A machine learning approach called boosting sequentially trains
binary value (0 or 1). The Bernoulli Naive Bayes classifier use Bayes’ several weak models, changing the weights of the training instances to
theorem to calculate the probability of each class given a document concentrate on the incorrectly categorized examples from the prior
(Singh et al., 2019). To assess the probability that a document belongs to models (Bloehdorn, 2004). A weighted mixture of the forecasts from the
a particular class, apply the equation below: weak models makes up the final projection. The following can be rep­
resented as the equation for multi-class text classification with boosting:
∏ ∏
P(C)* P(wi |C)ni * P(w′i |C)(1− ni )
P(C|D) = (21)
P(D)

7
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 12. Word embedding illustration in two-dimensional vectors.

( )

n gradient boosting models are gaining popularity (Natekin and Knoll,
C = argmaxc αk .hk (x) (24) 2013; Biswas et al., 2022 Nov). The gradient reasonable size is repre­
sented by Equation-25.
k=1

where C is the projected class label for input document D, argmaxc is the ∑
N
[ ]
class with the highest predicted score, αk is the weight given to the mth ρt = argminρ ψ yi, ̂f t − 1(xi ) + ρh(xi , θi ) (25)
weak model and hk (x) is the prediction of the mth weak model for input
i=1

document, D. In Fig. 10, demonstrates the operation of a boosting


4.12. Extreme gradient boosting
approach for 2D data sets. It illustrates that the data has been labeled
and trained using multi-model architectures (ensemble learning). The
In contrast to previous gradient boosting techniques, extreme
AdaBoost (Adaptive Boosting) was created as a result of these advances.
gradient boosting (XGBoost) performs better because it adopts a more
regularized model formalization to avoid data over-fitting (Friedman,
4.10. Bagging 2001). To do this, we must understand functions hi, each of which
contains a tree and leaf score structure. Eq. (26)’s outcome is predicted
Bagging (also known as bootstrapping aggregating) runs numerous by a tree ensemble model using L additive functions were given a set of
separate models, aggregating their predictions by majority voting data with m samples and n features, D = {(Xk, yk)} (|D| = m, Xk ∈ Rn, yk
(Bauer and Kohavi, 1999) on various subsets of the training data, it. A ∈ R).
bootstrap generates a homogenous sample from the training set. We
have N classifiers (C) if N bootstrap samples (B1, B2, …, BM) have been ∑
L
̂ k = φ(Xk ) =
Y hi (Xk ), hi ∈ H (26)
generated, and Ck is constructed from each bootstrap sample (Bk). Our i=1
classifier C produces the class that is predicted by its sub-classifiers most
frequently, which is composed of or formed from C1, C2, …, CM, with ties where the space of regression trees is H = {h(X) = wq(X)} (q: Rn → U, w
being broken randomly. Fig. 11 illustrates a straightforward bagging ∈ RU), when a sample is mapped to a tree’s matching leaf index, the
method that trained N models. structure of each tree is indicated by the letter q and U stands for the
tree’s leaf count. Each hi is associated with a different leaf weight w and
4.11. Gradient boosting independent tree q structure. In order to discover the set of functions
used in the model, the regularized objective is minimized (27) as
A family of machine learning algorithms known as gradient boosting follows.
combines several ineffective learning algorithms to create an effective ∑ ∑ 1
predictive model. Gradient boosting is frequently accomplished using L(φ) = ̂ k , Yk ) +
l( Y Ω(hl ), Ω(h) = YU + λ‖w‖2 (27)
2
decision trees. Due to their success in classifying complicated datasets,
k k

Fig. 13. The architecture of CBOW and skip-gram model.

8
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 14. Working procedure for the token classification method.

where the difference between the target value (Yk) and the forecast the level-wise tree approach are usually larger than those produced by
the leaf-wise tree method, the training procedure may therefore greatly
value ( Y
̂ k ) is measured by the differentiable convex loss function (l). Ω
accelerate when the dataset is large.
penalizes model complexity to stop over-fitting.
5. Word embedding
4.13. CatBoost
Typically, unstructured data sets include texts and documents. When
The fundamental predictors of CatBoost, a gradient boosting tech­
using a classifier that incorporates mathematical modeling, it is neces­
nique, are binary decision trees (Prokhorenkova et al., 2018). Assume
sary to convert these unstructured text sequences into a structured
that we have observed a set of data with samples D = (Xk, yk)k =1,2, …, m,
feature space. This process is called feature extraction. Term Frequency
where a vector of n characteristics is Xk = (x1k , x2k , …, xnk ) and response
(TF), Term Frequency-Inverse Document Frequency (TF-IDF), Word
feature yk ∈ R, which can be encoded as a numerical feature (0 or 1) or as Embedding, and GloVe are the most widely used feature extraction
a binary (yes or no) feature. The distribution of the samples (Xk, yk) is techniques (Kowsari et al., 2019). Each phrase or word from the dic­
independent and identical according to an unidentified distribution, p tionary is transformed into an N-dimensional vector of true values
(.,.). The goal of the learning problem is to create a function H: Rn → R known as a word embedding using the feature learning method. A way
that minimizes the estimated loss shown in (28). to express a word as a vector that carries meaning is through word
L(H) = EL(y, H(x)) (28) embeddings. It enables the almost equal representation of words with
related meanings. For instance, the terms “sad” and “sorrow” are
where L(.,.) is a loss function, and (X, y) is a sample of the testing data frequently used interchangeably. These words will therefore have
obtained from the training data. similar vector representations. A variety of word embedding strategies
have been proposed to make unigrams understandable as input for
4.14. LightGBM machine learning algorithms. This study focuses on Word2Vec which is
the most often used techniques for text classification. Fig. 12 shows how
When working with big amounts of data, conventional GBDT takes a word embedding can represent any word as a two-dimensional vector.
long time. LightGBM is an effective and scalable GBDT implementation
that speeds up training while maintaining respectable accuracy (Essa 5.1. Word2Vec
et al., 2023). The process of creating a decision tree is what causes
standard GBDT to be computationally expensive. Scan all of the data A technique called Word2Vec was put forth in 2013 by T. Mikolov
instances to choose which feature to use as a split point to maximize (Mikolov et al., 2013; Mikolov et al., 2013) and colleagues at Google to
information gain. To get around this restriction, LightGBM is intro­ effectively learn a stand-alone word embedding from a text corpus.
duced. In order to find the appropriate split points for decision trees, Every word is projected to a point in a vector space using the Word2Vec
LightGBM separates continuous feature values into bins and builds technique. For each word, the Word2Vec creates an N-dimensional
feature histograms using these bins. In regards to memory utilization vector using a shallow neural network. It proposes two model
and training time, this strategy performs better than the GBDT method. architectures:
In contrast to many other tree-based learning algorithms like XGBoost,
the LightGBM model divides the tree leaf-wise, which utilized tree • Continuous Skip-gram Model
growth on a level. The level-based tree development approach expands • Continuous Bag-of-Words (CBOW) Model
the tree structure level by level. An unbalanced tree may form using the
node with the greatest loss reduction when using the leaf-based tree The CBOW and Skip-gram models are effective resources for identi­
developing method. For the same tree depth, the tree nodes produced by fying connections and parallels between words. Fig. 13 illustrates a

9
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 15. Word clouds of the Bangla text corpus for the text normalization method.

Table 2 Table 3
The format of training data list. Training data number for each semiotic class.
Class Name Number Class Name Number

BNG_TXT 58,996 TELEPHONE 132,418


PUNCTUATION 141,895 FRACTIONAL 180,216
CARDINAL 104,075 ORDINAL 200,113
YEAR 17,601 EQUATION 100,051
RANGE 202,387 PERCENTAGE 200,037
GAME 129,186 DATE 149,577
MONEY 301,733 TIME 1442
ABBREVIATION 824 ENG_NUM 120,025
Total number of tokens = 2220900

technique. Fourthly, data has been split into 3 parts: test data, training
data, and validation data. Fifthly, validation data is utilized for fine-
tuning and training data has been used to train various machine
learning algorithms. Finally, the performance of these algorithms has
been estimated using test data. In Fig. 14, the full working procedure for
the token classification method has been displayed and the algorithm for
straightforward CBOW model that looks for words based on words that this method is described in Algorithm 1.
have come before it, while Skip-gram looks for words that might be close
Algorithm 1. (Working procedure of token classification for Bangla text
to each word. A particular target of words is represented by a large
normalization)
number of words in the CBOW model (Islam et al., 2020). For instance,
“breakfast” as the target word would include “bread” and “egg” as Input: Text normalized labeled datasets from 80k Bangla sentences
context words. Another model architecture that has many similarities to Output: Predicted class for each token (18 Classes)
CBOW is the continuous skip-gram model. This model seeks to maximize 1. Begin
the classification of a word based on another word in the same phrase 2. data ← load labeled dataset
3. X_data ← data[before]
and predicts the current word based on context.
4. y_data ← data[class]
5. X_ctw ← Word2Vec(X_data)
6. Token classification methodology 6. X_bal, Y_bal ← balanced data of X_ctw, y_data using random
oversampler
6.1. Method 7. X_train, y_train, X_test, y_test, X_valid, y_valid ← split_data of X_bal and
y_bal
8. model ← train model using X_train and y_train
This section has provided a description of the token classification 9. predict ← testing model using X_test and y_test
approach. Firstly, all data has been preprocessed and labeled using 10. computes performance using evaluation metrics
regular expressions. Secondly, labeled text data has been converted to 11. End
vectors using word2vec CBOW model. Thirdly, embedded text data has
been balanced using the random oversampling data balancing

10
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Table 4
Written text examples for each semiotic class with normalized text.

1) Data Acquisition: The initial step in the token classification Bangla text corpus have presented in Fig. 15. The more frequently a
approach is data acquisition. A large corpus with more than 80,000 term appears in the data, the larger it is.
sentences has been collected from online sources, which is more 2) Text preprocessing: The next step in the token classification
suitable for training a TTS synthesizer. The word clouds for the approach is text preprocessing. In this step, the text preprocessor

Fig. 16. Class distribution of tokens before being balanced using oversampling.

11
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 17. Class distribution of tokens after being balanced using oversampling.

Fig. 18. Decreasing of logloss and error during XGBClassifier training.

analyzes the collected Bangla text corpus and produces an organized


written text. Some Bangla words are grouped into the same token Table 5
according to semiotic classes. For example, the input text “০৬:৩০ Confusion Matrix.
িমঃ”, “৪–১ েগাল” and “৯৯৯ ন�র” will be grouped together using regular
Predicted Class
expressions. After that, tokenization divides the text stream into to­
kens, which can be phrases, symbols, words, or other significant Actual True False
Class True True Positive False Negative
elements. (TP) (FN)
Labeled_text_data = Text_Preprocessor(Bangla_Text_Corpus) False False Positive True Negative
(FP) (TN)
For example, the Bangla sentence such as following: “জািরফ ১১–০৯-
১৯৯৬ তািরেখ িবেকল ০৬:৩০ িমঃ এ ঢাকােত জ��হণ কেরন ।” (ENG: Zarif was born
on 11–09-1996 at 06:30 pm in Dhaka.) will be split into below format:
Table 6
[‘জািরফ’, ’১১-০৯-১৯৯৬’, ’তািরেখ’, ’িবেকল’, ’০৬:৩০ িমঃ’, ’এ’, ’ঢাকােত’, Results on tokens using different machine learning algorithms.
’জ��হণ’, ’কেরন’, ’ । ’ ]
Machine Learning Evaluation Standard
After that, each token will be assigned a semiotic class using regular Model
Accuracy F1-score Precision Recall
expressions with Sentence_id and Token_id. In Table 2, a training dataset (%) (%) (%) (%)
format explanation is provided for the following text. There are 4 col­ XGBClassifier 99.997 99.997 99.997 99.997
umns in the training data structure, with “Sentence_id” as the first col­ LGBMClassifiers with 99.993 99.993 99.993 99.993
umn’s name. Sentence_id’s are assigned to each sentence in the text dart
corpus. Each token within a phrase has a “Token_id”, which is the name Catboost 99.977 99.977 99.977 99.977
Voting Classifier 99.337 99.336 99.344 99.337
of the second column. The third column, “Before”, represents the written
LightGBM 99.008 99.005 99.04 99.008
text (or token) of a sentence, while the fourth column, “Class”, lists the K Nearest Neighbor 98.820 98.814 98.819 98.820
names of the semiotic classes for each token. Classifier

12
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 19. Top six machine learning algorithms results on tokens.

Table 7
Results for each class using top six machine learning algorithms by percentage.
Class Name XGBClassifier, LGBMClassifiers with dart and Catboost Voting Classifier LightGBM K Nearest Neighbor Classifier
(%) (%) (%) (%)

Prec. Rec. F1-s. Prec. Rec. F1-s. Prec. Rec. F1-s. Prec. Rec. F1-s.

ABBREVIATION 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 0.99 0.97 0.98
BNG_TXT 1.00 1.00 1.00 0.98 1.00 0.99 1.00 0.99 0.99 0.97 0.92 0.95
CARDINAL 1.00 1.00 1.00 1.00 0.98 0.99 1.00 0.94 0.97 1.00 0.99 1.00
DATE 1.00 1.00 1.00 0.99 0.98 0.99 1.00 1.00 1.00 1.00 1.00 1.00
ENG_NUM 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00
ENG_TXT 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00
EQUATION 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00
FLOAT 1.00 1.00 1.00 0.97 1.00 0.98 0.99 1.00 0.99 0.99 1.00 1.00
FRACTIONAL 1.00 1.00 1.00 1.00 0.97 0.99 1.00 1.00 1.00 0.97 0.99 0.98
GAME 1.00 1.00 1.00 0.99 0.99 0.99 1.00 1.00 1.00 0.99 0.99 0.99
MONEY 1.00 1.00 1.00 0.96 0.99 0.98 0.99 1.00 0.99 0.96 0.96 0.96
ORDINAL 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.97 0.98 1.00 0.99
PERCENTAGE 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.99 0.99 0.99 1.00 1.00
PUNCTUATION 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00
RANGE 1.00 1.00 1.00 0.99 1.00 0.99 0.99 0.99 0.99 0.97 0.97 0.97
TELEPHONE 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.99 0.98 0.99 0.98
TIME 1.00 1.00 1.00 1.00 0.99 1.00 0.92 1.00 0.96 1.00 1.00 1.00
YEAR 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Accuracy 1.00 0.99 0.99 0.99
Macro avg. 1.00 1.00 1.00 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
Weighted avg. 1.00 1.00 1.00 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99

From 80,000 Bangla text sentences, we have found 2,220,900 tokens.


The method has identified 18 types of semiotic classes for normalization.
The names of these semiotic classes are BNG_TXT, PUNCTUATION, 4) Imbalanced data handling: When the distribution of classes in the
CARDINAL, YEAR, RANGE, GAME, MONEY, ABBREVIATION, dataset is noticeably skewed, it presents a substantial problem to
ENG_TXT, TELEPHONE, FRACTIONAL, ORDINAL, EQUATION, PER­ handle imbalanced data in text classification. The model may start to
CENTAGE, DATE, TIME, ENG_NUM and FLOAT. In Table 3, the number work better for the majority class and less effectively for the minority
of training data for each semiotic class has been shown. We have class in certain circumstances. Therefore, the unbalance dataset is
selected unique words for BNG_TXT class from 10 lakhs Bangla words. handled with first in order to achieve the promising accuracy with
The list of classes with written text and normalized text examples have the best outcome. As a result, the dataset is balanced using the
been displayed in Table 4. random oversampling data balancing technique, a non-heuristic
procedure (See Figs. 16 and 17). This method’s primary objective
3) Feature Extraction: For this method, the word2vec CBOW model is to randomly replicate samples from the minority class.
has been chosen for feature extraction from the text corpus. We have
Balanced_text_data = Random_Oversampler(Embedded_text_data,
only taken a maximum of 20 features for this word2vec CBOW
y_data)
model. Token will be converted into vectors in this step.
Embedded_text_data = Word2Vec(Labeled_ text_data)

13
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 20. The average margin of different models according to the baseline.

5) Model Training: Training data, test data, and validation data were Multinomial Naïve Bayes classifier, Decision Tree classifier, Random
separated from the balanced text data. 49 % of total data has been Forest classifier, Bernoulli Naïve Bayes classifier, Logistic Regres­
assigned for training, 30 % of total data is used for testing data and sion, Gradient Boosting classifier, LGBMClassifiers with dart and
21 % of total data is assigned for fine tuning. We have trained 19th gbdt, XGBClassifier, Linear Support Vector Machine, CatBoost,
types of classifiers using training data. In Fig. 18, decreasing of AdaBoost, Nearest Centroid classifier, Voting classifier, Bagging
logloss and error during XGBClassifier training have been displayed. Classifier and LightGBM for the token classification method. Each
Hypermeter values for XGBClassifiers training have explained below. classifier’s performance on the positive and negative classes is
assessed independently using four outcomes from the confusion
matrix. For each classifier in the confusion matrix (shown in
• objective: multi:softmax • max_depth: 10 Table 5), true positive (TP), true negative (TN), false positive (FP),
• silent: 1 • nthread: − 1 and false negative (FN) predictions are given, respectively, by two
• num_class: 18 • eval_metric: merror, mlogloss kinds that make accurate predictions and two types that make
• eta: 0.3 • silent: 1
• iteration: 100 • verbose_eval: 10
inaccurate predictions. But Table 6 and Fig. 19 contain the recall,
precision, accuracy, and f1-score values for the top six classifiers. We
also provide the top six classifiers’ recall, precision, accuracy, and f1-
score values for each semiotic class in Table 7.

6) Model Evaluation: We have determined the recall, precision, ac­ (TP + TN)
Accuracy = × 100 (29)
curacy, and f1-score of the 19th types of classifiers including the K- (TP + FP + FN + TN)
Nearest Neighbor classifier, Gaussian Naïve Bayes classifier,

14
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 21. Heatmaps for (a) XGBClassifier (b) Lightboost Classifiers with dart algorithms.

∑L
TPl FP
Precision = ∑L l=1 × 100 (30) FPR = (34)
l=1 TPl + FPl
(FP + FN)

∑L The full area under the ROC curve is measured by AUC, or area under
TPl the ROC curve. AUC can be written as follows for tasks requiring binary
Recall = ∑L l=1 × 100 (31)
l=1 TPl + FN l classification:
∑L ∫∞
l=1 2TPl TPR(T)FPR(T′)dT
F1 score = ∑L × 100 (32) AUC =
2TPl + FPl + FN l ∫∞∫∞ −∞
(35)
l=1
= I(T′ > T)f1 (T′)f0 (T)dTdT′
Receiver operating characteristic curves, often known as ROC − ∞ − ∞
curves, are helpful visual tools for classifier evaluation. However, class = P(X1 > X0 )
imbalances (variations in prior class probabilities) may prevent ROC
curves from accurately depicting the performance of the classifier. On An example of a multi-class AUC average is as follows:
the ROC curve, the true positive rate (TPR) and false positive rate (FPR) 2 ∑C

are displayed. AUC = AUCi (36)


|C|(|C| − 1) i=1
TP
TPR = (33)
(TP + FN)

15
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Fig. 22. ROC curve for (a) XGBClassifier (b) Lightboost Classifiers with dart algorithms.

Fig. 23. Working process of the text normalization method.

16
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Table 8
Input written text and output spoken form pairs with class.

Here, the token <self> indicates that the token is to be left alone.

6.2. Result and analysis Table 9


Comparison of the proposed method with previous works for Bangla text.
After training the models, we assessed their performance using the Authors Name Research Paper Name Method Accuracy
test dataset’s accuracy, recall, precision, and F1-score, as well as by Firoj Alam et al., Text normalization Regular 99 %
plotting confusion matrices as heat maps. We have determined the 2009 (Alam system for Bangla expression in jflex (3 classes)
recall, precision, accuracy, and f1-score of the 19th classifiers including et al., 2008) format
the K-Nearest Neighbor classifier, Gaussian Naïve Bayes classifier, Muhammad Masud Text normalization and Rule based <90 %
Rashid et al., diphone preparation for approach and
Multinomial Naïve Bayes classifier, Decision Tree classifier, Random
2010 (Rashid Bangla speech synthesis. database approach
Forest classifier, Bernoulli Naïve Bayes classifier, Logistic Regression, et al., 2010)
Gradient Boosting classifier, LGBMClassifiers with dart and gbdt, Proposed Method Bangla text XGBClassifier 99.997 %
XGBClassifier, Linear Support Vector Machine, CatBoost, AdaBoost, normalization for text- algorithm (18
to-speech synthesizer classes)
Nearest Centroid classifier, Voting classifier, Bagging Classifier and
using machine learning
LightGBM for the token classification method of text normalization algorithms
method. In Table 6 and Fig. 19, top 6 classifiers accuracy has been
displayed. For a more thorough examination, Table 7 shows the recall,
precision, and F1-score of the XGBClassifier, LGBMClassifiers with dart, It has been represented by the purple bar in the figure. The proposed
Catboost, Voting Classifier, LightGBM, and K Nearest Neighbor Classifier model with the lowest accuracy margin is represented by the red bar. In
for each class. The top six models’ macro and weighted avg. for recall, comparison to higher models using the baseline as the reference, the
precision, and F1-score are displayed in the last two rows. In Fig. 20, we model with the minus (-) sign has substantially lower accuracy. In the
have ranked 19th classifiers using confusion matrics like accuracy, f1- ranking, the top 6 models have achieved greater than 10 points.
score, recall, and precision where the top model XGBClassifier has Confusion matrices are a useful tool for examining and visualizing multi-
achieved the highest number with 11.9 points according to the baseline. class classifier performance. In general, the confusion matrix displays
the precise number of properly and incorrectly classified tokens. For

Fig. 24. Working process example of the proposed Bangla text normalization method.

17
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

multi-class text classification tasks, heat maps can offer a visual depic­ b) Token classification with the trained model: In this step, at first,
tion of the performance measures. The number of instances that were the method will load the trained model. Then it will assign a se­
assigned to each class is shown in the heat map’s individual cells. In miotic class for each token of embedded text data using a trained
Fig. 21, heatmaps for XGBClassifier and LightBoost Classifiers with dart model.
algorithms has been displayed. For the token classification approach
model = tf.keras.models.load_model (′xgb_ model.model′)
employing XGBClassifier and Lightboost Classifiers with dart algo­
semiotic_class_list = token_classifier (Embedded_text_data, model)
rithms, the ROC curve and average AUC are shown in Fig. 22.

7. Text normalization methodology


4) Verbalization: For producing normalized text, the text normaliza­
7.1. Method tion method calls a verbalizer to expand each token of labeled text
data. In the verbalizer, the class function will be called according to
This section has provided a description of the text normalization the predicted semiotic class of each token in labeled text data.
method. Firstly, all data has been preprocessed and labeling using reg­
normalized_token = Verbalizer (Labeled_text_data, semiotic_class_list)
ular expressions. Secondly, labelled text data has been converted to
vector using word2vec CBOW model. Thirdly, each token will be clas­
sified using top ranked machine learning algorithms like XGBClassifier
etc. Fourthly, verbalizer will expand each token into normalized form 5) Text Post-processing: It is the last step of the text normalization
according to semiotic class. Finally, text postprocessor will combine method. In this step, sentence-wise normalized tokens will be joined
each token into sentences according to Sentence_id and Token_id. The for making the final output of text normalization.
complete working process for the text normalization approach is shown
Final_Output = Text_Postprocessor (normalized_token)
in Fig. 23, and the algorithm for this method is described in Algorithm 2.
Algorithm 2. (Working Procedure of Bangla Text Normalization for TTS
Synthesizer) 7.2. Result and analysis

Input: Bangla Text Corpus into written form For example, consider a simple Bangla sentence such as following:
Output: Bangla Text Corpus into spoken form “জািরফ ১১-০৯-১৯৯৬ তািরেখ িবেকল ০৬:৩০ িমঃ এ ঢাকােত জ��হণ কেরন । ”. The
1. Begin output form will be constructed similarly to Table 8 if the sentence ac­
2. data ← load dataset
3. Labeled_ text_data ← Text_Preprocessor(data)
cepts the proposed text normalization method. With token categoriza­
4. Embedded_ text_data ← Word2Vec(Labeled_ text_data) tion, verbalization, and text post-processing, Fig. 24 shows the full-text
5. model ← load trained model normalization approach for the Bangla language. Table 9’s comparative
6. semiotic_ class_list ← token_ classifier (Embedded_ text_data,model) analysis demonstrates that the proposed text normalization method
7. normalized_token ← Verbalizer (Labeled_ text_data, semiotic_class_
obtains the highest accuracy compared to previous methods for 18 types
list)
8. Final_Output ← Text_ Postprocessor (normalized_ token) of semiotic classes.
9. End
7.3. Environmental setup

Some resources were required to carry out this research. The re­
sources listed in Table 10 were used to build the suggested model.
1) Text preprocessing: In the function, some Bangla words will be
grouped into the same token according to semiotic classes. For
8. Conclusion
example, the input text “০৬:৩০ িমঃ”, “৪–১ েগাল” and “৯৯৯ ন�র” will be
grouped using regular expressions. Then tokenization separates a
In this research, a text normalization method has been proposed to
text stream into tokens, which can be words, phrases, symbols, or
produce a normalized text corpus for a text-to-speech (TTS) synthesizer.
other meaningful items.
It translates a written text into normalized text which has more simi­
Labeled_text_data = Text_Preprocessor(Bangla_Text_Corpus) larity to the spoken form. The method utilized the XGBClassifier algo­
rithm to select the semiotic class for a token. Then, the verbalizer
function converts the token into spoken form according to the predicted
class of the token. The normalized text corpus can be used to generate
2) Feature Extraction: For this method, the word2vec CBOW model
more accurate results for the TTS synthesizer. In the future work, we will
has been chosen for feature extraction from a text corpus. All tokens
produce a text normalization method for the Bangla language using deep
will convert into the vector in this step.
learning algorithms to achieve more accuracy. We will also build a
Embedded_text_data = Word2Vec(Labeled_text_data) machine translation-based model in which a text normalization system
will not require verbalization to generate normalized text. The main
challenge of this research is minimizing the normalized text errors, as
the accuracy of a TTS synthesizer highly dependent on the normalized
3) Token Classification using Trained Model: In this step, the token
text.
classifier will predict a semiotic class for each token. XGBClassifier
algorithm has been chosen for the classification. The step has two
parts. Table 10
Environment setup for the text normalization method.
a) Train a model using a normalized corpus: First of all, a large
normalized training data should be produced to train a model. Resource List Details Information
The data has been created using regular expressions for the token CPU Intel® Core™ i5-3427U CPU @ 1.80 GHz
classification. The method has produced training data with se­ RAM 8 GB
miotic class for 2,220,900 tokens. The full procedure of this step GPU Intel® UHD Graphics
Experimental Tool Jupyter Notebook
has been explained in section-6.

18
Md.R. Islam et al. Journal of King Saud University - Computer and Information Sciences 36 (2024) 101807

Author contribution Kim, M., Cheon, S.J., Choi, B.J., Kim, J.J., Kim, N.S., 2021. Expressive text-to-speech
using style tag. arXiv preprint arXiv:2104.00436.
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.,
The study’s concept and design were developed by MRI, AA, and 2019. Text classification algorithms: a survey. Information 10 (4), 150.
MSR, who also had complete access to all the study’s data. The creation Lai, T.M., Zhang, Y., Bakhturina, E., Ginsburg, B., Ji, H., 2021. A Unified Transformer-
of normalized data and evaluation of the various models’ degrees of based Framework for Duplex Text Normalization. arXiv preprint arXiv:2108.09889.
Mikolov, T., Chen, K., Corrado, G., Dean, J. Efficient estimation of word representations
accuracy was conducted by MRI. The article’s composition was a in vector space. arXiv 2013, arXiv:1301.3781.
collaborative effort among all the authors. The report’s critical revision Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013. Distributed
included input from AA and MSR. MRI developed all the datasets, results representations of words and phrases and their compositionality. Adv. Neural Inf.
Process. Syst. 26, 3111–3119.
and text normalization methods. The final version was reviewed and Mukherjee, S., Mandal, S.K.D., 2014. A bengali hmm based speech synthesis system,
approved by all authors, who also helped with data collection and arXiv preprint arXiv:1406.3915.
analysis. Murphy, K.P., 2006. Naive Bayes classifiers Generative classifiers. Bernoulli 4701
(October), 1–8. https://doi.org/10.1007/978-3-540-74958-5_35.
Naser, A., Aich, D., Amin, M.R., 2010. Implementation of subachan: Bengali text-to-
speech synthesis software. In: International Conference on Electrical & Computer
Declaration of competing interest Engineering (ICECE 2010). IEEE, 2010, pp. 574–577.
Natekin, A., Knoll, A., 2013. Gradient boosting machines, a tutorial. Front. Neurorob. 7
The authors declare that they have no known competing financial (DEC) https://doi.org/10.3389/fnbot.2013.00021.
Onan, A., Korukoğlu, S., Bulut, H., 2016. Ensemble of keyword extraction methods and
interests or personal relationships that could have appeared to influence classifiers in text classification. Expert Syst. Appl. 57, 232–247.
the work reported in this paper. Pramanik, S., Hussain, A., 2019. Text normalization using memory augmented neural
networks. Speech Comm. 109, 15–23.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A., 2018. CatBoost:
References Unbiased Boosting with Categorical Features. In: Advances in Neural Information
Processing Systems, pp. 6638-6648.
Abbas, M., Memon, K.A., Jamali, A.A., Memon, S., Ahmed, A., 2019. Multinomial Naive Raju, R.S., Bhattacharjee, P., Ahmad, A., Rahman, M.S., 2019. A bangla text-to-speech
Bayes classification model for sentiment analysis. IJCSNS Int. J. Comput. Sci. Netw. system using deep neural networks. In 2019 International Conference on Bangla
Secur 19 (3), 62. Speech and Language Processing (ICBSLP). IEEE, pp. 1-5.
F. Alam, P. K. Nath, and M. Khan, “Text-to-speech for bangla language using festival,” Rashid, M.M., Hussain, M.A., Rahman, M.S., 2010. Text normalization and diphone
2007. preparation for bangla speech synthesis. J. Multimed. 5 (6), 551.
Alam, F., Habib, S.M., Khan, M., 2008. Text normalization system for Bangla. BRAC Ro, J.H., Stahlberg, F., Wu, K., Kumar, S., 2022. Transformer-based Models of Text
University. Normalization for Speech Applications. arXiv preprint arXiv:2202.00153.
Allen, J., Hunnicutt, M.S., Klatt, D.H., Armstrong, R.C., Pisoni, D.B., 1987. From Text to Sasirekha, D., Chandra, E., 2012. Text to speech: a simple tutorial. Int. J. Soft Comput.
Speech: the MITalk System. Cambridge University Press. Eng. (IJSCE) 2 (1), 275–278.
Bauer, E., Kohavi, R., 1999. An empirical comparison of voting classification algorithms: Shamsi, M., Chevelu, J., Barbot, N. and Lolive, D., 2020. Corpus design for expressive
bagging, boosting, and variants. Machine Learn. 36, 105–139. speech: impact of the utterance length. In: 10th International Conference on Speech
Biswas, N., Uddin, K.M., Rikta, S.T., Dey, S.K., 2022 Nov. A comparative analysis of Prosody 2020. ISCA, pp. 955-959.
machine learning classifiers for stroke prediction: a predictive analytics approach. Singh, G., Kumar, B., Gaur, L., Tyagi, A., 2019, April. Comparison between multinomial
Healthcare Analytics 1 (2), 100116. and Bernoulli naïve Bayes for text classification. In: 2019 International Conference
Bloehdorn, S., Hotho, A., 2006. Boosting for text classification with semantic features. In on Automation, Computational and Technology Management (ICACTM). IEEE, pp.
Advances in Web Mining and Web Usage Analysis: 6th International Workshop on 593-596.
Knowledge Discovery on the Web, WebKDD 2004, Seattle, WA, USA, August 22-25, Sodimana, K., De Silva, P., Sproat, R., Theeraphol, A., Li, C.F., Gutkin, A., Sarin, S.,
2004, Revised Selected Papers 6. Springer Berlin Heidelberg, pp. 149-166. Pipatsrisawat, K., 2018. Text Normalization for Bangla, Khmer, Nepali, Javanese,
Dutoit, T., 1997. High-quality text-to-speech synthesis: an overview. J. Electrical Sinhala, and Sundanese TTS Systems.
Electron. Eng. Aust. 17 (1), 25–36. Sproat, R., 1996. Multilingual text analysis for text-to-speech synthesis. Nat. Lang. Eng. 2
Ebden, P., Sproat, R., 2015. The Kestrel TTS text normalization system. Nat. Lang. Eng. (4), 369–380.
21 (3), 333–353. Sproat, R., Jaitly, N., 2016. RNN approaches to text normalization: a challenge. arXiv
Essa, E., Omar, K., Alqahtani, A., 2023. Fake news detection based on a hybrid BERT and preprint arXiv:1611.00068.
LightGBM models. Complex Intelligent Systems 1–12. Sproat, R., Black, A.W., Chen, S., Kumar, S., Ostendorf, M., Richards, C., 2001.
Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Normalization of non-standard words. Comput. Speech Lang. 15 (3), 287–333.
Ann. Statistics 1189–1232. Sproat, R., 2010. Lightly supervised learning of text normalization: Russian number
Handley, Z.L., 2005. Evaluating text-to-speech (TTS) synthesis for use in computer- names. In: 2010 IEEE Spoken Language Technology Workshop. IEEE, pp. 436-441.
assisted language learning (CALL). The University of Manchester (United Kingdom). “The festival speech synthesis system,” accessed: 2019-07-28. [Online]. Available: http
Indra, S.T., Wikarsa, L., Turang, R., 2016, October. Using logistic regression method to ://www.cstr.ed.ac.uk/projects/festival/.
classify tweets into the selected topics. In: 2016 international conference on Tyagi, S., Bonafonte, A., Lorenzo-Trueba, J., Latorre, J., 2021. Proteno: Text
advanced computer science and information systems (icacsis). IEEE, pp. 385-390. normalization with limited data for fast deployment in text to speech systems. arXiv
Islam, M.R., Rahman, J., Talha, M.R., Chowdhury, F., 2020, June. Query Expansion for preprint arXiv:2104.07777.
Bangla Search Engine Pipilika. In: 2020 IEEE Region 10 Symposium (TENSYMP). Xu, S., 2018. Bayesian Naïve Bayes classifiers to text classification. J. Inf. Sci. 44 (1),
IEEE, pp. 1367-1370. 48–59.

19

You might also like