Multi-Task Deep Learning For Vietnamese Capitalization and Punctuation Recognition
Multi-Task Deep Learning For Vietnamese Capitalization and Punctuation Recognition
Corresponding Author:
Tuan-Linh Nguyen
Faculty of Electronics Engineering, Thai Nguyen University of Technology
Thai Nguyen, Vietnam
Email: [email protected]
1. INTRODUCTION
Automatic speech recognition (ASR) refers to the various processes, technologies, and methodologies
that facilitate improved interaction between humans and computers by converting spoken language
into written text [1], [2]. The output text of the ASR system, besides containing some insertion, deletion,
and word substitution errors, often has no punctuation and no capitalization. The issue of standardizing text
by restoring punctuation and capitalization is necessary. It will make the text more structured and understand-
able. Furthermore, it will facilitate NLP tasks including named entity recognition (NER), syntactic parsing,
part-of-speech tagging, and discourse segmentation [3]. Additionally, it benefits practical applications such as
generating movie subtitles, broadcasting news transcripts, creating documents for online meetings, and extract-
ing customer information more effectively [4].
Research in this field has traditionally focused on individual tasks, such as restoring punctuation
[5]–[7] or capitalization [8]–[10]. However, these separate processing techniques are challenging to improve
the overall effectiveness of the ASR systems. Therefore, recently, authors have proposed models to simultane-
ously handle two tasks [4], [11]–[13].
Initially, research in this field utilized methods such as maximum entropy [14], hidden markov model
(HMM) [15], long short-term memory (LSTM) [16], and conditional random field (CRF) [17], [18].
Currently, the most advanced models for both punctuation restoration and capitalization recognition use
neural network models. Studies have explored the character-level recurrent neural network (RNN) model [19],
bidirectional RNN [20], [21]. Particularly, with the emergence of the transformer model and its variations
like BERT, RoBERTa, ALBERT, DistilBERT, mBERT, and XLM-RoBERTa, studies on restoring punctuation
and capitalization on the transformer model [12], [22], BERT [10], [23], RoBERTa [24], have shown positive
results. Moreover, research on the Vietnamese language has also followed this innovative and promising trend,
using CRFs combined with CNNs and LSTM [25], transformer [26]–[28]. Notably, the ViBERT model, based
on RoBERTa for Vietnamese, is the latest model that achieves better performance in both tasks [29].
There are also have some limitations and challenges that need to be solved for the problem of restoring
punctuation and capitalization in general and Vietnamese in particular. Regarding data-related issues, to deal
with low-resource language data, some studies have experimented with using standardized text without punc-
tuation and capitalization [3], [6], [26]. Because commas and periods appear more frequently than other marks,
most research focuses only on these marks [12], [30], [31]. ASR output text does not have punctuation, so it is
often long and endless. To process such text in models, input sequences are usually randomly cut into ranges
of 20-30 words [32] or 20-50 words [24], maximum length 100 words [33] and 150 words [25]. Determining
the appropriate length for cutting is a matter that needs to be considered. The proposal to use overlap-chunking
in cutting and concatenating sequences has helped improve the model [9]. The issue of mismatched sequence
lengths due to errors like insertions, deletions, and substitutions in the ASR output text is one of the major
challenges that need to be addressed. Determining the order of punctuation restoration and capitalization is
still an important issue as it can significantly impact the final results [29].
Multi-task deep learning (MTDL) has gained significant attention in the field of machine learning
[34] and natural language processing (NLP) [35]. It has been used to address challenges such as overfitting and
data scarcity in NLP tasks. Different architectures, including parallel, hierarchical, modular, and generative
adversarial architectures, have been employed for multi-task learning (MTL) in NLP. Optimization techniques
such as loss construction, data sampling, and task scheduling have been utilized to train multi-task models
effectively. MTL has shown promising results in various NLP tasks, including machine translation, dialogue-
based systems, sentiment analysis, and machine reading comprehension.
In this study, our goal is to propose a MTDL for simultaneous capitalization and punctuation recogni-
tion (CPR). We combine MTL with the hypothesis that integrating a NER model will improve the performance
of punctuation restoration and capitalization. Additionally, we also employ the overlap-chunking technique
and the ViBERT language model, which have shown effectiveness in NER [13]. Particularly, to cope with
low-resource language data, we consider using text-to-speech (TTS) to augment the training data.
This paper is structured as follows: the initial session addresses the problem and reviews related
research. The second session discusses our proposed approach and the architecture of the MTDL CPR system.
In the third session, the focus is on the dataset’s implementation and preparation for both training and testing.
Session four details the configuration steps for the testing process and presents the evaluation outcomes. Lastly,
the fifth session offers a conclusion and recommendations for future research avenues.
2. METHOD
2.1. Multi-task deep learning for capitalization and punctuation recognition
Using the idea from MTL for the task of punctuation and capitalization restoration, a MTDL approach
is proposed with the aim of improving the performance of the CPR task. Figure 1 illustrates the proposed
MTDL model, which consists of the main CPR task combined with an auxiliary NER stream that provides
additional information about the named entity (NE) for the punctuation and capitalization restoration task. The
input data to the model is the output text of Vietnamese ASR without punctuation and capitalization, with a
length of N . During recognition, some sentences may contain errors such as substitutions, insertions, and
deletions, making the punctuation and capitalization restoration task more challenging. The sentence from
input is transferred through the ViBERT, a Vietnamese language representation model. In this study, a transfer
learning approach is applied with ViBERT, a pre-trained model that is kept fixed in the proposed MTDL model.
The output of ViBERT is a matrix of size (N x768). This representation matrix is simultaneously fed into three
blocks: (i) the NER information extraction block, (ii) the main CPR restoration block, and (iii) the NER
auxiliary learning block using a MTL mechanism.
Figure 1 shows the proposed multitask deep learning model, which includes the main CPR stream and
NER stream to support the NE information for CPR. Both CPR and NER tagger are based on the transformer
decoder structure in [29]. The input data consisted of unpunctuated and uncapitalized Vietnamese sentences,
which were extracted from the output of an ASR system, with a set size of N.
The input sentence underwent processing using the representation of Vietnamese language model
known as ViBERT. This model produced an output in the form of a matrix, characterized by dimensions of
(Nx768), which encode the representation of the input sentence. This matrix representation was simultaneously
directed toward three distinct components: (i) a block that extracts NE information, referred to as the NE
encoder; (ii) the main CPR block; and (iii) an auxiliary block for NER.
The supplemental block for extracting NER information incorporated a NER tagging block along
with an NER embedding block, which functioned to encode recognizable NER tags. These encoded tags were
meant to supply pertinent information to the CPR block. The CPR block represented the core task of the
MTDL model and utilized output from the sentence’s representation matrix produced by ViBERT. This input
was merged (concatenated) with the NER tag encoding matrix, emanating as the output from block (ii). This
merger with the output from block (ii) enriched the NER information, thereby enhancing the precision of the
CPR labeling process.
The output from the block was a collection of probabilities corresponding to NER tags for the specific
input sentence, and the loss function for this process was calculated based on these NER labels. The encoder
block was linked to the NER embedding block through a mechanism that shared parameters, as illustrated in
(1). Within this configuration, the weight matrix of the NER embedding was replicated from the transposed
matrix of the encoder in block (iii).
T
Wemb = Wenc (1)
The network was trained using a MTL approach that involved two tasks: CPR, which served as the
primary task, and NER, which functioned as a supportive, auxiliary task. The loss value for the MTDL model
Multi-task deep learning for Vietnamese capitalization and punctuation ... (Phuong-Nhung Nguyen)
1608 ❒ ISSN: 2252-8938
was determined by taking a weighted sum of the loss values from both the CPR and NER tasks. The MTDL
model’s loss value is computed as follows: the losses from the CPR and NER tasks are each assigned a weight,
and then these weighted losses are summed together to yield the final loss value for the model, as shown in (2).
Where α is the weight for the loss value of the CPR task, and β is the weight for the loss value of the NER
task. The choice of α and β depends on the importance of each task. In this study, the CPR task is considered
the primary task and the NER task as the auxiliary task. Therefore, α and β are chosen to be 0.6 and 0.4,
correspondently.
The chosen length for the segment cuts is an even number of words. Let l represent the length of the
segment cut, and k is the length of the overlapping part, then we have l = 2k. Each input string S, containing
n words denoted as w1 , w2 , ..., wn , will be divided into n/l + (n − k)/l overlapping segments. Here, the
ith segment cut is a substring of words wi−1 k + 1, . . . , wi+1 k. The study has examined various values for
l and k, and these values have been selected empirically to be appropriate. As the input sentence is divided
into overlapping segments, the challenge in merging these segments is to determine which words should be
discarded and which should be retained in the merged portion of the final sentence.
Let c be the length of the segment that will be either retained or discarded in the overlapping parts. For
simplicity in calculation, we take c = k/2. Observations indicate that the ending words of the first overlapping
segment and the initial words of the second overlapping segment (the words around the cutting point) lack
substantial context. Therefore, the algorithm will discard the c segment at the end of the first overlapping part
and retain the c segment in the second overlapping part. Consequently, the remaining words at the beginning of
the first overlapping segment are kept, and the remaining words at the start of the second overlapping segment
are discarded. This ensures that the words retained in the overlapping sections, always situated in the middle of
segments, will have ample context, aiding in more accurate restoration. The process of discarding and retaining
portions of the overlapping segments is repeated for subsequent overlapping segments. The post-catenation
merging process is described as (3).
n−1
X
[[w1 , . . . w(2k−c) ] + [w((i−1)k+c) , . . . w(ik+c) ] + [w(n−2k+c) , . . . wn ] (3)
i=2
Multi-task deep learning for Vietnamese capitalization and punctuation ... (Phuong-Nhung Nguyen)
1610 ❒ ISSN: 2252-8938
An alternative method for data collection utilizing TTS systems has been outlined in Figure 3, which
is suggested as a substitute for traditional recording techniques:
− Phase 1: Transform written text into audible speech with the use of a TTS tool
TTS technology has been under exploration in Vietnam for some time. Initially, VnSpeech was
recognized as the pioneering Vietnamese TTS system. The VietSound system, which is a product of Ho Chi
Minh City University of Technology, followed suit. Yet, these early systems had setbacks, particularly in pro-
ducing speech that sounded disconnected and unnatural. In recent times, more robust and widely implemented
TTS systems have emerged. Notably, FPT.AI offers a TTS service renowned for its effectiveness. Additionally,
the Viettel’s VTCC.AI system and Vbee’s AI TTS system are among the powerful players in the field. Telecom
companies like MobiPhone have also introduced their TTS solutions. Google has been providing a TTS tool
that is celebrated for its natural-sounding voice, positioning it as one of the top contenders in TTS software.
Leveraging DeepMind’s speech synthesis expertise, Google has crafted a voice quality that listeners may find
challenging to differentiate from human speech. In our study, we utilized the dataset of VLSP 2018 along with
Google’s TTS system to convert text into speech for both training and validation purposes. Due to resource
constraints, our focus was restricted to recording audio for the test set within the dataset, which comprises
4,272 samples totaling approximately 242,000 words. The speech dataset was generated by four individuals
who read the text in various settings, resulting in over 26 hours of recorded audio.
− Phase 2: Utilizing the ASR system for speech recognition
Following the acquisition of the audio dataset from both TTS and recordings, the data underwent
processing via an ASR system. For this phase, the Vietnam artificial intelligence solutions (VAIS) ASR
system was selected due to its superior performance in recognizing Vietnamese speech when compared to
other systems such as those by FPT, Viettel, Zalo, and Google, as noted in [40]. Additionally, the VAIS ASR
system was previously utilized in research focused on identifying markers in Vietnamese speech processing
using a Pipeline approach, as mentioned in [13]. To maintain consistency for result comparison, we opted to
continue with the VAIS system for our experimental framework.
− Phase 3: Evaluation of ASR text from synthesized speech
Table 4 presents the error rates for the TTS-ASR process and the recorded text through ASR (REC-
ASR) method, assessed across 241,899 words from the dataset. The findings showed only a minor discrepancy
in ASR accuracy when comparing two data collection methods, which was deemed provisionally acceptable.
In order to amass a substantial volume of training data, we sourced information from authoritative Vietnamese
news websites, including vnexpress.net, dantri.com.vn, and vietnamnet.vn. This content was then processed
using a speech synthesis system to generate the voice data. We employed the recorded speech from the VLSP
2018 dataset, featuring four distinct speakers, as our test set.
The dataset of VLSP2018 is segmented into training, development, and testing subsets. The training
subset contains 30,280 Vietnamese sentences; the development subset includes 1,593 sentences; and the testing
subset comprises 4,271 sentences, all formatted with uppercase letters and punctuation. The VLSP sentences
were annotated for capitalization, punctuation, and NE recognition. For creating the test subset’s audio, four
individuals recorded the test text of the subset. The VAIS ASR system (https://vais.vn/en/speech-to-text-core/)
was utilized to convert the spoken words into text data for evaluation purposes. For the generation of synthe-
sized audio data from the textual VLSP data, Google’s TTS system was utilized. Additionally, VAIS’s ASR
system was employed to transcribe the synthetic audio, producing a text dataset for training the MTDL model.
Table 4. Analysis of ASR error rates (%) of synthesized and recorded speech
Number (%) Outlier (%) Other (%) Total (%)
TTS-ASR 1.061 2.418 1.582 5.068
REC-ASR 1.427 2.371 2.228 6.032
TP
precision = (4)
TP + FP
TP
recall = (5)
TP + FN
2 ∗ precision ∗ recall
recall = (6)
precision + recall
Multi-task deep learning for Vietnamese capitalization and punctuation ... (Phuong-Nhung Nguyen)
1612 ❒ ISSN: 2252-8938
where true positive (TP) denotes the number of punctuation and capitalization tags that were correctly identified
by the model. False positive (FP) signifies the number of punctuation and capitalization tags that were mistak-
enly identified. On the other hand, false negative (FN) refers to the number of punctuation and capitalization
tags that the model fail to recognize.
The MTL approach outperforms the single-task learning by a significant margin, as evidenced by
the scores of 0.8375 and 0.7818, respectively. This suggests that MTL is more effective at generalizing from
uncased reference text, potentially due to its ability to capture and utilize shared representations from NER that
are relevant to the input text. The high performance of 0.8375 for MTL in processing uncased reference text
indicates robust feature extraction and learning, which can handle the absence of case information effectively.
This could be attributed to the multi-task framework’s potential to leverage auxiliary information from NER
task, improving the model’s understanding of context.
The findings further confirmed that integrating the NER component enhances the performance of
the CPR model processing ASR-generated text. The F1 score for the CPR model saw a notable increase of
approximately 6.2%, rising from 0.7319 to 0.778, after adding the NE recognition model to the text of ASR
output. Additionally, the NER model proved to be highly valuable, as it substantially raised the F1 score of
the CPR model by 7.1% with unprocessed text, which in this case, lacked capitalization and punctuation in the
reference text.
4. CONCLUSIONS
In our research, we introduced the application of a NER model to enhance capitalization and
punctuation recovery in Vietnamese speech processing with the MTL scheme. The experimental results pro-
vided evidence of the synergy between these methodologies. We also created and presented the inaugural
speech dataset that establishes a foundation for future explorations into entity extraction from Vietnamese
speech. Additionally, we highlighted the significant benefits of utilizing a pre-trained language model tailored
for the Vietnamese language processing in CPR and NER multi-tasks, which set a new benchmark on the VLSP
2018 dataset. We have made this pre-trained model available to the research community to encourage further
exploration. This contribution is particularly valuable for underrepresented languages such as Vietnamese.
ACKNOWLEDGEMENTS
This work was supported by the Institute of Information Technology, Vietnam Academy of Science
and Technology under project number CS23.05.
REFERENCES
[1] D. Yu and L. Deng, Automatic speech recognition: a deep learning approach. in Signals and Communication Technology. London:
Springer, 2015, doi: 10.1007/978-1-4471-5779-3.
[2] A. Jarin, A. Santosa, M. T. Uliniansyah, L. R. Aini, E. Nurfadhilah, and Gunarso, “Automatic speech recognition for indonesian
medical dictation in cloud environment,” IAES International Journal of Artificial Intelligence, vol. 13, no. 2, pp. 1762–1772, 2024,
doi: 10.11591/ijai.v13.i2.pp1762-1772.
[3] R. Pappagari, P. Zelasko, A. Mikolajczyk, P. Pezik, and N. Dehak, “Joint prediction of truecasing and punctuation for conversational
speech in low-resource scenarios,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2021,
pp. 1185–1191, doi: 10.1109/ASRU51503.2021.9687976.
[4] F. M. M. Batista and D. N. J. N. Mamede, “Recovering capitalization and punctuation marks on speech transcriptions,” Ph.D Thesis,
Fields of Science and Technology, Instituto Superior Técnico, Lisboa, Portugal, 2011.
[5] N. Ueffing, M. Bisani, and P. Vozila, “Improved models for automatic punctuation prediction for spoken and written text,” in
Interspeech 2013, ISCA, 2013, pp. 3097–3101, doi: 10.21437/Interspeech.2013-675.
[6] Q. H. Pham, B. T. Nguyen, and N. V. Cuong, “Punctuation prediction for vietnamese texts using condi tional random fields,” in
Proceedings of the Tenth International Symposium on Information and Communication Technology - SoICT 2019, New York, USA:
ACM Press, 2019, pp. 322–327, doi: 10.1145/3368926.3369716.
[7] P. Żelasko, P. Szymański, J. Mizgajski, A. Szymczak, Y. Carmiel, and N. Dehak, “Punctuation prediction model for conversational
speech,” in arXiv-Computer Science, 2018, pp. 1–5.
[8] R. H. Susanto, H. L. Chieu, and W. Lu, “Learning to capitalize with character-level recurrent neural networks: an empirical study,”
in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, USA: Association for
Computational Linguistics, 2016, pp. 2090–2095, doi: 10.18653/v1/D16-1225.
[9] H. N. T. Thu, B. N. Thai, H. N. V. Bao, T. D. Quoc, M. L. Chi, and H. N. T. Minh, “Recovering capitalization for automatic speech
recognition of vietnamese using transformer and chunk merging,” in 2019 11th International Conference on Knowledge and Systems
Engineering (KSE), IEEE, 2019, pp. 1–5, doi: 10.1109/KSE.2019.8919342.
[10] R. Rei, N. M. Guerreiro, and F. Batista, “Automatic truecasing of video subtitles using bert: a multilingual adaptable approach,”
in Information Processing and Management of Uncertainty in Knowledge-Based Systems, 2020, pp. 708–721, doi: 10.1007/978-3-
030-50146-4 52.
[11] A. Gravano, M. Jansche, and M. Bacchiani, “Restoring punctuation and capitalization in transcribed speech,” in 2009 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing, IEEE, 2009, pp. 4741–4744, doi: 10.1109/ICASSP.2009.4960690.
[12] A. Vāravs and A. Salimbajevs, “Restoring punctuation and capitalization using transformer models,” in Statistical Language and
Speech Processing, Cham, Switzerland: Springer, 2018, pp. 91–102, doi: 10.1007/978-3-030-00810-9 9.
[13] T. B. Nguyen, Q. M. Nguyen, T. T. H. Nguyen, Q. T. Do, and C. M. Luong, “Improving vietnamese named entity recogni-
tion from speech using word capitalization and punctuation recovery models,” in Interspeech 2020, 2020, pp. 4263–4267, doi:
10.21437/Interspeech.2020-1896.
[14] F. Batista, I. Trancoso, and N. Mamede, “Automatic recovery of punctuation marks and capitalization information for iberian
languages,” in Proceedings of the I Iberian SLTech 2009, 2009, pp. 99–102.
[15] J. P. Yamron, I. Carp, L. Gillick, S. Lowe, and P. V. Mulbregt, “A hidden markov model approach to text segmentation and event
tracking,” in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 1998, pp.
333–336, doi: 10.1109/ICASSP.1998.674435.
[16] O. Tilk and T. Alumäe, “LSTM for punctuation restoration in speech transcripts,” in Interspeech 2015, ISCA, 2015, pp. 683–687,
doi: 10.21437/Interspeech.2015-240.
[17] X. Wang, H. T. Ng, and K. C. Sim, “Dynamic conditional random fields for joint sentence boundary and punctuation prediction,” in
Interspeech 2012, ISCA, 2012, pp. 1384–1387, doi: 10.21437/Interspeech.2012-398.
[18] M. Lui and L. Wang, “Recovering casing and punctuation using conditional random fields,” in Proceedings of Australasian Lan-
guage Technology Association Workshop, 2013, pp. 137–141.
[19] G. Ramena, D. Nagaraju, S. Moharana, D. P. Mohanty, and N. Purre, “An efficient architecture for predicting the case of characters
using sequence models,” in 2020 IEEE 14th International Conference on Semantic Computing (ICSC), IEEE, 2020, pp. 174–177,
doi: 10.1109/ICSC.2020.00035.
Multi-task deep learning for Vietnamese capitalization and punctuation ... (Phuong-Nhung Nguyen)
1614 ❒ ISSN: 2252-8938
[20] O. Tilk and T. Alumäe, “Bidirectional recurrent neural network with attention mechanism for punctuation restoration,” in Interspeech
2016, 2016, pp. 3047–3051, doi: 10.21437/Interspeech.2016-1517.
[21] A. Zahra, A. F. Hidayatullah, and S. Rani, “Bidirectional long-short term memory and conditional random field for tourism
named entity recognition,” IAES International Journal of Artificial Intelligence, vol. 11, no. 4, pp. 1270–1277, 2022, doi:
10.11591/ijai.v11.i4.pp1270-1277.
[22] R. Pan, J. A. Garcı́a-Dı́az, and R. Valencia-Garcı́a, “Evaluation of transformer-based models for punctuation and capitaliza-
tion restoration in spanish and portuguese,” in Natural Language Processing and Information Systems, 2023, pp. 243–256, doi:
10.1007/978-3-031-35320-8 17.
[23] A. Nagy, B. Bial, and J. Ács, “Automatic punctuation restoration with bert models,” arXiv-Computer Science, pp. 1–11, 2021.
[24] M. Courtland, A. Faulkner, and G. McElvain, “Efficient automatic punctuation restoration using bidirectional transformers with
robust inference,” in Proceedings of the 17th International Conference on Spoken Language Translation, 2020, pp. 272–279, doi:
10.18653/v1/2020.iwslt-1.33.
[25] H. T. T. Uyen, N. A. Tu, and T. D. Huy, “Vietnamese capitalization and punctuation recovery models,” in Interspeech 2022, ISCA,
2022, pp. 3884–3888, doi: 10.21437/Interspeech.2022-931.
[26] B. Nguyen et al., “Fast and accurate capitalization and punctuation for automatic speech recognition using transformer and
chunk merging,” in 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and
Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), IEEE, 2019, pp. 1–5, doi: 10.1109/O-
COCOSDA46868.2019.9041202.
[27] V. T. Bui and O. T. Tran, “Punctuation prediction in vietnamese asrs using transformer-based models,” in PRICAI 2021: Trends in
Artificial Intelligence, 2021, pp. 191–204, doi: 10.1007/978-3-030-89363-7 15.
[28] H. Tran, C. V. Dinh, Q. Pham, and B. T. Nguyen, “An efficient transformer-based model for vietnamese punctuation prediction,” in
International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, 2021, pp. 47–58, doi:
10.1007/978-3-030-79463-7 5.
[29] T. T. H. Nguyen, T. B. Nguyen, N. P. Pham, Q. Truong, T. L. Le, and C. M. Luong, “Toward human-friendly ASR systems:
recovering capitalization and punctuation for vietnamese text,” IEICE TRANSACTIONS on Information and Systems, vol. 104, no.
8, pp. 1195–1203, 2021.
[30] F. Batista, D. Caseiro, N. Mamede, and I. Trancoso, “Recovering punctuation marks for automatic speech recognition,” in Inter-
speech 2007, ISCA: ISCA, 2007, pp. 2153–2156, doi: 10.21437/Interspeech.2007-581.
[31] W. Lu and H. T. Ng, “Better punctuation prediction with dynamic conditional random fields,” in EMNLP 2010 - Conference on
Empirical Methods in Natural Language Processing, 2010, pp. 177–186.
[32] E. Cho, J. Niehues, and A. Waibel, “NMT-based segmentation and punctuation insertion for real-time spoken language translation,”
in Interspeech 2017, ISCA: ISCA, Aug. 2017, pp. 2645–2649, doi: 10.21437/Interspeech.2017-1320.
[33] T. Pham, N. Nguyen, Q. Pham, H. Cao, and B. Nguyen, “Vietnamese punctuation prediction using deep neural networks,” in
SOFSEM 2020: Theory and Practice of Computer Science, 2020, pp. 388–400, doi: 10.1007/978-3-030-38919-2 32.
[34] N. Nikentari and H. L. Wei, “Multi-task learning using non-linear autoregressive models and recurrent neural networks for
tide level forecasting,” International Journal of Electrical and Computer Engineering, vol. 14, no. 1, pp. 960–970, 2024, doi:
10.11591/ijece.v14i1.pp960-970.
[35] R. M. Samant, M. R. Bachute, S. Gite, and K. Kotecha, “Framework for deep learning-based language models using multi-task learn-
ing in natural language understanding: a systematic literature review and future directions,” IEEE Access, vol. 10, pp. 17078–17097,
2022, doi: 10.1109/ACCESS.2022.3149798.
[36] T.-N. Phung, M. C. Luong, and M. Akagi, “An investigation on perceptual line spectral frequency (PLP-LSF) target stability against
the vowel neutralization phenomenon,” in 2011 3rd International Conference on Signal Acquisition and Processing (ICSAP 2011),
2011, pp. 512–514.
[37] P. T. Nghia, L. C. Mai, and M. Akagi, “Improving the naturalness of concatenative vietnamese speech synthesis under limited data
conditions,” Journal of Computer Science and Cybernetics, vol. 31, no. 1, 2015, doi: 10.15625/1813-9663/31/1/5064.
[38] V. L. Phung, H. K. Phan, A. T. Dinh, and Q. B. Nguyen, “Data processing for optimizing naturalness of vietnamese text-
to-speech system,” in 2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and
Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), IEEE, 2020, pp. 1–6, doi: 10.1109/O-
COCOSDA50338.2020.9295025.
[39] H. T. M. Nguyen, Q. T. Ngo, L. X. Vu, V. M. Tran, and H. T. T. Nguyen, “VLSP shared task: named entity recognition,” Journal of
Computer Science and Cybernetics, vol. 34, no. 4, pp. 283–294, 2019, doi: 10.15625/1813-9663/34/4/13161.
[40] C. H. Nga, C.-T. Li, Y.-H. Li, and J.-C. Wang, “A survey of vietnamese automatic speech recognition,” in 2021 9th International
Conference on Orange Technology (ICOT), IEEE, 2021, pp. 1–4, doi: 10.1109/ICOT54518.2021.9680652.
BIOGRAPHIES OF AUTHORS
Phuong-Nhung Nguyen received the master degree from the University of Technology -
Hanoi National University, majoring in information technology. Her research interests include data
mining, machine learning, soft computing, and fuzzy computing. She is currently a researcher at
the Institute of Information Technology, Vietnam Academy of Science and Technology. She can be
contacted at email: [email protected].
Thu-Hien Nguyen has received a Ph.D. in information systems from the Institute of Infor-
mation Technology, Vietnam Academy of Science and Technology. Her dissertation was “Research
on text normalization methods and named entity recognition in Vietnamese speech recognition”. She
has been a lecturer at the Faculty of Mathematics, Thai Nguyen University of Education since 2005.
Her research interests include natural language processing, artificial intelligence, databases, informa-
tion system analysis and design, and digital transformation in education. She can be contacted at
email: [email protected].
Nguyen Truong Thang received the Ph.D. degree from the Japan Advanced Institute of
Science and Technology (JAIST), Japan, in 2005. He is currently with the Institute of Information
Technology (IoIT), Vietnam Academy of Science and Technology. His major research interests in-
clude software quality assurance, software verification, program analysis, data mining, and machine
learning. He can be contacted at email: [email protected].
Nguyen Thi Thu Nga received her Ph.D. in mathematics at the Department of Training,
Military Academy of Science and Technology, Vietnam since 2023. In addition, she also holds the
position of researcher in Computer Science and Computational Intelligence Research (2016-present).
Her research field is technology and computer science. She currently working at the Institute of
Information and Technology, Vietnam Academy of Science and Technology. She can be contacted at
email: [email protected].
Nguyen Thi Anh Phuong is currently a master at Hanoi University of Science – VietNam
National University HaNoi, majoring applied mathematics computing. Her research interests include
data mining, soft computing, and fuzzy computing. She is currently a researcher at the Institute of
Information Technology, Vietnam Academy of Science and Technology. She can be contacted at
email: [email protected].
Tuan-Linh Nguyen received the Ph.D. degree from the Kyungpook National University
(KNU), Korea, in 2020. He is currently with the Thainguyen University of Technology (TNUT),
Vietnam. His major research interests include deep fuzzy-neural network, deep learning, interpretable
AI, and NLP. He can be contacted at email: [email protected].
Multi-task deep learning for Vietnamese capitalization and punctuation ... (Phuong-Nhung Nguyen)