0% found this document useful (0 votes)
14 views5 pages

Lee Intersp23

Uploaded by

Paulin Bayang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

Lee Intersp23

Uploaded by

Paulin Bayang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

INTERSPEECH 2023

20-24 August 2023, Dublin, Ireland

Embedding Articulatory Constraints for Low-resource Speech Recognition


Based on Large Pre-trained Model
Jaeyoung Lee, Masato Mimura, Tatsuya Kawahara

Graduate School of Informatics, Kyoto University, Japan


{jaeyoung, mimura, kawahara}@sap.ist.i.kyoto-u.ac.jp

Abstract In recent years, a prominent trend in low-resource speech


recognition research has been to use self- or semi-supervised
Knowledge about phonemes and their articulatory attributes pre-training on high-resource speech corpora to learn a uni-
can help improve automatic speech recognition (ASR) of low- versal speech representation, which can then be finetuned for
resource languages. In this study, we propose a simple and downstream tasks [3]. Large pre-trained multilingual models
effective approach to embed prior knowledge about phonemes such as wav2vec2.0 and XLS-R [13] are shown to learn gen-
into end-to-end ASR based on a large pre-trained model. An ar- eral representation that is applicable even to unseen languages,
ticulatory attribute prediction layer is constructed by embedding and have greatly benefited low-resource ASR performance, as
articulatory constraints in layer initialization, which allows for shown in [14]. It is known that these models learn high-level
predicting articulatory attributes without the need for explicit representations corresponding to phonemes without any explicit
training. The final ASR transcript is inferred by combining the supervision [3], and it is possible that part of the learned repre-
output of this layer with encoded speech features. We apply sentation corresponds to articulatory attributes as well. There-
our method to finetune a pre-trained XLS-R model using Ainu fore, the approach of incorporating articulatory information is
and Mboshi corpora, and achieve a 12% relative improvement expected to be effective when used in combination with a large-
when target data of only 1 hour is available. This demonstrates scale pre-trained model for developing an ASR system with
that the approach of incorporating phonetic prior knowledge is very low-resource settings.
useful when combined with a large pre-trained model. In this study, we propose a simple method to incorporate ar-
Index Terms: Low-resource speech recognition, articulatory ticulatory information by embedding it into layer initialization
attributes, wav2vec2.0 in end-to-end ASR. First, we construct a fixed-length encoding
vector for each phoneme, using knowledge about articulatory
1. Introduction attributes. Then, these attribute vectors are stacked to form an
Over the last decade, deep neural network (DNN) based ap- articulatory attribute projection matrix, which projects articu-
proaches have significantly improved the performance of auto- latory attribute prediction into output phoneme prediction. This
matic speech recognition (ASR) and have made the end-to-end articulatory attribute prediction layer is combined with another
approach possible [1] [2] [3], where output prediction is directly conventional projection layer to generate final outputs. We fine-
inferred from acoustic features through a single neural network, tune a pre-trained XLS-R model with this output layer placed
without the need of manual pronunciation modeling. However, on top. We also explore multilingual training that exploits high-
this has been possible with large language corpora, and only resource language to enhance the representation ability of the
a handful of languages have sufficient language resources to model for articulatory attributes. The proposed method is ap-
achieve high performance for ASR, among approximately 7,000 plied to two low-resource languages of Ainu and Mboshi, with
languages [4] in the world. In particular, the number of nodes the target training data of around 34 and 4 hours, respectively.
in the output layer in end-to-end models may be too large for
low-resource languages, and this can be partially mitigated by 2. Related Work
encoding the relationships between output tokens.
2.1. Articulatory Modeling
Articulatory attributes are a set of distinct features that de-
scribe how speech sounds are produced by the articulators in Müller et al. [11] attempted to improve low-resource ASR by
the mouth, such as the lips, tongue, and vocal cords. It is using articulatory attributes along with language feature vec-
shown that articulatory attributes can be recognized across dif- tors and acoustic features. They introduced seven articulatory
ferent languages [5]. The approach incorporating articulatory attribute classifiers and one special phoneme type classifier for
attributes to ASR systems has been investigated for traditional each of the 8 categories they defined, such as phoneme type,
GMM-HMM based models, and shown to contribute to mak- manner and place of articulation, and vowel frontness. Each
ing models more robust to speaker or channel variability [6]. category has a predefined number of classes, including a special
In Automatic Speech Attribute Transcription (ASAT) [7] [8], class representing not applicable; for example, frontness cate-
a bank of speech attribute detectors were placed in the lowest gory has 4 classes: front, central, back, not applicable. This
level of ASR pipeline hierarchy, where detected attributes were modeling is rather restrictive because each phoneme can only
combined to predict phones, syllables and words. Articulatory have one attribute class for each category, making it difficult to
modeling has also been applied to DNN-HMM and end-to-end apply to multilingual settings with languages requiring different
models in multilingual settings, improving robustness to spon- articulatory modeling. Moreover, it was not implemented in an
taneous and non-native speech [9], and benefiting performance end-to-end ASR model.
in low-resource scenarios [10] [11] [12]. Li et al. [12] introduced articulatory modeling to end-to-

1394 10.21437/Interspeech.2023-1437
Output Probabilities
CTC Loss

Output Layer

Output Token Scores

Transformer Encoder Layers


Projection Attribute Projection Matrix
Pre-trained
wav2vec2.0
Attribute Predictions

CNN
Projection
Articulatory Attribute
Raw Waveform Prediction Layer
Contexualized Input Features

Figure 1: An overview of the proposed model architecture. Each language has a respective output layer, while the articulatory attribute
prediction layer is shared across languages. L stands for the number of languages. Al stands for the articulatory attribute projection
matrix for language l and its elements are set as trainable parameters and is implemented simply as a linear projection layer.

end speech recognition. They mapped each character token to a of low-resource ASR [14].
sequence of tokens representing articulatory attributes. For ex-
ample, the character representing /g/ is mapped to the sequence 3. Proposed Method
of two tokens ⟨voiced⟩ ⟨velar⟩. They trained a Transformer-
based model to predict articulatory attribute token sequences 3.1. Articulatory Modeling for Phonemes
from speech features in an end-to-end fashion. They observed
that while this method underperforms under monolingual set- First, we identify all relevant articulatory attributes for the tar-
tings, it can significantly outperform usual end-to-end models in get language, which are categorized into three groups: vowel
multilingual settings, where the model can effectively learn ar- attributes, place and manner of articulation for consonants. Ad-
ticulatory representations shared across target languages. This ditionally, there is a special category of attributes that applies
suggests the importance of multilingual training when using ar- to all output tokens. Table 1 lists all of the attributes covered
ticulatory features for speech recognition. in this study, for Ainu and Mboshi. Since different languages
In this study, we incorporate articulatory attribute predictors have different contrasting features, the set of articulatory at-
and combine their outputs with acoustic features to form the tributes should be modified to encompass all contrasting fea-
final token prediction. Unlike [11], we treat each articulatory tures present in other target languages.
attribute independently, allowing phonemes to be modeled to Vowel attributes (e.g. vowel height, backness, etc) tend to
have multiple attributes at the same time, e.g. affricates can be be best represented in a continuous scale, while consonant at-
modeled by both being plosive and fricative. Moreover, we do tributes (i.e. place or manner of articulation) tend to be best
not train such predictors in a supervised manner; instead, they
are induced to make articulatory attribute predictions, by layer Table 1: Articulatory attributes.
initialization. Then, the entire network is finetuned in an end-to-
end manner. The modeling is more flexible to model attributes Attributes
Category
with a continuous scale, e.g. vowel height, and it can work with 1 0 -1
imperfect knowledge about the phonemes of the target language sound - symbol
and inaccurate articulations in input speech. Special
consonant semi-vowel vowel
voiced - voiceless
2.2. Wav2vec2.0 back central front
Vowel open mid closed
Wav2vec 2.0 is a self-supervised learning framework that learns
high-toned - low-toned
latent representation from raw speech data. In this framework,
bilabial - else
the speech input is first encoded by a multi-layer convolutional
Consonant labiodental - else
neural network (CNN). This latent representation is masked and
(place) alveolar - else
input to a Transformer encoder network, producing contextual-
palatal - else
ized representations. The model is trained through predicting
velar - else
true latent representation from other contextualized represen-
glottal - else
tations, in a similar fashion to masked language models such
as BERT [15]. Vector quantized codes are used as a similarity nasal - else
measure. Wav2vec 2.0 models can be trained on large multilin- plosive - else
gual corpora [13] and can then be used as a pre-trained model Consonant fricative - else
for finetuning on a small amount of labeled data. It has been (manner) flap - else
shown to outperform previous approaches especially in terms approximant - else

1395
Table 2: Examples of assignment of articulatory attribute encoding. Encodings that deserve special attention are rendered bold.

sound / consonant / back / open /


voiced plosive fricative bilabial alveolar palatal
symbol vowel front closed
⟨wb⟩ -1 0 0 0 0 0 0 0 0 0
a 1 -1 0 1 1 0 0 0 0 0
Ainu u 1 -1 1 -1 1 0 0 0 0 0
c 1 1 0 0 0 1 1 -1 1 0
f 1 1 0 0 -1 -1 1 1 -1 -1
Japanese
by 1 1 0 0 1 1 -1 1 -1 1

represented in a one-hot encoding scheme. In our method, we as illustrated in Figure 1.


represent every attribute in a continuous scale ranging from -1
to 1, e.g. front (-1) to back (1). In the case of consonant at- 4. Experimental Setup
tributes, a value of -1 denotes the attribute’s absence, analogous
to 0 in one-hot encoding. The interpretation of a value of 0 de- 4.1. Datasets
pends on the attribute and the token. For vowel attributes of 4.1.1. Speech Corpus of Ainu Folklore
vowel tokens, a value of 0 denotes the middle point on the -1 to
1 scale. For other instances, a value of 0 indicates that the at- Ainu is a language spoken by the Ainu, a minority ethnic group
tribute is not applicable, such as consonant attributes for vowel in the northern part of Japan, and is classified critically endan-
tokens. gered by UNESCO. The speech corpus of Ainu folklore [16] is a
We flexibly utilize this encoding scheme to embed phonetic collection of speech recordings of Ainu stories, myths, and leg-
and phonological knowledge, as it will only be used in layer ends that have been collected to preserve the Ainu language and
initialization and not for training labels, and deep learning will culture. It consists of utterances from 8 Ainu speakers speaking
finetune any incomplete, ambiguous or incorrect details. For the Saru dialect, amounting to 38.9 hours. We split it into train
example, the phoneme /c/ in Ainu is an affricate and is encoded set of 33.7 hours and dev set of 5.2 hours. Subsets of the train
by setting both plosive and fricative attributes to 1. The palatal set with varying amounts of data are used for training, to inves-
attribute is set to 0, as it is sometimes realized as palato-alveolar tigate the effect of the amount of training data for our method.
[Ù], though usually realized as alveolar affricate [ts]. Moreover, Furthermore, additional utterances of 14 hours spoken by an-
the voiced attribute is set to 0, encoding the fact that there is other Ainu speaker with a distinct dialect, Shizunai dialect, is
no phonemic contrast between voiced and voiceless consonants employed as the test set. The characters used in the transcrip-
in Ainu. Table 2 shows examples of assignment of articulatory tion correspond 1-to-1 to phonemes.
attributes for Ainu and Japanese phoneme tokens. In addition, we conduct experiments in bilingual settings
By representing each token as a fixed-length encoding vec- where we adopt Japanese as an auxiliary language because
tor, we can construct an articulatory attribute projection matrix Japanese and Ainu share most of the phonemes, and most Ainu
Al ∈ RV ×N for a target language l, where V represents the speakers also speak Japanese. We use utterances of about 300
size of vocabulary and N represents the number of articula- hours from the Corpus of Spontaneous Japanese (CSJ) [17] as
tory attributes. Each row in Al is normalized to have a unit an additional training corpus.
variance and zero mean. This matrix can be regarded of as a Both Ainu and Japanese have a restricted syllabic structure
mapping from [-1,1]-normalized attribute predictions to token of (C)V(C), and thus it is straightforward to employ articulatory
predictions. modeling for syllables in both languages. We use syllable as
For languages with a restricted syllabic structure such as the target unit and employ syllabic articulatory modeling, as
Ainu, it is straightforward to extend this encoding to syllabic described in section 3.1.
tokens. Syllables in Ainu have (C)V(C) structure, having 3
phonemes at most. Thus, each syllable token in Ainu can be 4.1.2. Mboshi Parallel Corpus
represented by a 3N -dimensional encoding vector, and an at- Mboshi (Bantu C25, Congo-Brazzaville) is a Bantu language
tribute projection matrix can be constructed accordingly. spoken by the Mboshi people in the Republic of Congo. The
Mboshi parallel corpus [18] contains speech utterances of
3.2. Articulatory Attribute Prediction Layer around 4.5 hours from 3 speakers. The data is split into train set
of 3.9 hours and dev set of 26.4 minutes. We adopt the train/dev
As shown in the right-most part of Figure 1, the articulatory at- split as defined in the original paper [18].
tribute prediction layer is placed to project input features to the
The characters used in the transcription of this corpus do not
attribute space using the tanh activation function, as we rep-
correspond to phonemes in a 1-to-1 fashion. For example, the
resent each attribute in a [-1,1] scale. This layer is followed
character h only appears in combination of either /gh/ or /bh/,
by another projection layer from attribute space to output to-
which correspond to the voiced bilabial fricative and voiced ve-
kens, which is initialized by the attribute projection matrix Al
lar fricative, respectively. To handle such cases, we leverage the
defined earlier. The outputs are then fed into softmax, yielding
flexibility of our articulatory modeling and assign a value of 1
the token predictions. The attribute prediction layer is trained
to both the bilabial and velar attributes for h.
as a predictor for articulatory attributes without explicit super-
vision. However, we observed that using only the outputs from
4.2. Model Training and Evaluation Measure
this layer as final token predictions leads to suboptimal perfor-
mance. To address this, we combined it with conventional pro- We used a publicly available pretrained multilingual model for
jection from input speech features to make the final predictions, all our experiments, namely XLS-R (0.3B) [13]. It is a 317M

1396
Table 3: CER(%) with varying amounts of Ainu training sets and Mboshi. The final row shows the relative improvement of the best
performing proposed model from the baseline model.
Target language (unit) Ainu (syll) Mboshi (char)
Training data 10m 1h 4h 10h all (33.7h) all (3.9h)
baseline 35.1 19.6 16.5 15.5 14.1 6.10
proposed (attribute prediction only) 26.0 18.0 15.8 15.2 14.1 7.28
proposed (hybrid) 24.8 17.6 15.5 14.8 13.4 5.76
proposed (hybrid + bilingual training) 23.8 17.2 15.6 14.5 14.1 -
relative improvement (%) 32.1 12.2 6.1 6.1 4.8 5.6

parameter model with 24 Transformer encoder layers, with em- Table 4: CER(%) results with different inputs for articulatory
bedding size of 1024 and 16 attention heads. It is trained on attribute prediction layer
various speech corpora such as VoxPopuli, MLS, and Common-
Voice, totaling 436K hours comprised of 128 different lan- Transformer Encoder # Ainu (4h)
guages. We conduct finetuning experiments with CTC loss, 24 15.5
where we freeze the convolution based feature extractor and 16 16.0
finetune all of the transformer encoder layers, along with the 8 16.2
output layers. CNN features 16.5
Models are trained using the Adam optimizer [19]. Learn-
ing rate is linearly warmed up for the first 10% training steps sults in a decline in performance. This may be because we
and peaks at 5e-3 for Ainu and 6e-3 for Mboshi, holds for 40% are modeling articulatory attributes of different languages in the
of the training steps, then gets exponentially decayed. The batch same way, even though they are physically realized in a slightly
size is 45s and we mix target and auxiliary languages in 1:1 ra- different way.
tio in bilingual settings. We employ different numbers of train- Our method benefited performance for Mboshi as well as
ing steps depending on the amount of target training data: 16K, Ainu, where we observe 5.6% relative improvement. The ab-
20K, 30K, 40K, 50K steps for training data of 10m, 1h, 4h, 10h, solute CERs for Mboshi are much lower than that of Ainu, and
33.7h, respectively. For bilingual training, we use 60% more it is likely due to the fact that the speakers and dialects used in
training steps. training and evaluation are the same, whereas the Ainu test set
Three different configurations of the proposed method are consists of a different speaker with a distinct dialect. The results
trained and evaluated, and compared to the baseline configura- demonstrate that out method can be applied and improve perfor-
tion where the output layer simply consists of one linear pro- mance across different languages and different target units.
jection layer 1 . In attribute prediction only configuration, only
output from the articulatory attribute prediction layer is used for 5.1. Effect of Inputs to the Attribute Prediction Layer
the final prediction. It is combined with input speech features in
hybrid configuration, as illustrated in the right-hand side of Fig- We conducted an additional series of experiments where we
ure 1. In hybrid + bilingual training configuration for Ainu, the change the source of input to the articulatory attribute predic-
model is trained in a bilingual setting with additional training tion layer. We tested latent representations from the final (24th),
data in Japanese. 16th and 8th Transformer encoder layer, as well as directly from
Character error rate (CER) is employed as the evaluation the CNN feature extractor. Note that the final output of the
metric, given the difficulty in computing word error rate (WER) Transformer encoder is still combined to produce the final out-
for very low-resource languages where determining word units put; with only input to the attribute prediction layer changed.
is not obvious. For Ainu, the model with the lowest CER against The results are presented in Table 4. Notably, the results show a
dev set is selected and evaluated against the test set. For Mboshi, consistent trend in performance improvement with higher rep-
we simply evaluated the model at the end of the training against resentation inputs. This finding suggests that articulatory at-
the dev set. tributes are high-level features that benefit from contextualized
high-level representations.
5. Experimental Results
The comparison of Character Error Rate (CER) of the baseline 6. Conclusions
and proposed models in various training settings is presented We have presented a method for improving ASR in low-
in Table 3. We observe a significant improvement in perfor- resource settings by incorporating linguistic prior knowledge
mance in all hybrid models, especially when the available train- about phonemes. It is simple yet flexible for modeling con-
ing data is smaller. The attribute prediction only models per- tinuous and ambiguous articulatory constraints. It does not re-
form consistently worse than hybrid models, sometimes even quire explicit supervision on articulatory attribute labels and can
worse than the baseline model. It is likely because the articu- model phonemes from multilingual inventories. We conducted
latory constraints can be too strict for the model to learn accu- finetuning experiments with XLS-R models on two very low-
rate representation of output tokens. The use of bilingual data resource languages, Ainu and Mboshi. The results show that
improves performance by a small margin, although not signif- our method can improve CER by relative 12% when the amount
icantly, when the training data is less than 10 hours. However, of training data is limited to 1 hour, and still produce a meaning-
bilingual training with the entire 33.7h Ainu training data re- ful improvement when using 33.7 hours of training data. This
1 The difference in model size between baseline and proposed con- study demonstrates the importance of incorporating linguistic
figurations is marginal, as the pretrained XLS-R model already has prior knowledge about target language, even when state-of-the-
317M parameters. art pre-trained models are employed.

1397
7. References [16] K. Matsuura, S. Ueno, M. Mimura, S. Sakai, and T. Kawahara,
“Speech Corpus of Ainu Folklore and End-to-end Speech Recog-
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. nition for Ainu Language,” in Proceedings of the Twelfth Lan-
Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you guage Resources and Evaluation Conference. European Lan-
need,” in Advances in Neural Information Processing Systems, guage Resources Association, 2020, pp. 2622–2628.
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Asso- [17] K. Maekawa, H. Koiso, S. Furui, and H. Isahara, “Sponta-
ciates, Inc., 2017. neous Speech Corpus of Japanese,” in Proceedings of the Sec-
ond International Conference on Language Resources and Eval-
[2] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, uation (LREC’00). European Language Resources Association
W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: (ELRA), 2000.
Convolution-augmented Transformer for Speech Recognition,” in
Proc. Interspeech 2020, 2020, pp. 5036–5040. [18] A. Rialland, M. Adda-Decker, G.-N. Kouarata, G. Adda, L. Be-
sacier, L. Lamel, E. Gauthier, P. Godard, and J. Cooper-Leavitt,
[3] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “Wav2vec “Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville),”
2.0: A framework for self-supervised learning of speech repre- in Proceedings of the Eleventh International Conference on Lan-
sentations,” in Proceedings of the 34th International Conference guage Resources and Evaluation (LREC 2018). European Lan-
on Neural Information Processing Systems, ser. NIPS’20. Red guage Resources Association (ELRA), 2000.
Hook, NY, USA: Curran Associates Inc., 2020.
[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-
[4] M. P. Lewis, Ed., Ethnologue: Languages of the World, six- timization,” in 3rd International Conference on Learning Repre-
teenth ed. Dallas, TX, USA: SIL International, 2009. sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Con-
[5] S. Stuker, T. Schultz, F. Metze, and A. Waibel, “Multilingual ar- ference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
ticulatory features,” in 2003 IEEE International Conference on
Acoustics, Speech, and Signal Processing, 2003. Proceedings.
(ICASSP ’03)., vol. 1, 2003.
[6] F. Metze and A. Waibel, “A flexible stream architecture for
ASR using articulatory features,” in 7th International Conference
on Spoken Language Processing, ICSLP2002 - INTERSPEECH
2002, Denver, Colorado, USA, September 16-20, 2002, J. H. L.
Hansen and B. L. Pellom, Eds. ISCA, 2002.
[7] C.-H. Lee, M. A. Clements, S. Dusan, E. Fosler-Lussier, K. John-
son, B.-H. Juang, and L. R. Rabiner, “An overview on automatic
speech attribute transcription (ASAT),” in Proc. Interspeech 2007,
2007, pp. 1825–1828.
[8] C.-H.Lee and M.Siniscalchi, “An information-extraction ap-
proach to speech processing: Analysis, detection, verification and
recognition,” Proc. IEEE, vol. 101, no. 5, pp. 1089–1115, 2013.
[9] V. Mitra, W. Wang, C. Bartels, H. Franco, and D. Vergyri, “Ar-
ticulatory Information and Multiview Features for Large Vocab-
ulary Continuous Speech Recognition,” in 2018 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2018, pp. 5634–5638.
[10] L. Badino, C. Canevari, L. Fadiga, and G. Metta, “Integrating ar-
ticulatory data in deep neural network-based acoustic modeling,”
vol. 36, no. C, pp. 173–195, 2016.
[11] M. Müller, S. Stüker, and A. Waibel, “Towards Improving Low-
Resource Speech Recognition Using Articulatory and Language
Features,” in Proceedings of the 13th International Conference
on Spoken Language Translation. International Workshop on
Spoken Language Translation, 2016.
[12] S. Li, C. Ding, X. Lu, P. Shen, T. Kawahara, and H. Kawai, “End-
to-End Articulatory Attribute Modeling for Low-Resource Multi-
lingual Speech Recognition,” in Interspeech 2019. ISCA, 2019,
pp. 2145–2149.
[13] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal,
K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau,
and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep-
resentation Learning at Scale,” in Proc. Interspeech 2022, 2022,
pp. 2278–2282.
[14] C. Yi, J. Wang, N. Cheng, S. Zhou, and B. Xu, “Applying
Wav2vec2.0 to Speech Recognition in Various Low-resource Lan-
guages,” 2021.
[15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
training of Deep Bidirectional Transformers for Language Under-
standing,” in Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1 (Long and Short
Papers). Association for Computational Linguistics, 2019, pp.
4171–4186.

1398

You might also like