Lee Intersp23
Lee Intersp23
1394 10.21437/Interspeech.2023-1437
Output Probabilities
CTC Loss
Output Layer
CNN
Projection
Articulatory Attribute
Raw Waveform Prediction Layer
Contexualized Input Features
Figure 1: An overview of the proposed model architecture. Each language has a respective output layer, while the articulatory attribute
prediction layer is shared across languages. L stands for the number of languages. Al stands for the articulatory attribute projection
matrix for language l and its elements are set as trainable parameters and is implemented simply as a linear projection layer.
end speech recognition. They mapped each character token to a of low-resource ASR [14].
sequence of tokens representing articulatory attributes. For ex-
ample, the character representing /g/ is mapped to the sequence 3. Proposed Method
of two tokens ⟨voiced⟩ ⟨velar⟩. They trained a Transformer-
based model to predict articulatory attribute token sequences 3.1. Articulatory Modeling for Phonemes
from speech features in an end-to-end fashion. They observed
that while this method underperforms under monolingual set- First, we identify all relevant articulatory attributes for the tar-
tings, it can significantly outperform usual end-to-end models in get language, which are categorized into three groups: vowel
multilingual settings, where the model can effectively learn ar- attributes, place and manner of articulation for consonants. Ad-
ticulatory representations shared across target languages. This ditionally, there is a special category of attributes that applies
suggests the importance of multilingual training when using ar- to all output tokens. Table 1 lists all of the attributes covered
ticulatory features for speech recognition. in this study, for Ainu and Mboshi. Since different languages
In this study, we incorporate articulatory attribute predictors have different contrasting features, the set of articulatory at-
and combine their outputs with acoustic features to form the tributes should be modified to encompass all contrasting fea-
final token prediction. Unlike [11], we treat each articulatory tures present in other target languages.
attribute independently, allowing phonemes to be modeled to Vowel attributes (e.g. vowel height, backness, etc) tend to
have multiple attributes at the same time, e.g. affricates can be be best represented in a continuous scale, while consonant at-
modeled by both being plosive and fricative. Moreover, we do tributes (i.e. place or manner of articulation) tend to be best
not train such predictors in a supervised manner; instead, they
are induced to make articulatory attribute predictions, by layer Table 1: Articulatory attributes.
initialization. Then, the entire network is finetuned in an end-to-
end manner. The modeling is more flexible to model attributes Attributes
Category
with a continuous scale, e.g. vowel height, and it can work with 1 0 -1
imperfect knowledge about the phonemes of the target language sound - symbol
and inaccurate articulations in input speech. Special
consonant semi-vowel vowel
voiced - voiceless
2.2. Wav2vec2.0 back central front
Vowel open mid closed
Wav2vec 2.0 is a self-supervised learning framework that learns
high-toned - low-toned
latent representation from raw speech data. In this framework,
bilabial - else
the speech input is first encoded by a multi-layer convolutional
Consonant labiodental - else
neural network (CNN). This latent representation is masked and
(place) alveolar - else
input to a Transformer encoder network, producing contextual-
palatal - else
ized representations. The model is trained through predicting
velar - else
true latent representation from other contextualized represen-
glottal - else
tations, in a similar fashion to masked language models such
as BERT [15]. Vector quantized codes are used as a similarity nasal - else
measure. Wav2vec 2.0 models can be trained on large multilin- plosive - else
gual corpora [13] and can then be used as a pre-trained model Consonant fricative - else
for finetuning on a small amount of labeled data. It has been (manner) flap - else
shown to outperform previous approaches especially in terms approximant - else
1395
Table 2: Examples of assignment of articulatory attribute encoding. Encodings that deserve special attention are rendered bold.
1396
Table 3: CER(%) with varying amounts of Ainu training sets and Mboshi. The final row shows the relative improvement of the best
performing proposed model from the baseline model.
Target language (unit) Ainu (syll) Mboshi (char)
Training data 10m 1h 4h 10h all (33.7h) all (3.9h)
baseline 35.1 19.6 16.5 15.5 14.1 6.10
proposed (attribute prediction only) 26.0 18.0 15.8 15.2 14.1 7.28
proposed (hybrid) 24.8 17.6 15.5 14.8 13.4 5.76
proposed (hybrid + bilingual training) 23.8 17.2 15.6 14.5 14.1 -
relative improvement (%) 32.1 12.2 6.1 6.1 4.8 5.6
parameter model with 24 Transformer encoder layers, with em- Table 4: CER(%) results with different inputs for articulatory
bedding size of 1024 and 16 attention heads. It is trained on attribute prediction layer
various speech corpora such as VoxPopuli, MLS, and Common-
Voice, totaling 436K hours comprised of 128 different lan- Transformer Encoder # Ainu (4h)
guages. We conduct finetuning experiments with CTC loss, 24 15.5
where we freeze the convolution based feature extractor and 16 16.0
finetune all of the transformer encoder layers, along with the 8 16.2
output layers. CNN features 16.5
Models are trained using the Adam optimizer [19]. Learn-
ing rate is linearly warmed up for the first 10% training steps sults in a decline in performance. This may be because we
and peaks at 5e-3 for Ainu and 6e-3 for Mboshi, holds for 40% are modeling articulatory attributes of different languages in the
of the training steps, then gets exponentially decayed. The batch same way, even though they are physically realized in a slightly
size is 45s and we mix target and auxiliary languages in 1:1 ra- different way.
tio in bilingual settings. We employ different numbers of train- Our method benefited performance for Mboshi as well as
ing steps depending on the amount of target training data: 16K, Ainu, where we observe 5.6% relative improvement. The ab-
20K, 30K, 40K, 50K steps for training data of 10m, 1h, 4h, 10h, solute CERs for Mboshi are much lower than that of Ainu, and
33.7h, respectively. For bilingual training, we use 60% more it is likely due to the fact that the speakers and dialects used in
training steps. training and evaluation are the same, whereas the Ainu test set
Three different configurations of the proposed method are consists of a different speaker with a distinct dialect. The results
trained and evaluated, and compared to the baseline configura- demonstrate that out method can be applied and improve perfor-
tion where the output layer simply consists of one linear pro- mance across different languages and different target units.
jection layer 1 . In attribute prediction only configuration, only
output from the articulatory attribute prediction layer is used for 5.1. Effect of Inputs to the Attribute Prediction Layer
the final prediction. It is combined with input speech features in
hybrid configuration, as illustrated in the right-hand side of Fig- We conducted an additional series of experiments where we
ure 1. In hybrid + bilingual training configuration for Ainu, the change the source of input to the articulatory attribute predic-
model is trained in a bilingual setting with additional training tion layer. We tested latent representations from the final (24th),
data in Japanese. 16th and 8th Transformer encoder layer, as well as directly from
Character error rate (CER) is employed as the evaluation the CNN feature extractor. Note that the final output of the
metric, given the difficulty in computing word error rate (WER) Transformer encoder is still combined to produce the final out-
for very low-resource languages where determining word units put; with only input to the attribute prediction layer changed.
is not obvious. For Ainu, the model with the lowest CER against The results are presented in Table 4. Notably, the results show a
dev set is selected and evaluated against the test set. For Mboshi, consistent trend in performance improvement with higher rep-
we simply evaluated the model at the end of the training against resentation inputs. This finding suggests that articulatory at-
the dev set. tributes are high-level features that benefit from contextualized
high-level representations.
5. Experimental Results
The comparison of Character Error Rate (CER) of the baseline 6. Conclusions
and proposed models in various training settings is presented We have presented a method for improving ASR in low-
in Table 3. We observe a significant improvement in perfor- resource settings by incorporating linguistic prior knowledge
mance in all hybrid models, especially when the available train- about phonemes. It is simple yet flexible for modeling con-
ing data is smaller. The attribute prediction only models per- tinuous and ambiguous articulatory constraints. It does not re-
form consistently worse than hybrid models, sometimes even quire explicit supervision on articulatory attribute labels and can
worse than the baseline model. It is likely because the articu- model phonemes from multilingual inventories. We conducted
latory constraints can be too strict for the model to learn accu- finetuning experiments with XLS-R models on two very low-
rate representation of output tokens. The use of bilingual data resource languages, Ainu and Mboshi. The results show that
improves performance by a small margin, although not signif- our method can improve CER by relative 12% when the amount
icantly, when the training data is less than 10 hours. However, of training data is limited to 1 hour, and still produce a meaning-
bilingual training with the entire 33.7h Ainu training data re- ful improvement when using 33.7 hours of training data. This
1 The difference in model size between baseline and proposed con- study demonstrates the importance of incorporating linguistic
figurations is marginal, as the pretrained XLS-R model already has prior knowledge about target language, even when state-of-the-
317M parameters. art pre-trained models are employed.
1397
7. References [16] K. Matsuura, S. Ueno, M. Mimura, S. Sakai, and T. Kawahara,
“Speech Corpus of Ainu Folklore and End-to-end Speech Recog-
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. nition for Ainu Language,” in Proceedings of the Twelfth Lan-
Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you guage Resources and Evaluation Conference. European Lan-
need,” in Advances in Neural Information Processing Systems, guage Resources Association, 2020, pp. 2622–2628.
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Asso- [17] K. Maekawa, H. Koiso, S. Furui, and H. Isahara, “Sponta-
ciates, Inc., 2017. neous Speech Corpus of Japanese,” in Proceedings of the Sec-
ond International Conference on Language Resources and Eval-
[2] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, uation (LREC’00). European Language Resources Association
W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: (ELRA), 2000.
Convolution-augmented Transformer for Speech Recognition,” in
Proc. Interspeech 2020, 2020, pp. 5036–5040. [18] A. Rialland, M. Adda-Decker, G.-N. Kouarata, G. Adda, L. Be-
sacier, L. Lamel, E. Gauthier, P. Godard, and J. Cooper-Leavitt,
[3] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “Wav2vec “Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville),”
2.0: A framework for self-supervised learning of speech repre- in Proceedings of the Eleventh International Conference on Lan-
sentations,” in Proceedings of the 34th International Conference guage Resources and Evaluation (LREC 2018). European Lan-
on Neural Information Processing Systems, ser. NIPS’20. Red guage Resources Association (ELRA), 2000.
Hook, NY, USA: Curran Associates Inc., 2020.
[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-
[4] M. P. Lewis, Ed., Ethnologue: Languages of the World, six- timization,” in 3rd International Conference on Learning Repre-
teenth ed. Dallas, TX, USA: SIL International, 2009. sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Con-
[5] S. Stuker, T. Schultz, F. Metze, and A. Waibel, “Multilingual ar- ference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
ticulatory features,” in 2003 IEEE International Conference on
Acoustics, Speech, and Signal Processing, 2003. Proceedings.
(ICASSP ’03)., vol. 1, 2003.
[6] F. Metze and A. Waibel, “A flexible stream architecture for
ASR using articulatory features,” in 7th International Conference
on Spoken Language Processing, ICSLP2002 - INTERSPEECH
2002, Denver, Colorado, USA, September 16-20, 2002, J. H. L.
Hansen and B. L. Pellom, Eds. ISCA, 2002.
[7] C.-H. Lee, M. A. Clements, S. Dusan, E. Fosler-Lussier, K. John-
son, B.-H. Juang, and L. R. Rabiner, “An overview on automatic
speech attribute transcription (ASAT),” in Proc. Interspeech 2007,
2007, pp. 1825–1828.
[8] C.-H.Lee and M.Siniscalchi, “An information-extraction ap-
proach to speech processing: Analysis, detection, verification and
recognition,” Proc. IEEE, vol. 101, no. 5, pp. 1089–1115, 2013.
[9] V. Mitra, W. Wang, C. Bartels, H. Franco, and D. Vergyri, “Ar-
ticulatory Information and Multiview Features for Large Vocab-
ulary Continuous Speech Recognition,” in 2018 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2018, pp. 5634–5638.
[10] L. Badino, C. Canevari, L. Fadiga, and G. Metta, “Integrating ar-
ticulatory data in deep neural network-based acoustic modeling,”
vol. 36, no. C, pp. 173–195, 2016.
[11] M. Müller, S. Stüker, and A. Waibel, “Towards Improving Low-
Resource Speech Recognition Using Articulatory and Language
Features,” in Proceedings of the 13th International Conference
on Spoken Language Translation. International Workshop on
Spoken Language Translation, 2016.
[12] S. Li, C. Ding, X. Lu, P. Shen, T. Kawahara, and H. Kawai, “End-
to-End Articulatory Attribute Modeling for Low-Resource Multi-
lingual Speech Recognition,” in Interspeech 2019. ISCA, 2019,
pp. 2145–2149.
[13] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal,
K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau,
and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep-
resentation Learning at Scale,” in Proc. Interspeech 2022, 2022,
pp. 2278–2282.
[14] C. Yi, J. Wang, N. Cheng, S. Zhou, and B. Xu, “Applying
Wav2vec2.0 to Speech Recognition in Various Low-resource Lan-
guages,” 2021.
[15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
training of Deep Bidirectional Transformers for Language Under-
standing,” in Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1 (Long and Short
Papers). Association for Computational Linguistics, 2019, pp.
4171–4186.
1398