0% found this document useful (0 votes)

14 views5 pages

Lee Intersp23

Uploaded by

Paulin Bayang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views5 pages

Lee Intersp23

Uploaded by

Paulin Bayang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

INTERSPEECH 2023

20-24 August 2023, Dublin, Ireland

Embedding Articulatory Constraints for Low-resource Speech Recognition

Based on Large Pre-trained Model
Jaeyoung Lee, Masato Mimura, Tatsuya Kawahara

Graduate School of Informatics, Kyoto University, Japan

{jaeyoung, mimura, kawahara}@sap.ist.i.kyoto-u.ac.jp

Abstract In recent years, a prominent trend in low-resource speech

recognition research has been to use self- or semi-supervised
Knowledge about phonemes and their articulatory attributes pre-training on high-resource speech corpora to learn a uni-
can help improve automatic speech recognition (ASR) of low- versal speech representation, which can then be finetuned for
resource languages. In this study, we propose a simple and downstream tasks [3]. Large pre-trained multilingual models
effective approach to embed prior knowledge about phonemes such as wav2vec2.0 and XLS-R [13] are shown to learn gen-
into end-to-end ASR based on a large pre-trained model. An ar- eral representation that is applicable even to unseen languages,
ticulatory attribute prediction layer is constructed by embedding and have greatly benefited low-resource ASR performance, as
articulatory constraints in layer initialization, which allows for shown in [14]. It is known that these models learn high-level
predicting articulatory attributes without the need for explicit representations corresponding to phonemes without any explicit
training. The final ASR transcript is inferred by combining the supervision [3], and it is possible that part of the learned repre-
output of this layer with encoded speech features. We apply sentation corresponds to articulatory attributes as well. There-
our method to finetune a pre-trained XLS-R model using Ainu fore, the approach of incorporating articulatory information is
and Mboshi corpora, and achieve a 12% relative improvement expected to be effective when used in combination with a large-
when target data of only 1 hour is available. This demonstrates scale pre-trained model for developing an ASR system with
that the approach of incorporating phonetic prior knowledge is very low-resource settings.
useful when combined with a large pre-trained model. In this study, we propose a simple method to incorporate ar-
Index Terms: Low-resource speech recognition, articulatory ticulatory information by embedding it into layer initialization
attributes, wav2vec2.0 in end-to-end ASR. First, we construct a fixed-length encoding
vector for each phoneme, using knowledge about articulatory
1. Introduction attributes. Then, these attribute vectors are stacked to form an
Over the last decade, deep neural network (DNN) based ap- articulatory attribute projection matrix, which projects articu-
proaches have significantly improved the performance of auto- latory attribute prediction into output phoneme prediction. This
matic speech recognition (ASR) and have made the end-to-end articulatory attribute prediction layer is combined with another
approach possible [1] [2] [3], where output prediction is directly conventional projection layer to generate final outputs. We fine-
inferred from acoustic features through a single neural network, tune a pre-trained XLS-R model with this output layer placed
without the need of manual pronunciation modeling. However, on top. We also explore multilingual training that exploits high-
this has been possible with large language corpora, and only resource language to enhance the representation ability of the
a handful of languages have sufficient language resources to model for articulatory attributes. The proposed method is ap-
achieve high performance for ASR, among approximately 7,000 plied to two low-resource languages of Ainu and Mboshi, with
languages [4] in the world. In particular, the number of nodes the target training data of around 34 and 4 hours, respectively.
in the output layer in end-to-end models may be too large for
low-resource languages, and this can be partially mitigated by 2. Related Work
encoding the relationships between output tokens.
2.1. Articulatory Modeling
Articulatory attributes are a set of distinct features that de-
scribe how speech sounds are produced by the articulators in Müller et al. [11] attempted to improve low-resource ASR by
the mouth, such as the lips, tongue, and vocal cords. It is using articulatory attributes along with language feature vec-
shown that articulatory attributes can be recognized across dif- tors and acoustic features. They introduced seven articulatory
ferent languages [5]. The approach incorporating articulatory attribute classifiers and one special phoneme type classifier for
attributes to ASR systems has been investigated for traditional each of the 8 categories they defined, such as phoneme type,
GMM-HMM based models, and shown to contribute to mak- manner and place of articulation, and vowel frontness. Each
ing models more robust to speaker or channel variability [6]. category has a predefined number of classes, including a special
In Automatic Speech Attribute Transcription (ASAT) [7] [8], class representing not applicable; for example, frontness cate-
a bank of speech attribute detectors were placed in the lowest gory has 4 classes: front, central, back, not applicable. This
level of ASR pipeline hierarchy, where detected attributes were modeling is rather restrictive because each phoneme can only
combined to predict phones, syllables and words. Articulatory have one attribute class for each category, making it difficult to
modeling has also been applied to DNN-HMM and end-to-end apply to multilingual settings with languages requiring different
models in multilingual settings, improving robustness to spon- articulatory modeling. Moreover, it was not implemented in an
taneous and non-native speech [9], and benefiting performance end-to-end ASR model.
in low-resource scenarios [10] [11] [12]. Li et al. [12] introduced articulatory modeling to end-to-

1394 10.21437/Interspeech.2023-1437
Output Probabilities
CTC Loss

Output Layer

Output Token Scores

Transformer Encoder Layers

Projection Attribute Projection Matrix
Pre-trained
wav2vec2.0
Attribute Predictions

CNN
Projection
Articulatory Attribute
Raw Waveform Prediction Layer
Contexualized Input Features

Figure 1: An overview of the proposed model architecture. Each language has a respective output layer, while the articulatory attribute
prediction layer is shared across languages. L stands for the number of languages. Al stands for the articulatory attribute projection
matrix for language l and its elements are set as trainable parameters and is implemented simply as a linear projection layer.

end speech recognition. They mapped each character token to a of low-resource ASR [14].
sequence of tokens representing articulatory attributes. For ex-
ample, the character representing /g/ is mapped to the sequence 3. Proposed Method
of two tokens ⟨voiced⟩ ⟨velar⟩. They trained a Transformer-
based model to predict articulatory attribute token sequences 3.1. Articulatory Modeling for Phonemes
from speech features in an end-to-end fashion. They observed
that while this method underperforms under monolingual set- First, we identify all relevant articulatory attributes for the tar-
tings, it can significantly outperform usual end-to-end models in get language, which are categorized into three groups: vowel
multilingual settings, where the model can effectively learn ar- attributes, place and manner of articulation for consonants. Ad-
ticulatory representations shared across target languages. This ditionally, there is a special category of attributes that applies
suggests the importance of multilingual training when using ar- to all output tokens. Table 1 lists all of the attributes covered
ticulatory features for speech recognition. in this study, for Ainu and Mboshi. Since different languages
In this study, we incorporate articulatory attribute predictors have different contrasting features, the set of articulatory at-
and combine their outputs with acoustic features to form the tributes should be modified to encompass all contrasting fea-
final token prediction. Unlike [11], we treat each articulatory tures present in other target languages.
attribute independently, allowing phonemes to be modeled to Vowel attributes (e.g. vowel height, backness, etc) tend to
have multiple attributes at the same time, e.g. affricates can be be best represented in a continuous scale, while consonant at-
modeled by both being plosive and fricative. Moreover, we do tributes (i.e. place or manner of articulation) tend to be best
not train such predictors in a supervised manner; instead, they
are induced to make articulatory attribute predictions, by layer Table 1: Articulatory attributes.
initialization. Then, the entire network is finetuned in an end-to-
end manner. The modeling is more flexible to model attributes Attributes
Category
with a continuous scale, e.g. vowel height, and it can work with 1 0 -1
imperfect knowledge about the phonemes of the target language sound - symbol
and inaccurate articulations in input speech. Special
consonant semi-vowel vowel
voiced - voiceless
2.2. Wav2vec2.0 back central front
Vowel open mid closed
Wav2vec 2.0 is a self-supervised learning framework that learns
high-toned - low-toned
latent representation from raw speech data. In this framework,
bilabial - else
the speech input is first encoded by a multi-layer convolutional
Consonant labiodental - else
neural network (CNN). This latent representation is masked and
(place) alveolar - else
input to a Transformer encoder network, producing contextual-
palatal - else
ized representations. The model is trained through predicting
velar - else
true latent representation from other contextualized represen-
glottal - else
tations, in a similar fashion to masked language models such
as BERT [15]. Vector quantized codes are used as a similarity nasal - else
measure. Wav2vec 2.0 models can be trained on large multilin- plosive - else
gual corpora [13] and can then be used as a pre-trained model Consonant fricative - else
for finetuning on a small amount of labeled data. It has been (manner) flap - else
shown to outperform previous approaches especially in terms approximant - else

1395
Table 2: Examples of assignment of articulatory attribute encoding. Encodings that deserve special attention are rendered bold.

sound / consonant / back / open /

voiced plosive fricative bilabial alveolar palatal
symbol vowel front closed
⟨wb⟩ -1 0 0 0 0 0 0 0 0 0
a 1 -1 0 1 1 0 0 0 0 0
Ainu u 1 -1 1 -1 1 0 0 0 0 0
c 1 1 0 0 0 1 1 -1 1 0
f 1 1 0 0 -1 -1 1 1 -1 -1
Japanese
by 1 1 0 0 1 1 -1 1 -1 1

represented in a one-hot encoding scheme. In our method, we as illustrated in Figure 1.

represent every attribute in a continuous scale ranging from -1
to 1, e.g. front (-1) to back (1). In the case of consonant at- 4. Experimental Setup
tributes, a value of -1 denotes the attribute’s absence, analogous
to 0 in one-hot encoding. The interpretation of a value of 0 de- 4.1. Datasets
pends on the attribute and the token. For vowel attributes of 4.1.1. Speech Corpus of Ainu Folklore
vowel tokens, a value of 0 denotes the middle point on the -1 to
1 scale. For other instances, a value of 0 indicates that the at- Ainu is a language spoken by the Ainu, a minority ethnic group
tribute is not applicable, such as consonant attributes for vowel in the northern part of Japan, and is classified critically endan-
tokens. gered by UNESCO. The speech corpus of Ainu folklore [16] is a
We flexibly utilize this encoding scheme to embed phonetic collection of speech recordings of Ainu stories, myths, and leg-
and phonological knowledge, as it will only be used in layer ends that have been collected to preserve the Ainu language and
initialization and not for training labels, and deep learning will culture. It consists of utterances from 8 Ainu speakers speaking
finetune any incomplete, ambiguous or incorrect details. For the Saru dialect, amounting to 38.9 hours. We split it into train
example, the phoneme /c/ in Ainu is an affricate and is encoded set of 33.7 hours and dev set of 5.2 hours. Subsets of the train
by setting both plosive and fricative attributes to 1. The palatal set with varying amounts of data are used for training, to inves-
attribute is set to 0, as it is sometimes realized as palato-alveolar tigate the effect of the amount of training data for our method.
[Ù], though usually realized as alveolar affricate [ts]. Moreover, Furthermore, additional utterances of 14 hours spoken by an-
the voiced attribute is set to 0, encoding the fact that there is other Ainu speaker with a distinct dialect, Shizunai dialect, is
no phonemic contrast between voiced and voiceless consonants employed as the test set. The characters used in the transcrip-
in Ainu. Table 2 shows examples of assignment of articulatory tion correspond 1-to-1 to phonemes.
attributes for Ainu and Japanese phoneme tokens. In addition, we conduct experiments in bilingual settings
By representing each token as a fixed-length encoding vec- where we adopt Japanese as an auxiliary language because
tor, we can construct an articulatory attribute projection matrix Japanese and Ainu share most of the phonemes, and most Ainu
Al ∈ RV ×N for a target language l, where V represents the speakers also speak Japanese. We use utterances of about 300
size of vocabulary and N represents the number of articula- hours from the Corpus of Spontaneous Japanese (CSJ) [17] as
tory attributes. Each row in Al is normalized to have a unit an additional training corpus.
variance and zero mean. This matrix can be regarded of as a Both Ainu and Japanese have a restricted syllabic structure
mapping from [-1,1]-normalized attribute predictions to token of (C)V(C), and thus it is straightforward to employ articulatory
predictions. modeling for syllables in both languages. We use syllable as
For languages with a restricted syllabic structure such as the target unit and employ syllabic articulatory modeling, as
Ainu, it is straightforward to extend this encoding to syllabic described in section 3.1.
tokens. Syllables in Ainu have (C)V(C) structure, having 3
phonemes at most. Thus, each syllable token in Ainu can be 4.1.2. Mboshi Parallel Corpus
represented by a 3N -dimensional encoding vector, and an at- Mboshi (Bantu C25, Congo-Brazzaville) is a Bantu language
tribute projection matrix can be constructed accordingly. spoken by the Mboshi people in the Republic of Congo. The
Mboshi parallel corpus [18] contains speech utterances of
3.2. Articulatory Attribute Prediction Layer around 4.5 hours from 3 speakers. The data is split into train set
of 3.9 hours and dev set of 26.4 minutes. We adopt the train/dev
As shown in the right-most part of Figure 1, the articulatory at- split as defined in the original paper [18].
tribute prediction layer is placed to project input features to the
The characters used in the transcription of this corpus do not
attribute space using the tanh activation function, as we rep-
correspond to phonemes in a 1-to-1 fashion. For example, the
resent each attribute in a [-1,1] scale. This layer is followed
character h only appears in combination of either /gh/ or /bh/,
by another projection layer from attribute space to output to-
which correspond to the voiced bilabial fricative and voiced ve-
kens, which is initialized by the attribute projection matrix Al
lar fricative, respectively. To handle such cases, we leverage the
defined earlier. The outputs are then fed into softmax, yielding
flexibility of our articulatory modeling and assign a value of 1
the token predictions. The attribute prediction layer is trained
to both the bilabial and velar attributes for h.
as a predictor for articulatory attributes without explicit super-
vision. However, we observed that using only the outputs from
4.2. Model Training and Evaluation Measure
this layer as final token predictions leads to suboptimal perfor-
mance. To address this, we combined it with conventional pro- We used a publicly available pretrained multilingual model for
jection from input speech features to make the final predictions, all our experiments, namely XLS-R (0.3B) [13]. It is a 317M

1396
Table 3: CER(%) with varying amounts of Ainu training sets and Mboshi. The final row shows the relative improvement of the best
performing proposed model from the baseline model.
Target language (unit) Ainu (syll) Mboshi (char)
Training data 10m 1h 4h 10h all (33.7h) all (3.9h)
baseline 35.1 19.6 16.5 15.5 14.1 6.10
proposed (attribute prediction only) 26.0 18.0 15.8 15.2 14.1 7.28
proposed (hybrid) 24.8 17.6 15.5 14.8 13.4 5.76
proposed (hybrid + bilingual training) 23.8 17.2 15.6 14.5 14.1 -
relative improvement (%) 32.1 12.2 6.1 6.1 4.8 5.6

parameter model with 24 Transformer encoder layers, with em- Table 4: CER(%) results with different inputs for articulatory
bedding size of 1024 and 16 attention heads. It is trained on attribute prediction layer
various speech corpora such as VoxPopuli, MLS, and Common-
Voice, totaling 436K hours comprised of 128 different lan- Transformer Encoder # Ainu (4h)
guages. We conduct finetuning experiments with CTC loss, 24 15.5
where we freeze the convolution based feature extractor and 16 16.0
finetune all of the transformer encoder layers, along with the 8 16.2
output layers. CNN features 16.5
Models are trained using the Adam optimizer [19]. Learn-
ing rate is linearly warmed up for the first 10% training steps sults in a decline in performance. This may be because we
and peaks at 5e-3 for Ainu and 6e-3 for Mboshi, holds for 40% are modeling articulatory attributes of different languages in the
of the training steps, then gets exponentially decayed. The batch same way, even though they are physically realized in a slightly
size is 45s and we mix target and auxiliary languages in 1:1 ra- different way.
tio in bilingual settings. We employ different numbers of train- Our method benefited performance for Mboshi as well as
ing steps depending on the amount of target training data: 16K, Ainu, where we observe 5.6% relative improvement. The ab-
20K, 30K, 40K, 50K steps for training data of 10m, 1h, 4h, 10h, solute CERs for Mboshi are much lower than that of Ainu, and
33.7h, respectively. For bilingual training, we use 60% more it is likely due to the fact that the speakers and dialects used in
training steps. training and evaluation are the same, whereas the Ainu test set
Three different configurations of the proposed method are consists of a different speaker with a distinct dialect. The results
trained and evaluated, and compared to the baseline configura- demonstrate that out method can be applied and improve perfor-
tion where the output layer simply consists of one linear pro- mance across different languages and different target units.
jection layer 1 . In attribute prediction only configuration, only
output from the articulatory attribute prediction layer is used for 5.1. Effect of Inputs to the Attribute Prediction Layer
the final prediction. It is combined with input speech features in
hybrid configuration, as illustrated in the right-hand side of Fig- We conducted an additional series of experiments where we
ure 1. In hybrid + bilingual training configuration for Ainu, the change the source of input to the articulatory attribute predic-
model is trained in a bilingual setting with additional training tion layer. We tested latent representations from the final (24th),
data in Japanese. 16th and 8th Transformer encoder layer, as well as directly from
Character error rate (CER) is employed as the evaluation the CNN feature extractor. Note that the final output of the
metric, given the difficulty in computing word error rate (WER) Transformer encoder is still combined to produce the final out-
for very low-resource languages where determining word units put; with only input to the attribute prediction layer changed.
is not obvious. For Ainu, the model with the lowest CER against The results are presented in Table 4. Notably, the results show a
dev set is selected and evaluated against the test set. For Mboshi, consistent trend in performance improvement with higher rep-
we simply evaluated the model at the end of the training against resentation inputs. This finding suggests that articulatory at-
the dev set. tributes are high-level features that benefit from contextualized
high-level representations.
5. Experimental Results
The comparison of Character Error Rate (CER) of the baseline 6. Conclusions
and proposed models in various training settings is presented We have presented a method for improving ASR in low-
in Table 3. We observe a significant improvement in perfor- resource settings by incorporating linguistic prior knowledge
mance in all hybrid models, especially when the available train- about phonemes. It is simple yet flexible for modeling con-
ing data is smaller. The attribute prediction only models per- tinuous and ambiguous articulatory constraints. It does not re-
form consistently worse than hybrid models, sometimes even quire explicit supervision on articulatory attribute labels and can
worse than the baseline model. It is likely because the articu- model phonemes from multilingual inventories. We conducted
latory constraints can be too strict for the model to learn accu- finetuning experiments with XLS-R models on two very low-
rate representation of output tokens. The use of bilingual data resource languages, Ainu and Mboshi. The results show that
improves performance by a small margin, although not signif- our method can improve CER by relative 12% when the amount
icantly, when the training data is less than 10 hours. However, of training data is limited to 1 hour, and still produce a meaning-
bilingual training with the entire 33.7h Ainu training data re- ful improvement when using 33.7 hours of training data. This
1 The difference in model size between baseline and proposed con- study demonstrates the importance of incorporating linguistic
figurations is marginal, as the pretrained XLS-R model already has prior knowledge about target language, even when state-of-the-
317M parameters. art pre-trained models are employed.

1397
7. References [16] K. Matsuura, S. Ueno, M. Mimura, S. Sakai, and T. Kawahara,
“Speech Corpus of Ainu Folklore and End-to-end Speech Recog-
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. nition for Ainu Language,” in Proceedings of the Twelfth Lan-
Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you guage Resources and Evaluation Conference. European Lan-
need,” in Advances in Neural Information Processing Systems, guage Resources Association, 2020, pp. 2622–2628.
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Asso- [17] K. Maekawa, H. Koiso, S. Furui, and H. Isahara, “Sponta-
ciates, Inc., 2017. neous Speech Corpus of Japanese,” in Proceedings of the Sec-
ond International Conference on Language Resources and Eval-
[2] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, uation (LREC’00). European Language Resources Association
W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: (ELRA), 2000.
Convolution-augmented Transformer for Speech Recognition,” in
Proc. Interspeech 2020, 2020, pp. 5036–5040. [18] A. Rialland, M. Adda-Decker, G.-N. Kouarata, G. Adda, L. Be-
sacier, L. Lamel, E. Gauthier, P. Godard, and J. Cooper-Leavitt,
[3] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “Wav2vec “Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville),”
2.0: A framework for self-supervised learning of speech repre- in Proceedings of the Eleventh International Conference on Lan-
sentations,” in Proceedings of the 34th International Conference guage Resources and Evaluation (LREC 2018). European Lan-
on Neural Information Processing Systems, ser. NIPS’20. Red guage Resources Association (ELRA), 2000.
Hook, NY, USA: Curran Associates Inc., 2020.
[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-
[4] M. P. Lewis, Ed., Ethnologue: Languages of the World, six- timization,” in 3rd International Conference on Learning Repre-
teenth ed. Dallas, TX, USA: SIL International, 2009. sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Con-
[5] S. Stuker, T. Schultz, F. Metze, and A. Waibel, “Multilingual ar- ference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
ticulatory features,” in 2003 IEEE International Conference on
Acoustics, Speech, and Signal Processing, 2003. Proceedings.
(ICASSP ’03)., vol. 1, 2003.
[6] F. Metze and A. Waibel, “A flexible stream architecture for
ASR using articulatory features,” in 7th International Conference
on Spoken Language Processing, ICSLP2002 - INTERSPEECH
2002, Denver, Colorado, USA, September 16-20, 2002, J. H. L.
Hansen and B. L. Pellom, Eds. ISCA, 2002.
[7] C.-H. Lee, M. A. Clements, S. Dusan, E. Fosler-Lussier, K. John-
son, B.-H. Juang, and L. R. Rabiner, “An overview on automatic
speech attribute transcription (ASAT),” in Proc. Interspeech 2007,
2007, pp. 1825–1828.
[8] C.-H.Lee and M.Siniscalchi, “An information-extraction ap-
proach to speech processing: Analysis, detection, verification and
recognition,” Proc. IEEE, vol. 101, no. 5, pp. 1089–1115, 2013.
[9] V. Mitra, W. Wang, C. Bartels, H. Franco, and D. Vergyri, “Ar-
ticulatory Information and Multiview Features for Large Vocab-
ulary Continuous Speech Recognition,” in 2018 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2018, pp. 5634–5638.
[10] L. Badino, C. Canevari, L. Fadiga, and G. Metta, “Integrating ar-
ticulatory data in deep neural network-based acoustic modeling,”
vol. 36, no. C, pp. 173–195, 2016.
[11] M. Müller, S. Stüker, and A. Waibel, “Towards Improving Low-
Resource Speech Recognition Using Articulatory and Language
Features,” in Proceedings of the 13th International Conference
on Spoken Language Translation. International Workshop on
Spoken Language Translation, 2016.
[12] S. Li, C. Ding, X. Lu, P. Shen, T. Kawahara, and H. Kawai, “End-
to-End Articulatory Attribute Modeling for Low-Resource Multi-
lingual Speech Recognition,” in Interspeech 2019. ISCA, 2019,
pp. 2145–2149.
[13] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal,
K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau,
and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep-
resentation Learning at Scale,” in Proc. Interspeech 2022, 2022,
pp. 2278–2282.
[14] C. Yi, J. Wang, N. Cheng, S. Zhou, and B. Xu, “Applying
Wav2vec2.0 to Speech Recognition in Various Low-resource Lan-
guages,” 2021.
[15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
training of Deep Bidirectional Transformers for Language Under-
standing,” in Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1 (Long and Short
Papers). Association for Computational Linguistics, 2019, pp.
4171–4186.

1398

Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
100% (1)
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
148 pages
Gebreegziabher 2020
No ratings yet
Gebreegziabher 2020
5 pages
Data-Driven Neural Network Based Feature - Phd-Thesis
No ratings yet
Data-Driven Neural Network Based Feature - Phd-Thesis
155 pages
Xiao Guest Lecture ASR
No ratings yet
Xiao Guest Lecture ASR
39 pages
Incorporating Knowledge Sources Into Statistical Speech Recognition
No ratings yet
Incorporating Knowledge Sources Into Statistical Speech Recognition
20 pages
Asr01 Intro
No ratings yet
Asr01 Intro
43 pages
King Etal JASA2007
No ratings yet
King Etal JASA2007
22 pages
ASRcourse MOSIG2024
No ratings yet
ASRcourse MOSIG2024
97 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
17 pages
End-to-End Speech Recognition: A Survey
No ratings yet
End-to-End Speech Recognition: A Survey
27 pages
Preprints202212 0426 v1
No ratings yet
Preprints202212 0426 v1
18 pages
End-to-End Speech Recognition: A Survey
No ratings yet
End-to-End Speech Recognition: A Survey
27 pages
Comparing The Fine-Tuning and Performance of Whisper Pre-Trained Models For Turkish Speech Recognition Task
No ratings yet
Comparing The Fine-Tuning and Performance of Whisper Pre-Trained Models For Turkish Speech Recognition Task
4 pages
Different Modeling For Amharic PDF
No ratings yet
Different Modeling For Amharic PDF
14 pages
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
No ratings yet
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
20 pages
Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages
Electrical Engineering (2017-2021) Punjab Engineering College, Chandigarh - 160012
No ratings yet
Electrical Engineering (2017-2021) Punjab Engineering College, Chandigarh - 160012
23 pages
Sensors 21 06258
No ratings yet
Sensors 21 06258
14 pages
PF Chapter5
No ratings yet
PF Chapter5
70 pages
Linguistic-Coupled Age-to-Age Voice Translation To Improve Speech Recognition Performance in Real Environments
No ratings yet
Linguistic-Coupled Age-to-Age Voice Translation To Improve Speech Recognition Performance in Real Environments
11 pages
Applying Wav2vec2 For Speech Recognition On Bengali Common Voices Dataset
No ratings yet
Applying Wav2vec2 For Speech Recognition On Bengali Common Voices Dataset
5 pages
Karafiat Interspeech2017 IS171775
No ratings yet
Karafiat Interspeech2017 IS171775
5 pages
2022.lrec-1.542 A Survey of Multilingual Models For Automatic Speech Recognition
No ratings yet
2022.lrec-1.542 A Survey of Multilingual Models For Automatic Speech Recognition
9 pages
Ethiopian Multi
No ratings yet
Ethiopian Multi
5 pages
2002 03785
No ratings yet
2002 03785
5 pages
ASR Survey Presentation
No ratings yet
ASR Survey Presentation
14 pages
An Amalgamation of Integrated Features With Deepspeech2 Architecture and Improved Spell Corrector For Improving Gujarati Language Asr System
No ratings yet
An Amalgamation of Integrated Features With Deepspeech2 Architecture and Improved Spell Corrector For Improving Gujarati Language Asr System
13 pages
Hidden Markov Model and Persian Speech Recognition
No ratings yet
Hidden Markov Model and Persian Speech Recognition
9 pages
Integration Empirical Study
No ratings yet
Integration Empirical Study
16 pages
Convai Technical Overview Speech Ai Part 2 2301964
No ratings yet
Convai Technical Overview Speech Ai Part 2 2301964
11 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
No ratings yet
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
37 pages
2514-Article Text-11375-1-10-20220919
No ratings yet
2514-Article Text-11375-1-10-20220919
12 pages
Noise-Robust Multilingual Speech Recognition and The Tatar Speech Corpus
No ratings yet
Noise-Robust Multilingual Speech Recognition and The Tatar Speech Corpus
6 pages
W - LM: I Asr M L M L - R L: Hisper Mproving Odels With Anguage Odels For OW Esource Anguages
No ratings yet
W - LM: I Asr M L M L - R L: Hisper Mproving Odels With Anguage Odels For OW Esource Anguages
26 pages
4032 Whispering LLaMA A Cross
No ratings yet
4032 Whispering LLaMA A Cross
10 pages
General Presentation
No ratings yet
General Presentation
19 pages
Performanceanalysisof ASRModelfor Santhalilanguageon Kaldiand Matlab Toolkit
No ratings yet
Performanceanalysisof ASRModelfor Santhalilanguageon Kaldiand Matlab Toolkit
5 pages
AudioPaLM - A Large Language Model That Can Speak and Listen
No ratings yet
AudioPaLM - A Large Language Model That Can Speak and Listen
27 pages
Multitask Learning of Deep Neural Networks For Low-Resource Speech Recognition
No ratings yet
Multitask Learning of Deep Neural Networks For Low-Resource Speech Recognition
12 pages
Gartner - Emerging Tech Impact Radar-Artificial Intelligence in Banking
100% (1)
Gartner - Emerging Tech Impact Radar-Artificial Intelligence in Banking
37 pages
Enabling ASR For Low-Resource Languages: A Comprehensive Dataset Creation Approach
No ratings yet
Enabling ASR For Low-Resource Languages: A Comprehensive Dataset Creation Approach
13 pages
Contextualized Streaming End-to-End Speech Recognition With Trie-Based Deep Biasing and Shallow Fusion
No ratings yet
Contextualized Streaming End-to-End Speech Recognition With Trie-Based Deep Biasing and Shallow Fusion
5 pages
2208.12666v1 Feature Extraction
No ratings yet
2208.12666v1 Feature Extraction
13 pages
Context Asr With LLM
No ratings yet
Context Asr With LLM
5 pages
LR Speech Tts ASR Combo 2020
No ratings yet
LR Speech Tts ASR Combo 2020
11 pages
Vivek Kumar - 1613112052
No ratings yet
Vivek Kumar - 1613112052
7 pages
Transformers and Attention Mechanisms - Post Quiz - Attempt Review
No ratings yet
Transformers and Attention Mechanisms - Post Quiz - Attempt Review
5 pages
1 s2.0 S0957417424009850 Main
No ratings yet
1 s2.0 S0957417424009850 Main
11 pages
Transfer Learning For ASR To Deal With Low-Resource Data Problem
No ratings yet
Transfer Learning For ASR To Deal With Low-Resource Data Problem
8 pages
Integrated Method of Deep Learning and Large Language Model in Speech Recognition
No ratings yet
Integrated Method of Deep Learning and Large Language Model in Speech Recognition
6 pages
Accented Speech Recognition: Benchmarking, Pre-Training, and Diverse Data
No ratings yet
Accented Speech Recognition: Benchmarking, Pre-Training, and Diverse Data
5 pages
Speaker Adaptation For End-To-End Speech Recognition Systems in Noisy Environments
No ratings yet
Speaker Adaptation For End-To-End Speech Recognition Systems in Noisy Environments
6 pages
Speaker Embedding Extraction With Phonetic Information: Yi Liu, Liang He, Jia Liu, Michael T. Johnson
No ratings yet
Speaker Embedding Extraction With Phonetic Information: Yi Liu, Liang He, Jia Liu, Michael T. Johnson
5 pages
Wav2vec, Hubert Representations
No ratings yet
Wav2vec, Hubert Representations
5 pages
Proposing XLSR Multilingual Model For Vietnamese Language Recognition-2
No ratings yet
Proposing XLSR Multilingual Model For Vietnamese Language Recognition-2
6 pages
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
No ratings yet
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
13 pages
31 Multilingual Automatic Speech
No ratings yet
31 Multilingual Automatic Speech
9 pages
SwinURNet Hybrid Transformer CNN Architecture For Real Time Unstructured
No ratings yet
SwinURNet Hybrid Transformer CNN Architecture For Real Time Unstructured
16 pages
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
No ratings yet
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
36 pages
A BERT Encoding With Recurrent Neural Network and Long-Short Term Memory For Breast Cancer Image Classification - 1-s2.0-S2772662223000176-Main
No ratings yet
A BERT Encoding With Recurrent Neural Network and Long-Short Term Memory For Breast Cancer Image Classification - 1-s2.0-S2772662223000176-Main
15 pages
State of AI Report - 2024 ONLINE
No ratings yet
State of AI Report - 2024 ONLINE
213 pages
Comparing LLMs Using A Unified Performance Ranking System
No ratings yet
Comparing LLMs Using A Unified Performance Ranking System
13 pages
Knowledge-Based Systems TSF Trading
No ratings yet
Knowledge-Based Systems TSF Trading
10 pages
DataScience, AI, GenerativeAI, Analytics Tech Insights
No ratings yet
DataScience, AI, GenerativeAI, Analytics Tech Insights
97 pages
Generative Artificial Intelligence - Opportunities and Challenges of Large Language Models - SpringerLink
No ratings yet
Generative Artificial Intelligence - Opportunities and Challenges of Large Language Models - SpringerLink
8 pages
AI-Generated Content (AIGC) : A Survey: Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Hong Lin
No ratings yet
AI-Generated Content (AIGC) : A Survey: Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Hong Lin
17 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
58 pages
HLDC Hindi Legal Documents Corpus
No ratings yet
HLDC Hindi Legal Documents Corpus
17 pages
Parameter-Efficient Fine-Tuning For Large Models: A Comprehensive Survey
No ratings yet
Parameter-Efficient Fine-Tuning For Large Models: A Comprehensive Survey
24 pages
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
No ratings yet
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
33 pages
Thesis Book T2310065 Pothole Detection Using Lightweight Network Models
No ratings yet
Thesis Book T2310065 Pothole Detection Using Lightweight Network Models
90 pages
Integration of Prediction Scores From Various Automated Essay Scoring Models Using Item Response Theory
No ratings yet
Integration of Prediction Scores From Various Automated Essay Scoring Models Using Item Response Theory
18 pages
BEST CODE UNETR Delving Into Efficient and Accurate 3D Medical Image Segmentation
No ratings yet
BEST CODE UNETR Delving Into Efficient and Accurate 3D Medical Image Segmentation
14 pages
Design of A NLP-empowered Finance Fraud Awareness Model The Anti-Fraud Chatbot For Fraud Detection and Fraud Classification As An Instance
No ratings yet
Design of A NLP-empowered Finance Fraud Awareness Model The Anti-Fraud Chatbot For Fraud Detection and Fraud Classification As An Instance
17 pages
Normalization and Pre-Tokenization - Hugging Face NLP Course
No ratings yet
Normalization and Pre-Tokenization - Hugging Face NLP Course
11 pages
Final Model Study and Benchmarking For Computer Vision Solution
No ratings yet
Final Model Study and Benchmarking For Computer Vision Solution
7 pages
Amoore Et Al. - 2024 - A World Model On The Political Logics of Generati
No ratings yet
Amoore Et Al. - 2024 - A World Model On The Political Logics of Generati
9 pages
Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation
No ratings yet
Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation
6 pages
Schwaller 2021 Mach. Learn. Sci. Technol. 2 015016
No ratings yet
Schwaller 2021 Mach. Learn. Sci. Technol. 2 015016
9 pages
A Transformer Based Network in Monocular Satellite Pose Estimation
No ratings yet
A Transformer Based Network in Monocular Satellite Pose Estimation
5 pages
The Transformer Architecture Explai
No ratings yet
The Transformer Architecture Explai
2 pages
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation
No ratings yet
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation
11 pages
Attentive
No ratings yet
Attentive
9 pages
Research Paper-On GPT Rise and Downfall
No ratings yet
Research Paper-On GPT Rise and Downfall
3 pages
BEiT Model
No ratings yet
BEiT Model
18 pages
Applied APL Programming: Definitive Reference for Developers and Engineers
From Everand
Applied APL Programming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
From Everand
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet

Lee Intersp23

Uploaded by

Lee Intersp23

Uploaded by

INTERSPEECH 2023

20-24 August 2023, Dublin, Ireland

Embedding Articulatory Constraints for Low-resource Speech Recognition

Graduate School of Informatics, Kyoto University, Japan

Abstract In recent years, a prominent trend in low-resource speech

Output Token Scores

Transformer Encoder Layers

sound / consonant / back / open /

represented in a one-hot encoding scheme. In our method, we as illustrated in Figure 1.

You might also like