0% found this document useful (0 votes)
21 views5 pages

WENETSPEECH

WenetSpeech is a comprehensive multi-domain Mandarin speech corpus comprising over 22,400 hours of audio, including 10,000+ hours of high-quality labeled speech, aimed at enhancing automatic speech recognition (ASR) systems. The corpus features diverse data sources from YouTube and Podcasts, covering various speaking styles and scenarios, and includes a novel end-to-end label error detection approach for quality assurance. This resource is intended to bridge the gap between industrial ASR systems and academic research by providing a large-scale, open-source dataset for the development of generalized ASR models.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views5 pages

WENETSPEECH

WenetSpeech is a comprehensive multi-domain Mandarin speech corpus comprising over 22,400 hours of audio, including 10,000+ hours of high-quality labeled speech, aimed at enhancing automatic speech recognition (ASR) systems. The corpus features diverse data sources from YouTube and Podcasts, covering various speaking styles and scenarios, and includes a novel end-to-end label error detection approach for quality assurance. This resource is intended to bridge the gap between industrial ASR systems and academic research by providing a large-scale, open-source dataset for the development of generalized ASR models.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

WENETSPEECH: A 10000+ HOURS MULTI-DOMAIN MANDARIN CORPUS FOR SPEECH

RECOGNITION

Binbin Zhang1,2,4,∗ , Hang Lv1,4,∗ , Pengcheng Guo1 , Qijie Shao1 , Chao Yang2,4 , Lei Xie1B , Xin Xu3 , Hui Bu3 ,
Xiaoyu Chen2 , Chenchen Zeng2 , Di Wu2,4 , Zhendong Peng2,4
1
Audio, Speech and Language Processing Group (ASLP@NPU),
School of Computer Science, Northwestern Polytechnical University
2
Mobvoi Inc. 3 Beijing Shell Shell Technology Co., Ltd. 4 WeNet Open Source Community
arXiv:2110.03370v5 [cs.SD] 23 Feb 2022

[email protected],[email protected]

ABSTRACT open source corpora is also crucial to the research community, espe-
cially for academia or small-scale research groups.
In this paper, we present WenetSpeech, a multi-domain Mandarin Most of the current open source speech corpora for ASR bench-
corpus consisting of 10000+ hours high-quality labeled speech, mark in the literature remain small size and lack of domain diver-
2400+ hours weakly labeled speech, and about 10000 hours unla- sities. For example, the commonly used English speech corpus -
beled speech, with 22400+ hours in total. We collect the data from Librispeech [23], which includes about 1000 hours reading speech
YouTube and Podcast, which covers a variety of speaking styles, sce- from audiobook, currently has a word error rate (WER) of 1.9% on
narios, domains, topics and noisy conditions. An optical character its test-clean benchmark. However, industrial ASR systems are usu-
recognition (OCR) method is introduced to generate the audio/text ally trained with tens of thousands of hours of data with acoustic
segmentation candidates for the YouTube data on the corresponding diversity and domain coverage. To close the gap between indus-
video subtitles, while a high-quality ASR transcription system is trial system and academic research, we notice that several large-scale
used to generate audio/text pair candidates for the Podcast data. multi-domain English corpora, including The People’s Speech [24],
Then we propose a novel end-to-end label error detection approach MLS [25] and GigaSpeech [26], are made available recently. The
to further validate and filter the candidates. We also provide three representative GigaSpeech consists of 10000 hours of high quality
manually labelled high-quality test sets along with WenetSpeech for transcribed English speech for supervised ASR training and 40000
evaluation – Dev for cross-validation purpose in training, Test Net, hours audio in total for semi-supervised or unsupervised training,
collected from Internet for matched test, and Test Meeting, recorded contributing to the research community for developing more gen-
from real meetings for more challenging mismatched test. Baseline eralized ASR systems. Comparing with those English corpora, the
systems trained with WenetSpeech are provided for three popular largest open source Mandarin speech data is AIShell-2 [27], includ-
speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and ing 1000 hours speech recorded in a quiet environment and having
recognition results on the three test sets are also provided as bench- a state-of-the-art character error rate of 5.35%. It is too simple to
marks. To the best of our knowledge, WenetSpeech is the current do further research and ASR systems developed based on it may be
largest open-source Mandarin speech corpus with transcriptions, susceptible to performance degradation in the complex real-world
which benefits research on production-level speech recognition. scenarios. In addition, current open source Mandarin corpora are
Index Terms— automatic speech recognition, corpus, multi- also unable to train a well generalized pre-trained model, since both
domain the Wav2vec 2.0 large model [20] and the XLSR-53 model [28] are
trained based on more than 50000 hours of English speech data.
1. INTRODUCTION In this work, we release WenetSpeech, a large multi-domain
In the past decade, the performance of automatic speech recogni- Mandarin corpus licensed for non-commercial usage under CC-BY
tion (ASR) systems have been significantly improved. On the one 4.0. “We” means connection and sharing, while “net” means all of
hand, the development of neural networks has increased the capacity the data are collected from the Internet which is repository for diver-
of models, pushing the dominant framework from the hybrid hid- sity. The key features of WenetSpeech include:
den Markov models [1, 2] to end-to-end models, like CTC [3, 4], • Large scale. 10000+ hours labeled data, 2400+ hours weakly
RNN-T [5, 6, 7, 8], and encoder-decoder based models [9, 10, 11, labeled data, and about 10000 hours unlabeled data are pro-
12, 13]. To simply implement such advanced models and obtain vided, resulting in 22400+ hours audio in total.
state-of-the-art reproducible results, researchers also release several
• Diverse. The data are collected from multiple speaking
open source toolkits, including Kaldi [14], Sphinx [15], Fariseq [16],
styles, scenarios, domains, topics, and noisy conditions.
ESPnet [17], and recently WeNet [18], etc. On the other hand,
self-supervised speech representation learning methods are proposed • Extensible. An extensible metadata is designed to extend the
to make better use of a large amount untranscribed data, such as data in the future.
wav2vec [19], wav2vec 2.0 [20], Hubert [21], and wav2vec-U [22], To the best of our knowledge, WenetSpeech is the current largest
etc. In addition to these algorithm-level efforts, the development of open source Mandarin speech corpus with domain diversity to satisfy
various speech recognition tasks. In Section 2, we first introduce
* Co-first authors, equal contribution. B Corresponding author. the construction procedure of WenetSpeech with a reliable pipeline
to obtain high-quality transcriptions, including OCR-based caption
recognition on Youtube video, ASR-based automatic transcription
on Podcast audio, as well as a new end-to-end label error detection
approach. The corpus composition is described in Section 3 and
baseline benchmarks built on Kaldi, ESPnet and WeNet toolkits are
introduced in Section 4. We believe our corpus will bring benefit to
research community for developing more generalized ASR systems.
(a) Audiobook (b) Game commentary
2. CREATION PIPELINE
In this section, we introduce the detailed creation pipeline of our
WenetSpeech corpus, including original audio collection, audio/text
segmentation candidate generation, and candidate calibration.
2.1. Stage 1: Audio Collection
In the beginning, we manually define the domains into 10 categories,
which including audiobook, commentary, documentary, drama, in-
(c) Drama (d) Lecture
terview, reading, talk, variety, and others. Then, we collected and
tagged the audio files from YouTube and Podcast playlists accord- Fig. 2. Example outputs of the OCR pipeline
ing to our selected categories. Especially, for the YouTube data, the the box is in a reasonable subtitle area, it will be marked as red, and
videos are also downloaded for preparing the audio/text segmenta- then, the subtitle box is further processed by the Text Recognition
tion candidates with our OCR system. For Podcast, since the man- module. At last, the recognition result is shown above each subtitle
ually transcribed Mandarin data is limited, we only considered the box.
category information and prepared to transcribed it by a high-quality In addition, we find that some annotators prefer to split a long
ASR system. subtitle phrase into many pieces for a video, so that the problem of
2.2. Stage 2: Candidates Generation audio and subtitle asynchronous is introduced. This leads an inac-
In this part, we introduce the specific pipelines to obtain the au- curate subtitle boundary detection problem to our OCR system. To
dio/text segmentation candidates from YouTube data by an OCR alleviate it, we merge the consecutive video phrases until the audio
system and Podcast data by a high-quality ASR system. At last, is over 8 seconds.
text normalization1 technique is applied to all the candidates. 2.2.2. Podcast Transcription
2.2.1. YouTube OCR We use a third-party commercial ASR transcription system to tran-
As Figure 1 shown, an OCR-based pipeline is applied for generating scribe all the Podcast data. The transcription system is one of the
candidates from embedded subtitles on YouTube videos. best system on the public benchmark platform2 , and more than 95%
1. Text Detection: apply CTPN [29] based text detection on each accuracy rates have been achieved in most of the testing scenarios,
frame image in the video. including news, reading, talk show, conversation, education, games,
TV, drama, and so on.
2. Subtitle Validation: mark frame t as the start point of a spe-
The transcription system first segments the original audio into
cific subtitle phrase, when a phrase of text is detected at the
short segments by a VAD module, and then the audio/text pair seg-
bottom of the screen for subtitles at frame t.
mentation candidates are generated by speech recognition.
3. Subtitle Change Detection: compute the structural similarity
(SSIM) of subtitle area frame by frame until a change is de- 2.3. Stage 3: Candidates Validation
tected at frame t + n. Then, the frame t + n − 1 is marked as Although the used OCR system and transcription system are high-
the the end point of this subtitle phrase. quality enough, the errors from candidate generation, such as subtitle
4. Text Recognition: a CRNN-CTC [30] based text recognition annotation error, timestamp inaccuracy, OCR mistake, transcription
approach is used to recognize the detected subtitle area. word error, and text normalization error, are still unavoidable. To
further improve the quality of our WenetSpeech corpus, we apply
5. Audio/Text Pair: prepare each audio/text pair segmentation
the following validation approach to classify the YouTube OCR and
candidate with the corresponding (start point, end point, sub-
Podcast transcription candidates according to their confidences and
title phrase) tuple.
filter out the extremely bad candidates.
Subtitle Subtitle Change Text
Text Detection
Validation Detection Recognition
Text/Audio Pair 2.3.1. Force Alignment Graph
Here, we propose an novel CTC-based end-to-end force alignment
subtitle endding t+n
subtitle start t approach to detect the transcription error. The transcription is first
segmented by the model unit of CTC, and then a force alignment
Fig. 1. OCR based YouTube data collection pipeline
graph (L) is built for each candidate as shown in Figure 3. The key
To verify whether the proposed OCR-based pipeline is reli- features of the alignment graph are:
able, we randomly extracted 5000 subtitle transcriptions from the 1. The oracle transcription alignment path is included.
YouTube data with different themes and manually annotated them
2. Arbitrary deletion operation at any position is allowed
as benchmarks by professionals for testing the Text Recognition
through tag 〈del〉 with penalty p1 .
module. Finally, we obtain 98% accuracy on the test set and confirm
the reliability of our OCR-based pipeline for WenetSpeech corpus. 3. Arbitrary insertion or substitution at any position is allowed.
Figure 2 shows 4 typical examples of our OCR-based system. From each reasonable position, a start tag 〈is〉 and a end tag
The results of the Text Detection module are marked with boxes. If 〈/is〉 are connected to a global filler state. On this filler state,
1 https://github.com/speechio/chinese text normalization 2 https://github.com/SpeechColab/Leaderboard
each CTC modeling unit has a corresponding self-loop arc
with penalty p2 , which is presented by tag 〈gbg〉. Table 1. WenetSpeech partition
This makes it possible to capture the error between the audio and Set Confidence Hours
the corresponding transcription through decoding technique.
Strong Label [0.95, 1.00] 10005
Compared with traditional hybrid validation approach which is
Weak Label [0.60, 0.95) 2478
used in Librispeech, our proposed approach is a pure end-to-end ap-
Others / 9952
proach. There is no need for HMM typologies, lexicon, language
model components, or careful design of the filler state. So the pro- Total(hrs) / 22435
posed novel approach simplifies the whole pipeline a lot. The force
alignment graph is implemented by the WeNet toolkit, and it is pub- Table 2. Training data in different domains with duration (hrs)
licly available 3 .
<gbg>/p2 Domain Youtube Podcast Total
audiobook 0 250.9 250.9
Filler commentary 112.6 135.7 248.3
documentary 386.7 90.5 477.2
<is> </is> <is> </is> <is> </is> <is> </is> <is> </is> drama 4338.2 0 4338.2
不 忘 初 心
interview 324.2 614 938.2
0 1 2 3 4 news 0 868 868
<del>/p1 <del>/p1 <del>/p1 <del>/p1
reading 0 1110.2 1110.2
Fig. 3. An example force alignment graph L of ”不忘初心” talk 204 90.7 294.7
2.3.2. Label Error Detection variety 603.3 224.5 827.8
others 144 507.5 651.5
After defining the force alignment graph L, it is further composed
with the CTC topology graph T [31] to build the final force decoding Total 6113 3892 10005
graph F = T ◦ L for label error detection and validation.
In addition, we assign confidence for each candidate by it’s ref- The original audio files are downloaded and converted to 16k
erence (ref) and the force decoding hypothesis (hyp). The confidence sampling rate, single-channel, and 16-bit signed-integer format.
c is computed as Then Opus compression is applied at an output bit rate of 32 kbps to
EditDistance(ref, hyp)
c=1− . reduce the size of the WenetSpeech corpus.
max(len(ref ), len(hyp))
3.2. Size and Confidence
With the confidence, we classify the audio/text segmentation
candidates of our WenetSpeech corpus into Strong Label and Weak We assign confidence for each valid segment which measures the
Label sets and even filter out the extremely ones to Others set. Fig- label quality, where the confidence in defined in Section 2.3.2. As
ure 4 shows two real examples that we find by applying label error shown in Table 1, we select 10005 hours Strong Label data, whose
detection. In Case 1, human subtitle error was successfully detected. confidence is greater than 0.95, as the supervised training data. The
In Case 2, OCR error was successfully detected. 2478 hours Weak Label data, whose confidence is between 0.60 and
Please note the model used in the transcription system and force 0.95, is reserved in our metadata for semi-supervised or other usage.
alignment system are developed and trained with different model At last, Others represent all invalid data (i.e. the confidence of data
methods and data. p1 is set to 2.3 and p2 is set to 4.6 in our pipeline. is less than 0.6 or unrecognized) for speech recognition task. In
summary, WenetSpeech has 22435 hours of raw audio.
3.3. Training Data Diversity and Subsets
We tag all the training data with its source and domain. All of the
training data is from Youtube and Podcast. As shown in Table 2, we
classify the data into 10 groups according to its category. Please note
about 4k hours of data is from drama, which is a special domain with
a wide range of themes, topics and scenarios, which may cover any
Fig. 4. Examples of label error detection kind of other categories.
3. THE WENETSPEECH CORPUS As shown in Table 3, we provide 3 training subsets, namely S,
In this section, we describe the metadata, audio format, partition by M and L for building ASR systems on different data scales. Subsets
confidence, data diversity, and training and evaluation set design of S and M are sampled from all the training data which have the oracle
our WenetSpeech corpus. Instructions and scripts are available at confidence 1.0.
WenetSpeech GitHub repo4 . 3.4. Evaluation Sets
3.1. Metadata and Audio Format We will release the following evaluation datasets associated with
WenetSpeech, and the major information is summarized in Table 4.
We save all the metadata information to a single JSON file. Lo-
1. Dev, is specifically designed dataset for some speech tools
cal path, original public URL, domain tags, md5, and segments are
which require cross-validation in training.
provided for each audio. And timestamp, label, confidence, subset
2. Test Net, is a matched test set from the internet. Compared
information are provided for each segment. The design is extensible,
with the training data, it also covers several popular and diffi-
and we are planning to add more diverse data in the future.
cult domains like game commentary, live commerce, etc.
3 https:github.com/wenet-e2e/wenet/blob/main/runtime/core/bin/ 3. Test Meeting, is a mismatched test set since it is a far-field,
label checker main.cc conversational, spontaneous, and meeting speech dataset. It
4 https://github.com/wenet-e2e/WenetSpeech is sampled from 197 real meetings in a variety of rooms. Its
global CMVN technique are used as data pre-processing. During
Table 3. The training data subsets training, we choose the Adam optimizer with the maximum learning
Training Subsets Confidence Hours rate of 0.0015. The Noam learning rate scheduler with 30k warm-up
steps is used. The model was trained with dynamic batching skill
L [0.95, 1.0] 10005 for 30 epochs. At last, the last 10 best checkpoints were averaged
M 1.0 1000 to be the final model. For decoding, the ESPNet system follows the
S 1.0 100 joint CTC/Attention beam search strategy [39].
Table 4. The WenetSpeech evaluation sets 4.3. WeNet Benchmark8
The WeNet baseline implements a U2 model [18], which unifies
Evaluation Sets Hours Source streaming and non-streaming end-to-end (E2E) speech recognition
Dev 20 Internet in a single model. The basic setup of our WeNet model is same as
Test Net 23 Internet the ESPnet model except the following minor points: 1) We prepare
Test Meeting 15 Real meeting 80-dimensional FBank features with a 25ms window and a 10ms
frame shift. SpecAugment with 2 frequency masks (F = 30) and 3
topics cover education, real estate, finance, house and home, time masks (T = 50) and global CMVN technique are applied on
technology, interview and so on. the top of our features. 2) The max trainable epoch is 25. Models
The three evaluation sets are carefully checked by professional an- of the last 10 epochs were averaged to be the final model. The key
notators to ensure the transcription quality. difference between WeNet and ESPNet is different decoding strate-
gies. Specifically, different from ESPNet’s auto-regressive decoding
4. EXPERIMENTS strategy, Wenet generates the N-Best hypotheses by the CTC branch
In this section, we introduce the baseline systems and experimen- and rescores them by the attention branch.
tal results on three popular speech recognition toolkits, Kaldi [14], Table 5. Results (MER%) on different test sets for baseline systems
ESPnet [17] and WeNet [18]. trained using WenetSpeech training subset L
4.1. Kaldi Benchmark5 Toolkit Dev Test Net Test Meeting AIShell-1
The Kaldi baseline implements a classical chain model [32] using
various amounts of WeNetSpeech data (i.e. S, M, L). We choose Kaldi 9.07 12.83 24.72 5.41
the open source vocabulary, BigCiDian6 , as our lexicon. And we ESPNet 9.70 8.90 15.90 3.90
segment our transcriptions with the open source word segmenta- WeNet 8.88 9.70 15.59 4.61
tion toolkit, jieba [33]. First, we train a GMM-HMM model to ob-
tain the training alignments. Then, we train a chain model, which Table 6. Kaldi baseline results (MER%) for different WenetSpeech
stacks 6 convolutional neural network (CNN) layers, 9 factored time- training subsets
delay neural network (TDNN-F) layers [34], 1 time-restricted at- SubSet Dev Test Net Test Meeting AIShell-1
tention layer [35] (H = 30), 2 projected long short-term memory
(LSTMP) with TDNN-F blocks. We feed the 40 dimensional filter- L 9.07 12.83 24.72 5.41
bank (FBank) features and 100 dimensional i-vector features as the M 9.81 14.19 28.22 5.93
input. In order to be consistent with the other systems, we only use S 11.70 17.47 37.27 7.66
SpecAugment [36] technique and abandon the speed/volume pertur-
bation techniques. The chain model is trained by LF-MMI criterion 4.4. Experimental Results
with cross-entropy loss (10 epochs for subset S, and 4 epochs for We must announce that the results listed here are purely for the pur-
subset M and L). A 3-gram language model (LM) is used for decod- pose of providing a baseline system for each toolkit. They might not
ing and generating the lattice. Finally, a recurrent neural network reflect the state-of-the-art performance of each toolkit.
LM (RNNLM) is further adopted to rescore the lattices. The 3-gram In Table 5, we report the experimental results in Mixture Error
and RNNLM are both trained on all the WenetSpeech transcriptions. Rate (MER) [40], which considers Mandarin characters and English
4.2. ESPnet Benchmark7 words as the tokens in the edit distance calculation, on three designed
test sets and one well-known, publicly available test set (i.e. AIShell-
The ESPnet baseline employs a Conformer model [13, 37] which
1 [41] test) with Kaldi, ESPNet and WeNet toolkits respectively. The
is designed to capture the global context with the multi-head self-
good performance on AIShell-1 reflects the diversity and reliability
attention module and learn the local correlations synchronously
of our WenetSpeech corpus. And the results on our designed test
with the convolution module. Our Conformer model consists
sets reflect they are quite challenging. In Table 6, we provide the
of a 12-block Conformer encoder (dff = 2048, H = 8, datt =
Kaldi baseline results for difference scale WenetSpeech subsets. As
512, CNNkernel = 15) and a 6-block Transformer [38] decoder
the growth of the data amount, the performance goes up steadily.
(dff = 2048, H = 8). A set of 5535 Mandarin characters and 26
English letters is used as the modeling units. The objective is a 5. ACKNOWLEDGEMENTS
logarithmic linear combination of the CTC (λ = 0.3) and attention We thank Jiayu Du and Guoguo Chen for their suggestions on this
objectives. Label smoothing is applied to the attention objective. work. We thank Tencent Ethereal Audio Lab and Xi’an Future AI In-
During data preparation, we generate 80-dimensional FBank fea- novation Center for providing hosting service for WenetSpeech. We
tures with a 32ms window and a 8ms frame shift. SpecAugment also thank MindSpore for the support of this work, which is a new
with 2 frequency masks (F = 30) and 2 time masks (T = 40), and deep learning computing framwork9 . Our gratitude goes to Lianhui
Zhang and Yu Mao for collecting some of the YouTube data.
5 https://github.com/wenet-e2e/WenetSpeech/tree/main/toolkits/kaldi
6 https://github.com/speechio/BigCiDian 8 https://github.com/wenet-e2e/WenetSpeech/tree/main/toolkits/wenet
7 https://github.com/wenet-e2e/WenetSpeech/tree/main/toolkits/espnet 9 https://www.mindspore.cn/
6. REFERENCES [21] Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov,
Abdelrahman Mohamed, “HuBERT: How much can a bad teacher benefit ASR
[1] Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mo- pre-training?,” in IEEE International Conference on Acoustics, Speech and Signal
hamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Processing (ICASSP), 2021, pp. 6533–6537.
N. Sainath, Brian Kingsbury, “Deep neural networks for acoustic modeling in [22] Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli, “Unsupervised
speech recognition,” in IEEE Signal Processing Magazine, 2012, vol. 29, no. 6, speech recognition,” in Annual Conference on Neural Information Processing
pp. 82–97. Systems (NeurIPS), 2021, vol. 34.
[2] George E Dahl, Dong Yu, Li Deng, Alex Acero, “Context-dependent pre-trained [23] Vassil Panayotov, Guoguo Chen, Daniel Povey, Sanjeev Khudanpur, “Librispeech:
deep neural networks for large-vocabulary speech recognition,” in IEEE Transac- An ASR corpus based on public domain audio books,” in IEEE International Con-
tions on Audio, Speech, and Language Processing (TASLP), 2012, vol. 20, no. 1, ference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–
pp. 30–42. 5210.
[3] Alex Graves, Santiago Fernández, Faustino Gomez, Jürgen Schmidhuber, “Con- [24] Daniel Galvez, Greg Diamos, Juan Manuel Ciro Torres, Juan Felipe Cerón, Keith
nectionist temporal classification: Labelling unsegmented sequence data with re- Achorn, Anjali Gopi, David Kanter, Max Lam, Mark Mazumder, Vijay Janapa
current neural networks,” in ACM International Conference on Machine Learning Reddi, “The people’s speech: A large-scale diverse english speech recognition
(ICML), 2006, pp. 369–376. dataset for commercial usage,” in Annual Conference on Neural Information Pro-
[4] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, cessing Systems (NeurIPS), 2021.
Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guo- [25] Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Col-
liang Chen, et al., “Deep speech 2: End-to-end speech recognition in English lobert, “Mls: A large-scale multilingual dataset for speech research,” in
and Mandarin,” in ACM International Conference on Machine Learning (ICML), ISCA Conference of the International Speech Communication Association (Inter-
2016, pp. 173–182. speech), 2020, pp. 2757–2761.
[5] Alex Graves, “Sequence transduction with recurrent neural networks,” in ACM In- [26] Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao
ternational Conference on Machine Learning Representation Learning Workshop Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khu-
(ICML), 2012. danpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao,
[6] Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton, “Speech recognition Yongqing Wang, Zhao You, Zhiyong Yan, “Gigaspeech: An evolving, multi-
with deep recurrent neural networks,” in IEEE International Conference on domain ASR corpus with 10,000 hours of transcribed audio,” in ISCA Conference
Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645–6649. of the International Speech Communication Association (Interspeech), 2021, pp.
[7] Xiong Wang, Zhuoyuan Yao, Xian Shi, Lei Xie, “Cascade rnn-transducer: Sylla- 3670–3674.
ble based streaming on-device mandarin speech recognition with a syllable-to- [27] Jiayu Du, Xingyu Na, Xuechen Liu, Hui Bu, “Aishell-2: Transforming mandarin
character converter,” in IEEE Spoken Language Technology Workshop (SLT), ASR research into industrial scale,” in arXiv preprint arXiv:1808.10583, 2018.
2021, pp. 15–21. [28] Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed,
[8] Xiong Wang, Sining Sun, Lei Xie, Long Ma, “Efficient conformer with prob- Michael Auli, “Unsupervised cross-lingual representation learning for speech
sparse attention mechanism for end-to-endspeech recognition,” in ISCA Confer- recognition,” in ISCA Conference of the International Speech Communication
ence of the International Speech Communication Association (Interspeech), 2021, Association (Interspeech), 2021, pp. 2426–2430.
pp. 4578–4582. [29] Zhi Tian, Weilin Huang, Tong He, Pan He, Yu Qiao, “Detecting text in natural im-
[9] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua age with connectionist text proposal network,” in Springer European Conference
Bengio, “Attention-based models for speech recognition,” in Annual Conference on Computer Vision (ECCV), 2016, pp. 56–72.
on Neural Information Processing Systems (NIPS), 2015, pp. 577–585. [30] Baoguang Shi, Xiang Bai, Cong Yao, “An end-to-end trainable neural network for
[10] William Chan, Navdeep Jaitly, Quoc Le, Oriol Vinyals, “Listen, attend and image-based sequence recognition and its application to scene text recognition,” in
spell: A neural network for large vocabulary conversational speech recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2016,
in IEEE International Conference on Acoustics, Speech and Signal Processing vol. 39, no. 11, pp. 2298–2304.
(ICASSP), 2016, pp. 4960–4964. [31] Yajie Miao, Mohammad Gowayyed, Florian Metze, “EESEN: End-to-end speech
[11] Suyoun Kim, Takaaki Hori, Shinji Watanabe, “Joint CTC-attention based end-to- recognition using deep RNN models and WFST-based decoding,” in IEEE Work-
end speech recognition using multi-task learning,” in IEEE International Confer- shop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp.
ence on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4835–4839. 167–174.
[12] Linhao Dong, Shuang Xu, Bo Xu, “Speech-Transformer: A no-recurrence [32] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal
sequence-to-sequence model for speech recognition,” in IEEE International Con- Manohar, Xingyu Na, Yiming Wang, Sanjeev Khudanpur, “Purely sequence-
ference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5884– trained neural networks for ASR based on lattice-free MMI,” in ISCA Conference
5888. of the International Speech Communication Association (Interspeech), 2016, pp.
[13] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui 2751–2755.
Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang, [33] J Sun, “Jieba chinese word segmentation tool,” 2012.
“Conformer: Convolution-augmented transformer for speech recognition,” in [34] Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmo-
ISCA Conference of the International Speech Communication Association (Inter- hammadi, Sanjeev Khudanpur, “Semi-orthogonal low-rank matrix factorization
speech), 2020, pp. 5036–5040. for deep neural networks,” in ISCA Conference of the International Speech Com-
[14] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš Burget, Ondřej Glembek, munication Association (Interspeech), 2018, pp. 3743–3747.
Nagendra Goel, Mirko Hannemann, Petr Motlı́ček, Yanmin Qian, Petr Schwarz, [35] Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, Sanjeev Khudanpur, “A
Jan Silovský, Georg Stemmer, Karel Veselý, “The Kaldi speech recognition time-restricted self-attention layer for ASR,” in IEEE International Conference on
toolkit,” in IEEE Workshop on Automatic Speech Recognition and Understanding Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5874–5878.
(ASRU), 2011. [36] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D
[15] K-F Lee, H-W Hon, Raj Reddy, “An overview of the SPHINX speech recogni- Cubuk, Quoc V Le, “SpecAugment: A simple data augmentation method for
tion system,” in IEEE Transactions on Acoustics, Speech, and Signal Processing automatic speech recognition,” in ISCA Conference of the International Speech
(TASLP), 1990, vol. 38, no. 1, pp. 35–45. Communication Association (Interspeech), 2019, pp. 2613–2617.
[16] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, [37] Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke
David Grangier, Michael Auli, “Fairseq: A fast, extensible toolkit for sequence Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero,
modeling,” in The Annual Conference of the North American Chapter of the As- Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, Yuekai Zhang,
sociation for Computational Linguistics (NAACL), 2019, pp. 48–53. “Recent developments on ESPnet toolkit boosted by conformer,” in IEEE Inter-
[17] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021,
Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, pp. 5874–5878.
Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai, “ESPnet: End-to-end [38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
speech processing toolkit,” in ISCA Conference of the International Speech Com- Aidan N Gomez, Łukasz Kaiser, Illia Polosukhin, “Attention is all you need,” in
munication Association (Interspeech), 2018, pp. 2207–2211. Annual Conference on Neural Information Processing Systems (NeurIPS), 2017,
[18] Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhen- pp. 5998–6008.
dong Peng, Xiaoyu Chen, Lei Xie, Xin Lei, “WeNet: Production oriented stream- [39] Takaaki Hori, Shinji Watanabe, John R Hershey, “Joint CTC/attention decoding
ing and non-streaming end-to-end speech recognition toolkit,” in ISCA Conference for end-to-end speech recognition,” in Annual Meeting of the Association for
of the International Speech Communication Association (Interspeech), 2021, pp. Computational Linguistics (ACL), 2017, pp. 518–529.
4054—-4058. [40] Xian Shi, Qiangze Feng, Lei Xie, “The ASRU 2019 mandarin-english code-
[19] Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli, “wav2vec: switching speech recognition challenge: open datasets, tracks, methods and re-
Unsupervised pre-training for speech recognition,” in ISCA Conference of the sults,” in IEEE Workshop on Automatic Speech Recognition and Understanding
International Speech Communication Association (Interspeech), 2019, pp. 3465— (ASRU), 2020.
-3469. [41] Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, Hao Zheng, “Aishell-1: An open-
[20] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, “wav2vec source mandarin speech corpus and a speech recognition baseline,” in IEEE Con-
2.0: A framework for self-supervised learning of speech representations,” in ference of the Oriental Chapter of the International Coordinating Committee on
Annual Conference on Neural Information Processing Systems (NeurIPS), 2020, Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), 2017,
vol. 33, pp. 12449–12460. pp. 1–5.

You might also like