End To End CSR
End To End CSR
End To End CSR
Ne Luo* , Dongwei Jiang* , Shuaijiang Zhao* , Caixia Gong* , Wei Zou, Xiangang Li
ABSTRACT not need expert linguistic knowledge on the objective languages and
the burden of generating specific lexicons can be relieved. There
arXiv:1810.13091v2 [cs.CL] 1 Nov 2018
Code-switching speech recognition has attracted an increasing is only one previous work in building end-to-end code-switching
interest recently, but the need for expert linguistic knowledge has speech recognition systems [11], but they use artificially generated
always been a big issue. End-to-end automatic speech recogni- data, producing by concatenating monolingual utterances, rather
tion (ASR) simplifies the building of ASR systems considerably than spontaneous code-switching speech.
by predicting graphemes or characters directly from acoustic input. Two major types of end-to-end architectures are: the connec-
In the mean time, the need of expert linguistic knowledge is also tionist temporal classification (CTC) [12, 13] and attention-based
eliminated, which makes it an attractive choice for code-switching method [14, 15, 16]. CTC objective can be used to train end-to-
ASR. This paper presents a hybrid CTC-Attention based end-to-end end systems that directly predict grapheme sequences without re-
Mandarin-English code-switching (CS) speech recognition system quiring a frame-level alignment of the target labels for a training ut-
and studies the effect of hybrid CTC-Attention based models, dif- terance. Attention-based method consists of an encoder network and
ferent modeling units, the inclusion of language identification and an attention-based decoder, which maps acoustic speech into high-
different decoding strategies on the task of code-switching ASR. On level representation and recognizes symbols conditioned on previous
the SEAME corpus, our system achieves a mixed error rate (MER) predicts, respectively. [13] presents a joint CTC-Attention multi-task
of 34.24%. learning model that combines the benefit of both two types of sys-
Index Terms— speech recognition, code-switching, end-to-end tems. Their model achieves state-of-the-art results on multiple pub-
methods, attention, connectionist temporal classification lic benchmarks while improving the robustness and speed of conver-
gence compared to other end-to-end models.
In this work, we apply a framework similar to the joint CTC-
1. INTRODUCTION
Attention model on Mandarin-English code-switching speech to
As multilingual phenomenon becoming more and more common in observe whether it can match the performance of traditional sys-
real life [1], there has been an increasing interest in code-switching tems while preserving the benefits of end-to-end models. We also
speech recognition. Code-switching speech is defined as speech study the effect of different modeling units, the inclusion of lan-
which contains more than one language within an utterance [2]. guage identification and different decoding strategies on end-to-end
code-switching ASR. All of our experiments are conducted on the
Several challenges appear in this area, including the lack of
SEAME corpus [17].
training data for language modeling [3], the co-articulation effects
[4], and the need of expert linguistic knowledge. Therefore, it is The rest of this paper is organized as follows. Section 2 intro-
difficult to build a good ASR system that can handle code-switching duces the details of attention and CTC framework. The end-to-end
phenomenon. Previous work mainly focus on the first two chal- based code-switching speech recognition, including modeling units,
lenges. Statistical machine translation (SMT) is used to generate language identification and decoding strategies are studied in Sec-
artificial code-switching texts [4]. Recurrent neural network lan- tion 3. Section 4 describes the details of our model and analyzes the
guage models (RNNLMs) and factored language models (FLMs) results of our experiments. Section 5 draws some conclusions and
with integration of part-of-speech (POS) tag, language information, discusses our future work.
or syntactic and semantic features are proposed to improve the per-
formance of language modeling to code-switching speech [3, 5, 6]. 2. END-TO-END FRAMEWORK
To tackle the co-articulation problem, speaker adaptation, phone
sharing and phone merging are applied [4]. Additionally, language 2.1. Connectionist temporal classification (CTC)
information is incorporated into ASR systems by introducing a
language identifier [7, 8, 9]. Key to CTC [12] is that it removes the need of prior alignment be-
Recently, end-to-end speech recognition systems [10] are be- tween input and output sequences. Taking the network outputs as
coming increasingly popular while achieving promising results on a probability distribution over all possible label sequences, condi-
various ASR benchmarks. End-to-end systems reduce the effort tioned on a given input sequence x, we can define an objective func-
of building ASR systems considerably by predicting graphemes or tion to maximize the probabilities of the correct labelling. To achieve
characters directly from acoustic information without predefined this, an extra blank label denoted hbi is introduced to map frames and
alignment. labels to the same length, which can be interpreted as no target label.
In a code-switching scenario, we believe end-to-end models CTC computes the conditional probability by marginalizing all pos-
have a competitive advantage over traditional systems since we do sible alignments and assuming conditional independence between
output predictions at different time steps given aligned inputs.
* The joint contributors Given a label sequence y corresponding to the utterance x,
where y is typically much shorter than the x in speech recognition. 3.2. Acoustic modeling units
Let β(y, x) be the set of all sequences consisting of the labels in
Y ∪ hbi, which are of length |x| = T , and which are identical to y Syllable and character are common acoustic modeling units for Man-
after first collapsing consecutive repeated targets and then removing darin speech recognition system. We choose character as Mandarin
any blank symbols (e.g., AhbiAAhbiB → AAB). CTC defines acoustic modeling unit as it is the most common choice for end-to-
the probability of the label sequence conditioned on the acoustics as end Mandarin ASR and it has shown state-of-the-art performance on
Equation 1. several public benchmarks [20, 21]. As for English, the frequently
used acoustic modeling units in end-to-end speech recognition sys-
tems are character [13, 18] and subword [22, 23]. In this paper, we
X X T
Y explore two acoustic modeling units combination for Mandarin and
PCT C (y|x) = P (ŷ|x) = P (ŷt |x) (1) English code-switching speech recognition: character units for both
ŷ∈β(y,x) ŷ∈β(y,x) t=1 languages (Character-Character), and character units for Mandarin
plus subwords units for English (Character-Subword).
2.2. Attention based models Character-Character model takes acoustic features as input and
outputs sequences consisting of Chinese and English characters. Let
Chan et al. [18] proposed Listen, Attend and Spell (LAS), a kind of Y be the output sequences, Y = (hsosi, y1 , y2 , .., yT , heosi), yi ∈
neural network that learns to transcribe speech utterances to charac- {yCH , yEN , hapostrophei, hspacei, hunki}, where yCH contains
ters. LAS is based on the sequence to sequence learning framework a few thousand frequently used Chinese characters, yEN contains 26
with attention and consists of two sub-modules: the listener and the English characters, and hsosi, heosi represents the start and the end
speller. of a sentence respectively.
Most attention models used in speech recognition share simi- Character-Subword model is built with a vocabulary containing
lar structure as LAS and is often used to deal with variable length Chinese characters and English subwords. In this paper, we adopt
input and output sequences. An attention-based model contains Byte Pair Encoding (BPE) [24] as subword segmentation method.
an encoder network and an attention based decoder network. The BPE is an algorithm that originally used in data compression, it re-
attention-based encoder-decoder network can be defined as: places the most frequent pair of bytes (characters) in a sequence with
a single and unused byte (character sequence). We iteratively replace
h = Encoder(x), (2) the most frequent pair of symbols with a new symbol, and every
new symbol is added to the subword set. The process ends when
P (yt |x, y1:t−1 ) = AttentionDecoder(h, y1:t−1 ), (3) the amount of subword reaches the value we set. We insert a spe-
cial symbol ’ ’ before every English word to represent the start of
where Encoder(·) can be Long Short-Term Memory (LSTM) or
words. By the time the subword set is generated, we splits English
Bidirectional LSTM (BLSTM) and AttentionDecoder(·) can be
words into subwords by greedily segmenting the longest subword
LSTM or Gated Recurrent Unit (GRU).
in a word. After decoding, words sequences are reconstructed from
The encoder network maps the input acoustics into a higher-
subword-based output sequences by replacing all the word boundary
level representation h. The attention based decoder network predicts
marks in subwords with spaces.
the next output symbol conditioned on the full sequence of previous
According to the segmentation methods above, a Mandarin-
predictions and acoustics, which can be defined as P (yt |x, y1:t−1 ).
English code-switching sentence can be converted into two kinds of
The attention mechanism selects (or weights) the input frames to
modeling units, which is shown in Fig. 1.
generate the next output element. Two of the main attention mecha-
nisms are: content-based attention [19] and the location-based atten-
tion [15]. Borrowed from neural machine translation, content-based Mandarin-English code-switching sentence:
attention can be directly used in speech recognition. For location- 请问music是什么意思
based attention, location-awareness is added to the attention mecha-
nism to better fit the speech recognition task.
Character-Character units: Character-Subword units:
请问music 是 什 么 意 思 请 问 mu si c 是 什 么 意 思
3. METHODS
3.1. Hybrid CTC-Attention based models Fig. 1. An example of converting one Mandarin-English code-
switching sentence into two kinds of modeling units.
Inspired by [13], we add a CTC objective function as an auxiliary
task to train the encoder of attention model. The forward-backward
algorithm of CTC enforces a monotonic alignment between input
and output sequences, which helps the attention model to converge. 3.3. Joint language identification (LID)
The attention decoder learns label dependency, thus often shows im-
proved performance over CTC when no external language model is In code-switching ASR, words with similar pronunciation from dif-
used. ferent languages are very likely to be recognized incorrectly. To deal
We combine CTC and attention model by defining a hybrid with this problem, we consider to include language identification in
CTC-Attention objective function utilizing two losses: our system. Specifically, we propose two strategies to incorporate
LID into our system.
LM T L = λLAtt + (1 − λ)LCT C , (4) One is LID-Label, which is similar to Seki el al.’s work [11].
In this strategy, we use an augmented vocabulary, adding LID ’CH’
where λ is a tunable hyper-parameter in the range of [0, 1], dictating and ’EN’ as part of output symbols. The decoder network predicts
the weight assigned to attention loss. corresponding LID before the following characters/subwords once it
meets code-switching points. In this way the network is forced to 4. EXPERIMENTS
learn language information.
The other method is training networks to recognize speech and 4.1. Data
language simultaneously through multi-task learning framework,
LID-MTL for short. Similar to [9], we create the alignment result We conduct our experiments on the SEAME (South East Asia
of Chinese characters and English words in advance and generate Mandarin-English) corpus. SEAME is a 66.8 hours Mandarin-
LID sequences based on the alignments. Then we add in a new net- English code-switching corpus containing spontaneous conversation
work that shares the encoder with attention model and CTC model. and interview talks recorded from Singapore and Malaysia speakers.
The loss of this new network is cross entropy of predicted LID and The corpus includes 155 speakers, where 115 in them are Singa-
ground truth LID from alignments. We combine their losses using porean and the rest are Malaysian. The ratio of gender is quite
Equation 5: balanced, in which female and male accounts for 55% and 45%
respectively. There is a small proportion of monolingual segments
in this corpus, only 12% and 6% of the transcribed segments are
LM T L = λAtt LAtt + λCT C LCT C + λLID LLID , (5)
Mandarin and English monolingual utterances respectively. We
divide the SEAME corpus into three sets (train, development and
where λAtt , λCT C and λLID are tunable hyper-parameters with a test) based on several criteria like gender, speaking style, speaker
sum of 1, dictating the weight assigned to the corresponding loss. nationality and so on. The detailed statistics of the SEAME corpus
Fig. 2 shows the architecture of proposed LID-MTL model. are presented in Table 1.
Shared Encoder
4.2. Training
x1 x2 x3 … xT
The model we use is a hybrid CTC-Attention model. The shared en-
Fig. 2. Our proposed joint speech recognition and language iden- coder has 2 convolutional layers, followed by 4 bi-directional GRU
tification multi-task learning framework. The encoder is shared by layers with 256 GRU units per-direction, interleaved with 2 time-
Attention decoder, CTC and LID component. It transforms input pooling layers which results in an 4-fold reduction of the input se-
sequence x into high level features h. Attention decoder and CTC quence length. The decoder model has 1 GRU layer with 256 GRU
generate output sequence y, while LID component outputs language units and output consists of 2376 Chinese characters, 1 unknown
IDs for each frame. character, 1 sentence start token, 1 sentence end token, and the En-
glish character/subword set.
During training stage, scheduled sampling and unigram label
smoothing are applied as described in [26, 27, 28]. Adam optimiza-
3.4. Decoding Strategy tion method with gradient clipping is used for optimization. We ini-
tialize all the weights randomly from an isotropic Gaussian distribu-
As we are using hybrid CTC-Attention model, the joint CTC- tion with variance 0.1 and learning rate is decayed from 5e-4 to 5e-5
Attention beam-search decoding introduced in [25] is applied as our during training. The model is trained using TensorFlow [29].
basic strategy. A recurrent neural network based language model (RNNLM)
After some experiments and analysis of decoding results, we is incorporated into the hybrid CTC-Attention based model. The
find an interesting phenomenon: although the TER of our end-to- RNNLM is composed of 2 LSTM layers of 800 hidden units each. It
end model is relative low, the MER is higher than our expectation has the same output vocabulary as the hybrid CTC-Attention based
because some of the final winners in beam search contain subword model. The RNNLM is trained with the SEAME train set and vali-
sequences that cannot form valid words. In order to overcome this dated on the dev set. The AdaDelta algorithm with gradient clipping
problem, we generated a word dictionary containing a few thousand is used for the optimization with an initial learning rate of 0.05. All
frequently used English words and words appeared in the SEAME experiments we conduct below incorporate with RNNLM.
train set. And we also developed two decoding strategies to try and
increase the odds of candidates that can form correct words being
selected. 4.3. Choice of MTL weight
• Decode1: At the end of beam search, we only choose candi- We first conduct experiments using different choice of MTL weight
dates whose subword sequences form correct words to com- with Character-Character model. As shown in Table 2, our model get
pete for a final winner. lowest MER with λ = 0.8. This is consistent with our expectation
that models trained with multi-task objective function perform better
• Decode2: During beam search, we discard candidates whose than using attention objective. Therefore, we choose λ = 0.8 in the
subword sequence cannot form correct word. following experiments.
Table 2. MERs (%) of different hyper-parameter λ on the devel- Table 4. MERs (%) on the development set (Dev) and test set (Test)
opment set (Dev) and test set (Test) of SEAME for character based of SEAME. λLID in the table represents the weight of LID loss in
systems. LID-MTL, while λAtt = 0.8, λCT C = 0.2 − λLID .
λ Dev Test Model λLID Dev Test
0.2 39.72 40.94 Att + CTC - 35.44 37.83
0.5 38.24 39.97 LID-Label - 35.48 37.98
0.8 37.59 39.31 LID-MTL 0.05 34.45 37.03
1.0 38.03 40.27 LID-MTL 0.10 34.13 36.48
LID-MTL 0.20 35.43 37.82
4.6. Effect of different decoding strategy [1] Colin Baker, Foundations of bilingual education and bilingual-
ism, vol. 79, Multilingual matters, 2011.
Table 5 shows MERs on SEAME using different decoding strate-
[2] Peter Auer, Code-switching in conversation: Language, inter-
gies. It is obvious that Decode2 imposes a stronger restriction on
action and identity, Routledge, 2013.
beam search candidates, but it may also remove correct decoding re-
sults because some mistakes occur in early stage of decoding. How- [3] Heike Adel, Ngoc Thang Vu, Katrin Kirchhoff, Dominic
ever, the final MER of Decode2 is lower than that of Decode1. This Telaar, and Tanja Schultz, “Syntactic and semantic features
seems to suggest our end-to-end model is having a hard time to relate for code-switching factored language models,” IEEE Transac-
different parts of subword together and one possible explanation is tions on Audio, Speech, and Language Processing, vol. 23, no.
that the size of SEAME is too small. We are interested to figure out 3, pp. 431–440, 2015.
[4] Ngoc Thang Vu, Dau-Cheng Lyu, Jochen Weiner, Dominic [16] William Chan and Ian Lane, “On online attention-based speech
Telaar, Tim Schlippe, Fabian Blaicher, Eng-Siong Chng, Tanja recognition and joint mandarin character-pinyin training,” in
Schultz, and Haizhou Li, “A first speech recognition sys- INTERSPEECH, 2016, pp. 3404–3408.
tem for mandarin-english code-switch conversational speech,” [17] Dau-Cheng Lyu, Tien Ping Tan, Chng Eng Siong, and Haizhou
in Acoustics, Speech and Signal Processing (ICASSP), 2012 Li, “Seame: a mandarin-english code-switching speech corpus
IEEE International Conference on. IEEE, 2012, pp. 4889– in south-east asia,” in INTERSPEECH, 2010.
4892.
[18] William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals,
[5] Heike Adel, Ngoc Thang Vu, and Tanja Schultz, “Combination
“Listen, attend and spell: A neural network for large vocabu-
of recurrent neural networks and factored language models for
lary conversational speech recognition,” in ICASSP, 2016.
code-switching language modeling,” in Proceedings of the 51st
Annual Meeting of the Association for Computational Linguis- [19] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,
tics (Volume 2: Short Papers), 2013, vol. 2, pp. 206–211. “Neural machine translation by jointly learning to align and
[6] Heike Adel, Katrin Kirchhoff, Dominic Telaar, Ngoc Thang translate,” arXiv preprint arXiv:1409.0473, 2014.
Vu, Tim Schlippe, and Tanja Schultz, “Features for factored [20] Wei Zou, Dongwei Jiang, Shuaijiang Zhao, and Xiangang Li,
language models for code-switching speech,” in Spoken Lan- “A comparable study of modeling units for end-to-end man-
guage Technologies for Under-Resourced Languages, 2014. darin speech recognition,” arXiv preprint arXiv:1805.03832,
[7] K Bhuvanagiri and Sunil Kopparapu, “An approach to mixed 2018.
language automatic speech recognition,” Oriental COCOSDA, [21] Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu, “A
Kathmandu, Nepal, 2010. comparison of modeling units in sequence-to-sequence speech
[8] Dau-Cheng Lyu, Ren-Yuan Lyu, Yuang-Chin Chiang, and recognition with the transformer on mandarin chinese,” arXiv
Chun-Nan Hsu, “Speech recognition on code-switching among preprint arXiv:1805.06239, 2018.
the chinese dialects,” in 2006 IEEE International Conference [22] Kanishka Rao, Haşim Sak, and Rohit Prabhavalkar, “Ex-
on Acoustics Speech and Signal Processing, ICASSP 2006, ploring architectures, data and units for streaming end-to-
Toulouse, France, May 14-19, 2006, 2006, pp. 1105–1108. end speech recognition with rnn-transducer,” in Automatic
[9] Jochen Weiner, Ngoc Thang Vu, Dominic Telaar, Florian Speech Recognition and Understanding Workshop (ASRU),
Metze, Tanja Schultz, Dau-Cheng Lyu, Eng-Siong Chng, and 2017 IEEE. IEEE, 2017, pp. 193–199.
Haizhou Li, “Integration of language identification into a
recognition system for spoken conversations containing code- [23] Thomas Zenkel, Ramon Sanabria, Florian Metze, and Alex
switches,” in Spoken Language Technologies for Under- Waibel, “Subword and crossword units for ctc acoustic mod-
Resourced Languages, 2012. els,” arXiv preprint arXiv:1712.06855, 2017.
[10] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anub- [24] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural
hai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, machine translation of rare words with subword units,” arXiv
Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al., “Deep preprint arXiv:1508.07909, 2015.
speech 2: End-to-end speech recognition in english and man- [25] Takaaki Hori, Shinji Watanabe, and John R. Hershey, “Joint
darin,” in International Conference on Machine Learning, ctc/attention decoding for end-to-end speech recognition,” in
2016, pp. 173–182. ACL, 2017.
[11] Hiroshi Seki, Shinji Watanabe, Takaaki Hori, Jonathan Le [26] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prab-
Roux, and John R. Hershey, “An end-to-end language-tracking havalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J
speech recognizer for mixed-language speech,” in 2018 IEEE Weiss, Kanishka Rao, Katya Gonina, et al., “State-of-the-art
International Conference on Acoustics, Speech and Signal Pro- speech recognition with sequence-to-sequence models,” arXiv
cessing, ICASSP 2018, Calgary, AB, Canada, April 15-20, preprint arXiv:1712.01769, 2017.
2018, 2018, pp. 4919–4923.
[27] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Phile-
[12] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen
mon Brakel, and Yoshua Bengio, “End-to-end attention-based
Schmidhuber, “Connectionist temporal classification: la-
large vocabulary speech recognition,” in Acoustics, Speech and
belling unsegmented sequence data with recurrent neural net-
Signal Processing (ICASSP), 2016 IEEE International Confer-
works,” in Proceedings of the 23rd international conference
ence on. IEEE, 2016, pp. 4945–4949.
on Machine learning. ACM, 2006, pp. 369–376.
[13] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint [28] Jan Chorowski and Navdeep Jaitly, “Towards better decoding
ctc-attention based end-to-end speech recognition using multi- and language model integration in sequence to sequence mod-
task learning,” in Acoustics, Speech and Signal Process- els,” arXiv preprint arXiv:1612.02695, 2016.
ing (ICASSP), 2017 IEEE International Conference on. IEEE, [29] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo,
2017, pp. 4835–4839. Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jef-
[14] Rohit Prabhavalkar, Kanishka Rao, Tara N Sainath, Bo Li, Leif frey Dean, Matthieu Devin, et al., “Tensorflow: Large-scale
Johnson, and Navdeep Jaitly, “A comparison of sequence-to- machine learning on heterogeneous distributed systems,” arXiv
sequence models for speech recognition,” in Proc. Interspeech, preprint arXiv:1603.04467, 2016.
2017, pp. 939–943. [30] Taku Kudo and John Richardson, “Sentencepiece: A simple
[15] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, and language independent subword tokenizer and detokenizer
Kyunghyun Cho, and Yoshua Bengio, “Attention-based mod- for neural text processing,” CoRR, vol. abs/1808.06226, 2018.
els for speech recognition,” in Advances in neural information
processing systems, 2015, pp. 577–585.