Conformer

Uploaded by

Sandeep Panchal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views5 pages

Conformer

Uploaded by

Sandeep Panchal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Conformer: Convolution-augmented Transformer for Speech Recognition

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo
Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang

Google Inc.
{anmolgulati, jamesqin, chungchengc, nikip, ngyuzh, jiahuiyu, weihan, shibow, zhangzd,
yonghui, rpang}@google.com

Abstract 40 ms rate
Layernorm

Recently Transformer and Convolution neural network (CNN) Conformer Blocks x N

based models have shown promising results in Automatic +
1/2 x
Speech Recognition (ASR), outperforming Recurrent neural
arXiv:2005.08100v1 [eess.AS] 16 May 2020

networks (RNNs). Transformer models are good at captur- Feed Forward Module
Dropout
ing content-based global interactions, while CNNs exploit lo-
cal features effectively. In this work, we achieve the best of 40 ms rate
+
both worlds by studying how to combine convolution neural Linear
networks and transformers to model both local and global de- Convolution Module

pendencies of an audio sequence in a parameter-efficient way. 40 ms rate

To this regard, we propose the convolution-augmented trans- Convolution +

Subsampling
former for speech recognition, named Conformer. Conformer
Multi-Head Self Attention
significantly outperforms the previous Transformer and CNN 10 ms rate Module
based models achieving state-of-the-art accuracies. On the
SpecAug
widely used LibriSpeech benchmark, our model achieves WER
+
of 2.1%/4.3% without using a language model and 1.9%/3.9% 10 ms rate
1/2 x
with an external language model on test/testother. We also
Feed Forward Module
observe competitive performance of 2.7%/6.3% with a small
model of only 10M parameters.
Index Terms: speech recognition, attention, convolutional neu-
ral networks, transformer, end-to-end Figure 1: Conformer encoder model architecture. Conformer
comprises of two macaron-like feed-forward layers with half-
1. Introduction step residual connections sandwiching the multi-headed self-
attention and convolution modules. This is followed by a post
End-to-end automatic speech recognition (ASR) systems based
layernorm.
on neural networks have seen large improvements in recent
years. Recurrent neural networks (RNNs) have been the de-
facto choice for ASR [1, 2, 3, 4] as they can model the temporal
dependencies in the audio sequences effectively [5]. Recently, self-attention improves over using them individually [14]. To-
the Transformer architecture based on self-attention [6, 7] has gether, they are able to learn both position-wise local features,
enjoyed widespread adoption for modeling sequences due to its and use content-based global interactions. Concurrently, papers
ability to capture long distance interactions and the high train- like [15, 16] have augmented self-attention with relative posi-
ing efficiency. Alternatively, convolutions have also been suc- tion based information that maintains equivariance. Wu et al.
cessful for ASR [8, 9, 10, 11, 12], which capture local context [17] proposed a multi-branch architecture with splitting the in-
progressively via a local receptive field layer by layer. put into two branches: self-attention and convolution; and con-
However, models with self-attention or convolutions each catenating their outputs. Their work targeted mobile applica-
has its limitations. While Transformers are good at modeling tions and showed improvements in machine translation tasks.
long-range global context, they are less capable to extract fine- In this work, we study how to organically combine con-
grained local feature patterns. Convolution neural networks volutions with self-attention in ASR models. We hypothesize
(CNNs), on the other hand, exploit local information and are that both global and local interactions are important for being
used as the de-facto computational block in vision. They learn parameter efficient. To achieve this, we propose a novel combi-
shared position-based kernels over a local window which main- nation of self-attention and convolution will achieve the best of
tain translation equivariance and are able to capture features like both worlds – self-attention learns the global interaction whilst
edges and shapes. One limitation of using local connectivity is the convolutions efficiently capture the relative-offset-based lo-
that you need many more layers or parameters to capture global cal correlations. Inspired by Wu et al. [17, 18], we introduce
information. To combat this issue, contemporary work Con- a novel combination of self-attention and convolution, sand-
textNet [10] adopts the squeeze-and-excitation module [13] in wiched between a pair feed forward modules, as illustrated in
each residual block to capture longer context. However, it is still Fig 1.
limited in capturing dynamic global context as it only applies a Our proposed model, named Conformer, achieves state-of-
global averaging over the entire sequence. the-art results on LibriSpeech, outperforming the previous best
Recent works have shown that combining convolution and published Transformer Transducer [7] by 15% relative improve-
1D
Pointwise Glu Swish Pointwise +
Layernorm Depthwise BatchNorm Dropout
Conv Activation Activation Conv
Conv

Figure 2: Convolution module. The convolution module contains a pointwise convolution with an expansion factor of 2 projecting the
number of channels with a GLU activation layer, followed by a 1-D Depthwise convolution. The 1-D depthwise conv is followed by a
Batchnorm and then a swish activation layer.

ment on the testother dataset with an external language model. 2.2. Convolution Module
We present three models based on model parameter limit con-
Inspired by [17], the convolution module starts with a gating
straints of 10M , 30M and 118M. Our 10M model shows an im-
mechanism [23]—a pointwise convolution and a gated linear
provement when compared to similar sized contemporary work
unit (GLU). This is followed by a single 1-D depthwise convo-
[10] with 2.7%/6.3% on test/testother datasets. Our medium
lution layer. Batchnorm is deployed just after the convolution
30M parameters-sized model already outperforms transformer
to aid training deep models. Figure 2 illustrates the convolution
transducer published in [7] which uses 139M model parameters.
block.
With the big 118M parameter model, we are able to achieve
2.1%/4.3% without using language models and 1.9%/3.9% with
2.3. Feed Forward Module
an external language model.
We further carefully study the effects of the number of at- The Transformer architecture as proposed in [6] deploys a feed
tention heads, convolution kernel sizes, activation functions, forward module after the MHSA layer and is composed of two
placement of feed-forward layers, and different strategies of linear transformations and a nonlinear activation in between. A
adding convolution modules to a Transformer-based network, residual connection is added over the feed-forward layers, fol-
and shed light on how each contributes to the accuracy improve- lowed by layer normalization. This structure is also adopted by
ments. Transformer ASR models [7, 24].
We follow pre-norm residual units [21, 22] and apply layer
normalization within the residual unit and on the input before
2. Conformer Encoder the first linear layer. We also apply Swish activation [25] and
dropout, which helps regularizing the network. Figure 4 illus-
Our audio encoder first processes the input with a convolution trates the Feed Forward (FFN) module.
subsampling layer and then with a number of conformer blocks,
as illustrated in Figure 1. The distinctive feature of our model is 2.4. Conformer Block
the use of Conformer blocks in the place of Transformer blocks
Our proposed Conformer block contains two Feed Forward
as in [7, 19].
modules sandwiching the Multi-Headed Self-Attention module
A conformer block is composed of four modules stacked and the Convolution module, as shown in Figure 1.
together, i.e, a feed-forward module, a self-attention module, This sandwich structure is inspired by Macaron-Net [18],
a convolution module, and a second feed-forward module in which proposes replacing the original feed-forward layer in the
the end. Sections 2.1, 1, and 2.3 introduce the self-attention, Transformer block into two half-step feed-forward layers, one
convolution, and feed-forward modules, respectively. Finally, before the attention layer and one after. As in Macron-Net, we
2.4 describes how these sub blocks are combined. employ half-step residual weights in our feed-forward (FFN)
modules. The second feed-forward module is followed by a
2.1. Multi-Headed Self-Attention Module final layernorm layer. Mathematically, this means, for input xi
to a Conformer block i, the output yi of the block is:
We employ multi-headed self-attention (MHSA) while integrat- 1
ing an important technique from Transformer-XL [20], the rel- x˜i = xi + FFN(xi )
2
ative sinusoidal positional encoding scheme. The relative po- 0
xi = x˜i + MHSA(x˜i )
sitional encoding allows the self-attention module to general- (1)
ize better on different input length and the resulting encoder is x00i = x0i + Conv(x0i )
more robust to the variance of the utterance length. We use pre- 1
norm residual units [21, 22] with dropout which helps training yi = Layernorm(x00i + FFN(x00i ))
2
and regularizing deeper models. Figure 3 below illustrates the
multi-headed self-attention block. where FFN refers to the Feed forward module, MHSA refers to
the Multi-Head Self-Attention module, and Conv refers to the
Convolution module as described in the preceding sections.
Our ablation study discussed in Sec 3.4.3 compares the
Multi-Head Attention with Macaron-style half-step FFNs with the vanilla FFN as used in
Layernorm Relative Positional Dropout +
Embedding previous works. We find that having two Macaron-net style
feed-forward layers with half-step residual connections sand-
Figure 3: Multi-Headed self-attention module. We use multi- wiching the attention and convolution modules in between pro-
headed self-attention with relative positional embedding in a vides a significant improvement over having a single feed-
pre-norm residual unit. forward module in our Conformer architecture.
The combination of convolution and self-attention has been
studied before and one can imagine many ways to achieve
Linear Swish Linear
Layernorm Dropout Dropout +
Layer Activation Layer

Figure 4: Feed forward module. The first linear layer uses an expansion factor of 4 and the second linear layer projects it back to the
model dimension. We use swish activation and a pre-norm residual units in feed forward module.

that. Different options of augmenting convolutions with self- Table 1: Model hyper-parameters for Conformer S, M, and L
attention are studied in Sec 3.4.2. We found that convolution models, found via sweeping different combinations and choos-
module stacked after the self-attention module works best for ing the best performing models within the parameter limits.
speech recognition.
Conformer Conformer Conformer
Model
(S) (M) (L)
3. Experiments
3.1. Data Num Params (M) 10.3 30.7 118.8
Encoder Layers 16 16 17
We evaluate the proposed model on the LibriSpeech [26] Encoder Dim 144 256 512
dataset, which consists of 970 hours of labeled speech and Attention Heads 4 4 8
an additional 800M word token text-only corpus for building Conv Kernel Size 32 32 32
language model. We extracted 80-channel filterbanks features Decoder Layers 1 1 1
computed from a 25ms window with a stride of 10ms. We use Decoder Dim 320 640 640
SpecAugment [27, 28] with mask parameter (F = 27), and ten
time masks with maximum time-mask ratio (pS = 0.05), where
the maximum-size of the time mask is set to pS times the length Table 2: Comparison of Conformer with recent published mod-
of the utterance. els. Our model shows improvements consistently over various
model parameter size constraints. At 10.3M parameters, our
3.2. Conformer Transducer model is 0.7% better on testother when compared to contempo-
rary work, ContextNet(S) [10]. At 30.7M model parameters our
We identify three models, small, medium and large, with 10M, model already significantly outperforms the previous published
30M, and 118M params, respectively, by sweeping different state of the art results of Transformer Transducer [7] with 139M
combinations of network depth, model dimensions, number of parameters.
attention heads and choosing the best performing one within
model parameter size constraints. We use a single-LSTM-layer Method #Params (M) WER Without LM WER With LM
decoder in all our models. Table 1 describes their architecture testclean testother testclean testother
hyper-parameters. Hybrid
For regularization, we apply dropout [29] in each residual Transformer [33] - - - 2.26 4.85
CTC
unit of the conformer, i.e, to the output of each module, before QuartzNet [9] 19 3.90 11.28 2.69 7.25
it is added to the module input. We use a rate of Pdrop = 0.1. LAS
Variational noise [5, 30] is introduced to the model as a regu- Transformer [34] 270 2.89 6.98 2.33 5.17
Transformer [19] - 2.2 5.6 2.6 5.7
larization. A `2 regularization with 1e − 6 weight is also added LSTM 360 2.6 6.0 2.2 5.2
to all the trainable weights in the network. We train the models Transducer
Transformer [7] 139 2.4 5.6 2.0 4.6
with the Adam optimizer [31] with β1 = 0.9, β2 = 0.98 and ContextNet(S) [10] 10.8 2.9 7.0 2.3 5.5
= 10−9 and a transformer learning rate schedule √ [6], with ContextNet(M) [10] 31.4 2.4 5.4 2.0 4.5
ContextNet(L) [10] 112.7 2.1 4.6 1.9 4.1
10k warm-up steps and peak learning rate 0.05/ d where d is
Conformer (Ours)
the model dimension in conformer encoder. Conformer(S) 10.3 2.7 6.3 2.1 5.0
We use a 3-layer LSTM language model (LM) with width Conformer(M) 30.7 2.3 5.0 2.0 4.3
4096 trained on the LibriSpeech langauge model corpus with Conformer(L) 118.8 2.1 4.3 1.9 3.9

the LibriSpeech960h transcripts added, tokenized with the 1k

WPM built from LibriSpeech 960h. The LM has word-level
perplexity 63.9 on the dev-set transcripts. The LM weight λ error rate among all the existing models. This clearly demon-
for shallow fusion is tuned on the dev-set via grid search. All strates the effectiveness of combining Transformer and convo-
models are implemented with Lingvo toolkit [32]. lution in a single neural network.

3.3. Results on LibriSpeech 3.4. Ablation Studies

Table 2 compares the (WER) result of our model on Lib- 3.4.1. Conformer Block vs. Transformer Block
riSpeech test-clean/test-other with a few state-of-the-art mod-
els include: ContextNet [10], Transformer transducer [7], and A Conformer block differs from a Transformer block in a
QuartzNet [9]. All our evaluation results round up to 1 digit number of ways, in particular, the inclusion of a convolution
after decimal point. block and having a pair of FFNs surrounding the block in the
Without a language model, the performance of our Macaron-style. Below we study these effects of these differ-
medium model already achieve competitive results of 2.3/5.0 ences by mutating a Conformer block towards a Transformer
on test/testother outperforming the best known Transformer, block, while keeping the total number of parameters unchanged.
LSTM based model, or a similar sized convolution model. With Table 3 shows the impact of each change to the Conformer
the language model added, our model achieves the lowest word block. Among all differences, convolution sub-block is the most
important feature, while having a Macaron-style FFN pair is Table 5: Ablation study of Macaron-net Feed Forward mod-
also more effective than a single FFN of the same number of ules. Ablating the differences between the Conformer feed for-
parameters. Using swish activations led to faster convergence ward module with that of a single FFN used in Transformer
in the Conformer models. models: (1) Conformer; (2) Conformer with full-step residuals
in Feed forward modules; (3) replacing the Macaron-style FFN
Table 3: Disentangling Conformer. Starting from a Conformer pair with a single FFN.
block, we remove its features and move towards a vanilla Trans-
former block: (1) replacing SWISH with ReLU; (2) remov- Model dev dev test test
ing the convolution sub-block; (3) replacing the Macaron-style Architecture clean other clean other
FFN pairs with a single FFN; (4) replacing self-attention with Conformer 1.9 4.4 2.1 4.3
relative positional embedding [20] with a vanilla self-attention Single FFN 1.9 4.5 2.1 4.5
layer [6]. All ablation study results are evaluated without the Full step residuals 1.9 4.5 2.1 4.5
external LM.

Model dev dev test test

Architecture clean other clean other 3.4.4. Number of Attention Heads
Conformer Model 1.9 4.4 2.1 4.3 In self-attention, each attention head learns to focus on different
– SWISH + ReLU 1.9 4.4 2.0 4.5 parts of the input, making it possible to improve predictions
– Convolution Block 2.1 4.8 2.1 4.9 beyond the simple weighted average. We perform experiments
– Macaron FFN 2.1 5.1 2.1 5.0 to study the effect of varying the number of attention heads from
– Relative Pos. Emb. 2.3 5.8 2.4 5.6 4 to 32 in our large model, using the same number of heads
in all layers. We find that increasing attention heads up to 16
3.4.2. Combinations of Convolution and Transformer Modules improves the accuracy, especially over the devother datasets, as
shown in Table 6.
We study the effects of various different ways of combining the
multi-headed self-attention (MHSA) module with the convolu-
Table 6: Ablation study on the attention heads in multi-headed
tion module. First, we try replacing the depthwise convolution
self attention.
in the convolution module with a lightweight convolution [35],
see a significant drop in the performance especially on the dev- Attention Dim per dev dev test test
other dataset. Second, we study placing the convolution mod- Heads Head clean other clean other
ule before the MHSA module in our Conformer model and find 4 128 1.9 4.6 2.0 4.5
that it degrades the results by 0.1 on dev-other. Another pos- 8 64 1.9 4.4 2.1 4.3
sible way of the architecture is to split the input into parallel 16 32 2.0 4.3 2.2 4.4
branches of multi-headed self attention module and a convolu- 32 16 1.9 4.4 2.1 4.5
tion module with their output concatenated as suggested in [17].
We found that this worsens the performance when compared to 3.4.5. Convolution Kernel Sizes
our proposed architecture.
These results in Table 4 suggest the advantage of placing To study the effect of kernel sizes in the depthwise convolu-
the convolution module after the self-attention module in the tion, we sweep the kernel size in {3, 7, 17, 32, 65} of the large
Conformer block. model, using the same kernel size for all layers. We find that the
performance improves with larger kernel sizes till kernel sizes
Table 4: Ablation study of Conformer Attention Convolution 17 and 32 but worsens in the case of kernel size 65, as show
Blocks. Varying the combination of the convolution block with in Table 7. On comparing the second decimal in dev WER, we
the multi-headed self attention: (1) Conformer architecture; (2) find kernel size 32 to perform better than rest.
Using Lightweight convolutions instead of depthwise convolu-
tion in the convolution block in Conformer; (3) Convolution be- Table 7: Ablation study on depthwise convolution kernel sizes.
fore multi-headed self attention; (4) Convolution and MHSA in
Kernel dev dev test test
parallel with their output concatenated [17].
size clean other clean other
dev dev 3 1.88 4.41 1.99 4.39
Model Architecture 7 1.88 4.30 2.02 4.44
clean other
Conformer 1.9 4.4 17 1.87 4.31 2.04 4.38
– Depthwise conv + Lightweight convolution 2.0 4.8 32 1.83 4.30 2.03 4.29
Convolution block before MHSA 1.9 4.5 65 1.89 4.47 1.98 4.46
Parallel MHSA and Convolution 2.0 4.9
4. Conclusion
3.4.3. Macaron Feed Forward Modules
In this work, we introduced Conformer, an architecture that
Instead of a single feed-forward module (FFN) post the atten- integrates components from CNNs and Transformers for end-
tion blocks as in the Transformer models, the Conformer block to-end speech recognition. We studied the importance of each
has a pair of macaron-like Feed forward modules sandwiching component, and demonstrated that the inclusion of convolution
the self-attention and convolution modules. Further, the Con- modules is critical to the performance of the Conformer model.
former feed forward modules are used with half-step residuals. The model exhibits better accuracy with fewer parameters than
Table 5 shows the impact of changing the Conformer block to previous work on the LibriSpeech dataset, and achieves a new
use a single FFN or full-step residuals. state-of-the-art performance at 1.9%/3.9% for test/testother.
5. References [18] Y. Lu, Z. Li, D. He, Z. Sun, B. Dong, T. Qin, L. Wang, and
T.-Y. Liu, “Understanding and improving transformer from a
[1] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, multi-particle dynamic system point of view,” arXiv preprint
Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State- arXiv:1906.02762, 2019.
of-the-art speech recognition with sequence-to-sequence models,”
in 2018 IEEE International Conference on Acoustics, Speech and [19] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,
Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778. M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al., “A
comparative study on transformer vs rnn in speech applications,”
[2] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, arXiv preprint arXiv:1909.06317, 2019.
data and units for streaming end-to-end speech recognition with
rnn-transducer,” in 2017 IEEE Automatic Speech Recognition and [20] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhut-
Understanding Workshop (ASRU). IEEE, 2017, pp. 193–199. dinov, “Transformer-xl: Attentive language models beyond a
fixed-length context,” 2019.
[3] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez,
D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, [21] Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao,
D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, “Learning deep transformer models for machine translation,” in
S.-Y. Chang, K. Rao, and A. Gruenstein, “Streaming End-to-end Proceedings of the 57th Annual Meeting of the Association for
Speech Recognition For Mobile Devices,” in Proc. ICASSP, 2019. Computational Linguistics. Association for Computational Lin-
guistics, Jul. 2019, pp. 1810–1822.
[4] T. N. Sainath, Y. He, B. Li, A. Narayanan, R. Pang, A. Bruguier,
S.-y. Chang, W. Li, R. Alvarez, Z. Chen, and et al., “A streaming [22] T. Q. Nguyen and J. Salazar, “Transformers without tears:
on-device end-to-end model surpassing server-side conventional Improving the normalization of self-attention,” arXiv preprint
model quality and latency,” in ICASSP, 2020. arXiv:1910.05895, 2019.
[5] A. Graves, “Sequence transduction with recurrent neural net- [23] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language
works,” arXiv preprint arXiv:1211.3711, 2012. modeling with gated convolutional networks,” in Proceedings of
the 34th International Conference on Machine Learning-Volume
[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. 70. JMLR. org, 2017, pp. 933–941.
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
2017. [24] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence
sequence-to-sequence model for speech recognition,” in 2018
[7] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and IEEE International Conference on Acoustics, Speech and Signal
S. Kumar, “Transformer transducer: A streamable speech recog- Processing (ICASSP). IEEE, 2018, pp. 5884–5888.
nition model with transformer encoders and rnn-t loss,” in ICASSP
2020-2020 IEEE International Conference on Acoustics, Speech [25] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activa-
and Signal Processing (ICASSP). IEEE, 2020, pp. 7829–7833. tion functions,” arXiv preprint arXiv:1710.05941, 2017.
[8] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Co- [26] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
hen, H. Nguyen, and R. T. Gadde, “Jasper: An end-to-end convo- rispeech: an asr corpus based on public domain audio books,”
lutional neural acoustic model,” arXiv preprint arXiv:1904.03288, in 2015 IEEE International Conference on Acoustics, Speech and
2019. Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
[9] S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, [27] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
V. Lavrukhin, R. Leary, J. Li, and Y. Zhang, “Quartznet: Deep Cubuk, and Q. V. Le, “Specaugment: A simple data augmen-
automatic speech recognition with 1d time-channel separable con- tation method for automatic speech recognition,” arXiv preprint
volutions,” arXiv preprint arXiv:1910.10261, 2019. arXiv:1904.08779, 2019.
[10] W. Han, Z. Zhang, Y. Zhang, J. Yu, C.-C. Chiu, J. Qin, A. Gulati, [28] D. S. Park, Y. Zhang, C.-C. Chiu, Y. Chen, B. Li, W. Chan, Q. V.
R. Pang, and Y. Wu, “Contextnet: Improving convolutional neural Le, and Y. Wu, “Specaugment on large scale datasets,” arXiv
networks for automatic speech recognition with global context,” preprint arXiv:1912.05533, 2019.
arXiv preprint arXiv:2005.03191, 2020. [29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[11] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhad- R. Salakhutdinov, “Dropout: A simple way to prevent neural net-
ran, “Deep convolutional neural networks for lvcsr,” in 2013 IEEE works from overfitting,” Journal of Machine Learning Research,
international conference on acoustics, speech and signal process- vol. 15, no. 56, pp. 1929–1958, 2014.
ing. IEEE, 2013, pp. 8614–8618. [30] K.-C. Jim, C. L. Giles, and B. G. Horne, “An analysis of noise
[12] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, in recurrent neural networks: convergence and generalization,”
and D. Yu, “Convolutional neural networks for speech recogni- IEEE Transactions on neural networks, vol. 7, no. 6, pp. 1424–
tion,” IEEE/ACM Transactions on audio, speech, and language 1438, 1996.
processing, vol. 22, no. 10, pp. 1533–1545, 2014. [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
[13] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” mization,” arXiv preprint arXiv:1412.6980, 2014.
in Proceedings of the IEEE conference on computer vision and [32] J. Shen, P. Nguyen, Y. Wu, Z. Chen, and et al., “Lingvo: a modu-
pattern recognition, 2018, pp. 7132–7141. lar and scalable framework for sequence-to-sequence modeling,”
[14] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention 2019.
augmented convolutional networks,” in Proceedings of the IEEE [33] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar,
International Conference on Computer Vision, 2019, pp. 3286– H. Huang, A. Tjandra, X. Zhang, F. Zhang et al., “Transformer-
3295. based acoustic modeling for hybrid speech recognition,” arXiv
[15] B. Yang, L. Wang, D. Wong, L. S. Chao, and Z. Tu, “Convolu- preprint arXiv:1910.09799, 2019.
tional self-attention networks,” arXiv preprint arXiv:1904.03107, [34] G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave,
2019. V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert, “End-to-
[16] A. W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi, end asr: from supervised to semi-supervised learning with modern
and Q. V. Le, “Qanet: Combining local convolution with architectures,” 2019.
global self-attention for reading comprehension,” arXiv preprint [35] F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli, “Pay
arXiv:1804.09541, 2018. less attention with lightweight and dynamic convolutions,” arXiv
preprint arXiv:1901.10430, 2019.
[17] Z. Wu, Z. Liu, J. Lin, Y. Lin, and S. Han, “Lite transformer with
long-short range attention,” arXiv preprint arXiv:2004.11886,
2020.

Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Transformer Network For Video To Text Translation
No ratings yet
Transformer Network For Video To Text Translation
6 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
58 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
English Translation of A Birth Certificate From Honduras PDF
78% (9)
English Translation of A Birth Certificate From Honduras PDF
1 page
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
5 Attention
No ratings yet
5 Attention
50 pages
Transformers Vs RNNS: Deeplearning - Ai
No ratings yet
Transformers Vs RNNS: Deeplearning - Ai
57 pages
Template Master USDB
No ratings yet
Template Master USDB
53 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
No ratings yet
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
20 pages
An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention
No ratings yet
An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention
16 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
No ratings yet
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
117 pages
Correia Et Al. - 2019 - Adaptively Sparse Transformers
No ratings yet
Correia Et Al. - 2019 - Adaptively Sparse Transformers
11 pages
Conformer
No ratings yet
Conformer
5 pages
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
Generative AI
No ratings yet
Generative AI
54 pages
2015 Attention Based Models For Speech Recognition Paper
No ratings yet
2015 Attention Based Models For Speech Recognition Paper
9 pages
Attention - Attention! - Lil'Log
No ratings yet
Attention - Attention! - Lil'Log
23 pages
End-to-End Automatic Speech Recognition
No ratings yet
End-to-End Automatic Speech Recognition
19 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Pervasive Attention 2D Convolutional Neural Networks For Sequence-to-Sequence Prediction
No ratings yet
Pervasive Attention 2D Convolutional Neural Networks For Sequence-to-Sequence Prediction
11 pages
ASR - VLSP 2021: Conformer With Gradient Mask and Stochastic Weight Averaging For Vietnamese Automatic Speech Recognition
No ratings yet
ASR - VLSP 2021: Conformer With Gradient Mask and Stochastic Weight Averaging For Vietnamese Automatic Speech Recognition
7 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
JAILBREAKER-Automated Jailbreak Across Multiple Large Language Model Chatbots-2023 7
100% (2)
JAILBREAKER-Automated Jailbreak Across Multiple Large Language Model Chatbots-2023 7
15 pages
Hyb Conformer
No ratings yet
Hyb Conformer
5 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Pappagari, Paper 8
No ratings yet
Pappagari, Paper 8
8 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
TRILLsson Distilled Universal Paralinguistic Speec
No ratings yet
TRILLsson Distilled Universal Paralinguistic Speec
6 pages
3MYou Et Al - 2022
No ratings yet
3MYou Et Al - 2022
5 pages
Transformer-Transducer End-to-End Speech Recognition With Self-Attention
No ratings yet
Transformer-Transducer End-to-End Speech Recognition With Self-Attention
5 pages
Multimodal Grounding For Sequence-To-Sequence Speech Recognition
No ratings yet
Multimodal Grounding For Sequence-To-Sequence Speech Recognition
5 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
GetResponse ELP NehaShah
100% (2)
GetResponse ELP NehaShah
14 pages
Full Single-Type Deep Learning Models With Multihead Attention For Speech Enhancement
No ratings yet
Full Single-Type Deep Learning Models With Multihead Attention For Speech Enhancement
3 pages
A Lightweight CNN-Conformer Model For Automatic Speaker Verification
No ratings yet
A Lightweight CNN-Conformer Model For Automatic Speaker Verification
5 pages
How To Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not
No ratings yet
How To Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not
5 pages
Cmgan
No ratings yet
Cmgan
5 pages
KKDAT Form 1
No ratings yet
KKDAT Form 1
2 pages
Transformer 1803.02155
No ratings yet
Transformer 1803.02155
5 pages
Modality Adaptation For End-to-End Speech-to-Text Translation
No ratings yet
Modality Adaptation For End-to-End Speech-to-Text Translation
5 pages
Transformers
No ratings yet
Transformers
15 pages
Deep Audio-Visual Speech Recognition
No ratings yet
Deep Audio-Visual Speech Recognition
13 pages
Connectionist Temporal Classification
No ratings yet
Connectionist Temporal Classification
6 pages
PHYHOME - FTTH PON Series
No ratings yet
PHYHOME - FTTH PON Series
37 pages
Transformers 1
No ratings yet
Transformers 1
6 pages
INS12 Hardware Description (01) (PDF) - EN
No ratings yet
INS12 Hardware Description (01) (PDF) - EN
8 pages
Attentive Visual Semantic Specialized Network For Video Captioning
No ratings yet
Attentive Visual Semantic Specialized Network For Video Captioning
8 pages
TUV Certificate - HC900 Safety
No ratings yet
TUV Certificate - HC900 Safety
1 page
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
Transformer
No ratings yet
Transformer
10 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Attention
No ratings yet
Attention
15 pages
Transformer Concepts
No ratings yet
Transformer Concepts
8 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Multi-Modal Hierarchical Attention-Based Dense Video Captioning
No ratings yet
Multi-Modal Hierarchical Attention-Based Dense Video Captioning
5 pages
A1
No ratings yet
A1
11 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Business Communication Skills UNIT 1
No ratings yet
Business Communication Skills UNIT 1
23 pages
Module 7
No ratings yet
Module 7
15 pages
Example File
No ratings yet
Example File
3 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
4 pages
BDP and CapDev Format Sample
No ratings yet
BDP and CapDev Format Sample
17 pages
Speidel Braumeister Brochure
No ratings yet
Speidel Braumeister Brochure
56 pages
Separation of The Common-Mode - and Differential-Mode-Conducted EM1 Noise
No ratings yet
Separation of The Common-Mode - and Differential-Mode-Conducted EM1 Noise
9 pages
Ce6306 Strength of Materials Ii/Iii Mechanical Engineering
No ratings yet
Ce6306 Strength of Materials Ii/Iii Mechanical Engineering
29 pages
Re Wiring Feeder
100% (1)
Re Wiring Feeder
31 pages
Syamali Ray
No ratings yet
Syamali Ray
2 pages
KRA-KYC Text File Download Structure - Ver2.2
No ratings yet
KRA-KYC Text File Download Structure - Ver2.2
49 pages
Catalogue of Unbalanced Chromosome Aberrations in Man 2nd Edition Albert Schinzel PDF Download
100% (3)
Catalogue of Unbalanced Chromosome Aberrations in Man 2nd Edition Albert Schinzel PDF Download
19 pages
Whats New in PLS-CADD Handout
No ratings yet
Whats New in PLS-CADD Handout
6 pages
Longest Palindromic Subsequence
No ratings yet
Longest Palindromic Subsequence
26 pages
CFD Approach For Wind Analysis of High Rise
No ratings yet
CFD Approach For Wind Analysis of High Rise
14 pages
Introduction To Computer
No ratings yet
Introduction To Computer
15 pages
Bank Management
100% (1)
Bank Management
19 pages
BasIc Structure of Computer
No ratings yet
BasIc Structure of Computer
19 pages
Hal 91-104
No ratings yet
Hal 91-104
14 pages
Mohammed Radwan CV PDF
No ratings yet
Mohammed Radwan CV PDF
6 pages
Effect of Speedometer Positioning: Distraction and Workload While Driving
No ratings yet
Effect of Speedometer Positioning: Distraction and Workload While Driving
6 pages
Creating A Standby Using RMAN Duplicate (RAC or Non RAC) (Doc ID 1617946.1)
No ratings yet
Creating A Standby Using RMAN Duplicate (RAC or Non RAC) (Doc ID 1617946.1)
11 pages
TKR-720 (N) /820 (N) : VHF/UHF Desktop Repeater
No ratings yet
TKR-720 (N) /820 (N) : VHF/UHF Desktop Repeater
2 pages
OOPS Project Proposal-3
No ratings yet
OOPS Project Proposal-3
3 pages
SEIR Catalog Flyer
No ratings yet
SEIR Catalog Flyer
2 pages
Natural Computing with Python: Learn to implement genetic and evolutionary algorithms to solve problems in a pythonic way
From Everand
Natural Computing with Python: Learn to implement genetic and evolutionary algorithms to solve problems in a pythonic way
Giancarlo Zaccone
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet

Conformer

Uploaded by

Conformer

Uploaded by

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) Conformer Blocks x N

pendencies of an audio sequence in a parameter-efficient way. 40 ms rate

To this regard, we propose the convolution-augmented trans- Convolution +

the LibriSpeech960h transcripts added, tokenized with the 1k

3.3. Results on LibriSpeech 3.4. Ablation Studies

Model dev dev test test

You might also like