Are You Wearing A Mask? Improving Mask Detection From Speech Using Augmentation by Cycle-Consistent Gans

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Are you wearing a mask?

Improving mask detection from speech using


augmentation by cycle-consistent GANs
Nicolae-Cătălin Ristea1 , Radu Tudor Ionescu2
1
University Politehnica of Bucharest, Romania
2
University of Bucharest, Romania
[email protected], [email protected]

Abstract to the pitfall of overfitting [10]. This means that deep models
The task of detecting whether a person wears a face mask can take decisions based on various biases existing in training
from speech is useful in modelling speech in forensic inves- data. A notorious example is an image of a wolf being correctly
tigations, communication between surgeons or people protect- labeled, but only because of the snowy background [11]. In
ing themselves against infectious diseases such as COVID-19. our case, the training samples belonging to one class may have
arXiv:2006.10147v2 [eess.AS] 25 Jul 2020

In this paper, we propose a novel data augmentation approach different gender and age distribution than the training samples
for mask detection from speech. Our approach is based on (i) belonging to the other class, among other unknown biases. In-
training Generative Adversarial Networks (GANs) with cycle- stead of finding relevant features to discriminate utterances with
consistency loss to translate unpaired utterances between two and without mask, a neural network might consider features for
classes (with mask and without mask), and on (ii) generating gender prediction or age estimation, which is undesired. With
new training utterances using the cycle-consistent GANs, as- our data augmentation approach, all utterances with mask are
signing opposite labels to each translated utterance. Original translated to utterances without mask and the other way around,
and translated utterances are converted into spectrograms which as shown in Figure 1. Any potential bias in the distribution of
are provided as input to a set of ResNet neural networks with training data samples is eliminated through the compensation
various depths. The networks are combined into an ensemble that comes with the augmented data samples from the opposite
through a Support Vector Machines (SVM) classifier. With this class. This forces the neural networks to discover features that
system, we participated in the Mask Sub-Challenge (MSC) of discriminate the training data with respect to the desired task,
the INTERSPEECH 2020 Computational Paralinguistics Chal- i.e. classification into mask versus non-mask.
lenge, surpassing the baseline proposed by the organizers by We conduct experiments on the Mask Augsburg Speech
2.8%. Our data augmentation technique provided a perfor- Corpus (MASC), showing that our data augmentation approach
mance boost of 0.9% on the private test set. Furthermore, we attains superior results in comparison to a set of baselines, e.g.
show that our data augmentation approach yields better results noise perturbation and time shifting, and a set of state-of-the-
than other baseline and state-of-the-art augmentation methods. art data augmentation techniques, e.g. speed perturbation [12],
Index Terms: mask detection, data augmentation, Generative conditional GANs [13] and SpecAugment [14].
Adversarial Networks, neural networks ensemble, ComParE.
2. Related Work
1. Introduction Successful communication is an important component in per-
In this paper, we describe our system for the Mask Sub- forming effective tasks, e.g. consider doctors in surgery rooms.
Challenge (MSC) of the INTERSPEECH 2020 Computational While communication is crucial, doctors are often wearing sur-
Paralinguistics Challenge (ComParE) [1]. In MSC, the task is gical masks, which could lead to less effective communication.
to determine if an utterance belongs to a person wearing a face Although surgical masks affect voice clarity, human listeners
mask or not. As noted by Schuller et al. [1], the task of detect- reported small effects on speech understanding [15]. Further-
ing whether a speaker wears a face mask is useful in modelling more, there is limited research addressing the effects of different
speech in forensics or communication between surgeons. In the face covers on voice acoustic properties. The speaker recog-
context of the COVID-19 pandemic, another potential applica- nition task was studied in the context of wearing a face cover
tion is to verify if people wear surgical masks. [16, 17], but the results indicated a small accuracy degradation
We propose a system based on Support Vector Machines ratio. In addition, a negligible level of artifacts are introduced
(SVM) [2] applied on top of feature embeddings concate- by surgical masks in automatic speech understanding [18].
nated from multiple ResNet [3] convolutional neural networks To our knowledge, there are no previous works on mask de-
(CNNs). In order to improve our mask detection performance, tection from speech. We therefore consider augmentation meth-
we propose a novel data augmentation technique that is aimed ods for audio data as related work. The superior performance of
at eliminating biases in the training data distribution. Our deep neural networks relies heavily on large amounts of training
data augmentation method is based on (i) training Genera- data [19]. However, labeled data in many real-world applica-
tive Adversarial Networks (GANs) with cycle-consistency loss tions is hard to collect. Therefore, data augmentation has been
[4, 5] for unpaired utterance-to-utterance translation among two proposed as a method to generate additional training data, im-
classes (with mask and without mask), and on (ii) generating proving the generalization capacity of neural networks. As dis-
new training utterances using the cycle-consistent GANs, as- cussed in the recent survey of Wen et al. [20], a wide range of
signing opposite labels to each translated utterance. augmentation methods have been proposed for time series data,
While deep neural networks attain state-of-the-art results in including speech-related tasks. A classic data augmentation
various domains [3, 6, 7, 8, 9], such models can easily succumb method is to perturb a signal with noise in accordance to a de-
Figure 1: Our mask detection pipeline with data augmentation based on cycle-consistent GANs. Original training spectrograms are
transferred from one class to the other using two generators, G and G0 . Original and augmented spectrograms are further used to train
an ensemble of ResNet models with depths ranging from 18 layers to 101 layers. Feature vectors from the penultimate layer of each
ResNet are concatenated and provided as input to an SVM classifier which makes the final prediction. Best viewed in color.

sired signal-to-noise ratio (SNR). Other augmentation methods where x[n] is the discrete input signal, w[n] is a window func-
with proven results on speech recognition and related tasks are tion (in our approach, Hamming), Nx is the STFT length and
time shifting and speed perturbation [12]. While these data aug- R is the hop (step) size [23]. Prior to the transformation, we
mentation methods are applied on raw signals, some of the most scaled the raw audio signal, dividing it by the maximum. In the
recent techniques [13, 14] are applied on spectrograms. Repre- experiments, we used Nx = 1024, R = 64 and a window size
senting audio signals through spectrograms goes hand in hand of 512. We preserved the complex values (real and imaginary)
with the usage of CNNs or similar models on speech recog- of STFT and kept only one side of the spectrum, considering
nition tasks, perhaps due to their outstanding performance on that the spectrum is symmetric because the raw input signal is
image-related tasks. Park et al. [14] performed augmentation real. Finally, each utterance is represented as a spectrogram of
on the log mel spectrogram through time warping or by masking 2×513×250 components, where 250 is the number of time bins.
blocks of frequency channels and time steps. Their experiments Learning framework. Our learning model is based on an en-
showed that their technique, SpecAugment, prevents overfit- semble of residual neural networks (ResNets) [3] that produce
ting and improves performance on automatic speech recogni- feature vectors which are subsequently joined together (con-
tion tasks. More closely related to our work, Chatziagapi et catenated) and given as input to an SVM classifier, as illus-
al. [13] proposed to augment the training data by generating trated in Figure 1. We employ ResNets because residual con-
new data samples using conditional GANs [21, 22]. Since con- nections eliminate vanishing or exploding gradient problems in
ditional GANs generate new data samples following the training training very deep neural models, providing alternative path-
data distribution, unwanted and unknown distribution biases in ways for the gradients during back-propagation. We employed
the training data can only get amplified after augmentation. Un- four ResNet models with depths ranging from 18 to 101 lay-
like Chatziagapi et al. [13], we employ cycle-consistent GANs ers in order to generate embeddings with different levels of
[4, 5], learning to transfer training data samples from one class abstraction. In order to combine the ResNet models, we re-
to another while preserving other aspects. By transferring sam- move the Softmax classification layers and concatenate the fea-
ples from one class to another, our data augmentation technique ture vectors (activation maps) resulting from the last remain-
is able to level out any undesired distribution biases. Further- ing layers. ResNet-18 and ResNet-34 provide feature vectors
more, we show in the experiments that our approach provides of 512 components, while ResNet-50 and ResNet-101 produce
superior results. 2048-dimensional feature vectors. After concatenation, each
utterance is represented by a feature vector of 5120 compo-
nents. On top of the combined feature vectors, we train an
3. Method SVM classifier. The SVM model [2] aims at finding a hyper-
plane separating the training samples by a maximum margin,
Data representation. CNNs attain state-of-the-art results in
while including a regularization term in the objective function,
computer vision [3, 8], the convolutional operation being ini-
controlling the degree of data fitting through the number of sup-
tially applied on images. In order to employ state-of-the-art
port vectors. We validate the regularization parameter C on the
CNNs for our task, we first transform each audio signal sample
development set. The SVM model relies on a kernel (similar-
into an image-like representation. Therefore, we compute the
ity) function [24, 25] to embed the data in a Hilbert space, in
discrete Short Time Fourier Transform (STFT), as follows:

which non-linear relations are transformed into linear relations.
X −j 2π kn We hereby consider the Radial Basis Function (RBF) kernel de-
ST F T {x[n]}(m, k) = x[n]·w[n−mR]·e Nx , (1) 2
n=−∞
fined as kRBF (x, y) = e−γkx−yk , where x and y are two
domain Y (without mask) to domain X (with mask):
LGAN (F, DX , x, y) = Ex∼Pdata (x) (DX (x))2
 
 (4)
+ Ey∼Pdata (y) (1−DX (F (y)))2 .


The cycle-consistency loss in Equation (2) is defined as the sum


of cycle-consistency losses for both translations:
 
Lcycle (G, F, x, y) = Ex∼Pdata (x) kF (G(x)) − xk1
  (5)
+ Ey∼Pdata (y) kG(F (y)) − yk1 ,
where k·k1 is the l1 norm.
Although cycle-GAN is trained to simultaneously transfer
Figure 2: Translating utterances (spectrograms) using cycle- spectrograms in both directions, we observed that, in practice,
consistent GANs. The spectrogram x (with mask) is translated the second generator F does not perform as well as the first gen-
using the generator G into ŷ (without mask). The spectrogram erator G. We therefore use an independent cycle-GAN to trans-
ŷ is translated back to the original domain X through the gen- fer spectrograms without mask to spectrograms with mask. We
erator F . The generator G and the discriminator DY are op- denote the first generator of this cycle-GAN as G0 . Upon train-
timized in an adversarial fashion, just as in any other GAN. ing the two cycle-GANs, we keep only the generators G and G0
In addition, the GAN is optimized with respect to the cycle- for data augmentation. Hence, in the end, we are able to accu-
consistency loss between the original spectrogram x and the rately transfer spectrograms both ways. By transferring spec-
spectrogram x̂. Best viewed in color. trograms from one class to the other, we level out any undesired
or unknown distribution biases in the training data.
In our experiments, we employ a more recent version of
feature vectors and γ is a parameter that controls the range of cycle-consistent GANs, termed U-GAT-IT [5]. Different from
possible output values. cycle-GAN [4], U-GAT-IT incorporates attention modules in
Data augmentation. Our data augmentation method is inspired both generators and discriminators, along with a new normal-
by the success of cycle-consistent GANs [4] in image-to-image ization function (Adaptive Layer-Instance Normalization), with
translation for style transfer. Based on the assumption that the purpose of improving the translation from one domain to
style is easier to transfer than other aspects, e.g. geometrical the other. The attention maps are produced by an auxiliary clas-
changes, cycle-GANs can replace the style of an image with a sifier, while the parameters of the normalization function are
different style, while keeping its content. In a similar way, we learned during training. Furthermore, the loss function used to
assume that cycle-GANs can transfer between utterances with optimize U-GAT-IT contains two losses in addition to those in-
and without mask, while preserving other aspects of the utter- cluded in Equation (2). The first additional loss is the sum of
ances, e.g. the spoken words are the same. We therefore propose identity losses ensuring that the amplitude distributions of input
to use cycle-GANs for utterance-to-utterance (spectrogram-to- and output spectrograms are similar:
spectrogram) transfer, as illustrated in Figure 2. The spectro- 
Lidentity (G, F, x, y) = Ey∼Pdata (y) kG(y) − yk1

gram x (with mask) is translated using the generator G into ŷ,   (6)
to make it seem that ŷ was produced by a speaker not wearing a + Ex∼Pdata (x) kF (x) − xk1 .
mask. The spectrogram ŷ is translated back to the original do- The second additional loss is the sum of the least squares losses
main X through the generator F . The generator G is optimized that introduce the attention maps:
to fool the discriminator DY , while the discriminator DY is op- LCAM (G, F, DX , DY , x, y) = Ey∼Pdata (y) (DY (y))2
 
timized to separate generated samples without mask from real
+ Ex∼Pdata (x) (1 − DY (G(x)))2
 
samples without mask, in an adversarial fashion. In addition,
(7)
the GAN is optimized with respect to the reconstruction error + Ex Pdata (x) (DX (x))2
 
computed between the original spectrogram x and the spectro-
+ Ey∼Pdata (y) (1 − DX (F (y)))2 .
 
gram x̂. Adding the reconstruction error to the overall loss func-
tion ensures the cycle-consistency. The complete loss function
of a cycle-GAN [4] for spectrogram-to-spectrogram translation 4. Experiments
in both directions is:
Data set. The data set provided by the ComParE organizers for
Lcycle- GAN (G, F, DX , DY , x, y) = LGAN (G, DY , x, y)
(2) MSC is the Mask Augsburg Speech Corpus. The data set is par-
+ LGAN (F, DX , x, y) + λ · Lcycle (G, F, x, y), titioned into a training set of 10,895 samples, a development set
where, G and F are generators, DX and DY are discrimina- of 14,647 samples and a test set of 11,012 samples. It comprises
tors, x is a spectrogram from the mask class, y is a spectrogram recordings of 32 German native speakers, with or without wear-
from the non-mask class and λ is a parameter that controls the ing surgical masks. Each data sample (utterance) is a recording
importance of cycle-consistency with respect to the two GAN of 1 second at a sampling rate of 16 KHz.
losses. The first GAN loss is the least squares loss that corre- Performance measure. The organizers decided to rank partic-
sponds to the translation from domain X (with mask) to domain ipants based on the unweighted average recall. We therefore
Y (without mask): report our performance in terms of this measure.
LGAN (G, DY , x, y) = Ey∼Pdata (y) (DY (y))2 Baselines. The ComParE organizers [1] provided some baseline
 
 (3) results on the development and the private test sets. We consid-
+ Ex∼Pdata (x) (1−DY (G(x)))2 ,

ered their top baseline results, obtained either by a ResNet-50
where E[·] is the expect value and Pdata (·) is the probability model or by an SVM trained on a fusion of features. In addi-
distribution of data samples. Analogously, the second GAN loss tion, we compare our novel data augmentation method based on
is the least squares loss that corresponds to the translation from U-GAT-IT with several data augmentation approaches, ranging
Table 1: Results of four ResNet models (ResNet-18, ResNet-34, Table 2: Results of SVM ensembles based on ResNet features,
ResNet-50, ResNet-101) in terms of unweighted average recall with and without data augmentation, in comparison with the
on the development set, with various data augmentation meth- official baselines [1]. Unweighted average recall values are
ods. Scores that are above the baseline without any data aug- provided for both the development and the private test sets.
mentation are highlighted in bold.
Approach SVM C Dev Test
Augmentation ResNet
DeepSpectrum [1] - 63.4 70.8
method 18 34 50 101
Fusion Best [1] - - 71.8
none 69.03 68.62 68.68 69.01
SVM (no augmentation) 10−3 71.3 72.6
noise perturbation 68.37 69.57 67.77 68.95 SVM + U-GAT-IT 10−3 72.0 73.5
time shifting 69.35 69.39 69.15 69.42 SVM + U-GAT-IT + time shifting 10−3 72.2 -
speed perturbation [12] 70.14 68.35 68.68 66.13 SVM + U-GAT-IT + time shifting 100 71.8 74.6
conditional GAN [13] 60.23 56.05 58.17 55.02 SVM + U-GAT-IT + time shifting 102 71.4 72.6
SpecAugment [14] 67.38 69.72 69.53 68.19
U-GAT-IT (ours) 69.86 70.22 69.88 70.02
U-GAT-IT + time shifting (ours) 71.34 70.85 71.16 70.73 since the transfer task is simply easier (the input is not a random
noise vector, but a real data sample). While noise perturbation
and speed perturbation [12] bring performance improvements
from standard approaches such as noise perturbation and time for only one of the four ResNet models, SpecAugment man-
shifting to state-of-the-art methods such as speed perturbation ages to bring improvements for two ResNet models. There are
[12], conditional GANs [13] and SpecAugment [14]. only two data augmentation methods that bring improvements
Parameter tuning and implementation details. For data aug- for all four ResNet models. These are time shifting and U-GAT-
mentation, we adapted U-GAT-IT [5] in order to fit our larger IT. However, we observe that U-GAT-IT provides superior re-
input images (spectrograms). We employed the shallower ar- sults compared to time shifting in each and every case. While
chitecture provided in the official U-GAT-IT code release1 . We speed perturbation brings the largest improvement for ResNet-
adapted the number of input and output channels in accordance 18, our augmentation method based on U-GAT-IT brings the
with our complex data representation, considering the real and largest improvements for ResNet-34, ResNet-50 and ResNet-
the imaginary parts of the STFT as two different channels. We 101. Among the individual augmentation methods, we con-
trained U-GAT-IT for 100 epochs on mini-batches of 2 samples clude that U-GAT-IT attains the best results. Since time shifting
using the Adam optimizer [26] with a learning rate of 10−4 and U-GAT-IT are the only augmentation methods that bring
and a weight decay of 10−4 . For the ResNet models, we used improvements for all ResNet models, we decided to combine
the official PyTorch implementation2 . We only adjusted the them in order to increase our rank in the competition. We ob-
number of input channels of the first convolutional layer, al- serve further performance improvements on the development
lowing us to input spectrograms with complex values instead set after combining U-GAT-IT with time shifting.
of RGB images. We tuned the hyperparameters of the ResNet Submitted results. In Table 2, we present the results obtained
models on the development set. All models are trained for 60 by various ensembles based on SVM applied on concatenated
epochs with a learning rate between 10−3 and 10−4 and a mini- ResNet feature vectors. Our SVM ensemble without data aug-
batch size of 16. In order to reduce the influence of the random mentation is already better than the baselines provided by the
weight initialization on the performance, we trained each model ComParE organizers [1]. By including the ResNet models
in three trials (runs), reporting the performance corresponding trained with augmentation based on U-GAT-IT, we observe a
to the best run. For a fair evaluation, we apply the same ap- performance boost of 0.9% on the private test set. This con-
proach to the data augmentation baselines, i.e. we consider the firms the effectiveness of our data augmentation approach. As
best performance in three runs. For the SVM, we experiment time shifting seems to bring only minor improvements for the
with the RBF kernel, setting γ = 10−2 . For the regulariza- SVM, we turned our attention in another direction. Noting that
tion parameter C of the SVM, we consider values in the set the validated value of C (10−3 ) is likely in the underfitting zone,
{10−3 , 10−2 , 10−1 , 100 , 101 , 102 , 103 }. We tuned the regular- we tried to validate it by switching the training and the develop-
ization parameter on the development data set. For the final ment set or by moving 5,000 samples from the development set
evaluation on the private test set, we added the development to the training set. This generated our fourth and fifth submis-
data samples to the training set. sions with C = 100 and C = 102 , respectively. Our top score
Preliminary results. In Table 1, we present the results obtained for MSC is 74.6%.
by each ResNet model using various data augmentation tech-
niques. First, we note that the augmentation based on condi-
tional GANs [13] reduces the performance with respect to the 5. Conclusion
baseline without data augmentation. While training the condi-
tional GANs, we faced convergence issues, which we believe to In this paper, we presented a system based on SVM applied on
be caused by the large size of the input spectrograms, which are top of feature vectors concatenated from multiple ResNets. Our
more than twice as large compared to those used in the origi- main contribution is a novel data augmentation approach for
nal paper [13]. We hereby note that GANs that learn to transfer speech, which aims at reducing the undesired distribution bias
samples [4, 5] are much easier to train than GANs that learn in the training data. This is achieved by transferring data from
to generate new samples from random noise vectors [21, 22], one class to another through cycle-consistent GANs.
Acknowledgements. The research leading to these results
1 https://github.com/znxlwm/UGATIT-pytorch has received funding from the EEA Grants 2014-2021, under
2 https://pytorch.org/hub/pytorch_vision_resnet Project contract no. EEA-RO-NO-2018-0496.
6. References [20] Q. Wen, L. Sun, X. Song, J. Gao, X. Wang, and H. Xu, “Time
Series Data Augmentation for Deep Learning: A Survey,” arXiv
[1] B. W. Schuller, A. Batliner, C. Bergler, E.-M. Messner, A. Hamil- preprint arXiv:2002.12478, 2020.
ton, S. Amiriparian, A. Baird, G. Rizos, M. Schmitt, L. Stap-
pen, H. Baumeister, A. D. MacIntyre, and S. Hantke, “The IN- [21] M. Mirza and S. Osindero, “Conditional Generative Adversarial
TERSPEECH 2020 Computational Paralinguistics Challenge: El- Nets,” arXiv preprint arXiv:1411.1784, 2014.
derly Emotion, Breathing & Masks,” in Proceedings of INTER- [22] G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Mal-
SPEECH, 2020. ossi, “BAGAN: Data Augmentation with Balancing GAN,” arXiv
[2] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine preprint arXiv:1803.09655, 2018.
Learning, vol. 20, no. 3, pp. 273–297, 1995. [23] J. B. Allen and L. R. Rabiner, “A unified approach to short-time
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Fourier analysis and synthesis,” Proceedings of the IEEE, vol. 65,
Image Recognition,” in Proceedings of CVPR, 2016, pp. 770–778. no. 11, pp. 1558–1564, 1977.
[4] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- [24] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern
image translation using cycle-consistent adversarial networks,” in Analysis. Cambridge University Press, 2004.
Proceedings of ICCV, 2017, pp. 2223–2232. [25] R. T. Ionescu and M. Popescu, Knowledge Transfer between Com-
[5] J. Kim, M. Kim, H. Kang, and K. H. Lee, “U-GAT-IT: Unsu- puter Vision and Text Mining, ser. Advances in Computer Vi-
pervised Generative Attentional Networks with Adaptive Layer- sion and Pattern Recognition. Springer International Publishing,
Instance Normalization for Image-to-Image Translation,” in Pro- 2016.
ceedings of ICLR, 2020. [26] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
[6] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-Dependent mization,” in Proceedings of ICLR, 2015.
Pre-Trained Deep Neural Networks for Large-Vocabulary Speech
Recognition,” IEEE Transactions on Audio, Speech, and Lan-
guage Processing, vol. 20, no. 1, pp. 30–42, 2011.
[7] M.-I. Georgescu, R. T. Ionescu, and N. Verga, “Convolutional
Neural Networks with Intermediate Loss for 3D Super-Resolution
of CT and MRI Scans,” IEEE Access, vol. 8, pp. 49 112–49 124,
2020.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classi-
fication with Deep Convolutional Neural Networks,” in Proceed-
ings of NIPS, 2012, pp. 1097–1105.
[9] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
works for semantic segmentation,” in Proceedings of CVPR, 2015,
pp. 3431–3440.
[10] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Un-
derstanding Deep Learning Requires Rethinking Generalization,”
in Proceedings of ICLR, 2017.
[11] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why Should I Trust
You?: Explaining the Predictions of Any Classifier,” in Proceed-
ings of KDD, 2016, pp. 1135–1144.
[12] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio Aug-
mentation for Speech Recognition,” in Proceedings of INTER-
SPEECH, 2015, pp. 3586–3589.
[13] A. Chatziagapi, G. Paraskevopoulos, D. Sgouropoulos, G. Pan-
tazopoulos, M. Nikandrou, T. Giannakopoulos, A. Katsamanis,
A. Potamianos, and S. Narayanan, “Data Augmentation using
GANs for Speech Emotion Recognition,” in Proceedings of IN-
TERSPEECH, 2019, pp. 171–175.
[14] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmenta-
tion Method for Automatic Speech Recognition,” in Proceedings
of INTERSPEECH, 2019, pp. 2613–2617.
[15] L. L. Mendel, J. A. Gardino, and S. R. Atcherson, “Speech under-
standing using surgical masks: a problem in health care?” Journal
of the American Academy of Audiology, vol. 19, no. 9, pp. 686–
695, 2008.
[16] R. Saeidi, T. Niemi, H. Karppelin, J. Pohjalainen, T. Kinnunen,
and P. Alku, “Speaker recognition for speech under face cover,”
in Proceedings of INTERSPEECH, 2015, pp. 1012–1016.
[17] R. Saeidi, I. Huhtakallio, and P. Alku, “Analysis of Face Mask Ef-
fect on Speaker Recognition,” in Proceedings of INTERSPEECH,
2016, pp. 1800–1804.
[18] M. Ravanelli, A. Sosi, M. Matassoni, M. Omologo, M. Benetti,
and G. Pedrotti, “Distant talking speech recognition in surgery
room: The domhos project,” in Proceedings of AISV, 2013.
[19] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
vol. 521, no. 7553, pp. 436–444, 05 2015.

You might also like