Are You Wearing A Mask? Improving Mask Detection From Speech Using Augmentation by Cycle-Consistent Gans
Are You Wearing A Mask? Improving Mask Detection From Speech Using Augmentation by Cycle-Consistent Gans
Are You Wearing A Mask? Improving Mask Detection From Speech Using Augmentation by Cycle-Consistent Gans
Abstract to the pitfall of overfitting [10]. This means that deep models
The task of detecting whether a person wears a face mask can take decisions based on various biases existing in training
from speech is useful in modelling speech in forensic inves- data. A notorious example is an image of a wolf being correctly
tigations, communication between surgeons or people protect- labeled, but only because of the snowy background [11]. In
ing themselves against infectious diseases such as COVID-19. our case, the training samples belonging to one class may have
arXiv:2006.10147v2 [eess.AS] 25 Jul 2020
In this paper, we propose a novel data augmentation approach different gender and age distribution than the training samples
for mask detection from speech. Our approach is based on (i) belonging to the other class, among other unknown biases. In-
training Generative Adversarial Networks (GANs) with cycle- stead of finding relevant features to discriminate utterances with
consistency loss to translate unpaired utterances between two and without mask, a neural network might consider features for
classes (with mask and without mask), and on (ii) generating gender prediction or age estimation, which is undesired. With
new training utterances using the cycle-consistent GANs, as- our data augmentation approach, all utterances with mask are
signing opposite labels to each translated utterance. Original translated to utterances without mask and the other way around,
and translated utterances are converted into spectrograms which as shown in Figure 1. Any potential bias in the distribution of
are provided as input to a set of ResNet neural networks with training data samples is eliminated through the compensation
various depths. The networks are combined into an ensemble that comes with the augmented data samples from the opposite
through a Support Vector Machines (SVM) classifier. With this class. This forces the neural networks to discover features that
system, we participated in the Mask Sub-Challenge (MSC) of discriminate the training data with respect to the desired task,
the INTERSPEECH 2020 Computational Paralinguistics Chal- i.e. classification into mask versus non-mask.
lenge, surpassing the baseline proposed by the organizers by We conduct experiments on the Mask Augsburg Speech
2.8%. Our data augmentation technique provided a perfor- Corpus (MASC), showing that our data augmentation approach
mance boost of 0.9% on the private test set. Furthermore, we attains superior results in comparison to a set of baselines, e.g.
show that our data augmentation approach yields better results noise perturbation and time shifting, and a set of state-of-the-
than other baseline and state-of-the-art augmentation methods. art data augmentation techniques, e.g. speed perturbation [12],
Index Terms: mask detection, data augmentation, Generative conditional GANs [13] and SpecAugment [14].
Adversarial Networks, neural networks ensemble, ComParE.
2. Related Work
1. Introduction Successful communication is an important component in per-
In this paper, we describe our system for the Mask Sub- forming effective tasks, e.g. consider doctors in surgery rooms.
Challenge (MSC) of the INTERSPEECH 2020 Computational While communication is crucial, doctors are often wearing sur-
Paralinguistics Challenge (ComParE) [1]. In MSC, the task is gical masks, which could lead to less effective communication.
to determine if an utterance belongs to a person wearing a face Although surgical masks affect voice clarity, human listeners
mask or not. As noted by Schuller et al. [1], the task of detect- reported small effects on speech understanding [15]. Further-
ing whether a speaker wears a face mask is useful in modelling more, there is limited research addressing the effects of different
speech in forensics or communication between surgeons. In the face covers on voice acoustic properties. The speaker recog-
context of the COVID-19 pandemic, another potential applica- nition task was studied in the context of wearing a face cover
tion is to verify if people wear surgical masks. [16, 17], but the results indicated a small accuracy degradation
We propose a system based on Support Vector Machines ratio. In addition, a negligible level of artifacts are introduced
(SVM) [2] applied on top of feature embeddings concate- by surgical masks in automatic speech understanding [18].
nated from multiple ResNet [3] convolutional neural networks To our knowledge, there are no previous works on mask de-
(CNNs). In order to improve our mask detection performance, tection from speech. We therefore consider augmentation meth-
we propose a novel data augmentation technique that is aimed ods for audio data as related work. The superior performance of
at eliminating biases in the training data distribution. Our deep neural networks relies heavily on large amounts of training
data augmentation method is based on (i) training Genera- data [19]. However, labeled data in many real-world applica-
tive Adversarial Networks (GANs) with cycle-consistency loss tions is hard to collect. Therefore, data augmentation has been
[4, 5] for unpaired utterance-to-utterance translation among two proposed as a method to generate additional training data, im-
classes (with mask and without mask), and on (ii) generating proving the generalization capacity of neural networks. As dis-
new training utterances using the cycle-consistent GANs, as- cussed in the recent survey of Wen et al. [20], a wide range of
signing opposite labels to each translated utterance. augmentation methods have been proposed for time series data,
While deep neural networks attain state-of-the-art results in including speech-related tasks. A classic data augmentation
various domains [3, 6, 7, 8, 9], such models can easily succumb method is to perturb a signal with noise in accordance to a de-
Figure 1: Our mask detection pipeline with data augmentation based on cycle-consistent GANs. Original training spectrograms are
transferred from one class to the other using two generators, G and G0 . Original and augmented spectrograms are further used to train
an ensemble of ResNet models with depths ranging from 18 layers to 101 layers. Feature vectors from the penultimate layer of each
ResNet are concatenated and provided as input to an SVM classifier which makes the final prediction. Best viewed in color.
sired signal-to-noise ratio (SNR). Other augmentation methods where x[n] is the discrete input signal, w[n] is a window func-
with proven results on speech recognition and related tasks are tion (in our approach, Hamming), Nx is the STFT length and
time shifting and speed perturbation [12]. While these data aug- R is the hop (step) size [23]. Prior to the transformation, we
mentation methods are applied on raw signals, some of the most scaled the raw audio signal, dividing it by the maximum. In the
recent techniques [13, 14] are applied on spectrograms. Repre- experiments, we used Nx = 1024, R = 64 and a window size
senting audio signals through spectrograms goes hand in hand of 512. We preserved the complex values (real and imaginary)
with the usage of CNNs or similar models on speech recog- of STFT and kept only one side of the spectrum, considering
nition tasks, perhaps due to their outstanding performance on that the spectrum is symmetric because the raw input signal is
image-related tasks. Park et al. [14] performed augmentation real. Finally, each utterance is represented as a spectrogram of
on the log mel spectrogram through time warping or by masking 2×513×250 components, where 250 is the number of time bins.
blocks of frequency channels and time steps. Their experiments Learning framework. Our learning model is based on an en-
showed that their technique, SpecAugment, prevents overfit- semble of residual neural networks (ResNets) [3] that produce
ting and improves performance on automatic speech recogni- feature vectors which are subsequently joined together (con-
tion tasks. More closely related to our work, Chatziagapi et catenated) and given as input to an SVM classifier, as illus-
al. [13] proposed to augment the training data by generating trated in Figure 1. We employ ResNets because residual con-
new data samples using conditional GANs [21, 22]. Since con- nections eliminate vanishing or exploding gradient problems in
ditional GANs generate new data samples following the training training very deep neural models, providing alternative path-
data distribution, unwanted and unknown distribution biases in ways for the gradients during back-propagation. We employed
the training data can only get amplified after augmentation. Un- four ResNet models with depths ranging from 18 to 101 lay-
like Chatziagapi et al. [13], we employ cycle-consistent GANs ers in order to generate embeddings with different levels of
[4, 5], learning to transfer training data samples from one class abstraction. In order to combine the ResNet models, we re-
to another while preserving other aspects. By transferring sam- move the Softmax classification layers and concatenate the fea-
ples from one class to another, our data augmentation technique ture vectors (activation maps) resulting from the last remain-
is able to level out any undesired distribution biases. Further- ing layers. ResNet-18 and ResNet-34 provide feature vectors
more, we show in the experiments that our approach provides of 512 components, while ResNet-50 and ResNet-101 produce
superior results. 2048-dimensional feature vectors. After concatenation, each
utterance is represented by a feature vector of 5120 compo-
nents. On top of the combined feature vectors, we train an
3. Method SVM classifier. The SVM model [2] aims at finding a hyper-
plane separating the training samples by a maximum margin,
Data representation. CNNs attain state-of-the-art results in
while including a regularization term in the objective function,
computer vision [3, 8], the convolutional operation being ini-
controlling the degree of data fitting through the number of sup-
tially applied on images. In order to employ state-of-the-art
port vectors. We validate the regularization parameter C on the
CNNs for our task, we first transform each audio signal sample
development set. The SVM model relies on a kernel (similar-
into an image-like representation. Therefore, we compute the
ity) function [24, 25] to embed the data in a Hilbert space, in
discrete Short Time Fourier Transform (STFT), as follows:
∞
which non-linear relations are transformed into linear relations.
X −j 2π kn We hereby consider the Radial Basis Function (RBF) kernel de-
ST F T {x[n]}(m, k) = x[n]·w[n−mR]·e Nx , (1) 2
n=−∞
fined as kRBF (x, y) = e−γkx−yk , where x and y are two
domain Y (without mask) to domain X (with mask):
LGAN (F, DX , x, y) = Ex∼Pdata (x) (DX (x))2
(4)
+ Ey∼Pdata (y) (1−DX (F (y)))2 .