Replay attack detection based on deformable convolutional neural network and temporal-frequency attention model

Dang-en Xie; Hai-na Hu; Qiang Xu

doi:10.1515/jisys-2022-0265

Open Access Published by De Gruyter July 18, 2023

Replay attack detection based on deformable convolutional neural network and temporal-frequency attention model

Dang-en Xie , Hai-na Hu and Qiang Xu

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2022-0265

Abstract

As an important identity authentication method, speaker verification (SV) has been widely used in many domains, e.g., mobile financials. At the same time, the existing SV systems are insecure under replay spoofing attacks. Toward a more secure and stable SV system, this article proposes a replay attack detection system based on deformable convolutional neural networks (DCNNs) and a time–frequency double-channel attention model. In DCNN, the positions of elements in the convolutional kernel are not fixed. Instead, they are modified by some trainable variable to help the model extract more useful local information from input spectrograms. Meanwhile, a time–frequency domino double-channel attention model is adopted to extract more effective distinctive features to collect valuable information for distinguishing genuine and replay speeches. Experimental results on ASVspoof 2019 dataset show that the proposed model can detect replay attacks accurately.

Keywords: replay attack; speaker verification; DCNN; temporal-frequency attention model

1 Introduction

With the continuous development of Internet technology and the progress of mobile communication technology, voiceprint recognition, as a non-contact identification hand that can be applied to telecommunication systems and networks, has been widely used because of its low cost and wide application range. In 2018, the people’s Bank of China officially issued and implemented the technical specification for the secure application of mobile finance based on voiceprint recognition (Standard No. JR/T0164-2018). This means that voiceprint recognition will be widely used in the field of domestic mobile finance as an officially recognized identity authentication technology.

However, in recent years, with the development of mobile Internet and the rise of social networks, coupled with people’s relatively weak awareness of personal privacy protection, criminals can easily obtain users’ voice data through the social network platform or directly by secretly recording. Through such recording and playback attacks, the protection of voiceprint recognition system is bypassed and the safety of users’ lives and property is endangered. A large number of experiments show that the existing classical voiceprint recognition systems are difficult to distinguish between natural speech and replayed speech [1,2,3]. Therefore, building a safe and robust voiceprint recognition system that can resist replay attacks has become the focus of recent research. Without changing the original framework of voiceprint recognition system, designing a front-end detector that can detect and replay speech is a simple and feasible means to improve the security of voiceprint recognition system. Therefore, in recent years, researchers at home and abroad have carried out a series of research on the design of playback speech detectors [4,5,6].

The replay speech detection model usually includes two modules: the front-end feature extraction module and the back-end classifier module.

In the front-end feature extraction, a variety of different acoustic features are used to try to build an effective natural/replay speech classifier. Among them, various acoustic features extracted based on speech energy spectrum have achieved good results in the task of replay speech detection. In Wu et al. [7], various cepstral coefficient features based on speech energy spectrum, such as Mel frequency cepstral coefficient, inverse MFCC (IMFCC), and linear frequency cepstral coefficients (LFCCs), are used as front-end features of replay attack detector. The frequency components are hidden in the high-frequency region of the speech energy spectrum that is not easy to be perceived by the auditory system, which often plays an important role in the replay attack detection task. Therefore, in Witkowski et al. [8], some features extracted from the high-frequency part of the energy spectrum or some specific frequency band range including high-frequency cepstral coefficients, single frequency filtering cepstral coefficients, and constant Q cepstral coefficients with non-uniform time–frequency resolution (CQCC) [9] are also used as front-end features for the training of natural/replay speech classifiers. Researchers also designed some other novel front-end features for replay attack detection tasks, for example, variable length Teager energy operator energy separation algorithm-instantaneous frequency cosine coefficients, which focus on extracting energy spectrum envelope change information [10]. In addition to the aforementioned single-feature model, some detection algorithms based on multi feature fusion have also been proposed. In Font et al. [11], the authors fused different instantaneous frequency features and cepstral coefficient features to construct the fusion feature front end. In Jelil et al. [12], the authors fused the output of classifiers trained by different cepstral coefficient features.

In the back end, discriminative models such as support vector machines, deep neural networks (DNNs) [13,14,15], and generative models such as Gaussian mixture model (GMM) are used as the back-end module of replay attack detector [16]. Especially in recent years, with the continuous development of deep learning technology, various deep learning models have been applied to replay attack detection tasks. On the one hand, the deep learning model can extract sentence-level depth features and detect replayed speech by combining cosine distance, probabilistic linear discriminant analysis, and other ranging methods [17]. On the other hand, an end-to-end DNN classifier can also be built to directly obtain the detection results from the original acoustic features. The transfer learning strategy and data enhancement scheme used to increase the generalization ability of deep learning model are also used to increase the accuracy and robustness of replay speech detector [18]. Some depth network structures that have been successfully applied in image recognition tasks to replay attack detection tasks include residual network (ResNET) [19] and light convolutional neural network (LCNN) [20,21]. Some recently proposed deep network structures for voiceprint recognition tasks, such as X-vector and SincNet, have also been applied to false speech detection tasks [17,22]. In addition, some deep network models with attention-based mechanism have also been tried to be applied to detect replayed speech [23].

This article directly uses the original spectrum features as the front-end input and focuses on how to automatically analyze and extract the information suitable for replay attack detection from the original high-dimensional front-end features through the exquisite design of the back end. In order to obtain the distinguishing features of natural/replayed speech more freely and flexibly, this article adopts the scheme of variable convolution and modifies the convolution kernel with fixed position in a learnable way, so that it can automatically find the information more conducive to the extraction of distinguishing features. Because the feature distribution of natural speech and replayed language is similar as a whole, in order to effectively extract the distinguishing components, a time–frequency attention mechanism is added to the model so that the classifier can detect replayed speech more effectively.

2 System model

In order to build a DNN classifier that can accurately distinguish natural/replayed speech, this article constructs a DNN classifier based on variable convolution module and time–frequency attention model. The whole network adopts the model structure of ResNet-34 [24], converts some ordinary convolution layers into variable convolution layers, and introduces a time–frequency attention module at the back end of each residual connection layer. This section will describe the variable convolution module, time–frequency attention module, and the overall structure of the network used in the classifier in detail.

2.1 DCNN

The operation process of ordinary convolution is shown in Figure 1(a). The output characteristic diagram is obtained by sliding convolution on the input characteristic diagram by a convolution kernel (the light part in Figure 1(a)), and its calculation formula is as follows:

(1) F out ( x , y ) = ∑ & △ x ∈ [ − 1 , 1 ] & △ y ∈ [ − 1 , 1 ] k ( △ x , △ y ) F in ( x + △ x , y + △ y ) .

Figure 1

The difference between normal convolution, deformable convolution, and dilated convolution. (a) Normal convolution, (b) deformable convolution, and (c) dilated convolution.

The value of the output characteristic graph at the coordinates (x, y) is obtained by the weighted sum of the elements of the input characteristic graph F _in at (x, y) and its surrounding elements F _in(x + △ x, y + △ y) by the convolution kernel k ( △ x , △ y ) , where the value △ x , △ y is fixed and is an integer (taking the convolution kernel of 3 by 3 as an example, the value is {−1,0,1}). This fixed position convolution kernel can only obtain the information of the fixed position area, but in the task of replay speech detection, the distinguishing area between natural speech and replay speech features is not “regular.” Therefore, in order to obtain more effective distinguishing features, this article introduces variable convolution, and its convolution process is represented by the following formula:

(2) F out ( x , y ) = ∑ & △ x ∈ [ − 1,1 ] & △ y ∈ [ − 1,1 ] k ( △ x , △ y ) F in ( x + d x , y + d y ) .

As shown in Figure 1(b), in the convolution process, the convolution kernel k ( △ x , △ y ) (the light square in Figure 1(b)) is multiplied and summed with the dark square in the input characteristic graph. Among them, ( △ x , △ y ) and ( d x , d y ) are one-to-one correspondences. Here, dx and dy are no longer integers, but an offset with decimals obtained after learning. Through the training method, the position of the convolution kernel can no longer be fixed, but the area that can obtain more effective discrimination information can automatically be found. Figure 1(c) shows a special form of variable convolution, that is, after learning, d x = 2 △ x , d y = 2 △ y and the variable convolution operation becomes a dilated convolution with an expansion coefficient of 2.

The specific operation process of variable convolution is shown in Figure 2 [25].

Figure 2

The processing of deformable convolutions.

First, CNN operation is performed on the input feature map to obtain the offset to be learned. Suppose the dimension of the input characteristic graph is [T, F, C], where T and F, respectively, represent the time axis and frequency axis of the characteristic graph, and C represents the number of channels of the characteristic graph. After processing through the convolution layer with the number of output channels of 2C, the offset characteristic graph with the dimension of [T, D, 2C] is obtained. The figure is used to describe the offset dx and dy of each pixel in the X-axis and Y-axis directions in the learned feature map. In the specific operation process, in order to ensure that the offset obtained by learning does not exceed the range of the feature map, the clamp function is used to limit the output range of dx and dy.

(3) F new ( x , y ) = F in ( x + d x , y + d y ) = ∑ M w m F in ( x m , y m ) .

Second, the bi-linear difference method is used to deform the input feature map according to the learned offset, that is, the element value at the position Fin (x + dx, y + dy) is calculated, where M represents the set of coordinates of the four integer points closest to (x + dx, y + dy).

At last, the traditional convolution method is used to convolute the obtained deformed feature map to obtain the variable convolution output F _out.

2.2 Temporal-frequency attention model (TF-AM)

In recent years, attention model has been widely studied in machine vision, natural speech processing, and speech processing. The attention mechanism can make the constructed neural network model pay more attention to the information related to the training target of the model and suppress the information irrelevant to the training target. The existing attention models related to speech processing mainly carry out attention processing in the time dimension. In order to make full use of the “timing” information in the frequency domain, this article adopts the time–frequency attention mechanism to find the information that can effectively distinguish natural/replayed speech in the time–frequency dimensions.

The internal structure of the time–frequency attention model is shown in Figure 3. First, the input feature map F _in is processed with attention in the frequency dimension to obtain the frequency domain weight template W _f. The template is used to weight F _in frequency domain to obtain the frequency domain attention feature map F _{out_f}. Then, the attention is analyzed in the time dimension, and the time domain weight template W _t is obtained, which is used to analyze F _{out_f} after time domain weighting; the time–frequency attention feature map F _out which is easier to replay speech detection can be extracted from the two dimensions of time–frequency F _{out_ft}.

Figure 3

Illustration of the time–frequency attention model.

The specific algorithm for obtaining the frequency domain weight template is shown in Figure 4. First, the input characteristic graph is pooled along the channel axis (C axis) to realize the aggregation of channel information. Then, the maximum pooling and average pooling are carried out along the time axis (T axis) to obtain the characteristic graphs with two dimensions of [1, F, 1]. After the two feature maps are spliced along the C-axis, a CNN layer is inputted to calculate and obtain the frequency domain weight template W _f. The size of the CNN layer convolution kernel is 1 by 7. The step of convolution kernel is 1 by 1. The number of output channels (c: Channel) is set to 1 and the activation function (AF) to sigmoid. After the processing of this layer, the frequency domain weight template W _f. with dimension [1, F, 1] can be obtained.

Figure 4

The processing of computing frequency domain weight mask.

2.3 Overall network structure

The playback voice detection network adopts the structure of ResNet-34 as a whole, and the parameters are adjusted. The specific network structure is shown in Figure 5. The white rectangle in the figure represents the ordinary CNN layer, and the three parameters represent the size of the convolution kernel, the moving step of the convolution kernel, and the number of output channels, respectively. Each CNN layer adopts ReLU AF and connects a bath normalization.

Figure 5

Illustration of the proposed replay attack detection neural network.

The dark rectangle represents variable convolution, and the three internal parameters refer to the parameters of the CNN layer on the right side of Figure 2. First, the convolution kernel size is set to 5 by 5, and the step is set to (1, 1).

As shown in Figure 5, the whole network structure mainly includes an initial CNN layer and four residual network layers. Each residual network layer contains (2, 3, 3, 3) residual modules, respectively. As shown in Figure 5, a variable convolution layer is introduced into the first residual module of each layer in order to flexibly extract the natural/replay speech discrimination information in this layer. A time–frequency attention module is added at the back end of the last residual module of each layer to weight the output characteristic diagram of this layer in time–frequency dual domain to obtain more effective discrimination information.

After the output of residual network layer passes through a maximum pool layer and a volume layer, a global average layer is used to aggregate information in time and frequency dimensions. Finally, n-dimensional probability output is obtained through a Softmax layer to represent playback attacks of natural speech and N − 1 different recording conditions, respectively.

3 Results and discussion

In this section, the ASVspoof 2019 dataset used for model effect evaluation, specific experimental details, and the experimental results will be described (the latest ASVspoof dataset is ASVspoof 2021 [26]).

3.1 ASVspoof 2019 dataset

ASVspoof2019 dataset was released in the anti-fraud challenge of speaker authentication system held in 2019 [3]. It is used to evaluate the anti-attack ability of the designed voiceprint recognition system. The database provides two means of attack, namely synthetic speech attack and replay attack. This article only uses the replay attack part to evaluate the proposed model.

The replay speech data set is divided into three parts: training set, development set, and evaluation set. The training set is used to train the replay speech detection model, the development set is used to optimize the model, and the evaluation set is used to finally evaluate the performance of the optimized system. The specific description of each part is shown in Table 1.

Table 1

Description of replay spoof dataset in ASVspoof 2019

Database sets	Speaker		Voice number
Database sets	Male	Female	Original	Reply
Training set	8	12	5,400	48,600
Development set	8	12	5,400	24,300
Test set	30	37	18,089	134,630

It can be seen from Table 1 that the number of natural voice is far less than the number of replayed voice. This is because three recording distances of near, medium, and far and three recording equipment of high, medium, and low quality is used in the recording process of replayed voice. The aforementioned recording conditions are combined to form nine different recording environments. In order to ensure the balance of training data during model training, the final training label dimension is set to 10 in this article (i.e., n = 10 in Figure 4), which represent real voice and replay voice recorded in 9 different recording environments, respectively.

3.2 Description of experiment

In this article, the speech spectrum feature is used as the training feature of the model, and the input speech is divided into frames according to 25 ms frame length and 10 ms step. After Hamming smoothing the speech frame, the short-time Fourier transform is performed to extract the amplitude spectrum features. The Fourier transform length is set to 1,024, the 0th dimension is omitted, and finally the spectrum feature with dimension 512 is obtained. Many experimental results show that mute frames also have a gain effect on replay speech detection [5,6,19], so endpoint detection and mute elimination are not carried out in the process of feature extraction.

In the training process, in order to increase the training speed, 256 frames of each input feature sequence are randomly intercepted as network input. For short training speech, the feature is repeated first, and then intercepted. The input features are regularized by the mean and variance of the feature sequence and sent to the network for training. In the test process, the feature is not intercepted and directly sent to the network.

In the process of model training, the batch size is set to 32, and the cross-entropy loss function is used for gradient descent training. Training rounds are set to 200. Adam optimization function is used for model training, and the two momentum values are set to 0.9 and 0.999, respectively. The initial learning rate is 0.00005, and the learning rate of the 120–150 rounds is 10% lower than that of the previous round. After 100 rounds, the model effect on the development set is verified, and the model with the best detection effect is taken as the final output.

3.3 Experiment results and analysis

In this article, the 0th dimension of the output result, that is, the output of the natural speech node, is used as the replay attack detection score. According to the rules proposed in ASVspoof 2019 competition, tandem detection cost function (T-DCF) and equal error rate (EER) are used as the main and secondary indicators of model evaluation. EER only considers the results of replay attack detection, while T-DCF will comprehensively consider the results of voiceprint detection. The score of voiceprint detection used is provided by ASVspoof 2019 database.

The specific experimental results are shown in Table 2. In order to effectively evaluate the detection effect, the newly proposed algorithm is compared with several baseline algorithms. It includes the detection algorithm based on statistical probability model (GMM model) and the algorithm based on depth neural network model.

Table 2

Performance of different replay attack detection models on replay spoof evaluation set in ASVspoof 2019

Models	t-DCF	EER (%)
GMM + LFCC	0.302	13.54
GMM + CQCC	0.245	11.04
LCNN	0.079	2.03
ResNet	0.068	1.97
ResNet + DefCNN	0.045	1.46
ResNet + TF-AM	0.036	1.22
ResNet + DefCNN + TF-AM	0.025	1.03

The training of the GMM model adopts CQCC feature and LFCC feature. The DNN model adopts ResNET and LCNN which perform well in replay attack detection. For fair comparison, all depth network models use the same spectral features as input features. The ResNET network settings are the same as those in Figure 4, except that the variable convolution layer and time–frequency attention layer are removed. For the setting of LCNN network, refer to previous literature [19].

It can be seen from the experimental results that the detection model based on deep learning performs better than the detector based on probability statistical model, and the performance of ResNet is basically the same as that of LCNN. In order to evaluate the performance improvement of general ResNET network by the variable volume layer and time–frequency attention module proposed in the text, the model ResNET + DefCNN with only variable volume layer and the model ResNET + TF-AM with only time–frequency attention are designed. It can be seen from the experimental results that adding variable convolution and time–frequency attention module will improve the system detection performance, and the time–frequency attention module will improve the system performance more obviously. ResNet + DefCNN + TF-AM with two gain modules have the best detection performance, t-DCF reaches 0.025, and EER is as low as 1.03%. The experimental results show that the addition of variable convolution and time–frequency attention model can greatly improve the accuracy of replay speech detection.

4 Conclusion

Voiceprint recognition technology is a kind of biometric technology, which can identify the speaker by voice. This technology will be subject to many kinds of counterfeiting attacks in specific applications, among which voice replay attack is the most common attack method. In order to solve the defect that the existing voiceprint recognition system is vulnerable to replay speech attack, this article proposed a replay attack detection model based on DNN.

The detection model designed in this article adds variable convolution and time–frequency attention modules on the basis of the classical ResNET model, which improves the network learning ability and is more sensitive in automatically distinguishing natural speech and replaying speech. Experiments on ASVspoof2019 dataset show that the proposed algorithm can effectively detect replay speech attacks, and T-DCF and EER reach 0.025 and 1.03%, respectively. Our model in this article has achieved a good EER on the standard data set. In the next step, it will be tested and improved in combination with specific application scenarios of public security and judicial institutions.

The shortcomings of the algorithm in this article are mainly reflected in the hyperparameter tuning process during model training, which mainly adopts manual trial and error to adjust many parameters with experience, which takes a lot of time. In the next step, it will be sought to build an independent model for all the possibilities of hyperparameters with the help of meshed optimization methods, evaluate the performance of each model, and select the model and parameter values that produce the best results.

Funding information: This work was supported by the Key Technologies R&D Program of He’nan Province under Grant No. 212102210084 and 222102210048, the Foundation of He’nan Educational Committee under Grant No. 18A520047, and the Scientific Research Innovation Team of Xuchang University under Grant No. 2022CXTD003.
Conflict of interest: It is declared by the authors that this article is free of conflict of interest.
Data availability statement: The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

[1] Wu Z, Kinnunen T, Evans N, Yamagishi J, Hanilçi C, Sahidullah M. ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. Interspeech; 2015.10.21437/Interspeech.2015-462Search in Google Scholar

[2] Wu Z, Yamagishi J, Kinnunen T, Hanilçi C, Sahidullah M, Sizov A, et al. ASVspoof: The automatic speaker verification spoofing and countermeasures challenge. IEEE J Sel Top Signal Process. 2017;11(4):588–604.10.1109/JSTSP.2017.2671435Search in Google Scholar

[3] Adiban M, Sameti H, Shehnepoor S. Replay spoofing countermeasure using auto encoder and siamese networks on ASVspoof 2019 challenge. Comput Speech Lang. 2020;64:101105.10.1016/j.csl.2020.101105Search in Google Scholar

[4] Kamble MR, Tak H, Patil HA. Amplitude and frequency modulation-based features for detection of replay spoof speech. Speech Commun. 2020;125(4):114–27.10.1016/j.specom.2020.10.003Search in Google Scholar

[5] Kamble MR, Patil. HA. Analysis of reverberation via teager energy features for replay spoof speech detection. In: Proc. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2019.10.1109/ICASSP.2019.8683830Search in Google Scholar

[6] Mankad SH, Garg S. On the performance of empirical mode decomposition-based replay spoofing detection in speaker verification systems. Prog Artif Intell. 2020;9:325–39.10.1007/s13748-020-00216-0Search in Google Scholar

[7] Wu Z, Gao S, Cling ES, Li H. A study on replay attack and anti-spoofing for text-dependent speaker verification. In: Proc. Signal and Information Processing Association Annual Summit and Conference (APSIPA); 2014. Asia-Pacific, 2014.10.1109/APSIPA.2014.7041636Search in Google Scholar

[8] Witkowski M, Kacprzak S, Elasko P, Kowalczyk K, Gałka J. Audio replay attack detection using high-frequency features. In: Proc. Interspeech 2017; 2017.10.21437/Interspeech.2017-776Search in Google Scholar

[9] Wang X, Xiao Y, Zhu X. Feature selection based on CQCCs for automatic speaker verification spoofing. In: Proc. Interspeech 2017; 2017.10.21437/Interspeech.2017-304Search in Google Scholar

[10] Patil HA, Kamble MR, Patel TB, Soni M. Novel variable length teager energy separation based instantaneous frequency features for replay detection. In: Proc. Interspeech 2017; 2017.10.21437/Interspeech.2017-1362Search in Google Scholar

[11] Font R, Espín JM, Cano MJ. Experimental analysis of features for replay attack detection — Results on the ASVspoof 2017 challenge. In: Proc. Interspeech 2017; 2017.10.21437/Interspeech.2017-450Search in Google Scholar

[12] Jelil S, Das RK, Prasanna SRM, Sinha R. Spoof detection using source, instantaneous frequency and cepstral features. In: Proc. Interspeech 2017; 2017.10.21437/Interspeech.2017-930Search in Google Scholar

[13] Zhang C, Yu C, Hansen JHL. An investigation of deep-learning frameworks for speaker verification anti spoofing. IEEE J Sel Top Signal Process. 2017;11(4):684–94.10.1109/JSTSP.2016.2647199Search in Google Scholar

[14] Nagarsheth P, Khoury E, Patil K, Garland M. Replay attack detection using DNN for channel discrimination. In: Proc. Interspeech 2017; 2017.10.21437/Interspeech.2017-1377Search in Google Scholar

[15] Fatehi N, Alasad Q, Alawad M. Towards adversarial attacks for clinical document classification. Electronics. 2023;12(1):129.10.3390/electronics12010129Search in Google Scholar

[16] Patel TB, Patil H. Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. Spoofed Speech. Conference of International Speech Communication Association, Interspeech 2015; 2015.10.21437/Interspeech.2015-467Search in Google Scholar

[17] Williams J, Rownicka J. Speech replay detection with x-vector attack embeddings and spectral features. In: Proc. Interspeech. 2019; 2019.10.21437/Interspeech.2019-1760Search in Google Scholar

[18] Kumar DR, Yang JC, Li HZ. Data augmentation with signal companding for detection of logical access attacks. In: Proc. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2021.Search in Google Scholar

[19] Avila AR, Alam J, Prado FOC, Shaughnessy DO, Falk TH. On the use of blind channel response estimation and a residual neural network to detect physical access attacks to speaker verification systems. Comput Speech Lang. 2021;66:101163.10.1016/j.csl.2020.101163Search in Google Scholar

[20] Lavrentyeva G, Novoselov S, Tseren A, Volkova M, Gorlanov A, Kozlov A. STC antispoofing systems for the ASVspoof2019 challenge. Interspeech 2019; 2019.10.21437/Interspeech.2019-1768Search in Google Scholar

[21] Huang C, Sorger VJ, Miscuglio M, Al-Qadasid M, Mukherjee A, Lampe L, et al. Prospects and applications of photonic neural networks. Adv Phys: X. 2022;7(1):1981155.10.1080/23746149.2021.1981155Search in Google Scholar

[22] Zeinali H, Stafylakis T, Athanasopoulou G, Rohdin J, Gkinis I, Burget L, et al. Detecting spoofing attacks using VGG and sincnet: BUT-omilia submission to ASVspoof 2019 challenge. Computer Vision and Pattern Recognition; 2019.10.21437/Interspeech.2019-2892Search in Google Scholar

[23] Liu M, Wang L, Dang J, Nakagawa S, Guan HT, Li. XG. Replay attack detection using magnitude and phase information with attention-based adaptive filters. In: Proc. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2019.10.1109/ICASSP.2019.8682739Search in Google Scholar

[24] Wu Z, Shen C, Hengel AVD. Wider or deeper: revisiting the ResNet model for visual recognition. Pattern Recognition; 2016.Search in Google Scholar

[25] Zhu J, Fang L, Ghamisi P. Deformable convolutional neural networks for hyper spectral image classification. IEEE Geosci Remote Sens Lett. 2018;15(8):1254–8.10.1109/LGRS.2018.2830403Search in Google Scholar

[26] Liu X, Wang X, Sahidullah M, Patino J, Delgado H, Kinnunen T, et al. ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild, IEEE/ACM transactions on audio, speech and language processing, Oct 5th, 2022. J Latex Cl Files. 2021;14(8). https://arxiv.org/abs/2210.02437.Search in Google Scholar

Received: 2022-11-14

Revised: 2023-04-06

Accepted: 2023-04-27

Published Online: 2023-07-18

This work is licensed under the Creative Commons Attribution 4.0 International License.

Replay attack detection based on deformable convolutional neural network and temporal-frequency attention model

Abstract

1 Introduction

2 System model

2.1 DCNN

2.2 Temporal-frequency attention model (TF-AM)

2.3 Overall network structure

3 Results and discussion

3.1 ASVspoof 2019 dataset

3.2 Description of experiment

3.3 Experiment results and analysis

4 Conclusion

References

Journal and Issue

Articles in the same Issue