0% found this document useful (0 votes)
99 views5 pages

Audio Splicing Detection Using Convolutional Neural Network

The document discusses a method for detecting audio splicing using a convolutional neural network. Audio slices of 1, 2, and 3 seconds are inserted into the middle of test audio clips. The CNN is fed the spectrogram of the audio. Results show the 3 second slice insertion has the highest detection accuracy at 96.67%, while the 1 second slice is lowest at 82.80%. The method is also robust to audio compression and noise.

Uploaded by

nomad def
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views5 pages

Audio Splicing Detection Using Convolutional Neural Network

The document discusses a method for detecting audio splicing using a convolutional neural network. Audio slices of 1, 2, and 3 seconds are inserted into the middle of test audio clips. The CNN is fed the spectrogram of the audio. Results show the 3 second slice insertion has the highest detection accuracy at 96.67%, while the 1 second slice is lowest at 82.80%. The method is also robust to audio compression and noise.

Uploaded by

nomad def
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IEEE - 45670

Audio Splicing Detection using Convolutional


Neural Network
Shital Jadhav Rashmika Patole Priti Rege
Department of Electronics Department of Electronic Department of Electronics
and Telecommunication, and Telecommunication, and Telecommunication,
College of Engineering Pune College of Engineering Pune College of Engineering Pune
[email protected] [email protected] [email protected]

Abstract—In an audio forensics scenario includes audio au- scene classification[13]. Up to now, less research is done in
thentication in which major investigation topic is audio tampering the field of audio tampering detection using deep learning.
detection. In this paper, we present a novel method of splicing According to our knowledge here CNN is used for the first
detection using a convolutional neural network. As high-level
features of audio are effectively estimated by convolutional neural time for splicing detection.
network, the frequency spectrogram of audio is directly fed as an The rest of paper is organized as follows: Section 2 is related
input to the convolutional neural network. The proposed work work of the audio authentication. Section 3 discuss proposed
uses 10-13 sec of an anechoic audio signal of an audio slice of work which gives the model of CNN. Section 4 gives details
1sec, 2sec, and 3sec insertion at the middle of the audio. Results of the experimental setup and results carried out. Future work
show that the insertion of 3 sec part gives better accuracy than
the other two parts. Whereas slice_1 part insertion gives 82.80% and conclusion are presented in section 5
accuracy, slice_2 part gives 87.54% accuracy, and slice_3 part II. RELATED WORK
gives 96.67% accuracy. Also, the proposed method is robust to
the audio compression as well as to the Additive White Gaussian Till now various methods are used in forensics for audio
Noise. authentication. They are divided into two techniques:
Index Terms—Audio Authentication, Splicing Detection, Tam- (i) Passive techniques which focuses on audio authentication
pering Detection, Convolutional Neural Network, Spectrogram,
STFT. using signal and its properties.
(ii) Active techniques contains watermarking as extra infor-
mation in audio[2].
I. INTRODUCTION
By passive techniques, an abrupt change in audio can
Audio authentication is a precursory venture in the area be observed that’s why passive techniques are preferable.
of audio forensics when an audio clip is used as a piece Watermarking can be absurd, as forgeries can be made without
of evidence. Nowadays due to furtherance in technologies change in watermarks embedded in audio.
alteration audio is an easy. Tampering in an audio made by Authentication may also include the recording device iden-
splicing, deletion, copy-move operation [8][9]. For Forgery, tification. Different recording device have divergent signature
splicing is thrusting from an audio at the start, middle or at which is useful to verify recording location and eventually
the end of another audio [8][7][15]. For manipulation, such determine the recording ownership [1]. Any recording device
kind of tampering may be present in audio, so detection of does not contain an ideal voltage regulator, so audio recorded
such splicing is worthwhile in audio forensics. in that device gives trace in an audio i.e an electric network
In previous work of audio splicing detection, features are frequency signal which is remnant of the main signal. As
extracted such as Electric Network Frequency (ENF), Lin- ENF vary steadily which is centered around 50Hz to 60Hz,
ear Prediction Coefficient (LPC), Mel Frequency Cepstral in the dense area there is tight control on ENF fluctuation so
Coefficient (MFCC), Decay Rate Parameter (reverberation determine recording location and tampering detection ENF is
parameters). Fickleness in this features are an indication of extracted from an audio and compare with the database of
tampering. Calculated features are then applied to classifier ENF values[6].
such as Support Vector Machine (SVM), Hidden Markov The properties extracted from audio is different for different
Model(HMM), Gaussian Mixture Model (GMM) for detection audio recording environment, so for authentication, identifica-
of tampering [8][12] . Deep learning is not yet used in the field tion of the environment is useful. By using Room Impulse
of forgery detection. Response (RIR), reverberation component identification of
In image forensics, supplemental growth is done on images environment can be possible[10][11]. The presence of sound
and video using CNN such as face identification and anomaly after source terminates is reverberation, so recorded audio
detection. In audio forensics, for speaker identification, speech consists of direct sound, early reverberation and late reverber-
recognition, more work has been completed and some work ation that’s why temporal and spectral smearing is present in
is completed in audio recapture detection [6][5] and acoustic recorded audio[4]. According to room geometry reverberation

10th ICCCNT 2019


July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
IEEE - 45670

is different for different room. Compression does not affect B. SPROPOSED NETWORK ARCHITECTURE
reverberation parameter present in an audio [12].
Local noise level present in a shows a different structure
The convolutional neural network is modified version of the
when an audio has been tampered. This structure difference
neural network which comes under deep learning. In audio
or abnormalities in the local noise level is estimated using
CNN can be used for speech recognition, speaker identifi-
kurtosis. Local noise estimator does not require knowledge of
cation, and music information retrieval. CNN has multiple
file format and recording device which is good for tampering
layers with lots of neurons which adjust weights and bias
detection [3].
among themselves. The sequence of layers in CNN transfer
Presence of two or more ENF, MFCC and DRD, local noise
values from one layer to another layer through an activation
level present in an audio is an indication of forgery [12].Sud-
function. We have used 11 layers of CNN architecture with
den changes in phase is also an indication of tampering [1],
2 convolutional layers, 2 batch normalization layers, 2 ReLU,
so audio authentication includes a frequency spectrogram test
and 1 max pooling layer.
which notices changes in phase. Instead of explicitly extracting
there features and applying them to various machine learning i) CONVOLUTIONAL LAYER
algorithm use of deep learning is advantageous. For that, direct The main building block of the network is a convolutional
frequency spectrogram of audio is applied as an input, which layer in this basic convolutional operation was performed
effectively estimates high level features. on input in which filter is traverse on whole input. Two
In recapture audio detection, CNN gives good results. When convolutional layers were used in this network with 16
audio is recaptured there is variation in noise level and ENF and 32 numbers of the filter
present in audio [6], CNN is able to differentiate this structural ii) BATCH NORMALIZATION LAYER
difference in audio, which is unable to discern through the As input to convolutional layer is STFT matrix that’s
listing test, visual test [5]. why the size of the matrix will be very large so for
When inputs are highly correlated then CNN gives good normalization this layer is used. In which mean is sub-
result because it extracts dense feature i.e CNN is not only tracted from it and divided it by its standard deviation.
used as classifier but also used as feature extractor [14]. Scaling the activation function improves regularization
As CNN is booming in the image forensics because of its effect which reduce over fitting of kernel output. Batch
automatic feature extraction properties but for CNN requires normalization increases network complexity is but it gives
large database, now days due to use of multimedia we able better performance and speeds up the processing along
to get huge database, the same thing we will use in the audio with that achieves the same accuracy which less number
forensics. of epochs and high learning rate.
III. PROPOSED WORK iii) ACTIVATION FUNCTION
The activation function is like transfer function of CNN.
A. SPECTROGRAM ANALYSIS AND PREPROCESSING
It gives output in the range of -1 to 1 or 0 to 1. Rectified
As input to CNN requires in the form of an image so prepro- Linear Unit (ReLU) converts the input to output by the
cessing of an audio is required. In preprocessing spectrogram piecewise linear activation function, it’s linear to half of
of an audio is calculated and applied as input to CNN. As in the input function whose value is greater than zero and
audio, frequency varies continuously, when frequency varies non-linear to half whose value is less than zero.
overtimes at that time short time frequency transform (STFT) iv) MAX POOLING LAYER
gives significance in time and frequency domain, it gives time For reduction in dimensionality basically, polling layer is
localized frequency information of the signal which shows in used, in this input is divided in non-overlapping frames
fig1 . It is represented as a matrix where row associated as of region R and operation takes maximum value among
frequency and column associated as time. The magnitude of that frame. Because of pre-filtering on input spectral rep-
this matrix is spectrogram and spectrogram was treated as a resentation become sparse, so max pooling layer reduces
mono-channel image to the input of CNN. the amount of data and gives robust feature.
v) FULLY CONNECTED LAYER
The function of a fully connected layer is to classify
an input image into its respective class by using the
training database. For classification, fully connected layer
combines feature extracted from the lower layer in an
abstracted form in this softmax layer is used as an
activation function which outputs probability distribution
function whose output value is sum to 1. Likewise, CNN
transfer input from the first layer to the last layer with
adjustment of weights and bias and gives result as original
Figure 1. Spectrogram of original audio or forged.

10th ICCCNT 2019


July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
IEEE - 45670

Figure 2. CNN Architecture

IV. EXPERIMENTAL SETUP Some random digit from any other three speakers is inserted
in the middle of audio after 5 digits.
The complete process is divided into three steps, the pro-
First of all only 1 digit is inserted which almost of 1sec
posed procedure is shown in Fig. 3
and then 2 digits and 3 digits of another speaker is inserted
• For preprocessing, an audio is divided into frames using
into original audio likewise three types of database is created.
different frame size and then applied to the Hamming Spliced audio length varies from 10 to 13 sec.
window which gives less spectral leakage after that FFT For CNN randomly 4000 original and 4400 spliced audios
is applied to each frame. The absolute value of each frame are selected out of which 70% are used for training and 30%
is calculated after FFT for spectrogram which is input to are used for testing. In order to extract spectrogram of audio
CNN. first windowing is performed on audio with different duration
• After application of input main thing is the extraction
and different overlapping for result comparison.
of the feature using CNN and train CNN for further
classification using all CNN layers. B. B. EVALUATION
• The last stage is classification which is done by the
The evaluation is performed on 30% dataset which is used
softmax layer. for testing. The proposed scheme is evaluated in terms of
accuracy percentage and error rate, where accuracy percentage
is the ratio of a total number of correctly classified output to
the total number of samples in test dataset. Error rate is the
ratio of the total number of misclassified outputs to the total
number of samples in the test dataset.

Correctly Classif ied Samples


Accuracy = × 100 (1)
T otal N umber of Samples

M isclassif ied Samples


Figure 3. Block Diagram of Proposed Work Error Rate = × 100 (2)
T otal N umber of Samples

A. DATABASE
In this paper, utterance of digits of four speakers of having
low, medium, and high pitch database is used. These four
speaker dataset is available on free spoken digit database
(FDSS). Available database consists of individual digits 0, 1, 2
up to 9. For original database, digits concatenated with some
silence part in between two digits of an individual speaker. So
original audios contains utterance of 0 to 9 spoken digits of
each speaker having 9 to 10 sec time.
As four speaker digit utterance database is available, in Figure 4. Accuracy of all slice database with 20ms frame and 50%
order to make dataset for splicing, digits of one speaker is overlapping
inserted into another speaker original utterance digits database.

10th ICCCNT 2019


July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
IEEE - 45670

Figure 5. Accuracy comparison with a database by changing frame size and overlapping

Accuracy percentage of slice_1, slic2_2, and slice_3 of


having 20ms frame with 50% overlapping are shown in Fig.
4. According to result when a large part is inserted then CNN
is able to differentiate original and spliced part.

C. RESULTS
In STFT there is trade off in time and frequency resolution
when frame duration varies, it gives better resolution in the
time domain when the width of the window is a narrow and
better resolution in frequency domain when the width of the Figure 6. Performance of method with noise attack, compression and silence
window is broad. So the comparison is performed on the remove
dataset using different frame duration of 20ms, 30ms and
50ms along with 50% overlapping, 25% overlapping and no
overlapping for significance of STFT which is shown in Fig In the past when audio is compressed then it is difficult to
5. detect tampering but due to CNN compression does not affect
According to results for slice_1, slice_2 and slice_3 there accuracy. Dynamic range compressor is used, which attenuates
is no significant variation in accuracy percentage in all spliced the loud sound.
database when frame duration is of 20ms with 50% overlap-
V. CONCLUSION
ping. So for performance check with noise attack, compression
and silence remove at that time 20ms frame with 50% over- According to our knowledge, this is the first paper which
lapping is used. As recordings are anechoic so white Gaussian employ convolutional neural network for audio splicing de-
noise is convolved with created database of having the signal tection. This model able to extract high-level features from
to noise ratio of 3dB so results show that there no much the spectrogram of audio which acts as an image to the input
variation in accuracy percentage. of convolutional neural network and classify by its own, so
Presence of silence in audio increases its length, this for forensic authentication, the proposed method can be ef-
database is anechoic so that silence part does not contain any fectively used which directly detect tampering. We are able to
information, processing on silence part is waste of time, for demonstrate audio splicing detection with higher accuracy and
performance check silence part is remove from an audio for robustness to noise attack and compression. Though training
that threshold is applied for silence remove so that value below requires a huge database and more power for computation,
that threshold will be considered as silence part but because this approach is useful for audio forensics. As the database
of silence remove accuracy decreases to 81.95% for a slice_3 contains four speakers and splicing is performed with different
part. The reason behind the decrease in accuracy is sometimes speaker digit insertion, so the goal is to detect splicing when
the pitch is low at that that is considered as silence part so it insertion can be performed by recapture audio or by recording
decreases the performance. same speaker audio at different environment. Limitation of this

10th ICCCNT 2019


July 6-8, 2019 , IIT - Kanpur,
Kanpur, India
IEEE - 45670

paper is, as CNN extract direct features from spectrogram of AES International Conference on Audio Forensics. Au-
audio, we are not able so define significance of that features. dio Engineering Society. 2017.
[13] Michele Valenti et al. “A convolutional neural network
R EFERENCES
approach for acoustic scene classification”. In: 2017
[1] Daniel Patricio Nicolalde and Jose Antonio Apolinario. International Joint Conference on Neural Networks
“Evaluating digital audio authenticity with spectral dis- (IJCNN). IEEE. 2017, pp. 1547–1554.
tances and ENF phase change”. In: 2009 IEEE Inter- [14] Michele Valenti et al. “A convolutional neural network
national Conference on Acoustics, Speech and Signal approach for acoustic scene classification”. In: 2017
Processing. IEEE. 2009, pp. 1417–1420. International Joint Conference on Neural Networks
[2] Swati Gupta, Seongho Cho, and C-C Jay Kuo. “Current (IJCNN). IEEE. 2017, pp. 1547–1554.
developments and future trends in audio authentication”. [15] Hong Zhao et al. “Audio splicing detection and local-
In: IEEE MultiMedia 19.1 (2011), pp. 50–59. ization using environmental signature”. In: Multimedia
[3] Xunyu Pan, Xing Zhang, and Siwei Lyu. “Detecting Tools and Applications 76.12 (2017), pp. 13897–13927.
splicing in digital audios using local noise level esti-
mation”. In: 2012 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP).
IEEE. 2012, pp. 1841–1844.
[4] Hafiz Malik. “Acoustic environment identification and
its applications to audio forensics”. In: IEEE Trans-
actions on Information Forensics and Security 8.11
(2013), pp. 1827–1837.
[5] Da Luo, Haojun Wu, and Jiwu Huang. “Audio recap-
ture detection using deep learning”. In: 2015 IEEE
China summit and international conference on signal
and information processing (ChinaSIP). IEEE. 2015,
pp. 478–482.
[6] Xiaodan Lin, Jingxian Liu, and Xiangui Kang. “Au-
dio recapture detection with convolutional neural net-
works”. In: IEEE Transactions on Multimedia 18.8
(2016), pp. 1480–1487.
[7] Hong Zhao et al. “Anti-Forensics of Environmental-
Signature-Based Audio Splicing Detection and Its
Countermeasure via Rich-Features Classification”. In:
IEEE Transactions on Information Forensics and Secu-
rity 11.7 (2016), pp. 1603–1617.
[8] Z. Ali, M. Imran, and M. Alsulaiman. “An Automatic
Digital Audio Authentication/Forensics System”. In:
IEEE Access 5 (2017), pp. 2994–3007. ISSN: 2169-
3536. DOI: 10.1109/ACCESS.2017.2672681.
[9] M. Imran et al. “Blind Detection of Copy-Move Forgery
in Digital Audio Forensics”. In: IEEE Access 5 (2017),
pp. 12843–12855. ISSN: 2169-3536. DOI: 10 . 1109 /
ACCESS.2017.2717842.
[10] Miloš Marković and Jürgen Geiger. “Reverberation-
based feature extraction for acoustic scene classifi-
cation”. In: 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP).
IEEE. 2017, pp. 781–785.
[11] Prateek Murgai, Mark Rau, and Jean-Marc Jot. “Blind
estimation of the reverberation fingerprint of unknown
acoustic environments”. In: Audio Engineering Society
Convention 143. Audio Engineering Society. 2017.
[12] Rashmika Patole, Gunda Kore, and Priti Rege. “Re-
verberation based tampering detection in audio record-
ings”. In: Audio Engineering Society Conference: 2017

10th ICCCNT 2019


July 6-8, 2019 , IIT - Kanpur,
Kanpur, India

You might also like