Scream and Gunshot Detection and Localization For Audio-Surveillance Systems
Scream and Gunshot Detection and Localization For Audio-Surveillance Systems
1
978-1-4244-1696-7/07/$25.00 ©2007 IEEE. 21
2 Audio Features 3 Feature Selection
A considerable number of audio features have been used Starting from the full set of 49 features, we can build a fea-
for the tasks of audio analysis and content-based audio re- ture vector of any dimension l, 1 ≤ l ≤ 49. It is desirable to
trieval. Traditionally, these features have been classified in keep l small in order to reduce the computational complex-
temporal features, e.g. Zero Crossing Rate (ZCR); energy ity of the feature extraction process and to limit the over-
features, e.g. Short Time Energy (STE); spectral features, fitting produced by the increasing number of parameters as-
e.g. spectral moments, spectral flatness; perceptual fea- sociated to features in the classification model.
tures, e.g. loudness, sharpness or Mel Frequency Cepstral Two main feature selection approaches have been dis-
Coefficients (MFCCs). In this work, we have chosen to dis- cussed in literature. In the filter method, the feature selec-
card audio features which are too sensitive to the SNR con- tion algorithm filters out features that have little chance to
ditions, like STE and loudness. In addition to the traditional be useful for classification, according to some performance
features listed above, we employ some other features which evaluation metrics calculated directly from the data, with-
have not been used before in similar works, such as spectral out direct feedback from a particular classifier used. In the
distribution (spectral slope, spectral decrease, spectral roll- second approach, known as wrapper approach, the perfor-
off) and periodicity descriptors. In this paper we also intro- mance evaluation metrics is some form of feedback pro-
duce a few innovative features based on the auto-correlation vided by the classifier (e.g. accuracy). Obviously, wrapper
function: correlation roll-off, correlation decrease, corre- approaches outperform filter methods, since they are tightly
lation slope, modified correlation centroid and correlation coupled with the employed classifier, but they require much
kurtosis. more computation time.
These features are similar to spectral distribution de- The feature selection process adopted in this work is a
scriptors (spectral roll-off, spectral decrease and spectral hybrid filter/wrapper method. First, a feature subset of size l
slope [8]), but, in lieu of the spectrogram, they are computed is assembled from the full set of features according to some
starting from the auto-correlation function of each frame. class-separability measure and a heuristic search algorithm,
The goal of these features is to describe the energy distribu- as detailed in Section 3.1. The so-obtained feature vector
tion over different time lags. For impulsive noises, like gun- is evaluated by a GMM classifier, which returns some clas-
shots, much of the energy is concentrated in the first time sification performance indicator related to that subset (this
lags, while for harmonic sounds, like screams, the energy procedure is explained in Section 3.2). Repeating this pro-
is spread over a wider range of time lags. Features based cedure for different l’s, one can choose the feature vector
on the auto-correlation function are labeled in two different dimension that optimizes the desired target performance.
ways, filtered or not filtered, depending on whether the au-
tocorrelation function is computed, respectively, on a band- 3.1 Selection of a Feature Vector of size l
pass filtered version of the signal or on the original signal.
The rationale behind this filtering approach is that much of This section reviews some heuristic methods used to ex-
the energy of some signals (e.g. screams) is distributed in plore the feature space, searching for a (locally) optimal
a relatively narrow range of frequencies; thus the autocor- feature vector. We consider two kinds of search algorithms
relation function of the filtered signal is much more robust [10]: scalar methods and vectorial methods.
to noise. In this paper, the limits of the frequency range
for filtering the autocorrelation function have been fixed to 3.1.1 Scalar Selection
1000 − 2500 Hz: experimental results have shown that most
of the energy of the screams harmonics is concentrated in In this work, we adopt a feature selection procedure de-
this frequency range. scribed in [10]. The method builds a feature vector iter-
Table 1 lists the feature set composition. All the features atively, starting from the most discriminating feature and
are extracted from 23 ms analysis frames (at a sampling fre- including at each step k the feature r̂ that maximizes the
quency of 22050 Hz) with 1/3 overlap. following function:
α2
J(r) = α1C(r) − ∑ |ρri |, for r 6= i. (1)
k − 1 i∈F
k−1
# Feature Type Features Ref.
1 Temporal ZCR [7] In words, Eq. 1 says that the feature to be included in the
2-6 Spectral 4 spectral moments + SFM [8] feature vector of dimension k has to be chosen from the set
7-36 Perceptual 30 MFCC [9] of features not yet included in the feature subset Fk−1 . The
37-39 Spectral distribu- spectral slope, spectral de- [8] objective function is composed of two terms: C(r) is a class
tion crease, spectral roll-off separability measure of the rth feature, while ρi j indicates
40-49 Correlation-based (filtered) periodicity, (fil- [7][8] the cross-correlation coefficient between the ith and jth fea-
tered) correlation slope, de- ture. The weights α1 and α2 determine the relative impor-
crease and roll-off, modified
correlation centroid, corre- tance that we give to the two terms. In this paper, we use
lation kurtosis either the Kullback-Leibler divergence (KL) or the Fisher
Discriminant Ratio (FDR) to compute the class separability
Table 1: Audio features used for classification. C(r) [10].
2
22
99
3.1.2 Vectorial Selection
98.5
The vectorial feature selection is carried out using the float-
ing search algorithm [10]. This procedure builds a feature
98
vector iteratively and, at each iteration, reconsiders features
previously discarded or excludes features selected in previ-
97.5
ous iterations from the current feature vector. Though not
optimal, this algorithm provides better results than scalar
97
selection, but with an increased computational cost. The
floating search algorithm requires the definition of a vecto- 96.5
rial class separability metrics. In the proposed system, we
use either one of the following objective metrics [10]: 96
variances.
5
4.5
sion l
3.5
3
23
occurred is then taken by computing the logical OR of the 5.2 Source Localization
two classifiers.
Differently from popular localization algorithms, the ap-
proach we use needs no far field hypothesis about source
location, and is based on the spherical error function [6]
5 Localization
esp (rs ) = Aθθ − b, (8)
5.1 Time Delay Estimation where
The localization system employs a T-shaped microphone ar- 2 2
x1 y1 d10 xs R1 − d10
ray composed of 4 sensors, spaced 30 cm apart from each 1
other. The center microphone is taken as the reference sen- A , x2 y2 d20 , θ , ys , b , R22 − d20 2
2 2 2
sor (hereafter referred with the number 0) and the three x3 y3 d30 Rs R3 − d30
Time Difference of Arrivals (TDOAs) of the signal between (9)
the other microphones and the reference microphone are for a two dimensional problem. Pairs (xi , yi ) are the coordi-
estimated. We use the Maximum-Likelihood Generalized nates of the ith microphone, (xs , ys ) are the unknown coor-
Cross Correlation (GCC) method for estimating time delays dinates of the sound source, Ri and Rs denote, respectively,
[12], i.e. we search the distance of microphone i and of the sound source from
the reference microphone, and di0 = c · τ̂i0 , with c being the
τ̂i0 = arg max Ψ̂i0 (τ), i = 1, 2, 3, (5) speed of sound.
τ To find an estimate of the source location we solve the
linear minimization problem
where
min(Aθθ − b)T (Aθθ − b) (10)
N−1
Sxi x0 (k) (k)|2
|γi0 j2πτk θ
Ψ̂i0 (τ) = ∑ · 2)
·e N
|S
k=0 xi x0 (k)| |Sxi x0 (k)| (1 − |γ i0 (k)| subject to the constraint xs2 + y2s = R2s . The solution of (10)
(6) can be found in [6].
is the generalized cross correlation function, Sxi x0 (k) =
E{Xi (k)X0∗ (k)} is the cross spectrum, Xi (k) is the discrete
Fourier transform (DFT) of xi (n), γi0 is the Magnitude 6 Experimental Results
Square Coherence (MSC) function between xi and x0 , and
N denotes the number of observation samples during the In our simulations we have used audio recordings taken
observation interval. from movies soundtracks and internet repositories. Some
To increase the precision, the estimation of τ̂i0 can be screams have been recorded live from people asked to
refined by a parabolic interpolation [13]. However, a fun- shout into a microphone. Finally, noise samples have been
damental requirement to increase the performance of (5) recorded live in a public square of Milan.
is a high-resolution estimation of the cross-spectrum and
of the coherence function. We use a non-parametric tech-
nique, known as minimum variance distortionless response 6.1 Classification performance with varying
(MVDR), to estimate the cross spectrum and therefore the SNR conditions
MSC function [14]. The MVDR spectrum can be viewed
as the output of a bank of filters, with each filter centered This experiment aims at verifying the effects of the noise
at one of the analysis frequencies. Following this approach, level on the training and test sets. We have added noise
the MSC is given by: both to the audio events of the training set and to the audio
events of the test set, changing the SNR from 0 to 20dB,
2 with a 5dB step. The performance indicators we have used
f H R−1 Ri0 R−1 fk in this test are the false rejection rate, defined in (4), and the
|γi0 (k)|2 = k ii 2 00 2 , (7)
fkH R−1 fkH R−1 false detection rate (FD), defined as follows:
ii fk 00 fk
number of detected events that were actually noise
where superscript H denotes transpose conjugate of FD = ,
number of noise samples in the test set
a vector or a matrix, Rxx = E{x(n)x(n)H } indicates √ (11)
the covariance matrix of a signal x, fk = 1/ L · where, as usual, an event could be both a scream or a gun-
[1 exp( jωk ) . . . exp( jωk (L − 1))]T and ωk = 2πk/K, k = shot. The results for scream/noise classification are reported
0, 1, . . . , K − 1. Assuming that K = L and observing that in Figure 2. As expected, performance degrades noticeably
matrices R have a Toeplitz structure, we can compute (7) as the SNR of both training and test sequences decreases. In
efficiently by means of the Fast Fourier Transform. In our particular, as the training SNR decreases, the false detection
experiments we set K = L = 200 and an observation time rate tends systematically to increase. At the same time, once
N = 4096 samples. the training SNR has been fixed, a reduction of SNR on the
4
24
10
0.35
0
0.3
! −10
!
!
0.25 ! −20
!
−30
0.2
−40
0.15
−50
0.1
−60
0.05 −70
−80
0 −25 −20 −15 −10 −5 0 5 10 15 20 25
0 0.005 0.01 0.015 0.02 0.025 0.03
6.2 Combined system noise records. This is necessary to simulate isotropic noise
conditions. TDOAs are estimated as explained in Section
Putting together the scream/noise classifier and the gun- 5.1; we narrow the search space of Eq. (5) to time lags
shot/noise classifier we can yield a precision of 93% with a τ ∈ [−Tmax , Tmax ], where Tmax = ⌈d/c · fs ⌉, d is the distance
false rejection rate of 5%, using samples at 10dB SNR. We between the microphones of a pair (here d = 30 cm) and fs
have used a feature vector of 13 features for scream/noise is the sampling frequency ( fs = 44100 Hz). The GCC peak
classification, and a feature vector of 14 features for gun- estimation is refined using parabolic interpolation.
shot/noise classification. In both cases the J2 criterion has Figure 3 shows the mean square error (MSE) of the
been employed. The two feature vectors are reported in Ta- TDOA between a pair of microphones for a scream sam-
ble 2. ple, normalized by (2Tmax + 1)2 /12, which corresponds to
the variance of a uniform distribution over the search in-
6.3 TDE error with different SNR conditions terval. Values in figure are expressed in decibel, while the
true time delay for the simulation has been set to 0 without
Localization has been evaluated against different values of any loss of generality. Analogous results are obtained for
SNR by properly mixing audio events with a colored noise gunshots records. From the figure, it’s clearly observable
with a pre-specified power. To generate the noise samples, the so-called “threshold effect” in the performance of GCC:
we use a white noise to feed an AR process, whose co- under some threshold SNR∗ , in this example about -10dB,
efficients have been obtained by LPC analysis on ambient the error of time delay estimation suddenly degrades as far
5
25
2
10
standard deviation of ϑ̂ for a given SNR when the true an-
gle is either 90◦ or -90◦ . For example, at 10dB SNR σ90 is
1
approximately 20◦ (see Figure 4).
10
10
0
7 Conclusions
In this paper we analyzed a system able to detect and lo-
−1
calize audio events such as gunshots and screams in noisy
10
environments. A real time implementation of the system is
going to be installed in the public square outside the Cen-
−2
tral Train Station of Milan, Italy. Future work will be ded-
10
icated to the formalization of feature dimension selection
algorithm and to the integration of multiple microphone ar-
−3
rays into a sensor network for increasing the range and the
10
−80 −60 −40 −20 0 20 40 60 80 precision of audio localization.
References
Figure 4: Standard deviation of the estimated angle ϑ̂ be- [1] C. Clavel, T. Ehrette, and G. Richard, “Events Detection for an
tween the sound source and the axis of the array, as a func- Audio-Based Surveillance System,” Multimedia and Expo, 2005.
tion of the true angle. The distance of the source has been ICME 2005. IEEE International Conference on, pp. 1306–1309,
fixed to 50 m. 2005.
[2] J. Rouas, J. Louradour, and S. Ambellouis, “Audio Events Detec-
tion in Public Transport Vehicle,” Proc. of the 9th International IEEE
as the estimated TDOA becomes just a random guess. This Conference on Intelligent Transportation Systems, 2006.
phenomenon agrees with theoretical results [13]. An im- [3] T. Zhang and C. Kuo, “Hierarchical system for content-based audio
classification and retrieval,” Conference on Multimedia Storage and
mediate consequence of this behavior is that no steering is Archiving Systems III, SPIE, vol. 3527, pp. 398–409, 1998.
applied to the video-camera if the estimated SNR is below
[4] D. Hoiem, Y. Ke, and R. Sukthankar, “SOLAR: Sound Object Local-
the threshold. This is feasible in our system since the au- ization and Retrieval in Complex Audio Environments,” Acoustics,
dio stream is classified as either an audio event or ambient Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05).
noise. Under the assumption that the two classes of sounds IEEE International Conference on, vol. 5, 2005.
are uncorrelated, the SNR can be easily computed from the [5] P. Atrey, N. Maddage, and M. Kankanhalli, “Audio Based Event De-
difference in power between events and noise, and tracked tection for Multimedia Surveillance,” IEEE International Conference
in real time. on Acoustics, Speech, and Signal Processing, 2006, 2006.
[6] J. Chen, Y. Huang, and J. Benesty, Audio Signal Processing for Next-
Generation Multimedia Communication Systems. Kluwer, 2004, ch.
6.4 Localization error 4-5.
[7] L. Lu, H. Zhang, and H. Jiang, “Content analysis for audio classifica-
The audio localization system has been tested by varying tion and segmentation,” Speech and Audio Processing, IEEE Trans-
the actual position of the sound source, spanning a range of actions on, vol. 10, no. 7, pp. 504–516, 2002.
±90◦ with respect to the axis of the array. A source posi- [8] G. Peeters, “A large set of audio features for sound description
tioned at -90◦ is on the left of the array, one positioned at (similarity and classification) in the CUIDADO project,” CUIDADO
0◦ is in front of the array, while a source located at +90◦ is Project Report, 2004.
on the right. Figure 4 shows the standard deviation of the [9] S. Sigurdsson, K. B. Petersen, and T. Lehn-Schiøler, “Mel frequency
cepstral coefficients: An evaluation of robustness of mp3 encoded
estimated source angle ϑ̂ for some SNRs above the thresh- music,” in Proceedings of the Seventh International Conference on
old. For a T-shaped array, the expected angular error is sym- Music Information Retrieval (ISMIR), 2006.
metric around 0◦ . As can be argued from the graph, if the [10] S. Theodoridis and K. Koutroumbas, Pattern Recognition. Aca-
actual sound source is in the range [−80◦ , 80◦ ], the stan- demic Press, 2006.
dard deviation of ϑ̂ is below one degree, even at 0dB SNR. [11] M. Figueiredo and A. Jain, “Unsupervised learning of finite mixture
As the sound source moves completely towards the left or models,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, vol. 24, no. 3, pp. 381–396, 2002.
the right of the array, the standard deviation of ϑ̂ increases,
specially when the ambient noise level is higher. This be- [12] C. Knapp and G. Carter, “The generalized correlation method for
estimation of time delay,” IEEE Transactions on Acoustics, Speech,
havior can be used for deciding whether the video-camera and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976.
should be zoomed or not. If ϑ̂ is known with sufficient pre- [13] J. Ianniello, “Time delay estimation via cross-correlation in the pres-
cision, the camera can be zoomed to capture more details. ence of large estimation errors,” IEEE Transactions on Acoustics,
If the estimation is uncertain, a wider angle should be used. Speech, and Signal Processing, vol. 30, no. 6, pp. 998–1003, 1982.
A conservative policy could be to zoom the camera only if [14] J. Benesty, J. Chen, and Y. Huang, “A generalized MVDR spectrum,”
|ϑ̂ | falls outside the interval [90◦ ± σ90 ], where σ90 is the Signal Processing Letters, IEEE, vol. 12, no. 12, pp. 827–830, 2005.
6
26