0% found this document useful (0 votes)
33 views6 pages

Scream and Gunshot Detection and Localization For Audio-Surveillance Systems

Uploaded by

RubensC.Gatto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views6 pages

Scream and Gunshot Detection and Localization For Audio-Surveillance Systems

Uploaded by

RubensC.Gatto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Scream and Gunshot Detection and Localization for Audio-Surveillance Systems∗

G. Valenzise L. Gerosa M. Tagliasacchi F. Antonacci A. Sarti


Dipartimento di Elettronica e Informazione – Politecnico di Milano
Piazza Leonardo da Vinci 32, 20133 Milano, Italy
Email: [email protected], [email protected], tagliasa/antonacc/[email protected]

Abstract of boosted decision trees to classify sound events belonging


to a set of predefined classes, such as screams, barks, etc.
This paper describes an audio-based video surveillance Successive works have shown that classification perfor-
system which automatically detects anomalous audio events mance can be considerably improved if a hierarchical clas-
in a public square, such as screams or gunshots, and lo- sification scheme, composed by different levels of binary
calizes the position of the acoustic source, in such a way classifiers, is used in place of a single-level multi-class clas-
that a video-camera is steered consequently. The system sifier [5]. This hierarchical approach has been employed in
employs two parallel GMM classifiers for discriminating [2] to design a specific system able to detect screams/shouts
screams from noise and gunshots from noise, respectively. in public transport environments. A slightly different tech-
Each classifier is trained using different features, chosen nique is used in [1] to detect gunshots in public environ-
from a set of both conventional and innovative audio fea- ments. Several binary sub-classifiers for different types of
tures. The location of the acoustic source which has pro- firearms are run in parallel. In this way, the false rejection
duced the sound event is estimated by computing the time rate of the system is reduced by a 50% on average with re-
difference of arrivals of the signal at a microphone array spect to a single gunshot/noise classifier.
and using linear-correction least square localization algo- The final objective of sound localization in most surveil-
rithm. Experimental results show that our system can detect lance systems consists in localizing the acoustic source po-
events with a precision of 93% at a false rejection rate of sition over a topological grid. The most popular technique
5% when the SNR is 10dB, while the source direction can for source localization in environments with small rever-
be estimated with a precision of one degree. A real-time beration time (such as a typical public square) is based on
implementation of the system is going to be installed in a the Time Difference of Arrivals (TDOA) of the signal at an
public square of Milan. array of microphones. These time delays are further pro-
cessed to estimate the source location [6].
1. Introduction In this paper we propose a surveillance system that is
able to accurately detect and localize screams and gunshots.
Video-surveillance applications are becoming increasingly The audio stream is recorded by a microphone array. Audio
important both in private and public environments. As the segments are classified as screams, gunshots or noise. Au-
number of sensors grows, the possibility of manually de- dio classified as noise is discarded. If an anomalous event
tecting an event is getting impracticable and very expensive. (scream or gunshot) is detected, the localization module es-
For this reason, research on automatic surveillance systems timates the TDOAs at each sensor pair of the array and com-
has recently received particular attention. In particular, the putes the position of the sound source, steering the video-
use of audio sensors in surveillance and monitoring applica- camera accordingly.
tions has proved to be particularly useful for the detection
of events like screams or gunshots [1][2]. Such detection Our approach is different from the previous works in the
systems can be efficiently used to signal to an automated following aspects. First, we give more weight to the phase
system that an event has occurred and, at the same time, to of feature selection for event detection. In traditional audio-
enable further processing like acoustic source localization surveillance works, features have been either selected by the
for steering a video-camera. classification algorithm itself [4] or reduced in dimensional-
Much of the previous work about audio-based surveil- ity by Principal Component Analysis (PCA) [1]. In most of
lance systems has concentrated on the task of detecting the cases, features have been manually selected on the ba-
some particular audio events. Early research stems from sis of some heuristic criteria [7]. We provide an exhaustive
the field of automatic audio classification and matching [3]. analysis of the feature selection process, mixing the clas-
More recently, specific works covering the detection of par- sical filter and wrapper feature selection approaches. Sec-
ticular classes of events for multimedia-based surveillance ond, in addition to video-camera steering based on local-
have been developed. The SOLAR system [4] uses a series ization of the sound source, we compare time delay estima-
tion errors with theoretical results, and we give some hints
∗ The work presented was developed within VISNET II, a network of on heuristic methods for zooming the camera based on the
excellence of the European Commission (http://www.visnet-noe.org) confidence of localization.

1
978-1-4244-1696-7/07/$25.00 ©2007 IEEE. 21
2 Audio Features 3 Feature Selection
A considerable number of audio features have been used Starting from the full set of 49 features, we can build a fea-
for the tasks of audio analysis and content-based audio re- ture vector of any dimension l, 1 ≤ l ≤ 49. It is desirable to
trieval. Traditionally, these features have been classified in keep l small in order to reduce the computational complex-
temporal features, e.g. Zero Crossing Rate (ZCR); energy ity of the feature extraction process and to limit the over-
features, e.g. Short Time Energy (STE); spectral features, fitting produced by the increasing number of parameters as-
e.g. spectral moments, spectral flatness; perceptual fea- sociated to features in the classification model.
tures, e.g. loudness, sharpness or Mel Frequency Cepstral Two main feature selection approaches have been dis-
Coefficients (MFCCs). In this work, we have chosen to dis- cussed in literature. In the filter method, the feature selec-
card audio features which are too sensitive to the SNR con- tion algorithm filters out features that have little chance to
ditions, like STE and loudness. In addition to the traditional be useful for classification, according to some performance
features listed above, we employ some other features which evaluation metrics calculated directly from the data, with-
have not been used before in similar works, such as spectral out direct feedback from a particular classifier used. In the
distribution (spectral slope, spectral decrease, spectral roll- second approach, known as wrapper approach, the perfor-
off) and periodicity descriptors. In this paper we also intro- mance evaluation metrics is some form of feedback pro-
duce a few innovative features based on the auto-correlation vided by the classifier (e.g. accuracy). Obviously, wrapper
function: correlation roll-off, correlation decrease, corre- approaches outperform filter methods, since they are tightly
lation slope, modified correlation centroid and correlation coupled with the employed classifier, but they require much
kurtosis. more computation time.
These features are similar to spectral distribution de- The feature selection process adopted in this work is a
scriptors (spectral roll-off, spectral decrease and spectral hybrid filter/wrapper method. First, a feature subset of size l
slope [8]), but, in lieu of the spectrogram, they are computed is assembled from the full set of features according to some
starting from the auto-correlation function of each frame. class-separability measure and a heuristic search algorithm,
The goal of these features is to describe the energy distribu- as detailed in Section 3.1. The so-obtained feature vector
tion over different time lags. For impulsive noises, like gun- is evaluated by a GMM classifier, which returns some clas-
shots, much of the energy is concentrated in the first time sification performance indicator related to that subset (this
lags, while for harmonic sounds, like screams, the energy procedure is explained in Section 3.2). Repeating this pro-
is spread over a wider range of time lags. Features based cedure for different l’s, one can choose the feature vector
on the auto-correlation function are labeled in two different dimension that optimizes the desired target performance.
ways, filtered or not filtered, depending on whether the au-
tocorrelation function is computed, respectively, on a band- 3.1 Selection of a Feature Vector of size l
pass filtered version of the signal or on the original signal.
The rationale behind this filtering approach is that much of This section reviews some heuristic methods used to ex-
the energy of some signals (e.g. screams) is distributed in plore the feature space, searching for a (locally) optimal
a relatively narrow range of frequencies; thus the autocor- feature vector. We consider two kinds of search algorithms
relation function of the filtered signal is much more robust [10]: scalar methods and vectorial methods.
to noise. In this paper, the limits of the frequency range
for filtering the autocorrelation function have been fixed to 3.1.1 Scalar Selection
1000 − 2500 Hz: experimental results have shown that most
of the energy of the screams harmonics is concentrated in In this work, we adopt a feature selection procedure de-
this frequency range. scribed in [10]. The method builds a feature vector iter-
Table 1 lists the feature set composition. All the features atively, starting from the most discriminating feature and
are extracted from 23 ms analysis frames (at a sampling fre- including at each step k the feature r̂ that maximizes the
quency of 22050 Hz) with 1/3 overlap. following function:
α2
J(r) = α1C(r) − ∑ |ρri |, for r 6= i. (1)
k − 1 i∈F
k−1
# Feature Type Features Ref.
1 Temporal ZCR [7] In words, Eq. 1 says that the feature to be included in the
2-6 Spectral 4 spectral moments + SFM [8] feature vector of dimension k has to be chosen from the set
7-36 Perceptual 30 MFCC [9] of features not yet included in the feature subset Fk−1 . The
37-39 Spectral distribu- spectral slope, spectral de- [8] objective function is composed of two terms: C(r) is a class
tion crease, spectral roll-off separability measure of the rth feature, while ρi j indicates
40-49 Correlation-based (filtered) periodicity, (fil- [7][8] the cross-correlation coefficient between the ith and jth fea-
tered) correlation slope, de- ture. The weights α1 and α2 determine the relative impor-
crease and roll-off, modified
correlation centroid, corre- tance that we give to the two terms. In this paper, we use
lation kurtosis either the Kullback-Leibler divergence (KL) or the Fisher
Discriminant Ratio (FDR) to compute the class separability
Table 1: Audio features used for classification. C(r) [10].

2
22
99
3.1.2 Vectorial Selection
98.5
The vectorial feature selection is carried out using the float-
ing search algorithm [10]. This procedure builds a feature
98
vector iteratively and, at each iteration, reconsiders features
previously discarded or excludes features selected in previ-
97.5
ous iterations from the current feature vector. Though not
optimal, this algorithm provides better results than scalar
97
selection, but with an increased computational cost. The
floating search algorithm requires the definition of a vecto- 96.5
rial class separability metrics. In the proposed system, we
use either one of the following objective metrics [10]: 96

trace(Sm ) det(Sm ) 95.5


J1 = , J2 = (2) 5 10 15 20 25 30 35 40 45
trace(Sw ) det(Sw )
(a) Precision
where Sw is the within-class scatter matrix, which carries in- 6.5

formation about intra-class variance of the features, while


6
Sm = Sw + Sb is the mixture scatter matrix; Sb , the between-
class scatter matrix, gives information about inter-class co- 5.5

variances.
5

4.5

3.2 Selection of the Feature Vector Dimen- 4

sion l
3.5

The optimal vector dimension is determined using a wrap- 3


per approach. The two classification feedbacks we take into
consideration are the precision and the false rejection rate 2.5

(FR), defined as follows: 2


5 10 15 20 25 30 35 40 45

number of events correctly detected (b) False Rejection Rate


precision = (3)
number of events detected
number of events not detected Figure 1: Classification precision and false rejection rate of
FR = , (4)
number of events to detect scream with increasing feature vector dimension l.

where the term “event” denotes either a scream or a gun-


shot. The rationale behind the choice of precision and false 4 Classification
rejection rate as performance metrics is that in an audio-
surveillance system the focus is on minimizing the num-
ber of events “missed” by the control system, while at the The event classification system is composed by two Gaus-
same time keeping as small as possible the number of false sian Mixture Model (GMM) classifiers that run in parallel
alarms. to discriminate, respectively, between screams and noise,
and between gunshots and noise. Each binary classifier
We evaluate the precision and false rejection rate for fea-
is trained separately with the samples of the respective
ture vectors of any dimension l. Figure 1 shows how the
classes (gunshot and noise, or scream and noise), using the
performance vary as l increases, for the case of scream
Figueiredo and Jain algorithm [11]. This method is con-
events (analogous results are obtained with gunshot sam-
ceived to avoid the limitations of the classical Expectation-
ples). From these graphs, it is clear that good performance
Maximization (EM) algorithm for estimating the parame-
may be obtained with a small number of features, while in-
ters of a mixture model: through an automatic “component
creasing l above a certain dimension lˆ (e.g. 13-15 in the annihilation” procedure, the Figueiredo-Jain algorithm au-
case of screams) not only the performance does not im- tomatically selects the number of components and rules out
prove significantly, but the results get worse due to over- the problem of determining adequate initial conditions; fur-
fitting. The choice of lˆ can be formalized as a trade-off thermore, singular estimates of the mixture parameters can
optimization problem and will be further investigated in a be automatically avoided by the algorithm.
future work. For now, lˆ is selected empirically by inspec- For the testing step, each frame from the input audio
tion of the graphs shown in Figure 1 (lˆ = 13 for screams and stream is classified independently by the two binary clas-
lˆ = 14 for gunshots). sifiers. The decision that an event (scream or gunshot) has

3
23
occurred is then taken by computing the logical OR of the 5.2 Source Localization
two classifiers.
Differently from popular localization algorithms, the ap-
proach we use needs no far field hypothesis about source
location, and is based on the spherical error function [6]
5 Localization
esp (rs ) = Aθθ − b, (8)
5.1 Time Delay Estimation where
The localization system employs a T-shaped microphone ar-      2 2 
x1 y1 d10 xs R1 − d10
ray composed of 4 sensors, spaced 30 cm apart from each 1
other. The center microphone is taken as the reference sen- A ,  x2 y2 d20  , θ ,  ys  , b ,  R22 − d20 2 
2 2 2
sor (hereafter referred with the number 0) and the three x3 y3 d30 Rs R3 − d30
Time Difference of Arrivals (TDOAs) of the signal between (9)
the other microphones and the reference microphone are for a two dimensional problem. Pairs (xi , yi ) are the coordi-
estimated. We use the Maximum-Likelihood Generalized nates of the ith microphone, (xs , ys ) are the unknown coor-
Cross Correlation (GCC) method for estimating time delays dinates of the sound source, Ri and Rs denote, respectively,
[12], i.e. we search the distance of microphone i and of the sound source from
the reference microphone, and di0 = c · τ̂i0 , with c being the
τ̂i0 = arg max Ψ̂i0 (τ), i = 1, 2, 3, (5) speed of sound.
τ To find an estimate of the source location we solve the
linear minimization problem
where
min(Aθθ − b)T (Aθθ − b) (10)
N−1
Sxi x0 (k) (k)|2
|γi0 j2πτk θ
Ψ̂i0 (τ) = ∑ · 2)
·e N
|S
k=0 xi x0 (k)| |Sxi x0 (k)| (1 − |γ i0 (k)| subject to the constraint xs2 + y2s = R2s . The solution of (10)
(6) can be found in [6].
is the generalized cross correlation function, Sxi x0 (k) =
E{Xi (k)X0∗ (k)} is the cross spectrum, Xi (k) is the discrete
Fourier transform (DFT) of xi (n), γi0 is the Magnitude 6 Experimental Results
Square Coherence (MSC) function between xi and x0 , and
N denotes the number of observation samples during the In our simulations we have used audio recordings taken
observation interval. from movies soundtracks and internet repositories. Some
To increase the precision, the estimation of τ̂i0 can be screams have been recorded live from people asked to
refined by a parabolic interpolation [13]. However, a fun- shout into a microphone. Finally, noise samples have been
damental requirement to increase the performance of (5) recorded live in a public square of Milan.
is a high-resolution estimation of the cross-spectrum and
of the coherence function. We use a non-parametric tech-
nique, known as minimum variance distortionless response 6.1 Classification performance with varying
(MVDR), to estimate the cross spectrum and therefore the SNR conditions
MSC function [14]. The MVDR spectrum can be viewed
as the output of a bank of filters, with each filter centered This experiment aims at verifying the effects of the noise
at one of the analysis frequencies. Following this approach, level on the training and test sets. We have added noise
the MSC is given by: both to the audio events of the training set and to the audio
events of the test set, changing the SNR from 0 to 20dB,
2 with a 5dB step. The performance indicators we have used
f H R−1 Ri0 R−1 fk in this test are the false rejection rate, defined in (4), and the
|γi0 (k)|2 =  k ii 2  00 2 , (7)
fkH R−1 fkH R−1 false detection rate (FD), defined as follows:
ii fk 00 fk
number of detected events that were actually noise
where superscript H denotes transpose conjugate of FD = ,
number of noise samples in the test set
a vector or a matrix, Rxx = E{x(n)x(n)H } indicates √ (11)
the covariance matrix of a signal x, fk = 1/ L · where, as usual, an event could be both a scream or a gun-
[1 exp( jωk ) . . . exp( jωk (L − 1))]T and ωk = 2πk/K, k = shot. The results for scream/noise classification are reported
0, 1, . . . , K − 1. Assuming that K = L and observing that in Figure 2. As expected, performance degrades noticeably
matrices R have a Toeplitz structure, we can compute (7) as the SNR of both training and test sequences decreases. In
efficiently by means of the Fast Fourier Transform. In our particular, as the training SNR decreases, the false detection
experiments we set K = L = 200 and an observation time rate tends systematically to increase. At the same time, once
N = 4096 samples. the training SNR has been fixed, a reduction of SNR on the

4
24
10
0.35     
    
     0
    
0.3     
      !  −10
      ! 
      ! 
0.25       !  −20
      ! 

−30
0.2

−40
0.15
−50

0.1
−60

0.05 −70

−80
0 −25 −20 −15 −10 −5 0 5 10 15 20 25
0 0.005 0.01 0.015 0.02 0.025 0.03

Figure 2: False rejection rate as a function of false detection


rate for various SNR training database and test sequences. Figure 3: Mean Square Error of delay estimation for gun-
shot and scream samples at 95% confidence level. Data is
The graph refers to the scream/noise classifier using lˆ = 20 normalized to the variance of a uniform random guess.
features.
# Scream/Noise classifier Gunshot/Noise classifier
test set leads to worse performance in terms of false rejec- 1 ZCR SFM
tion rate. To account for this behavior, we must consider 2 SFM spectral centroid
that using a high SNR training set implies that the classi- 3 MFCC 2 spectral kurtosis
fier is trained with almost clean scream/gunshot events. On 4 MFCC 3 MFCC 2
the contrary, a noisy training set implies that the classifier is 5 MFCC 4 MFCC 4
trained to detect events plus noise. Obviously, in this way 6 MFCC 9 MFCC 6
the probability of labeling noise as a scream or gunshot is 7 MFCC 11 MFCC 7
greater. On the other hand, if the training set SNR is high 8 periodicity MFCC 19
but the system is tested in a noisy environment, the classi- 9 (filtered) periodicity MFCC 20
fier is able to correctly detect only a small fraction of the 10 correlation decrease MFCC 28
actual events, since it was not trained to be robust to noise. 11 filtered correlation decrease MFCC 29
This experiment illustrates the trade-off existing between 12 correlation slope MFCC 30
false rejection and false detection rate. According to the 13 correlation centroid periodicity
average noise conditions of the environment in which the 14 spectral slope
system will be deployed, one should choose the appropriate
SNR for the training database. Similar results have been Table 2: Feature vectors used in the combined system
obtained with the gunshot/noise classifier.

6.2 Combined system noise records. This is necessary to simulate isotropic noise
conditions. TDOAs are estimated as explained in Section
Putting together the scream/noise classifier and the gun- 5.1; we narrow the search space of Eq. (5) to time lags
shot/noise classifier we can yield a precision of 93% with a τ ∈ [−Tmax , Tmax ], where Tmax = ⌈d/c · fs ⌉, d is the distance
false rejection rate of 5%, using samples at 10dB SNR. We between the microphones of a pair (here d = 30 cm) and fs
have used a feature vector of 13 features for scream/noise is the sampling frequency ( fs = 44100 Hz). The GCC peak
classification, and a feature vector of 14 features for gun- estimation is refined using parabolic interpolation.
shot/noise classification. In both cases the J2 criterion has Figure 3 shows the mean square error (MSE) of the
been employed. The two feature vectors are reported in Ta- TDOA between a pair of microphones for a scream sam-
ble 2. ple, normalized by (2Tmax + 1)2 /12, which corresponds to
the variance of a uniform distribution over the search in-
6.3 TDE error with different SNR conditions terval. Values in figure are expressed in decibel, while the
true time delay for the simulation has been set to 0 without
Localization has been evaluated against different values of any loss of generality. Analogous results are obtained for
SNR by properly mixing audio events with a colored noise gunshots records. From the figure, it’s clearly observable
with a pre-specified power. To generate the noise samples, the so-called “threshold effect” in the performance of GCC:
we use a white noise to feed an AR process, whose co- under some threshold SNR∗ , in this example about -10dB,
efficients have been obtained by LPC analysis on ambient the error of time delay estimation suddenly degrades as far

5
25
2
10
standard deviation of ϑ̂ for a given SNR when the true an-
gle is either 90◦ or -90◦ . For example, at 10dB SNR σ90 is
1
approximately 20◦ (see Figure 4).
10

10
0
7 Conclusions
In this paper we analyzed a system able to detect and lo-
−1
calize audio events such as gunshots and screams in noisy
10
environments. A real time implementation of the system is
going to be installed in the public square outside the Cen-
−2
tral Train Station of Milan, Italy. Future work will be ded-
10
icated to the formalization of feature dimension selection
algorithm and to the integration of multiple microphone ar-
−3
rays into a sensor network for increasing the range and the
10
−80 −60 −40 −20 0 20 40 60 80 precision of audio localization.

References
Figure 4: Standard deviation of the estimated angle ϑ̂ be- [1] C. Clavel, T. Ehrette, and G. Richard, “Events Detection for an
tween the sound source and the axis of the array, as a func- Audio-Based Surveillance System,” Multimedia and Expo, 2005.
tion of the true angle. The distance of the source has been ICME 2005. IEEE International Conference on, pp. 1306–1309,
fixed to 50 m. 2005.
[2] J. Rouas, J. Louradour, and S. Ambellouis, “Audio Events Detec-
tion in Public Transport Vehicle,” Proc. of the 9th International IEEE
as the estimated TDOA becomes just a random guess. This Conference on Intelligent Transportation Systems, 2006.
phenomenon agrees with theoretical results [13]. An im- [3] T. Zhang and C. Kuo, “Hierarchical system for content-based audio
classification and retrieval,” Conference on Multimedia Storage and
mediate consequence of this behavior is that no steering is Archiving Systems III, SPIE, vol. 3527, pp. 398–409, 1998.
applied to the video-camera if the estimated SNR is below
[4] D. Hoiem, Y. Ke, and R. Sukthankar, “SOLAR: Sound Object Local-
the threshold. This is feasible in our system since the au- ization and Retrieval in Complex Audio Environments,” Acoustics,
dio stream is classified as either an audio event or ambient Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05).
noise. Under the assumption that the two classes of sounds IEEE International Conference on, vol. 5, 2005.
are uncorrelated, the SNR can be easily computed from the [5] P. Atrey, N. Maddage, and M. Kankanhalli, “Audio Based Event De-
difference in power between events and noise, and tracked tection for Multimedia Surveillance,” IEEE International Conference
in real time. on Acoustics, Speech, and Signal Processing, 2006, 2006.
[6] J. Chen, Y. Huang, and J. Benesty, Audio Signal Processing for Next-
Generation Multimedia Communication Systems. Kluwer, 2004, ch.
6.4 Localization error 4-5.
[7] L. Lu, H. Zhang, and H. Jiang, “Content analysis for audio classifica-
The audio localization system has been tested by varying tion and segmentation,” Speech and Audio Processing, IEEE Trans-
the actual position of the sound source, spanning a range of actions on, vol. 10, no. 7, pp. 504–516, 2002.
±90◦ with respect to the axis of the array. A source posi- [8] G. Peeters, “A large set of audio features for sound description
tioned at -90◦ is on the left of the array, one positioned at (similarity and classification) in the CUIDADO project,” CUIDADO
0◦ is in front of the array, while a source located at +90◦ is Project Report, 2004.
on the right. Figure 4 shows the standard deviation of the [9] S. Sigurdsson, K. B. Petersen, and T. Lehn-Schiøler, “Mel frequency
cepstral coefficients: An evaluation of robustness of mp3 encoded
estimated source angle ϑ̂ for some SNRs above the thresh- music,” in Proceedings of the Seventh International Conference on
old. For a T-shaped array, the expected angular error is sym- Music Information Retrieval (ISMIR), 2006.
metric around 0◦ . As can be argued from the graph, if the [10] S. Theodoridis and K. Koutroumbas, Pattern Recognition. Aca-
actual sound source is in the range [−80◦ , 80◦ ], the stan- demic Press, 2006.
dard deviation of ϑ̂ is below one degree, even at 0dB SNR. [11] M. Figueiredo and A. Jain, “Unsupervised learning of finite mixture
As the sound source moves completely towards the left or models,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, vol. 24, no. 3, pp. 381–396, 2002.
the right of the array, the standard deviation of ϑ̂ increases,
specially when the ambient noise level is higher. This be- [12] C. Knapp and G. Carter, “The generalized correlation method for
estimation of time delay,” IEEE Transactions on Acoustics, Speech,
havior can be used for deciding whether the video-camera and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976.
should be zoomed or not. If ϑ̂ is known with sufficient pre- [13] J. Ianniello, “Time delay estimation via cross-correlation in the pres-
cision, the camera can be zoomed to capture more details. ence of large estimation errors,” IEEE Transactions on Acoustics,
If the estimation is uncertain, a wider angle should be used. Speech, and Signal Processing, vol. 30, no. 6, pp. 998–1003, 1982.
A conservative policy could be to zoom the camera only if [14] J. Benesty, J. Chen, and Y. Huang, “A generalized MVDR spectrum,”
|ϑ̂ | falls outside the interval [90◦ ± σ90 ], where σ90 is the Signal Processing Letters, IEEE, vol. 12, no. 12, pp. 827–830, 2005.

6
26

You might also like