GWDA SanjeevsDhurandhar
GWDA SanjeevsDhurandhar
GWDA SanjeevsDhurandhar
Sanjeev Dhurandhar
July 23, 2014
Contents
1 Statistical Detection of GW Signals in Noise
1.1 Introduction . . . . . . . . . . . . . . . . . . .
1.2 Random variables and their distributions . .
1.2.1 Distribution and density functions . .
1.2.2 Multivariate random variables . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
4
5
2 The
2.1
2.2
2.3
Matched Filter
10
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
White, coloured and stationary noise . . . . . . . . . . . . . . . . . . . . . . . . . 11
The matched filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 The
3.1
3.2
3.3
3.4
detection
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
18
19
20
Topics
Random variables, distributions of random variables, uniform distribution, univariate
Gaussian or normal distribution, multivariate Gaussian distribution.
Noise as a random variable, statistical characterisation of noise, autocorrelation function,
wide sense stationary noise, power spectral density of noise, white and coloured noise,
Gaussian noise, matched filter.
Hypothesis testing, binary hypothesis, Neyman-Pearson criterion, likelihood ratio, false
alarm and detection probabilities.
Composite hypothesis, geometric formulation, ambiguity function, metric, template banks,
computational costs.
Books
Lecture 1
Introduction
Ground based as well as space based gravitational wave (GW) detectors will continuously produce data. Since gravity couples to matter weakly, the gravitational wave signals are expected
to be weak and so will be buried in the noise. The aim of the data analyst is to extract the
signal buried in the noise by detecting it and measuring its parameters. The signal that the
detectors are expected to see is a metric perturbation. For simplicity if we take the background
metric ik to be flat, the full metric gik can be written as:
gik = ik + hik ,
(1.1)
where hik is a tiny perturbation in the metric describing the gravitational wave. This perturbation may be due to compact binary star coalescences, supernovae, asymmetric spinning neutron
stars, stochastic GW background, etc. We denote a typical component of the perturbation by
simply h. We disregard the indices which essentially provide directional information. The metric (and the perturbation) is a dimensionless quantity and typically h 1023 or smaller. For
a GW source, h can be estimated from the well-known Landau-Lifschitz quadrupole formula.
The GW amplitude h is related to the second time derivative of the quadrupole moment (this
quantity has the dimensions of energy) of the source:
h
4 G kinetic
E
,
r c4 nonspherical
(1.2)
where r is the distance to the source, G is the gravitational constant, c is the speed of light
kinetic
and Enonspherical
is the kinetic energy in the nonspherical motion of the source. If we consider
kinetic
2
Enonspherical /c a fraction of a solar mass and the distance to the source ranging from galactic
scale of tens of kpc to cosmological distances of Gpc, then h ranges from 1017 to 1023 . These
numbers then set the scale for the sensitivities at which the GW detectors must operate. The
GW is detected by measuring the change in armlength of a laser interferometric detector which
is observed as a phase shift on a photodiode. If the change in the armlength L of the detector
is L, then,
L hL .
(1.3)
The noise on the other hand is of the same order or larger. Therefore one needs to extract the
GW signal given by the function h(t) where t denotes the time, out of the noise n(t). The noise
n(t) is a stochastic process and is random. Any detection is a statistical problem. One can
3
never be sure of a 100 % detection. Because there is noise in the data it is possible that the
noise appears like a signal and it is only the noise that one is looking at. One can only assign
probabilities to the detection process and provide confidence levels for claiming a detection.
In these lectures I will start with random variables and their probability distributions, the
most impostant of these being the Gaussian random variables which are most common because
of the central limit theorem. Then I will talk about matched filters and their optimality. The
Neyman-Pearson criterion is most appropriate in the context of GW detection. I will discuss
this and the Neyman-Pearson lemma in the context of hypothesis testing. This will bring us to
the likelihood ratio and maximum likelihood. Finally I will discuss how geometry can be used
in statistical detection problems and also how it provides an elegant framework of manifolds
and the metric.
1.2
Our approach here will be more intuitive than rigorous. We will explain concepts via examples
rather than rigorous definitions. Consider a sample space denoted by on which a probability
measure is defined. A simple example is = {H, T } where H and T are respectively heads and
tails in a coin tossing experiment. The (measurable) subsets of are called events. A random
variable is a (measurable) function from a sample space to R the set of real numbers. For
example we may define X(H) = 1 and X(T ) = 0, then X is an example of a random variable.
The random variables we are interested in, are more complex than this simple example; it is
the data generated by the detector which contains noise.
1.2.1
Given the random variable X (we will often abridge random variable by r.v.), we denote the
set X 1 ((, x]) simply by {X x}. Then the distribution function of X is defined as:
FX (x) = P {X x} ,
(1.4)
where P denotes the probability of the event under consideration. We will require such functions when we talk about false alarm and detection probabilities later. FX has the following
properties:
1. FX is a non-decereasing function of x i.e. whenever x1 < x2 , we have FX (x1 ) FX (x2 );
2. FX () = 0,
FX () = 1;
3. FX is right continuous.
Usually we deal with the derivative of FX which is called the probability density function or
in short just pdf. We have:
dFX
pX (x) =
,
(1.5)
dx
where we have denoted the pdf by pX . From the above properties of FX , we must have pX (x) 0
and
Z
pX (x)dx = 1 .
(1.6)
When FX is discontinuous at some point we obtain a delta function at that point for pX .
Some simple examples of distributions are the uniform, Gaussian etc. The phase could be
uniformly distributed between 0 and 2. What this means is that p() (we drop the subscript
4
on p) is a constant. This constant can be evaluated by noting that its integral over the range
[0, 2] must be 1. Hence we must have p() = 1/2. An important distribution is the Gaussian
or the normal distribution - its pdf is given by:
1
2
p(x) = ex /2 ,
2
< x < .
(1.7)
It is defined over the real line. Its mean is zero and variance is unity. The mean is the first
moment and its variance is the second moment. They are given by:
Z
= hxi =
xp(x)dx = 0
Z
2
2
2
= hx i =
x2 p(x)dx = 1 .
(1.8)
Such a distribution is also described by the statement X N (0, 1), where N stands for normal;
0 is the mean and 1 the variance. We can give frequentists interpretation to this pdf. Let a
large number N of trials be taken of X, then we ask the question how many of these values lie
between say 0.1 and 0.2. Then, on an average N P {0.1 X 0.2} values of the trials will
lie between 0.1 and 0.2. The probability P {0.1 X 0.2} is the number,
Z
0.2
P {0.1 X 0.2} =
pX (x)dx .
(1.9)
0.1
We can easily transform this distribution by translating and scaling it. Let,
y=
x
,
(1.10)
1
2
2
e(y) /2 .
2
(1.11)
Note that the Jacobian factor of the transformation (dy = dx/) is included in the pdf - in
this case 1/. This distribution of the r.v. Y has mean and variance 2 and is Gaussian
distributed. So we may write, Y N (, 2 ). The square root of the variance is and is called
the standard deviation of the r.v. Y .
1.2.2
The motivation for studying multivariate random variables comes from the fact that noise is
a stochastic process and we will be considering data trains containing samples which are all
r.v.s. Normally, we have a data segment [0, T ] and it is sampled uniformly with N points i.e.
= T /N is the sampling interval between two consecutive samples and is constant. So we have
samples of the data x(tk = k) = xk , k = 0, 1, 2, ...N 1. We can look upon the samples xk as
components of a N -dimensional vector, say a column vector, which we denote by x. Then each of
the xk are r.v.s and x is a random vector distributed according to a N dimensional multivariate
distribution. If the stochastic process is Gaussian, then x is a multivariate Gaussian. Therefore
we need to understand how the r.v. of one variable generalises to a random vector of N variables.
To gain insight into general multivariate distributions it is best to begin with a two dimensional distribution or a bivariate distribution. The bivariate distribution has the advantage that
one can easily visualise it and at the same time it has interesting features which are missing in
the univariate (1 dimensional) distribution. For example two random variables can be correlated and this fact is encoded in their covariance. We will first analyse a 2 dimensional Gaussian
distribution.
Let us take two r.v.s X and Y . Their joint pdf is p(x, y) which is defined as follows: for any
event A R2 ,
Z
p(x, y) dxdy = P {(x, y) A} .
(1.12)
dxdy p(x, y) .
(1.13)
1
12 (x2 +y 2 )
2
e
.
2 2
(1.14)
One observes that p(x, y) can be factorised into a product of two univariate Gaussians pX (x)
pY (y) each of which is N (0, 1). It immediately follows from this that the r.v.s X and Y are independent. One can similarly factorise the distribution function as well, F (x, y) = FX (x)FY (y)
into distribution functions of individual r.v.s X and Y . The covariance between X and Y is
zero. One can easily check this by computing:
Z Z
1
2
2
1
(1.15)
hxyi =
dxdy xy e 22 (x +y ) = 0 .
2
2
Because the means vanish, hxi = hyi = 0, the covariance hxyi hxihyi = 0. Similarly one can
show by computing the corresponding integrals, that the variances are hx2 i = hy 2 i = 2 .
A more interesting example is the distribution when the r.v.s X, Y are correlated. Then the
Gaussian distribution is given by:
p(x, y) =
2x y
1
p
eQ(x,y) ,
(1.16)
(1.17)
Here x , y are respectively the standard deviations of X and Y and is the correlation coefficient which satisfies 1 < < 1. When > 0, X, Y are positively correlated while if < 0 they
are negatively correlated or anticorrelated. We may calculate the first and second moments of
this distribution directly by integration. The means as before are zero, hxi = hyi = 0. The
second moments are hx2 i = x2 , hy 2 i = y2 , hxyi = x y . Computing the second moments requires some involved integration. The second moments can be collected together in a covariance
matrix which we denote by C. Then:
2
x2
x y
hx i hxyi
=
.
(1.18)
C
x y
y2
hxyi hy 2 i
If we write the random vector:
x
x=
,
y
6
(1.19)
then we can write in short C = hxxT i, where xT denotes the transpose of x and is therefore a
row vector. The quadratic in terms of C can be written in short as Q(x, y) = xT C 1 x. Note
that the inverse of the covariance matrix, namely, C 1 makes an appearance in the pdf. C 1
is called the Fischer information matrix. Explicitly,
"
#
1
xy
1
x2
1
.
(1.20)
C =
1
1 2 x y
2
y
Note that if 6= 0 then the r.v.s X and Y are correlated. To gain further insight we can plot
contours of p(x, y) = const. and understand what correlation means. In order to do this it
helps to rotate the axes by 45 . Just to keep the algebra simple we will set x = y = 1, then
we have,
1
eQ(x,y) ,
(1.21)
p(x, y) = p
2 1 2
where Q simplifies to,
Q(x, y) =
1
(x2 + y 2 2xy) .
2(1 2 )
(1.22)
1
u = (x y),
2
(1.23)
where the corresponding r.v.s U, V are related to X, Y by identical expressions. With these
transformations, we have,
2
1
1 u2
v2
v2
u
Q(u, v) =
+
+
.
(1.24)
2 1 1+
2 u2
v2
where u2 = 1 and v2 = 1 + . The corresponding pdf in terms of u, v is given by:
p(u, v) =
12
1
e
2u v
u2
v2
2 + 2
u
v
(1.25)
1
1
1 u
1 v
e 2 u2
e 2 v2 pU (u)pV (v) .
p(u, v) =
2u
2v
(1.26)
Each of them are univariate Gaussians. This also shows that the r.v.s U and V are independent.
We may draw the contours of equal probability density which are essentially given by Q(u, v) =
const. These are just ellipses whose axes are the (u, v) axes or in the (x, y) plane these are
ellipses whose axes are rotated by 45 with respect to the (x, y) axes. We show them below in
the figure. For > 0 the ellipses have the v-axis as the major axis. In fact when 1, the
ellipses collapse onto the v-axis and all the probability density is concentrated along the v-axis;
since then u 0, the pdf pU (u) (u), the delta function. Since x = y on the v-axis, if X
has taken the value x0 , the probability is very high that Y also will take the same value x0 or
a value very close to it, since the pdf is concentrated around this line. The r.v.s X and Y are
then positively correlated.
On the other hand when < 0, the contours are ellipses with the u-axis as the major axis.
If now X takes the value x0 , then it is most likely that Y will take values near x0 , because
now the pdf is concentrated around the u axis which is just y = x. Then r.v.s X and Y are
said to be anticorrelated. The figures for > 0 and < 0 are shown below.
7
Figure 1.1: > 0: The constant pdf contours are ellipses with the v-axis being the major axis.
The contours are shown with unbroken curves while the axes are shown as dashed lines.
Figure 1.2: < 0: The constant pdf contours are ellipses with the u-axis being the major axis.
The contours are shown with unbroken curves while the axes are shown as dashed lines.
One can now generalise to a N dimensional random vector. Now the random vector xT =
(x0 , x1 , ..., xN 1 ) has N components and we have a N N covariance matrix which we continue
to denote by C. The zero mean multivariate Gaussian pdf is then given by:
p(x) =
(2)N/2 det C
T C 1 x
e 2 x
(1.27)
where C = hxxT i. One can show this by direct integration i.e. hxi xk i = Cik (one may employ
certain tricks like differentiation with respect to a parameter to show this in an efficient and
elegant way). If the noise vector n has such a pdf, then the noise is called Gaussian. Also in
this case the noise has zero mean. We can always do this for detector noise by subtracting out
the DC component and thus making the mean zero. Also if this is time series data, the nearby
samples in time can be correlated. In the subsequent lectures we will discuss white and coloured
noise where this issue is relevant.
Lecture 2
Introduction
When a known signal is embedded in the noise, matched filtering is the optimal method in
extracting the signal from the noise. We will discuss here in what sense the matched filter is
optimal. To give an example consider a sinusoidal signal say,
h(t) = A cos 2f0 t ,
(2.1)
where f0 is a constant frequency and A the amplitude. The signal is embedded in the noise
n(t). We will consider here the time t as continuous and the signal and the noise as functions
of the continuous variable t. When handling data one must of course sample the data and then
one gets a discrete time series. We will switch from discrete to continuous time domains and
vice-versa without any cause for confusion. We take the noise to be additive, then the data
which we denote by x(t) is given by,
x(t) = n(t) + h(t) .
(2.2)
Figure 2.1: The time series of a sinusoidal signal embedded in the noise. The amplitude of the
signal is small compared with the noise so that the signal is not visible.
10
We now apply the matched filter to this data. In this case the matched filter is also a
sinusoid and is easily implemented by taking the Fourier transform of the data. We compute
the statistic C(f ):
Z
C(f ) = |
x(f )|,
x
(f ) =
dt x(t) e2if t ,
(2.3)
where x
(f ) is the Fourier transform of x(t) as defined above. The statistic C(f ) is just the
modulus of the Fourier transform. This is shown in the figure below:
Figure 2.2: The statistic C(f ) is plotted versus the frequency f . Two peaks are seen at f0 .
There are two peaks occuring at f0 because cos 2f0 t = (e2if0 t + e2if0 t )/2. If we set a
threshold sufficiently high so that it is unlikely that the noise crosses this threshold we would
claim a detection of the signal. Apart from detection, we also recover parameters of the signal
from the statistic such as the amplitude, the frequency and the phase.
This example shows how matched filtering works. We will now discuss this method in detail.
2.2
We will consider the noise to be a zero mean stochastic process - at each time t, n(t) is a r.v.
Also any DC which is there in the noise is removed from the data. At each time t we write,
hn(t)i = 0 ,
(2.4)
where now the angular brackets denote ensemble average. One imagines a large number of
identical detectors producing data. Then at a given time t we take the average of n(t) over the
ensemble of detectors. We characterise the noise by its autocorrelation function K defined as
follows:
hn(t)n(t0 )i = K(t, t0 ) .
(2.5)
Clearly from the definition K(t, t0 ) is symmetric in its arguments, namely, K(t, t0 ) = K(t0 , t).
We will normally put restrictions on this function and hence the noise. The noise is called
stationary if for any N samples, the distribution function or the pdf is invariant under time
translations. That is,
F (n(t0 ), n(t1 ), ..., n(tN 1 )) = F (n(t0 + ), n(t1 + ), ..., n(tN 1 + )) ,
11
(2.6)
for all time translations . We will put a weaker restriction on the noise requiring only the first
two moments to be invariant under time translations. That is:
hn(t)i = hn(t + )i,
(2.7)
for all time translations . Such a stochastic process is called wide sense stationary or in short
WSS. We will assume that the noise we have is WSS. This is actually not the case, but if the
signals are for short durations, then the noise for that duration can be considered to satisfy the
WSS conditions.
When the noise is WSS, we have K(t, t0 ) = K(t + , t0 + ) for all . We may now choose
= t0 , then K(t, t0 ) = K(t t0 , 0) or K becomes a function of only one variable t t0 . Without
any cause for confusion, we denote it by the same symbol K(t t0 ). Physically, this means that
the mean and the covariance between noise samples do not depend on absolute time but only
on the time difference.
Because of the time translation symmetry, in Fourier space we have a simple representation
of the autocorrelation function. In the Fourier space, we do the following calculation:
Z Z
0 0
h
n(f )
n? (f 0 )i =
dtdt0 hn(t)n(t0 )i e2if t+2if t
Z Z
0 0
=
dtdt0 K(t t0 ) e2if t+2if t
Z Z
0
=
dtd K( ) e2if t+2if (t+ )
Z
0
=
d K( ) e2if (f f 0 )
= S(f ) (f f 0 ) .
(2.8)
Here the star denotes the complex conjugate of the quantity. Note that n
(f ) is a complex r.v.
0
for each frequency f . We have also changed variables by defining t = t + in the integrals and
written S(f ) for the Fourier transform of K. Because of stationarity we get a delta function
(f f 0 ) in the frequency space. Since K( ) = K( ), that is, it is an even function, its Fourier
transform S(f ) is real. Also S(f ) > 0 or it is positive definite. It is called the power spectral
density (PSD) of the noise. S(f ) is also an even function, namely, S(f ) = S(f ) and it can be
folded on itself and defined only for f 0. Then it is called the one sided PSD. We will not
discuss this here. If the time is measured in units of seconds, S(f ) has dimensions of Hz1 .
We can characterise the noise as white or coloured. If S(f ) = S0 a constant, then the noise
is called white; otherwise it is coloured. In case of white noise, the autocorrelation function
becomes,
Z
K( ) = S0
df e2if S0 ( ) .
(2.9)
h
n(f )
n? (f )i = S0 (f f 0 ) .
(2.10)
The noise samples at different times are uncorrelated however small the difference t t0 is.
White noise is an idealisation. Noise can be considered white in some limited band where the
function S(f ) is more or less flat.
We now define Gaussian white noise. Consider a data segment [0, T ] uniformly sampled with
N samples and sampling interval ; we then have T = N and tk = k, k = 0, 1, ..., N 1.
We integrate hn(t)n(tk )i over a sample width:
Z tk +/2
Z tk +/2
dt hn(t)n(tk )i = S0
dt (t tk ) .
(2.11)
tk /2
tk /2
12
The LHS gives hn2k i, while the RHS is just S0 . Thus we have the relation:
hn2k i =
S0
2
.
(2.12)
Thus the pdf for the noise vector nT = (n0 , n1 , ..., nN 1 ) is:
12
1
p(n) =
e
( 2 )N
|n|2
2
(2.13)
P 1 2
where |n|2 = nT n = N
k=0 nk which is in fact the square of the Euclidean norm. This is the
pdf of white Gaussian noise.
We can extend this discussion to coloured noise. We define the covariance matrix of the
noise vector as:
Cik = K(ti tk ) = K((i k)) .
(2.14)
Then the noise pdf is:
p(n) =
(2)N/2
det C
T C 1 n
e 2 n
(2.15)
We may define the matrix as the inverse of C, then in the exponent of the pdf we get the
quantity ik ni nk . This motivates the definition of a scalar product. We define a scalar product
of two data vectors x and y as:
(x, y) = ik xi yk ,
(2.16)
then we can write:
p(n) =
(2.17)
If we call the set of data trains as D which is essentially RN , then with the scalar product so
defined, D is a Hilbert space. For the special case of white noise the metric ik reduces to ik ,
the Kronecker delta.
In the Fourier space in case of WSS noise, the metric ik is diagonalised; ik 1/S(f ).
In fact, the equation (2.16) goes over to,
Z
x
? (f )
y (f )
(x, y) =
df
.
(2.18)
S(f )
Since the vectors x, y are real in the time domain, the scalar product is real.
The noise in a real detector is coloured and a detector has a bandwidth in which it is
sensitive to signals. Any real detector has a finite response time in which it reacts to a signal.
This response time determines the upper limit frequency of the frequency band in which the
detector is sensitive.
2.3
Consider a data segment [0, T ] and a known signal h(t). This signal is buried in the noise whose
auto correlation function is K(t, t0 ). In general we have not put any restrictions of stationarity
at this stage. Then the matched filter q(t) is defined as the solution of the integral equation:
Z
13
(2.19)
If we sample the data segment then this becomes a matrix equation. The matrix must be
inverted to obtain the filter vector q.
If the noise is WSS, we have the equation:
Z
(2.20)
The integral is a convolution and we can readily obtain q by going over to the frequency
domain by taking Fourier transforms. Indeed in the Fourier space the above equation reduces
) and since h is known, we immediately have the solution for the matched
to q(f )S(f ) = h(f
filter q:
)
h(f
.
(2.21)
q(f ) =
S(f )
Below the figure shows the plot of the correlation when a matched filter is applied to the data
containing the GW binary signal. It is difficult to see the signal in the data. A peak in the
correlation is observed at the time of arrival of the signal.
Figure 2.3: The operation of matched filtering is shown in the figure. The top shows the signal.
The middle shows the signal buried in the noise. The bottom shows the output of the matched
filter.
We now prove that the matched filter produces the highest signal to noise ratio (SNR)
among all linear filters. This is in one way the matched filter is optimal. This statement is
true irrespective whether the noise is Gaussian or not. First we must define the SNR of a given
statistic. In Fourier space we have:
)+n
x
(f ) = h(f
(f ) .
Apply the filter q(f ) to the data then we have the statistic c given by,
Z
c = q? (f )
x(f ) df .
(2.22)
(2.23)
The integrals go from to which we have not explicitly written. We note that c is real
for real filters and real data. Note that c is also a r.v. since it depends on the data x which is a
14
r.v. The SNR is the ratio of the mean of c to its standard deviation. The mean of c is denoted
by c and it is given by:
Z
?
q (f )
x(f ) df
c = hci =
Z
Z
?
?
=
q (f )h(f ) df i + h q (f )
n(f ) df
Z
) df .
=
q? (f )h(f
(2.24)
The last line follows since the second term is zero because the mean of the noise is zero and
the first term is deterministic. The variance of c, say c2 , is obtained from taking the expected
value of the filter operating on the noise only. Thus we have:
Z
Z
2
?
? 0
0
0
c =
q(f )
n (f ) df q (f )
n(f ) df
Z
=
df df 0 q(f )
q ? (f 0 ) h
n? (f )
n(f 0 )i
Z
=
df df 0 q(f )
q ? (f 0 ) S(f ) (f f 0 )
Z
=
df |
q (f )|2 S(f ) .
(2.25)
The SNR which we denote by is just c /c and is given by,
R ?
) df
q (f )h(f
.
(q) = R
[ df |
q (f )|2 S(f )]1/2
(2.26)
We observe that depends on the filter q we apply which is itself a function, while the SNR is
a number. Thus maps a function to a real number - it is called a functional. Secondly, we
also observe that scaling q, that is, q Aq, where A is a constant, does not change . Thus
is invariant under scalings of the filter q. We can use this fact to simplify the algebra in the
calculations. Accordingly we set,
Z
df |
q (f )|2 S(f ) = 1 k q k2 .
(2.27)
We have then normalised the filter. The SNR for the normalised filter q is given by:
Z
) df .
= q? (f )h(f
(2.28)
R
We now maximise (q) subject to the condition df |
q (f )|2 S(f ) = 1. This is most conveniently
done by invoking the Schwarz inequality. Write,
!
Z
Z
)
p
h(f
?
?
) df =
p
q (f )h(f
q (f ) S(f )
df .
(2.29)
S(f )
Then we have from the Schwarz inequality,
1
Z
Z
1 Z
2 2
2
|
h(f
)|
2
) df |
q? (f )h(f
q (f )| S(f ) df df S(f )
Z
1
)|2 2
|h(f
= df
.
S(f )
15
(2.30)
The maximisation is attained when we have proportionality between the two vectors or when
the vectors point in the same direction. That is when,
)
p
h(f
,
q(f ) S(f ) = A p
S(f )
(2.31)
where A is a constant. The matched filter is determined upto a scaling. A can be fixed by
requiring the filter q(f ) to have unit norm. Thus,
q(f ) = A
)
h(f
.
S(f )
(2.32)
16
Lecture 3
Hypothesis testing
Given a data vector x we must decide whether a signal is present or absent in the data. The
data vector x is a vector random variable with a multivariate probability distribution. The
probability distributions will differ depending on whether there is signal present in the data or
not. Given x we must decide which of these distributions the vector x belongs to; when there is
no signal in the data, the data is just noise x = n, while if there is a signal s in the data then,
x = s + n. The hypotheses are called H0 and H1 respectively. The multivariate probability
distributions are denoted by p0 (x) when the hypothesis is H0 and p1 (x) when the hypothesis
is H1 . Given x we must decide between the hypotheses H0 and H1 . Since we have assumed
additive noise we also have p1 (x) = p0 (x s).
To decide between the hypotheses, we device a test. A test amounts to partitioning the range
of x into two disjoint regions R and Rc , that is if the range of x is RN , then RN = R + Rc ,
where the plus sign represents disjoint union. Thus if x R, the signal is present or H1 is true
or else x Rc , the signal is absent. Generally, the region R is far away from the origin of RN
while Rc is closer to the origin. The boundary of R namely, R is called the decision surface
D. Usually for well behaved probability distributions like Gaussian for example, D is a smooth
N 1 dimensional hypersurface of RN . So now the problem of deciding whether H0 or H1 is
true amounts to determining the region R according to some specific criteria. In order to do
this we define the false alarm and detection probabilities as follows:
Z
PF (R) =
p0 (x) dN x ,
ZR
(3.1)
PD (R) =
p1 (x) dN x .
R
PF is called the false alarm probability and PD is called the detection probability. Further
1 PD is called the false dismissal probability while 1 PF is called the confidence. The false
alarm probability is the probability of the noise masquerading as a signal, that is, choosing H1
when H0 is in fact true. While the false dismissal probability is the probability of saying there
is no signal in the data, when there is in fact a signal, that is choosing H0 over H1 when H1 in
fact is true.
We will see how we can use these probabilities in deciding R. First we discuss the one
dimensional case when the data is just one quantity x. This is the simplest case and so easy to
understand which nevertheless illustrates some important aspects of the problem. However, we
17
are not in this situation - we have a data train and consequently a data vector. We will next
look at the 2-dimensional case which brings out features of the problem which are not present
in the 1 dimensional case, but is relatively simpler than the N dimensional case. Finally, we
will discuss the full N -dimensional case.
3.2
In discussing this case we will take Gaussian distributions. We will take as the r.v then we
have,
p0 () =
p1 () =
2
1
e 2 ,
2
(A)2
1
e 2 ,
2
(3.2)
where A is the amplitude of the signal. Implicit is the assumption that the noise is additive. We
now fix the false alarm probability to be some small number say . Normally, is chosen small
so that we will not often make the mistake of saying there is a signal when there is not. For
GWDA, we must choose the false alarm probability that we can tolerate. Typically, 1010
or 1014 an extremely small number. It is usually decided by the event rate we expect from
astrophysics. The false alarm rate must be much smaller than the event rate if we are to have
confidence in our detection. Accordingly, we choose a value of , say 0 , so that we have,
Z
2
1
d e 2 = .
(3.3)
PF =
2 0
Thus fixing determines 0 . The quantity 0 is called the threshold. We say that a signal is
present if 0 otherwise we say that the signal is absent. Thus 0 determines the region R,
namely, R = (0 , ) while Rc = (, 0 ]. The detection probability is then also determined
because we have,
Z
(A)2
1
PD =
d e 2 .
(3.4)
2 0
The figure below depicts this situation:
Figure 3.1: The probability distributions p0 () and p1 () are shown. The threshold 0 is also
shown which is fixed by the false alarm probability PF = .
In fact in the one dimensional case, since the region R gets fixed by the false alarm probability, the detection probability is also determined. It is important to realise that this is not
18
the situation for vector data of more than 1 dimension; the false alarm proability does not fix
the region R. We will discuss this situation in the next section.
We observe that the detection probability also depends on the signal amplitude A. The
greater the amplitude, higher is the detection probability. Thus in this case this is how one
handles the situation. Normally, we fix PF = and PD = , say 90%, then both these quantities
fix A which we call Acritical . We then take the decision of observing the signal with PF = and
with PD > if A > Acritical .
3.3
2 dimensional case
When the data vector has more than one component, additional features of the problem come
into play. The simplest case here is the two dimensional case where these features can be
understood and pictures can be drawn to analyse the situation. We again consider Gaussian
distributions for p0 and p1 and also for simplicity we will consider white noise, that is, the
samples of x = (x0 , x1 ) are uncorrelated and identically distributed. Thus we take,
p0 (x) =
p1 (x) =
1 1 (x20 +x21 )
,
e 2
2
1 1 [(x0 s0 )2 +(x1 s1 )2 ]
e 2
.
2
(3.5)
The signal now is a 2 dimensional vector s = (s0 , s1 ). We could have considered correlation
between x0 and x1 but that does not change the salient features while on the other hand it
makes the algebra more complex. We will remark later on how one can generalise to this case.
The false alarm and detection probabilities are defined as,
Z
PF (R) =
p0 (x0 , x1 ) dx0 dx1 ,
ZR
PD (R) =
p1 (x0 , x1 ) dx0 dx1 .
(3.6)
R
where the region R is to be determined. We may fix the false alarm probability to some value ,
but now fixing the false alarm probability does not fix the region R nor the detection probability.
In 2 dimensions, the decision surface or the boundary R is a curve and has more degrees of
freedom. An infinity of curves and regions R are possible which yield the same false alarm
probability but different detection probabilities. This is the crucial difference between the
one dimensional case and multidimensional case.
So then how does one go about the problem? One way is to assign costs for each type of
decision. Then we have a cost matrix cij , where cij is the cost incurred by choosing hypothesis
Hi , when Hj is true. A criterion could be to minimise the average cost given the prior probabilities for the hypotheses H0 and H1 . This is called Bayes criterion. However, in GWDA one
cannot assign costs in any meaningful way; there are no tangible costs to saying there is a GW
signal, when there is no signal and vice-versa. The criterion that seems to be most appropriate
in this situation is the Neyman-Pearson criterion. Here we fix the false alarm probability to
some small value that we can tolerate and maximise the detection probability with respect to
this constraint. This is a well posed problem and the solution is given by the Neyman-Pearson
lemma which then determines the region R. We will discuss this later in detail. For now we
focus on the 2 dimensional case.
Let us fix the false alarm probability to some value . Now there is a wide choice for the
region R and the decision surface - which is a curve. For simplicity, we take these curves as
straight lines. Note that in principle it could be any complicated curve. We will see later from
19
the Neyman-Pearson lemma that the optimal curve turns out to be in fact a straight line. Given
the circular symmetry of p0 (x) around the origin, it is easy to check that the decision curves
are any of the straight lines tangent to the circle of radius 0 , where the threshold 0 is the
solution to the equation:
Z
x2
1
(3.7)
dx e 2 = .
2 0
See figure below: Since p1 (x) is centred around the signal vector s, the different regions R
Figure 3.2: The regions R are the half planes whose boundaries are straight lines tangent to the
circle of radius 0 centred at the origin. We choose the region R which maximises the detection
probability, namely when R is furthest away from from the point s, that is when the tangent
is orthogonal to the direction of s.
give different detection probabilities. The maximum detection probability is obtained when the
decision surface R is furthest away from the signal position vector s. This occurs when the
decision line is orthogonal to the direction of s. Any other line, tangent to the circle will be
closer to s and consequently will yield lower detection probability. Of course at this point we
are only considering straight lines as the decision curves. How do we know that there is no
other decision curve (and region R) which does not yield a higher detection probability? The
answer is found from the Neyman- Pearson lemma which asserts that in this case the decision
curve must be a straight line.
Also if we choose to fix the detection probability, then it must be less than this maximum
detection probability. If we do fix it to be less than the maximum, there could be multiple
solutions for R. In fact if we restrict to straight lines as decision curves, there are two tangents
which are equidistant from the position vector s, which will yield the same detection probabilties.
In higher dimensions the choice is even more, since there are more degrees of freedom. In the
next section we go over to the general case of multidimensions and discuss the Neyman-Pearson
criterion and the lemma.
3.4
Neyman-Pearson criterion
When events are rare and the costs of making mistakes unquantifiable the Neyman-Pearson
criterion is appropriate. It is so for the GW detection. We fix the false alarm probability
20
PF (R) = to a small number that can be tolerated. It is fixed essentially by the event rate
that one expects. If one expects 40 or 50 events (say compact binary coalescences) in a year,
then one may fix the false alarm rate to one false alarm per year. We may afford the mistake
of making one false detection among 50 true events.
The Neyman-Pearson criterion states:
Maximise the detection probability PD (R) subject to PF (R) = .
The question essentially is how do we find R? The answer is supplied by the NeymanPearson Lemma (which we do not prove). It essentially provides a prescription to find R, which
is the following:
Define the likelihood ratio:
p1 (x)
(x) =
,
(3.8)
p0 (x)
then the regions which maximise PD are of the form R(0 ) = {x/(x) 0 } where 0 is a
parameter.
Now we choose the parameter 0 such that PF (R(0 )) = . This fixes 0 which is called
the threshold. The optimal region R in the sense of Neyman-Pearson is also fixed as well as the
decision surface D which is then given by the equation (x) = 0 .
We illustrate the situation for Gaussian noise which in general could be coloured. Then the
pdfs for the signal absent and signal present case are the following:
1
p0 (x) = AN e 2 ik xi xk ,
1
(3.9)
(3.10)
The monotonicity of the functions involved allows us to translate the equation, (x) 0 to
0 where,
= (x, s) = ik xi sk .
(3.11)
Thus we just need to compute the scalar product of the data vector with the signal vector. We
may write,
= qi xi ,
qi = ik sk ,
(3.12)
where we recognise that qi is just the matched filter!
Thus in Gaussian noise we have shown that the matched filter is also optimal in the NeymanPearson sense - it maximises the detection probability for a given false alarm probability. The
decision surface is just q x = 0 which is a hyperplane in D.
21
Lecture 4
Although the above discussion is simple and easy to apply, we are not in this situation where
there is a single known signal buried in the data. What we have in fact is a family of signals,
say h(t; ) which depend on a number of parameters , = 1, 2, ...p. For example in the case
coalescing binary sources, the signal depends on the individual masses, m1 , m2 , the amplitude A
and other kinematical parameters like the time of arrival, initial phase etc. All these constitute
the in this case and they span some subset U Rp . We collectively denote the by the
vector . Thus U , where U is the domain of the parameters. Also we will denote the class
of signals as h(), where now h() D, the space of data trains. Thus we have a mapping
from U D where U is mapped to h() D. The image of U under this mapping
is a p dimensional manifold - a submanifold of D. This manifold we call the signal manifold.
The signal parameters can be regarded as coordinates on the signal manifold. This why we
denote them with the superscript .
We have now to deal with a composite hypotheses:
H0 : No signal present, that is, x = n, noise only and the corresponding pdf is p0 (x).
H1 : One among the family of signals h() is present, that is, x = n + h() with the
corresponding pdf now denoted by p1 (x; ).
We have now a more complex situation here because now we have a family of pdfs p1 (x; ).
It turns out that we can still apply the Neyman-Pearson criterion - but now it is applicable
to the average detection probability which we denote by hPD i, namely, that it is this average
detection probability which is maximised for a given false alarm PF = . The average is defined
in the following way:
Consider a prior on the parameter space, which we denote by z(). We take it to be
normalised:
Z
z() d = 1 .
(4.1)
U
(4.2)
Then the hPD i is maximised for a given PF = for the region R which is now defined via the
average likelihood ratio:
Z
hi(x) =
z() (x; ) d ,
(4.3)
U
22
where (x; ) = p1 (x; )/p0 (x). The regions R(0 ) are now of the form hi(x) 0 . We fix
0 by requiring P {hi(x) 0 } = . Then the region R is fixed and the average detection
probability hPD i is maximised.
If z() is a slowly varying function of the parameters and the likelihood ratio (x; ) has
a sharp peak at some = max , then the dominant contribution to the average likelihood ratio
hi comes from this peak. Therefore we have,
hi max (x; ) .
(4.4)
(4.5)
This is called maximum likelihood detection (MLD). Therefore we compute (x; ) for all
U and then take the maximum over and compare the same with a pre-assigned threshold
to decide on the detection.
Apart from deciding on the detection, we also obtain the estimates of the parameters of the
signal, that is, the parameters max at which the likelihood ratio (x; ) is maximised. These
are called maximum likelihood estimates (MLE).
4.2
Sometimes if some of the parameters of the signal are kinematical such as the amplitude, time
of arrival or the phase, they can be easily dealt with in various quick ways. If the statistic
is a linear function of the data such as the matched filter, we can analytically maximise over
the amplitude. We take the compact binary inspiral as a typical example. In order to search
over amplitude, we may define a normalised template s() such that k s() k= 1, where now
represents all parameters other than the amplitude. The signal is then h() = A s() where A
is the amplitude of the signal. The data is then:
x = h() + n = A s() + n .
(4.6)
c = hs, xi = A + hs, ni .
(4.7)
23
to which we can observe the source. The mismatch is quantified in terms of what is called the
ambiguity function. We define the ambiguity function as:
H(, 0 ) = (s(), s(0 )) ,
(4.8)
where the signals s() are normalised as mentioned before and the brackets denote the scalar
product as defined earlier. One immediately deduces that |H(, 0 )| 1. This is immediate
from the Schwarz inequality,
H(, 0 ) = (s(), s(0 )) k s() kk s(0 ) k= 1 .
(4.9)
Since the templates have norm unity, they lie on a submanifold of a hypersphere of N 1
dimensions. The ambiguity function can be thought of as a cosine of the angle between the unit
vectors s() and s(0 ).
We can now place the templates with the help of the ambiguity function. If the maximum
mismatch is taken to be small like 3 % for example, then the template parameters do not differ
much, so that we may write, 0 = + , where is small. Then H is Taylor expanded to the
second order; the first derivative vanishes because when the parameters match, H is maximum.
Thus,
1 2H
,
2
1 g ,
H(, + ) ' 1 +
(4.10)
1 2H
.
2
(4.11)
This is how it has been defined in some of the literature. However, geometrically, if is the
angle between s() and s( + ), then H = cos ' 1 2 /2. And because the angle
is geometrically the distance on a unit (hyper)sphere, we should actually define the metric as:
1
H(, + ) = 1 g a ,
2
(4.12)
where g a is now the square of the distance 2 on the unit hypersphere. Further, this
is also the induced metric on the parameter space where the parameter space is regarded as a
submanifold of D on which we already have the metric ik . Then it is easy to show that,
g = ik
si sk
.
(4.13)
However, since this factor of 2 in the definitions does not affect the physical results when used
consistently, it may be regarded as a convention. We therefore follow the first formula given in
Eq.(4.10) so as to be in tune with the literature.
Since some of the parameters, namely, the kinematical parameters like the time of arrival or
the phase can be dealt with in a quick manner, the remaining parameters must be searched over
with the help of templates. We therefore maximise the match over the kinematical parameters
and redefine the ambiguity function over these remaining parameters and call it the reduced
ambiguity function. We will still denote the parameters by with the understanding that now
does not include the kinematical parameters and also continue to denote the reduced ambiguity
function by H and the corresponding metric by g .
24
(4.14)
Figure 4.1: A plot of g = is shown for the parameters 1 and 2 where only
two dimensions are considered. The origin of the coordinate system has been shifted to the
coordinates of the template. The contour is an ellipse.
Given the parameter space we can also compute the number of templates required to cover
the parameter space with a given mismatch in terms of the metric. This formula is due to
Owen. Let l be the side of the hypercube which fits inside the hyperellipsoid. The template is
at the centre of the hypercube. The vertex furthest away from the centre is at distance whose
square is N (l/2)2 . This results from the repeated application of Pythagorus theorem. Thus
we have the relations:
2
l
g = N
= = 1 MM .
(4.15)
2
The number of templates Ntemplates is the volume of the parameter space divided by the volume
of the hypercube, that is,
p
R
det(g)
d
.
(4.16)
Ntemplates = U
lp
Solving for l in terms of the mismatch gives us the result:
p
R
d det(g)
U
Ntemplates = p
.
(4.17)
(2 1 M M/p)p
The above formula gives a rough estimate of the number of the templates required to search the
parameter space. However, the problem is really that of tiling the parameter space. That is the
25
templates must span the parameter space with the given minimal match, that is leave no holes
in the parameter space, namely, at no point in the parameter space the match should fall below
the minimal match. This is one condition. The opposing condition is that one must be able
to achieve this with a minimum number of templates in order to minimise the computational
cost. These are two opposing criteria which then govern the placement of templates. There
are significant overlaps among the neighbouring templates which must be minimised. So the
packing of templates matters. Also there are boundary effects where templates placed near
the boundary of the parameter space spill outside the boundary - for example the parameter
space can become narrow in certain regions, nevertheless templates are required in order to
span the space but then they spill out in other directions outside the parameter space. These
considerations affect the number of templates required.
Figure 4.2: An hexagonal packing of templates for the nonspinning binary inspiral parameter
space. Hexagonal packing is more efficient than square packing by about 23 %.
Coming back to the arrangement or the patterns, in two dimensions, the hypercube (in
this case a square) is not the optimal way to tile. A hexagonal packing reduces the overlap
significantly. This reduces the number of templates by about 23 %. In the figure below we
show how templates can be packed in a hexagonal pattern in the case of the inspiraling binary
signals. The parameters are the chirp times 0 and 3 which are functions of the two masses m1
and m2 .
26