0% found this document useful (0 votes)
43 views7 pages

Derivations NIP Course: 1 Encoding Lectures

The third moment of a Gaussian signal is zero, and its fourth moment consists of terms with Kronecker deltas that combine signal values at equal times. The kernels that describe a linear system driven by Gaussian noise can be extracted from correlations between the stimulus and response without contamination from higher-order terms. Minimizing the mean square error of a discrete-time linear system derives the zero-th and first-order kernels as functions of the stimulus-response correlation matrix. The stimulus that maximizes the response of a linear system is proportional to the system's first-order kernel, constrained to have unit energy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views7 pages

Derivations NIP Course: 1 Encoding Lectures

The third moment of a Gaussian signal is zero, and its fourth moment consists of terms with Kronecker deltas that combine signal values at equal times. The kernels that describe a linear system driven by Gaussian noise can be extracted from correlations between the stimulus and response without contamination from higher-order terms. Minimizing the mean square error of a discrete-time linear system derives the zero-th and first-order kernels as functions of the stimulus-response correlation matrix. The stimulus that maximizes the response of a linear system is proportional to the system's first-order kernel, constrained to have unit energy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Derivations NIP course

Mark van Rossum

Version 1. 8th February 2017

Some notation:

• δij : Kronecker delta. δij = 1 if i = j and 0 otherwise. i, j ∈ N.



• δ(x−y): Dirac delta. dx f (x)δ(x−y) = f (y). Can be seen as infinitely narrow (and high) Gaussian
distribution, or a as continuous version of the Kronecker delta.

• ∂x = ∂
∂x : partial derivative w.r.t. x.

• ⟨x⟩: expectation (see lectures for interpretation).

• N (µ, σ 2 ): Gaussian distribution with mean µ and variance σ 2 .

1 Encoding lectures
1.1 Higher moments of Gaussian
Q: Consider an uncorrelated (white) Gaussian signal s(t). Calculate its 3rd and 4th order moments,
m3 = ⟨s(t1 )s(t2 )s(t3 )⟩ and m4 = ⟨s(t1 )s(t2 )s(t3 )s(t4 )⟩. Confirm with simulation.

The third moment m⟨3 =⟩ ⟨s(t∫1 )s(t2 )s(t3 )⟩ is zero. Namely, if t1 ̸= t2 ̸= t3 then m3 = ⟨s⟩3 = 03 = 0.
If t1 = t2 = t3 , m3 =⟨ s⟩3 = N (0, σ 2 )x3 dx = 0, as this is the integral over an odd function. For
t1 = t2 ̸= t3 , m3 = ⟨s⟩ s2 = 0.σ 2 = 0.
Consider m4 = ⟨s(t1 )s(t2 )s(t3 )s(t4 )⟩. Here we have contributions when t1 = t2 and t3 = t4 and
permutation
⟨ ⟩2 ⟨ ⟩2 ⟨ ⟩2
m4 = δ(t1 − t2 )δ(t3 − t4 ) s2 + δ(t1 − t3 )δ(t2 − t4 ) s2 + δ(t1 − t4 )δ(t2 − t3 ) s2
= [δ(t1 − t2 )δ(t3 − t4 ) + δ(t1 − t3 )δ(t2 − t4 ) + δ(t1 − t4 )δ(t2 − t3 )]σ 4

The fact that there is no additional contribution when t1 = t2 = t3 = t4 , is related to the fourth
cumulant of the Gaussian being zero.
In Matlab you can test this by trying things like mean(randn(1,n).*randn(1,n).*randn(1,n).*randn(1,n))
and mean(randn(1,n).^2*randn(1,n).*randn(1,n)).

1.2 Wiener Kernel


Q: Suppose that the response of a system is given by a 2nd order Wiener approximation:
∫ ∫ ∫
r(t) = g0 + dt1 g1 (t1 )s(t − t1 ) + dt1 dt2 g2 (t1 , t2 )s(t − t1 )s(t − t2 ) − σ 2
dt1 g2 (t1 , t1 )

Show that when the stimulus is Gaussian with variance σ 2 , one indeed can extract the kernels from the
correlates of stimulus with the response without contamination of higher orders, i.e. show that ⟨r⟩ = g0
and ⟨r(t)s(t − τ )⟩ = σ 2 g1 (τ ).

Zero order term:

1
⟨∫ ⟩ ⟨∫ ∫ ⟩
⟨r⟩ = g0 + dt1 g1 (t1 )s(t − t1 ) + dt1 dt2 g2 (t1 , t2 )s(t − t1 )s(t − t2 ) − σ 2
dt1 g2 (t1 , t1 )
∫ ∫ ∫
= g0 + dt1 ⟨s⟩ g1 (t1 ) + dt1 dt2 g2 (t1 , t2 ) ⟨s(t − t1 )s(t − t2 )⟩ − σ 2
dt1 g2 (t1 , t1 )
∫ ∫
= g0 + 0 + dt1 dt2 g2 (t1 , t2 )δ(t1 , t2 )σ 2 − σ 2 dt1 g2 (t1 , t1 )

= g0

First order. Similar calculation gives


⟨∫ ⟩
⟨r(t)s(t − τ )⟩ = 0 + dt1 g1 (t1 )s(t − t1 )s(t − τ ) + 0)
∫ ∫
= dt1 g1 (t1 )⟨s(t − t1 )s(t − τ )⟩ = σ 2 dt1 g1 (t1 )δ(τ − t1 ) = σ 2 g1 (τ )

1.3 Discrete time kernel


Q: Assume a linear system, discrete in time, described by 0th and 1st order kernels, so that we can write
r̂ = Sg (see lecture slides for the construction of S and definition of g). Derive the kernels g0 and g1i that
minimizes mean square error E = (r − Sg)T (r − Sg).

Minimize E = (r − Sg)T (r − Sg) w.r.t. to g, where g = (g0 , g11 , . . . , g1N ). For definition of S see
lecture notes.
dE d
= [(r − Sg)T (r − Sg)]
dgj dgj
dSg
= −2(r − Sg)
dgj
∑ d ∑
= −2 (r − Sg)i Sik gk
dgj
i k
∑ ∑
= −2 (r − Sg)i Sik δj,k
i k

= −2 (r − Sg)i Sij
i
= −2(S T r)j + 2(S T Sg)j

As this needs to be zero for all j, S T r = S T Sg, or g = (S T S)−1 S T r. This is an equation you often see
in linear regression.
∑ Note that at this point we have not used anything about the stimulus.
ST S = j ij kj , i.e. it equals the dot product
S S of the columns of S. If S is a design matrix
⟨ T ⟩
for a Gaussian stimulus then the expected value is S S = diag(n, nσ 2 , nσ 2 , nσ 2 , . . .), or, ⟨S T S⟩ij =
nδi,k (σ 2 + (1 − σ 2 )δi,1 ). So
1∑
g0 = S0i ri = ⟨r⟩
n
i
1 ∑ 1 ∑
g1i = 2 Sij rj = Si−j rj
nσ nσ 2
This is a discrete formulation of the first two Wiener kernels.

1.4 Optimality of STA as stimulus



Q: Assume a linear system described by 0th and 1st order kernels, r(t) = r0 + dt1 g1 (t1 )s(t∫− t1 ). Derive
the stimulus that maximizes the response r(t). Constrain the ’energy’ of the stimulus to be s(t)2 dt = 1.

2

The response of linear system is∫ r(t) = r0 + dt1 g1 (t1 )s(t − t1 ), with Lagrange multiplier to constrain
|s|2 , we maximize f (t) = r(t) − λ dt1 s2 (t1 ) wrt s(t). Note, this is a functional derivative; if unfamiliar,
you can discretize in time (sums become integrals).


δf (t) δr(t) δ dt1 s2 (t1 )
= −λ
δs(t0 ) δs(t0 ) δs(t0 )
∫ ∫
δs(t − t1 )
= 0 + dt1 g1 (t1 ) − λ dt1 δs2 (t1 )/δs(t0 )δ(t0 − t1 )
δs(t0 )
∫ ∫
= dt1 g1 (t1 )δ(t0 − t + t1 ) − 2λ dt1 s(t1 )δ(t0 − t1 )

= g1 (t − t0 ) − 2λs(to )

Set to zero for all t0 , hence s(t0 ) ∝ g1 (t − t0 ). ∫


Extra: what happens for different constraints such as s(t)k dt = 1?

3
2 Decoding lecture
2.1 Fisher Information Gaussian noise
Q: Assume N neurons with Gaussian ∑ independent noise and tuning curves fi (θ) (i = 1 . . . N ). Show that
the Fisher info equals If (θ) = σ12 N [f ′ (θ)2 ].
i=1 i
∫ 2
The Fisher information is defined as If ≡ − P (r|θ) ∂ log∂θP2(r|θ) dr. For independent. Gaussian noise
∏ ∏N 1
P (r|θ) = N i=1 P (ri |θ) = i=1 Z exp(−[ri − fi (θ)] /2σ ).
2 2

So that
( )′′
∂ 2 log P (r|θ) ∑
= −N log Z − [ri − fi (θ)] /2σ
2 2
∂θ2
i
( )′
1 ∑
= 2 [ri − fi (θ)]fi′ (θ)
σ
i
1 ∑[ ]
= 2 (ri − fi (θ))fi′′ (θ) − (fi′ (θ))2
σ
i
∫ ∫
Now use that P (r|θ)dr =∫1 and P (r|θ)r
∑ i dr = ∑ fi . ∫Also note that for this factorizing probability and a
general function g(ri ) then drP (r|θ) i g(ri ) = i dri P (ri |θ)g(ri ). So that

1 ∑
If (θ) = − P (r|θ) 2 [(ri − fi (θ))fi′′ (θ) − (fi′ (θ))2 ]dr
σ
i
{ ∫ ∫ }
1 ∑ ′′ ′
=− 2 fi (θ) P (ri |θ)[ri − fi (θ)]dri − (fi (θ)) 2
P (ri |θ)dri
σ
i

1 ∑ ′ 2
N
= 2 [fi (θ) ].
σ
i=1

2.2 Fisher Information Poisson noise


Q: Assume N neurons with independent Poisson noise and homogeneous tuning curves fi (θ) (i = 1 . . . N ).
Assume that the coding is dense (i.e. each stimulus leads∑ to the response of many
∑ neurons).
First show that the only stimulus dependent term of i log P (ni ) equals i ni log fi (θ). Next, show
∑ f ′2
that the Fisher info equals If = T i fii − fi′′
∑ [fi′′ (θ)]2
Finally, show that the Fisher info equals If (θ) = σ12 Ni=1 fi (θ) .

The number of spikes that a Poisson neuron at rate fi fires in a time T is, P = [fi (θ)T ]ni exp(−fi T )/ni !.
To
∑ calculate FisherInfo: ∑ ∑
i log P (ni |θ) =∑ i [ni log fi T − log ni ! − fi T ]. Now i fi ≈ const for dense coding, so the only
θ-dependent term is i ni log fi (θ). Therefore
∑ f′ ∑ f′ f ′′ ∑ fi′ 2 fi′′
∂ 2 log P
∂θ2
= ∑i ni ( fii )′ = i ni [( fii )2 − fii ] and If = ∞n(1) ,n(2) ,...,n(N ) p(ni )ni [( fi ) − fi ].
Note that n p(n)n = T f (the average of the firing rate is the tuning curve). So that
∑ f ′ (θ)2
If (θ) = T i
− fi′′ (θ)
fi (θ)
i

For dense tuning tuning curves we can replace the sum over units by an integral over centres of the tuning
curves. This cancels out the second term. You can see this by for instance substitution of Gaussian shape
tuning curves fi (θ) = A exp(−(∆i − s)2 /2σ 2 ).

4
2.3 Example Fisher information correlated neurons
Q: Consider two neurons, with response r1 and r2 with correlated Gaussian noise ri = fi (θ) + ηi . As-
sume the correlation matrix of noise η equals Qij = σ 2 (δij + c(1 − δij )). The neurons have tuning curves
f1 (θ) and f2 (θ) = αf1 (θ). Calculate Fisher info IF (θ). You can use that for stimulus-independent noise
IF = f ′ (θ)Q−1 f ′ (θ).

For correlated Gaussian noise (e.g. Abbott and Dayan, Neural comp 1999),
1
I = f ′ (θ)Q−1 (θ)f ′ (θ) + Tr(Q′ (θ)Q−1 (θ)Q′ (θ)Q−1 (θ))
2
′ −1 ′
which in our case simplifies
( to I = f) (θ)Q f (θ).
1 −c
As Q−1 = σ2 (1−c
1 1
, so I = σ2 (1−c ′ 2 ′ 2 ′ ′
2 ) [|f1 | + |f2 | − 2cf1 .f2 ]. Now assume that f2 (θ) =
2)
−c 1
αf1 (θ) (so the neurons have identical tuning apart from a scale factor, that could be negative).
1
I= |f ′ |2 [1 − 2cα + α2 ]
σ 2 (1 − c2 ) 1
|f ′ |2 |f ′ |2
If α = 1, then I = 1+c 2 1
σ2
. Note that if α = −1, I = 1−c 2 1
σ2
, so now correlation helps the
information.
Note that if α = 0 (one neuron does not code for the stimulus), the information diverges for c → ±1.
How would you build a perfect decoder in that limit?

2.4 Cramer-Rao bound with bias


Q: Given data x, define estimator T (x) of a quantity θ. If unbiased ⟨T ⟩ = θ, but if the estimator is biased
⟨T ⟩ = θ + b(θ), where b(θ) denotes the bias. Derive the Cramer-Rao bound for a biased estimator.

Define ’score’ V ≡ ∂θ log P (x|θ) = ∂Pθ P(x|θ)


(x|θ)
. Note that for a 1D Gaussian V ∝ (x − θ)/σ 2 . The Fisher
⟨ 2⟩
info is defined as If = V .
According to Cauchy-Schwartz, (x.y)2 ≤ |x|2 |y|2 . This is usually done for vectors x and y. Here we re-
place
⟨ the sum⟩ over
⟨ the dimensions
⟩ in the inner-products with the sum over trials: ⟨(V − ⟨V ⟩)(T − ⟨T ⟩)⟩2 ≤
(V − ⟨V ⟩) 2 (T − ⟨T ⟩) .
2
∫ ∫
Note that ⟨V ⟩ = dxp(x) ∂Pθ P(x|θ)
(x|θ)
= ∂θ dxP (x|θ) = ∂θ 1 = 0. So
⟨ ⟩⟨ ⟩
⟨V T − V ⟨T ⟩⟩2 ≤ V 2 (T − ⟨T ⟩)2
⟨ ⟩
⟨V T ⟩2 ≤ V 2 var(T )
∫ ∫ ⟨ ⟩
Now ⟨V T ⟩ = dxp(x) ∂Pθ P(x|θ)
(x|θ)
T (x) = ∂θ dxP (x|θ)T (x) = ∂θ ⟨T ⟩ = 1 + b′ (θ). So we have (T − θ)2 ≥
[1+b′ (θ)]2
If − b2 (θ) or
[1 + b′ (θ)]2
var(T ) ≥ .
If

2.5 Convolution in Fourier domain


Q: Show that the Fourier transform of a convolution of two function equals the product of the Fourier

transforms. The Fourier transform of a function is defined as f]
(ω) = dt exp(iωt)f (t).

Define Fourier Transform f](ω) = dt exp(iωt)f (t) (there are are various normalization conventions for

the
∫ Fourier
∫ transform. In the end this should not matter). (f ⋆ g)(τ ) ≡ dtf (t)g(τ − t). So f]
⋆ g(ω) =
dτ dtf (t)g(τ − t) exp(iωτ ).

5
On the other hand

f˜(ω).g̃(ω) = dt1 dt2 exp(iωt1 )f (t1 ) exp(iωt2 )g(t2 )


= dt dτ exp(iωt) exp(−iωt + iωτ )f (t)g(τ − t)
∫ ∫
= dt dτ exp(iωτ )f (t)g(τ − t)

(using t1 = t,t2 = τ − t). So that indeed f]


⋆ g(ω) = f˜(ω).g̃(ω)

3 Information Theory
3.1 Derivation of mutual information
Q: Derive the expression for the mutual information Im = H(R)−H(R|S) in terms of the distribution p(r, s).

Use that p(r|s)p(s) = p(r, s), p(r) = s p(r, s).

Im = H(R) − H(R|S)
∑ ∑
=− p(r) log p(r) + p(r|s)p(s) log p(r|s)
r r,s
∑ ∑
=− p(r, s) log p(r) + p(r, s)p(s)/p(s) log[p(r, s)/p(s)]
r,s r,s
∑ p(r, s)
= p(r, s) log
r,s
p(r)p(s)

3.2 Mutual information 2 correlated Gaussians


Q: Assume two correlated Gaussian variables y1 and y2 . Show that mutual information I(Y1 , Y2 ) =
σ2
− 21 log(1 − ρ2 ) with Pearson correlation coefficient ρ = σ112 σ2 .
Introduce y =(y1 , y2 ) and, if needed, translate the integrals so that they are centred around zero-mean.

P (y1 , y2 )
I(Y1 , Y2 ) = P (y1 , y2 ) log dy1 dy2
P (y1 )P (y2 )
∫ [ ]
T −1 N12 T −1 2 2 2 2
= N12 exp(−y C y/2) log exp(−y C y/2 + y1 /2σ1 + y2 /2σ2 )
N1 N2
∫ [ ]
T −1 T −1 N12
= N12 exp(−y C y/2) −y C y/2 + y1 /σ1 + y2 /σ2 + log
2 2 2 2
N1 N2
N12
= −1 + 1/2 + 1/2 + log
N1 N2

where the normalisation factors are 1/N12 = 2π det C, 1/N1 = 2πσ1 .
( )2
σ12 σ22 σ12 σ22 2 −1
2
σ12
C = σ 2 σ 2 −σ 4 = [1 − ρ ] , with Pearson correlation coefficient ρ = σ1 σ2 . So that
N12
N1 N2 = det
1 2 12
I(Y1 , Y2 ) = − 21 log(1 − ρ2 ). Note, if ρ → ±1 then I → ∞, and if ρ → 0 then I → 0.

3.3 Example of synergistic code


Q: Consider the following table of stimulus s and the response of two units r1 and r2 and their probability
P.

6
s r1 r2 P (s, r1 , r2 )
0 0 0 0
1 0 0 1/4
0 0 1 1/4
1 0 1 0
0 1 0 1/4
1 1 0 0
0 1 1 0
1 1 1 1/4
Show that this a synergistic code.

Note that P (r) = i=1,2 P (ri ). Im (r1 , s) = Im (r2 , s) = 0, but Im ((r1 , r2 ), s) > 0. In other words,
observation of just r1 or r2 provides no information about the stimulus. However, observing both does.

3.4 Maximal entropy distributions



(Discrete distributions) Q: Which distribution p(ri ) maximized the entropy H(R) = − i p(ri ) log2 p(ri )?
What if the mean is constrained? Assume that the bins are linearly spaced and start at r = 0. Thus
ri = i∆r. What if the variance is constrained?

We write pi = p(ri ).
∑ ∑
• Only constrained is that p is a distribution, so that pi = 1. Maximize E = H − λ( pi − 1),
dE
where λ is a Lagrange multiplier. dp = 0∀j (using different indices i and j to prevent errors) so
∑ j
∑ pi
i dpj (−pi log pi − λpi ) = i δij − log pi − pi − λ = 0. So log pi = −1 − λ∀i so
dE d
dpj = 0 =
pi = constant i.e. a uniform distribution
∑ ∑
• Constrain mean to be r̄. Now E = H − λ( pi − 1) − λ2 ( pi ri − r̄). So log pi = −1 − λ − λ2 ri ,
so pi ∝ exp(−λ2 ri ). Which after normalization and setting the mean gives pi = 1/r̄ exp(−ri /r̄).
∑ ∑
• Constrain variance, same as above but now add term E = H − λ( pi − 1) − λ2 ( pi ri2 − σr2 )
(assuming zero mean). Now pi ∝ exp(−cri2 ).

3.5 Gaussian variable with noise


Q: Consider a noise response r = s + η, where both signal s and noise η are Gaussian distributed. Calculate
the mutual information between r and s.
What is the MAP estimate of s given r?

2 + σ 2 )). Hence I = H − H ση2 +σs2


Hnoise = 21 log2 (2πeση2 ), Hr = 21 log2 (2πe(σeta s noise =
1
2 log2 ( ση2
) =
√ √
log2 1 + σs2 /ση2 = log2 1 + SN R, with SN R ≡ σs2 /ση2 .

Note if r = r(f ), s = s(f ) I = df log2 (1 + SN R(f )).
MAP estimate of s

P (s|r) = P (s)P (r|s)/P (r)


1
= N exp[−s2 /2σs2 ] exp[−(r − s)2 /2ση2 ]
P (r)
1 1
= N exp[− s2 (σs−2 + ση−2 ) + srση−2 ]
P (r) 2
2
Maximal d/ds = 0, if s = r σ2σ+σ
s
2 . Interpretation: because r can be large due to the noise, r “conser-
s η
vatively” estimates s.

You might also like