Derivations NIP Course: 1 Encoding Lectures
Derivations NIP Course: 1 Encoding Lectures
Some notation:
• ∂x = ∂
∂x : partial derivative w.r.t. x.
1 Encoding lectures
1.1 Higher moments of Gaussian
Q: Consider an uncorrelated (white) Gaussian signal s(t). Calculate its 3rd and 4th order moments,
m3 = ⟨s(t1 )s(t2 )s(t3 )⟩ and m4 = ⟨s(t1 )s(t2 )s(t3 )s(t4 )⟩. Confirm with simulation.
The third moment m⟨3 =⟩ ⟨s(t∫1 )s(t2 )s(t3 )⟩ is zero. Namely, if t1 ̸= t2 ̸= t3 then m3 = ⟨s⟩3 = 03 = 0.
If t1 = t2 = t3 , m3 =⟨ s⟩3 = N (0, σ 2 )x3 dx = 0, as this is the integral over an odd function. For
t1 = t2 ̸= t3 , m3 = ⟨s⟩ s2 = 0.σ 2 = 0.
Consider m4 = ⟨s(t1 )s(t2 )s(t3 )s(t4 )⟩. Here we have contributions when t1 = t2 and t3 = t4 and
permutation
⟨ ⟩2 ⟨ ⟩2 ⟨ ⟩2
m4 = δ(t1 − t2 )δ(t3 − t4 ) s2 + δ(t1 − t3 )δ(t2 − t4 ) s2 + δ(t1 − t4 )δ(t2 − t3 ) s2
= [δ(t1 − t2 )δ(t3 − t4 ) + δ(t1 − t3 )δ(t2 − t4 ) + δ(t1 − t4 )δ(t2 − t3 )]σ 4
The fact that there is no additional contribution when t1 = t2 = t3 = t4 , is related to the fourth
cumulant of the Gaussian being zero.
In Matlab you can test this by trying things like mean(randn(1,n).*randn(1,n).*randn(1,n).*randn(1,n))
and mean(randn(1,n).^2*randn(1,n).*randn(1,n)).
Show that when the stimulus is Gaussian with variance σ 2 , one indeed can extract the kernels from the
correlates of stimulus with the response without contamination of higher orders, i.e. show that ⟨r⟩ = g0
and ⟨r(t)s(t − τ )⟩ = σ 2 g1 (τ ).
1
⟨∫ ⟩ ⟨∫ ∫ ⟩
⟨r⟩ = g0 + dt1 g1 (t1 )s(t − t1 ) + dt1 dt2 g2 (t1 , t2 )s(t − t1 )s(t − t2 ) − σ 2
dt1 g2 (t1 , t1 )
∫ ∫ ∫
= g0 + dt1 ⟨s⟩ g1 (t1 ) + dt1 dt2 g2 (t1 , t2 ) ⟨s(t − t1 )s(t − t2 )⟩ − σ 2
dt1 g2 (t1 , t1 )
∫ ∫
= g0 + 0 + dt1 dt2 g2 (t1 , t2 )δ(t1 , t2 )σ 2 − σ 2 dt1 g2 (t1 , t1 )
= g0
Minimize E = (r − Sg)T (r − Sg) w.r.t. to g, where g = (g0 , g11 , . . . , g1N ). For definition of S see
lecture notes.
dE d
= [(r − Sg)T (r − Sg)]
dgj dgj
dSg
= −2(r − Sg)
dgj
∑ d ∑
= −2 (r − Sg)i Sik gk
dgj
i k
∑ ∑
= −2 (r − Sg)i Sik δj,k
i k
∑
= −2 (r − Sg)i Sij
i
= −2(S T r)j + 2(S T Sg)j
As this needs to be zero for all j, S T r = S T Sg, or g = (S T S)−1 S T r. This is an equation you often see
in linear regression.
∑ Note that at this point we have not used anything about the stimulus.
ST S = j ij kj , i.e. it equals the dot product
S S of the columns of S. If S is a design matrix
⟨ T ⟩
for a Gaussian stimulus then the expected value is S S = diag(n, nσ 2 , nσ 2 , nσ 2 , . . .), or, ⟨S T S⟩ij =
nδi,k (σ 2 + (1 − σ 2 )δi,1 ). So
1∑
g0 = S0i ri = ⟨r⟩
n
i
1 ∑ 1 ∑
g1i = 2 Sij rj = Si−j rj
nσ nσ 2
This is a discrete formulation of the first two Wiener kernels.
2
∫
The response of linear system is∫ r(t) = r0 + dt1 g1 (t1 )s(t − t1 ), with Lagrange multiplier to constrain
|s|2 , we maximize f (t) = r(t) − λ dt1 s2 (t1 ) wrt s(t). Note, this is a functional derivative; if unfamiliar,
you can discretize in time (sums become integrals).
∫
δf (t) δr(t) δ dt1 s2 (t1 )
= −λ
δs(t0 ) δs(t0 ) δs(t0 )
∫ ∫
δs(t − t1 )
= 0 + dt1 g1 (t1 ) − λ dt1 δs2 (t1 )/δs(t0 )δ(t0 − t1 )
δs(t0 )
∫ ∫
= dt1 g1 (t1 )δ(t0 − t + t1 ) − 2λ dt1 s(t1 )δ(t0 − t1 )
= g1 (t − t0 ) − 2λs(to )
3
2 Decoding lecture
2.1 Fisher Information Gaussian noise
Q: Assume N neurons with Gaussian ∑ independent noise and tuning curves fi (θ) (i = 1 . . . N ). Show that
the Fisher info equals If (θ) = σ12 N [f ′ (θ)2 ].
i=1 i
∫ 2
The Fisher information is defined as If ≡ − P (r|θ) ∂ log∂θP2(r|θ) dr. For independent. Gaussian noise
∏ ∏N 1
P (r|θ) = N i=1 P (ri |θ) = i=1 Z exp(−[ri − fi (θ)] /2σ ).
2 2
So that
( )′′
∂ 2 log P (r|θ) ∑
= −N log Z − [ri − fi (θ)] /2σ
2 2
∂θ2
i
( )′
1 ∑
= 2 [ri − fi (θ)]fi′ (θ)
σ
i
1 ∑[ ]
= 2 (ri − fi (θ))fi′′ (θ) − (fi′ (θ))2
σ
i
∫ ∫
Now use that P (r|θ)dr =∫1 and P (r|θ)r
∑ i dr = ∑ fi . ∫Also note that for this factorizing probability and a
general function g(ri ) then drP (r|θ) i g(ri ) = i dri P (ri |θ)g(ri ). So that
∫
1 ∑
If (θ) = − P (r|θ) 2 [(ri − fi (θ))fi′′ (θ) − (fi′ (θ))2 ]dr
σ
i
{ ∫ ∫ }
1 ∑ ′′ ′
=− 2 fi (θ) P (ri |θ)[ri − fi (θ)]dri − (fi (θ)) 2
P (ri |θ)dri
σ
i
1 ∑ ′ 2
N
= 2 [fi (θ) ].
σ
i=1
The number of spikes that a Poisson neuron at rate fi fires in a time T is, P = [fi (θ)T ]ni exp(−fi T )/ni !.
To
∑ calculate FisherInfo: ∑ ∑
i log P (ni |θ) =∑ i [ni log fi T − log ni ! − fi T ]. Now i fi ≈ const for dense coding, so the only
θ-dependent term is i ni log fi (θ). Therefore
∑ f′ ∑ f′ f ′′ ∑ fi′ 2 fi′′
∂ 2 log P
∂θ2
= ∑i ni ( fii )′ = i ni [( fii )2 − fii ] and If = ∞n(1) ,n(2) ,...,n(N ) p(ni )ni [( fi ) − fi ].
Note that n p(n)n = T f (the average of the firing rate is the tuning curve). So that
∑ f ′ (θ)2
If (θ) = T i
− fi′′ (θ)
fi (θ)
i
For dense tuning tuning curves we can replace the sum over units by an integral over centres of the tuning
curves. This cancels out the second term. You can see this by for instance substitution of Gaussian shape
tuning curves fi (θ) = A exp(−(∆i − s)2 /2σ 2 ).
4
2.3 Example Fisher information correlated neurons
Q: Consider two neurons, with response r1 and r2 with correlated Gaussian noise ri = fi (θ) + ηi . As-
sume the correlation matrix of noise η equals Qij = σ 2 (δij + c(1 − δij )). The neurons have tuning curves
f1 (θ) and f2 (θ) = αf1 (θ). Calculate Fisher info IF (θ). You can use that for stimulus-independent noise
IF = f ′ (θ)Q−1 f ′ (θ).
For correlated Gaussian noise (e.g. Abbott and Dayan, Neural comp 1999),
1
I = f ′ (θ)Q−1 (θ)f ′ (θ) + Tr(Q′ (θ)Q−1 (θ)Q′ (θ)Q−1 (θ))
2
′ −1 ′
which in our case simplifies
( to I = f) (θ)Q f (θ).
1 −c
As Q−1 = σ2 (1−c
1 1
, so I = σ2 (1−c ′ 2 ′ 2 ′ ′
2 ) [|f1 | + |f2 | − 2cf1 .f2 ]. Now assume that f2 (θ) =
2)
−c 1
αf1 (θ) (so the neurons have identical tuning apart from a scale factor, that could be negative).
1
I= |f ′ |2 [1 − 2cα + α2 ]
σ 2 (1 − c2 ) 1
|f ′ |2 |f ′ |2
If α = 1, then I = 1+c 2 1
σ2
. Note that if α = −1, I = 1−c 2 1
σ2
, so now correlation helps the
information.
Note that if α = 0 (one neuron does not code for the stimulus), the information diverges for c → ±1.
How would you build a perfect decoder in that limit?
5
On the other hand
∫
f˜(ω).g̃(ω) = dt1 dt2 exp(iωt1 )f (t1 ) exp(iωt2 )g(t2 )
∫
∫
= dt dτ exp(iωt) exp(−iωt + iωτ )f (t)g(τ − t)
∫ ∫
= dt dτ exp(iωτ )f (t)g(τ − t)
3 Information Theory
3.1 Derivation of mutual information
Q: Derive the expression for the mutual information Im = H(R)−H(R|S) in terms of the distribution p(r, s).
∑
Use that p(r|s)p(s) = p(r, s), p(r) = s p(r, s).
Im = H(R) − H(R|S)
∑ ∑
=− p(r) log p(r) + p(r|s)p(s) log p(r|s)
r r,s
∑ ∑
=− p(r, s) log p(r) + p(r, s)p(s)/p(s) log[p(r, s)/p(s)]
r,s r,s
∑ p(r, s)
= p(r, s) log
r,s
p(r)p(s)
6
s r1 r2 P (s, r1 , r2 )
0 0 0 0
1 0 0 1/4
0 0 1 1/4
1 0 1 0
0 1 0 1/4
1 1 0 0
0 1 1 0
1 1 1 1/4
Show that this a synergistic code.
∏
Note that P (r) = i=1,2 P (ri ). Im (r1 , s) = Im (r2 , s) = 0, but Im ((r1 , r2 ), s) > 0. In other words,
observation of just r1 or r2 provides no information about the stimulus. However, observing both does.
We write pi = p(ri ).
∑ ∑
• Only constrained is that p is a distribution, so that pi = 1. Maximize E = H − λ( pi − 1),
dE
where λ is a Lagrange multiplier. dp = 0∀j (using different indices i and j to prevent errors) so
∑ j
∑ pi
i dpj (−pi log pi − λpi ) = i δij − log pi − pi − λ = 0. So log pi = −1 − λ∀i so
dE d
dpj = 0 =
pi = constant i.e. a uniform distribution
∑ ∑
• Constrain mean to be r̄. Now E = H − λ( pi − 1) − λ2 ( pi ri − r̄). So log pi = −1 − λ − λ2 ri ,
so pi ∝ exp(−λ2 ri ). Which after normalization and setting the mean gives pi = 1/r̄ exp(−ri /r̄).
∑ ∑
• Constrain variance, same as above but now add term E = H − λ( pi − 1) − λ2 ( pi ri2 − σr2 )
(assuming zero mean). Now pi ∝ exp(−cri2 ).