0% found this document useful (0 votes)
76 views35 pages

Information Theory: Mark Van Rossum

Information theory provides a framework for understanding neural coding by quantifying how much information is contained in neural signals. The key concepts are: 1) Entropy measures the uncertainty in a signal and mutual information measures how much one signal reduces the uncertainty of another. 2) For a single neuron, the firing distribution that maximizes entropy and neural information is often uniform, exponential, or Gaussian depending on the constraints. 3) Experimental evidence shows neurons can encode information efficiently through strategies like histogram equalization that match response distributions to stimulus distributions. 4) For time-varying signals, maximal information is achieved by using all frequency bands up to the level where signal plus noise is constant, analogous to water filling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views35 pages

Information Theory: Mark Van Rossum

Information theory provides a framework for understanding neural coding by quantifying how much information is contained in neural signals. The key concepts are: 1) Entropy measures the uncertainty in a signal and mutual information measures how much one signal reduces the uncertainty of another. 2) For a single neuron, the firing distribution that maximizes entropy and neural information is often uniform, exponential, or Gaussian depending on the constraints. 3) Experimental evidence shows neurons can encode information efficiently through strategies like histogram equalization that match response distributions to stimulus distributions. 4) For time-varying signals, maximal information is achieved by using all frequency bands up to the level where signal plus noise is constant, analogous to water filling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Information Theory

Mark van Rossum

School of Informatics, University of Edinburgh

January 24, 2018

0
Version: January 24, 2018
1 / 35
Why information theory

Understanding the neural code.


Encoding and decoding. We imposed coding schemes, such as
2nd-order kernel, or NLP. We possibly lost information in doing so.
Instead, use information:
Don’t need to impose encoding or decoding scheme
(non-parametric).
In particular important for 1) spike timing codes, 2) higher areas.
Estimate how much information is coded in certain signal.
Caveats:
No easy decoding scheme for organism (upper bound only)
Requires more data and biases are tricky

2 / 35
Overview

Entropy, Mutual Information


Entropy Maximization for a Single Neuron
Maximizing Mutual Information
Estimating information
Reading: Dayan and Abbott ch 4, Rieke

3 / 35
Definition

The entropyPof a quantity is defined as


H(X ) = − x P(x) log2 P(x)
This is not ’derived’, nor fully unique, but it fulfills these criteria:
Continuous
If pi = n1 , it increases monotonically with n. H = log2 n.
Parallel independent channels add.
“Unit”: bits
Entropy can be thought of as physical entropy, “richness” of distribution
[Shannon and Weaver, 1949, Cover and Thomas, 1991,
Rieke et al., 1996]

4 / 35
Entropy

Discrete variable X
H(R) = − p(r ) log2 p(r )
r

Continuous variable at resolution ∆r


X X
H(R) = − p(r )∆r log2 (p(r )∆r ) = − p(r )∆r log2 p(r ) − log2 ∆r
r r

letting ∆r → 0 we have
Z
lim [H + log2 ∆r ] = − p(r ) log2 p(r )dr
∆r →0

(also called differential entropy)

5 / 35
Joint, Conditional entropy

Joint entropy:
X
H(S, R) = − P(S, R) log2 P(S, R)
r ,s

Conditional entropy:
X
H(S|R) = P(R = r )H(S|R = r )
r
X X
= − P(r ) P(s|r ) log2 P(s|r )
r s
= H(S, R) − H(R)

If S, R are independent

H(S, R) = H(S) + H(R)

6 / 35
Mutual information

Mutual information:
X p(r , s)
Im (R; S) = p(r , s) log2
r ,s
p(r )p(s)
= H(R) − H(R|S) = H(S) − H(S|R)

Measures reduction in uncertainty of R by knowing S (or vice


versa)
Im (R; S) ≥ 0
The continuous version is the difference of two entropies, the ∆r
divergence cancels

7 / 35
Mutual Information

The joint histogram determines mutual information.


Given P(r , s) ⇒ Im .

8 / 35
Mutual Information: Examples

Y1 Y1
non non
smoker smoker smoker smoker

lung lung
cancer 1/3 0 cancer 1/9 2/9
Y2 Y2
no lung no lung
0 2/3 2/9 4/9
cancer cancer

9 / 35
Mutual Information: Examples

Y1 Y1
non non
smoker smoker smoker smoker

lung lung
cancer 1/3 0 cancer 1/9 2/9
Y2 Y2
no lung no lung
0 2/3 2/9 4/9
cancer cancer

Only for the left joint probability Im > 0 (how much?). On the right,
knowledge about Y1 does not inform us about Y2 .

10 / 35
Kullback-Leibler divergence

KL-divergence measures distance between two probability


distributions
R P(x) P Pi
DKL (P||Q) = P(x) log2 Q(x) dx, or DKL (P||Q) ≡ i Pi log2 Qi

Not symmetric, but can be symmetrized


Im (R; S) = DKL (p(r , s)||p(r )p(s)).
Often used as probabilistic cost function: DKL (data||model).
Other probability distances exist (e.g. earth-movers distance)

11 / 35
Mutual info between jointly Gaussian variables

Z Z
P(y1 , y2 ) 1
I(Y1 ; Y2 ) = P(y1 , y2 ) log2 dy1 dy2 = − log2 (1 − ρ2 )
P(y1 )P(y2 ) 2

ρ is (Pearson-r) correlation coefficient.


12 / 35
Populations of Neurons

Given Z
H(R) = − p(r) log2 p(r)dr − N log2 ∆r

and Z
H(Ri ) = − p(ri ) log2 p(ri )dr − log2 ∆r

We have X
H(R) ≤ H(Ri )
i

(proof, consider KL divergence)

13 / 35
Mutual information in populations of Neurons

Reduncancy can be defined as (compare to above)


nr
X
R= I(ri ; s) − I(r; s).
i=1

Some codes have R > 0 (redundant code), others R < 0 (synergistic)


Example of synergistic code: P(r1 , r2 , s) with
P(0, 0, 1) = P(0, 1, 0) = P(1, 0, 0) = P(1, 1, 1) = 14

14 / 35
Entropy Maximization for a Single Neuron

Im (R; S) = H(R) − H(R|S)

If noise entropy H(R|S) is independent of the transformation


S → R, we can maximize mutual information by maximizing H(R)
under given constraints
Possible constraint: response r is 0 < r < rmax . Maximal H(R) if
⇒ p(r ) ∼ U(0, rmax ) (U is uniform dist)
If average firing rate is limited, and 0 < r < ∞ : exponential
distribution is optimal p(x) = 1/x̄exp(−x/x̄). H = log2 ex̄
If variance is fixed and −∞ < r < ∞: Gaussian distribution.
H = 12 log2 (2πeσ 2 ) (note funny units)

15 / 35
Let r = f (s) and s ∼ p(s). Which f (assumed monotonic)
maximizes H(R) using max firing rate constraint? Require:
1
P(r ) = rmax
dr 1 df
p(s) = p(r ) =
ds rmax ds
Thus df /ds = rmax p(s) and
Z s
f (s) = rmax p(s0 )ds0
smin

This strategy is known as histogram equalization in signal


processing

16 / 35
Fly retina
Evidence that the large monopolar cell in the fly visual system carries
out histogram equalization

Contrast response for fly large monopolar cell (points) matches


environment statistics (line) [Laughlin, 1981] (but changes in high
noise conditions)
17 / 35
V1 contrast responses

Similar in V1, but On and Off channels [Brady and Field, 2000]
18 / 35
Information of time varying signals

Single analog channel with Gaussian signal s and Gaussian noise η:


r =s+η
1 σ2 1
I = log2 (1 + s2 ) = log2 (1 + SNR)
2 ση 2
s(ω)
For time dependent signals I = 12 T dω
R
2π log2 (1 + n(ω) )
To maximize information, when variance of the signal is constrained,
use all frequency bands such that signal+noise = constant.
Whitening. Water filling analog:

19 / 35
Information of graded synapses

Light - (photon noise) - photoreceptor - (synaptic noise) - LMC


At low light levels photon noise dominates, synaptic noise is negligible.
Information rate: 1500 bits/s
[de Ruyter van Steveninck and Laughlin, 1996].

20 / 35
Spiking neurons: maximal information

Spike train with N = T /δt bins [Mackay and McCullogh, 1952] δt


“time-resolution”.
N!
pN = N1 events, #words = N1 !(N−N 1 )!
Maximal
P entropy if all words are equally likely.
H = pi log2 pi = log2 N! − log2 N1 ! − log2 (N − N1 )!
Use for large x that log x! ≈ x(log x − 1)

−T
H= [p log2 p + (1 − p) log2 (1 − p)]
δt
For low rates p  1, setting λ = (δt)p:
e
H = T λ log2 ( )
λδt

21 / 35
Spiking neurons

Calculation incorrect when multiple spikes per bin. Instead, for large
bins maximal information for exponential distribution:
1
P(n) = Z1 exp[−n log(1 + hni )]
1
H = log2 (1 + hni) + hni log2 (1 + hni ) ≈ log2 (1 + hni) + 1

22 / 35
Spiking neurons: rate code

[Stein, 1967]

Measure rate in window T , during which stimulus is constant.


Periodic neuron can maximally encode [1 + (fmax − fmin )T ] stimuli
H ≈ log2 [1 + (fmax − fmin )T ]. Note, only ∝ log(T )

23 / 35
[Stein, 1967]
Similar behaviour for Poisson : H ∝ log(T )

24 / 35
Spiking neurons: dynamic stimuli

[de Ruyter van Steveninck et al., 1997], but see


[Warzecha and Egelhaaf, 1999].
25 / 35
Maximizing Information Transmission: single output

Single linear neuron with post-synaptic noise

v =w·u+η

where η is an independent noise variable

Im (u; v ) = H(v ) − H(v |u)

Second term depends only on p(η)


To maximize Im need to maximize H(v ); sensible constraint is that
kwk2 = 1
If u ∼ N(0, Q) and η ∼ N(0, ση2 ) then v ∼ N(0, wT Qw + ση2 )

26 / 35
For a Gaussian RV with variance σ 2 we have H = 12 log 2πeσ 2 . To
maximize H(v ) we need to maximize wT Qw subject to the
constraint kwk2 = 1
Thus w ∝ e1 so we obtain PCA
If v is non-Gaussian then this calculation gives an upper bound on
H(v ) (as the Gaussian distribution is the maximum entropy
distribution for a given mean and covariance)

27 / 35
Infomax
Infomax: maximize information in multiple outputs wrt weights
[Linsker, 1988]
v = Wu + η
1
H(v ) = log det(hvvT i)
2
Example: 2 inputs and 2 outputs. Input is correlated. wk21 + wk22 = 1.

At low noise independent coding, at high noise joint coding.


28 / 35
Estimating information
Information estimation requires a lot of data.
Most statistical quantities are unbiased (mean, var,...).
But both entropy and noise entropy have bias.

[Panzeri et al., 2007]


29 / 35
Try to fit 1/N correction [Strong et al., 1998]

30 / 35
Common technique for Im : shuffle correction [Panzeri et al., 2007]
See also: [Paninski, 2003, Nemenman et al., 2002]

31 / 35
Summary

Information theory provides non parametric framework for coding


Optimal coding schemes depend strongly on noise assumptions
and optimization constraints
In data analysis biases can be substantial

32 / 35
References I

Brady, N. and Field, D. J. (2000).


Local contrast in natural images: normalisation and coding efficiency.
Perception, 29(9):1041–1055.
Cover, T. M. and Thomas, J. A. (1991).
Elements of information theory.
Wiley, New York.
de Ruyter van Steveninck, R. R. and Laughlin, S. B. (1996).
The rate of information transfer at graded-potential synapses.
Nature, 379:642–645.
de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., and Bialek, W.
(1997).
Reproducibility and variability in neural spike trains.
Science, 275:1805–1809.
Laughlin, S. B. (1981).
A simple coding procedure enhances a neuron’s information capacity.
Zeitschrift für Naturforschung, 36:910–912.
Linsker, R. (1988).
Self-organization in a perceptual network.
Computer, 21(3):105–117.

33 / 35
References II

Mackay, D. and McCullogh, W. S. (1952).


The limiting information capacity of neuronal link.
Bull Math Biophys, 14:127–135.
Nemenman, I., Shafee, F., and Bialek, W. (2002).
Entropy and Inference, Revisited.
nips, 14.
Paninski, L. (2003).
Estimation of Entropy and Mutual Information.
Neural Comp., 15:1191–1253.
Panzeri, S., Senatore, R., Montemurro, M. A., and Petersen, R. S. (2007).
Correcting for the sampling bias problem in spike train information measures.
J Neurophysiol, 98(3):1064–1072.
Rieke, F., Warland, D., de Ruyter van Steveninck, R., and Bialek, W. (1996).
Spikes: Exploring the neural code.
MIT Press, Cambridge.
Shannon, C. E. and Weaver, W. (1949).
The mathematical theory of communication.
Univeristy of Illinois Press, Illinois.

34 / 35
References III

Stein, R. B. (1967).
The information capacity of nerve cells using a frequency code.
Biophys J, 7:797–826.
Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., and Bialek, W. (1998).
Entropy and Information in Neural Spike Trains.
Phys Rev Lett, 80:197–200.
Warzecha, A. K. and Egelhaaf, M. (1999).
Variability in spike trains during constant and dynamic stimulation.
Science, 283(5409):1927–1930.

35 / 35

You might also like