Journal of Mathematical Psychology: Rafal Bogacz

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Journal of Mathematical Psychology 76 (2017) 198–211

Contents lists available at ScienceDirect

Journal of Mathematical Psychology


journal homepage: www.elsevier.com/locate/jmp

A tutorial on the free-energy framework for modelling perception


and learning
Rafal Bogacz ∗
MRC Unit for Brain Network Dynamics, University of Oxford, Mansfield Road, Oxford, OX1 3TH, UK
Nuffield Department of Clinical Neurosciences, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DU, UK

highlights
• Bayesian inference about stimulus properties can be performed by networks of neurons.
• Learning about statistics of stimuli can be achieved by Hebbian synaptic plasticity.
• Structure of the model resembles the hierarchical organization of the neocortex.

article info abstract


Article history: This paper provides an easy to follow tutorial on the free-energy framework for modelling perception
Available online 14 December 2015 developed by Friston, which extends the predictive coding model of Rao and Ballard. These models assume
that the sensory cortex infers the most likely values of attributes or features of sensory stimuli from
the noisy inputs encoding the stimuli. Remarkably, these models describe how this inference could be
implemented in a network of very simple computational elements, suggesting that this inference could be
performed by biological networks of neurons. Furthermore, learning about the parameters describing the
features and their uncertainty is implemented in these models by simple rules of synaptic plasticity based
on Hebbian learning. This tutorial introduces the free-energy framework using very simple examples, and
provides step-by-step derivations of the model. It also discusses in more detail how the model could be
implemented in biological neural circuits. In particular, it presents an extended version of the model in
which the neurons only sum their inputs, and synaptic plasticity only depends on activity of pre-synaptic
and post-synaptic neurons.
© 2015 The Author. Published by Elsevier Inc.
This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).

1. Introduction plasticity, and Rao and Ballard (1999) demonstrated that the model
presented with natural images learns features resembling recep-
The model of Friston (2005) and the predictive coding model tive fields of neurons in the primary visual cortex.
of Rao and Ballard (1999) provide a powerful mathematical frame- Friston (2005) has extended the model to also represent uncer-
work to describe how the sensory cortex extracts information from tainty associated with different features. He showed that learn-
noisy stimuli. The predictive coding model (Rao & Ballard, 1999) ing about the variance and co-variance of features can also be
suggests that visual cortex infers the most likely properties of stim- implemented by simple synaptic plasticity rules based on Hebbian
uli from noisy sensory input. The inference in this model is im- learning. As the extended model (Friston, 2005) learns the variance
plemented by a surprisingly simple network of neuron-like nodes. and co-variance of features, it offers several new insights. First, it
describes how the perceptual systems may differentially weight
The model is called ‘‘predictive coding’’, because some of the nodes
sources of sensory information depending on their level of noise.
in the network encode the differences between inputs and predic-
Second, it shows how the sensory networks can learn to recog-
tions of the network. Remarkably, learning about features present
nize features that are encoded in the patterns of covariance be-
in sensory stimuli is implemented by simple Hebbian synaptic
tween inputs, such as textures. Third, it provides a natural way to
implement attentional modulation as the reduction in variance of
the attended features (we come back to these insights in Discus-
∗ Correspondence to: Nuffield Department of Clinical Neurosciences, University sion). Furthermore, Friston (2005) pointed out that this model can
of Oxford, John Radcliffe Hospital, Oxford, OX3 9DU, UK. be viewed as an approximate Bayesian inference based on mini-
E-mail address: [email protected]. mization of a function referred to in statistics as free-energy. The
http://dx.doi.org/10.1016/j.jmp.2015.11.003
0022-2496/© 2015 The Author. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211 199

free-energy framework (Friston, 2003, 2005) has been recently ex- architectures. The tutorial provides step-by-step derivation of the
tended by Karl Friston and his colleagues to describe how the brain model. Some of these derivations are straightforward, and we feel
performs different cognitive functions including action selection that it would be helpful for the reader to do them on their own to
(FitzGerald, Schwartenbeck, Moutoussis, Dolan, & Friston, 2015; gain a better understanding of the model and to ‘‘keep in mind’’
Friston et al., 2013). Furthermore, Friston (2010) proposed that the the notation used in the paper. Such straightforward derivations
free-energy theory unifies several theories of perception and ac- are indicated by ‘‘(TRY IT YOURSELF)’’, so after encountering such
tion which are closely related to the free-energy framework. label we recommend trying to do the calculation described in the
There are many articles which provide an intuition for the sentence with this label and then compare the obtained results
free-energy framework and discuss how it relates with other with those in the paper. To illustrate the model we include simple
theories and experimental data (Friston, 2003, 2005, 2010; Friston simulations, but again we feel it would be helpful for a reader
et al., 2013). However, the description of mathematical details to perform them on their own, to get an intuition for the model.
of the theory in these papers requires a very deep mathematical Therefore we describe these simulations as exercises.
background. The main goal of this paper is to provide an easy to The paper is organized as follows. Section 2 introduces the
follow tutorial on the free-energy framework. To make the tutorial model using a very simple example using as basic mathematical
accessible to a wide audience, it only assumes basic knowledge concepts as possible, so it is accessible to a particularly wide audi-
of probability theory, calculus and linear algebra. This tutorial is ence. Section 3 provides mathematical foundations for the model,
planned to be complementary to existing literature so it does not and shows how the inference in the model is related to minimiza-
focus on the relationship to other theories and experimental data, tion of free-energy. Section 4 then shows how the model scales up
and on applications to more complex tasks which are described to describe the neural circuits in sensory cortex. In these three sec-
elsewhere (Friston, 2010; Friston et al., 2013). tions we use notation similar to that used by Friston (2005). Sec-
In this tutorial we also consider in more detail the neural tion 5 describes an extended version of the model which satisfies
implementation of the free-energy framework. Any computational the constraint of local plasticity described above. Finally, Section 6
model would need to satisfy the following constraints to be discusses insights provided by the model.
considered biologically plausible:
1. Local computation: A neuron performs computations only on 2. Simplest example of perception
the basis of the activity of its input neurons and synaptic
weights associated with these inputs (rather than information We start by considering in this section a simple perceptual
encoded in other parts of the circuit). problem in which a value of a single variable has to be inferred from
2. Local plasticity: Synaptic plasticity is only based on the activity a single observation. To make it more concrete, consider a simple
of pre-synaptic and post-synaptic neurons. organism that tries to infer the size or diameter of a food item,
which we denote by v , on the basis of light intensity it observes.
The model of Rao and Ballard (1999) fully satisfied these Let us assume that our simple animal has only one light sensitive
constraints. The model of Friston (2005) did not satisfy them fully, receptor which provides it with a noisy estimate of light intensity,
but we show that after small modifications and extensions it can which we denote by u. Let g denote a non-linear function relating
satisfy them. So the descriptions of the model in this tutorial the average light intensity with the size. Since the amount of light
slightly differ in a few places or extend the original model to better reflected is related to the area of an object, in this example we will
explain how the proposed computation could be implemented in consider a simple function of g (v) = v 2 . Let us further assume
the neural circuits. All such differences or extensions are indicated that the sensory input is noisy—in particular, when the size of food
by footnotes or in text, and the original model is presented in item is v , the perceived light intensity is normally distributed with
Appendix A. mean g (v), and variance Σu (although a normal distribution is not
It is commonly assumed in theoretical neuroscience, (O’Reilly & the best choice for a distribution of light intensity, as it includes
Munakata, 2000) that the basic computations a neuron performs negative numbers, we will still use it for a simplicity):
are the summation of its input weighted by the strengths of
synaptic connections, and the transformation of this sum through p(u|v) = f (u; g (v), Σu ). (1)
a (monotonic) function describing the relationship between In Eq. (1) f (x; µ, Σ ) denotes the density of a normal distribution
neurons’ total input and output (also termed firing-Input or f-I with mean µ and variance Σ :
curve). Whenever possible, we will assume that the computation
of the neurons in the described model is limited to these (x − µ)2
 
1
computations (or even just to linear summation of inputs). f (x; µ, Σ ) = √ exp − . (2)
2π Σ 2Σ
We feel that the neural implementation of the model is worth
considering, because if the free-energy principle indeed describes Due to the noise present in the observed light intensity, the
the computations in the brain, it can provide an explanation for animal can refine its guess for the size v by combining the sensory
why the cortex is organized in a particular way. However to stimulus with the prior knowledge on how large the food items
gain such insight it is necessary to start comparing the neural usually are, that it had learnt from experience. For simplicity, let us
networks implementing the model with those in the real brain. assume that our animal expects this size to be normally distributed
Consequently, we consider in this paper possible neural circuits with mean vp and variance Σp (subscript p stands for ‘‘prior’’),
that could perform the computations required by the theory. which we can write as:
Although the neural implementations proposed here are not the
p(v) = f (v; vp , Σp ). (3)
only possible ones, it is worth considering them as a starting point
for comparison of the model with details of neural architectures in Let us now assume that our animal observed a particular value
the brain. We hope that such comparison could iteratively lead to of light intensity, and attempts to estimate the size of the food item
refined neural implementations that are more and more similar to on the basis of this observation. We will first consider an exact
real neural circuits. solution to this problem, and illustrate why it would be difficult
To make this tutorial as easy to follow as possible we introduce to compute it in a simple neural circuit. Then we will present an
the free-energy framework using a simple example, and then approximate solution that can be easily implemented in a simple
illustrate how the model can scale up to more complex neural network of neurons.
200 R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211

a few summary statistics like mean and variance. Second, the


computation of the posterior distribution involves computation
of the normalization term. Although it has been proposed that
circuits within the basal ganglia can compute the normalization
term in case of the discrete probability distributions (Bogacz &
Gurney, 2007), computation of the normalization for continuous
distributions involves evaluating the integral of Eq. (5). Calculating
such integral would be challenging for a simple biological system.
This is especially true when the dimensionality of the integrals
(i.e., the number of unknown variables) increases beyond a trivial
number. Even mathematicians resort to (computationally very
expensive) numerical or sampling techniques in this case.
We will now present an approximate solution to the above
inference problem, that could be easily implemented in a simple
biological system.

2.2. Finding the most likely feature value


Fig. 1. The posterior probability of the size of the food item in the problem given
in Exercise 1.
Instead of finding the whole posterior distribution p(v|u), let us
2.1. Exact solution try to find the most likely size of the food item v which maximizes
p(v|u). We will denote this most likely size by φ , and its posterior
To compute how likely different sizes v are given the observed probability density by p(φ|u). It is reasonable to assume that in
sensory input u, we could use Bayes’ theorem: many cases the brain represents at a given moment of time only
most likely values of features. For example in case of binocular
p(v)p(u|v) rivalry, only one of the two possible interpretations of sensory
p(v|u) = . (4)
p(u) inputs is represented.
We will look for the value φ which maximizes p(φ|u). According
Term p(u) in the denominator of Eq. (4) is a normalization term, to Eq. (4), the posterior probability p(φ|u) depends on a ratio of
which ensures that the posterior probabilities of all sizes p(v|u) two quantities, but the denominator p(u) does not depend on φ .
integrate to 1: Thus the value of φ which maximizes p(φ|u) is the same one which
maximizes the numerator of Eq. (4). We will denote the logarithm

p(u) = p(v)p(u|v)dv. (5) of the numerator by F , as it is related to the negative of free energy
(as we will describe in Section 3):
The integral in the above equation sums over the whole range
of possible values of v , so it is a definite integral, but for brevity F = ln p(φ) + ln p(u|φ). (6)
of notation we do not state the limits of integration in this and all In the above equation we used the property of logarithm
other integrals in the paper. ln(ab) = ln a + ln b. We will maximize the logarithm of the numer-
Now combining Eqs. (1)–(5) we can compute numerically how ator of Eq. (4), because it has the same maximum as the numerator
likely different sizes are given the sensory observation. For readers itself as ln is a monotonic function, and is easier to compute as the
who are not familiar with such Bayesian inference we recommend expressions for p(u|φ) and p(φ) involve exponentiation.
doing the following exercise now. To find the parameter φ that describes the most likely size of the
food item, we will use a simple gradient ascent: i.e. we will modify
Exercise 1. Assume that our animal observed the light intensity u = φ proportionally to the gradient of F , which will turn out to be a
2, the level of noise in its receptor is Σu = 1, and the mean and very simple operation. It is relatively straightforward to compute
variance of its prior expectation of size are vp = 3 and Σp = 1. Write F by substituting Eqs. (1)–(3) into Eq. (6) and then to compute the
a computer program that computes the posterior probabilities of sizes derivative of F (TRY IT YOURSELF).
from 0.01 to 5, and plots them.
F = ln f (φ; vp , Σp ) + ln f (u; g (φ), Σu )
The Matlab code performing this calculation is given at the end  
(φ − vp )2

of the paper, and the resulting plot is shown in Fig. 1. It is worth 1
= ln  exp −
observing that such Bayesian approach integrates the information 2π Σp 2Σp
brought by the stimulus with prior knowledge: please note that the
(u − g (φ))2
  
1
most√likely value of v lies between that suggested by the stimulus + ln √ exp −
(i.e. 2) and the most likely value based on prior knowledge (i.e. 2π Σu 2Σu
3). It may seem surprising why the posterior probability is so low 1 1 (φ − vp )2
for v = 3, i.e. the mean prior expectation. It comes from the fact = ln √ − ln Σp −
2π 2 2 Σp
that g (3) = 9, which is really far from observed value u = 2, so
p(u = 2|v = 3) is very close to zero. This illustrates how non- 1 1 (u − g (φ))2
+ ln √ − ln Σu −
intuitive Bayesian inference can be once the relationship between 2π 2 2 Σu
variables is non-linear.
(φ − vp )2 (u − g (φ))2
 
Let us now discuss why performing such exact calculation is 1
= − ln Σp − − ln Σu − + C . (7)
challenging for a simple biological system. First, as soon as function 2 Σp Σu
g relating the variable we wish to infer with observations is non- We incorporated the constant terms in the 2nd line above into
linear, the posterior distribution p(v|u) may not take a standard a constant C . Now we can compute the derivative of F over φ :
shape—for example the distribution in Fig. 1 is not normal.
Thus representing the distribution p(v|u) requires representing ∂F vp − φ u − g (φ) ′
= + g (φ). (8)
infinitely many values p(v|u) for different possible u rather than ∂φ Σp Σu
R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211 201

Fig. 2. Solutions to Exercises 2 and 3. In panel b we have also included quantities that we will see later can be regarded as prediction errors.

In the above equation we used the chain rule to compute the The above terms are the prediction errors1 : εu expresses how
second term, and g ′ (φ) is a derivative of function g evaluated at φ , much the light intensity differs from that expected if the size of
so in our example g ′ (φ) = 2φ . We can find our best guess φ for v the food item was φ , while εp denotes how the inferred size differs
simply by changing φ in proportion to the gradient: from prior expectations. With these new variables the equation for
updating φ simplifies to:
∂F
φ̇ = . (9)
φ̇ = εu g ′ (φ) − εp .
∂φ (12)
The neural implementation of the model assumes that the
In the above equation φ̇ is the rate of change of φ with time. Let
model parameters vp , Σp , and Σu are encoded in the strengths of
us note that the update of φ is very intuitive. It is driven by two
synaptic connections (as they need to be maintained over the ani-
terms in Eq. (8): the first moves it towards the mean of the prior,
mal’s lifetime), while variables φ , εu , and εp and the sensory input
the second moves it according to the sensory stimulus, and both
u are maintained in the activity of neurons or neuronal populations
terms are weighted by the reliabilities of prior and sensory input
(as they change rapidly when the sensory input is modified). In par-
respectively.
ticular, we will consider very simple neural ‘‘nodes’’ which simply
Now please note that the above procedure for finding the ap-
change their activity proportionally to the input they receive, so for
proximate distribution of distance to food item is computationally
example, Eq. (12) is implemented in the model by a node receiving
much simpler than the exact method presented at the start of the
input equal to the right hand side of this equation. The prediction
paper. To gain more appreciation for the simplicity of this compu-
errors could be computed by the nodes with the following dynam-
tation we recommend doing the following exercise.
ics2 :

Exercise 2. Write a computer program finding the most likely size ε˙p = φ − vp − Σp εp (13)
of the food item φ for the situation described in Exercise 1. Initialize ε˙u = u − g (φ) − Σu εu . (14)
φ = vp , and then find its values in the next 5 time units (you can
use Euler’s method, i.e. update φ(t + 1t ) = φ(t ) + 1t ∂ F /∂φ with It is easy to show that the nodes with dynamics described by
1t = 0.01). Eqs. (13)–(14) converge to the values defined in Eqs. (10)–(11).
Once Eqs. (13)–(14) converge, then ε̇ = 0, so setting ε̇ = 0 and
Fig. 2(a) shows a solution to Exercise 2. Please notice that it solving Eqs. (13)–(14) for ε , one obtains Eqs. (10)–(11).
rapidly converges to the value of φ ≈ 1.6, which is also the value The architecture of the network described by Eqs. (12)–(14) is
that maximizes the exact posterior probability p(v|u) shown in shown in Fig. 3. Let us consider the computations in its nodes. The
Fig. 1. node εp receives excitatory input from node φ , inhibitory input
from a tonically active neuron via a connection with strength vp ,
and inhibitory input from itself via a connection with strength
2.3. A possible neural implementation
Σp , so it implements Eq. (13). The nodes φ and εu analogously
implement Eqs. (12) and (14), but here the information exchange
One can envisage many possible ways in which the computa- between them is additionally affected by function g, and we will
tion described in previous subsection could be implemented in discuss this issue in more detail in Section 2.5. We have now
neural circuits. In this paper we will present a possible imple- described all the details necessary to simulate the model.
mentation which satisfies the constraints of local computation and
plasticity described in the Introduction. It slightly differs from the Exercise 3. Simulate the model from Fig. 3 for the problem
original implementation which is contained in Appendix A. from Exercise 1. In particular, initialize φ = vp , εp = εu = 0, and
While thinking about the neural implementation of the above find their values for the next 5 units of time.
computation, it is helpful to note that there are two similar terms
in Eq. (8), so let us denote them by new variables.

φ − vp 1 In the original model (Friston, 2005) the prediction errors were normalized
εp = (10)
Σp slightly differently as explained in Appendix A.
2 The original model does not provide details on the dynamics of the nodes
u − g (φ)
εu = . (11) computing prediction error, but we consider sample description of their dynamics
Σu to illustrate how these nodes can perform their computation.
202 R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211

the expected value over trials. This will happen if vp = ⟨φ⟩, i.e.
when vp is indeed equal to the expected value of φ . Analogously,
the expected value of change in Σp is 0 when:
 
(φ − vp )2 1
− = 0. (18)
Σp2 Σp

Rearranging the above condition one obtains Σp = ⟨(φ − vp )2 ⟩,


thus the expected value of change in Σp is 0, when Σp is equal to
the variance of φ . An analogous analysis can be made for Σu .
Eqs. (15)–(17) for update of model parameters simplify
Fig. 3. The architecture of the model performing simple perceptual inference. significantly when they are written in terms of prediction errors
Circles denote neural ‘‘nodes’’, arrows denote excitatory connections, while lines (TRY IT YOURSELF):
ended with circles denote inhibitory connections. Labels above the connections
encode their strength, and lack of label indicates the strength of 1. Rectangles ∂F
indicate the values that need to be transmitted via the connections they label. = εp (19)
∂vp
Solution to Exercise 3 is shown in Fig. 2(b). The model converges ∂F 1 2
εp − Σp−1

= (20)
to the same value as in Fig. 2(a), but the convergence is just slower, ∂ Σp 2
as the model now includes multiple nodes connected by excitatory
∂F 1 2
εu − Σu−1 .

and inhibitory connections and such networks have oscillatory = (21)
tendencies, so these oscillations need to settle for the network to ∂ Σu 2
converge. The above rules for update of parameters correspond to very
simple synaptic plasticity mechanisms. All rules include only
2.4. Learning model parameters values that can be ‘‘known’’ by the synapse, i.e. the activities of
pre-synaptic and post-synaptic neurons, and the strengths of the
As our imaginary animal perceives food items through its synapse itself. Furthermore, the rules are Hebbian, in the sense that
lifetime, it may wish to refine its expectation about typical sizes they depend on the products of activity of pre-synaptic and post-
of food items described by parameters vp and Σp , and about the synaptic neurons. For example, the change in vp in Eq. (19) is equal
amount of error it makes observing light intensity, described by to the product of pre-synaptic activity (i.e. 1) and the post-synaptic
parameter Σu . Thus it may wish to update the parameters vp , Σp , activity εp . Similarly, the changes in Σ in Eqs. (20)–(21) depend
and Σu after each stimulus to gradually refine them. on the products of pre-synaptic and post-synaptic activities, both
We wish to choose the model parameters for which the equal to ε .
perceived light intensities u are least surprising, or in other words The plasticity rules of Eqs. (20)–(21) also depend on the value
most expected. Thus we wish to choose parameters that maximize of synaptic weights themselves, as they include terms Σ −1 . For the
p(u). However, please recall that p(u) is described by a complicated simple case considered in this section, the synapse ‘‘has access’’
integral of Eq. (5), so it would be difficult to maximize p(u) directly. to the information on its weight. Moreover, the dependence of
Nevertheless, it is simple to maximize a related quantity p(u, φ), synaptic plasticity on initial weights has been seen experimentally
which is the joint probability of sensory input u and our inferred (Chen et al., 2013), so we feel it is plausible for the dependence
food size φ . Note that p(u, φ) = p(φ)p(u|φ), so F = ln p(u, φ), predicted by the model to be present in real synapses. However,
thus maximization of p(u, φ) can be achieved by maximizing F . A when the model is scaled up to include multiple features and
more formal explanation for why the parameters can be optimized sensory inputs in Section 4.1, terms Σ −1 will turn into a matrix
by maximizing F will be provided in Section 3. inverse (in Eqs. (48)–(49)), so the required changes in each weight
The model parameters can be hence optimized by modifying will depend on the weights of other synapses in the network.
them proportionally to the gradient of F . Starting with the Nevertheless, we will show in Section 5 how this problem can be
expression in Eq. (7) it is straightforward to find the derivatives overcome.
of F over vp , Σp and Σu (TRY IT YOURSELF): Finally, we would like to discuss the limits on parameters Σ .
Although in principle the variance of a random variable can be
∂F φ − vp equal to 0, if Σp = 0 or Σu = 0, then Eq. (13) or (14) would not
= (15) converge but instead εp or εu would diverge to positive or negative
∂vp Σp
  infinity. Similarly, if Σ were close to 0, the convergence would be
∂F 1 (φ − vp )2 1 very slow. To prevent this from happening, the minimum value of
= − (16) 1 is imposed by Friston (2005) on the estimated variance.3
∂ Σp 2 Σp2 Σp

∂F 1 (u − g (φ))2 2.5. Learning the relationship between variables


 
1
= − . (17)
∂ Σu 2 Σu2 Σu
So far we have assumed for simplicity that the relationship g
Let us now provide an intuition for why the parameter update between the variable being inferred and the stimulus is known.
rules have their particular form. We note that since parameters However, in general it may not be known, or may need to be tuned.
are updated after observing each food item, and different food
items observed during animal’s life time have different sizes, the
parameters never converge. Nevertheless it is useful to consider √
3 In the original model, variables λ = Σ − 1 were defined, and these variables
the values of parameters for which the expected value of change
were encoded in the synaptic connections. Formally, this constraint is known
is 0, as these are the values in vicinity of which the parameters are as a hyperprior. This is because the variance or precision parameters are often
likely to be. For example, according to Eq. (15), the expected value referred to mathematically as hyperparameters. Whenever we place constraints on
of change in vp is 0 when ⟨(φ − vp )/Σp ⟩ = 0, where ⟨⟩ denotes hyperparameters we necessarily invoke hyperpriors.
R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211 203

Fig. 4. Architectures of models with linear and nonlinear function g. Circles and hexagons denote linear and nonlinear nodes respectively. Filled arrows and lines ended
with circles denote excitatory and inhibitory connections respectively, and an open arrow denotes a modulatory influence.

So we will now consider function g (v, θ ) that also depends on non-linear transformation h′ and modulating input received from
parameter which we denote by θ . node εu via a connection with weight θ (alternatively, this could
We will consider two special cases of function g (v, θ ), where be implemented within the node φ by making it react to its input
the parameter θ has a clear biological interpretation. First, let us differentially depending on its level of activity). The details of
consider a simple case of a linear function: g (v, θ ) = θ v , as then the neural implementation of these non-linear transformations
the model has a straightforward neural implementation. In this depend on the form of function h, and would be an interesting
case, Eqs. (12)–(14) describing the model simplify to: direction of the future work.
We also note that the update of the parameter θ , i.e. gradient of
φ̇ = θ εu − εp (22) F over θ becomes:
ε˙p = φ − vp − Σp εp (23) ∂F
= εu h(φ). (29)
ε˙u = u − θ φ − Σu εu (24) ∂θ
In this model, nodes φ and ε simply communicate through This rule is Hebbian for the top connection labelled by θ in
connections with weight θ as shown in Fig. 4(a). Furthermore, we Fig. 4(b), as it is a product of activity of the pre-synaptic and post-
can also derive the rule for updating the parameter θ by finding the synaptic nodes. It would be interesting to investigate how such
gradient of F over θ , as now function g in Eq. (7) depends on θ (TRY a plasticity rule could be realized for the other connection with
IT YOURSELF): the weight of θ (from node εu to φ ). We just note that for this
connection the rule also satisfies the constraint of local plasticity
∂F (stated in the Introduction), as φ fully determines h(φ), so the
= εu φ. (25)
∂θ change in weight is fully determined by the activity of pre-synaptic
Please note that this rule is again Hebbian, as the synaptic and post-synaptic neurons.
weights encoding θ are modified proportionally to the activities
of pre-synaptic and post-synaptic neurons (see Fig. 4(a)). 3. Free-energy
Second, let us consider a case of a nonlinear function4 g (v, θ ) =
θ h(v), where h(v) is a nonlinear function that just depends on v , In this section we discuss how the computations in the
as it results in only slightly more complex neural implementation. model relate to a technique of statistical inference involving
Furthermore, this situation is relevant to the example of the simple minimization of free-energy. There are three reasons for describing
animal considered at the start of this section, as the light is this relationship. First, it will provide more insight for why the
proportional to the area, but the proportionality constant may not parameters can be optimized by maximization of F . Second,
be known (this case is also relevant to the network that we will the concept of free-energy is critical for understanding of more
discuss in Section 4.1). In this case, Eqs. (12)–(14) describing the complex models (Friston et al., 2013), which not only estimate
model become: the most likely values of variables, but their distribution. Third,
φ̇ = θ εu h′ (φ) − εp (26) the free-energy is a very interesting concept on its own, and has
applications in mathematical psychology (Ostwald, Kirilina, Starke,
ε˙p = φ − vp − Σp εp (27) & Blankenburg, 2014).
ε˙u = u − θ h(φ) − Σu εu . (28) We now come back to the example of an inference by a
simple organism, and discuss how the exact inference described
A possible network implementing this model is illustrated in in Section 2.1 can be approximated. As we noted in Section 2.1,
Fig. 4(b), which now includes non-linear elements. In particular, the posterior distribution p(v|u) may have a complicated shape, so
the node φ sends to node εu its activity transformed by a non- we will approximate it with another distribution, which we denote
linear function, i.e. θ h(φ). One could imagine that this could be q(v). Importantly, we will assume that q(v) has a standard shape,
implemented by an additional node receiving input from node φ , so we will be able to characterize it by parameters of this typical
transforming it via a non-linear transformation h and sending its distribution. For example, if we assume that q(v) is normal, then
output to node εu via a connection with the weight θ . Analogously, to fully describe it, we can infer just two numbers: its mean and
the input from node εu to node φ needs to be scaled by θ h′ (φ).
variance, instead of infinitely many numbers potentially required
Again one could imagine that this could be implemented by an
to characterize a distribution of an arbitrary shape.
additional node receiving input from node φ , transforming it via a
For simplicity, here we will use an even simpler shape of the
approximate distribution, namely the delta distribution, which has
all its mass cumulated in one point which we denote by φ (i.e. the
4 Although this case has not been discussed by Friston (2005), it was discussed delta distribution is equal to 0 for all values different from φ , but
by Rao and Ballard (1999). its integral is equal to 1). Thus we will try to infer from observation
204 R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211

just one parameter φ which will characterize the most likely value normalization term. Second, as we will see later, it will allow us to
of v . naturally introduce learning about the parameters of the model.
We now describe what criterion we wish our approximate Let us first note that by assuming that q(v) is a delta distribu-
distribution to satisfy. We will seek the approximate distribution tion, the negative free energy simplifies to:
q(v) which is as close as possible to the actual posterior
p(u, v)

distribution p(v|u). Mathematically, the dissimilarity between F = q(v) ln dv
two distributions in measured by the Kullback–Leibler divergence q(v)
 
defined as:
= q(v) ln p(u, v)dv − q(v) ln q(v)dv
q(v)

KL (q(v), p(v|u)) = q(v) ln dv. (30) = ln p(u, φ) + C1 . (34)
p(v|u)
For readers not familiar with Kullback–Leibler divergence we In the transition from the first to the second line above we used
would like clarify why it is a measure of dissimilarity between the the property of logarithms ln(a/b) = ln a − ln b. In the transition
distributions. Please note that if the two distributions q(v) and from the second line to the third line we used the property of a
p(v|u) were identical, the ratio q(v)/p(v|u) would be equal to 1, delta function δ(x) with centre φ that for any function h(x), the
so its logarithm would be equal to 0, and so the whole expression integral of δ(x)h(x) is equal to h(φ). Furthermore, since the value
in Eq. (30) would be 0. The Kullback–Leibler divergence also has of the second integral in the second line of the above equation does
a property that the more different the two distributions are, the not depend on φ (so it will cancel when we compute the derivative
higher its value is (see Ostwald et al. (2014) for more details). over φ ) we denote it by a constant C1 .
Since we assumed above that our simplified distribution is Now using p(u, φ) = p(φ)p(u|φ), and ignoring constant
a delta function, we will simply seek the value of its centre C1 , we obtain the expression for F we introduced previously in
parameter φ which minimizes the Kullback–Leibler divergence Eq. (6). Thus finding approximate delta distribution q(v) through
defined in Eq. (30). minimization of free-energy is equivalent to the inference of
It may seem that the minimization of Eq. (30) is still difficult, features in the model described in the previous section. It is worth
because to compute term p(v|u) present in Eq. (30) from Bayes’ noting that Eq. (34) states that the best centre for our approximate
theorem (Eq. (4)) one needs to compute the difficult normalization distribution (i.e. our best guess for the size of the food item) is the
integral (Eq. (5)). However, we will now show that there exists value v = φ which maximizes the joint probability p(u, φ).
another way of finding the approximate distribution q(v) that does We now discuss how the concept of free-energy will help us
not involve the complicated computation of the normalization to understand why the parameters of the model can be learnt by
integral. maximization of F . Recall from Section 2.4 that we wish to find
Substituting the definition of conditional probability p(v|u) = parameters for which the sensory observations are least surprising,
p(u, v)/p(u) into Eq. (30) we obtain: i.e. those which maximize p(u). To see the relationship between
maximizing p(u) and maximizing F , we note that according to
q(v)p(u) Eq. (33), p(u) is related to the negative free-energy in the following

KL (q(v), p(v|u)) = q(v) ln dv
p(u, v) way:
q(v)

ln p(u) = F + KL (q(v), p(v|u)) . (35)
= q(v) ln dv
p(u, v)
 Since Kullback–Leibler divergence is non-negative, F is a lower
+ q(v)dv ln p(u) bound on ln p(u), thus by maximizing F we maximize the lower
bound on ln p(u). So in summary, by maximizing F we can both

q(v) find an approximate distribution q(v) (as discussed earlier), and
= q(v) ln dv + ln p(u). (31) optimize model parameters. However, there is a twist here: we
p(u, v)
wish to maximize the average of p(u) across trials (or here
In the transition from the second to the third line we used observations of different food items). Thus on each trial we need
the fact that q(v) is a probability distribution so its integral is 1. to modify the model parameters just a little bit (rather than until
The integral in the last line of the above equation is called free- minimum of free energy is reached as was the case for φ ).
energy, and we will denote its negative by F , because we will show
below, that for certain assumptions the negative free-energy is 4. Scaling up the model of perception
equal (modulo a constant) to the function F we defined and used
in the previous section: In this section we will show how the model scales up to the

p(u, v) networks inferring multiple features and involving hierarchy.
F = q(v) ln dv. (32)
q(v)
4.1. Increasing the dimension of sensory input
In the above equation we used the property of logarithms that
− ln a/b = ln b/a. So, the negative free-energy is related to the The model naturally scales up to the case of multiple sensory
Kullback–Leibler divergence in the following way: inputs from which we estimate multiple variables. Such scaled
KL (q(v), p(v|u)) = −F + ln p(u). (33) model could be used to describe information processing within
a cortical area (e.g. primary visual cortex) which infers multiple
Now please note that ln p(u) does not depend on φ (which is features (e.g. edges at different position and orientation) on the
a parameter describing q(v)), so the value of φ that minimizes basis of multiple inputs (e.g. information from multiple retinal
the distance between q(v) and p(v|u) is the same value as receptors preprocessed by the thalamus). This section shows that
that which maximizes F . Therefore instead of minimizing the when the dimensionality of inputs and features is increased, the
Kullback–Leibler divergence we can maximize F , and this will have dynamics of nodes in the networks and synaptic plasticity are
two benefits: first, as we already mentioned above, F is easier to described by the same rules as in Section 2, just generalized to
compute as it does not involve the complicated computation of the multiple dimensions.
R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211 205

Table 1
Rules for computation of derivatives. A denotes a symmetric matrix.
Original rule Generalization to matrices
∂ ax2 ∂ xT Ax
∂x
= 2ax ∂x
= 2Ax
 T
∂z ∂y ∂z ∂y
if z = f (y), y = g (x), then ∂x
= ∂x ∂y
if z = f (y), y = g (x), then ∂∂ zx = ∂x
∂z
∂y
∂ ln a 1 ∂ ln |A| −1
∂a
= a ∂A
=A
2
∂ xa x2 ∂ x A−1 x
T
∂a
= − a2 ∂A
= −(A−1 x)(A−1 x)T

The only complication in explaining this case lies in the symmetric, so we can use the top two rules in Table 1 to compute
necessity to use matrix notation, so let us make this notation very the gradient of the negative free energy (TRY IT YOURSELF):
explicit: we will denote single numbers in italic (e.g. x), column T
vectors by bar (e.g. x), and matrices in bold (e.g. x). So we assume ∂F ∂ g (φ, 2)
p (φ − v p ) +
= −6− u (u − g (φ, 2)).
6−
1 1
(38)
the animal has observed sensory input u and estimates the most ∂φ ∂φ
likely values φ of variables v . We further assume that the animal
has prior expectation that the variables v come from multivariate In the above equation, terms appear which are generalizations
normal distribution with mean v p and covariance matrix 6p , i.e. of the prediction errors we defined for the simple models:
p(v) = f (v; v p , 6p ) where: ε p = 6−
p (φ − v p )
1
(39)
 
1 1 ε u = 6u (u − g (φ, 2)).
−1
(40)
f (x; µ, 6) =  exp − (x − µ)T 6−1 (x − µ) . (36)
(2π )N |6| 2
With the error terms defined, the equation describing the
In the above equation N denotes the length of vector x, and |6| update of φ becomes:
denotes the determinant of matrix 6. Analogously, the probability T
of observing sensory input given the values of variables is given by ∂ g (φ, 2)
φ̇ = −ε p + εu . (41)
p(u|v) = f (u; g (v, 2), 6u ), where 2 are parameters of function ∂φ
g. We denote these parameters by a matrix 2, as we will consider The partial derivative term in the above equation is a matrix
a generalization of the function g discussed in Section 2.5, i.e. that contains in each entry with co-ordinates (i, j) the derivative of
g (v, 2) = 2h(v), where each element i of vector h(v) depends element i of vector g (φ, 2) over φj . To see how the above equation
only on vi . This function corresponds to an assumption often made simplifies for our choice of function g, it is helpful without loss of
by models of feature extraction (Bell & Sejnowski, 1995; Olshausen generality to consider a case of 2 features being estimated from 2
& Field, 1995), that stimuli are formed by a linear combination of stimuli. Then:
features.5 Moreover, such a function g can be easily computed as it
θ1,1 h(φ1 ) + θ1,2 h(φ2 )
 
is equal to an input to a layer of neurons from another layer with
g (φ, 2) = 2h(φ) = . (42)
activity h(v) via connections with strength 2. θ2,1 h(φ1 ) + θ2,2 h(φ2 )
We can state the negative free energy, analogously as for the
Hence we can find the derivatives of elements of the above
simple model considered in Eq. (7) (TRY IT YOURSELF):
vector over the elements of vector φ :
F = ln p(φ) + ln p(u|φ)
∂ g (φ, 2) θ h′ (φ ) θ1,2 h′ (φ2 )
 
1 = 1 ,1 ′ 1 . (43)
= (− ln |6p | − (φ − v p )T 6−
p (φ − v p )
1
∂φ θ2,1 h (φ1 ) θ2,2 h′ (φ2 )
2
Now we can see that Eq. (41) can be written as:
− ln |6u | − (u − g (φ, 2))T 6−
u (u − g (φ, 2))) + C .
1
(37)

Analogously as before, to find the vector of most likely values φ̇ = −ε p + h′ (φ) × 2T εu . (44)
of features φ , we will calculate the gradient (vector of derivatives In the above equation × denotes element by element multipli-
∂ F /∂φi ) which we will denote by ∂ F /∂φ . We will use the elegant cation, so term h′ (φ)× 2T ε u is a vector where its element i is equal
property that rules for computation of derivatives generalize to to a product of h′ (φi ) and element i of vector 2T ε u . Analogously,
vectors and matrices. To get an intuition for these rules we as for the simple model, prediction errors could be computed by
recommend the following exercise that shows how the rule nodes with the following dynamics:
∂ x2 /∂ x = 2x generalizes to vectors.
ε̇ p = φ − v p − 6p ε p (45)
Exercise 4. Show that for any vector x the gradient of function y =
x x is equal to: ∂ y/∂ x = 2x.
T ε̇ u = u − 2h(φ) − 6u ε u . (46)
It is easy to see that Eqs. (45)–(46) have fixed points at
Using an analogous method as that in the solution to Exercise values given by Eqs. (39)–(40) by setting the left hand sides of
4 (at the end of the paper) one can see that several other rules Eqs. (45)–(46) to 0. The architecture of the network with the
generalize as summarized in Table 1. These rules can be applied for dynamics described by Eqs. (44)–(46) is shown in Fig. 5, and it is
symmetric matrices, but since 6 are covariance matrices, they are analogous to that in Fig. 4(b).
Analogously as for the simple model, one can also find the rules
for updating parameters encoded in synaptic connections, which
5 In the model of Rao and Ballard (1999) the sparse coding was achieved through generalize the rules presented previously. In particular, using the
introduction of additional prior expectation that most φi are close to 0, but the top formula in Table 1 it is easy to see that:
sparse coding can also be achieved by choosing a shape of function h such that h(vi )
are mostly close to 0, but only occasionally significantly different from zero (Friston, ∂F
= εp . (47)
2008). ∂v p
206 R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211

to generalize the model from 2 layers to multiple layers. In such


generalized model the rules describing dynamics of neurons and
plasticity of synapses remain exactly the same, and only notation
has to be modified to describe presence of multiple layers of
hierarchy.
We assume that the expected value of activity in one layer vi
depends on the activity in the next layer vi+1 :

E (u) = g1 (v 2 , 21 ) (51)
E (v 2 ) = g2 (v 3 , 22 )
E (v 3 ) = . . . .

Fig. 5. The architecture of the model inferring 2 features from 2 sensory stimuli. To simplify the notation we could denote u by v1 , and then the
Notation as in Fig. 4(b). To help identify which connections are intrinsic and likelihood of activity in layer i becomes:
extrinsic to each level of hierarchy, the nodes and their projections in each level of
hierarchy are shown in green, blue and purple respectively (in the online version). p(v i |v i+1 ) = f (v i ; gi (v i+1 , 2i ), 6i ). (52)
(For interpretation of the references to colour in this figure legend, the reader is
referred to the web version of this article.) In this model, 6i parametrize the covariance between features
in each level, and 2i parametrize how the mean value of features
Using the two bottom formulas in Table 1 one can find the rules in one level depends on the next. Let us assume the same form of
for update of covariance matrices (TRY IT YOURSELF): function g as before, i.e. gi (v i+1 , 2i ) = 2i h(v i+1 ). By analogy to
the model described in the previous subsection, one can see that
∂F 1 inference of the features in all layers on the basis of sensory input
ε p ε Tp − 6− 1

= (48)
∂ 6p 2
p
can be achieved in the network shown in Fig. 6(a). In this network
the dynamics of the nodes are described by:
∂F 1
ε u εTu − 6− 1
.

= (49)
∂ 6u 2
u
φ̇ i = −ε i + h′ (φ i ). ∗ 2i−1 T ε i−1 (53)
The derivation of update of parameters 2 is a bit more tedious,
ε̇ i = φ i − 2i h(φ i+1 ) − 6i ε i . (54)
but we show in Appendix B that:
Furthermore, by analogy to the previous section, the rules for
∂F
= ε u h(φ)T . (50) modifying synaptic connections in the model become:
∂2
∂F 1 T
The above plasticity rules of Eqs. (47)–(50) are Hebbian in the ε i ε i − 6− 1

= (55)
same sense they were for the simple model—for example Eq. (48) ∂ 6i 2 i

implies that Σp,i,j should be updated proportionally to εp,i εp,j , ∂F


i.e. to the product of activity of pre-synaptic and post-synaptic = ε i h(φ i+1 )T . (56)
∂ 2i
neurons. However, the rules of update of covariance matrices of
Eqs. (48)–(49) contain matrix inverses 6−1 . The value of each entry The hierarchical structure of the model in Fig. 6(a) parallels
in matrix inverse depends on all matrix elements, so it is difficult the hierarchical structure of the cortex. Furthermore, it is worth
how it can be ‘‘known’’ by a synapse that encodes just a single noting that different layers within the cortex communicate with
element. Nevertheless, we will show in Section 5 how the model higher and lower sensory areas (as illustrated schematically in
can be extended to satisfy the constraint of local plasticity. Fig. 6(b)), which parallel the fact that different nodes in the model
communicate with other levels of hierarchy (Fig. 6(a)).
4.2. Introducing hierarchy
5. Local plasticity
Sensory cortical areas are organized hierarchically, such that
areas in lower levels of hierarchy (e.g. primary visual cortex) infer The plasticity rules for synapses encoding matrix 6 (describing
presence of simple features of stimuli (e.g. edges), on the basis the variance and co-variance of features or sensory inputs)
of which the sensory areas in higher levels of hierarchy infer introduced in the previous section (Eqs. (48), (49) and (55))
presence of more and more complex features. It is straightforward include terms equal to the matrix inverse 6−1 . Computing each

Fig. 6. (a) The architecture of the model including multiple layers. For simplicity only the first two layers are shown. Notation as in Fig. 5. (b) Extrinsic connectivity of cortical
layers.
R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211 207

Fig. 7. Prediction error networks that can learn the uncertainty parameter with
local plasticity. Notation as in Fig. 4(b). (a) Single node. (b) Multiple nodes for
multidimensional features. Fig. 8. Changes in estimated variance during learning in Exercise 5.

element of the inverse 6−1 requires not only the knowledge of


Thus we see that the prediction error node has a fixed point at
the corresponding element of 6, but also of other elements. For
the desired value (cf. Eq. (57)). Let us now consider the following
example, in a case of 2-dimensional vector u, the update rule for
rule for plasticity of the connection encoding Σi :
the synaptic connection encoding Σu,1,1 (Eq. (49)) requires the
computation of Σu−,11,1 = Σu,2,2 /|6u |. Hence the change of synaptic 1Σi = α(εi ei − 1). (63)
weight Σu,1,1 depends on the value of the weight Σu,2,2 , but According to this rule the weight is modified proportionally to
these are the weights of connections between different neurons the product of activities of pre-synaptic and post-synaptic neurons
(see Fig. 5), thus the update rule violates the principle of the decreased by a constant, with a learning rate α . To analyse to what
local plasticity stated in the Introduction. Nevertheless, in this values this rule converges, we note that the expected change is
section we show that by slightly modifying the architecture of equal to 0 when:
the network computing prediction errors, the need for computing
matrix inverses in the plasticity rules disappears. In other words, ⟨εi ei − 1⟩ = 0. (64)
we present an extension of the model from the previous section in
Substituting Eqs. (61)–(62) into the above equation and rear-
which learning the values of parameters Σ satisfies the constraint
ranging terms we obtain:
of local plasticity. To make the description as easy to follow as
possible, we start with considering the case of single sensory input ⟨(φi − gi (φi+1 ))2 ⟩
and single feature on each level, and then generalize it to increased = 1. (65)
Σi
dimension of inputs and features.
Solving the above equation for Σi we obtain Eq. (58). Thus in
5.1. Learning variance of a single prediction error node summary the network in Fig. 7(a) computes the prediction error
and learns the variance of the corresponding feature with a local
Instead of considering the whole model we now focus on Hebbian plasticity rule. To gain more intuition for how this model
computations in a single node computing prediction error. In the works we suggest the following exercise.
model we wish the prediction error on each level to converge to:
Exercise 5. Simulate learning of variance Σi over trials. For simplic-
φi − gi (φi+1 ) ity, only simulate the network described by Eqs. (59)–(60), and as-
εi = . (57)
sume that variables φ are constant. On each trial generate input φi
Σi
from a normal distribution with mean 5 and variance 2, while set
In the above equation Σi is the variance of feature φi (around
gi (φi+1 ) = 5 (so that the upper level correctly predicts the mean of
the mean predicted by the level above):
φi ). Simulate the network for 20 time units, and then update weight
Σi = ⟨(φi − gi (φi+1 ))2 ⟩. (58) Σi with learning rate α = 0.01. Simulate 1000 trials and plot how Σi
changes across trials.
A sample architecture of the model that can achieve this
computation with local plasticity is shown in Fig. 7(a). It includes The results of simulations are shown in Fig. 8, and they illustrate
an additional inhibitory inter-neuron ei which is connected to the that the synaptic weight Σi approaches the vicinity of the variance
prediction error node, and receives input from it via the connection of φi .
with weight encoding Σi . The dynamics of this model is described It is also worth adding that εi in the model described by
by the following set of equations: Eqs. (59)–(60) converges to the prediction error (Eq. (61)), when
one assumes that φ are constant or change on much slower time-
ε̇i = φi − gi (φi+1 ) − ei (59) scale than εi and ei . This convergence takes place because the
ėi = Σi εi − ei . (60) fixed point of the model is stable, which can be shown using the
standard dynamical systems theory (Strogatz, 1994). In particular,
The levels of activity at the fixed point can be found by setting since Eqs. (59)–(60) only contain linear functions of variables εi and
the left hand sides of Eqs. (59)–(60) to 0 and solving the resulting ei , their solution has a form of exponential functions of time t, e.g.
set of simultaneous equations (TRY IT YOURSELF): εi (t ) = c exp(λt ) + εi∗ , where c and λ are constants, and εi∗ is the
φi − gi (φi+1 ) value at the fixed point. The sign of λ determines the stability of
εi = (61) the fixed point: when λ < 0, the exponential term decreases with
Σi time, and εi converges to the fixed point, while if λ > 0, the fixed
ei = φi − gi (φi+1 ). (62) point is unstable. The values of λ are equal to the eigenvalues of
208 R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211

the matrix in the equation below (Strogatz, 1994), which rewrites


Eqs. (59)–(60) in a vector form:

ε̇i εi φi − gi (φi+1 )
      
0 −1
= + . (66)
ėi Σi −1 ei 0

To show that the eigenvalues of the matrix in the above


equation are negative we use the property that sum of eigenvalues
is equal to the trace and the product to the determinant. The trace
and determinant of this matrix are −1 and Σi , respectively. Since
the sum of eigenvalues is negative and their product positive, both
eigenvalues are negative, so the system is stable.

5.2. Learning the covariance matrix

The model described in the previous subsection scales up to


larger dimension of features and sensory inputs. The architecture
of the scaled up network is shown in Fig. 7(b), and its dynamics is
described by the following equations: Fig. 9. An example of a texture.

ε̇ i = φ i − gi (φ i+1 ) − ei (67) stimulus, the smaller the corresponding prediction error, and thus
lower its influence on activity on other neurons in the network.
ėi = 6i ε i − ei . (68)
Second, the model can learn properties of features encoded in
Analogously as before, we can find the fixed point by setting the covariance of sensory input. An example of such feature is texture,
left hand side of the equation to 0: which can be efficiently recognized on the basis of covariance,
irrespectively of translation (Harwood, Ojala, Pietikäinen, Kelman,
ε i = 6−
i (φ i − gi (φ i+1 ))
1
(69) & Davis, 1995). To get an intuition for this property, let us consider
an example of checker-board texture (Fig. 9). Please note that
ei = φ i − gi (φ i+1 ). (70)
adjacent nodes have always opposite colour – corresponding to
Thus we can see that nodes ε have fixed points at the values negative covariance, while the diagonal nodes have the same
equal to the prediction errors. We can now consider a learning rule colour – corresponding to positive covariance.
analogous to that in the previous subsection: Third, the attentional modulation can be easily implemented in
the model by changing the variance associated with the attended
16i = α(ε i eTi − 1). (71) features (Feldman & Friston, 2010). Thus for example, attending
To find the values to vicinity of which the above rule may to feature i at level j of the hierarchy can be implemented by
converge, we can find the value of 6i for which the expected value decreasing synaptic weight Σj,i,i , or inhibiting node ej,i in case of
of the right hand side of the above equation is equal to 0: the model described in Section 5, which will result in a larger effect
of the node encoding this feature on the activity in the rest of the
⟨ε i eTi − 1⟩ = 0. (72) network.
In this paper we included description of the modified or
Substituting Eqs. (69)–(70) into the above equation, and solving
extended version of the model with local computation and
for 6i we obtain (TRY IT YOURSELF):
plasticity to better illustrate how computation proposed by the
6i = ⟨(φ i − gi (φ i+1 ))(φ i − gi (φ i+1 ))T ⟩. (73) free-energy framework can be implemented in neural circuits.
However, it will be necessary in the future to numerically evaluate
We can see that the learning rule has a stochastic fixed point the efficiency of learning in the proposed model and the free-
at the values corresponding to the covariance matrix. In summary, energy framework in general. Existing models of feature extraction
the nodes in network described in this section have fixed points at (Bell & Sejnowski, 1997; Bogacz, Brown, & Giraud-Carrier, 2001;
prediction errors and can learn the covariance of the corresponding Olshausen & Field, 1995) and predictive coding (Rao & Ballard,
features, thus the proposed network may substitute the prediction 1999) have been shown to be able to find features efficiently and
error nodes in the model shown in Fig. 6, and the computation reproduce the receptive fields of neurons in the primary visual
will remain the same. But importantly in the proposed network the cortex when trained with natural images. It would be interesting
covariance is learnt with local plasticity involving simple Hebbian to explicitly test in simulations if the model based on the free-
learning. energy framework can equally efficiently extract features from
natural stimuli and additionally learn the variance and covariance
6. Discussion of features.
We have also demonstrated that if the dynamics within the
In this paper we presented the model of perception and learning nodes computing prediction errors takes place on a time-scale
in neural circuits based on the free-energy framework. This model much faster than in the whole network, these nodes converge
extends the predictive coding model (Rao & Ballard, 1999) in to stable fixed points. It is also worth noting that under the
that it represents and learns not only mean values of stimuli or assumption of separation of time scales, the nodes computing φ
features, but also their variances, which gives the model several also converge to a stable fixed point, because variables φ converge
new computational capabilities, as we now discuss. to the values that maximize function F . It would be interesting
First, the model can weight incoming sensory information by to investigate how to ensure that the model converges to desired
their reliability. This property arises in the model, because the values (rather than engaging into oscillatory behaviour) also when
prediction errors are normalized by dividing them by the variance one considers a more realistic case of time-scales not being fully
of noise. Thus the more noisy is a particular dimension of the separated.
R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211 209

In summary, in this paper we presented the free-energy theory,


which offers a powerful framework for describing computations
performed by the brain during perception and learning. The appeal
of the similarity in the organization of networks suggested by
this theory and observed in the brain invites attempts to map
the currently relatively abstract models on details of cortical
micro-circuitry, i.e. to map different elements of the model on
different neural populations within the cortex. For example, Bastos
et al. (2012) compared a more recent version of the model
(Friston, 2008) with the details of the cortical organization. Such
comparisons of the models with biological circuits are likely to lead
to iterative refinement of the models. Fig. 10. The architectures of the original model performing simple perceptual
Even if the free-energy framework does describe cortical inference. Notation as in Fig. 3.
computation, the mapping between the variables in the model
and the elements of neural circuit may not be ‘‘clean’’ but rather Analogously as before, we can find the rules describing synaptic
‘‘messy’’ i.e. each model variable or parameter may be represented plasticity in the model, by calculating the derivatives of F (Eq. (76))
by multiple neurons or synapses. The particular implementation over vp , σp and σu (TRY IT YOURSELF):
of the framework in the cortical circuit may be influenced by other
constraints the evolutionary pressure optimizes such as robustness ∂F
= ξp σp−1 (80)
to damage, energy efficiency, speed of processing, etc. In any case, ∂vp
the comparison of predictions of theoretical framework like the
∂F
= ξp2 − 1 σp−1
 
free-energy with experimental data offers hope for understanding (81)
the cortical micro-circuits. ∂σp
∂F
= ξu2 − 1 σu−1 .
 
Acknowledgments (82)
∂σu
This work was supported by Medical Research Council grant MC The original model does not satisfy the constraint of local com-
UU 12024/5. The author thanks Karl Frison, John-Stuart Brittain, putation stated in the Introduction, because the node computing φ
Daniela Massiceti, Linus Schumacher and Rui Costa for reading the receives the input from prediction error nodes scaled by parame-
previous version of the manuscript and very useful suggestions, ters σ (see Fig. 10), but the parameters σ are not encoded in the
and Chris Mathys, Peter Dayan and Diego Vidaurre for discussion. connections between node φ and the prediction error nodes, but
instead in the connections among the prediction error neurons.
Appendix A. The original neural implementation Nevertheless, we have shown in Section 2.3 that by just changing
the way in which prediction errors are normalized the computa-
In the original model (Friston, 2005), the prediction errors were tion in the model becomes local.
defined in a slightly different way:
Appendix B. Derivation of plasticity rule for connections be-
φ − vp tween layers
ξp = (74)
σp
This Appendix derives the rule for update of 2 given in
u − g (φ) Eq. (50). In order to use the two top formulas in Table 1 we have
ξu = . (75)
σu to reshape the matrix 2 into a vector. To avoid death by notation,
√ without loss of generality, let us consider the case of 2 dimensional
In the above equations, σp = Σp , and σu = Σu , i.e. σp and

stimuli and features. So let us define the vector of parameters:
σu denote the standard deviations of distributions p(v) and p(u|v) θ = [θ1,1 , θ1,2 , θ2,1 , θ2,2 ]. Now, using two top formulas in Table 1
respectively. With the prediction error terms defined in this way,
one can find that:
the negative free energy computed in Eq. (7) can be written as:
T
1 1 ∂F ∂ g (φ, 2)
F = − ln σp − ξ − ln σu − ξ + C2 .
2 2
(76) = εu . (83)
2
p
2
u ∂θ ∂θ
The dynamics of variable φ is proportional to the derivative of From Eq. (42) we find:
the above equation over φ :
∂ g (φ, 2)T h(φ1 ) h(φ2 )
 
0 0
= . (84)
ξu g ′ (φ) εp ∂θ 0 0 h(φ1 ) h(φ2 )
φ̇ = − . (77)
σu σp We can now evaluate the right hand side of Eq. (83):
Analogously as in Section 2.3, the prediction errors defined in
h(φ1 ) h(φ1 )εu,1
   
0
Eqs. (74)–(75) could be computed in the nodes with the following T
∂ g (φ, 2) h(φ2 )  εu ,1 h(φ2 )εu,1 
 
0
dynamics: εu =  = . (85)
∂θ 0 h(φ1 ) εu,2 h(φ1 )εu,2 
ξ˙p = φ − vp − σp ξp (78) 0 h(φ2 ) h(φ2 )εu,2

ξ˙u = u − g (φ) − σu ξu . (79) Reshaping the right hand side of the above equation into a
matrix, we can see how it can be decomposed into the product of
The architecture of the model described by Eqs. (77)–(79) is vectors in Eq. (50):
shown in Fig. 10. It is similar to that in Fig. 3, but differs in the
h(φ1 )εu,1 h(φ2 )εu,1 εu ,1 
   
information received by node φ (we will discuss this difference in
h(φ1 ) h(φ2 ) .

= (86)
more detail at the end of this Appendix). h(φ1 )εu,2 h(φ2 )εu,2 εu ,2
210 R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211

Solutions to exercises

Exercise 1

function exercise1

v_p = 3; % mean of prior distribution of food size


sigma_p = 1; % standard deviation of prior = sqrt (1)
sigma_u = 1; % standard deviation of sensory noise = sqrt (1)

u = 2; % observed light intensity

MINV = 0.01; % minimum value of v for which posterior computed


DV = 0.01; % interval between values of v for which posterior found
MAXV = 5; % maximum value of v for which posterior computed
vrange = [MINV:DV:MAXV ];

numerator = normpdf (vrange ,v_p , sigma_p ) .* normpdf (u, vrange .^2 , sigma_u );
normalization = sum ( numerator * DV );
p = numerator / normalization ;

plot (vrange , p, ’k ’);


xlabel (’v ’);
ylabel (’p(v|u) ’);

Exercise 2

function exercise2

v_p = 3; % mean of prior distribution of food size


Sigma_p = 1; % variance of prior distribution
Sigma_u = 1; % variance of sensory noise

u = 2; % observed light intensity

DT = 0.01; % integration step


MAXT = 5; % maximum time considered

phi (1) = v_p; % initializing the best guess of food size

for i = 2: MAXT/DT
phi(i) = phi(i -1) + DT * (( v_p - phi(i -1))/ Sigma_p + ...
(u-phi(i -1)^2)/ Sigma_u * (2* phi(i -1)));
end

plot ([ DT:DT:MAXT], phi , ’k ’);


xlabel (’Time ’);
ylabel (’\phi ’);
axis ([0 MAXT -2 3.5]);

Exercise 3

function exercise3

v_p = 3; % mean of prior distribution of food size


Sigma_p = 1; % variance of prior distribution
Sigma_u = 1; % variance of sensory noise

u = 2; % observed light intensity

DT = 0.01; % integration step


MAXT = 5; % maximum time considered

phi (1) = v_p; % initializing the best guess of food size


error_p (1) = 0; % initializing the prediction error of food size
error_u (1) = 0; % initializing the prediction error of sensory input

for i = 2: MAXT/DT
phi(i) = phi(i -1) + DT * (- error_p (i -1) + error_u (i -1) * (2* phi(i -1)));
error_p (i) = error_p (i -1) + DT * (phi(i -1) - v_p - Sigma_p * error_p (i -1));
error_u (i) = error_u (i -1) + DT * (u - phi(i -1)^2 - Sigma_u * error_u (i -1));
end

plot ([ DT:DT:MAXT], phi , ’k ’);


hold on
R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211 211

plot ([ DT:DT:MAXT], error_p , ’k--’);


plot ([ DT:DT:MAXT], error_u , ’k: ’);
xlabel (’Time ’);
ylabel (’Activity ’);
legend (’\phi ’, ’\epsilon_p ’, ’\epsilon_u ’);
axis ([0 MAXT -2 3.5]);

Exercise 4

It is easiest to consider a vector of two numbers (analogous can be shown for longer vectors):
 
x1
x= . (87)
x2
T
Then y = x x = x21 + x22 , so the gradient is equal to:
 ∂y 
∂ y  ∂ x1 
 
2x1
=  ∂y  = = 2x. (88)
∂x 2x2
∂ x2

Exercise 5

function exercise5

mean_phi = 5; % mean of input from the current level


Sigma_phi = 2; % variance of input from the current level
phi_above = 5; % input from the level above

DT = 0.01; % integration step


MAXT = 20; % maximum time considered
TRIALS = 1000; % number of simulated trials
LRATE = 0.01; % learning rate

Sigma (1) = 1; % initializing the value of weight

for trial = 2: TRIALS


error (1) = 0; % initializing the prediction error
e(1) = 0; % initializing the interneuron
phi = mean_phi + sqrt( Sigma_phi ) * randn ;

for i = 2: MAXT/DT
error (i) = error (i -1) + DT * (phi - phi_above - e(i -1));
e(i) = e(i -1) + DT * ( Sigma (trial -1) * error (i -1) - e(i -1));
end

Sigma ( trial ) = Sigma (trial -1) + LRATE * ( error (end )*e(end) - 1);
end

plot (Sigma , ’k ’);


xlabel (’Trial ’);
ylabel (’\Sigma ’);

References Friston, Karl (2005). A theory of cortical responses. Philosophical Transactions of the
Royal Society B, 360, 815–836.
Bastos, Andre M., Usrey, W. Martin, Adams, Rick A., Mangun, George R., Fries, Pascal, Friston, Karl (2008). Hierarchical models in the brain. PLoS Computational Biology,
& Friston, Karl J. (2012). Canonical microcircuits for predictive coding. Neuron, 4, e1000211.
76, 695–711. Friston, Karl (2010). The free-energy principle: a unified brain theory? Nature Re-
Bell, Anthony J., & Sejnowski, Terrence J. (1995). An information-maximization ap- views Neuroscience, 11, 127–138.
proach to blind separation and blind deconvolution. Neural Computation, 7, Friston, Karl, Schwartenbeck, Philipp, FitzGerald, Thomas, Moutoussis, Michael,
1129–1159. Behrens, Timothy, & Dolan, Raymond J. (2013). The anatomy of choice: active
Bell, Anthony J., & Sejnowski, Terrence J. (1997). The independent components of
inference and agency. Frontiers in Human Neuroscience, 7, 598.
natural scenes are edge filters. Vision Research, 37, 3327–3338.
Bogacz, Rafal, Brown, Malcolm W., & Giraud-Carrier, Christophe (2001). Emergence Harwood, David, Ojala, Timo, Pietikäinen, Matti, Kelman, Shalom, & Davis, Larry
of movement sensitive neurons’ properties by learning a sparse code for natural (1995). Texture classification by center-symmetric auto-correlation, using Kull-
moving images. Advances in Neural Information Processing Systems, 13, 838–844. back discrimination of distributions. Pattern Recognition Letters, 16, 1–10.
Bogacz, Rafal, & Gurney, Kevin (2007). The basal ganglia and cortex implement Olshausen, Bruno A., & Field, David J. (1995). Emergence of simple-cell recep-
optimal decision making between alternative actions. Neural Computation, 19, tive field properties by learning a sparse code for natural images. Nature, 381,
442–477. 607–609.
Chen, J.-Y., Lonjers, P., Lee, C., Chistiakova, M., Volgushev, M., & Bazhenov, M. (2013). O’Reilly, Randall C., & Munakata, Yuko (2000). Computational explorations in cogni-
Heterosynaptic plasticity prevents runaway synaptic dynamics. Journal of Neu- tive neuroscience. MIT Press.
roscience, 33, 15915–15929. Ostwald, Dirk, Kirilina, Evgeniya, Starke, Ludger, & Blankenburg, Felix (2014). A tu-
Feldman, Harriet, & Friston, Karl (2010). Attention, uncertainty, and free-energy. torial on variational Bayes for latent linear stochastic time-series models. Jour-
Frontiers in Human Neuroscience, 4, 215. nal of Mathematical Psychology, 60, 1–19.
FitzGerald, Thomas H. B., Schwartenbeck, Philipp, Moutoussis, Michael, Dolan, Ray- Rao, Rajesh P. N., & Ballard, Dana H. (1999). Predictive coding in the visual cortex: a
mond J., & Friston, Karl (2015). Active inference, evidence accumulation and the functional interpretation of some extra-classical receptive-field effects. Nature
urn task. Neural Computation, 27, 306–328. Neuroscience, 2, 79–87.
Friston, Karl (2003). Learning and inference in the brain. Neural Networks, 16, Strogatz, Steven (1994). Nonlinear dynamics and chaos. Westview Press.
1325–1352.

You might also like