Journal of Mathematical Psychology: Rafal Bogacz
Journal of Mathematical Psychology: Rafal Bogacz
Journal of Mathematical Psychology: Rafal Bogacz
highlights
• Bayesian inference about stimulus properties can be performed by networks of neurons.
• Learning about statistics of stimuli can be achieved by Hebbian synaptic plasticity.
• Structure of the model resembles the hierarchical organization of the neocortex.
1. Introduction plasticity, and Rao and Ballard (1999) demonstrated that the model
presented with natural images learns features resembling recep-
The model of Friston (2005) and the predictive coding model tive fields of neurons in the primary visual cortex.
of Rao and Ballard (1999) provide a powerful mathematical frame- Friston (2005) has extended the model to also represent uncer-
work to describe how the sensory cortex extracts information from tainty associated with different features. He showed that learn-
noisy stimuli. The predictive coding model (Rao & Ballard, 1999) ing about the variance and co-variance of features can also be
suggests that visual cortex infers the most likely properties of stim- implemented by simple synaptic plasticity rules based on Hebbian
uli from noisy sensory input. The inference in this model is im- learning. As the extended model (Friston, 2005) learns the variance
plemented by a surprisingly simple network of neuron-like nodes. and co-variance of features, it offers several new insights. First, it
describes how the perceptual systems may differentially weight
The model is called ‘‘predictive coding’’, because some of the nodes
sources of sensory information depending on their level of noise.
in the network encode the differences between inputs and predic-
Second, it shows how the sensory networks can learn to recog-
tions of the network. Remarkably, learning about features present
nize features that are encoded in the patterns of covariance be-
in sensory stimuli is implemented by simple Hebbian synaptic
tween inputs, such as textures. Third, it provides a natural way to
implement attentional modulation as the reduction in variance of
the attended features (we come back to these insights in Discus-
∗ Correspondence to: Nuffield Department of Clinical Neurosciences, University sion). Furthermore, Friston (2005) pointed out that this model can
of Oxford, John Radcliffe Hospital, Oxford, OX3 9DU, UK. be viewed as an approximate Bayesian inference based on mini-
E-mail address: [email protected]. mization of a function referred to in statistics as free-energy. The
http://dx.doi.org/10.1016/j.jmp.2015.11.003
0022-2496/© 2015 The Author. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211 199
free-energy framework (Friston, 2003, 2005) has been recently ex- architectures. The tutorial provides step-by-step derivation of the
tended by Karl Friston and his colleagues to describe how the brain model. Some of these derivations are straightforward, and we feel
performs different cognitive functions including action selection that it would be helpful for the reader to do them on their own to
(FitzGerald, Schwartenbeck, Moutoussis, Dolan, & Friston, 2015; gain a better understanding of the model and to ‘‘keep in mind’’
Friston et al., 2013). Furthermore, Friston (2010) proposed that the the notation used in the paper. Such straightforward derivations
free-energy theory unifies several theories of perception and ac- are indicated by ‘‘(TRY IT YOURSELF)’’, so after encountering such
tion which are closely related to the free-energy framework. label we recommend trying to do the calculation described in the
There are many articles which provide an intuition for the sentence with this label and then compare the obtained results
free-energy framework and discuss how it relates with other with those in the paper. To illustrate the model we include simple
theories and experimental data (Friston, 2003, 2005, 2010; Friston simulations, but again we feel it would be helpful for a reader
et al., 2013). However, the description of mathematical details to perform them on their own, to get an intuition for the model.
of the theory in these papers requires a very deep mathematical Therefore we describe these simulations as exercises.
background. The main goal of this paper is to provide an easy to The paper is organized as follows. Section 2 introduces the
follow tutorial on the free-energy framework. To make the tutorial model using a very simple example using as basic mathematical
accessible to a wide audience, it only assumes basic knowledge concepts as possible, so it is accessible to a particularly wide audi-
of probability theory, calculus and linear algebra. This tutorial is ence. Section 3 provides mathematical foundations for the model,
planned to be complementary to existing literature so it does not and shows how the inference in the model is related to minimiza-
focus on the relationship to other theories and experimental data, tion of free-energy. Section 4 then shows how the model scales up
and on applications to more complex tasks which are described to describe the neural circuits in sensory cortex. In these three sec-
elsewhere (Friston, 2010; Friston et al., 2013). tions we use notation similar to that used by Friston (2005). Sec-
In this tutorial we also consider in more detail the neural tion 5 describes an extended version of the model which satisfies
implementation of the free-energy framework. Any computational the constraint of local plasticity described above. Finally, Section 6
model would need to satisfy the following constraints to be discusses insights provided by the model.
considered biologically plausible:
1. Local computation: A neuron performs computations only on 2. Simplest example of perception
the basis of the activity of its input neurons and synaptic
weights associated with these inputs (rather than information We start by considering in this section a simple perceptual
encoded in other parts of the circuit). problem in which a value of a single variable has to be inferred from
2. Local plasticity: Synaptic plasticity is only based on the activity a single observation. To make it more concrete, consider a simple
of pre-synaptic and post-synaptic neurons. organism that tries to infer the size or diameter of a food item,
which we denote by v , on the basis of light intensity it observes.
The model of Rao and Ballard (1999) fully satisfied these Let us assume that our simple animal has only one light sensitive
constraints. The model of Friston (2005) did not satisfy them fully, receptor which provides it with a noisy estimate of light intensity,
but we show that after small modifications and extensions it can which we denote by u. Let g denote a non-linear function relating
satisfy them. So the descriptions of the model in this tutorial the average light intensity with the size. Since the amount of light
slightly differ in a few places or extend the original model to better reflected is related to the area of an object, in this example we will
explain how the proposed computation could be implemented in consider a simple function of g (v) = v 2 . Let us further assume
the neural circuits. All such differences or extensions are indicated that the sensory input is noisy—in particular, when the size of food
by footnotes or in text, and the original model is presented in item is v , the perceived light intensity is normally distributed with
Appendix A. mean g (v), and variance Σu (although a normal distribution is not
It is commonly assumed in theoretical neuroscience, (O’Reilly & the best choice for a distribution of light intensity, as it includes
Munakata, 2000) that the basic computations a neuron performs negative numbers, we will still use it for a simplicity):
are the summation of its input weighted by the strengths of
synaptic connections, and the transformation of this sum through p(u|v) = f (u; g (v), Σu ). (1)
a (monotonic) function describing the relationship between In Eq. (1) f (x; µ, Σ ) denotes the density of a normal distribution
neurons’ total input and output (also termed firing-Input or f-I with mean µ and variance Σ :
curve). Whenever possible, we will assume that the computation
of the neurons in the described model is limited to these (x − µ)2
1
computations (or even just to linear summation of inputs). f (x; µ, Σ ) = √ exp − . (2)
2π Σ 2Σ
We feel that the neural implementation of the model is worth
considering, because if the free-energy principle indeed describes Due to the noise present in the observed light intensity, the
the computations in the brain, it can provide an explanation for animal can refine its guess for the size v by combining the sensory
why the cortex is organized in a particular way. However to stimulus with the prior knowledge on how large the food items
gain such insight it is necessary to start comparing the neural usually are, that it had learnt from experience. For simplicity, let us
networks implementing the model with those in the real brain. assume that our animal expects this size to be normally distributed
Consequently, we consider in this paper possible neural circuits with mean vp and variance Σp (subscript p stands for ‘‘prior’’),
that could perform the computations required by the theory. which we can write as:
Although the neural implementations proposed here are not the
p(v) = f (v; vp , Σp ). (3)
only possible ones, it is worth considering them as a starting point
for comparison of the model with details of neural architectures in Let us now assume that our animal observed a particular value
the brain. We hope that such comparison could iteratively lead to of light intensity, and attempts to estimate the size of the food item
refined neural implementations that are more and more similar to on the basis of this observation. We will first consider an exact
real neural circuits. solution to this problem, and illustrate why it would be difficult
To make this tutorial as easy to follow as possible we introduce to compute it in a simple neural circuit. Then we will present an
the free-energy framework using a simple example, and then approximate solution that can be easily implemented in a simple
illustrate how the model can scale up to more complex neural network of neurons.
200 R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211
Fig. 2. Solutions to Exercises 2 and 3. In panel b we have also included quantities that we will see later can be regarded as prediction errors.
In the above equation we used the chain rule to compute the The above terms are the prediction errors1 : εu expresses how
second term, and g ′ (φ) is a derivative of function g evaluated at φ , much the light intensity differs from that expected if the size of
so in our example g ′ (φ) = 2φ . We can find our best guess φ for v the food item was φ , while εp denotes how the inferred size differs
simply by changing φ in proportion to the gradient: from prior expectations. With these new variables the equation for
updating φ simplifies to:
∂F
φ̇ = . (9)
φ̇ = εu g ′ (φ) − εp .
∂φ (12)
The neural implementation of the model assumes that the
In the above equation φ̇ is the rate of change of φ with time. Let
model parameters vp , Σp , and Σu are encoded in the strengths of
us note that the update of φ is very intuitive. It is driven by two
synaptic connections (as they need to be maintained over the ani-
terms in Eq. (8): the first moves it towards the mean of the prior,
mal’s lifetime), while variables φ , εu , and εp and the sensory input
the second moves it according to the sensory stimulus, and both
u are maintained in the activity of neurons or neuronal populations
terms are weighted by the reliabilities of prior and sensory input
(as they change rapidly when the sensory input is modified). In par-
respectively.
ticular, we will consider very simple neural ‘‘nodes’’ which simply
Now please note that the above procedure for finding the ap-
change their activity proportionally to the input they receive, so for
proximate distribution of distance to food item is computationally
example, Eq. (12) is implemented in the model by a node receiving
much simpler than the exact method presented at the start of the
input equal to the right hand side of this equation. The prediction
paper. To gain more appreciation for the simplicity of this compu-
errors could be computed by the nodes with the following dynam-
tation we recommend doing the following exercise.
ics2 :
Exercise 2. Write a computer program finding the most likely size ε˙p = φ − vp − Σp εp (13)
of the food item φ for the situation described in Exercise 1. Initialize ε˙u = u − g (φ) − Σu εu . (14)
φ = vp , and then find its values in the next 5 time units (you can
use Euler’s method, i.e. update φ(t + 1t ) = φ(t ) + 1t ∂ F /∂φ with It is easy to show that the nodes with dynamics described by
1t = 0.01). Eqs. (13)–(14) converge to the values defined in Eqs. (10)–(11).
Once Eqs. (13)–(14) converge, then ε̇ = 0, so setting ε̇ = 0 and
Fig. 2(a) shows a solution to Exercise 2. Please notice that it solving Eqs. (13)–(14) for ε , one obtains Eqs. (10)–(11).
rapidly converges to the value of φ ≈ 1.6, which is also the value The architecture of the network described by Eqs. (12)–(14) is
that maximizes the exact posterior probability p(v|u) shown in shown in Fig. 3. Let us consider the computations in its nodes. The
Fig. 1. node εp receives excitatory input from node φ , inhibitory input
from a tonically active neuron via a connection with strength vp ,
and inhibitory input from itself via a connection with strength
2.3. A possible neural implementation
Σp , so it implements Eq. (13). The nodes φ and εu analogously
implement Eqs. (12) and (14), but here the information exchange
One can envisage many possible ways in which the computa- between them is additionally affected by function g, and we will
tion described in previous subsection could be implemented in discuss this issue in more detail in Section 2.5. We have now
neural circuits. In this paper we will present a possible imple- described all the details necessary to simulate the model.
mentation which satisfies the constraints of local computation and
plasticity described in the Introduction. It slightly differs from the Exercise 3. Simulate the model from Fig. 3 for the problem
original implementation which is contained in Appendix A. from Exercise 1. In particular, initialize φ = vp , εp = εu = 0, and
While thinking about the neural implementation of the above find their values for the next 5 units of time.
computation, it is helpful to note that there are two similar terms
in Eq. (8), so let us denote them by new variables.
φ − vp 1 In the original model (Friston, 2005) the prediction errors were normalized
εp = (10)
Σp slightly differently as explained in Appendix A.
2 The original model does not provide details on the dynamics of the nodes
u − g (φ)
εu = . (11) computing prediction error, but we consider sample description of their dynamics
Σu to illustrate how these nodes can perform their computation.
202 R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211
the expected value over trials. This will happen if vp = ⟨φ⟩, i.e.
when vp is indeed equal to the expected value of φ . Analogously,
the expected value of change in Σp is 0 when:
(φ − vp )2 1
− = 0. (18)
Σp2 Σp
Fig. 4. Architectures of models with linear and nonlinear function g. Circles and hexagons denote linear and nonlinear nodes respectively. Filled arrows and lines ended
with circles denote excitatory and inhibitory connections respectively, and an open arrow denotes a modulatory influence.
So we will now consider function g (v, θ ) that also depends on non-linear transformation h′ and modulating input received from
parameter which we denote by θ . node εu via a connection with weight θ (alternatively, this could
We will consider two special cases of function g (v, θ ), where be implemented within the node φ by making it react to its input
the parameter θ has a clear biological interpretation. First, let us differentially depending on its level of activity). The details of
consider a simple case of a linear function: g (v, θ ) = θ v , as then the neural implementation of these non-linear transformations
the model has a straightforward neural implementation. In this depend on the form of function h, and would be an interesting
case, Eqs. (12)–(14) describing the model simplify to: direction of the future work.
We also note that the update of the parameter θ , i.e. gradient of
φ̇ = θ εu − εp (22) F over θ becomes:
ε˙p = φ − vp − Σp εp (23) ∂F
= εu h(φ). (29)
ε˙u = u − θ φ − Σu εu (24) ∂θ
In this model, nodes φ and ε simply communicate through This rule is Hebbian for the top connection labelled by θ in
connections with weight θ as shown in Fig. 4(a). Furthermore, we Fig. 4(b), as it is a product of activity of the pre-synaptic and post-
can also derive the rule for updating the parameter θ by finding the synaptic nodes. It would be interesting to investigate how such
gradient of F over θ , as now function g in Eq. (7) depends on θ (TRY a plasticity rule could be realized for the other connection with
IT YOURSELF): the weight of θ (from node εu to φ ). We just note that for this
connection the rule also satisfies the constraint of local plasticity
∂F (stated in the Introduction), as φ fully determines h(φ), so the
= εu φ. (25)
∂θ change in weight is fully determined by the activity of pre-synaptic
Please note that this rule is again Hebbian, as the synaptic and post-synaptic neurons.
weights encoding θ are modified proportionally to the activities
of pre-synaptic and post-synaptic neurons (see Fig. 4(a)). 3. Free-energy
Second, let us consider a case of a nonlinear function4 g (v, θ ) =
θ h(v), where h(v) is a nonlinear function that just depends on v , In this section we discuss how the computations in the
as it results in only slightly more complex neural implementation. model relate to a technique of statistical inference involving
Furthermore, this situation is relevant to the example of the simple minimization of free-energy. There are three reasons for describing
animal considered at the start of this section, as the light is this relationship. First, it will provide more insight for why the
proportional to the area, but the proportionality constant may not parameters can be optimized by maximization of F . Second,
be known (this case is also relevant to the network that we will the concept of free-energy is critical for understanding of more
discuss in Section 4.1). In this case, Eqs. (12)–(14) describing the complex models (Friston et al., 2013), which not only estimate
model become: the most likely values of variables, but their distribution. Third,
φ̇ = θ εu h′ (φ) − εp (26) the free-energy is a very interesting concept on its own, and has
applications in mathematical psychology (Ostwald, Kirilina, Starke,
ε˙p = φ − vp − Σp εp (27) & Blankenburg, 2014).
ε˙u = u − θ h(φ) − Σu εu . (28) We now come back to the example of an inference by a
simple organism, and discuss how the exact inference described
A possible network implementing this model is illustrated in in Section 2.1 can be approximated. As we noted in Section 2.1,
Fig. 4(b), which now includes non-linear elements. In particular, the posterior distribution p(v|u) may have a complicated shape, so
the node φ sends to node εu its activity transformed by a non- we will approximate it with another distribution, which we denote
linear function, i.e. θ h(φ). One could imagine that this could be q(v). Importantly, we will assume that q(v) has a standard shape,
implemented by an additional node receiving input from node φ , so we will be able to characterize it by parameters of this typical
transforming it via a non-linear transformation h and sending its distribution. For example, if we assume that q(v) is normal, then
output to node εu via a connection with the weight θ . Analogously, to fully describe it, we can infer just two numbers: its mean and
the input from node εu to node φ needs to be scaled by θ h′ (φ).
variance, instead of infinitely many numbers potentially required
Again one could imagine that this could be implemented by an
to characterize a distribution of an arbitrary shape.
additional node receiving input from node φ , transforming it via a
For simplicity, here we will use an even simpler shape of the
approximate distribution, namely the delta distribution, which has
all its mass cumulated in one point which we denote by φ (i.e. the
4 Although this case has not been discussed by Friston (2005), it was discussed delta distribution is equal to 0 for all values different from φ , but
by Rao and Ballard (1999). its integral is equal to 1). Thus we will try to infer from observation
204 R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211
just one parameter φ which will characterize the most likely value normalization term. Second, as we will see later, it will allow us to
of v . naturally introduce learning about the parameters of the model.
We now describe what criterion we wish our approximate Let us first note that by assuming that q(v) is a delta distribu-
distribution to satisfy. We will seek the approximate distribution tion, the negative free energy simplifies to:
q(v) which is as close as possible to the actual posterior
p(u, v)
distribution p(v|u). Mathematically, the dissimilarity between F = q(v) ln dv
two distributions in measured by the Kullback–Leibler divergence q(v)
defined as:
= q(v) ln p(u, v)dv − q(v) ln q(v)dv
q(v)
KL (q(v), p(v|u)) = q(v) ln dv. (30) = ln p(u, φ) + C1 . (34)
p(v|u)
For readers not familiar with Kullback–Leibler divergence we In the transition from the first to the second line above we used
would like clarify why it is a measure of dissimilarity between the the property of logarithms ln(a/b) = ln a − ln b. In the transition
distributions. Please note that if the two distributions q(v) and from the second line to the third line we used the property of a
p(v|u) were identical, the ratio q(v)/p(v|u) would be equal to 1, delta function δ(x) with centre φ that for any function h(x), the
so its logarithm would be equal to 0, and so the whole expression integral of δ(x)h(x) is equal to h(φ). Furthermore, since the value
in Eq. (30) would be 0. The Kullback–Leibler divergence also has of the second integral in the second line of the above equation does
a property that the more different the two distributions are, the not depend on φ (so it will cancel when we compute the derivative
higher its value is (see Ostwald et al. (2014) for more details). over φ ) we denote it by a constant C1 .
Since we assumed above that our simplified distribution is Now using p(u, φ) = p(φ)p(u|φ), and ignoring constant
a delta function, we will simply seek the value of its centre C1 , we obtain the expression for F we introduced previously in
parameter φ which minimizes the Kullback–Leibler divergence Eq. (6). Thus finding approximate delta distribution q(v) through
defined in Eq. (30). minimization of free-energy is equivalent to the inference of
It may seem that the minimization of Eq. (30) is still difficult, features in the model described in the previous section. It is worth
because to compute term p(v|u) present in Eq. (30) from Bayes’ noting that Eq. (34) states that the best centre for our approximate
theorem (Eq. (4)) one needs to compute the difficult normalization distribution (i.e. our best guess for the size of the food item) is the
integral (Eq. (5)). However, we will now show that there exists value v = φ which maximizes the joint probability p(u, φ).
another way of finding the approximate distribution q(v) that does We now discuss how the concept of free-energy will help us
not involve the complicated computation of the normalization to understand why the parameters of the model can be learnt by
integral. maximization of F . Recall from Section 2.4 that we wish to find
Substituting the definition of conditional probability p(v|u) = parameters for which the sensory observations are least surprising,
p(u, v)/p(u) into Eq. (30) we obtain: i.e. those which maximize p(u). To see the relationship between
maximizing p(u) and maximizing F , we note that according to
q(v)p(u) Eq. (33), p(u) is related to the negative free-energy in the following
KL (q(v), p(v|u)) = q(v) ln dv
p(u, v) way:
q(v)
ln p(u) = F + KL (q(v), p(v|u)) . (35)
= q(v) ln dv
p(u, v)
Since Kullback–Leibler divergence is non-negative, F is a lower
+ q(v)dv ln p(u) bound on ln p(u), thus by maximizing F we maximize the lower
bound on ln p(u). So in summary, by maximizing F we can both
q(v) find an approximate distribution q(v) (as discussed earlier), and
= q(v) ln dv + ln p(u). (31) optimize model parameters. However, there is a twist here: we
p(u, v)
wish to maximize the average of p(u) across trials (or here
In the transition from the second to the third line we used observations of different food items). Thus on each trial we need
the fact that q(v) is a probability distribution so its integral is 1. to modify the model parameters just a little bit (rather than until
The integral in the last line of the above equation is called free- minimum of free energy is reached as was the case for φ ).
energy, and we will denote its negative by F , because we will show
below, that for certain assumptions the negative free-energy is 4. Scaling up the model of perception
equal (modulo a constant) to the function F we defined and used
in the previous section: In this section we will show how the model scales up to the
p(u, v) networks inferring multiple features and involving hierarchy.
F = q(v) ln dv. (32)
q(v)
4.1. Increasing the dimension of sensory input
In the above equation we used the property of logarithms that
− ln a/b = ln b/a. So, the negative free-energy is related to the The model naturally scales up to the case of multiple sensory
Kullback–Leibler divergence in the following way: inputs from which we estimate multiple variables. Such scaled
KL (q(v), p(v|u)) = −F + ln p(u). (33) model could be used to describe information processing within
a cortical area (e.g. primary visual cortex) which infers multiple
Now please note that ln p(u) does not depend on φ (which is features (e.g. edges at different position and orientation) on the
a parameter describing q(v)), so the value of φ that minimizes basis of multiple inputs (e.g. information from multiple retinal
the distance between q(v) and p(v|u) is the same value as receptors preprocessed by the thalamus). This section shows that
that which maximizes F . Therefore instead of minimizing the when the dimensionality of inputs and features is increased, the
Kullback–Leibler divergence we can maximize F , and this will have dynamics of nodes in the networks and synaptic plasticity are
two benefits: first, as we already mentioned above, F is easier to described by the same rules as in Section 2, just generalized to
compute as it does not involve the complicated computation of the multiple dimensions.
R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211 205
Table 1
Rules for computation of derivatives. A denotes a symmetric matrix.
Original rule Generalization to matrices
∂ ax2 ∂ xT Ax
∂x
= 2ax ∂x
= 2Ax
T
∂z ∂y ∂z ∂y
if z = f (y), y = g (x), then ∂x
= ∂x ∂y
if z = f (y), y = g (x), then ∂∂ zx = ∂x
∂z
∂y
∂ ln a 1 ∂ ln |A| −1
∂a
= a ∂A
=A
2
∂ xa x2 ∂ x A−1 x
T
∂a
= − a2 ∂A
= −(A−1 x)(A−1 x)T
The only complication in explaining this case lies in the symmetric, so we can use the top two rules in Table 1 to compute
necessity to use matrix notation, so let us make this notation very the gradient of the negative free energy (TRY IT YOURSELF):
explicit: we will denote single numbers in italic (e.g. x), column T
vectors by bar (e.g. x), and matrices in bold (e.g. x). So we assume ∂F ∂ g (φ, 2)
p (φ − v p ) +
= −6− u (u − g (φ, 2)).
6−
1 1
(38)
the animal has observed sensory input u and estimates the most ∂φ ∂φ
likely values φ of variables v . We further assume that the animal
has prior expectation that the variables v come from multivariate In the above equation, terms appear which are generalizations
normal distribution with mean v p and covariance matrix 6p , i.e. of the prediction errors we defined for the simple models:
p(v) = f (v; v p , 6p ) where: ε p = 6−
p (φ − v p )
1
(39)
1 1 ε u = 6u (u − g (φ, 2)).
−1
(40)
f (x; µ, 6) = exp − (x − µ)T 6−1 (x − µ) . (36)
(2π )N |6| 2
With the error terms defined, the equation describing the
In the above equation N denotes the length of vector x, and |6| update of φ becomes:
denotes the determinant of matrix 6. Analogously, the probability T
of observing sensory input given the values of variables is given by ∂ g (φ, 2)
φ̇ = −ε p + εu . (41)
p(u|v) = f (u; g (v, 2), 6u ), where 2 are parameters of function ∂φ
g. We denote these parameters by a matrix 2, as we will consider The partial derivative term in the above equation is a matrix
a generalization of the function g discussed in Section 2.5, i.e. that contains in each entry with co-ordinates (i, j) the derivative of
g (v, 2) = 2h(v), where each element i of vector h(v) depends element i of vector g (φ, 2) over φj . To see how the above equation
only on vi . This function corresponds to an assumption often made simplifies for our choice of function g, it is helpful without loss of
by models of feature extraction (Bell & Sejnowski, 1995; Olshausen generality to consider a case of 2 features being estimated from 2
& Field, 1995), that stimuli are formed by a linear combination of stimuli. Then:
features.5 Moreover, such a function g can be easily computed as it
θ1,1 h(φ1 ) + θ1,2 h(φ2 )
is equal to an input to a layer of neurons from another layer with
g (φ, 2) = 2h(φ) = . (42)
activity h(v) via connections with strength 2. θ2,1 h(φ1 ) + θ2,2 h(φ2 )
We can state the negative free energy, analogously as for the
Hence we can find the derivatives of elements of the above
simple model considered in Eq. (7) (TRY IT YOURSELF):
vector over the elements of vector φ :
F = ln p(φ) + ln p(u|φ)
∂ g (φ, 2) θ h′ (φ ) θ1,2 h′ (φ2 )
1 = 1 ,1 ′ 1 . (43)
= (− ln |6p | − (φ − v p )T 6−
p (φ − v p )
1
∂φ θ2,1 h (φ1 ) θ2,2 h′ (φ2 )
2
Now we can see that Eq. (41) can be written as:
− ln |6u | − (u − g (φ, 2))T 6−
u (u − g (φ, 2))) + C .
1
(37)
Analogously as before, to find the vector of most likely values φ̇ = −ε p + h′ (φ) × 2T εu . (44)
of features φ , we will calculate the gradient (vector of derivatives In the above equation × denotes element by element multipli-
∂ F /∂φi ) which we will denote by ∂ F /∂φ . We will use the elegant cation, so term h′ (φ)× 2T ε u is a vector where its element i is equal
property that rules for computation of derivatives generalize to to a product of h′ (φi ) and element i of vector 2T ε u . Analogously,
vectors and matrices. To get an intuition for these rules we as for the simple model, prediction errors could be computed by
recommend the following exercise that shows how the rule nodes with the following dynamics:
∂ x2 /∂ x = 2x generalizes to vectors.
ε̇ p = φ − v p − 6p ε p (45)
Exercise 4. Show that for any vector x the gradient of function y =
x x is equal to: ∂ y/∂ x = 2x.
T ε̇ u = u − 2h(φ) − 6u ε u . (46)
It is easy to see that Eqs. (45)–(46) have fixed points at
Using an analogous method as that in the solution to Exercise values given by Eqs. (39)–(40) by setting the left hand sides of
4 (at the end of the paper) one can see that several other rules Eqs. (45)–(46) to 0. The architecture of the network with the
generalize as summarized in Table 1. These rules can be applied for dynamics described by Eqs. (44)–(46) is shown in Fig. 5, and it is
symmetric matrices, but since 6 are covariance matrices, they are analogous to that in Fig. 4(b).
Analogously as for the simple model, one can also find the rules
for updating parameters encoded in synaptic connections, which
5 In the model of Rao and Ballard (1999) the sparse coding was achieved through generalize the rules presented previously. In particular, using the
introduction of additional prior expectation that most φi are close to 0, but the top formula in Table 1 it is easy to see that:
sparse coding can also be achieved by choosing a shape of function h such that h(vi )
are mostly close to 0, but only occasionally significantly different from zero (Friston, ∂F
= εp . (47)
2008). ∂v p
206 R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211
E (u) = g1 (v 2 , 21 ) (51)
E (v 2 ) = g2 (v 3 , 22 )
E (v 3 ) = . . . .
Fig. 5. The architecture of the model inferring 2 features from 2 sensory stimuli. To simplify the notation we could denote u by v1 , and then the
Notation as in Fig. 4(b). To help identify which connections are intrinsic and likelihood of activity in layer i becomes:
extrinsic to each level of hierarchy, the nodes and their projections in each level of
hierarchy are shown in green, blue and purple respectively (in the online version). p(v i |v i+1 ) = f (v i ; gi (v i+1 , 2i ), 6i ). (52)
(For interpretation of the references to colour in this figure legend, the reader is
referred to the web version of this article.) In this model, 6i parametrize the covariance between features
in each level, and 2i parametrize how the mean value of features
Using the two bottom formulas in Table 1 one can find the rules in one level depends on the next. Let us assume the same form of
for update of covariance matrices (TRY IT YOURSELF): function g as before, i.e. gi (v i+1 , 2i ) = 2i h(v i+1 ). By analogy to
the model described in the previous subsection, one can see that
∂F 1 inference of the features in all layers on the basis of sensory input
ε p ε Tp − 6− 1
= (48)
∂ 6p 2
p
can be achieved in the network shown in Fig. 6(a). In this network
the dynamics of the nodes are described by:
∂F 1
ε u εTu − 6− 1
.
= (49)
∂ 6u 2
u
φ̇ i = −ε i + h′ (φ i ). ∗ 2i−1 T ε i−1 (53)
The derivation of update of parameters 2 is a bit more tedious,
ε̇ i = φ i − 2i h(φ i+1 ) − 6i ε i . (54)
but we show in Appendix B that:
Furthermore, by analogy to the previous section, the rules for
∂F
= ε u h(φ)T . (50) modifying synaptic connections in the model become:
∂2
∂F 1 T
The above plasticity rules of Eqs. (47)–(50) are Hebbian in the ε i ε i − 6− 1
= (55)
same sense they were for the simple model—for example Eq. (48) ∂ 6i 2 i
Fig. 6. (a) The architecture of the model including multiple layers. For simplicity only the first two layers are shown. Notation as in Fig. 5. (b) Extrinsic connectivity of cortical
layers.
R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211 207
Fig. 7. Prediction error networks that can learn the uncertainty parameter with
local plasticity. Notation as in Fig. 4(b). (a) Single node. (b) Multiple nodes for
multidimensional features. Fig. 8. Changes in estimated variance during learning in Exercise 5.
ε̇i εi φi − gi (φi+1 )
0 −1
= + . (66)
ėi Σi −1 ei 0
ε̇ i = φ i − gi (φ i+1 ) − ei (67) stimulus, the smaller the corresponding prediction error, and thus
lower its influence on activity on other neurons in the network.
ėi = 6i ε i − ei . (68)
Second, the model can learn properties of features encoded in
Analogously as before, we can find the fixed point by setting the covariance of sensory input. An example of such feature is texture,
left hand side of the equation to 0: which can be efficiently recognized on the basis of covariance,
irrespectively of translation (Harwood, Ojala, Pietikäinen, Kelman,
ε i = 6−
i (φ i − gi (φ i+1 ))
1
(69) & Davis, 1995). To get an intuition for this property, let us consider
an example of checker-board texture (Fig. 9). Please note that
ei = φ i − gi (φ i+1 ). (70)
adjacent nodes have always opposite colour – corresponding to
Thus we can see that nodes ε have fixed points at the values negative covariance, while the diagonal nodes have the same
equal to the prediction errors. We can now consider a learning rule colour – corresponding to positive covariance.
analogous to that in the previous subsection: Third, the attentional modulation can be easily implemented in
the model by changing the variance associated with the attended
16i = α(ε i eTi − 1). (71) features (Feldman & Friston, 2010). Thus for example, attending
To find the values to vicinity of which the above rule may to feature i at level j of the hierarchy can be implemented by
converge, we can find the value of 6i for which the expected value decreasing synaptic weight Σj,i,i , or inhibiting node ej,i in case of
of the right hand side of the above equation is equal to 0: the model described in Section 5, which will result in a larger effect
of the node encoding this feature on the activity in the rest of the
⟨ε i eTi − 1⟩ = 0. (72) network.
In this paper we included description of the modified or
Substituting Eqs. (69)–(70) into the above equation, and solving
extended version of the model with local computation and
for 6i we obtain (TRY IT YOURSELF):
plasticity to better illustrate how computation proposed by the
6i = ⟨(φ i − gi (φ i+1 ))(φ i − gi (φ i+1 ))T ⟩. (73) free-energy framework can be implemented in neural circuits.
However, it will be necessary in the future to numerically evaluate
We can see that the learning rule has a stochastic fixed point the efficiency of learning in the proposed model and the free-
at the values corresponding to the covariance matrix. In summary, energy framework in general. Existing models of feature extraction
the nodes in network described in this section have fixed points at (Bell & Sejnowski, 1997; Bogacz, Brown, & Giraud-Carrier, 2001;
prediction errors and can learn the covariance of the corresponding Olshausen & Field, 1995) and predictive coding (Rao & Ballard,
features, thus the proposed network may substitute the prediction 1999) have been shown to be able to find features efficiently and
error nodes in the model shown in Fig. 6, and the computation reproduce the receptive fields of neurons in the primary visual
will remain the same. But importantly in the proposed network the cortex when trained with natural images. It would be interesting
covariance is learnt with local plasticity involving simple Hebbian to explicitly test in simulations if the model based on the free-
learning. energy framework can equally efficiently extract features from
natural stimuli and additionally learn the variance and covariance
6. Discussion of features.
We have also demonstrated that if the dynamics within the
In this paper we presented the model of perception and learning nodes computing prediction errors takes place on a time-scale
in neural circuits based on the free-energy framework. This model much faster than in the whole network, these nodes converge
extends the predictive coding model (Rao & Ballard, 1999) in to stable fixed points. It is also worth noting that under the
that it represents and learns not only mean values of stimuli or assumption of separation of time scales, the nodes computing φ
features, but also their variances, which gives the model several also converge to a stable fixed point, because variables φ converge
new computational capabilities, as we now discuss. to the values that maximize function F . It would be interesting
First, the model can weight incoming sensory information by to investigate how to ensure that the model converges to desired
their reliability. This property arises in the model, because the values (rather than engaging into oscillatory behaviour) also when
prediction errors are normalized by dividing them by the variance one considers a more realistic case of time-scales not being fully
of noise. Thus the more noisy is a particular dimension of the separated.
R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211 209
ξ˙u = u − g (φ) − σu ξu . (79) Reshaping the right hand side of the above equation into a
matrix, we can see how it can be decomposed into the product of
The architecture of the model described by Eqs. (77)–(79) is vectors in Eq. (50):
shown in Fig. 10. It is similar to that in Fig. 3, but differs in the
h(φ1 )εu,1 h(φ2 )εu,1 εu ,1
information received by node φ (we will discuss this difference in
h(φ1 ) h(φ2 ) .
= (86)
more detail at the end of this Appendix). h(φ1 )εu,2 h(φ2 )εu,2 εu ,2
210 R. Bogacz / Journal of Mathematical Psychology 76 (2017) 198–211
Solutions to exercises
Exercise 1
function exercise1
numerator = normpdf (vrange ,v_p , sigma_p ) .* normpdf (u, vrange .^2 , sigma_u );
normalization = sum ( numerator * DV );
p = numerator / normalization ;
Exercise 2
function exercise2
for i = 2: MAXT/DT
phi(i) = phi(i -1) + DT * (( v_p - phi(i -1))/ Sigma_p + ...
(u-phi(i -1)^2)/ Sigma_u * (2* phi(i -1)));
end
Exercise 3
function exercise3
for i = 2: MAXT/DT
phi(i) = phi(i -1) + DT * (- error_p (i -1) + error_u (i -1) * (2* phi(i -1)));
error_p (i) = error_p (i -1) + DT * (phi(i -1) - v_p - Sigma_p * error_p (i -1));
error_u (i) = error_u (i -1) + DT * (u - phi(i -1)^2 - Sigma_u * error_u (i -1));
end
Exercise 4
It is easiest to consider a vector of two numbers (analogous can be shown for longer vectors):
x1
x= . (87)
x2
T
Then y = x x = x21 + x22 , so the gradient is equal to:
∂y
∂ y ∂ x1
2x1
= ∂y = = 2x. (88)
∂x 2x2
∂ x2
Exercise 5
function exercise5
for i = 2: MAXT/DT
error (i) = error (i -1) + DT * (phi - phi_above - e(i -1));
e(i) = e(i -1) + DT * ( Sigma (trial -1) * error (i -1) - e(i -1));
end
Sigma ( trial ) = Sigma (trial -1) + LRATE * ( error (end )*e(end) - 1);
end
References Friston, Karl (2005). A theory of cortical responses. Philosophical Transactions of the
Royal Society B, 360, 815–836.
Bastos, Andre M., Usrey, W. Martin, Adams, Rick A., Mangun, George R., Fries, Pascal, Friston, Karl (2008). Hierarchical models in the brain. PLoS Computational Biology,
& Friston, Karl J. (2012). Canonical microcircuits for predictive coding. Neuron, 4, e1000211.
76, 695–711. Friston, Karl (2010). The free-energy principle: a unified brain theory? Nature Re-
Bell, Anthony J., & Sejnowski, Terrence J. (1995). An information-maximization ap- views Neuroscience, 11, 127–138.
proach to blind separation and blind deconvolution. Neural Computation, 7, Friston, Karl, Schwartenbeck, Philipp, FitzGerald, Thomas, Moutoussis, Michael,
1129–1159. Behrens, Timothy, & Dolan, Raymond J. (2013). The anatomy of choice: active
Bell, Anthony J., & Sejnowski, Terrence J. (1997). The independent components of
inference and agency. Frontiers in Human Neuroscience, 7, 598.
natural scenes are edge filters. Vision Research, 37, 3327–3338.
Bogacz, Rafal, Brown, Malcolm W., & Giraud-Carrier, Christophe (2001). Emergence Harwood, David, Ojala, Timo, Pietikäinen, Matti, Kelman, Shalom, & Davis, Larry
of movement sensitive neurons’ properties by learning a sparse code for natural (1995). Texture classification by center-symmetric auto-correlation, using Kull-
moving images. Advances in Neural Information Processing Systems, 13, 838–844. back discrimination of distributions. Pattern Recognition Letters, 16, 1–10.
Bogacz, Rafal, & Gurney, Kevin (2007). The basal ganglia and cortex implement Olshausen, Bruno A., & Field, David J. (1995). Emergence of simple-cell recep-
optimal decision making between alternative actions. Neural Computation, 19, tive field properties by learning a sparse code for natural images. Nature, 381,
442–477. 607–609.
Chen, J.-Y., Lonjers, P., Lee, C., Chistiakova, M., Volgushev, M., & Bazhenov, M. (2013). O’Reilly, Randall C., & Munakata, Yuko (2000). Computational explorations in cogni-
Heterosynaptic plasticity prevents runaway synaptic dynamics. Journal of Neu- tive neuroscience. MIT Press.
roscience, 33, 15915–15929. Ostwald, Dirk, Kirilina, Evgeniya, Starke, Ludger, & Blankenburg, Felix (2014). A tu-
Feldman, Harriet, & Friston, Karl (2010). Attention, uncertainty, and free-energy. torial on variational Bayes for latent linear stochastic time-series models. Jour-
Frontiers in Human Neuroscience, 4, 215. nal of Mathematical Psychology, 60, 1–19.
FitzGerald, Thomas H. B., Schwartenbeck, Philipp, Moutoussis, Michael, Dolan, Ray- Rao, Rajesh P. N., & Ballard, Dana H. (1999). Predictive coding in the visual cortex: a
mond J., & Friston, Karl (2015). Active inference, evidence accumulation and the functional interpretation of some extra-classical receptive-field effects. Nature
urn task. Neural Computation, 27, 306–328. Neuroscience, 2, 79–87.
Friston, Karl (2003). Learning and inference in the brain. Neural Networks, 16, Strogatz, Steven (1994). Nonlinear dynamics and chaos. Westview Press.
1325–1352.