PCM Toolbox Python Readthedocs Io en Latest
PCM Toolbox Python Readthedocs Io en Latest
Release v.0.9
Jörn Diedrichsen
2 Documentation 5
2.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Models Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Regularized regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Application examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 Mathematical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.10 API reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Index 77
i
ii
Pcm for Python, Release v.0.9
The Pattern Component Modelling (PCM) toolbox is designed to analyze multivariate brain activity patterns in a
Bayesian approach. The theory is laid out in Diedrichsen et al. (2017) as well as in this documentation. We pro-
vide details for model specification, model estimation, visualisation, and model comparison. The documentation also
refers to the empricial examples in demos folder.
The original Matlab version of the toolbox is available at https://github.com/jdiedrichsen/pcm_toolbox. The practical
examples in this documentation are available as jupyter notebooks at https://github.com/DiedrichsenLab/PcmPy/tree/
master/docs/demos.
Note that the toolbox does not provide functions to extract the required data from the first-level GLM or raw data, or to
run search-light or ROI analyses. We have omitted these function as they strongly depend on the analysis package used
for the basic imaging analysis. Some useful tools for the extraction of multivariate data from first-level GLMs can be
found in the RSA-toolbox (https://github.com/rsagroup/rsatoolbox) and of course nilearn.
CONTENTS: 1
Pcm for Python, Release v.0.9
2 CONTENTS:
CHAPTER
ONE
The PCMPy toolbox is being developed by members of the Diedrichsenlab including Jörn Diedrichsen, Giacomo
Ariani, Spencer Arbuckle, Eva Berlot, and Atsushi Yokoi. It is distributed under MIT License, meaning that it can be
freely used and re-used, as long as proper attribution in form of acknowledgments and links (for online use) or citations
(in publications) are given. When using, please cite the relevant references:
• Diedrichsen, J., Yokoi, A., & Arbuckle, S. A. (2018). Pattern component modeling: A flexible approach for
understanding the representational structure of brain activity patterns. Neuroimage. 180(Pt A), 119-133.
• Diedrichsen, J., Ridgway, G., Friston, K.J., Wiestler, T., (2011). Comparing the similarity and spatial structure
of neural representations: A pattern-component model. Neuroimage.
3
Pcm for Python, Release v.0.9
TWO
DOCUMENTATION
2.1 Installation
2.1.1 Dependencies
You can also clone or fork the whole repository from https://github.com/diedrichsenlab/PCMPy. Place the the entire
repository in a folder of your choice. Then add the folder by adding the following lines to your .bash.profile or
other shell startup file:
PYTHONPATH=/DIR/PcmPy:${PYTHONPATH}
export PYTHONPATH
5
Pcm for Python, Release v.0.9
2.2 Introduction
The study of brain representations aims to illuminate the relationship between brain activity patterns and “things in the
world” - be it objects, actions, or abstract concepts. Understanding the internal syntax of brain representations, and how
this structure changes across different brain regions, is essential in gaining insights into the way the brain processes
information.
Central to the definition of representation is the concept of decoding. A feature (i.e. a variable that describes some
aspect of the “things in the world”) that can be decoded from the ongoing neural activity in a region is said to be
represented there. For example, a feature could be the direction of a movement, the orientation and location of a visual
stimulus, or the semantic meaning of a word. Of course, if we allow the decoder to be arbitrarily complex, we would
use the term representation in the most general sense. For example, using a computer vision algorithm, one may be able
to identify objects based on activity in primary visual cortex. However, we may not conclude necessarily that object
identity is represented in V1 - at least not explicitly. Therefore, it makes sense to restrict our definition of an explicit
representation to features that can be linearly decoded by a single neuron from some population activity (Kriegeskorte
& Diedrichsen, 2017, 2019).
While decoding is a popular approach when analyzing multi-variate brain activity patterns, is is not the most useful
tool when we aim to make inferences about the nature of brain representations. The fact that we can decode feature X
well from region A does not imply that the representation in A is well characterized by feature X - there may be many
other features that better determine the activity patterns in this region.
In an Encoding model, we characterize how well we can explain the activities in a specific region using a sets of
features. The activity profile of each voxel (here shown as columns in the activity data matrix), is modeled as the linear
combination of a set of features (Fig. 1a). Each voxel, or more generally measurement channel, has its own set of
parameters (W) that determine the weight of each feature. This can visualized by plotting the activity profile of each
voxel into the space spanned by the experimental conditions (Fig. 1b). Each dot refers to the activity profile of a channel
(here a voxel), indicating how strongly the voxel is activated by each condition. Estimating the weights is equivalent to
a projection of each of the activity profiles onto the feature vectors. The quality of the model can then be evaluated by
determining how well unseen test data can be predicted. When estimating the weights, encoding models often use some
form of regularization, which essentially imposes a prior on the feature weights. This prior is an important component
of the model. It determines a predicted distribution of the activity profiles (Diedrichsen & Kriegeskorte, 2017). An
encoding model that matches the real distribution of activity profiles best will show the best prediction performance.
The interpretational problem for encoding models is that for each feature set that predicts the data well, there is an
infinite number of other (rotated) features sets that describe the same distribution of activity profiles (and hence predict
the data) equally well (Diedrichsen, 2019). The argument may be made that to understand brain representations, we
should not think about specific features that are encoded, but rather about the distribution of activity profiles. This can
be justified by considering a read-out neuron that receives input from a population of neurons. From the standpoint
of this neuron, it does not matter which neuron has which activity profile (as long as it can adjust input weights), and
which features were chosen to describe these activity profiles - all that matters is what information can read out from
the code. Thus, from this perspective it may be argued that the formulation of specific feature sets and the fitting of
feature weights for each voxel are unnecessary distractions.
Therefore, pattern component modeling (PCM) abstracts from specific activity patterns. This is done by summarizing
the data using a suitable summary statistic (Fig. 1a), that describes the shape of the activity profile distribution (Fig.
1c). This critical characteristic of the distribution is the covariance matrix of the activity profile distribution or - more
generally - the second moment. The second moment determines how well we can linearly decode any feature from
the data. If, for example, activity measured for two experimental conditions is highly correlated in all voxels, then the
difference between these two conditions will be very difficult to decode. If however, the activities are uncorrelated, then
decoding will be very easy. Thus, the second moment is a central statistical quantity that determines the representational
content of the brain activity patterns of an area.
Similarly, a representational model is formulated in PCM not by its specific feature set, but by its predicted second
moment matrix. If two feature sets have the same second moment matrix, then the two models are equivalent. Thus,
PCM makes hidden equivalences between encoding models explicit. To evaluate models, PCM simply compares the
6 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Fig. 1: Figure 1. Decoding, encoding and representational models. (A) The matrix of activity data consists of rows of
activity patterns for each condition or of columns of activity profiles for each voxel (or more generally, measurement
channel). The data can be used to decode specific features that describe the experimental conditions (decoding).
Alternatively, a set of features can be used to predict the activity data (encoding). Representational models work at
the level of a sufficient statistics (the second moment) of the activity profiles. Models are formulated in this space and
possibly combined and changed using higher-order model parameters. (B) Encoding analysis: The activity profiles
of different voxels are points in the space of the experimental conditions. Features in encoding models are vectors
that describe the overall distribution of the activity profiles. (C) PCM: The distribution of activity profiles is directly
described using a multivariate normal distribution. (D) Representational similarity analysis (RSA): plots the activity
patterns for different conditions in the space defined by different voxel. The distances between activity patterns serves
here as the sufficient statistic, which is fully defined by the second moment matrix.
2.2. Introduction 7
Pcm for Python, Release v.0.9
likelihood of the data under the distribution predicted by the model. To do so, we rely on an generative model of brain
activity data, which fully specifies the distribution and relationship between the random variables. Specifically, true
activity profiles are assumed to have a multivariate Gaussian distribution and the noise is also assumed to be Gaussian,
with known covariance structure. Having a fully-specified generative model allows us to calculate the likelihood of data
under the model, averaged over all possible values of the feature weights. This results in the so-called model evidence,
which can be used to compare different models directly, even if they have different numbers of features.
In summarizing the data using a sufficient statistic, PCM is closely linked to representation similarity analysis (RSA),
which characterizes the second moment of the activity profiles in terms of the distances between activity patterns
(Fig. 1d, also see Diedrichsen & Kriegeskorte, 2017). Thus, in many ways PCM can be considered to be intermediate
approach that unifies the strength of RSA and Encoding models.
By removing the requirement to fit and cross-validate individual voxel weights, PCM enables the user to concentrate
on a different kind of free parameter, namely model parameters that determine the shape of the distribution of activity
profiles. From the perspective of encoding models, these would be hyper-parameters that change the form of the feature
or regression matrix. For example, we can fit the distribution of activity profiles using a weighted combination of 3
different feature sets (Fig1. a). Such component models (see section Component models) are extremely useful if we
hypothesize that a region cares about different groups of features (i.e.colour, size, orientation), but we do not know how
strongly each feature is represented. In encoding models, this would be equivalent to providing a separate regularization
factor to different parts of the feature matrix. Most encoding models, however, use a single regularization factor, making
them equivalent to a fixed PCM model.
In this manual we will show how to use the PCM toolbox to estimate and compare flexible representational models.
We will present the fundamentals of the generative approach taken in PCM and outline different ways in which flexible
representational models with free parameters can be specified. We will then discuss methods for model fitting and
for model evaluation. We will also walk in detail through three illustrative examples from our work on finger repre-
sentations in primary sensory and motor cortices, also providing demo code for the examples presented in the paper
(Diedrichsen, Yokoi. & Arbuckle, 2018).
PCM is based on a generative model of the measured brain activity data Y, a matrix of N x P activity measurements,
referring to N time points (or trials) and P voxels (or channels). The data can refer to the minimally preprocessed raw
activity data, or to already deconvolved activity estimates, such as those obtained as beta weights from a first-level time
series model. U is the matrix of true activity patterns (a number of conditions x number of voxels matrix) and Z the
design matrix. Also influencing the data are effects of no interest B and noise:
Y = ZU + XB + 𝜖
u𝑝 ∼ 𝑁 (0, G)
𝜖𝑝 ∼ 𝑁 (0, S𝜎 2 )
The activity profiles (u𝑝 columns of U) are considered to be a random variable. PCM models do not specify the exact
activity profiles of specific voxels, but rather their probability distribution. Also, PCM is not interested in how the
different activity profiles are spatially arranged. This makes sense considering that activity patterns can vary widely
across different participants and do not directly impact what can be decoded from a region. For this, only the distribution
of activity profiles in a region is important.
PCM assumes is that the expected mean of the activity profiles is zero. In many cases, we are not interested in how
much a voxel is activated, but only how acitivity differs between conditions. In these cases, we model the mean for
each voxel using the fixed effects X.
8 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Note that this mean pattern removal does not change in information contained in a region. In contrast, sometimes
researchers also remove the mean value (Walther et al., 2016), i.e., the mean of each condition across voxels. We
discourage this approach, as it would remove differences that, from the persepctive of decoding and representation, are
highly meaningful.
The third assumption is that the activity profiles come from a multivariate Gaussian distribution. This is likely the
most controversial assumption, but it is motivated by a few reasons: First, for fMRI data the multi-variate Gaussian
is often a relatively appropriate description. Secondly, the definition causes us to focus on the mean and covariance
matrix, G, as sufficient statistics, as these completely determine the Gaussian. Thus, even if the true distribution of the
activity profiles is better described by a non-Gaussian distribution, the focus on the second moment is sensible as it
characterizes the linear decodability of any feature of the stimuli.
We assume that the noise of each voxel is Gaussian with a temporal covariance that is known up to a constant term
𝜎 2 . Given the many additive influences of various noise sources on fMRI signals, Gaussianity of the noise is, by
the central limit theorem, most likely a very reasonable assumption, commonly made in the fMRI literature. The
original formulation of PCM used a model which assumed that the noise is also temporally independent and identically
distributed (i.i.d.) across different trials, i.e. S = I . However, as pointed out recently (Cai et al., 2016), this assumption
is often violated in non-random experimental designs with strong biasing consequences for estimates of the covariance
matrix. If this is violated, we can either assume that we have a valid estimate of the true covariance structure of the
noise (S), or we can model different parts of the noise structure (see Noise Models).
PCM also assumes that different voxels are independent from each other. If we use fMRI data, this assumption would
be clearly violated, given the strong spatial correlation of noise processes in fMRI. To reduce these dependencies we
typically uses spatially pre-whitened data, which is divided by a estimate of the spatial covariance matrix (Walther et
al., 2016). Recent result from our lab show that this approach is sufficient to obtain correct marginal likelihoods.
Marginal likelihood
When we fit a PCM model, we are not trying to estimate specific values of the the estimates of the true activity patterns
U. This is a difference to encoding approaches, in which we would estimate the values of U by estimating the feature
weights W. Rather, we want to assess how likely the data is under any possible value of U, as specified by the prior
distribution. Thus we wish to calculate the marginal likelihood
∫︁
𝑝 (Y|𝜃) = 𝑝 (Y|U, 𝜃) 𝑝 (U|𝜃) 𝑑U.
This is the likelihood that is maximized in PCM in respect to the model parameters 𝜃. For more details, see mathematical
and algorithmic details.
The main intellectual work when using PCM is to build the appropriate models. There are basically two complemen-
tary approaches or philosophies when it comes to specifying a representational models. Which way you feel more
comfortable with will likely depend on whether you are already familiar with Encoding models or RSA. Ultimately,
most problems can be formulated in both ways, and the results will be identical. Nonetheless, it is useful to become
familiar with both styles of model building, as they can also be combined to find the most intuitive and computationally
efficient way of writing a particular model.
Example
An empirical example for how to construct the same model as either an Encoding- or RSA-style PCM model comes
from Yokoi et al. (2018). In this experiment, participants learned six motor sequences, each different permutations
of pressing the finger 1, 3, and 5 (see Fig 2a). We would like to model the activity pattern in terms of two model
components: In the first component, each finger contributes a specific pattern, and the pattern of the first finger has
a particularly strong weight. In the second component, each transitions between 2 subsequent fingers contributes a
unique pattern.
Encoding-style models
When constructing an encoding-style model, the model components are formulated as sets of features, which are en-
coded into the design matrix. In our example, the first feature set (first finger) has 3 columns with variables that indicate
whether this first finger was digit 1, 3, or 5 (because each finger occurs exactly once in each sequence, we can ignore the
subsequent presses). The second feature set has 6 features, indicating which ones of the 6 possible transitions between
fingers were present in the sequence. Therefore, the design matrix Z has 9 columns (Fig. 2b). The encoding-style
model, however, is not complete without a prior on the underlying activity patterns U, i.e. the feature weights. As
implicit in the use of ridge regression, we assume here that all features within a feature set are independent and equally
strongly encoded. Therefore, the second moment matrix for the first model component is an identity matrix of 3x3 and
for the second component an identity matrix of size 6x6. Each component then is weighted by the relative weight of
the component, 𝑒𝑥𝑝(𝜃𝑖 ). The overall second moment matrix G then is the sum of the two weighted model component
matrices. This model would be equivalent to an encoding model where each of the features sets has its own ridge
coefficient. PCM will automatically find the optimal value of the two ridge coefficients (the importance of each feature
set). If you want to use PCM to tune the regularization paramters for Tikhonov regularization, the regression module
provides a simplified interface to do so (see Regularized regression)
Fig. 2: Figure2. (A) Set of six multi-finger sequences used in the task. (B) Encoding-style model construction.The
condition matrix Z, here shown for 12 trials of 2 x 6 sequences, contains all features. Featureset F contains indicators
for first fingers in each sequence, and featureset T contains the finger transitions for each sequence. The second moment
matrices for the two feature sets are diagonal matrices, indicating which feature is taken into account. Each feature set
is multiplied by an overall importance of the featureset (analogous to the ridge coefficient for the feature set) (C) RSA-
style model construction. The design matrix indicates which of the 6 sequences was executed. The second momemnt
matrix determines the hypothesized covariances between the 6 sequence patterns. In both cases, the overall second
moment matrix G is the weighted sum of the two component matrices
10 Chapter 2. Documentation
Pcm for Python, Release v.0.9
RSA-style models
In RSA, models are specified in terms of the predicted similarity or dissimilarity between discrete experimental condi-
tions. Therefore, when constructing our model in RSA-style, the design matrix Z simply indicates which trial belonged
to which sequence. The core of the model is specified in the second moment matrix, which specifies the covariances
between conditions, and hence both Euclidean and correlation distances (Fig. 2C). For example, the first component
predicts that sequence I and II, which both start with digit 1, share a high covariance. The predicted covariances can
be calculated from an encoding-stype model by taking the inner product of the feature sets FF𝑇 .
The Models depicted in Fig. 2B and 2C are identical. So when should we prefer one approach over the other? Some
of this is up to personal taste. However, some experiments have no discrete conditions. For example, each trial may
be also characterized by an action or stimulus property that varied in a continuous fashion. For these experiments,
the best way is to formulate the model in an encoding style. Even if there are discrete conditions, the conditions may
differ across subjects. Because the group fitting routines (see below) allow for subject-specific design matrices, but
not for subject-specific second-moment matrices, encoding-style models are the way to go. In other situations, for
example experiments with fewer discrete conditions and many feature sets, the RSA-style formulation can be more
straightforward and faster.
Finally, the two approaches can be combined to achieve the most natural way of expressing models. For example in
our example, we used the design matrix from the first finger model Fig. 2B, combined with a second moment derived
from the natural statistics to capture the known covariance structure of activity patterns associated with single finger
movements Ejaz et al. (2015).
Independently of whether you choose an Encoding- or RSA-style approach to building your model, the PCM toolbox
distinguishes between a number of different model types, each of which has an own model class.
Fixed models
In fixed models, the second moment matrix G is exactly predicted by the model. The most common example is the
Null model G = 0. This is equivalent to assuming that there is no difference between any of the activity patterns.
The Null-model is useful if we want to test whether there are any differences between experimental conditions. An
alternative Null model would be G = I, i.e. to assume that all patterns are uncorrelated equally far away from each
other.
Fixed models also occur when the representational structure can be predicted from some independent data. An example
for this is shown in the following example, where we predict the structure of finger representations directly from the
correlational structure of finger movements in every-day life (Ejaz et al., 2015). Importantly, fixed models only predict
the the second moment matrix up to a proportional constant. The width of the distribution will vary with the overall
scale or signal-to-noise-level. Thus, when evaluating fixed models we usually allow the predicted second moment
matrix to be scaled by an arbitrary positive constant (see Model Fitting).
Example
An empirical example to for a fixed representational model comes from Ejaz et al (2015). Here the representational
structure of 5 finger movements was compared to the representational structure predicted by the way the muscles are
activated during finger movements (Muscle model), or by the covariance structure of natural movements of the 5 fingers.
That is the predicted second moment matrix is derived from data completely independent of our imaging data.
Models are a specific class, inherited from the class Model. To define a fixed model, we simple need to load the
predicted second moment matrix and define a model structure as follows (see Application examples):
When evaluating the likelihood of a data set under the prediction, the pcm toolbox still needs to estimate the scaling
factor and the noise variance, so even in the case of fixed models, an iterative maximization of the likelihood is required
(see below).
Component models
A more flexible model is to express the second moment matrix as a linear combination of different components. For
example, the representational structure of activity patterns in the human object recognition system in inferior temporal
cortex can be compared to the response of a convolutional neural network that is shown the same stimuli (Khaligh-
Razavi & Kriegeskorte, 2014). Each layer of the network predicts a specific structure of the second moment matrix
and therefore constitutes a fixed model. However, the representational structure may be best described by a mixture of
multiple layers. In this case, the overall predicted second moment matrix is a linear sum of the weighted components
matrices:
∑︁
G= exp(𝜃ℎ )Gℎ
ℎ
The weights for each component need to be positive - allowing negative weights would not guarantee that the overall
second moment matrix would be positive definite. Therefore we use the exponential of the weighing parameter here,
such that we can use unconstrained optimization to estimate the parameters.
For fast optimization of the likelihood, we require the derivate of the second moment matrix in respect to each of
the parameters. Thus derivative can then be used to calculate the derivative of the log-likelihood in respect to the
parameters (see section 4.3. Derivative of the log-likelihood). In the case of linear component models this is easy to
obtain.
𝜕G
= exp(𝜃ℎ )Gℎ
𝜕𝜃ℎ
Example
In the example Finger demo, we have two fixed models, the Muscle and the natural statistics model. One question that
arises in the paper is whether the real observed structure is better fit my a linear combination of the natural statistics
and the muscle activity structure. So we can define a third model, which allows any arbitrary mixture between the two
type.
MC = pcm.ComponentModel('muscle+nat',[modelM[0],modelM[1]])
12 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Feature models
A representational model can be also formulated in terms of the features that are thought to be encoded in the voxels.
Features are hypothetical tuning functions, i.e.models of what activation profiles of single neurons could look like.
Examples of features would be Gabor elements for lower-level vision models, elements with cosine tuning functions for
different movement directions for models of motor areas, and semantic features for association areas. The actual activity
profiles of each voxel are a weighted combination of the feature matrix u𝑝 = Mw𝑝 . The predicted second moment
matrix of the activity profiles is then G = MM𝑇 , assuming that all features are equally strongly and independently
encoded, i.e. 𝐸 w𝑝 w𝑝 = I. A feature model can now be flexibly parametrized by expressing the feature matrix as a
(︀ 𝑇
)︀
Each parameter 𝜃ℎ determines how strong the corresponding set of features is represented across the population of
voxels. Note that this parameter is different from the actual feature weights W. Under this model, the second moment
matrix becomes
1 ∑︁ 2 ∑︁ ∑︁
G = UU𝑇 /𝑃 = 𝜃ℎ Mℎ M𝑇ℎ + 𝜃𝑖 𝜃𝑗 M𝑖 M𝑇𝑗 .
𝑃 𝑖 𝑗
ℎ
From the last expression we can see that, if features that belong to different components are independent of each other,
i.e. M𝑖 M𝑗 = 0, then a feature model is equivalent to a component model with Gℎ = Mℎ M𝑇ℎ . The only technical
difference is that we use the square of the parameter 𝜃ℎ , rather than its exponential, to enforce non-negativity. Thus,
component models assume that the different features underlying each component are encoded independently in the
population of voxels - i.e.knowing something about the tuning to feature of component A does not tell you anything
about the tuning to a feature of component B. If this cannot be assumed, then the representational model is better
formulated as a feature model.
By the product rule for matrix derivatives, we have
𝜕G 𝑇
= Mℎ M(𝜃) + M (𝜃) M𝑇ℎ
𝜕𝜃ℎ
Correlation model
The correlation model class is designed model correlation between specific sets of activity patterns. This problem
often occurs in neuroimaging studies: For example, we may have 5 actions that are measured under two conditions (for
example observation and execution), and we want to know to what degree the activtiy patterns of obseving an action
related to the pattern observed when executing the same action.
Fixed correlation models: We can use a series of models that test the likelihood of the data under a fixed correlations
between -1 and 1. This approach allows us to determine how much evidence we have for one specific correlation over
the other. Even though the correlation is fixed for these models, the variance structure within each of the conditions is
flexibly estimated. This is done using a compent model within each condition.
(1)
∑︁
G(1) = exp(𝜃ℎ )Gℎ
ℎ
(2)
∑︁
(2)
G = exp(𝜃ℎ )Gℎ
ℎ
Usually the $mathbf{G}_{h}$ is the identity matrix (all items are equally strongly represented, or a matrix that allows
individual scaling of the variances for each item. Of course you can also model any between-item covariance. The
overall model is nonlinear, as the two components interact in the part of the G matrix that indicates the covariance
between the patterns of the two conditions (C). Given a constant correlation r, the overall second moment matrix is
calculated as:
[︂ (1) ]︂
G 𝑟C
G=
𝑟C𝑇 G(2)
√︁
(1) (2)
C𝑖,𝑗 = G𝑖,𝑗 G𝑖,𝑗
If the parameter within_cov is set to true, the model will also add with-condition covariance, which are not part of co-
variance. So the correlation that we modelling is the correlation between the pattern component related to the individual
items, after the pattern component related to the overall condition (ie. observe vs. execute) has been removed.
(1)
The derivatives of that part of the matrix in respect to the parameters 𝜃ℎ then becomes
(1)
𝜕C𝑖,𝑗 𝑟 (2) 𝜕G𝑖,𝑗
(1)
= G𝑖,𝑗 (1)
𝜕𝜃ℎ 2C𝑖,𝑗 𝜕𝜃 ℎ
These derivatives are automatically calculated in the predict function. From the log-likelihoods for each model, we can
then obtain an approximation for the posterior distribution.
Flexible correlation model: We also use a flexible correlation model, which has an additional model parameter for the
correlation. To avoid bounds on the correlation, this parameter is the inverse Fisher-z transformation of the correlation,
which can take values of [−∞, ∞].
(︂ )︂
1 1+𝜃
𝜃 = 𝑙𝑜𝑔
2 1−𝜃
𝑒𝑥𝑝(2𝜃) − 1
𝑟=
𝑒𝑥𝑝(2𝜃) + 1
Example
Free models
The most flexible representational model is the free model, in which the predicted second moment matrix is uncon-
strained. Thus, when we estimate this model, we would simply derive the maximum-likelihood estimate of the second-
moment matrix. This model is mainly useful if we want to obtain an estimate of the maximum likelihood that could be
achieved with a fully flexible model, i.e the noise ceiling (Nili et al. 20).
In estimating an unconstrained G, it is important to ensure that the estimate will still be a positive definite matrix. For
this purpose, we express the second moment as the square of an upper-triangular matrix, G = AA𝑇 (Diedrichsen et
al., 2011; Cai et al., 2016). The parameters are then simply all the upper-triangular entries of A.
14 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Example
M5 = pcm.model.FreeModel('ceil',n_cond)
If the number of conditions is very large, the crossvalidated estimation of the noise ceiling model can get rather slow.
For a quick and approximate noise ceiling, you can also set use an unbiased estimate of the second moment matrix
from pcm.util.est_G_crossval to determine the parameters - basically the starting values of the complete model.
This will lead to slightly lower noise ceilings as compared to the full optimization, but large improvements in speed.
Custom model
In some cases, the hypotheses cannot be expressed by a model of the type mentioned above. Therefore, the PCM
toolbox allows the user to define their own custom model. In general, the predicted second moment matrix is a non-
linear (matrix valued) function of some parameters, G = 𝐹 (𝜃). One example is a representational model in which the
width of the tuning curve (or the width of the population receptive field) is a free parameter. Such parameters would
influence the features, and hence also the second-moment matrix in a non-linear way. Computationally, such non-linear
models are not much more difficult to estimate than component or feature models, assuming that one can analytically
derive the matrix derivatives 𝜕G/𝜕𝜃ℎ .
To define a custom model, the user needs to define a new Model class, inherited from the abstract class pcm.model.
Model. The main thing is to define the predict function, which takes the parameters as an input and returns G the
partial derivatives of G in respect to each of these parameters. The derivates are returned as a (HxKxK) tensor, where
H is the number of parameters.
class CustomModel(Model):
# Constructor of the class
def __init__(self,name,...):
Model.__init__(self,name)
...
# Prediction function
def predict(self,theta):
G = .... # Calculate second momement matrix
dG_dTheta = # Calculate derivative second momement matrix
return (G,dG_dTheta)
# Intiialization function
def set_theta0(self,G_hat):
"""
Sets theta0 based on the crossvalidated second-moment
Parameters:
G_hat (numpy.ndarray)
Crossvalidated estimate of G
"""
# The function can use G_hat to get good starting values,
# or just start at fixed values
self.theta0 = ....
Note that the predict function is repeatedly called by the optimization routine and needs to execute fast. That is, any
computation that does not depend on the current value of 𝜃 should be performed outside the function and stored in the
object.
Other than RSA and Encoding models, PCM also requires a explict model of the noise.In general, noise is assumed to
come from a multivariate normal distribution with covariance matrix S𝜎 2 . In general, we also assume that the noise
is independent across imaging runs (or partitions), making S a block-diagnonal matrix. But what do we assume about
the within-run covariance?
Independent Noise: If the data comes from regression estimates from a first-level model, and if the design of the
experiment is balanced, then it is usually also permissible to make the assumption that the noise is independent within
each imaging run S = I. The raw regression coefficients from a single imaging run, however, are positively correlated
with each other. So on solution is to remove the block-effect using a fixed_effect = ‘block’ during fitting.
Block Effect Plus Independent Noise: We can also estimate the amount of within-block correlation from the data,
rather than remove it. This is especially important for models where the contrast of condition against rest is important.
The BlockPlusIndepNoise model has two parameters - one for the shared within block covariance, one for the variance
for each item. Do not use this if you removed the block effect as a fixed effect.
Custom Model: Assuming equal correlations of the activation estimates within a run is only a rough approximation
to the real co-varince structure. A better estimate can be obtained by using an estimate derived from the design matrix
and the estimated temporal autocorrelation of the raw signal. As pointed out recently (Cai et al.), the particular design
can have substantial influence on the estimation of the second moment matrix. This is especially evident in cases where
the design is such that the trial sequence is not random, but has an invariant structure (where trials of one condition are
often to follow trials of another specific condition). The accuracy of our approximation hinges critically on the quality
of our estimate of the temporal auto-covariance structure of the true noise. Note that it has been recently demonstrated
that especially for high sampling rates, a simple autoregressive model of the noise is insufficient. In all optimisation
routine, a specific noise covariance structure can be specified by passing the correct noise model to the fitting routine.
Models can be either fit to individual or to group data. For group fits, some or all of the model parameters are shared
across the group, while the noise and scale parameters are still fitted individually to each subject. To compare models
of different complexity, we have implemented two types of crossvalidation, either within individuals across partitions,
or across individuals
Fig. 3: Model crossvalidation schemes. (A) Within-subject crossvalidation where the model is fit on N-1 partitions
and then evaluated on the left-out partition N. (B) Group crossvalidation were the model is fit to N-1 subjects and
then evaluated on a left-out subject. For group crossvalidation, individual scaling and noise parameters are fit to each
subject to allow for different signal-to-noise levels.
16 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Models can be fitted to each data set individually, using the function fit_model_individ. Individual fitting makes
sense for models with a single component (fixed models), which can be evaluated without crossvalidation.
The output theta is a list of np-arrays, which contain the M.n_param model parameters, the log-scale and log-noise
parameter for each data set.
The output can be used to compare the likelihoods between different models. Alternatively you can inspect the indi-
vidual fits by looking at the parameters (theta). The predicted second moment matrix for any model can be obtained
by
Crossvalidation within subject is the standard for encoding models and can also be applied to PCM-models.
The function fit_model_group fits a model to a group of subjects. By default, all parameters that change the G
matrix, that is theta[0:M.n_param] are shared across all subjects. To account for the individual signal-to-noise level,
by default a separate signal strength and noise parameter(s) are fitted for each subject. For each individual subject, the
predicted covariance matrix of the data is:
To finely control, which parameters are fit commonly to the group and which ones are fit individually, one can set the
boolean vector M[m].common_param, indicating which parameters are fit to the entire group. The output theta for each
model contains now a single vector of the common model parameters, followed by the data-set specific parameters:
possibly the non-common model parameters, and then scale and noise parameters.
PCM allows also between-subject crossvalidation (see panel b). The common model parameters that determine the
representational structure are fitted to all the subjects together, using separate noise and scale parameters for each
subject. Then the model is evaluated on the left-out subjects, after maximizing scale and noise parameters (and possibly
non-common model parameters). The Function fit_model_group_crossval implements these steps.
The demo demo_finger.ipynb provides a full example how to use group crossvalidation to compare different models.
Three models are being tested: A muscle model, a usage model (both a fixed models) and a combination model, in
which both muscle and usage can be combined in any combination. We also fit the noise-ceiling model, and a null-
model. Because the combination model has one more parameter than each single model, crossvalidation is necessary
for inferential tests. Note that for the simple models, the simple group fit and the cross-validated group fit are identical,
as in both cases only a scale and noise parameter are optimized for each subject.
# Fit the model in to the full group, using a individual scaling parameter for each
T_gr, theta = pcm.inference.fit_model_group(Y, M, fit_scale=True)
18 Chapter 2. Documentation
Pcm for Python, Release v.0.9
# Make a plot, using the group fit as upper, and the crossvalidated fit as a the lower␣
˓→noise ceiling
Under the hood, the main work in PCM is accomplished by the routines likelihood_individ, and
likelihood_group (see Inference), which return the negative log-liklihood of the data under the model, as well
as the first (and optionally) the second derivative. This enables PCM to use standard optimization routines, such a
scipy.optimize.minimize. For many models, a Newton-Raphson algorithm, implemented in pcm.optimize.
newton provides a fast and stable solution. A custom algorithm for models can be chosen by setting M.fit to be either
a string with a algorithm name that is implemented in PCM, or a function that returns the fitted parameters. (TO BE
IMPLEMENTED).
2.5 Visualisation
One important way to visualize both the data and the model prediction is to plot the second moment matrix as a
colormap, for example using the matplotlib command plt.imshow. The predicted second moment matrix for a fitted
model can be obtained using my_model.predict(theta). For the data we can get a cross-validated estimate obtained
using the function util.est_G_crossval(). Note that if you removed the block-effect using the runEffect option
‘fixed’ then you need to also remove it from the data to have a fair comparison.
Note also that you can transform a second moment matrix into a representational dissimilarity matrix (RDM) using the
following equivalence (see Diedrichsen & Kriegeskorte, 2016):
The only difference is that the RDM does not contain information about the baseline.
Another important way of visualizing the second moment matrix is Multi-dimensional scaling (MDS), an important
technique in representational similarity analysis. When we look at the second moment of a population code, the nat-
ural way of performing this is classical multidimensional scaling. This technique plots the different conditions in a
space defined by the first few eigenvectors of the second moment matrix - where each eigenvector is weighted by the
$sqrt(lambda)$.
Importantly, MDS provides only one of the many possible 2- or 3-dimensional views of the high-dimensional repre-
sentational structure. That means that one should never make inferences from this reduced view. It is recommended to
look at as many different views of the representational structure as possible to obtain a unbiased impression. For high
dimensional space, you surely will find one view that shows exactly what you want to show. There are a number of
different statistical visualisation techniques that can be useful here, including the ‘Grand tour’ which provides a movie
that randomly moves through different high-dimensional rotations.
Classical multidimensional scaling from the matlab version still needs to be implemented in Python.
2.5. Visualisation 19
Pcm for Python, Release v.0.9
Another approach to visualize model results is to plot the model evidence (i.e. marginal likelihoods). The marginal
likelihoods are returned from the modeling routines in arbitrary units, and are thus better understood after normalizing
to a null model at the very least. The lower normalization bound can be a null model, and upper bound is often a noise
ceiling. This technique simply plots scaled likelihoods for each model fit.
See Application examples for a practical example for this.
2.6 Inference
First we may make inferences based on the parameters of a single fitted model. The parameter may be the weight of a
specific component or another metric derived from the second√︀moment matrix. For example, the estimated correlation
coefficient between condition 1 and 2 would be 𝑟1,2 = G1,2 / G1,1 G2,2 . We may want to test whether the correlation
between the patterns is larger than zero, or whether a parameter differs between two different subject groups, two
different regions, or whether they change with experimental treatments.
The simplest way of testing parameters would be to use the point estimates from the model fit from each subject and
apply frequentist statistics to test different hypotheses, for example using a t- or F-test. Alternatively, one can obtain
estimates of the posterior distribution of the parameters using MCMC sampling [@RN3567] or Laplace approximation
[@RN3255]. This allows the application of Bayesian inference, such as the report of credibility intervals.
One important limitation to keep in mind is that parameter estimates from PCM are not unbiased in small samples.
This is caused because estimates of G are constrained to be positive definite. This means that the variance of each
feature must be larger or equal to zero. Thus, if we want to determine whether a single activity pattern is different
from baseline activity, we cannot simply test our variance estimate (i.e. elements of G) against zero - they trivially
will always be larger, even if the true variance is zero. Similarly, another important statistic that measures the pattern
separability or classifiability of two activity patterns, is the Euclidean distance, which can be calculated from the second
moment matrix as 𝑑 = G1,1 + G2,2 − 2G1,2 . Again, given that our estimate of G is positive definite, any distance
estimate is constrained to be positive. To determine whether two activity patterns are reliably different, we cannot
simply test these distances against zero, as the test will be trivially larger than zero. A better solution for inferences
from individual parameter estimates is therefore to use a cross-validated estimate of the second moment matrix and the
associated distances [@RN3565][@RN3543]. In this case the expected value of the distances will be zero, if the true
value is zero. As a consequence, variance and distance estimates can become negative. These techniques, however,
take us out of the domain of PCM and into the domain of representational similarity analysis [@RN2697][@RN3672].
As an alternative to parameter-based inference, we can fit multiple models and compare them according to their model
evidence; the likelihood of the data given the models (integrated over all parameters). In encoding models, the weights
W are directly fitted to the data, and hence it is important to use cross-validation to compare models with different
numbers of features. The marginal likelihood already integrates all over all likely values of U, and hence W, thereby
removing the bulk of free parameters. Thus, in practice the marginal likelihood will be already close to the true model
evidence.
Our marginal likelihood, however, still depends on the free parameters 𝜃. So, when comparing models, we need to
still account for the risk of overfitting the model to the data. For fixed models, there are only two free parameters: one
relating to the strength of the noise (𝜃𝜖 ) and one relating to the strength of the signal (𝜃𝑠 ). This compares very favorably
to the vast number of free parameters one would have in an encoding model, which is the size of W, the number of
features x number of voxels. However, even the fewer model parameters still need to be accounted for. We consider
here four ways of doing so.
20 Chapter 2. Documentation
Pcm for Python, Release v.0.9
The first option is to use empirical Bayes or Type-II maximal likelihood. This means that we simply replace the unknown
parameters with the point estimates that maximize the marginal likelihood. This is in general a feasible strategy if the
number of free parameters is low and all models have the same numbers of free parameters, which is for example the
case when we are comparing different fixed models. The two free parameters here determine the signal-to-noise ratio.
For models with different numbers of parameters we can penalize the likelihood by 12 𝑑𝜃 log(𝑛) , yielding the Bayes
information criterion (BIC) as the approximation to model evidence.
As an alternative option, we can use cross-validation within the individual (hyperref[fig2]{Fig. 2a}) to prevent overfit-
ting for more complex flexible models, as is also currently common practice for encoding models [@RN3096]. Taking
one imaging run of the data as test set, we can fit the parameters to data from the remaining runs. We then evaluate
the likelihood of the left-out run under the distribution specified by the estimated parameters. By using each imag-
ing run as a test set in turn, and adding the log-likelihoods (assuming independence across runs), we thus can obtain
an approximation to the model evidence. Note, however, that for a single (fixed) encoding model, cross-validation is
not necessary under PCM, as the activation parameters for each voxel (W or U) are integrated out in the likelihood.
Therefore, it can be handled with the first option we described above.
For the third option, if we want to test the hypothesis that the representational structure in the same region is similar
across subjects, we can perform cross-validation across participants (hyperref[fig2]{Fig. 2b}). We can estimate the
parameters that determine the representational structure using the data from all but one participant and then evaluate the
likelihood of data from the left-out subject under this distribution. When performing cross-validation within individ-
uals, a flexible model can fit the representational structure of individual subjects in different ways, making the results
hard to interpret. When using the group cross-validation strategy, the model can only fit a structure that is common
across participants. Different from encoding models, representational models can be generalized across participants, as
we do not fit the actual activity patterns, but rather the representational structure. In a sense, this method is performing
“hyper alignment” [@RN3572] without explicitly calculating the exact mapping into voxel space. When using this
approach, we still allow each participant to have its own signal and noise parameters, because the signal-to-noise ratio
is idiosyncratic to each participant’s data. When evaluating the likelihood of left-out data under the estimated model
parameters, we therefore plug in the ML-estimates for these two parameters for each subject.
Finally, a last option is to implement a full Bayesian approach and to impose priors on all parameters, and then use
a Laplace approximation to estimate the model evidence[@RN3654][@RN3255]. While it certainly can be argued
that this is the most elegant approach, we find that cross-validation at the level of model parameters provides us with a
practical, straightforward, and transparent way of achieving a good approximation.
Each of the inference strategies supplies us with an estimate of the model evidence. To compare models, we then
calculate the log Bayes factor, which is the difference between the log model evidences.
Log Bayes factors of over 1 are usually considered positive evidence and above 3 strong evidence for one model over
the other [@RN3654].
How to perform group inference in the context of Bayesian model comparison is a topic of ongoing debate in the
context of neuroimaging. A simple approach is to assume that the data of each subject is independent (a very reasonable
assumption) and that the true model is the same for each subject (a maybe less reasonable assumption). This motivates
the use of log Group Bayes Factors (GBF), which is simple sum of all individual log Bayes factor across all subjects n
∑︁
log 𝐺𝐵𝐹 = 𝑙𝑜𝑔𝐵𝑛 .
𝑛
Performing inference on the GBF is basically equivalent to a fixed-effects analysis in neuroimaging, in which we com-
bine all time series across subjects into a single data set, assuming they all were generated by the same underlying
model. A large GBF therefore could be potentially driven by one or few outliers. We believe that the GBF therefore
2.6. Inference 21
Pcm for Python, Release v.0.9
does not provide a desirable way of inferring on representational models - even though it has been widely used in the
comparison of DCM models [@RN2029].
At least the distribution of individual log Bayes factors should be reported for each model. When evaluating model
evidences against a Bayesian criterion, it can be useful to use the average log Bayes factor, rather than the sum. This
stricter criterion is independent of sample size, and therefore provides a useful estimate or effect size. It expresses
how much the favored model is expected to perform better on a new, unseen subject. We can also use the individual
log Bayes factors as independent observations that are then submitted to a frequentist test, using either a t-, F-, or
nonparametric test. This provides a simple, practical approach that we will use in our examples here. Note, however,
that in the context of group cross-validation, the log-Bayes factors across participants are not strictly independent.
Finally, it is also possible to build a full Bayesian model on the group level, assuming that the winning model is different
for each subject and comes from a multinomial distribution with unknown parameters [@RN3653].
Showing that a model provides a better explanation of the data as compared to a simpler Null-model is an important
step. Equally important, however, is to determine how much of the data the model does not explain. Noise ceil-
ings[@RN3300] provide us with an estimate of how much systematic structure (either within or across participants)
is present in the data, and what proportion is truly random. In the context of PCM, this can be achieved by fitting a
fully flexible model, i.e. a free model in which the second moment matrix can take any form. The non-cross-validated
fit of this model provides an absolute upper bound - no simpler model will achieve a higher average likelihood. As
this estimate is clearly inflated (as it does not account for the parameter fit) we can also evaluate the free model using
cross-validation. Importantly, we need to employ the same cross-validation strategy (within slash between subjects) as
used with the models of interest. If the free model performs better than our model of interest even when cross-validated,
then we know that there are definitely aspects of the representational structure that the model did not capture. If the
free model performs worse, it is overfitting the data, and our currently best model provides a more concise description
of the data. In this sense, the performance of the free model in the cross-validated setting provides a lower bound to the
noise ceiling. It still may be the case that there is a better model that will beat the currently best model, but at least the
current model already provides an adequate description of the data. Because they are so useful, noise ceilings should
become a standard reporting requirement when fitting representational models to fMRI data, as they are in other fields
of neuroscientific inquiry already. The Null-model and the upper noise ceiling also allow us to normalize the log model
evidence to be between 0 (Null-model) and 1 (noise ceiling), effectively obtaining a Pseudo-𝑅2 .
Often, we have multiple non-exclusive explanations for the observed activity patterns, and would like to know which
model components (or combinations of model components) are required to explain the data. For example, for sequence
representations, we may consider as model components the representation of single fingers, finger transitions, or whole
sequences (see Yokoi et al., 2019). To assess the importance of each of the components, we could fit each components
seperately and test how much the marginal likelihood increases relative to the Null-model (knock-in). We can also fit
the full model containing all components and then assess how much the marginal likelihood decreases when we leave
a single model component out (knock-out). The most comprehensive approach, however, is to fit all combinations of
components separately (Shen and Ma, 2017).
To do this, we can construct a model family containing all possible combination models by switching the individual
components either on or off. If we have 𝑘 components of interest, we will end up with 2𝑘 models.
After fitting all possible model combinations, one could simply select the model combination with the highest marginal
likelihood. The problem, however, is that often there are a number of combinations, which all achieve a relatively
high likelihood - such that the winning model changes from data set to data set. Because the inference on individual
components can depend very strongly on the winning model, this approach is inherently unstable.
To address this issue we can use Bayesian model averaging. We can posterior likelihood for each model component,
averaged across all possible model combinations (Clyde 1999). In the context of a model family, we can calculate the
22 Chapter 2. Documentation
Pcm for Python, Release v.0.9
posterior probability of a model component being present (𝐹 = 1) from the summed posteriors of all models that
contained that component (𝑀 : 𝐹 = 1)
∑︀
𝑀 :𝐹 =1 𝑝(𝑑𝑎𝑡𝑎|𝑀 )𝑝(𝑀 )
𝑝(𝐹 = 1|𝑑𝑎𝑡𝑎) = ∑︀
𝑀 𝑝(𝑑𝑎𝑡𝑎|𝑀 )𝑝(𝑀 )
Finally, we can also obtain a Bayes factor as a measure of the evidence that the component is present
∑︀
𝑝(𝑑𝑎𝑡𝑎|𝑀 )
𝐵𝐹𝐹 =1 = ∑︀𝑀 :𝐹 =1
𝑀 :𝐹 =0 𝑝(𝑑𝑎𝑡𝑎|𝑀 )
See the Component inference and model families example to see how to construct and fit a model family, and how to
then make inference on the individual model components.
PCM can be used to tune the regularization parameter for ridge regression. Specifically, ridge regression is a special
case of the PCM model
yi = Zu𝑖 + X𝛽 𝑖 + 𝜖𝑝
where u ∼ 𝑁 (0, I𝑠) are the vectors of random effects, and 𝜖 ∼ 𝑁 (0, I𝜎𝜖2 ) the measurement error.
𝛽 are the fixed effects - in the case of standard ridge regression, is the intercept. In this case X would be a vector of 1s.
The more general implementation allows arbitrary fixed effects, which may also be correlated with the random effects.
Assuming that the intercept is already removed, the random effect estimates are:
û = (Z𝑇 Z + I𝜆)−1 Z𝑇 y𝑖
𝜎2
𝜆 = 𝑠𝜖 = 𝑒𝑥𝑝(𝜃 𝑠)
𝑒𝑥𝑝(𝜃𝜖 )
This makes the random effects estiamtes in PCM indentical to Ridge regression with an optimal regularization coeffi-
cient 𝜆.
.
The PCM regularized regression model is design to work with multivariate data, i.e. many variables y𝑖 , . . , y𝑃 that all
share the same generative model (X, Z), but have different random and fixed effects. Of course, the regression model
works on univariate regression models with only one data vector.
Most importantly, the pcm regression model allows you to estimate a different ridge coefficients for different columns
of the design matrix. In general, we can can set the covariance matrix of u to
⎡ ⎤
𝑒𝑥𝑝(𝜃1 )
⎢ 𝑒𝑥𝑝(𝜃1 ) ⎥
𝐺=⎢ .
⎢ ⎥
⎣ .. ⎥
⎦
𝑒𝑥𝑝(𝜃𝑄 )
where Q groups of effects share the same variance (and therefore the same Ridge coefficient). In the extreme, every
column in the design matrix would have its own regularization parameter to be estimated. The use of the Restricted
Maxmimum Likelihood (ReML) makes the estimation of such more complex regularisation both stable and computa-
tionally feasible.
See Jupyter notebook demos/demo_regression.ipynb for a working example, which also shows a direct comparison
to ridge regression. In this work book, we generate a example with N = 100 observations, P = 10 variables, and Q =
10 regressors:
Given this data, we can now define Ridge regression model, where all regressors are sharing the same ridge coefficient.
comp is a index matrix for each column of Z.
# Vector indicates that all columns are scaled by the same parameter
comp = np.array([0,0,0,0,0,0,0,0,0,0])
# Make the model
M1 = pcm.regression.RidgeDiag(comp, fit_intercept = True)
# Estimate optimal regularization parameters from training data
M1.optimize_regularization(Z,Y)
After estimation, the two theta parameters (for signal and noise) can be retrieved from M1.theta_. The Regularization
parameter for ridge regression is then exp(M1.theta_[1])/exp(M1.theta_[0]).
The model then can be fitted to the training (or other) data to determine the coefficients.
The random effect coefficients are stored in M1.coefs_ and the fixed effects in M1.beta_s.
Finally we can predict the data for he indepenent test set and evaluate this predicton.
Yp = M1.predict(Zt)
R2 = 1- np.sum((Yt-Yp)**2)/np.sum((Yt)**2)
Finally, if we want to estimate the importance of different groups of columns, we can define different ridge coefficients
for different groups of columns:
comp = np.array([0,0,1,1,1,1,1,2,2,2])
M2 = pcm.regression.RidgeDiag(comp, fit_intercept = True)
In this example, the first 2, the next 5, and the last 3 columns share one Ridge coefficient. The call to M1.
optimize_regularization(Z,Y) causes 4 theta parameters and hence 3 regularization coefficients to be estimated.
If the importance of different columns of the design matrix is truely different, this will provide better predictions.
24 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Here some full application examples, using Jupyter notebooks. The life notebooks and underlying data can be found at
https://github.com/DiedrichsenLab/PcmPy/tree/master/demos.
Example of a fit of fixed and component pcm-models to data from M1. The models and data is taken from Ejaz et al.
(2015). Hand usage predicts the structure of representations in sensorimotor cortex, Nature Neuroscience.
We will fit the following 5 models
• null: G=np.eye, all finger patterns are equally far away from each other, Note that in many situations the no-
information null model, G = np.zeros, maybe more appropriate
• Muscle: Fixed model with G = covariance of muscle activities
• Natural: Fixed model with G = covariance of natural movements
• Muscle+nat: Combination model of muscle and natural covariance
• Noiseceil: Noise ceiling model
Read in the activity Data (Data), condition vector (cond_vec), partition vector (part_vec), and model matrices for
Muscle and Natural stats Models (M):
[2]: f = open('data_demo_finger7T.p','rb')
Data,cond_vec,part_vec,modelM = pickle.load(f)
f.close()
Now we are build a list of datasets (one per subject) from the Data and condition vectors
[3]: Y = list()
for i in range(len(Data)):
obs_des = {'cond_vec': cond_vec[i],
'part_vec': part_vec[i]}
Y.append(pcm.dataset.Dataset(Data[i],obs_descriptors = obs_des))
Before fitting the models, it is very useful to first visualize the different data sets to see if there are outliers. One
powerful way is to estimate a cross-validated estimate of the second moment matrix. This matrix is just another form
of representing a cross-validated representational dissimilarity matrix (RDM).
[16]: # Estimate and plot the second moment matrices across all data sets
N=len(Y)
G_hat = np.zeros((N,5,5))
for i in range(N):
G_hat[i,:,:],_ = pcm.est_G_crossval(Y[i].measurements,
Y[i].obs_descriptors['cond_vec'],
Y[i].obs_descriptors['part_vec'],
X=pcm.matrix.indicator(Y[i].obs_descriptors['part_vec']))
Nice - up to a scaling factor (Subject 4 has especially high signal-to-noise) all seven subjects have a very similar
structure of the reresentation of fingers in M1.
If you are more used to looking in representational dissimilarity matrices (RDMs), you can also transform the second
momement matrix into this (see Diedrichsen & Kriegeskorte,2017)
26 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Now we are building a list of models, using a list of second moment matrices
Now let’s look at two underlying second moment matrices - these are pretty similar
Model fitting
Now let’s fit the models to individual data set. There are three ways to do this. We can fit the models
• to each individual participant with it’s own parameters 𝜃
• to each all participants together with shared parameters, but with an individual parameter for the signal strength
and for group.
• in a cross-participant crossvalidated fashion. The models are fit to N-1 subjects and evaluated on the Nth subject.
[22]: # Fit the model in to the full group, using a individual scaling parameter for each
T_gr, theta_gr = pcm.fit_model_group(Y, M, fit_scale=True)
Fitting model 0
Fitting model 1
Fitting model 2
Fitting model 3
Fitting model 4
28 Chapter 2. Documentation
Pcm for Python, Release v.0.9
The results are returned as a nested data frame with the likelihood, noise and scale parameter for each individuals
[24]: T_in
[24]: variable likelihood \
model null muscle usage muscle+usage
0 -42231.412711 -41966.470799 -41786.672956 -41786.672927
1 -34965.171104 -34923.791342 -34915.406608 -34914.959612
2 -34767.538097 -34679.107626 -34632.643241 -34632.642946
3 -45697.970627 -45609.052395 -45448.518276 -45448.518254
4 -31993.363827 -31866.288313 -31806.982719 -31806.982521
5 -41817.234010 -41632.061473 -41543.438786 -41543.438769
6 -50336.142592 -50201.799362 -50173.300358 -50173.300306
variable noise \
model ceil null muscle usage muscle+usage ceil
0 -41689.860467 0.875853 0.871286 0.868482 0.868483 0.872297
1 -34889.042762 1.070401 1.067480 1.069075 1.068119 1.066987
2 -34571.750931 1.026408 1.021219 1.019122 1.019123 1.023299
3 -45225.784824 1.480699 1.479592 1.474026 1.474025 1.478701
4 -31707.184233 0.808482 0.805621 0.805774 0.805774 0.807319
5 -41439.111953 1.035696 1.031827 1.031649 1.031648 1.034879
6 -50099.140706 1.479001 1.472401 1.474430 1.474428 1.476145
variable
model usage muscle+usage ceil
0 0.786771 1.000000 0.996839
1 0.322917 0.963006 0.998003
2 0.463987 1.000000 0.992453
3 1.235628 1.000000 0.998176
4 0.532421 1.000000 0.999360
5 0.828773 1.000000 0.999325
6 0.723969 1.000000 0.981997
The likelihoods are very negative and quite different across participants, which is expected (see documentation). What
we need to interpret are the difference is the likelihood relative to a null model. We can visualized these using the
model_plot
[25]: ax = pcm.model_plot(T_in.likelihood,
null_model = 'null',
noise_ceiling= 'ceil')
The problem with the noise ceiling is that it is individually fit to each subject. It has much more parameters than the
models it is competing against, so it is overfitting. To compare models with different numbers of parameters directly,
we need to look at our cross-validated group fits. The group fits can be used as an upper noise ceiling.
[26]: ax = pcm.model_plot(T_cv.likelihood,
null_model = 'null',
noise_ceiling= 'ceil',
upper_ceiling = T_gr.likelihood['ceil'])
As you can see, the likelihood for individual, group, and crossvalidated group fits for the fixed models (null, muscle
+ usage) are all identical, because these models do not have common group parameters - in all cases we are fitting an
individual scale and noise parameter.
30 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Finally, it is very useful to visualize the model prediction in comparision to the fitted data. The model parameters are
stored in the return argument theta. We can pass these to the Model.predict() function to get the predicted second
moment matrix.
This demo shows two ways to use PCM models to test hypotheses about the correlation between activity patterns.
• In the first part of this jupyter notebook, we’ll focus on how to assess the true correlation between two activity
patterns.
• In the second part, we will consider a slightly more complex situation in which we want to estimate the true
correlation between two sets of activity patterns measured under two different conditions.
For example, we might want to know how the activity patterns related to the observation of 3 hand gestures
correlate (at a gesture-specific level) with the activity patterns related to the execution of the same 3 hand
gestures.
How similar/correlated are two activity patterns? It is easy to test whether 2 activity patterns are more correlated than
chance (i.e., zero correlation). However, even if the two conditions elicit exactly the same pattern, the correlation will
not be 1, simply because of measurement noise. Thus, it is very hard to estimate the true correlation between condition
A and B. As explained in our blog Brain, Data, and Science, cross-validation does not result in unbiased estimates.
To solve this problem, PCM turns the problem around: rather than asking which correlation is the best estimate given
the data, let’s instead ask how likely the data is given different levels of correlations. Thus, we will calculate the
likelihood of the data given a range of specific correlations, 𝑝(𝑌 |𝑟), and then compare this likelihood across a range of
correlation models.
In estimating the likelihood, we also need to estimate two additional model parameters: - The strength (variance across
voxels) of the activity patterns associated with condition A. - The strength (variance across voxels) of the activity
patterns associated with condition B.
And one additional noise parameter: - The variance of the measurement noise across repeated measures of the same
pattern.
These hyper-parameters are collectively referred to by 𝜃 (thetas). Here we will compare different models by using the
(Type II) maximum likelihood to approximate the model evidence:
𝑝(𝑌 |𝑟) ≈ max𝑝(𝑌 |𝑟, 𝜃).
𝜃
This may seem like a bit of a cheat, but works in this case quite well. Estimating parameters on the same
data that you use to evaluate the model of course leads to an overestimation of the likelihood.
However, as the number of hyper-parameters is low and all correlation models have the same number
of parameters, this bias will be approximately stable across all models. Since we are interested in the
difference in log-likelihood across models, this small bias simply disappears.
If you want to compare models with different numbers of parameters, a more sophisticated approach (such
as group-cross-validation) is required.
We will use the PCM toolbox to simulate data from a given model of the underlying true (noiseless) correlation between
two activity patterns (Mtrue). In this example, we set that our two activity patterns are positively correlated with a true
correlation of 0.7.
Note that in pcm.CorrelationModel we set num_items to 1, as we have only 1 activity pattern per
condition, and we set cond_effect to False, as we do not want to model the overall effect between
different conditions (each with multiple items, see section 2 below).
The true model (Mtrue) has 2 hyper-parameters reflecting the signal strength (or true pattern variance) for each activity
pattern (item). In addition to the 2 model parameters, PCM also fits one parameter for the variance of the measurement
noise.
We can now use the simulation module to create 20 datasets (e.g., one per simulated participant) with a relatively low
signal-to-noise level (0.2:1). We will use a design with 2 conditions and 8 partitions/runs per dataset. >Note that the
thetas are specified as log(variance).
32 Chapter 2. Documentation
Pcm for Python, Release v.0.9
[33]: # Create the design. In this case it's 2 conditions, across 8 runs (partitions)
cond_vec, part_vec = pcm.sim.make_design(n_cond=2, n_part=8)
First let’s look at the correlation that we get when we calculate the naive correlation—i.e. the correlation between the
two estimated activity patterns.
r = np.empty((20,))
for i in range(20):
data = D[i].measurements
r[i] = get_corr(data, cond_vec)
print(f'Estimated mean correlation: {r.mean():.4f}')
Estimated mean correlation: 0.4127
As we can see, due to measurement noise, the estimated mean correlation is much lower than the true value of 0.7.
This is not a problem if we just want to test the hypothesis that the true correlation is larger than zero. Then we can just
calculate the individual correlations per subject and test them against zero using a t-test.
However, if we want to test whether the true correlation has a specific value (for example true_corr=1, indicating
that the activity patterns are the same), or if we want to test whether the correlations are higher in one brain area than
another, then this becomes an issue.
Different brain regions measured with fMRI often differ dramatically in their signal-to-noise ratio. Thus, we need to
take into account our level of measurement noise. PCM can do that.
We can solve this problem by making a series of PCM correlation models in the range e.g., [0 1] (or -1 to 1 if you want).
We also generare a flexible model (Mflex) that has the correlation as a parameter that is being optimized.
We can now fit the models to the datasets in one go. The resulting dataframe T has the log-likelihoods for each model
(columns) and dataset (rows). The second return argument theta contains the parameters for each model fit.
variable ... \
model 0.26 0.32 0.37 0.42 0.47 ...
0 -234.517117 -234.526138 -234.569756 -234.650865 -234.773248 ...
1 -246.045157 -245.504529 -244.985186 -244.486599 -244.008726 ...
2 -277.221540 -277.086717 -276.968453 -276.867193 -276.783574 ...
3 -260.192870 -259.812921 -259.455499 -259.120782 -258.809456 ...
4 -252.122629 -251.797543 -251.497447 -251.222995 -250.975441 ...
variable iterations
model 0.58 0.63 0.68 0.74 0.79 0.84 0.89 0.95 1.00 flex
0 4.0 4.0 5.0 5.0 5.0 6.0 7.0 8.0 9.0 4.0
1 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 6.0
2 3.0 3.0 3.0 3.0 3.0 3.0 4.0 4.0 4.0 5.0
3 3.0 3.0 3.0 3.0 3.0 2.0 3.0 3.0 4.0 5.0
4 3.0 3.0 3.0 3.0 3.0 3.0 4.0 4.0 4.0 5.0
[5 rows x 63 columns]
34 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Note that the log-likelihood values are negative and differ substantially across datasets. This is normal—the only thing
we can interpret are differences between log-likelihoods across the same data set for different models.
Therefore, first, we remove the mean log-likelihood for each dataset across correlation models, expressing each log-
likelihood as the difference from the mean.
Next, we plot the full log-likelihood curves (solid lines) and the maximum likelihood estimate (filled circles) of the
correlation for each participant. We can also add the mean log-likelihood curve (dotted line) and the mean of the
maximum log-likelihood estimates (vertical blue line) across participants.
[37]: L = T.likelihood.to_numpy()
As we can see, the maximum-likelihood estimates (filled circles) behave quite well, as they are at least around the true
correlation value (0.7).
However, the mean of the maximum-likelihood estimates (vertical line) is unfortunately not unbiased but slightly biased
towards zero (see Brain, Data, and Science blog). Therefore, best way to use the log-likelihoods is to do a paired-
samples t-test between the log-likelihoods for two correlation values.
For example, 0.7 vs. 1, or 0.7 vs. 0:
Thus, we have clear evidence that the correlation of 0.68 is more likely than a correlation of zero (i.e., that the patterns
are unrelated), and more likely than a correlation of one (i.e., that the patterns are identical).
2. Testing for specific correlations between activity patterns across two conditions
In the second part of this notebook, we will simulate data from a hypothetical experiment, in which participants ob-
served 3 hand gestures or executed the same 3 hand gestures. Thus, we have 3 items (i.e., the hand gestures) in each of
2 conditions (i.e., either observe or execute).
We are interested in the average correlation between the patterns associated with observing and executing action A,
observing and executing action B, and observing and executing action C, while accounting for overall differences in
the average patterns of observing and executing. To solve this problem, we again calculate the likelihood of the data
given a range of specific correlations 𝑝(𝑌 |𝑟).
36 Chapter 2. Documentation
Pcm for Python, Release v.0.9
In this case we have a few more additional hyper-parameters to estimate: - The variance of the movement-specific
activity patterns associated with observing actions. - The variance of the movement-specific activity patterns associated
with executing actions.
These hyper-parameters express the strength of encoding and are directly related to the average inter-item
distance in an RSA analysis.
• The variance of the pattern component that is common to all observed actions.
• The variance of the pattern component that is common to all executed actions.
• Finally, we again have a hyper-parameter for the noise variance.
First, we create our true model (Mtrue): one where the all actions are equally strongly encoded in each condition, but
where the strength of encoding can differ between conditions (i.e., between observation or execution).
For example, we could expect the difference between actions to be smaller during observation than during
execution (simply due to overall levels of brain activation).
Next, we also model the covariance between items within each condition with a condition effect (i.e., by setting
condEffect to True). Finally, we set the ground-truth correlation to be 0.7.
These four parameters are concerned with the condition effect and item effect for observation and execution, respec-
tively. Visualizing the components of the second moment matrix (also known as variance-covariance, or simply co-
variance matrix) helps to understand this:
[40]: H = Mtrue.n_param
for i in range(H):
plt.subplot(1, H, i+1)
plt.imshow(Mtrue.Gc[i,:,:])
The first two components plotted above reflect the condition effect and model the covariance between
items within each condition (observation, execution). The second two components reflect the item effect
and model the item-specific variance for each item (3 hand gestures) in each condition.
To Simulate a dataset, we need to simulate an experimental design. Let’s assume we measure the 6 trial types (3 items
x 2 conditions) in 8 imaging runs and submit the beta-values from each run to the model as Y.
We then generate a dataset where there is a strong overall effect for both observation (exp(0)) and execution (exp(1)).
In comparison, the item-specific effects for observation (exp(-1.5)) and execution (exp(-1)) are pretty weak (this is a
rather typical finding).
Note that all hyper parameters are log(variances)—this helps us to keep variances positive and the math
easy.
[41]: # Create the design. In this case it's 8 runs, 6 trial types
cond_vec, part_vec = pcm.sim.make_design(n_cond=6, n_part=8)
#print(cond_vec)
#print(part_vec)
As a quick check, let’s plot the predicted second moment matrix of our true model (using the simulation parameters)
and the crossvalidated estimate from the first dataset.
plt.subplot(1,2,2)
plt.imshow(G_hat)
plt.title('dataset')
[42]: Text(0.5,1,'dataset')
38 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Now we are fitting these datasets with a range of models, each assuming a correlation value between 0 and 1. The other
parameters will still be included, as we did for the true model.
For comparison, we also include a flexible correlation model (Mflex), which has a additional free parameter that models
the correlation.
We can now fit the model to the datasets in one go. The resulting dataframe T has the log-likelihoods for each model
(columns) / dataset (rows). The second return argument theta contains the parameters for each model fit.
T.head()
[45]: variable likelihood \
model 0.00 0.01 0.02 0.03 0.04
0 -887.831487 -887.705964 -887.582884 -887.462226 -887.343969
1 -897.418265 -897.100121 -896.784548 -896.471492 -896.160898
2 -938.057184 -937.851286 -937.647757 -937.446563 -937.247669
3 -938.816255 -938.592624 -938.371687 -938.153403 -937.937735
4 -875.956085 -875.773279 -875.593541 -875.416842 -875.243151
variable ... \
model 0.05 0.06 0.07 0.08 0.09 ...
0 -887.228095 -887.114588 -887.003433 -886.894614 -886.788120 ...
1 -895.852715 -895.546892 -895.243381 -894.942133 -894.643103 ...
2 -937.051042 -936.856650 -936.664465 -936.474457 -936.286599 ...
3 -937.724648 -937.514106 -937.306077 -937.100530 -936.897437 ...
4 -875.072443 -874.904691 -874.739873 -874.577969 -874.418959 ...
variable iterations
model 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 flex
0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 5.0
1 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 7.0
2 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
3 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
4 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 5.0
Again, note that the absolute values of the log-likelihoods don’t mean much. Therefore, first, we remove the mean
log-likelihood for each correlation model, expressing each log-likelihood as the difference against the mean.
Next, we plot the full log-likelihood curves (solid lines) and the maximum likelihood estimate (filled circles) of the
correlation for each participant. We can also add the mean log-likelihood curve (dotted line) and the mean of the
maximum log-likelihood estimates (vertical blue line) across participants.
[46]: L = T.likelihood.to_numpy()
Again, the best way to use the log-likelihoods is to do a paired-samples t-test between the log-likelihoods for two
correlation values: e.g., 0.7 vs 0.3:
40 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Thus, we have clear evidence that the true correlation is much more likely to be 0.7 than 0.3.
Alternatively, we can transform the log-likelihoods in approximate posterior distributions, and proceed with a Full
Bayesian group analysis. For more accurate results, you probably want to space your correlation models more tightly.
This notebook shows a series of examples of how to use model families and component inference with PCM.
This is a classical example of a fully crossed design with two factors (A and B) and the possibility for an interaction
between those two factors. A experimental example would be that you measure the response to 6 stimuli, with 3
different objects in two different colors. You want to test whether a) objects are represented b) colors are represented
c) the unique combination of color and object is represented. Thus, this is a MANOVA-like design where you want to
test for the main effects of A and B, as well as their interaction.
Note: When building the features for the 3 different components, the interaction usually has to the orthogonalized for
the main effects. If the interaction is not orthogonalized, the interaction effect can explain part of the variance explained
by the main effects. Try out what happens when you use the non-orthogonalized version of the interaction effect - you
will see that a Model Family deals correctly with this situation as well!
Note that for data generation, we consider the interaction fixed - meaning that when the interaction effect is present, no
difference between categories of A and B occur.
[2]: # Generate the three model components, each one as a fixed model
A = np.array([[1.0,0,0],[1,0,0],[0,1,0],[0,1,0],[0,0,1],[0,0,1]])
B = np.array([[1.0,0],[0,1],[1,0],[0,1],[1,0],[0,1]])
I = np.eye(6)
# Orthogonalize the interaction effect
X= np.c_[A,B]
Io = I-X @ np.linalg.pinv(X) @ I
# Now Build the second moment matrix and create the full model
# for data generation:
Gc = np.zeros((3,6,6))
Gc[0][email protected]
Gc[1][email protected]
Gc[2][email protected]
trueModel = pcm.ComponentModel('A+B+I',Gc)
Now generate 20 data set from the full model. The vector theta gives you the log-variance of the A,B, and Interaction
components. Here A is absent, and the interaction stronger than the B-effect. You can play around with the values to
check out other combinations.
[3]: [cond_vec,part_vec]=pcm.sim.make_design(6,8)
theta = np.array([-np.inf,-1,0])
D = pcm.sim.make_dataset(trueModel,theta,
signal=0.1,
n_sim = 20,
n_channel=20,
part_vec=part_vec,
cond_vec=cond_vec)
42 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Now we can fit the data with the entire model family. An intercept is added as a fixed effect for each partition (block)
seperately, as common for fMRI data. The result is a likelihood for each of the model combination.
For the inference, we can use either crossvalidated pseudo-likelihoods (within- subject or between subjects - see in-
ference), or we can use the fitted likelihood, correcting for the number of paramaters using an AIC approach. We use
latter approach here.
[6]: # We can computer the posterior probability of each of the mixture models.
# This uses the flat prior over all possible model combinations
# The whole model family can visualized as a model tree.
plt.figure(figsize=(5,7))
# Get the mean likelihood
mposterior = MF.model_posterior(T.likelihood.mean(axis=0),method='AIC',format='DataFrame
˓→')
pcm.vis.plot_tree(MF,mposterior,show_labels=True,show_edges=True)
# mposterior.to_numpy()
44 Chapter 2. Documentation
Pcm for Python, Release v.0.9
We can also get the posterior probability for each component. This is simply the sum of the posterior probabilties
of all the model combinations that contain that component. The 0.5 line (this is the prior probability) is drawn. The
lower line is the most evidence we can get for the absence of a model component using AIC. This is because, in the
worst case, a new component does not increase the likelihood at all. This would result in the new component having
a realtive likelihood that is 1.0 lower than the simpler model (parameter penality). Thus, overall the worst we can get
p=1/(1+exp(1)).
[7]:
# Component posterior
plt.figure(figsize=(4,3))
cposterior = MF.component_posterior(T.likelihood,method='AIC',format='DataFrame')
pcm.vis.plot_component(cposterior,type='posterior')
For frequentist statistical testing and display, it is also often useful use the log-odds of the posterior 𝑙𝑜𝑔(𝑝/(1 − 𝑝)).
For a flat prior across the model family, this is the bayes factor for the specific component.
Base components
Often, there are components in a Model family that we always want to have in our model - these can be specified as
“base components”. By default (i.e. no base components are specified), the base model will be the zero model (there
are no differences between any of the patterns). In this example, we add a strong pattern that predicts that pattern 1-3
and pattern 4-6 are correlated.
plt.imshow(basecomp[0])
[9]: <matplotlib.image.AxesImage at 0x1296d46d0>
46 Chapter 2. Documentation
Pcm for Python, Release v.0.9
First we fit this data with a model without the base component. You will see that the base-effect mimics as the main
effect A and gives us a false positive - component A is not really present.
[11]: MF=pcm.model.ModelFamily(Gc,comp_names=['A','B','I'])
# Fit the data and display the relative likelihood.
T,theta=pcm.fit_model_individ(D,MF,verbose=False,fixed_effect='block',fit_scale=False)
Now we are adding the base component to all the models - so all models fitted will contain this extra component.
[12]: MF=pcm.model.ModelFamily(Gc,comp_names=['A','B','I'],basecomponents=basecomp)
# Fit the data and display the relative likelihood.
T,theta=pcm.fit_model_individ(D,MF,verbose=False,fixed_effect='block',fit_scale=False)
Now the inference is correct again. So if you have a strong correlation structure in your data that needs to be modeled
(but that you do not want to draw inferences on), add it as a base component!
48 Chapter 2. Documentation
Pcm for Python, Release v.0.9
In this example, we provide an example with random components that are partly overlapping with each other. We also
provide a function that performs different forms of component inference.
# pcm.vis.model_plot(T.likelihood-MF.num_comp_per_m)
mposterior = MF.model_posterior(T.likelihood.mean(axis=0),method='AIC',format=
˓→'DataFrame')
cposterior = MF.component_posterior(T.likelihood,method='AIC',format='DataFrame')
c_bf = MF.component_bayesfactor(T.likelihood,method='AIC',format='DataFrame')
fig=plt.figure(figsize=(18,3.5))
plt.subplot(1,3,1)
pcm.vis.plot_tree(MF,mposterior,show_labels=False,show_edges=True)
ax=plt.subplot(1,3,2)
pcm.vis.plot_component(cposterior,type='posterior')
ax=plt.subplot(1,3,3)
pcm.vis.plot_component(c_bf,type='bf')
[20]: # Let's check the co-linearity of these particular components by looking at their␣
˓→correlation
[cond_vec,part_vec]=pcm.sim.make_design(N,8)
D = pcm.sim.make_dataset(M,theta,
signal=0.2,
n_sim = 20,
n_channel=20,part_vec=part_vec,
cond_vec=cond_vec)
component_inference(D,MF)
[ ]:
2.9.1 Likelihood
In this section, we derive the likelihood in the case that there are no fixed effects. In this case the distribution of the
data would be
y ∼ 𝑁 (0, V)
V = ZGZT + S𝜎𝜖2
To calculate the likelihood, let us consider at the level of the single voxel, namely, Y = [y1 , y2 , ..., yp ]. Then the
likelihood over all voxels, assuming that the voxels are independent (e.g. effectively pre-whitened) is
𝑃 (︂ )︂
∏︁
−𝑁 − 21 1
𝑝 (Y|V) = (2𝜋)2 |V| 𝑒𝑥𝑝 − y𝑖𝑇 V−1 y𝑖
𝑖=1
2
50 Chapter 2. Documentation
Pcm for Python, Release v.0.9
When we take the logarithm of this expression, the product over the individual Gaussian probabilities becomes a sum
and the exponential disappears:
𝑃
∑︁
𝐿 = ln (𝑝 (Y|V)) = ln𝑝 (yi )
𝑖=1
𝑃
𝑁𝑃 𝑃 1 ∑︁ 𝑇 −1
=− ln (2𝜋) − ln (|V|) − y V y𝑖
2 2 2 𝑖=1 𝑖
𝑁𝑃 𝑃 1
ln (2𝜋) − ln (|V|) − 𝑡𝑟𝑎𝑐𝑒 Y𝑇 V−1 Y
(︀ )︀
=−
2 2 2
Using the trace trick, which allows trace (ABC) = trace (BCA), we can obtain a form of the likelihood that does
only depend on the second moment of the data, YYT ,as a sufficient statistics:
𝑁𝑃 𝑃 1 (︁ )︁
𝐿=− ln (2𝜋) − ln (|V|) − 𝑡𝑟𝑎𝑐𝑒 YY𝑇 V−1
2 2 2
In the presence of fixed effects (usually effects of no interest), we have the problem that the estimation of these fixed
effects depends iterativly on the current estimate of V and hence on the estimates of the second moment matrix and
the noise covariance.
^ = X𝑇 V−1 X −1 X𝑇 V−1 Y
(︀ )︀
B
Under the assumption of fixed effects, the distribution of the data is
yi ∼ 𝑁 (Xbi , V)
To compute the likelihood we need to remove these fixed effects from the data, using the residual forming matrix
)︀−1 T −1
R = I − X XT V−1 X
(︀
X V
ri = Ryi
For the optimization of the random effects we therefore also need to take into account the uncertainty in the fixed effects
estimates. Together this leads to a modified likelihood - the restricted likelihood.
𝑁𝑃 𝑃 1 (︁ )︁ 𝑃
𝐿𝑅𝑒𝑀 𝐿 = − ln (2𝜋) − ln (|V|) − 𝑡𝑟𝑎𝑐𝑒 YY𝑇 R𝑇 V−1 R − ln|XT V−1 X|
2 2 2 2
Note that the third term can be simplified by noting that
−1
RT V−1 R = V−1 − V−1 X(XV−1 X)−1 XT V−1 = V−1 R = VR
Next, we find the derivatives of L with respect to each hyper parameter 𝜃𝑖 , which influence G. Also we need to estimate
the hyper-parameters that describe the noise, at least the noise parameter 𝜎𝜖2 . To take these derivatives we need to use
two general rules of taking derivatives of matrices (or determinants) of matrices:
(︂ )︂
𝜕 ln |V| −1 𝜕V
= 𝑡𝑟𝑎𝑐𝑒 V
𝜕𝜃𝑖 𝜕𝜃𝑖
𝜕V−1
(︂ )︂
𝜕V
= V−1 V−1
𝜕𝜃𝑖 𝜕𝜃𝑖
Therefore the derivative of the log-likelihood in [@eq:logLikelihood]. in respect to each parameter is given by:
(︂ )︂ (︂ )︂
𝜕𝐿𝑀 𝐿 𝑃 𝜕V 1 𝜕V −1
= − 𝑡𝑟𝑎𝑐𝑒 V−1 + 𝑡𝑟𝑎𝑐𝑒 V−1 V YY𝑇
𝜕𝜃𝑖 2 𝜕𝜃𝑖 2 𝜕𝜃𝑖
First, let’s tackle the last term of the restricted likelihood function:
𝑃
𝑙 = − ln |X𝑇 V−1 X|
2
(︀ 𝑇 −1 )︀−1 𝑇 𝜕V−1
(︂ )︂
𝜕𝑙 𝑃
= − 𝑡𝑟𝑎𝑐𝑒 X V X X X
𝜕𝜃𝑖 2 𝜕𝜃𝑖
(︂ )︂
𝑃 (︀ 𝑇 −1 )︀−1 𝑇 −1 𝜕V −1
= 𝑡𝑟𝑎𝑐𝑒 X V X X V V X
2 𝜕𝜃𝑖
(︂ )︂
𝑃 −1
(︀ 𝑇 −1 )︀−1 𝑇 −1 𝜕V
= 𝑡𝑟𝑎𝑐𝑒 V X X V X X V
2 𝜕𝜃𝑖
Secondly, the derivative of the third term is
1 (︀ −1
YY𝑇
)︀
𝑙 = − 𝑡𝑟𝑎𝑐𝑒 V𝑅
2
(︂ )︂
𝜕𝑙 1 −1 𝜕V −1
= 𝑡𝑟𝑎𝑐𝑒 V𝑅 V𝑅 YY𝑇
𝜕𝜃𝑖 2 𝜕𝜃𝑖
The last step is not easily proven, except for diligently applying the product rule and seeing a lot of terms cancel. Putting
these two results together with the derivative of the normal likelihood gives us:
(︂ )︂
𝜕(𝐿𝑅𝑒𝑀 𝐿 ) 𝑃 𝜕V
= − 𝑡𝑟𝑎𝑐𝑒 V−1
𝜕𝜃𝑖 2 𝜕𝜃𝑖
(︂ )︂
1 −1 𝜕V −1
+ 𝑡𝑟𝑎𝑐𝑒 V𝑅 V𝑅 YY𝑇
2 𝜕𝜃𝑖
(︂ )︂
𝑃 )︀−1 𝑇 −1 𝜕V
+ 𝑡𝑟𝑎𝑐𝑒 V−1 X X𝑇 V−1 X
(︀
X V
2 𝜕𝜃𝑖
(︂ )︂ (︂ )︂
𝑃 −1 𝜕V 1 −1 𝜕V −1
= − 𝑡𝑟𝑎𝑐𝑒 V𝑅 + 𝑡𝑟𝑎𝑐𝑒 V𝑅 V𝑅 YY𝑇
2 𝜕𝜃𝑖 2 𝜕𝜃𝑖
From the general term for the derivative of the log-likelihood, we can derive the specific expressions for each parameter.
In general, we model the co-variance matrix of the data V as:
V = 𝑠ZG(𝜃 ℎ )Z𝑇 + 𝑆𝜎𝜖2
𝑠 = 𝑒𝑥𝑝(𝜃𝑠 )
𝜎𝜖2 = 𝑒𝑥𝑝(𝜃𝜖 )
Where 𝜃𝑠 is the signal scaling parameter, the 𝜃𝜖 the noise parameter. We are using the exponential of the parameter,
to ensure that the noise variance and the scaling will always be strictly positive. When taking the derivatives, we use
the simple rule of 𝜕𝑒𝑥𝑝(𝑥)/𝜕𝑥 = 𝑒𝑥𝑝(𝑥). Each model provides the partial derivaratives for G in respect to the model
parameters (see above). From this we can easily obtain the derviative of V
𝜕V 𝜕G(𝜃ℎ ) 𝑇
=Z Z 𝑒𝑥𝑝(𝜃𝑠 ).
𝜕𝜃ℎ 𝜕𝜃ℎ
The derivate in respect to the noise parameter
𝜕V
= S𝑒𝑥𝑝(𝜃𝜖 ).
𝜕𝜃𝜖
And in respect to the signal scaling parameter
𝜕V
= ZG(𝜃 ℎ )Z𝑇 𝑒𝑥𝑝(𝜃𝑠 ).
𝜕𝜃𝑠
52 Chapter 2. Documentation
Pcm for Python, Release v.0.9
One way of optiminzing the likelihood is simply using the first derviative and performing a conjugate-gradient descent
algorithm. For this, the routines pcm_likelihoodIndivid and pcm_likelihoodGroup return the negative log-likelihood,
as well as a vector of the first derivatives of the negative log-likelihood in respect to the parameter. The implementation
of conjugate-gradient descent we are using here based on Carl Rassmussen’s excellent function minimize.
A alternative to conjugate gradients, which can be considerably faster, are optimisation routines that exploit the matrix
of second derivatives of the log-liklihood. The local curvature information is then used to “jump” to suspected bottom
of the bowl of the likelihood surface. The negative expected second derivative of the restricted log-likelihood, also
called Fisher-information can be calculated efficiently from terms that we needed to compute for the first derivative
anyway:
𝜕2
[︂ ]︂ (︂ )︂
𝑃 −1 𝜕V −1 𝜕V
F𝑖,𝑗 (𝜃) = −𝐸 𝐿𝑅𝑒𝑀 𝐿 = 𝑡𝑟𝑎𝑐𝑒 V𝑅 V .
𝜕𝜃𝑖 𝜕𝜃𝑗 2 𝜕𝜃𝑖 𝑅 𝜕𝜃𝑗
The update then uses a slightly regularized version of the second derviate to compute the next update on the parameters.
−1 𝜕𝐿𝑅𝑒𝑀 𝐿
𝜃 𝑢+1 = 𝜃 𝑢 − (F(𝜃 𝑢 ) + I𝜆) .
𝜕𝜃 𝑢
Because the update can become unstable, we are regularising the Fisher information matrix by adding a small value to
the diagonal, similar to a Levenberg regularisation for least-square problems. If the likelihood increased, 𝜆 is decreases,
if the liklihood accidentially decreased, then we take a step backwards and increase 𝜆. The algorithm is implemented
in pcm_NR .
While the Newton-Raphson algorithm can be considerably faster for many problems, it is not always the case. Newton-
Raphson usually arrives at the goal with many fewer steps than conjugate gradient descent, but on each step it has
to calculate the matrix second derviatives, which grows in the square of the number of parameters . So for highly-
parametrized models, the simple conjugate gradient algorithm is better. You can set for each model the desired algo-
rithm by setting the field M.fitAlgorithm = ‘NR’; for Newton-Raphson and M.fitAlgorithm = ‘minimize’; for conjugate
gradient descent. If no such field is given, then fitting function will call M=pcm_optimalAlgorithm(M) to obtain a guess
of what will be the best algorithm for the problem. While this function provides a good heuristic strategy, it is recom-
mended to try both and compare both the returned likelihood and time. Small differences in the likelihood (< 0.1) are
due to different stopping criteria and should be of no concern. Larger differences can indicate failed convergence.
When calculating the likelihood or the derviatives of the likelihood, the inverse of the variance-covariance has to be
computed. Because this can become quickly very costly (especially if original time series data is to be fitted), we can
exploit the special structure of V to speed up the computation:
)︀−1
V−1 = 𝑠ZGZ𝑇 + S𝜎𝜀2
(︀
)︀−1 𝑇 −1 −2
= S−1(︁𝜎𝜀−2 − S−1 Z𝜎𝜀−2 𝑠−1 G−1 + Z𝑇 S−1 Z𝜎𝜀−2
(︀
Z )︁S 𝜎𝜀
−1 𝑇 −1
= S−1 − S−1 Z 𝑠−1 G−1 𝜎𝜀2 + Z𝑇 S−1 Z
(︀ )︀
Z S /𝜎𝜀2
With pre-inversion of S (which can occur once outside of the iterations), we make a 𝑁 ×𝑁 matrix inversion into a
K{times}K matrix inversion.
Model Classes
class model.Model(name)
Abstract PCM Model Class
Parameters
name ([str]) – Name of the the model
get_prior()
Returns prior mean and precision
predict(theta)
Prediction function: Needs to be implemented
class model.FixedModel(name, G)
Fixed PCM with a rigid predicted G matrix and no parameters
Parameters
• name (string) – name of the particular model for indentification
• G (numpy.ndarray) – 2-dimensional array giving the predicted second moment
predict(theta=None)
Calculation of G
Returns
• G (np.ndarray) – 2-dimensional (K,K) array of predicted second moment
• dG_dTheta (None)
54 Chapter 2. Documentation
Pcm for Python, Release v.0.9
56 Chapter 2. Documentation
Pcm for Python, Release v.0.9
class model.NoiseModel
Abstract PCM Noise model class
class model.IndependentNoise
Simple Indepdennt noise model (i.i.d) the only parameter is the noise variance
derivative(theta, n=0)
Returns the derivative of S in respect to it’s own parameters
Parameters
• theta ([np.array]) – Array like of noiseparamters
• n (int, optional) – Number of parameter to get derivate for. Defaults to 0.
Returns
d (np-array) – derivative of S in respective to theta
inverse(theta)
Returns S^{-1}
Parameters
theta ([np.array]) – Array like of noiseparamters
Returns
s (double) – Inverse of noise variance (scalar)
predict(theta)
Prediction function returns S - predicted noise covariance matrix
Parameters
theta ([np.array]) – Array like of noiseparamters
Returns
s (double) – Noise variance (for simplicity as a scalar)
set_theta0(Y , Z, X=None)
Makes an initial guess on noise paramters
Parameters
• Y ([np.array]) – Data
• Z ([np.array]) – Random Effects matrix
• X ([np.array], optional) – Fixed effects matrix.
class model.BlockPlusIndepNoise(part_vec)
This noise model uses correlated noise per partition (block) plus independent noise per observation For beta-
values from an fMRI analysis, this is an adequate model
Parameters
part_vec ([np.array]) – vector indicating the block membership for each observation
derivative(theta, n=0)
Returns the derivative of S in respect to it’s own parameters
Parameters
• theta (np.array) – Array like of noiseparamters
• n (int, optional) – Number of parameter to get derivate for. Defaults to 0.
Returns
d (np.array) – derivative of S in respective to theta
inverse(theta)
Returns S^{-1}
Parameters
theta (np.array) – Array like of noiseparamters
Returns
iS (np.array) – Inverse of noise covariance
predict(theta)
Prediction function returns S - predicted noise covariance matrix
Parameters
theta ([np.array]) – Array like of noiseparamters
Returns
s (np.array) – Noise covariance matrix
set_theta0(Y , Z, X=None)
Makes an initial guess on noise parameters :param Y: Data :type Y: [np.array] :param Z: Random Effects
matrix :type Z: [np.array] :param X: Fixed effects matrix. :type X: [np.array], optional
Inference module for PCM toolbox with main functionality for model fitting and evaluation. @author: jdiedrichsen
inference.fit_model_group(Data, M, fixed_effect='block', fit_scale=False, scale_prior=1000.0,
noise_cov=None, algorithm=None, optim_param={}, theta0=None, verbose=True,
return_second_deriv=False)
Fits PCM models(s) to a group of subjects
The model parameters are (by default) shared across subjects. Scale and noise parameters are individual for each
subject. Some model parameters can also be made individual by setting M.common_param
Parameters
• Data (list of pcm.Datasets) – List data set has partition and condition descriptors
• M (pcm.Model or list of pcm.Models) – Models to be fitted on the data sets. Optional
field M.common_param indicates which model parameters are common to the group (True)
and which ones are fit individually (False)
• effect (fixed) – None, ‘block’, or nd-array / list of nd-arrays. Default (‘block’) add an
intercept for each partition
• fit_scale (bool) – Fit a additional scale parameter for each subject? Default is set to
False.
• scale_prior (float) – Prior variance for log-normal prior on scale parameter
• algorithm (string) – Either ‘newton’ or ‘minimize’ - provides over-write for model spe-
cific algorithms
• noise_cov – None (i.i.d), ‘block’, or optional specific covariance structure of the noise
• optim_param (dict) – Additional paramters to be passed to the optimizer
• theta0 (list of np.arrays) – List of starting values (same format as return argument
theta)
58 Chapter 2. Documentation
Pcm for Python, Release v.0.9
60 Chapter 2. Documentation
Pcm for Python, Release v.0.9
62 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Returns
• Z (np.array) – Design matrix for random effects
• X (np.array) – Design matrix for fixed effects
• YY (np.array) – Quadratic form of the data (Y Y’)
• Noise (pcm.model.NoiseModel) – Noise model
• G_hat (np.array) – Crossvalidated estimate of second moment of U
inference.set_up_fit_group(Data, fixed_effect='block', noise_cov=None)
Pre-calculates and sets design matrices, etc for the PCM fit for a full group
Parameters
• Data (list of pcm.dataset) – Contains activity data (measurement), and
obs_descriptors partition and condition
• fixed_effect – Can be None, ‘block’, or a design matrix. ‘block’ includes an intercept for
each partition.
• noise_cov – Can be None: (i.i.d noise), ‘block’: a common noise paramter or a List of
noise covariances for the different partitions
Returns
• Z (np.array) – Design matrix for random effects
• X (np.array) – Design matrix for fixed effects
• YY (np.array) – Quadratic form of the data (Y Y’)
• Noise (NoiseModel) – Noise model
• G_hat (np.array) – Crossvalidated estimate of second moment of U
Optimization module for PCM toolbox with main functionality for model fitting. @author: jdiedrichsen
optimize.best_algorithm(M, algorithm=None)
Parameters
• M (List of pcm.Model) –
• algorithm (string) – Overwrite for algorithm
optimize.mcmc(th0, likelihood_fcn, proposal_sd=0.1, burn_in=100, n_samples=1000, verbose=1)
Implement Markov Chain Monte Carlo sampling for PCM models Metropolis-Hastings algorithm with adaptive
proposal distribution
optimize.newton(theta0, lossfcn, max_iter=80, thres=0.0001, hess_reg=0.0001, regularization='sEig',
verbose=0, fit_param=None)
Minimize a loss function using Newton-Raphson with automatic regularization
Parameters
• theta (np.array) – Vector of parameter starting values
• lossfcn (fcn) – Handle to loss function that needs to return a) Loss (Negative log-
likelihood) b) First derivative of the Loss c) Expected second derivative of the loss
Regression module contains bare-bones version of the PCM toolbox that can be used to tune ridge/Tikhonov coefficients
in the context of tranditional regression models. No assumption are made about independent data partitions.
class regression.RidgeDiag(components, theta0=None, fit_intercept=True,
noise_model=<PcmPy.model.IndependentNoise object>)
Class for Linear Regression with Tikhonov (L2) regularization. The regularization matrix for this class is di-
agnonal, with groups of elements along the diagonal sharing the same Regularisation factor.
Constructor
Parameters:
components (1d-array like)
Indicator to which column of design matrix belongs to which group
theta0 (1d np.array)
Vector of of starting values for optimization
fit_intercept (Boolean)
Should intercept be added to fixed effects (Dafault: true)
noise_model (pcm.model.NoiseModel)
Model specifying the full-rank noise effects
fit(Z, Y , X=None)
Estimates the regression parameters, given a specific regularization :param Z: Design matrix for random
effects NxQ :type Z: 2d-np.array :param Y: NxP Matrix of data :type Y: 2d-np.array :param X: Fixed effects
design matrix - will be accounted for by ReML :type X: np.array
Returns
self – Model with fitted parameters
optimize_regularization(Z, Y , X=None, optim_param={}, like_fcn='auto')
Optimizes the hyper parameters (regularisation) of the regression mode :param Z: Design matrix for random
effects NxQ :type Z: 2d-np.array :param Y: NxP Matrix of data :type Y: 2d-np.array :param X: Fixed effects
design matrix - will be accounted for by ReML :type X: np.array :param optim_parameters: parameters for
the optimization routine :type optim_parameters: dictionary of parameters
64 Chapter 2. Documentation
Pcm for Python, Release v.0.9
Returns
self – Model with fitted parameters
predict(Z, X=None)
Predicts new data based on a fitted model :param Z: Design matrix for random effects NxQ :type Z: 2d-
np.array :param Y: NxP Matrix of data :type Y: 2d-np.array :param X: Fixed effects design matrix - will be
accounted for by ReML :type X: np.array
Returns
self – Model with fitted parameters
regression.compute_iVr(Z, G, iS, X=None)
Fast inverse of V matrix using the matrix inversion lemma
Parameters
• Z (2d-np.array) – Design matrix for random effects NxQ
• G (1d or 2d-np.array) – Q x Q Matrix: variance of random effect
• iS (scalar or NxN matrix) – Inverse variance of noise matrix
• X (2d-np.array) – Design matrix for random effects
Returns
• iV (2d-np.array) – inv(Z*G*Z’ + S);
• iVr (2d-np.array) – iV - iV * X inv(X’ * iV *X) * X’ *iV
• ldet (scalar) – log(det(iV))
regression.likelihood_diagYTY_ZZT(theta, Z, Y , comp, X=None, Noise=<PcmPy.model.IndependentNoise
object>, return_deriv=0)
Negative Log-Likelihood of the data and derivative in respect to the parameters. This function is faster when
N>>P.
Parameters
• theta (np.array) – Vector of (log-)model parameters: These include model, signal and
noise parameters
• Z (2d-np.array) – Design matrix for random effects NxQ
• Y (2d-np.array) – NxP Matrix of data
• comp (1d-np.array or list) – Q-length: Indicates for each column of Z, which theta
will be used for the weighting
• X (np.array) – Fixed effects design matrix - will be accounted for by ReML
• Noise (pcm.Noisemodel) – Pcm-noise mode to model block-effects (default: Indentity)
• return_deriv (int) – 0: Only return negative loglikelihood 1: Return first derivative 2:
Return first and second derivative (default)
Returns
• negloglike – Negative log-likelihood of the data under a model
• dLdtheta (1d-np.array) – First derivative of negloglike in respect to the parameters
• ddLdtheta2 (2d-np.array) – Second derivative of negloglike in respect to the parameters
66 Chapter 2. Documentation
Pcm for Python, Release v.0.9
68 Chapter 2. Documentation
Pcm for Python, Release v.0.9
2.11 References
• Cai, M.B., Schuck, N.W., Pillow, J., and Niv, Y. (2016). A Bayesian method for reducing bias in neural repre-
sentational similarity analysis. In Advances in Neural Information Processing Systems, pp. 4952–4960.
• Diedrichsen, J., Ridgway, G.R., Friston, K.J., and Wiestler, T. (2011). Comparing the similarity and spatial
structure of neural representations: a pattern-component model. Neuroimage 55, 1665–1678.
• Diedrichsen, J. (2019). Representational models and the feature fallacy. In The Cognitive Neurosciences, M.S.
Gazzaniga, G.R. Mangun, and D. Poepple, eds. (Cambridge, MA: MIT Press), p.
• Diedrichsen, J., and Kriegeskorte, N. (2017). Representational models: A common framework for understanding
encoding, pattern-component, and representational-similarity analysis. PLOS Comput. Biol. 13, e1005508.
• Diedrichsen, J., Yokoi, A., and Arbuckle, S.A. (2018). Pattern component modeling: A flexible approach for
understanding the representational structure of brain activity patterns. Neuroimage 180, 119–133.
• Ejaz, N., Hamada, M., and Diedrichsen, J. (2015). Hand use predicts the structure of representations in sensori-
motor cortex. Nat Neurosci 18, 1034–1040.
• Khaligh-Razavi, S.M., and Kriegeskorte, N. (2014). Deep supervised, but not unsupervised, models may explain
IT cortical representation. PLoS Comput Biol 10, e1003915.
• Kriegeskorte, N., and Diedrichsen, J. (2016). Inferring brain-computational mechanisms with models of activity
measurements. Philos. Trans. R. Soc. B Biol. Sci. 371.
• Kriegeskorte, N., and Diedrichsen, J. (2019). Peeling the Onion of Brain Representations. Annu. Rev. Neurosci.
42, 407–432.
• Kriegeskorte, N., Mur, M., and Bandettini, P. (2008). Representational similarity analysis - connecting the
branches of systems neuroscience. Front Syst Neurosci 2, 4.
• Kriegeskorte, N., Mur, M., Ruff, D.A., Kiani, R., Bodurka, J., Esteky, H., Tanaka, K., and Bandettini, P.A.
(2008). Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron 60,
1126–1141.
• Naselaris, T., Kay, K.N., Nishimoto, S., and Gallant, J.L. (2011). Encoding and decoding in fMRI. Neuroimage
56, 400–410.
• Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., and Kriegeskorte, N. (2014). A toolbox for
representational similarity analysis. PLoS Comput Biol 10, e1003553.
• Walther, A., Nili, H., Ejaz, N., Alink, A., Kriegeskorte, N., and Diedrichsen, J. (2016). Reliability of dissimilarity
measures for multi-voxel pattern analysis. Neuroimage 137, 188–200.
• Yokoi, A., and Diedrichsen, J. (2019). Neural Organization of Hierarchical Motor Sequence Representations in
the Human Neocortex. Neuron.
70 Chapter 2. Documentation
Pcm for Python, Release v.0.9
• Yokoi, A., Arbuckle, S.A., and Diedrichsen, J. (2018). The Role of Human Primary Motor Cortex in the Pro-
duction of Skilled Finger Sequences. J. Neurosci. 38, 1430–1442.
2.11. References 71
Pcm for Python, Release v.0.9
72 Chapter 2. Documentation
CHAPTER
THREE
• genindex
• modindex
• search
73
Pcm for Python, Release v.0.9
d
dataset, 54
i
inference, 58
m
matrix, 67
o
optimize, 63
r
regression, 64
s
sim, 69
u
util, 67
75
Pcm for Python, Release v.0.9
77
Pcm for Python, Release v.0.9
optimize_regularization() (regression.RidgeDiag
method), 64
P
pairwise_contrast() (in module matrix), 67
predict() (model.BlockPlusIndepNoise method), 58
predict() (model.ComponentModel method), 55
predict() (model.CorrelationModel method), 56
predict() (model.FeatureModel method), 55
predict() (model.FixedModel method), 54
predict() (model.FreeModel method), 56
predict() (model.IndependentNoise method), 57
predict() (model.Model method), 54
predict() (regression.RidgeDiag method), 65
R
regression
module, 64
RidgeDiag (class in regression), 64
S
sample_model_group() (in module inference), 62
set_theta0() (model.BlockPlusIndepNoise method), 58
set_theta0() (model.ComponentModel method), 55
set_theta0() (model.CorrelationModel method), 56
set_theta0() (model.FreeModel method), 56
set_theta0() (model.IndependentNoise method), 57
set_up_fit() (in module inference), 62
set_up_fit_group() (in module inference), 63
sim
module, 69
U
util
module, 67
78 Index