0% found this document useful (0 votes)

27 views29 pages

Deep Supervised, But Not Unsupervised

Uploaded by

Khoa Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views29 pages

Deep Supervised, But Not Unsupervised

Uploaded by

Khoa Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Deep Supervised, but Not Unsupervised, Models May

Explain IT Cortical Representation

Seyed-Mahdi Khaligh-Razavi*, Nikolaus Kriegeskorte*
Medical Research Council, Cognition and Brain Sciences Unit, Cambridge, United Kingdom

Abstract
Inferior temporal (IT) cortex in human and nonhuman primates serves visual object recognition. Computational object-
vision models, although continually improving, do not yet reach human performance. It is unclear to what extent the
internal representations of computational models can explain the IT representation. Here we investigate a wide range of
computational model representations (37 in total), testing their categorization performance and their ability to account for
the IT representational geometry. The models include well-known neuroscientific object-recognition models (e.g. HMAX,
VisNet) along with several models from computer vision (e.g. SIFT, GIST, self-similarity features, and a deep convolutional
neural network). We compared the representational dissimilarity matrices (RDMs) of the model representations with the
RDMs obtained from human IT (measured with fMRI) and monkey IT (measured with cell recording) for the same set of
stimuli (not used in training the models). Better performing models were more similar to IT in that they showed greater
clustering of representational patterns by category. In addition, better performing models also more strongly resembled IT
in terms of their within-category representational dissimilarities. Representational geometries were significantly correlated
between IT and many of the models. However, the categorical clustering observed in IT was largely unexplained by the
unsupervised models. The deep convolutional network, which was trained by supervision with over a million category-
labeled images, reached the highest categorization performance and also best explained IT, although it did not fully explain
the IT data. Combining the features of this model with appropriate weights and adding linear combinations that maximize
the margin between animate and inanimate objects and between faces and other objects yielded a representation that fully
explained our IT data. Overall, our results suggest that explaining IT requires computational features trained through
supervised learning to emphasize the behaviorally important categorical divisions prominently reflected in IT.

Citation: Khaligh-Razavi S-M, Kriegeskorte N (2014) Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation. PLoS Comput
Biol 10(11): e1003915. doi:10.1371/journal.pcbi.1003915
Editor: Jörn Diedrichsen, University College London, United Kingdom
Received March 26, 2014; Accepted September 11, 2014; Published November 6, 2014
Copyright: ß 2014 Khaligh-Razavi, Kriegeskorte. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. The data has been used in previous studies,
including a recent PLOS Computational Biology paper (‘A Toolbox for Representational Similarity Analysis’ Nili et al. 2014), and is already available from here: http://
www.mrc-cbu.cam.ac.uk/methods-and-resources/toolboxes/.
Funding: This work was funded by Cambridge Overseas Trust and Yousef Jameel Scholarship to SMKR; and by the Medical Research Council of the UK
(programme MC-A060-5PR20) and a European Research Council Starting Grant (ERC-2010-StG 261352) to NK. The funders had no role in study design, data
collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* Email: [email protected] (SMKR); [email protected] (NK)

Introduction to fully explain the IT representational geometry [7]. In particular,

the HMAX model did not account for the category clustering
Visual object recognition is thought to rely on a high-level observed in the IT representation.
representation in the inferior temporal (IT) cortex, which has been This raises the question if any existing computational vision
intensively studied in humans and monkeys [1–12]. Object images models, whether motivated by engineering or neuroscientific
that are less distinct in the IT representation are perceived as more objectives, can more fully explain the IT representation and
similar by humans [10] and are more frequently confused by account for the IT category clustering. IT clearly represents visual
humans [13] and monkeys [6]. IT cortex represents object images shape. However, the degree to which categorical divisions and
by response patterns that cluster according to conventional semantic dimensions are also represented is a matter of debate
categories [6,7,9,14–16]. The strongest categorical division [22,23]. If visual features constructed without any knowledge of
appears to be that between animates and inanimates. Within the either category boundaries or semantic dimensions reproduced the
animates, faces and bodies form separate sub-clusters [6,7,15]. categorical clusters, then we might think of IT as a purely visual
Previous studies have compared the representational dissimilar- representation. To the extent that knowledge of categorical
ity matrices (RDMs) of a small number of models (mainly low-level boundaries or semantic dimensions is required to build an IT-
models) with human IT and some other brain areas [7,17–19]. like representation, IT is better conceptualized as a visuo-semantic
One of the previously tested models was the HMAX model representation.
[20,21], which was designed as a model of IT taking many of its Here we investigate a wide range of computational models [24]
architectural parameters from the neuroscience literature. The and assess their ability to account for the representational
internal representation of one variant of the HMAX model failed geometry of primate IT. Our study addresses the question of

PLOS Computational Biology | www.ploscompbiol.org 1 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

Author Summary others are more broadly biologically motivated (e.g. Biotransform,
convolutional network) [38–41]; and the others are well-known
Computers cannot yet recognize objects as well as computer-vision models (e.g. GIST, SIFT, PHOG, PHOW, self-
humans can. Computer vision might learn from biological similarity features, geometric blur) [42–48]. Some of the models
vision. However, neuroscience has yet to explain how use features constructed by engineers without training with natural
brains recognize objects and must draw from computer images (e.g. GIST, SIFT, PHOG). Others were trained in an
vision for initial computational models. To make progress unsupervised fashion (e.g. HMAX and VisNet).
with this chicken-and-egg problem, we compared 37 We also tested models that were supervised with category labels.
computational model representations to representations Two of the models (GMAX and supervised HMAX) [35] were
in biological brains. The more similar a model represen- trained in a supervised fashion to distinguish animates from
tation was to the high-level visual brain representation, the
inanimates, using 884 training images. In addition, we tested a
better the model performed at object categorization. Most
models did not come close to explaining the brain deep supervised convolutional neural network [41], trained by
representation, because they missed categorical distinc- supervision with over a million category-labeled images from
tions between animates and inanimates and between ImageNet [49].
faces and other objects, which are prominent in primate We also attempted to recombine model features, so as to
brains. A deep neural network model that was trained by construct a representation resembling IT in both its categorical
supervision with over a million category-labeled images divisions and within-category representational geometry. We
and represents the state of the art in computer vision linearly recombined the features in two ways: (a) by reweighting
came closest to explaining the brain representation. Our features (thus stretching and squeezing the representational space
brains appear to impose upon the visual input certain along its original axes) and (b) by remixing the features, creating
categorical divisions that are important for successful new features as linear combinations of the original features (thus
behavior. Brains might learn these divisions through performing general affine transformations). All unsupervised and
evolution and individual experience. Computer vision supervised training and all reweighting and remixing was based on
similarly requires learning with many labeled images so sets of images nonoverlapping with the image set used to assess
as to emphasize the right categorical divisions. how well models accounted for IT.
We analyzed brain responses in monkey IT (mIT; cell recording
how well computational models from computer vision and data acquired by Kiani and colleagues [6]) and human IT (hIT;
neuroscience can explain the IT representational geometry. In fMRI data from [7]) for a rich set of color images of isolated
particular, we investigated whether models not specifically objects spanning multiple animate and inanimate categories. The
optimized to distinguish categories can explain IT’s categorical human fMRI measurements covered the entire ventral stream, so
clusters and whether models trained using supervised learning with we also tested the models on fMRI data from the foveal confluence
category labels better explain the IT representational geometry. of early visual cortex (EVC), the lateral occipital complex (LOC),
Evaluating a computational model requires a framework for the fusiform face area (FFA), and the parahippocampal place area
relating brain representations and model representations. One (PPA).
approach is to directly predict the brain responses to a set of Internal representations of the HMAX model (the C2 stage) and
stimuli by means of the computational models. Because of its roots several computer-vision models performed well on EVC. Most of
in the computational neuroscience of early visual areas, this the models captured some component of the representational
approach is often referred to as receptive-field modeling. It has dissimilarity structure in IT and other visual regions. Several
been successfully applied to cell recording, e.g. [25], and fMRI models clustered the human faces, which were mostly frontal and
data, e.g. [26–28]. Here we attempt to test complex network had a high amount of visual similarity. However, all the
models whose internal representations comprise many units unsupervised models failed to cluster human and animal faces
(ranging from 99 to 2,904,000). The brain-activity data consist that were very different in visual appearance in a single face
of hundreds of measured brain responses. In this scenario, the cluster, as seen for human and monkey IT. The unsupervised
linear correspondency mapping between model units and brain models also failed to replicate IT’s clear animate/inanimate
responses is complex (a matrix of number of model units by division. The deep supervised convolutional network better
number of brain responses). Estimating this linear map is captured the categorical divisions, but did not fully replicate the
statistically costly, requiring a combination of substantial addi- categorical clustering observed in IT. We proceeded to remix the
tional data (for a separate set of stimuli) and prior assumptions (for features of the deep supervised model to emphasize the major
regularizing the fit). Here we avoid these complications by testing categorical divisions of IT using maximum-margin linear discrim-
the models in the framework of representational similarity analysis inants. In order to construct a representation resembling IT, we
(RSA) [17,18,29,30], in which brain and model representations combined these discriminants with the different representational
are compared at the level of the dissimilarity structure of the stages of the deep network, weighting each discriminant and layer
response patterns. The models, thus, predict the dissimilarities of the deep network so as to best explain the IT representational
among the stimuli in the brain representation. This approach geometry. The resulting IT-geometry model, when tested with
relies on the assumption that the measured responses preserve the crossvalidation to avoid overfitting to the image set, explains our
geometry of the neuronal representational space. The represen- IT data. Our results suggest that intensive supervised training with
tational geometry would be conserved to high precision if the large sets of labeled images might be necessary to model the IT
measured responses sampled random dimensions of the neuronal representational space.
representational space [31,32]. The RSA framework enables us to
test any pre-trained model directly with data from a single stimulus Results
set.
We tested a total of 37 computational model representations. The results for the 37 model representations are presented
Some of the models mimic the structure of the ventral visual separately for two sets of representations. The first set comprises
pathway (e.g. HMAX, VisNet, Stable model, SLF) [20,21,33–37]; the not-strongly-supervised representations (Figures 1–5). The

PLOS Computational Biology | www.ploscompbiol.org 2 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

Figure 1. Representational dissimilarity matrices for IT and for the seven best-fitting not-strongly-supervised models. The IT RDMs
(black frames) for human (A) and monkey (B) and the seven most highly correlated model RDMs (excluding the representations in the strongly
supervised deep convolutional network). The model RDMs are ordered from left to right and top to bottom by their correlation with the respective IT
RDM. These are the seven most higly correlated RDMs among the 27 models that were not strongly supervised and their combination model
(combi27). Biologically motivated models are in black, computer-vision models are in gray. The number below each RDM is the Kendall tA correlation
coefficient between the model RDM and the respective IT RDM. All correlations are statistically significant. For statistical inference, see Figure 2. For

PLOS Computational Biology | www.ploscompbiol.org 3 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

model abbreviations and RDM-correlation p values, see Table 1. For other brain ROIs (i.e. LOC, PPA, FFA, EVC) see Figure S1 and Table 1. The RDMs
here are 96696, including the four stimuli we did not have monkey data for. The corresponding rows and columns are shown in blue in the mIT RDM
and were ignored in the RDM comparisons.
doi:10.1371/journal.pcbi.1003915.g001

second set comprises the layers of a strongly supervised deep (HMAX-C2UT; p = 0.02; inference by bootstrap resampling of the
convolutional network and an IT-like representation constructed stimulus set [50], not shown). This suggests that the models are
by remixing and reweighting the features of the deep supervised somewhat complementary in explaining the IT features space. For
model (Figures 6–10). The not-strongly-supervised set (Table 1) hIT, the second best model was also a version of HMAX (HMAX-
includes two supervised models: GMAX and Supervised HMAX allUT), but it did not explain hIT significantly worse than combi27
(Materials and Methods). These were supervised much more (p = 0.261, not shown).
weakly than the deep convolutional network, using merely Model RDM correlations with mIT tended to be higher than
hundreds of images. The deep convolutional network (Table 2) model correlations with the hIT RDM. For example, the
was supervised with 1.2 million category-labeled images. Note that dissimilarity correlation of the combi27 with mIT was 0.25,
the first set contains many independent model representations, whereas for hIT it is 0.17. This difference is statistically significant
whereas the second set contains the stages of a single deep strongly (p = 0.001), suggesting that the models were able to better explain
supervised object-vision model. the mIT RDM compared to the hIT RDM. This could be caused
by a lower level of noise in the mIT RDM (estimated from cell-
Most models explain a small component of the IT recording data) than in the hIT RDM (from fMRI data).
representational geometry
Among the not-strongly-supervised models, the seven models None of the not-strongly-supervised models fully
with the highest RDM correlations with hIT and mIT are shown explains the IT data
in Figure 1 (for other brain regions, see Figure S1 and Table 1). For the human data we were able to estimate a noise ceiling
Visual inspection suggests that the models capture the human-face [30] (Materials and Methods), indicating the RDM correlation
cluster, which is also prevalent in IT. However, the models do not expected for the true model, given the noise in the data. None of
appear to place human and animal faces in a single cluster. In the 28 not-strongly-supervised models reached the noise ceiling
addition, the inanimate objects appear less clustered in the models. (Figure 2A). The combi27 representation came closest, but at
All models shown in Figure 1 have small, but highly significant tA = 0.17, it was far from the lower bound of the noise ceiling
(p,0.0001) RDM correlations with hIT and mIT (Figure 1A, 1B, (tA = 0.26). This indicates that the fMRI data capture a
respectively; for RDM correlation with other brain regions see component of the hIT representation that all the not-strongly-
Figure S2 for the not-strongly-supervised models, and Figure S3 supervised models leave unexplained. For mIT, we could not
for the deep supervised model representations). Most of the other estimate the noise ceiling because we had data from only two
not-strongly-supervised models also have significant RDM corre- animals.
lations (Table 1, Figure 2; inference by randomization of stimulus
labels). Although often significant, all RDM correlations between IT is more categorical than any of the not-strongly-
not-strongly-supervised models and IT were small (Kendall tA, supervised models
0.17 for hIT; tA,0.26 for mIT). The main categorical divisions observed in IT appear weak or
absent in the best fitting models (Figure 1). To measure the
Combining features from multiple models improves the strength of categorical clustering in each model and brain
explanation of IT representation, we fitted a linear model of category-cluster RDMs
Combining features from the not-strongly-supervised models to each model and brain RDM (Materials and Methods, Figure
improved the RDM correlations to IT. Model features were S5). The fitted models (Figure 3) descriptively visualize the
combined by summarizing each model representation by its first categorical component of each RDM, summarizing sets of within-
95 principal components and then concatenating these sets of and between-category dissimilarities by their averages. The fits for
principal components. This approach ensured that each model several computational models show a strong human-face cluster,
contributed equally to the combination (same number of features and a weak animate cluster. The human-face cluster is expected
and same total variance contributed). on the basis of the visual similarity of the human-face images (all
The combination of the 27 not-strongly-supervised models frontal aligned human faces of the same approximate size). The
(combi27) has a higher RDM correlation with both hIT and mIT animate cluster could reflect the similar colors and more rounded
than any of the 27 contributing models. Second to the combi27 shapes shared by the animate objects. However, IT in both human
model, internal representations of the HMAX model have the and monkey exhibits additional categorical clusters that are not
highest RDM correlation with hIT and mIT. This might reflect easily accounted for in terms of visual similarity. First, the IT
the fact that the architecture and parameters of the HMAX model representation has a strong face cluster that includes human and
closely follow the literature on the primate ventral stream. animal faces of different species, which differ widely in shape,
In addition to the combi27, we also tested the combination of color, and pose. Second, the IT representation has an inanimate
untrained models, the combination of unsupervised trained cluster, which includes a wide variety of natural and artificial
models, and the combination of weakly supervised trained models objects and scenes of totally different visual appearance. These
(Figure S4). The combi27 explained IT equally well or better than clusters are largely absent from the not-strongly-supervised models
other combinations of the not-strongly-supervised models. In the (Figures 3, S6, S7, S8).
remaining analyses, we therefore omit the other combinations and In order to statistically compare the overall strength of
consider the combi27 along with each individual model. categorical divisions between IT and each of the models, we
Monkey IT was significantly better explained by the combi27 computed a categoricality index for each representation. The
than by the second best among the not-strongly-supervised models categoricality index is the proportion of RDM variance explained

PLOS Computational Biology | www.ploscompbiol.org 4 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

PLOS Computational Biology | www.ploscompbiol.org 5 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

Figure 2. The not-strongly-supervised models fail to fully explain the IT data. The bars show the Kendall-tA RDM correlations between the
not-strongly-supervised models and IT for human (A) and monkey (B). The error bars are standard errors of the mean estimated by bootstrap
resampling of the stimuli. Asterisks indicate significant RDM correlations (random permutation test based on 10,000 randomizations of the stimulus
labels; ns: not significant, p,0.05: *, p,0.01: **, p,0.001: ***, p,0.0001: ****). Most models explain a small, but significant portion of the variance of
the IT representational geometry. The noise ceiling (gray bar) indicates the expected correlation of the true model (given the noise in the data). The
upper and lower edges of the gray horizontal bar are upper and lower bound estimates of the maximum correlation any model can achieve given the
noise. None of the not-strongly-supervised models reaches the noise ceiling. The noise ceiling could not be estimated for mIT, because the available
data were from only two animals. Models with the subscript ‘UT’ are unsupervised trained models, models with the subscript ‘ST’ are supervised trained
models, and others without a subscript are untrained models. Note that the supervised models included here were ‘‘weakly supervised’’, i.e. with
small numbers (884) of category-labeled images. Biologically motivated models are set in black font, and computer-vision models are set in gray font.
doi:10.1371/journal.pcbi.1003915.g002

by categorical divisions. The categoricality index is calculated as For example, the representation might contain a feature
the squared correlation between the fitted category-cluster model perfectly discriminating animates from inanimates. This single
(Figure S5) and the RDM it is fitted to (Figure 4). The model categorical feature would not have been reflected strongly in the
RDMs are noise-less. However, the brain RDMs are affected by overall RDM if none of the other features emphasized this
noise, which lowers the categoricality index. To account for the categorical division. The influence of such a feature on the overall
noise and make the categoricality indices comparable between representational geometry could be increased either by replicating
models and IT, we added noise matching the noise level of hIT to the feature in the representation or by amplifying the feature
the model representations (Materials and Methods). We then values. These two alternatives are equivalent in their effects on the
compared the categoricality indices of the 28 not-strongly- RDM, so we consider only the latter.
supervised models to that of hIT (Figure 4). Human IT has a Another possibility is that all essential nonlinearities are present,
categoricality index of 0.4. All of the not-strongly supervised but the features need to be linearly recombined (i.e. performing
models have categoricality indices below 0.16; most of them below general affine transformations) to approximate the IT represen-
0.1. tational geometry. We therefore investigated whether linear
Inferential comparisons show that the categoricality index is remixing and reweighting of the features of the not-strongly-
significantly higher for hIT than for any of the models (inference supervised models could provide a better explanation of the IT
by bootstrap resampling of the image set). We also compared the representational geometry.
categoricality indices between models and IT without equating the Remixing of features. We attempted to create new features
noise levels. In this analysis, the categoricality index reflects the as linear combinations of the original features. The space of all
categoricality of the models without noise. For hIT and mIT, the linear recodings is difficult to search given limited data. We
noise lowers the categoricality estimate. Nevertheless, the hIT therefore restricted this analysis to the combi27 features (which
categoricality index remains significantly greater than that of any represent a combination of the not-strongly-supervised models)
of the models. For mIT, similarly, the categoricality index is and attempted to find linear combinations that specifically
significantly greater than for all but three of the models (Figure emphasize the missing categorical divisions. In order to find such
S9). linear combinations, we trained three linear support vector
We also analyzed the clustering strength separately for each of machine (SVM) classifiers for body/nonbody, face/non-face,
the categories (Figure S6). For animates, clustering strength was and animate/inanimate categorization. The SVMs were trained
significant for a few models (Lab joint color histogram, PHOG, on a set of 884 labeled images of isolated objects nonoverlapping
and HMAX-all). For human faces, significant clustering was with the set of 96 images we had brain data for. We used the
observed for several computational models (convNet, bioTrans- decision-value outputs of the classifiers as new features. The
form, dense SIFT, LBP, silhouette image, gist, geometric blur, resulting single-feature RDMs (Figure 5, top; one RDM for each
local self-similarity descriptor, global self-similarity descriptor, SVM) are not highly categorical and have only a low correlation
stable model, HMAX-C1, and combi27). These significant (tA,0.1) with the IT RDMs for human and monkey. This is
category clusters reflect the visual similarity of the members of consistent with the fact that the combi27 representation does not
these categories. perform very well on categorization tasks (Figures 11, S11).
Inferential comparisons of clustering strength between each of Feature reweighting. Combining the not-strongly-super-
the models and hIT (Figure S8) and mIT (Figure S8) for each of vised models with equal weight in the combi27 representation
the categories revealed that IT clusters animates, inanimates, and improved the explanation of our IT data. We wanted to test
faces (including human and animal faces) significantly more whether appropriate weighting of the not-strongly-supervised
strongly in both species than most of the models (blue bars in models could further improve the explanation of the IT geometry.
Figures S7 and S8). There are only a few cases, in which a model In addition to the 27 not-strongly-supervised models, we included
clusters one of the categories more strongly than IT. the combi27 model, and the three categorical SVM discriminants
in the set of representations to be combined. We fitted one weight
Remixing and reweighting of the features of the not- for each of these representations (27+1+3 = 31 weights in total), so
strongly-supervised models does not improve the as to best explain the hIT RDM (Figure 5, middle row).
explanation of the IT data Flipping the sign of a feature (weight = 21) has no effect on the
The finding that categoricality is stronger in IT than in any of representational distances. We can, thus, consider only positive
the models raises the question of what the models are missing. One weights, without loss of generality. We therefore used a non-
possibility is that the models contain all essential nonlinear negative-least-squares fitting algorithm [51] to find the non-
features, but in proportions different from IT, thus emphasizing negative weights for the models that minimize the sum of squared
the features differently in the representational geometry. In that deviations between the hIT RDM and the RDM of the weighted
case reweighting of the features (i.e. stretching and squeezing the combination of models. The RDM of the weighted combination of
representational space along its original axes) should help the model features is equivalent to a weighted combination of the
approximate the IT representational geometry. RDMs of the models (Materials and Methods) when squared

PLOS Computational Biology | www.ploscompbiol.org 6 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

PLOS Computational Biology | www.ploscompbiol.org 7 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

Figure 3. IT-like categorical structure is not apparent in any of the not-strongly-supervised models. Brain and model RDMs are shown in
the left columns of each panel. We used a linear combination of category-cluster RDMs (Figure S5) to model the categorical structure (least-squares
fit). The categories modeled were animate, inanimate, face, human face, non-human face, body, human body, non-human body, natural inanimates,
and artificial inanimates. The fitted linear-combination of category-cluster RDMs is shown in the middle columns. This descriptive visualization shows
to what extent different categorical divisions are prominent in each RDM. The residual RDMs of the fits are shown in the right column. For statistical
inference, see Figure 4.
doi:10.1371/journal.pcbi.1003915.g003

Euclidean distance is used. We used the squared Euclidean reweighting step, that cost is small. The failure to improve the
distance for normalized representational patterns, which is explanation of the IT geometry through remixing and reweight-
equivalent to correlation distance, as used throughout this paper. ing, thus, suggests that the not-strongly-supervised models are
We therefore applied the nonnegative least-squares algorithm at missing features important to the IT representational geometry.
the level of the RDMs. Different nonlinear features and more powerful supervised
In order to avoid overestimation of the RDM correlation learning methods may be needed to fully capture the structure
between the fitted model and hIT due to overfitting to the image of the IT representation. We therefore next tested a deep
set, we fitted the weights to random subsets of 88 of the 96 images supervised convolutional neural network [52].
in a crossvalidation procedure, holding out 8 images on each fold.
We then estimated the representational dissimilarities for the A strongly supervised deep convolutional network better
weighted-combination model for the 8 held-out images. We explains the IT data
repeated this procedure until the entire RDM of 96 by 96 images So far, we showed that none of the not-strongly-supervised
was estimated (Figure 5, bottom row, center). models were able to reproduce the categorical structure present in
Feature reweighting and remixing did not reproduce the IT. Most of these models were untrained or trained without
categorical structure observed in IT (Figure 5, bottom row). In supervision. A few of them were weakly supervised (i.e. supervised
fact the weighted-combination model did slightly worse than with merely 884 training images). Their failure at explaining our
combi27 at explaining hIT and mIT (tA = 0.13 for hIT, tA = 0.20 IT data suggests that computational features trained to cluster the
for mIT). The lower performance, despite the inclusion of categories through supervised learning with many labeled images
combi27 as one of the component representations, reflects the might be needed to explain the IT representational geometry. We
cost of overfitting. However, since we fitted only 31 weights in the therefore tested a deep convolutional neural network trained with

Figure 4. The not-strongly-supervised models are less categorical than IT. Categoricality was measured using a categoricality index (vertical
axis) for each model and brain RDM. The categoricality index is defined as the proportion of RDM variance explained by the category-cluster model
(Figure S5), i.e. the squared correlation between the fitted category-cluster model and the RDM it is fitted to. Bars show the categoricality index for
each of the not-strongly-supervised models. The blue (gray) line shows the categoricality index for hIT (mIT). Error bars show 95%-confidence
intervals of the categoricality index estimates for the models. The 95%-confidence intervals for hIT and mIT are shown by the blue and gray shaded
regions, respectively. Significant categoricality indices are marked by stars underneath the bars (* p,0.05, ** p,0.01, *** p,0.001, **** p,0.0001).
Error bars are based on bootstrapping of the stimulus set, and the p-values are obtained by category label randomization test. Significant differences
between the categoricality indices of each model and hIT (inference by bootstrap resampling of the stimuli) are indicated by blue vertical arrows (p,
0.05, Bonferroni-adjusted for 28 tests). The corresponding inferential comparisons for mIT are indicated by gray vertical arrows. Categoricality is
significantly greater in hIT and mIT than in any of the 28 models. This analysis is based on equating the noise level in the models with that of hIT
(Materials and Methods). Similar results obtain for a conservative inferential analysis comparing the categoricality of the noise-less models with that
of the noisy estimates for hIT and mIT (Figure S9).
doi:10.1371/journal.pcbi.1003915.g004

PLOS Computational Biology | www.ploscompbiol.org 8 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

Figure 5. Remixing and reweighting features of the not-strongly supervised models does not explain IT. In order to build an IT-like
representation, we attempted to remix the features to strengthen relevant categorical divisions. We trained three linear SVM classifiers (for animate/
inanimate, face/nonface, and body/nonbody) on the combi27 features using 884 training images (separate from the set we had brain data for). RDMs

PLOS Computational Biology | www.ploscompbiol.org 9 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

for the resulting SVM decision values for the 92 images presented to humans and monkeys are shown at the top. The Kendall-tA RDM correlations
with hIT and mIT are stated underneath the RDMs. The RDM correlations are low, but all three are statistically significant (p,0.05). We further
attempted to create an IT-like representation as a reweighted combination of the models. We fitted one weight for each of the 27 not-strongly-
supervised models, the combi27 model, and the three SVM decision values. The weights were fitted by non-negative least squares, so as to minimize
the sum of squared deviations between the RDM of the weighted combination of the features and the hIT RDM. The resulting weights are shown in
the second row. Error bars indicate 95%-confidence intervals obtained by bootstrap resampling of the stimulus set. The resulting IT-geometry-
supervised RDM is shown at the bottom (center) in juxtaposition to hIT (left) and mIT (right). Importantly, the RDM was obtained by cross-validation
to avoid overfitting to the image set (Materials and Methods). The RDMs here are 92692, excluding the four stimuli that we did not have monkey
data for.
doi:10.1371/journal.pcbi.1003915.g005

1.2 million labelled images [52], nonoverlapping with the set of 96 Layer 7 is the deep network’s highest continuous representa-
images used here. The model has eight layers. The RDM for each tional space, followed only by the readout layer (layer 8, also
of the layers and the RDM correlations with hIT and mIT are known as the ‘‘scores’’). The readout layer is composed of 1000
shown in Figure 6. The deep supervised convolutional network features, one for each of the 1000 category labels used in training
explains the IT geometry better than any of the not-strongly- the network. The readout layer has a lower RDM correlation with
supervised models. The RDM correlation between hIT and the hIT (tA = 0.13) and mIT (tA = 0.18) than layer 7.
deep convolutional network’s best-performing layer (layer 7) is From layer 1 to layer 7 the RDM correlation with IT rises roughly
tA = 0.24. Layer 7 explains the hIT representation significantly monotonically (Figure 7, Table 2) and many of the pairwise
better (p,0.05; obtained by bootstrap resampling of the stimulus comparisons between RDM correlations for higher and lower layers
set) than combi27 (tA = 0.17), the best-performing of the not- are significant (Figure 7, horizontal lines at the top). Nevertheless,
strongly-supervised models. Monkey IT, as well, is better even the best-performing layer 7 does not reach the noise ceiling
explained by layer 7 (tA = 0.29) than by combi27 (tA = 0.25), (Figure 7). Although the deep convolutional network outperforms all
although the difference is not significant. not-strongly-supervised models, it does not fully explain our IT data.

Figure 6. RDMs of all layers of the strongly supervised deep convolutional network. RDMs for all layers of the deep convolutional network
(Krizhevsky et al. 2012) ref [41] are shown for the set of the 96 images (L1: layer 1 to L7: layer 7). Kendall-tA RDM correlations of the models with hIT
and mIT are stated underneath each RDM. All correlations are statistically significant. For inferential comparisons to IT and other regions, see Figure 7
and Table 2, respectively.
doi:10.1371/journal.pcbi.1003915.g006

PLOS Computational Biology | www.ploscompbiol.org 10 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

Figure 7. The strongly supervised deep network, with features remixed and reweighted, fully explains the IT data. The bars show the
Kendall-tA RDM correlations between the layers of the strongly supervised deep convolutional network and human IT. The error bars are standard
errors of the mean estimated by bootstrap resampling of the stimuli. Asterisks indicate significant RDM correlations (random permutation test based
on 10,000 randomizations of the stimulus labels; p,0.05: *, p,0.01: **, p,0.001: ***, p,0.0001: ****). As we ascend the layers of the deep network,
model RDMs explain increasing proportions of the variance of the hIT RDM. The noise ceiling (gray bar) indicates the expected correlation of the true
model (given the noise in the data). The upper and lower edges of the gray horizontal bar are upper and lower bound estimates of the maximum
correlation any model can achieve given the noise. None of the layers of the deep network reaches the noise ceiling. However, the final fully
connected layers 6 and 7 come close to the ceiling. Remixing the features of layer 7 (Figure 10) using linear SVMs to strengthen the categorical
divisions, provides a representation composed of three discriminants (animate/inanimate, face/nonface, and body/nonbody) that reaches the noise
ceiling. Reweighting the model layers and the three discriminants (see Figure 10 for details) yields a representation that explains the hIT geometry
even better. A horizontal line over two bars indicates that the two models perform significantly differently (inference by bootstrap resampling of the

PLOS Computational Biology | www.ploscompbiol.org 11 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

stimulus set). Multiple testing across the many pairwise comparisons is accounted for by controlling the expected FDR at 0.05. The pairwise statistical
comparisons show that the IT-geometry-supervised deep model explains IT significantly better than all other candidate representations.
doi:10.1371/journal.pcbi.1003915.g007

As for the not-strongly-supervised models, we analyzed the categoricality index. Importantly, the deep supervised network
categoricality of the layers of the deep supervised model (Figures 8, emphasizes some categorical divisions more strongly and others
9). All layers of the deep supervised model, including layer 7 and less strongly than IT (Figure 8). For example, layer 7 emphasizes
layer 8 (the readout layer), have significantly lower categoricality the division between human and animal faces and the division
indices than hIT and mIT (Figure 9). This might reflect the fact between artificial and natural inanimate objects more strongly
that the stimulus set was equally divided into animates and than IT. However, IT emphasizes the animate/inanimate and the
inanimates and this division, thus, strongly influences our face/body division more strongly than layer 7.

Figure 8. IT-like categorical structure emerges across the layers of the deep supervised model, culminating in the IT-geometry-
supervised layer. Descriptive category-clustering analysis as in Figure 3, but for the deep supervised network. We used a linear combination of
category-cluster RDMs (Figure S5) to model the categorical structure. The fitted linear-combination of category-cluster RDMs is shown in the middle
columns. This descriptive visualization shows to what extent different categorical divisions are prominent in each layer of the deep supervised model.
The layers show some of the categorical divisions emerging. However, remixing of the features (linear SVM readout) is required to emphasize the
categorical divisions to a degree that is similar to IT. The final IT-geometry-supervised layer (weighted combination of layers and SVM discriminants)
has a categorical structure that is very similar to IT. Overfitting to the image set was avoided by crossvalidation. For statistical inference, see Figure 9.
doi:10.1371/journal.pcbi.1003915.g008

PLOS Computational Biology | www.ploscompbiol.org 12 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

Figure 9. The layers of the deep supervised model are less categorical than IT, but remixing and reweighting achieves IT-level
categoricality. Bars show the categoricality index for each layer of the deep convolutional network and for the IT-geometry-supervised layer. For
conventions and for definition of the categoricality index, see Figure 4. Error bars and shaded regions indicate 95%-confidence intervals. Significant
Categoricality indices are indicated by stars underneath the bars (* p,0.05, ** p,0.01, *** p,0.001, **** p,0.0001). Significant differences between
the categoricality index of each model and the hIT categoricality index are indicated by blue vertical arrows (p,0.05, Bonferroni-adjusted for 9 tests).
The corresponding inferential comparisons for mIT are indicated by gray vertical arrows. Categoricality is significantly greater in hIT and mIT than in
any of the internal layers of the deep convolutional network. However, the IT-geometry-supervised layer (remixed and reweighted) achieves a
categoricality similar to (and not significantly different from) IT. This analysis is based on equating the noise level in the models with that of hIT
(Materials and Methods). Similar results obtain for a conservative inferential analysis comparing the categoricality of the noise-less models with that
of the noisy estimates for hIT and mIT (Figure S10).
doi:10.1371/journal.pcbi.1003915.g009

Remixing and reweighting of the deep supervised This procedure yielded a weight for each of the eight layers of the
features fully explains the IT data deep network and for each of the three linear SVM readout
We have seen that the deep supervised model provides better features (11 weights in total; Figure 10, middle row; Materials
separation of the categories than the not-strongly-supervised and Methods).
models and that it also better explains IT. However, it did not We refer to this weighted combination as the IT-geometry-
reach the noise ceiling. As for the not-strongly-supervised models, supervised deep model. Inspecting the RDM reveals the similarity
we therefore asked whether remixing the features linearly (by of its representational geometry to hIT and mIT (Figure 10,
adding linear readout features emphasizing the right categorical bottom row). The model emphasizes the major categorical
divisions) and reweighting of the different layers and readout divisions similarly to IT (Figure 8, bottom right). In contrast to
features could provide a better model of the IT representation. all other models, this model has a categoricality index matching
The method for remixing and reweighting was exactly the same mIT and not significantly different from either mIT or hIT
as for the not-strongly-supervised models (Figure 5). However, the (Figure 9). The IT-geometry-supervised deep model explains hIT
linear SVM features were based on layer 7 (instead of combi27) better than any layer of the deep network (Figure 7, horizontal
and the reweighting involved fitting one weight for each of the lines at the top). It has the highest RDM correlation with hIT
layers (1–8) and one weight for each of the three linear SVM (tA = 0.38) and mIT (tA = 0.4) among all model representations
features. considered in this paper. Importantly, it falls well within the upper
As before, the linear SVM features were trained for body/ and lower bounds of the noise ceiling and, thus, fully explains the
nonbody, face/non-face, and animate/inanimate categorization non-noise component of our hIT data.
using the nonoverlapping set of 884 training images. The RDMs
for the SVM readout features show strong categorical divisions Model representations more similar to IT categorize
(Figure 10, top row). This is consistent with the fact that the layer- better
7 representation performs well on categorization tasks (Figures 11, Figure 11 shows the animate/inanimate categorization accura-
S11). cy of linear SVM classifiers taking each of the model represen-
As before, we used non-negative least square fitting to find the tations as their input (for the face/body dichotomy and the
weighted combination of the representations that best approxi- artificial/natural dichotomy among inanimates, see Figure S11).
mates hIT. Again, we avoided overfitting to the image set by The categorization accuracy for each model was estimated by 12-
fitting the weights to random subsets of 88 of the 96 images in a fold crossvalidation of the 96 stimuli (Materials and Methods). The
crossvalidation procedure, holding out 8 images on each fold. deep convolutional network model (layer 7) has the highest

PLOS Computational Biology | www.ploscompbiol.org 13 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

Figure 10. Remixing and reweighting features of the deep supervised network achieves an IT-like representational geometry. All
analyses and conventions here are analogous to Figure 5, but applied to the strongly supervised deep convolutional network, rather than to the not-
strongly supervised models. Remixing the features of layer 7 by fitting linear SVMs (separate set of training images) for the major categorical divisions
(animate/inanimate, face/nonface, and body/nonbody) helped account for the categorical clusters in IT. The Kendall-tA RDM correlations between the
SVM decision values and IT (stated underneath the RDMs in the top row) are statistically significant (p,0.05). For the deep convolutional network
used here, feature remixing accounted for the animate/inanimate division of IT. We attempted to create an IT-like representation as a reweighted

PLOS Computational Biology | www.ploscompbiol.org 14 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

combination of the layers of the deep network and the SVM decision values. We fitted one weight for each of the layers and one weight for each of
the three decision values. The bar graph in the middle row shows the weights, with 95%-confidence intervals obtained by bootstrap resampling of
the stimulus set. As before, the weights were fitted using non-negative least squares to minimize the sum of squared deviations between the RDM of
the weighted combination and the hIT RDM. The resulting IT-geometry-supervised RDM (bottom row, center) is very similar to the RDMs of hIT (left)
and mIT (right). The tA RDM correlation between the fitted model and IT is about equal for monkey IT (0.40) and human IT (0.38). Both of these RDM
correlations are higher than the RDM correlation between hIT and mIT, reflecting the effect of noise on the empirical RDM estimates. As in Figure 5,
the fitted model RDM was obtained by cross-validation to avoid overfitting to the image set.
doi:10.1371/journal.pcbi.1003915.g010

animate/inanimate categorization performance (96%), and the and 3 of the model have the highest RDM correlation with EVC
combi27 has the second highest performance (76%). and reach the noise ceiling. However, their correlation with EVC
Figure 12 shows that models whose representations were more is lower than that of the HMAX-C2 layer.
similar to IT tended to have a higher animate/inanimate
categorization performance. The Pearson correlation between Object-vision models and other brain regions
the IT-to-model representational similarity (tA RDM correlation) We also compared the model RDMs with brain areas other
and categorization accuracy was 0.75 for hIT and 0.68 for mIT than IT and EVC (i.e. FFA, LOC, and PPA). Figure S2 shows how
across the 28 not-strongly-supervised model representations and well each of the 28 not-strongly-supervised models explained
the seven layers of the deep supervised model. This finding could EVC, FFA, LOC, and PPA. The seven not-strongly-supervised
simply reflect the fact that the categories correspond to clusters in models with the highest RDM correlations to these brain regions
the IT representation and any representation clustering the are shown in Figure S1. Among the not-strongly-supervised
categories will be well-suited for categorization. Indeed categori- models, the HMAX model showed the highest RDM correlation
zation performance is also predicted by the RDM correlation with EVC and FFA. Specifically, the HMAX-C2 layer had the
between a model and an animate-inanimate categorical RDM, highest RDM correlation with EVC (tA = 0.22) and HMAX-all
albeit with a lower correlation coefficient (r = 0.38, not shown). had the highest RDM correlation with the FFA (tA = 0.13). The
In order to further assess whether it was only the category combi27 model had the highest RDM correlation with LOC and
clustering that predicted categorization accuracy or something PPA (tA = 0.14 and tA = 0.03, respectively).
deeper about the similarity of the model representation to IT, we For the deep supervised model, Figure S3 shows how well
considered the within-category dissimilarity correlation between different layers explain EVC, FFA, LOC, and PPA. Layers 2 and
each model and IT as a predictor of categorization accuracy. 3 reached the noise ceiling for EVC. Subsequent layers along the
Models that were more similar to IT in terms of their within- deep network’s processing stream exhibited decreasing RDM
category representational geometry (dissimilarities among ani- correlations with EVC and increasing RDM correlations with
mates and dissimilarities among inanimates) also tended to have LOC. Layer 7 gets closest to the LOC noise ceiling, but does not
higher categorization performance (Pearson r = 0.45 for hIT, reach it. For FFA, however, layer 6 reaches the noise ceiling.
r = 0.67 for mIT; p,0.01, p,0.0001, respectively). PPA exhibited the lowest RDM correlations with the models,
These results may add to the motivation for computer vision to including both the not-strongly-supervised and the deep supervised
learn from biological vision. If computational feature spaces more representations. The only model with a significant RDM correlation
similar to the IT representation yield better categorization with PPA was combi27 (tA = 0.034, p,0.001; Table 1), which was
performance within the present set of models, then it might be a far below the noise ceiling. This somewhat puzzling result might
good strategy for computer vision to seek to construct features reflect a limitation of our stimulus set for investigating PPA. Konkle
even more similar to IT. and Oliva [53] have shown that a bilateral parahippocampal region
that overlaps with PPA responds more strongly to objects that are
big than to objects that are small in the real world. Our stimulus set
Several models using Gabor filters and other low-level
included a limited set of place and scene images and mostly objects
features explain human early visual cortex that are small in the real world.
We could not distinguish early visual areas V1, V2, and V3,
because stimuli were presented foveally in the human fMRI
Materials and Methods
experiment (2.9u visual angle in diameter, centered on fixation).
Instead we defined an ROI for early visual cortex (EVC), which Object-vision models
covered the foveal confluence of these retinotopic representations. We used a wide range of computational models to explore many
Several models using Gabor filters (SIFT, gist, PHOG, HMAX, different ways for extracting visual features. We selected some of
ConvNet) and other features (Geometric blur, local self-similarity the well-known biologically motivated object recognition models
descriptor, global self-similarity descriptor, silhouette image) as well as several models and feature extractors from computer
explained the early visual RDM estimated from fMRI (Figure vision. Some of the models need a training phase (these are shown
S1A, S2A). These models not only explained significant dissim- by a subscript –either ‘ST’ for supervised trained, or ‘UT’ for
ilarity variance, but reached the noise ceiling, indicating that they unsupervised trained) and some others do not (models without any
explain the EVC representation to the extent that the noise in our subscript).
data enables us to assess this. For the HMAX model (as For the models with a training phase, we used a new set of 884
implemented by Serre et al. [20]), we tested several internal training images. Half of the images were animates and the other
representations. The HMAX-C2 layer had the highest RDM half were inanimates. Then, all models were tested using the
correlation with EVC among all models. The HMAX-C2 layer testing stimuli (the set of 96 images). In the training set –similar to
falls within the early stages (above S1, C1, and S2 layers, and the testing set– animate images had subcategories of human/
below S2b, S3, C2b, C3, and S4 layers) of the HMAX model and animal faces and human/animal bodies. Inanimate images had
its features closely parallel the initial stages of primate visual subcategories of artificial and natural inanimates.
processing. For the deep supervised model, the RDM correlations Below is a description for all models used in this study (see [24]
of different layers with EVC are shown in Figure S3A. Layers 2 for a more comprehensive explanation of the models). For those

PLOS Computational Biology | www.ploscompbiol.org 15 November 2014 | Volume 10 | Issue 11 | e1003915

Table 1. RDM correlations between brain regions and not-strongly-supervised models.

correlation to

model mIT hIT [0.26, 0.48] LOC [0.20, 0.41] FFA [0.10, 0.39] PPA [0.08, 0.38] EVC [0.13, 0.40]
ns
without training 1. V1 model 0.104**** 0.080*** 0.048** 0.045* 0.006 0.123****
2.Convolutional network 0.132**** 0.111**** 0.058*** 0.083*** 20.023ns 0.174****
3.Bio transform-1st stage 0.117**** 0.059** 0.031ns 0.023ns 0.018ns 0.075*
4.Bio transform-2nd stage 0.096**** 0.015ns 0.029ns 0.077** 20.008ns 0.044ns
ns
5. Bio transform-both 0.105**** 0.056** 0.066*** 0.067** 20.026 0.037ns
ns ns ns ns ns
6. Radon 0.013 0.035 20.002 20.031 0.016 0.004ns
7. Dense SIFT 0.145**** 0.101**** 0.077**** 0.067*** 0.021ns 0.171****
8. Stimulus image (Lab) 0.044** 0.023ns 0.044* 0.083* 20.033ns 20.119ns
ns
9. LBP 0.067*** 0.078** 0.084**** 0.051* 20.057 20.025ns
ns

PLOS Computational Biology | www.ploscompbiol.org

10. Lab joint histogram 0.147**** 0.092**** 0.081**** 0.094*** 20.055 0.011ns
ns
11. Silhouette image 0.168**** 0.092**** 0.061*** 0.052* 20.003 0.209****
12. Gist 0.164**** 0.120**** 0.065**** 0.059** 20.014ns 0.185****
13. PHOG 0.120**** 0.094**** 0.059** 0.061* 20.010ns 0.212****
14. Geometric blur 0.123**** 0.061*** 0.028ns 0.058* 0.010ns 0.172****
15. Ssim 0.171**** 0.126**** 0.068**** 0.104*** 20.048ns 0.169****

16
16. Gssim 0.158**** 0.106**** 0.069**** 0.063** 20.000ns 0.191****

with training un-supervised 17.VisNetUT 0.012ns 0.037** 0.016ns 20.005ns 20.001ns 0.057***
18.Stable modelUT 0.092**** 0.081**** 0.056** 0.088** 20.083ns 0.004ns
19.HMAX-C1UT 0.127**** 0.085**** 0.051*** 0.036ns 0.025ns 0.154****
20.HMAX-C2UT 0.182**** 0.114**** 0.055*** 0.098*** 20.012ns 0.217****
ns
21.HMAX-C2bUT 0.114**** 0.112**** 0.095**** 0.065** 20.043 20.011ns
ns
22.HMAX-C3UT 0.139**** 0.114**** 0.074**** 0.081*** 20.047 0.078**
23.HMAX-AllUT 0.165**** 0.139**** 0.108**** 0.132**** 20.067ns 0.081**
24.SLFUT 0.061**** 0.054** 0.015ns 0.091*** 20.074ns 20.027ns
ns ns
25.PHOWUT 0.031* 0.054* 0.046* 0.038 20.067 20.021ns

with training supervised 26.GMAXST 0.078**** 0.060*** 0.023ns 0.098*** 20.074ns 20.038ns
ns
27.Supervised HMAXST 0.085**** 0.079**** 0.041* 0.109*** 20.074 20.032ns

28. Combination of all 27 0.253** 0.169 0.137 0.045 0.034* 0.096****

Kendall-tA RDM correlation coefficients between brain regions and not-strongly-supervised models. Significant correlations are indicated by asterisks (ns: not significant, * p,0.05, ** p,0.01, *** p,0.001, **** p,0.0001). The brain
regions are the lateral occipital complex (LOC), the fusiform face area (FFA), the parahippocampal place area (PPA), and the foveal confluence of early visual areas (EVC). For each brain region, the highest RDM correlation is set in
bold. Lower and upper bounds of the noise ceiling are stated in brackets below the labels to the human brain ROIs (top row).
doi:10.1371/journal.pcbi.1003915.t001
Deep Supervised Model Explains IT

November 2014 | Volume 10 | Issue 11 | e1003915

Table 2. RDM correlations between brain regions and layers of the deep convolutional network.

correlation to

model mIT hIT [0.26, 0.48] LOC [0.20, 0.41] FFA [0.10, 0.39] PPA [0.08, 0.38] EVC [0.13, 0.40]
ns
deep convolutional Layer 1 (convolutional) 0.08* 0.03* 0.03** 0.04* 20.03 0.01ns
network layers
(Krizhevsky et al. 2012)
Layer 2 (convolutional) 0.21**** 0.12**** 0.08**** 0.09**** 20.01ns 0.17****
Layer 3 (convolutional) 0.23**** 0.15**** 0.10**** 0.07*** 20.01ns 0.16****
Layer 4 (convolutional) 0.25**** 0.17**** 0.12**** 0.06*** 0.01ns 0.11****
Layer 5 (convolutional) 0.24**** 0.17**** 0.14**** 0.05** 0.01ns 0.09****
Layer 6 (fully connected) 0.29**** 0.23**** 0.18**** 0.12**** 20.02ns 0.07****
ns
Layer 7 (fully connected) 0.29**** 0.24**** 0.18**** 0.09**** 20.02 0.06**
Layer 8 (scores) 0.18**** 0.13**** 0.12**** 0.02ns 20.02ns 0.01ns

PLOS Computational Biology | www.ploscompbiol.org

ns
remixed features animate/inanimate 0.20**** 0.25**** 0.20**** 0.02 0.07**** 0.03*
(linear SVM readout)
face/nonface 0.13**** 0.08*** 0.08*** 0.04*** 0.02** 0.03**
body/nonbody 0.06** 0.05*** 0.05*** 0.00ns 0.01ns 20.01ns
reweighted combination IT-geometry supervised Layer 0.40**** 0.38**** 0.27**** 0.07**** 0.05*** 0.07****
of the above

17
Kendall-tA RDM correlation coefficients between brain regions and layers of the deep supervised network. Conventions as in Table 1. Significant correlations are indicated by asterisks (ns: not significant, * p,0.05, ** p,0.01, ***
p,0.001, **** p,0.0001). For each brain region, the highest RDM correlation is set in bold.
doi:10.1371/journal.pcbi.1003915.t002
Deep Supervised Model Explains IT

November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

Figure 11. Animate/inanimate categorization accuracy for all models. Each dark blue bar shows the categorization accuracy of a linear SVM
applied to one of the computational model representations. Categorization accuracy for each model was estimated by 12-fold crossvalidation on the
96 stimuli. To assess whether categorization accuracy was above chance level, we performed a permutation test, in which we retrained the SVMs on
(category-orthogonalized) 10,000 random dichotomies among the stimuli. Light blue bars show the average model categorization accuracy for
random label permutations. Categorization performance was significantly greater than chance for most models (* p,0.05, ** p,0.01, *** p,0.001,
**** p,0.0001). The deep convolutional network model (final fully connected layer 7) has the highest animate/inanimate categorization performance
(96%). The combi27 has the second highest performance (76%).
doi:10.1371/journal.pcbi.1003915.g011

models that the code was freely available online, we have provided [39]. Convolutional layers scan the input image inside their
the link. receptive field. Receptive Fields (RFs) of convolutional layers get
Stimulus image (Lab). Lab color space approximates a their input from various places on the input image, and RFs with
linear representation of human perceptual color space. Each Lab identical weights make a unit. The outputs of each unit make a
image was obtained by transferring the color image (1756175) feature map. Convolutional layers are then followed by subsam-
from RGB color space to the Lab color space. Then, the image pling layers that perform a local averaging and subsampling,
was converted to a pixel vector with the length of 175617563. which make the feature maps invariant to small shifts [40]. The
Color set (Lab joint histogram). First, images (1756175) convolutional network which we used had two stages of
were transferred from RGB color space to Lab color space. Then, unsupervised random filters, that is shown by RR in table 1 in
the three Lab dimensions were divided into 6 bins of equal width. Jarret et al. (2009) [39]. The obtained result for each image was
The joint histogram was computed by counting the number of then vectorized. The parameters were exactly the same as used in
figure pixels falling into each of the 66666 bins. Finally, the [39] (http://koray.kavukcuoglu.org/code.html).
obtained lab joint histogram was converted to a vector with the Deep supervised convolutional network. This is a super-
length of 66666. vised convolutional neural network, trained with 1.2 million
Radon. The Radon transform of an image is a matrix, in labelled images from ImageNet (1000 category labels) [52]. The
which each column corresponds to a set of integrals of the image network has 8 layers: 5 convolutional layers, followed by 3 fully
intensities along parallel lines of a given angle. The Matlab connected layers. The output of the last layer is a distribution over
function Radon was used to compute the Radon transform for the 1000 class labels. This is the result of applying a 1000-way
each luminance image. softmax on the output of the last fully connected layer [54]
Silhouette image. All RGB color images were converted to [http://caffe.berkeleyvision.org/ (Caffe: Convolutional Architec-
binary silhouette images by setting all background pixels to 0 and ture for Fast Feature Embedding)].
all figure pixels to 1. Each image was then converted to a vector Biological Transform (BT). BT is a hierarchical transform
with the length of 1756175. based on local spatial frequency analysis of oriented segments.
Unsupervised convolutional network. A hierarchical ar- This transform has two stages, each of which has an edge detector
chitecture of two stages of feature extraction, each of which is followed by an interval detector [38]. The edge detector consists of
formed by random convolutional filters and subsampling layers a bar edge filter and a box filter. For a given interval I and angle h,

PLOS Computational Biology | www.ploscompbiol.org 18 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

Figure 12. Model representations resembling IT afford better categorization accuracy. A model’s IT-resemblance (measured by the RDM
correlation between IT and model) predicts its categorization accuracy (animate/inanimate). This holds for both human-IT resemblance (top) and
monkey-IT resemblance (bottom). The substantial positive correlation between IT-resemblance and categorization accuracy could reflect the
categorical clustering of IT (left panels). However, the within-category RDM correlation between a model and IT also predicts model categorization
accuracy (right panels). Each panel shows the least-squares fit (gray line) and the Spearman rank correlation r (* p,0.05, ** p,0.01, *** p,0.001,
**** p,0.0001). Each circle shows one of the models. Numbers indicate the model (see Table 1 for model numbering). Different layers of the deep
supervised convolutional network are indicated by colored labels ‘‘L1’’ (layer 1) to ‘‘L7’’ (layer 7). The deep model’s layers are color-coded from light
blue to light red (from lower to higher layers). Computer vision models are shown by gray circles; biologically motivated models are shown by black
circles. The transparent horizontal and vertical rectangles cover non-significant ranges along each axis.
doi:10.1371/journal.pcbi.1003915.g012

the interval detector finds edges that have angle h and are Geometric Blur (GB). 289 uniformly distributed points were
separated by an interval I. In the first stage, for any given h and I, selected on each image, then the Geometric Blur descriptors [56–
all pixels of the filtered image were summed and then normalized 58] were calculated by applying spatially varying blur around the
by the squared sum of the input. They were then rectified by the feature points. We used GB features that were part of multiple
Heaviside function. The second stage was the same as the first kernels for image classification described in [59](http://www.
stage, except that in the first stage h was changing between 0–180u robots.ox.ac.uk/,vgg/software/MKL/#download). The blur
and I between 100–700 pixels and the input to the first stage had parameters were set to a = 0.5 and b = 1; the number of
not a periodic boundary condition on the h axis (repeating the descriptors was set to 300.
right-hand side of the image to the left of the image and vice Dense SIFT. For each grayscale image, SIFT descriptors [60]
versa); but in the second stage the input, which is the output of the of 16616 pixel patches were sampled uniformly on a regular grid.
first stage, was given a periodic boundary condition on the h axis, Then, all the descriptors were concatenated in a vector as the
and I was changing between 15–85 pixels. SIFT representation of that image. We used the dense SIFT
Gist. Each image was divided into 16 bins, and then oriented descriptors that were used in [44] to extract PHOW features,
Gabor filters (in 8 orientations) were applied over different scales (4 described below.
scales) in each bin. Finally, the average filter energy in each bin Pyramid Histogram of Visual Words (PHOWUT). Dense
was calculated [42,55]. Then each obtained image was converted SIFT descriptors were calculated for each image and then
to a vector of length (86868). The code is available from here: quantized using k-means clustering to form a visual vocabulary.
http://people.csail.mit.edu/torralba/code/spatialenvelope/ A spatial pyramid of three levels was then created and the

PLOS Computational Biology | www.ploscompbiol.org 19 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

histogram of SIFT visual words was calculated for each bin. The HMAXUT. The HMAX model developed by Serre et al. [20]
concatenation of all histograms was used as the PHOW has a hierarchical architecture inspired by the well-known simple
representation of that image [44]. We used the implementation to complex cells model of Huble & Wiesel [66,67]. There has been
available online(http://www.cs.unc.edu/,lazebnik/research/ several extensions to the HMAX model, improving its feature
spatial_pyramid_code.zip). The dictionary size was fixed to 200 selection process (e.g. [37]) or adding new processing layers to the
and the number of spatial pyramid levels was fixed to three. model [68]. The HMAX model that is used here adds three more
Pyramid Histogram of Gradients (PHOG). The canny layers –ends at S4- on the top of the complex cell outputs of the V1
edge detector was applied on grayscale images, and then a spatial model described above. The model has alternating S and C layers.
pyramid was created with four levels [45]. The histogram of S layers perform a Gaussian-like operation on their inputs, and C
orientation gradients was calculated for all bins in each level. All layers perform a max-like operation, which makes the output
histograms were then concatenated to create PHOG representation invariant to small shifts in scale and position. We used the freely
of the input image. We used Matlab implementation that was freely available version of the HMAX model (http://cbcl.mit.edu/
available online (http://www.robots.ox.ac.uk/,vgg/research/ software-datasets/pnas07/index.html). All simple and complex
caltech/phog.html). Number of quantization bins was set to forty, layers were included until the S4 layer.
number of pyramid levels to four and the angular range to 360u. Note: The HMAX model which has been used in [18] was a
VisNet UT. VisNet is a hierarchical model of ventral visual pre-trained version of the HMAX model; however, in this study
pathway for invariant object recognition that has four successive we have trained the HMAX model using a dataset that contains
layers of self-organizing maps. Neurons which are higher in the 442 animate and 442 inanimate objects. So, the obtained RDMs
hierarchy have larger receptive fields. Each layer in the model are different.
corresponds to a specific area of the primate ventral visual Sparse Localized Features (SLFUT). This is a biologically
pathway in terms of the size of its receptive fields [34,61]. The motivated model based on the HMAX C2 features. The model
model was trained with trace learning rule [62]. The learning rate introduces sparsified and localized intermediate-level visual
was set to 0.1 and number of epochs in each of the four layers was features [33]. We used the Matlab code available for these feature
fixed to 100. Finally the representation of the last layer was (http://www.mit.edu/,jmutch/fhlib/); and the default model
vectorized and used as VisNet features. parameters were used.
Local self-similarity descriptor (ssim). This is a descriptor GMAXST. This model is an extension of the HMAX model
that is not directly based on the image appearance; instead, it is C2 features in which authors have used feedback from the
based on the correlation surface of local self-similarities. For classification layer (analogous to PFC) to extract informative visual
computing local self-similarity features at a specific point on the features. Their method uses an optimization algorithm (i.e. genetic
image, say p, a local internal correlation surface can be created algorithm) to select informative patches from a large pool of
around p by correlating the image patch centred at p to its patches [35]. Using genetic algorithm a subset of patches that gives
immediate neighbours [46,63]. We used the code available for the best categorization performance is selected. A linear SVM
ssim features that were part of multiple kernels for image classifier was used to calculate the categorization performance. In
classification described in [59](http://www.robots.ox.ac.uk/ other words, in the training phase of the model the categorization
,vgg/software/SelfSimilarity/). The ssim descriptors were com- performance is used as the fitness function for the genetic
puted uniformly at every five pixels in both X and Y directions. algorithm. To run this model we used the same set of model
Global self-similarity descriptor (gssim). This descriptor parameters suggested in [35]. In the process of finding optimal
is an extension of the local self-similarity descriptor mentioned patches in the optimization algorithm, we used a random subset of
above. Gssim uses self-similarity globally to capture the spatial 884 training images described before.
arrangements of self-similarity and long range similarities within Stable Model UT. This is another biologically motivated
the entire image [47]. We used gssim Matlab implementation model, which has a hierarchy of simple to complex cells. The
available online(http://www.vision.ee.ethz.ch/,calvin/software. model uses the adaptive resonance theory (ART) mechanism [69]
html). Number of clusters for the patch prototype codebook was for extracting informative intermediate level visual features. This
set to 400, with 20000 patches to be clustered. D1 and D2 for the has made the model stable against forgetting previously learned
self-similarity hypercube were both set to 10. patterns [36]. Similar to HMAX model it extracts C2-like features,
Local Binary Patterns (LBP). Local binary patterns are except that in the training phase it only selects the highest active
usually used in texture categorization. The underlying idea of LBP C2 units as prototypes that represent the input image. This is done
is that a 2-dimensional surface can be described by two using top-down connections from C2 layer to C1 layer. The
complementary measures: local spatial patterns and gray scale connections match the C1-like features of the input image to the
contrast. For a given pixel, LBP descriptor gives binary labels to prototypes of the C2 layer. The matching degree is controlled by a
surrounding pixels by thresholding the difference between the vigilance parameter that is fixed separately on a validation set. We
intensity value of the pixel in the center and the surrounding pixels set the model parameters the same as was suggested by authors
[48,64,65]. We used LBP Matlab implementation freely available except that instead of using all patch sizes, we used patches of size
online(http://www.cse.oulu.fi/CMV/Downloads/LBPMatlab). 12 that made the output RDM more correlated with brain RDMs.
Number of sampling points was fixed to eight. It is also shown in [36] that patches of size 12 make the model
V1 model. A population of simple and complex cells were more stable. Furthermore when using patches of size 12, the model
modelled and were fed by the luminance images as inputs. Gabor performs better in the face/non-face classification task [36].
filters of 4 different orientations (0u, 90u, 245u, and 45u) and 12 Supervised HMAXST. We used this approach to remove
sizes (7–29 pixels) were used as simple cell receptive fields. Then, non-discriminative patches of the HMAX model. After training
the receptive field of complex cells were modelled by performing the HMAX model with the training images of animates and
the MAX operation on the neighboring simple cells with similar inanimates, extracted patches were divided into two clusters using
orientations. The outputs of all simple and complex cells were k-means clustering. One cluster represented the patches extracted
concatenated in a vector as the V1 representational pattern of from animate images, and the other cluster represented the
each image. patches extracted from inanimate images. Then, in order to

PLOS Computational Biology | www.ploscompbiol.org 20 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

remove the non-discriminative patches (i.e. patches that do not set of 884 labeled images of isolated objects nonoverlapping with
distinguish between animates and inanimates), those patches that the set of 96 images. We then used the decision-value outputs of
were extracted from the animate images but fell nearer to the the classifiers as new features. The resulting single-feature RDMs
center of the inanimate cluster were removed. Similarly the for the not-strongly-supervised models are shown in Figure 5, top
patches that were extracted from the inanimate images but fell – one RDM for each SVM. For the deep supervised model, we
nearer to the center of the animate cluster were removed. The used features from layer 7 to find linear combinations that
remaining patches were used for the test phase. emphasize the categorical divisions. The resulting single-feature
Combination of all not-strongly-supervised models RDMs for the deep supervised model are shown in Figure 10, top.
(combi27). This is the concatenation of features extracted by Reweighting of features. We tested whether appropriate
all of the above-mentioned models. Given an input stimulus, weighting of the combination of the original model features and
features from all of the above-mentioned models were extracted. the new features learned by remixing could further improve the
Because the dimension for extracted features differs across models, explanation of the IT geometry. We did the reweighting for both
we used principle component analysis (PCA) to reduce the not-strongly-supervised model features, and deep supervised
dimension of all of them to a unique number. We used the first model features. For the not-strongly-supervised models, in
95 PCs from each of the models and concatenated them along a addition to the 27 not-strongly-supervised models, we included
vector (95 was the largest possible number of PCs that we were the combi27 model and the three categorical SVM discriminants
able to use, because we had 96 images; so the covariance matrix (learned through remixing) in the set of representations to be
has only 95 non-zero eigenvalues). Therefore, combi27 features for combined. We fitted one weight for each of these representations
each image is a vector of length 95627 = 2565. (27+1+3 = 31 weights in total), so as to best explain the hIT RDM.
For some of the above-mentioned models that had a hierarchi- Figure 5, middle row, shows the weights obtained for each of the
cal architecture, we made an RDM for each of the stages in the model representations. For the deep supervised model represen-
hierarchy, as well as an RDM from the concatenation of the model tations, we weighted the combination of all eight layers of the deep
representation in all stages. convolutional network and the three categorical SVM discrimi-
nants obtained by remixing the deep supervised features. Please
Fitting of category-cluster RDMs to model and brain note that one weight is learned for each layer, and each of the
RDMs SVM discriminants (8+3 = 11 weights in total). Figure 10, middle
Ten category-cluster RDMs (Figure S5) were created as row, shows the weights obtained for each of the layers of the deep
predictors for a linear model of each RDM. The category clusters convolutional network and the SVM discriminants.
were: animate, inanimate, face, human face, non-human face, We used a non-negative-least-squares fitting algorithm [51] to
body, human body, non-human body, natural inanimate, and find the non-negative weights for the models that minimize the
artificial inanimate. To measure the clustering strength for each of sum of squared deviations between the hIT RDM and the RDM
the categories in each brain and computational-model RDM, we of the weighted combination of models.
fit the category-cluster RDMs to each brain and computational- The RDM of the weighted combination of the model features is
model RDM minimizing the sum of squared dissimilarity equivalent to a weighted combination of the RDMs of the models
deviations (Figure 3). when squared Euclidean distance is used. We used the squared
The design matrix for the least-squares fitting was created using Euclidean distance for normalized representational patterns,
the ten category RDMs (each RDM was vectorized to form a which is equivalent to correlation distance, as used throughout
column in the design matrix) with addition of a constant vector of this paper. We therefore applied the nonnegative least-squares
1 (confound mean RDM). Then the category model RDMs were algorithm at the level of the RDMs. This procedure is further
fitted to object-vision model RDMs. Bars in Figure S6 show the explained in the following equations: equations (1) and (2).
fitted coefficients (Beta values). Standard errors and p values are Equation (1) states that the squared distance between weighted
based on bootstrapping of the stimulus set. For each bootstrap model features, equals the weighted squared distance of the
sample of the stimulus set, a new instance is generated for the features:
reference RDM (e.g. hIT RDM) and for each of the candidate
RDMs (e.g. model RDMs). We did stratified resampling, which
means that the proportion of categories was the same across all ½wk fk,l (i){wk fk,l (j)2 ~½fk,l (i){fk,l (j)2 w2k ð1Þ
bootstrapped resamples. Because bootstrap resampling is resam-
Where wk is the weight given to model k. fk,l(i) is the lth feature
pling with replacement, the same condition can appear multiple
extracted by model k for stimulus i.
times in a sample. This entails 0 entries (from the diagonal of the
Equation (2) shows how each of the n model representations are
original RDM) in off-diagonal positions of the RDM for a
weighted by minimizing the sum of squared deviations between
bootstrap sample. These zeros are treated as missing values and
the hIT RDM and the RDM of the weighted combination of
excluded from the dissimilarities, across which the RDM
model representations.
correlations are computed. The number of bootstrap resamplings
used in bootstrap tests was 10,000.
X Xn Xm 2
w~arg min 2
di,j { k~1 l
k
½fk,l (i){fk,l (j)2 w2k ð2Þ
Weighting model features w[Rzn i=j
Remixing of features. For the not-strongly-supervised
models as well as the deep supervised model representations, we Where di,j is the distance between stimuli i,j in the hIT RDM. w is
attempted to create new features as linear combinations of the the weight vector that minimizes the sum of squared errors
original features that specifically emphasize the missing categorical between the pairwise dissimilarities of the stimuli in the hIT
divisions. For the not-strongly-supervised models, we used representation and the pairwise dissimilarities of the weighted
combi27 features to find these linear combinations. Three linear combination of model features. k changes from 1 to n where n is
support vector machine (SVM) classifiers for body/nonbody, face/ the number of model representations to be weighted. mk indicates
non-face, and animate/inanimate categorization were trained on a the number of features for model k.

PLOS Computational Biology | www.ploscompbiol.org 21 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

To avoid overfitting to the image set, we fitted the weights to b) For the face vs. body categorization task, we had 48 stimuli.
random subsets of 88 of the 96 images in a crossvalidation We left out 4 stimuli (2 faces and 2 bodies) that were used as
procedure, holding out 8 images on each fold. The representa- the validation data, and the SVM was trained using the
tional dissimilarities for the weighted-combination model was then remaining stimuli.
estimated for the 8 held-out images. This procedure was repeated c) For the artificial vs. natural inanimate categorization task,
until the pairwise dissimilarities for the entire RDM of 96 by 96 again we had 48 stimuli. We left out 4 stimuli (2 artificial and
images were estimated. 2 natural inanimates) that were used as the validation data,
and the SVM was trained using the remaining stimuli.
IT-geometry model
The IT-geometry supervised models (i.e. IT-geometry-super- To see if a model categorization performance significantly
vised combi27, and IT-geometry-supervised deep convolutional differs from chance, we did a permutation test by retraining the
network) are made by remixing and reweighting of the model models after category-orthogonalized permutation of labels.
features. For the IT-geometry-supervised combi27, only the not-
strongly-supervised models were used for remixing and reweight- Representational similarity analysis (RSA)
ing; and for the IT-geometry-supervised deep convolutional RSA enables us to relate representations obtained from different
network, only deep supervised model representations were used modalities (e.g. computational models and fMRI patterns) by
for remixing and reweighting. comparing the dissimilarity patterns of the representations. In this
For both of them, as explained before in the context of remixing framework representational dissimilarity matrices (RDMs) are
and reweighting, we trained three SVM classifiers for animate/ used for making the link between different modalities. RDM is a
inanimate, face/nonface, and body/nonbody classification using square symmetric matrix in which the diagonal entries reflect
884 training images. The SVM classifiers were then fed with the comparisons between identical stimuli and are 0, by definition.
96 stimuli and we used the SVM decision values as new features. Each off-diagonal value indicates the dissimilarity between the
The non-negative least square fitting was then used for finding the activity patterns associated with two different stimuli. RDM
optimal weights for different model representations and the SVM summarizes the information carried by a given representation
discriminant features so as to minimize the sum of squared errors from an area in the brain or a computational model.
between the RDM of the weighted combination of the features We had 96 stimuli, of which half were animates and the other
and the hIT RDM. half were inanimates. To calculate the RDM for a brain region or
For making the IT-geometry supervised RDM, which is a a computational model, a 96696 matrix was made in which each
weighted combination of the model representations and the SVM cell was filled with the dissimilarity value between the response
discriminants, we fit the non-negative weights by cross-validating patterns elicited by two stimuli. For each pair of stimuli, the
the stimulus set. Each time we randomly left out 8 stimuli (4 dissimilarity measure was 1 minus the Pearson correlation between
animates and 4 inanimates) from the set of 96, and learned the the response patterns elicited by those stimuli in a brain region or a
optimal weights over the remaining stimuli (88 images) so as to computational model.
minimize the sum of squared errors between the RDM of the
weighted combination of the features and the hIT RDM. Note Kendall tA (tau-a) correlation and noise ceiling
that the hIT RDM and the model RDMs become 88688 (not To judge the ability of a model RDM in explaining a brain
96696) because 8 stimuli are left out. The obtained weights were RDM, we used Kendall’s rank correlation coefficient tA (which is
then applied to weight the model feature for the left-out stimuli. the proportion of pairs of values that are consistently ordered in
The result is an 868 weighted RDM that shows the pairwise both variables). When comparing models that predict tied ranks
dissimilarities for the left-out stimuli. This procedure was repeated (e.g. category model RDMs) to models that make more detailed
for several times until a point that we had the cross-validated predictions (e.g. brain RDMs, object-vision model RDMs)
pairwise dissimilarities for all the 96 stimuli. Kendall’s tA correlation is recommended. In these occasions tA
correlation is more likely than the Pearson and Spearman
Categorization performance of models correlation coefficients to prefer the true model over a simplified
We calculated the categorization performance of the object- model that predicts tied ranks for a subset of pairs of dissimilarities.
vision models in the following categorization tasks: animates vs. For more information in this regard please refer to the RSA
inanimates (Figure 11), faces vs. bodies (Figure S11B), and Toolbox paper [30]. This is the first toolbox to implement RSA. It
artificial inanimates vs. natural inanimates (Figure S11A). For is a modular and work-flow based toolbox that supports an
each of the models, a SVM classifier [70] with a linear kernel analysis approach that is simultaneously data- and hypothesis-
was trained using k-fold cross validation (k = 12). The 96 stimuli driven. There are a set of ‘‘Recipe’’ functions in the toolbox that
were randomly partitioned into k = 12 equal size folds. Of the k allow automatic ROI analysis as well as whole-brain searchlight
folds, a single fold was retained as the validation data for testing analysis. Tools for visualization and inference enable the user to
the model categorization performance, and the remaining k21 relate sets of models to sets of brain regions and to statistically test
folds were used as training data. The cross-validation process and compare the models using nonparametric inference methods.
was then repeated k times, with each of the k folds used exactly Figure 2 shows tA correlation of the hIT/mIT RDM with
once as the validation data. The k results from the folds were model RDMs. To estimate significance, randomization and
then averaged. bootstrap tests were used. Randomization tests permute the
For each of the categorization tasks the SVM was trained in the stimulus labels whereas bootstrap tests bootstrap resample the
following way: conditions set.
The noise in the brain activity data has imposed limitations on
a) For the animate vs. inanimate categorization task, we had 96 the amount of dissimilarity variance that a model RDM can
stimuli. We left out 8 stimuli (4 animates and 4 inanimates) explain. Therefore an estimation of noise-ceiling was needed to
that were used as the validation data, and the SVM was indicate how much variance of a brain RDM –given the noise
trained using the remaining stimuli. level– was expected to be explained by an ideal model RDM (i.e. a

PLOS Computational Biology | www.ploscompbiol.org 22 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

model RDM that is able to perfectly capture the true dissimilarity Discussion
structure of the brain RDM).
The noise-ceiling in Figure 2A is shown by a gray horizontal Computer vision has made great strides in recent years. Early
bar. The upper and lower edges of this bar correspond to upper- attempts to achieve vision by fitting generative graphics models to
and lower-bound estimates on the group-average correlation with images faltered because of the exponential complexity of the
the RDM predicted by the unknown true model. There is a hard search space. However, computer vision made progress in
upper limit to the average correlation with the single-subject practical applications using hand-engineered feedforward features
reference-RDM estimates that any RDM can achieve for a given in combination with machine learning classifiers. In recent years,
data set. Intuitively, the RDM maximizing the group-average the advent of efficient training algorithms for deep neural networks
correlation lies at the center of the cloud of single-subject RDM [40,71,72,41] has made it possible to learn from image data not
estimates. To find an upper bound, we averaged the rank- just the final classification step, but also the internal representa-
transformed single-subject RDMs and used an iterative procedure tions. This approach has yielded unprecedented object-recognition
to find the RDM that has the maximum average Kendall’s tA performance, reaching levels comparable to humans on certain
correlation to the single-subject RDMs. This average RDM can be tasks (e.g. [41,73]).
thought of as an estimate of the true model’s RDM. This estimate These new deep vision models share certain features, some of
is overfitted to the single-subject RDMs. Its average correlation which parallel the primate visual system. First, they are
with the latter therefore overestimates the true model’s average feedforward hierarchical models: They are composed of a series
correlation, thus providing an upper bound. To estimate a lower of stages of representation, where each stage is computed from
bound, we employed a leave-one-subject-out approach. We the output of the previous stage. Moderate modifications of this
computed each single-subject RDM’s correlation with the average scheme with bypass connections are also sometimes used. Second,
of the other subjects’ RDMs. This prevents overfitting and each stage is composed of features, which are linear filters of the
underestimates the true model’s average correlation because the previous stage followed by a static nonlinearity. The nonlinearity
amount of data is limited, thus providing a lower bound on the is key to the representational power of these networks because a
ceiling. For more information about the noise ceiling please refer sequence of linear transforms would reduce to a single linear
to the toolbox paper [30]. We did not estimate a noise ceiling for transformation. Third, they are convolutional [40], computing
the cell recording data, because our procedure requires several each linear feature of the input at all visual-field locations. This
individuals to be measured and we only had data for two monkeys. architectural constraint reduces the effective number of param-
eters and automatically confers translation invariance. Fourth,
they compute visually local features with receptive field sizes
Equating the noise level in the models and the human IT increasing from stage to stage, thus gradually transforming a
To compare the categoricality in the models with the space-based image-like representation into space-insensitive
categoricality in human IT, we added Gaussian noise to the shape-based and semantic representation. Fifth, they are deep,
models to equate the level of noise in the models with that of the typically including four or more layers of representation. Even
fMRI data. To this end, we averaged the pairwise correlation shallow neural networks can approximate any nonlinear mapping
between the IT RDMs of the four human subjects; let’s denote the from input to output. However, deep networks can find concise
obtained value with ‘q’. Then to add the same amount of noise to representations (requiring fewer units) of complex functions. This
the models, we iteratively and increasingly added noise to the is essential to make them realistic in terms of both physical
model outputs until they reach the same level of noise as in human implementation and learnability [72]. Sixth, they are trained with
IT. The procedure for each model was that, we made new many category-labeled example images, typically more than a
instantiations of that model by adding random Gaussian noise to million (e.g. [41]).
the model output. We did this four times for each model, therefore A few studies have begun to compare recognition performance
having four noisy instantiation for each model. Then we made and internal representations between these models and primate
four model RDMs for each of the noisy model features, and IT. These investigations have so far given largely convergent
calculated the mean of their pairwise correlation, which we denote results. First, models that perform better at object recognition tend
by ‘qm’. If the obtained mean is equal to the mean of the pairwise to have representations more similar to IT [74–76]. Second, the
correlation between the four hIT RDMs, denoted by ‘q’, (i.e. new deep supervised models perform at unprecedented levels at
jq{qm jv10{3 ) we stop the iteration, otherwise the procedure is predicting the IT representation [75–77].
repeated and in each iteration the added noise to the model output Our exploration here placed deep supervised models in the
is updated. At the end, when the stopping criterion is satisfied, the context of a wide range of computer-vision features, revealing the
four model RDMs are averaged and used as the noise-equated extent to which each of these computational mechanisms can
model RDM. explain the IT representational geometry in human and monkey.
In addition, we analyzed the degree to which each of the models
Stimuli and response measurements emphasizes various categorical divisions. Our results, spanning the
We used the experimental stimuli from Kriegeskorte et al. [7]. gamut from unsupervised to strongly supervised models, suggest
The stimuli were 96 images which half were animates and the that strong supervision with many category-labeled images is
other half were inanimates. The animate cluster consisted of faces essential for building features that explain the IT representational
and bodies, and the inanimate cluster consisted of natural and geometry. The not-strongly-supervised models were significantly
artificial inanimates. less categorical than IT and this was part of the reason why they
For cell recording data, we had 92 stimuli. To make 92692 failed to explain the IT representational geometry. In addition, IT
RDMs comparable with 96696 RDMs, we made a 96696 RDM appears to have a particular categorical geometry. This is
from 92692 RDM by filling the gaps with NaN. consistent with the idea that IT is visuo-semantic, representing
The fMRI and cell recording data, which we used here, have visual features including shape, but also imposing categorical
been previously described and analyzed to address different divisions (or emphasizing semantic dimensions) that are relevant to
questions. See [6,7,18] for further experimental details. the organism’s survival and reproduction.

PLOS Computational Biology | www.ploscompbiol.org 23 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

We find strong similarities between the representational the animate and inanimate categories was substantial. Taken
geometries of a deep supervised model and IT (see also [76]). together, current evidence suggests that the categorical clustering
This is important because it suggests that deep supervised models we observed is not an artefact of the stimulus set.
capture something essential about the IT representation. Howev- Finally, our data set was affected by noise and intersubject
er, the fact that our IT-geometry-supervised deep representation variation. The human fMRI data, for which we were able to
fully explains our IT data should not be overinterpreted. compute the noise ceiling, was from 8 sessions (2 sessions in each of
First, these models operate in a feedforward fashion, and do not four subjects [7]). The fact that the IT-geometry-supervised deep
capture the recurrent dynamics in the visual hierarchy. This model fully explained the representational geometry of IT does not
component of visual processing might be sufficient for ‘‘core object mean that its representation is identical to IT, but just that given
recognition’’ [78], i.e. rapid recognition at a glance. However, noise and intersubject variability it is not significantly different.
vision provides us with a much more complex appreciation of our Future studies should use more comprehensive data sets to reveal
surroundings and supports a wide array of tasks. Biological vision remaining representational discrepancies between IT and deep
involves recurrent processing as well as active exploration of the supervised models.
scene with attentional shifts and eye movements. In the present
experiments, stimuli were presented for 105 ms (monkeys) and What does it mean for a representation to be
300 ms (humans) and eye movements and object-related atten- ‘‘categorical’’ or ‘‘semantic’’?
tional processes were minimized by using fixation tasks. The The IT representation has been described as categorical by
experiments were, thus, designed to focus on automatic, task- some authors [7] and as a visual shape space by others [23]. How
independent processing. However, recurrent processing is never- should a ‘‘categorical’’ representation be defined in this context?
theless likely to have contributed to the emergence of the IT One meaning of categoricality refers to the degree to which
representation. Indeed, recent human magnetoencephalography categorical divisions are explicit in the representation. The images
studies using the same stimulus set [11,12] suggest that the major themselves (and their retinal representations) clearly contain
categorical divisions take slightly longer to emerge than a purely category information. However, this information is not explicit.
feedforward account would predict. This evidence is not Instead it requires a highly nonlinear readout mechanism
unequivocally localized to IT and thus should be interpreted with commonly referred to as object recognition. An explicit represen-
caution. However, the categorical clustering achieved in a purely tation is sometimes defined [78,80] as one that enables linear
feedforward fashion in the deep supervised model considered here readout of the category dichotomy. Since linear readout is a trivial
might be achieved with some degree of involvement of recurrent one-step operation in a biological neuronal network, this definition
computations in the brain. of ‘‘explicit’’ is arguably only slightly broader than requiring
Second, the IT-geometry-supervised model needed to be single-cell step-like responses encoding the category dichotomy.
explicitly trained to emphasize the same categorical divisions as Linear discriminability does not require that the categories form
IT. The analysis of our human and monkey data in [77] similarly separate clusters in the representational space. A bimodal
found that a hierarchical feedforward model optimized for distribution in representational space, with two clusters corre-
invariant object recognition could account for the IT representa- sponding to the categories and divided by a margin or region of
tional geometry only when linear readout features emphasizing the lower density, could be considered to be an even more explicitly
appropriate categorical divisions were fitted to the data. On the categorical representation than one that merely enabled linear
one hand, our study suggests that visual similarity (as operationa- readout.
lized by the wide range of unsupervised visual features we Defining categoricality as the degree to which category
investigated) cannot explain the categorical clustering. On the information is explicit (as all the above definitions do) may be
other hand, it begs the question why IT emphasizes the particular useful in some contexts. However, it misses a crucial point.
divisions between faces and bodies and between animates and Depending on the nature of the images and categories, ‘‘explicit’’
inanimates, while deemphasizing other divisions (such as the one category representations could be observed in: (1) pixel images or
between human and animal faces). color histograms, (2) simple computational features (e.g. Gabor
Third, our study is limited by the image set. All objects were filters or gist features), (3) more complex unsupervised features (e.g.
centered on fixation and presented in isolation on a gray HMAX features). If the features happen to be sufficiently
background at the same retinal size. The stimulus set, thus, was correlated with a categorical division, these ‘‘visual’’ representa-
not challenging in terms of position, size, and clutter invariance. tions would be considered explicitly categorical by the above
However, the IT representation has been shown to be less sensitive definitions. This illustrates the difficulty of drawing a clear line
to changes of position, size, and context than earlier stages of between visual and categorical (or semantic) representations.
processing [79]. Cadieu et al. (2014) [76] used an image set with We would rather not refer to the representation as ‘‘categorical’’
substantial variations of position, size, and clutter to compare the when the categories are already separated in the distribution of the
representation in the same deep supervised model to monkey-IT sensory input patterns. We therefore suggest a criterion distinct
data and found the categorical clustering to be robust to these from category explicitness as the defining property of a categorical
variations. Although our image set did not vary position, size, and representation. A representation is ‘‘categorical’’ when it affords
clutter, note that it covered a broad range of categories and within better category discriminability than any feature set that can be
each category, there was substantial variation among the learned without category supervision, i.e. when it is designed to
exemplars in terms of both their intrinsic properties and accidental emphasize categorical divisions. A categorical representation in
properties of their appearance, including pose and lighting. The this sense can be interpreted as serving the function to emphasize
wide exemplar variations within broad categories like animates behaviorally relevant categorical divisions or semantic dimensions.
might present an even more difficult challenge than varying A category is a discrete semantic variable. A semantic
position and size. The human face photos were mostly frontal and representation could also include continuous variables that
therefore visually similar (as reflected in the clustering of human describe visual objects. Categorical clusters in the representational
faces in many of the unsupervised feature models). However, the space do not require discrete categorical variables. A sufficient
variation among the animal faces and among the exemplars within prevalence of continuous semantic variables that are correlated

PLOS Computational Biology | www.ploscompbiol.org 24 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

with a given categorical division could also produce categorical the same temporal context will also tend to represent the same
clusters. Future studies should investigate in greater detail whether scene. A biological learning mechanism that associates visual
the semantic component of the IT representation is better inputs that tend to co-occur with similar representational patterns
accounted for by categorical or continuous semantic dimensions. would learn features that are more stable across time, abstracting
from rapidly changing aspects of visual appearance. Moreover,
The IT representation appears to be both visual and objects present in a given scene might tend to be semantically
semantic related. Such a mechanism might therefore even learn semantic
Several studies suggested that the IT representation is not purely features.
visual but also semantic [7,14,22,81]. Our study provides Another way that context might provide a stepping stone
additional support for this claim by showing that IT exhibits toward a semantic representation is through perceptual channels
significantly stronger category clustering than a wide range of beyond the current retinal image. Natural perception provides a
unsupervised models. It is impossible to prove that no visual rich multimodal and dynamic stream of information. Distinct
feature model built without category-label supervision can explain visual patterns associated with similar context percepts might come
the IT representation. However, our current interpretation is that to be represented together in the representational space. For
IT reflects knowledge of category boundaries or semantic example, visual motion is associated with animacy [89], so
dimensions, and is thus not purely visual. dissimilar shapes associated with the same visual motion patterns
This finding may appear to contradict a previous study might come to be co-located in the representational space.
suggesting that the IT representation is better accounted for by The argument from context can be extended to other sensory
visual shape than by semantic category [23]. Note, however, that modalities (e.g. the same sound associated with two distinct visual
the representation of visual shape in IT is uncontroversial. A better stimuli), and to behavioral and social context, which might contain
account on the basis of visual shape does not preclude an signals correlated with the categories of the objects present in the
additional semantic component. There is clearly a continuum scene [90]. Visually dissimilar stimuli may be associated with the
between visual and semantic, between the representation of the same linguistic utterances of contemporaries, or with the same
appearance and the representation of the behavioral significance physical actions [91] or emotional states. Finally, the cognitive
of an object. Our working hypothesis is that the function of the context, including conscious inferences based on our perception of
primate ventral stream is to achieve this transformation. Interme- the current scene and behavioral goals, might influence the
diate-level features detecting parts of objects (e.g. eyes, noses, ears) development of the IT representation through feedback signals
might provide a stepping stone toward semantics and could lead to from frontal regions that provide an endogenous context to natural
clustering of faces and animates [82,83]. visual experience.
Recognition requires abstracting from several sources of within- An unsupervised learning process that receives such context
category variation among object images. One source of variation signals alongside the visual input would be expected to cluster
lies in the accidental properties [84] of the appearance of the percepts that are similar in this more complex multimodal input
object, such as its pose, distance, and lighting. Another source of space. The resulting representational clusters might then persist
within-category variation are the substantial differences between when the context is removed from the input and only static visual
exemplars. In our study, the winning model was supervised with shapes are presented, as in our experiments. The argument from
category-labeled images, learning to abstract from both of these context illustrates how the distinction between supervised and
sources of variation. It would be interesting to investigate whether unsupervised learning, which is clearly defined in computer
training a representation to abstract from accidental properties science, is blurred for biological brains. Unsupervised learning
only with exemplar-label supervision (where multiple images of the from a richly contextualized sensory input might achieve a result
same particular object have the same exemplar label) can also
similar to that of supervised learning.
produce a representation similar to IT. To our knowledge,
however, the previous studies [75,76] that investigated accidental
property variation in greater detail also required category-label Explaining the IT representation requires considering
supervision to derive representational geometries resembling that what it is for
of IT. The ultimate purpose of vision is not to provide a veridical
representation of our visual environment, but to support successful
How do biological brains acquire categorical divisions? behavior. An explanation of the IT representation, then, requires
In this study, we were looking to discover a model of the consideration of behavioral affordances. It appears plausible that
mechanism of biological object vision. We did not attempt to any primate faced with an unknown object might want to
model the developmental process that builds that mechanism. determine whether it is animate with high priority. Similarly, faces
Creating a viable model of IT appeared to require supervised are important to recognize because they confer a host of
learning. How might biological development implement this information that renders animates somewhat more predictable.
process? Biologically plausible implementations of backpropaga- In computational modelling, such behavioral affordances can be
tion and related rules for supervised learning have been proposed brought in by optimizing the representations for particular
(e.g. [85]). However, it is unclear what supervision signal such a categorization tasks, using supervised training. Such task-specific
process would use. What is the equivalent of the category labels in performance optimization appears essential to explaining IT.
the biological development of the IT representation? One Models with higher recognition accuracy better explained not only
possibility is that the perceptual and behavioral context provides the categorical clusters, but also the within-category representa-
the equivalent of the supervision signal in natural development. tional geometries observed in IT.
For example, visual images appearing in the same temporal Our results suggest that the IT representation is visuo-semantic.
context will often represent the same object in different retinal Explaining IT requires consideration of the perceptual and
positions, poses, distances, and sizes. It has been argued that cognitive context and of behavioral affordances. Through phylo-
invariance to accidental properties can be learned from temporal and ontogenesis, IT appears to have learned to emphasize certain
proximity in natural experience [86–88]. Different visual images in behaviorally important divisions that transcend visual appearance

PLOS Computational Biology | www.ploscompbiol.org 25 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

and relate to the meaning of objects in the context of the Figure S6 Clustering strength for different categories in
organism’s survival and reproduction. IT and not-strongly-supervised models. We measured the
strength of clustering for each of the categories (animate,
Supporting Information inanimate, face, human face, non-human face, body, human
body, non-human body, natural inanimates, and artificial
Figure S1 The not-strongly-supervised models best inanimates), by least-squares fitting of a set of category cluster
explaining EVC (A), FFA (B), LOC (C), and PPA (D). This RDMs (shown in Figure S5) to each brain and computational-
figure shows the most correlated model RDMs (from left to right model RDM. Bars in this figure show the fitted coefficients
and top to bottom) with the EVC (A), FFA (B), LOC (C) and PPA (clustering strengths). The higher the bar, the more tightly
(D) RDMs. Biologically motivated models are set in black font, clustered are the objects in that category. Error bars show 95%
and computer-vision models are set in gray font. Models with the confidence interval of the coefficient estimates. Significance is
subscript ‘UT’ are unsupervised trained models; and others shown by red (legend) corrected for 30 * 10 multiple comparisons.
without a subscript are untrained models. The number below Standard errors and p values are based on bootstrapping of the
each RDM is the Kendall tA correlation coefficient between the stimulus set.
model RDM and the respective brain RDM. All correlations are
(TIF)
statistically significant, except those that are shown by ‘ns’.
Correlation p-values are reported in Table 1. Figure S7 Category-clustering strengths of not-strongly-
(TIF) supervised models relative to hIT. For each of the categories
(animate, inanimate, face, human face, non-human face, body,
Figure S2 Kendall’s tA RDM correlation of the not-
human body, non-human body, natural inanimates, and artificial
strongly-supervised models with EVC (A), FFA (B), LOC
inanimates) the difference in clustering strength between the
(C), and PPA (D). The bars shows the Kendall’s tA RDM
models and hIT was measured. Bars show the difference in
correlation between the not-strongly-supervised model RDMs and
clustering strength between the models and hIT. Model clustering
EVC (A), FFA (B), LOC (C) and PPA (D). The error bars are
strengths that were significantly lower/higher than the hIT
standard errors of the mean estimated by bootstrap resampling.
clustering strength are shown by blue/red bars (legend). Error
Asterisks across the x-axis show the p-values obtained by a random
bars show 95% confidence interval of the difference in clustering
permutation test based on 10,000 randomizations of the condition
strength estimates between the models and hIT. P values are based
labels (ns: not significant, p,0.05: *, p,0.01: **, p,0.001: ***,
on bootstrapping of the stimulus set.
p,0.0001: ****). These p-values assess the relatedness of different
model RDMs with a brain RDM. The grey horizontal rectangle (TIF)
shows the noise ceiling. Figure S8 Category-clustering strengths of not-strongly-
(TIF) supervised models relative to mIT. For each of the
Figure S3 Kendall’s tA RDM correlation of the deep categories (animate, inanimate, face, human face, non-human
convolutional network with EVC (A), FFA (B), LOC (C), face, body, human body, non-human body, natural inanimates,
and PPA (D). The bars show the Kendall-tA RDM correlations and artificial inanimates) the difference in clustering strength
between the layers of the deep supervised convolutional network between the models and mIT was measured. Bars show the
and EVC (A), FFA (B), LOC (C) and PPA (D). The error bars are difference in clustering strength between the models and mIT.
standard errors of the mean estimated by bootstrap resampling. Model clustering strengths that were significantly lower/higher
Asterisks across the x-axis show the p-values obtained by a random than the mIT clustering strength are shown by blue/red bars
permutation test based on 10,000 randomizations of the condition (legend). Error bars show 95% confidence interval of the difference
labels (ns: not significant, p,0.05: *, p,0.01: **, p,0.001: ***, in clustering strength estimates between the models and mIT. P
p,0.0001: ****). The grey horizontal rectangles show the noise values are based on bootstrapping of the stimulus set.
ceiling in each of the brain ROIs. The upper and lower edges of (TIF)
the gray horizontal bar are upper and lower bound estimates of Figure S9 Categoricality in noise-less models compared
the maximum correlation any model can achieve given the noise. with the categoricality in IT. Bars show categoricality
(TIF) (measured by the category clustering index, CCI) for each of the
Figure S4 Different combinations of the not-strongly- not-strongly-supervised models.The category clustering index
supervised models. Each of the first four RDMs (A, B, C, D) (CCI) for each model and brain RDM is defined as the proportion
was calculated by combining internal representation of object-vision of RDM variance explained by the category cluster model (Figure
models for all images and then measuring the pairwise dissimilarity S5), i.e. the squared correlation between the fitted category-cluster
between the combined feature vectors. E and F are categorical model and the RDM it is fitted to. Error bars and shaded regions
model RDMs; F shows animate-inanimate category structure, and indicate 95%-confidence intervals. Significant CCIs are indicated
E comes with extra information about the within-animate category by stars underneath the bars (* p,0.05, ** p,0.01, *** p,0.001,
structure (i.e. face clusters). Underneath each RDM, the Kendall-tA **** p,0.0001). Significant differences between the CCI of each
correlations of that RDM with hIT and mIT RDMs are stated. The model and the hIT/mIT CCI are indicated by blue/gray vertical
statistical significance of correlations are shown by asterisks (p, arrows (p,0.05, Bonferroni-adjusted for 28 tests). The corre-
0.05: *, p,0.01: **, p,0.001: ***, p,0.0001: ****). To estimate sponding inferential comparisons for mIT are indicated by gray
significance, randomization test was used. vertical arrows. The categoricality in hIT is significantly higher
(TIF) than in any of the 28 not-strongly-supervised models. This analysis
is based on the noise-less model representations.
Figure S5 Ten category RDMs used as linear predictors
(TIF)
in the RDM model. These ten category models and a confound
mean (all-1) RDM were linearly combined to explain each of the Figure S10 Categoricality in the noise-less representa-
brain and model RDMs (Figures 3, 4). tions of the deep convolutional network compared with
(TIF) hIT and mIT. Bars show categoricality (measured by the

PLOS Computational Biology | www.ploscompbiol.org 26 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

category clustering index, CCI) for each layer of the deep randomizations of the condition labels (ns: not significant, p,
convolutional network and for the IT-geometry-supervised layer. 0.05: *, p,0.01: **, p,0.001: ***, p,0.0001: ****). The p-values
For conventions and for definition of the CCI, see Figure S9. Error assess the relatedness of different model RDMs with a brain RDM.
bars and shaded regions indicate 95%-confidence intervals. The grey horizontal rectangles show the noise ceiling. Models with
Significant CCIs are indicated by stars underneath the bars (* the subscript ‘UT’ are unsupervised trained models, models with
p,0.05, ** p,0.01, *** p,0.001, **** p,0.0001). Significant the subscript ‘ST’ are supervised trained models, and others
differences between the CCI of each model and the hIT/mIT CCI without a subscript are untrained models.
are indicated by blue/gray vertical arrows (p,0.05, Bonferroni- (TIF)
adjusted for 9 tests). The corresponding inferential comparisons
Figure S13 Kendall’s tA RDM correlation of the not-
for mIT are indicated by gray vertical arrows. Categoricality is
significantly greater in hIT and mIT than in any of the internal strongly-supervised models with the mIT animate (A)
layers of the deep convolutional network. However, the IT- and inanimate (B) sub-clusters. The bars show the Kendall-
geometry-supervised layer (remixed and reweighted) achieves a tA RDM correlations of the not-strongly-supervised models with
categoricality similar to IT. This analysis is based on the noise-less the mIT RDM for animate images (A), and inanimate images (B).
model representations. The error bars are standard deviations of the mean estimated by
(TIF) bootstrap resampling. Asterisks across the x-axis show the p-values
obtained by a random permutation test based on 10,000
Figure S11 Categorization accuracy of all models for randomizations of the condition labels (ns: not significant, p,
natural/artificial (A) and face/body (B). Each dark blue 0.05: *, p,0.01: **, p,0.001: ***, p,0.0001: ****). The p-values
bar shows the categorization accuracy of a linear SVM applied to assess the relatedness of different model RDMs with a brain RDM.
one of the computational model representations. Categorization Models with the subscript ‘UT’ are unsupervised trained models,
accuracy for each model was estimated by 12-fold crossvalidation models with the subscript ‘ST’ are supervised trained models, and
on the 96 stimuli. To assess whether categorization accuracy was others without a subscript are untrained models.
above chance level, we performed a permutation test, in which we (TIF)
retrained the SVMs on (category-orthogonalized) 10,000 random
dichotomies among the stimuli. Light blue bars show the average Text S1 Models better explain the representation of the
model categorization accuracy for random label permutations. animate objects than the inanimate objects in IT.
Categorization performance was significantly greater than chance (DOCX)
for most models (ns: not significant, * p,0.05, ** p,0.01, *** p,
0.001, **** p,0.0001). Acknowledgments
(TIF)
We would like to thank all those who kindly shared their model
Figure S12 Kendall’s tA RDM correlation of the not- implementation with us. In particular Simon Stringer, Bedeho Mender,
strongly-supervised models with the hIT animate (A) Benjamin Evans, Masoud Ghodrati, Karim Rajaei, Pavel Sountsov, and
and inanimate (B) sub-clusters. The bars show the Kendall- John Lisman who kindly helped us to set up their code.
tA RDM correlations of the not-strongly-supervised models with
the hIT RDM for animate images (A), and inanimate images (B). Author Contributions
The error bars are standard deviations of the mean estimated by Conceived and designed the experiments: SMKR NK. Performed the
bootstrap resampling. Asterisks across the x-axis show the p-values experiments: SMKR NK. Analyzed the data: SMKR NK. Wrote the
obtained by a random permutation test based on 10,000 paper: SMKR NK. Implemented the models: SMKR.

References
1. Desimone R, Albright TD, Gross CG, Bruce C (1984) Stimulus-selective 11. Carlson T, Tovar DA, Alink A, Kriegeskorte N (2013) Representational
properties of inferior temporal neurons in the macaque. J Neurosci 4: 2051– dynamics of object vision: The first 1000 ms. J Vis 13: 1. doi:10.1167/13.10.1.
2062. 12. Cichy RM, Pantazis D, Oliva A (2014) Resolving human object recognition in
2. Gross CG (1994) How Inferior Temporal Cortex Became a Visual Area. Cereb space and time. Nat Neurosci 17: 455–462. doi:10.1038/nn.3635.
Cortex 4: 455–469. doi:10.1093/cercor/4.5.455. 13. Majaj N, Hong H, Solomon E, DiCarlo J (2012) A unified neuronal population
3. Tanaka K (1996) Inferotemporal Cortex and Object Vision. Annu Rev Neurosci code fully explains human object recognition Cosyne 2012.
19: 109–139. doi:10.1146/annurev.ne.19.030196.000545. 14. Connolly AC, Guntupalli JS, Gors J, Hanke M, Halchenko YO, et al. (2012)
4. Hung CP, Kreiman G, Poggio T, DiCarlo JJ (2005) Fast Readout of Object The Representation of Biological Classes in the Human Brain. J Neurosci 32:
Identity from Macaque Inferior Temporal Cortex. Science 310: 863–866. 2608–2618. doi:10.1523/JNEUROSCI.5547-11.2012.
doi:10.1126/science.1117593. 15. Naselaris T, Stansbury DE, Gallant JL (n.d.) Cortical representation of animate
5. Zoccolan D, Kouh M, Poggio T, DiCarlo JJ (2007) Trade-Off between Object and inanimate objects in complex natural scenes. Journal of Physiology-Paris.
Selectivity and Tolerance in Monkey Inferotemporal Cortex. J Neurosci 27: Available: http://www.sciencedirect.com/science/article/pii/
12292–12307. doi:10.1523/JNEUROSCI.1897-07.2007. S092842571200006X. Accessed 20 September 2012.
16. Mur M, Ruff DA, Bodurka J, De Weerd P, Bandettini PA, et al. (2012)
6. Kiani R, Esteky H, Mirpour K, Tanaka K (2007) Object Category Structure in
Categorical, Yet Graded – Single-Image Activation Profiles of Human
Response Patterns of Neuronal Population in Monkey Inferior Temporal
Category-Selective Cortical Regions. J Neurosci 32: 8649–8662. doi:10.1523/
Cortex. J Neurophysiol 97: 4296–4309. doi:10.1152/jn.00024.2007.
JNEUROSCI.2334-11.2012.
7. Kriegeskorte N, Mur M, Ruff DA, Kiani R, Bodurka J, et al. (2008) Matching 17. Kriegeskorte N (2009) Relating population-code representations between man,
Categorical Object Representations in Inferior Temporal Cortex of Man and monkey, and computational models. Front Neurosci 3: 363–73 Available:
Monkey. Neuron 60: 1126–1141. doi:10.1016/j.neuron.2008.10.043. http://www.frontiersin.org/neuroscience/10.3389/neuro.01.035.2009/pdf/
8. Sato T, Uchida G, Tanifuji M (2009) Cortical Columnar Organization Is full. Accessed 18 January 2012.
Reconsidered in Inferior Temporal Cortex. Cereb Cortex 19: 1870–1888. 18. Kriegeskorte N, Mur M, Bandettini P (2008) Representational similarity analysis
doi:10.1093/cercor/bhn218. – connecting the branches of systems neuroscience. Front Syst Neurosci 2: 4.
9. Bell AH, Hadj-Bouziane F, Frihauf JB, Tootell RBH, Ungerleider LG (2009) doi:10.3389/neuro.06.004.2008.
Object Representations in the Temporal Cortex of Monkeys and Humans as 19. Leeds DD, Seibert DA, Pyles JA, Tarr MJ (2013) Comparing visual
Revealed by Functional Magnetic Resonance Imaging. J Neurophysiol 101: representations across human fMRI and computational vision. J Vis 13: 25.
688–700. doi:10.1152/jn.90657.2008. doi:10.1167/13.13.25.
10. Mur M, Bodurka J, Goebel R, Bandettini PA, Kriegeskorte N (2013) Human 20. Serre T, Oliva A, Poggio T (2007) A feedforward architecture accounts for rapid
object-similarity judgments reflect and transcend the primate-IT object categorization. Proceedings of the National Academy of Sciences 104: 6424–
representation. Front Psychol 4: 128. doi:10.3389/fpsyg.2013.00128. 6429. doi:10.1073/pnas.0700622104.

PLOS Computational Biology | www.ploscompbiol.org 27 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

21. Riesenhuber M, Poggio T (1999) Hierarchical models of object recognition in 47. Deselaers T, Ferrari V (2010) Global and efficient self-similarity for object
cortex. nature neuroscience 2: 1019–1025. classification and detection. Computer Vision and Pattern Recognition (CVPR),
22. Huth AG, Nishimoto S, Vu AT, Gallant JL (2012) A Continuous Semantic 2010 IEEE Conference on. pp. 1633–1640. doi:10.1109/CVPR.2010.5539775.
Space Describes the Representation of Thousands of Object and Action 48. Ojala T, Pietikäinen M, Mäenpää T (2001) A generalized local binary pattern
Categories across the Human Brain. Neuron 76: 1210–1224. doi:10.1016/ operator for multiresolution gray scale and rotation invariant texture
j.neuron.2012.10.014. classification. Advances in Pattern Recognition—ICAPR 2001: 399–408.
23. Baldassi C, Alemi-Neissi A, Pagan M, DiCarlo JJ, Zecchina R, et al. (2013) 49. Deng J, Dong W, Socher R, Li L-J, Li K, et al. (2009) ImageNet: A large-scale
Shape Similarity, Better than Semantic Membership, Accounts for the Structure hierarchical image database. IEEE Conference on Computer Vision and Pattern
of Visual Object Representations in a Population of Monkey Inferotemporal R e c o g n i t i o n , 2 0 0 9 . C V P R 2 0 0 9 . p p . 2 4 8 – 2 5 5 . do i: 1 0 . 1 1 0 9 /
Neurons. PLoS Comput Biol 9: e1003167. doi:10.1371/journal.pcbi.1003167. CVPR.2009.5206848.
24. Khaligh-Razavi S-M (2014) What you need to know about the state-of-the-art 50. Efron B, Tibshirani R (1986) Bootstrap Methods for Standard Errors,
computational models of object-vision: A tour through the models. ar- Confidence Intervals, and Other Measures of Statistical Accuracy. Statist Sci
Xiv:14072776 [cs, q-bio]. Available: http://arxiv.org/abs/1407.2776. Accessed 1: 54–75. doi:10.1214/ss/1177013815.
11 July 2014. 51. Lawson CL, Hanson RJ (1974) Solving least squares problems. Englewood
25. Pillow JW, Shlens J, Paninski L, Sher A, Litke AM, et al. (2008) Spatio-temporal Cliffs, NJ: Prentice-hall. Vol. 161.
correlations and visual signalling in a complete neuronal population. Nature 52. Krizhevsky A., Sutskever I. and Hinton, G E. (2012) ImageNet Classification
454: 995–999. doi:10.1038/nature07140. with Deep Convolutional Neural Networks. NIPS. Lake Tahoe, Nevada.
26. Mitchell TM, Shinkareva SV, Carlson A, Chang K-M, Malave VL, et al. (2008) 53. Konkle T, Oliva A (2012) A Real-World Size Organization of Object Responses
in Occipitotemporal Cortex. Neuron 74: 1114–1124. doi:10.1016/j.neu-
Predicting Human Brain Activity Associated with the Meanings of Nouns.
ron.2012.04.036.
Science 320: 1191–1195. doi:10.1126/science.1152876.
54. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, et al. (2013) DeCAF: A Deep
27. Kay KN, Naselaris T, Prenger RJ, Gallant JL (2008) Identifying natural images
Convolutional Activation Feature for Generic Visual Recognition. ar-
from human brain activity. Nature 452: 352–355. doi:10.1038/nature06713.
Xiv:13101531 [cs]. Available: http://arxiv.org/abs/1310.1531. Accessed 7
28. Dumoulin SO, Wandell BA (2008) Population receptive field estimates in human May 2014.
visual cortex. NeuroImage 39: 647–660. doi:10.1016/j.neuroimage. 55. Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image
2007.09.034. features in recognition. Progress in Brain Research. p. 2006.
29. Kriegeskorte N, Kievit RA (2013) Representational geometry: integrating 56. Belongie S, Malik J, Puzicha J (2002) Shape matching and object recognition
cognition, computation, and the brain. Trends in Cognitive Sciences 17: 401– using shape contexts. IEEE Transactions on Pattern Analysis and Machine
412. doi:10.1016/j.tics.2013.06.007. Intelligence: 509–522.
30. Nili H, Wingfield C, Walther A, Su L, Marslen-Wilson W, et al. (2014) A 57. Berg AC, Berg TL, Malik J (2005) Shape matching and object recognition using
Toolbox for Representational Similarity Analysis. PLoS Comput Biol 10: low distortion correspondences. Computer Vision and Pattern Recognition,
e1003553. doi:10.1371/journal.pcbi.1003553. 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. pp. 26–33.
31. Ganguli S, Sompolinsky H (2012) Compressed sensing, sparsity, and 58. Zhang H, Berg AC, Maire M, Malik J (2006) SVM-KNN: Discriminative
dimensionality in neuronal information processing and data analysis. Annu Nearest Neighbor Classification for Visual Category Recognition. Computer
Rev Neurosci 35: 485–508. doi:10.1146/annurev-neuro-062111-150410. Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on.
32. Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Vol. 2. pp. 2126–2136. doi:10.1109/CVPR.2006.301.
Hilbert space. Contemporary mathematics 26: 1. 59. Vedaldi A, Gulshan V, Varma M, Zisserman A (2009) Multiple kernels for
33. Mutch J, Lowe DG (2008) Object Class Recognition and Localization Using object detection. Computer Vision, 2009 IEEE 12th International Conference
Sparse Features with Limited Receptive Fields. International Journal of on. pp. 606–613. doi:10.1109/ICCV.2009.5459183.
Computer Vision 80: 45–57. doi:10.1007/s11263-007-0118-0. 60. Lowe DG (2004) Distinctive Image Features from Scale-Invariant Keypoints.
34. Wallis G, Rolls ET (1997) A model of invariant object recognition in the visual Int J Comput Vision 60: 91–110. doi:10.1023/B:VISI.0000029664.99615.94.
system. Prog Neurobiol 51: 167–194. 61. Tromans JM, Harris M, Stringer SM (2011) A Computational Model of the
35. Ghodrati M, Khaligh-Razavi S-M, Ebrahimpour R, Rajaei K, Pooyan M (2012) Development of Separate Representations of Facial Identity and Expression in
How Can Selection of Biologically Inspired Features Improve the Performance the Primate Visual System. PLoS ONE 6: e25616. doi:10.1371/journal.-
of a Robust Object Recognition Model? PLoS ONE 7: e32357. doi:10.1371/ pone.0025616.
journal.pone.0032357. 62. Stringer SM, Rolls ET, Tromans JM (2007) Invariant object recognition with
36. Rajaei K, Khaligh-Razavi S-M, Ghodrati M, Ebrahimpour R, Shiri Ahmad trace learning and multiple stimuli present during training. Network 18: 161–
Abadi ME (2012) A Stable Biologically Motivated Learning Mechanism for 187. doi:10.1080/09548980701556055.
Visual Feature Extraction to Handle Facial Categorization. PLoS ONE 7: 63. Chatfield K, Philbin J, Zisserman A (2009) Efficient retrieval of deformable
e38478. doi:10.1371/journal.pone.0038478. shape classes using local self-similarities. 2009 IEEE 12th International
37. Ghodrati M, Farzmahdi A, Rajaei K, Ebrahimpour R, Khaligh-Razavi S-M Conference on Computer Vision Workshops (ICCV Workshops). pp. 264–
(2014) Feedforward Object-Vision Models Only Tolerate Small Image 271. doi:10.1109/ICCVW.2009.5457691.
Variations Compared to Human. Frontiers in Computational Neuroscience 8. 64. Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and
doi:10.3389/fncom.2014.00074. rotation invariant texture classification with local binary patterns. Pattern
38. Sountsov P, Santucci DM, Lisman JE (2011) A biologically plausible transform Analysis and Machine Intelligence, IEEE Transactions on 24: 971–987.
for visual recognition that is invariant to translation, scale, and rotation. 65. PietikÃ¤inen M (2010) Local Binary Patterns. Scholarpedia 5: 9775.
Frontiers in computational neuroscience 5: 53 doi:10.4249/scholarpedia.9775.
39. Jarrett K, Kavukcuoglu K, Ranzato MA, LeCun Y (2009) What is the best 66. HUBEL D, WIESEL T (1962) Receptive fields, binocular interaction and
functional architecture in the cat’s visual cortex. The Journal of physiology 160:
multi-stage architecture for object recognition? Computer Vision, 2009 IEEE
106–154.
12th International Conference on. pp. 2146–2153.
67. Hubel DH, Wiesel TN (1968) Receptive fields and functional architecture of
40. LeCun Y, Bengio Y (1995) Convolutional networks for images, speech, and time
monkey striate cortex. The Journal of Physiology 195: 215.
series. The handbook of brain theory and neural networks 3361.
68. Zabbah S, Rajaei K, Mirzaei A, Ebrahimpour R, Khaligh-Razavi S-M (2014)
41. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet Classification with Deep
The impact of the lateral geniculate nucleus and corticogeniculate interactions
Convolutional Neural Networks. In: Pereira F, Burges CJC, Bottou L, on efficient coding and higher-order visual object processing. Vision Research
Weinberger KQ, editors. Advances in Neural Information Processing Systems 101: 82–93. doi:10.1016/j.visres.2014.05.006.
25. Curran Associates, Inc. pp. 1097–1105. 69. Grossberg S (1988) Adaptive pattern classification and universal recoding. I.:
42. Oliva A, Torralba A (2001) Modeling the Shape of the Scene: A Holistic parallel development and coding of neural feature detectors: 243–258.
Representation of the Spatial Envelope. International Journal of Computer 70. Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector machines.
Vision 42: 145–175. ACM Trans Intell Syst Technol 2: 27:1–27:27. doi:10.1145/1961189.1961199.
43. Lowe DG (1999) Object recognition from local scale-invariant features. iccv. p. 71. Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep
1150. belief nets. Neural computation 18: 1527–1554.
44. Lazebnik S, Schmid C, Ponce J (2006) Beyond Bags of Features: Spatial Pyramid 72. Bengio Y (2009) Learning Deep Architectures for AI. Found Trends Mach
Matching for Recognizing Natural Scene Categories. Computer Vision and Learn 2: 1–127. doi:10.1561/2200000006.
Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 2. pp. 73. Zeiler MD, Fergus R (2013) Visualizing and Understanding Convolutional
2169–2178. doi:10.1109/CVPR.2006.68. Networks. arXiv:13112901 [cs]. Available: http://arxiv.org/abs/1311.2901.
45. Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial Accessed 26 March 2014.
pyramid kernel. Proceedings of the 6th ACM international conference on Image 74. Khaligh-Razavi S-M, Kriegeskorte N (2013) Object-vision models that better
and video retrieval. CIVR ’07. New York, NY, USA: ACM. pp. 401–408. explain IT also categorize better, but all models fail at both. Cosyne Abstracts,
Available: http://doi.acm.org/10.1145/1282280.1282340. Accessed 6 April Salt Lake City USA.
2012. 75. Yamins DLK, Hong H, Cadieu CF, Solomon EA, Seibert D, et al. (2014)
46. Shechtman E, Irani M (2007) Matching Local Self-Similarities across Images Performance-optimized hierarchical models predict neural responses in higher
and Videos. IEEE Conference on Computer Vision and Pattern Recognition, visual cortex. Proc Natl Acad Sci USA 111: 8619–8624. doi:10.1073/
2007. CVPR ’07. pp. 1–8. doi:10.1109/CVPR.2007.383198. pnas.1403112111.

PLOS Computational Biology | www.ploscompbiol.org 28 November 2014 | Volume 10 | Issue 11 | e1003915

Deep Supervised Model Explains IT

76. Cadieu CF, Hong H, Yamins DLK, Pinto N, Ardila D, et al. (2014) Deep 83. Clarke A, Tyler LK (2014) Object-specific semantic coding in human perirhinal
Neural Networks Rival the Representation of Primate IT Cortex for Core Visual cortex. J Neurosci 34(14):4766–75
Object Recognition. arXiv:14063284 [cs, q-bio]. Available: http://arxiv.org/ 84. Biederman I (1987) Recognition-by-components: A theory of human image
abs/1406.3284. Accessed 17 July 2014. understanding. Psychological Review 94: 115–147.
77. Yamins DL, Hong H, Cadieu C, DiCarlo JJ (2013) Hierarchical Modular 85. Stork DG (1989) Is backpropagation biologically plausible?, International Joint
Optimization of Convolutional Networks Achieves Representations Similar to Conference on Neural Networks, 1989. IJCNN. pp. 241–246 vol. 2.
Macaque IT and Human Ventral Stream. In: Burges CJC, Bottou L, Welling M, doi:10.1109/IJCNN.1989.118705.
Ghahramani Z, Weinberger KQ, editors. Advances in Neural Information 86. Földiák P (1991) Learning Invariance from Transformation Sequences. Neural
Processing Systems 26. Curran Associates, Inc. pp. 3093–3101. Computation 3: 194–200. doi:10.1162/neco.1991.3.2.194.
78. DiCarlo JJ, Cox DD (2007) Untangling invariant object recognition. Trends in 87. Li N, DiCarlo JJ (2010) Unsupervised Natural Visual Experience Rapidly
Cognitive Sciences 11: 333–341. Reshapes Size-Invariant Object Representation in Inferior Temporal Cortex.
79. Rust NC, DiCarlo JJ (2010) Selectivity and Tolerance (‘‘Invariance’’) Both Neuron 67: 1062–1075. doi:10.1016/j.neuron.2010.08.029.
Increase as Visual Information Propagates from Cortical Area V4 to IT. 88. Li N, DiCarlo JJ (2012) Neuronal Learning of Invariant Object Representation
J Neurosci 30: 12978–12995. doi:10.1523/JNEUROSCI.0179-10.2010. in the Ventral Visual Stream Is Not Dependent on Reward. J Neurosci 32:
80. Kriegeskorte N (2011) Pattern-information analysis: from stimulus decoding to 6611–6620. doi:10.1523/JNEUROSCI.3786-11.2012.
computational-model testing. Neuroimage 56: 411–421. doi:10.1016/j.neuro- 89. Schultz J, Friston KJ, O’Doherty J, Wolpert DM, Frith CD (2005) Activation in
image.2011.01.061. Posterior Superior Temporal Sulcus Parallels Parameter Inducing the Percept of
81. Carlson TA, Simmons RA, Kriegeskorte N, Slevc LR (2013) The Emergence of Animacy. Neuron 45: 625–635. doi:10.1016/j.neuron.2004.12.052.
Semantic Meaning in the Ventral Temporal Pathway. Journal of Cognitive 90. Riesenhuber M (2007) Appearance Isn’t Everything: News on Object
Neuroscience: 1–12. doi:10.1162/jocn_a_00458. Representation in Cortex. Neuron 55: 341–344. doi:10.1016/j.neu-
82. Devereux BJ, Clarke A, Marouchos A, Tyler LK (2013) Representational ron.2007.07.017.
Similarity Analysis Reveals Commonalities and Differences in the Semantic 91. Mahon BZ, Milleville SC, Negri GAL, Rumiati RI, Caramazza A, et al. (2007)
Processing of Words and Objects. J Neurosci 33: 18906–18916. doi:10.1523/ Action-Related Properties Shape Object Representations in the Ventral Stream.
JNEUROSCI.3809-13.2013. Neuron 55: 507–520. doi:10.1016/j.neuron.2007.07.011.

PLOS Computational Biology | www.ploscompbiol.org 29 November 2014 | Volume 10 | Issue 11 | e1003915

2007 - Object Category Structure in Response Patterns of Neuronal Population in Monkey Inferior Temporal Co - Kiani at Al
No ratings yet
2007 - Object Category Structure in Response Patterns of Neuronal Population in Monkey Inferior Temporal Co - Kiani at Al
14 pages
4692 Aligning Model and Macaque Inf
No ratings yet
4692 Aligning Model and Macaque Inf
13 pages
Dapello22 AligningModelMacaque
No ratings yet
Dapello22 AligningModelMacaque
19 pages
Shoham Etal 2024
No ratings yet
Shoham Etal 2024
27 pages
Ni Hms 352068
No ratings yet
Ni Hms 352068
36 pages
Brain-Optimized Deep Neural Network Models of Human Visual Areas Learn Non-Hierarchical Representations
No ratings yet
Brain-Optimized Deep Neural Network Models of Human Visual Areas Learn Non-Hierarchical Representations
16 pages
Representations of Human Faces: Ax-Lanck - Nstitut Für Biologische Kybernetik
No ratings yet
Representations of Human Faces: Ax-Lanck - Nstitut Für Biologische Kybernetik
12 pages
Perspective: How Does The Brain Solve Visual Object Recognition?
No ratings yet
Perspective: How Does The Brain Solve Visual Object Recognition?
20 pages
Invariant Visual Object
No ratings yet
Invariant Visual Object
70 pages
Guo 1
No ratings yet
Guo 1
11 pages
Comparing Deep Neural Networks Against Humans: Object Recognition When The Signal Gets Weaker
No ratings yet
Comparing Deep Neural Networks Against Humans: Object Recognition When The Signal Gets Weaker
31 pages
Computer Vision Algorithms and Hardware Implementations A Survey
No ratings yet
Computer Vision Algorithms and Hardware Implementations A Survey
12 pages
Admin,+4554 Article+Text 17736 2 10 20210928
No ratings yet
Admin,+4554 Article+Text 17736 2 10 20210928
13 pages
1-Deep Learning Human
No ratings yet
1-Deep Learning Human
9 pages
Degrees of Algorithmic Equivalence Between The Brain and Its DNN Models
No ratings yet
Degrees of Algorithmic Equivalence Between The Brain and Its DNN Models
13 pages
1 s2.0 S1053811921008971 Main
No ratings yet
1 s2.0 S1053811921008971 Main
17 pages
Complexity - 2021 - Nandhini Abirami - Deep CNN and Deep GAN in Computational Visual Perception Driven Image Analysis
No ratings yet
Complexity - 2021 - Nandhini Abirami - Deep CNN and Deep GAN in Computational Visual Perception Driven Image Analysis
30 pages
Shi BioRxiv 2023
No ratings yet
Shi BioRxiv 2023
84 pages
Psyc 365 - Cognitive Neuroscience Lec 12
No ratings yet
Psyc 365 - Cognitive Neuroscience Lec 12
18 pages
CIFAR10 To Compare Visual Recognition Performance Between Deep Neural Networks and Humans
No ratings yet
CIFAR10 To Compare Visual Recognition Performance Between Deep Neural Networks and Humans
10 pages
2017 Deep Image Reconstruction From Hs B (01-10)
No ratings yet
2017 Deep Image Reconstruction From Hs B (01-10)
10 pages
Visualization 1 Introduction 1
No ratings yet
Visualization 1 Introduction 1
53 pages
Class11 365 2025 Notes
No ratings yet
Class11 365 2025 Notes
53 pages
2023 - Interpreting Age Predictions From Brain Maps Via Deep Neural Activations and Tensor Decomposition - Claros-Olivares Et Al
No ratings yet
2023 - Interpreting Age Predictions From Brain Maps Via Deep Neural Activations and Tensor Decomposition - Claros-Olivares Et Al
13 pages
Visual Pathways From The Perspective of Cost Functions and Multi T 2018 Cort
No ratings yet
Visual Pathways From The Perspective of Cost Functions and Multi T 2018 Cort
13 pages
IJCRT22A6701
No ratings yet
IJCRT22A6701
8 pages
The Code For Facial Identity in The Primate Brain. Cell
No ratings yet
The Code For Facial Identity in The Primate Brain. Cell
17 pages
Beyond The Doors of Perception: Vision Transformers Represent Relations Between Objects
No ratings yet
Beyond The Doors of Perception: Vision Transformers Represent Relations Between Objects
37 pages
Visual Cirtex
No ratings yet
Visual Cirtex
4 pages
Image Search Algorithm
No ratings yet
Image Search Algorithm
28 pages
Kunda 2018 Visual Mental Imagery A View From Artificial Intelligence
No ratings yet
Kunda 2018 Visual Mental Imagery A View From Artificial Intelligence
47 pages
Visual Deep Learning of Unprocessed Neuroimaging Characterises Dementia Subtypes and Generalises Across Non-Stereotypic Samples
No ratings yet
Visual Deep Learning of Unprocessed Neuroimaging Characterises Dementia Subtypes and Generalises Across Non-Stereotypic Samples
15 pages
Stroke Lesion Segmentation With Visual Cortex Anatomy Alike Neural Nets
No ratings yet
Stroke Lesion Segmentation With Visual Cortex Anatomy Alike Neural Nets
11 pages
Deep Learning in Visual Computing Explanations and Examples (Hassan Ugail) - Bibis - Ir
No ratings yet
Deep Learning in Visual Computing Explanations and Examples (Hassan Ugail) - Bibis - Ir
140 pages
Research Proposal Azeem
No ratings yet
Research Proposal Azeem
10 pages
Applsci 11 09374 v3
No ratings yet
Applsci 11 09374 v3
30 pages
Facial Emotion Detection Using Deep Learning
No ratings yet
Facial Emotion Detection Using Deep Learning
6 pages
Artificial Intelligence: Free Will, Self-Consciousness and Ethics
No ratings yet
Artificial Intelligence: Free Will, Self-Consciousness and Ethics
20 pages
Harmonizing The Object Recognition Strategies of Deep Neural Networks With Humans
No ratings yet
Harmonizing The Object Recognition Strategies of Deep Neural Networks With Humans
23 pages
Capturing Human Categorization of Natural Images by Combining Deep Networks and Cognitive Models
No ratings yet
Capturing Human Categorization of Natural Images by Combining Deep Networks and Cognitive Models
14 pages
【2022】IEEE Cyb Visual Relationship Detection a Survey
No ratings yet
【2022】IEEE Cyb Visual Relationship Detection a Survey
14 pages
Joc 8 1 443
No ratings yet
Joc 8 1 443
15 pages
An Attention-Guided Deep Neural Network For Annotating Abnormalities in Chest X-Ray Images Visualization of Network Decision Basis
No ratings yet
An Attention-Guided Deep Neural Network For Annotating Abnormalities in Chest X-Ray Images Visualization of Network Decision Basis
4 pages
1 s2.0 S0169023X24000090 Main
No ratings yet
1 s2.0 S0169023X24000090 Main
17 pages
CS312 Module 4
No ratings yet
CS312 Module 4
21 pages
A White Paper On The Future of Artificial Intelligence: July 2019
No ratings yet
A White Paper On The Future of Artificial Intelligence: July 2019
11 pages
Using Goal-Driven Deep Learning Models To Understand The Sensory Cortex
No ratings yet
Using Goal-Driven Deep Learning Models To Understand The Sensory Cortex
10 pages
Clicktionary: A Web-Based Game For Exploring The Atoms of Object Recognition
No ratings yet
Clicktionary: A Web-Based Game For Exploring The Atoms of Object Recognition
9 pages
4.1 - Unsupervised Visual Representation Learning by Context Prediction
No ratings yet
4.1 - Unsupervised Visual Representation Learning by Context Prediction
10 pages
Deep Models of Superficial Face Judgments
No ratings yet
Deep Models of Superficial Face Judgments
9 pages
A Computational Cognitive Model of Human Memory Based On Invertible Neural Networks
No ratings yet
A Computational Cognitive Model of Human Memory Based On Invertible Neural Networks
9 pages
Whitepaper AI Research Merck KGaA
No ratings yet
Whitepaper AI Research Merck KGaA
11 pages
Age and Gender Prediction
No ratings yet
Age and Gender Prediction
7 pages
The Neural Basis of Visual Object Learning
No ratings yet
The Neural Basis of Visual Object Learning
19 pages
Face Image Analysis by Unsupervised Learning Scribd PDF Download
100% (19)
Face Image Analysis by Unsupervised Learning Scribd PDF Download
17 pages
Analysis of Walking Pattern Using LRCN For Early Diagnosis of Dementia in Elderly Patients
No ratings yet
Analysis of Walking Pattern Using LRCN For Early Diagnosis of Dementia in Elderly Patients
12 pages
A Survey On Deep Learning Techniques For Medical Image Analysis Riyaj
100% (1)
A Survey On Deep Learning Techniques For Medical Image Analysis Riyaj
20 pages
Visual Mental Imagery: A View From Artificial Intelligence
No ratings yet
Visual Mental Imagery: A View From Artificial Intelligence
47 pages
Demiris Khadhouri 2006
No ratings yet
Demiris Khadhouri 2006
9 pages
Oo Assignment02 Hangman
No ratings yet
Oo Assignment02 Hangman
3 pages
1 s2.0 S1878929322000111 Main
No ratings yet
1 s2.0 S1878929322000111 Main
13 pages
PS07. Greedy Algorithm
No ratings yet
PS07. Greedy Algorithm
2 pages

Deep Supervised, But Not Unsupervised

Uploaded by

Deep Supervised, But Not Unsupervised

Uploaded by

Deep Supervised, but Not Unsupervised, Models May

Explain IT Cortical Representation

Introduction to fully explain the IT representational geometry [7]. In particular,

PLOS Computational Biology | www.ploscompbiol.org 1 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 2 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 3 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 4 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 5 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 6 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 7 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 8 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 9 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 10 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 11 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 12 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 13 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 14 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 15 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org

28. Combination of all 27 0.253**** 0.169**** 0.137**** 0.045**** 0.034*** 0.096****

November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org

November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 18 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 19 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 20 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 21 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 22 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 23 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 24 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 25 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 26 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 27 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 28 November 2014 | Volume 10 | Issue 11 | e1003915

PLOS Computational Biology | www.ploscompbiol.org 29 November 2014 | Volume 10 | Issue 11 | e1003915

You might also like

28. Combination of all 27 0.253** 0.169 0.137 0.045 0.034* 0.096****