2014-Hierarchical Feature Representation and Multimodal Fusion With Deep
2014-Hierarchical Feature Representation and Multimodal Fusion With Deep
NeuroImage
journal homepage: www.elsevier.com/locate/ynimg
a r t i c l e i n f o a b s t r a c t
Article history: For the last decade, it has been shown that neuroimaging can be a potential tool for the diagnosis of Alzheimer's
Accepted 28 June 2014 Disease (AD) and its prodromal stage, Mild Cognitive Impairment (MCI), and also fusion of different modalities can
Available online xxxx further provide the complementary information to enhance diagnostic accuracy. Here, we focus on the problems of
both feature representation and fusion of multimodal information from Magnetic Resonance Imaging (MRI) and
Keywords:
Positron Emission Tomography (PET). To our best knowledge, the previous methods in the literature mostly used
Alzheimer's Disease
Mild Cognitive Impairment
hand-crafted features such as cortical thickness, gray matter densities from MRI, or voxel intensities from PET,
Multimodal data fusion and then combined these multimodal features by simply concatenating into a long vector or transforming into a
Deep Boltzmann Machine higher-dimensional kernel space. In this paper, we propose a novel method for a high-level latent and shared
Shared feature representation feature representation from neuroimaging modalities via deep learning. Specifically, we use Deep Boltzmann
Machine (DBM)2, a deep network with a restricted Boltzmann machine as a building block, to find a latent hierar-
chical feature representation from a 3D patch, and then devise a systematic method for a joint feature representa-
tion from the paired patches of MRI and PET with a multimodal DBM. To validate the effectiveness of the proposed
method, we performed experiments on ADNI dataset and compared with the state-of-the-art methods. In three
binary classification problems of AD vs. healthy Normal Control (NC), MCI vs. NC, and MCI converter vs. MCI
non-converter, we obtained the maximal accuracies of 95.35%, 85.67%, and 74.58%, respectively, outperforming
the competing methods. By visual inspection of the trained model, we observed that the proposed method could
hierarchically discover the complex latent patterns inherent in both MRI and PET.
© 2014 Elsevier Inc. All rights reserved.
Introduction To this end, many researchers have devoted their efforts to find
biomarkers and develop a computer-aided system, with which we can
Alzheimer's Disease (AD), characterized by progressive impairment effectively predict or diagnose the diseases. Recent studies have
of cognitive and memory functions, is the most prevalent cause of de- shown that the neuroimaging such as Magnetic Resonance Imaging
mentia in elderly subjects. According to a recent report by Alzheimer's (MRI) (Cuingnet et al., 2011; Davatzikos et al., 2011; Li et al., 2012;
Association, the number of subjects with AD is significantly increasing Wee et al., 2011; Zhang et al., 2012; Zhou et al., 2011), Positron Emission
every year, and 10 to 20% of people aged 65 or older have Mild Cognitive Tomography (PET) (Nordberg et al., 2010), and functional MRI (fMRI)
Impairment (MCI), known as a prodromal stage of AD (Alzheimer's (Greicius et al., 2004; Suk et al., 2013), can be nice tools for diagnosis
Association, 2012). However, due to the limited period for which the or prognosis of AD/MCI. Furthermore, fusing the complementary infor-
symptomatic treatments could be effective, it has been of great impor- mation from multiple modalities helps enhance the diagnostic accuracy
tance for early diagnosis and prognosis of AD/MCI in the clinic. (Cui et al., 2011; Fan et al., 2007a; Hinrichs et al., 2011; Kohannim et al.,
2010; Perrin et al., 2009; Suk and Shen, 2013; Walhovd et al., 2010; Wee
⁎ Corresponding author at: Department of Radiology and Biomedical Research Imaging et al., 2012; Westman et al., 2012; Yuan et al., 2012; Zhang and Shen,
Center (BRIC), University of North Carolina at Chapel Hill, NC, USA. 2012; Zhang et al., 2011).
E-mail address: [email protected] (D. Shen).
1 Various types of features or patterns extracted from neuroimaging
Data used in preparation of this article were obtained from the Alzheimer's Disease
Neuroimaging Initiative (ADNI) database (http://www.loni.ucla.edu/ADNI). As such, the modalities have been considered for brain disease diagnosis with
investigators within the ADNI contributed to the design and implementation of ADNI machine learning methods. Here, we divide the previous feature extrac-
and/or provided data but did not participate in analysis or writing of this report. A complete tion approaches into three categories: voxel-based approach, Region Of
list of ADNI investigators is available at http://adni.loni.ucla.edu/wpcontent/uploads/how_ Interest (ROI)-based approach, and patch-based approach. A voxel-
to_apply/ADNI_Authorship_List.pdf.
2
Although it is clear from the context that the acronym DBM denotes “Deep Boltzmann
based approach is the most simple and direct way that uses the voxel in-
Machine” in this paper, we would clearly indicate that DBM here is not related to tensities as features in classification (Baron et al., 2001; Ishii et al.,
“Deformation Based Morphometry”. 2005). Although it is simple and intuitive in terms of interpretation of
http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
1053-8119/© 2014 Elsevier Inc. All rights reserved.
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
2 H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx
the results, its main limitations are the high-dimensionality of feature noises. For obtaining the latent high-level feature representations in-
vectors and also the ignorance of regional information. ROI-based herent in a patch observation such as correlations among voxels that
approach considers the structurally or functionally predefined brain re- cover different brain regions, we exploit a deep learning strategy
gions and extracts representative features from each region (Cuingnet (Bengio, 2009; LeCun et al., 1998), which has been successfully ap-
et al., 2011; Davatzikos et al., 2011; Kohannim et al., 2010; Nordberg plied to medical imaging analysis (Ciresan et al., 2013; Hjelm et al.,
et al., 2010; Suk and Shen, 2013; Walhovd et al., 2010; Zhang and 2014; Liao et al., 2013; Shin et al., 2013; Suk and Shen, 2013).
Shen, 2012). Thanks to the relatively low feature dimensionality and Among various deep models, we use a Deep Boltzmann Machine
the whole brain coverage, it is widely used in the literature. However, (DBM) (Salakhutdinov and Hinton, 2009) that can hierarchically
the features extracted from ROIs are very coarse in the sense that they find feature representations in a probabilistic manner. Rather than
cannot reflect small or subtle changes involved in the brain diseases. using the noisy voxel intensities as features as Liu et al. (2013) did,
Note that the disease-related structural/functional changes occur in the high-level representation obtained via DBM is more robust to noises
multiple brain regions. Furthermore, since the abnormal regions affected and thus helps enhance diagnostic performances. Meanwhile, from a
by neurodegenerative diseases can be part of ROIs or span over multiple multimodal data fusion perspective, unlike the conventional multimod-
ROIs, the simple voxel- or ROI-based approach may not effectively al feature combination methods that first extract modality-specific fea-
capture the diseased-related pathologies. To tackle these limitations, re- tures and then fuse their complementary information during classifier
cently, Liu et al. proposed a patch-based method that first dissected brain learning, the proposed multimodal DBM fuses the complementary in-
areas into small 3D patches, extracted features from each selected patch formation from different modalities during a feature representation
individually, and then combined the features hierarchically in a classifier step. Note that once we extract features from each modality, we may al-
level (Liu et al., 2012, 2013). ready lose some good correlation information between modalities.
As for the fusion of multiple modalities including MRI, PET, biological Therefore, it is important to discover a shared representation by fully
and neurological data for discriminating AD/MCI patients from healthy utilizing the original information in each modality during feature repre-
Normal Control (NC), Kohannim et al. (2010) concatenated features sentation procedure. In our multimodal data fusion method, thanks to
from modalities into a vector and used a Support Vector Machine the methodological characteristic of the DBM (i.e., undirected graphical
(SVM) classifier. Walhovd et al. (2010) applied multi-method stepwise model), it allows the bidirectional information flow from one modality
logistic regression analyses, and Westman et al. (2012) exploited a (e.g., MRI) to the other modality (e.g., PET) and vice versa. Therefore, we
hierarchical modeling of orthogonal partial least squares to latent can distribute feature representations over different layers in the path
structures. Hinrichs et al. (2011), Suk and Shen (2013), and Zhang between modalities and thus efficiently discover a shared representa-
et al. (2011), independently, utilized a kernel-based machine learning tion while still utilizing the full information in the observations.
technique.
In this paper, we consider the problems of both feature representa- Materials and image processing
tion and multimodal data fusion for computer-aided AD/MCI diagnosis.
Specifically, for feature representation, we exploit a patch-based Subjects
approach since it can be considered as an intermediate level between
voxel-based approach and ROI-based approach, thus efficiently han- In this work, we use the ADNI dataset publicly available on the
dling the concerns of the high feature dimension and also the sensitivity web,3 but consider only the baseline MRI and 18-Fluoro-DeoxyGlucose
to small change. Furthermore, from a clinical perspective, neurologists PET (FDG-PET) data acquired from 93 AD subjects, 204 MCI subjects
or radiologists examine brain images by searching local distinctive re- including 76 MCI converters (MCI-C) and 128 MCI non-converters
gions and then combine the interpretations with neighboring ones (MCI-NC), and 101 NC subjects.4 The demographics of the subjects are
and ultimately with the whole brain. In these regards, we believe that detailed in Table 1.
the patch-based approach can effectively handle the region-wide With regard to the general eligibility criteria in ADNI, subjects were in
pathologies, which may not be limited to specific ROIs, and accords the age of between 55 and 90 with a study partner, who could provide an
with the neurologists or radiologists' perspective in terms of examining independent evaluation of functioning. General inclusion/exclusion
images, i.e., investigating local patterns and then combining local infor- criteria5 are as follows: 1) NC subjects: MMSE scores between 24 and
mation distributed in the whole brain for making a clinical decision. In 30 (inclusive), a Clinical Dementia Rating (CDR) of 0, non-depressed,
this way, we can also extract richer information that helps enhance non-MCI, and non-demented; 2) MCI subjects: MMSE scores between
diagnostic accuracy. 24 and 30 (inclusive), a memory complaint, objective memory loss mea-
However, unlike Liu et al.'s method that directly used the gray matter sured by education adjusted scores on Wechsler Memory Scale Logical
density values in each patch as features, we propose to use a latent high- Memory II, a CDR of 0.5, absence of significant levels of impairment in
level feature representation. Meanwhile, in the fusion of multimodal other cognitive domains, essentially preserved activities of daily living,
information, the previous methods often applied either simple concate- and an absence of dementia; and 3) mild AD: MMSE scores between 20
nation of features extracted from multiple modalities or kernel methods and 26 (inclusive), CDR of 0.5 or 1.0, and meets the National Institute of
to combine them in a high-dimensional kernel space. However, the Neurological and Communicative Disorders and Stroke and the
feature extraction and feature combination were often performed inde- Alzheimer's Disease and Related Disorders Association (NINCDS/
pendently. In this work, we propose a novel method of extracting a ADRDA) criteria for probable AD.
shared feature representation from multiple modalities, i.e., MRI and
PET. As investigated in the previous studies (Catana et al., 2012; Pichler
MRI/PET scanning and image processing
et al., 2010), there exist the inherent relations between modalities of
MRI and PET. Thus, finding the shared feature representation, which
The structural MR images were acquired from 1.5 T scanners. We
combines the complementary information from modalities, is helpful
downloaded data in the Neuroimaging Informatics Technology Initia-
to enhance performance on AD/MCI diagnosis.
tive (NIfTI) format, which had been pre-processed for spatial distortion
From a feature representation perspective, it is noteworthy that
correction caused by gradient nonlinearity and B1 field inhomogeneity.
unlike the previous approaches (Hinrichs et al., 2011; Kohannim et al.,
2010; Liu et al., 2012, 2013; Walhovd et al., 2010; Westman et al., 3
Available at ‘http://www.loni.ucla.edu/ADNI’.
2012; Zhang and Shen, 2012; Zhang et al., 2011) that considered simple 4
Although there exist in total more than 800 subjects in ADNI database, only 398 sub-
low-level features, which are often vulnerable to noises, we propose to jects have the baseline data including the modalities of both MRI and FDG-PET.
consider high-level or abstract features for improving the robustness to 5
Refer to ‘http://www.adniinfo.org’ for the details.
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx 3
Table 1
Demographic and clinical information of the subjects. (SD: standard deviation).
The FDG-PET images were acquired 30–60 min post-injection, averaged, hierarchical manner, i.e., patch-level classifier learning, mega-patch
spatially aligned, interpolated to a standard voxel size, normalized in construction, and a final ensemble classification.
intensity, and smoothed to a common resolution of 8 mm full width at
half maximum. Patch extraction
The MR images were preprocessed by applying the typical procedures
of Anterior Commissure (AC)–Posterior Commissure (PC) correction, For the class-discriminative patch extraction, we exploit statistical
skull-stripping, and cerebellum removal. Specifically, we used MIPAV significance for voxels in each patch, i.e., p-values, following Liu et al.'s
software6 for AC–PC correction, resampled images to 256 × 256 × 256, (2013) work. It is noteworthy that in this step, we take advantage of a
and applied N3 algorithm (Sled et al., 1998) to correct non-uniform tissue group-wise analysis via voxel-wise statistical test. That is, by first
intensities. After skull stripping (Wang et al., 2014) and cerebellum re- performing group comparison, e.g., AD and NC, we can find the statisti-
moval, we manually checked the skull-stripped images to ensure the cally significant voxels, which can provide useful information for brain
clean and dura removal. Then, FAST in FSL package7 (Zhang et al., 2001) disease diagnosis. Based on these voxels, we can then define the class-
was used to segment the structural MR images into three tissue types of discriminative patches to further utilize local regional information. By
Gray Matter (GM), White Matter (WM), and CerebroSpinal Fluid (CSF). considering only the selected discriminative patches rather than all
Finally, all the three tissues of MR image were spatially normalized patches in an image, we can obtain both performance improvement
onto a standard space, for which in this work we used a brain atlas in classification and reduction in computational cost. Throughout
already aligned with the MNI coordinate space (Kabani et al., 1998), this paper, a patch is defined as a three-dimensional cube with a size of
via HAMMER (Shen and Davatzikos, 2002), although other advanced w × w × w in a brain image, i.e., MRI or PET. Given a set of training im-
registration methods can also be applied for this process (Friston, ages, we first perform two-sample t-test on each voxel, and then select
1995; Jia et al., 2010; Tang et al., 2009; Xue et al., 2006; Yang et al., voxels with the p-value smaller than the predefined threshold.9 For
2008). Then, the regional volumetric maps, called RAVENS maps, were each of the selected voxels, by taking each of them as a center, we extract
generated by a tissue preserving image warping method (Davatzikos patches with a size of w × w × w, and then compute a mean p-value by
et al., 2001). It is noteworthy that the values of RAVENS maps are pro- averaging the p-values of all voxels within a patch. Finally, by scanning
portional to the amount of original tissue volume for each region, giving all the extracted patches, we select class-discriminative patches in a
a quantitative representation of the spatial distribution of tissue types. greedy manner with the following rules:
Due to its relatively high relatedness to AD/MCI compared to WM and
CSF (Liu et al., 2012), in this work, we considered only the spatially • The candidate patch should be overlapped less than 50% with any of
normalized GM volumes, called GM tissue densities, for classification. the selected patches.
Regarding FDG-PET images, they were rigidly aligned to the respective • Among the candidate patches that satisfy the rule above, we select
MR images. The GM density maps and the PET images were further patches whose mean p-values are smaller than the average p-value
smoothed using a Gaussian kernel (with unit standard deviation) to im- of all candidate patches.
prove the signal-to-noise ratio. We downsampled both the GM density
maps and PET images to 64 × 64 × 64 voxels8 according to Liu et al.'s For the multimodal case, i.e., MRI and PET in our work, we apply the
(2013) work, which saved the computational time and memory cost, steps of testing the statistical significance, extracting patches, and
but without sacrificing the classification accuracy. computing the mean p-values as explained above, for each modality in-
dependently. But for the last step of selecting class-discriminative
Method patches, we consider multiple modalities together. That is, regarding
the second rule, the mean p-value of a candidate patch should be smaller
In Fig. 1, we illustrate a schematic diagram of our framework for than that of all candidate patches of all the modalities. Once a patch loca-
AD/MCI diagnosis. Given a pair of MRI and PET images, we first select tion is determined from one modality, a patch of the same location in the
class-discriminative patches by means of a statistical significance test other modality is paired for multimodal joint feature representation,
between classes. Using the tissue densities of a MRI patch and the which is described in the following section.
voxel intensities of a PET patch as observations, we build a patch-level
feature learning model, called a MultiModal DBM (MM-DBM), that Patch-level deep feature learning
finds a shared feature representation from the paired patches. Here, in-
stead of using the original real-valued tissue densities of MRI and the Recently, Liu et al. (2013) presented a hierarchical framework that
voxel intensities of PET as inputs to MM-DBM, we first train a Gaussian gradually integrated features from a number of local patches extracted
Restricted Boltzmann Machine (RBM) and use it as a preprocessor to from a GM density map. Although they showed the efficacy of their
transform the real-valued observations into binary vectors, which be- method for AD/MCI diagnosis, it is well-known that the structural
come the input to MM-DBM. After finding latent and shared feature or functional images are susceptible to acquisition noise, intensity
representations of the paired patches from the trained MM-DBM, we inhomogeneity, artifacts, etc. Furthermore, the raw voxel density or
construct an image-level classifier by fusing multiple classifiers in a intensity values in a patch can be considered as low-level features
that do not efficiently capture more informative high-level features.
6 To this end, in this paper, we propose a deep learning based high-
Available at ‘http://mipav.cit.nih.gov/clickwrap.php’.
7
Available at ‘http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/’.
8
The final voxel size is 4 × 4 × 4 mm3. 9
In this work, we set the threshold to 0.05.
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
4 H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx
Fig. 1. Schematic illustration of the proposed method in hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis. (I: image size, w: patch size, K: # of
the selected patches, m: modality index, FG: # of hidden units in a Gaussian restricted Boltzmann machine, i.e., preprocessor, FS: # of hidden units in the top layer of a multimodal Deep
Boltzmann Machine (DBM)).
level structural and functional feature representation from MRI and PET, where Θ = {W = [Wij] ∈ RD × F, a = [ai] ∈ RD, b = [bj] ∈ RF}, E(v, h; Θ) is
respectively, for AD/MCI classification. an energy function, and Z(Θ) is a partition function that can be obtained
In the following, we first introduce an RBM, which has recently be- by summing over all possible pairs of v and h. For the sake of simplicity,
come a prominent tool for feature learning with applications in a wide by assuming binary visible and hidden units, the energy function E(v, h;
variety of machine learning fields. Then, we describe a DBM, a network Θ) is defined by
of stacking multiple RBMs, with which we discover a latent hierarchical
⊤ ⊤ ⊤
feature representation from a patch. We finally explain a systemic Eðv; h; ΘÞ ¼ −h Wv−a v−b h
method to find a joint feature representation from multimodal neuro- XD X
D XD XF
units, i.e., voxels in a patch, that can be captured by the symmetric matrix
W. It is worth noting that because of the symmetricity of the matrix W, where sigmðÞ ¼ 1þexp
exp½
½ is a logistic sigmoid function. Due to the unobserv-
we can also reconstruct the input observations, i.e., a patch, from the able hidden variables, the objective function is defined as the marginal
hidden representations. Therefore, an RBM is also considered as an distribution of the visible variables as follows:
auto-encoder (Hinton and Salakhutdinov, 2006). This favorable charac-
1 X
teristic is also used in RBM parameter learning (Hinton et al., 2006). P ðv; ΘÞ ¼ expð−Eðv; h; ΘÞÞ: ð5Þ
In RBM, a joint probability of (v, h) is given by: Z ðΘÞ h
1 In our work, the observed patch values from MRI and PET are real-
P ðv; h; ΘÞ ¼ exp½−Eðv; h; ΘÞ ð1Þ
ZðΘÞ valued v ∈ RD. For this case, it is common to use a Gaussian RBM
(a) (b)
Fig. 2. An architecture of a restricted Boltzmann machine (a) and its simplified representation (b).
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx 5
(Hinton and Salakhutdinov, 2006), in which the energy function is Fig. 3(a) shows an example of the three-layer DBM. The energy of
given by the state (v, h1, h2) in the DBM is given by
⊤
X
D
ðv −a Þ
2 X D X F
vi X F 1 2 ⊤ 1 1
E v; h ; h ; Θ ¼ −v W h − h
1 2 2
W h ð8Þ
i i
Eðv; h; ΘÞ ¼ − W ij h j − b jh j ð6Þ
i¼1
2σ 2i σ
i¼1 j¼1 i j¼1
h i h i
where W1 ¼ W 1ij ∈RD F 1 and W2 ¼ W 2jk ∈R F 1 F 2 are, respectively,
where σi denotes a standard deviation of the i-th visible variable and
Θ = {W, a, b, σ = [σi] ∈ RD}. This variation leads to the following symmetric connections of (v, h1) and (h1, h2), and Θ = {W1, W2}.
conditional distribution of visible variables given the binary hidden Then the probability that the model assigns to a visible vector v is
variables given by:
0 XF 2 1 1 X
1 2
vi −ai − h W ij C P ðv; ΘÞ ¼ exp −E v; h ; h ; Θ ð9Þ
1 B j¼1 j Z ðΘÞ 1 2
pðvi jh; ΘÞ ¼ pffiffiffiffiffiffi exp@− A: ð7Þ h ;h
2πσ i 2σ 2i
where Z(Θ) is a normalizing factor. Given the values of the units in the
neighboring layer(s), the probability of the binary visible or binary hidden
Deep Boltzmann Machine units being set to 1 is computed as follows:
A DBM is an undirected graphical model, structured by stacking !
X
D X
F2
multiple RBMs in a hierarchical manner. That is, a DBM contains a 1 2 1 2 2
1 i P h j ¼ 1v; h ¼ sigm W ij vi þ W jk hk ð10Þ
visible layer v and a series of hidden layers h ∈f0; 1g F 1 ; ⋯; h ∈ i¼1 k¼1
Fi L FL
f0; 1g ; ⋯; h ∈f0; 1g , where Fi denotes the number of units in the
i-th hidden layer and L is the number of hidden layers. We should
0 1
note that, hereafter, for simplicity, we omit bias terms and assume XF1
1
P hk ¼ 1h ¼ sigm@ W jk h j A
2 2 1
that the visible and hidden variables are binary 10 or probability, ð11Þ
and the following description on DBM is based on Salakhutdinov j¼1
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
6 H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx
(a) (b)
Fig. 3. An architecture of (a) a conventional Deep Boltzmann Machine and (b) its discriminative version with label information at the top layer.
For the label layer, we use a logistic function The key idea in a greedy layer-wise learning is to train one layer at a
hX i time by maximizing the variational lower bound. That is, we first train
F2 2 the 1st hidden layer with the training data as input, and then train the
2
exp U h
k¼1 lk k
P ðol ¼ 1jh Þ ¼ XC hX i ð16Þ 2nd hidden layer with the outputs from the 1st hidden layer as input,
F2 2
l0 ¼1
exp U0 h :
k¼1 l k k and so on. That is, the representation of the l-th hidden layer is used
as input for the (l + 1)-th hidden layer and this pairwise model
In this way, the hidden units capture class-predictive information becomes an RBM. Here, it should be mentioned that, unlike the other
about the input vector. Here, we should note that the label layer con- deep networks, because the DBM integrates both bottom-up and top-
nected to the top hidden layer is considered during only the training down information, the first and last RBMs in the network need modifi-
phase of finding the class-discriminative parameters. cation by using weights twice as big as in one direction. Since the de-
From a feature learning perspective, in the low layer of our model, tailed explanation on this issue is out of domain of our work, please
basic image features such as spots and edges are captured from the refer to Salakhutdinov and Hinton (2012) for details.
input data. The learned low-level features are further fed into the In a nutshell, the learning proceeds by two steps: (1) a greedy-layer-
high-level of the network, which encodes more abstract and higher wise pre-training for a good initial setup of the modal parameters, and
level semantic information inherent in the input data. But, here, we (2) iterative alternation of variational mean-field approximation to esti-
should note that the output layer linked to the top hidden layer imposes mate the posterior probabilities of hidden units and stochastic approxi-
the learned features to be discriminative between classes. mation to update model parameters (refer to Appendix A). After
In order to learn the parameters Θ = {W1, W2, U}, we maximize the learning the parameters, we can then obtain a latent feature representa-
log-likelihood of the observed data (v, o). The derivative of the log- tion for an input sample, by inferring the probabilities of the hidden units
likelihood of the observed data with respect to the model parameters in the trained DBM.11
takes the simple form of
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx 7
where the subscripts M, P, and S denote, respectively, units of the MRI In this section, we evaluate the effectiveness of the proposed method
path, the PET path, and the shared hidden layer. Regarding parameter for (1) a latent feature representation with DBM and (2) a shared feature
learning for MM-DBM, the same strategy with the unimodal DBM learn- representation between MRI and PET with an MM-DBM, by considering
ing can be applied. For details, please refer to Appendix B. three binary classification problems: AD vs. NC, MCI vs. NC, MCI converter
(MCI-C) vs. MCI non-converter (MCI-NC). Due to the limited number of
Image-level hierarchical classifier learning data, we applied a 10-fold cross validation technique. Specifically, we ran-
domly partitioned the dataset into 10 subsets, each of which included
In order to combine the distributed patch information over an image 10% of the total data. We repeated experiments for each classification
and build an image-level classifier, we use a hierarchical classifier learning problem 10 times, by using 9 out of 10 subsets for training and the
scheme, proposed by Liu et al. (2013). That is, we first build a classifier for remaining one for testing at each time. It is worth noting that, for each
each patch, independently, and then combine them in a hierarchical classification problem, during a training phase, we performed patch
manner by feeding the outputs from the lower-level classifiers to the selection, (MM-)DBM and SVM model learning only using the 9 training
upper-level classifier. Specifically, we build a three-level classifier for subsets. Based on the selected patches and also the trained (MM-)DBM
decision: patch-level, mega-patch-level, and image-level. For the and SVM models, we finally evaluated the performance on the left-out
patch-level classification, a linear Support Vector Machine (SVM) is testing subset. We compare the proposed method with Liu et al.'s
trained for each patch location independently with the (MM-)DBM- (2013) method, using the same training and testing set in each experi-
learned feature representations as input. The output from a patch- ment for a fair comparison.
level SVM, measured by the relative distance from the decision hyper-
plane, is then converted to a probability via a softmax function. Here, Experimental setup
we should note that in patch-level classifier learning, we randomly par-
tition the training data into a training set and a validation set.12 The As for the patch size w, we set it to 11 by following Liu et al.'s (2013)
patch-level classifier is trained on the training set, and then the classifi- work. During mega-patch construction, the size of a mega-patch was
cation accuracy is obtained with the validation set. allowed in the range of w × [1.2, 1.4, 1.6, 1.8, 2] and the optimal size for
In the following hierarchy, instead of considering all patch-level each mega-patch was determined by cross-validation as explained in
classifiers' output simultaneously, we agglomerate the information of the Image-level hierarchical classifier learning section.
the locally distributed patches by constructing spatially distributed In building (MM)-DBM of GM patches and/or PET patches, we can use
‘mega-patches’ under the consideration that the disease-related brain Gaussian visible units for the input patches by considering the voxels as
areas are distributed over some distant brain regions with arbitrary continuous variables. However, learning (MM-)DBMs with Gaussian vis-
shape and size (Liu et al., 2013). Similar to the patch extraction de- ible units is very slow and requires a huge number of parameter updates,
scribed in Section 3.1, we construct mega-patches and the respective compared with the binary visible units. To this end, we first trained RBM
classifiers in a greedy manner. Concretely, we first sort the patches in with 1, 331(= 113) Gaussian visible units and also 500 binary hidden
a descending order based on the classification accuracy obtained with units by using contrastive divergence learning (Hinton et al., 2006) for
the validation set in patch-level classifier learning. Starting with the 1000 epochs.14 After training a Gaussian RBM for each modality, we
patch with the highest classification accuracy as a new mega-patch, used it as a preprocessor, following Nair and Hinton's (2008) work that
we greedily merge the neighboring patches into the mega-patch. The effectively converts GM tissue densities or PET voxel intensities into
merging condition is that, if and only if, a mega-patch classifier, which
13
In this work, we set the number of subsets to 10.
12 14
In our work, we set 80% of the entire training data as a training set and the rest for a The input data were first normalized and whitened by zero component analysis, and
validation set. the standard deviation was fixed to 1 during the parameter updates.
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
8 H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx
500-dimensional binary vectors. We then used the binary vectors as within the receptive field. For example, the weights of hidden units in
‘preprocessed data’ to train our (MM-)DBMs. We should note that the the hidden layer of the MRI pathway in an MM-DBM (right in
Gaussian RBMs were not updated during (MM-)DBM learning. Fig. 7(a)) discover more complicated structural patterns in the input
We structured a three-layer DBM for MRI (MRI-DBM) and PET 3D GM patch, such as combination of edges orienting in different direc-
(PET-DBM), respectively, and a four-layer DBM for MRI + PET tions. With respect to the PET, the weights of hidden units in the hidden
(MM-DBM). For all these models, we used binary visible and binary layer of the PET pathway in an MM-DBM (right in Fig. 7(b)) discover
hidden units. Both the MRI-DBM and the PET-DBM were structured non-linear functional relations among voxels within a 3D patch. In this
with 500(visible)–500(hidden)–500(hidden), and the MM-DBM way, as it forwards to the higher layer, the MM-DBM finds complex
was structured with 500(visible)–500(hidden)–500(hidden) for a latent features in the input patch, and ultimately in the top hidden
MRI pathway, 500(visible)–500(hidden)–500(hidden) for a PET layer, the hidden units discover the inter-modality relations in between
pathway, and finally 1000 hidden units for the shared hidden layer. the pair of MRI and PET patches, each of which comes from the same lo-
In (MM-)DBM learning, we updated the parameters, i.e., weights cation in a brain.
and biases, with a learning rate of 10–3 and a momentum of 0.5
with an increment gradually up to 0.9 for 500 epochs. We used the Performance evaluation
trained parameters of MRI-DBM and PET-DBM as the initial setup
of the MRI and PET pathways in MM-DBM learning. We implement- Let TP, TN, FP, and FN denote, respectively, True Positive, True
ed the DBM method based on Salakhutdinov's codes.15 Negative, False Positive, and False Negative. In this work, we consider
We used a linear SVM for the hierarchical classifiers, i.e., patch-level the following quantitative measurements and presented the perfor-
classifier, mega-patch-level classifier, and image-level classifier. An mances of the competing methods in Table 2.
LIBSVM toolbox16 was used for SVM learning and classification. The
free parameter that controls the soft margin was determined by a • ACCuracy (ACC) = (TP + TN) / (TP + TN + FP + FN)
nested cross-validation. • SENsitivity (SEN) = TP / (TP + FN)
• SPECificity (SPEC) = TN / (TN + FP)
• Balanced ACcuracy (BAC) = (SEN + SPEC) / 2
Extracted patches and trained DBMs
• Positive Predictive Value (PPV) = TP / (TP + FP)
• Negative Predictive Value (NPV) = TN / (TN + FN)
In Fig. 5, we presented the example images overlaid with p-values of
• Area Under the receiver operating characteristic Curve (AUC)
the voxels, obtained from AD and NC groups, based on which we selected
patch locations for AD and NC classification. It is worth noting that, for
In the classification of AD and NC, the proposed method showed
both modalities, the voxels in the subcortical and medial temporal
the mean accuracies of 92.38% (MRI), 92.20% (PET), and 95.35%
areas showed low p-values, i.e., statistically different between classes,
(MRI + PET). Compared to Liu et al.'s method that showed the accuracies
while for other areas, each modality presents slightly different p-value
of 90.18% (MRI), 89.13% (PET), and 90.27% (MRI + PET),18 the proposed
distributions, from which we could possibly obtain complementary in-
method improved by 2.2% (MRI), 3.07% (PET), and 5.08% (MRI + PET).
formation for classification. Samples of the selected 3D patches are also
That is, the proposed method outperformed Liu et al.'s method in all the
presented in Fig. 6, in which one 3D volume is displayed in each row,
cases of MRI, PET, and MRI + PET. In the discrimination of MCI from
for each modality. Taking these patches as input data to a Gaussian
NC, the proposed method showed the accuracies of 84.24% (MRI),
RBM and then transforming to binary vectors, we trained our feature
84.29% (PET), and 85.67% (MRI + PET). Meanwhile, Liu et al.'s method
representation models, i.e., MRI-DBM, PET-DBM, and MM-DBM. Regard-
showed the accuracies of 81% (MRI), 81.14% (PET), and 83.90%
ing the trained MM-DBM, we visualized the trained weights in Fig. 7 by
(MRI + PET). Again, the proposed method outperformed Liu et al.'s
linearly projecting them to the input space for intuitive interpretation of
method by making performance improvements of 3.24% (MRI),
the feature representations.17 In the figure, the left images represent the
3.15% (PET), and 1.77% (MRI + PET). In the classification between
trained weights of our Gaussian RBMs that were used to convert the real-
MCI-C and MCI-NC, which is the most important for early diagnosis
valued patches into binary vectors as a preprocessor, and the right
and treatment, Liu et al.'s method achieved the accuracies of 64.75%
images represent the trained weights of the first-layer hidden units of
(MRI), 67.17% (PET), and 73.33% (MRI + PET). Compared to these re-
the respective modality's pathway in our MM-DBM. From the figure,
sults, the proposed method improved the accuracies by 7.67% (MRI),
we can regard the hidden units in the Gaussian RBM as simple cells of
3.58% (PET), and 2.59% (MRI + PET), respectively. Concisely, in our
a human visual cortex that maximally responds to specific spot- or
three binary classifications, based on the classification accuracy, the
edge-like stimulus patterns within the receptive field, i.e., a patch in
proposed method clearly outperformed Liu et al.'s method by achieving
our case. In particular, each hidden unit in a Gaussian RBM finds simple
the maximal accuracies of 95.35% (AD vs. NC), 85.67% (MCI vs. NC), and
volumetric or functional patterns in the input 3D patch by assigning dif-
75.92% (MCI-C vs. MCI-NC), respectively.
ferent weights to the corresponding voxels. For example, hidden units of
Regarding sensitivity and specificity, the higher the sensitivity, the
the Gaussian RBM for MRI (left in Fig. 7(a)) focus on different parts of a
lower the chance of mis-diagnosing AD/MCI patients; also the higher
patch to detect a simple spot- or edge-like pattern in the input 3D GM
the specificity, the lower the chance of mis-diagnosing NC to AD/MCI.
patch. The hidden units in a Gaussian RBM for PET (left in Fig. 7(b))
Although the proposed method had a lower sensitivity than that of Liu
can be understood as descriptors that discover local functional relations
et al.'s method for a couple of cases, e.g., 90.06% (Liu et al.'s method)
among voxels within a patch.
vs. 88.04% (proposed) with PET in the AD diagnosis, 98.97% (Liu
Note that the hidden units of a Gaussian RBM for either MRI or PET
et al.'s method) vs. 95.37% (proposed) with MRI + PET in the MCI diag-
find, respectively, the structural or functional relations among voxels in
nosis, and 40.02% (Liu et al.'s method) vs. 25.45% (proposed) with PET in
a localized way. Meanwhile, the hidden units in our (MM-)DBM served
the MCI-C diagnosis, in general, the proposed method showed higher
as complex filters of a human visual cortex that combine the outputs
sensitivity and specificity in all three classification problems. Hence,
from the simple cells and maximally responds to more complex patterns
from a clinical point of view, the proposed method is less likely to
15
mis-diagnose subjects with AD/MCI and vice versa, compared to Liu
Available at ‘http://www.cs.toronto.edu/rsalakhu/DBM.html’.
16 et al.'s method.
Available at ‘http://www.csie.ntu.edu.tw/cjlin/libsvm/’.
17
For the hidden units of the MRI pathway and the PET pathway, their weights were vi-
18
sualized as a weighted linear combination of the weights of the Gaussian RBM, similar to For the multimodal case, we concatenated the patches of modalities into a single vec-
Lee et al.'s (2009) work. tor for Liu et al.'s method.
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx 9
(a) MRI
(b) PET
Fig. 5. Visualization of the p-value distributions used to select the patch locations of MRI and PET in AD and NC classification.
Meanwhile, because of the data imbalance between classes, i.e., AD performance estimates on imbalanced datasets. Based on this metric,
(93 subjects), MCI (204 subjects; 76 MCI-C and 128 MCI-NC subjects), we clearly see that the proposed method is superior to the competing
and NC (101 subjects), we obtained low sensitivity (MCI vs. NC) or spec- method. Note that in discrimination between MCI and NC, while the
ificity (MCI-C vs. MCI-NC). The balanced accuracy, which is calculated accuracy improvement by the proposed method with MRI + PET was
by taking the average of sensitivity and specificity, avoids inflated 1.43% and 1.38% compared to the same method with MRI and PET,
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
10 H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx
(a) MRI
(b) PET
Fig. 7. Visualization of the trained weights of our modality-specific Gaussian RBMs (left) used for data conversion from a real-valued vector to a binary vector, and those of our MM-DBM
(right) used for latent feature representations. For the weights of our MM-DBM, they correspond to the first hidden layer in the respective modality's pathway in the model. In each
subfigure, one row corresponds to one hidden unit in the respective Gaussian RBM or MM-DBM.
respectively, in terms of the balanced accuracy, the improvements went a Positive Predictive Value (PPV) and a Negative Predictive Value (NPV).
up to 3.93% (vs. MRI) and 2.95% (vs. PET). Statistically, PPV and NPV measure, respectively, the proportion of sub-
With a further concern on low sensitivity and specificity, especially jects with AD, MCI, or MCI-C who are correctly diagnosed as patients,
in classifications of MCI vs. NC and MCI-C vs. MCI-NC, we also computed and the proportion of subjects without AD, MCI, or MCI-C who are
Table 2
A summary of the performances of two methods. The boldface denotes the best performance in each metric for each classification task.
Method Modality ACC (%) SEN (%) SPEC (%) BAC (%) PPV (%) NPV (%) AUC (%)
AD/NC Liu et al. MRI 90.18 ± 5.25 91.54 90.61 91.08 88.94 90.67 0.9620
PET 89.13 ± 6.81 90.06 89.36 89.71 88.49 89.26 0.9594
MRI + PET 90.27 ± 7.02 89.48 92.44 90.96 90.56 88.70 0.9655
Proposed MRI 92.38 ± 5.32 91.54 94.56 93.05 92.65 90.84 0.9697
PET 92.20 ± 6.70 88.04 96.33 92.19 95.03 89.66 0.9798
MRI + PET 95.35 ± 5.23 94.65 95.22 94.93 96.80 95.67 0.9877
MCI/NC Liu et al. MRI 81.00 ± 4.98 97.08 48.18 72.63 79.14 88.99 0.8352
PET 81.14 ± 10.22 96.03 52.59 74.31 80.26 84.16 0.8231
MRI + PET 83.90 ± 5.80 98.97 52.59 75.78 81.18 97.22 0.8301
Proposed MRI 84.24 ± 6.26 99.58 53.79 76.69 81.23 98.75 0.8478
PET 84.29 ± 7.22 98.69 56.87 77.78 81.99 94.57 0.8297
MRI + PET 85.67 ± 5.22 95.37 65.87 80.62 85.02 89.00 0.8808
MCI-C/MCI-NC Liu et al. MRI 64.75 ± 14.83 22.22 89.57 55.90 46.29 77.39 0.6355
PET 67.17 ± 13.43 40.02 82.61 61.32 64.13 70.31 0.6911
MRI + PET 73.33 ± 12.47 33.25 97.52 65.38 80.00 73.18 0.7159
Proposed MRI 72.42 ± 13.09 36.70 90.98 63.84 65.49 77.84 0.7342
PET 70.75 ± 13.23 25.45 96.55 61.00 75.00 70.69 0.7215
MRI + PET 75.92 ± 15.37 48.04 95.23 71.63 83.50 74.33 0.7466
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx 11
correctly diagnosed as cognitive normal. Based on a recent report by Braak, 1991; Burton et al., 2009; Desikan et al., 2009; Devanand et al.,
Alzheimer's Association (2012), the AD prevalence is projected to be 2007; Ewers et al., 2012; Lee et al., 2006; Mosconi, 2005; Visser et al.,
11 million to 16 million by 2050. For MCI and MCI-C, although there is 2002; Walhovd et al., 2010), superior/medial frontal gyrus (Johnson
high variation among reports depending on definitions, the median of et al., 2005), precentral/postcentral gyrus (Belleville et al., 2011),
the prevalence estimates of MCI or MCI-C in the literature is 26.4% precuneus (Bokde et al., 2006; Davatzikos et al., 2011; Singh et al.,
(MCI) and 4.9% (amnestic MCI) (Ward et al., 2012). Regarding the AD 2006), thalamus, putamen (de Jong et al., 2008), caudate nucleus (Dai
prevalence by 2050, the proposed method, which achieved 96.80% of et al., 2009), etc.
the PPV in the classification of AD and NC, can correctly identify
10.648 million to 15.488 million of subjects with AD while Liu et al.'s
method, whose respective PPV was 90.56%, can identify 9.9616 million Limitations
to 14.4896 million of subjects with AD. Accordingly, our method can
correctly identify as many as 0.6864 million to 0.9984 million of sub- In our experiments, we validated the efficacy of the proposed
jects more. method in three classification problems by achieving the best perfor-
The Receiver Operating Characteristic (ROC) curve19 and the Area mances. However, there still exist some limitations of the proposed
Under the ROC Curve (AUC) are also widely used metrics to evaluate method.
the performance of diagnostic tests in brain disease as well as other First, even though we could visualize the trained weights in our
medical areas. In particular, the AUC can be thought as a measure of MM-DBMs in Fig. 7, from a clinical perspective, it is difficult to under-
the overall performance of a diagnostic test. The proposed method stand or interpret the resulting feature representations. Particularly,
with MRI + PET showed the best AUCs of 0.9877 in AD vs. NC, 0.8808 with respect to the investigation of brain abnormalities affected by
in MCI vs. NC, and 0.7466 in MCI-C vs. MCI-NC. Compared to Liu et al.'s neurodegenerative disease, i.e., AD or MCI, our method cannot provide
method with MRI + PET, the proposed multimodal method increased useful clinical information. In this regard, it could be a good research
the AUCs by 0.0222 (AD vs. NC), 0.0507 (MCI vs. NC), and 0.0307 direction in which we further extend the proposed method to find or
(MCI-C vs. MCI-NC). Noticeably, the proposed method with MRI en- detect brain abnormalities in terms of brain regions or areas for easy un-
hanced the AUC as much as 0.0987 than the corresponding AUC of Liu derstanding to clinicians.
et al.'s method. It is also noteworthy that in the classification of MCI Second, in our experiments, we manually determined the number of
and NC, the proposed method with MRI + PET improved the AUC by hidden units in each layer. Furthermore, we used a relatively small data
0.0330 (vs. MRI) and 0.0389 (vs. PET), while the improvements in the samples (93 AD, 76 MCI-C, 128 MCI-NC, and 101 NC). Therefore, the
classifications of AD vs. NC and MCI-C vs. MCI-NC were, respectively, network structures used to discover high-level feature representations
0.0180/0.0079 (vs. MRI/PET) and 0.0124/0.0251 (vs. MRI/PET). in our experiments were not necessarily optimal. We believe that it
Based on the quantitative measurements depicted above, the needs more intensive studies such as learning the optimal network
proposed method clearly outperforms Liu et al.'s method. In terms structure from big data for practical use of deep learning in clinical
of modalities used for classification, similar to the previous work settings.
(Hinrichs et al., 2011; Suk and Shen, 2013; Zhang et al., 2011), we also Third, as the graphical model illustrated in Fig. 4, the current method
obtained the best performances with the complementary information only considers bi-modalities of MRI and PET. However, it is generally
from multiple modalities, i.e., MRI + PET. beneficiary to combine as many modalities as possible to use their richer
information. Therefore, it is necessary to build a more systematic model
that can efficiently find and use complementary information from
Comparison with State-of-the-Art Methods
genetics, proteomics, imaging, cognition, disease status, and other phe-
notypic modalities.
In Table 3, we also compared the classification accuracies of the
Lastly, according to a recent broad spectrum of studies, there are
proposed method with those of the state-of-the-art methods that con-
increasing evidences that subjective cognitive complaint is one of the
sidered multi-modality in classifications of AD vs. NC, MCI vs. NC, and
important genetic risk factors, which increases the risk of progression
MCI-C vs. MCI-NC. Note that, due to different datasets and different
to MCI or AD (Loewenstein et al., 2012; Mark and Sitskoorn, 2013).
approaches of extracting features and building classifiers, it is not fair
That is, among the cognitively normal elderly individuals who have sub-
to directly compare the performances among methods. Nonetheless, it
jective cognitive impairments, there exists a high possibility for some of
is remarkable that the proposed method showed the highest accuracies
them to be in the stage of ‘pre-MCI’. However, in the ADNI dataset, there
among the methods in all the binary classification problems. It is also
is no related information. Thus, in our experiments, the NC group could
worth noting that our method is the only one that considered the
include both genuine controls and those with subjective cognitive
patch-based approach for feature extraction, while the other methods
complaints.
used an ROI-based approach.
For the investigation of the relative importance of different brain In this paper, we proposed a method for a shared latent feature
areas determined by the proposed method for AD/MCI diagnosis, we representation from MRI and PET in deep learning. Specifically, we
visualized the weights of the selected patches in Fig. 8. Specifically, the used DBM to find a latent feature representation from a volumetric
weight of each patch was calculated by accumulating the selected patch and further devised method to systemically discover a joint fea-
frequency of mega-patches in final ensemble classifiers over cross- ture representation from multi-modality. Unlike the previous methods
validations. That is, the weight of a patch was determined with the that mostly considered the direct use of the GM tissue densities from
sum of the weights of the mega-patches that included the patch MRI and/or voxel intensities from PET and then fused the complemen-
and was used in the final decision. The high weighted patches were tary information in a kernel technique, the proposed method learned
in accordance with the previous reports on AD/MCI studies. Those high-level features in a self-taught manner via deep learning, and thus
were distributed around a medial temporal lobe (that includes could efficiently combine the complimentary information from MRI
amygdala, hippocampal formation, entorhinal cortex) (Braak and and PET during feature representation procedure. Experimental results
on ADNI dataset showed that the proposed method is superior to the
19
A plot of test true positive rate versus its false positive rate. previous methods in terms of various quantitative metrics.
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
12 H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx
Table 3
Comparison of classification accuracy with state-of-the-art methods. The numbers in the parentheses denote the number of AD/MCI(MCI-C, MCI-NC)/NC subjects in the dataset used. The
boldface denotes the best performance in each classification task.
Methods Dataset Features AD vs. NC (%) MCI vs. NC (%) MCI-C vs. MCI-NC (%)
Kohannim et al. (2010) MRI + PET + CSF (40/83(43,40)/43) ROI 90.7 75.8 n/a
Walhovd et al. (2010) MRI + CSF (38/73/42) ROI 88.8 79.1 n/a
Hinrichs et al. (2011) MRI + PET (48/119(38,81)/66) ROI 92.4 n/a 72.3
Westman et al. (2012) MRI + CSF (96/162(81,81)/111) ROI 91.8 77.6 66.4
Zhang and Shen (2012) MRI + PET + CSF (45/91(43,48)/50) ROI 93.3 83.2 73.9
Proposed method MRI + PET (93/204(76,128)/101) Patch 95.35 85.67 75.92
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx 13
where H(·) is the entropy functional, KL[·||·] denotes Kullback–Leibler The learning proceeds by iteratively alternating the variational
divergence, and Ω is a variational parameter set. mean-field inference to find the values of Ω for the fixed current
For computational simplicity and learning speed, the naïve mean- model parameters Θ and the stochastic approximation procedure to up-
field approximation, which uses a fully factorized distribution, is gener- date model parameters Θ given the variational parameters Ω. Finally,
ally used in the literature (Tanaka, 1998). That is, the shared feature representations can be obtained by inferring the
F1 F2 values of the hidden units in the top hidden layer from the trained
1 2 1 2
Q h ; h v; Ω ¼ ∏ q h j ∏ q hk ðA:3Þ MM-DBM.
j¼1 k¼1
where Ω = {μ1, μ2}, μ 1 ¼ μ 11 ; ⋯; μ 1F 1 , μ 2 ¼ μ 21 ; ⋯; μ 2F 2 , q(hj1 = 1) = μj1 References
(j ∈ {1,⋯, F1}), and q(hj2 = 1) = μk2 (k ∈ {1,⋯, F2}). It alternatively esti- Alzheimer's Association, 2012. 2012 Alzheimer's disease facts and figures. Alzheimers
mates the state of the hidden units, μ1 and μ2, for fixed Θ until conver- Dement. 8, 131–168.
gence: Baron, J., Chtelat, G., Desgranges, B., Perchey, G., Landeau, B., de la Sayette, V., Eustache, F.,
! 2001. In vivo mapping of gray matter loss with voxel-based morphometry in mild
1
X D
1
X
F2
2 2 Alzheimer's disease. Neuroimage 14, 298–309.
μ j ←sigm W ij vi þ W jk μ k ðA:4Þ Belleville, S., ClŽment, F., Mellah, S., Gilbert, B., Fontaine, F., Gauthier, S., 2011. Training-
i¼1 k¼1 related brain plasticity in subjects at risk of developing Alzheimer's disease. Brain
134, 1623–1634.
Bengio, Y., 2009. Learning deep architectures for AI. Foundations and Trends in Machine
0 1 Learning, 2, pp. 1–127.
XF1 X
C Bengio, Y.,Lamblin, P.,Popovici, D.,Larochelle, H., 2007. Greedy layer-wise training of deep
μ k ←sigm@ U lk ol A:
2 2 1
W jk μ j þ ðA:5Þ networks. In: Schölkopf, B., Platt, J., Hoffman, T. (Eds.), Advances in Neural Informa-
j¼1 l¼1 tion Processing Systems, 19. MIT Press, Cambridge, MA, pp. 153–160.
Bokde, A.L.W., Lopez-Bayo, P., Meindl, T., Pechler, S., Born, C., Faltraco, F., Teipel, S.J., Möller,
H.J., Hampel, H., 2006. Functional connectivity of the fusiform gyrus during a face-
Regarding the data-independent statistics, we apply a stochastic ap-
matching task in subjects with mild cognitive impairment. Brain 129, 1113–1124.
1
proximation procedure to obtain samples, also called particles, of e e ,
v, h Braak, H., Braak, E., 1991. Neuropathological stageing of Alzheimer-related changes. Acta
Neuropathol. 82, 239–259.
e 2 , and o
h e by running repeatedly the alternate Gibbs sampler on a set of Burton, E.J.,Barber, R.,Mukaetova-Ladinska, E.B.,Robson, J.,Perry, R.H.,Jaros, E.,Kalaria, R.N.,
OBrien, J.T., 2009. Medial temporal lobe atrophy on MRI differentiates Alzheimer's
particles. Once both the data-dependent and data-independent statis- disease from dementia with Lewy bodies and vascular cognitive impairment: a pro-
tics are computed, we then update parameters as follows: spective study with pathological verification of diagnosis. Brain 132, 195–203.
Catana, C., Drzezga, A., Heiss, W.D., Rosen, B.R., 2012. PET/MRI for neurologic applications.
1 N 1 J. Nucl. Med. 53, 1916–1925.
W
1;ðtþ1Þ
¼W
1;ðt Þ
þ αt ∑
n 1;n⊤
v μ
M
− ∑m¼1 v e1;m⊤
em h ðA:6Þ Ciresan, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J., 2013. Mitosis detection in breast
N n¼1 M cancer histology images with deep neural networks. Medical Image Computing and
Computer-Assisted Intervention MICCAI 2013, pp. 411–418.
Cui, Y., Liu, B., Luo, S., Zhen, X., Fan, M., Liu, T., Zhu, W., Park, M., Jiang, T., Jin, J.S., the
Alzheimer's Disease Neuroimaging Initiative, 2011. Identification of conversion
1 N 1 1; 2;m ⊤
W
2;ðtþ1Þ
¼W
2;ðt Þ
þ αt ∑ μ μ
1;n 2;n⊤ M e mh
− ∑m¼1 h e ðA:7Þ from mild cognitive impairment to Alzheimer's disease using multivariate predictors.
N n¼1 M PLoS One 6, e21896.
Cuingnet, R., Gerardin, E., Tessieras, J., Auzias, G., Lehéricy, S., Habert, M.O., Chupin, M.,
Benali, H.,Colliot, O.,The Alzheimer's Disease Neuroimaging Initiative, 2011. Automatic
classification of patients with Alzheimer's disease from structural MRI: a comparison of
ðtþ1Þ ðt Þ 1 N 2;n n⊤ 1 M 2;m n⊤
e e ten methods using ADNI database. Neuroimage 56, 766–781.
U ¼U þ αt ∑ μ o − ∑m¼1 h o ðA:8Þ
N n¼1 M Dai, W.,Lopez, O.,Carmichael, O.,Becker, J.,Kuller, L.,Gach, H., 2009. Mild cognitive impair-
ment and Alzheimer disease: patterns of altered cerebral blood flow at MR imaging.
Radiology 250, 856–866.
where αt is a learning rate, and N and M denote, respectively, the Davatzikos, C., Genc, A., Xu, D., Resnick, S.M., 2001. Voxel-based morphometry using the
numbers of training data and particles, and superscripts n and m denote, RAVENS maps: methods and validation using simulated longitudinal atrophy.
respectively, indices of an observation and a particle. Neuroimage 14, 1361–1369.
Davatzikos, C.,Bhatt, P.,Shaw, L.M.,Batmanghelich, K.N.,Trojanowski, J.Q., 2011. Prediction
of MCI to AD conversion, via MRI, CSF biomarkers, and pattern classification.
Appendix B. Learning multimodal DBM parameters Neurobiol. Aging 32 (2322.e19–2322.e27).
de Jong, L.W.,van der Hiele, K., Veer, I.M., Houwing, J.J., Westendorp, R.G.J., Bollen, E.L.E.M.,
de Bruin, P.W.,Middelkoop, H.A.M.,van Buchem, M.A.,van der Grond, J., 2008. Strongly
The same approach to the unimodal DBM described in the Deep
reduced volumes of putamen and thalamus in Alzheimer's disease: an MRI study.
Boltzmann Machine section can be applied, i.e., iterative alternation of Brain 131, 3277–3285.
the variational mean-field approximation for data-dependent statistics Desikan, R., Cabral, H., Hess, C., Dillon, W., Salat, D., Buckner, R., Fischl, B., Initiative, A.D.N.,
2009. Automated MRI measures identify individuals with mild cognitive impairment
and the stochastic approximation procedure for data-independent statis-
and Alzheimer's disease. Brain 132, 2048–2057.
tics, and parameters update. Let H = {h1M, h2M, h1P, h2P, h3S } and V = {vM, vP}. Devanand, D.P.,Pradhaban, G.,Liu, X.,Khandji, A.,De Santi, S.,Segal, S.,Rusinek, H.,Pelton, G.H.,
In variational learning of our MM-DBM, a fully factorized mean-field Hoing, L.S., Mayeux, R., Stern, Y., Tabert, M.H., de Leon, J.J., 2007. Hippocampal and ento-
variational function for approximation of the true posterior distribution rhinal atrophy in mild cognitive impairment. Neurology 68, 828–836.
Dinov, I.,Boscardin, J.,Mega, M.,Sowell, E.,Toga, A., 2005. A wavelet-based statistical analysis
P(H | V, o; Θ = {W1M, W2M, W1P, W2P, W3S , U}) is defined as follows: of fMRI data. Neuroinformatics 3, 319–342.
Ewers, M., Walsh, C., Trojanowski, J.Q., Shaw, L.M., Petersen Jr., R.C., C.R.J., Feldman, H.H.,
FM
1 F M2 F P1 F P2 F S Bokde, A.L.,Alexander, G.E., Scheltens, P., Vellas, B., Dubois, B.,Weiner, M., Hampel, H.,
1 2 1 2 3
Q ðHjV; o; ΩÞ ¼ ∏ q hM;i ∏ q hM; j ∏ q hP;k ∏ q hP;l ∏ q hS;m 2012. Prediction of conversion from mild cognitive impairment to Alzheimer's disease
i¼1 j¼1 k¼1 l¼1 m¼1 dementia based upon biomarkers and neuropsychological test performance.
ðB:1Þ Neurobiol. Aging 33, 1203–1214 (e2).
FM FM FP FP FS
1
1 22
1 1
2 2
3 Fan, Y., Rao, H., Hurt, H., Giannetta, J., Korczykowski, M., Shera, D., Avants, B.B., Gee, J.C.,
¼ ∏ μ M;i ∏ μ M; j ∏ μ P;k ∏ μ P;l ∏ μ S;m Wang, J., Shen, D., 2007a. Multivariate examination of brain abnormality using both
i¼1 j¼1 k¼1 l¼1 m¼1
structural and functional MRI. Neuroimage 36, 1189–1199.
Fan, Y.,Shen, D.,Gur, R.,Gur, R.,Davatzikos, C., 2007b. COMPARE: classification of morpholog-
where Ω = {μ1M, μ2M, μ1P , μ2P , μ3S } is a mean-field parameter set with ical patterns using adaptive regional elements. IEEE Trans. Med. Imaging 26, 93–105.
Friston, K.J., 1995. Functional and effective connectivity in neuroimaging: a synthesis.
1 1 1 2 2 2 1 1 1 2
μ M ¼ μ M;1 ; ⋯; μ M; F M1 , μ M ¼ μ M;1 ; ⋯; μ M; F M2 , μ P ¼ μ P;1 ; ⋯; μ P; F P1 , μ P ¼ Hum. Brain Mapp. 2, 56–78.
Greicius, M.D., Srivastava, G., Reiss, A.L., Menon, V., 2004. Default-mode network activity
μ 2P;1 ; ⋯; μ 2P; F P2 , and μ 3S ¼ μ 3S;1 ; ⋯; μ 3S; F S . Referring Eqs. (A.4) and (A.5), distinguishes Alzheimer's disease from healthy aging: evidence from functional
MRI. Proc. Natl. Acad. Sci. U. S. A. 101, 4637–4642.
given a fixed model parameter Θ, it is straightforward to estimate the Hackmack, K., Paul, F., Weygandt, M., Allefeld, C., Haynes, J.D., 2012. Multi-scale classifica-
mean-field parameters Ω. tion of disease using structural MRI and wavelet transform. Neuroimage 62, 48–58.
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077
14 H.-I. Suk et al. / NeuroImage xxx (2014) xxx–xxx
Hinrichs, C., Singh, V., Xu, G., Johnson, S.C., 2011. Predictive markers for AD in a multi- Salakhutdinov, R.,Hinton, G.E., 2009. Deep Boltzmann machines. Proceedings of the Inter-
modality framework: an analysis of MCI progression in the ADNI population. national Conference on Artificial Intelligence and, Statistics, pp. 448–455.
Neuroimage 55, 574–589. Salakhutdinov, R., Hinton, G., 2012. An efficient learning procedure for deep Boltzmann
Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural machines. Neural Comput. 24, 1967–2006.
networks. Science 313, 504–507. Shen, D., Davatzikos, C., 2002. HAMMER: hierarchical attribute matching mechanism for
Hinton, G.E., Osindero, S., Teh, Y.W., 2006. A fast learning algorithm for deep belief nets. elastic registration. IEEE Trans. Med. Imaging 21, 1421–1439.
Neural Comput. 18, 1527–1554. Shin, H.C., Orton, M.R.,Collins, D.J., Doran, S.J., Leach, M.O., 2013. Stacked autoencoders for
Hjelm, R.D.,Calhoun, V.D.,Salakhutdinov, R.,Allen, E.A.,Adali, T.,Plis, S.M., 2014. Restricted unsupervised feature learning and multiple organ detection in a pilot study using 4D
boltzmann machines for neuroimaging: an application in identifying intrinsic patient data. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1930–1943.
networks. Neuroimage 96, 245–260. Singh, V., Chertkow, H.,Lerch, J.P., Evans, A.C., Dorr, A.E., Kabani, N.J., 2006. Spatial patterns
Ishii, K., Kawachi, T., Sasaki, H., Kono, A.K., Fukuda, T., Kojima, Y., Mori, E., 2005. Voxel-based of cortical thinning in mild cognitive impairment and Alzheimer's disease. Brain 129,
morphometric comparison between early- and late-onset mild Alzheimer's disease 2885–2893.
and assessment of diagnostic performance of z score images. Am. J. Neuroradiol. 26, Sled, J.G., Zijdenbos, A.P., Evans, A.C., 1998. A nonparametric method for automatic
333–340. correction of intensity nonuniformity in MRI data. IEEE Trans. Med. Imaging
Jia, H.,Wu, G., Wang, Q., Shen, D., 2010. ABSORB: atlas building by self-organized registra- 17, 87–97.
tion and bundling. Neuroimage 51, 1057–1070. Srivastava, N., Salakhutdinov, R., 2012. Multimodal learning with deep Boltzmann
Johnson, N.A.,Jahng, G.H.,Weiner, M.W.,Miller, B.L.,Chui, H.C.,Jagust, W.J.,Gorno-Tempini, machines. Advances in Neural Information Processing Systems, 25, pp. 2231–2239.
M.L.,Schuff, N., 2005. Pattern of cerebral hypoperfusion in Alzheimer disease and mild Suk, H.I.,Shen, D., 2013. Deep learning-based feature representation for AD/MCI classifica-
cognitive impairment measured with arterial spin-labeling MR imaging: initial expe- tion. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2013.
rience. Radiology 234, 851–859. Lecture Notes in Computer Science, vol. 8150, pp. 583–590.
Kabani, N., MacDonald, D., Holmes, C., Evans, A., 1998. A 3D atlas of the human brain. Suk, H.I., Wee, C.Y., Shen, D., 2013. Discriminative group sparse representation for mild
Neuroimage 7, S717. cognitive impairment classification. Machine Learning in Medical Imaging. Lecture
Kohannim, O., Hua, X., Hibar, D.P., Lee, S., Chou, Y.Y., Toga Jr., A.W., C.R.J., Weiner, M.W., Notes in Computer Science, vol. 8184, pp. 131–138.
Thompson, P.M., 2010. Boosting power for clinical trials using classifiers based on Tanaka, T., 1998. A theory of mean field approximation. Advances in Neural Information
multiple biomarkers. Neurobiol. Aging 31, 1429–1442. Processing Systems (NIPS). The MIT Press, pp. 351–360.
Larochelle, H., Bengio, Y., 2008. Classification using discriminative restricted Boltzmann Tang, S.,Fan, Y.,Wu, G.,Kim, M.,Shen, D., 2009. RABBIT: rapid alignment of brains by building
machines. Proceedings of the 25th International Conference on Machine Learning, intermediate templates. Neuroimage 47, 1277–1287.
pp. 536–543. Visser, P.J.,Verhey, F.R.J.,Hofman, P.A.M.,Scheltens, P.,Jolles, J., 2002. Medial temporal lobe
LeCun, Y.,Bottou, L.,Bengio, Y.,Haffner, P., 1998. Gradient-based learning applied to docu- atrophy predicts Alzheimer's disease in patients with minor cognitive impairment. J.
ment recognition. Proc. IEEE 86, 2278–2324. Neurol. Neurosurg. Psychiatry 72, 491–497.
Lee, A.C.H.,Buckley, M.J.,Gaffan, D.,Emery, T.,Hodges, J.R.,Graham, K.S., 2006. Differentiating Walhovd, K., Fjell, A., Brewer, J., McEvoy, L., Fennema-Notestine Jr., C., D.H., Jennings, R.,
the roles of the hippocampus and perirhinal cortex in processes beyond long-term de- Karow, D.,Dale, A.,the Alzheimer's Disease Neuroimaging Initiative, 2010. Combining
clarative memory: a double dissociation in dementia. J. Neurosci. 26, 5198–5203. MR imaging, positron-emission tomography, and CSF biomarkers in the diagnosis
Lee, H., Grosse, R., Ranganath, R., Ng, A.Y., 2009. Convolutional deep belief networks for and prognosis of Alzheimer disease. Am. J. Neuroradiol. 31, 347–354.
scalable unsupervised learning of hierarchical representations. Proceedings of the Wang, Y., Nie, J., Yap, P.T., Li, G., Shi, F., Geng, X., Guo, L., Shen, D., 2014. Knowledge-guided
26th International Conference on Machine Learning, pp. 609–616. robust MRI brain extraction for diverse large-scale neuroimaging studies on humans
Li, Y., Wang, Y., Wu, G., Shi, F., Zhou, L., Lin, W., Shen, D., 2012. Discriminant analysis of and non-human primates. PLoS One 9, e77810.
longitudinal cortical thickness changes in Alzheimer's disease using dynamic and Ward, A., Arrighi, H.M., Michels, S., Cedarbaum, J.M., 2012. Mild cognitive impairment:
network features. Neurobiol. Aging 33 (427.e15–427.e30). disparity of incidence and prevalence estimates. Alzheimers Dement. 8, 14–21.
Liao, S., Gao, Y., Oto, A., Shen, D., 2013. Representation learning: a unified deep learning Wee, C.Y., Yap, P.T., Li, W., Denny, K., Browndyke, J.N., Potter, G.G., Welsh-Bohmer, K.A.,
framework for automatic prostate MR segmentation. Medical Image Computing and Wang, L., Shen, D., 2011. Enriched white matter connectivity networks for accurate
Computer-Assisted Intervention MICCAI 2013. Lecture Notes in Computer Science, identification of MCI patients. Neuroimage 54, 1812–1822.
vol. 8150, pp. 254–261. Wee, C.Y., Yap, P.T., Zhang, D., Denny, K., Browndyke, J.N., Potter, G.G., Welsh-Bohmer, K.A.,
Liu, M., Zhang, D., Shen, D., 2012. Ensemble sparse classification of Alzheimer's disease. Wang, L., Shen, D., 2012. Identification of MCI individuals using structural and func-
Neuroimage 60, 1106–1116. tional connectivity networks. Neuroimage 59, 2045–2056.
Liu, M.,Zhang, D., Shen, D., the Alzheimer's Disease Neuroimaging Initiative, 2013. Hierar- Westman, E., Muehlboeck, J.S., Simmons, A., 2012. Combining MRI and CSF measures for
chical fusion of features and classifier decisions for Alzheimer's disease diagnosis. classification of Alzheimer's disease and prediction of mild cognitive impairment
Hum. Brain Mapp. 35, 1305–1319. conversion. Neuroimage 62, 229–238.
Loewenstein, D.A., Greig, M.T.,Schinka, J.A., Barker, W.,Shen, Q., Potter, E., Raj, A., Brooks, L., Xue, Z.,Shen, D.,Davatzikos, C., 2006. Statistical representation of high-dimensional defor-
Varon, D., Schoenberg, M., Banko, J., Potter, H., Duara, R., 2012. An investigation of mation fields with application to statistically constrained 3D warping. Med. Image
PreMCI: subtypes and longitudinal outcomes. Alzheimers Dement. 8, 172–179. Anal. 10, 740–751.
Mark, R.E.,Sitskoorn, M.M., 2013. Are subjective cognitive complaints relevant in preclin- Yang, J., Shen, D., Davatzikos, C.,Verma, R., 2008. Diffusion tensor image registration using
ical Alzheimer's disease? A review and guidelines for healthcare professionals. Rev. tensor geometry and orientation features. In: Metaxas, D., Axel, L., Fichtinger, G.,
Clin. Gerontol. 23, 61–74. SzŽkely, G. (Eds.), Medical Image Computing and Computer-assisted Intervention —
Mohamed, A.,Dahl, G.E.,Hinton, G.E., 2012. Acoustic modeling using deep belief networks. MICCAI 2008. Springer, Berlin Heidelberg, pp. 905–913.
IEEE Trans. Audio Speech Lang. Process. 20, 14–22. Yuan, L.,Wang, Y.,Thompson, P.M.,Narayan, V.A.,Ye, J., 2012. Multi-source feature learning
Montavon, G.,Braun, M.L., Möller, K.R., 2012. Deep Boltzmann machines as feed-forward hi- for joint analysis of incomplete multiple heterogeneous neuroimaging data.
erarchies. Journal of Machine Learning Research—Proceedings Track, 22, pp. 798–804. Neuroimage 61, 622–632.
Mosconi, L., 2005. Brain glucose metabolism in the early and specific diagnosis of Zhang, D.,Shen, D., 2012. Multi-modal multi-task learning for joint prediction of multiple re-
Alzheimer's disease. Eur. J. Nucl. Med. Mol. Imaging 32, 486–510. gression and classification variables in Alzheimer's disease. Neuroimage 59, 895–907.
Nair, V.,Hinton, G.E., 2008. Implicit mixtures of restricted Boltzmann machines. Advances Zhang, Y., Brady, M., Smith, S., 2001. Segmentation of brain MR images through a hidden
in Neural Information Processing Systems, pp. 1145–1152. Markov random field model and the expectation–maximization algorithm. IEEE
Ngiam, J.,Khosla, A., Kim, M.,Nam, J.,Lee, H.,Ng, A.Y., 2011. Multimodal deep learning. Trans. Med. Imaging 20, 45–57.
Proceedings of the 28th International Conference on Machine Learning, pp. Zhang, D., Wang, Y., Zhou, L., Yuan, H., Shen, D., 2011. Multimodal classification of
689–696. Alzheimer's disease and mild cognitive impairment. Neuroimage 55, 856–867.
Nordberg, A.,Rinne, J.O.,Kadir, A.,Langstrom, B., 2010. The use of PET in Alzheimer disease. Zhang, D., Shen, D., Alzheimer's Disease Neuroimaging, I., 2012. Predicting future clinical
Nat. Rev. Neurol. 6, 78–87. changes of MCI patients using longitudinal and multimodal biomarkers. PLoS One
Perrin, R.J., Fagan, A.M., Holtzman, D.M., 2009. Multimodal techniques for diagnosis and 7, e33182.
prognosis of Alzheimer's disease. Nature 461, 916–922. Zhou, L.,Wang, Y.,Li, Y.,Yap, P.T.,Shen, D., the Alzheimer's Disease Neuroimaging, I., 2011.
Pichler, B.J.,Kolb, A.,Nägele, T.,Schlemmer, H.P., 2010. PET/MRI: paving the way for the next Hierarchical anatomical brain networks for MCI prediction: revisiting volumetric
generation of clinical multimodality imaging applications. J. Nucl. Med. 51, 333–336. measures. PLoS One 6, e21935.
Please cite this article as: Suk, H.-I., et al., Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis,
NeuroImage (2014), http://dx.doi.org/10.1016/j.neuroimage.2014.06.077