Transformer
Transformer
Keywords: As the prevalence of autism spectrum disorder (ASD) increases globally, more and more patients need to
Autism spectrum disorder (ASD) receive timely diagnosis and treatment to alleviate their suffering. However, the current diagnosis method of
Functional magnetic resonance imaging (fMRI) ASD still adopts the subjective symptom-based criteria through clinical observation, which is time-consuming
Deep learning(DL)
and costly. In recent years, functional magnetic resonance imaging (fMRI) neuroimaging techniques have
Transformer
emerged to facilitate the identification of potential biomarkers for diagnosing ASD. In this study, we developed
Adversarial Generation Network(GAN)
ABIDE
a deep learning framework named spatial–temporal Transformer (ST-Transformer) to distinguish ASD subjects
from typical controls based on fMRI data. Specifically, a linear spatial–temporal multi-headed attention unit is
proposed to obtain the spatial and temporal representation of fMRI data. Moreover, a Gaussian GAN-based data
balancing method is introduced to solve the data unbalance problem in real-world ASD datasets for subtype
ASD diagnosis. Our proposed ST-Transformer is evaluated on a large cohort of subjects from two independent
datasets (ABIDE I and ABIDE II) and achieves robust accuracies of 71.0% and 70.6%, respectively. Compared
with state-of-the-art methods, our results demonstrate competitive performance in ASD diagnosis.
✩ This work was supported in part by the Natural Science Foundation of Chongqing, China under Grant cstc2020jcyj-msxmX0284; in part by The Science
and Technology Research Program of Chongqing Municipal Education Commission, China under Grant KJQN202000625; in part by the National Natural Science
Foundation of China under Grant 61806033, Grant 61703065; and in part by the Educational Reform Project of CQUPT, China under Grant XJG20207.
∗ Corresponding author.
E-mail address: [email protected] (R. Liu).
https://doi.org/10.1016/j.compbiomed.2022.106320
Received 3 March 2022; Received in revised form 11 October 2022; Accepted 14 November 2022
Available online 17 November 2022
0010-4825/© 2022 Elsevier Ltd. All rights reserved.
X. Deng et al. Computers in Biology and Medicine 151 (2022) 106320
Attention mechanism [8], a well-known deep learning technique, Functional connectivity (FC) is the most commonly used fMRI feature
has been widely adopted in natural language processing (NLP) [9– for CAD models, which is the temporal correlation between blood oxy-
12], computer vision (CV) [13–16], and speech processing [17,18]. The gen level-dependent signals from separate brain regions. FC can reflect
attention-based approach that simplifies the complex feature extraction the functional interactions among different brain regions. Thanks to the
process by focusing on important parts during neural network training strong ability of FC to characterize connection patterns of brain activity,
has also been applied to develop advanced CAD methods for brain numerous conventional learning-based studies are constructed based on
disease diagnosis [19–21]. Since most ASD datasets are collected from FC measures. In particular, many conventional learning CAD methods
different clinical sites with different sampling period, there is a complex based on FC measures involve a two-step manner (feature selection and
multi-site data problem. To address the multi-site data problem of classifier construction).
ASD, the researchers used multi-head attention to compute independent Typically, feature extraction is performed to find potential biomark-
features in parallel, and then concatenated these independent features. ers for ASD identification. Then, a well-constructed classifier is adapted
This parallel structure can obtain information of fMRI data from differ- to perform the ASD classification task. For example, Wang et al. [24]
ent perspectives, and solve the problem of multi-site data to a certain chose 35 ROIs to construct the FC matrix, finding the optimal features
extent [22]. However, the above methods only addressed the multi-site by support vector machine (SVM) with Gaussian kernel was used to
data problem of ASD in a static way, where the temporal character- identify fMRI scans of ASD. Sadeghian et al. [25] constructed a whole-
istic of fMRI data is not considered [23]. In addition, the diagnosis
brain FC with feature dimension reduction by a genetic algorithm, to
of ASD subtypes is also crucial for planning the treatment plan for
achieve the final ASD diagnosis using a k-nearest neighbor classifier.
ASD patients. The significant data imbalance in ASD subtypes raises
Wang et al. [26] used a similarity-driven multi-view linear recon-
several issues, including poor diagnostic performance and imbalanced
struction model to learn potential representations and perform topic
specificity and sensitivity in current subtype diagnostic studies.
clustering in ASD and healthy controls. Then, a nested singular value
To address the above issues, an end-to-end deep learning framework
decomposition method was designed to extract FC features. Finally, the
named spatial–temporal Transformer (ST-Transformer) is proposed to
extracted FC features are feed forward to a linear SVM classifier for ASD
effectively distinguish ASD with subtypes based on time-series fMRI
detection. Yap et al. [27] proposed a framework based on penalized
data. To be specific, a linear spatial–temporal multi-headed attention
(LSTMA) unit is proposed to simultaneously learn joint feature repre- SVM clusters that combine the selection of significant FCs from the
sentation on the spatial–temporal domain. In addition, for ASD subtype original FCs as input features for the final SVM classifier. To explore
diagnosis, an approach named Gaussian-GAN data balancing (GGDB) the FC of the brain, Ma et al. [28] used the Hilbert transform to deter-
is proposed to address the data unbalance issue in real-world ASD mine the phase synchrony among brain regions. Principal component
datasets. Comprehensive experiments are conducted for performance analysis and SVM were utilized to develop a discriminant model for
evaluation based on two ASD datasets (e.g., Autism Brain Imaging identifying ASD.
Data Exchange (ABIDE) I/II). The contributions of this paper can be Although, conventional learning-based methods achieve good clas-
summarized as follows: sification results on homogeneous data sites. Well-designed feature
(1) A linear spatial–temporal multi-headed attention (LSTMA) unit selection (dimensionality reduction) methods are still the core of their
is introduced into this work. Specifically, a linear multi-headed atten- good performance. Moreover, these data-dependent feature selection
tion method is applied to ROI-averaged time series from both spatial methods are time-consuming, labor-intensive, and also lack generality.
and temporal perspectives. With the help of the attention mechanism,
the LSTMA unit is able to hierarchically complement fMRI feature spa- 2.2. Deep learning-based methods
tial and temporal domain to learn fine-grained feature representation
for facilitating diagnosis performance. In addition, by using the LSTMA Deep learning has been successfully applied in brain disease diag-
unit, the training process of neural networks can be accelerated. nosis [29–32]. In FC-based deep learning, many researchers directly
(2) To address the data imbalance problem of ASD subtypes sam- use deep learning models to capture potential biomarkers, rather than
ples, we propose a GGDB strategy. Our GGDB is capable to learn the carefully designed feature extraction algorithms. For example, Leming
hidden representation and distribution of the original data to generate et al. [33] used the FC matrix extracted from a multi-source fMRI
pseudo data samples. connected group dataset to train a convolutional neural network (CNN)
(3) Based on the LSTMA unit and GGDB, the end-to-end for ASD diagnosis. Heinsfeld et al. [34] designed a stacked denoising
ST-Transformer is constructed to diagnose ASD with subtypes. The autoencoder to find the hidden patterns of FC for further diagnosis of
effectiveness and reliability of the proposed methods are validated on ASD. Eslami et al. [35] also combined autoencoder with single-layer
two real-world datasets. Compared with state-of-the-art methods, our perceptron to perform FC selection and classification in an end-to-end
results show competitive performance in ASD diagnosis. manner. However, FC-based methods for ASD diagnosis do not consider
The rest of this paper is organized as follows. In Section 2, we briefly the time-series nature of fMRI data, which loses the temporal variation
review previous studies on fMRI-based CAD methods for ASD diagnosis. information.
Section 3 describes the studied datasets and data preprocessing. In In recent years, recurrent neural network (RNN), long short-term
Section 4, we introduce the detail of the proposed ST-Transformer memory (LSTM), and attention mechanism-based methods have shown
framework and data balancing strategy for ASD subtypes. The experi-
great potential in the diagnosis of brain diseases, which can capture fea-
mental results and analysis are discussed in Section 5. Finally, the paper
ture information in the temporal domain. For example, Liu et al. [36]
is concluded in Section 6.
developed a novel multi-network of LSTM for the identification of
2. Related works attention deficit hyperactivity disorder (ADHD). Dvornek et al. [37]
proposed a recurrent neural network with LSTM for the classification
In this section, we briefly review previous work on fMRI-based of individuals with ASD and typical controls directly from the fMRI
CAD methods for ASD diagnosis. The existing CAD methods are mainly time series. A framework with an RNN unit was presented by Byeon
divided into two categories: conventional learning-based methods and et al. [38] to extract temporal properties of fMRI data for multi-
deep learning-based methods. site ASD classification. The RNN, LSTM, CNN, and multiple hybrid
models were proposed together for the diagnosis of ASD by Bayram
2.1. Conventional learning-based methods et al. [39]. It was shown that the RNN exhibited better performance
than the other methods. Niu et al. [23] proposed a multichannel deep
Conventional learning-based CAD methods typically apply machine attention neural network, which integrates multilayer neural networks,
learning methods to perform diagnosis tasks based on fMRI features. attention mechanisms, and feature fusion for ASD recognition. Zhang
2
X. Deng et al. Computers in Biology and Medicine 151 (2022) 106320
Table 1
Demographic information on ABIDE I.
Site ASD TC
Age avg(SD) Sex(M/F) Handedness(L/M/R) FIQ avg(SD) Age avg(SD) Sex(M/F) Handedness(L/M/R) FIQ avg(SD)
CALTECH 27.4(10.3) 15/4 0/5/14 108.2(12.2) 28.0(10.9) 14/4 1/3/14 114.8(9.3)
CMU 26.4(5.8) 11/3 1/1/12 114.5(11.2) 26.8(5.7) 10/3 0/1/12 114.6(9.3)
KKI 10.0(1.4) 16/4 1/3/16 97.9(17.1) 10.0(1.2) 20/8 1/3/24 112.1(9.2)
LEUVEN 17.8(5.0) 26/3 3/0/26 109.4(12.6) 18.2(5.1) 29/5 4/0/30 114.8(12.4)
MAX_MUN 26.1(14.9) 21/3 2/0/22 109.9(14.2) 24.6(8.8) 27/1 0/0/28 111.8(9.1)
NYU 14.7(7.1) 65/10 N/A 107.1(16.3) 15.7(6.2) 74/26 N/A 113.0(13.3)
OLIN 16.5(3.4) 16/3 4/0/15 112.6(17.8) 16.7(3.6) 13/2 2/0/13 113.9(16.0)
PITT 19.0(7.3) 25/4 3/1/25 110.2(14.3) 18.9(6.6) 23/4 1/1/25 110.1(9.2)
SBL 35.0(10.4) 15/0 N/A N/A 33.7(6.6) 15/0 N/A N/A
SDSU 14.7(1.8) 13/1 1/0/13 111.4(17.4) 14.2(1.9) 16/6 3/0/19 108.1(10.3)
STANFORD 10.0(1.6) 15/4 3/1/15 110.7(15.7) 10.0(1.6) 16/4 0/2/18 112.1(15.0)
TRINITY 16.8(3.2) 22/0 0/0/22 108.9(15.2) 17.1(3.8) 25/0 0/0/25 112.5(9.2)
UCLA 13.0(2.5) 48/6 6/0/48 100.4(13.4) 13.0(1.9) 38/6 4/0/40 106.4(11.1)
UM 13.2(2.4) 57/9 7/8/51 105.5(17.1) 14.8(3.6) 56/18 9/2/63 108.2(9.7)
USM 23.5(8.3) 46/0 N/A 99.7(16.4) 21.3(8.4) 25/0 N/A 115.4(14.8)
YALE 12.7(3.0) 20/8 5/0/23 94.6(21.2) 12.7(2.8) 20/8 4/0/24 105.0(17.1)
1
FIQ: Full Scale Intelligence Quotient SD: Standard Deviation TC: Typical Control Avg: Average
et al. [20] proposed a new two-stage network structure for the classi- width at half maximum (FWHW). The details of preprocessing pipelines
fication of ADHD by combining a split-channel convolutional network are available on the websites (https://rfmri.org/DPARSF). In this work,
with an attention-based network. The split-channel convolutional net- we leveraged CC200 functional parcellation atlas to partition the whole
work was used to learn temporal features of each brain region, while brain into 200 ROIs for extracting the time series data.
the attention-based network was used to discover temporal correlation
features among brain regions and extract fusion features. 3.2. Data augmentation
Although temporal deep learning methods have made great progress,
they only focus on temporal information while ignoring the spatial Although ABIDE provides over a thousand subjects, the training
domain. Temporal domain-based deep learning models fail to ade- of neural network models tends to require numerous data to prevent
quately utilize the information from fMRI data. It causes the temporal overfitting phenomena. We adopt a simple cropping data augmentation
domain-based deep learning methods that do not provide excellent strategy the same as in [37]. Since the time length of the data obtained
generalization ability, resulting in limited classification performance. by each site is different, we fix a time length of 90 as the final
In this paper, we propose an ST-Transformer deep learning framework data length to ensure that the time length of our sliced data remains
to pay attention to the information in the spatio-temporal domain of consistent. We crop 10 sequences of the ROI-averaged time series for
fMRI data for the diagnosis of ASD. each subject by randomly sliding windows. The number of original
fMRI data has inflated to 10 times for further model training.
3. Materials
4. Model design
In this section, we will introduce the fMRI datasets, the image
pre-processing pipeline, and data augmentation strategy used in our In this study, the overall flowchart of the proposed model is shown
study. in Fig. 1. Transformer is applied as the backbone of our proposed model
owing to its specific multi-headed self-attention mechanism that can
3.1. Data preprocessing pay attention to the global information of the fMRI time series well.
Based on the Transformer, a LSTMA unit is designed in our proposed
In this study, preprocessed ABIDE I dataset is downloaded from [40]. ST-Transformer to capture the spatio-temporal properties of fMRI data.
The ABIDE I dataset is collected from 17 different sites with 1112 sub- In addition, we propose a GGDB method to address the data imbalance
jects, including 539 patients with ASD and 573 typical controls. A total problem of ASD subtypes samples.
of 1035 subjects with complete labeling information are available. To
fix the time length of the ROI-averaged time series, 1009 valid subjects 4.1. Preliminaries
are selected in this work. The detailed ABIDE I dataset information is
presented in Table 1. The preprocessed fMRI data selected in this work The overall structure of Transformer-encoder consists of multi-
employs the Configurable Pipeline for the Analysis of Connectomes headed self-attention module, position feed-forward network (FNN),
C-PAC pipeline with Craddock 200 (CC200) functional parcellation. residual connectivity, and layer normalization, as shown in Fig. 2a.
Besides, we also selected valid fMRI data from ABIDE II to validate Self-attention mechanism is the core of Transformer-encoder, which is
the generalization ability of our proposed model. ABIDE II involves computed from the Query, Key, and Value matrices. Given a packed
19 sites containing a combined sample of 1114 subjects, consisting of matrix representing queries 𝑄 ∈ R𝑁×𝐷𝑘 , keys 𝐾 ∈ R𝑀×𝐷𝑘 , and values
521 ASD subjects and 593 healthy controls. To fix the time length 𝑉 ∈ R𝑀×𝐷𝑣 , the scaled dot-product attention is formulated by Eq. (1),
of the ROI-averaged time series, 1058 valid subjects are selected in ( )
𝑄𝐾 𝑇
this work. The detailed ABIDE II dataset information is presented in Attention(𝑄, 𝐾, 𝑉 ) = sof tmax √ 𝑉 (1)
Table 2. However, there is no processed ABIDE II data available online. 𝐷𝑘
To preprocess the fMRI raw data, the pipeline of Data Processing where 𝑁 and 𝑀 denote the lengths of queries and keys (or values).
Assistant for Resting-State fMRI (DPARSF) [41] is used in this work. 𝐷𝑘 and 𝐷𝑣 denote the dimensions of keys (or queries) and values.
The preprocessing steps in the DPARSF pipeline include removal of Softmax is an activation function that converts the attention score
√ into
the first few volumes, slice timing correction and realignment, motion a probability. The dot product of 𝑄 and 𝐾 T matrices divided by 𝐷𝑘 is
correction, spatial normalization, bandpass filtering, normalization by to solve the gradient vanishing problem. In contrast to the simple single
the MNI template, and smoothing with a 6-mm Gaussian kernel of full attention function, Transformer applies a multi-headed self-attention
3
X. Deng et al. Computers in Biology and Medicine 151 (2022) 106320
Table 2
Demographic information on ABIDE II.
Site ASD TC
Age avg(SD) Sex(M/F) Handedness(L/M/R) FIQ avg(SD) Age avg(SD) Sex(M/F) Handedness(L/M/R) FIQ avg(SD)
BNI 37.4(15.8) 29/0 0/0/29 107.8(13.5) 39.6(14.8) 29/0 0/0/29 112.4(11.9)
EMC 8.0(1.2) 22/5 5/0/22 N/A 8.1(1.0) 22/5 6/0/21 N/A
ETH 20.6(3.3) 13/0 0/0/13 109(12.5) 23.9(4.4) 24/0 0/0/24 116.5(9.3)
GU 10.9(1.5) 43/8 8/0/43 118.3(15.2) 10.4(1.7) 28/27 3/0/52 121.5(13.7)
IU 25.0(9.1) 16/4 2/3/15 116.3(11.5) 23.8(4.8) 15/5 1/2/17 117(10.4)
KKI 10.3(1.5) 41/15 2/8/46 103.4(15.8) 10.3(1.2) 99/56 10/12/133 114.3(10.5)
KUL 23.6(4.8) 28/0 6/0/22 106.6(15.5) N/A N/A N/A N/A
NYU 8.9(4.8) 67/8 6/15/54 103.8(16.8) 9.5(3.3) 28/2 0/1/29 116.1(15.4)
OHSU 11.8(2.2) 30/7 1/1/35 106.0(16.5) 10.4(1.6) 27/29 0/1/55 117.5(11.9)
OILH 21.8(3.6) 20/4 4/4/16 114.0(15.9) 24.0(3.6) 20/15 0/3/32 111.2(12.7)
SDSU 12.9(3.2) 26/7 4/2/27 99.8(14.5) 13.3(3.0) 23/2 1/3/21 103.0(11.5)
SU 11.2(1.2) 19/2 0/0/21 111.8(15.4) 11.0(1.3) 19/2 0/3/18 116.1(13.7)
TCD 14.9(3.2) 21/0 0/0/21 108.5(15.0) 15.6(3.0) 21/0 0/0/21 118.5(12.9)
UCD 14.8(1.9) 14/4 0/1/17 103.4(11.8) 14.8(1.7) 10/4 0/0/14 113.0(10.8)
UCLA 11.7(2.2) 15/1 2/0/14 102.1(13.5) 9.7(2.1) 11/5 1/1/14 114.5(13.4)
U_MIA 9.9(1.9) 11/2 0/1/12 100.8(19.2) 9.7(2.1) 11/4 0/0/15 115.9(14.2)
USM 18.3(6.8) 15/2 0/2/15 99.3(19.1) 24.0(7.5) 13/3 0/1/15 115.2(14.1)
mechanism. The rationale behind multi-headed self-attention is that the deep-level neural network models are difficult to train, two residual
queries, keys, and values matrices of the original 𝐷𝑚 dimension are connections with layer normalization are respectively applied to the
mapped to ℎ different 𝐷𝑘 , 𝐷𝑘 , 𝐷𝑣 by ℎ different projection mappings. output of the multi-headed self-attention and FNN.
For each projection mapping of queries, keys, values matrices, the
output can be calculated as shown in Eq. (1). Finally, the ℎ different 4.2. Spatial–temporal transformer
output is connected and projected back to the original 𝐷𝑚 dimension,
which is the output of multi-headed self-attention. The multi-headed Based on Transformer-encoder, a novel ST-Transformer is proposed
self-attention can be given by Eq. (2), to hierarchically complement fMRI features in both spatial and tempo-
( ) ral domains to learn fine-grained feature representation for facilitating
MultiHead (𝑄, 𝐾, 𝑉 ) = Concat head 1 , … , head h 𝑊 𝑂
( ) (2) diagnosis performance. The architecture of ST-Transformer is shown
where head i = Attention 𝑄𝑊𝑖𝑄 , 𝐾𝑊𝑖𝐾 , 𝑉 𝑊𝑖𝑉 in Fig. 2b. ST-Transformer is a variant of Transformer-encoder, which
designs a linear spatial–temporal multi-headed attention (LSTMA) unit
where the projections are parameter matrices 𝑊𝑖𝑄 ∈ R𝑁×𝐷𝑘 , 𝑊𝑖𝐾 ∈ to replace the self-attention mechanism. Since the self-attention mech-
R𝑁×𝐷𝑘 , 𝑊𝑖𝑉 ∈ R𝑀×𝐷𝑣 , 𝑊𝑖𝑂 ∈ Rℎ𝐷𝑣 ×𝑁 . 𝑊𝑖𝑄 , 𝑊𝑖𝐾 , and 𝑊𝑖𝑉 are the anism in vanilla Transformer is more likely to focus on the information
weight matrices corresponding to the 𝑄, 𝐾, and 𝑉 vectors, respectively. of the sequence data in the temporal domain. However, the spatial
𝑊𝑖𝑄 is the weight matrix of the attention scores concatenated by information of sequence data will always be ignored by the traditional
multiple heads. Then, the FNN applies two linear transformations with self-attention.
relu activation function to the output of multi-headed self-attention as To extract the spatial–temporal dependency of sequential data, a
the followings, spatial–temporal multi-headed self-attention (STMA) unit is proposed,
( ) as shown in Fig. 2b. The STMA Unit obtains the spatio-temporal feature
FFN(𝑥) = Relu 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2 (3)
representation of fMRI data by first conducting spatial self-attention
where 𝑥 denotes the output of the previous layer, and 𝑊1 , 𝑊2 , 𝑏1 , 𝑏2 , and then temporal self-attention. Besides the data volume of the fMRI
denotes the trainable parameters. Finally, to solve the problem that time series is relatively large, the self-attention requires a large number
4
X. Deng et al. Computers in Biology and Medicine 151 (2022) 106320
Fig. 2. Architectures of the Transformer encoder and ST-Transformer. 𝑁 represents the number of layers. 𝑄, 𝐾, and 𝑉 are Query vector, Key vector, and Value vector respectively.
𝜙(⋅) is a linear transformation. (⋅) is the dot product. 𝑍 is the output matrix of 𝑄, 𝐾, 𝑉 calculated by 𝐿𝐴.
of computational resources and exclusive time to train the model on a Algorithm 1: Pseudo-code of the ST-Transformer
large dataset. Therefore, to speed up the model training process, we Input: Preprocessed data 𝑋𝑡𝑟𝑎𝑖𝑛 ,𝑋𝑡𝑒𝑠𝑡 , phenotypic data 𝑋𝑝 , and
proposed a linear attention module to replace the self-attention unit in label 𝑌𝑡𝑟𝑎𝑖𝑛 ,𝑌𝑡𝑒𝑠𝑡
Transformer. For the given matrix of queries 𝑄 ∈ R𝑁×𝐷𝑘 , keys 𝐾 ∈ 𝑝𝑟𝑒𝑑
Output: Predicted probabilities of test set 𝑌𝑡𝑒𝑠𝑡
R𝑀×𝐷𝑘 , and values 𝑉 ∈ R𝑀×𝐷𝑣 , 𝐾 performs a dot product operation
1 Initialize ST-Transformer;
with 𝑉 through a feature map. The final attention result is obtained by
// n: number of training epoch
dotting the result with a feature map passed by 𝑄. Linear attention can
2 for n=1,....,epochs do
be written as,
( ) // Q, K, V are obtained by linear
LA(𝑄, 𝐾, 𝑉 ) = 𝜙(𝑄) ⋅ 𝜙(𝐾)⊤ ⋅ 𝑉 (4) transformation of 𝑋𝑡𝑟𝑎𝑖𝑛
where 𝜙 is a feature map that is applied in a row-wise manner. In this
// 𝜙 is a feature map
study, we select a feature map, which is can be given by Eq. (6), 3 𝐿𝐴(𝑄, 𝐾, 𝑉 ) ← 𝜙(𝑄) ⋅ (𝜙(𝐾)⊤ ⋅ 𝑉 );
// ℎ𝑒𝑎𝑑 represents a pass through 𝐿𝐴
𝜙(𝑥) = 𝑒𝑙𝑢(𝑥) + 1 (5) 4 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉 ) ← 𝐶𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑 1 , … , ℎ𝑒𝑎𝑑 h )𝑊 𝑂 ;
where 𝑒𝑙𝑢 [42] is an activation function. Replacing self-attention with
// spatial multi-headed linear attention
linear attention, comparing Eqs. (1) and (5), we can notice 5 𝑋𝑠 ← 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉 );
( ) that the // temporal multi-headed linear attention
time complexity of computing attention decrease from 𝑂 𝑛2 to 𝑂 (𝑛), ′
which significantly reduced the model training time. 6 𝑋𝑡 ← 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄, 𝐾, 𝑉 ) ;
Based on the STMA and linear attention module, we propose the // LN: Layer Normalizaiton RC: Residual
LSTMA unit to efficiently learn the spatio-temporal feature represen- Connection
tation of fMRI time-series data. The LSTMA unit consists of spatial 7 𝑋𝑙1 ← 𝐿𝑁(𝑅𝐶(𝑋𝑡𝑟𝑎𝑖𝑛 ,𝑋𝑡 ));
multi-headed linear attention and temporal multi-headed linear atten- // FFN: feed-forward network
tion. Given 𝑥 ∈ 𝑇 ×𝑁 as input, spatial multi-headed linear attention take 8 𝑋𝑓 ← 𝐹 𝐹 𝑁(𝑋𝑙1 ));
each column (1 × 𝑇 ) as one token. Therefore, the spatial multi-headed 9 𝑋𝑙2 ← 𝐿𝑁(𝑅𝐶(𝑋𝑙1 ,𝑋𝑓 ));
linear attention can learn dependency between tokens (brain regions). ′
10 𝑋 ← 𝐶𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒(𝑋𝑙2 ,𝑋𝑝 );
After the spatial multi-headed linear attention, the dimensionality of ′
the adjusted 𝑥 is 𝑁 × 𝑇 , temporal multi-headed linear attention take 11 𝑜𝑢𝑡𝑝𝑢𝑡 ← 𝐹 𝑢𝑙𝑙𝑦𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑁𝑒𝑡𝑤𝑜𝑟𝑘𝑠(𝑋 );
each row (1 × 𝑁) as one token. The temporal multi-headed linear 12 𝐿𝑜𝑠𝑠 ← 𝐶𝑟𝑜𝑠𝑠𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝐿𝑜𝑠𝑠(𝑜𝑢𝑡𝑝𝑢𝑡,𝑌𝑡𝑟𝑎𝑖𝑛 );
attention can learn correlations between tokens (time points) simulta- 13 𝑆𝑇 -𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟.𝑢𝑝𝑑𝑎𝑡𝑒(𝑙𝑜𝑠𝑠)
neously. The detailed operation of the LSTMA unit can be illustrated 14 end
as, 𝑝𝑟𝑒𝑑
15 𝑌𝑡𝑒𝑠𝑡 ← 𝑆𝑇 -𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟.𝑝𝑟𝑒𝑑𝑖𝑐𝑡(𝑋𝑡𝑒𝑠𝑡 );
( )
𝛷(𝑥) = 𝜑1 𝜑2 (𝑥)⊤ (6)
where 𝑥 refers to ROI-averaged time series, 𝜑1 refers to temporal multi-
headed linear attention, and 𝜑2 refers to spatial multi-headed linear we attempt to explore the possibility of identifying ASD subtypes. Due
attention. To ease the understanding of the ST-Transformer framework, to the paucity of ASD subtype fMRI data, we combine the fMRI data
the pseudo-code is shown in Algorithm 1. with ASD subtype labels from the ABIDE I and ABIDE II datasets.
Table 3 shows the number of each specific ASD subtype. Although
4.3. Gaussian GAN-based data balancing strategy for ASD subtypes
combining the ABIDE I and ABIDE II datasets increases the sample
ASD as a spectrum of psychiatric disorders can be further divided size of the ASD subtype fMRI data, the data volume for Asperger
into other subtypes, including autism, Asperger, and pervasive devel- and PDD-NOS is still small compared with autism. It will cause a
opmental disorder not otherwise specified (PDD-NOS). In this study, significant data imbalance problem, which makes the model biased to
5
X. Deng et al. Computers in Biology and Medicine 151 (2022) 106320
6
X. Deng et al. Computers in Biology and Medicine 151 (2022) 106320
Table 4
Model parameter settings for ST-Transformer and GGDB.
For ST-transformer
Epochs 100 Optimizer(lr = 0.0001) Adam Batch Size 32
LSMA:head = 10 𝐷𝑘 = 𝐷𝑣 =9
Settings in LSTMA unit
LTMA:head = 8 𝐷𝑘 = 𝐷𝑣 =25
Configuration of final fully connected layers 204-2
Activation function to output layers Sigmoid
For GGDB
Epochs 80 Optimizer(lr = 0.0001) Adam Batch Size 64
Configuration of generator 200-2048-18000
Configuration of discriminator 18000-4096-1
Activation function to hidden layers LeakyRelu
1 LSMA denotes spatial multi-headed linear attention.
2 LTMA denotes temporal multi-headed linear attention.
5.2. Performance comparison 1DCNN is slightly better than LSTM on ABIDE II. In the Transformer-
based model, Transformer outperforms LSTM and 1DCNN for ASD
In this section, Section 5.2.1 describes in detail the experimental classification, where the accuracy of Transformer is 1%–2% higher
comparison of our proposed model with Transformer-based methods than LSTM and 1DCNN on both ABIDE I and ABIDE II. It might
and traditional deep learning methods. Several state-of-the-art methods be attributed to the fact that the multi-headed self-attention module
are compared with our proposed model in Section 5.2.2. of the Transformer-based method pays more attention to the global
information of the fMRI time series. The Transformer encoder is supe-
5.2.1. Comparison with different models rior to the Transformer decoder for ASD identification in Transformer
In this experiment, we compare our proposed ST-Trans former with variants. A possible reason for this is that the masked multi-headed
the following five methods, including three Transformer-based variants, self-attention module of the Transformer decoder blocks part of the
LSTM, and a one-dimensional convolutional neural network (1DCNN). information in the fMRI time series. Causing the Transformer decoder
The performance of the final model is validated on both ABIDE I and structure is not suitable for our task. Finally, we also compare the
ABIDE II datasets by 10-fold cross-validation. results with or without the demographic information. Among these
Transformer: Vanilla Transformer is the representative sequential methods, demographic information assists the fMRI data to improve the
models to address sequence data. In this experiment, vanilla Trans- classification performance of the models on both ABIDE I and ABIDE
former is constructed with eight headed self-attention blocks. A fully II.
connected layer with softmax activation function is concatenated to
perform ASD classification. Moreover, we select the encoder and de- 5.2.2. Comparison with state-of-the-art
coder as separate models for experimental comparison. In the decoder, In addition to a 10-fold CV, the existing methods also applied
we remove the cross-attention and keep only the masked self-attention independent sets of training/testing (IS) for performance evaluation.
part in the attention module. The other settings in the separate encoder However, there is a wide variation in the distribution of data from site
and decoder models are the same as the Transformer. to site with different equipment and parameter settings. IS is susceptible
LSTM: LSTM, as one of the most widely used classification models to the training/testing split and cannot well assess the reliability of
in neuroimaging analysis, has good results in traditional deep learning their models. In contrast, the 10-fold CV is more reliable to validate
methods for processing sequence data. In this experiment, the hidden the model performance.
size is 64. We add a fully connected layer after the LSTM for the final We summarize the state-of-the-art studies on the CAD methods for
ASD classification. ASD diagnosis in Tables 6 and 7. We also show the state-of-the-art
1DCNN: 1DCNN, as one of the traditional deep learning methods, method’s accuracy versus sample size in Fig. 4. Our proposed model
has been successfully applied in the field of CV. In this method, out shows superior performance in ABIDE I and ABIDE II. In comparison
channel is set to 128, size of the convolution kernel is 2, and the stride is with traditional machine learning approaches, deep learning-based
fixed to 1. To alleviate overfitting problem, we add batch normalization methods dominated the recent advance in mental disorder diagno-
after the convolution layer along with the maximum pooling. Finally, sis studies. Most deep learning-based methods outperform machine
a fully connected layer is concatenated for the diagnosis of ASD. learning-based methods on both ABIDE I and ABIDE II datasets con-
For a fair comparison, we generally keep the number of parame- ventionally. It could be attributed to the effectiveness of deep learning
ters of our proposed ST-Transformer consistent with the comparison to learn discriminative nonlinear feature representation. In addition,
algorithm. Furthermore, we combine the demographic information in previous studies, researchers tend to choose smaller datasets for the
with the output of each model separately to form models with prior lack of equipment the performance or the limitation of data collection.
information. Each model with demographic information is added to a Due to the heterogeneity problem caused by the increase of data, it
fully connected layer to achieve ASD identification. may cause the model performance to decrease with larger data samples.
We tabulate the classification performance obtained by different The classification performance obtained by our proposed model is not
methods on ABIDE I and ABIDE II after a 10-fold cross-validation in only better than all competitors, as well as the size of the data sample
Table 5. As shown in Table 5, our proposed model achieves reliable is larger than the majority of researchers. Generally speaking, the
accuracy of 71.01% (SPE: 70.01%, SEN: 72.02%) and 70.61%(SPE: proposed model can achieve competitive and robust results in ASD
72.27%, SEN: 68.75%) on ABIDE I and ABIDE II, respectively. It is prob- diagnosis.
ably due to the fact that our proposed model is able to effectively utilize
the spatio-temporal domain representation of fMRI data. Based on the 5.3. Ablation study
results of rank-sum test, our proposed model outperforms LSTM, CNN,
and Transformers on most evaluation metrics. Among the traditional In this section, we perform ablation experiments to validate the
deep learning methods, the accuracy obtained by LSTM and 1DCNN effectiveness of the proposed modules. Firstly, we make a comparison
are approximately the same on ABIDE I, and the performance of the between the proposed model with or without positional encoding, as
7
X. Deng et al.
Table 5
Performance compared with Transformer-based methods and traditional deep learning methods.
Model ABIDE I ABIDE II
seq_acc seq_sen seq_spe sub_acc sub_sen sub_spe seq_acc seq_sen seq_spe sub_acc sub_sen sub_spe
LSTM 64.39%* 63.09%* 65.69%* 66.34%* 67.33%* 65.35%* 63.88%* 60.90%* 68.09%* 64.84%* 60.11%* 69.03%*
LSTM(w/pheno) 65.68% 62.31%* 68.86% 67.73%* 66.50%* 68.96% 64.48%* 60.52%* 68.02%* 66.07%* 64.92%* 67.09%*
1DCNN 64.64%* 62.35%* 66.82% 66.30%* 66.51%* 66.11%* 64.03%* 61.01%* 66.72%* 66.36%* 66.54% 66.19%*
1DCNN(w/pheno) 65.43%* 62.84%* 68.01% 68.18% 66.30%* 69.96% 65.49% 60.71% 69.76% 68.14% 64.54%* 71.37%
Transformer 65.61%* 64.21%* 67.02% 68.09%* 68.98%* 67.20%* 65.48% 62.19% 68.76% 67.81% 66.55% 69.07%*
Transformer(w/pheno) 66.46% 65.18% 67.74% 68.63% 68.34%* 68.92% 66.06% 63.23% 68.88% 68.44% 67.35% 69.54%
Transformer encoder 65.60%* 61.96%* 69.23% 67.63%* 66.97%* 68.29% 65.37% 62.35% 68.38%* 67.38%* 65.15%* 69.60%
Transformer encoder(w/pheno) 66.21% 64.76%* 67.57% 68.63% 68.55%* 68.71% 65.87% 63.71% 68.03%* 68.72% 69.14% 68.33%*
8
Transformer decoder 64.39%* 62.34%* 66.28%* 67.79%* 70.14% 65.48%* 62.42%* 58.78%* 65.67%* 64.84%* 63.93%* 65.66%*
Transformer decoder(w/pheno) 65.63%* 64.81%* 66.35%* 68.18% 68.74%* 67.57% 63.69%* 55.80%* 70.74% 66.07%* 59.93%* 71.56%
ST-Transformer\LA 66.67%(± 4.3%) 63.84% 69.49% 69.18%(± 5.9%) 67.33% 70.91% 66.41%(± 4.0%) 63.07% 69.38% 68.71%(± 4.3%) 68.54% 68.86%
ST-Transformer\LA(w/pheno) 67.25%(± 3.8%) 64.90% 69.46% 69.77%(± 5.6%) 69.56% 69.94% 67.20%(± 4.6%) 63.05% 70.91% 69.28%(± 5.6%) 66.33% 71.91%
ST-Transformer 67.92%(± 3.8%) 65.55% 70.15% 70.46%(± 4.8%) 70.76% 70.14% 67.61%(± 4.4%) 62.86% 71.83% 69.85%(± 4.7%) 66.53% 72.79%
ST-Transformer(w/pheno) 68.56%(± 3.1%) 67.85% 69.26% 71.01%(± 3.9%) 72.02% 70.01% 68.09%(± 3.7%) 64.26% 71.51% 70.61%(± 3.6%) 68.75% 72.27%
1 w/pheno denotes adding demographic information to the model.
2
seq_acc, seq_sen, and seq_spe refer to sequence accuracy, sequence sensitivity, and sequence specificity, respectively.
3
sub_acc, sub_sen, and sub_spe refer to subject accuracy, subject sensitivity, and subject specificity, respectively.
4
The ST-Transformer\LA method is another version of our proposed method, which uses self-attention instead of linear attention.
5
The numbers in () represent the standard deviation.
*p-value ≤ 0.05.
Table 6
Performance compared with previous literature on ABIDE I.
Method Classifier Validation Sample# Accuracy
Ours ST-Transformer 10-fold CV 1009 71.01%
Yang 2022[43] kSVM 5-fold CV 871 69.43%
Almuqhim 2021[44] SAE+DNN 10-fold CV 1035 70.80%
Abdelbasset 2020[45] SVC intra-site CV 172 70.36%
Shahamat 2020[46] 3D-CNN 5-fold CV 1000 70.00%
You 2020[47] CNN 10-fold CV 106 68.54%
Bengs 2019[48] convGRU-CNN3D IS 194 67.00%
El-Gazzar 2019[49] 1DCNN 5-fold CV 1100 64.00%
Heinsfeld 2018[34] AE+ANN 10-fold CV 1035 70.00%
Dvornek 2018[50] RawPhenotype-LSTM 10-fold CV 1100 70.10%
Dvornek 2017[51] LSTM 10-fold CV 1100 68.50%
Rane 2017[52] LR 5-fold CV 1112 62.00%
Table 7 the reliance of deep learning methods on large sample data. Fourthly,
Performance compared with previous literature on ABIDE II.
we conducted experiments with our proposed model on the combined
Method Classifier Validation Sample# Accuracy
ABIDE (ABIDE I & II) dataset. From Table 8 section D we can see that
Ours ST-Transformer 10-fold CV 1058 70.61% the experimental performance is slightly degraded compared to that on
Liu 2021[53] BL 10-fold CV 1043 65.29%
ABIDE I and ABIDE II separately. This is probably due to the fact that
Chen 2020[54] HBM 10-fold CV 250 65.00%
Aghdam 2019[55] MCNNEs 10-fold CV 343 70.00% as the data sample increases, more samples about ASD heterogeneity
Zhao 2019[56] CAE+CNN IS 693 65.30% emerge. This makes it difficult for our model to discover potential
biomarkers for ASD diagnosis.
Furthermore, we carry out experiments with our proposed model
only in the temporal domain and only in the spatial domain with
seen in Table 8 section A, the classification accuracy is reduced by 1%–
multi-headed linear attention, respectively. As Fig. 5 shows, our pro-
2% when positional encoding is added compared to that without posi-
posed model makes good use of both temporal and spatial feature
tional encoding. The results reveal that although position-coding can
information, and the classification accuracies obtained by the model
provide position information, the input fMRI data is already sequential
are better than those obtained by multi-headed linear attention in the
time-series data that implicitly contain position information. Adding
temporal domain only, or in the spatial domain only. To explore the
additional positional embedding will reinforce the location information
structure of the proposed ST-Transformer, we compare the differences
and interfere with the attention mechanism for learning the discrimi-
native feature representation. Secondly, we evaluate the performance in the number of layers of the proposed model encoder. As shown in
of our proposed linear attention mechanism in terms of the accuracy Fig. 6, when the number of layers is higher, the classification accuracy
and time efficiency of training process. As shown in Table 8 section obtained is lower. This is probably owing to the fact that the scale of
B, our proposed model achieves higher accuracy than ST-Transformer our dataset is relatively small compared to the CV and NLP domains,
without using linear attention (noted as ST-Transformer\LA) on both and a significant overfitting phenomenon occurs when the number of
ABIDE I and ABIDE II, and the time for model training is 29% less. layers is increased.
Thirdly, we performed experiments on the proposed method with or
without the data augmentation strategy, to address the necessity of data 5.4. Identifying ASD subtypes
augmentation method for fMRI. As shown in Table 8 section C, our
proposed method achieves a significant improvement in ASD diagnosis In this section, we put the proposed GGDB strategy under our ST-
performance when using the data augmentation. It might be due to Transformer model for experiments. To verify the effectiveness of our
9
X. Deng et al. Computers in Biology and Medicine 151 (2022) 106320
Table 8
The result of Ablation study.
Method ABIDE I ABIDE II
Section A: Performance comparison to evaluate the effectiveness of positional encoding.
seq_acc seq_sen seq_spe sub_acc sub_sen sub_spe seq_acc seq_sen seq_spe sub_acc sub_sen sub_spe
ST-Transformer(w/pe) 67.58% 66.91% 68.21% 69.47% 70.58% 68.40% 65.85% 62.89% 68.49% 68.53% 68.75% 68.33%
ST-Transformer(w/o_pe) 68.56% 67.85% 69.26% 71.01% 72.02% 70.01% 68.09% 64.26% 71.51% 70.61% 68.75% 72.27%
Section B: Performance comparison to evaluate the effectiveness of our proposed linear attention mechanism.
sub_acc Time (h) sub_acc Time (h)
ST-Transformer\LA(w/pheno) 69.77% 1.8 69.28% 2.1
ST-Transformer(w/pheno) 71.01% 1.3 70.61% 1.7
Section C: Performance comparison to evaluate the effectiveness of data augmentation.
sub_acc sub_sen sub_spe sub_acc sub_sen sub_spe
ST-Transformer(w/o_da) 65.02% 65.89% 64.14% 65.32% 61.75% 68.52%
ST-Transformer(w/da) 71.01% 72.02% 70.01% 70.61% 68.75% 72.27%
Section D: The results of our method on ABIDE I & II.
Dataset seq_acc seq_sen seq_spe sub_acc sub_sen sub_spe
ABIDE I 68.56% 67.85% 69.26% 71.01% 72.02% 70.01%
ST-Transformer(w/pheno) ABIDE II 68.09% 64.26% 71.51% 70.61% 68.75% 72.27%
ABIDE I&II 66.80% 64.72% 68.72% 69.24% 68.75% 69.72%
1
w/pe, w/o_pe denotes model with or without positional encoding, respectively.
2
w/da, w/o_da denotes our method with or without data augmentation, respectively.
Table 9
The result of using data balancing strategies.
Balancing strategy Autism Asperger PDD-NOS Total Autism Asperger PDD-NOS Total
seq_𝐹1 -score seq_Macro-acc sub_𝐹1 -score sub_Macro-cc
w/o_db 75.90% 46.47% 52.10% 58.16% 77.51% 48.41% 55.87% 60.60%
Slicing 74.59% 45.18% 57.96% 59.24% 75.11% 50.19% 58.56% 61.29%
Gaussian 74.97% 46.58% 55.83% 59.13% 77.20% 49.48% 57.86% 61.51%
GAN 74.78% 49.35% 54.47% 59.53% 76.73% 50.40% 60.28% 62.47%
GGDB 75.94% 52.04% 54.07% 60.68% 77.88% 55.60% 60.11% 64.53%
1
w/o_db means that no data balancing strategy is used.
2
seq_𝐹1 -score , seq_Macro-acc, sub_𝐹1 -score , and sub_Macro-cc represents sequence 𝐹1 -score , sequence Macro-average, subject 𝐹1 -score and subject Macro-average, respectively.
is 64.53%. The results are better than the other three methods. This
is because that GGDB is able to learn the hidden representation and
distribution of original data for generating additional data samples.
It can also be seen that all four methods for dealing with data im-
balance exhibit the biased predictive capability to varying degrees.
𝐹1 -score is higher for the majority class samples and relatively lower
for the minority class samples. These results suggest that our proposed
method addresses the data imbalance problem in ASD subtype diagno-
sis to a certain extent and achieves effective performance over other
methods.
In Fig. 7, we plot the distributions of the real and generated samples
by PCA (Principal Component Analysis) to further analyze the quality
of GGDB generated data. PCA is commonly used method to convert
features from high-dimensional space into a two-dimensional plane
for better visualization. Polynomial regression (PR) is also used to
fit the two-dimensional points for analysing the distribution of real
and generated data from Asperger and PDD-NOS. The PR function
Fig. 5. Performance comparison to evaluate the effectiveness of LSTMA unit on ABIDE approximates the boundaries of real and generated data from Asperger
I&II. and PDD-NOS with different colored curves. According to the curves
plotted by PR functions, the fitted boundaries of the real data and the
generated data corresponding to the two sets of classes are very close,
proposed method, we compare our method with three common used which demonstrate the high quality of our generated data. Besides,
data augmentation methods including slicing, adding Gaussian noise compared with real data, the generated data has large confidence inter-
and GAN. It is worth mentioning that these three strategies for handling val (the shadow region of polynomial curves) which helps to enlarge
data imbalance are also used in the training set only. Table 9 shows the the decision boundaries on the hyperplane for the classifier. In other
results of the GGDB method with the other three methods for identify- words, The data generated by GGDB contains not only sample points
ing ASD subtype fMRI data. As we can see from Table 9, our proposed near the original sample distribution, but also data samples outside the
method achieves 77.88%, 55.60%, 60.11% 𝐹1 -score on autism, As- original sample distribution. Therefore, the GGDB is able to lead to
perger, and PDD-NOS, respectively. Its corresponding Macro-average better performance and robustness of the classifier.
10
X. Deng et al. Computers in Biology and Medicine 151 (2022) 106320
atlases to increase the sample amount required for deep learning model
training. Second, due to the different scanning machines and parameter
settings at each ABIDE site, there may be domain gaps in fMRI data
collected from different sites. In the following work, transfer learning
can be adopted in our framework to address the domain gap problem to
further improve our classification performance. Third, although GGDB
is able to learn hidden representations and distributions of the original
data to generate additional data, GGDB exhibits a large bias on the
𝐹1 -score . We will aim to improve GGDB to address this issue.
6. Conclusion
Fig. 7. Two-dimensional PCA visualizations of the real and generated feature from
Declaration of competing interest
Asperger and PDD-NOS.
11
X. Deng et al. Computers in Biology and Medicine 151 (2022) 106320
[7] Y. Kong, J. Gao, Y. Xu, Y. Pan, J. Wang, J. Liu, Classification of autism spectrum [34] A.S. Heinsfeld, A.R. Franco, R.C. Craddock, A. Buchweitz, F. Meneguzzi, Identi-
disorder by combining brain connectivity and deep neural network classifier, fication of autism spectrum disorder using deep learning and the ABIDE dataset,
Neurocomputing 324 (2019) 63–68. NeuroImage: Clinical 17 (2018) 16–23.
[8] V. Mnih, N. Heess, A. Graves, et al., Recurrent models of visual attention, in: [35] T. Eslami, V. Mirjalili, A. Fong, A.R. Laird, F. Saeed, ASD-DiagNet: A hybrid
Advances in Neural Information Processing Systems, 2014, pp. 2204–2212. learning approach for detection of autism spectrum disorder using fMRI data,
[9] J.F. DeRose, J. Wang, M. Berger, Attention flows: Analyzing and comparing Front. Neuroinform. 13 (2019) 70.
attention mechanisms in language models, IEEE Trans. Vis. Comput. Graphics [36] R. Liu, Z.-a. Huang, M. Jiang, K.C. Tan, Multi-LSTM networks for accurate
27 (2) (2021) 1160–1170. classification of attention deficit hyperactivity disorder from resting-state fMRI
[10] S. Kitada, H. Iyatomi, Attention meets perturbations: Robust and interpretable data, in: 2020 2nd International Conference on Industrial Artificial Intelligence,
attention with adversarial training, IEEE Access 9 (2021) 92974–92985. IAI, 2020, pp. 1–6.
[11] G. Liu, J. Guo, Bidirectional LSTM with attention mechanism and convolutional [37] N.C. Dvornek, P. Ventola, K.A. Pelphrey, J.S. Duncan, Identifying autism from
layer for text classification, Neurocomputing 337 (2019) 325–338. resting-state fMRI using long short-term memory networks, in: Q. Wang, Y.
[12] A. Roy, M. Saffar, A. Vaswani, D. Grangier, Efficient content-based sparse Shi, H.-I. Suk, K. Suzuki (Eds.), Machine Learning in Medical Imaging, Springer
attention with routing transformers, Trans. Assoc. Comput. Linguist. 9 (2021) International Publishing, Cham, 2017, pp. 362–370.
53–68. [38] K. Byeon, J. Kwon, J. Hong, H. Park, Artificial neural network inspired by
[13] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, I. Sutskever, Generative neuroimaging connectivity: Application in autism spectrum disorder, in: 2020
pretraining from pixels, 2020, pp. 1691–1703. IEEE International Conference on Big Data and Smart Computing (BigComp),
[14] S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, Cbam: Convolutional block attention 2020, pp. 575–578.
module, in: Proceedings of the European Conference on Computer Vision, ECCV, [39] M.A. Bayram, Ö. İlyas, F. Temurtaş, Deep learning methods for autism spectrum
2018, pp. 3–19. disorder diagnosis based on fMRI images, Sakarya Univ. J. Comput. Inf. Sci. 4
[15] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, (1) (2021) 142–155.
End-to-end object detection with transformers, 2020, pp. 213–229. [40] C. Craddock, Y. Benhajali, C. Chu, F. Chouinard, A. Evans, A. Jakab, B.S.
[16] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, D. Tran, Image Khundrakpam, J.D. Lewis, Q. Li, M. Milham, et al., The neuro bureau preprocess-
transformer, 2018, pp. 4055–4064. ing initiative: open sharing of preprocessed neuroimaging data and derivatives,
[17] X. Chen, Y. Wu, Z. Wang, S. Liu, J. Li, Developing real-time streaming Front. Neuroinform. 7 (2013).
transformer transducer for speech recognition on large-scale dataset, in: ICASSP [41] C.-G. Yan, X.-D. Wang, X.-N. Zuo, Y.-F. Zang, DPABI: data processing & analysis
2021-2021 IEEE International Conference on Acoustics, Speech and Signal for (resting-state) brain imaging, Neuroinformatics 14 (3) (2016) 339–351.
Processing, ICASSP, IEEE, 2021, pp. 5904–5908. [42] A.D. Rasamoelina, F. Adjailia, P. Sinčák, A review of activation function for
[18] L. Dong, S. Xu, B. Xu, Speech-transformer: a no-recurrence sequence-to-sequence artificial neural network, in: 2020 IEEE 18th World Symposium on Applied
model for speech recognition, in: 2018 IEEE International Conference on Machine Intelligence and Informatics, SAMI, 2020, pp. 281–286.
Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2018, pp. 5884–5888. [43] X. Yang, N. Zhang, P. Schrader, A study of brain networks for autism spectrum
[19] Y. Qiu, S. Yu, Y. Zhou, D. Liu, X. Song, T. Wang, B. Lei, Multi-channel sparse disorder classification using resting-state functional connectivity, Mach. Learn.
graph transformer network for early alzheimer’s disease identification, in: 2021 Appl. 8 (2022) 100290.
IEEE 18th International Symposium on Biomedical Imaging, ISBI, 2021, pp. [44] F. Almuqhim, F. Saeed, ASD-SAENet: A sparse autoencoder, and deep-neural
1794–1797. network model for detecting autism spectrum disorder (ASD) using fMRI data,
[20] T. Zhang, C. Li, P. Li, Y. Peng, X. Kang, C. Jiang, F. Li, X. Zhu, D. Yao, B. Front. Comput. Neurosci. 15 (2021) 27.
Biswal, P. Xu, Separated channel attention convolutional neural network (SC- [45] A. Brahim, N. Farrugia, Graph Fourier transform of fMRI temporal signals based
CNN-attention) to identify ADHD in multi-site rs-fMRI dataset, Entropy 22 (8) on an averaged structural connectome for the classification of neuroimaging,
(2020). Artif. Intell. Med. 106 (2020) 101870.
[21] C. Yang, P. Wang, J. Tan, Q. Liu, X. Li, Autism spectrum disorder diagnosis [46] H. Shahamat, M.S. Abadeh, Brain MRI analysis using a deep learning based
using graph attention network based on spatial-constrained sparse functional evolutionary approach, Neural Netw. 126 (2020) 218–234.
brain networks, Comput. Biol. Med. 139 (2021) 104963. [47] Y. You, H. Liu, S. Zhang, L. Shao, Classification of autism based on fMRI
[22] W. Yin, L. Li, F.-X. Wu, A graph attention neural network for diagnosing ASD data with feature-fused convolutional neural network, in: Cyberspace Data and
with fMRI data, in: 2021 IEEE International Conference on Bioinformatics and Intelligence, and Cyber-Living, Syndrome, and Health, Springer, 2020, pp. 77–88.
Biomedicine, BIBM, IEEE, 2021, pp. 1131–1136. [48] M. Bengs, N.T. Gessert, A. Schlaefer, 4D spatio-temporal deep learning with 4D
[23] K. Niu, J. Guo, Y. Pan, X. Gao, X. Peng, N. Li, H. Li, Multichannel deep fMRI data for autism spectrum disorder classification, in: Medical Imaging with
attention neural networks for the classification of autism spectrum disorder using Deep Learning, MIDL 2019 Conference, 2019, pp. 1–4.
neuroimaging and personal characteristic data, Complexity 2020 (2020). [49] A. El-Gazzar, M. Quaak, L. Cerliani, P. Bloem, G.v. Wingen, R. Mani Thomas, A
[24] C. Wang, Z. Xiao, J. Wu, Functional connectivity-based classification of autism hybrid 3DCNN and 3DC-LSTM based model for 4D spatio-temporal fMRI data: an
and control using SVM-RFECV on rs-fMRI data, Phys. Medica 65 (2019) 99–105. ABIDE autism classification study, in: OR 2.0 Context-Aware Operating Theaters
[25] F. Sadeghian, H. Hasani, M. Jafari, Feature selection based on genetic algorithm and Machine Learning in Clinical Neuroimaging, Springer, 2019, pp. 95–102.
in the diagnosis of autism disorder by fMRI, Casp. J. Neurol. Sci. 7 (2) (2021) [50] N.C. Dvornek, P. Ventola, J.S. Duncan, Combining phenotypic and resting-state
74–83. fMRI data for autism classification with recurrent neural networks, in: 2018 IEEE
[26] N. Wang, D. Yao, L. Ma, M. Liu, Multi-site clustering and nested feature 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE, 2018,
extraction for identifying autism spectrum disorder with resting-state fMRI, Med. pp. 725–728.
Image Anal. 75 (2022) 102279. [51] N.C. Dvornek, P. Ventola, K.A. Pelphrey, J.S. Duncan, Identifying autism from
[27] S.Y. Yap, W.H. Chan, Elastic SCAD SVM cluster for the selection of significant resting-state fMRI using long short-term memory networks, 2017, pp. 362–370.
functional connectivity in autism spectrum disorder classification, Acad. Fundam. [52] S. Rane, E. Jolly, A. Park, H. Jang, C. Craddock, Developing predictive imaging
Comput. Res. 1 (2) (2020). biomarkers using whole-brain classifiers: Application to the ABIDE I dataset, Res.
[28] X. Ma, X.-H. Wang, L. Li, Identifying individuals with autism spectrum disorder Ideas and Outcomes 3 (2017) e12733.
based on the principal components of whole-brain phase synchrony, Neurosci. [53] J.-c. Liu, J.-z. Ji, Classification method of fMRI data based on broad learning
Lett. 742 (2021) 135519. system, J. ZheJiang Univ. (Engineering Science) 55 (7) 1270–1278.
[29] G. Wen, P. Cao, H. Bao, W. Yang, T. Zheng, O. Zaiane, MVS-GCN: A prior [54] T. Chen, Y. Chen, M. Yuan, M. Gerstein, T. Li, H. Liang, T. Froehlich, L. Lu,
brain structure learning-guided multi-view graph convolution network for autism et al., The development of a practical artificial intelligence tool for diagnosing
spectrum disorder diagnosis, Comput. Biol. Med. (2022) 105239. and evaluating autism spectrum disorder: multicenter study, JMIR Med. Inform.
[30] A. Loddo, S. Buttau, C. Di Ruberto, Deep learning based pipelines for Alzheimer’s 8 (5) (2020) e15767.
disease diagnosis: A comparative study and a novel deep-ensemble method, [55] M.A. Aghdam, A. Sharifi, M.M. Pedram, Diagnosis of autism spectrum disorders
Comput. Biol. Med. (2021) 105032. in young children based on resting-state functional magnetic resonance imaging
[31] H. Jiang, P. Cao, M. Xu, J. Yang, O. Zaiane, Hi-GCN: A hierarchical graph data using convolutional neural networks, J. Digit. Imaging 32 (6) (2019)
convolution network for graph embedding learning of brain network and brain 899–918.
disorders prediction, Comput. Biol. Med. 127 (2020) 104096. [56] Y. Zhao, H. Dai, W. Zhang, F. Ge, T. Liu, Two-stage spatial temporal deep
[32] A. Puente-Castro, E. Fernandez-Blanco, A. Pazos, C.R. Munteanu, Automatic learning framework for functional brain network modeling, in: 2019 IEEE 16th
assessment of Alzheimer’s disease diagnosis based on deep learning techniques, International Symposium on Biomedical Imaging (ISBI 2019), IEEE, 2019, pp.
Comput. Biol. Med. 120 (2020) 103764. 1576–1580.
[33] M. Leming, J.M. Górriz, J. Suckling, Ensemble deep learning on large, mixed-
site fMRI datasets in autism and other tasks, Int. J. Neural Syst. 30 (07) (2020)
2050012.
12