1. Introduction
Brain-computer interface (BCI) is one of the current research hotspots, which can achieve direct communication of the brain with the external world without relying on muscles or the nervous system. Therefore, BCI enables users to control external devices such as computers, wheelchairs and prostheses through brain activities [
1,
2,
3]. BCI technology has been widely applied in many fields such as medical treatment [
4], rehabilitation [
5,
6], intelligence control [
7], augmented reality [
8], virtual reality [
9] and neuroscience research [
10], and has shown new potentials and development prospects. In BCI, various methods are used to present biological signals, with EEG being the most common choice for brain signal decoding due to its high temporal resolution, non-invasiveness, relatively low cost, and portability. EEG-based BCI systems use different paradigms, including event-related potentials (ERPs), steady-state visual evoked potentials (SSVEPs) [
11], and motor imagery [
12]. Oscillatory activities in tasks, such as sleepiness, can be effectively detected through EEG signals. Recently, decoding of semantic concepts for imagination and perception (SCIP) has attracted increasing interest. Decoding semantic information has the advantage of being able to generalize the main concepts of a task [
13]. For example, when presenting an image, instead of focusing on low-level sensory details such as the colour and the shape, people also focus on the high-level semantic concepts of the object in the image. Visual decoding is often interfered with by irrelevant information such as shape, size, colour, etc., whereas semantic decoding extracts conceptual information including object type or category. The capability of ignoring low-level sensory details is considered a desirable quality for BCI systems. There is growing evidence showing that imagination and perception overlap at the neural level [
14]. This overlap may facilitate the feasibility of cross-task decoding [
15]. This neural overlap between imagination and perception can be explored by analysing how different sensory inputs (e.g., pictorial, orthographic and audio) are processed by the brain during perception and imagination. This research has opened up the possibility of developing more robust BCI systems, which are important for the understanding of the neural mechanisms of semantic concepts during imagination and perception in the field of rehabilitation [
16,
17]. However, EEG signals are prone to significant noise and interference, making the decoding of SCIP-EEG signals a critical challenge to address.
Initial feature extraction from EEG signals can be categorized into time-domain [
18], frequency-domain [
19], and spatial-domain [
20] and some multi-domain combination methods [
21,
22,
23]. Frequency-domain features examine the energy distribution across different frequencies, using Fast Fourier Transform [
24], wavelet transform [
25], autoregression, and other spectral analysis methods. Spatial-domain analysis uses spatial mapping techniques to project EEG data to spatial distributions in the manner to enhance feature classification, where Common Spatial Patterns (CSP) [
26] are commonly used. The extracted features are then classified using machine learning methods, such as SVM (Support Vector Machine) [
27], LDA (Linear Discriminant Analysis) [
28], and KNN (K-Nearest Neighbors) [
29]. SVM is a binary classification model that maps feature vectors to a two-dimensional space aiming to find an optimal dividing curve that effectively differentiates feature vectors. LDA is a traditional binary classification algorithm that reduces dimensionality of features and projects them onto a one-dimensional coordinate system for classification through optimal thresholds. KNN is suitable for multi-class tasks and determines the class of features via its distance to known class samples and using majority voting within a certain range. Additionally, various other traditional machine learning methods are used for EEG signal classification.
With the advent of deep learning [
30], multiple deep neural network architectures have been applied to EEG signal feature extraction and classification. Deep learning, compared to traditional machine learning, shows greater capability in EEG feature extraction. Convolutional Neural Networks (CNNs) were among the earlier deep learning models used for decoding EEG signals in BCIs [
31,
32,
33]. However, due to the limitations of CNNs, current applications have nearly reached their maximum potential. Subsequently, Recurrent Neural Networks (RNNs) [
34] and Long Short-Term Memory networks (LSTMs) [
35] were introduced in the BCI field, though they are primarily suited for time series prediction, instead of EEG signal decoding. In RNNs and LSTMs, the transformer model [
36] effectively addressed the importance of parallel computation and capturing long-distance features. For EEG signals, the available datasets are usually small and may not be sufficient to adequately train the deep learning models, such as Transformer. The high spatial correlation characteristic makes EEG signals suitable for Capsule Neural Networks [
37]. Capsule Neural Networks exhibit several advantages in EEG signal classification:
Firstly, the capsule network utilizes the directional properties of its units. This is helpful to effectively sense and resolve the EEG signals in different postures, because the brainwave characteristics vary depending on the head position or the angle of electrode placement.
Secondly, the multilayer capsule structure of Capsule Neural Networks allows for a hierarchical representation of features so that each unit can accurately represent features at different levels.
Thirdly, the dynamic routing mechanism can learn and optimize the signal correlations across different electrodes, time domain features, frequency domain features.
Typically, capsule neural network is divided into three parts: the convolutional layers, the primary capsule layers, and the digital capsule layers. First, the convolutional layers are used to extract features and instantiate and encode the input vectors. Then, the primary capsule layers receive the feature representations from the convolutional layers and organize them into capsules. The capsule is the core unit of the capsule neural network, consisting of vectors or a matrix formed by a group of neurons. These neurons collectively represent various attributes of a specific object, such as position and orientation. Unlike traditional neural networks where each neuron outputs a single scalar value, a capsule outputs a vector. Finally, the representation generated by the primary capsule layers is fed to the digital capsule through a dynamic routing mechanism to obtain the desired output. Most of the previous capsule neural networks for decoding EEG signals [
38,
39,
40,
41,
42] create capsules in the manner shown in
Figure 1. However, EEG signals have complex dynamic characteristics in both time and spatial domains. Using N-by-N convolutional kernels may not be able to capture important features in both time and spatial domains, leading to the loss of information. Therefore, this method cannot efficiently represent the features of the EEG signals well in capsule form. Combining the properties of EEG signals, we can improved the feature extraction part of the capsule neural network. The spatio-temporal features of the EEG signals are extracted using a convolutional neural network. The extracted spatio-temporal features are stored in capsules, which are different than the traditional instantiated attributes. These capsules are called spatio-temporal capsules. In addition, the traditional dynamic routing mechanism undergoes complex computations to update the weights between capsules, which increases the overall complexity of the model and lacks connections between capsules. In the proposed method in this paper, we replace dynamic routing with a novel non-iterative, highly parallel routing mechanism. Overall, compared to previous capsule-based neural network methods, our proposed method makes improvements on capturing the spatio-temporal features of EEG signals, optimising the inter-capsule routing mechanism, and improving the model efficiency and stability. These advantages make the proposed method more suited to handle complex SCIP-EEG-like decoding tasks. The main contributions of this paper are as follows:
Improved the feature extraction part of the capsule neural network. The convolutional neural network is used to extract the spatio-temporal features in the EEG signals, replacing the traditional instantiated attributes with spatio-temporal features, which are more adaptive to the complex dynamic characteristics of EEG signals.
A novel non-iterative routing mechanism called the self-correlation routing mechanism is proposed. This routing mechanism allows the model to compute similarities between different capsules and assign different similarity weights based on these similarities. This routing mechanism efficiently captures key features and improves the robustness and performance of the model. Compared to traditional dynamic routing, it is non-iterative and highly parallel, which reduces the model complexity while establishing better connections between capsules.
Validation was carried out on a publicly accessible EEG-based BCI dataset for imagination and perception tasks of semantic concepts from the University of Bath. We classified perception and imagination tasks across three different sensory modalities (pictorial, orthographic, and audio), achieving average accuracies of 94.9%, 93.3%, and 78.4%, respectively, with an overall average accuracy of 88.9%. Compared to existing advanced algorithms, the proposed model significantly enhances classification accuracy and demonstrates superior performance over state-of-the-art methods.
The rest of this paper is organized as follows.
Section 2 introduces the related work on the differences between capsule neural networks and traditional neural networks.
Section 3 provides a detailed description of the proposed method.
Section 4 describes the dataset and discusses the experimental results, the sensitivity of the parameters, and the ablation study. In
Section 5, we summarize the paper.