¹¹institutetext: Southern University of Science and Technology, Shenzhen 518055, China ¹¹email: [email protected]²²institutetext: Advanced Computing and Storage Lab, Huawei Technologies Co., Ltd., Shenzhen 518055, China
²²email: [email protected]

Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications

Zhou Zhou Zhou Zhou and Guohang He contribute equally to this work.11 Guohang He ⁰ 11 Zheng Zhang Corresponding authors: Zheng Zhang and Ran Cheng.1122 Luziwei Leng 22 Qinghai Guo 22 Jianxing Liao 22 Xuan Song 11 Ran Cheng⁰ 11

Abstract

Traditional invasive Brain-Computer Interfaces (iBCIs) typically depend on neural decoding processes conducted on workstations within laboratory settings, which prevents their everyday usage. Implementing these decoding processes on edge devices, such as the wearables, introduces considerable challenges related to computational demands, processing speed, and maintaining accuracy. This study seeks to identify an optimal neural decoding backbone that boasts robust performance and swift inference capabilities suitable for edge deployment. We executed a series of neural decoding experiments involving nonhuman primates engaged in random reaching tasks, evaluating four prospective models, Gated Recurrent Unit (GRU), Transformer, Receptance Weighted Key Value (RWKV), and Selective State Space model (Mamba), across several metrics: single-session decoding, multi-session decoding, new session fine-tuning, inference speed, calibration speed, and scalability. The findings indicate that although the GRU model delivers sufficient accuracy, the RWKV and Mamba models are preferable due to their superior inference and calibration speeds. Additionally, RWKV and Mamba comply with the scaling law, demonstrating improved performance with larger data sets and increased model sizes, whereas GRU shows less pronounced scalability, and the Transformer model requires computational resources that scale prohibitively. This paper presents a thorough comparative analysis of the four models in various scenarios. The results are pivotal in pinpointing an optimal backbone that can handle increasing data volumes and is viable for edge implementation. This analysis provides essential insights for ongoing research and practical applications in the field.

Keywords:

Neural decodingBrain-computer interfaces Deep neural networks.

1 Introduction

Advancements in invasive Brain Computer Interfaces (iBCIs) have demonstrated promising results across various applications, including speech decoding [10, 27, 28], prosthesis control [7, 31], neurological disorders rehabilitation [5, 15, 21, 22] and more. Accurate decoding the brain activities is crucial for the success of these applications. Previous efforts have focused on employing adaptive filters such as Kalman Filters [7, 29, 30] or traditional machine learning models such as Recurrent Neural Networks (RNNs) [23, 27]. However, with the expansion of the available neural data, significant progress has been made using Transformer-based architectures. Models such as Neural Data Transformer (NDT1)[32] leverage multi-session, multi-task and multi-subject neural data, yielding improved decoding performance and enhanced generalization capabilities with unseen data.

Limitations still exist among these methods. Despite the advantages of RNNs for handling long-term dependency, their inherent serial dependency significantly affect the model’s inference speed [32]. Meanwhile, it remains unclear whether scaling up GRU model size with data volume improves neural decoding accuracy. Transformers facilitate parallel computation and adhere to the scaling laws [12], but the increase in model size and sequence length leads to quadratic growth in model complexity (O( $n^{2}$ )), requiring a dramatic escalation in computational resources in order to fit in edge-device for portable BCI applications in daily use.

Models such as Receptance Weighted Key Value model (RWKV) [18] and Selective State Space model (Mamba) [8] have been designed utilizing linear attention mechanisms that offer reduced temporal and spatial complexity compared to traditional transformers. These models have demonstrated competitive performance in natural language processing and computer vision tasks [8, 16, 18], but it remains unclear which model is most suitable as the backbone for neural decoding.

This paper investigates whether recent advancements in model architectures can enhance neural decoding. Instead of benchmarking against state-of-the-art (SoTA) architectures, we compare the RWKV and Mamba models with the GRU and Transformer models in terms of computational efficiency and decoding accuracy. We have designed a series of experiments to assess various parameters: decoding accuracy, adaptiveness to new sessions, inference time, and scalability trends on model size, to identify an optimal neural decoding backbone. To the best of our knowledge, this work might be the first effort to investigate linear attention mechanisms in neural decoding, targeting fast and low-power inference on edge devices.

2 Related Work

2.1 Neural Decoding

Neural decoding primarily relied on adaptive filters or traditional machine learning methods such as Kalman Filters [7, 29, 30], Wiener Filters [11] or SVM [24]. However, with the advent of deep learning, particularly the emergence of large-scale models, there has been a significant shift in neural decoding approaches. Deep learning models facilitate automated feature learning, reducing the impact of subjective factors and greatly improving decoding accuracy and efficiency. Recurrent neural networks and Transformers have now found more applications in neural decoding tasks [20]. Contemporary applications of brain decoding technologies extend to medical rehabilitation, assistive communication, and human-computer interaction [19].

2.2 RWKV

Transformer has precipitated as disruptive revolution, particularly due to its widespread application of attention mechanisms across multiple domains. However, a significant issue arises as the memory and computational complexity of the Transformer grows quadratically with increasing sequence length. Concurrently, RNNs exhibit linear growth in memory and computational demands but are significantly outperformed by Transformer due to limitations in parallelization and scalability [18]. To address this challenge, Bo Peng et at. have proposed the RWKV, which integrates the efficient parallel training advantages of Transformer with the effective inference comparable to that of similarly scaled Transformer, underscoring its potential and effectiveness in handling large-scale sequence data [2].

2.3 Mamba

The state space model is a mathematical framework used to describe the evolution of systems over time. It employs state vectors to represent the current state of the system and uses state transition equations and observation equations to correlate the changes between system states and the relationship with observed data [9]. Mamba is an enhanced approach based on the structured state space model S4, integrating the recurrent structure of recurrent neural networks and the parallel characteristics of convolution neural networks. This approach excels in capturing long-term dependencies in sequential data and facilitates efficient parallel computation. By combining structured state space models with deep learning techniques, Mamba can handle sequential data more effectively, exhibiting higher modeling capability and predictive performance. Mamba has demonstrated superior performance in various domains, including language modeling, DNA sequence modeling, audio modeling and generation [8].

3 Methods

Refer to caption — Figure 1: Raw neural signals were recorded from the primary motor cortex (M1) area of a monkey using a 96-channel Utah microelectrode array during random reach tasks. Spike activity detected from these neural signals was binned temporally across the 96 channels. The resulting matrix of spike counts served as inputs for various methods after normalization and smoothing, and the outputs were the predicted finger velocities along the x and y axes. Experiments conducted under different scenarios facilitated comparisons of predictive accuracy, inference speed, and scalability among the four types of backbone models.

The system architecture is shown in Fig.1. The raw neural recording from the Utah array is processed into spike count bins and decoded using GRU, Transformer, RWKV and Mamba as four different backbones. The decoded output is compared with the ground truth motion activities. The detailed workflow is given below.

3.1 Data Processing

3.1.1 Datasets

The dataset from [17] is used in this study, which includes a rich collection of neural and behavioral data recorded from nonhuman primates engaged in a random target reaching task. This task requires the subject controlling a ticker to move the computer cursor and reach a series of randomly distributed targets displayed on screen in succession. During the execution of the task, the neural activities from primary motor cortex (M1) and primary sensory cortex (S1) are collected using Utah array, and the position of the subject’s hand kinematic trajectories are recorded using motion tracking systems.

The neural recordings in this dataset consists of extracellular spike recordings, with the event times of threshold crossings sorted into discrete units. The recordings collected from subject Indy are used in this studies in total 30 sessions. The kinematic measurements contain the x and y coordinates of the subject’s fingertip and cursor position as it reaches out, as well as the x and y coordinates of the set targets, both sampled at a frequency of 250 Hz.

3.1.2 Data processing

In this studies, we only used recordings collected from the M1 cortex. We partitioned each session of the recorded task into multiple temporal bins with duration of 10 ms. Due to the sampling rate of 250 Hz, the sampling frequency is increased to 1000 Hz using linear interpolation. Within these bins, we quantified the number of spike events (threshold crossings) for each neural recording channel, thereby capturing the discrete neural firing patterns over time. It is worth noting that we use the unsorted spike events known as multi-unit activities. In practice, spike sorting can be require too much computation for on-chip processing while using the sorted single-unit activities only bring limited decoding accuracy improvement as shown in [25].

The cursor’s velocity is used to characterize the kinematics of the reaching movement. The binned spike event and cursor velocity were temporally aligned, normalized and smoothed with a Gaussian smoothing operation, which attenuates high-frequency noise and elucidate the underlying signal trends following [14].

In the experiments, the input to the model is denoted as $Spk\in\mathbb{R}^{S\times C}$ , where $S$ represents the timesteps used for each prediction and $C$ denotes the number of channels. The ground truth denoted as $Vel\in\mathbb{R}^{S\times 2}$ , which represents the finger speed in x and y axis at each timestep.

3.2 Backbone Models

3.2.1 GRU

Proposed by Cho et at. [4], GRU is a variant of the Recurrent Neural Network (RNN), specifically designed to address the challenges of gradient explosion and gradient vanishing in training. GRU achieves this by employing update gates and reset gates, which selectively update useful information and capture of long-term dependencies within time series data. GRU can be characterized by the following formulations [6]:

$\displaystyle h_{t}^{j}$	$\displaystyle=(1-z_{t}^{j})\odot h_{t-1}^{j}+z_{t}^{j}\odot\tilde{h}_{t}^{j}$	(1)
$\displaystyle z_{t}^{j}$	$\displaystyle=\sigma(W_{z}x_{t}+U_{z}h_{t-1}^{j})$	(2)
$\displaystyle\tilde{h}_{t}^{j}$	$\displaystyle=\tanh(Wx_{t}+U(r_{t}\odot h_{t-1}^{j}))$	(3)
$\displaystyle r_{t}^{j}$	$\displaystyle=\sigma(W_{r}x_{t}+U_{r}h_{t-1}^{j})$	(4)

The reset gate ( $r$ ) is a gating mechanism that modulates the flow of information from the previous activation, allowing the model to discard irrelevant past state information, thus mitigating the vanishing gradient problem. The update gate ( $z$ ) determines the extent to which the unit updates its activation, or hidden state ( $h$ ). It controls the degree of information transfer from the previous state to the current state, enabling the model to capture long-term dependencies. The activation ( $h$ ), commonly referred to as the hidden state, captures the learned information at the current time step and is recursively influenced by past activation. In our work, we employ hidden size $d_{h}=256$ . The candidate activation ( $\tilde{h}$ ) is a proposed update to the hidden state, which incorporates new input while being modulated by the reset gate to potentially discard the irrelevant previous state.

3.2.2 Transformer

The foundational mechanism of the Transformer is its self-attention mechanism, which enables the model to dynamically adjust the weighting of input data, such as tokens or sequence elements, based on their contextual relevance [26]. Unlike GRUs, which process data sequentially, Transformers handle input in parallel during the training phase, significantly expediting the training process.

We employ the classic Multihead Scaled Dot-Product Attention mechanism along with an encoder-decoder architecture. Unlike traditional approaches that transform input vectors of vocabulary tokens through an embedding layer to embed feature dimensions, we directly take the spike matrix $x\in\mathbb{R}^{S\times C}$ as the input for both encoder and decoder and treat the channel dimension $C$ of the input spike matrix as the feature dimension and project the feature dimension to the model hidden dimension following Eq.5 and this projection is also used in RWKV and Mamba model.

A=f(x)=Wx+b

(5)

B=E[\text{positions}]

(6)

input=\text{Dropout}(A+B)

(7)

output=\text{Decoder}(\text{Encoder}(input),input)

(8)

The function $f$ represents a linear mapping layer, where $W$ and $b$ denote the weights and biases of the input layer, respectively. $E$ corresponds to the positional embedding matrix, from which an embedding vector is selected for each positional index. Where $input\in\mathbb{R}^{S\times d_{\text{model}}}$ is the input to the encoder and decoder and $output\in\mathbb{R}^{S\times 2}$ is the predicted x and y axis velocity. In the encoder and decoder, the attention is implemented as below:

\displaystyle\text{Attention}(Q,K,V)

\displaystyle=\text{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V

(9)

\displaystyle\begin{aligned} \text{MultiHead}(Q,K,V)&=\text{Concat}(\text{head% }_{1},\ldots,\text{head}_{h})W^{O}\\ \text{where }\text{head}_{i}&=\text{Attention}(Q^{\prime},K^{\prime},V^{\prime% })\end{aligned}

(10)

Where the $Q,K,V$ are calculated following Eq.5 with independent weights and zero bias, the parameter matrix $W^{O}\in\mathbb{R}^{hC\times d_{\text{model}}}$ and $Q^{\prime}\in\mathbb{R}^{S\times h\times d_{q}}$ , $K^{\prime}\in\mathbb{R}^{S\times h\times d_{k}}$ , $V^{\prime}\in\mathbb{R}^{S\times h\times d_{v}}$ . Here, we employ $h$ = 2 heads, $d_{q}=d_{k}=d_{v}=\frac{d_{\text{model}}}{h}$ and $d_{model}=128$ .

Given the limited variance in input data patterns, the data is processed through two separate attention heads. The system comprises three layers each of encoders and decoders, culminating in the prediction of velocities in the x and y axes.

3.2.3 RWKV

Unlike most RNNs, RWKV is a recurrent model combines the efficient parallelizable training of transformers with the fast inference time. RWKV reformulates the attention mechanism with a variant of linear attention, replacing traditional dot-product token interaction with more effective channel-directed attention [18]. It mitigates the memory bottleneck and quadratic scaling issues inherent in Transformers through efficient linear scaling. It also preserves the ability for parallelized training and ensures robust scalability.

$\displaystyle r_{t}$	$\displaystyle=W_{r}(\mu_{r}\odot x_{t}+(1-\mu_{r})\odot x_{t-1})$	(11)
$\displaystyle k_{t}$	$\displaystyle=W_{k}(\mu_{k}\odot x_{t}+(1-\mu_{k})\odot x_{t-1})$	(12)
$\displaystyle v_{t}$	$\displaystyle=W_{v}(\mu_{v}\odot x_{t}+(1-\mu_{v})\odot x_{t-1})$	(13)

R encodes historical information, activated via a Sigmoid function and incorporating a forgetting mechanism. W signifies the positional weight decay vector, a trainable parameter within the model. The terms K and V function analogously to the key and value in Transformer architectures. Distinct from traditional models where $x$ is simply the embedding of the current token, in the RWKV, $x$ is calculated as the weighted sum of the embeddings of the current token and the previous token.

wkv_{t}=\frac{\sum_{i=1}^{t-1}e^{-(t-1-i)w+k_{i}}\odot v_{i}+e^{u+k_{t}}\odot v% _{t}}{\sum_{i=1}^{t-1}e^{-(t-1-i)w+k_{i}}+e^{u+k_{t}}}

(14)

Equation 14 functions similarly to an attention mechanism, representing position $t$ as a learnable weighted sum of past content. In RWKV, $w$ is treated as a channel-wise vector that adjusts according to the relative position, requiring the training of only a single parameter vector $w$ . $u$ is designated for individual processing of the current token’s position, serving to circumvent any potential degradation of $w$ .

3.2.4 Mamba

In contrast to the quadratic scaling observed with traditional models, Mamba demonstrates a throughput up to five times faster than the Transformer and exhibits linear scaling with sequence length [8]. Unlike RNNs, which compress all information into a hidden space and struggle with long-term memory issues, Mamba introduces a selective state-space model. This model offers the benefits of a linear recurrent network, enhanced by mechanisms for rapid training and effective context retention. Improvements in Mamba’s Structured State Spaces (SSM) include a selection mechanism that filters out irrelevant information while enabling indefinite memory retention, and a hardware-aware algorithm optimized for GPU memory layouts to facilitate hardware acceleration. This ensures efficient computation cycling without extending the state unnecessarily, thus enhancing performance.

The SSM Mamba consists of the following two equations:

x_{t}=f(x_{t-1},u_{t},w_{t})

(15)

y_{t}=h(x_{t},v_{t})

(16)

Equation 15 represents the state transition equation, describing how the system state evolves over time. Here, $x_{t}$ denotes the system state at time step $t$ , $u_{t}$ represents the control input, $w_{t}$ is the process noise, and f is the state transition function. Equation 16 is the observation equation, $y_{t}$ represents the observation data at time step $t$ , $v_{t}$ denotes the observation noise, and $h$ is the observation function. The concept of selectivity in Mamba allows the model to selectively remember or forget information at each time step.

4 Experiments and Key Results

4.1 Experiment settings

To evaluate the capabilities of different backbone models across various dimensions, four distinct experiments were established: single-session, multi-session, new session finetuning, and scaling experiments (set timestpes as 128, 1024, 128 and 1024 respectively). A total of 30 sessions, collected over different days from the same subject, were used. All neural recordings from these 30 days were divided into training and testing datasets with an 8:2 ratio, consistently applied across all experiments.

Single-Session Experiment: This experiment assessed the ability of the backbone models to perform effectively on small datasets. Each of the four models was trained independently on data from individual sessions, with recording lengths varying from 360 s to 3363 s.

Multi-Session Experiment: This experiment focused on the models’ capacity to extract deep latent representations from neural recordings with input feature shifting overtime. A unified model was trained using training sets from all sessions. Over time, the quality of the recordings degraded due to scar tissue encapsulation around the implants, leading to increased noise levels and a decrease in detected neural firing rates from over 20Hz to below 10Hz. Additionally, the neurons observed on different channels changed over time. Various training strategies were explored to help models adapt to these shifting input features.

New Session Finetuning Experiment: This experiment tested the models’ ability to generalize and adapt to unseen data. Models were initially trained with datasets from the first 25 days, and then incrementally finetuned using datasets from the last five days (10 seconds per iteration). This setup mirrors practical scenarios for BCI calibration on new days, where a shorter calibration time is often critical. The aim was to identify the model that could quickly return to acceptable performance levels, making it more suitable for real-world use outside the laboratory.

Scaling Experiment: This experiment investigated whether increased model size could enhance performance. The scaling law has been a key principle in designing large language models [13], but its applicability in neural decoding remains unexplored.

Table 1: Parameter counts and hyperparameters of models

Model Parameters Epochs Single(Multi) Layers Embedding Size GRU 272k 30(50) 1 256 Transformer 316k 50(50) 3 128 RWKV 294k 30(50) 2 88 Mamba 306k 30(50) 2 144

The same hyperparamter settings are used for all experiments except the scaling experiment, with details on their parameter counts and hyperparameters presented in Table.1. The requirement for the Transformer model to undergo 50 epochs may be attributed to its attention mechanism, which necessitates numerous iterations to effectively optimize attention weights. Additionally, the design of the Transformer, which processes entire sequences simultaneously, may contribute to slower convergence rates during training. [26].

The R² is used to evaluate the neural decoding performance following [1, 33]. R² typically ranging from 0 to 1, an R-square value of 0 indicates that the model fails to explain any variance in the dependent variable, while a value of 1 indicates a perfect fit if the model to the data. The formula for calculating R² is as follows:

	$\displaystyle RSS$	$\displaystyle=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}$		(17)
	$\displaystyle TSS$	$\displaystyle=\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}$		(18)

R^{2}=1-\frac{RSS}{TSS}

(19)

where $RSS$ is the residual sum of squares(the sum of the squares of the differences between actual( $y_{i}$ ) and predicted values( $\hat{y}_{i}$ )), and $TSS$ is the total sum of squares(the sum of the squares of the differences between actual values and the mean of the observed values( $\bar{y}$ )). Table.2 summarizes the evaluation results of four models in different experiments.

4.2 Single-session experiment

As shown in Table. 2, the RWKV model excels in the single-session experiment, surpassing the GRU model by 0.02 in R². However, both the Mamba and Transformer models score below 0.7, indicating that these models are less effective when dataset sizes are limited.

In terms of inference time processing 1280 ms of neural data (1 batch), as detailed in Table. 2, shows varying performance among the models. The GRU model requires the longest processing time due to its sequential processing nature. The Transformer model also exhibits relatively long inference times due to its computationally intensive operations. In contrast, the RWKV and Mamba models demonstrate significant advantages in inference speed over both the GRU and Transformer models.

Specifically, RWKV, which is a recurrent neural network devoid of an attention mechanism, avoids the computational overhead associated with computing attention matrices. This model incorporates Token Shift and Channel Mix mechanisms to optimize position encoding and channel blending, thereby enhancing both efficiency and performance. On the other hand, Mamba achieves rapid inference and maintains linear scalability with sequence length through dynamic and selective retention or dismissal of information based on input. Its streamlined and homogeneous architecture, coupled with a selective state space, markedly boosts inference speed.

4.3 Multi-session experiment

In the multi-session experiment, we explored three different data partitioning strategies during training to identify the most effective approach for aiding models to learn as input features shifted. These strategies are as follows:

$\bullet$

Random partitioning: Batches are randomly selected from random sessions to be fed into the model.
$\bullet$

Sequential partitioning: Data batches are fed into the model in a sequential order, day by day.
$\bullet$

Random session partitioning: Sessions are selected randomly, but within each selected session, data batches are fed sequentially.

The random training strategy results in significantly higher stability and decoding accuracy of the model compared to the other two strategies. Although the data are strongly time-correlated, this approach of random input enhances gradient diversity, reduces cyclic biases in data appearance, and helps prevent overfitting.

Sequential training resulted in limited improvement over the single-session experiment for both the GRU and RWKV models. Although these models can memorize historical information, sequential training may still lead to catastrophic forgetting, thereby only marginally enhancing performance compared to the single-session results. In contrast, the Mamba model demonstrated a significant improvement, nearly 0.1 increase in R², over the single-session experiment. This suggests that Mamba’s selective state space mechanism is more effective at preserving useful information and handling long-term dependencies compared to the gating mechanisms of GRU or the RWKV model in neural decoding.

However, the random session training strategy failed to provide a diverse training gradient and the data order could not convey long-term dependencies, resulting in underfitting of the model.

Another observation during the multi-session training is the difficulty in achieving convergence with the Transformer model, which required careful tuning of its hyper-parameters. In contrast, the other models exhibited less sensitivity to training hyper-parameter settings.

Table 2: Experiments results on all models

Experiment Indicator GRU Transformer RWKV Mamba Single-session Average R² 0.715 0.633 0.717 0.660 Inference time/s 0.941 0.822 0.303 0.434 Multi-session Random train 0.838 0.720 0.812 0.810 Sequence train 0.749 0.523 0.726 0.752 Random session 0.560 0.314 0.600 0.556 Fine-tuning Average R² 0.773 0.748 0.763 0.756 Recovery time/s 214 - 202 178 Zero shot 0.4811 0.383 0.452 0.370 Scaling Max R² 0.846 - 0.843 0.851 Increment 0.010 - 0.031 0.041

4.4 Fine-tuning the model on new sessions

As shown in Table.2, among the four models, GRU achieved the highest average R² score over 5 days of fine-tuning on new sessions, reaching 0.773. The RWKV and Mamba models scored 1-2% lower, while the Transformer model recorded the lowest score at 0.748. Regarding zero-shot performance, we only saw RWKV achieved an R² of 0.7 in one session out of five. On average, none of the models achieved adequate zero-shot performance.

The results from the finetuning experiment indicate that all models are capable of surpassing their performance when trained solely on single-session data. This demonstrates that despite variations in firing rates and neuron-channel mappings over time, the models can distill useful information to enhance neural decoding. The quality of the base model significantly influences the effectiveness of the finetuned model. However, the backbone model alone does not provide zero-shot capability, suggesting that additional architectural designs or training strategies are necessary to enhance the models’ adaptability to input feature shifts and improve zero-shot performance.

In terms of the data length required to achieve an acceptable R² score of 0.7 through fine-tuning, Mamba outperformed both RWKV and GRU. This superior performance likely stems from Mamba’s enhanced ability to resolve long-term dependencies, which facilitates its calibration to unseen data more effectively. Consequently, Mamba emerges as a more viable option for real-world deployment in practical BCI applications due to its robust adaptability.

4.5 Scaling analysis

In multi-session training, the parameter count for the models we used is approximately 300k. To explore whether increasing the model size could enhance its decoding performance, we examined the improvements achieved by increase the number of layers in GRU, RWKV and Mamba models (Transformer can fail to converge in many cases and is therefore ignored.). The variation in the model’s decoding R² scores as a function of the parameter count of these models, ranging from 300K to 3M, is illustrated in the Fig.2.

With an increase in model parameters, the R² scores for Mamba and RWKV show significant improvement, reaching 0.843 and 0.851 respectively. This represents increases of 0.031 and 0.041 over their 300k parameter models. In contrast, the GRU model demonstrates only a mild improvement of 0.01 when parameters are increased, and further scaling leads to a declining trend in performance. Despite its gate mechanisms to mitigate the vanishing gradient problem, GRU’s inherent sequential processing nature restricts its scalability and limits its efficiency in handling large-scale sequence tasks.

Conversely, RWKV and Mamba exhibit superior scalability and computational efficiency, outperforming GRU. This advantage is largely due to their innovative structural designs and optimization strategies that effectively address the limitations typically associated with recurrent neural networks and traditional Transformers.

While performance gains for RWKV and Mamba level off as model size increases, this plateau is mainly attributable to the limited size of the current dataset. However, with the rapid advancement of BCI technology and the anticipated increase in available data, it is reasonable to predict that RWKV or Mamba could serve as robust backbones for neural decoding in future applications.

5 Discussion

5.1 Suggestions on model selection

Each of the four models evaluated demonstrates distinct strengths and weaknesses. The GRU model achieves the secondary prediction accuracy in single session experiment and best predictive accuracy on multi-session experiment on the dataset used in this work. However, its inference time and calibration recovery time is constrained by its inherent serial structure. In contrast, the RWKV and Mamba model have significantly faster inference and calibration recovery time. Additionally, both RWKV and Mamba adhere to the scaling law, demonstrating a gradual improvement in predictive accuracy as model sizes increase. Mamba eventually achieves an R² of 0.851 when scale up to 3M, hitting the highest score among all models in different experiment settings. It also becomes compatible with the SoTA neural decoding model POYO [3], trained on a much larger dataset tested on the same task. The Transformer model, however, lags in nearly all performance metrics and is difficult to converge in our experiments.

Consequently, Mamba or RWKV could be suitable backbones for future neural decoding tasks, especially with an increasing amount of available neural recordings. Their scalability and linear computational complexity can significantly enhance decoding performance without the need for excessive computational resources, making them preferable for wearable devices used daily. For BCI applications, this choice can also lead to reduced training times, faster response times, and quicker calibration speeds. However, for studies involving a limited amount of data and those not sensitive to response times, RNN models like GRU or LSTM may suffice to provide high decoding performance in most use cases.

5.2 Limitation and future works

One significant challenge within the BCI field is achieving long-term stable neural decoding. Unfortunately, none of the four models can provide long-term stable decoding capabilities without finetuning, based on our experiments. While this work utilizes only one dataset, introducing a more diverse dataset could enable the model to learn a broader array of data features, thereby enhancing its robustness in practical applications.

The degradation in long-term decoding performance is primarily due to input feature shifting [33]. To manage potential data drift over prolonged periods, continuous or online learning strategies could be implemented, allowing the model to continually adapt to new data. From a computational-saving perspective, instead of full parameter updating, tuning only the input and output layers or employing some transfer learning strategies might better accommodate input variations with less computational overhead.

New training strategies can also be explored to guide the model in learning useful latent representations. By implementing a weighted loss scheme that prioritizes recent sessions chronologically, our preliminary results have already shown notably improved zero-shot outcomes.

Additionally, the backbone models in this study were only trained on a random track task with one subject. The adaptation across different tasks and subjects also needs to be carefully evaluated in future studies.

6 Conclusion

This study has conducted a comprehensive comparison of GRU, Transformer, RWKV, and Mamba models in the context of neural decoding for random reach tasks. RWKV and Mamba, which demonstrate faster inference speeds, lower computational complexity, better scalability compared to GRU and Transformer, emerge as preferred choices for deployment on wearable devices. This detailed evaluation of the various strengths and weaknesses of each model not only highlights their individual capabilities but also establishes a robust foundation for future advancement on model architecture. The insights gained from this work guide the development of more efficient and effective neural decoding architecture, paving the way for enhanced performance in practical applications.

7 Acknowledgement

This work was supported in part by Guangdong Natural Science Funds for Distinguished Young Scholar under Grant 2024B1515020019.

References

[1] Ahmadi, N., Constandinou, T.G., Bouganis, C.S.: Robust and accurate decoding of hand kinematics from entire spiking activity using deep learning. Journal of Neural Engineering 18(2), 026011 (2021)
[2] Alam, M.M., Raff, E., Biderman, S., Oates, T., Holt, J.: Recasting self-attention with holographic reduced representations. In: ICML (2023)
[3] Azabou, M., Arora, V., Ganesh, V., Mao, X., Nachimuthu, S., Mendelson, M., Richards, B.A., Perich, M.G., Lajoie, G., Dyer, E.L.: A unified, scalable framework for neural population decoding. In: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems, NeurIPS (2023)
[4] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR (2015)
[5] Cheng, N., Phua, K.S., Lai, H.S., Tam, P.K., Tang, K.Y., Cheng, K.K., Yeow, R.C.H., Ang, K.K., Guan, C., Lim, J.H.: Brain-computer interface-based soft robotic glove rehabilitation for stroke. IEEE Transactions on Biomedical Engineering 67(12), 3339–3351 (2020)
[6] Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR 1412.3555 (2014)
[7] Gilja, V., Nuyujukian, P., Chestek, C.A., Cunningham, J.P., Yu, B.M., Fan, J.M., Churchland, M.M., Kaufman, M.T., Kao, J.C., Ryu, S.I., et al.: A high-performance neural prosthesis enabled by control algorithm design. Nature Neuroscience 15(12), 1752–1757 (2012)
[8] Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)
[9] Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. In: ICLR. OpenReview.net (2022)
[10] Heelan, C., Lee, J., O’Shea, R., Lynch, L., Brandman, D.M., Truccolo, W., Nurmikko, A.V.: Decoding speech from spike-based neural population recordings in secondary auditory cortex of non-human primates. Communications Biology 2(1), 1–12 (2019)
[11] Hochberg, L.R., Serruya, M.D., Friehs, G.M., Mukand, J.A., Saleh, M., Caplan, A.H., Branner, A., Chen, D., Penn, R.D., Donoghue, J.P.: Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature 442(7099), 164–171 (2006)
[12] Ivgi, M., Carmon, Y., Berant, J.: Scaling laws under the microscope: Predicting transformer performance from small scale experiments. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP. pp. 7354–7371. Association for Computational Linguistics (2022)
[13] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. CoRR 2001.08361 (2020)
[14] Keshtkaran, M.R., Sedler, A.R., Chowdhury, R.H., Tandon, R., Basrai, D., Nguyen, S.L., Sohn, H., Jazayeri, M., Miller, L.E., Pandarinath, C.: A large-scale neural network training framework for generalized estimation of single-trial population dynamics. Nature Methods 19(12), 1572–1577 (2022)
[15] Lazarou, I., Nikolopoulos, S., Petrantonakis, P.C., Kompatsiaris, I., Tsolaki, M.: EEG-based brain–computer interfaces for communication and rehabilitation of people with motor impairment: a novel approach of the 21 st century. Frontiers in Human Neuroscience 12, 14 (2018)
[16] Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y.: Vmamba: Visual state space model. CoRR 2401.10166 (2024)
[17] Makin, J.G., O’Doherty, J.E., Cardoso, M.M., Sabes, P.N.: Superior arm-movement decoding from cortex with a new, unsupervised-learning algorithm. Journal of neural engineering 15(2), 026010 (2018)
[18] Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Derczynski, L., Du, X., Grella, M., Gv, K., He, X., Hou, H., Kazienko, P., Kocon, J., Kong, J., Koptyra, B., Lau, H., Lin, J., Mantri, K.S.I., Mom, F., Saito, A., Song, G., Tang, X., Wind, J.S., Wozniak, S., Zhang, Z., Zhou, Q., Zhu, J., Zhu, R.: RWKV: reinventing rnns for the transformer era. In: Findings of the Association for Computational Linguistics: EMNLP. pp. 14048–14077. Association for Computational Linguistics (2023)
[19] Rapeaux, A.B., Constandinou, T.G.: Implantable brain machine interfaces: first-in-human studies, technology challenges and trends. Current opinion in biotechnology 72, 102–111 (2021)
[20] Roy, Y., Banville, H.J., Albuquerque, I., Gramfort, A., Falk, T.H., Faubert, J.: Deep learning-based electroencephalography analysis: a systematic review. CoRR 1901.05498 (2019)
[21] Stanslaski, S., Afshar, P., Cong, P., Giftakis, J., Stypulkowski, P., Carlson, D., Linde, D., Ullestad, D., Avestruz, A.T., Denison, T.: Design and validation of a fully implantable, chronic, closed-loop neuromodulation device with concurrent sensing and stimulation. IEEE Transactions on Neural Systems and Rehabilitation Engineering 20(4), 410–421 (2012)
[22] Stanslaski, S., Herron, J., Chouinard, T., Bourget, D., Isaacson, B., Kremen, V., Opri, E., Drew, W., Brinkmann, B.H., Gunduz, A., Adamski, T., Worrell, G.A., Denison, T.: A chronically implantable neural coprocessor for investigating the treatment of neurological disorders. IEEE Transactions on Biomedical Circuits and Systems 12(6), 1230–1245 (2018)
[23] Sussillo, D., Nuyujukian, P., Fan, J.M., Kao, J.C., Stavisky, S.D., Ryu, S., Shenoy, K.: A recurrent neural network for closed-loop intracortical brain–machine interface decoders. Journal of Neural Engineering 9(2), 026027 (2012)
[24] Taghizadeh-Sarabi, M., Daliri, M.R., Niksirat, K.S.: Decoding objects of basic categories from electroencephalographic signals using wavelet transform and support vector machines. Brain Topography 28(1), 33–46 (2015)
[25] Todorova, S., Sadtler, P., Batista, A., Chase, S., Ventura, V.: To sort or not to sort: the impact of spike-sorting on neural decoding performance. Journal of Neural Engineering 11(5), 056005 (2014)
[26] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. pp. 5998–6008 (2017)
[27] Willett, F.R., Avansino, D.T., Hochberg, L.R., Henderson, J.M., Shenoy, K.V.: High-performance brain-to-text communication via handwriting. Nature 593(7858), 249–254 (2021)
[28] Wilson, G.H., Stavisky, S.D., Willett, F.R., Avansino, D.T., Kelemen, J.N., Hochberg, L.R., Henderson, J.M., Druckmann, S., Shenoy, K.V.: Decoding spoken english from intracortical electrode arrays in dorsal precentral gyrus. Journal of Neural Engineering 17(6), 066007 (2020)
[29] Wu, W., Black, M., Gao, Y., Serruya, M., Shaikhouni, A., Donoghue, J., Bienenstock, E.: Neural decoding of cursor motion using a kalman filter. Advances in Neural Information Processing Systems 15 (2002)
[30] Wu, W., Hatsopoulos, N.G.: Real-time decoding of nonstationary neural activity in motor cortex. IEEE Transactions on Neural Systems and Rehabilitation Engineering 16(3), 213–222 (2008)
[31] Xu, H., Han, Y., Han, X., Xu, J., Lin, S., Cheung, R.C.: Unsupervised and real-time spike sorting chip for neural signal processing in hippocampal prosthesis. Journal of Neuroscience Methods 311, 111–121 (2019)
[32] Ye, J., Pandarinath, C.: Representation learning for neural population activity with neural data transformers. arXiv preprint arXiv:2108.01210 (2021)
[33] Zhang, Z., Constandinou, T.G.: Firing-rate-modulated spike detection and neural decoding co-design. Journal of Neural Engineering 20(3), 036003 (2023)