11institutetext: Southern University of Science and Technology, Shenzhen 518055, China 11email: [email protected]22institutetext: Advanced Computing and Storage Lab, Huawei Technologies Co., Ltd., Shenzhen 518055, China
22email: [email protected]

Benchmarking Neural Decoding Backbones towards Enhanced On-edge iBCI Applications

Zhou Zhou Zhou Zhou and Guohang He contribute equally to this work.11    Guohang He 0 11    Zheng Zhang Corresponding authors: Zheng Zhang and Ran Cheng.1122    Luziwei Leng 22    Qinghai Guo 22    Jianxing Liao 22    Xuan Song 11    Ran Cheng0 11
Abstract

Traditional invasive Brain-Computer Interfaces (iBCIs) typically depend on neural decoding processes conducted on workstations within laboratory settings, which prevents their everyday usage. Implementing these decoding processes on edge devices, such as the wearables, introduces considerable challenges related to computational demands, processing speed, and maintaining accuracy. This study seeks to identify an optimal neural decoding backbone that boasts robust performance and swift inference capabilities suitable for edge deployment. We executed a series of neural decoding experiments involving nonhuman primates engaged in random reaching tasks, evaluating four prospective models, Gated Recurrent Unit (GRU), Transformer, Receptance Weighted Key Value (RWKV), and Selective State Space model (Mamba), across several metrics: single-session decoding, multi-session decoding, new session fine-tuning, inference speed, calibration speed, and scalability. The findings indicate that although the GRU model delivers sufficient accuracy, the RWKV and Mamba models are preferable due to their superior inference and calibration speeds. Additionally, RWKV and Mamba comply with the scaling law, demonstrating improved performance with larger data sets and increased model sizes, whereas GRU shows less pronounced scalability, and the Transformer model requires computational resources that scale prohibitively. This paper presents a thorough comparative analysis of the four models in various scenarios. The results are pivotal in pinpointing an optimal backbone that can handle increasing data volumes and is viable for edge implementation. This analysis provides essential insights for ongoing research and practical applications in the field.

Keywords:
Neural decodingBrain-computer interfaces Deep neural networks.

1 Introduction

Advancements in invasive Brain Computer Interfaces (iBCIs) have demonstrated promising results across various applications, including speech decoding [10, 27, 28], prosthesis control [7, 31], neurological disorders rehabilitation [5, 15, 21, 22] and more. Accurate decoding the brain activities is crucial for the success of these applications. Previous efforts have focused on employing adaptive filters such as Kalman Filters [7, 29, 30] or traditional machine learning models such as Recurrent Neural Networks (RNNs) [23, 27]. However, with the expansion of the available neural data, significant progress has been made using Transformer-based architectures. Models such as Neural Data Transformer (NDT1)[32] leverage multi-session, multi-task and multi-subject neural data, yielding improved decoding performance and enhanced generalization capabilities with unseen data.

Limitations still exist among these methods. Despite the advantages of RNNs for handling long-term dependency, their inherent serial dependency significantly affect the model’s inference speed [32]. Meanwhile, it remains unclear whether scaling up GRU model size with data volume improves neural decoding accuracy. Transformers facilitate parallel computation and adhere to the scaling laws [12], but the increase in model size and sequence length leads to quadratic growth in model complexity (O(n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT)), requiring a dramatic escalation in computational resources in order to fit in edge-device for portable BCI applications in daily use.

Models such as Receptance Weighted Key Value model (RWKV) [18] and Selective State Space model (Mamba) [8] have been designed utilizing linear attention mechanisms that offer reduced temporal and spatial complexity compared to traditional transformers. These models have demonstrated competitive performance in natural language processing and computer vision tasks [8, 16, 18], but it remains unclear which model is most suitable as the backbone for neural decoding.

This paper investigates whether recent advancements in model architectures can enhance neural decoding. Instead of benchmarking against state-of-the-art (SoTA) architectures, we compare the RWKV and Mamba models with the GRU and Transformer models in terms of computational efficiency and decoding accuracy. We have designed a series of experiments to assess various parameters: decoding accuracy, adaptiveness to new sessions, inference time, and scalability trends on model size, to identify an optimal neural decoding backbone. To the best of our knowledge, this work might be the first effort to investigate linear attention mechanisms in neural decoding, targeting fast and low-power inference on edge devices.

2 Related Work

2.1 Neural Decoding

Neural decoding primarily relied on adaptive filters or traditional machine learning methods such as Kalman Filters [7, 29, 30], Wiener Filters [11] or SVM [24]. However, with the advent of deep learning, particularly the emergence of large-scale models, there has been a significant shift in neural decoding approaches. Deep learning models facilitate automated feature learning, reducing the impact of subjective factors and greatly improving decoding accuracy and efficiency. Recurrent neural networks and Transformers have now found more applications in neural decoding tasks [20]. Contemporary applications of brain decoding technologies extend to medical rehabilitation, assistive communication, and human-computer interaction [19].

2.2 RWKV

Transformer has precipitated as disruptive revolution, particularly due to its widespread application of attention mechanisms across multiple domains. However, a significant issue arises as the memory and computational complexity of the Transformer grows quadratically with increasing sequence length. Concurrently, RNNs exhibit linear growth in memory and computational demands but are significantly outperformed by Transformer due to limitations in parallelization and scalability [18]. To address this challenge, Bo Peng et at. have proposed the RWKV, which integrates the efficient parallel training advantages of Transformer with the effective inference comparable to that of similarly scaled Transformer, underscoring its potential and effectiveness in handling large-scale sequence data [2].

2.3 Mamba

The state space model is a mathematical framework used to describe the evolution of systems over time. It employs state vectors to represent the current state of the system and uses state transition equations and observation equations to correlate the changes between system states and the relationship with observed data [9]. Mamba is an enhanced approach based on the structured state space model S4, integrating the recurrent structure of recurrent neural networks and the parallel characteristics of convolution neural networks. This approach excels in capturing long-term dependencies in sequential data and facilitates efficient parallel computation. By combining structured state space models with deep learning techniques, Mamba can handle sequential data more effectively, exhibiting higher modeling capability and predictive performance. Mamba has demonstrated superior performance in various domains, including language modeling, DNA sequence modeling, audio modeling and generation [8].

3 Methods

Refer to caption
Figure 1: Raw neural signals were recorded from the primary motor cortex (M1) area of a monkey using a 96-channel Utah microelectrode array during random reach tasks. Spike activity detected from these neural signals was binned temporally across the 96 channels. The resulting matrix of spike counts served as inputs for various methods after normalization and smoothing, and the outputs were the predicted finger velocities along the x and y axes. Experiments conducted under different scenarios facilitated comparisons of predictive accuracy, inference speed, and scalability among the four types of backbone models.

The system architecture is shown in Fig.1. The raw neural recording from the Utah array is processed into spike count bins and decoded using GRU, Transformer, RWKV and Mamba as four different backbones. The decoded output is compared with the ground truth motion activities. The detailed workflow is given below.

3.1 Data Processing

3.1.1 Datasets

The dataset from [17] is used in this study, which includes a rich collection of neural and behavioral data recorded from nonhuman primates engaged in a random target reaching task. This task requires the subject controlling a ticker to move the computer cursor and reach a series of randomly distributed targets displayed on screen in succession. During the execution of the task, the neural activities from primary motor cortex (M1) and primary sensory cortex (S1) are collected using Utah array, and the position of the subject’s hand kinematic trajectories are recorded using motion tracking systems.

The neural recordings in this dataset consists of extracellular spike recordings, with the event times of threshold crossings sorted into discrete units. The recordings collected from subject Indy are used in this studies in total 30 sessions. The kinematic measurements contain the x and y coordinates of the subject’s fingertip and cursor position as it reaches out, as well as the x and y coordinates of the set targets, both sampled at a frequency of 250 Hz.

3.1.2 Data processing

In this studies, we only used recordings collected from the M1 cortex. We partitioned each session of the recorded task into multiple temporal bins with duration of 10 ms. Due to the sampling rate of 250 Hz, the sampling frequency is increased to 1000 Hz using linear interpolation. Within these bins, we quantified the number of spike events (threshold crossings) for each neural recording channel, thereby capturing the discrete neural firing patterns over time. It is worth noting that we use the unsorted spike events known as multi-unit activities. In practice, spike sorting can be require too much computation for on-chip processing while using the sorted single-unit activities only bring limited decoding accuracy improvement as shown in  [25].

The cursor’s velocity is used to characterize the kinematics of the reaching movement. The binned spike event and cursor velocity were temporally aligned, normalized and smoothed with a Gaussian smoothing operation, which attenuates high-frequency noise and elucidate the underlying signal trends following  [14].

In the experiments, the input to the model is denoted as SpkS×C𝑆𝑝𝑘superscript𝑆𝐶Spk\in\mathbb{R}^{S\times C}italic_S italic_p italic_k ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_C end_POSTSUPERSCRIPT, where S𝑆Sitalic_S represents the timesteps used for each prediction and C𝐶Citalic_C denotes the number of channels. The ground truth denoted as VelS×2𝑉𝑒𝑙superscript𝑆2Vel\in\mathbb{R}^{S\times 2}italic_V italic_e italic_l ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × 2 end_POSTSUPERSCRIPT, which represents the finger speed in x and y axis at each timestep.

3.2 Backbone Models

3.2.1 GRU

Proposed by Cho et at. [4], GRU is a variant of the Recurrent Neural Network (RNN), specifically designed to address the challenges of gradient explosion and gradient vanishing in training. GRU achieves this by employing update gates and reset gates, which selectively update useful information and capture of long-term dependencies within time series data. GRU can be characterized by the following formulations [6]:

htjsuperscriptsubscript𝑡𝑗\displaystyle h_{t}^{j}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT =(1ztj)ht1j+ztjh~tjabsentdirect-product1superscriptsubscript𝑧𝑡𝑗superscriptsubscript𝑡1𝑗direct-productsuperscriptsubscript𝑧𝑡𝑗superscriptsubscript~𝑡𝑗\displaystyle=(1-z_{t}^{j})\odot h_{t-1}^{j}+z_{t}^{j}\odot\tilde{h}_{t}^{j}= ( 1 - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ⊙ italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT + italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ⊙ over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT (1)
ztjsuperscriptsubscript𝑧𝑡𝑗\displaystyle z_{t}^{j}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT =σ(Wzxt+Uzht1j)absent𝜎subscript𝑊𝑧subscript𝑥𝑡subscript𝑈𝑧superscriptsubscript𝑡1𝑗\displaystyle=\sigma(W_{z}x_{t}+U_{z}h_{t-1}^{j})= italic_σ ( italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_U start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (2)
h~tjsuperscriptsubscript~𝑡𝑗\displaystyle\tilde{h}_{t}^{j}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT =tanh(Wxt+U(rtht1j))absent𝑊subscript𝑥𝑡𝑈direct-productsubscript𝑟𝑡superscriptsubscript𝑡1𝑗\displaystyle=\tanh(Wx_{t}+U(r_{t}\odot h_{t-1}^{j}))= roman_tanh ( italic_W italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_U ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) (3)
rtjsuperscriptsubscript𝑟𝑡𝑗\displaystyle r_{t}^{j}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT =σ(Wrxt+Urht1j)absent𝜎subscript𝑊𝑟subscript𝑥𝑡subscript𝑈𝑟superscriptsubscript𝑡1𝑗\displaystyle=\sigma(W_{r}x_{t}+U_{r}h_{t-1}^{j})= italic_σ ( italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (4)

The reset gate (r𝑟ritalic_r) is a gating mechanism that modulates the flow of information from the previous activation, allowing the model to discard irrelevant past state information, thus mitigating the vanishing gradient problem. The update gate (z𝑧zitalic_z) determines the extent to which the unit updates its activation, or hidden state (hhitalic_h). It controls the degree of information transfer from the previous state to the current state, enabling the model to capture long-term dependencies. The activation (hhitalic_h), commonly referred to as the hidden state, captures the learned information at the current time step and is recursively influenced by past activation. In our work, we employ hidden size dh=256subscript𝑑256d_{h}=256italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 256. The candidate activation (h~~\tilde{h}over~ start_ARG italic_h end_ARG) is a proposed update to the hidden state, which incorporates new input while being modulated by the reset gate to potentially discard the irrelevant previous state.

3.2.2 Transformer

The foundational mechanism of the Transformer is its self-attention mechanism, which enables the model to dynamically adjust the weighting of input data, such as tokens or sequence elements, based on their contextual relevance [26]. Unlike GRUs, which process data sequentially, Transformers handle input in parallel during the training phase, significantly expediting the training process.

We employ the classic Multihead Scaled Dot-Product Attention mechanism along with an encoder-decoder architecture. Unlike traditional approaches that transform input vectors of vocabulary tokens through an embedding layer to embed feature dimensions, we directly take the spike matrix xS×C𝑥superscript𝑆𝐶x\in\mathbb{R}^{S\times C}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_C end_POSTSUPERSCRIPT as the input for both encoder and decoder and treat the channel dimension C𝐶Citalic_C of the input spike matrix as the feature dimension and project the feature dimension to the model hidden dimension following Eq.5 and this projection is also used in RWKV and Mamba model.

A=f(x)=Wx+b𝐴𝑓𝑥𝑊𝑥𝑏A=f(x)=Wx+bitalic_A = italic_f ( italic_x ) = italic_W italic_x + italic_b (5)
B=E[positions]𝐵𝐸delimited-[]positionsB=E[\text{positions}]italic_B = italic_E [ positions ] (6)
input=Dropout(A+B)𝑖𝑛𝑝𝑢𝑡Dropout𝐴𝐵input=\text{Dropout}(A+B)italic_i italic_n italic_p italic_u italic_t = Dropout ( italic_A + italic_B ) (7)
output=Decoder(Encoder(input),input)𝑜𝑢𝑡𝑝𝑢𝑡DecoderEncoder𝑖𝑛𝑝𝑢𝑡𝑖𝑛𝑝𝑢𝑡output=\text{Decoder}(\text{Encoder}(input),input)italic_o italic_u italic_t italic_p italic_u italic_t = Decoder ( Encoder ( italic_i italic_n italic_p italic_u italic_t ) , italic_i italic_n italic_p italic_u italic_t ) (8)

The function f𝑓fitalic_f represents a linear mapping layer, where W𝑊Witalic_W and b𝑏bitalic_b denote the weights and biases of the input layer, respectively. E𝐸Eitalic_E corresponds to the positional embedding matrix, from which an embedding vector is selected for each positional index. Where inputS×dmodel𝑖𝑛𝑝𝑢𝑡superscript𝑆subscript𝑑modelinput\in\mathbb{R}^{S\times d_{\text{model}}}italic_i italic_n italic_p italic_u italic_t ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the input to the encoder and decoder and outputS×2𝑜𝑢𝑡𝑝𝑢𝑡superscript𝑆2output\in\mathbb{R}^{S\times 2}italic_o italic_u italic_t italic_p italic_u italic_t ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × 2 end_POSTSUPERSCRIPT is the predicted x and y axis velocity. In the encoder and decoder, the attention is implemented as below:

Attention(Q,K,V)Attention𝑄𝐾𝑉\displaystyle\text{Attention}(Q,K,V)Attention ( italic_Q , italic_K , italic_V ) =softmax(QKTdk)Vabsentsoftmax𝑄superscript𝐾𝑇subscript𝑑𝑘𝑉\displaystyle=\text{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V= softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V (9)
MultiHead(Q,K,V)=Concat(head1,,headh)WOwhere headi=Attention(Q,K,V)MultiHead𝑄𝐾𝑉absentConcatsubscripthead1subscriptheadsuperscript𝑊𝑂subscriptwhere head𝑖absentAttentionsuperscript𝑄superscript𝐾superscript𝑉\displaystyle\begin{aligned} \text{MultiHead}(Q,K,V)&=\text{Concat}(\text{head% }_{1},\ldots,\text{head}_{h})W^{O}\\ \text{where }\text{head}_{i}&=\text{Attention}(Q^{\prime},K^{\prime},V^{\prime% })\end{aligned}start_ROW start_CELL MultiHead ( italic_Q , italic_K , italic_V ) end_CELL start_CELL = Concat ( head start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , head start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL where roman_head start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL = Attention ( italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW (10)

Where the Q,K,V𝑄𝐾𝑉Q,K,Vitalic_Q , italic_K , italic_V are calculated following Eq.5 with independent weights and zero bias, the parameter matrix WOhC×dmodelsuperscript𝑊𝑂superscript𝐶subscript𝑑modelW^{O}\in\mathbb{R}^{hC\times d_{\text{model}}}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h italic_C × italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and QS×h×dqsuperscript𝑄superscript𝑆subscript𝑑𝑞Q^{\prime}\in\mathbb{R}^{S\times h\times d_{q}}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_h × italic_d start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, KS×h×dksuperscript𝐾superscript𝑆subscript𝑑𝑘K^{\prime}\in\mathbb{R}^{S\times h\times d_{k}}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_h × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, VS×h×dvsuperscript𝑉superscript𝑆subscript𝑑𝑣V^{\prime}\in\mathbb{R}^{S\times h\times d_{v}}italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_h × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Here, we employ hhitalic_h = 2 heads, dq=dk=dv=dmodelhsubscript𝑑𝑞subscript𝑑𝑘subscript𝑑𝑣subscript𝑑modeld_{q}=d_{k}=d_{v}=\frac{d_{\text{model}}}{h}italic_d start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = divide start_ARG italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT end_ARG start_ARG italic_h end_ARG and dmodel=128subscript𝑑𝑚𝑜𝑑𝑒𝑙128d_{model}=128italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT = 128.

Given the limited variance in input data patterns, the data is processed through two separate attention heads. The system comprises three layers each of encoders and decoders, culminating in the prediction of velocities in the x and y axes.

3.2.3 RWKV

Unlike most RNNs, RWKV is a recurrent model combines the efficient parallelizable training of transformers with the fast inference time. RWKV reformulates the attention mechanism with a variant of linear attention, replacing traditional dot-product token interaction with more effective channel-directed attention [18]. It mitigates the memory bottleneck and quadratic scaling issues inherent in Transformers through efficient linear scaling. It also preserves the ability for parallelized training and ensures robust scalability.

rtsubscript𝑟𝑡\displaystyle r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Wr(μrxt+(1μr)xt1)absentsubscript𝑊𝑟direct-productsubscript𝜇𝑟subscript𝑥𝑡direct-product1subscript𝜇𝑟subscript𝑥𝑡1\displaystyle=W_{r}(\mu_{r}\odot x_{t}+(1-\mu_{r})\odot x_{t-1})= italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ⊙ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (11)
ktsubscript𝑘𝑡\displaystyle k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Wk(μkxt+(1μk)xt1)absentsubscript𝑊𝑘direct-productsubscript𝜇𝑘subscript𝑥𝑡direct-product1subscript𝜇𝑘subscript𝑥𝑡1\displaystyle=W_{k}(\mu_{k}\odot x_{t}+(1-\mu_{k})\odot x_{t-1})= italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⊙ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (12)
vtsubscript𝑣𝑡\displaystyle v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Wv(μvxt+(1μv)xt1)absentsubscript𝑊𝑣direct-productsubscript𝜇𝑣subscript𝑥𝑡direct-product1subscript𝜇𝑣subscript𝑥𝑡1\displaystyle=W_{v}(\mu_{v}\odot x_{t}+(1-\mu_{v})\odot x_{t-1})= italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ⊙ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (13)

R encodes historical information, activated via a Sigmoid function and incorporating a forgetting mechanism. W signifies the positional weight decay vector, a trainable parameter within the model. The terms K and V function analogously to the key and value in Transformer architectures. Distinct from traditional models where x𝑥xitalic_x is simply the embedding of the current token, in the RWKV, x𝑥xitalic_x is calculated as the weighted sum of the embeddings of the current token and the previous token.

wkvt=i=1t1e(t1i)w+kivi+eu+ktvti=1t1e(t1i)w+ki+eu+kt𝑤𝑘subscript𝑣𝑡superscriptsubscript𝑖1𝑡1direct-productsuperscript𝑒𝑡1𝑖𝑤subscript𝑘𝑖subscript𝑣𝑖direct-productsuperscript𝑒𝑢subscript𝑘𝑡subscript𝑣𝑡superscriptsubscript𝑖1𝑡1superscript𝑒𝑡1𝑖𝑤subscript𝑘𝑖superscript𝑒𝑢subscript𝑘𝑡wkv_{t}=\frac{\sum_{i=1}^{t-1}e^{-(t-1-i)w+k_{i}}\odot v_{i}+e^{u+k_{t}}\odot v% _{t}}{\sum_{i=1}^{t-1}e^{-(t-1-i)w+k_{i}}+e^{u+k_{t}}}italic_w italic_k italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - ( italic_t - 1 - italic_i ) italic_w + italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊙ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_e start_POSTSUPERSCRIPT italic_u + italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊙ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - ( italic_t - 1 - italic_i ) italic_w + italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_u + italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (14)

Equation 14 functions similarly to an attention mechanism, representing position t𝑡titalic_t as a learnable weighted sum of past content. In RWKV, w𝑤witalic_w is treated as a channel-wise vector that adjusts according to the relative position, requiring the training of only a single parameter vector w𝑤witalic_w. u𝑢uitalic_u is designated for individual processing of the current token’s position, serving to circumvent any potential degradation of w𝑤witalic_w.

3.2.4 Mamba

In contrast to the quadratic scaling observed with traditional models, Mamba demonstrates a throughput up to five times faster than the Transformer and exhibits linear scaling with sequence length [8]. Unlike RNNs, which compress all information into a hidden space and struggle with long-term memory issues, Mamba introduces a selective state-space model. This model offers the benefits of a linear recurrent network, enhanced by mechanisms for rapid training and effective context retention. Improvements in Mamba’s Structured State Spaces (SSM) include a selection mechanism that filters out irrelevant information while enabling indefinite memory retention, and a hardware-aware algorithm optimized for GPU memory layouts to facilitate hardware acceleration. This ensures efficient computation cycling without extending the state unnecessarily, thus enhancing performance.

The SSM Mamba consists of the following two equations:

xt=f(xt1,ut,wt)subscript𝑥𝑡𝑓subscript𝑥𝑡1subscript𝑢𝑡subscript𝑤𝑡x_{t}=f(x_{t-1},u_{t},w_{t})italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (15)
yt=h(xt,vt)subscript𝑦𝑡subscript𝑥𝑡subscript𝑣𝑡y_{t}=h(x_{t},v_{t})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_h ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (16)

Equation 15 represents the state transition equation, describing how the system state evolves over time. Here, xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the system state at time step t𝑡titalic_t, utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the control input, wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the process noise, and f is the state transition function. Equation 16 is the observation equation, ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the observation data at time step t𝑡titalic_t, vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the observation noise, and hhitalic_h is the observation function. The concept of selectivity in Mamba allows the model to selectively remember or forget information at each time step.

4 Experiments and Key Results

4.1 Experiment settings

To evaluate the capabilities of different backbone models across various dimensions, four distinct experiments were established: single-session, multi-session, new session finetuning, and scaling experiments (set timestpes as 128, 1024, 128 and 1024 respectively). A total of 30 sessions, collected over different days from the same subject, were used. All neural recordings from these 30 days were divided into training and testing datasets with an 8:2 ratio, consistently applied across all experiments.

Single-Session Experiment: This experiment assessed the ability of the backbone models to perform effectively on small datasets. Each of the four models was trained independently on data from individual sessions, with recording lengths varying from 360 s to 3363 s.

Multi-Session Experiment: This experiment focused on the models’ capacity to extract deep latent representations from neural recordings with input feature shifting overtime. A unified model was trained using training sets from all sessions. Over time, the quality of the recordings degraded due to scar tissue encapsulation around the implants, leading to increased noise levels and a decrease in detected neural firing rates from over 20Hz to below 10Hz. Additionally, the neurons observed on different channels changed over time. Various training strategies were explored to help models adapt to these shifting input features.

New Session Finetuning Experiment: This experiment tested the models’ ability to generalize and adapt to unseen data. Models were initially trained with datasets from the first 25 days, and then incrementally finetuned using datasets from the last five days (10 seconds per iteration). This setup mirrors practical scenarios for BCI calibration on new days, where a shorter calibration time is often critical. The aim was to identify the model that could quickly return to acceptable performance levels, making it more suitable for real-world use outside the laboratory.

Scaling Experiment: This experiment investigated whether increased model size could enhance performance. The scaling law has been a key principle in designing large language models [13], but its applicability in neural decoding remains unexplored.

Table 1: Parameter counts and hyperparameters of models

Model Parameters Epochs Single(Multi) Layers Embedding Size GRU 272k 30(50) 1 256 Transformer 316k 50(50) 3 128 RWKV 294k 30(50) 2 88 Mamba 306k 30(50) 2 144

The same hyperparamter settings are used for all experiments except the scaling experiment, with details on their parameter counts and hyperparameters presented in Table.1. The requirement for the Transformer model to undergo 50 epochs may be attributed to its attention mechanism, which necessitates numerous iterations to effectively optimize attention weights. Additionally, the design of the Transformer, which processes entire sequences simultaneously, may contribute to slower convergence rates during training. [26].

The R2 is used to evaluate the neural decoding performance following [1, 33]. R2 typically ranging from 0 to 1, an R-square value of 0 indicates that the model fails to explain any variance in the dependent variable, while a value of 1 indicates a perfect fit if the model to the data. The formula for calculating R2 is as follows:

RSS𝑅𝑆𝑆\displaystyle RSSitalic_R italic_S italic_S =i=1n(yiy^i)2absentsuperscriptsubscript𝑖1𝑛superscriptsubscript𝑦𝑖subscript^𝑦𝑖2\displaystyle=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (17)
TSS𝑇𝑆𝑆\displaystyle TSSitalic_T italic_S italic_S =i=1n(yiy¯)2absentsuperscriptsubscript𝑖1𝑛superscriptsubscript𝑦𝑖¯𝑦2\displaystyle=\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (18)
R2=1RSSTSSsuperscript𝑅21𝑅𝑆𝑆𝑇𝑆𝑆R^{2}=1-\frac{RSS}{TSS}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 - divide start_ARG italic_R italic_S italic_S end_ARG start_ARG italic_T italic_S italic_S end_ARG (19)

where RSS𝑅𝑆𝑆RSSitalic_R italic_S italic_S is the residual sum of squares(the sum of the squares of the differences between actual(yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and predicted values(y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT)), and TSS𝑇𝑆𝑆TSSitalic_T italic_S italic_S is the total sum of squares(the sum of the squares of the differences between actual values and the mean of the observed values(y¯¯𝑦\bar{y}over¯ start_ARG italic_y end_ARG)). Table.2 summarizes the evaluation results of four models in different experiments.

4.2 Single-session experiment

As shown in Table. 2, the RWKV model excels in the single-session experiment, surpassing the GRU model by 0.02 in R2. However, both the Mamba and Transformer models score below 0.7, indicating that these models are less effective when dataset sizes are limited.

In terms of inference time processing 1280 ms of neural data (1 batch), as detailed in Table. 2, shows varying performance among the models. The GRU model requires the longest processing time due to its sequential processing nature. The Transformer model also exhibits relatively long inference times due to its computationally intensive operations. In contrast, the RWKV and Mamba models demonstrate significant advantages in inference speed over both the GRU and Transformer models.

Specifically, RWKV, which is a recurrent neural network devoid of an attention mechanism, avoids the computational overhead associated with computing attention matrices. This model incorporates Token Shift and Channel Mix mechanisms to optimize position encoding and channel blending, thereby enhancing both efficiency and performance. On the other hand, Mamba achieves rapid inference and maintains linear scalability with sequence length through dynamic and selective retention or dismissal of information based on input. Its streamlined and homogeneous architecture, coupled with a selective state space, markedly boosts inference speed.

4.3 Multi-session experiment

In the multi-session experiment, we explored three different data partitioning strategies during training to identify the most effective approach for aiding models to learn as input features shifted. These strategies are as follows:

  • \bullet

    Random partitioning: Batches are randomly selected from random sessions to be fed into the model.

  • \bullet

    Sequential partitioning: Data batches are fed into the model in a sequential order, day by day.

  • \bullet

    Random session partitioning: Sessions are selected randomly, but within each selected session, data batches are fed sequentially.

The random training strategy results in significantly higher stability and decoding accuracy of the model compared to the other two strategies. Although the data are strongly time-correlated, this approach of random input enhances gradient diversity, reduces cyclic biases in data appearance, and helps prevent overfitting.

Sequential training resulted in limited improvement over the single-session experiment for both the GRU and RWKV models. Although these models can memorize historical information, sequential training may still lead to catastrophic forgetting, thereby only marginally enhancing performance compared to the single-session results. In contrast, the Mamba model demonstrated a significant improvement, nearly 0.1 increase in R2, over the single-session experiment. This suggests that Mamba’s selective state space mechanism is more effective at preserving useful information and handling long-term dependencies compared to the gating mechanisms of GRU or the RWKV model in neural decoding.

However, the random session training strategy failed to provide a diverse training gradient and the data order could not convey long-term dependencies, resulting in underfitting of the model.

Another observation during the multi-session training is the difficulty in achieving convergence with the Transformer model, which required careful tuning of its hyper-parameters. In contrast, the other models exhibited less sensitivity to training hyper-parameter settings.

Table 2: Experiments results on all models

Experiment Indicator GRU Transformer RWKV Mamba Single-session Average R2 0.715 0.633 0.717 0.660 Inference time/s 0.941 0.822 0.303 0.434 Multi-session Random train 0.838 0.720 0.812 0.810 Sequence train 0.749 0.523 0.726 0.752 Random session 0.560 0.314 0.600 0.556 Fine-tuning Average R2 0.773 0.748 0.763 0.756 Recovery time/s 214 - 202 178 Zero shot 0.4811 0.383 0.452 0.370 Scaling Max R2 0.846 - 0.843 0.851 Increment 0.010 - 0.031 0.041

4.4 Fine-tuning the model on new sessions

As shown in Table.2, among the four models, GRU achieved the highest average R2 score over 5 days of fine-tuning on new sessions, reaching 0.773. The RWKV and Mamba models scored 1-2% lower, while the Transformer model recorded the lowest score at 0.748. Regarding zero-shot performance, we only saw RWKV achieved an R2 of 0.7 in one session out of five. On average, none of the models achieved adequate zero-shot performance.

The results from the finetuning experiment indicate that all models are capable of surpassing their performance when trained solely on single-session data. This demonstrates that despite variations in firing rates and neuron-channel mappings over time, the models can distill useful information to enhance neural decoding. The quality of the base model significantly influences the effectiveness of the finetuned model. However, the backbone model alone does not provide zero-shot capability, suggesting that additional architectural designs or training strategies are necessary to enhance the models’ adaptability to input feature shifts and improve zero-shot performance.

In terms of the data length required to achieve an acceptable R2 score of 0.7 through fine-tuning, Mamba outperformed both RWKV and GRU. This superior performance likely stems from Mamba’s enhanced ability to resolve long-term dependencies, which facilitates its calibration to unseen data more effectively. Consequently, Mamba emerges as a more viable option for real-world deployment in practical BCI applications due to its robust adaptability.

4.5 Scaling analysis

Refer to caption
Figure 2: Scaling parameter counts for the models range from 300k to 3.8M with error

In multi-session training, the parameter count for the models we used is approximately 300k. To explore whether increasing the model size could enhance its decoding performance, we examined the improvements achieved by increase the number of layers in GRU, RWKV and Mamba models (Transformer can fail to converge in many cases and is therefore ignored.). The variation in the model’s decoding R2 scores as a function of the parameter count of these models, ranging from 300K to 3M, is illustrated in the Fig.2.

With an increase in model parameters, the R2 scores for Mamba and RWKV show significant improvement, reaching 0.843 and 0.851 respectively. This represents increases of 0.031 and 0.041 over their 300k parameter models. In contrast, the GRU model demonstrates only a mild improvement of 0.01 when parameters are increased, and further scaling leads to a declining trend in performance. Despite its gate mechanisms to mitigate the vanishing gradient problem, GRU’s inherent sequential processing nature restricts its scalability and limits its efficiency in handling large-scale sequence tasks.

Conversely, RWKV and Mamba exhibit superior scalability and computational efficiency, outperforming GRU. This advantage is largely due to their innovative structural designs and optimization strategies that effectively address the limitations typically associated with recurrent neural networks and traditional Transformers.

While performance gains for RWKV and Mamba level off as model size increases, this plateau is mainly attributable to the limited size of the current dataset. However, with the rapid advancement of BCI technology and the anticipated increase in available data, it is reasonable to predict that RWKV or Mamba could serve as robust backbones for neural decoding in future applications.

5 Discussion

5.1 Suggestions on model selection

Each of the four models evaluated demonstrates distinct strengths and weaknesses. The GRU model achieves the secondary prediction accuracy in single session experiment and best predictive accuracy on multi-session experiment on the dataset used in this work. However, its inference time and calibration recovery time is constrained by its inherent serial structure. In contrast, the RWKV and Mamba model have significantly faster inference and calibration recovery time. Additionally, both RWKV and Mamba adhere to the scaling law, demonstrating a gradual improvement in predictive accuracy as model sizes increase. Mamba eventually achieves an R2 of 0.851 when scale up to 3M, hitting the highest score among all models in different experiment settings. It also becomes compatible with the SoTA neural decoding model POYO [3], trained on a much larger dataset tested on the same task. The Transformer model, however, lags in nearly all performance metrics and is difficult to converge in our experiments.

Consequently, Mamba or RWKV could be suitable backbones for future neural decoding tasks, especially with an increasing amount of available neural recordings. Their scalability and linear computational complexity can significantly enhance decoding performance without the need for excessive computational resources, making them preferable for wearable devices used daily. For BCI applications, this choice can also lead to reduced training times, faster response times, and quicker calibration speeds. However, for studies involving a limited amount of data and those not sensitive to response times, RNN models like GRU or LSTM may suffice to provide high decoding performance in most use cases.

5.2 Limitation and future works

One significant challenge within the BCI field is achieving long-term stable neural decoding. Unfortunately, none of the four models can provide long-term stable decoding capabilities without finetuning, based on our experiments. While this work utilizes only one dataset, introducing a more diverse dataset could enable the model to learn a broader array of data features, thereby enhancing its robustness in practical applications.

The degradation in long-term decoding performance is primarily due to input feature shifting [33]. To manage potential data drift over prolonged periods, continuous or online learning strategies could be implemented, allowing the model to continually adapt to new data. From a computational-saving perspective, instead of full parameter updating, tuning only the input and output layers or employing some transfer learning strategies might better accommodate input variations with less computational overhead.

New training strategies can also be explored to guide the model in learning useful latent representations. By implementing a weighted loss scheme that prioritizes recent sessions chronologically, our preliminary results have already shown notably improved zero-shot outcomes.

Additionally, the backbone models in this study were only trained on a random track task with one subject. The adaptation across different tasks and subjects also needs to be carefully evaluated in future studies.

6 Conclusion

This study has conducted a comprehensive comparison of GRU, Transformer, RWKV, and Mamba models in the context of neural decoding for random reach tasks. RWKV and Mamba, which demonstrate faster inference speeds, lower computational complexity, better scalability compared to GRU and Transformer, emerge as preferred choices for deployment on wearable devices. This detailed evaluation of the various strengths and weaknesses of each model not only highlights their individual capabilities but also establishes a robust foundation for future advancement on model architecture. The insights gained from this work guide the development of more efficient and effective neural decoding architecture, paving the way for enhanced performance in practical applications.

7 Acknowledgement

This work was supported in part by Guangdong Natural Science Funds for Distinguished Young Scholar under Grant 2024B1515020019.

References

  • [1] Ahmadi, N., Constandinou, T.G., Bouganis, C.S.: Robust and accurate decoding of hand kinematics from entire spiking activity using deep learning. Journal of Neural Engineering 18(2), 026011 (2021)
  • [2] Alam, M.M., Raff, E., Biderman, S., Oates, T., Holt, J.: Recasting self-attention with holographic reduced representations. In: ICML (2023)
  • [3] Azabou, M., Arora, V., Ganesh, V., Mao, X., Nachimuthu, S., Mendelson, M., Richards, B.A., Perich, M.G., Lajoie, G., Dyer, E.L.: A unified, scalable framework for neural population decoding. In: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems, NeurIPS (2023)
  • [4] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR (2015)
  • [5] Cheng, N., Phua, K.S., Lai, H.S., Tam, P.K., Tang, K.Y., Cheng, K.K., Yeow, R.C.H., Ang, K.K., Guan, C., Lim, J.H.: Brain-computer interface-based soft robotic glove rehabilitation for stroke. IEEE Transactions on Biomedical Engineering 67(12), 3339–3351 (2020)
  • [6] Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR 1412.3555 (2014)
  • [7] Gilja, V., Nuyujukian, P., Chestek, C.A., Cunningham, J.P., Yu, B.M., Fan, J.M., Churchland, M.M., Kaufman, M.T., Kao, J.C., Ryu, S.I., et al.: A high-performance neural prosthesis enabled by control algorithm design. Nature Neuroscience 15(12), 1752–1757 (2012)
  • [8] Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)
  • [9] Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. In: ICLR. OpenReview.net (2022)
  • [10] Heelan, C., Lee, J., O’Shea, R., Lynch, L., Brandman, D.M., Truccolo, W., Nurmikko, A.V.: Decoding speech from spike-based neural population recordings in secondary auditory cortex of non-human primates. Communications Biology 2(1), 1–12 (2019)
  • [11] Hochberg, L.R., Serruya, M.D., Friehs, G.M., Mukand, J.A., Saleh, M., Caplan, A.H., Branner, A., Chen, D., Penn, R.D., Donoghue, J.P.: Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature 442(7099), 164–171 (2006)
  • [12] Ivgi, M., Carmon, Y., Berant, J.: Scaling laws under the microscope: Predicting transformer performance from small scale experiments. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP. pp. 7354–7371. Association for Computational Linguistics (2022)
  • [13] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. CoRR 2001.08361 (2020)
  • [14] Keshtkaran, M.R., Sedler, A.R., Chowdhury, R.H., Tandon, R., Basrai, D., Nguyen, S.L., Sohn, H., Jazayeri, M., Miller, L.E., Pandarinath, C.: A large-scale neural network training framework for generalized estimation of single-trial population dynamics. Nature Methods 19(12), 1572–1577 (2022)
  • [15] Lazarou, I., Nikolopoulos, S., Petrantonakis, P.C., Kompatsiaris, I., Tsolaki, M.: EEG-based brain–computer interfaces for communication and rehabilitation of people with motor impairment: a novel approach of the 21 st century. Frontiers in Human Neuroscience 12,  14 (2018)
  • [16] Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y.: Vmamba: Visual state space model. CoRR 2401.10166 (2024)
  • [17] Makin, J.G., O’Doherty, J.E., Cardoso, M.M., Sabes, P.N.: Superior arm-movement decoding from cortex with a new, unsupervised-learning algorithm. Journal of neural engineering 15(2), 026010 (2018)
  • [18] Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Derczynski, L., Du, X., Grella, M., Gv, K., He, X., Hou, H., Kazienko, P., Kocon, J., Kong, J., Koptyra, B., Lau, H., Lin, J., Mantri, K.S.I., Mom, F., Saito, A., Song, G., Tang, X., Wind, J.S., Wozniak, S., Zhang, Z., Zhou, Q., Zhu, J., Zhu, R.: RWKV: reinventing rnns for the transformer era. In: Findings of the Association for Computational Linguistics: EMNLP. pp. 14048–14077. Association for Computational Linguistics (2023)
  • [19] Rapeaux, A.B., Constandinou, T.G.: Implantable brain machine interfaces: first-in-human studies, technology challenges and trends. Current opinion in biotechnology 72, 102–111 (2021)
  • [20] Roy, Y., Banville, H.J., Albuquerque, I., Gramfort, A., Falk, T.H., Faubert, J.: Deep learning-based electroencephalography analysis: a systematic review. CoRR 1901.05498 (2019)
  • [21] Stanslaski, S., Afshar, P., Cong, P., Giftakis, J., Stypulkowski, P., Carlson, D., Linde, D., Ullestad, D., Avestruz, A.T., Denison, T.: Design and validation of a fully implantable, chronic, closed-loop neuromodulation device with concurrent sensing and stimulation. IEEE Transactions on Neural Systems and Rehabilitation Engineering 20(4), 410–421 (2012)
  • [22] Stanslaski, S., Herron, J., Chouinard, T., Bourget, D., Isaacson, B., Kremen, V., Opri, E., Drew, W., Brinkmann, B.H., Gunduz, A., Adamski, T., Worrell, G.A., Denison, T.: A chronically implantable neural coprocessor for investigating the treatment of neurological disorders. IEEE Transactions on Biomedical Circuits and Systems 12(6), 1230–1245 (2018)
  • [23] Sussillo, D., Nuyujukian, P., Fan, J.M., Kao, J.C., Stavisky, S.D., Ryu, S., Shenoy, K.: A recurrent neural network for closed-loop intracortical brain–machine interface decoders. Journal of Neural Engineering 9(2), 026027 (2012)
  • [24] Taghizadeh-Sarabi, M., Daliri, M.R., Niksirat, K.S.: Decoding objects of basic categories from electroencephalographic signals using wavelet transform and support vector machines. Brain Topography 28(1), 33–46 (2015)
  • [25] Todorova, S., Sadtler, P., Batista, A., Chase, S., Ventura, V.: To sort or not to sort: the impact of spike-sorting on neural decoding performance. Journal of Neural Engineering 11(5), 056005 (2014)
  • [26] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. pp. 5998–6008 (2017)
  • [27] Willett, F.R., Avansino, D.T., Hochberg, L.R., Henderson, J.M., Shenoy, K.V.: High-performance brain-to-text communication via handwriting. Nature 593(7858), 249–254 (2021)
  • [28] Wilson, G.H., Stavisky, S.D., Willett, F.R., Avansino, D.T., Kelemen, J.N., Hochberg, L.R., Henderson, J.M., Druckmann, S., Shenoy, K.V.: Decoding spoken english from intracortical electrode arrays in dorsal precentral gyrus. Journal of Neural Engineering 17(6), 066007 (2020)
  • [29] Wu, W., Black, M., Gao, Y., Serruya, M., Shaikhouni, A., Donoghue, J., Bienenstock, E.: Neural decoding of cursor motion using a kalman filter. Advances in Neural Information Processing Systems 15 (2002)
  • [30] Wu, W., Hatsopoulos, N.G.: Real-time decoding of nonstationary neural activity in motor cortex. IEEE Transactions on Neural Systems and Rehabilitation Engineering 16(3), 213–222 (2008)
  • [31] Xu, H., Han, Y., Han, X., Xu, J., Lin, S., Cheung, R.C.: Unsupervised and real-time spike sorting chip for neural signal processing in hippocampal prosthesis. Journal of Neuroscience Methods 311, 111–121 (2019)
  • [32] Ye, J., Pandarinath, C.: Representation learning for neural population activity with neural data transformers. arXiv preprint arXiv:2108.01210 (2021)
  • [33] Zhang, Z., Constandinou, T.G.: Firing-rate-modulated spike detection and neural decoding co-design. Journal of Neural Engineering 20(3), 036003 (2023)