A Transformer-Based Multi-Task Learning Framework for Myoelectric Pattern Recognition Supporting Muscle Force Estimation

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL.
31, 2023 3255
A Transformer-Based Multi-Task Learning

Framework for Myoelectric Pattern Recognition
Supporting Muscle Force Estimation
Xinhui Li, Xu Zhang , Member, IEEE, Liwei Zhang, Xiang Chen , Member, IEEE,
and Ping Zhou , Senior Member, IEEE
Abstract— Simultaneous implementation of myoelectric myoelectric control systems, with wide applications in ges-
pattern recognition and muscle force estimation is highly tural interfaces, prosthetic and orthotic control.
demanded in building natural gestural interfaces but a
challenging task due to the gesture classification accuracy Index Terms— Myoelectric pattern recognition, muscle
degradation under varying muscle strengths. To address force estimation, varying muscle strengths, transformer
this problem, a novel method using transformer-based model, multi-task learning.
multi-task learning (MTL-Transformer) for the prediction
of both myoelectric patterns and corresponding muscle I. I NTRODUCTION
strengths was proposed to describe the inherent charac- YOELECTRIC control is a technology that converts
teristics of an individual gesture pattern under different
force conditions, thereby improving the accuracy of myo-
electric pattern recognition. In addition, the transformer
M human movement intentions into machine commands
by sensing and processing electromyographic (EMG) sig-
model enabled the characterization of long-term tempo- nals to control peripheral devices. It has been widely used
ral correlations to ensure precise and smooth estimation as gestural interfaces in prosthetic and orthotic robots [1],
of the muscle force. The performance of the proposed [2], [3]. Due to its favorable non-invasive property, the sur-
MTL-Transformer framework was evaluated via experi-
ments of classifying eleven hand gestures and estimat-
face EMG (sEMG) is usually used as the command source
ing the corresponding muscle force simultaneously, using in the myoelectric control systems [4], [5], [6]. In recent
high-density surface electromyogram (HD-sEMG) record- years, a number of studies in myoelectric control have been
ings from forearm flexor muscles of eleven intact-limbed devoted to the interpretation of movement patterns from the
subjects. The MTL-Transformer framework yielded high sEMG signals [7], [8], [9]. Many machine learning methods
classification accuracy (98.70±1.21%) and low root mean
square deviation (12.59±2.76%), and outperformed other such as linear discriminant classifier [10], Gaussian mixture
two common temporally modelling methods significantly model [11], support vector machine [12], have been adopted
(p< 0.05) in terms of both improved gesture recognition to process the sEMG signals and improve the number of rec-
accuracies and reduced muscle force estimation errors. ognizable patterns and recognition accuracy, with significant
The MTL-Transformer framework is demonstrated as an progresses [13], [14], [15]. In particular, the rapid development
effective solution for simultaneous implementation of myo-
electric pattern recognition and muscle force estimation. of deep learning algorithms in recent years has significantly
This study promotes the development of robust and smooth advanced the techniques for myoelectric control [16], [17],
[18]. To reduce the adverse interference when exploring the
feasibility of the recognition methods, these studies are usually
Manuscript received 10 February 2023; revised 17 June 2023;
accepted 12 July 2023. Date of publication 25 July 2023; date of current carried out with different movement patterns under constant
version 18 August 2023. This work was supported by the National Natu- medium force levels, without considering potential variations
ral Science Foundation of China under Grant 62271464. (Corresponding of the muscle force. Intuitively, both movement pattern recog-
author: Xu Zhang.)
This work involved human subjects or animals in its research. nition and muscle force estimation are not separate tasks.
Approval of all ethical and experimental procedures and protocols was For instance, when griping on an object by a prosthetic
granted by the Ethics Review Board of the University of Science and control system, both the movement pattern and muscle force
Technology of China (USTC), Hefei, Anhui, China, under Application
No. 2022-N(H)-163, in February 2022. are generated in a well-coordinated manner so as to achieve
Xinhui Li, Xu Zhang, and Xiang Chen are with the School of Micro- natural and smooth control. Consequently, it’s necessary to
electronics, University of Science and Technology of China, Hefei, Anhui validate the gesture pattern recognition algorithm for the
230027, China (e-mail: [email protected]).
Liwei Zhang is with the First Affiliated Hospital, University of Science sEMG signals under the condition of varying forces, and this
and Technology of China, Hefei, Anhui 230001, China. further motivates the research on simultaneous implementation
Ping Zhou is with the Department of Biomedical and Rehabilitation of both gesture recognition and muscle force estimation.
Engineering, University of Health and Rehabilitation Sciences, Qingdao,
Shandong 266024, China (e-mail: [email protected]). Towards advanced myoelectric pattern recognition sup-
Digital Object Identifier 10.1109/TNSRE.2023.3298797 porting muscle force estimation, several studies have been
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
3256 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 31, 2023
conducted. For example, Baldacchino et al. [19] proposed a coefficient, and wavelet transform coefficient. Pancholi and
multivariate Bayesian hybrid model based on the gate function, Joshi [28] proposed an energy kernel-based feature extraction
which can achieve the pattern recognition of nine finger move- method, with an average classification accuracy of 92% for
ments and force estimation of the fingertip. Fang et al. [20] six gestures under three force levels, which achieved a 2%-9%
proposed an attribute-driven granular (AGrM) model for rec- improvement over TD-PSD and wavelet transform coefficients.
ognizing eight finger pinch patterns and estimating fingertip Although some progresses have been made, the decoding of
forces. Despite the achievements of these works, their pattern movement patterns under varying forces is still unsatisfactory.
recognition performance degraded significantly under variable Due to the sequential properties of sEMG data, it is essential
forces. One main reason was that both the gesture recognition to mine the temporal relevance along the data sequence. It is
and the muscle force estimation were treated as independent hypothesized that temporal modeling of signal sequences helps
tasks, ignoring their complementary properties underlying to learn robust features of one gesture by aggregating informa-
complex muscle coordination. tion from sEMG data over varying force levels, thus improving
Since the gesture pattern and muscle force can both be pre- the accuracy of gesture pattern recognition. In recent years,
dicted from the sEMG signal, the multi-task learning (MTL) the transformer model has attracted wide attention due to its
framework is naturally considered. The MTL approaches aim powerful temporal modeling capability and has been success-
to learn multiple related tasks simultaneously, by sharing the fully applied in speech recognition, machine translation and
feature representations of different tasks, it can achieve better many other computer vision tasks [29], [30], [31]. The key of
generalization ability than learning individual task indepen- the transformer model lies in the self-attention mechanism,
dently [21]. Hua et al. [22] used a MTL framework based it allows the data point in the input sequence to interact
on the multi-stream temporal convolutional neural network with each other by computing the similarity score (attention
(TCNN) to simultaneously make decisions within eight move- weight) among them [32]. The self-attention mechanism can
ment patterns and three corresponding force levels. This help to capture long-term dependencies in the time sequence
method just considered three fixed force levels for each and aggregate global information of the data, instead of only
pattern, which had certain limitations in real world appli- focusing on the local context information as in convolutional
cations. Hu et al. [23] proposed a MTL framework based neural networks. Besides, compared with recurrent neural
on the long-short term memory (LSTM) network and the networks (RNNs) such as LSTM [33], [34], [35] with the
multi-layer perceptron (MLP), incorporating a post-processing similar capability of aggregating global context information,
approach. It enabled the recognition of eleven gestures while the transformer has a property of parallel computation [36],
supporting instantaneous estimation of the muscle force of which can reduce the training time cost and improve the
the activated gesture. However, the post-processing algorithm execution efficiency. Although the transformer model has
can lead to a large time delay, which was not conducive to been utilized in the myoelectric pattern recognition tasks with
the real-time requirement of the myoelectric control system. promising performance [37], [38], [39], its effectiveness has
Besides, these methods have just achieved unsatisfying and not been investigated in simultaneous implementation of both
limited performance (an average accuracy of just around 90%). the myoelectric pattern recognition task and the muscle force
In the simultaneous control task, gesture recognition is usu- estimation task.
ally a more important issue, and the prediction of muscle force To reduce the negative impacts of varying muscle strengths
is meaningful only when the gesture patterns are recognized and simultaneously predict both the gestural pattern and the
correctly. Meanwhile, the main difficulty of the simultaneous force, we proposed a novel transformer-based multi-task learn-
control task also lies in overcoming the degradation of gesture ing (MTL-Transformer) method for myoelectric pattern recog-
recognition accuracy under the influence of variable forces. nition supporting muscle force estimation. In this method, the
Movement patterns have been frequently characterized by sEMG data samples were characterized as features and fed into
sEMG features. Most of them are associated with sEMG the transformer model, and then went through the classification
amplitudes [6], such as time domain (TD) features [24], [25], module and the regression module simultaneously to obtain
and they may change obviously with varying forces, leading decisions of both gesture pattern and instantaneous muscle
to decreased pattern recognition performance. This places force. Our proposed method can achieve efficient and robust
a higher demand on the user’s operational normality in myoelectric control, which is of great significance to gestural
the application of myoelectric control systems, resulting in interfaces, prosthetic and orthotic control.
poor user experience [22]. To deal with this problem, some
methods have been proposed to improve the generalization II. M ETHODOLOGY
ability of the classification algorithm by extracting sEMG Figure 1 demonstrates the flowchart of the proposed
features that are insensitive to force changes, thus reducing method. First, high-density sEMG (HD-sEMG) from the
the variation in feature space caused by variable contraction forearm flexor and the corresponding grasping force are col-
forces [26], [27], [28]. For example, Al-Timmy et al. [27] lected simultaneously when the gestures are executed. The
used the time-dependent power spectral descriptors (TD-PSD) HD-sEMG data are used to extract sEMG envelopes in
of sEMG signals on a six-class classification task under channel-wise manner. The multi-channel sEMG signals and
three force levels, reducing the classification error significantly the corresponding multi-channel envelope signals are stitched
when compared to the conventional characteristics such as together, which are segmented into a series of multi-channel
autoregressive model coefficient, discrete Fourier transform time windows and then fed into the transformer model. For
LI et al.: TRANSFORMER-BASED MTL FRAMEWORK FOR MYOELECTRIC PATTERN RECOGNITION 3257
Fig. 1. The flowchart of the proposed MTL-Transformer method.
Fig. 2. Illustration of 11 different gesture patterns.
Fig. 3. The diagram of the experiment set-up. (a) Three force sensors
each sEMG sample in one window, features obtained from including two pressure sensors and a torque sensor. (b) Schematic
the transformer model are fed into the classification module diagram of force sensor and HD-sEMG array placement in a gesture
task.
and the regression module simultaneously to obtain the gesture
pattern and instantaneous muscle force.
At the beginning of the experiment, the skin of the sub-
jects’ forearms was cleaned with medical-grade alcohol to
A. Subjects and Experiments reduce the skin-electrode impedance. As shown in Fig. 3(b),
Eleven male subjects aged from 23 to 27 years old were the HD-sEMG array was placed to the skin surface of the
recruited in this study. All subjects did not have any neu- forearm flexor muscle (containing ulnar carpal flexor, radial
romuscular diseases and were informed of the experimental carpal flexor, and intra-hand flexor), whose primary function
procedures and signed the informed consent. The study was corresponds to the eleven gestures, and an extra inelastic
approved by the Ethics Review Board of the University of bandage was used to secure the HD-sEMG array and reduce
Science and Technology of China (Hefei, China). the motion artifacts. Then the subjects were asked to perform
Eleven gestures involving pressure, pinch, grip and twist three maximal voluntary contractions (MVCs) while their
were selected from commonly used daily gestures to form the corresponding forces were recorded, and the largest force
target gesture set in this study, as shown in Fig. 2. Several was used as the final MVC for every subject. Then, subjects
hand molds from 3-D printing were adopted to assist the data were instructed to perform target gestures in a variable force
collection of twist gestures. The diagram of the experiment generation pattern that rises smoothly from the initial baseline
set-up was shown in Fig. 3. As shown in Fig. 3(a), two (resting state) to 60% MVC force level and then falls smoothly
pressure sensors (LOADING SEN, LDCZL-FA & LDCZL- to baseline. The duration of each force generation pattern was
SC, China) and a torque sensor (LOADING SEN, LDN-08A, maintained between 2s and 3s, and the force generation pattern
China) were used to record the grasping force. A HD-sEMG was repeated 20 times for each user performing each gesture.
electrode array consisting of 128 electrodes arranged in a The target force and the actual force generation curve were
16 × 8 grid form was used to collect HD-sEMG signals. Each displayed on a human-computer interaction interface to help
electrode had a circular recording probe of 3-mm diameter, the subjects to better complete the force generation task, and
and the center-by-center inter-electrode distance between two the actual measured force was used in the subsequent signal
neighboring electrodes was 8 mm. Each electrode in the processing process.
array worked in a monopolar manner concerning the common A homemade recording system was used for both force and
reference electrode that was placed on the back of the other sEMG data collection. There are a two-stage amplifier with
hand, constituting 128 recording channels. a total gain of 60dB and a band-pass filter at 20–500 Hz per
channel. The HD-sEMG and force signals were sampled at and a ReLU activation layer to map X into P channels (i.e.,
2 kHz and digitized via a 16-bit A/D converter (ADS1299, empirically 256 channels in this study) as follows:
Texas Instruments, TX). All collected sEMG and force data
were saved to the hard disk of a computer through a high-speed U = Max(X W + b, 0), (2)
USB cable. All subsequent data processing and analyses were where W ∈ R M×P , b ∈ R P . Note that the T data samples
conducted on a desktop PC equipped with an Intel i7 CPU in a window correspond to the same gesture pattern label but
and an NVIDIA GeForce GTX 1080Ti. have different muscle force values, and the muscle force is not
inconsecutive and it’s important to consider the smoothness of
B. Signal Pre-Processing the muscle force prediction results. To correctly recognize the
gesture pattern of these data samples and estimate the muscle
A 20-500 Hz band-pass filter was applied to HD-sEMG
force, temporal modeling is essential for improving the feature
signals first to eliminate low-frequency noise artifacts. Then
robustness and smoothness. Consequently, the output U was
a set of second-order notch filters were adopted to remove
then fed into the multi-head attention module to aggregate
the 50-Hz power line interference and its harmonics for each
temporal information. Specifically, the multi-head attention
sEMG channel. For the three twist gestures (Ges 9, Ges 10,
module consists of N heads, and each head processes the input
and Ges 11), the torque recorded by the force sensor was
independently. For the n-th head, we first map U into query
converted into the corresponding twist force:
Q n , key K n and value Vn :
D̂ = I /d, (1) Q n = U WnQ , K n = U WnK , Vn = U WnV (3)
where D̂ represents the calculated twist force, I is the mea- P
where Wn ∈ R P× N represents the learnable weight. Then
sured torque, and d is the arm of the hand mold’s force. For query Q n and key K n were used to calculate the similarity
each gesture and each subject, both the HD-sEMG data and matrix among data samples:
force data were normalized separately. We first calculated a
Q n K n′

global maximum absolute value of the sEMG amplitude over Sn = so f tmax (4)
all channels. Once the maximum absolute amplitude value was τ
obtained, it was used to normalize each channel of sEMG q
signal between −1 and 1. In addition, the force signal was where τ = P N is the scale factor, K n′ is the transpose of
normalized between 0 and 1 using the previously recorded the K n . With the similarity matrix Sn , the output Hn of the
MVC value. Furthermore, we used the full-wave rectification n-th head can be obtained:
and low-pass filtering (cutoff frequency 3 Hz, finite impulse
Hn = Sn Vn . (5)
response filter, Hanning window, 80th order) to process each
channel of the normalized HD-sEMG signals in the temporal By concatenating the output of N heads together, we can
dimension, obtaining a corresponding signal envelope. These obtain the aggregated temporal feature for each data sample
multi-channel envelopes were also normalized in the same way according to Equation (6). In this paper, the heads number N
according to the maximum absolute value across all channels. was set to be 4, so the dimension of H is R B×T ×256 . To keep
Considering following supervised learning analyses, all of the the original feature so as to prevent over smoothing for muscle
above parameters for normalization were just derived from force estimation, U and H were fused together and then fed
the training data and they were stored and applied to any test into a layer-normalization layer to obtain U ′ according to
data. Thus a channel-augmented data stream was obtained by Equation (7). At last, U ′ was fed into a feed forward module
concatenating the normalized multi-channel envelopes and the FFN and a layer-normalization layer to obtain the output of
normalized multi-channel sEMG data. These data also carried the transformer model according to Equation (8).
their original labels including both the gesture pattern and the
corresponding measured force. The sEMG data along with the H = Concat (H1 , . . . , H N ) (6)
force signal were segmented into several overlapping analysis U = Layer N or mali zation(U + H )
′
(7)
windows with a window length of 64 ms and a window ′′ ′ ′

U = Layer N or mali zation F F N U + U (8)
increment of 32 ms. These multi-channel analysis windows
were considered as the basic samples for both myoelectric The feed forward module F F N consists of two fully con-
pattern recognition and muscle force estimation tasks. nected layers, with a ReLU activation layer between them and
a normalization layer after them. For the FFN module, each
FC layer had 256 input channels and 256 output channels.
C. Model Structure and Model Training With the output features of each sample U ′′ ∈ R 256×T ,
In this section, we introduce the detailed design of our the classification scores Ŷ cls ∈ R B×C and muscle force
MTL-Transformer model. Given the input data X ∈ R B×T ×M estimation results Ŷ r eg ∈ R B×T were obtained by the classifi-
with pattern labels Y cls ∈ R B and force labels Y r eg ∈ cation module N cls and regression module N r eg according to
R B×T in a mini-batch, where B is the batch size, T is the Equation (9). In the classification module, we used the tempo-
length of data samples in a window and M is the channel ral average pooling operation on U ′′ in the second dimension
number of fusion samples. To fuse the feature information of to obtain the output U ′′′ ∈ R 256 , which was then fed into
different electrodes, we first adopted a fully connected layer the fully connected layer. For the classification module, the
number of the input channels was 256, the number of output estimation respectively, where F and F̂ are the predicted force
channels was the pattern number of the gestures (i.e., 11 in and the measured force, respectively.
this study). In addition, the regression module was composed
Corr ect I nstances
of a fully connected layer and a sigmoid activation layer. For CA = × 100% (14)
the regression module, there were 256 input channels, and just sT otal I nstances
PT 2
1 output channel. t=1 [ F̂(t) − F(t)]
RMSD = × 100% (15)
L
Ŷ cls = N cls U ′′ , Ŷ r eg = N r eg U ′′

(9)
To validate the advantage of the proposed method, two com-
In the training stage, the real gesture label Y cls and muscle mon temporal modeling approaches were applied to construct
force value Y r eg were available. The network was trained with MTL framework, which can realize simultaneous gesture
the stochastic gradient descent (SGD) algorithm [40], with recognition and force estimation. One was based on the LSTM
the batch size of B (i.e., 10 in this study). Consequently, model (termed MTL-LSTM), where a FC layer, a ReLU layer
the classification loss L cls and muscle force estimation loss and a LSTM layer were used to obtain the features of each
L r eg can be calculated as the cross entropy loss and mean sample that then went through the classification and regression
square error loss according to Equation (10) and Equation (11), modules, thus the predicted gesture pattern and muscle force
respectively, where C represents the pattern number of the can be obtained. The other one was based on the multi-stream
cls means the gesture label value of the j-th sample
gestures, Y j,c temporal convolutional neural network (termed MTL-TCNN)
cls means the predicted
labeled as pattern c in a mini-batch, Ŷ j,c according to the previous study [10]. In this work, data from
r eg each channel of the fusion samples were used as the input
probability that the j-th sample belongs to pattern c, Y j,t
means the real force value of the t-th sampling points of the of each stream. Three conv blocks containing a BN layer, a
r eg conv layer and a maxpooling were utilized to extract features
j-th sample, Ŷ j,t means the corresponding predicted force
value. The final loss used for model training was the weighted of each sample, which were then fed into the classification
sum of two losses defined as Equation (12), where the weight and regression modules. In both methods, the structure of
α was used to balance the contribution of the muscle force the classification and regression modules was the same as
estimation loss. Since the regression loss was found to be the proposed method. All the experiments were conducted
about ten times smaller than the classification loss, α was set under a single GTX 2080 GPU and an Intel(R) Xeon(R) CPU
from 0 to 10 by every increment step of 1 to find an appropriate E5-2695 v4 @ 2.10GHz. The details of the implemented
value leading to optimal performance. The learning rate was neural network layers and parameters were shown in Table I.
set to be 1×10−3 in this paper, and the number of training
epochs was set to be 20. F. Statistical Analysis
B C Two one-way repeated-measure ANOVAs were performed
1 X X cls
L cls = − cls
Y j,c log Ŷ j,c (10) on the CA and RMSD respectively, to examine the effect of
B gesture recognition and muscle force estimation using different
j=1 c=1
B T methods. The LSD post hoc test was employed for multi-
1 X X r eg r eg 2

L r eg = Y j,t − Ŷ j,t (11) ple pairwise comparisons. The significance level was set as
B p < 0.05. All statistical analysis were implemented by SPSS
j=1 t=1
software (version 24.0, SPSS Inc. Chicago, IL, USA).
L=L cls
+ αL reg
(12)
III. R ESULTS
D. Model Testing and Decision Making
Table II shows the performance of gesture recognition and
In the testing stage, given the testing data X ∈ R L×M , muscle force estimation respectively, when the weight coeffi-
we feed it into the MTL-Transformer model to obtain the cient α was set from 0 to 10. Please note that α = 0 means
gesture classification result Ŷcls ∈ R C and muscle force esti- that only the gesture recognition task was performed without
mation result Ŷr eg ∈ R T . Since Ŷcls was a pattern distribution, force estimation, thus there was no result of force estimation.
the predicted pattern label c can be obtained: In this case, the CA of the single gesture recognition task was
95.00±5.15%. When the α was increased from 1 to 10, the
c = argmax Ŷcls (i) (13)
i CA was all improved obviously with statistical significance
( p < 0.05), and the maximal value was 98.70±1.21% when
α was set to be 6. In this case, the muscle force estimation
E. Performance Evaluation performance was also competitive. Thus, α = 6 was selected
To evaluate the effectiveness of the proposed method, the and it was consistently applied in subsequent analyses.
training set and testing set were divided in the proportion Fig. 4 reports the CA when varying the model depth
of 3:7 for each subject’s data. The classification accuracy from 1 to 4 layers, both the MTL-Transformer method and
(CA) described in Equation (14) and the root mean square the MTL-LSTM method achieved their best performance with
deviation (RMSD) defined in Equation (15) were used to just 1 layer, and the MTL-TCNN method had the optimal
evaluate the performance of gesture recognition and force performance with 3 layers. Notably, the MTL-Transformer
TABLE I
T HE D ETAILS OF THE D ESIGNED N EURAL N ETWORK . B R EPRESENTS THE B ATCH S IZE
The resultant time cost of any method was much less than the
window increment (32 ms). In addition, the sum of individual
window length (64 ms) and the time cost per window, i.e., the
total time delay for a testing window was less than 300 ms
(the tolerance for real-time myoelectric control).
Fig. 7 displays representative examples of the estimated
force curve with respect to actually measured force curve,
using three methods (the MTL-Transformer, MTL-LSTM and
MTL-TCNN methods), respectively. It can be found that
the predicted muscle force curve of MTL-TCNN was very
noisy and therefore failed to fit the measured force well.
The MTL-LSTM method had a predicted muscle force curve
Fig. 4. The CA achieved by three methods as the number of layers much smoother than the MTL-TCNN method. However, it still
varies from 1 to 4. had many fluctuations not in accordance with the true force
curve. By contrast, the estimated force curve derived from the
proposed method fits better with the measured force curve,
method outperformed any other method under any model depth by capturing the fluctuations of the actual force precisely.
setting. In this case, the proposed method achieved the lowest average
Fig. 5 and Fig. 6 exhibit the CA and RMSD values of RMSD value (9.27%), outperforming the MTL-LSTM method
all subjects using the proposed method and two comparison (9.72%) and the MTL-TCNN method (26.48%).
methods, respectively, and the mean value averaged over all To evaluate the real-time classification performance of
subjects was shown in the right side of each figure. It can the proposed MTL-Transformer method, we visualized the
be seen that the CA of the MTL-TCNN, MTL-LSTM, and real-time classification results [6], [41] derived from a repre-
the proposed method were 69.10±20.816%, 95.85±4.63% sentative subject S4, as shown in Figure 8. In this test, S4 was
and 98.70±1.21%, respectively. The proposed method had the asked to cycle through every gesture. It can be found that most
highest average CA, which outperformed two contrast methods of the samples were classified correctly, and misclassifications
significantly ( p < 0.05). At the same time, the average usually occur during gesture transitions.
RMSD of the proposed method was 12.59±2.76%, achieving
a reduction in estimation error compared to 13.79±3.20% by IV. D ISCUSSION
the MTL-LSTM method and 21.69±1.92% by MTL-TCNN This paper presents a novel method for simultaneous
method, with statistical significance ( p < 0.05). implementation of gesture recognition and muscle force esti-
Besides, the computational time cost in the testing stage mation using the MTL-Transformer method. The transformer
was also calculated as the average time cost over all windows model was embedded in the MTL framework to mine con-
from the testing dataset for three methods. The mean time cost text information of sEMG sequences. The innovations and
of the proposed method was 0.20 ms, and it was a bit longer major contributions are as follows: (1) A novel Transformer-
than 0.13 ms achieved by the MTL-TCNN method, but much based multi-task learning method is proposed for simultaneous
shorter than 0.36 ms resulted from the MTL-LSTM method. gesture recognition and muscle force estimation. (2) The
TABLE II
T HE CA OF THE G ESTURE R ECOGNITION AND THE RMSD OF THE F ORCE E STIMATION R ESPECTIVELY,
W HEN THE W EIGHT C OEFFICIENT α R ANGED F ROM 0 TO 10
Fig. 5. The CA of gesture recognition for all subjects using the proposed method and two contrast methods, respectively.
invariance of the sEMG characteristics inherent to patterns estimation task and the gesture recognition task through the
under variable forces is explored by temporal modeling using MTL design, as compared with the method implementing just
the transformer model, and the smoothness of both gesture one individual task.
recognition and muscle force estimation is ensured simulta- In the proposed method, MTL framework was used to
neously. (3) Better recognition performance is achieved by improve the generalization ability of the model by learn-
sharing feature representations between both the muscle force ing a shared feature space of gesture recognition and force
Fig. 6. The RMSD of muscle force estimation for all subjects using the proposed method and two contrast methods, respectively.
estimation, compared to the single-task independent learning

way. Besides, the end-to-end implementation improves the
convenience of the myoelectric control systems. In the MTL
framework, it is essential to balance the contribution of each
task. Since gesture recognition is often a more important
task than muscle force estimation, muscle force estimation
can be considered as an auxiliary task, and the model can
work better by dynamically adjusting the importance of the
force estimation. In this paper, this was achieved by adding
a weighting factor α to the regression loss corresponding
to the muscle force estimation. When α is too small, the
model training may be dominated by the classification loss,
neglecting the contribution of the regression task. Conversely,
when α is too large, the model training may emphasize too
much on muscle force estimation loss and lead to degradation
on gesture classification. As shown in Table I, without the
involvement of muscle force estimation (i.e., α = 0), the
lowest performance of gesture recognition was obtained. This
demonstrated the effectiveness of MTL, i.e., the addition of
the muscle force estimation task has a positive effect on the
performance of gesture recognition. In contrast, when the
auxiliary task was added (with a non-zero balance factor), the
model performance was found to be improved and the best
average classification accuracy was achieved when α equals
to 6. Since gesture recognition is more important than muscle
force estimation, we chose the value of α when the highest
accuracy of gesture recognition and a competitive RMSD of
muscle force estimation was obtained, thus α was determined Fig. 7. Representative examples of the estimated force curve (blue)
with respect to actually measured force curve (red), selected from
as 6. Notably, there was no significant difference in recognition a gesture performed by the subject S4, using the proposed MTL-
results when α ranged from 1 to 10 (p > 0.05), suggesting that Transformer method (a), the MTL-LSTM method (b) and the MTL-TCNN
the model was not sensitive to the value of the balance factor method (c), respectively.
for this task. This is consistent with a previous finding [22],
and can provide guidance for similar tasks based on the MTL
framework in myoelectric control. ability of the transformer model through aggregating features
As clarified in the Introduction section, the changes in of each sample with different force values. Meanwhile, the
sEMG feature space under variable forces inevitably may proposed MTL-Transformer method ensured the smoothness
degrade gesture recognition accuracy. In this paper, we adopted of the muscle force estimation curve, as visualized in Fig. 7.
transformer model to implement simultaneously gesture recog- As a commonly used powerful model for temporal mod-
nition and muscle force estimation. As shown in Fig. 5 and eling, the LSTM model was designed as a MTL structure
Fig. 6, the proposed MTL-Transformer method had the CA of and used for comparison in this paper. Not surprisingly, the
98.70±1.21% and the RMSD of 12.59±2.76%, demonstrating proposed MTL-Transformer outperformed the MTL-LSTM
the best performance by both the gesture recognition and the method in terms of both gesture recognition and muscle force
muscle force estimation. This verified the previous scientific estimation. This is due to the fact that the LSTM relies on
hypothesis that the negative effect of force variation on the historical memory, thus the initial performance of the model
sEMG features can be mitigated by the temporal modeling is limited. When compared with recurrent neural networks
estimation performance along with sufficient computational

efficiency.
Although the results are promising, there are still some
limitations in this study. First, the target gestures in the
experimental scheme in this paper only include several com-
prehensive gestures involving press, pinch, grasp and twist.
More complex and dexterous gestures from daily life can be
considered to expand the gesture set. Second, the proposed
method in this paper is based on the user-specific condition
where each new user needs to provide certain training data,
which may be inconvenient in practical use. Thus, this method
can be explored in the future in conjunction with strategies
such as unsupervised domain adaptation for cross-user simul-
taneous gesture recognition and muscle force estimation.
Fig. 8. The classification visualization results during a real-time test V. C ONCLUSION

performed on S4, in which S4 was asked to cycle through every gesture. In this paper, a novel MTL-Transformer method is presented
Green blocks indicate correct gesture decisions, black blocks indicate
correct predictions of the resting state, and just some sporadic red using the transformer model embedded in the MTL framework
blocks indicate incidental errors of the gesture classification. for predicting both gesture pattern and muscle force, which
can mitigate the negative effect of force variation on the
sEMG features through temporal modeling. The proposed
MTL-Transformer method outperformed common temporal
(including LSTM) that process temporal data in a sequential modelling methods-based MTL framework in terms of both
manner, the self-attention operation can be conducted in a gesture recognition and muscle force estimation ( p < 0.05).
parallel way for all data samples, which makes it very time The experimental results demonstrated the effectiveness of
efficient [32], as verified by less time delay derived from the the transformer model in mining the context information of
proposed method. sEMG sequences. This study offers a promising method for
Besides, temporal CNN (TCNN) is also a typical method robust and smooth myoelectric control systems, with wide
used to characterize temporal relationships, which has been applications in gestural interfaces, prosthetic and orthotic
designed and validated as a multi-stream structure by previous control.
work [22]. However, the performance of the method in terms
of both gesture recognition and muscle force estimation was R EFERENCES
unsatisfactory. There are two possible reasons. The first is the [1] M. Asghari Oskoei and H. Hu, “Myoelectric control systems—A
survey,” Biomed. Signal Process. Control, vol. 2, no. 4, pp. 275–294,
simple structure of TCNN with a small number of parameters. Oct. 2007.
Although this property can reduce the inference time, the [2] C. Cipriani, F. Zaccone, S. Micera, and M. C. Carrozza, “On the
simple model cannot be adapted to the complex gesture shared control of an EMG-controlled prosthetic hand: Analysis of user–
prosthesis interaction,” IEEE Trans. Robot., vol. 24, no. 1, pp. 170–184,
recognition and muscle force estimation tasks. Previous work Feb. 2008.
only carried out gesture recognition and force level estimation [3] A. A. Adewuyi, L. J. Hargrove, and T. A. Kuiken, “An analysis of
for sEMG signals at three fixed force levels, whereas the force intrinsic and extrinsic hand muscle EMG for improved pattern recog-
nition control,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 24, no. 4,
varied continuously in a great range from resting (almost 0% pp. 485–494, Apr. 2016.
MVC) to 60% MVC when performing different gestures in this [4] X. Li et al., “Decoding muscle force from individual motor unit activities
study. Greater force changes make the distribution of sEMG using a twitch force model and hybrid neural networks,” Biomed. Signal
features more variable, and more powerful models are needed Process. Control, vol. 72, Feb. 2022, Art. no. 103297.
[5] S. Lee, M.-O. Kim, T. Kang, J. Park, and Y. Choi, “Knit band sensor
for achieving satisfactory results. Secondly, the receptive field for myoelectric control of surface EMG-based prosthetic hand,” IEEE
of TCNN is limited, and it can only focus on the information Sensors J., vol. 18, no. 20, pp. 8578–8586, Oct. 2018.
within a short period of time in the long time series, so the [6] V. E. Kosmidou and L. J. Hadjileontiadis, “Sign language recognition
using intrinsic-mode sample entropy on sEMG and accelerometer data,”
stationarity of gesture recognition and muscle force estimation IEEE Trans. Biomed. Eng., vol. 56, no. 12, pp. 2879–2890, Dec. 2009.
cannot be guaranteed. Compared with the TCNN model, the [7] X. Zhang, X. Chen, Y. Li, V. Lantz, K. Wang, and J. Yang, “A framework
transformer model is able to well characterize global context for hand gesture recognition based on accelerometer and EMG sensors,”
IEEE Trans. Syst. Man, Cybern. A, Syst. Humans, vol. 41, no. 6,
information through self-attention mechanism and can improve pp. 1064–1076, Nov. 2011.
the model performance. All of these advantageous aspects [8] Y. Du, W. Jin, W. Wei, Y. Hu, and W. Geng, “Surface EMG-based
of the proposed MTL-Transformer method can be used to inter-session gesture recognition enhanced by deep domain adaptation,”
Sensors, vol. 17, no. 3, p. 458, Feb. 2017.
explain its significant performance gains (over 30%) at the
[9] X. Chen, Y. Li, R. Hu, X. Zhang, and X. Chen, “Hand gesture
cost of slightly longer time delay, with respect to the MTL- recognition based on surface electromyography using convolutional
TCNN method. Meanwhile, the prolonged testing time delay neural network with transfer learning method,” IEEE J. Biomed. Health
is too small to affect the common usability of the myoelectric Informat., vol. 25, no. 4, pp. 1292–1304, Apr. 2021.
[10] K. Englehart and B. Hudgins, “A robust, real-time control scheme for
control system. Therefore, the proposed method is regarded multifunction myoelectric control,” IEEE Trans. Biomed. Eng., vol. 50,
to achieve superior gesture recognition and muscle force no. 7, pp. 848–854, Jul. 2003.
[11] Y. Huang, K. B. Englehart, B. Hudgins, and A. D. C. Chan, “A Gaussian [27] A. H. Al-Timemy, R. N. Khushaba, G. Bugmann, and J. Escudero,
mixture model based classification scheme for myoelectric control of “Improving the performance against force variation of EMG con-
powered upper limb prostheses,” IEEE Trans. Biomed. Eng., vol. 52, trolled multifunctional upper-limb prostheses for transradial amputees,”
no. 11, pp. 1801–1811, Nov. 2005. IEEE Trans. Neural Syst. Rehabil. Eng., vol. 24, no. 6, pp. 650–661,
[12] M. A. Oskoei and H. Hu, “Support vector machine-based classification Jun. 2016.
scheme for myoelectric control applied to upper limb,” IEEE Trans. [28] S. Pancholi and A. M. Joshi, “Advanced energy kernel-based fea-
Biomed. Eng., vol. 55, no. 8, pp. 1956–1965, Aug. 2008. ture extraction scheme for improved EMG-PR-based prosthesis con-
[13] X. Wu, B. Zhou, Z. Lv, and C. Zhang, “To explore the potentials of trol against force variation,” IEEE Trans. Cybern., vol. 52, no. 5,
independent component analysis in brain-computer interface of motor pp. 3819–3828, May 2022.
imagery,” IEEE J. Biomed. Health Informat., vol. 24, no. 3, pp. 775–787, [29] A. T. Liu, S.-W. Li, and H.-y. Lee, “TERA: Self-supervised learning
Mar. 2020. of transformer encoder representation for speech,” IEEE/ACM Trans.
[14] G. R. Naik, D. K. Kumar, and Jayadeva, “Twin SVM for gesture clas- Audio, Speech, Language Process., vol. 29, pp. 2351–2366, 2021.
sification using the surface electromyogram,” IEEE Trans. Inf. Technol. [30] Y. Kawara, C. Chu, and Y. Arase, “Preordering encoding on transformer
Biomed., vol. 14, no. 2, pp. 301–308, Mar. 2010. for translation,” IEEE/ACM Trans. Audio, Speech, Language Process.,
[15] L. J. Hargrove, K. Englehart, and B. Hudgins, “A comparison of vol. 29, pp. 644–655, 2021.
surface and intramuscular myoelectric signal classification,” IEEE Trans. [31] Y. Shi et al., “Emformer: Efficient memory transformer based acous-
Biomed. Eng., vol. 54, no. 5, pp. 847–853, May 2007. tic model for low latency streaming speech recognition,” in Proc.
[16] A. Kashizadeh, K. Peñan, A. Belford, A. Razmjou, and M. Asadnia, IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2021,
“Myoelectric control of a biomimetic robotic hand using deep learning pp. 6783–6787.
artificial neural network for gesture classification,” IEEE Sensors J., [32] C. Yang et al., “Lite vision transformer with enhanced self-attention,”
vol. 22, no. 19, pp. 18914–18921, Oct. 2022. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
[17] A. Ameri, M. A. Akhaee, E. Scheme, and K. Englehart, “A deep transfer Jun. 2022, pp. 11988–11998.
learning approach to reducing the effect of electrode shift in EMG [33] A. Shrestha, H. Li, J. L. Kernec, and F. Fioranelli, “Continuous human
pattern recognition-based control,” IEEE Trans. Neural Syst. Rehabil. activity classification from FMCW radar with bi-LSTM networks,” IEEE
Eng., vol. 28, no. 2, pp. 370–379, Feb. 2020. Sensors J., vol. 20, no. 22, pp. 13607–13619, Nov. 2020.
[18] S. Tam, M. Boukadoum, A. Campeau-Lecours, and B. Gosselin, “A fully [34] X. Yuan, L. Li, Y. A. W. Shardt, Y. Wang, and C. Yang, “Deep
embedded adaptive real-time hand gesture classifier leveraging HD- learning with spatiotemporal attention-based LSTM for industrial soft
sEMG and deep learning,” IEEE Trans. Biomed. Circuits Syst., vol. 14, sensor model development,” IEEE Trans. Ind. Electron., vol. 68, no. 5,
no. 2, pp. 232–243, Apr. 2020. pp. 4404–4414, May 2021.
[19] T. Baldacchino, W. R. Jacobs, S. R. Anderson, K. Worden, and [35] A. Zollanvari, K. Kunanbayev, S. Akhavan Bitaghsir, and M. Bagheri,
J. Rowson, “Simultaneous force regression and movement classification “Transformer fault prognosis using deep recurrent neural network over
of fingers via surface EMG within a unified Bayesian framework,” vibration signals,” IEEE Trans. Instrum. Meas., vol. 70, pp. 1–11,
Frontiers Bioeng. Biotechnol., vol. 6, p. 13, Feb. 2018. 2021.
[20] Y. Fang, D. Zhou, K. Li, Z. Ju, and H. Liu, “Attribute-driven granular [36] X. Dong et al., “CSWin transformer: A general vision transformer back-
model for EMG-based pinch and fingertip force grand recognition,” bone with cross-shaped windows,” in Proc. IEEE/CVF Conf. Comput.
IEEE Trans. Cybern., vol. 51, no. 2, pp. 789–800, Feb. 2021. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 12114–12124.
[21] S. Ruder, “An overview of multi-task learning in deep neural networks,” [37] S. Shen, X. Wang, F. Mao, L. Sun, and M. Gu, “Movements classifica-
2017, arXiv:1706.05098. tion through sEMG with convolutional vision transformer and stacking
[22] S. Hua, C. Wang, Z. Xie, and X. Wu, “A force levels and gestures ensemble learning,” IEEE Sensors J., vol. 22, no. 13, pp. 13318–13325,
integrated multi-task strategy for neural decoding,” Complex Intell. Syst., Jul. 2022.
vol. 6, no. 3, pp. 469–478, Oct. 2020. [38] Z. Chen, H. Wang, H. Chen, and T. Wei, “Continuous motion finger
[23] R. Hu, X. Chen, H. Zhang, X. Zhang, and X. Chen, “A novel myoelectric joint angle estimation utilizing hybrid sEMG-FMG modality driven
control scheme supporting synchronous gesture recognition and muscle transformer-based deep learning model,” Biomed. Signal Process. Con-
force estimation,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 30, trol, vol. 85, Aug. 2023, Art. no. 105030.
pp. 1127–1137, 2022. [39] R. Song et al., “Decoding silent speech from high-density surface
[24] S.-H. Park and S.-P. Lee, “EMG pattern recognition based on artificial electromyographic data using transformer,” Biomed. Signal Process.
intelligence techniques,” IEEE Trans. Rehabil. Eng., vol. 6, no. 4, Control, vol. 80, Feb. 2023, Art. no. 104298.
pp. 400–405, Dec. 1998. [40] W. E. L. Ilboudo, T. Kobayashi, and K. Sugimoto, “Robust stochastic
[25] G. N. Saridis and T. P. Gootee, “EMG pattern analysis and classification gradient descent with student-t distribution based first-order momen-
for a prosthetic arm,” IEEE Trans. Biomed. Eng., vol. BME-29, no. 6, tum,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 3,
pp. 403–412, Jun. 1982. pp. 1324–1337, Mar. 2022.
[26] J. He, D. Zhang, X. Sheng, S. Li, and X. Zhu, “Invariant surface EMG [41] S. Tam, M. Boukadoum, A. Campeau-Lecours, and B. Gosselin,
feature against varying contraction level for myoelectric control based “Intuitive real-time control strategy for high-density myoelectric hand
on muscle coordination,” IEEE J. Biomed. Health Informat., vol. 19, prosthesis using deep and transfer learning,” Sci. Rep., vol. 11, no. 1,
no. 3, pp. 874–882, May 2015. May 2021, Art. no. 11275.

A Transformer-Based Multi-Task Learning Framework for Myoelectric Pattern Recognition Supporting Muscle Force Estimation

Uploaded by

Copyright:

Available Formats

A Transformer-Based Multi-Task Learning Framework for Myoelectric Pattern Recognition Supporting Muscle Force Estimation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Transformer-Based Multi-Task Learning Framework for Myoelectric Pattern Recognition Supporting Muscle Force Estimation

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL.

31, 2023 3255

A Transformer-Based Multi-Task Learning

Fig. 1. The flowchart of the proposed MTL-Transformer method.

Fig. 2. Illustration of 11 different gesture patterns.

estimation, compared to the single-task independent learning

estimation performance along with sufficient computational

Fig. 8. The classification visualization results during a real-time test V. C ONCLUSION

You might also like