A Review of Classification Algorithms For EEG-based Brain-Computer Interfaces: A 10 Year Update
A Review of Classification Algorithms For EEG-based Brain-Computer Interfaces: A 10 Year Update
Topical Review
E-mail: [email protected]
Abstract
Objective. Most current electroencephalography (EEG)-based brain–computer interfaces
(BCIs) are based on machine learning algorithms. There is a large diversity of classifier types
that are used in this field, as described in our 2007 review paper. Now, approximately ten years
after this review publication, many new algorithms have been developed and tested to classify
EEG signals in BCIs. The time is therefore ripe for an updated review of EEG classification
algorithms for BCIs. Approach. We surveyed the BCI and machine learning literature from
2007 to 2017 to identify the new classification approaches that have been investigated to
design BCIs. We synthesize these studies in order to present such algorithms, to report how
they were used for BCIs, what were the outcomes, and to identify their pros and cons. Main
results. We found that the recently designed classification algorithms for EEG-based BCIs can
be divided into four main categories: adaptive classifiers, matrix and tensor classifiers, transfer
learning and deep learning, plus a few other miscellaneous classifiers. Among these, adaptive
classifiers were demonstrated to be generally superior to static ones, even with unsupervised
adaptation. Transfer learning can also prove useful although the benefits of transfer learning
remain unpredictable. Riemannian geometry-based methods have reached state-of-the-art
performances on multiple BCI problems and deserve to be explored more thoroughly, along
with tensor-based methods. Shrinkage linear discriminant analysis and random forests also
appear particularly useful for small training samples settings. On the other hand, deep learning
methods have not yet shown convincing improvement over state-of-the-art BCI methods.
Significance. This paper provides a comprehensive overview of the modern classification
algorithms used in EEG-based BCIs, presents the principles of these methods and guidelines
on when and how to use them. It also identifies a number of challenges to further advance
EEG classification in BCI.
1. Introduction was published, many new algorithms have been designed and
explored in order to classify EEG signals in BCI, and BCIs are
A brain–computer interface (BCI) can be defined as a system more popular than ever. We therefore believe that the time is
that translates the brain activity patterns of a user into mes- ripe to update this review of EEG classifiers. Consequently, in
sages or commands for an interactive application, this activity this paper, we survey the literature on BCI and machine learn-
being measured and processed by the system [44, 139, 229]. ing from 2007 to 2017 in order to identify which new EEG
A BCI user’s brain activity is typically measured using elec- classification algorithms have been investigated to design
troencephalography (EEG). For instance, a BCI can enable a BCI, and which appear to be the most efficient11. Note that we
user to move a cursor to the left or to the right of a computer also include in the present review machine learning methods
screen by imagining left or right hand movements, respec- for EEG feature extraction, notably to optimize spatial filters,
tively [230]. As they make computer control possible without which have become a key component of BCI classification
any physical activity, EEG-based BCIs promise to revolution- approaches. We synthesize these readings in order to present
ize many applications areas, notably to enable severely motor- these algorithms, to report how they were used for BCIs and
impaired users to control assistive technologies, e.g. text input what were the outcomes. We also identify their pros and cons
systems or wheelchairs [181], as rehabilitation devices for in order to provide guidelines regarding how and when to use
stroke patients [8], as new gaming input devices [52], or to a specific classification method, and propose some challenges
design adaptive human–computer interfaces that can react to that must be solved to enable further progress in EEG signal
the user’s mental states [237], to name a few [45, 216]. classification.
In order to use a BCI, two phases are generally required: This paper is organized as follows. Section 2 briefly pre-
(1) an offline training phase during which the system is cali- sents the typically used EEG feature extraction and selection
brated and (2) the operational online phase in which the sys- techniques, as these features are usually the input to classi-
tem can recognize brain activity patterns and translate them fiers. It also summarizes the classifier performance evaluation
into commands for a computer [136]. An online BCI system metrics. Then, section 3.1 provides a summary of the classi-
is a closed-loop, starting with the user producing a specific fiers that were used for EEG-based BCIs up to 2007, many of
EEG pattern (e.g. using motor imagery) and these EEG sig- which are still in use today, as well as the challenges faced by
nals being measured. Then, EEG signals are typically pre- current EEG classification methods. Section 4 describes the
processed using various spatial and spectral filters [23], and core of the paper, as it reviews the classification algorithms for
features are extracted from these signals in order to represent BCI that have been explored since 2007 to address these vari-
them in a compact form [140]. Finally, these EEG features are ous challenges. These algorithms are discussed in section 5,
classified [141] before being translated into a command for an where we also propose guidelines on how and when to use
application [45] and before feedback is provided to users to them, and identify some remaining challenges. Finally, sec-
inform them whether a specific mental command was recog- tion 6 concludes the paper.
nized or not [170].
Although much effort is currently under way towards
2. Feature extraction and selection,
calibration-free modes of operation, an off-line calibration is
and performance measures in brief
currently used and is necessary in most BCIs to obtain a reli-
able system. In this stage, the classification algorithm is cali-
The present paper is dedicated to classification methods for
brated and the optimal features from multiple EEG channels
BCI. However, most pattern recognition/machine learning
are selected. For this calibration, a training data set needs to
pipelines, and BCIs are no exception, not only use a classi-
be pre-recorded from the user. EEG signals are highly user-
fier, but also apply feature extraction/selection techniques to
specific, and as such, most current BCI systems are calibrated
represent EEG signals in a compact and relevant manner. In
specifically for each user. This training data set contains EEG
particular for BCI, EEG signals are typically filtered both in
signals recorded while the user performed each mental task of
the time domain (band-pass filter), and spatial domain (spatial
interest several times, according to given instructions.
filter) before features are extracted from the resulting signals.
There are various key elements in the BCI closed-loop,
The best subsets of features are then identified using feature
one being the classification algorithms a.k.a classifiers used
to recognize the users’ EEG patterns based on EEG features.
11
There was, and still is, a large diversity of classifier types This updated review describes more advanced classification concepts and
algorithms than the ones presented in the initial review in [141]. We thus
that are used and have been explored to design BCIs, as pre-
advise our readers new to the EEG classification field to start by reading
sented in our 2007 review of classifiers for EEG-based BCIs [141], as that paper is more accessible, and the concepts it presented will not
[141]. Now, approximately ten years after this initial review be explained again in the current manuscript.
2
J. Neural Eng. 15 (2018) 031005 Topical Review
Figure 1. Typical classification process in EEG-based BCI systems. The oblique arrow denotes algorithms that can be or have to be
optimized from data. A training phase is typically necessary to identify the best filters and features and to train the classifier. The resulting
filters, features and classifier are then used online to operate the BCI.
selection algorithms, and these features are used to train a how EEG signals travel through the skin and skull, lead-
classifier. This process is illustrated in figure 1. In this chapter, ing to spatial filters such as the well-known Laplacian filter
we briefly discuss which features are typically used in BCI, [159] or inverse solution based spatial filtering [18, 101, 124,
how to select the most relevant features amongst these and 173]. Spatial filters can also be obtained in a data-driven and
how to evaluate the resulting pattern recognition pipeline. unsupervised manner with methods such as principal comp
onent analysis (PCA) or independent component analysis
(ICA) [98]. Finally, spatial filters can be obtained in a data-
2.1. Feature extraction
driven manner, with supervised learning, which is currently
While there are many ways in which EEG signals can be rep- one of the most popular approaches. Supervised spatial fil-
resented (e.g. [16, 136, 155]), the two most common types ters include the well-known common spatial patterns (CSP)
of features used to represent EEG signals are frequency band [23, 185], dedicated to band-power features and oscillatory
power features and time point features. activity BCI, and spatial filters such as xDAWN [188] or
Band power features represent the power (energy) of EEG Fisher spatial filters [92] for ERP classification based on time
signals for a given frequency band in a given channel, averaged point features. Owing to the good classification performances
over a given time window (typically 1 second for many BCI obtained by such supervised spatial filters in practice, many
paradigms). Band power features can be computed in various variants of such algorithms have been developed that are more
ways [28, 87], and are extensively used for BCIs exploiting robust to noise or non-stationary signals, using regularization
oscillatory activity, i.e. changes in EEG rhythm amplitudes. approaches, robust data averaging, and/or new divergence
As such, band power features are the gold standard features measures, (e.g. [143, 187, 194, 211, 233]). Similarly, exten-
for BCI based on motor and mental imagery for many passive sions of these approaches have been proposed to optimize
BCI aiming at decoding mental states such as mental work- spectral and spatial filters simultaneously (e.g. the popular fil-
load or emotions, or for steady state visual evoked potential ter bank CSP (FBCSP) method [7] and others [61, 88, 161]).
(SSVEP)-based BCIs. Finally, some approaches have combined both physically-
Time point features are a concatenation of EEG samples driven spatial filters based on inverse models with data-driven
from all channels. Typically, such features are extracted after spatial filters (e.g. [49, 148]).
some pre-processing, notably band-pass or low-pass filter- While spatial filtering followed by either band power or
ing and down-sampling. They are the typical features used to time points feature extraction are by far the most common fea-
classify Event Related Potentials (ERP), which are temporal tures used in current EEG-based BCIs, it should be mentioned
variations in EEG signals’ amplitudes time-locked to a given that other feature types have been explored and used. Firstly,
event/stimulus [22, 136]. These are the features used in most an increasingly used type is connectivity features. Such fea-
P300-based BCI. tures measure the correlation or synchronization between
Both types of features benefit from being extracted after signals from different sensors and/or frequency bands. This
spatial filtering [22, 136, 185, 188]. Spatial filtering consists can be measured using features such as spectral coherence,
of combining the original sensor signals, usually linearly, phase locking values or directed transfer functions, among
which can result in a signal with a higher signal-to-noise ratio many others [31, 79, 110, 167, 225, 240]. Researchers have
than that of individual sensors. Spatial filtering can be data also explored various EEG signal complexity measures or
independent, e.g. based on physical consideration regarding higher order statistics as features of EEG signals (e.g. [11,
3
J. Neural Eng. 15 (2018) 031005 Topical Review
29, 135, 248]). Finally, rather than using vectors of features, information between each feature and the target variable [82,
recent research has also explored how to represent EEG sig- 180]. Many filter feature selection approaches require esti-
nals by covariance matrices or by tensors (i.e. arrays and mations of the probability densities and the joint density of
multi-way arrays, with two or more dimensions), and how to the feature and class label from the data. One solution is to
classify these matrices or tensors directly [38, 47, 232]. Such discretize the features and class labels. Another solution is
approaches are discussed in section 4.2. It should be men- to approximate their densities with a non-parametric method
tioned that when using matrix or tensor decompositions, the such as Parzen windows [179]. If the densities are estimated
resulting features are linear combinations of various sensors’ by a normal distribution, the result obtained by the mutual
data, time points or frequencies (among others). As such they information will be similar to the one obtained by the corre-
may not have an obvious physical/physiological interpreta- lation coefficient. Filter approaches have a linear complexity
tion, but nonetheless prove useful for BCI design. with respect to the number of features. However, this may lead
Finally, it is interesting to note that several BCI studies to a selection of redundant features [106].
have reported that combining various types of features, e.g. Wrapper and embedded approaches solve this problem at
time points with band powers or band powers with connectiv- the cost of a longer computation time. These approaches use
ity features, generally leads to higher classification accuracies a classifier to obtain a subset of features. Wrapper methods
as compared to using a single feature type (e.g. [29, 60, 70, select a subset of features, present it as input to a classifier for
93, 166, 191]). Combining multiple feature types typically training, observe the resulting performance and stop the search
increases dimensionality; hence it requires the selection of the according to a stopping criterion or propose a new subset if
most relevant features to avoid the curse-of-dimensionality. the criterion is not satisfied. Embedded methods integrate the
Methods to reduce dimensionality are described in the follow- features selection and the evaluation in a unique process, e.g.
ing section. in a decision tree [27, 184] or a multilayer perceptron with
optimal cell damage [37].
Feature selection has provided important improvements in
2.2. Feature selection
BCI, e.g. the stepwise linear discriminant analysis (embed-
A feature selection step can be applied after the feature extrac- ded method) for P300-BCI [111] and frequency bands selec-
tion step to select a subset of features with various potential tion for motor imagery using maximal mutual information
benefits [82]. Firstly, among the various features that one may (filtering methods) [7]. Let us also mention the support vec-
extract from EEG signals, some may be redundant or may not tor machine for channel selection [115], linear regressor for
be related to the mental states targeted by the BCI. Secondly, knowledge extraction [123], genetic algorithms for spectral
the number of parameters that the classifier has to optimize is feature selection [50] and P300-based feature selection [201],
positively correlated with the number of features. Reducing or evolutionary algorithms for feature selection based on
the number of features thus leads to fewer parameters to be multiresolution analysis [176] (all being wrapper methods).
optimized by the classifier. It also reduces possible overtrain- Indeed, metaheuristic techniques (also including ant colony,
ing effects and can thus improve performance, especially swarm search, tabu search and simulated annealing) [152] are
if the number of training samples is small. Thirdly, from a becoming more and more frequently used for feature selection
knowledge extraction point of view, if only a few features are in BCI [174] in order to avoid the curse-of-dimensionality.
selected and/or ranked, it is easier to observe which features Other popular methods used in EEG-based BCIs notably
are actually related to the targeted mental states. Fourthly, a include filter methods such as maximum relevance minimum
model with fewer features and consequently fewer param redundancy (mRMR) feature selection [166, 180] or R2 fea-
eters can produce faster predictions for a new sample, as it ture selection [169, 217]. It should be mentioned that five
should be computationally more efficient. Fifthly, collection feature selection methods, namely information gain ranking,
and storage of data will be reduced. Three feature selection correlation-based feature selection, Relief (an instance-based
approaches have been identified [106]: the filter, wrapper and feature ranking method for multiclass problems), consistency-
embedded approaches. Many alternative methods have been based feature selection and 1R Ranking (one-rule classifica-
proposed for each approach. tion) have been evaluated on the BCI competition III data sets
Filter methods rely on measures of relationship between [107]. Amongst ten classifiers, the top three feature selection
each feature and the target class, independently of the clas- methods were correlation-based feature selection, information
sifier to be used. The coefficient of determination, which is gain and 1R ranking, respectively.
the square of the estimation of the Pearson correlation coef-
ficient, can be used as a feature ranking criterion [85]. The 2.3. Performance measures
coefficient of determination can also be used for a two-class
problem, labelling classes as −1 or +1. The correlation coef- To evaluate BCI performance, one must bear in mind that
ficient can only detect linear dependencies between features different components of the BCI loop are at stake [212].
and classes though. To exploit non-linear relationships, a Regarding the classifier alone, the most basic performance
simple solution is to apply non-linear pre-processing, such as measure is the classification accuracy. This is valid only if the
taking the square or the log of the features. Ranking criteria classes are balanced [66], i.e. with the same number of samples
based on information theory can also be used e.g. the mutual per class and if the classifier is unbiased, i.e. it has the same
4
J. Neural Eng. 15 (2018) 031005 Topical Review
performance for each class [199]. If these conditions are not included boosting, voting or stacking combination algorithms.
met, the Kappa metric or the confusion matrix are more infor- Classifier combination appeared to be amongst the best per-
mative performance measures [66]. The sensitivity-specificity forming classifiers for EEG based BCIs, at least in offline
pair, or precision, can be computed from the confusion matrix. evaluations.
When the classification depends on a continuous parameter
(e.g. a threshold), the receiver operating characteristic (ROC)
3.2. Challenges faced by current EEG signal classification
curve, and the area under the curve (AUC) are often used. methods
Classifier performance is generally computed offline on
pre-recorded data, using a hold-out strategy: some datasets Ten years ago, most classifiers explored for BCI were rather
are set aside to be used for the evaluation, and are not part of standard classifiers used in multiple machine learning
the training dataset. However, some authors also report cross- problems. Since then, research efforts have focused on iden-
validation measures estimated on training data, which may tifying and designing classification methods dedicated to the
over-rate the performance. specificities of EEG-based BCIs. In particular, the main chal-
The contribution of classifier performance to overall BCI lenges faced by classification methods for BCI are the low
performance strongly depends on the orchestration of the BCI signal-to-noise ratio of EEG signals [172, 228], their non-
subcomponents. This orchestration is highly variable given the stationarity over time, within or between users, where same-
variety of BCI systems (co-adaptive, hybrid, passive, self- or user EEG signals varying between or even within runs [56, 80,
system- paced). The reader is referred to [212] for a compre- 109, 145, 164, 202], the limited amount of training data that is
hensive review of evaluation strategies in such BCI contexts. generally available to calibrate the classifiers [108, 137], and
the overall low reliability and performance of current BCIs
[109, 138, 139, 229].
3. Past methods and current challenges
Therefore, most of the algorithms studied these past ten
years aimed at addressing one or more of these challenges.
3.1. A brief overview of methods used ten years ago
More precisely, adaptive classifiers whose parameters are
In our original review of classification algorithms for EEG- incrementally updated online were developed to deal with
based BCIs published ten years ago, we identified five main EEG non-stationarity in order to track changes in EEG prop-
families of classifiers that had been explored: linear classi- erties over time. Adaptive classifiers can also be used to deal
fiers, neural networks, non-linear Bayesian classifiers, nearest with limited training data by learning online, thus requir-
neighbour classifiers and classifier combinations [141]. ing fewer offline training data. Transfer learning techniques
Linear classifiers gather discriminant classifiers that use aim at transferring features or classifiers from one domain,
linear decision boundaries between the feature vectors of each e.g. BCI subjects or sessions, to another domain, e.g. other
class. They include linear discriminant analysis (LDA), regu- subjects or other sessions from the same subject. As such
larized LDA and support vector machines (SVMs). Both LDA they also aim at addressing within or between-subjects non-
and SVM were, and still are, the most popular types of clas- stationarity and limited training data by complementing the
sifiers for EEG based-BCIs, particularly for online and real- few training data available with data transferred from other
time BCIs. The previous review highlighted that in terms of domains. Finally in order to compensate for the low EEG
performances, SVM often outperformed other classifiers. signal-to-noise ratio and the poor reliability of current BCIs,
Neural networks (NN) are assemblies of artificial neurons, new methods were explored to process and classify signals in
arranged in layers, which can be used to approximate any non- a single step by merging feature extraction, feature selection
linear decision boundary. The most common type of NN used and classification. This was achieved by using matrix (notably
for BCI at that time was the multi-layer perceptron (MLP), Riemannian methods) and tensor classifiers as well as deep
typically employing only one or two hidden layers. Other NN learning. Additional methods explored were targeted specifi-
types were explored more marginally, such as the Gaussian cally at learning from limited amount of data and at dealing
classifier NN or learning vector quantization (LVQ) NN. with multiple class problems. We describe these new families
Non-linear Bayesian classifiers are classifiers modeling the of methods in the following.
probability distributions of each class and use Bayes’ rule to
select the class to assign to the current feature vector. Such
4. New EEG classification methods since 2007
classifiers notably include Bayes quadratic classifiers and hid-
den Markov models (HMMs). 4.1. Adaptive classifiers
Nearest neighbour classifiers assign a class to the cur
rent feature vector according to its nearest neighbours. Such 4.1.1. Principles. Adaptive classifiers are classifiers whose
neighbours could be training feature vectors or class proto- parameters, e.g. the weights attributed to each feature in a lin-
types. Such classifiers include the k-nearest neighbour (kNN) ear discriminant hyperplane, are incrementally re-estimated
algorithm or Mahalanobis distance classifiers. and updated over time as new EEG data become available
Finally, classifier combinations are algorithms combin- [200, 202]. This enables the classifier to track possibly chang-
ing multiple classifiers, either by combining their outputs ing feature distribution, and thus to remain effective even with
and/or by training them in ways that maximize their comple- non-stationary signals such as an EEG. Adaptive classifiers
mentarity. Classifier combinations used for BCI at the time for BCI were first proposed in the mid-2000s, e.g. in [30, 72,
5
J. Neural Eng. 15 (2018) 031005 Topical Review
163, 202, 209], and were shown to be promising in offline Online, still using supervised adaptation, both adaptive
analysis. Since then, more advanced adaptation techniques LDA and QDA have been explored successfully in [222]. In
have been proposed and tested, including online experiments. [86], an adaptive probabilistic neural network was also used
Adaptive classifiers can employ both supervised and for online adaptation with a motor imagery-BCI. Such a clas-
unsupervised adaptation, i.e. with or without knowledge of sifier models the feature distributions of each class in non-
the true class labels of the incoming data, respectively. With parametric fashion, and updates them as new trials become
supervised adaptation, the true class labels of the incoming available. Classifier ensembles were also explored to create
EEG signals is known and the classifier is retrained on the adaptive classifiers. In [119], a dynamic ensemble of five
available training data augmented with these new, labelled SVM classifiers was created by training a new SVM for each
incoming data, or is updated based on this new data only [200, batch of new incoming labelled EEG trials, adding it to the
202]. Supervised BCI adaptation requires guided user train- ensemble and removing the oldest SVM. Classification was
ing, for which the users’ commands are imposed and thus performed using a weighted sum of each SVM output. This
the corresponding EEG class labels are known. Supervised approach was shown online to be superior to a static classifier.
adaptation is not possible with free BCI use, as the incoming Regarding supervised adaptation, it should be mentioned
EEG data true label is unknown. With unsupervised adap- that adaptive spatial filters were also proposed, notably sev-
tation, the label of the incoming EEG data is unknown. As eral variants of adaptive CSP [204, 247], but also adaptive
such, unsupervised adaptation is based on an estimation of xDAWN [227].
the data class labels for retraining/updating, as discussed Unsupervised adaptation of classifiers is obviously much
in [104], or is based on class-unspecific adaptation, e.g. the more difficult, as the class labels, hence the class-specific vari-
general all classesEEG data mean [24, 219] or a covariance ability, is unknown. Thus, unsupervised methods have been
matrix [238] is updated in the classifier model. A third type proposed to estimate the class labels of new incoming sam-
of adaptation, in between supervised and unsupervised meth- ples before adapting the classifier based on this estimation.
ods, has also been explored: semi-supervised adaptation [121, This technique was explored offline in [24] and [129], and
122]. Semi-supervised adaptation consists of using both ini- online in [83] for an LDA classifier and Gaussian mixture
tial labelled data and incoming unlabelled data to adapt the model (GMM) estimation of the incoming class labels, with
classifier. For BCI, semi-supervised adaptation is typically motor imagery data. Offline, Fuzzy C-means (FCM) were
performed by (1) initially training a supervised classifier on also explored instead of GMM to track the class means and
available labelled training data, then (2) by estimating the covariance for an LDA classifier [130]. Similarly, a non-linear
labels of incoming unlabelled data with this classifier, and Bayesian classifier was adapted using either unsupervised
(3) by adapting/retraining the classifier using these initially or semi-supervised learning (i.e. only some of the incom-
unlabelled data assigned to their estimated labels combined ing trials were labelled) using extended Kalman filtering to
with the known available labelled training data. This process track the changes in the class distribution parameters with
is repeated as new batches of unlabelled incoming EEG data auto-regressive (AR) features [149]. Another simple unsuper-
become available. vised adaptation of the LDA classifier for motor imagery data
was proposed and evaluated for both offline and online data
4.1.2. State-of-the-art. So far, the majority of the work on [219]. The idea was to not incrementally adapt all of the LDA
adaptive classifiers for BCI has been based on supervised parameters, but only its bias, which can be estimated without
adaptation. Multiple adaptive classifiers were explored offline, knowing the class labels if we know that the data is balanced,
such as LDA or quadratic discriminant analysis (QDA) [200] i.e. with the same number of trials per class on average. This
for motor imagery-based BCI. An adaptive LDA was also approach was extended to the multiclass LDA case, and evalu-
proposed based on Kalman filtering to track the distribution ated in an offline scenario in [132].
of each class [96]. In order to deal with possibly imperfect Adaptation can be performed according to reinforcement
labels in supervised adaptation, [236] proposed and evalu- signals (RS), indicating whether a trial was erroneously clas-
ated offline an adaptive Bayesian classifier based on sequen- sified by the BCI. Such reinforcement signals can be deduced
tial Monte Carlo sampling that explicitly models uncertainty from error-related potentials (ErrP), potentials appearing fol-
in the observed labels. For ERP-based BCI, [227] explored lowing a perceived error which may have been committed by
an offline adaptive support vector machine (SVM), adaptive either the user or the machine [68]. In [133], an incremen-
LDA, a stochastic gradient-based adaptive linear classifier, tal logistic regression classifier was proposed, which was
and online passive-aggressive (PA) algorithms. interestingly, updated along the error gradient when a trial was judged to
McFarland and colleagues demonstrated in offline analysis be misclassified according to the detection of an ErrP. The
of EEG data over multiple sessions that continuously retrain- strength of the classifier update was also proportional to the
ing the weights of linear classifiers in a supervised manner probability of this ErrP. A Gaussian probabilistic classifier
improved the performance of sensori-motor rhythms (SMR)- incorporating an RS was later proposed in [131], in which the
based BCI, but not of the P300-based BCI speller [160]. update rules of the mean and covariance of each class depend
However, results presented in [197] suggested that continu- on the probability of the RS. This classifier could thus incor-
ous adaption was beneficial for the asynchronous P300-BCI porate a supervised, unsupervised or semi-supervised adap-
speller, and [227] suggested the same for passive BCI based tation mode, according to whether the probability of the RS
on the P300. is always correct as either 0 or 1 (supervised case), uniform,
6
J. Neural Eng. 15 (2018) 031005 Topical Review
i.e. uninformative (unsupervised case) or with a continuous Table 1. Summary of adaptive supervised classification methods
probability with some uncertainty (partially supervised case). explored offline.
Using simulated supervised RS, this method was shown to be EEG pattern Features Classifier References
superior to static LDA and the other supervised and unsuper-
vised adaptive LDA discussed above [131]. Evaluations with Motor Band power Adaptive LDA/ [200]
real-world data remain to be performed. Also using ErrP in imagery QDA
Motor Fractal Adaptive LDA [96]
offline simulations of an adaptive movement-related poten-
imagery dimension
tial (MRP)-BCI, [9] augmented the training set with incom-
Motor Band power Adaptive LDA/ [222]
ing trials, but only with those that were classified correctly, as imagery QDA
determined by the absence of an ErrP following feedback to Motor Band power Adaptive [86]
the user. They also removed the oldest trials from the training imagery probabilistic NN
set as new trials became available. Then, the parameters of Motor CSP Dynamic SVM [119]
the classifier, an incremental SVM, were updated based on imagery ensemble
the updated training set. ErrP-based classifier adaptation was Motor Adaptive CSP SVM [204, 247]
explored online for code-modulated visual evoked potential imagery
(c-VEP) classification in [206]. In this work, the label of the Motor AR parameters Adaptive Gaussian [236]
incoming trial was estimated as the one decided by the clas- execution classifier
sifier if no ErrP was detected, the opposite label otherwise P300 Time points Adaptive [227]
LDA/SVM
(for binary classification). Then, this newly labelled trial was
with adaptive online PA classifier
added to the training set, and the classifier and spatial filter,
xDAWN
a one-class SVM and canonical correlation analysis (CCA),
respectively, were retrained on the new data. Finally, [239]
demonstrated that classifier adaptation based on RS could also Vidaurre et al, also explored co-adaptive training, where
be performed using classifier confidence, and that such adap- both the machine and the user are continuously learning, by
tation was beneficial to P300-BCI. using adaptive features and an adaptive LDA classifier [220,
221]. This enabled some users who were initially unable to
For ERP-based BCI, semi-supervised adaptation was
control the BCI to achieve better than chance classification
explored with SVM and enabled the calibration of a P300-
performances. This work was later refined in [64] by using a
speller with less data as compared to a fixed, non-adaptive clas-
simpler but fully adaptive setup with auto-calibration, which
sifier [122, 151]. This method was later tested and validated
proved to be effective both for healthy users and for users with
online in [81]. For P300-BCI, a co-training semi-supervised
disabilities [63]. Co-adaptive training, using adaptive CSP
adaptation was performed in [178]. In this work, two classi-
patches, proved to be even more efficient [196].
fiers were used: a Bayesian LDA and a standard LDA. Each
Adaptive classification approaches used in BCI are sum-
was initially trained on training labelled data, and then used
marized in tables 1 and 2, for supervised and unsupervised
to estimate the labels of unlabelled incoming data. The latter
methods, respectively.
were labelled with their estimated class label and used as addi-
tional training data to retrain the other classifier, hence the co- 4.1.3. Pros and cons. Adaptive classifiers were repeatedly
training. This semi-supervised approach was shown offline to shown to be superior to non-adaptive ones for multiple types
lead to higher bit-rates than a fully supervised method, which of BCI, notably motor-imagery BCI, but also for some ERP-
requires more supervised training data. On the other hand, based BCI. To the best of our knowledge, adaptive classifiers
offline semi-supervised adaptation with an LDA as classifier have apparently not been explored for SSVEP-BCI. Naturally,
failed on mental imagery data, probably owing to the poor supervised adaptation is the most efficient type of adaptation,
robustness of the LDA to mislabelling [137]. Finally, both for as it has access to the real labels. Nonetheless unsupervised
offline and online data, [104, 105] proposed a probabilistic adaptation has been shown to be superior to static classifiers
method to adaptively estimate the parameters of a linear clas- in multiple studies [24, 130, 132, 149, 219]. It can also be used
sifier in P300-based spellers, which led to a drastic reduction to shorten or even remove the need for calibration [78, 81,
in calibration time, essentially removing the need for the ini- 105, 122, 151]. There is a need for more robust unsupervised
tial calibration. This method exploited the specific structure of adaptation methods, as the majority of actual BCI applications
the P300-speller, and notably the frequency of samples from do not provide labels, and thus can only rely on unsupervised
each class at each time, to estimate the probability of the most methods.
likely class label. In a related work, [78] proposed a generic For unsupervised adaptation, reward signals, and notably
method to adaptively estimate the parameters of the classifier ErrP, have been exploited in multiple papers (e.g. [9, 206,
without knowing the true class labels by exploiting any struc- 239]). Note however, that ErrP decoding from EEG signals
ture that the application may have. Semi-supervised adapta- may be a difficult task. Indeed, [157] demonstrated that the
tion was also used offline for multi-class motor imagery with decoding accuracy of ErrP was positively correlated with the
a Kernel discriminant analysis (KDA) classifier in [171]. This P300 decoding accuracy. This means that people who make
method has shown its superiority over non-adaptive methods, errors in the initial BCI task (here a P300), for whom error cor-
as well as over adaptive unsupervised LDA methods. rection and ErrP-based adaptation would be the most useful,
7
J. Neural Eng. 15 (2018) 031005 Topical Review
Table 2. Summary of adaptive unsupervised classification methods to continuously-changing feedback. Both machine and human
explored. learning may not necessarily converge to a suitable and stable
EEG solution. A recent theoretical model of this two-learner prob-
pattern Features Classifier References lem was proposed in [168], and indicated that adaptation that
is either too fast or too slow can actually be detrimental to user
Motor Band power Adaptive LDA [24, 83, 129]
learning. There is thus a need to design adaptive classifiers
imagery with GMM
Motor Band power Adaptive LDA [130] that ensure and favour human learning.
imagery with FCM
Motor AR Adaptive Gaussian [149] 4.2. Classifying EEG matrices and tensors
execution parameters classifier
Motor Band power Adaptive LDA [132, 219] 4.2.1. Riemannian geometry-based classification.
imagery
Motor Band power Adaptive Gaussian [131] Principles. The introduction of Riemannian geometry
imagery classifier in the field of BCI has challenged some of the conventions
Motor Band power Semi-supervised [137] adopted in the classic classification approaches; instead of
imagery CSP+LDA
estimating spatial filters and/or select features, the idea of
Motor Adaptive Adaptive LDA [63, 64, 220, 221]
a Riemannian geometry classifier (RGC) is to map the data
imagery band power
Motor Adaptive Adaptive LDA [196] directly onto a geometrical space equipped with a suitable
imagery CSP metric. In such a space, data can be easily manipulated for
patches several purposes, such as averaging, smoothing, interpolating,
Covert Band power Incremental [133] extrapolating and classifying. For example, in the case of EEG
attention logistic regression data, mapping entails computing some form of covariance
MRP Band power Incremental SVM [9] matrix of the data. The principle of this mapping is based on
c-VEP CCA Adaptive one-class [206] the assumption that the power and the spatial distribution of
SVM EEG sources can be considered fixed for a given mental state
P300 Time points SWLDA [239] and such information can be coded by a covariance matrix.
P300 Time points Semi-supervised [81, 122, 151] Riemannian geometry studies smooth curved spaces that can
SVM
be locally and linearly approximated. The curved space is
P300 Time points Co-training LDA [178]
named a manifold and its linear approximation at each point
P300 Time points Unsupervised [104, 105]
linear classifier is the tangent space. In a Riemannian manifold the tangent
ErrP Time points Unsupervised [78] space is equipped with an inner product (metric) smoothly
linear classifier varying from point to point. This results in a non-Euclidean
notion of distance between any two points (e.g. each point
may be a trial) and a consequent notion of centre of mass of
have a lesser chance that the ErrP will be correctly decoded. any number of points (figure 2). Therefore, instead of using
There is thus a need to identify robust reward signals. the Euclidean distance, called the extrinsic distance, an intrin-
Only a few of the proposed methods were actually used sic distance is used, which is adapted to the geometry of the
online. For unsupervised methods, a simple and effective one manifold, and thus to the manner in which the data have been
that demonstrated its value online in several studies is adap- mapped [47, 232].
tive LDA, proposed by Vidaurre et al [219]. This and other Amongst the most common matrix manifolds used for BCI
methods that are based on incremental adaptation (i.e. updat- applications, we encountered the manifold of Hermitian or
ing the algorithms parameters rather than fully re-optimizing symmetric positive definite (SPD) matrices [19] when deal-
them) generally have a computational complexity that is low ing with covariance matrices estimated from EEG trials, and
enough to be used online. Adaptive methods that require fully the Stiefel and Grassmann manifolds [62] when dealing with
retraining the classifier with new incoming data generally subspaces or orthogonal matrices. Several machine learning
have a much higher computationnal complexity (e.g. regularly problems can be readily extended to those manifolds by tak-
retraining an SVM from scratch in real-time requires a lot of ing advantage of their geometrical constraints (i.e. learning on
computing power) which might prevent them from being actu- manifold). Furthermore, optimization problems can be form
ally used online. ulated specifically on such spaces, which is leading to several
However, more online studies are clearly necessary to new optimization methods and to the solution of new prob-
determine how adaptation should be performed in practice, lems [2]. Although related, manifold learning, which consists
with a user in the loop. This is particularly important for men- of empirically attempting to locate the non-linear subspace
tal imagery BCI in which human-learning is involved [147, in which a dataset is defined, is different in concept and will
170]. Indeed, because the user is adapting to the BCI by not be covered in this paper. To illustrate these notions, con-
learning how to perform mental imagery tasks so that they sider the case of SPD matrices. The square of the intrinsic dis-
are recognized by the classifier, adaptation may not always tance between two SPD matrices C1 and C2 has a closed-form
help and may even be confusing to the user, as it may lead expression given by
8
J. Neural Eng. 15 (2018) 031005 Topical Review
Figure 2. Schematic representation of a Riemannian manifold. EEG trials are represented by points. Left: Representation of the tangent
space at point G. The shortest path on the manifold relying on two points C1 and C2 is named the geodesic and its length is the Riemannian
distance between them. Curves on the manifolds through a point are mapped on the tangent space as straight lines (local approximation).
Right: G represents the centre of mass (mean) of points C1, C2, C3 and C4. It is defined as the point minimizing the sum of the squared
distance between itself and the four points. The centre of mass is often used in RGCs as a representative for a given class.
9
J. Neural Eng. 15 (2018) 031005 Topical Review
Table 3. Summary of Riemannian geometry classifiers for ill-conditioned and numerically unstable. Further, note that
EEG-based BCI. the larger the dimensions, the more the distance is prone to
EEG pattern Features Classifier References noise. Moreover, Riemannian approaches usually have high
computational complexities (e.g. growing cubically with the
Motor Band-pass RMDM [13, 46] number of electrodes for computing both the geometric mean
imagery covariance and the Riemannian distance). For these reasons, when the
Motor Band-pass Tangent [13, 231]
number of electrodes is large with respect to the window size,
imagery covariance space + LDA
it is advocated to reduce the dimensions of the input matrices.
Motor Band-pass SVM Riemannian [14]
imagery covariance Kernel Classical unsupervised methods such as PCA or supervised
P300 Special RMDM [46] methods such as CSP can be used for this purpose. Recently,
covariance Riemannian-inspired dimensionality reduction methods have
P300 Special RMDM [15] been investigated as well [94, 95, 189].
covariance Interestingly, some approaches have tried to bridge the
P300 Special RMDM [158] gap between Riemannian approaches and more classical
covariance paradigms by incorporating some Riemannian geometry in
SSVEP Band-pass RMDM [34, 100] approaches such as CSP [12, 233]. CSP was the previous
covariance golden standard and is based on a different paradigm than
Riemannian geometry. Taking the best of those two paradigms
is expected to gain better robustness while compressing the
tuning, for example by cross-validation. Hence, Riemannian
information.
geometry provides new tools for building simple, more robust
and accurate prediction models.
4.2.2. Other matrix classifiers.
Several reasons have been proposed to advocate the use of
the Riemannian geometry. Due to its logarithmic nature the
Principles. As mentioned previously, the classification
Riemannian distance is robust to extreme values, that is, noise.
pipeline in BCI typically involves spatial filtering of the EEG
Also, the intrinsic Riemannian distance for SPD matrices is
signals followed by classification of the filtered data. This
invariant both to matrix inversion and to any linear invertible
results in the independent optimization of several sets of
transformation of the data, e.g. any mixing applied to the EEG
parameters, namely for the spatial filters and for the final clas-
sources does not change the distances among the observed
sifier. For instance, the typical linear classifier decision func-
covariance matrices. These properties in part explain why
tion for an oscillatory activity BCI would be the following:
Riemannian classification methods provide a good generaliza-
tion capability [224, 238], which enabled researchers to set up
(3) f (X, w, S) = wi log(var(sTi X)) + w0
calibration-free adaptive ERP-BCIs using simple subject-to- i
subject and session-to-session transfer learning strategies [6]. where X is the EEG signals matrix, w = [w0 , w1 , . . . , wN ] is
Interestingly, as illustrated in [94], it is possible to not the linear classifier weight vector, and S = [s1 , s2 , . . . , sN ] is a
only interpolate along geodesics (figure 2) on the SPD mani- matrix of spatial filter si. Optimizing w and si separately may
folds, but also to extrapolate (e.g. forecast) without leaving thus lead to suboptimal solutions, as the spatial filters do not
the manifold and respecting the geometrical constraints. For consider the objective function of the classifier. Therefore, in
example, in [99] interpolation has been used for data augmen- addition to RGC, several authors have shown that it is possible
tation by generating artificial covariance matrices along geo- to formulate this dual optimization problem as a single one,
desics but extrapolation could also have been used. Often, the where the parameters of the spatial filters and the linear classi-
Riemannian interpolation is more relevant than its Euclidean fier are optimized simultaneously, with the potential to obtain
counterpart as it does not suffer from the so-called swelling improved performance. The key principle of these approaches
effect [232]. This effect describes the fact that a Euclidean is to learn classifiers (either linear vector classifiers or matrix
interpolation between two SPD matrices does not involve the classifiers) that directly use covariance matrices as input, or
determinant of the matrix as it should (i.e. the determinant of their vectorised version. We briefly present these approaches
the Euclidean interpolation can exceed the determinant of the below.
interpolated matrices). In the spirit of [231], the determinant
of a covariance matrix can be considered as the volume of State-of-the-art. In [214], the EEG data were represented
the polytope described by the column of the matrix. Thus, a as an augmented covariance matrix A, containing as block
distance that is immune to the swelling effect will respect the diagonal terms both the first order term X, i.e. the signal time
shape of the polytope along geodesics. course, and as second order terms the covariance matrices of
As equation (1) indicates, computing the Riemannian dis- EEG trials band-pass filtered in various frequency bands. The
tance between two SPD matrices involves adding squared log- learned classifier is thus a matrix of weights W (rather than
arithms, which may cause numerical problems; the smallest a vector), with the decision function f (A, W) = A, W + b.
eigenvalues of matrix (C−1 1 C2 ) tend towards zero as the num- Due to the large dimensionality of the augmented covariance
ber of electrodes increases and/or the window size for esti- matrix, a matrix regularization term is necessary with such
mating C1 and C2 decreases, making the logarithm operation classifiers, e.g. to obtain sparse temporal or spatial weights.
10
J. Neural Eng. 15 (2018) 031005 Topical Review
Note that this approach can be applied to both ERP and oscil- The concept of tensorization refers to the generation of
latory-based BCI, as the first order terms capture the temporal higher-order structured tensors (multiway arrays) from lower-
variation, and the covariance matrices capture the EEG sig- order data formats, especially time series EEG data repre-
nals’ band power variations. sented as vectors or organized as matrices. This is an essential
Following similar ideas in parallel, [65] represented this step prior to tensor (multiway) feature extraction and classifi-
learning problem in tensor space by constructing tensors of cation [41, 42, 182].
frequency-band specific covariance matrices, which can then The order of a tensor is the number of modes, also known
be classified using a linear classifier as well, provided appro- as ways or dimensions (e.g. for EEG BCI data: space (chan-
priate regularization is used. nels), time, frequency, subjects, trials, groups, conditions,
Finally, [190] demonstrated that equation (3) can be rewrit- wavelets, dictionaries). In the simplest scenario, multichannel
ten as follows, if we drop the log-transform: EEG signals can be represented as a 3rd-order tensor that has
three physical modes: space (channel) × time × frequency.
(4) f (Σ, wΣ ) = vec(Σ)T wΣ + w0
In other words, S channels of EEG which are recorded over
with wΣ = i wi vec(si sTi ), Σ = XT X being the EEG cova- T time samples, can produce S matrices of F × T dimen-
riance matrix, and vec(M) being the vectorisation of matrix sional time-frequency spectrograms stacked together into an
M . Thus, equation (3) can be optimized directly in the space F × T × S dimensional third-order tensor. For multiple trials
of vectorised covariance matrices by optimizing the weights and multiple subjects, the EEG data sets can be naturally rep-
wΣ. Here as well, owing to the usually large dimensionality resented by higher-order tensors: e.g. for a 5th-order tensor:
of vec(σ), appropriate regularization is necessary, and [190] space × time × frequency × trial × subject.
explored different approaches to do so. It should be noted that almost all basic vector- and matrix-
These different approaches all demonstrated higher perfor- based machine learning algorithms for feature extraction and
mance than the basic CSP+LDA methods on motor imagery classification have been or can be extended or generalized to
data sets [65, 190, 214]. This suggests that such formulations tensors. For example, the SVM for classification has been
can be worthy alternatives to the standard CSP+LDA naturally generalized to the tensor support machine (TSM),
pipelines. Kernel TSM and higher rank TSM. Furthermore, the standard
LDA method has been generalized to tensor Fisher discrimi-
Pros and cons. By simultaneously optimizing spatial fil- nant analysis (TFDA) and/or higher order discriminant anal-
ters and classifiers, such formulations usually achieve better ysis (HODA) [41, 183]. Moreover tensor representations of
solutions than the independent optimization of individual sets BCI data are often very useful in mitigating the small sample
of components. Their main advantage is thus increased clas- size problem in discriminative subspace selection, because the
sification performance. This formulation nonetheless comes information about the structure of data is often inherent in ten-
at the expense of a larger number of classifier weights due sors and is a natural constraint which helps reduce the number
to the high increase in dimensionality of the input features of unknown feature parameters in the description of a learn-
(covariance matrix with (Nc ∗ (Nc + 1))/2 unique values ver- ing model. In other words, when the number of EEG training
sus Nc values when using only the channels’ band power). measurements is limited, tensor-based learning machines are
Appropriate regularization is thus necessary. It remains to be expected often to perform better than the corresponding vec-
evaluated how such methods perform for various amounts of tor- or matrix-based learning machines, as vector representa-
training data, as they are bound to suffer more severely from tions are associated with problems such as loss of information
the curse of dimensionality than simpler methods with fewer for structured data and over-fitting for high-dimensional data.
parameters. These methods have also not been used online to
date. From a computational complexity point of view, such State-of-the-art. To ensure that the reduced data sets contain
methods are more demanding than traditionnal methods given maximum information about input EEG data, we may apply
their increased number of parameters, as mentioned above. constrained tensor decomposition methods. For example, this
They also generally require heavy regularization, which can could be achieved on the basis of orthogonal or non-negative
make their calibration longer. However, their decision func- tensor (multi-array) decompositions, or higher order (multi-
tions being linear, they should be easily applicable in online linear) discriminant analysis (HODA), whereby input data are
scenarios. However, it remains to be seen whether they can considered as tensors instead of more conventional vector or
be calibrated quickly enough for online use, and what their matrix representations. In fact, tensor decomposition mod-
performance will be for online data. els, especially PARAFAC (also called CP decomposition),
TUCKER, hierarchical tucker (HT) and tensor train (TT) are
4.2.3. Feature extraction and classification using tensors. alternative sophisticated tools for feature extraction prob-
lems by capturing multi-linear and multi-aspect structures in
Principles. Tensors (i.e. multi-way arrays) provide a large-scale higher-order data-sets [39, 183]. Using this type of
natural representation for EEG data, and higher order tensor approach, we first decompose multi-way data using TUCKER
decompositions and factorizations are emerging as promising or CP decompositions, usually by imposing specific con-
(but not yet very well established and not yet fully explored) straints (smoothness, sparseness, non-negativity), in order
tools for analysis of EEG data; particularly for feature extrac- to retrieve basis factors and significant features from factor
tion, clustering and classification tasks in BCI [38–40, 42, 43]. (component) matrices. For example, wavelets/dictionaries
11
J. Neural Eng. 15 (2018) 031005 Topical Review
Table 4. Summary of tensor classifiers for EEG-based BCI. accuracy for the BCI dataset by almost 10%. A comparison of
all methods mentioned is provided in table 4.
EEG pattern Features/Methods Classifier References
From the time-frequency analysis perspective, tensor
Motor Topographic map, LDA/HODA [183] decompositions are very attractive, even for a single chan-
imagery TFR, Connect. nel, because they simultaneously take into account temporal
P300 Multilinear PCA SVM/TSM [223] and spectral information and variability and/or consistency of
P300 Tine-space-freq. HODA [175] time frequency representations (TFRs) for trials and/or sub-
SSVEP TCCA, MsetCCA, LDA [243–246] jects. Furthermore, they provide links among various latent
Bayesian
(hidden) variables (e.g. temporal, spectral and spatial comp
onents) often with physical or physiological meanings and
allow us to represent the data often in a more efficient way, interpretations [40, 183].
i.e. a sparse manner with different sparsity profiles [43, 183]. Furthermore, standard canonical correlation analysis
Moreover, in order to increase performance of BCI clas- (CCA) was generalized to tensor CCA and multiset CCA and
sification, we can apply two or more time-frequency repre- was successfully applied to the classification of SSVEP for
sentations or the same frequency transform but with two or BCI [242, 243, 245, 246]. Tensor canonical correlation analy-
more different parameter settings. Different frequency trans- sis (TCCA) and its modification multiset canonical correlation
forms (or different mother wavelets) allow us to obtain differ- analysis (MsetCCA) have been one of the most efficient meth-
ent sparse tensor representations with various sparsity profiles ods for frequency recognition in SSVEP-BCIs. The MsetCCA
and some complimentary information. For multichannel EEG method learns multiple linear transforms that implement joint
signals we can generate a block of at least two tensors, which spatial filtering to maximize the overall correlation amongst
can be concatenated as a single data tensor: space × time × canonical variates, and hence extracts SSVEP common fea-
frequency × trial [43, 182, 183]. tures from multiple sets of EEG data recorded at the same stim-
The key problem in tensor representation is the choice of ulus frequency. The optimized reference signals are formed by
a suitable time-frequency representation (TFR) or frequency combination of the common features and are completely based
transform and the selection of optimal, or close to optimal, on training data. Extensive experimental study with EEG data
corresponding transformation parameters. By exploiting vari- demonstrated that the tensor and MsetCCA method improve
ous TFRs, possibly with suitably selected different parameter the recognition accuracy of SSVEP frequency in comparison
settings for the same data, we may potentially improve the with the standard CCA method and other existing methods,
classification accuracy of BCI due to additional (partially especially for a small number of channels and a short time
redundant) information. Such approaches have been imple- window length. The superior results indicate that the tensor
mented e.g. for motor imagery (MI) BCI by employing differ- MsetCCA method is a very promising candidate for frequency
ent complex Morlet (Gabor) wavelets for EEG data sets with recognition in SSVEP-based BCIs [243].
62 channels [183]. For such data sets, the authors selected dif-
ferent complex Morlet wavelets with two different bandwidth Pros and cons. In summary, the recent advances in BCI
frequency parameters fb = 1 Hz and fb = 6 Hz for the same technologies have generated massive amounts of brain data
centre frequency fc = 1 Hz. For each mother wavelet the authors exhibiting high dimensionality, multiple modality (e.g. physi-
constructed a 4th-order tensor: 62-channels × 23-frequency cal modes such as frequency or time, multiple brain imaging
bins × 50-time frames × 120-trials for both training and test techniques or conditions), and multiple couplings as func-
EEG data. The block of training tensor data can be concat- tional connectivity data. By virtue of their multi-way nature,
enated as the 5th-order tensor: 62-channels × 23-frequency tensors provide powerful and promising tools for BCI analy-
bins × 50-time frames × 2-wavelets × 120-trials. sis and fusion of massive data combined with a mathemati-
The HODA algorithm was used to estimate discriminant cal backbone for the discovery of underlying hidden complex
bases. The four most significant features were selected to clas- (space-time-frequency) data structures [42, 183].
sify the data, and led to an improved accuracy higher than Another of their advantages is that, using tensorization and
95%. Thus, it appears that by applying tensor decomposition low-rank tensor decomposition, they can efficiently compress
for suitably constructed data tensors, considerable perfor- large multidimensional data into low-order factor matrices
mance improvement in comparison to the standard approaches and/or core tensors which usually represent reduced features.
can be achieved for both motor-imagery BCI [183, 223] and Tensor methods can also analyze linked (coupled) blocks of
P300 paradigms [175]. trials represented as large-scale matrices into the form of ten-
In this approach, transformation of data with a diction- sors in order to separate common/correlated from independ-
ary aims to un-correlate the raw data and express them in ent/uncorrelated components in the observed raw EEG data.
a sparse domain. Different dictionaries (transformations) Finally, it is worth mentioning that tensor decomposi-
contribute to obtaining different sparse representations with tions are emerging techniques not only for feature extraction/
various sparsity profiles. Moreover, augmentation of dimen- selection and BCI classification, but also for pattern recogni-
sionality to create samples with additional modes improved tion, multiway clustering, sparse representation, data fusion,
the performance. dimensionality reduction, coding, and multilinear blind brain
To summarize, tensor decompositions with nonnegative, source separation (MBSS). They can potentially provide con-
orthonormal or discriminant bases improved the classification venient multi-channel and multi-subject space-time-frequency
12
J. Neural Eng. 15 (2018) 031005 Topical Review
sparse representations, artefact rejection, feature extraction, DS = DT occurs when either the feature spaces XS and XT are
multi-way clustering and coherence tracking [39, 40]. different or when the marginal distributions PS (X) and PT (X)
On the cons side, the complexity of tensor methods is usu- are not equal. Similarly, TS = TT indicates that either the label
ally much higher than standard matrix and vector machine spaces are different or the predictive functions are different.
learning methods. Moreover, since tensor methods are just For the latter situation, this reduces to situations where the
emerging as potential tools for feature extraction and clas- two conditional probabilities differ: PS (yS |XS ) = PT (yT |XT ).
sification, existing algorithms are not always mature and are Based on the learning setting and domains and tasks, there
still not fully optimized. Thus, some efforts are still needed to exist several situations applicable to transfer learning. For
optimize and test them for real-life large scale data sets. instance, homogeneous transfer learning refers to cases where
XS = XT , and domain adaptation refers to situations where the
4.3. Transfer learning
marginal probability distributions or the conditional probabil-
ity distributions do not match in the source and target domain.
4.3.1. Principles. One of the major hypotheses in machine Settings in which labelled data are available in both source
learning is that training data, on which the classifier is trained, and target domains, and TS = TT , are referred to as inductive
and test data, on which the classifier is evaluated, belong to the transfer learning. In BCI, this may be the case when the source
same feature space and follow the same probability distribu- domain and task are related to visual P300 evoked potentials
tion. In many applications such as computer vision, biomedi- whilst the target domain and task involve auditory P300-
cal engineering or brain–computer interfaces, this hypothesis evoked potentials. In contrast, transductive transfer learning
is often violated. For BCI, a change in data distribution typi- refers to situations in which tasks are similar but domains are
cally occurs when data are acquired from different subjects different. A particular case is the domain adaptation problem
and across various time sessions. when mismatch in domains is caused by mismatch in the mar-
Transfer learning aims at coping with data that violates this ginal or conditional probability distributions. In BCI, trans-
hypothesis by exploiting knowledge acquired while learning ductive transfer learning is the most frequent situation, as
a given task for solving a different but related task. In other inter-subject variability or session-to-session variability usu-
words, transfer learning is a set of methodologies considered ally occurs. For more categorizations in transfer learning, we
for enhancing performance of a learned classifier trained on refer the reader to the survey of Pan et al [177].
one task (also denoted as a domain) based on information There exists a flurry of methods and implementations for
gained while learning another task. Naturally, the effective- solving a transfer learning problem, which depend on spe-
ness of transfer learning strongly depends on how well-related cific situations and the application of a domain. For homoge-
the two tasks are. For instance, it is more relevant to perform neous transfer learning, which is the most frequent situation
transfer learning between two P300 speller tasks performed encountered in BCIs, there exist essentially three main strat-
by two different subjects than between one P300 speller task egies. If domain distributions do not match, one possible
and a motor-imagery task performed by the same subject. strategy is to learn the transformation of source or target
Transfer learning is of importance especially in situations domain data so as to correct the distribution mismatch [134,
where there exists abundant labelled data for one given task, 203]. If the type of mismatch occurs on the marginal distri-
denoted as a source domain, whilst data are scarce or expen- bution, then a possible method for compensating the change
sive to acquire for the second task, denoted as a target domain. in distribution is to consider a reweighting scheme [208].
Indeed, in such cases, transferring knowledge from the source Many transfer learning approaches are also based on find-
domain to the target domain acts as a bias or as a regularizer ing a common feature representation for the two (or more)
for solving the target task. We provide a more formal descrip- domains. As the representation, or the retrieved latent space,
tion of transfer learning based on the survey of Pan et al [177] is common to all domains, labelled samples from the source
More formally, a domain is defined by a feature space X and target domain can be used to train a general classifier
and a marginal probability distribution P(X) where the ran- [53, 177]. A classic strategy is to consider approaches whose
dom variable X takes value X . The feature space is associated goal is to locate representations in which domains match.
with a label space Y and they are linked through a joint prob- Another trend for transfer learning is to consider methods
ability distribution P(X, Y) with Y = y ∈ Y . A task is defined that learn a transformation of the data so that their distri-
by a label space Y and a predictive function f (·) which butions match. These transformations can either be linear,
depends on the unknown probability distribution P(X, Y). For based for instance on kernel methods [76, 241] or non-linear,
a given task, the objective is to learn the function f (·) based through the use of an optimal transport strategy [51].
on pairs of examples {xi , yi }i=1 where xi ∈ X and yi ∈ Y . Note that transfer learning may not always yield enhanced
Define the source and target domains as respectively performance on a specific task TT . Theoretical results [55]
DS = {XS , PS (X)} and DT = {XT , PT (X)} and the source and in domain adaptation and transfer learning show that gain in
target tasks as TS = {YS , fS (·)} TT = {YT , fT (·)}, respectively. performance on TT may be achieved only if the source and
Hence, given the estimation of fT (·) trained based solely on target tasks are not too dissimilar. Hence, a careful analysis
information from the target task, the goal of transfer learn- of how well tasks relate has to be carried out before consider-
ing is to improve on this estimation by exploiting knowledge ing transfer learning methods. Transfer learning methods are
obtained from DS and TS with DS = DT or TS = TT . Note that illustrated in figure 4.
13
J. Neural Eng. 15 (2018) 031005 Topical Review
Figure 4. Illustrating the objective of domain adaptation. Left: source domain with labelled samples. Middle: target domain (with labels
and decision function for the sake of clarity). A classifier trained on the source domain will perform poorly. Right: a domain adaptation
technique will seek a common representation transformation or a mapping of domains so as to match the source and target domain
distributions.
4.3.2. State-of-the-art. In recent years, transfer learning has other subjects or other sessions [46]. A regularization strategy
gained much attention for improving BCI classification. BCI in this case is effective [103]. A more relevant approach is
research has focused on transductive transfer learning, in to directly regularize the CSP objective function rather than
which tasks are identical between source and target. Motor the covariance matrices [143]. In this vein, Blankertz et al
imagery has been the most-used paradigm to test transfer [21] have proposed an invariant CSP (iCSP), which regular-
learning methods, probably owing to the availability of datas- izes the CSP objective function in a manner that diminishes
ets from BCI Competitions [4, 10, 35, 67, 97, 102, 103, 143]. the influence of noise and artefacts. Fazli et al [67] built a
A few studies considered other paradigms such as the P300- subject-independent classifier for movement imagination
speller [74, 151, 218], and visual and spatial attention para- detection. They first extracted an ensemble of features (spatial
digms [165]. A transfer learning challenge was also recently and frequency filters) and then applied LDA classifiers across
organized on an error potential dataset [1]. all subjects. They compared various ways of combining these
Instead of considering source and target domains one-to- classifiers to classify a new subject’s data: simply averaging
one, a widespread strategy is to perform ensemble analyses, their outcomes (bagging) performs adequately, but is outper-
in which many pre-recorded sessions, from possibly different formed by a sparse selection of relevant features.
subjects, are jointly analysed. This addresses a well-known Sparse representations are indeed relevant when applied to
problem in data scarcity, especially involving labelled data, ensemble datasets coming from multiple sessions or subjects.
prone to overfitting. The dictionary of waveforms / topographies / time-frequency
There are many methods for combining the features and representations, from which the sparse representations are
classifiers within ensembles [205]. A first concern when con- derived, can be built in a manner to span a space that naturally
sidering ensembles is to guarantee the quality of the features handles the session- or subject-variability. Sparsity-inducing
and classifiers from the source domain. Feature selection is methods fall in the category of ‘invariant feature representa-
also relevant in this context (see section 2.2) to eliminate out- tion’. Dictionaries can be predefined, but to better represent
liers. Many methods have been used to select relevant features the data under study, they can be computed using data-driven
from the ensemble, for instance mutual information [186], methods. Dictionary learning is a data-driven method that
classification accuracy [143] or sparsity-inducing methods. alternatively adapts the dictionary of representative functions
A second major challenge is to cope with the variabil- and the coefficients of the data representation with the diction-
ity of data across subjects or sessions. Methods from adap- ary. Dictionary learning has been used to reveal inter-trial vari-
tive classification are sometimes applicable in the context of ability in neurophysiological signals [91]. Morioka et al [165]
transfer learning. Although the goal of adaptive classifica- proposed to learn a dictionary of spatial filters, which is then
tion, as explained in section 4.1, is to update classifiers and adapted to the target subject. This method has the benefit of
not to transfer data, transfer learning can benefit from adap- taking into account the target subject’s specificities, through
tive classification to update classifiers whose initialization is their resting state EEG. Cho et al [35] also exploit target ses-
subject-independent. This approach has been proposed for sion data by constructing spatiotemporal filters which mini-
P300 classification by Lu et al [151]. Riemannian geometry mally overlap with noise patterns, an extension of Blankertz’s
can also increase robustness with respect to inter-subject and iCSP [21].
inter-session variability, as demonstrated in several studies An even more sophisticated method to address the domain
[46, 238]. adaptation of features is to model their variability across ses-
A particularly fruitful strand of research has focused on sions of subjects. Bayesian models capture variability through
building spatial filters based on ensemble data. Common spa- their model parameters. These models are generally imple-
tial patterns (CSP) and spatial filters in general are able to mented in a multitask learning context, where an ensemble
learn quickly on appropriate training data, but do not perform of tasks TS = {YS , fS (·)} is jointly learned from the source
well with a large quantity of heterogeneous data recorded from (labelled) domain. For BCIs, ‘a task’ is typically a distinct
14
J. Neural Eng. 15 (2018) 031005 Topical Review
Table 5. Summary of transfer learning methods for BCI. and acceptance. In fact, it is well recognized in the community
EEG Features / that the calibration session may be unduly tiring for clinical
pattern Method Classifier /Transfer References users, whose cognitive resources are limited, and annoying in
general for healthy users. As discussed by Sanelli et al [195],
Motor CSP + band Linear SVM [103, 143] receiving feedback from the very beginning of their BCI expe-
imagery power rience is highly motivating and engaging for novice users.
subject-to-subject
Transfer learning can then provide users with an adequately-
Motor Sparse LDA [67]
performing BCI, before applying co-adaptive strategies. In this
imagery feature set
Motor CSP Fisher LDA [35]
spirit, transfer learning may be used to initialize a BCI using
imagery data from other subjects for a naive user and data from other
session-to-session sessions for a known user. In any case such an initialization is
Motor Surface LDA, Bayesian [4, 97] suboptimal, thus such an approach entails adapting the classi-
imagery Laplacian multitask fier during the session, a topic that we have discussed in sec-
subject-to-subject tion 4.1. Therefore, transfer learning and adaptivity must come
Motor PCSP LDA, Bayesian [102] arm in arm to achieve the final goal of a calibration-free mode
imagery model of operation [46].
multisubject Although suboptimal in general, transfer learning is robust
Motor CSP + band LDA [10] by definition. For instance, subject-to-subject transfer learn-
imagery power ing can produce better results as compared to subject-specific
session-to-session calibration if the latter is of low quality [15]. This is par
Visual, Dictionary linear SVM [165]
ticularly useful in clinical settings, where obtaining a good
learning
Spatial of spatial subject-to-subject
calibration is sometimes prohibitive [158].
attention filters As we have seen, the approach of seeking invariant spaces
P300 Time points Mixture of bayesian [218] for performing classification in transfer learning settings is
classifiers appealing theoretically and has shown promising results by
P300 Time points Fisher LDA [151] exploiting Riemannian geometry; however it comes at the
multisubject risk of throwing away some of the information that is relevant
P300 xDAWN LDA, optimal [74] for decoding. In fact, instead of coping with the variability of
transport data across sessions, as we formulated above, it may be wiser
session-to-session to strive to benefit from the variability in the ensemble to bet-
ter classify the target session. The idea would be to design
recording session, either for a single or multiple subjects. classifiers able to represent multiple sessions or subjects.
Bayesian models have hence been built for features in spectral The combination of transfer learning and adaptive clas-
[4], spatial [102], and recently in combined spatial and spec- sifiers represents a topic at the forefront of current research
tral domains [97]. Combining a Bayesian model and learning in BCI. It is expected to receive increasing attention in the
from label proportion (LLP) has recently been proposed in upcoming years, leading to a much-sought new generation of
[218]. calibration-free BCIs.
Another interesting domain adaptation method is to actu- Very few of the transfer learning presented methods
ally transport the features of the target data onto the source have yet been used online, but computational power is not a
domain. Once transported to the source domain, the target limitation, because these methods do not require extensive
data can then be classified with the existing classifier trained computational ressources, and can be run on simple desktop
on the source data. Arvaneh et al [10] apply this approach computers. For methods whose learning phases may take a
to session-to-session transfer for motor imagery BCI, by long time (such as sparsity-inducing methods, or diction-
estimating a linear transformation of the target data which ary learning), this learning should be performed in advance
minimizes the Kullback–Leibler distance between source and so that the adaptation to a new subject or session is time-
transformed target distributions. Recently, session-to-session efficient [165].
transfer of P300 data has been accomplished using a nonlinear
transform obtained by solving an optimal transport problem 4.4. Deep learning
[74]. Optimal transport is well-suited for domain adaptation
as its algorithms can be used for transporting probability dis- Deep learning is a specific machine learning algorithm in
tributions from one domain onto another [51]. These various which features and the classifier are jointly learned directly
works are summarized in table 5. from data. The term deep learning is coined by the architec-
ture of the model, which is based on a cascade of trainable
4.3.3. Pros and cons. As reported in the above cited stud- feature extractor modules and nonlinearities. Owing to such
ies, transfer learning is instrumental in session-to-session and a cascade, learned features are usually related to increasing
subject-to-subject decoding performance. This is essential to levels of concepts. We discuss in this section the two most
be able to achieve a true calibration-free BCI mode of opera- popular deep learning approaches for BCI: convolutional neu-
tion in the future, which in turn would improve BCI usability ral networks and restricted Boltzmann machines.
15
J. Neural Eng. 15 (2018) 031005 Topical Review
Figure 5. Example architectures of two deep learning frameworks. Left: convolutional neural networks. The blue blocks refer to results of
convolving input signal with several different filters. Right: stacked restricted Boltzmann machines. Hidden layers are trained layer-wise
and the full network can be fine-tuned according to the task at hand.
4.4.1. Principles.
∂L(W|v)
∝ vi hj data − vi hj model
A short introduction on restricted Boltzmann machines. A ∂wi,j
v
restricted Boltzmann machine (RBM) is a Markov random
field (MRF) [120] associated with a bipartite undirected with the two brackets respectively denoting expectation over
graph. It is composed of two sets of units: m visible ones p(h|v)q(v) and over the model ( p(v, h)) with q being the
V = (V1 , · · · , Vm ) and n hidden ones H = (H1 , · · · , Hn ). The empirical distribution of the inputs. While the first term of this
visible units are used for representing observable data whereas gradient is tractable, the second one has exponential complex-
the hidden ones capture some dependencies between observed ity. Contrastive divergence aims at approximating this gradi-
variables. For the usual type of RBM such as those discussed ent using a Gibbs chain procedure that computes the binary
in this paper, units are considered as random variables that state of h using p(h|v) and then obtaining an estimation of v
take binary values (v, h) and W is a matrix whose entries wi,j using p(v|h) [89]. There exist other methods for approximat-
are the weights associated with the connection between unit vi ing the gradient of RBMs log-likelihood that may lead to bet-
and hj. The joint probability of a given configuration (v, h) can ter solutions as well as methods for learning with continuous
variables [17, 213].
be modelled according to the probability p(v, h) = Z1 e−E(v,h)
The above procedure allows one to learn a generative
with the energy function E(v, h) being
model of the inputs using a simple layer of RBMs. A deep
E(v, h) = a v − b h − v Wh learning strategy can be obtained by stacking several RBMs
with the hidden units of one layer used as inputs of the subse-
where a and b are bias weight vectors. Note Z is a normalizing quent layers. Each layer is usually trained in a greedy fashion
factor in order that p(v, h) sums to one for all possible con- [90] and fine-tuning can be performed depending on the final
figurations. Owing to the undirected bipartite graph property, objective of the model. A RBM is illustrated in figure 5 (left).
hidden (respective to the visible) variables are independent
given the visible (hidden) ones leading to: Short introduction on convolutional neural networks. A
p(v|h) = Πm
i=1 p(vi |h) p(h|v) = Πnj=1 p(hj |v) convolutional neural network (ConvNet or CNN) is a feedfor-
ward neural network (a network in which information flows
and marginal distributions over the visible variables can be uni-directionally from the input to the hidden layers to the
easily obtained as [69]: output) which has at least one convolutional layer [71, 117,
1 −E(v,h) 117]. Such a convolutional layer maps its input to an output
p(v) = e . through a convolution operator. Suppose that the input is a 1D
Z
h
signal {xn} with N samples, its convolution through a 1D filter
Hence, by optimizing all model parameters (W, b, a), it is pos- {hm} of size M is given by:
sible to model the probability distribution of the observable
variables. Other properties of RBMS as well as connections of M−1
RBMs with stochastic neural networks are detailed in [69, 90] y(n) = hi xn−i ∀n = 0, · · · , N − 1.
To learn the probability distribution of the input data, i=0
RBMs are usually trained according to a procedure denoted as This equation can be extended to higher dimensions by aug-
contrastive divergence learning [89]. This learning procedure menting the number of summations in accordance with the
is based on a gradient ascent of the log-likelihood of the train- dimensions. Several filters can also be independently used
ing data. The derivative of the log-likelihood of an input v can in convolution operations leading to an increased number of
be easily derived [69] and the mean of this derivative over the channels in the output. This convolutional layer is usually fol-
training set leads to the rule: lowed by nonlinearities [75] and possibly by a pooling layer
16
J. Neural Eng. 15 (2018) 031005 Topical Review
that aggregate the local information of the output into a single than a spatially Weighted LDA-PCA classifier, by 2%. It
value, typically through an average or a max operator [25]. was not compared to any other classifier though. It should
Standard ConvNet architectures usually stack several of these be mentioned that in this paper, as in most BCI papers on
layers (convolution + non-linearity (+ pooling)) followed deep learning, the architecture is not justified and not com-
by other layers, typically fully connected, that act as a clas- pared to different architectures, apart from the fact that the
sification layer. Note however that some architectures use architecture was reported to perform well.
all convolutional layers as classification layers. Given some For SSVEP, [113] also explored a CNN with a spatial
architectures, the parameters of the models are the weights of convolutional layer and a temporal one that used band power
all the filters used for convolution and the weights of the fully features from two EEG channels. This CNN obtained perfor-
connected layers. mance similar to that of a three-layer MLP or that of a classi-
ConvNets are usually trained in a supervised fashion by fier based on canonical correlation analysis (CCA) with kNN
solving an empirical risk minimization problem of the form: data recorded from static users. However, it outperformed both
on noisy EEG data recorded from a moving user. However, the
1 classifiers that were compared to the CNN were not the state-
ŵ = arg min L(yi , fw (xi )) + Ω(w)
w i of-the-art for SSVEP classification (e.g. CCA was not used
with any harmonics of the SSVEP stimulus known to improve
where {xi , yi }i=1 are the training data, fw is the prediction performance nor with more channels).
function related to the ConvNet, L(·, ·) is a loss function that For SCP classification, [59] explored a deep extreme learn-
measures any discrepancy between the true labels of xi and ing machine (DELM), which is a multilayer ELM with the last
fw (x)i, and Ω is a regularization function for the parameters layer being a Kernel ELM. The structure of the network, its
of the ConvNet. Owing to the specific form of the global number of units, the input features and hyper-parameters were
loss (average loss over the individual samples), stochastic not justified. Such network obtained lower performance than
gradient descent and its variants are the most popular means the BCI competition winners for the data set used, and was not
for optimizing deep ConvNets. Furthermore, the feedfor- significantly better than a standard ELM or multilayer ELM.
ward architecture of fw (·) allows the computation of the For MVEP, [153] used a deep belief network (DBN)
gradient at any given layer using the chain rule. This can be composed of three RBMs. The dimensionality of the input
performed efficiently using the back-propagation algorithm features, EEG time points, was reduced using compressed
[193]. sensing (CS). This DBN+CS approach outperformed a SVM
In several domain applications, ConvNets have been very approach which used neither DBN nor CS.
successful because they are able to learn the most relevant Regarding passive BCIs, Yin et al explored DNNs for both
features for the task at hand. However, their performances workload and emotions classifications [234, 235]. In [234],
strongly depend on their architectures and their learning they used adaptive DBN, composed of several stacked auto-
hyper-parameters. A ConvNet is illustrated in figure 5 (right). encoders (AE), for workload classification. Adaptation was
performed by retraining the first layer of the network using
4.4.2. State-of-the-art. Deep neural networks (DNNs) have incoming data labelled with their estimated class. Compared
been explored for all major types of EEG-based BCI systems; to kNN, MLP or SVM, the proposed network outperformed all
that is P300, SSVEP, Motor imagery and passive BCI (for without channel selection, but obtained similar performance
emotions and workload detection). They have also been stud- with feature selection. As is too often the case in DNN papers
ied for less commonly used EEG patterns such as slow corti- for BCI, the proposed approach was not compared to the state-
cal potentials (SCP) or motion-onset visual evoked potential of-the-art, e.g. to methods based on FBCSP. In [235], another
(MVEP). It is worth mentioning that all these studies were DBN composed of stacked AE was studied. This DNN was,
performed offline. however, a multimodal one with separate AEs for EEG sig-
Regarding P300-based BCI, Cecotti et al published the nals and other physiological signals. Additional layers merged
very first paper which explored CNN for BCI [32]. Their the two feature types. This approach appeared to outperform
network comprised two convolutional layers, one to learn competing classifiers and published results using the same
spatial filters and the other to learn temporal filters, fol- database. However, the data used to perform model selection
lowed by a fully connected layer. They also explored of the proposed DNN and determine its structure was all data,
ensembles of such CNNs. This network outperformed the that is, it included the test data, which biased the results.
BCI competition winners on the P300-speller data set used Several studies have explored DNN for motor imagery
for evaluation. However, an ensemble of SVMs obtained classification with both DBN and CNN [150, 198, 207, 210].
slightly better performances than the CNN approach. A DBN was explored in [150] to classify BP features from
Remaining with P300 classification, but this time in the two EEG channels. The network outperformed FBCSP and
context of the rapid serial visualization paradigm (RSVP), the BCI competition winner but only when using an arbitrary
[156] explored another CNN with one spatial convolution structure whose selection was not justified. When removing or
layer, two temporal convolution layers and two dense fully- adding a single neuron, this network exhibited lower perfor-
connected layers. They also used rectifying linear units, mance than FBCSP or the competition winner, hence casting
Dropout and spatio-temporal regularization on the convo- doubts on its reliability and its initial structure choice. Another
lution layers. This network was reported as more accurate DBN was used in [207] for motor imagery classification,
17
J. Neural Eng. 15 (2018) 031005 Topical Review
Table 6. Summary of works using deep learning for EEG-based multiple EEG patterns, namely P300, movement related corti-
BCI. cal potentials (MRCP), ErrP and Motor imagery. This network
EEG pattern Features Classifier References outperformed another CNN (that of [156] mentioned above),
and XDAWN + BDA as well as RCSP+LDA for subject-to-
SCP Not Deep ELM [59] subject classification. The parameters (number of filters and
specified the pass-band used) for xDAWN and RCSP are not specified
Motion-onset EEG time DBN [153]
though but would be suboptimal if they used the same band
VEP points
as for the CNN. The method is also not compared to the state-
SSVEP Band power CNN [113]
P300 EEG time CNN [32]
of-the-art (FBCSP or Riemannian) methods. Comparison to
points existing methods is thus again unconvincing.
P300 EEG time CNN [156] A summary of the methods using deep learning for EEG
points classification in BCI are listed in table 6.
Motor imagery Band power DBN [150]
Motor imagery/ Raw EEG CNN [198] 4.4.3. Pros and cons. DNNs have the potential to learn both
Execution effective features and classifiers simultaneously from raw EEG
Motor imagery Band power DBN [207] data. Given their effectiveness in other fields, DNNs certainly
Motor imagery Band power CNN+DBN [210] seem promising to lead to better features and classifiers, and
Workload Band power Adaptive DBN [234] thus to much more robust EEG classification. However, so far,
Emotions Band power DBN [235] the vast majority of published studies on DNNs for EEG-based
+zero BCIs have been rather unconvincing in demonstrating their
crossing
actual relevance and superiority to state-of-the-art BCI methods
+entropy
in practice. Indeed, many studies did not compare the studied
ErrP, P300, EEG time CNN [116]
points
DNN to state-of-the-art BCI methods or performed biased com-
MRCP, Motor parisons, with either suboptimal parameters for the state-of-the-
imagery art competitors or with unjustified choices of parameters for
the DNN, which prevents us from ruling out manual tuning of
these parameters with knowledge of the test set. There is thus
but was outperformed by a simple CSP+LDA classifier. a need to ensure such issues be solved in future publications
However, the authors proposed a method to interpret what the around DNN for BCI. An interesting exception is the work in
network has learned and its decisions, which provided useful [198], who rigorously and convincingly showed that a shallow
insights on the possible neurophysiological causes of misclas- CNN could outperform FBCSP. This suggests that the major
sifications. A combination of CNN and DBN was explored in limitation of DNN for EEG-based BCI is that such networks
[210]. They used a CNN whose output was used as input to a have a very large number of parameters, which thus requires
six-layer SAE. Compared to only a CNN, a DBN or a SVM, a very large number of training examples to calibrate them.
the CNN+DBN approach appeared to be the most effective. It Unfortunately, typical BCI data sets and experiments have very
was not compared to the BCI competition winners on this data small numbers of training examples, as BCI users cannot be
set, or to other state-of-the-art methods such as Riemannian asked to perform millions or even thousands of mental com-
geometry and FBCSP. The last study to explore DNN for mands before actually using the BCI. As a matter of fact, it has
motor imagery is that of Schirrmeister et al [198]. This study been demonstrated outside the BCI field that DNNs are actu-
should be particularly commended as, contrary to most previ- ally suboptimal and among the worst classifiers with relatively
ously mentioned papers, various DNN structures are explored small training sets [36]. Unfortunately, only small training sets
and presented, all carefully justified and not arbitrary, and the are typically available to design BCIs. This may explain why
networks are rigorously compared to state-of-the-art methods. shallow networks, which have much fewer parameters, are the
They explored shallow CNN (one temporal convolution, one only ones which have proved useful for BCI. In the future, it is
spatial convolution, squaring and mean pooling, a softmax thus necessary to either design NNs with few parameters or to
layer), deep CNN (temporal convolution, spatial convolution, obtain BCI applications with very large training data bases, e.g.
then three layers of standard convolution and a softmax layer), for multi-subject classification.
an hybrid shallow+deep CNN (i.e. their concatenation), and It is also worth noting that DNNs so far were only explored
residual NN (temporal convolution, spatial convolution, 34 offline for BCI. This is owing to their very long training times.
residual layers, and softmax layer). Both the deep and shal- Indeed, the computational complexity of DNN is generally very
low CNN significantly outperformed FBCSP, whereas the high, both for training and testing. Calibration can take hours or
hybrid CNN and the residual NN did not. The shallow CNN days on standard current computers, and testing, depending on
was the most effective with +3.3% of classification accuracy the number of layers and neurons, can also be very demanding.
over FBCSP. The authors also proposed methods to interpret As a result, high-performing computing tools, e.g. multiple pow-
what the network has learned, which can provide useful neu- erful graphic cards, may be needed to use them in practice. For
rophysiological insights. practical online BCI applications, the classifier has to be trained
Finally, a study explored a generic CNN, a compact one in at most a few minutes to enable practical use (BCI users can-
with few layers and parameters, for the classification of not wait for half an hour or more every time they want to use the
18
J. Neural Eng. 15 (2018) 031005 Topical Review
Table 7. Summary of multi-label (and related multiclass) approaches for EEG-based BCI.
BCI). Fast training of a DNN would thus be required for BCI. Another method converted circular ordinal regression to a
Designing DNNs that do not require any subject-specific train- multi-label classification approach to control a simulated
ing, i.e. a universal DNN, would be another alternative. wheelchair, using data set IIIa of the third BCI competition,
with as motor tasks imagination of left hand, right hand,
4.5. Other new classifiers foot and tongue movements [57]. Multiclass and multi-label
approaches have been compared to discriminate height com-
4.5.1. Multilabel classifiers. mands from the combination of three motor imagery tasks
(left hand, right hand and feet) to control a robotic arm [125].
Principles. In order to classify more than two mental A first method used a single classifier applied to the concat-
tasks, two main approaches can be used to obtain a multiclass enated features related to each activity source (C3, Cz, C4),
classification function [215]. The first approach consists in with one source for each limb involved. A second approach
directly estimating the class using multiclass techniques such consisted of a hierarchical tree of three binary classifiers to
as decision trees, multilayer perceptrons, naive Bayes classi- infer the final decision. The third approach was a combina-
fiers or k-nearest neighbours. The second approach consists tion of the first two approaches. All methods used the CSP
of decomposing the problem into several binary classification algorithm for feature extraction and Linear discriminant
problems [5]. This decomposition can be accomplished in analysis (LDA) for classification. All methods were validated
different ways using (i) one-against-one pairwise classifiers and compared to the classical one-versus-one (OVO) and
[20, 84], (ii) one-against-the-rest (or one-against-all) classi- one-versus-rest (OVR) methods. Results obtained with the
fiers [20, 84], (iii) hierarchical classifiers similar to a binary hierarchical method were similar to the ones obtained with
decision tree and (iv) multi-label classifiers [154, 215]. In the OVO and OVR approaches. The performance obtained
the latter case, a distinct subset of L labels (or properties) is with the first approach (single classifier) and the last (com-
associated to each class [58]. The predicted class is identified bined hierarchical classifier) were the best for all subjects.
according to the closest distance between the predicted labels The various multi-label approaches explored are mentioned
and each subset of labels defining a class. in table 7.
State-of-the-art. The number of commands provided by Pros and cons. Multiclass and multi-label approaches there-
motor imagery-based BCIs depends on the number of mental fore aim to recognize more than two commands. In both
imagery states that the system is able to detect. This, in turn, cases, the resulting increase in the number of recognized
is limited by the number of body parts that users can imag- classes potentially provides the user with a greater number
ine moving in a manner that generate clear and distinct EEG of commands to interact more quickly with the system, with-
patterns. Multi-label approaches can thus prove useful for out the need for a drop-down menu, for example. The multi-
detecting combined motor imagery tasks, i.e. imagination of label approach can make learning shorter and less tiring, as
two or more body parts at the same time [125, 192, 226], with it requires learning only a small number of labels. The many
each body part corresponding to a single label (indicating possible combinations of these labels leads to a large number
whether that body part was used). Indeed, in comparison with of classes and therefore to more commands. In addition, the
the standard approach, this approach has the advantage of multi-label approach allows redundancy in the labels describ-
considerably increasing the number of different mental states ing a class, which can lead to better class separation. Usually,
while using the same number of body parts: 2P compared to the number of labels to produce is less than the number of
P, where P is the number of body parts. Thus, EEG patterns classes. Finally, as compared to standard methods, multiclass
during simple and combined motor imagery tasks were inves- and multilabel approaches usually have a lower computational
tigated to confirm the separability of seven different classes complexity since they can share parameters, e.g. using multi-
of motor imagery for BCI [126, 226, 249]. For the purpose of layer perceptron, or class descriptors (especially if no redun-
achieving continuous 3D control, both hands motor imagery dancy is introduced).
was adopted to complement the set of instructions in a simple However, there might be a lack of relationship between
limb motor imagery based-BCI to go up (and rest to go down) the meaning of a label and the corresponding mental com-
[114, 192]. The up/down control signal was the inverted addi- mand, e.g. two hand imagery to go up. This may generate a
tion of left and right autoregressive spectral amplitudes cal- greater mental workload and therefore fatigue. It is therefore
culated for each of the electrodes and 3 Hz-frequency bins. necessary to choose carefully the mapping between mental
19
J. Neural Eng. 15 (2018) 031005 Topical Review
Table 8. Summary of classifiers that can be trained with limited for BCI, both for ERP-based BCI [22] and for oscillatory
amount of data. activity BCI [137]. It has also been shown that such classifier
EEG pattern Features Classifier References can be calibrated with much fewer data than LDA to achieve
the same performance [137, 142]. For instance, for mental
P300 Time points sLDA [142] imagery BCI, an sLDA has been shown to obtain similar
P300 Time points sLDA [22] performance with ten training trials per class than a standard
P300 Time points RF [3]
LDA with 30 training trials per class, effectively reducing the
P300 Special covariance RMDM [46]
calibration time three-fold [137].
P300 Special covariance RMDM [15]
Motor imagery RF [54]
Random forest (RF) classifiers are ensembles of several
CSP + band power
decision tree classifiers [26]. The idea behind this classifier
Motor imagery CSP + band power sLDA [137]
is to randomly select a subset of the available features, and to
Motor imagery Band-pass covariance RMDM [14, 46]
SSVEP Band-pass covariance RMDM [100] train a decision tree classifier on them, then to repeat the pro-
cess with many random feature subsets to generate many deci-
sion trees, hence the name random forest. The final decision is
commands and corresponding labels. Finally, classification taken by combining the outputs of all decision trees. Because
errors remain of course possible. In particular, the set of esti- each tree only uses a subset of the features, it is less sensitive
mated labels may sometimes not correspond to any class, and to the curse-of-dimensionality, and thus requires fewer train-
several classes may be at equal distances, thus causing class ing data to be effective. Outside of BCI research, among vari-
confusion. ous classifiers and across various classification problems and
domains, random forest algorithms were actually often found
4.5.2. Classifiers that can be trained from little data. to be among the most accurate classifiers, including problems
with small training data sets [26, 36]. RFs were used success-
Principles. As previously discussed, most EEG-based fully even online both for ERP-based BCI [3] and for motor
BCIs are currently optimized for each subject. Indeed, this has imagery BCI [54]. They outperformed designs based on LDA
been shown to lead in general to substantially higher classifica- classifiers for motor imagery BCI [54].
tion performances than subject-independent classifiers. Typi- Riemannian classifiers have been discussed in sec-
cal BCI systems can be optimized by using only a few training tion 4.2.1. Typically, a simple Riemannian classifier such as
data, typically 20–100 trials per class, as subjects cannot be the RMDM requires less training data as compared to optimal
asked to produce the same mental commands thousands of filtering approaches such as the CSP for motor imagery [46]
times before being provided with a functional BCI. Moreover, and xDAWN for P300 [15]. This is due to the robustness of
collecting such training data takes time, which is inconvenient the Riemannian distance, which the geometric mean inherits
for the subjects, and an ideal BCI would thus require a cali- directly, as discussed in [47]. Even more robust mean esti-
bration time as short as possible. This calls for classifiers that mations can be obtained computing Riemannian medians or
can be calibrated using as little training data as possible. In the trimmed Riemannian means. Shrinkage and other regulariza-
following we present those classifiers that were shown to be tion strategies can also be applied to a Riemannian framework
effective for this purpose. They rely on using statistical estima- to improve the estimation of covariance matrices when a small
tors dedicated to small sample size or on dividing the input number of data points is considered [100]. These methods are
features between several classifiers to reduce the dimensional- summarized in table 8.
ity, thus reducing the amount of training data needed by each
classifier. Pros and cons. sLDA, RF and the RMDM are simple classi-
fiers that are easy to use in practice and provide good results
State-of-the-art. The three main classifiers that have been in general, including online. We thus recommend their use.
shown to be effective with little training data, and thus effec- sLDA and RMDM do not have any hyper-parameters, which
tive for EEG-based BCI design, are the shrinkage LDA clas- makes them very convenient to use. sLDA have been shown to
sifier [22, 137, 142], random forest [3, 54] and Riemannian be superior to LDA both for ERP and oscillatory activity-based
classifiers [47, 232]. BCI across a number of data sets [22, 137]. There is thus no
The shrinkage LDA (sLDA), is a standard LDA classifier in reason to use classic LDA; instead sLDA should be preferred.
which the class-related covariance matrices used in its optim RMDM performs as well as CSP+LDA for oscillatory activity-
ization were regularized using shrinkage [22]. Indeed, covari- based BCI [13, 46], as well as xDAWN+LDA but better than a
ance matrices estimated from little data tend to have larger step-wise LDA on time samples for ERP-based BCI [15, 46] and
extreme eigenvalues than the real data distribution, leading better than CCA for SSVEP [100]. Note that because LDA is a
to poor covariance estimates. This can be solved by shrink- linear classifier, it may be suboptimal in the hypothetic future
ing covariance matrices Σ as Σ̂ = Σ − λI, with I the iden- case when vast amounts of training data will be available. RF on
tity matrix, and λ the regularization parameter. Interestingly the other hand is a non-linear classifier that can be effective both
enough, there are analytical solutions to automatically deter- with small and large training sets [36]. RMDM is also non-linear
mine the best λ value (see [118]. The resulting sLDA classifier and performs well with small as well as large training sets [46].
has been shown to be superior to the standard LDA classifier In terms of computational complexity, while RF can be more
20
J. Neural Eng. 15 (2018) 031005 Topical Review
demanding than RMDM or sLDA since it uses many classifiers, • Many of the classification methods surveyed in this paper
all of them are fairly simple and fast methods, and all have been have been evaluated offline only. However, an actual BCI
used online successfully on standard computers. application is fundamentally online. There is thus a need
to study and validate these classification methods online
as well, to ensure they are sufficiently computationally
5. Discussion and guidelines efficient to be used in real time, can be calibrated quickly
enough to be convenient to use and to ensure that they can
Based on the many papers surveyed in this manuscript, we withstand real-life noise in EEG signals. In fact, online
identify some guidelines on whether to use various types of evaluation of classifiers should be the norm rather than
classification methods, and if so, when and how it seems rel- the exception, as there is relatively little value in studying
evant to do so. We also identify a number of open research classifiers for BCI if they cannot be used online.
questions that deserve to be answered in order to design better • Transfer learning and domain adaptation may be key
classification methods to make BCI more reliable and usable. components for calibration-free BCI. However, at this
These guidelines and open research questions are presented in stage, several efforts must be taken before they can be
the two following sections. routinely used. Among the efforts, coupling advanced
features such as covariance matrices and domain adap-
5.1. Summary and guidelines tation algorithms can further improve on the invariance
ability of BCI systems.
According to the various studies surveyed in this paper, we • There are also several open challenges that, once solved,
extract the following guidelines for choosing appropriate clas- could make Riemannian geometry classifiers even more
sification methods for BCI design: efficient. One would be to design a stable estimator of
• In terms of classification performance, adaptive classifi- the Riemannian median to make RMDM classifiers
cation approaches, both for classifiers and spatial filters, more robust to outliers than when using the Riemannian
should be preferred to static ones. This should be the case mean. Another would be to work on multimodal RMDM,
even if only unsupervised adaptation is possible for the with multiple modes per class, not just one, which could
targeted application. potentially improve their effectiveness. Finally, there is a
• Deep learning networks do not appear to be effective to need for methods to avoid poorly conditioned covariance
date for EEG signals classification in BCI, given the lim- matrices or low rank matrices, as these could cause RGC
ited training data available. Shallow convolutional neural to fail.
networks are more promising. • While deep learning approaches are lagging in per-
• Shrinkage linear discriminant analysis (sLDA) should formance for BCI, mostly due to lack of large training
always be used instead of classic LDA, as it is more effec- datasets, they can be strongly relevant for end-to-end
tive and more robust for limited training data. domain adaptation [73] or for augmenting datasets
• When very little training data is available, transfer learning, through the use of generative adversarial networks [77].
sLDA, Riemannian minimum distance to the mean • Classifiers, and the entire machine learning/signal
(RMDM) classifiers or random forest should be used. processing pipeline are not the only considerations in
• When tasks are similar between subjects, domain a BCI system design. In particular, the user should be
adaptation can be considered for enhancing classifier per- considered as well and catered to so as to ensure effi-
formance. However, care should be taken regarding the cient brain–computer communications [33, 112, 144].
effectiveness of the transfer learning, as it may sometimes As such, future BCI classifiers should be designed to
decrease performance. ensure that users can make sense of the feedback from
• Riemannian geometry classifiers (RGC) are very prom- the classifier, and can learn effective BCI control from it
ising, and are considered the current state-of-the-art [146].
for multiple BCI problems, notably Motor imagery,
P300 and SSVEP classification. They should be fur- 6. Conclusion
ther applied and further explored to increase their
effectiveness. In this manuscript, we have surveyed the EEG classification
• Tensor approaches are emerging and as such may also be approaches that have been developed and evaluated between
promising but currently require more research to be appli- 2007 and 2017 in order to design BCI systems. The numerous
cable in practice, online, and to assess their performance approaches that were explored can be divided into four main
as compared to other state-of-the-art methods. categories: adaptive classifiers, matrix and tensor classifiers,
transfer learning methods, and deep learning. In addition, a
5.2. Open research questions and challenges
few miscellaneous methods were identified outside these cat-
egories, notably the promising shrinkage LDA and random
In addition to guidelines, our survey also enabled us to iden- forest classifiers.
tify a number of unresolved challenges or research questions Overall, our review revealed that adaptive classifiers, both
and points that must be addressed. These challenges and ques- supervised and unsupervised, outperform static ones in gen-
tions are presented below. eral. Matrix and tensor classifiers are also very promising
21
J. Neural Eng. 15 (2018) 031005 Topical Review
22
J. Neural Eng. 15 (2018) 031005 Topical Review
[20] Bishop M C 2006 Pattern Recognition and Machine Learning tensor decompositions Found. Trends Mach. Learn.
(Berlin: Springer) 9 249–429
[21] Blankertz B, Kawanabe M, Tomioka R, Hohlefeld F, [40] Cichocki A, Mandic D, Lathauwer L D, Zhou G, Zhao Q,
Nikulin V and Müller K R 2008 Invariant common spatial Caiafa C and Phan A H 2015 Tensor decompositions
patterns: alleviating nonstationarities in brain–computer for signal processing applications: from two-way to
interfacing Advances in Neural Information Processing multiway component analysis IEEE Signal Process. Mag.
Systems vol 20 (Cambridge, MA: MIT Press) 32 145–63
[22] Blankertz B, Lemm S, Treder M, Haufe S and Müller K R [41] Cichocki A, Phan A H, Zhao Q, Lee N, Oseledets I,
2010 Single-trial analysis and classification of ERP Sugiyama M and Mandic D 2017 Tensor networks for
components–A tutorial NeuroImage 56 814–25 dimensionality reduction and large-scale optimizations. Part
[23] Blankertz B, Tomioka R, Lemm S, Kawanabe M and 2 applications and future perspectives Found. Trends Mach.
Müller K R 2008 Optimizing spatial filters for robust EEG Learn. 9 431–673
single-trial analysis IEEE Signal Proc. Mag. 25 41–56 [42] Cichocki A, Washizawa Y, Rutkowski T, Bakardjian H,
[24] Blumberg J, Rickert J, Waldert S, Schulze-Bonhage A, Phan A H, Choi S, Lee H, Zhao Q, Zhang L and Li Y
Aertsen A and Mehring C 2007 Adaptive classification 2008 Noninvasive BCIs: multiway signal-processing array
for brain computer interfaces 29th Annual Int. Conf. of the decompositions Computer 41
IEEE Engineering in Medicine and Biology Society [43] Cichocki A, Zdunek R, Phan A and Amari S 2009
pp 2536–9 Nonnegative Matrix and Tensor Factorizations:
[25] Boureau Y L, Bach F, LeCun Y and Ponce J 2010 Applications to Exploratory Multi-way Data Analysis and
Learning mid-level features for recognition IEEE Blind Source Separation (New York: Wiley)
Conference on Computer Vision and Pattern Recognition [44] Clerc M, Bougrain L and Lotte F 2016 Brain–Computer
pp 2559–66 Interfaces 1: Foundations and Methods (New York: Wiley)
[26] Breiman L 2001 Random forests Mach. Learn. 45 5–32 [45] Clerc M, Bougrain L and Lotte F 2016 Brain–Computer
[27] Breiman L, Friedman J H, Olshen R A and Stone C J 1984 Interfaces 2: Technology and Applications (New York: Wiley)
Classification and Regression Trees (Monterey, CA: [46] Congedo M 2013 EEG Source Analysis (Grenoble: Univ.
Wadsworth and Brooks) Grenoble Alpes)
[28] Brodu N, Lotte F and Lécuyer A 2011 Comparative study [47] Congedo M, Barachant A and Bhatia R 2017 Riemannian
of band-power extraction techniques for motor imagery geometry for EEG-based brain–computer interfaces; a
classification IEEE Symp. on Computational Intelligence, primer and a review Brain-Comput. Interfaces 4 155–74
Cognitive Algorithms, Mind, and Brain pp 1–6 [48] Congedo M, Barachant A and Kharati K 2016 Classification
[29] Brodu N, Lotte F and Lécuyer A 2012 Exploring two novel of covariance matrices using a riemannian-based kernel
features for EEG-based brain–computer interfaces: for BCI applications IEEE Trans. Signal Process.
multifractal cumulants and predictive complexity 65 2211–20
Neurocomputing 79 87–94 [49] Congedo M, Lotte F and Lécuyer A 2006 Classification of
[30] Buttfield A, Ferrez P and Millan J 2006 Towards a robust BCI: movement intention by spatially filtered electromagnetic
error potentials and online learning IEEE Trans. Neural inverse solutions Phys. Med. Biol. 51 1971–89
Syst. Rehabil. Eng. 14 164–8 [50] Corralejo R, Hornero R and Álvarez D 2011 Feature selection
[31] Caramia N, Lotte F and Ramat S 2014 Optimizing spatial filter using a genetic algorithm in a motor imagery-based
pairs for EEG classification based on phase synchronization brain computer interface Annual Int. Conf. of the IEEE
Int. Conf. on Audio, Speech and Signal Processing Engineering in Medicine and Biology Society pp 7703–6
[32] Cecotti H and Graser A 2011 Convolutional neural networks [51] Courty N, Flamary R, Tuia D and Rakotomamonjy A 2017
for P300 detection with application to brain–computer Optimal transport for domain adaptation IEEE Trans.
interfaces IEEE Trans. Pattern Anal. Mach. Intell. Pattern Anal. Mach. Intell. 39 1853–65
33 433–45 [52] Coyle D, Principe J, Lotte F and Nijholt A 2013 Guest
[33] Chavarriaga R, Fried-Oken M, Kleih S, Lotte F and Scherer R editorial: brain/neuronal computer games interfaces and
2017 Heading for new shores! overcoming pitfalls in BCI interaction IEEE Trans. Comput. Intell. AI Games 5 77–81
design Brain–Computer Interfaces 4 60–73 [53] Daumé H III 2007 Frustratingly easy domain adaptation Proc.
[34] Chevallier S, Kalunga E, Elemy Q B and Yger F 2018 of the Association Computational Linguistics
Riemannian classification for SSVEP-based BCI Brain– [54] David S, Reinhold S, Josef F and Müller-Putz G R 2016
Computer Interfaces Handbook: Technological and Random forests in non-invasive sensorimotor rhythm
Theoretical Advances ed C Nam et al (London: Taylor & brain–computer interfaces: a practical and convenient non-
Francis) linear classifier Biomed.. Eng./Biomed. Tech. 61 77–86
[35] Cho H, Ahn M, Kim K and Chan Jun S 2015 Increasing [55] David S B, Lu T, Luu T and Pál D 2010 Impossibility
session-to-session transfer in a brain–computer interface theorems for domain adaptation Proc. of the 13th Int. Conf.
with on-site background noise acquisition J. Neural Eng. on Artificial Intelligence and Statistics pp 129–36
12 066009 [56] del Millán J 2004 On the need for on-line learning in brain–
[36] Chongsheng Z, Changchang L, Xiangliang Z and George A computer interfaces Proc. 2004 IEEE Int. Joint Conf. on
2017 An up-to-date comparison of state-of-the-art Neural Networks vol 4 (IEEE) pp 2877–82
classification algorithms Expert Syst. Appl. 82 128–50 [57] Devlaminck D, Waegeman W, Bauwens B, Wyns B, Santens P
[37] Cibas T, Soulié F F, Gallinari P and Raudys S 1994 Variable and Otte G 2010 From circular ordinal regression to
Selection with Optimal Cell Damage (London: Springer) pp multilabel classification Preference Learning: ECML/
727–30 PKDD-10 Tutorial and Workshop p 15
[38] Cichocki A 2011 Tensor decompositions: a new concept [58] Dietterich T G and Bakiri G 1995 Solving multiclass learning
in brain data analysis? J. Soc. Instrum. Control Eng. problems via error-correcting output codes J. Artif. Int. Res.
58 507–16 2 263–86
[39] Cichocki A, Lee N, Oseledets I, Phan A H, Zhao Q and [59] Ding S, Zhang N, Xu X, Guo L and Zhang J 2015 Deep
Mandic D 2016 Tensor networks for dimensionality extreme learning machine and its application in EEG
reduction and large-scale optimization: part 1 low-rank classification Math. Probl. Eng. 2015 129021
23
J. Neural Eng. 15 (2018) 031005 Topical Review
[60] Dornhege G, Blankertz B, Curio G and Müller K 2004 [79] Grosse-Wentrup M 2009 Understanding brain connectivity
Boosting bit rates in non-invasive EEG single-trial patterns during motor imagery for brain–computer
classifications by feature combination and multi-class interfacing Advances in Neural Information Processing
paradigms IEEE Trans. Biomed. Eng. 51 993–1002 Systems vol 21
[61] Dornhege G, Blankertz B, Krauledat M, Losch F, Curio G and [80] Grosse-Wentrup M 2011 What are the causes of performance
Müller K R 2006 Combined optimization of spatial and variation in brain–computer interfacing? Int. J.
temporal filters for improving brain–computer interfacing Bioelectromagn. 13 115–6
IEEE Trans. Biomed. Eng. 53 2274–81 [81] Gu Z, Yu Z, Shen Z and Li Y 2013 An online semi-
[62] Edelman A, Tomás A and Smith S T 1998 The geometry of supervised brain–computer interface IEEE Trans. Biomed.
algorithms with orthogonality constraints SIAM J. Matrix Eng. 60 2614–23
Anal. Appl. 20 303–53 [82] Guyon I and Elisseeff A 2003 An introduction to variable and
[63] Faller J, Scherer R, Costa U, Opisso E, Medina J and Müller- feature selection J. Mach. Learn. Res. 3 1157–82
Putz G R 2014 A co-adaptive brain–computer interface [83] Hasan B A S and Gan J Q 2012 Hangman BCI: An
for end users with severe motor impairment PloS One unsupervised adaptive self-paced brain–computer interface
9 e101168 for playing games Comput. Biol. Med. 42 598–606
[64] Faller J, Vidaurre C, Solis-Escalante T, Neuper C and [84] Hastie T and Tibshirani R 1997 Classification by pairwise
Scherer R 2012 Autocalibration and recurrent adaptation: coupling Advances in Neural Information Processing
towards a plug and play online ERD-BCI IEEE Trans. Systems Conf. (Denver, CO, USA, 1997) vol 10 pp 507–13
Neural Syst. Rehabil. Eng. 20 313–9 [85] Hastie T, Tibshirani R and Friedman J 2001 The Elements of
[65] Farquhar J 2009 A linear feature space for simultaneous Statistical Learning (Berlin: Springer)
learning of spatio-spectral filters in BCI Neural Netw. [86] Hazrati M K and Erfanian A 2010 An online EEG-based
22 1278–85 brain–computer interface for controlling hand grasp using
[66] Fatourechi M, Ward R, Mason S, Huggins J, Schlogl A an adaptive probabilistic neural network Med. Eng. Phys.
and Birch G 2008 Comparison of evaluation metrics in 32 730–9
classification applications with imbalanced datasets Int. [87] Herman P, Prasad G, McGinnity T and Coyle D 2008
Conf. on Machine Learning and Applications (IEEE) pp Comparative analysis of spectral approaches to feature
777–82 extraction for EEG-based motor imagery classification
[67] Fazli S, Popescu F, Danóczy M, Blankertz B, Müller K R IEEE Trans. Neural Syst. Rehabil. Eng. 16 317–26
and Grozea C 2009 Subject-independent mental state [88] Higashi H and Tanaka T 2013 Simultaneous design of
classification in single trials Neural Netw. 22 1305–12 FIR filter banks and spatial patterns for EEG signal
[68] Ferrez P and Millán J 2008 Error-related EEG potentials classification IEEE Trans. Biomed. Eng. 60 1100–10
generated during simulated brain–computer interaction [89] Hinton G E 2002 Training products of experts by minimizing
IEEE Trans. Biomed. Eng. 55 923–9 contrastive divergence Neural Comput. 14 1771–800
[69] Fischer A and Igel C 2012 An introduction to restricted [90] Hinton G E, Osindero S and Teh Y W 2006 A fast learning
Boltzmann machines Progress in Pattern Recognition, algorithm for deep belief nets Neural Comput.
Image Analysis, Computer Vision, and Applications, CIARP 18 1527–54
2012, Lecture Notes in Computer Science vol 7441, ed L [91] Hitziger S, Clerc M, Saillet S, Benar C and Papadopoulo T
Alvarez, M Mejail, L Gomez, J Jacabo (Berlin: Springer) 2017 Adaptive waveform learning: a framework for
pp 14–36 modeling variability in neurophysiological signals IEEE
[70] Frey J, Appriou A, Lotte F and Hachet M 2015 Classifying Trans. Signal Process. 65 4324–38
EEG signals during stereoscopic visualization to estimate [92] Hoffmann U, Vesin J and Ebrahimi T 2006 Spatial filters
visual comfort Computational Intelligence & Neuroscience for the classification of event-related potentials European
2016 2758103 Symp. on Artificial Neural Networks
[71] Fukushima K and Miyake S 1982 Neocognitron: a self- [93] Höhne J, Holz E, Staiger-Sälzer P, Müller K R, Kübler A and
organizing neural network model for a mechanism of visual Tangermann M 2014 Motor imagery for severely motor-
pattern recognition Competition and Cooperation in Neural impaired patients: evidence for brain–computer interfacing
Nets (Berlin: Springer) pp 267–85 as superior control solution PLOS One 9 e104854
[72] Gan J 2006 Self-adapting BCI based on unsupervised learning [94] Horev I, Yger F and Sugiyama M 2015 Geometry-aware
3rd Int. Brain–Computer Interface Workshop principal component analysis for symmetric positive
[73] Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, definite matrices M. Mach Learn 106 493–522
Laviolette F, Marchand M and Lempitsky V 2016 Domain- [95] Horev I, Yger F and Sugiyama M 2016 Geometry-aware
adversarial training of neural networks J. Mach. Learn. Res. stationary subspace analysis ACML
17 1–35 [96] Hsu W Y 2011 EEG-based motor imagery classification
[74] Gayraud N T, Rakotomamonjy A and Clerc M 2017 Optimal using enhanced active segment selection and adaptive
transport applied to transfer learning for P300 detection 7th classifier Comput. Biol. Med. 41 633–9
Graz Brain–Computer Interface Conf. [97] Jayaram V, Alamgir M, Altun Y, Scholkopf B and Grosse-
[75] Glorot X, Bordes A and Bengio Y 2011 Deep sparse rectifier Wentrup M 2016 Transfer learning in brain-computer
neural networks Proc. of the 14th Int. Conf. on Artificial interfaces IEEE Comput. Intell. Mag. 11 20–31
Intelligence and Statistics pp 314–23 [98] Kachenoura A, Albera L, Senhadji L and Comon P 2008
[76] Gong B, Shi Y, Sha F and Grauman K 2012 Geodesic flow ICA: a potential tool for BCI systems IEEE Signal
kernel for unsupervised domain adaptation IEEE Conf. on Process. Mag. 25 57–68
Computer Vision and Pattern Recognition pp 2066–73 [99] Kalunga E, Chevallier S and Barthélemy Q 2015 Data
[77] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde- augmentation in Riemannian space for brain–computer
Farley D, Ozair S, Courville A and Bengio Y 2014 interfaces ICML Workshop on Statistics, Machine
Generative adversarial nets Advances in Neural Information Learning and Neuroscience (Stamlins 2015)
Processing Systems pp 2672–80 [100] Kalunga E, Chevallier S, Barthélemy Q, Djouani K and
[78] Grizou J, Iturrate I, Montesano L, Oudeyer P Y and Lopes M Monacelli E 2016 Online SSVEP-based BCI using
2014 Calibration-free BCI based control AAAI pp 1213–20 riemannian geometry Neurocomputing 191 55–68
24
J. Neural Eng. 15 (2018) 031005 Topical Review
[101] Kamousi B, Liu Z and He B 2005 Classification of motor [120] Li S Z 2009 Markov Random Field Modeling in Image
imagery tasks for brain–computer interface applications Analysis (Berlin: Springer)
by means of two equivalent dipoles analysis IEEE Trans. [121] Li Y and Guan C 2008 Joint feature re-extraction and
Neural Syst. Rehabil. Eng. 13 166–71 classification using an iterative semi-supervised support
[102] Kang H and Choi S 2014 Bayesian common spatial patterns vector machine algorithm Mach. Learn. 71 33–53
for multi-subject EEG classification Neural Netw. [122] Li Y, Guan C, Li H and Chin Z 2008 A self-training semi-
57 39–50 supervised SVM algorithm and its application in an EEG-
[103] Kang H, Nam Y and Choi S 2009 Composite common based brain computer interface speller system Pattern
spatial pattern for subject-to-subject transfer IEEE Signal Recogn. Lett. 29 1285–94
Process. Lett. 16 683–6 [123] Liang N and Bougrain L 2012 Decoding finger flexion from
[104] Kindermans P J, Schreuder M, Schrauwen B, Müller K R and band-specific ECoG signals in humans Front. Neurosci.
Tangermann M 2014 Improving zero-training brain– 6 91
computer interfaces by mixing model estimators PLoS [124] Lindgren J T 2017 As above, so below? Towards
One 9 e102504 understanding inverse models in BCI J. Neural Eng.
[105] Kindermans P J, Tangermann M, Müller K R and 15 012001
Schrauwen B 2014 Integrating dynamic stopping, transfer [125] Lindig-León C 2017 Multilabel classification of EEG-
learning and language models in an adaptive zero-training based combined motor imageries implemented for the
erp speller J. Neural Eng. 11 035005 3D control of a robotic arm. (Classification multilabels
[106] Kohavi R and John G H 1997 Wrappers for feature subset à partir de signaux EEG d’imaginations motrices
selection Artif. Intell. 97 273–324 (Relevance) combinées : application au contrôle 3D d’un bras
[107] Koprinska I 2010 Feature Selection for Brain–Computer robotique) PhD Thesis University of Lorraine, Nancy,
Interfaces (Berlin: Springer) pp 106–17 France
[108] Krauledat M, Schröder M, Blankertz B and Müller K R 2007 [126] Lindig-León C and Bougrain L 2015 Comparison of
Reducing calibration time for brain–computer interfaces: sensorimotor rhythms in EEG signals during simple
a clustering approach Advances in Neural Information and combined motor imageries over the contra and
Processing Systems vol 19, ed B Scholkopf et al ipsilateral hemispheres 37th Annual Int. Conf. of the IEEE
(Cambridge, MA: MIT Press) Engineering in Medicine and Biology Society (Milan, Italy)
[109] Krusienski D, Grosse-Wentrup M, Galán F, Coyle D, [127] Lindig-León C and Bougrain L 2015 A multilabel
Miller K, Forney E and Anderson C 2011 Critical issues in classification method for detection of combined motor
state-of-the-art brain–computer interface signal processing imageries IEEE Int. Conf. on Systems, Man, and
J. Neural Eng. 8 025002 Cybernetics
[110] Krusienski D, McFarland D and Wolpaw J 2012 Value [128] Lindig-León C, Gayraud N, Bougrain L and Clerc M 2016
of amplitude, phase, and coherence features for a Hierarchical classification using Riemannian geometry for
sensorimotor rhythm-based brain–computer interface motor imagery based BCI systems BCI Meeting (Asilomar,
Brain Res. Bull.. 87 130–4 CA, USA, 2016)
[111] Krusienski D, Sellers E, Cabestaing F, Bayoudh S, [129] Liu G, Huang G, Meng J, Zhang D and Zhu X 2010
McFarland D, Vaughan T and Wolpaw J 2006 A Improved GMM with parameter initialization for
comparison of classification techniques for the P300 unsupervised adaptation of brain–computer interface Int.
speller J. Neural Eng. 3 299–305 J. Numer. Methods Biomed. Eng. 26 681–91
[112] Kübler A, Holz E M, Riccio A, Zickler C, Kaufmann T, [130] Liu G, Zhang D, Meng J, Huang G and Zhu X 2012
Kleih S C, Staiger-Sälzer P, Desideri L, Hoogerwerf E J Unsupervised adaptation of electroencephalogram signal
and Mattia D 2014 The user-centered design as novel processing based on fuzzy C-means algorithm Int. J.
perspective for evaluating the usability of BCI-controlled Adapt. Control Signal Process. 26 482–95
applications PLoS One 9 e112392 [131] Llera A, Gómez V and Kappen H J 2012 Adaptive
[113] Kwak N S, Müller K R and Lee S W 2017 A convolutional classification on brain–computer interfaces using
neural network for steady state visual evoked potential reinforcement signals Neural Comput. 24 2900–23
classification under ambulatory environment PloS One [132] Llera A, Gómez V and Kappen H J 2014 Adaptive multiclass
12 e0172578 classification for brain computer interfaces Neural
[114] LaFleur K, Cassady K, Doud A, Shades K, Rogin E and Comput. 26 1108–27
He B 2013 Quadcopter control in three-dimensional space [133] Llera A, van Gerven M A, Gómez V, Jensen O and
using a non-invasive motor imagery-based brain–computer Kappen H J 2011 On the use of interaction error potentials
interface J. Neural Eng. 10 046003 for adaptive brain computer interfaces Neural Netw.
[115] Lal T, Schröder M, Hinterberger T, Weston J, Bogdan M, 24 1120–7
Birbaumer N and Schölkopf B 2004 Support vector [134] Long M, Wang J, Ding G, Sun J and Yu P S 2013 Transfer
channel selection in BCI IEEE TBME 51 1003–10 feature learning with joint distribution adaptation Proc. of
[116] Lawhern V J, Solon A J, Waytowich N R, Gordon S M, the IEEE Int. Conf. on Computer Vision pp 2200–7
Hung C P and Lance B J 2016 EEGNet: a compact [135] Lotte F 2012 A new feature and associated optimal spatial
convolutional network for EEG-based brain–computer filter for EEG signal classification: waveform length Int.
interfaces (arXiv:1611.08024) Conf. on Pattern Recognition pp 1302–5
[117] LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, [136] Lotte F 2014 A tutorial on EEG signal-processing techniques
Hubbard W and Jackel L D 1989 Backpropagation applied for mental-state recognition in brain–computer interfaces
to handwritten zip code recognition Neural Comput. Guide to Brain–Computer Music Interfacing (Berlin:
1 541–51 Springer) pp 133–61
[118] Ledoit O and Wolf M 2004 A well-conditioned estimator [137] Lotte F 2015 Signal processing approaches to minimize or
for large-dimensional covariance matrices J. Multivariate suppress calibration time in oscillatory activity-based
Anal. 88 365–411 brain–computer interfaces Proc. IEEE 103 871–90
[119] Li J and Zhang L 2010 Bilateral adaptation and [138] Lotte F 2016 Towards usable electroencephalography-based
neurofeedback for brain computer interface system J. brain–computer interfaces Habilitation Thesis Habilitation
Neurosci. Methods 193 373–9 à diriger des recherches (HDR) Univ Bordeaux
25
J. Neural Eng. 15 (2018) 031005 Topical Review
[139] Lotte F, Bougrain L and Clerc M 2015 [158] Mayaud L et al 2016 Brain-computer interface for the
Electroencephalography (EEG)-based brain–computer communication of acute patients: a feasibility study and
interfaces Wiley Encyclopedia on Electrical and a randomized controlled trial comparing performance
Electronices Engineering (New York: Wiley) with healthy participants and a traditional assistive device
[140] Lotte F and Congedo M 2016 EEG Feature Extraction (New Brain-Comput. Interfaces 3 197–215
York: Wiley) pp 127–43 [159] McFarland D J, McCane L M, David S V and Wolpaw J R
[141] Lotte F, Congedo M, Lécuyer A, Lamarche F and 1997 Spatial filter selection for EEG-based communication
Arnaldi B 2007 A review of classification algorithms for Electroencephalogr. Clin. Neurophysiol. 103 386–94
EEG-based brain–computer interfaces J. Neural Eng. [160] McFarland D, Sarnacki W and Wolpaw J 2011 Should the
4 R1–13 parameters of a BCI translation algorithm be continually
[142] Lotte F and Guan C 2009 An efficient P300-based brain– adapted? J. Neurosci. Methods 199 103–7
computer interface with minimal calibration time Assistive [161] Meng J, Yao L, Sheng X, Zhang D and Zhu X 2015
Machine Learning for People with Disabilities Symp. Simultaneously optimizing spatial spectral features based
[143] Lotte F and Guan C 2011 Regularizing common spatial on mutual information for EEG classification IEEE Trans.
patterns to improve BCI designs: Unified theory and new Biomed. Eng. 62 227–40
algorithms IEEE Trans. Biomed. Eng. 58 355–62 [162] Meng J, Zhang S, Bekyo A, Olsoe J, Baxter B and He B
[144] Lotte F and Jeunet C 2015 Towards improved BCI based 2016 Noninvasive electroencephalogram based control of
on human learning principles 3rd Int. Brain–Computer a robotic arm for reach and grasp tasks Scientific Reports
Interfaces Winter Conf. 6 38565
[145] Lotte F and Jeunet C 2017 Online classification accuracy is [163] Millán J, Renkens F, Mouriño J and Gerstner W 2004
a poor metric to study mental imagery-based BCI user Noninvasive brain-actuated control of a mobile robot by
learning: an experimental demonstration and new metrics human EEG IEEE Trans. Biomed. Eng. 51 1026–33
Int. Brain–Computer Interface Conf. [164] Mladenovic J, Mattout J, Lotte F 2017 A generic framework
[146] Lotte F, Jeunet C, Mladenovic J, N’Kaoua B and Pillette L for adaptive EEG-based BCI training, operation Handbook
2018 Signal Processing and Machine Learning for of Brain–Computer Interfaces ed C Nam et al (London:
Brain-Machine Interfaces ed T Toshihisa and M Arvaneh Taylor & Francis)
(Stevenage, United Kingdom: Institution of Engineering [165] Morioka H, Kanemura A, Hirayama J I, Shikauchi M,
and Technology (IET)) Ogawa T, Ikeda S, Kawanabe M and Ishii S 2015 Learning
[147] Lotte F, Larrue F and Mühl C 2013 Flaws in current human a common dictionary for subject-transfer decoding with
training protocols for spontaneous brain–computer resting calibration NeuroImage 111 167–78
interfaces: lessons learned from instructional design Front. [166] Mühl C, Jeunet C and Lotte F 2014 EEG-based workload
Human Neurosci. 7 estimation across affective contexts Front. Neurosci.
[148] Lotte F, Lécuyer A and Arnaldi B 2009 FuRIA: an inverse 8 114
solution based feature extraction algorithm using fuzzy set [167] Mullen T, Kothe C, Chi Y M, Ojeda A, Kerth T, Makeig S,
theory for brain–computer interfaces IEEE Trans. Signal Cauwenberghs G and Jung T P 2013 Real-time modeling and
Process. 57 3253–63 3D visualization of source dynamics and connectivity using
[149] Lowne D, Roberts S J and Garnett R 2010 Sequential non- wearable EEG Annual Int. Conf. of the IEEE Engineering in
stationary dynamic classification with sparse feedback Medicine and Biology Society vol 2013 pp 2184–7
Pattern Recogn. 43 897–905 [168] Müller J S, Vidaurre C, Schreuder M, Meinecke F C, von
[150] Lu N, Li T, Ren X and Miao H 2017 A deep learning scheme Bünau P and Müller K R 2017 A mathematical model for
for motor imagery classification based on restricted the two-learners problem J. Neural Eng. 14 036005
Boltzmann machines IEEE Trans. Neural Syst. Rehabil. [169] Müller K R, Krauledat M, Dornhege G, Curio G and
Eng. 25 566–76 Blankertz B 2004 Machine learning techniques for
[151] Lu S, Guan C and Zhang H 2009 Unsupervised brain brain–computer interfaces Biomed. Technol. 49 11–22
computer interface based on inter-subject information and [170] Neuper C and Pfurtscheller G 2010 Chapter Neurofeedback
online adaptation IEEE Trans. Neural Syst. Rehabil. Eng. training for BCI control Brain–Computer Interfaces:
17 135–45 Revolutionizing Human-Computer Interaction ed B
[152] Luke S 2013 Essentials of Metaheuristics (Lulu) (https:// Graimann, G Pfurtscheller and B Allison (Berlin:
cs.gmu.edu/~sean/book/metaheuristics/Essentials.pdf) Springer) pp 65–78
[153] Ma T, Li H, Yang H, Lv X, Li P, Liu T, Yao D and Xu P 2017 [171] Nicolas-Alonso L F, Corralejo R, Gomez-Pilar J, Álvarez D
The extraction of motion-onset VEP BCI features based and Hornero R 2015 Adaptive semi-supervised
on deep learning and compressed sensing J. Neurosci. classification to reduce intersession non-stationarity in
Methods 275 80–92 multiclass motor imagery-based brain–computer interfaces
[154] Madjarov G, Kocev D, Gjorgjevikj D and Džeroski S 2012 Neurocomputing 159 186–96
An extensive experimental comparison of methods for [172] Niedermeyer E and da Silva F L 2005
multi-label learning Pattern Recogn. 45 3084–104 (Best Electroencephalography: Basic Principles, Clinical
Papers of Iberian Conf. on Pattern Recognition and Image Applications, and Related Fields 5th edn (Philadelphia,
Analysis) PA: Lippincott Williams & Wilkins)
[155] Makeig S, Kothe C, Mullen T, Bigdely-Shamlo N, Zhang Z [173] Noirhomme Q, Kitney R and Macq B 2008 Single trial EEG
and Kreutz-Delgado K 2012 Evolving signal processing source reconstruction for brain–computer interface IEEE
for brain–computer interfaces Proc. IEEE 100 1567–84 Trans. Biomed. Eng. 55 1592–601
[156] Manor R and Geva A B 2015 Convolutional neural network [174] Nurse E S, Karoly P J, Grayden D B and Freestone D R 2015
for multi-category rapid serial visual presentation BCI A generalizable brain–computer interface (BCI) using
Front. Comput. Neurosci. 9 machine learning for feature discovery PLoS One 10 1–22
[157] Margaux P, Emmanuel M, Sébastien D, Olivier B and [175] Onishi A, Phan A, Matsuoka K and Cichocki A 2012
Jérémie M 2012 Objective and subjective evaluation of Tensor classification for P300-based brain computer
online error correction during P300-based spelling Adv. interface IEEE Int. Conf. on Acoustics, Speech and Signal
Human-Comput. Interact. 2012 4 Processing (IEEE) pp 581–4
26
J. Neural Eng. 15 (2018) 031005 Topical Review
[176] Ortega J, Asensio-Cubero J, Gan J Q and Ortiz A 2016 BCI performance: an online evaluation J. Neural Eng.
Classification of motor imagery tasks for BCI with 13 046003
multiresolution analysis and multiobjective feature [197] Schettini F, Aloise F, Aricò P, Salinari S, Mattia D and
selection Biomed. Eng. Online 15 ( Suppl. 1) Cincotti F 2014 Self-calibration algorithm in an
[177] Pan S J and Yang Q 2010 A survey on transfer learning IEEE asynchronous P300-based brain–computer interface J.
Trans. Knowl. Data Eng. 22 1345–59 Neural Eng. 11 035004
[178] Panicker R C, Puthusserypady S and Sun Y 2010 [198] Schirrmeister R T, Springenberg J T, Fiederer L D J,
Adaptation in P300 brain–computer interfaces: a two- Glasstetter M, Eggensperger K, Tangermann M,
classifier cotraining approach IEEE Trans. Biomed. Eng. Hutter F, Burgard W and Ball T 2017 Deep learning with
57 2927–35 convolutional neural networks for EEG decoding and
[179] Parzen E 1962 On estimation of a probability density visualization Human Brain Mapp.
function and mode Ann. Math. Stat. 33 1065–76 [199] Schlögl A, Kronegg J, Huggins J and Mason S G 2007
[180] Peng H, Long F and Ding C 2005 Feature selection based on Chapter: Evaluation criteria in BCI research Towards
mutual information: criteria of max-dependency, max- Brain–Computer Interfacing (Cambridge, MA: MIT
relevance, and min-redundancy IEEE Trans. Pattern Anal. Press) pp 327–42
Mach. Intell. 27 1226–38 [200] Schlögl A, Vidaurre C and Müller K R 2010 Adaptive
[181] Pfurtscheller G, Müller-Putz G, Scherer R and Neuper C methods in BCI research-an introductory tutorial Brain–
2008 Rehabilitation with brain–computer interface Computer Interfaces (Berlin: Springer) pp 331–55
systems IEEE Comput. 41 58–65 [201] Seno B D, Matteucci M and Mainardi L 2008 A genetic
[182] Phan A, Cichocki A, Tichavský P, Zdunek R and Lehky S algorithm for automatic feature extraction in P300
2013 From basis components to complex structural detection IEEE Int. Joint Conf. on Neural Networks pp
patterns IEEE Int. Conf. on Acoustics, Speech and Signal 3145–52
Processing (IEEE) pp 3228–32 [202] Shenoy P, Krauledat M, Blankertz B, Rao R and Müller K R
[183] Phan A H and Cichocki A 2010 Tensor decompositions for 2006 Towards adaptive classification for BCI J. Neural
feature extraction and classification of high dimensional Eng. 3 R13
datasets Nonlinear Theory Appl. 1 37–68 [203] Si S, Tao D and Geng B 2010 Bregman divergence-based
[184] Quinlan J R 1986 Induction of decision trees Mach. Learn. regularization for transfer subspace learning IEEE Trans.
1 81–106 Knowl. Data Eng. 22 929–42
[185] Ramoser H, Muller-Gerking J and Pfurtscheller G 2000 [204] Song X, Yoon S C and Perera V 2013 Adaptive common
Optimal spatial filtering of single trial EEG during spatial pattern for single-trial EEG classification in
imagined hand movement IEEE Trans. Rehabil. Eng. multisubject BCI Int. IEEE/EMBS Conf. on Neural
8 441–6 Engineering pp 411–4
[186] Ray A M et al 2015 A subject-independent pattern-based [205] Soria-Frisch A 2012 A critical review on the usage of
brain–computer interface Front. Behav. Neurosci. 9 ensembles for BCI Towards Practical Brain–Computer
[187] Rivet B, Cecotti H, Phlypo R, Bertrand O, Maby E and Interfaces (Biological, Medical Physics and Biomedical
Mattout J 2010 EEG sensor selection by sparse spatial Engineering) ed B Z Allison et al (Berlin: Springer) pp
filtering in P300 speller brain–computer interface Annual 41–65
Int. Conf. of the IEEE Engineering in Medicine and [206] Spüler M, Rosenstiel W and Bogdan M 2012 Online
Biology Society (IEEE) pp 5379–82 adaptation of a c-VEP brain–computer interface (BCI)
[188] Rivet B, Souloumiac A, Attina V and Gibert G 2009 xDAWN based on error-related potentials and unsupervised
algorithm to enhance evoked potentials: application learning PloS One 7 e51077
to brain computer interface IEEE Trans. Biomed. Eng. [207] Sturm I, Lapuschkin S, Samek W and Müller K R 2016
56 2035–43 Interpretable deep neural networks for single-trial EEG
[189] Rodrigues P, Bouchard F, Congedo M and Jutten C 2017 classification J. Neurosci. Methods 274 141–5
Dimensionality reduction for BCI classification using [208] Sugiyama M, Nakajima S, Kashima H, Buenau P V and
Riemannian geometry 7th Graz Brain–Computer Interface Kawanabe M 2008 Direct importance estimation with
Conf. (Graz, Austria, September 2017) model selection and its application to covariate shift
[190] Roijendijk L, Gielen S and Farquhar J 2016 Classifying adaptation Advances in Neural Information Processing
regularized sensor covariance matrices: an alternative Systems pp 1433–40
to CSP IEEE Trans. Neural Syst. Rehabil. Eng. [209] Sykacek P, Roberts S J and Stokes M 2004 Adaptive BCI
24 893–900 based on variational bayesian kalman filtering: an
[191] Roy R N, Charbonnier S, Campagne A and Bonnet S empirical evaluation IEEE Trans. Biomed. Eng. 51 719–29
2016 Efficient mental workload estimation using task- [210] Tabar Y R and Halici U 2016 A novel deep learning approach
independent EEG features J. Neural Eng. 13 026019 for classification of EEG motor imagery signals J. Neural
[192] Royer A S, Doud A J, Rose M L and He B 2010 EEG control Eng. 14 016003
of a virtual helicopter in 3-dimensional space using [211] Thiyam D B, Cruces S, Olias J and Cichocki A 2017
intelligent control strategies IEEE Trans. Neural Syst. Optimization of alpha-beta log-det divergences and their
Rehabil. Eng. 18 581–9 application in the spatial filtering of two class motor
[193] Rumelhart D E et al 1988 Learning representations by back- imagery movements Entropy 19 89
propagating errors Cogn. Model. 5 1 [212] Thomas E, Dyson M and Clerc M 2013 An analysis of
[194] Samek W, Kawanabe M and Müller K R 2014 Divergence- performance evaluation for motor-imagery based bci J.
based framework for common spatial patterns algorithms Neural Eng. 10 031001
IEEE Rev. Biomed. Eng. 7 50–72 [213] Tieleman T 2008 Training restricted boltzmann machines
[195] Sannelli C, Vidaurre C, Müller K R and Blankertz B 2011 using approximations to the likelihood gradient Proc..
CSP patches: an ensemble of optimized spatial filters. an of the 25th Int. Conf. on Machine Learning (ACM) pp
evaluation study J. Neural Eng. 8 025012 1064–71
[196] Sannelli C, Vidaurre C, Müller K R and Blankertz B [214] Tomioka R and Müller K R 2010 A regularized
2016 Ensembles of adaptive spatial filters increase discriminative framework for EEG analysis with
27
J. Neural Eng. 15 (2018) 031005 Topical Review
application to brain–computer interface Neuroimage [233] Yger F, Lotte F and Sugiyama M 2015 Averaging covariance
49 415–32 matrices for EEG signal classification based on the CSP:
[215] Tsoumakas G and Katakis I 2007 Multilabel classification: an empirical study 23rd European Signal Processing Conf.
an overview Int. J. Data Warehousing Mining 3 1–13 pp 2721–5
[216] van Erp J, Lotte F and Tangermann M 2012 Brain-computer [234] Yin Z and Zhang J 2017 Cross-session classification of
interfaces: beyond medical applications IEEE Comput. mental workload levels using EEG and an adaptive deep
45 26–34 learning model Biomed. Signal Process. Control 33 30–47
[217] Vaughan T, McFarland D, Schalk G, Sarnacki W, [235] Yin Z, Zhao M, Wang Y, Yang J and Zhang J 2017
Krusienski D, Sellers E and Wolpaw J 2006 The Recognition of emotions using multimodal physiological
wadsworth BCI research and development program: at signals and an ensemble deep learning model Comput.
home with BCI IEEE Trans. Neural Syst. Rehabil. Eng. Methods Programs Biomed. 140 93–110
14 229–33 [236] Yoon J W, Roberts S J, Dyson M and Gan J Q 2009
[218] Verhoeven T, Hübner D, Tangermann M, Müller K R, Adaptive classification for brain computer interface
Dambre J and Kindermans P J 2017 True zero-training systems using sequential monte carlo sampling Neural
brain–computer interfacing—an online study J. Neural Netw. 22 1286–94
Eng. 14 036021 [237] Zander T and Kothe C 2011 Towards passive brain–computer
[219] Vidaurre C, Kawanabe M, Von Bunau P, Blankertz B and interfaces: applying brain–computer interface technology
Muller K 2011 Toward unsupervised adaptation of LDA to human-machine systems in general J. Neural Eng. 8
for brain–computer interfaces IEEE Trans. Biomed. Eng. 025005
58 587–97 [238] Zanini P, Congedo M, Jutten C, Said S and Berthoumieu Y
[220] Vidaurre C, Sannelli C, Müller K R and Blankertz B 2011 2017 Transfer learning: a Riemannian geometry
Co-adaptive calibration to improve BCI efficiency framework with applications to brain–computer interfaces
J. Neural Eng. 8 025009 IEEE Trans. Biomed. Eng.
[221] Vidaurre C, Sannelli C, Müller K R and Blankertz B 2011 [239] Zeyl T, Yin E, Keightley M and Chau T 2016 Partially
Machine-learning-based coadaptive calibration for supervised P300 speller adaptation for eventual stimulus
brain–computer interfaces Neural Comput. 23 791–816 timing optimization: target confidence is superior to error-
[222] Vidaurre C, Schlögl A, Cabeza R, Scherer R and related potential score as an uncertain label J. Neural Eng.
Pfurtscheller G 2007 Study of on-line adaptive 13 026008
discriminant analysis for EEG-based brain computer [240] Zhang H, Chavarriaga R and Millán J D R 2015
interfaces IEEE Trans. Biomed. Eng. 54 550–6 Discriminant brain connectivity patterns of performance
[223] Washizawa Y, Higashi H, Rutkowski T, Tanaka T and monitoring at average and single-trial levels NeuroImage
Cichocki A 2010 Tensor based simultaneous feature extraction 120 64–74
and sample weighting for EEG classification Int. Conf. on [241] Zhang K, Zheng V, Wang Q, Kwok J, Yang Q and Marsic I
Neural Information Processing, ICONIP 2010: Neural 2013 Covariate shift in hilbert space: a solution via sorrogate
Information Processing. Models and Applications (Berlin: kernels Int. Conference on Machine Learning pp 388–95
Springer) pp 26–33 [242] Zhang Y, Zhou G, Jin J, Wang M, Wang X and Cichocki A
[224] Waytowich N, Lawhern V, Bohannon A, Ball K and Lance B 2013 L1-regularized multiway canonical correlation
2016 Spectral transfer learning using information analysis for SSVEP-based BCI IEEE Trans. Neural Syst.
geometry for a user-independent brain–computer interface Rehabil. Eng. 21 887–96
Front. Neurosci. 10 430 [243] Zhang Y, Zhou G, Jin J, Wang X and Cichocki A 2014
[225] Wei Q, Wang Y, Gao X and Gao S 2007 Amplitude and phase Frequency recognition in SSVEP-based BCI using
coupling measures for feature extraction in an EEG-based multiset canonical correlation analysis Int. J. Neural Syst.
brain–computer interface J. Neural Eng. 4 120 24 1450013
[226] Yi W, Qiu S, Qi H, Zhang L, Wan B and Ming D 2013 [244] Zhang Y, Zhou G, Jin J, Wang X and Cichocki A 2015
EEG feature comparison and classification of simple and Optimizing spatial patterns with sparse filter bands
compound limb motor imagery J. Neuroeng. Rehabil. for motor-imagery based brain–computer interface
10 106 J. Neurosci. Methods 255 85–91
[227] Woehrle H, Krell M M, Straube S, Kim S K, Kirchner E A [245] Zhang Y, Zhou G, Jin J, Zhang Y, Wang X and Cichocki A
and Kirchner F 2015 An adaptive spatial filter for 2017 Sparse Bayesian multiway canonical correlation
user-independent single trial detection of event-related analysis for EEG pattern recognition Neurocomputing
potentials IEEE Trans. Biomed. Eng. 62 1696–705 225 103–10
[228] Wolpaw J, Birbaumer N, McFarland D, Pfurtscheller G [246] Zhang Y, Zhou G, Zhao Q, Onishi A, Jin J, Wang X
and Vaughan T 2002 Brain-computer interfaces for and Cichocki A 2011 Multiway canonical correlation
communication and control Clin. Neurophysiol. analysis for frequency components recognition in
113 767–91 SSVEP-based BCIs Neural Information Processing
[229] Wolpaw J and Wolpaw E 2012 Brain–Computer Interfaces: (Berlin: Springer)
Principles and Practice (Oxford: Oxford University Press) [247] Zhao Q, Zhang L, Cichocki A and Li J 2008 Incremental
[230] Wolpaw J R, McFarland D J, Neat G W and Forneris C A common spatial pattern algorithm for BCI IEEE Int. Joint
1991 An EEG-based brain–computer interface for Conf. on Neural Networks
cursor control Electroencephalogr. Clin. Neurophysiol. [248] Zhou S M, Gan J Q and Sepulveda F 2008 Classifying
78 252–9 mental tasks based on features of higher-order statistics
[231] Yger F 2013 A review of kernels on covariance matrices from EEG signals in brain–computer interface Inf. Sci.
for BCI applications IEEE Int. Workshop on Machine 178 1629–40
Learning for Signal Processing pp 1–6 [249] Zhou Z, Wan B, Ming D and Qi H 2010 A novel technique
[232] Yger F, Berar M and Lotte F 2017 Riemannian approaches in for phase synchrony measurement from the complex
brain–computer interfaces: a review IEEE Trans. Neural motor imaginary potential of combined body and limb
Syst. Rehabil. Eng. 25 1753–62 action J. Neural Eng. 7 046008
28