0% found this document useful (0 votes)

6 views

Using Machine Learning to Detect Events in Eye-tracking Data

This study presents a machine-learning approach using a random forest classifier for automated event detection in eye-tracking data, specifically identifying fixations, saccades, and post-saccadic oscillations. The proposed method outperforms existing hand-crafted algorithms and approaches the accuracy of manual coding, offering a more efficient solution for analyzing eye movement data. The research highlights the potential for machine learning to revolutionize event detection in eye-tracking applications, providing robust performance across varying noise levels and sampling frequencies.

Uploaded by

gozdens2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Using Machine Learning to Detect Events in Eye-tracking Data

Uploaded by

gozdens2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Behav Res (2018) 50:160–181

DOI 10.3758/s13428-017-0860-3

Using machine learning to detect events in eye-tracking data

Raimondas Zemblys1,2 · Diederick C. Niehorster3,4 · Oleg Komogortsev5 ·
Kenneth Holmqvist2,6

Published online: 23 February 2017

Abstract Event detection is a challenging stage in eye compared to current state-of-the-art event detection algo-
movement data analysis. A major drawback of current event rithms and can reach the performance of manual coding.
detection methods is that parameters have to be adjusted
based on eye movement data quality. Here we show that Keywords Eye movements · Event detection · Machine
a fully automated classification of raw gaze samples as learning · Fixations · Saccades
belonging to fixations, saccades, or other oculomotor events
can be achieved using a machine-learning approach. Any
already manually or algorithmically detected events can be Introduction
used to train a classifier to produce similar classification of other
data without the need for a user to set parameters. In this In eye movement research, the goal of event detection is
study, we explore the application of random forest machine- to robustly extract events, such as fixations and saccades,
learning technique for the detection of fixations, saccades, from the stream of raw data samples from an eye tracker,
and post-saccadic oscillations (PSOs). In an effort to show based on a set of basic rules and criteria which are appropri-
practical utility of the proposed method to the applications ate for the recorded signal. Until recently, researchers who
that employ eye movement classification algorithms, we ventured to record eye movements were required to conduct
provide an example where the method is employed in an time-consuming manual event detection. For instance, Har-
eye movement-driven biometric application. We conclude tridge and Thomson (1948) devised a method to analyze eye
that machine-learning techniques lead to superior detection movements at a rate of 10000 s (almost 3 h) of analysis time
for 1 s of recorded data, and as Monty (1975) remarked: “It
is not uncommon to spend days processing data that took
only minutes to collect” (p. 331–332).
Raimondas Zemblys Computers have fundamentally changed how eye move-
[email protected] ment data are analyzed. Today, event detection is almost
1
exclusively done by applying a detection algorithm to
Department of Engineering, Siauliai University, Siauliai, Lithuania
the raw gaze data. For a long time, two broad classes
2 Humanities Laboratory, Lund University, Lund, Sweden of algorithms were used: First, the velocity-based algo-
rithms that detect saccades and assume the rest to be
3 Humanities Laboratory and Department of Psychology, fixations. The most well-known is the I-VT algorithm of
Lund University, Lund, Sweden
Bahill, Brockenbrough, and Troost (1981) and Salvucci and
4 Institute for Psychology, University of Muenster, Muenster, Goldberg (2000), but the principle can be traced back to
Germany algorithms by Boyce from 1965 and 1967 as referred to by
5
Ditchburn (1973). The dispersion-based algorithms instead
Department of Computer Science, Texas State University,
San Marcos, TX, USA
detect fixations and assume the rest to be saccades. The
best known is the I-DT algorithm of Salvucci and Goldberg
6 UPSET, NWU Vaal, Vanderbijlpark, South Africa (2000).
Behav Res (2018) 50:160–181 161

Both the velocity-based and the dispersion-based algo- manner that generalizes to other previously unseen data. The
rithms come with thresholds that the user needs to set. Both resulting event detector, which we name identification by
classify data incorrectly if they are run on data with a sam- random forest (IRF), can simultaneously perform multiple
pling frequency outside the intended range. Data containing detection tasks, categorizing data into saccades, fixations,
noise, post-saccadic oscillations and smooth pursuit also and post-saccadic oscillation. Here we report that the perfor-
result in erroneous classification (Holmqvist et al., 2011). mance of the classifier constructed in this manner exceeds
The velocity and dispersion algorithms only detect fixa- that of the best existing hand-crafted event detection algo-
tions and saccades and tell us nothing about other events. rithms, and approaches the performance of manual event
The last decade has seen several improved event-detection coding done by eye-movement experts. The classifier fur-
algorithms, as researchers have hand-crafted new measures thermore provides event detection output that remains stable
of properties of the eye-movement signal and hand-tuned over a large range of noise levels and data sampling frequen-
algorithms to exploit these measures for event detection. cies. An interesting auxiliary result of training the classifier
For instance, Engbert and Kliegl (2003), Nyström and is that it provides insight into which detection features carry
Holmqvist (2010) and Mould, Foster, Amano, and Oakley the most information for classifying eye-movement data
(2012) use adaptive thresholds to free the researcher from into events.
having to set different thresholds per trial when the noise When evaluating the performance of any event detection
level varies between trials. Nonetheless, these algorithms method, it is important to recognize that the detected events
still only work over a limited range of noise levels (Hessels, are in most cases only an intermediate step that enable
Niehorster, Kemner, and Hooge, 2016). A very recent devel- further analysis steps. These further steps may ultimately
opment (Larsson et al., 2013, 2015) has enabled the auto- provide evidence for or against a research hypothesis, or
matic detection of smooth pursuit and post-saccadic oscil- may enable a practical application where eye movements
lations in clean data recorded at a high sampling frequency are being used. In this paper, we therefore analyze how the
that contains these events intermixed with fixations and sac- distribution of selected fixation and saccades-based met-
cades. Furthermore, Hessels et al. (2016) have presented a rics change as a function of noise level and data sampling
novel largely noise-resilient algorithm that can successfully frequency when event detection is performed by our IRF
detect fixations in data with varying noise levels, ranging algorithm. We furthermore examine whether the events
from clean data to the very noisy data typical of infant detected by our classifier can drive an eye-movement-driven
research. These algorithms are designed to solve a specific biometric application, and if it does so better than a com-
problem—smooth pursuit detection, or noise resilience— mon state-of-the-art hand-crafted event detection algorithm
using algorithmic rules and criteria specifically designed for (Nyström and Holmqvist, 2010). It is reasonable to expect
that problem. There are many other algorithms with specific that more precise eye movement classification should result
purposes, such as separating the slow and fast phase in nys- in better biometric performance.
tagmus, detecting microsaccades, online event detection for In summary, here we introduce an entirely new design
use in gaze-contingent research, or removing saccades from principle for event detection algorithms, where machine
smooth pursuit data (Holmqvist et al., 2016, Chapter 7). learning does the job of choosing feature combinations and
Most of these algorithms work well within the assump- selecting appropriate thresholds. We argue that this work is
tions they make of the data. Examples of common assump- the onset of a paradigm shift in the design of event detection
tions are that the input must be high-quality data, or data algorithms.
recorded at high sampling frequencies, and there is no
smooth pursuit in it. All algorithms come with overt settings
that users must experiment with to achieve satisfactory event Methods
detection in their data set, or covert settings that users have
no access to. When the sampling frequency is too low, or In this study, we use a similar approach as in Zemblys
too high, or the precision of the data is poor, or there is data loss, et al. (2015). We use clean data recorded with a high-end
many of these algorithms fail (Holmqvist et al. 2012, 2016). eye tracker and systematically resample and add increas-
In this paper, we present a new technique for developing ing levels of noise to simulate recordings from other eye
eye-movement event detectors that uses machine learning. trackers. We then train a random forest classifier to pre-
More specifically, a random forest classifier is built to find dict eye-movement events from features used by existing
combinations of data description measures and detection event detection algorithms, in conjunction with descrip-
features (e.g., velocity, acceleration, dispersion, etc.) that tors of data quality. It was shown in Zemblys (2016) that
enable robust event detection. Through training with manu- among ten tested machine-learning algorithms, Random for-
ally coded example data, the classifier automatically learns est resulted to the best eye-movement event classification
to perform event detection based on these features in a performance.
162 Behav Res (2018) 50:160–181

Machine-learning models are known to suffer from over- starting point in the sequence. Monocular eye movement
fitting. This happens when a model describes random error data were recorded at 1000 Hz using the EyeLink 1000 eye
or noise instead of the underlying relationship. Measures of tracker set up in tower mode. We chose to use data from this
model fit are not a good guide to how well a model will gen- eye tracker as it is known to exhibit low trackloss, and low
eralize to other data: a high R 2 does not necessarily mean a RMS noise that is more or less constant across the screen
good model. It is easy to over-fit the data by including too (Holmqvist et al., 2015). The average noise level2 in the
many degrees of freedom. Machine-learning-based models dataset was 0.015 ± 0.0029 ◦ RMS.
usually are very complex, can be non-linear and have a mul- The baseline dataset is then split into development and
titude of parameters. Therefore it is common practice when testing sets by randomly selecting data from one subject
performing a (supervised) machine learning to hold out part (i.e., 20 % of the data) to belong to the testing set. We do
of the available data as a test set. This data-split some- so in order to calculate out-of-the-bag error and evaluate
times is called cross-validation and is used to calculating how our algorithm generalizes to completely unseen data,
‘out-of-the-bag error’ i.e., evaluating how well the model i.e., data from a different subject. Further, the development
generalizes to completely unseen data. set is split into training and validation sets by randomly
Moreover, different machine-learning algorithms may assigning 25 % of each trial to the validation sets and leav-
have a number of settings (known as hyperparameters) that ing the remaining 75 % of the data in the training set. This
must be manually set. When evaluating different hyperpa- validation dataset will be used when tuning the classifier.
rameters there is still a risk of over-fitting on the test set An expert with 9 years of eye-tracking experience (author
because the parameters can be tweaked until the model per- RZ) manually tagged raw data into fixations, saccades, post-
forms optimally. This way, knowledge about the test set saccadic oscillations (PSOs) and undefined events, which
can “leak” into the model and evaluation metrics no longer were used as baseline events. In the eye-tracking field, man-
report on generalization performance. To solve this prob- ual event classification was the dominant option long into
lem, yet another part of the dataset needs be held out as a the 1970s. Manual coding has also been used as a “golden
so-called validation set: training proceeds on the training standard” when comparing existing algorithms (Andersson
set, after which hyperparameters are adjusted and evaluation et al., 2016) and when developing and testing new algo-
is done on the validation set, and when the trained model rithms, e.g., Munn et al. (2008), Larsson et al. (2013, 2015).
seems to be optimal, the final evaluation is performed on From this literature, we know that human coders agree with
the test set. This is how we conducted the development and each other to a larger extent than existing event detection
evaluation in this paper. algorithms. We did not have multiple coders to analyze
To train a classifier we use the random forest implemen- inter-rater reliability, as this would open another research
tation in the Scikit-learn library (Pedregosa et al., 2011) and question of how the coder’s background and experience
the LUNARC1 Aurora computer cluster to speed up train- affect the events produced. Instead, the focus in this paper is
ing. To visualize the result we use Seaborn; a python pack- on how well we can recreate original events using machine
age developed for statistical data visualization (Waskom learning–irrespective of what coder produced those events.
et al., 2016). We also use the lme4 package (Bates et al., A total of 560 fixations, 555 saccades and 549 PSOs were
2015) in R to perform statistical analysis when compar- tagged in the raw data. The total number of saccades and
ing the output of our classifier at different hyperparameter other event in the dataset might seem very low, however
settings. after the data augmentation step (which we describe next),
the total number of each event approaches 54,000. Fixation
Baseline dataset durations ranged from 21 to 1044 ms, saccades had dura-
tions from 6 to 92 ms and amplitudes ranging from 0.1◦ to
Our baseline dataset consisted of eye movement record- 29.8◦ , while PSO durations were 2–60 ms. The durations
ings where five participants perform a simple fixate-saccade of the manually tagged fixations and saccades have a some-
task. We asked participants to track a silver 0.2◦ dot with what bimodal distributions (see swarmplots in Fig. 1). This
a 2 × 2 pixel black center that jumped between positions reflects a common behavior in the type of fixation-saccade
in an equally spaced 7 × 7 grid, and showed for 1 s at task we employed: a large amplitude saccade towards a new
each point. PsychoPy (Peirce, 2007) was used to present target often undershoots and is followed by (multiple) fixa-
the dot on a black background. Stimulus targets changed tions of short duration and small corrective saccades, which
their position every second and were presented in the same are then finally followed by a long fixation while looking
pseudo-random order for all participants, but with a random
2 Weused the method based on kernel density estimation to select
1 Lunarc is the center for scientific and technical computing at Lund sample windows in the data for which the noise level was calculated
University. http://www.lunarc.lu.se/. (Holmqvist et al., 2015).
Behav Res (2018) 50:160–181 163

data, i.e., recorded with any eye tracker and thus having very
different noise levels and sampling rates. Commercial eye-
trackers sample data starting from 30 Hz (the Eye Tribe, and
the first versions of the SMI and Tobii Glasses) and rang-
ing up to 2000 Hz (e.g., Eyelink 1000 and 1000 Plus, and
the ViewPixx TRACKPixx). In this study, we augment our
baseline dataset to simulate recordings of the most common
eye trackers on the market. First-order spline interpolation
is used to resample data to 60, 120, 200, 250, 300, 500,
and 1250 Hz. Before resampling, the data was low-pass fil-
tered using a Butterworth filter with cut-off frequency of
0.8 times the Nyquist frequency of the new data rate and a
window size of 20 ms.
Next, we systematically added white Gaussian noise to
Fig. 1 Tukey boxplots of manually tagged event durations. Red cir-
cles indicate means. Overlaid are the swarmplots, which show a
the resampled dataset at each sampling frequency. More
representation of the underlying distribution specifically, for each recording in each data set, we added
a white Gaussian noise signal generated using the Box–
Muller method (Thomas et al., 2007), separately for the
horizontal and vertical gaze components. The choice to
at the target. Figure 2 shows the amplitude distribution of use white Gaussian noise to model measurement noise is
all the saccades in the dataset. This distribution is less typi- motivated by recent findings that when recording from an
cal of psychological experiments, such as visual search and artificial eye (which has no oculomotor noise), eye track-
scene viewing tasks, because of the many saccades with ers generally produce white noise (Coey et al., 2012; Wang
amplitudes of less than 2◦ (Holmqvist et al., 2016, p. 466). et al., 2016a). Both studies showed that the power spectrum
However, in reading research and fixate-saccade experi- is in this case constant and independent of frequency.
ments small saccades are abundant. Furthermore, small However, it remains an open question what type of
saccades are harder to reliably detect as they easily drown measurement noise is found in recordings of human eye
in noise, and small saccades have more prominent PSOs movements. Next to inherent oculomotor noise (tremor
(Hooge et al., 2015). Therefore, our dataset not only makes and drift), there are many possible eye tracker-dependent
a good basis for training an universal event detection algo- sources of colored noise: filtering, fluctuations in lighting,
rithm, but also provides a challenge for any event detection pupil diameter changes, head movements, etc. To account
algorithm. for these additional noise sources, other studies modeled
the noise using other methods, e.g., an autoregressive pro-
Data augmentation cess Otero-Millan et al. (2014) or power spectral density
(PSD) estimation methods Wang et al. (2016b) and Hessels
Our goal in this paper is to develop a universal classifier et al. (2016), which model noise as a sum of oculomotor
that would be able to work with any type of eye-tracking and measurement noise. It is, however, questionable if it
is valid to add such noise to a signal of human origin that
already includes an oculomotor noise component. More-
over, it is unclear if using a noise model obtained from
real data and scaling it to create different noise levels is
valid. This assumes that oculomotor and measurement noise
always scale in the same ratio—it is more likely that oculo-
motor noise is relatively constant and that high noise in the
recording is mostly caused by the eye trackers. In this study,
we therefore chose to use white noise to model noise incre-
ments in our data as white noise transparently simulates well
understood reasons for different noise levels in eye-tracking
data: lower-quality sensors used in lower-quality eye track-
Fig. 2 Distribution of amplitudes of manually tagged saccades. Bin ers or lower pixel density eye image in remote eye trackers,
size 1◦ compared to tower mounted systems, where the eye is close
164 Behav Res (2018) 50:160–181

to the camera sensor. Our baseline data set is recorded from After data augmentation, we have around 6.4 million
humans and already includes natural oculomotor noise, such samples in our training set and more than 2.1 million
as microsaccades, and drifts. The color of the noise spec- samples in both the validation and testing sets each.
trum due to the oculomotor noise is inherited to all the
augmented data after adding white noise. At higher noise Feature extraction
levels, however, the oculomotor noise will drown in the data,
and the noise color disappears as the signal-to-noise ratio Right before feature extraction, we interpolate missing data
gets lower and lower. using a Piecewise Cubic Hermite Interpolating Polynomial.
Holmqvist et al. (2015) show that for most of the eye This kind of interpolation preserves monotonicity in the data
trackers, noise in the corners of the screen is up to three and does not overshoot if the data is not smooth. At this
times higher than in the middle of the screen. In the noise stage, there is no limit on the duration of periods of miss-
we add to our datasets, we simulate this increase of noise ing data, but later we remove long sequences of interpolated
with distance from the middle of the screen by using a 2-D data (see “Post-processing: Labeling final events”).
Gaussian noise mapping function, which is minimal in the We then perform feature extraction on the raw data at
middle and maximal in the corners. The standard deviation each sampling frequency and noise level. For each feature,
of this Gaussian mapping function was chosen such that the this process produces one transformed sample for each input
noise level at the corners of the stimulus plane at ±15◦ was sample. For instance, the velocity feature is computed by
three times higher than in the middle of the screen. calculating the gaze velocity for each sample in the input
We further scale the variance of the generated noise sig- data. In this paper, we use the 14 features listed in Table 1,
nal to ten levels starting from 0.005◦ RMS, where each yielding a 14-dimensional feature vector for each sample.
subsequent noise level is the double of the previous one. Most of these features are based on the local 100–200-ms
This results in additive noise levels ranging from 0.005◦ surroundings of each sample. The features we employ either
to 2.56◦ RMS in the middle of the screen and three times describe the data in terms of sampling frequency and preci-
higher noise at the corners. Figure 3 shows distributions of sion, or are features that are used in common or state-of-the-
the resulting noise levels in part of our augmented dataset art hand-crafted event detection algorithms. Next to these,
(500 Hz), along with the noise level in the baseline data set. we also propose several new features, which we hypothesize
Note how the resulting noise distributions overlap at each are likely to be useful for the detection of the onset and off-
level of added noise. Modeling the noise this way not only set of saccades: rms-diff, std-diff and bcea-diff. These new
represents the real case scenario with variable noise in each features are inspired by Olsson (2007) and are calculated
of the recordings, but also covers the whole range of noise by taking the difference in the RMS, STD, and BCEA pre-
from minimum 0.0083◦ to maximum 7.076◦ RMS in a con- cision measures calculated for 100-ms windows preceding
tinuous fashion. In comparison, the baseline data cover a and following the current sample. Obviously, the largest dif-
noise range from min 0.0058◦ to max 0.074◦ RMS. ferences (and therefore peaks in the feature) should occur
around the onset and offset of the saccades. We expect that
many of the features used in this paper will be highly cor-
related with other features. This provides room to optimize
the computational complexity of our model by removing
some of the correlated features. In the next step, the 14-
dimensional feature vector produced by feature extraction
for each sample is fed to the machine-learning algorithm.

Algorithm

In this study, we use a random forest classifier to perform

the initial classification of each raw data sample. We then
apply heuristics to the output of the classifier, such as the
merging of nearby fixations and the removal of fixations
and saccades that are too short, to produce the final events.
Fig. 3 Tukey boxplots of measured noise level in ◦ RMS in our base-
No user-adaptable settings will result from these heuristics.
line and augmented 500-Hz data, as a function of the amount of added An advantage of using a random forest classifier instead of
noise other machine-learning algorithms is that we can use the 14
Behav Res (2018) 50:160–181 165

Table 1 Features used to train random forest classifier

Feature Description

fs sampling frequency (Hz). As some features may provide different information at different sampling
rates (e.g., SMI BeGaze uses velocity for data sampled at 200 Hz and more and dispersion at lower
frequencies), providing the classifier with information about sampling frequency may allow it to
make better decision trees
rms root mean square (◦ ) of the sample-to-sample displacement in a 100-ms window centered on a
sample. The most used measure to describe eye-tracker noise (Holmqvist et al., 2011)
std standard deviation (◦ ) of the recorded gaze position in a 100-ms window centered on a sample.
Another common noise measure (Holmqvist et al., 2011)
bcea bivariate contour ellipse area (◦2 ). Measures the area in which the recorded gaze position in a 100-ms
window is for P % of the time (Blignaut and Beelders, 2012). P = 68
disp dispersion (◦ ). The most common measure in dispersion-based algorithms (Salvucci & Goldberg,
2000). Calculated as (xmax − xmin ) + (ymax − ymin ) over a 100-ms window
vel, acc velocity (◦ /s) and acceleration (◦ /s 2 ), calculated using a Savitzky–Golay filter with polynomial
order 2 and a window size of 12 ms—half the duration of shortest saccade, as suggested by Nyström
and Holmqvist (2010)
med-diff distance (◦ ) between the median gaze in a 100-ms window before the sample, and an equally sized
window after the sample. Proposed by Olsson (2007)
mean-diff distance (◦ ) between the mean gaze in a 100-ms window before the sample, and an equally sized win-
dow after the sample. Proposed by Olsson (2007) and used in the default fixation detection algorithm
in Tobii Studio
Rayleightest a feature used by Larsson et al. (2015) that indicates whether the sample-to-sample directions in a
22-ms window are uniformly distributed
i2mc introduced by Hessels et al. (2016) to find saccades in very noisy data. We used the final weights
provided by the two-means clustering procedure as generated by the original implementation of the
algorithm. A window size of 200 ms, centered on the sample was used
rms-diff, std-diff, bcea-diff features inspired by Olsson (2007), but instead of differences in position, we take the difference
between noise measures calculated for 100-ms windows preceding and succeeding the sample

A minimum of three samples are used in case there are not enough samples in the defined window, as may happen for lower frequency data

features as they are. There is no need to scale, center, or rules and thresholds set on these features by the algorithm’s
transform them in any way. designer, derive which event the sample likely belongs to.
A random forest is an ensemble method in the sense that
Random forest classifier it builds several independent estimators (trees). For each
sample, it then either produces a classification by a majority
A random forest classifier works by producing many deci- vote procedure (“this sample is part of a saccade, because
sion trees. Each tree, from its root to each of its leaves, 45 out of 64 trees classified it as such”), or it produces a
consists of a series of decisions, made per sample in the probabilistic classification (“the probability that this sam-
input data, based on the 14 features that we provide the ple is part of a saccade is 4564 = 70 %”). We use a fully
classifier with. A tree could for instance contain a decision probabilistic approach, where the class probability of a sin-
such as “if around this sample, RMS is smaller than 0.1◦ , gle tree is the fraction of samples of the same class in a
and the sampling frequency is less than 100 Hz, use disp, leaf and where individual trees are combined by averaging
else use i2mc”. Every tree node—equaling a singular log- their probabilistic prediction, instead of letting each clas-
ical proposition—is a condition on a single feature, bound sifier vote for a single class. Each of the decision trees in
to other nodes in a tree with if-then clauses, which brings the ensemble is built using a random subset of the fea-
the algorithm closer to deciding whether the sample belongs tures and a random subset of training samples from the data.
to e.g., a fixation or a saccade. These decisions are similar This approach goes by the name of bootstrap aggregation,
to how traditional hand-crafted event detection algorithms known in the machine-learning literature as bagging. As the
work. These also take a number of features (such as veloc- result of bagging, the bias (underfitting) of the forest usu-
ity, acceleration, noise level, etc.) as input and, by means of ally increases slightly but, due to averaging, its variance
166 Behav Res (2018) 50:160–181

(overfitting) decreases and compensates for the increase Machine-learning techniques for building a classifier
in bias, hence yielding an overall better model (Breiman, allow assessing feature importance. Feature importance
2001). indicates how useful a given feature is for correctly clas-
sifying the input data. This can be used to help develop a
Training parameters better understanding of how certain properties of the input
data affect event detection, such as whether sampling fre-
When training a random forest classifier, a few parameters quency is important to the detection of saccades (which is
need to be set. Two important parameters are the number of debated, see Holmqvist et al. 2011, p. 32). Measures of fea-
estimators, i.e., the number of trees in the forest, and the cri- ture importance however also allow reducing the number of
terion, i.e., a function used to measure the quality of a node features used by the classifier, which might improve gen-
split, that is, a proposed border between saccade samples eralizability of the classifier to unseen data and reduce the
and fixation samples. Selecting the number of trees is an computational and memory requirements for running the
empirical problem and it is usually done by means of cross- classifier.
validation. For example, Oshiro, Perez, and Baranauskas There are several measures of feature importance. A ran-
(2012) trained a classifier on 29 datasets of human medi- dom forest classifier directly gives an assessment of feature
cal data, and found that there was no benefit in using more importance in the form of the mean decrease impurity.
than 128 trees when predicting human medical conditions. This number tells us how much each feature decreases
We chose to use 200 trees as a starting point, because ran- the weighted impurity in a tree. It should, however, be
dom forest classifiers do not overfit (Breiman, 2001). As a noted that some of the features we use are highly cor-
function to measure the quality of a decision made by each related with each other (see Fig. 4), as expected. Highly
tree, we use the Gini impurity measure. It can be understood correlated features complicate assessing feature importance
as a criterion to minimize the probability of misclassifica- with the mean decease of impurity method. When train-
tion. Another commonly used criterion is the information ing a model, any of the correlated features can be used
gain which is based on entropy. Raileanu and Stoffel (2004) as the predictor, but once one of them is used, the impor-
found that these two metrics disagree about only 2 % of tance of the other highly correlated features is signifi-
decision made by the tree, which means it is normally not cantly reduced since the other features provide little extra
worth spending time on training classifiers using different information.
impurity criteria. The Gini impurity criterion was chosen There are a number of other feature selection meth-
because it is faster to calculate than the information gain ods in machine learning, e.g., Correlation Criteria, Mutual
criterion. information and maximal information coefficient, Lasso
To deal with our unbalanced dataset where most sam- regression, etc., that each have their specific strengths and
ples belong to fixations, we use the balanced subsample weaknesses (Guyon & Elisseeff, 2003). In this study, in
weighting method.3 We further limit each tree in the ensem- addition to mean decrease impurity (MDI), we chose to
ble to use a maximum of three features, which is close to use two additional methods of assessing feature impor-
the square root of the number of features we provide the tance that are well suited for non-linear classifiers; mean
classifier with. This is one of the invisible hyperparameters decrease accuracy (MDA) and univariate feature selec-
that makes random forests powerful. However, we do not tion (UFS). Mean decrease accuracy directly measures the
limit the depth of the tree, the minimum number of sam- impact of each feature on the accuracy of the model. After
ples required to split an internal tree node nor the minimum training a model, the values of each feature are permuted
number of samples in newly created leaves. and we measure how much the permutation decreases the
accuracy (we use Cohen’s kappa to measure accuracy, see
Classifier optimization “Sample-to-sample classification accuracy”) of the classi-
fier. The idea behind this technique is that if a feature is
After training the full classifier using all 14 features and 200 unimportant, the permutation will have little to no effect
trees, we reduced the computational and memory require- on model accuracy, while permuting an important feature
ments of the classifier by removing unimportant features would significantly decrease accuracy. Univariate feature
and reducing the number of trees. The procedure for both selection, and more specifically single variable classifiers,
these optimizations is described in this section. assesses feature importance by building a classifier using
only an individual feature, and then measure the perfor-
mance of each of these classifiers.
3 Fora detailed description see http://scikit-learn.org/stable/modules/ We optimize our model by performing recursive fea-
generated/sklearn.ensemble.RandomForestClassifier.html. ture elimination using the following equation as a feature
Behav Res (2018) 50:160–181 167

Fig. 4 Spearman’s rank correlation between the features in the training dataset

elimination criterion, which sums the squared feature meaningful eye-tracking events (apply a categorization rule
importances and then removes the one with the lowest value: according to Hessels et al. 2016). For each of the sam-
ples, our random forest classifier outputs three probabilities
argmin m2 (1) (summing to 1), indicating how likely the sample belongs
∀f eat ∀mf eat to a fixation, a saccade or a PSO. This is done internally,
where m is the MDI, MDA, and UFS measures of a feature’s with no user-accessible settings. We first apply a Gaus-
importance, normalized to [0-1] range. sian smoother (σ = 1 sample) over time for each of the
Specifically, we train a classifier using all 14 features, three probabilities, then label each sample according to what
find the least important feature using Eq. 1, remove this fea- event it most likely belongs to, and then use the following
ture and retrain the classifier using less and less features, heuristics to determine the final event labels.
until there are only four left—one more than the maximum – mark events that contain more than 75 ms of interpo-
number of features used in each tree. lated data as undefined.
The size of the classifier is further reduced by find- – merge fixations which are less than 75 ms and 0.5◦
ing the number of trees after which there is no further apart.
improvement in classifier performance. For each of the dif- – make sure that all saccades have a duration of at least
ferent number of features, we trained classifiers with 1, 4, three samples, expand if required, which means that
8, up to 196 (with a step size of 4) trees. We then run if we have a one sample saccade, we also label the
each of these reduced classifiers on the validation set and preceding and following samples as saccade.
assess their performance by means of Cohen’s kappa (see – merge saccades that are closer together than 25 ms.
“Sample-to-sample classification accuracy” below). We – remove saccades that are too short (<6 ms) or too long
then employ a linear mixed effects model (Bates, Mächler, (>150 ms).
Bolker, and Walker, 2015) with number of trees and num- – remove PSOs that occur in other places than directly
ber of features as categorical predictors to test below which after a saccade and preceding a fixation.
number of features and trees the performance of the clas-
– remove fixations shorter than 50 ms.
sifier as indicated by Cohen’s kappa starts to decrease
– remove saccades and following PSO events that sur-
significantly compared to the full classifier using all features
round episodes of missing data as these are likely blink
and 200 trees. The linear mixed effects model included sub-
events.
ject, sampling rate and added noise level as random factors
with random intercepts. Removal of a saccade, PSOs or fixations means that the
sample is marked as unclassified, a fourth class. Unclassi-
Post-processing: Labeling final events fied samples also existed in the manual coding, but in the
below we do not compare agreement between the manual
After initial classification of raw data samples (Hessels et al. coder and the algorithm on which samples are unclassi-
2016 refers to it as a search rule), the next step is to produce fied. While these parameters of heuristic post-processing are
168 Behav Res (2018) 50:160–181

accessible to the user, they are designed to work with all would be poor choices in our case because of our unbal-
types of data that we use in this paper, and as such, we do not anced data set with nearly 89 % of the samples tagged as
expect that users would need to change these parameters. fixations, while only 6.8 and 4.3 %, respectively, are sac-
cade and PSO samples. Larsson et al. (2013, 2015), for
Performance evaluation instance, also report sensitivity (recall) and specificity, and
in the machine-learning literature the F1 score is common.
We evaluate our identification by random forest (IRF) algo- For our dataset, where almost 90 % of the samples belong
rithm, as optimized by the procedure detailed above, using to a fixation, a majority class classifier that indicates that all
three approaches: sample-to-sample classification accuracy, samples are a fixation would result in a high score for these
ability to reproduce ground truth event measures and fun- measures. The advantage of using Cohen’s kappa in our case
damental saccadic properties (main sequence), and perfor- is that the majority class model would result in a score of
mance in a eye-movement biometrics application. Currently, 0, correctly indicating that our classifier fails to provide us
there are only two other algorithms, which are able to with a meaningful classification.
detect all the three events we concern ourselves with—
fixations, saccades, and PSOs. One of these is the algorithm Evaluation of event measures
by Nyström and Holmqvist (2010)4 (hereafter NH) and
the other is the algorithm by Larsson et al. (2013, 2015) To test whether our algorithm produces event measures that
(hereafter LNS). Unfortunately, an implementation of the are similar to those provided by the manual coder, we exam-
latter is not publicly available. Implementing it ourselves ine the durations and number of fixations, saccades and
is tricky and might lead to painting an incorrect picture of PSOs produced by IRF, as well as main sequence param-
this algorithm’s performance. In the following, we there- eters. The main sequence and amplitude-duration relation-
fore only compare the performance of our algorithm to that ships are fundamental properties of saccadic eye movements
of Nyström and Holmqvist (2010). In order to ensure that that should be maintained by any classification algorithm.
the NH algorithm performs optimally, we manually checked To evaluate how well our algorithm reproduces the main
the output of the algorithm and adjusted settings to best suit sequence compared to manually coded data, we first cal-
the input data. We found that default initial velocity thresh- culated saccade amplitude vs. peak velocity and amplitude
old of 100◦ /s works fine for data with average noise level vs. duration relationships on the high-quality manual data.
up to 0.5◦ RMS and increase it to 200–300◦ /s for noisier
A
We used Vpeak = Vmax ∗ (1 − e− C ) to fit the amplitude-
input. These initial thresholds then adapted (decreased) to peak velocity relationship, where Vpeak and A here are
the noise in the data. saccade peak velocity and amplitude, while Vmax together
with C are parameters to be determined by the model fit
Sample-to-sample classification accuracy (Leigh & Zee, 2006, p. 111). For the amplitude vs. duration
relationship we used a linear function (Carpenter, 1988).
To evaluate the performance of our algorithm, we com- Next we parsed our augmented data (which were down-
pare manual coding with the output of the algorithm using sampled and had added noise) using our and the NH algo-
Cohen’s kappa (K), which measures inter-rater agreement rithms. We then calculated saccade amplitudes from the
for categorical data (Cohen, 1960). Cohen’s kappa is a num- detected events and predicted saccade peak velocities and
ber between -1 and 1, where 1 means perfect agreement and saccade durations using the previously obtained parame-
0 means no agreement between the raters other than what ters for the main sequence relationship in the baseline data.
would be expected by chance. Scores above 0.8 are consid- We assume that output data from a better performing clas-
ered as almost perfect agreement. Using K as our evaluation sification algorithm (compared to a baseline) will lead to
metric will allow to directly compare the performance of our estimated saccadic peak velocities and duration that closer
algorithm to that reported in the literature, because Cohen’s match those observed in the baseline data set. This allows
kappa has previously been used in the eye-tracking field us to evaluate how well saccade parameters are preserved
to assess the performance of newly developed event detec- when the data is degraded or different algorithms are used
tion algorithms (Larsson et al., 2013, 2015) and to compare to detect events. We use the coefficient of determination
algorithms to manual coding (Andersson et al., 2016). (R 2 ) as a metric for the goodness of fit. Note that R 2 can
While there are a number of other metrics to assess be negative in this context, because the predictions that are
sample-to-sample classification accuracy, these methods compared to the corresponding measured data have not been
derived from a model-fitting procedure using those data, i.e.,
the fits from ground truth data can actually be worse than
4 A MATLAB implementation is available for download at just fitting a horizontal line to the data obtained from the
http://www.humlab.lu.se/en/person/MarcusNystrom/. algorithm. We trim negative values to 0.
Behav Res (2018) 50:160–181 169

Eye-movement driven biometrics performance resulting operational performance of a biometric system in

an authentication scenario.
Last, in an effort to show the practical utility of the IRF To calculate the biometric performance for each classifi-
algorithm to applications that employ eye movement clas- cation algorithm, dataset, and stimulus, the recordings were
sification algorithms, we provide an example where the randomly partitioned into training and testing sets by sub-
method is employed in an eye movement-driven biometric ject, with each subject having an equal chance to be placed
application. A biometric application provides a good test- in each set. Half of the subject pool was assigned to the
ing ground because its performance can be expressed with a training set and half of the subject pool was assigned to the
single number that indicates the accuracy of person recog- testing set, with no overlap between them. Biometric algo-
nition. This makes it possible to straightforwardly interpret rithm thresholds were selected on the training set and the
the effect of changes that occur in the components of the average EER was computed on the testing set over 100 such
biometric framework, such as the detected eye-movement random partitions.
events that it consumes.
As an eye movement-driven biometric application, we
decided to use the complex eye movement extended (CEM- Results
E) biometrics approach (Rigas et al., 2016). The CEM-E
approach is an extension of the complex eye movement Model optimization
patterns (CEM-P) (Holland & Komogortsev, 2013b) and
the complex eye movement behavior (CEM-B) (Holland & At the outset, our goal was to get the best possible clas-
Komogortsev, 2013a) approaches. It currently represents the sification performance with all 14 features. Therefore we
best performing eye-movement-driven biometrics approach trained a classifier consisting of 200 decision trees and did
when such things as computational complexity and simplic- not limit tree depth or number of samples in newly cre-
ity of biometrics features and their fusion are taken into ated leaves, which allowed it to fit the training data nearly
account. In short, CEM-E consists of several fixation and perfectly. This resulted in an extremely complex and large
saccade-related features that together provide a biometric classifier with large computational and memory require-
fingerprint of a person. Fixation features are: onset, off- ments (see Appendix D). We will refer to this classifier as
set times, duration, and vertical and horizontal coordinates. the full classifier.
Saccade features are: onset, offset times, duration, verti- The average performance of this full classifier on the
cal and horizontal amplitudes, vertical and horizontal mean validation dataset is impressive: K = 0.85. For compari-
and peak velocities, horizontal and vertical vigor, horizontal son, the full classifier achieves an F1 score of 0.97 while
and vertical mean acceleration, and horizontal and vertical the majority class model gets F 1 = 0.83 and K = 0.
mean acceleration ratio. It is clear that the features described Note that here we are examining what we will refer to as
here are directly affected by eye-movement classification raw performance, i.e., the performance of only the classifier
performance. itself, before applying the heuristics described in the meth-
To test biometric performance when eye-movement clas- ods section to its output. The heuristics will only be applied
sification is done by our algorithm, we have selected two after we have optimized the classifier.
datasets of different levels of data quality. Each dataset Next, after we have established the baseline performance
contains three different types of stimulus. The difference of the full classifier, we do model optimization by recur-
in the dataset quality and employed stimulus is important sively removing the least important features (see Eq. 1) and
because it provides a variety of challenges to the classi- limiting the number of trees. We do this by measuring the
fication algorithms. A detailed description of used stimuli raw performance of each of the simplified classifiers on the
and those datasets are provided in Appendix A. While the validation dataset. This way, we assure that the performance
IRF and NH algorithms output fixations, saccades, and post- of the final classifier (in terms of detecting event candidates,
saccadic oscillations, the CEM-E framework only employs the “search rule”) we end up with after simplification is as
features that are fixation and saccade related. Post-saccadic similar to the full classifier as possible.
oscillations are not employed for biometrics in the current
implementation of CEM-E framework. Feature correlations
One of the core metrics describing biometric perfor-
mance is the equal error rate (EER). The EER metric is We first calculate Spearman’s rank correlation between the
employed in an authentication biometric scenario and it features in the training dataset. Feature correlations are a
is an operating point at which the false acceptance rate good first indicator of which features add little extra infor-
(FAR) is equal to the false rejection rate (FRR) (Jain et al., mation over the other features they are highly correlated
2007). In short, the lower the EER number, the better is the with. Correlations are calculated including all samples, i.e.,
170 Behav Res (2018) 50:160–181

fixations, saccades, and PSO, over all sampling frequencies

and noise levels. As expected, features that describe simi-
lar properties of the data are highly correlated. For example,
all the measures of precision that we used, RMS, STD,
BCEA and disp, are highly correlated with each other (r =
0.88−0.99), which suggests that all precision measures pro-
vide very similar information. Unsurprisingly, the velocity
and acceleration measures are also highly correlated with
each other (r = 0.83) and with all four precision measures
(r = 0.82 − 0.87 for velocity and r = 0.76 − 0.86 for accel-
eration). The reason for such high correlations is probably Fig. 5 Performance of classifiers trained using only individual
that all these measures describe the amount of sample-to- features
sample movement or spatial spread of the raw data, be it
be noise or a saccade. The difference between positions of
a window preceding and succeeding the sample (mean-diff contains noisy and low sampling rate data. This is consis-
and med-diff) also represent movement, but its value is only tent with the common practice to use algorithms based on
large in the case of a real movement, i.e., it is largely unaf- spatial features for noisy and low sampling rate data.
fected by noise. This is somewhat reflected in correlation Figure 6 shows feature importances, calculated using
values: while mean-diff and med-diff are highly correlated Eq. 1, which is the sum of the three feature importance mea-
(r = 0.88), correlations between these two features and pre- sures introduced in the methods section, normalized such
cision measures are lower—in the range of r = 0.42 − 0.78) that the largest value is 1. After training the full classifier,
and between 0.33 − 0.53 when correlated to velocity and we removed the least important feature, and retrained the
acceleration. classifier using the remaining features, until a classifier with
Our newly proposed features rms-diff, std-diff, and bcea- only four features is left.
diff are correlated between each other almost to the same Surprisingly, sampling rate was the first feature we
degree as the precision measures from which they were removed, which indicates that it does not help in deciding
derived. However, the finding that the correlations of these whether a sample belongs to, e.g., a fixation or saccade. This
features with all other features are practically zero indicates may be because sampling frequency probably did not help
that these three new features hold some unique informa- the classifier decide whether to use, e.g., the vel or the disp
tion that is not reflected in any of the other features. The feature for classification. Although it is not directly impor-
Rayleightest and i2mc features are similar in that, although tant to the classifier, sampling rate is still implicitly used
they correlate a bit more with other features, they hold a when calculating other features. It for instance determines
significant amount of unique information. If that unique the size of the windows (in terms of number of samples)
information indeed turns out to be useful for event detection, over which all other features are calculated.
such features are good candidates for surviving the feature Further analyzing Fig. 6 we can spot interaction between
elimination procedure which we describe next. highly correlated features. For example, after removing

Recursive feature elimination

First, univariate feature selection (UFS) scores are cal-

culated by training random forest classifiers using only
individual features. The largest Cohen’s kappa (K) results
from models which use med-diff and mean-diff features-
0.35 and 0.33, respectively (see Fig. 5), followed by stan-
dard deviation (std) and two recently proposed features in
eye-movement event detection, Rayleightest (Larsson et al.,
2015) and i2mc (Hessels et al., 2016).
Surprisingly, velocity-based features (vel, acc, and rms),
at least on their own, are not good features to predict
whether a sample belongs to a fixation, saccade, or PSO. It
is furthermore interesting to note that while velocity domain
features are known to work well with high-quality data, spa- Fig. 6 Normalized feature importances during recursive feature
tial features apparently work better for our dataset that also elimination
Behav Res (2018) 50:160–181 171

bcea-diff, the importance of std-diff increases (see Fig. 6,

when the number of features decreases from 8 to 7). This
indicates that, as expected, once a feature that is highly cor-
related with others is removed from the classifier, the other
features that were highly correlated with the removed fea-
ture now provide more unique information to the classifier,
making them more important. The most important features
in our classifier are med-diff, mean-dif, and std; the same
ones which resulted in the best performance for single-
feature classifiers (see Fig. 5). We have trained this classifier
multiple times, and each and every time it resulted in the
same feature ranking, suggesting that our feature elimina-
tion procedure is robust and repeatable, i.e., there is no much
variation in the feature importance scores.
Figure 7 shows the performance of the full classifier, Fig. 8 Performance as a function of the number trees used in the
along with the classifiers resulting from recursive feature classifier
elimination, on the validation set, as a function of sampling
rate. The figures in the plot are averages of K over all noise
levels. These results reveal that the performance of our clas- which features are removed first from the full classifier
sifier was stable down to 200 Hz, even when half of the (Fig. 6) reveals a potential reason for this. For 30-Hz data,
features of lesser importance are removed from the classi- the classifier starts performing better when some of the
fier. For the full classifier, the average performance in data velocity-related features—acceleration, rms and rms-diff—
sampled at 200–1250 Hz ranged from K = 0.86 to K = are removed. This observation allows us to hypothesize that
0.83, and performance only slightly decreased to K = 0.81 the feature ranking would be different for a classifier trained
when only six features remain in the classifier. Models with to deal with only 30-Hz data or data of a similarly low
less than six features performed considerably worse, as indi- sampling frequency, compared to the classifier presented
cated by an decrease in K to 0.7–0.74. As may be expected, here. Most likely, all velocity-related features would be less
the performance of all classifiers also decreased when data important, and would therefore be removed earlier in the
is sampled at 120 Hz or less. feature elimination procedure. In contrast, if we build a
An interesting anomaly in this data is that for 30Hz classifier only for data sampled at higher rates, velocity-
data, classifiers trained with an intermediate number of related features would most likely be more important than
features (6–10) performed better (K = 0.51 − −0.53) for the current classifier. In that sense, the features that are
than the classifiers trained with more features, such as found to be most important in our classifier likely provide
the full classifier which scored K = 0.48. Looking at information for event detection at all sampling frequencies.

Fig. 7 Performance of classifiers for different sampling rates. Performance is measured using Cohen’s kappa K and the presented figures are
averages of K over all noise levels. Note that this is only the performance of classifier without post-processing step
172 Behav Res (2018) 50:160–181

Extensive testing of this hypothesis is beyond the scope of computed performance of each of these trimmed random
this paper, but we made a small test by training a specialist forest classifiers using Cohen’s kappa.
classifier, using only high-quality data at 500–1000 Hz and As an example, Fig. 8 shows K as a function of the num-
having an average noise level of up to 0.04◦ RMS. The four ber of trees for 500-Hz data, for the full classifier along with
most important features in such a specialist classifier were a subset of the reduced classifiers using a limited number of
velocity, std, acceleration, and bcea (see Appendix B). features. It is very clear from this plot that at least in 500-Hz
data, there is no decrease in classification performance until
Limiting the number of trees less than 8–16 trees are used. In the next section, we per-
form statistical analysis to find out to what extent the forest
All classifiers above were trained using 200 trees, which can be trimmed until performance starts to decrease.
is clearly too much according to, e.g., Oshiro et al. (2012)
and results in classifiers with large computational and mem- The final model
ory requirements. To reduce the number of trees in each
of the trained models, we trained classifiers with 1, 4, 8, Linear mixed effect modeling confirms that there is no
up to 196 (with a step size of 4) trees, and used these significant decrease in performance compared to the full
reduced classifiers to perform event detection. We then classifier when using 16 trees or more trees (see Table 2).

Table 2 Linear mixed-effect model fit for raw performance, measured as Cohen’s kappa (K)

Fixed effects:
Estimate Std. error df t value Pr(>|t|)
(Intercept) 7.76e–01 4.58e–02 2.40e+01 16.928 5.11e–15 ***
ntrees1 −3.25e–02 1.12e–03 7.84e+04 −29.022 <2e–16 ***
ntrees4 −9.40e–03 1.12e–03 7.84e+04 −8.383 <2e–16 ***
ntrees8 −4.07e–03 1.12e–03 7.84e+04 −3.626 0.000288 ***
ntrees12 −2.76e–03 1.12e–03 7.84e+04 −2.462 0.013806 *
ntrees16 −1.41e–03 1.12e–03 7.84e+04 −1.26 0.207644
ntrees20 −1.77e–03 1.12e–03 7.84e+04 −1.58 0.114205
ntrees24 −1.53e–03 1.12e–03 7.84e+04 −1.367 0.171585
...
nfeat4 −1.33e–01 8.76e–04 7.84e+04 −151.369 <2e–16 ***
nfeat5 −9.49e–02 8.76e–04 7.84e+04 −108.276 <2e–16 ***
nfeat6 −1.43e–02 8.76e–04 7.84e+04 −16.333 <2e–16 ***
nfeat7 −8.98e–03 8.76e–04 7.84e+04 −10.244 <2e–16 ***
nfeat8 −1.16e–03 8.76e–04 7.84e+04 −1.323 0.185882
nfeat9 1.78e–03 8.76e–04 7.84e+04 2.032 0.042207 *
nfeat10 2.89e–03 8.76e–04 7.84e+04 3.297 0.000979 ***
nfeat11 −7.95e–06 8.76e–04 7.84e+04 −0.009 0.992766
nfeat12 −2.73e–04 8.76e–04 7.84e+04 −0.312 0.755049
nfeat13 −2.07e–04 8.76e–04 7.84e+04 −0.236 0.813646

Random effects:
Groups Name Variance Std.Dev.
noise (Intercept) 0.0062052 0.07877
fs (Intercept) 0.0126143 0.11231
sub (Intercept) 0.0005378 0.02319
Residual 0.0027374 0.05232

The intercept represents a predicted K of a classifier with 200 trees and using all 14 features. Subject (sub), sampling rate (f s), and noise level
(noise) are modeled as random factors. t tests use Satterthwaite approximations to degrees of freedom
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Number of obs: 78408, groups: 11 noise levels (noise); 9 sampling frequencies (fs); 4 subjects (sub)
Behav Res (2018) 50:160–181 173

Using only 16 trees reduces the classifier computational

requirements and makes it possible to easily run it on a
regular laptop (see Appendix D).
An interesting finding is that using all features does not
lead to the best classifier. Instead, the linear mixed effects
analysis shows that classification performance is signifi-
cantly better when using nine or ten features, compared to
the full classifier with 14 features. In previous analyses, we
saw this effect in data sampled at 120 and 30 Hz when
analyzing the performance over sampling rates (Fig. 7).
For other sampling frequencies, the performance always
decreases slightly when the number of features is reduced.
However, in this study, we are interested in building an uni-
versal algorithm, that works optimally for any kind of data,
Fig. 9 Example of 500-Hz data with ground truth (hand-coded) events
finding the events that are possible to find, given the lim- and events detected by our (IRF) and NH (Nyström & Holmqvist,
its posed by sampling frequency and noise level. Therefore 2010) algorithms
in our final model we reduce the number of features to ten.
This also makes our classifier even less computationally
expensive. Sample-to-sample classification performance
Analysis of the random effects in the LMER model
showed that most of the variance in our classifier’s perfor- Figure 10 shows the performance of the IRF algorithm on
mance is due to data sampling rate, but not noise level or the testing dataset. IRF’s performance on the testing dataset,
subject. This indicates that the classifier generalizes well containing data that our algorithm has never seen before, is
over individual differences and varying amounts of noise a bit worse compared to the validation dataset. K, averaged
in the data, whereas dealing with lower sampling rate data over all noise levels, for all three events (Fig. 10a) is around
could still be improved. 0.77 in data down to 200 Hz and gradually drops from 0.72
to 0.51 and to 0.36 in data sampled at 120 Hz, 60 Hz,
Performance and 30 Hz, respectively. Overall, average performance on
the validation dataset (K = 0.75) is 7 % better com-
Next, we assess the performance of the final classifier pared to the testing dataset (K = 0.7), indicating that our
on the testing dataset, i.e., data that the algorithm has classifier slightly overfits the data. Slight overfitting is not
never seen before, in terms of sample-to-sample classifi- unusual in machine learning, and is also seen when devel-
cation accuracy, the number, distribution and properties of oping hand-crafted event detection algorithms. For instance,
the detected fixations and saccades, and performance in a Larsson et al. (2013) reports 9 % better performance on a
biometric authentication application. Here we examine the development dataset compared to a testing dataset.
performance of the full IRF algorithm, including both the Higher noise levels (above 0.26◦ RMS on average) starts
classifier we built above, and the post-processing heuristics minimally affecting the performance of our IRF algorithm.
detailed in the methods section. Only when the noise level equals or exceeds 0.519◦ RMS on
Figure 9 shows an example of 500-Hz data with ground average, does performance start to degrade noticeably. For
truth (hand-coded) events and events detected by our algo- data sampled at 120 Hz and more, K gets as low as 0.56 in
rithm and the NH algorithm. It is clear from the plot that the extremely noisy data, while the effect of noise is less vis-
output from the IRF algorithm is closer to the hand-coded ible in data sampled at 60 and 30 Hz. The performance in
ground truth—our algorithm detects all saccades and sub- data sampled at 120 Hz and above is just a bit worse com-
sequent PSOs, while in this example NH algorithm misses pared to what the NH algorithm achieves in the same, but
two out of three PSOs and tags them as part of the preced- clean, 500-Hz data (K = 0.604, see Table 3). Compared to
ing saccades. The IRF algorithm accurately tags saccades, the other state-of-the art algorithm by Larsson et al. (2013,
but has issues finding the exact offsets of PSO (when com- 2015) which achieves K = 0.745, IRF outperforms it even
pared to our manually coded data). Andersson et al. (2016) when the data has an average noise level of up to 1◦ RMS.
report that even two expert coders do not agree when tag- Figure 10b shows the performance, averaged over all
ging PSO samples, suggesting that PSOs are actually very noise levels, in detecting each of the three eye-movement
difficult events to reliably find in raw data. events that our algorithm reports. This data reveals that IRF
174 Behav Res (2018) 50:160–181

a b

Fig. 10 Performance of our algorithm on the testing dataset. Left side averaged across all noise levels. On the right side b, we separate data
a shows performance for all events, with blue lines showing perfor- for the three events our algorithm reports
mance at different noise levels, while the red line depicts performance

algorithm is best at correctly labeling saccades, while the RMS), because this is the kind of data these two algorithms
performance of detecting PSO is considerably poorer. This were tested on by Larsson et al. (2013) and Andersson
replicates the previous findings of Andersson et al. (2016), et al. (2016). Table 3 also includes performance of special-
who found that expert coders are also best at finding sac- ist version of the IRF classifier, which was trained using
cades, and agree less with each other when it comes to only high-quality data—500–1000 Hz and having average
indicating PSOs. It may very well be that the performance of noise level up to 0.04◦ RMS. The largest performance gain
our classifier is worse at detecting PSOs because of poten- is obtained in PSO classification—the specialist classifier
tially inconsistently tagged PSOs samples in the training is around 7 % better at detecting these events than the
data set. The classifier thus either learns less well as it tries universal classifier. Overall performance of this specialist
to work from imprecise input, or sometimes correctly report classifier is around 2 % better compared to the univer-
PSOs that the expert coder may have missed or tagged incor- sal classifier, and marginally outperforms the expert coders
rectly. We can see from the Fig. 9 that it is really hard to tell reported by Larsson et al. (2015).
the exact offset of a PSO.
Table 3 shows that IRF outperforms two state-of-the-art Event measures
event detection algorithms that were specifically designed
to detect fixations, saccades and PSOs, and approaches the Raw sample-to-sample performance, i.e., only that of classifiers
performance of human expert coders. We compare the per- itself, before applying heuristics, is around 5.5 % better than
formance on 500-Hz clean data (average noise level 0.042◦ that after heuristic post-processing, meaning that there is
still room for improvement in the design of our heuris-
tics. For instance, our post-processing step removes all saccades
with amplitudes up to 0.5◦ , because of our choice to merge
Table 3 Comparison of Cohen’s kappa in clean data, sampled at 500 Hz nearby fixations (see “Post-processing: Labeling final
events”). This is reflected in the number of fixations and
All events Fixations Saccades PSO
average fixation duration as reported by the IRF algorithm
IRF 0.829 0.854 0.909 0.697 (see the offsets between the manual coding results and those
IRF (specialist) 0.846 0.874 0.905 0.746 of our algorithm in clean data in Figs. 11 and 12). These fig-
Expert (Andersson et al., 2016) 0.92 0.95 0.88 ures show that compared to ground truth (manual coding),
Expert (Larsson et al., 2015) 0.834 our algorithm misses around 10 % of the small saccades and
LNS (Andersson et al., 2016) 0.81 0.64 the same percentage for fixations. This causes an overes-
LNS (Larsson et al., 2013) 0.745 timation in the average fixation duration of approximately
NH 0.604 0.791 0.576 0.134 60 ms, as two fixations get merged into one. When the noise
NH (Andersson et al., 2016) 0.52 0.67 0.24 increases over 0.26◦ RMS, more and more smaller saccades
NH (Larsson et al., 2013) 0.484 are missed. The number of detected fixations therefore starts
decreasing, while the mean fixation duration increases. This
IRF - our algorithm, LNS - algorithm by Larsson et al. (2013, 2015), behavior is consistently seen in the output of our algorithm
NH - algorithm by Nyström and Holmqvist (2010) down to a sampling rate of 120 Hz. Similar analyses for
Behav Res (2018) 50:160–181 175

Fig. 11 Number of detected fixations in the testing dataset. Green - Fig. 13 Performance of our (blue) and NH (red) algorithms in terms
ground truth (handcoded), blue - our algorithm. Different intensities of of reproducing ground truth main sequence
blue show results for different sampling rates

Saccade peak velocities (and therefore main sequence)

cannot be reliably calculated for low sampling frequencies
the number and duration of detected saccades and PSOs are
(Holmqvist et al., 2011, p. 32) and therefore reproduction
presented in Appendix C.
of the main sequence relationship fails completely at 30 and
Figure 13 shows an analysis of the main sequence fits.
60 Hz. For higher sampling rate data, our algorithm repro-
Note that negative R 2 were clipped to 0. Ground truth sac-
duces the main sequences nearly perfectly for data with
cade amplitude-peak velocity fit had parameters Vmax =
noise level up to 1◦ RMS – R 2 ranges from 0.96 to 0.87.
613.76 and C = 6.74 (R 2 = 0.96), while the amplitude-
In comparison, the NH algorithm (Fig. 13, red line) per-
duration relationship was best fitted by the line 0.0124 +
forms considerably worse, except in data with a noise level
0.00216 ∗ A (R 2 = 0.93). Interestingly, the slope of the
between 0.07–0.13◦ RMS. According to our hypothesis, a
amplitude-duration relationship resembles that of Carpenter
better performing classification algorithm will have a higher
(1988), but the intercept is only a bit more than half, indicat-
R 2 for the degraded data, because it will have less outliers in
ing that our manually coded saccades have shorter durations
the data (saccade amplitudes and peak velocities). The NH
(by on average 9 ms) for the same amplitude than his. That
algorithm detects approximately twice as many saccades
is probably because Carpenter used older data from Yarbus
in clean data than the ground truth, thus leading to more
and Robinson, recorded with corneal reflection eye trackers
outliers in the main sequence fit. We believe that the prob-
and coils, which do not record PSOs (Hooge et al., 2016;
lem with the NH algorithm lies in its adaptive behavior: in
Frens and Van Der Geest, 2002). The signal from such eye
very clean data, the velocity threshold becomes so low, that
trackers contain saccades that are known to be longer than
many PSO events are classified as saccades. As the noise
saccades from a pupil-CR eye tracker (Hooge et al., 2016).
increases, the adaptive threshold also increases causing only
actual saccades to be detected, which results in near perfect
reproduction of the main sequence relationship.
As for the amplitude–duration relationship, the NH algo-
rithm is not very good at detecting the precise location of
saccade onsets and offsets. To further investigate this, we
fitted the amplitude-duration relationship to the output of
the NH algorithm when provided with 500-Hz clean data.
This revealed that the intercept of this relationship is 0.044
s, almost four times the intercept of 0.012 for the manually
coded data. This indicates that the NH algorithm produces
considerably longer saccades. Slope of this fit was close
to ground truth—0.0025 s. The IRF algorithm on the other
hand reproduces the ground truth relationship quite well.
Results for amplitude–duration fit look largely the same as
Fig. 12 Mean fixation duration in the testing dataset. Green - ground
in Fig. 13—in data sampled at 120 Hz and above and hav-
truth (handcoded), blue - our algorithm. Different intensities of blue ing average noise level up to 1◦ RMS, R 2 ranges from 0.95
show results for different sampling rates to 0.69, and gets lower in more noisy data.
176 Behav Res (2018) 50:160–181

Biometric results This optimized algorithm can be run on a laptop and is

what we recommend for users to process their data with. It
Results for the biometric performance are presented in takes around 100 ms to classify 1 min of 1000-Hz data, in
Table 4. As a lower EER indicates better performance, addition to around 600 ms to extract features and 50 ms to
the results indicate that the events detected by IRF allow post-process predictions.
the biometric application to have better performance in A limitation of our approach is that heuristics are needed
authentication scenario than the events detected by the NH to produce meaningful events. This is because the classi-
algorithm. fication of each sample is independent from the context
Separate 2×3 fixed-effects ANOVAs for the two datasets of the surrounding samples. For instance, our classifier is
revealed that EER was lower for both datasets when the not aware that a PSO can only follow a stream of saccade
events fed to the biometric procedure were detected by samples, nor that fixation samples might only occur after
IRF than when they were detected by NH (F(1,594)<72, a saccade or PSO and need to last for a minimum amount
p<0.001). For both datasets, there also were differences in of time. The post-processing step provides this context, but
EER between the three tasks (F(2,594)<38, p<0.001) and we expect that it can be avoided by using other machine-
significant interactions between event detection algorithm learning algorithms and paradigm known as end-to-end
and task (F(1,594)<3.8, p<0.023). For both datasets, Tukey training, where the algorithm learns also the sequence and
HSD post-hoc tests confirmed that the EER rate was lower context information directly from the raw data and produces
for IRF than for NH for all tasks (p<0.0005). the final output without the need for any post-processing.
Previously, researchers carefully hand-crafted their eye-
movement event detection algorithms by assembling sim-
Discussion ilarly hand-crafted signal processing features in a specific
order with many specific settings that may or may not
Our data show that the machine-learning algorithm we be accessible to the user (e.g., Engbert & Kliegl, 2003;
have presented outperforms current state-of-the-art event Nyström & Holmqvist, 2010; Larsson et al., 2015; Hessels
detection algorithms. In clean data, performance almost et al., 2016). We propose that the time has come for a
reaches that of a human expert. Performance is stable down paradigm shift in the research area of developing event
to 200 Hz, and the algorithm performs quite well even detection algorithms. When using random forests and sim-
at lower sampling frequencies. Classification performance ilar classifiers, the designer still needs to craft a collection
slowly degrades with increasing noise. Biometric testing has of data descriptors and signal processing features, but the
indicated that the IRF algorithm provides superior authenti- assembly of the algorithm and the setting of thresholds
cation performance as indicated by lower Equal Error Rates is done using machine learning. Training a classifier on a
(EER) when compared to the algorithm by Nyström and wide variety of input data allows a machine-learning-based
Holmqvist (2010). algorithm to generalize better than hand-crafted algorithms.
The full random forest classifier, containing 14 features In this paper, we have shown that the machine-learning
and 200 trees, takes up 800 MB of storage space and over approach leads to better classification performance than
8 GB of RAM memory to run. Optimization allowed us to hand-crafted algorithms while it is also computationally
obtain a final classifier whose classification performance is inexpensive to use.
virtually indistinguishable from the full classifier. Our final Further supporting the notion that a paradigm shift in
classifier has 16 trees and ten features (med-diff, mean-diff, event detection algorithm design is happening is that shortly
std, vel, rayleightest, std-diff, i2mc, disp, bcea-diff, bcea). after we presented the IRF algorithm at the Scandina-
vian Workshop of Applied Eye Tracking in June 2016,
Table 4 The EER performance for each dataset, stimulus, and classi- Anantrasirichai et al. (2016) published a machine-learning-
fication algorithm based algorithm designed to work with 30-Hz SMI mobile
eye-tracker data. In that paper, the authors employ temporal
Medium quality dataset
characteristics of the eye positions and local visual features
HOR RAN TEX extracted by a deep convolutional neural network (CNN),
Stimulus and then classify the eye movement events via a support
IRF 22.30 % 21.00 % 21.80 % vector machine (SVM). It is not clear from the paper what
NH 23.30 % 24.60 % 28.00 %
was used as ground truth, but the authors report that their
High-quality dataset
algorithm outperformed three hand-crafted algorithms.
IRF 17.80 % 18.40 % 12.40 %
Using our approach, classifiers can also be built that are
not aimed to be general and to apply to many types of data,
NH 20.00 % 20.80 % 13.40 %
but that instead are trained to work with a specific type of
Behav Res (2018) 50:160–181 177

data. One can imagine specialist classifiers, trained to work maximum 25 types of eye-movement events will be possi-
only on Eyelink, SMI, or Tobii data. The only thing that ble too using this approach. The very recent and, to the best
is needed is a representative dataset and desired output— of our knowledge, the very first attempt to use deep learn-
whether it be manual coding or events derived by any other ing for eye movement detection is the algorithm by Hoppe
method. Our results show that such specialized machine- (2016). This algorithm is still not entirely end-to-end as it
learning-based algorithms have the potential to work better uses hand-crafted features—input data needs to be trans-
than hand-crafted algorithms. Feeding the machine-learning formed into the frequency domain first, but as authors write:
algorithm with event data from another event detector raises “it would be conceptually appealing to eliminate this step
other possibilities. If the original event detector is com- as well”. Hoppe (2016) show that a simple one-layer con-
putationally inefficient but results in good performance, volutional neural network (CNN), followed by max pooling
its output can be used to train a machine-learning algo- and a fully connected layer outperforms algorithms based on
rithm that has the same performance but is computationally simple dispersion and velocity and PCA-based dispersion
efficient. thresholding.
But the real promise of machine learning is that in the If deep learning with end-to-end training works for event
near future we may have a single algorithm that can detect detection, there will be less of a future for feature devel-
not only the basic five eye-movement events (fixations, sac- opers. The major bottle neck will instead be the amount
cades, PSOs, smooth pursuit, and blinks), but distinguish of available hand-coded event data. This is time-consuming
between all 15–25 events that exist in the psychological and because it requires domain experts, it is also expensive.
and neurological eye-movement literature (such as different From the perspective of the machine-learning algorithm,
types of nystagmus, square-wave jerks, opsoclonus, ocular the hand-coded events are the goal, the objective ground
flutter, and various forms of noise). All that is needed to truth, perchance, that the algorithm should strive towards.
reach this goal is a lot of data, and time and expertise to However, we know from Andersson et al. (2016) that hand-
produce the hand-coded input needed for training the clas- coded data represent neither the golden standard, nor the
sifier. Performance of this future algorithm is not unlikely objective truth on what fixations and saccades are. They
to be comparable to or even better than any hand-crafted show that agreement between coders is nowhere close to
algorithm specifically designed for a subset of events. perfect, most likely because expert coders often have dif-
The shift toward automatically assembled eye-movement ferent conceptions of what a fixation, saccade, or another
event classifiers exemplified by this paper mirrors what has event is in data. If a machine-learning algorithm uses a
happened in the computer vision community. At first, anal- training set from one expert coder it will be a different algo-
ysis of image content was done using simple hand-crafted rithm than if a training set from another human coder would
approaches, like edge detection or template matching. Later, have been used. The event detector according to Kenneth
machine-learning approaches such as SVM (support vec- Holmqvist’s coding will be another event detector compared
tor machine) using hand-crafted features such as LBP (local to the event detector according to Raimondas Zemblys’ cod-
binary pattern) and SIFT (scale-invariant feature transform) ing. However, it is very likely that the differences between
(Russakovsky et al., 2015) became popular. Starting in these two algorithms will be much smaller than the dif-
2012, content-based image analysis quickly became domi- ferences between previous hand-crafted algorithms, given
nated by deep learning, i.e., an approach where a computer that Andersson et al. (2016) showed that human expert
learns features itself using convolutional neural networks coders agree with each other to a larger extent than existing
(Krizhevsky et al., 2012). state-of-the-art event detection algorithms.
Following the developments that occurred in the com- We should also note that the machine-learning approach
puter vision community and elsewhere, we envision that poses a potential issue for reproducibility. You would need
using deep learning methods will be the next step for the exact same training data to be able to reasonably repro-
eye-movement event detection algorithm design. Such duce a paper’s event detection. Either that, or authors need
methods—which are now standard in content-based image to make their trained event detectors public, e.g., as supple-
analysis, natural language processing and other fields— mental info attached to a paper, or on another suitable and
allow us to feed the machine-learning system only with data. reasonably permanent storage space. To ensure at least some
It develops the features itself, and finds appropriate weights level of reproducibility of the work, future developers of
and thresholds for sample classification, even taking into machine-learning-based event detection algorithms should
account the context of the sample. Since such deep-learning- report as many details as possible: the algorithms and pack-
based approach works well in image content analysis, ages used, hyperparameters, source of training data, etc.
where the classifier needs to distinguish between thou- We make our classifier and code freely available online5
sands of objects, or in natural language processing, where
5 https://github.com/r-zemblys/irf.
sequence modeling is the key, we expect that classifying the
178 Behav Res (2018) 50:160–181

and strive to further extend the classifier to achieve good were not found to impact subsequent readings. Each
classification in noisy data and data recorded at a low sam- recording session contained a different part of the poem
pling rates. Future work will furthermore focus on detecting text.
other eye-movement events (such as smooth pursuit and
nystagmus) by including other training sets and using other Datasets
machine-learning approaches.
– Medium-quality (MQ) dataset consisted of records
Acknowledgments We thank LUNARC - the center for scientific from 99 subjects among which were 70 males and 29
and technical computing at Lund University for providing us with
computational resources and support, which enabled us to efficiently
females. The ages of the subjects were in the range
train large classifiers (project No. SNIC 2016/4-37). This work is sup- between 18 and 47. The average age was 22 (SD = 4.8).
ported in part by NSF CAREER Grant CNS-1250718 and NIST Grant All 99 subjects participated in two recording sessions,
60NANB15D325 to author OK, and in part by MAW Grant “Eye- which were timed such that there were approximately
Learn: Using visualizations of eye movements to enhance metacogni-
tion, motivation, and learning” to author KH. We express our gratitude
20 min between the first and the second presentation
to Dillon Lohr at Texas State University who performed computations of each stimulus. The MQ dataset was recorded using
related to CEM-E framework. a PlayStation Eye Camera driven by modified version
of the open-source ITU Gaze Tracker software (San
Agustin et al., 2009). The sampling rate of the recording
Appendix A was 75 Hz and average spatial accuracy was 0.9◦ (SD =
0.6◦ ) as reported by the validation procedure performed
Description of stimuli and datasets used when assessing the after the calibration. Because none of the data samples
biometrics performance of our IRF algorithm. are marked as invalid by ITU Gaze Tracker software,
we are unable to report the amount of data loss for this
Presented stimulus dataset. Stimuli were presented on a flat screen monitor
positioned at a distance of 685 mm from each subject.
– Horizontal stimulus (HOR) was a simple step-stimulus The dimensions of the monitor were 375 × 302 mm.
with a small white dot making 30◦ jumps back and The resolution of the screen was 1280 × 1024 pixels.
forth horizontally 50 times across a plain black back- The records from the dataset can be downloaded from
ground. In total, 100 dots were presented to each Komogortsev (2016).
subject. For each subject and for each recording ses- – High-quality (HQ) dataset consisted of records from 32
sion of the same subject the sequence of dots was the subjects among which were 26 males and six females.
same. The subjects were instructed to follow the dot The ages of the subjects were in the range between 18
with their eyes. The goal of this stimulus was to elicit and 40. The average age was 23 (SD = 5.4). Twenty-
a large number of purely horizontal large amplitude nine of the subjects performed four recording sessions
saccades. each, and three of the subjects performed two recording
– The random stimulus (RAN) was a random step- sessions each. The first and second recording sessions
stimulus with a small white dot jumping across a plain were timed such that there were approximately 20 min
black background of the screen in a uniformly dis- between the first and the second presentation of each
tributed random pattern. One hundred dot movements stimulus. For each subject, the 3rd and 4th sessions
were presented to each subject. The subjects were were recorded approximately 2 weeks after the first two
instructed to follow the dot with their eyes. For each sessions. Similar to the first two sessions, the time inter-
subject and for each recording session of the same sub- val between the 3rd and 4th sessions was timed such
ject the sequence of presented dots was completely that there were approximately 20 min between the first
random. The goal of this stimulus was to elicit a large and the second presentation of each stimulus. The data
number of oblique saccades with various points of was recorded with an EyeLink 1000 eye-tracking sys-
origin, directions, curvatures, and amplitudes. tem at 1000 Hz and spatial accuracy as reported by
– The textual stimulus (TEX) consisted of various the validation procedure performed after the calibra-
excerpts from Lewis Carroll’s “The Hunting of the tion was 0.7◦ (SD = 0.5◦ ). The average amount of data
Snark.” The selection of this specific poem aimed to loss was 5 % (SD = 5 %). Stimuli were presented on
encourage the subjects to progress slowly and carefully a flat screen monitor positioned at a distance of 685
through the text. An amount of text was selected from mm from each subject. The dimensions of the monitor
the poem that would take on average approximately 1 were 640 × 400 mm. The resolution of the screen was
min to read. Line lengths and the difficulty of the mate- 2560 × 1600 pixels. The records from the dataset can
rial was consistent, and content-related learning effects be downloaded from Komogortsev (2011).
Behav Res (2018) 50:160–181 179

Fig. 16 Number of detected PSO in the testing dataset

Fig. 14 Normalized feature importances during recursive feature
elimination in the specialist model

Appendix D

Computational performance of IRF algorithm

Appendix B
To test computational performance of our classifier we used
Figure 14 shows feature importances, obtained when train-
a desktop computer with four Intel Core i7 CPUs @ 3.6
ing the specialist classifier. This classifier was trained using
GHz, 16GB RAM and a Seagate 2TB 7200 RPM HDD disk
only high-quality data at 500–1000 Hz with an average
drive, running on Linux Mint 17.3 Cinnamon 64-bit OS
noise level below 0.04◦ RMS. As before, after training
and Dell e5530 laptop with Intel Core i5-3360M dual-core
the classifier, we removed the least important feature, and
CPU @ 2.80 GHz, 4 GB RAM, 7200 rpm HDD, running on
retrained the classifier using the remaining features, until a
Debian 7 OS.
classifier with only four features was left.
The exact figures would depend on a number of factors,
such as the way of saving the trained forest, the computer
storage and memory type, operating system, etc. In our spe-
Appendix C cific case, the full classifier takes almost 800 MB when
stored on a computer drive, takes around 80 s to load and
Analyses for the number and duration of detected saccades
fills over 8 GB of RAM memory when loaded. However,
and PSOs using the IRF algorithm. Green stippled line -
when loaded it works quite fast—it takes around 0.5 s to
ground truth (hand coded), blue - our algorithm. Different
classify 1 min of 1000-Hz data on both desktop and laptop
intensities of blue show results for different sampling rates
PCs tested. Optimized classifier requires less than 400 MB
(Figs. 15, 16, 17, and 18).
of RAM memory, loads in 1.2 s and takes around 100 ms to
classify 1 min of 1000-Hz data.

Fig. 15 Number of detected saccades in the testing dataset Fig. 17 Mean saccade duration in the testing dataset
180 Behav Res (2018) 50:160–181

Hessels, R. S., Niehorster, D. C., Kemner, C., & Hooge, I. T. C. (2016).

Noise-robust fixation detection in eye movement data: Identifica-
tion by two-means clustering (i2mc). Behavior Research Methods,
1–22. doi:10.3758/s13428-016-0822-1
Holland, C. D., & Komogortsev, O. V. (2013a). Complex eye move-
ment pattern biometrics: Analyzing fixations and saccades. In
2013 International conference on biometrics (ICB), (pp. 1–
8).
Holland, C. D., & Komogortsev, O. V. (2013b). Complex eye move-
ment pattern biometrics: The effects of environment and stimulus.
IEEE Transactions on Information Forensics and Security, 8(12),
2115–2126.
Holmqvist, K., Nyström, M., Andersson, R., Dewhurst, R., Jarodzka,
H., & van de Weijer, J. (2011). Eye tracking. A comprehensive
guide to methods and measures. Oxford: Oxford University Press.
Holmqvist, K., Nyström, M., & Mulvey, F. (2012). Eye tracker data
Fig. 18 Mean PSO duration in the testing dataset
quality: What it is and how to measure it. In Proceedings of the
symposium on eye tracking research and applications, (pp. 45–52).
New York, NY: ACM.
In addition, the final IRF algorithm takes around 600 ms Holmqvist, K., Zemblys, R., Cleveland, D.C., Mulvey, F.B., Borah,
to extract features and 50 ms to post-process predictions and J., & Pelz, J.B. (2015). The effect of sample selection methods
only somewhat more when a simpler laptop is used. on data quality measures and on predictors for data quality. In
Ansorge, U., Ditye, T., Florack, A., & Leder, H. (Eds.), Abstracts
of the 18th European Conference on Eye Movements 2015,
volume 8 of Journal of Eye Movement Research.
References Holmqvist, K., Andersson, R., Jarodzka, H., Kok, E., Nyström, M.,
& Dewhurst, R. (2016). Eye tracking. A comprehensive guide to
methods and measures. Oxford: Oxford University Press.
Anantrasirichai, N., Gilchrist, I. D., & Bull, D. R. (2016). Fixa-
Hooge, I., Nyström, M., Cornelissen, T., & Holmqvist, K. (2015). The
tion identification for low-sample-rate mobile eye trackers. In
art of braking: post saccadic oscillations in the eye tracker signal
2016 IEEE international conference on image processing (ICIP),
decrease with increasing saccade size. Vision Research, 112, 55–
(pp. 3126–3130). IEEE.
67.
Andersson, R., Larsson, L., Holmqvist, K., Stridh, M., & Nyström, M.
Hooge, I. T. C., Holmqvist, K., & Nyström, M. (2016). The pupil
(2016). One algorithm to rule them all? An evaluation and discus-
is faster than the corneal reflection (cr): Are video-based pupil-
sion of ten eye movement event-detection algorithms. Behavior
cr eye trackers suitable for studying detailed dynamics of eye
Research Methods, 1–22. doi:10.3758/s13428-016-0738-9
movements? Vision Research, 128, 6–18.
Bahill, A. T., Brockenbrough, A., & Troost, B. T. (1981). Variabil-
Hoppe, S. (2016). End-to-end eye movement detection using convolu-
ity and development of a normative data base for saccadic eye
tional neural networks. ArXiv:1609.02452 e-prints.
movements. Investigative Ophthalmology & Visual Science, 21(1),
116–125. Jain, A. K., Flynn, P., & Ross, A. A. (2007). Handbook of biometrics.
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear Secaucus, NJ, USA: Springer-Verlag New York, Inc.
mixed-effects models using lme4. Journal of Statistical Software, Komogortsev, O. V. (2011). Eye movement biometric database v1.
67(1), 1–48. Komogortsev, O. V. (2016). Eye movement biometric database v2.
Blignaut, P., & Beelders, T. (2012). The precision of eye-trackers: Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classi-
a case for a new measure. In Proceedings of the symposium on fication with deep convolutional neural networks. In Advances in
eye tracking research and applications, ETRA ’12, (pp. 289–292). Neural Information Processing Systems.
New York, NY, USA: ACM. Larsson, L., Nyström, M., Andersson, R., & Stridh, M. (2015). Detec-
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. tion of fixations and smooth pursuit movements in high-speed
Carpenter, R. (1988). Movements of the eyes. Pion. eye-tracking data. Biomedical Signal Processing and Control,
Coey, C. A., Wallot, S., Richardson, M. J., & Van Orden, G. (2012). 18(0), 145–152.
On the structure of measurement noise in eye-tracking. Journal of Larsson, L., Nyström, M., & Stridh, M. (2013). Detection of sac-
Eye Movement Research, 5(4), 1–10. cades and postsaccadic oscillations in the presence of smooth
Cohen, J. (1960). A coefficient of agreement for nominal scales. pursuit. IEEE Transactions on Biomedical Engineering, 60(9),
Educational and Psychological Measurement, 20(1), 37–46. 2484–2493.
Ditchburn, R. W. (1973). Eye Movements and Visual Perception. Leigh, R. J., & Zee, D. S. (2006). The neurology of eye movements.
Oxford: Oxford University Press. Oxford, UK: Oxford University Press.
Engbert, R., & Kliegl, R. (2003). Microsaccades uncover the orienta- Monty, R. A. (1975). An advanced eye-movement measuring and
tion of covert attention. Vision Research, 43(9), 1035–1045. recording system. American Psychologist, 30(3), 331.
Frens, M. A., & Van Der Geest, J. N. (2002). Scleral search coils Mould, M. S., Foster, D. H., Amano, K., & Oakley, J. P. (2012). A
influence saccade dynamics. Journal of Neurophysiology, 88(2), simple nonparametric method for classifying eye fixations. Vision
692–698. Research, 57, 18–25.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and Munn, S. M., Stefano, L., & Pelz, J. B. (2008). Fixation-identification
feature selection. Journal of Machine Learning Research, 3, 1157– in dynamic scenes: Comparing an automated algorithm to manual
1182. coding. In Proceedings of the 5th symposium on applied percep-
Hartridge, H., & Thomson, L. (1948). Methods of investigating eye tion in graphics and visualization, APGV ’08, (pp. 33–42). New
movements. The British Journal of Ophthalmology, 32(9), 581–591. York, NY, USA: ACM.
Behav Res (2018) 50:160–181 181

Nyström, M., & Holmqvist, K. (2010). An adaptive algorithm for fixa- San Agustin, J., Skovsgaard, H., Hansen, J. P., & Hansen, D. W.
tion, saccade, and glissade detection in eyetracking data. Behavior (2009). Low-cost gaze interaction: Ready to deliver the promises.
Research Methods, 42(1), 188–204. In CHI ’09 extended abstracts on human factors in computing
Olsson, P. (2007). Real-time and offline filters for eye tracking. Mas- systems, CHI EA ’09 (pp. 4453–4458). New York, NY, USA:
ter’s thesis, Royal Institute of Technology, Stockholm, Sweden. ACM.
Oshiro, T. M., Perez, P. S., & Baranauskas, J. A. (2012). How Many Thomas, D. B., Luk, W., Leong, P. H., & Villasenor, J. D. (2007).
Trees in a Random Forest? (pp. 154–168). Berlin: Springer. Gaussian random number generators. ACM Computing Surveys,
Otero-Millan, J., Castro, J. L. A., Macknik, S. L., & Martinez-Conde, 39(4).
S. (2014). Unsupervised clustering method to detect microsac- Wang, D., Mulvey, F. B., Pelz, J. B., & Holmqvist, K. (2016a). A study
cades. Journal of Vision, 14(2), 18. of artificial eyes for the measurement of precision in eye-trackers.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Behavior Research Methods, 1–13. doi:10.3758/s13428-016-0755-8
Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Wang, D., Pelz, J. B., & Mulvey, F. (2016b). Characterization and
Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, reconstruction of vog noise with power spectral density analysis.
M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in In Proceedings of the ninth biennial ACM symposium on eye track-
Python. Journal of Machine Learning Research, 12, 2825–2830. ing research & applications, ETRA ’16 (pp. 217–220). New York,
Peirce, J. W. (2007). Psychopy: Psychophysics software in Python. NY, USA: ACM.
Journal Neuroscience Methods, 162(1-2), 8–13. Waskom, M., Botvinnik, O., Drewokane, Hobson, P., Halchenko, Y.,
Raileanu, L. E., & Stoffel, K. (2004). Theoretical comparison between Lukauskas, S., Warmenhoven, J., Cole, J. B., Hoyer, S., Vander-
the gini index and information gain criteria. Annals of Mathemat- plas, J., Gkunter, Villalba, S., Quintero, E., Martin, M., Miles, A.,
ics and Artificial Intelligence, 41(1), 77–93. Meyer, K., Augspurger, T., Yarkoni, T., Bachant, P., Evans, C.,
Rigas, I., Komogortsev, O., & Shadmehr, R. (2016). Biometric recog- Fitzgerald, C., Nagy, T., Ziegler, E., Megies, T., Wehner, D., St-
nition via eye movements: Saccadic vigor and acceleration cues. Jean, S., Coelho, L. P., Hitz, G., Lee, A., & Rocher, L. (2016).
ACM Trans. Appl. Percept., 13(2), 6:1–6:21. seaborn: v0.7.0 (January 2016).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Zemblys, R. (2016). Eye-movement event detection meets
Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., machine learning. In Biomedical Engineering (pp. 98–
& Fei-Fei, L. (2015). Imagenet large scale visual recognition chal- 101).
lenge. International Journal of Computer Vision (IJCV), 115(3), Zemblys, R., Holmqvist, K., Wang, D., Mulvey, F.B., Pelz, J.B., &
211–252. Simpson, S. (2015). Modeling of settings for event detection algo-
Salvucci, D. D., & Goldberg, J. H. (2000). Identifying fixations and rithms based on noise level in eye tracking data. In Ansorge,
saccades in eye-tracking protocols. In Proceedings of the 2000 U., Ditye, T., Florack, A., & Leder, H. (Eds.) Abstracts of the
symposium on eye tracking research & applications, ETRA ’00 18th European Conference on Eye Movements 2015, volume 8 of
(pp. 71–78). Journal of Eye Movement Research.

EEG Eye State Report
No ratings yet
EEG Eye State Report
19 pages
BOS Note Battlespace Operating System
No ratings yet
BOS Note Battlespace Operating System
5 pages
Demo Teaching
No ratings yet
Demo Teaching
19 pages
Fixation Algorithms
No ratings yet
Fixation Algorithms
22 pages
1dcnn With Blstm
No ratings yet
1dcnn With Blstm
17 pages
Supervised Machine Learning for Science: How to stop worrying and love your black box
From Everand
Supervised Machine Learning for Science: How to stop worrying and love your black box
Christoph Molnar
No ratings yet
Fully Convolutional Neural Networks For Raw Eye Tracking Data Segmentation, Generation, and Reconstruction
No ratings yet
Fully Convolutional Neural Networks For Raw Eye Tracking Data Segmentation, Generation, and Reconstruction
9 pages
Event Camera-Based Eye Motion Analysis A Survey
No ratings yet
Event Camera-Based Eye Motion Analysis A Survey
22 pages
Rahul.K T 7233: Covert Information Processing Observable (Quantifiable) Eye Movements
0% (1)
Rahul.K T 7233: Covert Information Processing Observable (Quantifiable) Eye Movements
20 pages
Automatic Target Recognition: Fundamentals and Applications
From Everand
Automatic Target Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Target Recognition: Advances in Computer Vision Techniques for Target Recognition
From Everand
Automatic Target Recognition: Advances in Computer Vision Techniques for Target Recognition
Fouad Sabry
No ratings yet
Multimodal_Fusion_of_EEG_and_Eye_Data_for_Attention_Classification_using_Machine_Learning
No ratings yet
Multimodal_Fusion_of_EEG_and_Eye_Data_for_Attention_Classification_using_Machine_Learning
2 pages
Eye Movement Analysis For Activity Recognition Using Electrooculography
No ratings yet
Eye Movement Analysis For Activity Recognition Using Electrooculography
13 pages
Swarm Intelligence: Fundamentals and Applications
From Everand
Swarm Intelligence: Fundamentals and Applications
Fouad Sabry
No ratings yet
(2017) Bayesian Microsaccade Detection
No ratings yet
(2017) Bayesian Microsaccade Detection
23 pages
I AOI
No ratings yet
I AOI
12 pages
Real-time_eye_blink_detection_using_general_camera
No ratings yet
Real-time_eye_blink_detection_using_general_camera
8 pages
Eyes_Status_Detector_Based_on_Light-weight_Convolutional_Neural_Networks_supporting_for_Drowsiness_Detection_System
No ratings yet
Eyes_Status_Detector_Based_on_Light-weight_Convolutional_Neural_Networks_supporting_for_Drowsiness_Detection_System
6 pages
Optical Flow: Exploring Dynamic Visual Patterns in Computer Vision
From Everand
Optical Flow: Exploring Dynamic Visual Patterns in Computer Vision
Fouad Sabry
No ratings yet
9837 Ce 3 F 83 Eda 769
No ratings yet
9837 Ce 3 F 83 Eda 769
11 pages
Uncertainty Theories and Multisensor Data Fusion
From Everand
Uncertainty Theories and Multisensor Data Fusion
Alain Appriou
No ratings yet
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Motion saliency using CNN
No ratings yet
Motion saliency using CNN
12 pages
Computer Vision: Exploring the Depths of Computer Vision
From Everand
Computer Vision: Exploring the Depths of Computer Vision
Fouad Sabry
No ratings yet
Fixation: A Universal Framework For Experimental Eye Movement Research
No ratings yet
Fixation: A Universal Framework For Experimental Eye Movement Research
5 pages
Remote Sensing Technology
From Everand
Remote Sensing Technology
Rajendra Asan
No ratings yet
Nolte Et Al. 2024 - Biorxiv
No ratings yet
Nolte Et Al. 2024 - Biorxiv
50 pages
The Radiology Guide
From Everand
The Radiology Guide
Vincenzo Giuliano
No ratings yet
Dynamical Models in Neurocognitive Psychology
No ratings yet
Dynamical Models in Neurocognitive Psychology
158 pages
EOG-Based Eye Movement Classification and Application On HCI Baseball Game
No ratings yet
EOG-Based Eye Movement Classification and Application On HCI Baseball Game
11 pages
A Method of EOG Signal Processing To Detect The Direction of Eye Movements
No ratings yet
A Method of EOG Signal Processing To Detect The Direction of Eye Movements
6 pages
Computer Vision: Fundamentals and Applications
From Everand
Computer Vision: Fundamentals and Applications
Fouad Sabry
No ratings yet
View of Real Time Eye Tracking For Password Authentication
No ratings yet
View of Real Time Eye Tracking For Password Authentication
5 pages
Yarbus (Greene, Liu, Wolfe)
No ratings yet
Yarbus (Greene, Liu, Wolfe)
8 pages
Comparison of ANN and SVM for classification of eye movements in EOG signals 2018
No ratings yet
Comparison of ANN and SVM for classification of eye movements in EOG signals 2018
12 pages
1
No ratings yet
1
18 pages
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
From Everand
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Fouad Sabry
No ratings yet
Percept: Fundamentals and Applications
From Everand
Percept: Fundamentals and Applications
Fouad Sabry
No ratings yet
Comparison of SVM and ANN For Classification of Ey
No ratings yet
Comparison of SVM and ANN For Classification of Ey
9 pages
Comparison of SVM and ANN For Classification of Ey PDF
No ratings yet
Comparison of SVM and ANN For Classification of Ey PDF
9 pages
Seminar 1
No ratings yet
Seminar 1
22 pages
Eog-Based Signal Detection and Verification For Hci
No ratings yet
Eog-Based Signal Detection and Verification For Hci
7 pages
Textbook of Urgent Care Management: Chapter 35, Urgent Care Imaging and Interpretation
From Everand
Textbook of Urgent Care Management: Chapter 35, Urgent Care Imaging and Interpretation
Tim Hogan
No ratings yet
Real-Time Eye Gaze Direction Classification Using Convolutional Neural Network
No ratings yet
Real-Time Eye Gaze Direction Classification Using Convolutional Neural Network
6 pages
Augmented Reality Assisted Surgery: Enhancing Surgical Precision through Computer Vision
From Everand
Augmented Reality Assisted Surgery: Enhancing Surgical Precision through Computer Vision
Fouad Sabry
No ratings yet
Basic Research On Eye Tracking
No ratings yet
Basic Research On Eye Tracking
24 pages
Tracing Eye Movement Protocols With Cognitive Process Models
No ratings yet
Tracing Eye Movement Protocols With Cognitive Process Models
6 pages
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Aura Final Report
No ratings yet
Aura Final Report
68 pages
Research on Eye Movement Tracking and English Teaching Application Based on CNN-BLSTM Fusion Algorithm
No ratings yet
Research on Eye Movement Tracking and English Teaching Application Based on CNN-BLSTM Fusion Algorithm
4 pages
Neuroevolution: Fundamentals and Applications for Surpassing Human Intelligence with Neuroevolution
From Everand
Neuroevolution: Fundamentals and Applications for Surpassing Human Intelligence with Neuroevolution
Fouad Sabry
No ratings yet
Driver Drowsiness Detection System
No ratings yet
Driver Drowsiness Detection System
5 pages
Eye State Detection Using Image Processing Technique
No ratings yet
Eye State Detection Using Image Processing Technique
6 pages
Eye Fatigue Detection.docx
No ratings yet
Eye Fatigue Detection.docx
8 pages
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
From Everand
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
Fouad Sabry
No ratings yet
Pupil Detection Algorithm Based On Feature Extraction For Eye Gaze
No ratings yet
Pupil Detection Algorithm Based On Feature Extraction For Eye Gaze
10 pages
Hema, 2014 PDF
No ratings yet
Hema, 2014 PDF
13 pages
Gaze
No ratings yet
Gaze
7 pages
E.O.G. Guidance of A Wheelchair Using Neural Networks
No ratings yet
E.O.G. Guidance of A Wheelchair Using Neural Networks
4 pages
DDD Mydocs
No ratings yet
DDD Mydocs
49 pages
Theoretical method to increase the speed of continuous mapping in a three-dimensional laser scanning system using servomotors control
From Everand
Theoretical method to increase the speed of continuous mapping in a three-dimensional laser scanning system using servomotors control
Lars Lindner
No ratings yet
Image Zooming
No ratings yet
Image Zooming
4 pages
SE SEM III DEC 2023 Compressed Compressed
No ratings yet
SE SEM III DEC 2023 Compressed Compressed
10 pages
Netflix
No ratings yet
Netflix
6 pages
New Microsoft Office Word Document
No ratings yet
New Microsoft Office Word Document
20 pages
DB
No ratings yet
DB
12 pages
ALTIVAR312 P8 2009 07 en
100% (1)
ALTIVAR312 P8 2009 07 en
127 pages
Lab 1: Creating A Simple MVC App Using Java/Swing: 1. Opening The Existing Project
No ratings yet
Lab 1: Creating A Simple MVC App Using Java/Swing: 1. Opening The Existing Project
12 pages
Sap HR - Sap Query
No ratings yet
Sap HR - Sap Query
2 pages
Little Field Paper Version 3
100% (1)
Little Field Paper Version 3
4 pages
1Z0 033 S
No ratings yet
1Z0 033 S
24 pages
Report Anomalies and Normalization Summary
No ratings yet
Report Anomalies and Normalization Summary
5 pages
BCOM ITM - Informatics 1A
No ratings yet
BCOM ITM - Informatics 1A
87 pages
ITEC-322 Discrete Structures: Introduction To Subject
No ratings yet
ITEC-322 Discrete Structures: Introduction To Subject
73 pages
BCPC Permeter PDF
100% (1)
BCPC Permeter PDF
4 pages
Place & Route Tutorial #1: I. Setup
No ratings yet
Place & Route Tutorial #1: I. Setup
13 pages
Bayes_2
No ratings yet
Bayes_2
3 pages
PRS Access Server Solution
No ratings yet
PRS Access Server Solution
2 pages
www.facebook.com
100% (1)
www.facebook.com
32 pages
Online Blood Bank Management System
No ratings yet
Online Blood Bank Management System
3 pages
(Ebook) Just Enough Programming Logic and Design by Joyce Farrell ISBN 9781439039571, 1439039577 - The ebook in PDF format is available for download
100% (2)
(Ebook) Just Enough Programming Logic and Design by Joyce Farrell ISBN 9781439039571, 1439039577 - The ebook in PDF format is available for download
59 pages
02 Program Design PDF
No ratings yet
02 Program Design PDF
32 pages
Manual Ultra Printing
100% (1)
Manual Ultra Printing
21 pages
Women
No ratings yet
Women
4 pages
Enrollment Card For Group Insurance: Member Information
No ratings yet
Enrollment Card For Group Insurance: Member Information
1 page
Complete 8086 Instruction Set: Quick Reference
No ratings yet
Complete 8086 Instruction Set: Quick Reference
55 pages
OS LAB Manual for R23
No ratings yet
OS LAB Manual for R23
71 pages
2nd year NEP Syllabus
No ratings yet
2nd year NEP Syllabus
30 pages
A Roadmap To Develop Enterprise Security Architecture
No ratings yet
A Roadmap To Develop Enterprise Security Architecture
5 pages

Using Machine Learning to Detect Events in Eye-tracking Data

Uploaded by

Using Machine Learning to Detect Events in Eye-tracking Data

Uploaded by

Behav Res (2018) 50:160–181

Using machine learning to detect events in eye-tracking data

Published online: 23 February 2017

In this study, we use a random forest classifier to perform

Table 1 Features used to train random forest classifier

Eye-movement driven biometrics performance resulting operational performance of a biometric system in

fixations, saccades, and PSO, over all sampling frequencies

Recursive feature elimination

First, univariate feature selection (UFS) scores are cal-

bcea-diff, the importance of std-diff increases (see Fig. 6,

Using only 16 trees reduces the classifier computational

Saccade peak velocities (and therefore main sequence)

Biometric results This optimized algorithm can be run on a laptop and is

Fig. 16 Number of detected PSO in the testing dataset

Computational performance of IRF algorithm

Hessels, R. S., Niehorster, D. C., Kemner, C., & Hooge, I. T. C. (2016).

You might also like