0% found this document useful (0 votes)
6 views17 pages

1dcnn With BLSTM

This document presents a novel approach for classifying eye movements (fixations, saccades, and smooth pursuits) using a combination of a 1D convolutional neural network (1D-CNN) and a bidirectional long short-term memory network (BLSTM). The proposed method outperforms existing algorithms on a large-scale hand-labeled dataset (GazeCom) and introduces a new pipeline for event detection in eye-tracking recordings. The study emphasizes the importance of automatic detection in understanding perceptual processes and provides a publicly available implementation of the model.

Uploaded by

shoxruxsmartboy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views17 pages

1dcnn With BLSTM

This document presents a novel approach for classifying eye movements (fixations, saccades, and smooth pursuits) using a combination of a 1D convolutional neural network (1D-CNN) and a bidirectional long short-term memory network (BLSTM). The proposed method outperforms existing algorithms on a large-scale hand-labeled dataset (GazeCom) and introduces a new pipeline for event detection in eye-tracking recordings. The study emphasizes the importance of automatic detection in understanding perceptual processes and provides a publicly available implementation of the model.

Uploaded by

shoxruxsmartboy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Behavior Research Methods (2019) 51:556–572

https://doi.org/10.3758/s13428-018-1144-2

1D CNN with BLSTM for automated classification of fixations, saccades,


and smooth pursuits
Mikhail Startsev1 · Ioannis Agtzidis1 · Michael Dorr1

Published online: 8 November 2018


© Psychonomic Society, Inc. 2018

Abstract
Deep learning approaches have achieved breakthrough performance in various domains. However, the segmentation of
raw eye-movement data into discrete events is still done predominantly either by hand or by algorithms that use hand-
picked parameters and thresholds. We propose and make publicly available a small 1D-CNN in conjunction with a
bidirectional long short-term memory network that classifies gaze samples as fixations, saccades, smooth pursuit, or noise,
simultaneously assigning labels in windows of up to 1 s. In addition to unprocessed gaze coordinates, our approach uses
different combinations of the speed of gaze, its direction, and acceleration, all computed at different temporal scales, as
input features. Its performance was evaluated on a large-scale hand-labeled ground truth data set (GazeCom) and against 12
reference algorithms. Furthermore, we introduced a novel pipeline and metric for event detection in eye-tracking recordings,
which enforce stricter criteria on the algorithmically produced events in order to consider them as potentially correct
detections. Results show that our deep approach outperforms all others, including the state-of-the-art multi-observer smooth
pursuit detector. We additionally test our best model on an independent set of recordings, where our approach stays highly
competitive compared to literature methods.

Keywords Eye-movement classification · Deep learning · Smooth pursuit

Introduction Nyström, Andersson, & Hessels, 2017). For smooth pursuit


(SP), even detecting episodes1 by hand is not entirely trivial
Eye-movement event detection is important for many (i.e., requires additional information) when the information
eye-tracking applications as well as the understanding about their targets is missing. Especially when pursuit speed
of perceptual processes. Automatically detecting different is low, it may be confused with drifts or oculomotor noise
eye movements has been attempted for multiple decades (Yarbus, 1967).
by now, but evaluating the approaches for this task is Therefore, most algorithms to date are based on hand-
challenging, not least because of the diversity of the tuned thresholds and criteria (Larsson, Nyström, Andersson,
data and the amount of manual labeling required for a & Stridh, 2015; Berg, Boehnke, Marino, Munoz, & Itti,
meaningful evaluation. To compound this problem, even 2009; Komogortsev, Gobert, Jayarathna, Koh, & Gowda,
manual annotations suffer from individual biases and 2010). The few data-driven methods in the literature either
implicitly used thresholds and rules, especially if experts do not consider smooth pursuit (Zemblys, Niehorster,
from different sub-areas are involved (Hooge, Niehorster, Komogortsev, & Holmqvist, 2017), operate on data
produced with low-variability synthetic stimuli (Vidal,
 Mikhail Startsev Bulling, & Gellersen, 2012), or are not yet publicly available
[email protected] (Hoppe & Bulling, 2016).
Here, we propose and make publicly available a neural
Ioannis Agtzidis
[email protected] network architecture that differentiates between the three
major eye-movement classes (fixation, saccade, and smooth
Michael Dorr
[email protected]
1 We use the terms “event” and “episode” interchangeably, both
referring to an uninterrupted interval, where all recorded gaze samples
1 Technical University of Munich, Institute for Human-Machine have been assigned the same respective label, be that in the ground
Communication, Arcisstr. 21, Munich, 80333, Germany truth or in algorithmic labels.
Behav Res (2019) 51:556–572 557

pursuit) while also taking potential noise (e.g., blinks Another publicly available data set that includes smooth
or lost tracking) into account. Our approach learns from pursuit, but has low temporal resolution, accompanies the
data (simple features) to assign sequences of labels to work of Santini, Fuhl, Kübler, and Kasneci (2016) (avail-
sequences of data points with a compact one-dimensional able at Santini, 2016; ca. 15 min of 30-Hz recordings with a
convolutional neural network (1D-CNN) and bidirectional Dikablis Pro eye tracker). This work, however, operates in a
long short-term memory block (BLSTM) combination. different data domain: pupil coordinates on raw eye videos.
Evaluated on a fully annotated2 (Startsev, Agtzidis, & Dorr, Because it was not necessary for the algorithm, no eye
2016) GazeCom data set of complex natural movies (Dorr, tracker calibration was performed, and therefore no coordi-
Martinetz, Gegenfurtner, & Barth, 2010), our approach nates are provided with respect to the scene camera. Post
outperforms all 12 evaluated reference models, for both hoc calibration is possible, but it is recording-dependent.
sample- and event-level detection. We additionally test our Nevertheless, the approach of Santini et al. (2016) presents
method’s generalization ability on the data set of Andersson, an interesting ternary (fixations, saccades, smooth pursuit)
Larsson, Holmqvist, Stridh, and Nyström (2017). probabilistic classifier.

Automatic detection
Related work
Many eye-movement detection algorithms have been
Data sets developed over the years. A simple, yet versatile toolbox
for eye-movement detection is provided by Komogortsev
Despite the important role of smooth pursuit in our (2014). It contains Matlab implementations for a diverse
perception and everyday life (Land, 2006), its detection in set of approaches introduced by different authors. Out of
free-viewing scenarios has been somewhat neglected. At the the eight included algorithms, five (namely, I-VT and I-
very least, it should be considered in event detectors to avoid DT (Salvucci & Goldberg, 2000), I-HMM (Salvucci &
false detections of other eye-movement types (Andersson Anderson, 1998), I-MST (Goldberg & Schryver, 1995;
et al., 2017). Even when taking into account information Salvucci & Goldberg, 2000), and I-KF Sauter, Martin, Di
about gaze patterns of dozens of observers at once (Agtzidis, Renzo, & Vomscheid, 1991) detect fixations and saccades
Startsev, & Dorr, 2016b), there is a dramatic gap between only (cf. Komogortsev et al. 2010 for details).
the performance of detecting saccades or fixations, and I-VVT, I-VMP, and I-VDT, however, detect the three
detecting smooth pursuits (Startsev et al., 2016). eye-movement types (fixations, saccades, smooth pursuit)
at once. I-VVT (Komogortsev & Karpov, 2013) is a
We will, therefore, use the largest publicly available
modification of the I-VT algorithm, which introduces a
manually annotated eye-tracking data set that accounts for
second (lower) speed3 threshold. The samples with speeds
smooth pursuit to train and validate our models: GazeCom
between the high and the low thresholds are classified as
(Dorr et al., 2010; Startsev et al., 2016) (over 4 h of
smooth pursuit. The I-VMP algorithm (San Agustin, 2010)
250-Hz recordings with SR Research EyeLink II). Its data
keeps the high speed threshold of the previous algorithm
files contain labels of four classes, with noise labels (e.g.,
for saccade detection, and uses window-based scoring
blinks and tracking loss) alongside fixations, saccades, and
of movement patterns (such as pair-wise magnitude and
smooth pursuits. We maintain the same labeling scheme in
direction of movement) for further differentiation. When the
our problem setting (including the introduced, yet unused,
score threshold is exceeded, the respective sample is labeled
“unknown” label).
as belonging to a smooth pursuit. I-VDT (Komogortsev
We additionally evaluate our approach on a small high-
& Karpov, 2013) uses a high speed threshold for saccade
frequency data set that was originally introduced by Larsson,
detection, too. It then employs a modified version of I-DT
Nyström, and Stridh (2013) (data available via (Nyström,
to separate pursuit from fixations.
2015); ca. 3.5 min of 500-Hz recordings with SensoMotoric Dorr et al. (2010) use two speed thresholds for saccade
Instruments Hi-Speed 1250), also recently used in a larger detection alone: The high threshold is used to detect the
review of the state of the art by Andersson et al. (2017). peak-speed parts in the middle of saccades. Such detections
This data set considers postsaccadic oscillations in manual are then extended in time as long as the speed stays above
annotation and algorithmic analysis, which is not common the low threshold. This helps filter out tracking noise and
yet for eye-tracking research. other artifacts that could be mistaken for a saccade, if only

3 Unfortunately,in the eye-movement literature, the term “velocity”


2 The recordings were algorithmically pre-labeled to speed up the (in physics, a vector) often is used to refer to “speed” (the scalar
hand-labeling process, after which three manual annotators have magnitude of velocity). Here, we try to be consistent and avoid using
verified and adjusted the labeled intervals in all of the files. “velocity” when it is not justified.
558 Behav Res (2019) 51:556–572

the low threshold was applied. Fixations are determined SVM, while everything else is attributed to the class “sac-
while trying to avoid episode contamination with smooth cades and noise”. The model is trained with gaze trace fea-
pursuit samples. The approach uses a sliding window inside tures, as well as image features locally extracted by a small
intersaccadic intervals, and the borders of fixations are 2D CNN. The approach is interesting, but the description of
determined via a combination of modified dispersion and the data set and evaluation procedure lacks details.
speed thresholds. A recently published work by Zemblys et al. (2017)
Similarly, the algorithm proposed by Berg et al. (2009) uses Random Forests with features extracted in 100–200-
was specifically designed for dynamic stimuli. Here, however, ms windows that are centered around respective samples.
the focus is on distinguishing saccades from pursuit. After It aims to detect fixations, saccades, and postsaccadic
an initial low-pass filtering, the subsequent classification is oscillations. This work also employs data augmentation to
based on the speed of gaze and principal component analysis simulate various tracking frequencies and different levels of
(PCA) of the gaze traces. If the ratio of explained variances noise in data, which adds to the algorithm’s robustness.
is near zero, the gaze follows an almost straight line. Unfortunately, neither of these machine learning
Then, the window is labeled either as a saccade or smooth approaches are publicly available, at least in such a form
pursuit, depending on speed. By combining information that allows out-of-the-box usage (e.g., the implementation
from several temporal scales, the algorithm is more robust of Zemblys et al. (2017) lacks a pre-trained classifier).
at distinguishing saccades from pursuit. The samples that All the algorithms so far process gaze traces individually.
are neither saccade nor pursuit are labeled as fixations. The Agtzidis et al. (2016b) (toolbox available at Startsev et al.,
implementation is part of a toolbox (Walther & Koch, 2006). 2016) detect saccades and fixations in the same fashion,
An algorithm specifically designed to distinguish between but use inter-observer similarities for the more challenging
fixations and pursuit was proposed by Larsson et al. task of smooth pursuit detection. The samples remaining
(2015), and its re-implementation is provided by after saccade and fixation detection are clustered in the 3D
Startsev et al. (2016). It requires a set of already space of time and spatial coordinates. Smooth pursuit by
detected saccades, as it operates within intersaccadic definition requires a target, so all pursuits on one scene
intervals. Every such interval is split into overlap- can only be in a limited number of areas. Since no video
ping windows that are classified based on four criteria. information is used, inter-observer clustering of candidate
If all criteria are fulfilled, the window is marked as gaze samples is used as a proxy for object detection: If many
smooth pursuit. If none are fulfilled, the fixation label is participants’ gaze traces follow similar paths, the chance of
assigned. Windows with one to three fulfilled criteria are this being caused by tracking errors or noise is much lower.
labeled based on their similarity to already-labeled windows. To take advantage of this effect, only those gaze samples
Several machine learning approaches have been proposed that were part of some cluster are labeled as smooth pursuit.
as well. Vidal et al. (2012) focus solely on pursuit Those identified as outliers during clustering are marked as
detection. They utilize shape features computed via a sliding noise. This way, however, many pursuit-like events can be
window, to then use k-NN based classification. The reported labeled as noise due to insufficient inter-observer similarity,
accuracy of detecting pursuit is over 90%, but the diversity and multiple recordings for each clip are required in order
of the data set is clearly limited (only purely vertical to achieve reliable results.
and horizontal pursuit), and reporting accuracy with-
out information about class balance is difficult to interpret.
Hoppe and Bulling (2016) propose using convolutional Our approach
neural networks to assign eye-movement classes. Their
approach, too, operates as a sliding window. For each win- We here propose using a bidirectional long short-term
dow, the Fourier-transformed data are fed into a network, memory network (one-layer, in our case), which follows
which contains one convolutional, one pooling, and one the processing of the input feature space by a series of 1D
fully connected layer. The latter estimates the probabilities convolutional layers (see Fig. 1). Unlike (Hoppe & Bulling,
of the central sample in the window belonging to a fixa- 2016) and the vast majority of other literature approaches,
tion, a saccade, or a pursuit. The network used in this work we step away from single sample classification in favor of
is rather small, but the collected data seems diverse and simultaneous window-based labeling, in order to capture
promising. The reported scores are fairly low (average F1 temporal correlations in the data. Our network receives
score for detecting the three eye movements of 0.55), but and outputs a window of samples-level features and labels,
without the availability of the test data set, it is impossible respectively. Unlike most of the methods in the literature,
to assess the relative performance of this approach. we also assign the “noise” label, which does not force our
An approach by Anantrasirichai, Gilchrist, and Bull model to choose only between the meaningful classes, when
(2016) is to identify fixations in mobile eye tracking via an this is not sensible.
Behav Res (2019) 51:556–572 559

Fig. 1 The architecture of our 1D CNN-BLSTM network. BN stands for “batch normalization”, FC – for “fully connected”. In this figure, the
input is assumed to contain the five different-scale speed features, and the context window size that is available to the network is just above 1 s

Features for classification and direction (in radians, relative to the horizontal vector
from left to right) of gaze for each sample were computed
We used both the raw x and y coordinates of the gaze via calculating the displacement vector of gaze position on
and simple pre-computed features in our final models. the screen from the beginning to the end of the temporal
To avoid overfitting for the absolute gaze location when window of the respective size, centered around the sample.
using the xy-coordinates, we used positions relative to the Acceleration (in ◦ /s 2 ) was computed from the speed values
first gaze location in each processed data window. Our of the current and the preceding samples on the respective
initial experiments showed that a small architecture such as scale (i.e., numerical differentiation; acceleration for the
ours noticeably benefits from feature extraction on various first sample of each sequence is set to 0). If a sample was
temporal scales prior to passing the sequence to the model. near a prolonged period of lost tracking or either end of a
This is especially prominent for smooth pursuit detection. recording (i.e., if gaze data in a part of the temporal window
With a limited number of small-kernel convolutional layers, was missing), a respectively shorter window was used.
network-extracted features are influenced only by a small We additionally conducted experiments on feature
area in the input-space data (i.e., the feature extracting combinations, concatenating feature vectors of different
sub-network has a small receptive field, seven samples, or groups, in order to further enhance performance.
28 ms, in the case of our network). With this architecture
it would thus be impossible to learn motion features on
Data sets
coarser temporal scales, which are important, e.g., for
detecting the relatively persistent motion patterns, which
GazeCom
characterize smooth pursuits. To overcome this, we decided
to use precomputed features; specifically, we included
We used the GazeCom (Dorr et al., 2010; Startsev et al.,
speed, acceleration, and direction of gaze, all computed at
2016) recordings for both training and testing (manual
five temporal scales in order to capture larger movement
annotation in Agtzidis, Startsev, & Dorr, 2016a), with a
patterns on the feature level already.
The speed of gaze is an obvious and popular choice strict cross-validation procedure. This data set consists of
(Sauter et al., 1991; Salvucci & Goldberg, 2000; Komogort- 18 clips, around 20 s each, with an average of 47 observers
sev et al., 2010; Komogortsev & Karpov, 2013). Accelera- per clip (total viewing time over 4.5 h). The total number of
tion could aid saccade detection, as it is also sometimes used individual labels is about 4.3 million (including the samples
in the literature (Collewijn & Tamminga, 1984; Nyström still recorded after a video has finished; 72.5, 10.5, 11,
& Holmqvist, 2010; Behrens, MacKeben, & Schröder- and 5.9% of all samples are labeled as parts of fixations,
Preikschat, 2010; Larsson et al., 2013) as well as in SR saccades, pursuits, or noise, respectively). Event-wise, the
Research’s software for the EyeLink trackers. data set contains 38629 fixations, 39217 saccades, and 4631
The effect of using the direction of gaze is less obvious: smooth pursuits. For training (but not for testing) on this
Horizontal smooth pursuit, for instance, is more natural data set, we excluded gaze samples with timestamps over
to our visual system (Rottach, Zivotofsky, Das, Averbuch- 21 s (confidently outside video durations) and re-sampled
Heller, Discenna, Poonyathalang, & Leigh, 1996). The to 250-Hz recordings of one of the observers (SSK), whose
drifts that occur due to tracking artifacts are, however, more files had a higher sampling frequency.
pronounced along the vertical axis (Kyoung Ko, Snodderly, We used leave-one-video-out (LOVO) cross-validation
& Poletti, 2016). for evaluation: The training and testing is run 18 times, each
We consider five different temporal scales for feature time training on all the data for 17 videos and testing on all
extraction: 4, 8, 16, 32, and 64 ms. The speed (in ◦ /s) the eye-tracking data collected for the remaining video clip.
560 Behav Res (2019) 51:556–572

This way, the model that generates eye-movement labels on objects for video clips” Larsson et al. 2013) vs. free viewing
a certain video had never seen any examples of data with in GazeCom (Dorr et al., 2010).
this clip during training. We aggregate the labeled outputs As in Andersson et al. (2017), we evaluated all the
for the test sets of all splits before the final evaluation. considered automatic approaches and both manual raters
There are two major ways to fully utilize an eye-tracking against the “average coder” (i.e., effectively duplicating
data set in the cross-validation scenario, in the absence each recording, but providing the “true” labels by MN in
of a dedicated test subset. The first, LOVO, is described one case and by RA in the other).
above, and it ensures that no video clip-specific information
can benefit the model. The second would ensure that no Model architecture
observer-specific information would be used. For this, we
used a leave-n-observers-out (LnOO) procedure. In our We implemented a joint architecture that consists of a
case, to maintain symmetry with the 18 splits of LOVO, small one-dimensional temporal convolutional network and
we introduced the same number of splits in our data, each a bidirectional LSTM layer, with one time-distributed dense
containing three unique randomly selected observers (54 layer both before and after the BLSTM (for higher-level
participants in total). feature extraction and to match the number of classes
We hypothesize that LOVO should be less susceptible in the output without limiting the number of neurons in
to overfitting than LnOO, since smooth pursuit is target- the BLSTM, respectively). In this work, we implement
driven, and the observers’ scanpaths tend to cluster in multi-class classification with the following labels: fixation,
space and time (Agtzidis et al., 2016b), meaning that their saccade, smooth pursuit, noise (e.g., blinks or tracking loss),
characteristics for different observers would be similar for also “unknown” (for potentially using partially labeled
the same stimulus. We test this hypothesis with several data), in order to comply with the ground truth labeling
experiments, where the only varied parameter is the cross- scheme in Startsev et al. (2016). The latter label was absent
validation type. in both the training data and the produced outputs. The
architecture is also illustrated in Fig. 1 on an example of
Nyström-Andersson data set using a five-dimensional feature space and simultaneously
predicting labels in a window equivalent to about 1 s of
We used an independent set of recordings in order to addi- 250-Hz samples.
tionally validate our model. For this, we took the manually The network used here is reminiscent of other deep
annotated eye-tracking recordings that were utilized in a sequence-processing approaches, which combine either
recent study (Andersson et al., 2017). These contain labels 2D (Donahue, Anne Hendricks, Guadarrama, Rohrbach,
provided by two manual raters: One rater (“CoderMN”) Venugopalan, Saenko, & Darrell, 2015) or, more recently,
originally labeled the data for Larsson et al. (2013), another 3D (Molchanov, Yang, Gupta, Kim, Tyree, & Kautz,
(“CoderRA”) was added by Andersson et al. (2017). The 2016) convolutions with recurrent units for frame sequence
annotations of both raters include six labeled classes: fix- analysis. Since our task is more modest, our parameter count
ations, saccades, postsaccadic oscillations (PSOs), smooth is relatively low (ca. 10000, depending on the input feature
pursuit, blinks, and undefined events. space dimensionality, compared to millions of parameters
The whole data set comprises three subsets that use moving even in compact static CNNs (Hasanpour, Rouhani, Fayyaz,
dots, images, and video clips as stimuli. We focus our eval- & Sabokrou, 2016), or ca. 6 million parameters only for
uation on the “video” part, since our model was trained on the convolutional part (Tran, Bourdev, Fergus, Torresani, &
this domain. Paluri, 2015) of Molchanov et al., 2016).
We will refer to this subset by the abbreviations of The convolutional part of our architecture contains three
the manual labelers’ names (in chronological order of layers with a gradually decreasing number of filters (32, 16,
publications, containing respective data sets): “MN-RA- and 8) with the kernel size of 3, and a batch normalization
data”. In total, it contains ca. 58000 gaze samples (or about operation before activation. The subsequent fully connected
2 min at 500 Hz). Notably, only half of this data consists layer contains 32 units. We did not use pooling, as is
of “unique” samples, the second half being duplicated, but customary for CNNs, since we wanted to preserve the one-
with different ground truth labels (provided by the second to-one correspondence between inputs and outputs. This
rater). 37.7% of all the samples were labeled as fixation, part of the network is therefore not intended for high-level
5.3% as saccade, 3% as PSO, 52.2% as pursuit, 1.7% as feature extraction, but prepares the inputs to be used by the
blink, and 0.04% as “unknown”. Counting events yields BLSTM that follows.
163 fixations, 244 saccades, 186 PSOs, 121 pursuits, and 8 All layers before the BLSTM, except for the very first
blinks. The high ratio of pursuit is explained by the explicit one, are preceded by dropout (rate 0.3), and use ReLU as
instructions given to the participants (“follow [...] moving activation function. The BLSTM (with 16 units) uses tanh,
Behav Res (2019) 51:556–572 561

and the last dense layer (5 units, according to the number of As for the event-level evaluation, there is no consensus in
classes) – softmax activation. the literature as to which methodology should be employed.
The input to our network is a window of a pre-set length, Hoppe and Bulling (2016), for example, use ground truth
which we varied to determine the influence of context size event boundaries as pre-segmentation of the sequences:
on the detection quality for each class. To minimize the For each event in the ground truth, all corresponding
border effects, our network uses valid padding, requiring its individual predicted sample labels are considered. The
input to be padded. For both training and testing, we only event is classified by the majority vote of these labels.
mirror-pad the whole sequence (of undetermined length; As Hoppe and Bulling (2016) themselves point out, this
typically ca. 5000 in our data), and not each window. We pre-segmentation noticeably simplifies the problem of eye-
pad by three samples on each side (since each of the three movement classification. Additionally, the authors only
convolutional layers uses valid padding and a kernel of reported confusion matrices and respective per-class hit
size 3). For our context-size experiments, this means that rates, which conceal the problem of false detections.
for a prediction window of 129 samples (i.e., classification Andersson et al. (2017) only assess the detected events
context size 129), for example, windows of length 135 must in terms of the similarity of event duration distribution
be provided as input. So when we generate the labels for the parameters to those of the ground truth.
whole sequence, the neighboring input windows overlap by In Hooge, Niehorster, Nyström, Andersson, and Hessels
six samples, but the output windows do not. (2017), event-level fixation detection was assessed by an
We balanced neither training nor test data in any way. arguably fairer approach with a set of metrics that includes
To allow for fair comparison of results with different F1 scores for fixation episodes. We computed these for all
context sizes, we attempted to keep the training procedure three main event types in our data (fixations, saccades, and
consistent across experiments. To this end, we fixed the smooth pursuits): For each event in the ground truth, we
number of windows (of any size) that are used for look for the earliest algorithmically detected event of the
training (90%) and for validation (10%) to 50000. For same class that intersects with it. Only one-to-one matching
experiments with context no larger than 65 samples, we used is allowed. Thus, each ground truth event can be labeled
windows without overlap and randomly sampled (without as either a “hit” (a matching detected event found) or a
replacement) the required number of windows for training. “miss” (no match found). The detected events that were not
For larger-context experiments, the windows were extracted matched with any ground truth events are labeled as “false
with a step of 65 samples (at 250 Hz, equivalent to 260 ms). alarms”. These labels correspond to true positives, false
We initialized convolutional and fully connected layers negatives, and false positives, which are needed to compute
with random weights from a normal distribution (mean 0, the F1 score.
standard deviation 0.05), and trained the network for 1000 One drawback of such matching scheme is that the area
iterations with batch size 5000 with the RMSprop optimizer of event intersection is not taken into account. This way,
(Tieleman & Hinton, 2012) with default parameters from for a ground truth event EGT , the earlier detected event
the Keras framework (Chollet et al., 2015) (version 2.1.1; EA will be preferred to a later detected event EB , even
learning rate 0.001 without decay) and categorical cross- if the intersection with the former is just one sample, i.e.,
entropy as the loss function. |EGT ∩ EA | = 1, while the intersection with EB is far
greater. Hooge et al. (2017) additionally compute measures
Evaluation such as relative timing offset and deviation of matched
events in order to tie together agreement measures and eye-
Metrics movement parameters, which would also penalize potential
poor matches. These, however, have to be computed for
Similar to Agtzidis et al. (2016b), Startsev et al. (2016), both onset and offset of each event type, and are more
and Hoppe and Bulling (2016), we evaluated sample-level suited for in-detail analysis of particular labeling patterns
detection results. Even though all our models (and most of rather than for a concise quantitative evaluation. We propose
the baseline models) treat eye-movement classification as using a typical object detection measure (Everingham, Van
a multi-class problem, for evaluation purposes we consider Gool, Williams, Winn, & Zisserman, 2010; Everingham,
each eye movement in turn, treating its detection as a binary Eslami, Van Gool, Williams, Winn, & Zisserman, 2015), the
classification problem (e.g., with labels “fixation” and “not ratio of the two matched events’ intersection to their union
fixation”). This evaluation approach is commonly used in (often referred to as “intersection over union”, or IoU). If a
the literature (Larsson et al., 2013; Agtzidis et al., 2016b; ground truth event is labeled as a “miss”, its corresponding
Andersson et al., 2017; Hoppe & Bulling, 2016). We can “match IoU” is set to 0. This way, the average “match IoU”
then compute typical performance metrics such as precision, is influenced both by the number of detected and missed
recall, and F1 score. events, and by the quality of correctly identified events.
562 Behav Res (2019) 51:556–572

We report this statistic for all event types in the ground training data, we predict labels in MN-RA-data using a
truth data, in addition to episode-level F1 scores of Hooge model trained on all GazeCom clips except one without
et al. (2017) as well as sample-level F1 scores (for brevity, smooth pursuit in its ground truth annotation (“bridge 1”).
F1 scores are used instead of individual statistics such as Andersson et al. (2017) ignore smooth pursuit detection
sensitivity or specificity; this metric represents a balanced in most of their quantitative evaluation (while separating
combination of precision and recall—their harmonic mean), it from fixations is a challenging problem on its own),
for both GazeCom and MN-RA-data. and focus on postsaccadic oscillations instead (which our
Another idea, which we adapt from object detection algorithm does not label), so we cannot compare with the
research, is only registering a “hit” when a certain IoU reported results directly. However, on the MN-RA-data, we
threshold is reached (Ren, He, Girshick, and Sun, 2015), additionally followed the evaluation strategies of Andersson
thus avoiding the low-intersection matches potentially et al. (2017).
skewing the evaluation. The threshold that is often In order to compare our approach to the state-of-the-
employed is 0.5 (Everingham et al., 2010). In the case art performances on MN-RA-data that were reported in
of one-dimensional events, this threshold also gains inter- Andersson et al. (2017), we computed the Cohen’s kappa
pretability: This is the lowest threshold that ensures that no statistic (for each major eye-movement class separately).
two detected events can be candidate matches for a single Cohen’s kappa κ for two binary sets of labels (e.g.,
ground truth event. Additionally, if two events have the same A and B) can be computed via the observed proportion
duration, their relative shift can be no more than one-third of samples, where A and B agree on either accepting or
of their duration. For GazeCom, we further evaluate the rejecting the considered eye-movement label, pobs , and the
algorithms at different levels of the IoU threshold used for chance probability of agreement. The latter can be expressed
event matching. through the proportions of samples, where each of A and
We also compute basic statistics over the detected eye- B has accepted and rejected the label. We denote those as
movement episodes, which we compare to those of the p+A , p A , p B , and p B , respectively. In this case,
− + −
ground truth. Among those are the average duration (in
milliseconds) and amplitude (in degrees of visual angle) of pchance = p+
A
∗ p+
B
+ p−A
∗ p−B
, (1)
all detected fixation, saccade, and smooth pursuit episodes. pobs − pchance
Even though fixations are not traditionally characterized κ(A, B) = . (2)
1 − pchance
by their amplitude, it reflects certain properties of fixation
detection: For instance, where does a fixation end and a Cohen’s kappa can assume values from [−1; 1], higher
saccade begin? While this choice has relatively little bearing score is better. We also considered the overall sample-level
on saccade amplitudes, it might significantly affect the error rate (i.e., proportion of samples where the models
amplitudes of fixations. disagree with the human rater, when all six ground truth
label classes are taken into account). For this, we consider
Data set-specific settings all “noise” labels assigned by our algorithm as blink labels,
as blinks were the primary cause of “noise” samples in the
For MN-RA-data, we focused only on our best (according GazeCom ground truth. It has to be noted that all sample-
to GazeCom evaluation) model. level metrics are, to some extent, volatile with respect to
Since MN-RA-data were recorded at 500 Hz (compared small perturbations in the data—changes in proportions
to 250 Hz for GazeCom), we simply doubled the sample- of class labels, almost imperceptible relative shifts, etc.
level intervals for feature extraction, which preserves the We would, therefore, recommend using event-level scores
temporal scales of the respective features (as described instead.
in the “Features for classification” above). We also used
a model that classifies 257-sample windows at once (our Baselines
largest-context model). This way, the temporal context at
500 Hz is approximately equivalent to that of 129 samples For both data sets, we ran several literature models as
at 250 Hz, which was used for the majority of GazeCom baselines, to give this work’s evaluation more context.
experiments. Notably, the model used for MN-RA-data These were: Agtzidis et al. (2016b) (implementation
processing was trained on 250-Hz recordings and tested by Startsev et al. (2016)), Larsson et al. (2015) (re-
on the ones with double the sampling frequency. Our implemented by Startsev et al. (2016)), Dorr et al.
estimate of the model’s generalization ability is, therefore, (2010) (the authors’ implementation), Berg et al. (2009)
conservative. (toolbox implementation Walther & Koch, 2006), I-VMP,
Due to cross-validation training on GazeCom, and in I-VDT, I-VVT, I-KF, I-HMM, I-VT, I-MST, and I-DT
order to maximize the amount of pursuit examples in the (all as implemented by Komogortsev (2014), with fixation
Behav Res (2019) 51:556–572 563

interval merging enabled). For their brief descriptions, see of I-MST to the sampling frequency of the data set (for both
“Automatic detection”. data sets).
Since several of the baselines (Dorr et al., 2010; Just as Andersson et al. (2017), we did not re-train any of
Agtzidis et al., 2016b, the used implementation of Larsson the models before testing them on the MN-RA-data.
et al., 2015) were either developed in connection with For the model of Agtzidis et al. (2016a), however,
or optimized for GazeCom, we performed grid search we had to set the density threshold (minP ts), which
optimization (with respect to the average of all sample- is a parameter of its clustering algorithm. This value
and event-wise F1 scores, as reported in Table 2) of the should be set proportionally to the number of observers,
parameters of those algorithms in Komogortsev (2014) that and the sampling frequency (Startsev et al., 2016). If
detect smooth pursuit: I-VDT, I-VVT, and I-VMP. The was, therefore, set to 160 ∗ Nobservers /46.9 ∗ 500/250,
ranges and the parameters of the grid search can be seen where Nobservers is the number of recordings for each
in Fig. 2. Overall, the best parameter set for I-VDT was clip (i.e., four for “dolphin”, six for “BergoDalbana”,
80 ◦ /s for the speed threshold and 0.7 ◦ for the dispersion and eight for “triple jump”). GazeCom has an average of
threshold. For I-VVT, the low speed threshold of 80 ◦ /s and 46.9 observers per clip, and is recorded at 250 Hz. MN-
the high threshold of 90 ◦ /s were chosen. For I-VMP, the RA-data, as mentioned already, consists of recordings at
high speed threshold parameter was fixed to the same value 500 Hz. The resulting minP ts values were 28, 40, and 54,
as in the best parameter combination of I-VVT (90 ◦ /s), respectively.
and the window duration and the “magnitude of motion” For both data sets, we additionally implemented a
threshold were set to 400 ms and 0.6, respectively. random baseline model, which assigns one of the three
It is an interesting outcome that, when trying to optimize major eye-movement labels according to their frequency in
the scores, half of which depend on events, I-VVT abandons the ground truth data.
pursuit detection (by setting very high speed thresholds) in
favor of better-quality saccade and fixation identification.
If optimization with respect to sample-level scores only is Results and discussion
performed, this behavior is not observed. This indicates that
simple speed thresholding is not sufficient for reasonable Cross-validation procedure selection
pursuit episode segmentation. We have, therefore, tried
different speed thresholds for I-VMP, but 90 ◦ /s was still the We first address the cross-validation type selection. We con-
best option. sidered two modes, leave-one-video-out (LOVO) and leave-
We have to note that taking the best set of parameters n-observers-out (LnOO, n = 3). If the two cross-validation
selected via an exhaustive search on the entire test set is procedures were identical in terms of the danger of over-
prone to overfitting, so the reported performance figures fitting, we would expect very similar quantitative results. If
for these baseline methods should be treated as optimistic one is more prone to overfitting behavior than the other,
estimates. its results would be consistently higher. In this part of the
Since the fixation detection step of Dorr et al. (2010) evaluation, therefore, we are, somewhat counterintuitively,
targeted avoiding smooth pursuit, we treat missing labels (as looking for a validation procedure with lower scores.
long as the respective samples were confidently tracked) as We conducted several experiments to determine the influ-
pursuit for this algorithm. We also adapted the parameters ence of the cross-validation procedure on the performance

a b c
Fig. 2 Grid search average F1 scores on GazeCom for I-VDT (2a), I-VVT (2b), and I-VMP (2c). Default parameters were (ordered as (x, y)
pairs): I-VDT – (70 ◦ /s, 1.35 ◦ ), I-VVT – (20 ◦ /s, 70 ◦ /s), I-VMP – (0.5 s, 0.1). These are not optimal for our data set
564 Behav Res (2019) 51:556–572

Table 1 Experiment on the choice of a suitable cross-validation event-level detection. Table 3 additionally provides the IoU
technique for our 1D CNN-BLSTM model with speed and direction values for all the tested algorithms. Bold numbers mark best
features and a context size of 129 samples (equivalent to ca. 0.5 s at
250 Hz) performance in each category.
Most of our BLSTM models were trained with the
Metric LOVO LnOO context window of 129 samples (ca. 0.5 s) at the output
layer, as it presented a reasonable trade-off between training
Fixations sample F1 0.937 0.939
time (ca. 3 s per epoch on NVIDIA 1080Ti GPU) and the
Saccade sample F1 0.892 0.892
saturation of the effect that context size had on performance.
Pursuit sample F1 0.680 0.706
Fixations episode F1 0.888 0.887 Individual feature groups
Saccade episode F1 0.944 0.946
Pursuit episode F1 0.585 0.583 Looking at individual feature sets for our model (raw xy-
Fixations episode F1 (I oU >= 0.5) 0.854 0.855 coordinates, speed, direction, and acceleration), we find
Saccade episode F1 (I oU >= 0.5) 0.922 0.922 that speed is the best individual predictor of eye-movement
Pursuit episode F1 (I oU >= 0.5) 0.456 0.466 classes.
Acceleration alone, not surprisingly, fails to differentiate
Fixations episode IoU 0.902 0.905
between fixations and smooth pursuit (the largest parts of
Saccade episode IoU 0.858 0.857
almost 90% of the smooth pursuit episodes are covered
Pursuit episode IoU 0.521 0.549
by fixation labels), since both perfect fixation and perfect
pursuit lack the acceleration component, excepting onset
LOVO refers to the leave-one-video-out approach, LnOO – to leave-
n-observers-out. Both methods split the data in 18 non-overlapping and offset stages of pursuits. Saccade detection performance
groups of recordings in our case (18 videos in the data set, 18 groups of is, however, impressive.
three observers each). Differences no less than 0.01 are boldified. This Interestingly, direction of movement provides a decent
suggests that LOVO provides a more conservative estimate, compared
feature for eye-movement classification. One would expect
to LnOO
that within fixations, gaze trace directions are distributed
almost uniformly because of (isotropic) oculomotor and
estimates (while keeping the rest of the training and testing tracker noise. Within pursuits, its distribution should be pro-
parameters fixed), all of which revealed the same pattern: nouncedly peaked, corresponding to the direction of the
While being comparable in terms of fixation and saccade pursuit, and even more so within saccades due to their much
detection, these strategies differ consistently and noticeably higher speeds. Figure 3 plots these distributions of directional
for smooth pursuit detection (see the results of one of these features for each major eye-movement type. The directions
experiments in Table 1). were computed at a fixed temporal scale of 16 ms and nor-
LnOO tends to overestimate the performance of the malized per-episode so that the overall direction is at 0.
models, yielding higher scores on most of the metrics. Unlike perfect fixations, which would be completely sta-
We conclude that LOVO is less prone to overfitting and tionary, fixations in our data set contain small drifts (mean
conduct the rest of the experiments using this technique. displacement amplitude during fixation of 0.56◦ of visual
We note that overfitting seems to affect the detection angle, median 0.45◦ ), so the distribution in Fig. 3 is not uni-
of the stimulus characteristics-dependent eye-movement form. Gaze direction features during saccades and pursuits
type—smooth pursuit—the most. For stimuli that only predictably yield much narrower distribution shapes.
induce fixations and saccades, the concern of choosing an Using just the xy coordinates of gaze has an advantage of
appropriate cross-validation technique is alleviated. its simplicity and the absence of any pre-processing. How-
We conclude that excluding the tested stimulus (video clip, ever, according to our evaluation, the models that use either
in this case) must be preferred to excluding the tested speed or direction features instead generally perform bet-
observer(s), if some form of cross-validation has to be ter, especially for smooth pursuit detection. Nevertheless,
employed, especially if the evaluation involves highly our model without any feature extraction already outper-
stimulus-driven eye-movement classes (e.g., smooth pursuit). forms the vast majority of the literature approaches.

GazeCom results overview Feature combinations

An overview of all the evaluation results on the full Experimenting with several feature sets at once, we found
GazeCom data set is presented in Table 2. It reports the that acceleration as an additional feature did not improve
models’ performance on the three eye movement classes average detection performance, probably due to its inability
(fixations, saccades, and pursuits) for both sample- and to distinguish pursuit from fixation. The combination
Behav Res (2019) 51:556–572 565

Table 2 GazeCom evaluation results as F1 scores for sample-level and episode-level detection (sorted by the average of all columns)

Sample-level F1 Event-level F1

Model average F1 Fixation Saccade SP Fixation Saccade SP

1D CNN-BLSTM: speed + direction+ 0.830 0.939 0.893 0.703 0.898 0.947 0.596
1D CNN-BLSTM: speed + direction 0.821 0.937 0.892 0.680 0.888 0.944 0.585
1D CNN-BLSTM: speed 0.808 0.932 0.891 0.675 0.877 0.942 0.529
(Agtzidis et al., 2016b) 0.769 0.886 0.864 0.646 0.810 0.884 0.527
1D CNN-BLSTM: direction 0.769 0.919 0.802 0.621 0.862 0.911 0.499
1D CNN-BLSTM: xy 0.752 0.913 0.855 0.517 0.861 0.932 0.435
(Larsson et al., 2015) 0.730 0.912 0.861 0.459 0.873 0.884 0.392
I-VMP∗∗ 0.718 0.909 0.680 0.581 0.792 0.815 0.531
(Berg et al., 2009) 0.695 0.883 0.697 0.422 0.886 0.856 0.424
(Dorr et al., 2010) 0.680 0.919 0.829∗ 0.381 0.902 0.854 0.193∗
1D CNN-BLSTM: acceleration 0.668 0.904 0.876 0.160 0.877 0.943 0.245
I-VDT∗∗ 0.606 0.882 0.676 0.321 0.823 0.781 0.152
I-KF 0.563 0.892 0.736 – 0.877 0.876 –
I-HMM 0.546 0.891 0.712 – 0.817 0.857 –
I-VVT∗∗ 0.531 0.890 0.686 0.000 0.778 0.816 0.013
I-VT 0.528 0.891 0.705 – 0.761 0.810 –
I-MST 0.497 0.875 0.570 – 0.767 0.773 –
I-DT 0.480 0.877 0.478 – 0.759 0.765 –
Random Baseline 0.201 0.750 0.105 0.114 0.098 0.121 0.020

CNN-BLSTM results are for the context window size of just over 0.5 s (129 samples), except where marked with + (1 s, 257 samples). The ∗ signs
mark the numbers where the label was assumed from context and not actually assigned by the algorithm – i.e. missing labels were imputed.
Performance estimates for models marked with ∗∗ are optimistic (thresholds were optimized on the entire test set). In each column, the highest
value is boldified

of direction and speed, however, showed a noticeable outperforms the nearest literature approach by 2, 2.9, and
improvement over using them separately, and the results for 5.7% of the F1 score for fixations, saccades, and smooth
these features we present in the tables. pursuits, respectively. The gap is even wider (6.3 and 6.5%
We retrained the model that uses direction and speed for saccades and smooth pursuit, respectively) for episode-
features for a larger context size (257 samples, ca. 1 s). level evaluation. Only for fixation episode detection, the
This model demonstrates the highest F1 scores (or within Dorr et al. (2010) model performs slightly better (by 0.004).
half a percent of the best score achieved by any model) In terms of IoU values, our model improves the state-of-the-
for all eye-movement types in both evaluation settings. It art scores by 0.04, 0.05, and 0.09 for fixations, saccades,

Distribution of direction deviation from overall episode direction


0.3

Fixation
Pursuit
0.2 Saccade
Share

0.1

0
-180 -90 0 90 180
Deviation (degrees)
Fig. 3 Histogram of sample-level direction features, when normalized relative to the overall direction of each respective episode
566 Behav Res (2019) 51:556–572

Table 3 GazeCom evaluation results for event-level detection as intersection-over-union values (sorted by the average of all columns)

Model Average IoU Fixation IoU Saccade IoU SP IoU

1D CNN-BLSTM: speed + direction+ 0.768 0.906 0.858 0.541


1D CNN-BLSTM: speed 0.763 0.885 0.856 0.547
1D CNN-BLSTM: speed + direction 0.760 0.902 0.858 0.521
1D CNN-BLSTM: xy 0.665 0.880 0.801 0.313
(Dorr et al., 2010) 0.663 0.815 0.808 0.367∗
(Agtzidis et al., 2016b) 0.663 0.742 0.799 0.448
1D CNN-BLSTM: direction 0.631 0.834 0.718 0.341
(Larsson et al., 2015) 0.625 0.789 0.809 0.277
I-VMP∗∗ 0.613 0.828 0.556 0.454
1D CNN-BLSTM: acceleration 0.606 0.906 0.834 0.077
I-VDT∗∗ 0.558 0.760 0.555 0.359
(Berg et al., 2009) 0.541 0.774 0.499 0.351
I-KF 0.504 0.842 0.671 –
I-HMM 0.501 0.870 0.633 –
I-VT 0.492 0.868 0.607 –
I-VVT∗∗ 0.477 0.863 0.567 0.000
I-MST 0.364 0.694 0.399 –
I-DT 0.313 0.592 0.347 –
Random baseline 0.055 0.077 0.071 0.017

CNN-BLSTM results are for the context window size of just over 0.5 s (129 samples), except where marked with + (1 s, 257 samples). The ∗ signs
mark the numbers where the label was assumed from context and not actually assigned by the algorithm – i.e. missing labels were imputed.
Performance estimates for models marked with ∗∗ are optimistic (thresholds were optimized on the entire test set). In each column, the highest
value is boldified

and pursuits, respectively, indicating the higher quality of we can see practical advantages as well, thanks to the Ran-
the detected episodes across the board. dom Baseline model. Due to the prevalence of fixation
We also varied the IoU threshold that determines whether samples in the GazeCom data set, assigning random labels
two episodes constitute a potential match, computing with the same distribution of classes results in many fixa-
episode-level F1 scores every time (see Fig. 4). From this tion events, which occasionally intersect with fixations in
evaluation it can be seen that not only does our deep learning the ground truth. In the absence of any IoU thresholding
model outperform all literature models, but it maintains (the threshold of 0 in Fig. 4), the F1 scores for fixations
this advantage even when a stricter criterion for an event and saccades are around 10%. Only by the threshold level
“hit” is considered (even though it was trained to optimize of 0.5 does the fixation event-wise F1 score for the Random
pure sample-level metrics). For fixations, while similar Baseline reach zero.
to the performance of Dorr et al. (2010) at lower IoU
thresholds, our model is clearly better when it comes to Common eye-tracking measures
higher thresholds. For saccades, it has to be noted that
the labels of Dorr et al. (2010) were used as initialization In order to directly compare the average properties of the
for the manual annotators in order to obtain ground truth detected events to those in the ground truth, we compute
event labels for GazeCom. This results in a higher number the mean durations and amplitudes for the episodes of the
of perfectly matching saccade episodes for Dorr et al. three major eye-movement types in our data. The results
(2010) (as well as for Agtzidis et al. (2016b) and our are presented in Table 4. For this part of the evaluation,
implementation of Larsson et al. (2015), both of which use a we consider only our best model (referred to as 1D CNN-
very similar saccade detection procedure), when the manual BLSTM (best) in the tables), which uses speed and direction
raters decided not to change the borders of certain saccades. features at a context size of roughly 1 s. We compare it to
As mentioned already, a threshold of 0.5 has its theoreti- all the baseline algorithms that consider smooth pursuit.
cal benefits (no two detected episodes can both be matches For both measures, our algorithm is ranked second, while
for a single ground truth event, some interpretability). Here, providing average fixation and saccade amplitudes that are
Behav Res (2019) 51:556–572 567

Table 4 Average durations (in milliseconds) and amplitudes (in degrees of visual angle) of different event types, as labelled by manual annotators
(first row) or algorithmic procedures

Average event duration, ms Average event amplitude, degrees

Algorithm Fixation Saccade SP (rank) avg.  Fixation Saccade SP (rank) avg. 

Ground truth 315.2 41.5 405.2 0.56 6.84 2.38


1D CNN-BLSTM (best) 281.1 38.4 217.0 (2) 75.2 0.53 6.66 1.44 (2) 0.38
(Agtzidis et al., 2016b) 229.5 40.1 244.6 (3) 82.6 0.40 7.19 1.91 (1) 0.33
(Larsson et al., 2015) 335.3 41.0 320.1 (1) 35.2 0.70 7.45 3.15 (3) 0.51
I-VMP∗∗ 256.3 20.8 217.8 (4) 89.0 0.52 6.03 1.64 (4) 0.53
(Berg et al., 2009) 282.6 66.0 164.3 (5) 99.3 0.51 8.26 1.20 (6) 0.88
(Dorr et al., 2010) 340.3 41.7 68.2∗ (6) 120.7 0.53 7.41 1.09∗ (5) 0.63
I-VDT∗∗ 284.7 19.0 45.7 (7) 137.5 0.47 5.41 0.49 (8) 1.13
I-VVT∗∗ 261.1 21.4 0.5 (8) 159.7 0.64 6.11 0.04 (7) 1.05

In each column, the value with the lowest absolute difference to the respective ground truth value is boldified. The averages of these absolute
differences for event durations and amplitudes occupy the fifth and the last columns, respectively, along with the rank of each considered model
(lower is better). The rows are sorted as the respective rows of our main evaluation table – Table 2. The ∗ signs mark the numbers where the
label was assumed from context and not actually assigned by the algorithm – i.e. missing labels were imputed. Performance estimates for models
marked with ∗∗ are optimistic (thresholds were optimized on the entire test set)

the closest to the ground truth. We note that the approaches cross-validation process at a range of context sizes, with five
with the average duration and amplitude of events closest to speed features defining the input space. We tested contexts
the ground truth differ for the two measures (Larsson et al., of 1, 2 + 1, 4 + 1, 8 + 1, . . ., 256 + 1 samples. For the
2015; Agtzidis et al., 2016b, respectively). GazeCom sampling frequency, this corresponds to 4, 12, 20,
From this evaluation, we can conclude that our algorithm 36 ms, . . ., 1028 ms. Training for larger context sizes was
detects many small smooth pursuit episodes, resulting in computationally impractical.
comparatively low average smooth pursuit duration and Context size had the biggest influence on smooth pursuit
amplitude. This is confirmed by the relatively higher event- detection. For speed features, when the context window size
level false positive count of our algorithm (3475, compared was reduced from just over 1 s of gaze data to merely one
to 2207 for Larsson et al., 2015). Our model’s drastically sample being classified at a time, the F1 scores for fixation
lower false negatives count (1192 vs. 2966), however, and saccade samples decreased (in terms of absolute values)
allows it to achieve much higher F1 score for pursuit event only by 2.8 and 5.1%, respectively, whereas smooth pursuit
detection. sample detection performance plummeted (decreased by
We also have to stress that simple averages do not over 40%).
provide a comprehensive insight into the performance of For all eye movements, however, there is a general
an eye-movement detection algorithm, but rather offer positive impact of expanding the context of the analysis.
an intuitively interpretable, though not entirely reliable, This effect seems to reach saturation point by the 1 s
measure of detected event quality. There is no matching mark, with absolute improvements in all detection F1 scores
performed here, the entire set of episodes of a certain being not much higher than 1% (except for smooth pursuit
type is averaged at once. This is why we recommend episodes, which could potentially benefit from even larger
using either the temporal offsets of matched episode pairs context sizes).
as introduced by Hooge et al. (2017), or IoU averaging At the largest context size, this model is better at
or thresholding, as we suggest in “Metrics”. The latter detecting smooth pursuit (for both sample- and event-level
allows for evaluating episode-level eye-movement detection problem settings) than any baseline smooth pursuit detector,
performance at varying levels of match quality, which is including the multi-observer approach in Agtzidis et al.
assessed via a relatively easily interpretable IoU metric. (2016b), which uses information from up to 50 observers at
the same time, allowing for higher-level heuristics.
Context size matters
Generalizability
We also investigated the influence of the size of the
context, where the network simultaneously assigns labels, To test our model on additional independent data, we
on detection scores (see Fig. 5). We did this by running the present the evaluation results of our best model (speed and
568 Behav Res (2019) 51:556–572

direction used as features) with the context size ca. 0.5 s substantial differences in labeling the “ground truth”
and all the literature models we tested on MN-RA-data set (Hooge et al., 2017; Andersson et al., 2017).
as sample- and event-level F1 scores (Table 5), as well Nevertheless, the average out-of-the-box performance of
as average IoU values (Table 6). This is the model with our algorithm compares favorably to the state of the art in
the largest context we trained, 257 samples. The duration terms of sample-level classification (see Table 5). In terms
in seconds is reduced due the doubling of the sampling of episode-level evaluation, our model shows somewhat
frequency, compared to GazeCom. competitive F1 scores (Table 5), but makes up for it in the
Table 7 combines our evaluation results with the average intersection over union statistic, which accounts for
performances reported in Andersson et al. (2017) in the both the number of correctly identified episodes and the
form of Cohen’s kappa values and overall error rates for the quality of the match (see Table 6). While its error rate is
MN-RA-data (for video stimuli). Evaluation results from similar to that of I-VMP, the F1 and IoU scores are, on
Andersson et al. (2017) were included in the table if they average, higher for our model, and its Cohen’s κ scores are
represent the best performance with respect to at least one consistently superior to I-VMP across the board.
of the statistics that we include in this table. BIT refers to Our algorithm’s 34% error rate may still be unacceptable
the algorithm in van der Lans, Wedel, and Pieters (2011), for many applications, but so could be the manual rater
LNS—in Larsson et al. (2013). disagreement of 19% as well. Our algorithm further
For our model, performance on this data set is worse demonstrates the highest Cohen’s Kappa score for smooth
compared to GazeCom, but even human raters show pursuit detection, and second highest for fixation detection.

a b

c
Fig. 4 Episode-level F1 scores at different IoU thresholds: At 0.0, levels of episode match strictness. The vertical dashed line marks the
the regular episode F1 score is computed; At 1.0, the episodes have threshold, which is typically used when considering IoU scores
to match exactly; The thresholds in-between represent increasing
Behav Res (2019) 51:556–572 569

a b

c
Fig. 5 Detection quality plotted against the context size (in samples pursuits – I-VMP. For event-level IoU evaluation (5c), “best other” fix-
at 250 Hz; log-scale) that the network classifies at once. Dashed lines ation detection IoU is taken from I-HMM, for saccades – Larsson et al.
represent individually chosen reference algorithms that perform best (2015), for pursuits – I-VMP. We separately display the smooth pursuit
with respect to each eye movement. For both sample- and event-level detection results of the multi-observer algorithm’s toolbox (Startsev
F1 evaluation (5a and 5b, respectively), fixation detection results of et al., 2016) (the dotted line), as it belongs to a different class of
Dorr et al. (2010) are taken, for saccades – Startsev et al. (2016), for algorithms

The best saccade detection quality is achieved by LNS, the three major eye-movement classes: fixations, saccades,
which explicitly labels postsaccadic oscillations and thus and smooth pursuits. To the best of our knowledge, this
increases saccade detection specificity. is the first inherently temporal machine learning model for
For sample-level F1-score evaluation (Table 5), our eye-movement event classification that includes smooth
model achieves second highest scores for fixation (with a pursuit. Unlike (Agtzidis et al., 2016b), which implicitly
very narrow margin) and pursuit detection, outperforming uses full temporal context, and explicitly combines infor-
all competition in saccade detection. mation across a multitude of observers, our model can be
adapted for online detection (by re-training without using
look-ahead features and preserving the LSTM states4 ). The
Conclusions classification time is kept short due to the low—for deep-
learning models, at least—parameter count of the trained
We have proposed a deep learning system for eye-movement models. Furthermore, we introduced and analyzed a new
classification. Its overall performance surpasses all con- 4 For online detection, one would need to either use a unidirectional
sidered reference models on an independent small-scale LSTM and process the samples as they are recorded, or assemble
data set. For the naturalistic larger-scale GazeCom, our windows of samples that end with the latest available ones and process
approach outperforms the state of the art with respect to the full windows with a BLSTM model.
570 Behav Res (2019) 51:556–572

Table 5 MN-RA-data evaluation results as F1 scores for sample-level and episode-level detection (sorted by the average of all columns). CNN-
BLSTM results here are for the window size of just over 0.5s (257 samples at 500 Hz). The ∗ signs mark the numbers where the label was assumed
from context and not actually assigned by the algorithm – i.e. missing labels were imputed. In each column, the highest value is boldified

Sample-level F1 Event-level F1

Model average F1 Fixation Saccade SP Fixation Saccade SP

(Agtzidis et al., 2016b) 0.653 0.670 0.699 0.638 0.455 0.860 0.592
1D CNN-BLSTM: speed + direction 0.650 0.667 0.720 0.663 0.550 0.826 0.475
(Larsson et al., 2015) 0.633 0.609 0.698 0.424 0.741 0.871 0.456
(Dorr et al., 2010) 0.630 0.614 0.691 0.446∗ 0.710 0.841 0.476∗
I-VMP 0.572 0.593 0.699 0.739 0.455 0.564 0.385
(Berg et al., 2009) 0.533 0.609 0.625 0.176 0.683 0.730 0.374
I-VDT 0.474 0.595 0.694 0.222 0.443 0.561 0.329
I-KF 0.444 0.578 0.643 – 0.639 0.806 –
I-HMM 0.421 0.577 0.711 – 0.535 0.702 –
I-DT 0.381 0.573 0.439 – 0.599 0.678 –
I-VT 0.375 0.575 0.701 – 0.412 0.560 –
I-VVT 0.365 0.573 0.701 0.242 0.067 0.560 0.046
I-MST 0.363 0.560 0.444 – 0.603 0.568 –
Random Baseline 0.180 0.387 0.051 0.535 0.023 0.066 0.017

Table 6 MN-RA-data evaluation results for event-level detection as intersection-over-union values (sorted by the average of all columns)

Model Fixation ep. IoU Saccade ep. IoU SP ep. IoU

1D CNN-BLSTM: speed + direction 0.705 0.543 0.398


I-VMP 0.368 0.623 0.619
I-VDT 0.744 0.647 0.127
(Agtzidis et al., 2016b) 0.626 0.469 0.420
(Larsson et al., 2015) 0.754 0.514 0.215
I-VT 0.798 0.665 −
I-HMM 0.81 0.627 −
(Dorr et al., 2010) 0.646 0.461 0.284∗
I-KF 0.785 0.543 −
(Berg et al., 2009) 0.706 0.353 0.075
I-DT 0.628 0.328 −
I-VVT 0.181 0.665 0.023
I-MST 0.438 0.208 −
Random baseline 0.024 0.038 0.016

BLSTM results here are for the window size of just over 0.5s (257 samples at 500 Hz). The ∗ signs mark the numbers where the label was assumed
from context and not actually assigned by the algorithm–i.e., missing labels were imputed. In each column, the highest value is boldified
Behav Res (2019) 51:556–572 571

Table 7 Cohen’s kappa (higher is better) and overall error rates (lower is better) for the MN-RA-data set

Group Error rate Fixation κ Saccade κ SP κ

CoderMN 19% 0.83 0.94 0.83


CoderRA 19% 0.82 0.94 0.83
1D CNN-BLSTM: speed + direction 34% 0.41 0.70 0.41
I-VMP 34% 0.38 0.68 0.40
(Agtzidis et al., 2016b) 38% 0.43 0.68 0.40
(Dorr et al., 2010) 46% 0.25 0.67 0.20∗
(Larsson et al., 2015) 47% 0.23 0.68 0.19
I-VDT 53% 0.16 0.67 0.09
Random Baseline 56% 0.00 0.00 0.00
(Berg et al., 2009) 57% 0.21 0.60 0.07
I-VVT 55% 0.14 0.68 0.02
I-HMM∗∗ 59% 0.13 0.71 −
I-VT∗∗ 59% 0.13 0.76 −
I-DT 60% 0.09 0.40 −
I-MST 61% 0.04 0.43 −
I-KF∗∗ 62% 0.14 0.59 −
BIT∗∗ 67% 0.14 0.00 −
LNS∗∗ 92% 0.00 0.81 −

BLSTM here uses speed and direction features and the context of ca. 0.5 s (257 samples at 500 Hz). The ∗ signs mark the numbers where the label
was assumed from context and not actually assigned by the algorithm—i.e., missing labels were imputed. The scores for the algorithms marked
with the ∗∗ were taken directly from Andersson et al. (2017). The rows are sorted by their error rate. In each column, the best value a chieved by
any algorithm is boldified (the first two rows correspond to manual annotators)

event-level evaluation protocol that considers the quality Andersson, R., Larsson, L., Holmqvist, K., Stridh, M., & Nyström, M.
of the matched episodes through enforcing restrictions on (2017). One algorithm to rule them all? An evaluation and discus-
sion of ten eye movement event-detection algorithms. Behavior
the pair of events that constitute a match. Our experiments
Research Methods, 49(2), 616–637.
additionally highlight the importance of temporal context, Behrens, F., MacKeben, M., & Schröder-Preikschat, W. (2010). An
especially for detecting smooth pursuit. improved algorithm for automatic detection of saccades in eye
The code for our model and results for all evaluated movement data and for calculating saccade parameters. Behavior
Research Methods, 42(3), 701–708.
algorithms are provided at http://www.michaeldorr.de/
Berg, D. J., Boehnke, S. E., Marino, R. A., Munoz, D. P., & Itti, L.
smoothpursuit. (2009). Free viewing of dynamic stimuli by humans and monkeys.
Journal of Vision, 9(5), 1–15.
Chollet, F., et al. (2015). Keras. https://github.com/keras-team/keras
Acknowledgements This research was supported by the Elite Collewijn, H., & Tamminga, E. P. (1984). Human eye movements
Network Bavaria, funded by the Bavarian State Ministry for Research during voluntary pursuit of different target motions on dif-
and Education. ferent backgrounds. The Journal of Physiology, 351(1), 217–
250.
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M.,
Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-
References term recurrent convolutional networks for visual recognition and
description. In The IEEE conference on computer vision and
Agtzidis, I., Startsev, M., & Dorr, M. (2016a). In the pursuit of pattern recognition (CVPR).
(ground) truth: A hand-labelling tool for eye movements recorded Dorr, M., Martinetz, T., Gegenfurtner, K. R., & Barth, E. (2010).
during dynamic scene viewing. In 2016 IEEE second workshop on Variability of eye movements when viewing dynamic natural
eye tracking and visualization (ETVIS) (pp. 65–68). scenes. Journal of Vision, 10(10), 28–28.
Agtzidis, I., Startsev, M., & Dorr, M. (2016b). Smooth pursuit Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., &
detection based on multiple observers. In Proceedings of the ninth Zisserman, A. (2010). The Pascal visual object classes (VOC)
biennial ACM symposium on eye tracking research & applications, challenge. International Journal of Computer Vision, 88(2), 303–
ETRA ’16 (pp. 303–306). New York: ACM. 338.
Anantrasirichai, N., Gilchrist, I. D., & Bull, D. R. (2016). Fixation Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I.,
identification for low-sample-rate mobile eye trackers. In 2016 Winn, J., & Zisserman, A. (2015). The Pascal visual object classes
IEEE international conference on image processing (ICIP) challenge: A retrospective. International Journal of Computer
(pp. 3126–3130). Vision, 111(1), 98–136.
572 Behav Res (2019) 51:556–572

Goldberg, J. H., & Schryver, J. C. (1995). Eye-gaze-contingent control Rottach, K. G., Zivotofsky, A. Z., Das, V. E., Averbuch-Heller, L.,
of the computer interface: Methodology and example for zoom Discenna, A. O., Poonyathalang, A., & Leigh, R. J. (1996).
detection. Behavior Research Methods Instruments, & Computers, Comparison of horizontal, vertical and diagonal smooth pursuit
27(3), 338–350. eye movements in normal human subjects. Vision Research,
Hasanpour, S. H., Rouhani, M., Fayyaz, M., & Sabokrou, M. (2016). 36(14), 2189–2195.
Lets keep it simple, using simple architectures to outperform Salvucci, D. D., & Anderson, J. R. (1998). Tracing eye movement
deeper and more complex architectures. CoRR, arXiv:1608.06037 protocols with cognitive process models. In Proceedings of the
Hooge, I. T. C., Niehorster, D. C., Nyström, M., Andersson, R., & 20th annual conference of the cognitive science society (pp. 923–
Hessels, R. S. (2017). Is human classification by experienced 928). Lawrence Erlbaum Associates Inc.
untrained observers a gold standard in fixation detection? Salvucci, D. D., & Goldberg, J. H. (2000). Identifying fixations and
Behavior Research Methods. saccades in eye-tracking protocols. In Proceedings of the 2000
Hoppe, S., & Bulling, A. (2016). End-to-end eye movement detection symposium on eye tracking research & applications, ETRA ’00
using convolutional neural networks. ArXiv e-prints. (pp. 71–78). New York: ACM.
Komogortsev, O. V. (2014). Eye movement classification software. San Agustin, J. (2010). Off-the-shelf gaze interaction. PhD thesis,
http://cs.txstate.edu/ok11/emd offline.html IT-Universitetet i København.
Komogortsev, O. V., Gobert, D. V., Jayarathna, S., Koh, D. H., & Santini, T. (2016). Automatic identification of eye movements. http://
Gowda, S. M. (2010). Standardization of automated analyses of ti.uni-tuebingen.de/Eye-Movements-Identification.1845.0.html
oculomotor fixation and saccadic behaviors. IEEE Transactions on Santini, T., Fuhl, W., Kübler, T., & Kasneci, E. (2016). Bayesian
Biomedical Engineering, 57(11), 2635–2645. identification of fixations, saccades, and smooth pursuits. In
Komogortsev, O. V., & Karpov, A. (2013). Automated classification Proceedings of the ninth biennial ACM symposium on eye tracking
and scoring of smooth pursuit eye movements in the presence research & applications, ETRA ’16 (pp. 163–170). New York:
of fixations and saccades. Behavior Research Methods, 45(1), ACM.
203–215. Sauter, D., Martin, B. J., Di Renzo, N., & Vomscheid, C. (1991).
Kyoung Ko, H., Snodderly, D. M., & Poletti, M. (2016). Eye Analysis of eye tracking movements using innovations generated
movements between saccades: Measuring ocular drift and tremor. by a Kalman filter. Medical and Biological Engineering and
Vision Research, 122, 93–104. Computing, 29(1), 63–69.
Land, M. F. (2006). Eye movements and the control of actions Startsev, M., Agtzidis, I., & Dorr, M. (2016). Smooth pursuit.
in everyday life. Progress in Retinal and Eye Research, 25(3), http://michaeldorr.de/smoothpursuit/
296–324. Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the
Larsson, L., Nyström, M., Andersson, R., & Stridh, M. (2015). gradient by a running average of its recent magnitude. COURS-
Detection of fixations and smooth pursuit movements in high- ERA: Neural Networks for Machine Learning, 4(2), 26–31.
speed eye-tracking data. Biomedical Signal Processing and Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015).
Control, 18, 145–152. Learning spatiotemporal features with 3D convolutional networks.
Larsson, L., Nyström, M., & Stridh, M. (2013). Detection of saccades In 2015 IEEE international conference on computer vision (ICCV)
and postsaccadic oscillations in the presence of smooth pursuit. (pp. 4489–4497).
IEEE Transactions on Biomedical Engineering, 60(9), 2484– van der Lans, R., Wedel, M., & Pieters, R. (2011). Defining eye-
2493. fixation sequences across individuals and tasks: The binocular-
Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., & Kautz, J. individual threshold (BIT) algorithm. Behavior Research Methods,
(2016). Online detection and classification of dynamic hand ges- 43(1), 239–257.
tures with recurrent 3D convolutional neural network. In The IEEE Vidal, M., Bulling, A., & Gellersen, H. (2012). Detection of smooth
conference on computer vision and pattern recognition (CVPR). pursuits using eye movement shape features. In Proceedings of
Nyström, M. (2015). Marcus Nyström — Humanities Lab, Lund the symposium on eye tracking research & applications, ETRA ’12
University. http://www.humlab.lu.se/en/person/MarcusNystrom (pp. 177–180). New York: ACM.
Nyström, M., & Holmqvist, K. (2010). An adaptive algorithm for Walther, D., & Koch, C. (2006). Modeling attention to salient proto-
fixation, saccade, and glissade detection in eyetracking data. objects. Neural Networks, 19(9), 1395–1407.
Behavior Research Methods, 42(1), 188–204. Yarbus, A. L. (1967). Eye movements during perception of moving
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: objects (pp. 159–170). Boston: Springer.
Towards real-time object detection with region proposal networks. Zemblys, R., Niehorster, D. C., Komogortsev, O., & Holmqvist, K.
In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., & (2017). Using machine learning to detect events in eye-tracking
Garnett, R. (Eds.) Advances in neural information processing data. Behavior Research Methods, 50(1), 160–181. https://doi.org/
systems 28 (pp. 91–99). Curran Associates, Inc. 10.3758/s13428-017-0860-3

You might also like