Using Machine Learning to Detect Events in Eye-tracking Data
Using Machine Learning to Detect Events in Eye-tracking Data
DOI 10.3758/s13428-017-0860-3
Abstract Event detection is a challenging stage in eye compared to current state-of-the-art event detection algo-
movement data analysis. A major drawback of current event rithms and can reach the performance of manual coding.
detection methods is that parameters have to be adjusted
based on eye movement data quality. Here we show that Keywords Eye movements · Event detection · Machine
a fully automated classification of raw gaze samples as learning · Fixations · Saccades
belonging to fixations, saccades, or other oculomotor events
can be achieved using a machine-learning approach. Any
already manually or algorithmically detected events can be Introduction
used to train a classifier to produce similar classification of other
data without the need for a user to set parameters. In this In eye movement research, the goal of event detection is
study, we explore the application of random forest machine- to robustly extract events, such as fixations and saccades,
learning technique for the detection of fixations, saccades, from the stream of raw data samples from an eye tracker,
and post-saccadic oscillations (PSOs). In an effort to show based on a set of basic rules and criteria which are appropri-
practical utility of the proposed method to the applications ate for the recorded signal. Until recently, researchers who
that employ eye movement classification algorithms, we ventured to record eye movements were required to conduct
provide an example where the method is employed in an time-consuming manual event detection. For instance, Har-
eye movement-driven biometric application. We conclude tridge and Thomson (1948) devised a method to analyze eye
that machine-learning techniques lead to superior detection movements at a rate of 10000 s (almost 3 h) of analysis time
for 1 s of recorded data, and as Monty (1975) remarked: “It
is not uncommon to spend days processing data that took
only minutes to collect” (p. 331–332).
Raimondas Zemblys Computers have fundamentally changed how eye move-
[email protected] ment data are analyzed. Today, event detection is almost
1
exclusively done by applying a detection algorithm to
Department of Engineering, Siauliai University, Siauliai, Lithuania
the raw gaze data. For a long time, two broad classes
2 Humanities Laboratory, Lund University, Lund, Sweden of algorithms were used: First, the velocity-based algo-
rithms that detect saccades and assume the rest to be
3 Humanities Laboratory and Department of Psychology, fixations. The most well-known is the I-VT algorithm of
Lund University, Lund, Sweden
Bahill, Brockenbrough, and Troost (1981) and Salvucci and
4 Institute for Psychology, University of Muenster, Muenster, Goldberg (2000), but the principle can be traced back to
Germany algorithms by Boyce from 1965 and 1967 as referred to by
5
Ditchburn (1973). The dispersion-based algorithms instead
Department of Computer Science, Texas State University,
San Marcos, TX, USA
detect fixations and assume the rest to be saccades. The
best known is the I-DT algorithm of Salvucci and Goldberg
6 UPSET, NWU Vaal, Vanderbijlpark, South Africa (2000).
Behav Res (2018) 50:160–181 161
Both the velocity-based and the dispersion-based algo- manner that generalizes to other previously unseen data. The
rithms come with thresholds that the user needs to set. Both resulting event detector, which we name identification by
classify data incorrectly if they are run on data with a sam- random forest (IRF), can simultaneously perform multiple
pling frequency outside the intended range. Data containing detection tasks, categorizing data into saccades, fixations,
noise, post-saccadic oscillations and smooth pursuit also and post-saccadic oscillation. Here we report that the perfor-
result in erroneous classification (Holmqvist et al., 2011). mance of the classifier constructed in this manner exceeds
The velocity and dispersion algorithms only detect fixa- that of the best existing hand-crafted event detection algo-
tions and saccades and tell us nothing about other events. rithms, and approaches the performance of manual event
The last decade has seen several improved event-detection coding done by eye-movement experts. The classifier fur-
algorithms, as researchers have hand-crafted new measures thermore provides event detection output that remains stable
of properties of the eye-movement signal and hand-tuned over a large range of noise levels and data sampling frequen-
algorithms to exploit these measures for event detection. cies. An interesting auxiliary result of training the classifier
For instance, Engbert and Kliegl (2003), Nyström and is that it provides insight into which detection features carry
Holmqvist (2010) and Mould, Foster, Amano, and Oakley the most information for classifying eye-movement data
(2012) use adaptive thresholds to free the researcher from into events.
having to set different thresholds per trial when the noise When evaluating the performance of any event detection
level varies between trials. Nonetheless, these algorithms method, it is important to recognize that the detected events
still only work over a limited range of noise levels (Hessels, are in most cases only an intermediate step that enable
Niehorster, Kemner, and Hooge, 2016). A very recent devel- further analysis steps. These further steps may ultimately
opment (Larsson et al., 2013, 2015) has enabled the auto- provide evidence for or against a research hypothesis, or
matic detection of smooth pursuit and post-saccadic oscil- may enable a practical application where eye movements
lations in clean data recorded at a high sampling frequency are being used. In this paper, we therefore analyze how the
that contains these events intermixed with fixations and sac- distribution of selected fixation and saccades-based met-
cades. Furthermore, Hessels et al. (2016) have presented a rics change as a function of noise level and data sampling
novel largely noise-resilient algorithm that can successfully frequency when event detection is performed by our IRF
detect fixations in data with varying noise levels, ranging algorithm. We furthermore examine whether the events
from clean data to the very noisy data typical of infant detected by our classifier can drive an eye-movement-driven
research. These algorithms are designed to solve a specific biometric application, and if it does so better than a com-
problem—smooth pursuit detection, or noise resilience— mon state-of-the-art hand-crafted event detection algorithm
using algorithmic rules and criteria specifically designed for (Nyström and Holmqvist, 2010). It is reasonable to expect
that problem. There are many other algorithms with specific that more precise eye movement classification should result
purposes, such as separating the slow and fast phase in nys- in better biometric performance.
tagmus, detecting microsaccades, online event detection for In summary, here we introduce an entirely new design
use in gaze-contingent research, or removing saccades from principle for event detection algorithms, where machine
smooth pursuit data (Holmqvist et al., 2016, Chapter 7). learning does the job of choosing feature combinations and
Most of these algorithms work well within the assump- selecting appropriate thresholds. We argue that this work is
tions they make of the data. Examples of common assump- the onset of a paradigm shift in the design of event detection
tions are that the input must be high-quality data, or data algorithms.
recorded at high sampling frequencies, and there is no
smooth pursuit in it. All algorithms come with overt settings
that users must experiment with to achieve satisfactory event Methods
detection in their data set, or covert settings that users have
no access to. When the sampling frequency is too low, or In this study, we use a similar approach as in Zemblys
too high, or the precision of the data is poor, or there is data loss, et al. (2015). We use clean data recorded with a high-end
many of these algorithms fail (Holmqvist et al. 2012, 2016). eye tracker and systematically resample and add increas-
In this paper, we present a new technique for developing ing levels of noise to simulate recordings from other eye
eye-movement event detectors that uses machine learning. trackers. We then train a random forest classifier to pre-
More specifically, a random forest classifier is built to find dict eye-movement events from features used by existing
combinations of data description measures and detection event detection algorithms, in conjunction with descrip-
features (e.g., velocity, acceleration, dispersion, etc.) that tors of data quality. It was shown in Zemblys (2016) that
enable robust event detection. Through training with manu- among ten tested machine-learning algorithms, Random for-
ally coded example data, the classifier automatically learns est resulted to the best eye-movement event classification
to perform event detection based on these features in a performance.
162 Behav Res (2018) 50:160–181
Machine-learning models are known to suffer from over- starting point in the sequence. Monocular eye movement
fitting. This happens when a model describes random error data were recorded at 1000 Hz using the EyeLink 1000 eye
or noise instead of the underlying relationship. Measures of tracker set up in tower mode. We chose to use data from this
model fit are not a good guide to how well a model will gen- eye tracker as it is known to exhibit low trackloss, and low
eralize to other data: a high R 2 does not necessarily mean a RMS noise that is more or less constant across the screen
good model. It is easy to over-fit the data by including too (Holmqvist et al., 2015). The average noise level2 in the
many degrees of freedom. Machine-learning-based models dataset was 0.015 ± 0.0029 ◦ RMS.
usually are very complex, can be non-linear and have a mul- The baseline dataset is then split into development and
titude of parameters. Therefore it is common practice when testing sets by randomly selecting data from one subject
performing a (supervised) machine learning to hold out part (i.e., 20 % of the data) to belong to the testing set. We do
of the available data as a test set. This data-split some- so in order to calculate out-of-the-bag error and evaluate
times is called cross-validation and is used to calculating how our algorithm generalizes to completely unseen data,
‘out-of-the-bag error’ i.e., evaluating how well the model i.e., data from a different subject. Further, the development
generalizes to completely unseen data. set is split into training and validation sets by randomly
Moreover, different machine-learning algorithms may assigning 25 % of each trial to the validation sets and leav-
have a number of settings (known as hyperparameters) that ing the remaining 75 % of the data in the training set. This
must be manually set. When evaluating different hyperpa- validation dataset will be used when tuning the classifier.
rameters there is still a risk of over-fitting on the test set An expert with 9 years of eye-tracking experience (author
because the parameters can be tweaked until the model per- RZ) manually tagged raw data into fixations, saccades, post-
forms optimally. This way, knowledge about the test set saccadic oscillations (PSOs) and undefined events, which
can “leak” into the model and evaluation metrics no longer were used as baseline events. In the eye-tracking field, man-
report on generalization performance. To solve this prob- ual event classification was the dominant option long into
lem, yet another part of the dataset needs be held out as a the 1970s. Manual coding has also been used as a “golden
so-called validation set: training proceeds on the training standard” when comparing existing algorithms (Andersson
set, after which hyperparameters are adjusted and evaluation et al., 2016) and when developing and testing new algo-
is done on the validation set, and when the trained model rithms, e.g., Munn et al. (2008), Larsson et al. (2013, 2015).
seems to be optimal, the final evaluation is performed on From this literature, we know that human coders agree with
the test set. This is how we conducted the development and each other to a larger extent than existing event detection
evaluation in this paper. algorithms. We did not have multiple coders to analyze
To train a classifier we use the random forest implemen- inter-rater reliability, as this would open another research
tation in the Scikit-learn library (Pedregosa et al., 2011) and question of how the coder’s background and experience
the LUNARC1 Aurora computer cluster to speed up train- affect the events produced. Instead, the focus in this paper is
ing. To visualize the result we use Seaborn; a python pack- on how well we can recreate original events using machine
age developed for statistical data visualization (Waskom learning–irrespective of what coder produced those events.
et al., 2016). We also use the lme4 package (Bates et al., A total of 560 fixations, 555 saccades and 549 PSOs were
2015) in R to perform statistical analysis when compar- tagged in the raw data. The total number of saccades and
ing the output of our classifier at different hyperparameter other event in the dataset might seem very low, however
settings. after the data augmentation step (which we describe next),
the total number of each event approaches 54,000. Fixation
Baseline dataset durations ranged from 21 to 1044 ms, saccades had dura-
tions from 6 to 92 ms and amplitudes ranging from 0.1◦ to
Our baseline dataset consisted of eye movement record- 29.8◦ , while PSO durations were 2–60 ms. The durations
ings where five participants perform a simple fixate-saccade of the manually tagged fixations and saccades have a some-
task. We asked participants to track a silver 0.2◦ dot with what bimodal distributions (see swarmplots in Fig. 1). This
a 2 × 2 pixel black center that jumped between positions reflects a common behavior in the type of fixation-saccade
in an equally spaced 7 × 7 grid, and showed for 1 s at task we employed: a large amplitude saccade towards a new
each point. PsychoPy (Peirce, 2007) was used to present target often undershoots and is followed by (multiple) fixa-
the dot on a black background. Stimulus targets changed tions of short duration and small corrective saccades, which
their position every second and were presented in the same are then finally followed by a long fixation while looking
pseudo-random order for all participants, but with a random
2 Weused the method based on kernel density estimation to select
1 Lunarc is the center for scientific and technical computing at Lund sample windows in the data for which the noise level was calculated
University. http://www.lunarc.lu.se/. (Holmqvist et al., 2015).
Behav Res (2018) 50:160–181 163
data, i.e., recorded with any eye tracker and thus having very
different noise levels and sampling rates. Commercial eye-
trackers sample data starting from 30 Hz (the Eye Tribe, and
the first versions of the SMI and Tobii Glasses) and rang-
ing up to 2000 Hz (e.g., Eyelink 1000 and 1000 Plus, and
the ViewPixx TRACKPixx). In this study, we augment our
baseline dataset to simulate recordings of the most common
eye trackers on the market. First-order spline interpolation
is used to resample data to 60, 120, 200, 250, 300, 500,
and 1250 Hz. Before resampling, the data was low-pass fil-
tered using a Butterworth filter with cut-off frequency of
0.8 times the Nyquist frequency of the new data rate and a
window size of 20 ms.
Next, we systematically added white Gaussian noise to
Fig. 1 Tukey boxplots of manually tagged event durations. Red cir-
cles indicate means. Overlaid are the swarmplots, which show a
the resampled dataset at each sampling frequency. More
representation of the underlying distribution specifically, for each recording in each data set, we added
a white Gaussian noise signal generated using the Box–
Muller method (Thomas et al., 2007), separately for the
horizontal and vertical gaze components. The choice to
at the target. Figure 2 shows the amplitude distribution of use white Gaussian noise to model measurement noise is
all the saccades in the dataset. This distribution is less typi- motivated by recent findings that when recording from an
cal of psychological experiments, such as visual search and artificial eye (which has no oculomotor noise), eye track-
scene viewing tasks, because of the many saccades with ers generally produce white noise (Coey et al., 2012; Wang
amplitudes of less than 2◦ (Holmqvist et al., 2016, p. 466). et al., 2016a). Both studies showed that the power spectrum
However, in reading research and fixate-saccade experi- is in this case constant and independent of frequency.
ments small saccades are abundant. Furthermore, small However, it remains an open question what type of
saccades are harder to reliably detect as they easily drown measurement noise is found in recordings of human eye
in noise, and small saccades have more prominent PSOs movements. Next to inherent oculomotor noise (tremor
(Hooge et al., 2015). Therefore, our dataset not only makes and drift), there are many possible eye tracker-dependent
a good basis for training an universal event detection algo- sources of colored noise: filtering, fluctuations in lighting,
rithm, but also provides a challenge for any event detection pupil diameter changes, head movements, etc. To account
algorithm. for these additional noise sources, other studies modeled
the noise using other methods, e.g., an autoregressive pro-
Data augmentation cess Otero-Millan et al. (2014) or power spectral density
(PSD) estimation methods Wang et al. (2016b) and Hessels
Our goal in this paper is to develop a universal classifier et al. (2016), which model noise as a sum of oculomotor
that would be able to work with any type of eye-tracking and measurement noise. It is, however, questionable if it
is valid to add such noise to a signal of human origin that
already includes an oculomotor noise component. More-
over, it is unclear if using a noise model obtained from
real data and scaling it to create different noise levels is
valid. This assumes that oculomotor and measurement noise
always scale in the same ratio—it is more likely that oculo-
motor noise is relatively constant and that high noise in the
recording is mostly caused by the eye trackers. In this study,
we therefore chose to use white noise to model noise incre-
ments in our data as white noise transparently simulates well
understood reasons for different noise levels in eye-tracking
data: lower-quality sensors used in lower-quality eye track-
Fig. 2 Distribution of amplitudes of manually tagged saccades. Bin ers or lower pixel density eye image in remote eye trackers,
size 1◦ compared to tower mounted systems, where the eye is close
164 Behav Res (2018) 50:160–181
to the camera sensor. Our baseline data set is recorded from After data augmentation, we have around 6.4 million
humans and already includes natural oculomotor noise, such samples in our training set and more than 2.1 million
as microsaccades, and drifts. The color of the noise spec- samples in both the validation and testing sets each.
trum due to the oculomotor noise is inherited to all the
augmented data after adding white noise. At higher noise Feature extraction
levels, however, the oculomotor noise will drown in the data,
and the noise color disappears as the signal-to-noise ratio Right before feature extraction, we interpolate missing data
gets lower and lower. using a Piecewise Cubic Hermite Interpolating Polynomial.
Holmqvist et al. (2015) show that for most of the eye This kind of interpolation preserves monotonicity in the data
trackers, noise in the corners of the screen is up to three and does not overshoot if the data is not smooth. At this
times higher than in the middle of the screen. In the noise stage, there is no limit on the duration of periods of miss-
we add to our datasets, we simulate this increase of noise ing data, but later we remove long sequences of interpolated
with distance from the middle of the screen by using a 2-D data (see “Post-processing: Labeling final events”).
Gaussian noise mapping function, which is minimal in the We then perform feature extraction on the raw data at
middle and maximal in the corners. The standard deviation each sampling frequency and noise level. For each feature,
of this Gaussian mapping function was chosen such that the this process produces one transformed sample for each input
noise level at the corners of the stimulus plane at ±15◦ was sample. For instance, the velocity feature is computed by
three times higher than in the middle of the screen. calculating the gaze velocity for each sample in the input
We further scale the variance of the generated noise sig- data. In this paper, we use the 14 features listed in Table 1,
nal to ten levels starting from 0.005◦ RMS, where each yielding a 14-dimensional feature vector for each sample.
subsequent noise level is the double of the previous one. Most of these features are based on the local 100–200-ms
This results in additive noise levels ranging from 0.005◦ surroundings of each sample. The features we employ either
to 2.56◦ RMS in the middle of the screen and three times describe the data in terms of sampling frequency and preci-
higher noise at the corners. Figure 3 shows distributions of sion, or are features that are used in common or state-of-the-
the resulting noise levels in part of our augmented dataset art hand-crafted event detection algorithms. Next to these,
(500 Hz), along with the noise level in the baseline data set. we also propose several new features, which we hypothesize
Note how the resulting noise distributions overlap at each are likely to be useful for the detection of the onset and off-
level of added noise. Modeling the noise this way not only set of saccades: rms-diff, std-diff and bcea-diff. These new
represents the real case scenario with variable noise in each features are inspired by Olsson (2007) and are calculated
of the recordings, but also covers the whole range of noise by taking the difference in the RMS, STD, and BCEA pre-
from minimum 0.0083◦ to maximum 7.076◦ RMS in a con- cision measures calculated for 100-ms windows preceding
tinuous fashion. In comparison, the baseline data cover a and following the current sample. Obviously, the largest dif-
noise range from min 0.0058◦ to max 0.074◦ RMS. ferences (and therefore peaks in the feature) should occur
around the onset and offset of the saccades. We expect that
many of the features used in this paper will be highly cor-
related with other features. This provides room to optimize
the computational complexity of our model by removing
some of the correlated features. In the next step, the 14-
dimensional feature vector produced by feature extraction
for each sample is fed to the machine-learning algorithm.
Algorithm
Feature Description
fs sampling frequency (Hz). As some features may provide different information at different sampling
rates (e.g., SMI BeGaze uses velocity for data sampled at 200 Hz and more and dispersion at lower
frequencies), providing the classifier with information about sampling frequency may allow it to
make better decision trees
rms root mean square (◦ ) of the sample-to-sample displacement in a 100-ms window centered on a
sample. The most used measure to describe eye-tracker noise (Holmqvist et al., 2011)
std standard deviation (◦ ) of the recorded gaze position in a 100-ms window centered on a sample.
Another common noise measure (Holmqvist et al., 2011)
bcea bivariate contour ellipse area (◦2 ). Measures the area in which the recorded gaze position in a 100-ms
window is for P % of the time (Blignaut and Beelders, 2012). P = 68
disp dispersion (◦ ). The most common measure in dispersion-based algorithms (Salvucci & Goldberg,
2000). Calculated as (xmax − xmin ) + (ymax − ymin ) over a 100-ms window
vel, acc velocity (◦ /s) and acceleration (◦ /s 2 ), calculated using a Savitzky–Golay filter with polynomial
order 2 and a window size of 12 ms—half the duration of shortest saccade, as suggested by Nyström
and Holmqvist (2010)
med-diff distance (◦ ) between the median gaze in a 100-ms window before the sample, and an equally sized
window after the sample. Proposed by Olsson (2007)
mean-diff distance (◦ ) between the mean gaze in a 100-ms window before the sample, and an equally sized win-
dow after the sample. Proposed by Olsson (2007) and used in the default fixation detection algorithm
in Tobii Studio
Rayleightest a feature used by Larsson et al. (2015) that indicates whether the sample-to-sample directions in a
22-ms window are uniformly distributed
i2mc introduced by Hessels et al. (2016) to find saccades in very noisy data. We used the final weights
provided by the two-means clustering procedure as generated by the original implementation of the
algorithm. A window size of 200 ms, centered on the sample was used
rms-diff, std-diff, bcea-diff features inspired by Olsson (2007), but instead of differences in position, we take the difference
between noise measures calculated for 100-ms windows preceding and succeeding the sample
A minimum of three samples are used in case there are not enough samples in the defined window, as may happen for lower frequency data
features as they are. There is no need to scale, center, or rules and thresholds set on these features by the algorithm’s
transform them in any way. designer, derive which event the sample likely belongs to.
A random forest is an ensemble method in the sense that
Random forest classifier it builds several independent estimators (trees). For each
sample, it then either produces a classification by a majority
A random forest classifier works by producing many deci- vote procedure (“this sample is part of a saccade, because
sion trees. Each tree, from its root to each of its leaves, 45 out of 64 trees classified it as such”), or it produces a
consists of a series of decisions, made per sample in the probabilistic classification (“the probability that this sam-
input data, based on the 14 features that we provide the ple is part of a saccade is 4564 = 70 %”). We use a fully
classifier with. A tree could for instance contain a decision probabilistic approach, where the class probability of a sin-
such as “if around this sample, RMS is smaller than 0.1◦ , gle tree is the fraction of samples of the same class in a
and the sampling frequency is less than 100 Hz, use disp, leaf and where individual trees are combined by averaging
else use i2mc”. Every tree node—equaling a singular log- their probabilistic prediction, instead of letting each clas-
ical proposition—is a condition on a single feature, bound sifier vote for a single class. Each of the decision trees in
to other nodes in a tree with if-then clauses, which brings the ensemble is built using a random subset of the fea-
the algorithm closer to deciding whether the sample belongs tures and a random subset of training samples from the data.
to e.g., a fixation or a saccade. These decisions are similar This approach goes by the name of bootstrap aggregation,
to how traditional hand-crafted event detection algorithms known in the machine-learning literature as bagging. As the
work. These also take a number of features (such as veloc- result of bagging, the bias (underfitting) of the forest usu-
ity, acceleration, noise level, etc.) as input and, by means of ally increases slightly but, due to averaging, its variance
166 Behav Res (2018) 50:160–181
(overfitting) decreases and compensates for the increase Machine-learning techniques for building a classifier
in bias, hence yielding an overall better model (Breiman, allow assessing feature importance. Feature importance
2001). indicates how useful a given feature is for correctly clas-
sifying the input data. This can be used to help develop a
Training parameters better understanding of how certain properties of the input
data affect event detection, such as whether sampling fre-
When training a random forest classifier, a few parameters quency is important to the detection of saccades (which is
need to be set. Two important parameters are the number of debated, see Holmqvist et al. 2011, p. 32). Measures of fea-
estimators, i.e., the number of trees in the forest, and the cri- ture importance however also allow reducing the number of
terion, i.e., a function used to measure the quality of a node features used by the classifier, which might improve gen-
split, that is, a proposed border between saccade samples eralizability of the classifier to unseen data and reduce the
and fixation samples. Selecting the number of trees is an computational and memory requirements for running the
empirical problem and it is usually done by means of cross- classifier.
validation. For example, Oshiro, Perez, and Baranauskas There are several measures of feature importance. A ran-
(2012) trained a classifier on 29 datasets of human medi- dom forest classifier directly gives an assessment of feature
cal data, and found that there was no benefit in using more importance in the form of the mean decrease impurity.
than 128 trees when predicting human medical conditions. This number tells us how much each feature decreases
We chose to use 200 trees as a starting point, because ran- the weighted impurity in a tree. It should, however, be
dom forest classifiers do not overfit (Breiman, 2001). As a noted that some of the features we use are highly cor-
function to measure the quality of a decision made by each related with each other (see Fig. 4), as expected. Highly
tree, we use the Gini impurity measure. It can be understood correlated features complicate assessing feature importance
as a criterion to minimize the probability of misclassifica- with the mean decease of impurity method. When train-
tion. Another commonly used criterion is the information ing a model, any of the correlated features can be used
gain which is based on entropy. Raileanu and Stoffel (2004) as the predictor, but once one of them is used, the impor-
found that these two metrics disagree about only 2 % of tance of the other highly correlated features is signifi-
decision made by the tree, which means it is normally not cantly reduced since the other features provide little extra
worth spending time on training classifiers using different information.
impurity criteria. The Gini impurity criterion was chosen There are a number of other feature selection meth-
because it is faster to calculate than the information gain ods in machine learning, e.g., Correlation Criteria, Mutual
criterion. information and maximal information coefficient, Lasso
To deal with our unbalanced dataset where most sam- regression, etc., that each have their specific strengths and
ples belong to fixations, we use the balanced subsample weaknesses (Guyon & Elisseeff, 2003). In this study, in
weighting method.3 We further limit each tree in the ensem- addition to mean decrease impurity (MDI), we chose to
ble to use a maximum of three features, which is close to use two additional methods of assessing feature impor-
the square root of the number of features we provide the tance that are well suited for non-linear classifiers; mean
classifier with. This is one of the invisible hyperparameters decrease accuracy (MDA) and univariate feature selec-
that makes random forests powerful. However, we do not tion (UFS). Mean decrease accuracy directly measures the
limit the depth of the tree, the minimum number of sam- impact of each feature on the accuracy of the model. After
ples required to split an internal tree node nor the minimum training a model, the values of each feature are permuted
number of samples in newly created leaves. and we measure how much the permutation decreases the
accuracy (we use Cohen’s kappa to measure accuracy, see
Classifier optimization “Sample-to-sample classification accuracy”) of the classi-
fier. The idea behind this technique is that if a feature is
After training the full classifier using all 14 features and 200 unimportant, the permutation will have little to no effect
trees, we reduced the computational and memory require- on model accuracy, while permuting an important feature
ments of the classifier by removing unimportant features would significantly decrease accuracy. Univariate feature
and reducing the number of trees. The procedure for both selection, and more specifically single variable classifiers,
these optimizations is described in this section. assesses feature importance by building a classifier using
only an individual feature, and then measure the perfor-
mance of each of these classifiers.
3 Fora detailed description see http://scikit-learn.org/stable/modules/ We optimize our model by performing recursive fea-
generated/sklearn.ensemble.RandomForestClassifier.html. ture elimination using the following equation as a feature
Behav Res (2018) 50:160–181 167
Fig. 4 Spearman’s rank correlation between the features in the training dataset
elimination criterion, which sums the squared feature meaningful eye-tracking events (apply a categorization rule
importances and then removes the one with the lowest value: according to Hessels et al. 2016). For each of the sam-
ples, our random forest classifier outputs three probabilities
argmin m2 (1) (summing to 1), indicating how likely the sample belongs
∀f eat ∀mf eat to a fixation, a saccade or a PSO. This is done internally,
where m is the MDI, MDA, and UFS measures of a feature’s with no user-accessible settings. We first apply a Gaus-
importance, normalized to [0-1] range. sian smoother (σ = 1 sample) over time for each of the
Specifically, we train a classifier using all 14 features, three probabilities, then label each sample according to what
find the least important feature using Eq. 1, remove this fea- event it most likely belongs to, and then use the following
ture and retrain the classifier using less and less features, heuristics to determine the final event labels.
until there are only four left—one more than the maximum – mark events that contain more than 75 ms of interpo-
number of features used in each tree. lated data as undefined.
The size of the classifier is further reduced by find- – merge fixations which are less than 75 ms and 0.5◦
ing the number of trees after which there is no further apart.
improvement in classifier performance. For each of the dif- – make sure that all saccades have a duration of at least
ferent number of features, we trained classifiers with 1, 4, three samples, expand if required, which means that
8, up to 196 (with a step size of 4) trees. We then run if we have a one sample saccade, we also label the
each of these reduced classifiers on the validation set and preceding and following samples as saccade.
assess their performance by means of Cohen’s kappa (see – merge saccades that are closer together than 25 ms.
“Sample-to-sample classification accuracy” below). We – remove saccades that are too short (<6 ms) or too long
then employ a linear mixed effects model (Bates, Mächler, (>150 ms).
Bolker, and Walker, 2015) with number of trees and num- – remove PSOs that occur in other places than directly
ber of features as categorical predictors to test below which after a saccade and preceding a fixation.
number of features and trees the performance of the clas-
– remove fixations shorter than 50 ms.
sifier as indicated by Cohen’s kappa starts to decrease
– remove saccades and following PSO events that sur-
significantly compared to the full classifier using all features
round episodes of missing data as these are likely blink
and 200 trees. The linear mixed effects model included sub-
events.
ject, sampling rate and added noise level as random factors
with random intercepts. Removal of a saccade, PSOs or fixations means that the
sample is marked as unclassified, a fourth class. Unclassi-
Post-processing: Labeling final events fied samples also existed in the manual coding, but in the
below we do not compare agreement between the manual
After initial classification of raw data samples (Hessels et al. coder and the algorithm on which samples are unclassi-
2016 refers to it as a search rule), the next step is to produce fied. While these parameters of heuristic post-processing are
168 Behav Res (2018) 50:160–181
accessible to the user, they are designed to work with all would be poor choices in our case because of our unbal-
types of data that we use in this paper, and as such, we do not anced data set with nearly 89 % of the samples tagged as
expect that users would need to change these parameters. fixations, while only 6.8 and 4.3 %, respectively, are sac-
cade and PSO samples. Larsson et al. (2013, 2015), for
Performance evaluation instance, also report sensitivity (recall) and specificity, and
in the machine-learning literature the F1 score is common.
We evaluate our identification by random forest (IRF) algo- For our dataset, where almost 90 % of the samples belong
rithm, as optimized by the procedure detailed above, using to a fixation, a majority class classifier that indicates that all
three approaches: sample-to-sample classification accuracy, samples are a fixation would result in a high score for these
ability to reproduce ground truth event measures and fun- measures. The advantage of using Cohen’s kappa in our case
damental saccadic properties (main sequence), and perfor- is that the majority class model would result in a score of
mance in a eye-movement biometrics application. Currently, 0, correctly indicating that our classifier fails to provide us
there are only two other algorithms, which are able to with a meaningful classification.
detect all the three events we concern ourselves with—
fixations, saccades, and PSOs. One of these is the algorithm Evaluation of event measures
by Nyström and Holmqvist (2010)4 (hereafter NH) and
the other is the algorithm by Larsson et al. (2013, 2015) To test whether our algorithm produces event measures that
(hereafter LNS). Unfortunately, an implementation of the are similar to those provided by the manual coder, we exam-
latter is not publicly available. Implementing it ourselves ine the durations and number of fixations, saccades and
is tricky and might lead to painting an incorrect picture of PSOs produced by IRF, as well as main sequence param-
this algorithm’s performance. In the following, we there- eters. The main sequence and amplitude-duration relation-
fore only compare the performance of our algorithm to that ships are fundamental properties of saccadic eye movements
of Nyström and Holmqvist (2010). In order to ensure that that should be maintained by any classification algorithm.
the NH algorithm performs optimally, we manually checked To evaluate how well our algorithm reproduces the main
the output of the algorithm and adjusted settings to best suit sequence compared to manually coded data, we first cal-
the input data. We found that default initial velocity thresh- culated saccade amplitude vs. peak velocity and amplitude
old of 100◦ /s works fine for data with average noise level vs. duration relationships on the high-quality manual data.
up to 0.5◦ RMS and increase it to 200–300◦ /s for noisier
A
We used Vpeak = Vmax ∗ (1 − e− C ) to fit the amplitude-
input. These initial thresholds then adapted (decreased) to peak velocity relationship, where Vpeak and A here are
the noise in the data. saccade peak velocity and amplitude, while Vmax together
with C are parameters to be determined by the model fit
Sample-to-sample classification accuracy (Leigh & Zee, 2006, p. 111). For the amplitude vs. duration
relationship we used a linear function (Carpenter, 1988).
To evaluate the performance of our algorithm, we com- Next we parsed our augmented data (which were down-
pare manual coding with the output of the algorithm using sampled and had added noise) using our and the NH algo-
Cohen’s kappa (K), which measures inter-rater agreement rithms. We then calculated saccade amplitudes from the
for categorical data (Cohen, 1960). Cohen’s kappa is a num- detected events and predicted saccade peak velocities and
ber between -1 and 1, where 1 means perfect agreement and saccade durations using the previously obtained parame-
0 means no agreement between the raters other than what ters for the main sequence relationship in the baseline data.
would be expected by chance. Scores above 0.8 are consid- We assume that output data from a better performing clas-
ered as almost perfect agreement. Using K as our evaluation sification algorithm (compared to a baseline) will lead to
metric will allow to directly compare the performance of our estimated saccadic peak velocities and duration that closer
algorithm to that reported in the literature, because Cohen’s match those observed in the baseline data set. This allows
kappa has previously been used in the eye-tracking field us to evaluate how well saccade parameters are preserved
to assess the performance of newly developed event detec- when the data is degraded or different algorithms are used
tion algorithms (Larsson et al., 2013, 2015) and to compare to detect events. We use the coefficient of determination
algorithms to manual coding (Andersson et al., 2016). (R 2 ) as a metric for the goodness of fit. Note that R 2 can
While there are a number of other metrics to assess be negative in this context, because the predictions that are
sample-to-sample classification accuracy, these methods compared to the corresponding measured data have not been
derived from a model-fitting procedure using those data, i.e.,
the fits from ground truth data can actually be worse than
4 A MATLAB implementation is available for download at just fitting a horizontal line to the data obtained from the
http://www.humlab.lu.se/en/person/MarcusNystrom/. algorithm. We trim negative values to 0.
Behav Res (2018) 50:160–181 169
Fig. 7 Performance of classifiers for different sampling rates. Performance is measured using Cohen’s kappa K and the presented figures are
averages of K over all noise levels. Note that this is only the performance of classifier without post-processing step
172 Behav Res (2018) 50:160–181
Extensive testing of this hypothesis is beyond the scope of computed performance of each of these trimmed random
this paper, but we made a small test by training a specialist forest classifiers using Cohen’s kappa.
classifier, using only high-quality data at 500–1000 Hz and As an example, Fig. 8 shows K as a function of the num-
having an average noise level of up to 0.04◦ RMS. The four ber of trees for 500-Hz data, for the full classifier along with
most important features in such a specialist classifier were a subset of the reduced classifiers using a limited number of
velocity, std, acceleration, and bcea (see Appendix B). features. It is very clear from this plot that at least in 500-Hz
data, there is no decrease in classification performance until
Limiting the number of trees less than 8–16 trees are used. In the next section, we per-
form statistical analysis to find out to what extent the forest
All classifiers above were trained using 200 trees, which can be trimmed until performance starts to decrease.
is clearly too much according to, e.g., Oshiro et al. (2012)
and results in classifiers with large computational and mem- The final model
ory requirements. To reduce the number of trees in each
of the trained models, we trained classifiers with 1, 4, 8, Linear mixed effect modeling confirms that there is no
up to 196 (with a step size of 4) trees, and used these significant decrease in performance compared to the full
reduced classifiers to perform event detection. We then classifier when using 16 trees or more trees (see Table 2).
Table 2 Linear mixed-effect model fit for raw performance, measured as Cohen’s kappa (K)
Fixed effects:
Estimate Std. error df t value Pr(>|t|)
(Intercept) 7.76e–01 4.58e–02 2.40e+01 16.928 5.11e–15 ***
ntrees1 −3.25e–02 1.12e–03 7.84e+04 −29.022 <2e–16 ***
ntrees4 −9.40e–03 1.12e–03 7.84e+04 −8.383 <2e–16 ***
ntrees8 −4.07e–03 1.12e–03 7.84e+04 −3.626 0.000288 ***
ntrees12 −2.76e–03 1.12e–03 7.84e+04 −2.462 0.013806 *
ntrees16 −1.41e–03 1.12e–03 7.84e+04 −1.26 0.207644
ntrees20 −1.77e–03 1.12e–03 7.84e+04 −1.58 0.114205
ntrees24 −1.53e–03 1.12e–03 7.84e+04 −1.367 0.171585
...
nfeat4 −1.33e–01 8.76e–04 7.84e+04 −151.369 <2e–16 ***
nfeat5 −9.49e–02 8.76e–04 7.84e+04 −108.276 <2e–16 ***
nfeat6 −1.43e–02 8.76e–04 7.84e+04 −16.333 <2e–16 ***
nfeat7 −8.98e–03 8.76e–04 7.84e+04 −10.244 <2e–16 ***
nfeat8 −1.16e–03 8.76e–04 7.84e+04 −1.323 0.185882
nfeat9 1.78e–03 8.76e–04 7.84e+04 2.032 0.042207 *
nfeat10 2.89e–03 8.76e–04 7.84e+04 3.297 0.000979 ***
nfeat11 −7.95e–06 8.76e–04 7.84e+04 −0.009 0.992766
nfeat12 −2.73e–04 8.76e–04 7.84e+04 −0.312 0.755049
nfeat13 −2.07e–04 8.76e–04 7.84e+04 −0.236 0.813646
Random effects:
Groups Name Variance Std.Dev.
noise (Intercept) 0.0062052 0.07877
fs (Intercept) 0.0126143 0.11231
sub (Intercept) 0.0005378 0.02319
Residual 0.0027374 0.05232
The intercept represents a predicted K of a classifier with 200 trees and using all 14 features. Subject (sub), sampling rate (f s), and noise level
(noise) are modeled as random factors. t tests use Satterthwaite approximations to degrees of freedom
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Number of obs: 78408, groups: 11 noise levels (noise); 9 sampling frequencies (fs); 4 subjects (sub)
Behav Res (2018) 50:160–181 173
a b
Fig. 10 Performance of our algorithm on the testing dataset. Left side averaged across all noise levels. On the right side b, we separate data
a shows performance for all events, with blue lines showing perfor- for the three events our algorithm reports
mance at different noise levels, while the red line depicts performance
algorithm is best at correctly labeling saccades, while the RMS), because this is the kind of data these two algorithms
performance of detecting PSO is considerably poorer. This were tested on by Larsson et al. (2013) and Andersson
replicates the previous findings of Andersson et al. (2016), et al. (2016). Table 3 also includes performance of special-
who found that expert coders are also best at finding sac- ist version of the IRF classifier, which was trained using
cades, and agree less with each other when it comes to only high-quality data—500–1000 Hz and having average
indicating PSOs. It may very well be that the performance of noise level up to 0.04◦ RMS. The largest performance gain
our classifier is worse at detecting PSOs because of poten- is obtained in PSO classification—the specialist classifier
tially inconsistently tagged PSOs samples in the training is around 7 % better at detecting these events than the
data set. The classifier thus either learns less well as it tries universal classifier. Overall performance of this specialist
to work from imprecise input, or sometimes correctly report classifier is around 2 % better compared to the univer-
PSOs that the expert coder may have missed or tagged incor- sal classifier, and marginally outperforms the expert coders
rectly. We can see from the Fig. 9 that it is really hard to tell reported by Larsson et al. (2015).
the exact offset of a PSO.
Table 3 shows that IRF outperforms two state-of-the-art Event measures
event detection algorithms that were specifically designed
to detect fixations, saccades and PSOs, and approaches the Raw sample-to-sample performance, i.e., only that of classifiers
performance of human expert coders. We compare the per- itself, before applying heuristics, is around 5.5 % better than
formance on 500-Hz clean data (average noise level 0.042◦ that after heuristic post-processing, meaning that there is
still room for improvement in the design of our heuris-
tics. For instance, our post-processing step removes all saccades
with amplitudes up to 0.5◦ , because of our choice to merge
Table 3 Comparison of Cohen’s kappa in clean data, sampled at 500 Hz nearby fixations (see “Post-processing: Labeling final
events”). This is reflected in the number of fixations and
All events Fixations Saccades PSO
average fixation duration as reported by the IRF algorithm
IRF 0.829 0.854 0.909 0.697 (see the offsets between the manual coding results and those
IRF (specialist) 0.846 0.874 0.905 0.746 of our algorithm in clean data in Figs. 11 and 12). These fig-
Expert (Andersson et al., 2016) 0.92 0.95 0.88 ures show that compared to ground truth (manual coding),
Expert (Larsson et al., 2015) 0.834 our algorithm misses around 10 % of the small saccades and
LNS (Andersson et al., 2016) 0.81 0.64 the same percentage for fixations. This causes an overes-
LNS (Larsson et al., 2013) 0.745 timation in the average fixation duration of approximately
NH 0.604 0.791 0.576 0.134 60 ms, as two fixations get merged into one. When the noise
NH (Andersson et al., 2016) 0.52 0.67 0.24 increases over 0.26◦ RMS, more and more smaller saccades
NH (Larsson et al., 2013) 0.484 are missed. The number of detected fixations therefore starts
decreasing, while the mean fixation duration increases. This
IRF - our algorithm, LNS - algorithm by Larsson et al. (2013, 2015), behavior is consistently seen in the output of our algorithm
NH - algorithm by Nyström and Holmqvist (2010) down to a sampling rate of 120 Hz. Similar analyses for
Behav Res (2018) 50:160–181 175
Fig. 11 Number of detected fixations in the testing dataset. Green - Fig. 13 Performance of our (blue) and NH (red) algorithms in terms
ground truth (handcoded), blue - our algorithm. Different intensities of of reproducing ground truth main sequence
blue show results for different sampling rates
data. One can imagine specialist classifiers, trained to work maximum 25 types of eye-movement events will be possi-
only on Eyelink, SMI, or Tobii data. The only thing that ble too using this approach. The very recent and, to the best
is needed is a representative dataset and desired output— of our knowledge, the very first attempt to use deep learn-
whether it be manual coding or events derived by any other ing for eye movement detection is the algorithm by Hoppe
method. Our results show that such specialized machine- (2016). This algorithm is still not entirely end-to-end as it
learning-based algorithms have the potential to work better uses hand-crafted features—input data needs to be trans-
than hand-crafted algorithms. Feeding the machine-learning formed into the frequency domain first, but as authors write:
algorithm with event data from another event detector raises “it would be conceptually appealing to eliminate this step
other possibilities. If the original event detector is com- as well”. Hoppe (2016) show that a simple one-layer con-
putationally inefficient but results in good performance, volutional neural network (CNN), followed by max pooling
its output can be used to train a machine-learning algo- and a fully connected layer outperforms algorithms based on
rithm that has the same performance but is computationally simple dispersion and velocity and PCA-based dispersion
efficient. thresholding.
But the real promise of machine learning is that in the If deep learning with end-to-end training works for event
near future we may have a single algorithm that can detect detection, there will be less of a future for feature devel-
not only the basic five eye-movement events (fixations, sac- opers. The major bottle neck will instead be the amount
cades, PSOs, smooth pursuit, and blinks), but distinguish of available hand-coded event data. This is time-consuming
between all 15–25 events that exist in the psychological and because it requires domain experts, it is also expensive.
and neurological eye-movement literature (such as different From the perspective of the machine-learning algorithm,
types of nystagmus, square-wave jerks, opsoclonus, ocular the hand-coded events are the goal, the objective ground
flutter, and various forms of noise). All that is needed to truth, perchance, that the algorithm should strive towards.
reach this goal is a lot of data, and time and expertise to However, we know from Andersson et al. (2016) that hand-
produce the hand-coded input needed for training the clas- coded data represent neither the golden standard, nor the
sifier. Performance of this future algorithm is not unlikely objective truth on what fixations and saccades are. They
to be comparable to or even better than any hand-crafted show that agreement between coders is nowhere close to
algorithm specifically designed for a subset of events. perfect, most likely because expert coders often have dif-
The shift toward automatically assembled eye-movement ferent conceptions of what a fixation, saccade, or another
event classifiers exemplified by this paper mirrors what has event is in data. If a machine-learning algorithm uses a
happened in the computer vision community. At first, anal- training set from one expert coder it will be a different algo-
ysis of image content was done using simple hand-crafted rithm than if a training set from another human coder would
approaches, like edge detection or template matching. Later, have been used. The event detector according to Kenneth
machine-learning approaches such as SVM (support vec- Holmqvist’s coding will be another event detector compared
tor machine) using hand-crafted features such as LBP (local to the event detector according to Raimondas Zemblys’ cod-
binary pattern) and SIFT (scale-invariant feature transform) ing. However, it is very likely that the differences between
(Russakovsky et al., 2015) became popular. Starting in these two algorithms will be much smaller than the dif-
2012, content-based image analysis quickly became domi- ferences between previous hand-crafted algorithms, given
nated by deep learning, i.e., an approach where a computer that Andersson et al. (2016) showed that human expert
learns features itself using convolutional neural networks coders agree with each other to a larger extent than existing
(Krizhevsky et al., 2012). state-of-the-art event detection algorithms.
Following the developments that occurred in the com- We should also note that the machine-learning approach
puter vision community and elsewhere, we envision that poses a potential issue for reproducibility. You would need
using deep learning methods will be the next step for the exact same training data to be able to reasonably repro-
eye-movement event detection algorithm design. Such duce a paper’s event detection. Either that, or authors need
methods—which are now standard in content-based image to make their trained event detectors public, e.g., as supple-
analysis, natural language processing and other fields— mental info attached to a paper, or on another suitable and
allow us to feed the machine-learning system only with data. reasonably permanent storage space. To ensure at least some
It develops the features itself, and finds appropriate weights level of reproducibility of the work, future developers of
and thresholds for sample classification, even taking into machine-learning-based event detection algorithms should
account the context of the sample. Since such deep-learning- report as many details as possible: the algorithms and pack-
based approach works well in image content analysis, ages used, hyperparameters, source of training data, etc.
where the classifier needs to distinguish between thou- We make our classifier and code freely available online5
sands of objects, or in natural language processing, where
5 https://github.com/r-zemblys/irf.
sequence modeling is the key, we expect that classifying the
178 Behav Res (2018) 50:160–181
and strive to further extend the classifier to achieve good were not found to impact subsequent readings. Each
classification in noisy data and data recorded at a low sam- recording session contained a different part of the poem
pling rates. Future work will furthermore focus on detecting text.
other eye-movement events (such as smooth pursuit and
nystagmus) by including other training sets and using other Datasets
machine-learning approaches.
– Medium-quality (MQ) dataset consisted of records
Acknowledgments We thank LUNARC - the center for scientific from 99 subjects among which were 70 males and 29
and technical computing at Lund University for providing us with
computational resources and support, which enabled us to efficiently
females. The ages of the subjects were in the range
train large classifiers (project No. SNIC 2016/4-37). This work is sup- between 18 and 47. The average age was 22 (SD = 4.8).
ported in part by NSF CAREER Grant CNS-1250718 and NIST Grant All 99 subjects participated in two recording sessions,
60NANB15D325 to author OK, and in part by MAW Grant “Eye- which were timed such that there were approximately
Learn: Using visualizations of eye movements to enhance metacogni-
tion, motivation, and learning” to author KH. We express our gratitude
20 min between the first and the second presentation
to Dillon Lohr at Texas State University who performed computations of each stimulus. The MQ dataset was recorded using
related to CEM-E framework. a PlayStation Eye Camera driven by modified version
of the open-source ITU Gaze Tracker software (San
Agustin et al., 2009). The sampling rate of the recording
Appendix A was 75 Hz and average spatial accuracy was 0.9◦ (SD =
0.6◦ ) as reported by the validation procedure performed
Description of stimuli and datasets used when assessing the after the calibration. Because none of the data samples
biometrics performance of our IRF algorithm. are marked as invalid by ITU Gaze Tracker software,
we are unable to report the amount of data loss for this
Presented stimulus dataset. Stimuli were presented on a flat screen monitor
positioned at a distance of 685 mm from each subject.
– Horizontal stimulus (HOR) was a simple step-stimulus The dimensions of the monitor were 375 × 302 mm.
with a small white dot making 30◦ jumps back and The resolution of the screen was 1280 × 1024 pixels.
forth horizontally 50 times across a plain black back- The records from the dataset can be downloaded from
ground. In total, 100 dots were presented to each Komogortsev (2016).
subject. For each subject and for each recording ses- – High-quality (HQ) dataset consisted of records from 32
sion of the same subject the sequence of dots was the subjects among which were 26 males and six females.
same. The subjects were instructed to follow the dot The ages of the subjects were in the range between 18
with their eyes. The goal of this stimulus was to elicit and 40. The average age was 23 (SD = 5.4). Twenty-
a large number of purely horizontal large amplitude nine of the subjects performed four recording sessions
saccades. each, and three of the subjects performed two recording
– The random stimulus (RAN) was a random step- sessions each. The first and second recording sessions
stimulus with a small white dot jumping across a plain were timed such that there were approximately 20 min
black background of the screen in a uniformly dis- between the first and the second presentation of each
tributed random pattern. One hundred dot movements stimulus. For each subject, the 3rd and 4th sessions
were presented to each subject. The subjects were were recorded approximately 2 weeks after the first two
instructed to follow the dot with their eyes. For each sessions. Similar to the first two sessions, the time inter-
subject and for each recording session of the same sub- val between the 3rd and 4th sessions was timed such
ject the sequence of presented dots was completely that there were approximately 20 min between the first
random. The goal of this stimulus was to elicit a large and the second presentation of each stimulus. The data
number of oblique saccades with various points of was recorded with an EyeLink 1000 eye-tracking sys-
origin, directions, curvatures, and amplitudes. tem at 1000 Hz and spatial accuracy as reported by
– The textual stimulus (TEX) consisted of various the validation procedure performed after the calibra-
excerpts from Lewis Carroll’s “The Hunting of the tion was 0.7◦ (SD = 0.5◦ ). The average amount of data
Snark.” The selection of this specific poem aimed to loss was 5 % (SD = 5 %). Stimuli were presented on
encourage the subjects to progress slowly and carefully a flat screen monitor positioned at a distance of 685
through the text. An amount of text was selected from mm from each subject. The dimensions of the monitor
the poem that would take on average approximately 1 were 640 × 400 mm. The resolution of the screen was
min to read. Line lengths and the difficulty of the mate- 2560 × 1600 pixels. The records from the dataset can
rial was consistent, and content-related learning effects be downloaded from Komogortsev (2011).
Behav Res (2018) 50:160–181 179
Appendix D
Fig. 15 Number of detected saccades in the testing dataset Fig. 17 Mean saccade duration in the testing dataset
180 Behav Res (2018) 50:160–181
Nyström, M., & Holmqvist, K. (2010). An adaptive algorithm for fixa- San Agustin, J., Skovsgaard, H., Hansen, J. P., & Hansen, D. W.
tion, saccade, and glissade detection in eyetracking data. Behavior (2009). Low-cost gaze interaction: Ready to deliver the promises.
Research Methods, 42(1), 188–204. In CHI ’09 extended abstracts on human factors in computing
Olsson, P. (2007). Real-time and offline filters for eye tracking. Mas- systems, CHI EA ’09 (pp. 4453–4458). New York, NY, USA:
ter’s thesis, Royal Institute of Technology, Stockholm, Sweden. ACM.
Oshiro, T. M., Perez, P. S., & Baranauskas, J. A. (2012). How Many Thomas, D. B., Luk, W., Leong, P. H., & Villasenor, J. D. (2007).
Trees in a Random Forest? (pp. 154–168). Berlin: Springer. Gaussian random number generators. ACM Computing Surveys,
Otero-Millan, J., Castro, J. L. A., Macknik, S. L., & Martinez-Conde, 39(4).
S. (2014). Unsupervised clustering method to detect microsac- Wang, D., Mulvey, F. B., Pelz, J. B., & Holmqvist, K. (2016a). A study
cades. Journal of Vision, 14(2), 18. of artificial eyes for the measurement of precision in eye-trackers.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Behavior Research Methods, 1–13. doi:10.3758/s13428-016-0755-8
Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Wang, D., Pelz, J. B., & Mulvey, F. (2016b). Characterization and
Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, reconstruction of vog noise with power spectral density analysis.
M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in In Proceedings of the ninth biennial ACM symposium on eye track-
Python. Journal of Machine Learning Research, 12, 2825–2830. ing research & applications, ETRA ’16 (pp. 217–220). New York,
Peirce, J. W. (2007). Psychopy: Psychophysics software in Python. NY, USA: ACM.
Journal Neuroscience Methods, 162(1-2), 8–13. Waskom, M., Botvinnik, O., Drewokane, Hobson, P., Halchenko, Y.,
Raileanu, L. E., & Stoffel, K. (2004). Theoretical comparison between Lukauskas, S., Warmenhoven, J., Cole, J. B., Hoyer, S., Vander-
the gini index and information gain criteria. Annals of Mathemat- plas, J., Gkunter, Villalba, S., Quintero, E., Martin, M., Miles, A.,
ics and Artificial Intelligence, 41(1), 77–93. Meyer, K., Augspurger, T., Yarkoni, T., Bachant, P., Evans, C.,
Rigas, I., Komogortsev, O., & Shadmehr, R. (2016). Biometric recog- Fitzgerald, C., Nagy, T., Ziegler, E., Megies, T., Wehner, D., St-
nition via eye movements: Saccadic vigor and acceleration cues. Jean, S., Coelho, L. P., Hitz, G., Lee, A., & Rocher, L. (2016).
ACM Trans. Appl. Percept., 13(2), 6:1–6:21. seaborn: v0.7.0 (January 2016).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Zemblys, R. (2016). Eye-movement event detection meets
Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., machine learning. In Biomedical Engineering (pp. 98–
& Fei-Fei, L. (2015). Imagenet large scale visual recognition chal- 101).
lenge. International Journal of Computer Vision (IJCV), 115(3), Zemblys, R., Holmqvist, K., Wang, D., Mulvey, F.B., Pelz, J.B., &
211–252. Simpson, S. (2015). Modeling of settings for event detection algo-
Salvucci, D. D., & Goldberg, J. H. (2000). Identifying fixations and rithms based on noise level in eye tracking data. In Ansorge,
saccades in eye-tracking protocols. In Proceedings of the 2000 U., Ditye, T., Florack, A., & Leder, H. (Eds.) Abstracts of the
symposium on eye tracking research & applications, ETRA ’00 18th European Conference on Eye Movements 2015, volume 8 of
(pp. 71–78). Journal of Eye Movement Research.