Mit PDF
Mit PDF
Mit PDF
by Harini D. Kannan
S.B., Massachusetts Institute of Technology (2016)
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
Massachusetts Institute of Technology
February 2017
© Massachusetts Institute of Technology 2017. All rights reserved.
The author hereby grants to MIT permission to reproduce and to distribute publicly paper and
electronic copies of this thesis document in whole and in part in any medium now known or
hereafter created.
Author: __________________________________________________________________
Department of Engineering and Computer Science
February 3, 2017
Certified by: _________________________________________________________________
Professor Antonio Torralba, Thesis Supervisor
February 3, 2017
Accepted by: _________________________________________________________________
Christopher J. Terman, Chairman, Master of Engineering Thesis Committee
Eye tracking for the iPhone using Deep Learning
by
Harini Kannan
Abstract
Accurate eye trackers on the market today require specialized hardware and are very
costly. If eye-tracking could be available for free to anyone with a camera phone, the
potential impact could be great. For example, free eye tracking assistive technology
could help people with paralysis to regain control of their day-to-day activities, such
as sending email. The first part of this thesis describes the software implementation
and the current performance metrics of the original iTracker neural network, which
was published in the CVPR 2016 paper "Eye Tracking for Everyone." This original
iTracker network had a 1.86 centimeter error for eye tracking on the iPhone. The
second part of this thesis describes the efforts towards creating an improved neural
network with a smaller centimeter error. A new error of 1.66 centimeters (11% im-
provement from the previous benchmark) was achieved using ensemble learning with
the ResNet10 model with batch normalization.
2
Acknowledgments
Firstly, I would like to thank my mentors, Aditya Khosla and Bolei Zhou. Aditya
was a very helpful mentor during the implementation phase of this project, when I
worked on implementing the iTracker neural network on the iPhone GPU. When I
worked on the modeling side of the project in order to improve the centimeter error,
Bolei provided lots of insightful guidance, introducing me to new concepts and ideas.
Secondly, I would like to thank my thesis supervisor, Professor Torralba, for many
helpful discussions, especially when I was stuck.
Finally, I would like to thank my parents and my brother for their unwavering
support throughout my MIT career.
3
4
Contents
1 Introduction 9
1.1 Motivations for a software-only eye tracker . . . . . . . . . . . . . . . 9
1.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 GazeCapture dataset . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 iTracker model . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Goals and direction of M. Eng research project . . . . . . . . . . . . . 12
3 Dataset analysis 21
3.1 Analysis of bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Reasons behind bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Effects of training set bias . . . . . . . . . . . . . . . . . . . . . . . . 22
5
4.2.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 Discussion of incremental improvements . . . . . . . . . . . . 29
4.2.3 Cropping on the fly . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.4 Changing the loss function . . . . . . . . . . . . . . . . . . . . 29
4.2.5 Changing the model: Original architecture to AlexNet to ResNet10
with batch normalization . . . . . . . . . . . . . . . . . . . . . 30
4.2.6 Implementing rectangle crops in Caffe to preserve aspect ratio 31
4.2.7 Increasing image resolution . . . . . . . . . . . . . . . . . . . 32
4.2.8 Ensemble learning approach . . . . . . . . . . . . . . . . . . . 32
6 Conclusion 43
6.1 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6
List of Figures
7
8
Chapter 1
Introduction
Current state-of-the-art eye trackers include the Tobii X2-60, the Eye Tribe, and Tobii
EyeX, all of which are costly and require specialized hardware (such as infrared sen-
sors). Moreover, many eye trackers on the market today work accurately in controlled
conditions for experiments, but not in real-world conditions.
An accurate, free eye-tracker has many applications - for example, it could impact
education technology by enabling online educators to assess attention patterns while
students watch online lecture videos - thereby helping educators to tweak and improve
their content. It could impact medical and behavioral studies as many researchers
use eye-tracking to study how people respond to various stimuli. Furthermore, eye-
tracking has the power to transform many people’s lives, especially through assistive
technology. For people who cannot use their hands anymore, eye tracking is an alter-
9
native way to control electronic devices and resume daily activities, such as sending
email. For example, there exists accurate eye trackers to help those with ALS, MND,
and other diseases that cause paralysis, but these eye trackers can cost thousands of
dollars. This can be an impossible amount of money to set aside for those who are
already faced with astronomical health care costs.
Inspired by these compelling applications, the aim of this thesis project was to cre-
ate an accurate eye tracker that only requires software, thereby making it completely
free to anyone who already has a normal camera phone. State-of-the-art computer
vision methods in deep learning make it possible for computers to identify minute
details from images. Because of this, deep learning seemed like a natural tool to pre-
dict where a user is looking on a screen, by extracting minute details from a camera
stream of a person.
Prior to this M. Eng thesis, Krafka et al made progress towards a software only eye-
tracker[4]. They developed a large-scale eye tracking dataset called GazeCapture,
which consisted of images from 1500 unique test subjects captured by an iPhone or
an iPad, along with the cropped face and eye images belonging to each original full-
frame image. The images were labeled with an (x, y) coordinate corresponding to the
point on the screen that the user looked at when the photo was taken. These images
were captured using the iOS GazeCapture application, which flashed a dot that the
user was instructed to look at. To ensure that the user actually was looking at the
dot, a letter was displayed in the dot ("L" or "R"). The user was then asked to tap
the left or right side of the screen depending on the corresponding letter he or she
saw. Many prior datasets had been collected by inviting participants to a physical
location such as a research lab, but this method was hard to scale up (and also often
10
Figure 1-1: Structure of iTracker neural network
led to little variation in the dataset). Instead, online crowd-sourcing was used to
overcome these difficulties, and most of the participants in the dataset were recruited
through Amazon Mechanical Turk, a popular crowd-sourcing platform. Others were
recruited through studies at the University of Georgia.
A deep learning model called iTracker [4] was trained with Caffe [3] using the Gaze-
Capture dataset. Figure 1-1 indicates the structure of the full iTracker neural network,
a large neural network constructed with mini neural networks. There are four total
inputs: a right eye image, a left eye image, a face image, and a facegrid. The facegrid
is a 25x25 grid of 1’s and 0’s that indicates where the face is located relative to the
rest of the camera view. The 25x25 grid represents the whole camera view, and the
region shaded in by 1’s represents where the face is located. Each of these four inputs
are fed into four different mini neural networks. The outputs from the left eye neural
network and the right eye neural network are concatenated and fed into a fifth neural
11
network. Finally, the output from this fifth neural network is concatenated with the
output from the face neural network and the facegrid neural network to form the
input for the sixth neural network. This sixth and last neural network outputs an (x,
y) coordinate pair, which is the predicted location of the user’s gaze.
The iTracker model had an average error of 1.86 centimeters on the test set for
iPhones, and a 1.71 centimeter error from a 25-crop average. Because in the real-
time case, it is unrealistic to generate 25 random crops and average the error from
all of them, this M. Eng thesis chose to use the 1.86 single-crop centimeter error as a
benchmark to improve upon.
1. To implement an iOS application that ran the iTracker model on the iPhone
GPU, as this has not been done before.
This thesis will describe the work done to complete both goals.
Chapter two describes the software work done to implement the iTracker model
on the iPhone.
Chapter three provides an analysis of the dataset used and its biases.
Chapter four describes the steps taken to bring the error down from 1.86 centime-
ters to 1.66 centimeters.
Chapter five provides visualizations and analysis of the new 1.66 centimeter result.
Chapter six concludes the thesis and contains some suggestions for further work.
12
Chapter 2
2.1 Objectives
The objective of this part of the project was to write software to run the iTracker
model on the iPhone GPU for real-time eye tracking. Since most mobile deep learning
occurs on the back end by communicating with external server that has a GPU, there
are very few deep learning libraries that actually run a model on iPhone GPU. Part
of the challenge of this project was to find and tweak such a library for the purposes
of implementing a real-time eye tracker that ran on the iPhone GPU.
There are two main advantages to implementing a real-time eye tracking library
with the deep learning done locally on the phone itself, instead of on a backend server:
1. The eye-tracking library would not need the internet and would therefore be
much more flexible.
2. In the age of privacy concerns, users would be much less willing to adopt an
application that constantly streamed their faces back to a server. Having the
deep learning done locally erases this concern and would help users feel more
13
confident about their privacy.
The ultimate goal was to create an eye-tracking library that could theoretically
be used in any iOS app. For the purposes of this project, the eye-tracking library was
used in a very simple app that simply displayed a dot where the user was looking on
the screen.
Firstly, there were no deep learning libraries that directly and accurately converted
a model trained with Caffe to a model that could run on iOS. There was a very new
library called DeepLearningKit 1 that did offer a direct Caffe to iOS conversion, but it
had many bugs and did not work properly. A different library called DeepBeliefSDK
2
was chosen instead for its correctness, lack of bugs, and optimized performance.
However, unlike DeepLearningKit, it did not offer a direct Caffe to iOS conversion.
To solve this issue, the Caffe model was first converted to another intermediate library
called ccv3 . Then, the intermediate ccv model was used with the DeepBeliefSDK
library, which then converted it to a model that could be used by iOS.
The main challenge of this solution was in writing the scripts to perform the
manual conversions between libraries correctly (Caffe to ccv to iOS). For example,
1
https://github.com/DeepLearningKit/DeepLearningKit
2
https://github.com/jetpacapp/DeepBeliefSDK
3
https://github.com/liuliu/ccv
14
Figure 2-1: Flowchart illustrating Caffe to iOS conversion process
each library had its own conventions for the ordering of the 4D tensor dimensions (e.g.
width x height x depth x channels). Since there was no documentation online about
the ccv library’s conventions vs. DeepBeliefSDKs conventions, all 24 permutations of
the ordered dimensions needed to be tested. Other discrepancies between the libraries
that were similarly undocumented were the the type of mean file (e.g. dimensions of
224x224x3 as opposed to 3x224x224) and color channels (e.g. a BGR convention as
opposed to RGB). Figure 2-1 illustrates this manual conversion process among the
libraries. First, a Matlab script was used to convert a caffemodel file into a SQLite
file. Then, a script modified from Pete Warden’s forked version of the ccv libary4
converted the sqlite file into a .ntwk file, which is the format used by DeepBeliefSDK.
Many tests were written to ensure that the large iTracker model was being pre-
served correctly in the manual conversion process. In the end, the final iOS model
outputted (x, y) coordinates that were slightly different from the original Caffe out-
puts, but since the differences were around 0.01cm, they were deemed negligible.
The iTracker model was a model with four different inputs, and the DeepBeliefSDK
library only supported neural networks with one input. To resolve this issue, the
iTracker model was split up into mini neural network components. Each of these
components were converted one at a time into the corresponding DeepBeliefSDK
network. As Figure 1-1 shows, there were 6 mini neural networks that could be
4
https://github.com/petewarden/ccv
15
constructed this way: three for each of the left eye, right eye, and face, and three for
each of the the facegrid, eye concatenation, and final concatenation of all inputs. The
first set of three neural networks (left eye, right eye, and face) all had convolutional
layers, while the second set of three neural networks (facegrid, eye concatenation, and
final concatenation) all did not have convolutional layers. Therefore, there were two
types of mini neural networks that needed to be implemented: one with convolutional
layers, and one without convolutional layers.
The ccv library had the following two requirements for networks:
1. Networks needed to have at least one convolutional layer (i.e. no networks with
only fully connected layers)
Two out of the six mini networks did not follow the first requirement (the networks
for the left eye and right eye), and three other networks did not follow the second
requirement (the networks for the facegrid, eye concatenation, and final concaten-
tation). Therefore, in the beginning, five out of the six mini neural networks were
unable to be handled by the ccv library.
One possible solution to this issue was to change the ccv library itself, but after
multiple days of trying this, it became clear that changing the entire library would be
too time-consuming. Instead, the following solutions were implemented. To satisfy
the first requirement, fully-connected layers were re-implemented within DeepBe-
liefSDK by reading in the weights of the fully connected layers from a textfile, and
then performing optimized matrix operations. This solution worked since the net-
works in question were made up of only fully-connected layers. However, the tradeoff
16
in implementing this solution was in a larger memory footprint of the app, since the
weights of the layers needed to be stored in memory.
To satisfy the second requirement, the iTracker model was retrained with an extra
fully connected layer after the left eye convolutional network, and a second extra
fully connected layer after the right eye convolutional network. The original concern
around this solution was that adding unnecessary layers could increase the number
of parameters in the model and lead to overfitting (and a larger error on the test
set), but the error stayed roughly the same (within 0.01cm), so this solution was
implemented.
The iTracker model requires a lot of preprocessing, including detecting and extracting
the face and eyes from the original frame image. This dependence on external libraries
slows the app down, which was one of the motivations for the research towards a
newer, single-input, end-to-end model described at the end of this thesis.
After the model conversion from Caffe to iOS, what remained was to connect the rest
of the app.5 Apple’s native face and eye detection libraries were used to gather the
three out of the four inputs, which consisted of the face image, left eye image, and
right eye image. The facegrid was computed separately by computing the position of
the face relative to the rest of the full frame image.
The input images were then converted to a format that could be read by the
neural network, which then ran on the GPU of the Apple device and produced the
final coordinates. To remain device independent, the final coordinates that the neural
network predicted were relative to the camera position on the Apple device. Therefore,
5
https://github.com/harini-kannan/EyeTrackerDemo
17
there had to be some post-processing of the coordinates to adapt the output to the
iPhone 6s that was being used for testing. Finally, after the post-processing, the
location of a red dot rendered by a CALayer object was updated with each frame,
creating a smoothly moving dot that followed a user’s gaze.
The work done for this app was published as part of the CVPR 2016 paper "Eye
Tracking for Everyone"[4]. Figure 2-2 is a screenshot of the iPhone app. The three
image inputs (face, left eye, and right eye) are shown on the screen, along with a red
dot that indicates the location of the current gaze.
Below are the current performance metrics of the app:
A video demo 6 of this app was also created, illustrating the movement of the dot
to eight different points: upper left corner, upper middle corner, upper right corner,
bottom right corner, bottom middle corner, bottom left corner, and the center.
6
http://people.csail.mit.edu/hkannan/eye_tracking_demo.html
18
Figure 2-2: Screenshot of current iPhone app
19
20
Chapter 3
Dataset analysis
Prior to efforts to improve the model, it was important to analyze the original Gaze-
Capture dataset first to better understand its characteristics, since the plan was to
use the GazeCapture dataset to train a new model. During this analysis, a strong
bias in the dataset was discovered: the same set of around 20 points appeared over
and over again, as shown in Figure 3-1.
The training data for the model in question (the model trained for iPhones in portrait
mode) contained 10,677,300 data points that should have been spread out over the
6.83600 cm by 12.15400 cm possible screen size (screen size of the largest iPhone,
which is the iPhone 6 Plus). These ground truth data points were then plotted on a
3D histogram, where each histogram bucket was 0.12 cm by 0.12 cm, with 56 x 100
= 5600 total buckets.
As Figure 3-1 shows, there are 26 "peaks" in the data, where each peak is well
over 100,000 points. The rest of the locations on the grid have very few points (on
the order of 100 or so). The total number of points contained in these 26 peaks (or
26 grid cells) is 3,804,375. Given that each grid cell is 0.12cm by 0.12cm, the total
21
area represented by these 26 grid cells is 0.3744 square centimeters, while the total
area represented by the entire possible screen size is 6.83600 cm x 12.15400 cm, or
83.085 square centimeters.
This means that 3,804,375 / 10,677,300 = 35.6% of the entire data is concentrated
in 0.3744 / 83.085 = 0.45 % of the total area.
These 26 peaks were found when looking at the histogram buckets with more than
100,000 samples each. When looking at the histogram buckets with more than 10,000
samples each, there were 48 peaks. Similar calculations for these 48 peaks show that
42.6% of the entire dataset is concentrated in 0.83% of the total area.
Furthermore, out of the 5600 histogram buckets, 1116 of them did not have any
training data. This means that 20% of the total area was not covered at all by the
training set.
Upon further inspection, it was observed that the GazeCapture app had thirteen
calibration points that it sent out to each user. Since there were two supported iPhone
screen sizes (a larger and a smaller one), the thirteen calibration points appeared at
two different sets of locations for each screen size. This resulted in a total of 26 points
that appeared over and over again, which is the result seen in Figure 3-1.
The initial concern around this training dataset distribution was that the original
iTracker model was trained to recognized those twenty-six calibration points. More-
over, since the testing set had the same bias since it was drawn from the same distri-
bution as the training set, the concern was that the numbers reported in the paper
were also biased. In other words, the testing error that was reported may have been
22
Figure 3-1: 3D histogram of ground truth points from GazeCapture dataset
smaller than the actual "real life" error. However, when plotting the 3D histogram of
the predictions on the test set as shown in Figure 3-2, the distribution was relatively
flat, and no sharp peaks were observed as in Figure 3-1. One reason this happened is
likely because the model learned to interpolate well, even though it was fed the same
set of 20 points over and over again since the training set was not too diverse.
Because the effects of the dataset bias were small, the same training set with all
of the repeated points was used for the new model. Creating a new training set by
removing all the repeated points would have almost halved the training set size, so
it was convenient that the effects of the dataset bias were small, allowing us to avoid
the expensive and time-consuming process of re-collecting all the data.
23
Figure 3-2: 3D histogram of ground truth points from GazeCapture dataset
24
Chapter 4
As described in the introduction, the second goal of this M. Eng project was to
improve the error from the baseline model’s error of 1.86 centimeters.
4.1.1 Motivations
The original iTracker model used a regression loss to predict the (x, y) coordinate
of the user’s gaze. The first attempt to improve the centimeter error was to use a
classification loss instead, motivated by the work of Recasens et al [7]. Their work
implemented a model for gaze following 1 , or identifying the objects in a photo that
a person in the photo is looking at. Since the problem is similar to the problem of
eye tracking, the classification loss used in their model seemed like a good approach
to improve the iTracker model error.
1
http://gazefollow.csail.mit.edu/
25
4.1.2 Implementation
The idea behind the classification loss function was to take the output space and
divide it up into a grid. In the case of eye-tracking, the output space would be the
iPhone screen. A softmax classification loss would be used to predict which square
in the grid the user is looking at. In other words, for each square on each grid, the
classification loss function would calculate the probability that the location of the
user’s gaze is in that particular square. The center of the square with the highest
probability would be used as the model’s prediction.
With just one grid, the number of locations the model could predict would be
equal to the the number of square centers. However, the regression model has a
clear advantage over this since the regression model’s output space is continuous. To
mitigate this issue, the original grid was shifted in four different directions (up, down,
left, and right) to produce four different grids (for a total of five grids). Figure 4-1
illustrates this process. For each of these five grids, a softmax classification loss layer
was used to predict which square in the grid the user was looking at. The centers
of these five chosen squares were weighted by their respective probabilities, and then
summed up to give the final (x, y) prediction.
The original grid dimensions that were used was 5x3, so with an original screen
size of 12.15400 cm x 6.836 cm, each grid square was 2.4308 cm x 2.2787 cm. The four
shifted grids were created by shifting half of their width if shifting horizontally, or
shifting half of their height if shifting horizontally. This corresponded to a horizontal
shift of 1.2154 cm, and a vertical shift of 1.13935 cm. However, the final error of this
classification model was 1.93 centimeters, which was worse than the baseline iTracker
model.
Many variations on this classification model were tried, such as 4x6 grids, 7x5
grids, grids shifted by thirds, adding dropout layers of various percentages after a
variety of different layers, removing layers to reduce overfitting, shrinking layers to
reduce overfitting, trying different loss layers (Euclidean, Cross-Entropy, and both),
26
Figure 4-1: Five grids produced by four different shifts.
trying many different hyperparameter values (for learning rate, weight decay, step
size, batch size), and trying different weight initialization methods. However, none of
these models did better than the baseline model’s error of 1.86 centimeters.
One reason that the classification loss did poorly could be because it was unable
to interpolate as well as the regression loss. As discussed in the previous chapter,
the training set was heavily skewed towards a set of twenty-six repeated calibration
points. Such a training set requires a model that can interpolate well given the
limited number of points. Since the regression function has a perfectly continuous
and uniform output space, while the classification function does not, the regression
function is likely better suited to such interpolation. Because of this, the classification
loss was discarded, and the regression loss was kept for future iterations of the model.
27
4.2 Second attempt: End-to-end model
4.2.1 Motivations
One issue with the baseline model was that it required a lot of preprocessing, since
it needed to extract a left eye, right eye, face, and facegrid from a full-frame image.
This meant that on the software side, a separate library was needed to perform
real-time face detection and eye detection. All of this preprocessing hindered the
performance of the app, which makes the dot movements slow and created discrete-
looking movements.
This motivated the idea of a neural network that could accurately predict the
location of a user’s gaze using just a person’s face, or even just a full frame image.
This would remove the need for manual feature extraction (extracting the eyes and
facegrid, for example). Even if this new end-to-end neural network could just match
the existing performance, this would cut down the preprocessing time a lot and make
the eye tracker work better. Advantages to such an end-to-end model include the
following:
2. A more elegant model - the full frame already contains all the information the
model would need, such as the picture of the eyes, pose of the head, etc. There
is not really a need to parse out these individual components and do the feature
extraction ourselves if we can build a good enough model.
3. Potential for visualizations - The model could be used to visualize which areas
of the image are being utilized the most by the network. For example, the
model could show that some neurons learn to activate on the eyes, or that
some neurons activate on certain key points on the face to determine head pose.
These results would be interesting on their own to shed light on which features
are important for eye tracking.
28
4.2.2 Discussion of incremental improvements
The new model described in the subsequent sections decreased the error from 1.86
centimeters in the baseline model to 1.66 centimeters in the new model, which is an
11% improvement from the baseline. The following sections describe the incremental
improvements used to achieve this result: from various improvements on the data
side (cropping on the fly, implementing rectangular crops in Caffe, increasing image
resolution) to various improvements on the model side (using AlexNet, then ResNet10
and batch normalization, and then an iterative ensemble learning model).
The first improvement was an improvement with the implementation of the model
itself. The previous baseline model had a cropped face training set that had already
been augmented in five different ways before being fed into the Caffe model. The
cropped face had been moved up, down, left, and right, and kept in the center for
the five different augmentations. The issue with this was twofold: firstly, the lmdbs
took up much more memory than necessary, since they were 5 times larger than they
should have been. Secondly, pre-augmenting the lmdb makes it so that the model is
trained to recognize the same types of augmentations (up, down, left, and right). On
the other hand, doing a random crop of images on the fly while training would lead
to a new crop (and therefore new position) of the face each time the training image
passed through the model. As the models were trained with 50 epochs, this meant
that the model saw 50 unique crops of each training image, once for each time it saw
the training image. This would lead to less overfitting on the "real-life" test set.
The original iTracker architecture on 224x224 full face images resulted in a baseline
error of 3.14 cm. The first improvement that decreased the centimeter error was to
29
use a custom L1 centimeter loss instead of Caffe’s Euclidean loss [3]. Caffe’s Euclidean
loss is a measure of the L2 distance, not an L1 distance 2 . This decision was made
since an L1 loss function is more resistant to outliers. This is because the L2 loss
function squares the error, so the model will be more likely to prioritize fixing the
outlier cases rather than the more average examples. A custom L1 Euclidean loss
was created in Caffe to implement this change, and the code for this can be found on
the author’s Github 3 . As a result of this new loss layer, the error improved to 2.99
centimeters (0.15 centimeter improvement).
The second improvement that was made was to use AlexNet[5] on the full face images
instead of the original iTracker architecture. The original iTracker architecture was
loosely modeled after the first four layers of AlexNet. As deeper neural networks
generally learn better features, using all nine layers of AlexNet would likely result in
a more accurate model. This was shown to be true, as the result from using AlexNet
was a 2.10cm error, or a 0.89 centimeter improvement.
The third improvement was to use the ResNet10 model[1] with batch normaliza-
tion [2]. ResNet is based on building blocks of conv-relu-conv layers, the output of
which is denoted by some F(x). Then, the identity x is added to F(x) to create a
new function H(x), which is easier for the model to optimize. The ResNet model,
made from these building blocks, won 1st place in the ILSVRC 2015 competition 4 .
AlexNet’s top 5 test error was 15.4%, while ResNet’s top 5 test error rate was 3.6%.
Therefore, it seemed clear that ResNet would learn better features than AlexNet,
which was the main motivation for using the ResNet model. The ten-layer version
2
http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1EuclideanLossLayer.html
3
http://www.github.com/harini-kannan
4
http://image-net.org/challenges/LSVRC/2015/
30
of ResNet was used in order to more quickly train the models5 , and the rest of this
thesis will refer to this version of ResNet as ResNet10.
The ResNet10 model was used with batch normalization [2], a fairly new technique
that helps to improve performance. Batch normalization works by seeking to reduce
internal covariate shift. Covariate shift is when the distribution of inputs into a
machine learning model changes, and internal covariate shift is when the distribution
of neuron activations changes from layer to layer inside a neural network. LeCun et
al [6] and Wiesler et al [9] showed that whitened inputs, or inputs with a mean of 0
and a variance of 1, make neural networks converge faster than non-whitened inputs.
Motivated by this, a batch normalization layer whitens the inputs between each layer
in a neural network, which drastically reduces the issue of an internal covariate shift.
During the training phase of the ResNet10 neural network, the batch normalization
layer performed its calculations on mini-batches, while during the testing phase, the
batch normalization layer used the global, aggregated statistics on the whole testing
set. Using the ResNet10 model with batch normalization brought the error down to
1.94 centimeters, or a 0.16 centimeter improvement.
The fourth improvement was to change the input image size to keep the original
aspect ratio. All of the original images were 640x480 pixels, but since Caffe only
supported square crops, the images needed to be resized to squares. Images that were
256x256 were used for the model that produced the 1.92 centimeter error. Resizing
a rectangle image to a square causes distortion along one axis, which could make the
model unable to find key details in the picture. As a result, Caffe was modified to
allow rectangular crops. Two new parameters (called "crop_h" and "crop_w") were
implemented within Caffe to allow this. The code for this modified version of Caffe
5
https://github.com/cvjena/cnn-models
31
can be found on the author’s Github 6 . When images that were 320x240 were used
with a rectangle crop size of 288x224, the result improved to 1.90 centimeters, or a
0.04 centimeter improvement.
The fifth improvement was to use an increased resolution for the input images. As
current state-of-the-art eye trackers like the Tobii Pro show 7 , it is important to have
high resolution images so that models can identify minute details in the eyes in order to
make a prediction about the user’s gaze. The original images were all 640x480 pixels,
and input image sizes of 256x256 (with crop size of 224x224), 384x384 (with crop size
of 352x352), and 480x480 (with crop size of 448x448) were all tried. Figure 4-2 is a
graph illustrating the relationship between image input size and centimeter error. As
shown, the centimeter error decreases linearly with the input image size.
The best result was the 480x480 image with a crop size of 448x448, which resulted
in a 1.767 centimeter error, or a 0.13 centimeter improvement.
The sixth improvement was to use an ensemble learning approach. One important ob-
servation from Figure 4-2 is that when two types of input images were tried, full frames
and cropped faces, the cropped face ones always performed better. The cropped face
input images were created by first cropping out the face from the full frame, and
then enlarging the cropped face to the desired image size. The best full frame model
(448x448 input images) resulted in a 1.767 centimeter error, while the best cropped
face model (448x448 input images) resulted in a 1.752 centimeter error. When aver-
aging the predictions from both of these models to create an ensemble, the resulting
6
http://www.github.com/harini-kannan
7
http://www.tobiipro.com/learn-and-support/learn/eye-tracking-essentials/how-do-tobii-eye-
trackers-work/
32
Figure 4-2: Centimeter error as a function of image size. Shown for both full frame
images and cropped face images.
error was 1.657 centimeters. This represents around a 11% improvement from the
baseline model’s error, which was 1.86 centimeters.
Creating an ensemble model by averaging the predictions from both of these mod-
els likely led to a better result because the ensemble model was able to use the
strengths from both models. The strength of the full frame model was that it pro-
vided extra location information for where the face was, similar to the function of the
facegrid in the original baseline model. The strength of the cropped face model was
that it allowed the model to focus in on fine-grained details of the eyes. By averaging
these two predictions, the strengths of both these models were combined, making it
less likely that either one of them would make an extreme prediction that would drive
the error higher.
One point that needed to be taken into consideration was the memory usage for
33
the high resolution images. While the 448x448 input images produced the best result,
it also led to a large increase in memory. To fit an entire mini-batch of images into a
single GPU that had 12GB of memory, the batch size needed to be reduced from 64
to 32. To account for the halving of the batch size, the learning rate was increased
p
by a factor of 2. This slowed down training significantly - both the cropped face
and the full frame model took around 3 days to train. Each model was trained for
50 epochs, with a training set size of around 400,000 images, and a test set size of
approximately 50,000 images.
This result suggests that it could be possible to merge the full frame and the
cropped face model into one model whose performance was around 1.66 centimeters.
Inspired by the work of Rosenfeld and Ullman [8], an iterative model was used,
constructed with two smaller models. For the first model for the training phase,
the model took in three input lmdbs (lmdb for 640x480 full frames, lmdb for the
ground truth gaze labels, and lmdb for the ground truth face bounding boxes). The
face bounding boxes were represented with two points - one for the top left corner,
and one for the bottom right corner. There were three losses used: a Euclidean loss
to measure against the ground truth gaze label, a Euclidean loss to measure against
the top left corner of the face bounding box, and a Euclidean loss to measure against
the bottom right corner of the face bounding box. A weighted average of these three
losses was used to train the model.
The second model for the training phase was exactly the same as the 480x480
cropped face model. This model took in two input lmdbs (lmdb for 480x480 previously
cropped faces and an lmdb for the ground truth gaze labels). The output loss was a
Euclidean loss on the ground truth labels.
The key novelty of this model is in the testing phase, whose process Figure 4-3
illustrates. For the testing phase, the first model only needs the full frame image -
with this, it produces both the bounding box for a cropped face and the coordinates
of the user’s gaze. The testing phase of the second model only needs a cropped face,
34
and with this, it produces the coordinates of the user’s gaze. Then, the two sets of
predicted gaze coordinates (from the first model and the second model) are averaged
together to produce the final result. The novelty of this model lies in the fact that no
preprocessing is needed to achieve a better result than the baseline - the model only
needs the full frame image as input. In contrast, the baseline model needed a cropped
face, left eye, right eye, and a facegrid as input, which required a lot of preprocessing.
This new model eliminates the need for any preprocessing and is designed to make
real-time performance faster when implemented on an app.
35
Figure 4-3: Centimeter error as a function of image size. Shown for both full frame
images and cropped face images.
36
Chapter 5
The final result achieved with the new deep learning model described in this thesis was
a 1.66 centimeter error, which improved upon the previous baseline of 1.86 centimeters
by 11%. Many visualizations were created in order to better understand how this new
model made its predictions, and the analysis of these visualizations is described in
Section 2.
Many machine learning models compare their results to random performance as one
of the baselines to measure against. To compare this to purely random performance,
two types of random models were used:
1. Model that randomly picks one of the twenty-six calibration points described
in Chapter 3: 4.1 centimeter error
As shown, the new model with a 1.66 centimeter error does much better than
random performance.
37
5.2 Visualizations
1. Visualization 1: 100 full frame images taken randomly from the test set, along
2
with their corresponding cropped faces
2. Visualization 2: The 100 cropped face images with the highest error in the test
3
set, along with the corresponding full frame images
3. Visualization 3: The 100 cropped face images with the lowest error in the test
4
set, along with the corresponding full frame images
4. Visualization 4: The 100 full frame images with the highest error in the test
5
set, along with the corresponding cropped faces
5. Visualization 5: The 100 full frame images with the lowest error in the test set,
6
along with the corresponding cropped faces
5.2.2 Observations
38
servations:
1. The heat maps for the full frame images look very different from the heat maps
for the cropped face images. The model trained on the cropped face images
is able to localize on the eyes much better than the model trained on the full
frame images. Example 58 in Visualization 2, reproduced in Figure 5-1 and
Figure 5-2 , illustrates this. This is consistent with the hypothesis described in
the previous section about the relative strengths of the full frame images and
the cropped face images with respect to the ensemble model: the cropped face
model is able to localize on fine-grained details like the eyes, while the full frame
image provides valuable information about the head pose.
2. With regards to the cropped face model’s ability to localize on the eyes, it
appears that there could be some room for improvement. Example 54 in Visu-
alization 2, reproduced in Figure 5-3, illustrates this. The model does not focus
its attention on the eyes, and instead focuses its attention on the nose. This
suggests that the model could have some trouble with identifying the eyes, es-
pecially with blurry images. This observation could inform the next generation
of models to improve the centimeter error, perhaps by identifying models that
localize on the eyes better.
3. Both the cropped face model and the full frame model highlight the nose many
times. Example 24 in Visualization 3, reproduced in Figure 5-4 illustrates this.
This implies that the orientation of the nose acts as a key point that signals
which direction the head is pointing.
39
Figure 5-1: Cropped face, Example 58 in Visualization 2 (see previous footnotes for
link).
Figure 5-2: Cropped face, Example 58 in Visualization 2 (see previous footnotes for
link).
40
Figure 5-3: Example 54 in Visualization 2 (see previous footnotes for link).
41
42
Chapter 6
Conclusion
While this thesis describes a new model with a 1.66 centimeter error, there is still
room for further work to bring the centimeter error down even lower. Below are three
ways that this further work could develop:
1. Collecting more data from a variety of subjects. As Krafka et al showed [4], in-
creasing the number of subjects brought down the centimeter error significantly.
2. Developing an accurate deep learning model that uses calibration. The effects
of calibration were not explored in this thesis since the scope was to create a
calibration-free eye tracker, but calibration could be used in the future to create
a much more accurate eye tracker.
3. Developing a deep learning model that localizes on the eyes more accurately.
As the analysis from the visualizations in Chapter 5 showed, there seems to be
room for better localization for the eyes. Finding ways to better localize on the
eyes could help the model find minute details in the eyes better, and therefore
lead to a better centimeter error.
43
6.2 Conclusion
This thesis discussed the development of a software-only eye tracker, with an error of
1.66 centimeters that represented a 11% improvement over the baseline error of 1.86
centimeters. The first part of this thesis discussed the novel implementation of the
original iTracker model on the iPhone GPU using DeepBeliefSDK, while the second
part of the thesis discussed the range of incremental improvements made to decrease
the centimeter error. These improvements included ones on the data side (cropping
on the fly, implementing rectangular crops in Caffe, larger image resolution) and also
ones on the model side (using AlexNet, then ResNet10 and batch normalization, and
then an iterative learning model). The final model was an iterative model inspired by
an ensemble learning approach that only needed a full frame image to make a final
prediction, in contrast to the original iTracker model that needed heavy preprocessing
and four different inputs (cropped face, cropped left eye, cropped right eye, and
facegrid). In conclusion, this thesis developed an accurate, calibration-free, software-
only eye tracker that could eventually be used for a variety of applications, such as
free assistive technology that could one day help people with paralysis control their
phones with just their eyes.
44
Bibliography
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-
ing for image recognition. In arXiv preprint arXiv:1512.03385, 2015.
[2] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In Journal of Machine
Learning Research (JMLR), 2015.
[3] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,
Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional
architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[4] Kyle Krafka*, Aditya Khosla*, Petr Kellnhofer, Harini Kannan, Suchendra
Bhandarkar, Wojciech Matusik, and Antonio Torralba. Eye tracking for ev-
eryone. In Computer Vision and Pattern Recognition (CVPR), 2016.
[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in Neural Information
Processing Systems (NIPS), 2012.
[6] Yann LeCun, Leon Bottoui, Genevieve B. Orr, and Klaus-Robert Miuller. Effi-
cient backprop. In Neural Networks: Tricks of the trade, 1998.
[7] Adria Recasens*, Aditya Khosla*, Carl Vondrick, and Antonio Torralba. Where
are they looking? In Advances in Neural Information Processing Systems
(NIPS), 2015.
[8] Amir Rosenfeld and Shimon Ullman. Visual concept recognition and localization
via iterative introspection. arXiv preprint arXiv:1603.04186v2, 2016.
[9] Simon Wiesler and Hermann Ney. A convergence analysis of log-linear training.
In Advances in Neural Information Processing Systems (NIPS), 2011.
[10] B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba. Learning Deep
Features for Discriminative Localization. CVPR, 2016.
45