0% found this document useful (0 votes)
26 views13 pages

Wang 2018

Uploaded by

emanuel.21070031
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views13 pages

Wang 2018

Uploaded by

emanuel.21070031
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

International Journal of Human–Computer Interaction

ISSN: 1044-7318 (Print) 1532-7590 (Online) Journal homepage: http://www.tandfonline.com/loi/hihc20

Exploring Relationships Between Eye Tracking and


Traditional Usability Testing Data

Jiahui Wang, Pavlo Antonenko, Mehmet Celepkolu, Yerika Jimenez, Ethan


Fieldman & Ashley Fieldman

To cite this article: Jiahui Wang, Pavlo Antonenko, Mehmet Celepkolu, Yerika Jimenez,
Ethan Fieldman & Ashley Fieldman (2018): Exploring Relationships Between Eye Tracking and
Traditional Usability Testing Data, International Journal of Human–Computer Interaction, DOI:
10.1080/10447318.2018.1464776

To link to this article: https://doi.org/10.1080/10447318.2018.1464776

Published online: 30 Apr 2018.

Submit your article to this journal

Article views: 31

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=hihc20
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION
https://doi.org/10.1080/10447318.2018.1464776

Exploring Relationships Between Eye Tracking and Traditional Usability Testing Data
a
Jiahui Wang , Pavlo Antonenkoa, Mehmet Celepkolub, Yerika Jimenezb, Ethan Fieldmanc, and Ashley Fieldmanc
a
School of Teaching and Learning, College of Education, University of Florida, Gainesville, FL, 32611, USA; bDepartment of Computer & Information
Science & Engineering, College of Engineering, University of Florida, Gainesville, FL , 32611, USA; cStudy Edge Corporation, Gainesville, FL, 32603,
USA

ABSTRACT
This study explored the relationships between eye tracking and traditional usability testing data in the
context of analyzing the usability of Algebra Nation™, an online system for learning mathematics used
by hundreds of thousands of students. Thirty-five undergraduate students (20 females) completed seven
usability tasks in the Algebra Nation™ online learning environment. The participants were asked to log
in, select an instructor for the instructional video, post a question on the collaborative wall, search for an
explanation of a mathematics concept on the wall, find information relating to Karma Points (an
incentive for engagement and learning), and watch two instructional videos of varied content difficulty.
Participants’ eye movements (fixations and saccades) were simultaneously recorded by an eye tracker.
Usability testing software was used to capture all participants’ interactions with the system, task
completion time, and task difficulty ratings. Upon finishing the usability tasks, participants completed
the System Usability Scale. Important relationships were identified between the eye movement metrics
and traditional usability testing metrics such as task difficulty rating and completion time. Eye tracking
data were investigated quantitatively using aggregated fixation maps, and qualitative examination was
performed on video replay of participants’ fixation behavior. Augmenting the traditional usability testing
methods, eye movement analysis provided additional insights regarding revisions to the interface
elements associated with these usability tasks.

1. Introduction system offers instructional materials and support for students


in areas including pre-Algebra, Algebra, and Geometry. For
With the exponential growth of information and communi-
example, within the Algebra domain, Algebra Nation™ has a
cation technologies and their widespread application to pro-
content review session where Algebra lessons are divided into
mote formal and informal learning, online learning
11 sections, and each of the sections contains 8–12 videos.
platforms have gained wide acceptance among teachers and
The videos are designed as pencasts, where instructors write
students. Thus, understanding the user experience in these
out the solution to a problem while explaining each step
massive online learning systems is becoming increasingly
(Herold, Stahovich, Lin, & Calfee, 2011). Additionally, each
more important. Comprehensive usability studies are essen-
video offers a picture-in-picture view of the instructor selected
tial in informing design and refinement of online learning
by the student from four different study experts that represent
systems and interfaces to improve the user experience. So far,
different races and genders. To reinforce the skills discussed
a limited number of usability studies have been carried out to
in the video tutorials and provide a platform for social learn-
examine these massive online learning systems (Hasan, 2014;
ing and peer support, Algebra Nation™ also provides an inter-
Ssemugabi & De Villiers, 2007).
active collaborative wall where students can post questions
The current study adopted a number of measures to com-
about the material and get answers from their peers and study
prehensively evaluate the usability of the Algebra Nation™, a
experts, and search for an answer in the existing threads. To
massive online community for learning mathematics that is
encourage students to contribute to the interactive collabora-
used by over 250,000 middle and high school students in the
tive wall, Algebra Nation™ employs a rewards system called
United States. From the methodological perspective, we
“Karma Points.”
focused on exploring the relationships between data generated
Usability studies of online learning technologies are still
using traditional usability testing techniques such as task
not common and little is understood about the usability
difficulty ratings and eye-movement analysis data. Algebra
aspects of massive online learning systems like Algebra
Nation™ was designed to help students advance Algebra
Nation™. From a methodological standpoint, it is not always
knowledge and skills and improve performance on the final
clear to usability researchers and practitioners when and why
Algebra exam to get high school graduation credit. The
traditional usability testing methods like time on task and task

CONTACT Jiahui Wang [email protected] University of Florida, Norman Hall, 1221 SW 5th Avenue, Gainesville, FL 32611, USA
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/hihc.
© 2018 Taylor & Francis Group, LLC
2 J. WANG ET AL.

difficulty rating should be augmented with psychophysiologi- measure participants’ satisfaction such as the Computer
cal measures like eye tracking. This article addresses both System Usability Scale (CSUQ; Lewis, 1995), System
issues by exploring the usability of Algebra Nation™ with a Usability Scale (SUS; Brooke, 1996), and Questionnaire for
sample of its target users and by converging rigorous eye- User Interface Satisfaction (QUIS; Chin, Diehl, & Norman,
movement analysis techniques and traditional usability testing 1988). In a systematic comparison of CSUQ, SUS, QUIS
methods to understand the user experience within a large measures, SUS provided the most reliable results at sample
online learning system. Several usability evaluation methods sizes ranging from 6 to 14 (Tullis & Stetson, 2004). SUS, a
were adopted, including task completion time, task difficulty highly robust measurement tool for usability researchers con-
rating, System Usability Scale, and eye movement analysis sistently producing reliable results (Bangor, Kortum, & Miller,
including both gaze fixations and saccades. This study exam- 2008), has been adopted in usability testing of various pro-
ined the relationships between these metrics relative to under- ducts, ranging from everyday products such as microwave
standing the quality of user experience in an online learning ovens (Kortum & Bangor, 2013) to mobile apps such as
system, which contributes to our understanding of these mea- Gmail™ (Kortum & Sorber, 2015).
sures and their applicability in various contexts. The traditional SEE usability metrics have been used in
many usability studies (e.g., Rashid, Soo, Sivaji, Naeni, &
Bahri, 2013). The three core aspects of usability—effective-
2. Literature review
ness, efficiency, and satisfaction—are equally important in
Multiple usability testing methods are employed by usability usability testing, and they have been found to be highly
evaluators and researchers to gather information on the qual- dependent. For example, Kortum and Peres (2014) identified
ity of user experience. Widely used usability testing methods a strong positive correlation between task success rates (i.e., a
include, for example, measures of effectiveness (e.g., task measure of effectiveness) and SUS score (i.e., a measure of
success), efficiency (e.g., time on task), and satisfaction satisfaction), for both the laboratory and field studies at indi-
(International Organization for Standardization, 1998). With vidual and system levels.
the advancement of sensing technologies and their availability
to researchers and practitioners, eye tracking has gained
2.2. Eye tracking
popularity among usability scholars and professionals
(Nielsen & Pernice, 2010). However, each usability testing Besides the traditional usability testing methods that focus on
method has its advantages and disadvantages, and to test a satisfaction, effectiveness, and efficiency, usability evaluators
system and identify its usability problems, it is critical to select have started to adopt psychophysiological techniques to discover
the most appropriate usability evaluation methods by consid- more insights about the user’s attentional and cognitive pro-
ering the nature of human-system interactions being exam- cesses during usability testing. Eye tracking in particular is a
ined, the complexity of the system and interface, the time and psychophysiological method that has recently gained much
cost involved in the usability testing, as well as the expertise of popularity among usability professionals. The main assumption
the usability evaluators. behind the use of eye tracking in human factors and usability
research is the eye-mind hypothesis (Just & Carpenter, 1980),
which suggests that visual attention is the proxy for mental
2.1. Traditional usability testing methods
attention and so visual attention patterns reflect cognitive stra-
Usability is defined as the degree to which a product can be tegies used by individuals. Eye tracking has been employed to
used by intended users to achieve specified goals with effec- study visual attention distribution in a wide variety of visual
tiveness, efficiency, and satisfaction (International Organization tasks, from visual search (Pomplun, Reingold, & Shen, 2001) to
for Standardization, 1998). Following this widely adopted defi- reading (Schneps et al., 2013), viewing advertisements
nition, usability performance concepts such as satisfaction, (Maughan, Gutnikov, & Stevens, 2007), to watching online
efficiency, and effectiveness (SEE) have been employed to video (Author, 2017). Eye tracking has also been applied in
design measures that assess whether and how a system is easy multiple usability studies to provide insights regarding the
to use. Effectiveness has been measured by task success (i.e., the design of websites, digital TV menus, and games (Cowen, Ball,
user’s ability to complete the usability task successfully). & Delin, 2002; Ehmke & Wilson, 2007; Russell, 2005; Wulff,
Efficiency is typically assessed by how much time it takes the 2007).
user to complete a usability task or the number of errors the Eye tracking has been a useful technique in user
user makes while completing the task. As task completion time research, particularly in situations that require evaluation
and success do not necessarily capture all the elements asso- of the user’s attention distribution relative to various
ciated with effectiveness and efficiency, studies have also eli- (often competing) interface elements. With the recent
cited task difficulty ratings from users to measure effectiveness advancements of sensor technology, eye tracking has also
and efficiency (Tullis & Albert, 2013). become more affordable and less intrusive to use (Pernice
Satisfaction reflects the user’s attitude or perceptions about & Nielsen, 2009). However, to benefit from the informa-
system functionality and aesthetics and it has been measured tion provided by eye tracking, one must understand spe-
using self-reports. Usability evaluators have employed stan- cific eye movement metrics and what they represent.
dardized surveys to examine users’ satisfaction with an inter- Most modern eye trackers can accurately record two
face (e.g., Everett, Byrne, & Greene, 2006). A number of types of eye movements: gaze fixations and saccades
standardized and validated usability surveys are available to (Rayner, 1998). A gaze fixation occurs when the eye focuses
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION 3

on a visual target for a short period of time (i.e., around Table 1. Participant demographics.
300 ms). A saccade is a rapid eye movement between two Variables Statistics
fixations and saccades range in amplitude from small Gender 20 female, 15 male
Age M = 19.60 (SD = 0.85)
movements to large ones. Usability evaluators have exam- Ethnicity 26 White, 7 Hispanic, 2 Asian-Pacific Islander
ined these types of eye movement phenomena as quantified Undergraduate 4 freshmen, 20 sophomores, 10 juniors, 1 senior
indices, such as the duration of each fixation, number of classification
Wear glasses 12 Yes, 23 No
fixations, and saccade amplitude. For example, Wu and
colleagues (2016) found that eye movement data such as
fixation duration combined with fixation point number this study. Approximately 74% of the participants identified
were useful in revealing how users search for target infor- themselves as White, and 20% were Hispanic. Participants
mation on a smart watch interface, thus providing impor- represented multiple majors including finance, health
tant information about interface information structure and science, international studies, psychology, and others.
interface element representation meaning. Also, Çöltekin Twelve participants wore glasses or contact lenses
and colleagues (2009) used eye tracking in a usability (Table 1). None of the participants were color-blind.
study of two online map websites, and the eye movement
data including fixation durations and number of fixations
revealed usability issues of specific features in two differ- 3.2. Usability tasks
ently designed online map interfaces.
Eye tracking data can be examined not only quantita- Seven tasks that users typically complete within Algebra
tively, but also qualitatively. Video replay of visual scan Nation™ were selected (see Table 2). Each task was designed
paths and eye movements can provide important insights to utilize one main feature of the Algebra Nation™ learning
regarding the patterns of attending to the various features environment. Participants were instructed to complete these
of the interface, thus allowing usability researchers to iden- tasks to help researchers identify potential usability issues of
tify where people focus their attention and for how long the system.
(Bojko, 2006; Goldberg & Kotval, 1999). For example, Wu Figures 1 through 4 show the Algebra Nation™ interface
and colleagues (2016) examined eye movement videos to features participants used to complete the tasks. The final task
describe the sequence of visual attention targets and iden- focused on watching an instructional video either on Similar
tified potential usability issues within a smart watch inter- Triangles (easy topic) or Trigonometry (difficult topic). The
face. However, despite the potential to reveal important video included a picture-in-picture effect of an instructor in
usability issues about an interface, eye tracking requires the bottom right corner. The instructor explained the learning
significant expertise and can be time consuming and labor content using the typical non-verbal communication cues
intensive compared to traditional usability testing methods such as eye contact, facial expressions, and gestures. Figure 5
such as SEE measures (Pernice & Nielsen, 2009). shows the screenshots of the two instructional videos on
As each usability testing method has its advantages and dis- Similar Triangles (easy topic) and Trigonometry (difficult
advantages, usability researchers often adopt a combination of topic).
different usability testing techniques that complement one
another (Jaspers, 2009; Tullis & Albert, 2013). An important
gap in knowledge, however, is that the relationships between 3.3. Apparatus
these usability methods for specific purposes and within specific The Algebra Nation™ website was displayed on an external 20-
contexts are not well understood. For example, it is not clear to inch flat screen monitor viewed at a 55-cm distance, at a 1600
usability researchers if and what specific eye tracking methods by 1200 screen resolution and a 60 Hz refresh rate.
can provide added value to traditional measures of usability Participants sat in a chair, and their head was stabilized
(Pernice & Nielsen, 2009), given that this method requires sig- using a chinrest built into the desk mount (SR Research,
nificant time, effort, specialized equipment, and expertise. Ontario, Canada). An Eyelink 1000 Plus system and its
The current study adopted a number of usability methods Screen Recorder software were used to simultaneously capture
to comprehensively evaluate the usability of the Algebra locus of participants’ eye movements and all screen activities,
Nation™ massive online learning environment. The purpose
of this study was to contribute to usability professionals’
understanding of the relationships between eye movement Table 2. Tasks used in this study.
metrics and the more traditional and easy-to-administer Task Description
usability testing methods such as task completion time, task 1 Find a way to log in to Algebra Nation™ using the following credentials.
School: XXXXXX
difficulty rating, and System Usability Scale. Username: xxxxxx
Password: xxxxxx
2 Open Algebra I course, find the section on Quadratics—Part 1, and
select the instructor you want to work with.
3. Method 3 You are trying to solve the following equation. Seek help from the
Algebra Nation™ community.x2 þ 4x ¼ 17
3.1. Participants 4 Find a post explaining parallel lines.
5 Locate information about what Karma Points is.
Thirty-five undergraduate students (ages 18–21; 20 females) 6 Watch a video on Similar Triangles (easy topic).
who had never used Algebra Nation™ were recruited for 7 Watch a video on Trigonometry (difficult topic).
4 J. WANG ET AL.

Figure 1. Task 1: Log in to the system.

Figure 2. Task 2: Select an instructor.

as the participants performed the usability tasks (see recorded at a sampling rate of 1000 Hz. Participants used a
Figure 6). Eye movements (i.e., fixations and saccades) were keyboard and a Bluetooth mouse as input devices.
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION 5

Figure 3. Task 3: Post an equation to the wall and task 4: find a post explaining parallel lines.

3.4. Procedure difficulty rating upon the completion of each task.


Morae’s™ task difficulty rating instrument is a five-point
After signing informed consent, participants completed a
scale that ranges from very difficult (1) to very easy (5).
brief demographics survey. At the beginning of the experi-
While participants worked on each usability task and
ment, the gaze of each participant was calibrated and
watched the instructional video, their eye movements
validated with a 13-point calibration algorithm. Then,
(i.e., fixations and saccades) and on-screen activities were
they were instructed to perform the usability tasks in
simultaneously recorded using Eyelink Screen Recorder.
Algebra Nation™. The instructions for each usability task
After completing the usability tasks, participants com-
were displayed in the middle top of the screen using
pleted System Usability Scale (Brooke, 1996; see Table 3).
TechSmith Morae™. Morae™ has been widely used for
Participants also reported their level of satisfaction with
usability testing in various contexts (e.g., Çöltekin et al.,
the video they watched on a nine-point scale that ranged
2009; Fagan, Mandernach, Nelson, Paulo, & Saunders,
from extremely dissatisfied (1) to extremely satisfied (9)
2012). Morae™ recorded the completion time for each
(Author, 2017). The entire session lasted about 30 minutes
task and elicited post-task responses concerning task
for each participant.
6 J. WANG ET AL.

Figure 4. Task 5: Locate information about Karma Points.

Figure 5. Tasks 6 and 7: Watch two instructional videos of varied content difficulty: (a) Similar triangles; (b) Trigonometry.

Table 3. System usability scale means for each item.


Statements Mean
1: Strongly disagree to 5: Strongly agree
Q1. I think that I would like to use this system frequently. 4.00
Q2. I found the system unnecessarily complex. 1.94
Q3. I thought the system was easy to use. 4.29
Q4. I think that I would need the support of a technical person to be 1.34
able to use this system.
Q5. I found the various functions in this system were well integrated. 4.00
Q6. I thought there was too much inconsistency in this system. 1.37
Q7. I would imagine that most people would learn to use this system 4.49
very quickly.
Q8. I found the system very cumbersome to use. 1.74
Q9. I felt very confident using the system. 4.11
Q10. I needed to learn a lot of things before I could get going with 1.63
this system.

4. Results
The dependent variables measured by the traditional usabil-
ity testing methods included task completion time, self-
reported task difficulty rating, and System Usability Scale
ratings. These data were complemented by eye movement
Figure 6. Experimental setup. data including the number of fixations, average fixation
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION 7

Figure 7. Average time (a) and task difficulty rating (b) for usability tasks. Rating: 1: very difficult, 5: very easy.
*means significant difference between two tasks. Error bars represent ± 1 SEM.

duration, and saccade amplitude for each usability task.  = 58.2s), task 2 (X
93s), as compared to task 1 (X  = 38.4s), task 4
Eyelink Data Viewer software (SR Research, Ontario,  
(X = 64.8s), and task 5 (X = 34.2s). Task 3 focused on seeking
Canada) was used to divide the eye movement data into help from the Algebra Nation™ community to help solve an
segments (one for each of the usability tasks) and extract equation. A significant difference was also found between the
the number of fixations, fixation duration, and saccade completion times for task 4 (X = 64.8s) and task 5 (X  = 34.2s).
amplitude data for each task.

4.1. Task difficulty ratings and completion times 4.2. System usability scale (SUS) and satisfaction with
the videos
The average task difficulty rating for each task and average time
spent on task are provided in Figure 7. ANOVA results indi- Average overall System Usability Scale (SUS) score for all parti-
cated a significant difference in the task difficulty rating for the cipants was 82, calculated following the method described in
tasks (F (4, 168) = 5.17, p < .05, η2 = .11). Bonferroni post-hoc Brooke (1996), where the minimum is 0 and the maximum is
analyses indicated participants rated task 4 (X  = 3.57) as sig- 100. A higher score indicates a higher usability rating. A score of

nificantly more difficult compared to task 1 (X = 4.60), task 2 (X  82 represents “acceptable” usability (Bangor, Kortum, & Miller,

= 4.44), and task 5 (X = 4.31). Task 4 focused on finding a post 2009) and it is designated an A according to the Sauro-Lewis
explaining parallel lines. There was also a significant difference curved grading scale (CGS) for the SUS (Lewis & Sauro, 2017;
in task completion time (F (4, 168) = 11.951, p < .05, η2 = .222). Sauro & Lewis, 2016, p. 204). The mean score for each statement
Participants took significantly longer to complete task 3 (X  = of SUS is reported in Table 3.

Figure 8. Number of fixations for each task (a), average fixation duration for each task (b), and average saccade amplitude for each task (c).
*means significant difference between two tasks. Error bars represent +/− 1 SEM.
8 J. WANG ET AL.

Figure 9. Aggregated heat maps of fixations on the two instructional videos of varied content difficulty: Similar triangles (a); Trigonometry (b).

For the videos participants watched, they reported a high Table 4. Usability problems identified.
level of satisfaction on a nine-point scale, 8.11 (SD = 0.68) for Usability Problem Description
the easy topic on Similar Triangles and 7.61 (SD = 1.58) for Task 1: Log into the system
the difficult video on Trigonometry. The “Enter” button on the top right corner of the main page was not
immediately attended to by 6 participants.
Task 2: Select an instructor for the video
The question mark representing “about instructor” feature was not attended
4.3. Quantitative eye tracking data to or used by 34 participants.
Task 3: Post an equation to the wall
We examined participants’ number of fixations, average fixa-
The equation editor signf ðxÞ was not intuitive to the participant. Three
tion duration, and average saccade amplitude for each usabil-
participants confused it with the special character signx 2 .
ity task (see Figure 8). Saccade amplitude refers to the average Task 4: Find a post explaining parallel line
size of saccades in degrees of visual angle. ANOVA results The “refresh” button was very close to the “search” button on the search bar,
indicated a significant difference in the number of fixations and five participants clicked on the “refresh” button when they would like
(F (4, 168) = 4.690, p < .05, η2 = .100), average fixation to search.
Two participants refreshed the page using the browser’s refreshing
duration (F (4, 168) = 3.204, p < .05, η2 = .071), and average functionality instead of using the “refresh” button on the search bar.
saccade amplitude (F (4, 168) = 2.971, p < .05, η2 = .066). Search bar was not attended to by two participants, and they instead use
control + F to search.
Bonferroni post-hoc analyses indicated participants per-
Task 5: Locate information about Karma Points
formed a significantly higher number of fixations during
task 3 (X= 249.23) compared to task 2 (X  = 155.83) and The question mark that led to information about Karma Points was not
attended to by eight participants.

task 5 (X = 116.71). Participants’ average fixation duration
was also longer when working on task 4 (X  = 270.53 ms)
compared to task 5 (X  = 245.10 ms). Participants also pro- where participants focused their attention and for how long,
duced significantly larger saccade amplitudes during Task 1 thus helping to identify potential usability problems within
(X = 3.63°) compared to task 5 (X  = 3.18°). the system. Table 4 summarizes the usability problems asso-
In addition to the five usability tasks, participants watched ciated with each of the usability tasks based on qualitative eye
two instructional videos of varied content difficulty (i.e., tracking data.
Similar Triangles and Trigonometry). Figure 9 represents the
aggregated fixation maps across the participants while they
4.5. Relationships between usability metrics
attended to the two videos. For the easy topic of similar
triangles, the instructor attracted 26% of the total fixation Relationships between the usability metrics were examined
time; whereas for the difficult topic of Trigonometry, the using the Pearson product-moment correlation coefficient.
instructor attracted 22% of the total fixation time. We found a significant positive correlation between average
Considering the instructor frame only constitutes 7% of the task difficulty rating and the overall score for the System
screen size, participants allocated a significant amount of Usability Scale, r(33) = .768, p < .001. Higher SUS scores
visual attention to the instructor, especially the instructor’s were associated with rating the tasks as easier to accomplish.
face. Participants generally expressed satisfaction with seeing Also, several significant positive and negative correlations
the instructor on the screen and believed the instructor as were identified between task difficulty rating, task completion
helpful and engaging. These findings speak in favor of includ- time, and eye movement metrics (i.e., number of fixations,
ing the instructor in the Algebra Nation™ videos. average fixation duration, and average saccade amplitude). A
summary of significant correlations is provided in Table 5.
We found that there was a moderate to strong negative
4.4. Qualitative eye tracking data
correlation between the number of fixations and task diffi-
Besides analyzing the eye tracking data quantitatively, quali- culty rating for each task. Participants who exhibited more
tative analysis was conducted on each participant’s fixation fixations during the task tended to rate the task as more
behavior. The qualitative analysis provided information about difficult. We also identified a strong, positive correlation
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION 9

Table 5. Significant correlations among usability metrics.


Task 1 Task 2 Task 3 Task 4 Task 5
# of fixations and task difficulty rating r = −.334* r = −.484** r = −.432*** r = −.622*** r = −.650***
# of fixations and time on task r = .934*** r = .646*** r = .894*** r = .972*** r = .978***
Average saccade amplitude and time on task / / r = −.546*** / r = .415*
Average saccade amplitude and task difficulty rating / / / / r = −.424*
Time spent on task and task difficulty rating / r = −.371* r = −.451** r = −.637*** r = −.675***
Note: For task difficulty rating, 1: very difficult, 5: very easy.
df = 33.
*p < .05.
**p < .01.
***p < .001.

between the number of fixations and time on task for each technique, suggested that Algebra Nation™ is user-friendly
task. This means that participants who spent more time on a and easy to use. On average, the overall System Usability
task also demonstrated more gaze fixations during that task. Scale score was 82, which is an acceptable SUS score for a
This finding was true for all tasks, although the strength of the system/interface evaluation. The levels of agreement with
relationship varied across the tasks. the SUS statements also corroborated this finding.
Time on task was found to be negatively associated with task Specifically, participants generally believed the system was
difficulty rating. Spending more time working on a task resulted easy to use and they were confident in using it.
in rating of the task as more difficult. This finding applied to all In the current study, in addition to the traditional usability
tasks except task 1 (log into the system) and for the other four testing methods, eye movement metrics such as number of
tasks, the correlations ranged from strong (e.g., r(33) = −.675 for fixations, average fixation duration, and average saccade
task 5, locate information on Karma Points) to moderate (e.g., r amplitude were examined. Eye tracking data results indicated
(33) = −.371 for task 2, select an instructor for the video). We that participants performed a significantly higher number of
have also identified a strong, negative correlation between aver- fixations during task 3 (seek help to solve an equation) com-
age saccade amplitude and time on task for task 3 whereas a pared to tasks 2 (select an instructor for the video) and 5
moderate, positive correlation between these two metrics for task (locate information on Karma Points). Participants’ average
5. Moreover, average saccade amplitude and task difficulty rating fixation duration was also longer when working on task 4
were found to be negatively correlated for task 5. Task 5 required (find a post explaining parallel lines) compared to task 5
participants to locate information related to Karma Points and (locate information on Karma Points). Participants also pro-
for this task, participants who demonstrated larger saccade duced significantly larger saccade amplitudes during task 1
amplitudes found the task more difficult. compared to task 5. These results provide useful information
Using average task difficulty rating as a predict variable, about how different tasks induced different levels of visual
59% of the variance in the SUS score is explained (R2 = .59, F attention from the participants and inform the aspects of the
(1, 33) = 47.398, p < .001). Adding average fixation duration interface that can be improved.
as a predictor variable in the model is associated with a Our study also examined the relationships between user’s
statistically significant increase in R2 (ΔR2 = .048, F(1, visual dynamics patterns collected using an eye tracker and
32) = 4.194, p < .05). By using average fixation duration as a these standard usability methods. These eye movement
predictor, we can now predict 4.8% more variance in the SUS metrics reflected strong positive and negative correlations
score than we could with a model that only contained average with task performance variables such as task completion
task difficulty rating. time and task difficulty rating. First, in this study, negative
correlations were identified between the number of fixations
and self-reported task difficulty rating for each of the usability
5. Discussion
tasks. Specifically, higher number of fixations coincided with
This study evaluated the usability of Algebra Nation™, a ratings representing higher levels of difficulty for each task.
massive online learning environment that is used by hun- Importantly, this finding applied to all five usability tasks used
dreds of thousands of students, and investigated relation- in this study, ranging from a moderate correlation, r
ships between data collected using several usability (33) = −.334 for task 1, log into the system) to a strong
evaluation methods. Traditional usability testing methods correlation, r(33) = −.650 for task 5, locate information on
(i.e., standard metrics to gauge effectiveness and efficiency) Karma Points). This finding confirms results of a study that
revealed that the usability tasks resulted in variable task used a visual search task in an experimental neuro-cognitive
completion times and task difficulty ratings, which helped paradigm (Goldberg & Kotval, 1999). In that study, research-
in identifying the aspects of the interface that need ers evaluated several eye tracking measures that are relevant to
improvement. For example, participants rated task 4 (find the visual search task and suggested that when searching for a
a post explaining parallel line) as significantly more difficult single target in a user interface, a larger number of fixations
compared to task 1 (log into the system), task 2 (select an indicated that the user sampled many other objects prior to
instructor for the video) and task 5 (locate information on selecting the target. In other words, a larger number of fixa-
Karma Points). Results of the System Usability Scale, tions was associated with a less efficient visual search strategy
another traditional and widely used usability testing due to less optimal interface layout. Based on these findings, it
10 J. WANG ET AL.

is reasonable to conclude that in the current study, more Table 6. Solutions to usability problems.
fixations, possibly due to a suboptimal page layout, also Usability Problem Description Solutions
resulted in the participants’ rating those tasks as more diffi- Task 1: Log into the system
cult. Second, a series of positive correlations were identified The “Enter” button on the top right Use “Log in” instead of “Enter” on the
corner of the main page was not main page.
between task difficulty rating and completion time. Rating a immediately attended to by 6 Use a contrasting color of blue for the
task as more difficult was positively associated with more time participants. “log in” button.
spent on the task. This association has been identified for all Eliminate the “Enter” button in the
center of the main page.
tasks except task 1 (log into the system), ranging from mod- Task 2: Select an instructor for the Make the question mark for “about
erate correlation, r(33) = .371) to strong correlation, r video instructor” stand out more by using a
(33) = .675). This finding is reasonable as users tend to different color and a more intuitive
icon.
spend more time figuring out how to complete a usability The question mark representing Create a hover over feature with a
task when they perceive the task to be more difficult. Thus, it “about instructor” feature was not short description about the
can be concluded that time on task can be used as a proxy for attended to or used by 34 instructors or create a separate page
participants. for instructors’ information in the
the difficulty in cognitive processing. This conclusion is con- system.
sistent with the results reported in Cooke (2006), who found Task 3: Post an equation to the
wall
the easiest page resulted in the shortest task completion time. The equation editor signf ðxÞ was not Change the looks of special characters
In the current study, eye tracking metrics such as number intuitive to the participant. It and equation editor signs and make
of fixations and saccade amplitude provide convergent valid- confused three participants with them look more self-explanatory.
the special characters signx2 .
ity for the standard usability evaluation measures. Where eye Task 4: Find a post explaining
tracking data provide a lot of added value is in discovering parallel line
usability with individual interface elements that users attend The “refresh” button was very close to Eliminate the “refresh” button.
the “search” button on the search
to on the screen. Qualitative examination of eye tracking data bar, and five participants clicked on
can provide relatively unobtrusive measures of visual behavior the “refresh” button when they
that offer information about participants’ attention and cog- would like to search.
Two participants refreshed the page Make the search stand out more by
nition, thus complementing the traditional usability evalua- using the browser’s refreshing using a different color or assigning a
tion methods in identifying usability issues at a deeper level. functionality instead of using the bigger space.
For example, by tracking eye movements, researchers were “refresh” button on the search bar.
Search bar was not attended to by
able to discover how long and how often a user looks at a two participants, and they instead
certain area of interest in the interface and how frequently use control + F to search.
users switch from one visual component of the interface to Task 5: Locate information about
Karma Points
others (Duchowski, 2007). In this study, the qualitative exam- The question mark that led to Make the question mark stand out
ination of eye tracking data provided us with detailed infor- information about Karma Points more by using a different color (e.g.,
was not attended to by eight blue).
mation regarding which features were used a lot or very little, participants.
thus leading to important insights on the solutions to the Create a hover over caption saying,
usability issues of the system (Table 6). The analysis especially “what is Karma points?” over the
question mark.
suggested improving the design of the search bar in the
Algebra Nation™ collaborative wall.
Before adopting eye tracking methods, usability researchers
should consider the characteristics of the interface to be evaluated.
Specifically, eye tracking could be useful in providing additional the standard usability evaluation measures. More importantly,
information about how users perceive different designs of an the qualitative examination of eye movement data revealed
interface by examining the visual attention distribution over sev- several design flaws of the system and provided important
eral areas of interest (AOIs). For example, in the current study, the suggestions on how to improve the interface design, which is
aggregated fixation map is helpful in examining the users’ visual otherwise impossible to acquire from using traditional usabil-
attention distribution while they watched the two instructional ity testing methods.
videos which included the instructor on the screen. On the other
hand, qualitative analysis of eye tracking data can provide valuable
6. Conclusion
information on usability issues where users interact with interface
that involves dynamically changing screens, for example, when the his study explored the relationships between eye tracking
user is scrolling up and down a page to locate a piece of data and standard usability testing data that focus on the
information. effectiveness and efficiency of completing usability tasks.
Unlike most other research on massive online learning The context of the study was evaluating the usability and
systems (Guo, 2013; Kiger, Herro, & Prunty, 2012), the cur- cognitive task requirements of Algebra Nation™, a massive
rent study focused on exploring the usability of the system, online learning environment used by hundreds of thou-
instead of simply examining the learning outcomes from the sands of students in the USA. The usability tasks resulted
systems. Our study is one of the first few studies that used in variable levels of self-reported task difficulty rating and
multiple evaluation methods to examine the usability of a completion time, which helped identify the aspects of the
massive online learning system. The eye tracking metrics interface that need improvement. Compared to traditional
such as number of fixations provide convergent validity for usability metrics that gather data based on participants’
INTERNATIONAL JOURNAL OF HUMAN–COMPUTER INTERACTION 11

overt behavior (e.g., time on task), qualitative eye move- Guo, P. J. (2013). Online python tutor: Embeddable web-based program visua-
ment analysis provided additional insights into how parti- lization for cs education. SIGCSE 2013 - Proceedings of the 44th ACM
Technical Symposium on Computer Science Education. 579–584.
cipants respond to visual elements while interacting with
doi:10.1145/2445196.2445368
the interface. Eye movement analysis augmented the tradi- Hasan, L. (2014). Evaluating the usability of educational websites based
tional usability testing methods, and provided important on students’ preferences of design characteristics. International Arab
implications regarding revisions to the interface elements Journal of E-Technology, 3(3), 179–193.
associated with those usability tasks. Herold, J., Stahovich, T., Lin, H. L., & Calfee, R. C. (2011).The effec-
tiveness of “pencasts” as an instructional medium. Proceedings of the
American Society for Engineering Education 118th Annual
Acknowledgment Conference and Exposition.
Jaspers, M. W. M. (2009). A comparison of usability methods for testing
We also would like to thank the Study Edge Corporation for their help interactive health technologies: Methodological aspects and empirical
with participant recruitment. evidence. International Journal of Medical Informatics, 78(5), 340–353.
doi:10.1016/j.ijmedinf.2008.10.002
Just, M. A., & Carpenter, P. A. (1980). A theory of reading: From eye
Funding fixations to comprehension. Psychological Review, 87(4), 329–354.
doi:10.1037/0033-295X.87.4.329
This article is based on work supported by the National Science Kiger, D., Herro, D., & Prunty, D. (2012). Examining the influence of a
Foundation under Grant No. 1540888. mobile learning intervention on third grade math achievement.
Journal of Research on Technology in Education, 45(1), 61–82.
doi:10.1080/15391523.2012.10782597
ORCID Kortum, P., & Peres, S. C. (2014). The relationship between system
effectiveness and subjective usability scores using the system usability
Jiahui Wang http://orcid.org/0000-0001-5681-5055 scale. International Journal of Human–Computer Interaction, 30(7),
575–584. doi:10.1080/10447318.2014.904177
Kortum, P., & Sorber, M. (2015). Measuring the usability of mobile
References applications for phones and tablets. International Journal of Human-
Computer Interaction, 31(8), 518–529. doi:10.1080/
Bangor, A., Kortum, P., & Miller, J. (2009). Determining what individual SUS 10447318.2015.1064658
scores mean: Adding an adjective rating scale. Journal of Usability Studies, 4 Kortum, P. T., & Bangor, A. (2013). Usability ratings for everyday
(3), 114–123. products measured with the system usability scale. International
Bangor, A., Kortum, P. T., & Miller, J. T. (2008). An empirical evaluation of Journal of Human-Computer Interaction, 29(2), 67–76. doi:10.1080/
the system usability scale. International Journal of Human-Computer 10447318.2012.681221
Interaction, 24(March), 574–594. doi:10.1080/10447310802205776 Lewis, J. R. (1995). IBM computer usability satisfaction questionnaires:
Bojko, A. (2006). Using eye tracking to compare web page designs: A Psychometric evaluation and instructions for use. International Journal
case study. Journal of Usability Studies, 1(3), 112–120. of Human-Computer Interaction, 7(1), 57–78. doi:10.1080/
Brooke, J. (1996). SUS-A quick and dirty usability scale. Usability 10447319509526110
Evaluation in Industry, 189(194), 4–7. Lewis, J. R., & Sauro, J. (2017). Can I leave this one out? The effect of
Chin, J. P., Diehl, V. A., & Norman, K. L. (1988). Development of an dropping an item from the SUS. Journal of Usability Studies, 13, 1.
instrument measuring user satisfaction of the human-computer inter- Maughan, L., Gutnikov, S., & Stevens, R. (2007). Like more, look more. Look
face. Proceedings of theSIGCHI conference on Human factors in com- more, like more: The evidence from eye-tracking. Journal of Brand
puting systems (pp. 213–218). ACM. Management, 14(4), 335–342. doi:10.1057/palgrave.bm.2550074
Çöltekin, A., Heil, B., Garlandini, S., & Fabrikant, S. I. (2009). Evaluating Nelsen, B. T. (1998). Ergonomic requirements for office work with visual
the effectiveness of interactive map interface designs: A case study display terminals part 11: Guidance on usability (ISO DIS 924–11).
integrating usability metrics with eye-movement analysis. Cartography London: ISO.
and Geographic Information Science, 36(1), 5–17. doi:10.1559/ Nielsen, J., & Pernice, K. (2010). Eyetracking web usability. Berkeley, CA:
152304009787340197 New Riders.
Cooke, L. (2006). Is eye tracking the next step in usability testing?. Pernice, K., & Nielsen, J. (2009). How to conduct eyetracking studies.
International professional communication conference, 2006 IEEE (pp. Fremont, CA: Nielsen Norman Group.
236–242). doi:10.1109/IPCC.2006.320355 Pomplun, M., Reingold, E. M., & Shen, J. (2001). Investigating the visual
Cowen, L., Ball, L. J., & Delin, J. (2002). An eye movement analysis of span in comparative search: The effects of task difficulty and divided
web page usability. In People and computers XVI-memorable yet attention. Cognition, 81(2), 57–67. doi:10.1016/S0010-0277(01)
invisible (pp. 317–335). London: Springer. 00123-8
Duchowski, A. (2007). Eye tracking methodology: Theory and practice.
Rashid, S., Soo, S., Sivaji, A., Naeni, H. S., & Bahri, S. (2013). Preliminary
London: Springer-Verlag.
usability testing with eye tracking and FCAT analysis on occupational
Ehmke, C., & Wilson, S. (2007). Identifying web usability problems from
safety and health websites. Procedia - Social and Behavioral Sciences,
eye-tracking data. In 21st British CHI group annual conference on HCI
97, 737–744. doi:10.1016/j.sbspro.2013.10.295
2007: People and Computers XXI, 1, (p. 12). Swindon, UK: The British
Computer Society. doi:10.1145/1531294.1531311 Rayner, K. (1998). Eye movements in reading and information proces-
Everett, S. P., Byrne, M. D., & Greene, K. K. (2006). Measuring the sing: 20 years of research. Psychological Bulletin, 124(3), 372–422.
usability of paper ballots: Efficiency, effectiveness, and satisfaction. doi:10.1037/0033-2909.124.3.372
Proceedings of the Human Factors and Ergonomics Society Annual Russell, M. (2005). Using eye-tracking data to understand first impres-
Meeting, 50(24), 2547–2551. doi:10.1177/154193120605002407 sions of a website. Usability News, 7(1), 1–14.
Fagan, J. C., Mandernach, M. A., Nelson, C. S., Paulo, J. R., & Saunders, G. Sauro, J., & Lewis, J. R. (2016). Quantifying the user experience: Practical
(2012). Usability test results for a discovery tool in an academic library. statistics for user research (2nd ed.). Cambridge, MA: Morgan
Information Technology and Libraries, 31(1), 83 Kaufmann.
Goldberg, J. H., & Kotval, X. P. (1999). Computer interface evaluation Schneps, M. H., Thomson, J. M., Sonnert, G., Pomplun, M., Chen, C.,
using eye movements: Methods and constructs. International Journal & Heffner-Wong, A. (2013). Shorter lines facilitate reading in
of Industrial Ergonomics, 24(6), 631–645.doi:10.1016/S0169-8141(98) those who struggle. PLoS ONE, 8(8). doi:10.1371/journal.
00068-7 pone.0071161
12 J. WANG ET AL.

Ssemugabi, S., & De Villiers, R. (2007). A comparative study of two usability Pavlo “Pasha” Antonenko is an Associate Professor of Educational
evaluation methods using a web-based e-learning application. Proceedings Technology and Director of NeurAL Lab at the University of Florida.
of the 2007 Annual Research Conference of the South African Institute of His scholarship focuses on a) frameworks and technologies to encourage
Computer Scientists and Information Technologists on It Research in and scaffold learning and twenty-firstcentury skills and b) psychophy-
Developing Countries, (April 2016), 132–142. doi:10.1145/ siological assessment of cognition to optimize design of technology-
1292491.1292507 enhanced learning environments.
Tullis, T., & Albert, W. (2013). Measuring the user experience: Collecting,
analyzing, and presenting usability metrics (2nd ed). Amsterdam: Mehmet Celepkolu is currently a PhD student and in the Computer
Morgan Kaufmann. Science program at the University of Florida. His research focuses on
Tullis, T. S., & Stetson, J. N. (2004). A comparison of questionnaires for how computational models can reveal the hidden phenomena during
assessing website usability. Usability Professional Association Conference, dialogue and learning, which can create effective strategies for supporting
1–12. human learning with intelligent systems.
Wang, J., & Antonenko, P. (2017). Instructor presence in instructional
Yerika Jimenez is currently a PhD student in Human-Centered
video: Effects on visual attention, recall, and perceived learning.
Computing and an NSF graduate research fellow at the University of
Computers in Human Behavior, 71, 79–89.
Florida. Her research focuses on computer science education. She is
Wu, Y., Cheng, J., & Kang, X. (2016). Study of smart watch interface
interested in understanding how much cognitive effort do students use
usability evaluation based on eye-tracking. In International conference
to interact and learn computer science with block-based programming
of design, user experience, and usability (pp. 98–109). Cham: Springer
environments.
International Publishing. doi:10.1007/978-3-319-20886-2
Wulff, A. (2007). Eyes wide shut-or using eye tracking technique to test a Ethan Fieldman is the President of Study Edge. Study Edge provides
website. International Journal of Public Information Systems, 3(1), 1–12. various services including education technology that uses social media,
mobile devices, online communities, gamification, personalized learning,
and some of the best, most energetic instructors in the world to help
About the Authors students from middle school through college.
Jiahui Wang is a PhD candidate and research fellow in Educational Ashley Fieldman is the Vice President of Study Edge. Study Edge
Technology program at the University of Florida. Her research focuses provides various services including education technology that uses social
on individual differences affect learning with technology and how tech- media, mobile devices, online communities, gamification, personalized
nology-based learning environments can be designed to accommodate learning, and some of the best, most energetic instructors in the world to
individual differences. help students from middle school through college.

You might also like