Zlib - Pub Computational Interaction
Zlib - Pub Computational Interaction
computational interaction
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Computational Interaction
Edited by
PER OL A KR ISTENSSON
University Reader in Interactive Systems Engineering
University of Cambridge
X I AOJUN BI
Assistant Professor
Stony Brook University
ANDREW HOWES
Professor and Head of School at the School of Computer Science
University of Birmingham
1
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
3
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Oxford University Press 2018
The moral rights of the authors have been asserted
First Edition published in 2018
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2017949954
ISBN 978–0–19–879960–3 (hbk.)
ISBN 978–0–19–879961–0 (pbk.)
Printed and bound by
CPI Group (UK) Ltd, Croydon, CR0 4YY
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
CO N T E N TS
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Xiaojun Bi, Andrew Howes, Per Ola Kristensson, Antti Oulasvirta, John Williamson
PART II DESIGN
4. Combinatorial Optimization for User Interface Design . . . . . . . . . . . . 97
Antti Oulasvirta, Andreas Karrenbauer
5. Soft Keyboard Performance Optimization . . . . . . . . . . . . . . . . . . . . . . 121
Xiaojun Bi, Brian Smith, Tom Ouyang, Shumin Zhai
6. Computational Design with Crowds . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Yuki Koyama, Takeo Igarashi
vi | contents
Index 421
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
L I ST O F CO N T R I B U TO R S
Yuki Koyama National Institute of Advanced Industrial Science and Technology, Japan
Q. Vera Liao IBM Thomas J. Watson Research Center, New York, USA
Introduction
xiaojun bi,
andrew howes,
per ola kristensson,
antti oulasvirta,
john williamson
This book is concerned with the design of interactive technology for human use. It pro-
motes an approach, called computational interaction, that focuses on the use of algorithms
and mathematical models to explain and enhance interaction. Computational interaction
involves, for example, research that seeks to formally represent a design space in order to
understand its structure and identify solutions with desirable properties. It involves building
evaluative models that can estimate the expected value of a design either for a designer, or
for a system that continuously adapts its user interface accordingly. It involves constructing
computational models that can predict, explain, and even shape user behaviour.
While interaction may be approached from a user or system perspective, all examples
of computational interaction share a commitment to defining computational models that
gain insight into the nature and processes of interaction itself. These models can then be
used to drive design and decision making. Here, computational interaction draws on a
long tradition of research on human interaction with technology applying human factors
engineering (Fisher, 1993; Hollnagel and Woods, 2005; Sanders and McCormick, 1987;
Wickens et al., 2015), cognitive modelling (Payne and Howes, 2013; Anderson, 2014;
Kieras and Hornof, 2017; Card, Newell, and Moran 1983; Gray and Boehm-Davis, 2000;
Newell, 1994; Kieras, Wood, and Meyer, 1997; Pirolli and Card, 1999), artificial intelligence
and machine learning (Sutton and Barto, 1998; Brusilovsky and Millan, 2007; Fisher, 1993;
Horvitz et al., 1998; Picard, 1997; Shahriari et al., 2016), information theory (Fitts and
Peterson, 1964; Seow, 2005; Zhai, 2004), design optimization (Light and Anderson, 1993;
Eisenstein, Vanderdonckt, and Puerta, 2001; Gajos and Weld, 2004; Zhai, Hunter and
Smith, 2002), formal methods (Thimbleby, 2010; Dix, 1991; Harrison and Thimbleby,
1990; Navarre et al., 2009), and control theory (Craik, 1947; Kleinman, Baron, and Levison,
1970; Jagacinski and Flach, 2003; Sheridan and Ferrell, 1974).
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
2 | introduction
Computational interaction is a science of the artificial (Simon, 1996), where the object of
research is the construction of artefacts for stated goals and function. Herbert Simon took
up construction as the distinctive feature of disciplines like medicine, computer science,
and engineering, distinguishing them from natural sciences, where the subject is natural
phenomena, and from arts, which may not share interest in attaining goals: ‘Engineering,
medicine, business, architecture, and painting are concerned not with the necessary but with
the contingent, not with how things are but with how they might be, in short, with design’
(Simon, 1996, p. xiii).
Simon made three important observations, which we share, about sciences of the arti-
ficial. First, he distinguished the ‘inner environment’ of the artefact, such as an algorithm
controlling a display, and the ‘outer environment’, in which the artefact serves its purpose.
When we design a user interface, we are designing how the outer and the inner environ-
ments are mediated. Both the human and the technology also have invariant properties
that together with the designed variant properties shape the process and outcomes of
interaction (see Figure I.1). The final artefact reflects a designer’s implicit theory of this
interplay (Carroll and Campbell, 1989). When we design computational interaction, we
are designing or adapting an inner environment of the artefact such that it can proactively,
and appropriately, relate its functioning to the user and her context. In computational
interaction, the theory or model, is explicit and expressed in code or mathematics. Second,
Simon elevated simulation as a prime method of construction. Simulations, or models in
general, allow ‘imitating’ reality in order to predict future behaviour. Simulations support the
discovery of consequences of premises which would be difficult, often impossible, to obtain
with intuition. In computational interaction, models are used offline to study the conditions
of interaction, and they may be parameterized in a real-time system with data to infer user’s
intentions and adapt interaction. Third, Simon pointed out that since the end-products are
artefacts that are situated in some ‘outer environments’, they can and should be subjected to
Environment
Person Technology
Interaction
Variant Invariant
Figure I.1 Interaction is an emergent consequence of both variant and invariant aspects of technol-
ogy and people. Technology invariants include, for example, current operating systems that provide
an ecosystem for design. Technology variants include programmed interfaces that give rise to apps,
visualizations, gestural control, etc. A person’s invariants include biologically constrained abilities,
such as the acuity function of human vision. A person’s variants consist of the adaptive behavioural
strategies. To design interaction is to explain and enhance the variant aspects given the invariants
(both human and technology). In computational interaction, variant aspects are changed through
appropriate algorithmic methods and models.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
introduction | 3
empirical research. Design requires rigorous empirical validation of not only the artefact, but
also of the models and theories that created it. Since models contain verifiable information
about the world, they have face validity which is exposed through empirical observations.
They do not exist as mere sources of inspiration for designers, but systematically associate
variables of interest to events of interaction.
However, computational interaction is not intended to replace designers but to comple-
ment and boost the very human and essential activity of interaction design, which involves
activities such as creativity, sensemaking, reflection, critical thinking, and problem solving
(Cross, 2011). We hold that even artistic and creative efforts can greatly benefit from
a well-articulated understanding of interaction and mastery of phenomena captured via
mathematical methods. By advancing the scientific study of interactive computing artefacts,
computational interaction can create new opportunities and capabilities in creative and
artistic endeavors. We anticipate that the approaches described in this book are useful
in the interaction design processes of ideating, sketching, and evaluating, for example.
Computational interaction can also be studied and developed in broader contexts of socio-
technical systems, where power, labour, and historical settings shape interaction.
The fundamental ideas of computational interaction have been present for many years.
However, there is now strong motivation to collect, rationalize, and extend them. In the early
days of interaction design, design spaces were relatively simple; input devices, for example,
were explicitly designed to be simple mappings of physical action to digital state. However,
computer systems are now vastly more complex. In the last two decades, interaction has bro-
ken out from the limited domains of workstations for office workers to pervade every aspect
of human activity. In the current mobile and post-PC computing era, new technologies have
emerged that bring new challenges, for which the traditional hand-tuned approach to design
is not well equipped. For example, wearable computing, augmented and virtual reality, and
customizable interactive devices, pose increasingly wicked challenges, where designers must
consider a multiplicity of problems from low-level hardware, through software, all the way
to human factors. In these increasingly complex design spaces, computational abstraction
and algorithmic solutions are likely to become vital.
Increasing design complexity motivates the search for a scalable interaction engineering
that can complement and exceed the capabilities of human designers. A contention of this
book is that a combination of methods from modelling, automation, optimization, and
machine learning can offer a path to address these challenges. An algorithmic approach
to interaction offers the hope of scalability; an opportunity to make sense of complex
data and systematically and efficiently derive designs from overwhelmingly complex design
spaces. This requires precise expression of interaction problems in a form amenable to
computational analysis.
This book is part of an argument that, embedded in an iterative design process, compu-
tational interaction design has the potential to complement human strengths and provide
a means to generate inspiring and elegant designs. Computational interaction does not
exclude the messy, complicated, and uncertain behaviour of humans. Neither does it seek
to reduce users to mechanistic caricatures for ease of analysis. Instead, a computational
approach recognizes that there are many aspects of interaction that can be augmented by an
algorithmic approach, even if algorithms are fallible at times and based on approximations of
human life. Furthermore, computational interaction has the potential to reduce the design
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
4 | introduction
errors that plague current interactive devices, some of which can be life ending. It can
dramatically reduce the iterations cycles required to create high-quality interactions that
other approaches might write off as impossible to use. It can specialize and tailor general
interaction techniques such as search for niche tasks, for instance, patent searches and
systematic reviews. It can expand the design space and enable people to interact more
naturally with complex machines and artificial intelligence.
In parallel with the increased complexity of the interaction problem, there have been
rapid advances in computational techniques. At the fundamental level, rapid strides in
machine learning, data science, and associated disciplines have transformed the problem
space that can be attacked, generating algorithms and models of transformative power.
At the practical level, affordable high-performance computing, networked systems, and
availability of off-the-shelf libraries, datasets, and software for computational methods,
make it feasible to analyse and enhance the challenging problems of interaction design.
These are developments that should not be ignored.
This book sets out a vision of human use of interactive technology empowered by
computation. It promotes an analytical, algorithmic, and model-led approach that seeks
to exploit the rapid expansion in computational techniques to deal with the diversification
and sophistication of the interaction design space. As noted by John Carroll (1997), and by
others many times after, research on human-computer interaction has had problems assim-
ilating the variety of methodologies, theories, problems, and people. The implementation
of theoretical ideas in code could become a substrate and a nexus for computer scientists,
behavioural and social scientists, and designers alike. If this promise is to be fulfilled, then
serious effort must be made to connect mainstream search in human-computer interaction
to computational foundations.
I.1 Definition
Defining a field is fraught with dangers. Yet there are two reasons to attempt one here: (i) to
provide a set of core objectives and ideas for like-minded researchers in otherwise disparate
fields to coalesce; (ii) to help those not yet using computational methods to identify the
attributes that enable those methods to be successful and understand how this way of
thinking can be brought to bear in enhancing their work. While any definition is necessarily
incomplete, there is a strong sense that there is a nucleus of distinctive and exciting ideas
that distinguish computational from traditional human-computer interaction research. The
goal of our definition is to articulate that nucleus.
Definition Computational interaction applies computational thinking—that is, abstrac-
tion, automation, and analysis—to explain and enhance the interaction between a user
(or users) and a system. It is underpinned by modelling which admits formal reasoning and
involves at least one of the following:
introduction | 5
For example, the design of a control panel layout might involve first proposing an abstract
representation of the design space and an objective function (for example, visual search
performance and selection time) for choosing between variants. It might then involve using
an optimization method to search the space of designs and analysing the properties of the
proposed design using formal methods. Alternatively, explaining observed user behaviour
given a specific design might be explained by first proposing abstract representations of the
information processing capacities of the mind (for example, models of human memory for
interference and forgetting), then building computer models that automate the calculation
of the implications of these capacities for each particular variant across the distribution of
possible users, and then analysing the results and comparing with human data.
As in engineering and computer science, the hallmark of computational interaction
is mathematical—and often executable—modelling that connects with data. However,
while computational interaction is founded on fundamental theoretical considerations, it
is constructive in its aims rather than descriptive. It calls for empirical rigour in evaluation
and model construction, but focuses on using computationally powered models to do what
could not be done before, rather than describing what has been done before. To this end,
it emphasizes generating constructive conceptual foundations and robust, replicable and
durable methods that go beyond point sampling of the interaction space.
I.2 Vision
The overarching objective of computational interaction is to increase our capacity to reliably
achieve interactions with desirable qualities. A generic capacity is preferable over point
solutions, because we need to deal with diverse users, contexts, and technologies. By
seeking shared, transferable solution principles, the long-term aim shall be to effect the
runaway ‘ratcheting’ of a body of research that builds up constructively and composition-
ally; something that research on human-computer interaction has struggled to achieve
in the past.
This motivates several defining qualities of computational interaction: verifiable, mathe-
matical theory that allow results to generalize; scalable, theory-led engineering that does not
require empirical testing of every variant; transparency and reproducibility in research; and
the concomitant requirement for reusable computational concepts, algorithms, data sets,
challenges, and code. In machine learning, for example, the pivot to open-source libraries,
large state-of-the-art benchmark datasets, and rapid publication cycles has facilitated both
the uptake of developments and progress rate of research.
This vision opens up several possibilities to improve the way we understand, design, and
use interactive technology.
Increase efficiency, robustness, and enjoyability of interaction Models of interaction,
informed by often large-scale datasets, can enable the design of better interactions. For
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
6 | introduction
instance, interactions can demand less time; reduce the number of errors made; decrease
frustration; increase satisfaction, and so on. In particular, computational approaches can
help quantify these abstract goals, infer these qualities from observable data, and cre-
ate mechanisms through which interfaces can be optimized to maximize these qualities.
A computational approach aims to create an algorithmic path from observed user-system
interaction data to quantifiably improved interaction.
Proofs and guarantees A clear benefit of some approaches, such as formal, probabilistic,
and optimization methods is that they can offer guarantees and even proofs for some aspects
of solution quality. This may be valuable for designers who seek to convince customers, or
for adaptive user interfaces, that can more reliably achieve desired effects.
Develop user-centred design Computational interaction not only supports but can re-
envision user-centred design, where techniques such as parametric interfaces, data-driven
optimization, and machine-learned input recognition create direct data paths from usage
and observations to design and interaction. Computation allows interfaces to be precisely
tailored for users, contexts, and devices. Structure and content can be learned from obser-
vation, potentially on a mass scale, rather than dictated in advance.
Reduce the empirical burden Models can predict much of expected user-system
behaviour. Interaction problems can be defined formally, which increases our ability to
reason and avoid blind experimentation. Computational methods should reduce the
reliance on empirical studies and focus experimental results on validating designs based
on sound models. Strong theoretical bases should move ‘shot-in-the-dark’ point sampling
of designs to informed and data efficient experimental work.
Reduce design time of interfaces Automation, data, and models can supplant hand-
tweaking in the design of interfaces. It should be quicker, and less expensive, to engineer
complex interactions if the minutiae of design decisions can be delegated to algorithmic
approaches.
Free up designers to be creative In the same vein, algorithmic design can support
designers in tedious tasks. From tuning pose recognizers at public installations to choosing
colour schemes for medical instrument panels, computational thinking can let designers
focus on the big picture creative decisions and synthesize well-designed interactions.
Harness new technologies more quickly Hardware engineering is rapidly expanding
the space of input devices and displays for user interaction. However, traditional research
on human-computer interaction, reliant on traditional design-evaluate cycles, struggles
to tackle these challenges quickly. Parametric interfaces, predictive, simulatable models,
crowdsourcing, and machine learning offer the potential to dramatically reduce the time
it takes to bring effective interactions to new engineering developments.
I.3 Elements
Computational interaction draws on a long history of modelling and algorithmic con-
struction in engineering, computer science, and the behavioural sciences. Despite the
apparent diversity of these disciplines, each provides a body of work that supports explain-
ing and enhancing complex interaction problems. They do so by providing an abstract,
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
introduction | 7
often mathematical, basis for representing interaction problems, algorithms for automating
the calculation of predictions, and formal methods for analysing and reasoning about the
implications of the results.
These models differ in the way they represent elements of interaction: that is, how the
properties, events, processes, and relationships of the human and technology are formally
expressed and reasoned with. They may represent the structure of design, human behaviour,
or interaction, and they may represent some tendencies or values in those spaces. They
allow the computer to reason and test alternative hypotheses with the model, to ‘speculate’
with possibilities, and to go beyond observations. This can be achieved analytically, such as
when finding the minima of a function. However, in most cases, the models are complex and
permit no analytical solution, and an algorithmic solution is needed, in which case methods
like optimization or probabilistic reasoning can be used. Such computations may give rise to
predictions and interpretations that are emergent in the sense that they are not obvious given
lower-level data. Another property shared by these approaches is that the involved models
can be parameterized by some observations (data) that represent the problem at hand.
We have collected a few key concepts of interaction and design, which can be identified
behind the many formalisms presented in this book. This is not meant to exhaustively list
all concepts related to computational interaction and we invite the reader to identify, define,
and elaborate additional concepts central to computational interaction, such as dialogues,
dynamics, game theory, biomechanics, and many others.
Information Information theory, drawing from Shannon’s communication theory, mod-
els interaction as transmission of information, or messages, between the user and the
computer. Transmission occurs as selection of a message from a set of possible messages and
transferring over a degraded (noisy, delayed, intermittent) channel. The rate of successful
transmission, and the complexity of the messages that are being passed, is a criterion for
better interaction. For instance, to use a keyboard, the user communicates to the computer
system via the human neuromuscular system. A user can be seen as a source communicating
messages in some message space over a noisy channel in order to transmit information from
the user’s brain into the computer system. Importantly, throughput and similar information
theoretical constructs can be taken as objectives for design, modelled using human perfor-
mance models such as Fitt’s law and the Hick–Hyman law, and operationalized in code to
solve problems. Applications include input methods, interaction techniques, and sensing
systems.
Probability The Bayesian approach to interaction is explicit in its representation of
uncertainty, represented by probability distributions over random variables. Under this
viewpoint, interaction can be seen, for example, as the problem of inferring intention via
evidence observed from sensors. Intention is considered to be an unobserved random
variable existing in a user’s mind, about which the system maintains and updates a proba-
bilistic belief. An effective interaction is one that causes the belief distribution to converge
to the user’s true intention. This approach allows informative priors to be specified across
potential actions and rigorously combined with observations. Uncertainty about intention
can be propagated through the interface and combined with utility functions to engage
functionality using decision-theoretic principles. As a consequence of proper representation
of uncertainty, Bayesian approaches offer benefits in terms of robustness in interaction.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
8 | introduction
One major advantage of the probabilistic view is that many components at multiple levels
of interaction can be integrated on a sound basis, because probability theory serves as a
unifying framework; for example, linking probabilistic gesture recognizers to probabilistic
text entry systems. Tools such as recursive Bayesian filters and Gaussian process regression
can be used directly in inferring user intention.
Learning Machine learning equips a computer with the ability to use data to make
increasingly better predictions. It can transform static data into executable functionality,
typically by optimizing parameters of a model given a set of observations to minimize
some loss. In computational interaction, machine learning is used to both predict likely user
behaviour and to learn mappings from sensing to estimated intentional states. Supervised
machine learning formulates the problem as learning a function y = f(x) that transforms
observed feature vectors to an output space for which some training set of observations
are available. After a training phase, predictions of y can be made for unseen x. Estimation
of continuous values such as, for instance, intended cursor position, is regression, and
estimation of discrete values such as, for example, distinct gesture classes, is classification.
Supervised machine learning can replace hand-tweaking of parameters with data-driven
modelling. There are many high-performing and versatile tools for supervised learning,
including support vector machines, deep neural networks, random forests, and many others.
Unsupervised learning learns structure from data without a set of matching target values.
Techniques such as manifold learning (learning a simple smooth low-dimensional space
that explains complex observations) and clustering (inferring a set of discrete classes) have
potential in exploring and eliciting interactions. Unsupervised learning is widely used in
recommender systems and user-modelling in general, often with an assumption that users
fall into distinct clusters of behaviour and characteristics.
Optimization Optimization refers to the process of obtaining the best solution for a
defined problem. For example, a design task can be modelled as a combination of elementary
decisions, such as which functionality to include, which widget type to use, where to place
an element, and so on. Optimization can also use constraints to rule out infeasible designs.
Several approaches exist to modelling design objectives, ranging from heuristics to detailed
behavioural, neural, cognitive, or biomechanical models. The benefit of formulating a design
problem like this is that powerful solution methods can be exploited to find best designs
automatically, rooted on decades of research on optimization algorithms for both offline
and online setups.
States State machines are a powerful formalism for representing states and transitions
within an interface. This model of interaction captures the discrete elements of interfaces
and represents states (and their groupings) and events that cause transitions between
states. Formalisms such as finite-state machines (FSMs), and specification languages such
as statecharts, allow for precise analysis of the internal configurations of interfaces. Explicit
modelling permits rigorous analysis of the interface properties, such as reachability of
functionality, critical paths, bottlenecks, and unnecessary steps. Graph properties of FSMs
can be used to model or optimize interfaces; for example, outdegree can be used to study the
number of discrete controls required for an interface. State machines offer both constructive
approaches in rigorously designing and synthesizing interfaces and in analysing and charac-
terizing existing interfaces to obtain quantifiable metrics of usability.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
introduction | 9
Control With roots in cybernetics and control engineering, control theory provides a
powerful formalism for reasoning about continuous systems. In applications to human-
technology interaction, the user is modelled as a controller aiming to change a control signal
to a desired level (the reference) by updating its behaviour according to feedback about
the system state. The design of the system affects how well the user can achieve the goal
given its own characteristics. Control theory views interaction as continuous, although a
computer may register user behaviour as a discrete event. Modelling interaction as a control
system with a feedback loop overcomes a fundamental limitation of stimulus–response
based approaches, which disregard feedback. The control paradigm permits multi-level
analysis, tracing the progression of user-system behaviour over time (as a process) to explain
eventual outcomes to user. It allows insight into consequences of changing properties of the
user interface, or the user.
Rationality Rational analysis is a theory of human decision-making originating in
behavioural economics and psychology of decision-making. The assumption is that people
strive to maximize utility in their behaviour. Bounded rationality is the idea that rational
behaviour is constrained by capacity and resource limitations. When interacting, users
pursue goals or utility functions to the best of their capability within constraints posed by
user interfaces, environments, and tasks. The idea of bounded rationality is explored in
information foraging theory and economic models of search.
Agents Bounded agents are models of users that take action and optimally adapt their
behaviour to given constraints: environments and capabilities. The bounds include not
only those posed by the environment, which includes the interface, but limitations on
the observation and cognitive functions and on the actions of the agent. These bounds
define a space of possible policies. The hypothesis is that interactive behaviour is rationally
adapted to the ecological structure of interaction, cognitive and perceptual capacities, and
the intrinsic objectives of the user. The interactive problem can be specified, for example, as
a reinforcement learning problem, or a game, and behaviour emerges by finding the optimal
behavioural policy or program to the utility maximization problem. The recent interest in
computationally implemented agents is due to the benefit that, when compared with classic
cognitive models, they require no predefined specification of the user’s task solution, only
the objectives. Increasingly powerful representations and solution methods have emerged
for bounded agents in machine learning and artificial intelligence.
The chapters of this book present further details on the assumptions, implementation,
applications, as well as limitations of these elementary concepts.
I.4 Outlook
The chapters in this book manifest intellectual progress in the study of computational
principles of interaction, demonstrated in diverse and challenging applications areas such
as input methods, interaction techniques, graphical user interfaces, information retrieval,
information visualization, and graphic design. Much of this progress may have gone
unnoticed in mainstream human-computer interaction because research has been published
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
10 | introduction
in disconnected fields. To coalesce efforts and expand the scope of computationally solvable
interaction problems, an exciting vista opens up for future research.
Both the potential of and the greatest challenge in computational interaction lies in
mathematics and algorithms. A shared objective for research is mathematical formulation
of phenomena in human use of technology. Only by expanding formulations can one
devise new methods and approaches for new problems. On the one hand, mathematics and
algorithms are the most powerful known representations for capturing complexity. On the
other hand, the complexity of a presentation must match the complexity of the behaviour
it tries to capture and control. This means that the only way out of narrow applications is
via increasingly complex models. The challenge is how to obtain and update them without
losing control and interpretability.
A significant frontier, therefore, is to try to capture those aspects of human behavior and
experience that are essential for good design. Whether users are satisfied with a design is
determined not only by how much time they spend to accomplish the task, but also by
its aesthetic aspect, the ease of learning, whether it fits the culture of certain regions, etc.
Interaction is also often coupled with the environment and the situated contexts of the user.
For example, a user’s typing behaviour is dependent on whether a user is walking, if the
user is encumbered, and the way the user is holding the device. The system itself is also
often coupled to the environment. Also, computational interaction should not be restricted
to model a single user’s interaction with a single a computer system. A single user may
interact with many devices, many users may interact with a single computer system, or
many users may interact with many computer systems in order to for instance carry out a
shared task objective. These challenges call for collaboration with behavioural and social
scientists.
This increasing scale and complexity will pose a challenge also for algorithms. Algorithms
underpin computational interaction and for systems implementing principles of compu-
tational interaction it is important that the underlying algorithms scale with increasing
complexity. For example, naive optimization is often infeasible due to the complexity of
the optimization problem. However, it is often relatively straight-forward to search for an
approximately optimal solution. Complexity is multifaceted in computational interaction
and may, for instance, concern the expressiveness of a system (for instance, the number
of gestures recognized by the system), the ability of a solution to satisfy multi-objective
criteria, or manage a complex and constantly changing environment. To increase our ability
to deal with complex real-world phenomena implies that we need to search for more
efficient ways to update models with data. Some computational models are heavily reliant on
representative training data. A challenge in data collection is to ensure the data is accurately
reflecting realistic interaction contexts, which may be dynamic and constantly changing,
and that it captures the variability of different user groups. Moreover, data may not be
available until the particular user has started adopting the design. Such a ‘chicken and the
egg’ dilemma has long been a problem in computational interaction: the interaction data is
needed for designing interface or interactive systems; yet the data will not be available until
the design or system is available and the user starts adopting it. These challenges call for
collaboration with computer and computational scientists.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
introduction | 11
....................................................................................................
references
Anderson, J. R., 2014. Rules of the mind. Hove, UK: Psychology Press.
Brusilovsky, P., and Millan, E., 2007. User models for adaptive hypermedia and adaptive educational
systems. In: P. Brusilovsky, A. Kobsa, and W. Nejdl, eds. The Adaptive Web: Methods and Strategies
of Web Personalization. Berlin: Springer, pp. 3–53.
Card, S. K., Newell, A., and Moran, T. P., 1983. The psychology of human-computer interaction. New
York, NY: Lawrence Erlbaum.
Carroll, J. M., 1997. Human-computer interaction: psychology as a science of design. Annual Review
of Psychology, 48(1), pp. 61–83.
Carroll, J. M., and Campbell, R. L., 1989. Artifacts as psychological theories: The case of human-
computer interaction. Behaviour and Information Technology, 8, pp. 247–56.
Craik, K. J. W., 1947. Theory of the human operator in control systems: 1. the operator as an
engineering system. British Journal of Psychology General Section, 38(2), pp. 56–61.
Cross, N., 2011. Design thinking: Understanding how designers think and work. Oxford: Berg.
Dix, A. J., 1991. Formal methods for interactive systems. Volume 16. London: Academic Press.
Eisenstein, J., Vanderdonckt, J., and Puerta, A., 2001. Applying model-based techniques to the devel-
opment of UIs for mobile computers. In: Proceedings of the 6th International Conference on Intelligent
User Interfaces. New York, NY: ACM, pp. 69–76.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
12 | introduction
Fisher, D. L., 1993. Optimal performance engineering: Good, better, best. Human Factors, 35(1), pp.
115–39.
Fitts, P. M., and Peterson, J. R., 1964. Information capacity of discrete motor responses. Journal of
Experimental Psychology, 67(2), pp. 103–12.
Gajos, K., and Weld, D. S., 2004. SUPPLE: automatically generating user interfaces. In: Proceedings of
the 9th International Conference on Intelligent User Interfaces. New York, NY: ACM, pp. 93–100.
Gray, W. D., and Boehm-Davis, D. A., 2000. Milliseconds matter: An introduction to microstrategies
and to their use in describing and predicting interactive behavior. Journal of Experimental Psychol-
ogy: Applied, 6(4), pp. 322–35.
Harrison, M., and Thimbleby, H., eds., 1990. Formal methods in human-computer interaction. Volume 2.
Cambridge: Cambridge University Press.
Hollnagel, E., and Woods, D. D., 2005. Joint cognitive systems: Foundations of cognitive systems engineer-
ing. Columbus, OH: CRC Press.
Horvitz, E., Breese, J., Heckerman, D., Hovel, D., and Rommelse, K., 1998. The Lumiere project:
Bayesian user modeling for inferring the goals and needs of software users. In: UAI ‘98: Proceedings
of the Fourteenth Conference on Uncertainty in Artificial Intelligence. San Francisco, CA: Morgan
Kaufmann, pp. 256–65.
Jagacinski, R. J., and Flach, J. M., 2003. Control Theory for Humans: Quantitative approaches to modeling
performance. Mahwah, NJ: Lawrence Erlbaum.
Kieras, D. E., and Hornof, A., 2017. Cognitive architecture enables comprehensive predictive
models of visual search. Behavioral and Brain Sciences, 40. DOI: https://doi.org/10.1017/
S0140525X16000121
Kieras, D. E., Wood, S. D., and Meyer, D. E., 1997. Predictive engineering models based on the
EPIC architecture for a multimodal high-performance human-computer interaction task. ACM
Transactions on Computer-Human Interaction (TOCHI), 4(3), pp. 230–75.
Kleinman, D. L., Baron, S., and Levison, W. H., 1970. An optimal control model of human response
part I: Theory and validation. Automatica 6(3), pp. 357–69.
Light, L., and Anderson, P., 1993. Designing better keyboards via simulated annealing. AI Expert, 8(9).
Available at: http://scholarworks.rit.edu/article/727/.
Navarre, D., Palanque, P., Ladry, J. F., and Barboni, E., 2009. ICOs: A model-based user
interface description technique dedicated to interactive systems addressing usability, relia-
bility and scalability. ACM Transactions on Computer-Human Interaction (TOCHI), 16(4).
<doi:10.1145/1614390.1614393>.
Newell, A., 1994. Unified Theories of Cognition. Cambridge, MA: Harvard University Press.
Payne, S. J., and Howes, A., 2013. Adaptive interaction: A utility maximization approach to under-
standing human interaction with technology. Synthesis Lectures on Human-Centered Informatics,
6(1): pp. 1–111.
Picard, R. W., 1997. Affective Computing. Volume 252. Cambridge, MA: MIT Press.
Pirolli, P., and Card, S. K., 1999. Information Foraging. Psychological Review, 106(4), pp. 643–75.
Sanders, M. S., and McCormick, E. J., 1987. Human Factors in Engineering and Design. Columbus, OH:
McGraw-Hill.
Seow, S. C., 2005. Information theoretic models of HCI: a comparison of the Hick-Hyman law and
Fitt’s law. Human-Computer Interaction, 20(3), pp. 315–52.
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N., 2016. Taking the human out of
the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), pp. 148–75.
Sheridan, T. B., and Ferrell, W. R., 1974. Man-machine Systems: Information, Control, and Decision
Models of Human Performance. Cambridge, MA: MIT Press.
Simon, H. A., 1996. The Sciences of the Artificial. Cambridge, MA: MIT press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
introduction | 13
Sutton, R. S., and Barto, A. G., 1998. Reinforcement learning: An introduction. Volume 1, Number 1.
Cambridge, MA: MIT Press.
Wickens, C. D., Hollands, J. G., Banbury, S., and Parasuraman, R., 2015. Engineering Psychology &
Human Performance. Hove, UK: Psychology Press.
Thimbleby, H., 2010. Press on: Principles of Interaction Programming. Cambridge, MA: MIT Press.
Zhai, S., 2004. Characterizing computer input with Fitts’ law parameters—the information and
non-information aspects of pointing. International Journal of Human-Computer Studies, 61(6),
pp. 791–809.
Zhai, S., Hunter, M., and Smith, B. A., 2002. Performance optimization of virtual keyboards.
Human–Computer Interaction, 17(2–3), pp. 229–69.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
PA RT I
Input and Interaction Techniques
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
1
• • • • • • •
1.1 Introduction
What do we really mean when we talk about Human–Computer Interaction(HCI)? It is a
subject with few firm, agreed foundations. Introductory textbooks tend to use phrases like
‘designing spaces for human communication and interaction’, or ‘designing interactive prod-
ucts to support the way people communicate and interact in their everyday lives’ (Rogers ,
Sharp, and Preece, 2011). Hornbæk and Oulasvirta (2017) provides a recent review of the
way different HCI communities have approached this question, but only touches briefly on
control approaches. Traditionally, HCI research has viewed the challenge as communication
of information between the user and computer, and has used information theory to represent
the bandwidth of communication channels into and out of the computer via an interface:
‘By interaction we mean any communication between a user and a computer, be it direct or
indirect’ (Dix , Finlay, Abowd, and Beale, 2004), but this does not provide an obvious way
to measure the communication, or whether the communication makes a difference.
The reason that information theory is not sufficient to describe HCI is that in order to
communicate the simplest symbol of intent, we typically require to move our bodies in some
way that can be sensed by the computer, often based on feedback while we are doing it.
Our bodies move in a continuous fashion through space and time, so any communication
system is going to be based on a foundation of continuous control. However, inferring the
user’s intent is inherently complicated by the properties of the control loops used to generate
the information—intention in the brain becomes intertwined with the physiology of the
human body and the physical dynamics and transducing properties of the computer’s input
device. In a computational interaction context, the software adds a further complication to
the closed-loop behaviour (Figure 1.1). Hollnagel (1999) and Hollnagel and Woods (2005)
make a compelling argument that we need to focus on how the joint human–computer
system performs, not on the communication between the parts.
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Effectors Sensors
Perception Display
Another reason that control theory can be seen as a more general framework is that
often the purpose of communicating information to a computer is to control some aspect
of the world, whether this be the temperature in a room or the volume of a music player,1
the destination of an autonomous vehicle or some remote computational system. This can
be seen in Figure 1.2, which illustrates the evolution of human–machine symbiosis from
direct action with our limbs, via tools and powered control. Over time this has led to an
increasing indirectness of the relationship between the human and the controlled variable,
with a decrease in required muscular strength and an increasing role for sensing and thought
(Kelley, 1968). The coming era of Computational Interaction will further augment or replace
elements of the perceptual, cognitive, and actuation processes in the human with artificial
computation, so we can now adapt the original figure from Kelley (1968) to include control
of computationally enhanced systems, where significant elements of:
1 One might think the reference here is the loudness of the music, but in many social contexts it is probably
the inferred happiness of the people in the room that is actually being controlled, and any feedback from volume
indicators are just intermediate variables to help the user.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Feedback information
Feedback information
Feedback information
Control Modulated
signal power
(c) Perceptual Control Controlled
information Human Tool/Device
junction variable
Power
Power
Source
Figure 1.2 The evolution of human control over the environment from (a) direct muscle power
via (b) use of specialised tools for specifhc tasks and (c) externally powered devices, potentially
regulated by automatic controllers. Adapted from (Kelley, 1968). Grey lines indicate optional
feedback connections.
Computational Computationally
Effectors perception of augmented
human intent control
Interface
Controlled
Goals Human Computer
System
Augmented Computational
Perception
Display perception
Figure 1.3 The next step in evolution–control with power- and computationally-enhanced devices.
Grey blocks indicate computational intelligence enhancement.
from Craik’s (1947, 1948) early, war-related work, and became more well known in the
broader framing of Wiener’s Cybernetics (1948). As observed by Wickens and Hollands
(1999), the approach to modelling human control behaviour came from two major schools:
the skills researchers and the dynamic systems researchers. The ‘skills’ group often focused
on undisturbed environments, while the ‘dynamics’, or ‘manual control theory’ approach
(e.g., Kelley, 1968; Sheridan and Ferrell, 1974) tended to seek to model the interaction
of humans with machines, for example, aircraft pilots or car drivers, usually driven by
engineering motivations and the need to eliminate error, making it a closed-loop system.
The ‘skills’ group tended to focus on learning and acquisition while the ‘dynamics’ group
focused on the behaviour of a well-trained operator controlling dynamic systems to make
them conform with certain space-time trajectories in the face of environmental uncertainty.
This covers most forms of vehicle control, or the control of complex industrial processes.
Poulton (1974) reviews the early tracking literature, and an accessible textbook review of
the basic approaches to manual control can be found in Jagacinski and Flach (2003). Many
of the earlier models described here were based on frequency domain approaches, where
the human and controlled system were represented by Laplace transforms representing
their input/output transfer function. Optimal control theoretic approaches used in the
time domain are described in (Kleinman, Baron, and Levison, 1970, 1971). The well-
established field of human motor control theory, e.g., (Schmidt and Lee, 2005), which seeks
to understand how the human central nervous system controls the body, is an important
component of using control theory in HCI, but this chapter focuses on the basic role of
control concepts in HCI.
Environmental
influences/
disturbances
Desired Control
Controller Controlled variable
state r Control effector
signal
the state space. The concepts of state variable, state, and state space are important tools which
can help designers understand the problem. The choice of state variables for your model is a
statement about the important elements of the system. The state dimensions combine to
form the state space. Behaviour is visualized as movement through this space, and the values
of state reflect the position compared to important landmarks such as the goal.2 Problem
constraints can be represented as boundaries in the state space and qualitative properties of
the system can be described as regions in the state space (Bennett and Flach, 2011).
Inputs and Outputs: The controller generates control inputs to the controlled system.
Transformations of the system state are observed, i.e., the outputs. The concepts of controlla-
bility and observability are important, as they describe the degree to which a control system
can observe and control the states of a particular system.
Open/Closed Loop: If an input is transformed in various ways but the control variable
does not depend on feedback from the system state, the system is described as open loop.
Closed loop systems have feedback from the state to the controller which affect the input to
the controlled system.
Disturbances: External or unpredictable effects which affect the system state. In an open-
loop system, the controller is unable to compensate for these, while a closed-loop controller
can observe the error and change its control variable to compensate.
Stability: In technical control systems, stability is a key aspect of design. This has been
less of an issue in modern HCI, but is important in automotive and aircraft control,
where interactions between the human operator and technical control systems can lead to
instability, and ‘pilot-induced oscillations’. Stability is relevant not only for equilibria, but
also for period motions. For example, a pendulum is typically stable to minor perturbations
around its normal limit cycle.
Feedback: The display is to provide the user with information needed to exercise control;
i.e. predict consequences of control alternatives, evaluate status, and plan control actions, or
2 Note that for dynamic systems position in a state space can describe a rapidly changing situation.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Environmental
influences/
disturbances
Perceptual Control
Human Controller Controlled variable
information Desired Control effector
state r signal
Figure 1.5 Human–Computer control loop where the ‘control’ element is partitioned into human-
controlled and system controlled elements.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
One important issue is that for most interaction tasks you are moving a cursor towards an
area, not a point, so the conditions for selection are met before the error is brought to zero. As
the size of the area increases relative to the distance travelled (a decreasing index of difficulty
in Fitts’ terminology), this becomes more significant. Müller, Oulasvirta, and Murray-Smith
(2017) found that users tended not to minimize the error in the final endpoint and used
different control behaviours when faced with larger targets. We will discuss this again in
Section 1.6.1.
3 Note that in control theory the term transfer function tends to refer to a linear time-invariant (LTI) system in
Laplace or Fourier transform representation.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
1.3.2 Variability
Sources of variability in human action include noise, trajectory planning and delays in the
feedback loop. In some tasks motor planning can be complex, but with most interfaces being
designed to have simple dynamics, and human limbs being typically well controlled over
the range of motions used, most user interface (UI) interaction is still relatively simple. The
variability is therefore dominated by the human’s predictive control interacting with delayed
feedback, and the variability in defining the timing of changes of motion. If the user moves
extremely slowly then the effects of human lags and delays are negligible.
Will control models do a better job of representing user variability? Most of the historical
applications of manual control models did not focus on variability but the intermittent
predictive control models have done a better job of this. (Gawthrop, Lakie and Loram,
2008) demonstrate that a simple predictive controller can be consistent with Fitts’ law,
while non-predictive controllers cannot. The same paper also presents the link between
intermittent control, predictive control and human motor control, and further develops this
in Gawthrop, Loram, Lakie and Gollee, 2011; Gawthrop, Gollee, and Loram, 2015).
We can see the interaction of difficulty of the targeting task with the speed and variabil-
ity of human movement, even in simple one-dimensional spatial targeting tasks (Müller,
Oulasvirta, and Murray-Smith, 2017). These experiments make clear the anticipatory, or
predictive element in human action, as the behaviour for different levels of difficulty changes
well before the user reaches the target.
Quinn and Zhai (2016) describe the challenges associated with understanding user
behaviour when performing gestural input, focusing on gesture keyboards. They highlight
how users are not tightly constrained, but are attempting to reproduce a prototype shape
as their goal. Earlier models assumed that the motor control process was similar to serial
aiming through a series of landmarks. They used a minimum-jerk model of motor control
to model human behaviour and describe the variability in user control in gesture keyboards,
which includes some general issues, including an inclination to biomechanical fluidity,
speed–accuracy trade-offs, awareness of error tolerances in the algorithms, visual online
feedback from the interface, and the errors due to sensorimotor noise and mental and
cognitive errors.
Wheel hard
left
Continutation path
User w(s) Constraint (constant wheel angle)
trajectory c(s) Car trajectory w(s)
c(s)
Wheel hard
right
Edge of road
Figure 1.6 Two-dimensional ‘tunnel‘ task (left). A classical automotive steering task (right).
(see Crossman, 1960) in part by making it easier for the user to use prediction to plan the
exact timing of their actions, and avoid the impact of delays.
4 The analysis of boundary crossings in (Accot and Zhai, 2002) is closely related, especially if there is a
sequence of multiple crossings needed to achieve a specific goal. This can be seen as a discretization of the steering
law task, with intermittent constraints.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
but did hypothesize a likely relationship of the tangential velocity v(s) ∝ ρ(s)W(s), where
ρ(s) is the local radius of curvature. The goal here is to create a model of the probability
density function from the current state (x, v) at t n for the user’s future behaviour t n+1 ,
t n+2 …, etc. Because of motor variability, this will naturally spread out spatially over time in
areas of high curvature, returning close to the optimal path in easier areas. The nature of the
trade-off between spread of trajectories and variation in speed will depend on the implicit
cost function the user is performing to. Are they being cautious and never breaching the
constraints, or more risk-taking and increasing speed?
Appropriately identified models of human tunnel-following behaviour would allow us to
create a more appropriate function for inferring intent than simply detecting that a user had
exited the tunnel. Predicting the likely dynamic behaviour for a user attempting the tunnel
at a given speed could allow us to infer which of N targets was most likely. An interesting
extension of this is instead of speed-accuracy trade-offs, we could look at effort-accuracy
trade-offs, where users might choose to provide less precision in the location or timing of
their actions. (The techniques can link closely to methods used for filtering noisy inputs).
Feedback during the process can change closed-loop performance significantly. For
example, when examining touch trajectories associated with the ‘slide to open’ on an iPhone,
we could view it as a response to a spatially constrained trajectory following task, but because
of the visual metaphors the user typically perceives it as dragging the slider (which in itself
is a direct position control task) to activate the device. The physical metaphor might make
sense for a new user, but as the user becomes more skilled and confident, they may be
keen to have faster or more sloppy movements to achieve the same end. For example, the
actual constraints on the touch input need not correspond to the limit of the drawn object,
depending on the context and design trade-offs. For example, if the system sensed that the
user was walking while using the slide–to–open feature, it could be more forgiving on the
constraints than in a stationary context, or a confident, cleanly contacting, fast swipe might
be allowed to unlock the device, even if it were at the wrong angle.
1.3.5 Gestures
Gestures can be viewed as generating a spatial trajectory through time, and can therefore
be represented by differential equations, and tolerances around a prototypical gesture. For a
review of gestures in interaction, see (Zhai, Kristensson, Appert, Andersen and Cao, 2012).
The underlying control task then, is almost identical to that of Section 1.3.4, although
typically in gestures there is no visual guide for the user to follow – they are expected to
have memorized the gestures, and implicit constraints, such that they can be generated in an
open–loop fashion. In some cases, a rough reference framework is provided for the gesture.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
basic_touch_2cm_fast filtered
Filtered X x/dx x/d2x
500
3000
250 20
2000
dx
0
d2x
0
Value
1000 –250
–20
–500
0
0 1 2 3 4 5 6 7 8 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500
Time (s) x x
Figure 1.7 Examples of time-series (left) and phase space plots in velocity (middle) and accelera-
tion (right) against position for a finger moving above a touch screen. Note how the particular style
of movement corresponds to a constrained region of the phase space.
For example, the Android gesture lock screen provides a grid of points as a framework
for users’ gestures. This reduces the possible gesture prototypes to a discrete set, as the
gesture passes through these subgoals, and helps normalize the users’ spatial performance
and transparently removes the variability factor of timing.
A differential equation representation of gestures is used by Visell and Cooperstock
(2007). A specific example of differential equations for gesture is that of generating con-
trolled cyclic behaviour as discussed in Lantz and Murray-Smith (2004). The advan-
tage of rhythmic gestures is that they can be repeated until the system recognizes them,
whereas ballistic gestures are more frustrating to repeat from scratch if the system fails to
recognize them.
Handwriting can be viewed as a specific case of gesture system, but one which leaves
a visible trace, and which most humans have spent years learning. Recent developments
in handwriting recognition based on the use of recurrent neural networks to learn the
dynamic systems required to both generate and classify handwriting (Graves, 2013) could
be generalized to other areas of gestural input. This work made progress by not requiring the
training data for the handwriting task to be broken down into individual letters, but to work
at a word level, letting the machine learning cope with the variability, and co-articulation
effects from neighbouring letters. This might be of interest in analysis of interactions ‘in the
wild’ where it can be difficult for a human to label when exactly a user changed their goal to
a particular target or task.
R 1
ẋ2 (t) = a(t) = v̇ = − x2 (t) + u(t) (1.2)
m m
b R c
ẋ3 (t) = z(t) = − x2 (t) − x3 (t) + u(t), (1.3)
m m m
which can then be more conveniently represented in the state space form ẋ = Ax + Bu,
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
ẋ1 0 1 0 x1 0
⎣ ẋ2 ⎦ = ⎣ 0 − mR 0 ⎦ ⎣ x2 ⎦ + ⎣ m1 ⎦ u. (1.4)
R c
ẋ3 0 −m − m
b x3 m
This shows how a single degree-of-freedom (DOF) input can control both velocity and
zoom-level. The non-zero off-diagonal elements of the A matrix indicate coupling among
states, and the B matrix indicates how the u inputs affect each state. Note that you can
change from velocity to acceleration input just by changing the values of the B matrix. This
example, could be represented as having zoom as an output equation, rather than state, and
the coupling between zoom and speed comes only primarily the B matrix.
We can create a control law such that u = L(xr − x) and write new state equations, ẋ =
Ax+Bu = Ax−BLx+BLr = (A−BL)x+BLr, which shows how the control augmentation
has changed the closed loop dynamics. The user input can then be linked to the xr value so
that this could then be linked to a desired velocity, or a desired position in the space. It is
also possible to have a switched dynamic system which changes the dynamics depending on
the mode the user is in, supporting their inferred activity, and Eslambolchilar and Murray-
Smith (2008) describe examples with different regimes for exploration, cruising and diving
modes. The stability and control properties of such systems are examined in Eslambolchilar
and Murray-Smith (2010).
This approach was further developed in Kratz, Brodien, and Rohs (2010), where they
extended the model to two-dimensions and presented a novel interface for mobile map
navigation based on Semi-Automatic Zooming (SAZ). SAZ gives the user the ability to
manually control the zoom level of a Speed-Dependent Automatic Zooming (SDAZ)
interface, while retaining the automatic zooming characteristics of that interface at times
when the user is not explicitly controlling the zoom level.
box’, where we can interact with it, but cannot see the underlying code. This is an area where
systematic approaches to identify the dynamics of system transitions can allow us to create
a canonical representation of the dynamics as a differential equation which is independent
of how it was implemented. An example of this is the work on exposing scrolling transfer
functions by Quinn, Cockburn, Casiez, Roussel and Gutwin (2012). Quinn and Malacria
(2013) used robots to manipulate various touch devices to infer the scrolling dynamics. A
differential equation approach provides a universal representation of the different imple-
mentations, even if they did not originally use that representation internally. This could
have a role in intellectual property disputes, where more objective similarity measures
could be proposed, which are independent of trivial implementation details. The differential
equation approach can also be applied to other mechanisms for presenting large data spaces,
e.g., fisheye lenses (see Eslambolchilar and Murray-Smith, 2006).
2A
ID = log2 . (1.6)
W
Movement times and error rates are important aspects of human interaction, but they do
not provide a complete picture.
A feedback control based explanation was provided in 1963 by Crossman and Goodeve,
reprinted in (Crossman and Goodeve 1983), where they suggested that Fitts’ Law could be
derived from feedback control, rather than information theory. They proposed that there
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
1.5 Models
Models can be used to create the controllers, or can be directly incorporated into the
controller. In some cases, the controller can be seen as an implicit model of the system
and environment (Conant and Ross Ashby, 1970; Eykhoff , 1994). In many areas of HCI
research we need to compare user behaviour to some reference behaviour. How similar are
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
two trajectories? Use of simple Euclidean distance measures between two time series can be
quite misleading. However, if we can identify model parameters for a specific user, we can
calculate the likelihood of model parameters given the observed data, which can be more
robust in some cases.
Figure 1.8 Impact of prediction horizon and narrowing straight tunnel. The same variability that
can be contained in a widening tunnel will breach the constraints of a narrowing one.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
The ability to adapt can depend on whether the task is one of forced or unforced refer-
ence following. In unpaced tasks, users can increase speed and accuracy, as their preview
increases. In forced pace tasks, their speed cannot change, but their accuracy improves if
their preview increases (Poulton, 1974).
Stimulus Response
Environment
Response/
Stimulus
Action Sensation
Environment
Decision Perception
Figure 1.9 Separation of cause and effect in the human–computer interaction loop is fundamen-
tally problematic. (Adapted from Jagacinski and Flach, 2003).
subsystems: rather, they will adapt their behaviour to take into account the change in the
computer system around them. This is well documented in McRuer et al.’s crossover model,
described in (Sheridan and Ferrell, 1974; Jagacinski and Flach, 2003; McRuer and Jex,
1967), where pilots would adapt their behaviour Y H so that even with unstable controlled
dynamics, the overall closed-loop behaviour near the ‘crossover frequency’ ωc remained
close to a ‘good’ servo reference behaviour
ωc exp(−jωτ )
YH YC ≈ ,
jω
despite changes in the aircraft dynamics (Y C ). τ represents the effective time delay,
combining reaction delay and neuromuscular lag. Young (1969) provides a wide-ranging
review on how the human can adapt in manual control contexts, highlighting the challenges
in understanding human adaptation in complex failure settings (Young, 1969).
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
1968) discusses, mathematical models of human control behaviour often underplayed the
richness of human sensing. Will the recent developments in agents which can learn to link
rich visual perception to action via deep convolutional networks Mnih et al. (2015) change
the nature of these models?
....................................................................................................
references
Accot, J., and Zhai, S., 1997. Beyond Fitts’ Law: models for trajectory-based HCI tasks. In: CHI ’97:
Proceedings of the SIGCHI conference on Human factors in computing systems. New York, NY: ACM,
pp. 295–302.
Accot, J., and Zhai, S., 2002. More than dotting the i’s — foundations for crossing-based interfaces. In:
CHI ’02: Proceedings of the SIGCHI conference on Human factors in computing systems. New York,
NY: ACM, pp. 73–80.
Apps, M. A. J., Grima, L. L., Manohar, S., and Husain, M., 2015. The role of cognitive effort in
subjective reward devaluation and risky decision-making. Scientific reports, 5, 16880.
Balakrishnan, R., 2004. ‘Beating’ Fitts’ law: virtual enhancements for pointing facilitation. International
Journal of Human-Computer Studies, 61(6), pp. 857–74.
5 https://gym.openai.com/
6 https://deepmind.com/research/open-source/open-source-environments/
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Barrett, R. C., Selker, E. J., Rutledge, J. D., and Olyha, R. S., 1995. Negative inertia: A dynamic pointing
function. In: CHI ’95: Conference Companion on Human Factors in Computing Systems, New York,
NY: ACM, pp. 316–17.
Bennett, K. B., and Flach, J. M., 2011. Display and interface design: Subtle science, exact art. Cleveland,
OH: CRC Press.
Blanch, R., Guiard, Y., and Beaudouin-Lafon, M., 2004. Semantic pointing: Improving target acqui-
sition with control-display ratio adaptation. In: CHI ’04: Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems. New York, NY: ACM, pp. 519–26.
Cannon, D. J., 1994. Experiments with a target-threshold control theory model for deriving Fitts’ law
parameters for human-machine systems. IEEE Trans Syst Man Cybern, 24(8), pp. 1089–98.
Casiez, G., and Roussel, N., 2011. No more bricolage!: Methods and tools to characterize, replicate and
compare pointing transfer functions. In: UIST ’11: Proceedings of the 24th Annual ACM Symposium
on User Interface Software and Technology. New York, NY: ACM, pp. 603–14.
Casiez, G., Vogel, D., Balakrishnan, R., and Cockburn, A., 2008. The impact of control-display gain on
user performance in pointing tasks. Human–computer interaction, 23(3), pp. 215–50.
Cho, S.-J., Murray-Smith, R., and Kim, Y.-B., 2007. Multi-context photo browsing on mobile devices
based on tilt dynamics. In: MobileHCI ’07: Proceedings of the 9th International Conference on Human
Computer Interaction with Mobile Devices and Services. New York, NY: ACM, pp. 190–7.
Clarke, C., Bellino, A., Esteves, A., Velloso, E., and Gellersen, H., 2016. Tracematch: A computer vision
technique for user input by tracing of animated controls. In: UbiComp ’16: Proceedings of the 2016
ACM International Joint Conference on Pervasive and Ubiquitous Computing. New York, NY: ACM,
pp. 298–303.
Cockburn, A., and Firth, A., 2004. Improving the acquisition of small targets. In People and Computers
XVII - Designing for Society. New York, NY: Springer, pp. 181–96.
Conant, R. C., and Ross Ashby, W., 1970. Every good regulator of a system must be a model of that
system. International Journal of Systems Science, 1(2), pp. 89–97.
Connelly, E. M., 1984. Instantaneous Performance Evaluation with Feedback can Improve Training.
In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting. New York, NY: Sage,
pp. 625–8.
Costello, R. G., 1968. The surge model of the well-trained human operator in simple manual control.
IEEE Transactions on Man–Machine Systems, 9(1), pp. 2–9.
Craik, K. J. W., 1947. Theory of the human operator in control systems: 1. the operator as an
engineering system. British Journal of Psychology. General Section, 38(2), pp. 56–61.
Craik, K. J. W., 1948. Theory of the human operator in control systems: 2. man as an element in a
control system. British Journal of Psychology. General Section, 38(3), pp. 142–8.
Crossman, E. R. F. W., 1960. The information-capacity of the human motor-system in pursuit tracking.
Quarterly Journal of Experimental Psychology, 12(1), pp. 1–16.
Crossman, E. R. F. W., and Goodeve, P. J., 1983. Feedback control of hand-movement and fitts’ law.
The Quarterly Journal of Experimental Psychology, 35(2), pp. 251–78.
Dix, A., Finlay, J., Abowd, G., and Beale, R., 2003. Human-computer interaction. Harlow: Pearson
Education Limited.
Drury, C. G., 1971. Movements with lateral constraint. Ergonomics, 14(2), pp. 293–305.
Eslambolchilar, P., and Murray-Smith, R., 2006. Model-based, multimodal interaction in document
browsing. In: Machine Learning for Multimodal Interaction, LNCS Volume 4299. New York, NY:
Springer, pp. 1–12.
Eslambolchilar, P., and Murray-Smith, R., 2008. Control centric approach in designing scrolling
and zooming user interfaces. International Journal of Human-Computer Studies, 66(12),
pp. 838–56.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Eslambolchilar, P., and Murray-Smith, R., 2010. A model-based approach to analysis and calibration of
sensor-based human interaction loops. International Journal of Mobile Human Computer Interaction,
2(1), pp. 48–72.
Eykhoff, P., 1994. Every good regulator of a system must be a model of that system. Modeling,
identification and control, 15(3), pp. 135–9.
Fekete, J.-D., Elmqvist, N., and Guiard, Y., 2009. Motion-pointing: Target selection using elliptical
motions. In: CHI ’09: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems. New York, NY: ACM, pp. 289–98.
Fitts, P. M., 1954. The information capacity of the human motor system in controlling the amplitude
of movement. Journal of Experimental Psychology, 47(6), pp. 381–91.
Fitts, P. M., and Peterson, J. R., 1964. Information capacity of discrete motor responses. Journal of
experimental psychology, 67(2), pp. 103–12.
Flemisch, F. O., Adams, C. A., Conway, S. R., Goodrich, K. H., Palmer, M. T., and Schutte, P. C.,
2003. The H-Metaphor as a guideline for vehicle automation and interaction. Control (December),
pp. 1–30.
Gawthrop, P., Gollee, H., and Loram, I., 2015. Intermittent control in man and machine. In: M.
Miskowicz, ed. Event-Based Control and SignalProcessing. Cleveland, OH: CRC Press, pp. 281–350.
Gawthrop, P., Lakie, M., and Loram, I., 2008. Predictive feedback control and Fitts’ law. Biological
Cybernetics, 98, pp. 229–38.
Gawthrop, P., Loram, I., Lakie, M., and Gollee, H., 2011. Intermittent control: A computational theory
of human control. Biological Cybernetics, 104(1–2), pp. 31–51.
Graves, A., 2013. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850.
Grossman, T., and Balakrishnan, R., 2005. The bubble cursor: Enhancing target acquisition by
dynamic resizing of the cursor’s activation area. In: CHI ’05: Proceedings of the SIGCHI Conference
on Human Factors in Computing Systems. New York, NY: ACM, pp. 281–90.
Guiard, Y., and Rioul, O., 2015. A mathematical description of the speed/accuracy trade-off of aimed
movement. In: Proceedings of the 2015 British HCI Conference. New York, NY: ACM, pp. 91–100.
Hollnagel, E., 1999. Modelling the controller of a process. Transactions of the Institute of Measurement
and Control, 21(4-5), pp. 163–70.
Hollnagel, E., and Woods, D. D., 2005. Joint cognitive systems: Foundations of cognitive systems engineer-
ing. Cleveland, OH: CRC Press.
Hornbæk, K., and Oulasvirta, A., 2017. What is interaction?In:Proceedings of the SIGCHI Conference
on Human Factors in Computing Systems. New York, NY: ACM, pp. 5040–52.
Jagacinski, R. J., and Flach, J. M., 2003. Control Theory for Humans: Quantitative approaches to modeling
performance. Mahwah, NJ: Lawrence Erlbaum.
Kelley, C. R., 1968. Manual and Automatic Control. New York, NY: John Wiley and Sons, Inc.
Kleinman, D. L., Baron, S., and Levison, W. H., 1970. An optimal control model of human response
part i: Theory and validation. Automatica, 6(3), pp. 357–69.
Kleinman, D. L., Baron, S., and Levison, W. H., 1971. A control theoretic approach to manned-vehicle
systems analysis. IEEE Transactions Automatic Control, 16, pp. 824–32.
Kratz, S., Brodien, I., and Rohs, M., 2010. Semi-automatic zooming for mobile map navigation. In:
MobileHCI ’10: Proc. of the 12th Int. Conf. on Human Computer Interaction with Mobile Devices and
Services. New York, NY: ACM, pp. 63–72.
Lank, E., and Saund, E., 2005. Computers and Graphics, 29, pp. 490–500.
Lantz, V., and Murray-Smith, R., 2004. Rhythmic interaction with a mobile device. In: NordiCHI ’04,
Tampere, Finland. New York, NY: ACM, pp. 97–100.
MacKenzie, I. S., 1992. Fitts’ law as a research and design tool in human–computer interaction.
Human-computer interaction, 7(1), pp. 91–139.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
McRuer, D. T., and Jex, H. R., 1967. A review of quasi-linear pilot models. IEEE Trans. on Human
Factors in Electronics, 8(3), pp. 231–49.
Meyer, D. E., Keith-Smith, J. E. Keith, Kornblum, S., Abrams, R. A., and Wright, C. E., 1990. Speed-
accuracy trade-offs in aimed movements: Toward a theory of rapid voluntary action. In: M.
Jeannerod, ed. Attention and Performance XIII. Hove, UK: Psychology Press, pp. 173–226.
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc
G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Ostrovski, Georg et al., 2015. Human-
level control through deep reinforcement learning. Nature, 518(7540), 529–33.
Müller, J., Oulasvirta, A., and Murray-Smith, R., 2017. Control theoretic models of pointing. ACM
Transactions on Computer Human Interaction. Vol. 24, Issue 4, September, P27:1–27:36.
Pastel, R., 2006. Measuring the difficulty of steering through corners. In: Proceedings of ACM CHI 2006
Conference on Human Factors in Computing Systems, 1(2), pp. 1087–96.
Poulton, E. C., 1974. Tracking skill and manual control. New York, NY: Academic Press.
Powers, W. T., 1973. Behavior: The control of Perception. New York, NY: Routledge.
Powers, W. T., 1989. Living Control Systems: Selected papers of William T. Powers. New York, NY:
Benchmark.
Powers, W. T., 1992. Living Control Systems II: Selected papers of William T. Powers. Bloomfield, NJ:
Control Systems Group.
Quinn, P., Cockburn, A., Casiez, G., Roussel, N., and Gutwin, C., 2012. Exposing and understanding
scrolling transfer functions. In: Proceedings of the 25th annual ACM symposium on User interface
software and technology, New York, NY: ACM, pp. 341–50.
Quinn, P., Malacria, S., and Cockburn, A., 2013. Touch scrolling transfer functions. In: Proceedings of
the 26th Annual ACM Symposium on User Interface Software and Technology. New York, NY: ACM,
pp. 61–70.
Quinn, P., and Zhai, S., 2016. Modeling gesture-typing movements. Human–Computer Interaction.
http://dx.doi.org/10.1080/07370024.2016.1215922.
Rashevsky, N., 1959. Some remarks on the mathematical aspects of automobile driving. Bulletin of
Mathematical Biology, 21(3), pp. 299–308.
Rashevsky, N., 1960. Further contributions to the mathematical biophysics of automobile driving.
Bulletin of Mathematical Biology, 22(3), pp. 257–62.
Rigoux, L., and Guigon, E., 2012. A model of reward-and effort-based optimal decision making and
motor control. PLoS Comput Biol, 8(10), e1002716.
Rogers, Y., Sharp, H., and Preece, J., 2011. Interaction Design: Beyond Human Computer Interaction. 4th
ed. New York, NY: Wiley & Sons.
Schmidt, R. A., and Lee, T. D., 2005. Motor Control and Learning. Champaign, IL: Human Kinetics.
Shadmehr, R., 2010. Control of movements and temporal discounting of reward. Current Opinion in
Neurobiology, 20(6), pp. 726–30.
Shadmehr, R., Huang, H. J., and Ahmed, A. A., 2016. A representation of effort in decision-making
and motor control. Current biology, 26(14), pp. 1929–34.
Sheridan, T. B., and Ferrell, W. R., 1974. Man-machine Systems: Information, Control, and Decision
Models of Human Performance. Cambridge, MA: MIT Press.
Velloso, E., Carter, M., Newn, J., Esteves, A., Clarke, C., and Gellersen, H., 2017. Motion correlation:
Selecting objects by matching their movement. ACM Transactions on Computer-Human Interaction
(ToCHI), 24(3), Article no. 22.
Visell, Y., and Cooperstock, J., 2007. Enabling gestural interaction by means of tracking dynamical
systems models and assistive feedback. In: ISIC. IEEE International Conference on Systems, Man
and Cybernetics, pp. 3373–8.
Viviani, P., and Terzuolo, C., 1982. Trajectory determines movement dynamics. Neuroscience, 7(2),
pp. 431–7.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Wickens, C. D., and Hollands, Justin G., 1999. Engineering psychology and human performance. 3rd ed.
New York, NY: Prentice-Hall.
Wiener, N., 1948. Cybernetics: Control and communication in the animal and the machine. Cambridge,
MA: MIT Press.
Williamson, J., and Murray-Smith, R., 2004. Pointing without a pointer. In: ACM SIG CHI, New York,
NY: ACM, pp. 1407–10.
Williamson, J., 2006. Continous uncertain interaction, Ph.D thesis, School of Computing Science,
University of Glasgow.
Yamanaka, S., and Miyashita, H., 2016a. Modeling the steering time difference between narrowing
and widening tunnels. In: CHI ’16: Proceedings of the 2016 CHI Conference on Human Factors in
Computing Systems. New York, NY: ACM, pp. 1846–56.
Yamanaka, S., and Miyashita, H., 2016b. Scale effects in the steering time difference between narrowing
and widening linear tunnels. In: NordiCHI ’16: Proceedings of the 9th Nordic Conference on Human-
Computer Interaction. New York, NY: ACM, pp. 12:1–12:10.
Young, L. R., 1969. On adaptive manual control. IEEE Transactions on Man-Machine Systems, 10(4),
pp. 292–331.
Yun, S., and Lee, G., 2007. Design and comparison of acceleration methods for touchpad. In: CHI’07
Extended Abstracts on Human Factors in Computing Systems. New York, NY: ACM, pp. 2801–12.
Zhai, S., Kristensson, P. O., Appert, C., Anderson, T. H., and Cao, X., 2012. Foundational issues in
Touch-Surface Stroke Gesture Design - An Integrative Review. Foundations and Trends in Human
Computer Interaction, 5(2), pp. 95–205.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
2
• • • • • • •
2.1 Introduction
Text entry is an integral activity in our society. We use text entry methods for asynchronous
communication, such as when writing short messages and email, as well as when we
compose longer documents. For nonspeaking individuals with motor disabilities, efficient
text entry methods are vital for everyday face-to-face communication.
Text entry is a process and a text entry method is a system for supporting this process.
Fundamentally, the objective of a text entry method is to allow users to transmit the intended
text from the user’s brain into a computer system as fast as possible. Many transmission
methods are possible, including speech, cursive handwriting, hand printing, typing and
gesturing.
The primary objective when designing a text entry method is to maximise the effective text
entry rate—the entry rate achievable when the error rate is within a reasonable threshold.
Entry rate is typically measured as either characters per second (cps) or words per minute
(wpm), where a word is defined as five consecutive characters, including space.
Error rate can be defined over characters, words or sentences. Character error rate is
defined as the minimum number of insertion, deletion and substitution character edit
operations necessary to transform a user’s response text into the stimulus text. Word error
rate is defined as the minimum number of insertion, deletion and substitution word edit
operations necessary to transform a user’s response text into the stimulus text. Sentence
error rate is the ratio of error-free response sentences to the total number of stimulus
sentences.
Intelligent text entry methods (Kristensson, 2009) are text entry methods which infer or
predict the user’s intended text. Such methods aim to amplify users’ ability to communicate
as quickly and as accurately as possible by exploiting redundancies in natural languages.
This chapter explains the role of computational interaction in the design of text entry
methods. It present three interlinked modelling approaches for text entry which collectively
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
allow the design of probabilistic text entry methods. Such methods open up a new design
space by inferring the user’s intended text from noisy input.
We begin by explaining how language can be viewed as information (bits) in a framework
provided by information theory. Information theory allows the quantification of informa-
tion transmitted between a user and a computer system and provides a mathematical model
of text entry as communication over a noisy channel. Although this model is limited in
several ways, certain design decisions can be informed by this model.
The fundamentals of information theory reveal that natural languages are highly redun-
dant. Such redundancies can be exploited by probabilistic text entry methods and result in a
lower error rate and potentially a higher entry rate. However, to exploit the redundancies in
natural languages it is necessary to model language. For this purpose the chapter introduces
the principles of language modelling and explains how to build effective language models
for text entry.
Having built a language model, the next step is to infer the user’s intended text from noisy
input. There are several strategies to achieve this. This chapter focuses on the token-passing
statistical decoder, which allows a text entry designer to create a generative probabilistic
model of text entry that can be trained to efficiently infer user’s intended text. In addition,
it can be adapted to perform several related functions, such as merging hypothesis spaces
generated by probabilistic text entry methods.
We then review how probabilistic text entry methods can be designed to correct typing
mistakes, enable fast typing on a smartwatch, improve predictions in augmentative and
alternative communication, enable dwell-free eye-typing, and intelligently support error
correction of probabilistic text entry.
Finally, we discuss the limitations of the models introduced in this chapter and highlight
the importance of establishing solution principles based on engineering science and empir-
ical research in order to guide the design of probabilistic text entry.
implementation is the redundancy. The perplexity PP() of the optimal coding scheme is
PP() = 2H() ≈ 3.4. In contrast, the perplexity of the naïve coding scheme is 4. However,
if all four symbols in were equally likely, then the perplexity of the optimal coding scheme
would also be 4.
Information theory enables a model of text entry as communication over a noisy channel
(Shannon, 1948). Using this model it is possible to calculate the bitrate of a text entry
method. Assuming the random variable I is a distribution over the set of words the user
is intending to write, and the random variable O is a distribution over the set of words the
user is writing, then the rate R (in bits) is:
I (I; O)
R= , (2.3)
t
where I (I; O) is the mutual information1 and t is the average time it takes to write a word
in O. Since I (I; O) = H(I) − H(I|O), where H(I|O) is the conditional entropy2 , the rate
can be rewritten as:
H(I) − H(I|O)
R= . (2.4)
t
If the probability of error is zero, that is, all words in I can always be inferred from O, then
the conditional entropy H(I|O) will disappear and R = H(I)/t.
Information theory can be used to guide the design of text entry methods. Inverse arith-
metic coding (Witten et al., 1987), a near-optimal method of text compression based on lan-
guage modelling, inspired the text entry method Dasher (MacKay et al., 2004; Ward et al.,
2000; Ward and MacKay, 2002). Unlike most other text entry methods, in Dasher a user’s
gesture does not map to a particular letter or word. Instead, text entry in Dasher involves
continuous visually-guided closed-loop control of gestures via, for example, a mouse or
eye gaze. A user writes text by continuously navigating to a desired sequence of letters in
a graphical user interface laid out according to a language model (MacKay et al., 2004).
Dasher has an information-efficient objective in the following sense. If the user writes
at a rate RD (in bits) then Dasher attempts to zoom in to the region containing the user’s
intended text by factor of 2RD . If the language model generates text at an exchange rate of
RLM then the user will be able to reach an entry rate of RD /RLM characters per second
(MacKay et al., 2004).
Dasher can also be controlled by discrete button presses in which timing is disregarded.
Assuming the button pressing is precise, and therefore without errors, the capacity C of the
communication channel is then:
log2 (n)
C= , (2.5)
t
information I(A; B) of two discrete random variables A and B is I(A; B) = a∈A b∈B P(a, b)
1 The
mutual
P(a,b)
log2 P(a)P(b) .
2 The conditional entropy H(B|A) of two discrete random variables A and B is
a∈A,b∈B P(a, b)
P(a)
log2 P(a,b) .
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
where n is the number of buttons and t is the average time to switch buttons. If the process
is noisy, such that the wrong button is pressed a fraction f of the time, then the capacity is:
log2 (n) 1
C= 1 − H2 f , f ∈ 0, , (2.6)
t 2
where H2 is the binary entropy function3 MacKay et al. (2004). Critically, even a rather low
error rate of f = 1/10 scales C by approximately 0.53, and thus reduces the channel capacity
by approximately a factor of two.
The above derivation shows that the perplexity is minimized when the conditional
probability of the word sequence is maximized.
If the probability of observing a word is independent of all previous words, then the
entropy for a vocabulary V is:
|V|
H=− P(wi ) log2 (P(wi )), (2.10)
i=1
2.4 Decoding
A decoder infers users’ intended text from noisy data. Such a decoding process can be
achieved in a number of ways, although the objective is usually framed within a relatively
simple probabilistic model.
The user wishes to transmit a message y via some form of signal x. For instance, a user
may want to transmit the signal “a” (the character a) by touching in the vicinity of the
a key on a touchscreen keyboard. Assuming an error-free input channel, this objective
may be trivially achieved by implementing a lookup table that maps certain pixels on the
touchscreen keyboard (for instance, all pixels occupying a visual depiction of the letter
key a on the touchscreen keyboard) to the character a. The advantage of this method is
simplicity in its implementation. The disadvantage is an implicit assumption of error-free
input, which is a false assumption as this process is inherently noisy. Noise arises from
several sources, such as users themselves (neuromuscular noise, cognitive overload), the
environment (background noise), and the device sensors. Due to these noise sources the
process is inherently uncertain.
As a consequence, a decoder models this process probabilistically. The objective is to
compute the probability of the message y in the some message space Y given an input
signal x. This conditional probability P(y|x) can be computed via Bayes’ rule:
P(x|y)P(y)
P(y|x) = . (2.12)
P(x)
The most probable message ŷ ∈ Y is then:
P(x|y)P(y)
ŷ = arg max . (2.13)
y∈Y P(x)
However, since the decoder is only attempting to identify the message that maximizes the
posterior probability, the denominator is just a normalization constant and will be invariant
under the search. It can therefore be ignored:
ŷ = arg max P(x|y)P(y) . (2.14)
y∈Y
P(x|y) is the likelihood of the input signal x given a particular hypothesis for y. P(y) is the
prior probability of the message y without taking the input signal into account.
Various decoding strategies can be used to find the most probable message ŷ, including
hidden Markov models and deep neural networks. A particularly flexible framework is
token-passing (Young et al., 1989), which has primarily three advantages. First, it is straight
forward to model sequence decoding problems using token-passing, which will become
evident later in this chapter. Second, it is easy to control the search using beam pruning,
which is crucial when the search space grows exponentially or becomes infinite. Third,
token-passing makes it relatively easy to parallelize the search as global memory access can
be minimized.
Assume the system receives an observation sequence O = {o1 , o2 , . . . , on }. These obser-
vations can, for instance, be a series of touch coordinates on a touchscreen keyboard. The
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
objective of the decoder is now to identify the most probable hypothesis for the observation
sequence.
In the token-passing paradigm, the system searches for the best hypothesis by propagating
tokens. A token is an abstract data structure tracking at least three pieces of information:
1) the generated sequence of output symbols; 2) the associated accumulated probability;
and 3) the current observation index.
We decode an observation sequence of length n by starting with a single initial token,
which predicts the empty string with 1.0 probability for observing zero observations.
The process now works as follows. We take any token at observation index i ∈ [0, n) and
propagate k tokens to observation index i + 1, where k is, for instance, the number of keys on
the keyboard (possible output symbols). This is repeated for all tokens until a token is in the
last observation and can no longer propagate any tokens as there are no more observations.
These final tokens represent all complete hypotheses that explain the entire observation
sequence. The final token with the highest probability contains the hypothesis which best
explains the observation sequence.
Every time a token is propagated, the system explores a new state. The accumulated
probability of the token for the new state is computed by combining a likelihood model
estimate with a prior probability, as explained above. The exact formulation depends on
the specific implementation, which may include additional terms. Figure 2.1 shows a search
trellis for an observation sequence with three observations.
The previous model essentially propagates tokens which consume observations and
in return generate letter sequence hypotheses. The probabilities of these hypotheses
depend on the language model context (previously generated letters) and the observations
themselves (for example, how close or far away touch point observations are to certain
letter keys).
A limitation with this model is that it only substitutes observations for letters. In practice,
it is common for users to omit typing certain letters or accidentally typing additional letters.
Transposition errors, where two letters are interchanged, are also common. To handle such
situations we need to extend the decoder to handle insertions and deletions of observations
in the observation sequence.
s2 s2 s2
s3 s3 s3
s4 s4 s4
o1 o2 o3
Figure 2.1 Trellis of an observation sequence O = {o1 , o2 , o3 }. The thick arrows indicate an
example highest probability path.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
One solution is to introduce another modality, such as pressure, to allow users to signal
to the decoder whether a particular observation (touchscreen point) is likely to be on
the intended letter or not. By modelling the likelihood as a 2D Gaussian distribution,
the system can allow the user to modulate their uncertainty in a key press by pressure
by adjusting the standard deviation for a touch point by computing σ t = C/ωt , where
ωt is the pressure of the touch t and C is an empirically determined parameter, which
needs to be calibrated on data to ensure the standard deviation is reasonable for moderate
pressure (Weir et al., 2014). A study revealed that this system significantly reduced the
need for users to manually correct their text and significantly increased entry rates (Weir
et al., 2014).
The above system suggests that the instinct to redesign a user interface to fit within
additional constraints, in this case the size of the touchscreen display, may not always be the
optimal approach. Here the redundancies in natural languages allowed an efficient decoder
to accurately correct users’ typing patterns, even when the typing was very noisy.
method of signalling to the system when a key has been typed and the system will need to
disambiguate whether the user intended to type a letter key or merely happened to look at
it. This is known as the Midas touch problem (Jacob and Karn, 2003). Eye-typing solves
this problem using a dwell-timeout. If the user fixates at a letter key for a set amount of
time (typically 800–1000 ms) then the system infers that the user intended to type the
letter key. The dwell-timeout is usually indicated to the user graphically, for instance, via
an animated clock.
While eye-typing does provide a method of communication, there are several negative
aspects to the process. First, the dwell-timeouts impose a hard performance bound since the
user is forced to fixate at every letter key in a word for a fixed timeout. Second, eye-typing
breaks the flow of writing as every word is broken up into a series of saccades followed by
fixations that must be longer in duration than the dwell-timeout. Third, the eyes are sensory
organs, not control organs. It requires conscious effort to keep fixating at a specific area of
the display.
As a consequence of the above limitations, traditional eye-typing is rather slow, typically
limited to about 5–10 wpm (e.g., Majaranta and Räihä, 2002; Rough et al., 2014). A possible
improvement is to let the user set their own dwell-timeout, with the hope that users will be
able to adjust lower dwell-timeouts and hence writer faster. Experiments with such systems
show that able-bodied participants in front of well-calibrated eye-trackers can reach an
entry rate of about 7–20 wpm (Majaranta et al., 2009; Räihä and Ovaska, 2012; Rough
et al., 2014).
An alternative interface is Dasher, which we introduced earlier in this chapter. However,
Dasher entry rates are limited to about 12–26 wpm (Ward and MacKay, 2002; Tuisku et al.,
2008; Rough et al., 2014).
In summary, traditional gaze-based text entry methods are limited to about 20 wpm. It
is also important to remember that nearly all the above studies have been conducted with
able-bodied users positioned in front of a desk with a well-calibrated eye-tracker for an
experimental session.
In practice, at least two design considerations are critical when designing a text entry
method using eye gaze for users with motor disabilities. First, the precision of eye gaze
varies dramatically among different users. Some users can move quickly and precisely from
letter key to letter key and are primarily limited by the dwell-timeout. In practice, such users
tend to rely heavily on word prediction when eye-typing. Other users have difficulty due
to various noise sources, such as an inability to keep the head positioned in the centre of
the eye-tracker’s virtual tracking box, an experience of pain or other issues resulting in sleep
deprivation, or jerky uncontrolled movements of the head and body causing the head and/or
display (if mounted on, for example, a wheelchair) to move.
Second, the tolerance for errors varies depending on the application. In speaking mode,
the tolerance for errors is higher as the primary objective is to ensure understanding. In other
applications, such as email or document writing, it is more important that the text does not
contain obvious errors.
Dwell-free eye-typing (Kristensson and Vertanen, 2012) is a technique that mitigates
these issues. The main principle behind dwell-free eye-typing is that it is possible to elimi-
nate the need for dwell-timeouts by inferring users’ intended text. In dwell-free eye-typing a
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
user writes a phrase or sentence by serially gazing at the letter keys that comprise the phrase
or sentence (gazing at the spacebar is optional). When the phrase or sentence is complete
the user gazes at the text area. The system uses a statistical decoder to automatically infer
the user’s intended text.
A human performance study using an oracle decoder, which always infers the intended
text correctly, showed that users can in principle reach an average entry rate of 46 wpm
(Kristensson and Vertanen, 2012). By modelling traditional eye-typing as a combination
of dwell-timeout and overhead time (the time to transition between keys, etc.) it becomes
apparent that for every conceivable operating point of eye-typing, dwell-free eye-typing will
be faster. For example, a study with record eye-typing rates (Majaranta et al., 2009) had
an eye-typing entry rate which corresponds to roughly a dwell-timeout and an overhead
time of approximately 300 ms each (Kristensson and Vertanen, 2012). This is roughly half
the empirical average entry rate observed for dwell-free eye-typing using an oracle model
(Kristensson and Vertanen, 2012).
Dwell-free eye-typing has been implemented as a probabilistic text entry method and is
now a component of the AAC communication suite Communicator 5 by Tobii-Dynavox4 .
The system is designed based on a token-passing statistical decoder and can decode word
sequences of any length and there is no need to type any spaces (Kristensson et al., 2015).
4 https://www.tobiidynavox.com/globalassets/downloads/leaflets/software/tobiidynavox-communicator-
5-en.pdf
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
corrected text at an entry rate of 40 wpm despite an original word error rate of 22 per cent
in the speech recognizer (Vertanen and MacKay, 2010).
An orthogonal approach to error correction is to allow the user to repeat the misrecog-
nized words, and optionally also include some surrounding context (McNair and Waibel,
1994; Vertanen and Kristensson, 2009a).
In particular, if the user’s initial observation sequence generated a word confusion net-
work and the user’s error repair observation sequence generated another word confusion
network, then it is possible to find the hypothesis with the highest probability that takes
both word confusion networks into account (Kristensson and Vertanen, 2011). Figure 2.3
shows two word confusion networks, not necessarily generated by the same probabilistic
text entry method. The solid transitions show the original transitions between the word
confusion clusters. By taking some of the probability mass from the original word confusion
clusters we can soften these networks by connecting each cluster with additional wildcard
and -transitions and by adding wildcard self-loops to each cluster, shown in dotted lines
in Figure 2.3. The -transition allows the search to proceed to the next cluster without
generating a word. A wildcard-next transition allows the search to proceed to the next cluster
and generate any word. A wildcard self-loop allows the search to generate any word while
remaining in the same cluster.
The end result is a representation of a combined search space for both word confusion
networks. A merge model based on the token-passing paradigm can now search for the most
probable joint path through both word confusion networks. Each token tracks the position
in each of the word confusion networks, an accumulated probability, and the previous few
words of language-model context. The search starts with an initial token in the first cluster
of both networks. The search ends when the token leaves the last cluster in each of the
networks. The search selects a token active in the search and, based on the token’s position
in each word confusion network, identifies all possible moves that can either generate a real
∗ 0.01 the 0.94 ∗ 0.01 bat 0.57 ∗ 0.01 sat 0.47 ∗ 0.01
rat 0.28 at 0.28
t1 t2 cat 0.09 t3 t4
∗ 0.03 ∗ 0.03 0.21
0.02 0.02 ∗ 0.03
∗ 0.01 the 0.56 ∗ 0.01 cat 0.82 ∗ 0.01 0.87 ∗ 0.01 sat 0.75 ∗ 0.01
a 0.38 at 0.06 at 0.09 nat 0.19
t1 t2 fat 0.06 t3 t4 t5
∗ 0.03 ∗ 0.03 ∗ 0.03 ∗ 0.03
0.02 0.02 0.02
Figure 2.3 Two example word confusion networks that could be generated from the same, or a
different, probabilistic text entry method’s search graphs for the same intended text. The original
word confusion networks have been softened by additional dotted transitions.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
word or a wildcard word. The search then computes the cross-product between all candidate
moves in each word confusion network subject to two rules. First, if both word confusion
networks generate real words then these words must match. In other words, the model does
not believe in a world in which a user would intend to write two different words at the same
time. Second, only one of the word confusion networks is allowed to generate a wildcard
word. Every moved is assessed by a language model, which is the same as the language model
used to generate the word confusion networks. The search space is infinite but beam pruning
can be used to keep the search tractable (Kristensson and Vertanen, 2011).
Initial results indicate that this merge model can reduce a word error rate of 27 per cent
for speech recognition and 14 per cent for gesture keyboard down to 6.6 per cent by merging
the word confusion networks generated for the same sentence by both modalities. It is
also possible to allow spotty correction. In spotty correction the user does not re-write or
re-speak the entire phrase or sentence. Instead the user merely re-writes or re-speaks the
words in the original phrase or sentence that have been recognized incorrectly. The system
automatically locates the erroneous words and replaces them. Initial results indicate that
spotty correction of misrecognized speech utterances can reduce an initial error rate of 48.5
per cent by approximately 28 per cent by allowing users to use the gesture keyboard modality
to re-write the incorrect words in the original speech recognition result.
The conclusion is that a probabilistic text entry method does not only provide a means to
infer the user’s intended text. The process outcome of the search itself can be used to design
interaction techniques that mitigate additional problems in the text entry process.
2.6 Discussion
There is no doubt that statistical language processing techniques have been critical in
the design of several text entry methods, such as auto-correcting touchscreen keyboards.
Decades of speech recognition research have been invaluable as a driving force for the
discovery of effective methods for training and smoothing language models and designing
efficient decoding strategies. Speech recognition is a rapidly evolving discipline, and recent
progress in deep neural networks outperforms classical speech recognition approaches
(Hinton et al., 2012). Thus, while the principles and techniques in this chapter remain
relevant, in practice, the precise modelling requirements of a complex task may suggest using
a more advanced decoder than the token-passing model briefly explained in this chapter.
range of simplifying assumptions, and as a statistical model it is reliant on training text being
representative of unseen text. It is easy to identify examples which make such assumptions
dubious, for instance—as pointed out by Witten and Bell (1990)—Ernest Vincent Wright
wrote a 260-page novel in English without ever using the letter e. As another example, an
n-gram language model assumes that the probability of occurrence of a word only depends
on the last few (typically two) previous words, which is a false assumption. There is also
an implicit assumption that the training text is fully representative of the text the user is
intending to write, which, again, is nearly always false. Finally, smoothing is necessary in
order to be able to estimate probabilities over unseen sequences of characters or words.
The likelihood model in a decoder can be designed in various ways. For a touchscreen
keyboard, the typical choice is 2D Gaussian distributions (Goodman et al., 2002; Vertanen
et al., 2015), although there is evidence that a slight performance improvement can be
achieved using Gaussian Process regression Weir et al. (2014).
In addition, there are other potential sources of information that are usually not taken
into account. Users’ situational context, for example, if they are walking or encumbered
when typing, is typically not modelled or taken into account. However, there is research
that indicates that sensing users’ gait and modelling this additional information source into
the auto-correction algorithm can improve performance (Goel et al., 2012). In practice, it
is difficult to model context as it relies on accurate and practical sensing technologies and
sufficient amounts of relevant training data.
Finally, the design of a decoder involves making a range of assumptions. As previously
discussed, a statistical decoder can, among other things, substitute, delete, or insert observa-
tions, and there are numerous other modifications and additional applications, for instance,
the merge model for word confusion networks which was explained previously. The type
of decoder, the precise model, and the selection and setting of parameters will all affect
performance. In practice, it can be difficult to implement and train a statistical decoder with
complex modelling assumptions.
particular for text entry, that there is resistant to change due to the amount of effort invested
in learning to be proficient with typing on a qwerty layout. This suggests a solution
principle of path dependency (David, 1985; Kristensson, 2015)—user interfaces are familiar
to users for a reason and unless there is a good reason for redesign, it may better to design
a system that supports existing user behaviour. In the case of text entry design, statistical
decoding allows us to present users with a miniature version of a mobile phone keyboard
for a smartwatch, and have it support the typing behaviour and patterns users are already
familiar with.
Another example discussed in this chapter covered the limitations of language models,
which in turn affect the efficacy of touchscreen keyboard auto-correction algorithms. Earlier
we saw that allowing users to modulate the uncertainty of their input to the auto-correction
algorithm reduces error rate. This suggests a solution principle of fluid regulation of uncer-
tainty. In other words, when an interactive system is interpreting noisy data from a user it
may be worthwhile to investigate if it is possible to design a method that enables users to
express their own certainty in their interaction. This additional information can be encoded
into a statistical decoder or other computational model of interaction. It allows users to
guide the system’s inference in situations when the system’s model may have incomplete
information. Thus it enables users to preempt errors, increasing efficiency and reducing
frustration by preventing an error recovery process.
Finally, another solution principle suggested by the applications presented in this chapter
is flexibility—allow users to optimize their interaction strategy based on situational con-
straints. For example, consider the system we described earlier, which allows users to repair
erroneous recognition results by re-speaking or re-writing the intended text, and optionally
include some surrounding context to maximize the probability of success. Such a system
could potentially allow users to flexibly combine multiple modalities, such as speaking and
typing in thin air with an optical see-through head-mounted display. In such a scenario, both
text entry methods are probabilistic and the user’s interaction is inherently uncertain. A user
may, for instance, adopt a strategy of speaking first and then opt to fix errors by re-typing the
intended text in thin air.
The solution principles we have sketched above are examples of systematic design
insights, which, if underpinned by engineering science or empirical research, allow com-
putational interaction designers to explore a richer design space where the fabric of design
also includes taking into account behavioural aspects of users interacting with inherently
uncertain intelligent interactive systems.
2.7 Conclusions
This chapter has explained how methods from statistical language processing serve as a
foundation for the design of probabilistic text entry methods and error corrections methods.
It has reviewed concepts from information theory and language modelling and explained
how to design a statistical decoder for text entry, which is using a generative probabilistic
model based on the token-passing paradigm. It then presented five example applications of
statistical language processing for text entry: correcting typing mistakes, enabling fast typing
on a smartwatch, improving prediction in augmentative and alternative communication,
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Kristensson, P. O., Vertanen, K., and Mjelde, M., 2015. Gaze based text input systems and methods.
US Patent App. 14/843,630.
Kristensson, P. O., and Zhai, S., 2005. Relaxing stylus typing precision by geometric pattern matching.
In: Proceedings of the 10th international conference on Intelligent user interfaces, pp. 151–8.
Kurihara, K., Goto, M., Ogata, J., and Igarashi, T., 2006. Speech pen: predictive handwriting based
on ambient multimodal recognition. In: Proceedings of the SIGCHI conference on human factors in
computing systems. New York, NY: ACM, pp. 851–60.
MacKay, D. J., Ball, C. J., and Donegan, M., 2004. Efficient communication with one or two buttons.
In: AIP Conference Proceedings, 735(1), pp. 207–18.
Majaranta, P., Ahola, U.-K., and Špakov, O., 2009. Fast gaze typing with an adjustable dwell time.
In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York, NY:
ACM, pp. 357–60.
Majaranta, P., and Räihä, K.-J., 2002. Twenty years of eye typing: systems and design issues. In:
Proceedings of the 2002 symposium on Eye tracking research & applications. New York, NY: ACM,
pp. 15–22.
Mangu, L., Brill, E., and Stolcke, A., 2000. Finding consensus in speech recognition: word error
minimization and other applications of confusion networks. Computer Speech & Language, 14(4),
pp. 373–400.
McNair, A. E., and Waibel, A., 1994. Improving recognizer acceptance through robust, natural speech
repair. In: Third International Conference on Spoken Language Processing. New York, NY: ACM,
pp. 1299–1302.
Miller, G. A., 1957. Some effects of intermittent silence. The American Journal of Psychology, 70(2),
pp. 311–14.
Moore, R. C., and Lewis, W., 2010. Intelligent selection of language model training data. In: Proceedings
of the ACL 2010 conference short papers. Red Hook, NY: Curran Associates, pp. 220–4.
Oney, S., Harrison, C., Ogan, A., and Wiese, J., 2013. Zoomboard: a diminutive qwerty soft keyboard
using iterative zooming for ultra-small devices. In: Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, pp. 2799–802.
Räihä, K.-J., and Ovaska, S., 2012. An exploratory study of eye typing fundamentals: dwell time, text
entry rate, errors, and workload. In: Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems. New York, NY: ACM, pp. 3001–10.
Rough, D., Vertanen, K., and Kristensson, P. O., 2014. An evaluation of dasher with a high-
performance language model as a gaze communication method. In: Proceedings of the 2014 Inter-
national Working Conference on Advanced Visual Interfaces. New York, NY: ACM, pp. 169–76.
Shannon, C., 1948. A mathematical theory of communication. The Bell System Technical Journal, 27(3),
pp. 379–423.
Speier, W., Arnold, C., and Pouratian, N., 2013. Evaluating true bci communication rate through
mutual information and language models. PloS One, 8(10):e78432.
Tuisku, O., Majaranta, P., Isokoski, P., and Räihä, K.-J., 2008. Now dasher! dash away!: longitudinal
study of fast text entry by eye gaze. In: Proceedings of the 2008 Symposium on Eye Tracking Research
& Applications. New York, NY: ACM, pp. 19–26.
Vertanen, K., and Kristensson, P. O., 2009a. Automatic selection of recognition errors by respeaking
the intended text. In: Automatic Speech Recognition & Understanding, 2009. ASRU 2009. New York,
NY: IEEE Press, pp. 130–5.
Vertanen, K., and Kristensson, P. O., 2009b. Parakeet: A continuous speech recognition system for
mobile touch-screen devices. In: Proceedings of the 14th international conference on Intelligent user
interfaces. New York, NY: ACM, pp. 237–46.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Vertanen, K., and Kristensson, P. O., 2011. The imagination of crowds: conversational aac language
modeling using crowdsourcing and large data sources. In: Proceedings of the Conference on Empirical
Methods in Natural Language Processing. Red Hook, NY: Curran Associates, pp. 700–11.
Vertanen, K., and MacKay, D. J., 2010. Speech dasher: fast writing using speech and gaze. In: Pro-
ceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York, NY: ACM,
pp. 595–98.
Vertanen, K., Memmi, H., Emge, J., Reyal, S., and Kristensson, P. O., 2015. Velocitap: Investigating fast
mobile text entry using sentence-based decoding of touchscreen keyboard input. In: Proceedings of
the 33rd Annual ACM Conference on Human Factors in Computing Systems. New York, NY: ACM,
pp. 659–68.
Ward, D. J., Blackwell, A. F., and MacKay, D. J., 2000. Dasher—a data entry interface using continuous
gestures and language models. In: Proceedings of the 13th annual ACM symposium on User interface
software and technology. New York, NY: ACM, pp. 129–37.
Ward, D. J., and MacKay, D. J., 2002. Fast hands-free writing by gaze direction. Nature, 418(6900),
pp. 838–8.
Weir, D., Pohl, H., Rogers, S., Vertanen, K., and Kristensson, P. O., 2014. Uncertain text entry on
mobile devices. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.
New York, NY: ACM, pp. 2307–16.
Witten, I. H., and Bell, T. C., 1990. Source models for natural language text. International Journal of
Man-Machine Studies, 32(5), pp. 545–79.
Witten, I. H., Neal, R. M., and Cleary, J. G., 1987. Arithmetic coding for data compression. Communi-
cations of the ACM, 30(6), pp. 520–40.
Young, S. J., Russell, N., and Thornton, J., 1989. Token passing: a simple conceptual model for connected
speech recognition systems. Cambridge: University of Cambridge Engineering Department.
Zhai, S., and Kristensson, P.-O., 2003. Shorthand writing on stylus keyboard. In: Proceedings of the
SIGCHI conference on Human factors in computing systems. New York, NY: ACM, pp. 97–104.
Zipf, G. K., 1949. Human behavior and the principle of least effort.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
3
• • • • • • •
Input Recognition
otmar hilliges
3.1 Introduction
Sensing and processing of user input lies at the core of Human-Computer Interaction (HCI)
research. In tandem with various forms of output (e.g., graphics, audio, haptics) input
sensing ties the machine and the user together and enables interactive applications. New
forms of input sensing have triggered important developments in the history of computing.
The invention of the mouse by Douglas Engelbart in the early 1960s and its mainstream
adaptation in the late 1980s and early 1990s have played a significant role in the transition
from the mainframe computer to the personal computing era. Similarly, the perfection of
capacitive multi-touch input by Apple with the introduction of the iPhone in 2007 has
dramatically changed the computing landscape and has brought computing to entirely new
parts of the population.
Notably the timeframes from invention of a input paradigm to its mainstream adoption
can be very long, sometimes lasting up to thirty years. This process can be explained
by the intrinsic connection between input recognition and user experiences. Humans are
remarkably good at detecting isolated events that are not inline with their expectation,
which gives us the evolutionary advantage of detecting and reacting to potentially life-
threatening events. However, this also implies that any input recognition mechanism has to
work accurately almost 100 per cent of the time or a user will very quickly get the impression
that ‘it does not work’ (even when 95 per cent of the time it does). This implies that any
input paradigm has to undergo many iterations of invention, design, and refinement before
it reaches the necessary accuracies. Furthermore, any time we touch the input stack, we also
alter how the user interacts with the machine and thus impact the user experience. Hence,
the design and engineering of input recognition techniques is a crucial part of HCI research
and is fundamental in bringing more alternative and complementary input paradigms closer
to end-user adoption, enabling new forms of post-desktop interaction and hence new forms
of computing.
This chapter provides an overview of the general approach to input sensing processing
of the data produced by a variety of sensors. In particular, we focus on the extraction
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
of semantic meaning from low-level sensor data. That is, the chapter provides a general
framework for the processing of sensor data until it becomes usable as input mechanism
in interactive applications. Typically, sensors collect physical signals, such as pressure or
motion, and convert them into electrical signals. These signals are then streamed to a
machine, which is programmed to recognize, process, and act upon them.
3.2 Challenges
Much research in HCI is dedicated to the questions what humans can do with computers
and how they interact with them. So far we have seen three dominant waves in computer
interfaces: purely text-based interfaces (e.g., the command line, 1960s), graphical interfaces
based on mouse and keyboard (1980s), and direct-touch based interfaces on mobile phones
and tablet computers (2000s). As digital technology moves further away from the desktop
setting, it is becoming increasingly clear that traditional interfaces are no longer adequate
means for interaction and that the traditional computing paradigm will be replaced or
complemented by new forms of interaction.
It is no longer hard to predict that we are heading towards a future where almost every
man-made object in the world will have some form of input, processing, and output capabil-
ities. This implies that researchers in HCI need to explore and understand the implications
as well as the significant technological challenges of leveraging various sensors to facilitate
and enhance interaction. Some researchers seek to expand and improve the interaction space
for existing devices (e.g., augmented mice or keyboards Taylor et al.). Others try to develop
entirely novel interactive systems, forgoing traditional means of input entirely. Such devel-
opments are often explored alongside the emergence of new sensing modes and interaction
paradigms (e.g., depth cameras for natural user interfaces Hilliges et al. “Holodesk”).
Together these emerging technologies form the basis for new ways of interaction with
machines of all kinds and can, if carefully designed, lay the foundations for a diverse set of
future application domains, including smart home and health support appliances, intelligent
living and working spaces, interactive devices such as mobile and wearable devices, auto-
mobiles and robots, smart tutoring systems, and much more. However, for such benefits
to materialize many challenges in input recognition have to be overcome. These can be
categorized as follows:
input recognition | 67
existing ‘hand-gesture recognition’ sensor that could simply be added to a portable device.
Instead, a variety of solutions have to be considered. One could base the detection on colour
images (Song, Sörös, Pece, Fanello, Izadi, Keskin, and Hilliges, 2014), motion sensors and
magnetometers (Benbasat and Paradiso 2002), radar waves (Song, Wang, Lein, Poupyrev,
and Hilliges, 2016) or piggy-backing on existing WIFI infrastructure (Pu, Gupta, Gollakota,
and Patel 2013). Each of these technologies come with intrinsic advantages and drawbacks,
e.g., a camera requires line-of-sight and incurs high computational cost, leveraging IMUs or
radio-frequency signals limits the fidelity of inputs that can be recognized. Clearly, a complex
and difficult-to-navigate design space ensues. Unfortunately, there exists no one-size-fits-
all solution to this problem. However, understanding the characteristics of various sensing
modalities and their implications for input sensing can inform the design process and finally
determines the usability of a system.
designing motion-based interfaces, implying that there is a need for sensing technology
and input recognition algorithms that can detect and discriminate subtle and very low-
magnitude motions in order to reduce fatigue and effort of use. In terms of context depen-
dency speech-based interfaces are a prime example. Where speaking to a machine may be
perfectly acceptable and potentially even the preferred mode of interactions in some settings
such as the home, it may be entirely unacceptable in other, more social, settings, such as on
a train or in an elevator. Thus, research in computer vision, signal processing, and machine
learning for sensing and modelling of multimodal human machine interaction to recognize
spatiotemporal activities, to integrate multiple sensor sources and learning individual- and
context- dependent human-behaviour models are all important research directions.
3.2.4 Personalization
Many HCI research prototypes have employed non-standard forms of input such as gestures
where the exact mapping between human action and system response is created by system
designers and researchers. Although such gestures are appropriate for early investigations,
they are not necessarily reflective of long-term user behaviour. Therefore, a juxtaposition
of the user learning how to operate a system, and how the user would want to operate
the system, arises. This issue has been highlighted first by (Wobbrock, Morris, and Wilson
2009) in a so-called gesture elicitation study in which users are first shown the effect of an
input and are then asked to perform a matching gesture. The resulting user-defined gesture
sets often vary drastically from ‘expert’ designed gesture sets. Similar findings have been
replicated in many follow-up studies and a pattern emerges in which there is high agreement
only for a small set of system functionalities (often those that leverage some physically
inspired metaphor) and a large number of functionalities that elicit almost no agreement
across users. These results indicate a need for personalization of input recognition systems
(and gesture sets). However, there is very little technical research on how to implement such
personalized recognition systems.
Some of these challenges have known solutions, which we aim to address in this chapter,
whereas others such as online adaptation of input recognition systems and their personal-
ization remain largely unaddressed and open up directions for future research.
Due to the mentioned issues, modern multi-modal interfaces leverage sophisticated
machine learning models to recover user intent and input from often noisy and incomplete
sensor data. In the following sections we provide a brief overview over a general framework
that applies such machine-learning models to the interaction tasks.
input recognition | 69
Features
Sensor Response
Figure 3.1 General data-driven input recognition pipeline. A sensor signal produces a certain
feature set, which can be used to classify a predictive model. This model can then be deployed on
computing devices to enable interaction.
good overview of the many machine-learning algorithms (e.g., Mohri, Rostamizadeh, and
Talwalkar, 2012; Bishop, 2006). Nonetheless, it is important to understand the basics of
machine learning in order to build intelligent and robust interactive systems that ideally are
flexible enough to work well across different users and user groups, as well as being capable
of adapting to individual user preferences and capabilities.
To the uninitiated the sheer number of algorithms that have been proposed can be
bewildering and hence many HCI researchers treat machine learning as a ‘magic black
box’. However, to the unique settings and requirements of HCI it is not only beneficial
but perhaps even mandatory to go beyond this view and to open up the black box in
order to adjust it’s workings for the specific needs of interactive systems. In this spirit we
attempt to summarize the most important aspects of machine learning and illustrate how
many important concepts map onto HCI and input recognition problems. We summarize
our discussions by highlighting were the HCI settings go beyond the typically considered
problem definition and thus identify fruitful areas for future work.
Mitchell (1997) provides the most compact, formal definition of machine learning: ‘A
computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P if its performance at tasks in T, as measured by P, improves with
experience E.’
Whereby the task T refers to the type of skill that the algorithm is supposed to learn
(not the learning itself). This is typically decomposed into supervised, unsupervised,
and reinformcement learning tasks. For the purpose of this discussion we limit ourselves
to the supervised case. Machine learning tasks are usually described in terms of how
the system should process an example. In our context, an example is typically a col-
lection of features extracted from low-level sensor signals. The most prominent task is
then that of classification—that is the algorithm is supposed to decide to which of k
categories a input sample belongs to. Other tasks (that we don’t treat in more depth)
include regression, transcription, machine translation, clustering, denoising, and infilling of
missing data.
The performance measure P is a quantitative measure of the systems ability to perform
the task. Normally the measure is specific to the task T. In classification, one often measures
the accuracy of the model, defined as the proportion of examples for which the model
produces the correct output (i.e., assigns the correct label).
The categorization of ML tasks is broadly aligned with the type of experience they are
allowed to have during the learning process. Most of the learning algorithms in this chapter
can be understood as experiencing an entire (labelled) dataset. A dataset itself is a collection
of many examples, often also called data points or feature vectors.
input recognition | 71
predict the value of a scalar y ∈ R as output. In linear regression, the estimated output ŷ is a
linear function of the inputs:
ŷ = wT x, (3.1)
where w ∈ Rn is a vector of parameters that we seek to learn from data. One can think
of w as set of weights that determine how much each feature affects the prediction of the
value of y. This then formally defines the task T which is to predict y from x by computing
ŷ = wT x.
Now we need a definition of the performance measure P. In the 2D case linear regression
can be thought of fitting a line into a collection of 2D data points. The standard way of
measuring performance in this setting is to compute the mean squared error (MSE) of the
model on a (previously unseen) test set. The MSE on the test data is defined as:
1 (test)
MSE(test) = ŷ − y(test) 22 . (3.2)
m
Intuitively, one can see that the error decreases to 0 exactly when ŷ(test) = y(test) . Figure 3.2,
left schematically illustrates that the error (the area of the green squares, hence the name
of the metric) decreases when the Euclidean distance between the predictions and the true
target values decreases.
We now need an algorithm that allows the quality of the predictions to improve by observ-
ing a training set (X (train) , y(train) ). In this case the learning procedure entails adjusting the
set of weights w. This is done via minimization of the MSE on the training set. To minimize
MSE(train) , we can simply solve for where its gradient is 0, given by ∇ w MSE(train) = 0
which can be solved for w in closed form via the normal equations:
T T
w = (X (train) X (train) )−1 = X (train) y(train) . (3.3)
Intuitively, solving Equation 3.3 will ‘move’ the line in Figure 3.2, left around until the sum
of the areas of the hatched squares is smallest. In practice one often uses a slightly more
sophisticated model that also includes an intercept term b so that the full model becomes
ŷ = wTx
y y
ŷ = wTx + b
x x
Figure 3.2 Illustration of linear regression and the mean-square error. Left: Simple model fits line
through the origin and tries to minimize the mean area of the green squares. Right: More complex
model with intercept parameter can fit lines that do not have to pass through the origin.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
ŷ = wT x + b. This additional parameter removes the requirement for the line to pass
through the origin (cf. Figure 3.2, right).
Linear regression is a simple yet powerful model and it provides an almost complete
example of how to build a learning algorithm.
input recognition | 73
one should we choose? Rather than picking an estimator at random and then analysing
the bias and variance we can leverage the maximum likelihood principle to derive specific
functions that are good estimators for a particular model. Consider the set of m examples
X = {(x1 , . . . , (xm )} drawn from the true (but unknown) data generating distribution
pdata (x). We then let pmodel (x;) be a parametric family of probability distributions over
the same space indexed by . Intuitively speaking pmodel maps any x to a numerical value
that estimates the true probability pdata . The maximum likelihood estimator (MLE) for
is then given by:
∗ = arg max pmodel (X; ) (3.4)
m
= arg max pmodel (x(i) ; ). (3.5)
i=1
This product of probabilities has a number of numerical inconveniences and is therefore
substituted by the more well behaved but equivalent formulation leveraging the observation
that taking the logarithm of the likelihood does not change the value of the objective
function but allows us to rewrite the equation as a sum:
m
∗ = arg max log pmodel (x(i) ; ). (3.6)
i
Dividing m gives a version of the objective expressed as an expectation wrt to the empirical
distribution p̂data given by the m training data samples:
∗ = arg max Exp̂data log pmodel (x; ). (3.7)
One way of interpreting this is to think of MLE as a minimization process of dissimilarity
between the empirical distribution p̂data and the model distribution pmodel . In other words,
we want the model to predict values that are as close as possible to the samples in the training
data. If the model would predict exactly all samples in the training data the two distributions
would be the same. This dissimilarity can be measured by the KL divergence:
DKL (p̂data )pmodel ) = Exp̂data [log p̂data (x) − log pmodel (x)]. (3.8)
To minimize the KL divergence we only need to minimize (since the remaining terms do
not depend on the model):
−Exp̂data [log pmodel (x)]. (3.9)
to obtain the same result as via maximizing Equation 3.7. This process is referred to
as minimizing the cross-entropy between the two distributions. Any loss consisting of a
negative log-likelihood forms such a cross-entropy between training and model distribution.
In particular, the previously used mean-squared error is the cross-entropy between the
empirical distribution and a Gaussian model distribution. We will use this aspect shortly
to model linear regression in the MLE framework.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
where ŷ(i) is the output of the linear regression for the ith input and m is the number of
training samples. Recalling that m and σ are constants, we can see from the RHS that the
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
input recognition | 75
above equation will have a different value than the mean-square error, defined in Equation
3.2, but will have the identical location of the optimum. In other words, optimizing Equation
3.12 wrt to the model parameters w will produce exactly the same parameters as minimizing
the mean-squared error.
We now have a framework with desirable properties in the maximum-likelihood estima-
tion principle. We have also seen that it produces the same result as optimizing MSE in
the case of linear regression and, due to the MLE properties, this justifies the (previously
arbitrary) choice of MSE for linear regression. Assuming that sufficient training samples
m are given, MLE is often the preferred estimator to use for a wide variety of machine
learning problems. Generalizing the MLE principle to conditional probabilities provides the
basis for most machine learning algorithms, in particular, those that fall into the supervised
learning category.
to be minimized to find the best parameters w, b. This can be solved in closed form via the
normal equation. We’ve also discussed that typically cost functions will contain at least on
term where parameters have to be estimated and that MLE is the preferred principle of doing
so in many instances.
Even if we add additional terms to the cost function C(w, b) such as a regularization term
λw22 it can still be solved in closed form. However, as soon as the model becomes non-
linear most cost functions can no longer be solved in closed form. This requires the use of
iterative numerical optimization procedures such as gradient descent (Cauchy, 1847) or,
more commonly, stochastic gradient descent (SGD). In some cases, especially if the cost
function is non-convex, in the past gradient descent methods were considered a bad choice.
Today we know that SGD often converges to very low values of the cost function especially
for models with very large numbers of parameters and training datasets of increasing size.
The most prominent example of such ML models are the currently very successful class of
Deep Neural Networks.
In order to make cost functions amenable to optimization via SGD we can decompose
it as a sum of per-example loss terms such that L(x, y, ) = − log p(y|x; ) is the per-
example loss in the case of cross-entropy:
1
m
C() = Ex,yp̂data L(x, y, ) = L(x(i), y(i), ) (3.14)
m i=1
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
The core idea here is that the gradient itself is an expectation and that it can be approxi-
mated by using a small set of samples. Hence, in every step of the SGD algorithm a minibatch
of samples is drawn at random from the training data to compute the gradient in which
direction the algorithm moves. The size m of the minibatch is typically held constant to
a few hundred samples, even if the total training data contains millions of samples. This
gradient approximation is performed as follows:
m
1
g = ∇ L(x(i) , y(i) , ). (3.16)
m i=1
goal then becomes to find a function g:X→Y that maps a sample x from the input space X
to the output space Y . The function g is an element of the so-called hypothesis space that
contains all possible functions G. In some algorithms, g is represented by a scoring function
f : X × Y → R so that g returns the highest value of f: g(x) = arg maxy f(x, y). While G and
F can be an arbitrary space of functions, many machine learning algorithms are probabilistic
in nature and hence the function g often is either a conditional probability model g(x) =
p(y|x) or a joint probability model f(x, y) = p(x, y). For example, naïve Bayes (which we
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
input recognition | 77
will not discuss in this chapter) is a joint probability model and Decision Trees and Random
Forests (which we will discuss in this chapter) are conditional probability models. However,
there are successful and widely applied non-probabilistic models, most notably the Support
Vector Machine (SVM).
As we have seen before in many cases maximum likelihood estimation is the preferred
approach to optimize many supervised learning models, in particular those that are based on
estimating a conditional probability distribution p(y|x). This learning goal is then achieved
via MLE of the parameter vector for a parametric family of distributions p(y|x;).
As discussed above linear regression corresponds to this family of supervised learning
algorithms.
X1
||
X1 H1
2
H
H3 w
x
–
w b
b
x =
||
||w
– 1
w b
x =
– 0
b
w
=
X2 –1
X2
Figure 3.3 Left: Illustration of the max-margin property. Hyperplane H3 does not correctly
separate classes. Hyperplanes H1 and H2 both separate correctly but H2 is the better classifier since
it is furthest from all sample points. Right: Schematic overview of the SVM algorithm. Dotted lines
indicate decision surface and margins. Support vectors are marked with dotted circles.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
input recognition | 79
this topic is not treated here). The goodness of the classifier is parametrized as the distance
to the nearest training data points of any class. Intuitively this could be seen as quality
metric of the classification since a large distance to the decision surface, again intuitively,
implies low generalization error or chance of misclassification. Given a training set D =
{(xi , yi )} | xi ∈ Rp , yi ∈ {−1, 1}ni=1 , again with xi being the p-dimensional feature vectors
and yi their categorical labels we want to construct the max-margin hyperplane that separates
the samples with yi = 1 from those with yi = −1. A hyperplane is parametrized as the set of
points that satisfy:
wT x − b = 0, (3.19)
where w is a vector that is perpendicular to the hyperplane (i.e., a surface normal) and the
b
distance off the hyperplane from the origin is given by ||w|| . The learning task is then to find
the parameters b and w that maximize the margin to the closest training samples. Intuitively
(and in 2D) this can be seen as a street where the middle line in is the decision hyperplane
and the curbs are two parallel hyperplanes which we wish to push as far away from the centre
line. These two parallel hyperplanes can be written as wT x − b = 1 and wT x − b = −1
respectively. Via geometric consideration we can show that the distance between these two
2
curbs is ||w|| . Maximizing this distance is equal to minimizing ||w||. In order to prevent any
training samples to fall within this street we add the constraints wT xi − b ≥ 1|∀xi with yi = 1
and wT xi − b ≤−1|∀xi with yi = −1. This can be written more compactly as:
yi (wT xi − b) ≥ 1, ∀ 1 ≤ i ≤ n, (3.20)
Using this constraint, the final optimization criterion is then:
arg min ||w||, s. t. yi (wT xi − b) ≥ 1, ∀ 1 ≤ i ≤ n. (3.21)
w,b
This optimization problem can solved efficiently and is provably convex (given a linearly
separable training set). Evaluating the classifier at runtime is also efficient as it only depends
on a dot product between the test sample and w. For an illustration of the various terms, see
Figure 3.3, right.
The original SVM algorithm as discussed so far is a linear classifier. However, for the
task of handwritten digit recognition, Boser, Guyon, and Vapnik (1992) extended the
algorithm to also deal with non-linearly separable data by introducing the so-called kernel
trick (Schölkopf, Burges, and Smola, 1999). The algorithm remains almost the same with
the only difference that the dot product is replaced by a (non-linear) kernel function, which
allows for fitting of the hyperplane in a transformed, higher-dimensional space. This so-
called non-linear or kernelized SVM is the most frequently found version in practice. The
SVM algorithm has been implemented and is available as a black box classifier in many
frameworks and toolkits (including Weka, scikit-learn, and OpenCV; see Witten, Frank,
Hall, and Pal, 2016; Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel,
Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot, and Duch-
esnay, 2011; Bradski, 2000). Due to the ease of training, runtime efficiency, and availability
of high-quality implementations alongside a generally good out-of-bag performance on
many different tasks, this classifier has been extremely popular tool in the HCI literature.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
input recognition | 81
distributions Pt (c|I, x) over per-pixel class labels c. These probability distributions describe
the likelihood with which a pixel belongs to any of the classes in the label space. The
distributions of individual trees are then averaged together over all trees in the forest to attain
the final class probability:
1
T
P(c|I, x) = Pt (c|I, x). (3.23)
T 1
These per-pixel probabilities are then often pooled across the image and a simple majority
vote is used to decide the class label for a given frame (or image) I.
An important aspect is that the order in which the split functions are concatenated (the
tree structure), the tresholds τ and the probability distributions stored in the leaf nodes are
learned from annotated training data. In the case of RFs this is done by randomly selecting
multiple split candidates and choosing the one that splits the data best. The quality metric
for the split thereby is typically defined by the information gain I j at node j:
|Cji |
Ij = H(Cj ) − H(Cji ). (3.24)
i∈L,R
|Cj |
The quality (information content) of the extracted features is a decisive aspect in the
overall achieved accuracy of a recognition system. The process of feature extraction derives
values from and hence summarizes the raw data in a way that, ideally, the resulting fea-
ture vectors are informative and non-redundant and so that they facilitate the subsequent
learning and generalization steps. Often an attempt is made to extract features that also
provide a human intuition or that allow to bring domain knowledge about the signal and
it’s interpretation into the otherwise generic machine learning process.
A further benefit of this data reduction step is that it reduces the amount of computational
resources needed to describe a large set of data. Analysis with a large number of variables
generally requires a large amount of memory and computation power and hence reducing
the number of variables to the most important ones can drastically speed-up both training
and testing times.
Furthermore, training an ML model on the raw data may lead to overfitting to contextual
information in the training data. For example, a model trained to recognize human activity
may learn to ‘recognize’ activities such as Tai-Chi (which is often performed in parks and
hence on grass) purely based on the green colour of the background if it was trained using
the entire image. This issue then causes poor generalization to new samples (in which the
context may be different).
input recognition | 83
predictions against groundtruth labels is used as scoring functions. Due to the need to train
many different models, such algorithms can be inefficient.
Filter methods use a so-called proxy measure to score the different feature subsets.
Popular metrics are often based on mutual information (Peng, Long, and Ding, 2005) or
correlation measures (Senliol, Gulgezen, Yu, and Cataltepe, 2008; Hall and Smith 1999).
While filter methods are more efficient than wrapper methods, they are not tuned to any
particular machine learning model and hence a high-scoring feature subset may still perform
poorly in combination with some predictive models.
Embedded methods is a bucket term applied to various techniques that embed the feature
selection process in the predictive model construction step. For example, training multiple
versions of support vector machines (SVM) and then iteratively removing features with low
weights. Such methods are often somewhere in between filter and wrapper methods in terms
of computational complexity, but can result in better matchings between feature set and
predictive model.
A B
C D
Figure 3.4 Gestures recognized by a RF-based classifier allow users to combine typing and
gesturing to (A) navigate documents (B) for task switching (C+D) control of complex applications
that require frequent mode-changes (Taylor Keskin, Hilliges, Izadi, and Helmes, 2014).
A B C D E
Figure 3.5 Song, Sérés, Pece, Fanello, Izadi, Keskin and Hilliges, (2014) propose an ML-based
algorithm for gesture recognition, expanding the interaction space around the mobile device (B),
adding in-air gestures and hand-part tracking (D) to commodity off-the-shelf mobile devices,
relying only on the device’s camera (and no hardware modifications). The paper demonstrates
a number of compelling interactive scenarios, including bi-manual input to mapping and gaming
applications (C+D). The algorithm runs in real time and can even be used on ultra-mobile devices
such as smartwatches (E).
performed behind or in front of the screen (cf. Figure 3.5). The technique uses only the
built-in RGB camera, and recognizes a wide range of gestures robustly, copes with user
variation, and varying lighting conditions. Furthermore, the algorithm runs in real-time
entirely on off-the-shelf, unmodified mobile devices, including compute-limited smart-
phones and smartwatches. Again, the algorithm is based on random forests (RF), which
trade discriminative power and run-time performance against memory consumption. Given
a complex enough problem and large enough training set, the memory requirement will
grow exponentially with tree depth. This is, of course, a strong limitation for applications
on resource-constrained mobile platforms. The main contribution is a general approach,
dubbed cascaded forests, to reduce memory consumption drastically, while maintaining
classification accuracy.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
input recognition | 85
Hand-crafted φ(x) x1 y1
mapping
x2
y2
xN yM
Figure 3.6 Comparison of traditional and deep-learning approaches. Left: a had-designed mapping
from low- to high-dimensional spaces is used to transform feature vectors. Right: in deep-learning
the mapping φ(x) and its parameters are learned from a broad family of functions in order to
transform the feature representation. A simple classifier can be used in both cases to separate the
data in the high-dimensional space. In deep neural network training only the desired output of the
top-most (output) layer is specified exactly via the cost function. The learning algorithm needs to
find out how to best utilize model capacity of these layers by itself.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
A feed-forward network approximates a function by learning the parameters that yield the
best mapping between inputs and outputs. They are called feed-forward because informa-
tion flows strictly forward from x and no feedback mechanisms exist (unless the models are
put into recurrence as discussed in the example below). The name network stems from the
fact that the function is decomposed into many different functions and a DAG describes how
information flows through the different functions. Each function is typically modelled by a
layer and the total number of layers determines the depth of the network (hence the name
deep learning). One important aspect here is that there are different types of layers, namely
the input layer, and the output layer, with hidden layers in between. During training of the
network (typically via SGD), we constrain the optimization such that the output layer must
produce estimates ŷ that are close to the training labels y. However, for the hidden layers,
no such specification is given and, loosely speaking, the network must figure out on its own
how to use the model capacity these layers provide. Therefore, the training algorithm must
decide how to use these layers so that overall the network produces an as-good-as-possible
approximation of the true function. Because the output of these intermediate layers isn’t
shown or used directly, they are called hidden layers.
A further important aspect is that deep learning models use non-linear activation func-
tions to compute the values of the hidden layers, allowing for non-linearity in the approxi-
mated functions. The general principle in designing deep learning models is the same as with
linear models discussed in this chapter: one needs to define a cost function, an optimization
procedure, and provide a training and test dataset. At the core of the success of deep learning
techniques lies the back-propagation algorithm that computes gradients for the parameters
of the various layers and hence makes the approach amenable to optimization via gradient-
based approaches, and SGD in particular.
Currently deep learning approaches are of tremendous importance to applied ML since
they form the most successful platform for many application scenarios, including computer
vision, signal processing, speech-recognition, natural language processing, and, last but not
least, HCI and input recognition. A complete treatment of the subject is not nearly feasible
in the context of this chapter and we refer the reader to Goodfellow, Bengio and Courville,
(2016) for an in-depth treatment. While HCI is not a ‘natural’ application domain for deep
learning, in the sense that it is more expensive to collect and label data than in many other
domains, it is still an extremely powerful tool in many aspects of input recognition. In
part motivated by difficulties in finding good hand-crafted features, deep neural network
approaches have been successfully applied to various tasks in video-based action recognition
(Simonyan and Zisserman, 2014) and speech recognition (Hinton, Deng, Yu, Dahl, rahman
Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Kingsbury and Sainath, 2012). Similarly
(Neverova, Wolf, Taylor and Nebout, 2015) use convolutional neural networks (CNNs)
for sign language recognition based on combined colour and depth data. In (Song, Wang,
Lien, Poupyrev and Hilliges, 2016) deep learning is applied to dynamic gesture recognition
using high-frequency radar data. Furthermore, significant improvements in human activity
recognition based on accelerometers or gyroscopes have been demonstrated (Ordonez and
Roggen, 2016).
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
input recognition | 87
Figure 3.7 Soli signal properties. The sensor does not resolve spatial properties of objects but is
very sensitive to fine-grained motion. Pixel intensity corresponds to reflected energy; horizontal
axis is velocity; vertical axis is range (Song, Wang, Lien, Poupyrev, and Hilliges, 2016).
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Softmax
t0 (c)
… FC LSTM
Conv Conv FC
Softmax
t1 … (c)
Conv Conv FC LSTM FC
…
…
Softmax
(c)
tk … FC LSTM
Conv Conv FC
(1) Raw data (2) Preprocessing (3) CNN-based representation learning (4) RNN-based dynamics model
Figure 3.8 (1) Data produced by the sensor when sliding index finger over thumb. (2) Preprocess-
ing and stacking of frames. (3) Convolutional Neural Networks. (4) Recurrent Neural Networks
with per-frame predictions.
itself. Furthermore, the paper highlights the need for modelling of temporal dynamics,
which is inherently difficult with frame-by-frame classification approaches.
In response to these difficulties, a deep learning architecture, schematically shown in
Figure 3.8, is proposed that combines the steps of representation learning and dynamic
modelling into a single end-to-end trainable model.
More specifically, the network learns a feature representation function f (I t , ) that maps
inputs I t (i.e., Range-Doppler images) to outputs xt , where contains the weights of the
network. During learning, the classification error of the overall pipeline is used as opti-
mization criterion. Designing CNN architectures is a complex task involving many hyper-
parameters such as the number of layers and neurons, activation functions, and filter sizes.
In the experiments section, the paper reports on different CNN variants. Most saliently,
a comparison with a network adapted from computer vision (Simonyan and Zisserman,
2014) shows the need for a custom network architecture. This aspect of the work illustrates
the flexibility that is gained from the modular, network-like properties of deep learning. With
a relatively small number of ingredients one can design custom architectures for different
use-cases and problem settings quickly.
input recognition | 89
qt = ho ht + bo (3.26)
where s are weight matrices connecting input, hidden, and output layers, b are bias vectors,
and H is the hidden layer’s activation function. Crucially, the hidden states ht are passed on
from timestep to timestep while the outputs qt are fed to a softmax layer (a straightforward
multi-class extension of the binary loss we introduced in the logistic regression example in
Section 3.7.1), providing per-frame gesture probabilities ŷt .
When learning over-long sequences, standard RNNs can suffer from numerical instabil-
ities known as the vanishing or exploding gradient problem. To alleviate this issue LSTMs
use memory cells to store, modify, and access internal state via special gates, allowing for
better modelling of long-range temporal connections (see Figure 3.9). Very loosely speaking
this means a more involved computation of the hidden layers outputs ht For each unit, the
relation between input, internal state, and output is formulated as follows:
xt xt
Cell
xt ct ht
Forget Gate
ft
xt
Figure 3.9 Illustration of LSTM input, output, cell, and gates and their connections. Computations
are listed in Equation 3.28.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
where σ is the logistic sigmoid function, and the input gate, forget gate, output gate, and cell
activation are respectively represented by i, f , o, and c. The indices on the weight matrices
represent for input-to-hidden, hidden-to-output, and hidden-to-hidden connections,
respectively.
Notably, the whole network is trained in an end-to-end fashion. That is, RNN and CNN
are trained together. For this purpose the total loss for a given sequence of x values paired
with a sequence of y values would then be just the sum of the losses over all the time steps.
For example, if L(t) is the negative log-likelihood of y(t) given (x1 , …, xT ), then
L((x1 , . . . , xT ), (y1 , . . . , yT )) = L(t) = − log pmodel (yt |(x1 , . . . , xT )), (3.29)
t t
where pmodel (yt |(x1 , …, xT )) is attained by reading the entry for yt from the (model) output
vector ŷt . Computing the gradient for this loss is an expensive operation since for each time
step a forward-pass is necessary at the end of which the softmax layer in the RNN predicts a
gesture label for which we compute a loss value and then use back-propagation to compute
the gradient of the network’s weights . Gradients and loss values are then summed over
time and finally the network’s parameters are optimized, minimizing the combined loss.
This training procedure is known as backpropagation through time Werbos (1990).
3.11 Discussion
This chapter has discussed a principled and unified framework to input recognition that
leverages state-of-the-art methods and algorithms from machine learning. Following the
techniques outlined here enables implementation of sensing-based input recognition algo-
rithms that are both efficient and robust to sensor noise and variation across users and across
instances of the action being performed by the user. We conclude with a discussion on the
limitations of current approaches and interesting directions for future work.
input recognition | 91
over time. All of these aspects are at odds with the traditional ML setting, where typically
datasets remain fixed, and once a model has been optimized on this dataset, it too remains
fixed. This is especially true for deep learning approaches that require large training data
sets and perform training in an iterative stochastic fashion. This results in models that
are fairly rigid once trained. Technically speaking once the optimizer has found a local
but deep minima there is not much incentive for the algorithm to move. Therefore, a
single sample, or a even a small number of new examples provided by the user after
training has finished, won’t change the weights of the model much and hence won’t change
the behaviour of the system. This calls for new approaches that allow for updating of
already exiting models via a small number of additional training samples. Recent approaches
enabling so-called one-shot learning in image classification tasks may be an interesting
starting point.
Furthermore, (training) data in general is a problem in HCI in several ways. First, collect-
ing data and the associated labels is an laborious and expensive process. In many areas this
can be partially automated via crowd-sourcing or by providing users with additional value
in exchange of their labelling work (e.g., online CAPTCHA verification, tagging friends in
social networks). Unfortunately this is seldomly possible in the HCI context since normally
data collection requires users to physically be present in the lab, to wear special equipment,
and to perform input-like tasks in a repetitive fashion. Furthermore, due to the proliferation
of and fast-paced evolution of sensing technologies, the lifetime of collected datasets is
limited by the lifetime of the underlying acquisition technology. Especially emerging sensors
and their signals can change drastically over short time horizons. Finally, changes in user
behaviour and usage patterns need to be reflected in such datasets, which would require
constant updating and curation of datasets. Finally, currently the most successful models in
ML require large amounts of data for training and have significant memory and runtime
consumption once trained. This means that such models are not directly applicable for
mobile and wearable use since such devices are often much more constrained in memory
and computational resources.
Together these issues call for innovation both algorithmically and in terms of tools and
procedures. First, there is an opportunity for algorithms that combine the discriminative
power and model capacity of current state-of-the art techniques with more flexibility
in terms of online adaptation and refinement using a small number of sample points.
Furthermore, advancements in terms of computational and memory efficiency would be
necessary to fully leverage the power of ML in mobile and wearable settings. Second, we
currently often use recognition accuracy as optimization criterion. However, this is some-
times merely a poor surrogate for the ultimate goal of usability. Therefore, finding ways of
more directly capturing the essence of what makes for a ‘good’ or ‘bad’ UI and using this
information to adjust the parameters of underlying ML models would be a hugely fruitful
direction for research. Finally, HCI could contribute novel approaches and tools in terms of
data-collection, labelling, and long-term curation of datasets. Moreover, there lie opportu-
nities in tools and methods that allow technically untrained users and developers without
an ML background to understand the behaviour of these algorithms more intuitively and
to enable them to make predictions about the impact of changing a classifier or providing
additional data (or labels).
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
3.12 Conclusions
In summary, this chapter has discussed the area of sensor-based input recognition. it is an
area of research that lies at the heart of HCI and that has the potential to change the face of
the computing landscape by altering how humans interact with machines at a fundamental
level. This area is full of interesting technical and conceptual challenges and we hope to have
been able to provide a useful summary of the various contributing aspects.
....................................................................................................
references
Benbasat, A. Y., and Paradiso, J. A., 2002. An inertial measurement framework for gesture recognition
and applications. In Revised Papers from the International Gesture Workshop on Gesture and Sign
Languages in Human-Computer Interaction, GW ’01, London, UK: Springer-Verlag, pp. 9–20.
Bishop, C. M., 2006. Pattern Recognition and Machine Learning (Information Science and Statistics).
Springer-Verlag New York, Inc., Secaucus, NJ, USA.
Boser, B. E., Guyon, I. M., and Vapnik, V. N., 1992. A training algorithm for optimal margin classi-
fiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92,
New York, NY: ACM. pp. 144–52.
Bradski, G., 2000. The Open CV library. Dr. Dobb’s Journal of Software Tools, 25(11), pp. 120–6.
Breiman, L., 2001. Random forests. Machine Learning, 45(1), pp. 5–32.
Cauchy, A., 1847. Méthode générale pour la résolution des systemes d’équations simultanées. Comptes
rendus de l’Académie des Sciences, 25(1847), pp. 536–8.
Chan, L., Hsieh, C.-H., Chen, Y.-L., Yang, S., Huang, D.-Y., Liang, R.-H., and Chen, B.-Y., 2015.
Cyclops: Wearable and single-piece full-body gesture input devices. In: CHI ’14: Proceedings of
the 33rd Annual ACM Conference on Human Factors in Computing Systems. New York, NY: ACM.
pp. 3001–9.
Cramir, H., 1946. Mathematical methods of statistics. Princeton, MA: Princeton University Press.
Goodfellow, I., Bengio, Y., and Courville, A., 2016. Deep Learning. MIT Press. http://www
.deeplearningbook.org.
Guyon, I., and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of Machine
Learning Research, 3, pp. 1157–82.
Hall, M. A., and Smith, L. A., 1999. Feature selection for machine learning: Comparing a correlation-
based filter approach to the wrapper. In: Proceedings of the Twelfth International Florida Artificial
Intelligence Research Society Conference. Palo Alto, CA: AAAI Press. pp. 235–9.
Harrison, C., Tan, D., and Morris, D., 2010. Skinput: Appropriating the body as an input surface.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10,
New York, NY: ACM. pp. 453–62.
Hinton, G., Deng, L., Yu, D., Dahl, G., rahmanMohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,
Nguyen, P., Kingsbury, B., and Sainath, T., 2012. Deep neural networks for acoustic modeling in
speech recognition. IEEE Signal Processing Magazine, 29, pp. 82–97.
James, G., Witten, D., Hastie, T., and Tibshirani, R., 2013. An Introduction to Statistical Learning: with
Applications in R. New York, NY: Springer.
Keskin, C., Kiraé, F., Kara, Y. E., and Akarun, L., 2012. Hand pose estimation and hand shape clas-
sification using multi-layered randomized decision forests. In: European Conference on Computer
Vision. Berlin: Springer. pp. 852–63.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
input recognition | 93
Kim, D., Hilliges, O., Izadi, S., Butler, A. D., Chen, J., Oikonomidis, I., and Olivier, P., 2012. Digits:
Freehand 3D interactions anywhere using a wrist-worn gloveless sensor. In: Proc. ACM UIST,
New York, NY: ACM, p. 167.
Kim, D., Izadi, S., Dostal, J., Rhemann, C., Keskin, C., Zach, C., Shotton, J., Large, T., Bathiche,
S., Niessner, M., Butler, D. A., Fanello, S., and Pradeep, V., 2014. Retrodepth: 3d silhouette
sensing for high-precision input on and above physical surfaces. In: CHI’14: Proceedings of the
32nd Annual ACM Conference on Human Factors in Computing Systems. New York, NY: ACM,
pp. 1377–86. ACM.
LeGoc, M., Taylor, S., Izadi, S., and Keskin, C., 2014. A low-cost transparent electric field sensor for
3d interaction on mobile devices. In: CHI ’14: Proceedings of the 32nd Annual ACM Conference on
Human Factors in Computing Systems. New York, NY: ACM. pp. 3167–70.
Lien, J., Gillian, N., Karagozler, M. E., Amihood, P., Schwesig, C., Olson, E., Raja, H., and Poupyrev, I.,
2016. Soli: Ubiquitous gesture sensing with millimeter wave radar. ACM Transactions on
Graphics, 35(4), pp. 142:1–142:19.
Mitchell, T. M., 1997. Machine learning. Burr Ridge, IL: McGraw Hill.
Mohri, M., Rostamizadeh, A., and Talwalkar, A., 2012. Foundations of Machine Learning. Cambridge,
MA: MIT Press.
Neverova, N., Wolf, C., Taylor, G. W., and Nebout, F., 2015. Hand Segmentation with Structured
Convolutional Learning. Cham, Switzerland: Springer. pp. 687–702.
Ordonez, F., and Roggen, D., 2016. Deep convolutional and lstm recurrent neural networks for
multimodal wearable activity recognition. Sensors, 16(1), pp. 1–25.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Pretten-
hofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot,
M., and Duchesnay, E., 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning
Research, 12, pp. 2825–30.
Peng, H., Long, F., and Ding, C., 2005. Feature selection based on mutual information criteria of
max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(8), pp. 1226–38.
Pu, Q., Gupta, S., Gollakota, S., and Patel, S., 2013. Whole-home gesture recognition using wireless sig-
nals. In: MobiCom ’13: Proceedings of the 19th Annual International Conference on Mobile Computing
& Networking. New York, NY: ACM, pp. 27–38.
Radhakrishna Rao, C., 1945. Information and the accuracy attainable in the estimation of statistical
parameters. Bulletin of Calcutta Mathematical Society, 37, pp. 81–91.
Russell, S. J., and Norvig, P., 2010. Artificial Intelligence: A Modern Approach. Discovering great minds
of science. New York, NY: Prentice Hall.
Sato, M., Poupyrev, I., and Harrison, C., 2012. Touché: Enhancing touch interaction on humans,
screens, liquids, and everyday objects. In: CHI ’12: Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems. New York, NY: ACM, pp. 483–92.
Schélkopf, B., Burges, C. J. C., and Smola, A. J., eds. 1999. Advances in Kernel Methods: Support Vector
Learning. Cambridge, MA: MIT Press.
Senliol, B., Gulgezen, G., Yu, L., and Cataltepe, Z., 2008. Fast correlation based filter (fcbf) with a
different search strategy. In: ISCIS 08: 23rd International Symposium on Computer and Information
Sciences. New York, NY: IEEE, pp. 1–4.
Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter,
I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., and Izadi, S.,
2015. Accurate, Robust, and Flexible Real-time Hand Tracking. In: CHI ’15 Proceedings of the
33rd Annual ACM Conference on Human Factors in Computing Systems. New York, NY: ACM,
pp. 3633–42.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., and Moore, R.,
2013. Real-time human pose recognition in parts from single depth images. Communications of the
ACM, 56(1), pp. 116–24.
Simonyan, K., and Zisserman, A., 2014. Two-stream convolutional networks for action recognition in
videos. CoRR, abs/1406.2199.
Song, J., Pece, F., Sérés, G., Koelle, M., and Hilliges, O., 2015. Joint estimation of 3d hand position and
gestures from monocular video for mobile interaction. In: CHI ’15: Proceedings of the 33rd Annual
ACM Conference on Human Factors in Computing Systems. New York, NY: ACM, pp. 3657–60.
Song, J., Sörös, G., Pece, F., Fanello, S. R., Izadi, S., Keskin, C., and Hilliges, O., 2014. In-air Gestures
Around Unmodified Mobile Devices. In: UIST ’14: Proceedings of the 27th Annual ACM Symposium
on User Interface Software and Technology. New York, NY: ACM, pp. 319–29.
Song, J., Wang, S., Lien, J., Poupyrev, I., and Hilliges, O., 2016. Interacting with Soli: Exploring Fine-
Grained Dynamic Gesture Recognition in the Radio-Frequency Spectrum. In: UIST ’16: ACM
Symposium on User Interface Software and Technologies. New York, NY: ACM, pp. 851–60.
Steinwart, I., and Christmann, A., 2008. Support vector machines New York, NY: Springer-Verlag.
Tang, D., Yu, T. H., and Kim, T. K., 2013. Real-time articulated hand pose estimation using semi-
supervised transductive regression forests. In: 2013 IEEE International Conference on Computer
Vision. New York, NY: IEEE, pp. 3224–31.
Taylor, S., Keskin, C., Hilliges, O., Izadi, S., and Helmes, J., 2014. Type-hover-swipe in 96 bytes:
A motion sensing mechanical keyboard. In: CHI ’14: Proceedings of the 32nd Annual ACM Con-
ference on Human Factors in Computing Systems. New York, NY: ACM, pp. 1695–704.
Vapnik, V. N., 1998. Statistical learning theory, Volume 1. New York, NY: Wiley.
Vapnik, V. N., 2013. The nature of statistical learning theory. New York, NY: Springer Science & Business
Media.
Werbos, P. J., 1990. Backpropagation through time: what it does and how to do it. Proceedings of the
IEEE, 78(10), pp. 1550–60.
Witten, I. H., Frank, E., Hall, M. A., and Pal, C. J., 2016. Data Mining: Practical machine learning tools
and techniques. Burlington, MA: Morgan Kaufmann.
Wobbrock, J. O., Ringel Morris, M., and Wilson, A. D., 2009. User-defined gestures for surface
computing. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
New York, NY: ACM, pp. 1083–92.
Yeo, H.-S., Flamich, G., Schrempf, P., Harris-Birtill, D., and Quigley, A., 2016. Radarcat: Radar
categorization for input & interaction. In: UIST ’16: Proceedings of the 29th Annual Symposium on
User Interface Software and Technology. New York, NY: ACM, pp. 833–41.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
PA RT II
Design
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
4
• • • • • • •
4.1 Introduction
Combinatorial optimization offers a powerful yet under-explored computational method
for attacking hard problems in user interface design. ‘The science of better’, or optimization
and operations research, has its roots in mathematics, computer science, and economics.
To optimize refers to the act and process of obtaining the best solution under given circum-
stances, and combinatorial optimization a case where solutions are defined as combinations
of multiple discrete decisions (Rao, 2009). To design an interactive system by optimization,
a number of discrete decisions is made such that they constitute as good whole as possible.
The goal of this chapter is to advance applications of combinatorial optimization in user
interface design. We start with a definition of design task in combinatorial optimization:
max f (d)
d∈D
where d is a design in a set of feasible designs D, and f is the objective function. In plain words,
the task is to find the design that yields the highest value of this function. For example, we
might be looking for the design of a web form that maximizes user performance in filling it
in. However, this definition does not expose the structure of the design problem.
The following definition makes that explicit Rao (2009):
⎛ ⎞
x1
⎜ x2 ⎟
⎜ ⎟
Find x = ⎜ . ⎟ ∈ X which maximizes f (x),
⎝ .. ⎠
xn
where x is an n-dimensional design vector, each dimension describing a design variable, X is
the set of feasible designs (all to-be-considered design vectors), and f (x) is the objective
function.
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
98 | design
This revised task definition exposes the design space and its structure in X. The design
space X is contained in the Cartesian product of the domains X i of the design variables xi .
Formally X ⊆X 1 ×…×X n , where X i is the domain of the i-th design variable, e.g., X i =
{0, 1} for a binary decision variable, such as whether to select a feature or not. A design
variable can address any open decision that can be defined via boolean, integer, or categorical
variables. In a real design project, many properties of a interface would be preassigned, but
some are open and treated as variables. Examples include sizes and colours and positions of
an element and their types. In many combinations they have to satisfy functional and other
requirements. These are collectively called design constraints. For example, when designing
a Web layout, we should not place elements such that they overlap.
The first insight we can make is that some problems in design are exceedingly large,
too large for trial-and-error approaches in design. For instance, to design an interactive
layout (e.g., menu), one must fix the types, colours, sizes, and positions of elements, as
well as higher-level properties, such as which functionality to include. The number of
combinations of such choices easily gets very large. For example, for n functions there
are 2n −1 ways to combine them to an application, which for only fifty functions means
1,125,899,906,842,623 possibilities. Further, assuming that fifty commands have been
selected, they can be organized into a hierarhical menu in 100! ≈ 10158 ways.
The definition also exposes evaluative knowledge in f (x). The objective function1
encodes this evaluative knowledge. Technically, it is a function that assigns an objective
score to a design candidate. It formalizes what is assumed to be ‘good’ or ‘desirable’, or,
inversely, undesirable when the task is to minimize. The design candidate (a design vector)
that obtains the highest (or lowest, when minimizing) score is the optimum design.
In applications in HCI, a key challenge is to formulate objective functions that encap-
sulate goodness in end-users’ terms. This can be surface features of the interface (e.g.,
visual balance) or expected performance of users (e.g., ‘task A should be completed as
quickly as possible’), users’ subjective preferences, and so on. It is tempting but naive
to construct objective function based on heuristics. Those might be easy to express and
compute, but they might have little value in producing good designs. It must be kept in
mind that the quality of a interface is determined not by the designer, nor some quality of
the interface, but by end-users in their performance and experiences. An objective function
should be viewed as a predictor of quality for end users. It must capture some essential
tendencies in the biological, psychological, behavioural, and social aspects of human con-
duct. This fact drives a departure from traditional application areas of operations research
and optimization, where objective functions have been based on natural sciences and
economics.
Another challenge is the combination of multiple objectives into an objective function:
f (x) = ω1 f1 (x) + · · · + ωq fq (x) (4.1)
1 Also called loss function, error function, energy function, merit function, criteria function, reward function,
evaluative function, utility function, goodness function, or fitness function.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
where q is the number of objectives considered, f i (x) is the function for objective i, and
ωi is the weight2 attributed to i. Linearization is one way to deal with multiple criteria, and
many methods have been presented to capture designers’ intentions (Rao, 2009). While
formulating and calibrating of multi-objective tasks is challenging, it must be remembered
that designers solving design problems similarly make assumptions about the way in which
the objectives are traded off. Combinatorial optimization can expose the assumptions and
their consequences.
When a task has been defined, including the feasible set and an objective function,
algorithmic solvers can attack it. This is the basis of benefits for practical efforts in design.
Solvers can be used to find the optimal design, or surprising solutions, or designs with
distinct tradeoffs, or solutions that strike the best compromise among competing objectives,
or are robust to changes in conditions. A designer can identify how far a present interface
is from the best achievable design. Modern optimization methods offer a higher chance
of finding good or optimal solutions and, in certain conditions, mathematical guarantees
(bounds) can be computed. Moreover, all choices are documented and scrutinable and can
support knowledge sharing in design teams. In this regard, a combinatorial optimization task
naturally incorporates a form of design rationale that is not only descriptive, but executable
(Carroll, 2000).
Notwithstanding the relatively high cost of defining a tractable design task, combinatorial
optimization is not only compatible with user-centred design but can facilitate it by assisting
designers in problem-solving and automating certain tasks entirely. Instead of a designer
generating designs and having them evaluated with users, the designer defines a task and lets
the computer solve it. In one-shot optimization, this is done once (think about keyboards).
However, advances in algorithms and hardware have made it possible to consider integration
of optimization into design tools. In interactive optimization, computation is steered by the
designer in the loop.
Prior to this chapter, there has been no attempt to elaborate and discuss the assumptions
and principles of combinatorial optimization in user interface design. The goal of the chapter
is to lay out the principles of formulating design tasks for combinatorial optimization. We
focus on two obstacles: (1) definition of a design task, and (2) encoding of design and
research knowledge in an objective functions.
First, mathematical definition of interface design problems has been difficult and became
an impediment to applications. How can we present mathematically design problems that
consist of multiple interrelated decisions and several objectives? In the absence of task
definitions, applications in HCI were long limited to mainly keyboards. To expand the scope
of optimizable design problems in HCI, it is thus necessary to research the mathematical
definition of design.
Second, what goes in f (x) is pivotal to the quality of designs. In essence, defining f
‘equips’ the search algorithm with design knowledge that predicts how users interact and
experience. Therefore, in the design of objective functions one should favor predictive
models rooted in well-tested theories. With increasing understanding of how to formulate
100 | design
objective functions gauging design goals such as usability and user experience, new applica-
tions have been made possible, for example, in menu systems, gestural input, visualizations,
widget GUIs, web layouts, and wireframe designs (Oulasvirta, 2017).
The idea of defining a complex design problem via its constituent design variables is
rooted in Fritz Zwicky’s ‘morphological analysis’ (Zwicky, 1948). Zwicky proposed breaking
down core decisions to obtain a finite solution space called the ‘Zwicky cube’. While it
primarily targets understanding of problems, and not their solution, it was the basis of
the central HCI concept of design space (e.g., Card, Mackinlay and Robertson, 1991).
Herbert Simon and colleagues proposed the first structured representation of complex
human activities that permitted algorithmic solutions. They pioneered the concept of a
search tree (Simon, 1973). Human activities that appear complex at surface, such as chess,
could be broken down to smaller decisions that form a decision structure in the form of a
tree. Problem-solving activities, then, are about traversing this tree. This idea informed the
development of engineering design methods (Cross & Roy, 1989). While applications in
user interfaces remained distant, it put forward the idea that complex design problems can
be algorithmically solved if decomposed in the right way.
The foundations of combinatorial optimization methods were laid during the two
decades following World War II (Rao, 2009). While optimization revolutionarized
management, engineering, operations research, and economics, it was in the late 1970s
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
when it was first applied to interface design. Rainer Burkard and colleagues formulated
the design of typewriter layouts as a quadratic assignment problem (QAP) (Burkard and
Offermann, 1977). However, the objective function used unrealistic estimates of finger
travel times. Nevertheless, Burkard’s work permitted the use of known solvers for QAP.
Their definition also placed the problem in the context of other problems in computer
science, exposing it as an NP-complete problem. Many other problems in user interface
design would fall to this same class. It was more than a decade later, however, when data
from empirical studies of typists were used to form a more realistic objective function (Light
and Anderson, 1993). A milestone was reached in the late 1990s, when Fitts’ law, weighted
by bigram frequencies in English, was introduced as the objective function optimizing for
typing performance (Zhai, Hunter, and Smith, 2002). Emerging epirical evidence suggested
superiority of performance-optimized keyboard layouts over the Qwerty layout.
The idea of using models of human performance and cognition to inform the choice of
interface designs is rooted in Stuart Card and colleagues’ seminal work in the early 1980s
(Card, Newell, and Moran, 1983). Card and colleagues referred to operations research as a
model for organizing HCI research and proposed a cognitive simulation of a user, called
GOMS, to replace expensive user studies and to be able to predict the consequences of
design choices. A designer could now evaluate an interface by simulating how users perceive,
think, and act when completing tasks. While subsequent work extended modelling to factors
like errors, memory functioning, and learning, cognitive simulations became difficult to use
and fell out of pace with the rapid development of technology. To aid practitioners, math-
ematical simplifications and interactive modelling environments were developed, yet these
were not combined with algorithmic search. The models were not used to generate designs.
A decade later, human factors researchers noted that ergonomics researchers should not
be content with studies and theories, but should actively seek to identify optimal conditions
for human performance (Fisher, 1993; Wickens and Kramer, 1985). The first applications
concerned the arrangement of buttons on menus and multi-function displays. As objective
functions, they developed analytical models of visual search, selection time, and learning
(Francis 2000; Liu, Francis, and Salvendy, 2002). This work extended the use of predictive
models from Fitts’ law to consider aspects of attention and memory. However, since little
advice was given on how to formulate more complex design tasks, or how to solve them
efficiently, designs were still limited to parameter optimizations and simple mappings.
Within the last two decades, these advances have been integrated into what can be
called model-based interface optimization: using models of interaction and technology in the
objective function (Eisenstein, Vanderdonckt, and Puerta, 2001; Gajos and Weld 2004;
Oulasvirta, 2017). Three milestones can be distinguished in this space. The first is the
formal definition of a user interface technology as the object of optimization. Software
engineers proposed formal abstractions of user interfaces (UIDLs) to describe the interface
and its properties, operation logic, and relationships to other parts of the system (Eisenstein,
Vanderdonckt, and Puerta, 2001). UIDLs can be used to compile interfaces in different lan-
guages and to port them across platforms. However, since an UIDL alone does not contain
information about the user, their use was limited to transformations and retargetings instead
of complete redesigns. The second milestone was the encoding of design heuristics in objec-
tive functions. For example, Peter O’Donovan and colleagues have formulated an energy
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
102 | design
minimization approach to the design of page layouts using heuristics like alignment, visual
importance, white space, and balance (O’Donovan, Agarwala, and Hertzmann, 2014). The
approach improves the visual appeal of optimized layouts. However, such heuristics are
surface characteristics and they may lack substantive link with objectives that users consider
important, such as task completion time or aesthetics. For example, the amount of white
space may or may not predict aesthetic experience. Moreover, resolving conflicts among
multiple conflicting heuristics became a recognized issue. The third milestone is the direct
use of predictive models and simulations of users in objective functions, extending the work
that started in computer science and human factors. For example, Krzystzof Gajos and
colleagues explored model-based approaches in the design of widget layouts, considering
models of motor performance in conjunction with design heuristics for visual impairments
(Gajos and Weld, 2004). This approach has been extended to the design of menu systems,
gestural inputs, information visualization, and web page layouts (Oulasvirta, 2017).
A parallel thread of research has looked at exploitation of optimization methods in design
tools. Interactive optimization is the idea of a human designer specifying and steering the
search process. The key motivation is the fact that the standard ‘fire and forget’ paradigm
of optimization methods is a poor fit with design practice. Designers cannot be expected to
provide point-precise input values a priori. By contrast, they are known to constantly refine
the problem definition, mix activities like sketching and evaluation and reflection, and draw
from tacit knowledge. Techniques for interactive optimization can be divided according to
four dimensions: (1) interaction techniques and data-driven approaches for specification of
a design task for an optimizer; (2) control techniques offered for steering the search process;
(3) techniques for selection, exploration and refinement of outputs (designs) and (4) level
of proactivity taken by the tool, for example in guiding the designer toward good designs
(as determined by an objective function). Some illustrative examples include DesignScape,
which supports rapid exploration of alternative layouts with an energy minimization scheme
(O’Donovan, Agarwala, and Hertzmann, 2015), MenuOptimizer, which provides a high
level of control for specifying a task and actively visualizes usability-related metrics to guide
the designer to choose better designs (Bailly, Oulasvirta, Kötzing, and Hoppe, 2013), and
SketchPlorer, which tries to automatically recognize the design task and explores the design
for diverse designs to serve as the basis of design exploration (Todi, Weir, and Oulasvirta,
2016).
problems over binary variables. To sum up, the benefits of IP in interface design are three-
fold (Karrenbauer and Oulasvirta, 2014): (1) it offers a universal framework to define and
compare design tasks, (2) powerful solvers are available that are not only fast but use exact
methods to compute bounds for solutions, and (3) several relaxations are known that can
boost performance.
The general structure of design task consists of decision variables, objective functions,
calibration factors (or functions), and constraints. For the sake of presentation, we here
restrict ourselves to binary variables and linear constraints. Binary variables are used to
model decisions, e.g., in a selection problem xi = 1 means that item i is selected and xi = 0
if it is not. Suppose that we have n decision variables that we combine in an n-dimensional
vector x ∈ [0, 1]n in the unit hypercube where each extreme point represents a configuration.
Configurations that are infeasible are excluded by constraints, i.e., linear inequalities. For
example, xi + xj ≤ 1 implies that items i and j are mutually exclusive. Given weights for
each decision/item, we can ask for the feasible solution with maximum total weight by
maximizing
n
wi xi
i=1
subject to
Ax ≤ b
x ∈ {0, 1}n
where the comparison between the left-hand side vector Ax and the right-hand side b is
meant component-wise such that each row of the matrix A together with the corresponding
entry in b represents a linear constraint.
In general, it is NP-hard to solve such integer linear optimization problems. This can be
easily seen in the following example for the famous NP-hard IndependentSet problem
where we are given an undirected graph G = (V, E) and are supposed to compute a set of
nodes that does not induce an edge:
max xv : ∀{v, w} ∈ E : xv + xw ≤ 1, ∀v ∈ V : xv ∈ {0, 1} .
v∈V
However, these optimization problems become polynomial-time solvable if we relax the
integrality constraint.
We provide a first example next before turning into more general definitions in user
interface design.
104 | design
experience and heuristics such as ‘place frequently used commands closer to the top’ or
‘place elements that belong together next to each other’ or ‘make the menu consistent with
other menus’. The outcomes would be evaluated using inspection methods or usability
testing, and the process would be repeated until a satisfactory design is attained.
Combinatorial optimization offers a way to mathematically describe such objectives and
algorithmically solve the resulting design problem. Two core requirements are illustrated in
the following:
1. Definition of the design space (feasible set of designs).
2. Definition of an objective function.
subject to
n
∀l : xi = 1 (4.2)
i=1
n
∀i : xi = 1
=1
∀i, l : xi ∈ {0, 1}
where ci is a design objective capturing the cost (to user) for accessing command when
assigned to slot . The two constraints, respectively, state that each slot must be filled by
exactly one command, and that each command must be assigned to exactly one slot. In
plain words, the optimal menu design is an assignment of all n commands such to the menu
such that it yields lowest expected cost to user. This formulation exposes how the number
of commands is linked to the size of the design space: there are n! possible designs for n
commands. However, this definition is not yet operational, because we lack a definition of
the objective function ci .
To that end, let us consider optimizing the menu for selection time: the expected time
for selecting a command. Assuming a novice user who scans menu options linearly from
top to bottom, the optimal design should have the most important (e.g., most frequently
accessed) items on top. For this purpose, we introduce a frequency-based score for each
command: f i ∈ [0, 1]. When this is multiplied by the position of the item (counting from
the top), we get an objective function that serves as a proxy model for selection time:
ci = f i · l. Note that this objective is similar to a linear regression model of novice search
performance in menu selection (Cockburn, Gutwin, and Greenberg, 2007; Liu, Francis, and
Salvendy, 2002). These models compute selection time as the number of fixations required
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
when starting search from the top of the menu. Multiplication with a positive scalar, nor
addition (or subtraction), does not change the optimal solution. Thus, the free parameters
of the models do not affect the outcome of optimization and can be ignored.
However, this objective function does not determine how quickly a command can be
selected once it has been found. To that end, we need a model of motor performance.
Assuming that a single end-effector (e.g., a cursor or fingertip) is used for selection, selection
time is governed by the time that it takes to move it across distance on top of the slot.
Movement time in this case is given by Fitts’ law:
D
MT = a + b log2 +1 (4.3)
W
where D is distance of cursor to target centre and W is its width. Now, because all targets
in a linear menu are of equal height, our objective function (Eq. 4.2) can be rewritten as a
twofold objective function:
n
n
min ω fi log2 () xi + (1 − ω)fi xi (4.4)
i=1 =1
where ω ∈ [0, 1] controls the relative importance of expert users whose performance is
dominated by motor performance. Similarly, 1 − ω denotes the relative importance novices,
who spend more time scanning the menu one item at a time. With ω = 1 the design task
collapses to that of minimizing expected log-travel distance. However, in practice, no matter
what the ω is set to, the problem can be solved by assigning commands from top to bottom
in the order of decreasing frequency.
The task gets more challenging when commands should be grouped according to a
semantic relationship (e.g., Bailly, Oulasvirta, Kötzing, and Hoppe, 2013). For example,
in many application menus ‘Open’ and ‘Save’ are close to each other. To this end, we
need define an association score: a(i, j) ∈ [0, 1]. It represents the strength of semantic
relationship between two commands i and j, and could represent, for example, collocation in
previous designs, word association norms, or synonymity. For instance, association between
‘Open’ and ‘Save’ should be high, and lower between ‘Bookmark’ and ‘Quit’. To decide how
commands should be grouped, we need another instance of the decision variable x: let xjk
be 1 when j is assigned to slot k and 0 otherwise. Considering visual search only, we now
reformulate the objective function as:
n
n
n
n
min (1 − aij ) fi fj k xi xjk (4.5)
i=1 =1 j=1 k=1
We can collapse the first part of the objective to a four-dimensional cost matrix c (i, j, k, ),
which tells the cost of assigning i to when j is assigned to k:
n
n
n
n
min c (i, j, k, ) xi xjk (4.6)
i=1 =1 j=1 k=1
Technically, the problem is now quadratic, because of presence of the product of two
decision variables. It is harder to solve than the linear version, because the optimum design
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
106 | design
is defined by how elements are in relation to each other, not only by an element’s position
in the menu. However, this definition is operationable and could be solved by black box
optimizers.
1. Decision variables: The first step consists of setting up design variables as decision
variables. Though this set of variables may be extended and modified in later
iterations, the initial setup is the corner stone of the modelling process.
2. Constraints: Given the decision variables along with their respective domains, the
next step deals with the exclusion of infeasible designs by linear constraints.
3. Objective function: The objective function is modelled in the third step. It may
consist of several parts that are linked together.
In addition, calibration factors or functions may need to be added to trade off the objective
terms.
This section describes the creation of the feasible set in the two first steps. Our goal is
to illustrate the expression of design challenges in graphical user interface design, including
how functionalities are selected, widget types determined, chosen elements organized on a
layout, and their temporal behaviour decided. Instead of providing ready-made solutions,
our goal is to illuminate the logic of expressing interface design tasks in IP.
To this end, this section introduces four types of combinatorial optimization prob-
lems for interface design, illustrating how they are defined using decision variables and
constraints:
requirements, say rj , for each attribute aj . The task is to select a subset of U that satisfies
all requirements, i.e., the number of selected elements with attribute aj is at least rj for
each attribute. The goal is to find a selection that minimizes the total costs, i.e., the sum
of the costs for each selected element. Sometimes an element may have an attribute with
some multiplicity, e.g., for the attribute cost we would be given values associated with each
element. Moreover, it may or may not be allowed to select an element multiple times.
Consider the selection of widget types for a graphical interface, for example. Here a
designer sets a minimum set of requirements to a interface, and there are multiple widgets to
select from, each with distinct properties. For example, there may be a requirement that the
user must select one option out of many. This may be implemented as a menu, drop-down
list, etc., each with its associated ‘costs’ in terms of learnability, usability, use of display space,
etc. A good solution to this problem covers all the requirements with minimum cost.
Prototypical covering problems are set-covering problems, where all requirements are 1
and in its cardinality version all costs are 1 as well. The term set refers to the different subsets
of elements that share a common attribute. A special case of the set-covering problem is
vertex cover where the elements are the nodes of a given graph and the attributes are the
edges, i.e., we are asked to select a subset of the nodes of minimum cardinality that contains
at least one node of each edge.
Packing problems are similar to covering problems: again, we are given a set of elements
U = {u1 , …, un } and attributes A = {a1 , …, am }. However, instead of requirements, we
now have positive capacities cj for each attribute aj and only those subsets of U are feasible
that do not exceed the capacity for each attribute. Analogously, a backpack has a certain
volume and a total weight limit over all packed items. The attributes are volume and weight.
Typically, the elements are also associated with a valuation, say pi for element ui , that can
be considered as a profit. Hence, the goal is to find a selection of the elements that does not
exceed the capacities and maximizes the total profit. Again, it may or may not be allowed to
select an item multiple times.
As an HCI example, consider again the widget selection problem: not all available widgets
might fit in a canvas of limited size. If an individual reward is associated with each widget, it
is natural to ask for a fitting selection that yields the highest reward.
Packing problems are in some sense dual to covering problems: when we switch the
role of elements and attributes, the role of capacities and costs as well as the role of profits
and requirements, we obtain the associated dual covering problem. An important fact that
follows from optimization theory is that the profit of any solution for the packing problem
can not exceed the costs of any solution for the associated dual covering problem. In
particular, this holds for the maximum profit and the minimum costs. The dual problem
of the vertex cover problem mentioned earlier is the so-called matching problem where we
asked to find a maximum cardinality subset of the edges such that each node is incident to at
most one edge in the matching. Other well-known packing problems are the independent set
problem, the set packing problem, and the knapsack problem. The independent set problem
often appears as a subproblem for modelling mutually exclusive choices.
Network design problems are a special type of selection problems in connection with
graphs. Typically, we are given a graph and are supposed to select a subset of the edges such
that certain connectivity constraints are satisfied. A basic example is the so-called Steiner
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
108 | design
Tree problem: for a given undirected graph G = (V, E) and a set of terminals T, i.e., T ⊆ V,
we shall select edges, say E ⊂ E, such that each pair of terminals is connected by a path
consisting only of edges from E . Observe that at least one edge in a cycle can be removed
while connectivity is still maintained, and hence, every optimum solution is a tree. The
nodes of the selected edges that are not terminals are called Steiner nodes. For the Directed
Steiner Tree problem, we are given a directed graph G = (V, A), a designated root node
r ∈ V, and a set of terminals T ⊆ V \{r}, and our task is to select arcs such that there is
a directed path from r to each terminal consisting only of selected arcs. Typically, a cost
is associated with each arc, and we are supposed to find a feasible selection of minimum
costs. Sometimes only the root and the terminals are given explicitly and the remaining
nodes are only given implicitly, e.g., we could have a Steiner node associated with each
subset of the terminals. A feasible solution would then define a hierarchical clustering of
the terminals.
This problem appears for instance, in the context of the design of a hierarchical menu
structure. The terminals are the actions that should be available as menu items. To achieve
quick access times it makes sense to avoid clutter by grouping related actions together in
a submenu. The root node represents the menu bar and an arc determines a parent-child
relation, i.e., selecting an arc from the root to a menu item means to put this item directly
into the menu bar, and hence it is accessible via one click.
Quadratic problems: In addition to the costs of assigning item i to location j, also consider
the costs of dependencies between two selections, i.e., cijk determines the costs of assigning
item i to location j conditioned on the choice of also assigning item k to location .
The Quadratic Assignment Problem (QAP) appears, for example, in the Keyboard
Layout Problem, a.k.a. the Letter Assignment Problem: a set of n letters is supposed to
be mapped to n slots on the keyboard. The objective is to maximize the typing speed or,
equivalently, to minimize the inter-key intervals. To this end, we set cijk = pik × t jl where
pij denote the bigram probabilities and t jl the movement times. In general, QAP can be used
to model mutual dependencies between pairs of selected elements. These occur whenever
the interface is expected to support task sequences where one subtask is followed by another.
However, this modelling power comes at the expense of efficiency of finding optimum
solutions.
110 | design
Because the most realistic models are expensive to compute, the designer weighs the quality
of estimates against computational efficiency.
that of harmonic colours. It defines a set of colours that combine to provide a pleasant
visual perception. Harmony is determined by relative position in colour space rather than by
specific hues. Templates are provided to test any given set of colours against the harmonic
set. The templates consist of one or two sectors of the hue wheel, with given angular sizes.
They can be arbitrarily rotated to create new sets. The distance of a interface from a template
is computed using the arc-length distance between the hue of element and the hue of the
closest sector border, and the saturation channel of the element. Several perception- and
aesthetics-related metrics are available (Miniukovich and De Angeli, 2015).
Simulators are step-wise executed functions M that map the model parameters θ to
behavioral data. In HCI, the parameters θ capture both parameters of the user and those of
the task, design, or context. If random variables are involved in the simulator, the outputs of
the simulator can fluctuate randomly even when θ are fixed. Simulators are implemented
as programs where parameters are provided as input. Simulators can be implemented in
many ways, for example, as control-theoretical, cognitive, biomechanical, or computational
rationality models, as discussed elsewhere in this book.
In contrast to heuristics, computational metrics, and regression models, simulators are
generative models that predict not only the outcomes of interaction but intermediate steps, or
the process of interaction. This is valuable in design, as one can output not only predictions
for aggregates, but also examine predicted interaction in detail.
For example, visual search from graphical 2D-layouts is complex cognitive-perceptual
activity driven by the oculomotor system and affected by task as well as incoming informa-
tion. The Kieras-Hornof model of visual attention (Kieras and Hornof, 2014) is a simulator
predicting how fixations are placed when looking for a given target on a given graphical
layout. It bases on a set of availability functions that determine how the features of a target
are perceivable from the user’s current eye location. The availability functions are based
on the eccentricity from the current eye location and angular size s of the target. Additive
Gaussian random noise with variance proportional to the size of the target is assumed. For
each feature, a threshold is computed and the probability that the feature is perceivable.
These determine where to look next on the layout. Simulation terminates when a fixation
lands on the target.
Another benefit of simulators over regression models is their high fidelity, which
allows capturing the relationship between multiple parameters and structures describing
the human mind. Some simulation models are Turing-strong formalisms with memory
and ability to execute programs. A simulator can address multiple design objectives in
a coherent manner that does not require tuning. However, they are significantly more
expensive to execute. Certain optimization methods, especially exact methods (e.g, integer
programming), are practically ruled out.
112 | design
choices and decision-making, and even user experience. These models offer a rich toolbox
for attacking complex and realistic design problems.
Considering the design of a graphical user interface as an example, cost functions can be
distinguished based on how they relate candidate designs to user-related objectives and the
different timescales of interaction. We present some salient categories with examples and
discuss their relative merits.
Surface features Surface metrics and heuristics are functions that capture properties of
an interfaces that may correlate with user-related qualities. To optimize a surface-level cost
function is to reorganize the UI’s visual or spatial features to increase this correlation.
Examples include optimizing a menu system for consistency, a scatterplot design to avoid
overplotting, and a layout for minimum whitespace (Oulasvirta, 2017). A concrete example
of an aesthetic metric used in wireframe optimization (Todi, Weir, and Oulasvirta, 2016) is
grid quality (Miniukovich and De Angeli, 2015). It calculates the distance to the closest
symmetrical layout and was shown to correlate with user aesthetic preferences. Computa-
tion starts with the layout vertex set, rescaled such that the x-axis coincides with the axis
of symmetry and the y-axis contains the centre of mass of the xi . This is mapped to the
complex plane to obtain a set zi = xi + Iyi . These points are horizontally symmetrical if
they consist only of real values and complex conjugate pairs. The asymmetry for a layout
is obtained by constructing a polynomial of involved coefficients and averaging the size of
the imaginary parts of the coefficients. Vertical symmetry is scored in the same way, after a
simple coordinate transformation.
Actions Action or operation level objectives capture user performance in individual
actions like visual search or selection. To the extent they correlate with task-level
performance or user experience, they may be good surrogates that are easier to model
and quicker to evaluate. Examples include optimization of wireframe sketches for visual
search performance, widget layouts for individual motor capabilities, and gesture sets for
learnability (Oulasvirta, 2017). For instance, consider the optimization of a keyboard
layout for aimed movement performance (Zhai, Hunter, and Smith, 2002). In aimed
movements, an end-effector, such as the tip of a finger or a cursor, is brought on top of
a spatially expanded target to select it. Selection forms the basis of communicating intent
with all visuo-spatially organized UIs, such as menus, keyboards, web pages, and GUIs. To
optimize a UI for selection means that the speed and accuracy of selections is optimized.
Fitts’ law provides a well-known regression model that links movement time MT and
task demands. optimization using Fitts’ law minimizes distances and maximizes sizes of
selectable elements, such as keys.
Tasks Task-level models predict users’ performance or success in completing activities
with an interface, thus directly addressing usability-related concerns such as task completion
time and errors. They assume some structure in the task, consisting of sequential or parallel
steps. An example is the Search-Decide-Point (SDP) model used to optimize menus (Bailly,
Oulasvirta, Kötzing, and Hoppe, 2013).
Experience Models of experience predict how users might learn, experience, or stress
about interaction. For example, the Rosenholtz model offers a way to minimize experience
of visual clutter. It operates by choosing colours, sizes, and shapes of elements such that
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
feature congestion is minimized. The layout elements have some distribution of features such
as colour, contrast, orientation, and size. With few targets, the distribution is small compared
to the overall size of the feature space. As the feature distribution occupies more of the
available space, the feature space becomes congested. One first computes the mean and
covariance of given feature vectors. The clutter of the display is defined as the determinant
of the covariance matrix.
4.6.1 Heuristics
Black box methods do not make any explicit assumptions on the objective function they
are going to optimize, but rather they consider it as a black box or an oracle that tells them
the objective value of a given candidate. There are deterministic and randomized black
box methods (see e.g., Rao, 2009). Though it is possible to randomly try out candidates
and return the best solution after a certain time, it is often more efficient to apply a more
systematic approach. We here outline some well-known approaches. The reader is pointed
to any of the many textbooks for more comprehensive reviews.
Greedy Greedy algorithms divide the construction of a solution into multiple subsequent
decisions. These decisions are taken in a certain order and are not revoked later. Typical
examples are packing problems: the solution is picked item by item until further items
cannot be picked anymore. It is natural to proceed by choosing the next item among those
that still fit and that yields the largest growth of the profit w.r.t. the current selection.
Local search The main ingredient of local search is the definition of a neighborhood of a
given configuration. For example, the neighboring layouts of a keyboard layouts could be
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
114 | design
those that differ by swapping one pair of letters. Starting from an initial configuration, local
search proceeds by chosing a neighboring configuration with an objective value that is at
least as good as the current one. These choices could be, for example, greedy or randomized.
Simulated annealing Simulated annealing can be considered as a special case of local
search using randomness to control exploration/exploitation behaviour. The main differ-
ence is that a neighboring configuration is not only chosen when it yields to a better objective
value, but also with a certain probability when it is worse. This probability decreases with
the extent of the detoriation of the objective value. Typically this probability is exp(−β),
where is the difference in the objective value and β a parameter called inverse temperature.
Simulated annealing belongs to the class of randomized search heuristics. Further members
of this class are evolutionary algorithms, ant colony optimization, etc. Common to these
is that they implement some logic to control the use of heuristics over the progression
of search.
objectives are often conflicting, the optimization task becomes that of finding the best
compromise (or, ‘the least worst compromise’). Multi-objective optimization is in the end
about optimizing a single function into which the multiple objectives have been collapsed.
Tuning an objective function refers to setting of weights (ω). A designer decides how to
trade each objective, for example how to exchange for example one ‘unit of usability’ for one
unit of security. In HCI, the choice of ω is not only an empirical decision (what do users
want?) but a strategic decision at the same time (what should they want?). For empirical
tuning, one may run an experimental study to find the set of weights that best predicts some
variable of interest. Statistics or logs from a live system can be used to weigh parameters.
Strategic tuning refers to goals set outside optimization itself, for example, by marketing.
Despite its simplistic appeal, the weighted sum approach has some recognized issues.
optimization literature proposed several advanced methods. They offer alternatives to
weighing to express a designer’s goals. General formulations and their pros and cons are
summarized by Marler and Arora (2004).
116 | design
designs against human ground truth. Alternatively, they could be based on mining of existing
designs, or inferred from user behaviour in logs.
Challenge 5. Efficient methods that allow real-time optimization. With some notable
exceptions, the field has thus far worked on proofs-of-concepts using ‘one shot’ optimiza-
tion. With increasing need for real-time adaptation and integration with design tools, it will
become important to develop methods specifically targetting getting good results fast.
Challenge 6. Concepts for interactive definition and steering of optimization. To seriously
re-envision UI design with an optimizer in the loop, the field needs to go beyond the ‘fire
and forget’ paradigm. It is irreconcilable with design practice. A designer cannot be asked
to provide all inputs to an optimizer to a decimal point and come back after a while for
an answer. Designers constantly change the problem definition and the design space. How
might design task be defined for an optimizer, its search process steered, and how could
the optimizer positively encourage the designer to adopt better designs without choking
creativity?
Challenge 7. Critical evaluation against model-predictions. It is fair to ask if optimized
designs are actually usable, and how they compare with those of human-designed interfaces.
Empirically verified improvements have been reported for keyboards, widget layouts, ges-
ture controls, and menu design (Oulasvirta, 2017). In addition, one of the most significant
benefits of the approach is that each design comes with model-based predictions about its
use. Empirical data that is being collected is thus a direct assessment of the objective function
that was used.
Over and beyond such challenges, perhaps the most exciting academic prospect of
combinatorial optimization is that it offers a multi-disciplinary yet principled solution to
one of the most profound questions in HCI research: how to solve a problem using available
empirical and theoretical knowledge (Card, Newell, and Moran, 1983; Carroll, 2000; Gaver,
Beaver, and Benford, 2003; Höök and Löwgren 2012; Nielsen 1994; Norman and Draper
1986; Oulasvirta and Hornbæk 2016; Winograd and Flores 1986; Zimmerman, Forlizzi,
and Evenson, 2007). Combinatorial optimization invites computer scientists to define what
‘design’ means and psychologists, cognitive scientists, and designers to define what ‘good’
means in interaction.
....................................................................................................
references
Bailly, G., Oulasvirta, A., Kötzing, T., and Hoppe, S., 2013. Menuoptimizer: Interactive optimization
of menu systems. In: UIST’13: Proceedings of the 26th annual ACM symposium on User Interface
Software and Technology. New York, NY: ACM, pp. 331–42.
Burkard, R. E., and Offermann, J., 1977. Entwurf von schreibmaschinentastaturen mittels quadratis-
cher zuordnungsprobleme. Zeitschrift för Operations Research, 21(4), pp. B121–32.
Card, S. K., Mackinlay, J. D., and Robertson, G. G., 1991. A morphological analysis of the design space
of input devices. ACM Transactions on Information Systems (TOIS), 9(2), pp. 99–122.
Card, S. K., Newell, A., and Moran, T. P., 1983. The psychology of human-computer interaction.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
118 | design
Oulasvirta, A., and Hornbæk, K., 2016. HCI research as problem-solving. In: J. Kaye, A. Druin,
C. Lampe, D. Morris, and J. P. Hourcade, eds. Proceedings of the SIGCHI Conference on Human
Factors in Computing. New York, NY: ACM, pp. 4956–67.
Rao, S., 2009. Engineering optimization: Theory and Practice. New York, NY: John Wiley & Sons.
Simon, H. A., 1973. The structure of ill structured problems. Artificial intelligence, 4(3–4),
pp. 181–201.
Todi, K., Weir, D., and Oulasvirta, A., 2016. Sketchplore: Sketch and explore with a layout optimiser.
In: Proceedings of the 2016 ACM Conference on Designing Interactive Systems. New York, NY: ACM,
pp. 543–55.
Wickens, C. D., and Kramer, A., 1985. Engineering psychology. Annual Review of Psychology, 36(1),
pp. 307–48.
Winograd, T., and Flores, F., 1986. Understanding Computers and Cognition: A New Foundation for
Design. Reading, MA: Addison-Wesley.
Zhai, S., Hunter, M., and Smith, B. A., 2002. Performance optimization of virtual keyboards. Human–
Computer Interaction, 17(2–3), pp. 229–69.
Zimmerman, J., Forlizzi, J., and Evenson, S., 2007. Research through design as a method for interaction
design research in HCI. In: CHI 2007: Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems. New York, NY: ACM, pp. 493–502.
Zwicky, F., 1948. The morphological method of analysis and construction. In: Studies and Essays.
Courant Anniversary Volume. New York, NY: Interscience.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
5
• • • • • • •
5.1 Introduction
Text entry is one of the most basic, common and important tasks on touchscreen devices
(e.g., smartphones and tablets). A survey reported by time.com (McMillan, 2011) shows
that the top three most common activities on smartphones are texting, emailing, and
chatting on social networks: all of them involve intensive text input. According a study from
Nielsen (2010), a teenager, on average, sends over 3,000 messages per month, or more than
six texts per waking hour. Given the popularity of mobile text entry, any improvement on the
text entry technology will likely have a big impact on the user experience on mobile devices.
A soft keyboard (i.e., virtual keyboard, or on-screen keyboard) is a graphical presentation
of keyboard on touchscreen. It is now the major text method on touchscreen devices. A user
can enter text by either tapping keys (referred as tap typing), or gesture typing in which a
user enters a word by gliding finger over letters. Despite its wide adoption, entering text on
soft keyboards is notoriously inefficient and error prone, in part due to the keyboard layout
design, and in part the ambiguity of finger input.
First, it is widely known that Qwerty is suboptimal for entering text on mobile devices.
Invented in the nineteenth century, one of the rationales of the Qwerty design was to mini-
mize typewriter mechanical jamming by arranging common digraphs on the opposite sides
of the keyboards (Yamada, 1980). Such a design has long been understood as inefficient as a
single movement point (finger or stylus) keyboard. When used with a single stylus or a single
finger, back and forth lateral movement is more frequent and over greater distance than
necessary. Although some users tap a soft keyboard with two thumbs on a relatively large
touchscreen devices, one-finger text entry is also very common for phone-size touchscreen
devices, especially when the other hand is unavailable or when the user is performing gesture
typing.
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
122 | design
Dij and d can be measured in any distance unit. Assuming all keys have the same size, an
informative distance unit will be simply the key width (diameter if the key is round) so
that the distance measure is normalized against key size and counted as the number of keys
travelled.
Equation 5.1 is a reasonable but underused optimization objective function. It means
arranging the letter keys in such a way so the average movement distance is the lowest. For
example, if the average travel distance to tap on a letter on QWERTY is 3.3 keys, a good
result of optimization will be to lower that number to, for example, 1.5 keys. There are
two advantages to such an optimization objective. First, it is simple and direct, involving
no models or parameters that may be subject to debate. Second, it is also very meaningful:
distance is literally and linearly related to ‘work’ in physics terms. Minimizing the amount of
work is a reasonable goal of optimization.
An alternative and more commonly used optimization metric is the average movement
time (MT) taken to reach a key. Since movement time cannot be manipulated directly,
it has to be related to layout in some way. One approach, taken for example by Hughes,
Warren, and Buyukkokten (2002), is to empirically measure Tij , the average movement time
between every pair of unlabelled keys on a grid. From there it was possible to obtain the
average movement for a given layout with letters assigned on that grid.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
124 | design
N
N
t= Pij MTij (5.3)
i=1 j=1
Or,
N
N
N
N
Dij
t=a Pij + b Pij log2 +1 (5.4)
i=1 j=1 i=1 j=1
Wij
Since
N
N
Pij = 1,
i=1 j=1
N
N
Dij
t =a+b Pij log2 +1 (5.5)
i=1 j=1
Wij
t has the unit of seconds. t can be converted to input speed (V) in characters per minute
(CPM): V = 60/t. Equation 5.5 has been called the Fitts-digraph model (Zhai, Hunter,
and Smith, 2002).
There are two main advantages to use movement time as the objective function. First,
time might be what users are most concerned about in entering text. Second, movement
time as the objective function can be converted to the conventional typing speed units of
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
characters per minute (CPM) or words per minute (WPM). For example, if the conven-
tional typing speed standard for an office worker on a typewriter is 60 WPM, an optimization
result of 40 WPM for a touchscreen keyboard would give us a good idea how good the result
is. From CPM to WPM is a simple arithmetic conversion given 1 minute = 60 seconds and
1 word = 5 characters. The latter is simply a convention and it includes the space character
after each word. The average number of characters in a word depends on the text corpus.
Our calculation from the American National Written Text Corpus is 4.7 characters per word,
excluding the space after the word. We choose to use CPM in the rest of this paper to avoid
confusion.
On the other hand, there are disadvantages to using Fitt’s law, primarily because there
is a wide range of values of Fitt’s law parameters (a and b in Equation 5.2) reported in the
literature. Based on results from the more general Fitt’s reciprocal tapping tasks, previous
researchers have selected values such as a = 0, b = 0.204 (MacKenzie and Zhang, 1999;
Zhai, Hunter, and Smith, 2002). Specifically in the context of touchscreen keyboarding,
more appropriate estimates were made at a = 0.083 s and b = 0.127 s (Zhai, Sue, and
Accot, 2002). We employed these parameters in this chapter. The accuracies of these values
were subsequently verified by an empirical study to be reported later in this chapter. If we
assumed a = 0, as it had been mistakenly done in the literature (MacKenzie and Zhang,
1999; Zhai, Hunter, and Smith, 2000), the percentage of time improvement estimated
would tend to be exaggerated. Here, a reflects ‘non-informational aspect of pointing action’
(Zhai, 2004). Non-informational aspects of pointing here could include activation of mus-
cles and key press actions that are independent of the distance to the targeted letter, and
hence, not influenced by keyboard layout.
Fitt’s law also suggests that movement time might be saved by giving greater size to
the more commonly used letters. Although not a settled topic, previous work on key size
optimization has been unsuccessful due to a number of reasons, including the difficulty of
tightly packing keys with varying size and the conflict between central positions and size
(Zhai, Hunter, and Smith, 2002).
Given the pros and cons of each approach, this chapter uses movement time as calculated
in Equation 5.5 as the primary objective function, but we also simultaneously report the
mean distance calculated according to Equation 5.1.
While the methods are general and, to a large extent, independent of the scope of keys
included, we chose to first optimize only for the 26 Roman letters, excluding all auxiliary
keys. Previous research in this area tended to include the space key in optimization because
it is the most frequent character (MacKenzie and Zhang, 1999; Zhai, Hunter, and Smith,
2000). The choice of excluding the space key in this chapter is made for three reasons. First,
on almost all the existing keyboards, including QWERTY and the keypad on a cell phone,
the 26 Roman letters tend to be grouped together, while all other keys are arranged in the
periphery. To ease the access of frequently used auxiliary keys, they are usually assigned
different shapes and positioned at distinct positions (e.g., the spacebar on QWERTY). We
keep this layout style when designing new keyboards to leverage users’ familiarities and the
possible cognitive advantages. Second, it is debatable what the best way to enter space is.
Besides assigning the space key a distinct shape and position, previous research (Kristensson
and Zhai, 2005) has argued that it is better to use a stroke gesture on the keyboard, such as
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
126 | design
a slide over more than one key-width distance to enter a space because such an action can
be done anywhere on the keyboard (saving the time to travel to a specific location) and it is
also a more robust word segmentation signal for word-level error correction. In the case of
gesture keyboards, i.e. using stroke gestures approximately connecting letters on a keyboard
as a way of entering words (Zhai and Kristensson, 2006), the space is automatically entered.
Furthermore, the same optimization methodology used in this chapter can also be applied if
a space key is included in the optimization process. Including it will not change the essence
and main conclusions of this study.
Many languages use diacritics. It is possible to include letters with diacritics in the
optimization process or to arrange them separately. We deal with this topic later in the
chapter.
To establish reference points, we first calculated the average tapping time and distance
on two existing layouts (QWERTY and ATOMIK) as a touchscreen keyboard for English,
using English digraph frequencies obtained from a modern, large-scale English corpus: the
American National Corpus (ANC) containing 22,162,602 words. Pervious research has
shown that layout optimization is not sensitive to corpus source within the same language
(Zhai, Hunter, and Smith, 2002). By applying Equation 5.5, the English input speed VEnglish
of the QWERTY keyboard is estimated at 181.2 CPM.
allows moves with positive energy changes to be able to climb out of a local minimum. We
use the same basic optimization process in this chapter.
The conventional QWERTY keyboard lays the 26 English letters in a rectangle shape
with three rows and ten columns, which is well suited for two-handed typing. Such a
constraint is not necessary for touchscreen keyboards. If the keyboard is not constrained
to any particular shape, the objective function of minimizing one point travel (by a stylus or
a single finger) would tend to result in a rounded keyboard (Zhai, Hunter, and Smith, 2000).
Our experience is that a more practical square shape (or near square such as a five rows by
six columns grid) is more practical for graphic design and still gives sufficient flexibility for
optimization (Zhai, Hunter, and Smith, 2002). As shown in Figure 5.2, in practical product
designs the keys on the edge unused by the alphabet letters can be filled with auxiliary
characters and functions.
5.2.4 Results
128 | design
16%
14%
12%
10%
8%
6%
4%
2%
0%
a b c d e f g h i j k l m n o p q r s t u v w x y z
English French
(%) (%)
English 0.03 French 0.03
a a
f f
Letters (a~z)
k k
p p
u u
z z
a f k p u z 0 a f k p u z 0
Letters (a~z) Letters (a~z)
Figure 5.2 Heat maps of English (left) and French (right) digraph distribution. The shade intensity
reflects each digraph cell’s frequency.
encouraging for our optimization objective. Note that the English corpus is obtained from
ANC, and corpora of other languages are from http://corpus.leeds.ac.uk/list.html.
According to Equation 5.7, the mean time of typing a character t is then calculated as:
t = 0.5tEng + 0.5tFren (5.8)
where tEng is the mean time of typing an English letter, and tFren a French letter.
Using the Metropolis algorithm, we obtained a variety of keyboards with similar perfor-
mance. Figure 5.3 shows one of them, which is denoted by K-Eng-Fren.
We also used the Metropolis method to optimize solely for English and French, obtaining
K-English and K-French respectively (Figure 5.3). Figure 5.4 shows the performance of all
three optimization results, as well as the performance of QWERTY. The calculated English
input speed for K-English, K-Eng-Fren, and QWERTY are 230.1, 229.4, and 181.2 CPM
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Figure 5.4 Predicted input speed and average travel distance per tap of K-English, K-French, and
K-Eng-Fren and QWERTY.
respectively, and French input speed for K-French, K-Eng-Fren and QWERTY are 235.0,
230.8, and 177.8 CPM respectively. As one would expect, since K-Eng-Fren takes into
account two languages simultaneously, it is inferior to K-English in English inputting speed
(VEnglish ). However, K-Eng-Fren is only 1 CPM lower than K-English in speed. Such a tiny
performance drop should be negligible in real use. A similar relationship exists between K-
Eng-Fren and K-Fren. Compared to the standard QWERTY layout, K-Eng-Fren is superior
in both English and French input: K-Eng-Fren improves the English inputting speed by 48.2
CPM, and French by 53.0 CPM.
If we look at the average travel distance per tap (Figure 5.4b) on these layouts for English
and French respectively, we can draw similar conclusions. For English, the average travel
distance per key press on K-English, K-French, and K-Eng-Fren and QWERTY are 1.76,
1.93, 1.78, and 3.31, respectively. K-Eng-Fren is 46% shorter than QWERTY but only 1%
longer than K-English. For French, the average travel distance per key press on K-English,
K-French, and K-Eng-Fren and QWERTY are 1.85, 1.68, 1.76, and 3.51, respectively. K-Eng-
Fren is 50% shorter than QWERTY but only 5% longer than K-French.
In summary, it is somewhat surprising, and quite encouraging, that it is possible to
simultaneously optimize a keyboard layout for both English and French input efficiency.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
130 | design
The resulting layout has little loss for English from a keyboard specifically optimized for
English and little loss for French from a keyboard specifically optimized for French.
20%
18%
16%
14%
12%
10%
8%
6%
4%
2%
0%
a b c d e f g h i j k l m n o p q r s t u v w x y z
Figure 5.5 Letter Frequency in English, French, German, Spanish, and Chinese.
Figure 5.6 Correlation coefficients of letter (bottom-left) and digraph frequency distributions
(top-right).
132 | design
K5 K-English K-French
k j z x z j d g k h f v w
f c h t w y l n i c c t a p k
q u o i s p f o a t h w y o n i r g
y m a n e r b u r e s q u s e l b
b l g d v q p m v x x m d j z
u n t c k z u a s c y w e d j x v
g s e d i v w n i t h x f n a i h s
j o r a l h j d e r o v k g o u z c
y p m b z w b g l k t y b q
Figure 5.7 The layouts of K5, and optimized layout for each individual language.
for inputting English, French, Spanish, German, and Chinese. We also obtained optimized
layouts and their performances for each of the five languages (See Figure 5.7). Figure 5.8
summarizes these results in V (calculated input speed) andd(average distance in the unit of
key width for entering a character) metrics.
Let us first examine the individual results when optimized specifically for each of the
five languages, summarized in bold numbers in the diagonal cells of Figure 5.8. The first
interesting observation is that after optimization for each language, the average travel dis-
tance per tap fell into a relatively narrow range: 1.76, 1.68, 1.76, 1.63, and 1.5 keys for
English, French, Spanish, German, and Chinese, respectively. Greater differences between
these languages may have been expected, given their different phonology. In comparison,
QWERTY is somewhat equally bad for all: 3.31, 3.51, 3.7, 3.26, and 3.85 keys for English,
French, Spanish, German, and Chinese respectively. The ratio between the average travel
distance per tap on QWERTY and the average travel distance per tap on the keyboards
individually optimized for each language are large: 1.88, 2.09, 2.10, 1.99, and 2.57 for
English, French, Spanish, German, and Chinese, respectively. Figure 5.8 illustrates the travel
distance difference among the various layouts. Although English has been the primary target
language used in touchscreen keyboard optimization work (8, 14, 15, 16, 17 Zhai, Hunter,
and Smith, 2000, 25), English in fact has the least to gain and Chinese has the most to gain
from touchscreen keyboard optimization. These results and observations are new to our
knowledge.
When the five languages are considered simultaneously, the optimization effect is still
strong. As shown in Figure 5.8, the average distance d for tapping a character on K5 are
1.88, 1.86, 1.91, 1.77, and 1.68 keys for English, French, Spanish, German, and Chinese
respectively, much shorter than QWERTY’s 3,31, 3.51, 3.7, 3.26, and 3.85 keys, and close
to the results obtained specifically for each language (1.76, 1.68, 1.76, 1.63, and 1.5 keys for
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Figure 5.8 Calculated input speed (CPM) and average travel distance per tap (keys) of various
layout optimized for English, French, Spanish, German, Chinese, all five, and two previous layouts
(ATOMIK and QWERTY).
Individual
0.53 0.48 0.48 0.50 0.39
optimization/QWERTY
Individual
0.94 0.90 0.92 0.92 0.89
optimization/K5
Figure 5.9 The ratios of travel distance between layouts for the five languages.
English, French, Spanish, German, and Chinese, respectively.) The ratios in travel distance
between K5 and QWERTY, and between the individually optimized layouts and K5, are
summarized in Figure 5.9.
As discussed earlier, the digraph correlations among the five languages are relatively
weak, so optimizing for only one language provides no guarantee that the resulting layout
would also be good for other languages. Note that the optimization process uses a stochastic
method, so each layout obtained is just one instance of many possibilities. The specific
instance of layout we obtained for English happened to be also quite good for the other
four languages (see Figure 5.10), although not as good as K5. On the other hand, the
specific instance of Spanish layout was relatively poor for Chinese. Interestingly, the layout
optimized for Chinese was not very good for any of the other four languages (Figure 5.10).
The computational study and analysis thus far have not only produced optimized layouts
for French, Spanish, German, and Chinese that have not been previously reported in the
literature, but also demonstrated that it is possible to accommodate at least these five lan-
guages in one optimized layout, with about a 10% travel distance increase from individually
optimized layouts (See Figure 5.9).
Having examined the layouts in terms of travel distance, let us now evaluate the calculated
input speeds of all languages as shown in Figure 5.11 and in Figure 5.8. K5 is faster than
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
134 | design
3.5
2.5
1.5
0.5
0
QWERTY ATOMIK K5 K-English K-French K-Spanish K-German K-Chinese
Layout
Figure 5.10 The average travel distance per tap for English, French, Spanish, German, and Chinese
on various optimized and QWERTY layouts.
300
250
Inputting speed (cpm)
200
150
100
50
0
V(English) V(French) V(Spanish) V(German) V(Chinese)
Figure 5.11 Calculated input speed of K5, QWERTY, and single-language optimization
keyboards.
QWERTY for all the five languages. As an example, in English, K5 improves the input
speed by 24% over QWERTY, from 181.2 CPM to 225.1 CPM. To compare with a past
optimization effort, the performance metrics (V and d) of K5 for English are better than
those of the revised ATOMIK, as currently used in ShapeWriter (V = 221.5 CPM, d = 1.94
keys).
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
136 | design
Lis a 40,000-word lexicon, fw is the frequency ofw, and dw is the distance between wand
its nearest neighbour. We compute the distance between two ideal traces w and x via
proportional shape matching. Each gesture is sampled into N equidistant points, and the
distance is simply the average of the distance between corresponding points:
1
N
d (w, x) = wi − xi 2 (5.11)
N i=1
Since the gesture clarity metric compares the gestures of every pair of words to find each
word’s nearest neighbour, its time complexity is N · |L|2 . Here, L is the number of
words in the lexicon and N is the number of sample points in each word gesture. Its quadratic
time complexity with respect to L stands in stark contrast to the time complexities of earlier
optimization metrics (which are exclusively linear with respect to L), making optimization
using it intractable. For our 40,000-word lexicon, there are nearly 800 million pairs of word
gestures to compare for each keyboard layout that we examine during the optimization
process.
To make the metric more tractable, we made two key algorithmic refinements. First,
when searching for the nearest neighbour for each word, we only considered prospective
neighbours that started and ended with characters that were located within one key diag-
onal of the word’s starting and ending character, respectively. This is similar to the initial
template-pruning step employed in SHARK2 (Kristensson and Zhai, 2004), where the
distance threshold in this case is the diagonal length of a key. Second, we used a small
number of gesture sample points N to represent each word’s gesture. If N were too large, the
computation would be very expensive. If N were too small, word gestures (especially longer
ones) might not be represented properly, leading to incorrectly chosen nearest neighbours.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
100%
90%
Nearest Neighbour Accuracy
80%
70%
60%
50%
40%
30%
20%
10%
0%
0 10 20 30 40 50 60 70 80 90 100
Number of Sample Points
Figure 5.12 Word gesture neighbour sensitivity. The nearest neighbour that we find for a word
depends on how finely the word gestures are sampled. Here, we show the percentage of nearest
neighbours that are the same as when 100 sample points are used. The darker dot signifies 40 points,
the amount used.
In order to see how small we could make N without affecting the integrity of our results,
we performed a small experiment. First, we found each word’s nearest neighbour on Qwerty
using very fine sampling (N = 100). Then, we repeated this step for smaller values of N
down to N = 20 and counted the number of nearest neighbours that were identical to the
N = 100 case. Figure 5.12 shows the results. When the number of sample points is reduced
to 40, 96.9% of the nearest neighbours are the same as they were before. We used this value
for N in our algorithm.
138 | design
T (P) = T AB , (5.13)
AB∈P
where P is the polyline and AB is a segment in the polyline. Although Cao and Zhai found
that the angles between polyline segments (that is, of a polyline’s corners) have an effect
on gesture entry time, the magnitude of the effect was small: less than 40 ms per corner
compared to 200–700 ms per segment. Hence, the model uses corners to delineate segments
but omits their 40 ms contribution.
As with the gesture clarity metric, each word in the lexicon is represented as its ideal trace.
To help compute the metric, we store a table of the weighted number of occurrences of each
bigram in our lexicon. The weighted number of occurrences o i − j of a bigram i − j (for
letters iand j) is calculated as follows:
o i−j
(5.14)
= w∈L fw · # occurrences of i − j in w
Here, L is the lexicon, w is a word in the lexicon, and fw is the frequency of word w in L. Each
bigram is represented by a different line segment in the CLC model. Hence, to estimate G,
the average time it takes to complete a word gesture, we calculate the following:
G= o i − j · T Ki Kj (5.15)
i,j∈α
Here, i and j are both letters in alphabet α, the set of lowercase letters from ‘a’ to ‘z.’ Ki and
Kj are the key centers of the i and j keys, respectively, Ki Kj is the line segment connecting
the key centres, and the function T is defined in Equation 5.12. Hence, G is measured in
milliseconds.
The last step is to convert the gesture duration G into words per minute (WPM), a
measure of typing speed. Doing so gives us our gesture speed metric score:
60, 000
Speed = (5.16)
G
60,000 represents the number of milliseconds in one minute. When calculating the gesture
typing speed of a keyboard layout, we do not consider the effects of the space bar or
capitalization (and the Shift key). One of the key contributions of gesture typing is the fact
that spaces are automatically added between word gestures, eliminating the need for one in
approximately every 5.7 characters typed (Zhai and Kristensson, 2003). Moreover, most of
today’s gesture-typing systems apply capitalization and diacritics automatically.
We should also note that, because the CLC model omits the cost of gesturing corners
and the cost of travelling from the end of one gesture to the beginning of the next, the
calculated speeds generally overestimate the speeds at which users would actually type.
Rick (2010) proposed an alternative to the CLC model that is also based on Fitt’s law, and
although we ultimately chose to use the CLC model for our metric, we implemented Rick’s
model (without key taps for single-character words) to compare the models’ behaviours.
We found that Rick’s model consistently output lower speed estimates than the CLC model,
but that they both followed the same overall trend. More specifically, the mean (std. dev.)
ratio between Rick’s model’s predicted speeds and the CLC model’s predicted speeds for our
final set of optimized layouts is 0.310 (0.004). After normalizing the metrics as described in
Section 5.3.3.1, the mean (std. dev.) ratio becomes 0.995 (0.016).
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
where i is a letter in alphabet α, the set of lowercase letters from ‘a’ to ‘z,’ and kix and qix are
the x-indices of the i key on the given keyboard layout and Qwerty, respectively. Unlike Ki
and Kj in Equation 5.15, which are points with units of millimetres, ki and qi are unit-less
ordered pairs of integers that represent the 2D index of key i’s slot in the keyboard grid. In
most of today’s touchscreen keyboard layouts, the second and third rows are offset from the
first row by half of a key width. Hence, in order to properly calculate the Manhattan distance
for this metric, we treat the second and third rows as if they are shifted to the left by another
half of a key width, so that the second row is left-aligned with the first row. The resulting
representation of keyboard layouts is actually identical to the one used for creating Quasi-
Qwerty (Bi, Smith, and Zhai, 2010). The Qwerty similarity metric is the only one that uses
this modified keyboard representation.
140 | design
interested in understanding the behaviour of each of the metrics and the inherent tradeoffs
between them.
As a result, although we still employ a simple objective function as part of our optimiza-
tion’s second phase, we use another approach called Pareto optimization for the optimiza-
tion at large. Pareto optimization has recently been used to optimize both keyboard layouts
(Dunlop and Levine, 2006) and keyboard algorithms (Bi, Ouyang, and Zhai, 2013). In
this approach, we calculate an optimal set of layouts called a Pareto optimal set or a Pareto
front. Each layout in the set is Pareto optimal, which means that none of its metric scores
can be improved without hurting the other scores. If a layout is not Pareto optimal, then
it is dominated, which means that there exists a Pareto optimal layout that is better than
it with respect to at least one metric, and no worse than it with respect to the others. By
calculating the Pareto optimal set of keyboard layouts, rather than a single keyboard layout,
we can analyse the tradeoffs inherent in choosing a keyboard layout and give researchers the
freedom to choose one that best meets their constraints.
Our optimization procedure is composed of three phases, described in detail in the next
subsections.
three metric scores. We use twenty-two different weightings for the linear combinations and
perform roughly fifteen full 2000-iteration local neighbourhood searches for each weighting.
The purpose is to ensure that the Pareto front includes a broad range of Pareto optimal
keyboard layouts.
The Pareto front starts out empty at the very beginning of this phase, but we update it with
each new candidate keyboard layout that we encounter during the searches (at each iteration
of each search). To update the front, we compare the candidate layout with the layouts
already on the front. Then, we add the candidate layout to the front if it is Pareto optimal
(possibly displacing layouts already on the front that are now dominated by the candidate
layout). The candidate layout is added whether it is ultimately kept in the particular local
neighbourhood search or not.
5.3.4 Results
Figure 5.13 shows the final Pareto front of keyboard layouts optimized for gesture typing.
Overall, the front is composed of 1,725 keyboard layouts chosen from the 900,000+
candidate layouts that we examined in all. No single layout on the front is better than all
of the others—each layout is better than the others in some way, and the tradeoffs that are
inherent in choosing a suitable layout from the front are reflected in the front’s convex shape.
More specifically, the front can be viewed as a 3D design space of performance goals that
one can choose from for different usage scenarios. Layouts with high gesture clarity scores,
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
142 | design
0.8
Qwerty Similarity
0.6
0.4
0.2
0
0 0
0.2 0.2
0.4 0.4
0.6 0.6
Speed 0.8 0.8 Clarity
1 1
Figure 5.13 D Pareto front. The keyboard layouts with lighter shades are farther from the origin.
gesture speed scores, and Qwerty similarity scores are more apt to exhibit lower error rates,
expert-level gesture entry times, and initial gesture entry times (respectively) than those
with low scores. However, since each layout on the front represents a compromise between
these three goals, the choice of layout for a particular user or usage scenario depends on the
relative importance of each goal. For example, a fast but less accurate user may prefer a layout
biased towards clarity, while a user who gesture types very accurately may prefer a layout
biased toward speed. Nevertheless, if we know nothing about users’ preferences or wish to
choose a layout that can best accommodate a wide variety of preferences, it is reasonable
to use one that is in the middle of the convex surface (serving each goal on a roughly equal
basis) as Dunlop and Levine (2006) did.
We now highlight layouts optimized for each of the three metrics as well as layouts that
serve roughly equal combinations of metrics. These layouts may serve as useful references
to researchers and designers, and later (in the user study) help us test the effectiveness of
our optimization and its associated metrics.
Figure 5.14(a) shows GK-C (‘Gesture Keyboard—Clarity’), the layout optimized exclu-
sively for gesture typing clarity. Figure 5.14(b) shows GK-S, which was optimized exclu-
sively for speed. The layout optimized for Qwerty similarity is simply Qwerty itself.
Figure 5.14(c) shows GK-D (where the ‘D’ stands for ‘double-optimized’). This keyboard
offers a roughly equal compromise between gesture typing clarity and gesture typing speed
without regard to learnability (Qwerty similarity). To find this layout, we projected the 3D
Pareto front onto the clarity–speed plane to derive a 2D Pareto front between clarity and
speed, then chose the layout on the 2D front that was closest to the 45◦ line. Figure 5.5
shows the 2D Pareto front and GK-D.
Figure 5.14(d) shows GK-T, where the ‘T’ stands for ‘triple optimized.’ This keyboard
offers a roughly equal compromise between all three metrics: gesture typing clarity, gesture
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
R Q A J T U Y I P H K G D N I L Y B J Z
W O Z V F X K G N V C A E R O M F Q
E B S M D C L W H T S U P X
B Q P C R J E V D L Q D W S O I Y U J P
M Y Z S I T H X K Z R F A T N G K L
F U W O N A G C E X H V M B
Figure 5.14 Single-optimized keyboard layouts. (a) Our GK-C keyboard (‘Gesture Keyboard—
Clarity’) is optimized for gesture typing clarity only. (b) Our GK-S keyboard (‘Gesture Keyboard—
Speed’) is optimized for gesture typing speed only.
Table 5.1 Keyboard metric score comparison. Shaded rows signify previous layouts.
typing speed, and Qwerty similarity. It is the one on the 3D Pareto front that is closest to the
45◦ line through the space. As Figure 5.5 illustrates, it is possible to accommodate the extra
dimension of Qwerty similarity without a big sacrifice to clarity and speed.
Table 5.1 shows the metric scores for our optimized layouts as well as previous optimized
layouts. Together, these optimized layouts give us a good understanding of what is possible
in the optimization space for gesture typing.
First, we can improve gesture clarity by 38.8% by optimizing for clarity alone: GK-C’s
raw metric score is 0.543 key widths, while Qwerty’s is 0.391 key widths. Likewise, we also
see that we can improve gesture speed by 24.4% by optimizing for speed alone (resulting
in GK-S).
Second, the 2D Pareto front for gesture clarity and gesture speed (Figure 5.15) shows
that these two metrics conflict with each other. It forms a roughly −45º line, indicating
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
144 | design
1
GK-S
0.9
0.8 GK-D
Figure 5.15 2D Pareto front for gesture typing clarity and gesture typing speed. GK-D, our double-
optimized layout, is the point on the front nearest the 45◦ line. Note that Qwerty is far worse in
both dimensions than GK-D, and that GK-T (which accommodates yet another dimension) is only
slightly worse on these two dimensions than GK-D.
that optimizing for one leads to the decrease in the other. As GK-C and GK-S illustrate, the
clarity metric tends to arrange common letters far apart in a radial fashion while the speed
metric clusters common letters close together.
However, despite the conflict, it is possible to arrange common letters close together
while keeping word gestures relatively distinct, achieving large improvements in both clarity
and speed. In GK-D (our double-optimized keyboard), letters in common n-grams such as
‘the,’ ‘and,’ and ‘ing’ are arranged together, while the n-grams themselves are spaced apart.
This arrangement offers a 17.9% improvement in gesture clarity and a 13.0% improvement
in gesture speed over Qwerty.
Third, accommodating Qwerty similarity (as GK-T does) does little harm to gesture
clarity or gesture speed. GK-T’s gesture clarity is only 0.01 key widths lower than GK-
D’s, and GK-T’s predicted speed is only 1 WPM lower than GK-D’s. Meanwhile, GK-T’s.
Manhattan distance from Qwerty is just 42 key slots, while GK-D’s is 102 key slots.
At an abstract level, the decoder selects a word wi from the keyboard’s lexicon and
calculates its probability of being the intended word based on the information from the
following three sources:
1) The proximities of letters in wi to the touch points s1, s2 , s3 . . . sn .
2) The prior probability of wi from a language model.
3) The possibilities of spelling errors (e.g., inserting/omitting/transposing/
substituting letters).
Pruning out the improbable ones, a list of the most probable words is then ranked and
suggested to the user.
While many modern keyboard algorithms may work similarly at this abstract level, the
actual implementations of industrial modern keyboards may be rather complex and vary
across products. Since most of the commercial keyboard algorithms are not published, it
is impossible to develop a general model representing all of them. Fortunately, we could
conduct the present research on open source keyboards.
We developed a keyboard, referred as PBaseline hereafter, based on the latest version
(Android 4.3_r3) of the Android Open Source Project (AOSP) Keyboard, which is open-
sourced and broadly deployed in Android mobile devices. PBaseline shared the similar algo-
rithm as the AOSP keyboard which can be read and analysed by any researcher or developer.
The lexicon composed of around 170,000
words was stored in a trie data structure. As the
algorithm received spatial signals s1, s2 , s3 . . . sn , it traversed the trie and calculated the
probabilities for nodes storing words, based on the aforementioned three sources.
A critical part of the algorithm is to calculate the probabilities of word candidates by
weighting information from three sources. Same as the AOSP keyboard, the weights of
different sources are controlled by 21 parameters (Bi, Ouyang, and Zhai, 2014). These
parameters are pertinent to the performance of PBaseline . Altering them adjusts the rel-
ative weights of information from different sources, leading to different correction and
completion capabilities. For example, reducing the cost of un-typed letters (i.e., lowering
COST_LOOKAHEAD) would make the keyboard in favour of suggesting long words,
but it might be detrimental to the correction ability. PBaseline is implemented based on the
AOSP keyboard, and the parameters in PBaseline shared the same values as those in the AOSP
keyboard, which serves as the baseline condition in the current research.
146 | design
uses the language regularities for correcting errors due to the imprecision of the finger
touch or spelling errors such as inserting/omitting/substituting/transposing letters, while
completion uses the language regularities for predicting unfinished letters based on partial
(either correct or erroneous) input to offer keystroke savings.
Our goal is to optimize the decoding algorithm for both correction and completion
abilities. Formally, these two abilities were defined as follows.
5.4.2.1 Correction
Correction is measured in word score, which reflects the percentage of correctly recognized
words out of the total test words. The word score (W) is defined as:
the number of correctly recognized words
W= × 100% (5.18)
the number of words in test
5.4.2.2 Completion
Completion is measured in keystroke saving ratio (S). Given a data set with n test words, S
is defined as:
n
Smin (wi )
S = 1 − ni=0 × 100% (5.19)
i=0 Smax (wi )
Smax (wi )is the maximum number of keystrokes for entering a word wi . In the worst scenario
where the keyboard fails to complete wi , the user needs to enter all the letters of wi and
presses the space bar to enter it. Therefore,
Smin (wi ) is the minimum number of keystrokes needed to enter wi . Assuming that users fully
take advantage of the completion capability, wi will be picked as soon as it is suggested on
the suggestion bar. Therefore, Smin (wi ) is the least number of keystrokes needed to bring wi
on the suggestion bar plus one more keystroke for selection. The number of the slots on the
suggestion bar may vary across keyboards. The keyboard used in this work provides three
suggestion slots, the same as the AOSP keyboard.
Note that the measure defined here is the maximum savings offered to a user by the
keyboard algorithm. If and how often the user takes them depends on the UI design, the
users preference and bias in motor, visual, and cognitive effort trade-offs, which are separate
research topics.
Typical laboratory-based studies demand intensive labour and time to evaluate keyboards
and may still lack the necessary sensitivity to reliably detect important differences amid
large variations in text, tasks, habits, and other individual differences. Bi, Azenkot, Partridge,
and Zhai (2013) proposed using remulation—replaying previously recorded data from the
same group from the same experiment in real time simulation—as an automatic approach
to evaluate keyboard algorithms. We adopt a similar approach in the present research:
we evaluated a keyboard algorithm against previously recorded user data to measure its
performance. Unlike the Octopus tool (Bi, Azenkot, Partridge, and Zhai, 2013) which
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
evaluated keyboards on devices, we simulated the keyboard algorithm and ran tests on a
workstation. Octopus evaluates keyboards in a ‘black box’ fashion without access to their
algorithms and source code. The off-device approach in this work is limited to the keyboards
whose algorithm and source code are accessible, but is much faster than Octopus.
5.4.2.3 Datasets
We ran studies to collect data for optimization and testing. The data collection studies
were similar to that in Bi, Azenkot, Partridge, and Zhai (2013). To avoid data collection’s
dependencies and limitations on keyboard algorithms, a Wizard of Oz keyboard was used in
the study by Bi and colleagues (2013), which provided users with only asterisks as feedback
when they were entering text. The collected data was segmented into words in evaluation.
After the segmentation, the development dataset included 7,106 words, and the test dataset
had 6,323 words.
148 | design
unchanged during the sub process; α changed across sub processes. Our purpose was to
ensure that the Pareto set covered a broad range of Pareto optimal solutions.
In each iteration of a sub process, the 21 parameters moved in a random direction
with a fixed step length (0.01). We then evaluated the keyboard algorithm with the new
parameters against the development dataset according to the objective function (Equation
5.21). Whether the new set of parameters and search direction were kept was determined
by the Metropolis function:
⎧ Z
⎨ e kT if Z < 0
W (O → N) = (5.22)
⎩
1 if Z ≥ 0
W (O → N) was the probability of changing from the parameter set O (old) to the parame-
ter set N (new); Z = Znew − Zold , where Znew , and Zold were values of objective functions
(Equation 5.21) for the new and old set of parameters respectively; k was a coefficient; T
was ‘temperature’, which can be interactively adjusted.
This optimization algorithm used a simulated annealing method. The search did not
always move toward a higher value of objective function. It occasionally allowed moves with
negative objective function value changes to be able to climb out of a local minimum.
After each iteration, the new solution was compared with the solutions in the Pareto set.
If the new solution dominated at least one solution in the Pareto set, it would be added to
the set and the solutions dominated by the new solution would be discarded.
5.4.4 Results
After 1000 iterations, the sub process restarted with another random set of parameters and
a random α for the objective function Equation 5.21. We ensured that there was at least one
sub process for each of the following three weights: α = 0, 0.5, and 1. The optimization
led to a Pareto set of 101 Pareto optimal solutions after 100 sub processes with 100,000
iterations in total, which constituted a Pareto frontier illustrated in Figure 5.16.
The Pareto frontier shows that the Pareto optimal solutions distribute in a small region,
with keystroke saving ratio ranging from 30% to 38% and word score ranging from 81%
to 88%. As shown in Figure 5.16, the Pareto frontier forms a short, convex, L-shaped
curve, indicating that correction and completion have little conflict with each other and the
algorithm can be simultaneously optimized for both with minor loss to each. Among the
101 Pareto optimal solutions, we are particularly interested in three solutions, illustrated in
darker dots in Figure 5.16:
1) The solution with the highest word score (W), denoted by Pw . It is the solution
exclusively optimized for correction.
2) The solution with the highest keystroke saving ratio (S), denoted by PS . It is the
solution exclusively optimized for completion.
3) The solution with highest 0.5W + 0.5S, denoted by PW+S . It is the solution
optimized for both correction and completion, with 50% weight for each objective.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
40
PS
36
34
32
PW
30
80 82 84 86 88 90
Word Score (%)
Figure 5.16 The Pareto frontier of the multi-objective optimization. The three darker dots were
solutions with the most word score (Pw ), the most keystroke savings (PS ), and the highest
0.5W+0.5S (PW+S ), respectively.
Pw and PS reveal the highest correction and completion capabilities a keyboard can reach,
while PW+S is the most balanced solution with equal weights for correction and completion.
The optimization results showed the default parameter set taken from the AOSP key-
board in Android 4.3 (16 September 2013) was sub-optimal in both correction and com-
pletion according to the criteria, and the development dataset used in this study Pw , PS , and
PW+S all improve the word score and keystroke saving ratio by at least 10% over PBaseline .
It is more illuminating to compare the three Pareto optimal solutions Pw , PS , and PW+S ,
since they are all optimized under identical conditions with the same dataset. Figure 5.16
shows that PW+S is close to PS in keystroke saving ratio, and close to Pw in word score.
It indicates simultaneously optimizing for both objectives causes only minor performance
degradation for each objective. These results were later verified with the separate test dataset.
Parameters moved in various directions after the optimization. For such a complex
optimization problem with twenty-one free parameters, it was difficult to precisely explain
why each parameter moved in such a direction after the optimization. However, we observed
some distinct patterns of parameter value changes, which partially explain the performance
differences across keyboards.
For examples, the parameters pertinent to the cost of proximity of touch points to
letters (i.e., PROXIMITY_COST, FIRST_PROXIMITY_COST) decreased substantially
from PBaseline toPw , PS , and PW+S , indicating that the optimized algorithms tend to be
more tolerant to errors in spatial signals. The cost of untyped letters in a word (i.e.,
COST_FIRST_LOOKAHEAD) also decreased, indicating that the optimized algorithms
were more likely to predict untyped letters as the user’s intention, especially after typing the
first letter, to save keystrokes.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
150 | design
....................................................................................................
references
Bi, X., Azenkot, S., Partridge, K., and Zhai, S., 2013. Octopus: evaluating touchscreen keyboard
correction and recognition algorithms via. In: Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems. New York, NY: ACM, pp. 543–52.
Bi, X., Smith, B. A., and Zhai, S., 2010. Quasi-Qwerty soft keyboard optimization. In: Proceedings of
the 2010 CHI Conference on Human Factors in Computing Systems. New York, NY: ACM Press,
pp. 283–6.
Bi, X., Smith, B. A., and Zhai S., 2012. Multilingual Touchscreen Keyboard Design and Optimization,
Human-Computer Interaction, Human-Computer Interaction, 27(4), pp. 352–82.
Bi, X., Ouyang, T., and Zhai, S., 2014. Both complete and correct? Multiobjective optimization of a
touchscreen keyboard. In: Proceedings of the ACM CHI Conference on Human Factors in Computing
Systems. Toronto, Canada, 26 April–1 May 2014. New York, NY: ACM Press, pp. 2297–306.
Bi, X., and Zhai, S., 2016. IJQwerty: What Difference Does One Key Change Make? Gesture Typing
Keyboard Optimization Bounded by One Key Position Change from Qwerty. In: Proceedings of the
2016 CHI Conference on Human Factors in Computing Systems. New York, NY: ACM, pp. 49–58.
Cao, X., and Zhai, S., 2007. Modeling human performance of pen stroke gestures. In: Proceedings of
the ACM CHI Conference on Human Factors in Computing Systems. San Jose, California, 28 April–3
May 2007. New York, NY: ACM, pp. 1495–504.
Dunlop, M. D., and Levine, J., 2012. Multidimensional Pareto optimization of touchscreen keyboards
for speed, familiarity, and improved spell checking. In: Proceedings of the ACM SIGCHI Conference
on Human Factors in Computing Systems. Austin, Texas, 5–10 May 2012. New York, NY: ACM,
pp. 2669–78.
Fitts, P. M., 1954. The information capacity of the human motor system in controlling the amplitude
of movement. Journal of Experimental Psychology, 47, pp. 381–91.
Getschow, C. O., Rosen, M. J., and Goodenough-Trepagnier, C., 1986. A systematic approach to
design a minimum distance alphabetical keyboard. In: Proceedings of the 9th Annual Conference on
Rehabilitation Technology. Minneapolis, Minnesota, 23–26 June 1986. Washington, DC: AART,
pp. 396–8.
Hastings, W. K., 1970. Monte Carlo sampling methods using Markov chains and their applications.
Biometrika, 57(1) pp. 97–109.
Hughes, D., Warren, J., and Buyukkokten, O., 2002. Empirical bi-action tables: a tool for the evalu-
ation and optimization of text-input systems, applications I; stylus keyboard. Human-Computer
Interaction, 17(2, 3), pp. 271–309.
Jokinen, P. P., Sarcar, S., Oulasvirta A., Silpasuwanchai, C., Wang, Z., and Ren, X., 2017. Modelling
Learning of New Keyboard Layouts. In: Proceedings of the 2017 CHI Conference on Human Factors
in Computing Systems. New York, NY:ACM, pp. 4203–15.
Kristensson, P.-O., and Zhai, S., 2004. SHARK2: a large vocabulary shorthand writing system for pen-
based computers. In: Proceedings of the 17th Annual ACM Symposium on User Interface Software and
Technology. New York, NY: ACM, pp. 43–52.
Kristensson, P. O., and Zhai, S., 2005. Relaxing stylus typing precision by geometric pattern matching.
In: Proceedings of ACM International Conference on Intelligent User Interfaces. New York, NY: ACM,
pp. 151–8.
Lewis, J. R., 1992. Typing-key layouts for single-finger or stylus input: initial user preference and
performance (Technical Report No. 54729). Boca Raton, FL: International Business Machines
Corporation.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
152 | design
Lewis, J. R., Kennedy, P. J., and LaLomia, M. J., 1999. Development of a Digram-Based Typing Key
Layout for Single-Finger/Stylus Input. In: Proceedings of The Human Factors and Ergonomics Society
43rd Annual Meeting, 43(20), pp. 415–19.
Lewis, J. R., Potosnak, K. M., and Magyar, R. L., 1997. Keys and Keyboards. In: M. G. Helander, T.
K. Landauer & P. V. Prabhu, eds. Handbook of Human-Computer Interaction (2nd ed.). Amsterdam:
Elsevier Science, pp. 1285–315.
MacKenzie, S. I., and Zhang, S. X., 1999. The design and evaluation of a high-performance soft
keyboard. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI
‘99), New York, NY: ACM, pp. 25–31.
McMillan, G., 2011. Study: Fewer than 50% of Smartphone Users Make Calls. Time.com. Available
at: techland.time.com/2011/07/21/study-fewer-than-50-of-smartphone-users-make-calls/
The Nielsen Company, 2010. U.S. Teen Mobile Report Calling Yesterday, Texting Today, Using Apps
Tomorrow. Available at: nielsen.com/us/en/newswire/2010/u-s-teen-mobile-report-calling-
yesterday-texting-today-using-apps-tomorrow.html
Rick, J., 2010. Performance optimizations of virtual keyboards for stroke-based text entry on a touch-
based tabletop. In: Proceedings of the ACM Symposium on User Interface Software and Technology.
New York, NY: ACM Press, pp. 77–86.
Soukoreff, W., and MacKenzie, I. S., 1995. Theoretical upper and lower bounds on typing speeds using
a stylus and keyboard. Behaviour & Information Technology, 14, pp. 370–79.
Yamada, H., 1980. A historical study of typewriters and typing methods: from the position of planning
Japanese parallels. Information Processing, 2(4), pp. 175–202.
Zhai, S., 2004. Characterizing computer input with Fitt’s law parameters—The information and non-
information aspects of pointing. International Journal of Human-Computer Studies: Fitts’ Law 50
Year Later: Applications and Contributions from Human-Computer Interaction, 61(6), pp. 791–809.
Zhai, S., Hunter, M., and Smith, B.A., 2000. The metropolis keyboard—an exploration of quantitative
techniques for virtual keyboard design. In: Proceedings of the 13th Annual ACM Symposium on User
Interface Software and Technology. New York, NY: ACM Press, pp. 119–28.
Zhai, S., Hunter, M., and Smith, B. A., 2002. Performance optimization of virtual keyboards. Human-
Computer Interaction, 17(2, 3), pp. 89–129.
Zhai, S., and Kristensson, P. O., 2003. Shorthand writing on stylus keyboard. In: Proceedings of
the SIGCHI Conference on Human Factors in Computing Systems. New York, NY: ACM Press,
pp. 97–104.
Zhai, S., and Kristensson, P. O., 2006. Introduction to Shape Writing: IBM Research Report RJ10393,
and Chapter 7 of I. S. MacKenzie and K. Tanaka-Ishii, eds. Text Entry Systems: Mobility, Accessibility,
Universality. San Francisco: Morgan Kaufmann Publishers, pp. 139–58.
Zhai, S., Sue, A., and Accot, J., 2002. Movement Model, Hits Distribution and Learning in Virtual
Keyboarding. In: Proceedings of the ACM Conference on Human Factors in Computing Systems.
New York, NY: ACM Press, pp. 17–24.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
6
• • • • • • •
6.1 Introduction
Computational design is the emerging form of design activities that are enabled by compu-
tational techniques. It formulates design activities as mathematical optimization problems;
it formulates design criteria as either objective functions or constraints, design space as the
search space (or the choice set), and design exploration, which can be performed by either
systems, users, or their combination, as the process of searching for solutions. This view-
point provides an opportunity to devise new ways of utilizing computational techniques
(i.e., mathematical tools developed in computer science that leverage machine processing
power). The main goal of computational design research is to enable efficient design
workflow or sophisticated design outcomes that are impossible in traditional approaches
relying purely on the human brain.
The quality of designed objects can be assessed using various criteria according to their
usage contexts. Some criteria might work as conditions that should be at least satisfied
(i.e., constraints); other criteria might work as values that should be maximized (i.e.,
objectives). These design criteria can be classified into two groups: functional criteria and
aesthetic criteria. Functional criteria are the criteria about how well the designed object
functions in the expected contexts. For example, a chair is expected to be ‘durable’ when
someone is setting on it; in this case, durability can be a functional criterion that should be
satisfied. In contrast, aesthetic criteria are about how perceptually preferable (or pleasing)
the designed object looks. A chair might look ‘more beautiful,’ for example, if its shape is
smooth and the width and height follow the golden ratio rule; in this case, beauty in shape
performs the role of an aesthetic criterion that is desired to be maximized. Note that, in
practical design scenarios, these two criteria are sometimes simultaneously considered by
designers.
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
154 | design
Brightness
Figure 6.1 An example of design parameter tweaking, in which aesthetic preference is used as a
criterion. Photo colour enhancement is one of such design scenarios, in which designers tweak
sliders such as ‘brightness’ so that they eventually find the parameter set that provides the best
preferable photo enhancement.
design methods. Finally, we conclude this chapter with additional discussion on the remain-
ing challenges and future directions.
1 Interested readers are referred to Oulasvirta and Karrenbauer, this volume, for a detailed discussion on design
criteria for UI design.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
156 | design
Hertzmann, 2011), 3D viewpoint preference (Secord, Lu, Finkelstein, Singh, and Nealen,
2011), and photo colour enhancement (Bychkovsky, Paris, Chan, and Durand, 2011).
Talton, Gibson, Yang, Hanrahan, and Koltun (2009) presented a method for supporting
preference-driven parametric design by involving many people. Their method constructs
a so-called collaborative design space, which is a subset of the target design parameter space
consisting of aesthetically acceptable designs, based on the design history of many voluntary
users. Then, the collaborative design space supports new users’ design exploration. Their
method takes roughly one year to obtain the necessary design history and needs many
volunteers to engage exactly the same design space. In contrast, more recent crowdsourcing
platforms have enabled on-demand generation of necessary data, which has opened new
opportunities for computational aesthetic design.
. . . a paradigm for utilizing human processing power to solve problems that computers
cannot yet solve.
For example, human processors are much better at perceiving the semantic meanings of
visual contents than machine processors; thus, for building a system that requires perceptive
abilities, it is effective to incorporate human processors as well as machine processors.
Such problems that are difficult for machine processors but easy for human processors,
including visual design driven by preference, are observed in many situations. However,
human processors also have critical limitations, such as that they are extremely slow and
expensive to execute compared to machine processors. Therefore, it is important to carefully
choose how to employ such human processors.
How to Employ Human Processors. A possible solution for employing many human
processors is to implicitly embed human computation tasks in already existing tasks.
reCAPTCHA (von Ahn, Maurer, McMillen, Abraham, and Blum, 2008) takes such an
approach; it embeds optical character recognition (OCR) tasks to web security measures.
Unfortunately, this approach is not easy for everybody to take. Another solution to motivate
many ordinary people to voluntarily participate in human computation tasks is to do
‘gamification’ of tasks so that people do tasks purely for entertainment purpose (von Ahn
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
158 | design
and Dabbish, 2008). For example, ESP game (von Ahn and Dabbish, 2004) is a game in
which players provide semantic labels for images without being aware of it. Recently, since
the emergence of large-scale crowdsourcing markets, it has become increasingly popular to
employ human processors using crowdsourcing. As we take this approach, we detail it in
the following subsections.
6.2.2.2 Crowdsourcing
The term crowdsourcing was first introduced by Howe (2006b) and later more explicitly
defined in (Howe, 2006a) as follows.
Today, many online marketplaces for crowdsourcing are available for researchers, such
as Upwork,2 Amazon Mechanical Turk,3 and CrowdFlower.4 Since 2006, crowdsourcing
has been a more and more popular research topic in computer science. The forms of
crowdsourcing are roughly categorized into the following categories.
Microtask-Based Crowdsourcing is a form of crowdsourcing where many workers
are employed to perform a microtask–a task that is very small (usually completed
in a minute) and does not require any special skills or domain knowledge to be
performed. One of the most attractive features of this form is that anyone can
stably employ a large number of crowd workers even for small tasks on demand
with minimum communication cost. Recent crowdsourcing marketplaces, includ-
ing Amazon Mechanical Turk, have enabled this new form. Although crowd workers
in these platforms are usually non-experts, they do have full human intelligence,
which enables many emerging applications.
One of the popular usages of microtask-based crowdsourcing is to outsource
data-annotation tasks (e.g., Bell, Upchurch, Snavely, and Bala, 2013) for machine
learning purposes. Another popular usage is to conduct large-scale perceptual user
studies (e.g., Kittur, Chi, and Suh, 2008). Microtask-based crowdsourcing also
enables crowd-powered systems, which are systems that query crowd workers to use
their human intelligence in run time. For example, Soylent (Bernstein, Little, Miller,
Hartmann, Ackerman, Karger, Crowell, and Panovich, 2010) is a crowd-powered
word processing system that utilizes human intelligence to edit text documents.
Expert Sourcing is a form of crowdsourcing in which skilled experts (e.g., web develop-
ers, designers, and writers) are employed for professional tasks. Some online mar-
ketplaces (e.g., Upwork) provide an opportunity for researchers to reach experts.
However, asking professional designers takes significant communication costs and
2 https://www.upwork.com/
3 https://www.mturk.com/
4 https://www.crowdflower.com/
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
large variances between individuals’ skills. An expert with the required skill is
difficult to find and not always available. Compared to microtask-based crowdsourc-
ing, expert sourcing may be less suitable for employing human processors and for
designing crowd-powered systems that work stably and on demand.
Volunteer-Based Crowdsourcing is a form of crowdsourcing in which unpaid crowds
voluntarily participate in the microtask execution (Huber, Reinecke, and Gajos,
2017; Morishima, Amer-Yahia, and Roy, 2014). One challenge in this form is to
motivate crowd workers; crowds would do not tend to perform tasks unless the
task execution provides certain values other than monetary rewards (e.g., public
recognition for the efforts).
160 | design
Figure 6.2 Example scenarios of parameter tweaking for visual design, including photo colour
enhancement, image effects for 3D graphics, and 2D graphic designs, such as web pages, presen-
tation slide.
one; it is considered that the design space here is parametrized by several degrees of freedom
mapped to sliders, and thus it is considered a parametric design.
Parametric designs can be found almost everywhere in visual design production.
Figure 6.2 illustrates a few examples in which the visual content is tweaked so that it becomes
aesthetically the best. For example, in Unity5 (a computer game authoring tool) and Maya6
(a three-dimensional computer animation authoring tool), the control panels include many
sliders, which can be manipulated to adjust the visual nature of the contents.
Finding the best parameter combination is not an easy task. It may be easily found by a
few mouse drags in a case in which the designer is familiar with how each parameter affects
the visual content and is very good at predicting the effects without actually manipulating
sliders. However, this is unrealistic in most cases; several sliders mutually affect the resulting
visuals in complex ways, and each slider also has a different effect when the contents are dif-
ferent, which makes the prediction difficult. Thus, in practice, it is inevitable that a designer
explores the design space—the set of all the possible design alternatives—in a trial-and-error
manner, to find the parameter set that he or she believes is best for the target content. This
requires the designer to manipulate sliders many times, as well as to construct a mental
model of the design space. Furthermore, as the number of design parameters increases, the
design space expands exponentially, which makes this exploration very tedious.
Computer graphics researchers have investigated many methods for defining reasonable
parametric design spaces. For 3D modelling, the human face (Blanz and Vetter, 1999) and
body (Allen, Curless, and Popović, 2003) are parametrized using data-driven approaches.
The facial expression of characters is often parametrized using blendshape techniques
(Lewis, Anjyo, Rhee, Zhang, Pighin, and Deng, 2014). For material appearance design,
Matusik, Pfister, Brand, and McMillan (2003) proposed a parametric space based on
measured data, and Nielsen, Jensen, and Ramamoorthi (2015) applied dimensionality
reduction to the space. Procedural modelling of 3D shapes (e.g., botany, see Weber and
5 https://unity3d.com/
6 https://www.autodesk.com/products/maya/overview
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Penn, 1995) is also considered parametric design in that the shapes are determined by a set
of tweakable parameters. One of the recent trends is to define parametric spaces based on
semantic attributes for facilitating intuitive exploration; this direction has been investigated
for shape deformation (Yumer, Chaudhuri, Hodgins, and Kara, 2015), cloth simulation
(Sigal, Mahler, Diaz, McIntosh, Carter, Richards, and Hodgins, 2015), and human body
shape (Streuber, Quiros-Ramirez, Hill, Hahn, Zuffi, O’Toole, and Black, 2016), for example.
162 | design
Aesthetic preference
Design space X
Design variables x
Figure 6.3 Problem formulation. We discuss computational design methods to solve the opti-
mization problem described in Equation 6.3 or to find the optimal solution x∗ that maximizes the
aesthetic preference of the target design.
ridges, or locally flat regions around maximums. Also, we assume that the goodness function
is constant with respect to time. The design space is expected to be parameterized by a
reasonable number of parameters as in most commercial software packages; parametrization
itself will be not discussed in this chapter. We handle the design domains in which even
novices can assess relative goodness of designs (for example, given two designs, they are
expected to be able to answer which design looks better); but importantly, they do not need
to know how a design can be improved. Though we narrow down the target problem as
discussed earlier, it still covers a wide range of practical design scenarios including photo
enhancement, material appearance design for computer graphics, and two-dimensional
graphic design (e.g., posters).
Estimation of the Objective Function. The first approach is to estimate the shape
of the goodness function g(·) by using computational tools. In other words, it
is to compute the regression of g(·). Once g(·) is estimated, it can be used for
‘guided’ exploration: supporting users’ free exploration of the design space X for
finding their best favorite parameter set x∗ through some user interfaces. One of the
advantages of this approach is that even if the estimation quality is not perfect, it can
still be effective for supporting users to find x∗ . To implement this approach, there
are two important challenges: how to compute this regression problem that deals
with human preference and how to support the users’ manual exploration using the
estimated g(·). We discuss this approach further in Section 6.4.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
For both approaches, human-generated preference data are necessary for enabling com-
putation. In this chapter, we focus on the use of human computation to generate necessary
preference data. Such human processors can be employed via crowdsourcing. By this
approach, systems can obtain crowd-generated data on demand in the manner of function
calls. Here we put an additional assumption: a common ‘general’ preference exists that is
shared among crowds; though there might be small individual variation, we can observe
such a general preference by involving many crowds.
164 | design
Task: Choose the image that Task: Adjust the slider so that
looks better the image looks the best
Figure 6.4 Microtask design. (Left) Pairwise comparison microtask. (Right) Single-slider manip-
ulation microtask.
possible variant of this task is to provide relative scores by an n-pt Likert scale
(the standard pairwise comparison is the special case with n = 2). A/B testing
also belongs to this category. This pairwise comparison microtask is popular in
integrating crowds’ perceptions into systems (e.g., O’Donovan, Lı̄beks, Agarwala,
and Hertzmann, 2014; Gingold, Shamir, and Cohen-Or, 2012). We discuss this
approach further in Section 6.4.
Query about Continuous Space. An alternative approach is to ask crowd workers
to explore a continuous space and to identify the best sample in the space. This
requires more work by crowds, but the system can obtain much richer information
than an evaluation of discrete samples. Asking crowds to directly control all the
raw parameters (i.e., explore the original search space X ) is an extreme case of this
approach. Mathematically speaking, given a high-dimensional parametric space X ,
crowds provide the solution of the following maximization problem:
the mathematical formulations. In the following sections, we introduce two such crowd-
powered systems from our previous studies as illustrative examples.
Suggestive Interface. The system has a suggestive interface called Smart Suggestion
(Figure 6.5). It generates nine parameter sets that have relatively high goodness val-
ues and displays the corresponding designs as suggestions. This interface facilitates
design exploration by giving users a good starting point to find a better parameter
set for the visual design. This implementation takes a simple approach to generate
quality suggestions: the system generates 2,000 parameter sets randomly and then
selects the nine best parameter sets according to their goodness values. This simple
algorithm interactively provides suggestions of an adequate quality, which enables
Figure 6.5 A suggestive interface enabled by the estimated goodness function. The user can obtain
appropriate parameter sets as suggestions, which are generated considering the goodness of designs.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
166 | design
Brightness
Contrast
Interactive
optimization
Saturation
Visualization
Colour Balance (R)
Edit by user
Figure 6.6 A slider interface enabled by the estimated goodness function. The user can adjust each
parameter effectively by the visualization (Vis) near the slider and the optimization (Opt), which
gently guides the current parameters toward the optimal direction.
users to regenerate suggestions quickly enough for interactive use if none of the
suggestions satisfy them.
Slider Interface. The system has a slider interface called VisOpt Slider (Figure 6.6). It
displays coloured bars with a visualization of the results of a crowd-powered analysis.
The distribution of goodness values is directly visualized on each slider using colour
mapping, which navigates the user to explore the design space. Note that when the
user modifies a certain parameter, the visualizations of the other parameters change
dynamically. This helps the user not only to find better parameter sets quickly
but also to explore the design space effectively without unnecessarily visiting ‘bad’
designs. When the optimization is turned on, the parameters are automatically and
interactively optimized while the user is dragging a slider. That is, when a user starts
to drag a slider, the other sliders’ ticks also start to move simultaneously to a better
direction according to the user’s manipulation.
These two interfaces are complementary; for example, a user can first obtain a reasonable
starting point by the suggestive interface and then interactively tune it up using the slider
interface.
Figure 6.7 Overview of the crowd-powered algorithm for estimating the goodness function.
Gathering Pairwise Comparisons. The next step is to gather information on the good-
ness of each sampling point. We take the pairwise comparison approach; crowd workers
are shown a pair of designs and asked to compare them. As a result, relative scores (instead
of absolute ones) are obtained. For the instruction of the microtask for crowd workers, we
prepare a template:
Which of the two images of [noun] is more [adjective]? For example, [clause]. Please choose the
most appropriate one from the 5 options below.
In accordance with the purpose and the content, the user gives a noun, an adjective such as
‘good’ or ‘natural,’ and a clause that explains a concrete scenario to instruct crowd workers
more effectively. After this instruction, two images and five options appear. These options
are linked to the five-pt Likert scale; for example, ‘the left image is definitely more [adjective]
than the right image’ is for option 1, and the complete opposite is option 5. Option 3 is ‘these
two images are equally [adjective], or are equally not [adjective]’.
Estimating Goodness Values of Sampling Points. Given the relative scores, the next
goal is to obtain the absolute goodness values y = [ y1 · · · yM ]T at the sampling points
x1 , …, xM . Note that inconsistency exists in the data from crowds; thus, a solution that
satisfies all the relative orders does not generally exist. We want to obtain a solution that
is as reasonable as possible. In this work, this problem is solved by being formulated as
optimization. Specifically, the system solves the optimization:
min Erelative (y) + ωEcontinuous (y) , (6.9)
y
where Erelative (·) is a cost function that reflects the relative scores data provided by crowds,
Econtinuous (·) is a cost function that regularizes the resultant values so as to be distributed
smoothly and continuously, and ω > 0 is a hyperparameter that defines the balance of these
two objectives.
Fitting a Goodness Function. Now, the goal is to obtain a continuous goodness
function from the goodness values at the discrete sampling points obtained in the previous
step. For this purpose, this work adopted the radial basis function (RBF) interpolation
technique (Bishop, 1995), which can be used to smoothly interpolate the values at scattered
data points.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
168 | design
6.4.3 Applications
This method was applied to four applications from different design domains. In this exper-
iment, each microtask contained ten unique pairwise comparisons, and 0.02 USD was paid
for it. For the photo colour enhancement (6D), material appearance design (8D), and
camera and light control (8D) applications, we deployed 200 tasks. For the facial expression
modelling (53D) application, we deployed 600 tasks.
Figure 6.8 Designs and visualizations of goodness distributions in the photo colour enhancement
application.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Surface Shaders7 , which has eight parameters. We applied this shader to a teapot model.
We asked crowd workers to choose the one that was the most realistic as a stainless steel
teapot. Figure 6.9 (Left) shows typical parameter sets with their visualizations. From these
visualizations, we can learn, without any trial and error, that the ‘Reflection’ parameter (the
fifth parameter in Figure 6.9 (Left)) performs the most important role in this application.
Figure 6.9 Designs and visualizations of goodness distributions in the shader application (Left)
and the camera and light application (Right).
7 https://www.assetstore.unity3d.com/#/content/729
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
170 | design
Figure 6.10 Designs and estimated goodness values in the facial expression modelling application.
The values shown below the pictures are the estimated goodness values.
n-dimensional 1-dimensional
design space slider space
Crowdsource Crowdsource
... ...
...
...
Human processors Line search microtask
(Crowd workers) (Single-slider manipulation)
Figure 6.11 Concept of the method. This work envisions that the design software equips a
‘Crowdsource’ button for running the crowd-powered search for the slider values that provide the
perceptually ‘best’ design. To enable this, the system decomposes the n-dimensional optimization
problem into a sequence of one-dimensional line search queries that can be solved by crowdsourced
human processors.
172 | design
For each iteration, we want to arrange the next slider space so that it is as ‘meaningful’
as possible for finding x∗ . We propose to construct the slider space S such that one
end is at the current-best position x+ and the other one is at the best-expected-improving
position xEI . Suppose that we have observed t responses so far, and we are going to
query the next oracle. The slider space for the next iteration (i.e., St+1 ) is constructed by
connecting
xt+ = arg max μt (x),
N (6.10)
x∈{xi }i=1
t
where {xi }N t EI
i=1 is the set of observed data points, and μt (·) and at (·) are the predicted mean
function and the acquisition function calculated from the current data. μ(·) and aEI (·) can
be calculated based on an extension of Bayesian optimization techniques; see the original
publication (Koyama, Sato, Sakamoto, and Igarashi, 2017) for details.
Example Optimization Sequence Figure 6.12 illustrates an example optimization
sequence in which the framework is applied to a two-dimensional test function and the
oracles are synthesized by a machine processor. The process begins with a random slider
space. After several iterations, it reaches a good solution.
g(·)
μ(·)
a(·)
Figure 6.12 An example sequence of the Bayesian optimization based on line search oracle,
applied to a two-dimensional test function. The iteration proceeds from left to right. From top to
bottom, each row visualizes the black box function g(·) along with the next slider space S and the
chosen parameter set xchosen , the predicted mean function μ(·), and the acquisition function a(·),
respectively. The red dots denote the best parameter sets x+ among the observed data points at
each step.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
6.5.3 Applications
We tested our framework in two typical parameter tweaking scenarios: photo colour
enhancement and material appearance design. All the results shown in this section were
generated with fifteen iterations. For each iteration, our system deployed seven microtasks,
and it proceeded to the next iteration once it had obtained at least five responses. We paid
0.05 USD for each microtask execution, so the total payment to the crowds was 5.25 USD for
each result. Typically, we obtained a result in a few hours (e.g., the examples in Figure 6.13
took about sixty-eight minutes on average).
Figure 6.13 Comparison of photo colour enhancement between our crowd-powered optimization
and auto-enhancement in commercial software packages (Photoshop and Lightroom). The num-
ber on each photograph indicates the number of participants who preferred the photograph to the
other three in the study.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
174 | design
60
A 50
40
30
B
20
10
C 0
0 1 2 3 4 5 6 7 8 9 10 11 12 1314 15
Initial #3 #5 #15 Perceptual colour distances
Figure 6.14 Comparison of three optimization trials with different initial conditions in photo
colour enhancement. (Left) Transitions of the enhanced images. (Right) Transitions of the differ-
ences between each trial, measured by the perceptual colour metric.
compared the results of our enhancement (with fifteen iterations) with Adobe Photoshop
CC8 and Adobe Photoshop Lightroom CC9 . Figure 6.13 shows the results. To quantify
the degree of success of each enhancement, we conducted a crowdsourced study in which
we asked crowd workers to identify which image looks best among the three enhancement
results and the original image. The numbers in Figure 6.13 represent the results. The photos
enhanced by our crowd-powered optimization were preferred over the others in these cases.
These results indicate that our method can successfully produce a ‘people’s choice.’ This
represents one of the advantages of the crowd-powered optimization.
Next, we repeated the same optimization procedure three times with different initial con-
ditions (Trial A, B, and C). Figure 6.14 (Left) shows the sequences of enhanced photographs
over the iterations. We measured the differences between the trials by using a perceptual
colour metric based on CIEDE2000; we measured the perceptual distance for each pixel in
the enhanced photographs and calculated the mean over all the pixels. Figure 6.14 (Right)
shows the results. It shows that the distances become small rapidly in the first four or five
iterations, and they approach similar enhancement even though the initial conditions are
quite different.
8 http://www.adobe.com/products/photoshop.html
9 http://www.adobe.com/products/photoshop-lightroom.html
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Figure 6.15 Results of the crowd-powered material appearance design with reference photographs.
In each pair, the top image shows the reference photograph and the bottom image shows the
resulting appearance. Some photographs were provided by Flickr users: Russell Trow, lastcun,
and Gwen.
Figure 6.16 Results of the crowd-powered material appearance design with textual instructions.
image with a slider, side by side, and asked to adjust the slider until their appearances were as
similar as possible. Figure 6.15 shows the results for both monotone and full colour spaces.
Another usage is that the user can specify textual instructions instead of reference
photographs. Figure 6.16 illustrates the results of this usage, where crowd workers were
instructed to adjust the slider so that it looks like ‘brushed stainless,’ ‘dark blue plastic,’ and
so on. This is not easy when a human-in-the-loop approach is not taken.
6.6 Discussions
6.6.1 Usage of Computation: Estimation vs. Maximization
We have considered two usages involving computational techniques. The first approach,
that is the estimation of the goodness function g(·), has several advantages compared to
the other approach:
• The user can maintain control in terms of how to explore the design space X . That
is, he or she is not forced to follow the computational guidance by the system.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
176 | design
• The chosen solution x∗ is always ensured to be optimal for the target user, as the
‘true’ goodness function used for deciding the final solution is owned by the user,
which can be different from the ‘estimated’ goodness function used for guidance.
• Even if the estimation is not perfect, it can still guide the user to explore the design
space X effectively. In Section 6.4, it was observed that the estimated goodness
function was often useful for providing a good starting point for further exploration
and for eliminating meaningless visits of bad designs.
• This approach can be seamlessly integrated in existing practical scenarios because
it does not intend to replace existing workflows but does augment (or enhance)
existing workflows.
In contrast, the second approach, that is the maximization of the goodness function g(·),
has different advantages:
• The user does not need to care about the strategy of how design exploration should
proceed. This could enable a new paradigm for aesthetic design and solve many
constraints with respect to user experience. For example, users are released from the
need to understand and learn the effects of each design parameter.
• The user no longer needs to interact with the system, enabling fully automatic
workflows. This further broadens the possible usage scenarios.
• The found solutions by this approach can be used as either final products or good
starting points for further manual refinement. In Section 6.5, it was observed that
most of the results were not necessarily perfect but quite acceptable as final products.
• This approach aims to find the optimal solution as efficiently as possible, based
on optimization techniques. For example, the second illustrative example uses
Bayesian optimization techniques so as to reduce the number of necessary iterations.
Compared to the approach of estimating the goodness function g(·) everywhere
in the design space X , whose computational cost is exponential with respect to
the dimensionality, this approach may ease this problem in high-dimensional
design spaces.
A hybrid approach between these two approaches is also possible. For example, partially
estimating the goodness function g(·) around the expected maximum may be useful for
supporting users to effectively explore the design space. Investigating this possibility is an
important future work.
(Kapoor, Caicedo, Lischinski, and Bing Kang, 2014). Here, we summarize the advantages
and disadvantages of this data source.
178 | design
problems in data science, the resulting space in this case has to be either designer-friendly
or optimization-friendly (or both) for maximizing aesthetic preference. Recently, Yumer,
Asente, Mech, and Kara (2015) showed that autoencoder networks can be used for convert-
ing a high-dimensional, visually discontinuous design space to a lower-dimensional, visually
continuous design space that is more desirable for design exploration. Incorporating human
preference in dimensionality reduction of design spaces is an interesting future work.
Discrete Parameters. We focused on continuous parameters, and did not discuss
how to handle discrete design parameters, such as fonts (O’Donovan, Lı̄beks, Agarwala,
and Hertzmann, 2014) and web design templates (Chaudhuri, Kalogerakis, Giguere, and
Funkhouser, 2013). The remaining challenges to handle such discrete parameters include
how to represent goodness functions for design spaces including discrete parameters and
how to facilitate users’ interactive explorations. Investigating techniques for jointly handling
discrete and continuous parameters is another potential future work.
Locally Optimal Design Alternatives. In some scenarios, totally different design
alternatives can be equally ‘best,’ and it can be hard to determine which is better. For
example, in Adobe Colour CC,10 which is a user community platform to make, explore, and
share colour palettes (a set of colours usually consisting of five colours), there are a number of
(almost) equally popular colour palettes that have been preferred by many users, as shown
in Figure 6.17. In this case, if we assume the existence of a goodness function for colour
palettes, the popular palettes can be considered as local maximums of the goodness function.
Considering that the goal is to support design activities, it may not be effective to assume that
there is a sole global maximum in this design space and to guide the user toward the single
maximum; rather, it may be more desirable to provide a variety of good design alternatives.
There is a room for investigation about how computation can support such design scenarios.
Evaluation Methodology. One of the issues in computational design driven by aes-
thetic preference is the lack of an established, general methodology of evaluating each
new method. Validation of methods in this domain is challenging for several reasons. The
first reason is the difficulty of defining ‘correct’ aesthetic preference, which can be highly
dependent on scenarios. Also, as the ultimate goal is the support of design activities, the
effectiveness needs to be evaluated by designers. Methods in this domain are built on many
Figure 6.17 ‘Most Popular’ colour palettes in the user community of Adobe Colour CC. Though
visually different from each other, they are (mostly) equally popular and preferred by many users.
10 https://color.adobe.com/
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
assumptions, each of which is difficult to validate. We consider that an important future work
would be to establish general evaluation schemes.
More Sophisticated Models of Crowd Behaviours. The two examples described in
Sections 6.4 and 6.5 were built on an assumption of crowd workers: crowd workers share a
common goodness function, and each crowd worker responds based on the function with
some noise. Thus, it is assumed that the common goodness function is observed by asking
many crowd workers and then averaging their responses. This assumption may be valid in
some scenarios, but may not be in many other scenarios; for example, crowds may form
several clusters with respect to their aesthetic preference. Modelling such more complex
properties of crowds is an important future challenge.
Incorporating Domain-Specific Heuristics. We have tried to use minimal domain-
specific knowledge so that the discussions made in this chapter are as generally applicable
as possible. However, it is possible to make computational design methods more practical
for certain specific scenarios by making full use of domain-specific heuristics. For example, if
one builds a software program to tweak the viewpoints of 3D objects, the heuristic features
and the pre-trained model in (Secord, Lu, Finkelstein, Singh, and Nealen, 2011) could be
jointly used with the methods described in the previous sections.
Combining General and Personal Preference. We discussed how to handle crowds’
general preference. However, it would also be beneficial if we could handle users’ per-
sonal preference; Koyama, Sakamoto, and Igarashi (2016) presented a method for learning
personal preference to facilitate design exploration and reported that this approach was
appreciated by professional designers. Both approaches have advantages and disadvantages;
to complement the disadvantages of each approach, we envision that the combination of
these two approaches may be useful and worth investigating.
180 | design
while accounting for human evaluation in the loop. For example, Sims (1991) showed that
IEC can generate interesting designs beyond predefined parametric design spaces.
where g male (·) and g female (·) are the goodness functions owned by men and women, respec-
tively, and wmale and wfemale are the weights for adjusting the bias, which can be wmale <
wfemale in this case. With crowdsourced human computation, this could be solved by utilizing
the demographic information of crowd workers (Reinecke and Gajos, 2014). Another
complex scenario is a case in which some additional design criteria are expected to be at
least satisfied, but do not have to be maximized. For example, a client may want a design
that is as preferred by young people as possible and at the same time is ‘acceptable’ by
elderly people. In this case, the problem can be formulated as a constrained optimiza-
tion. Under these complex conditions, it should be difficult for designers to manually
explore designs. We believe that this is the part that computational techniques need to
facilitate.
6.7 Summary
In this chapter, we discussed the possible mechanisms, illustrative examples, and future
challenges of computational design methods with crowds. Especially, we focused on the
facilitation of parametric design (i.e., parameter tweaking) and then formulated the design
process as a numerical optimization problem, where the objective function to be maximized
was based on perceptual preference. We illustrated the ideas of using crowdsourced human
computation for this problem, either for the estimation of the objective function or for the
maximization of the objective function.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
....................................................................................................
references
Allen, B., Curless, B., and Popović, Z., 2003. The space of human body shapes: Reconstruction and
parameterization from range scans. ACM Transactions on Graphics, 22(3), pp. 587–94.
Bailly, G., Oulasvirta, A., Kötzing, T., and Hoppe, S., 2013. Menuoptimizer: Interactive optimization
of menu systems. In: UIST ’13: Proceedings of the 26th Annual ACM Symposium on User Interface
Software and Technology. New York, NY: ACM, pp. 331–42.
Bell, S., Upchurch, P., Snavely, N., and Bala, K., 2013. Opensurfaces: A richly annotated catalog of
surface appearance. ACM Transactions on Graphics, 32(4), pp. 111:1–111:17.
Bernstein, M. S., Brandt, J., Miller, R. C., and Karger, D. R., 2011. Crowds in two seconds: Enabling
realtime crowd-powered interfaces. In: UIST ’11: Proceedings of the 24th Annual ACM Symposium
on User Interface Software and Technology. New York, NY: ACM, pp. 33–42.
Bernstein, M. S., Little, G., Miller, R. C., Hartmann, B., Ackerman, M. S., Karger, D. R., Crowell, D.,
and Panovich, K., 2010. Soylent: A word processor with a crowd inside. In: UIST ’10: Proceedings of
the 23rd Annual ACM Symposium on User Interface Software and Technology. New York, NY: ACM,
pp. 313–22.
Bishop, C. M., 1995. Neural Networks for Pattern Recognition. Oxford: Oxford University Press.
Blanz, V., and Vetter, T., 1999. A morphable model for the synthesis of 3d faces. In: SIGGRAPH ’99:
Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. New
York, NY: ACM, pp. 187–94.
Brochu, E., de Freitas, N., and Ghosh, A., 2007. Active preference learning with discrete choice data.
In: NIPS ’07: Advances in Neural Information Processing Systems 20. pp. 409–16.
Bychkovsky, V., Paris, S., Chan, E., and Durand, F., 2011. Learning photographic global tonal adjust-
ment with a database of input/output image pairs. In: CVPR ’11: Proceedings of the 24th IEEE
Conference on Computer Vision and Pattern Recognition. pp. 97–104.
Chaudhuri, S., Kalogerakis, E., Giguere, S., and Funkhouser, T., 2013. Attribit: Content creation with
semantic attributes. In: UIST ’13: Proceedings of the 26th Annual ACM Symposium on User Interface
Software and Technology. New York, NY: ACM, pp. 193–202.
Cohen-Or, D., and Zhang, H., 2016. From inspired modeling to creative modeling. The Visual
Computer, 32(1), pp. 7–14.
Gajos, K., and Weld, D. S., 2004. Supple: Automatically generating user interfaces. In: IUI ’04:
Proceedings of the 9th International Conference on Intelligent User Interfaces. New York, NY: ACM,
pp. 93–100.
Gingold, Y., Shamir, S., and Cohen-Or, D., 2012. Micro perceptual human computation for visual
tasks. ACM Transactions on Graphics, 31(5), pp. 119:1–119:12.
Howe, J., 2006a. Crowdsourcing: A definition. Available at: http://crowdsourcing.typepad.com/
cs/2006/06/crowdsourcing_a.html, 2006. Accessed: October 23, 2016.
Howe, J., 2006b. The rise of crowdsourcing. Available at: https://www.wired.com/2006/06/crowds/.
Accessed: October 23, 2016.
Huber, B., Reinecke, K., and Gajos, K. Z., 2017. The effect of performance feedback on social media
sharing at volunteer-based online experiment platforms. In: CHI ’17: Proceedings of the 2017 CHI
Conference on Human Factors in Computing Systems. New York, NY: ACM, pp. 1882–6.
Ipeirotis, P. G., Provost, F., and Wang, J., 2010. Quality management on amazon mechanical turk. In:
HCOMP ’10: Proceedings of the ACM SIGKDD Workshop on Human Computation. New York, NY:
ACM, pp. 64–7.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
182 | design
Kapoor, A., Caicedo, J. C., Lischinski, D., and Kang, S. B., 2014. Collaborative personalization of image
enhancement. International Journal of Computer Vision, 108(1), pp. 148–64.
Kittur, A., Chi, E. H., and Suh, B., 2008. Crowdsourcing user studies with mechanical turk. In: CHI
’08: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York, NY:
ACM, pp. 453–56.
Koyama, Y., Sakamoto, D., and Igarashi, T., 2014. Crowd-powered parameter analysis for visual design
exploration. In: UIST ’14: Proceedings of the 27th Annual ACM Symposium on User Interface Software
and Technology. New York, NY: ACM, pp. 65–74.
Koyama, Y., Sakamoto, D., and Igarashi, T., 2016. Selph: Progressive learning and support of manual
photo color enhancement. In: CHI ’16: Proceedings of the 2016 CHI Conference on Human Factors
in Computing Systems. New York, NY: ACM, pp. 2520–32.
Koyama, Y., Sato, I., Sakamoto, D., and Igarashi, T., 2017. Sequential line search for efficient visual
design optimization by crowds. ACM Transactions on Graphics, 36(4), pp. 48:1–48:11.
Koyama, Y., Sueda, S., Steinhardt, E., Igarashi, T., Shamir, A., and Matusik, W., 2015. Autoconnect:
Computational design of 3d-printable connectors. ACM Transactions on Graphics, 34(6), pp.
231:1–231:11.
Lewis, J. P., Anjyo, K., Rhee, T., Zhang, M., Pighin, F., and Deng, Z., 2014. Practice and theory of
blendshape facial models. In: Eurographics 2014 - State of the Art Reports. pp. 199–218.
Little, G., Chilton, L. B., Goldman, M., and Miller, R. C., 2010. Turkit: Human computation algo-
rithms on mechanical turk. In: UIST ’10: Proceedings of the 23rd Annual ACM Symposium on User
Interface Software and Technology. New York, NY: ACM, pp. 57–66.
Matusik, W., Pfister, H., Brand, M., and McMillan, L., 2003. A data-driven reflectance model. ACM
Transactions on Graphics, 22(3), pp. 759–69.
Miniukovich, A., and Angeli, D. A., 2015. Computation of interface aesthetics. In: CHI ’15: Proceedings
of the 33rd Annual ACM Conference on Human Factors in Computing Systems. New York, NY: ACM,
pp. 1163–72.
Morishima, A., Amer-Yahia, S., and Roy, B. S., 2014. Crowd4u: An initiative for constructing an open
academic crowdsourcing network. In: HCOMP ’14: Proceedings of Second AAAI Conference on
Human Computation and Crowdsourcing – Works in Progress Abstracts. New York, NY: ACM.
Nielsen, J. B., Jensen, H. W., and Ramamoorthi, R., 2015. On optimal, minimal brdf sampling for
reflectance acquisition. ACM Transactions on Graphics, 34(6), pp. 186:1–186:11.
O’Donovan, P., Agarwala, A., and Hertzmann, A., 2011. Color compatibility from large datasets. ACM
Transactions on Graphics, 30(4), pp. 63:1–63:12.
O’Donovan, P., Agarwala, A., and Hertzmann, A., 2014. Learning layouts for single-page graphic
designs. IEEE Transactions on Visualization and Computer Graphics, 20(8), pp. 1200–13.
O’Donovan, P., Lı̄beks, J., Agarwala, A., and Hertzmann, A., 2014. Exploratory font selection using
crowdsourced attributes. ACM Transactions on Graphics, 33(4), pp. 92:1–92:9.
Prévost, R., Whiting, E., Lefebvre, S., and Sorkine-Hornung, O., 2013. Make it stand: Balancing shapes
for 3d fabrication. ACM Transactions on Graphics, 32(4), pp. 81:1–81:10.
Quinn, A. J., and Bederson, B. B., 2011. Human computation: A survey and taxonomy of a growing
field. In: CHI ’11: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.
New York, NY: ACM, pp. 1403–12.
Reinecke, K., and Gajos, K. Z., 2014. Quantifying visual preferences around the world. In: CHI ’14:
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York, NY:
ACM, pp. 11–20.
Secord, A., Lu, J., Finkelstein, A., Singh, M., and Nealen, A., 2011. Perceptual models of viewpoint
preference. ACM Transactions on Graphics, 30(5), pp. 109:1–109:12.
computational design with crowds | 183
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N., 2016. Taking the human out of
the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), pp. 148–75.
Sigal, L., Mahler, M., Diaz, S., McIntosh, K., Carter, E., Richards, T., and Hodgins, J., 2015. A perceptual
control space for garment simulation. ACM Transactions on Graphics, 34(4), pp. 117:1–117:10.
Sims, K., 1991. Artificial evolution for computer graphics. SIGGRAPH Computer Graphics, 25(4),
pp. 319–28.
Stava, O., Vanek, J., Benes, B., Carr, N., and Měch, R., 2012. Stressrelief: Improving structural strength
of 3D printable objects. ACM Transactions on Graphics, 31(4), pp. 48:1–48:11.
Streuber, S., Quiros-Ramirez, M. A., Hill, M. Q., Hahn, C. A., Zuffi, S., O’Toole, A., and Black, M. J.,
2016. Bodytalk: Crowdshaping realistic 3d avatars with words. ACM Transactions on Graphics,
35(4), pp. 54:1–54:14.
Talton, J. O., Gibson, D., Yang, L., Hanrahan, P., and Koltun, V., 2009. Exploratory modeling with
collaborative design spaces. ACM Transactions on Graphics, 28(5), pp. 167:1–167:10.
Todi, K., Weir, D., and Oulasvirta, A., 2016. Sketchplore: Sketch and explore with a layout optimiser.
In: DIS ’16: Proceedings of the 2016 ACM Conference on Designing Interactive Systems. New York,
NY: ACM, pp. 543–55.
Umetani, N., Koyama, Y., Schmidt, R., and Igarashi, T., 2014. Pteromys: Interactive design and opti-
mization of free-formed free-flight model airplanes. ACM Transactions on Graphics, 33(4), pp. 65:
1–65:10.
von Ahn, L., 2005. Human Computation. PhD thesis, Carnegie Mellon University, Pittsburgh, PA.
AAI3205378.
von Ahn, L., and Dabbish, L., 2004. Labeling images with a computer game. In: CHI ’04: Proceed-
ings of the SIGCHI Conference on Human Factors in Computing Systems. New York, NY: ACM,
pp. 319–26.
von Ahn, L., and Dabbish, L., 2008. Designing games with a purpose. Commun. ACM, 51(8),
pp. 58–67.
von Ahn, L., Maurer, B., McMillen, C., Abraham, D., and Blum, M., 2008. reCAPTCHA: Human-
based character recognition via web security measures. Science, 321(5895), pp. 1465–8.
Weber, J., and Penn, J., 1995. Creation and rendering of realistic trees. In: SIGGRAPH ’95: Proceedings
of the 22nd Annual Conference on Computer Graphics and Interactive Techniques. New York, NY:
ACM, pp. 119–28.
Won, J., Lee, K., O’Sullivan, C., Hodgins, J. K., and Lee, J., 2014. Generating and ranking diverse multi-
character interactions. ACM Transactions on Graphics, 33(6), pp. 219:1–219:12.
Yumer, M. E., Asente, P., Mech, R., and Kara, L. B., 2015. Procedural modeling using autoencoder
networks. In: UIST ’15: Proceedings of the 28th Annual ACM Symposium on User Interface Software
and Technology. New York, NY: ACM, pp. 109–18.
Yumer, M. E., Chaudhuri, S., Hodgins, J. K., and Kara, L. B., 2015. Semantic shape editing using
deformation handles. ACM Transactions on Graphics, 34(4), pp. 86:1–86:12.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
PA RT III
Systems
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
7
• • • • • • •
7.1 Introduction
One might feel that the very association of the terms ‘formal methods’ and ‘human’ is an
oxymoron; how can anything that is formal make any sense when faced with the complexity
and richness of human experience? However, little could be more formal than computer
code, so in that sense everything in Human–Computer Interaction (HCI) is formal. Indeed,
all the chapters in this book are about practical formal methods, in the sense that they involve
a form of mathematical or symbolic manipulation. From Fitts’ Law to statistical analysis of
experimental results, mathematics is pervasive in HCI.
However, in computer science ‘formal methods’ has come to refer to a very specific set
of techniques. Some of these are symbolic or textual based on sets and logic, or algebraic
representations. Others are more diagrammatic. All try to specify some aspect of a system
in precise detail in order to clarify thinking, perform some sort of analysis, or communicate
unambiguously between stakeholders across the design and engineering process.
The use of such methods is usually advocated in order to ensure the correctness, or
more generally, the quality of computer systems. However, they are also usually regarded
as requiring too much expertise and effort for day-to-day use, being principally applied in
safety-critical areas outside academia. Similarly, in HCI, even when not dismissed out of
hand, the principal research and use of formal methods is in safety-critical areas such as
aerospace, the nuclear industry, and medical devices.
However, this chapter demonstrates that, in contrast to this perception, formal methods
can be used effectively in a wider range of applications for the specification or understanding
of user interfaces and devices. Examples show that, with appropriate choice and use, formal
methods can be used to allow faster development and turnaround, and be understood by
those without mathematical or computational background.
We will use two driving examples. One is from many years ago and concerns the develop-
ment of simple transaction-based interactive information systems. The other is more recent
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
188 | systems
and relates to the design of smart-devices where both physical form and digital experience
need to work together.
In both cases, a level of appropriation is central: the adaptation of elements of specific
formal methods so that their power and precision is addressed at specific needs, while being
as lightweight and flexible as possible in other areas. Because of this, the chapter does not
introduce a particular fixed set of notations or methods, but rather, through the examples,
demonstrates a number of heuristics for selecting and adapting formal methods. This said,
it is hoped that the specific methods used in the two examples may also themselves be of
direct use.
The rest of this chapter is structured as follows. First, Section 7.2 looks at some of the
range of formal methods used in computer science in general, and then at the history and
current state of the use of formal methods in HCI. Sections 7.3 and 7.4 look at the two
examples, for each introducing a specific problem, the way formalism was used to address it,
and then the lessons exposed by each. Finally, we bring together these lessons and also look
at emerging issues in HCI where appropriate formalisms may be useful or, in some cases,
essential for both researcher and practitioner as we attempt to address a changing world.
state
total: Nat - running total (accumulator)
disp: Nat - number currently displayed
pend_op: {+, –,∗,/,none} - pending operation
typing: Bool - true/false flag
Effectively, the first is about formalizing requirements and the second formalizing the
detailed (technical) design. These can be used alone, i.e., (i) can be used on its own to create
test cases or simply clarify requirements; (ii) can be used to analyse the behaviour or to verify
that code does what the specification says it should. Alternatively, the two kinds may be used
together where the detailed specification (ii) is verified to fulfil the formal requirements (i).
Where using both, these may use the same formal notation, or two different ones, as noted
with the use of temporal logics.
Simpler diagrammatic formalisms have always been well used, and many, including
Statecharts, have found their way into Unified Modeling Language (UML) (Booch et al.,
1999). There have been classic success stories of formal methods used in industrial practice
dating back many years, notably, a large-scale project to specify large portions of IBM
Customer Information Control Systems (CICS) in Z (Freitas et al., 2009). However, on the
whole, more complex formal methods tend to require too much expertise and are perceived
as taking too much effort. Hence, adoption is very low except for a few specialized safety-
critical fields where the additional effort can be cost effective.
One suggestion to deal with this low adoption has been to develop domain-specific
notations, designed explicitly for a very narrow range of applications using representations
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
190 | systems
-- In - Invali
<-- B ention e --
terv
<-
Write Hit <-- Bus Read --
us R
-- Shared -->
-->
W
WIT ∗ ) --> /
rit
dat
Hi
(
M --
t
O E
SD R
Read Miss
-- Read -->
<-- from MM -
- not Shared --
Figure 7.2 Example state transition network of MESI caching protocol. (from Wikipedia, by
Ferry24.Milan—Own work, CC BY-SA 3.0, en.wikipedia.org/wiki/Cache_memory#/media/
File:MESI_State_Transaction_Diagram.svg)
that are more likely to be comprehensible by domain experts, but which are built over or can
be connected back to some form of semantic core that allows analysis and reasoning (Polak,
2002). The use of formal methods in HCI can be seen as one such specialization, as can the
examples later in this chapter.
192 | systems
During this same period there was also work on formally based architectural description
such as PAC (Coutaz, 1987) and ARCH-Slinky (Bass et al., 1991), computational abstrac-
tions for construction of particular kinds of user interface such as Dynamic Pointers (Dix
1991, 1995), and a number of dedicated user interface formalisms that are still used today,
including Interactive Cooperative Object, or ICO (Navarre et al., 2009), a variant of Petri
Nets, and ConcurTaskTrees, or CTT (Paterno, 1999), a more formally based variant of
hierarchical task analysis (Shepherd, 1989).
As a final word in this section, note again that while some of this early work arose from
more theoretical concerns, the majority had roots in practical problems that arose when
either trying to understand or build interactive user interfaces.
More information about the early development of this area can be found in a number
of monographs and collections (Thimbleby and Harrison, 1990; Dix, 1991; Gram and
Cockton, 1996; Palanque and Paterno, 1997; Paterno, 1999; Weyers et al., 2017).
Meixner et al. (2011), it has also led to a W3C standardization effort (Meixner et al.,
2014).
Another development has been the work on domain specific notations (DSNs). These
are notations designed for a specific application domain, e.g., chemical plants. The visu-
alizations and vocabulary are created to more easily express the particular requirements
and properties of the domain, and also be meaningful to domain experts. Chapter 8 argues
strongly for the need for appropriate languages so that coders and designers can more easily
express their intentions and understand the computer’s semantics, hence improving both
productivity and reliability.
There are elements of this in early work, for example, dynamic pointers (Dix 1991, 1995),
which were designed to specify and construct systems that included complex changing
structures and required locations or parts to be marked (e.g., cursors for editing, current
scroll location in text). This was effectively a bespoke, handcrafted DSN. More recent work
has focused on creating abstractions, tools, and methods that make it easier for a formal
expert to construct a DSN (Weyers, 2017; Van Mierlo et al, 2017).
194 | systems
might get a second page corresponding to the second user’s query. While this was annoying,
it was infrequent enough to be acceptable for purely information access. However, the
author was tasked with creating the council’s first systems to also allow interactive update
of data, and so it would be unacceptable to have similar issues.
The reason for this occasional bugginess was evident in the code of the existing appli-
cations, which was spaghetti-like due to complex decisions primarily aimed at working out
the context of a transaction request. Similar code can indeed often be seen today in poorly
structured web-server code that tries to interpret input from different sources, checking for
the existence of variables to distinguish a submit from a cancel on a web form. In fairness
to the programmers, this was very early in the development of such systems, and also the
individual application did a lot of the work now done by web servers such as Apache,
marshalling variables and directing requests to the appropriate code.
No
No No
Figure 7.3 Example flowchart showing decisions (diamonds), activities (rectangles), and control
flow (arrows).
Flowcharts developed rich vocabularies of shapes denoting different forms of action, but
the basic elements are rectangular blocks representing activities, diamonds representing
choices, and arrows between them representing control flow, one activity following another
(see Figure 7.3).
Flowcharts have many limitations, not least in that they only deal with sequential activity,
and do not deal with concurrency. There are extensions, such as UML activity diagrams,
which add features, but part of the ongoing popularity of flowcharts is undoubtedly the
simplicity that makes them work so well as a communication tool.
196 | systems
Delete D1
Please enter
employee no.:
C1
read record
Delete D2 Delete D3
Name: Alan Dix Name: Alan Dix
Dept: Computing Dept: Computing
delete? (Y/N): delete? (Y/N):
Please enter Y or N
C2 other
answer?
Y N Finish
C3
delete record
Finish
The lozenge shapes represented computer activity. This was chosen as it was available in
flowchart templates, and was part way between the diamond (choice) and rectangular block
(actions), as typical computer processing combines the two. Under the ECMA standard
this shape had a specified meaning, but it was rarely used and so deemed unlikely to cause
confusion. Quite complex internal processing would be represented as a single block, with
the internal breakdown only being shown if there was an external difference (e.g., the record
only being deleted if the confirm response was ‘Y’).
The small tape symbols were used to represent interactions with stored data—recall that
this was the early 1980s, when personal computers used cassette tapes for storage, and, in
mainframe computer centres, tapes were still used alongside disk storage for large data files.
Finally, note that each user and computer activity has an identifier at the top right. The
computer ones corresponded to internal labels in the code and the user ones were displayed
as the non-editable id at the top right of the display.
It is also worth noting what the user interaction flow charts did not include. There was
little description of the internal code, for example, the way records would be read from disk.
This was because this level of detail was ‘standard programming’, the thing the developer
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
could easily do anyway, and which was pretty much self-documenting in the code. Neither
did it give a complete description of the contents of the screens, merely enough to make
the intention clear; again, it was often fairly obvious which fields from a record should
be displayed and how, but if not, then this would be described separately. The formalism
focused on doing the things that were hard to understand from the code alone.
These flowcharts were used in several ways.
In short, this led to systems that were more reliable, were more easily understood by the
clients, and were produced more than ten times faster than with conventional methods at
the time.
7.3.5 Lessons
It was only some years later that it became apparent just how unusual it is to see use of formal
methods that was so clearly effective. In analysing this in a short report (Dix, 2002) on which
the previous description is based, a number of features were identified that contributed to
this success:
• useful—the formalism addressed a real problem, not simply used for its own sake,
or applied to a ‘toy’ problem. Often, formalisms are proposed because of internal
properties, but here it is the problem that drove the formalism.
• appropriate—there was no more detail than needed—what was not included was as
important as what was. Often formal notations force you to work at a level of detail
that is not useful, which both increases the effort needed and may even obfuscate,
especially for non-experts.
• communication—the images of miniature display screens and clear flow meant that it
was comprehensible to developers and clients alike. The purpose of formal methods
has often been phrased in terms of its ability to communicate unambiguously, but if
the precision gets in the way of comprehension, then it is in vain.
• complementary—the formalism used a different paradigm (sequential) than the
implementation (event driven). There is often a temptation to match formalism
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
198 | systems
and implementation (e.g., both object based), which may help verification, but
this means that the things difficult to understand in the code are also difficult to
understand in the specification.
• fast payback—it was quicker to produce applications (by at least 1000 per cent).
It is often argued that getting the specification right saves time in the long term as
there are fewer bugs, but often at the cost of a long lead time, making it hard to assess
progress. The lightweight and incremental nature of the method allowed rapid bang-
for-buck, useful for both developer motivation and progress monitoring.
• responsive—there was also rapid turnaround of changes. Often, heavy specification-
based methods can mean that change is costly. This is justified, if the time spent at
specification means you have a ‘right first time’ design, but for the user interface, we
know that it is only when a prototype is available that users begin to understand what
they really need.
• reliability—the clear boilerplate code was less error-prone. While the transforma-
tion from diagram to code was not automated, the hand process was straightforward,
and the reuse of code fragments due to the boilerplate process further increased both
readability and reliability.
• quality—it was easy to establish a test cycle due to the labelling, and to ensure that
all paths were well tested.
• maintenance—the unique ids made it easy to relate bugs or requests for enhance-
ments back to the specification and code.
In summary, the formalism was used to fulfil a purpose, and was neither precious nor purist.
1 We acknowledge that Parts of section 7.4 were supported by the AHRC/EPSRC funded project DEPtH
‘Designing for Physicality’ (http://www.physicality.org/).
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
design is not just about the abstract flow of actions and information, but the way these are
realized in pressable buttons or strokeable screens.
Furthermore, maker and DIY electronics communities have grown across the world,
enabled by affordable 3D printers, open-sourced hardware such as Arduino and Raspber-
ryPi, and digital fabrication facilities, such as FabLabs, in most cities. This means that the
design of custom electronics devices has moved from the large scale R&D lab to the street.
Items produced can be frivolous, but can include prosthetics (Eveleth, 2015), reconstructive
surgery (Bibb et al., 2015), community sensing (Balestrini et al., 2017), and prototypes for
large-scale production through platforms such as Kickstarter. We clearly need to understand
user interaction with these devices and find ways to make the resulting products safe, usable,
and enjoyable.
There are many ways to describe the internal digital behaviour of such devices. DIY-end
users may use graphical systems such as variants of Scratch, or data flow-based systems;
professionals may use appropriate UML models, and researchers may use various user-
interface formalisms, as discussed in section 7.2.1. However, all start with some form of
abstract commands (such as ‘up button pressed’) or numerical sensor input, as it is available
once it hits the computer system.
Similarly, there are many ways to describe the 3D form itself including standard formats
for point clouds, volume and surface models; affordable devices to scan existing objects or
clay formed models into these formats; and tools to help design 3D shapes including the
complexities of how these need to be supported during printing by different kinds of devices.
However, these focus principally on the static physical form, with occasional features to
make it easy, for example, to ensure that doors open freely.
The gap between the two is the need to describe: (i) the way a physical object has
interaction potential in and of itself, before it is connected to digital internals (buttons can
be pressed, knobs turned, a small device turned over in one’s hands); and (ii) the way this
intrinsic physical interaction potential is mapped onto the digital functionality.
This issue has been considered in an informal way within the UI and design community,
not least in the literature on affordance (Gibson, 1979; Gaver, 1991; Norman, 1999), where
issues such as the appropriate placing and ordering of light switches and door handles has
led to a generation of HCI students who are now unable to open a door without confusion.
At a semi-formal level Wensveen et al.’s (2004) Interaction Frogger analyses some of the
properties that lead to effective physical interaction.
Within the more formal and engineering literature in HCI, the vast majority of work has
been at the abstract command level, including virtually all the chapters in a recent state-
of-the-art book (Weyers et al., 2017), with a small amount of research in studying status–
event analysis (Dix, 1991; Dix and Abowd, 1996) and continuous interaction (Massink et
al., 1999; Wüthrich, 1999; Willans and Harrison, 2000; Smith, 2006).
There are some specialized exceptions. Eslambolchilar (2006) studied the cybernetics of
control of mobile devices, taking into account both the device sensors and feedback, and
the human physical motor system. Zhou et al. (2014) have looked at detailed pressure-
movement characteristics of buttons, which is especially important when a button has
to be easy to engage when an emergency arises, but hard to press accidentally (such as
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
200 | systems
emergency cut-off, or fire alarm). Thimbleby (2007) has also modelled the physical layout
of buttons on a VCR, to investigate whether taking into account Fitts’ Law timings for
inter-button movement was helpful in assessing the overall completion time of tasks, and
similar techniques can also be seen in Chapter 9.
In order to address this gap, physigrams were developed, a variant of state-transition
networks (as discussed in Section 7.2.1), but focused on the physical interactions.
switch
UP
user pushes
switch up
and down
DOWN
switch light
UP OFF
user pushes
switch up
and down
DOWN ON
Figure 7.6 Logical states of an electric light map 1–1 with physigram states.
OUT
bounce
back
user pushes
switch in
IN
(e.g., outside light controlled from inside). In these cases, assuming there is no fault, the
state of the light can be observed from the state of the switch.
More generally, this one-to-one mapping is not always possible, and in many cases the
physical device controls some form of transition. In these cases, the link between the
physical and logical sides of the system ends up more like the classic user specification
with abstract commands, except in this case, we have a model of how the commands are
produced, not just how they are used. In some ways, the ICO model of input device drivers
in Chapter 9 fulfils a role similar to physigrams, albeit less focused on the actual physical
behaviour. However, the way these are annotated with events and event parameters shows
one way in which device behaviour can be linked to more abstract system models.
This connection between the physigram and underlying system states is very important,
but for the purposes of the current discussion, we will focus entirely on the physigram itself,
as it is the more novel aspect.
Figures 7.7 and 7.8 show two more examples of two-state devices.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
202 | systems
press
up
press UP
down switch UP
‘gives’
press
PART
down
DOWN PART
UP
press
up
switch
‘gives’ press DOWN
DOWN up
press
down
(i) (ii)
Figure 7.8 Switch with ‘give’ (i) detailed physigram; (ii) ‘shorthand’ physigram with
decorated transition.
Figure 7.7 is a button with ‘bounce-back’. These are common in electronic devices;
indeed, there are seventy-nine on the laptop keyboard being used to type this chapter.
Whereas most light switches stay in the position you move them to, when you press a key
on a keyboard, it bounces back up as soon as you take your finger off the key. Notice that
the IN state has a dashed line around it, which shows that it is a temporary tension state—
unstable without pressure being applied. The arrow from OUT to IN is smooth, denoting a
user action of pressing in, but the arrow from IN to OUT is a ‘lightning bolt’ arrow, denoting
the physical bounce-back of the button.
In fact, even a typical up/down light switch has a small amount of bounce-back. If the
switch is up and you press very lightly down, it moves a little, but if you release it before it
gets to halfway it would pop back to its original position. When it gets to the halfway position
it will suddenly ‘give’ snapping to the down position.
This can be represented by adding part-in/part-out states with bounce-back transitions,
as shown in Figure 7.8(i). This detailed representation would be useful if the digital system
had some sort of response to the partly moved states (perhaps an audio feedback to say what
will get turned on/off). However, mostly it is sufficient to know that they have this feel for
the user, and so the shorthand in figure 7.8(ii) was introduced; the ‘bounce’ marks where the
arrow exits the state are intended to represent this small amount of resistance, bounce-back,
and give.
Tables 7.1 and 7.2 summarize the symbols we have seen so far for states and transitions
respectively. A more complete list can be found in Dix and Ghazali (2017) and online at
http://physicality.org/physigrams/.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
IN
IN
204 | systems
1
8 2
7 UP 3
6 4
5
1
(i) 7
8
UP
2
3
8 2
6 4
7 DOWN 3 5
6 4 8 2
5
7 DOWN 3
6 4
(ii) 5
1
8 2
7 UP 3
6 4
5
(iii) DOWN
1 1 1
2 2 2
from UP to itself, passing through the DOWN state, to denote that it was something that
you know was happening, but without any physical feedback.
There were also detailed differences in the transitions when the knob, dial, or touchpad
was rotated (see Figure 7.10). The knob had a resistance to movement and then gave, and
hence was drawn with the ‘give’ arrows shown in Figure 7.8(ii). In contrast, the stops on
the dial (Figure 7.10(ii)) had some tangible feel, but only slightly, and with no sense of
resistance to motion; hence, the stats were drawn as simple transitions. Finally, the touchpad
(Figure 7.10(iii)) had no tangible feedback at all. The designers ‘knew’ there were seven
stops, but this was not at all apparent from the feel of the device. They felt they wanted to
record the seven stops, but drew the transitions as simply passing through these; it would
only be through visual or audio feedback that the user would be able to actually tell what
was selected.
7.4.4 Lessons
First, this case study demonstrates the complexity of physical interaction. The specialized
notation in the user interaction flowcharts in section 7.3 was developed once on a particular
system design and then reused on other systems without any need for additional constructs.
In contrast, the computer science side of the DEPtH team had analysed numerous devices
in order to create a vocabulary for physigrams, and yet on the first use by others, new cases
were found.
Of course, the good thing is that the designers were able to adapt the physigrams to model
the unexpected physical properties. This demonstrated two things: comprehensibility,
in that the notation was sufficiently comprehensible that they took ownership and felt
confident to modify it, even though they were not mathematicians or computer scientists;
and openness to appropriation, in that the notation had sufficient flexibility that there were
obvious ways to extend or adapt it.
Of particular note in terms of appropriation, the fact the semantics of STNs only use
connectivity, not layout, meant that the designers were free to use layout themselves,
including 3D layout, in order to create the most meaningful version for human–human
communication. This is an example of the appropriation design principle openness to inter-
pretation, identified elsewhere (Dix, 2007): by leaving parts of a system or notation without
formal interpretation, this makes it available for rich interpretation by users, or in Green and
Petri’s (1996) terms, secondary notation.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
206 | systems
The downside of this ease of modification is that the semantics were not fixed, which
would be problematic if we had wished to perform some sort of automatic analysis. In fact,
for this purpose, the clarity for the product designers and the ability to communicate design
intentions between designers and computer scientists was more important than having a
computer readable notation.
7.5 Discussion
7.5.1 Critical Features
The first case study identified a number of features of interaction flowcharts that led to
success: useful, appropriate, communication, complementary, fast payback, responsive, reli-
ability, quality, and maintenance. Some of these, reliability, quality and ease of maintenance,
would be familiar to anyone who has read arguments advocating formal methods, precisely
where the formality and rigour would be expected to give traction. Others, notably, fast
payback and responsiveness, would be more surprising as formal methods are normally seen
as cumbersome with long lead times to benefit.
These additional benefits are effectively outcomes of the first four features, which can
be summarized in two higher-level considerations, which are shared with the physigrams
case study.
Tuned for purpose—The interaction flowcharts were useful and appropriate, i.e., they did
what was necessary for a purpose and no more. Recall the way lower-level details were
ignored, as these were already understood; only the high-level dialogue was represented
and the way this attached to code and screens. Similarly, the physigrams focus on a specific
aspect, i.e., the device unplugged and its interaction potential; all the digital features of the
device are (initially) ignored. It is interesting to note that, while both specify interaction
order and flow, one is at a higher level than most dialogue specifications, and the other a
lower level.
Optimized for human consumption—In the case of the interaction flowcharts, the
choice of a complementary paradigm (sequential rather than event based) helped the
programmer understand aspects that were hard to comprehend in the code itself. Similarly,
the physigrams specify an aspect of the device that normally is not represented in the
code or standard diagrammatic interaction notations. Also, as noted in the first case
study, while formal methods are often advocated as enabling communication, this is
normally meant as unambiguous communication between highly trained members of
a technical team. In contrast, both the interactive flowcharts and physigrams focused
on expressive communication, enabling use by clients and non-formally trained designers
respectively.
Early work emphasized the importance of bridging the formality gap (Dix, 1991), i.e., the
need to be confident that a formal specification reflects the real world issue it is intended
to express. That is, while it is relatively easy to ensure internal validity between formal
expressions within a design process, it is external validity that is hard to achieve. It is precisely
in establishing this external validity that expressive communication helps.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Both tuning and expressive communication were enabled by appropriation. This appro-
priation happened at two levels. In both case studies, an existing notation (flowcharts,
state transition networks) was borrowed and modified to create a new lightweight notation
tuned to the specific purpose. In both cases this extension included both features tuned to
computational or analytic aims (overall flow of states), but also expressive communication
(for example, the miniature screen images in interactive flowcharts and special arrows in
physigrams). However, in the second case study we saw a further level of appropriation
when the product designers themselves modified physigrams to express aspects beyond
those considered by the notation’s creators.
Some of this appropriation is possible because of the flexibility enabled by the uninter-
preted aspects of the notations: the layout on the page, and the descriptive text annotating
states and transitions; as noted previously, all classic ways to design for appropriation (Dix,
2007). However, in some cases, the appropriation may ‘break’ the formalism; for example,
the ‘parallel’ states within the UP and DOWN high-level states in Figure 7.9 are clearly
connected for the human viewer, but would not be so if the diagram were interpreted
formally.
Achieving this flexible formalism is a hard challenge. The idea of domain specific nota-
tions, introduced in section 7.2.3, would certainly help in the first level of appropriation,
as the formal expert effectively creates a rich syntactic sugar that allows the domain expert
to use the DSN and have it transformed into a lower-level standard notation amenable to
automated analysis or transformation.
The second level of appropriation requires something more radical. One option would
be to allow semantics by annotation, where the user of the formalism has a lot of flexibility,
but must map this back to a more restricted predetermined formalism (which itself may
be domain specific). For example, in the product designers’ physigrams in Figure 7.9(i),
the arc between the UP and DOWN high-level states could be selected (either by the
designers themselves or the formal experts) and then declared to represent the series of arcs
between the low-level states. The primary diagram is still as drawn and, for most purposes,
used for human–human communication, but if there is any doubt in the meaning, or if
automated analysis is needed, then the expanded standard form is accessible. In many ways
this is like Knuth’s (1992) concept of literate programming; his WEB system allowed Pascal
programmers to write their code and internal documentation in the order that suited them,
with additional annotations to say how the fragments of code should be re-ordered and
linked in the order needed for the Pascal compiler.
208 | systems
has always recognized that it is important to take into account a diversity of people: different
ages, different abilities, different social and cultural backgrounds. However, digital citizenry
means this can no longer be an afterthought. Similarly, mobile access and digitally enabled
household appliances mean that users encounter a diversity of devices, sometimes simultane-
ously, as with second-screen TV watching. Attempting to think through all the permutations
of devices and personal characteristics and situations is precisely where automated analysis
is helpful.
Looking at the technology itself, both the Internet of Things (IoT) and big data create
situations where we encounter complexity though scale due to interactions of myriad small
parts. This may lead to feature interactions or emergent effects that are hard to predict.
Again, this is an area where more automated and formal analysis is helpful.
However, the above should be set in the context of digital fabrication and maker culture,
where larger numbers of people are involved with hardware and software development and
customization.
So, the circumstances in the world are making the use of formal methods more important,
and yet also mean that those who would need to use them are likely to be less expert; the very
same methods that many computer scientists find difficult. The need for practical formal
methods is clear.2
....................................................................................................
references
Abowd, G., and Dix, A., 1992. Giving undo attention. Interacting with Computers, 4(3), pp. 317–42.
Balestrini, M., Creus, J., Hassan, C., King, C., Marshall, P., and Rogers, Y., 2017. A City in Common:
A Framework to Orchestrate Large-scale Citizen Engagement around Urban Issues. Proc. CHI 2017,
ACM.
Bass, L., Little, R., Pellegrino, R., Reed, S., Seacord, R., Sheppard, S., and Szezur, M.R., 1991. The
ARCH model: Seeheim Revisited, User Interface Developers’ Workshop, April 26, Version 1.0.
Bibb, R., Eggbeer, D., and Paterson, A., 2015. Medical Modelling: The Application of Advanced Design
and Rapid Prototyping Techniques in Medicine. 2nd ed. Kidlington: Elsevier.
Bolton, M., and Bass, E., 2017. Enhanced Operator Function Model (EOFM): A Task Analytic
Modeling Formalism for Including Human Behavior in the Verification of Complex Systems. In:
B. Weyers, J. Bowen, A. Dix, P. Palanque, eds. The Handbook of Formal Methods in Human-Computer
Interaction. New York, NY: Springer. pp. 343–77.
Booch, G., Rumbaugh, J., and Jacobson, I., 1999. The Unified Modeling Language User Guide. London:
Addison Wesley.
Bowen, J., and Reeves, S., 2017. Combining Models for Interactive System Modelling. In: B. Weyers, J.
Bowen, A. Dix, P. Palanque, eds. The Handbook of Formal Methods in Human-Computer Interaction.
New York, NY: Springer. pp. 161–82.
British Standards Institution, 1987. BS 4058:1987, ISO 5807:1985: Specification for data processing
flow chart symbols, rules and conventions. Milton Keynes: BSI.
Buxton, W., 1990. A three-state model of graphical input. In: Proceedings of Human–Computer
Interaction—INTERACT ’90. Amsterdam: Elsevier, pp. 449–56.
Card, S., Moran, T., and Newell, A., 1983. The Psychology of Human Computer Interaction. Hillsdale,
NJ: Lawrence Erlbaum Associates.
Coutaz, J., 1987. PAC, an object-oriented model for dialogue design. In: H-J. Bullinger and B. Shackel,
eds. Human Computer Interaction INTERACT ‘87, pp. 431–6.
Coutaz, J., 2010. User Interface Plasticity: Model Driven Engineering to the Limit! In Engineering
Interactive Computing Systems. Berlin, Germany, 19–23 June 2010.
Dix, A., 1991. Formal methods for interactive systems. New York: Academic Press. Available through:
http://www.hiraeth.com/books/formal/.
Dix, A., 1995. Dynamic pointers and threads. Collaborative Computing, 1(3):191–216. Available
through: http://alandix.com/academic/papers/dynamic-pntrs-95/.
Dix, A., 2002. Formal Methods in HCI: a Success Story—why it works and how to reproduce it.
Available through: http://alandix.com/academic/papers/formal-2002/.
Dix, A., 2007. Designing for appropriation. In: D. Ramduny-Ellis and D. Rachovides, eds. HCI
2007. . .but not as we know it, Volume 2: People and Computers XXI, The 21st British HCI
Group Annual Conference. Lancaster, UK, 3–7 September 2007. London: BCS. Available from:
http://www.bcs.org/server.php?show=ConWebDoc.13347.
Dix, A., 2016. Human computer interaction, foundations and new paradigms, Journal of Visual Lan-
guages & Computing, in press. doi: 10.1016/j.jvlc.2016.04.001.
Dix, A., and Abowd, G., 1996. Modelling status and event behaviour of interactive systems. Softw Eng
J 11(6), pp. 334–46, http://www.hcibook.com/alan/papers/SEJ96-s+e/
Dix, A., and Ghazali, M., 2017. Physigrams: Modelling Physical Device Characteristics Interaction. In:
B. Weyers, J. Bowen, A. Dix, P. Palanque, eds. The Handbook of Formal Methods in Human-Computer
Interaction. New York, NY: Springer, pp. 247–72.
Dix, A., Ghazali, M., Gill, S., Hare, J., and Ramduny-Ellis, S., 2009. Physigrams: Modelling Devices for
Natural Interaction. Formal Aspects of Computing, 21(6), pp. 613–41.
Dix, A., and Runciman, C., 1985. Abstract models of interactive systems. In: P. Johnson and
S. Cook, eds. People and Computers: Designing the Interface. Cambridge: Cambridge University
Press. pp. 13–22. Available through: http://www.alandix.com/academic/papers/PIE85/.
ECMA International, 1966. Standard: ECMA 4, Flow Charts. 2nd ed. Geneva: European Association
for Standardizing Information and Communication Systems.
Eslambolchilar, P., 2006. Making sense of interaction using a model-based approach. PhD. Hamilton
Institute, National University of Ireland.
Eveleth, R., 2015. DIY prosthetics: the extreme athlete who built a new knee. Mosaic, 19 May 2015,
[online] Available at: https://mosaicscience.com/story/extreme-prosthetic-knee.
Freitas, L., Woodcock, J., and Zhang, Y., 2009. Verifying the CICS File Control API with Z/Eves:
An experiment in the verified software repository. Science of Computer Programming, 74(4),
pp. 197–218. http://dx.doi.org/10.1016/j.scico.2008.09.012
Gaver, W., 1991. Technology affordances. In: CHI ’91: Proceedings of the SIGCHI conference on Human
factors in computing systems. New York, NY: ACM Press, pp. 79–84.
Gibson, J., 1979. The Ecological Approach to Visual Perception. New York, NY: Houghton Mifflin.
Gram, C., and Cockton, G., eds., 1996. Design principles for interactive software. London: Chapman &
Hall.
Green, T., and Petri, M., 1996. Usability analysis of visual programming environments: a ‘cognitive
dimensions’ framework. Journal of Visual Languages and Computing, 7, pp. 131–74.
Harel, D., 1987. Statecharts: A Visual Formalism for Complex Systems. Science of Computer Program-
ming, 8(3), pp. 231–74.
Harrison, M., Masci, P., Creissac Campos, J., and Curzon, P., 2017. The Specification and Analysis
of Use Properties of a Nuclear Control System. In: B. Weyers, J. Bowen, A. Dix, P. Palanque,
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
210 | systems
eds. The Handbook of Formal Methods in Human-Computer Interaction. New York, NY: Springer.
pp. 379–403.
Kieras, D., and Polson, P., 1985. An approach to the formal analysis of user complexity. International
Journal of Man-Machine Studies, 22, pp. 365–94.
Knuth, D., 1992. Literate Programming. Stanford, CA: Stanford University Center for the Study of
Language and Information.
Manca, M., Paternò, F., and Santoro, C., 2017. A Public Tool Suite for Modelling Interactive Applica-
tions. In: B. Weyers, J. Bowen, A. Dix, P. Palanque, eds. The Handbook of Formal Methods in Human-
Computer Interaction. New York, NY: Springer. pp. 505–28.
Mancini, R., 1997. Modelling Interactive Computing by Exploiting the Undo. PhD. Università degli Studi
di Roma ‘La Sapienza’. Available through: http://www.hcibook.net/people/Roberta/.
Masci, P., Zhang, Y., Jones, P., Curzon, P., and Thimbleby, H., 2014. Formal Verification of Medical
Device User Interfaces Using PVS. In: Proceedings of the 17th International Conference on Fundamen-
tal Approaches to Software Engineering, Volume 8411. New York, NY: Springer-Verlag. pp. 200–14.
Available through: http://dl.acm.org/citation.cfm?id=2731750.
Massink, M., Duke, D., and Smith, S., 1999. Towards hybrid interface specification for virtual environ-
ments. In: DSV-IS 1999 Design, Specification and Verification of Interactive Systems. Berlin: Springer.
pp. 30–51.
Meixner, G., Calvary, G., and Coutaz, J., 2014. Introduction to Model-Based User Interfaces. W3C
Working Group Note, 7 January 2014. Available through: http://www.w3.org/TR/mbui-intro/.
Meixner, G., Paternò, F., and Vanderdonckt, J., 2011. Past, Present, and Future of Model-Based User
Interface Development. i-com, 10(3), pp. 2–11.
Navarre, D., Palanque, P. A., Ladry, J-F., and Barboni, E., 2009. ICOs: a model-based user interface
description technique dedicated to interactive systems addressing usability, reliability and scala-
bility. ACM Transactions on Computer-Human Interaction, 16(4), pp. 1–56.
Norman, D., 1999. Affordance, conventions, and design. Interactions, 6(3), pp. 38–43, New York, NY:
ACM Press.
Palanque, P., and Paterno, F., eds., 1997. Formal Methods in Human-Computer Interaction. London:
Springer-Verlag.
Paterno, F., 1999. Model-Based Design and Evaluation of Interactive Applications. 1st ed. London:
Springer-Verlag.
Pfaff, G., and Hagen, P., eds., 1985. Seeheim Workshop on User Interface Management Systems. Berlin:
Springer-Verlag.
Polak, W., 2002. Formal methods in practice. Science of Computer Programming, 42(1),
pp. 75–85.
Reisner, P., 1981. Formal Grammar and Human Factors Design of an Interactive Graphics System.
IEEE Transactions on Software Engineering, 7(2), pp. 229–40. DOI=10.1109/TSE.1981.234520
Shepherd, A., 1989. Analysis and training in information technology tasks. In D. Diaper and
N. Standton, eds. The Handbook of Task Analysis for Human-Computer Interaction. Mahweh, NJ:
Lawrence Erlbaum. pp. 15–55.
Smith, S., 2006. Exploring the specification of haptic interaction. In: DS-VIS 2006: Interactive systems:
design, specification and verification. Dublin, Ireland, 26–28 July 2006. Springer, Berlin. Available
through: https://link.springer.com/chapter/10.1007/978-3-540-69554-7_14.
Spivey, J., 1992. The Z Notation: a reference manual. 2nd edn. Upper Saddle River, NJ: Prentice Hall.
Sufrin, B., 1982. Formal specification of a display-oriented text editor. Science of Computer Program-
ming, 1, pp. 157–202.
Thimbleby, H., 2007. Using the Fitts law with state transition systems to find optimal task tim-
ings. In: FMIS2007: Proceedings of the Second International Workshop on Formal Methods for
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
8
• • • • • • •
8.1 Introduction
It is a truism that the only reason computers exist is for humans to use them—computers
are not a natural phenomenon that ‘just exist’ (like rocks, flowers, or cats) but they and what
they do are made by people for people. Better computers are easier to use.
Somehow the two concerns:
have become different specialities with few overlapping interests. University courses may
study one and not the other. Web authoring systems mean that user experience (UX) people
work on web site design and user experience, but the underlying services are done by
completely separate teams of programmers, as are the tools that the UX people use. This
separation misses out a lot of beneficial cross-fertilisation.
We believe that this disciplinary separation is premature and unnecessary. What is taught
in user interface design or human-computer interaction courses raises deep computational
problems. Conversely, what is taught as basic theoretical computer science has applica-
tions directly in user interface design. After all, a programmer understanding computers
and getting them to do things correctly for them is an almost identical problem to users
understanding computers and getting them to do things correctly for them. Pretending that
one is about people and the other about computers over-simplifies.
This chapter explores the split and, specifically, we re-integrate basic computer science
into the user-centred design arena.
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
214 | systems
errors, we would correct them. As computers have become more and more attractive, their
underlying dependability has not improved as much as we think.
To understand these deep problems, it is necessary to go back to the very foundations
of what it means to interact with a computer. When we interact with anything, we, at least
implicitly, use a language. When we interact with humans we may use English or another
natural language, and with programmed computers, whether web sites, PCs, or mobile
phones and even door bells, we use much simpler languages. Language is the mechanism
to convey meaning, be that a complex expression of our thoughts to a simple single ‘I’m at
your front door.’
No doubt linguists might object to pressing door bells being a language. It doesn’t
involve vocalisation or written sentences, certainly, but the physical actions are something
like ‘move finger to correct location’ then ‘press.’ This is a language—and perhaps so
seemingly trivial that a computer programmer would probably also overlook the obligation
to think clearly about the language of interaction, even when it seems so obvious it can ‘just
be implemented’—which of course will give it premature semantics. Figure 8.1 gives an
example of how pressing ‘simple’ buttons can go horribly wrong.
All computer programs implicitly define a language of either data or interactions (or both)
that they can accept and give meaning to. Of course, they or their developers do not know
they are giving meaning to the language but nonetheless they act according to the meaning
understood by their developers. If we were building a system to understand English, this
would be very obvious—you’d see things like ‘verb’ and ‘noun’ all over the program—but
most of the time it is too easy to build systems without thinking clearly about the language
and there is nothing visible to show for it. For a door bell, the language of interaction is so
obvious that there is probably nothing left visible that specifies or defines the language, or
even needs to.
Programmers are taught a lot about languages (regular languages, context free gram-
mars, and so on). And then they promptly forget them because it is possible to pro-
gram anything without going to the trouble of specifying a language clearly. The move
of programming away from low-level details to using packages and complex APIs means
that the languages actually being implemented by the computer are well-hidden from the
programmer.
It follows that programmers typically build user interfaces with the wrong languages,
because they never really thought about what the right languages might be. This, of course,
creates the entire world of HCI: we need methods and processes to find out what the
language should be, and we need ways to help programmers refine the programs to use the
right languages, using iterative design and so on.
Design problems are often accompanied by tell-tale phrases that sound like language
problems:
‘I want to do this, but I don’t know ... The system (or the user) has an
what to do or how to say it’ insufficiently expressive language
‘When I told it to do something, it did ... the system had a different semantics
something else’ from the user
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
216 | systems
Figure 8.1 A very simple walk-up-and-use user interface. A UK Coast Guard phone (photographed
in 2016) for the public to use to call the Coast Guard in the event of an emergency. In the UK, the
standard emergency number is 999—as the sign makes very clear. But the phone itself does not
have any way to dial 999, so what should you do? It is not very clear! Presumably only three buttons
(labelled 1, 2, 3) were provided either to make it ‘simpler’ or to save money? Figure 8.2 shows the
underlying FSM for this phone, and the FSM shows how to use the phone to correctly call the Coast
Guard.
These problems often arise because the programmer has specified a simpler language than
the user needs or expects. Part of the problem is that the right language is not obvious. On
the other hand, if programmers carefully worked out the right language, they’d spend so
much time planning and working with users they’d never get round to writing any useful
programs—we are not blaming programmers, we are pointing out that implementing the
wrong language is almost always inevitable because life is too short.
If programmers implement a simpler language something very interesting happens. The
language they implement is easier for the computer (and for them to program) but probably
at the expense of being harder for the human user. The computer program does do some-
thing, but it is not as powerful as users might wish. This is the what HCI is all about fixing.
Yet because this problem is invisible, it remains a fundamental problem for HCI.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
This chapter puts a name to the problem: premature semantics. Of course HCI tries to
solve the problems caused by premature semantics, but it does so by ‘fixes’—processes, such
as iterative design, that come too late to avoid the premature part of the premature semantics.
If we can recognise and know what premature semantics means, we can take steps to avoid
it, and hence vastly improve normal HCI methods.
We will build up to our discussion of premature semantics by reviewing how user inter-
faces and languages intertwine—using regular languages and other fundamental computer
science concepts—and then premature semantics will fall out of our discussion naturally.
We will show that the idea exposes very clearly some serious issues in user interface design
that have been widely overlooked, and certainly not attributed to a common factor.
218 | systems
We have found many people say, ‘Well that’s what they do!’ But just because we are
familiar with a problem does not mean it is harmless and that we should not try to solve
it. Using computers and other complex devices may well be a form of hazing, which initiates
users into the fraternity of experts; and once hazed, you disdain the naïvety of people who
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
see problems rather than commit to learning the rituals that would save them and have saved
you. (We will consider a more constructive way of talking about hazing in Section 8.6.4;
more constructive in that it suggests some solutions.)
220 | systems
1. The problems may be small but they are hugely widespread. Problems like these
occur not only in the various designs we mentioned but in the design of all calcu-
lators that we have examined. Such ubiquitous problems may have small individual
cost but have huge cumulative cost, that is, when multiplied up by the numbers of
users of calculators and the number of times they are used by each user. For instance,
nurses use calculators every day to work out drug doses for patients: one wonders
how many design-induced errors happen with catastrophic results.
2. There is a secondary cost to this ubiquity in that we have to take time to teach
our children what the problems are instead of solving them. Schools teach children
not to trust calculators and to try performing calculations in various ways. It is like
teaching people how to drive a car that doesn’t have very good brakes instead of
working out how to make safer brakes.
3. Calculators are only the most familiar representative of the type device that requires
number entry; there are many variations on the number entry task in many other
devices and we have found that these too have similar or related problems (Thim-
bleby and Cairns 2017; Thimbleby 2015). And when the devices considered are
aircraft altimeters or medical infusion pumps, these fundamental problems become
extremely important. People often die in hospitals from ‘out by ten’ errors where a
drug dose is ten times (or more) too high. How often does it happen that correcting a
number entered like 1..5 to be 1.5 gets mangled by the bad design to become
15? Nobody knows.
4. We have been using Arabic numerals in the West since Fibonacci published his best-
seller Liber Abaci in 1202. Although there were a few issues sorting out decimals and
negative numbers since then, by the twentieth century, certainly, Arabic numerals
were perfectly well understood and, surely, there is now no excuse to program
number entry systems improperly.
In some cases, the change of state will be quite innocuous. For instance, pressing 1 on
a calculator can make a 1 appear in the display. In some cases, the change of state is quite
important such as pressing = on a calculator will lead to the display being cleared and the
result of some calculation being presented, as well as getting the calculator ready to receive
new numbers for further operations: the change of state here is quite complex. In some cases,
a change of state may be very important, such as changing a setting on an autopilot in a
commercial airliner can make the difference between setting a target altitude or setting the
rate of descent, which can have very different end states (Degani, 2004).
From the earliest days of computer science, it has been recognized that many algorithms
and interactions can be understood as the system moving through a serious of different
states and this led to consideration of an important abstract idea called Finite State Machines
(FSMs) or sometimes Finite State Automata (FSAs).
FSMs are very well established theoretical tools in computer science, though not always
considered as practical tools. They have also been used in HCI to model interactions and to
allow various sorts of analysis.
We can explain FSMs in many essentially equivalent ways:
• Pictorially, each state of a FSM can be drawn as a circle, and each transition from
one state to the next as an arrow joining two circles. Typically the circles will be
labeled with the states or modes they represent (like ‘on’ or ‘off ’) and the arrows
will be labeled with the names of the buttons or other actions that cause the changes
of state. We’d expect, for instance, an off arrow to point to the off state circle. Note
that there may be lots of arrows for the same action—many states can be switched
off by pressing OFF so there may be lots of off arrows, one from each state (at least
from those that can be switched off). However, there is only one circle for each state;
there is usually only one Off circle, for instance.
See figure 8.2 for an example. As shown in figure 8.2, normally one state will be
marked as the starting state—typically the state when the device is off.
• Mathematically, a FSM can be understood as an abstraction of this pictorial rep-
resentation in the form of an abstract graph, where circles correspond to elements
of a set of states, called nodes, and arrows ordered pairs of nodes called edges that
are mapped to labels via a labelling function. Graph theory is a substantial part of
mathematics, though confusingly ‘graph’ is also used as the term for diagrams to plot
functions like y = x2 .
• In software, an FSM can easily be represented by numbering each state, numbering
each user action (e.g., each button), then using an array of state numbers T, to
represent the FSM. When the user does action b in state s, the state is simply
updated by doing the assignment s = T[b,s].
• Using algebra we can describe a FSM very abstractly. There is a set of states S and a
set of transitions (typically user actions, button presses, hand waving, etc) T. Each
transition t is a function that changes the current state to the next state: t ∈ T : S → S.
A FSM starts in an initial state s0 ∈ S and is operated on by a sequence of transitions
t 1 ;t 2 ;t 3 . . .;t n and it will then be in state t n (. . .t 3 (t 2 (t 1 (s0 ))). . .). Typically a special
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
222 | systems
Handset
off hook
Pick up 3
2
Put down 1
Handset Call
1 on hook connected 1
Put
down
2 2
3 3
Figure 8.2 A drawing of the finite state machine for the Coast Guard phone shown in figure 8.1.
The phone starts in the state indicated by the • arrow (pointing to the bottom left circle, ‘handset
on hook’), then user actions follow arrows to the other two states. For example, in the start state,
pressing any of the digit buttons (1, 2, 3) does not change state, so nothing happens—in this state
the buttons do nothing. The user has to pick up the phone handset to start anything happening.
An analogy may help. We can ‘draw’ the number 6 by drawing : : :, and this familiar way
of drawing numbers makes numbers very easy to understand. Adding one to six is easy to
draw, and when we were children drawings like this helped us learn how numbers work. But
we can’t usefully draw 1,007 this way—it will look much like any other large number, like
999 for instance. Obviously, we can only sensibly draw relatively small numbers, but never-
theless we know that large numbers are still very useful even though we cannot draw them
reliably.
Note that we can add 6 and 2 by simply putting our drawing of 6 next to a drawing of 2:
:::+:=::::
We get 8 with no effort! In other words, we can add up without consciously adding. Indeed,
if we delete the + symbol, the numbers ‘just’ add themselves.
Similarly, all programs are using FSMs, even if the FSMs cannot easily be drawn or
visualized. A bit of program like if(x<0)x=-x is doing something an FSM can do,
but you cannot see the FSM in the code—just as in : : : : you can’t see the + sign.1
Another problem FSMs have is in the ‘programmer’s imagination,’ which is that pro-
grammers think FSMs are technically limited compared to Turing Machines and general
programming. This is true in infinite theory, but not in practice. Any real PC is a finite state
machine—it is not an infinite machine like a Turing Machine. The practical truth is that
FSMs do not have good ways to abstract them, so programmers tend to write in Java or some
other convenient programming language. Their programs are strictly equivalent to (big)
FSMs—not necessarily the same ones. Thus programmers ‘think in’ Java, not in FSMs, and
the elegance and nice properties of the (correct) FSMs fail to be abstracted coherently.
1 One of us has a car whose user manual has the warning that its clever computer features do not help the driver
overcome the laws of physics. In other words, the car obeys the laws of physics even though the driver cannot
see them, and—as the manual tries to warn—even if the driver foolishly denies the laws, they are there anyway.
Similarly, the laws of FSMs are obeyed even if they cannot be drawn or seen.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
224 | systems
down (Hick’s Law),2 here further delaying calls to the Coast Guard—it’d be better to label
all the buttons 999 anyway! (Notice the button labels shown in Figure 8.1 are blank and
could have been filled in with 999—or, better, each button could have been labelled 999
instead of 1, 2 or 3.)
It would be easy to write a program to automatically find such errors, and many others,
in the graphs of designs before they are built, but unfortunately here it is too late for the the
Coast Guard to fix the problem easily.
Even from thinking about simple examples like an oven temperature setting, it is clear
that what state the system is in has a significant impact on what actions are possible or safe.
◦
Where a state is not otherwise visible, for instance the oven is above 40 C, it is important to
indicate this to the user some other way such as a warning light or door lock. Again, we can
automatically analyse FSMs to ensure they have such important properties.
2 Hick’s Law states that human choice time is proportional to log n + 1 for a selection between n choices. In this
example, the user interface provides three buttons when at most one would do hence, by Hick’s Law (and ignoring
other causes of delay), making the user 2 = log 4/ log 2 times slower than need be.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Asaf Degani (Degani, 2004) uses state machines to consider the interaction with autopi-
lots in airplanes. He shows how interactions with the autopilot can lead to unnoticed state
transitions, particularly when the environment is providing inputs to the device such as
the plane coming close to its target altitude. Where such transitions are not made visible
to the pilots, it is impossible for the pilots to know what state the autopilot is in and
therefore reliably predict the consequences of their interactions with it. Degani shows how
this problem of hidden state transitions has led to fatal and expensive accidents.
With FSMs, it is also possible to add information to the transitions. For instance, if
numbers between 0 and 1 are added to the edges to correspond to the probabilities of a
particular button being pressed, then it is possible to calculate the probabilities of a system
ending up in any particular state (in fact, we end up with a Markov model). We can then
estimate how many steps it takes to get to any states of interest. We have previously used this
idea to show that an error, say in setting a microwave to heat something up for a fixed time,
can lead to users being unable to set the microwave at all. But, also, the introduction of a
Quit key, and the knowledge to use it, can greatly help people from getting trapped when
that happens (Thimbleby, Cairns, and Jones, 2001).
It can be seen that FSMs make a bridge between the theoretical tools of computer science
and the ability to analyse interactive systems. It is probably also clear that effective analysis
of real systems requires support from suitable tools that can manage the generally large
sizes of real FSMs. Some such tools, like MAUI and others (Gow and Thimbleby, 2005;
Thimbleby, 2010) exist and can help. However, analysing FSM models is one thing and
building working programs is another. What is still needed is a way to move fluidly from the
analytical power of FSMs and the actual systems being developed so that a programmer
can do the job of programming but also exploit the analytical strengths of FSM for
interaction design.
Unfortunately because programmers rarely use explicit FSMs (e.g., as simple as the array
we illustrated above), even these simple questions are usually very difficult to answer. In fact,
they are so hard to answer they are rarely even asked. It is arguable that major advances in
HCI, such as the initial explorations of GOMS and KLM (Card, Moran, and Newell, 1980),
floundered because what an HCI or UX professional is interested in is just too hard for a
programmer to address. Thus GOMS was used for really very trivial systems, and still hasn’t
been used for anything of even ordinary complexity.
226 | systems
A regular expression to capture the format of valid number plate registration numbers is:
There are many other notations for regular expressions that can be used to say the
same thing. Many readers will be more familiar with the Unix notation, which in this case
would be
[A-Z][A-Z][0-9][0-9] [A-Z][A-Z][A-Z]
and even then this is only one way Unix can express registration numbers. Our own notation
used above has the advantage that we can name regular expressions—such as using L to
represent letters—and then use the name L later to save rewriting the same idea multiple
times (making it much less likely we make mistakes). For example, if we sorted out how to
handle that the UK registration plate font makes the letters/digits I and 1, O and 0 identical,
we only have to fix this in two places (in the definitions of L and D), rather than in seven
where letters and digits are needed. Conversely, if there is a bug in our regular expression,
then the same bug will reappear up to seven times: this might sound like a disadvantage, but
we are also more likely to spot bugs in the first place, which is a huge advantage, and when
we fix any bug we find, we actually fix seven. This is a very powerful advantage.
For real UK registration numbers there are actually more rules than this. For example, the
pair of digits in the middle of the number plate represents the year of manufacture of the car,
so for instance anything above 17 but below 50 is not currently valid (at the time of writing)
but we did not specify this above. Ignoring such details, then, more or less Pl defines valid
number plates and would, for instance, exclude many number plates from outside of the UK.
Thus, regular expressions can be used to both define valid things and to check whether given
things are valid or not.
The three basic operations permissible in regular expression are sequence, choice and
iteration.
Sequence is the operation of being given two regular expression, you can have one
expression follow on from another. This is precisely what is seen in defining Pl; a valid
registration is a letter followed by a letter followed by a digit . . .In most regular expression
notations, there is no special symbol for sequence: thus, ‘LD’ means L followed by D.
Choice is that a new regular expression can be made out of two existing ones; it is either
one or the other of them, that is for regular expression A and B, C = A|B means C is either
A or B. Here the symbol “|” is the notation for choice. The expression L above is a short
hand of listing the choice options that L3 is one of the characters of the upper case roman
alphabet.
3 This example makes it look like choice allows twenty-six choices—more than two choices! In fact,
if we write the choices out explicitly each with exactly two choices: ((’A’|’B’)|’C’)|…|’Z’ or as
’A’|(’B’|(’C’|…|’Z’)), etc, these regular expressions are all the same, so the brackets are unnecessary. In
other words, choice is associative.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Finally, a given expression can be iterated (repeated) a number of times to form a new
expression, represented as B = A∗ which means B is any number of repetitions of A
including no repetitions. If a + is used instead of a ∗ then this means B is at least one
repetition of A. Sometimes it is useful to be specific, and write An for repeating n times
or Am−n for repeating between m and n times. We did not use these ideas above because car
registration numbers are too simple to need them—writing Pl = L2 D2 ’ ’L3 is more
confusing that Pl = LLDD’ ’LLL which is what it means. In more complicated cases,
though, the additional notation can be very useful (e.g., it is much easier to tell the difference
between D8 and D9 than DDDDDDDD and DDDDDDDDD).
Regular expressions are the tool behind many familiar user interactions like making sure
users enter a date in the correct format or enter a number when they should. However, even
in these simple interactions where regular expressions are useful (and would help ensure
user interfaces did what they were supposed to), they are often not applied when they
might be.
Like FSMs, regular expressions also have a public image problem. They are fantastic
for performing string operations, like finding car registration numbers in a big file of text.
Many programming languages (e.g., JavaScript, PHP, etc) provide powerful built-in features
for using regular expressions for string operations. But just because they are great at string
operations does not mean that is all they can do, and in particular that they should not be
used right across all aspects of user interface design.
Grete Fossbakk lost 50,000 Norwegian Krone (about US$60,000) due to a poorly
programmed user interface (Olsen, 2008). Fossbakk entered an account number but acci-
dentally pressed one digit twice. The resulting account number was of course then too long,
but the bank’s system truncated it, and the truncated number happened to be a valid account
number but unfortunately the right number for another person. Fossbakk did not notice this
error, and confirmed her money transfer—to the wrong account. Of course the bank argued
Fossbakk had confirmed what she wanted to do, but she had confirmed a simple error that
the bank should have—and could have—detected.
A regular expression in this case could have been used to validate Fossbakk’s entry,
and would have detected her error because it was too long. For example, D8 would have
done. The point is, had the programmer specified the user interface (here, using a regular
expression), the problems would have been obvious, and a solution found (e.g., warning
the user). In fact, using regular expressions to validate user input is an obvious professional
decision. In this case, though, it seems the programmers did not specify anything, but rather
‘it just happened’ and then nobody thought about it.
The user interface (if designed properly) should have forced her or the bank to re-check
the input more carefully. This may seem like a rare and unlikely feature of one particular
bank but it is not. Figure 8.3 shows some code taken from the UK Lloyds Bank’s current
(2016) web pages for the same task. Lloyds Bank uses HTML to truncate account numbers
to 8 digits, and it does so silently. Because the web browser (thanks to the HTML) truncates
the number, the bank’s servers have no idea that an invalid number has been entered.
Though we rarely think about it, Arabic decimal numbers (and indeed all number
systems if they are reliable) have a well-defined structure that can be captured by a regular
expression:
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
228 | systems
Account number:
<input
type = "text"
autocomplete = "off"
id = "pnlSetupNewPayeePayaPerson:frmPayaPerson:stringBenAccountNumber"
name = "pnlSetupNewPayeePayaPerson:frmPayaPerson:stringBenAccountNumber"
maxlength = "8"
/>
Figure 8.3 HTML for a user to enter an account number, copied from a Lloyds Bank web site in
2016. Notice the code maxlength="8" (on the penultimate line) which will silently truncate
any account number to a maximum of 8 digits: neither the user nor the bank’s server will know.
These rules may seem unduly complicated (and we did not include any rules for signs, +
and -) but they are not really complicated:
The final rule 6 is saying that a number is either a whole number (an integer) or an integer
followed by a decimal fraction. Rule 4 is saying that a valid integer does not start with 0 unless
it is actually 0, and rule 5 says that the decimal fraction of a number does not end in a 0.
Actually, there are contexts where that is valid, for instance to show the number of significant
figures but in general, if we are representing a precise value then rule 5 is necessary.
These rules make Num specify whole numbers (which do not have a decimal point) and
fractional numbers (which do have a decimal point). We have not permitted a number to
start with redundant zeros, and we have not allowed a fractional number to end with a
zero: thus 00050 is not allowed to represent 50, and 0.50 is not allowed because
of its trailing zero. Moreover, we do not allow ‘naked decimal points’—so neither .5
and 5. are allowed. The naked decimal point rule is inspired by the Institute for Safe
Medication Practices (ismp.org), which forbids naked decimals as they encourage errors—
for instance, a nurse could misread .5 as 5 and give somebody a drug dose ten times
too high.
When it comes to a task like number entry, it seems unlikely that checking that a number
is actually a valid number would make much difference to anyone—and few user interfaces
check anyway. However, we found that in the possibility that users make errors when
entering numbers, then simply checking if the result was a valid number reduced the risk
of high impact errors by a half (where by high impact we mean when the number, perhaps
a drug dose, is out by a factor of 10 from the correct value) (Thimbleby and Cairns, 2010).
Unfortunately many programmers learn about regular expressions as a theoretical computer
science curiosity and never realize they are very useful in user interface design!
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
4 Technically, a FSM can end up in any state, but a regular expression can only end up either succeeding or failing
to match its input; hence, regular expressions are exactly equivalent to special FSMs called finite state acceptors,
which are FSMs with exactly two designated final states. Such final states can always be added to any FSM to make
it into an acceptor.
5 There are some easily addressed technical differences between FSMs (and non-deterministic FSMs) and REs
but it is beyond the scope of this chapter to worry about it further.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
230 | systems
1. How do buttons work? What are the timing issues for double clicks? What is a long
hold? Do buttons audibly click when pressed? What happens when the user clicks
on a button and moves out of its active region before releasing the mouse?
2. What should the application do when a user clicks on a button? What are its modes,
and how do buttons change their meanings in each mode?
3. How does the application achieve what it should do?
Step 2 above can often be done by a FSM, and doing so creates a useful wall—a separation
of concerns—between steps 1 and 3, which makes those steps easier to implement reliably
and consistently. Otherwise, there would be a temptation to implement particular buttons
that do particular things one at a time, which would entwine the button and what it does,
and hence each button would likely be implemented in a slightly different way.
A corollary of separation is that program code gets reused. Instead of every button being
implemented in its own way all the buttons can share the exact same code. This means that
any bugs in button design become apparent quickly—because testing any button is the same
as testing all of them. In contrast, when buttons are implemented one at a time, then each
button must be rigorously tested, and that is hard work. Separation of concerns concentrates
design effort into a few important places, and those places get greater scrutiny than without
the separation.
Not only does it make it easier for the programmer (and hence ultimately easier for the
user because the program is better implemented) but the separation of concerns means the
meaning of the interaction has been separated out and becomes a thing-in-itself. Once a FSM
is used, it can be analysed.
For example, whatever a user interface looks like (step 1 above) and whatever it is doing
(step 2 above) we probably do not want the user to get stuck. There are many ways of
expressing this design requirement, but suppose it is expressed as ‘whatever the user is
doing, they always have the option to do anything else.’ If this is the design requirement,
it translates to a trivial test on the FSM—is it strongly connected? Another form of the
requirement (with slightly different connotations) is ‘the user can always undo whatever
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
they have done.’ If that is the design requirement, again it translates into a simple test on the
FSM: is it symmetric?
Often, strict requirements like these simple examples are problematic or raise further
design questions that need carefully exploring. For example, once you start using a fire
extinguisher, you cannot go back to any state of not having used it—so fire extinguishers fail
both of the properties above, but they are no less useful for failing them. On the other hand,
activating a fire alarm, you might want to undo because it may be an accidental activation,
and setting off a fire alarm is expensive (e.g., all staff have to stop work and leave the building).
Once a FSM is used: critical user interface design questions can be asked, questions can
be answered, and the design trade offs are very easy to explore. These are advantages for
programmers, designers and users that conventional programming rarely benefits from.
232 | systems
That is, the semantics of the input are forced to be a number all the time. For example, if
you try to enter two decimal points in a calculator entry, the second one leads to no change
because a valid number cannot have two decimal points. More generally, we call this issue
premature semantics: the semantics of an input are fixed (to be a valid number) before the
entry is finished. Note that the premature semantics affects the user—an input becomes a
number before they have finished (they accidentally entered a non-number, but something
was mangled to force it to be a number). That means the user’s ability to correct their own
error is compromised; for example, the DEL key won’t work as they expect.
The reasons why calculators are like this are easy to imagine. Early electronic calculators
would hold the input number in a register inside the calculator. The device could block any
input that would lead to an incorrect number or to overflowing the size of the register. And,
because the physical devices have a limited number of buttons, this was easy. In addition, the
early devices did not allow delete, as we would understand it now, but only a Clear function
which cleared the value in the register rather than remove individual digits.
Though they were almost certainly not programmed this way, it is straightforward to
understand the interaction of entering a number in an early calculator through a FSM. With
the underlying principle that the calculator always displays a valid number, there is (or there
is in principle) an underlying FSM that accepts keypresses that lead to valid numbers and
blocks keypresses that do not. So the display starts by showing 0 (a valid number even
though the user has not even started entering a number) and as the user presses keys, the
digits are accepted and appended to the display. A single keypress of a decimal point is a
valid state transition in the FSM so the point is appended to the display. Having entered
states with a decimal point, any further decimal point keypresses do not define valid state
transitions so they are blocked. Pressing Clear resets the display and the FSM back to the
initial state.
It is worth contrasting this with text entry. Because valid words in, say, English do not
have a rigid structure that can be easily captured with a regular expression (or arguably
any formal expression), it is not so tempting to enforce premature semantics in text entry.
A spell checker can only identify an error once a word has been entered and even then,
as we commonly experience, a correctly spelled word is not necessarily semantically rite.
Sentence level checks on semantics are even more difficult. Certainly, there is no way to
enforce semantics in text entry as the text is being entered.
Some systems, nonetheless, do impose premature semantics on text. As you write the
space after, say, ‘thankfull’ it magically turns into ‘thankful.’ But if you are copy typing and
not watching the screen, you would type ‘thankfull# DEL DEL #’ and you’d end up with
‘thankfu’—since both you and the system deleted the extra ‘l.’
We now consider how premature semantics leads to two specific types of problem:
problems of construction and problems of representation.
←
or are useful functions. Clear is an all or nothing function which forces the user to
re-start an entry from the beginning. This is of course frustrating if only the last keypress
was incorrect. DEL is more convenient, as just it deletes individual keystrokes and is easier
to control. Unfortunately, as we saw above, DEL is where a lot of problems of premature
semantics come in and create strange special effects a user has to be aware of—and of course
users probably are not aware of these special cases, especially when they are trying to correct
their own errors.
The key DEL can have one of two meanings: delete the last (rightmost in Roman-based
systems) character in the display or delete the last keypress pressed on the keyboard. In text
entry, these two meanings almost always coincide. If we type A B C the display will
show ABC . Clearly, here the last key pressed was C and the rightmost key displayed is
the same, namely C . Pressing DEL will delete it, leaving AB in the display, which seems
perfectly obvious.
Yet because of the premature semantics, this is not how things always happen.
When a calculator blocks a second decimal point, pressing DEL cannot delete the last
keypress because the last keypress had no action visible in the display. Depending on how
the calculator is implemented, it may delete the rightmost character or do nothing or delete
a digit, or even something else. Premature semantics is the source of problems described in
Section 8.3.1 of this chapter.
Or consider when a user presses the negation ± key on a calculator. When a number
changes sign, the negative sign appears (or disappears) as a prefix in the display and so
deleting the last keypress should remove the negative sign but deleting the last character
shown in the display should remove the rightmost digit. You can’t have it both ways. Also,
if there is only one digit (possibly a zero), how can it be deleted when the calculator always
displays a number? These choices have very different effects, as the following table makes
the clear:
And indeed, from the point of view of the calculator, it is not possible to know which is
actually required.
A current example is the Apple iPhone calculator (early 2017): the following sequence
of user keystrokes will make it crash AC ± DEL ± DEL . The sequence of keystrokes
has a simple story: the user keyed ± by mistake and tries to correct it, but that doesn’t
work, so they try ± again (changing sign twice should cancel it out); that doesn’t work
either, so they hit DEL . Whoops, the calculator crashes. This is a bug—actually, it’s the
third bug we’ve seen in three keystrokes. Apple have some of the world’s best programmers,
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
234 | systems
but evidently they do not always make reliable interactive programs, even when they are
this simple. We should not be blaming Apple; we should be wondering why it has taken us
so long to notice that programming interactive systems is very hard. And the consequence
of it being so hard is that it is rarely done well.
Now, summarizing this to pull out the morals:
It is now possible in the UK to order a passport online. The form asks for several separate
items of information before asking for a digital photograph. This must, understandably, have
very specific formatting and styles which you are not informed of until you reach that page
and have already entered half a dozen other items of information. But you cannot go past this
page or save your form—because it is not yet correct—so if you do not have the photograph
ready, you have to go away and start again when you come back (the form will have timed-
out for security reasons).
This is premature semantics during construction. The form is requiring that it is always
correctly filled out (and certainly must be if it is to be saved), including requiring that
all fields that need completing have indeed been completed even though later pages still
have completely unfilled sections. Paper forms, of course, cannot enforce such premature
correctness, allowing users a much more flexible workflow.
It is worth noting that the UK Government has taken usability of its systems very seriously
and has made a lot of effort to make all its online resources as accessible as possible to as wide
a portion of the population as possible. Thus, even when usability is taken seriously, it is still
possible to end up with problematic interactions because of premature semantics. As we’ve
said, we think it is a shame that HCI (and UX in particular) has become quite separated
from computer science. You cannot have good UX (or good HCI) without a foundation of
good CS.
236 | systems
and so results in rounding errors when displayed back to the user. This is premature
semantics at its worst: the meaning of the number is altered even before the user has finished
entering it!
In some cases, problems of representation can also become problems of construction.
Modern calculators now have the capabilities to display numbers correctly in more sophis-
ticated forms, such as standard exponent form, such as 6.022 × 1023 . Early calculators
that used seven-segment displays could not represent this format correctly so instead they
compromised with a notation like 6.022E23 , with E standing for Exponent (i.e., the
exponent of a power of ten, so E23 means what we normally write as 1023 ). In some early sci-
entific calculators, a special key allowed users to enter this notation but not directly. Oddly,
modern calculators, such as iOS calculators, display numbers in this historical and out-
moded format: thus the iPhone (iOS v7.1.2) calculator displays 6.022e+23 . Not only
is the representation a historical peculiarity, it interferes with construction. The EE key (not
e !) in the same calculator allows users to enter numbers in this exponential form but acts
like an operator (like + ) and = has to be pressed to find out what it has done. This means
DEL cannot be used conventionally: correction is not possible without a complete clear
and re-entry! It looks like the programmers have not thought this through, yet they think ‘it
works’ (and probably passes all their tests)—that is, it’s premature semantics at work.
• The programmer will optimize their program, doing bit operations, writing in assem-
bler, and more. Optimizations make programs more efficient, but they do so by
relying on assumptions; and the more assumptions a programmer makes the more
likely they are going to make premature assumptions further embedding in the
premature semantics of those assumptions.
These sorts of issues are familiar to programmers, but if you agree with them you have
already fallen into the trap of premature semantics!
• What do we do if the user does not enter a security code? This is undefined, and
undefined is not a value an int can represent. Therefore the programmer must do
a lot of work to handle this case, as no programming language helps directly. Many
quality programming languages, Java being a good example, force the programmer to
ensure that their variable code is defined—but it is possible that the user does not
define it. If it is defined, it will have to be a number like zero (or 000) but that is not
what the user did if the user has not entered it. This is a conflict.6
• What do we do it the user enters an invalid code, perhaps their credit card’s expiry
date 11/19 by mistake? This is an error, and error is not a value for an int either.7
These are both fundamental problems created by premature semantics. If either unde-
fined or error occurs, what should the program do? Can the user still save their form they
are entering the credit card details in? Even though saving a form makes perfect sense—
especially when the user has made a mistake they want to think about—the program (as
usually written) cannot offer the service, because undefined and error themselves cannot be
saved as integers.
The programmer thought three digit security codes could obviously be implemented
as an int, but—typically later, sometimes even after they have left for another job—
somebody has to start hacking their program to handle ‘special cases’ to keep the usability
professionals happy. These well-intentioned iterative ‘improvements’ are then likely to lead
to further bugs.
Note. When we explain a problem as we did above, not only did we choose a problem
that was simple enough to explain but you have the privilege of hindsight: those errors
we described seem very obvious and easy to avoid, don’t they? But if you think you could
correctly implement a three-digit security code (thanks to our lucid explanation), pause to
think how many people in the world thought they could implement years—how hard could
a four digit number be? Yet we ended up with one of the world’s largest computer fiascos,
the Y2K problem.
6 Forcing variables to be defined reduces the risk that programs will crash, but it does not increase the chances
the programs do what they should. Here, programming language designers (and their compilers) force premature
semantics on users of their languages. Programming languages like Swift allow variables to be undefined, and they
rigorously check how programmers managed undefined values; this helps.
7 No programming language handles errors well—it is still a research problem. This is why many user interfaces
say ‘Unexpected error’—if they had correctly expected the error they would have fixed it!
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
238 | systems
Report Error
Cancel
Figure 8.4 What the ‘to do list’ app Omnifocus does when it encounters an error it had not
anticipated in its premature implementation. It successfully detects an ‘error’ but has no idea what
to do. Note that the dialog box does not give the user sufficient information to make a wise
decision either; whichever option they choose, they risk losing information. Ironically a ‘to do list’
is supposed to help users remember what to do, not help them forget things!
Figure 8.4 shows what still happens in 2017—almost two decades after Y2K. Here a
programmer thought that a to do list was a simple database, a highly structure type of data.
They then thought they would allow the user to run their to do list from several places
(their desktop, their mobile phone, their tablet). And then they discovered that they had
not implemented distributed error correctly. Their program just drops out in a panic, leaving
the user an unanswerable question. Note the premature semantics: Omnifocus prematurely
decided to program perfect databases (analogously to calculators implementing ‘perfect’
numbers but not supporting what users do). For whatever reasons, the programmer’s
utopian (premature semantics) databases do not exist in the real world of the user’s actual
tasks and activities.
Unlike a calculator, Omnifocus at least notices when its premature semantics fails, though
it does not do this very gracefully, even though a brief consideration of what ‘to do lists’ are
would suggest several options of what ‘to do’ with an inconsistent list including adding it to
a list to sort out later. A user knowing they have a problem is a step better than a nurse using
a calculator for a drug dose and not knowing there is a problem.
However, this sheet of paper does not work like real paper in that you cannot put your
cursor wherever you like and start typing. In fact, there is only initially one place the cursor
can be placed. It is only when text (including spaces) has been entered can the cursor be
repositioned, but even then it has to be between characters and not in blank spaces where
there are no characters. This is not necessarily bad premature semantics: we do not think
many users would these days expect any different interaction. But at the same time, it is
an unnecessary constraint where cursor position is inherently tied to the textual content of
the document. This contrasts with new opportunities for interaction where with that very
same document on a Microsoft Surface, it is possible to use the pen to make ‘ink’ marks
anywhere on the page. So even if the semantics of the cursor language for the original Word
are appropriate, they are not consistent with the pen language. Interestingly, this problem
was solved in the 1980s (Thimbleby and Bornat, 1989), but hazing or other reasons meant
the solution was never adopted.
However, as our examples show, sometimes the problems arise because of premature over-
sight (the main cause, we think, of premature semantics), or the development of technology,
sometimes because of contextual or cultural developments, and sometimes because the
accumulation of otherwise sensible features that interact with each other incoherently. It
is currently hard to imagine how to define a language involving multiple physical objects
like a supermarket self-checkout. But if there were such a language, then it may be possible
to formally analyse such systems to see how the changing world interacts with the changing
language.
240 | systems
command and its parameters. It is only when they pressed (i.e., enter, return, line feed,
etc.) was the command interpreted; if it was invalid it was rejected, sometimes even with
a helpful error message. This style of interaction has steadily been left behind because of
the burden it puts on users to get things right all in one go (Sharp, Rogers, and Preece,
2007), something we know can be challenging even for the most diligent users (Li, Cox,
Blandford, Cairns, Young, and Abeles, 2006). Instead, what is needed is a form of data entry
that can be done alongside semantics but without constraining what users do—but also
guiding them into correct semantics. Forms are an example of a user interface paradigm that
does do this (to a varying extent, depending on the quality of the implementation) but they
are not necessarily free from premature semantics (see Section 8.5.2).
In order to balance semantics and free-entry, Thimbleby and Gimblett (2011) imple-
mented a system for number entry (and other forms of data entry) using a traffic light coding.
A user is free to enter data however they wish but as each key is pressed the system also
processes the input using an underlying FSM for the correct semantics. A user can press any
keys but if it leads to a semantically invalid value, say if a key press adds a second decimal
point, then the display turns red to indicate this has no meaning. Conversely, if the user has
entered a sequence of keypresses leading to a semantically correct item then the display is
green to indicate that what is currently displayed is semantically correct. Finally, the display
will be orange to display that it could be the start of a semantically correct entry but needs
further correct keypresses to make it so; for instance, if the display is 1. it needs some
decimal digits before becoming a valid number (at least if we want to avoid ‘naked decimals’
(Institute for Safe Medication Practices, 2015)). Finally, the user interface can make sounds
(or it could provide tactile feedback) when it transitions between traffic light colours: the
user does not need to look at the display.
With this approach, the user is constantly aware of the potential or contingent semantics
of their entry but without being constrained at all in their entry or correction. The interesting
thing about this traffic light approach (and a range of related variations to better suit different
sorts of user interface) is that the light colours can be automatically determined from the
underlying FSM.
Such a traffic light system uses FSMs but does not work only with an FSM as the display
itself acts as a buffer to store the user’s actions, so DEL etc correctly work on the buffer
regardless of its semantics. The system does not accept the user input, as represented on the
display, until some form of commit key is pressed and that can only be pressed when the
display is green.
8.6.1 Reengineering
If programming user interfaces causes so many problems for programmers, another
approach is to consider if the designs themselves are faulty in some way, rather than
the programs that implement them. Certainly, computers cannot correctly implement
something that is inconsistent.
We should consider stepping back entirely from traditional user interfaces dictated by his-
toric and outmoded technology. Why reproduce mechanical calculators, which inevitably
were compromises when now computers can do anything? We should start by asking
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
questions like ‘what is the purpose of the system that is intended’, in our particular examples
‘what is the purpose of a calculator?’
A calculator is used to perform arithmetic calculations that arise in real situations with
numbers or calculations that are difficult to do in the head or on paper. Some calculations
fit well with entry into traditional calculators: how many oranges do I have if I have
twenty-three crates with thirty-five oranges per crate? This problem can be translated into
a sequence of key presses that look roughly like 23 × 35. However, many calculations we
want to do are far less direct: if a worker works fifty minutes on and ten minutes off every
hour for seven hours, and each of their data entry tasks takes on average thirty-five seconds,
how many data entry tasks will they get through in a typical working day? Normally, you’d
solve this sort of problem with a calculator helped by either using pencil and paper to write
down the problem and then converting to a calculation or chancing your luck by a good
guess as you work on the calculator. A better solution would be to represent the problem as
you understand it in the calculator and then work with the calculator to get the answer. This
is precisely the philosophy behind a novel gesture-based calculator (Thimbleby, 2004).
This novel calculator avoids the problem of premature semantics by allowing the user to
write whatever they want on the calculating surface. The calculator was designed to allow the
user to enter anything, even nonsense, and therefore does not have premature semantics—it
does not require or force the user’s input to be correct in any way, but uses another colour to
correct the user’s input. It turns out that correcting the user’s input also provides the answer
to the user’s calculation. The relation between correction, error and premature semantics is
profound and worth explaining in more detail.
You may want to know what to multiply 2 by to get 6 (you may well want to do something
harder, but we are interested here in the principles not the actual problem). On a normal
calculator, you might key AC 6 / 2 = to get the answer you want, which here will
be 3. However, note that you have had to convert the original question using multiplication
into a question the calculator can handle, using division.
With the new calculator, typing ‘2× = 6’ is not correct, but would be corrected to show
the answer 3 in the right place: 2 × 3 = 6. In other words, the flexibility of correcting
anything significantly increases the power and flexibility of the calculator.
Or if the new calculator was used like an old one, typing ‘6/2’ is also not correct as it
stands, so it would be immediately be corrected to ‘6/2 =3 .’ If the user continued typing
‘6/2 =’ that is fine too, as it will be corrected to ‘6/2 = 3 ’. Indeed, if the user typed ‘= 6/2’
(i.e., the other way around) that would be corrected to ‘ 3 = 6/2’.
See Figure 8.5. The calculator interprets whatever the user tries to enter so if the user
cannot represent it, the calculator doesn’t make it worse. And the result of the calculation
and the calculation itself are co-present and persistent, which means that the user repre-
sentation can be re-constructed as necessary and even the calculator representation can be
re-constructed.
This is a specific solution to the problems of premature semantics that works with
arithmetic calculations (which is really useful if that was all it did). However, it does show
that really understanding what the user needs can lead to much better systems than the blind
adherence to existing ways of doing things that computers, or at least programmers, may
struggle with.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
242 | systems
Figure 8.5 The classroom version of an intentionally not premature semantics calculator. The cal-
culator also runs well on handheld tablets. A photograph of a user interface cannot do the interactive
experience justice; the person shown using this calculator is Will Thimbleby, the implementer.
244 | systems
emergent property of executing a program. The formal properties we might want to think
about even for apparently simple devices like a calculator, such as display overflow and the
regular expression for valid numbers, seems like a long way from the simple tasks that most
users would encounter. Moreover, the list of formal properties have not ever been specified,
even for calculators which in principle do have formal semantic foundations. Diligent user
interface programmers would struggle to formally evaluate a user interface even if they
wanted to.
And at the end of the day, much user interface design is about the acceptability of
interfaces to users. Most users are tempted to say ‘the user interface works’—why worry
about a quirky case, which anyway can be avoided in a work-around if the bug ever affects
a user (assuming the user notices the bug)? In addition it is well known that conventional
informal user interface development (such as focus groups, scenarios, personas, etc) is a
critical tool for user interface acceptance. Thus, even what we would identify as bugs in the
semantic interpretation of an interaction, users frequently dismiss as not relevant.
Downloading
Undo Delete
Cancel Undo
Figure 8.6 A familiar dialog box for users of iPhones, but what does it mean? Does pressing
“Cancel” mean cancel undo, which means delete, or does pressing “Undo” mean undo undo delete,
which probably means delete? So pressing either button means delete? We imagine—we hope—
the programmer and other privileged people know what the choice means, and (presumably!) what
these buttons means was obvious to them and (presumably!) they mean different things (otherwise
why give a vacuous choice to users?). In any case, what exactly are we accused of deleting or undoing
deleting that we might want to cancel or undo? Presumably (!) the programmer knows, but they
haven’t bothered to tell us to help us out of our confusion.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
246 | systems
....................................................................................................
references
Aho, A. V., Sethi, R., and Ullman, J. D., 1986. Compilers, Principles, Techniques. Boston, MA: Addison
Wesley.
Cairns, P., and Thimbleby, H., 2017. Interactive numerals. Royal Society Open Science, 4(160903).
Card, S. K., Moran, T. P., and Newell, A., 1980. The keystroke-level model for user performance time
with interactive systems. Communications of the ACM, 23(7), pp. 396–410.
Degani, A., 2004. Taming HAL: Designing interfaces beyond 2001. New York, NY: Springer.
Gasen, J. B., 1995. Support for HCI educators: A view from the trenches. In: Proceedings BCS HCI
Conference, Volume X, Cambridge: Cambridge University Press, pp. 15–20.
Gow, J., and Thimbleby, H., 2005. Maui: An interface design tool based on matrix algebra. In:
Computer-Aided Design of User Interfaces IV. New York, NY: Springer, pp. 81–94.
Institute for Safe Medication Practices, 2015. ISMP List of Error-Prone Abbreviations, Symbols, and Dose
Designations. Horsham, PA: ISMP.
Jackson, D., 2011. Software Abstractions: Logic, Language, and Analysis. 2nd ed. Cambridge, MA: MIT
Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Landauer, T. K., 1995. The trouble with computers: Usefulness, usability, and productivity. Volume 21.
Milton Park, UK: Taylor & Francis.
Li, S., Cox, A. L., Blandford, A., Cairns, P., Young, R. M., and Abeles, A., 2006. Further investigations
into post-completion error: the effects of interruption position and duration. In: Proceedings of
the 28th Annual Meeting of the Cognitive Science Conference. Mahweh, NJ: Lawrence Erlbaum,
pp. 471–6.
Mulligan, R. M., Altom, M. W., and Simkin, D. K., 1991. User interface design in the trenches: Some
tips on shooting from the hip. In: CHI ’91: Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems. New York, NY: ACM, pp. 232–6.
Olsen, K. A., 2008. The $100,000 keying error. IEEE Computer, 41(108), pp. 106–7.
Pinker, S., 2015. The Sense of Style: The Thinking Person’s Guide to Writing in the 21st Century! London:
Penguin Books.
Rogers, Y., 2012. HCI theory: classical, modern, and contemporary. Synthesis Lectures on Human-
Centered Informatics, 5(2), pp. 1–129.
Salah, D., Paige, R. F., and Cairns, P., 2014. A systematic literature review for agile development
processes and user centred design integration. In: Proceedings of the 18th international conference
on evaluation and assessment in software engineering. New York, NY: ACM, p. 5.
Sharp, J., Rogers, Y., and Preece, J., 2007. Interaction design: beyond human-computer interaction. Oxford:
John Wiley.
Sommerville, I., 2015. Software Engineering. 10th ed. London: Pearson.
Thimbleby, H., 2000. Calculators are needlessly bad. International Journal of Human-Computer Studies,
52(6), pp. 1031–69.
Thimbleby, H., 2004. User interface design with matrix algebra. ACM Transactions on Computer-
Human Interaction (TOCHI), 11(2), pp. 181–236.
Thimbleby, H., 2010. Press On: Principles of interaction programming. Cambridge, MA: MIT Press.
Thimbleby, H., 2013. Reasons to question seven segment displays. In: Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems. New York, NY: ACM, pp. 1431–40.
Thimbleby, H., 2015. Safer user interfaces: A case study in improving number entry. IEEE Transactions
on Software Engineering, 41(7), pp. 711–29.
Thimbleby, H., and Bornat, R., 1989. The life and times of Ded, display editor. In: J. B. Long and A.
Whitefield, eds. Cognitive Ergonomics and Human Computer Interaction. Cambridge: Cambridge
University Press, pp. 225–55.
Thimbleby, H., and Cairns, P., 2010. Reducing number entry errors: Solving a widespread, serious
problem. Journal of the Royal Society Interface, rsif20100112.
Thimbleby, H., Cairns, P., and Jones, M., 2001. Usability analysis with Markov models. ACM Transac-
tions on Computer-Human Interaction (TOCHI), 8(2), pp. 99–132.
Thimbleby, H., and Gimblett, A., 2010. User interface model discovery: Towards a generic approach.
In: G. Doherty, J. Nichols, and Michael D. Harrison, eds. Proceedings ACM SIGCHI Symposium on
Engineering Interactive Computing Systems — EICS 2010. New York, NY: ACM, pp. 145–54.
Thimbleby, H., and Gimblett, A., 2011. Dependable keyed data entry for interactive systems. Electronic
Communications of the EASST, 45, pp. 1/16–16/16.
Thimbleby, W., 2004. A novel pen-based calculator and its evaluation. In: Proceedings of the third Nordic
conference on Human-computer interaction. New York, NY: ACM, pp. 445–8.
Thimbleby, W., and Thimbleby, H., 2005. A novel gesture-based calculator and its design principles.
In: Proceedings of the 19th BCS HCI Conference. Volume 2. State College, PA: Citeseer, pp. 27–32.
Turing, A. M., 1936. On computable numbers, with an application to the entscheidungsproblem.
Proceedings of the London Mathematical Society, 42, pp. 230–65.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
248 | systems
9
• • • • • • •
9.1 Introduction
Arguments to support validity of most contributions in the field of Human-Computer
Interaction (HCI) are based on detailed results of empirical studies involving cohorts of
tested users confronted with a set of tasks and an interactive prototype. Interestingly enough,
the same approach is used by the industry when building new interactive applications,
and by the academia when innovating with new interaction techniques. Based on iterative
and User Centered Design processes as the ones promoted in (Göransson et al., 2003)
or (Mayhew, 1999), these evaluations are conducted either on controlled conditions (i.e.,
usability labs) or on working environment (e.g., ethnographical studies). Such usability
tests-driven contributions are both extremely empirical (forecasting of user tests results
is not usually performed) and highly inefficient in terms of development costs (as several
versions of systems have to be developed prior to run user tests in order to identify the ones
that are possibly valuable).
Another way of handling the usability property problem is to base the work on models
in order to be able to predict the overall performance (or the existence of properties in the
resulting system). Such models can be used at design level and prior to implementation and
their use is even required in the area of critical systems following development standards
such as DO-178C (DO-178C, 2012) and more precisely, its supplement DO-333 (DO-
333, 2011, p. 101). For instance, such standards state that high-level requirements should
be modelled using (for instance) temporal logic (Emerson and Srinivasan, 1988), while
low-level requirements should be modelled using state-based description techniques, e.g.,
Petri nets (Reisig, 2013). Compliance between these two models has to be done using
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
250 | systems
model-checking techniques (Clarke et al., 2009). In that area, the term model corresponds
to behavioural descriptions of software systems.
In the field of HCI, such model-based approaches have been used in two different
directions. Models (in the meaning of science (Lee, 2016)) used to understand the world
and to produce predictions (e.g., Fitts’ law (Fitts, 1954) or Hick’s law used in (Cockburn
et al., 2007) to predict user performance on menus usage or (Lee and Oulasvirta, 2016) to
predict error rates in temporal pointing). In that meaning, the quality of the model depends
on its precision in representing the part of the world it is aimed at describing; there the goal
is the proposition and the production of the model. Models (in the meaning of engineering)
are used to produce a system that is supposed to carry the properties expressed in the model.
Hence, the quality of the model lies in the information it contains and the quality of the
system produced depends on how it covers what is expressed in the model; there the goal is
the production of the system. Such approaches have been used in the area of engineering
interactive systems (both for understanding what interactive systems are (e.g., the PIE
model (see Dix, 1991)) and for building reliable, dependable, and usable multimodal
interactive systems from formal models (Ladry et al., 2009) among many others.
This chapter presents an approach integrating both views presented here. More precisely,
we present how a model (in the engineering meaning) can be enhanced with models (in
the meaning of science) describing human (in terms of perceptive, motoric, and cognitive
behaviour). This chapter thus presents a tool-supported generic framework for integrating
user models into interactive systems engineering technologies. The performance evaluation
techniques proposed aim at providing designers with data for comparing different designs
of interactive systems. Therefore, this chapter does not present any contribution related to
design aspects of interactive systems; new models of users nor provides means of calculating
absolute values for usability or user experience properties.
Interpretation
Action Perception
Figure 9.1 Generic architecture for interactive systems tuned for multi-touch applications.
interactive application at the higher level of abstraction. The middle part of the figure
provides details on how this layered architecture can be refined and tuned for interactive
systems offering multi-touch interactions. The flow of information is represented as a
cycle from the user (bottom of the figure) exploiting input devices (and their associated
interaction techniques), to the interactive application (top of the figure) and back to the user
by means of information presentations (via the rendering engines) exploiting output devices
and their drivers. As represented right in the middle of the figure, this loop may exhibit
shortcuts (double-ended arrows) making it possible, for instance, to produce immediate
feedback to the user (e.g., when displaying the new position of the mouse cursor when the
user manipulates the mouse).
This architecture corresponds to a refinement of ARCH (Bass et al., 1991) architectural
model proposed in the early 1990s. More importantly, this architecture offers a clear separa-
tion between hardware related elements and the other layers. Finally, it goes beyond ARCH
by making explicit the role of input and output fusion and fission engines that are in charge
of producing high-level events (and their associated interaction techniques) from low-level
inputs produced by the users when they manipulate physically the input and output devices.
252 | systems
1 We refer here to the definition provided in (Pnueli, 1986) where safety is ‘nothing bad will ever happen’ and
User interface (UI) components might be fully defined at design time (e.g., a set of user
interface buttons in a window) or might be created and presented at runtime (e.g., a new
icon is added on a Desktop application each time a new file is created). This creation of UI
objects at runtime is the result of the dynamic instantiation of objects from classes (in the
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
254 | systems
object-oriented programming paradigm). Being able to handle these new objects within the
formalism is mandatory to be able to reason about them (for instance, to ensure that the
number of dynamic instantiations is bounded). In the context of multi-touch interactions,
touching the screen with a finger changes the interaction in a similar way as adding a new
mouse in WIMP interactions requiring dynamic instantiation management for input and
output devices that can be added or removed at runtime:
More concretely, ICO models are Petri net in terms of structure and behaviour, but they
hold a much more complex set of information. For instance, in the ICO notation, tokens in
the places can hold a typed set of values that can be references to other objects (allowing,
for instance, its use in a transition for method calls) or preconditions about the actual value
of some attributes of the tokens. For instance, in Figure 9.2, the tokens in places ‘p0’, ‘p1’,
t1 ::event <a> p0 t0
<a>
a <= 10 <c> p2
<b>
{c = a + b;}
1. p1
Figure 9.2 ICO model representing events and tokens with value.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
256 | systems
and ‘p2’ may hold an integer value respectively labelled ‘a,’ ‘b,’ and ‘c’. The Object Petri nets
formalism also allows the use of test arcs that are used for testing presence and values of
tokens without removing them from the input place while firing the transition (e.g., the arc
connecting place ‘p1’ to transition ‘t0’ in Figure 9.2).
The ICO notation also encompasses multicast asynchronous communication principles
that enables, upon the reception of an event, the firing of some transitions. For instance, the
transition ‘t1’ in Figure 9.2 is fired upon the receipt of an event named event that will result
in the deposit of a new token in place ‘p0’.
) 4. p0 (
Figure 9.3 ICO model with an ‘infinite’ number of state representing the use of parenthesis.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
the newly created objects will by definition represent a new state for the system. Most of the
time, this characteristic is handled by means of code and remains outside the User Interface
Description Languages (UIDL). Only Petri nets-based UIDLs can represent explicitly
such a characteristic, provided they handle high-level data structures, or objects. So, the
description language must be able to receive dynamically created objects. This dynamicity
has also to be addressed at operation time (i.e., when the system is currently in use). An
ICO specification is fully executable, which gives the possibility to prototype and test an
application before it is fully implemented (Palanque et al., 2009). Multi-touch interactions
are now widely available but their modelling remains a deep challenge. Indeed, fingers used
on a multi-touch input device can be seen as dynamic instantiation and removal of input
devices (e.g., one finger is similar to one mouse). ICOs are able to address this challenge, as
demonstrated in (Hamon et al., 2013) going even to finger clustering interaction techniques
as presented in (Hamon et al., 2014).
258 | systems
service availability, which can be used to identify deadlocks and analyse the model liveness
(i.e., for every widget, there will always be at least one sequence of events available on the
user interface that will make that widget enabled). When modelling a system, it may be
important to ensure that the initial state is always reachable (through a sequence of actions).
This is particularly relevant for safety-critical interactive systems that may be required to
reset/restart/shutdown quickly in an emergency situation. Such properties allow evaluation
of the quality of the model before system implementation. A more detailed description on
how to perform formal analysis of Petri nets models representing user interface behaviours
can be found in (Palanque and Bastide, 1995).
The analysis on ICO models is performed on the underlying Petri net (a basic
Place/Transition Petri net simplification of the original Petri net). The advantage of
the underlying Petri net is that classical properties listed earlier are possible to prove
with dedicated algorithms, but an important drawback is that properties verified by the
underlying Petri Net are not necessary true on the original one. Thus, we usually use
the result of classical property analysis as an indicator that highlights potential problems
in the Petri net. Verifying these properties directly on the ICOs model is still a research
challenge for the Petri nets community, which is still focusing on Petri nets with reduced
expressive power as coloured Petri nets ( Jensen et al., 2007). This chapter focuses on
the use of formal methods (and more particularly of the ICO modelling technique)
for the assessment of the efficiency of an interactive system through the predictive and
summative evaluation of the performance of the couple (user, system). However, the ICO
modelling technique can also be used of the verification of interactive systems properties
(e.g., Palanque and Bastide, 1995; Palanque et al., 2006).
260 | systems
° Place this transition between the incoming place and the outgoing place of the existing transition
describing the currently targeted input interaction (in parallel with the described input interaction),
° Add an action in the newly inserted transition to cumulate the estimated time to perform the possible
motoric activity.
Figure 9.4 Steps for enriching an ICO model with motoric time predictions.
sequences of actions as well as to identify sequences of actions that may decrease user
performance. The predictive evaluation step consists in taking each scenario and performing
the sum of action time for each action described in the scenario. Figure 9.6 presents in details
the steps of the predictive evaluation stage.
In this process for predictive evaluation of user performance, it is important to note that
real-time execution of the enriched dialogue models is not possible, in contrary to real-time
execution of dialogue models. Furthermore, the predictive evaluation process suits to the
application layer of the generic layered architecture for interactive systems (presented in
Section 9.2 and in Figure 9.1). We enrich dialogue models of the application only, and not
device driver models and interaction technique models. These other types of models are
required for the summative evaluation.
Enriched dialog
Cognitive Human cognitive Execution of
ICO models with
behaviour performance laws usability tests
motoric actions
modelling
Identification of
Videos, Model-based
relevant scenarios
Cognitive behaviour Questionnaires... logs
ICO model
Scenarios Predictive
evaluation
Perceptive behaviour Videos,
ICO model Questionnaires… Log Analysis
Analysis
Time prediction
for each scenario
Perceptive Human perceptive
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Figure 9.6 Processes for a) predictive evaluation of user performance with interactive systems and b) summative evaluation of interaction techniques.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
steps such as requirements analysis, design, and low-fidelity prototyping have been carried
out previously.
Figure 9.6(b) presents the process that starts with interactive system modelling. We use
several types of artefacts including test scenarios, i.e., behavioural models of the interactive
system (using the ICO description technique for describing both the general behaviour of
the interaction technique and the treatment of low-level events from input devices). The
prototype, running with ICO models, then follows a usability evaluation plan based on user
testing. The output of the usability evaluation includes logs data that can be classical logs
such as video or questionnaires, as well as model-based logs (that are related to the ICO
models). These logs thus provide support for the assessment of the models of the interaction
technique. They help in analysing whether or not the performance of these techniques meets
the corresponding requirements and allows iterating (the rhombus box labelled ‘Decision’
in Figure 9.6(b)). The main idea is that usability problems discovered with the usability test
can now point out these problems in the interactive system models themselves, making it
easier for designer both to locate where changes have to be made and additionally how to
make them.
The approach is quite generic but heavily dependent of a tool support and formalism
able to support the main assumptions drew about models. In the present case, the approach
exploits the ICO notation for describing the entire behaviour of the interaction techniques
and relies on the tool support Petshop for executing ICO models.
264 | systems
Figure 9.7 Screenshot of a running ICO specification in the PetShop tool and its associated user
interface.
theory (Peterson, 1981). The ICO approach is based on high-level Petri nets. As a result, the
analysis approach builds on and extends these static analysis techniques. The analyst must
carefully consider analysis results as the underlying Petri net model can be quite different
from the ICO model. Such analysis has been included in Petshop and can be interleaved with
the editing and simulation of the model, thus helping to correct it in a style that is similar to
that provided by spell checkers in modern text editors (Fayollas et al., 2017). It is thus pos-
sible to check well-formedness properties of the ICO model, such as absence of deadlocks,
as well as user interface properties, either internal properties (e.g., reinitiability) or external
properties (e.g., availability of widgets). Note that it is not possible to express these user
interface properties explicitly—the analyst needs to express these properties as structural
and behavioural Petri net properties that can be then analysed automatically in PetShop.
is stored in XML files and it is possible to transform and export it to spreadsheet tools for
statistical analysis.
The structure of the log files is the following: the first field corresponds to the name of the
model concerned by records, and the second field represents the type of the Petri net node
(i.e., place or transition). The third field corresponds to the given node name. The fourth field
represents the action performed associated to a node; for a transition, it can be fire a transition
while for a place the actions are a token movement in the Petri net, such as tokenAdded or
tokenRemoved. The fifth field represents the time elapsed from the start of the application in
milliseconds. The last two fields (named data 1 and data 2) represent additional data related
to the node type. For a place, the field data1 would represents the number of tokens added
or removed at the same time, and the field data2 would point out to the objects embed by the
token. For a transition, the field data2 represents the substitutions used for firing a transition.
266 | systems
Figure 9.9 Image of a) the control panel of the weather radar; b) the radar display manipulation
and, c) the input device Keyboard and Cursor Control Unit (KCCU).
different range scales (40 NM for the left display and 80 NM for the right display). Spots
in the middle of the images show the current position, importance, composition, and size of
the clouds.
This section focuses on the user tasks performed with the KCCU input device.
switchManualcb ::switchManual
<wxr> <wxr>
1. AUTO 1. NOT_AUTO
<wxr> <wxr>
switchAutocb ::switchAuto
<wxr>
Tilt angle edition mode
<wxr>
<wxr>
switchStabOffcb
<wxr,angle>
<wxr,angle>
EnterKeyPress ::KeyPress
<wxr,angle>
updateAngleRequired
Figure 9.11 ICO model of the behaviour of the lower part of the control panel of the weather radar.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
268 | systems
The top left part of Figure 9.11 describes the handling of the edition mode of the tilt angle,
the bottom left part describes the handling of the weather radar stabilization, and the right
part describes the behaviour of the manual editing of the tilt angle. In Figure 9.11, the place
‘AUTO’ holds a token; this means that the edition mode of the tilt angle is set to automatic.
If the user clicks on the button ‘Manual’, the transition ‘switchManualcb ::switchManual’
is fired and the token moves from the place ‘AUTO’ to the place ‘MANUAL’, thus setting
the edition mode of the tilt angle to manual. This modification implies that the transition
‘clickInEditBox’ becomes fireable, thus enabling the click on the corresponding editbox. If
the user clicks on the editbox, the transition ‘clickInEditBox’ is fired and a token is set in the
place ‘EditingValue’, thus enabling the user to use the keyboard in order to enter a new tilt
angle value. When the user presses the key ‘Enter’, the transition ‘EnterKeyPress ::KeyPress’ is
fired and the token is moved from the place ‘EditingValue’ to the place ‘UpdateAngleRequired’.
Finally, following the entered tilt angle value, one of the three transitions at the bottom is
fired. If the entered angle is correct (included between -15.0 and 15.0), the value of the
current tilt angle (value of the token held in the place ‘TILT_ANGLE’) is set to the entered
value (firing of the transition ‘angleIsCorrect’). Otherwise, if the entered value is too low, the
tilt angle is set to its minimum (firing of the transition ‘angleIsLow’) and if the entered value
is too high, the tilt angle is set to its maximum (firing of the transition ‘angleIsHigh’).
• The transition ‘ParkingToManual ::switchManual’ is fired if the user moves the cursor
from the predefined parking point (initial position for the cursor, top left of the
window) to the button ‘Manual’ and clicks.
• The transition ‘AutoToManual ::switchManual’ is fired if the user moves the cursor
from the button ‘Auto’ to the button ‘Manual’ and clicks.
• The transition ‘ONToManual ::switchManual’ is fired if the user moves the cursor
from the button ‘ON’ to the button ‘Manual’ and clicks.
• The transition ‘OFFToManual ::switchManual’ is fired if the user moves the cursor
from the button ‘OFF’ to the button ‘Manual’ and clicks.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
a ParkingToManual ::switchManual
{t = t + 288;}
AutoToManual ::switchManual
{t = t + 184;}
ONToManual ::switchManual
{t = t + 226;}
OFFToManual ::switchManual
{t = t + 198;}
EditAngleToManual ::switchAuto
{t = t + 279;}
ManualToClickInEditBox ::changeAngle
NOT_AUTO
1. AUTO {t = t + 279;}
{t = t + 279;}
<wxr,angle>
b
<wxr,angle>
switchStabOncb Key6ToKeyEnter ::KeyPress
Key0ToKeyEnter ::KeyPress
<wxr,angle> {t = t + 237;}
{t = t + 323;} <wxr,angle>
<wxr,angle> Key7ToKeyEnter ::KeyPress
Key1ToKeyEnter ::KeyPress <wxr,angle>
{t = t + 303;}
{t = t + 248;} <wxr,angle> <wxr,angle>
Key8ToKeyEnter ::KeyPress
Key2ToKeyEnter ::KeyPress <wxr,angle>
<wxr,angle> {t = t + 280;}
{t = t + 184;}
<wxr,angle> <wxr,angle> Key9ToEnter ::KeyPress
Key3ToKeyEnter ::KeyPress
{t = t + 280;}
{t = t + 184;}
KeyMinusToEnter ::KeyPress
Key4ToKeyEnter ::KeyPress
{t = t + 320;}
{t = t + 263;}
KeyDotToEnter ::KeyPress
Key5ToKeyEnter ::KeyPress
{t = t + 326;} <wxr,angle>
<wxr,angle> {t = t + 211;} <wxr,angle>
<wxr,angle> <wxr,angle> <wxr,angle><wxr,angle>
<wxr,angle> <wxr,angle>
<wxr,angle>
<wxr,angle>
UpdateAngleRequired
anglelsLow anglelsHigh
anglelsCorrect
new_angle < –15.0 new_angle > 15.0
(new_angle >=–15.0) && (new_angle <=15.0)
{correct_angle=–15.0f; {correct_angle = 15.0f;
{wxr.setTiltAngle(new_angle);}
wxr.setTiltAngle(correct_angle);} wxr.setTiltAngle(correct_angle);}
<angle> <new_angle>
<correct_angle> <correct_angle>
Figure 9.12 ICO model of the lower part of the control panel of the weather radar enriched with
possible user motoric actions.
In the same way, in the ICO model depicted in Figure 9.12, in the central part located
around the places labelled ‘EditingValue’ and ‘UpdateAngleRequired’ (rectangle tagged ‘b’ in
Figure 9.12), the transitions represent the possible user motoric actions to validate a value
using the key ‘Enter’ of the keyboard and replace the transition ‘EnterKeyPress ::KeyPress’ in
Figure 9.11. For example, in the top left of the rectangle ‘b’, the transition ‘Key0ToKeyEnter
::KeyPress’ indicates that the user finger is located on the key ‘0’ and that s/he has moved her
finger towards the key ‘Enter’ and to press it.
The aim of this presented model is to illustrate our approach and its main principles. It is
not exhaustive, as the number of possible transitions would make the model not fully legible.
However, the approach is scalable as the associated tool provides support for editing all the
possible transitions (Navarre et al., 2009).
It is important to note that, as presented in the steps for enriching the system behavioural
models (presented in Figure 9.4), all of the added transitions contain temporal information
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
270 | systems
that will be used, when playing a scenario, to calculate the time needed to perform it. This
time has been calculated, as presented in the Section 9.4.1.1, using Fitts’ law (Fitts, 1954).
The details about the interface elements enabling this calculation (e.g., the width of the
different elements and the distance between them) are given in Section 9.7.5.
9.7.4 ICO Models for Time Prediction of Perceptual and Cognitive Actions
Figure 9.13 presents the ICO model for users’ cognitive activity embedding temporal
information. In Figure 9.13, as for the enriched interactive application model (see Figure
9.12), the important information is associated to the transitions. Therefore, each transition
of the model is associated to a different cognitive activity and a mean time for accomplishing
it (the rationale for the chosen mean times is explained in Section 9.4.1.2):
• An analysis cognitive activity is associated with the mean time 100ms and the
transition ‘Analyze_’.
• A comparison between two values’ activity is associated with the mean time 33ms
and the transition ‘Compare2Values_’.
• A decision between several items’ cognitive activity is associated with a calculation of
the time using the Hick’s law and the transition ‘Decide_’. This transition is connected
to the place ‘NumberOfItems’ that contains a set of values for the number of items to
decide from. This transition is also connected to the place ‘HickLawParameters’ that
contains the values of the Hick’s law parameters that are necessary to perform the
time calculation.
The ICO model presented in Figure 9.13 is rather simple as cognitive activity is quite
limited for using the devices. However, models that are more complex could be considered
for modelling information that must be remembered. For instance, as introduced in (Moher
et al., 1996), due to memory overload some information might be lost in short-term
memory. This would be modelled (in a user-cognitive model) using a time-out transition
automatically removing information from the place representing information in short-term
memory.
Figure 9.14 presents the ICO model for users’ perceptual activity embedding temporal
information. The model is very simple, as perceptual activity is quite limited for using the
devices: we only consider here the perceptual activity of reading a value. This activity is
Analyze_
{t = t + 100;}
1. NumberOfItems
<x>
Decide_ Compare2Values_
1. Start
<a,b> {t = t + a + b∗ log(x);} {t = t + 33;}
1. HickLawParameters
ReadValue_
1. Start
{t = t + 100;}
associated with the transition ‘ReadValue_’ and the 100ms mean time (the rational for the
chosen mean time is explained in Section 9.4.1.3).
272 | systems
Widget Name Button Auto Button Manual Button ON Button OFF Edit Box
Width (mm) 31 31 31 31 31
Table 9.2 Distances between the buttons (centre to centre) of the lower part of the control panel
of the weather radar.
Widget Name Button Auto Button Manual Button ON Button OFF Edit Box
Parking 95 99 131 134 170
Button Auto 0 40 46 59 106
Button Manual 40 0 59 46 92
Button ON 46 59 0 40 92
Button OFF 59 46 40 0 48
Edit Box 106 92 92 48 0
0 1 2 3 4 5 6 7 8 9 . +/– Enter
Width (mm) 10 10 10 10 10 10 10 10 10 10 10 10 15
274 | systems
Time required by the system to execute an action has an impact on the user performance.
In this example, the weather radar system takes approximatively 3000 ms to change from
one tilt angle to another one. Depending on the type of weather radar, the time required by
the weather radar to scan the airspace in front of the aircraft (getting a reliable image takes
two or three scans) ranges from 2000–4000ms.
Table 9.5 is the output of the predictive evaluation stage presented in Section 9.3.2. It
highlights the results from the application of the steps for predicting time performance
(presented in Figure 9.6) to the scenario of checking the weather on the flight path. From
this table, we see that the total time performance for achieving the scenario is about 153
seconds and that system actions take 15 seconds in total (9.75 per cent of total time
performance).
eventParams:(dx,dy)
eventCondition:true
raiseEvent mousePress(dx,dy,mouseld)
<mouseld> MouseMove2 ::move E->
<mouseld>
MouseMove1 ::move E->
<mouseld> <mouseld> movefrom
movefrom
eventParams:(dx,dy)
eventParams:(dx,dy) 1. Idle Pressed
eventCondition:true
eventCondition:true
raiseEvent mouseMove(dx,dy,mouseld)
raiseEvent mouseMove(dx,dy,mouseld) <mouseld> <mouseld>
<mouseld> <mouseld>
eventParams:(dx,dy)
eventCondition:true
raiseEvent mouseRelease(dx,dy,mouseld)
Figure 9.15 ICO model of the trackball input device driver (similar to a mouse device driver).
two-mice interaction, the approach is generic and could be applied, for instance, when
multi-touch interactions are concerned, each finger playing the role of a mouse. The same
interaction can also be performed by two users, each of them using one mouse. This could
take place, for instance, in new modern civil aircraft, where interaction takes place by means
of the KCCUs.
This section focuses on the application of the stages of the process to the production of
systems models and to their use for summative performance evaluation.
mouseMove_t1
mouseRelease_t2
Moving {trigger(“mouseClick”, mld, x, y);
{trigger(“endDrag”, mld, x, y);}
trigger(“beginDrag”, mld, x, y);
<x1,y1,x2,y2> stopTimer(mld);}
mouseMove_t3
1. CurrentXY <x1,y1,x2,y2>
{trigger(“beginDrag”, mld, x, y);}
<x1,y1,x2,y2>
timerExpired_t1
mouseMove_t5
{trigger(“mouseClick”, mld, x, y);
stopTimer(mld);}
mouseRelease_t1
2. Idle mousePress_t1 Down OneClick mousePress_t2 TwoDown
{startTimer(mld);}
mouseMove_t4
1. CurrentXY <x1,y1,x2,y2>
click {trigger(“mouseClick”, mld, x, y);
stopTimer(mld);}
[200]
fusionClicks
2 CombClick
{trigger(“combinedClick”, mld1, mld2, x1, y1, x2, y2);}
timerExpired_t2
<x1,y1,x2,y2>
{trigger(“mouseClick”, mld, x, y);
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
<x1,y1,x2,y2> stopTimer(mld);}
1. CurrentXY
<x1,y1,x2,y2> <x1,y1,x2,y2>
mouseRelease_t3
fusionDoubleClicks
2 CombDoubleClick {trigger(“mouseDoubleClick”, mld, x, y);
{trigger(“CombDoubleClick”, mld1, mld2, x1, y1, x2, y2);}
stopTimer(mld);}
doubleClick
[200]
Figure 9.16 ICO model of the interaction technique with the KCCU trackball input device.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
mousePress and triggers them as higher-level events, such as combinedClick that will be
handled by the interactive application (for instance, the WXR application presented in
Section 9.7). Place currentXY contains a token that keeps track of the mice target pointers’
coordinates. For readability purpose, this place has been demultiplied into virtual places
(functionality of the Petshop tool), which are clones of the initial one and are several
views of this single place. It avoids overcrowding the figure. Each time a mouseMove event
is caught (mouseMove_t1, mouseMove_t2, mouseMove_t3, mouseMove_t4, mouseMove_t5),
the current coordinates are updated with the (x, y) deltas that have been done by the target
pointers. In the initial state (2 tokens in place Idle, one for each mouse), the mousePress_t1
and mouseMove_t5 transitions are fireable (users can move or press their mouse). Once
a mousePress event is caught, the mousePress_t1 transition is fired and the corresponding
mouse token is put into the place Down. If a mouseRelease event occurs, the transition
mouseRelease_t1 transition will be fired and a timer armed in case of a future double click.
The token is put into the OneClick place. If a mouseMove event occurs, the mouseMove_t4
transition will be fired. If the timer previously set expires, the transition timerExpired_t2 will
be fired. In both cases, the timer is stopped, a mouseClick event is triggered and the token put
into the place CombClick.
• If all the previously described sequence of events happens for both mice, two tokens
are in the CombClick place, and the transition fusionClick can be fired. It can be fired
only in this case because the weight of the corresponding arc is 2. This fusionClick
transition, when fired, triggers a combinedClick event (which will be caught by
the interaction technique model, interactive application, for instance, the WXR
application presented in Section 9.7) and then deposits the two tokens back to the
place Idle.
• In other cases, for the place CombClick, if one token arrives in and that no other token
arrives in within 200 milliseconds, the click transition will be fired and the token will
go back to the place Idle.
The same principle is applied for the combined double click, which is not detailed here
as ir was not used in our case study.
278 | systems
the tokens). It is possible to know which model is in charge of the event by looking at the
field class.
Two transitions are important to handle the correctness of the interaction technique
multimodal click. If the transition ‘command1’ is fired using two valid objects as parameter
(visible in column data2 as (w1,w2)), it means that the user completed the launching
of the command named ‘command1’ using the CombinedClick interaction technique. If
the transition ‘command1’ has been fired with one missing object as parameter (visible in
column data2 as (null,w2)), the user failed to launch the command, but managed to do a
proper multimodal combine click. Figure 9.17 presents three successful combined clicks
(lines 1425–1428, 1435–1438, 2897–2900), and one miss (lines 1699–1702).
We can calculate the time for completion of the task and the number of missed combined
click by hands or with spreadsheet formulas. At this point, log data help to count the number
of successful and failed tasks as well as the user performance. However, log data are also
useful to understand complex sequences of events that may have led to unexpected user
fails while trying to launch the command labelled ‘command1’. For that purpose, we should
take a look on the events recorded at the transducer level.
In the ICO model of the interaction technique (see Figure 9.16) the transition ‘click’ is
fired only when the two clicks are not combined. The substitution contains the ID of a
mouse. According to the log, the delay between click on mouse 1 (line 617) and click on
mouse 2 (line 640) is 250ms. In fact, one of the main assumptions about this interaction
technique is that a combined click is counted only if the user clicks on the two mice in an
interval of less than 200ms. This makes it possible to count the number of failed combined
clicks. Based on such an information, the evaluator can decide the means of delay or failure
for combined clicks, and propose a better value for the trigger.
From the log data produced by our user while trying to perform her task, we can extract
fine-grained information such as the time between the pressing the first mouse’s button with
the cursor on the widget labelled ‘w1’ and second mouse click on the widget labelled ‘w2’.
This time gives us the total time for completing the task of launching a command. The
sequence of events described by the lines 752, 758, 761, 774, 781, and 794 are particularly
revealing of a usability problem that would be very difficult to find without a model-based
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
log environment. It shows that the user failed a combined click when s/he pressed the two
mice at the same time but forgot to release the button of first mouse (see mid at column
‘data2’ in Figure 9.17) leading to a click after 200ms trigger on the second mouse.
280 | systems
of critical systems, it is far from current practice in interactive systems engineering. One
way of addressing this issue is to model only the part of the interactive system of which
the performance must be carefully assessed. This means that some parts would be directly
programmed, while other ones would be modelled. Integration of models with the code can
be done following the approach presented by Fayollas et al. (2016).
ICO formal notation has been designed with the objective of being able to describe in
a complete and unambiguous way interactive systems, even though they present complex
interaction techniques (as stated earlier). This means that the expressive power is high
and, for basic interactions techniques (as the ones on the infusion pumps only having one
display and a couple of physical buttons (Cairns and Thimbleby, 2017)), its complexity is
not fully exploited. This is the reason why, for these systems, standard automata are still
used. Similarly, ICO models are meant to be executable, and are executed to bridge the
‘classical’ gap between models and implementations. For this reason, too, ICO models might
appear complex and, if the analysis of models is the only goal, executability will add useless
complexity in that context.
Further work would include integrating more complex models from psychology, espe-
cially those related to human error, such as cognitive or perceptive biases (Baron, 2000),
but also to training, as this is an important element of the operations of critical interactive
systems (DoDTD, 1975). Beyond that, integrating descriptions of the operators’ tasks in
such a usability evaluation framework would provide additional benefits, as described in
(Barboni et al., 2010).
....................................................................................................
references
Anderson, J. R., 1993. Rules of the Mind. New York, NY: Lawrence Erlbaum.
Barboni, E., Ladry, J-F., Navarre, D., Palanque, P., and Winckler, M., 2010. Beyond Modelling: An
Integrated Environment Supporting Co-Execution of Tasks and Systems Models. In: EICS ’10:
The Proceedings of the 2010 ACM SIGCHI Symposium on Engineering Interactive Computing Systems.
Berlin, Germany June 19–23 2010. New York, NY: ACM, pp. 143–52.
Baron, J., 2000. Thinking and deciding. 3rd ed. New York, NY: Cambridge University Press.
Bass, L., John, B., Juristo Juzgado, N., and Sánchez Segura, M. I., 2004. Usability-Supporting Archi-
tectural Patterns. In: ICSE 2004: 26th International Conference on Software Engineering. Edinburgh,
Scotland, May 23–28 2004. pp. 716–17.
Bass, L., Pellegrino, R., Reed, S., Seacord, R., Sheppard, R., and Szezur, M. R., 1991. The Arch model:
Seeheim revisited. Proceedings of the User Interface Developers’ Workshop Report.
Basnyat, S., Chozos, N., and Palanque, P., 2006. Multidisciplinary perspective on accident investiga-
tion. Reliability Engineering & System Safety, (91)12, pp. 1502–20.
Bastide, R., and Palanque, P. A., 1995. Petri Net based Environment for the Design of Event-driven
Interfaces. In: International Conference on Application and Theory of Petri Nets. Zaragoza, Spain 25–
30 June 1995. pp. 66–83.
Brumby, D. P., Janssen, C. P., Kujala, T. D., and Salvucci, D. D., 2017. Computational Models of
User Multitasking. In: A. Oulasvirta, P. O. Kristensson, X. Bi, and A. Howes, eds. Computational
Interaction. Oxford: Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Cairns, P., and Thimbleby, H., 2017. From premature semantics to mature interaction programming.
In: A. Oulasvirta, P. O. Kristensson, X. Bi, and A. Howes, eds. Computational Interaction. Oxford:
Oxford University Press.
Card, S. K., Moran, T. P., and Newell, A., 1980. The Keystroke-Level Model for User Performance
Time with Interactive Systems. Communications of the ACM, 23(7), pp. 396–410.
Card, S. K., Moran, T. P., and Newell, A., 1983. The psychology of human-computer interaction. New
York, NY: Lawrence Erlbaum.
Card, S. K., Moran, T. P., and Newell, A., 1986. The Model Human Processor: An Engineering Model
of Human Performance. In: K. Boff and L. Kaufman, eds. The Handbook of Perception and Human
Performance: Sensory Processes and Perception. Volume 1. pp. 1–35.
Chiola, G., Dutheillet, C., Franceschinis, G., and Haddad, S., 1997. A Symbolic Reachability Graph
for Coloured Petri Nets. Theoretical Computer Science, 176(1–2), pp. 39–65.
Clarke, E. M., Emerson, A., and Sifakis, J., 2009. Model checking: algorithmic verification and debug-
ging. Communications of the ACM, 52(11), pp. 74–84.
Cockburn, A., Gutwin, C., and Greenberg, S., 2007. A predictive model of menu performance. In: CHI
’07: The Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York,
NY:ACM, pp. 627–36.
Cockburn A., Gutwin C., Palanque P., Deleris Y., Trask C., Coveney A., Yung M., and MacLean
K., 2017. Turbulent Touch: Touchscreen Input for Cockpit Flight Displays. In: CHI ’17: The
Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. New York, NY:
ACM, pp. 6742–53.
Dix, A., 1991. Formal methods for Interactive Systems. London: Academic Press.
DO-178C/ED-12C, 2012. Software Considerations in Airborne Systems and Equipment Certification.
Washington, DC: RTCA, Inc.
DO-333, 2011. Formal Methods Supplement to DO-178C and DO-278A. Washington, DC: RTCA, Inc.
DoDTD, 1975. U.S. Department of Defense Training Document. Pamphlet 350–30.
Emerson, E.A., and Srinivasan, J., 1988. Branching Time Temporal Logic. LNCS, 354, pp. 122–72.
Fayollas, C., Martinie, C., Navarre, D., and Palanque P., 2016. Engineering mixed-criticality interac-
tive applications. In: EICS 2016: ACM SIGCHI symposium on Engineering Interactive Computing
Systems, pp. 108–19.
Fayollas, C., Martinie, C., Palanque, P., Barboni, E., Racim Fahssi, R., and Hamon, A., 2017. Exploit-
ing Action Theory as a Framework for Analysis and Design of Formal Methods Approaches:
Application to the CIRCUS Integrated Development Environment. In: B. Weyers, J. Bowen,
A. Dix, and P. Palanque, eds. The Handbook on Formal Methods for HCI. Berlin: Springer,
pp. 465–504.
Fitts, P. M., 1954. The information capacity of the human motor system in controlling the amplitude
of movement. Journal of Experimental Psychology, 47, pp. 381–91.
Genrich, H. J., 1991. Predicate/Transitions nets. In: K. Jensen and G. Rozenberg, eds. High-Levels Petri
Nets: Theory and Application. Berlin: Springer, pp. 3–43.
Göransson B., Gulliksen J., and Boivie, I., 2003. The usability design process—integrating user-
centered systems design in the software development process. Software Process: Improvement and
Practice, 8(2), pp. 111–31.
Gram, C., and Cockton, G., 1996. Design principles for Interactive Software. London: Chapman & Hall.
Hamon A., Palanque P., Cronel M., André R., Barboni E., and Navarre, D., 2014. Formal mod-
elling of dynamic instantiation of input devices and interaction techniques: application to multi-
touch interactions. In: EICS ‘14: The Proceedings of the 2014 ACM SIGCHI Symposium on
Engineering Interactive Computing Systems. Rome, Italy, 17–20 June 2014. New York, NY: ACM,
pp. 173–8.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
282 | systems
Hamon A., Palanque P., Silva J-L., Deleris Y., and Barboni, E., 2013. Formal description of multi-touch
interactions. In: EICS ‘13: The Proceedings of the 2013 ACM SIGCHI Symposium on Engineering
Interactive Computing Systems. London, UK, 27–27 June. New York, NY: ACM.
Harel D., and Naamad A., 1996. The STATEMATE Semantics of Statecharts. ACM Transactions on
Software Engineering and Methodology, 5(4), pp. 293–333.
Hick, W., 1952. On the rate of gain of information. Journal of Experimental Psychology, 4, pp. 11–36.
Hyman, R., 1953. Stimulus information as a determinant of reaction time. Journal of Experimental
Psychology, 45, pp. 188–96.
Jensen, K., Kristensen, L., and Wells, L., 2007. Coloured Petri nets and CPN tools for modelling and
validation of concurrent systems. International Journal of Software Tools for Technology Transfer,
9(3), pp. 213–54.
John, B., and Kieras, D., 1996a. The GOMS Family of User Interface Analysis Techniques: Compari-
son and Contrast. ACM Transactions on Computer Human Interaction, 3(4), pp. 320–51.
John, B., and Kieras, D., 1996b. Using GOMS for User Interface Design and Evaluation: Which
Technique. ACM Transactions on Computer Human Interaction, 3(4), pp. 287–319.
Kieras, D., and Meyer, D., 1997. An Overview of the EPIC Architecture for Cognition and Per-
formance with Application to Human-Computer Interaction. Human Computer Interaction, 12,
pp. 391–438.
Ladry, J. F., Navarre, D., and Palanque, P., 2009. Formal description techniques to support the design,
construction and evaluation of fusion engines for sure (safe, usable, reliable and evolvable) multi-
modal interfaces. In: ICMI-MLMI ‘09 Proceedings of the 2009 international conference on Multimodal
interfaces. Cambridge, MA, 2–4 November 2009.
Lee, B., and Oulasvirta, A., 2016. Modelling Error Rates in Temporal Pointing. In: CHI ‘16: The
Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. San Jose, CA, 7–12
2016. New York, NY: ACM, pp. 1857–68.
Lee, E. A., 2016. Fundamental Limits of Cyber-Physical Systems Modeling. ACM Transactions
on Cyber-Physical Systems, 1(1), [online] 26 pages. Available from: http://dl.acm.org/citation.
cfm?id=2912149&CFID=973113565&CFTOKEN=26116238.
Martinie, C., Palanque, P. A., Barboni, E., and Ragosta, M., 2011. Task-model based assessment of
automation levels: Application to space ground segments. In: Proceedings of the IEEE International
Conference on Systems, Man and Cybernetics. Anchorage, Alaska, 9–12 October 2011. Red Hook,
NY: Curran, pp. 3267–73.
Mayhew D. J., 1999. The Usability Engineering Lifecycle. Burlington, MA: Morgan Kaufmann.
Moher, T., Dirda, V., Bastide, R., and Palanque, P., 1996. Monolingual, Articulated Modeling of
Devices, Users, and Interfaces. In: DSVIS’96: Proceedings of the Third International Eurographics
Workshop. Namur, Belgium, 5–7 June 1996. Berlin: Springer, pp. 312–29.
Navarre, D., Palanque, P., Ladry, J.-F., and Barboni, E., 2009. ICOs: a Model-Based User Interface
Description Technique dedicated to Interactive Systems Addressing Usability, Reliability and Scal-
ability. Transactions on Computer-Human Interaction, 16(4). DOI: 10.1145/1614390.1614393.
Palanque, P., Barboni, E., Martinie, C., Navarre, D., and Winckler, M., 2011. A model-based approach
for supporting engineering usability evaluation of interaction techniques. In: EICS ’11: The Pro-
ceedings of the 3rd ACM SIGCHI symposium on Engineering Interactive Computing Systems. New
York, NY: ACM, pp. 21–30.
Palanque, P., and Bastide, R., 1995. Verification of an interactive software by analysis of its formal
specification. In: Interact ’95: The Proceedings of the IFIP Human-Computer Interaction Conference.
Lillehammer, Norway, 14–18 July 1995. pp. 181–97.
Palanque, P., Bernhaupt, R., Navarre, D., Ould, M., and Winckler, M., 2006. Supporting
usability evaluation of multimodal man-machine interfaces for space ground segment
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
applications using Petri Net based formal specification. In: SpaceOps 2006: The Proceedings
of the 9th International Conference on Space Operations. Rome, Italy, 19–23 June 2006. Available at:
https://doi.org/10.2514/MSPOPS06.
Palanque, P., Ladry, J-F., Navarre, D., and Barboni, E., 2009. High-Fidelity Prototyping of Interactive
Systems Can Be Formal Too. HCI International, (1), pp. 667–76.
Peterson, J. L., 1981. Petri Net Theory and the Modeling of Systems. New York, NY: Prentice Hall.
Petri, C. A., 1962. Kommunikation mit Automaten. Rheinisch-Westfaliches Institut fur Intrumentelle
Mathematik an der Universitat Bonn, Schrift Nr 2.
Pnueli A., 1986. Applications of Temporal Logic to the Specification and Verification of Reactive
Systems: A Survey of Current Trends. LNCS, (224), pp. 510–84. Berlin: Springer.
Reisig, W., 2013. Understanding Petri Nets—Modeling Techniques, Analysis Methods, Case Studies.
Berlin: Springer.
Russo, J. E., 1978. Adaptation of Cognitive Processes to Eye Movements. Eye Movements and Higher
Psychological Functions. New York, NY: Lawrence Erlbaum.
Silva, J. L., Fayollas, C., Hamon, A., Palanque, P., Martinie, C., and Barboni, E., 2014. Analysis
of WIMP and Post WIMP Interactive Systems based on Formal Specification. Electronic Commu-
nications of the EASST, [online] 69(55): 29 pages. Available at: https://journal.ub.tu-berlin.de/
eceasst/article/view/967.
Woods, W. A., 1970. Transition network grammars for natural language analysis. Communications of
the ACM, 13(10), pp. 591–606.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
PA RT IV
Human Behaviour
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
10
• • • • • • •
10.1 Introduction
A Partially Observable Markov Decision Process (POMDP) is a mathematical framework
for modelling sequential decision problems. We show in this chapter that a range of phe-
nomena in Human-Computer Interaction can be modelled within this framework and we
explore its strengths and weaknesses. One important strength of the framework is that
it embraces the highly adaptive, embodied and ecological nature of human interaction
(Fu and Pirolli, 2007; Howes, Vera, Lewis, and McCurdy, 2004; Payne and Howes, 2013;
Payne, Howes, and Reader, 2001; Pirolli and Card, 1999; Vera, Howes, McCurdy, and Lewis,
2004) and it thereby provides a suitable means of rigorously explaining a wide range of
interaction phenomena.
Consider as an illustration a task where a user searches for an image in the results returned
by a web search engine (Figure 10.1). Tseng and Howes (2015) studied a version of this
task in which a person has the goal of finding an image with a particular set of features,
for example to find an image of a castle with water and trees. A user with this task must
make multiple eye movements and fixations because the relatively high resolution fovea is
sufficient to provide the details of only about 2.5 degrees of visual angle. Eventually, after a
sequence of these partial observations the user might find a relevant image (e.g. the one in
the bottom right of the figure). Evidence shows that people do not perform a search such
as this using a systematic top-left to bottom-right strategy; they do not start at the top left
and then look at each image in turn. But, equally, they do not search randomly. Rather, the
search is a rational adaptation to factors that include the ecological distribution of images
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Figure 10.1 Simulated results from a search engine. Reproduced with permission from Elsevier.
and the partial observations provided by the visual fixations. We show in this paper that
rational strategies like these can be modelled as the emergent solution to a POMDP.
The approach that we take is heavily influenced by a long tradition of computational
models of human behaviour (Card, Moran, and Newell, 1983; Howes and Young, 1996;
Howes, Vera, Lewis, and McCurdy, 2004; Byrne, 2001; Gray, Sims, Fu, and Schoelles,
2006; Miller and Remington, 2004; Kieras and Hornof, 2014; Kieras, Meyer, Ballas, and
Lauber, 2000; Halverson and Hornof, 2011; Hornof, Zhang, and Halverson, 2010; Zhang
and Hornof, 2014; Janssen, Brumby, Dowell, Chater, and Howes, 2011). A key insight of
this work has been to delineate the contribution to interactive behaviour of information
processing capacities (e.g. memory, perception, manual movement times), on the one hand,
and strategies (methods, procedures, policies), on the other. A contribution of the POMDP
approach is that strategies emerge through learning given a formal specification of the
interaction problem faced by the user where the problem includes the bounds imposed by
individual processing capacities.
To illustrate the contribution of POMDPs, we examine two examples of their application
to explaining human-computer interaction. The first is a model of how people search
menus and the second of how they use visualizations to support decision making. Before
describing the examples, we first give an overview of POMDPs. In the discussion,
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
r0 r1 r2
s0 s1 s2 s3
o0 o1 o2
a0 a1 a2
Reward function R
Transition function T
Observation function Z
Policy
reward. However, eye-movements and fixations incur a negative reward (a cost) that is
proportional to the time taken. Therefore, maximizing reward means finding a sequence
of action that trades more matching features against time. Each state s in this task might
consist of a representation of the thirty-six images on the display and the location of the
current fixation. The images might be represented as a bitmap or in terms of a more abstract
symbolic feature vector. The actions A might include eye movements, mouse movements,
and button presses. Again these might be abstracted to just include fixation locations and
selections. An observation O would encode information from the fixated location (to 2.5
degrees of visual angle) with high reliability and information from the periphery with
lower reliability according to the observation function Z. The transition function T models
consequences of actions in A such as changing the fixation location. The transition function
also models the reliability of action. For example, users might intend to fixate the second
image from the left in the third row, but with a small probability fixate an adjacent image.
The usefulness of the assumption that users can be modelled with POMDPs rests on the
ability of researchers to find psychologically valid definitions of < S , A, O, T , Z , R, γ >
and the use of machine learning algorithms to find approximately optimal strategies. Finding
optimal strategies is computationally hard and often requires the application of the latest
machine learning methods. Lastly, the usefulness of POMDPs also rests on demonstrating
some correspondence between the behaviour generated from the learned optimal strategies
and human behaviour.
people to find highly efficient strategies that trade the minimal number of eye movements
for the highest quality image. Tseng and Howes (2015) found that when people performed
this task in a laboratory they appeared to be computationally rational (Lewis, Howes, and
Singh, 2014; Howes, Lewis, and Vera 2009). In other words, people were as efficient as they
could given the computational limits of their visual system.
There has been much research in HCI and cognitive science demonstrating computa-
tionally rational adaptation (Fu and Pirolli, 2007; Payne and Howes, 2013; Lewis, Howes,
and Singh, 2014; Trommershäuser, Glimcher, and Gegenfurtner, 2009; Sprague, Ballard,
and Robinson, 2007; Hayhoe and Ballard, 2014; Nunez-Varela and Wyatt, 2013; Russell,
Stefik, Pirolli, and Card, 1993; Russell and Subramanian, 1995). These analyses take into
account the costs and benefits of each action to the user so as to compose actions into
efficient behavioural sequences. One prominent example of this approach is Card’s cost of
knowledge function (Card, Pirolli, and Mackinlay, 1994). In addition, there are models of
multitasking in which the time spent on each of two or more tasks is determined by their
relative benefits and time costs (Zhang and Hornof, 2014). This literature supports the
general idea that finding reward maximizing policies that solve well-defined POMDPs is
a promising approach to modelling interaction.
will perform interactive menu search for newly experienced menus. The requirement is that
the model should learn, from experience, the best way to search for new targets in new,
previously unseen, menus.
Chen, Bailly, Brumby, Oulasvirta, and Howes (2015) hypothesized that a key property of
the menu search task is that human search strategies should be influenced by the distribution
of relevance across menus. If highly semantically relevant items are rare then the model
should learn to select them as soon as they are observed, whereas if they are very common
then they are less likely to be correct, and the model should learn to gather more evidence
before selection. The goal for the model, therefore, is not just to learn how to use a
single menu, but rather it is how to use a new menu that is sampled from an experienced
distributions of menus.
To achieve this goal Chen, Bailly, Brumby, Oulasvirta, and Howes (2015) built a com-
putational model that can be thought of as a simple POMDP. In the model an external
representation of the displayed menu is fixated and an observation is made that encodes
information about the relevance of word shapes (‘Minimize’ and ‘Zoom’, for example have
different lengths) and semantics (word meanings). This observation is used to update a
vector representing a summary of the observation history (a belief about the state). This
vector has an element for the shape relevance of every item in the menu, an element for
the semantic relevance of every item in the menu, and an element for the current fixation
location. The vector elements are null until estimates are acquired through observation.
An observation is made after each fixation action, e.g. after fixating ‘Minimize’ in the above
example. After having encoded new information through observation, the policy chooses an
action on the basis of the current belief. The chosen action might be to fixate on another item
or to make a selection, or to exit the menu if the target is probably absent. Belief-action values
are updated incrementally (learned) as reward feedback is received from the interaction.
relevance ranking given by participants. Chen, Bailly, Brumby, Oulasvirta, and Howes’s
(2015) model also observed the length of each menu item (0 for non-target length; 1 for
target length). Observations of alphabetic relevance were determined using the distance
apart in the alphabet of target and fixated first letters. This was then standardized to a four-
level scale between 0 and 1, i.e., [0, 0.3, 0.6, 1]. Further details are reported in Chen, Bailly,
Brumby, Oulasvirta, and Howes (2015).
Visual acuity is known to reduce with eccentricity from the fovea (Kieras and Hornof,
2014). In Chen and colleagues’ (2015) model, the acuity function was represented as the
probability that a visual feature was recognized. Semantic information was available with
probability 0.95 at the fovea and probability 0 elsewhere. The model made use of semantic
features and shape features but could easily be enhanced with other features such as colour.
These parameter settings resulted in the following availability probabilities: 0.95 for the item
fixated, 0.89 for items immediately above or below the fixated item, and 0 for items further
away. On each fixation, the availability of the shape information was determined by these
probabilities.
Chen and colleagues (2015) defined rewards for saccades and fixations in terms of their
durations as determined by the psychological literature. Saccades were given a duration that
was a function of distance. Fixations were given an average duration measured in previous
work on menus. A large reward was given for successfully finding the target or correctly
responding that it was absent.
10.3.3 Learning
Chen and colleagues’ (2015) model solved the POMDP using Q-learning. The details
of the algorithm are not described here but they can be found in any standard Machine
Learning text (e.g. Sutton and Barto (1998)). Q-learning uses the reward signal, defined
above, to learn state-action values (called Q values). A state-action value can be thought of
as a prediction of the future reward (both positive and negative) that will accrue if the action
is taken. Before learning, an empty Q-table was assumed in which all state-action values
were zero. The model therefore started with no control knowledge and action selection was
entirely random. The model was then trained until performance plateaued (requiring 20
million trials). On each trial, the model was trained on a menu constructed by sampling
randomly from the ecological distributions of shape and semantic/alphabetic relevance.
The model explored the action space using an -greedy exploration. This means that it
exploited the greedy/best action with a probability 1 − , and it explored all the actions
randomly with probability . q-values were adjusted according to the reward feedback. The
(approximately) optimal policy acquired through this training was then used to generate the
predictions described below.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
0.025 0.05
0.02 0.04
0.015 0.03
Probability
Probability
0.01 0.02
0.005 0.01
0 0
0 20 40 60 80 100 0 50
Semantic Relevance Menu item length
Non-target Group (target-absent)
Non-target Group (target-present)
Target Group (target-present)
Figure 10.3 Menu ecology of a real-world menu task (Apple OS X menus). Left panel: The
distribution of semantic relevance. Right panel: The distribution of menu length. Reproduced with
permission from ACM.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
3500
3000
Search Time(ms)
2500
2000
1500
1000
500
0
Initial Alphabetic Semantic Unorganized
Learned Policies for different menu organizations
Figure 10.4 The search duration taken by the optimal strategy for each type of menu (95%
confidence intervals (C.I.s)). Reproduced with permission from ACM.
optimal policy achieved 99 per cent selection accuracy. The utility of all models plateaued,
suggesting that the learned strategy was a good approximation to the optimal strategy.
Figure 10.4 is a plot of the duration required for the optimal policy to make a selection
given four types of experience crossed with three types of test menu. The purpose of this
analysis is to show how radically different patterns of behaviour emerge from the model
based on how previously experienced menus were organized. Prior to training (the far left
Initial set of three bars in the figure), the model offers the slowest performance; it is unable
to take advantage of the structure in the alphabetic and semantic menus because it has no
control knowledge. After training on a distribution of Unorganized menus (far right set in the
figure), performance time is better than prior to training. However, there is no difference in
performance time between the different menu organizations. After training on a distribution
of semantically organized menus (middle right set in the figure), the model is able to take
advantage of semantic structure, but this training is costly to performance on alphabetic
and unorganized menus. After training on a distribution of alphabetically organized menus
(middle left set in the figure), the model is able to take advantage of alphabetic ordering,
but this training is again costly to the other menu types. The optimal policy must switch the
policy depending on the menu type.
Figure 10.5 shows the effect of different semantic groupings on performance time
(reported by Chen and colleagues (2015). It contrasts the performance time predictions
for menus that are organized into three groups of three or into two groups of five. The
contrast between these kinds of design choices has been studied extensively before (Miller
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
3000
2500
Search Time(ms)
2000
1500
1000
500
0
Alphabetic Semantic Unorganized
3 groups of 3 2 groups of 5
Figure 10.5 The effect of semantic group size (95% C.I.s). Reproduced with permission
from ACM.
and Remington, 2004). What has been observed is an interaction between the effect of
longer menus and the effect of the number of items in each semantic group (See Miller
and Remington’s (2004) Figure 8). As can be seen in Figure 10.5 while the effect of longer
menus (3 × 3 = 9 versus 2 × 5 = 10) is longer performance times in the unorganized and
alphabetic menus, the effect of organization (three groups of three versus two of five) gives
shorter performance times in the semantic condition. This prediction corresponds closely
to a number of studies (See Miller and Remington’s (2004) Figure 8).
The results show that deriving adaptive strategies using a reinforcement learning algo-
rithm, given a definition of the human menu search problem as a POMDP, bounded by
the constraints of the human visual systems (embodiment) and the ecology of the task
environment (ecology), can lead to reasonable predictions about interactive behaviour.
Alphabetic Unorganized
0.6 0.6
Proportion
Proportion
0.4 0.4
0.2 0.2
0 0
2 4 6 8 2 4 6 8
Target Location Target Location
Semantic
0.6
0.6
0.4
Proportion
Proportion
0.4
0.2
0.2
0 0
2 4 6 8 Model Humans
Target Location Target Location
Alphabetic
Model Semantic
Humans Unorganized
Figure 10.6 The proportion of gazes on the target location for each of the three types of
menu (95% C.I.s). Reproduced with permission from ACM.
Interestingly, both the model and the participants selectively gazed at targets at either end
of the menu more frequently than targets in the middle. This may reflect the ease with
which words beginning with early and late alphabetic words can be located. In the top right
panel, there is no organizational structure to the menu and the model’s gaze distribution
is a consequence of shape relevance only in peripheral vision. The model offers a poor
prediction of the proportion of gazes on the target when it is in position 1, otherwise, as
expected, the distribution is relatively flat in both the model and the participants. In the
bottom left panel, the model’s gaze distribution is a function of semantic relevance and shape
relevance. Here there are spikes in the distribution at position 1 and 5. In the model, this is
because the emergent policy uses the relevance of the first item of each semantic group as
evidence of the content of that group. In other words, the grouping structure of the menu
is evidence in the emergent gaze distributions. The aggregated data is shown in the bottom
right panel; the model predicts the significant effect of organization on gaze distribution,
although it predicts a larger effect for alphabetic menus than was observed.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
10.3.6 Discussion
Chen and colleagues (2015) reported a model in which search behaviours were an emergent
consequence of an interaction problem specified as a POMDP. The observation function
and state transition function modelled the cognitive and perceptual limits of the user, and
the reward function modelled the user’s preferences for speed and accuracy. Predicted
strategies were generated using machine learning to infer an approximately optimal policy
for the POMDP. The model was tested with two studies. The first study involved applying
the model to a real world distribution of menu items and in the second study the model was
compared to human data from a previously reported experiment.
Unlike in previous models, no assumptions were made about the gaze strategies avail-
able to or adopted by users, Instead, a key property of the approach is that behavioural
predictions are derived by maximizing utility given a quantitative theory of the constraints
on behaviour, rather than by maximizing fit to the data. Although this optimality assump-
tion is sometimes controversial, the claim is simply that users will do the best they can
with the resources that are available to them. Further discussions of this issue can be
found in (Howes, Lewis, and Vera, 2009; Lewis, Howes, and Singh, 2014; Payne and
Howes, 2013).
Figure 10.7 Four interface variants for credit card fraud detection. The information cues are
represented with text (left panels) or light and dark grey shaded (right panels) and the information
is either immediately available (bottom panels) or revealed by clicking the ‘Reveal’ buttons (top
panels). Reproduced with permission from ACM.
Block/Allow transaction). Therefore, the size of the action space is eleven (nine cues plus
two decision actions).
At any moment, the environment (in one of the states s) generates a reward (cost if the
value is negative), r(s, a), in response to the action taken a. For the information gathering
actions, the reward is the time cost (the unit is seconds). The time cost includes both the
dwell time on the cues and the saccadic time cost of travelling between cues. The dwell
durations used in the model were determined from experimental data. In the experiment to
be modelled (described below), the participants were asked to complete 100 correct trials
as quickly as possible, so that errors were operationalized as time cost. In the model, the
cost for incorrect decisions is based on participants’ average time cost (Seconds) for a trial
(CT:17 ± 15; CC:24 ± 7;VT:20 ± 8; VC:13 ± 3). That is, the penalty of an incorrect trial
is the time cost for doing another trial.
In addition to the reward, another consequence of the action is that the environment
moves to a new state according to the transition function. In the current task the states (i.e.
displayed information patterns) stay unchanged across time steps within one trial. There-
fore, T(St+1 |St , At ) equals to 1 only when St+1 = St . T(St+1 |St , At ) equals 0 otherwise.
That is, the state transition matrix is the identity matrix.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
After transitioning to a new state, a new observation is received. The observation, ot ∈O,
is defined as the information gathered at the time step t. An observation is a 9-element
vector, each element of which represents the information gathered for one of the cues. Each
element of the observation has three levels, F (fraudulent), N (normal), and U (unknown).
For example, one observation might be represented as [F,N,U,F,U,U,U,N,N]. Therefore, the
upper bound on the observation space is 39 = 19683.
For the observation function p(Ot |St , At ) The availability of information about a cue
is dependent on the distance between this cue and the fixation location (eccentricity). In
addition, it is known that an object’s colour is more visible in the periphery than the object’s
text label (Kieras and Hornof, 2014). In our model, the observation model is based on the
acuity functions reported in (Kieras and Hornof, 2014), where the visibility of an object
is dependent on, for example, the object size, the object feature (colour or text), and the
eccentricity.
The observation obtained is constrained by a theory of the limits on the human visual
system. The model assumed that the text information was obtained only when it was fixated.
The colour information was obtained based on the colour acuity function reported in
(Kieras and Hornof, 2014). This function was used to determine the availability of the
colour information for each cue given the distance between the cues and the fixated location
(called eccentricity), and the size of the item. Specifically, on each fixation, the availability of
the colour information was determined by the probabilities defined in Equation (10.1).
P(available) = P(size + X > threshold) (10.1)
where size is the object size in terms of visual angle in degrees; X ∼ N (size, v × size);
threshold = a × e2 + b × e + c; e is eccentricity in terms of visual angle in degrees. In the
model, the function were set with parameter values of v = 0.7, b = 0.1, c = 0.1, a = 0.035 as
in (Kieras and Hornof, 2014).
10.4.3 Learning
Control knowledge was represented in Chen, Starke, Baber, and Howes’s (2017) model
as a mapping between beliefs and actions, which was learned with Q-learning (Watkins
and Dayan, 1992). Further details of the algorithm can be found in any standard Machine
Learning text (e.g., Watkins and Dayan, 1992; Sutton and Barto, 1998).
Before learning, a Q-table was assumed in which the values (Q-values) of all belief-action
pairs were zero. The model therefore started with no control knowledge and action selection
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
was entirely random. The model was then trained through simulated experience until
performance plateaued. The model explored the action space using an -greedy exploration.
Q-values of the encountered belief-action pairs were adjusted according to the reward
feedback. The idea is that, Q-values are learned (or estimated) by simulated experience of
the interaction tasks. The true Q-values are estimated by the sampled points encountered
during the simulations. The optimal policy acquired through this training was then used to
generate the predictions described below (last 1000 trials of the simulation).
While Chen, Starke, Baber, and Howes (2017) used Q-learning, any reinforcement
learning algorithm that converges on the optimal policy is sufficient to derive the rational
adaptation (Sutton and Barto, 1998). The Q-learning process is not a theoretical commit-
ment. Its purpose is merely to find the optimal control policy. It is not to model the process
of learning and is therefore used to achieve methodological optimality and determine the
computationally rational strategy (Lewis, Howes, and Singh, 2014). Alternative learning
algorithms include QMDP (Littman, Cassandra, and Kaelbling, 1995) or DQN (Mnih,
Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra, and Riedmiller, 2013b).
• Covered-Text (CT) condition (Figure 10.7a): The cue information was presented
in covered text. In order to check each cue, the participants had to click on the
associated button on each cue and wait for 1.5 seconds while a blank screen was
shown.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
• Covered-Colour (CC) condition (Figure 10.7b): The cue information was pre-
sented by colour (light grey shade for possibly normal, dark grey shade for possibly
fraudulent). As with CT, the information was covered until clicked.
• Visible-Text (VT) condition (Figure 10.7c): The cue information was presented in
text. The information was visible immediately (no mouse-click was required).
• Visible-Colour (VC) condition (Figure 10.7d): The cue information was presented
in colour and no mouse-click was required to reveal it.
1
8
No. of cues fixated
0.8
6
Accuracy
0.6
4
0.4
2 0.2
0 0
CT CC VT VC CT CC VT VC
conditions conditions
data model
Figure 10.8 The number of cues (left) and accuracy (right) predicted by the model across four
experimental conditions (x-axis). Model predictions (grey crosses) are plotted with the partici-
pant’s data (boxplots). The figure shows that the model predicts that an elevated number of cues
will be fixated by participants in the Visible-Text condition and a reduced number in the Visible-
Colour. Reproduced with permission from ACM.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
foveated vision is required to read text, and therefore more cues are used. In contrast, in the
covered conditions (Covered-Text and Covered-Colour), information access is expensive,
thus reducing the number of cues used. Lastly, in the Visible-Colour condition, peripheral
vision, rather than only foveated vision, can be used to access information and it appears
that, as a consequence, fewer cues are used, at least by being directly fixated.
10.5 Discussion
This chapter has reviewed two models, previously presented by Chen, Bailly, Brumby,
Oulasvirta and Howes (2015) and Chen, Starke, Baber and Howes (2017), and shown
that POMDPs permit a rigorous definition of interaction as an emergent consequence of
constrained sequential stochastic decision processes. Theories of human perception and
action were used to guide the construction of partial observation functions and transition
functions for the POMDP. A reinforcement learning algorithm was then used to find
approximately optimal strategies. The emergent interactive behaviours were compared to
human data and the models were shown to offer a computational explanation of interaction.
In the following paragraphs we summarize the ways in which embodiment, ecology, and
adaptation constrain the emergence of interaction.
Embodied interaction. In the models, interaction was both made possible by and con-
strained by the way in which cognition is embodied. A key element of embodiment concerns
the constraints imposed by the biological mechanisms for encoding information from the
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
environment. In both the menu and the decision model, the limitations of the observation
function were key factors that shaped the cognitive strategies determining interaction. The
observation function modelled human foveated vision. In menu search, foveated vision
was the primary determinant of beliefs about the semantic relevance of items and in the
decision task it was a primary determinant of beliefs about whether or not a transaction
was fraudulent. However, peripheral vision also played a role. In the menu search model,
peripheral vision provided an extra source of eye-movement guidance through the detection
of items with similar or dissimilar shapes to the target. In the decision model, peripheral
vision encoded cue validities without direct fixation, but only when cues were displayed
with colour, and thereby played a substantive role in distinguishing the properties of the
different visualizations.
Ecological interaction. The ecological nature of interaction was most evident in the menu
search model, where menus were sampled from distributions that determined the propor-
tions of high, medium, and low relevance distractor items. These ecological distributions
were critical to the structure of the emergent cognitive strategies. While we did not report
the analysis, it is obvious that the more discriminable these distributions (the greater the
d’ in signal detection terms) then the more it should be possible for the agent to adopt
strategies that localize fixation to areas where the target is more likely to be found. Ecological
constraints on the cognitive strategies were also evident in the decision model where
the adaptive strategies were dependent on the distribution of cue validities. Here, high
validity cues were rare and low validity cues relatively common resulting in behaviours that
emphasized fixating high validitiy cues.
Adaptive interaction. In both the menu model and the decision model, the strategies for
information gathering and choice, and consequentially the behaviour, were an adaptive
consequence of embodied interaction with an ecologically determined task environment.
The strategies emerge from the constraints imposed by ecology and embodiment through
experience. They are adaptive to the extent that they approximate the optimal strategies for
the user; that is, those strategies that maximize utility.
Chen, Starke, Baber, and Howes (2017) report that the key property of the model, that
it predicts interactive strategies given constraints, is evident in the fact that it generates a
broad range of strategies that are also exhibited by humans. For example, in addition to
those described earlier, it also predicts a well-known strategy called centre-of-gravity (also
called averaging saccades or the global effect) (Findlay 1982; Vitu 2008; Van der Stigchel
and Nijboer 2011; Venini, Remington, Horstmann, and Becker, 2014), which refers to the
fact that people frequently land saccades on a region of low-interest that is surrounded by
multiple regions of high-interest. Figure 10.9 shows that this ‘centre-of-gravity’ effect is an
emergent effect of our model. The model also predicts inhibition-of-return.
The fact that the model is able to predict the strategies that people use is a departure from
models that are programmed with strategies so as to fit performance time. This is important
because it suggests that the theory might be easily developed in the future so as to rapidly
evaluate the usability of a broader range of interactions. For example, in the near future it
should be possible to consider multidimensional visualizations that not only make use of
colour, but also size, shape, grouping, etc. It should be possible to increment the observation
functions, for example, with a shape detection capacity, and then use the learning algorithm
to find new strategies for the new visualizations.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
(a)
1
1 2 3 0.8
Frequency
0.5
4 5 6 0.7
7 8 9 0.6 0
1 2 3 4 5 6 7 8 9
cues
(b)
1
1 2 3 0.8
Frequency
0.5
4 5 6 0.7
7 8 9 0.6 0
1 2 3 4 5 6 7 8 9
cues
Figure 10.9 In each row of the figure, the frequency of the model’s cue fixations (right panel)
is shown for a different spatial arrangement of cue validities (left panel). The validity of ROIs in
the left panels is represented as a heat map (high validity is a lighter colour). The frequency of
model fixations is represented as a box plot. The column numbers in the box plot correspond to
the numbers of the ROIs (1..9). In the top row, ROI number 4 has a low validity but is surrounded
by relatively high validity ROIs (1, 5, and 7). In contrast, in the bottom row, ROI number 5
has a low validity and surrounding ROIs 2, 4 and 6 have high validity. In both rows, the model
fixates frequently on the ROI that is surrounded by high validity ROIs. This is known as a centre-
of-gravity effect. Reproduced with permission from ACM.
• Despite the fact that POMDPs are a universal formalism for representing sequential
decision processes, the scope of which human behaviours have been modelled, to
date, is quite limited. Mostly, the existing models are variants of visual informa-
tion gathering tasks. The continued expansion of the scope of POMDP models of
humans rests on the ability of researchers to find psychologically valid definitions of
< S , A, O, T , Z , R, γ >.
In conclusion, evidence supports the claim that POMDPs provide a rigorous framework
for defining interaction as constrained sequential stochastic decision processes. POMDPs
can offer a computational explanation of interactive behaviour as a consequence of embod-
iment, ecology, and adaptation.
....................................................................................................
references
Bailly, G., and Malacria, S., 2013. MenuInspector: Outil pour l’analyse des menus et cas d’étude. In:
IHM ’13 Proceedings of the 25th Conference on l’Interaction Homme-Machine. New York, NY: ACM.
Bailly, G., Oulasvirta, A., Brumby, D. P., and Howes, A., 2014. Model of visual search and selection
time in linear menus. In: ACM CHI’14. New York, NY: ACM, pp. 3865–74.
Brumby, D. P., Cox, A. L., Chung, J., and Fernandes, B., 2014. How does knowing what you are looking
for change visual search behavior? In: ACM CHI’14. New York, NY: ACM, pp. 3895–8.
Butko, N. J., and Movellan, J. R., 2008. I-POMDP: An infomax model of eye movement. In: 2008 IEEE
7th International Conference on Development and Learning. New York, NY: ACM, pp. 139–44.
Byrne, M. D., 2001. ACT-R/PM and menu selection: Applying a cognitive architecture to HCI.
International Journal of Human-Computer Studies, 55(1), pp. 41–84.
Card, S. K., Moran, T. P., and Newell, A., 1983. The Psychology of Human-Computer Interaction.
Hillsdale, NJ: Lawrence Erlbaum.
Card, S. K., Pirolli, P., and Mackinlay, J. D., 1994. The cost-of-knowledge characteristic function: dis-
play evaluation for direct-walk dynamic information visualizations. In: Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems. New York, NY: ACM, pp. 238–44.
Charman, S. C., and Howes, A., 2001. The effect of practice on strategy change. In: Proceedings of
the Twenty-Third Annual Conference of the Cognitive Science Society. London: Psychology Press,
pp. 188–93.
Chen, X., Bailly, G., Brumby, D. P., Oulasvirta, A., and Howes, A., 2015. The Emergence of Interactive
Behaviour: A Model of Rational Menu Search. In: Proceedings of the 33rd Annual ACM Conference
on Human Factors in Computing Systems. New York, NY: ACM, pp. 4217–26.
Chen, X., Starke, S., Baber, C., and Howes, A., 2017. A Cognitive Model of How People Make
Decisions Through Interaction with Visual Displays. In: Proceedings of the ACM CHI’17 Conference
on Human Factors in Computing Systems. New York, NY: ACM, pp. 1205–16.
Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., and Bontempi, G., 2014. Learned
lessons in credit card fraud detection from a practitioner perspective. Expert systems with appli-
cations, 41(10), pp. 4915–28.
Findlay, J. M., 1982. Global visual processing for saccadic eye movements. Vision Research, 22(8),
pp. 1033–45.
Fu, W.-T., and Pirolli, P., 2007. SNIF-ACT: A cognitive model of user navigation on the World Wide
Web. Human–ComputerInteraction, 22(4), pp. 355–412.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Gigerenzer, G., and Todd, P. M., 1999. Fast and frugal heuristics: The adaptive toolbox. Simple
Heuristics That Make Us Smart. Oxford: Oxford University Press, pp. 3–34.
Gray, W. D., Sims, C. R., Fu, W.-T., and Schoelles, M. J., 2006. The soft constraints hypothesis: a rational
analysis approach to resource allocation for interactive behavior. Psychological Review, 113(3),
p. 461. doi: 10.1037/0033-295.113.3.461
Halverson, T., and Hornof, A. J., 2011. A computational model of “active vision” for visual search in
human–computer interaction. Human–ComputerInteraction, 26(4), pp. 285–314.
Hayhoe, M., and Ballard, D., 2014. Modeling Task Control of Eye Movements. Current Biology, 24(13),
pp. R622–R628.
Hornof, A. J., Zhang, Y., and Halverson, T., 2010. Knowing where and when to look in a time-critical
multimodal dual task. In: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems. New York, NY: ACM, pp. 2103–12.
Howes, A., Duggan, G. B., Kalidindi, K., Tseng, Y.-C., and Lewis, R. L., 2015. Predicting Short-Term
Remembering as Boundedly Optimal Strategy Choice. Cognitive Science, 40(5), pp. 1192–223.
Howes, A., Lewis, R. R. L., and Vera, A., 2009. Rational adaptation under task and processing
constraints: implications for testing theories of cognition and action. Psychological Review, 116(4),
p. 717.
Howes, A., Vera, A., Lewis, R. L., and McCurdy, M., 2004. Cognitive constraint modeling: A formal
approach to supporting reasoning about behavior. In: Proceedings of the Cognitive Science Society.
Austin, TX: Cognitive Science Society, pp. 595–600.
Howes, A., and Young, R. M., 1996. Learning Consistent, Interactive, and Meaningful Task-Action
Mappings: A Computational Model. Cognitive Science, 20(3), pp. 301–56.
Janssen, C. P., Brumby, D. P., Dowell, J., Chater, N., and Howes, A., 2011. Identifying optimum
performance trade-offs using a cognitively bounded rational analysis model of discretionary task
interleaving. Topics in Cognitive Science, 3(1), pp. 123–39.
Jha, S., and Westland, J. C., 2013. A Descriptive Study of Credit Card Fraud Pattern. Global Business
Review, 14(3), pp. 373–84.
Kaelbling, L., Littman, M. L., and Cassandra, A., 1998. Planning and Acting in Partially Observable
Stochastic Domains. Artificial Intelligence, 101(1–2), pp. 99–134.
Kieras, D., and Hornof, A., 2014. Towards accurate and practical predictive models of active-vision-
based visual search. In: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems. New York, NY: ACM, pp. 3875–84.
Kieras, D., Meyer, D., Ballas, J., and Lauber, E., 2000. Modern computational perspectives on exec-
utive mental processes and cognitive control: where to from here? In: S. Monsell and J. Driver,
eds. Control of cognitive processes: Attention and performance XVIII. Cambridge, MA: MIT Press,
pp. 681–712.
Lewis, R., Howes, A., and Singh, S., 2014. Computational rationality: linking mechanism and behavior
through bounded utility maximization. Topics in Cognitive Science, 6(2), pp. 279–311.
Littman, M. L., Cassandra, A. R., and Kaelbling, L. P., 1995. Learning policies for partially observ-
able environments: Scaling up. In: Proceedings of the Twelfth International Conference on Machine
Learning, tahoe city. Burlington, MA: Morgan Kaufmann, p. 362.
Miller, C. S., and Remington, R. W., 2004. Modeling information navigation: Implications for infor-
mation architecture. Human-Computer Interaction, 19(3), pp. 225–71.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M.,
2013a. Playing âtari with Deep Reinforcement Learning. arXiv preprint arXiv:1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M.,
2013b. Playing Atari with Deep Reinforcement Learning. arXiv preprint arXiv: ..., 1–9. Retrieved
from http://arxiv.org/abs/1312.5602 doi: 10.1038/nature14236
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Nunez-Varela, J., and Wyatt, J. L., 2013. Models of gaze control for manipulation tasks. ACM Transac-
tions on Applied Perception (TAP), 10(4), p. 20.
Pandey, M., 2010. Operational risk forum: A model for managing online fraud risk using transaction
validation. Journal of Operational Risk, (5)1, pp. 49–63.
Papadimitriou, C. H., and Tsitsiklis, J. N., 1987. The Complexity of Markov Decision Processes.
Mathematics of Operations Research, 12(3), pp. 441–50. doi: 10.1287/moor.12.3.441
Payne, S. J., and Howes, A., 2013. Adaptive Interaction: A Utility Maximization Approach to Under-
standing Human Interaction with Technology. Synthesis Lectures on Human-Centered Informatics,
6(1), pp. 1–111.
Payne, S. J., Howes, A., and Reader, W. R., 2001. Adaptively distributing cognition: a decision-
making perspective on human-computer interaction. Behaviour & Information Technology, 20(5),
pp. 339–46.
Pirolli, P., and Card, S., 1999. Information foraging. Psychological Review, 106, pp. 643–75.
Rao, R. P. N., 2010. Decision making under uncertainty: a neural model based on partially observable
markov decision processes. Frontiers in Computational Neuroscience. https://doi.org/10.3389/
fncom.2010.00146
Russell, D. M., Stefik, M. J., Pirolli, P., and Card, S. K., 1993. The cost structure of sensemaking. In:
Proceedings of the interact ’93 and CHI ’93 Conference on Human Factors in Computing Systems. New
York, NY: ACM, pp. 269–76.
Russell, S., and Subramanian, D., 1995. Provably bounded-optimal agents. Journal of Artificial Intelli-
gence Research, 2, pp. 575–609.
Sénchez, D., Vila, M. A., Cerda, L., and Serrano, J.-M., 2009. Association rules applied to credit card
fraud detection. Expert Systems with Applications, 36(2), pp. 3630–40.
Shani, G., Pineau, J., and Kaplow, R., 2013. A survey of point-based POMDP solvers. Autonomous
Agents and Multi-Agent Systems, 27(1), pp. 1–51.
Sprague, N., Ballard, D., and Robinson, A., 2007. Modeling embodied visual behaviors. ACM Trans-
actions on Applied Perception (TAP), 4(2), p. 11.
Sutton, R. S., and Barto, A. G., 1998. Reinforcement learning: an introduction. Cambridge, MA: MIT
Press.
Trommershéuser, J., Glimcher, P. W., and Gegenfurtner, K. R., 2009. Visual processing, learning and
feedback in the primate eye movement system. Trends in Neurosciences, 32(11), pp. 583–90.
Tseng, Y.-C., and Howes, A., 2015. The adaptation of visual search to utility, ecology and design.
International Journal of Human-Computer Studies, 80, pp. 45–55.
Van der Stigchel, S., and Nijboer, T. C., 2011. The global effect: what determines where the eyes
land?Journal of Eye Movement Research, 4(2), pp. 1–13.
Venini, D., Remington, R. W., Horstmann, G., and Becker, S. I., 2014. Centre-of-gravity fixations in
visual search: When looking at nothing helps to find something. Journal of Ophthalmology, 2014.
http://dx.doi.org/10.1155/2014/237812.
Vera, A., Howes, A., McCurdy, M., and Lewis, R. L., 2004. A constraint satisfaction approach to
predicting skilled interactive cognition. In: Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems. New York, NY: ACM, pp. 121–8.
Vitu, F., 2008. About the global effect and the critical role of retinal eccentricity: Implications for eye
movements in reading. Journal of Eye Movement Research, 2(3), pp. 1–18.
Watkins, C., and Dayan, P., 1992. Q-Learning. Machine Learning, 8, pp. 279–92.
Zhang, Y., and Hornof, A. J., 2014. Understanding multitasking through parallelized strategy
exploration and individualized cognitive modeling. In: ACM CHI’14. New York, NY: ACM,
pp. 3885–94.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
11
• • • • • • •
11.1 Introduction
When interacting with a system, users need to make numerous choices about what actions
to take in order to advance them towards their goals. Each action comes at a cost (e.g., time
taken, effort required, cognitive load, financial cost, etc.), and the action may or may not
lead to some benefit (e.g., getting closer to completing the task, saving time, saving money,
finding out new information, having fun, etc.). Describing Human-Computer Interaction
(HCI) in this way naturally leads to an economic perspective on designing and developing
user interfaces. Economics provides tools to model the costs and benefits of interaction
where the focus is on understanding and predicting the behaviour and interaction of
economic agents/users within an economy/environment. By developing economic models
of interaction, it is possible to make predictions about user behaviour, understand the
choices they make and inform design decisions. When interaction is framed as an economic
problem, we can examine what actions lead to accruing the most benefit for a given cost or
incur the least cost for a given level of benefit, from which it is then possible to determine
what is the optimal course of action that a rational user should take, given the task, interface,
context, and constraints.
Let’s consider a simple example:1 your friend has just completed a marathon, and you
are curious to know how long it took them to complete the race. You have arrived at the web
page showing all the times and names of runners, ordered by time. You consider two options:
(i) scrolling through the list, or (ii) using the ‘find’ command.2 The first option would mean
scrolling through on average about half the list of names, while the second would require
1 This example is based on a study conducted in Russell (2015), where people were challenged to undertake
such a task.
2 Note that we have assumed that you are familiar with using the ‘find’ command (e.g. CTRL-f, CMD-f, etc).
Of course, not all users are familiar with this option, or even aware that it is available.
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
selecting the find command, typing in their name, and then checking through the matches.
Unless the list is very small, then the second option is probably going to be less costly (i.e.,
fewer comparisons) and more accurate.3 In this example, it may seem obvious that using
the ‘find’ option would be preferable in most cases—and indeed it is reasonably trivial to
develop a simple model of the costs and benefits to show at what point it is better to use
the ‘find’ option over the ‘scroll’ option, and vice versa. However, even to arrive at such an
intuition, we have made a number of modelling assumptions:
1. that the user wants to find their friend’s performance (and that the said friend took
part in the marathon);
2. that the user knows and can perform both actions;
3. the currency of the costs/benefit is in time, i.e., time spent/saved; and,
4. that the user wants to minimize the amount of time spent completing the task.
Such assumptions provide the basis for a formal model to be developed. The last assumption
is common to most economic models. This is because they are a type of ‘optimization’
model (Lewis, Howes, and Singh, 2014; Murty, 2003; Pirolli and Card, 1999; Payne
and Howes, 2013), which assumes that people attempt to maximize their profit given
their budget (costs) or minimize their budget expenditure given some level of profit.
The other assumptions serve as constraints which are a result of the environment, the
limitations of the person, and/or the simplifications made by the modeller. By engaging
such an assumption, the model can be used to consider the trade-offs between different
strategies, reason about how users will adapt their behaviour as the costs and benefit
change, and make predictions about their behaviour. Consequently, economic models
go beyond approaches which just focus solely on cost (e.g. GOMS-KLM (Card, Moran,
and Newell, 1980), Fitt’s Law (Fitts, 1954), Hick’s Law (Hick, 1952), etc.), as economic
models also consider the benefit and profit that one derives from the interaction. This
is an important difference, because not all tasks are cost/time driven where the goal is
to reduce the time taken or minimize friction. For example, when should an author stop
editing a paper, when should an artist stop photoshopping an image, when should a
researcher stop searching for related works? In the above example, the different options
have varying degrees of accuracy when employed to find the correct runner’s name and
subsequent time. This is because as the number of items in the list increases, the chance
of missing or skipping over an item also increases, thus decreasing the accuracy. So, in
this case, there is a trade-off between the speed (minimizing time taken to complete
the task) and the accuracy (finding the correct time). Also when using the ‘find’ option,
there is another trade-off between the number of letters entered (typing cost) versus
the number of matching names (scanning costs, and thus accuracy). In such tasks, it
is clear that understanding the trade-off between the benefits and the costs of different
interaction strategies can help predict user behaviour. Economic models can help to
3 It is easy to skip over records when browsing through thousands of entries. Indeed, in the study conducted
in Russell (2015), subjects that scrolled often reported the incorrect time.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
draw insights into these trade-offs and understand when one strategy (sequence of
actions) is better to perform than another or what strategy to adopt under different
circumstances.
In economic models, it is commonly assumed that users are economic agents that are
rational in the sense that they attempt to maximize their benefits, and can learn to evolve
and adapt their strategies towards the optimal course of interaction. Thus the theory is
normative, and gives advice on how a rational user should act given their knowledge and
experience of the system. Going back to the example above, if a user is not aware of the
‘find’ option, then they will be limited in their choices, and so they would select the ‘scroll’
option (or, choose not to complete the task, i.e., the ‘do nothing’ option). However, when
they learn about the existence of the ‘find’ option, perhaps through exploratory interactions
or from other users, then they can decide between the different strategies. While assuming
that users are rational may seem like a rather strong assumption, in the context of search,
a number of works (Azzopardi, 2014; Pirolli, Schank, Hearst, and Diehl, 1996; Smith and
Kantor, 2008; Turpin and Hersh, 2001) have shown that users adapt to systems and tend
to maximize benefit for a given cost (e.g., subscribe to the utility maximization paradigm
(Varian, 1987)) or minimize cost for a given level of benefit (e.g., subscribe to the principle
of least effort (Zipf, 1949)).4 So a user, knowing of the ‘find’ option would select it when
the list of items is sufficiently long such that employing the find command is likely to reduce
the total cost incurred. Once we have a model, we can then test such hypotheses about user
behaviour, e.g., given the cost of using the find command, the cost of scanning items, etc.,
then we may hypothesize that when the length of the list is over say two pages, it is more
efficient to use the ‘find’ option—and then design an experiment to test if this assertion
holds in practice (or not) in order to (in)validate the model.
During the course of this chapter, we first provide an overview of economic modelling in
the context of HCI where we will formalize the example above by developing two models
that lead to quantitative predictions regarding which option a user should employ (i.e., ‘find’
or ‘scroll’), and, how they should use the ‘find’ command, when chosen. Following on from
this finding example, we then consider three further search scenarios related to information
seeking and retrieval, where we develop models of: (i) querying, (ii) assessing, and (iii)
searching. The first model provides insights into query length and how to encourage longer
or shorter queries. The next model provides insights into when to stop assessing items in a
ranked list of results and how to design different result pages for different result types. The
third model on searching examines the trade-off between issuing queries and how many
documents to examine per query during the course of a search session. This will lead to
a number of insights into where the system can be improved and how users will respond
to such changes. While these models are focused on search and search behaviour, similar
models could be developed to help describe how people browse products, play games, use
messaging, find apps, enter text, and so on. In the next section, we describe a framework for
building economic models of interaction that can be used to build your own models, that
inform your designs, and guide your experimental research.
4 Note that essentially these optimizations objectives are two sides of the same coin and arrive at the same
optimal solution, i.e., if the maximum benefit is $10 for five minutes of work, then for a benefit of $10 the minimum
cost is five minutes of work.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Theoretical models aim to develop testable hypotheses about how people will behave
and assume that people are economic agents that maximize specific objectives
subject to constraints (e.g., amount of time available for the task, knowledge of
potential actions, etc.). Such models provide qualitative answers to questions such
as, how does the cost of querying affect the user’s behaviour, if the benefit of query
suggestions increases, how will user’s adapt?
Empirical models aim to evaluate the qualitative predictions of theoretical models and
realise the predictions they make into numerical outcomes. For example, consider a
news app that provides access to news articles for a small payment, and a theoretical
model that says that if the cost of accessing news articles increases, then users will
reduce their consumption of such articles. Then an empirical model would seek to
quantify by how much consumption will drop given a price increase.
Step 1 - Describe the Problem Context: First off, outline what is known about the problem
context, the environment, and the interface(s) in which the interaction is occurring. It may
also help to illustrate the interface(s), even if hypothetical, that the user population will
be using. For example, we may want to consider how facets, when added to a shopping
interface, would affect behaviour, and so draw a faceted search interface from which we can
consider different ways in which the user can then interact with it (Kashyap, Hristidis, and
Petropoulos, 2010). According to Varian (2016), all economic models take a similar form,
where we are interested in the behaviour of some economic agents. These agents make
choices to advance towards their objective(s). And these choices need to satisfy various
constraints based upon the individual, the interface, and the environment/context. This
leads to asking the following questions:
Let’s re-visit the scenario we introduced earlier, where we want to know our friend’s
performance in the marathon. Imagine we are at the page containing a list of names and
their race completion times. And let’s assume we are tech savvy individuals. We are aware
of several choices: (i) search by scrolling, (ii) search via the find command, (iii) some
combination of scrolling and finding. To make things simple we consider only the first two
and assume that we only select one or the other. We would like to try and find out as quickly
as possible our friend’s race time because we’d like to see whether we ought to congratulate
our friend or sympathize with them. So, in this case, the objective is to minimize the time
taken to find their name. Since time is at a premium, we have a constraint such that we want
to complete the search within a certain period of time (after which we may give up), or, if we
believe we could not complete the task within the time constraint, then we may decide not
to search at all. In this later case, where we decide not to search, we may take some other
action, like asking our friend. Though, of course, we would like to keep the initiative in
the conversation from which we will derive benefit. In terms of interaction with the page,
if we (i) scroll, then we plan to look down the list, one by one, and see if we recognize our
friend’s name in the list, while if we (ii) use command find, we plan to type in a few letters of
their name, and step through each matching name. In both cases, we also acknowledge that
there is some chance of skipping over their name and so there is some probability associated
with finding their name—such that as the list of names that has to be checked increases, the
chance of missing also increases. We also can imagine that in case (ii) if we enter more letters
the list of names to check decreases proportionally with each additional letter. We can now
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
formalize the problem with a series of assumptions (like those listed in Section 11.1), and
then start to model the process mathematically.
Step 2 - Specify the Cost and Benefit Functions: For a particular strategy/choice, we
need to identify and enumerate the most salient interactions which are likely to affect
the behaviour when using the given interface. At this point, it is important to model the
interaction at an appropriate level—too low and it becomes unwieldy (i.e., modelling every
keystroke), too high and it becomes uninformative (i.e., simply considering the aggregated
cost/benefit of the scroll option vs. the cost/benefit of find option). Varian (2016) suggests
to keep this as simple as possible:
‘The whole point of a model is to give a simplified representation of reality. . .your model should
be reduced to just those pieces that are required to make it work’.
So initially focus on trying to model the simplest course of action, at the highest level
possible, to get a feel for the problem, and then refine. If we start at too high a level, we can
then consider what variables influence the cost of the scroll option (i.e., the length of the
list), and start to parameterize the cost function, etc. We can also reduce the complexity of
the interaction space: for example, in the facet shopping interface, we might start with one
facet, and then progress to two facets. Essentially, make the problem simple and tractable to
understand what is going on. The simple model that is developed will probably be a special
case or an example. The next important step is to generalize the model: e.g., how do we
model f facets?
In our scenario, for option (i) we need to perform two main actions: scroll (scr) and
check (chk), where we will assume that the cost of a scroll is per item (cscr ), and the cost
of checking the name is also per item (cchk ). In the worst case, we’d need to scroll through
and check N names, while in the average case we’d only need to examine approximately half
of the names N/2, and in the best case our friend came first, so N = 1. Let’s consider the
average case, where then, the cost of option (i) would be:
N · (cscr + cchk )
C(i) (N) = (11.1)
2
We also know that the benefit is proportional to our likelihood of success and that is
conditioned on how many items we need to check through, so we can let the probability
of successfully finding our friend’s name be pscr (N). Thus, we can formulate an expected
benefit function, i.e., the benefit that we would expect to receive on average:
B(i) (N) = pscr (N) · b (11.2)
where b is the benefit, e.g., the time saved from having to hear your friend going on and
on about how you are not interested in them, and how you couldn’t even find it on the
computer, etc. Now we can create a profit function to denote how much time we expect
to save/lose if we take this option. A profit function is the difference between the benefit
function and the cost function:
N(cscr + cchk )
π(i) = B(i) (N) − C(i) (N) = pscr (N) · b − (11.3)
2
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
On the other hand, for option (ii), we need to perform a different sequence of actions:
command find (cmd), type (typ), skip (skp), and check (chk), where we will assume that
the cost to evoke command find is ccmd , to type in a letter is ctyp and to skip to the next
match is cskp . For simplicity, we will assume the cost of a skip and the cost of a scroll be the
same cskp = cscr . The course of interaction is press command find, type in m letters, and
then skip through the results, checking each one. Since typing in m letters will reduce the
number of checks, we assume that there is a function f(N, m), which results in a list of M to
check through (and as m increases, M decreases). Again we are concerned with the average
case, if there are M matches, then we’d only need to examine approximately half, i.e., M/2.
Putting this all together, we can formulate the following cost function, which takes in both
the size of the list and the number of letters we are willing to enter:
M(cscr + cchk )
C(ii) (N, m) = ccmd + m · ctyp + (11.4)
2
We can also formulate the benefit function, which also takes in N and m as follows:
B(ii) (N, m) = pscr (f(N, m)) · b = pscr (M) · b (11.5)
where, since M will be typically much smaller than N, the expected benefit will typically be
higher. Again we can formulate a profit function π (ii) by taking the difference between the
benefit and cost function.
Step 3 - Solve the Model: The next step is to solve / instantiate the model in order
to see what insights it reveals about the problem being studied. This can be achieved
through various means: analytically, computationally or graphically. For example, we could
determine which option would be more profitable, π (i) or π (ii) , by taking the difference
and seeing under what circumstances option (i) is better than (ii), and vice versa. We could
achieve this analytically, i.e., if π (i) >π (ii) , then select option (i), else (ii). Alternatively, we
could instantiate the model, with a range of values, and plot the profit functions of each
to see when π (i) >π (ii) , and under what conditions. Or in the case of option (ii), we could
consider the tradeoff between typing more letters and the total time to find the name—and
find the optimal number of letters to type. Note that here we have not specified the form of
the functions f(N, m) or pscr (·); so in order to solve or plot, we would need to make some
further assumptions, or look to empirical data for estimating functional forms.
If we are interested in deciding which option to take, then we can try and solve the
inequality π (i) >π (ii) , which we can reduce to:
ccmd + m · ctyp > (N − M) · (cscr + cchk ) (11.6)
where we have assumed that p(N) is approximately equal to p(M) for the sake of sim-
plicity i.e., we are assuming that the two methods perform the same (even though this is
probably not the case in reality). To plot graphically the model, we further assumed that
f(N, m) = (m+1) N
2 , to reflect the intuition that more letters entered will reduce the number
of names to check. Later we could empirically estimate the form based on a computational
simulation, i.e., given a list of names, we could count how many names are returned, on
average, when N and m are varied in order to gather actual data to fit a function. Next we
have to provide some estimates of the different costs. Here, we have set ctyp to be one second
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
80
60
LHS vs RHS
40
20
0
0 50 100 150
Size of List (N)
LHS m = 2 RHS m = 2
100
80
LHS vs RHS
60
40
20
0
0 50 100 150
Size of List (N)
LHS m = 7 RHS m = 7
Figure 11.1 Top: Plot of the inequality when only two letters are entered. Bottom: Plot of
the inequality when seven letters are entered, where if LHS > RHS then scroll, else use the find
command. The plots suggest that once the size of the list is greater than thirty to forty items, using
the find command is less costly. But as m increases, a longer list is required to justify the additional
typing cost.
per letter, ccmd to be 15 seconds, cscr and cskp to be 0.1 seconds per scroll/skip, and cchk
to be 0.5 seconds per name check. Of course, these values are only loosely based on the
time taken to perform such actions—to create a more precise instantiation of the model, we
would need to empirically ground these values. Part of the model building process involves
iteratively refining the parameters and their estimates based on observed data. But, initially,
we can get a ‘feel’ for the model by using some reasonable approximations. Figure 11.1 shows
a plot of the inequality when m = 2 (top) and m = 7 (bottom).
Now, focusing on option (ii), we can calculate the optimal way to interact when using
the find command, i.e., how many letters should we type? To do this, we can consider
maximizing the profit with respect to the number of letters we need to type (as this reduces
the number of possible matches to skip through). To achieve this, we instantiate the profit
function for option (ii), where we assume, for simplicity, that pscr (M) = 1, such that:
N
π(ii) = b − ccmd + m · ctyp + (c scr + c chk ) (11.7)
2 · (m + 1)2
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
10
8
Optimal No. of Letters (m∗)
0
0 200 400 600 800 1000
Size of List (N)
Figure 11.2 Top: Plot of m versus the size of the result list (N).
then we can differentiate the profit function with respect to m to arrive at:
dπ(ii)
= −ctyp + N · (cscr + cchk ) · (m + 1)−3 (11.8)
dm
dπ
Setting dm(ii) = 0, we obtain the following expression for m , which is the optimal number
of letters to enter for a list size of N:
1
N · (cscr + cchk ) 3
m =
−1 (11.9)
ctyp
Figure 11.2 shows a plot of the optimal number of letters (m ) as N increases. As
expected, more letters are required as N increases, but at a diminishing rate.
Step 4 - Use the Model and Hypothesise About Interaction: Given the models cre-
ated above, we can now consider: how different variables will influence interaction and
behaviour, find out what the model tells us about optimal behaviour, and see what hypothe-
ses can be generated from the model.
From Equation 11.6 and Figure 11.1, we can see that if the cost of using the find command
is very high, then the list will have to be longer before it becomes a viable option. Further-
more, there is a trade-off between the number of letters entered (m) and the reduction in M,
which is, of course, proportional to m. From the plots, we can see that moving from m = 2
to m = 7 does not have a dramatic impact in changing when we’d decide to scroll or find,
a longer list is needed to warrant the entry of more letters. Furthermore, from the graphs,
we can see that, in these examples, once the list contains more than forty to fifty names, it is
better to use the find command. Exactly where this point is depends on how we estimate
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
the various costs and instantiate the functions used. However, it is possible to create
hypotheses about how people would change their behaviour in response to different cir-
cumstances. For example, we could imagine a similar scenario where the cost of comparison
is very high, because we are trying to match a long unique number that represents each
runner instead (and so hypothesize that using the find command is preferable when lists
are even shorter).
From Equation 11.9 and Figure 11.2, we can see that as the list size increases the optimal
number of letters to enter (given our model) increases, such that four letters are optimal
when the list is around 250 in size, while around seven letters are required when the list
grows to 1000. Given these estimates, we can then hypothesise that for large lists (around
1000 in size), users will tend to enter, on average, seven letters, while for shorter lists (around
200–300), users will tend to enter, on average, four letters.
Step 5 - Compare with Observed Behaviour: The next step is to determine whether
the hypothesis made using the model is consistent with empirical observations from the
literature, and/or to validate the model by designing empirical experiments that explicitly
test the hypotheses.
This is an important step in the process for two reasons: (a) the model provides a guide
for what variables and factors are likely to influence the behaviour of users, and thus enables
us to inform our experiments, and (b) it provides evidence which (in)validates the models,
which we can use to refine our models. From the experimental data, we may discover that,
for instance, users performed in a variety of ways we did not consider or that we ignored.
For example, maybe a significant proportion of users adopted a mixed approach, scrolling
a bit first, then using the find command. Or when they used the find command, they mis-
spelled the name or couldn’t remember the exact spelling, and so there is some probability
associated with entering the correct partial string to match the name. As a consequence, we
find that the model, or the estimates, need to be refined, and so the final step (6) is to iterate:
refining and revising the model and its parameters accordingly. Once we have conduced
an empirical investigation, we can better estimate the costs and benefits. Alternatively, they
allow us to develop new models to cater for different interactions and conditions. With this
respect, Box (1979) notes that:
He points out that it would be remarkable if a simple model could exactly represent a
real world phenomena. Consequently, he argues that we should build parsimonious models
because model elaboration is often not practical, but adds increased complexity, without
necessarily improving the precision of the model (i.e., how well the model predicts/explains
the phenomena). This is not to say that we should be only building very simple mod-
els; instead, Box (1979) argues we should start simple, and then only add the necessary
refinements based on our observations and data, to generate the next tentative model,
which is then again iterated and refined, where the process is continued depending on how
useful further revisions are judged to be. That is, there is a trade-off between the abstracted
model and the predictions that it makes—the less abstracted, the greater the complexity,
with perhaps increases model precision. Therefore, the refinements to the model can be
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
evaluated by how much more predictive or explanatory power the model provides about
the phenomena.
Following this approach helps to structure how we develop models, and crucially how
we explain them to others and use the models in practice. The approach we have described
here is similar to the methodology adopted when applying the Rational Analysis method
(Anderson, 1991; Chater and Oaksford, 1999), which is a more general approach that
is underpinned by similar assumptions. However, here we are concerned with building
economic models as opposed to other types of formal models, e.g., Pirolli and Card,
1999; Murty, 2003; Payne and Howes, 2013; Lewis, Howes, and Singh, 2014. During the
remaining of the chapter, we shall describe three different, but related economic models of
human-computer interaction, where a user interacts with a search engine. Our focus is on
showing how theoretical economics models can be constructed (i.e., steps 1–4) and discuss
how they provide insights into observed behaviours and designs.
Agapie, Golovchinsky, and Qvardordt (2012) used a halo effect around the query box, such
that as the user types a longer query the halo changes from a red glow to a blue glow.
However, these attempts have largely been ineffectual and have not be replicated outside
the lab (Hiemstra, Hauff, and Azzopardi, 2017). So can the model provide insights into why
this is the case, why user queries tend to be short, and how we could improve the system to
encourage longer queries?
11.3.2 Model
To create an economic model of querying, we need to model the benefit associated with
querying and model the cost associated with querying. Let’s assume that the user enters a
query of length W (the number of words in the query). The benefit that a user receives is
given by the benefit function b(W) and the cost (or effort in querying) defined by the cost
function c(W). Here we make a simplifying assumption: that cost and benefit are only a
function of query length.
Now let’s consider a benefit function which denotes the situation where the user experi-
ences diminishing returns such that as the query length increases they receive less and less
benefit as shown by Azzopardi (2009), and Belkin, Kelly, Kim, Kim, Lee, Muresan, Tang,
Yuan, and Cool (2003). This can be modeled with the function:
b(W) = k · loga (W + 1) (11.10)
where k represents a scaling factor (for example to account for the quality of the search
technology), and a influences how quickly the user experiences diminishing returns. That
is as a increases, additional terms contribute less and less to the total benefit, and so the
user will experience diminishing returns sooner. Let’s then assume that the cost of entering
a query is a linear function based on the number of words such that:
c(W) = W · cw (11.11)
where cw represents how much effort must be spent to enter each word. This is, of course, a
simple cost model and it is easy to imagine more complex cost functions. However the point
is to provide a simple, but insightful, abstraction.
40 60
30
Benefit(w)
Benefit(w)
40
20
20
10
0 0
0 5 10 0 5 10
Words Words
20 40
10 20
Profit(w)
0 Profit(w) 0
–10 −20
0 5 10 0 5 10
Words Words
Figure 11.3 The top plots show the benefit while the bottom plots show the profit as the length
of the query increases. Plots on the right show when the queries yield greater benefit (left k=10;
right k=15). Each plot shows three levels of a which denotes how quickly diminishing returns
sets in.
11.3.4 Hypotheses
Figure 11.3 illustrates the benefit (top) and profit (bottom) as query length increases.
For the left plots k=10, and for the right plots k=15. Within each plot we show various
levels of a. These plots show that as k increases (i.e., overall the performance of the system
increases), the model suggests that query length, on average, would increase. If a increases
(i.e., additional terms contribute less and less to the overall benefit), then queries decrease
in length. Furthermore the model suggests that as the cost of entering a word, cw , decreases
then users will tend to pose longer queries.
11.3.5 Discussion
This economic model of querying suggests that to motivate longer queries either the cost
of querying needs to decrease or the performance of the system needs to increase (either
by increasing k or decreasing a). The model provides several testable hypotheses, which
provide key insights that inform the design of querying mechanisms and help explain various
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
attempts to encourage longer queries. For example, the query halo does not reduce cost, nor
increase benefit, and so is not likely to change user behaviour. On the other hand, the inline
query auto-completion functionality now provided by most search engines, reduces the cost
of entering queries (e.g., less typing to enter each query term), and also increases the quality
of queries (e.g., fewer mis-spellings, less out of vocabulary words, etc.). Thus, according to
the model, since the key drivers are affected, queries are likely to be longer when using query
auto-completion than without.
While this model provides some insights into the querying process and the trade-off
between performance and length, it is relatively crude. We could model more precisely the
relationship between the number of characters, number of terms, and the discriminative
power of those terms, and how they influence performance. Furthermore, depending
on the search engine, the length of a query relative to performance will vary, and may
even decrease as length increases. For instance, if our search engine employed an implicit
Boolean AND between terms, then as the number of query terms increases the number
of results returned decreases, and so fewer and fewer relevant items are returned (if
any). In this case, we would need to employ a different benefit function to reflect and
capture this relationship. It is only when we empirically explore, either through an
analysis of query logs (Hiemstra, Hauff, and Azzopardi, 2017), user judgements and
ratings (Verma and Yilmaz, 2017), or computational simulations (Azzopardi, 2009,
2011) that we can test and refine the model, updating the assumptions and cost/benefit
functions/parameters.
11.4.2 Model
To create an economic model of assessing, we need to formulate cost and benefit functions
associated with the process. Let’s start off by modelling the costs. A user first poses a query
to the search engine and thus incurs a query cost cq . Then for the purposes of the model,
we will assume the user assesses items, one by one, where the cost to assess each item is ca .
If the user assesses A items, then the cost function would be:
c(A) = cq + A.ca (11.15)
Now, we need a function to model the benefit associated with assessing A items. Consider
the scenario where a user is searching for news articles, and that they are reading about the
latest world disaster. The first article that they read provides key information, e.g., that an
earthquake has hit. The subsequent articles start to fill in the details, while others provide
context and background. As they continue to read more and more news articles, the amount
of new information becomes less and less as the same ‘facts’ are repeated. Essentially, as they
work their way down the ranked list of results, they experience diminishing returns. That is,
each additional item contributes less and less benefit. So we can model the benefit received
as follows:
b(A) = k.Aβ (11.16)
where k is a scaling factor, and β represents how quickly the benefit from the information
diminishes. If β is equal to one, then for each subsequent item examined, the user receives
the same amount of benefit. However, if β is less than one, then for each subsequent item
examined, the user receives less additional benefit. This function is fairly flexible: if k = 0
for a given query, then it can represent a ‘dud’ query, while β = 0 models when only one
item is of benefit (e.g. A0 = 1). So the benefit function can cater for a number of different
scenarios.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
11.4.4 Hypotheses
From Equation 11.19, we can see that the optimal depth is dependent on the cost of
assessment (ca ), and the performance surmised by k and β. Using comparative statics
(Varian, 1987), we can see how a user should respond when one variable changes, and
everything else is held constant. If the cost of assessing increases, and β is less than one
(i.e., diminishing returns), then the model suggests that the user would examine fewer
documents. For example, consider a news app that charges per article, while another does
not. In this case, the model predicts that users would read fewer documents in the first app,
when compared to the second.
Figure 11.4 shows how the profit of assessing changes as the cost of assessing is increased.
If the performance increases, i.e., β tends to one, then the user would examine more
documents. Similarly, as the performance increases via k, then this also suggests that the
user would examine more documents.
11.4.5 Discussion
Intuitively, the model makes sense: if the performance of the query was very poor, there is
little incentive/reason to examine results in the list. And if the cost of assessing documents
is very high, then it constrains how many documents are examined. For example, consider
the case of a user searching on their mobile phone when the internet connection is very
slow. The cost of visiting each page is high (i.e., it takes a lot of time to download the
page, and may not even load properly), so the model predicts that users are less likely
to click and assess documents. Alternatively, consider the case of a user searching for
images. The cost of assessing thumbnails (and thus the image) is very low (compared to
examining text snippets), and so the model predicts that a user will assess lots of images
(but few text snippets). Interestingly, under this model, the cost of a query does not impact
on user behaviour. This is because it is a fixed cost, and the analysis is only concerned
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
150
100
Benefit
50
0
0 2 4 6 8 10
A − Number of Documents Examined
40
20
Profit
–20
–40
0 2 4 6 8 10
A − Number of Documents Examined
b = 0.3 b = 0.5
Figure 11.4 Top: Plot of the benefit of assessing. Bottom: Plot of the profit of assessing where the
result list is of low quality (b=0.3) and higher quality (b=0.5). The model predicts users will assess
more documents as the result list quality increases.
with the change in cost versus the change in benefit (i.e., stop when the marginal cost
equals the marginal benefit). However, in reality, the cost of the query is likely to influence
how a user behaves. Also, users are likely to issue multiple queries, either new queries or
better reformulations which lead to different benefits. While this simple model of assessing
provides some insights into the process, it is limited, and may not generalize to these other
cases.5 Next, we extend this model and consider when multiple queries can be issued, and
how the trade-off between querying and assessing affects behaviour.
5 As an exercise the reader may wish to consider a model where two queries are issued, and the benefit function
is different between queries. See Azzopardi and Zuccon (2015) for a solution.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Essentially, given a particular context, we would like to know if a user should issue more
queries and assess fewer items per query, or whether they should issue fewer queries and
assess many items per query?
11.5.2 Model
Described more formally, during the course of a search session, a user will pose a number
of queries (Q), examine a number of search result pages per query (V), inspect a number
of snippets per query (S) and assess a number of items per query (A). Each interaction has
an associated cost where cq is the cost of a query, cv is the cost of viewing a page, cs is the
cost of inspecting a snippet, and ca is the cost of assessing an item. With this depiction of
the search interface we can construct a cost function that includes these variables and costs,
such that the total cost of interaction is:
c(Q , V, S, A) = cq .Q + cv .V.Q + cs .S.Q + ca .A.Q (11.20)
This cost function provides a reasonably rich representation of the costs incurred during
the course of interaction. In modelling the interaction, we have assumed that the number
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Figure 11.5 Standard Search Interface – showing results for the query ‘python’.
of pages, snippets, and items viewed is per query. Of course, in reality, the user will vary
the number of pages, snippets, and items viewed for each individual query (see Azzopardi
and Zuccon (2015) for how this can be modelled). So, V, S, and A can be thought of as the
average number of pages, snippets, and items viewed, respectively. We thus are modelling
how behaviour with respect to these actions changes, on average. Nonetheless, the cost
function is quite complex, so we will need to simplify the cost function. To do so, we will
need to make a number of further assumptions.
First, we shall ignore the pagination and assume that all the results are on one page, i.e.,
V = 1. Thus a user does not need to go to subsequent pages (i.e., infinite scroll).6 The
assumption is quite reasonable as, in most cases, users only visit the first page of results
anyway (Azzopardi, Kelly, and Brennan, 2013; Kelly and Azzopardi, 2015).
6 However, it would be possible to encode the number of page views per query more precisely by using a
step function based on the number of snippets viewed, representing the fixed cost incurred to load and view each
page of results. The step function would be such that the number of pages viewed V would be equal to the number
of snippets viewed, divided by the number of snippets shown per page (n), rounded up to the nearest integer,
i.e., nS .
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Second, we shall assume that the number of items assessed is proportional to the number
of snippets viewed, i.e., that users need to first inspect the result snippet, before clicking
on and examining an item, thus S ≥ A. Furthermore, we can associate a probability to
a user clicking on a result snippet, pa , and examining the item. The expected number of
assessments viewed per query would then be A = S.pa . Substituting these values into the
cost model, we obtain:
A
c(Q , V, S, A) = cq .Q + cv .Q + cs . .Q + ca .A.Q (11.21)
pa
We can now reduce the cost function to be dependent only on A and Q, such that:
cs
c(Q , A) = (cq + cv ).Q + + ca .A.Q (11.22)
pa
Let’s turn our attention to building the benefit function and characterising how much
benefit the user receives from their interactions. Given the two main interactions: querying
and assessing, we assume, as in the previous model, that as a user examines items, they obtain
some benefit, but as they progress through the list of items, the benefit they experience
is at a diminishing returns. As previously mentioned, when searching for news about the
latest crisis, as subsequent news articles are read, they become less beneficial because they
begin to repeat the same information contained in previous articles. In this case, to find
out about other aspects of the topic, another related but different query needs to be issued.
Essentially, each query issued contributes to the overall benefit, but again at a diminishing
returns, because as more and more aspects of the topic are explored, less new information
about the topic remains to be found. To characterize this, we shall model the benefit function
using the Cobbs-Douglas function (Varian, 1987):
b(Q , A) = k.Q α .Aβ (11.23)
where α represents returns from querying, while β represents the returns from assessing,
and k is a scaling factor.7 Let’s consider two scenarios when α = 0 and when α = 1. In the
first case, regardless of how many queries are issued Q0 = 1, so issuing more than one query
would be a waste as it would not result in more benefit. In the latter case, Q1 = Q, there is no
diminishing returns for subsequent queries. This might model the case where the user poses
independent queries, i.e., the user searches for different topics within the same session, poses
queries that retrieve different items for the same topic, or when there is a seemingly endless
supply of beneficial/relevant items e.g., procrastinating watching online videos. Given the
form in Equation 11.23 the benefit function is sufficiently flexible to cater for a wider range
of scenarios. Azzopardi (2011) showed this benefit function to fit well with empirical search
performance of querying and assessing.
7 Note that if α = 1 then we arrive at the same benefit as in the model of assessing, see Equation 11.16.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
we assume that the objective of the user is to minimize the cost for a given level of benefit
(or alternatively, maximize their benefit for a given cost). This occurs when the marginal
benefit equals the marginal cost. We can solve this optimization problem with the following
objective function (using a Lagrangian Multiplier λ):
cs
= (cq + cv .v).Q + + ca .A.Q − λ k.Q α .Aβ − b
pa
where the goal is to minimize the cost subject to the constraint that the amount of benefit is
b. By taking the partial derivatives, we obtain:
∂ cs
= + ca .Q − λ.k.β.Q α .Aβ−1 (11.24)
∂A pa
and:
∂ cs
= cq + cv .v + + ca .A − λ.k.α.Q α−1 .Aβ (11.25)
∂Q pa
Setting these both to zero, and then solving, we obtain the following expressions for the
optimal number of assessments per query A :
β.(cq + cv .v)
A = (11.26)
(α − β). pcsa + ca
11.5.4 Hypotheses
Using this analytical solution we can now generate a number of testable hypotheses about
search behaviour by considering how interaction will change when specific parameters
of the model increase or decrease. From the model it is possible to derive a number of
hypotheses regarding how performance and cost affect behaviour. Rather than enumerate
each one below we provide a few examples (see Azzopardi (2016) for details on each).
Similar to the previous model, we can formulate an hypothesis regarding the quality of
the result list, where as β increases, the number of assessments per query will increase, while
the number of queries will decrease (as shown in Figure 11.6, top left plot). Intuitively, this
makes sense because as β increases the rank list of results contains more relevant items: it is
better to exploit the current query before switching to a new query.
Regarding costs, we can formulate a query cost hypothesis, such that as the cost of a query
cq increases, the number of items assessed per query will increase, while the number of
queries issued will decrease (as shown in Figure 11.7, top, right plot). It should be clear from
Equation 11.26 that this is the case because as cq becomes larger, A also becomes larger. In
turn, the number of queries issued will decrease, because as A becomes larger, Q tends to
zero. Of course, to start the search session, there needs to be at least one query, e.g., Q must
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
8 8
6 6
Queries Issued
Queries Issued
4 4
2 2
0 0
0 10 20 0 10 20
Documents Examined Documents Examined
500 500
450 450
400 400
Cost
Cost
350 350
300 300
250 250
0 10 20 0 10 20
Documents Examined Documents Examined
Figure 11.6 Top: Plots of the number of queries vs. the number of items examined per query for
a given level of benefit. Any point yields the same amount of benefit. The asterisk indicates the
optimal querying/assessing strategy, i.e., (A , Q ). Bottom: Plots of the cost vs. the number of items
examined per query. The asterisk indicates when the cost is minimized, i.e., at A .
11.5.5 Discussion
This economic model of searching provides a useful representation of the interaction
between a user and a search engine. The model provides a number of insights into
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
15 15
Actions 10 10
Actions
5 5
0 0
0.1 0.2 0.3 0.4 10 20 30 40 50
Beta Query Cost
15 15
10 10
Actions
Actions
5 5
0 0
10 20 30 40 50 0.2 0.4 0.6 0.8
Assessment Cost Probability of Assessing
Assessments Queries
Figure 11.7 Top Left: Plot of A and Q as β changes. Top Right: Plot of A and Q as query cost
changes. Bottom Left: Plot of A and Q as assessment cost changes. Bottom Right: Plot of A and
Q as the probability of assessment changes.
different factors that are likely to affect behaviour, and serves as guide on how and where
we could improve the system. For example, the search engine could focus on improving
the quality of the results, in which case increases in performance should, according to the
model, lead to changes in search behaviour. Or the search engine may want to increase
the number of queries, and so may focus on lowering the cost of querying. Of course,
the usefulness of the model depends on whether the model hypotheses hold in practice.
These hypotheses were tested in two related studies. The first study by Azzopardi, Kelly,
and Brennan (2013) explicitly explored the query cost hypothesis where a between groups
experiment was devised where the search interface was modified to create different query
cost conditions. They used a structured, standard and suggestion based search interface.
Their results provided evidence to support the query cost hypothesis, such that when the
query cost was high subjects issued fewer queries and examined more items per query,
and vice versa. In a follow-up analysis on the same data, the other hypotheses above were
explored, and it was shown that they also tend to hold in general (Azzopardi, 2014).
In a study by Ong, Jarvelin, Sanderson, and Scholer (2017), they conducted a between
groups study evaluating the differences in search behaviour when subjects used either a
mobile device or a desktop device to perform search tasks. On mobile devices, the costs
for querying and assessing are much higher due to the smaller keyboard (leading to slower
query entry) and bandwidth/latency limitations (leading to slower page downloads). This
resulted in subjects assigned to the mobile condition issuing fewer queries, but examining
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
more snippets/items per query (Ong, Jarvelin, Sanderson, and Scholer, 2017). Again, this
is broadly consistent with the model developed here.
While the model provides a number of insights into search behaviour for topic-based
searches, there are a number of limitations and assumptions which could be addressed.
For example, the model currently assumes that the user examines a fixed number of snip-
pets/items per query, yet most users examine a variable number of snippets/items per query.
Essentially, the model assumes that on average this is how many snippets/items are viewed
per query. For a finer grained model, we would have to model each result list individually,
so that the total benefit would be, for example, the sum of the benefit obtained from each
result list over all queries issued. Then it would be possible to determine, given a number of
queries, how far the user should go into each result list (see Azzopardi and Zuccon (2015)
for this extension). Another aspect that could be improved is the cost function. We have
assumed that the cost of different actions are constant, yet they are often variable, and will
change during the course of interaction (e.g., a query refinement may be less costly than
expressing a new query or selecting from among several query suggestions). Also a more
sophisticated and accurate cost model could be developed which may affect the model’s
predictions.
different aspects of cost, and the trade-offs between them, i.e., a user might prefer to expand
physical effort over cognitive effort. Or to combine the different costs functions into an
overall cost function, where the different aspects are weighted according to their impact
on the user’s preferences. For example, Oulasvirta, Feit, Lahteenlahti, and Karrenbauer
(2017) created a benefit function that is a linear combination of several criteria (usefulness,
usability, value, etc.) in order to evaluate which features/actions an interface should afford
users. In this case, rather than having one possible optimal solution, instead, depending
on what aspect(s) are considered most important, different solutions arise. On the other
hand, what is the benefit that a user receives from their interactions with the system? Is
it information, enjoyment, satisfaction, time, money? In our models, we have made the
assumption that the cost and benefit are in the same units; however, if we were to assume
time as the cost, then the benefit would be how much time is saved (as done by Fuhr,
2008). While this makes sense in the case of finding a name in a list, it does not make
sense in all scenarios. For example, in the news search scenario, we could imagine that the
amount of benefit is proportional to the new information found, and the benefit is relative
to how helpful the information is in achieving their higher level work or leisure task. So,
if we were to assume benefit as say, information gain (as done by Zhang and Zhai, 2015),
or user satisfaction (as done by Verma and Yilmaz, 2017), then how do we express cost
in the same unit? In this case, a function is needed to map the benefits and costs into
the same units (as done by Azzopardi and Zuccon, 2015). Alternatively, the ratio between
the benefit and the cost could be used instead, as done in Information Foraging Theory
(IFT), see Pirolli and Card, 1999, or when performing a cost-effectiveness analysis. Once
the units of measurement have been chosen, and instruments have been created to take such
measurements, then the subsequent problem is how to accurately estimate the cost of the
different interactions, and the benefit that is obtained from those interactions. This is very
much an open problem.
A noted limitation of such models is the underlying assumption that people seek to
maximize their benefit (e.g., the utility maximization paradigm). This assumption has been
subject to much scrutiny, and shown to break down in various circumstances leading to the
development of behavioural economics. Kahneman and Tversky have shown that people
are subject to various cognitive biases and that people often adopt heuristics which result in
sub-optimal behaviours (Kahneman and Tversky, 1979). In their work on Prospect Theory,
they argue that people have a more subjective interpretation of costs and benefits, and
that people perceive and understand risk differently (i.e., some are more risk-adverse than
others). Whereas Simon argues that people are unlikely to be maximizers that relentlessly
seek to maximize their benefit subject to a given cost (Simon, 1955), but rather satisificers
who seek to obtain a satisfactory amount of benefit for the minimum cost. While the utility
maximization assumption is questionable, there is opportunity to extend these economic
models presented here, and create more behavioural economic models that encode these
more realistic assumptions about behaviour. As pointed out earlier, though, it is best to start
simple and refine the models accordingly.
Another challenge that arises when developing such models is to ensure that there has
been sufficient consideration of the user and the environment in which the interaction is
taking place. In our models, we have largely ignored the cognitive constraints and limitations
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
of users, nor have we explicitly modelled the environment. However, such factors are likely
to influence behaviour. For example, Oulasvirta, Hukkinen, and Schwartz (2009) examined
how choice overload (i.e., the paradox of choice) affects search behaviour and performance,
finding that people were less satisfied when provided with more results. While White
(2013) showed that searchers would often seek confirmation of their a priori beliefs (i.e.,
confirmation bias), and were again less satisfied with the results that contradicted them.
However within the Adaptive Interaction Framework (Payne and Howes, 2013) it is argued
that the strategies that people employ are shaped by the adaptive, ecological, and bounded
nature of human behaviour. And as such, these biases and limitations should be taken into
account when developing models of interaction, that is, by using the economic modelling
approach presented above it is possible to develop models that maximize utility subject to
such constraints. Essentially, the economic models here could be extended to incorporate
such constraints (and thus assume Bounded Rationality (Gigerenzer and Selten, 1999;
Simon, 1955), for example). Furthermore, they could be incorporated into approaches
such as Rational Analysis (Anderson, 1991) or the Adaptive Interaction Framework (Payne
and Howes, 2013), whereby a model includes user and environmental factors as well.
Imposing such constraints, not only makes the models more realistic, but they are likely to
provide better explanations of behaviour and thus better inform how we design interfaces
and systems.
On a pragmatic point, the design and construction of experiments that specifically
test the models can also be challenging. In the models we have described above, we
used a technique called comparative statics, to consider what would happen, when one
variable is changed (i.e., as cost goes up), to behaviour (i.e., issue fewer queries). This
required the assumption that all other variables were held constant. In practice, how-
ever, the manipulation of one variable will invariably influence other variables. For exam-
ple, in the experiments performed examining the query cost hypothesis, one of the con-
ditions contained query suggestions, the idea being that clicking on suggestions would
be cheaper than typing in suggestions (Azzopardi, Kelly, and Brennan, 2013). However,
this inadvertently led to an increase in the amount of time on the search result page for
this condition, which was attributed to the time spent reading through the suggestions
(Azzopardi, Kelly, and Brennan, 2013). However, this cost was not considered in the initial
economic model proposed by Azzopardi (2011). This led to the revised model (Azzopardi,
2014), described above, which explicitly models the cost of interacting with the search
result page as well as the cost of interacting with snippets. These changes subsequently
led to predictions that were consistent with the observed data. This example highlights
the iterative nature of modelling and how refinements are often needed to create higher
fidelity models.
To sum up, this chapter has presented a tutorial for developing economic models of
interaction. While we have presented several examples based on information seeking and
retrieval, the same techniques can be applied to other tasks, interfaces, and applications,
in particular where there is a trade-off between the cost and benefits. By creating such
models, we can reason about how changes to the system will impact upon user behaviour
before even implementing the said system. We can identify what variables are going to have
the biggest impact on user behaviour and focus our attention on addressing those aspects,
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
thus guiding the experiments that we perform and the designs that we propose. Once a
model has been developed and validated, then they provide a compact representation of
the body of knowledge, so that others can use and extend the model developed. While we
have noted above a number of challenges in working with such models, the advantages are
appealing and ultimately such models provide theoretical rigour to our largely experimental
discipline.
....................................................................................................
references
Agapie, E., Golovchinsky, G., and Qvardordt, P., 2012. Encouraging behavior: A foray into persuasive
computing. In: Proceedings on the Symposium on Human-ComputerInformation Retrieval. New York,
NY: ACM.
Anderson, J. R., 1991. Is human cognition adaptive? Behavioral and Brain Sciences, 14(3) pp. 471–85.
Arampatzis, A., and Kamps, J., 2008. A study of query length. In: SIGIR ’08: Proceedings of the 31st
annual international ACM SIGIR conference on Research and development in information retrieval.
New York, NY: ACM, pp. 811–12.
Azzopardi, L., 2009. Query side evaluation: an empirical analysis of effectiveness and effort. In: Pro-
ceedings of the 32nd international ACM SIGIR conference on Research and development in information
retrieval. New York, NY: ACM, ACM, pp. 556–63.
Azzopardi, L., 2011. The economics in interactive information retrieval. In: Proceedings of the 34th
ACM SIGIR Conference. New York, NY: ACM, pp. 15–24.
Azzopardi, L., 2014. Modelling interaction with economic models of search. In: Proceedings of the 37th
ACM SIGIR Conference. New York, NY: ACM, pp. 3–12.
Azzopardi, L., Kelly, D., and Brennan, K., 2013. How query cost affects search behavior. In: Proceedings
of the 36th ACM SIGIR Conference. New York, NY: ACM, pp. 23–32.
Azzopardi, L., and Zuccon, G., 2015. An analysis of theories of search and search behavior. In:
Proceedings of the 2015 International Conference on The Theory of Information Retrieval. New York,
NY: ACM, pp. 81–90.
Azzopardi, L., and Zuccon, G., 2016a. An analysis of the cost and benefit of search interactions. In:
Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval. New
York, NY: ACM, pp. 59–68.
Azzopardi, L., and Zuccon, G., 2016b. Two Scrolls or One Click: A Cost Model for Browsing Search
Results. In: Advances in Information Retrieval: 38th European Conference on IR Research/i. Padua,
Italy: Springer, pp. 696–702.
Belkin, N. J., Kelly, D., Kim, G., Kim, J.-Y., Lee, H.-J., Muresan, G., Tang, M.-C., Yuan, X.-J., and Cool, C.,
2003. Query length in interactive information retrieval. In: Proceedings of the 26th ACM conference
on research and development in information retrieval (SIGIR). New York, NY: ACM, pp. 205–12.
Box, G. E., 1979. Robustness in the strategy of scientific model building. In: R. L. Launer and
G. N. Wilkinson, eds. Robustness in statistics: Proceedings of a workshop. New York: Academic Press,
pp. 201–36.
Browne, G. J., Pitts, M. G., and Wetherbe, J. C., 2007. Cognitive stopping rules for terminating
information search in online tasks. MIS Quarterlyy, 31(1), pp. 89–104.
Card, S. K., Moran, T. P., and Newell, A., 1980. The keystroke-level model for user performance time
with interactive systems. Communications of the ACM, 23(7) pp. 396–410.
Chater, N., and Oaksford, M., 1999. Ten years of the rational analysis of cognition. Trends in Cognitive
Sciences, 3(2), pp. 57–65.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Cooper, W. S., 1973. On selecting a measure of retrieval effectiveness part ii. implementation of the
philosophy. Journal of the American Society of Information Science, 24(6), pp. 413–24.
Cummins, R., Lalmas, M., and O’Riordan, C. 2011. The limits of retrieval effectiveness. In: European
Conference on Information Retrieval. Berlin: Springer, pp. 277–82.
Dostert, M., and Kelly, D., 2009. Users’ stopping behaviors and estimates of recall. In: Proceedings of
the 32nd ACM SIGIR. New York, NY: ACM, pp. 820–1.
Fitts, P. M., 1954. The information capacity of the human motor system in controlling the amplitude
of movement. Journal of Experimental Psychology, 47(6), p. 381.
Fuhr, N., 2008. A probability ranking principle for interactive information retrieval. Information
Retrieval, 11(3), pp. 251–65.
Gigerenzer, G., and Selten, R., 1999. Bounded Rationality: The Adaptive Toolbox. Cambridge, MA: MIT
Press.
Hick, W. E., 1952. On the rate of gain of information. Quarterly Journal of Experimental Psychology,
4(1), pp. 11–26.
Hiemstra, D., Hauff, C., and Azzopardi, L., 2017. Exploring the query halo effect in site search.
In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in
Information Retrieval. New York, NY: ACM, forthcoming.
Ingwersen, P., and Järvelin, K., 2005. The Turn: Integration of Information Seeking and Retrieval in
Context. New York, NY: Springer-Verlag.
Jin, F., Dougherty, E., Saraf, P., Cao, Y., and Ramakrishnan, N., 2013. Epidemiological modeling of
news and rumors on twitter. In: Proceedings of the 7th Workshop on Social Network Mining and
Analysis. New York, NY: ACM, p. 8.
Kahneman, D., and Tversky, A., 1979. Prospect theory: An analysis of decision under risk. Economet-
rica: Journal of the Econometric Society, 47(2), pp. 263–91.
Kashyap, A., Hristidis, V., and Petropoulos, M., 2010. Facetor: cost-driven exploration of faceted query
results. In: Proceedings of the 19th ACM conference on Information and knowledge management. New
York, NY: ACM, pp. 719–28.
Kelly, D., and Azzopardi, L., 2015. How many results per page?: A study of serp size, search behavior
and user experience. In: Proceedings of the 38th International ACM SIGIR Conference. New York,
NY: ACM, pp. 183–92.
Kelly, D., Dollu, V. D., and Fu, X., 2005. The loquacious user: a document-independent source of
terms for query expansion. In: Proceedings of the 28th annual international ACM SIGIR conference
on Research and development in information retrieval. New York, NY: ACM, pp. 457–64.
Kelly, D., and Fu, X., 2006. Elicitation of term relevance feedback: an investigation of term source
and context. In: Proceedings of the 29th ACM conference on research and development in information
retrieval. New York, NY: ACM, pp. 453–60.
Lewis, R. L., Howes, A., and Singh, S., 2014. Computational rationality: Linking mechanism and
behavior through bounded utility maximization. Topics in cognitive science, 6(2), pp. 279–311.
Murty, K. G., 2003. Optimization models for decision making: Volume 1. University of Michigan, Ann
Arbor.
Ong, K., Järvelin, K., Sanderson, M., and Scholer, F., 2017. Using information scent to understand
mobile and desktop web search behavior. In: Proceedings of the 40th ACM SIGIR. New York, NY:
ACM.
Oulasvirta, A., Feit, A., Lahteenlahti, P., and Karrenbauer, A., 2017. Computational support for
functionality selection in interaction design. In: ACM Transactions on Computer-Human Interaction.
New York, NY: ACM, forthcoming.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Oulasvirta, A., Hukkinen, J. P., and Schwartz, B., 2009. When more is less: The paradox of choice in
search engine use. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and
Development in Information Retrieval. New York, NY: ACM, pp. 516–23.
Ouliaris, S., 2012. Economic models: Simulations of reality. Available at: http://www.imf.org/
external/pubs/ft/fandd/basics/models.htm. Accessed: 30 April, 2017.
Payne, S. J., and Howes, A., 2013. Adaptive interaction: A utility maximization approach to under-
standing human interaction with technology. Synthesis Lectures on Human-Centered Informatics,
6(1), pp. 1–111.
Pirolli, P., and Card, S., 1999. Information foraging. Psychological Review, 106, pp. 643–75.
Pirolli, P., Schank, P., Hearst, M., and Diehl, C., 1996. Scatter/gather browsing communicates the topic
structure of a very large text collection. In: Proceedings of the ACM SIGCHI conference. New York,
NY: ACM, pp. 213–20.
Prabha, C., Connaway, L., Olszewski, L., and Jenkins, L., 2007. What is enough? Satisficing informa-
tion needs. Journal of Documentation, 63(1), pp. 74–89.
Robertson, S. E., 1977. The probability ranking principle in IR. Journal of Documentation, 33(4),
pp. 294–304.
Russell, D., 2015. Mindtools: What does it mean to be literate in the age of google? Journal of Computing
Sciences in Colleges, 30(3), pp. 5–6.
Simon, H. A., 1955. A behavioral model of rational choice. The Quarterly Journal of Economics, 69(1),
pp. 99–118.
Smith, C. L., and Kantor, P. B., 2008. User adaptation: good results from poor systems. In: Proceedings
of the 31st ACM conference on research and development in information retrieval. New York, NY: ACM,
pp. 147–54.
Turpin, A. H., and Hersh, W., 2001. Why batch and user evaluations do not give the same results. In:
Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in
IR. New York, NY: ACM, pp. 225–31.
Varian, H. R., 1987. Intermediate microeconomics: A modern approach. New York, NY: W.W. Norton.
Varian, H. R., 2016. How to build an economic model in your spare time. The American Economist,
61(1), pp. 81–90.
Verma, M., and Yilmaz, E., 2017. Search Costs vs. User Satisfaction on Mobile. In: ECIR ’17: Proceedings
of the European Conference on Information Retrieval. New York, NY: Springer, pp. 698–704.
White, R., 2013. Beliefs and biases in web search. In: Proceedings of the 36th International ACM SIGIR
Conference on Research and Development in Information Retrieval. New York, NY: ACM, pp. 3–12.
Zach, L., 2005. When is “enough” enough? modeling the information-seeking and stopping behavior
of senior arts administrators: Research articles. Journal of the Association for Information Science and
Technology, 56(1), pp. 23–35.
Zhang, Y., and Zhai, C., 2015. Information retrieval as card playing: A formal model for optimizing
interactive retrieval interface. In: Proceedings of the 38th International ACM SIGIR Conference on
Research and Development in Information Retrieval. New York, NY: ACM, pp. 685–94.
Zipf, G. K., 1949. Human Behavior and the Principle of Least-Effort. New York, NY: Addison-Wesley.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
12
• • • • • • •
12.1 Introduction
Our world today is increasingly technological and interconnected: from smart phones to
smart homes, from social media to streaming video, technology users are bombarded with
digital information and distractions. These many advances have led to a steep increase
in user multitasking (see, e.g., Mark, González, and Harris, 2005), as users struggle to
keep pace with the dizzying array of functions at their fingertips. The field of human-
computer interaction (HCI) has also struggled to keep pace, ever trying to understand how
multitasking affects user behaviour and how new technologies might be better designed to
alleviate the potential problems that arise from multi-tasking and distraction.
The world of user multitasking can be roughly characterized in terms of two broad
categories that span a continuum of multitasking, in terms of the time interval that separates
a switch from one task to another (see Salvucci and Taatgen, 2011). On one end of the
continuum are pairs of tasks that are performed more or less concurrently, with very
little (e.g., sub-second) time between task switches. Psychologists first studied concurrent
multitasking many decades ago (e.g., Telford, 1931) to better understand why, in most cases,
simultaneous tasks are more difficult than individual tasks. Studies of dual-choice phenom-
ena (e.g., Schumacher et al., 2001), in which a user makes two different stimulus-response
decisions in a rapid sequential order, have arguably been the most popular paradigm in
which to study concurrent multitasking behaviour. These studies have yielded some inter-
esting insights, showing that response times for a task generally increase as it is made to
temporally overlap with the processing of another concurrent task. Studies of concurrent
multitasking have broadened to more practical domains in recent years, especially those
that involve performing an occasional short task while doing a continual primary task.
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
Perhaps the best and most common example involves driving while interacting with an
in-car entertainment system or smartphone, and researchers have identified many risks
associated with driver distraction and inattention (see, e.g., Dingus et al., 2016). This
is just one example of a large class of tasks in which a person does a continuous task
while trying to perform a separate concurrent activity, such as interacting with a mobile
computing device.
On the other end of the multitasking continuum, sequential multitasking describes those
situations in which a user switches between two or more tasks, typically with a longer
period of time dedicated to each task before switching. Psychology and HCI researchers
have examined sequential multitasking in several forms using different terminology: ‘task
switching’, in which each task is an isolated unit, and once the task is left it is typically
not returned to at a later time (cf. Monsell, 2003); ‘task interleaving’, in which one task
is temporarily suspended for another task, and the user later returns to the first task (e.g.,
Payne, Duggan, and Neth, 2007; Janssen, Brumby, and Garnett, 2012); and ‘interruption
tasks’, in which the original task is disrupted by another task (e.g., Boehm-Davis and
Remington, 2009). Everyday sequential multitasking has been found to be very common
(González and Mark, 2004), and users can also adapt the time they spend between two
tasks based on the reward rates of tasks (Duggan, Johnson, and Sørli, 2013; Janssen et al.,
2011; Janssen and Brumby, 2015; Payne et al., 2007). Switching from one task to another
task generally incurs a task resumption lag (Altmann and Trafton, 2002; Trafton and Monk,
2008) and increases the probability of making a resumption error (Brumby, Cox, Back, and
Gould, 2013). The longer one is interrupted from working on a task, the longer it typically
takes to resume that task later (Altmann and Trafton, 2002; Monk et al., 2008). Moreover,
being interrupted at an inopportune moment can increase cognitive workload (e.g., Bailey
and Iqbal, 2008), stress (Adamczyk and Bailey, 2004), and the likelihood of making an error
(Borst, Taatgen, and Van Rijn, 2015; Gould, Brumby, and Cox, 2013).
This chapter surveys a number of recent approaches to understanding user multitasking
using computational models. While strengths and weaknesses of the different approaches
vary greatly, together they have shown great promise in helping to understand how users
perform multiple tasks and the implications for task performance. Computational models
are critical not only for scientific exploration of user behaviours, but also as computational
engines embedded into running systems that mimic, predict, and potentially mitigate inter-
ruption and distraction.
quantitative and qualitative results for the impact of multitasking on behaviour and per-
formance. A range of different modelling approaches have evolved to understand human
multitasking behaviour. In this section, we summarize three different approaches: cognitive
architectures, cognitive constraint modelling, and uncertainty modelling. These approaches
are some of the most common and powerful approaches to computational models of user
multitasking, and have complementary strengths. Later we discuss their benefits and draw-
backs with respect to understanding and improving HCI, as well as illustrate the approach
in the context of modelling various aspects of driver distraction.
hand in hand with threaded cognition: while the threads enable concurrent interleaving of
processes, the problem state plays a central role in more sequential interleaving of tasks as
task information is stored and later recalled.
Evaluation Criteria
(e.g., priority
All Behaviour payoff function)
Figure 12.1 In the cognitive constraints-based approach, external and internal factors (top) narrow
down what set of interaction strategies is possible, given the agent and the environment, out of all
potential behaviours. Once an evaluation criterion is introduced, one can identify the set of local
optimal strategies within the full set of possible strategies.
of Figure 12.1: the space of all possible behaviours is narrowed down to a space of possible
behaviours for a specific user in a specific task setting once the external and internal constraints
are known (top of figure).
Given the focus on understanding strategic variability, this approach contrasts sharply
with the architecture-based approach that was described earlier, which typically looks to
simulate only a small set of strategies for performing tasks, rather than the broader context
of possible options. Instead, the cognitive constraint approach complements and extends
previous HCI modelling approaches, such as a fast-man/slow-man approach (Card, Moran,
and Newell, 1986) and performance bracketing (Kieras and Meyer, 2000). Whereas these
previous approaches typically focus on performance in the extremes (i.e., predicting the
fastest and slowest performance), the constraint-based approach aims to model a wider
(and potentially, complete) range of strategies to understand why a particular strategy
is chosen.
Objective function. A critical issue that cognitive constraint modelling brings into focus
is the choice between strategies. The approach allows the modeller to predict the range of
behaviours that are possible for an agent in a specific environment. A critical question that
emerges from this approach is to consider the choice between strategies. An underlying
assumption of this approach is that people will generally select strategies that maximize
rewards (Howes, Lewis, and Vera, 2009). Such an evaluation criterion can take various
forms. Ideally, there is an objective criterion, such as a score from an explicit payoff function,
which allows identification of the overall best performing strategies (cf. Janssen et al., 2011;
Janssen and Brumby, 2015), as illustrated in Figure 12.1. This allows one to assess how well
people can optimize performance under constraints, which has been one of the focuses of
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
this modelling approach (Howes, Lewis, and Vera, 2009; Janssen and Brumby, 2015; Lewis,
Howes, and Singh, 2014; Payne and Howes, 2013).
An alternative approach is the use of a priority instruction. How people’s multitasking
performance changes with different priorities has been studied extensively (e.g., building
on work such as Navon and Gopher, 1979; Norman and Bobrow, 1975). The role of
constraint-based models is then to see whether a change in instructed priority results in
a change of behaviour, and how well this performance compares with alternative ways
of interleaving. As our example of driver distraction will later illustrate, this allows one
to understand distraction from a speed-accuracy tradeoff perspective and to dissociate
why instructed ‘safe’ performance might not always align with objectively assessed ‘safe’
performance.
Later, Kujala et al. (2016) introduced a simplified model of driver’s uncertainty accumu-
lation based on the event density of the road (i.e., environmental visual demands of driving):
Umaxk = D(x)OD(x)k ,
in which the constant Umaxk is the driver k’s subjective uncertainty tolerance threshold,
D(x) is the event density of the road at point x, and OD(x)k is the preferred occlusion
distance (the distance driven without visual information of the forward roadway) of the
driver k starting at the road point x. The event density D(x) is a discrete version of the
information density of the road as defined by Senders et al. (1967). Each event represents
an information-processing event of a task-relevant event state in the road environment by
the driver. The model allows a prediction for driver behaviour to be made, that is:
OD(x)k = Umaxk /D(x),
which can be used for predicting individual off-road glance durations if one’s Umaxk and
the event density of the road environment (e.g., curvature; see Tsimhoni and Green, 2001)
are known. The metric of OD has been utilized as a baseline for acceptable off-road glance
durations for in-car user interface testing in dynamic, self-paced driving scenarios (Kujala,
Grahn, Mäkelä, and Lasch, 2016). Based on the findings of Lee et al. (2015), the constant
Umaxk is likely to be determined by two individual factors instead of one, that is, of
the uncertainty tolerance threshold as well as the rate of the growth of uncertainty. This
complicates the interpretation of the OD measurements for a road environment, but for
a sufficient driver sample, average OD can be taken as an estimate of the objective visual
demands of the road environment.
An example of an alternative quantitative model of attentional control is SEEV (Saliency,
Effort, Expectancy, Value; see Wickens et al., 2003), in which the operator’s (internal) uncer-
tainty does not play an explicit key role. SEEV’s modifications have been utilized in mod-
elling and predicting the allocation of visual (focal) attention of a human operator at areas of
interest in aviation, driving, driving with secondary tasks, and visual workspaces in general
(e.g., Steelman et al., 2011, 2016; Wortelen et al., 2013; Horrey et al., 2006). The SEEV
model treats the eye as a single-server queue served by visual scanning (i.e., sampling). The
SEEV model is based on optimal (prescriptive) models of information sampling, which rely
on the product of expectancy (i.e., expected event rate) and value (i.e., relevance, priority)
of information at a channel (Senders, 1964; Carbonell, 1966; Moray, 1986). SEEV adds
the components of information saliency and effort (e.g., distance between displays) to the
equation in order to provide a more accurate description on how (suboptimal) human
operator shares visual attention in time between targets (Wickens et al., 2003). All of these
components are quantified in the SEEV model(s) for evaluating the probability of gaze
allocation at a task-relevant area of interest in time. SEEV can be applied to predict attention
allocation between concurrent tasks, that is, task management in multitasking environ-
ments. Validation studies on SEEV have produced high correlations between predicted and
observed percentage dwell times (PDTs) on areas of interest in the domains of aviation
(Wickens et al., 2003) and driving (Horrey et al., 2006). An extension of the model (NSEEV,
N for predicting noticing behaviour) is also capable of predicting the speed and probability
with which human operators notice critical visual events (e.g., warning signals) in cockpits
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
and in alert detection tasks while multitasking (Steelman et al., 2016, 2011). The SEEV-AIE
(Adaptive Information Expectancy) model by Wortelen et al. (2013) is built upon the SEEV
model with an advantage that it is able to derive the expectancy parameters automatically
from a simulation of a dynamic task model in a cognitive architecture, whereas SEEV (and
NSEEV) requires a human expert to provide an estimation of the operator’s expectancies
for the task-relevant event rates at the visual information channels in the task environment.
Also in SEEV, uncertainty is implicitly assumed to be a driver of attention allocation, but
uncertainty is coupled tightly with the objective bandwidth of events in an information
channel (Horrey et al., 2006). It is assumed that the higher the expected event rate in
an information channel, the faster the operator’s uncertainty about these events grows if
the operator does not attend the channel. This is a sane assumption, but there are tasks
and task environments in which uncertainty and expectancy are decoupled, and, in fact,
uncertainty is coupled with the unexpected. For instance, a car driver feels a great demand
to keep eyes on the road at the end of an unfamiliar steep hill because the driver does not
know what to expect behind the hill (i.e., high uncertainty), not because of high expectancy
of task-relevant events. Expectancy informs the human operator what the task-relevant
channels are in the task environment and the expected event rates on the channels, but
(individual) uncertainty of these expectations may add a significant (top-down) factor for
explaining human attention allocation between the targets in unfamiliar and/or variable
task environments. From the viewpoint of minimizing uncertainty, attentional demand
for allocating attention to a source of information with an expected event and the related
value of the sampled information can be actually rather low (Clark, 2013). From this
perspective, SEEV and similar models are useful for predicting attention allocation for an
experienced operator for a well-defined task and task environment (a prescriptive model),
but are limited in their ability to describe or predict individual behaviours in uncertain
tasks contexts (e.g., for operators with different uncertainty growth rates; see Lee et al.,
2015, for a novice operator with under-developed expectancies, or for environments of
high variability in task-relevant event rates). For instance, in the uncertainty model of
Kujala et al. (2016), discussed earlier, the event density and visual demand is still very
much coupled with the (objective) information bandwidth of the road, but overall, the
uncertainty-based models of attention allocation seem to provide feasible solutions for
predicting individual differences in the control of attention (of focal vision, in particular)
in multitasking. Furthermore, these models are well in line with more general theories of
brains as prediction engines (e.g., Clark, 2013, 2015).
risks. Researchers, legislators, and the general public have all sought ways to understand,
predict, and ultimately alleviate this growing problem.
All the current definitions of driver distraction are incomplete, as these do not define the
activities necessary for safe driving, from which the driver is referred to be distracted (e.g.,
Foley et al., 2013; Regan et al., 2011). There are at least three forms of driver distraction:
manual, cognitive, and visual (Foley et al., 2013). Manual distraction (‘hands off the steering
wheel’) refers to ‘any physical manipulation that competes with activities necessary for safe
driving’ and cognitive distraction (‘mind off road’) refers to ‘any epoch of cognitive loading that
competes with activities necessary for safe driving’ (Foley et al., 2013, p. 62). Visual distraction,
in short, refers to driver’s eyes being off the road at the wrong moment. The act of driving
requires visual attention, in particular, and visual distraction has been found to be the most
significant type of distraction when associated with the safety-critical incident risk (Dingus
et al., 2016). For this reason, optimizing driver’s in-car user interfaces for minimizing visual
distraction has been a major goal of HCI design efforts in the domain (e.g., Kun et al.,
2016; Schmidt et al., 2010). This section provides examples of computational modelling on
driver’s time-sharing of visual attention between the task of driving and secondary mobile
computing activities that are performed at the same time by the driver. We are interested
in understanding the effects of driver multitasking on task performance and safety. As we
shall see, a further concern is with visual distraction, which is often deeply entwined with
cognitive distraction.
the interface; in doing so, the user specifies the architecture models not by explicit coding,
but instead by demonstrating the task actions in sequence. Once the interface and tasks are
specified, the user can run the system to simulate driver behaviour, checking how common
measures of driver performance (e.g., deviation from the lane centre) might be affected by
interaction with the proposed interface.
number while also steering a simulated car. While doing this, they were given an explicit task
priority: either to prioritize the dialling of the phone number as fast as possible (‘dialling
focus’ condition) or to prioritize keeping the car as close to the middle of the driving lane
as possible (‘steering focus’ condition). Empirical results consistently showed that dialling
and steering performance were affected by this manipulation of task priority instructions
(Brumby, Salvucci, and Howes, 2009; Janssen and Brumby, 2010; Janssen, Brumby, and
Garnett, 2012). A model was then used to investigate how efficient the performance was,
by comparing it to predicted performance of a range of interleaving strategies.
The model developed for this purpose explored only sequential strategies in which a
driver is either steering, dialling, or switching between dialling and driving (more details
can be found in Janssen and Brumby, 2010). Each aspect of the model combines aspects of
external factors (e.g., dynamics of the driving simulator) with internal factors (e.g., memory
retrieval times). When the model is actively steering, it updates the heading (or lateral
velocity) of the car every 250ms based on the current position of the car in the lane
(lateral deviation (LD) expressed in m/s) as defined in this equation (Brumby, Howes, and
Salvucci, 2007):
Lateral Velocity = 0.2617 × LD2 + 0.0233 × LD − 0.022
The equation captures the intuition that steering movements are sharper when the car is
further away from lane centre and are smaller when the car is close to lane centre. In the
driving simulator, the position of the car was updated every 50ms and on each sample
noise from a Gaussian distribution was added. As a consequence, during periods of driver
inattention the car would gradually drift in the lane.
For the dialling task, the time to type each digit was calibrated based on single-task
dialling time for each study. Typically, it was assumed that each digit required approximately
the same amount of time to type. In addition, retrieval costs were added each time when
a chunk needed to be retrieved from memory (i.e., before the first and sixth digit of the
numbers, and after returning to dialling after a period of driving). Switching attention
between tasks was assumed to take 200ms.
The model was used to make performance predictions for different dual-task interleaving
strategies. This was done by systematically varying how many digits are dialled before the
model returns to driving and how much time is spent on driving before returning to dialling.
This approach meant that many different dual-task interleaving strategies were simulated
and performance predictions made using the model. Figure 12.2 plots typical model results.
Each grey dot indicates the performance of a particular strategy for interleaving the dialling
task with driving. The performance of each strategy is expressed as total dialling time
(horizontal axis) and the average lateral deviation of the car (vertical axis).
The modelling analysis demonstrates that the choice of dual-task interleaving strategy
has a substantial impact on the performance of each task. In particular, for dialling time, the
fastest and slowest strategy differ by more than a factor of 10 (note the logarithmic scale).
For both dialling time and lateral deviation, lower values are better. More specifically, the
closer that the strategy is to the origin, the better the strategy can be considered. Given a
known criterion of performance on one task (e.g., a specific dialling time), the model can
be used to predict the best possible performance on the other task (e.g., lateral deviation).
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
0.5
0
5 7 10 15 20 30 40 50
Dialling Time (sec)
Figure 12.2 Typical model results of the cognitive constraint model of dialling-while-steering
task. Human interleaving strategy changes based on the given priority focus. It typically lies on a
performance trade-off outside curve. However, even when prioritizing steering, it is not applying
the objectively safest strategy (far right), given the trade-off cost of the longer dialling time. Adapted
from Janssen and Brumby (2010).
Examples are highlighted in the figure with the dashed grey lines. If one draws a line through
the points on this lower-left outside edge of the model, a trade-off curve emerges (the solid
light gray line in the figure).
Understanding the shape of the trade-off curve can give insights into why people might
adopt particular dual-task interleaving strategies. To give an example, Figure 12.2 shows
the dual-task performance of participants from Janssen and Brumby’s (2010) dialling and
driving study. Participants in this experiment were instructed to either prioritize the dialling
task or the steering task. It can be seen that participants in these different task priority
conditions made different dual-task tradeoffs: participants instructed to prioritize dialling
were faster at completing the dialling task but maintained worse lateral control of the car,
whereas participants instructed to prioritize steering were slower at completing the dialling
task but maintained better lateral control of the car. The modelling analysis helps understand
these tradeoffs: human performance falls at different points along the tradeoff curve. This
result is important because it demonstrates the substantial impact that strategy choice can
have on performance: people can choose to adopt quite different strategies and this has
implications for how well each task is performed.
Another salient aspect is that participants in the steering focus condition did not apply
the strategy that leads to the ‘best possible’ driving performance (i.e., they did not apply
the strategy on the far right of the performance space). This behaviour can be understood
given the broader context of the strategy space. The modelling analysis suggests that while
better driving performance could be achieved by completing the dialling task very slowly,
the improvements in driving are marginal relative to the additional time cost for completing
the dialling task (i.e., the line flattens out after around 8s to complete the dialling task).
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
This suggests that participants were ‘satisficing’ by selecting a strategy that was good enough
for the driving task while also completing the dialing task in a reasonable length of time. This
insight is important because it suggests that satisficing could lead people to make decisions
that seem good enough when compared to other multitasking strategies, but which are
nonetheless worse for performance than a mono-tasking strategy.
architecture (ACT-R) on which the model was built introduced certain constraints but,
on the other hand, validated solutions, for modelling some of the details of the simulated
behaviour. For example, the model benefited from ACT-R’s embedded theory of time per-
ception (Taatgen et al., 2007). The theory posits that time perception acts like a metronome
that ticks slower as time progresses, with additional noise drawn from a logistic distribution.
The end result is time estimates that match well to the abilities and limitations of real
human time perception. The driving behaviour was based on the ACT-R driver model
(Salvucci, 2006), and the ACT-R model of eye movements and visual encoding (Salvucci,
2001b). Model parameters were estimated to achieve the best overall fit to the empirical data
gathered in two experiments.
The secondary search tasks simulated a situation in which the participant searches for
a target song in an in-car music menu. The search tasks were performed on a screen with
six different layouts, with six, nine, or twelve items per page, on a list or grid menu layout
(3 × 2 design). For in-car visual search, the model assumes a random starting point, inhi-
bition of return (Klein, 2000; Posner and Cohen, 1984), and a search strategy that always
seeks the nearest uninspected item (Halverson and Hornof, 2007). A limitation of the search
model was that it probably does not apply to semantic and alphabetic organizations of the
in-car display items (Bailly et al., 2014). We decided to focus on the more general case
in which there is no predetermined ordering for the menu items (e.g., sorting points of
interest by proximity to the car). This served to minimize the potential effects of previous
knowledge and practice on the visual search behaviours. Following Janssen et al. (2012),
we assumed that the press of a scroll button after an unsuccessful search for a page would
act as a natural break point for a task switch, and thus, the model returns eyes back to the
driving environment after each change of a screen (Figure 12.3). Following the NHTSA in-
car device testing guidelines (2013), the model driver was given the goal to drive at 80km/h
on the centre lane of a three-lane road and to follow a simulated car that kept a constant speed
of 80km/h. The model’s predictions were validated with two empirical driving simulator
studies with eye tracking (N = 12 each).
In terms of its results, the model accurately accounted for the increase in the total in-car
glance durations as the number of display items, and the task time, increased (Exp 1: r2 =
0.843, RMSSD = 1.615; Exp 2: r2 = 0.950, RMSSD = 1.108). It also accounted generally
for the number of in-car glances, although the model tended to overestimate the number
Driving Search
Attend lane centre Find next item
Attend lead car
Encode lead car
If no more items, Encode item
Press down button
If car is not stable, If car is stable,
Continue driving; Resume search;
If first iteration If first iteration If item ≠ target If item ≠ target
If item = target,
after interruption, after interruption, and time ≥ limit, and time < limit,
Press target item
Reset time limit Increment time limit Interrupt search Continue search
Figure 12.3 Schematic overview of the visual sampling model’s flow of processing. Adapted from
Kujala and Salvucci (2015).
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
of the glances. These metrics describe the model’s (and architecture’s) ability to model and
predict the general visual demands (e.g., encoding times) of visual search tasks. However,
here we were also interested in the model’s ability to describe how drivers time-share their
visual attention between the driving and the secondary task. In general, the model not only
provided good predictions on mean in-car glance durations (Exp 1: r2 = 0.396; Exp 2:
r2 = 0.659), but also indicated the challenges in predicting probabilities of the occasional
long in-car glances. A major accomplishment of the model was its ability to predict that
only the List-6 layout out of all the layout alternatives would (likely) pass the 15 per cent
(max) NHTSA (2013) in-car device verification threshold for the percentage of >2s in-car
glances. Another accomplishment was the model’s ability to predict the in-car task length’s
increasing effect on the percentage of these long in-car glances. The work also provided
a plausible explanation for this in-car task length effect observed in several studies (e.g., Lee
et al., 2012; Kujala and Saariluoma, 2011). Due to the upward adjustment of the time limit
after each in-car glance if the driving stays stable, the time limit is able to grow the higher the
longer the task (i.e., the more in-car glances, the more upward adjustments can be made).
The large time limit together with the inbuilt delay and noise in the human time perception
mechanism (Taatgen et al., 2007) translate to a greater chance of risky overlong glances (i.e.,
lapses of control in visual sampling).
Besides the successes, there were significant limitations in the predictive power of the
model. There could be alternative visual sampling models that could achieve similar or
even better fits with the empirical data. In particular, our model overestimated the number
of in-car glances in general, as well as the maximum in-car glance durations. There could
be additional mechanisms in work in limiting the in-car glance durations that were absent
in our model. For instance, there was no (explicit) internal uncertainty component in the
model. On the other hand, it is possible the human drivers preferred to extend a glance to
press a scroll button as a natural break point for a task switch, even if the time limit was
close or passed (Janssen et al., 2012). Our model tended to move attention back to driving
immediately after exceeding the time limit.
The most severe deficiencies of the model come clear if one studies individual time-
sharing protocols at a detailed level. While the fits on the aggregate data are at a good level, at
individual level one can see variations in the visual sampling strategies even in the behaviour
of a single participant that are not taken into account in our original model. In Salvucci
and Kujala, 2016, we studied the systematicities in the individual time-sharing protocols
and modelled how the structural (e.g., natural breakpoints) and temporal constraints (e.g.,
uncertainty buildup for the driving task) of the component tasks interplay in determining
switch points between tasks. The proposed computational model unifies the structural and
temporal constraints under the notion of competing urgencies. The model was able to
provide a good fit for the identified individual strategies, as well as for the aggregate data,
comparable to the fit of our original model (Kujala and Salvucci, 2015). The findings give
support for the computational model on the interplay between the subtask structure and the
temporal constraints of the component tasks, and indicate how this interplay strongly influ-
ences people’s time-sharing behaviour. More generally, the findings stress the importance
of modelling individual time-sharing behaviours (including strategies and uncertainties)
besides the aggregate behaviours.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
12.4 Discussion
Throughout this chapter we have tried to emphasize the many benefits of computational
models for understanding multitasking in the context of HCI. One of the most important
overarching benefits of these models is that they capture fundamental characteristics
of human behaviour and can be used to predict performance in novel settings. This is
especially critical for the domain of multitasking, as humans are flexible in the ways
that they combine multiple tasks. For example, new applications for smartphones and
new types of technology are released continuously. Even when few empirical studies
have been run on such novel settings, models can then be used to predict human
performance. This is particularly relevant for the domain of driver distraction in the
advent of automated and self-driving vehicles (Kun et al., 2016). As the capabilities of
automated vehicles change, so will the role of the human driver. However, again, even
though the requirements of the car change, the central characteristics of the human
driver (i.e., as captured in a cognitive architecture or model) will not. For example, they
might be used to predict how quickly drivers can recreate relevant driving context, or
what might be an appropriate moment to request a human driver to take over control
of the vehicle (see also, Van der Heiden, Iqbal, and Janssen, 2017). All the modelling
approaches outlined here can be used to make predictions about how humans might
perform in such changing circumstances, although each approach focuses on a different
grain size of behaviour and/or different tradeoffs of aggregate behaviour vs. individual
behaviours.
One important limitation of the modelling approaches presented here is that they are
developed for goal-oriented tasks, whereas not all tasks in the world are goal-oriented.
Recent efforts are being made on modelling less goal-oriented work, such as computational
models of ‘mind wandering’ in a cognitive architecture (e.g., van Vugt, Taatgen, Sackur, and
Bastian, 2015). Another limitation of the approaches is that in the construction and analysis
of the models, we as modellers are often focused on the cognitive, perceptual, and motor
aspects of behaviour, whereas other aspects of human behaviour—such as fatigue, arousal,
mood, and emotion—also affect performance and experience (e.g., Dancy, Ritter, and Berry,
2012; Gunzelmann, Gross, Gluck, and Dinges, 2009), but are typically not addressed in
these modelling efforts.
Although the main example illustrated in this chapter focused on driver distraction,
the essential elements are captured in general constraints and characteristic of the
human cognitive architecture, and therefore the approaches can also be applied to other
settings. For example, the cognitive architecture approach of threaded cognition has been
applied to study various combinations of tasks (Salvucci and Taatgen, 2011), and the
constraint-based approach has been applied to study efficiency in desk-based dynamic
tasks (e.g., Janssen and Brumby, 2015). Ideally, each of these approaches strives to
account for as many known phenomena as possible, while simultaneously offering enough
flexibility to predict and explain behaviour in novel situations and newly discovered
phenomena.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
12.5 Conclusion
This chapter introduces three approaches that have been used to model human multi-
tasking: cognitive architectures, cognitive constraint modelling, and uncertainty modelling.
The value of these modelling approaches lies in their detailed description and prediction
of human behaviour, and each approach focused on slightly different aspects of behaviour.
Given the diversity of situations in which multitasking occurs and the diversity in research
approaches to study multitasking (Janssen, Gould, Li, Brumby, and Cox, 2015), the diversity
in modelling approaches is extremely valuable. The diversity provides complementary
strengths and applicability to different settings. The presented examples offer insights
into how different task models can be combined in a cognitive architecture for providing
estimates of aggregate measures of performance; how the space of task interleaving strategies
can be explored by using cognitive constraint models; and how uncertainty models can
predict occasional failures in visual sampling between tasks.
....................................................................................................
references
Adamczyk, P. D., and Bailey, B. P., 2004. If not now, when? Proceedings of the SIGCHI Conference
on Human Factors in Computing Systems, pp. 271–8.
Altmann, E., and Trafton, J. G., 2002. Memory for goals: An activation-based model. Cognitive Science,
26(1), pp. 39–83.
Anderson, J. R., 1983. The Architecture of Cognition. Cambridge, MA: Harvard University Press.
Anderson, J. R., 2007. How Can the Human Mind Occur In The Physical Universe? Oxford: Oxford
University Press.
Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., and Qin, Y., 2004. An integrated
theory of the mind. Psychology Review, 111, pp. 1036–60.
Bailey, B. P., and Iqbal, S. T., 2008. Understanding changes in mental workload during execution
of goal-directed tasks and its application for interruption management. ACM Transactions on
Computer-Human Interaction (TOCHI), 14(4), pp. 1–28.
Bailly, G., Oulasvirta, A., Brumby, D. P., and Howes, A., 2014. Model of visual search and selection time
in linear menus. In: CHI’14: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems. New York, NY: ACM, pp. 3865–74.
Boehm-Davis, D. A., and Remington, R. W., 2009. Reducing the disruptive effects of interruption:
a cognitive framework for analysing the costs and benefits of intervention strategies. Accident
Analysis & Prevention, 41(5), pp. 1124–9. Available at: http://doi.org/10.1016/j.aap.2009.06.029.
Bogunovich, P., and Salvucci, D. D., 2010. Inferring multitasking breakpoints from single-task data. In:
S. Ohlsson and R. Catrambone, eds. Proceedings of the 32nd Annual Meeting of the Cognitive Science
Society. Austin, TX: Cognitive Science Society, pp. 1732–7.
Borst, J. P., Taatgen, N. A., and van Rijn, H., 2010. The problem state: a cognitive bottleneck
in multitasking. Journal of Experimental Psychology: Learning, Memory, and Cognition, (36)2,
pp. 363–82.
Borst, J. P., Taatgen, N. A., and van Rijn, H., 2015. What makes interruptions disruptive? A process-
model account of the effects of the problem state bottleneck on task interruption and resumption.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
In: Proceedings of the 33rd Annual ACM Conference. New York, NY,: ACM, pp. 2971–80.
Available at: http://doi.org/10.1145/2702123.2702156.
Brumby, D. P., Cox, A. L., Back, J., and Gould, S. J. J., 2013. Recovering from an interruption: Inves-
tigating speed-accuracy tradeoffs in task resumption strategy. Journal of Experimental Psychology:
Applied, 19, pp. 95–107. Available at: <http://dx.doi.org/10.1037/a0032696>.
Brumby, D. P., Howes, A., and Salvucci, D. D., 2007. A cognitive constraint model of dual-task trade-
offs in a highly dynamic driving task. In: Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems. New York, NY: ACM, pp. 233–42.
Brumby, D. P., Salvucci, D. D., and Howes, A., 2009. Focus on Driving: How Cognitive Con-
straints Shape the Adaptation of Strategy when Dialing while Driving. In: Proceedings of
the SIGCHI conference on Human factors in computing systems. New York, NY: ACM,
pp. 1629–38.
Byrne, M. D., and Anderson, J. R., 2001. Serial modules in parallel: The psychological refractory
period and perfect time-sharing. Psychological Review, 108, pp. 847–69.
Carbonell, J. R., 1966. A queuing model of many-instrument visual sampling. IEEE Transactions on
Human Factors in Electronics, 7(4), pp. 157–64.
Card, S. K., Moran, T., and Newell, A., 1986. The psychology of human-computer interaction. New York,
NY: Lawrence Erlbaum.
Clark, A., 2013. Whatever next? Predictive brains, situated agents, and the future of cognitive science.
Behavioral and Brain Sciences, 36(3), pp. 181–204.
Clark, A., 2015. Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford: Oxford
University Press.
Dancy, C. L., Ritter, F. E., and Berry, K., 2012. Towards adding a physiological substrate to ACT-R. In:
BRiMS 2012: 21st Annual Conference on Behavior Representation in Modeling and Simulation
2012, pp. 75–82.
Dingus, T. A., Guo, F., Lee, S., Antin, J. F., Perez, M., Buchanan-King, M., and Hankey, J., 2016.
Driver crash risk factors and prevalence evaluation using naturalistic driving data. Proceed-
ings of the National Academy of Sciences, 201513271. Available at: <http://doi.org/10.1073/
pnas.1513271113>.
Duggan, G. B., Johnson, H., and Sørli, P., 2013. Interleaving tasks to improve performance: Users
maximise the marginal rate of return. International Journal of Human-Computer Studies, 71(5),
pp. 533–50.
Foley, J. P., Young, R., Angell, L., and Domeyer, J. E., 2013. Towards operationalizing driver distraction.
In: Proceedings of the Seventh International Driving Symposium on Human Factors in Driver Assess-
ment, Training, and Vehicle Design, pp. 57–63.
González, V. M., and Mark, G. J., 2004. ‘Constant, constant, multi-tasking craziness’: managing
multiple working spheres. In: CHI ‘04: Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems, pp. 113–20.
Gould, S. J. J., Brumby, D. P., and Cox, A. L., 2013. What does it mean for an interruption to
be relevant? An investigation of relevance as a memory effect. In: HFES 2013: Proceedings
Of The Human Factors And Ergonomics Society 57th Annual Meeting, pp. 149–53. Available at:
<http://dx.doi.org/10.1177/1541931213571034>.
Gunzelmann, G., Gross, J. B., Gluck, K. A., and Dinges, D. F., 2009. Sleep deprivation and sustained
attention performance: Integrating mathematical and cognitive modeling. Cognitive Science, 33,
pp. 880–910.
Halverson, T., and Hornof, A. J., 2007. A minimal model for predicting visual search in human-
computer interaction. In: CHI’07: Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems. New York, NY: ACM, pp. 431–4.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
Horrey, W., and Wickens, C., 2007. In-vehicle glance duration: distributions, tails, and model of crash
risk. Transportation Research Record: Journal of the Transportation Research Board, 2018, pp. 22–8.
Horrey, W. J., Wickens, C. D., and Consalus, K. P., 2006. Modeling drivers’ visual attention allocation
while interacting with in-vehicle technologies. Journal of Experimental Psychology: Applied, 12(2),
pp. 67–78.
Howes, A., Lewis, R. L., and Vera, A., 2009. Rational adaptation under task and processing constraints:
Implications for testing theories of cognition and action. Psychological Review, 116(4), pp. 717–51.
Janssen, C. P., and Brumby, D. P., 2010. Strategic Adaptation to Performance Objectives in a Dual-
Task Setting. Cognitive Science, 34(8), pp. 1548–60. Available at: http://doi.org/10.1111/j.1551-
6709.2010.01124.x.
Janssen, C. P., and Brumby, D. P., 2015. Strategic Adaptation to Task Characteristics, Incentives, and
Individual Differences in Dual-Tasking. PLoS ONE, 10(7), e0130009.
Janssen, C. P., Brumby, D. P., Dowell, J., Chater, N., and Howes, A., 2011. Identifying Optimum
Performance Trade-Offs using a Cognitively Bounded Rational Analysis Model of Discretionary
Task Interleaving. Topics in Cognitive Science, 3(1), pp. 123–39.
Janssen, C. P., Brumby, D. P., and Garnett, R., 2012. Natural Break Points: The Influence of Priorities
and Cognitive and Motor Cues on Dual-Task Interleaving. Journal of Cognitive Engineering and
Decision Making, 6(1), pp. 5–29.
Janssen, C. P., Gould, S. J., Li, S. Y. W., Brumby, D. P., and Cox, A. L., 2015. Integrating Knowledge of
Multitasking and Interruptions across Different Perspectives and Research Methods. International
Journal of Human-Computer Studies, 79, pp. 1–5.
Jin, J., and Dabbish, L., 2009. Self-interruption on the computer: a typology of discretionary task
interleaving. In: Chi 2009: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems. New York, NY: ACM, pp. 1799–808.
Johnson, L., Sullivan, B., Hayhoe, M., and Ballard, D., 2014. Predicting human visuomotor behaviour
in a driving task. Philosophical Transactions of the Royal Society of London B: Biological Sciences,
369(1636), 20130044.
Kieras, D. E., and Meyer, D. E., 2000. The role of cognitive task analysis in the application of predictive
models of human performance. In: J. Schraagen, S. Chipman, and V. Shalin, eds. Cognitive task
analysis. Mahwah, NJ: Lawrence Erlbaum, pp. 237–60.
Klein, R., 2000. Inhibition of return. Trends in Cognitive Sciences, 4, pp. 138–46.
Kujala, T., Grahn, H., Mäkelä, J., and Lasch, A., 2016. On the visual distraction effects of audio-
visual route guidance. In: AutomotiveUI ’16: Proceedings of the 8th International Conference
on Automotive User Interfaces and Interactive Vehicular Applications. New York, NY: ACM,
pp. 169–76.
Kujala, T., Mäkelä, J., Kotilainen, I., and Tokkonen, T., 2016. The attentional demand of automobile
driving revisited—Occlusion distance as a function of task-relevant event density in realistic
driving scenarios. Human Factors: The Journal of the Human Factors and Ergonomics Society, 58,
pp. 163–80.
Kujala, T., and Saariluoma, P., 2011. Effects of menu structure and touch screen scrolling style
on the variability of glance durations during in-vehicle visual search tasks. Ergonomics, 54(8),
pp. 716–32.
Kujala, T., and Salvucci, D. D., 2015. Modeling visual search on in-car displays: The challenge of
modeling safety-critical lapses of control. International Journal of Human-Computer Studies, 79,
pp. 66–78.
Kun, A. L., Boll, S., and Schmidt, A., 2016. Shifting gears: User interfaces in the age of autonomous
driving. IEEE Pervasive Computing, 15(1), pp. 32–8.
Laird, J. E., 2012. The Soar cognitive architecture. Cambridge, MA: MIT Press.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
Laird, J. E., Newell, A., and Rosenbloom, P. S., 1987. Soar: An architecture for general intelligence.
Artificial Intelligence, 33, pp. 1–64.
Lee, J. D., Roberts, S. C., Hoffman, J. D., and Angell, L. S., 2012. Scrolling and driving: How an
MP3 player and its aftermarket controller affect driving performance and visual behavior. Human
Factors, 54(2), pp. 250–63.
Lee, J. Y., Gibson, M., and Lee, J. D., 2015. Secondary task boundaries influence drivers’ glance
durations. In: Automotive UI ’15: Proceedings of the 7th International Conference on Automotive
User Interfaces and Interactive Vehicular Applications. New York, NY: ACM, pp. 273–80.
Lewis, R. L., Howes, A., and Singh, S., 2014. Computational Rationality: Linking Mecha-
nism and Behavior Through Bounded Utility Maximization. Topics in Cognitive Science, 6(2),
pp. 279–311.
Liang, Y., Lee, J.D., and Yekhshatyan, L., 2012. How dangerous is looking away from the road?
Algorithms predict crash risk from glance patterns in naturalistic driving. Human Factors, 54,
pp. 1104–16.
Liu, Y., Feyen, R., and Tsimhoni, O., 2006. Queueing Network-Model Human Processor (QN-
MHP): A computational architecture for multitask performance in human-machine systems. ACM
Transactions on Computer-Human Interaction, 13, pp. 37–70.
Mark, G., Gonzalez, V. M., and Harris, J., 2005. No task left behind?: examining the nature of
fragmented work. In: CHI ’05: Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems. New York, NY: ACM, pp. 321–30.
Meyer, D. E., and Kieras, D. E., 1997. A computational theory of executive cognitive processes and
multiple-task performance: part 1. basic mechanisms. Psychological Review, 104(1), pp. 3–65.
Miyata, Y., and Norman, D. A., 1986. Psychological issues in support of multiple activities. In:
D. A. Norman and S. Draper, eds. User centered system design: New perspectives on human- computer
interaction. Hillsdale, NJ: Lawrence Erlbaum, pp. 265–84.
Monk, C. A., Trafton, J. G., and Boehm-Davis, D. A., 2008. The effect of interruption duration and
demand on resuming suspended goals. Journal of Experimental Psychology: Applied, 14, pp. 299–
313. [online] doi:10.1037/a0014402
Monsell, S., 2003. Task switching. Trends in Cognitive Sciences, 7(3), pp. 134–40.
Moray, N., 1986. Monitoring behavior and supervisory control. In K. R. Boff, L. Kaufman, and
J. P. Thomas, eds. Handbook of Perception and Performance, Volume II: Cognitive Processes and
Performance. New York, NY: Wiley Interscience, pp. 40–51.
Norman, D. A., and Bobrow, D. G., 1975. On Data-limited and Resource-limited Processes. Cognitive
Psychology, 7, pp. 44–64.
Navon, D., and Gopher, D., 1979. On the Economy of the Human-Processing System. Psychological
Review, 86(3), 214–55.
Newell, A., 1990. Unified Theories of Cognition. Cambridge, MA: Harvard University Press.
NHTSA (National Highway Traffic Safety Administration), 2013. Visual-Manual NHTSA Driver
Distraction Guidelines for In-Vehicle Electronic Devices. NHTSA-2010-0053.
Payne, S. J., Duggan, G. B., and Neth, H., 2007. Discretionary task interleaving: heuristics for
time allocation in cognitive foraging. Journal of Experimental Psychology: General, 136(3),
pp. 370–88.
Payne, S. J., and Howes, A., 2013. Adaptive Interaction: A utility maximisation approach to understanding
human interaction with technology. Williston, VT: Morgan & Claypool.
Posner, M. I., and Cohen, Y., 1984. Components of visual orienting. In: H. Bouma and D. Bouwhuis,
eds. Attention and Performance. Volume 10. Hillsdale, NJ: Lawrence Erlbaum, pp. 531–56.
Regan, M. A., Hallet, C., and Gordon, C. P., 2011. Driver distraction and driver inattention: Definition,
relationship and taxonomy. Accident Analysis & Prevention, 43, pp. 1771–81.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
Salvucci, D. D., 2001a. Predicting the effects of in-car interface use on driver performance:
An integrated model approach. International Journal of Human Computer Studies, 55, pp. 85–107.
Salvucci, D. D., 2001b. An integrated model of eye movements and visual encoding. Cognitive Systems
Research, 1(4), pp. 201–20.
Salvucci, D. D., 2005. A Multitasking General Executive for Compound Continuous Tasks. Cognitive
Science, 29(3), pp. 457–92.
Salvucci, D. D., 2006. Modeling driver behavior in a cognitive architecture. Human Factors, 48,
pp. 362–80.
Salvucci, D. D., 2009. Rapid prototyping and evaluation of in-vehicle interfaces. ACM Transactions on
Computer-Human Interaction [online], 16. doi:10.1145/1534903.1534906
Salvucci, D. D., and Kujala, T., 2016. Balancing structural and temporal constraints in multitasking
contexts. In: CogSci 2016: Proceedings of the 38th Annual Meeting of the Cognitive Science Society.
Philadelphia, PA, 10–13 August 2016. Red Hook, NY: Curran Associates, pp. 2465–70.
Salvucci, D. D., and Macuga, K. L., 2002. Predicting the effects of cellular-phone dialing on driver
performance. Cognitive Systems Research, 3, pp. 95–102.
Salvucci, D. D., and Taatgen, N. A., 2008. Threaded cognition: An integrated theory of concurrent
multitasking. Psychological review, 115(1), pp. 101–30.
Salvucci, D. D., and Taatgen, N. A., 2011. The multitasking mind. New York: Oxford University Press.
Schmidt, A., Dey, A. K., Kun, A. L., and Spiessl, W., 2010. Automotive user interfaces: human
computer interaction in the car. In: CHI’10: Extended Abstracts on Human Factors in Computing
Systems. New York, NY: ACM, pp. 3177–80.
Schumacher, E. H., Seymour, T. L., Glass, J. M., Fencsik, D. E., Lauber, E. J., Kieras, D. E., and Meyer, D.
E., 2001. Virtually perfect time sharing in dual-task performance: Uncorking the central cognitive
bottleneck. Psychological Science, 12, pp. 101–8.
Senders, J., 1964. The human operator as a monitor and controller of multidegree of freedom systems.
IEEE Transactions on Human Factors in Electronics, 5, pp. 2–6.
Senders, J. W., Kristofferson, A. B., Levison, W. H., Dietrich, C. W., and Ward, J. L., 1967.
The attentional demand of automobile driving. Highway Research Record, 195, pp. 15–33.
Sprague N., and Ballard D., 2003. Eye movements for reward maximization. In: S. Thrun, L. K. Saul,
and B. Schölkopf, eds. Advances in Neural Information Processing Systems. Volume 16. Cambridge,
MA: MIT Press, pp. 1467–82.
Steelman, K. S., McCarley, J. S., and Wickens, C. D., 2011. Modeling the control of attention in visual
workspaces. Human Factors, 53(2), 142–53.
Steelman, K. S., McCarley, J. S., and Wickens, C. D., 2016. Theory-based models of attention in visual
workspaces. International Journal of Human–Computer Interaction, 33(1), pp. 35–43.
Sullivan, B. T., Johnson, L., Rothkopf, C. A., Ballard, D., and Hayhoe, M., 2012. The role of uncer-
tainty and reward on eye movements in a virtual driving task. Journal of Vision, 12(13). [online]
doi:10.1167/12.13.19
Taatgen, N. A., Van Rijn, H., and Anderson, J., 2007. An integrated theory of prospective time interval
estimation: the role of cognition, attention, and learning. Psychological Review, 114(3), pp. 577–98.
Telford, C. W., 1931. The refractory phase of voluntary and associative response. Journal of Experimen-
tal Psychology, 14, pp. 1–35.
Trafton, J. G., and Monk, C. M., 2008. Task interruptions. In D. A. Boehm-Davis, ed. Reviews of human
factors and ergonomics. Volume 3. Santa Monica, CA: Human Factors and Ergonomics Society,
pp. 111–26.
Tsimhoni, O., and Green, P., 2001. Visual demands of driving and the execution of display-intensive in-
vehicle tasks. Proceedings of the Human Factors and Ergonomics Society 45th Annual Meeting, 45(14),
pp. 1586–90.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
Van der Heiden, R. M. A., Iqbal, S. T., and Janssen, C. P., 2017. Priming drivers before handover in
semi-autonomous cars. In: CHI ‘17: Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems. Denver, Colorado, 6–11 May 2017. New York, NY: ACM, pp. 392–404.
van Vugt, M., Taatgen, N., Sackur, J., and Bastian, M., 2015. Modeling mind-wandering: a tool to better
understand distraction. In: N. Taatgen, M. van Vugt, J. Borst, and K. Mehlhorn, eds. Proceedings of
the 13th International Conference on Cognitive Modeling. Groningen: University of Groningen,
pp. 252–7.
Wickens, C. D., Goh, J., Helleberg, J., Horrey, W. J., and Talleur, D. A., 2003. Attentional mod-
els of multitask pilot performance using advanced display technology. Human Factors, 45(3),
pp. 360–80.
Wierwille, W. W., 1993. An initial model of visual sampling of in-car displays and controls. In: A. G.
Gale, I. D. Brown, C. M. Haslegrave, H. W. Kruysse, and S. P. Taylor, eds. Vision in Vehicles IV.
Amsterdam: Elsevier Science, pp. 271–9.
World Health Organization, 2011. Mobile Phone Use: A Growing Problem of Driver Distrac-
tion. Geneva, Switzerland: World Health Organization. Available at: http://www.who.int/
violence_injury_prevention/publications/road_traffic/en/index.html.
Wortelen, B., Baumann, M., and Lüdtke, A., 2013. Dynamic simulation and prediction of drivers’
attention distribution. Transportation Research Part F: Traffic Psychology and Behaviour, 21,
pp. 278–94.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
13
• • • • • • •
13.1 Introduction
While Human-Information Interaction (HII) is dynamic and expresses itself in many dif-
ferent forms, information retrieval or search is almost always involved as an important
component of HII. On the other hand, information retrieval is often not the final goal of
most activities. Rather, information retrievals are often initiated and/or embedded as parts
of a series of goal-directed actions that allow us to accomplish various tasks. As a result,
optimizing information retrievals regardless of the cognitive context often does not help
people to accomplish their tasks, and thus, will unlikely lead to a truly intelligent information
system. A user-centred research theme may therefore benefit from conceptualizing informa-
tion retrievals as parts of cognitive computations that allow users to accomplish a wide range
of information tasks. By treating the user and the information system as a coupled cognitive
system, we can also gain valuable insight into design guidelines for intelligent information
systems that are more compatible with human cognitive computations.
Another benefit of adopting a cognitive science perspective is that it allows researchers
to understand the nature of the acts of information retrievals in the broader context of
how these cognitive computations are represented, selected, and executed to allow users
to intelligently accomplish their tasks. The general idea is, by understanding the nature
of these cognitive computations, researchers may have a better ability to more precisely
characterize and predict how users will behave in certain information environments. With
better understanding, researchers will have higher confidence in designing better informa-
tion environments (or systems) that fit the cognitive computations of users, which allow
them to more effectively accomplish their (information-rich) tasks.
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Cognitive science tends to focus on mechanisms and computations that explain intelli-
gent behavior. By conceptualizing information retrievals as parts of the broader set of goal-
directed cognitive computations, one can develop a more complete theoretical framework
that encompasses cognitive computations that involve external (in the environment) and
internal (inside the head of the user) information accesses (Newell, 1990) as they emerge
during each goal-directed cognitive cycle. This framework may also be extended to account
for how multiple users engage and (indirectly) interact with each other through the infor-
mation environments (Fu, 2008, 2012).
One common concern about studies of human behaviour and HII is that it is influ-
enced by many factors to variable extents in different situations, and thus is difficult, if
not impossible to derive predictive principles. In HII, users may have different cognitive
abilities, knowledge, experiences, or motivation, which all influence how they behave as they
interact with sources of information. This is a real challenge, but a challenge that cognitive
science has learned to embrace. One useful approach is to identify invariants at the right
level of abstraction, such that approximate but useful predictions can be made (Simon, 1990).
In fact, this approach is commonly adopted in many sciences to various degrees, as most
invariant structures are approximate. For example, in chemistry, atomic weights of elements
are approximately integral multiples of that of hydrogen; while the law of photosynthesis
varies in details from one species to another, at some abstract level the law is an invariant
that is useful for understanding the behaviours of plants in different environments. In human
cognition, examples of these invariants with different granularities abound: the number of
items one can remember is approximately seven (Anderson, 2002), and the movement time
to a target is proportional to the log of the ratio of the distance and area of the target (Fitts,
Paul, and Peterson, 1964). While these invariants are approximate, they are considered
useful for various reasons. In the context of HII research, these invariants at least can help:
The rest of the chapter is organized as follow. The next section discusses a few founda-
tional concepts in cognitive science and how they can be applied to understand the nature
of HII. After that, it covers the idea of levels of abstraction in cognitive science, and how
they can be applied to understand the spectrum of HII. Finally, it discusses how these ideas
derived from a cognitive science perspective can inform designs of information systems.
13.2 Foundation
We will first provide more concrete definitions of what we mean by cognitive computations,
and how they are important for understanding the nature of intelligence in information
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
systems. In fact, cognitive science emerged together with artificial intelligence around
1960s as researchers demanded a science of intelligent systems. Among existing theories of
intelligent systems, the physical symbol system proposed by Newell and Simon (1972) is often
considered the ‘classical view’ on how knowledge can be represented as symbol structures and
how the content of knowledge is related to the states of the symbol structures by semantic
relations,1 such that intelligent actions can be produced by an artificial system (Newell,
1990). These ideas provide useful perspectives on how one can characterize and explain acts
of information retrievals as parts of a cognitive task. We discuss some of the foundational
concepts most relevant to HII in the next section.
1 While there has been opposition to the view that humans have symbol structures in their head, many (e.g.,
[43]) argue that there is no alternative theory that is capable of explaining how (human) intelligent behavior follows
semantic principles while at the same time is governed by physical laws.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Cognitive Probabilistically
processing textured information
of environments
information
Given that both internal and external cognitive computations can be characterized under
the framework of a symbol system described earlier, an important advantage to study HII
as coupled cognitive computations is the potential to provide explanations at both the
algorithm and semantic levels, as well as how they are connected. This is useful because
one can then provide more direct user-centred guidance for designing more intelligent
information systems, as well as predicting changes in user behaviour in response to changes
in information systems or tasks. For example, a better understanding of the changes of users’
cognitive representations of information as they engage in online learning can inform how
information systems can adapt to these cognitive representations, such that learning can be
more effective.
2 Theoretically it should be called a symbol token as the symbol (a word) is situated in a particular context (the
sentence). However, for the sake of simplicity we just call it a symbol here.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
in this context, the goal of an intelligent information system should facilitate the cognitive search
process in the coupled internal-external system, i.e., it should generate local structures that help the
user to access distal structures most relevant to his/her goals.
There are a few properties that are worth highlighting, given their relevance to HII:
1. The context of each symbol structure plays a pivotal role in determining the search
process. This is because the context determines what and how local cues should be
processed, and guides the subsequent processing of distal structures. This context is
defined by the task, the environment, as well as knowledge of the system (and user)
(Simon, 1956).
2. Knowledge is always required to process the local cues to determine what and where
to look for relevant symbol structures. This knowledge may vary across individuals
and may change over time, and is represented through the relations of symbol
structures.
3. Because of uncertainties, the search process involves detection of relevant patterns
of distal structures that allow them to respond by incrementally adjusting the search
process (Fu and Pirolli, 2013) as new symbol structures are processed.
improving our understanding of the nature of what and why certain properties can make
HII more intelligent in general. This perspective may, for example, allow better synergies
between cognitive science and HII research.
This classical view of cognitive science may also help to provide some new perspectives
on future HII research. For example, by conceptualizing external information retrievals as
parts of the cognitive computations, one can ask how to make the system more compat-
ible with the cognitive processes and representations of the user. For example, a general
observation shows that research on information retrieval (IR) systems often focuses on
description or explanation at the algorithm level, whereas research on user experience of
these IR systems often focuses on the knowledge level. A natural question one can ask is how
we can design IR systems that operate at the knowledge level. In other words, how should
the system be designed such that it does not focus on performance at the algorithmic level,
but facilitate the cognitive computations involved the coupled internal-external system?
Similarly, one can also ask how we can derive description of user behavior at the algorithm
level to inform the design of user-centered IR systems. Research on these areas will perhaps
lead to better theories and systems that make HII more intelligent.
To better illustrate this point, we can study, for example, how Web search engines operate.
Normally, the user has to generate query keywords and the search engines try to match
those keywords to a large set of documents according to some algorithms that calculate
relevance. When we compare this process to, for example, how two persons communicate
and exchange information, one can easily see how search engines often demand the user
to convert knowledge-level representations to those at the algorithm level—i.e., the system
is forcing the user to adapt to its level of description. While this by itself may not be
too harmful, as people do adapt to technologies over time, it does limits the potential
of a knowledge-level IR systems, which potentially can provide better support to higher
level cognitive activities of the user, such as reasoning, hypothesis generation and testing,
decision making, sense making, etc.
While extracting and representing (semantic) knowledge from large corpora of (online)
documents has been an active area of research in IR, an equally important focus is perhaps
on gaining more understanding on how humans represent and express ideas (or information
needs) at their (natural) knowledge level in the context of the overall cognitive search process,
such that the behavior of the information system can be described and expressed in ways
that can adapt to the knowledge-level cognitive activities of the individual user. This is not
an easy task, as it is a well-known challenge in artificial intelligence (AI). However, even
with only incremental progress, a better synergy between research on systems and on users
can potentially lead to important breakthroughs in developing more intelligent information
systems.
In terms of having more description of user behavior at the algorithm level, one active
area of research that can be useful is on the application of computational cognitive models to
describe and predict user behavior in HII (Fu and Pirolli, 2007; Karanam, van Oostendorp,
and Indurkhya, 2011; Kitajima, Blackmon, and Polson, 2000; Sutcliffe and Ennis, 1998;
van Oostendorp and Goldman, 1999; van Oostendorp and Juvina, 2007). Because of their
explicit representations and processes, these computational models of users can be used
as cognitive agents that help to predict behavior, and potentially help IR systems to better
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
When we look at the levels of activities in HII, one can think of two approaches that
inform better designs that correspond to Marr’s Type 1 and Type 2 theories. One can adopt
a ‘high road’ to strive for discovering principles that characterize major variables at a certain
level and abstract away details at lower level, and apply these principles to inform designs
(e.g., apply foraging theory (Stephens and Krebs, 1986) to information search (Pirolli and
Card, 1999)). One can also adopt the ‘low road’, by discovering a range of specific principles
that characterize how people will behave in different contexts, and provide customized
designs to support desirable behavior in these contexts.
Although Marr argued strongly for adopting the ‘high road’, one important question is
whether these Type 1 theories exist for all (or any) of these levels of abstraction in HII.
In fact, existing research will likely continue to have differences in the style of approach
in HII research. There are researchers who are more concerned with searching for general
principles of behaviour or IR algorithms, and there are some who are more concerned
with predicting a certain level of abstraction of behaviour in specific contexts or improving
IR algorithms for certain types of information environments. However, it is important to
note that, even general principles may apply mostly only at their levels of abstraction, in
which main variables and relations are defined. For example, although we are approaching
general principles that optimize simple IR, they may not help HII at high levels. Or at
least, the benefits across levels do not come automatically. An example from Marr may
help to illustrate this point: if one wants to know how birds can fly, knowing the details of
how their feathers are optimally structured is less useful than principles such as the law of
aerodynamics.
To a large extent, perhaps research in HII will generate results that are at least general-
izable to an extent that they are useful for designs and understanding. On the other hand,
given the complex interactions factors in different contexts at multiple levels, generating a
universal Type 1 theory for HII may be difficult, if not impossible. As Marr (1977) put it:
the ‘only possible virtue [of a Type 2 theory] might be that it works’.
Sj Si 2
…
Spreading activation
1 3
Web documents
Enrich concepts &
Di = {Tj, Tk, Sn, Sm} Exploratory
mental categories
search
(sensemaking) Dj = {Tx, Ty, Sk, Sm}
…
Figure 13.2 The rational model in Fu (2008). D=document, T=tags, S=semantic nodes,
L=link/bookmark, R=mental categories. The model keeps tracks of growth of mental categories
as new web documents are found and concepts extracted from them. The new mental categories
led to use of new query terms and social tags in the next search cycle, which leads to new concepts
learned. The model demonstrates the potential of using computational cognitive models to make
information systems more intelligent by facilitating cognitive computations in the coupled internal-
external system.
The model was used to characterize search and learning behaviour over a period of eight
weeks, during which eight users were asked to look for information related to two topics: 1.
Independence of Kosovo, and 2. Anti-aging. Users were not familiar with both domains, and
they were asked to search online to learn about the topics by using a social tagging system.
The system allowed them to bookmark any Web pages that they found, and add social tags
to these pages. Users were told that these social tags will be shared among the users, and that
they will help them organize the Web pages that they found. At the end of the eight-week
period, users were asked to organize their bookmarked pages into categories, and explain
what they have learned about the topic based on the categorized pages.
Readers can find details of the model as well as its validation against experimental data in
(Fu, 2008). To summarize the results, the model showed good fits to human behaviour. It
predicted how users chose social tags and query terms to search, and what mental categories
were formed as new information was found. The value of the model lies not only at its
predictions of human behaviour, but also its potential to be integrated into an information
system to enrich its knowledge level representations of information and learning processes
that are more cognitively compatible with human users. For example, one can imagine that
information systems can represent and organize documents by leveraging human catego-
rization models, such that better local cues (e.g., search results returned from a query) can
be presented to help users acquire new information that optimize learning of new concepts.
In other words, a more cognitively compatible semantic representations and processes in
information systems will likely optimize learning, as they will likely more directly facilitate
cognitive computations in the coupled cognitive system.
Liao, Q. V., and Fu, W., 2014a. Can You Hear Me Now? Mitigating the Echo Chamber Effect by Source
Position Indicators. In: Proceedings of the ACM conference on Computer supported cooperative work
& social computing. New York, NY: ACM, pp. 184–96.
Liao, Q. V., and Fu, W.-T., 2014b. Expert Voices in Echo Chambers: Effects of Source Expertise
Indicators on Exposure to Diverse Opinions. In: Proceedings of the ACM CHI Conference on
Human Factors in Computing Systems. New York, NY: ACM, pp. 2745–54.
Liao, Q. V., Fu, W.-T., and Mamidi, S., 2015. It Is All About Perspective: An Exploration of Mitigating
Selective Exposure with Aspect Indicators. In: CHI ‘15: Proceedings of the ACM CHI conference on
Human Factors in Computing Systems. New York, NY: ACM, pp. 1439–48.
Marchionini, G., 2006. Exploratory search: From finding to understanding. Communications of the
ACM, 49, pp. 41–6.
Marr, D., 1977. Artificial Intelligence—A Personal View. Artificial Intelligence, 9, pp. 37–48.
Newell, A., 1990. Unified Theories of Cognition. Cambridge, MA: Harvard University Press.
Newell, A., and Simon, H. A., 1972. Human problem solving. Englewood Cliffs, NJ: Prentice-Hall.
Pirolli, P., and Card, S., 1999. Information foraging. Psychological Review, 106(4), pp. 643–75.
Pylyshyn, Z., 1989. Computing in Cognitive Science. In M. I. Posner, ed. Foundations of Cognitive
Science. Cambridge, MA: MIT Press, pp. 49–92.
Simon, H. A., 1956. Rational choice and the structure of the environment. Psychological Review, 63(2),
pp. 129–38. [online] doi: 10.1037/h0042769
Simon, H., 1990. Invariants of Human Behavior. Annual Review of Psychology, 41, pp. 1–20.
Stephens, D. W., and Krebs, J. R., 1986. Foraging theory. Princeton, NJ: Princeton University Press.
Sutcliffe, A., and Ennis, M., 1998. Towards a Cognitive Theory of Information Retrieval. Interacting
with computers, 10, 321–51.
van Oostendorp, H., and Juvina, I., 2007. Using a cognitive model to generate web navigation support.
International Journal of Human-Computer Studies, 65(10), 887–97.
van Oostendorp, H., Karanam, S., and Indurkhya, B., 2012. CoLiDeS+ Pic: a cognitive model of
web-navigation based on semantic information from pictures. Behaviour & Information Technology,
31(1), 17–32.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
14
• • • • • • •
Computational Model of
Human Routine Behaviours
nikola banovic,
jennifer mankoff,
anind k. dey
14.1 Introduction
User interfaces (UIs) that learn about people’s behaviours by observing them and interacting
with them are exemplars of Computational Interaction (CI). Such human-data supported
interfaces leverage computational models of human behaviour that can automatically reason
about and describe common user behaviours, infer their goals, predict future user actions,
and even coach users to improve their behaviours. Computational models of behaviours
already power technology that is changing many aspects of our lives (Stone et al., 2016).
However, concerns remain about black box technologies that use large human behaviour
data traces, but which inner workings cannot be examined to determine that they are not
negatively impacting people for whom they make decisions (Pasquale, 2015; Stone et al.,
2016). Understanding computational models of behaviours remains challenging because
real-world behaviour data stored in behaviour logs and used to train these models is too
heterogeneous to explore and too large to label.
Thus, to ensure that CI has a positive impact on people requires technology that can
sense and model the underlying process behind people’s behaviours and expose the source
of its conclusions so that they can be used to describe, reason about, and act in response
to those behaviours. Without such technology, interface designers are forced to hardcode
limited knowledge, beliefs, and assumption about people’s behaviours into their interfaces.
Such user interfaces are ill suited to enable the future in which human-data supported
interfaces aid people in real-world scenarios where they choose to perform many different
activities to solve ill-defined problems in complex environments. Such interfaces cannot act
autonomously to learn about people’s behaviours and can only respond to a small subset of
predefined commands in well-defined environments to accomplish specialized tasks.
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
understand and describe behaviours that are characteristic of aggressive drivers). Policy
makers will use this knowledge and models to reason about behaviours and inform policies
and guidelines that will ensure future interfaces have a meaningful and positive impact on
people’s wellbeing (e.g., to define responsibilities of virtual driving instructor interfaces).
Interface designers will use such behaviour models to power their future interfaces that
will act in response to people’s behaviours (e.g., design interfaces that coach drivers to
improve their driving and be safer in traffic). Enabling end users to understand their own
behaviours could help them not only evaluate and change their behaviours to improve the
quality of their lives, but also understand the decisions that future interfaces will make
on their behalf (e.g., to understand what makes them aggressive drivers and dangerous
in traffic).
Here, we first scope our discussion to goal-directed human routine behaviours. We
then explore the implications that different data collection and analysis methods have on
understanding data that captures such behaviours. We guide our discussion of different
methods based on how they support the process of making sense (Pirolli and Card, 2005)
of data. We contrast explanatory modelling (driven by statistical methods) and predictive
modelling (driven by data mining) approaches (Friedman, 1997) and call out the challenges
of both approaches.
routine behaviours that span long periods of time and different contexts. Field studies collect
behaviour data from natural settings and over time. However, they are resource intensive
and do not scale up. Furthermore, they often produce knowledge about behaviours that is
difficult to operationalize.
We focus on log studies that collect traces of human behaviours in large behaviour
logs. Such behaviour logs complement data from traditional lab and field observational
studies by offering behaviour data collected in natural settings, uninfluenced by observers,
over long period of time, and at scale (Dumais et al., 2014). Behaviour log data can be
collected using various server- (Google, 2017) and client-side (Capra, 2011) software
logging tools, and from sensors in the environment (Koehler, et al., 2014) and on people’s
personal or wearable devices (Ferreira, Kostakos, and Dey, 2015). Examples of behaviour
logs include web use logs (Adar, Teevan, and Dumais, 2008), social network logs (Starbird
and Palen, 2010), outdoor mobility logs (Davidoff et al., 2011), mobile device usage logs
(Banovic, et al., 2014), and even vehicle operation logs (Hong, Margines, and Dey, 2014).
Stakeholders can then use the trace data stored in behaviour logs to understand complex
routine behaviours.
by visual inspection. However, this can be painstakingly slow, making manual exploration
challenging.
Both EDA and custom visualizations focus on isolated events, or temporal evolution of
a particular state (e.g., sleep vs. awake) or variable (e.g., the number of steps walked per
day). However, such partitions describe people’s general dispositions using the principle
of aggregation, which often ‘does not explain behavioural variability across situations, nor
does it permit prediction of a specific behaviour in a given situation’ (Ajzen, 1991). Routines
are characterized by both specific actions people perform in different situations (Hodgson,
1997) and variability of those actions across situations (Feldman and Pentland, 2003).
This makes it difficult to manually find nuanced relationships in the data. Also, due to the
complexity and size of behaviour logs, EDA methods do not guarantee that the stakeholder
will be able to find patterns of behaviours (e.g., those that form routines) and not some other
patterns in data that are unrelated to behaviours.
Conceptualizing routines using explanatory models can help ensure the patterns we find
using EDA match the underlining processes that generate behaviour data through causal
explanation and description (Shmueli, 2010). Past research shows that such models can
explain people’s low-level behaviours, e.g., pointing (Banovic, Grossman, and Fitzmaurice,
2013) and text entry (Banovic, et al., 2017). Such models perform well on behaviours
that can be described with a few variables, e.g., some pointing models with as little as two
variables: distance to and size of the target (MacKenzie, 1992). The main strength of such
models is their support for the sensemaking loop where they provide a statistical framework
for hypotheses generation and testing.
However, when applied to complex, heterogeneous, multivariate behaviour data, e.g., to
explain how people adopt information technology (Venkatesh, et al., 2003), such explana-
tory models are often unable to explain most of the observed data. Such models have less
predictive power than predictive models (Kleinberg, et al., 2015), even in some cases when
they closely approximate the true process that generated the data (Shmueli, 2010). Thus,
they may not be able to act in response to people’s behaviours.
disregard variations in human behaviour to focus on classifying and predicting only the
most frequent human activity. Some variations may happen infrequently in data and are
difficult to detect using those existing algorithms. Some specific infrequent variations may
be detectable, e.g., detecting when parents are going to be late to pick up their children
(Davidoff, Ziebart, Zimmerman, and Dey, 2011). However, this requires a case-by-case
approach to address each variation, which can be difficult to apply if all possible variations
are not known a priori.
Most such models require labelled examples about which behaviour instances are char-
acteristic of a routine, and those that are not. However, the lack of individually labelled
behaviour instances in large behaviour logs makes it challenging to use those existing
supervised machine learning algorithms to classify behaviour instances into routines. For
example, to train a supervised machine learning algorithm to classify behaviour instances
that lead parents to forget to pick up their children, Davidoff, Zimmerman, and Dey (2010)
had to manually label each behaviour instance in the behaviour log they collected, and
confirm this information with the participants in their next study (Davidoff et al., 2011).
This places significant burden on stakeholders to properly label enough data to be able to
train their data mining algorithms.
Unsupervised machine learning methods cluster behaviours without prior knowledge of
labels. For example, algorithms based on Topic Models (Farrahi and Gatica-Perez, 2012)
allow stakeholders to generate clusters of behaviour instances. However, the main limitation
of general-purpose unsupervised methods is that they offer no guarantees that the resulting
clusters group instances based on the routine they belong to (i.e., the clusters may not rep-
resent routines). Unsupervised anomaly detection algorithms (e.g., Mcfowland, Speakman,
and Neill, 2013) could be used to find differences between behaviour instances. However,
they detect if a behaviour instance is a deviation from a routine, but not whether it is part of
the routine.
This highlights the major challenge with most traditional machine learning models:
there is no easy way to inspect the model to ensure that it captures meaningful patterns of
routine behaviour. Like any other black box predictive model, there is no easy way to inspect
the model and ensure that it captures meaningful patterns of behaviour. Stakeholders can
inspect generative models (Salakhutdinov, 2009) by generating behaviours from the model
and comparing them with empirical behaviour instances in the data. However, this assumes
that the stakeholder already understands behaviour instances characteristic of a routine,
which is what the model is supposed to automatically extract for the stakeholder in the
first place.
Unsupervised methods specifically designed to model routines from behaviour logs
(Eagle and Pentland, 2009; Farrahi and Gatica-Perez, 2012; Li, Kambhampati, and Yoon,
2009; Magnusson, 2000) are designed to capture meaningful patterns of routine behaviour.
Each offers a unique approach to modelling routines. The advantage of such special-
ized methods compared to general-purpose machine learning approaches is that each is
grounded in theory about routine behaviours. Stakeholders can explore each model using
various custom visualizations. For example, T-patterns (Magnusson, 2000), which model
multivariate events that reoccur at a specific time interval, can be represented using arc
diagrams (Wattenberg, 2002). This enables stakeholders to check that a model matches the
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
actual patterns of behaviour in the data. Such models can then act in response to people’s
behaviour, e.g., predict their mobility (Sadilek and Krumm, 2012).
However, it remains unclear which aspects of routines those existing routine modelling
approaches are able to capture. The next section explores different aspects of routine
behaviours and the implications they have on the ability of each of those models to explain
routine behaviours.
routines that may include bad habits and other behaviours that could negatively impact
people, we exclude a discussion on pathological behaviours, such as addiction, which require
special consideration.
that characterize routines as people repeatedly perform those behaviours (Agre and Shrager,
1990). Hodgson’s (1997) definition further implies a rigid, one-to-one mapping between
situations and actions, which suggests that repeated behaviour instances will also be char-
acterized by same rigidity (Brdiczka et al., 2010). However, routines, like most other kinds
of human behaviours, have high variability (Ajzen, 1991). Thus, a unified definition must
consider that routines may vary from enactment to enactment (Hamermesh, 2003). Also,
people adapt their routines over time (Ronis, Yates, and Kirscht, 1989) based on feedback
from different enactments (Feldman and Pentland, 2003).
We combine different aspects of routines and propose our own unified definition of
routine behaviour:
Routines are likely, weakly ordered, interruptible sequences of causally related situations and
actions that a person will perform to create or reach opportunities that enable the person to
accomplish a goal.
Our unified definition strongly builds on Hodgson’s (1997) definition to give structure and
ordering to routine behaviour, while at the same time allowing for variability in behaviour.
We leave the features of situations and actions unspecified because finding such features in
the data is an essential part of exploring and understanding routines. We do not attribute
recurrence and repetitiveness to routines directly, but to features of situations in the envi-
ronment (i.e., if situations repeat, so will corresponding actions). We leave the granularity
of such features unspecified for the same reasons. However, we specifically require that data
features that describe situations and actions include information about people’s goals and
opportunities to accomplish those goals. Situations and actions and the behaviour structures
they form are characterized by causal relations between features of situations and actions
that help describe and explain routines.
Both T-patterns (Magnusson, 2000) and Eigenbehaviours (Eagle and Pentland, 2009)
focus on temporal relationships between multivariate events. By considering time, they
still implicitly model other aspects of the routines. For example, sequences of situations
and actions can be expressed over a period of time, and recurrent situations are often
periodic. However, this is a limited view of routines because there are clearly other forces that
influence people’s free choice to act (Hamermesh, 2003). Although time may be correlated
with many aspects of routine behaviours, it may not necessarily have a causal effect on
people’s actions. For example, people will attend a scheduled weekly meeting because of
social interactions and norms, but not simply because of a specific day of week and time of
day (Weiss, 1996).
From this we conclude that while these algorithms are helpful for extracting routines from
behaviour logs, they are not sufficient for providing a holistic understanding of a routine
behaviour. Thus, we propose a new model that is grounded in our unified definition of
routine behaviour.
MMDP = (S, A, P(s), P(s |s, a), P(a|s), R(s, a)) (14.1)
In our model, we model a set of situations S, where each unique situation s ∈ S is described
by a set of features Fst , and a set of actions A, where each action a ∈ A is defined by a set of
features Fat which describe unique actions that the people can perform. We can auto-
matically convert raw behaviour traces from behaviour logs into sequences of situations
and actions that describe behaviour instances that we can use to train our routine model.
A behaviour instance is a finite, ordered sequence of situations and actions {s1 , a1 , s2 ,
a2 , . . . sn , an }, where in each situation si , the person performs action ai which results in a
new situation si+1 . Behaviour instances describe peoples’ behaviours as they seek to accom-
plish a goal.
We express the likelihood of any of these sequences by capturing the probability distri-
bution of situations P (s) , probability distribution of actions given situations (P (a|s)) , and
probability distribution of situation transitions P (s, |s, a) , which specifies the probability of
the next situation s, when the person performs action a in situation s. This situation
transition probability distribution P (s, |s, a) models how the environment responds to
the actions that people perform in different situations. Then, assuming that each situ-
ation depends only on the previous situation and action, we calculate the probability
of any behaviour instance b, where the probability of the initial situation s0 (p (s0 )) and
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
the conditional probability of actions given situations (p (ai |si )) are specific to routine
model M:
P (b|M) = p (s0 ) · p (ai |si ) · p (si+1 |si , ai ) (14.2)
i
A reward function R (s, a) → R defines the reward the person incurs when performing
action a in situation s. To model peoples’ goal states, we seek situations with high reward
that characterizes situations in which people have an opportunity to accomplish their goals.
In an MDP framework, the reward function is used to compute a deterministic policy
(π : S → A), which specifies actions agents should take in different situations. Traditionally,
the MDP is ‘solved’ using the value iteration algorithm (Bellman, 1957), to find an optimal
policy (with the highest expected cumulative reward). However, our goal is to infer the
reward function from peoples’ demonstrated behaviour.
We use MaxCausalEnt algorithm (Ziebart, 2010) to infer the reward function that
expresses the preference that people have for different situations and actions. Our main
contribution to training models of routines is our insight that the byproducts of Max-
CausalEnt (Ziebart, 2010), a decision-theoretic Inverse Reinforcement Learning (IRL)
algorithm typically used to train MDP models from data and predict people’s activity
(Ziebart, et al., 2008; Ziebart et al., 2009), encodes the preference that people have for
specific goal situations, and the causal relationship between routine actions and situations in
which people perform those actions. Such IRL algorithms (Ng and Russell, 2000; Ziebart,
Bagnell, and Dey, 2013) assume a parametric reward function that is linear in FS,A , given
unknown weight parameters θ:
R (s, a) = θ T · Fst ,at (14.3)
soft
Incidentally, Vθ (st ) and θ represent the preference that people have for different situations
and features that describe those situations respectively, which implies goal states.
Using MaxCausalEnt (Ziebart, 2010), we can build a probabilistic model of routines that,
unlike models that extract only the most frequent routines, also captures likely variations of
those routines, even in infrequent situations. Our approach does this by modelling proba-
bility distributions over all possible combinations of situations and actions using behaviour
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
traces from large behaviour logs. Our approach supports both individual and population
models of routines, providing the ability to identify the differences in routine behaviour
across different people and populations. By Jaynes’s Principle of Maximum Entropy (1955),
the estimated probability distribution of actions given situations (P (a|s)) is the one that
best fits the situation and action combinations from the sequences in the behaviour logs.
Assuming that the actions a person performs only depend on the information encoded
by the current situation (the Markovian property of the model) makes this approach
computationally feasible. See Banovic et al. (2016) for more details on the routine model,
and Ziebart (2010) for proofs and detailed pseudocode for the MaxCausalEnt algorithm.
Having our probabilistic, generative model of routine behaviour allows us to automati-
cally classify and generate behaviours that are characteristic of a routine (Banovic, Wang,
et al., 2017). Classifying behaviour instances often involves considering two competing
routine and finding behaviour instances that are characteristic of one routine, but not the
other. To classify a behaviour instance, we need to calculate the probability that it belongs
to a particular routine:
P (b|M) · P (M)
P (M|b) = (14.6)
P (b)
where (P (b|M)) is the probability of the instance in the routine M, P(M) is the probability
that the routine of the person whose behaviour we are classifying is M, and P(b) is the prob-
ability that people, regardless of their routine, would perform behaviour instance b. Then,
assuming two models of opposing routines M and M, with probabilities of all possible
behaviour instances in the model, by law of total probability:
P (b|M) · P (M)
P (M|b) = (14.7)
P (b|M) · P (M) + P (b|M, ) · P (M, )
This allows us to create a classifier that can automatically classify (or detect) if behaviour
,
instance b is in routine M, but not in routine M , for some confidence level :
,
h (b) = I P M |b < α · I (α ≤ P (M|b)) (14.8)
and generating behaviour instances automates a fundamental task of the behaviour sense-
making process: finding evidence to support that the model captures patterns in the data
that represent behaviours and not some other patterns in the data. It also enables a class of
CI UIs that can automatically act in response to peoples’ behaviours.
Figure 14.1 Driving behaviour detection and generation tool user interface from Banovic et
al. (2017). The main animation region (A) shows a scenario in which a vehicle travels straight
through an intersection. The simplified intersection consists of a main road with 25 miles per
hour speed limit and a residential street with stop signs for opposing traffic. The current scenario
depicts an automatically detected aggressive driving behaviour (speeding transparent vehicle) and an
automatically generated non-aggressive behaviour (opaque vehicle, which drives within the posted
speed limit). Dials (A1) show the vehicles’ speed, and gas and brake pedal positions. The user
can: B1) select a driver and a trip to review, B2) replay current driver behaviour, B3 & B5) load
previous and next driving behaviour in the current trip, and B4) play or replay an automatically
generated non-aggressive behaviour for the given driving scenario. Reproduced with permission
from ACM.
section. We identified 20,312 different situations and 43 different actions in the dataset. The
final models had 234,967 different situations, 47 different actions, and 5,371,338 possible
transitions.
Once the system detects an aggressive behaviour instance using the classifiers in Equation
14.9, it uses the non-aggressive driving routine model to sample behaviour instances
that represent what a non-aggressive driver would do in the same situation. Generating
behaviour instances for specific driving situations allows us to explore such ‘what-if’
scenarios for the two driver populations, even if our training data does not contain those
exact scenarios.
box alternatives is that it allows stakeholders to explore the model and the data used to
train it. Our model allows stakeholders to generate knowledge about routines that will
inform the design of human-data supported interfaces. We primarily focused on supporting
stakeholders in the information foraging loop part of the sensemaking process (Pirolli and
Card, 2005). We have shown how to automate the process of extracting salient patterns and
searching for relationships in the data that describe routine behaviours.
We have shown how to schematize this collection of evidence into an automatically
trained computational model of routines. We have verified that domain experts can inspect
and search for relationships in the routine models to ensure the model captures meaningful
patterns of behaviours in the data (Banovic et al., 2016). We also explored different aspect
of the sensemaking loop in which stakeholders generate and test their hypotheses, and
used various visual representations of the data to demonstrate the ability of our approach
to present findings about human behaviours that gives stakeholders a holistic picture of
this type of human behaviour (Banovic et al., 2017). Our probabilistic model allows the
stakeholders to formalize the process by which they generate and test their hypothesis about
behaviours using Bayesian Hypothesis Testing (Bernardo and Rueda, 2002). We have also
show how the model can support interfaces that act in response to those behaviours to
prescribe behaviour change (Banovic et al., 2017).
The ability of our proposed routine model to accurately encode behaviours is based on
the assumptions that we can fully observe situations and actions, and that we can encode the
world dynamics. The ability to collect information about situations and actions in behaviour
logs satisfies the assumption that both are fully observable. However, the current model
requires the stakeholders to manually specify how the environment responds to people’s
actions and other external factors that operate in and influence the environment. We can
manually specify world dynamics when they are known ahead of time. This is often the case
when people’s actions fully describe situation transitions (e.g., when the model considers
only factors that people have full control over in the environment). For example, it is easy
to specify world dynamics in a routine model of people’s daily commutes between different
places that are all known ahead of time because the person always ends up at the place they
indented to go to or stay at with 100 per cent probability. However, if we introduce more
external factors into the model, we must also estimate the effect of those factors on the
environment. The effects of such factors are often not known ahead of time, and even if
they were, it may be tedious to encode such dynamics manually.
Learning world dynamics from the data is challenging because it requires large number
of training examples to accurately model its complexity. For example, in a model where
there are |S| number of situations and |A| number of actions, we need to estimate situation
transition probability distribution (P (s, |s, a)) for |S|×|A|×|S| number of transitions. This
problem is compounded when modelling human behaviour from behaviour logs. In this
case, transitions involving actions that represent deviations from a routine will be infrequent
in the data (by definition). Some possible, but infrequent transitions will also not be well
represented in the data. However, the nature of log studies prevents the stakeholders from
asking people to go into their environment and perform such actions and hope they end
up in situations that we have not observed. Even in situations when the stakeholders could
contact people, asking them to perform specific actions might be cumbersome (e.g., if it
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
requires time and resources), inappropriate (e.g., if they are unable to perform such actions),
or unethical (e.g., if those actions could negatively impact them).
For future work, we propose a semi-automated method that learns complex world
dynamics. We will begin by modifying the existing MaxCausalEnt algorithm (Ziebart,
2010) and to also estimate situation transition probabilities from the data. This will allow
us to estimate the combined effects of situations and actions on situation transitions for
situation transitions that are well represented in the data. Such improvements will bring us
closer to a general purpose, generative model of routines that enables CI across domains
and applications.
In our work, we illustrated a class of CI we define as Human-Data Supported Interfaces.
However, the generalizable nature of the model makes it an ideal candidate for CI in the
context of UI design. For example, such models can model interaction between the user
and a class of devices, such as a mobile device, including the tasks and the goals that the user
wants to accomplish. Such models can then be used to select the best design for the mobile
device UIs that will streamline the user’s interaction to complete tasks and accomplish the
goals faster. Such examples offer a promise of a future in which computation drives UIs in
ways that will help improve peoples’ wellbeing and quality of their lives.
....................................................................................................
references
American Automobile Association, 2009. Aggressive driving: Research update. Washington, DC:
American Automobile Association Foundation for Traffic Safety.
Adar, E., Teevan, J., and Dumais, S. T., 2008. Large-scale analysis of web revisitation patterns. In: CHI
’08 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Florence, Italy,
5–10 April 2008. New York, NY: ACM, pp. 1197–206.
Agre, P. E., and Shrager, J., 1990. Routine Evolution as the Microgenetic Basis of Skill Acquistion. In:
Proceedings of the 12th Annual Conference of the Cognitive Science Society. Cambridge, MA, 25–28
July 1990. Hillsdale, NJ: Lawrence Erlbaum.
Aigner, W., Miksch, S., Thurnher, B., and Biffl, S., 2005. PlanningLines: Novel glyphs for representing
temporal uncertainties and their evaluation. In: Proceedings of the 9th International Conference
on Information Visualisation. London, UK. 6–8 July 2005. Red Hook, NY: Curran Associates,
pp. 457–63.
Ajzen, I., 1985. From intentions to actions: A theory of planned behavior. In: J. Kuhl and J. Beckmann,
eds. Action Control. SSSP Springer Series in Social Psychology. Berlin: Springer.
Ajzen, I., 1991. The theory of planned behavior. Organizational Behavior and Human Decision Processes,
50(2), pp. 179–211.
Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., and Qin, Y., 2004. An Integrated
Theory of the Mind. Psychological Review, 111(4), pp. 1036–60.
Banovic, N., Brant, C., Mankoff, J., and Dey, A. K., 2014. ProactiveTasks: the Short of Mobile Device
Use Sessions. In: Proceedings of the 16th International Conference on Human-Computer Interaction
with Mobile Devices & Services. Toronto, ON, Canada, 23–26 September 2014. New York, NY:
ACM, pp. 243–52.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Banovic, N., Buzali, T., Chevalier, F., Mankoff, J., and Dey, A. K., 2016. Modeling and Understanding
Human Routine Behavior. In: Proceedings of the 2016 CHI Conference on Human Factors in Com-
puting Systems. San Jose, CA, 7–12 May 2016. New York, NY: ACM, pp. 248–60.
Banovic, N., Grossman, T., and Fitzmaurice, G., 2013. The effect of time-based cost of error in target-
directed pointing tasks. In: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems. Paris, France, 27 April–2 May 2013. New York, NY: ACM, pp. 1373–82.
Banovic, N., Rao, V., Saravanan, A., Dey, A. K., and Mankoff, J., 2017. Quantifying Aversion to
Costly Typing Errors in Expert Mobile Text Entry. In: Proceedings of the 2017 CHI Conference
on Human Factors in Computing Systems. Denver, CO, 6–11 May 2017. New York, NY: ACM,
pp. 4229–41.
Banovic, N., Wang, A., Jin, Y., Chang, C., Ramos, J., Dey, A. K., and Mankoff, J., 2017. Leveraging
Human Routine Models to Detect and Generate Human Behaviors. In: Proceedings of the 2017
CHI Conference on Human Factors in Computing Systems. Denver, CO, 6–11 May 2017. New York,
NY: ACM, pp. 6683–94.
Baratchi, M., Meratnia, N., Havinga, P. J. M., Skidmore, A. K., and Toxopeus, B. A. K. G., 2014. A
hierarchical hidden semi-Markov model for modeling mobility data. In: Proceedings of the 2014
ACM International Joint Conference on Pervasive and Ubiquitous Computing. Seattle, WA, 13–17
September 2014. New York, NY: ACM, pp. 401–12.
Becker, M. C., 2004. Organizational routines: A review of the literature. Industrial and Corporate
Change, 13(4), pp. 643–77.
Bellman, R., 1957. A Markovian decision process. Journal of Mathematics and Mechanics, 6,
pp. 679–84.
Bernardo, J., and Rueda, R., 2002. Bayesian hypothesis testing: A reference approach. International
Statistical Review, 70(3), pp. 351–72.
Brdiczka, O., Su, N., and Begole, J., 2010. Temporal task footprinting: identifying routine tasks by their
temporal patterns. In: Proceedings of the 15th international conference on Intelligent user interfaces.
Hong Kong, China, 7–10 February 2010. New York, NY: ACM, pp. 281–4.
Bulling, A., Blanke, U., and Schiele, B., 2014. A tutorial on human activity recognition using body-worn
inertial sensors. ACM Computing Surveys (CSUR), 46(3), pp. 1–33.
Buono, P., Aris, A., Plaisant, C., Khella, A., and Shneiderman, B., 2005. Interactive Pattern Search in
Time Series. In: Proceedings of the Conference on Visualization and Data Analysis (VDA 2005). San
Jose, CA, 3 November 2005. Vol. 5669, pp. 175–86. http://doi.org/10.1117/12.587537
Capra, R., 2011. HCI browser: A tool for administration and data collection for studies of web
search behaviors. In: A Marcus, ed. Lecture Notes in Computer Science: Design, User Experience, and
Usability, Pt II. LNCS 6770, pp. 259–68.
Casarrubea, M., Jonsson, G. K., Faulisi, F., Sorbera, F., Di Giovanni, G., Benigno, A., and Crescimanno,
G., 2015. T-pattern analysis for the study of temporal structure of animal and human behavior: A
comprehensive review. Journal of Neuroscience Methods, 239, pp. 34–46.
Clear, A. K., Shannon, R., Holland, T., Quigley, A., Dobson, S., and Nixon, P., 2009. Situvis: A visual
tool for modeling a user’s behaviour patterns in a pervasive environment. In: H. Tokuda, ed. Lecture
Notes in Computer Science: Pervasive 2009, LNCS 5538, pp. 327–41.
Davidoff, S., Ziebart, B. D., Zimmerman, J., and Dey, A. K., 2011. Learning Patterns of Pick-ups and
Drop-offs to Support Busy Family Coordination. In: Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems. Vancouver, BC, Canada, 7–12 May 2011. New York, NY:
ACM, pp. 1175–84.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Davidoff, S., Zimmerman, J., and Dey, A. K., 2010. How routine learners can support family coordi-
nation. In: Proceedings of the 28th International Conference on Human Factors in Computing Systems.
Atlanta, GA, 10–15 April 2010. New York, NY: ACM, pp. 2461–70.
Dumais, S., Jeffries, R., Russell, D. M., Tang, D., and Teevan, J., 2014. Understanding User Behavior
Through Log Data and Analysis. In: J. Olson and W. Kellogg, eds. Ways of Knowing in HCI. New
York, NY: Springer, pp. 349–72.
Eagle, N., and Pentland, A. S., 2009. Eigenbehaviors: identifying structure in routine. Behavioral
Ecology and Sociobiology, 63(7), pp. 1057–66.
Farrahi, K., and Gatica-Perez, D., 2012. Extracting mobile behavioral patterns with the distant N-gram
topic model. In: Proceedings of the 16th International Symposium on Wearable Computers (ISWC).
Newcastle, UK, 18–22 June 2012. Red Hook, NY: Curran Associates, pp. 1–8.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., 1996. From Data Mining to Knowledge Discovery in
Databases. AI Magazine, 17(3), pp. 37–54.
Feldman, M. S., 2000. Organizational Routines as a Source of Continuous Change. Organization
Science, 11(6), pp. 611–29.
Feldman, M. S., and Pentland, B. T., 2003. Reconceptualizing Organizational Routines as a Source of
Flexibility and Change. Administrative Science Quarterly, 48(1), pp. 94–118.
Ferreira, D., Kostakos, V., and Dey, A. K., 2015. AWARE: Mobile Context Instrumentation Frame-
work. Frontiers in ICT, 2, pp. 1–9.
Friedman, H. J., 1997. Data mining and statistics: What’s the connection? In: Proceedings of the 29th
Symposium on the Interface Between Computer Science and Statistics. Houston, TX, 14–17 May 1997,
pp. 1–7.
Good, I. J., 1983. The Philosophy of Exploratory Data Analysis. Philosophy of Science, 50(2),
pp. 283–95.
Google. 2017. Google Analytics. Retrieved 10 April 2017, from http://www.google.com/
analytics/
Hamermesh, D. S., 2003. Routine. NBER Working Paper Series, (9440). Retrieved from http://www.
sciencedirect.com/science/article/pii/S0014292104000182
Hodgson, G. M., 1997. The ubiquity of habit and rules. Cambridge Journal of Economics, 21(6),
pp. 663–83.
Hodgson, G. M., 2009. Choice, habit and evolution. Journal of Evolutionary Economics, 20(1),
pp. 1–18.
Hong, J.-H., Margines, B., and Dey, A. K., 2014. A smartphone-based sensing platform to model
aggressive driving behaviors. In: Proceedings of the 32nd Annual ACM Conference on Human Factors
in Computing Systems. Toronto, ON, Canada, 23–26 September 2014. New York, NY: ACM,
pp. 4047–56.
Hurst, A., Mankoff, J., and Hudson, S. E., 2008. Understanding pointing problems in real world
computing environments. In: Proceedings of the 10th International ACM SIGACCESS Conference on
Computers and Accessibility. Halifax, Nova Scotia, Canada, 13–15 October 2008, New York, NY:
ACM, pp. 43–50.
Jaynes, E. T., 1955. Information Theory and Statistical Mechanics. Physical Review, 108(2),
pp. 171–90.
Jin, J., and Szekely, P., 2010. Interactive querying of temporal data using a comic strip metaphor.
In: Proceedings of the 2010 IEEE Symposium on Visual Analytics Science and Technology
(VAST 2010. Salt Lake City, UT, 25–26 October 2010. Red Hook, NY: Curran Associates,
pp. 163–70).
Kahneman, D., 2011. Thinking, fast and slow. London: Penguin.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Keim, D., Andrienko, G., Fekete, J. D., Görg, C., Kohlhammer, J., and Melançon, G., 2008. Visual
analytics: Definition, process, and challenges. In: A. Kerren et al., eds. Lecture Notes in Computer
Science: Information Visualization. LNCS 4950, pp. 154–75.
Kleinberg, J., Ludwig, J., Mullainathan, S., and Obermeyer, Z., 2015. Prediction Policy Problems.
American Economic Review: Papers & Proceedings, 105(5), pp. 491–5.
Koehler, C., Banovic, N., Oakley, I., Mankoff, J., and Dey, A. K., 2014. Indoor-ALPS: An adaptive
indoor location prediction system. In: Proceedings of the 2014 ACM International Joint Conference on
Pervasive and Ubiquitous Computing. Seattle, WA, 13–17 September 2014. New York, NY: ACM,
pp. 171–81.
Krumm, J., and Horvitz, E., 2006. Predestination: Inferring destinations from partial trajectories.
In: Proceedings of the 8th international conference on Ubiquitous Computing. Irvine, CA, 17–21
September 2006. Berlin: Springer, pp. 243–60.
Kuutti, K., 1995. Activity Theory as a potential framework for human- computer interaction research.
In: B. Nardi, ed. Context and Consciousness: Activity Theory and Human-Computer Interaction.
Cambridge, MA: MIT Press, pp. 17–44.
Li, N., Kambhampati, S., and Yoon, S., 2009. Learning Probabilistic Hierarchical Task Networks
to Capture User Preferences. In: Proceedings of the Twenty-First International Joint Conference on
Artificial Intelligence. Pasadena, CA, 14–17 July 2009. Menlo Park, CA: AAAI, pp. 1754–60.
MacKenzie, I. S., 1992. Fitts’ law as a research and design tool in human-computer interaction. Human-
Computer Interaction, 7(1), pp. 91–139.
Magnusson, M. S., 2000. Discovering hidden time patterns in behavior: T-patterns and their detection.
Behavior Research Methods, Instruments, & Computers, 32(1), pp. 93–110.
Mcfowland III, E., Speakman, S., and Neill, D. B., 2013. Fast Generalized Subset Scan for Anomalous
Pattern Detection. Journal of Machine Learning Research, 14, pp. 1533–61.
Monroe, M., Lan, R., Lee, H., Plaisant, C., and Shneiderman, B., 2013. Temporal event sequence
simplification. IEEE Transactions on Visualization and Computer Graphics, 19(12), pp. 2227–36.
Ng, A., and Russell, S., 2000. Algorithms for inverse reinforcement learning. In: Proceedings of the
Seventeenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann,
pp. 663–70.
Pasquale, F., 2015. The Black Box Society: The Secret Algorithms That Control Money and Information.
Cambridge, MA: Harvard University Press.
Pentland, B. T., and Rueter, H. H., 1994. Organizational Routines as Grammars of Action. Administra-
tive Science Quarterly, 39(3), pp. 484–510.
Pirolli, P., and Card, S., 2005. The sensemaking process and leverage points for analyst technology as
identified through cognitive task analysis. In: Proceedings of the 2005 International Conference on
Intelligence Analysis. McLean, VA, May 2–6, 2005.
Plaisant, C., Milash, B., Rose, A., Widoff, S., and Shneiderman, B., 1996. LifeLines: visualizing
personal histories. In: Proceedings of the SIGCHI Conference on Human Factors in Comput-
ing Systems. Vancouver, British Columbia, Canada, 13–18 April 1996. New York, NY: ACM,
pp. 221–7.
Puterman, M., 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Oxford:
Wiley Blackwell.
Rashidi, P., and Cook, D. J., 2010. Mining and monitoring patterns of daily routines for assisted living
in real world settings. In: Proceedings of the 1st ACM International Health Informatics Symposium.
Arlington, VA, 11–12 November 2010, New York, NY: ACM, pp. 336–45.
Reason, J., Manstead, A., Stradling, S., Baxter, J., and Campbell, K., 2011. Errors and violations on the
roads: a real distinction? Ergonomics, 33(10–11), pp. 1315–32.
OUP CORRECTED PROOF – FINAL, 22/12/2017, SPi
Ronis, David L., J., Yates, F., and Kirscht, J. P., 1989. Attitudes, decisions, and habits as determinants of
repeated behavior. In: A. Pratkanis, S. J. Breckler, and A. C. Greenwald, eds. Attitude Structure and
Function. Mahwah, NJ: Lawrence Erlbaum, pp. 213–39.
Russell, D. M., Stefik, M. J., Pirolli, P., and Card, S. K., 1993. The cost structure of sensemaking. In:
Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems.
Amsterdam, Netherlands, 24–29 April 1993. New York, NY: ACM, pp. 269–76.
Sadilek, A., and Krumm, J., 2012. Far Out: Predicting Long-Term Human Mobility. In: Proceedings of
the Twenty-Sixth AAAI Conference on Artificial Intelligence. Toronto, Ontario, Canada, 22–26 July
2012. New York, NY: ACM, pp. 814–20.
Salakhutdinov, R., 2009. Learning Deep Generative Models. Annual Review of Statistics and Its Applica-
tion, 2, pp. 361–85.
Shmueli, G., 2010. To explain or to predict? Statistical Science, 25, pp. 289–310.
Starbird, K., and Palen, L., 2010. Pass it on?: Retweeting in mass emergency. In: Proceedings of the 7th
International ISCRAM Conference. Seattle, WA, 2–5 May 2010.
Stone, P., Brooks, R., Brynjolfsson, E., Calo, R., Etzioni, O., Hager, G., Hirschberg, J., Kalyanakrishnan,
S., Kamar, E., Kraus, S., Leyton-Brown, K., Parkes, D., Press, W., Saxenian, A., Shah, J., Tambe, M.,
and Teller, A., 2016. Artificial Intelligence and Life in 2030. One Hundred Year Study on Artificial
Intelligence: Report of the 2015–2016 Study Panel. Stanford University, Stanford, CA, September
2016. Available at https://ai100.stanford.edu/2016-report
Taylor, R., 1950. Purposeful and non-purposeful behavior: A rejoinder. Philosophy of Science, 17(4),
pp. 327–32.
Tukey, J., 1977. Exploratory data analysis. Addison-Wesley Series in Behavioral Science. New York, NY:
Pearson.
Venkatesh, V., Morris, M. G., Davis, G. B., and Davis, F. D., 2003. User acceptance of information
technology: Toward a unified view. MIS Quarterly, 27(3), pp. 425–78.
Wattenberg, M., 2002. Arc diagrams: Visualizing structure in strings. In: Proceedings of the IEEE
Symposium on Information Visualization. Boston, MA, 28–29 October 2002. Red Hook, NY:
Curran Associates, pp. 110–16.
Weber, M., Alexa, M., and Müller, W., 2001. Visualizing time-series on spirals. In: Proceedings of the
IEEE Symposium on Information Visualization (INFOVIS). San Diego, CA, 22–23 October 2001.
Red Hook, NY: Curran Associates, pp. 7–14.
Weiss, Y., 1996. Synchronization of work schedules. International Economic Review, 37(1), pp. 157–79.
Ziebart, B., 2010. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy.
PhD. Carnegie Mellon University.
Ziebart, B., Ratliff, N., Gallagher, G., Mertz, C., Peterson, K., Bagnell, J. A., Hebert, M., Dey, A. K.,
and Srinivasa, S., 2009. Planning-based prediction for pedestrians. In: Proceedings of the 2009
IEEE/RSJ International Conference on Intelligent Robots and Systems. St. Louis, MS, 10–15 October
2009. Red Hook, NY: Curran Associates, pp. 3931–6.
Ziebart, B. D., Bagnell, J. A., and Dey, A. K., 2013. The principle of maximum causal entropy for
estimating interacting processes. IEEE Transactions on Information Theory, 59(4), pp. 1966–80.
Ziebart, B. D., Maas, A. L., Dey, A. K., and Bagnell, J. A., 2008. Navigate like a cabbie: Probabilistic
reasoning from observed context-aware behavior. In: Proceedings of the 10th international conference
on Ubiquitous computing. Seoul, Korea, 21–24 September 2008. New York, NY: ACM, pp. 322–31.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
15
• • • • • • •
15.1 Introduction
In the last two decades or so, various forms of online social media, forums, and networking
platforms have been playing increasingly significant roles in our societies. Recent research
by the Pew Research Center shows that 62 per cent of adults consider social media as their
major sources of information about current events and social issues (Gottfried and Shearer,
2016). To understand how these online social platforms impact information consumption,
researchers have conducted extensive research to investigate how the various features offered
by these platforms may influence behaviour at the individual and collective levels. We call
this general area of research Socio-Computer Interaction (SCI). This chapter starts with a
few notable distinctions between SCI and traditional human-computer interaction (HCI),
and provide examples of some of general computational methods used to perform analyses
in SCI. Examples of these SCI analyses will be given to illustrate how these computational
methods can be used to answer various research questions.
Our consideration of what SCI is and does implies a basic structure that emerges as a
sequence of behaviour-artifact cycles, in which individual and social behaviour changes as
new artifacts (e.g., online social platforms) are available. The focus of SCI is on analytic
methods that investigate the interactions of humans and technologies at both the individual
and social levels. As such, research methods on SCI have subtle differences from those
commonly used in the area of social computing, as the latter focuses more on the design
and characteristics of the technological systems that support social behaviour. In particular,
this chapter focuses on computational techniques that are useful for characterizing and
predicting the inherent structures of SCI. These techniques provide multiple measurements
useful for understanding the dynamic interactions and their changes at multiple levels.
These structures are dynamic because they often begin with people’s desire to engage
Computational Interaction. Antti Oulasvirta, Per Ola Kristensson, Xiaojun Bi, Andrew Howes (Eds).
© Oxford University Press 2018. Published 2018 by Oxford University Press.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
or share opinions with their friends, colleagues, or others. In the process of this, they
sometimes encounter problems and sometimes make new discoveries. Capitalizing on the
understanding of these social goals, new artifacts are developed to facilitate these social
desires and to increase the satisfaction of the users. New artifacts, however, may (sometimes
unintentionally) change the social interactions for which they were originally designed, and
over time may even change the social norms of when and how people engage in these social
interactions. This creates the need for further analysis of behaviour, which may prompt the
development of new artifacts that satisfy the evolving social goals, and so on (Carroll et al.,
1991). In short, SCI is the study of the emerging interactive ecology of social behaviour and
technological artifacts.
The second distinction between SCI and HCI relates to how people’s beliefs, attitudes,
and emotions may influence each other and impact their behaviour. The reason why these
factors are particularly more important for SCI than HCI is that online social platforms
provide multiple ways people can express social signals that influence other people’s choices,
beliefs, opinions, and behaviour. The influence is often magnified by the platforms as
multiple social signals are aggregated or cascaded, or when the social signals ‘resonate’
with beliefs or emotional states of other users, which may reinforce the social signals or
extend the social influence to more people. For example, research has shown that online
content that leads to high arousal is more likely to be shared, suggesting that a causal
impact of specific (emotional) responses is invoked by online contents on transmission
(Berger and Milkman, 2012). Analysis in SCI therefore should include not only semantic
or informational content, but also how people with different beliefs and attitudes may
have different responses or reactions to the content, as these reactions play a pivotal role in
predicting behaviour in SCI.
Finally, in SCI, the unit of analysis needs to be broader than in traditional HCI. In
HCI, analysis tends to focus only on individual behaviour. In SCI, however, the main
goal of the users is to interact with other users using the same platform, rather than
interact with the interface itself. This distinction implies that the unit-of-analysis needs
to be expanded to understand characteristics and patterns of collective behaviour that
emerge at levels that involve multiple individuals, or how the social networks may
predict or influence collective behaviour. We provide examples of computation methods
for network analysis next, before we introduce examples of how it will be useful for
SCI research.
The first step of the algorithm is to normalize term expressions by tokenizing, in which
a text is broken up into words, phrases, or other meaningful units called tokens. After
tokenizing, one can calculate the cosine-similarity between blocks of tokens. Each block
is presented in the format of a feature vector. Based on the calculated similarities, one can
decide where to split the blocks into different topics by checking whether the dissimilarity
level between adjacent blocks reaches a threshold, which is set through empirical studies.
Lexical cohesion is an important indicator of the similarity between two sentences or blocks.
LCSeg (Galley, McKeown, Fosler-Lussier, and Jing, 2003) applies both text-based segmen-
tation component features and conversational features into a machine learning approach.
The text-based segmentation method uses lexical chains to reflect discourse structure based
on term repetitions. Then, the lexical features are combined with conversational features,
e.g., silences and cue phrases, and fed into the machine learning classifier.
The TextTiling algorithm may encounter problems as the calculations are based on raw
terms frequency. For example, if two blocks contain the exact same terms, they will be
considered similar. However, two similar sentences or two similar blocks don’t have to
contain exactly the same words. Different words could express similar meanings. Thus, Cho
and colleagues (Choi, Wiemer-Hastings, and Moore, 2001) used LSA to measure similarity
based on meanings, rather than term frequencies. In addition, instead of using a threshold
to split blocks, they used clustering to perform topic segmentations.
Another approach is to use graph theory to perform segmentation. In graph theory,
finding the minimum cut is a segmentation problem; one needs to split the graph into several
connected components with the minimum removal of edges. This idea could be applied to
topic segmentation if one could determine the graph structures of the texts in the corpus.
For example, Malioutov and Barzilay (2006) built a weighted undirected graph representing
a large text corpus. Each vertex in the graph represents one sentence and each edge presents
the cosine similarity between two sentences in the TF.IDF format. With this graph structure
of the corpus, partitioning the graph with minimum cut cost is equivalent to partitioning the
corpus with maximized dissimilarity.
to form a graph called Fragment Quotation Graph (FQG) (Carenini, Ng, and Zhou, 2007).
FQG is a directed graph where each vertex is one unit of text in the conversation and each
edge (x, y), where vertex x is the reply unit to vertex y. The replying relationship is very
important in a conversation, as usually similar topics continue to appear in the stream of
replies. After the FQG graph is formulated, Carenini and colleagues applied the techniques
mentioned previously on the FQG structure. They combined the LCSeg (Galley, McKe-
own, Fosler-Lussier, and Jing, 2003) and LDA (Blei, Ng, and Jordan, 2003) methods to the
corresponding FQG structure of each conversation. Interested readers are referred to these
studies for more detail on the mechanisms used in combining those methods.
unit that can express a major opinion on some main topics. The length can vary ranging
from a short message like a Tweet or a movie review, to a long article like a blog post, as long
as it expresses one major opinion on a general aspect.
The sentence-level sentiment analysis classifies a sentence to an associated sentiment.
It assumes that a sentence expresses a single opinion, and thus, it is not appropriate for
complex and compound sentences with multiple opinions. The analysis is often performed
after subjectivity classification task (Wilson et al., 2005) that determines whether the given
sentence contains subjective opinions. In this way, objective sentences are filtered out
before analysing the sentiment. However, this task is easier said than done, since objective
sentences can convey opinions and not all subjective texts have opinions. Moreover, recent
research has dealt with conditional (Narayanan, Liu, and Choudhary, 2009), comparative
(Ding, Liu, and Zhang, 2009), and figurative sentences (Davidov, Tsur, and Rappoport,
2010; Maynard and Greenwood, 2014).
The aspect-level analysis, or feature-level classification, is the most fine-grained analysis
and focuses on individual aspects of an entity. For instance, given a review about a mobile
phone, the sentiment for each aspect, including camera resolution, screen size, and voice
quality, could be explored. Before recognizing sentiments, it is necessary to extract entities
and aspects in a given corpus, collectively named as opinion target extraction. How to
identify explicit or implicit entities and aspects is discussed in the literature (Liu, 2015),
such as finding frequent noun phrases, exploiting syntactic patterns, supervised learning
using sequential algorithms (e.g. hidden Markov models, conditional random fields), topic
modelling, and co-occurrence rule mining (Hai, Chang, and Kim, 2011). The aspect-level
analysis is especially valuable in analysing reviews and related works (Jo and Oh, 2011; Thet,
Na, and Khoo, 2010; Xianghua, Guo, Yanyan, and Zhiqiang, 2013).
Granularity of Opinion Classifying opinion into binary classes like positive and nega-
tive, thumbs up and thumbs down, like and dislike, is the simplest, and the most studied,
task (e.g., Liu and Zhang, 2012). Simple extensions from binary opinions could consider
a neutral class. Going further, a discrete numeric scale of opinion, such as five-star rating
system, is also exploited in existing literature. Other binary classification tasks include
agreement detection, i.e., whether the opinions are the same or different, and subjectivity
detection, i.e., whether the sentence is subjective or objective. Binary classification is often
too simple to analyse individuals’ thoughts in detail. Thus, many recent works make efforts
to incorporate fine-grained levels of opinion. Fine-grained emotions are widely perceived in
two viewpoints: discrete and multi-dimensional emotion. The discrete emotion perspective
argues that our emotion is based on distinct basic emotions. Paul Ekman’s six emotion
theory is the widely accepted emotion system that distinguishes sentiment into six basic
emotions: happiness, sadness, anger, surprise, disgust, and fear. On the other hand, a few
studies use a multi-dimensional viewpoint that depicts a sentiment in two dimensions:
valence and arousal.
Domain Thereis a broad spectrum of domains for sentiment analysis. Domains can be
different in data sources, e.g., microblog, review, news, and blog, as well as in content, e.g.,
commodities, services, health, politics, education, and economics. Typically, each sentiment
analysis task is attuned to one specific domain due to the difference in structures and
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
words. To illustrate the difference in word use in different domains, consider some simple
sentiment words used in product reviews and political discussions. Words that describe
a product such as ‘beautiful’, ‘malfunctioning’, ‘sturdy’, and ‘heavy’ are likely to appear in
product reviews, whereas ‘support’, ‘disagree’, ‘prosper’, ‘secure’, and ‘progressive’ are likely
to be seen in the context of political discussions. Sentiment analysis tasks that are specialized
for one domain often perform poorly in other domains. To cope with this issue, there have
been several approaches that deal with cross-domain adaptation (Aue and Gamon, 1984;
Glorot, Bordes, and Bengio, 2011; Pan, Ni, Sun, Yang, and Chen, 2010; Yang, Si, and Callan,
2006). For example, Glorot and colleagues performed domain adaptation using a deep
learning technique. Pan and colleagues suggested the Spectral Feature Alignment (SFA)
algorithm that exploits the relationship between domain-specific and domain-independent
words.
There has been a considerable amount of research addressing both the temporal dynam-
ics and rich network of social media. For example, Do, Lim, Kim, and Choi (2016) analysed
emotions in Twitter to investigate the trend of public opinion during the Middle East Respi-
ratory Syndrome (MERS) outbreak in South Korea. De Choudhury, Monroy-Hernández,
and Mark (2014) examined sentiment in Spanish Tweets during the Mexican Drug War.
Analysing the social media is a complicated task due to the unstructured and noisy nature
with incorrect language usage. For instance, Twitter data should be used with caution
because of its short length (less than 140 characters), novel features (e.g. mention, retweet,
hashtag), and use of informal words including slang, profanity, and emoticons. Despite the
difficulty in analysing Twitter data, the research will continue to gain its importance as a
medium to evaluate SCI.
Language Most sentiment analysis methods are dependent on emotion lexicons and
training corpora, and thus, it is a language-dependent task. In comparison to the extensive
research with textual data written in English and Chinese, there has been a limited research in
other languages. The problem is aggravated by the very few non-English annotated datasets
and resources that exists for sentiment analysis. Thus, many recent works focus on language
adaptation (Abbasi, Chen, and Salem, 2008). A typical approach to cope with this language
deficiency is to simply apply machine translation that translates existing corpus to English
(Banea, Mihalcea, Wiebe, and Hassan, 2008). Also, in the efforts to construct non-English
resources, Chen, Brook, and Skiena (2014) used graph propagation methods to generate
emotion lexicons for 136 major languages.
15.3.2.3 Methodology
Most of the existing methodologies for sentiment analysis fall into two categories: super-
vised and unsupervised approaches. The supervised approach trains a classifier using an
annotated dataset and classifies test instances into a finite set of sentiment classes, whereas
unsupervised approaches do not require annotated data sets. In supervised approaches,
manual tagging (Do and Choi, 2015a, 2015b) of sentences is mostly done in past literatures,
although some scientists employed distant supervision (Go, Bhayani, and Huang, 2009) to
reduce the burden. Moreover, the method represents the input instance into a set of features.
Some of the widely used features are n-grams (typically bigrams and trigrams), parts of
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
speech (especially adjectives and verbs), emotion lexicons, and parse structures, but it varies
by domain and language characteristic. Classifiers are taught using the annotated training
data and feature representations. Some of the most applied machine-learning classification
algorithms are Support Vector Machine (SVM), Logistic Regression (LR), Random Forest
(RF), and Naïve Bayes (NB). Ensemble methods that combine the existing classifiers for a
single task are also actively leveraged in recent works.
On the contrary, the unsupervised approach does not require annotated training data.
Instead, it determines the sentiment by the semantic orientation (SO) of the given infor-
mation. The information is classified into an emotion class if the SO is above a threshold.
The lexicon-based approach (Hu and Liu, 2004; Stone, Dunphy, and Smith, 1966) is
the most representative way. It counts or aggregates the according values of predefined
emotion lexicons that appear in the target textual data. For this reason, having good quality
and abundant emotion lexicons is crucial for improving performance. It is known that
the lexicon-based method performs better when it is combined with a machine-learning
algorithm. An alternative unsupervised method is the generative models that deal with
sentiment classification and topic modelling. The models simultaneously separate top-
ics from a document and identify sentiments. Interested readers are referred to Jo and
Oh, 2011; Lin and He, 2009; and Mei, Ling, Wondra, Su, and Zhai, 2007 for further
information.
The most actively exploited machine-learning technique in the past few years is the
deep learning approach (Giachanou and Crestani, 2016). The technique utilizes neural
networks (NNs) with many hidden layers. Deep learning has demonstrated its successes
in many fields of research including image recognition and natural language processing
and it is also a promising direction for sentiment analysis. In 2016, during an international
workshop on semantic evaluation (SemEval), multiple representative tasks were selected
for evaluations of computational semantic analysis systems. One of the tasks was about
sentiment analysis on Twitter. Forty-three teams participated in the task and most of the
top-ranked teams used deep learning that included convolutional NNs, recurrent NNs,
and word embedding. Among the methods used by these teams, the word embedding
method was shown to be especially useful in learning continuous representation of words
and phrases that can be leveraged as features for machine-learning classifiers instead of hand-
crafting features. For instance, Tang et al. (2014) encoded the sentiment information into
the representation of words so that positive and negative words are represented differently.
Despite its high performance, the deep learning method requires a huge training corpus and
high computational power to train the weights of the network. Comprehensive surveys of
related papers can be found in Giachanou and Crestani, 2016.
Emotion lexicons are highly utilized in both lexicon-based and machine-learning based
methods. Most of the literature in the past constructed the lexicons manually. However, it
is not preferable as it takes too much time and laborious effort. A recent work by Hutto
and Gilbert (2014) used crowd sourcing to manually create a large set of positive and
negative lexicons for the purpose of Twitter sentiment analysis. Current sentiment analysis
studies make effort to automatically construct the lexical resources. Typical approaches for
automatic lexicon construction include statistical methods that investigate the frequency of
lexicons in annotated corpus (Do and Choi, 2015a), or conjoined occurrence of existing
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
John Mary
John Mary Bob Peter
John -- 1 1 0
Mary 0 -- 1 0
Bob 1 0 -- 0
Peter 0 0 1 --
Peter Bob
Figure 15.1 Graphical (left) and matric (right) representation of a social network.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
Additionally, one may want to know to what extent a network is hierarchical—i.e., how
centralized the network is around a few key nodes (actors, posts, forums). One can compute
the sum of differences between the node with the highest centrality (e.g., betweenness)
and all the other nodes, and then divide this sum of differences by the maximum possible
sum of differences. When this measure is close to 0, it means that the most central node is
not much different from the other nodes—i.e., the network is not hierarchical. However,
if this measure is close to 1, then the most central node is likely connecting all the other
nodes, showing a hierarchical structure between the ‘star’ nodes and the rest of the nodes in
the network.
In a series of studies (Liao and Fu, 2014a,b; 2015), we found that selective exposure is
pervasive. To understand how to mitigate selective exposure, we designed interfaces and
tested how they could encourage people to attend to information inconsistent with their
beliefs. For example, in a recent study (Gao, Do, and Fu, 2017), in addition to semantic
analysis, we included sentiment analysis techniques as discussed earlier to measure the
emotional reactions to different social opinions by people with different existing beliefs.
Figure 15.2 shows the design of the new online social platform. We found that the
addition of the sentiment analysis seems to help mitigate selective exposure much better
than using only semantic analysis. On this platform, users see sentiments of social opinions
related to a controversial social issue. These opinions were separately presented on two
(or more) areas on the screen. These areas represent different perspectives on the issue by
Click circles to find out how each side see the same post differently
Contempt Embarrassment Relief Surprise Amusement Disgust Anger Embarrassment Shame Contentment
SUBMIT
Satisfaction
Disgust
Amusement Satisfaction
Happiness
Amusement
Bush, Trump, Sanders, and Clinton are all on a plane about to.... View more comments
Amusement Amusement
Figure 15.2 The interface in Gao et al. (2017). Emotional labels are assigned to social opinions
about the 2016 US presidential election. These labels are then used to organize them into clusters
for users to explore the opinions of supporters of both candidates.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
different uses. For example, during the 2016 US presidential election, each box represents
opinions from users who support one candidate. Inside each box, opinions are clustered as
circles based on their sentiments, such as ‘anger’, ‘surprise’, ‘pleasure’, etc. These sentiments
represent the reactions to each of the opinions by other users. Users can click on any of the
circles to see the opinions in that sentiment category. The opinions will be displayed at
the bottom. Reactions by other users (who also may have different perspectives) on each
opinion will be displayed. Details of the setup of the platform can be found in Gao, Do,
and Fu, 2017.
In a user study reported in Gao and colleagues (2017), we asked two groups of par-
ticipants to explore a large set of user opinions of the two candidates of the 2016 US
presidential election (Hillary Clinton and Donald Trump) from Reddit.com (the study
was done before the election). This is a highly polarized topic, as supporters of either
candidate tend to have extreme opinions about the other candidate. We recruited partic-
ipants who were supporters of one or the other of the candidates and asked them to use
the interface to understand the social opinions expressed on the platform. One group of
the participants used the new interface, while the other group used the traditional Reddit
interface (which organize opinions based on topics, i.e., semantic contents, only). We found
that participants who used the new interface were more likely to read posts about the
candidates that they did not support, whereas people who used the Reddit interface showed
selective exposure, i.e., they tended to only check opinions about the candidates that they
supported.
This example demonstrated an important characteristic of SCI—the emotional signals
expressed through online social platforms are often magnified and indirectly influence
behaviour at the individual and collective levels. While better organization at the semantic
levels can facilitate cognitive processes that selectively attend to information that is seman-
tically relevant to the social goals of users, they do not sufficiently address the emotional
component of the social goals. In fact, in many controversial societal issues, the emotional
component plays an important, if not stronger, role in determining user behaviour (e.g.,
which posts they will select, read, and share). A more complete analysis of SCI needs to
consider other social factors that influence user behaviour, which may lead to development
of better online social platforms.
But how do social tags contribute to social indexing of online information, and to what
extent are they effective? One important question is whether the unconstrained, open-
vocabulary approach is effective for creating indices in the long run. For example, if people
use different words to index the same resources, over time these indices may not converge
and its informational value may decrease. To study this, we performed a user study on
how social tags grow when people can see other people’s tags (the social condition), and
compare them with a condition in which people cannot see other people’s tags (the nominal
condition). We hypothesized that, when a user can see existing tags created by others, these
tags will serve as cues to invoke semantic representations of the concepts that the tags refer
to, which in turn will influence the user to generate tags that are semantically relevant to the
concepts. If this is true, then social tags may converge and their information value will not
decrease over time. This is because, although people may use different words to represent
similar concepts, the underlying semantic representations should remain relatively stable
and thus, the information value should remain high (when measured at the semantic level,
not the word level). The situation is similar to the growth of natural language—even though
a child learns more and more words as they grow up, the informational value of each word
does not decrease as the semantic content of each word is enriched by their connections to
meanings in the real world through repeated interactions.
To answer the research question, Fu et al. (2010) represented a social tagging system as a
network, and applied various computational methods from SNA to analyse the social tags.
First, a social tagging system is represented as a tripartite network, in which there are three
main components: a set of users (U), a set of tags created by the users (T), and resources (R)
(URL, books, pictures, movies etc., see Figure 15.3). Resources can be different depending
on the specific purpose of the social tagging system. By looking at any two of the sets, one
can convert the bipartite set of connections into a graphical representation, in which each
node is connected to the other if they shared the same connections (see Figure 15.4). The
graphical representation allows the social network analysis that discussed earlier.
After representing the social tagging system as a network, SNA computational methods
can be used to compare how tags grow as a network in a social group and in a nominal
group. Figure 15.5 shows the node degree distribution of the two groups. Both groups show
a scale-free distribution—i.e., there are a small number of nodes with high degrees (hub
nodes connected to many other nodes), and a long tail of nodes that have few connections
U1 T1 R1
U2 T2 R2
Ui Tj Rk
Users or
A B C D E F G H I J K
Tags
H
A B F
E
K
D G
J
C
Figure 15.4 The top part shows a bi-partite set of connections. The bottom half shows the
corresponding network obtained by linking nodes (tags or users) that are connected to the same
resources.
2.00
Nominal Group 2.00 Social Group
1.50 1.50
log(freq)
log(freq)
1.00 1.00
0.50 0.50
0.00 0.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 0.00 0.50 1.00 1.50 2.00 2.50
log(degree) log(degree)
Figure 15.5 The node degree distributions in the nominal and social groups in the study by Fu
et al. (2010). In the nominal group, participants could not see tags created by others. In the social
groups, participants could see tags created by previous participants.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
career CI self-help
LSA career
exercise diet
self-help
nutrition
advice health
diet
exercise
nutrition
advice health stress business
livelihood stress
women business livelihood
women
Figure 15.6 Graphical representation of a set of co-occurring tags for “health” in the social
condition. LSA (Latent semantic analysis) scores measures the semantic similarities of two tags.
CI (Co-occurrence index) score measures the distance between two tags based on co-occurrence.
(but still connected to the rest of the network through the hub nodes). More importantly,
we found that average node degree of the social group (14.1) was significantly smaller than
the nominal group (18.9), and that the social group had a higher clustering coefficient
than the nominal group (0.86 vs. 0.76). This indicates that when people can see others’ tags
in the social group, the tags were more clustered than the nominal group. This finding was
generally consistent with the hypothesis that social tags do converge as people can see tags
by others.
We used two measures: Intertag Co-occurrence Index (CI) and Latent Semantic Analysis
(LSA) scores to measure the usage similarity and semantic similarity respectively. The CI
captures the likelihood of two tags being used together by the same user, whereas LSA
is a statistical estimate of how semantically similar two words are (for more details, see
Fu et al., 2010). Figure 15.6 shows an example of the CI and LSA score for the tag
‘health’. Tags that are closer to ‘health’ have higher CI and LSA scores, and the sizes of the
circles reflect the frequencies of uses. In general, the structures match well. However, there
are notable exceptions. While ‘health’ and ‘advice’ had a higher likelihood of being used
together, they had low semantic similarities. It is possible that ‘advice’ was frequently used
together with ‘health’ because online information tends to have high frequencies of ‘health
advice’, and thus the co-occurrence could have emerged as a common term.
15.5 Conclusion
We defined socio-computer interaction (SCI) as an emerging area of research that studies
the ecology between social behaviour and online artifacts. We outlined distinctive charac-
teristics of SCI research that demanded extension of analysis methods that are commonly
used in traditional HCI research. We discussed a number of computational methods that can
be utilized to analyse SCI behaviour, and demonstrated their uses in two research studies.
These examples were chosen to demonstrate some of the uniqueness of the data collected
from SCI research. In particular, we demonstrated how a combination of computational
methods are often necessary to triangulate the important factors that influence behaviour
at both the individual and social levels, and to what extent the analyses of these factors have
implications to design of future online social platforms.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
....................................................................................................
references
Abbasi, A., Chen, H., and Salem, A., 2008. Sentiment Analysis in Multiple Languages: Feature
Selection for Opinion Classification in Web forums. Transactions on Information Systems, 26(3),
pp. 1–34.
Aue, A., and Gamon, M., 1984. Customizing Sentiment Classifiers to New Domains: A Case Study.
Educational Measurement: Issues and Practice, 3(3), pp. 16–18.
Banea, C., Mihalcea, R., Wiebe, J., and Hassan, S., 2008. Multilingual Subjectivity Analysis Using
Machine Translation. In: Proceedings of the 2008 Conference on Empirical Methods in Natural
Language Processing. Red Hook, NY: Curran Associates.
Berger, J., and Milkman, K. L., 2012. What makes online content viral. Journal of Marketing Research,
49(2), pp. 192–205.
Bessette, J., 1994. The Mild Voice of Reason: Deliberative Democracy & American National Government.
Chicago: University of Chicago Press.
Blei, D. M., Ng, A. Y., and Jordan, M. I., 2003. Latent dirichlet allocation. Journal of Machine Learning
Research, 3( Jan), pp. 993–1022.
Brin, S., and Page, L., 2012. Reprint of: The anatomy of a large-scale hypertextual web search engine.
Computer Networks, 56(18), pp. 3825–33.
Carenini, G., Ng, R. T., and Zhou, X., 2007. Summarizing email conversations with clue words. In:
Proceedings of the 16th International Conference on World Wide Web. New York, NY: ACM,
pp. 91–100.
Carroll, J. M., Kellogg, W. A., and Rosson, M. B., 1991. The Task-Artifact Cycle. In: J. M. Carroll,
eds. Designing Interaction: Psychology at the Human-Computer Interface. Cambridge: Cambridge
University Press.
Chen, Y., Brook, S., and Skiena, S., 2014. Building Sentiment Lexicons for All Major Languages. In:
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Red Hook,
NY: Curran Associates.
Choi, F. Y. Y., Wiemer-Hastings, P., and Moore, J., 2001. Latent Semantic Analysis for Text Segmen-
tation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. New
York, NY: ACM, pp. 109–17.
Davidov, D., Tsur, O., and Rappoport, A., 2010. Semi-Supervised Recognition of Sarcastic Sentences
in Twitter and Amazon. In: CoNLL ‘10: Proceedings of the 14th Conference on Computational Natural
Language Learning. New York, NY: ACM, pp. 107–16.
De Choudhury, M., Monroy-Hernández, A., and Mark, G., 2014. ‘Narco’ Emotions: Affect and
Desensitization in Social Media during the Mexican Drug War. In: Proceedings of the 32nd Annual
ACM Conference on Human Factors in Computing Systems. New York, NY: ACM.
Ding, X., Liu, B., and Zhang, L., 2009. Entity discovery and assignment for opinion mining applica-
tions. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining. New York, NY: ACM Press.
Do, H. J., and Choi, H.-J., 2015a. Korean Twitter Emotion Classification Using Automatically
Built Emotion Lexicons and Fine-grained Features. In: Proceedings of the 29th Pacific Asia
Conference on Language, Information and Computation. Red Hook, NY: Curran Associates,
pp. 142–50.
Do, H. J., and Choi, H.-J., 2015b. Sentiment Analysis of Real-life Situations Using Location, People
and Time as Contextual Features. In: Proceedings of the 3rd International Conference on Big Data
and Smart Computing. New York, NY: IEEE.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
Do, H. J., Lim, C. G., Kim, Y. J., and Choi, H.-J., 2016. Analyzing Emotions in Twitter During a Crisis:
A Case Study of the 2015 Middle East Respiratory Syndrome Outbreak in Korea. In: Proceedings
of the 3rd International Conference on Big Data and Smart Computing. New York, NY: IEEE.
Du, W., Tan, S., Cheng, X., and Yun, X., 2010. Adapting Information Bottleneck Method for Automatic
Construction of Domain-oriented Sentiment Lexicon. In: Proceedings of the 3rd International
Conference on Web Search and Data Mining. New York, NY: ACM Press.
Eisenstein, J., and Barzilay, R., 2008. Bayesian unsupervised topic segmentation. In: Proceedings of the
Conference on Empirical Methods in Natural Language Processing. Red Hook, NY: Curran Associates,
pp. 334–43.
Freeman, L. C., 2004. The development of social network analysis: A study in the sociology of science.
Vancouver: Empirical Press.
Fu, W.-T., Kannampallil, T. G., Kang, R., and He, J., 2010. Semantic imitation in social tagging. ACM
Transactions on Computer-Human Interaction, 17(3), pp. 1–37.
Galley, M., McKeown, K., Fosler-Lussier, E., and Jing, H., 2003. Discourse segmentation of multi-
party conversation. In: Proceedings of the 41st Annual Meeting on Association for Computational
Linguistics.Volume 1. Red Hook, NY: Curran Associates, pp. 562–9.
Gao, M., Do, H. J., and Fu, W.-T., 2017. An intelligent interface for organizing online opinions on
controversial topics. In: IUI ’17: Proceedings of the ACM conference on Intelligent User Interfaces.
Limassol, Cyprus, 13–16 March 2017.
Giachanou, A., and Crestani, F., 2016. Like It or Not: A Survey of Twitter Sentiment Analysis Methods.
ACM Computing Surveys, 49(2), pp. 1–41.
Glorot, X., Bordes, A., and Bengio, Y., 2011. Domain Adaptation for Large-Scale Sentiment Classifi-
cation: A Deep Learning Approach. In: Proceedings of the 28th International Conference on Machine
Learning. New York, NY: ACM Press.
Go, A., Bhayani, R., and Huang, L., 2009. Twitter Sentiment Classification using Distant Supervi-
sion. Technical Report, Stanford. Available at: http://cs.stanford.edu/people/alecmgo/papers/
TwitterDistantSupervision09.pdf.
Gottfried, J., and Shearer, E., 2016. News use across social media. Washington, DC: Pew
Research Center. Available at: http://www.journalism.org/2016/05/26/news-use-across-social-
media-platforms-2016/.
Hai, Z., Chang, K., and Kim, J. J., 2011. Implicit Feature Identification via Co-occurrence Association
Rule Mining. In: Proceedings of the 12th International Conference on Intelligent Text Processing and
Computational Linguistics. Tokyo, Japan, 20–26 February 2011. New York, NY: Springer.
Hatzivassiloglou, V., and McKeown, K. R., 2009. Predicting the Semantic Orientation of Adjectives.
ACM Transactions on Information Systems, 21(4), pp. 315–46.
Hearst, M. A., 1997. Texttiling: segmenting text into multi-paragraph subtopic passages. Computa-
tional Linguistics, 23(1), pp. 33–64.
Hoque, E., and Carenini, G., 2016. Multiconvis: A visual text analytics system for exploring a collection
of online conversations. In: Proceedings of the 21st International Conference on Intelligent User
Interfaces. New York, NY: ACM, pp. 96–107.
Hu, M., and Liu, B., 2004. Mining and summarizing customer reviews. In: KDD ‘04: Proceedings
of the 2004 SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle,
Washington, 22–25 August 2004. New York, NY: ACM, pp. 168–77.
Hutto, C. J., and Gilbert, E., 2014. VADER: A Parsimonious Rule-based Model for Sentiment Analysis
of Social Media Text. In: Proceedings of the 8th International AAAI Conference on Weblogs and Social
Media. Ann Arbor, Michigan, 1–4 June 2014. Palo Alto, CA: AAAI Press.
Jo, Y., and Oh, A. H., 2011. Aspect and Sentiment Unification Model for Online Review Analysis. In:
Proceedings of the 4th ACM International Conference on Web Search and Data Mining. New York,
NY: ACM.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
Joty, S., Carenini, G., and Ng, R. T., 2013. Topic segmentation and labeling in asynchronous conver-
sations. Journal of Artificial Intelligence Research, 47, pp. 521–73.
Landauer, T. K., and Dumais, S. T., 1997. A solution to Plato’s problem: The latent semantic analysis
theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2),
pp. 211–40.
Liao, Q. V., and Fu, W.-T., 2014a. Can you hear me now? Mitigating the echo chamber effect by source
position indicators. In: Proceedings of the 17th ACM conference on Computer Supported Cooperative
Work and Social Computing. New York, NY: ACM, pp. 184–96.
Liao, Q. V., and Fu, W.-T., 2014b. Expert voices in echo chambers: effects of source expertise indicators
on exposure to diverse opinions. In: Proceedings of the 32nd annual ACM Conference on Human
Factors in Computing Systems. New York, NY: ACM, pp. 2745–54.
Lin, C., and He, Y., 2009. Joint Sentiment/Topic Model for Sentiment Analysis. In: Proceedings of
the 18th ACM Conference on Information and Knowledge Management. Hong Kong, China, 2–6
November 2009. New York, NY: ACM, pp. 375–84.
Liu, B., 2015. Sentiment Analysis. Cambridge: Cambridge University Press.
Liu, B., and Zhang, L., 2012. A Survey of Opinion Mining and Sentiment Analysis. In: Charu C.
Aggarwal and C. Zhai, eds. Mining Text Data. New York, NY: Springer, pp. 415–63.
Malioutov, I., and Barzilay, R., 2006. Minimum cut model for spoken lecture segmentation. In:
Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual
meeting of the Association for Computational Linguistics. Red Hook, NY: Curran Associates,
pp. 25–32.
Maynard, D., and Greenwood, M. A., 2014. Who Cares About Sarcastic Tweets? Investigating the
Impact of Sarcasm on Sentiment Analysis. In: Proceedings of the 9th International Conference on
Language Resources and Evaluation. Reykjavik, Iceland, 26–31 May 2014. Red Hook, NY: Curran
Associates.
Mei, Q., Ling, X., Wondra, M., Su, H., and Zhai, C., 2007. Topic Sentiment Mixture: Modeling Facets
and Opinions in Weblogs. In: Proceedings of the 16th International Conference on World Wide Web.
New York, NY: ACM Press.
Mihalcea, R., and Tarau, P., 2004. TextRank: Bringing order into texts. Stroudsburg, PA: Association for
Computational Linguistics.
Miller, G. A., 1995. WordNet: A Lexical Database for English. Communications of the ACM, 38(11),
pp. 39–41.
Narayanan, R., Liu, B., and Choudhary, A., 2009. Sentiment Analysis of Conditional Sentences. In:
Proceedings of the 2009 Conference on Empirical Methods in Natural Language. New York, NY:
ACM Press.
Noelle-Neumann, E., 1974. The Spiral of Silence: A Theory of Public Opinion. Journal of Communi-
cation, 24(2), pp. 43–51.
Pan, S. J., Ni, X., Sun, J., Yang, Q., and Chen, Z., 2010. Cross-Domain Sentiment Classification via
Spectral Feature Alignment. In: Proceedings of the 19th International Conference on World Wide Web.
New York, NY: ACM Press.
Purver, M., Griffiths, T. L., Körding, K. P., and Tenenbaum, J. B., 2006. Unsupervised topic modelling
for multi-party spoken discourse. In: Proceedings of the 21st International Conference on Computa-
tional Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Red
Hook, NY: Curran Associates, pp. 17–24.
Stone, P. J., Dunphy, D. C., and Smith, M. S., 1966. The General Inquirer: A Computer Approach to
Content Analysis. Cambridge, MA: MIT Press.
Sunstein, C. R., 2009. Republic.com 2.0. Princeton, NJ: Princeton University Press.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., and Qin, B., 2014. Learning Sentiment-Specific
Word Embedding. In: Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics. Red Hook, NY: Curran Associates, pp. 1555–65.
Thet, T. T., Na, J.-C., and Khoo, C. S. G. K., 2010. Aspect-based Sentiment Analysis of Movie Reviews
on Discussion Boards. Journal of Information Science, 36(6), pp. 823–48.
Valentino, N. A., Banks, A. J., Hutchings, V. L., and Davis, A. K., 2009. Selective exposure in the
internet age: The interaction between anxiety and information utility. Political Psychology, 30(4),
pp. 591–613.
Wei, F., Liu, S., Song, Y., Pan, S., Zhou, M. X., Qian, W., Shi, L., Tan, L., and Zhang, Q., 2010. Tiara:
a Visual Exploratory Text Analytic System. In: Proceedings of the 16th ACM SIGKDD international
conference on Knowledge discovery and data mining. New York, NY: ACM, pp. 153–62.
Wilson, T., Hoffmann, P., Somasundaran, S., Kessler, J., Wiebe, J., Choi, Y., Cardie, C., Riloff, E,
and Patwardhan, S., 2005. OpinionFinder: A System for Subjectivity Analysis. In Proceedings of
HLT/EMNLP on Interactive Demonstrations. Red Hook, NY: Curran Associates.
Xianghua, F., Guo, L., Yanyan, G., and Zhiqiang, W., 2013. Multi-aspect Sentiment Analysis for
Chinese Online Social Reviews Based on Topic Modeling and HowNet lexicon. Knowledge-Based
Systems, 37, pp. 186–95.
Yang, H., Si, L., and Callan, J., 2006. Knowledge Transfer and Opinion Detection in the TREC2006
Blog Track. In: Proceedings of the 15th Text Retrieval Conference. Gaithersburg, Maryland, 14–17
November 2006. Gaithersburg, MD: NST.
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
INDEX
ACT-R 116, 259, 343, 344, 349, clarity, gestural 136–44, 150 crowdsourcing 157–71, 174–80,
353–4, 372 classification 28, 37, 69, 382–3, 407
adaptation 33, 53, 68, 91, 117, 389, 404–5; see also curse of knowledge, the 245
188, 287–92, 305, 406; see recognition curves, lines, corners (CLC)
also personalization cockpits 243, 252–3, 265–6, 137–8; see also Fitts’ law
aesthetics 10–11, 102, 110–12, 347 cybernetics 9, 20, 199
154–7, 159–63, 176–79 cognitive architectures 305,
agents 9, 26, 33–37, 158, 343–4, 348–9; see also Dasher 46, 55–6
289–302, 306, 311–15, 345, models data analysis 381; see also data
369, 388, 404 cognitive constraint science
aiming, models of 23–7; see also modelling 344–5, 350, datasets 4–5, 70–6, 86, 91, 147–9,
Fitts’ law 353–7 390–1, 406
appearange design 168, 174 cognitive computation 363–5; see data science 4, 178; see also
artificial intelligence 1, 4, 9, 37, also cognitive architectures machine learning
289, 365, 369 colour enhancement 155, 157, deep learning 76, 85–92, 406–7
assignment problems 101, 168, 173–4 CNNs 88
106–14; see also communication theory, RNNs 88–90
optimization see information theory see also neural networks;
augmentative and alternative completion metric 146; see also machine learning
communication text entry decision boundary 81
(AAC) 54–6 computational interaction 1–11, decision making 1, 6, 9, 69, 112,
automation 3, 4, 5, 6, 11, 91, 156, 17–18, 36, 43, 377–8, 288–9, 299, 303, 369; see
198, 207–8, 274, 356, 378, 393 also Markov decision process
382, 384, 390–1, 393 definition 4 decision space 11
vision 5 decision structure 100
back-off 48 computational thinking 4, 6 decision surface 78–9
Bayesian computational rationality decision theory 7, 388
beliefs 290, 302, 372 111, 292 decision trees 80
filters 8 control decision variable 98, 104–5
inference 6, 378 loops 17–8, 20–2, 36–7 decoding 49, 52–3, 58–60,
hypothesis testing 393 optimal 20, 292, 303 144–6, 150
optimization 171, 173, 176 order 24, 35 design 1–11, 17, 22–9, 33,
updating 372 shared 35 35–7, 43–4, 46, 53–61,
see also decoding; probability theory 1, 8–9, 17–37, 137 65–8, 80, 97–117, 121–27,
Burkard, Rainer 101 cost function 27, 33, 36–7, 72, 75, 139–41, 146–7, 187–93,
85–6, 112, 167, 316–17, 197–8, 204–7, 213–20,
Card, Stuart 100–1, 117, 191, 223–31, 240–6, 249–53,
322–30, 334–5; see also
225, 258, 259–60, 292, 312, 311–14, 341, 349,
objective function
321, 335, 345, 371, 353–4, 363–71,
creative design 180
378–81, 393 399–400
creativity 3, 6, 11, 117, 180;
camera control 82, 168–9 computational 153–80
see also creative design
cardinality 107 objective 5, 73, 104, 180
crowd computing 6, 54, 91, 116,
Carroll, John 2, 4, 99, 117, 400 parametric 154–61,
154; see also human
channel capacity 46–7 179–80
computation
OUP CORRECTED PROOF – FINAL, 21/12/2017, SPi
422 | index
index | 423
424 | index
standardization 250, 294 tokens 44, 49–51, 53, 56–8, 60, graphical 9, 23, 46, 66, 69,
state transition 188, 190, 192, 254–5, 263–5, 277, 402 106–7, 111–12, 116, 121,
207, 232, 299, 301 touch 156, 199
strategy space 344, 352 Midas problem 55 management systems 191
strong typing 236 modelling 49–53, 145, 149 user models 250, 258–60,
support vector machines 8, 77–8, touchpad 28, 33, 205 268
83, 407 touchscreen 23, 24, 52–5, 58–60,
121, 123–7, 130, 132, 135, variability 10, 25, 28, 32, 35, 154,
text entry 8, 43–60, 83, 121–4, 139, 141, 150, 349 260, 344–8, 382, 386
144, 232–3, 243, 382 visualizations 2, 9, 21, 100, 102,
automatic completion 53, 122, uncertainty 7, 20, 26, 32, 36, 45, 166, 168–9, 172, 223, 288,
146 53, 60, 78–81, 290, 343, 299–300, 306, 378, 381–3,
automatic correction 52 346–8, 353–7 408–9
error correction 44, 52, 56–7, unified modeling language
61, 126, 146 (UML) 189, 192–5, wicked challenges 3
gestural 25, 135 199, 231 word confusion network 51,
intelligent 43 user-centered design 6, 99, 56–9
optimization 121–42 213–14, 249, 367 words per minute (WPM) 43,
see also number entry; user experience 65, 100, 112, 116, 53–7, 125, 138, 140,
optimization; recognition; 121–2, 150, 176, 213–14, 143–4
words per minute 246, 250, 369
tracking user interface York approach, the 191
object 20, 25–6, 30, 50, description languages 101
84 design 11, 97, 99–110, 191, Z 188, 192
see also recognition 213–14, 217, 227–8, 231, Zipf ’s law 47
transfer function 20, 24 244, 246 Zwicky, Fritz 100