Valid Useful UX Measurement

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/259823102
Valid Useful User Experience Measurement
Book · January 2008
CITATIONS READS
5 4,842
5 authors, including:
Georgios Christou Mark Springett

European University Cyprus Middlesex University, UK
37 PUBLICATIONS 357 CITATIONS 45 PUBLICATIONS 436 CITATIONS
SEE PROFILE SEE PROFILE
Marta Larusdottir
Reykjavik University
76 PUBLICATIONS 826 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
A Multi-Dimensional Data Model for Personal Photo Browsing View project
SPARKS - Rethinking Innovation. Together. View project
All content following this page was uploaded by Georgios Christou on 29 January 2015.
The user has requested enhancement of the downloaded file.

Proceedings of the
International Workshop on
eaningful
easures :
Valid Useful User
Experience
Measurement
(VUUM)
Reykjavik, Iceland y June 18th 2008
Effie L-C. Law

Nigel Bevan
Georgios Christou
Mark Springett
& Marta Lárusdóttir
(Editors)
http://cost294.org/
Proceedings of the International Workshop on Meaningful Measures: Valid Useful User Experience
Measurement (VUUM)
Editors: Effie L-C. Law, Nigel Bevan, Georgios Christou, Mark Springett, and Marta Lárusdóttir
Publisher: Institute of Research in Informatics of Toulouse (IRIT) - Toulouse, France
ISBN: 978-2-917490-02-0
© 2008 Copyright for the individual papers by the papers' authors. Copying permitted for private and
academic purposes. Re-publication of material on this page requires permission by the copyright
owners.
VUUM 2008
Acknowledgements
First of all, we are very grateful to the local organizers – University of Iceland and Reykjavik University,
especially Marta Lárusdóttir and Ebba Hvannberg, who have strongly supported us to hold our 5th COST294-
MAUSE Open Workshop “Meaningful Measures: Valid Useful User Experience Measurement (VUUM)”
(http://www.cost294.org/vuum). Thanks must also go to the authors of the workshop’s papers, whose
contributions serve as rich sources of stimulation and inspiration to explore the issues of interest from multiple
perspectives. The quality of the contributions could further be ensured and improved with the generous help of the
program committee members (Table 1). Their effective and efficient review works are highly appreciated.
Besides, we are grateful to our Dissemination Team, Marco Winckler and Philippe Palanque, for designing,
printing and transporting the printed proceedings to the workshop venue.
Table 1: List of the reviewers of the 5th COST294-MAUSE workshop VUUM, 18th June 2008
Name Affiliation Country

Bevan, Nigel Professional Usability Service UK
Christou, Georgios European University Cyprus Cyprus
Cockton, Gilbert University of Sunderland UK
Gulliksen, Jan Uppsala University Sweden
Hassenzahl, Marc University of Koblenz-Landau Germany
Hornbaek, Kasper University of Copenhagen Denmark
Hvannberg, Ebba University of Iceland Iceland
Jokela, Timo Joticon Finland
Larusdottir, Marta Reykjavik University Iceland
Law, Effie ETH Zurich/University of Leicester Switzerland/UK
Springett, Mark Middlesex University UK
Last but not least, we express gratitude to our sponsor – COST (European Cooperation in the field of Scientific
and Technical Research; http://cost.cordis.lu/src/home.cfm). The COST Office operated by the European
Science Foundation (ESF) provides scientific, financial and administrative support to COST Actions. Specifically,
the COST Action 294 (http://www.cost294.org), which is also known as MAUSE, was officially launched in
January 2005. The ultimate goal of COST294-MAUSE is to bring more science to bear on Usability Evaluation
Methods (UEM) development, evaluation, and comparison, aiming for results that can be transferred to industry
and educators, thus leading to increased competitiveness of European industry and benefit to the public. The
current Workshop is the second open workshop implemented under the auspices of COST294-MAUSE. As with
other past and forthcoming events of COST294-MAUSE, we aim to provide the participants with enlightening
environments to further deepen and broaden their expertise and experiences in the area of usability.
1
VUUM 2008
Table of Contents
Meaningful Measures: Valid Useful User Experience Measurement (VUUM): Preface pp. 3-7
Effie L-C. Law & the VUUM Program Committee
Developing Usability Methods for Multimodal Systems: The Use of Subjective and Objective
Measures pp. 8 -12
Anja B. Naumann & Ina Wechsung
Classifying and Selecting UX and Usability Measures pp. 13-18

Nigel Bevan
Towards Practical User Experience Evaluation Methods pp. 19-22

Kaisa Väänänen-Vainio-Mattila, Virpi Roto & Marc Hassenzahl
Exploring User Experience Measurement Needs pp. 23-26

Pekka Ketola & Virpi Roto
Combining Quantitative and Qualitative Data for Measuring User Experience of an Educational
Game pp. 27-31
Carmelo Ardito, Paolo Buono, Maria F. Costabile, Antonella De Angeli, Rosa Lanzilotti
Is User Experience Supported Effectively in Existing Software Development Processes? pp. 32-37
Mats Hellman & Kari Rönkkö
On Measuring Usability of Mobile Applications pp. 38-44

Nikolaos Avouris, Georgios Fiotakis & Dimitrios Raptis
Use Experience in Systems Usability Approach pp. 45-48

Leena Norros & Paula Savioja
Developing the Scale Adoption Framework for Evaluation (SAFE) pp. 49-55
William Green, Greg Dunn & Jettie Hoonhout
A Two-Level Approach for Determining Measurable Usability Targets pp. 56-59

Timo Jokela
What Worth Measuring Is pp. 60-66

Gilbert Cockton
Is what you see what you get? Children, Technology and the Fun Toolkit pp. 67-71
Janet Read
Comparing UX Measurements, a case study pp. 72-78

Arnold P.O.S. Vermeeren, Joke Kort, Anita H.M. Cremers & Jenneke Fokker
Evaluating Migratory User Interfaces pp. 79-85

Fabio Paterno, Carmen Santoro & Antonio Scorcia
Assessing User Experiences within Interaction: Experience as a Qualitative State and Experience
as a Causal Event pp. 86-90
Mark Springett
Only Figures Matter?– If Measuring Usability and User Experience in Practice is Insanity or a
Necessity pp. 91-96
Jan Gulliksen, Åsa Cajander, Elina Eriksson
Measuring the User Experience of a Task Oriented Software pp. 97-102

Jonheidur Isleifsdottir & Marta Larusdottir
2
VUUM 2008
Meaningful Measures: Valid Useful User Experience

Measurement (VUUM)
Preface
Effie L-C. Law & the VUUM Program Committee1
COST294-MAUSE (http://www.cost294.org)
ABSTRACT Lord Kelvin’s dictum on measurement is frequently quoted

In this Preface we first describe the motives underlying the to justify the quantification of theoretical concepts in
workshop VUUM. Next, we provide a short summary of physical sciences, computer science, engineering and social
each of the sixteen accepted submissions, which are sciences [1]. Much HCI evaluation aspires to its scientific
grouped into five categories, namely: Overviews, Validity, philosophy. However, while some HCI researchers and
Comparisons, Commercial relevance, and UX in Particular practitioners are strongly convinced about the need for
Application Contexts. Correspondingly, some questions for measurement, others are ambivalent about the role of
further discussion are raised. numerical values in providing useful, valid and meaningful
assessments and understanding of complex interactions
Author Keywords between humans and machines. Some go further and deny
VUUM, Usability measurements, User experience, the measurability of affective states such as love, beauty,
Meaningfulness, Validity, Usefulness happiness, and frustration. Strictly, one can measure
(almost) anything in some arbitrary way. The compelling
ACM Classification Keywords concern, however, is whether the measure is meaningful,
H5.m. Information interfaces and presentation (e.g., HCI): useful and valid to reflect the state or nature of the object or
Miscellaneous. event in question.
What is measurement actually? One key definition is “the
assignment of numbers to objects or events in accordance
BACKGROUND with a rule of some sort” ([9], p.384); a process that is seen
“To measure is to know” as essential to the empirical determination of functional
“If you cannot measure it, you cannot improve it” relations. Alternatively, measurement can be defined as “an
observation that reduces an uncertainty expressed as a
(Lord Kelvin, a.k.a. Sir William Thomson, n.d.) quantity” (our italics, [5]). In usability evaluation, the
uncertainty is the quality-in-use; the observation can
objectively be taken by usability professionals (e.g. task
completion time) or subjectively by users (e.g. self-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
perceived duration) [2]. The debate on objective vs.
not made or distributed for profit or commercial advantage and that copies subjective measurements animates many HCI discussions.
bear this notice and the full citation on the first page. To copy otherwise, or In dispute are not only which type of measure is more
republish, to post on servers or to redistribute to lists, requires prior specific appropriate, but also whether and how they are related and
permission and/or a fee.
under which conditions [4]. More important perhaps
VUUM2008, June 18, 2008, Reykjavik, Iceland. however, is the question of how to interpret measurements
1
The list is too long to be presented on the title page: Special contributions from Nigel Bevan, Gilbert Cockton, and
Georgios Christou in preparing the call for papers and in reviewing the submissions together with other Program Committee
members: Mark Springett, Marta Lárusdóttir, Ebba Hvannberg, Kasper Hornbæk, Marc Hassenzahl, Jan Gulliksen,
and Timo Jokela.
3
VUUM 2008
taken and use them to support improvement of an • To identify practical strategies for selecting appropriate
interaction design. usability measures and instruments that meet contextual
requirements, including commercial contexts;
Usability manifests as quality in design, in interaction and
in value [8], with diverse measures from many methods and • To explore the notion of qualia from the philosophical
instruments [3]. The challenge is how to select appropriate perspective and its practical implications for usability
measures to address the particularities of an evaluation engineering;
context. The definitions above strongly support the • To identify whether non-measurable properties of
necessity and utility of usability measures, which should usability/UX exist, and propose alternative critical
provide data either for improving the system under scrutiny methods and techniques for their assessment;
(i.e. formative evaluation), and/or for comparing different • To extend the range of measures to currently tangible
versions of a system or assessing whether user requirements but unmeasured and under-used physical and other
have been achieved (i.e., summative evaluation). However, properties, such as affordances and constraints inherent
both the construct validity and predictive power of some in interfaces;
usability measures are of particular concern. For instance, a • To review and analyse the theoretical frameworks
count of user errors cannot accurately reflect the quality of underlying different usability measures;
interaction, nor can it well predict the actual adoption of a • To examine the applicability of existing usability
product/service, because there are inconsistent ways to measures to new interaction styles;
categorize and quantify errors [4].
Whereas some qualities of intangible interactions/products CATEORIZATION OF SUBMISSIONS
may be considered as non-measurable, there are Sixteen accepted submissions, which are authored by
un-measured but tangible qualities such as affordances and experienced usability and UX researchers and practitioners,
constraints of interfaces. Some researchers argue that they cover the aforementioned objectives to various extents.
are unmeasured not because they have nothing to do with Each of the submissions has been peer reviewed by at least
usability, but because no suitable measure that translates two members of the Program Committee. They are
them well into usability outcomes exists. Furthermore, categorized into five major categories: Overview, Validity,
there is a substantial philosophical literature on qualia [6, Comparisons, Commercial relevance, and UX in particular
10], which are not even directly detectable, never mind application contexts. Subsequently, each submission is
measurable (cf. physicists struggling with the problem of briefly described.
sub-particle physics). It is intriguing to consider indirect
measurement and inference of qualia. Besides, we should Category 1: Overviews
consider alternative approaches from the fine arts where This category covers ISO standards, organizational
there are not systematic measures, but critical assessments stakeholder enquiry approach, systems usability approach,
of artefacts. and worth-centred approach.
Most importantly, all sorts of measurements should be • Nigel Bevan: Classifying and Selecting Usability
rooted in sound theories, usability measures are no Measures
exception. Otherwise, they are just numbers, as remarked The paper refers to a set of old and new standards related
by Thomas S. Kuhn [7]: to usability, accessibility and user experience. The author
"The route from theory or law to measurement can attempts to integrate the interpretations of usability and
almost never be traveled backwards. Numbers UX extracted from several related standards. It is deemed
gathered without some knowledge of the regularity challenging to articulate the intricate inter-relationships
to be expected almost never speak for themselves. between the standards. Nonetheless, it is relevant to know
Almost certainly they remain just numbers." (p.44) how the ISO addresses the UX definition and measures.
GOAL & OBJECTIVES • Pekka Ketola & Virpi Roto: Towards Practical UX
Evaluation Methods
The overall goal of the workshop VUUM is to understand The paper presents the findings of a survey conducted
challenges relating to measures of usability and user with personnel in different units and levels of Nokia
experience (UX), and to identify effective practical about UX measurement needs. The authors characterize
responses to these challenges. The following objectives are UX in two interesting ways, as a longitudinal relationship
addressed: that continues post-use, and as a set of intangible benefits
• To gather evidence of the contextual bases for both for end user and for the organization. The paper also
meaningful and useful usability and UX measures; contributes to an undervalued body of knowledge about
• To identify validity and reliability concerns for specific the needs and perspectives of different roles within
usability measures; organizations.
4
VUUM 2008
• Leena Norros & Paula Savioja: User Experience in the • Jan Gulliksen, Asa Cajander & Elina Eriksson: Only
Systems Usability Approach Figures Matter? – If Measuring Usability and User
The paper discusses user experience in work-related Experience in Practice is Insanity or a Necessity
settings and in the operation of complex systems under The paper presents two case studies to illustrate how
the conceptual framework of systems usability and usability measures have been used in industry. The
activity theory. It highly advocates contextual analysis authors put forward some contentious arguments about
and more holistic perspectives. The authors present in this the necessity and utility of usability measures. The
short paper some strong arguments for observations and controversies lie in the particularities of organizational
interviews as the main data source (i.e. informal goals, which strongly influence whether and how
documentation methods). usability measures are used.
• Gilbert Cockton: What Worth Measuring is
• Jonheidur Isleifsdottir & Marta Lárusdóttir:
The paper presents a very useful and insightful overview
Measuring the User Experience of a Task Oriented
of the state-of-the-art of worth-centred design (WCD),
Software
and some contentious claims about when and why
The paper uses AttrakDiff2 to probe user experience with
emotion should be measured (i.e. “measuring emotions
a business tool. AttrakDiff2 is new, ambitious and
has to be related to their role”). The paper raises a
requires examination. The authors administer it before
number of fundamentally important issues and describes
and after think-aloud test. The timing of UX
a useful approach to understanding UX in context. It is
measurement is an issue to explore.
convincing argued that worth centredness places user-
experience design and measurement meaningfully in
Discussion: Shall we fix the constructs to be measured or
context.
fix the measuring tools/methods to render usability and UX
Discussion: measures more meaningful? How is the meaningfulness of a
What is actually new in the alternative approaches to usability measure determined by contextual factors?
usability and UX measures?
Category 3: Comparisons
Category 2: Validity This category addresses the issue of subjective vs. objective
measures, field vs. lab-based settings, and triangulation of
This category covers the issues on psychometric properties data from multiple resources.
of scales, the notion of qualia, the utility of usability
measures in industry, and the practical use of a newly • Anja B. Naumann & Ina Wechsung: Developing
developed scale. Usability Methods for Multimodal Systems: The Use of
Subjective and Objective Measures
• William Green, Greg Dunn & Jettie Hoonhout: The paper addresses the controversial and yet unsettled
Developing the Scale Adoption Framework for issue about the relationship between the subjective and
Evaluation (SAFE)
objective usability/UX measures. In addition, it looks into
In this paper the authors argue for the use of established the variable interaction modality (mono vs. multi) when
scales for subjective measurements. This paper presents a examining the extensibility of the existing usability
good account of psychometric issues for scale
evaluation methods. Another interesting point is to
development and choice. It summarizes the approach contrast within-data-types correlations with between-
taken in psychology and provides a framework that is data-types correlations.
relevant for HCI.
• Carmelo Ardito et al.: Combining Quantitative and
• Mark Springett: Assessing User Experiences Within Qualitative Data for Measuring User Experience of an
Interaction: Experience as a Qualitative State and Educational Game
Experience as a Causal Event This paper demonstrates the applications of multiple
The paper presents a nice discussion on the notion of measurement techniques and methods to evaluate the
qualia, which is addressed in the literature of UX, albeit complex user experiences engendered by playing the
to a limited extent. It also attempts to link the educational game called Explore! The work reported
persuasiveness of the evidence with first person felt states interestingly revealed aspects of authentic field evaluation
(i.e. qualia) based on some case studies. The author puts and its advantages over contrived evaluation settings.
forward an interesting argument that the ‘soft’ discipline
of user-experience evaluation is not significantly softer • Arnold Vermeeren et al.: Comparing UX
than more traditional measurement of HCI phenomena. Measurements: a Case Study
The paper reports different approaches to capturing data
on the usage of a P2P software application. The authors
5
VUUM 2008
present some interesting empirical studies – field tests, a comprehensive literature review as well as a nice
lab tests and expert reviews. The authors demonstrate summary of the current practice as reported in several
how to triangulate different types of data from related papers of CHI 2008. This material gives a good
longitudinal and field studies. ground for discussing what usability measurements for
mobile applications are selected from a range of choices.
Discussion: Under which conditions are certain types of
• Janet C. Read: Is what you see what you get? Children,
usability and UX measures correlated?
Technology and the Fun Toolkit
The paper addresses the validity and reliability of the UX
Category 4: Commercial Relevance measures with children. The paper describes some
interesting empirical studies on surveying children’s
This category addresses the practical applications of
prediction and experience in using different types of
emerging usability and UX evaluation methods in industry.
technologies.
• Mats Hellman & Kari Rönkkö: Is User Experience
supported effectively in existing software development • Fabio Paterno, Carmen Santoro & Antonio Scorica:
Evaluating Migratory User Interfaces
processes?
The paper describes an empirical evaluation study of
This paper addresses a very valid issue of monitoring UX
migratory user interfaces that allow users to change
into the software development process with a case study
device and continue the task from the point where they
of a mobile phone company. The authors illustrate how
left. A new evaluation paradigm is clearly described
the UX quality can be integrated into the traditional
along with key usability and UX issues specific to it. A
usability framework. They also point out the definitional
useful list of identified issues relevant to migratory
problem of UX and the tradeoff between different
interfaces is identified.
usability metrics such as efficiency and satisfaction.
Discussion: What can we learn from the process of
• Timo Jokela: A Two-Level Approach for Determining adapting the existing usability and UX evaluation
Measurable Usability Targets methods or creating new ones to address context-specific
The paper makes a succinct and important contribution by characteristics?
distinguishing between business-relevant strategic
usability goals and detailed operational usability goals.
Different usability measures are required for each of this CONCLUDING REMARKS
two-tier usability targets.
The workshop VUUM brings together a group of
• Kaisa Väänänen-Vainio-Mattila, Virpi Roto, & Marc experienced HCI researchers and practitioners to explore a
Hassenzahl et al.: Exploring UX Measurement Needs long-standing research problem – the meaningfulness of
The paper summarizes several proposed methods for UX measurable constructs and the measurability of non-
evaluation that have been addressed at a CHI’08 measurable ones. One may argue that basically everything
workshop UX Evaluation Methods in Product can be measured, but some things may be more
Development (UXEM). The useful and interesting “measurable” than the others; how to estimate the threshold
outcome of that workshop is a list of requirements for of measurability remains unclear. The sixteen interesting
practical UX evaluation methods. submissions in this volume touch upon the basic issue of
the formal-empirical dichotomy [9]. Many arguments can
Discussion: How can the industrial partners be convinced be boiled down to the fundamental problem that our
the usefulness and validity of alternative evaluation understanding of how people think and feel is still rather
methods for usability and UX? Do they tend to believe in limited, which is essentially inferred from people’s
adapted traditional methods or in entirely new methods that behaviours. Psycho-physiological and neuro-psychological
might bring in some magic? data seem promising, but the issue of calibration and
integration is a big hurdle to overcome. Nonetheless, we are
Category 5: UX in Particular Application Contexts convinced about the value, meaningfulness and usefulness
of this research endeavour.
This category addresses three specific contexts: mobile
applications, children’s technology uses, and emergent
migratory user interfaces. REFERENCES
• Nikolaos Avouris, Georgios Fiotakis & Dimitrios 1. Bulmer, M. (2001). Social measurement: What stands in
Raptis: On Measuring Usability of Mobile Applications its way? Social Research, 62(2).
This paper addresses some interesting views on
measuring the usability of mobile applications. It presents
6
VUUM 2008
2. Czerwinski, M., Horvitz, E., & Cutrell, E.(2001) UX Manifesto”, 3rd September 2007, Lancaster, UK.
Subjective Duration Assessment: An Implicit Probe for Online at: http://www.cost294.org
Software Usability? Proc. IHM-HCI 2001, 167-170
7. Kuhn,T.S. (1961). The function of measurement in
3. Hornbæk, K .(2006). Current Practice in Measuring modern physical science. In H. Woolf (Ed.),
Usability: Challenges to Usability Studies and Research, Quantification: A History of the Meaning of
International Journal of Human-Computer Studies, 64, Measurement in the Natural and Social Sciences (pp.31-
79-102. 63). Indianapolis: Bobbs-Merrill Co.
4. Hornbæk, K., & Law, E. L-C. (2007). Meta-analysis of 8. Law, E., Hvannberg, E., & Cockton, G. (2008) (Eds.).
correlations among usability measures. In Proc. CHI Maturing usability: Quality in software, interaction and
2007, San Jose, USA. software. Springer
5. Hubbard, D. (2007). How to measure anything: Finding 9. Stevens, S.S. (1958). Measurement and man. Science,
the value of intangibles in business. John Wiley & Sons. 127(3295), 383-389.
6. Kerkow, D. (2007). Don’t have to know what it is like 10. van Gulick, R. (2007). Functionalism and qualia. In M.
to be a bit to build a radar reflector – Functionalism in Velmans & S. Schneider (Eds.) The Blackwell
UX. In E. Law, A. Vermeeren, M. Hassenzahl, & M. Companion to Consciousness. Blackwell.
Blythe (Eds.), Proceedings of the Workshop “Towards a
7
VUUM 2008
Developing Usability Methods for Multimodal Systems:

The Use of Subjective and Objective Measures
Anja B. Naumann Ina Wechsung
Deutsche Telekom Laboratories, TU Berlin Deutsche Telekom Laboratories, TU Berlin
Ernst-Reuter-Platz 7, 10587 Berlin, Germany Ernst-Reuter-Platz 7, 10587 Berlin, Germany
[email protected] [email protected]
+4930-8353-58466 +4930-8353-58329
ABSTRACT Thus the questionnaires seem to measure different
In the present paper different types of data (log-data and constructs. Therefore further validation is needed. The
data from questionnaires) assessing usability parameters of purpose of this paper is first to analyze which questionnaire
two multimodal and one unimodal system were compared. relates most to objective data and second to investigate if
The participants (N=21) performed several tasks with each the results are consistent to our earlier findings.
system and were afterwards asked to rate the system by
filling out different questionnaires. Log-data was recorded RELATED WORK
during the whole test sessions. The results show that since The HCI literature provides a wide range of methods to
the questionnaire ratings differed considerably from each measure usability. Most of them were developed to evaluate
other, questionnaires designed for unimodal interfaces unimodal graphical user interfaces. Parameters used as
should not be used as the only data source when evaluating usability measures include subjective data, often assessed
multimodal systems. The correlations between task duration through questionnaires, and objective data like for example
and questionnaire ratings indicate that subjective measures log-files containing task duration or performance data.
of efficiency via questionnaires do not necessarily match Since all these parameters are measuring at least roughly
with the objective data. the same concept, namely usability, high correlations
among them should be observable.
Author Keywords
Evaluation methods, multimodal interfaces A meta-analysis conducted by Nielsen and Levy [10]
showed that performance and predicted preference are
ACM Classification Keywords indeed correlated. Similar results were reported by Sauro
H.5.2 [Information Interfaces and Presentation]: User and Kundlund [12]. They found negative correlations
Interfaces – theory and methods, interaction styles, haptic between satisfaction (subjective data) and time, errors
I/O, voice I/O, graphical user interfaces (GUI). (objective data) and a positive correlation between
satisfaction (subjective data) and completion (objective
INTRODUCTION data).
In the recent years, an emerging interest in multimodal However, several studies reported opposing findings:
interfaces has become noticeable. But up to now, there is no Krämer and Nitschke [8] showed that user ratings of
standardized method for usability evaluation of multimodal multimodal interfaces are not affected by increased
systems. In particular, it is not clear if established methods intuitivity and efficiency. Moeller [9] could not find a
covering unimodal systems provide valid and reliable correlation between task duration and user judgments when
results also for multimodal systems. evaluating speech dialogue systems. Also Frokjaer and
In an earlier paper we already showed that standardized colleagues [2] could not find correlations between user
questionnaires are not completely applicable for usability ratings and efficiency. Results from a meta-analysis by
evaluation of multimodal systems, since there was only Hornbaek and Lai Chong-Law [5] showed that the user’s
little agreement between the different questionnaires [13]. experience of the interaction (subjective data) and objective
data differ considerably from each other or show even
negative correlations.
personal or classroom use is granted without fee provided that copies are In view of the studies mentioned above, it seems necessary
not made or distributed for profit or commercial advantage and that copies to use both kinds of data in usability evaluation to obtain
bear this notice and the full citation on the first page. To copy otherwise, or reliable results. Thus developing methods for usability
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. evaluation of multimodal systems can only be done by
validating within these data types (e.g. validation across
VUUM2008, June 18, 2008, Reykjavik, Iceland. questionnaires) but also between these data types (e.g.
comparing objective and subjective results).
8
VUUM 2008
METHOD
Participants and Material
Twenty-one German-speaking individuals (11 male, 10
female) between the age of 19 and 69 (M = 31.24) took part
in the study. All users participated in return for a book
token. Due to technical problems log-data was missing from
three participants. Regarding the task duration three further
cases were identified as outliers and therefore excluded
from analyses including task duration.
The multimodal systems adopted for the test were a PDA
(Fujitsu-Siemens Pocket LOOX T830) and a tablet PC
(Samsung Q1-Pro 900 Casomii). Both systems could be
operated via voice control as well as via graphical user
interface with touch screen. Additionally, the PDA could be
operated via motion control. Furthermore, an unimodal
system (a conventional PC controllable with mouse and
keyboard) was used as control condition. The application
MediaScout, a media recommender system, was the same
for all systems.
Procedure
The users performed five different types of tasks: seven
navigation tasks, six tasks where checkboxes had to be
marked or unmarked, four tasks where an option from a Figure 1. Example of the procedure. Order of systems,
drop-down list had to be selected, three tasks where a modalities and questionnaires was randomized.
button had to be pressed, and one task where a phone
number had to be entered. The questionnaires used were the
During the whole experimental session log-data and
AttrakDiff questionnaire [3], the System Usability Scale
psycho-physiological data were recorded. Task duration as
(SUS) [1], the Software Usability Measurement Inventory
a measure of efficiency was assessed with the log-files and
(SUMI) [7], the SASSI questionnaire [4] and a self-
was, for each system, averaged over all tasks.
constructed questionnaire covering overall ratings and
preferences. SUMI, SASSI and AttrakDiff were used in For the current paper only data from the test block in which
their original form. The SUS was adapted for voice control the users could freely choose modalities was analyzed.
by replacing the word “system” by “voice control”. The Since the participants were already familiar with the
order of the questionnaires was randomized. With the help systems and all modalities, it can be assumed that they have
of the questionnaires designed or adapted to cover speech used the modality they preferred most.
based applications (SASSI and SUS) ratings were only
collected for the two multimodal systems (PDA and tablet The scales and subscales for each questionnaire were
PC). calculated according to the instructions in the specific
handbook [1,3,4,11]. All questionnaire items which were
Each test session took approximately three hours. The negatively poled were recoded so that higher values
procedure is shown in Figure 1. Each participant performed indicate better ratings.
the tasks with each system. Participants were verbally
instructed to perform the tasks with a given modality. This The level of significance was set at p<.05.
was repeated for every modality supported by that specific
RESULTS
system. After that, the tasks were presented again and the
participants could freely choose the interaction modality. Task Duration – Objective Measures
Finally, they were asked to fill out the questionnaires in In a first step we compared the systems regarding the task
order to to rate the previously tested system. This procedure duration. There was a significant difference between all
was repeated for each of the three systems. In order to three systems (F(2,28)=32.64, p=.000; part.eta²=.70).
balance fatigue and learning effects the sequence of the Participants solved the tasks fastest when using the
systems was randomized. After the third system, a final unimodal PC. The most time to solve the tasks was needed
questionnaire regarding the overall impressions and with the PDA. Figure 2 visualizes these results.
preferences had to be filled out by the participants.
9
was higher, which means better, for the tablet PC. The
detailed results are given in Table 1.
Ratings on Scales Measuring Global Usability

According to ISO 9241 [6] efficiency is one of the three
main factors determining usability. So we expected global
questionnaire scales to be affected by task duration, as it is
an efficiency measure.
On the SUMI global scale the PDA was rated best and the
unimodal PC worst (F(2,40) = 6.56, p = .003; partial eta² =
.247). The AttrakDiff scale attractiveness pointed to the
tablet PC as the most attractive system (F(2,38)=4.04,
p=.026, part.eta²=.175). Regarding SUS and SASSI ratings
were only assessed for the systems supporting voice
Figure 2. Task duration for all systems. control. No differences but a medium effect could be shown
for the SASSI (t(20)=1.95, p=.059, d=0.6). Like on the
AttrakDiff the tablet PC was rated better than the PDA. The
Questionnaire Ratings -Subjective Measures
SUS ratings for the PDA and the tablet PC did not differ
Ratings on Scales Measuring Efficiency (t(20)=1.23, p=.232, d=0.25) but again the absolute ratings
Task duration is used as an objective measure for efficiency were better for the tablet PC. Table 2 presents the detailed
and should therefore be related to questionnaire scales results.
assessing perceived efficiency. Thus the following section
presents the results of the subscales developed with the Scale System Mean SD
purpose to measure subjective judgments of efficiency.
Tablet PC 40.19 7.87
The results of task duration are contradictory to the SUMI SUMI Global
questionnaire but match the results from the AttrakDiff. PDA 45.29 10.14
(Min.=10/Max.=50)
According to the SUMI efficiency subscale the PDA is Unimodal 38.04 7.06
most efficient and the unimodal PC least efficient
AttrakDiff Tablet PC .99 1.04
(F(2,40)=6.19, p=.005, part.eta²=.236).
Attractiveness PDA .34 .88
Results on the AttrakDiff in contrast agree with the (Min.=-3/Max.=3)
objective task duration data: Best ratings on the AttrakDiff Unimodal .74 .66
pragmatic scale (the scale measuring efficiency according SASSI Global Tablet PC 2.25 .52
to the AttrakDiff authors) got the unimodal system, worst (Min.=0/Max.=4)
ratings were given for the PDA (F(2,38)=16.80, p=.000, PDA 1.96 .48
part.eta²=.469). SUS Global Tablet PC 53.93 16.60
(Min.=0/Max.=100) PDA 50.12 13.38
Scale System Mean SD
Table 2. Ratings on questionnaire scales measuring global
Tablet PC 19.00 3.39
usability.
SUMI Efficiency
PDA 19.90 3.48
(Min.=10/Max.=30)
Correlations between Subjective and Objective
Unimodal 16.67 3.15
Measures.
AttrakDiff Tablet PC .91 .84 In a further step the efficiency measures from the different
Pragmatic Qualities PDA .01 .88 questionnaires were correlated with task duration. All
(Min.=-3/Max.=3) questionnaire ratings and task durations were transformed
Unimodal 1.34 .61 into ranks for each participant. Concerning the
SASSI Speed Tablet PC 1.64 .50 questionnaire ratings, rank one was assigned to the system
(Min.=0/Max.=4) with the best rating. For task duration, the system with the
PDA 1.60 .46 shortest task duration got rank one. Thus positive
correlations show concordance between these objective
Table 1. Ratings on questionnaire subscales measuring
efficiency. (task duration) and subjective measures (questionnaire
ratings).
The SASSI speed scale revealed no significant difference
(t(20)=.40, p=.69, Cohen’s d= 0.09) but the absolute rating
10
VUUM 2008
Correlations between Scales Measuring Efficiency and Task scale measures the construct it was developed for. A similar
Duration conclusion can be drawn from the SASSI results: Ratings
The ranks based on SUMI efficiency correlated negatively on the speed scale matched the task duration data.
with task duration. A positive correlation could be observed Regarding the global scales, all questionnaires except the
between the ranks based on the AttrakDiff pragmatic scale SUMI showed no significant correlation with task duration.
and task duration. Also the rank transformed SASSI speed According to these results, the global usability is hardly
scale was positively correlated with task duration. Table 3 affected by the systems efficiency.
shows the detailed results.
The current results are in line with our previous findings
[13]. Again, the AttrakDiff and the SASSI showed the most
Ranks based on Task Duration concordance. A possible explanation could be that the kind
SUMI Efficiency of rating scale used in the AttrakDiff, the semantic
-.577** differential, is applicable to all systems. It uses no direct
(N=45)
questions but pairs of bipolar adjectives, which are not
AttrakDiff Pragmatic Qualities linked to special functions of a system. The SASSI uses
.529**
(N=45) direct questions but was specifically developed for the
evaluation of voice control systems and may therefore be
SASSI Speed
.324* more suitable for multimodal systems including voice
(N=30) control than questionnaires developed for GUI based
systems. Furthermore all SUMI questions were included
Table 3. Correlations (Kendall's tau-b) between task duration although some of them are only appropriate for the
and subscales measuring efficiency (** p<.01; *p<.05).
evaluation of market-ready interfaces. These inappropriate
questions may have affected the SUMI results.
Correlations between Scales Measuring Global Usability
and Task Duration In summary, some of the questionnaires showed high
Regarding the scales assessing global usability the SUMI correlations in the right direction with the objective data.
showed a negative correlation with task duration. All other Other questionnaires’ ratings stood in contrast or no relation
global scales were not significantly correlated with task with the objective data. Thus the contradictory findings
duration (see Table 4). [5,8,9,10,12] regarding the correlation between objective
and subjective data may partly be caused by questionnaires
Ranks based on Task Duration with a lack of construct validity. Therefore the method
assessing subjective data should be chosen carefully.
SUMI Global Furthermore a reliable, valid and more specific
-.418** questionnaire especially for multimodal interfaces is
(N=45)
desirable. In view of the results reported, the AttrakDiff
AttrakDiff Attractiveness provides a proper basis for this.
.019
(N=45)
REFERENCES
SASSI Global
.230 1. Brooke, J. SUS: A ‘quick and dirty’ usability scale. In
(N=30) Jordan P.W., Thomas B., Weerdmeester B.A.,
SUS Global McClelland I. L. (Eds.) Usability Evaluation in Industry,
.264 pp. 189-194. Taylor & Francis, London, 1996.
(N=30)
2. Frøkjær, E., Hertzum, M., and Hornbæk, K. Measuring
Table 4. Correlations (Kendall's tau-b) between task duration usability: are effectiveness, efficiency, and satisfaction
and scales measuring global usability (*p<.05). really correlated? In Proc. CHI 2000. ACM Press,
(2000), 345-352.
DISCUSSION
3. Hassenzahl, M., Burmester, M. and Koller, F.
The subjective data (questionnaire ratings) and the objective
AttrakDiff: Ein Fragebogen zur Messung
data (task duration) results showed concordances only to a
wahrgenommener hedonischer und pragmatischer
limited extend. The questionnaire ratings most inconsistent
Qualität [A questionnaire for measuring perceived
to the results of all other questionnaires as well as to the
hedonic and pragmatic quality]. In Ziegler J., Szwillus
task duration data were the ratings of the SUMI. The SUMI
G. (Eds.) Mensch & Computer 2003. Interaktion in
results showed correlations in the wrong direction for
Bewegung, B.G. Teubner, Stuttgart (2003), 187-196.
efficiency and for the global scale. That means that the
longer the task duration the better was the rating on the 4. Hone, K.S. and Graham, R. Towards a tool for the
SUMI. The AttrakDiff pragmatic scale in contrast showed subjective assessment of speech system interfaces
the highest agreement with the task duration data. Thus this
11
(SASSI). Natural Language Engineering, 6, 3/4 (2000) 9. Möller, S. Messung und Vorhersage der Effizienz bei
287-305. der Interaktion mit Sprachdialogdiensten [Measuring
5. Hornbæk, K. and Law, E.L. Meta-analysis of and predicting efficiency for the interaction with speech
correlations among usability measures. In Proc. CHI dialogue systems], In S. Langer & W. Scholl (Eds.),
2007. ACM Press (2007), 617-626. Fortschritte der Akustik - DAGA 2006. DEGA, Berlin
(2006), 463-464.
6. ISO 9241-9 Ergonomic Requirements for Office Work
with Visual Display Terminals, Nonkeyboard Input 10. Nielsen, J. & Levy, J. Measuring usability: Preference
Device Requirements, Draft International Standard, vs. performance. Communications of the ACM, 37, 4
International Organization for Standardization (1998). (1994), 66-75.
7. Kirakowski, J. and Corbett, M. SUMI: The software 11. Porteous, M., Jurek, K. and Corbett, M. SUMI: User
usability measurement inventory. British Journal of Handbook. Human Factors Research Group, University
Educational Technology, 24, 3 (1993), 210-212. of Cork, Ireland, 1993.
8. Krämer, N.C. & Nitschke, J. Ausgabemodalitäten im 12. Sauro, J. and Kindlund, E. A method to standardize
Vergleich: Verändern sie das Eingabeverhalten der usability metrics into a single score. In Proc. CHI 2005.
Benutzer? [Output modalities by comparison: Do they ACM Press (2005), 401-409.
change the input behaviour of users?] In R. Marzi, V. 13. Wechsung I. and Naumann, A. Established Usability
Karavezyris, H.-H. Erbe & K.-P. Timpe (Eds..), Evaluation Methods for Multimodal Systems: A
Bedienen und Verstehen. 4. Berliner Werkstatt Mensch- Comparison of Standardized Usability Questionnaires.
Maschine-Systeme. Düsseldorf: VDI-Verlag (2002), In Proc. PIT 08. Heidelberg: Springer (in press).
231-248.
12
VUUM 2008
Classifying and Selecting UX and Usability Measures

Nigel Bevan
Professional Usability Services
12 King Edwards Gardens, London W3 9RG, UK
[email protected]
www.nigelbevan.com
ABSTRACT
There are many different types of measures of usability For example, the test method for everyday products in
and user experience (UX). The overall goal of usability ISO 20282-2 points out that to obtain 95% confidence that
from a user perspective is to obtain acceptable 80% of users could successfully complete a task would
effectiveness, efficiency and satisfaction (Bevan, 1999, for example require 28 out of 30 users tested to be
ISO 9241-11). This paper summarises the purposes of successful. If 4 out of 5 users in a usability test were
measurement (summative or formative), and the measures successful, even if the testing protocol was perfect there is
of usability that can be taken at the user interface level 20% chance that the success rate for a large sample of
and at the system level. The paper suggests that the users might only be 51%.
concept of usability at the system level can be broadened
to include learnability, accessibility and safety, which Although summative measures are most commonly
contribute to the overall user experience. UX can be obtained from user performance and satisfaction,
measured as the user’s satisfaction with achieving summative data can also be obtained from hedonic
pragmatic and hedonic goals, and pleasure. questionnaires (e.g. Hassenzahl et al., 2003; Lavie and
Tractinsky, 2004) or from expert evaluation, such as the
WHY MEASURE UX/USABILITY? degree of conformance with usability guidelines (see for
The most common reasons for measuring usability in example Jokela, et al, 2006).
product development are to obtain a more complete
understanding of users’ needs and to improve the product Formative Measures
in order to provide a better user experience. Formative evaluation can be used to identify UX/usability
problems, to obtain a better understanding of user needs
But it is also important to establish criteria for and to refine requirements. The main data from formative
UX/usability goals at an early stage of design, and to use evaluation is qualitative. When formative evaluation is
summative measures to evaluate whether these have been carried out relatively informally with small numbers of
achieved during development. users, it does not generate reliable data from user
performance and satisfaction.
Summative Measures
Summative evaluation can be used to establish a baseline, However some measures of the product obtained by
make comparisons between products, or to assess whether formative evaluation, either with users or by an expert,
usability requirements have been achieved. For this such as the number of problems identified, may be useful,
purpose, the measures need to be sufficiently valid and although they should be subject to statistical assessment if
reliable to enable meaningful conclusions to be drawn they are to be interpreted.
from the comparisons. One prerequisite is that the In practice, even when the main purpose of an evaluation
measures are taken from an adequate sample of typical is summative, it is usual to collect formative information
users carrying out representative tasks in a realistic to provide design feedback at the same time.
context of use. Any comparative figures should be
accompanied by a statistical assessment of whether the WHAT MEASURES SHOULD BE USED?
results may have been obtained by chance. There are two types of UX/usability measures: those that
measure the result of using the whole system (usability in
use) and measures of the quality of the user interface
personal or classroom use is granted without fee provided that copies are (interface usability).
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, or
VUUM2008, June 18, 2008, Reykjavik, Iceland.
13
SYSTEM USABILITY • Efficiency: “resources expended.” How quickly a
ISO 9241-11 (1998) defines usability as: user can perform work is critical for business
the extent to which a product can be used by specified productivity.
users to achieve specified goals with effectiveness, • Satisfaction: the extent to which expectations are met.
efficiency and satisfaction in a specified context of use Satisfaction is a success factor for any products with
and ISO 9241-171 (2008) defines accessibility as: discretionary use; it’s essential for maintaining
workforce motivation.
usability of a product, service, environment or facility
by people with the widest range of capabilities Usability in use also explicitly identifies the need for a
product to be usable in the specified contexts of use:
These definitions mean that for a product to be usable and
accessible users should be able to use a product or web • Context conformity: the extent to which usability in
site to achieve their goals in an acceptable amount of use meets requirements in all the required contexts of
time, and be satisfied with the results. ISO/IEC standards use.
for software quality refer to this broad view of usability as Flexibility in use: the extent to which the product is
“quality in use”, as it is the user’s overall experience of usable in all potential contexts of use:
the quality of the product (Bevan, 1999). This is a black-
box view of usability: what is achieved, rather than how. • Context conformity in use: the degree to which
The new draft ISO standard ISO/IEC CD 25010.2 (2008) usability in use meets requirements in all the intended
proposes a more comprehensive breakdown of quality in contexts of use.
use into usability in use (which corresponds to the ISO • Context extendibility in use: the degree of usability in
9241-11 definition of usability as effectiveness, efficiency use in contexts beyond those initially intended.
and satisfaction); flexibility in use (which is a measure of
the extent to which the product is usable in all potential • Accessibility in use: the degree of usability in use for
contexts of use, including accessibility); and safety (which users with specified disabilities.
is concerned with minimising undesirable consequences): Safety: acceptable levels of risk of harm to people,
Quality in use business, data, software, property or the environment in
the intended contexts of use.
Usability in use
Safety is concerned with the potential adverse
Effectiveness in use consequences of not meeting the goals. For instance in
Productivity in use Cockton’s (2008) example of designing a van hire system,
Satisfaction in use from a business perspective, what are the potential
Likability (satisfaction with pragmatic goals) consequences of:
Pleasure (satisfaction with hedonic goals)
Comfort (physical satisfaction) Not offering exactly the type of van preferred by a
Trust (satisfaction with security) potential user group?
Flexibility in use The user mistakenly making a booking for the wrong
Context conformity in use dates or wrong type of vehicle?
Context extendibility in use The booking process taking longer than with
Accessibility in use competitor systems?
Safety For a consumer product or game, what are the potential
Operator health and safety adverse consequences of a lack of pleasurable emotional
Public health and safety reactions or of achievement of other hedonic goals?
Environmental harm in use
Commercial damage in use SYSTEM USABILITY MEASURES
Usability in use is similar to the ISO 9241-11 definition Usability in use and flexibility in use are measured by
of usability: effectiveness (task goal completion), efficiency (resources
used) and satisfaction. The relative importance of these
• Effectiveness: “accuracy and completeness.” Error- measures depends on the purpose for which the product is
free completion of tasks is important in both business being used (for example in some personal situations,
and consumer applications. resources may not be important).
Table 1 illustrates how the measures of effectiveness,
resources, safety and satisfaction can be selected to
14
VUUM 2008
measure quality in use from the perspective of different • Pleasure: the extent to which the user is satisfied with
stakeholders. their perceived achievement of hedonic goals of
stimulation, identification and evocation
From an organisational perspective, quality in use and
(Hassenzahl, 2003) and associated emotional
usability in use is about achievement of task goals. But
responses (Norman’s (2004) visceral category).
for the end user there are not only pragmatic task-related
“do” goals, but also hedonic “be” goals (Carver & • Comfort: the extent to which the user is satisfied with
Scheier, 1998). For the end user, effectiveness and physical comfort.
efficiency are the do goals, and stimulation, identification,
evocation and pleasure are the be goals. • Trust: the extent to which the user is satisfied that the
product will behave as intended.
Additional derived user performance measures (Bevan,
2006) include: Satisfaction is most often measured using a questionnaire.
Psychometrically designed questionnaires will give more
• Partial goal achievement. In some cases goals may be reliable results than ad hoc questionnaires (Hornbaek,
only partially achieved, producing useful but 2006).
suboptimal results.
Safety and Risk Measures
• Relative user efficiency. How long a user takes in There are no simple measures of safety. Historical
comparison with an expert. measures can be obtained for the frequency of health and
safety, environmental harm and security failures. A
• Productivity. Completion rate divided by task time,
product can be tested in situations that might be expected
which gives a classical measure of productivity.
to increase risks. Or risks can be estimated in advance.
Evaluation of Data from Usage of an Existing System

Stakeholder: End User Usage Technical Measures of effectiveness, efficiency and satisfaction can
Usability Organisation support also be obtained from usage of an existing system.
Cost- Maintenance
effectiveness Web Metrics
Goal: Personal goals Task goals Support goals Web-based logs contain potentially useful data that can be
Characteristic used to evaluate usability by providing data such as
entrance and exit pages, frequency of particular paths
System User Task Support through the site, and the extent to which search is
effectiveness effectiveness effectiveness effectiveness successful. (Burton and Walther, 2001), although it is
System Productivity Cost Support cost very difficult to track individual user behaviour (Groves,
resources (time) efficiency 2007) without some form of page-tagging combined with
(money) pop-up questions when the system is being used, so that
the results can be related to particular user groups and
Safety Risk to user Commercial System tasks.
(health and risk failure or
safety) corruption
Application Instrumentation
Stakeholder Hedonic and Management Support Data points can be built into code that "count" when an
satisfaction pragmatic satisfaction satisfaction event occurs (for example in Microsoft Office (Harris,
satisfaction 2005)). This could be the frequency with which
Table 1. Stakeholder perspectives of quality in use commands are used or the number of times a sequence
results in a particular type of error. The data is sent
User Satisfaction Measures anonymously to the development organization. This real-
User satisfaction can be measured by the extent to which world data from large populations can help guide future
users have achieved their pragmatic and hedonic goals. design decisions.
ISO/IEC CD 25010.2 suggests the following types of
measure: Satisfaction Surveys
Satisfaction questionnaires distributed to a sample of
• Likability: the extent to which the user is satisfied existing users provide an economical way of obtaining
with their perceived achievement of pragmatic goals, feedback on the usability of an existing product or system.
including acceptable perceived results of use and
consequences of use.
15
USER INTERFACE USABILITY Accessibility may refer to product capabilities (“technical
The broad quality in use perspective contrasts with the accessibility”) or a product usable by people with
narrower interpretation of usability as the attributes of the disabilities (e.g. ISO 9241-171).
user interface that makes the product easy to use. This is
consistent with one of the views of usability in HCI, for UX has even more interpretations. ISO CD 9241-210
example in Nielsen’s (1993) breakdown where a product defines user experience as:
can be usable, even if it has no utility (Figure 1). all aspects of the user’s experience when interacting
System acceptability with the product, service, environment or facility.
Social acceptability This definition can be related to different interpretations
Practical acceptability of UX:
Cost
Compatibility • UX attributes such as aesthetics, designed into the
Reliability product to create a good user experience.
Usefulness
Utility • The user’s pragmatic and hedonic UX goals
Usability (individual criteria for user experience) (Hassenzahl,
Figure 1. Nielsen’s categorisation of usability 2003).
User interface usability is a pre-requisite for system • The actual user experience when using the product
usability. (this is difficult to measure directly).
Expert-based Methods
• The measurable UX consequences of using the
Expert evaluation relies on the expertise of the evaluator, product: pleasure, and satisfaction with achieving
and may involve walking through user tasks or assessing pragmatic and hedonic goals.
conformance to UX/usability guidelines or heuristics. Table 2 shows how measures of system usability and UX
are dependent on product attributes that support different
Measures that can be obtained from expert evaluation
aspects of user experience. In Table 2 the columns are the
include:
quality characteristics that contribute to the overall user
• Number of violations of guidelines or heuristics. experience, with the associated product attributes needed
to achieve these qualities.
• Number of problems identified.
The users’ goals may be pragmatic (to be effective and
• Percentage of interface elements conforming to a efficient), and/or hedonic (stimulation, identification
particular guideline. and/or evocation).
• Whether the interface conforms to detailed Although UX is primarily about the actual experience of
requirements (for example the number of clicks usage, this is difficult to measure directly. The
required to achieve specific goals). measurable consequences are the user’s performance,
satisfaction with achieving pragmatic and hedonic goals,
If the measures are sufficiently reliable, they can be used and pleasure.
to track usability during development.
User performance and satisfaction is determined by
Automated Evaluation Methods qualities including attractiveness, functionality and
There are some automated tools (such as WebSAT and interface usability. Other quality characteristics will also
LIFT) that automatically test for conformance with basic be relevant in determining whether the product is
usability and accessibility rules. Although the measures learnable, accessible, and safe in use.
obtained are useful for screening for basic problems, they Pleasure will be obtained from both achieving goals, and
only test a very limited scope of usability issues (Ivory & as a direct visceral reaction to attractive appearance
Hearst, 2001). (Norman, 2004).
MEASURING UX, USABILITY AND ACCESSIBILITY
Usability is variously interpreted as good user interface
design (ISO 9126-1), an easy to use product (e.g.
Cockton, 2004), good user performance (e.g. Väänänen-
Vainio-Mattila et al, 2008), good user performance and
satisfaction (e.g. ISO 9241-11), or good user performance
and user experience (e.g. ISO 9241-210).
16
VUUM 2008
Quality UX Functionality User Learnability Accessibility Safety

characteristic interface
usability
Product Aesthetic Appropriate Good UI Learnability Technical Safe and
attributes attributes functions design (easy attributes accessibility secure design
to use)
UX pragmat- To be effective and efficient
ic do goals
UX hedonic Stimulation, identification and evocation
be goals
UX: actual Visceral Experience of interaction
experience
Usability (= Effectiveness and Productivity Learnability Accessibility Safety
performance in use: in use: in use: in use:
in use effective and effective and occurrence of
effective task completion and efficient use of
measures) efficient to learn efficient with unintended
time
disabilities consequences
Measures of Satisfaction in use:
UX satisfaction with achieving pragmatic and hedonic goals
consequences
Pleasure Likability and Comfort Trust
Table 2. Factors contributing to system usability and UX
WHAT SHOULD BE MEASURED? CONCLUSIONS

In a systems development environment, UX/usability Discussion of UX and selection of appropriate UX
measures need to be prioritised: measures would be simplified if the different perspectives
on UX were identified and distinguished. The current
1. At a high level, whose stakeholder goals are the main interpretations of “UX” are even more diverse than those of
concern (e.g. users, staff or managers)? “usability”.
2. What aspects of effectiveness, efficiency, satisfaction, This paper proposes a common framework for classifying
flexibility, accessibility and safety are most important usability and UX measures, showing how they relate to
for these stakeholders? broader issues of effectiveness, efficiency, satisfaction, ,
3. What are the risks if the goals for effectiveness, accessibility and safety. It is anticipated that the framework
efficiency, satisfaction, flexibility, accessibility and could to be elaborated to incorporate new conceptual
safety are not achieved in the intended contexts of use? distinctions as they emerge.
4. Which of these UX/system usability measures are Understanding how different aspects of user experience
important enough to validate using user-based testing relate to usability, accessibility, and broader conceptions of
and/or questionnaires, and how should the users, tasks quality in use, will help in the selection of appropriate
and measures be selected? measures.
5. Are baseline measures needed to establish REFERENCES

requirements? (Whiteside et al, 1998) 1. Bevan, N. (1999) Quality in use: meeting user needs for
6. Which aspects of interface usability can be measured quality, Journal of Systems and Software, 49(1), pp 89-
during development by expert evaluation to help 96.
develop a product that achieves the UX/system 2. Bevan, N. (2006) Practical issues in usability
usability goals for the important stakeholders in the measurement. Interactions 13(6): 42-43
important contexts of use?
7. How can UX/usability be monitored during use?
17
3. Carver, C. S., & Scheier, M. F. (1998). On the self- 14. ISO FDIS 9241-171 (2008) Ergonomics of human-
regulation of behavior. New York: Cambridge system interaction -- Part 171: Guidance on software
University Press. accessibility. ISO.
4. Burton, M and Walther, J (2001) The value of web log 15. ISO CD 9241-210 (2008) Ergonomics of human-system
data in use-based design and testing. Journal of interaction -- Part 210: Human-centred design process
Computer-Mediated Communication, 6(3). for interactive systems. ISO.
jcmc.indiana.edu/vol6/issue3/burton.html
16. ISO 13407 (1999) Human-centred design processes for
5. Cockton, G. (2004) From Quality in Use to Value in the interactive systems. ISO.
World. CHI 2004, April 24–29, 2004, Vienna, Austria.
17. ISO TS 20282-2 Ease of operation of everyday products
6. Cockton, G. (2008a) Putting Value into E-valu-ation. -- Part 2: Test method for walk-up-and-use products.
In: Maturing Usability. Quality in Software, Interaction ISO.
and Value. Law, E. L., Hvannberg, E. T., Cockton, G.
18. ISO/IEC 9126-1 (2001) Software engineering - Product
(eds). Springer.
quality - Part 1: Quality model. ISO.
7. Cockton (2008b) What Worth Measuring is. Proceedings
19. ISO/IEC CD 25010.2 (2008) Software engineering –
of Meaningful Measures: Valid Useful User Experience
Software product Quality Requirements and Evaluation
Measurement (VUUM), Reykjavik, Iceland.
(SQuaRE) – Quality model
8. Groves, K (2007). The limitations of server log files for
20. Ivory, M.Y., Hearst, M.A. (2001) State of the Art in
usability analysis. Boxes and Arrows.
Automating Usability Evaluation of User Interfaces.
www.boxesandarrows.com/view/the-limitations-of
ACM Computing Surveys, 33,4 (December 2001) 1-47.
9. Harris, J. (2005) An Office User Interface Blog. Accessible at http://webtango.berkeley.edu/ papers/ue-
http://blogs.msdn.com/jensenh/archive/2005/10/31/4872 survey/ue-survey.pdf
47.aspx Retrieved January 2008.
21. Jokela, T., Koivumaa, J., Pirkola, J., Salminen, P.,
10. Hassenzahl, M. (2002). The effect of perceived hedonic Kantola , N. (2006) “Methods for quantitative usability
quality on product appealingness. International Journal requirements: a case study on the development of the
of Human-Computer Interaction, 13, 479-497. user interface of a mobile phone”, Personal and
Ubiquitous Computing, 10, 345 – 355.Nielsen, J. (1993)
11. Hassenzahl, M. (2003) The thing and I: understanding
Usability Engineering. Academic Press.
the relationship between user and product. In Funology:
From Usability to Enjoyment, M. Blythe, C. Overbeeke, 22. Norman, D. (2004) Emotional design: Why we love (or
A.F. Monk and P.C. Wright (Eds), pp. 31 – 42 hate) everyday things (New York: Basic Books).
(Dordrecht: Kluwer).
23. Väänänen-Vainio-Mattila, K., Roto, V., Hassenzahl, M.
12. Hornbaek, K (2006). Current practices in measuring (2008) Towards Practical UX Evaluation Methods.
usability. Int. J. Human-Computer Studies 64 (2006) Proceedings of Meaningful Measures: Valid Useful
79–102 User Experience Measurement (VUUM), Reykjavik,
Iceland.
13. ISO 9241-11 (1998) Ergonomic requirements for office
work with visual display terminals (VDTs) Part 11: 24. Whiteside, J., Bennett, J., & Holtzblatt, K. (1988).
Guidance on Usability. ISO. Usability engineering: Our experience and evolution. In
M. Helander (Ed.), Handbook of Human-Computer
Interaction (1st Ed.) (pp. 791–817). North-Holland.
18
VUUM 2008
Towards Practical User Experience Evaluation Methods

Kaisa Väänänen-Vainio-Mattila Virpi Roto Marc Hassenzahl
Tampere University of Technology Nokia Research Center University of Koblenz-Landau
Human-Centered Technology P.O.Box 407 Economic Psychology and
(IHTE) 00045 Nokia Group, Finland Human-Computer Interaction,
Korkeakoulunkatu 6 [email protected] Campus Landau, Im Fort 7
33720 Tampere, Finland 76829 Landau, Germany
kaisa.vaananen-vainio- [email protected]
[email protected]
ABSTRACT
In the last decade, User eXperience (UX) research in the
academic community has produced a multitude of UX There are many definitions for UX, but not an agreed one
models and frameworks. These models address the key [16]. However, even the most diverse definitions of user
issues of UX: its subjective, highly situated and dynamic experience all agree that it is more than just a product's
nature, as well as the pragmatic and hedonic factors leading usefulness and usability [2,6,17,18,23,26]. In addition, they
to UX. At the same time, industry is adopting the UX term stress the subjective nature of UX: UX is affected by the
but the practices in the product development are still largely user’s internal state, the context, and perceptions of the
based on traditional usability methods. In this paper we product [2, 6, 17].
discuss the need for pragmatic UX evaluation methods and However, definitions alone are not sufficient for the proper
how such methods can be used in product development in consideration of UX throughout product development.
industry. We conclude that even though UX definition still Product development in its current form needs tools from
needs work it seems that many of the methods from HCI the very early concept to market introduction: UX must be
and other disciplines can be adapted to the particular assessable and manageable. An important element of this is
aspects of UX evaluation. The paper is partly based on the a set of evaluation methods focused on UX.
results of UX evaluation methods in product development
(UXEM) workshop in CHI 2008. Apparently, there is a gap between the research
community's and the product developers' understanding of
Author Keywords what UX is and how it should be evaluated (see Figure 1).
User experience, evaluation methods, product development.
ACM Classification Keywords Research, Companies,

H5.m. Information interfaces and presentation (e.g., HCI): academics: GAP industry:
Miscellaneous.
UX theories, Practical UX work
INTRODUCTION models, in product
Companies in many industrial sectors have become aware development
frameworks
that designing products and services is not enough, but
designing experiences is the next level of competition [19, Hedonic aspects, Functionality,
22]. Product development is no longer only about emotions usability
implementing features and testing their usability, but about
Co-experience Novelty
designing products that are enjoyable and support
fundamental human needs and values. Thus, experience Dynamics of Product life
should be a key concern of product development. experience cycle
etc. etc. etc. etc.
Permission to make digital or hard copies of all or part of this work for Figure 1. Currently the academic UX research and industrial
personal or classroom use is granted without fee provided that copies are UX development are focusing on different issues.
bear this notice and the full citation on the first page. To copy otherwise, or As an attempt to close the gap, we organised a workshop on
UX evaluation methods for product development (UXEM)
[25] in the context of the CHI 2008 conference on human
VUUM2008, June 18, 2008, Reykjavik, Iceland. factors in computing. The aim of the workshop was to
19
identify truly experiential evaluation methods (in the repeatedly prompted to assess their current emotional state
research sense) and discuss their applicability and by going through a number of questions. This approach
practicability in engineering-driven product development. takes UX evaluation a step further, by focusing on the
In addition, we hoped for lively and fruitful discussions experience itself instead of the product. However, in the
between academics and practitioners about UX itself and context of product development additional steps would have
evaluation methods. In this paper, we present the central to be taken to establish a causal link between a positive
findings of this workshop. experience and the product: how does the product affect the
measured experience. Bear in mind, that product evaluation
CURRENT STATE OF UX EVALUATION METHODS is not interested in experiences per se but in experiences
Traditionally, technology-oriented companies have tested caused by the product at hand.
their products against technical and usability requirements. Two further methods presented in the workshop (Repertory
Experiential aspects were predominantly the focus of the Grid, Multiple Sorting) [1,12] make use of Kelly’s
marketing department, which tried to create a certain image "personal construct psychology" [e.g., 13]. Basically, these
of a product through advertising. For example, when are methods to capture the personal meaning of objects.
Internet became an important channel in communicating the They have a strong procedural structure, but are open to any
brand and image, technical and usability evaluations of sort of meaning, whether pragmatic or hedonic.
Web sites needed to be integrated with more experiential Interestingly, the methods derived from Kelly’s theory tend
goals [4,15]. Today, industry is in need of user experience to provide both a tool for analysis and evaluation [7]. The
evaluation methods for a wide variety of products and results give an idea of the themes, topics, concerns people
services. have with a particular group of products (i.e., content). At
User-centered development (UCD) is still the key to the same time, all positive and negative feelings (i.e.,
designing for good user experiences. We must understand evaluations) towards topics and products become apparent.
users’ needs and values first, before designing and Finally, Heimonen and colleagues [8] use "forced choice"
evaluating solutions. Several methods exist for to evaluate the "desirability" of a product. This method
understanding users and generating ideas in the early phases highlights another potential feature of UX, which may pose
of concept design, such as Probes [5] or Contextual Inquiry additional requirements for UX evaluation methods: There
[3]. Fewer methods are available for concept evaluation that might be drivers of product appeal and choice, which are
would assess the experiential aspects of the chosen concept. not obvious to the users themselves. Tractinsky and Zmiri
[24], for example, found hedonic aspects (e.g., symbolism,
EXPERIENTIAL EVALUATION METHODS beauty) to be predictive of product choice. When asked,
A number of evaluation methods were presented and however, participants gave predominantly pragmatic
discussed in the UXEM workshop. However, only a few reasons for the choice. Note that the majority of the
were "experiential" in the sense of going beyond traditional "experiential" methods discussed so far rely on people's self
usability methods by emphasizing the subjective, positive report. This might be misleading, given that experiential
and dynamic nature of UX. aspects are hard to justify or even to verbalize. In other
Isomursu's "experimental pilots" [11], for example, stress words, choice might be driven by criteria not readily
the importance of evaluating before (i.e., expectation), available to the people choosing. Forced choice might bring
while (i.e., experience) and after product use (i.e., this out.
judgment). This acknowledges the subjective and changing, All in all, a number of interesting approaches to measure
dynamic nature of UX: expectations influence experience, UX were suggested and discussed in the workshop. All of
experience influences retrospective judgments and these them addressed at least one key feature of UX, thereby
judgments in turn set stage for further expectations and so demonstrating that "experiential" evaluation is possible.
forth. In addition, Isomursu points at the importance of More work, however, has to be done to integrate methods to
creating an evaluation setting, which resembles an actual capture more aspects of UX simultaneously. In addition,
use setting. UX is highly situated; its assessment requires a methods need to be adapted to the requirements of
strong focus on situational aspects. Roto and colleagues as evaluation in an industrial setting. So far, most suggested
well as Hoonhout [21,10] stress the importance of positive methods are still demanding in the sense of the skills and
emotional responses to products and embrace the notion time required.
that task effectiveness and efficiency (i.e., usability) might
be not the only source for positive emotions. Their focus is REQUIREMENTS FOR PRACTICAL UX EVALUATION
on early phases of development where idea creation and METHODS
evaluation is closely linked and short-cycled. In industry, user experience evaluation is done in order to
improve a product. Product development is often a hectic
Hole and Williams suggest "emotion sampling" as an process and the resources for UX evaluation scarce.
evaluation method [9]. While using a product, people are Evaluating early and often is recommended, as the earlier
20
VUUM 2008
the evaluations can be done, the easier it is to change the In the UXEM workshop, we noticed that there is not always
product to the right direction. a clear line between the design and evaluation methods,
since evaluating current solutions often gives ideas for new
The early phases of product development are challenging
ones. On the other hand, companies do need evaluation
for UX evaluation, since at that point, the material available
methods that focus in producing UX scores or a list of pros
about the concept may be hard to understand and assess for
and cons for a pool of concept ideas in an efficient way.
the participants [10, 21]. In the early phases, it is not
After the product specification has been approved, the
possible to test the non-functional concept in the real
primary interest is to check that the user experience matches
context of use, although user experience is tied to the
the original goal. In this phase, the methods applied are
context [6]. We need good ideas for simulating real context
clearly about evaluation, not about creating new ideas.
in a lab [14]. Later on, when prototypes are stable enough to
be handed for field study participants, UX evaluation
DISCUSSION AND CONCLUSIONS
becomes much easier. The most reliable UX evaluation data Obviously, applying and developing methods for UX
comes from people who have actually purchased and used a evaluation requires an understanding of what UX actually
product on the market. This feedback helps improving the is. This is still far from being settled. Although everybody
future versions of the product. in the workshop agreed that the UX perspective adds
In summary, the UXEM workshop presentations and group something to the traditional usability perspective, it was
works produced the following requirements for practical hard to even put a name to this added component: Is it
UX evaluation methods: "emotional", "experiential" or "hedonic"? The lack of a
shared understanding on what UX means was identified as
Valid, reliable, repeatable one of the major problems of UX evaluation in its current
o For managing UX also in a big company state. As long we do not agree or at least take a decision on
Fast, lightweight, and cost-efficient what we are looking for, we cannot pose the right questions.
Without an idea of the appropriate questions, selecting a
o For fast-pace iterative development method is futile. Nevertheless, once a decision is made ─
Low expertise level required for example to take a look at the emotional consequences of
o For easy deployment (no extensive training needed) product use ─ there seem to be a wealth of methods already
in use within HCI or from other disciplines, which could be
Applicable for various types of products adapted to this particular aspect of evaluation.
o For comparisons and trend monitoring
Working with UX evaluation is a double task: We have to
Applicable for concept ideas, prototypes, and products understand UX and make it manageable and measurable.
o For following how UX develops during the process Given the fruitful discussions in the workshop, a practice-
driven development of the UX concept may be a valid road
Suitable for different target user groups to a better understanding of UX. "UX is what we measure"
o For a fair outcome might be an approach as long as there is no accepted
Suitable for different product lifecycle phases definition of UX at hand. However, this approach requires
some reflection on the evaluation needs and practices. By
o For improving e.g. taking into use, repurchasing UX discussing the implicit notions embedded in the evaluation
Producing comparable output (quantitative and qualitative) requirements and methods, we might be able to better
For UX target setting and iterative improvement articulate what UX actually should be. The UXEM
workshop and this paper hopefully open up the discussion.
Useful for the different in-house stakeholders
o As UX is multidisciplinary, many company ACKNOWLEDGMENTS
departments are interested in UX evaluation results. We thank all participants of the UXEM workshop: Jim
Clearly, it is not possible to have just one method that Hudson, Jon Innes, Nigel Bevan, Minna Isomursu, Pekka
would fulfill all the requirements above. Some of the Ketola, Susan Huotari, Jettie Hoonhout, Audrius
requirements may be contradictory, or even unrealistic. For Jurgelionis, Sylvia Barnard, Eva Wischnewski, Cecilia
example, a method which is very lightweight may not Oyugi, Tomi Heimonen, Anne Aula, Linda Hole, Oliver
necessarily be totally reliable. Also, it might be challenging Williams, Ali al-Azzawi, David Frohlich, Heather Vaughn,
if not impossible to find a method which is suitable for Hannu Koskela, Elaine M. Raybourn, Jean-Bernard
different types of products, product development phases, Martens, and Evangelos Karapanos. We also thank the
and product lifecycle phases. We thus need to have a toolkit programme committee of UXEM: Anne Aula, Katja
of experiential methods to be used for the different Battarbee, Michael "Mitch" Hatscher, Andreas Hauser, Jon
purposes. Innes, Titti Kallio, Gitte Lindgaard, Kees Overbeeke, and
Rainer Wessler.
21
REFERENCES 15. Kuniavsky, M. (2003), Observing The User Experience
1. al-Azzawi, A., Frohlich, D. and Wilson, M. User – A Practitioner’s Guide to User Research. Morgan
Experience: A Multiple Sorting Method based on Kaufmann Publishers, Elsevier Science, USA
Personal Construct Theory, Proc. of UXEM,
16. Law, E., Roto, V., Vermeeren, A., Kort, J., &
www.cs.tut.fi/ihte/CHI08_workshop/papers.shtml
Hassenzahl, M. (2008). Towards a Shared Definition
2. Alben, L. (1996), Quality of Experience: Defining the for User Experience. Special Interest Group in CHI’08.
Criteria for Effective Interaction Design. Interactions, Proc. Human Factors in Computing Systems 2008, pp.
3, 3, pp. 11-15. 2395-2398.
3. Beyer, H. & Holtzblatt, K. (1998). Contextual Design. 17. Mäkelä, A., Fulton Suri, J. (2001), Supporting Users’
Defining Customer-Centered Systems. San Francisco: Creativity: Design to Induce Pleasurable Experiences.
Morgan Kaufmann. Proceedings of the International Conference on
4. Ellis, P., Ellis, S. (2001), Measuring User Experience. Affective Human Factors Design, pp. 387-394.
New Architect 6, 2 (2001), pp. 29-31. 18. Nielsen-Norman group (online). User Experience –
5. Gaver, W. W., Boucher, A., Pennington, S., & Walker, Our Definition.
B. (2004). Cultural probes and the value of uncertainty. http://www.nngroup.com/about/userexperience.html
Interactions. 19. Nokia Corporation (2005), Inspired Human
6. Hassenzahl, M., Tractinsky, N. (2006), User Technology. White paper available at
Experience – a Research Agenda. Behaviour and http://www.nokia.com/NOKIA_COM_1/About_Nokia/
Information Technology, Vol. 25, No. 2, March-April Press/White_Papers/pdf_files/backgrounder_inspired_h
2006, pp. 91-97. uman_technology.pdf
7. Hassenzahl, M. & Wessler, R. (2000). Capturing 20. Norman, D.A., Miller, J., and Henderson, A. (1995)
design space from a user perspective: the Repertory What you see, some of what's in the future, and how
Grid Technique revisited. International Journal of we go about doing it: HI at Apple Computer. Proc. CHI
Human-Computer Interaction, 12, 441-459. 1995, ACMPress (1995), 155.
8. Heimonen, T., Aula, A., Hutchinson, H. and Granka, L. 21. Roto, V., Ketola, P. and Huotari, S. User Experience
Comparing the User Experience of Search User Evaluation in Nokia, Proc. of UXEM,
Interface Designs, Proc. of UXEM, www.cs.tut.fi/ihte/CHI08_workshop/papers.shtml
www.cs.tut.fi/ihte/CHI08_workshop/papers.shtml 22. Seidel, M., Loch, C., Chahil, S. (2005). Quo Vadis,
9. Hole, L. and Williams, O. Emotion Sampling and the Automotiven Industry? A Vision of Possible Industry
Product Development Life Cycle, Proc. of UXEM, Transformations. European Management Journal, Vol.
www.cs.tut.fi/ihte/CHI08_workshop/papers.shtml 23, No. 4, pp. 439–449, 2005
10. Hoonhout, J. Let's start to Create a Fun Product: 23. Shedroff, N. An Evolving Glossary of Experience
Where Is My Toolbox?, Proc. of UXEM, Design, online glossary at
www.cs.tut.fi/ihte/CHI08_workshop/papers.shtml http://www.nathan.com/ed/glossary/
11. Isomursu, M. User experience evaluation with 24. Tractinsky, N. & Zmiri, D. (2006). Exploring attributes
experimental pilots, Proc. of UXEM, of skins as potential antecedents of emotion in HCI. In
www.cs.tut.fi/ihte/CHI08_workshop/papers.shtml P.Fishwick (Ed.), Aesthetic computing (pp. 405-421).
Cambridge, MA: MIT Press.
12. Karapanos, E. and Martens, J.-B. The quantitative side
of the Repertory Grid Technique: some concerns, Proc. 25. Väänänen-Vainio-Mattila, K., Roto, V., & Hassenzahl,
of UXEM, M. (2008). Now Lets Do It in Practice: User
www.cs.tut.fi/ihte/CHI08_workshop/papers.shtml Experience Evaluation Methods for Product
Development. Workshop in CHI’08. Proc. Human
13. Kelly, G. A. (1963). A theory of personality. The Factors in Computing Systems, pp. 3961-3964.
psychology of personal constructs, paperback. New http://www.cs.tut.fi/ihte/CHI08_workshop/index.shtml
York: Norton.
26. UPA (Usability Professionals’ Association): “Usability
14. Kozlow, S., Rameckers, L., Schots, P. (2007). People Body of Knowledge”,
Research for Experience Design. Philips white paper. http://www.usabilitybok.org/glossary (22.5.2008)
http://philipsdesign.trimm.nl/People_Reseach_and_Exp
erience_design.pdf
22
VUUM 2008
Exploring User Experience Measurement Needs

Pekka Ketola Virpi Roto
Nokia Nokia Research Center
ABSTRACT behavioural, and reflective level (Norman 2003), and Nokia

We conducted an empirical study on user experience (UX) followed these lines with the Wow, Flow, Show model
measurement needs at different units and levels of Nokia (Nokia 2005).
product development and asked which kinds of UX
measurements would be useful in different parts of the As UX highlights the emotional aspects, also emotion
organization. In this report, we present the initial results of measurements have been investigated. Most emotion
this study. We found that the needs for UX measures were evaluations concentrate on identifying the emotion a user
not only about design details, but mostly about how the has while interacting with a product, and both objective and
different touch points between user and company are subjective methods are used to collect this information (e.g.
experienced along the product experience lifecycle. Mandryk et al 2006; Desmet et al. 2001).
UX evaluation can take place also after interaction phase.
Author Keywords For example, Hassenzahl (2003) has investigated the
User experience, Measurement pragmatic and hedonic aspects of products from the
perspective of product appraisal. This model helps to
INTRODUCTION measure user experience in real life, preferably after long-
In a big corporate like Nokia, measurements play an term use.
important role in all phases of product development as they
enable systematic improvement of the products. Several Gartner (2007b) considers UX measurements from the
company departments are interested in user experience perspective of return on investment: what is the monetary
measurement. To make the measurements relevant and benefit of spending money on user experience
useful, we first need to find out the needs of different improvement. They study the relations of brand experience,
stakeholders for UX measurements before starting to define company experience, and the implications to related
the metrics. revenues and costs. User experience is claimed to be a
subset of brand experience. According to Gartner research
In Human-Computer Interaction (HCI) field, measurements (2007a), the success of UX can be measured in hard metrics
have traditionally been usability measures, such as and as intangible benefits:
efficiency, effectiveness, and satisfaction (ISO 13407);
learnability, memorability, error prevention, and • Increased revenue: More orders per customer, More
satisfaction (Nielsen 1993); effectiveness, learnability, repeat engagements, More products per order.
flexibility, and attitude (Shackel 1991, 25); guessability, • Reduced cost: Fewer support calls, Fewer returns due
learnability, experienced user performance, system to mistake or misperceptions, More efficient server use
potential, and re-usability (Jordan 1998). Learnability is the
common element that is included in all above measures of • Faster time to market due to accelerated development:
usability. Increased customer satisfaction, Improved brand
image, Positive word of mouth.
When usability evolved to user experience, the
measurements broadened from pragmatic (easy and
THE EMPIRICAL STUDY
efficient) to experiential (delighting). Jordan upgraded his To our mind, people at different roles and levels of product
list to functionality, usability, pleasure, and pride (Jordan development are the suitable population from whom to ask
2002). Norman set the goal in engaging users in visceral, their view on user measurements and produced data. We
selected to make a phenomenographic survey (Marton and
Booth 1997) on the different measurement needs in their
personal or classroom use is granted without fee provided that copies are proper environment.
bear this notice and the full citation on the first page. To copy otherwise, or
We use qualitative Email survey for data collection (Meho
republish, to post on servers or to redistribute to lists, requires prior specific 2006). For practical reasons we invited 42 Nokia people
permission and/or a fee. from different corporate functions which we believe have
interest in UX measurements, to act as subjects in this
study. We sent our short question (below) to selected
23
specialists, senior specialists, managers, senior managers, point of interest is out of box readiness with products and
directors and vice presidents). A total of 18 of them services.
responded within one week
• How easy it is to start using product and services?
Which User Experience information (measurable data
gained from our target users directly or indirectly),is useful • What is customer experience in support activities?
for your organization? How? Quality (n=6). This group consists of quality managers and
The question is intentionally very open and can be specialists, working with concrete products or in company
interpreted in many ways. This way the study participants wide quality development activities. Respondents in quality
are not limited in describing their measurement needs, and are particularly interested in developing the quality
can address any area they think is valid. Only one measurement practises, and understanding the users’
participant asked for further clarification for the question. perceptions about both products and support services:
We analyzed the responses by using the content analysis. • Which metrics should be applied for experienced
The free-text answers were first collected to an Excel product quality?
document, matching the person, role and response. This • What is the perceived performance of products and
grouping told us the key information needs for each services?
discipline (Section “Measurement Needs for Different
Groups”) and the variations between roles within that Common Needs for User Measurement
discipline. Then, one researcher organized and grouped the In this section we provide a consolidated grouping across
answers using mind map technique (see all responses, based on a mind map categorization.
http://en.wikipedia.org/wiki/Mind_map). The grouping was
reviewed by the other researcher. This led to common User experience lifecycle
grouping of topics across disciplines (Section “Common Measurable information is needed not only when the user is
Needs for User Measurement”). We gave the draft of our using the product for its original purpose, but also when the
report to the respondents to confirm that we interpreted and user is planning to buy a new device, when the new device
classified their responses correctly. is being taken into use and when there is a shift from an old
device to a new device.
Measurement Needs for Different Groups
In this section we shortly summarize the key needs for each What should be Examples of measures
studied group. We will discuss only four groups that measured?
answered most actively. Pre-purchase The impact of expected UX on purchase
decisions
Research (n=3). This group represents people who work
with research management or hands-on research, before First use Success of taking the product into use
development takes place. Measurement needs in this group Product upgrade Success in transferring content from old
are seen in two main groups: device to the new device
• How users perceive and use new technologies?
Table 1. Measurement areas in UX lifecycle
• Which are the most important UX or usability
Retention
problems in current products and services?
Retention is a concept and also measurement describing the
Development (n=4). This group represents people who loyalty of the customers. It is assumed that good user
manage and design concrete products and services, such as experience leads to high retention. Retention information
product manager or software designer. This group would tell us how many customers continue with the brand,
emphasizes the first use of products and services: how many newcomers there are in customer base, and how
many customer leave the brand. Among retention topics we
• Which functions are needed the most? can see non-UX information needs, such as the ownership
• What are the first impressions (overall experience, of previous devices.
first use) and level of satisfaction?
Care (n=5). This group represents people who manage and
provide numerous product supports and maintenance
services in online forums and in local support centers. In
most cases they have direct connection to customers. This
group has very a rich set of measurement needs. Major
24
VUUM 2008
What should be Examples of measures What should be Examples of measures

measured? measured?
Expectations vs. Has the device met your expectations? Customer experience in How does customer think & feel about
reality “touchpoints” the interaction in the touch points?
Long term Are you satisfied with the product quality Accuracy of support Does inaccurate support information
experience (after 3 months of use) information result in product returns? How?
Previous devices Which device you had previously? Table 5. Measurement areas in customer care
Engagement Continuous excitement
Experiences with user interface localization
Table 2. Measurement areas in retention English is the primary language in technology
communication, and often English language variants are
User groups and use of features provided first. There is a need to understand and measure
“What product features are used by different demographic how far this is enough, and what is the value of localized
groups” is the information that is mentioned by respondents content in terms of user experience.
from Research and Care organizations. The question has
What should be Examples of measures
two key themes:
measured?
- What kind of differences there are in the use of Effect of How do users perceive content in their local
product function across different user groups? localization language, what does it mean to them, how do
- What functions are most used (and underused)? they feel about it?
What should be Examples of measures Table 6. Measurement areas in localization

measured?
Perceived performance
Use of device What functions are used, how often, In product development it is essential to know the
functions why, how, when, where?
performance targets, and how those targets are met in
Differences in user How different user groups access completed products.
groups features?
What should be Examples of measures
Reliability of product Comparison of target users vs. actual measured?
planning buyers?
Latencies Perceived latencies in key tasks.
Table 3. Measurement in the use of functions Performance Perceived UX on device performance
Where are the experience breakdowns? Perceived complexity Actual and perceived complexity of task
Breakdowns and concrete problems seem to have link to accomplishments.
almost all our respondents. A common challenge is to Table 7. Measurement areas in device performance
identify breakdowns and measure the improvements.
What should be Examples of measures Experiences with new technologies
measured? There are several information needs mentioned by the
people who represent research teams. Especially the
UX Obstacles Why and when the user experiences
frustration? reactions towards new proposed or realized solutions are
interesting.
Malfunction Amount of “reboots” and severe technical
problems experienced. What should be Examples of measures
measured?
Usability problems Top 10 usability problems experienced by
the customers. Change in user How are usage patterns changing when
behaviour new technologies are introduced
Table 4. Measurement areas in breakdowns
Innovation feedback New user ideas & innovations triggered by
new experiences
User experience in customer care and maintenance
Organizations providing customer care and other support Table 8. Measurement areas with new technologies
need to measure how users experience the given support,
for example, was it helpful?
25
value: development and application of an approach for
DISCUSSION
research through design. The Design Journal, 4(1), 32-
Limitations 47.
Primarily the findings should be used as a new data for
3. Gartner. (2007a). Valdes R. and Gootzit D (eds.).
further UX measurement development and research
Usability Drives User Experience; User Experience
activities. Our findings are not complete nor universal since
Delivers Business Values.
the study was conducted in only one firm and with a limited
4. Gartner. (2007b). Valdes R. and Gootzit D (eds.). A
number of respondents.
Value-Driven, User-Centered Design Process for Web
Attention can be paid to the low response rate, but in the Sites and Applications.
phenomenographic study the high response rate is not as 5. Hassenzahl, M. (2003). The thing and I: understanding
important as the saturation level achieved. Alexandersson's the relationship between user and product. In
survey (1994) on more than 500 phenomenographic studies M.Blythe, C. Overbeeke, Monk, A. & Wright, P.
concluded that the variation of a phenomenon reached (Eds.), Funology: From Usability to Enjoyment (pp.
saturation at around 20 research participants. Our number of 31-42). Dordrecht: Kluwer.
informants (18) is close to that figure, and we can argue that 6. ISO 13407:1999, Human-Centred Design Proceses for
the saturation took place in our study. Interactive Systems. International Standardization
Organization (ISO), Switzerland.
Practical Implications
Most of the UX measurement needs are familiar and 7. Jordan, P.W. (1998). An Introduction to Usability.
already handled in existing practises. However, in our view Taylor & Francis.
this study provides new information revealing common 8. Jordan, P.W. (2002). Designing Pleasurable Products:
cross-organizational needs for measurements. When new An Introduction to the New Human Factors. Taylor &
UX measurements are developed or existing measurements Francis.
are improved, there should be sufficient cross-functional 9. Mandryk, R., Inkpen, K., Calvert, T.W. (2006). Using
review to find out who else would benefit of the measures, psychophysiological techniques to measure user
and who else could be already measuring related topics or experience with entertainment technologies. Behaviour
collecting similar data. The same finding can be extended to & Information Technology, 25, 2, 141 – 158.
consider also cross-firm perspective, such as for developing
UX measures together with business partners, ‘third parties’ 10. Marton. F. and Booth S. (1997). Learning and
and developers. awareness. Lawrence Erlbaum, Mahvah N.J.
11. Meho L.I. (2006). E-Mail Interviewing in Qualitative
Implications to Research Research: A Methodological Discussion. Journal of
To our mind the data from our survey is consistent with the the American Society For Information Science And
evolution of measurements that are visible in previous Technology. August 2006., Wiley Periodicals.
research from a few different disciplines. As the discipline
12. Nielsen J. (1993). Usability engineering. Academic
of user experience is now forming, it is beneficial for the
Press. New York.
field to be aware of the kinds of metrics needed in industry.
It is also healthy to start UX metrics work from the needs of 13. Nokia Corporation (2005). Inspired Human
the audience that will use the results. This hopefully helps Technology. White paper available at
UX researchers to establish the boundaries for UX http://www.nokia.com/NOKIA_COM_1/About_Nokia
measurements and even for UX as a discipline. /Press/White_Papers/pdf_files/backgrounder_inspired_
human_technology.pdf
Our research results are still tentative. The current data
requires more thorough analysis and discussion that 14. Norman, D. (2004). Emotional Design: Why We Love
compares our findings with others. Our research will (or Hate) Everyday Things. Basic Books.
continue to look deeper at the responses and contextual 15. Peppard, J. (2000). Customer relationship management
factors (organizational environment), with the aim to (CRM) in financial services, European Management
develop and generalize useful model for further research Journal, 18, 3, 312-327.
and development. 16. Shackel, B. (1991). Usability—context, framework,
definition, design and evaluation. In Human Factors
REFERENCES
For informatics Usability, B. Shackel and S. J.
1. Alexandersson, M. (1994). Metod och medvetande,
Richardson, Eds. Cambridge University Press, New
Acta Universitatis Gothoburgensis, Göteborg.
York, NY, 21-37.
2. Desmet, P.M.A., Overbeeke, C.J., Tax, S.J.E.T.
(2001). Designing products with added emotional
26
VUUM 2008
Combining Quantitative and Qualitative Data

for Measuring User Experience of an Educational Game
Carmelo Ardito1, Paolo Buono1, Maria F. Costabile1, Antonella De Angeli2, Rosa Lanzilotti1
1 2
Dipartimento di Informatica Manchester Business School
Università di Bari The University of Manchester
70125 Bari, Italy Po BOX M15 6PB - UK
{ardito, buono, costabile, lanzilotti }@di.uniba.it [email protected]
ABSTRACT attractiveness (harmonious, clear), emotion (affectionate,
Measuring the user experience (UX) with interactive lovable), engagement (fun, motivation and stimulation).
systems is a complex task. Experiences are influenced not
only by the characteristics of an interactive system, but also The evaluations of mobile systems open further challenges.
by the user’s psychological state and the context within User performance can be disrupted by external
which the interaction occurs. Human-Computer Interaction circumstances, such as noise, distractions, and competition
researchers have emphasised the importance of quantitative for resources in multi-task settings, or other surrounding
measures for providing useful, valid and meaningful people. These factors are difficult to predict and can only be
assessments. Until now, there is no consensus among observed in the field. In fact, the comparison of laboratory
researchers as to which specific techniques should be used and field evaluations reported in [7] demonstrated that users
for evaluating UX. This paper presents an evaluation behave more negatively in the field than in the laboratory.
methodology that combines different techniques to measure Users also take longer time to perform certain tasks and also
the UX. Some of this techniques provides quantitative data, reported more negative feelings, such as dissatisfaction and
others provides qualitative data. The method was applied in difficult of use, to the use of the device in the field.
the evaluation of Explore!, a mobile learning system HCI (Human-Computer Interaction) researchers have long
implementing a game to be played by middle school emphasised the importance of quantitative measures for
students during a visit to an archaeological park. Results providing useful, valid and meaningful evaluation. Until
demonstrate the importance of triangulating qualitative and now, there is little consensus among researchers as to which
quantitative evaluation data to provide a clearer assessment specific techniques should be applied to understand the UX
of the UX. [4]. We think that different techniques should be used to
measure different aspects of the UX. This approach has
INTRODUCTION and motivation been followed in the evaluation of Explore!, an m-learning
Measuring the user experience (UX) with interactive system implementing a game to be played by middle school
systems is a complex task. Experiences are influenced not students during a visit to an archaeological park. We have
only by the characteristics of the interactive system (e.g. conducted a field study in which a range of different
complexity, usability, functionality, etc.), but also by the techniques were used for comparing the experience of
user’s internal states (such as predispositions, expectations, playing the game with and without technological support.
needs, motivation, mood, etc.), and by the context (or the The evaluation study and its results are illustrated in details
environment) within which the interaction occurs (e.g. in [3]. We summarise here the main findings, discussing
organisational/social setting, meaningfulness of the activity, how quantitative and qualitative techniques should be
voluntariness of use, etc.) [8]. Thus, the evaluation of UX integrated to provide a view of UX .
has to consider not only functional aspects of the system
(e.g., fast, easy, functional, error-free performance), but it EXPLORE!
has also to measure non-functional characteristics such as Explore! is an m-learning system that exploits imaging and
multimedia capabilities of the latest generation mobile
devices. Explore! implements an electronic game that
Permission to make digital or hard copies of all or part of this work for support learning of ancient history, transforming a visit to
personal or classroom use is granted without fee provided that copies are archaeological parks into an engaging and culturally rich
bear this notice and the full citation on the first page. To copy otherwise, or experience [1, 2]. The m-learning technique implemented in
republish, to post on servers or to redistribute to lists, requires prior specific Explore! is inspired by the excursion-game, an educational
permission and/or a fee. technique that help students to learn history while they play
in an archaeological park [5, 6].
27
In the study we have evaluated the “Gaius’ Day in qualitative data and quantitative data, that have been
Egnathia” game, designed to be played in the archaeological triangulated to provide a global view of the user experience.
park of Egnathia in Southern Italy. Gaius’ Day is structured
like a treasure hunt to be played by groups of 3-5 students, Observations were based on event-sampling, an approach
which have to discover meaningful places in the park whereby the observers record all instances of a particular
following some indications. The game consists of three behaviour during a specified time period. Each group of
main phases: introduction, game, and debriefing. In the children was shadowed by two independent observers, who
introduction phase, the game master gives a brief had received in-depth training on data-collection. The
description of the park and explains the game. In the game events of interest referred to problem-solving strategies,
phase, groups of students explore the park to discover some social interaction processes (including collaboration and
interesting places. This phase follows the metaphor of a competition), and interaction with the artefacts (mobile,
treasure hunt, where students have to locate physical places map, and glossary). These events were recorded in an
by some cues provided to them in written messages. These observation grid organized on the basis of mission and time.
locations have to be recorded on a map. Finally, in the Naturalistic observations; questionnaire;
Behaviour
debriefing phase, the knowledge which is implicitly learned focus group; essays and drawings.
during the game is reviewed and shared among students.
Engagement Open and closed questionnaires;
naturalistic observations; focus group.
EVALUATION
Learning Observations during the debriefing
The main objective of the evaluation was to compare the
session; questionnaire: learning self-
pupils’ experience while playing Gaius’ Day in its original
assessment immediately after the
paper-based version with the experience of the electronic
debriefing; multiple-choice test
version of the game. The results of this are reported in [3].
administered in school on the day after the
In this paper, we focus on describing the different
visit; essay writing in school.
techniques we applied in the study and discuss their relative
advantages and disadvantages for measuring the user Table 1. Instruments and techniques used in the
experiences of mobile learning. evaluation
The game type (paper-based vs. mobile) was manipulated Two questionnaires were developed for this study based on
between-subjects. Nineteen students, divided into 5 groups, the QSA, an Italian questionnaire measuring learning
played the paper-based version of the game; 23 students, motivation, strategies and behaviour [9]. The first
divided into 6 groups, played the mobile version. The main questionnaire was administered individually at the
difference between the conditions was the way in which archaeological park immediately after the game phase. It
individual missions were communicated to the students. In included 20 Likert-scale items, in the form of short
the mobile version of the game, the missions were statements regarding their game experience from the
communicated to students by text-messages and the location viewpoint of the following factors: collaboration,
was recorded in the mobile phone by entering a code competition, motivation, fun, and challenge. Responses
number identifying the place. The mobile also provide some were modulated on a five point scale, ranging from 1
contextual help, in the form of an ‘Oracle’, displaying a (strongly agree), to 5 (strongly disagree). The second
glossary of the terms directly related to the active mission. questionnaire was administered immediately after the
In the paper-based condition, the missions were debriefing to measure participants’ evaluation on the
communicated to the participants in a letter format discussion session and their opinions on how much they
distributed at the beginning of the game. Participants had to learnt during the game.
record the right location on a map. A glossary containing
the full list of terms was also delivered at the beginning of Learning was assessed in school on the day after the visit,
the study. via a multiple-choice test requiring the memory of facts, and
knowledge application. The test was designed by
Instruments researchers of Historia Ludens in collaboration with the
In order to evaluate the overall user experience with the school teachers.
Gaius’ Day game a wide range of different techniques were
exploited, specifically: naturalistic observations, self-reports Evaluation Study Results
(questionnaires, structured interviews, and focus groups), Results addressing behaviour, engagement, and learning are
post-experience elicitation techniques (drawings and reported in following subsections.
essays), and multiple choice tests. A summary of the main
factors addressed by our evaluation, along with the method Behaviour
and techniques applied is reported in Table 1. The use of Efficiency and effectiveness during the game were analysed
these different techniques allowed us collect both by traditional quantitative measures, namely time needed to
complete the games and percentage of correct answers to
28
VUUM 2008
the missions. Both variables were significantly better in the such questions were: "Have you found ****?", "Have you
paper-based condition of the game. Participants playing the finished?”, “What mission are you carrying out?”. In
game in the paper-based condition completed the challenge contrast, the paper-based groups were more talkative and
faster (mean = 29.5 minutes; std dev = 6.43) than those in often engaged in jokes and chit chat when they met. In
the mobile condition (mean = 38.5 minutes; std dev = 7.66). general, however, winning appeared to be important to
The mobile condition was more prone to errors than the students, who often enquired about the other groups’
paper-based one, with a difference of some 20 percentage performance during the game. We also witnessed a couple
points. of attempts to cheat, where students tried to swap codes
between different locations to make it impossible for others
The naturalistic observations performed during the game to win, or gave false answers to a direct enquiry.
phase were instrumental in understanding the reasons for
these differences. In fact, we observed that groups in the Interestingly these differences did not emerge as significant
paper-based condition preferred to choose their missions in the analysis of self-reported items addressing sociability
according to their locations or contextual knowledge rather issues in the questionnaire.
that following the order in which the missions were
presented in the letter. Another parallel strategy, commonly Engagement
adopted in the paper-based condition, was to read several To analyse strengths and weakness of the game, we
items in the glossary at the same time, making it possible to considered two open questions where participants reported
compare the target details. Once again, this behaviour was the three best and the three worst features of the game. A
not possible in the mobile condition, as the ‘Oracle’ total of 99 positive features were reported, and only 39
displayed only the glossary entry directly related to the negative features. On the average participants in the mobile
active mission. condition reported more positive features (mean per
participant = 2.7) than the paper-based group (mean = 1.9).
Naturalistic observations were also used to understand No differences in the number of negative features reported
problems with the game artefacts experienced during the by the two groups emerged (mean =1).
game (i.e. cell phone, map, and glossary). In particular, we
observed that the mobile groups had little problems in using Analysing the content of participants’ self-reports it
the telephone, only in two cases at the start of the game did emerged a different trend in the two game conditions. The
the technician need to intervene to explain how to use the most frequently reported positive features in the mobile
phone. One group had difficulties in managing paper sheets, game addressed the artefacts used during the game, whereas
i.e. the wind complicated writing the answers. In both in the paper-based condition participants referred most often
groups, students had some difficulties in reading the map, to the archaeological park. A total of 11 out of 19 references
but the game master helped them in overcoming the to artefacts in the mobile condition directly addressed the
problems. cell phone or some interface features, such as the Oracle
and the 3D reconstructions. Overall, the 3D reconstructions
Behavioural observations were also important in were given a score of 4.3 on a 5 point scale. One of the
understanding social dynamics driving group behaviour. In students commented “The mobile and the game are a
particular, we looked at leadership, defined as the winning combination”.
participants’ willingness to take charge of the game, The collaborative nature of the game was indicated as
contributing ideas and suggestions and allocating tasks to another winning factor by both groups. Children enjoyed
the other members. This was done by analysis the notes playing together and demonstrated a good team spirit all
collected in the field by a dedicated observer for each group over the game. The learning potential of the game was
and by video analysis. It was found that 50% of the another positive factor in both conditions. Students,
participants who played the leader role in the mobile especially those in the mobile condition, appreciated the
condition happened to be the one holding the cell phone, difficulty of the game: they enjoyed because “It was
whereas no clear trends emerged in the paper-based challenging”, as reported by a participant in a focus group.
condition.
As regards negative features, the trend of results is more
Looking at participant’s behaviour in the field it also homogeneous between the experimental conditions,
appeared that the mobile groups were more competitive although the mobile groups was more likely to complain
than the paper-based one. Usually, when they met they about the difficulty of the game and the paper-based group
ignored each other and continued to carry out their own was more likely to complain about the duration of the game,
mission. They appeared very concentrated on their tasks and normally considered to be too brief.
did not want to exchange any comments with their
adversaries. The few questions they exchanged were aimed Learning
to get information that could be useful to them. Examples of Learning was evaluated by a range of different techniques
including self-evaluation and formal assessments performed
29
with multiple choices, essays writing and drawing. In the drawings, and quantitative data, collected through
following some data are reported. On average the students questionnaire, multiple-choice learning test, behavioural
were very positive about the educational impact of the analysis.
systems and they all agreed they had learned something The study results have highlighted that in some domains,
(mean = 4.1; std dev = 0.66). No significant differences such as the mobile one, collecting and analyzing qualitative
emerged in the group comparison. Participants’ opinions data gives the possibility to discover UX aspects that should
were confirmed by an objective test. A total of 36 tests were be neglected by only considering data that are quantitatively
returned for analysis (21 from the mobile condition; 15 measurable. The outcome of the study suggests that in order
from the paper-based one). On average, students answered 9 to obtain meaningful, useful and valid results, qualitative
out of 11 questions correctly (std dev = 1.65). No and quantitative data should be combined. It is worth
significant differences between the game conditions mentioning that qualitative data helped us explain some
emerged. The distributions are strikingly similar (mobile unexpected results obtained by quantitative data. For
mean=8.8 (SE=.38); paper-based mean=9 (SE=.40)). example, regarding engagement aspect, no difference
emerged from quantitative data. Otherwise, from
Discussion naturalistic observation, focus group, we can state that
Summarizing, about the behaviour aspect, the evaluation students enjoyed the use of cell phone and appreciated very
study revealed that different problem solving strategies much the 3D reconstructions both on phone during the
emerged. In the paper-based condition, students changed the game phase and during debriefing.
mission order, either firstly performing those missions they
perceived as easier or according to a personal strategy; Quantitative data did not demonstrate the advantage of the
while in the mobile condition students had to solve one electronic game. But, during the focus group, participants of
mission after the other. This could be one of the reasons the mobile condition referred that they were not distracted
why, in the paper-based condition, students completed the by the technology, while participants of the paper-based
challenge in less time and with less errors. From this result condition reported that paper annoyed them it was a windy
we have learnt that mobile games require more interaction day. Questionnaire revealed to be a useful source of
freedom and context-dependent information to enhance the information, but failed to discriminate between the
overall user experience. experimental conditions. This may be due to a ceiling
effect, as evaluations in both conditions were very high.
The evaluation has also demonstrated that users enjoyed
playing the game and, although we could not demonstrate ACKNOWLEDGEMENTS
the expected superiority of the mobile game condition by Partial support for this research was provided by Italian
statistical comparison of questionnaire data, the introduction MIUR (grant "CHAT").
of the mobile appeared to be much appreciated as
demonstrated by qualitative data. The use of the mobile was REFERENCES
1. Ardito C., Lanzilotti R. Isn’t this archaeological site
directly acknowledged as one of the best features of
Explore!. We expect that, as we add to the interactive exciting!”: a mobile system enhancing school trips. AVI
features of the mobile, we will also improve the user 2008. Napoli, Italy, May 28-30, 2008. ACM Press
experience with Explore!. (2008). In print.
2. Ardito C., Buono P., Costabile M., Lanzilotti R., and
Finally, no difference in learning between the two Pederson T. Mobile games to foster the learning of
conditions was found. This can be explained by ceiling history at archaeological sites. VL HCC 2007. Coeur
effect, as the participants’ performance in both conditions d’Alène, Idaho, September 23-27, 2007, 81-84.
was very high. We think that it is not a bad result because
students are not distracted by the technology and also that e- 3. Costabile M., De Angeli A., Lanzilotti R., Ardito C.,
learning is as equally valuable as traditional learning Buono P., Pederson T. Explore! Possibilities and
provided that appropriate techniques are used such as challenges of mobile learning. CHI 2008. Florence,
excursion-game, which is able to engage and stimulate Italy, April 5-10, 2008. ACM Press (2008), 145-154.
students. 4. Bernhaupt, R., et al. (Eds). Proc. of workshop “Methods
for evaluating games-How to measure usability and user
CONCLUSION experience in games”, at ACE 2007.
In this paper, we have illustrated an approach for the 5. Cecalupo, M., Chirantoni, E. Una giornata di Gaio, in
evaluation of m-learning. The study shows the importance Clio al lavoro. Technical report, Didattica della Storia,
of triangulation of different techniques and measures to Università degli Studi di Bari, (1994), 46-57.
capture all the different aspects that the user experience 6. Ciancio, E., Iacobone C. Nella città senza nome. Come
involves. Our understanding of the game was supported by esplorare l’area archeologica di Monte Sannace,
a combination of qualitative data, collected through Laterza, Bari, 2000.
naturalistic observation, focus group, analysis of essays, and
7. Duh, H., Tan, G., and Chen, V. Usability evaluation for
30
VUUM 2008
mobile device: a comparison of laboratory and field 25, 2 (Mar-Apr 2006), 91-97.
tests. Proc. MobileHCI’06, Helsinki, Finland, September 9. Pellerey, M. Questionario sulle Strategie
2006, ACM PRESS, New York, 2006, 181-186. d’Apprendimento (QSA), LAS, Roma, 1996.
8. Hassenzahl, M., Tractinsky, N. User experience - a
research agenda. Behaviour & Information Technology,
31
Is User Experience supported effectively in existing
software development processes?
Mats Hellman Kari Rönkkö
Product Planning User Experience School of Engineering
UIQ Technology Blekinge Institute of Technology
Soft Center VIII Soft Center
SE 372 25 Ronneby, Sweden SE 372 25 Ronneby, Sweden
+46708576497 +46733124892
ABSTRACT Engineering most often attempts to split the product
Software process is one important element in the contextual complexity into smaller more manageable sub-functions. In
foundation for meaningful and useful usability and UX the end all the sub-functions are put together and a product
measures. In this base usability has been around for a long appears, hopefully as the designer or the idea maker
time. Today the challenge in the mobile industry is User intended. Deviations from the intended product idea are
experience (UX), which is starting to affect software handled through iterated defect reporting and defect
engineering processes. A common use or definition of the handling until the product is judged to have sufficient
term UX is still not de facto defined. Industry and academy product quality. Hence, monitoring product quality is
are both in agreement that UX definitely includes more than conducted by processes in which milestone criteria are
the previous usability definition. How do industry and measured mainly by different ways of controlling defect
manufacturers manage to successfully get a UX idea into levels and defect status. So far this approach has been
and through the development cycle? That is, to develop and sufficient enough when striving to secure a product’s
sell it in the market within the right timeframe and with the quality from a task and goal perspective (classic usability
right content. This paper will discuss the above challenges view), but still no guarantee for enhancing the user
based on our own industrial case, and make suggestions on experience (that increases the chances of product success).
how the case related development process needs to change In the goal and task view three canonical usability metrics
in order to support the new quality aspects of UX. have dominated, i.e. effectiveness, efficiency and
satisfaction. Where the latter, satisfaction, has been a term
Author Keywords capturing the feeling experience on a very high level, i.e.
Software development processes, User Experience, without further dividing it into its diverse constituent
Usability, Usability test, Management, Hedonic, Mobile, elements. Today a new level of quality needs to be handled,
Product development, product validation, Concept i.e. user experience. Handling this quality forces us to
validation divide satisfaction into soft values such as: fun, pleasure,
pride, intimacy, joy, etc. [8, 4]
ACM Classification Keywords
H5.m. Information interfaces and presentation (e.g., HCI): One problem that follows with this new approach towards
Miscellaneous. D.2.8 Metrics, D.2.9 Management. quality is the risk of losing the holistic product view if we
split it into smaller manageable sub-functions in the
INTRODUCTION production process. In the quality of user experience
Mobile phones have reached a point beyond the level where apparently small changes made in different subparts can
technical hot news is not enough to satisfy the buyers, for actually constitute a huge user experience change when put
today mobile devices also have to include the aspects of together in the final product. It is also difficult to predict the
user experiences. Apple’s iPhone is one indication of this effects of such separately handled changes. Example of this
change. Software engineering strives to make complex could be that applications in the past have been more or less
things manageable. separate entities or islands in a mobile product. This has
provided opportunities for application designers and
engineers to apply their own solutions and create their own
Permission to make digital or hard copies of all or part of this work for application specific components with “isolated” specific
not made or distributed for profit or commercial advantage and that copies behavior to support a use case. Such isolated behavior can
bear this notice and the full citation on the first page. To copy otherwise, or and will be a big threat to the total UX of a product. In this
republish, to post on servers or to redistribute to lists, requires prior specific aspect designers and engineers need a generic framework of
permission and/or a fee. deliverable components to make sure that the total UX of
32
VUUM 2008
the product is consistent throughout the product The research group U-ODD – Use-Oriented Design and
development. Development [13], belongs to the School of Engineering at
Blekinge Institute of Technology (BTH), and is a part of the
Pushing out ownership and responsibility to the separate
research environment BESQ [1]. The research performed
parts is a common management strategy. Are organizational
by U-ODD is directed towards use-orientation and the
models that push ownership out to the leaves in
social element inherent in software development. Studies
organization really effective in the mobile industry with its
performed by U-ODD are influenced by the application of a
many actors and stakeholders? Doesn’t this model
social science qualitative research methodology, and the
encourages handling risks via a focus on each constituent
application of an end-user's perspective.
part rather than a holistic view on the end product or are
there better and more efficient ways of making an idea The process of cooperation is Action research according to
appear in a product? Ways that could shorten the time to the research- and development methodology called
market, minimize the risk of fragmentation of the product, Cooperative Method Development (CMD), see [3] [11],
and in new effective ways help organizations to prioritize chapter 8) for details.
and secure a successful UX in a product. How can we
maintain a holistic perspective despite multiple splits of HOLISTIC PRODUCT VIEW
functionality during development? For the goal and task There is a risk of losing the UX intent of a product if no
related usability paradigm models simple division and support structure is in place. In order to keep the
delegation has been successful. In this era with a growing organization “mean and lean” and at the same time deliver
need for high level monitoring of UX in products we are UX focused products we need to secure the vision of a
still left with the goal and task oriented development product throughout the development process. Today many
models. Sorry to say but having good quality on each companies have developed methods to validate concepts of
different part may not be sufficient and definitively not a the final product with end users. UIQ technology uses for
guarantee for a good and successful end product. We need instance their UTUM method. [12]. Unfortunately these
to find new ways to measure and monitor this combined kind of validation activities are too often handled by and
quality aspect. To support UX efficiently a process with a within a UI Design/Interaction Design group and not as part
clear product focus is needed in parallel with application of the overall design process, e.g. as ad hoc help in the
development processes. Otherwise, because of the design work at different stages.
prevailing task and goal tradition, there is a risk that we talk
Figure 1 visualizes the separateness product vision into
about a holistic product view but in practice end up
many divided requirements and thereby risk not monitoring
monitoring small identities. Still, we believe the
UX in a holistic way; it also represent today’s goal and task
engineering approach of separation is powerful and
oriented development models. The outcome/product
necessary in large projects. So - what are the possible
includes the risk of becoming something that was not
approaches for ensuring an idea appears throughout the
intended.
prevailing engineering approach of separating the
development? The introduction of an overall design process
is the solution we advocate, as a frame surrounding existing
goal and task related development processes.
RESEARCH COOPERATION
The industrial partner is UIQ Technology AB, a growing
company with currently around 330 employees, situated in
the Soft Center Science Research Park, Ronneby, Sweden.
The company was established in 1999, and has as one of its
goals the creating of a world leading user interface for
mobile phones. Their focus is “to pave the way for the
successful creation of user-friendly, diverse and cost-
efficient mobile phones” [12]. They develop and license an
open software platform to leading mobile phone
manufacturers and support licensees in the drive towards
developing a mass market for open mobile phones. Their Figure 1.
product, UIQ, is a media-rich, flexible and customizable
software platform, pre-integrated and tested with Symbian How do we then keep the end user product in focus in a
OS, providing core technologies and services such as large product organization? Today most companies in the
telephony and networking. Mobile industry face the challenge of increasing demands
on UX products. A lot of effort in improving the UX
33
capabilities in the software is undertaken but there are no Their role would need to be to police the fulfillment of the
new UX criteria included in the measurements of product UX quality criteria in the process defined and decided
quality even if most companies have both verbal and checkpoints. These checkpoints could e.g. be expert
written UX statements and visions on their walls as lead reviews of requirements and expert UX reviewers to get the
goals for their businesses. A product quality definition is authority to set a pass/not pass stamp on the intended
still based on different levels, measurements and delivery. This needs to be agreed and formalized into the
predictions of defects as criteria, and seldom includes development process.
usability and/or UX quality criteria. This means there is no
connection or possible way of measuring the “temperature” UX DEFINITION AND MEASUREMENT
of UX in the product during the development between Beside the introduction of an overall design process
vision and final product. There is also a divergence between monitoring the UX visions we also need to develop ways to
UX quality and existing product quality, meaning that we both define and measure it. Today we use the UTUM
have processes and means (traditional Software method to measure usability according to efficiency,
Engineering) to monitor product quality by defects, but effectiveness and satisfaction and visualize the result for
these are not a guarantee for achieving an envisioned high management as shown in figure 3.
level of UX in the final product. The only way is to change
the culture to become more UX and product focused and
not as, in many cases today, focused on part delivery.
Therefore a more appropriate way to inject UX quality
assurance into the development process would be by:
1. Gaining acceptance of a vision through user research
with end users by means of methods like early prototype
testing.
2. Policing the vision throughout the development process
by internal review methods to secure UX product quality.
UX quality criteria and milestones should be included in an
overall design process influencing the development process.
A new quality assurance role needs to be created for UX
experts to act as guardians for the UX quality.
3. Validating the product and evaluating the result against Figure 3.
the vision, again by formalizing existing methods like P. Jordan [6] writes about Functionality, usability and
UTUM [1] in the development process. This is also pleasure as the essential ingredients for a successful product
visualized in figure 2: or service to provide a consumer experience, see figure 5.
(See also [7] for discussion of contextual factors) Jordan
explains the three levels of consumer needs that have to be
supported as:
Functionality: Meaning that a product without the right
functionality is useless and causes dissatisfaction. A
product has to fulfill the needs of a user.
Usability: When a product has the right functionality the
product has to be easy to use. A user doesn’t just expect the
right functionality they expect ease in use.
Figure 2. Pleasure: When the user is used to usable products they
want something more. Meaning, when functionality and
An organizational set up like the one described in Figure 2
usability have become levels of plain “product hygiene”
would be a better guarantee that the product vision and
they want products that just not bring functional benefits
intent is what will be delivered in the end compared with an
but also give the user emotional benefits.
organizational set up as described in figure 1. Meaning that
the whole of the organization needs to understand and We think Jordan’s view on UX is a good starting point and
prioritize the end result. The way to secure product quality is in line with our thoughts and present thinking. For us the
and to include UX into the product quality aspect has to be Hedonic side of an experience is much about the users
to introduce “UX guards” in all levels of development.
34
VUUM 2008
experience and the Pragmatic side is about the user's period 2001 to 2004 an attempt was made, initiated by Kari
expectations. Rönkkö from the research group U-ODD from Blekinge
Meaning that the Hedonic aspects of UX is more about how Institute of Technology (BTH), to use Personas in UIQ’s
the product creates pleasurable and exciting experiences development processes. The intention was to find
that delights the user in sometimes unexpected but something that could bridge the gap between designers and
hopefully positive ways compared with developers and others within the company, and find a way
the Pragmatic aspect of UX that is about how well/bad a of mediating between many different groups within the
product lives up to the users expectation of that product's in company. The Personas attempt was finally abandoned in
functionality, ease-of-use etc. 2004, as it was found not to be a suitable method within the
This is also applicable to our thinking in this paper about company, for reasons which can be found in [10]. In 2001,
the need to increase the product quality aspect. By Symbian company goals included the study of metrics for
developing methods to understand users in all three areas, aspects of the system development process. An evaluation
and by making it possible to also measure pleasurable tool was developed by the, at that point of time, head of the
aspects of a product, it would be possible to measure UX interaction design group at Symbian, Patrick W Jordan,
product quality. together with Mats Hellman from the Product Planning
User Experience team at UIQ Technology, and Kari
Today we measure in and cover the areas of “Usability” and
Rönkkö from U-ODD/BTH. The first part of the tool
“Functionality” but just partly in the area of “Pleasure” by
consisted of six use cases to be performed on a high fidelity
doing traditional ISO standard measurements as Efficiency,
mock-up [9] within decided time frames. The second part of
effectiveness and user satisfaction. But we propose a shift
the tool was the System Usability Scale (SUS) [2].
in to include aspects of “Hedonic”, “Pragmatic” and
“Business cost” (Development cost, infrastructure, The first testing took place during the last quarter of 2001.
technology maturity etc.) in measuring UX. It is also vital The goal was to see a clear improvement in product
to understand the context in which the product will be used, usability, and the tests were repeated three times at
as visualized in figure 4. important junctures in the development process. The results
of the testing process were seen as rather predictable, and
did not at this time measurably contribute to the
development process, but showed that the test method could
lead to a value for usability. During the period 2004 to
2005, a student project was performed, supported by
cooperation with representatives from U-ODD, which
studied how UIQ Technology measured usability. The
report from the study pointed out that the original method
needed improvements and that the process should contain
some form of user investigation, a way of prioritizing use
cases, and that it should be possible to include the test
leader’s observations in the method. The test method was
refined further during 2005, leading to the development of
UTUM v 1.0 which consisted of three steps. First, a
questionnaire was used to prioritize use cases and to collect
data for analysis and statistical purposes. The use cases
could either be decided by the questionnaire about the
user’s usage of a device (in this choice Jordan’s level of
functionality is visible) or in advance decided by the
company if specific areas were to be evaluated. After each
use case the user performs a small satisfaction evaluation
Figure 4 questionnaire explaining how that specific use case
supported their intentions and expectations. Each use case is
EXTENDING UTUM WITH UX QUALITY carefully monitored videotaped if found necessary and
UTUM is a usability test package for mass market mobile timed by the test expert. The second step was a performance
devices, and is a tool for quality assurance, measuring metric, based on completion of specified use cases,
usability empirically on the basis of metrics for satisfaction, resulting in a value between 0 and 1. The third was an
efficiency and effectiveness, complemented by a test attitudinal metric based on the SUS, also resulting in a
leader’s observations. value between 0 and 1. These values were used as
parameters in order to calculate a Total Usability Metric
The current test package is the result of several years of with a value between 0 and 100. Besides these summative
joint research cooperation and experimentation. During the
35
usability results the test leader also through his/her to act upon the result in a similar way as we do with
observation during the test can directly feed back formative defects. This would also be a possible way to undertake
usability results to designers in the teams within the regular measurements along with existing processes and
organization, giving them user feedback to consider in their using existing forums for displaying and acting on the
improvement and redesign work. received result.
It is also the test expert observation that is decisive in what
NEW PROCESSES NEEDED
the trade offs are in the test. One example could be a use Today many organizations, including UIQ Technology AB,
case that takes too long to perform compared to the norm, have decided to work in multidisciplinary teams to secure
which still is rated as high usability on the satisfaction quality and focus on deliveries. It is our belief that this is
scale, by the user. It is then up to the test expert to explain necessary, but there is a big risk that an organization that
and describe that trade off. focuses on securing component quality loses sight of the
In 2005, the work with UTUM gained impetus. A usability actual holistic product intent, meaning that the delivered
engineer was engaged in developing the test package, and a product doesn’t match the actual product intent. In order to
new PhD student under Rönkkö’s supervision from the secure the vision of product intent in complex and multi
research group U-ODD became engaged in the process and requirement projects the organization needs to acknowledge
began researching the field of usability, observing the the need for policing. Not just defect levels, but also and
testing process, and interviewing staff at UIQ. In an maybe even more important, the holistic product intent
iterative process in 2005 and 2006, the method was further throughout the development cycle and in all different teams
refined and developed into UTUM 2.0. UTUM 2.0 is “a participating in the development process.
method to generate a scale of measurement of the usability This is needed to secure an efficient and effective way of
of our products on a general level, as well as on a functional working towards a successful product. Today it becomes
level.” [12]. The test package was presented for the industry more and more important to deliver UX focus products
at the Symbian Smartphone show in London, October 2006, faster and faster, whereby it also becomes vital for an
and the philosophy behind it is presented in full detail on organization to decrease development times.
UIQ’s website. [12] Since UTUM is available via the
Internet, we do not provide detailed instructions of the "Everything about mobile phone design and production has
testing procedures in this paper. to be quick, so it's months from when there is an idea for a
phone to the roll out on the market," said James Marshall,
Increasing the UTUM test package to a more in depth focus Sony Ericsson's head of product marketing, who is in Las
on UX and measurements of pleasurable and “holistic” Vegas this week for the trade fair. "The market moves very
aspects through for example, questionnaires would help the quickly, so you have to minimize development times."[5]
organization to focus on the UX aspects within the software
development process. The visual presentation for Our suggestion is that companies organize in such a way so
management for the achieved UX in a product at a specific that UX requirements developed by end user understanding
time could then be presented as a radar diagram containing and user knowledge are monitored throughout the
the current status in each area of UX. (See Figure 5) development cycle. Companies need to develop a better
holistic understanding of their product, and the attempt that
product should support and aimed for, even if they optimize
their efforts in multi-disciplinary teams responsible only for
specific components in a system. This could be done by
having UX guarding functionality in leading positions in
the development process. People that monitor the holistic
view of the product and who have the mandate take
necessary actions whenever it is needed to secure the
product intent.
CONCLUSION AND SUMMARY

A new approach to this problem that focuses on the
organization per se, as well as ways of working, is needed.
By prioritizing UX in products a new perspective needs to
be adopted if the intended result should appear in products
in the market and appreciated by users. On a high level
Figure 5 there needs to be a cultural shift into a more UX oriented
This could be a way to include UX quality measurements and UX driven mentality within the whole product
into the software development process as a quality aspect, development organization. On an organizational level UX
36
VUUM 2008
quality assurance needs to be established by recognizing http://www.usabilitynet.org/trump/documents/Suschapt

and given authority to UX expertise that can secure the total .doc , 2008-05-22
UX product quality in all levels of development. In this 3. Dittrich, Y., Rönkkö, K., Erickson, J., Hansson, C. and
paper we want to throw light upon the fact that if mobile Lindeberg, O. Co-operative Method Development:
manufactures want to be competitive and able to deliver Combining qualitative empirical research with method,
high quality UX products fast and with regularity and technique and process improvement, in the Journal of
predictability the need for UX “guardians” are necessary, Empirical Software Engineering, Springer Netherlands,
see Figure 6. (online 18 Dec 2007) 1382-3256 (Print) 1573-7616
(Online), 2007.
4. Hassenzahl, M. and Tractinsky, N. User experience – a
research agenda, in Behaviour & Information
Technology, Vol. 25, No. 2, March-April, pp. 91 – 97,
2006
5. International Herald Tribune/Technology and Media.
Iphone pushes mobile makers to think simpler. By Eric
Sylvers.
http://www.iht.com/articles/2008/01/09/technology/wir
eless10.php, 2008-05-22
6. Jordan, Patrick W., 2008. The four Pleasures:
Understanding User’s holistically. In proceedings of
applied ergonomics International, (AHFE
Figure 6 International), 2008.
We also point out and argue the need for both cultural as 7. Karapanos, E., Hassenzahl, M., and Martens, J. User
well as organizational changes. Today we monitor and experience over time. In CHI '08 Extended Abstracts on
define product quality by measuring defects levels in Human Factors in Computing Systems (Florence, Italy,
different ways. This will still be needed but must be April 05 - 10, 2008). CHI '08. ACM, New York, NY,
complemented by UX quality measurements. The product 3561-356, 2008.
quality definition needs to be increased and widened to
8. Law, E., Hvannberg, E. and Hassenzahl, M. , User
include measurements from the UX area and these new
Experience: Towards a Unified View, in proceedings of
quality criteria need to be accounted for with the same
the NordiCHI 2006 Workshop, Oslo, Norway, 14 Oct.
priority as other quality criteria. More organizational effort
2006.
should be spent on developing Metrics and KPI’s for
monitoring and securing UX product quality. 9. Rettig, M., Prototyping for tiny fingers,"
Communications of the ACM, vol. 37, pp. 21 - 27,
ACKNOWLEDGMENTS 1994.
We wish to thank Gary Denman from UIQ Technology AB 10. Rönkkö, K., B. Kilander, M. Hellman, and Y. Dittrich,
for providing valuable support and insights. This work was "Personas is Not Applicable: Local Remedies
partly funded by Vinnova (a Swedish Governmental Interpreted in a Wider Context," presented at
Agency for Innovation Systems) under a research grant for Participatory Design Conference PDC '04, Toronto,
the project WeBIS [14]; and the Knowledge Foundation in Canada, 2004.
Sweden (work to make Sweden more competitive) under a 11. Rönkkö, K. Making Methods Work in Software
research grant for the software development project Engineering: Method Deployment as a Social
“Blekinge – Engineering Software Qualities”[1]. achievement School of Engineering, Blekinge Institute
of Technology, Ronneby, 2005.
12. UIQ Technology. UIQ Technology Usability Metrics,
REFERENCES
UIQ Technology, http://uiq.com/utum.html. 2008-05-
1. BESQ. Blekinge - Engineering Software Qualities,
22
http://www.bth.se/besq , 2008-05-22.
13. U-ODD. Use-Oriented Design and Development,
2. Brooke, J., "System Usability Scale (SUS) - A quick
http://www.bth.se/tek/U-ODD, 2008-05-22
and dirty usability scale,
14. WeBiS, http://www.webis.se
37
On Measuring Usability of Mobile Applications
Nikolaos Avouris, Georgios Fiotakis and Dimitrios Raptis
University of Patras, Human-Computer Interaction Group
GR-26500 Rio, Patras, Greece
[email protected], [email protected], [email protected]
ABSTRACT As Annett [2] discusses in the debate paper of a special
In this paper we discuss challenges of usability evaluation issue on subjective versus objective method in ergonomic
of mobile applications. We outline some key aspects of practice: “..all knowledge is based on subjective experience.
mobile applications and the special characteristics of their What really matters in establishing scientific truth is the
usability evaluation that have recently lead to the laboratory method by which independent observers agree on the
vs field discussion. Then we review the current trend and meaning of their individual observations. Complex features
practices. We provide an example of a usability evaluation of the external world may be judged subjectively by experts,
study from our own background. We conclude with but the reliability of these judgments depends heavily on the
discussion of some open issues of usability evaluation of use of agreed criteria … that provide the best assurance of
mobile applications. inter-subjectivity”.
Author Keywords These agreed criteria however need to be based on deep

Usability evaluation measures, mobile applications, field understanding of the characteristics of mobile applications.
evaluation studies, user studies. So, to start with, the evaluation of mobile applications
necessitates examination of the factors that affect user
ACM Classification Keywords experience and interaction with such applications. It is
H5.2. Information interfaces and presentation (e.g., HCI): therefore understandable that during this exploratory phase
User Interfaces – Evaluation/Methodology, H5.4 existing evaluation techniques involving established
Information interfaces and presentation (e.g., HCI): Mobile measures and qualitative approaches are mostly to be used.
Applications, Usability Evaluation Methods, Usability So, in a survey of usability attributes in mobile applications
Measures. by Zhang and Adipat [31] nine attributes were identified
that are most often evaluated: learnability, efficiency,
INTRODUCTION memorability, user errors, user satisfaction, effectiveness,
Usability evaluation of mobile applications is an active area simplicity, comprehensibility and learning performance. All
of research and practice in continuous evolution, due to the nine of them are well defined and extensively used
high demands and the evolving context in which mobile measures of usability in more traditional desktop
applications are developed and used. A fundamental applications. Yet it is apparent that mobile applications
concern of researchers and practitioners of this area is the introduce new aspects that need to be considered. We
use of adequate usability methods and measures. This has cannot limit the evaluation only to the device (typical
taken recently the form of a debate between the advocates scenario in desktop applications) but we must extend it,
of field versus laboratory study approaches, e.g. [18] and including aspects of context, which often bears dynamic
[19], that relates to qualitative vs quantitative measures. and complex characteristics. There is the possibility that a
This debate seems to have its roots in the historical single device is used in more than a single context, in
distinction between subjective and objective knowledge and different situations, serving different goals and tasks of a
corresponding epistemological implications, taking the single or a group of users. Also, group interactions, a
form of a discussion on the merits of the scientific versus common characteristic in mobile settings, influence the
the design research method [4]. interaction flow and increases the complexity of the
required analysis as well as the necessity of analysis of
complex observational data. The process of selecting
appropriate usability attributes to evaluate a mobile
Permission to make digital or hard copies of all or part of this work for application depends on the nature of the mobile application
personal or classroom use is granted without fee provided that copies are and the objectives of the study. So far, a variety of specific
not made or distributed for profit or commercial advantage and that copies measures (e.g., task execution time, speed, number of
bear this notice and the full citation on the first page. To copy otherwise, or button clicks, group interactions, support sought, etc.) have
permission and/or a fee. been proposed to be used for evaluation of different
usability attributes of specific mobile applications, as
VUUM2008, June 18, 2008, Reykjavik, Iceland. discussed in the survey section of the paper. Is this due to
the limited way of measuring user behavior and user
38
VUUM 2008
experience? Wilson and Nicholls [29] point out in [18], who found that having participants report usability
discussing performance during the evaluation of virtual problems while sitting down in the laboratory led to more
environments, “There are only a limited number of ways in usability problems being reported than when the
which we can assess people’s performance: we can measure participants performed the evaluation while walking. They
the outcome of what they have done, we can observe them suggested that this result might have arisen from different
doing it, we can measure the effects on them of doing it or demands on attention – in the seated condition, there was
we can ask them about either the behavior or its little distraction from the product and so participants were
consequences.” [29]. This leads to Baber [3] suggesting that able to devote most of their attention to it, but in the
perhaps what is required is not so much a set of new walking conditions attention needed to be divided between
measures, as an adaptation of existing approaches that pay the device and the task of walking.
particular attention to the relatively novel aspects of the
environment and activity that pertain to mobile devices. Tools for Mobile Usability Evaluation
Along these lines, a new breed of techniques for usability A number of tools have been proposed to support this
evaluation has been proposed [17], [18], [9], [14]. In a diversity of approaches. Some tools gather usage data
survey of research methods for mobile environments [13] a remotely through active (e.g., Experience Sampling Method
number of new data sources have been identified, to be - ESM, Diary studies) and passive modes (e.g., logging),
considered in design and evaluation of mobile applications. enabling field evaluation on real settings. The Momento [7]
These new data sources, may lead to new measures of the MyExperience [10] tools are two examples of such
usability and user experience. Mediated data collection systems. Both systems support remote data gathering. The
includes a range of approaches for collecting data remotely first relies on text and media messaging to send data to a
by relying on the participant or mobile technologies remote evaluator. The second also incorporates a logging
themselves. Simulations enable knowledge about physical mechanism that stores usage data for synchronization. An
movement, device input and the ergonomics of using a innovative approach is suggested by Paterno et al. [21])
device while mobile. Enactments enable researchers to which combines a model based representation of the
know more about why we carry these mobile devices with activity with remotely logged data of mobile application
us and what these devices give us the potential to do. user activity. An alternative approach is proposed by Yang
et al. [30] who have applied UVMODE a mixed reality
On the methodological aspect of usability evaluation of based usability evaluation system for mobile information
mobile applications, influencing the usability measures, device evaluation. The system supports usability evaluation
there is a growing interest in the development of scenarios, by replacing real products with virtual models. With this
personas or applying performance techniques [14]. Along system, users can change the design of a virtual product,
these lines, de Sa et al. [8] suggest a set of guidelines for and investigate how it affects its usability. While users can
generation of scenarios in usability evaluation of mobile review and test the virtual product by manipulating it, the
applications. They claim that specific details of the design system also provides evaluation tools for measuring
should be the focus of evaluation studies to be conducted in objective usability measures, including estimated design
the field or the lab. While no specific metrics are suggested, quality and users’ hand load. The metrics supported by this
a framework for definition of usability evaluation scenarios approach include direct measure of physical loads given to
is introduced that covers the following aspects: (1) the user’s hand when user grabs and manipulates a virtual
Locations and settings: lighting, noise, weather, obstacles, product. A custom built glove interface, equipped with
social environment. (2) Movement and posture: variations pressure and EMG (electromyography) sensors, is used for
for sitting, standing and walking (3) Workloads, measuring the hand load. The collected information is
distractions and activities: Critical activities, settings or visualized in real-time, so that usability experts could
domains requiring different degrees of attention, cognitive investigate and analyze the hand load while the user
distractions (e.g., phone ringing, etc), to study cognitive manipulates a virtual product.
recovery, physical distractions (4) Devices and usages:
Single vs dual handed interaction, and Given the theoretical and methodological aspects of the
stylus/finger/keyboard/numeric pad, different devices (e.g., mobile usability evaluation problem, it is worth examining
PDAs, smart phones, etc). (5) Users and Personas: the current trends in mobile usability evaluation practice;
movement and visual impairment, Heterogeneity – age this is the subject of the next section.
(small/large fingers), dexterity, etc. For each of these
aspects a set of variables has been suggested to be used by CURRENT PRACTICE: USABILITY EVALUATION OF
designers while arranging their scenarios and preparing the MOBILE APPLICATIONS
usability evaluation study. In this section we outline typical examples of current
mobile usability evaluation practice. First a survey of
Various aspects related with mobility need to be measured. reported evaluation studies that involved mobile
For instance the effect of mobility on the subjects of field applications has been performed, based on the ACM CHI
experiments was demonstrated by Kjeldskov and Stage 2008 Conference. Five cases that were found are briefly
39
discussed here in terms of context of the study and Guo et al. [12] introduced a new device (based on the
methodological approach, see Table 1. Nintendo Wiimote and Nunchuk) for human-robot
interaction. To evaluate this technique, they have conducted
Jokela et al. [15] described the design of the Mobile
a lab based comparative user study, with 20 participants.
Multimedia Presentation Editor, a mobile application that
This study compared the Wiimote/Nunchuk interface with a
makes it possible to author multimedia presentations on
traditional input device – keypad in terms of speed and
mobile devices. To validate and evaluate the application
accuracy. Two tasks were employed: the posture task
design, they have conducted a series of laboratory
utilized a direct mapping between the tangible interface and
evaluations, that involved 24 participants, and a field trial
the robot, and the navigation task that utilized a less direct,
with 15 participants. No quantitative measures are reported
more abstract mapping. A follow-up questionnaire recorded
from these studies, except of descriptive statistics of
the participants’ opinion about the preferred technique for
participants’ behaviour.
controlling the robot in these tasks.
Source System Number of Evaluation Metrics Used

Participants Method
Jokela et al. Mobile 24 (lab) Laboratory Qualitative measures of user
(2008) Multimedia 15 (field) evaluation and behaviour.
Presentation Field study
Editor
Guo et al. Nintendo 20 Lab based Speed and accuracy in both tasks
(2008) Wiimote and comparative and user preference through
Nunchuk – user study questionnaire
based controller
of a robot
Riegelsberg Use of Google 24 Field study in Qualitative measures and usability
er et al. Maps in Mobile four different problems found using group
(2008) Devices locations. briefing sessions, recorded usage,
multiple telephone interviews and
debriefs in a lab setting
Sanchez et AudioNature, A 10 Case study in Qualitative measures of
al. (2008) pocketPC lab involving effectiveness and performance
device for typical users through pre and post tests and
science learning questionnaires
for the blind
Bellotti et al. Leisure guide 11 Field study Qualitative measures of user
(2008) Magitti over a period experience recorded through
of several questionnaires
days
Table 1. CHI 2008 usability evaluation studies of mobile applications
Riegelsberger et al. [22] discuss a 2-week field trial of scale (12 statements regarding interaction with
Google Maps for Mobile phones, with 24 participants, that AudioNature and 12 statements related to the device used).
was conducted in various parts of the world. The field trial In addition in order to evaluate the impact of AudioNature
combined many methods: group briefing sessions, recorded on learning, preliminary pre-tests and post-tests were
usage, multiple telephone interviews for additional context administered. Cognitive testing consisted of different
around recorded use, and 1:1 debriefs in a lab setting with questionnaires to evaluate the learning of biology concepts,
the development team observing. No metrics were reported. the behaviours and skills of visually impaired users and
As a result over 100 usability problems were found. their performance with the cognitive tasks.
Insights were gained along several dimensions: user
Bellotti et al. [5] presented an evaluation study of a context-
experience at different levels of product familiarity (e.g.
aware mobile recommender system, the leisure guide
from download/install to habitual use) and identified
Magitti. A field evaluation of this mobile application was
hurdles to user experience arising from the mobile
conducted, in which 11 volunteers participated. The main
ecosystem (e.g. carrier and handset platforms).
features of Magitti were assessed in this study. The field
Sanchez et al. [23], evaluated the usability of AudioNature, study involved 60 visits (places to eat, to buy, and to do) in
an audio based interface implemented for pocketPC devices the Palo Alto area. After each outing, participants filled out
to support science learning of users with visual a questionnaire about their activities. In addition, all
impairments. The usability and the cognitive impact of the participants actions with the device were logged and map
device were evaluated. The usability evaluation traces of outings were collected. An experimenter
methodology used was that of a case study involving 10 accompanied each participant on one of their outings to
blind participants in a lab setting. An end-user usability test observe their use of the system. Finally, participants were
was conducted that contained 24 statements in a Likert interviewed after they completed all their outings. The
40
VUUM 2008
reported metrics were related to the user view on these on task completion, task time, error counts per task and
characteristics, while no quantitative usability measures post-task satisfaction rating, each of these four metrics,
were applied. once standardized, contributing equally to the calculation of
the SUM score. While this was a large scale usability
Quantitative Measures of Mobile Usability evaluation study, in which both usability quantitative and
While the CHI 2008 conference papers, typical of the qualitative measures were used, the produced report,
current practice on mobile applications usability evaluation, seemed to include both usability findings and comparative
applied mostly qualitative measures in field studies, there quantitative measures that were effective in communicating
have been cases in which mobile evaluation was based on the quality in use of the evaluated devices. One concern
specific quantitative usability measures. Two typical such related with this study, is the fact that there seem to be lack
examples are the evaluation study of Goodman et al. [11] of consideration on issues of mobility and the effect of the
and the study reported in [28]. mobility dimension on the user experience.
In the first case, [11] a number of measures of usability are A similar case is the study of Jokela et al. [16], who
suggested to be used in location sensitive mobile proposed two composite attributes the Relative average
applications that are, like mobile guides, to be tested in efficiency (RAE) and the Relative overall usability (ROU)
controlled field conditions (a setting termed Field which however do not take in consideration the specific
Experiments). The following measures and methods are characteristics of mobile devices. Finally an alternative
suggested: Performance: measured through timings and approach, focusing on social interaction, is the evaluation
number of errors, identification of particularly hard points study reported by Szymanski et al. [26], which analyses
or tasks: by number of errors or by success in completing user activity while touring a historic house with the Sotto
the tasks or answering questions correctly, Perceived Voce mobile guide.
workload: User satisfaction Through NASA TLX scales
and other questionnaires and interviews, Route taken and IN SEARCH OF A USABILITY EVALUATION METHOD
distance traveled measured using a pedometer, GPS or In the rest of the paper, we describe our own experience
other location-sensing system, or by experimenter with conducting usability evaluation studies of mobile
observation, Percentage preferred walking speed (PPWS) applications over the last years and conclude with an outline
Performance By dividing distance traveled by time to of a methodological proposal for measuring usability in
obtain walking speed, Comfort: User satisfaction. Device such contexts. An example of a typical usability evaluation
acceptability using the Comfort Rating Scale and other study of a mobile application from our own experience is a
questionnaires and interviews, User satisfaction and collaborative game supported by PDAs in a cultural-
preferences and Experimenter observations. In effect, in historical museum [27], [25], an earlier version of which
this scheme, the authors suggest in addition to established was discussed in [6].
measures, to take in consideration issues like walking speed
One of the characteristics of these mobile applications was
and route taken, while they also suggest that existing
that they were context sensitive, allowed interaction with
measures like the Task Load Index and Comfort Rating
objects in the environment (e.g. scan RFID tags), in order to
Scale to be adapted for mobile use. For instance the NASA
harvest information or make gestures for interacting with
Task Load Index (TLX) has been extended for mobile
the application, while there was a strong social aspect as
applications, by addition of the Distraction scale.
users acted mostly in groups. Evaluation studies were
In the second case, Tullis and Albert [28] report a case conducted in various phases of design of these applications.
study of usability evaluation of a mobile music and video Earlier in design time, these were laboratory based. One
device, conducted by practitioners S. Weiss and C. Whitby limitation of the lab evaluation approach was that it was
(pp. 263-271) in lab conditions. In this study a number of particularly difficult to reconstruct the context of use, given
alternative devices where used by participants for executing the above characteristics (social aspect, interaction with
a number of tasks relating to purchase and playback of objects of the environment etc.). One approach used was to
music and video, using three different mobile devices. The create scenarios and run them in a simulated environment.
measures used included: the time to complete the task, the Sketches of the scenarios that included incidents of
success or failure of each task, the number of attempts, interaction were built at these early stages (see Figure 1)
perception metrics, like feeling about the handsets before
and after one hour’s use (affinity), perceived cost, perceived
weight of handset, ease of use, perceived time to complete
the tasks and satisfaction. In the summary findings they
included in addition to qualitative findings, the summative
usability metric (SUM), based on work by Sauro and
Kindlund [24] This is a quantitative model for combining
usability metrics into a single score. The focus of SUM is
41
Then these scenarios were transformed in more detailed
models of the activity, see figure 2 for a task model of the
Inheritance Museum Game, using CTT [20]. These models
were then used for formative evaluation of the design in the
lab, using low fidelity prototypes. No quantitative
measurements were made during this phase, while
interaction was fragmented and focused in specific aspects
of the interaction and tasks, like navigation, scan of
exhibits, exchange of harvested information between group
members, etc.
Figure 1. Inheritance museum game sketch
Figure 2. Task model of the inheritance museum game using CTT notation [20].
and experiences from the activity in the museum, while a
High Fidelity Prototypes Usability Evaluation
As soon as high fidelity prototypes of the application were week later they were asked to report their experience.
made available, more extensive and systematic studies were In order to analyze all the collected data, we used
conducted in the lab, in terms of subjects and tasks. In this ActivityLens (Figure 4) that has already been effectively
case the task execution was recorded using video and audio used in similar studies [6, 25].
recording equipment. In the final phase, a study was
conducted in the field, following a micro-ethnographic Three usability experts, with different level of experience,
approach, which involves, typical users, engaged with the analyzed the collected data, in order to increase the
application for a limited period of time, following a given reliability of the findings. Initially, we created a new
scenario, without any intervention of the evaluators. ActivityLens Study including 4 projects (each project
concerns the observations of a team). We studied then the
In order not to miss important contextual information, integrated multimedia files and annotated the most
multiple video cameras were used in this study. Two of interesting interaction episodes. It should be clarified that
them were steadily placed in positions overlooking the we wished to evaluate the performance of each team and
museum halls, while the third one was handled by an not individual team members. The performed analysis
operator who tenderly followed the users from a convenient through ActivityLens revealed several problems related to
distance. One member of each group wore a small audio user interaction with the device and the overall setting,
recorder in order to capture the dialogues between them, given the social and physical context of use. In figure 3 the
while interacting with the application and the environment. dimensions of analysis used are outlined, these include in
Furthermore, snapshots of the PDA screens were captured addition to typical user device interaction, user-user
during the collaborative activity at a constant rate and interaction, user –setting interaction, while observed
stored in PDA's memory. After the completion of the study phenomena were related to other aspects, like interaction of
the guide, who was member of the evaluator team, the mobile devices with the infrastructure and other
interviewed the users, asking them to provide their opinion applications.
42
VUUM 2008
Various usability problems observed were unexpected and CONCLUSION

need further investigation, for instance the use of scrollbar In this paper, we discussed issues related with measuring
in the textual description of the exhibits. The typical users usability of mobile applications. It has been argued that the
were not familiar with the procedure of scrolling on a PDA nature of these applications necessitate new methods and
using a stylus, a task necessitating both hands, while tools for measuring their usability. Issues like study of
various cases of split attentional resources were observed. attentional resources and mobility need to be included in the
usability measures. Survey of current practice demonstrated
Quantitative measures of usability beyond the standard that so far most usability evaluation studies are of
performance and user satisfaction measures need however qualitative nature, while quantitative measure of usability
yet to be defined. Inspired by the literature, discussed also used are mostly related to well established general purpose
in the survey part of this paper, we currently work towards attributes of usability, like effectiveness, efficiency and user
defining a number of combined measures related to the satisfaction. However as the mobile applications field
dimensions outlined in Figure 3. matures and the usability evaluation methods become more
reliable and theoretically understood, the specific
characteristics and requirements of mobile applications, that
become pervasive in modern societies, we feel that there
will be accordingly adapted.
REFERENCES
1. Andre, T.S., Hartson, H.R., Belz, S.M. and McCreary,
F.A., The user action framework: a reliable foundation
for usability engineering support tools. Int. J. Human-
Computer Studies 54, (2001), pp. 107-136.
2. Annett, J., 2002, Subjective rating scales: science or art?
Ergonomics, 45 (14), pp. 966 – 987, 2002
Figure 3. Dimensions of analysis of observed behaviour 3. Baber C., Evaluating Mobile-Human Interaction, in J.
Lumsden, “Handbook of Research on User Interface
In addition, and using the ActivityLens tool that integrates a
design and Evaluation of Mobile Technology, 2007,
model-based view with observational data, we plan to relate
Hershey, PA, IGI Global
usability problems found to the task structure, the
dimensions of interaction and a classification of usability 4. Bartneck, C., What is good? A Comparison between the
problems scheme (e.g. the user action framework [1]), Quality Criteria Used in Design and Science, CHI 2008,
extending it in order to accommodate the characteristics of pp. 2485-2492, 2008 • Florence, Italy
mobile applications. 5. Bellotti V., Begole B., Chi E. H., Ducheneaut N., Fang
J., Isaacs E., King T., Newman M. W., Partridge K.,
Price B., Rasmussen P., Roberts M., Schiano D. J.,
Walendowski A., Activity-Based Serendipitous
Recommendations with the Magitti Mobile Leisure
Guide. CHI 2008, pp. 1157-1166, 2008, Florence, Italy
6. Cabrera, J. S., Frutos, H. M., Stoica, A. G., Avouris, N.,
Dimitriadis, Y., Fiotakis, G., and Liveri, K. D. 2005.
Mystery in the museum: collaborative learning activities
using handheld devices. In Proceedings of the 7th
MobileHCI '05, (Salzburg, Austria, September 19 - 22,
2005). vol. 111. ACM Press, New York, NY, 315-318.
7. Carter S., Mankoff J., Heer J., Momento: Support for
Situated Ubicomp Experimentation. CHI’07, pp. 125-
134, ACM
8. de Sá M. , Carriço L., Duarte L.. A Framework for
Mobile Evaluation. CHI 2008, Florence, Italy, pp. 2673-
Figure 4. The ActivityLens analysis environment, annotation 2678, ACM Press, April, 2008
of video recording and mapping to task model.
9. de Sa M., Carrico L., Defining Scenarios for Mobile
Design and Evaluation, CHI 2008, pp. 2847-2852, 2008
• Florence, Italy
43
10. Froehlich J., Chen M.Y., Consolvo S.,Harrison B., 22. Riegelsberger J., Nakhimovsky Y., Seeing the Bigger
Landay J.A., MyExperience: A System for In Situ Picture: A Multi- Method Field Trial of Google Maps
Tracing and Capturing of User Feedback on Mobile for Mobile, CHI 2008, pp. 2221-2228, Florence, Italy
Phones. MobiSys’07, pp. 57-70, ACM. 23. Sánchez J.,Flores H., Sáenz M., Mobile Science
11. Goodman J., Brewster S., and Gray P., Using Field Learning for the Blind, CHI 2008, pp. 3201-3206, 2008
Experiments to Evaluate Mobile Guides, in Schmidt- • Florence, Italy
Belz, B. and Cheverst, K. (ed.), Proc. Workshop “HCI 24. Sauro J., Kindlund E., A method to standardize
in Mobile Guides”, Mobile HCI 2004, Glasgow, UK. usability metrics into a single score, Proc. CHI 2005, pp.
12. Guo C., Sharlin E., Exploring the use of tangible user 401-409.
interfaces for human-robot interaction: a comparative 25. Stoica A., Fiotakis G., Raptis D., Papadimitriou I.,
study, CHI 2008, pp. 121-130, Florence, Italy Komis V., Avouris N., Field evaluation of collaborative
13. Hagen P., Robertson T., Kan M., Sadler K., Emerging mobile applications, chapter LVIX in J. Lumsden (ed.),
research methods for understanding mobile technology “Handbook of Research on User Interface Design and
use, Proc. OZCHI '05, Canberra, Australia, 2005 Evaluation for Mobile Technology”, 2007, pp. 994-
14. Jacucci G., Interaction as Performance, PhD Thesis, 1011, Hershey, PA, IGI Global
University of Oulu, Oulu University Press, 2004 26. Szymanski, M. H., Aoki, P. M., Grinter, R. E., Hurst,
15. Jokela T., Lehikoinen J.T., Korhonen, H., Mobile A., Thornton, J. D., and Woodruff, A. 2008. Sotto Voce:
Multimedia Presentation Editor: Enabling Creation of Facilitating Social Learning in a Historic House.
Audio-Visual Stories on Mobile Devices, CHI 2008, pp. Comput. Supported Coop. Work 17, 1 (Feb. 2008), 5-34.
63-72, Florence, Italy 27. Tselios N., Papadimitriou I., Raptis D., Yiannoutsou N.,
16. Jokela, T., J. Koivumaa, J. Pirkola, P. Salminen and N. Komis V., Avouris N. (2007), Designing for Mobile
Kantola (2006). "Quantitative Usability Requirements in Learning in Museums, in J. Lumsden (ed.), “Handbook
the Development of the User Interface of a Mobile of Research on User Interface Design and Evaluation for
Phone. A Case Study." Personal and Ubiquitous Mobile Technology”, Hershey, PA, IGI Global
Computing 10(6): 345-355. 28. Tullis T., Albert B., Measuring the User Experience,
17. Kjeldskov J. & Graham C. (2003) A Review of Mobile Morgan Kaufmann, 2008.
HCI Research Methods. 5th Int. Mobile HCI 2003, 29. Wilson J.R., Nichols S.C., 2002, Measurement in virtual
Udine, Italy, Springer-Verlag, LNCS environments: another dimension to the objectivity/
18. Kjeldskov J., & Stage J., New Techniques for Usability subjectivity debate, Ergonomics, 45 (14), pp. 1031 –
Evaluation of Mobile Systems, International Journal of 1036
Human-Computer Studies, 60, (2004), pp. 599-620. 30. Yang U., Jo D., Son, W., UVMODE: Usability
19. Nielsen C.M., Overgaard M., Pedersen M.P., Stage J., Verification Mixed Reality System for Mobile Devices,
Stenild S., It's worth the hassle!: the added value of CHI 2008, pp. 3573-3578, Florence, Italy
evaluating the usability of mobile systems in the field, 31. Zhang D. and Adipat B., Challenges, Methodologies,
Proc. 4th Nordic CHI, Oslo, 2006, pp.272 - 280. and Issues in the Usability Testing of Mobile
20. Paternò F., Model-Based Design and Evaluation of Applications, Int. Journal of Human-Computer
Interactive Applications, Springer, 2000. Interaction, 18, 3, (2005), 293-308.
21. Paternò F., Russino A., Santoro C., Remote Evaluation
of Mobile Applications, TAMODIA 2007, LNCS 4849,
pp. 155 – 169, 2007
44
VUUM 2008
User Experience in the Systems Usability Approach

Leena Norros Paula Savioja
VTT Technical Research Centre of Finland VTT Technical Research Centre of Finland
P.O.Box 1000, FI-02044 VTT, Finland P.O.Box 1000, FI-02044 VTT, Finland
ABSTRACT The systems can also be characterised complex socio-technical
In this paper we propose how and why the concept of user systems [1] because they comprise both social and technical
experience (UX) could be used in the analysis of user components which have to work together in a uniform manner
interfaces of complex systems. Complex systems often to produce the desired outcome.
relate to work and professional usage of tools. In this
context UX is typically not considered a meaningful The traditional approach to the analyses of the quality and
attribute. As opposed to that, we feel that experience is functioning of complex socio-technical systems has been
especially important in work-related contexts as it has a safety. From a human factors point of view this line of thought
link to development of good work practice, job means the effort of minimising the possibility of human errors.
satisfaction, and motivation. Thus, in this paper we Recently, the focus of the human factors research in safety
describe how we think that UX should and could be made critical complex systems has been extended to cover also the
operational in the analysis of ICT tools used in various possible positive effects of the human components of the
kinds of work. systems for example in tackling unexpected situations. But
even a more systemic approach has been proposed which
Author Keywords emphasises the transaction of the technical and human
User experience, activity theory, practice, systems elements, and their mutual development [2]. A new notion of
usability. “joint cognitive systems” (JCS) proposed recently by
Hollnagel and Woods represents the system-oriented point of
ACM Classification Keywords view [3]. JCS presents the idea that from a functional point of
H5.m. Information interfaces and presentation (e.g., HCI): view, human and technology constitute a unified system.
Miscellaneous. While in the traditional Human-Computer Interaction (HCI) or
Cognitive System Engineering (CSE) approaches a structural
INTRODUCTION perspective was dominant and the information flow between
In this paper we describe how the concept of user two separate elements, i.e. human and technology was in
experience UX could be, and why it should be considered focus, in the JCS approach the intelligent features and patterns
in the analysis of complex system user interfaces. With of functioning of the unit should be the focus of design. From
complex systems we refer to ICT (information and this perspective, rather than safety alone, an overall adaptive
communication technology) based tools that are used in functioning of the system becomes a relevant success
work that can be characterised as outcome critical from criterion. For example features that are typically considered
the point of view of society. Such works include for within usability research may be linked to more global risk
example control of power plants (including nuclear), and reliability features.
control of other industrial plants (manufacturing and
process industry), manoeuvring of large ships, command Our approach proposed recently under the title “systems
and control of emergency services, command and control usability” [4] is another example of a system-oriented
of traffic systems etc. Also smaller scale systems such as approach. We have proposed how to tackle safety and
patient monitoring can be considered to belong to the usability issues in an integrated way in empirical analysis and
class of complex systems. What is in common with all of practical design connections. We have also demonstrated that
the above mentioned systems is that the work conducted exploiting the concept of activity of the cultural historical
with them is demanding and requires thorough training. activity theory (CHAT) opens a possibility to consider system
functionalities more comprehensively than just dealing with
the user interface [see also 5]. In our work we have become
Permission to make digital or hard copies of all or part of this work for aware that experience of usage of complex tools is an intrinsic
personal or classroom use is granted without fee provided that copies are element in understanding capabilities and willingness of users
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
to adopt new tools and technologies. We see however, that
otherwise, or republish, to post on servers or to redistribute to lists, research and innovation is needed to develop an approach that
requires prior specific permission and/or a fee. in an appropriate way comprehends user experience of
complex professional technologies. The aim of this paper is an
45
outline on which bases and how we would like to proceed Both aspects are incorporated in the analysis frame we have
in this work. labelled “contextual assessment of systems usability”.
EXPERIENCE IN THE DYNAMICS OF ACTIVITY Functional Approach to Analysis of Use Technology

Considering user experience is not a well developed In a technology development oriented analysis of an activity
research area with CHAT. Kaptelinin and Nardi (2006) in system the tools that are used in the activity are the key
their recent account on “activity theory and interaction” element of analysis. According to the theory the tools mediate
touch the topic only briefly. They see that CHAT has the role between the actor and the object of activity; it is with
potential to consider the role of emotions in the design tools that the actor can affect his/her environment. The tool’s
and use of technologies but that challenging theoretical effect on the environment is called its instrumental function.
and conceptual work has to be accomplished. Our interpretation is that the instrumental function refers to the
same phenomena as “effectiveness and efficiency” in the
Developing Mediations in Action prevailing ISO definition [12] of usability. Exploiting activity
We take one of the basic ideas of CHAT that all human theory, the tools in an activity system have also further
action is mediated by tools and signs [6] as our starting functions from instrumental. The second function of a tool is
point. CHAT has traditionally focused particularly in the its psychological function. This refers to the tool’s ability to
role of material tools [7]. Symbolic mediation has been shape the actor’s understanding of the world. Use of a tool
the focus of semiotics. Within it, the pragmatist semiotics gives the actor concepts and schemas concerning the world,
founded by Charles Sanders Peirce [8], has close which enable some cognitive processes such as reflection of
resemblance to CHAT, and great relevance to own behaviour and learning. Recently an addition to the
understanding mediated action of the modern age. In functions of tools has been proposed by a prominent advocate
CHAT it is also maintained that development of human of media theory Georg Rückriem [13]. Thus the third function
activity becomes evident in human beings becoming of a tool is its communicative function. This means that tools
capable of making environmental features as signs of new act as mediators that shape our shared understanding and
possibilities of action. People experience new mediations awareness of the surrounding world. Tools are a manifestation
as emotionally meaningful events, “decisive events” [9]. of our culture and communicate meanings among human
These events provide leverage for activity and create new beings.
motives. Drawing on Ch. S. Peirce, Koski-Jännes notes
further that the strength of mediation is then greater when Our claim is that good tools must satisfy all of the three above
all the different sign aspects - iconic showing mentioned functions before they can be considered appropriate
resemblance, indexical pointing to or identifying for a particular activity.
specifics, and symbolic creating generalized connections
Phenomenological Approach to Analysis of Use of
– are all present. A symbolic connection has emerged
Technology
when a sign awakens awareness of a more comprehensive The phenomenological tradition in philosophy and social
context, significance and worth of the event by the actor. sciences has traditionally considered interesting and worth
A further basic notion of CHAT that we consider relevant while to understand how people experience world, and also
is, that creating new meaningful and emotionally technology. Drawing on this tradition John Ihde has proposed
experienced mediations is a process that takes place in a his idea of “phenomenology of technics” [14]. Ihde identifies
“potential space” in which (a child and adult) the user and three forms of experiencing human-technology-environment
the designer jointly experience and develop the mediation relationships. These are the embodiment, hermeneutic and the
function. In this space the tool or sign does not yet have alterity relationships. In the first variant, technology, when
established the mediating function but receives attention functioning well, is experienced invisible and as merging with
and interest as it is, as is typical for play or art. We have the human body. In the second technology is experienced as
borrowed these ideas from Leiman [10], but they are close part of the environment and interpreted as signs denoting
to those of the so-called instrument genesis theory, too environmental possibilities. In the third variant the focus of
[11]. The “potential space” is clearly an experience is on technology as another agent like the human
intercommunicational space. him/herself. This approach does not assume that people
necessarily conceive technology as mediator or in a particular
The principles are the basics of understanding the function. It is interested in the user’s experience of
development of mediated action. Hence, they should be coordination with technology. This is, of course, a very
relevant in understanding how practices and tools should relevant aspect when considering usability of technology from
be developed and how to proceed in evaluating their UX point of view. We see, that understanding experience of
maturity. At the present state, we observe tools and tool this coordination provides added value to our attempt to
usage from two distinct perspectives. First we pay understand users’ experience of the functions of tools and
attention to the function of tools in activity, and, second, technologies.
we consider how people experience technology in usage.
46
VUUM 2008
Systemic Approach to Analysis of Use of Technology nature. We have connected the concept of experience of
We have earlier introduced the concept of systems technology use to the functions that the tool has in an activity
usability which provides a systemic approach to system (Table 1). With this we are able to infer which
understanding use of technology. In the approach we have experiences have meaning in the particular context in which
conceptualised what constitutes quality of tools in an the technology is used.
activity system. We see that the quality of tools becomes
focus of analysis outcome of mode of action experience
visible in the practice of usage. Good tools promote good
action
work practices! Further, the good practice, and its tool functions
measures can be made operational by analysing the
domain and the result critical functions of the particular instrumental task meaningful experience of
achievement, routines appropriate
activity. time, errors, functioning
objective
The systems usability approach presumes the following reference
principles for the analysis of use of technology.
psychological cognitive coordination, experience of
1. The analysis should be contextual. The quality measures, control of own fit for human
requirements for tools are determined by the SA, mental activity use, experience
models of own
objectives and purposes of the domain. The competence,
domain knowledge is elicited by modelling the experience of
critical functions and resources of the domain trust in
technology
2. The analysis should be holistic. The users, tools,
communicative amount and meaning of experience of
and the environment constitute one system the
content of actions and joint culture,
functioning of which is the focus of analysis. We communicati communications tool as a sign of
do not only analyse how the users affect the ons shared culture,
environment with the technical system, but also experience of
fit for own
vice versa: how the environment and the system
style
have an effect on user.
We have used the systemic approach to the analysis of use Table 5. Classes of measures used in CASU framework.
of technology in the development of evaluation
framework: Contextual Assessment of Systems Usability Measures of Experience
(CASU). In evaluating technology with CASU we make a The measures of experience are connected to the functions of
difference between the outcome of action and the mode of the tools in the activity system. Above (Table 1) are stated
action. This means that although a task may be completed also the classes of measures that are used in the functional
(the outcome is the desired one) there might still be evaluation of the tool. But only the measures related to
problems in the way of reaching the outcome. For experience are described in more detail.
example the practice might be such that it somehow
When the instrumental function is considered, the experience
expends resources excessively. Thus the result cannot be
that we are trying to unveil is the experience of the
considered as good as possible even though the objective
appropriate, and probably embodied, functioning of the tool.
has been accomplished. The two aspects of activity is
This is the feeling that the tool functions as expected and that
included in the Core-task analysis approach, in which
it is possible to use the tool for the purpose that it was
actual activity is distinguished from what it is a sign of, or
designed for.
what internal logic it refers to. The latter is the semiotic
point of view which is represented by the concept of In the psychological function the user experience is related to
mode of activity. the tool’s fit for human use. For example the feeling that the
tool functions as expected and that the user is able to use it are
The third dimension (in addition to the outcome and
characterisations of experience in psychological function. If
mode) in the analysis of technology use is the experience
the experience is positive the users feel that they are
of use. Thus in this type of analysis of technology use we
competent users and they are able to formulate appropriate
combine the functional and the phenomenological
trust in technology (not over or under trust).
approaches.
The communicative function of technology is experienced in
EVALUATION OF EXPERIENCE the community. If an individual user has an experience of
Evaluating experience is not easy. Especially so-called belonging to a community (of practice) that shares the same
objective evaluation of experience is at least extremely objective and tools and also rules and norms, which all enable
difficult, but it maybe also unnecessary as the him to interpret the meaning of his fellow actors’ behaviours
phenomenon itself (the experience) is not objective by and tool usages. These are embodiments of positive user
47
experience concerning the communicative function of a users is a natural and indispensable element. UX is part of this
tool. Further, the experience of the communicative dialogue.
function assumes that users’ are able to use the
technology with their own style the fits their own identity REFERENCES
and how they want others to perceive them. 1. VICENTE, K.J., Cognitive Work Analysis. Toward a Safe,
Productive, and Healthy Computer-Based Work. 1999,
Data Concerning Experience Mahwah, NJ: Lawrence Erlbaum Publishers,(1999).
In order to evaluate or measure the experience we need 2. FLACH, J., et al., Global perspectives on the ecology of
data concerning it. In our evaluations we have gathered human-machine systems. 1995, Hillsdale, New Jersey:
data about the experiences with mainly two methods: Lawrence Erlbaum Associates,(1995).
observations and interviews. 3. WOODS, D. and E. HOLLNAGEL, Joint cognitive
systems - patterns in cognitive systems engineering. 2006,
In the careful analysis of observation data it is possible to
Boca Raton: Taylor & Francis,(2006).
recognise experience related features in the users’
4. SAVIOJA, P. and L. NORROS, Systems usability -
behaviour. For example communications and statements
promoting core-task oriented work practices, in Maturing
about own actions give information about experiences.
Usability: Quality in Software, Interaction and Value, E.
Very careful analysis of the usage is needed in order to
Law, et al., Editors. 2008, Springer: London. p. 123-143.
understand the behavioural markers that are related to
5. KAPTELININ, V. and B.A. NARDI, Acting with
experiences.
technology. Activity theory and interaction design. 2006,
The other method to acquire data about experiences of use Cambridge, Massachusetts: The MIT Press. 333,(2006).
is the interview method. We have used interface 6. VYGOTSKY, L.S., Mind in Society. The Development of
interviews that are conducted after the use of the tool. In Higher Psychological Processes. 1978, Cambridge, Mass.:
the interviews we specifically ask the user to comment Harvard University Press,(1978).
how in his/her view the system works. Also a stimulated 7. LEONT'EV, A.N., Activity, Consciousness, and
interview method in which the user comments his or her Personality. 1978, Englewood Cliffs: Prentice Hall,(1978).
own use of technology is suitable in eliciting data 8. PEIRCE, C.S., Pragmatism, in The essential Peirce.
concerning UX. Selected philosophical writings, T.P.e. project, Editor.
1998, Indiana University Press: Bloomington and
CONCLUSIONS Indianapolis. p. 398-433.
In the development of complex system interfaces it is 9. KOSKI-JÄNNES, A., From addiction to self governance,
especially important to understand how the potential users in Perspectives on activity theory, Y. Engeström, R.
experience the new technology. The experience of work Miettinen, and R.-L. Punamäki, Editors. 1999, Cambridge
tools is an evolving phenomenon that might take different University Press: Cambridge, UK.
forms during the technology life cycle. There is a 10. LEIMAN, M., The concept of sign in the work of Vygotsky,
difference in comparison to the development of different Winnicot, and Bakhtin: Further integration of object
consumer products in which the experience is often relations theory and activity theory, in Perspectives on
connected to the decision to buy the product. With work activity theory, R. Engeström, R. Miettinen, and R.
related tools the user is not necessarily the same person Punamäki, Editors. 1999, Cambridge University Press:
who makes the decision to buy the particular tool. This Cambridge, Massachusetts. p. 419-434.
means that the role of experience is also different. The 11. RABARDEL, P. and P. BEGUIN, Instrument mediated
need for positive user experiences is rooted in the need for activity: from subject development to anthropometric
the users to have high motivation in their work and also to design. Theoretical Issues in Ergonomics Science. 6(5): p.
carry out their work with appropriate work practices. 429-461(2005).
Experience is related to both these concepts. If the tools in 12. ISO, ISO 9241-11. Ergonomic requirements for office
the work are experienced somehow negatively this has an work with visual display terminals (VTDs) - Part 11.
effect on the whole activity system. An contrary, positive Guidance on usability. (1998).
experience and activeness is raised if tools can be seen to 13. RÜCKRIEM, G.,Tool or medium? The Meaning of
support professionally and personally worthy Information and Telecommunication technology to Human
development in the activity. Practice. A quest for Systemic Understanding of Activity
Theory. University of Helsinki, Center for Activity Theory
We see that development of new mediations that
and Developmental Work Research:
technologies enable takes place in the “potential space” in
http://www.iscar.org/fi/.
which the role and features of technology have not yet
14. IHDE, D., Technology and the lifeworld. From garden to
been crystallised. In this space a communication and
earth. 1990, Bloomington, Indianapolis: Indiana
mutual understanding between technology developers and
University Press,(1990).
48
VUUM 2008
Developing the Scale Adoption Framework for Evaluation

(SAFE)
William Green, Greg Dunn, and Jettie Hoonhout
Philips Research
High Tech Campus 34
5656 AE, Eindhoven, NL
{first name.last name}@philips.com
statistics expertise. In this paper, the Scale Adoption
ABSTRACT
A growing number of psychometric scales have been Framework for Evaluation (SAFE) is presented and has
developed to measure a plethora of constructs within the been developed in an effort to support practitioners before
umbrella term, User Experience (UX). Unfortunately, selecting psychometric scales. A psychometric scale is a
selecting an appropriate scale for UX can be difficult for the form of measurement used to assess abstract qualia
usability practitioner. Based on psychometric scale subjectively experienced by an individual (e.g., individuals’
development literature, the Scale Adoption Framework for personality, their emotional states, and interpersonal trust).
Evaluation (SAFE) has been designed to support From this point forward, ‘psychometric scale,’ will be
practitioners in selecting appropriate scales. The SAFE is referred to simply as, ‘scale.’ The SAFE is intended to help
presented in this paper, with an example of how it could be inform practitioners, who may not have a psychology or
used. However, utilizing the SAFE has emphasised the psychometric background, about the essential properties of
difficulty in selecting suitable measures, revealing that the valid and reliable measures. To achieve this, the SAFE
SAFE will not solve this issue in isolation. Two main illustrates the most important elements that a practitioner
challenges are presented for the VUUM workshop: more should look for when adopting a measure, to ensure that it
explicit and transparent descriptions of psychometric scale is sufficiently robust and accurately measuring the construct
development, and a need for more sensitive psychometric that the practitioner needs, and it claims to measure.
scales. Future work should build on these challenges by Before presenting the SAFE, the UCD process is briefly
consulting usability practitioners directly. reviewed emphasising how psychometric measurement of
users’ subjective experience supports the evaluation of
Author Keywords products. The reasoning behind why it was believed that
Psychometric scale development, evaluation, UCD. the SAFE would be helpful for practitioners is then
presented. In the next section, we discuss development and
ACM Classification Keywords evaluation of psychometric measures, specifically focusing
H.1.2.,User/Machine System: Design, Experimentation, on what is particularly pertinent to practitioners when
Human Factors, Measurement. selecting a measure. This led to the development of the
SAFE. An example of how to utilise the SAFE is
INTRODUCTION illustrated. Initial use of the SAFE to select a measure of
This paper is concerned with psychometric scale emotion indicated several challenges, which has led to
development and its adoption by practitioners within the conclusions and recommendations for future research.
User-Centred Design (UCD) process. There are many
illustrative texts and practical examples of psychometric Measuring Usability and User Experience for UCD
scale development in psychology literature, e.g., [4, 21, and User-Centred Design (UCD) is a design philosophy that
26]. Interpreting this literature can be difficult however; the focuses on the needs and interests of users and emphasises
majority of psychometric literature is written for an making products usable and understandable [20]. There
assumed audience that will have either psychology or have been many descriptions of UCD in the literature that
vary in the nomenclature used. Nonetheless, [5] identified
three principles for user-centred system design, which are
Permission to make digital or hard copies of all or part of this work for still prevalent in the majority of the UCD descriptions,
personal or classroom use is granted without fee provided that copies are these are: 1) a user focus from the beginning and
bear this notice and the full citation on the first page. To copy otherwise,
throughout the process, 2) measurement of system usage,
or republish, to post on servers or to redistribute to lists, requires prior and 3) iterative design. As [3] noted, the two editions of the
specific permission and/or a fee. Handbook for HCI, e.g., [7], also emphasise the three UCD
principles proposed by [5]. Despite recent suggestion that
the emphasis on evaluation in [5] and [7] is dated [3] and
49
not always appropriate [6], evaluation is undeniably an regarding the low percentage of established satisfaction
important aspect in UCD, often achieved via measurement. scales being used by practitioners becomes even more
poignant. If practitioners are creating or implementing
Measurement of system usage by practitioners involved in
scales that do not provide valid or reliable results, this could
UCD is often concerned with usability evaluation, i.e., are
seriously compromise evaluations of constructs critical to
the products easy and comfortable to use, safe, effective,
the success of a product (e.g., trust measures for users’
efficient, and easy to learn how to use? These usability
perceived trustworthiness of an e-commerce website). This
evaluations have been done through objective (e.g., time or
paper supports [12]’s position recommending that
physiology) and subjective (e.g., perceptions, attitudes, and
practitioners should use established scales, but greater
other scales of psychological constructs) measurement and
emphasis is placed on supporting practitioners when
are typically based on at least one of three dimensions
selecting existing scales. To do this, steps are taken to
outlined by ISO 9241-11 [13], which are: Efficiency,
further understand the benefits of using scales from a
Effectiveness, and Satisfaction.
psychometric perspective, and their contribution to the UX
Subjective evaluation is often done via questionnaires, evaluation process.
described as both the most versatile, but also most often
misused research tool for HCI [19]. Despite [19]’s caveat Using Scales in User Experience
and support for questionnaire design, development, and Psychometric scales have often been employed for
appropriate use, it appears that the questionnaire remains evaluation when measurements are required across a
misused almost twenty years later. For example, using number of studies to determine some psychological
[13]’s dimensions, [12] demonstrated that while the construct (e.g., emotion, trust). This method enables users
measurement of both Efficiency and Effectiveness is to subjectively quantify their experiences, which map onto a
relatively straightforward and measured in similar fashion construct. These findings then indicate to practitioners how
among UCD practitioners, the measurement of users’ a product could be improved. This use of scales has
Satisfaction is diverse. This diversity is illustrated by provided a number of advantages: 1) it enables testing of
findings indicating that only 11% of satisfaction scales used large quantities of participants over short periods of time, at
by practitioners are established (i.e., valid and reliable) relatively low cost, which is practical since time and
[12]. Furthermore, [12] reported that the 89% of financial resources to conduct user-studies have usually
“homegrown” measures vary greatly in terms of reliability, been limited, 2) it is an easy to apply technique (although
and so, recommend that practitioners should use established scale development is more time-consuming and complex),
scales. This is not a new recommendation. For example, and 3) it has usually been non-intrusive for participants.
[15] emphasised that, “questionnaires should be elegant in To support design decisions regarding the dimensions of
terms of their reliability, validity and human factors appeal” UX through evaluation, scales must first be developed
(pg. 210). It is in this vein that [11] proposed a model to based on a UX construct definition. For example, emotion
support practitioners when selecting the best approach for and affect are ambiguous terms but are often postulated as
usability evaluation. His model outlines six approaches for integral to designing UX, e.g., [14]. This ambiguity stems
evaluating usability, both objectively and subjectively for from the literature itself. [22] differentiate between the
each of the three usability dimensions noted within the ISO terminology ‘affect’, ‘emotions’, ‘moods dispositions’ and
standard above. Extending from [11]’s model, this paper ‘preferences’ (pg. 124). They note that difficulty in
focuses on subjective approaches to usability evaluation. answering the question, “what is an emotion?” is related to
Obviously, performance criteria related to Efficiency and the interchangeable use of the terms ‘emotion’, ‘emotional’,
Effectiveness are important for consumer products, and ‘affect.’ An example of this can be seen in [23] in an
especially in the case of safety, comfort, and learnability. attempt to clarify this ambiguity in their use of terminology,
Particularly for consumer products, however, it has been “emotional and affective will be used interchangeably as
increasingly accepted that other requirements related to adjectives describing either physical or cognitive
Satisfaction should also be considered. Using a product components of emotion, although ‘affective’ will
should be enjoyable, engaging, and appealing [e.g., 1, 8]. sometimes be used in a broader sense than ‘emotional” (pg.
These requirements have often been discussed as part of the 24). This ambiguity stresses the importance of evaluating
User Experience (UX). UX has been particularly important UX dimensions with scales based on specific constructs.
given the migration from the workplace, ubiquity and One possible reason for the apparent lack of psychometric
growth of technology in the home, and the emergence of scale usage for UX could be due to practitioners simply not
‘intelligent’ and perhaps more complex products. being familiar with the exhaustive process needed to create
Nevertheless, like usability satisfaction, UX has numerous a new scale. Choosing whether a scale is appropriate for a
subjective meanings that are dependent not only on the particular product is difficult given that part of this decision
individual, but also the context of interaction. Given the involves understanding scale construction, which is perhaps
growing importance of UX evaluation, [12] findings limited to those with psychology or statistics expertise. It is
50
VUUM 2008
anticipated that new psychometric scales will be developed 2a) Construct validity – it is important that practitioners
to evaluate UX [e.g., 10]. Given the high percentage of see evidence from developers that the constructed
homegrown measures used to measure satisfaction and the scale shows some sort of relation (via
similarities between UX and satisfaction (both are
subjective and abstract constructs), it is possible that the 1) Construct definition
nature of UX will inadvertently encourage its practitioners Aim of Pertinent questions regarding construct definition for
construct definition practitioners when adopting a scale
to develop their own scales without properly recognising
issues such as scale validity and reliability. If this were to This is arguably the most
important element of scale
1b) Clarity: need to 1c) Discriminating:
occur, then it would only serve to propagate questions development: good scale
items cannot be formulated
1a) Is the construct
know what you
measure is not
you must know that
you are not
measuring
grounded in ambiguous and
concerning the validity and reliability of UX evaluation. To without it. Determine what
you are and are not
theory? confused, when it is
you are unsure of
something that you
are not intending to
prevent this possibility, this paper presents the Scale intending to measure. what you are
measuring.
measure
(confound).
Adoption Framework for Evaluation (SAFE), which is

intended to encourage and support practitioners to select
established scales for UX, rather than utilising homegrown
ones. But if a homegrown one must be used, then the 2) Scale validity
SAFE also provides a starting point for practitioners to Aim of scale validity Pertinent questions regarding scale validity for practitioners
consider what is vital when attempting to construct a valid when adopting a scale
and reliable scale.

The extent to which a scale
2a) Construct
measures what it intend to 2b) Context 2c) Sample
validity
measure
THE SAFE DEVELOPMENT APPROACH
Based on several methodological and statistical texts that
discuss (aspects of) scale development [e.g., 2, 21, 26], the
SAFE was created to portray the key elements of scale
3) Scale reliability
development that are, debatably, essential for scale
construction. As shown in Figure 1, the SAFE is composed Aim of scale reliability Pertinent questions regarding scale reliability for practitioners
when adopting a scale
of three elements. Within each element, on the left, a Ensure that the measure
description of the aim that developers have when consistently reflects the
construct, not only 3a) Inter-item 3b) Test-retest
considering the scale in scale construction is provided. internally (cronbach’s
alpha) but also over time
reliability reliability
Additionally, for each of these elements three key aspects

(test-retest)
are emphasized for practitioner’s to consider when deciding

on a scale for their usability or UX evaluation. The Figure 1: Scale Adoption Framework for Evaluation (SAFE)
pertinent aspects of these elements are as follows:
prediction or correlation) with existing
1. Construct definition – confirming that the scale construct measures/behaviours connected to the construct. For
(abstract qualia to be measured) is suitable. further information see [2] and [26].
1a) Theory – is concerned with whether scale 2b) Context – practitioners must be aware of the
developers have based their proposed construct on a context of use that the developers first envisaged when
strong theoretical foundation following a literature they constructed their scale. If practitioners use a scale
review, experiments, or consultation with domain in a context that was not originally envisaged by
experts. If this is not done, then there are fundamental developers, then the validity (and reliability) of the
questions regarding what the scale is measuring, scale could be compromised.
despite possibly being valid and reliable (see below).
2c) Sample – similar to Context, this aspect asks
1b) Clarity – recommends that the theory behind the practitioners to be aware of the sample (i.e., users) that
construct is clear, coherent, and straight to the point. If developers used when constructing the scale. If, in the
this is not the case, it becomes more difficult for the opinion of the practitioner, this is distinctly different
practitioner to decide whether the theory adequately from the sample that s/he is intending to recruit, then
describes the construct. this could compromise scale validity and reliability.
1c) Discriminating – acknowledges that it is important 3) Scale reliability – the extent the tool provides stable
for scale developers to also consider what the construct results across repeated administrations.
does not include at a theoretical level. For the
practitioner, this will simply mean that it will make it 3a) Inter-item reliability – Measured via Cronbach’s
easier to see what the scale does and does not measure. alpha or split-half reliability, it determines what extent
scale items measure the same construct that it is said to
2) Scale validity – the extent to which a tool or method be measuring accurately. Generally, Cronbach’s alpha
measures what it is intended to measure. of .7 is a recommended minimum, though higher
51
Cronbach’s alpha are said to be necessary depending being used extensively in usability testing. Without going
on the construct that is said to be measured. into a detailed description of the above mentioned studies,
this section provides a number of discussion points with
3b) Test-retest reliability – Much like Inter-item
respect to the use of these scales, and some lessons learned.
reliability, this aspect tests how reliable the scale is
when individuals are retested with the same scale. The Scale and adherence to SAFE
level of test-retest reliability depends on what is said to Pleasure,
Attribute of SAFE Usability
be measured, but again, a good rule of thumb is .7. arousal &
dominance
Interpersonal
trust [29]
Aesthetics
[18]
satisfaction
[19]
[17]
Example of the SAFE usage a) Theory Yes Yes Yes Yes
To initially ascertain whether or not the SAFE would 1)

b) Clarity
Yes: multi-
Yes: uni-dimentional Yes: bi-dimensional
Yes: multi-
support usability practitioners, the authors of this paper Construct dimensions dimensions
utilised the SAFE to review four established psychometric c) Discriminat’ Yes Yes Yes Yes and No
scales (Table 1). For this, the original publications of these a) Construct
validity
Yes Yes Yes Yes
psychometric scales were obtained. The scales have all 2) Designed for
been referenced in relation to their use in usability Validity b) Context Yes Yes Yes scenario based
evaluation
evaluations and are purposively concerned with disparate c) Sample N=78: 45 Male N=222 N=384: 211 Male N=377
constructs that are all related to usability, satisfaction, and a) Inter-item Split-half reliability:
UX. In Table 1 the four psychometric scales and their 3) reliability
No
0.92
Between 0.60 & 0.78 Exceeding 0.89
Reliability
adherence to the attributes of SAFE are tabulated. b) Test-retest No No Yes No
As seen in Table 1, the four scales do not adhere to all of Table 1: Example use of the SAFE2
the SAFE attributes. Nonetheless, all of these scales are • A general issue with several of the scales used is that
established and have been used to measure their respective quite often the scores obtained for different conditions do
constructs in publications. It is noted then, that it is not not show any significant differences. Obviously, a range
always possible or necessary to create scales that fully meet of causes might have contributed; the conditions not
the SAFE requirements discussed above. This decision is being distinct enough, too many variables in play,
down to the practitioner and whether or not the scale can be selection and number of participants, or the scales not
accepted. To determine how the use of SAFE can support being sensitive enough. The first two possibilities are
practitioners in selecting appropriate scales, several case related to the fact that the studies in question are
studies conducted in an industrial setting are now discussed. conducted in an ecologically valid setting, resulting in
conditions that are not too extreme, and relatively rich in
Determining the Affective Response of Participants: features. After all, in an industry context, testing
Feedback from Case Studies applications in a realistic setting is very important.
In many consumer application studies, a key question often
addressed is, what is the affective response of participants • One could also wonder if these measures are effective in
when using the product being tested? In several projects, typical usability test conditions that often involve a
ranging from studies investigating users’ affective response relatively limited number of participants.
to environments that present coloured lighting effects [9, • The SAM consists of three subscales, pleasure, arousal,
16], to game applications [28], different measures to obtain and dominance. The last subscale, however, is quite
participants’ affective state were used: Self Assessment often not well understood by participants, even when
Manikin [17], Activation-Deactivation (AD-ACL, [27]), provided with the official instructions as recommended
pleasure and arousal as evoked by environmental factors [17]. A growing number of researchers have decided to
[24], and intrinsic motivation inventory (IMI [29]). These simply no longer include the dominance subscale in their
instruments are all claimed to be established measures. It is tests. This raises questions regarding the underlying
important to realise though that these instruments, as well construct of the scale, or at least questions with regard to
as other affect scales that are often used in the context of applicability of part of the construct in a usability test.
UX studies, were never developed with possible application
of usability testing for consumer applications in mind. • Valence measures (i.e., pleasure scales) quite often show
These scales originate, for example, from research in the ceiling effects; obtained scores tend towards the positive
field of organizational psychology (e.g., IMI), or clinical end of the scale. In many cases, such strong positive
psychology (e.g., AD-ACL). scores are already obtained in connection with a baseline
condition. That is, even before the participants
Unfortunately, alternative measures do not seem to be experienced the device or condition under investigation,
readily available. Other means such as psycho-
physiological measures have many drawbacks, especially in
the context of testing consumer applications. So, it seems 2
Table will be illustrated further in the VUUM workshop.
inevitable to adopt these scales, such as the SAM, which is
52
VUUM 2008
they are in a positive state, and this continues for the test, the practitioner should consider that the original context of
unless something dramatic happens to impact mood. scale development may have an effect on the reliability and
• Finally, self-report measures are associated with a validity versus today. Thus it is suggested that practitioners
number of well-known issues: should at least acknowledge this and, if there is no time to
develop another scale, be aware of the potential problems
o Participants’ answers might be biased or guided by that could arise when reviewing findings.
what the participants think is the “right” answer, or by
socially desirable answers. Furthermore, the discussed Thirdly, to the knowledge of this paper’s authors, there is
self-reports are retrospective and, thus, potentially no authority textbook to support the selection of
subject to distortions. psychometric measures specifically for UCD and UX.
Given the abstract constructs that practitioners will be
o In the case of SAM, such self-report issues might be required to measure in UX, a textbook to support and
strengthened because it is a measure that involves only explain to practitioners the advantages of using
three subscales, making it easy to remember what one psychometric measures, and how they are developed, would
filled in before the test condition, for example. perhaps increase the validity and reliability of UX scales
DISCUSSION utilised by practitioners, regardless of whether or not these
The aim of the SAFE is to support the selection of scales are homegrown. It should, however, be noted that
psychometric scales for usability and UX evaluations. The there are some good texts related to the development of
development of the SAFE has built on previous tools to scales for the HCI domain. For example, [15] describe a
support practitioners in selecting and adopting ‘basic lifecycle of a questionnaire’ and call for a ‘battery’ of
psychometric scales. For example, [11]’s proposed model scales to be developed for HCI. They describe the need to
that outlined six approaches to measuring usability. The achieve validity and reliability, but not how this is done.
SAFE has been developed to support the selection of Another important focus is the importance of language and
psychometric scales for (subjective) usability satisfaction choice of scale, amongst others. The SAFE builds on this
and UX, supporting [12]’s and [15]’s position work by supporting the choice of appropriate scales.
recommending that practitioners should use established It seems that the call for a battery of psychometric measures
scales. The example use of the SAFE in this paper [15] has been met and the HCI community is now entering
considered existing scales related to the broad notion of a second generation of evaluation methods. Whilst the
(subjective) usability satisfaction and UX. A number of community has a battery of scales to support evaluation and
discussion points emerged. design, it is clear that the context for usage is now being
First, it became apparent that gathering the information to stretched. It is hoped that the SAFE will iterate to
complete the SAFE was not as easy as initially anticipated. practitioners that just utilising a psychometric scale does
The main reason for this was that scales are developed in not determine if constructs being measured are valid in
many different ways, (e.g., using different formats and every situation, and that using a psychometric scale does
statistical tests). This diversity is also reflected in the not guarantee significant findings. In other words, a
language used to describe scale development. Translating psychometric scale can be the wrong measure for
the original publications of the scales to complete the SAFE evaluating a system. Using the SAFE should encourage
required an understanding of the development of practitioners to question appropriate use of scales.
psychometric scales. It is not anticipated that all Nonetheless, the example use of the SAFE to select a
practitioners wishing to conduct an evaluation will have this psychometric scale in the UX example SAM [17], among
required expertise or time. It is proposed that developers are others, illustrates the need for new scale development.
more transparent in their scale description when publishing Please note that this paper does not discuss scale norms.
their developed scales; [19] is a good example to follow. Psychometric scale norms are not reported in many UX
Secondly, the majority of usability scales were developed in studies. Norms would provide the researcher with the
the 1990s, between 10 and 20 years ago. There are potential ability to compare findings with other studies (assuming the
issues surrounding measures of this age, which practitioners studies were sufficiently controlled) in the same way that
should consider. For example, the majority of usability personality and intelligence of individuals can be compared.
scales were developed for workplace systems where tasks
are structured and efficiency is of primary importance. CONCLUSIONS
When choosing measures to evaluate systems for the home, The preliminary development and example use of the
these factors may not be as prominent in design. SAFE, a tool to support practitioners in selecting
appropriate psychometric scales, was illustrated in this
In addition, the relative age of these scales may affect paper. Two conclusions have emerged, these are:
validity, considering the type of systems that would have
been used when these scales were developed. Therefore, • It can be daunting and difficult to gauge whether or not a
scale is based on a robust construct, and if the scale is
53
valid and reliable. This should be clearer in future Overbeek, T., Pasveer, F. & de Ruyter, B. (Eds.).
publications of scales. Probing Experiences, Springer, Dordrecht, NL, 2008.
• New scales are required to measure new constructs and 11. Hornbæk, K. Current practice in measuring usability:
overcome ceiling effects due to the new emphasis on Challenges to usability studies and research.
augmenting existing and enjoyable experiences. International Journal of Human-Computer Studies, 64,
These challenges are pertinent to the VUUM workshop, and 2 (2006), 79-102.
can be elaborated on for presentation and discussion. 12. Hornbæk, K. and Law, E.L. Meta-analysis of
correlations among usability measures. In Proc. CHI
The SAFE has been developed to support practitioners in 2007, ACM Press (2007), 617-626.
adopting established psychometric measures. Nonetheless,
the SAFE model still has to be evaluated by practitioners. 13. ISO 9241. Ergonomic requirements for office work
We intend to build on the SAFE and develop it into a useful with visual display terminals (VDTs) – Part 11:
and usable design tool, via evaluation with practitioners. Guidance on
usability. ISO 9241-11, Geneva, CH, 1998.
ACKNOWLEDGMENTS
14. Isomursu, M., Tähti, M., Väinämö, S., and Kuutti, K.
We thank Jan Engel, Nele Van den Ende, and Janneke
2007. Experimental evaluation of five methods for
Verhaegh for lending ideas and comments in support of this
collecting emotions in field settings with mobile
research. We thank Marie Curie for funding William
applications. International Journal of Human-
Green’s research (SIFT project: FP6-14360) and Greg
Computer Studies. 65, 4 (2007).
Dunn’s research (CONTACT project: FP6-008201).
15. Kirakowski, J, and Corbett, M, Effective Methodology
REFERENCES for the Study of HCI. Amsterdam: North-Holland,
1. Blythe, M. A., Overbeeke, K., Monk, A. F., & Wright 1990.
P. C. (Eds.). Funology: From usability to enjoyment. 16. Kohne, R. Subjective, behavioral and
Dordrecht, The Netherlands: Kluwer Academic, 2003. psychophysiological responses to lighting induced
2. Clark, L.A., Watson, D. Constructing validity: basic emotional shopping. Unpub master thesis. Utrecht /
issues in objective scale development. Psychological Eindhoven: Universiteit Utrecht / Philips Research,
assessment, 7, 3 (1995), 309-319. 2008.
3. Cockton, G. Revisiting usability's three key principles. 17. Lang, P. J. Behavioral treatment and bio-behavioral
In Proc. CHI 2008:alt.CHI. Florence, Italy, ACM, assessment: computer applications. In J. B. Sidowski,
(2008), 2473-2484. J. H. Johnson, & T. A. Williams (Eds.), Technology in
4. Cohen R. J. Swerdlik, M. E., and Philips, S. M. mental health care delivery systems. Norwood, NJ:
Psychological testing and assessment. An introduction Ablex, (1980), 119-137.
to tests and measurement. Mayfield, 1996. 18. Lavie, T., and Tractinsky, N. Assessing dimensions of
5. Gould, J. D. and Lewis, C. Designing for usability: key perceived visual aesthetics of web sites. International
principles and what designers think. Communications Journal of Human-Computer Studies, 60, 3 (2004),
of the ACM 28, 3 (March 1985), 300-311. 269-298.
6. Greenberg, S. and Buxton, B. Usability evaluation 19. Lewis, J. IBM Computer Usability Satisfaction
considered harmful (some of the time). Proc. CHI Questionnaires: Psychometric Evaluation and
2008. Florence, Italy, ACM, (2008), 111-120. Instructions for Use. International Journal of Human-
Computer Interaction, 7, 1 (1995), 57-78.
7. Helander, M.G. Landauer, T.K. and Prabhu, P. (eds.).
Handbook of Human-Computer Interaction, 2nd 20. Norman, D.A. The psychology of everyday things.
Edition. Amsterdam, North-Holland, 1997. New York: Basic Books, 1988.
8. Helander, M.G. Khalid, H.K., and Tham, (Eds.). Proc 21. Nunnaly, J. C. Psychometric theory. New York:
of the Int. Conf. on Affective Human Factors Design. McGraw-Hill, 1978.
London: Asean Academic Press, 2001. 22. Oatley, K, & Jenkins, J. M. Understanding Emotions.
9. Hoonhout, H.C.M., Human Factors in Matrix LED Cambridge MA: Blackwell Publishers, 2006.
illumination. In: Aarts, E., Diederiks, E. (eds.). 23. Pickard, R. Affective Computing. Cambridge, MA:
Ambient Lifestyle. Amsterdam: BIS Publishers, 2006. MIT Press 1998.
10. Hoonhout, H.C.M. How was the experience for you 24. Russell, J.A., and Pratt, G. A description of the
just now? Inquiring about people’s affective product affective quality attributed to environments. Journal of
judgments. In: Westerink, J., Ouwerkerk, M., Personality and Social Psychology. 38 (1981), 311-
322.
54
VUUM 2008
25. Ryan, R.M. Intrinsic Motivation Inventory: 28. Verhaegh, J., Fontijn, W., and Hoonhout, J. TagTiles:
http://www.psych.rochester.edu/SDT/measures/intrins. optimal challenge in educational electronics. Proc. of
html. Accessed on April 13 2008. the Tangible and Embedded Interaction conference,
26. Spector, P.E. Summated rating scale construction. An Baton Rouge, LA, ACM, 2007.
Introduction. Newbury Park: Sage, 1992. 29. Wheeless, L. and Grotz, J. The Measurement of Trust
27. Thayer, R.E. Measurement of activation through self- and Its Relationship to Self-Disclosure. Human
report. Psychological reports, 20 (1967), 663-678. Communication Research, 3, 3 (1977), 250–257.
55
A Two-Level Approach for Determining Measurable
Usability Targets
Timo Jokela
Joticon, P.O. Box 42
90101 Oulu, Finland
[email protected]
et al. 1988), (Hix and Hartson 1993), (Gould and Lewis

ABSTRACT
Usability measures are required for defining clear, 1985)m (Nielsen 1993) (Mayhew 1999) (Dumas and
unambiguous usability targets for a system or product to be Redish 1993) (Bevan and Macleod 1994) (Macleod,
developed. A two-level approach for defining usability Bowden et al. 1997) (Maguire 1998) (Bevan, Claridge et al.
targets is proposed. Strategic usability measures are used to 2002) (ANSI 2001) (Butler 1985) (Karat 1997).
define the usability targets at business level. Operational In this paper, we propose a two-level approach for
usability measures are used to define the usability targets at measurable usability targets. Strategic usability measures
the level of user performance. Different usability measures are used to define the usability targets at business level.
are required for each of the two levels. Operational usability measures are used to define the
usability targets at the level of user performance. Different
INTRODUCTION AND RATIONALE usability measures are required for each of the two levels.
It is generally agreed as a good project management
practice to define clear, measurable targets for product3 The approach presented in this paper is not fully evaluated.
quality characteristics. Clear quality targets provide a clear Examples of partial application of the approach are
direction of work and acceptance criteria for the product provided.
under development.
ON USABILITY TARGETS AND MEASURES
In practice, however, usability targets are quite seldom Usability is not a generic “on/off” characteristic - one
among the measurable goals in system or product basically cannot state that a product is usable or not usable.
development projects. One of the consequences of not Instead, usability is a continuous variable. Thereby, when a
having usability goals is that other project objectives product is developed, it is not viable to aim at “good
dominate and usability is considered only as a secondary usability” at a generic level. Instead, one should define the
objective of a project. The result is a product with usability level of usability that one is aiming at. In other words, clear
problems. measurable usability targets are required.
Thereby, usability practitioners should aim for defining We basically have two questions when defining usability
measurable usability targets and thereby also make their targets:
work and role more important in the projects.
1. What are the appropriate usability measures to be
The relevance of measuring usability is generally accepted used in the development project?
in literature, for example in (Good, Spine et al. 1986)
2. What are the appropriate usability targets, i.e. the
(Wixon and Wilson 1997) (Gulliksen, Göransson et al.
target values for the measures?
2005) (Jokela, Koivumaa et al. 2006) (Whiteside, Bennett
But the matter, however, is more complex. The challenging
question is: what is the process of finding answers to the
questions above? In other words:
personal or classroom use is granted without fee provided that copies are 1. How does one determine the appropriate usability
measures?
or republish, to post on servers or to redistribute to lists, requires prior 2. How does one set the appropriate usability targets
specific permission and/or a fee. (target values for the measures)?
VUUM2008, June 18, 2008, Reykjavik, Iceland. While usability is not an on/off thing, neither is the process
of developing usability: one can carry out some basic low-
3
cost usability activities, or one can use a lot of resources for
or system the usability activities. If a project team aims to achieve
56
VUUM 2008
ambitious usability targets, they obviously need more becomes a business issue: how many resources and how
resources for usability activities. If the usability targets are much money to assign on usability activities.
modest, less usability work is probably adequate.
The conclusion is that the basis defining usability targets -
Therefore, setting usability targets is necessarily a question and usability measures - is the business context of the
of how much usability resources to assign for a product under development. One should analyse and
development project. And thereby setting usability targets understand the business context, in order to be able to set
the appropriate usability targets.
Figure 1. The JFunnel usability life-cycle model
THE TWO SETS OF USABILITY TARGETS Strategic Usability Targets

We use the JFunnel4 model (illustrated in Figure 1) as a Strategic usability targets are used to define the impact of
reference of usability life cycle. usability at business level: what is the economical and
competitive advantage of usability. Strategic usability
There are two activities related to the determination of
targets can be defined in terms of attributes known as
usability targets, and thereby to the determination of
‘business benefits of usability” (Bias and Mayhew 1994) ,
appropriate usability measures:
such as:
• (0) Strategic usability targets
• Reduced training time
• (3) Operational usability targets
• Better system acceptance
The strategic usability targets define usability goals at the
• Savings in support costs
business level. The operational usability targets define the
usability goals the product level. The achievement of the • No need of a user manual, or a smaller one
usability targets is later checked in the usability verification • Improved efficiency of users’ work
activity (7).
• Better user satisfaction
• Savings in development costs
4
JFunnel is a process model used by Joticon. Its earlier • Etc.
version is the KESSU model of the author, reported e.g. in
Jokela, T. (2007).
57
A company may choose different levels of ambition in The operational level usability targets would then define
strategic usability objectives. High-level ambition leads to a what exactly are those things that should be “learnable
need of more usability resources; lower level strategic without training”. In practice, this would mean the
usability objectives can be reached with fewer resources. definition of those tasks that the user should be able to do
For example, a system that is learnable without training is without training.
obviously more challenging to develop than one without
At the context of this project, usability work has not (yet)
such a target.
continued beyond the level of defining strategic usability
Strategic usability targets should be discussed and decided targets.
with business management.
A Position Tracking Product
Operational Usability Targets This is a type of product the use of which requires special
Strategic usability targets form the basis for operational expertise by the user. In other words, it is not feasible that a
usability targets. person without any background knowledge would be able
to use the product just by intuition.
While strategic usability targets are the ultimate goals for
usability, they are too high-level issues to concretely guide When defining the business objectives, it was therefore
the development process. Operational usability targets are decided that the users would be categorised into two main
more concrete, and they provide guidance for the design. groups: those that have previously used a similar product,
and those that are novices. It was decided as a design target
The main reference of defining usability measures is
that those users with earlier experience should be able to
probably the definition of usability from ISO 9241-11: “the
use the product intuitively, without training or user manual.
extent to which a product can be used by specified users to
This is an obvious strategic usability target: the product
achieve specified goals with effectiveness, efficiency and
should be intuitively learnable by non-novice users.
satisfaction in a specified context of use” (ISO/IEC 1998).
In brief, the definition means that usability requirements are During the next step, the user tasks were identified, from
based on measures of users performing tasks with the the introduction of the product to the daily use of the
product to be developed. product. Operational targets were that the users should be
able to carry out these tasks without training.
• An example of an effectiveness measure is the
percentage of users who can successfully complete a task. For the other user group – novices – the objective was set
that a user should be able to take the product into use
• Efficiency can be measured by the mean time needed
without assistance of sales person. (the detailed discussion
to successfully complete a task.
of this is not in the scope of this paper.
The operational usability targets are derived from the
strategic ones. For example: This case study is at the phase where the user interface has
been designed at the ‘paper prototype’ level and software
• If a strategic target is set to be that the product under development is in progress.
development should be learnable without training, the
operational usability targets define what exactly the users DISCUSSION
should learn. In this paper, a two-level approach for defining usability
• If a strategic target is to improve the efficiency of targets is proposed. The approach is meant to “identify
users’ work, operational usability targets define what practical strategies for selecting appropriate usability
exactly are those user tasks that should be more efficient. measures and instruments that meet contextual
requirements, including commercial contexts” (in the Call
EXAMPLES for papers of VUUM).
The approach is partially applied in a couple of case The specific feature of the proposed approach is that
studies. usability targets should be business driven. In some
business contexts ambitious usability targets are
A Health Care System
appropriate, while in other contexts more modest usability
The user analysis of a healthcare system revealed that
targets may be justified.
doctors do not attend training sessions of new information
systems. Thereby, it was seen a business benefit if the new Two preliminary case studies are presented. In both of
system would be learnable without doctors attending the these cases, a strategic business benefit (among others) was
training sessions. Through this, one would achieve good identified as “learnability without training”. The strategic
acceptance of the system by doctors. In other words, we usability goals would be most likely different in other
had an obvious strategic usability target: the system should cases, depending on the business case and application
be learnable without training. domain.
58
VUUM 2008
REFERENCES 15. John, B. E. (1995). Why GOMS? Interactions. 2: 80-

1. ANSI (2001). Common Industry Format for Usability 89.
Test Reports. NCITS 354-2001. 16. Jokela, T. (2005). Guiding designers to the world of
2. Bevan, N., N. Claridge, et al. (2002). Guide to usability: Determining usability requirements through
specifying and evaluating usability as part of a teamwork. Human-Centered Software Engineering. A.
contract, version1.0. PRUE project. London, Serco Seffah, J. Gulliksen and M. Desmarais, Kluwer HCI
Usability Services: 47. series.
3. Bevan, N. and M. Macleod (1994). "Usability 17. Jokela, T. (2007). Characterizations, Requirements and
measurement in context." Behaviour and Information Activities of User-Centered Design - the KESSU 2.2
Technology 13(1&2): 132-145. Model. Maturing Usability. Quality in Software,
4. Bias, R. and D. Mayhew, Eds. (1994). Cost-Justifying Interaction and Value. E. Law, E. Hvannberg and C.
Usability. San Diego, California, Academic Press. Cockton. London, Springer-Verlag: 168-196.
5. Brooke, J. (1986). SUS - A "quick and dirty" usability 18. Jokela, T., J. Koivumaa, et al. (2006). "Quantitative
scale, Digital Equipment Co. Ltd. Usability Requirements in the Development of the User
6. Butler, K. A. (1985). Connecting theory and practice: a Interface of a Mobile Phone. A Case Study." Personal
case study of achieving usability goals. SIGCHI 1985, and Ubiquitous Computing 10(6): 345-355.
San Francisco, California, United States, ACM Press 19. Karat, J. (1997). User-Centered Software Evaluation
New York, NY, USA. Methodologies. Handbook of Human-Computer
7. Card, S. K., T. P. Moran, et al. (1983). The Psychology Interaction. M. G. Helander, T. K. Landauer and P. V.
of Human-Computer Interaction. Hillsdale, Lawrence Prabhu. Amsterdam, Elsevier Science.
Erlbaum Associates. 20. Kirakowski, J. and M. Corbett (1993). "SUMI: The
8. Chin, J. P., V. A. Diehl, et al. (1988). Development of software usability measurement inventory." British
an instrument measuring user satisfaction of the Journal of Educational Technology 24(3): 210-212.
human-computer interface. Proceedings of SIGCHI '88, 21. Macleod, M., R. Bowden, et al. (1997). "The MUSiC
New York. Performance Measurement Method." Behaviour &
9. Dumas, J. S. and J. C. Redish (1993). A Practical Information Technology 16(4 & 5): 279 - 293.
Guide to Usability Testing. Norwood, Ablex 22. Maguire, M. (1998). RESPECT User-centred
Publishing Corporation. Requirements Handbook. Version 3.3, HUSAT
10. Good, M., T. M. Spine, et al. (1986). User-derived Research Institute (now the Ergonomics and Saftety
impact analysis as a tool for usability engineering. Research Institute, ESRI), Loughborough University,
Conference Proceedings on Human Factors in UK.
Computing Systems, Boston. 23. Mayhew, D. J. (1999). The Usability Engineering
11. Gould, J. D. and C. Lewis (1985). "Designing for Lifecycle. San Fancisco, Morgan Kaufman.
Usability: Key Principles and What Designers Think." 24. Nielsen, J. (1993). Usability Engineering. San Diego,
Communications of the ACM 28(3): 300-311. Academic Press, Inc.
12. Gulliksen, J., B. Göransson, et al. (2005). Key 25. Whiteside, J., J. Bennett, et al. (1988). Usability
Principles of User-Centred Systems Design. Human- Engineering: Our Experience and Evolution. Handbook
Centered Software Engineering: Bridging HCI, of human-computer interaction. M. Helander.
Usability and Software Engineering. M. Desmarais, J. Amsterdam, North-Holland: 791-817.
Gulliksen and A. Seffah. 26. Wixon, D. and C. Wilson (1997). The Usability
13. Hix, D. and H. R. Hartson (1993). Developing User Engineering Framework for Product Design and
Interfaces: Ensuring Usability Through Product & Evaluation. Handbook of Human-Computer
Process. New York, John Wiley & Sons. Interaction. M. Helander, T. Landauer and P. Prabhu.
14. ISO/IEC (1998). 9241-11 Ergonomic requirements for Amsterdam, Elsevier Science B.V: 653-688.
office work with visual display terminals (VDT)s - Part
11 Guidance on usability. ISO/IEC 9241-11: 1998 (E).
59
What Worth Measuring Is
Gilbert Cockton
School of Computing & Technology, Sir Tom Cowie Campus, University of Sunderland,
St. Peter’s Way, Sunderland SR6 0DD, UK.
gilbert.cockton@ sunderland.ac.uk
ABSTRACT
Worth-Centred Development uses worth maps to show There is thus a tension in the title of this workshop. Valid
expected associations between designs, user experiences Useful User Experience Measurement (VUUM) may only
and worthwhile outcomes. Worth map elements focus be possible on balance, that is, we may have to trade-off
support for Element Measurement Strategies and their validity, when judged by psychometric standards, in return
selection of measures, targets and instruments. An example for relevance, that is, what makes measures matter. The
is presented for a van hire web site to relate how and why tension goes once we ignore psychometric disciplinary
user experience is measured in a worth-centred context. values, and see validity as a question of relevance rather
than repeatability or objectivity. Even so, prejudice either
Author Keywords way in relation to psychometric standards would be wrong:
Evaluation, Element Measurement Strategy (EMS), User we want neither a ‘tail’ that ‘wags the dog’, nor a ‘baby’
Experience, Value, Worth Centred Development (WCD). that is ‘thrown out with the bathwater’. Psychometric
approaches from affective psychology should be allowed to
prove their worth, but we should not assume this. There is
H5.m. Information interfaces and presentation (e.g., HCI): enough evidence from usability engineering that cognitive
Miscellaneous.
performance such as time on task, error rate, task
completion and task conformance are not always the
INTRODUCTION
appropriate measures. Too often, usability engineers were
The title of this paper is not in the style of Joda from Star
like ‘drunks under the lamppost’, looking for dropped keys
Wars, but instead promises a worth-centred answer to
where the light is best, and not where they actually lost
“What is Worth Measuring”. Interaction design has
them. A starting place for relevant measures, to continue
inherited a strong evaluation tradition from HCI, which
the analogy, is where the keys were dropped, and not where
largely reflects values from experimental psychology,
the light is best: we should ground all evaluation purpose in
bringing measures that will not be appropriate for all
design purpose, not in favoured measures.
evaluations [4]. Rather than choose measures that matter to
what a design is trying to achieve, usability evaluations We need to clearly express what an interaction design seeks
re-use procedures and measures from cognitive psychology. to achieve, and from this derive and/or anchor evaluation
In early usability work, psychologists applied measures that strategies. This workshop paper motivates and outlines an
they were familiar with, using instruments and procedures approach to grounding Element Measurement Strategies
refined for experimental validity [15]. (EMS) in worth maps, the anchor approach in
Worth-Centred Development (WCD). Next, the logic of
There is a strong risk that affective psychology in user
WCD is briefly motivated, and the evolution of worth maps
experience (UX) research will again privilege disciplinary
is summarised. Next, the anchoring of an EMS is
values from psychology over professional practice values of
presented. The final section presents a van hire example.
Interaction Design. The primary issue for choosing any
measure or instrument is relevance to design purpose.
WORTH AND EVALUATION
Measures should reveal what we really need to know about Worth is a balance of value over costs. A worthwhile
a design, and not simply what psychology can ‘best’ tell us. design delivers sufficient value to its intended beneficiaries
to outweigh costs of ownership and usage. The benefits of
the ends (design purpose) outweigh costs arising from the
Permission to make digital or hard copies of all or part of this work for means (design features and composition). Although
personal or classroom use is granted without fee provided that copies are originally used as a near synonym for value, to avoid
bear this notice and the full citation on the first page. To copy otherwise, confusing value and values [3], worth actually subsumes
or republish, to post on servers or to redistribute to lists, requires prior value as one half of its cost-benefit balance. Worth-Centred
specific permission and/or a fee. Development (WCD) thus needs a survey of costs, including
adverse outcomes (as opposed to worthwhile ones).
Designing becomes more complicated, requiring constant
60
VUUM 2008
consideration of purchase, ownership and usage costs in Laddering has its origins in construct psychology’s
context, and reflecting on the extent to which achievable repertory grids, developed as a clinical instrument to elicit
value could sufficiently offset such costs. people’s understandings of their relationships. Having
identified the significant people in their life, a respondent
In [3], worth was introduced in the sense of “such as to
would be asked to contrast them, in the process exposing
justify or repay; deserving; bringing compensation for”.
constructs that they used to describe personality and
Elaborating on this, ‘worth’ deserves (or compensates)
social relations. Laddering takes elicitation further by
whatever is invested in it, whether this is money (repay), or
asking what each construct means to the respondent, and
time, energy or commitment (justify). In short, worth is a
applies this questioning recursively to each answer. The
motivator. Thus differing motivations of users, sponsors
result is a rich structure of what matters to someone, with
and other stakeholders must be understood, as must factors
personal constructs related via progressive questioning to
that will de-motivate them, perhaps so much that
an individual’s fundamental values. Approaches have
(perceived) costs outweigh (perceived) benefits, when a
been transferred from therapeutic settings to marketing
design will almost certainly fail in use and/or the market.
[10], where ladders become means-end chains, exposing
WCD broadens designing’s view on human motivation, perceived product benefits, especially ones relating to
realising that worth is the outcome of complex interactions personal values. Such elicited benefits form a basis for
of Herzberg’s motivators and hygiene factors [11]. The advertising messages or market positioning.
former are satisfiers and the latter are dissatisfiers. For
Lee and colleagues’ published work is in Korean (Journal
example, at work, salary is a hygiene factor, whereas
of the HCI Society of Korea 2(1), 13-24) but there is a
advancement is a satisfier. Cognitive and affective HCI
draft manuscript in English [12]. Rather than apply
have largely focused on hygiene factors. Motivational HCI
laddering to existing designs (as in [10]), laddering
shifts the focus to Herzberg’s factors, and does not
concepts can be adapted into a design management
automatically associate positive impact with favourable
approach. [5] and [6] present WCD’s early keystone
hedonic factors, which may be short-lived and transient,
approach, Worth/Aversion Maps (W/AMs).
with no long term impact. This is immediately relevant to
choosing UX measures. Contrary to much current thought W/AMs retained much of the structure of HVMs (formed
in HCI, there may often be no need, in summative by merging ladders at points of intersection).
evaluation at least, to measure hedonic factors. If value can Specifically, they incorporated design elements as
be demonstrated to be achieved in a worthwhile manner, concrete and abstract product attributes and human
i.e., with benefits outweighing costs, then one could assume elements as functional and psychosocial usage
that all contributory hedonic factors are good enough, and consequences. These initial categories were quickly
thus detailed measures of specific emotions would add extended by a few example W/AMs in [5,6] to include
nothing to a summative evaluation. However, for formative physiological, environmental and economic
evaluations where adverse hygiene factors do appear to consequences. Use of Rokeach’s Instrumental and
increase costs, degrading or destroying intended worth, then Terminal Values [13] in original HVMs was replaced by a
measurement of hedonic factors could be essential. single W/AM element type (worthwhile outcomes).
Worth thus turned out to be not only less ambiguous than In W/AMs, positive means-ends chains thus began with
value(s), but it also broadened design spaces to cover concrete product attributes and then could connect with
interacting benefits/value(s) and costs (price, usage and worthwhile outcomes via qualities and a range of usage
learning effort, synergies with existing product-service consequences. Negative means-end chains linked
ecosystems). However, moving from models of cost-benefit downover from concrete product attributes to adverse
balances into evaluation criteria needs to be structured, as outcomes via defects and adverse usage consequences.
WCD as presented in [3] is less concrete and focused than
Once people other than W/AMs’ inventor began to use
Value-Centre Design (VCD, [2]). An approach was needed
them, it became clear that modifications were needed.
to bridge between understandings of worth (in terms of
During a WCD workshop for the VALU project
costs and benefits) and worth-centred evaluation criteria.
(webhotel.tut.fi/projects/uservalues), a need became clear
Worth Sketches and Worth Maps have provided this bridge.
to split concrete product attributes into materials (basic
unmodified technical system components) and features
Worth Sketches and Worth Maps
Korean research extended the VCD framework with (technically coherent parameterisations, configurations or
approaches from consumer psychology. Lee, Choi and compositions of materials). When applying W/AMs at
Kim [12] took approaches from advertising and marketing Microsoft Research, it became clear that the extensive
and applied them to an existing design to explore the usage consequences were unwieldly. Information hiding
feasibility of grounding VCD in laddering and techniques were thus applied to replace all basic
hierarchical value models (HVMs). consequence types by a single human value element type
of user experiences. A comparison of Figure 1 below and
61
[5,6] will highlight significant changes from W/AMs to pipeline that tracks potential students from initial interest
Worth Maps. There is now no named distinction between through downloads and registrations of interest, to
Worth Maps and W/AMs, since worth by definition interaction with admissions. If, as preferred in VCD,
considers cost issues associated with product defects, evaluation planning is completed up to initial outlines of
adverse experiences and outcomes. procedures (e.g., test plans) before detailed design
completes, then this lets sales pipeline instrumentation
EVALUATION MEASUREMENT STRATEGIES features be added to the design (e.g., registration forms,
Worth Maps are network structures, drawn as box and downloadable content for registered prospects, information
arrow diagrams, with elements as nodes and associations as and advice service, active management of relationship with
arcs (see Figure 1 later). Paths up from feature elements to potential students through email and/or personalised web
worthwhile outcomes constitute positive means-end chains, pages, integration with events and application process).
where associations between elements indicate a hoped for These example features can measure engagement of
causal relationship. These causal associations hopefully potential students, tracking them through different stages of
combine to deliver intended value in usage. Paths down interest (i.e., registration, information downloads, use of
from features to adverse outcomes constitute negative information and advice service, use of personal pages,
means-end chains. For a worthwhile interaction design, attendance at events, applying for entry). WCD’s close
positive chains will outweigh negative ones in usage. synergies between design and evaluation would actually
shape the design here. DWI, like much measurement,
Worth Map elements focus Element Measurement
changes the behaviour of what is measured, in this case
Strategies (EMSs), which can begin at the worth sketching
through design features that not only measure engagement,
stage [8], before associations between elements. There are
but form a sales pipeline structure that actively increases
three types of positive element (worthwhile outcomes,
engagement. With DWI, the relative effectiveness of
worthwhile user experiences, qualities), three of negative
different features can be assessed, based on targets set for
elements (adverse outcomes, adverse user experiences,
maximum drop outs at each stage of the sales pipeline.
defects), and two of neutral element (features, materials).
Ineffective features can be dropped and perhaps replaced,
Elements also divide into human value elements (outcomes
and poorly performing features can be improved.
and experiences) and design elements (the rest).
Evaluation should never be separated from design. There
Initial worth sketch elements can be inspired, grounded,
will be no interplay between them if they are co-ordinated
derived or otherwise created from a broad range of
from the outset. Interplay occurs between loosely coupled
sensitivities [8,9]. A project team’s receptiveness to facts,
and potentially independent phenomena. In HCI, design
theories, trends, ideas and past designs spreads the breadth
and evaluation have been so separated that their main
of these sensitivities. Some directly address human value(s)
relationship is interplay, but this is a fault line from HCI’s
independently of specific technology usage. Such human
forced marriage between computing and psychology.
sensitivities form a basis for adding human value elements
Aligning evaluation purpose with design purpose replaces
to a worth sketch (i.e., a worth map without associations
fault lines with synergies beyond unpredictable interplay.
[8,9]). Other sensitivities concern creative craft possibilities
In particular, DWI goes beyond measuring design
and technological opportunities. These form a basis for
effectiveness to actually improving it (by adding features
adding design elements to a worth sketch.
that combine instrumentation and user support).
An EMS decides on measures, instruments and targets for
worth-centred elements. It can start from a worth sketch, Evaluation in WCD
but is better supported once elements are associated. Worth sketches and maps identify the main summative
Furthermore, WCD survey data [5,6,9] that makes a worth evaluation criteria for designs as worthwhile and adverse
map more receptive and expressive, and provide a rich outcomes. Evaluation criteria selection is an automatic
context for the simple box label of a worth element. The by-product of worth sketching. Worth maps refine sketched
intended value statements of [2] are replaced by the balance worthwhile and adverse outcomes as associations are
of worth indicated by human value elements. added, since this often results in reflection on elements and
consequent revisions to them. For formative evaluation, all
Available information guides EMS selection of measures,
other elements may be measured to provide diagnostic
instruments and targets. An advantage of WCD is that one information for iteration. User experiences in worth maps
can identify opportunities for direct worth instrumentation are treated as means to the ends of worthwhile outcomes,
(DWI, [6]) (self-instrumentation in [5]). Here system
but the quality of the UX will impact on the worth achieved
features are added to capture evaluation data above low by a design. Thus formative measures of UX can become
level system logging, substantially automating direct summative when costs of interaction reduce achieved
collection of relevant measures. For example, if the
worth. However, the primary evaluation focus remains
intended worth for a university website includes increased worthwhile outcomes, with UX only considered in so far as
student recruitment, then design could include a sales
62
VUUM 2008
it adds or detracts from the balance of worth. Appropriate measure achievement of worthwhile outcomes (i.e.,
methods are indicated by the epoch within which evaluation worthwhile economic transaction, pleasant sequel,
criteria form. Laboratory testing will thus be inadequate to successful gift/transfer/disposal and/or nicer home), plus the
measure value that takes days or weeks to be realised. extent of the adverse outcomes (costs out of control, arrive
late, load won’t fit, can’t collect van/ find depot). The
AN EXAMPLE: A VAN HIRE WEB SITE diamond ended arcs in Figure 1 are aversion blocks, which
Figure 1 below shows a worth map for a hypothetical van express claims that associated design elements will prevent
hire site, based on previous commercial UX work at adverse outcomes. Measures can test these claims. Most
Sunderland. The map is populated by design elements need instrumentation at the van hire depot, with data
(materials, features, qualities) and value elements (user provided directly by customers/and or depot staff. Others
experiences, outcomes). Element colours are: yellow for require instrumentation of customers soon after the end of a
worthwhile outcomes, pink for user experiences, light blue hire, again either with data provided directly at the depot by
for qualities, grey for features, white for materials, and red customers and/or staff, or via customer feedback web
edged for adverse outcomes. Other negative elements pages, phone or email surveys. The ‘moments of truth’ of
(defects, adverse experiences) are omitted for simplicity. all the outcome measures are such that none can be
measured through web-site user testing, and most cannot be
Any focus on UX measures would relate to the three user
fully measured until a customer has completed removal of
experience elements. Figure 1 shows a divide between the
goods, or abandoned delivery for some reason. Referring
experience of forming a good plan (at point of order) and
back to this paper’s title, this is what “worth measuring” is.
the experience of being in control (seeded by a good plan
but unfolding during actual van use). This corresponds to a Worth measuring measures what matters, and what matters
hiatus in user activity, which lets user testing in the hiring is worth. In contrast, [1] associates the van hire W/AM
context (e.g., home, work) formatively explore confidence from [5] with a user dissatisfaction and two difficulties:
in quality of van hire plans up to confirmation of booking.
Such formative evaluation should not be confused with • Not offering exactly preferred type of van
summative evaluation, which should be worth-centred. In • Mistakenly booking for wrong dates or wrong type of van
WCD, measuring achieved worth takes precedence over
• Booking process taking longer than competitor systems
measuring UX.
The first and third have no impact relative to the worth map
The main requirements for summative evaluation are to in Figure 1. If saved time relative to any competitor matters
Worthwhile Economic Pleasant Successful gift, transfer Nicer

Transaction Sequel or disposal home
In Control
Good Value Good Plan
Clear, Concerned, caring, Complete, Helpful,

informative valuing checkable, thorough considerate
Price What to bring Email/fax Van load Depot maps and

information and and when confirmation of information directions
cost summary information booking information
Web pages with Email and fax Image

dowloadable documents capabilities capabilities of
(e.g., pdfs) html, java etc.
Not in control Hirers can’t Load won’t fit Hirers arrive Inability to find
of costs, more collect hired into van/ more late at depot van hire depot
than planned van trips needed
63
Figure 1. Worth Map for Hypothetical Van Hire Site.
(regardless of overall achievable worth), then it must be a result of poor design. In short, you don’t measure UX
added to the worth map as a worthwhile outcome. until you know you have specific problems with worth.
Similarly, if a specific type of van is required, in terms of When you do, measures become valid and useful in so far
the example worth map, its unavailability must impact on as they can locate causes of degraded or destroyed worth in
load handling, and would be covered by the failure to poor UXs that fail to coalesce into their intended positive
achieve a good plan. Just liking Citroen Berlingos isn’t an meanings (or, for adverse experiences, achieve an
issue though. Clearly, date or van selection errors would anticipated but adverse meaning that must be avoided).
lead to users experiencing a bad actual plan, and by
In WCD, UXs are named to reflect their expected
implication a failure of the qualities in Figure 1. In short,
meanings. When summative evaluation reveals a shortfall
these are at best low level formative evaluation concerns
in achieved worth, formative evaluation needs to explore
that could be explored following failures in worthwhile
why. Working into the centre of a worth map from
outcomes. If not, then the worth map is wrong. Such a
outcomes, the next elements to look at are worthwhile and
conclusion may often be reached during EMS formation,
adverse UXs. UX Frames (UEFs) are a WCD notation that
where designs can be fixed before they even form in detail.
supports a detailed focus on the unfolding of UX. They
[1] endorses advice in [15] to base targets on an existing have a flexible table format. The example schematic UEF
system or previous version, strongly competitive systems, in Figure 2 includes feelings, beliefs, system usage, system
or performing tasks with no computer. However [15] response and actions in the world but other UX aspects
advised that such (revisable) targets be chosen by the
development team, engaging engineering staff in feelings
user-centred design but giving them control of development
risks [6]. Importantly (but overlooked in [1]) over half of features beliefs
outcomes
[15] addresses why they abandoned this approach [6]. and system usage
For the VUUM focus, we must state when UX measures qualities system response
become important. In WCD, they are secondary to worth
measures. Note that a pleasant sequel, that is good times
actions in the world
after hiring a van, is a worthwhile outcome that may follow Figure 2. Schematic User Experience Frame (UEF)
from the UXs of receiving good value and feeling in
control. The coalescing of these experiences minimizes could be added, and existing ones removed. The contents
potential attrition from van hire, which may otherwise lead of a UEF depend on the UX being represented.
to disappointment, frustration and/or fatigue. It is important The schematic rotates a UEF, which is normally a table of
to include worth elements that capture the impact of columns below a header of outcomes and a footer of
interactions on user resources for subsequent activities. features and qualities. Header and footer items correspond
Where UX measures do not concern worthwhile outcomes, to associated worth map elements for the UX. Column
their role is largely formative. UX measures tend to be items are time ordered, with the most recent at the top. In
associated with emotion (e.g., joy) or meaning (e.g., fun, the schematic, time runs left to right. The dotted arrow
trust). A now common view in HCI is that these can and represents an abstract scenario, indicating how specific
should be measured in isolation. This however ignores the features and/or qualities give rise to feelings, then beliefs,
reality that UXs are holistic, taking their coherence from the that lead on to usage, further feelings and beliefs, further
meanings that coalesce during them. Feelings, beliefs, system usage and response an action in the world and then
system usage, system response and actions in the world are one last feeling that closes the meaning of the user
almost inseparable in the unfolding of coherent experiences. experience and gives rise to associated outcomes.
Measuring emotions has to be related to their role. At each There is not space to provide a detailed UEF example. The
point in a UX, emotions increase or reduce a user’s purpose of Figure 2 is to illustrate the context within which
confidence and comfort with the progress of interaction. emotions and meanings form. It is this context that
Each strong emotion both evaluates the immediate past and identifies relevant emotions and user interpretations that
judges the immediate future. Emotions indicate comfort could be measured in formative evaluation. A range of
with the experience so far, and expectations for the instruments could be designed to collect specific measures,
remaining experience. They are bound up in interpretation including semantic differentials, questionnaires, playback
and anticipation of something, from which they are (think aloud in response to video replay), facial expression
inseparable. Emotions must thus be measured in the monitoring, and video analysis. Similarly, user perceptions
context of UXs with the aim of understanding their impact of qualities in Figure 1 can be measured using some of
within the dynamics of UX. Once again, formal measures these instruments, especially semantic differentials.
should not become relevant until summative evaluation of
achieved worth has indicated degradation or destruction as
64
VUUM 2008
CONCLUSION measurement of UX. The latter is generally a means to an

The purpose of this paper is to show how and when relevant end, not an end in itself. By placing UX in a worth-centred
measures can be identified. Only once success criteria and context, WCD guards against inappropriate use of measures
related design and interaction elements are identified in and instruments from affective psychology. The key
context can selection of measures, targets and instruments requirement is to understand evaluation in a development
proceed. Evaluators have to be instrument makers, but this context, and not as a scientific enterprise.
should not be their key skill. Instead they should work
closely with design and related roles in a project team to ACKNOWLEDGMENTS
isolate the critical success and failure factors for a proposed The WCD framework was developed with support from a
system. Sensitivities, worth sketching and worth mapping NESTA Fellowship (www.nesta.org). User Experience
are WCD approaches that structure a process that identifies Frames (UEFs) were introduced to replace consequence
critical factors as worthwhile and adverse outcomes. These elements of Hierarchical Value Models (HVMs) at
become the major focus for summative evaluation in WCD. Microsoft Research Cambridge when I was a visiting
researcher. Alan Woolrych directed the commercial UX
Much UX research in HCI is currently approaching
testing of van hire websites on which the example is based.
evaluation from another position, which assumes the value
My initial interest in adapting HVMs for WCD was
of measuring emotions in isolation, without regard to their
inspired by research by Lee and colleagues [12].
actual role in UX. This approach also looks for affective
psychology for measures and instruments. As with
REFERENCES
usability in 1980s HCI, this position starts with a set of
1. Bevan, N. (2008) “UX, Usability and ISO Standards,” in
disciplinary assumptions that do not hold for designing as a
CHI 2008 Workshop on User Experience Evaluation
creative, holistic, judgemental activity. We do not design
Methods in Product Development, last accessed 25/4/08
for specific emotions without regard to context. Instead, we
as www.cs.tut.fi/ihte/CHI08_workshop/papers/
design for positive experiences of which emotions are only
Bevan_UXEM_CHI08_06April08.pdf
one aspect. Meanings and outcomes bring emotions with
them, and not vice-versa. We must thus measure emotions 2. Cockton, G. (2005) “A Development Framework for
only where needed, and mostly in formative evaluation. Value-Centred Design,” in CHI 2005 Extended
Abstracts, ed. C. Gale, ACM, 1292-95,
We must trade proven validity of generic psychometric
tools for relevance of evaluation measures in design 3. Cockton, G. (2006) “Designing Worth is Worth
contexts. The purpose of evaluation is to support design, Designing,” in Proceedings of NordiCHI 2006, eds.
not psychology. This paper has focused on WCD A.I. Mørch, et al., ACM. 165-174.
approaches to highlight what matters in evaluation, as 4. Cockton, G. (2007) “Make Evaluation Poverty History”,
opposed to what can be easily or reliably measured. Where alt.chi 2007, last access 19/4/08 at www.
the two conflict, design’s values take precedence over viktoria.se/altchi/submissions/submission_gilbert_0.pd
science’s. Science can support design, but not direct it. 5. Cockton, G. (2008) “Putting Value into E-valu-ation,” in
Designing is a creative activity involving much judgement Maturing Usability: Quality in Software, Interaction
and interpretation, and this must extend to evaluation too. and Value, eds. E. Law et al., 287-317, Springer.
If HCI evaluation remains a (pseudo-) scientific silo
alongside interaction design, its influence and effectiveness 6. Cockton, G. (2008) “Some Experience! Some
will stagnate. We have an opportunity in early UX research Evolution,” in HCI Remixed, eds. T. Erickson and D.W.
to avoid the mistakes of usability evaluation, which MacDonald, MIT Press, 215-219, 2008
uncritically applied measures, instruments and procedures 7. Cockton, G. (2008) “Revisiting Usability’s Three Key
from cognitive psychology. It would be wrong for affective Principles”, in CHI Extended Abstracts, 2473-2484.
psychology to take UX evaluation into the same blind alley 8. Cockton, G. (2008) “Sketch Worth, Catch Dreams, Be
of evaluation methods with little downstream utility. Fruity, in CHI 2008 Extended Abstracts, 2579-2582.
The D in WCD stands for development, not design. WCD 9. Cockton, G., (2008) “Designing Worth: Connecting
takes a holistic view of development that allocates clear Preferred Means with Probable Ends,” to appear in
synergistic roles to design and evaluation, unified around interactions, 15(4)
the common purpose of achieving intended worth. This
10. Heitmann, M., Prykop, C. and Aschmoneit, P. (2004):
holistic context, supported by representations such as UEFs
“Using Means-End Chains to Build Mobile Brand
and worth maps, lets us identify key criteria for summative
Communities,” in HICSS 2004, last accessed 29/4/08 at
evaluation and potential diagnostic measures for formative
http://www.hsw-basel.ch/iwi/publications.nsf/id/289
evaluation. UX measures largely relate to the latter. If
intended worth is demonstrably achievable through 11. Herzberg, F. (1966) Work and the Nature of Man, Ty
interaction with a design, there is little need for detailed Crowell Co; Reissue edition.
65
12. Lee, I, Choi, B. and Kim, J. (2006) Visual Probing of Usability: Quality in Software, Interaction and Value,
User Experience Structure: Focusing on Mobile Data eds. E. Law, E. Hvannberg and G. Cockton, Springer.
Service Users, HCI Lab, Yonsei University, Seoul. 15. Whiteside, J., Bennett, J., and Holtzblatt, K. (1988)
13. Rokeach, M. (1973) The Nature of Human Values, “Usability engineering: Our experience and evolution,”
Free Press. in Handbook of Human-Computer Interaction, 1st
14. Rosenbaum, S. (2007) “The Future of Usability Edition, ed. M. Helander., North-Holland, 791-817,
Evaluation: Increasing Impact on Value,” in Maturing
The columns on the last page should be of approximately equal length.
66
VUUM 2008
Is what you see what you get? Children, Technology and

the Fun Toolkit
Janet C Read
ChiCI group
University of Central Lancashire
Preston, UK
+44(0) 1772 893285
[email protected]
ABSTRACT INTRODUCTION
There are several metrics used to gather experience data that Asking children about technologies is a commonly used
rely on the user opinions. A common approach is to use method for evaluation, especially where opinions or some
surveys. This paper describes a small empirical study that measure of experience is required. Determining what
was designed to test the validity, and shed light on, opinion children think about technologies can give insights into
data from children that was gathered using the software appeal and can provide a designer or developer
Smileyometer from the Fun Toolkit. with feedback as well as with a very useful endorsement of
a product. Many academic papers, especially design papers,
This study looked at the ratings that children gave to three use the opinions of end users as an indication of value or
different interactive technology installations, and compared worth.
the ratings that they gave before use (expectations) with
ratings they gave after use (experience). Each child rated at Collecting children’s opinions is not always straightforward
least two of the installations and most rated all three and there are several known hurdles that need to be
overcome if the opinions that are gathered are to have real
Different ratings were given for the different installations; value. Where opinions are elicited by questioning, an area
this suggests that children can, and do, discriminate between of concern is the degree to which the child understands the
different experiences. In addition, three quarters of the question being asked. Answering questions in complex as
children, having encountered the product, changed their not only does the child need to understand and interpret the
minds on at least one occasion and gave a different rating question being asked, retrieving relevant information from
after use than they had before. Finally, the results showed memory, and integrating this information into a summarised
that in most cases, children expected a lot from the judgment, in most cases the child then has to code the
technologies and their after use rating confirmed that this answer in some way, typically using rating scales.
was what they had got. .Researchers often discuss the importance of the question-
answer process in determining the reliability of responses
KEYWORDS provided by children in surveys [1].
Fun Toolkit, User Experience, Children, Smileyometer,
Errors Factors that impact on question answering include
developmental effects (the child may be too young to
ACM CLASSIFICATION KEYWORDS understand); language (the child may not have the words
H5.m. Information interfaces and presentation (e.g., HCI): needed), reading age (the child might be unable to read the
Miscellaneous. question), and motor abilities (the child may be unable to
write neatly enough), as well as temperamental effects
including confidence, self-belief and the desire to please.
Another factor, especially evident in surveys where
respondents are being asked to pass attitudinal judgments
[2], is the degree to which the respondent satisfices or
personal or classroom use is granted without fee provided that copies are optimises his or her answers.
bear this notice and the full citation on the first page. To copy otherwise, For validity in a survey, optimising is the preferred process.
or republish, to post on servers or to redistribute to lists, requires prior Optimising occurs when a survey respondent goes
specific permission and/or a fee. thoughtfully and carefully through all three or four stages of
the question and answer sequence. Thus, the respondent
will ensure that he or she understands what the question is
really about, will take care to retrieve an appropriate answer,
67
will think about how the response should be articulated in The Three Main Tools
order to make it clear what he or she is thinking, and, where The Fun Toolkit was originally proposed to have four tools.
there is a rating to apply, will carefully consider where on Over time, this original fun toolkit, has been tested and
the rating scale each answer should be placed. evaluated with the result that it is now considered to be
sufficient to have three main tools for use. These are the
Satisficing is a midway approach (less good than optimising tools described here.
but better than guessing!) and occurs when a respondent
gives more or less superficial responses that generally The Smileyometer (shown in Figure 1) is a discrete visual
appear reasonable or acceptable, but without having so analogue scale (VAS). Research has shown that children
carefully gone through all the steps involved in the question- aged around seven and over can use these VAS effectively
answer process. Satisficing is not to do with guessing and so, for the Fun Toolkit, this was an obvious choice of
answers or choosing random responses, it is to do with tool to consider [8]. In validations of the Smileyometer,
picking a suitable answer. The degree or level of satisficing studies by the authors and others have shown VAS to be
is known to be related to the motivation of the respondent, useful for considerably younger children than was
the difficulties of the task, and the cognitive abilities of the previously reported, but it should also be noted that when
respondent [3]. these scales are used to elicit opinions about software or
hardware products, younger children are inclined to almost
Especially because of motivation and cognitive maturity, it always indicate the highest score on the scale [9].
is the case that children are prone to satisficing - they find
During use children are asked to tick one face. This makes
survey participation difficult and often lack the abilities to
it a very easy tool for the children, but it also includes
articulate answers or to fully understand the questions.
textual information to improve the validity. The
Smileyometer can be easily coded; it is common to
As well as there being concerns with children’s responses to
apportion scores of 1 to 5 for the different faces; if used in
questions given at a moment in time, there are also concerns this way.
about the stability of these responses over time, both of
these concerns affect the reliability of self-reported data
from children. It is generally considered that all that can be
elicited from a survey is a general feel for a product or a
concept with a particular group of children at a particular
time, e.g. [4]. This is not especially good news for
researchers hoping to evaluate user experience with
children, but there is some comfort in that, in HCI work, the
reliability of responses is generally not critical (as could be
the case where a child is being interviewed as part of a
criminal investigation). Figure 1. The Smileyometer
The Smileyometer has been used in many studies to rate
Given the general difficulties of surveying children, there experienced but also anticipated fun; in the latter case the
are several useful approaches that can be taken to make the child is asked, before an experience, to tick a face according
process more valuable and at least satisfactory for all the to ‘How much fun will it be?’
parties. One of these is to use specially designed tools that
fit the task of evaluating user experience for children [5], In many evaluation studies, the desire is not to carry out an
[6]. These tools may be easy to use but because they are isolated evaluation but to rank a series of connected or
relatively novel, they may be unreliable and may be used competing activities or technologies. These comparative
inappropriately. The remainder of this paper describes a evaluations can help the evaluator determine which may be
suite of tools, the Fun Toolkit [7], and then looks at one tool more appealing or which may be least fun. Repeated
in particular, the Smileyometer, before describing a study instances of the Smileyometer can be used but the Fun-
that tested the reliability of the Smileyometer and the use of Sorter, a variation on a repertory grid test [10], can be useful
it with children. in this context. The Fun-Sorter (shown in Figure 2) has one
or more constructs and a divided line (or table) that has as
THE FUN TOOLKIT many spaces in it as there are activities to be compared. The
The fun toolkit is a selection of tools that have been children either write the activities in the spaces, or for
designed to gather opinions from children about interactive younger children, picture cards can be made and placed on
technology. They use only essential language, lend the empty grid.
themselves well to the use of pictures and gluing and
sticking for input, they are fun and attractive, and they
reduce some of the effects of satisficing.
68
VUUM 2008
Tests of the Validity of the Fun Toolkit Instruments

The Fun toolkit has been around for quite some time now
and has been evaluated in several studies. The major
findings to date have been that;
• Children almost always pick 5 for everything, this
is more the case the younger the children (REF
Removed) – Either, everything children see is
great, or, children see everything as great!
• Children tend to report their experience as being
the same as their expectation (REF Removed) –
This suggests that what is seen at the beginning is
very important
• The Smileyometer is measuring Fun rather than
Usability (REF Removed)
• The tools, when used together, give similar results
(REF Removed) – This suggests that the tools are
fundamentally not flawed and that children know
what they are doing when they use them.
Of these findings, there are several research questions that
Figure 2. A completed Fun-Sorter merit further study. In the work that reported that children
The attractiveness of the Fun-Sorter, over repeated use of ‘got what they expected’, the technologies being evaluated
the Smileyometer or the Fun-Sorter is that it forces the were all attractive and were quite ‘even’ in their
children to rank the items or activities in order. Where the attractiveness. Whether this result would hold with less
Smileyometer has been used for comparative studies, the attractive technology was worth investigating as it was
tendency of children to give maximum ratings to products hoped, but not known, that children would be able to
inevitably makes it difficult to see which item, from a set of discriminate more than that research study suggested!
four, is the least fun. THE STUDY
In this study, a group of children were given the
The ‘Again–Again’ tool (see Figure 3) can also be used to Smileyometer to rate three different technology experiences.
compare activities / events. This table lists some activities There were three hypotheses:
or instances on the left hand side, and has three columns
headed Yes, Maybe, and No. 1. Children would give different overall experienced
(after fun) ratings according to the predicted
experiences of the three technology installations
2. Children would generally rate the technologies
highly, with 5 being a common score
3. Some of the children would demonstrate
discriminatory behaviour according to their
expected and experienced fun that would align with
the predictions of the evaluators.
Method
The study took place at a Mess day at the University of
Central Lancashire. 24 children aged 6 and 7 (9 girls, the
rest boys) came to the lab and each child took part in 10
minute activities on the three technologies in question. The
technologies were intended to offer different user
Figure 3. Part of an Again–Again table experiences.
The child ticks either yes, maybe or no for each activity, • Interface A was a tangible block game that was
having in each case considered the question ‘Would you like entirely novel to the children, it looked fun, was
to do this again?’ The background to this is that, for most very easy to play, and it was expected that all the
children, a fun activity would be wanted to be repeated! children would enjoy playing it.
69
• Interface B was a PDA application that was For Interface B, 10 children gave it 5 before playing and 9
probably new to most of the children, it looked gave it 9 after playing with the other children rating it from
interesting, it was not especially easy to use (due to 1 – 4 in both cases.
the size of the stylus), and the children were
Interface C was rated at 5 by 14 children before it was
expected to find it rather dull as they were carrying
played and by 9 after playing. The other children rated it at
out a writing activity.
3 or 4 before playing and between 1 and 4 after playing.
• Interface C was a laptop PC with paint installed on
it. This was not especially novel but was expected Across the interfaces there were some trends in respect of
to be greeted with enthusiasm by the children. individual children, six of the twenty four, in every instance
However, there was a catch as the interface was gave the same ratings before they used the technology as
relatively difficult to work due to there being no they did after use. Another six never changed their ratings
mouse attached to the computer and so children by more than one from before use to after use, whereas half
had to draw using the touch pad. Thus, it was the children, on at least one occasion made a shift of two or
expected that most children would find this hard. more. Overall, the score of 5 before and 5 after was the
most popular choice. Table 2 shows the frequency and
Children moved around the activities in a predetermined spread of the different scores, the row indicates the after
order. Each child moved from A to B to C but this was in a score, the column the before score.
cyclic order and so some began with C (then moved on to A
and B), others started with B (and moved to C then A). 25 3 3 1 5 after
Some children only visited two applications as there were 3 3 3 1 4 after
other activities happening in the room that they were
involved in. The children came to each application in 5 2 1 1 3 after
groups of four and each child participated.
5 2 after
On arriving at an activity, the activity was briefly described,
1 1 1 1 after
the children were told what they were going to do and
shown the technology (but it was not demonstrated), then 5 4 3 2 1
the child completed a ‘Before Smileyometer’ to indicate before before before before before
how much fun they thought it was going to be. Having
completed the activity, the child then completed an ‘After Table 7. The 59 results
Smileyometer’ (the before version was hidden from view at
this point) to indicate how much fun the activity had been. This table shows that out of the 59 pairs, 25 were rated (5,5)
– it also shows that only one interface was rated as ‘awful’
The scores from the children were coded as follows: before use, and only 3 rated as awful after use.
Brilliant = 5, Really good = 4, Good = 3, Not very good = 2,
Awful = 1. Discussion
The results found in this study confirmed the three
At the end of the activity the children were thanked for their hypotheses whilst also indicating some interesting areas for
assistance and given certificates to take home. further study.
Results Looking at the average ratings in Table 1, it appears that
In total, 59 pairs of ratings were gathered. 16 of these children rated fun differently in each interface with there
related to Interface A, 21 related to Interface B, and 22 being a large difference between interface A and the other
related to Interface C. The average before and after ratings two. Given that interface A was designed to be a game, and
for the three interfaces are shown in Table 1: the aim of the interface was to have fun, this result seems to
indicate that children are able to discriminate for fun even
Interface A Interface B Interface C
when using such a raw tool as the Smileyometer. This was
Before 4.75 3.71 4.36 not previously reported as in earlier evaluations of the Fun
Toolkit; the technologies being compared were very similar.
After 4.75 3.52 3.64
Table 2, once again shows (as demonstrated in studies in [7]
Number 16 21 22 that on the whole, children will rate things with a maximum
rating (5) score. 46 of the 59 ratings given overall included at least
one five showing that children generally expect and
Table 6. Average ratings, before and after experience high levels of fun, with their expectations (39
For Interface A, all but two of the children gave a 5 rating with a 5), in this instance, appearing slightly higher than
for expected (before) fun and all but two children (a their experienced fun (32 with a 5), and evidenced by lower
different two), gave a 5 for experienced (after) fun. average ratings as seen in Table 1.
70
VUUM 2008
On interface C, the paint application, at least half the Number of Response Options, Negative Wording, and
children experienced significant difficulties with the touch Neutral Mid-Point. Quality and Quantity, 2004. 38(1): p.
pad but the interface still got an average score that would 17 - 33.
indicate it was better than good and almost ‘really good’ and 2. Krosnick, J.A., Response Strategies for coping with the
this confirms earlier results that suggest that children ‘put Cognitive demands of attitude measures in surveys.
up’ with poor usability if what they experience is fun, and Applied Cognitive Psychology, 1991. 5: p. 213 - 236.
that fun, therefore, is more important than usability (a
finding noted in [11]). 3. Borgers, N. and J. Hox, Item Non response in
Questionnaire Research with Children. Journal of
Considering how the ratings changed from before to after it Official Statistics, 2001. 17(2): p. 321 - 335.
is interesting to note that interface C, which looked easy and
4. Vaillancourt, P.M., Stability of children's survey
familiar had a high rating for expected fun which fell once it
responses. Public opinion quarterly, 1973. 37: p. 373 -
had been played. In this regard the children’s scoring
387.
matched the predictions of the evaluators.
5. Airey, S., et al. Rating Children's Enjoyment of Toys,
CONCLUSION Games and Media. in 3rd World Congress of
This work shows some important trends for measuring the International Toy Research on Toys, Games and Media.
experience of children with technology. It also adds to the already
2002. London.
published work on the use of the Fun Toolkit and, specifically, the
Smileyometer. The relationship between expected and 6. Hanna, L., K. Risden, and K. Alexander, J, Guidelines
experienced fun has been shown to vary according to the for usability testing with children. Interactions, 1997.
technology being evaluated. 1997(5): p. 9-14.
Where good, fun technology is being evaluated, the children will, 7. Read, J.C. and S.J. MacFarlane. Using the Fun Toolkit
by and large, rate it very highly and the expected and experienced and Other Survey Methods to Gather Opinions in Child
fun will hardly vary. Where the technology is familiar but difficult Computer Interaction. in Interaction Design and
to use, expectations will be quite high but children will adjust their Children, IDC2006. 2006. Tampere, Finland: ACM
ratings based on the experience with the product. Technology that Press.
appears uninteresting will be rated lower before use than
8. Shields, B.J., et al., Predictors of a child's ability to use
technology that appears interesting.
a visual analogue scale. Child: Care, Health and
Further work will explore the use of the other Fun Toolkit metrics Development, 2003. 29(4): p. 281 - 290.
with ‘less attractive’ and ‘less functional’ technologies and will 9. Read, J.C., et al. An Investigation of Participatory
test the ideas developed in this small study with a larger cohort of Design with Children - Informant, Balanced and
children and across a more diverse set of technologies. Facilitated Design. in Interaction Design and Children.
2002. Eindhoven: Shaker Publishing.
ACKNOWLEDGEMENTS
10. Fransella, F. and D. Bannister, A manual for repertory
Thanks to the researchers of the ChiCI group at UCLan and
grid technique. 1977, London: Academic Press.
the children of the local primary school for their assistance
in this study. 11. Read, J.C., Validating the Fun Toolkit: an instrument for
measuring children’s opinions of technology Cognition
Technology and Work, 2008. 10(2): p. 119 - 128.
REFERENCES
1. Borgers, N., J. Hox, and D. Sikkel, Response Effects in
Surveys on Children and Adolescents: The Effect of
71
Comparing UX Measurements, a case study
Arnold P.O.S. Vermeeren Anita H.M. Cremers
TU Delft, Industrial Design Engineering TNO Human Factors
Landbergstraat 15, Kampweg 5, P.O. Box 23
2628 CE Delft, The Netherlands 3769 ZG Soesterberg, The Netherlands
E: [email protected],T: +31(0)152784218 E: [email protected], T: +31(0)346356310
Joke Kort Jenneke Fokker

TNO Information and Communication Technology TU Delft, Industrial Design Engineering
Eemsgolaan 3, P.O. Box 1416 Landbergstraat 15
9701 BK Groningen, The Netherlands 2628 CE Delft, The Netherlands
E: [email protected], T: +31(0)505857751 E: [email protected], T: +31(0)152789677
ABSTRACT answer UX research questions based on the UX
In this paper we present our preliminary findings of an framework as described in [3]. Below, the UX framework
informal comparison of different types of User eXperience and different TUMCAT measurement tools are explained
(UX) measurements and methods during the field trial of a in more detail, followed by a description of the field trial,
peer-to-peer file sharing application called Tribler. the results and some conclusions and future work.
Author Keywords UX FRAMEWORK

User experience (UX), usability, framework, measurement, Shifts in focus from functional aspects of Information and
methods, requirements, peer-to-peer file sharing. Communication Technology (ICT) towards ICT as an
integrated part of a user’s everyday life create a situation
ACM Classification Keywords in which the success of a product in terms of value for the
H5.m. Information interfaces and presentation (e.g., HCI): user can no longer be fully understood based on mere
Miscellaneous. usability (e.g. efficiency, effectiveness, satisfaction,
learnability, etc.). It becomes more apparent that the
INTRODUCTION nature of user activity is opportunistic and situated, based
Within the TUMCAT5 project, a generic, in-situ Testbed for on the relation between individual predispositions
UX measurements of Mobile, Context-Aware applicaTions (personality, norms/values, emotions/moods, goals and
is being developed. An informal comparison was made of preferences, earlier experiences/knowledge, etc), product
TUMCAT findings from a first Tribler field trial, a interaction and context (social, physical as well as virtual)
laboratory usability test and an internet survey [2, 3]. In this [4].Furthermore ICT increasingly influences our
paper we report some preliminary findings of a second everyday experiences and emotions, which is often
comparative study. In this second Tribler field trial three referred to as the UX [4, 5]. In [3] we presented a
studies (a longitudinal study with TUMCAT tools, an expert tentative framework for UX addressing the changes
review and a laboratory test) are cross-validated on their mentioned above. This framework draws heavily on
ability to measure different User eXperience (UX) aspects. research by Wright & McCarthy [4] and by Desmet &
The measurements within the three studies were set up to Hekkert [6], as well as on our own previous research [1].
The framework is meant to provide an overview of all
possible aspects of UX. Further, it should help to identify
Permission to make digital or hard copies of all or part of this work for (1) possible important product (interaction) aspects, such
personal or classroom use is granted without fee provided that copies are as design and context features that may play a role in the
not made or distributed for profit or commercial advantage and that copies UX, and (2) suitable evaluation methods and instruments
bear this notice and the full citation on the first page. To copy otherwise, (see Section “From UX Framework to Measurements”)
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee. for measuring resulting emotions and experiences, at
appropriate moments before, during and/or after product
VUUM2008, June 18, 2008, Reykjavik, Iceland. interaction.
Design elements (outer circle of Figure 1) are the product
features a designer can manipulate, such as form, colour
5
For more information on TUMCAT see: (providing aesthetics), interaction flow or choice of
http://www.freeband.nl/project.cfm?id=1126&language=en
72
VUUM 2008
specific application features/functionality (providing its interaction without giving meaning to it; Interpreting,
compositional experiences and meaning). interpretation of our interaction ((non-) instrumental,
(non-) physical [6]) with the product, relating it to our
Aesthetic design aspects (middle circle) relate to a product’s
goals, desires, hopes, fears and previous experiences,
capacity to delight one or more of our sensory modalities
leading to experiences such as anxiety, unease, desire,
[6]. These aspects are closely related to design elements
willingness to continue); Reflecting, making judgements
such as look, feel, sound, colour, form and their specific
about our experiences, evaluating them and comparing
composition. Aesthetic aspects hardly involve cognitive
them with other experiences, resulting in satisfaction,
processing and lead to emotions (inner circle) such as thrill,
excitement or boredom, a sense of achievement, etc;
fear, excitement, unease, awkwardness, the perception of
Appropriating, trying to identify ourselves with the
speed, time and its boundaries [4, 6].
experience or changing our sense of self as a consequence
of the experience; Recounting, reliving experiences by
recounting them to ourselves and others, finding new
possibilities and meaning in them. In some sense-making
processes such as connecting, cognitive processing is
hardly involved and experiences are a direct result of the
perception or sensation of the aesthetics of the product
[7]. In other sense-making processes cognition plays an
important role in making sense of the product’s workings
(compositional structure) and its attributed meaning (e.g.
in anticipation, interpreting, reflecting, appropriating and
recounting). The sense-making processes described here
can all happen in parallel or successively [4].
FROM UX FRAMEWORK TO MEASUREMENTS

In earlier work we identified requirements for new UX
measurement tools; such tools should [1]:
• Measure qualitative as well as quantitative and
subjective as well as objective data related to the UX,
Figure 2. UX framework as unobtrusively as possible. For example, they
should capture users’ behaviour (e.g. application
Compositional design aspects (middle circle) are closely feature usage) and users’ opinions and emotions
related to usability, pragmatic and behavioural about the application’s features and relate both kinds
characteristics of an interactive product. They encompass of data to understand the results in the context in
aspects directly related to interaction with the product, such which the measurement was taken.
as the design of interactive elements, action possibilities, • Support long term studies and timed or continuous
narrative structure [4], intended use and function [6]. measurements in which different measurement
Compositional aspects result in emotions (inner circle), such qualities (as mentioned above) are combined to create
as (mis)understanding of a product’s workings, a picture of the overall UX or the dynamics therein
(un)predictability of a product’s behaviour or outcomes, and (e.g. changes over time, at a specific moment in time,
feelings of (not) making progress. Aspects related to as a mean over time).
meaning are those through which a designer tries to realize • Enable researchers to perform situated measurements
the users’ higher order goals, by making users attribute to approach or realize UX measurements that reflect
meaning (middle circle) to the product. Through sense- users’ experiences in their every day lives.
making processes people are able to recognize metaphors,
assign personality or other expressive characteristics, and The above requirements were used to develop four
assess the personal or symbolic significance of a product different kinds of TUMCAT measurement tools [1]:
[6]. Attributed meaning can result in emotions (inner circle) • Logging tools: automatic capture of user’s behaviour
such as anger, joy, satisfaction, fulfilment, fun, bliss, or product usage (e.g., mouse-clicks, keystrokes,
closeness to one’s own identity or image, inspiration and application feature usage).
regret [3, 5].
• Sensing tools: automatic capture of a user’s context
According to Wright & McCarthy [4], experiences come to (e.g. physical, social and/or virtual such as physical
life via a process of sense-making. Sense-making refers to location, buddy lists and buddies online, active
processes such as: Anticipation, expectations about our application window, system/application status).
experience with a product; Connecting, our first experience
(sense of speed, thrill, openness, etc.) with the product and
73
• Experience sampling6 tools: automatically generated Together with their download of the Tribler software they
self report requests sent once or multiple times to the also received a customized version of the uLog7 software
user, triggered by user activity (logged data), specific that logged all activities users performed with Tribler. An
context aspects (sensed data) or a predefined time other tailor-made software package sent the loggings to a
schedule. central server that the researchers used for data gathering.
• User generated content tools: provided to the user to This software also sensed application and user related
give feedback about any desired topic when a user feels data that Tribler stores on the user’s computer (e.g., a list
like it (e.g. feedback button linked to a feedback form, of the user’s friends on Tribler and the library of
discussion forum related to the product). downloaded files). To be able to gather subjective data
from the test participants, the server also sent out
Experience Sampling questions to users which opened in
CASE STUDY: TRIBLER
a browser window. These questions were automatically
Peer-to-peer (P2P) networks are networked computer
triggered based on specified logged activities or
devices (nodes, terminals) that permit other computer
combinations of activities.
devices to utilize locally available storage space,
communication bandwidth, processing capacity, and
sometimes even hardware components. P2P technology has
brought clear advantages over client-server architectures [8].
Yet, the success of any P2P system fully depends on the
level of cooperation among users. Technical enforcement of
this cooperation is limited. Therefore, within the Tribler
project an alternative approach is chosen: making use of
knowledge from (social) psychology on altruistic behaviour
for developing cooperation inducing features [8].
What is Tribler?
Tribler is a peer-to-peer television (P2P-TV) system for
downloading, video-on-demand and live streaming of
television content [2]. The system not only gives users
access to all discovered content and other users in the Figure 3. Tribler 4.0 main screen.
network, but also provides the means to browse Responses to these questions were automatically sent
personalized content with a distributed recommendation back to the server. In addition, test participants had a
engine and an advanced social network that each user possibility to provide user generated content to the
creates implicitly and explicitly. An advantage of having research team by clicking a feedback icon in the system
trustworthy friends in Tribler is that they can speed up the tray of their computer and then entering text in a text
downloading process by donating their own idle bandwidth. window that would pop up. By clicking the submit button
Figure 2 shows the main screen of Tribler version 4.0. this feedback was sent to the server and was merged with
the logged data file. Test participants were gathered
Measuring Experiences with Using Tribler through personal contacts, as well as through flyers and
Measurements were taken by conducting a longitudinal field posters at institutes of higher education as well as
study using the TUMCAT measurement tools, as well as by technical institutes of applied science. They were given a
conducting a laboratory study in which test participants € 25 reward after the five weeks had passed. In the fourth
were asked to perform some tasks with Tribler and by week of the study they were asked to perform some
asking international experts on usability and user experience specified tasks and to fill in a questionnaire. Tasks and
from the COST294 network to provide their expectations on questionnaire were similar to those of the laboratory study
the users’ experiences with Tribler.
Laboratory study
Longitudinal field study In the laboratory study ten test participants were asked to
In the field trial 39 users were asked to download Tribler download the Tribler software, explore its use and then
and use it at home for five weeks in any way they wanted. perform some tasks with it, in the presence of a facilitator.
Sessions lasted about one hour to one hour and a half.
Example tasks were: accepting an invitation to become
6
Experience sampling is a set of empirical methods that are
designed to repeatedly request people to document and
7
report their thoughts, feelings, and actions outside the For more information on the uLog software see:
laboratory and within the context of everyday life. http://www.noldus.com/site/doc200603005
74
VUUM 2008
friends with someone; searching for a specific movie; on usage was automatically done over the five week study
asking a Tribler friend to donate bandwidth to speed up period. In case of the laboratory study, users were asked
downloading. Tasks were designed to make users use basic UX and usage related questions after their initial
downloading functionality, as well as features that experiences with Tribler, as well as during the whole
distinguish Tribler from other competing file sharing session and afterwards, imagining future usage. In the
software (e.g., recommendations, donating bandwidth to expert review experts were asked to provide their
friends). Users were asked to think aloud while the expectations on the UX after their first confrontation with
facilitator was sitting next to them. The facilitator took an the software, as well as after further inspection and trial.
active listener approach [9] and asked questions at
appropriate times to probe the user’s understanding of the Usage
software’s functionality, e.g., Have you noticed the In the field study, TUMCAT’s automated logging
checkmark / exclamation mark in the top right corner? What facilities made it possible to monitor actual usage of
does it mean to you? Retrospective interviews were Tribler at the level of UI events (e.g., shifts in input focus,
conducted to further discuss some of the events or problems key strokes) and the abstract interaction level (e.g.,
that came up during sessions and users were asked to fill in providing values in input fields) [10], providing insight
the Attrakdiff8 questionnaire on paper. into how usage developed over the five week time period.
In the laboratory study test participants were asked to
Expert review imagine whether and how they would use Tribler at home
In the expert review five experts were provided with the in three different ways: 1) after a short exploration of the
Tribler software to review it. They were asked various software: what kind of things do you think you would use
questions assessing their opinions on the software. They Tribler for at home; 2) after each task: would this be
were also asked to estimate what problems users would have something you can picture yourself doing at home? (scale
with the software, and to fill in the Attrakdiff questionnaire 1-5) and 3) at the end of the session: Would you consider
imagining what the users would answer. Communication starting to use Tribler at home? How frequently? What
with the experts was solely through email. would you use it for? Under what circumstances? In the
expert review, the experts were asked similar questions:
Measurements in the Different Studies 1) after a quick exploration of Tribler: Do you think that
The various types of measurements taken are discussed the target group may indeed consider downloading, or
using the UX framework explained in section “UX actually download the software once they are aware of its
Framework”, focusing on compositional, meaning and existence? What do you think users would want to use
aesthetics UX aspects. It is assumed here that in case of Tribler for in particular? 2) in relation to Tribler’s friends
software for voluntary use its attractiveness largely and recommendations facilities: Do you think that (in the
determines whether or not and to what extent the software long term) Tribler users will actively engage in using this
will be used. The attractiveness of using the software is facility?
closely related to the emotional response to (the exposure to
and interaction with) the software. In addition, factors like Attractiveness
the user’s context, a user’s pre-dispositions and constraints One of the ways of measuring attractiveness was by
play an important role (e.g., familiarity and availability of having the test participants fill in the Attrakdiff
other software with similar functionality, previous questionnaire. This questionnaire is based on a theoretical
knowledge of P2P file sharing software, technicalities and model in which pragmatic and hedonic aspects of
compatibility of the user’s system). experiences are thought to affect the attractiveness of a
product. A number of questions assess the product’s
The sense-making process of the UX framework (depicted
attractiveness for the user. In the field test, participants
in the outer circle) implies that there are temporal issues
who had actually used Tribler were asked to fill in the
involved. One has to take into account the user’s experience
questionnaire in the fourth week of their use. In the
in relation to the product prior to or at the very start of the
laboratory study participants filled in the questionnaire
actual interaction (e.g., anticipation and connecting), as well
after their task performance. In the expert view, the
as during and after the interaction (i.e., the other elements of
experts were asked to fill in what they thought users
the sense-making process).
would fill in.
Data was gathered on usage and on the attractiveness of the
In addition to this questionnaire, participants were also
software, in addition to data that related to the different UX
asked about their appreciation of specified Tribler features
aspects. Temporal issues were addressed in different ways
on a 5-point scale. In the field study, these questions were
in the three types of studies. In the field trial data gathering
asked as experience sample questions in response to the
use of the specified function (e.g., after the 1st and every
8
For more information see: http://www.attrakdiff.de/ 5th time of using the function). In the laboratory study all
75
participants were asked to answer the questions after having In all three studies measurements included those questions
performed a task related to that function. In the expert view, of the Attrakdiff questionnaire that related to the
experts were instructed to answer the questions imagining software’s pragmatic quality. In the field study, the
how they thought users would appreciate the specified logged and sensed data in combination with spontaneous
functionality. user feedback shed a light on pragmatic issues. In the
laboratory study such data were gathered by observing
Finally, insight into the users’ emotions in reaction to the
task performance and through retrospective interviews. In
product were received through spontaneous (written)
the expert review, the experts were given the tasks used in
feedback by the participants (field test) as well as through
the laboratory study as suggestions to structure their
the retrospective interviews and observed (verbal and no-
search for usability problems in the software and were
verbal) reactions during task performance (laboratory
also asked to provide some reasons on why they thought a
study).
problem would occur.
Compositional aspects, meaning and aesthetics
Compositional aspects relate to the pragmatic aspects of
interactions, including usability problems, effectiveness etc.
7,0
Lab t est
Expert Review
6,0 Field t est
5,0
4,0
3,0
2,0
1,0
PQ HQI HQS ATT
Figure 4. Attrakdiff results (PQ: pragmatic quality, HQI: hedonic quality (identification), HQS: hedonic quality (stimulation),
ATT: attractiveness. 1 indicates low score, 4 indicates neutral. Lab test (n=11), Expert view (n=5), Field test (n=6)).
In addition, the experts were explicitly asked about whether

PRELIMINARY RESULTS AND ANALYSIS
they thought users would understand the logic behind A preliminary analysis of the data was conducted, which
specified functions. Aspects of meaning were measured by allows drawing some tentative conclusions on how the
the questions in the Attrakdiff questionnaire that related to findings from the three studies relate to each other in
the software’s hedonic qualities (all three studies). It was terms of UX framework elements.
also expected that in all three studies spontaneous feedback
from users and explanations on expected future usage would From the field study it became clear that 15 of the 39 test
provide some insights on these aspects. Aesthetic aspects participants had started using Tribler during the five week
were not explicitly included in any of the measurements of period; these were given the questionnaire. Only 6 of
the three studies. However, spontaneous remarks or them filled it in, being active users in the sense that on
feedback could provide some data on such aspects. occasional days they spent some hours on using the
software. From questions asked to the 15 ‘real’ users and
76
VUUM 2008
through informal communication it became clear that DISCUSSION AND CONCLUSIONS

reasons for not using Tribler had to do with software Analysis of these studies is still preliminary and ongoing.
packages crashing, the combination of software making In the following we describe our first findings. We can
computers too slow, as well as them being used to only gain insight in the actual usage of Tribler and its’
competing software. Statements participants in the functions by using tools such as logging and sensing over
laboratory study made, predicted that utility and usability a longer period of time. Laboratory tests and expert
problems (compositional aspects) would prevent some reviews give weak predictions about usage which do not
people from using the software. In the expert review similar agree with the results from logging and sensing. Logging
views were expressed. A superficial analysis of the logging and sensing do not provide any explanation for the actual
and sensing data gathered in the field study could not usage. We found lower scores in the field test for both
provide a clear view on compositional aspects like usability meaning and hedonics (from the Attraktdiff
problems; it was too difficult to trace back what users were questionnaire) than from the laboratory study or the
trying to do from mere log data. The data from the expert reviews. Actual usage makes people aware of the
laboratory study, as well as from the expert review provided match between product and their higher order goals (or
a rich view on problems and their possible causes. This was meaning). This may account for this result.
the kind of data that designers in the Tribler development Laboratory tests and expert reviews give a detailed and
team could most readily use in their attempts to redesign rich insight in compositional, pragmatic issues such as the
Tribler. Aspects of meaning were found mainly in the usability. Logging and sensing do not provide this
laboratory study through spontaneous think aloud detailed insight. Logging tools used, monitored user
utterances, as well as in retrospective interviews. This activity on the levels of UI events (e.g., shifts in input
related to issues like terminology being too dull for them, focus, key events) and the abstract interaction level (e.g.,
not wanting to use social software, not valuing providing values in input fields). To generate information
recommendations of files based on popularity, issues in about usability issues, the data needs to be transformed to
relation to (appreciating or not appreciating) illegal higher levels of abstraction such as domain or task related
downloads and (laughing about or feeling offended by) levels (e.g., providing address information) or to goal and
adult content. In the expert review two experts commented problems related levels (e.g., placing an order) and
on issues of meaning, only in relation to not appreciating compared to predefined or automatically identified
illegal content and feeling offended by adult content. In the sequences of user activity within these levels [10].
field study only two users commented on issues of meaning.
They did so in the same way as the experts in the expert Attractiveness scores were slightly negative on the
review. Attrakdiff questionnaire in all three studies. Though the
score was slightly negative we doubt if this is a valuable
As to the aesthetic aspects, three experts mentioned issues predictor for the (non-)usage as measured during the field
of graphical design and layout in their comments. Generally test (it seems too bold that a slightly negative score can
they indicated they liked the graphical design, although also result in such overwhelming non-usage). The influence of
once ‘bad layout’ was mentioned, as well as being disturbed attractiveness on usage might be dependent on people’s
about the software not showing thumbnails in the files view. motivation to use a product (externally motivated or
In the field test only two participants commented on internally motivated product interaction). Aesthetic
aesthetic aspects, mentioning they disliked the library and aspects are difficult to measure for interactive products.
files view. In the laboratory study a rich mix of comments We found attractiveness as well as aesthetics are best
was given (by 7 participants) on aesthetic issues. Opinions measured through direct interaction with participants. For
here were more mixed in the sense of valuing the design or experts it’s difficult to formulate an opinion about
not, but also of the level of design detail they commented attractiveness and aesthetics from a user’s viewpoint.
on, ranging from comments on specific icons to an opinion
on the general looks of the software. Many of the comments FUTURE WORK
were spontaneous exclamations when confronted with a new Based on the conclusions above, we identified several
screen. issues for future work on improving UX measurement in
From the Attrakdiff questionnaire we found that pragmatic long-term field studies:
aspects in the field trial and the laboratory test were more or 1. Tools or methods to raise the level of logging data
less the same and scored more negatively than in the expert from the UI events and the abstract interaction level
review. Hedonic identification with, and stimulation by to higher levels such as domain and task or even goal
Tribler scored lowest in the field test, followed by the and problem levels and means to analyze the data on
laboratory test and the experts’ opinions. Attractiveness these higher levels (detecting sequences and
scored similar over the different studies, see figure 3 for an interpreting these sequences);
overview.
77
2. Approaches for automated gathering of data that Television Systems: a Pilot Study of Tribler, in
provide (a) insight into reasons of (non-) usage at the EuroITV'07, A. Lugmayr and P. Golebiowsky,
level of products or product features, (b) insight into Editors. 2007, TICSP, Tampere: Amsterdam, The
why or how the product succeeds (or not) in making the Netherlands. p. 196-200.
user attribute meaning to it; (c) rich and detailed data on 3. Kort, J., A.P.O.S. Vermeeren, and J.E. Fokker,
usability issues, especially those that relate to longer Conceptualizing and Measuring UX, in Towards a UX
term usage and are highly affected by the personal Manifesto, COST294-MAUSE affiliated workshop, E.
situation of the user. Especially for topic 2a more Law, et al., Editors. 2007, COST294-MAUSE:
detailed theoretical knowledge in the area of UX would Lancaster. p. 83.
help to relate the various aspects in the framework to
each other and to a product’s attractiveness and (non-) 4. McCarthy, J. and P. Wright, Technology as
usage in real-life. Experience. 2007: The MIT Press. 224.
5. Pals, N., et al., Three approaches to take the user
3. A practical approach for assessing the aesthetic aspects
perspective into account during new product design.
of a product.
International Journal of Innovation Management,
2008. In press.
ACKNOWLEDGEMENTS
We would like to especially thank the following people for 6. Desmet, P. and P. Hekkert, Framework of Product
their contributions to these studies: Gilbert Cockton, Effie Experience. International Journal of Design, 2007.
Law, Jens Gerken, Hans-Christian Jetter and Alan Woolrych 1(1): p. 10.
from the COST294 network. Ashish Krishna and Anneris 7. Hekkert, P., Design Aesthetics: Principles of Pleasure
Tiete for their contribution in executing and analysing the in Design, Delft University of Technology,
study and the preliminary results. The test participants for Department of Industrial Design: Delft. p. 14.
their active contribution and feedback. Leon Roos van
Raadshoven, Paul Brandt, Jan Sipke van der Veen and 8. Pouwelse, J.A., et al., Tribler: A social-based peer-to-
Armin van der Togt for their technical contributions. peer system. Concurrency and computation: Practice
and experience, 2008. 20(2): p. 127-138.
REFERENCES 9. Boren, T.M. and J. Ramey, Thinking aloud:
1. Vermeeren, A.P.O.S. and J. Kort, Developing a testbed reconciling theory and practice. IEEE Transactions on
for automated user experience measurement of context Professional Communication, 2000. 43(3): p. 261-277.
aware mobile applications, in User eXperience, Towards 10. Hilbert, D.M. and Redmiles D.F, Extracting usability
a unified view, E. Law, E.T. Hvannberg, and M. information from user interface events. ACM
Hassenzahl, Editors. 2006, COST294-MAUSE: Olso. p. Computing Surveys (CSUR). 2000. 32(4): p. 384 –
161. 421.
2. Fokker, J.E., A.P.O.S. Vermeeren, and H. de Ridder,
Remote User Experience Testing of Peer-to-Peer
78
VUUM 2008
Evaluating Migratory User Interfaces

Fabio Paternò Carmen Santoro Antonio Scorcia
ISTI-CNR ISTI-CNR ISTI-CNR
Via G. Moruzzi, 1 Via G. Moruzzi, 1 Via G. Moruzzi, 1
56124 Pisa 56124 Pisa 56124 Pisa
[email protected] [email protected] [email protected]
+39 050 315 3066 +39 050 315 3053 +39 050 315 3127
ABSTRACT In the paper, we first introduce some relevant design
Migratory user interfaces are user interfaces that allow the dimensions for migratory interfaces, and next we introduce
users to change device and continue the task from the point the software architecture of our environment for supporting
they left off. In this paper we discuss aspects that are such interfaces. We also discuss usability aspects that
important in evaluating migratory interfaces and report on a characterise the migration and report on some user tests.
usability test we carried out for this purpose. Lastly, some concluding remarks along with indications for
future work are provided.
Author Keywords
Migratory user interfaces, Usability, Adaptation, MIGRATION DIMENSIONS
Consistency Different dimensions have been identified regarding the
migratory user interfaces. Such dimensions can help the
ACM Classification Keywords designer in understanding the characteristics of the different
H5.m. Information interfaces and presentation (e.g., HCI): migrations and also better identify possible relevant aspects
Miscellaneous. as far as the migration evaluation is concerned. Amongst
them we identified the following ones:
INTRODUCTION
Migratory user interfaces provide users with the ability to • Activation Type: the agent triggering the migration
change the interaction device and still continue their tasks (the user or the system or a mixed initiative) is
through an interface adapted to the new platform. We use triggered.
the concept of ‘platform’ to mean those environments that • Type of Migration: This dimension analyses the
have similar interaction capabilities (such as the form-based ‘extent’ of migration, as there are cases in which
graphical desktop, the vocal device, the digital TV). only a portion of the interactive application should
Migration can involve devices belonging to different be migrated.
platforms. In some cases, migration can be used to improve
the user’s experience by switching to a better suited devices • Number/Combinations of Migration Modalities
(bigger screen, more graphical power, …) or to a more This dimension analyses the modalities involved in
efficient communication channels that can guarantee better the migration process.
QoS (shorter delays, higher bandwidth).
• Type of Interface Activated: This dimension
The increasing availability of various types of interactive specifies how the user interface is generated in
devices has raised interest in model-based approaches order to be rendered on the target device(s) (pre-
useful for logically specifying relevant user interface computed or dynamically generated or a mix of the
information in appropriate models which are often two options).
described by using XML-based languages. In our migration
environment we refer to TERESA¬XML language [1]. • Granularity of Adaptation: The adaptation process
can be affected at various levels: the entire
application can be changed depending on the new
context or the user interface components
(presentation, navigation, content).
personal or classroom use is granted without fee provided that copies are • How the UI is Adapted: Several strategies can be
identified regarding how to adapt user interfaces
or republish, to post on servers or to redistribute to lists, requires prior after a migration process occurs.
specific permission and/or a fee.
• The Impact of Migration on Tasks: The impact of
VUUM2008, June 18, 2008, Reykjavik, Iceland. migration on tasks depends on how the UI is
adapted (e.g.: the device characteristics might
79
enable reduction/increase on the range of tasks • the migration server detects the state of the
supported by each device). application modified by the user input (elements
selected, data entered, …) and identifies the last
• Context Aspects. During adaptation of the user
element accessed in the source device. At the same
interface the migration process can consider the
time, the server gets information about the source
context in terms of description of device, user and
device and, depending on such information it
environment.
builds the corresponding logical descriptions by
• Implementation Environment. The migration invoking a reverse engineering process. The result
process can involve different types of applications of the reverse engineering process, together with
(Web, Java, .NET, …). information about source and target platforms is
used as input for carrying out a semantic redesign
• Architecture of the migration infrastructure: phase in order to produce a user interface for the
client/server; and peer to peer, target platform.
• Usability Issues: The adaptation and generation of • afterwards, the migration server identifies on the
user interfaces should be performed taking into target device the logical presentation to be
account the main principles of usability. activated, and it consequently adapts the state of
the concrete user interface with the values that
OUR ARCHITECTURE FOR MIGRATION have been saved previously.
Starting with a pre-existing Web version for the desktop
platform, our environment is able to dynamically generate • Finally, the generation of the final user interface
interfaces for different platforms and modalities exploiting from such a logical description for the target
semantic information, which is derived through reverse platform is delivered, and the resulting page is sent
engineering techniques. The environment currently supports to the browser of the target device in order to be
access to Web applications and is able to generate versions loaded and rendered.
for PDAs, digital TV, various types of mobile phones and
It is worth pointing out that in our environment both the
vocal devices and support migration through them. Our
system and the user can trigger the migration process,
solution migrates the presentations as they are needed.
depending on the surrounding context conditions.
There is no transformation of the entire application in one
shot, which would be very expensive in terms of processing
USABILITY EVALUATION
and time. Consequently, the migration process takes only a The usability evaluation of migratory interfaces should
few seconds from when it is triggered to when the user consider their two main components: continuity and
interface appears on the target device. The application adaptation. In addition, it should also consider the usability
implementation languages currently supported include of the migration client, which is used by the users to trigger
XHTML, XHTML Mobile Profile, VoiceXML and Java. migration and select the target device.
The tool has been tested on a number of Web applications
(e.g. Movie rental, Tourism in Tuscany, Restaurant Continuity includes the ease with which users are able to
reservations, on-line shopping…). Our environment immediately orient themselves in the new interface and
assumes that the desktop Web site has been implemented therefore are able to recognise and have the feeling of a
with standard W3C languages and filters out some dynamic “continuous” interaction. Indeed, it implies that users can
elements of the page (e.g. advertising and animations). easily recognise that the new user interface is the follow-up
of previous interactions through another device while
Our migration platform is composed of a number of phases. carrying out the same task. If users do not recognise such a
The process starts with the environment discovering the situation, they may even be confounded by the user
various devices available in the current context, which also interface presented on the target device. Different factors
implies saving related device information, e.g. device can affect such recognition and they might make continuing
interaction capabilities. Whenever a user accesses an the interaction in a seamless way problematic from the
application, the proxy server included in our migration user’s point of view. For instance, factors might include i)
environment receives the Web page, and modifies it in whether long time has passed since the last interaction,
order to include scripts that allow the collection of which therefore might be difficult to remember, or/and ii)
information about its state. Afterwards, it sends the whether the adaptation process has changed the user
modified page to the web browser of the source device. interface rendered on the target device in such a way that
When the migration server receives the request for the users do not recognise that it enables them to logically
migration (which specifies the source device, the target continue the performance of their tasks from the point
device) from the source device, it triggers the following where they left off in the source device. For instance, the
actions: new user interface should clearly highlight the elements that
were changed during the interaction with the previous
80
VUUM 2008
device. Due to screen size limitations, such highlighting migration client can offer to them, and to what extent such
may not be immediately evident. options were designed in such a way to be able to
effectively communicate to the user the result that will be
In the first case, adequate and (adaptable) feedback
achieved by activating each of them. This might be
messages should be identified in order to take into account
especially relevant when the migration process involves
the aspects related to the time passed [2]. For instance, if a
more than two devices (and/or more than one user),
long time has passed, the user interface activated on the
although different options can be also available with only
target device could enrich the quantity of feedback
one user and two devices involved (partial/total migration).
information initially provided to the users. This should
For example, users should be able to easily understand to
allow them to better contextualise the interaction and
what actual devices correspond to the list of devices that
remember the action(s) already done on the user interface of
can be selected as migration targets. In case of partial
the source device (e.g.: which sub-tasks they already
migration, users should be able to easily predict what part
accomplished on the device, and possibly which overall
of the interface will migrate and on which device the results
task they were about to complete, ..). Regarding time,
of the interactions will appear. An aspect connected with
another aspect that can affect the user experience is the time
the predictability is the learnability, which is the easiness
necessary for the migration to take place: a migration that
with which the users become familiar to the migration
takes a long time to complete may compromise again the
features and therefore they are capable of controlling its
feeling of a continuous interaction and have a negative
features and related effects. If the system is easy to be
impact on the user’s experience.
learned, even occasional users should not have difficulties
In the case of the adaptation process, it should be a trade-off to use it.
between two sometimes conflicting aspects: on the one
hand the devices are different and thus need an interface USER TESTS
adapted to the varying interaction resources, on the other In order to understand the impact of migration on users and
hand users do not want to change their logical model of the usability of migratory user interfaces, some user tests
how to interact with the application at each device change. were performed in order to evaluate some of the
Therefore, suitable mechanisms should be identified in abovementioned usability-related aspects. The trans-modal
order to make the users easily recognise the features for migration functionality, from graphic to vocal, has been
supporting interaction continuity. tested in the first version of the migration service.
Further aspects that should be taken into account regard the Trans-modal Migration Test
users’ model of the application and how to find information The first test was performed on the “Restaurant” application
in it. In particular, the user’s familiarity in using the same [3], which allows users to select a restaurant, accessing its
application through different devices can affect the usability general information and make a reservation. The interfaces
of migratory user interfaces. Indeed, if the users already of the test application for desktop, PDA and vocal platforms
have some familiarity in using the same application through differ both in the number of tasks and their implementation.
different devices (but without having experienced the For example, the date insertion is a text field in the desktop
migration capabilities), on the one hand they might more version and a selection object in the PDA version, while the
easily recognise the effect of adaptation on the user insertion of free comments was removed from the vocal
interface of the target device and therefore they should feel interface. When migrating on the vocal platform, it might
more confident with the adapted user interface. However, happen that there are some vocal inputs which are
on the other hand they are likely to notice the changes that automatically disabled after migration because the user
have occurred on the user interface of the target device(s) as already provided the corresponding values through the
an effect of the migration process, which might become a graphical device. This type of support is provided in order
potential source of disorientation for them because they to facilitate the user in efficiently completing the
might expect a different presentation and/or navigation. application form. Since we were interested in considering
Another factor that can impact the usability of migratory multi-device environments, both a desktop PC and a PDA
user interfaces is the predictability of the effects of the were used as graphic source platforms. The 20 users
migration’s trigger, namely the ability of the user to involved were divided into two groups. The first one started
understand the effect of triggering a specific type of with migration from PDA to vocal platform and then
migration. To this regard, the migration client should be repeated the experiment starting with the desktop. The
designed in such a way to effectively allow the users to second one started with the desktop and repeated the test
understand the effects that a migration trigger will provoke using the PDA. The scenario proposed to the users was that
on the target device(s). The predictability aspect is they start booking a table at a restaurant by using their
particularly relevant especially when different options for graphic device (the PDA is supposed to be used if they are
migration are offered to the users. Therefore, it also on the move, the desktop if they are at home or at office)
depends on the number of different migration options the and load the “Restaurant” application. At some point, due to
81
some external conditions (e.g.: the low battery level of the Table 2. User preferences and salience of task differences
PDA), they had to ask for migration towards the available This might be justified by the fact that the PDA and the
vocal device and then complete the Restaurant Reservation vocal versions are more similar in terms of tasks supported
task. After the session the users filled in the evaluation than the desktop and the vocal ones. In any case, the
questionnaire. difference in ease of continuity between the two platforms
Users were recruited in the research institute community. (desktop and PDA) is small, thus the interaction continuity
The average user age was 33.5 years (min 23 - max 68). ease is influenced, but not compromised. Both the initial
Thirty percent of them were females, 65% had at least and the overall feedback through the vocal application were
undergraduate degrees and 55% had previously used a judged positively (Table 1). The vocal feedback design was
PDA. Users had good experience with graphic interfaces appreciated and 80% of the users would have not changed
but far less with vocal ones: on a scale of 1 to 5, the average its style. One concern was the potential user disorientation
self-rating of graphic interface skill was 4.30 and 2.05 for in continuing interaction, not only due to the change of
vocal interfaces. For each migration experiment, users were modality, but also due to the different range of possible
asked to rate from 1 (bad) to 5 (good) the parameters shown actions to perform. Only 20-25% noticed the difference and
in Table 1. it was perceived more in the desktop-to-vocal case (Table
2). This first study provided some useful suggestions to
Parameters Desktop to PDA to
bear in mind when designing user interfaces intended for
vocal vocal
Migration client 3.4 3.9
trans-modal migration. The modality change does not cause
interface clearness disorientation but must be well supported by proper user
Interaction continuity 4.35 4.65 feedback balancing completeness while avoiding boredom.
easiness The differences in interaction objects used to support the
Initial vocal 4.1 4.2 same task on different platforms were not noticed at all,
feedback usefulness while the difference in the number of tasks supported was.
Vocal feedback 4.25 4.25 This has to be carefully designed in order to reduce as much
usefulness as possible any sudden disruption in the user’s expectation.
Table 1. User rating for trans-modal migration attributes.
Vocal feedback was provided via both the initial message, Test on the New Version of the Migration Environment.
recalling the information inserted before migration, and a The first test confirmed our choices concerning the support
final message at the end of the session about the for trans-modal migration and gave useful suggestions for
information inserted after migration. We chose this solution improving the migration environment. The work undertaken
as the most likely to reduce user memory load but in any after the first test resulted in the new version of the
case, after the test, we asked the users if they would have migration environment discussed in this article and we
preferred only an overall final feedback instead. Also, we performed a new test with a different application: the
asked whether they noticed any difference between the “Virtual Tuscany Tour”. The goal was to have further
graphic and vocal interface with the aim of finding out empirical feedback on the migration concept and the new
whether they could perceive the different number of solution supporting it. Since very few changes concerned
supported tasks. The numeric test results were interpreted the trans-modal interface transformation while many more
taking into account the answer justifications and free affected the unimodal (graphic) migration, the new test
comments left in the questionnaire and considering user considered the (unimodal) desktop-to-PDA migration, in
comments while performing the test. order to evaluate the new functionality.
The service in itself was appreciated by users. Many judged The test application is the Web site of a tourism agency
it interesting and stimulating. The users had never tried any specialized in trips around Tuscany, an Italian region.
migration service before and interacted with it more easily Interested users can request material and get detailed
in the second experiment, thus, showing it was easy to learn information about trips around the sea and the beach or
through practise, once the concepts underlying migration around the countryside and the mountains; it is also
were understood. Interaction continuity received a slightly possible to get overviews of good quality places to sleep.
higher score in the PDA-to-vocal case. From this Web site it is also possible to get information
about the main social events, entertainment, places to taste
Parameters Desktop PDA to good food and wine, sports activities and other useful
to vocal vocal information, such as the weather forecast and public
Only final vocal feedback Yes 20% Yes 20% - transportation. During the test, users could freely interact
preferred - No 80% No 80%
with the application and at a certain point they were asked
Noticed different task set Yes 25% Yes 20% -
to fill in a form requesting tourist information about
- No 75% No 80%
Tuscany. The form had to be filled in partially, choosing
some fields in any preferred order (see for example Figure
82
VUUM 2008
1). Then, users had to interact with the desktop migration

client interface to require migration towards the PDA and
continue the registration on the new device. They also were
asked to check previously inserted information and perform
some more tasks on the PDA among those supported by the
Characteristics Average
application indicated above (e.g.: completing the form or Score
going to another section of the web site). Lastly, users were Visual Impact of PDA Redesigned 3.9
asked to fill in an evaluation questionnaire, whose questions Pages
were identified in such a way to collect user comments on Functional Navigation Structure 4.1
migration features . on PDA
Interaction Difficulties due to 1.6
Redesign Result
Interaction Difficulties due to PDA 1.75
device
Table 4. Redesign transformation evaluation
In Table 4 we show the average scores collected for the
redesign part (which performed the adaptation to the target
device) of the questionnaire and the characteristics we
asked the participants to evaluate, still on a 1 to 5 score
scale. A final question concerning the redesign part was
about whether participants noticed changes in the
redesigned pages compared to the original ones. 40% of the
participants answered positively, and in the next paragraphs
we discuss what this means.
It is worth noting that the evaluation questionnaire was not
Figure 1. The Test Application only a device for a mere collection of numeric values,
because users were also asked to provide justifications for
The experiment involved 20 users whose average age was
each answer and free comments were also always allowed.
32 years (min 22 - max 58). Forty percent of them were
females, 85% were graduated or had an advanced degree.
Users were experienced accessing Web applications
through the desktop platform and far less experienced
through the PDA: on a scale of 1 (minimum value) to 5
(maximum value), the average self-rating of Web access
skill through the desktop platform was 4.35, while PDA
usage ability was 1.95. The evaluation questionnaire was
divided into two parts, one concerning migration and one
related to the result of the automatic page redesign from the
desktop to PDA platform. The migration related part asked
for a 1 to 5 score to the characteristics shown in Table 3
along with the average values collected.
Migration characteristics Average

Score
Migration Service Usefulness 4.1
Migration Client Interface Clearness 4.1
Easiness in Continuing Task 4.4
Performance After Migration Figure 2. Migration evaluation results
Easiness in Retrieving Information 4.75 This gave us the ability to better understand the numeric
Inserted Before Migration results. Looking at Table 3, we can see that all the average
Table 3. Migration service evaluation scores concerning migration evaluation were fully positive;
Users were also asked to say if they would have changed moreover, as you can see in Figure 2, only 10% or less of
anything in the way migration is performed and 45% of the participants rated any of the parameters a score of 1.
them answered positively providing useful suggestions for Some of the people who did not find the migration client
improving the migration client interface. useful said that they simply do not use a PDA.
83
The migration client interface was deemed quite clear since interaction object implementing a given basic task (i.e. a
it was considered basic and easy to use, some of the desktop pull down menu with hundreds of choices
difficulty that was found was because this user interface had transformed to a PDA text input field) when changing
never seen before, as some users commented, at a second platform would have disturbed the users. The comments
usage they would have known what to do. In addition, at a provided in the questionnaire outlined that the changes that
second usage, since the testers already have an idea of what had been noticed were the ones related to the fact that a
would have happened, they are likely to be more focused on single desktop page had been split into multiple PDA pages
the technical evaluation of the features of the application and images where reduced in size. They did not make any
and then less affected by the novelty of the prototype. The observation concerning the different implementations of
values about task performance continuity and access to some tasks, which was a signal of the fact that they did not
information inserted before migration can be considered notice any particular difference that disturbed them. About
quite satisfactory. Regarding the 45% of users who would difficulties of interacting on the PDA platform (see Figure
have changed something in the way the migration was 3b), the numbers must be interpreted in a converse way,
performed, this most often referred to the list of IP since the parameters are related to a negative feature.
numerical addresses, which was used to indicate the Indeed, the vast majority of users did not find any major
possible target devices (instead of using logical names, with difficulty or only some very small problems in interacting
more significance to the user). We understand such with the redesigned user interface.
criticism since the migration client interface was at
prototype stage and it has been improved to this respect. CONCLUSION
In this paper we have discussed usability evaluation of
(a) migratory interfaces in multi-device environments. We also
report on usability tests carried out to better understand the
usability of migration and the interfaces resulting from our
semantic redesign transformation. The results are
encouraging and show that migration is a feature that can be
appreciated by end users because it provides them with
more flexibility in emerging multi-device environments.
The test provided useful results especially regarding how
the user experience is affected by the different adaptation
strategies employed for supporting migratory user
(b) interfaces. For instance, the test showed that the users are
more sensitive to changes affecting the set of available tasks
when changing devices during migration, rather than to
possible changes occurring on the implementation of
different interaction objects supporting the same task on
different platforms.
In addition, the test gave us the opportunity to verify that
the migration features are easily to be familiarised, since the
users interacted more easily during the second experiment
Figure 3(a)- (b). Evaluation of the Redesign Transformation of the test. The test results have also shown that the
proposed redesign transformations are able to automatically
Concerning the evaluation results obtained for the redesign generate adapted presentations that allow users to continue
part (see Table 4 and Figure 3a), one issue about visual their task performance in the target device without being
effectiveness of the redesigned page was that current disoriented by the platform change.
browsers for PDAs have limitations with respect to those
for desktop systems in terms of implementation constructs Future work will better evaluate the issue of continuity and
that are supported. While the navigation structure of the to what extent the time factor (eg: the time passed between
redesigned pages was appreciated, some difficulties were the last interaction on the source device and the first
found while interacting with the PDA (see Table4), but it interaction on the target device, time needed for executing
must be said that most users had never used one before and the migration) can affect the user experience. Future work
many difficulties arose simply because of the lack of will be also dedicated to identifying more suitable
knowledge of this platform. evaluation methods and metrics in order to better quantify
the benefits of the migration from the user’s perspective.
Also, users said that they noticed changes between the
desktop and the PDA presentations. This question was
made in order to understand if changing the specific
84
VUUM 2008
REFERENCES User Interfaces: Cross-Platform Applications and

1. Mori G., Paternò F., Santoro C. Design and Context-Aware Interfaces”, H. Javahery and A. Seffah
Development of Multidevice User Interfaces through (Eds.), 373-385, Wiley and Sons, 2003.
Multiple Logical Descriptions. IEEE Transactions on
3. Bandelloni R., Berti S., Paternò F., Analysing Trans-
Software Engineering August 2004, Vol. 30, No 8, IEEE
Modal Interface Migration, Proceedings INTERACT
Press, pp.507-520.
’05, Roma, September 2005, Lecture Notes Computer
2. Denis, C., and Karsenty, L., Inter-Usability of Multi- Science 3585, pp. 1071-1074, Springer Verlag.
Device Systems−A Conceptul Framework. In “Multiple
85
Assessing User Experiences within Interaction:
Experience As A Qualitative State and Experience As A
Causal Event
Mark Springett
Interaction Design Centre
Middlesex University, Town Hall, Hendon,
London, NW4 4BT, UK
[email protected]
user-experience enquiry than of traditional HCI. The nature

ABSTRACT of qualia, or the first person experience of product use is
This paper discusses issues in assessing the qualitative then considered and its relationship to tacit factors causing
affective elements of human interaction with software qualia, and the consequences of qualitative reactions. The
products. Evaluating qualitative responses seems inherently argument is that whilst qualia are first person felt states and
problematic due to the personal and intangible nature of therefore in themselves explicitly expressible, crucial causal
experience. The paper argues that the ‘soft’ discipline of relationships involving those states are tacit, and only
user-experience evaluation is not significantly softer than measurable in less direct ways.
more traditional ‘measurement’ of HCI phenomena. It then
considers the key phenomena of qualia, tacit knowledge and PROOF AND PERSUASION IN EMPIRICAL RESEARCH
behaviour, considering what formative and comparative Compared to traditional HCI enquiry, user-experience can
assessment needs to know and the availability of that initially seem softer, less tangible and less available to
information. In particular it considers the causes and effects measurement. The best we can hope for it seems is to get
of qualitative states in interaction and their significance. some weak opinion or reaction data, and see if it is
The implications for selection of evaluation approaches is persuasive. However, even where something more like
then discussed using cameo examples of systems that have ‘hard science’ is brought to bear on traditional HCI
a key affective element in use. phenomena, it is persuasion that is ultimately the test of its
worth. Experimental evidence may provide a proof of a
Author Keywords theory pertinent to an HCI problem, but this is evidence to
User-experience, qualia, tacit, evaluation support persuasion. The proponents must then argue for the
significance of the experimental finding in reference to the
INTRODUCTION target HCI phenomena. The proof is a phase on this
This paper considers the nature of the relationship between journey but it is never the entire journey, or its essence.
qualitative experience, tacit causality and evaluation
approaches for artefacts in use. The question addressed is Various approaches have been developed to assess
about how we can appropriately measure qualitative factors experience. For example, Product reaction cards [1] prompt
in interactive products. In this context we use a wide-scope users to identify ‘attributes’ present in a product, some of
interpretation of the term ‘measurement’. Orthodox which are user experience attributes, pairing their emotional
measurement theory implies quantification or proof [16]. A responses with a subset of preset terms. The reactions
more generous definition of measurement means the ability imbue the product with descriptors reflecting the user’s
to assess against criteria or assess progress towards a goal experience of them. Variants on that (admittedly used
or target. In the first section it is argued that HCI predominantly on children) allow a non-verbal expression
evaluation is typically a soft discipline in which persuasive of mood by selection from a range of cards bearing
evidence is used to inform design. This is no less true of alternative facial expressions [8]. The arguable position on
this is that verbalisation or expression of a qualitative state
Permission to make digital or hard copies of all or part of this work for is a translation of the tacit into the explicit. The potential
personal or classroom use is granted without fee provided that copies are flaws in such an exercise are similar to issues in
not made or distributed for profit or commercial advantage and that copies requirements analysis, where expertise that is essentially
bear this notice and the full citation on the first page. To copy otherwise, skill-based and well practised tends to be poorly and
specific permission and/or a fee. inaccurately captured though direct methods and explicit
expression. Similarly, the concern is that a verbal
VUUM2008, June 18, 2008, Reykjavik, Iceland. deconstruction of felt states would be prone to the same sort
86
VUUM 2008
of negative filtering effect, potentially rendering the QUALIA, TACITNESS AND EVALUATION
information unreliable. The qualitative dimension does not Qualia are key phenomena that user experience research
easily lend itself to articulation through language, inevitably deals with are, but the role of qualia in formative
particularly if we are looking for trend data. The selection evaluation enquiry is inextricably linked to causality.
of descriptions for one’s reaction seems to be dependent on Perhaps the goal of a design is to produce a simple non-
shared perception of meanings for the terms used. Where instrumental emotional state (joy, fun?). Perhaps the goal is
multiple data sets are compared with a view to analysing to produce a set of behaviours, and this is a goal that is
trends, the need to make assumptions about shared shared with the user (e.g. persuasive technologies such as
meanings could render any conclusions hazardous. diet assistants). Perhaps the qualitative experience of
Arguably, the greater the complexity of the elicitation (e.g. interaction for the user is not the goal in using the system
multiple possible reaction terms) the greater the risk of bad linked to instrumental goals (e.g. e-banking). Whatever the
data. In turn, the simpler and less explicit approach (the goal of the system is, we want to know what causes quales
Smileycards used in [8]) could be criticised as not (tokens of qualia) and what effect the quales have, in order
expressing enough, although they have the advantage of to engineer better products.
avoiding the linguistic filter. Qualia are broadly referred to as an essentially individual
The notion of proof (over simply persuasion) does not first-person experience, a first person experience that is
appear to be significantly stronger or flimsier in user- essentially unique. So in this sense one can use descriptor
experience evaluation than in traditional HCI evaluation. terms reporting a felt state, but it is not possible to confirm
This sounds an odd claim when we consider something like that your experience is the same as the experience that
an ‘efficiency’ evaluation, for example using the Keystroke others are having. Jackson describes as "...certain features
Level Model [2]. The targets that are set against the quality of the bodily sensations especially, but also of certain
goal are concise and quantifiable. The experiments are perceptual experiences, which no amount of purely physical
rigid and formal, samples carefully selected and potential information includes"[7].
sources of bias carefully removed. What emerges is a A key point about a quale is that it is experienced. For
contingent proof. However, many things have not been something to be experienced presumably it makes little
‘proved’ in such an analysis. The technique involves the sense to say that it is experienced without the subject being
assumption that the operators performing the tasks are aware of it. First-person awareness of it would seem to be
consummate experts and will not make errors. The claim part of its essence. This raises the question of whether
that these contrived findings are useful in assessing their qualia are necessarily what we are referring to when we
real use in real work situations must be put with reason and consider affective aspects of system use. If we are held to
persuasion. Also, it involves the assumption that the the claim that qualia must be experienced, and experienced
identified evaluation criteria (e.g. speed of expert knowingly, then qualia are non-tacit and available to self-
performance) are indeed key to assessing the efficacy of the reporting. But expressions of qualia do not explain key
system. Whether or not those concerned accept or reject cause-and-effect phenomena that are critical to evaluating
the evaluation findings’ validity is dependent on persuasion user-experience. In a concurrent protocol session we can
and the balance of probability. Similarly, evidence from ask a subject “how do you feel now?”. The subject can tell
protocol studies can be used to persuade us about causes of us this but not so easily tell us why. The subject reports
user problems and help specify suitable fixes [3]. Measures awareness of a qualitative state, but the causal reasoning
of performance rate are usually reliant on very strong offered has an inference gap. In some cases (and in service
assumptions about conditions, environment and operator of some evaluation objectives) the inference gap will be
skill. Equally, assessment of experience or the trivial. In a game of Battleships the cathartic moment of
consequences of experience requires investigation, a degree victory is overwhelming likely to be linked to an explicit
of rigour, and ultimately persuasion. Self reporting and report of elation or joy. A sense of unease during an e-
reaction data can fairly reliably denote that a qualitative commerce encounter is rather harder to fathom.
state change is occurring in the subject. These, using the
classification of Hollnagel [6] can be seen as user- What is ironic in this debate is that it would at first seem
experience ‘phenotypes’ indicating a ‘surface level’ event. that the quales in their unextended form are as inaccessible
Inferences about their causality may or may not require a an element of mental phenomena as it is possible to
deeper level of investigation in search of ‘genotypes’, or imagine. However, further analysis of this concept seems
distant causes and consequences of the reported event. The to support the argument that quales (in themselves) are
internal states that are elicited from the user seem to be the more accessible to probing and evaluation than other HCI
key data in user-experience enquiry. The next section evaluation phenomena to which they are relevant. What
considers issues in reliably assessing qualia, articulation of are rather more elusive are the causal relationships, the
qualia, tacitness and causality. elements of an encounter with an artefact that critically
effect experience. Whereas simply reporting how one feels
87
gives an informative account of the felt state, how it was building and maintaining the relationship between customer
caused and what effect it will have on attitudes, disposition and organisation.
and behaviour may not be available to the reporting subject.
Case 1 - Games
This suggests that the essence of this is in the tacit Games and entertainment applications could be seen as
dimension. If this is the case there are implications for its having the goal of a good user experience, of positive felt
characterisation elicitation and assessment. states. Therefore the experience is the key goal, and any
notion of the instrumental is subsumed within the affective.
DEFINITIONS OF TACITNESS
Certain measures relating to engagement, flow, and
The most simple, traditional notion of tacitness is knowing
excitement can be brought to bear in formative assessment
that cannot be made explicit through verbalisation. In
of games. These are terms referring to desirable qualitative
Polyani’s words [10] ‘we know more than we can tell’
states that are caused by using a good game or fun system.
citing examples such as face recognition and bicycle riding
It may be hard however to pair a measure with these
where sophisticated ability to perform contrasts with a
attributes. The cited user experiences are outcomes, and
marked inability to explain. We have natural aptitude for
it is their pairing with design concepts and practice that can
these things that seems to be beyond our conscious
result in those that provides assessable phenomena.
understanding. Work on knowledge acquisition and
Success against the experiential outcome attributes can be
requirements engineering in the 1980’s tended to
assessed summatively, but is not helpful feedback for
characterise tacitness as problematic, a knowledge reef that
design unless it can be paired with more tangible design
must be overcome. A more modern way of thinking is that
concepts, and properties of the artefact. A sense of
some mental phenomena are in essence tacit and the project
challenge in a computer-based game is an outcome,
of making them explicit is fundamentally flawed [14].
experienced by the game player. But could challenge also
Polyani’s assertion is that knowledge is different from
be something that is identifiable in the artefact itself? Are
knowing, knowing being more a process than a state. This
there design techniques in the games domain that support a
seems a useful point. We have tacit skills as described
sense of challenge?
above. Domain experts have knowledge much of which
provides a great deal of skill to elicit for a variety of In a sense game design turns longitudinal design for
reasons. Much of this is described as ‘semi-tacit’ by usability on its head. In most systems the ideal scenario is
Maiden and Rugg [9]. Their examples of semi-tacit that access to the system metaphor is seamless,
knowledge refer to communication failures such as internalisation of system rules of operation facilitated
withholding useful information in elicitation interviews optimally and expertise, once established, exploited and
because they do not sense the value in imparting it. These supported. This reflects the description of knowledge-
types of elicitation problems seem to be different in nature based, rule-based, and skill-based processing using software
from the tacitness issue in user experience evaluation. packages described by Rasmussen [11,12]. Optimal
Whereas Maiden and Rugg describe tacit expertise, tacit usability may (very broadly) be seen as the facilitation of
knowing seems to be more fundamentally about human progress towards skill-based processing, and its support
behaviour and self-awareness. Humans do not track the once arrived at. Conversely, game design tantalisingly
causality of their mental states. Tacit skills, tacit allows its user to reach a level of skill-based competence at
behaviours and tacit processing all seem useful descriptor a particular stage in a game. The skill in good games that
phrases for processes that are key to understanding what is have progressive levels is to incrementally subvert the
truly going on when we form an affective reaction to an expertise of the game-player, taking them to the next level
artefact and the subsequent consequences of that event. If of difficulty. In a sense the (simplified) progression from
tacit skills, process and dispositions are types, tokens of knowledge-based through rule-based to skill-based
them are events. First-person attempts to describe and processing becomes more like a cycle than a simple
account causally for those events are inevitably flawed and progression (n.b. it is acknowledged that this ‘progression’
unreliable. is not quite as neat as this description suggests in real
cases).
VARIANCE IN THE ROLE AND SIGNIFICANCE OF
QUALIA The designers’ manipulation of the game player’s ability to
In this section we look at cameo examples of three deploy cognitive resources seems persuasively to have a
significantly different types of system. In each of these the reasonably tangible link to the qualitative outcome of
qualitative affective dimension plays a different role. In challenge experienced by the game player. Contrast a well-
games it is the main goal of system use. In a personal designed game that maintains the sense of challenge in its
mentoring system it is the ‘coalface’ the problem space with user with one that does not. For example, a golf-playing
which the system is involved for instrumental purposes. In game may have an ‘expertise threshold’ beyond which the
e-banking the affective dimension plays a key role in sense of challenge is insufficient to maintain interest.
Initially the virtual golfer is on a learning curve. The
88
VUUM 2008
delicate skill of lining up and realising the control to play a state of its user, persuading them to keep their discipline
shot takes time to develop. Judgements on how to use and providing encouragement in this, is central to its value.
controls to line up and estimate power and drift on the shot The assessment of the mentoring role appears difficult here.
resemble the real game. The level of challenge seems The affective element is to make the user self aware at key
authentic, like playing an actual game of golf. The aim moments. Also, the sense of having support,
eventually is to resemble the performance of a world-class ‘companionship’, sharing their struggle to resist temptation
golfer. Once too great a level of expertise is reached (e.g. is important. The system variously helps the user
when one has failed if a chip from the fairway fails to go in) recognise, alter and respond to qualitative states. So in a
the sense of challenge is diminished. case such as this what properties or qualities give us targets
buy which to evaluate this device? Are design criteria
So challenge (the outcome) has a connection to some
available that can be usefully linked to experience
tangible design properties fortified by theories of user
outcomes?
processing and actual design features. So there is
something tangible that we can ‘measure’ and evaluate This example seems to contrast with games because the
against the quality attribute of challenge. However, the outcome is defined, longitudinal and instrumental. The
apparent presence of these is not sufficient to guarantee that qualitative experience during use is subsumed within the
an individual game player will sense challenge, or in a instrumental goal of, say, achieving a target weight. The
broader sense, will be satisfied with the experience. Other success of this product in achieving this seems to be in its
factors may cut across what is described. A potential gamer ability to support a complex interplay of emotions, physical
may find the nature of challenge uninteresting, a domain dispositions and critical events. Success of a dietary mentor
that is not engaging or fascinating, and not feel challenge is subject to the unfolding of events in key situations and
for this reason. Conversely, a virtual golfer may derive personal factors, and factors that are often not in the user’s
great satisfaction from being improbably accurate with chip realm of control or awareness.
shots from off the green. What is key is that the experience
Success of a product such as the Diet Companion seems to
attribute links to design criteria. Where a user or users
be in awareness rather than understanding of diversity and
report a lack of challenge in an evaluation exercise, there is
putting products out there that can cater for it. Whilst
an investigative ‘phenotype to genotype’ thread, criteria for
evaluations can help engineer by refining components,
investigating the space of possible redesigns. However,
improving adeptness in individual situations (perhaps
there seems to be a strong connection between the
through shadowing exercises) its effectiveness is difficult to
qualitative sense of challenge and more established
unpack. The complexity of the problem-space and its
usability attributes such as complexity. Controlled
elusively personal nature perhaps does not need deep
progressive introduction of complexity is to some extent
understanding, more a ‘black-box’ knowledge, and
measurable, or at least estimable.
engineering of solutions through feedback from users. As
Case 2 – The diet companion; A Mobile Mentor for with all engineered solutions knowledge is passed on and
dieting feeds formatively into subsequent designs and new
This cameo product example would be in the spirit of products. How should information be presented, should
‘Persuasive Technologies’ [4]. The concept is that the avatars be used, what tone should be used in the alerts? But
mobile device is there with the role of helping an individual predicting the overall suitability of a design for a particular
resist their immediate inclination to eat in service of the user seems elusive. They simply offer choices and
longer term goal of for losing weight. The idea is that the opportunities for candidate users to try.
device informs its user of calorie targets for the day,
provides on-the-spot estimates of the calories of certain Case 3 – E-Banking
items (that the user is contemplating eating). It also keeps a In the case of e-banking the goals and values of participants
continuous record of the accumulated calorie intake for the are separate but linked, and issues of user experience have a
day, issuing warnings when the user is contemplating eating subsumed, instrumental role in it. The aim of the
an unhealthy meal, or is in danger of exceeding a daily organisation is to establish and maintain a trading
limit. Records are kept over time for reference and relationship with the customer. In order to do this they
reminder. The goal of the user in using this system is to need to establish and maintain a trust relationship with
help them to keep to a diet, to lose weight and become potential clients. Previous work [5,13] demonstrates that
healthier. The problem addressed is that short-term urges tangible as well as intangible factors affect the users
or careless food choices cut across this longer term goal. perception of whether or not an organisation is worthy of
The user wants a virtual mentor that helps them with the trust. Therefore an ‘impenetrable’ qualitative dimension at
difficult task of resisting immediate urges and sticking to least partly determines the customer’s willingness to do
the diet. Part of the system’s role is simply information business with an organisation. In the tangible zone, an
provision, but its presence, its ability to affect the mental audit of available features such as the Verisign seal,
declared guarantees and declared commitments such as
89
privacy policies contributes to an evaluation of an e- REFERENCES
banking site’s fitness for purpose. However, assessment of 1. Benedek,J. and Miner, T; Measuring Desirability: New
the affective role of factors such as visual design, methods for evaluating desirability in a usability lab
interactivity, feature layout and clustering is essential to setting, In: Proceedings of Usability Professionals
reliably assert this. These are seen as intangible factors [5] Association, Orlando, (2002), July 8-12.
and a range of personal and cultural factors may influence 2. Card, S, Moran, T., and Newell, A, The keystroke-level
them. Qualitatively the user has perhaps a sense of model for user performance with interactive systems,
confidence, or a sense of unease, suspicion or resentment Communications of the ACM, 23 (1980), 396-210
that they can report. The reasons for that suspicion could in 3. Ericsson, K. A., & Simon, H. A.. Protocol analysis:
some cases be satisfactorily explained by tangible factors, Verbal reports as data. MIT Press, Cambridge, MA.
and directly articulated. (1993)
Whether a prototype design will positively or negatively 4. Fogg B.J., Persuasive Technology: Using Computers to
reinforce trust perceptions and diagnosis of designs that are Change What We Think and Do, Morgan Kaufmann
detected as causing negative trust propagation are key Publishers, (2003); ISBN 1-55860-643-2
evaluation agenda. But the phenomena of interest where 5. French, T. K. Liu, K. and Springett, M. , ‘A Card-
intangible affective factors such as the use of colour, layout, Sorting probe of E-Banking Trust Perceptions’,
style of written content are not so reliably characterised Proceedings HCI 2007, BCS, (2007) ISBN 1-902505-
through direct self reporting. As a consequence, enquiry in 94-8
this area favours techniques such as eye-tracking [13,15] 6. Hollnagel, E. (1998) Cognitive Reliability and Error
and the deployment of indirect probing techniques such as Analysis Method. Oxford: Elsevier Science Ltd
card-sorting [5]. Again an accumulated body of craft 7. Jackson, F., "Epiphenomenal Qualia", Philosophical
knowledge is emerging from evaluation and engineering of Quarterly, (1982) vol. 32, pp. 127–36.
this type of system. 8. MacFarlane, S., Sim, G., Horton, M.(2005), Assessing
Usability and Fun in Educational Software. IDC2005,
CONCLUSION pp.103-109, Boulder, CO, USA
The cameo examples describe the key but essentially causal 9. Maiden, N.A.M., and Rugg, G. ACRE: selecting
role of qualia in assessment of three different types of methods for requirements acquisition, Software
system. In the three cases qualia was variously central to the Engineering Journal (1996) 183–192.
user’s goal, intimately bound to the user’s instrumental goal 10. Polanyi, M. "The Tacit Dimension". First published
and an underlying factor relevant to the users goal. In all Doubleday & Co, 1966. Reprinted Peter Smith,
cases it seems clear that causality around the qualitative Gloucester, Mass, (1983).
episode (pre and post) is of key interest. Despite the 11. Rasmussen, J. Skills, rules, knowledge; signals, signs,
conscious, self-aware nature of qualia, the critical and symbols, and other distinctions in human
knowledge about causality relating to qualia is in the tacit performance models. IEEE Transactions on Systems,
dimension to a greater or lesser degree. This suggests that Man and Cybernetics, (1983). 13, 257-266.
direct approaches to assessment are limited in that they can 12. Rasmussen, J. Mental models and the control of action
only denote an agenda for diagnosis and understanding of in complex environments. In D. Ackermann, D. & M.J.
user experience factors, or at least factors in the experience Tauber (Eds.). Mental Models and Human-Computer
of use. Interaction 1 (pp.41-46). North-Holland: Elsevier
Science Publishers. (1990). ISBN 0-444-88453-X
Evaluation of qualitative user experience seems dependent 13. Riegelsberger, J., Sasse, M.A., and McCarthy, J. Rich
on identifying the role of ‘qualia events’ in the use of an media, poor judgement? A Study of media effects on
artefact, and their relationship to user goals and properties users' trust in expertise. In T. McEwan, J. Gulliksen &.
of the artefact. The measurability of this is dependent on D. Benyon [Eds.]: People and Computers XIX -
the tangibility of the design factors that contribute to them. Proceedings of HCI 2005, Springer, (2005), 267-284.
In games it seems possible to conceive of a link between 14. Senker,J., ‘The Contribution of tacit knowledge to
complexity measures and design that maintains a feeling of innovation’. In AI & Society Volume 7 , Issue 3
challenge. Contrastingly, achieving a sense of trust in e- (1993) Pages: 208 - 224 ISSN:0951-5666
commerce is only partly based on tangible factors. How 15. Sillence, E, Briggs, P. Harris, P. and Fishwick, L., A
trustworthy someone finds a site in a user-trial can be framework for understanding trust factors in web-based
measured, perhaps on a Likert Scale, but assessment (in health advice, International Journal of Human-
support of formative evaluation) of how something like a Computer Studies 64, 8, 2006, 697–713. ISSN:1071-
sense of trust can be engineered is not available to such 5819.
straightforward measurement. 16. Stevens, S.S. On the theory of scales and measurement
1946. Science. 103, 677-680
90
VUUM 2008
Only Figures Matter? – If Measuring Usability and User

Experience in Practice is Insanity or a Necessity
Jan Gulliksen Åsa Cajander Elina Eriksson
Uppsala university, Dept. of Uppsala university, Dept. of Uppsala university, Dept. of
IT/HCI IT/HCI IT/HCI
PO Box 337, PO Box 337, PO Box 337,
75105 Uppsala, Sweden 75105 Uppsala, Sweden 75105 Uppsala, Sweden
[email protected] [email protected] [email protected]
+46 - 70 425 00 86 +46 -70 425 07 86 + 46 70 561 88 58
ABSTRACT measurement as long as there is no common agreement

Measurement of usability and user experience is a method upon what we are measuring. Particularly within COST
much sought after in order to introduce and motivate user- 2949 there has been a strong focus towards arriving at a
centred design activities. Yet, when measurements are unified view on how to define UX (Law, Hvannberg &
available they have little or no influence on forthcoming Hassenzahl, 2006), which mainly shows that there is very
decisions or any impact on future development. What is the little common agreement upon what UX is and even less on
reason for this? how it should be measured. At a Special Interest Group
In this paper we will discuss and question measurement as (SIG) presented at the conference Human Factors in
such, based on previous research and two case studies. In Computing Systems (CHI 2008) the goal was to target a
both organisations thorough methods have been deployed to shared definition of UX. However, this SIG mainly lead to
be able to measure usability and user experiences. Data was a decision to further distribute surveys that had been made
gathered through an interview study in one of the on people’s interpretation and appreciation of the concept.
organisations, and on conversations with two usability Despite this the group developing process standards for
experts in the other organisation. One experience from these human-centred design considers that UX now has matured
cases have been that measurements as such are desired and enough to find its way even into international standards. In
give rise to great expectations, before they have been the revision of ISO 13407 (Human-centred design processes
conducted, but with limited or no impact once the for interactive systems) that in the future will be known as
measurements have been made. ISO 9241-210, the concept of UX have been introduced. At
this moment ISO 9241-210 has only reached a committee
Author Keywords draft (CD) level, meaning that the contents is not yet stable
Usability, user experience, UX, case study, metrics, CIF, and has not undergone an international vote. However,
user-centred design extensive discussions have showed the need for including a
definition of UX into the standard as a pleasant UX may be
ACM Classification Keywords an important additional goal of a human-centred design
H5.5. Information interfaces and presentation process. But, when introducing such a concept into an
international standard it is a requirement that we arrive at a
INTRODUCTION commonly agreed upon definition. The current CD defines
Measuring usability and more recently user experience UX in the following way:
(UX)in a meaningful and valid way has received increasing
interest recently (Law, Roto et al. 2008). This is fair “all aspects of the user's experience when interacting with
enough, but when discussing these issues there is a need to the product, service, environment or facility”
clarify what we intend to measure and how. It is pointless Following the SIG at CHI 2008 the definition might change
discussing and comparing different methods for and not include “when interacting” as the discussion
elaborated on the opportunities of having a UX of
Permission to make digital or hard copies of all or part of this work for something before and after the actual interaction. In
summary, before we have a commonly agreed upon
9
VUUM2008, June 18, 2008, Reykjavik, Iceland. www.cost294.org
91
definition it will be very difficult to reach any consensus on Sauro and Kindlund go even further and suggest a method
to how to measure it. for standardizing all the aspects of usability (according to
ISO) into one single score, that they refer to as SUM
Another relevant discussion is what we intend to achieve by
(summarized usability metric) (Sauro & Kindlund, 2005).
measuring altogether. The focus of this paper is to discuss
These papers, as well as most papers dealing with
the effect of introducing measurements of usability, which
measurement, are neither critical about why measurement is
is exemplified by two different cases in which two
necessary nor do they cite the major publications that
organizations have tried to arrive at usability and UX
criticize quantitative measurement. Rather the tendency is
measurements.
that if you do make claims without justifying it with
Many organizations in Sweden use measurement when quantitative measurements, it is not true science:
assessing and understanding different aspects of their
“Without measurable usability specifications, there is no
organization. These measurements are essential for
way to determine the usability needs of a product, or to
management and include for example Customer
measure whether or not the finished product fulfils those
Satisfaction Index, Productivity Statistics, Employee
needs. If we cannot measure usability, we cannot have
Satisfaction Index and Environmental Index. In this context
usability engineering.” (Good, Spine Whiteside & George,
a measurement of usability might be used to safeguard the
1986)
usability of the systems used in the organization. One might
argue that a measurement could put usability on the agenda, Such a positivistic and objective view of science has been
and that it fits into an organizational culture of objective there for a long time and unfortunately there is a tendency
truths through numbers and measurements. Moreover, it to judge this sort of science to be more mature than
might be a way to overview the problem area like a subjective, qualitative science. This development is not
helicopter scanning large areas of forests looking for fires. new, it can be found even in science from more than 100
The index might locate a fire, but neither understands the years ago:
reasons for the fire nor can it help the organization when
“When you can measure what you are speaking about and
trying to extinguish it.
express it in numbers, you know something about it. But
when you cannot measure it, when you cannot express it in
THEORETICAL BACKGROUND
Measuring usability has been discussed for a long time in numbers, your knowledge is of an unsatisfactory kind: It
the usability engineering community. For example Nielsen may be the beginning of knowledge, but you have scarcely,
& Levy (1994) argue that usability as such cannot be in your thoughts, advanced to the stage of Science.”(Lord
measured but that aspects of usability can. Particularly they Kelvin, 1891)
distinguish two different types of usability metrics that is Another famous quote when it comes to people’s tendency
the subjective preference-related usability problems in to rely on measures and metrics is what is referred to as the
comparison to the objective much more easily quantifiable McNamara Fallacy (for example published in Broadfoot,
performance measurements. Nielsen and Levy showed a 2000). It comes from the American secretary of state,
positive correlation between preference and performance in Robert McNamara who discussed the outcomes of the
this sense. On the other hand Frøkjær, Hertzum and Vietnam war based on body counts:
Hornbæk (2000) contradict this in their study of to what
extent the various components of usability (effectiveness, “The first step is to measure whatever can easily be
efficiency and satisfaction) are correlated. Based on their measured. This is OK as far as it goes.
study they claim that there was no correlation between The second step is to disregard that which can't be easily
effectiveness and efficiency, but that it was difficult to measured or to give it an arbitrary quantitative value. This
judge the potential correlation with the more subjective is artificial and misleading.
satisfaction criteria.
The third step is to presume that what can't be measured
The Common Industry Format (CIF) is an international easily really isn't important. This is blindness.
standard (ISO 25062, 2006) for documenting and reporting
the results of a usability evaluation. It has been developed The fourth step is to say that what can't be easily measured
within ANSI (American National Standards Institute) and really doesn't exist. This is suicide.”
subsequently been proposed as and developed into an
international standard within ISO (International PURPOSE AND JUSTIFICATION
Organization for Standardization). In addition to providing It is still difficult to incorporate the ideas of user-centred
templates for documenting usability problems it also systems design, and usability in practice. It is a complex
provides a procedure for quantitatively grading the severity problem, and our research group has addressed it from
of the findings based on the goals that the user intends to many different angels to make usability a part of systems
achieve. development. In our action research projects we aim at
influencing systems development in practice through close
92
VUUM 2008
collaboration with different organizations. Several of the field, and have developed their products for more than 30
organizations we have cooperated with have expressed a years. The average user of their product is a technically
need for measurements of usability. Hence, in one of the skilled person who uses the device in work related
case studies in this paper (CSN) an IT user index was situations. In Sweden Riley has worked with usability and
developed and introduced, measuring how the user user-centred design for a few years, and they have
perceives the usability of their IT systems and their work employed a few usability experts, that frequently need to be
environment. supported by usability consultants working with design and
development of their products.
The other case study (Riley10) is from a large global product
development company that we have been remotely
METHOD
collaborating with for some years. This company have As a part of our large action research project at CSN we
made a usability assessment using the CIF framework. conducted 36 semi-structured interviews based on an
This paper briefly describes these two case studies interview guide. The interview guide included many
addressing the issue of meaningful measurement of different topics, one of which was related to the usability
usability and user experience. In the paper the cases are index. The following are examples of issues dealing with
described and discussed, and finally some conclusions are the usability index; general opinion of work process when
made concerning the use and efficacy of usability developing the usability index, expectations compared to
measurements in practice. results, future use, benefits for the development of usability
in the organization, utility of the results in the organization,
ORGANISATIONAL SETTING OF OUR CASE STUDIES and organizational belongings. However, questions were
adapted in accordance with the organizational role of each
CSN
informant, and their background. The interviews were
The Swedish National Board of Student Aid (CSN) handles
mostly conducted on site and each lasted for about one
financial aids for students. With offices in 13 cities in
hour. Most of the interviews were conducted by two
Sweden, CSN employs more than 1100 employees of
researchers interviewing one person. We took detailed notes
whom 350 are situated in Sundsvall at headquarters.
on paper and the interviews were audio recorded.
Officially, all IT projects are run according to methods
provided by the development department, and the IT In the case study at Riley we have interviewed one usability
architecture department. These include methods for designer working at the company, and the external usability
acquisition, project management and systems development. expert who was the person working with the key
In short, these methods provide a common framework, with performance index in all phases of the project. These
descriptions of milestones, decision points as well as interviews were informal and conducted by one of the
templates and role descriptions. authors. Detailed notes were taken during the interviews.
The interviews were conducted on site and over the phone
The organization of CSN, as many other organizations, has
and lasted for about an hour, each.
been evolving the last ten years towards an organization
that is focused on developing work process and the IT- After the interviews, the data was analyzed using mind
systems supporting these work processes and the maps where main themes and ideas were highlighted.
organization is structured accordingly. The system Finally, the results were discussed in relation to other
development begins at the development department, where findings from the participatory observations, and hence put
pre-studies and the acquisition process are conducted. The in a wider organizational context.
development department initiates development projects and
system developers from the IT-department staff the A case study is always contextual, and consequently it is
projects. Our research group has had a close collaboration not possible to generalize findings from case studies to
with the organization in an action research project that has other contexts. However, even though these studies are case
run for three years. studies, the organization and the findings are not unique,
nor unusual, and therefore we hope that the results will
Riley Systems, Inc. contribute to a deepened understanding of the use of
Riley Systems Inc is the global leader of product usability measurements in general.
development of a technical device used in a wide range of
RESULTS
industrial, commercial and governmental activities. The
company also develops a series of PC software applications CSN
supposed to support the use of the products. With customers Our research group developed a web based usability index
in more than 60 countries they are seen as pioneers of their method at CSN that resulted in measurements of usability
and UX on three different occasions (Kavathatzopoulos,
2006, 2008a, 2008b). During a trial period the questionnaire
10
This name is an alias due to reasons of anonymity
93
gradually improved in which questions were clarified and “When you fill in the questionnaire, you do it from how
some even deleted. things are at the moment. The workload for a case handler
is distributed unevenly over the year, with peaks and slow
The main purpose of the introduction of a usability index in
periods. And if you fill in the questionnaire in May, [which
the organisation was to measure usability and find usability
a slow period], then they will get a much better result than
problems in their computer systems. However, it was not
if you fill in the questionnaire in January [which is a peak].
clearly stated from the beginning which department that
And the questionnaire has always been distributed in May
should be responsible for it. Finally it was decided that the
or November [both slow periods] in order for the results to
usability index should be incorporated into the existing
be as good as possible. At least I have felt it that way.”
Employee Satisfaction Index (ESI) that is distributed by the
Human Resource department twice a year to all employees
Riley Systems, Inc.
at CSN. This means that the usability index needed to be The purpose of the case at Riley was to establish a key
shortened into only a few questions. performance index for the field of usability in order to be
The interviews that were conducted at the end of the able to assess the effects of usability activities. The
collaboration with the organization show that upper organization had previously experiences with using key
management highly support the usability index, both in its performance indexes from electronics, mechanics, project
longer form and the shortened version that would be management and so forth and therefore wanted to treat the
incorporated into the ESI. However, upper management usability initiatives in the same way. The purpose was not
states that they have no detailed knowledge about the to gather knowledge of the level of usability of the
questionnaire, but that they feel confident in the tool. developed product, nor was it to gather information useful
Overall the interviewees who did have knowledge about the for improving the usability of the product. However, this
usability index where positive, although the expectations should not be interpreted as if they are not interested in
varied. Some believed that the usability index could doing so, but rather seen as a positive bi-effect of the
measure a general level of usability in organizations; others assessment of the performance of the organisational unit.
believed that it could be an important tool in the systems A usability designer was hired as a consultant to perform
development process. However, at the moment, with the the assessment of the key performance index for usability,
incorporation of the short version in ESI, there is no clear particularly targeting a product that was just about to be
indication on how the results could be used in the systems launched onto the market. He produced a form based on the
development process. The Human Resource department CIF and set levels of goal fulfilment and identified tasks the
will initially interpret the results from the questionnaire, but users were supposed to perform. Moreover, he was in
those results that could be used in system development charge of user tests and authored an evaluation report.
should be shared with the Business Development
department. However, in the interviews, the manager for the The first report should be considered as a baseline, which
Business Development department was not aware that the means that they really cannot say much about any effects in
results from the questionnaire should be taken care of by the organisation until they have made a follow up study
her department. later. However, according to the interviewee there are
difficulties concerning dependence associated with the
What is further interesting is that the part of the measurement, for example the person who is measuring,
questionnaire that was mostly appreciated by the who the users are, status of the product on the market, etc.
interviewees concerning usage in the system development This occurs despite the fact that the goal of the CIF is to
process was the open-ended questions at the end of the provide a methodology that is independent of whoever is in
questionnaire. These open-ended questions generated an charge.
enormous amount of improvement suggestions that the
organizations were rather unprepared for at first. However, The usability designer, who is a senior expert with many
it is not clear whether these open-ended questions will be years in the field, experienced great difficulties in finding
kept in the shorter version of the questionnaire, although appropriate tasks that expressed something about the use of
these where the most appreciated by both developers and the system, and also to define appropriate levels for each
case handlers. tasks. He confronted renowned experts on CIF about these
issues and was told that he should be pragmatic and not be
Furthermore, the ESI is distributed twice a year, when the too rigorous when defining the tasks –“it doesn't need to be
workload is at the lowest for the case handlers. This might that thorough”.
influence the results of the questionnaire, since the case
handlers are not working with all parts of the internal The usability field at Riley has experienced some problems
systems in slow periods. As one of the interviewees lately, and it may be because of the difficulties they
commented: experience in showing real evidence of the benefits of their
work. The organisation lacks the ability to see the benefits
beyond what can be quantitatively measured.
94
VUUM 2008
DISCUSSION When do we measure?

One idea is that if usability can be measured on a regular If the goal from within the organization is to show good
basis, the IT department must take usability into account in results it will clearly influence the timing of the
their development process. The underlying assumption here measurement, hence the data becomes even more
seems to be that by expressing usability in numbers, it will unrealistic. In one of the cases we saw a conscious effort to
make things visible and these numbers represent some kind making measurement when it was likely to receive the best
of objective truth about the IT systems. Frequent possible results. This is clearly misleading. The workload
complaints, suggestions for improvements, reported will obviously influence the users’ perception of the system
problems when using the IT systems are seen as subjective and the work situation differently if working under stress.
and do not seem to represent the same kind of truth.
Why do we want to measure?
Usability comprises many different aspects, some of which Measures may often be considered as eternal truths. Many
cannot be easily measured and turned into numbers. people do not believe something until it is confirmed by
Specifying usability and UX as a set of well-defined numerical figures. Therefore numerical measures are
parameters complies with the need for formal attributed heavy weight in strategic argumentation.
representations in the development projects, but obscures
the complex aspects. The difficulties with addressing Decision makers often want measures to base their
aspects, such as numbers, in no way make them less decisions upon, and figures can be the door opener to
important. The question is how to deal with them in an receiving budgets for certain activities. This seems to be the
organization that focuses to that extent on measurement and case regardless of weather or not the measures are realistic
metrics. hypothetical, or even meaningful. In some situations
measurement can even be what constitutes the basis for the
What conclusions can be made from the cases? existence of a particular line of work. And even if this is a
In both cases there was an outspoken request for some form great risk, it may also be a good opportunity. But it is
of metrics to be able to assess the state of the computerised important with an awareness of the limitations that the
work situation or the state of the commercial product. But, measurement induces.
once the usability index was in place the organizations did
not pay much attention to it. The reasons for this may be What are the consequences of the measures?
many. First of all they experienced difficulties knowing Once there was a numerical assessment of some sort the
how to interpret the results and what conclusions to focus of discussion tended to be on the measurement as
draw.Second, it was problematic to turn the acquired results such, ignoring or forgetting the complexity behind the work
into action items that could be dealt with in the upcoming situation in which the measurement was acquired. It is
organisational development.Third, it turned out to be much easier to discuss and trust a numerical value as such,
challenging from an organisational point of view to judge and even more so a change in value from consecutive
whose responsibility it is to deal with whatever comes out measures, compared to investigating the reason for a
of the study. particular measure and what to do about this situation.
What do we measure? CONCLUSION

It is difficult to measure since it is very difficult to come up One might read this paper as arguing against any form of
with questions that really target the crucial aspects that you measurement, but that is not how we intended it to come
do want to highlight. It turns out that the things people tend out. Measurements may be very useful and valid in great
to measure are those aspects that are easily measurable. But, many situations, but once you do measure things you need
do we measure a function, a unit of work or an to be aware of the consequences these measures have. One
organisation? Most often measurement is restricted to an thing to do then is to improve the measurement methods to
application or even to a specific function. That means that provide a better coverage of the aspects we want to assess.
the measurements do not capture and holistic view One important part of increasing the credibility of
whatsoever. Most workers in an organisation deal with a measurement is to clarify what the method measures and
large number of applications simultaneously, and the how these measures should be interpreted. For example, if
usability problems that people experience often occur as a one intends to measure user experience (UX) it needs to
consequence of bad synchronization between the have a commonly agreed upon definition. Otherwise the
applications rather than within them. Only measuring results obtained may be considered nonsensical.
aspects within one application do not lead to any good tool
Most measurement provides far from the whole truth about
for managers and/or organizations to base their decision
a situation and the results must be dealt with caution.
making on. And vice versa, a measurement of an
Usability provides so many non-measureable aspects that
organization is also problematic since it is difficult to know
enforcing measurement upon usability in an organisation
which system or combination of systems that are
troublesome.
95
may risk comparing with other disciplines that do not have 5. Kavathatzopoulos, I. (2006): AvI-enkäten.
the same reliance upon qualitative effects. Ettverktygförattmätaanvändbarhet, stress ochnyttaav IT-
stöd (in Swedish – The AVI survey – a tool for
If the goal is to manage to achieve the highest possible level
measuring usability stress and usefulness of an IT
of usability in a system in an organisation, we cannot really
support system). Report No. 2006-050). Uppsala
see any evidence that measurement techniques and metrics
universitet: InstitutionenförInformationsteknologi.
actually have such effects in the organisations we have
encountered. Therefore organisations may consider weather 6. Kavathatzopoulos, I. (2007): Usability Index. In A.
it is worthwhile investing in measurement techniques over Toomingas, A. Lantz & T. Berns (Eds.) Work with
dedicated usability improvement efforts, such as applying a computing systems (p. 160). Stockholm: Royal Institute
user-centred iterative process of contextual analysis and of Technology and National Institute of Working Life.
system redesign based on formative evaluations. 7. Kavathatzopoulos, I. (2008a):
Ettförbättratverktygförmätningavanvändbarhet, stress
ACKNOWLEDGMENTS ochnytta: Andraförsöketinom CSN.(in Swedish – An
We thank the usability designers and the interview subjects improved tool for measuring usability stress and
that have taken part in our interviews for their contribution. usefulness: The 2nd attempt within CSN) (Report No.
A particular thank you to BengtGöransson who read and 2008-003). Uppsala universitet:
provided comments on the manuscript based on his InstitutionenförInformationsteknologi.
experiences of measurement in practice.
8. Kavathatzopoulos, I. (2008b): AvI-index:
REFERENCES Ettverktygföranvändbar IT, nyttaocharbetsmiljö. (in
1. Broadfoot, P. (2000) Assessment and intuition. In: T. Swedish – The AVI index – a tool for usable IT,
Atkinson and G. Claxton, Editors, The intuitive usefulness and work environment) (Manuscript).
practitioner: on the value of not always knowing what Uppsala universitet:
one is doing, Open University Press, Buckingham, pp. InstitutionenförInformationsteknologi.
199–219.5 9. Law, E., Roto, V., et al. (2008). Towards a shared
2. Frøkjær, E.,Hertzum, M., &Hornbæk, K. (2000). definition of user experience. CHI '08 extended
Measuring usability: are effectiveness, efficiency, and abstracts on Human factors in computing systems.
satisfaction really correlated? Proceedings of the Florence, Italy, ACM
SIGCHI conference on Human factors in computing 10. Law, E., Hvannberg, E. &Hassenzahl, M. (Eds.) (2006)
systems, p.345-352, April 01-06, 2000, The Hague, The UX 2006 : 2nd International Workshop on User
Netherlands. eXperience. 14 October 2006, Olso, Norway. Held in
3. Good, M., Spine, T.M., Whiteside, J., George, P. (1986) conjunction with NordiCHI 2006.
User-derivedimpact analysis as a tool for usability 11. Nielsen, J. and Levy, J. (1994) Measuring usability:
engineering. In: Proceedings of the SIGCHI Conference Preference vs. performance, Communications of the
on Human factors in computing systems ACM 37, 4, 66-75.
4. ISO/IEC 25062 (2006). Software Engineering -Software 12. Sauro, J. & Kindlund E. (2005) “A Method to
product Quality Requirements and Evaluation Standardize Usability Metrics into a Single Score.” in
(SQuaRE)-Common Industry Format (CIF) for Proceedings of the Conference in Human Factors in
Usability Test Reports. Computing Systems (CHI 2005) Portland, OR.
96
VUUM 2008
Measuring the User Experience of a Task Oriented Software

Jonheidur Isleifsdottir Marta Larusdottir
deCODE genetics Reykjavik University
Sturlugötu 8, 101 Reykjavik Kringlan 1, 103 Reykjavik
ABSTRACT beauty and usability related? What contributes to the

In this paper a study on a web based tool is described that is goodness and beauty of products? What effects how the
used to keep track of attendance and work schedules by users summarize experiences during usability evaluations?
employees and managers in large companies. Ten users [5] To address some of these problems Hassenzahl [2] has
particpated in a think aloud test measuring the usability of a proposed a model of user experience that divides the
new version of the software and the user experience was attributes of a product into pragmatic and hedonic
measured before and after each user test. attributes. Based on this model he has made the AttrakDiff
The user experience results show that the group of 2 [4] questionnaire that can be used to measure the user’s
questions that measure the personal growth of the user got experience of these different attributes in the product. That
the lowest scores for this product, but pragmatic attributes, questionnaire has been translated to Icelandic by Marta
hedonic identification and attraction got much higher Larusdottir.
scores. It is not surprising that pragmatic issues get high The goal of this study is to measure the user experience of
scores for a task-oriented software like this one, but it is an users participating in a typical think aloud test of a task-
interesting result that the users value highly the attraction oriented software by using the Icelandic version of
and hedonic identification. AttrakDiff 2 questionnaire. Our goal was to compare the
measurements of the expectations to the tool and the user
Author Keywords experience measured just after taking part in a think aloud
User experience, usability, think-aloud, user testing. test. Secondly we wanted to analyse the Icelandic
translation of the questionnaire.
INTRODUCTION BACKGROUND
In the past decade user experience has become a popular In this section it is explained what is meant by User
field of study within the field of Human Computer Experience, a clarification of the different concepts that lie
Interaction. It challenges the past notion that task related behind the AttrakDiff 2 questionnaire is given and some of
features are the only ones that contribute to usability. User the studies that have used it to measure what attributes to
experience focuses on the user entire experience when attractiveness of products are described.
using a software product, not just the ISO 9241-11 factors,
effectiveness, efficiency and satisfaction. User experience User Experience
introduces new concepts to the quality of software like fun, User Experience is a relatively new field within the larger
beauty and pleasure. scope of Human Computer Interaction. It proposes a more
holistic view of the user’s experience when using a product
The problem with the concept of user experience is how then is usually taken in the evaluation of usability [2]. Until
vague it is and that it can be interpreted in many ways. now usability evaluations have primarily focused on task-
More empirical results are needed to define the concept related issues such as efficiency and effectiveness. Stating
clearly [6] and these will only be obtained by making that a product that is efficient and effective in allowing the
models and using them to measure user experience for user to solve the tasks needed, to fulfill the user’s goals,
different products [3]. There are many unanswered makes the user satisfied [6].
questions that need to be addressed. What is beauty? Are
But is it enough to have a satisfied user? The researchers
leaning toward UX say the answer is no [6]. The user needs
personal or classroom use is granted without fee provided that copies are to experience more that satisfaction with the product for it
not made or distributed for profit or commercial advantage and that copies to be marketable. Hassenzahl [2] proposes a model for the
bear this notice and the full citation on the first page. To copy otherwise, different attributes a product can have and make up a
product character. He states that a product has both
pragmatic and hedonic attributes. The former being the task
VUUM2008, June 18, 2008, Reykjavik, Iceland. related attributes we are used to from the classic usability
literature and the later emphasizing the users well being
97
while using the product. Hassenzahl also introduces in the Attraction
same chapter three different classes of hedonic attributes; When we talk about something as being attractive to us, we
stimulation, identification and evocation. Later Hasssenzahl are usually summarizing the whole experience of the
decided to drop the evocation class from the model and that product. We judge the product as a whole and use words
is not included in AttrakDiff 2. like good, bad, beautiful and ugly to describe things. In
AttrakDiff 2, attraction is used to measure the global appeal
AttrakDiff 2 of a product and to see how the other attribute affect this
AttrakDiff 2 [4] is a questionnaire that measures hedonic global judgment [3].
stimulation and identity and pragmatic qualities of software
products [1]. The questionnaire was originally made in Scale Examples
German but has been translated to English. AttrakDiﬀ 2 has The Pragmatic Quality (PQ) scale has seven items each
four, seven anchor scales, in total 28 questions. These will with bipolar anchors that measure the pragmatic qualities of
be described in more details in the following. the product. This includes anchors such as Technical-
Human, Complicated-Simple, Confusing-Clear. The
Pragmatic Manipulation Hedonic Quality Identification (HQI) and Stimulation
Pragmatic attributes are the ones that are associated with (HQS) scales also have seven anchors each. HQI has
how easy the user finds it to manipulate the environment, in anchors like Isolating-Integrating, Gaudy-Classy, Cheap-
this case the product or software. It is what makes us able to Valuable. HQS has anchors like Typical-Original, Cautious
fulfill our goals and what we have until now talked about as Courageous and Easy-Challenging. AttrakDiff 2 also has a
usability. If we think pragmatically the only requirement seven item anchor scale for overall appeal or attraction
from a product to squeeze juice from an orange is that it (ATT) with anchors like ugly-beautiful and bad-good. The
actually squeezes the juice from the orange and that we can anchors are presented on opposite sides of a seven point
find out how to use it on our own. There is no beauty or likert scale, ranging from -3 to 3, where zero represents the
design needed to make a product pragmatic. neutral value between the two anchors of the scale.
Hedonic Stimulation Related Work

The attributes connected to hedonic stimulation are the ones Hassenzahl used the AttrakDiff 2 questionnaire to study
that encourage personal growth of the user. People want to four different mp3 player skins [3]. He used the
develop their skills and knowledge further and these are the questionnaire to explore the effect of pragmatic and hedonic
attributes of the product that allow for that to happen. As an qualities on beauty and goodness. The four different skins
example Hassenzahl provides unused features of a software had been pre tested and judged ugly or beautiful; two skins
[2]. Those features that the user does not yet use are not a were judged beautiful and two ugly. In the first study 33
part of the pragmatic experience but are rather perceived as students were asked to look at each skin and fill out an
hedonic as they provide stimulation for further AttrakDiff 2 questionnaire for each of the skins without
development. Stimulation can also be provided by using them. Two 2x3 ANOVAs were performed on the
presenting things in a novel way or by a new interaction result data, one for the beautiful skins and another for the
style. ugly ones. The ANOVAs had skin and attribute group (PQ,
HQS, HQI) as within subject factors and score as dependent
Hedonic Identification variable. This revealed that there was one skin in each
Attributes connected to hedonic identity are the ones that group that was significantly more stimulating than the other
make us identify with the product in a social context. What and the other was thought to be significantly more
message are we communicating to other socially by using pragmatic. Other results were that the identity
this product? These attributes are connected to the fact that communication factors, HQI were the factors that had the
all persons communicate their identity through things they highest correlation with the beauty rating, i.e. the HQI
use and own. An example of this would be a personal scores were significantly higher for the more beautiful
website where you can communicate who you are to the skins.
outside world. If a product communicates what we think to A study on the influence of hedonic quality on
be advantageous to others we might prefer that product. attractiveness was done by Schrepp, Held and Laugwitz [7].
Ipods would be a good example of a product that They sent out e-mails to students and asked them to look at
communicates a strong identity. There are several other three different interfaces of business management software.
mp3 players on the market that work the same but the brand Around 90 people responded and the response rate was
name is so strongly connected to the product and its 34%. AttrakDiff 2 was used to measure the user experience
coolness that everyone has to have an ipod. after the user had looked at 11 different screenshots, with an
explanatory text before it, of the interface executing a part
of a business scenario. Schrepp et al. expected, since they
were testing business software that are in their nature meant
98
VUUM 2008
to support people in their work, that pragmatic qualities by one of the developers of the user interface that has good
would have greater influence on attractiveness than hedonic connections to the users.
qualities. What they found on the other hand was that both
HQI and PQ contributed evenly to the attraction and HQS
also contributed significantly.
They also found, as they expected, that more attractive
interfaces were preferred over the less attractive ones.
MATERIALS AND METHODS

In this chapter the study using the newly translated
AttrakDiff 2 questionnaire is described, first the tool which
was evaluated, then the usability tests, the participants and
the measurements in the study.
The tool - Workhour

Usability tests were conducted on a new version of software
called Workhour (Vinnustund in Icelandic) [8]. It is
designed by the Icelandic software company Skyrr, see Figure 1. The new version of the software Workhour
figure 1. An old version had been in use for serveral years,
but in the new version the user interface was changed The user tests were conducted at their ordinary working
exstensively. There are four main user groups of Workhour; place, so a lot of contextual information was also gained.
ordinary users that work on shifts and those that work Two usability specialists conducted the tests; one was the
regular hours. The other two main user groups are managers organiser and one the data recorder. Additionally
that work on shifts and those that don´t. everything that was said was recorded on tape.
The main tasks for ordinary users working on shifts is to The AttrakDiff 2 questionnaire was administered before and
check there monthly plan for shifts, ask for a day off and after the think aloud test. First the participants were asked
check if they have fullfilled all their work obligations for to answer the questionnaire according to their expectations
that month. The main tasks for regular users are asking for to the new version of Workhour they would be trying in a
holidays and check if they have been too many hours off minute. The users are all familiar with older versions of
work. The Workhour system is very useful to managers, Workhour and were chosen to be in the study as typical
because they can do much of their organizing work in users of the system. After the think aloud test was finished
Workhour like check if all timestamps for their employees the user filled in the questionnaire again and now the
are correct, insert information about an employee that is participants were asked to base their answers on the
sick and get an overview of how many have been sick over experience of using Workhour to solve the given tasks. The
a particular period to name a few. reason for measuring both the expectations and the
experience is that user experience a subjective factor that
Participants changes over time and we were interested to see what affect
Ten individuals participated in the study, eight women and actually using the tool would have on the measurements of
two men. Five of the participants were categorized as the user experience for the tool.
managers and the other five as ordinary users. The
participants are employees of Landspitali – University Unlike studies done by Hassenzahl on skins for mp3 players
Hospital and Financial Management Authority in Iceland, [3], our study had participants that are actual users of the
and were divided into two groups but all of them use software being evaluated. The tasks they performed were
Workhour as a part of their workday. One of the tasks they know and perform every day. The expectation of
participants only filled out a very small portion of the pre- the user when answering the AttrakDiff 2 questionnaire
use questionnaire and none of the post-use so we did not before use was purely based on what the user expected of
use any data from him in the results. the new version, because most users had not seen the new
version before answering the questionnaire.
The usability tests
The ten usability tests were conducted by two usability Measurements
specialists on the new version of Workhour running on a The AttrakDiff 2 questionnaire was translated from English
test database two months before it was installed. to Icelandic by Marta Larusdottir and this was the first time
it was used in the Icelandic format. There was one set of
Each user solved six or seven tasks in think aloud tests anchors that were left out when the questionnaire was
which were adjusted to their ordinary tasks. The total translated. It was the HQI item with anchors Alienating-
number of tasks in the study was 17. The tasks were made
99
Integrating. The translator was not able to find a suitable statistically significant (This might be due to the small
translation that differed from the translation of another pair sample we had). We still believe it is relevant because it is
of anchors. So the HQI only had six items. found in all categories. We think that one reason for this is
that users are optimistic that a new version is somehow
The internal consistency of the HQI, HQS, PQ and ATT
better than the old and therefore have high expectations to
scores was measured both before use and after use and
its hedonic and pragmatic qualities, that are lowered when
unfortunately it was not as high as previously measured by
the product is used but not considerably. Participants might
Hassenzahl [3]. Cronbach’s α on the pooled values for the
move their scoring one point closer to the middle.
different scales before use was: PQ, α = .58; HQI, α = .57;
HQS, α = .42; ATT, α = .43. These values for alpha are
rather low and that indicates that there is not a high
correlation between the answers with in each group.
Usually the criteria of internal validity wanted from
questionnaires is an α > .70 [3]. After use the scales internal
validity was higher in three cases: PQ, α = .86; HQS, α =
.55; ATT, α = .70. In the HQI, α = .46 scale the validity was
lower than before. Both PQ and ATT were over the, α > .7
mark when measured after use, which is good but the
internal validity of HQS is still too low.
RESULTS
In the following chapter the results will be described, first Figure 2. Mean scores for each scale of AttrakDiff 2
on the user experience and then the analysis of the
translation of AttrakDiff 2.0 to Icelandic is described. Moreover, even if the scores moved closer to zero they
were over all pretty good compared with both the mp3 skin
The user experience study and the business interface study mentioned earlier.
We calculated the mean score of all user answers for each
In Figure 2 we can see that HQS gives the lowest score.
quality scale (each scale has 7 questions). As mentioned
That category has a mean that is 1-1.5 lower that the other
earlier each answer gets a value from -3 to 3, with zero as
means.
the neutral value between the anchors of the question.
This indicates that the software has less hedonic stimulation
As can be seen in Figure 2 all the quality scales have means
qualities than identification and pragmatic ones. This is an
above zero both before and after use of Workhour. The
interesting result.
post-use line is also always beneath the pre-use line. A
paired T-test that compared the pre and post-use scores for We think this is very understandable since we doubt that
each participant and each category showed that difference stimulation was one of the design goals of Workhour. It is
in mean scores pre and post-use was significant in software that is used to check on work schedules and to
HQS(meanDiff = .67,t = 3.56, p = .007) and in punch in and out of work. We think that the stimulation
ATT(meanDiff = .49,t = 3.43, p = .009) but not in factor is more important when designing software for
HQI(meanDiff = .35,t = 1.62, p = .115) and PQ(meanDiff = creative work rather than support software that most user
.34,t = 1.78, p = .145). It is also noticeable that HQS score use only for a short period of time each day and mainly just
means are much lower than the other means both pre- and have to trust that it works.
post use.
Translation of AttrakDiff 2
Since the internal validity of the scales was not very high As stated before the study above was also done to test how
we decided not to do any further analysis of the effects of the Icelandic translation of the questionnaire was working.
the different qualities (HQI, HQS and PQ) on ATT as was In the description of the variables and measurements above
done in other studies using AttrakDiff 2. One idea was also we showed that the internal validity of the scales was rather
to test the correlation between different scales. At the poor. Since other studies have shown much better validity
AttrakDiff website there is a confidence square diagram scores our first thought was to look closer at how the scores
made from the data which we did not attempt. There online were for each item on the scale. We started with the HQS
experiments can be set up in German and in English [4]. scale because that gave the lowest internal validity in both
the pre- and post-use study and also a much lower mean
Discussion
score than the others.
It is interesting to see that the post-use line (Experience) in
Figure 2 is below the pre-use line (Expectations) and In Figure 3 we see that the third and sixth items of HQS
consistently so. Even though the difference is not give a very different mean than the others in this group.
100
VUUM 2008
This indicates that these items are not measuring effect that It was surprising though how high the scores were in HQI
is similar to the others. The third HQS item has the word considering that identification was probably not a part of
anchors bold - cautious where markings closer to bold give the design. The good score in PQ, HQI and ATT is pleasant
a higher score. The translation was ótraust - traust where to see because that indicates that there is overall happiness
ótraust gave a higher score. This is a somewhat misleading with the software product.
translation, that indicates that the software is not reliable
It is dangerous to draw conclusions about the relationship of
and secure enough. A better translation would be djarft-
hedonic qualities and usability and goodness from the
varfærnislegt. The only problem is that varfærnislegt is
studies that have been done at present. A great deal of
already used in item four in HQS and we propose that
software is used every day for extended periods of time and
íhaldsamt will be used there instead.
not recreationally. It is questionable, if that kind of software
would follow the same patterns as the mp3 player skins in
Hassenzahl’s study, and what about real users? Even though
the people in his study probably use some kind of mp3
players frequently we do not know whether they are to be
considered “real” users.
Further empirical studies are needed to be able to draw any
conclusions about user experience and how it is affected.
Hassenzahl’s model is a step in the right direction and
hopefully we will see a great increase in studies using that
or similar models to evaluate user experience with software
and other products. It is also our hope that translating the
AttrakDiff 2 questionnaire to Icelandic will inspire more
people to use it alone or as an addition to other user testing
Figure 3. Scores of HQS: Pre- and post-use
to gain more knowledge about what factors contribute to
The sixth HQS item has the word anchors undemanding - how attractive our software is to the user.
challenging where markings closer to challenging give a
higher score. The translation was auðvelt-krefjandi. We ACKNOWLEGDEMENTS
don’t think this translation can be improved and the reasons We want to thank the people at Skyrr for their very good
for the low score are most likely that users don’t want co-operation in this project, the users, the developers and
software of this type to be challenging. the managers.
There was also one item missing from the translation. That
REFERENCES
was the fifth item in HQI that has the anchors alienating- 1. Hall, M., & Straub, K. (2005, October).
integrating. The Icelandic translation for integrating was in 2. Ui design newsletter. Internet Resource
use in the first HQI item. http://www.humanfactors.com/downloads/oct05.asp.
Our suggestions are that we change item HQI1 to (Retrieved March 24, 2007)
einangrandi-tengjandi and item HQI5 which was missing to 3. Hassenzahl, M. (2003). The thing and I: Understanding
fráhverft-sameinandi. the relationship between user and product. In M. A.
Blyth, A. F. Monk, K. Overbeeke, & P. C. Wright
The other scales did not seem to be suffering from the same (Eds.), Funology: From usability to enjoyment, 1-12
problems as the HQS scale. It is therefore our belief that if (chap. 3). Kluwer Academic Publishers.
those changes are made the internal consistency of the
scales would improve. If this does not happen it would raise 4. Hassenzahl, M. (2004). The interplay of beauty,
the question whether the HQS scale simply does not apply goodness, and usability in interactive products. Human-
in the same way as the other scales when measuring the Computer Interaction, 19, 319-349.
user experience of very practical software. This is certainly 5. Hassenzahl, M. (2007). AttrakDiff(tm). Internet
a point worth studying further. Resource http://www.attrakdiﬀ.de.
CONCLUSION 6. Hassenzahl, M., & Sandweg, N. (2004). From mental

It was very interesting to see the scores for the different effort to perceived usability: transforming experiences
attributes of the AttrakDiff 2 Questionnaire on Workhour. It into summary assessments. In Chi ’04: Chi ’04 extended
seems that hedonic stimulation is the least important factor abstracts on human factors in computing systems (p.
in such software or at least what the participants thought 1283-1286). New York, NY, USA: ACM Press.
deserved the lowest score.
101
7. Hassenzahl, M., & Tractinsky, N. (2006, March-April). interfaces of business management software. Interacting
User experience - a research agenda. Behavior & with Computers 18 (5), 1055–1069.
Information Technology, 25, 91-97. 9. Skyrr. (2007). Workhour. Internet Resource
8. Schrepp, M., Held, T., & Laugwitz, B. (2006). The http://www.skyrr.is/vorur/vinnustund/.
influence of hedonic quality on the attractiveness of user
102
European Science Foundation provides and manages the scientific and technical secretariat for COST
COST is supported by the EU RTD Framework Programme
View publication stats

Valid Useful UX Measurement

Uploaded by

Copyright:

Available Formats

Valid Useful UX Measurement

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Valid Useful UX Measurement

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Valid Useful User Experience Measurement

Book · January 2008

Georgios Christou Mark Springett

SEE PROFILE SEE PROFILE

A Multi-Dimensional Data Model for Personal Photo Browsing View project

SPARKS - Rethinking Innovation. Together. View project

The user has requested enhancement of the downloaded file.

Reykjavik, Iceland y June 18th 2008

Effie L-C. Law

Name Affiliation Country

Classifying and Selecting UX and Usability Measures pp. 13-18

Towards Practical User Experience Evaluation Methods pp. 19-22

Exploring User Experience Measurement Needs pp. 23-26

On Measuring Usability of Mobile Applications pp. 38-44

Use Experience in Systems Usability Approach pp. 45-48

A Two-Level Approach for Determining Measurable Usability Targets pp. 56-59

What Worth Measuring Is pp. 60-66

Comparing UX Measurements, a case study pp. 72-78

Evaluating Migratory User Interfaces pp. 79-85

Measuring the User Experience of a Task Oriented Software pp. 97-102

Meaningful Measures: Valid Useful User Experience

ABSTRACT Lord Kelvin’s dictum on measurement is frequently quoted

Developing Usability Methods for Multimodal Systems:

Ratings on Scales Measuring Global Usability

Classifying and Selecting UX and Usability Measures

VUUM2008, June 18, 2008, Reykjavik, Iceland.

Evaluation of Data from Usage of an Existing System

Quality UX Functionality User Learnability Accessibility Safety

Table 2. Factors contributing to system usability and UX

WHAT SHOULD BE MEASURED? CONCLUSIONS

5. Are baseline measures needed to establish REFERENCES

Towards Practical User Experience Evaluation Methods

ACM Classification Keywords Research, Companies,

Exploring User Experience Measurement Needs

ABSTRACT behavioural, and reflective level (Norman 2003), and Nokia

What should be Examples of measures What should be Examples of measures

What should be Examples of measures Table 6. Measurement areas in localization

Combining Quantitative and Qualitative Data

CONCLUSION AND SUMMARY

quality assurance needs to be established by recognizing http://www.usabilitynet.org/trump/documents/Suschapt

Author Keywords These agreed criteria however need to be based on deep

Source System Number of Evaluation Metrics Used

Table 1. CHI 2008 usability evaluation studies of mobile applications

Various usability problems observed were unexpected and CONCLUSION

User Experience in the Systems Usability Approach

EXPERIENCE IN THE DYNAMICS OF ACTIVITY Functional Approach to Analysis of Use Technology

Developing the Scale Adoption Framework for Evaluation

Adoption Framework for Evaluation (SAFE), which is

and reliable scale.

Additionally, for each of these elements three key aspects

are emphasized for practitioner’s to consider when deciding

To initially ascertain whether or not the SAFE would 1)

et al. 1988), (Hix and Hartson 1993), (Gould and Lewis

Figure 1. The JFunnel usability life-cycle model

THE TWO SETS OF USABILITY TARGETS Strategic Usability Targets

REFERENCES 15. John, B. E. (1995). Why GOMS? Interactions. 2: 80-

Worthwhile Economic Pleasant Successful gift, transfer Nicer

Good Value Good Plan

Clear, Concerned, caring, Complete, Helpful,