AREA Technical Report: User Experience Design For Enterprise Augmented Reality
AREA Technical Report: User Experience Design For Enterprise Augmented Reality
Figure 1: Increase likelihood of success by selecting the AR presentation device based on user requirements.
Once the use cases are identified and understood, often using a storyboard, the developer
works with team members to determine the best AR presentation system components (e.g.,
hardware and software). Finally, once the AR presentation system is defined, the developer
works within the technology constraints of the chosen system to design the optimal user
experience.
With limited time and resources, early AR projects should also serve to illustrate the future –
providing a glimpse of how many complex processes involving unfamiliar objects can be
connected with digital instructions and other enterprise assets.
In this section, we discuss use cases that are frequently chosen for AR introduction in
industrial settings. Use case analysis and description are an important step that helps
designers to make decisions during the design process. There are several methods in the
scientific literature that analyze the task for which the solution is to be designed and extract
design requirements [1, 2]. Most of these techniques aim at helping the designer
understand the constraints that need to be respected and the features that the user needs
in the final experience, and the extent to which the features are a requirement. The
essential components in these analytical techniques include:
1. Information requirements at each step
2. Environmental conditions during the task
3. Human factors and ergonomics
4. Desired outcome at each step
5. External constraints affecting the workflow
This report uses the five components above to describe a few sample use cases that are
popular for AR in industrial settings. This will clarify the approach as it begins with a general
analysis of the target task and then is further broken down along multiple dimensions.
The use cases then drive the selection of solution components across four categories:
Presentation system
Display technology
Mobility
Connectivity
Warehouse Picking
The scenario characterizing this family of tasks involves moving goods in a well-known
facility. The typical blueprint for warehouse picking tasks includes:
Navigation inside the facility towards the location of interest
Identification of the item of interest
Loading/unloading of the item
Documentation of the action performed
Information required: In this scenario, the worker needs to be able to identify the place
where the item is stored and how to reach it, the identifier of the item (usually an
alphanumeric code or a barcode), and the action to perform, including the relative
destination, if appropriate. Knowing the progress status in the list of actions to perform can
also help the worker with timing and process understanding.
Environmental conditions: It is safe to assume that these tasks are performed in facilities
with reasonably good lighting conditions, dry atmosphere, and normal room temperatures.
However, the level of ambient noise can, in some cases, be relatively high due to forklift and
other machines operating around the user.
Ergonomics: Workers need the use of both hands most of the time. Despite the fact that
items of interest are usually large and easy to scan with any device, the code or identifier
can be in places where access is difficult or impossible.
External constraints: There are few if any external constraints affecting warehouse picking
use cases.
During assembly scenarios, workers use mechanical and electric components following a
standard procedure that guides them through the assembly of an object composed of
multiple parts. These components might have different sizes and can be functional parts or
only serve to hold critical parts together as part of the final object. The procedure can
require using specific tools.
Information required: At each step of the task, the worker must focus on the part that needs
to be assembled, its exact location relative to where and how it should be in its final
position, the space in which the worker will perform the assembly movements, and the tools
necessary to carry out the action.
Environmental conditions: Assembly tasks are usually performed indoors. The entire task is
normally located in a specific place that is dedicated to the assembly of that component or
of a known category of components. Lighting conditions are controlled while noise level can
be high depending on the machines operating in proximity.
Ergonomics: The tasks require the use of both hands at all times. The amount of physical
movement required varies but is frequently limited as the actions are performed around a
stationary object.
Desired outcome: At each step, the worker verifies that the components are assembled in
the correct configuration.
External constraints: In some cases, the task may have time constraints related to the
actions performed (e.g.., a part needs to cool down after welding before being able to
operate).
Maintenance and repair operations (MRO) are procedural tasks that involve the revision of
the status of a piece of equipment, the diagnosis of a problem and, when necessary, the
repair of an identified fault. MRO procedures can be performed on any type of component
and in potentially any location. The procedure often follows these steps:
1. Receive notification of a maintenance order
2. Identify the location
3. Identify the object
4. Diagnose the status of the object
5. Identify the fault
6. Perform the repair procedure
7. Report that the procedure has been performed successfully
The repair procedure can include the replacement of a part of the equipment and the
disassembly/reassembly of many adjacent parts.
Information required: The user needs to be guided to the location of the object in need of
maintenance. Once at the location, the user needs to identify the equipment. The user must
retrieve all the information needed to diagnose the problem with the machine and be aware
of the procedure to perform in order to perform the repair.
Environmental conditions: MRO procedures can be carried out in any conditions. Lighting,
temperature and humidity conditions can vary, both in indoor and outdoor settings.
Ergonomics: This category of operations is characterized by the high mobility needed by the
workers. Procedures often require the use of both hands and are sometimes performed in
spaces that do not permit free body movement.
Desired outcome: The task must be completed following the procedure for quality assurance
to ensure that faults are identified and fixed. Lastly, the user must report the completion of
the task to the central system.
External conditions: The successful completion of MRO procedures can depend on the
availability of diagnostic sensor data related to the equipment under inspection.
Presentation Systems
One of the most important decisions is the choice of device for presenting the AR
experience. Currently, the options include:
Handheld devices: The rear camera of tablets and smartphones captures the
environment in front of the user while the display facing the user presents the AR
experience (video see through). The touch screen is the primary interface for
interaction with the user interface and the content.
Wearable devices / Head-mounted displays (HMD): Camera-equipped smart glasses
and visors display information directly in front of the eyes of the user without the need
for him to hold a device. Industrial-grade systems in this category are often designed
to be attached to or to replace hard hats or industry-complaint safety goggles. In
some cases, these devices are tethered to an external controller that serves as the
interface and provides supplementary battery power.
Stationary: these devices are characterized by a larger form factor and are statically
installed in a fixed physical location. Stationary AR systems are composed of a
camera system capturing the environment around the object of interest and a display
device. Display devices normally used in stationary systems are based on projection
technology or large screens mounted in easily accessible locations. In the first case,
When choosing among these presentation hardware options, some considerations are
crucial to take into account at the start of the design phase.
First, these devices clearly offer different capabilities in terms of mobility. While handheld
devices and smart glasses are highly mobile and can be easily carried around, stationary
systems are strictly tied to a precise location, making them unsuitable for tasks that require
roaming or performing tasks anywhere other than the station for which they are designated.
Consequently, if a task is somewhat mobile (i.e., roaming inside a facility) and highly mobile
(i.e., tasks in remote locations or outdoors), either of the first two options can be considered.
On the other hand, stationary systems are particularly effective in situations where the task
is always performed in the same place (i.e., the assembly of a mechanical part) as they do
not force the user to wear or physically interact with an external device.
Figure 2: Tablet AR (on the left) and projection-based AR (on the right - Image credits Marner et al. [3]).
Another factor that influences the choice of AR presentation hardware is the user’s need for
hands to perform a task. Both smart glasses and projection AR systems are hands-free AR
presentation systems, while smartphones and tablets need to be held as they are pointed at
the target object in order to allow the camera to capture the scene and overlay information.
This may be inappropriate for tasks that require simultaneous use of both hands when using
an AR experience.
Display technologies also play an important role when organizations are choosing
hardware. The display technology options include:
Video See Through (VST): a display screen shows the video feed being captured by
the camera while the augmentations are rendered on top of it, making it appear as if
the virtual objects were embedded into the physical environment. The user perceives
their environment while looking at the display. VSTs are used for both handheld
devices and smart glasses.
The user perceives the environment through the semi-transparent glass while also
perceiving the augmentations. These displays are used for smart glasses.
Figure 5: Example of an optical see-through display for an aircraft windshield. Source WIkipedia
Projection: A projector renders the augmentations directly on the target objects in the
form of monochromatic or color shapes. The user perceives the environment and the
augmentations directly without any mediation.
These three technologies provide very different experiences as they are constrained by their
individual technical limitations.
Video See-Through display technologies are the most diffuse as they are available on most
commercial grade devices, such as tablets and smartphones or camera-equipped virtual
reality visors. The use of this technology for AR experiences introduces a delay between the
user’s movements in space and the final rendering of the same scene with the blended
overlays. This is due to the computational process involved in the creation of the AR scene.
Once a frame is captured by the camera, it is processed for recognition and tracking. Once
the camera pose spatial transformation is calculated, the AR engine calculates the relative
spatial transformation of the overlaid augmentations. Finally, the rendering engine blends
the augmentations in the frame captured and renders the result on the display. Current
technologies allow for a delay of as little as 200 milliseconds, which is, however, large
enough to be perceived. In fact, the latency between the user’s movement and the delayed
movement of the scene observed on the display has been demonstrated to hinder the
experience, causing confusion [5] and, when used for wearable displays, a sense of nausea
called “cybersickness” [6].
Guideline: for video see-through display-equipped smart glasses, experiences do not have
to force the user to wear the device for long sessions.
Guideline: avoid video see-through smart glasses for tasks that require the user to
have full spatial awareness of the surrounding environment.
Even with these limitations, VST is currently used to deliver most of the handheld AR
experiences: the hardware needed is already embedded into tablets and
smartphones and the user is already used to holding camera-equipped devices and
pointing them towards an object of interest (the same interaction can be experienced
when taking a photo), observing the environment through the “digital window.”
In many cases the resolution of video feed is lower than the display resolution as
frame rate is preferred to quality during video capture. This may produce a strongly
visible mismatch between a high-resolution overlay and a low-resolution background.
A solution usually lowers the rendering quality.
Optical see-through technologies are, in many cases, not affected by the peripheral
FOV obstruction, as the user can perceive the environment through a transparent
material. These displays are usually employed on smart glasses because they are
completely unobtrusive, allowing the wearer to directly perceive the environment. For
this reason, this is the most popular solution adopted for wearable devices. The
possibility of overlaying virtual imagery on the environment around the user makes
this category of displays the most sought after among both providers and users of
AR. However, significant technological impediments limit the applications of these
displays.
The first important limitation affecting optical see-through technology is the limited
portion of FOV that can be used to render virtual imagery. The current state of the art
does not allow the manufacturing of lenses capable of displaying virtual imagery in a
FOV comparable to what the human eye usually perceives (more than 180 degrees).
Guideline: divide information and overlays in small self-contained chunks if the content
will be delivered on an OST display.
Guideline: display resolution affects the size of the augmentations. Objects too small will
not be visible/readable with a low-resolution display.
Designing the AR content and interface as separate visual objects helps the user in
identifying entities. If the overlay exceeds the useful FOV of the display, the user will be
forced to step back to visualize it or turn, potentially missing parts of the information if the
content is dynamic.
Guideline: the AR experience should provide visual cues to direct the user’s attention if
the FOV is limited.
Figure 7: A red visual arrow can direct the user's gaze direction.
The augmented scene is observed through the small window created by the OST
display. Due to the limited FOV of this window, in those scenarios where the user has to
Background and light conditions heavily affect the visibility and readability of AR
content when visualized with OST displays. Current OST displays are able to render
content with a maximum of 80% opacity. This means that augmentations blend with
20% of the light reflected by the environment. Consequently, bright environments
affect the visibility of the virtual imagery. The same concept applies when considering
the background color. Low contrast between the background and the augmentations
hinders their visibility.
Guideline: avoid optical see-through displays for very bright environments and
outdoors. In addition, test the designed experience in the environment it will be used in
order to check color consistency.
On the one hand, monocular HMDs occupy a portion of the FOV of only one of the two
eyes. Also called “assistive displays,” these devices are implemented with both VST
and OST technologies. On the other hand, binocular displays occupy the same portion
of FOV for both eyes. The main difference between these two categories is that a
binocular display is able to deliver stereoscopic augmentations. Therefore binocular
displays are the only ones able to deliver spatially registered AR (also known as Mixed
Reality or MR). Nevertheless, monocular displays are far less obtrusive then binocular
ones and, being placed in the peripheral vision of one of the eyes, can be easily
scanned with a simple eye movement.
Guideline: use binocular displays if spatially registered AR is part of the UX. If only
checklists or contextual information needs to be displayed, the experience might
benefit from a monocular display.
When designing content and interfaces that are going to be projected on objects, it is
important to generate a good representation of the surfaces on which the shapes will be
projected. In fact, because these shapes are made of light beams, if the surface is
irregular, protruded or reflective, the light will not be normally reflected and the shapes
will appear distorted.
Guideline: make sure that projection surfaces are regular and non-reflective.
Connectivity
Most AR applications rely on the presence of some form of connectivity on the device
they are deployed on, in order to download content, perform indoor positioning and
upload real-time results of operations.
These technology standards vary in relation to the data transmission speed, type of data
allowed and range.
Bluetooth has a very short transmission range (up to 20 meters) and is generally used
for indoor positioning (combined with beacons) and non-data-intensive device
communication. Wi-Fi connections are usually characterized by high-speed data
transmission rates and a maximum transmission range of 50 meters, allowing usage for
indoor applications. The extended range of 3G and 4G networks justifies their use for AR
applications outdoors and in remote facilities that cannot provide Wi-Fi connections.
However, the data transmission speed of these connections can vary according to
atmospheric interference and signal coverage.
3G 20 - 100 ~1500m
4G 40 - 150 ~1500m
Table 1: Options for network connectivity for AR experience delivery.
Network availability and data transmission speed can heavily affect the experience and
the efficacy of AR applications: downloading the relevant technical documentation in real
time is essential in many AR applications. For this reason, it is important to investigate in
advance the availability and reliability of the connections that will be used, and design
the content accordingly. For example, it is recommended to design low-resolution
versions of images, textures and 3D models if these are used on slower networks for
faster rendering.
Guideline: cellular networks can have highly variable transmission speeds. Data loss and
delays can affect the validity of real-time data.
Industrial Standards
There are other industrial standards that apply to other settings, such as the need for
sterilization in some food or pharmaceutical manufacturing plants. In oil and gas
UX Design Considerations
Some considerations and recommendations in this section are driven by requirements
introduced in the previous section. For example, the UI and interaction design are tightly
constrained by the technological limitations of the hardware platform on which the UX will
be delivered.
Guideline: content for use in AR experiences needs to be large in size (compared to what
is designed for high-resolution tablets or other screens) and other steps may need to be
taken (e.g., animating a simple asset) to ensure the information is clearly visible in the
user’s environment. In addition, crucial parts of the information should be simultaneously
visible without users needing to turn their heads (in the same FOV).
Content positioning plays a very important role in AR experiences. The location of digital
content can be conveniently mapped in two spatial reference systems [8]:
Egocentric: also known as user-centric space, this reference system has its
reference point in the user’s body and, specifically, the center of the user’s
perceptual system, often identified with the head.
Allocentric: also known as object-centric space, this reference system uses the
object of interest as a reference point.
Egocentric interfaces are spatially organized around the user independently from the
scene observed by the AR-enabled device, and can be accessed through the user’s
active query (e.g., turning the head right or left, looking up or down) – see Figure 8. In
cases of handheld devices, these typically take the form of on-screen content that can
occupy a portion (if not all) of the video feed, constantly overlaying it. HMDs can deliver
egocentric content using two different techniques.
The most commonly used technique, known as Heads-Up Display (HUD), implies the
placement of information in the peripheral portion of the useful FOV so that the content is
constantly visible independently from the head movements. The second technique makes
use of the inertial motion-sensing units (IMUs) installed in HMDs to place content on the
sides of the user’s head.
The user can visually query this information by rotating the head to one side or the other
where the content is located. This technique assumes that there is a preferred head
direction to keep clear from content, or the ability for the user to change this direction
during usage (personal preferences). Egocentric interfaces are particularly useful when
the user needs to access information while moving or changing environments, as it is not
registered to any particular spatial reference point.
Figure 8: Reference space of HUD-like (on the left) and head-tracked (on the right) egocentric user interfaces.
Guideline: always-on user-centric interfaces should occupy a small portion of the FOV.
Preferably, external corners and peripheral parts of the FOV are used to place content.
Although peripheral vision cannot be used to recognize patterns and understand the
environment, it is particularly sensitive to variation in light modulation – i.e., a sudden
change in luminosity [7]. This means that object movement (which changes the luminosity
in the peripheral vision) is more likely to attract attention than shape or color change.
Guideline: object movement – bouncing, vibration, etc. – attracts the user’s attention
towards information notifications in the peripheral FOV better than other visual changes.
Guideline: the size and number of digital assets superimposed onto an object of interest
must always be limited. If information about the environment is available during the design
phase, position overlays appropriately and do not occlude surrounding relevant objects.
There are other known challenges. Object-centric annotations are, by definition, attached
to a specific target. Their position with respect to the object may be inconvenient when
performing tasks that require the user to change positions (high mobility). Some of these
tasks require the user to perform operations on large machines or spatially distributed
equipment. In these cases, users need to visualize equipment-related information
independently of their physical locations.
Object-centric visualizations often are dependent not only on the visibility of the object itself,
but also on the relative position and orientation of the user. Assumptions are made during the
design phase about the user’s perspective. The most common is that the user will be looking
at an object from a specific position and angle – in front in most cases. If the digital content is
designed to be static (spatially arranged in a fixed position), the experience designer needs
to make sure that the user is correctly positioned in relation to the object when visualizing the
overlays. This can be achieved by providing spatial cues to direct the user’s point of view
towards a position and orientation where the experience designer assumes the user to be
[13].
Guideline: the user should look at the object of interest from the expected position and
direction. If this cannot be guaranteed, spatial cues to guide the user towards a favorable
position should be provided.
Figure 10: Parafrustum, a technique to direct the user towards the optimal point of observation. Image credits Sukan et al. [13]
Graphical Objects
In contrast with VR systems, where the experience is fully immersive, AR blends the real
world and digital assets, suggesting the need for high levels of photographic realism. Such
realism has been achieved with computer graphics special effects systems designed for
television and cinema, but it requires post-processing of video streams in order to manually
Spatial annotations, for example, are a particular set of digital objects used to identify
real objects in space, providing indications about the action to perform during the tasks
and graphically labeling the environment. These annotations are superimposed directly
on the object of interest and can often be combined with 3D virtual imagery and
animations. Instances of this category of overlays are arrows, circles, color shapes or
symbols. Spatial annotations can be designed to be rendered as 2D widgets positioned
on a planar surface, or 3D images floating in the environment.
Figure 11: 2D content (on the left) and 3D content (on the right). Image credits [14]
Ideally, these two modalities of visual information presentation carry the same
informative load and are equally effective for task instruction. However, scientific
research demonstrates that the two strategies are not equally valid in every situation. In
fact, while 2D shapes and symbols are clearer and more easily interpretable when
rendered on objects with large planar surfaces, 3D arrows and shapes are more
effective to annotate irregular objects and surfaces, as they more clearly provide depth
cues and three-dimensional information [15]. In addition, 3D content is more
computationally demanding than 2D shapes, and this affects system performance and
battery life, both crucial requirements for industrial grade AR systems.
Guideline: flat (2D) content should be prioritized in order to improve performance and to
prolong the battery life of the AR system. 3D content should be used only when
necessary: 3D virtual objects are effective in providing depth cues.
Depth sensors allow AR systems to correctly estimate the spatial position of specific
parts of the environment with respect to the target object on which the user will expect to
see (and use) digital information. In addition, systems should be able to correctly position
Since, in most cases, current AR presentation engines track the environment using only
monocular RGB cameras, they do not have depth-mapping data. As a result, the rendering
engines are unable to display only the visible portions of a digital object occluded by a real
object based on any depth data. This may cause misinterpretation of spatial cues and
mislead the user towards an object that was not intended for the experience. Without depth
mapping technology integration, it is recommended to design experience content so that the
spatial location of a digital object with respect to the target in the real world is unequivocally
clear and the results do not take into account the user’s depth perception.
Guideline: use high-contrast colors to enhance overlay visibility and text readability.
Guideline: take into account the background color and how it modifies the color perceived
by the AR experience user.
Gesture
Gesture recognition can be obtained using a camera-enabled device that constantly monitors
the scene searching for pre-defined hand movement patterns (i.e., gestures). Either RGB
cameras combined with computer vision algorithms or depth-sensing cameras can be used to
recognize gestures. Gesture-based interaction is based on the assumption that the user is
aware of and uses only a predefined set of gestures, and how these gestures are interpreted
by the AR system.
Guideline: provide help and training for gesture interaction in order to reduce delays due to
any learning curve.
AR Interaction Techniques
Stare/gaze navigation This technique tracks the user’s head or eye movements
to interpret the interaction intention and act accordingly.
External factors that can influence the effectiveness of recognition algorithms include:
1. Extreme lighting conditions
2. Reflective surfaces in the surroundings
3. Gloves worn by the user
Guideline: design small simple gestures for user interaction. Test the interaction design in the
expected user environment to ensure that external conditions do not affect robustness of
recognition.
There are other considerations for this interaction technique. Because the system has to
continuously execute advanced computer vision algorithms to achieve a successful
interaction, the computational load on the CPU or a dedicated processor is very high. This
makes gesture recognition highly power consuming. Although this may seem an issue with
which only hardware designers need to deal, it is crucial for UX designers to take the battery
life of the device into consideration.
Speech
There are some limitations to speech as a technique for system control. First, the
robustness of speech recognition can be heavily affected by the noise level of the
environment. Despite the fact that many modern microphones have noise canceling
algorithms, the reliability of speech recognition is compromised by noise levels that
overcome the user’s voice. These unwanted sound sources are usually generated by heavy
machinery present in the environment, but they can also be the result of the conversations
of other workers (which can be interpreted as valid commands). In any case, by using a
microphone array it is possible to locate the position of the sound source and disregard
those that are not created by the authorized users.
Another important consideration when implementing speech interaction is to make sure that
the user is aware of the commands allowed and how to properly formulate them. This
technique involves vocal or auditory interactions between the user and the device. The
device does not display visual cues of the possible commands that the user can choose
from – i.e., a button with the text “Next” suggests that tapping that button will most probably
make the session move to the next phase.
Guideline: display hints and suggest commands that can be used, especially for novice
users.
Talking is an activity performed daily and natural language has a complexity that goes far
beyond what a simple command recognition algorithm can process. If complex natural
language processing techniques are not implemented, the algorithm may fail to recognize
the user’s command even if it is properly formulated. For example, a user could say “select
the second item” followed by “and the two items below.” Understanding such commands
implies a complex analysis of the contextual relationship among words that many speech
processors do not support.
Guideline: instruct the user not to give complex instructions using natural language.
Touch and External Devices
Touch input is a familiar mechanism for users and developers. As most AR-enabled
handheld devices offer multi-touch input by design, it is very easy to implement touch-based
Figure 12: An example of Direct Manipulation of AR interfaces. Image credits 3D Studio Bloomberg
Touch-based interaction can be implemented following two different paradigms. The first is
what interaction theorists call “Direct Manipulation” [17]: the user manipulates the virtual
object and the interface directly using hand gestures. The virtual object or widget has
properties similar to real-world objects so that the user can use her understanding of the
physical world to manipulate the virtual imagery. Examples of this type of interaction are
tapping on a button, swiping to rotate a 3D object or pinching the screen to zoom in. In AR
user interfaces, the interaction is, in practice, very similar to 2D touch-based interfaces, with
the difference that often content is not attached to the screen, but is rendered spatially at a
precise point in the environment. In these cases, the user touches the visible area of the
virtual object on the screen, trying to manipulate the object in space. Because of the
similarity between spatially registered 3D content (which has a designated position and
orientation in space) and manipulable virtual 3D content (which can be browsed and
explored by direct manipulation), users often try to touch and manipulate spatially registered
AR content in an attempt to explore it, instead of moving the point of view around the object
itself.
Guideline: in cases where the user needs to explore a 3D model, enable for direct
manipulation of 3D objects through touch interfaces.
Figure 13: Smart glasses connected to an external interaction device. Image credits Epson Inc.
This technique can be easy to implement if the presentation device can be connected to
external devices, but it is not recommended for tasks during which body movements are
spatially restricted.
Stare/gaze Tracking
Recently, stare/gaze input mechanisms have been implemented in some AR systems. These
techniques allow to track the user’s head or eye movements in order to interpret the
interaction intention and act accordingly. In case of head tracking, a small pointer is displayed
in the center of the view. The user stares at interactive elements of the virtual scene –
overlays or UI components – and after a brief confirmation time, the input is accepted as
intention. This technique, called “stare navigation,” is best used for simple operations and
manipulations as it can emulate only a single input – the equivalent of a mouse click.
An optimal solution for interaction with AR systems may be to combine multiple techniques
in order to mitigate the limitations of each one. Multimodal interaction can, ideally, provide
the best user experience as long as it is correctly designed. Current technologies allow, for
example, to easily combine gesture and speech interaction in order to grant the flexibility
and intuitiveness of gesture-based input as well as the hands-free experience of verbal
commands. In practice, however, there are limitations that can affect implementations of
multimodal interaction techniques. Firstly, the more that input devices are combined, the
more processing power is needed to detect all the different inputs, drastically reducing
battery life – which is already short for wearable devices. Secondly, when more than one
technique is involved during the interaction, achieving consistency and avoiding confusion
requires extensive design knowledge and experience.
Once these technological limitations are overcome, multimodal input techniques facilitate
the creation of more natural and intuitive interaction techniques, allowing users to focus on
the task at hand, without having to be aware of how the underlying technology works.
The solutions proposed are not intended to be optimal. This section attempts to showcase
how the guidelines and the discussions in the previous sections can be applied to real-life
use cases. In addition, these solutions are not meant to be sufficient for every use case that
matches the description. In fact there can be a number of external factors that influence the
effectiveness of the proposed system. In these cases, the designer should follow the same
critical mental process adopted in this report to analyze the characteristics of these external
factors and determine how they impact the AR solution.
Warehouse Picking
From the description of this use case, it seems immediately obvious that users need to be
mobile and easily roam around indoor facilities. Workers in warehouse facilities need to
collect and deliver goods in different locations inside the building and move around on foot or
with small vehicles. Consequently, a mobile solution is needed, thereby excluding projection-
based stationary systems.
Another important consideration concerns the type of content that is going to be displayed.
The essential types of information that need to be delivered by an AR application for
warehouse support in order to support the successful fulfillment of basic tasks are:
Checklists with the operations to perform or objects of interest (including the relevant
information about their status)
Textual descriptions of the operations, descriptions of objects, and notifications of
status updates
A simple navigation mechanism in the form of arrows or maps
This implies that 3D content and stereoscopy are not necessary for presenting this kind of
As for content design, user-centric textual information and 2D content are sufficient to
implement most of the functions required. Given the simplicity of this type of content, touch-
enabled external devices or speech recognition – for low noise-level warehouses – are easy-
to-implement solutions that work very effectively for 2D content.
There are a number of use cases in which 3D spatial representation is not needed, as, for
example, in welding operations. In this case it is far more important to precisely signal the
welding location and a few other details like welder machine temperature or welding time. In
this case, stationary projection-based AR systems are better solutions.
Projectors do not require the user to wear or hold a device, which would be uncomfortable
As the content is mostly related to the object of interest and its components, object-centric
overlays are usually implemented for supporting assembly: labels and images should be
placed around the objects of interest while animated 3D models are superimposed directly
onto objects, highlighting differences in configuration. Depth-sensing cameras are often used
for these applications in order to improve the spatial precision of overlaid 3D models. In
addition, by using depth maps, it is possible, in some cases, to automatically detect assembly
errors or document operation compliance.
Many of the considerations for the assembly use case in the previous section are valid also for
maintenance and repair operations. In fact, many MRO tasks include assembly and
disassembly of physical objects. Consequently, spatially positioned overlays and animated 3D
models are important for operators to build a correct mental model of the task and the objects
of interest.
One key difference with the assembly use case lies in the extreme mobility required for MRO
tasks. These tasks are mostly performed around facilities or even in remote sites.
Consequently, stationary solutions are not convenient, while handheld and head-worn devices
are often the recommended options. Smart glasses are, in general, a preferred device choice,
allowing users to visualize while still being able to perform two-handed operations. However,
handheld devices, such as tablets, have also proven to be effective and more cost-efficient
solutions in less-complex scenarios that do not require the user to constantly pick up and put
down the device.
Secondly, unlike in the object assembly scenario, the demand for network connectivity is much
higher for MRO. These tasks often make use of Internet connections for a number of
operations:
Real-time data readings from industrial sensor networks, needed for equipment
monitoring and fault diagnosis
Status updates about concurrent maintenance operations or urgent interventions
needed
Expert remote assistance
Job fulfillment documentation
Conclusions
This report provides general considerations and guidelines for AR experience design in
industrial settings. Its goal is to address a broad spectrum of topics for audiences with a range
of familiarity with Augmented Reality.
It explains decisions to be made throughout the design process and how choices impact the
design of the AR experience. It serves as an introductory overview and may be useful for
supporting discussions among people with diverse exposure to the field of Augmented Reality.
That said, we recommend keeping an open mind. Some of the topics introduced are very
broad and complex and general guidelines cannot be applied, as the decision process varies
according to the situation at hand. This should not stop the reader from reflecting upon the
thought processes behind these guidelines and modifying them for their particular cases. More
importantly, the structure of this report provides a guide for the decision-making pipeline and
demonstrates how choices impact the design of the final experiences. We expect that the
themes and guidelines will also need to be adapted as new devices are introduced and some
obstacles to introduction decline in the future.