Du DepthLab-Real-Time3DInteractionWithDepthMapsForMobileAugmentedReality UIST2020 Lowres
Du DepthLab-Real-Time3DInteractionWithDepthMapsForMobileAugmentedReality UIST2020 Lowres
Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian,
Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, Shahram Izadi, Adarsh Kowdle,
Konstantine Tsotsos, David Kim†
Google LLC
(a) oriented reticles and splats (b) ray-marching-based scene relighting (c) depth visualization and particles
(d) geometry-aware collisions (e) 3D-anchored focus and aperture effect (f) occlusion and path planning
Figure 1. Real-time interactive components enabled by DepthLab: (a) virtual texture decals “splatting” onto physical trees and a white oriented reticle
as a 3D virtual cursor; (b) relighting of a physical scene with three virtual point lights; (c) AR rain effect on dry stairs on the left and false-color
depth map on the right; (d) virtual objects colliding with physical exercise equipment; (e) “Bokeh”-like effect putting focus on a physical 3D anchor; (f)
occlusion and path planning in a mobile AR game. Please refer to the accompanying video captured in real time for more results.
Depth developers
API
...
...
Figure 2. A high-level overview of DepthLab. We process the raw depth map from ARCore Depth API and provide customizable and self-contained
components such as a 3D cursor, geometry-aware collision, and screen-space relighting. The DepthLab library aims to accelerate mobile app developers
to build more photo-realistic and interactive AR applications.
In contrast, our system does not depend on dedicated depth Each participant generated individual ideas in a 30-minute
sensors and can instantly run using the input depth maps. With session. These ideas were then collected in a shared document
live depth maps provided by ARCore Depth API [46], we and briefly presented to the group by each participant. We
are the first to demonstrate a number of geometry-aware AR collected a total of 120 ideas this way. After this, the session
interaction and visual effect features on smartphones without organizer clustered similar ideas, initiated an offline voting
surface reconstruction to the best of our knowledge. and then summarized the ideas with the most votes.
10 Kinect: https://en.wikipedia.org/wiki/Kinect In the second session, we assigned key attributes, discussed
11 LeapMotion: https://leapmotion.com and ranked the technical feasibility, compelling use cases, rel-
12 Mixed Reality Toolkit: https://github.com/microsoft/ evance of depth, whether any form of machine learning (ML)
MixedRealityToolkit-Unity models are required, and type of depth data as a minimum
I put DepthLab DepthLab
Data Structures a d Utilities Algorith s
Depth Array
physics texture decal depth mesh
Dense Depth
Depth Map Depth Mesh
Depth Texture
relighting occlusion aperture
Figure 4. System architecture of DepthLab. Our input consists of the RGB camera image, depth map from ARCore Depth API, camera parameters,
and phone orientation. For each frame, we update the depth array (CPU), depth mesh (CPU+GPU), and depth texture (GPU) from the raw depth buffer.
We offer a set of conversion utilities to improve the workflow of developers and a set of algorithms that can be drag & dropped into their applications.
requirement for each idea. Based on the depth requirement, we Data Structures of DepthLab
scoped DepthLab to cover localized depth, surface depth, and The depth data is typically stored in a low-resolution depth
dense depth, rather than surface reconstruction with voxels or buffer (160 × 120 in our examples13 ), which is a perspective
triangles. We further explain this categorization in Table 1. camera image that contains a depth value instead of color in
each pixel. For different purposes, we generate three cate-
In the final session, we discussed the top priorities based on
gories of data structures:
the overall rating of the ideas, organized weekly meetings,
and assigned tasks to collaboratively develop DepthLab in a 1. Depth array stores depth in a 2D array of a landscape im-
six-month period. We summarize 39 aggregated ideas in the age with 16-bit integers on the CPU. With phone orientation
supplementary material and indicate which ones DepthLab and maximum sensing range (8 meters in our case), we offer
implements without 3D surface reconstruction. conversion functions to access depth from any screen point
or texture coordinates of the camera image.
System Architecture 2. Depth mesh is a real-time triangulated mesh generated
DepthLab consists of four main components (Figure 4): track- for each depth map on both CPU and GPU. In contrast
ing and input, data structure generation, conversion utilities, to traditional surface reconstruction with persistent voxels
and algorithms for the presented effects. or triangles, depth mesh has little memory and compute
overhead and can be generated in real time. We detail its
Tracking and Input generation procedure in Algorithm 2.
DepthLab uses real-time depth maps provided by ARCore
Depth API, which only requires a single moving RGB camera 3. Depth texture is copied to the GPU from the depth array for
on the phone to estimate depth. A dedicated depth camera, per-pixel depth use cases in each frame. We filter the depth
such as time-of-flight (ToF) cameras can instantly provide texture with depth-guided anti-aliasing methods (Figure 11)
depth maps without any initializing camera motion. Addi- in additional to hardware-accelerated bilinear filtering to
tionally, DepthLab uses the live camera feed, phone position reduce visual artifacts.
and orientation, and camera parameters including focal length,
intrinsic matrix, extrinsic matrix, and projection matrix for Conversion Utilities and Algorithm
each frame to establish a mapping between the physical world The slow adoption of depth on mobile device applications may
and virtual objects. We provide extensive conversion utilities lie in the complexity to process depth for end-user experiences.
and interaction modules to facilitate higher-level mobile AR 13 Thedepth map resolution may be different depending on different
development. phone models.
Depth data becomes more useful when it is mapped to the Interaction With Localized Depth
camera image and the real-world geometry. However, even In comparison to DepthLab, widely-used AR frameworks,
these steps require technical knowledge outside the domain such as ARCore, ARKit, and AR Toolkit [25] provide hit test-
of many application developers. Additional factors that can ing functions that allow applications to get a real-world point
complicate depth processing include adapting to the change based on the intersection between a camera ray and detected
of the phone orientation, conversion of points between local AR planes. Unfortunately, this method often yields errors
and global coordinate frames, and the lack of examples. Our due to inaccurate measurements around edges and non-planar
conversion utilities are detailed in the next section. Based objects on surfaces [31]. In this section, we introduce fun-
on the three classes of depth data structures, we provide a damental techniques and interactive use cases with localized
series of algorithms and techniques for developers to directly depth (Figure 5), which yield more accurate hit tests and en-
apply high-level concepts such as physics, shadows, texture able finer-grained use cases than plane-based or anchor-based
mapping, relighting in their applications using popular game AR interaction.
editors, such as Unity or Unreal. We detail these techniques in
the next section.
Localized Depth Surface Depth Dense Depth Figure 5. Example use cases of localized depth. (a) shows a 3D cursor
oriented according to the normal vector of the physical surface and de-
CPU ✓ ✓ ✗ (non-real-time) tails about its distance to the ground and to the camera. (b) shows a rain
GPU N/A ✓ (compute shader) ✓ (fragment shader) particles demo where each rain drop tests for a hit with the physical en-
vironment and renders a ripple upon a collision. (c) shows a collision
point projection depth mesh anti-aliasing checking example where a virtual chair is occluded by a physical wall.
Prerequisite
normal estimation triplanar mapping multi-pass rendering
Data
depth array depth mesh depth texture Screen-space to/from World-space Conversion
Structure
Given a screen point p = [x, y], we look up its depth value in
physical measure collision & physics scene relighting
Example the depth array Dw×h (in our case: w = 120, h = 160), then
oriented 3D cursor virtual shadows aperture effects
Use Cases re-project it to a camera-space vertex vp using the camera
path planning texture decals occluded objects intrinsic matrix K [15]:
Table 1. Comparison of CPU/GPU real-time performance, key prereq- vp = D(p) · K−1 [p, 1] (1)
uisite techniques, underlying data structures, and example use cases be-
tween localized depth, surface depth, dense depth.
Given the camera extrinsic matrix C = [R|t], which consists
of a 3 × 3 rotation matrix R and a 3 × 1 translation vector t,
1. Localized depth uses the depth array to operate on a small
we derive the global coordinates gp in the world space:
number of points directly on the CPU. It is useful for com-
puting physical measurements, estimating normal vectors, gp = C · [vp , 1] (2)
and automatically navigating virtual avatars for AR games.
Hence, we have both virtual objects and the physical environ-
2. Surface depth leverages the CPU or compute shaders on ment in the same coordinate system. Hit tests can be directly
the GPU to create and update depth meshes in real time, thus performed with ray casting from the camera location (transla-
enabling collision, physics, texture decal, geometry-aware tion) t to the screen point p, then to a vertex gp in the world
shadows, etc. space.
The reverse process is simpler. We first project 3D points with
3. Dense depth is copied to a GPU texture and is used for ren- the camera’s projection matrix P, then normalize the projected
dering depth-aware effects with GPU-accelerated bilinear depth values and scale the depth projection to the size of the
filtering in screen space. Every pixel in the color cam- depth map w × h:
era image has a depth value mapped to it, which is useful
p̂ = P · [gp , 1],
for real-time computational photography tasks, such as re-
(3)
lighting, 3D-anchored focus and aperture, and screen-space p̂x + p̂w p̂y + p̂w
p = w· ,h·
occlusion effects. 2p̂w 2p̂w
Through close communication with partner developers, we
identified that adapting the depth processing steps to dynami-
cally changing screen orientation and resolution is complicated
and time consuming. We simplified these steps and provide
convenient conversion utilities, which ensure that every pixel
on the screen has a corresponding world vertex measured in
meters.
Computing Normal Vectors (a) reference color (b) input depth (c) eq. 4 result (d) our result
Computing usable normal maps out of low-resolution and Figure 6. Comparison of the output normal maps between (c) computed
coarse depth maps can be challenging. With reliable depth val- by Equation 4 and our result (d) yielded by Algorithm 1. (a) shows the
reference color image and (b) shows the input depth map computed from
ues we could compute a normal vector n with the cross product (a) with [46].
of vectors formed by adjacent depth values re-projected to 3D
vertices. [22]:
np = vp − vp+(1,0) × vp − vp+(0,1) (4) shadows. For simple tasks, such as placing a virtual object,
we recommend using localized depth for better performance.
However, such methods may yield noisy or invalid results due
We use a majority voting approach to check for collision given
to depth discontinuities, holes, and outliers in the estimated
a noisy depth map. For example, with the collision mesh (e.g.,
scene depth, as shown in Figure 6(b). We provide two real-
a simplified mesh such as a bounding box or a capsule) of the
time algorithms to compute a more stable normal map in real
virtual object, we can transform the eight corner points into
time, on both CPU and GPU (fragment shader). Both compo-
screen space, then test whether its depth value is larger than
nents estimate the average normal from 4-ring neighborhoods
the physical environment’s depth value. If the majority of the
and cull outliers:
corner points are visible, the user may safely place the virtual
object in the environment, as shown in Figure 5(c) and the
Algorithm 1: Estimation of the Normal Vector of a Screen supplementary video for a live demo.
Point in DepthLab.
Input : A screen point p ← (x, y) and focal length f . Avatar Path Planning
Output : The estimated normal vector n.
1 Set the sample radius: r ← 2 pixels.
2 Initialize the counts along two axes: cX ← 0, cY ← 0.
3 Initialize the correlation along two axes: ρX ← 0, ρY ← 0.
4 for ∆x ∈ [−r, r] do
5 for ∆y ∈ [−r, r] do
6 Continue if ∆x = 0 and ∆y = 0.
7 Set neighbor’s coordinates: q ← [x + ∆x, y + ∆y].
8 Set q’s distance in depth: dpq ← kD(p), D(q)k.
9 Continue if dpq = 0.
10 if ∆x , 0 then Figure 7. With localized depth, DepthLab can automatically plan a 3D
path for the avatar that avoids a collision with the statue by making the
11 cX ← cX + 1. avatar hover over the statue.
12 ρX ← ρX + dpq /∆x.
13 end
AR applications without access to a dense depth map rely
14 if ∆y , 0 then
on gravity-aligned AR planes to digitally represent the real-
15 cY ← cY + 1.
world environment. Since these planes only coarsely represent
16 ρY ← ρY + dpq /∆y.
flat horizontal or vertical surfaces, existing AR applications
17 end show most virtual character simply moving along a flat ground
18 end plane, even when the real world has uneven terrain, or with
19 end user’s guidance [56]. With localized depth, we can allow AR
20 Set pixel size: λ ← D(p)f .
characters to respect the geometry of physical environments as
shown in Figure 7 and in the supplementary video. First, the
21 return the normal vector n: − λρcYY , − λρcXX , −1 . character is moved parallel to the ground plane. Then the final
position of the character is calculated by casting a ray starting
at the top of the character down along the gravity vector. At
Collision-aware Placement each ray-casting step, the ray’s current position is projected
Collisions can be computed with both localized depth and to the depth map. If the projected point has greater depth
surface depth. Localized depth allows developers to project a value than that of the depth map, a physical surface has been
world-space vertex to the depth map to check for a collision. intersected with the virtual avatar. If so, we set the intersection
On the other hand, surface depth enables features beyond point as the new character position to avoid the obstacle along
simple collision checking, such as physics simulations and the way. We apply the C1 filter [4] to reduce avatar jitters.
Other Use Cases can add a detailed understanding of the environment geometry
Localized depth can also enable many interesting visual ef- to the AR experience.
fects, such as virtual ray reflections and rain drops hitting
arbitrary surfaces. Given a starting vertex in the world space,
a direction, and a marching speed, we can estimate when and
where a ray will hit a physical surface. We can also compute a
ray’s new reflected direction based on the surface normal at the
collision point. We showcase an example of rain particles in
Figure 5(b) and ray reflection in the supplementary material.
Virtual Shadows
Shadows provide a strong depth cue and are essential for
increasing the realism of AR experiences. Conventional
mobile experiences without access to depth often render flat
virtual shadows using AR planes on the real world, which
leads to very noticeable artifacts on non-planar objects.
Real objects need to be represented as meshes to solve this
issue, so that they can be treated as part of the virtual scene.
With surface depth, we render a depth map of the physical (a) relighting effect (b) aperture effect (c) fog effect
scene represented as a screen-space mesh from the perspective Figure 10. Example use cases of dense depth. (a) shows animating vir-
of the light source following [49]. Any scene point that has a tual light sources illuminating the carpet and spherical chairs. (b) shows
greater depth value than that of the light source’s depth map a photography app where a user can anchor the focus in 3D space and
is considered to be in the dark and shaded accordingly. This the background is adaptively blurred out according to the distance to
the focal point. (c) shows a fog effect example where faraway objects are
allows the real scene geometry to be rendered from arbitrary more difficult to see.
viewpoints and allows real and virtual objects to cast shadows
on each other.
Due to the compute constraints on mobile AR, we recommend
In a 3D AR scene, the real-world environment is often ren- interactions with dense depth to be implemented on the GPU
dered as a quad texture behind all other scene objects. We with compute or fragment shaders. Using this dense depth
to supplement the z-buffer allows many screen-space compu- a relighting model introduced in Equation 8 of [6]. Methods
tational photography techniques to be seamlessly applied to based on BRDFs (Bidirectional Reflectance Distribution Func-
both real and virtual scenes. tions), such as Phong or Lambertian models [34] require a
normal map, which can contain artifacts around object bound-
Anti-aliasing aries in low-texture regions as shown in Figure 6(d). Instead,
we chose ray marching [10] to compute occlusion-aware re-
lighting without normals. For each pixel on the screen, we
evaluate the overall intensity by marching rays from the light
source to the corresponding vertex in the world coordinate
system, which naturally yields shadows for occluded areas.
ARCore Depth API provides live depth maps to make AR
experiences geometry aware. Applications, such as Google’s
SceneViewer, Search, and a number of shopping apps allow
users to preview furniture and other objects in their own space.
We aim to improve the realism of such experiences beyond
world-aware placement to help users answer questions such
as: What will be the effect of the lamp & lighting in my room?
(a) input depth map (b) result depth map (c) result with Can I try out different light colors and configurations? Can
(bilinearly filtered) with FXAA depth-guided FXAA
I view my room and objects in a different simulated daytime
Figure 11. Comparison between the bilinearly upsampled depth lighting?
map, post-processed depth map with FXAA, and our result with depth-
guided FXAA. Although traditional FXAA smoothes the close (red) In order to dynamically illuminate the physical scene with
depth around curvature, it fails to straighten the lines in the far (blue) virtual light sources, we need to compute the photon intensity
regions. With depth-guided antialising, we can adaptively smooth the
edges in both near and far regions. The input color image with an egg-
at the points the rays intersect with physical surfaces.
shaped chair can be referenced from the first image shown in Figure 2.
Since the raw depth map has a much lower resolution (e.g.,
Algorithm 3: Ray-marching-based Real-time Relighting.
160 × 120) compared to the phone screen resolution (e.g.,
3040 × 1040 in Pixel 4) and bilinear upsampling may yield Input : Depth map D, the camera image I, camera intrinsic
pixelation artifacts, we provide a variation of the fast approxi- matrix K, L light sources L = {L i , i ∈ L} with each
mate anti-aliasing (FXAA) algorithm [29] with depth guidance. light’s location vL and intensity in RGB channels
For each pixel on the screen, denote d as its corresponding nor- φL .
malized depth value ranging from 0 to 1. Considering that the Output :Relighted image O.
closer depth pixels are typically larger, we employ an adaptive 1 for each image pixel p ∈ depth map D in parallel do
kernel size to filter the depth map D with FXAA: 2 Sample p’s depth value d ← D(p).
3 Compute the corresponding 3D vertex vp of the screen
σ = smin + S2 (1 − d) · (smax − smin ), d ∈ D (5) point p using the camera intrinsic matrix vp with K:
vp = D(p) · K−1 [p, 1]
where S2 (x) = 6x5 − 15x4 + 10x3 , i.e., the fast smoothstep 4 Initialize relighting coefficients of vp in RGB: φp ← 0.
function introduced in [9]. We empirically determined smin = 5 for each light L ∈ light sources L do
2, smax = 3 pixels as a good value. We show a comparison 6 Set the current photon coordinates vo ← vp .
before and after anti-aliasing the low-resolution depth map on 7 Set the current photon energy Eo ← 1.
a per-pixel basis in Figure 11. 8 while vo , vL do
Relighting 9 Compute the weighted distance between the
photon to the physical environment
∆d ← α|vxy xy z z
o − vL | + (1 − α)|vo − vL |, α = 0.5.
10 Decay the photon energy: Eo ← 95%Eo
11 Accumulate the relighting coefficients:
φp ← φp + ∆dEo φL .
12 March the photon towards the light source:
vo ← vo + (vL − vo )/S, here S = 10, depending
on the mobile computing budget.
13 end
14 end
Figure 12. Given a dense depth texture, a camera image, and virtual 15 Sample pixel’s original color: Φp ← I(p).
light sources, we altered the lighting of the physical environment by trac-
ing occlusions along the light rays in real time.
16 Apply relighting effect:
1.5−φp
O(p) ← γ · |0.5 − φp | · Φp − Φp , here γ ← 3.
We implemented a per-pixel relighting algorithm that uses low- 17 end
resolution depth maps, which is based on ray marching and
There are a number of challenges to enable relighting with Given the user’s touch position p, we first compute a 3D vertex
live depth maps: High-quality normal maps are not available. gp to anchor the focus point. While the user moves the phone,
One could compute the intensity with the Lambertian model we recompute the distance between the camera and gp to set a
by using the dot product between the normal vector and the new focal plane for the wide-aperture effect. We convert the
lighting direction. However, in our case this method is not 3D anchor to the view space, normalize the depth values the
preferred since the normal map computed from the depth with local minimum and maximum values to emphasize the
map can suffer from many artifacts including empty regions objects that are in focus. Finally, we apply Gaussian blur and
around object boundaries and over-exposed areas. The lighting render the wide-aperture effect on the 2.5D RGBD data.
condition of the physical environment is complex, and we do
not know the intrinsic albedo of the materials in the physical Occlusion
world. Occlusion effects are achieved in a per-pixel manner. Each
pixel of the virtual object is tested whether it is located behind
3D-anchored Aperture Effect surfaces of the physical environment or not using the GPU
smoothed depth map as introduced in [20, 46]. Note that to
reduce the blending artifacts, we perform a soft blur in depth
boundary regions as shown in Figure 14(b).
frame (ms)
structed meshes only have coarse alignment with low-
resolution polygons. input color output with #samples=8 100
frame (ms)
mum viable application with a point depth example (oriented input color output with kernel size=21 100
reticle), a surface depth example (depth mesh generator), and a
per-pixel depth example (visualization of depth map) in Unity 50
2019.2.
0
input depth output with kernel size=71 3 11 21 31 41 51 61 71
Procedure Timings (ms) Kernel size
DepthLab's overall processing and rendering in Unity 8.32 (a) examples of aperture effects (b) performance benchmark
DepthLab's data structure update and GPU uploading 1.63 Figure 17. Examples and performance evaluation of real-time aperture
Point Depth: normal estimation algorithm < 0.01 effects. (a) shows a pair of input color and depth images and the corre-
sponding results with Gaussian kernel sizes of 21 and 71 in Algorithm 3.
Surface Depth: depth mesh update algorithm 2.41
(b) shows a performance evaluation with real-time camera images on a
Per-pixel Depth: visualization with single texture fetch 0.32
Pixel 3. According to the results, we recommend a kernel size of 11-21
Figure 15. Profiling analysis for a minimum DepthLab example applica- to deploy our real-time wide-aperture effect on Pixel 3 or comparable
tion with a point depth example (oriented reticle), a surface depth exam- mobile devices.
ple (depth mesh generation), and a per-pixel depth example (visualiza-
tion of depth map).
In the third test, we evaluated the performance of the wide-
aperture effect. Similar to relighting, we ran 5 rounds of
We ran the experiments with a handheld Android Phone re- experiments across 8 kernel sizes: 3, 11, 21, 31, 41, 51, 61,
leased in 2018 (Pixel 3). The application also records the 71. The kernel sizes are selected with odd numbers so that the
average rendering time per frame in a sliding window of 500 receptive field is always centered at the pixel to be rendered.
frames to prevent outliers. We ran the experiments in five dif- With larger sizes of the Gaussian kernel, the out-of-focus re-
ferent locations of a typical household and report the average gions become more blurry but the performance downgrades
profiling results of timings in Figure 15 (a). The average CPU significantly. Based on the results shown in Figure 17, we
consumption is 13% and the memory usage is 223.3 MB. recommend a kernel size of 21 or smaller for real-time perfor-
In the second test, we evaluated the performance of the real- mance on a Pixel 3 or comparable phones.
time relighting. We set the number of sampled photons to 2, 4,
8, 16, 32, 64, and 128 respectively and ran the experiment in
each setting for five rounds in different locations. We report DepthLab as a Reusable Library for Depth Rendering and
the mean and standard deviation in Figure 15 (b) and suggest a Interaction
sampling rate of 4-8 photons per ray for real-time deployment After solving many technical challenges for interacting with
on a Pixel 3 phone. To better understand the effects of our real-time depth on a mobile phone, we shared DepthLab with
samples in Algorithm 3, we offer a comparison between 8 selected partners. In the supplementary video, we show a
samples and 128 samples with a pair of input from Middlebury sped-up video of an external AR developer demonstrating how
Stereo Datasets [37]. Based on the results shown in Figure 16, DepthLab components can accelerate mobile AR development
we recommend a sampling rate of 8 photons or less per ray for process with Unity prefabs and reusable scripts into their AR
real-time performance on a Pixel 3 or comparable phones. For games.
LIMITATIONS [2] Ronald Azuma. 1993. Tracking Requirements for
While we present a self-contained library for rendering and Augmented Reality. Commun. ACM 36, 7 (1993), 50–52.
interaction in mobile augmented reality, our work does have DOI:http://dx.doi.org/10.1145/159544.159581
limitations.
[3] Wolfgang Birkfellner, Michael Figl, Klaus Huber, Franz
DepthLab is designed to enable geometry-aware AR expe- Watzinger, Felix Wanschitz, Johann Hummel, Rudolf
riences on phones with and without time-of-flight sensors, Hanel, Wolfgang Greimel, Peter Homolka, Rolf Ewers,
hence we have yet to explore more in the design space of and others. 2002. A Head-Mounted Operating Binocular
dynamic depth. With time-of-flight sensors available on many for Augmented Reality Visualization in
commercial smartphones, we would like to extend DepthLab Medicine-Design and Initial Evaluation. IEEE
with motion sensing, gesture recognition, and pose estimation. Transactions on Medical Imaging 21, 8 (2002), 991–997.
DOI:http://dx.doi.org/10.1109/TMI.2002.803099
We envision live depth to be available on many IoT devices
with cameras or depth sensors in the future. Each pixel in a [4] Géry Casiez, Nicolas Roussel, and Daniel Vogel. 2012.
depth map could be associated with a semantic label and help 1e Filter: a Simple Speed-Based Low-Pass Filter for
computers better understand the world around us and make Noisy Input in Interactive Systems. In Proceedings of
the world more accessible for us. the SIGCHI Conference on Human Factors in
Computing Systems. ACM, 2527–2530. DOI:
CONCLUSION AND FUTURE WORK http://dx.doi.org/10.1145/2207676.2208639
In this paper, we present DepthLab, an interactive depth li- [5] Keith Cheverst, Nigel Davies, Keith Mitchell, Adrian
brary that aims to empower mobile AR designers and devel- Friday, and Christos Efstratiou. 2000. Developing a
opers to more realistically interact with the physical world Context-Aware Electronic Tourist Guide: Some Issues
using virtual content. Our primary contribution is the open- and Experiences. In Proceedings of the SIGCHI
sourced, reusable, real-time, depth-based Unity library Depth- Conference on Human Factors in Computing Systems
Lab, which enables novel AR experiences with increased real- (CHI). ACM, 17–24. DOI:
ism and geometry-aware features. http://dx.doi.org/10.1145/332040.332047
We described our interaction modules and real-time algorithms [6] Ruofei Du, Ming Chuang, Wayne Chang, Hugues
building upon three data structure representations of depth: lo- Hoppe, and Amitabh Varshney. 2019a. Montage4D:
calized depth, surface depth, and dense depth. On commodity Real-Time Seamless Fusion and Stylization of
mobile phones with a single RGB camera, DepthLab can fuse Multiview Video Textures. Journal of Computer
virtual objects into the physical world with geometry-aware Graphics Techniques 1, 15 (2019), 1–34.
shadows and occlusion effects, simulate collision and paint http://jcgt.org/published/0008/01/01
splatting, and add virtual lighting into the real world.
[7] Ruofei Du, David Li, and Amitabh Varshney. 2019b.
We open sourced the DepthLab library on Github (https: Geollery: A Mixed Reality Social Media Platform. In
//github.com/googlesamples/arcore-depth-lab) to facilitate fu- Proceedings of the 2019 CHI Conference on Human
ture research and development in depth-aware mobile AR ex- Factors in Computing Systems (CHI). ACM, 13. DOI:
periences. We believe that this library will allow researchers, http://dx.doi.org/10.1145/3290605.3300915
developers, and enthusiasts to leverage the base interactions to
build novel, realistic AR experiences on regular smartphones. [8] Ruofei Du, David Li, and Amitabh Varshney. 2019c.
With the general space of perception in AR growing as an Project Geollery.com: Reconstructing a Live Mirrored
active field, we believe there are a number of possibilities World With Geotagged Social Media. In Proceedings of
that span persistent geometric reconstructions, novel human the 24th International Conference on Web3D
computer interaction, and semantic scene understanding that Technology (Web3D). ACM, 1–9. DOI:
will add to making AR experiences more delightful on modern http://dx.doi.org/10.1145/3329714.3338126
phones or head-mounted displays. [9] David S Ebert, F Kenton Musgrave, Darwyn Peachey,
Ken Perlin, and Steven Worley. 2003. Texturing &
ACKNOWLEDGEMENT Modeling: A Procedural Approach. Morgan Kaufmann.
We would like to extend our thanks to Barak Moshe and Wendy
Yang for providing a number of visual assets and UX guidance [10] Thomas Engelhardt and Carsten Dachsbacher. 2010.
for our open-source code, and to Sean Fanello and Danhang Epipolar Sampling for Shadows and Crepuscular Rays
Tang for providing initial feedback for the manuscript. We in Participating Media With Single Scattering. In
would also like to thank our UIST reviewers for their insightful Proceedings of the 2010 ACM SIGGRAPH Symposium
feedback. on Interactive 3D Graphics and Games (I3D). ACM,
119–125. DOI:
REFERENCES http://dx.doi.org/10.1145/1730804.1730823
[1] Jake K Aggarwal and Lu Xia. 2014. Human Activity [11] Steven Feiner, Steven Feiner, Blair MacIntyre, Tobias
Recognition From 3D Data: A Review. Pattern Hollerer, and Anthony Webster. 1997. A Touring
Recognition Letters 48 (2014), 70–80. DOI: Machine: Prototyping 3D Mobile Augmented Reality
http://dx.doi.org/10.1016/j.patrec.2014.04.011 Systems for Exploring the Urban Environment. Personal
Technologies 1, 4 (1997), 74–81. DOI: [22] Shahram Izadi, David Kim, Otmar Hilliges, David
http://dx.doi.org/10.1109/ISWC.1997.629922 Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie
Shotton, Steve Hodges, Dustin Freeman, Andrew
[12] Martin A Fischler and Robert C Bolles. 1981. Random
Davison, and Andrew Fitzgibbon. 2011. KinectFusion:
Sample Consensus: A Paradigm for Model Fitting With
Real-Time 3D Reconstruction and Interaction Using a
Applications to Image Analysis and Automated
Moving Depth Camera. In Proceedings of the 24th
Cartography. Commun. ACM 24, 6 (1981), 381–395.
Annual ACM Symposium on User Interface Software
DOI:http://dx.doi.org/10.1145/358669.358692
and Technology (UIST ’11). ACM, Santa Barbara,
[13] Ryan Geiss. 2007. Generating Complex Procedural California, USA, 559–568. DOI:
Terrains Using the GPU. GPU Gems 3 (2007), 7–37. http://dx.doi.org/10.1145/2047196.2047270
https://dl.acm.org/doi/book/10.5555/1407436
[23] Brett Jones, Rajinder Sodhi, Michael Murdock, Ravish
[14] Chris Greenhalgh, Shahram Izadi, Tom Rodden, and Mehra, Hrvoje Benko, Andrew Wilson, Eyal Ofek, Blair
Steve Benford. 2001. The EQUIP Platform: Bringing MacIntyre, Nikunj Raghuvanshi, and Lior Shapira. 2014.
Together Physical and Virtual Worlds. Mixed Reality RoomAlive: Magical Experiences Enabled by Scalable,
Laboratory-University of Nottingham-UK (2001). Adaptive Projector-Camera Units. In Proceedings of the
27th Annual ACM Symposium on User Interface
[15] Richard Hartley and Andrew Zisserman. 2003. Multiple
Software and Technology. 637–644. DOI:
View Geometry in Computer Vision. Cambridge
http://dx.doi.org/pdf/10.1145/2642918.2647383
University Press. DOI:
http://dx.doi.org/10.1017/CBO9780511811685 [24] Brett R Jones, Hrvoje Benko, Eyal Ofek, and Andrew D
Wilson. 2013. Illumiroom: Peripheral Projected
[16] Jeremy Hartmann, Christian Holz, Eyal Ofek, and
Illusions for Interactive Experiences. In Proceedings of
Andrew D Wilson. 2019. RealityCheck: Blending
the SIGCHI Conference on Human Factors in
virtual environments with situated physical reality. In
Computing Systems. ACM, 869–878. DOI:
Proceedings of the 2019 CHI Conference on Human
http://dx.doi.org/10.1145/2468356.2479531
Factors in Computing Systems. 1–12. DOI:
http://dx.doi.org/10.1145/3290605.3300577 [25] Hirokazu Kato and Mark Billinghurst. 1999. Marker
Tracking and HMD Calibration for a Video-Based
[17] Iris Herbst, Anne-Kathrin Braun, Rod McCall, and
Augmented Reality Conferencing System. In
Wolfgang Broll. 2008. TimeWarp: Interactive Time
Proceedings 2nd IEEE and ACM International
Travel With a Mobile Mixed Reality Game. In the
Workshop on Augmented Reality (IWAR’99). IEEE,
Proceedings of MobileHCI 2008, Amsterdam,
85–94. DOI:
Netherlands (2008), 235–244. DOI:
http://dx.doi.org/10.1109/IWAR.1999.803809
http://dx.doi.org/10.1145/1409240.1409266
[26] David Kim, Otmar Hilliges, Shahram Izadi, Alex D
[18] Otmar Hilliges, David Kim, Shahram Izadi, Malte
Butler, Jiawen Chen, Iason Oikonomidis, and Patrick
Weiss, and Andrew Wilson. 2012. HoloDesk: Direct 3D
Olivier. 2012. Digits: Freehand 3d Interactions
Interactions With a Situated See-Through Display. In
Anywhere Using a Wrist-Worn Gloveless Sensor. In
Proceedings of the SIGCHI Conference on Human
Proceedings of the 25th Annual ACM Symposium on
Factors in Computing Systems. ACM, 2421–2430. DOI:
User Interface Software and Technology. ACM,
http://dx.doi.org/10.1145/2207676.2208405
167–176. DOI:
[19] Tobias Höllerer, Steven Feiner, Tachio Terauchi, Gus http://dx.doi.org/10.1145/2380116.2380139
Rashid, and Drexel Hallaway. 1999. Exploring MARS:
[27] Minseok Kim and Jae Yeol Lee. 2016. Touch and Hand
Developing Indoor and Outdoor User Interfaces to a
Gesture-Based Interactions for Directly Manipulating
Mobile Augmented Reality System. Computers &
3D Virtual Objects in Mobile Augmented Reality.
Graphics 23, 6 (1999), 779–785. DOI:
Multimedia Tools Appl 75, 23 (2016), 16529–16550.
http://dx.doi.org/10.1016/S0097-8493(99)00103-X
DOI:http://dx.doi.org/10.1007/s11042-016-3355-9
[20] Aleksander Holynski and Johannes Kopf. 2018. Fast
[28] Johnny Lee. 2017. Mobile AR in Your Pocket With
Depth Densification for Occlusion-Aware Augmented
Google Tango. SID Symposium Digest of Technical
Reality. ACM Transactions on Graphics (TOG) 37, 6
Papers 48, 1 (2017), 17–18. DOI:
(2018), 1–11. DOI:
http://dx.doi.org/10.1002/sdtp.11563
http://dx.doi.org/10.1145/3272127.3275083
[29] Timothy Lottes. 2011. Fast Approximation Antialiasing
[21] Shahram Izadi, Mike Fraser, Steve Benford, Martin
(FXAA). NVIDIA Whitepaper (2011).
Flintham, Chris Greenhalgh, Tom Rodden, and Holger
Schnädelbach. 2002. Citywide: Supporting Interactive [30] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin
Digital Experiences Across Physical Space. Personal Matzen, and Johannes Kopf. 2020. Consistent Video
Ubiquitous Comput 6, 4 (2002), 290–298. DOI: Depth Estimation. ACM Transactions on Graphics 39, 4
http://dx.doi.org/10.1007/s007790200030 (2020). https://arxiv.org/abs/2004.15021
[31] Paweł Nowacki and Marek Woda. 2019. Capabilities of [39] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby
ARCore and ARKit Platforms for AR/VR Applications. Sharp, Mark Finocchio, Richard Moore, Alex Kipman,
In International Conference on Dependability and and Andrew Blake. 2011. Real-Time Human Pose
Complex Systems. Springer, 358–370. DOI: Recognition in Parts From Single Depth Images. In
http://dx.doi.org/10.1007/978-3-030-19501-4_36 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). IEEE, 1297–1304. DOI:
[32] Peter Ondrúška, Pushmeet Kohli, and Shahram Izadi.
http://dx.doi.org/10.1109/CVPR.2011.5995316
2015. MobileFusion: Real-Time Volumetric Surface
Reconstruction and Dense Tracking on Mobile Phones. [40] Rajinder S. Sodhi, Brett R. Jones, David Forsyth,
IEEE Transactions on Visualization and Computer Brian P. Bailey, and Giuliano Maciocci. 2013. BeThere:
Graphics 21, 11 (2015), 1251–1258. DOI: 3D Mobile Collaboration With Spatial Input. In
http://dx.doi.org/10.1109/TVCG.2015.2459902 Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI ’13). ACM, ACM,
[33] Sergio Orts-Escolano, Christoph Rhemann, Sean
179–188. DOI:
Fanello, Wayne Chang, Adarsh Kowdle, Yury
http://dx.doi.org/10.1145/2470654.2470679
Degtyarev, David Kim, Philip L Davidson, Sameh
Khamis, Mingsong Dou, Vladimir Tankovich, Charles [41] Jie Song, Gábor Sörös, Fabrizio Pece, Sean Ryan
Loop, Philip A.Chou, Sarah Mennicken, Julien Valentin, Fanello, Shahram Izadi, Cem Keskin, and Otmar
Vivek Pradeep, Shenlong Wang, Sing Bing Kang, Hilliges. 2014. In-Air Gestures Around Unmodified
Pushmeet Kohli, Yuliya Lutchyn, Cem Keskin, and Mobile Devices. In Proceedings of the 27th Annual
Shahram Izadi. 2016. Holoportation: Virtual 3D ACM Symposium on User Interface Software and
Teleportation in Real-Time. In Proceedings of the 29th Technology (UIST ’14). ACM, 319–329. DOI:
Annual Symposium on User Interface Software and http://dx.doi.org/10.1145/2642918.2647373
Technology (UIST). ACM, 741–754. DOI:
[42] Ivan E Sutherland. 1968. A Head-mounted Three
http://dx.doi.org/10.1145/2984511.2984517
Dimensional Display. In Proceedings of the December
[34] Matt Pharr, Wenzel Jakob, and Greg Humphreys. 2016. 9-11, 1968, Fall Joint Computer Conference, Part I.
Physically Based Rendering: From Theory to ACM, 757–764. DOI:
Implementation. Morgan Kaufmann. http://dx.doi.org/10.1145/1476589.1476686
http://www.pbr-book.org
[43] Jonathan Taylor, Lucas Bordeaux, Thomas Cashman,
[35] Jing Qian, Jiaju Ma, Xiangyu Li, Benjamin Attal, Bob Corish, Cem Keskin, Toby Sharp, Eduardo Soto,
Haoming Lai, James Tompkin, John F. Hughes, and Jeff David Sweeney, Julien Valentin, Benjamin Luff, Arran
Huang. 2019. Portal-Ble: Intuitive Free-Hand Topalian, Erroll Wood, Sameh Khamis, Pushmeet Kohli,
Manipulation in Unbounded Smartphone-Based Shahram Izadi, Richard Banks, Andrew Fitzgibbon, and
Augmented Reality. In Proceedings of the 32Nd Annual Jamie Shotton. 2016. Efficient and Precise Interactive
ACM Symposium on User Interface Software and Hand Tracking Through Joint, Continuous Optimization
Technology (UIST ’19). ACM, 133–145. DOI: of Pose and Correspondences. ACM Transactions on
http://dx.doi.org/10.1145/3332165.3347904 Graphics (TOG) 35, 4 (2016), 1–12. DOI:
http://dx.doi.org/10.1145/2897824.2925965
[36] Gerhard Reitmayr and Tom Drummond. 2006. Going
Out: Robust Model-Based Tracking for Outdoor [44] Bruce Thomas, Benjamin Close, John Donoghue, John
Augmented Reality. In Proceedings of the 5th IEEE and Squires, Phillip De Bondi, Michael Morris, and Wayne
ACM International Symposium on Mixed and Piekarski. 2000. ARQuake: an Outdoor/indoor
Augmented Reality (ISMAR ’06), Vol. 6. Washington, Augmented Reality First Person Application. In Digest
DC, USA, 109–118. DOI: of Papers. Fourth International Symposium on Wearable
http://dx.doi.org/10.1109/ISMAR.2006.297801 Computers. IEEE, IEEE, 139–146. DOI:
http://dx.doi.org/10.1109/ISWC.2000.888480
[37] Daniel Scharstein and Richard Szeliski. 2002. A
Taxonomy and Evaluation of Dense Two-Frame Stereo [45] Bruce Thomas, Wayne Piekarski, David Hepworth,
Correspondence Algorithms. International Journal of Bernard Gunther, and Victor Demczuk. 1998. A
Computer Vision 47, 1-3 (2002), 7–42. DOI: Wearable Computer System With Augmented Reality to
http://dx.doi.org/10.1109/SMBV.2001.988771 Support Terrestrial Navigation. In Proceedings of the
2nd IEEE International Symposium on Wearable
[38] Toby Sharp, Cem Keskin, Duncan Robertson, Jonathan
Computers (ISWC ’98). IEEE, 168–171. DOI:
Taylor, Jamie Shotton, David Kim, Christoph Rhemann,
http://dx.doi.org/10.1109/ISWC.1998.729549
Ido Leichter, Alon Vinnikov, Yichen Wei, Daniel
Freedman, Pushmeet Kohli, Eyal Krupka, Andrew [46] Julien Valentin, Adarsh Kowdle, Jonathan T. Barron,
Fitzgibbon, and Shahram Izadi. 2015. Accurate, Robust, Neal Wadhwa, Max Dzitsiuk, Michael Schoenberg,
and Flexible Real-Time Hand Tracking. In Proceedings Vivek Verma, Ambrus Csaszar, Eric Turner, Ivan
of the 33rd Annual ACM Conference on Human Factors Dryanovski, Joao Afonso, Jose Pascoal, Konstantine
in Computing Systems (CHI ’15). ACM, 3633–3642. Tsotsos, Mira Leung, Mirko Schmidt, Onur Guleryuz,
DOI:http://dx.doi.org/10.1145/2702123.2702179 Sameh Khamis, Vladimir Tankovitch, Sean Fanello,
Shahram Izadi, and Christoph Rhemann. 2018. Depth Interactive Spaces Using Real-Time Top-View
From Motion for Smartphone AR. ACM Transactions on Depth-Sensing. In Proceedings of the 2017 CHI
Graphics (TOG) 37, 6, Article 193 (2018), 193:1–193:19 Conference on Human Factors in Computing Systems
pages. DOI:http://dx.doi.org/10.1145/3272127.3275041 (CHI ’17). ACM, 3929–3942. DOI:
[47] Julien Valentin, Vibhav Vineet, Ming-Ming Cheng, http://dx.doi.org/10.1145/3025453.3025562
David Kim, Jamie Shotton, Pushmeet Kohli, Matthias [53] Robert Xiao, Scott Hudson, and Chris Harrison. 2016.
Nießner, Antonio Criminisi, Shahram Izadi, and Philip DIRECT: Making Touch Tracking on Ordinary Surfaces
Torr. 2015. SemanticPaint: Interactive 3D Labeling and Practical With Hybrid Depth-Infrared Sensing. In
Learning at Your Fingertips. ACM Transactions on Proceedings of the 2016 ACM International Conference
Graphics (TOG) 34, 5 (2015), 154. DOI: on Interactive Surfaces and Spaces. ACM, ACM, 85–94.
http://dx.doi.org/10.1145/2751556 DOI:http://dx.doi.org/10.1145/2992154.2992173
[48] Jonathan Ventura, Clemens Arth, Gerhard Reitmayr, and [54] Robert Xiao, Julia Schwarz, Nick Throm, Andrew D
Dieter Schmalstieg. 2014. Global Localization From Wilson, and Hrvoje Benko. 2018. MRTouch: Adding
Monocular SLAM on a Mobile Phone. IEEE Touch Input to Head-Mounted Mixed Reality. IEEE
Transactions on Visualization and Computer Graphics Transactions on Visualization and Computer Graphics
20, 4 (2014), 531–539. DOI: 24, 4 (2018), 1653–1660. DOI:
http://dx.doi.org/10.1109/TVCG.2014.27 http://dx.doi.org/10.1109/TVCG.2018.2794222
[49] Lance Williams. 1978. Casting Curved Shadows on
[55] Jackie Yang, Christian Holz, Eyal Ofek, and Andrew D
Curved Surfaces. In Proceedings of the 5th Annual
Wilson. 2019. Dreamwalker: Substituting real-world
Conference on Computer Graphics and Interactive walking experiences with a virtual reality. In
Techniques. 270–274. DOI: Proceedings of the 32nd Annual ACM Symposium on
http://dx.doi.org/10.1145/965139.807402 User Interface Software and Technology. 1093–1107.
[50] Andrew Wilson, Hrvoje Benko, Shahram Izadi, and DOI:http://dx.doi.org/10.1145/3332165.3347875
Otmar Hilliges. 2012. Steerable Augmented Reality
[56] Hui Ye, Kin Chung Kwan, Wanchao Su, and Hongbo Fu.
with the Beamatron. In Proceedings of the 25th Annual
2020. ARAnimator: In-situ Character Animation in
ACM Symposium on User Interface Software and
Mobile AR with User-defined Motion Gestures. ACM
Technology (UIST ’12). ACM, 413–422. DOI:
Transactions on Graphics (TOG) 39, 4 (2020), 83. DOI:
http://dx.doi.org/10.1145/2380116.2380169
http://dx.doi.org/10.1145/3386569.3392404
[51] Andrew D. Wilson and Hrvoje Benko. 2010. Combining
Multiple Depth Cameras and Projectors for Interactions [57] Tao Yu, Kaiwen Guo, Feng Xu, Yuan Dong, Zhaoqi Su,
on, above and between Surfaces. In Proceedings of the Jianhui Zhao, Jianguo Li, Qionghai Dai, and Yebin Liu.
23nd Annual ACM Symposium on User Interface 2017. BodyFusion: Real-Time Capture of Human
Software and Technology (UIST ’10). ACM, 273–282. Motion and Surface Geometry Using a Single Depth
DOI:http://dx.doi.org/10.1145/1866029.1866073 Camera. In Proceedings of the IEEE International
Conference on Computer Vision. ACM, 910–919. DOI:
[52] Chi-Jui Wu, Steven Houben, and Nicolai Marquardt. http://dx.doi.org/10.1109/ICCV.2017.104
2017. EagleSense: Tracking People and Devices in