0% found this document useful (0 votes)
78 views36 pages

Augmented Reality - Unit 5

The document discusses the role of computer vision in augmented reality (AR), detailing processes such as spatial mapping, lighting estimation, and motion tracking that enable AR systems to accurately overlay virtual graphics onto the real world. It highlights various applications of AR powered by computer vision in consumer, retail, and healthcare sectors, as well as the benefits of enhanced user experience and visualization. Additionally, it covers case studies on tracking techniques and the importance of annotations in AR for user interaction and knowledge sharing.

Uploaded by

Laks Sadeesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views36 pages

Augmented Reality - Unit 5

The document discusses the role of computer vision in augmented reality (AR), detailing processes such as spatial mapping, lighting estimation, and motion tracking that enable AR systems to accurately overlay virtual graphics onto the real world. It highlights various applications of AR powered by computer vision in consumer, retail, and healthcare sectors, as well as the benefits of enhanced user experience and visualization. Additionally, it covers case studies on tracking techniques and the importance of annotations in AR for user interaction and knowledge sharing.

Uploaded by

Laks Sadeesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Augmented Reality

Unit 5
Computer Vision for Augmented
Reality
Computer vision for AR is concerned with
electronically perceiving and understanding
imagery from camera sensors that can inform
the AR system about the user and the
surrounding environment.
Computer vision forms the crucial backbone that
enables AR systems to analyze real-world visuals
to achieve spatial understanding and overlay
virtual graphics accurately
Processes
• Spatial Mapping and Scene Reconstruction
Using visual inputs from smartphone cameras or specialized depth sensors
on AR headsets, computer vision algorithms construct detailed 3D maps of
the environment.
This process, known as simultaneous localization and mapping (SLAM),
tracks feature points in the scene to model depth, surfaces, and spatial
relationships.
• Lighting Estimation
CV algorithms estimate real-world lighting conditions by analyzing
brightness and shadowing patterns in the environment.
This data is used to modulate the rendering properties of virtual AR objects
to blend in seamlessly with ambient illumination.
• Occlusion Handling
By combining environment mapping with object detection outputs,
computer vision in augmented reality rendering engines leverage CV
outputs to determine where virtual objects should be occluded by real
surfaces and where they should be visible for proper visual coherence.
• Motion Tracking
As users or their device cameras move, CV algorithms continuously
track visual motion across frames.
Marker-based or marker-less techniques identify anchor points to
update the 3D world and AR content positions relative to the
changing viewpoint and device motion.
• Object Detection and Recognition
Robust computer vision models identify real-world objects by
detecting their presence across images or video frames and
classifying them into known categories like people, cars, buildings,
furniture, etc.
• Surface Detection and Meshing
Beyond just object recognition, CV models identify real-world planar
surfaces like walls, floors, and tabletops through geometric reasoning
and shade reconstruction.
Working of Computer Vision
The Application of Computer Vision
• Consumer
AR filters and lenses in social apps like Instagram and Snapchat use CV
for facial recognition, motion tracking and 3D animation. Gaming
companies leverage CV environment mapping for realistic gameplay
rendering.
• Retail and eCommerce
Virtual try-on apps overlay virtual clothing, cosmetic rendering, and
furniture placement in shoppers' environments through real-time CV
mapping. This enhances buyer confidence through visualization.
• Healthcare
AR surgery guidance systems use CV tracking to overlay rendered
anatomy graphics precisely aligned to the patient's body to help
surgeons during procedures.
Benefits of augmented reality powered by
computer vision
• Intuitive User Experience
• Enhanced Context and Understanding
• Remote Assistance
• Visualization and Previews
CASE STUDY
• Case study on marker tracking: This simple example introduces a basic camera
representation, contour-based shape detection, pose estimation from a homography, and
nonlinear pose refinement.
• Case study on multi-camera infrared tracking: This case study presents a crash course in
multi-view geometry. The reader learns about 2D–2D point correspondences in multiple-
camera images, epipolar geometry, triangulation, and absolute orientation.
• Case study on natural feature tracking by detection: This case study introduces interest
point detection in images, creation and matching of descriptors, and robust computation
of the camera pose from known 2D–3D correspondences (Perspective-n-Point Pose,
RANSAC).
• Case study on incremental tracking: This case study explains how to track features across
consecutive frames using active search methods (KLT, ZNCC) and how incremental
tracking can be combined with tracking by detection.
• Case study on simultaneous localization and mapping: This case study explores pose
computation from 2D–2D correspondences (fivepoint pose, bundle adjustment). We also
look into modern techniques such as parallel tracking and mapping, and dense tracking
and mapping.
• Case study on outdoor tracking: This case study presents methods for tracking in wide-
area outdoor environments—a capability that requires scalable feature matching and
assistance from sensor fusion and geometric priors.
1. Marker Tracking
Detecting the four corners of a flat marker in an image from a single
calibrated camera delivers just enough information to recover the
pose of the camera relative to the marker.
The steps following provides an overview of the marker tracking
pipeline. The pipeline consists of five stages:

1. Capturing an image using a camera with a known mathematical representation


2. Marker detection by searching for quadrilateral shapes
3. Pose estimation from a homography
4. Pose refinement by nonlinear reprojection error minimization
5. AR rendering with the recovered camera pose
The tracking pipeline for square fiducial markers begins with
thresholding the image, followed by quadrilateral fitting and pose
estimation. With the recovered pose, AR rendering can be performed.
Camera Representation
Pinhole Camera Model Overview
• A simplified abstraction of a physical camera.
• Describes how a 3D point in object space is projected onto a 2D point in
image space.

Key Components:
• Center of Projection (c): All 3D points' projections pass through this point.
• Image Plane (Π): The plane where the projected image is formed.
• Principal Point (c′): The point on the image plane directly along the
optical axis.
• Optical Axis: The line connecting the center of projection (c) and the
principal point (c′).
• Focal Length (f): The distance between the center of projection (c) and
the principal point (c′).
Marker Detection

A popular marker design consists of a black square surrounding a 2D


barcode. In this case, only a single corner is marked in black to
determine the marker’s orientation.
We start by converting a single-channel input image (typically 8-bit
grayscale values) into a black-and-white binary image using a threshold
operation. Variations in lighting conditions make it necessary to
select a suitable threshold. This can be done manually or automatically.
Automatic threshold selection can be done by analyzing the histogram of the
image or, even better, by adapting the threshold based on the gradient of the
logarithm of the image intensities

These methods can even deal with strong artifacts such as glossy reflections on the
marker. Unfortunately, they are computationally intensive. A cheaper
method is to determine the threshold locally (e.g., in a 4 × 4 sub-area) and
interpolate it linearly over the image
Pose Estimation from Homography
The four corners of a flat marker are an instance of a
frequently encountered geometric situation
constraining the known points qi to lie on a plane.
We assume that the marker defines the plane Π ′:qz = 0
in world coordinates, and that marker corners have the
coordinates [0 0 0]T, [1 0 0]T, [1 1 0]T, and [0 1 0]T.
We can then express a 3D point q Π ′ as a homogeneous
2D point q′ = [qx qy 1]T. Mapping from one plane to
another can be mathematically modeled as a
homography defined by a 3 × 3 matrix H
Pose Refinement
Pose estimation cannot always be computed
directly from imperfect point correspondences
with the desired accuracy. Therefore, the pose
estimation is refined by iteratively minimizing
the reprojection error.
When a first estimate of the camera pose is
known, we minimize the displacement of the
known points, projected using points from its
known image location.
2. Multiple-Camera Infrared Tracking

Multiple-Camera Infrared Tracking in AR refers


to a system where several infrared (IR) cameras
are used simultaneously to track the position
and orientation of objects or users in a physical
space.
These cameras detect IR markers or emitters
attached to objects or users, enabling accurate
real-time tracking of their movements.
Working
• Infrared Markers/Emitters: Objects or users
are equipped with IR markers that reflect or
emit infrared light.
• Multiple Cameras: Several IR cameras capture
the infrared signals from different angles.
• 3D Positioning: Using data from multiple
camera views, the system triangulates the
precise location and orientation of the tracked
objects or users in 3D space.
Pipeline
The stereo camera tracking pipeline consists of the following
steps:
1. Blob detection in all images to locate the spheres of the rigid
body markers
2. Establishment of point correspondences between blobs using
epipolar geometry between the cameras
3. Triangulation to obtain 3D candidate points from the multiple
2D points
4. Matching of 3D candidate points to 3D target points
5. Determination of the target’s pose using absolute orientation
Blob Detection
The blob detection, which is sometimes done directly on
the camera hardware, is very simple:
• The binary input picture is scanned for connected
regions consisting of white pixels.
• Regions that are too small or too elongated are rejected.
• For the remaining regions, the centroid is computed and
returned as a candidate point. Because all spheres have
similar appearance, the data association required for
target identification must be resolved in a subsequent
step.
Establishing Point Correspondences
The candidate 2D points in the two images can be related using
epipolar lines.
Triangulation from Two Cameras
The rays through camera centers ci and image plane coordinates pi may not
properly intersect in space. We can identify the midpoint at their closest
distance.
Matching Targets Consisting of Spherical
Markers
In Augmented Reality (AR) systems utilizing multiple-camera infrared (IR)
tracking, spherical markers are often used as targets for tracking.
These spherical markers, typically reflective or emitting infrared light, are
placed on objects or users.
The cameras capture these markers to determine the position and orientation
of the target in the 3D space.
Why this Marker?
Uniform Appearance: Spherical markers provide a uniform appearance from
any angle, making them easy to track with multiple cameras.
Simplified Detection: The circular shape simplifies the detection process in
each camera’s 2D view, even under different lighting conditions.
Absolute Orientation
• It requires at least three points.
• The centroid of the three points can be used to
determine the translation from the reference coordinate
system to the measurement coordinate system.
• The rotation is computed from two parts.
• First, we define a rotation from the measurement
coordinate system into an intermediate coordinate
system defined by the qi.
• Second, we do the same for ri.
• Finally, we concatenate the two rotations to obtain R.
3. Natural Feature Tracking by Detection
• monocular tracking with a single camera.
• The wide availability of mobile devices with built-in
cameras makes this the preferred tracking hardware for
mobile AR.
• Using more than one camera increases the hardware
cost and computational demands.
• The restriction to a single camera implies that the
objective of tracking is to determine correspondences
between 2D points in the camera image and known 3D
points in the world. We can try to determine such 2D–3D
correspondences either densely or sparsely.
• Dense matching means that we try to find a
correspondence for every pixel in an image.
• objects with poor texture, repetitive
structures, and reflective surfaces, such as
metal, can be handled by dense matching
• Sparse matching means that we try to find a
correspondence for a small but sufficient
number of salient interest points selected
from the image.
Pipeline
A typical pipeline for tracking by detection of sparse
interest point consists of five stages:
1. Interest point detection
2. Descriptor creation
3. Descriptor matching
4. Perspective-n-Point camera pose détermination
5. Robust pose estimation
Augmentation Through Annotations

Annotations in AR: These are digital labels or


pieces of information (e.g., text, images, or
audio) that are linked to physical objects or
locations in the user's real-world view.
AR applications bring abstract information to life
by associating it with real-world geometry or
appearance.
These annotations help users better understand
and remember their surroundings.
Collaborative and Social Annotations:
Modern AR browsers allow users to contribute and
share annotations.
For example, someone can add a note to a landmark or
object, which other users can see when they view the
same location.
This collaborative aspect turns AR into a social
computing tool, where users participate in the creation
and exchange of knowledge about the environment.
The idea is called "augment-able reality," coined by
Rekimoto et al. (1998), referring to the ability of users
to actively contribute to the augmentation of reality
with annotations rather than being passive consumers.
Recognition and Annotation
Recognizing Objects in AR: To place an annotation on an object,
the AR system must first recognize the object in the user’s view.
Once an object is recognized, the associated digital information
(annotations) is presented to the user. This might involve
displaying text, images, or even playing audio clips that provide
more context about the object.
User Contributions: Users are not limited to consuming the
information presented by AR systems. They can actively
contribute by adding their own annotations, which are stored
on a server and indexed by location. This shared information
then becomes accessible to other users who interact with the
same objects or locations.
Challenges in Annotation Placement

Need for 3D Models or Image-Based


Representations:
In the real world, it’s not always guaranteed that
AR systems will have pre-existing tracking
models for the objects that users want to
annotate. Before placing an annotation on an
object, the system needs a way to reliably
recognize and track the object later.
SLAM (Simultaneous Localization and
Mapping) Techniques
What is SLAM?:
SLAM is a technique used by AR systems to build a map of an environment
while simultaneously tracking the system’s location within that map.
SLAM helps in capturing and understanding the environment, enabling the
AR system to place annotations at precise locations.

SLAM for Annotation Placement: In AR applications, especially in indoor


environments, SLAM can create a sparse map of the environment.
This map contains basic information about the layout and geometry of the
space, which the AR system can use to register and place annotations.
The sparse map makes it easier to locate objects or features for annotation
without needing a full, detailed reconstruction of the environment.
Applications in AR
• Enhancing the precision of AR applications for
gaming, training simulations, and interactive
exhibits.

• Used in motion capture systems and AR


displays to track the user's movements or
external objects in the augmented space.
Navigation
AR navigation can enhance exploration of the real world, facilitate
wayfinding, and support viewpoint control during the execution of
real world tasks.
Navigation—that is, moving in one’s environment encompasses
travel, wayfinding, and exploration.
• Travel is the motor activity necessary to control one’s position
and orientation.
• Wayfinding is the higher-level cognition of a user, such as
understanding one’s current location, planning a path to another
location, and updating a mental map of the environment.
• Exploration is concerned with understanding and surveying an
unknown environment and its artifacts.
Contd….
Wayfinding and exploration require acquiring spatial
knowledge and structuring it into a mental map
Spatial knowledge can be acquired from various
sources.
Primary and secondary sources
The environment itself is the primary source: Humans
continuously extract spatial information from their
observations of the environment.
All other sources, such as maps, pictures, and videos,
are secondary sources.
Spatial knowledge
Spatial knowledge can organized into the following categories:
• Landmarks - environment’s structure and the person’s own location
• Routes are sequences of actions needed to navigate from a given
start point to a given end point
• Nodes are decision-making points, where users can choose among
• Paths - Route planning and wayfinding decisions are usually made
in relation to nodes.
• Districts are larger areas in the environment, such as parks or
shopping Areas
• Edges partition the environment.
For example, a road or a river requires special means or locations for
crossing. a pedestrian will classify a street as an edge, whereas a driver
will classify a street as a route.
• Survey knowledge primarily consists of global spatial relationships
between landmarks and routes.

You might also like