Foresight Redefining Stereo Ebook 2022

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

eBook

Rethinking Stereo
Table of Contents
Introduction 3

A Picture is Worth a Thousand Words - How Much is a Point Cloud Worth? 4

Why Single-View Cameras are Not Sufficient 5

What is Stereo Vision? 7

Stereo Vision The Foresight Way 9

The Benefits of Breaking the Rules 11

eBook | Redefining Stereo 2


Introduction
Autonomous driving technology is not just for passenger cars. It is a technology that is
crossing verticals and changing the face of commercial automobiles like taxis and trucks
and also has applications in agriculture and other heavy machinery. As autonomous
vehicles become more prominent wherever they are - on the road, on farms, on
construction sites, etc. - the need to ensure the safety of the drivers, passengers, and the
surrounding environment is paramount.

The vehicles of the future are chock full of sensors, cameras, and technology that enable
the car to “see” what is on the road ahead and the surrounding area in order to alert the
driver or automatically brake or swerve to avoid an obstacle. Even the smallest inaccuracy
can result in disaster, so it is crucial to ensure that the obstacle detection systems can
accurately pinpoint any potential obstacle - from an object on the road to another vehicle
to an animal or human being - regardless of the lighting or weather conditions. Many
existing solutions rely on a combination of radar, LiDAR, cameras, and sensors but the
magic bullet had yet to be found. Foresight’s patented use of stereoscopic technology
combines the best of all worlds, allowing for the use of either visible-light or thermal long-
wave infrared cameras to provide accurate detection of both classified and unclassified
objects. Traditional stereo vision solutions are limited by the placement of the cameras
and the need for them to be in perfect parallel alignment. With Foresight’s technology, this
requirement is eliminated, opening up a huge array of new possibilities.

This cutting-edge technology offers a cost-effective solution that provides highly accurate
object detection even at long-range distances regardless of the lighting and weather
conditions.

eBook | Redefining Stereo 3


A Picture is Worth a Thousand Words -
How Much is a Point Cloud Worth?
Whether for a robot, a drone, an agricultural tool, or a passenger car, in order to accurately detect
objects in the way of an autonomously driven vehicle, the system has to create a detailed mapping of
the surrounding area. This means locating the object in real life and pinpointing its location accurately
on a virtual map. Solutions abound that make it easy to do this on a horizontal and vertical axis, but
when it comes to estimating the distance and depth of each point, that’s when things get complicated.

Existing solutions add an additional LiDAR or radar component to estimate the distance when objects
are detected by a camera. This not only adds cost but, if disparate components are used to capture
an image and measure distance, the results need to be “fused” together. This is a complex technical
process that brings with it a host of challenges and complications.

A standard two-dimensional static image simply does not provide enough information for an
autonomous driving system to be able to successfully and accurately avoid hitting an obstacle every
time. For example, an accident in an autonomous vehicle was attributed to the vehicle being unable
to distinguish between a white car on the road and a cloud. The autonomous vehicle hit the other car,
causing a fatal accident that could have been avoided if the obstacle detection system had been able
to provide a higher level of accuracy.

As the systems in autonomous vehicles are going to need to continuously make more and faster
decisions without human intervention, they must transition from relying on static images to using a 3D
point cloud. A 3D point cloud is a depth map that represents the x, y, and z coordinates of an object,
accounting for the depth of the object and making it possible to detect objects more accurately as well
as to conduct terrain analysis and sensor fusion.

Creating a 3D point cloud requires cameras. While some solutions use single-view cameras, the
level of accuracy they enable is not high enough. Foresight’s technology uses stereo vision (or
two cameras) to overcome many of the accuracy challenges and to be able to detect even smaller
obstacles at larger distances.

eBook | Redefining Stereo 4


Why Single-View Cameras
are Not Sufficient
Single-view depth refers to methods that rely on just one
image in order to estimate the depth of an object. This
means that one camera is placed on the vehicle and then
the system uses prior knowledge such as known sizes of
objects in the real world or typical scenes like a residential
street or highway in order to estimate the depth or the
distance of an object. In some cases, this can work well.
For example, if a particular car model is detected and the
system knows that that model’s height is two meters, basic
geometry can be used to calculate the distance to the car
based on the number of pixels in the image.

There are challenges with this method, however. It does


not offer a way to distinguish between an object that is
small and close as opposed to an object that is larger
but farther away. If, for example, a miniature toy car is
put in front of the camera, the algorithm will think it is a
normal-sized car that is far away. Recent advances have
presented some solutions to this challenge by leveraging
neural networks capabilities to make generalizations about
the pictured scene to fill in the gaps between the known
objects and create a dense depth map. As long as the
image at hand is a typical street view or another image
that only includes known objects, this solution works. But,
as soon as any unknown objects that the system was
not trained to recognize come into view, there will be a
problem. Because it’s impossible to predict a finite number
of potential obstacles that an autonomous vehicle might
encounter, this type of system cannot offer the required
level of safety for autonomous vehicles.

eBook | Redefining Stereo 5


In order to truly estimate depth in a useful and safe way, two different perspectives on the
same image are needed. This can be done using one camera that changes position and
captures images at different times. Key points that are identified in multiple images are
matched to each other and compared to the relative position of each camera at the time
the image was taken. This information is then used to determine the location of the image
in the real world in an approach called Structure from Motion (SFM).

While SFM works well for creating static maps, it is not a reliable source for estimating the
distance of moving objects such as cars and pedestrians. This is because a moving object
will be in a different position in each of the two frames that were captured at different
times, making it difficult to estimate its actual position or depth. The result is an inaccurate
distance measurement that can put the safety of autonomous car drivers at risk.

To get the most accurate object detection and depth map, stereo vision is required.

The setup of the general structure from motion problem. Using a single camera for taking snapshots of the same
object from different angles, then reconstructing the world position of the points. The major difficulty is to avoid using
points from moving objects. The course notes for Stanford’s CS231A course on computer vision.
https://github.com/kenjihata/cs231a-notes

eBook | Redefining Stereo 6


What is Stereo Vision?
Similar to the way humans have two eyes that see the same object at the same time in order to provide
depth perception, stereo vision cameras use two cameras to capture the same object at the same
time. The major advantages to using two cameras as opposed to just one are that the cameras can
both photograph the same scene from different angles simultaneously, eliminating the moving objects
challenge described above. The relative position of the cameras to each other remains the same
throughout the entire capture process and is possible to measure exactly, as opposed to with a single-
camera setup where the relative position is just an estimation.

When using stereo vision, two cameras are set up facing the same direction at an accurately-measured
distance from each other known as the baseline. The cameras are located in a way that maximizes
their overlapping field of view in order to minimize the computational power needed to match up the
points located in each image and accurately estimate the depth of each object. In traditional stereo
vision solutions, the cameras must be positioned on the same horizontal access, parallel to each other.

Choosing the baseline - the distance between the cameras - is an important decision and involves
a tradeoff. A larger baseline, meaning a longer distance between the two cameras, will improve the
accuracy of the distance estimation in the long range. This larger baseline, however, will also impact
the ability to accurately estimate the depth of closer range objects because of occlusions and different
view perspectives. To illustrate this tradeoff by using extreme examples, if one was to send a satellite
to observe Earth from outer space, it would make sense to have as large a baseline as possible. If, on
the other hand, the objective is to construct a robot that needs to see 5 meters (16 feet) ahead, the
baseline only has to be a few centimeters. The purpose of the images must be taken into consideration
when determining the baseline.

eBook | Redefining Stereo 7


While stereo vision provides a better and more accurate rendering of a real-world
environment and any obstacles that may appear, most solutions on the market today are
limited by their use of only visible-light cameras and are restricted by their positioning.
Visible-light cameras can only be used in well-lit areas and are susceptible to challenges
caused by glare. Because autonomous vehicles must be able to be used when it is
dark outside and/or in extreme weather conditions, they cannot rely only on visible-light
cameras. In addition, the positioning requirements are such that most solutions have to use
a small baseline which affects the distance accuracy at high range distances. Autonomous
vehicles need a solution that will allow them to “see” obstacles even at farther distances.

This risk of inaccuracy poses a challenge for using stereo vision to ensure the safety of
drivers (and people and animals in the vicinity) of autonomous vehicles. Foresight has
developed a revolutionary approach to stereo vision, solving the positioning challenge and
creating a highly accurate 3D image of obstacles, visibility, and terrain.

Stereo vision setup. 2 pinpoint cameras facing approximately forward situated on a mutual baseline. O and O’ represent
the focal point of the cameras. e and e’ denote the epipole points, the projection of the focal point of the other camera.
Rectification is performed (homographic transformation) to bring the epipolar lines (marked in red dashes) to be parallel
(marked by red lines). The course notes for Stanford’s CS231A course on computer vision.
https://github.com/kenjihata/cs231a-notes

eBook | Redefining Stereo 8


Stereo Vision The Foresight Way
Most stereo vision solutions require the cameras to be set up along the same straight line (or horizontal
axis). While each camera will have a slightly different view of the scene they capture, it only requires
a relatively simple algorithm to match up the pixels from each image and create an accurate 3D
rendering. Because the cameras are parallel to each other, the algorithm only has to search in one
direction - along the horizontal x-axis - to match the pixels from one image to the other.

If, however, the cameras were to be moved and were no longer in parallel alignment, then a process
called optical flow comes into play. In this process, the entire image has to be searched in order
to match up the pixels and identify the same objects in each image. This presents a challenge and
requires complex and costly computational processes.

In the past, attempts were made to overcome the challenge. One way was to calculate optical flow
based on variations of the light consistency assumption. In this method, the system would look for small
patches in the source image that looked the same as in the target image, only shifted. The problem,
however, was the inability to handle things like scaling, rotation, morphing, and illumination changes,
resulting in inaccurate renderings. To compensate, another method was developed involving key points
algorithms. Using this process, key points were chosen from the two images, and then feature vectors
were extracted using one of a variety of feature-detecting methods. Then, the feature vectors from the
points of one image were matched to the set of points from the other image. Any points with a single
close neighbor were paired and then triangulation was performed on the matched pairs in order to
calculate the distance, producing a sparse depth map. Unfortunately, because the resulting depth map
is sparse, it is still not accurate enough to fulfill the safety requirements of autonomous vehicles. And
this is among the reasons why OEMs and Tier 1s have been hesitant to adopt stereo vision technology.
The good news is that a new age is dawning with Foresight at the forefront.

Leveraging recent advances based on neural networks, Foresight has been able to revolutionize the
way the optical flow is calculated, making it possible to result in a dense depth map even when the two
cameras are not on a parallel axis and regardless of whether the camera uses visible-light or thermal
long waves.

eBook | Redefining Stereo 9


A feature extractor that is semantic and Image-level features that can help
agnostic to scale, illumination, rotation, understand which pixels are part of one
etc. Regardless of the camera placement, full item in the first image (i.e. a human
similar features are extracted from both being or an automobile) and therefore
images that will allow for good matching should also appear as one item in the
between the images. second image. This solves the issue of
occlusions - for example, if one camera
Correlation volume that looks for captures a vehicle in its frame and the
similarities in the images, pixel by pixel. second camera only captures part of the
For example, if the system identifies a vehicle, the system will recognize that a
person in image one, it will automatically whole vehicle is actually present in the
look for the same person in image two. scene.

Refinement layers such as CNN, GRU,


Transformer, etc. offer refined and clear
results that are steps above the coarse
renderings of the past.

Foresight harnesses these state-of-the-art and ensure that they will always get the
optical flow techniques to create a dense pixel- most accurate depth map regardless of the
wise optical flow map even under challenging conditions. The system automatically calibrates
circumstances. A vehicle’s existing cameras itself, so if one camera is moved, the system is
can be used, whether they are visible-light or automatically recalibrated without the need for
thermal long-wave and regardless of where on manual intervention.
the automobile they are positioned. Foresight’s
patented methodology means the ability to Foresight has leveraged the inherent benefits
capture the depth perception and obtain a clear of stereo vision and has applied patented
3D view at any distance and no matter how the technological solutions to upgrade this method,
cameras are positioned. creating a highly-accurate depth map that
can be used to detect any object - known or
The solution offers an array of placement unknown - and indicate its size, location and
options that can be fully customized and distance.
dynamically adapted to suit the user’s needs

eBook | Redefining Stereo 10


The Benefits of Breaking
the Rules
Foresight has gone where no one else has dared - taking
existing technology and methods and using them in
unique ways to solve a key challenge shared across the
autonomous vehicle world. Any pair of cameras with
overlapping fields of view can be used to generate an
accurate depth map. The cameras can be placed in any
number of locations or positions, including ones that
would prohibit the use of classic stereo algorithms. With a
combination of stereoscopic tech and deep neural network
object recognition, the system is always learning and can
recognize both known and unknown obstacles, adding to
the accuracy of the map.

The biggest benefit is to the vehicle manufacturers who are


facing the demands of a new generation of consumers who
want autonomous vehicles. In the past, even if automakers
did incorporate stereo vision, its usefulness was limited
by the width of the car. The camera-position requirements
meant that the cameras would not be able to “see” beyond
the width of the car itself. With Foresight, the cameras can
be positioned optimally to provide the best view – the
cameras’ positioning can leverage the height of the vehicle
as well as its width to increase the baseline, expand the
distance that can be captured, and increase the ability to
detect obstacles and prevent collisions.

And this is not good news only for the passenger car
industry. The autonomous driving demand extends to
commercial transportation, agriculture, drones, and more.
Manufacturers across verticals can incorporate Foresight’s
stereo vision capabilities - with full design flexibility to
determine where to place the cameras - ensuring that their
vehicles will perform the way consumers expect without
compromising on safety.

eBook | Redefining Stereo 11

You might also like