Chapter 3
Chapter 3
The goal of this chapter is to embrace the optimism of the 60’s and to hand-design an end-
to-end visual system. During this process, we will cover some of the main concepts that will
be developed in the rest of the book.
In 1966, Seymour Papert wrote a proposal for building a vision system as a summer
project [Papert 1966]. The abstract of the proposal starts stating a simple goal: “The summer
vision project is an attempt to use our summer workers e↵ectively in the construction of a
significant part of a visual system.” The report then continues dividing all the tasks (most
of which also are common parts of modern computer vision approaches) among a group of
MIT students. This project was a reflection of the optimism existing in the early days of
computer vision. However, the task proved to be harder than anybody expected.
In this first chapter we will discuss several of the main topics that we will cover in this
book. We will do this in the framework of a real, although a bit artificial, vision problem.
Vision has many di↵erent goals (object recognition, scene interpretation, 3D interpretation,
etc) but in this chapter we’re just focusing on the task of 3D interpretation.
43
44 CHAPTER 2. A SIMPLE VISION SYSTEM
Figure 2.1: A world of simple objects. Print, cut and build your own blocks world!
Figure 2.2: a) Close up picture without zoom. Note that near edges are larger than far
edges, and parallel lines in 3D are not parallel in the image, b) Picture taken from far away
but using zoom. This creates an image that can be approximately described by parallel
projection.
One way of generating images that can be described by parallel projection is to use the
camera zoom. If we increase the distance between the camera and the object while zooming,
we can keep the same approximate image size of the objects, but with reduced perspective
e↵ects (fig. 2.2.b). Note how, in fig. 2.2.b, 3D parallel lines in the world are almost parallel
in the image (some weak perspective e↵ects remain).
The first step we need to is to characterize how a point in world coordinates (X, Y, Z)
projects into the image plane. Figure 2.3.a shows our parameterization of the world and
camera. The camera center is inside the plane X = 0, and the horizontal axis of the camera
(x) is parallel to the ground plane (Y = 0). The camera is tilted so that the line connecting
the origin of the world coordinates system and the image center is perpendicular to the
image plane. The angle ✓ is the angle between this line and the Z axis. The image is
parametrized by coordinates (x, y). In this simple projection model, the origin of the world
coordinates projects on the origin of the image coordinates. Therefore, the world point
(0, 0, 0) projects into (0, 0). The resolution of the image (the number of pixels) will also a↵ect
the transformation from world coordinates to image coordinates via a constant factor ↵ (for
now we assume that pixels are square and we will see a more general form in section 36) and
that this constant is ↵ = 1. Taking into account all these assumptions, the transformation
between world coordinates and image coordinates can be written as:
x = X (2.1)
y = cos(✓)Y sin(✓)Z (2.2)
For this particular parametrization, the world coordinates Y and Z are mixed.
2.3. A SIMPLE GOAL 45
Y
y Y
X x
θ X
Z
Z
Figure 2.3: A simple projection model. a) world axis and camera plane. b) Visualization
of the world axis projected into the camera plane with parallel projection. Note that the Z
axis does not look like it is pointing towards the observer. Instead, the Z axis is identical to
the Y axis up to a sign change and a scaling. A point moving parallel to the Z axis will be
indistinguishable from a point moving parallel to the Y axis.
Figure 2.4: Image as a surface. The vertical axis corresponds to image intensity. For clarity
here, I have reversed the vertical axis. Dark values are shown higher than lighter values.
Occlusion
Horizontal 3D edge
Change of
surface orientation
Vertical 3D edge
Contact edge
Shadow boundary
Figure 2.5: Edges denote image regions where there are sharp changes of the image intensities.
Those variations can be due to a multitude of scene factors (occlusion boundaries, changes
in surface orientation, changes in surface albedo, shadows, etc).
• Shadow edges: this can be harder than it seems. In this simple world, shadows are
soft, creating slow transitions between dark and light.
We will also consider two types of object boundaries:
• Contact edges: this is an boundary between two objects that are in physical contact.
Therefore, there are no depth discontinuity.
• Occlusion boundaries: Occlusion boundaries happen when an object is partially in
front of another. Occlusion boundaries generally produce depth discontinuities. In this
simple world, we will position the objects in such a way that objects do not occlude
each other but they will occlude the background.
Despite the apparent simplicity of this task, in most natural scenes, this classification is
very hard and requires the interpretation of the scene at di↵erent levels. In other chapters we
will see how to make better edge classifiers (for instance by propagating information along
boundaries, junction analysis, inferring light sources, etc.).
In this simple world, we will assume that objects do not occlude each other (this can be
relaxed) and that the only occlusion boundaries are the boundaries between the objects and
the ground. However, as we next describe, not all boundaries between the objects and the
ground correspond to depth gradients.
• Collinearity: a straight 3D line will project into a straight line in the image.
• Cotermination: if two or more 3D lines terminate at the same point, the corresponding
projections will also terminate at a common point.
Note that those invariances refer to the process of going from world coordinates to image
coordinates. The opposite might not be true. For instance, a straight line in the image could
correspond to a curved line in the 3D world but that happens to be precisely aligned with
respect to the viewers point of view to appear as a straight line. Also, two lines that intersect
in the image plane could be disjoint in the 3D space.
However, some of these properties (not all), while not always true, can nonetheless be Collinearity
used to reliably infer something about the 3D world using a single 2D image as input. For
instance, if two lines coterminate in the image, then, one can conclude that it is very likely
that they also touch each other in 3D. If the 3D lines do not touch each other, then it will
require a very specific alignment between the observer and the lines for them to appear to
coterminate in the image. Therefore, one can safely conclude that the lines might also touch Cotermination
in 3D.
These properties are called non-accidental properties [Lowe 1985] as they will only
be observed in the image if they also exist in the world or by accidental alignments between
the observer and scene structures. Under a generic view, non-accidental properties will be
shared by the image and the 3D world [Freeman 1994].
Let’s see how this idea applies to our simple world. In this simple world all 3D edges
are either vertical or horizontal. Under parallel projection, we will assume that 2D vertical
edges are also 3D vertical edges. Under parallel projection and with the camera having
its horizontal axis parallel to the ground, we know that vertical 3D lines will project into
vertical 2D lines in the image. On the other hand, horizontal lines will, in general, project
into oblique lines. Therefore, we can assume than any vertical line in the image is also a
vertical line in the world. As shown in figure 2.7, in the case of the cube, there is a particular
viewpoint that will make an horizontal line project into a vertical line, but this will require
an accidental alignment between the cube and the line of sight of the observer. Nevertheless,
this is a weak property and accidental alignments such as this one can occur, and a more
50 CHAPTER 2. A SIMPLE VISION SYSTEM
Figure 2.7: Generic views of the cube allow 2D vertical edges to be classified as 3D vertical
edges, and collinear 2D edge fragments to be grouped into a common 3D edge. Those
assumptions break down only for the center image, where the cube happens to be precisely
aligned with the camera axis of projection.
general algorithm will need to account for that. But for the purposes of this chapter we will
consider images with generic views only.
In figure 2.6 we show the edges classified as vertical or horizontal using the edge an-
gle. Anything that deviates from 2D verticality by more than 15 degrees is labeled as 3D
horizontal.
We can now translate the inferred 3D edge orientation into linear constraints on the
global 3D structure. We will formulate these constraints in terms of Y (x, y). Once Y (x, y)
is recovered we can also recover Z(x, y) from equation 2.2.
In a 3D vertical edge, using the projection equations, the derivative of Y along the edge
will be:
In a 3D horizontal edge, the coordinate Y will not change. Therefore, the derivative along
the edge should be zero:
@Y /@t = 0 (2.11)
n=(nx,ny) where the vector t denotes direction tangent to the edge, t = ( ny , nx ). We can write this
derivative as a function of derivatives along the x and y image coordinates:
@ 2 Y /@x2 = 0 (2.13)
@ 2 Y /@y 2 = 0 (2.14)
2
@ Y /@y@x = 0 (2.15)
If some constraints are more important than others, it is possible to also add a weight
wi .
X
J= wi (ai Y bi ) 2 (2.18)
i
Our formulation has resulted on a big system of linear constraints (there are more equa-
tions than there are pixels in the image). It is convenient to write the system of equations
in matrix form:
AY = b (2.19)
where row i of the matrix A contains the constraint coefficients ai . The system of equations is
over determined (A has more rows than columns). We can use the pseudoinverse to compute
the solution:
Y = (AT A) 1
AT b (2.20)
This problem can be solved efficiently as the matrix A is very sparse (most of the elements
are zero).
2.5.7 Results
Figure 2.8 shows the resulting world coordinates X(x, y), Y (x, y), Z(x, y) for each pixel.
The world coordinates are shown here as images with the gray level coding the value of each
coordinate (black represents 0).
A few things to reflect on:
• It works. At least it seems to work pretty well. Knowing how well it works will require
having some way of evaluating performance. This will be important.
• But it can not possibly work all the time. We have made lots of assumptions that
will work only in this simple world. The rest of the book will involve upgrading this
approach to apply to more general input images.
Despite that this approach will not work on general images, many of the general ideas
will carry over to more sophisticated solutions (e.g., gather and propagate local evidence).
One thing to think about is: if information about 3D orientation is given in the edges, how
does that information propagate to the flat areas of the image in order to produce the correct
solution in those regions?
Evaluation of performance is a very important topic. Here, one simple way to visually
verify that the solution is correct is to render the objects under new view points. Figure 2.9
shows the reconstructed 3D scene rendered under di↵erent view points.
52 CHAPTER 2. A SIMPLE VISION SYSTEM
X Y (height) Z (depth)
Figure 2.8: The solution to our vision problem: for the input image of fig 1.7, the assumptions
and inference scheme described below leads to these estimates of 3d world coordinates at each
image location. Note that the world coordinates X, Y , and Z are shown as images.
Figure 2.9: To show that the algorithm for 3D interpretation gives reasonable results, we
can re-render the inferred 3D structure from di↵erent viewpoints, the rendered images show
that the 3D structure has been accurately captured.
2.6 Generalization
One desired property of any vision system is it ability to generalize outside of the domain
for which it was designed to operate. In the approach presented in this chapter, the domain
of images the algorithm is expected to work is defined by the assumptions described in section
2.4. Later in the book, when we describe learning-based approaches, the training dataset
Out of domain will specify the domain. What happens when assumptions are violated? or when the images
generalization refers to contain configurations that we did not consider while designing the algorithm?
the ability of a system to Let’s run the algorithm with shapes that do not satisfy the assumptions that we have
operate outside the made for the simple world. Figure 2.10 shows the result when the input image violates several
domain for which it was assumptions we made (shadows are not soft and the green cube occludes the red one) and
designed. has one object on top of the other, a configuration that we did not consider while designing
the algorithm. In this case the result happens to be good, but it could have gone wrong.
Figure 2.11 shows the impossible steps figure from [Adelson 2000]. On the left side, this
shape looks rectangular with the stripes appear as being painted on the surface. On the
right side, the shape looks as having steps with the stripes corresponding to shading due
to the surface orientation. In the middle the shape is ambiguous. Figure 2.11 shows the
reconstructed 3D scene for this unlikely image. The system has tried to approximate the
constraints, as for this shape it is not possible to exactly satisfy all the constraints.
Figure 2.10: Reconstruction of an out of domain example. The input image violates several
assumptions we made (shadows are not soft and the green cube occludes the red one) and
has one object on top of the other, a configuration that we did not consider while designing
the algorithm.
Figure 2.11: Reconstruction of an impossible figure (inspired from [Adelson 2000]): the
algorithm does the best it can. Note how the reconstruction seems to agree with how we
perceive the impossible image.
54 CHAPTER 2. A SIMPLE VISION SYSTEM
A di↵erent approach to the one discussed here is model-based scene interpretation, where
we could have a set of predefined models of the objects that can be present in the scene and
the system should try to decide if they are present or not in the image, and recover their
parameters (pose, color, etc.)
Recognition allows indexing properties that are not directly available in the image. For
instance, we can infer the shape of the invisible surfaces. Recognizing each geometric figure
also implies extracting the parameters of each figure (pose, size, ...).
Given a detailed representation of the world, we could render the image back, or at least
some aspects of it. We can check that we are understanding things right if we can make
verifiable predictions, such as what would you see if you look behind the object? Closing the
loop between interpretation and the input will be good at some point.
In this chapter, besides introducing some useful vision concepts, we also made use of
mathematical tools from algebra and calculus that you should be familiar with.
2.7. CONCLUDING REMARKS 55
Exercises
We provide a python notebook with the code to be completed. You can run it locally or in
Colab. To use Colab, upload it to Google Drive and select ‘open in colab’, which will allow
you to complete the problems without setting up your own environment. Once you have
finished, copy the code sections that you have completed as screenshots to the report.
x = ↵X + x0 (2.21)
y = ↵(cos(✓)Y sin(✓)Z) + y0 (2.22)
which relate the coordinates of a point in the 3D world to the image coordinates of an
orthogonal camera rotated by ✓ over the X-axis.
Show that the equations emerge naturally from a series of transformations applied to the
3D world coordinates (X, Y, Z), of the form:
2 3
X
x x0
= ↵ · P · Rx (✓) · 4 Y 5 + (2.23)
y y0
Z
Please make sure to also include your answers for vertical edges, horizontal edges,
and your formulations for Aij and b for the di↵erent constraints in the report.