0% found this document useful (0 votes)
8 views14 pages

Chapter 3

CV Chapter 3

Uploaded by

Vishaal Ram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views14 pages

Chapter 3

CV Chapter 3

Uploaded by

Vishaal Ram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Chapter 2

A Simple Vision System

The goal of this chapter is to embrace the optimism of the 60’s and to hand-design an end-
to-end visual system. During this process, we will cover some of the main concepts that will
be developed in the rest of the book.
In 1966, Seymour Papert wrote a proposal for building a vision system as a summer
project [Papert 1966]. The abstract of the proposal starts stating a simple goal: “The summer
vision project is an attempt to use our summer workers e↵ectively in the construction of a
significant part of a visual system.” The report then continues dividing all the tasks (most
of which also are common parts of modern computer vision approaches) among a group of
MIT students. This project was a reflection of the optimism existing in the early days of
computer vision. However, the task proved to be harder than anybody expected.
In this first chapter we will discuss several of the main topics that we will cover in this
book. We will do this in the framework of a real, although a bit artificial, vision problem.
Vision has many di↵erent goals (object recognition, scene interpretation, 3D interpretation,
etc) but in this chapter we’re just focusing on the task of 3D interpretation.

2.1 A simple world: The blocks world


As the visual world is too complex, we will start by simplifying it enough that we will be able Blocks world from Larry
to build a simple visual system right away. This was the strategy used by some of the first Roberts PhD in June
scene interpretation systems. L. G. Roberts [Roberts 1963] introduced the Block World, a 1963.
world composed of simple 3D geometrical figures.
For the purposes of this chapter, let’s think of a world composed of a very simple (yet
varied) set of objects. These simple objects are composed of flat surfaces which can be
horizontal or vertical. These objects will be resting on a white horizontal ground plane. We
can build these objects by cutting, folding and gluing together some pieces of colored paper
as shown in figure 2.1. Here, we will not assume that we know the exact geometry of these
objects in advance.

2.2 A simple image formation model


One of the simplest forms of projection is parallel (or orthographic) projection. In this
image formation model, the light rays travel parallel to each other and perpendicular to the
camera plane. This type of projection produces images in which objects do not change size
as they move closer or farther from the camera, parallel lines in 3D remain parallel in the
2D image. This is di↵erent from the perspective projection (to be discussed in 5.2) where
the image is formed by the convergence of the light rays into a single point (focal point). If
we do not take special care, most pictures taken with a camera will be better described by
perspective projection (fig. 2.2.a).

43
44 CHAPTER 2. A SIMPLE VISION SYSTEM

Figure 2.1: A world of simple objects. Print, cut and build your own blocks world!

Figure 2.2: a) Close up picture without zoom. Note that near edges are larger than far
edges, and parallel lines in 3D are not parallel in the image, b) Picture taken from far away
but using zoom. This creates an image that can be approximately described by parallel
projection.

One way of generating images that can be described by parallel projection is to use the
camera zoom. If we increase the distance between the camera and the object while zooming,
we can keep the same approximate image size of the objects, but with reduced perspective
e↵ects (fig. 2.2.b). Note how, in fig. 2.2.b, 3D parallel lines in the world are almost parallel
in the image (some weak perspective e↵ects remain).
The first step we need to is to characterize how a point in world coordinates (X, Y, Z)
projects into the image plane. Figure 2.3.a shows our parameterization of the world and
camera. The camera center is inside the plane X = 0, and the horizontal axis of the camera
(x) is parallel to the ground plane (Y = 0). The camera is tilted so that the line connecting
the origin of the world coordinates system and the image center is perpendicular to the
image plane. The angle ✓ is the angle between this line and the Z axis. The image is
parametrized by coordinates (x, y). In this simple projection model, the origin of the world
coordinates projects on the origin of the image coordinates. Therefore, the world point
(0, 0, 0) projects into (0, 0). The resolution of the image (the number of pixels) will also a↵ect
the transformation from world coordinates to image coordinates via a constant factor ↵ (for
now we assume that pixels are square and we will see a more general form in section 36) and
that this constant is ↵ = 1. Taking into account all these assumptions, the transformation
between world coordinates and image coordinates can be written as:

x = X (2.1)
y = cos(✓)Y sin(✓)Z (2.2)

For this particular parametrization, the world coordinates Y and Z are mixed.
2.3. A SIMPLE GOAL 45

Y
y Y
X x

θ X

Z
Z

Figure 2.3: A simple projection model. a) world axis and camera plane. b) Visualization
of the world axis projected into the camera plane with parallel projection. Note that the Z
axis does not look like it is pointing towards the observer. Instead, the Z axis is identical to
the Y axis up to a sign change and a scaling. A point moving parallel to the Z axis will be
indistinguishable from a point moving parallel to the Y axis.

2.3 A simple goal


Part of the simplification of the vision problem resides in simplifying its goals. In this chapter
we will focus on recovering the world coordinates of all the pixels seen by the camera.
Besides recovering the 3D structure of the scene there are many other possible goals that
we will not consider in this chapter. For instance, one goal (that might seem simpler but is
not) is to recover the actual color of the surface seen by each pixel (x, y). This will require
discounting for illumination e↵ects as the color of the pixel is a combination of the surface
albedo and illumination (color of the light sources and inter-reflections).

2.4 From images to edges and useful features


The observed image is a function, f , that takes as input location, (x, y), and it outputs the Question: are there
intensity at that location: animal eyes that produce
f (x, y) (2.3) a di↵erent initial
representation than ours?
In this representation, the image is an array of intensity values (color values) indexed It seems that Gekko’s
by location. This representation is great if we are interested in knowing the light intensity have a pupil with 4
coming from each direction of the space and striking the camera plane as this information diamon shaped pinhole
is explicitly represented. However, here we are interested in interpreting the 3D structure of apertures which could
the scene and the objects within. Therefore, it will be useful to transform the image into a allow them to encode
representation that makes more explicit some of the important parts of the image that carry distance to a target in the
information about the boundaries between objects, changes in the surface orientation and so retinal image [Murphy
on. and Howland 1986].
The representation will not be unique. Di↵erent levels of the scene understanding process
will rely on di↵erent representations. For instance, the array of pixel intensities f (x, y) is
a reasonable representation for the early stages of visual processing as, although we do not
know the distance of surfaces in the world, the direction of each light ray in the world is
well defined. However, other initial representations could be used by the visual system (e.g.,
images could be coded in the Fourier domain, or pixels could combine light coming in di↵erent
directions).
There are several representations that can be used as an initial step for scene interpre-
tation. Images can be represented as collections of small image patches, regions of uniform
properties, edges, etc.
46 CHAPTER 2. A SIMPLE VISION SYSTEM

Figure 2.4: Image as a surface. The vertical axis corresponds to image intensity. For clarity
here, I have reversed the vertical axis. Dark values are shown higher than lighter values.

Occlusion

Horizontal 3D edge

Change of
surface orientation
Vertical 3D edge

Contact edge

Shadow boundary

Figure 2.5: Edges denote image regions where there are sharp changes of the image intensities.
Those variations can be due to a multitude of scene factors (occlusion boundaries, changes
in surface orientation, changes in surface albedo, shadows, etc).

2.4.1 A catalog of edges


Edges denote image regions where there are strong changes of the image with respect to loca-
tion. Those variations can be due to a multitude of scene factors (e.g., occlusion boundaries,
changes in surface orientation, changes in surface albedo, shadows, etc.).
One of the tasks that we will solve first is to classify image edges according to what
is the most probable cause. We will use the following classification of image boundaries
(fig. 2.5):
• Object boundaries: indicate pixels that delineate the boundaries of any object. Bound-
aries between objects generally correspond to changes in surface color, texture, orien-
tation, ...
• Changes in surface orientation: indicate locations where there are strong image vari-
ations due to changes in the surface orientations. A change in surface orientation
produces changes in the image intensity as intensity is a function of the angle between
the surface and the incident light.
2.4. FROM IMAGES TO EDGES AND USEFUL FEATURES 47

• Shadow edges: this can be harder than it seems. In this simple world, shadows are
soft, creating slow transitions between dark and light.
We will also consider two types of object boundaries:
• Contact edges: this is an boundary between two objects that are in physical contact.
Therefore, there are no depth discontinuity.
• Occlusion boundaries: Occlusion boundaries happen when an object is partially in
front of another. Occlusion boundaries generally produce depth discontinuities. In this
simple world, we will position the objects in such a way that objects do not occlude
each other but they will occlude the background.
Despite the apparent simplicity of this task, in most natural scenes, this classification is
very hard and requires the interpretation of the scene at di↵erent levels. In other chapters we
will see how to make better edge classifiers (for instance by propagating information along
boundaries, junction analysis, inferring light sources, etc.).

2.4.2 Extracting edges from images


The first step will consist in detecting candidate edges in the image. Here we will start by
making use of some notions from di↵erential geometry. If we think of the image f (x, y) as
a function of two (continuous) variables (fig. 2.4), we can measure the degree of variation
using the gradient: Gradient of an image at
✓ ◆ one location:
@f @f
rf = , (2.4)
@x @y
The direction of the gradient indicates the direction in which the variation of intensities is
larger. If we are on top of an edge, the direction of larger variation will be in the direction
∇f
perpendicular to the edge. θ
However, the image is not a continuous function as we only know the values of the f (x, y)
at discrete locations (pixels). Therefore, we will approximate the partial derivatives by:
@f
' f (x, y) f (x 1, y) (2.5)
@x
@f
' f (x, y) f (x, y 1) (2.6)
@y
A better behaved approximation of the partial image derivative can be computed by
combining the image pixels around (x, y) with the weights:
2 3
1 0 1
1 4
⇥ 2 0 25
4
1 0 1
We will discuss these approximations in detail in chapter 17.
From the image gradient, we can extract a number of interesting quantities:
e(x, y) = |rf (x, y)| / edge strength (2.7)
@f /@y
✓(x, y) = \rf = arctan / edge orientation (2.8)
@f /@x
The unit norm vector perpendicular to an edge is:
rf
n= (2.9)
|rf |
The first decision that we will perform is to decide which pixels correspond to edges
(regions of the image with sharp intensity variations) and which ones belong to uniform
regions (flat surfaces). We will do this by simply thresholding the edge strength e(x, y). In
the pixels with edges, we can also measure the edge orientation ✓(x, y). Figure 2.6 visualizes
the edges and the normal vector on each edge.
48 CHAPTER 2. A SIMPLE VISION SYSTEM

Input image Gradient (magnitude and orientation)` Edges

3D orientation Depth discontinuities Contact edges

Figure 2.6: Gradient and edge types.

2.5 From edges to surfaces


We want to recover world coordinates X(x, y), Y (x, y) and Z(x, y) for each image location
(x, y). Given the simple image formation model described before, recovering the X world
coordinates is trivial as they are directly observed: for each pixel with image coordinates
(x, y), the corresponding world coordinate is X(x, y) = x. Recovering Y and Z will be
harder as we only observe a mixture of the two world coordinates (one dimension is lost due
The classical visual to the projection from the 3D world into the image plane). Here we have written the world
illusion “two faces or a coordinates as functions of image location (x, y) to make explicit that we want to recover the
vase” is an example of 3D locations of the visible points.
figure-ground In this simple world, we will formulate this problem as a set of linear equations.
segmentation problem.
2.5.1 Figure/ground segmentation
Segmentation of an image into figure and ground is a classical problem in human perception
and computer vision that was introduced by the Gestalt psychology.
In this simple world deciding when a pixel belongs to one of the foreground objects or to
the background can be decided by simply looking at the color values of each pixel. Bright
pixels that have low saturation (similar values of the R, G, and B components) correspond to
the white ground plane, the rest of the pixels are likely to belong to the colored blocks that
compose our simple world. In general, the problem of image segmentation into distinct
objects is a very challenging task.
Once we have classified pixels as ground or figure, if we assume that the background
corresponds to a horizontal ground plane, then, for all pixels that belong to the ground we
can set Y (x, y) = 0. For pixels that belong to objects we will have to measure additional
image properties before we can deduce any geometric scene constraints.

2.5.2 Occlusion edges


An occlusion boundary separates two di↵erent surfaces at di↵erent distances from the ob-
Border ownership: the server. Along an occlusion edge, it is also important to know which object is in front as this
foreground object is the will be the one owning the boundary. Knowing who owns the boundary is important as an
one that owns the edge provides cues about the 3D geometry, but those cues only apply to the surface that
common edge. owns the boundary.
2.5. FROM EDGES TO SURFACES 49

In this simple world, we will assume that objects do not occlude each other (this can be
relaxed) and that the only occlusion boundaries are the boundaries between the objects and
the ground. However, as we next describe, not all boundaries between the objects and the
ground correspond to depth gradients.

2.5.3 Contact edges


Contact edges are boundaries between two distinct objects but where there exists no depth
discontinuity. Despite that there is not a depth discontinuity, there is an occlusion here (as
one surface is hidden behind another), and the edge shape is only owned by one of the two
surfaces.
In this simple world, if we assume that all the objects rest on the ground plane, then
we can set Y (x, y) = 0 on the contact edges. Contact edges can be detected as transitions
between the object (above) and ground (below). In the simple world only horizontal edges
can be contact edges. We will discuss next how to classify edges according to their 3D
orientation.

2.5.4 Generic view and non-accidental scene properties


Despite that in the projection of world coordinates to image coordinates we have lost a great
deal of information, there are a number of properties that will remain invariant and can help
us in interpreting the image. Here is a list of some of those invariant properties:

• Collinearity: a straight 3D line will project into a straight line in the image.

• Cotermination: if two or more 3D lines terminate at the same point, the corresponding
projections will also terminate at a common point.

• Smoothness: a smooth 3D curve will project into a smooth 2D curve.

Note that those invariances refer to the process of going from world coordinates to image
coordinates. The opposite might not be true. For instance, a straight line in the image could
correspond to a curved line in the 3D world but that happens to be precisely aligned with
respect to the viewers point of view to appear as a straight line. Also, two lines that intersect
in the image plane could be disjoint in the 3D space.
However, some of these properties (not all), while not always true, can nonetheless be Collinearity
used to reliably infer something about the 3D world using a single 2D image as input. For
instance, if two lines coterminate in the image, then, one can conclude that it is very likely
that they also touch each other in 3D. If the 3D lines do not touch each other, then it will
require a very specific alignment between the observer and the lines for them to appear to
coterminate in the image. Therefore, one can safely conclude that the lines might also touch Cotermination
in 3D.
These properties are called non-accidental properties [Lowe 1985] as they will only
be observed in the image if they also exist in the world or by accidental alignments between
the observer and scene structures. Under a generic view, non-accidental properties will be
shared by the image and the 3D world [Freeman 1994].
Let’s see how this idea applies to our simple world. In this simple world all 3D edges
are either vertical or horizontal. Under parallel projection, we will assume that 2D vertical
edges are also 3D vertical edges. Under parallel projection and with the camera having
its horizontal axis parallel to the ground, we know that vertical 3D lines will project into
vertical 2D lines in the image. On the other hand, horizontal lines will, in general, project
into oblique lines. Therefore, we can assume than any vertical line in the image is also a
vertical line in the world. As shown in figure 2.7, in the case of the cube, there is a particular
viewpoint that will make an horizontal line project into a vertical line, but this will require
an accidental alignment between the cube and the line of sight of the observer. Nevertheless,
this is a weak property and accidental alignments such as this one can occur, and a more
50 CHAPTER 2. A SIMPLE VISION SYSTEM

Generic Generic Generic Accidental Generic Generic Generic

Figure 2.7: Generic views of the cube allow 2D vertical edges to be classified as 3D vertical
edges, and collinear 2D edge fragments to be grouped into a common 3D edge. Those
assumptions break down only for the center image, where the cube happens to be precisely
aligned with the camera axis of projection.

general algorithm will need to account for that. But for the purposes of this chapter we will
consider images with generic views only.
In figure 2.6 we show the edges classified as vertical or horizontal using the edge an-
gle. Anything that deviates from 2D verticality by more than 15 degrees is labeled as 3D
horizontal.
We can now translate the inferred 3D edge orientation into linear constraints on the
global 3D structure. We will formulate these constraints in terms of Y (x, y). Once Y (x, y)
is recovered we can also recover Z(x, y) from equation 2.2.
In a 3D vertical edge, using the projection equations, the derivative of Y along the edge
will be:

@Y /@y = 1/ cos(✓) (2.10)

In a 3D horizontal edge, the coordinate Y will not change. Therefore, the derivative along
the edge should be zero:

@Y /@t = 0 (2.11)

n=(nx,ny) where the vector t denotes direction tangent to the edge, t = ( ny , nx ). We can write this
derivative as a function of derivatives along the x and y image coordinates:

@Y /@t = rY · t = ny @Y /@x + nx @Y /@y (2.12)


t
When the edges coincide with occlusion edges, special care should be taken so that these
constraints are only applied to the surface that owns the boundary.

2.5.5 Constraint propagation


Most of the image consists of flat regions where we do not have such edge constraints and
we thus don’t have enough local information to infer the surface orientation. Therefore, we
need some criteria in order to propagate information from the boundaries, where we do have
information about the 3D structure. This problem is common in many visual domains.
In this case we will assume that the object faces are planar. Thus, flat image regions
impose these constraints on the local 3D structure:

@ 2 Y /@x2 = 0 (2.13)
@ 2 Y /@y 2 = 0 (2.14)
2
@ Y /@y@x = 0 (2.15)

This approximation to the second derivative can be obtained


⇥ by applying
⇤ twice the first
order derivative approximated by [-1 1]. The result is: 1 2 1 which corresponds to
@ 2 Y /@x2 ' 2Y (x, y) Y (x + 1, y) Y (x 1, y), and similarly for @ 2 X/@x2
2.5. FROM EDGES TO SURFACES 51

2.5.6 A simple inference scheme


All the di↵erent constraints described before can be written as an overdetermined system of
linear equations. Each equation will have the form:
ai Y = bi (2.16)
where Y is a vectorized version of the image Y (that is, all rows of pixels have been con-
catenated into a flat vector). Note that there might be many more equations than there are
image pixels.
We can translate all the constraints described in the previous sections into this form. For
instance, if the index i corresponds to one of the pixels inside one of the planar faces of a fore-
ground object, then the planarity constraint can be written as ai = [0, . . . , 0, 1, 2, 1, 0, . . . , 0],
bi = 0.
We can solve the system of equations by minimizing the following cost function:
X
J= (ai Y bi )2 (2.17)
i

If some constraints are more important than others, it is possible to also add a weight
wi .
X
J= wi (ai Y bi ) 2 (2.18)
i

Our formulation has resulted on a big system of linear constraints (there are more equa-
tions than there are pixels in the image). It is convenient to write the system of equations
in matrix form:
AY = b (2.19)
where row i of the matrix A contains the constraint coefficients ai . The system of equations is
over determined (A has more rows than columns). We can use the pseudoinverse to compute
the solution:
Y = (AT A) 1
AT b (2.20)
This problem can be solved efficiently as the matrix A is very sparse (most of the elements
are zero).

2.5.7 Results
Figure 2.8 shows the resulting world coordinates X(x, y), Y (x, y), Z(x, y) for each pixel.
The world coordinates are shown here as images with the gray level coding the value of each
coordinate (black represents 0).
A few things to reflect on:
• It works. At least it seems to work pretty well. Knowing how well it works will require
having some way of evaluating performance. This will be important.
• But it can not possibly work all the time. We have made lots of assumptions that
will work only in this simple world. The rest of the book will involve upgrading this
approach to apply to more general input images.
Despite that this approach will not work on general images, many of the general ideas
will carry over to more sophisticated solutions (e.g., gather and propagate local evidence).
One thing to think about is: if information about 3D orientation is given in the edges, how
does that information propagate to the flat areas of the image in order to produce the correct
solution in those regions?
Evaluation of performance is a very important topic. Here, one simple way to visually
verify that the solution is correct is to render the objects under new view points. Figure 2.9
shows the reconstructed 3D scene rendered under di↵erent view points.
52 CHAPTER 2. A SIMPLE VISION SYSTEM

X Y (height) Z (depth)

Figure 2.8: The solution to our vision problem: for the input image of fig 1.7, the assumptions
and inference scheme described below leads to these estimates of 3d world coordinates at each
image location. Note that the world coordinates X, Y , and Z are shown as images.

Figure 2.9: To show that the algorithm for 3D interpretation gives reasonable results, we
can re-render the inferred 3D structure from di↵erent viewpoints, the rendered images show
that the 3D structure has been accurately captured.

2.6 Generalization
One desired property of any vision system is it ability to generalize outside of the domain
for which it was designed to operate. In the approach presented in this chapter, the domain
of images the algorithm is expected to work is defined by the assumptions described in section
2.4. Later in the book, when we describe learning-based approaches, the training dataset
Out of domain will specify the domain. What happens when assumptions are violated? or when the images
generalization refers to contain configurations that we did not consider while designing the algorithm?
the ability of a system to Let’s run the algorithm with shapes that do not satisfy the assumptions that we have
operate outside the made for the simple world. Figure 2.10 shows the result when the input image violates several
domain for which it was assumptions we made (shadows are not soft and the green cube occludes the red one) and
designed. has one object on top of the other, a configuration that we did not consider while designing
the algorithm. In this case the result happens to be good, but it could have gone wrong.
Figure 2.11 shows the impossible steps figure from [Adelson 2000]. On the left side, this
shape looks rectangular with the stripes appear as being painted on the surface. On the
right side, the shape looks as having steps with the stripes corresponding to shading due
to the surface orientation. In the middle the shape is ambiguous. Figure 2.11 shows the
reconstructed 3D scene for this unlikely image. The system has tried to approximate the
constraints, as for this shape it is not possible to exactly satisfy all the constraints.

2.7 Concluding remarks


Despite of having a 3D representation of the structure of the scene, the system is still unaware
of the fact that the scene is composed of a set of distinct objects. For instance, as the system
lacks a representation of which objects are actually present in the scene, we cannot visualize
the occluded parts. The system cannot do simple tasks like counting the number of cubes.
2.7. CONCLUDING REMARKS 53

Edges (3D orientations) Depth discontinuities (occlusions) Contact edges

Figure 2.10: Reconstruction of an out of domain example. The input image violates several
assumptions we made (shadows are not soft and the green cube occludes the red one) and
has one object on top of the other, a configuration that we did not consider while designing
the algorithm.

Figure 2.11: Reconstruction of an impossible figure (inspired from [Adelson 2000]): the
algorithm does the best it can. Note how the reconstruction seems to agree with how we
perceive the impossible image.
54 CHAPTER 2. A SIMPLE VISION SYSTEM

A di↵erent approach to the one discussed here is model-based scene interpretation, where
we could have a set of predefined models of the objects that can be present in the scene and
the system should try to decide if they are present or not in the image, and recover their
parameters (pose, color, etc.)
Recognition allows indexing properties that are not directly available in the image. For
instance, we can infer the shape of the invisible surfaces. Recognizing each geometric figure
also implies extracting the parameters of each figure (pose, size, ...).
Given a detailed representation of the world, we could render the image back, or at least
some aspects of it. We can check that we are understanding things right if we can make
verifiable predictions, such as what would you see if you look behind the object? Closing the
loop between interpretation and the input will be good at some point.
In this chapter, besides introducing some useful vision concepts, we also made use of
mathematical tools from algebra and calculus that you should be familiar with.
2.7. CONCLUDING REMARKS 55

Exercises
We provide a python notebook with the code to be completed. You can run it locally or in
Colab. To use Colab, upload it to Google Drive and select ‘open in colab’, which will allow
you to complete the problems without setting up your own environment. Once you have
finished, copy the code sections that you have completed as screenshots to the report.

Problem 1 Perspective and orthographic projections


The goal of this first exercise is to take images with di↵erent settings of a camera to
create pictures with perspective projection and with orthographic projection. Both pictures
should cover the same piece of the scene. You can take pictures of real places or objects (e.g.
your furniture).
To create pictures with orthographic projection you can do two things: 1) use the zoom
of the camera, 2) crop the central part of a picture. You will have to play with the distance
between the camera and the scene, and with the zoom or cropping so that both images look
as similar as possible, only di↵ering in the type of projection (similar to figure 2.2).
Submit the two pictures and clearly label parts of the images that reveal their projection
types.

Problem 2 Orthographic projection equations


Recall the parallel projection equations:

x = ↵X + x0 (2.21)
y = ↵(cos(✓)Y sin(✓)Z) + y0 (2.22)

which relate the coordinates of a point in the 3D world to the image coordinates of an
orthogonal camera rotated by ✓ over the X-axis.
Show that the equations emerge naturally from a series of transformations applied to the
3D world coordinates (X, Y, Z), of the form:
2 3
 X 
x x0
= ↵ · P · Rx (✓) · 4 Y 5 + (2.23)
y y0
Z

Where Rx (✓) is a 3 ⇥ 3 matrix corresponding to a rotation over the X axis, P is a 2 ⇥ 3


matrix corresponding to the orthogonal projection and ↵ is a scaling factor to account for
the size of the camera sensor, which is a single scalar when the pixels are square (assumed
in this case).
Then, find ↵, x0 and y0 when the world point (0, 0, 0) projects onto (0, 0) (which corre-
sponds to the center of the image) and the point (1, 0, 0) projects onto (3, 0).

Problem 3 Edge and surface constraints


In Sect. 2.5.5, we have written down the constraints for Y (x, y). Briefly derive the
constraints for Z(x, y) along vertical edges, horizontal edges, and flat surfaces.

Problem 4 Complete the code


Fill in the missing lines in the notebook: Ch2.ipynb, and include them in the report as
screenshots. First, find a way to classify edges as vertical or horizontal edges. Next, fill in
the rest of the conditions of the constraint matrix. The constraints for when the pixel is on
the ground have already been done for you as an example. Put the kernel in Aij and the
value you expect in b (the conversion to a linear system is done for you later so you don’t
need to worry about that part).
You only need to modify the locations marked with a TODO comment.
56 CHAPTER 2. A SIMPLE VISION SYSTEM

Please make sure to also include your answers for vertical edges, horizontal edges,
and your formulations for Aij and b for the di↵erent constraints in the report.

Problem 5 Run the code


Select some of the images included with the code and show some new viewpoints for
them.
Optional: You can also try with new images taken by you if you decide to create your
own simple world.

Problem 6 Violating simple world assumptions (1 point)


Find one example from the four images provided with the problem set (img1.jpg, ...,
img4.jpg) when the recovery of 3D information fails. Include the image and the reconstruc-
tion in your writeup, and explain why it fails.

Research problem The real world [optional]


A research problem is a question for which we do not know the answer. In fact, there
might not even be an answer. This question is optional and could be extended into a larger
course project.
The goal of this problem is to test the 3D reconstruction code with real images. A
number of the assumptions we have made will not work when the input is real images of a
more complex scene. For instance, the simple strategy of di↵erentiating between foreground
and background segmentation will not work with other scenes.
Try taking pictures of real-world scenes and propose modifications to the scheme proposed
in this lecture so that you can get some better 3D reconstructions. The goal is not to build
a general system, but to be able to handle a few more situations.

You might also like