0% found this document useful (0 votes)
11 views10 pages

01-Camera Models

The document discusses camera models in computer vision, focusing on the pinhole camera model and its limitations, such as the trade-off between image sharpness and brightness due to aperture size. It also explains how modern lenses improve image quality by focusing light, while addressing issues like radial distortion. Finally, the document introduces the camera matrix model and homogeneous coordinates to facilitate the mapping of 3D points to 2D image coordinates.

Uploaded by

auladacivil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

01-Camera Models

The document discusses camera models in computer vision, focusing on the pinhole camera model and its limitations, such as the trade-off between image sharpness and brightness due to aperture size. It also explains how modern lenses improve image quality by focusing light, while addressing issues like radial distortion. Finally, the document introduces the camera matrix model and homogeneous coordinates to facilitate the mapping of 3D points to 2D image coordinates.

Uploaded by

auladacivil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Course Notes 1: Camera Models∗

1 Introduction
The camera is one of the most essential tools in computer vision. It is the mechanism by
which we can record the world around us and use its output - photographs - for various
applications. Therefore, one question we must ask in introductory computer vision is:
how do we model a camera?

2 Pinhole cameras

object barrier film

aperture

Figure 1: A simple working camera model: the pinhole camera model.

Let’s design a simple camera system – a system that can record an image of an object
or scene in the 3D world. This camera system can be designed by placing a barrier with
a small aperture between the 3D object and a photographic film or sensor. As Figure 1
shows, each point on the 3D object emits multiple rays of light outwards. Without a
barrier in place, every point on the film will be influenced by light rays emitted from
every point on the 3D object. Due to the barrier, only one (or a few) of these rays of
light passes through the aperture and hits the film. Therefore, we can establish a one-
to-one mapping between points on the 3D object and the film. The result is that the
film gets exposed by an “image” of the 3D object by means of this mapping. This simple
model is known as the pinhole camera model.
A more formal construction of the pinhole camera is shown in Figure 2. In this
construction, the film is commonly called the image or retinal plane. The aperture

Most contents are from
- R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision (2nd Edition)
- K. Hata and S. Savarese. Course notes of Stanford CS231A

1
Course Note GEO1016: Photogrammetry and 3D Computer Vision

𝒋
𝑷 𝒇

𝑪′ 𝚷′
𝒌
𝑶
𝒊

𝑷′

Figure 2: A formal construction of the pinhole camera model.

is referred to as the pinhole O or center of the camera. The distance between the
image plane and the pinhole O is the focal length f . Sometimes, the retinal plane is
placed between O and the 3D object at a distance f from O. In this case, it is called the
virtual image or virtual retinal plane. Note that the projection of the object in the
image plane and the image of the object in the virtual image plane are identical up to a
scale (similarity) transformation.
! "T
Now, how do we use pinhole cameras? Let P = x y z be a point on some 3D
object visible to the pinhole camera. P will be mapped or projected onto the image
! "T
plane Π′ , resulting in point1 P ′ = x′ y ′ . Similarly, the pinhole itself can be projected
onto the image plane, giving a new point C !′ . "
Here, we can define a coordinate system i j k centered at the pinhole O such that
the axis k is perpendicular to the image plane and points toward it. This coordinate sys-
tem is often known as the camera reference system or camera coordinate system.
The line defined by C ′ and O is called the optical axis of the camera system.
Recall that point P ′ is derived from the projection of 3D point P on the image plane
Π′ . Therefore, if we derive the relationship between 3D point P and image plane point P ′ ,
we can understand how the 3D world imprints itself upon the image taken by a pinhole
camera. Notice that triangle P ′ C ′ O is similar to the triangle formed by P , O and (0, 0, z).
Therefore, using the law of similar triangles we find that:
! "T ! "T
P ′ = x′ y ′ = f xz f yz (1)
Notice that one large assumption we make in this pinhole model is that the aperture
is a single point. In most real world scenarios, however, we cannot assume the aperture
can be infinitely small. Thus, what is the effect of varying aperture size?
As the aperture size increases, the number of light rays that passes through the barrier
consequently increases. With more light rays passing through, then each point on the
film may be affected by light rays from multiple points in 3D space, blurring the image.
Although we may be inclined to try to make the aperture as small as possible, recall
that a smaller aperture size causes less light rays to pass through, resulting in crisper
but darker images. Therefore, we arrive at the fundamental problem presented by the
pinhole formulation: can we develop cameras that take crisp and bright images?
1
Throughout the course notes, let the prime superscript (e.g. P ′ ) indicate that this point is a projected
or complementary point to the non-superscript version. For example, P ′ is the projected version of P .

2
Course Note GEO1016: Photogrammetry and 3D Computer Vision

Figure 3: The effects of aperture size on the image. As the aperture size decreases, the
image gets sharper, but darker.

3 Cameras and lenses

object lens film

Figure 4: A setup of a simple lens model. Notice how the rays of the top point on the
tree converge nicely on the film. However, a point at a different distance away from the
lens results in rays not converging perfectly on the film.

In modern cameras, the above conflict between crispness and brightness is mitigated
by using lenses, devices that can focus or disperse light. If we replace the pinhole with
a lens that is both properly placed and sized, then it satisfies the following property: all
rays of light that are emitted by some point P are refracted by the lens such that they
converge to a single point P ′ in the image plane. Therefore, the problem of the majority
of the light rays blocked due to a small aperture is removed (Figure 4). However, please
note that this property does not hold for all 3D points, but only for some specific point
P . Take another point Q which is closer or further from the image plane than P . The
corresponding projection into the image will be blurred or out of focus. Thus, lenses
have a specific distance for which objects are “in focus”. This property is also related
to a photography and computer graphics concept known as depth of field, which is the
effective range at which cameras can take clear photos.
Camera lenses have another interesting property: they focus all light rays traveling

3
Course Note GEO1016: Photogrammetry and 3D Computer Vision

object lens film


z'

P’
focal point

-z f zo

Figure 5: Lenses focus light rays parallel to the optical axis into the focal point. Fur-
thermore, this setup illustrates the paraxial refraction model, which helps us find the
relationship between points in the image plane and the 3D world in cameras with lenses.

parallel to the optical axis to one point known as the focal point (Figure 5). The
distance between the focal point and the center of the lens is commonly referred to as the
focal length f . Furthermore, light rays passing through the center of the lens are not
deviated. We thus can arrive at a similar construction to the pinhole model that relates
a point P in 3D space with its corresponding point P ′ in the image plane.
# ′$ # ′ x $
′ x z
P = ′ = ′ yz (2)
y zz
The derivation for this model is outside the scope of the class. However, please notice
that in the pinhole model z ′ = f , while in this lens-based model, z ′ = f +z0 . Additionally,
since this derivation takes advantage of the paraxial or “thin lens” assumption2 , it is called
the paraxial refraction model.

normal pincushion barrel

Figure 6: Demonstrating how pincushion and barrel distortions affect images.

Because the paraxial refraction model approximates using the thin lens assumption,
a number of aberrations can occur. The most common one is referred to as radial dis-
tortion, which causes the image magnification to decrease or increase as a function of
the distance to the optical axis. We classify the radial distortion as pincushion distor-
tion when the magnification increases and barrel distortion3 when the magnification
2
For the angle θ that incoming light rays make with the optical axis of the lens, the paraxial assumption
substitutes θ for any place sin(θ) is used. This approximation of θ for sin θ holds as θ approaches 0.
3
Barrel distortion typically occurs when one uses fish-eye lenses.

4
Course Note GEO1016: Photogrammetry and 3D Computer Vision

decreases. Radial distortion is caused by the fact that different portions of the lens have
differing focal lengths.

4 Going to digital image space


In this section, we will discuss the details of the parameters we must account for when
modeling the projection from 3D space to the digital images we know. All the results
derived will use the pinhole model, but they also hold for the paraxial refraction model.
As discussed earlier, a point P in 3D space can be mapped (or projected) into a 2D
point P ′ in the image plane Π′ . This R3 → R2 mapping is referred to as a projective
transformation. This projection of 3D points into the image plane does not directly
correspond to what we see in actual digital images for several reasons. First, points in
the digital images are, in general, in a different reference system than those in the image
plane. Second, digital images are divided into discrete pixels, whereas points in the image
plane are continuous. Finally, the physical sensors can introduce non-linearity such as
distortion to the mapping. To account for these differences, we will introduce a number
of additional transformations that allow us to map any point from the 3D world to pixel
coordinates.

4.1 The Camera Matrix Model and Homogeneous Coordinates


4.1.1 Introduction to the Camera Matrix Model
The camera matrix model describes a set of important parameters that affect how a world
point P is mapped to image coordinates P ′ . As the name suggests, these parameters will
be represented in matrix form. First, let’s introduce some of those parameters.
The first parameters, cx and cy , describe how image plane and digital image coordi-
nates can differ by a translation. Image plane coordinates have their origin C ′ at the
image center where the k axis intersects the image plane. On the other hand, digital
image coordinates typically have their origin at the lower-left corner of the image. Thus,
2D points in the image plane and 2D points in the image are offset by a translation vector
! "T
cx , cy . To accommodate this change of coordinate systems, the mapping now becomes:
# ′$ # x $
′ x f z + cx
P = ′ = (3)
y f yz + cy
The second effect we must account for that the points in digital images are expressed
in pixels, while points in image plane are represented in physical measurements (e.g.
centimeters). In order to accommodate this change of units, we must introduce two
new parameters k and l. These parameters, whose units would be something like pixels cm
,
correspond to the change of units in the two axes of the image plane. Note that k and l
may be different because the aspect ratio of a pixel is not guaranteed to be one. If k = l,
we often say that the camera has square pixels. We adjust our previous mapping to be
# ′$ # x $ # x $
′ x f k z + cx α z + cx
P = ′ = = (4)
y f l yz + cy β yz + cy
Is there a better way to represent this projection from P → P ′ ? If this projection is a
linear transformation, then it can be represented as a product of a matrix and the input
vector (in this case, it would be P . However, from Equation 4, we see that this projection

5
Course Note GEO1016: Photogrammetry and 3D Computer Vision

P → P ′ is not linear, as the operation divides one of the input parameters (namely z).
Still, representing this projection as a matrix-vector product would be useful for future
derivations. Therefore, can we represent our transformation as a matrix-vector product
despite its nonlinearity? Homogeneous coordinates are the solution.

4.1.2 Homogeneous Coordinates


One way to solve this problem is to change the coordinate systems. For example, we
introduce a new coordinate, such that any point P ′ = (x′ , y ′ ) becomes (x′ , y ′ , 1). Sim-
ilarly, any point P = (x, y, z) becomes (x, y, z, 1). This augmented space is referred to
as the homogeneous coordinate system. As demonstrated previously, to convert a
Euclidean vector (v1 , ..., vn ) to homogeneous coordinates, we simply append a 1 in a new
dimension to get (v1 , ..., vn , 1). Note that the equality between a vector and its homo-
geneous coordinates only occurs when the final coordinate equals one. Therefore, when
converting back from arbitrary homogeneous coordinates (v1 , ..., vn , w), we get Euclidean
coordinates ( vw1 , ..., vwn ). Using homogeneous coordinates, we can formulate
% '
% ' % ' x % '
αx + cx z α 0 cx 0 ) * α 0 cx 0
y* &
Ph′ = & βy + cy z ( = & 0 β cy 0( ) & z ( = 0 β c y 0 Ph
( (5)
z 0 0 1 0 0 0 1 0
1

From this point on, assume that we will work in homogeneous coordinates, unless
stated otherwise. We will drop the h index, so any point P or P ′ can be assumed to be
in homogeneous coordinates. As seen from Equation 5, we can represent the relationship
between a point in 3D space and its image coordinates by a matrix vector relationship:
% '
% ′' % ' x % '
x α 0 cx 0 ) * α 0 cx 0
y* &
P ′ = & y ′ ( = & 0 β cy 0 ( ) (
& z ( = 0 β cy 0 P = M P (6)
z 0 0 1 0 0 0 1 0
1

We can decompose this transformation a bit further into


% '
α 0 cx ! " ! "
P ′ = M P = & 0 β cy ( I 0 P = K I 0 P (7)
0 0 1

The matrix K is often referred to as the camera matrix.

4.1.3 The Complete Camera Matrix Model


The camera matrix K contains some of the critical parameters that describes a camera’s
characteristics and its model, including the cx , cy , k, and l parameters as discussed above.
Two parameters are currently missing this formulation: skewness and distortion. We
often say that an image is skewed when the camera coordinate system is skewed, meaning
that the angle between the two axes is slightly larger or smaller than 90 degrees. Most
cameras have zero-skew, but some degree of skewness may occur because of sensor manu-
facturing errors. Deriving the new camera matrix accounting for skewness is outside the

6
Course Note GEO1016: Photogrammetry and 3D Computer Vision

scope of this class and we give it to you below:


% ′' % '
x α −α cot θ cx
K = &y ′ ( = & 0 β
sin θ
cy ( (8)
z 0 0 1

Most methods that we introduce in this class ignore distortion effects, therefore our class
camera matrix K has 5 degrees of freedom: 2 for focal length, 2 for offset, and 1 for
skewness. These parameters are collectively known as the intrinsic parameters, as
they are unique and inherent to a given camera and relate to essential properties of the
camera, such as its manufacturing.

4.2 Extrinsic Parameters


So far, we have described a mapping between a point P in the 3D camera reference
system to a point P ′ in the 2D image plane using the intrinsic parameters of a camera
described in matrix form. But what if the information about the 3D world is available
in a different coordinate system? Then, we need to include an additional transformation
that relates points from the world reference system to the camera reference system. This
transformation is captured by a rotation matrix R and translation vector T . Therefore,
given a point in a world reference system Pw , we can compute its camera coordinates as
follows: # $
R T
P = P (9)
0 1 w
Substituting this in equation (7) and simplifying gives
! "
P ′ = K R T Pw = M Pw (10)

These parameters R and T are known as the extrinsic parameters because they
are external to and do not depend on the camera.

This completes the mapping from a 3D point P in an arbitrary world reference system
to the image plane. To reiterate, we see that the full projection matrix M consists of
the two types of parameters introduced above: intrinsic and extrinsic parameters.
All parameters contained in the camera matrix K are the intrinsic parameters, which
change as the type of camera changes. The extrinsic paramters include the rotation and
translation, which do not depend on the camera’s build. Overall, we find that the 3 × 4
projection matrix M has 11 degrees of freedom: 5 from the intrinsic camera matrix, 3
from extrinsic rotation, and 3 from extrinsic translation.

5 Appendix A: Rigid Transformations


The basic rigid transformations are rotation, translation, and scaling. This appendix will
cover them for the 3D case, as they are common type in this class.
Rotating a point in 3D space can be represented by rotating around each of the three
coordinate axes respectively. When rotating around the coordinate axes, common conven-
tion is to rotate in a counter-clockwise direction. One intuitive way to think of rotations
is how much we rotate around each degree of freedom, which is often referred to as Euler
angles. However, this methodology can result in what is known as singularities, or

7
Course Note GEO1016: Photogrammetry and 3D Computer Vision

gimbal lock, in which certain configurations result in a loss of a degree of freedom for
the rotation.
One way to prevent this is to use rotation matrices, which are a more general form of
representing rotations. Rotation matrices are square, orthogonal matrices with determi-
nant one. Given a rotation matrix R and a vector v, we can compute the resulting vector
v ′ as
v ′ = Rv
Since rotation matrices are a very general representation of matrices, we can represent
a rotation α, β, γ around each of the respective axes as follows:
% '
1 0 0
Rx (α) = &0 cos α − sin α(
0 sin α cos α
% '
cos β 0 sin β
Ry (β) = & 0 1 0 (
− sin β 0 cos β
% '
cos γ − sin γ 0
Rz (γ) = & sin γ cos γ 0(
0 0 1
Due to the convention of matrix multiplication, the rotation achieved by first rotating
around the z-axis, then y-axis, then x-axis is given by the matrix product Rx Ry Rz .
Translations, or displacements, are used to describe the movement in a certain di-
rection. In 3D space, we define a translation vector t with 3 values: the displacements
in each of the 3 axes, often denoted as tx , ty , tz . Thus, given some point P which is
translated to some other point P ′ by t, we can write it as:
% ' % '
Px tx
P = P + t = Py + t y (
′ & ( &
Pz tz
In matrix form, translations can be written using homogeneous coordinates. If we
construct a translation matrix as
% '
1 0 0 tx
)0 1 0 ty *
T =)
&0 0 1
*
tz (
0 0 0 1
then we see that P ′ = T P is equivalent to P ′ = P + t.
If we want to combine translation with our rotation matrix multiplication, we can
again use homogeneous coordinates to our advantage. If we want to rotate a vector v by
R and then translate it by t, we can write the resulting vector v ′ as:
# ′$ # $# $
v R t v
=
1 0 1 1
Finally, if we want to scale the vector in certain directions by some amount Sx , Sy , Sz ,
we can construct a scaling matrix
% '
Sx 0 0
S = & 0 Sy 0 (
0 0 Sz

8
Course Note GEO1016: Photogrammetry and 3D Computer Vision

Therefore, if we want to scale a vector, then rotate, then translate, our final transfor-
mation matrix would be: # $
RS t
T =
0 1
Note that all of these types of transformations would be examples of affine transfor-
mations.
! Recall
" that projective transformations occur when the final row of T is not
0 0 0 1 .

6 Appendix B: Different Camera Models


We will now describe a simple model known as the weak perspective model. In the
weak perspective model, points are first projected to the reference plane using orthogonal
projection and then projected to the image plane using a projective transformation.

Figure 7: The weak perspective model: orthogonal projection onto reference plane

As Figure 7 shows, given a reference plane Π at a distance zo from the center of the
camera, the points P, Q, R are first projected to the plane Π using an orthogonal projec-
tion, generating points P , Q , R . This is a reasonable approximation when deviations
in depth from the plane are small compared to the distance of the camera.

Figure 8: The weak perspective model: projection onto the image plane

9
Course Note GEO1016: Photogrammetry and 3D Computer Vision

Figure 8 illustrates how points P , Q , R are then projected to the image plane using
a regular projective transformation to produce the points p′ , q ′ , r′ . Notice, however, that
because we have approximated the depth of each point to zo the projection has been
reduced to a simple, constant magnification. The magnification is equal to the focal
length f ′ divided by zo , leading to
f′ f′
x′ = x y′ = y
z0 z0
This model also simplifies the projection matrix
# $
A b
M=
0 1
! "
! we" see, the last row of M is 0 0 0 1 in the weak perspective model, compared to
As
v 1 in the normal camera model. We do not prove this result and leave it to you as
an exercise. The simplification is clearly demonstrated when mapping the 3D points to
the image plane. % ' % '
m1 m1 P
P ′ = M P = & m2 ( P = & m2 P ( (11)
m3 1
Thus, we see that the image plane point ultimately becomes a magnification of the orig-
inal 3D point, irrespective of depth. The nonlinearity of the projective transformation
disappears, making the weak perspective transformation a mere magnifier.

Figure 9: The orthographic projection model

Further simplification leads to the orthographic (or affine) projection model. In


this case, the optical center is located at infinity. The projection rays are now perpen-
dicular to the retinal plane. As a result, this model ignores depth altogether. Therefore,
x′ = x
y′ = y
Orthographic projection models are often used for architecture and industrial design.
Overall, weak perspective models result in much simpler math, at the cost of being
somewhat imprecise. However, it often yields results that are very accurate when the
object is small and distant from the camera.

10

You might also like