CS231A Course Notes 1: Camera Models: Kenji Hata and Silvio Savarese
CS231A Course Notes 1: Camera Models: Kenji Hata and Silvio Savarese
1 Introduction
The camera is one of the most essential tools in computer vision. It is the
mechanism by which we can record the world around us and use its output -
photographs - for various applications. Therefore, one question we must ask
in introductory computer vision is: how do we model a camera?
2 Pinhole cameras
aperture
Let’s design a simple camera system – a system that can record an image
of an object or scene in the 3D world. This camera system can be designed
by placing a barrier with a small aperture between the 3D object and a
photographic film or sensor. As Figure 1 shows, each point on the 3D object
emits multiple rays of light outwards. Without a barrier in place, every point
on the film will be influenced by light rays emitted from every point on the
3D object. Due to the barrier, only one (or a few) of these rays of light passes
through the aperture and hits the film. Therefore, we can establish a one-
to-one mapping between points on the 3D object and the film. The result is
that the film gets exposed by an “image” of the 3D object by means of this
mapping. This simple model is known as the pinhole camera model.
1
𝒋
𝑷 𝒇
𝑪′ 𝚷′
𝒌
𝑶
𝒊
𝑷′
the pinhole itself can be projected onto the image plane, giving a new point
C 0.
Here, we can define a coordinate system i j k centered at the pinhole
O such that the axis k is perpendicular to the image plane and points toward
it. This coordinate system is often known as the camera reference system
or camera coordinate system. The line defined by C 0 and O is called the
optical axis of the camera system.
Recall that point P 0 is derived from the projection of 3D point P on the
image plane Π0 . Therefore, if we derive the relationship between 3D point
P and image plane point P 0 , we can understand how the 3D world imprints
itself upon the image taken by a pinhole camera. Notice that triangle P 0 C 0 O
is similar to the triangle formed by P , O and (0, 0, z). Therefore, using the
law of similar triangles we find that:
1
Throughout the course notes, let the prime superscript (e.g. P 0 ) indicate that this
point is a projected or complementary point to the non-superscript version. For example,
P 0 is the projected version of P .
2
T T
P 0 = x0 y 0 = f xz f yz
(1)
Notice that one large assumption we make in this pinhole model is that
the aperture is a single point. In most real world scenarios, however, we
cannot assume the aperture can be infinitely small. Thus, what is the effect
of varying aperture size?
Figure 3: The effects of aperture size on the image. As the aperture size
decreases, the image gets sharper, but darker.
As the aperture size increases, the number of light rays that passes
through the barrier consequently increases. With more light rays passing
through, then each point on the film may be affected by light rays from
multiple points in 3D space, blurring the image. Although we may be in-
clined to try to make the aperture as small as possible, recall that a smaller
aperture size causes less light rays to pass through, resulting in crisper but
darker images. Therefore, we arrive at the fundamental problem presented by
the pinhole formulation: can we develop cameras that take crisp and bright
images?
3
object lens film
Figure 4: A setup of a simple lens model. Notice how the rays of the top
point on the tree converge nicely on the film. However, a point at a different
distance away from the lens results in rays not converging perfectly on the
film.
in the image plane. Therefore, the problem of the majority of the light rays
blocked due to a small aperture is removed (Figure 4). However, please note
that this property does not hold for all 3D points, but only for some specific
point P . Take another point Q which is closer or further from the image
plane than P . The corresponding projection into the image will be blurred
or out of focus. Thus, lenses have a specific distance for which objects are
“in focus”. This property is also related to a photography and computer
graphics concept known as depth of field, which is the effective range at
which cameras can take clear photos.
P’
focal point
-z f zo
Figure 5: Lenses focus light rays parallel to the optical axis into the fo-
cal point. Furthermore, this setup illustrates the paraxial refraction model,
which helps us find the relationship between points in the image plane and
the 3D world in cameras with lenses.
Camera lenses have another interesting property: they focus all light rays
traveling parallel to the optical axis to one point known as the focal point
(Figure 5). The distance between the focal point and the center of the lens
is commonly referred to as the focal length f . Furthermore, light rays
4
passing through the center of the lens are not deviated. We thus can arrive
at a similar construction to the pinhole model that relates a point P in 3D
space with its corresponding point P 0 in the image plane.
0 0 x
0 x z
P = 0 = 0 yz (2)
y zz
The derivation for this model is outside the scope of the class. However,
please notice that in the pinhole model z 0 = f , while in this lens-based model,
z 0 = f +z0 . Additionally, since this derivation takes advantage of the paraxial
or “thin lens” assumption2 , it is called the paraxial refraction model.
Because the paraxial refraction model approximates using the thin lens
assumption, a number of aberrations can occur. The most common one is
referred to as radial distortion, which causes the image magnification to
decrease or increase as a function of the distance to the optical axis. We
classify the radial distortion as pincushion distortion when the magnifi-
cation increases and barrel distortion3 when the magnification decreases.
Radial distortion is caused by the fact that different portions of the lens have
differing focal lengths.
5
As discussed earlier, a point P in 3D space can be mapped (or projected)
into a 2D point P 0 in the image plane Π0 . This R3 → R2 mapping is referred
to as a projective transformation. This projection of 3D points into the
image plane does not directly correspond to what we see in actual digital
images for several reasons. First, points in the digital images are, in general,
in a different reference system than those in the image plane. Second, digital
images are divided into discrete pixels, whereas points in the image plane are
continuous. Finally, the physical sensors can introduce non-linearity such as
distortion to the mapping. To account for these differences, we will introduce
a number of additional transformations that allow us to map any point from
the 3D world to pixel coordinates.
6
0 x x
0 x f k z + cx α z + cx
P = 0 = = (4)
y f l yz + cy β yz + cy
Is there a better way to represent this projection from P → P 0 ? If this
projection is a linear transformation, then it can be represented as a product
of a matrix and the input vector (in this case, it would be P . However, from
Equation 4, we see that this projection P → P 0 is not linear, as the opera-
tion divides one of the input parameters (namely z). Still, representing this
projection as a matrix-vector product would be useful for future derivations.
Therefore, can we represent our transformation as a matrix-vector product
despite its nonlinearity? Homogeneous coordinates are the solution.
From this point on, assume that we will work in homogeneous coordinates,
unless stated otherwise. We will drop the h index, so any point P or P 0 can
be assumed to be in homogeneous coordinates. As seen from Equation 5,
we can represent the relationship between a point in 3D space and its image
coordinates by a matrix vector relationship:
0 x
x α 0 cx 0 α 0 cx 0
y
P 0 = y 0 = 0 β cy 0 z = 0 β cy 0 P = M P
(6)
z 0 0 1 0 0 0 1 0
1
7
We can decompose this transformation a bit further into
α 0 cx
P 0 = M P = 0 β cy I 0 P = K I 0 P
(7)
0 0 1
Most methods that we introduce in this class ignore distortion effects, there-
fore our class camera matrix K has 5 degrees of freedom: 2 for focal length, 2
for offset, and 1 for skewness. These parameters are collectively known as the
intrinsic parameters, as they are unique and inherent to a given camera
and relate to essential properties of the camera, such as its manufacturing.
8
Substituting this in equation (7) and simplifying gives
P 0 = K R T Pw = M P w
(10)
5 Camera Calibration
To precisely know the transformation from the real, 3D world into digital
images requires prior knowledge of many of the camera’s intrinsic parame-
ters. If given an arbitrary camera, we may or may not have access to these
parameters. We do, however, have access to the images the camera takes.
Therefore, can we find a way to deduce them from images? This problem of
estimating the extrinsic and intrinsic camera parameters is known as camera
calibration.
9
rig usually consists of a simple pattern (i.e. checkerboard) with known di-
mensions. Furthermore, the rig defines our world reference frame with origin
Ow and axes iw , jw , kw . From the rig’s known pattern, we have known points
in the world reference frame P1 , ..., Pn . Finding these points in the image we
take from the camera gives corresponding points in the image p1 , ..., pn .
We set up a linear system of equations from n correspondences such
that for each correspondence Pi , pi and camera matrix M whose rows are
m1 , m2 , m3 : m1 Pi
ui
pi = = M Pi = m 3 Pi
m2 Pi (11)
vi m3 Pi
As we see from the above equation, each correspondence gives us two
equations and, consequently, two constraints for solving the unknown pa-
rameters contained in m. From before, we know that the camera matrix
has 11 unknown parameters. This means that we need at least 6 correspon-
dences to solve this. However, in the real world, we often use more, as our
measurements are often noisy. To explicitly see this, we can derive a pair of
equations that relate ui and vi with Pi .
ui (m3 Pi ) − m1 Pi = 0
vi (m3 Pi ) − m2 Pi = 0
Given n of these corresponding points, the entire linear system of equa-
tions becomes
u1 (m3 P1 )−m1 P1 = 0
v1 (m3 P1 )−m2 P1 = 0
..
.
un (m3 Pn )−m1 Pn = 0
vn (m3 Pn )−m2 Pn = 0
10
were some other m that were a nonzero solution, then ∀k ∈ R, km is also
a solution. Therefore, to constrain our solution, we complete the following
minimization:
minimize kPmk2
m
(13)
subject to kmk2 = 1
To solve this minimization problem, we simply use singular value decompo-
sition. If we let P = U DV T , then the solution to the above minimization is
to set m equal to the last column of V . The derivation for this solution is
outside the scope of this class and you may refer to Section 5.3 of Hartley &
Zisserman on pages 592-593 for more details.
After reformatting the vector m into the matrix M , we now want to
explicitly solve for the extrinsic and intrinsic parameters. We know our
SVD-solved M is known up to scale, which means that the true values of the
camera matrix are some scalar multiple of M :
T
αr1 − α cot θr2T + cx r3T αtx − α cot θty + cx tz
β β
ρM = rT + cy r3T
sin θ 2
t + cy tz
sin θ y
(14)
T
r3 tz
Here, r1T , r2T , and r3T are the three rows of R. Dividing by the scaling
parameter gives
T
αr1 − α cot θr2T + cx r3T αtx − α cot θty + cx tz
T
a1 b1
1 β T T β
T
M= r + cy r 3
sin θ 2
t + cy tz
sin θ y
= A b = a2 b2
ρ T T
r3 tz a3 b3
Solving for the intrinsics gives
1
ρ=±
ka3 k
2
cx = ρ (a1 · a3 )
cy = ρ2 (a2 · a3 )
(15)
−1 (a1 × a3 ) · (a2 × a3 )
θ = cos −
ka1 × a3 k · ka2 × a3 k
2
α = ρ ka1 × a3 k sin θ
β = ρ2 ka2 × a3 k sin θ
The extrinsics are
a2 × a3
r1 =
ka2 × a3 k
r2 = r3 × r1 (16)
r3 = ρa3
T = ρK −1 b
11
We leave the derivations as a class exercise or you can refer to Section 1.3.1
of the Forsyth & Ponce textbook.
With the calibration procedure complete, we warn against degenerate
cases. Not all sets of n correspondences will work. For example, if the
points Pi lie on the same plane, then the system will not be able to be
solved. These unsolvable configurations of points are known as degenerate
configurations. More generally, degenerate configurations have points that
lie on the intersection curve of two quadric surfaces. Although this outside
the scope of the class, you can find more information in Section 1.3 of the
Forsyth & Ponce textbook.
ui q3 Pi = q1 Pi
vi q3 Pi = q2 Pi
This system, however, is no longer linear, and we require the use of non-
linear optimization techniques, which are covered in Section 22.2 of Forsyth
& Ponce. We can simplify the nonlinear optimization of the calibration prob-
lem if we make certain assumptions. In radial distortion, we note that the
ratio between two coordinates ui and vi is not affected. We can compute this
ratio as
m 1 Pi
ui m 1 Pi
= m 3 Pi
m 2 Pi
= (18)
vi m P
m 2 Pi
3 i
12
of linear equations:
v1 (m1 P1 )−u1 (m2 P1 ) = 0
..
.
vn (m1 Pn )−un (m2 Pn ) = 0
Similar to before, this gives a matrix-vector product that we can solve via
SVD:
v1 P1T −u1 P1T T
Ln = ...
.. m1 (19)
. mT
T T 2
vn Pn −un Pn
Once m1 and m2 are estimated, m3 can be expressed as a nonlinear func-
tion of m1 , m2 , and λ. This requires to solve a nonlinear optimization problem
whose complexity is much simpler than the original one.
13
cos β 0 sin β
Ry (β) = 0 1 0
− sin β 0 cos β
cos γ − sin γ 0
Rz (γ) = sin γ cos γ 0
0 0 1
Due to the convention of matrix multiplication, the rotation achieved by
first rotating around the z-axis, then y-axis, then x-axis is given by the matrix
product Rx Ry Rz .
Translations, or displacements, are used to describe the movement in a
certain direction. In 3D space, we define a translation vector t with 3 values:
the displacements in each of the 3 axes, often denoted as tx , ty , tz . Thus,
given some point P which is translated to some other point P 0 by t, we can
write it as:
Px tx
0
P = P + t = Py + ty
Pz tz
In matrix form, translations can be written using homogeneous coordi-
nates. If we construct a translation matrix as
1 0 0 tx
0 1 0 ty
T = 0 0 1 tz
0 0 0 1
then we see that P 0 = T P is equivalent to P 0 = P + t.
If we want to combine translation with our rotation matrix multiplication,
we can again use homogeneous coordinates to our advantage. If we want to
rotate a vector v by R and then translate it by t, we can write the resulting
vector v 0 as: 0
v R t v
=
1 0 1 1
Finally, if we want to scale the vector in certain directions by some amount
Sx , Sy , Sz , we can construct a scaling matrix
Sx 0 0
S = 0 Sy 0
0 0 Sz
Therefore, if we want to scale a vector, then rotate, then translate, our
final transformation matrix would be:
RS t
T =
0 1
14
Note that all of these types of transformations would be examples of affine
transformations. Recall that projective transformations occur when the final
row of T is not 0 0 0 1 .
15
Figure 9: The weak perspective model: projection onto the image plane
As we see, the last row of M is 0 0 0 1 in the weak perspective model,
compared to v 1 in the normal camera model. We do not prove this result
and leave it to you as an exercise. The simplification is clearly demonstrated
when mapping the 3D points to the image plane.
m1 m1 P
P 0 = M P = m2 P = m2 P (20)
m3 1
Thus, we see that the image plane point ultimately becomes a magnification
of the original 3D point, irrespective of depth. The nonlinearity of the projec-
tive transformation disappears, making the weak perspective transformation
a mere magnifier.
16
rays are now perpendicular to the retinal plane. As a result, this model
ignores depth altogether. Therefore,
x0 = x
y0 = y
Orthographic projection models are often used for architecture and industrial
design.
Overall, weak perspective models result in much simpler math, at the
cost of being somewhat imprecise. However, it often yields results that are
very accurate when the object is small and distant from the camera.
17