21ai601 CV LM9 2
21ai601 CV LM9 2
1. IMAGE PYRAMIDS
Image information occurs over many different spatial scales. Image pyramids multi-
resolution representations for images are a useful data structure for analyzing and
manipulating images over a range of spatial scales. Here well discuss three different ones, in
a progression of complexity. The first is a Gaussian pyramid, which creates versions of the
input image at multiple resolutions. This is useful for analysis across different spatial scales,
but doesnt separate the image into different frequency bands. The Laplacian pyramid provides
that extra level of analysis, breaking the image into different isotropic spatial frequency bands.
The Steerable pyramid provides a clean separation of the image into different scales and
orientations. There are various other differences between these pyramids, which well describe
below. As a motivating example, lets assume we want to detect the birds from figure ?? using
the normalized correlation approach. If we have a template of a bird, the normalized
correlation will be able to detect only the birds that have a similar image size than the template.
To introduce scale invariance, one possible solution is to change the size of the template to
cover a wide range of possible sizes and apply them to the image. Then, the ensemble of
templates will be able to detect birds of different sizes. The disadvantage of this approach is
that it will be computationally expensive as detecting large birds will require computing
convolutions with big kernels which is very slow.
Another alternative is to change the image size resulting in a multiscale image pyramid.
In this example, the original image has a resolution of 848 643 pixels. Each image in the
pyramid is obtained by scaling down the image from the previous level by reducing the
number of pixels by factor of 25%. This operation is called downsampling and we will study
it in detail in this chapter. Now we can use the pyramid to detect birds at different sizes using
a single template. The red box in the figure denotes the size of the template used. The figure
shows how birds of different sizes become detectable at, at least, one of the levels of the
pyramid. This method will be more efficient as the template can be kept small and the
convolutions will remain computationally efficient.
Each image is 25% smaller than the previous one. The red box indicates the size of a template used for
detecting ying birds. As the size of the template is xed, it will only be able to detect the birds that tightly
fit inside the box. Birds that are smaller or larger will not be detected within a single scale. By running
the same template across many levels in this pyramid, di erent birds instances are detected at different
scales.
Mutiscale image processing and image pyramids have many applications beyond scale invariant object
detection.
1.1 Linear image transforms
Lets rst look at some general properties of linear image transforms. For an input image x of N
pixels, a linear transform is:
The columns of B = [B0,B1,… BM 1] are the basis vectors. The input signal x can be reconstructed as a
linear combination of the basis vectors Bi weighted by the representation coefficients ri. The transform P
is complete, encoding all image structure, if it is invertible. If critically sampled (i.e., M = N) and the
transform is complete, then B = (PT) -1. If it is over complete (over-sampled and complete), then the inverse
can be obtained using the pseudo inverse B = (PPT) -1P.
where Dk is the down sampling operator, Bk is the convolution with the 4-th binomial filter, and Gk =
DkBk is the blur-and-down sample operator for level k. We call the sequence of images g0, g1, …, gN as
the Gaussian pyramid. The first level of the Gaussian pyramid is the input image: g0 = x.
It is useful to check a concrete example. If x is a 1D signal of length 8, and if we assume zero boundary
conditions, the matrices for computing g1 are:
the first level of the gaussian pyramid is a signal g1 with length 4. Applying the recursion
we can write the output of each level as a function of the input x: g2 = G1G0x, g3 =
G2G1G0x, and so on. For 2D images the operations are analogous. Figure 3.3 shows the
Gaussian pyramid of an image.
For instance, for a 1D input x of length 8, and assuming zero boundary conditions, the
operators to compute the first level of the Laplacian pyramid are:
The factor 2 is necessary because inserting zeros decreases the average value of the signal
gk+1 by a factor of 2.
The Laplacian pyramid is an overcomplete representation (more coefficients than
pixels): the dimensionality of the representation is higher than the dimensionality of the
input.
Note that the reconstruction property of the Laplacian pyramid does not depend on the filters
used for subsampling and upsampling. Even if we used random filters the reconstruction
property would still hold.
Making a sharp transition from one image to another gives an artifactually sharp
image boundary (see the straight edge of the apple/orange.) Using the Laplacian pyramid,
we can transition from one image to the next over many different spatial scales to make
a gradual transition between the two images. First, we build the Laplacian pyramid for
the two input images, in this example we use 7 levels and we also keep the last low-pass
residual:
and the Gaussian pyramid of the mask as shown below (note that we use 8
levels, one level more than for the Laplacian pyramid):
m0 m1 m2 m3 m4 m5 m6 m7
Now we combine the three pyramids to compute the Laplacian pyramid of the blended
image. The Laplacian pyramid of the blended image is obtained as:
To ensure that the image can be reconstructed from the steerable filter transform
coefficients, the filters must be designed so that their sums of squared magnitudes “tile”
in the frequency domain. We reconstruct by applying each filter a second time to the
steerable filter representation, and we want the final system frequency response to be flat,
for perfect reconstruction.
One block of the Steerable pyramid computation
The following block diagram shows the steps to build a 2 level steerable pyramid
and the reconstruction of the input. The architecture has two parts: 1) the analysis
net- work (or encoder) that transforms the input image x into a representation
composed of r = [b0,0, …, b0,n, b1,0, …b1,n, …, bk−1,0, …bk−1,n] and the low pass residual
gk−1. And 2) the synthesis network (or decoder) that reconstructs the input from the
representation r.
Steps to Build Steerable Pyramid