Computer Vision
Computer Vision
Introduction
Computer vision is a field of artificial intelligence (AI) that enables computers and systems to
derive meaningful information from digital images, videos and other visual inputs — and take
actions or make recommendations based on that information. If AI enables computers to think,
computer vision enables them to see, observe and understand.
Computer vision works much the same as human vision, except humans have a head start.
Human sight has the advantage of lifetimes of context to train how to tell objects apart, how
far away they are, whether they are moving and whether there is something wrong in an image.
Computer vision trains machines to perform these functions, but it has to do it in much less
time with cameras, data and algorithms rather than retinas, optic nerves and a visual cortex.
Because a system trained to inspect products or watch a production asset can analyze thousands
of products or processes a minute, noticing imperceptible defects or issues, it can quickly
surpass human capabilities.
Computer vision needs lots of data. It runs analyses of data over and over until it discerns
distinctions and ultimately recognize images. For example, to train a computer to recognize
automobile tires, it needs to be fed vast quantities of tire images and tire-related items to learn
the differences and recognize a tire, especially one with no defects.
Two essential technologies are used to accomplish this: a type of machine learning called deep
learning and a convolutional neural network (CNN).
Machine learning uses algorithmic models that enable a computer to teach itself about the
context of visual data. If enough data is fed through the model, the computer will “look” at the
data and teach itself to tell one image from another. Algorithms enable the machine to learn by
itself, rather than someone programming it to recognize an image.
A CNN helps a machine learning or deep learning model “look” by breaking images down into
pixels that are given tags or labels. It uses the labels to perform convolutions (a mathematical
operation on two functions to produce a third function) and makes predictions about what it is
“seeing.” The neural network runs convolutions and checks the accuracy of its predictions in a
series of iterations until the predictions start to come true. It is then recognizing or seeing
images in a way similar to humans.
Much like a human making out an image at a distance, a CNN first discerns hard edges and
simple shapes, then fills in information as it runs iterations of its predictions. A CNN is used
to understand single images. A recurrent neural network (RNN) is used in a similar way for
video applications to help computers understand how pictures in a series of frames are related
to one another.
The human visual system makes scene interpretation seem easy. We can look out of a window
and can make sense of even a very complex scene. This process is very difficult for a machine.
As with natural language interpretation, it is a problem of ambiguity. The orientation and
position of an object changes its appearance, as does different lighting or colour. In addition,
objects are often partially hidden by other objects.
In order to interpret an image, we need both low-level information, such as texture and shading
and high-level information, such as context and world knowledge. The former allows us to
identify the object, the latter to interpret it according to our expectations.
Because of these multiple levels of conformation, most computer vision is based on a hierarchy
of processes, starting with the raw image and working toward a high-level model of the world.
1. Digitization:
The aim of computer vision is to understand some scene in the outside world. This may be
captured using a video camera, but may come from a scanner. It will be easier to digitise
photographs than to work with real-time video. Also, it is not necessary that images come from
visible light. For the purposes of exposition, we will assume that we are capturing a visible
image with a video camera. This image will need to be digitized so that it can be processed by
a computer and also “cleaned up” by signal processing software.
Digitizing Images:
For use in computer vision, the image must be represented in a form which the machine can
read. The analog video image is converted into a digital image. The digital image is basically
a stream of numbers, each corresponding to a small region of the image, a pixel. The number
is a measure of the light intensity of the pixel, and is called a grey level. The range of possible
grey levels is called a grey scale (hence grey-scale image). If the grey scale consists of just two
levels (black or white) the image is a binary image.
Figure 14.2, shows an image (a) and its digitised form (b). There are ten grey levels from 0-
white to 9-black. More typically there will be 16 or 256 grey levels rather than ten and often 0
is black (no light). However, the digits 0-9 fit better into the picture.
Most of the algorithms used in computer vision work on simple grey-scale images. However,
sometimes colour images are used. In this case, there are usually three or four values stored for
each pixel, corresponding to either primary colours (red, blue and green) or some other colour
representation system.
The blurring of edges, and other effects conspire to make the very-scale image inaccurate.
Some cameras may not generate parallel lines of pixels, the pixels may be rectangular rather
than square (the aspect ratio) or the relationship between darkness and grey scale recorded may
not be linear. However, the most persistent problem is noise: inaccurate readings of individual
pixels due to electronic fluctuations, dust on the lens or even a foggy day!
Thresholding:
Given a grey-scale image, the simplest thing we can do is to threshold it; that is, select all pixels
whose greyness exceed some value. This may select key significant features from the image.
Thresholding can be used to recognise objects. For example, faults in electrical plugs can be
detected using multiple threshold levels. At some levels the wires are selected, allowing us to
check that the wiring is correct, at others the presence of the fuse can be verified.
We can also use thresholding to obtain a simple edge detection. We can simply follow round
the edge of a thresholded image. And this be done without actually performing the thresholding
as we can simply follow pixels where the grey changes from the desired value. This is called
contour following. Contour following would give a good start for image understanding. The
more robust approaches will instead use the rate of change in intensity- slope rather than height-
to detect edges.
2. Signal Processing:
i. Digital Filters:
We have noted some of the problems of noise, blurring and lighting effects in processing a
signal. Various signal processing techniques which make image interpretation difficult can be
applied to the image in order to remove some of the effects of noise or enhance other features,
such as edges. The application of such techniques is also called digital filtering.
Thresholding is a simple form of digital filter, whereas thresholding processes each pixel
independently, more sophisticated filters also use neighbouring pixels. Some filters go beyond
this and potentially each pixel’s filtered value is dependent on the whole image. However, all
the filters we will consider operate on a finite window-a fixed-size group of pixels surrounding
the current pixel.
Many filters are linear. These work by having a series of weights for each pixel in the window.
For any point in the image, the surrounding pixels are multiplied by the relevant weights and
added together to give the final filtered pixel value.
In Fig. 14.3., we see the effect of applying a filter with a 3 x 3 window. The filter weights are
shown at the top right. The initial image grey levels are at the top left. For a particular pixel the
nine pixel values in the window are extracted. These are then multiplied by the corresponding
weights, giving in this case the new value 1. This value is placed in the appropriate position in
the new filtered image (bottom left).
The pixels around the edge of the filtered image have been left blank. This is because we cannot
position a window of pixels 3×3 centred on the edge pixels. So, either the filtered image must
be smaller than the initial image, or some special action is taken at the edges.
Moreover some of the filtered pixels have negative values associated with them. Obviously
this can only arise if some of the weights are negative. This is not a problem for subsequent
computer processing, but the values after this particular filter cannot easily be interpreted as
grey levels.
A related problem is that the values in the final image may be bigger than the original range of
values. For example, with the above weights, a zero pixel surrounded by nines would give rise
to a filtered value of 36. Again, this is not too great a problem, but if the result is too large or
too small (negative) then it may be too large to store-an overflow problem. Usually, the weights
will be scaled to avoid this.
So, in the example above, the result of applying the filter would be divided by 8 in order to
bring the output values within a similar range to the input grey scales. The coefficients are often
chosen to add up to a power of 2, as dividing can then be achieved using bit shifts, which are
far faster.
iii. Smoothing:
The simplest type of filter is for smoothing an image. That is, surrounding pixels are averaged
to give the new value of a pixel. Fig. 14.4., shows a simple 2×2 smoothing filter applied to an
image. The filter window is drawn in the middle, and its pivot cell (the one which overlays the
pixel to which the window is applied) is at the top left.
The filter values are all ones, and so it simply adds the pixel and its three neighbours to the left
and below and averages the four (÷ 4). The image clearly consists of two regions, one to the
left with high (7 or 8) grey-scale values and one to the right with low (0 or 1) values.
However, the image also has some noise in it. Two of the pixels on the left have low values
and one on the right a high value. Applying the filter has removed these anomalies, leaving the
two regions far more uniform, and hence suitable for thresholding or other further analysis.
Because only a few pixels are averaged with the 2×2 filter, it is still susceptible to noise.
Applying the filter would only reduce the magnitude by a factor of 4. Larger windows are used
if there is more noise, or if later analysis requires a cleaner image. A larger filter will often
have an uneven distribution of weights, giving more importance to pixels near the chosen one
and less to those far away.
There are disadvantages to smoothing, especially when using large filters the boundary
between the two regions becomes blurred (Fig. 14.4.). There is a line of pixels which are at an
average value between the high and low regions. Thus, the edge can become harder to trace.
Furthermore, fine features such as thin lines may disappear altogether.
The Gaussian filter is a special smoothing filter based on the bell-shaped Gaussian curve, well
known in statistics as the ‘normal’ distribution. We imagine a window of infinite size, where
the weight, w(x, y), assigned to the pixel at position x, y from the centre is
The constant a is a measure of the spread of the window-how much the image will be smeared
by the filter. A small value of o will mean that the weights in the filter will be small for distant
pixels, whereas a large value allows more distant pixels to affect the new value of the current
pixel. If noise affects groups of pixels together then we would choose a large value of σ.
Although the window for a Gaussian filter is theoretically infinite, the weights become small
rapidly and so, depending on the value of σ, we can ignore those outside a certain area and so
make a finite windowed version. For example, Fig. 14.5., shows a Gaussian filter with a 5 x 5
window; it is symmetric and so the weights decrease towards the edge. This filter has weights
totaling 256, but this took some effort! The theoretical weights are not integers, and the
rounding errors mean that in general the sum of weights will not be a nice number.
One big advantage of Gaussian filters is that the parameter a can be set to any value yielding
finer or coarser smoothing. Simple smoothing methods tend only to have versions getting
‘bigger’ at fixed intervals (3 x 3, 5 x 5 etc.). The Gaussian with σ = 0.7 would also fit on a 5 x
5 window, but would be weighted more towards the centre (less smoothing).
3. Edge Detection:
Edge detection is central to most computer vision. There is also substantial evidence that edges
form a key part of human visual understanding. A few lines are able to invoke the full two or
three dimensional image. Edge detection consists of two sub-processes. First of all, potential
edge pixels are identified by looking at their grey level compared with surrounding pixels. Then
these individual edge pixels are traced to form the edge lines. Some of the edges may form
closed curves, while others will terminate or form a junction with another edge. Some of the
pixels detected by the first stage may not be able to join up with others to form true edges.
These may correspond to features too small to recognise properly, or simply be the result of
noise.
The grey-level image is an array of numbers (grey levels) representing the intensity value of
the pixels. It can be viewed as a description of a hilly landscape where the numbers are
altitudes. So a high number represents a peak and a low number a valley. Edge detection
involves identifying ridges, valleys and cliffs. These are the edges in the image. We can use
gradient operators to perform edge detection by identifying areas with high gradients. A high
gradient (that is, a sudden change in intensity) indicates an edge. There are a number of
different gradient operators in use.
Gradient Operators: If we subtract a pixel’s grey level from the one immediately to its right,
we get a simple measure of the horizontal gradient of the image. This two-point filter is shown
in Fig. 14.6., (i) together with two alternatives also shown in Fig. are: a four point filter, (ii)
which uses a 2 x 2 window, and a six-point filter, (iii) which uses a 3 x 3 window. The vertical
version of the six-point filter is also shown (iv).
These operators can be useful if edges at a particular orientation are important, in which case
we can simply threshold the filtered image and treat pixels with large gradients as edges.
However, neither operator on its own detects both horizontal and vertical edges.
Sobel’s Operator: Sobel uses a slightly larger 3×3 window, which makes it somewhat less
affected by noise. Fig. 14.10., labels the grey levels of the nine pixels.
We can see the operator as composed of two terms, a horizontal and a vertical gradient:
H = (c + 2f + i) – (a + 2d + g)
V = (g + 2h + i) – (a + 2b + c)
G = | H| + │V│
The first term, H, compares the three pixels to the right of e with those to the left. The second,
V compares those below the pixel with those above. In fact, if we look back at the six-point
gradient filters in Fig. 14.6., we will see that V and H are precisely the absolute values of the
outputs of those filters.
An edge running across the image will have a large value of V, one running up the image a
large value of H. So, once we have decided that a pixel represents an edge point, we can give
the edge an orientation using the ratio between H and V. Although we could follow edges
simply by looking for adjacent edge pixels, it is better to use edge directions.
Further, that Sobel’s operator uses each pixel value twice, either multiplying it by two (the side
pixels: f, d, h and b) or including it in both terms (the corner pixels: a, c, g and i). However, an
error in one of the corner pixels might cancel out, whereas one in the side pixels would always
affect the result.
G = | (c + f + i) – (a + d + g) | + │ (g + h + i)-(a + b + c)│
Laplacian Operator:
However, for digital image processing, linear filters are used which approximate to the true
Laplacian. Approximations are shown in Fig. 14.11, for a 2 x 2 grid and a 3×3 grid
To see how they work, we will use a one-dimensional equivalent to the Laplacian which filters
a one-dimensional series of grey levels using the weights (1, -2, 1). The effect of this is shown
in Fig. 14.12. We may see how the edge between the nines and ones is converted into little
peaks and troughs. The actual edge detection then involves looking for zero crossings places
where the Laplacian’s values change between positive and negative.
It is obvious that in Fig. 14.12, the boundary between the nines and the ones is a 5. The one-
dimensional image is slightly blurred. When Sobel’s operators encounter such an edge they are
likely to register several possible edge pixel 1 on either side of the actual edge. The Laplacian
will register a single pixel in the middle of the blurred edge.
The Laplacian also has the advantage that it is a linear filter and can thus be easily manipulated
with other filters. A frequent combination is to use a Gaussian filter to smoothen the image,
and then follow this with a Laplacian. Because both are linear filters, they can be combined
into a single filter called the Laplacian-of- Gaussian (LOG) filter. The Laplacian however, does
not give any indication of orientation. If this is required then some additional method must be
used once an edge has been detected.
Successive Refinement:
The images are very large and hence calculations over the whole image take a long time. One
way to avoid this is to operate initially on coarse versions of the image and then successively
use more detailed images to examine potentially interesting features.
For example, we could divide a 512 x 512 image into 8 x 8 cells and then calculate the average
grey level over the cell. Treating each cell as a big ‘pixel’, we get a much smaller 64 x 64
image. Edge detection is then applied to this image using one of the methods suggested above.
If one of the cells is registered as an edge then the pixels comprising it are investigated
individually. Assuming that only a small proportion of the cells are potential edges then the
savings in computation are enormous—the only time we have to visit all the pixels is when the
cell averages are computed. This method of successive refinement can be applied to other parts
of the image processing process, such as edge following and region detection.
One we have identified pixels which are present on the edges of objects, the next step is to
string those pixels together to make lines, that is, to identify which groups of pixels make up
particular edges. The basic rule of thumb is that if two suspected edges are connected then they
form a single line.
The first means that we may have to look for more than one pixel ahead to find the next edge
point. The other two mean that we have to use the edge orientation information in order to
reject spurious edges or detect junctions.
1. Choose any suspected edge pixel which has not already been used.
4. If the orientation of the pixel is not too different then accept it.
7. If no acceptable pixel is found repeat the process for the other direction.
The pixels found during a pass of this algorithm are regarded as forming a single edge. The
whole process is repeated until all edge pixels have been considered.
The output of this algorithm is a collection of edges, each of which consists of a set of pixels.
The end points of each edge segment will also have been detected at step 7. If the end point is
isolated then it is a termination; if several lie together, or if it lies on another edge, then the end
point is at a junction. This resulting set of edges and junctions will be used by Waltz’s algorithm
to infer three- dimensional properties of the image.
However, before passing these data on to more knowledge-rich parts of the process, some
additional cleaning up is possible. For example, very short edges may be discarded as they are
likely either to be noisy or to be unimportant in the final image. Also, we can look for edges
which terminate close to one another.
If they are collinear and there are no intervening edges then we may join them up to form a
longer edge. Also, if two edges with different orientation terminate close together, or an edge
terminates near the middle of another edge, then this can be regarded as a junction.
One problem with too much guessing at lower levels is that it may confuse higher levels (the
source of optical illusions). One solution is to annotate edges and junctions with certainty
figures. Higher levels of processing can then use Bayesian-style inferencing and accept or
reject these guesses depending on higher-level semantic information.
4. Region Detection:
In contrast, an oil painting will not have lines drawn at the edges, but will consist of areas of
different colours. An alternative to edge detection is to concentrate on the regions composing
the image.
Region Growing: A region can be regarded as a connected group of pixels whose intensity is
almost the same. Region detection aims to identify the main shapes in an image. This can be
done by identifying clusters of similar intensities.
2. Examine the boundaries between these regions—if the difference is lower than a threshold,
merge the regions.
This process is demonstrated in Fig. 14.13. The first image (i) shows the original grey levels.
Identical pixels are merged giving the initial regions in (ii). The boundaries between these are
examined and in (iii) those where the intensity is less than 3 are marked for merging. The
remainder, those where the difference in intensity is more than 2, are retained, giving the final
regions in (iv).
Texture can cause problems with all types of image analysis, but region growing has some
special problems. If the image is unprocessed then a textured surface will have pixels of many
different intensities. This may lead to many small island regions within each large region.
Alternatively, the texture may ‘feather’ the edges of the regions so that different regions get
merged.
The obvious response is to smooth the image so that textures become greys. However, if the
feathering is bad, then sufficient smoothing to remove the texture will also blur the edge
sufficiently that the regions will be merged anyway.
In a controlled environment, where lighting levels can be adjusted, we may be able to adjust
the various parameters (level of smoothing, threshold for merging) so that recognition is
possible, but where such control is not easily possible region merging may fail completely.
Representing Regions-Quad-Trees:
In a computer program it is not that straight forward! The simplest representation would be to
keep a list of all the pixels in each region. However, this would take an enormous amount of
storage.
There are various alternatives to reduce this overhead. One popular representation is quad-
trees. These make use of the fact that images often have large areas with the same value-
precisely the case with regions. We will describe the algorithm in terms of storing a binary
image and then show how it can be used for recording regions.
Start off with a square image where the width in pixels is some power of 2. Examine the image.
Is it all black or white? If so, stop, If not, then divide the image into four quarters and look at
each quarter. If any quarter is all black or white, then leave it alone, but if any quarter is mixed,
then split it into quarters. This continues until either each region is of one colour, or else one
gets to individual pixels-which must be one colour by definition. This process is illustrated in
Fig. 14.14.
The first part (i) shows the original image, perhaps part of a black circle. This is then divided
and subdivided into quarters in (ii). Finally, in (iii) we see how this can be stored in the
computer as a tree data structure. We may note how the 64 pixels of the image are stored in
five tree nodes. Of course the tree nodes are more complicated than simple bitmaps and so for
this size of image a quad-tree is a little useful, but for larger images the saving can be enormous.
Quad tree representation can be used to record regions in two ways. Each region can be stored
as a quad-tree where a black means that the pixel is part of the region. Alternatively, we can
use a multi-coloured version of a quad-tree where each region is coded as a different colour. In
either case, regions can easily be merged using the quad-tree representation.
5. Reconstructing Objects:
Edge and region detection identify parts of an image. We need to establish the objects which
the parts depict. We can use constraint satisfaction algorithms to determine what possible
objects can be constructed from the lines given. First, we need to label the lines in the image
to distinguish between concave edges, convex edges and obscuring edges. An obscuring edge
occurs where a part of one object lies in front of another object or in front of a different part of
the same object.
The convention is to use a ‘+’ to label a convex edge, a ‘-‘ for a concave edge and an arrow for
an obscuring edge. The arrow has the object the edge is ‘attached’ to on its right and the
obscured object on its left. Fig. 14.15., shows an object with the lines in the image suitably
labelled.
How do we decide which labels to use for each line? Lines meet each other at vertices. If we
assume that certain degenerate cases do not occur, then we need only worry about trihedral
vertices (in which exactly three lines meet at a vertex). There are four types of such vertices,
called L, T, fork (or Y) and arrow (or W).
There are 208 possible labellings using the four labels available, but only 18 of these are
physically possible (see Figs. 14.16 -14.19.). We can therefore use these to constrain our line
labelling. Waltz proposed a method for line labelling using these constraints.
The Waltz Algorithm
Waltz’s algorithm basically starts at the outside edges of the objects and works inward using
the constraints. The outside edges must always be obscuring edges (where it is the background
that is obscured). Therefore, these can always be labelled with clockwise arrows.
2. Find vertices where the currently labelled lines are sufficient to determine the type of the
vertex.
Steps 2 and 3 are repeated either until there are no unlabelled lines (success), or until there are
no remaining vertices which are completely determined (failure). We will follow through the
steps of this algorithm attempting to label the object in Fig. 14.15. We start by naming the
vertices and labelling the boundary lines. This gives the labelling in Fig. 14.20(i).
We now perform the first pass of steps 2 and 3, a, c, f and h are arrow vertices with the two
side arms labelled as boundaries (‘>’). Only type A6 matches this, so the remaining line
attached to each of these vertices must be convex (‘+’). Similarly, the T vertex d must be of
type T4, hence the line d-k is a boundary. Vertices e and i are already fully labelled, so add less
information. The results of this pass are shown in (ii).
On the second pass of steps 2 and 3 we concentrate on vertices j, k and t. Unfortunately, vertex
k is not determined yet; it might be of type L1 or L5 and we have to wait until we have more
information. However, vertices and t are more helpful: they are forks with one concave line.
We see that if one line to a fork is concave it must be of type F1 and so all the lines from it are
concave. These are marked in (iii).
As we start the third pass, we see that k is still not determined, but m is an arrow with two
concave arms. It is therefore of type A3 and the remaining edge is concave. This also finally
determines that k is of type L5. The fully labelled object (iv) now agrees with the original
labelling in Fig. 14.15.
Problems with Labelling:
Waltz’s algorithm will always find the unique correct labelling if one exists. However, there
are scenes for which there are multiple labelling, or for which no labelling can be found. Fig.
14.21., shows a scene with an ambiguous labelling. The first labelling corresponds to the upper
block being attached to the lower one. In the second labelling the upper block is ‘floating’
above the lower one.
If there were a third block between the other two we would be able to distinguish the two, but
with no further information we cannot do so. With this scene, Waltz’s algorithm would come
to an impasse at stage 2, when it would have unlabelled vertices remaining, but none which are
determined from the labelled edges. At this stage, we could make a guess about edge labelling,
but whereas the straightforward algorithm never needs to backtrack, we might need to change
our guesses as we search for a consistent labelling.
Fig. 14.21., shows the other problem, a scene which cannot be labelled consistently. In this
case Waltz’s algorithm would get stuck at step 3. Two different vertices would each try to label
the same edge differently. The problem edge is the central diagonal. Reasoning from the lower
arm, the algorithm thinks it is convex, but reasoning from the other two arms it thinks it is
concave. To be fair, the algorithm is having exactly the same problem as we have with this
image. It is locally sensible, but there is no reasonable interpretation of the whole scene.
Given only the set of vertex labelling from Figs. 14.16 – 14.19., there are also sensible scenes
which cannot be labelled. A pyramid which has faces meeting at the top cannot be labelled
using trihedral vertices. Even worse, a piece of folded cloth may have a cusp, where a fold line
disappears completely. These problems can be solved by extending the set of vertex types, but
as we take into account more complex vertices and edges, the number of cases does increase
dramatically.
We may also note that the algorithm starts with the premise that lines and vertices have been
identified correctly, which is not necessarily a very robust assumption. If the edge detection is
not perfect, then we might need to use uncertain reasoning while building up objects. Consider
Fig. 14.21 – a valid scene that can be labelled consistently.
However, if the image is slightly noisy at the top right vertex it might be uncertain whether it
is a T, an arrow or a Y vertex. If we choose the last of these, it would have the same problems
as with the first, inconsistent figure. If the edge detection algorithm instead gave probabilities,
we could use these with Bayesian reasoning to get the most likely labelling. However, the
search process would be somewhat more complicated than Waltz’s algorithm.
Edge detection simply uses lines of rapid change, but discards the properties of the regions
between the lines. However, there is a lot of information in these regions which can be used to
understand the image or to identify objects in the image. For example, in Fig. 14.22., it is likely
that the regions labelled and are part of the same object partly obscured by the darker object.
We might have guessed this from the alignment of the two regions, but if they are of the same
colour this reinforces this conclusion.
Also, the position and nature of highlights and shadows can help to determine the position and
orientation of objects. If we have concluded that an edge joins two parts of the same object,
then we can use the relative brightness of the two faces to determine which is facing the right.
Of course, this depends on the assumption that the faces are all of similar colour and shade.
Such heuristics are often right, but can sometimes lead us to misinterpret art image—which is
precisely why we can see a two-dimensional picture as if it had depth.
Once we know the position of the light source, we can work out which regions represent
possible shadows of other objects and hence connect them to the face to which they belong.
For example, in Fig. 14.24., we can see from the different shadings that the light is coming
from above, behind and slightly to the left. It is then obvious that the black region is the shadow
of the upper box and so is part of the top face of the lower box.
Shadows and lighting can also help us to disambiguate images. If one object casts a shadow on
another then it must lie between that object and the light. Also, the shape of a shadow may be
able to tell us about the distance of objects from one another and whether they are touching.
6. Identifying Objects:
Finally, having extracted various features from an image, we need to establish what the various
objects are. The output of this will be some sort of symbolic representation at the semantic
level.
Using Bitmaps:
The simplest form of object identification is just to take the bitmap, suitably thresholded, and
match it against various templates of known objects. We can then simply count the number of
pixels which disagree and use this as a measure of fit. The best match is chosen, and so long as
its match exceeds a certain threshold it is accepted.
This form of matching can work well where we can be sure that shapes are not occluded and
where lighting levels can be chosen to ensure clean thresholded images. However, in many
situations the match will be partial, either because of noise, or because the object is partly
obscured by another object. Simply reducing the threshold for acceptability will not work.
Consider the two images, in Fig. 14.25. They have a similar amount of pixels in common, but
the first is clearly a triangle like the template whereas the latter is not.
Neural network techniques can be used to deal with noisy pattern matching. Several different
types could be used, but most work by taking a series of examples of images of the different
objects and ‘learning’ how to recognise them. Depending on the particular technique, the
examples may be of perfect or of noisy images.
After training, when the network is presented with an image it identifies the object it thinks it
matches, sometimes with an indication of certainty. Neural network can often give accurate
results even when there is a large amount of noise, but without some of the unacceptable
spurious matches from crude template matching. One reason for this is that many nets
effectively match significant features, such as the corners and edges of the triangle. This is not
because they have any particular knowledge built in, but simply because of the low-level way
which they learn.
One problem with both template matching and neural networks is that they are looking for the
object at a particular place in the image. They have problems when the object is at a different
location or orientation than the examples with which they are taught. One solution is to use lots
of examples at different orientations.
For template matching this increases the cost dramatically. For neural nets, the way in which
the patterns are stored reduces this cost to some extent, but if too many patterns are taught
without increasing the size of the network, the accuracy will eventually decay.
An alternative approach is to move the objects so that it is in the expected location. If we are
able to identify which region of the image represents an object, then this can be moved so that
it lies at the bottom left-hand corner of the image, and then : matched in this standard position.
This process is called normalisation. A few stray pixels at the bottom or left of the object can
upset this process, but alternative normalisation methods are less susceptible to noise, for
example moving the centre of gravity of the object to the centre of the image.
Similar methods can be used to standardise the orientation and size of the object. The general
idea is to find a co-ordinate system relative to the object and then use this to transform the
object into the standard co-ordinate system used for the matching.
ii. Choose the direction in which the object is ‘widest’; make this the x-axis.
iv. Scale the two axes so that the object ‘fits’ within the unit square.
The definitions of ‘widest’ and fit’ from steps 2 and 4 can use the simple extent of the object,
but are more often based on measures which are less noise sensitive. The process is illustrated
in Fig. 14.25. The resulting x and y-axes are called an object-centred coordinate system.
Obviously all the example images must be transformed in a similar fashion so that they match!