0% found this document useful (0 votes)

27 views67 pages

Module - 2 - Computer Vision For Robotics Systems

Uploaded by

Sneha Gouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views67 pages

Module - 2 - Computer Vision For Robotics Systems

Uploaded by

Sneha Gouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

6

Computer Vision for Robotic

Systems: A Functional Approach

6.0 OBJECTIVESs

The didactic material covered in this

chapter is intended to expose the reader to
many considerations required in providing computer vision for robotic
systems:
Computer vision components
Computer vision task complexity
General approaches to computer vision architecture
Problem types amenable to solution by
computer vision techniques
To reach these objectives, not all topics in
computer vision can be treated,
since complete coverage of this subject would easily require several volumes. Be
fore the specific issues of computer vision systems are addressed, the functional
requirements of machine vision should be presented. Let us consider certain visual
tasks, requiring only an
intelligent observer.
Basically, these visual tasks con
cerning handling or acquisition of a part can be thought of as a series of questions
that an expert viewer (or a robot) must answer:

.Where is it?
I s it the one I am looking for?
.Is it defective or is it OK?
How far away is it?
.Is it right side up?

440
Sec. 6.1 Motivation
441
I s it being interfered with
by another object of the same
type? type? or a different
.What is the angle of the object relative to my hand?
.What color is it?

These types of questions lead to

system and its components:
us consider the human visual intelligence
Eye(s)
Neural connections (sensory and motor)
Visual cortex
Physiologic subsystems can (very) loosely model the architecture of current
(machine) vision systems. These include:
Sensors
Data paths
Computer or processor

These major subsystems have numerous

subtopics, but it is clear that the functional
issues suggested by the questions above are addressed
> data path
generally by the sensory
computer configuration. This configuration is far too general
and must be further subdivided. With this in mind, the
in this chapter are:
topics that will be covered

Imaging components
Image representation
Hardware considerations
.Picture coding
Object recognition and categorization
.Software considerations
.Need for vision training and adaptations
Review of existing systems

6.1 MOTIVATION

Computer vision" is a somewhat glamorous phrase that describes a wide class of

computer image analysis topics. The use of the word "vision" is in many ways a
poor choice, because it imbues a "sense" or "intelligence" to the robot that is
undeserved and nonexistent. As used in this text and in most applied fields,
"Computer vision" is simply the analysis of photometric noncontact measurements
obtained from a closed-circuit television (CCTV) camera or other photo-sensing
442 Computer Vision for Robotic Systems: A Functional Approach
Chap. 6
element, such as a linear photoarray. Although there are a wide variety of pho.
toimaging sensors, we will generally use the term CCTV in reference to the sta
television camera, and not to the generic class of photosensors. We will
refer to the sensor as being a CCTV, but the reader should extend the moen often
eaning
to other types of photosensors. The electrical voltage output from the CCTv
generally converted to a digital form, where processing ot the image informat is
mation
by a digital computer may be accomplished.
such
The type of information to be produced by a camera-con
omputer couplet
1S.

Location of a stationary object

Object tracking as a function of time
Object identification
Object orientation
Defect recognition and inspection of an object
These types of tasks may appear somewhat mundane rather than exotic, but
they are far from trivial as far as computer implementation is concerned. Consider
that most CCTV images are usually broken down into rectangular matrices of
roughly 320 x 240 elements, with each element corresponding to an intensity from
the scene being imagedd. Consider the problems of searching through 76,800 (320
x 240) elements of an array at television rates (50 or 60 cycles/second), generally
about 4.5 million searches per second. Scanning intelligently through this amount
of data at these rates is no mean feat, and in fact it is the root of most of the

problems facing today's computer vision systems.

Why, then, bother with "computer vision" systems if they require such high-

high-data volume solutions? The answer is simple: noncontact

measure-
speed,
The advantages of noncontact location, detection, and identification
are
ment!
often so overwhelming that one is often forced to "computer vision" solutions. It
is often the case that vision is the only sense capable of doing the job required,
and human vision is often selected. In those cases where human interactive vision
has proven necessary, it is reasonable to try to utilize computer vision toward the
same end.
function must be
In situations where there is no alternative to vision, the
For example, in the reading
performed by either a human being or a machine. vision processing mus
of bar codes or OCR codes (optical character reader codes), of the effect
be performed, and the time penalty becomes irrelevant.
One measure

F o r two
additional time penalty is the UPH or "units per hour" factor.
of the this tac
o n e c a n define
sequential processes, such as ""assembly'" and "visions,"
as follows:
UPH =
1
assembly UPH vision UPH
Sec.6.2 Imaging Components 443

6.2 IMAGING COMPONENTS

The imaging component, the "eye" or the sensor, is the first link in the vision
chain. Numerous sensors may be used to observe the world. All the vision com-
ponents have the property that they are "remote sensing'" or "noncontact" meas
urement devices.
Vision sensors can be categorized in many different ways. For convenience,
we will categorize thenm according to their dimensionality although they could also
be classified by their wavelength sensitivity (i.e., do they respond to shades of
black and white, or colors, or infrared, x-ray, ultraviolet, or the normal spectrum
of human vision?). Vision sensors may be conveniently divided into the following
dimensional categories or classes:

.Point sensors
Line sensors
Planar sensors
.Volume sensors
6.2.1 Point Sensors

The point sensors may be similar to ""electric eyes," being either some type of
photomultiplier, or more commonly, a phototransistor (see section 5.8.2.1). In
either case, the sensor is capable of measuring the light only at a single point in
space. For this reason they are referred to as "point sensors." These sensors
may be coupled with a light source (e.g., a light-emitting diode) and used as a
noncontact "feeler," as shown in Figure 6.2.1.
The "feeler" essentially monitors the light in a small "acceptance aperture."
If an object falls in this acceptance aperture, light will be reflected from the object's
surface and will be received by the sensor. If the acceptance aperture is clear, no
light will be reflected into the sensor and it will not "feel" anything.
The point sensor may be used to create a higher-dimensional set of vision
information by scanning across a field of view by employing some ancillary mech-
anism. For example, an orthogonal set of scanning mirrors (see Figure 6.2.2) or
an x-y table can be employed to execute the scanning of the scene.

Light

Point
Out of Sensing Figure 6.2.1. Noncontact feeler-point
Sensor Limit (i.e., proximity) sensor. The object is
sensed only if its location falls between
n e a r sensing limit d, and d,: Points located beyond d, are
dd=far
On d sensing limit out of the sensing limit of the device.
luminator
and
Point Sensor

Reflected Ray (from object's surface)

Incident Ray (to object's surface)
Pivot

Object Horizontal
Scanning9
Mirror
Vertical
Scanning9 Figure 6.2.2. Image scanning using a
Mirror point sensor andoscillating deflecting
mirrors.

6.2.2 Line Sensors

Line sensors are one-dimensional devices and may be used to collect vision infor-
mation from a scene in the real world. The sensor most frequently used is a "line
array" of photodiodes or charge-coupled-device components. These devices are
similar in operation, both being the equivalent of *analog shift registers" that
produce a sequential, synchronized output of electrical signals corresponding to
the light intensity falling on an integrating light-collecting cell. See Figure 6.2.3
for a schematic representation. The light output from these arrays is available
sequentially" (i.e., the individual cell outputs are not available in parallel or even
on demand). The consequence of this is that the light intensity from the scene is
available only in an ordered sequence and not at random on demand by the user.
This has some consequences with regard to the time required for accessing a desired
point intensity. The arrays may also be obtained in other than straight lines (e.g.,
circular arrays or crossed arrays are available; see Figure 6.2.4).
By proper scanning, line arrays may be used to image a scene. For example,
by fixing the position of a straight-line sensor and moving an object orthogonaly
to the orientation of the array, one may scan he entire object of interest. Figure
6.2.5 is an example of such an application in a robot system.

Light Integrating Cells

cCD Analog Shift Register Analog Output

ignal
Figure 6.2.3.
is a sequential
Schematie representation of line seanning arrays. The analog outpuels
representation of the intensity of the light collected by the integra
ells.

444
Sec. 6.2 Imaging Components 445

Figure 6.2.4. Circular and crossed configurations of light sensors.

6.2.3 Planar Sensors

Planar sensors are an extension of the line-scan concept to a two-dimensional

configuration. Two generic types of these sensors are generally in use today:
scanning photomultipliers and solid-state sensors.
The scanning photomultipliers are represented by television cameras. The
The
most
common type of television camera used is the vidicon tube which is essentially
an optical-to-electrical converter. The photoelectric target "boils off" electrons

TV Monitor

Video Processing
System
Line Scan Camera
Conveyor Rate
Information Robot
Control

lurmination Robot

Conveyor Belt
Travel

Figure 6.2.5. An automated robot sorting system using a line scan camera to

generate two-dimensional images.

446
Computer Vision for Robotic Systems: A Functional
Approach
when struck by
photons in the visible The
Chap.6
process are collected via the current inspectrum. electrons boiled off in.
The electrons
scanned back and forth across the
a
scanning electron beam. This in this
thic
isis
photoelectric target in a
so-called
This beam
(See Figure
6.2.6). The rate and pattern of the raster scan differs raster faS
world with a
variety of existing standards. throuchoon throughout the
A more detailed
image is focused
description of the operation of this tube follows.
An optical
on a
plate. When light strikes this photoconductive target layer which lies behind a transnara on.

layer, it becomes conductive and transfers

to the
positive signal plate. The transfer causes a charge to build electro
of the up the face
target. Its reverse face is scanned by a low on
tends to stabilize the velocity electron beam that
at target the cathode's potential. This
current to be generated which is indicative of the scanning causes a
light intensity pattern on the
photoconductive
It
layer.
is important to note that each time the target is
transferring charge to the
scanned, it again begins
In plate depending on the
intensity the incident light.
of
fact,
the amount of
charge may be thought of as being
integral of the light intensity. Since the raster scan is proportional to the
amount of light is incident on a periodic in time, if a constant
at each
particular point, the same charge will be
present
pass of the scan. From this description, it can be
inferred that it is not
possible to expose a vidicon to light and randomly read the data or
results by expect consistent
executing a single raster scan.
As in any physical
system, the vidicon family has an upper limit to the amount
of intensity that can be
is
accepted by the photoconductive layer. If too much light
incident certain portion of the sensor, additional electrons
on a
and beyond what would be removed (i.e., those above
by the normal incident light) from nearby
areas may also be removed. This
phenomenon is called blooming and is evidenced
as an area i maximum
intensity in the image.
Another important concept of the vidicon is that when the electron beam
resets the charge, it may not reset it completely.
Any remaining charge then
decays exponentionally. The term given to this phenomenon is called lag and is
related to the speed of response of the camera. The
is that when viewing a moving object it may seem
physical manifestation of lag
transparent. Additionally, it
one observes an object as it moves into a camera's field of view and then stops, it
may be necessary to wait for the motion to cease (i.e., 60 to 200 ms) before an
accurate representation of the object can be obtained.
The RS-170 standard (described more completely below) used in North Amer
ica specifies a 525-line 60-Hz physical format. Like line-scan devices,
raster-
devices provide serial or sequential access to the converted optical data, so na
information about a given point in the field of view may be periodically inspected
at a certain rate, but may not be accessed at random. In the case of a standard
television camera, each intensity point from the camera may be inspectu lty once

every 33.3 ms (i.e., 30-Hz

at 1/33.3 ms). In many
=

applications
this time penay
is not severe, but in others it is almost intolerable. In most robotic applicatio
Sec. 6.2 Imaging Components
447

cycle times of 1 to 10 s are commonplace, and the additional time needed for robot
vision processing would probably be acceptable. Most currently available vision
systems provide about five vision cycles per second, so the extra 200 ms is not
significant. However, in a small-parts assembly where the manipulator portion of
the assembly cycle may be on the order of 1 s, the additional 200 ms for vision
processing begins to have a significant effect. In an application such as semicon-
ductor assembly, a 200-ms penalty would be prohibitive, since the additional time
would reduce the process yield to unprofitable levels.
Although a vidicon camera sensor is not inherently a raster-scan device, the
raster-scan format creates economies in manufacture (i.e., inexpensive cameras on
the order of $200) and of course provides a simple mechanism for viewing the
scene using ordinary television monitors, also
relatively inexpensive.
Random-access scanning photomultipliers (image dissectors) are also avail
able. but the photoelectric device is more expensive to manufacture (approximately
$1000 for the tube itself) and the control circuitry is more complex because of the
random-access requirements. Since this tube does not rely on the conventional
raster scan, interesting variations such as spiral scans or radial scans may be im-
plemented. Viewing the output of such a device would require a more costly
monitor that is capable of accepting both raster and random-access inputs, with
some type of mode switching required.
In addition to vidicon transducers, several types of solid-state cameras are
available (e.g., photodiode and charge-coupled-device cameras). The solid-state
camera is manufactured in a fashion similar to large-scale integrated circuits. The
sensor elements themselves are very different from the photosensitive elements in
a vidicon camera, but the arrays are still accessed in a serial or raster fashion. For
this reason the solid-state cameras have no access-time advantage over the vidicon
tube cameras. The solid-state arrays are inherently less noisy than the vidicon
cameras, but are also considerably more expensive. This price/performance trade-
off between the camera types must be carefully considered before final selection
of a photo-optical transducer is made for a particular application. Many appli-
cations require the solid-state sensors because of weight and noise factors. This
would be particularly important if it were necessary to mount the camera near or
on the end effector of a robot.
Two-dimensional arrays (similar to line arrays) may be formed using either
CCD (charge coupled device) or CID (charge injected device) technology. Both
of these sensors are based on MOS (metal oxide semiconductor) transistor tech
nology. Since they are discrete in nature, these devices will have a finite number
of cells in both the horizontal and vertical direction. The solid-state array sensor
is also an integrating detector and thus it is apparent that its sensitivity is propor-
tional to exposure time.
It is important to understand how video information from the two-dimensional

array is acquired. In the case of the CCD the most popular topology used is the
frame transfer (FT) structure. FT technology makes use of an imaging area which
448 Computer Vision for Robotic Systems: A Functional Approach
Chap.6
BRIEF COMPARISON OF CAMERA
TABLE 6.2.1
TECHNOLOGIES

Feature/specification Vidicon CCD CID

1 2 2
Resolution
1 2 3
Scnsitivity (light levels)
3 2
Speed
Bloom 3 2
Size 2. 1
2 1
Reliability
Current cost 2 3
Future cost 3

Source: Tech Tran Consultants, Inc., Lake Geneva, Wis-

consin from their report, "Machine Vision-A Summary
and Forecast."

is exposed to light and generates charges proportional to the integral of the light
intensity. There is also a storage area having the same number of cells as the
imaging area. During the "frame time" when the image area is exposed, charge
is accumulated at various cell locations in the array. After proper exposure, but
before the next frame, this charge is clocked up to the corresponding cell in the
storage area. (Note that the storage area is shielded from light.) A transfer
register that permits each row of data to be moved in a serial manner is utilized
in the operation. While the imaging area is being exposed for acquisition of the
next frame of data, the charge pattern of the previous frame can be read out from
the storage area by means of the readout register that operates in a manner similar
to the transfer register.
In contrast to the CcD device, the CID camera may be thought to consist
of a matrix of photosensitive cells arranged in rows and columns. As opposed to
the CCD technology, each of the cells can be addressed randomly although
com-

mercial implementation relies on scanning compatible with raster techniques. After

each cell is read, it is reset and begins integrating light for the next frame. Since
the CID does not utilize the frame transfer structure of the CCD array, there is
or
between the electrical signal representing the image and the readout
no delay
the accumulated charge of a CID device.
Solid-state imagers also exhibit blooming. charge carriers
In this case, the
generated from an extremely bright part of an image spread to nearby elemen
This charge travels
because those elements located in the bright area are saturated.
to nearby locations thereby causing false information to be delivered to the visio
In general, CID sensors are less sensitive to blooming than CCD-9P
system.
sensors. d to
Table 6.2.1 is a brief comparison
of the three major technologies u3 i
implement cameras for image acquisition. Note that the ranking is
from
with 1 representing the best performance.
Sec. 6.2 Imaging Components 449

6.2.3.1 Camera transfer characteristic

The transfer characteristic of a television camera (tube or solid state) can be
defined by the parameter y given by:
Y = log(/1,Mog(E/E,.)

where I denotes the signal current from the sensor, E the illumination on the
photoconductive device, and I, and E, are respectively the value to which the
signal and illumination are referenced.
For non-unity gamma, the contrast in the darker part of the picture is com
pressed while the lighter portion is exaggerated. Most television cameras are
designed with y 0.45. This is the result of the natural characteristic of the
=

CCTV monitors used to view the output. Normally, such monitors have a gamma
2 2. Thus if the camera is adjusted for the inverse (i.e., y= 0.45), the picture
on the monitor is pleasing to the eye. Unfortunately, while the image may appear
pleasing to a human being, the signal provided to a computer will not contain the
correct information. This is why it is extremely important to use a camera of unity
gamma for imaging a scene for a vision application.
6.2.3.2 Raster scan
Previously, it was mentioned that the electron beam in a vidicon is scanned
across the photosensitive element in a raster fashion. Figure 6.2.6 illustrates how

First Horizontal
Video Synchronization
Line Pulses
Even
Even Lines Odd Lines

Last Vertical Odd

Video Synchronization
Line Pulses
(a) (b)

Video
Signal
M

Intensity of Intensity of
First Line Last Line

Horizontal Vertical
Synchronization Synchronization
Pulse
Pulse
(c)
gure 6.2.6. (a) Raster scan process on a television monitor; (b) magnified view
onterlaced raster scan showing odd and even field lines; (C) two lines of analog
video data.
450 Computer Vision for Robotic Systems: A Functional Approach
Chap. 6
TABLE 6.2.2 UNITED STATES
MONOCHROME TELEVISION STANDARD
RS-170 SPECIFICATIONS FOR RASTER
SCAN

Parameter Value

Aspect ratio
(width to height) 4/3
Lines per frame 525
Line frequency 15.75 kHz
Line time 63.5 us
Horizontal retrace time 10
Field frequency 60 Hz
Vertical retrace 20 lines per field

the raster scan would look on a

television monitor. Examination of the
shows that the beam starts at the figure
top left and is moved across the face of the tube
at a slight downward angleuntil it reaches the right side. Once at this
it quickly back
moves to the left and once again travels toward the
location,
The process just described defines how a right.
picture is recreated on a monitor
or television screen. The
picture consists of a frame that is made up of two fields:
the odd and the even. This type of raster scan is referred to as
interlaced scanning.
where the even lines are first traced out, then the odd
ones, then the even, and
SO on. The interlacing of the two fields is used as a method to
with moving objects to appear with minimum flicker.
permit pictures
Certain standards exist to define the signal associated with the
television camera and its subsequent reproduction on a monitor or television output of a
screen.
As mentioned previously, the RS-170 specification is used in North America. Pa-
rameters used in a raster scan process are summarized in Table
6.2.2. Essentially,
this standard specifies a 525-line, 60-Hz format for the video signal. Recall that
raster scan devices produce serial data that repeat with a given period. Thus
information about a particular point in the field of view
may be inspected period
ically at a certain rate but may not be accessed randomly.
Figure 6.2.6 also shows a representation for two lines of video. A complete
frame consists of a more complicated waveform with additional
pulses for syn
chronization of the fields and blanking pulses to ensure that the beam is
off (blanked) during the time it is
turncu
retracing. At this point, it important note
is to
that whatever type of imager is used (vidicon
family, CID, or CCD) the electria
signal corresponding to lines in a frame (odd and even fields) is essentially tna
shown in Figure 6.2.6 with the parameters defined in Table 6.2.2.

6.2.3.3 Image capture time

An important parameter associated with image sensors is the actua
capture time. This can be defined as the time from when an object to be viewaed
Sec. 6.2 Imaging Components 451

is motionless to the time when the RS-170 video signal representing the scene
contains both the odd and even fields (with valid data). To this must be added
the time that it takes to store the image in whatever type of vision system is being
used.
Image capture time is dependent on the type of sensor and lighting. Consider
a CCD camera fixed in space and looking at an area into which an object is moved
by means of some mechanical device (such as an x-y table or conveyor). The
camera, by its nature (integration of light levels) must be continually scanning.
Since there will be a settling time for all mechanical motions, the instant when the
motion of the object ceases will not be synchronized with the start of a field.
Additionally, and as described previously, a frame transfer CCD camera transfers
its image to a storage area which can be read out only when the next image area
is scanned. Referring to Figure 6.2.7, it can be seen that the worst-case time for
image capture with a CCD is 83.3 ms. It should also be apparent that if the object
is motionless for at least three field times (enough to read out any invalid video

Vertical Blanking Field

Placed on Output.
Then Charge Reset.

Field 16.7 ms

H
83.3 ms
Worst-Case Time
From End of Motion
To Signal Received

E O4 E Og E2 O Fo Real-Time
Video Signal

Camera Acquisition
E O2 E O (Charge Integration)

Motion
Ceases

Integration Time
For First Even
Field After Motion

Integration Time
For First Odd
Fleld After Motion
* = Invalid Video
En-Even Field n
O,=Odd Field n

Figure 6.2.7. Image capture for a CCD camera at the time when an object stops moving,
field.
Note: 33.3 ms must be allowed for integration of each
A Functional Approach
452 Computer Vision for Robotic Systems:
Chap.6
data from the frame transfer system), then the image can be captured in twa fii.
times (or 33.3 ms). That is, the data coming from the camera will be valid a
three field times and the capture time is that needed to obtain both an odd after
and
even field.
The same type of analysis can be performed for a vidicon where the lao tim.

must be taken into account. The CID camera does not have a frame transfer
architecture and therefore once the image is stationary, the video is availahle

image capture time is 3 fields or 50.1l ms.

immediately. Its worst-case

6.2.4 Volume Sensors

Volume sensors, providing general three-dimensional information, are not yet cur.
rently available on the market as a standard item. There are mechanisms that
may be used to measure three-dimensional shape and orientation properties of
solid objects. Stereo imaging using multiple two-dimensional arrays to image the
object may be used. Three-dimensional information from solid objects may also

Light
Source

Imeger Plene
(ine eOy

.... ...

...

..
***.
Object of height B

Figure 6.2.8. Schematic representation of a triangulation range finder. A point

or line source of light illuminates objects directly below it, while the camera (in
this case a line array) is set at an angle with respect to the light source. ne
camera sees" the ray reflected from an object directly beneath the light. When
no object is present, ray Aa intersects one end of the line array, while an object
of height D causes ray Dd to intersect cell d at the opposite end of the array.
data relating the actual physical heights A. C, and D are known along with tne
address of the corresponding illuminated cells on the line array, then the following
relationship may be used to solve for the unknown height B. givenb (the line array
cellilluminated by the light reflected from the object of height B): (ACIAD)/(BBD
= (ac/ad)/(bc/bd).
Sec. 6.3 Image Representation
453
he obtained by use of directional lasers or acoustic range finders, and these
niques are becoming practical (see Figure 6.2.8). The tecn-
of three-dimensional features is not methodology
yet solved in general, and no
for
resolution
purpose stereo vision tor robots is currently available. practical, general
Structured light may be used for the location
More discussion of this topic is of various surfaces of objects.
discussed.
given
in the following section when illumination 1s

6.3 IMAGE REPRESENTATION

Currently, most vision for robotic systems is not

our discussion to representation of stereoscopic, so we will restrict
This is best described as an
images as one observes on a television monitor.
image intensity function of two variables, I(x, y), a
real function of two space variables
(x, y), which
falling on the photoelectric target. The intensitycorresponds
to the
function is used tolight intensity
represent an
image, such as shown in Figure 6.3.1.
I(x, y) is an idealized image representation and does not account for commonly
occurring natural distortions, such as lens geometric aberrations, beat frequencies
between source and scanner, and noise in the scene illumination. A more
image description is
complete
VCx, y) = i(, y)R(x, y)
(6.3.1)
In this representation, V(x, y) is the received visual
or video information, for
example, at the face of a vidicon target. The function i(x, y), in this case, represents
the illumination function that corresponds to physical illumination by some incident
energy source (e.g., visible light). R\x, y) can represent a reflectance function if
we are imaging by reflectance, or a transmittance function if the
object is being

Figure 6.3.1. Example of an intensity

function I (r.y) for an image. One side
of a cube (e.g., the three side off a die)
is shown.
454 Computer Vision for Robotic Systems: A Functional Approach
Chap.6
lluminated (e.g., as a slide or transparency would be projected). In.
gen
R(x, y) contains the information-bearing portion that will be of interest interal
the robot
vision system. Frequently, the illumination source is designed ito be constant
i(r, y) = k], so that the measured signal Vlx, Y) in Eq. (6.3.1) closely match i.e.
desired image function R(x, y). The image illumination i(x, y) may be manipula
so as to take advantage of some natural property ot the scene. For examDlepulated
may illuminate the field with red light so as to bring out certain "red" portio ns
a scene and to subdue other "less red" features. One of
can
accomplish a similar
effect by placing a red filter in front of an imaging lens. Other illumination
variations may be designed to light a portion of the scene, for example, with a lin
of light, so that only certain edges or aspects of a tield being viewed are imaged
Illustrations of this are the structured vision system developed for use with robots
by the National Bureau of Standards (see Figure 6.3.2), the General Motors Con-
sight system, as well as several commercial products (see Figure 6.9.1).
The image function V%, y) in Eq. (6.3.1) represents a continuous measure
ment of intensity, as a function of the horizontal and vertical space variables x and
y. In Figure 6.3.1, the x axis was oriented from left to right, and the y axis from
top to bottom. This convention is taken from the normal raster scan of television,
where the scan progresses from upper left to lower right. The television type of
image scanner naturally digitizes the y axis, since the raster scan generates only
individual horizontal scan lines. In the American Standard, there are 525 indi-
vidual horizontal lines. Of these, approximately 40 do not contain active video
information from the scene. These 40 are used for timing and synchronization
and consequently only about 485 lines of valid image information are available.
The natural quantization of the y axis can be extended to the x axis, so that the
intensity elements of a field will be defined at discrete points in space. This has
an obvious advantage if the image is to be processed digitally, as is the case with
almost all robot vision systems. Finally, one may digitize the intensity amplitude
in addition to the x and y coordinates, so that the entire picture is digital in nature.
The picture in this form is said to be made up of individual picture elements, or
pixels. Each pixel's intensity is said to be its gray level, since the intensities have
a value from "black" to "white." Figure 6.3.3 shows a representative digitization
process for the picture elements for a rectangular letter *0".
The question of sampling density must be raised since there are considerations
of the sampling rates that must be made. The other physical components of the
imaging system (e.g., the lens and target) may cause the image formed onn
target to be imperfectly focused. This wil cause the scanning system to collee
information from a finite-sized aperture. This and other effects limit the spale
resolution, usually to a degree that does not cause problems with regard to un
dersampling the image components (i.e., does not violate the Nyquist sampu
theorem).
The amount of digital data that is used to represent the image varies w iely.
Although 525 lines are available from the raster-scanned image, the x and y axe
s
are typically digitized from 64 to 512 pixels. The number used for (r, y) sizius
Sec. 6.3 Image Representation
455
Camera

Camer
Field of View
Flash Plane of Light
Object

Reflected Line
of Light

(a)

Box

(Front view) (Oblique view)

Object with Depressed Surface

Object with
Raised Surface (Front view) (Oblique view)

(b)

Figure 6.3.2. (a) Structured light imaging for the NBS robot vision system; (b)
example objects and the line segment pattern formed by a plane of light for a box
and an object with both a raised and depressed surface. In all instances, a dark
line represents reflected líght and is the only image "seen" by the camera.
456 Computer Vision for Robotic A
Systems: Functional Approach
Chap.6

Array of
Light
Integrating
Cells

White Level
Black Level.

Pixel Clock
Column 1 Column 10
U-SUnr
Output Register Pixel Clock
Figure 6.3.3. Digital picture representation and data readout. The signal from
the output register is shown for row 5 or 6.

often chosen to be an integral power of 2, and often a

of less than the full resolution is
representation consisting
adequate. These decisions are
judgmental and
require analysis of each specific application. This would then require a matrix of
pixels from 64 x 64 to 512 x 512 (i.e., from 4096 to 262,144 picture elements).
The intensity quantization is usually from 1 to 8 bits, so the amount of
memory
required for storing images typically ranges from 4096 to 2,097,152 bits. Trans-
lating this into 8-bit bytes means that from 512 to 262,144 bytes must be available
for storing images. In most of the image systems available, the images are stored
in video RAM'" (random-access memory) on the order of 128 x 128 to 512 X
512 pixels. Each of these pixels may be from 1 to 8-bits. Since these data are
usually acquired in 16.67 ms, the data rates are on the order of 2 megabytes's
(up to 7.8 megabytes/s for a 512 x 512 image at 30 frames/s). (A frame consists
of the 525 lines scanned by the vidicon camera. Every frame is
composed of two
interlaced fields, each consisting of 262.5 lines.)

6.4 HARDWARE CONSIDERATIONS

An exhaustive treatment of hardware for vision systems would require hundr

of pages, so our attention will be focused on the major considerations. Once ag
the consequences arising from the use of standard television specifications
direct our primary discussions. We will also briefly treat some n o n s t a n d a r d i z e d

situations. most

Since commercial television cameras are serial access devices, and since n e
one

vision system processing will require random access to the picture informatio
This
commonly encountered problem is the storage of pixels for future reference
Sec. 6.4 Hardware Considerations 457

renerally requires an analog-to-digital (A/D) converter in a configuration similar

ger
tothat shown in Figure 6.4.1.
The pixel information must be sampled and digitized so that the information
may be made available to a computer or microprocessor. The A/D subsystem
essentially matches the rate at which video information is available to the rate at
which a computer can acquire it. If we are dealing with 262.5 lines per field, the
sampling period required is approximately 240 to 300 nanoseconds (one line/256
pixels 62.5 es/256). Unfortunately, ordinary microprocessor components can-
not directly keep up with this data rate since typical instruction times are in the
range of 0.1 to 5 ps. This mismatch in speed may be compensated for by
coarsely, and taking more frame times to accumulate the picture information. sampling
For example, one may use components with a long time constant and sampling
aperture and sample only one pixel every other frame (assuming that the raster is
interlaced), the digitizer must return to the same field to collect the data from a
uniform grid of spatial points. If this approach is taken and a 256 x 256 matrix
of image points is required, the time for image acquisition is 256 x 256/30 = 2184 s
or 36 min., which is obviously unacceptable. One may take a single pixel per line
andcapture 256 pixels per field; the time required for acquisition would then be
256/30 8.5 s, which is much more acceptable but still too slow for
=

practiccal
robotic applications. Even if one took 10 samples per line in order to match the
instruction times of a typical microprocessor, we would still need to return to the
same line 25 times, so it would require about 25/30 0.83 s to acquire an image,
=

which isstill too long for most practical applications. The generally applied solution
to the time problem is to use a frame buffer that is capable of storing an entire
image. These frame buffers, which usually have a built-in digitizer to convert the
data to digital form, are available off the shelf in a variety of configurations that
match the bus specifications of almost any minicomputer or microprocesor.
Using
such a device then permits higher-level languages and general-purpose image anal
ysis algorithms to be applied to what is essentially a large array of data points.
In many robotic applications where a manipulator is to grasp an object for
placement elsewhere, the silhouette of the part is very often sufficient to permit
orientation of the manipulator. In these situations, the image required for use is
said to be binary; that is, the pixels describing the object need only one bit to

White
Black Video
Sample A/D Digital To
and
Converter Gray
In Hold Level Computer

Data Timing
Pixel Clock

Figure 6.4.1. A/D subsystem for converting analog video signals to digital gray
levels.
458 Computer Vision for Robotic Systems: A Functional Approach
Chap.6
describe the presence or absence of light intensity with respect to the nor
the object being viewed at that instant. Since it is inefficient to
to use an
rtion of
8-bir
format. Packibyte
to store binary pixels, binary images are often stored in a packed
allows 8 pixels to be stored into each byte, theretore requiring only 8192 byte
memory for each field. (Verification of this is left as an exercise for
for the bytesof
the read
reader.
For gray-scale images one may require 32 kilobytes to 64 kilobytes of storaee
e for
4-bit (16 gray levels) or 8-bit (256 gray levels) images. Even with the
use of
storage buffers, image-processing algorithms for robot vision must still be
frame
very simple if software techniques are to be used to process the imagesbecause kept
e
much data must be processed. Even moderately complex algorithms require con.
siderable amounts of hardware to achieve sufficient preprocessing for general
purpose software to be applied. In some cases the hardware algorithm enhance.
ments are so sophisticated that there actually is no need to store the image itself
but rather only a set of reduced picture parameters is accessed by the computer
In addition to the considerations for image acquisition, the illumination of
the scene itself influences the hardware considerations. In the simple cases, the
lighting may be matched to the color or surface reflectivity of the objects under
consideration. It may be necessary to use either colored light or colored filters
to bring out certain features to match the color sensitivity of the camera. This
often permits certain simple processing to be accomplished at the speed of light
before the data ever reach the photosensitive target. One should remember that
the speed of light is finite, and that light has a speed of about 1 ft/ns. By using
oblique as well as normal perpendicular lighting, one may take advantage of shad-
ows and surface anomalies, or one may choose to eliminate these effects. It is
also common to use shutters and/or stroboscopic sources to "freeze" motion. Since
the vidicon target is scanned at 60 Hz, motion blurring of moving objects is common.
and a strobed light source can often be of great use. The principle in use is related
to the fact that the vidicon will store the electrical equivalent of an image for a
short time while the scanning is being completed. The strobe light essentially
freezes the object's position while the camera stores its reflected light pattern. The
scanning circuitry then transters the electrical signal a short time later.
The appropriate use of transmitted and reflected illumination must be made
to more easily satisfy the picture-processing requirements by the computing com
ponents, or there may be no feasible solution to the vision-processing task at han
in the limited time available. Again, the structured-light approach of the Nationa
Bureau of Standards is a good illustration of this principle in a robotics applicatiou
(see Section 6.9.5.2).

6.5 PICTURE CODING

on
The representation of pictures has been briefly discussed previously. In this se
t st h a t
we treat the topic more extensively and present the following coding concepis
represent the most commonly used methods in present practice:
Sec. 6.5 Picture Coding 459

.Gray-scale images
.Binary images
Run-length coding
Differential-delta modulation
6.5.1 Gray-Scale Images

Perhaps the best place to begin the discussion of picture coding is with the simplest
distributional representation scheme, the gray-level histogram. which is a
one
dimensional array containing the distribution of intensities from the image. Figure
6.5.1 shows a picture and its gray-level distribution. Assuming an 8-bit gray scale.
the number of gray levels will be 256. One can see that the gray-level histogram
destroys all geometrical information. This is illustrated simply by noting that if
the image is rotated by any arbitrary angle, the gray-level distribution will remain
the same. In a sense, the gray-level histogram is really an image transformation
of a very simple type, so it is often very useful in evaluating imagery because of
the enormous concomitant data reduction. For
example, theoriginal image may
require 65 kilobytes of storage, while the histogram would require only 256 x 16
bits or 512 bytes, a saving of over 99%
(512/65,536).
The gray-level histogram is related to the probability of occurrence of
level information. Using this interpretation, there are many useful ideas that
gray
evolve naturally. The average gray level, in conjunction with the minimum and
maximum gray levels, can be inspected to decide whether or not the picture has
an adequate contrast throughout the scene. This could, of course, be done au-
tomatically and the illumination increased or the sensitivity of the camera increased
by opening the aperture setting on the camera.
Other picture properties may be deduced by measuring properties of the
histogram, such as variance. The variance of a histogram is a measure of the
"spread" of the distribution of gray values. The gray-level histogram function
may also be used to determine a contrast enhancement function, so that the overall
image quality is improved. For example, a simple linear contrast enhancement
may be specified by amplifying the video signal and altering the offset. The
mathematical expression for linear contrast enhancement is given as follows:
new gray level = K{old gray level) + B

where
K = amplification or attenuation factor
B bias or offset gray level
The results of linear contrast enhancement are illustrated in Figures 6.5.1 and 6.5.2.
Figure 6.5.1(a) shows an original poor contrast image with the corresponding his
togram shown in Figure 6.5.1(6) The results of linear contrast enhancement and
the corresponding histogram are indicated in Figure 6.5.2(a) and (b), respectively.
If one inspects the equation for linear contrast enhancement, it is easy to see
Robotic Systems: A Functional Approach
Computer Vision for
460 Chap.6
LAN
r

11914

255
(b)

Original image and its corresponding gray-level histogram: in

(a)
Figure 6.5.1.
of (a).
tegrated circuit (original image); (6) gray-level histogram part
of
that if K = -1 and B = 0, the contrast-enhancement process becomes on
in a photographic negative).
image inversion or image negativity (as
More sophisticated enhancement techniques may be derived from tne
Forexample, a procedure known as equalization may applieu be
togram.
Looking at the histogram in Figure 6.5,.1, one can
see that some
image.
Sec. 6.5 Picture Coding 461

SLOAN

(a)

6377

Gray Scale 255

(b)
Figure 6.5.2. Linear contrast enhanced picture and its corresponding gray-level
histogram: (a) contrast enhanced image: (b) gray-level histogram of part (a).

evels have a relatively low probability of occurring. If we redistribute the gray

evel to those bits that are relatively unused, we can increase the contrast in those
gray level ranges where there is crowding of picture information. For instance, if
he region where the most pixels occur is selectively enhanced at the expense of
pIxel intensities where few pixels occur, we obtain a more equalized distribution
A Functional Approach
Vision for Robotic Systems:
462 Computer Chap.6
the use of the available displav d
ofinformation. In essence, this is increasing device
automatic gain control system (AGC) tha
This is similar to the effect of an
a specia-purpOse circuit measures the
would find in a CCTV camera, whereby
the system gain to maximi
of the input signal and automatically adjusts
by the c a m e r a . It should he.e
amplitude of the electrical signal being output
that a robot vision system (i.e., the processing circuitry) would not be affectedLby
noted

such an image alteration (since the processing

circuitry is a quantitative systemnot
human vision system), but the himo
subject to the contrast limitations of the man
up the system
will have an easier task. Contrast enhance
being who usually sets
vision systems, so human
that the
ment features are often provided in robotic
more-well-balanced imagery to interact with during the trainina
ng
operator will have
phases during setup of the vision system. the histogram. For
Other information may also be obtained by evaluating
be detected by using the minimum and
example, picture saturation may easily The lighting may be controlled automatically to
maximum gray levels found.
The overall focus property of the scene may be
compensate for this distortion.
or the variance of the gray level as a
determined as well, by looking at the range
focus procedure can be
function of lens focus position. In fact, an automatic
derived using a computer-controlled lens adjustment driven to the maximum range,

provided that image saturation has not occurred.

The digital coding of images in gray-scale format is in many ways the simplest
few intelligent choices to be made. Bas-
method, since this coding requires very
saturation of the picture intensity relative to the digitizer
ically, if one prevents the will reflect the scene
and the video amplification chain, the picture representation
We shall see in the next section that binary image representation
being imaged. a binarization
to memory storage, but the selection of
is very efficient with respect
threshold is often not a trivial process.
2 to 8 bits, yielding from 4 to 256
Gray-level coding requires anywhere from can be
Standard television is specified so that only 10 gray levels
gray levels. perform
a human observer, and most commercial television systems
perceived by This means that 3 to 4 bits of digitization is sufficient to per
on that order.
As far as the human visual sys
mit the maintenance of ordinary video standards.
tem is concerned, a 4-bit picture is visually pleasing, but the discrete gray ievci
due to the digitizing may be noticeable. Under the best circumstances, huma
Figure
observers are able to distinguish 6 bits (64 gray levels) of intensity.
6.5.3 shows examples of ordinary scenes digitized with 1, 2, 4, and 6-bits or n
lar
mation. Figure 6.5.4 shows computer-generated gray wedges to illustratesi
effects. How
Gray-scale images may be digitized with little regard to image setuPndand

ever, the memory storage requirements are greater than for binary imagel compli-
the algorithms for processing gray-scale images are generally much mo This
cated and more time consuming than those used for binary image procesSIng
will become clearer in the following sections.
Sec. 6.5 Picture Coding
463

6.5.2 Binary Images

Binary image coding requires that each pixel in the original image be coded into
one bit: therefore the term binary image. In its simplest form, a fixed threshold
may be applied over the entire scene. Figure 6.5.5 shows an example of such a
process. Although this binary image appears visually pleasing, and a human being
is easily able to recognize its content, in point of fact the image itself has been
poorly digitized since there is an uneven or disproportionate distribution ot blacK-
and-white regions. In this figure the original subject matter was illuminated from
one side, so that a significant shadow appears in the image and the selected binary
image threshold is inadequate. In the case of a more realistic scene for robotic
vision, this effect causes part of the object to extend past its actual limits, and part
of the image to be eroded.
In the case of the binary portrait, it is important to appreciate the fact that
no single binary threshold would be adequate under the lighting conditions used
to present the image to the camera. Given this illumination, one could attempt
to use an adaptive threshold that would adapt to the region locally surrounding
the pixel to be binarized. One such adaptation technique is to use the local average
intensity as the threshold. Figure 6.5.6 shows such a method applied to an image.
Because of the reduced memory requirements for binary image coding, as
well as the reduced arithmetic requirements when dealing with a 1-bit pixel, many
of the commercial vision products now manufactured frequently use binary images
and have, to date, not generally used gray-scale imagery for object identification,
location, and so on. This implies that shape and geometry factors, rather than
gray-level textural parameters, are the most used in present-day robotic vision
systems.

6.5.3 Run-Length Coding

Gray-scale and binary coding of images are direct methods for image coding, in
that both systems maintain a map of the (x, y) coordinates and the corresponding
intensity information. In the simplest fornm, this might be an array of intensity
values, the array being as long as the number of pixels in the image. The term
data structure" refers to a representation of data in a structured manner useful
The data
or implementation and management by a computer system. structure
Tor gray-level and binary images will be a continuous array in memory, whose index

Value and contents are directly related to a pixel location and intensity value.
Figure 6.5.7 shows such a data structure for an 8-bit intensity mapped image. For
this data structure, the location of the pixel under consideration is used to compute

ne index in memory associated with that pixel. In this example, the entry index
Computed by taking the row number of the pixel and adding 256 multiplied by
The column number of the pixel. (The row index ranges from 0 to 255, while the
column index ranges from 1 to 256.) The actual entry value in the array is the
464 Computer Vision for Robotic Systems: A Functional Approach
Chap.6

ONE BIT IHAGE

(a)

TWO BIT
IMAGE

(b)
Figure 6.5.4. Computer-generated gray wedges (scales): (a) 1-bit
gray-scale
olution; (b) 2-bit gray-scale resolution; (c) 4-bit-gray scale resolution; (d) re
gray-scale resolution. 0-D
Sec. 6.5 Picture Coding
465

SLOAN

FOUR BIT IMAGE

(c)

SLOAN

ww.o

SIX BIT IMAGE

(d)

Figure 6.5.3. Continued

Robotic Systems: A
Functional
oach
Approach Chap.6
466 Computer Vision for

ONE BIT
IMAGE

TWO BIT I MAGE

(b)
Figure 6.5,4. Computer-generated gray wedges (scales): (a) 1-bit gray-scale res
olution; (b) 2-bit gray-scale resolution; (c) 4-bit-gray scale resolution; (d) 6-bIt
gray-scale resolution.
Sec. 6.5 Picture Coding 467

FOUR BIT IMAGE

(c)

SIX BIT IMAGE

d)
Figure 6.5.4. Continued
468 Computer Vision for Robotic Systems: A Functional Approach
Chap.6

Figure 6.5.5. Binary portrait image.

(a)

Figure 6.5.6. Binarized images using the local average as a binary threshold: (a
original gray-level image; (b) original image with a 3 x 3 pixel local avera
subiracied; (c) binarized image with gray values of 124 to 132 mapped to black lau
other values are white).
Sec. 6.5 Picture Coding 469

(b)

CCTV electrical value, which will range from 0 to 255. For binary
digitized
magery, the data structure will require only 1-bit per pixel instead of 8-bits per
pixel
These direct coding and simple packing schemes are quite simple but do not
Lake advantage of image structure. If we look at the data structures above, es-
470 Computer Vision for Robotic Systems: A Functional Approach
Chap.6
Entry Entry Entry Value
Index Value Size (bits)

1 100
2 102
103 8
106
First 169 8
row

255

257

Second
row

512

65281

256th
row

Figure 6.5.7. Simple data structure for

65536 an 8-bit intensity
mapped image.

pecially for binary images, it is clear that there will generally exist long strings of
pixels of the same binary value. In this case, a separate entry for each pixel is a
waste of computer memory. If one simply stores the transition points and the
string length, a (potentially) more efficient data structure can be applied to the
images. Figure 6.5.8 illustrates such a "run-length coded" data structure.
For certain types of images that have a lot of "blobs," the
memory require
ments may be considerably reduced by the use of run-length coding. However,
images with numerous small features may require more memory storage than the
direct method. In many robotic vision applications, the use of
run-length
does offer a considerable saving. Gray-scale images may also be run-length coded,
coding
using a data structure similar to that shown above. Ordinarily, however, the run
lengths are not as long, and memory savings are usually not as great as for bindy
images.
6.5.4 Differential-Delta Coding

Differential-delta coding (DDC) may also be used to code gray-scale images more

efficiently. This coding technique uses the difference between the intensuy
pixel and the previous pixel. In ordinary scenes, the difference between suhet
pixels is not very large (on the order of 25% of the range or less), so the nui
Sec. 6.6 Object Recognition and Categorization 471

Entry Entry Value

Entry Index | Value Size (bits)

100
2

Entry Index 256

(Transition String
point) Length Value 257
258
100 0(=black)
101 2
103 50
1(=white) 512
153 3
154 25 0
180

65536

Figure 6.5.8. Binary run-length Figure 6.5.9. Data struc-

coded data structure. ture for DDC. The first
entry value size of each row
is 8-bits. All others require
6-bits.

of bits to encode the difference is two less than that required to encode the entire
number representing the intensity. The data structure for this encoding is given
in Figure 6.5.9. Figure 6.5.7 provides the data for Figure 6.5.9.

6.6 OBJECT RECOGNITION AND CATEGORIZATION

The preceding section dealt with the coding of images in a blind fashion, without
intelligently addressing the issues of analysis of imagery from robotic applications.
Although coding techniques are important, it is more important to understand the
need for partitioning or segmenting the imagery from the real world. As was
stated in the beginning of this chapter, several basic tasks are commonly encoun-
tered in robotic vision systems such as tracking, identification, and inspection. All
these tasks require that the particular portion of the scene to be operated on can
be extracted from extraneous picture information and can be analyzed efficiently.
This naturally leads us to the two topics that will make the realization of these
goals more likely:

Dimensionality reduction
Segmentation of images
6.6.1 Dimensionality Reduction

The direct coding of images using binary or gray-scale intensity coding results in
fixed and somewhat large (from 8 to 65 kilobytes
storage requirements generally
472 Computer Vision for Robotic Systems: A Functional Approach

and more). Run-length coding and DDC generally reduce those fiou Chap.6
used for
appropriate imagery. These types of storage are still rathergures when
There is no hacdire in
that they are coding the pixel values as they are found.
ligence embedded into the numbers stored to represent the picture ic intel-
the images have been memorized, a photographic memory if In a sense,
know that those with you will
photographic memories may not possess the wisdom We all
the information in an
intelligent fashion. In a sense, the coding schemewisdom to use
up for the lack of intelligent coding, by brute force. Once all
the dato
make es
image have been acquired, use must be made of the pixels to represent the con.
of the image rather than
just the arrangement of the pixels. The descriptione
the content of an
image will generally yield a simpler description of the obiechof
within the image field of view. For
example, one may describe an ects
on a table as: egg's orientation
egg is located at (o, Yo)

eggis oriented at an angle

egg short diameter is do"

This structured description of the egg's location is
efficient than the pixel description, and represents a
obviously much more
significantly reduced dimen-
sionality description. However, these efficient descriptions would have little mean-
ing out of the context of the example, whereas the pixel description would be more
universally understood.
In a sense, this new data structure has been designed for the problem or
product under consideration. General methods for dimensionality reductiondo
not exist that are useful in all circumstances, but there are some methods that will
be useful in many problems. This is discussed in the next two sections.

6.6.2 Segmentation of Images

The dimensionality-reduction concept implies that the image must be processed in

the context of specific situations. The development of a self-teaching general-
purpose vision system is a very remote dream. It may someday exist, but for the
present we must be satisfied with systems of a rather limited scope, albeit with
high performance.
To develop an understanding of such high-performance systems, it must De
understood that the reduction of dimensionality of images is an integral parto
creating high-performance vision systems. If these systems were required to pro
ess all the pixels in a sophisticated fashion, the amount of computation time a
or special-purpose hardware would be prohibitive. By reducing the dimension
of an image to a few salient features, we can then afford to spend a relatively
must
time evaluating those features. To assist in this dimensionality reduction, we.
ROIs
develop mechanisms to isolate "regions of interest" (ROIs). Isolation ofthe
is generally referred to as image segmentation.
Sec. 6.6 Object Recognition and Categorization 473

Segmentation can be attempted by many different techniques. We will treat

the following methods:

. Color or gray level

.Edgedetection
Texture
Regionization and connectivity
6.6.2.1 Color or gray level
Frequently, the color of an object is useful in separating it from other objects
for thepurposes of analysis. With proper lighting and proper filters in the optical
path of the image sensor, one can often highlightthe desired portion of the scene
in order to accomplish segmentation.
Gray-level segmentation is also frequently used for image segmentation. For
instance, the reflected light intensity from the surface of an egg can be used to
define a ROI (region of interest) that separates the egg from the rest of the scene.
Some of the more common techniques used in robot vision systems are silhouetting
by backlighting of opaque objects, and floodlighting a scene to be analyzed so that
the object is clearly well separated from the background.
6.6.2.2 Edge detection
Another popular and effective method of image segmentation is edge detec-
tion. In general, edges are portions of the image that have a high spatial variation
of gray level. Not all such areas of high spatial variation contain edges. For
example, in a scene with ""salt and pepper" noise contamination, the spatial var-
iation of gray levels will be high, yet there may not be a real corresponding edge
in the scene. Assuming, however, that the imaging system has been set up with
care, the use of the "edge emphasis" function may generally be useful in defining
edges, and consequently used to isolate an object of interest.
There are numerous approachesto edge detection [5], but we shall consider
some of the simpler algorithms only. The procedures to be specifically considered
are:

First difference/one-dimensional methods

Sobel operator
Contrast operator
Each of the operators requires consideration of at least two pixels; thus the
local subimage shown in Figure 6.6.1 will be used. The local subimage matrix is
interpreted as follows, A through I are intensity values [i.e., (x, y) values of the
localimage]. "E" represents the (x, y) location of the image point under consid-
eration and A through D and F through I are E's neighbors.
474 Computer Vision for Robotic Systems: A Functional Approach

A BC
Chap.6
D E F
Figure 6.6.1. Local
GH tion. sub-image defini.

The first difference/one-dimensional operators are of the

the following
following generi
type:
Edge1 = |I - E| + E
A
Edge2 = |F E + |E - D
Edge3= [H - E + E BI

Edge4 |G E + JE - C

Edgel computes a first difference function in the "northwest:southeast" directioo n.

This particular edge function is most sensitive to edges perpendicular to the di
rection or orientation of computation and essentially returns a magnitude of edge
in only one spatial direction. The functions through Edge4 Edge2
similar type of one-directional edge property. Figure 6.6.2 shows the effect of the
compute a
directionality biases of these operators.
The operators as shown above also incorporate some image smoothing, since
they accumulate two measurements of edge contrast and then add them together.
This type of operator will diminish the effect of an errant intensity point.

(a) (b)

(d)
(c)
images: (
Figure 6.6.2. Examples of one-dimensional edge operators on two
original egg: (b); original integrated circuit; (c) Edge2 horizontal
difference: (d)
Edge3 vertical difference.
Sec. 6.6 Object Recognition and
Categorizatioin 475

(a) (b)

swww.w.www

(c)

Figure 6.6.3. Examples of Sobel and modified Sobel operators: (a) image of an
egg: (b) Sobel diference operator applied to the image in part (a): (c) modified
Sobel operator applied to the image in part (a): (d) difference between the images
in parts (b) and (c).

information in two directions as

The Sobel operator [5] incorporates edge
follows:
Edge =
((A + 2B +C- G -

2H 1 + (A +2D +G -

C- 2F 1)2)2
function along the borders
This operator computes a weighted-average intensity
measurements perpendicular to each
of the subimage, and then forms two edge
combined in a "quadrature"
measure-
other. The two edge properties are then
enhances edges in an acceptable fashion.
ment. The Sobel operator generally
of the local region. The squaring and
Note that it also has built-in smoothing
so one normally would modify
will be very time consuming,
operations
Square-root
with absolute values, and eliminating the
this operator by replacing the squares modified Sobel operator. Appli-
root in toto. This is referred to as the 6.6.3.
square Sobel operators are shown in Figure
Sobel and modified
cations of the

6.6.2.3 Contrast operator

contrast differences. Con-
detection is the use of
Another approach to edge
sider the following:
+ B + C +D +F + G + H + II
A
Edge =
E -

8
of the central pixel
pixe>'s surrounding. to the
the intensity
his operator compares the central pixel and its neighbors
in contrast between
the difference
Here, only
476 Computer Vision for Robotic Systems: A Functional Approach
Chap.6

(a) (b)

(c) ray Seole (d)

Figure 6.6.4. (a) Original 8-bit image; (b) contrast operator applied to the original
image that was offset by 128 to avoid negative numbers: (c) gray-scale histogram
of part (B); (d) binary image of part (b) with gray values of 124 to 132 mapped to
black (all other values are white).

is considered, regardless of the distribution of the neighboring pixel intensities

(see
Figure 6.6.4 for examples).
The use of edge operators, as described above, can be used to expedite
segmentation of images. For example, given that edge information is available.
the extent of the object of interest may then be defined. The edge detection
functions defined above have the property that they may be computed sequentially
as the image is scanned (i.e., their need for random access to the image contents
is minimal). The disadvantage to these procedures is that they characterize only
edge points, and do not characterize edges. This is an important distinction because
the edge points that are located are done so independently of each other, and there
is no explicit or implicit connection among them. These techniques are
user
however, as implied above. For example, if one is scanning with a raster scan
the detection of an edge point may be used to activate another
procedure t
bounding the object of interest. In many cases,
simply knowing the extento the
object in two dimensions is a good first step toward reducing the dimensiona
of the image to be analyzed.
As there are serial- or sequential-access ran-
there areedge algorithms, aiS One
dom-access edge tracking procedures that may be used for
such procedure is segmentation
a
contour-following procedure.
6.6.2.4 Contour following
not
In contour following it is first assumed that or
we can determine
wnetnthreshold
a test point is interior to a region or exterior to it. A simple binary hold
Sec. 6.6 Object Recognition and
Categorization 477
technique may be appropriate for segmenting an object from the background, or
alternatively, an edge detection function as described previously may be used to
define a band of pixels surrounding the object. Another possibility would be the
use of an adaptive threshold, as described in Section 6.5.2.
Contour following utilizes the
following technique:
Find an edge point (between two pixels), and consider two test pixels diagonally
ahead (with respect to the direction
pixels will be used to test
approached) of the current position.
if a candidate edge point is inside or
These
outside of the
region.
If the current point is inside the region, i.e., both
turn left until the region is exited.
test pixels are inside,

If the current point is outside the region, i.e., both test pixels are outside,
turn right until the region is entered.
If the current point is indeterminant, i.e., one test pixel is inside and the
other is outside the region, go straight.
This procedure is graphically illustrated below. (Note that 'x" marks pixels in-
terior to the region and ""o" marks pixels exterior to the region. Also, the "x*"
and the "o" are the only two pixels under consideration at the
present time.)

indeterminant - go straight ahead one step and

O test next two pixels.
O
;indeterminant- g o straight ahead one step and
test next two pixels.

turn right and test next

;outsideobject two pixels.
o

;inside object turn left and test next

X two pixels.

Figure 6.6.5 shows an example of contour following. In this example, the

starting direction is toward the right at coordinate location (3, 9). The dotted line
then indicates the contour path followed.
This contour following method is capable of following closed regions wel.
Dut will also follow any nook and cranny that exists in the contour. It must be
emphasized again that proper original scene illumination can make the difference
l any algorithm used to perform the required task.
Vision for Robotic Systems:
A Functional Approach
478 Computer Chap.6
o
O

O O

x
o A

O Figure 6.6.5. Example of contour fol.

Start O O lowing. X" represents an interior
o
pixel and "o" an exterior pixel. The
O shaded area is the binary image of the

O o object.

6.6.3 Object Description, Categorization, and Recognition

described in the preceding section are not the
Although the types of segmentation
of isolating objects. Once objects
only types, they a r e often used for the purpose information in the scene, numerous
have been segmented from the extraneous
and recognition. There
methods c a n be used for object description, categorization,
are two very c o m m o n description techniques
for objects. If the object is one that
it, the object will simply be described by a
requires shades of gray to characterizeresolutions in both space and light intensity.
pixel map, with sufficiently high pixel but it is useful for comparing a known
This description is not especially efficient
object with a test object.
The more usual situation is with objects that do not require gray-level de
is sufficient). In this case the outline of the
scription (i.e., a binary description
is sufficient. If an object has different mechan-
object, along with interior holes, describe the object
several different outlines will be necessary to
ically stable states, a
be used to describe
properly. It must be made clear that the outline that may
that is amenable to storage in digital form
binary object must have a data structure ot tne
is an ordered list
for later reference. One especially simple data structure
coordinates of the outline, but this is not very efficient or convenient.
(x, y) Another more useful data structure for encoding the outline of an object
the chain code.
Here a sequence of unit steps is utilized in one of the four

eight directions that may be traversed in reaching the next point on the outn
In this
Such a coding scheme is illustrated in Figure 6.6.6 for a simple objecCt.
the out
represents a point on
c**"

example, eight directions are used, and the

Note that starting at different points on the outline gives different chan couhis
this
theycan be reconciled with software that are beyond the scope
techniques
presentation.
Sec. 6.6 Object Recognition and Categorization 479
Start at
the " ******
and proceed
clockwise

Chain code=
<e,e,e.e,e,e,e, se, s, s,
w,w,w,w,w,w, W, nw,n,n>
where e = east

W = west

n north Figure 6.6.6. Chain code example.

south Starting at the "+", the sequence of
ne north-east pixel moves that defines the outline of
se = south-east
an object is represented by the direction
abbreviations shown within the <
nw= north-west
i.e., 7 east, 1 southeast, 2 south, 7 west,
sW= south-west 1 northwest, and 2 north.

The use of gray-level or chain code object memorization is a

simplistic ap-
proach to object description that is used in practice. However, one must recognize
that description is only part of the
problem. Given that a "master drawing" of
the objects of interest is available, the real
problem is to accomplish recognition
of a new object. For example, assume that a known good part is available for
characterization. The part would have to have some properties measured and
stored. The outline and gray scale pattern of each mechanically stable state would
also have to be memorized.
It can be shown that these simple data structures for object
description require
a significant amount of computer time for implementation (see Problem 6.7). The
solution of the rotational and translational problems requires a three-dimensional
search and comparison with the master drawing. To reduce the complexity of
comparisons, it is very common practice to use less than complete geometrical
descriptors or "features" for object characterization.
The most common techniques simply use parameters such as:

Size or area
.Range of object projected onto x and y axes
Ratio of the extent of the object in the x and y directions
Center of gravity of gray scale or binary rendition of object
Geometrical moment description
Number of holes in the object

Figure 6.6.7 illustrates some shape and geometry features commonly used in
object description. These features clearly reduce the dimensionality required to
characterize the objects and will reduce the computation time for object identifi-
cation. Since the dimensionality of the descriptions is reduced, we should expect
to trade-off
something in return. In fact, we will find that these features do not
uniquely characterize the object, so there exist an unlimited number of different
480 Computer Vision for Robotic Systems: A Functional Approach

Area
Chap.6
Excluding
Holes X
Interior Area
(Area of holes)
0 X
Longes
Dimension

Feret's
Diameter
I 0I XT
Breadth

Perimeteer

Convex
Perimeter
O)
Projected
Length
Maximum
OT1I IEXTI
Horizontal
Chord X
Examples of Derived Size Measurements

Derived measurements Significance

Area excluding holes +Area of holes Total area of feature

Convex perimeter/ T Average feret's diameter

Average feature width

Area/longest dimension at right angles to
the longest dimension

TxArea/perimeter Average chord length

TXArea
4x Longest Dimension Equivalent cylindrical volume

Figure 6.6.7. Commonly used shape and geometry features. Redrawn with pe
mission of Dynztech Laboratories, Inc., Imaging Products.
Sec. 6.6 Object Recognition and Categorzation 481

objects with the same parameters. Fortunately, in a real robotics application. we

will know if these cases exist and add additional features to remove ambiguity.
One good method for categorizing an object is to derive a feature vectr
a set of measurements describing the object) and then to compare that set with
(.
the known good set. The differences or error between the two vectors may be
used to decide if the test object is in the same category as the object used for
training. Obviously, this would have to be done for each stable state. Such a
process is known formally as the "nearest-neighbor technique [4] and is one of
many statistical pattern-recognition techniques used for categorization. Unlike
many statistical pattern-recognition problems, most robotic vision problems are
not usually very statistical in nature, since the system designer can exert control
over many processes (i.e., the engineer is often able to "design for automation").

6.6.3.1 Image comparison

In many applications, a reference image or sub-image is available. The
reference or standard sub-image may represent the object to be located in sub-
sequent test images. The sub-image is frequently used as a mask or memory
engram for comparison with incoming test imagery. This search task may generally
be described as:

O o , yo) = G{T(x -0 y - yo), t(x, y)}

where
T(x, y) = reference sub-image

(x, y) = test image

spatial translation offset

o. Yo)
O(o, yo) = output response

G{} is referred to as the comparison operator.

Frequently, G{} is chosen as a linear, two-dimensional convolution operator.

Alternately, it can be defined as

G} =
2 I7(* -

Xo» y - yo)
-

t(x. y)|
of the absolute ditferences of the reference and test
thereby representing the sum
location of the reference (Xo. y%) with respect to
image as a function of the spatial
as a "nearest neighbor classifier
the test image. This technique is also known
will have the lowest score and will
since the nearest neighbor to T(x, y) in t(x, y)
is carried out over the image region in
be the closest "relative." The summation
of the test image. In general,
which the sub-image is coincident with a portion that it is
both and y) so placed over the
the sub-image is displaced spatially (in
x

entire test image.

Other operators may also be created. One of the simplest is called binary
482 Computer Vision for Robotic Systems: A Functional Approach
Chap. 6
correlation. As expected, both the image and reference are binary in nature. This
process may be defined as follows:

A template is defined. For binary correlation, this is usually part of animage

Ttis important to note that it may take the form of a line, a square, a rectangle,
or even a
disjoint set of pixels (that maintain a fixed spatial relationship to
each other).
The template is overlaid on the test
image. Some location on the template
is
chosen as a reference
point (e.g., the top left corner). The
point on the test image defines the coordinate where the results of the op-
corresponding
eration will be laced.
The number (or fraction)
of pixels on the template that match the portion of
the test image which it is
over
placed are computed. A typical method of
doing this is to sum the results of the complementary XOR function performed
between each corresponding template and test
image pixel.
The value computed above is placed in a resultant
image at the location
corresponding to the template's reference point.
.The procedure is repeated until the template has been overlaid each
location in the entire image.
on
pixel

Figure 6.6.8 shows an original binary image, a template, and the resultant
image after the binary correlation process has been applied. As can be seen in
part (c) of this figure, the locations where the best match occurs have the
value. Various methods can be used to
highest
provide more image data in cases when
the template extends beyond the image. One
possibility is to add additional zeros.
This was done in the example shown in Figure 6.6.8. Two columns of zeros were
added (but not shown) to the right of the image and two rows of
zeros were added
at the bottom. Thus the top lett edge of the
(3 x 3) template could be placed
on each pixel of the image.
While binary template matching is quite
as well as
simple to implement (in hardware
software), some problems exist. Recall that generally information about
an image is lost when it is binarized and
pixels that constitute an edge must either
be white or black. Thus accurate positional information
may be lost. Addition-
ally, since both the image and template are binary, it is
proximately the same threshold and lighting conditions when important
to have ap

as was used to define the performing correlation

template. This is because changes in scene illumination
(and threshold) may cause the test image to differ from the
the template, thereby resulting in incorrect matches image used to detine
or no match at all.

6.6.3.2 Template matching

As
a matter of interest, the same
be
procedure used for binary correlation cau
used to perfornm edge detection or other
mathematical operations on the image
When used this way, the prOcess is referred to as
the template (or model) template
is used to detect matching.
In this case
some property of the image such as au
Sec. 6.6 Object Recognition and Categorization 483

o1 1 1 1
11 0 0 0 1 10 01 1 1011 o1 1 1 1
1 0 0 0 0 0
101 10 0
1 0 00 0 10 0 0 1 100 11110
o 0 1 1 0
o 1 0 0 0 1
0 1 1 10 1 1 1 0 0 0
110 1 0 0 1 0 0 110 0 0 1
1
1
10 1 1 0 0 0 0 0 1 1 0 1
1 1 0 10 1
0 0 0 1 0 10 0 0 0 0
0 1 0 10 0
0 0 0
10 0 11 1 10 1
0
1 0
0 01 1 0 0 110 0 1 1 1
0 1 1 1 0 0 0 1 0 1 1 1 1
0 0 0 0 1 0 1 1 1
1 0 1 0 0 0 0 1
10 1
0 1 1 1
0 0 0 1 1 1 0
0 1 1 1 1 1 0
o 11 1 10 0 0
0
0
00
1
10 1 1 1 0 1 1 0 0
1 1 1 1 0 1 1 1 1 0 1
(a)

2 3 5 3 6 4 4 5
5 5 6 5 6
74 9 3 4 3 2 2
3
5 5 5 2 4 4 4 4 5 6 6 5
15 5 2 3 4 5 5 3 3 4 3 4 7 5 6 7
4 4 5 6 4 5 4 67 4 5 7 6 63 45 43 43 63
5
44 46 33 5 5 3 4 5 5 5 5
74 6 4 7 3 4 2 5 3 6 5 6
6 6 7 2 6 5 3 4 5 3 4 6 4 3 3 3 2 2
6
4 2 6 3 6 3 4 5 548 5 6 7 5 5
5 2 2 5 4 7 3 4 5 5 2 4 4 6 5
8 4 5 5 4 2 6 6 64 5 6 6 3
5 3 4 64 8 4 6 4 5 2 4 2 5 5 5 4
2
10 4 65 51
5 5 22 65
35 45555
654 64 66 31
(b) (c)

Figure 6.6.8. Example of template matching applied to a binary image: (a) binary
image array (15 x 20): (b) binary template; (c) output of binary correlation (14 x 19).
The dashed box in part (a) indicates a perfect template match has been achieved.

edge For example, if both templates shown in Figure 6.6.9 are applied to an
mage and the square root of the sum of the squares of both process outputs at
cach (image) pixel taken, the edges of the resultant image are enhanced. These
emplates perform the same function as the Sobel operator described previously.
this case the reference location of the template would be the center pixel.
6.6.3.3 Correlation for gray-level images
The same concept used for binary template matching can be extended to a
y-level image and model (or reference sub-image). However, in this case it
hod be apparent that a high correlation could occur for features in the image
took nothing like the model due to the fact that they may be brighter than
the pixels that match the model. Thus the method of cross correlation may not
be ad to uniquely define the feature (part of image that best matches the
reference) that is being sought
isher statistical correlation coefficient is used, the best match of a
nc
Eay-level model in a gray-level image can be found, independent ofvariations
A Functional Approach
484 Computer Vision for Robotic Systems: Chap.6
Figure 6.6.9. Templates used to imple.
0 ment the Sobel operator (described in
Section 6.6.2.2) for edge detection.
Second Operator
First Operator

to pick out the feature based on the

in illumination. Essentially, we are able
model with the image. By definition the
maximum value of the correlation of the
correlation coefficient is given by:
N EI,M - (EI)(EM)

Po Yo (2M)]
VNE - (21)]NEM
where
which is dimensionally the same
= a sub-image of the test image
as the model (or template)
M = the model or reference sub-image
N =the total number of pixels in the model
the test image
(xo» Yo) =
the spatial location of the model with respect to
provision that it is smaller
Note that the model can be any size (q x r) with the
than the test image.
The template or model is moved across the image as described previously.
levels of the
However in this case, besides multiplying the corresponding gray
and the model (and
template and model, both the average value of the sub-image
the model and
the variance of each) are needed. Note that the average value of
its variance are constant.Variations on the equation above can simplify com-
to remove the
putation. For example, the correlation coefficient can be squared
burden of taking the square root in he denominator.
A few interesting conclusions can be inferred from the use of the correlation
than
coefficient. First, a value ofl indicates a perfect match and if values less
zero are ignored, the maximum value obtained over the image defines the location
the
of the best match. Of course, the higher the correlation coefficient the better
match. if a feature in the image is partially degraded, the correlation
coefficient will still generally identify it and its location. Third, all the locations
in an image that matched the model could be found by simply looking for all the
correlation coefficients above some specified threshold. Finally and as stated
previously, the correlation coefficientis independent of linear changes in brightnes
That is, assuming all the pixels of the image (or the model for that matter) arc
modified by the function
I(x, y)aew = a I(x, y) +b
sa
for a>0 and any b, the correlation coefficient is unchanged. The paramete
and b can be considered as gain and offset for either the image or the acqu
device.
of an
If one chooses to make a
template that defines some specific featui
Sec. 6.6 Object Recognition and Categorization 485
object such as a corner or edge, it is possible to use the gray-scale correlation
technique just deseribed to find these specific features. In this case, imagine a
(a x r) model with the left half, q x r/2, entirely white and the right half black.
Using this template (or model) with gray-level correlation enables one to find all
the areas of the image that resemble an edge having a light to dark transition.

6.6.3.4 Morphological image processing

In morphological processing, an output image is considered as the next gen-
eration of an input image. Essentially the processing may be carried out many
times (1.e., the output becomes the input, the inputis operated on, and becomes
the next generation output) to cause certain characteristics of an image to be
enhanced, removed, or otherwise changed. The operations on an input image
consist of a set of rules that operate on the value of a spatially specified input pixel
and its neighboring pixels. The rules generate a value for the pixel in the output
image having the same spatial location as the input pixel. While morphological
processing is commonly applied to binary images, it can be applied equally to gray-
level images. However, there are no equivalent gray operators for each of the
binary operators.
To briefly illustrate this type of processing, two of the most common oper-
ators, erosion and dilation, will be examined. Erosion in its simplest form, replaces
a pixel by the local minimum of its neighborhood (e.g., a 3 x 3 local region). For
this operator, bright objects will decrease in apparent "size" thereby becoming
"eroded." (Of course, dark objects will increase in apparent size.) Dilation in its
simplest form, replaces a pixel by its local maximum, thereby increasing the size
of bright objects and decreasing the size of dark ones.
We will illustrate the morphological image procedure by considering a3 x 3
kernel. In this case, imagine that each pixel surrounding the center one (as well
as the center pixel) is the input to a transform rule. This rule generates a value
of the output pixel which is placed in the output image at the same spatial location
as the center
pixel. Once again we may think of moving the 3x 3 kernel area
all over the input image to cover it completely. If we chose the logical AND of
all the pixels in the 3 x 3 area as the transform rule, then the output will be a 1
only if a white pixel is surrounded completely by white pixels. This performs
erosion since its effect is to reduce the size of white regions. Note that if erosion
Is applied enough times, it can completely eliminate all the white pixels of an image.
As mentioned above, dilation is the opposite of erosion and corresponds to using

the logical OR function for the pixel mapping rule. When it is applied to a binary
mage, the size of a white area is increased. Figure 6.6.10(a) shows an unprocessed
binary image. Figure 6.6.10(b) shows the results of the white dilation operation
performed on Figure 6.6.10(a) while Figure 6.6.10(c) shows the results of a white
erosion (or black dilation) performed on the image of Figure 6.6.10(b).
If one performs erosion operations followed by dilations, small bridges be-
Tween white objects will be broken. This cascaded operation is called an opening.
486 Computer Vision for Robotic Systems: A Functional Approach
Chap.6

(a) (b)

(c)

Figure 6.6.10. Example of morphological image processing: (a) original binary image:; (b)
result of white dilation of the image in part (a); (c) result of white erosion of the image in
part (a).

The operation of dilation followed by erosion has the opposite effect and closes
up the spaces between adjacent regions. This cascaded operation is called a closing.
Many other operators exist which allow filling in partially missing edges to
restore the original shape of the image, measuring the extent of an image, and
actually isolating objects and reducing their dimensionality to count specificaly
shaped and sized objects in a scene.

6.7 SOFTWARE CONSIDERATIONS

be
The techniques used for vision in robotic applications may rarely, if ever,
implemented totally in software. Virtually all vision systems implement so
with
algorithms in hardware. The software considerations lie mainly in the ease n in
which the vision systems may be used. Most of the present vision systems arc
Sec. 6.9 Review of Existing Systems 487

essence, peripheral devices to the main robot controller and are invoked through
a command/data structure that is rather simple. Basically, the peripheral (vision)
device is given a string or stack of commands, and the peripheral returns data to
the main processor. Some vision systems have their own user languages that range
from cumbersome to friendly. For the most part, then, language and software
considerations lie mainly in control of the vision peripheral, not in the actual
implementation of the algorithms.

6.8 NEED FOR VISION TRAINING AND ADAPTATIONSs

Although one might initially have believed that the definition of a prototypical
imaged part is trivial, by now the reader should be aware that the amount of data
required to define an object may indeed be huge in comparison to other digital
data-processing applications. When considering the variety of degrees of freedom
required to describe an object fully, it is evident that vision system training will be
needed so that a reasonable amount of data may be retained by the vision system.
It is for these reasons that dimensionality reduction, as mentioned earlier, is
so important. Specifically, it allows for the efficient representation of the visual
data, usually in an independent manner. Another important consideration has to
do with the potential for dealing with objects and parts that may be changing over
time. For example, a conveyor belt carrying a part may not operate at a carefully
controlled constant speed. As a consequence, parts will not arrive at known or
precise intervals of time. In such an instance, tracking of the "trend" regarding
parts arrival may be very useful in efficiently acquiring images and in processing
the data.
Another case where this is true is in semiconductor assembly, where a die
may be attached to a substrate by a die attach machine. If this machine has a
slight but consistent drift, the imaging of the die for later bonding of wires to the
lead frame and chip may be subject to placement errors, due to variable placement
of the die. This type of adaptive updating of object positions is often necessary
for efficient part handling.

6.9 REVIEw OF EXISTING SYSTEMS

In this section we review the major types of commercial vision systems currently
available. The discussion of techniques used in these systems will be restricted
Somewhat, since robotic vision requirements may be very diverse. The major
systems can be classified into the following type of general categories:

Binary vision systems (utilizing either preprocessing or classification algo-

rithms)
Gray-level vision systems
488 Computer Vision for Robotic Systems: A Functional Approach
Chap.6
Structured light systems
Character recognition systems
A d hoe
special-purpose systems
6.9.1 Binary Vision Systems

Binary vision systems are those that use only two levels of image information
They are so-called silhouette systems, since very controlled lighting must be used
to image objects
reliably. Backlighting of parts is frequently selected so that the
objects to be inspected stand apart from the background. The binary vision systems
are used primarily for:

Parts recognition
Parts location
Parts inspectionh
From a visual perspective, a binary vision device may be thought of as being
able to operate on a part as if an inspector had picked up the part and held it up
to a light source for backlighted inspection. One can see that there is a limited
but useful class of information that one can glean from this procedure.
The SRI collection of algorithms is an example of a binary vision system. As
typically implemented, it will permit arbitrary angular alignment of the part. A
run-length-coded image is often produced because of the speed enhancements
possible (see Section 6.9.5.3).
For objects where angular alignment is not an issue because of some prior
orientation stage, but where the translational position is unknown, binary corre
lation techniques are often used. Binary correlation permits the object to be
located with the value of correlation at the best match point often used as a measure
of part quality. For arbitrary parts orientation this technique is not practical yet,
because of the need to perform correlation in three dimensions (two translational,
one rotational). Other binary vision systems frequently use so-called "pixel count
of
ing" for inspection parts. These systems usually require spatial windowing or
data and then counting pixels in those windows. This scheme typically requires
a
significant application effort to choose the correct
then requires
windows, and
large degree of ad hoe adjustment to determine the significance of the pixel count

6.9.2 Gray-Level Vision Systems

Gray-level systems generally capture 4-, 6-, or 8-bit images, and then apply ve
tailored algorithms designed tor a specitic application. Gray-level templatem
ing techniques, for example, may be used to locate parts in nonsilhouetted envr
ronments, In many instances, highly controlled lighting may not be permitted,
the surface of the object has variable reflectivities that are useful for inspecting
Sec.6.9 Review of Existing Systems
489

the object. Gray-level template comparisons may be used to locate objects that
angularly aligned, with the amount of template differences
are

of object similarity, relative being

used as a
measure to a known "ideal" prototype.

6.9.3 Structured-Light Systems

The structured-light approach in general has proven very successful in numerous
applications (e.g., see Section 6.9.5.2). We have already seen where backlighted
objects are easy to analyze for certain types of object characterizations. Structured
light carries that further by characterizing objects
with slits of light, and
thenof
observing new samples in the same lighting environment. As a slit or plane
ight falls on an object, various distortions and path deviations of the illumination
may be seen and may be used to characterize location, orientation, and surface
details. In addition to slits of light, one may also use "grids" of light and look
for distortions in the grid pattern to characterize objects.

6.9.4 Character-Recognition Systems

It is often desirable to read labels or characters from parts, packages, and so on.
Where bar ocodes may be placed on the parts to be identified, identification may
be accomplished by simple bar-code readers. Alphanumericcodes are a different
matter entirely, since recognition of arbitrary character sets has until recently been
a very difficult image-processing task. Several systems available today are able to
read a wide variety of character sets (after initial training) at high speeds (15 to
30 characters/s).

6.9.5 Examples of Early Robotic Vision Systems

A number of vision systems were developed specifically for use with robots prior
to 1980. These include the GM Consight System, the one developed by the
National Bureau of Standards, and also a system developed by SRI. We consider
each of these in turn.

6.9.5.1 The GM Consight I system [12]

In the late 1970s, General Motors Research Laboratory demonstrated a ro-
botic vision system that was capable of operating in the visually noisy environment
O1tCn found in manufacturing installations. Called Consight I, this system was
capable of determining the type, position, and orientation of a part on a conveyor
belt without the need for enhanced (and often impractical) contrast techniques
uch as utilization of fluorescent paint on the belt's surface.
in 6.9.1. As can be seen
A block diagram of the system is shown Figure the scene detected by the solid
m this figure, a minicomputer is used to process
ate linear array camera (e.g., a Reticon RL 256L) that is mounted upstream from
490 Computer Vision for Robotic Systems: A Functional Approach
Chap.6
Camera
Interface

Solid-State Stanford
Line Camera Robot Arm
PDP 11/45
Computer
Light Light
Source Source

Conveyor

Belt Position/
Speed Measurement

Robot
Interface

Figure 6.9.1. GM Consight I system.

the robot's work station. In addition, information about the conveyor's speed is
sent to the computer. In the time it takes a part to move from the vision system's
location to that of the robot, the computer utilizes the visual and speed data to
determine the location, orientation, and type of part and sends this information
the robot controller via an interface. With this knowledge, the robot is able
to successfully approach and pick up the part while the latter is still moving on the
belt. It is important to note from Figure 6.9.1 that the vision system is not mounted
on the robot itself and, therefore, does not reduce the robot's useful
payload.
It is necessary to monitor the conveyor's velocity continuously for several
reasons. First, typical moving belts that are found in factories generally do not
have velocity servos controlling their speed. Thus it is
expected that the speed
will fluctuate due to a variety of causes including load
changes, line voltage var
ations, and wear of rotating parts (i.e., increased friction). Since the robot must
accurately know when the part arrives at its work station, the instantaneous bet
speed must be available to the computer. Second, keeping track of any belt speed
variations has to do with the method used to acquire the two-dimensional visual
scene. This is discussed next.
As can be seen in Figure 6.9.2, the linear array camera scans the belt ina
one-dimensional manner (e.g., in the y direction) and this is perpendicular to i
conveyor's motion (e.g., the x direction). Note that the camera will record 14o
equally spaced points across the width of the belt. The two-dimensional image
formed by instructing the camera to wait until the part has moved a
distance down the conveyor before recording the next line
spei be
It should
image.
clear that fluctuations in belt speed can produce distortion in the recorded
This undesirable phenomenon is avoided by speed
n the
which monitoring permt
Sec. 6.9 Review of Existing Systems 491

Light Linear Array

Source Camera

Conveyor
Belt
Figure 6.9.2. Camera and light source
configuration in the Consight I system.
The basic light principle is illustrated.

time interval between acquiring two successive line images to be varied in order
to compensate for any non-uniformity in the belt speed.
How does the Consight System avoid the problem of poor lighting conditions
without resorting to contrast enhancing techniques? How does the system *know"
if a part is present or not? The answer to both of these questions is through the
use of structured light. With respect to Figures 6.9.3 and 6.9.4, it is observed that
a light source, consisting ofa long slender tungsten filament bulb and a cylindrical
lens, which projects a linear (and fairly intense) beam across the belt's width is
positioned downstream of the camera which is placed in a position so that it can
sense this line of light. When no object is within the field of the camera. an
unbroken line of light results. See Figure 6.9.4(a). However, when a part is
present, the three-dimensional nature of the objJect causes a portion of the light
beam to be intercepted before it reaches the camera position. When viewed from

Long Light Tube

Cylindrical
Lens

Focused to
Line Across Belt

Figure 6.9.3. Structured (linear) light

source used in the Consight I system.
492 Computer Vision for Robotic Systems:A Functional Approach
Chap.6
Figure 6.9.4.
Computer (line
view of part: (a)
nothingcamera
a
with
conveyor, on the
an unbroken line of
Without With seen by the camera; (b) a light i
part cause
Pa Part interrupted line of light to uses an
(a) (b) the width of the
conveyor.
appear
across

above by the camera, this part of the line of light that is deflected by the.
appears to be displaced (downstream) as shown in Figure 6.9.4(b). T h c
camera will sense a black Thus the
image wherever there is an object and will record a
region where there is no part. As the part moves down the conveyor, the light
of black will change in regionfel
length (i.e., y). The two-dimensional
binary image recorded
by the camera will, therefore, consist of regions of black (wherever there is a nam
and white where there is none.
One potential problem with this procedure is shadowing, as
illustrated by the
dotted lines in Figure 6.9.5. Here, it is observed that the system will
'detect the
leading edge of the part before it actually arrives at the camera position.
This
problem can be solved through the use of two or more linear light sources focused
at the camera location on the belt as shown in
Figure 6.9.5. The reader will
observe that the second light beam prevents this
position on the conveyor from
becoming dark until the part is actually at the location.
TheConsight System run-length coding scheme for storing the two-
uses a
dimensional binary image. Since the camera has 128 elements,
needed for this purpose. The remaining bit
only 7 bits are
(usually the most significant) in any
run length ""word" is used to indicate whether the
transition is light to dark or vice

First
Second
Light Line Light
Source Camera Source

Figure 6.9.5. Use of two light sourc

prevents shadowing.
Sec. 6.9 Review of Existing Systems 493

versa. As mentioned in an carlier section of this chapter, it should be clear that

unless the object being viewed has many holes, a considerable data compression
will result from run-length coding and the memory required to store the processed
data is far smaller than would be needed otherwise. Moreover, it is found that
processing time in subsequent steps is also decreased when this technique is em-
ployed. Edge detection is accomplished utilizing a 6-connected region algorithm
whereby connectivity is permitted along four sides of a pixel and along one of the
diagonals.
Once the object outline has been found, the part must be classified and its
position and orientation (relative to its leading edge) determined. To do this, a
small number of numerical descriptors (i.e., features) are computed or extracted.
Some of the descriptors Consight employs are:

Center of area (centroid)

Axis of the least moment of inertia of the part silhouette
.Maximum radius point measured from the centroid to the image boundary

For a given object, comparing these and other simply computed features with those
stored in the computer (for the entire "world" of permitted objects) allows the
part to be recognized.
Orientation, a deseriptor that is usually part specific, can be found in a number
of ways including selecting the moment axis direction that points nearest to the
maximum radius point measured from the centroid to the boundary. This de-
scriptor and the belt speed are then used to inform the robot when the object is
within the workspace and where and how to grasp the part. In some instances,
it may be necessary to stop the conveyor for a period of time to permit the ma-
nipulator to acquire the object.
A major problem with the Consight system is that it cannot handle parts that
are touching one another. If such a situation occurs, the number of scan lines will
usually be greater than for any single part. Alternatively, no match between all
the features of the two touching parts and those stored in the vision system's memory
will occur (e.g., the overall area will generally be larger for the "compound ob.
Ject"). In either instance, the objects that cannot be "identified" are permitted
to run off the end of the conveyor and into a reject bin where they can berecycled.
It is important to understand that any object that can assume more than one stable
position on the conveyor will require a separate set of features to be stored for each
one.
Although the GM Consight was developed over ten years ago, it is still used
insomewhat modified form today by the GMF Corporation and
commercially a
also by the Adept Corporation under a licensing agreement.
Having considered the various features of a vision system that was developed
by a large private company, we next consider a robotic vision system developed
with Federal money at the National Bureau of Standards.
494 Computer Vision for Robotic Systems: A Functional Approach
Chap. 6
6.9.5.2 National Bureau of Standards vision system [19]
In the 1970s, the United States Congress charged the National Bureau of
Standards (NBS) with developing a fully automated machine shop by the latter
part ofthe 1980s. As part of their effort to achieve coordinated control over robots
and other less sophisticated machine tools (e.g. lathes, punch presses, milling
machines, etc.), a need for a vision system that could be used in such an environment
and could be interfaced with robots was
perceived.
The research effort was undertaken by Dr. James Albus and his colleagues
and produced a workable system in the late 1970s. The NBS Vision System, as
itwas first introduced, was able to process picture information in less than 100 ms
and was estimated to cost about $8000. Since then, the system response time has
been improved and some of the hardware has been modified. However, the basic
operation technique has not changed appreciably as described next.
The major hardware elements of the NBS Vision System consist of three
components: (1) a solid state camera capable of producing 16K pixels (128 x 128);
(2) an electronic stroboscopic light that emits a plane of light and whose flash
intensity can be modified digitally; and (3) a "picture processing" unit. To under
stand the operation of the system, consider Figure 6.9.6. [The reader should
understand that the robot manipulator (in particular, its wrist) is not shown here.
In actual operation, the camera and (structured) light source would be mounted
on the robot's wrist whereas the processing unit would be in or near the robot's

Camera

Camera Field
of
View
Flash Plane of Light
Object

Reflected Line
of Light

Figure 6.9.6. Structured light imaging for the NBS robot vision system. The
camera and the flash (strobe) unit are mounted on opposite sides of the robot's
wrist (not shown) such that the plane of light is
The presence of an object causes one or more line
parallel to the fingers of the gripper.
camera. With permission of the National segments to be seen by the
Institute of Science and Technology
(formerly National Bureau of Standards).
Review of Existing Systems 495
Sec. 6.9

Box

(Front view) (Oblique view)

Object with Depressed Surface

Object with
Raised Surface (Front view) (Oblique view)

Figure 6.9.7. Example objects and the line segment patterns formed by the plane
of light as seen by the camera for the NBS vision system. With permission of the
National Institute of Science and Technology (formerly National Bureau of Stand-
ards).

own controller.] The strobe unit produces a plane of light that is projected parallel
to the wrist plane (determine by the approach vector and y). The camera is
mounted above the light source and is tilted down (i.e., so as to intersect the light
plane). Its 36-degree field of view covers the region extending from inside the
fingers of the gripper out to a distance of one meter. If the projected light strikes
an object in this region, a pattern of line segments is formed on the object. See
Figure 6.9.7. As the robot gripper moves closer to the object, these line segments
will grow in size and will move down in the camera's field. Qualitatively, the
reader should be able to conclude that the nearer the bottom of the image, the
Closer the object being scanned will be to the robot's end effector. However, how
does this system provide quantitative information that will permit the robot to
acquire the object?
To answer this question, consider the calibration chart (derived from simple
geometric considerations) shown in Figure 6.9.8. Observe that the top and right
axes are calibrated in pixels whereas the bottom and left axes are calibrated in
centimeters representing the x and y distances between the camera and object
respectively. For example, if the camera viewed the horizontal line shown in
igure 6.9.9 extending from pixel element (32,64) to element (96,64), the object
producing this line would be located about 13 centimeters from the gripper and be
about 10 centimeters in width.
The information contained in Figure 6.9.8 is actually stored in the vision
496 Computer Vision for Robotic Systems: A Functional Approach Cha
Chap.6
Pixel Column

32 64 96 128
128

50 0

-30 30
96
30 20
-20

15 10
-1 10 64
10

4 3 2 0
5
3

x Distance in Centimeters
Figure 6.9.8. The calibration chart for the
NBS vision system. The x and
distances aremeasured in the coordinate y
system of the fingers. The x axis
through the two finger tips and the axis is
y parallel to the wrist
passes
tilt in the figure is due to a axis. The slight
misalignment of the
of the National Institute of Science and chip in the camera.With permission
Standards). Technology (formerly National Bureau of

system's processing unit and is used to

determine the range and azimuth of eacu
point of the reflected line
segments. In particular, the
use
triangulation to extract interpretation algorithms
data, the slope of lines to indicate
range
entation, and line
endpoints to object o
should be grasped by the robot. provide information on which edges of the objet
it should be evident that By studying the calibration chart in
resolution of the Figure 0.7.
located far from the camera system is coarse when the
object
In actual
(i.e., about 1 meter) and is
application (e.g., a materials-handling quite fine at short distanc
an objectthat has been
randomly placed problem) the task of acqu
workspace (e.g., within the
a table top) is divided into a one-square meter sy
below: three-step as sequence des
1. The robot is commanded to go to a
"home" position which is located at ne
Sec. 6.9 Review of Existing Systems 497

Pixel Column
32 64 98 128
128

100
70
5 0 4 0

30
30 96
20
3020
15
10
64
10
10

x Distance in Centimeters
Figure 6.9.9. Calibration chart with example line segment (image) from (32, 64)
to (96, 64) shown. With permission of the National Institute of Science and
Technology (formerly National Bureau of Standards).

of the corners of the workspace. The table is then scanned (by firing the
strobe unit) in a plane that is approximately parallel to its surface. The
illuminated object appears in the image as a series of line segments. See,
for example, Figure 6.9.7. Generally, this step in the process yields coarse
range information.
2. The estimate obtained from step 1 is then used to move the robot's arm closer
to the object. The flash unit is again fired and more accurate range infor-
mation is obtained (recall that the resolution improves as the camera comes
closer to the object because we are now operating in a higher resolution part
of the calibration chart).
3. Based on the better range estimate obtained in step 2, the arm is moved
above (or in front of) the object. The strobe is triggered a third time and
the system makes fine positional and orientational corrections. The robot
can then be commanded to grasp the object.
498 Computer Vision for Robotic Systems: A Functional Approach
Chap. 6
In the earlier versions of the system, there was a
perceptible pause
of the illumination points. A later version eliminated this
delay thereby proda.n
an
extremely smooth motion. addition,
In the processing speed now so was ducing
that
it
was actually possible for the system to track a moving
object. rapid
It should be noted that the NBS system is fundamentally
quite different
other machine vision devices because it is not "looking'" all the time from
quently, it is much more time efficient since only three picture Conseho
scenes need to
processed (i.e., for the line segment information) during an object acquisition
sequence.
Besides ranging and orientation data, it is possble to extract
the structure of the object. This is also illustrated in information on
Figure 6.9.7. For example
it is observed that an
object with a raised portion of its surface will produce three
disconnected horizontal line segments. This happens because the
segment asso-
ciated with the elevated section will be lower than the other two.
In a similar
manner, objects with depressed surfaces can be detected by the nature and
of disconnected line number
segments. The figure also indicates that obliquely viewed
objects produce line segments that are connected but have different slopes
there is a cusp at their intersection). In addition, it can be shown (i.e.,
that objects
with cylindrical surfaces will
produce curved line segments when illuminated a by
plane of light.
Besides speed, another advantage of the NBS system over others is that the
contrast problem eased, or even eliminated, by the use of stroboscopic illumi-
is
nation. Even if the surface of the object is rather dull, it is
for the "threshold" problem by increasing the flash duration possible to compensate
under computer con-
trol. With the system used by NBS, the strobe time can be
as short as 6.4
as long as 1.6 ms. (This range is divided into 256 us and
values.) Theambient light
problem is solved by frame-to-frame differencing whereby data in the flash frame
is compared with that from a nonflash frame at the
same location. This does,
however, require a "frame buffer" and hence additional
memory.
At each step in the illumination process, a
one-pass line following algorithm
that looks for corners and gaps is utilized. For each
image scan line produced",
the system calculates (in hardware) a run length of 8-bits and an intensity of 8-
bits. (Note that gray-level information is utilized here unlike
the GM system
described in the previous section.) Based on this information, the possible run
length of the next scan line is predicted (from the slope and curvature information
of the working'" line). If the actual run length (RL) is within a specified e, the
point is added to the working line and the next run length is predicted. If, however,
the RL > 8, there are three cases:
1. RL = 128 which implies that a gap exists. The system will then begin to

compute the gap width.

I n practice, the camera is made to read by columns first (bottom

128 data points/read. Once the system
to top) thereby yielding o
senses a line (or a point on the
row.
line), it reads an apprOp
Sec. 6.9 Review of Existing Systems
499
2. ERL <T (specified) which implies that the previous point corresponds to
a corner (1.e., a
cusp). The
previous line has therefore ended and it is
necessary to start a new one that has a different
slope.
3. RL T which implies that there is a
"depth
terminate the previous line and compute thediscontinuity."
The system will
discontinuity.
If no line segments are detected or
many faint segments are observed (basedd
on the intensity information), the flash can be
repeated
the other hand, if there is too much brightness in the image, higherwhich
intensity.
at a On
results in
smearing, the flash intensity can be lowered. It should be clear that this adjustment
in illumination intensity can be handled
easily by the computer.
Although modified somewhat (e.g., the computer systems have undergone
considerable change), the NBS vision system is still being utilized at the National
Bureau of Standards in its automated machine shop project. At this time, however,
it has not yet been commercialized.
Now that the reader has a good feel for two robotic vision systems and can
see some of the similarities and differences between them, we next consider the
one developed at SRI which is today being utilized by at least one commercial
organization.

6.9.5.3 SRI industrial vision system [11

A third robotic vision system was the óne developed in the 1970s by SRI and
had three objectives:

Classification of objects
Materials and/or parts handling by a robot
Visual inspection of parts

In addition, these objectives were to be done for parts moving on a conveyor belt.
In the following sections, we describe four different aspects of the system
that permit these objectives to be realized. They are:

Lighting and imaging techniques

Imaging hardware
Feature extraction
Automated parts recognition
Lighting and Imaging Techniques
It has been found that images that clearly show features of interest (e.g.,
holes), result in faster and easier image processing. To implement this "rule,"
the SRI system works only with binary imagesso that the pixels contain either
black or white information. In addition, special lighting is utilized so that it is
possible to employ simple thresholding techniques that can clearly indicate the
500 Computer Vision for Robotic Systems: A Functional Approach
Chap. 6
relevant features of an object. This so-called image conirast enhancement
ent isis achi.
achieved
in three different
ways.
Contrasting Backgrounds. The first way to improve the contrast bet
the part and the background is to paint the conveyor belt with a red fluoreeen
When the belt is illuminated with an ultraviolet light source, any nonfluorent
object appears dark against a white background. Note
Note that another
that another metho
method
ent
One way of
improving contrast is to utilize backlighting. to
accomplish this is
place the object on a translucent plate and then view it from above. Althou to
is useful for hole detection, obviously such a technique is not applicah
especially
to the case of a conveyor belt. able

Color Filters. Another technique that can be used to enhance the imaoe.
is
to mount a color filter in front of the camera lens. Clearly, this is only useful
when the colors of either the object, the background, or both are known.
In
particular, when such a technique is combined with the fluorescent system men.
tioned above, a red filter (matched to the spectral response of the fluorescent belt
further enhances the image.
Special Lighting Arrangement. By illuminating an object in different ways,
it is possible to control shadows and highlights. For example, directional lighting
can be used to enhance shadows whereas multidirectional lighting can be employed
to reduce them. Highlights (or reflections from shiny surfaces) can be enhanced
by placing the illuminating source near the camera. Conversely, oblique lighting
tends to reduce such highlights. Such techniques are generally referred to as
structured lighting and are an important component of the SRI System.
Imaging Hardware

As mentioned above, the SRI Vision System is primarily aimed at applications

involving parts that are being transported on a moving conveyor belt. Based on
what was stated previously, the belt is painted with a red fluorescent and a red
filter is placed over the camera lens. In addition, the belt is illuminated with an
ultraviolet source. Thus the contrast between the background and any part is
increased thereby improving the identification process.
Although it is possible to utilize a standard vidicon camera to acquirethe
binary image of the part on the conveyor, this is not done in the SRI system. The
reason for this is that although, at the time of the experiments, it was possible to
employ a high speed A/DD converter to quantize the TV image to a raster or ia
x 120 points with 32 possible gray levels (i.e., 5-bits of brightness), the relav
large amount of memory (i.e., over 14K) needed to store the entire image D e
scanned by the camera made this fairly impractical*. Therefore, a linear aray

The reader should understand that with memory so inexpensive now, this would not posem
of a problem. Also, it is important to note that storing the entire image has the advantage or nav
it available for processing at any time. Thus features can be extracted anywhere in the process tne
permitting considerable flexibility in the image recognition algorithms.
Sec. 6.9
Review of Existing Systems 501

amera consisting of 128 light sensitive diodes was used instead. With this device,
ne line of data, consisting of 128 one-bit samples was obtained each time the
one

moved 0.05 inches,

conveyor belt (Note the similarity between this approach and
the one used in the GM Consight System.) During the time that the belt is moving
to the next location, the raW data are processed (i.e., compressed) by converting
it into run-length format (only 8-bits of information are needed since this is a binary
custem). At this point, the actual recognition process can be initiated.
Feature Extraction

To determine or recognize
the part on the conveyor that is
being scanned by
the diode array camera, it is necessary to extract (compute) a set of features that
permit the recognition to be performed. Before this can be done, however, it is
necessary to obtain the outline of the object. In the
SRI system, a connectivity
analyzer is used to examine overlaps between lines in successive rows of the image,
which in turn, permits connected components to be ascertained. In addition, holees
located within the boundaries of the object are also found in this manner.
In both
cases, standard edge detection algorithms are utilized.
Once the part outline is determined, the initialor
In the SRI system, these include:
simple features are obtained.

Area (the first and second moments are

used)
CG-Center of Gravity (area)
Axis of the least moment of inertia
Perimeter length (obtained from the list of perimeter points found by the
connectivity analyzer)
Extent of the object determined by the
smallest enclosing rectangle
Center of the extent (i.e., the extent rectangle's center)

With the above features now available, other features can then be
For example, the SRI
determined.
system computes a radius function. That is, for each point
on the
perimeter of the object, the square of the distance from that point to the
center of
gravity (or extent) is found. Then, the maximum, minimum, and
average
radi become the features of
interest.
Localizing features can also be used to determine the angular orientation of
the
object being viewed. They also permit attention to be directed to a specific
portion of the
part's outline. For this purpose, one can utilize:

The maximum and minimum radius as determined in the manner described

above
Corners or notches (in relation to the maximum minimum radius
Holes
or
function)
t h e case of
the water pump shown in Figure 6.9.10, the angle AOB between
502 Computer Vision for Robotic Systems: A Functional Approach
Chap.6

mmw
O

B
mmy2 A
Figure 6.9.10. Binary image of a water
pump showing the hole B and the long-
est radius OA.

OA, the longest radius vector and OB, the vector formed from the CG to the
nearest hole B, permits measurement of the position of the movable pump handle.
In the actual experiments conducted by SRI, there were 200 lines required
to delineate the pump and with a PDP 11-40, image processing took about one
second. It should be clear that with the faster computers now available, this time
could probably be reduced by quite a bit.

Automatic Part Recognition

One of the goals of the SRI Vision System is to be able to permit a robot to
locate and acquire parts that are moving on a conveyor belt. Besides having a

complete dictionary of features for each of the "world" of objects are to be

that
distinguished, there are also other factors that affect the ability such a vision system
has to reliably recognize an object*.
For any vision system, the designer must also take into account that variations
in an image of a class of patterns are due to some or all of the following:

Object rotation
Object translation
.Lightingvariations
obust.
*One refers to a system that can reliably operate under a wide range of conditions a
Sec. 6.9 Review of Existing Systems 503

.Cameranoise
Quantization errors

Dotation and translation problems can be minimized by measuring (and using)

features that are theoretically independent of those features. For example, area,
area of all holes located within the boundaries of the
object, and the
radius statistics fall into this category. varo
For the remaining variations above,
the SRI System makes certain important
assumptions concerning the statistics of the features obtained for various objects
being viewed. For example, it is assumed that the variance of these features
depends almost entirely on measurement errors. That is, this statistical quantity
is class (i.e., object) independent. While such an assumption is not 100% correct,
acceptable results are obtained. On the other hand, the system also makes the
assumption that the class conditional (i.e., the object dependent) distribution of
any given feature is normal. This too is only approximately correct but once again
vields acceptable results. In addition to the above assumptions, the system nor-
malizes each feature's probability density by the standard deviation (SD)) (i.e., it
divides by the SD) thereby permitting a wide range of objects to be handled.
In actual implementation, a sequential recognition process is employed whereby
a binary decision tree approach is utilized. Before generating such a tree, several
off-line operations must be performed. Thus for each object included in the "world"
of objects to be recognized, we must:

Measure the distribution of N previously agreed upon features X1, X2, ..

Compute p(), the mean value of the ith feature X, for the jith class (object).
See Figure 6.9.11.
For each feature X, determine a decision threshold-¢ in the middle of the
gap that separates all the objects into two distinct groups (i.e., those with
mean values above the threshold and those below it). With respect to Figure
6.9.11, it is seen that for the particular single feature X,. the five objects being
considered (i.e., classes C1 through C5) are divided into two groups (C1, c2,
C3) and (C4,C5). Thus if the actual (1.e., measured) value of this feature
for an unknown part is such that it is less than d the part is probably either
C1, C2, or C3. On the other hand, if the feature is greater than then the
part is most likely either C4 or C5.
Rank the N features by values of the largest gaps. It should be clear that
the larger the gap, the more reliable the binary subdivisions will be and hence
the object recognition.
I n the actual detection scheme, select for measurement the feature for which
the largest gap is a maximum.

Once the operations above are performed, the decision tree can be constructed
504 Computer Vision for Robotic Systems: A Functional Approach
Chap.6
Class C2 Class C3 Class C1 Class C5 Class C4

2) 413) 4l1) 415) Al4)

Figure 6.9.11. The Gaussian shaped distributions labeled Class CI through C5

show the statistical distributions of the ith feature (e.g., perimeter) for a set of
five objects. 8 is the threshold that separates sets of classes.

and the actualdetection process can be initiated. As an example, we consider a

"world" of seven objects that are to be
recognized consisting of four automobile
parts (castings). Note that three of these have two stable configurations, e.g., a
cylinder head (referred to in Figure 6.9.12 as Head 1 and Head 2), a piston sleeve
(Sleeve 1 and Sleeve
2), and finally a disk brake caliper (Caliper 1 and Caliper 2).
The connecting rod (Conrod) shown in the figure only has one stable state.
Thus

Connecting Rod
(Conrod)

Cylinder Head
(On Side)
n Cylinder Head
(Upright)
[Head 1] [Head 2]

Piston Sleeve Piston Sleeve

(Upright) (On Side)
[Sleeve 11 [Sleeve 2

Brake Caliper
ww
Brake Caliper
(Lying Flat) (Lying on Side)
[Caliper 1] [Caliper 2]

Figure 6.9.12. Seven outlines of

foundry castings.
Sec. 6.9
Review of Existing Systems 505

as far as the system is concerned there are seven possible objects to identify.
Moreover, for this example, the following seven features are used:

.X = Perimeter of the figure

X2 Square root of the area
XsTotal hole area
X = Minimum radius
Xs =Maximum radius
Xs Average radius
X = Compactness = X{/X2

Measure

No Yes
>1.66

Measure Measure

No X1.80 No 1.34
Head
Yes
Conrod Yes
Measure
X NoX>5.20
Caliper Yes
1
o Sleeve
X>18.45 1

Head
Yes

Measure

No
Xs>3.78 Yes

Sleeve Figure 6.9.13. Binary decision tree for

2 Caliper the seven images shown in Fig. 6.9.12.
2
506 Computer Vision for Robotic Systems: A Functional Approach hap. 6
Utilizing these features and based on the results of the "off-line" measure.
ments described above, the decision tree for the seven objects results. As shown
As shown
in Figure 6.9.13, the tree indicates that once the image of the objectbeing viewed
is stored in run length format, the system should first find X3 (the total hole area)
followed by X4 (the minimum radius)) and so on. It is interesting to note that for
this rather small number of objects, only four of the features must be computed
accurately determine which object is on the conveyor. In fact, in the experiments
conducted at SRI, an average of 2.7 features were required and recognition was
accomplished in less than 1 second. As might be expected, most of the errors
were found to be produced by poor binary images.
Although the SRI Vision System was developed over ten years ago, it is still
being used in somewhat modified and expanded form by Automatix, Inc. under
the trade name AUTOVISION*.

6.10 SUMMARY

This chapter illustrated many of the major aspects of machine vision, especially as
applied to robots. The concepts presented permit the general understanding of
the components, hardware, software, and algorithms that are often required in a
vision or remote sensing application.
The reader should understand that the techniques presented here, while use-
ful, will very often need to be modified before practical implementation is achieved.
These modifiçations may be in the nature of using selected regions of interest,
using computational approximations that provide for efficient implementation, or
other similar techniques that permit the transfer of the theoretical or academic
techniques into the real world of engineering and manufacturing implementation.
Furthermore, the techniques presented in this chapter should be used as "vectors"
that will point the implementer towarda good direction but will not give all the
details required. For example, one may choose a horizontal edge operator for
enhancing edges, but one also has to select the coarseness of the operator as well.
The algorithm itself may be obtained from the literature, but the specific imple-
mentation parameters must be selected by the user.
Another issue of interest is the fact that the same problem may be solved in
more than one way, and the method for selection of the most appropriate method
will not be found in any textbook. The authors have sought to present many of
the important techniques, and must regrettably leave the selection of the appro
priate technique in any given environment to the user. The example in Chapter
Nine serves to illustrate some of the problems associated with vision systems in
robotics.

For example, Automatix has developed and incorporated a special-purpose language called
RAIL," which permits the vision system to be more readily interfaced to the robot. In addition,
modern 32-bit microprocessors have been utilized, which allow both binary and gray-level data to be

processed.

Unit 1 CV Notes
No ratings yet
Unit 1 CV Notes
122 pages
Deep Learning for Vision Book 2
No ratings yet
Deep Learning for Vision Book 2
292 pages
Chapter 1 - Introduction To CV
No ratings yet
Chapter 1 - Introduction To CV
49 pages
Unit 1 to 5 Computer Vision and Image Processing
No ratings yet
Unit 1 to 5 Computer Vision and Image Processing
56 pages
Paper_BackProoagation
No ratings yet
Paper_BackProoagation
13 pages
A Survey of Sensor Planning in Computer Vision
No ratings yet
A Survey of Sensor Planning in Computer Vision
19 pages
Machine Vision
No ratings yet
Machine Vision
39 pages
Object Recognition
No ratings yet
Object Recognition
49 pages
65eeb59cc7566 Solutions Manual Digi 240808 034325
No ratings yet
65eeb59cc7566 Solutions Manual Digi 240808 034325
142 pages
2 Vision Lec 2
No ratings yet
2 Vision Lec 2
71 pages
ME407 M6 Ktunotes - in
No ratings yet
ME407 M6 Ktunotes - in
57 pages
1_Intro
No ratings yet
1_Intro
103 pages
MODULE 3
No ratings yet
MODULE 3
20 pages
Sensors 22 01201 v2
No ratings yet
Sensors 22 01201 v2
26 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Object Tracking
No ratings yet
Object Tracking
50 pages
CS 804 Image Processing notes
No ratings yet
CS 804 Image Processing notes
57 pages
Unit1
No ratings yet
Unit1
8 pages
MVS Notes Unit-I
No ratings yet
MVS Notes Unit-I
16 pages
RAT292 M3 Part 2 Sensors and Actuators
No ratings yet
RAT292 M3 Part 2 Sensors and Actuators
55 pages
Computer Vision Notes
No ratings yet
Computer Vision Notes
72 pages
19ECE454 Image Processing_SET `1
No ratings yet
19ECE454 Image Processing_SET `1
74 pages
RVS - converted
No ratings yet
RVS - converted
45 pages
Mod 5 Robotic Vision
No ratings yet
Mod 5 Robotic Vision
16 pages
revisionback
No ratings yet
revisionback
13 pages
To Computer Vision: Ahmed Eslam Mohammed Solyman (M.D.)
No ratings yet
To Computer Vision: Ahmed Eslam Mohammed Solyman (M.D.)
29 pages
lecture 1 AI Summary
No ratings yet
lecture 1 AI Summary
31 pages
Features 1 B
No ratings yet
Features 1 B
94 pages
UNIT 1
No ratings yet
UNIT 1
21 pages
Unit-5 Computer Vision
No ratings yet
Unit-5 Computer Vision
3 pages
Computer Vision
No ratings yet
Computer Vision
23 pages
Watss Up, PPT
No ratings yet
Watss Up, PPT
51 pages
CS7.505: Computer Vision: Spring 2022
No ratings yet
CS7.505: Computer Vision: Spring 2022
46 pages
CV #1 Course Introduction-1
No ratings yet
CV #1 Course Introduction-1
61 pages
Module 1_IP
No ratings yet
Module 1_IP
11 pages
Computer Vision
No ratings yet
Computer Vision
4 pages
Digital Image Processing
No ratings yet
Digital Image Processing
10 pages
chp-5_rcs
No ratings yet
chp-5_rcs
10 pages
Basic building blocks of Machine vision
No ratings yet
Basic building blocks of Machine vision
3 pages
OIE 751 ROBOTICS Unit 3 Class 5 (19-9-2020)
100% (1)
OIE 751 ROBOTICS Unit 3 Class 5 (19-9-2020)
14 pages
Computer Vision CS-6350: Prof. Sukhendu Das Deptt. of Computer Science and Engg., IIT Madras, Chennai - 600036
No ratings yet
Computer Vision CS-6350: Prof. Sukhendu Das Deptt. of Computer Science and Engg., IIT Madras, Chennai - 600036
48 pages
Brain, Vision and AI
No ratings yet
Brain, Vision and AI
292 pages
Department of Computer Science and Engineering - University of Bologna
No ratings yet
Department of Computer Science and Engineering - University of Bologna
23 pages
Digital Image Processing
No ratings yet
Digital Image Processing
30 pages
Real Time Object Detection Using Deep Learning Andmachine Learning Project
No ratings yet
Real Time Object Detection Using Deep Learning Andmachine Learning Project
56 pages
Arduino Computer Vision Programming - Sample Chapter
No ratings yet
Arduino Computer Vision Programming - Sample Chapter
19 pages
DIGITAL IMAGE PROCESSING Full Report
No ratings yet
DIGITAL IMAGE PROCESSING Full Report
10 pages
Artificial Intelligence (Computer Vision) : by Dr. Sehat Ullah Department of Computer Science & IT University of Malakand
No ratings yet
Artificial Intelligence (Computer Vision) : by Dr. Sehat Ullah Department of Computer Science & IT University of Malakand
35 pages
Download full Guide to 3D Vision Computation Geometric Analysis and Implementation 1st Edition Kenichi Kanatani ebook all chapters
100% (2)
Download full Guide to 3D Vision Computation Geometric Analysis and Implementation 1st Edition Kenichi Kanatani ebook all chapters
44 pages
Vision Sensors For Recognition and Assessment of Objects and Scenes
No ratings yet
Vision Sensors For Recognition and Assessment of Objects and Scenes
17 pages
WG 2 Tips and Trick Abdominal Sound
No ratings yet
WG 2 Tips and Trick Abdominal Sound
179 pages
Solving Application With VIsion Sensor
100% (1)
Solving Application With VIsion Sensor
8 pages
Robotics Chapter 5 - Robot Vision
No ratings yet
Robotics Chapter 5 - Robot Vision
7 pages
Object Detector For Blind Person
No ratings yet
Object Detector For Blind Person
20 pages
Digital Image Processing Full Report
No ratings yet
Digital Image Processing Full Report
9 pages
Fabrication and Performance Evaluation of Pick and Place Robotic Arm - Report
No ratings yet
Fabrication and Performance Evaluation of Pick and Place Robotic Arm - Report
54 pages
IIITUGIT22 (1)
No ratings yet
IIITUGIT22 (1)
212 pages
Phase-Ii Machine Vision: Machine Vision (MV) Is The Application of Computer Vision To Industry and
No ratings yet
Phase-Ii Machine Vision: Machine Vision (MV) Is The Application of Computer Vision To Industry and
7 pages
Digital Image Processing
No ratings yet
Digital Image Processing
10 pages
Bitwise Operations On Images in Computer Vision
No ratings yet
Bitwise Operations On Images in Computer Vision
4 pages
Latest AI Book + Sample Paper 2025 PDF
No ratings yet
Latest AI Book + Sample Paper 2025 PDF
276 pages
Unit 3 Current Trends & Emerging Technologies
100% (2)
Unit 3 Current Trends & Emerging Technologies
23 pages
Laporan Praktikum 3 Balik Gambar Grayscale Biner Output
No ratings yet
Laporan Praktikum 3 Balik Gambar Grayscale Biner Output
6 pages
CGV Syllabus
No ratings yet
CGV Syllabus
2 pages
TP 1 1cuatri 2023
No ratings yet
TP 1 1cuatri 2023
14 pages
Index Merged
No ratings yet
Index Merged
117 pages
Sampe Analisis
No ratings yet
Sampe Analisis
12 pages
Slidesgo Comprehensive Overview of Parallel Dots Company Insights and Key Details 20241205072525RsX8
No ratings yet
Slidesgo Comprehensive Overview of Parallel Dots Company Insights and Key Details 20241205072525RsX8
13 pages
EE
No ratings yet
EE
9 pages
Assignment Answers
No ratings yet
Assignment Answers
8 pages
Chapter 2: Technologies: What Is Yolov4?
No ratings yet
Chapter 2: Technologies: What Is Yolov4?
6 pages
Computer Vision ET
No ratings yet
Computer Vision ET
12 pages
Paper 38-Inspection System For Glass Bottle Defect Classification
No ratings yet
Paper 38-Inspection System For Glass Bottle Defect Classification
10 pages
ICRA19_porav_Images_Restoration_De_raining
No ratings yet
ICRA19_porav_Images_Restoration_De_raining
7 pages
Coin Detection: Anthony Sutardja, Andrew Fang
No ratings yet
Coin Detection: Anthony Sutardja, Andrew Fang
4 pages
Number Plate Recognition Using Ocr Techn
No ratings yet
Number Plate Recognition Using Ocr Techn
5 pages
Using Machine Learning & Deep Learning With Arcgis Imagery: Kannan Jayaraman Gistec
No ratings yet
Using Machine Learning & Deep Learning With Arcgis Imagery: Kannan Jayaraman Gistec
28 pages
SujitNoronha Resume UCSC NLPMS
No ratings yet
SujitNoronha Resume UCSC NLPMS
2 pages
Color Palette 006: Color 1 Color 2 Color 3 Color 4 Color 5
No ratings yet
Color Palette 006: Color 1 Color 2 Color 3 Color 4 Color 5
4 pages
Uv-Dms6300 4300-75
No ratings yet
Uv-Dms6300 4300-75
6 pages
Microprocessors stm32mp2 Series Overview
No ratings yet
Microprocessors stm32mp2 Series Overview
28 pages
Collage CSC Project
No ratings yet
Collage CSC Project
25 pages
ETI-CH1 Notes
No ratings yet
ETI-CH1 Notes
19 pages
Dip
No ratings yet
Dip
2 pages
Project Detecto!: A Real-Time Object Detection Model
No ratings yet
Project Detecto!: A Real-Time Object Detection Model
3 pages
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
From Everand
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
Fouad Sabry
No ratings yet
Computer Vision: Exploring the Depths of Computer Vision
From Everand
Computer Vision: Exploring the Depths of Computer Vision
Fouad Sabry
No ratings yet
Percept: Fundamentals and Applications
From Everand
Percept: Fundamentals and Applications
Fouad Sabry
No ratings yet
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet