0% found this document useful (0 votes)
21 views16 pages

Unec 1699770296

This document provides an overview of early image processing operations used in artificial intelligence and computer vision, including edge detection, texture analysis, and optical flow computation. It discusses how images are formed through lenses and cameras, and how early vision operations like edge detection work to extract abstract representations from raw pixel images by identifying locations of sharp brightness changes. These early operations analyze images locally without considering global scene context and are implemented in parallel hardware. Texture analysis also examines local pixel patterns to characterize surfaces.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views16 pages

Unec 1699770296

This document provides an overview of early image processing operations used in artificial intelligence and computer vision, including edge detection, texture analysis, and optical flow computation. It discusses how images are formed through lenses and cameras, and how early vision operations like edge detection work to extract abstract representations from raw pixel images by identifying locations of sharp brightness changes. These early operations analyze images locally without considering global scene context and are implemented in parallel hardware. Texture analysis also examines local pixel patterns to characterize surfaces.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

AZERBAIJAN STATE UNIVERSITY OF ECONOMICS

DEPARTMENT: DIGITAL TECHNOLOGIES AND APPLIED


INFORMATICS
TEACHER: PROFESSOR SHAHNAZ SHAHBAZOVA

SUBJECT: THE FUNDAMENTALS OF ARTIFICIAL


INTELLIGENCE

LECTURE 13. Communicating, perceiving, and acting


Perception
Content
1. Image Formation
2. Early Image-Processing Operations
3. Object Recognition by Appearance
4. Reconstructing the 3DWorld
5. Object Recognition from Structural Information.
6. Using Vision

13.1. I MAGE F ORMATION

Imaging distorts the appearance of objects. For example, a picture taken looking down a long straight set
of railway tracks will suggest that the rails converge and meet. As another example, if you hold your hand
in front of your eye, you can block out the moon, which is not smaller than your hand. As you move your
hand back and forth or tilt it, your hand will seem to shrink and grow in the image, but it is not doing so
in reality (Figure 24.1). Models of these effects are essential for both recognition and reconstruction.

Images without lenses: The pinhole camera


Image sensors gather light scattered from objects in a scene and create a two-dimensional image. In the
eye, the image is formed on the retina, which consists of two types of cells: about 100 million rods,
which are sensitive to light at a wide range of wavelengths, and 5
million cones. Cones, which are essential for color vision, are of three main types, each of which is
sensitive to a different set of wavelengths. In cameras, the image is formed on an image plane, which can
be a piece of film coated with silver halides or a rectangular grid of a few million photosensitive pixels,
each a complementary metal-oxide semiconductor (CMOS) or charge-coupled device (CCD). Each
photon arriving at the sensor produces an effect, whose strength depends on the wavelength of the photon.
The output of the sensor is the sum of all effects due to photons observed in some time window, meaning
that image sensors report a weighted average of the intensity of light arriving at the sensor.
To see a focused image, we must ensure that all the photons from approximately the same spot in the
scene arrive at approximately the same point in the image plane. The simplest way to form a focused
image is to view stationary objects with a pinhole camera, which consists of a pinhole opening, O, at the
front of a box, and an image plane at the back of the box (Figure 24.2). Photons from the scene must pass
through the pinhole, so if it is small enough then nearby photons in the scene will be nearby in the image
plane, and the image will be in focus.
The geometry of scene and image is easiest to understand with the pinhole camera. We use a
three-dimensional coordinate system with the origin at the pinhole, and consider a point P in the scene,
with coordinates (X, Y, Z). P gets projected to the point P I in the image plane with coordinates (x, y, z).
If f is the distance from the pinhole to the image plane, then by similar triangles, we can derive the
following equations:

These equations define an image-formation process known as perspective projection. Note that
the Z in the denominator means that the farther away an object is, the smaller its image
will be. Also, note that the minus signs mean that the image is inverted, both left–right and up–down,
compared with the scene.
Under perspective projection, distant objects look small. This is what allows you to cover the
moon with your hand (Figure 24.1). An important result of this effect is that parallel lines
converge to a point on the horizon. (Think of railway tracks, Figure 24.1.) A line in the scene in
the direction (U, V, W ) and passing through the point (X0, Y0, Z0) can be described
as the set of points (X0 + λU, Y0 + λV, Z0 + λW ), with λ varying between −∞ and +∞.
Different choices of (X0, Y0, Z0) yield different lines parallel to one another. The projection
of a point Pλ from this line onto the image plane is given by

As λ → ∞ or λ → −∞, this becomes p∞ = (fU/W, f V/W ) if W /= 0. This means that


two parallel lines leaving different points in space will converge in the image—for large λ, the image
points are nearly the same, whatever the value of (X0, Y0, Z0) (again, think railway tracks, Figure
24.1). We call p∞ the vanishing point associated with the family of straight lines with direction (U,
V, W ). Lines with the same direction share the same vanishing point

Lens systems
The drawback of the pinhole camera is that we need a small pinhole to keep the image in focus. But the
smaller the pinhole, the fewer photons get through, meaning the image will be dark. We can gather more
photons by keeping the pinhole open longer, but then we will get motion blur—objects in the scene that
move will appear blurred because they send photons to multiple locations on the image plane. If we can’t
keep the pinhole open longer, we can try to make it bigger. More light will enter, but light from a small
patch of object in the scene will now be spread over a patch on the image plane, causing a blurred image
Vertebrate eyes and modern cameras use a lens system to gather sufficient light while keeping the
image in focus. A large opening is covered with a lens that focuses light from nearby object locations
down to nearby locations in the image plane. However, lens systems have a limited depth of field:
they can focus light only from points that lie within a range of depths (centered around a focal plane).
Objects outside this range will be out of focus in the image. To move the focal plane, the lens in the
eye can change shape (Figure 24.3); in a camera, the lenses move back and forth.

13.1. EARLY IMAGE-PROCESSING OPERATIONS


We have seen how light reflects off objects in the scene to form an image consisting of, say, five million
3-byte pixels. With all sensors there will be noise in the image, and in any case there is a lot of data to
deal with. So how do we get started on analyzing this data?
In this section we will study three useful image-processing operations: edge detection, texture
analysis, and computation of optical flow. These are called “early” or “low-level” operations because they
are the first in a pipeline of operations. Early vision operations are characterized by their local nature
(they can be carried out in one part of the image without regard for anything more than a few pixels
away) and by their lack of knowledge: we can perform these operations without consideration of the
objects that might be present in the scene. This makes the low-level operations good candidates for
implementation in parallel hardware—either in a graphics processor unit (GPU) or an eye. We will then
look at one mid-level operation: segmenting the image into regions
Edge detection
Edges are straight lines or curves in the image plane across which there is a “significant” change in
image brightness. The goal of edge detection is to abstract away from the messy, multimegabyte
image and toward a more compact, abstract representation, as in Figure 24.6. The motivation is that
edge contours in the image correspond to important scene contours. In the figure we have three
examples of depth discontinuity, labeled 1; two surface-normal discontinuities, labeled 2; a
reflectance discontinuity, labeled 3; and an illumination discon- tinuity (shadow), labeled 4. Edge
detection is concerned only with the image, and thus does not distinguish between these different
types of scene discontinuities; later processing will.
Figure 24.7(a) shows an image of a scene containing a stapler resting on a desk, and
(b) shows the output of an edge-detection algorithm on this image. As you can see, there is a
difference between the output and an ideal line drawing. There are gaps where no edge appears, and
there are “noise” edges that do not correspond to anything of significance in the scene. Later stages of
processing will have to correct for these errors. How do we detect edges in an image? Consider the
profile of image brightness along a one-dimensional cross-section perpendicular to an edge—for
example, the one between the left edge of the desk and the wall. It looks something like what is shown in
Figure 24.8 (top). Edges correspond to locations in images where the brightness undergoes a sharp
change, so a naive idea would be to differentiate the image and look for places where the magnitude of
the derivative II(x) is large. That almost works. In Figure 24.8 (middle), we see that there is indeed a
peak at x = 50, but there are also subsidiary peaks at other locations (e.g., x = 75). These arise because of
the presence of noise in the image. If we smooth the image first, the spurious peaks are diminished, as we
see in the bottom of the figure
The measurement of brightness at a pixel in a CCD camera is based on a physical process involving the
absorption of photons and the release of electrons; inevitably there will be statistical fluctuations of
the measurement—noise. The noise can be modeled with a Gaussian probability distribution, with
each pixel independent of the others. One way to smooth an image is to assign to each pixel the
average of its neighbors. This tends to cancel out extreme values. But how many neighbors should we
consider—one pixel away, or two, or more? One good answer is a weighted average that weights the
nearest pixels the most, then gradually decreases the weight for more distant pixels
Texture
In everyday language, texture is the visual feel of a surface—what you see evokes what the surface
might feel like if you touched it (“texture” has the same root as “textile”). In computational vision,
texture refers to a spatially repeating pattern on a surface that can be sensed visually. Examples
include the pattern of windows on a building, stitches on a sweater, spots on a leopard, blades of grass
on a lawn, pebbles on a beach, and people in a stadium. Sometimes the arrangement is quite periodic,
as in the stitches on a sweater; in other cases, such as pebbles on a beach, the regularity is only
statistical.
Whereas brightness is a property of individual pixels, the concept of texture makes sense only for a
multipixel patch. Given such a patch, we could compute the orientation at each pixel, and then
characterize the patch by a histogram of orientations. The texture of bricks in a wall would have two
peaks in the histogram (one vertical and one horizontal), whereas the texture of spots on a leopard’s
skin would have a more uniform distribution of orientations.
Figure 24.9 shows that orientations are largely invariant to changes in illumination. This makes
texture an important clue for object recognition, because other clues, such as edges, can yield
different results in different lighting conditions.
In images of textured objects, edge detection does not work as well as it does for smooth objects.
This is because the most important edges can be lost among the texture elements. Quite literally, we
may miss the tiger for the stripes. The solution is to look for differences in texture properties, just the
way we look for differences in brightness. A patch on a tiger and a patch on the grassy background
will have very different orientation histograms, allowing us to find the boundary curve between them.

Segmentation of images
Segmentation is the process of breaking an image into regions of similar pixels. Each image
pixel can be associated with certain visual properties, such as brightness, color, and texture.
Within an object, or a single part of an object, these attributes vary relatively little, whereas
across an inter-object boundary there is typically a large change in one or more of these at-
tributes. There are two approaches to segmentation, one focusing on detecting the boundaries
of these regions, and the other on detecting the regions themselves (Figure 24.11).
A boundary curve passing through a pixel (x, y) will have an orientation θ, so one way to
formalize the problem of detecting boundary curves is as a machine learning classification
problem. Based on features from a local neighborhood, we want to compute the probability
Pb(x, y, θ) that indeed there is a boundary curve at that pixel along that orientation. Consider
a circular disk centered at (x, y), subdivided into two half disks by a diameter oriented at θ. If
there is a boundary at (x, y, θ) the two half disks might be expected to differ significantly in
their brightness, color, and texture. Martin, Fowlkes, and Malik (2004) used features based
on differences in histograms of brightness, color, and texture values measured in these two
half disks, and then trained a classifier. For this they used a data set of natural images where
humans had marked the “ground truth” boundaries, and the goal of the classifier was to mark
exactly those boundaries marked by humans and no others.
Boundaries detected by this technique turn out to be significantly better than those found
using the simple edge-detection technique described previously. But still there are two
limita- tions. (1) The boundary pixels formed by thresholding Pb(x, y, θ) are not guaranteed
to form closed curves, so this approach doesn’t deliver regions, and (2) the decision making
exploits only local context and does not use global consistency constraints
The alternative approach is based on trying to “cluster” the pixels into regions based on their
brightness, color, and texture. Shi and Malik (2000) set this up as a graph partitioning problem. The
nodes of the graph correspond to pixels, and edges to connections between pixels. The weight Wij on
the edge connecting a pair of pixels i and j is based on how similar the two pixels are in brightness,
color, texture, etc. Partitions that minimize a normalized cut criterion are then found. Roughly
speaking, the criterion for partitioning the graph is to minimize the sum of weights of connections
across the groups of pixels and maximize the sum of weights of connections within the groups.
Segmentation based purely on low-level, local attributes such as brightness and color cannot be
expected to deliver the final correct boundaries of all the objects in the scene. To reliably find object
boundaries we need high-level knowledge of the likely kinds of objects in the scene. Representing
this knowledge is a topic of active research. A popular strategy is to produce an over-segmentation of
an image, containing hundreds of homogeneous regions known as superpixels. From there,
knowledge-based algorithms can take over; they will find it easier to deal with hundreds of
superpixels rather than millions of raw pixels. How to exploit high-level knowledge of objects is the
subject of the next section.

13.2. OBJECT RECOGNITION BY APPEARANCE

Appearance is shorthand for what an object tends to look like. Some object categories—for
example, baseballs—vary rather little in appearance; all of the objects in the category look about the
same under most circumstances. In this case, we can compute a set of features describing each class
of images likely to contain the object, then test it with a classifier Other object categories—for
example, houses or ballet dancers—vary greatly. A house can have different size, color, and shape
and can look different from different angles. A dancer looks different in each pose, or when the stage
lights change colors. A useful abstraction is to say that some objects are made up of local patterns
which tend to move around with respect to one another. We can then find the object by looking at
local histograms of detector responses, which expose whether some part is present but suppress the
details of where it is.
Testing each class of images with a learned classifier is an important general
recipe. It works extremely well for faces looking directly at the camera, because at low resolution
and under reasonable lighting, all such faces look quite similar. The face is round, and quite bright
compared to the eye sockets; these are dark, because they are sunken, and the mouth is a dark slash,
as are the eyebrows. Major changes of illumination can cause some variations in this pattern, but the
range of variation is quite manageable. That makes it possible to detect face positions in an image
that contains faces. Once a computational challenge, this feature is now commonplace in even
inexpensive digital cameras.
For the moment, we will consider only faces where the nose is oriented vertically;
we will deal with rotated faces below. We sweep a round window of fixed size over the image,
compute features for it, and present the features to a classifier. This strategy is sometimes called the
sliding window. Features need to be robust to shadows and to changes in brightness caused by
illumination changes. One strategy is to build features out of gradient orientations. Another is to
estimate and correct the illumination in each image window. To find faces of different sizes, repeat
the sweep over larger or smaller versions of the image. Finally, we postprocess the responses across
scales and locations to produce the final set of detections.
Postprocessing is important, because it is unlikely that we have chosen a window
size that is exactly the right size for a face (even if we use multiple sizes). Thus, we will likely have
several overlapping windows that each report a match for a face. However, if we use a classifier that
can report strength of response (for example, logistic regression or a support vector machine) we can
combine these partial overlapping matches at nearby locations to yield a single high-quality match.
That gives us a face detector that can search over locations and scales. To search rotations as well, we
use two steps. We train a regression procedure to estimate the best orientation of any face present in
a window. Now, for each window, we estimate the orientation, reorient the window, then test whether
a vertical face is present with our classifier. All this yields a system whose architecture is sketched in
Figure 24.12.
Training data is quite easily obtained. There are several data sets of marked-up
face images, and rotated face windows are easy to build (just rotate a window from a training data
set). One trick that is widely used is to take each example window, then produce new examples by
changing the orientation of the window, the center of the window, or the scale very slightly. This is
an easy way of getting a bigger data set that reflects real images fairly well; the trick usually
improves performance significantly. Face detectors built along these lines now perform very well for
frontal faces (side views are harder).
13.4 RECONSTRUCTING THE 3D WORLD
In this section we show how to go from the two-dimensional image to a three-dimensional
representation of the scene. The fundamental question is this: Given that all points in the scene that fall
along a ray to the pinhole are projected to the same point in the image, how do we recover three-
dimensional information? Two ideas come to our rescue
 If we have two (or more) images from different camera positions, then we can triangu- late to
find the position of a point in the scene.
• We can exploit background knowledge about the physical scene that gave rise to the image. Given an
object model P(Scene) and a rendering model P(Image | Scene), we can compute a posterior
distribution P(Scene | Image).
There is as yet no single unified theory for scene reconstruction. We survey eight commonly used visual
cues: motion, binocular stereopsis, multiple views, texture, shading, contour, and familiar
objects.

Motion parallax

If the camera moves relative to the three-dimensional scene, the resulting apparent motion in the image,
optical flow, can be a source of information for both the movement of the camera and depth in the scene.
To understand this, we state (without proof) an equation that relates the optical flow to the viewer’s
translational velocity T and the depth in the scene. The components of the optical flow field are
where Z(x, y) is the z-coordinate of the point in the scene corresponding to the point in the image
at (x, y).
Note that both components of the optical flow, vx(x, y) and vy(x, y), are zero at the point x =
Tx/Tz,y = Ty/Tz. This point is called the focus of expansion of the flow field. Suppose we
change the origin in the x–y plane to lie at the focus of expansion; then
the expressions for optical flow take on a particularly simple form. Let (xI, yI) be the new
coordinates defined by xI = x − Tx/Tz, yI = y − Ty/Tz. Then

Note that there is a scale-factor ambiguity here. If the camera was moving twice as fast, and every
object in the scene was twice as big and at twice the distance to the camera, the optical flow field
would be exactly the same. But we can still extract quite useful information.
1.Suppose you are a fly trying to land on a wall and you want to know the time-to- contact at the
current velocity. This time is given by Z/Tz. Note that although the instantaneous optical flow field
cannot provide either the distance Z or the velocity component Tz, it can provide the ratio of the two
and can therefore be used to control the landing approach. There is considerable experimental
evidence that many different animal species exploit this cue.
2.Consider two points at depths Z1, Z2, respectively. We may not know the absolute value of either of
these, but by considering the inverse of the ratio of the optical flow magnitudes at these points, we can
determine the depth ratio Z1/Z2. This is the cue of motion parallax, one we use when we look out of
the side window of a moving car or train and infer that the slower moving parts of the landscape are
farther away.

Binocular stereopsis

Most vertebrates have two eyes. This is useful for redundancy in case of a lost eye, but it helps in other
ways too. Most prey have eyes on the side of the head to enable a wider field of vision. Predators
have the eyes in the front, enabling them to use binocular stereopsis. The idea is similar to motion
parallax, except that instead of using images over time, we use two (or more) images separated in
space. Because a given feature in the scene will be in a different place relative to the z-axis of each
image plane, if we superpose the two images, there will be a disparity in the location of the image
feature in the two images. You can see this in Figure 24.16, where the nearest point of the pyramid is
shifted to the left in the right image and to the right in the left image.
Note that to measure disparity we need to solve the correspondence problem, that is, determine for a
point in the left image, the point in the right image that results from the projection of the same scene
point. This is analogous to what one has to do in measuring optical flow, and the most simple-minded
approaches are somewhat similar and based on comparing blocks of pixels around corresponding
points using the sum of squared differences. In practice, we use much more sophisticated algorithms,
which exploit additional constraints.
Assuming that we can measure disparity, how does this yield information about depth in the scene?
We will need to work out the geometrical relationship between disparity and depth. First, we will
consider the case when both the eyes (or cameras) are looking forward with their optical axes
parallel. The relationship of the right camera to the left camera is then just a displacement along the
x-axis by an amount b, the baseline. We can use the optical flow equations from the previous section,
if we think of this as resulting from a translation

vector T acting for time δt, with Tx = b/δt and Ty = Tz = 0. The horizontal and vertical disparity are
given by the optical flow components, multiplied by the time step δt, H = vx δt, V = vy δt. Carrying
out the substitutions, we get the result that H = b/Z, V = 0. In words, the horizontal disparity is equal
to the ratio of the baseline to the depth, and the vertical disparity is zero. Given that we know b, we
can measure H and recover the depth Z.
Under normal viewing conditions, humans fixate; that is, there is some point in the scene at which the
optical axes of the two eyes intersect. Figure 24.17 shows two eyes fixated at a point P0, which is at a
distance Z from the midpoint of the eyes. For convenience, we will compute the angular disparity,
measured in radians. The disparity at the point of fixation P0 is zero. For some other point P in the
scene that is δZ farther away, we can compute the angular displacements of the left and right images
of P , which we will call PL and PR, respectively. If each of these is displaced by an angle δθ/2
relative to P0, then the displacement between PL and PR, which is the disparity of P , is just δθ. From
Figure 24.17,
In humans, b (the baseline distance between the eyes) is about 6 cm. Suppose that Z is about 100 cm.
If the smallest detectable δθ (corresponding to the pixel size) is about 5 seconds of arc, this gives a
δZ of 0.4 mm. For Z = 30 cm, we get the impressively small value δZ = 0.036 mm. That is, at a
distance of 30 cm, humans can discriminate depths that differ by as little as 0.036 mm, enabling us to
thread needles and the like.

1.Differences in the distances of the texels from the camera. Distant objects appear smaller by a scaling
factor of 1/Z.
2.Differences in the foreshortening of the texels. If all the texels are in the ground plane then distance
ones are viewed at an angle that is farther off the perpendicular, and so are more foreshortened. The
magnitude of the foreshortening effect is proportional to cos σ, where σ is the slant, the angle
between the Z-axis and n, the surface normal to the texel.
Researchers have developed various algorithms that try to exploit the variation in the appearance of the
projected texels as a basis for determining surface normals. However, the accuracy and applicability
of these algorithms is not anywhere as general as those based on using multiple views.

13.5 OBJECT RECOGNITION FROM STRUCTURAL


INFORMATION
Putting a box around pedestrians in an image may well be enough to avoid driving
into them. We have seen that we can find a box by pooling the evidence provided by
orientations, using histogram methods to suppress potentially confusing spatial detail.
If we want to know more about what someone is doing, we will need to know where
their arms, legs, body, and head lie in the picture. Individual body parts are quite
difficult to detect on their own using a moving window method, because their color
and texture can vary widely and because they are usually small in images. Often,
forearms and shins are as small as two to three pixels wide. Body parts do not usually
appear on their own, and representing what is connected to what could be quite
powerful, because parts that are easy to find might tell us where to look for parts that
are small and hard to detect.
Inferring the layout of human bodies in pictures is an important task
in vision, because the layout of the body often reveals what people are doing. A
model called a deformable template can tell us which configurations are acceptable:
the elbow can bend but the head is never joined to the foot. The simplest deformable
template model of a person connects lower arms to upper arms, upper arms to the
torso, and so on. There are richer models: for example we could represent the fact that left
and right upper arms tend to have the same color and texture, as do left and right legs. These richer
models remain difficult to work with, however.

The geometry of bodies: Finding arms and legs

For the moment, we assume that we know what the person’s body parts look like
(e.g., we know the color and texture of the person’s clothing). We can model the geometry of the
body as a tree of eleven segments (upper and lower left and right arms and legs respectively, a torso, a
face, and hair on top of the face) each of which is rectangular. We assume that the position and
orientation (pose) of the left lower arm is independent of all other segments given the pose of the left
upper arm; that the pose of the left upper arm is independent of all segments given the pose of the
torso; and extend these assumptions in the obvious way to include the right arm and the legs, the face,
and the hair. Such models are often called “cardboard people” models. The model forms a tree, which
is usually rooted at the torso. We will search the image for the best match to this cardboard person
using inference methods for a tree-structured Bayes net (see Chapter 14).
There are two criteria for evaluating a configuration. First, an image rectangle
should look like its segment. For the moment, we will remain vague about precisely what that means,
but we assume we have a function φi that scores how well an image rectangle matches a body
segment. For each pair of related segments, we have another function ψ that scores how well relations
between a pair of image rectangles match those to be expected from the body segments. The
dependencies between segments form a tree, so each segment has only one parent, and we could
write ψi,pa(i). All the functions will be larger if the match is better, so we can think of them as being
like a log probability. The cost of a particular match that allocates image rectangle mi to body
segment i is then

Dynamic programming can find the best match, because the relational model is a
tree.
It is inconvenient to search a continuous space, and we will discretize the space of image rectangles.
We do so by discretizing the location and orientation of rectangles of fixed size (the sizes may be
different for different segments). Because ankles and knees are different, we need to distinguish
between a rectangle and the same rectangle rotated by 180 ◦. One could visualize the result as a set of
very large stacks of small rectangles of image, cut out at different locations and orientations. There is
one stack per segment. We must now find the best allocation of rectangles to segments. This will be
slow, because there are many image rectangles and, for the model we have given, choosing the right
torso will be O(M 6) if there are M image rectangles. However, various speedups are available for an
appropriate choice of ψ, and the method is practical (Figure 24.23). The model is usually known as a
pictorial structure model.
Recall our assumption that we know what we need to know about what the person looks like. If we
are matching a person in a single image, the most useful feature for scoring segment matches turns
out to be color. Texture features don’t work well in most cases, because folds on loose clothing
produce strong shading patterns that overlay the image texture. These

patterns are strong enough to disrupt the true texture of the cloth. In current work, ψ typically reflects
the need for the ends of the segments to be reasonably close together, but there are usually no
constraints on the angles. Generally, we don’t know what a person looks like, and must build a model
of segment appearances. We call the description of what a person looks like the appearance model.
If we must report the configuration of a person in a single image, we can start with a poorly tuned
appearance model, estimate configuration with this, then re-estimate appearance, and so on. In video,
we have many frames of the same person, and this will reveal their appearance.

13.6 USING VISION

If vision systems could analyze video and understood what people are doing, we
would be able to: design buildings and public places better by collecting and using data about what
people do in public; build more accurate, more secure, and less intrusive surveillance systems; build
computer sports commentators; and build human-computer interfaces that watch people and react to
their behavior. Applications for reactive interfaces range from computer games that make a player get
up and move around to systems that save energy by managing heat and light in a building to match
where the occupants are and what they are doing.
Some problems are well understood. If people are relatively small in the video
frame, and the background is stable, it is easy to detect the people by subtracting a background image
from the current frame. If the absolute value of the difference is large, this background subtraction
declares the pixel to be a foreground pixel; by linking foreground blobs over time, we obtain a track.
Structured behaviors like ballet, gymnastics, or tai chi have specific vocabularies of actions. When
performed against a simple background, videos of these actions are easy to deal with. Background
subtraction identifies the major moving regions, and we can build HOG features (keeping track of
flow rather than orientation) to present to a classifier. We can detect consistent patterns of action with
a variant of our pedestrian detector, where the orientation features are collected into histogram
buckets over time as well as space (Figure 24.25).
More general problems remain open. The big research question is to link
observations of the body and the objects nearby to the goals and intentions of the moving people. One
source of difficulty is that we lack a simple vocabulary of human behavior. Behavior is a lot like color,
in that people tend to think they know a lot of behavior names but can’t produce long lists of such
words on demand. There is quite a lot of evidence that behaviors combine— you can, for example,
drink a milkshake while visiting an ATM—but we don’t yet know what the pieces are, how the
composition works, or how many composites there might be. A second source of difficulty is that we
don’t know what features expose what is happening. For example, knowing someone is close to an
ATM may be enough to tell that they’re visiting the ATM. A third difficulty is that the usual
reasoning about the relationship between training and test data is untrustworthy. For example, we
cannot argue that a pedestrian detector is safe simply because it performs well on a large data set,
because that data set may well omit important, but rare, phenomena (for example, people mounting
bicycles). We wouldn’t want our automated driver to run over a pedestrian who happened to do
something unusual.

You might also like