0% found this document useful (0 votes)
4 views41 pages

Ai 3

AI

Uploaded by

K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views41 pages

Ai 3

AI

Uploaded by

K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Artificial Intelligence: A Modern Approach

Fourth Edition, Global Edition

Chapter 27

Computer Vision

© 2022 Pearson Education Ltd. 1


Outline
♦ Introduction

♦ Image Formation

♦ Simple Image Features

♦ Classifying Images

♦ Detecting Objects

♦ The 3D World

♦ Using Computer Vision

© 2022 Pearson Education Ltd. 2


Introduction
• Most agents that use vision use passive sensing

• A feature is a number obtained by applying simple


computations to an image.

• The model-based approach to vision uses two kinds of models


• object model
• Example: precise geometric model produced by
computer aided design systems
• rendering model
• Describes the physical, geometric, and statistical
processes that produce the stimulus from the world

two core problems of computer vision are


• reconstruction, where an agent builds a model of the world
from an image or a set of images,
• recognition, where an agent draws distinctions among the
objects it encounters based on visual and other information

© 2022 Pearson Education Ltd. 3


Image Formation
Images without lenses: The pinhole camera
• Image sensors gather light scattered from objects in a scene
and create a 2D image
• whole image plane as a sensor, but each pixel is an individual
tiny sensor—usually a charge-coupled device (CCD) or
complementary metal-oxide semiconductor (CMOS)
• focused image: all the photons arriving at a sensor come from
approximately the same spot on the object in the world
Pinhole camera:
• Pinhole opening, O, at the front of a box, and an image plane at the back of the
box
• opening is called the aperture, f is the focal length—the distance from
the pinhole to the image plane

• moving objects can be focused if it moves a short distance in


the sensors’ time window
• if defocused moving object: motion blur
• projection of a point

© 2022 Pearson Education Ltd. 4


Image Formation

Each light sensitive element at the back of a pinhole camera receives light that passes
through the pinhole from a small range of directions. If the pinhole is small enough,
the result is a focused image behind the pinhole. The process of projection means
that large, distant objects look the same as smaller, nearby objects—the point Pt in
the image plane could have come from a nearby toy tower at point P or from a
distant real tower at point Q.

© 2022 Pearson Education Ltd. 5


Image Formation
Lens systems
• Lens design restricts them to focusing light only from points that lie within a
range of Z depths from the lens.
• The center of this range—where focus is sharpest—is called the focal plan
• The range of depths for which focus remains sharp enough is called the depth
of field.
• The larger the lens aperture (opening), the smaller the depth of field.

© 2022 Pearson Education Ltd. 6


Image Formation
Scaled orthographic projection
• Handle the windows with a simplified model called scaled
orthographic projection, rather than perspective projection

• The depth Z of all points on an object fall within the range Z0 ∆Z, with ∆Z Z0,

• The perspective scaling factor f /Z can be approximated by a constant s = f /Z0.

• The equations for projection from the scene coordinates (X,Y, Z) to the image
plane become x = sX and y = sY .

• Foreshortening still occurs in the scaled orthographic projection


model, because it is caused by the object tilting away from the
view

© 2022 Pearson Education Ltd. 7


Image Formation
Light and shading
The brightness of a pixel in the image is a function of the
brightness of the surface patch in the scene that projects to the
pixel

The ambiguity occurs because there are three factors that


contribute to the amount of light:
• the overall intensity of ambient light;
• whether the point is facing the light or is in shadow); and
• the amount of light reflected from the point.

Diffuse reflection scatters light evenly across the directions


leaving a surface, so the brightness of a diffuse surface doesn’t
depend on the viewing direction

Specular reflection causes incoming light to leave a surface in a


lobe of directions that is determined by the direction the light
arrived from

Illumination rays all travel parallel to one another in a known


direction
• model this behavior with a distant
© 2022 Pearson point light source
Education Ltd. 8
Image Formation

This photograph illustrates a variety of illumination effects. There are


specularities on the stainless steel cruet. The onions and carrots are bright
diffuse surfaces because they face the light direction. The shadows appear
at surface points that cannot see the light source at all. Inside the pot are
some dark diffuse surfaces where the light strikes at a tangential angle.
(There are also some shadows inside the pot.)
© 2022 Pearson Education Ltd. 9
Image Formation

Two surface patches are illuminated by a distant point source, whose rays are shown as
light arrows. Patch A is tilted away from the source (θ is close to 90◦) and collects less
energy, because it cuts fewer light rays per unit surface area. Patch B, facing the source
(θ is close to 0◦), collects more energy.

© 2022 Pearson Education Ltd. 10


Image Formation
Color

• Principle of trichromacy
• match the visual appearance of any spectral energy
density, however complex, by mixing appropriate amounts
of just three primaries.

• Three primaries
• no mixture of any two will match the third.

• red primary, one green, and one blue, abbreviated as RGB

• Color constancy
• estimate the color the surface would have under white light
• Ignore the effects of different colored lights

© 2022 Pearson Education Ltd. 11


Simple Image Features
Edges
Edges are straight lines or curves in the image plane across which
there is a “significant” change in image brightness.
• Edge detection: abstract away from the messy, multi-megabyte
image and towards a more compact, abstract representation

Effects in the scene very often result in large changes in image


intensity producing edges
• Depth discontinuities can cause edges
• the surface normal changes
• surface reflectance
• shadow is a discontinuity in illumination

Finding edges requires care


• Intensity difference alone can give mistakes in identifying
edges due to noise.
• Noise: changes to the value of a pixel that don’t have to do
with an edge.

© 2022 Pearson Education Ltd. 12


Simple Image Features

Different kinds of edges: (1) depth discontinuities; (2) surface


orientation discontinuities; (3) reflectance discontinuities; (4)
illumination discontinuities (shadows).

© 2022 Pearson Education Ltd. 13


Simple Image Features

Smoothing involves using surrounding pixels to suppress noise

Predict value of pixel as a weighted sum of nearby pixels (more


weight for closest)

Gaussian filter: the convolution of two functions f and g (denoted as h = f


∗ g)
• Applying a Gaussian filter means replacing the intensity I(x0, y0) with the
sum, over all (x, y) pixels, of I(x, y) Gσ(d), where d is the distance from (x0,
y0) to (x, y).

© 2022 Pearson Education Ltd. 14


Simple Image Features

Top: Intensity profile I(x) along a one-dimensional section across a step edge. Middle: The
derivative of intensity, I!(x). Large values of this function correspond to edges, but the function
is noisy. Bottom: The derivative of a smoothed version of the intensity. The noisy candidate
edge at x = 75 has disappeared.

© 2022 Pearson Education Ltd. 15


Simple Image Features

(a) Photograph of a stapler. (b) Edges computed from (a).

The output is not perfect: there are gaps where no edge


appears, and there are “noise” edges that do not correspond
to anything of significance in the scene. Later stages of
processing will have to correct for these errors

© 2022 Pearson Education Ltd. 16


Simple Image Features
Texture
• texture refers to a pattern on a surface that can be sensed
visually

• usual rough model of texture is a repetitive pattern of


elements, sometimes called texels

• property of an image patch, rather than a pixel in isolation

• no change when the lighting changes

• change in a sensible way when the patch rotates.

• useful for identifying objects and matching patches

• basic construction for a texture representation


• compute the gradient orientation at each pixel in the patch
• characterize the patch by a histogram of orientations

© 2022 Pearson Education Ltd. 17


Simple Image Features
Optical flow
Optical flow: apparent motion whenever there is relative
movement between the camera and one or more objects in the
scene

describes the direction and speed of motion of features in the


image as a result of relative motion between the viewer and the
scene.

measure of similarity, sum of squared differences (SSD):

Here, (x, y) ranges over pixels in the block centered at (x0, y0). We find the (Dx,
Dy) that minimizes the SSD. The optical flow at (x0, y0) is then (vx, vy) = (Dx/Dt,
Dy/Dt).

Requires some texture in the scene, resulting in windows


containing a significant variation in brightness among the pixels
© 2022 Pearson Education Ltd. 18
Simple Image Features

Two frames of a video sequence and the optical flow field


corresponding to the displacement from one frame to the
other. Note how the movement of the tennis racket and the
front leg is captured by the directions of the arrows.

© 2022 Pearson Education Ltd. 19


Simple Image Features
Segmentation of natural images

Segmentation is the process of breaking an image into groups


of similar pixels.

Focus: detecting the boundaries or group themselves (regions)

Classification problem
• boundary curve at pixel location (x, y), an orientation θ. An image
neighborhood centered at (x, y) looks roughly like a disk, cut into two
halves by a diameter oriented at θ.
• compute the probability Pb(x, y, θ) that there is a boundary curve at that
pixel along that orientation by comparing features in the two halves.
• to train a machine learning classifier using a data set of
natural images in which humans have marked the ground
truth boundaries
• the goal of the classifier is to mark exactly those boundaries
marked by humans and no other

alternative approach is based on trying to “cluster” the pixels

© 2022 Pearson Education Ltd. 20


Simple Image Features

(a) Original image.


(b) Boundary contours, where the higher the Pb value, the darker the contour.
(c) Segmentation into regions, corresponding to a fine partition of the image.
Regions are rendered in their mean colors.
(d) Segmentation into regions, corresponding to a coarser partition of the image,
resulting in fewer regions.

© 2022 Pearson Education Ltd. 21


Classifying Images
Modern systems classify images using appearance (i.e., color
and texture, as opposed to geometry)

There are two difficulties

• different instances of the same class could look different—


some cats (black and orange)

• Look different at different times depending on several


effects
• Lighting
• Foreshortening
• Aspect
• Occlusion
• Deformation

Deal with these problems by learning representations and


classifiers from
very large quantities of training data using a convolutional
neural network.

© 2022 Pearson Education Ltd. 22


Classifying Images
Image classification with convolutional neural networks (CNN)
With enough training data and enough training ingenuity, CNNs
produce very successful classification systems

Images can have small alterations without changing the identity

local patterns can be quite informative.

spatial relations between local patterns are informative

Convolution followed by a ReLU activation function—as a


local pattern detector
• convolution measures how much each local window of the
image looks like the kernel pattern
• ReLU sets low-scoring windows to zero, and emphasizes high-
scoring window

convolution with multiple kernels finds multiple patterns


composite patterns can be detected by applying another layer to
the output of the first layer.

© 2022 Pearson Education Ltd. 23


Classifying Images
Data set augmentation: training examples are copied and
modified slightly Images can have small alterations without
changing the identity

• randomly shift, rotate,or stretch an image by a small amount,


or randomly shift the hue of the pixels by a small amount local
patterns

• CNN-based classifiers are good at ignoring patterns that aren’t


discriminative

• Context: patterns that lie off the object might be


discriminative

• e.g., a cat toy, a collar with a little bell, or a dish of cat


food might actually help tell that we are looking at a cat

© 2022 Pearson Education Ltd. 24


Detecting Objects
Object detectors find multiple objects in an image, report what
class each object is and also report where each object is by
giving a bounding box around the object

Building an object detector:


• looking at a small sliding window onto the larger image—a
rectangle.
• At each spot, we classify what we see in the window, using a
CNN classifier

Details:
• Decide on a window shape
• Build a classifier for windows
• Decide which windows to look at
• Choose which windows to report
• Report precise locations of objects using these windows

A network that finds regions with objects is called a regional


proposal network (RPN)
• Faster RCNN encodes a large collection of bounding boxes as a
map of fixed size
© 2022 Pearson Education Ltd. 25
Detecting Objects
construct a 3D block where each spatial location in the block has
two dimensions for the center point and one dimension for the
type of box

any box with a good enough objectness score is called a region of


interest
(ROI)

make them have the same number of features by sampling the


pixels to extract features, a process called ROI pooling

greedy algorithm for deciding windows to report called non-


maximum suppression: sort based scores over threshold,
choose highest score, discared overlapping

bounding box regression: trim the window down to a proper


bounding box

© 2022 Pearson Education Ltd. 26


Detecting Objects

Faster RCNN uses two networks


• first network computes “objectness” scores of candidate
image boxes, called “anchor boxes,” centered at a grid
point.
• second network is a feature stack that computes a
representation of the ©image suitable for classification
2022 Pearson Education Ltd. 27
3D World

Two pictures of objects in a 3D world are better than one

• two images of the same scene taken from different


viewpoints and you know enough about the two cameras

• two views of enough points, and you know which point in the
first view corresponds to which point in the second view

key problem is to establish which point in the first view


corresponds to which in the second view

two ways of getting multiple views of a scene


• two cameras
• Move

© 2022 Pearson Education Ltd. 28


3D World

Translating a camera parallel to the image plane causes


image features to move in the camera plane. The disparity in
positions that results is a cue to depth. If we superimpose
left and right images, as in (b), we see the disparity

© 2022 Pearson Education Ltd. 29


3D World

The relation between disparity and depth in stereopsis. The centers of projection
of the two eyes are distance b apart, and the optical axes intersect at the fixation
point P0. The point P in the scene projects to points PL and PR in the two eyes. In
angular terms, the disparity between these is δθ (the diagram shows two angles
of δθ .

© 2022 Pearson Education Ltd. 30


Using Computer Vision

Understanding what people are doing

Difficulty of how to link observations of the body and the objects


nearby to the goals and intentions of the moving people.

Another difficulty is caused by time scale. What someone is


doing depends quite strongly on the time scale

unrelated behaviors are going on at the same time

Learned classifiers are guaranteed to behave well only if the


training and test data come from the same distribution.

For activity data, the relationship between training and test data
is more untrustworthy people do so many things in so many
context

© 2022 Pearson Education Ltd. 31


Using Computer Vision

Reconstructing humans from a single image is now


practical. Each row shows a reconstruction of 3D body
shape obtained using a single image
© 2022 Pearson Education Ltd. 32
Using Computer Vision

The same action can look very different; and different


actions can look similar

© 2022 Pearson Education Ltd. 33


Using Computer Vision

What you call an action depends on the time scale. The single frame
at the top is best described as opening the fridge (you don’t gaze at
the contents when you close a fridge). But if you look at a short clip
of video (indicated by the frames in the center row), the action is
best described as getting milk from the fridge. If you look at a long
clip (the frames in the bottom row), the action is best described as
fixing a snack. © 2022 Pearson Education Ltd. 34
Using Computer Vision

Linking pictures and words

tagging systems that tag images with relevant words.

tags aren’t a comprehensive description of what is happening


in an image

captioning systems—systems that write a caption of one or


more sentences describing the image.

Current methods for captioning use detectors to find a set of


words that describe the image, and provide those words to a
sequence model that is trained to generate a sentence

To establish whether a system has a good representation of


what is happening in an image is
• visual question answering or VQA system and a visual dialog
system

© 2022 Pearson Education Ltd. 35


Using Computer Vision

Automated image captioning systems produce some good results


and some failures.

© 2022 Pearson Education Ltd. 36


Using Computer Vision

Visual question-answering systems produce answers


(typically chosen from a multiple-choice set) to natural-
language questions about images

© 2022 Pearson Education Ltd. 37


Using Computer Vision

GAN generated images of lung X-rays. results of a test asking


radiologists, given
a pair of X-rays as seen on the left, to tell which is the real X-ray
© 2022 Pearson Education Ltd. 38
Using Computer Vision
Making pictures

© 2022 Pearson Education Ltd. 39


Using Computer Vision

Mobileye’s camera-based sensing for autonomous vehicles

© 2022 Pearson Education Ltd. 40


Summary

• Representations of images capture edges, texture, optical


flow, and regions

• Convolutional neural networks produce accurate image


classifiers that use learned features.

• Image classifiers can be turned into object detectors

• With more than one view of a scene, it is possible to recover


the 3D structure of the scene and the relationship between
views

© 2022 Pearson Education Ltd. 41

You might also like