0% found this document useful (0 votes)
17 views62 pages

CVlecture 4

The document discusses computer vision techniques for object recognition, including a history of the challenges in recognition from the 1980s to present. It covers four important recognition problems (recognition, detection, segmentation, pose estimation), and describes datasets like PASCAL VOC and COCO that are used to evaluate recognition algorithms. Deep learning approaches have helped advance recognition by learning features from large amounts of labeled training data rather than relying on hand-crafted features.

Uploaded by

David B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views62 pages

CVlecture 4

The document discusses computer vision techniques for object recognition, including a history of the challenges in recognition from the 1980s to present. It covers four important recognition problems (recognition, detection, segmentation, pose estimation), and describes datasets like PASCAL VOC and COCO that are used to evaluate recognition algorithms. Deep learning approaches have helped advance recognition by learning features from large amounts of labeled training data rather than relying on hand-crafted features.

Uploaded by

David B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

COMPUTER VISION

Learning and higher-level computer vision

Andrew French
Today
We will start to look at some higher-level processing in computer vision
• “Higher level” = Understanding what we see in images.
• In future weeks we will look at tracking, which can also be higher-level
• To understand image content, we first need to know what is in an
image
• How to recognise objects

• Here we present a history in the lead up to deep learning…

http://cocodataset.org
PART 1
The recognition challenge
Recognition

Recognition is hard
• There is very large variation in the visual information
• Requires learning from “prior experience”
[An Invitation to 3D Vision Y. Ma, S. Soatto, J. Kosecka, S. Sastry]
Recognition

4 important recognition problems:


1. Recognition (identify the main (foreground) object in an image)
2. Detection (find the location of all objects)
3. Segmentation (assign all pixels to objects)
4. Pose (find the location of the object parts)
Recognition

BUS

4 important recognition problems:


1. Recognition (identify the main (foreground) object in an image)
2. Detection (find the location of all objects)
3. Segmentation (assign all pixels to objects)
4. Pose (find the location of the object parts)
Recognition

4 important recognition problems:


1. Recognition (identify the main (foreground) object in an image)
2. Detection (find the location of all objects)
3. Segmentation (assign all pixels to objects)
4. Pose (find the location of the object parts)
Recognition

4 important recognition problems:


1. Recognition (identify the main (foreground) object in an image)
2. Detection (find the location of all objects)
3. Segmentation (assign all pixels to objects)
4. Pose (find the location of the object parts)
Recognition

4 important recognition problems:


1. Recognition (identify the main (foreground) object in an image)
2. Detection (find the location of all objects)
3. Segmentation (assign all pixels to objects)
4. Pose (find the location of the object parts)
Detection
• Find the location of all objects in the scenes in terms of
providing a bounding box
Semantic Image Segmentation
• Process of partitioning the image into “meaningful” segments
• Group pixels based on “common” properties
• Recently: semantic image segmentation
Pose estimation
• We assume that the object class and the location of the object
in terms of a bounding box is known
• The aim is to localise the locations of the object parts
Object recognition

• Recognition: “does the image contain any instances of a


particular object class?” (cars, people, dogs, etc.)

• In the 80-90s: Identify specific, known objects in an image


• Example: detect a particular Stapler, Screwdriver etc.
Object recognition

• This problem was solved e.g. using SIFT [Lowe 2004]


• Observe that the objects (train, frog) in the image are allowed to vary in 3D
(e.g. scale, rotation) and can be partially occluded
• However, they are literally the same objects as the templates…
Object (class) recognition

• In the late 2000s/2010s: specific object recognition became object class recognition
• Large variation in object appearance (e.g. see the chairs above)
• Real world images with background clutter
• System needs to be robust in large variation in object pose, illumination, occlusions
The VOC Object Recognition Challenge
The VOC Object Recognition Challenge
The VOC Object Recognition Challenge
Classification Task:
• For each of the 20 object classes, predict the
presence/absence of at least one object of that class in a
test image
• Participants are required to provide a real-valued confidence
of the object’s presence for each test image so that a
precision/recall curve can be drawn

Detection Task:
• For each of the 20 classes, predict the bounding boxes of
each object of that class in a test image (if any), with
associated real-valued confidence.

Segmentation Task:
• For each test image, predict the object class of each pixel,
or “background” if the object does not belong to one of the 20
specified classes
The VOC Object Recognition Challenge
• A prediction (for an algorithm) is made in terms of “comparing”
a test image with a model for a particular object class.

Define
• True Positive (TP) = the algorithm makes a correct prediction
about the presence of an object in an image
• False Positive (FP) = the algorithm predicts the presence of an
object, but that object is not present in the image
• False Negative (FN) = the algorithm misses an object

(also see Evaluation tutorial later)


The VOC Object Recognition Challenge
• Assume that the algorithm is tested on N test images
• For these images we know the “Ground Truth” i.e. the
classes of all the objects in those images
• Hence, we can measure all true detections (TP), false
detections (FP), and missed detections (FN) for a
particular value of a detection threshold

For every threshold value measure


• Precision = TP/(TP+FP)
• Recall = TP/(TP+FN)

Performance Measure
• Draw Precision vs Recall Curve
The PASCAL Visual Object Classes Challenge: A Retrospective. Int J Comput Vis
(2015) 111:98–136
The VOC Object Recognition Challenge
• Assume that the algorithm is tested on N test images
• For these images we know the “Ground Truth” i.e. the
classes of all the objects in those images
• Hence, we can measure all true detections (TP), false
detections (FP), and missed detections (FN) for a
particular value of a detection threshold

For every threshold value measure


• Precision = TP/(TP+FP)
• Recall = TP/(TP+FN)

Performance Measure
• Draw Precision vs Recall Curve
The PASCAL Visual Object Classes Challenge: A Retrospective. Int J Comput Vis
(2015) 111:98–136
Precision-recall curves
Recognition vs Detection
• Aeroplanes

• Best achievable result:


• Detection is much harder, but much more useful!
Other challenges
• COCO – Common Objects In
Context

• 330k images
• 1,500,000 object instances
• http://cocodataset.org/
Other challenges
• Image-Net
• 14,000,000 images
• 1000 object classes
• http://image-net.org/
ImageNet privacy update
• A 2021 research paper on
obfuscating people’s faces in
ImageNet
• Although people aren’t often the
focus of a category, they are in the
background
• By annotating and blurring faces in
ImageNet, the team demonstrate
accuracy only falls by 0.68% - very
small drop.
• Paves the way for privacy-aware
recognition
Reminder: Talk about papers from tutorial! ☺
An aside: How to collect the data?
E.g.
https://www.zooniverse.org/projects/meredithspalmer/snapshot-mountain-zebra/classify
Other recognition problems
• Detect fine-grained facial attributes
• Very fine representation of face: e.g. Bald, curly hair,
glasses, moustache, makeup, etc.
PART 2
Learning the recognition models
Object (Class) Recognition
Aim
• Build a model for recognising a specific object class
e.g. aeroplanes.
We need 3 things:
• Data:
• Images containing objects from that class and
images from all other classes
• Feature extraction:
• We will not work with pixels but with features
extracted from them
• Machine Learning:
• From the features extracted, we will learn a model
that recognises this particular object class
Data
We can use PASCAL VOC for this (20 classes)
• Each object is cropped out and and rescaled to a fixed
resolution
• Ia, a=1,…A images containing objects from that class
• Id, d=1,…D images from all other classes
Feature extraction

• Pixel intensities are not good features as they vary a lot


depending on illumination and viewpoint
• Plus there are millions of pixels!

• Replace pixels with features extracted from them


• For all images compute fa = f(Ia) in RD, fd = f(Id) in RD
• f() is a function for computing features
• E.g. This could be e.g. HOG features (we’ll see an example
next, as these have been introduced earlier in the course),
but it could be many different kinds of features.
HOG Features
• Divide image into a grid of cells (e.g. 8x8)
• Compute edges and their orientation for every
pixel location
• Compute histogram of gradient orientations in
each cell
Inverting Features
Inverse HOG: More generally:
• From HOG features try to
reconstruct the image

The amazingly-titled “HOGgles” ☺ (HOG Goggles)

http://www.cs.columbia.edu/~vondrick/ihog/

https://ieeexplore.ieee.org/document/6751109 https://www.robots.ox.ac.uk/~vedaldi/assets/pubs/mahendran15u
nderstanding.pdf (a 2015 paper about inverting features, inc.
CNNs)
Bags of Features
• Some features are obviously good
representations of some objects e.g.
HoGs and people
• Sometimes its not clear what
features should be used
• Bag of Features methods analyse
the large set of very specific features
generated by a training set of images
and identify a small set of useful,
more generic features
Origin 1: Bag-of-words models
• Orderless document representation: frequencies of words from a dictionary
• Repetition of words suggests importance?

US Presidential Speeches Tag Cloud


Origin 2: Texture recognition
Texture is characterized by the repetition of basic elements or textons
• For stochastic textures (sand, dirt etc), it is the identity of the textons, not their
spatial arrangement, that matters
Origin 2: Texture recognition
histogram

Universal texton dictionary


Bags of features for object recognition
• First, take a bunch of images
• Extract features, and build up a “dictionary” or “visual vocabulary”
– a list of common features
• Given a new image, extract features
• For each feature, find the closest visual word in the dictionary
• Build a histogram to represent the image

face, flowers, building


Learning the visual vocabulary

Slide credit: Josef Sivic


Learning the visual vocabulary

Clustering

Slide credit: Josef Sivic


Learning the visual vocabulary
Visual vocabulary

Clustering

Slide credit: Josef Sivic


Viola-Jones Recognition
• Developed for face recognition, but general
• Basic idea: slide a window across image and
evaluate a face model at every location
• Sliding window detector must evaluate tens of thousands of
location/scale combinations
• Faces are rare: 0–10 per image
• Key ideas
• Integral images for fast feature
evaluation
• Boosting for feature selection
• Attentional cascade for fast rejection
of non-face windows
Features
• Four basic types
• Easy to calculate
• The white areas are subtracted from the black ones
• A novel representation - the integral image - makes feature
extraction faster, and allows consideration of more features.
Integral Images
• The integral image computes a value at each pixel
(x,y) that is the sum of the pixel values above and to
the left of (x,y), inclusive
• This can quickly be computed in one pass through the image

• Cumulative row sum:


s(x, y) = s(x–1, y) + i(x, y)
• Integral image:
ii(x, y) = ii(x, y−1) + s(x, y)
ii(x, y-1)

s(x-1, y)
i(x, y)
Integral Images
• Pixel values can be summed over arbitrary rectangles
quickly
Feature Extraction
• Features are extracted from sub windows of a sample image.
• The base size for a sub window is 24 by 24 pixels.
• Each of the four feature types are scaled and shifted across all possible
combinations
• In a 24 pixel by 24 pixel sub window there are ~160,000 possible features to be calculated.
Feature Selection
• Faces are complex and variable – we need a lot of features to capture
all possible examples
• We can’t possibly use all 160,000

• Can we create a good classifier using just a small subset of all possible features?

• How to select such a subset?

• Boosting is a classification scheme that works by combining weak


learners into a more accurate ensemble classifier
• A weak learner need only do better than chance

• Training consists of multiple boosting rounds


Boosting
• Need a training set of labelled (object/non-object) examples
• Start with all examples equally weighed
• Learn a series of recognition rules (classifiers)
• Re-weight examples so incorrect recognition by nth classifier makes
that example more important to the n+1th
Boosting

• No single rule/classifier can separate complex objects from complex


backgrounds: but a combination can
Boosting

• Weights are determined


automatically
• Details of the weight learning
algorithm are beyond scope of
this module
Viola-Jones Version
• Weak classifiers threshold a single feature xi

= 1 (if x1 > thresh)


o(x1)
= -1 (otherwise)

• At each stage of boosting


• Given re-weighted data from previous stage
• Train all K (160,000) single-feature classifiers
• Select the single best classifier at this stage
• Combine it with the other previously selected classifiers
• Re-weight the data
• Learn all K classifiers again, select the best, combine, reweight
• Repeat until you have T classifiers selected
Cascading Classifiers
• A 200-feature ensemble classifier can achieve 95% correct results
• Not good enough
• Learn simple (few feature) classifiers that can reject obviously non-face regions
• Focus effort (classifiers with more features) on harder regions
• Days to train, but very fast once built

Example two-feature stage 1 classifier.


Aim is to minimize false negatives.
Learned Features are Task-Specific
Face Detection
Learned Features are Task-Specific
Profile Detection
Results
Classical learning in Vision
The classic approach, then, applies learned operations to user-defined features

1
2
-1
-4
0
5
-6 SVM “Not a root tip”
-4
2
1
-3 1. Design/choose features

2. Design/choose a
4
classifier
HoG, LBP, histograms, etc 3. Train the classifier
Classical learning in Vision
• Designing features can become a trial and error process

Root tip Root crossover

• Learning will fail if the user limits it to the wrong features

• Some approaches try to reduce reliance on the user


• Bag of Words clusters the results of applying the user-defined set of feature-
detection operators to form a more generic visual vocabulary
• Viola-Jones selects from a much larger set of user-defined features
Deep Learning – the future?
• Deep learning does not use any pre-computed features
• Feature detection and classification are integrated
• Deep methods learn:
1. Which features are needed to make classification possible
2. How to do the classification given those features

Input Image

Deep Network “Not a root tip”

Next time….!
Summary
• In previous lectures we have looked at segmentation and pixel-level information
• What does it mean to have higher level understanding of images?
• We looked at Recognition problems:
• Recognising the main object in an image
• Detecting all instances of an object
• Segmenting all pixels within an object (semantic segmentation)
• Pose: locating all components of an object
Note: although we talk about objects we really mean classes
E.g. happy faces versus sad faces, mountain bikes versus racing bikes?

You might also like