CVlecture 4
CVlecture 4
Andrew French
Today
We will start to look at some higher-level processing in computer vision
• “Higher level” = Understanding what we see in images.
• In future weeks we will look at tracking, which can also be higher-level
• To understand image content, we first need to know what is in an
image
• How to recognise objects
http://cocodataset.org
PART 1
The recognition challenge
Recognition
Recognition is hard
• There is very large variation in the visual information
• Requires learning from “prior experience”
[An Invitation to 3D Vision Y. Ma, S. Soatto, J. Kosecka, S. Sastry]
Recognition
BUS
• In the late 2000s/2010s: specific object recognition became object class recognition
• Large variation in object appearance (e.g. see the chairs above)
• Real world images with background clutter
• System needs to be robust in large variation in object pose, illumination, occlusions
The VOC Object Recognition Challenge
The VOC Object Recognition Challenge
The VOC Object Recognition Challenge
Classification Task:
• For each of the 20 object classes, predict the
presence/absence of at least one object of that class in a
test image
• Participants are required to provide a real-valued confidence
of the object’s presence for each test image so that a
precision/recall curve can be drawn
Detection Task:
• For each of the 20 classes, predict the bounding boxes of
each object of that class in a test image (if any), with
associated real-valued confidence.
Segmentation Task:
• For each test image, predict the object class of each pixel,
or “background” if the object does not belong to one of the 20
specified classes
The VOC Object Recognition Challenge
• A prediction (for an algorithm) is made in terms of “comparing”
a test image with a model for a particular object class.
Define
• True Positive (TP) = the algorithm makes a correct prediction
about the presence of an object in an image
• False Positive (FP) = the algorithm predicts the presence of an
object, but that object is not present in the image
• False Negative (FN) = the algorithm misses an object
Performance Measure
• Draw Precision vs Recall Curve
The PASCAL Visual Object Classes Challenge: A Retrospective. Int J Comput Vis
(2015) 111:98–136
The VOC Object Recognition Challenge
• Assume that the algorithm is tested on N test images
• For these images we know the “Ground Truth” i.e. the
classes of all the objects in those images
• Hence, we can measure all true detections (TP), false
detections (FP), and missed detections (FN) for a
particular value of a detection threshold
Performance Measure
• Draw Precision vs Recall Curve
The PASCAL Visual Object Classes Challenge: A Retrospective. Int J Comput Vis
(2015) 111:98–136
Precision-recall curves
Recognition vs Detection
• Aeroplanes
• 330k images
• 1,500,000 object instances
• http://cocodataset.org/
Other challenges
• Image-Net
• 14,000,000 images
• 1000 object classes
• http://image-net.org/
ImageNet privacy update
• A 2021 research paper on
obfuscating people’s faces in
ImageNet
• Although people aren’t often the
focus of a category, they are in the
background
• By annotating and blurring faces in
ImageNet, the team demonstrate
accuracy only falls by 0.68% - very
small drop.
• Paves the way for privacy-aware
recognition
Reminder: Talk about papers from tutorial! ☺
An aside: How to collect the data?
E.g.
https://www.zooniverse.org/projects/meredithspalmer/snapshot-mountain-zebra/classify
Other recognition problems
• Detect fine-grained facial attributes
• Very fine representation of face: e.g. Bald, curly hair,
glasses, moustache, makeup, etc.
PART 2
Learning the recognition models
Object (Class) Recognition
Aim
• Build a model for recognising a specific object class
e.g. aeroplanes.
We need 3 things:
• Data:
• Images containing objects from that class and
images from all other classes
• Feature extraction:
• We will not work with pixels but with features
extracted from them
• Machine Learning:
• From the features extracted, we will learn a model
that recognises this particular object class
Data
We can use PASCAL VOC for this (20 classes)
• Each object is cropped out and and rescaled to a fixed
resolution
• Ia, a=1,…A images containing objects from that class
• Id, d=1,…D images from all other classes
Feature extraction
http://www.cs.columbia.edu/~vondrick/ihog/
https://ieeexplore.ieee.org/document/6751109 https://www.robots.ox.ac.uk/~vedaldi/assets/pubs/mahendran15u
nderstanding.pdf (a 2015 paper about inverting features, inc.
CNNs)
Bags of Features
• Some features are obviously good
representations of some objects e.g.
HoGs and people
• Sometimes its not clear what
features should be used
• Bag of Features methods analyse
the large set of very specific features
generated by a training set of images
and identify a small set of useful,
more generic features
Origin 1: Bag-of-words models
• Orderless document representation: frequencies of words from a dictionary
• Repetition of words suggests importance?
Clustering
Clustering
s(x-1, y)
i(x, y)
Integral Images
• Pixel values can be summed over arbitrary rectangles
quickly
Feature Extraction
• Features are extracted from sub windows of a sample image.
• The base size for a sub window is 24 by 24 pixels.
• Each of the four feature types are scaled and shifted across all possible
combinations
• In a 24 pixel by 24 pixel sub window there are ~160,000 possible features to be calculated.
Feature Selection
• Faces are complex and variable – we need a lot of features to capture
all possible examples
• We can’t possibly use all 160,000
• Can we create a good classifier using just a small subset of all possible features?
1
2
-1
-4
0
5
-6 SVM “Not a root tip”
-4
2
1
-3 1. Design/choose features
…
2. Design/choose a
4
classifier
HoG, LBP, histograms, etc 3. Train the classifier
Classical learning in Vision
• Designing features can become a trial and error process
Input Image
Next time….!
Summary
• In previous lectures we have looked at segmentation and pixel-level information
• What does it mean to have higher level understanding of images?
• We looked at Recognition problems:
• Recognising the main object in an image
• Detecting all instances of an object
• Segmenting all pixels within an object (semantic segmentation)
• Pose: locating all components of an object
Note: although we talk about objects we really mean classes
E.g. happy faces versus sad faces, mountain bikes versus racing bikes?