0% found this document useful (0 votes)
19 views

CVTutorial 2

This document discusses various methods for evaluating computer vision algorithms. It begins by looking at the history of computer vision evaluation and how it has evolved from early demonstrations on single images to more rigorous testing on real datasets. Common evaluation approaches are then examined for tasks like stereo, recognition, and tracking. Key aspects covered include the use of ground truth data, metrics like precision and recall, and potential issues to watch out for such as class imbalance and the quality of human-generated annotations. Overall evaluation is presented as an important part of advancing computer vision research and comparing different algorithms.

Uploaded by

David B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

CVTutorial 2

This document discusses various methods for evaluating computer vision algorithms. It begins by looking at the history of computer vision evaluation and how it has evolved from early demonstrations on single images to more rigorous testing on real datasets. Common evaluation approaches are then examined for tasks like stereo, recognition, and tracking. Key aspects covered include the use of ground truth data, metrics like precision and recall, and potential issues to watch out for such as class imbalance and the quality of human-generated annotations. Overall evaluation is presented as an important part of advancing computer vision research and comparing different algorithms.

Uploaded by

David B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

COMPUTER VISION

Evaluating Computer Vision Methods

Andrew French
Today
• Today we’ll look at different ways you can evaluate computer vision
algorithms:
• The history
• Common evaluation approaches, and why we do them
• We'll look briefly at evaluating:
• Stereo
• Recognition
• Tracking
• Some evaluation “gotchas” – things to watch out for
Performance Evaluation
• Computer Vision is (only) 50 years old
• Early papers demonstrated output
from programs applied to one or two
images
• “Look Ma, no hands”
• As the field developed
• theories/algorithms began to compete
• research began to target real images and
problems
• Evaluation became more important
• when & how well does method A work?
• is method B better than method C?
Datasets and Challenges abound
What are we evaluating?
• Measurement of real world properties
• Depth via stereo
• Velocity via optic flow, tracking
• Numerical values
• etc.

• Segmentation/detection/classification
• Success vs failure in images

The boundary is not always clear-cut


• Is a tracker detecting the target in the image, or measuring its real-world velocity?
May need to combine evaluation methods
Ground Truth & Test Data: how to get it
1. Alternative/Competing Sensors
• e.g. depth from stereo vs e.g. laser scanners
• Valid if alternatives perform well given similar
environments/tasks
• More common with measurement than classification methods

2. Artificial Images
• E.g. Blender?
• Perfect GT, BUT it’s not real, won’t contain the outlier events
and event combinations that occur in genuine data
• Confidence in evaluation = confidence in simulation
• The problem has only moved
• Do we trust the simulation?
• Is it really like the real world?
Ground Truth & Test Data
3. Real Images
• Evaluation addresses the real problem
• Most journals require real data
• Datasets are now being shared, so comparison with other
methods is possible
BUT
• Obtaining ground truth can be difficult
• Automatic methods may have errors
• Manual methods are slow, subjective and also error prone
• Is your algorithm wrong, or your ground truth?
• What if standard sets don’t have the properties you want to
evaluate on?
Evaluating Stereo
• The Middlebury Stereo Dataset & Evaluation
• Started in 2001, 30+ scenes, varying illumination, resolution, etc
• Ground truth disparity maps produced by calibrating cameras and
laser scanner. Laser-measured depth converted to disparity
• 100+ algorithms tested, various image distance measures used to
compare disparity maps
• Online database of results
Evaluating Stereo: Real Images
• Middlebury
• Structured light used to provide
unambiguous feature matching
• Ground truth pixel correspondences

High-Accuracy Stereo Depth Maps Using Structured


Light. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR 2003),
volume 1, pages 195–202, Madison, WI, June 2003

High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth. German Conference on


Pattern Recognition (GCPR 2014), Münster, Germany, September 2014.
Evaluating Recognition
We want to:
• Compare true class/identity of a particular image/object with
that predicted by the algorithm
• Define
• True Positive (TP) = the algorithm makes a correction prediction
about the presence of an object in an image
• False Positive (FP) = the algorithm predicts the presence of an object
but that object is not present in the image
• False Negative (FN) = the algorithm misses an object
• Then
• Precision = TP/(TP+FP) // fraction of responses that were correct
• Recall = TP/(TP+FN) // fraction of correct classifications that Wikipedia
were identified
Evaluating Recognition
• Tables of precision, recall, etc. statistics are hard to
interpret: ROC (precision-recall) curves are a valuable
visualisation tool
• Receiver Operating Characteristic curves
• Plot of precision against recall as some parameter is varied
• Parameter is the threshold used to decide if model and image are
similar enough to be considered equal
• Increasing threshold imposes a
tighter requirement on matching
• This reduces FP, but
increases FN
• So Precision goes up,
but Recall goes down

We want both to be high


Evaluating Recognition
• Sometimes we are interested in if a result we get is in the top X of the
returned results, rather than if the single best result is correct

• For example if the classification results for different classes are:


Class Score

Car 0.60

Bus 0.08

Van 0.30

Motorbike 0.02

• Top-1 result is Car (the single best)


• Top-2 results are Car, Van

So if the actual image is of a Van, we get a Top-2 result, but not Top-1.
Evaluating Recognition
• For classification we can look at a confusion matrix, which shows which
category images are confused with which others
• We did this in the lab e.g.

Which digit images are most confused with which others


Evaluating Tracking
• Ground truth is set of manually-drawn bounding boxes
• Elements of measurement and recognition
• Accuracy – is the target where the tracker says it is?
• Robustness – is the tracker associated with the target?

Precision plot measures the percentage


of frames whose estimated location is
within a given threshold distance of
the ground truth.
Bounding boxes
• Error measure is overlap between ground truth box
and predicted box
• In tracking: Success plot measures the percentage
of frames for which the overlap divided by the union
of the predicted and ground truth bounding boxes
exceeds a threshold which varies from 0 to 1.
• In instance segmentation, could look for the closest
detected bounding box and count as a “hit” if overlap
is high enough

 = Jaccard Index (“intersection over union”)


+
Related: segmentation accuracy
• Jaccard closely related to F1 score, a.k.a Dice coefficient

Or

• Matlab can compute this given two binary images


(test and ground truth)
https://uk.mathworks.com/help/images/ref/dice.html
Quantitative versus Qualitative evaluation
• i.e. Numerical measures, versus presenting output pictures

Both are important!


Best approach is to present numerical data, then examine some good and bad images
to try and understand why the numerical results are the way they are.
For example:
Classification: See which images are in the wrong categories
Bounding boxes: What is causing a low score? Poor position or amount of overlap?
Segmentation: See which images have a low score – is the segmentation completely
off, or is the boundary just not perfect?
Etc.
“GOTCHAS”
– what to watch for when evaluating computer vision
What about true negatives, TN?

• In some calculations we might use True Negatives as well


• E.g. suppose we are segmenting an object. A True Negative result is a
correctly labelled background pixel, not part of the object.
• Many background pixels may be easy to categorise, e.g. a white tabletop
• Be careful here, as TN is related to the size of the image
There are more easy true
negatives here

• This could also be true in classification, if we have many more instances in a


background/negative class
• This is a specific example of class imbalance
Tracking – accuracy versus robustness
• It only makes sense to measure the accuracy of the track if we are still
tracking the target

If the orange target


drifts off onto the bus
stop, when do we
consider it “lost” and
no longer part of the
evaluation?
Tracking – consistent error
• Average error – is this enough?

The green target may


be consistently off-
centre by 5 pixels.

Average error would


be 5 pixels.

But another tracker


may be on and off
the target throughout
the sequence.
Which is better, and
how to differentiate?
Segmentation accuracy

How good is the human-produced ground truth?

(We’ll look at some existing datasets regarding annotation quality, e.g. the MS COCO paper
https://arxiv.org/pdf/1405.0312.pdf )
Learning

With algorithms that learn (e.g. Viola-Jones, and Deep Learning) we must be
careful how we use the training set
• It must be representative of the data
• It must not be too specific
• We must not use training data in the evaluation of performance!!!
Conclusion
• Performance evaluation is a key part of any computer vision project
• Quantitative, objective assessment is a requirement of publication
• Many test sets and challenges exist, spanning generic and application-specific
tasks
• Vision processes produce measurements or classifications, which are
evaluated differently
• Task-based evaluation may be necessary
• Choose your evaluation methods wisely
• Make sure it is really evaluating what you think it is evaluating
• Doing it wrong can give misleading results
• Be careful with training and testing data

You might also like