CVTutorial 2
CVTutorial 2
Andrew French
Today
• Today we’ll look at different ways you can evaluate computer vision
algorithms:
• The history
• Common evaluation approaches, and why we do them
• We'll look briefly at evaluating:
• Stereo
• Recognition
• Tracking
• Some evaluation “gotchas” – things to watch out for
Performance Evaluation
• Computer Vision is (only) 50 years old
• Early papers demonstrated output
from programs applied to one or two
images
• “Look Ma, no hands”
• As the field developed
• theories/algorithms began to compete
• research began to target real images and
problems
• Evaluation became more important
• when & how well does method A work?
• is method B better than method C?
Datasets and Challenges abound
What are we evaluating?
• Measurement of real world properties
• Depth via stereo
• Velocity via optic flow, tracking
• Numerical values
• etc.
• Segmentation/detection/classification
• Success vs failure in images
2. Artificial Images
• E.g. Blender?
• Perfect GT, BUT it’s not real, won’t contain the outlier events
and event combinations that occur in genuine data
• Confidence in evaluation = confidence in simulation
• The problem has only moved
• Do we trust the simulation?
• Is it really like the real world?
Ground Truth & Test Data
3. Real Images
• Evaluation addresses the real problem
• Most journals require real data
• Datasets are now being shared, so comparison with other
methods is possible
BUT
• Obtaining ground truth can be difficult
• Automatic methods may have errors
• Manual methods are slow, subjective and also error prone
• Is your algorithm wrong, or your ground truth?
• What if standard sets don’t have the properties you want to
evaluate on?
Evaluating Stereo
• The Middlebury Stereo Dataset & Evaluation
• Started in 2001, 30+ scenes, varying illumination, resolution, etc
• Ground truth disparity maps produced by calibrating cameras and
laser scanner. Laser-measured depth converted to disparity
• 100+ algorithms tested, various image distance measures used to
compare disparity maps
• Online database of results
Evaluating Stereo: Real Images
• Middlebury
• Structured light used to provide
unambiguous feature matching
• Ground truth pixel correspondences
Car 0.60
Bus 0.08
Van 0.30
Motorbike 0.02
So if the actual image is of a Van, we get a Top-2 result, but not Top-1.
Evaluating Recognition
• For classification we can look at a confusion matrix, which shows which
category images are confused with which others
• We did this in the lab e.g.
Or
(We’ll look at some existing datasets regarding annotation quality, e.g. the MS COCO paper
https://arxiv.org/pdf/1405.0312.pdf )
Learning
With algorithms that learn (e.g. Viola-Jones, and Deep Learning) we must be
careful how we use the training set
• It must be representative of the data
• It must not be too specific
• We must not use training data in the evaluation of performance!!!
Conclusion
• Performance evaluation is a key part of any computer vision project
• Quantitative, objective assessment is a requirement of publication
• Many test sets and challenges exist, spanning generic and application-specific
tasks
• Vision processes produce measurements or classifications, which are
evaluated differently
• Task-based evaluation may be necessary
• Choose your evaluation methods wisely
• Make sure it is really evaluating what you think it is evaluating
• Doing it wrong can give misleading results
• Be careful with training and testing data