0% found this document useful (0 votes)

23 views24 pages

CVTutorial 2

This document discusses various methods for evaluating computer vision algorithms. It begins by looking at the history of computer vision evaluation and how it has evolved from early demonstrations on single images to more rigorous testing on real datasets. Common evaluation approaches are then examined for tasks like stereo, recognition, and tracking. Key aspects covered include the use of ground truth data, metrics like precision and recall, and potential issues to watch out for such as class imbalance and the quality of human-generated annotations. Overall evaluation is presented as an important part of advancing computer vision research and comparing different algorithms.

Uploaded by

David B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views24 pages

CVTutorial 2

Uploaded by

David B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

COMPUTER VISION

Evaluating Computer Vision Methods

Andrew French
Today
• Today we’ll look at different ways you can evaluate computer vision
algorithms:
• The history
• Common evaluation approaches, and why we do them
• We'll look briefly at evaluating:
• Stereo
• Recognition
• Tracking
• Some evaluation “gotchas” – things to watch out for
Performance Evaluation
• Computer Vision is (only) 50 years old
• Early papers demonstrated output
from programs applied to one or two
images
• “Look Ma, no hands”
• As the field developed
• theories/algorithms began to compete
• research began to target real images and
problems
• Evaluation became more important
• when & how well does method A work?
• is method B better than method C?
Datasets and Challenges abound
What are we evaluating?
• Measurement of real world properties
• Depth via stereo
• Velocity via optic flow, tracking
• Numerical values
• etc.

• Segmentation/detection/classification
• Success vs failure in images

The boundary is not always clear-cut

• Is a tracker detecting the target in the image, or measuring its real-world velocity?
May need to combine evaluation methods
Ground Truth & Test Data: how to get it
1. Alternative/Competing Sensors
• e.g. depth from stereo vs e.g. laser scanners
• Valid if alternatives perform well given similar
environments/tasks
• More common with measurement than classification methods

2. Artificial Images
• E.g. Blender?
• Perfect GT, BUT it’s not real, won’t contain the outlier events
and event combinations that occur in genuine data
• Confidence in evaluation = confidence in simulation
• The problem has only moved
• Do we trust the simulation?
• Is it really like the real world?
Ground Truth & Test Data
3. Real Images
• Evaluation addresses the real problem
• Most journals require real data
• Datasets are now being shared, so comparison with other
methods is possible
BUT
• Obtaining ground truth can be difficult
• Automatic methods may have errors
• Manual methods are slow, subjective and also error prone
• Is your algorithm wrong, or your ground truth?
• What if standard sets don’t have the properties you want to
evaluate on?
Evaluating Stereo
• The Middlebury Stereo Dataset & Evaluation
• Started in 2001, 30+ scenes, varying illumination, resolution, etc
• Ground truth disparity maps produced by calibrating cameras and
laser scanner. Laser-measured depth converted to disparity
• 100+ algorithms tested, various image distance measures used to
compare disparity maps
• Online database of results
Evaluating Stereo: Real Images
• Middlebury
• Structured light used to provide
unambiguous feature matching
• Ground truth pixel correspondences

High-Accuracy Stereo Depth Maps Using Structured

Light. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR 2003),
volume 1, pages 195–202, Madison, WI, June 2003

High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth. German Conference on

Pattern Recognition (GCPR 2014), Münster, Germany, September 2014.
Evaluating Recognition
We want to:
• Compare true class/identity of a particular image/object with
that predicted by the algorithm
• Define
• True Positive (TP) = the algorithm makes a correction prediction
about the presence of an object in an image
• False Positive (FP) = the algorithm predicts the presence of an object
but that object is not present in the image
• False Negative (FN) = the algorithm misses an object
• Then
• Precision = TP/(TP+FP) // fraction of responses that were correct
• Recall = TP/(TP+FN) // fraction of correct classifications that Wikipedia
were identified
Evaluating Recognition
• Tables of precision, recall, etc. statistics are hard to
interpret: ROC (precision-recall) curves are a valuable
visualisation tool
• Receiver Operating Characteristic curves
• Plot of precision against recall as some parameter is varied
• Parameter is the threshold used to decide if model and image are
similar enough to be considered equal
• Increasing threshold imposes a
tighter requirement on matching
• This reduces FP, but
increases FN
• So Precision goes up,
but Recall goes down

We want both to be high

Evaluating Recognition
• Sometimes we are interested in if a result we get is in the top X of the
returned results, rather than if the single best result is correct

• For example if the classification results for different classes are:

Class Score

Car 0.60

Bus 0.08

Van 0.30

Motorbike 0.02

• Top-1 result is Car (the single best)

• Top-2 results are Car, Van

So if the actual image is of a Van, we get a Top-2 result, but not Top-1.
Evaluating Recognition
• For classification we can look at a confusion matrix, which shows which
category images are confused with which others
• We did this in the lab e.g.

Which digit images are most confused with which others

Evaluating Tracking
• Ground truth is set of manually-drawn bounding boxes
• Elements of measurement and recognition
• Accuracy – is the target where the tracker says it is?
• Robustness – is the tracker associated with the target?

Precision plot measures the percentage

of frames whose estimated location is
within a given threshold distance of
the ground truth.
Bounding boxes
• Error measure is overlap between ground truth box
and predicted box
• In tracking: Success plot measures the percentage
of frames for which the overlap divided by the union
of the predicted and ground truth bounding boxes
exceeds a threshold which varies from 0 to 1.
• In instance segmentation, could look for the closest
detected bounding box and count as a “hit” if overlap
is high enough

 = Jaccard Index (“intersection over union”)

+
Related: segmentation accuracy
• Jaccard closely related to F1 score, a.k.a Dice coefficient

• Matlab can compute this given two binary images

(test and ground truth)
https://uk.mathworks.com/help/images/ref/dice.html
Quantitative versus Qualitative evaluation
• i.e. Numerical measures, versus presenting output pictures

Both are important!

Best approach is to present numerical data, then examine some good and bad images
to try and understand why the numerical results are the way they are.
For example:
Classification: See which images are in the wrong categories
Bounding boxes: What is causing a low score? Poor position or amount of overlap?
Segmentation: See which images have a low score – is the segmentation completely
off, or is the boundary just not perfect?
Etc.
“GOTCHAS”
– what to watch for when evaluating computer vision
What about true negatives, TN?

• In some calculations we might use True Negatives as well

• E.g. suppose we are segmenting an object. A True Negative result is a
correctly labelled background pixel, not part of the object.
• Many background pixels may be easy to categorise, e.g. a white tabletop
• Be careful here, as TN is related to the size of the image
There are more easy true
negatives here

• This could also be true in classification, if we have many more instances in a

background/negative class
• This is a specific example of class imbalance
Tracking – accuracy versus robustness
• It only makes sense to measure the accuracy of the track if we are still
tracking the target

If the orange target

drifts off onto the bus
stop, when do we
consider it “lost” and
no longer part of the
evaluation?
Tracking – consistent error
• Average error – is this enough?

The green target may

be consistently off-
centre by 5 pixels.

Average error would

be 5 pixels.

But another tracker

may be on and off
the target throughout
the sequence.
Which is better, and
how to differentiate?
Segmentation accuracy

How good is the human-produced ground truth?

(We’ll look at some existing datasets regarding annotation quality, e.g. the MS COCO paper
https://arxiv.org/pdf/1405.0312.pdf )
Learning

With algorithms that learn (e.g. Viola-Jones, and Deep Learning) we must be
careful how we use the training set
• It must be representative of the data
• It must not be too specific
• We must not use training data in the evaluation of performance!!!
Conclusion
• Performance evaluation is a key part of any computer vision project
• Quantitative, objective assessment is a requirement of publication
• Many test sets and challenges exist, spanning generic and application-specific
tasks
• Vision processes produce measurements or classifications, which are
evaluated differently
• Task-based evaluation may be necessary
• Choose your evaluation methods wisely
• Make sure it is really evaluating what you think it is evaluating
• Doing it wrong can give misleading results
• Be careful with training and testing data

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6440)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (642)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1174)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (998)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5145)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2133)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (463)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2010)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2788)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2884)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4088)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Ebtisam DTP
100% (1)
Ebtisam DTP
20 pages
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Civil Service Manual
No ratings yet
Civil Service Manual
171 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
GT Dental Supplies Complete Price List General - Consumable 2
No ratings yet
GT Dental Supplies Complete Price List General - Consumable 2
4 pages
Rancangan Anggaran Biaya Proyek CLE Indonesia
No ratings yet
Rancangan Anggaran Biaya Proyek CLE Indonesia
1 page
List of All Remote Emotions Anonymous Meetings: (Updated February 28th, 2024)
No ratings yet
List of All Remote Emotions Anonymous Meetings: (Updated February 28th, 2024)
47 pages
EEE_1131_M1
No ratings yet
EEE_1131_M1
23 pages
Electrical Safety by Kunal Jaiswal
No ratings yet
Electrical Safety by Kunal Jaiswal
4 pages
Unit 2
No ratings yet
Unit 2
17 pages
Integral Calculus
No ratings yet
Integral Calculus
24 pages
Using Bench Grinder (WS)
No ratings yet
Using Bench Grinder (WS)
7 pages
Isms Services
No ratings yet
Isms Services
2 pages
I94 - Official Website
No ratings yet
I94 - Official Website
1 page
Week2-1 Numpy
No ratings yet
Week2-1 Numpy
43 pages
Mike Okmawati - Case Study Research
No ratings yet
Mike Okmawati - Case Study Research
2 pages
Esha's Resume
No ratings yet
Esha's Resume
1 page
Designed for Success 310 HVAC Design Report and Rater Review_2
No ratings yet
Designed for Success 310 HVAC Design Report and Rater Review_2
67 pages
An Enhanced Hankel Matrix Based Singular Value Decomposition Method For Removing Noise From Partial Discharge Signals.
No ratings yet
An Enhanced Hankel Matrix Based Singular Value Decomposition Method For Removing Noise From Partial Discharge Signals.
5 pages
Benevolence Request Forms
No ratings yet
Benevolence Request Forms
3 pages
Meritor Axle MT 40 143MA N
100% (1)
Meritor Axle MT 40 143MA N
55 pages
Crastin PBT and Rynite PET Design Info Module IV PDF
100% (1)
Crastin PBT and Rynite PET Design Info Module IV PDF
67 pages
Tioco v. Imperial
No ratings yet
Tioco v. Imperial
3 pages
3d Graphics Pipeline
No ratings yet
3d Graphics Pipeline
4 pages
Line Following Rulebook
No ratings yet
Line Following Rulebook
7 pages
SMRP Solutions
100% (1)
SMRP Solutions
35 pages
De-thi-thu-THPT-mon-Tieng-Anh-THPT-Truong-Chinh (bản chính)
No ratings yet
De-thi-thu-THPT-mon-Tieng-Anh-THPT-Truong-Chinh (bản chính)
9 pages
Catalogo Faro PDF
No ratings yet
Catalogo Faro PDF
26 pages
Individual Hand in Case PDF
No ratings yet
Individual Hand in Case PDF
7 pages
Copper-Beryllium Alloy Rod and Bar: Standard Specification For
No ratings yet
Copper-Beryllium Alloy Rod and Bar: Standard Specification For
5 pages
Medoral 24 E518
No ratings yet
Medoral 24 E518
11 pages
Tarnaka Times - Nov 2010
No ratings yet
Tarnaka Times - Nov 2010
8 pages

CVTutorial 2

Uploaded by

CVTutorial 2

Uploaded by

COMPUTER VISION

Evaluating Computer Vision Methods

The boundary is not always clear-cut

High-Accuracy Stereo Depth Maps Using Structured

High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth. German Conference on

We want both to be high

• For example if the classification results for different classes are:

• Top-1 result is Car (the single best)

Which digit images are most confused with which others

Precision plot measures the percentage

 = Jaccard Index (“intersection over union”)

• Matlab can compute this given two binary images

Both are important!

• In some calculations we might use True Negatives as well

• This could also be true in classification, if we have many more instances in a

If the orange target

The green target may

Average error would

But another tracker

How good is the human-produced ground truth?

You might also like