Lecture 4 - Computer Vision Notes
Lecture 4 - Computer Vision Notes
Lecture 4 - Computer Vision Notes
2
3
4
Source: IBM Watson Visual Recognition -
https://www.youtube.com/watch?v=n3_oGnXkMAE&t=4s
Notes:
Watch the video and discuss remarks and notes with the rest of the class
5
Source: A Practical Guide to Building Enterprise Applications: by Tom Markiewicz and Josh
Zheng (O’Reilly) – February 2018 IBM Copyright - https://tmarkiewicz.com/getting-started-
with-artificial-intelligence
Notes:
Examine the room around you and take a quick inventory of what you see. Perhaps a desk,
some chairs, bookshelves, and maybe even your laptop. Identifying these items is an
effortless process for a human, even a young child.
Speaking of children, it’s quite easy to teach them the difference between multiple objects.
Over time, parents show them objects or pictures and then repeat the name or description.
Show them a picture of an apple, and then repeat the word apple. In the kitchen, hand them
an apple, and then repeat the word apple. Eventually, through much repetition, the child
learns what an apple is along with its many color and shape variations—red, green, yellow.
Over time, we provide information as to what is a correct example and what isn’t. But how
does this translate to machines? How can we train computers to recognize patterns visually,
as the human brain does?
Training computer vision models is done in much the same way as teaching children about
objects visually. Instead of a person being shown physical items and having them identified,
however, the computer vision algorithms are provided many examples of images that have
been tagged with their contents. In addition to these positive examples, negative examples
are also added to the training. For example, if we’re training for images of cars, we may also
6
include negative examples of airplanes, trucks, and even boats.
6
Source: A Practical Guide to Building Enterprise Applications: by Tom Markiewicz and Josh
Zheng (O’Reilly) – February 2018 IBM Copyright - https://tmarkiewicz.com/getting-started-
with-artificial-intelligence
Notes:
Computer vision’s most core functionality, general image tagging and classification, allows
users to understand the content of an image. While you’ll often see image classification and
tagging used interchangeably, it’s best to consider classification as assigning an image to one
or more categories, while tagging is an assignment of a single word (or multiple words)
describing the image. When an image is processed, various keyword tags or classes are
returned, describing the image with varying confidence levels. Based on an application’s
needs, these can be used to identify the image contents. For example, you may need to find
images with contents of “male playing soccer outside” or organize images into visual themes
such as cars, sports, or fruits.
Words, phrases, and other text are frequently part of images. In our day-to-day lives, we
come across street signs, documents, and advertisements. Humans see the text, read it, and
comprehend it rather easily. For machines, this is an entirely different challenge. Via optical
character recognition (OCR), computers can extract text from an image, enabling a wide
range of potential applications. From language translation to mobile apps that assist the
vision impaired, computer vision algorithms equip users to pull words out of a picture into
readily usable text in applications.
In addition to general image tagging, computer vision can be used for more specific tasks.
Some of the more common are the ability to detect logos in images as well as food items.
Another frequent application of computer vision is in facial detection. With training, computer
vision algorithms allow developers to recognize faces, sometimes getting even more
specialized to detect celebrities
7
7
Source: A Practical Guide to Building Enterprise Applications: by Tom Markiewicz and Josh
Zheng (O’Reilly) – February 2018 IBM Copyright - https://tmarkiewicz.com/getting-started-
with-artificial-intelligence
Notes:
Another capability of computer vision is object localization. Sometimes your application’s
requirements will include not just classifying what is in the image, but also understanding the
position of the particular object in relation to everything else. This is where object localization
comes into play. Localization finds a specific object’s location within an image, displaying the
results as a bounding box around the specified object. Similarly, object detection then identi-
fies a varying number of objects within the image. An example is the ability to recognize faces
or vehicles in an image. Figure 4-1 shows an example of object detection with dogs.
There are some challenges associated with object localization, how- ever. Often, objects in
an image overlap, making it difficult to ascertain their specific boundaries. Another challenge
is visually similar items. When the colors or patterns of an object match their background in
an image, it can again be difficult to determine the objects.
8
Sources:
A Practical Guide to Building Enterprise Applications: by Tom Markiewicz and Josh Zheng
(O’Reilly) – February 2018 IBM Copyright - https://tmarkiewicz.com/getting-started-with-
artificial-intelligence
Notes:
Custom Classifiers
Most of the time, you don’t need to recognize everything. If you’re looking to identify or
classify only a small set of objects, custom classifiers could be the right tool. Most of the
large third-party platforms provide some mechanism for building custom visual classifiers,
allowing you to train the computer vision algorithms to recognize specific content within your
images. Custom classifiers extend general tagging to meet your application’s particular
needs to recognize your visual content, and primarily exist to gain higher accuracy by
reducing the search space of your visual classifier.
At a high level, when creating a custom classifier, you’ll need to have a collection of images
that are identified as positive and negative examples. For example, if you were training a
custom classifier on fruits, you’d want to have positive training images of apples, bananas,
and pears. For negative examples, you could have pictures of vegetables, cheese, or meat
(see Figure 4-2).
Most computer vision uses deep learning algorithms (discussed in Chapter 1), specifically
convolutional neural networks (CNNs). If you’re interested in building CNNs from scratch and
9
going more in depth with deep learning for computer vision, there are a wide variety of
resources available. We recommend Andrew Ng’s Deep Learning specialization on Coursera
as well as fast.ai. Additionally, the following resources dive deeper into relevant computer
vision topics:
• Programming Computer Vision with Python (Jan Erik Solem, O’Reilly)
• Learning Path: Deep Learning Models and Computer Vision with TensorFlow (ed. Shannon
Cutt, O’Reilly)
• Learning OpenCV3 (Gary Bradski and Adrian Kaehler, O’Reilly)
9
Sources:
1. A Practical Guide to Building Enterprise Applications: by Tom Markiewicz and Josh Zheng
(O’Reilly) – February 2018 IBM Copyright - https://tmarkiewicz.com/getting-started-
with-artificial-intelligence
2. https://www.popsci.com/how-watson-supercomputer-can-see-water-waste-in-drought-
stricken-california/
Notes:
When a drought in California reached crisis level in April 2015, Governor Jerry Brown issued
the state’s first mandatory water restrictions. All cities and towns were instructed to reduce
water usage by 25% in 10 months. Achieving this required measures more effective than just
asking residents to use less water. Specifically, the state needed to run ad campaigns
targeted to property owners who were using more water than necessary. Unfortunately, the
state didn’t even have water consumption data on such a granular level.
Scientists at OmniEarth came up with the idea of analyzing aerial images to identify these
property owners. They first trained IBM Watson’s Visual Recognition service on a set of aerial
images containing individual homes with different topographical features, including pools,
grass, turf, shrubs, and gravel. They then fed a massive amount of similar aerial imagery to
Watson for classification. Partnering with water districts, cities, and counties, the scientists at
OmniEarth could then quickly identify precisely which land parcels needed to reduce water
consumption, and by how much. For examples, they identified swimming pools in 150,000
parcels in just 12 minutes.
10
Armed with this knowledge, OmniEarth helped water districts make specific
recommendations to property owners and governments. Such recommendations included
replacing a patch or percentage of turf with mulch, rocks, or a more drought-tolerant species,
or draining and filling a swimming pool less frequently.
10
Sources:
1. A Practical Guide to Building Enterprise Applications: by Tom Markiewicz and Josh Zheng
(O’Reilly) – February 2018 IBM Copyright - https://tmarkiewicz.com/getting-started-
with-artificial-intelligence
2. https://www.popsci.com/how-watson-supercomputer-can-see-water-waste-in-drought-
stricken-california/
Notes:
The proliferation of cameras in recent years has led to an explosion in video data. Though
videos contain numerous insights, these are hard to extract using computers. In many cases,
such as home surveillance videos, the only viable solution is still human monitoring. That is, a
human sits in an operations center 24/7, watching screens and raising an alert when
something happens, or reviewing dozens or even hundreds of hours of past footage to
identify key events.
BlueChasm is a company looking to tackle this problem using computer vision. The founder,
Ryan VanAlstine, believes that if success- ful, video can be a new form of sensor where
traditional detection and human inspection fail. BlueChasm’s product, VideoRecon, can
watch and listen to videos, identifying key objects, themes, or events within the footage. It
will then tag and timestamp those events, and then return the metadata to the end user.
The industry the company plans to focus on first is law enforce- ment. Ryan explains:
“Imagine you work in law enforcement, and you knew there was a traffic camera on a street
11
where a burglary occurred last night. You could use VideoRecon to review the footage from
that camera and create a tag whenever it detected a person going into the burgled house.
Within a few minutes, you would be able to review all the sections of the video that had been
tagged and find the footage of the break-in, instead of having to watch hours of footage
yourself.” Once a video is uploaded, IBM Watson’s Visual Recognition is used to analyze the
video footage and identify vehi- cles, weapons, and other objects of interest.
BlueChasm is also looking to help media companies analyze movies and TV shows. It wants
to track the episodes, scenes, and moments where a particular actor appears, a setting is
shown, or a line of dia- log is spoken. This metadata could help viewers find a classic scene
from a show they’d like to rewatch by simply typing in some search terms and navigating
straight to the relevant episode. More gener- ally, for any organization that manages an
extensive video archive, the ability to search and filter by content is an enormous time-saver.
11
12
Notes:
13
Notes:
• Node (neuron)
• Connection (synapse)
• Inputs
• Weights
• Predicted output
14
Notes:
15
Notes:
A network that uses binary classification the output will be two nodes.
The bias used in the nodes can be a sigmoid operation, parabolic tangent or a Rectify linear
unit (ReLU)
16
Notes:
The bias value (c) allows the activations function to be shifted to the left or to the right, to
better fit the data; hence, the changes to the weight alter the steepness of the vector or
curve. The bias, on the other hand, shifts the entire vector or curve so it fits better.
17
Notes:
Step 1. A picture of your face is captured from a photo or video. Your face might appear alone
or in a crowd. Your image may show you looking straight ahead or nearly in profile.
Step 2. A CNN approach reads the geometry of your face. Key factors include the distance
between your eyes and the distance from forehead to chin. The software identifies facial
landmarks, or features, 68 of them resulting in your facial signature.
Step 4. A determination is made. Your faceprint may match that of an image in a facial
recognition system database.
18
Notes:
The model does not interpret these images as words: dog, cat, car.
Instead the inputs become encoded so they can take the form of an integer or vector.
One type of encoding that is used for classification is called 1-hot encoding.
19
Notes:
20
Notes:
1-Hot Encoding:
• Create a matrix of 0’s and 1’s
• Make each category a column in a table
• Warning: you must encode (n-1) categories you have in the variable
• Otherwise, you will have perfect multicollinearity and encounter mathematical issues
Cons:
• This makes the data very large and sparse
• Better methods for text data
• Does not scale well to big data
Solutions:
• Bag of words / TF-IDF for text data
• Advanced algorithms such as neural networks with embedding layers
21
22
Notes:
Eye fatigue is a huge problem… Watson will be the best computer in the world at image
analytics - - through an intelligent anomaly detection image, medical sieve will sift through
large image collections, highlight disease-depicting ones and offer image-guided decision
support from search of similar studies in databases. This could lead to reduced costs, overall
improvement in quality of care and avoid unnecessary radiation to patients.
23
Source: http://blog.genesys.com/3-reasons-cognitive-systems-will-revolutionize-customer-
experience/
Notes:
Intrinsic identity Attributes:
• Age, gender, Ethnicity
Expression:
• Happy, Sad, Angry, Disgusted, Surprise, Fear
• Micro expressions
Extrinsic Attributes:
• Hair color, Hat, Glasses, Bald, Sunglasses
24
• Hospital: Add speech/text/video with face expression analysis to better understand the
patient illness/distress state/condition.
• Autonomous Systems: A broader ’understanding’ of the implications of one’s decision can
help prevent major disasters – e.g. flash-crash of 2010 – or ethical dilemmas – e.g.
autonomous cars. In the future, autonomous systems should be embedded with a broader
perspective on human contexts, ethics, values. Creation of sentic/empathetic machines.
24
Source: https://www.ibm.com/blogs/research/2019/01/diversity-in-faces/
Notes:
IBM Research Releases ‘Diversity in Faces’ Dataset to Advance Study of Fairness in Facial
Recognition Systems
The challenge in training AI is manifested in a very apparent and profound way with facial
recognition technology. There can be difficulties in making facial recognition systems that
meet fairness expectations. As shown by Joy Buolamwini and Timnit Gebru in Gender Shades
in 2018, facial recognition systems in commercial use performed better for lighter individuals
and males and worse for darker females [1]. The heart of the problem is not with the AI
technology itself, per se, but with how the AI-powered facial recognition systems are trained.
For the facial recognition systems to perform as desired – and the outcomes to become
increasingly accurate – training data must be diverse and offer a breadth of coverage, as
shown in our prior work [2]. For example, the training data sets must be large enough and
different enough that the technology learns all the ways in which faces differ to accurately
recognize those differences in a variety of situations. The images must reflect the distribution
of features in faces we see in the world.
How do we measure and ensure diversity for human faces? On one hand, we are familiar
with how faces differ by age, gender, and skin tone, and how different faces can vary across
some of these dimensions. Much of the focus on facial recognition technology has been on
how well it performs within these attributes. But, as prior studies have shown, these
25
attributes are just a piece of the puzzle and not entirely adequate for characterizing the full
diversity of human faces. Dimensions like face symmetry, facial contrast, the pose the face
is in, the length or width of the face’s attributes (eyes, nose, forehead, etc.) are also
important.
25
Source: https://techcrunch.com/2019/01/29/ibm-builds-a-more-diverse-million-face-dataset-to-help-
reduce-bias-in-ai/
Notes:
Encoding biases into machine learning models, and in general into the constructs we refer to as AI, is
nearly inescapable — but we can sure do better than we have in past years. IBM is hoping that a new
database of a million faces more reflective of those in the real world will help.
Facial recognition is being relied on for everything from unlocking your phone to your front door, and is
being used to estimate your mood or likelihood to commit criminal acts — and we may as well admit many
of these applications are bunk. But even the good ones often fail simple tests like working adequately
with people of certain skin tones or ages.
This is a multi-layered problem, and of course a major part of it is that many developers and creators of
these systems fail to think about, let alone audit for, a failure of representation in their data.
That’s something everyone needs to work harder at, but the actual data matters, as well. How can you
train a computer vision algorithm to work well with all people if there’s no set of data that has all people
in it?
Every set will necessarily be limited, but building one that has enough of everyone in it that no one is
effectively systematically excluded is a worthwhile goal. And with its new million-image Diversity in Faces
(DiF) set, that’s what IBM has attempted to create. As the paper introducing the set reads:
For face recognition to perform as desired – to be both accurate and fair – training data must provide
sufficient balance and coverage. The training data sets should be large enough and diverse enough to
learn the many ways in which faces inherently differ. The images must reflect the diversity of features in
faces we see in the world.
The faces are sourced from a huge 100-million-image data set (Flickr Creative Commons), through which
another machine learning system prowled and found as many faces as it could. These were then isolated
26
and cropped, and that’s when the real work started.
26
Source: https://www.youtube.com/watch?v=rBe6KY5Mv-o&feature=youtu.be
https://arxiv.org/abs/1901.10436
Notes:
As we rely on data-driven methods to create face recognition technology, we need to ensure
necessary balance and coverage in training data. However, there are still scientific questions
about how to represent and extract pertinent facial features and quantitatively measure
facial diversity. Towards this goal, Diversity in Faces (DiF) provides a data set of one million
annotated human face images for advancing the study of facial diversity. The annotations are
generated using ten well-established facial coding schemes from the scientific literature. The
facial coding schemes provide human-interpretable quantitative measures of facial features.
We believe that by making the extracted coding schemes available on a large set of faces, we
can accelerate research and development towards creating more fair and accurate facial
recognition systems.
27
Sources:
• https://www.youtube.com/watch?time_continue=3&v=ND2HfNnss3M
• https://www.youtube.com/watch?v=YRhxdVk_sIs
• Ethically Bound white papers: https://arxiv.org/pdf/1812.03980.pdf
28
29
Source: https://www.ibm.com/watson/services/visual-recognition/demo/
Notes:
Quickly and accurately tag, classify and train visual content using machine learning. View the
AI services demo on visual recognition.
30
Source: https://analog-ai-demo.mybluemix.net/hardware#publications
Notes:
In-memory computing hardware increases the speed and energy-efficiency needed for the
next steps in AI. The Fusion chip implements an artificial neural network in a piece of in-
memory computing hardware. The hardware exploits the storage capability and physical
attributes of phase-change memory (PCM) devices to implement an artificial neural network.
With PCM, when an electrical pulse is applied to the material, it changes the conductance of
the device by switching the material between amorphous and crystalline phases. A low
electrical pulse will make the PCM device more crystalline (less resistance). A high electrical
pulse will make the device more amorphous (more resistance). Therefore, instead of
recording a 0 or 1 like in the digital world, it records the states as a continuum of values
between the two–the analog world.
PCM devices have the ability to store synaptic weights in their analog conductance state.
When PCM devices are arranged in a crossbar configuration, it allows to perform an analog
matrix-vector multiplication in a single time step, exploiting the advantages of multi-level
storage capability and Kirchhoff’s circuits laws.
31
Notes:
The presentation mode has a link that will allow the students to experiment
32
33
34
35
36
37
38