Fundamentals of Computer Vision With QA
Fundamentals of Computer Vision With QA
Vision
Introduction
Completed100 XP
2 minutes
Computer vision is one of the core areas of artificial intelligence (AI), and
focuses on creating solutions that enable AI applications to "see" the
world and make sense of it.
Of course, computers don't have biological eyes that work the way ours
do, but they're capable of processing images; either from a live camera
feed or from digital photographs or videos. This ability to process images
is the key to creating software that can emulate human visual perception.
8 minutes
Copy
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
The array consists of seven rows and seven columns, representing the
pixel values for a 7x7 pixel image (which is known as the
image's resolution). Each pixel has a value between 0 (black) and 255
(white); with values between these bounds representing shades of gray.
The image represented by this array looks similar to the following
(magnified) image:
Copy
Red:
150 150 150 150 150 150 150
150 150 150 150 150 150 150
150 150 255 255 255 150 150
150 150 255 255 255 150 150
150 150 255 255 255 150 150
150 150 150 150 150 150 150
150 150 150 150 150 150 150
Green:
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Blue:
255 255 255 255 255 255 255
255 255 255 255 255 255 255
255 255 0 0 0 255 255
255 255 0 0 0 255 255
255 255 0 0 0 255 255
255 255 255 255 255 255 255
255 255 255 255 255 255 255
Here's the resulting image:
Copy
Red: 150
Green: 0
Blue: 255
Copy
Red: 255
Green: 255
Blue: 0
Copy
-1 -1 -1
-1 8 -1
-1 -1 -1
Copy
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
First, we apply the filter kernel to the top left patch of the image,
multiplying each pixel value by the corresponding weight value in the
kernel and adding the results:
Copy
(0 x -1) + (0 x -1) + (0 x -1) +
(0 x -1) + (0 x 8) + (0 x -1) +
(0 x -1) + (0 x -1) + (255 x -1) = -255
The result (-255) becomes the first value in a new array. Then we move
the filter kernel along one pixel to the right and repeat the operation:
Copy
(0 x -1) + (0 x -1) + (0 x -1) +
(0 x -1) + (0 x 8) + (0 x -1) +
(0 x -1) + (255 x -1) + (255 x -1) = -510
Again, the result is added to the new array, which now contains two
values:
Copy
-255 -510
The process is repeated until the filter has been convolved across the
entire image, as shown in this animation:
The filter is convolved across the image, calculating a new array of values.
Some of the values might be outside of the 0 to 255 pixel value range, so
the values are adjusted to fit into that range. Because of the shape of the
filter, the outside edge of pixels isn't calculated, so a padding value
(usually 0) is applied. The resulting array represents a new image in which
the filter has transformed the original image. In this case, the filter has
had the effect of highlighting the edges of shapes in the image.
To see the effect of the filter more clearly, here's an example of the same
filter applied to a real image:
Expand table
Because the filter is convolved across the image, this kind of image
manipulation is often referred to as convolutional filtering. The filter used
in this example is a particular type of filter (called a laplace filter) that
highlights the edges on objects in an image. There are many other kinds
of filter that you can use to create blurring, sharpening, color inversion,
and other effects.
10 minutes
Tip
This unit assumes you are familiar with the fundamental principles of
machine learning, and that you have conceptual knowledge of deep
learning with neural networks. If you are new to machine learning,
consider completing the Fundamentals of machine learning module
on Microsoft Learn.
During the training process for a CNN, filter kernels are initially defined
using randomly generated weight values. Then, as the training process
progresses, the models predictions are evaluated against known label
values, and the filter weights are adjusted to improve accuracy.
Eventually, the trained fruit image classification model uses the filter
weights that best extract features that help identify different kinds of fruit.
During training the output probabilities are compared to the actual class
label - for example, an image of a banana (class 1) should have the value
[0.0, 1.0, 0.0]. The difference between the predicted and actual class
scores is used to calculate the loss in the model, and the weights in the
fully connected neural network and the filter kernels in the feature
extraction layers are modified to reduce the loss.
The training process repeats over multiple epochs until an optimal set of
weights has been learned. Then, the weights are saved and the model can
be used to predict labels for new images for which the label is unknown.
Note
CNNs have been at the core of computer vision solutions for many years.
While they're commonly used to solve image classification problems as
described previously, they're also the basis for more complex computer
vision models. For example, object detection models combine CNN feature
extraction layers with the identification of regions of interest in images to
locate multiple classes of object in the same image.
Transformers
Most advances in computer vision over the decades have been driven by
improvements in CNN-based models. However, in another AI discipline
- natural language processing (NLP), another type of neural network
architecture, called a transformer has enabled the development of
sophisticated models for language. Transformers work by processing huge
volumes of data, and encoding language tokens (representing individual
words or phrases) as vector-based embeddings (arrays of numeric
values). You can think of an embedding as representing a set of
dimensions that each represent some semantic attribute of the token. The
embeddings are created such that tokens that are commonly used in the
same context are closer together dimensionally than unrelated words.
Multi-modal models
The Microsoft Florence model is just such a model. Trained with huge
volumes of captioned images from the Internet, it includes both a
language encoder and an image encoder. Florence is an example of
a foundation model. In other words, a pre-trained general model on which
you can build multiple adaptive models for specialist tasks. For example,
you can use Florence as a foundation model for adaptive models that
perform:
Azure AI Vision
Completed100 XP
3 minutes
While you can train your own machine learning models for computer
vision, the architecture for computer vision models can be complex; and
you require significant volumes of training images and compute power to
perform the training process.
To use Azure AI Vision, you need to create a resource for it in your Azure
subscription. You can use either of the following resource types:
Copy
Nutrition Facts Amount Per Serving
Serving size:1 bar (40g)
Serving Per Package: 4
Total Fat 13g
Saturated Fat 1.5g
Amount Per Serving
Trans Fat 0g
calories 190
Cholesterol 0mg
ories from Fat 110
Sodium 20mg
ntDaily Values are based on
Vitamin A 50
calorie diet
Tip
You can explore Azure AI Vision's OCR capabilities further in the Read
text with Azure AI Vision module on Microsoft Learn.
Azure AI Vision has the ability to analyze an image, evaluate the objects
that are detected, and generate a human-readable phrase or sentence
that can describe what was detected in the image. For example, consider
the following image:
Azure AI Vision returns the following caption for this image:
Skateboard (90.40%)
Person (95.5%)
Azure AI Vision can suggest tags for an image based on its contents.
These tags can be associated with the image as metadata that
summarizes attributes of the image and can be useful if you want to index
an image along with a set of key terms that might be used to search for
images with specific attributes or contents.
For example, the tags returned for the skateboarder image (with
associated confidence scores) include:
sport (99.60%)
person (99.56%)
footwear (98.05%)
skating (96.27%)
boardsport (95.58%)
skateboarding equipment (94.43%)
clothing (94.02%)
wall (93.81%)
skateboarding (93.78%)
skateboarder (93.25%)
individual sports (92.80%)
street stunts (90.81%)
balance (90.81%)
jumping (89.87%)
sports equipment (88.61%)
extreme sport (88.35%)
kickflip (88.18%)
stunt (87.27%)
skateboard (86.87%)
stunt performer (85.83%)
knee (85.30%)
sports (85.24%)
longboard (84.61%)
longboarding (84.45%)
riding (73.37%)
skate (67.27%)
air (64.83%)
young (63.29%)
outdoor (61.39%)
If the built-in models provided by Azure AI Vision don't meet your needs,
you can use the service to train a custom model for image
classification or object detection. Azure AI Vision builds custom models on
the pre-trained foundation model, meaning that you can train
sophisticated models by using relatively few training images.
Image classification
Expand table
Object detection
Details of how to use Azure AI Vision to train a custom model are beyond
the scope of this module. You can find information about custom model
training in the Azure AI Vision documentation.
In this exercise, you will use Vision Studio to analyze images using the
built-in try-it-out experiences. Suppose the fictitious retailer Northwind
Traders has decided to implement a “smart store”, in which AI services
monitor the store to identify customers requiring assistance, and direct
employees to help them. By using Azure AI Vision, images taken by
cameras throughout the store can be analyzed to provide meaningful
descriptions of what they depict.
Create an Azure AI services resource
You can use Azure AI Vision’s image analysis capabilities with an Azure AI
services multi-service resource. If you haven’t already done so, create
an Azure AI services resource in your Azure subscription.
2. Sign in with your account and making sure you are using the same
directory as the one where you have created your Azure AI services
resource.
3. On the Vision Studio home page, select View all resources under
the Getting started with Vision heading.
4. On the Select a resource to work with page, hover your mouse
cursor over the resource you created above in the list and then
check the box to the left of the resource name, then select Select
as default resource.
Note : If your resource is not listed, you may need to Refresh the page.
5. Close the settings page by selecting the “x” at the top right of the
screen.
Now you are ready to use Vision Studio to analyze images taken by a
camera in the Northwind Traders store.
4. Select https://aka.ms/mslearn-images-for-analysis to
download image-analysis.zip. Open the folder on your computer
and locate the file named store-camera-1.jpg; which contains the
following image:
5. Upload the store-camera-1.jpg image by dragging it to the Drag
and drop files here box, or by browsing to it on your file system.
8. Hover over one of the captions in the Detected attributes list and
observe what happens within the image.
Move your mouse cursor over the other captions in the list, and
notice how the bounding box shifts in the image to highlight the
portion of the image used to generate the caption.
Tagging images
The next feature you will try is the Extract Tags functionality. Extract
tags is based on thousands of recognizable objects, including living
beings, scenery, and actions.
1. Return to the home page of Vision Studio, then select the Extract
common tags from images tile under the Image analysis tab.
2. In the Choose the model you want to try out, leave Prebuilt
product vs. gap model selected. In the Choose your language,
select English or a language of your preference.
3. Open the folder containing the images you downloaded and locate
the file named store-image-2.jpg, which looks like this:
4. Upload the store-camera-2.jpg file.
5. Review the list of tags extracted from the image and the confidence
score for each in the detected attributes panel. Here the confidence
score is the likelihood that the text for the detected attribute
describes what is actually in the image. Notice in the list of tags that
it includes not only objects, but actions, such as shopping, selling,
and standing.
Object detection
In this task, you use the Object detection feature of Image Analysis.
Object detection detects and extracts bounding boxes based on
thousands of recognizable objects and living beings.
1. Return to the home page of Vision Studio, then select the Detect
common objects in images tile under the Image analysis tab.
2. In the Choose the model you want to try out, leave Prebuilt
product vs. gap model selected.
3. Open the folder containing the images you downloaded and locate
the file named store-camera-3.jpg, which looks like this:
4. Upload the store-camera-3.jpg file.
Clean up
If you don’t intend to do more exercises, delete any resources that you no
longer need. This avoids accruing any unnecessary costs.
1. Open the Azure portal and select the resource group that contains the
resource you created.
2. Select the resource and select Delete and then Yes to confirm. The
resource is then deleted.
Learn more
To learn more about what you can do with this service, see the Azure AI
Vision page.
Azure AI Vision
Azure AI services
Correct. An Azure AI Services resource supports both Azure AI
Vision and Azure AI Language.
Azure OpenAI service
3.
Objects
Correct. Azure AI Vision returns objects with a bounding box to
indicate their location in the image.
Visual Tags
Dense Captions