Computer Vision
Computer Vision
COMPUTER VISION
Computer Vision is a subfield in Artificial Intelligence that empowers machines to see,
recognize and process images just the way humans do. Like other types of AI, computer
vision seeks to perform and automate tasks that replicate human capabilities. In this case,
computer vision seeks to replicate both the way humans see, and the way humans make
sense of what they see. The main objective of computer vision is to understand the content
of digital images. Computers are not the same as humans, so they don’t have the gift of
vision and perception.
Machine learning is a subfield of artificial intelligence (AI) that focuses on developing
algorithms and statistical models that enable computers to perform tasks without explicit
instructions where they learn patterns and make decisions based on data thereby
improving their performance overtime.
Deep learning is the subset of machine learning which is based on neural networks with
many layers (deep neural networks) which are designed to automatically learn and extract
features from raw data, allowing them to model complex patterns and representations.
HOW COMPUTER VISION WORKS
Computer vision applications use input from sensing devices, artificial intelligence,
machine learning, and deep learning to replicate the way the human vision system works.
Computer vision applications run on algorithms that are trained on massive amounts of
visual data or images in the cloud. They recognize patterns in this visual data and use those
patterns to determine the content of other images.
COMPUTER VISION APPLICATIONS
Computer vision can be combined with many types of applications and sensing devices to
support a number of practical use cases. Here are just a few different types of computer
vision applications:
1. Content organization: Computer vision can be used to identify people or objects in
photos and organize them based on that identification. Photo recognition
applications like this are commonly used in photo storage and social media
applications.
2. Text extraction: Optical Character Recognition is also referred to as text
recognition or text extraction which are techniques that allow you to extract printed
or handwritten text from images such as posters, street signs and product labels, as
1
well as from documents like articles, reports, forms, and invoices. The text is
typically extracted as words, text lines, and paragraphs or text blocks, enabling
access to digital version of the scanned text. This eliminates or significantly reduces
the need for manual data entry.
3. Augmented reality: Physical objects are detected and tracked in real-time with
computer vision through Object Detection and Object Tracking. Object detection is
used to recognize objects in visual data while Object tracking is used to understand
movement, count people and objects. This information is then used to realistically
place virtual objects in a physical environment.
4. Agriculture: Images of crops taken from satellites, drones, or planes can be
analyzed to monitor harvests, detect weed emergence, or identify crop nutrient
deficiency.
5. Autonomous vehicles: Self-driving cars use real-time object identification and
tracking to gather information about what's happening around a car and route the
car accordingly.
6. Healthcare: Photos or images captured by other medical devices can be analyzed to
provide to help doctors identify problems and make diagnoses more quickly and
accurately.
7. Sports: Object detection and tracking is used for player and strategy analysis.
Computer vision technology in sports help process vast amounts of visual data
received throughout the games to make match-related decisions in real-time,
develop new training schemes, and much more. These computer vision solutions are
of extreme help in data collection and sports analytics.
8. Manufacturing: Computer vision can monitor manufacturing machinery for
maintenance purposes. It can also be used to monitor product quality and packaging
on a production line.
9. Spatial analysis: The system identifies people or objects, such as cars, in a space
and tracks their movement within that space. For example, you can use Azure AI
Vision Spatial Analysis to detect the presence and movements of people in video.
The service can do things like count the number of people entering a space or
measure compliance with face mask and social distancing guidelines.
10. Face recognition: Computer vision can be applied to identify individuals. Facial
recognition can detect and identify individual faces from an image containing one or
many people's faces. It can detect facial data in both front and side face profiles.
2
COMPARISON OF HUMAN VISUAL SYSTEMS WITH COMPUTER VISION SYSTEMS
1. Sensory Input and Data Acquisition
Human Visual System:
Sensory Organs: Eyes capture light and convert it into neural signals using
photoreceptors (rods and cones).
Dynamic Range: The human eye has a high dynamic range, adapting to various
light conditions.
Resolution: Humans have a varying resolution across the visual field, with the
highest resolution at the fovea (central vision).
Color Perception: Humans perceive color through three types of cones
(trichromatic vision) sensitive to red, green, and blue wavelengths.
Computer Vision System:
Sensors: Cameras capture images using digital sensors, converting light into
electronic signals.
Dynamic Range: Modern cameras have high dynamic ranges but are generally less
adaptable than human eyes.
Resolution: Cameras can have very high resolution, limited by the sensor’s
megapixel count.
Color Perception: Cameras typically use RGB filters to capture color, similar to
human trichromatic vision, but can also include additional filters (e.g., infrared).
2. Processing Architecture
Human Visual System:
Parallel Processing: Processes multiple aspects of visual information (color,
motion, depth) simultaneously in different brain areas.
Neural Networks: Utilizes biological neural networks in the brain, particularly the
visual cortex, for processing.
Adaptability: Highly adaptive to changes and capable of learning from experience
and context.
Computer Vision System:
3
Sequential and Parallel Processing: Can use both sequential algorithms and
parallel processing techniques, especially in GPU-based systems.
Artificial Neural Networks: Utilizes artificial neural networks (e.g., CNNs) to
process visual information, inspired by biological networks.
Training and Adaptability: Requires extensive training on large datasets to adapt
and improve performance.
3. Perception and Interpretation
Human Visual System:
Contextual Understanding: Highly contextual, using prior knowledge, experiences,
and expectations to interpret visual information.
Depth Perception: Achieved through binocular vision and various monocular cues
(e.g., perspective, occlusion).
Attention and Focus: Capable of focusing on specific details while filtering out
irrelevant information, driven by cognitive processes.
Computer Vision System:
Algorithmic Interpretation: Relies on algorithms and models trained on labeled
data to interpret images.
Depth Perception: Achieved through techniques like stereo vision, depth sensors,
and structure from motion.
Attention Mechanisms: Implemented through attention models in neural
networks, allowing the system to focus on relevant parts of the image.
DIGITAL IMAGE FORMATION AND REPRESENTATION
Image formation in computer vision refers to the process of capturing and representing
visual information in the form of digital images. It involves the conversion of real-world
scenes into a digital format that can be processed and analyzed by computer algorithms.
Understanding the image formation process is crucial for various computer vision tasks
such as object detection, recognition, and tracking.
The image formation process typically involves the following stages:
1. Scene: The process begins with a real-world scene that contains objects, lighting
conditions, and a viewing perspective. The scene can be indoor or outdoor and may
involve various objects, textures, colors, and shapes.
2. Illumination: Illumination refers to the light sources present in the scene. It plays a
vital role in determining the appearance of objects in an image. The lighting
4
conditions, such as brightness, color, and direction, affect how the objects and their
features are represented in the image.
3. Reflectance: Reflectance describes the interaction between light and the objects in
the scene. When light interacts with an object’s surface, it can be absorbed,
transmitted, or reflected. The reflectance properties of objects, such as their colors,
textures, and materials, influence how they appear in the image.
4. Optics: The next step involves the interaction of light with optical components, such
as lenses, that focus the light onto a photosensitive sensor or film. The lens captures
the light rays from different parts of the scene and projects them onto the sensor.
5. Sensor: In modern digital cameras, an image sensor is used to capture the incoming
light and convert it into an electrical signal. The most common type of image sensor
is a charge-coupled device (CCD) or a complementary metal-oxide-semiconductor
(CMOS) sensor. These sensors consist of an array of pixels that record the intensity
of light falling on them.
6. Sampling: The continuous analog signal captured by the image sensor needs to be
converted into discrete digital values. This process is known as sampling. The
sensor divides the image into a grid of pixels, and each pixel measures the intensity
of light falling on it. The intensity values are then quantized into discrete levels
based on the bit depth of the sensor, resulting in a digital representation of the
image.
7. Quantization: Quantization is the process of assigning discrete intensity levels to
the continuous measurements obtained from the sensor. The number of intensity
levels depends on the bit depth of the sensor. For example, an 8-bit sensor can
represent 256 discrete levels of intensity (0 to 255).
8. Digital Image: The final result of the image formation process is a digital image,
which is a two-dimensional grid of pixels, each having a specific intensity value. The
image represents a visual representation of the scene captured by the camera. The
resolution of the image depends on the number of pixels in the sensor and the
quality of the optics.
It’s important to note that image formation can also be affected by various factors such as
camera settings (e.g., exposure time, aperture), noise, lens aberrations, and post-processing
techniques. Understanding the image formation process helps computer vision algorithms
interpret and extract meaningful information from digital images.
Image representation in computer vision refers to the process of converting an image
into a numerical or symbolic form that can be easily understood and processed by a
computer. Images are typically represented as a collection of pixels, where each pixel
corresponds to a specific color or intensity value. The goal of image representation is to
extract relevant features and information from the image, enabling the computer to
5
perform various tasks, such as object recognition, image classification, and image
segmentation.
There are several common techniques for image representation in computer vision:
The choice of image representation depends on the specific computer vision task at hand.
Different representations may be more suitable for different applications, and the selection
of the most appropriate representation significantly impacts the performance and accuracy
of the computer vision system.
Efficient storage and processing: Raw images are typically high-dimensional data with
pixel values, making them cumbersome to handle. Image representation techniques help in
compressing the information into more compact and meaningful formats, reducing the
memory requirements and computational complexity.
6
Feature extraction: Image representation methods extract relevant features and patterns
from the images. These features highlight the important aspects of the image, such as
edges, textures, shapes, and colors, which are essential for understanding and recognizing
objects and scenes.
Generalization: Effective image representations help in generalizing the learning from one
set of images to others. When a model learns meaningful features, it becomes capable of
recognizing similar patterns in new and unseen data, improving its performance on diverse
images.
IMAGE PROCESSING
Image processing is the process of transforming an image into a digital form and
performing certain operations to get some useful information from it. Image processing
algorithms are used to extract information from images, restore and compress image and
video data, and build new experiences in virtual and augmented reality. Computer vision
uses image processing to recognize and categorize image data.
7
Sharpening and restoration - Create an enhanced image from the original image
Pattern recognition - Measure the various patterns around the objects in the image
Retrieval - Browse and search images from a large database of digital images that
are similar to the original image
Image Acquisition
Image acquisition is the first step in image processing. This step is also known as
preprocessing in image processing. It involves retrieving the image from a source, usually a
hardware-based source.
Image Enhancement
Image enhancement is the process of bringing out and highlighting certain features of
interest in an image that has been obscured. This can involve changing the brightness,
contrast, etc.
Image Restoration
Image restoration is the process of improving the appearance of an image. However, unlike
image enhancement, image restoration is done using certain mathematical or probabilistic
models.
Color image processing includes a number of color modeling techniques in a digital domain.
This step has gained prominence due to the significant use of digital images over the
internet.
Wavelets are used to represent images in various degrees of resolution. The images are
subdivided into wavelets or smaller regions for data compression and for pyramidal
representation.
Compression
Compression is a process used to reduce the storage required to save an image or the
bandwidth required to transmit it. This is done particularly when the image is for use on
the Internet.
Morphological Processing
8
Morphological processing is a set of processing operations for morphing images based on
their shapes.
Segmentation
Segmentation is one of the most difficult steps of image processing. It involves partitioning
an image into its constituent parts or objects.
After an image is segmented into regions in the segmentation process, each region is
represented and described in a form suitable for further computer processing.
Representation deals with the image’s characteristics and regional properties. Description
deals with extracting quantitative information that helps differentiate one class of objects
from the other.
Recognition
9
Image processing can be used to recover and fill in the missing or corrupt parts of an
image. This involves using image processing systems that have been trained extensively
with existing photo datasets to create newer versions of old and damaged photos.
Face Detection
One of the most common applications of image processing that we use today is face
detection. It follows deep learning algorithms where the machine is first trained with the
specific features of human faces, such as the shape of the face, the distance between the
eyes, etc. After teaching the machine these human face features, it will start to accept all
objects in an image that resemble a human face. Face detection is a vital tool used in
security, biometrics and even filters available on most social media apps these days.
Feature Extraction in Image Processing
Feature extraction is a critical step in image processing and computer vision, involving
the identification and representation of distinctive structures within an image. This process
transforms raw image data into numerical features that can be processed while
preserving the essential information. These features are vital for various downstream
tasks such as object detection, classification, and image matching.
10
Edge detection is a fundamental technique in image processing used to identify boundaries
within an image. It’s crucial for tasks like object detection, image segmentation, and feature
extraction. Essentially, edge detection algorithms aim to locate points where the intensity
of an image changes abruptly, which typically signifies the presence of an edge or boundary
between different objects or regions. Common edge detection methods are:
I. Sobel, Prewitt, and Roberts Operators: These methods are based on calculating
the gradient of the image intensity. They operate by convolving the image with a
small, predefined kernel that highlights the intensity changes in horizontal and
vertical directions. By computing the gradient magnitude and direction at each pixel,
these operators can identify edges where the intensity changes are significant. The
Sobel operator, for example, uses a 3×3 kernel to compute gradients, while Prewitt
and Roberts operators use similar principles but with different kernel designs.
II. Canny Edge Detector: Unlike the previous methods, the Canny Edge Detector is a
multi-stage algorithm that provides more refined edge detection. The Canny Edge
Detector is known for its ability to detect a wide range of edges while suppressing
noise and minimizing false detections. It comprises several steps:
Gaussian Smoothing: The input image is convolved with a Gaussian kernel
to reduce noise and smooth out the image.
Gradient Calculation: Sobel operators are applied to compute the gradient
magnitude and direction at each pixel.
Non-maximum Suppression: This step helps thinning the detected edges by
retaining only local maxima in the gradient magnitude along the direction of
the gradient.
Double Thresholding: Pixels are classified as strong, weak, or non-edge
pixels based on their gradient magnitudes. A high threshold determines
strong edge pixels, while a low threshold identifies weak edge pixels.
Edge Tracking by Hysteresis: Weak edge pixels that are adjacent to strong
edge pixels are considered as part of the edge. This helps in connecting
discontinuous edge segments.
2. Corner detection
Corner detection is another important technique in image processing, particularly in
computer vision and pattern recognition. It aims to identify points in an image where the
intensity changes significantly in multiple directions, indicating the presence of corners or
junctions between edges. Corners are valuable features because they often correspond to
keypoints that can be used for tasks like image alignment, object tracking, and 3D
reconstruction.
Common corner detection methods are:
11
I. Harris Corner Detector: The Harris Corner Detector is a classic method for corner
detection. It works by analyzing local intensity variations in different directions
using the concept of the auto-correlation matrix. Specifically, it measures the
variation in intensity for a small displacement of a window in all directions. By
calculating the eigenvalues of the auto-correlation matrix, the algorithm identifies
corners as points where the eigenvalues are large in both directions. The Harris
Corner Detector typically uses a Gaussian window function to weight the intensity
values within the window, which helps in making the detector more robust to noise.
II. Shi-Tomasi Corner Detector: The Shi-Tomasi Corner Detector is an enhancement
over the Harris Corner Detector. It uses a similar approach but introduces a
different criterion for corner detection. Instead of relying solely on the eigenvalues
of the auto-correlation matrix, Shi-Tomasi proposed using the minimum eigenvalue
of the matrix as a corner measure. This modification leads to better performance,
especially in cases where there are multiple corners in close proximity or when the
corners have varying degrees of contrast.
Both the Harris Corner Detector and the Shi-Tomasi Corner Detector are widely used in
computer vision applications. They are crucial for tasks like feature-based registration,
image stitching, object recognition, and motion tracking. Corner detection plays a
fundamental role in extracting meaningful information from images and enabling high-
level analysis and interpretation.
3. Blob detection
Blob detection is a technique used in image processing to identify regions within an image
that exhibit significant differences in properties such as brightness, color, or texture
compared to their surroundings. These regions are often referred to as “blobs,” and
detecting them is useful for tasks such as object recognition, image segmentation, and
feature extraction.
Common blob detection methods:
1. Laplacian of Gaussian (LoG): The LoG method is a popular technique for blob
detection. It involves convolving the image with a Gaussian kernel to smooth it and
then applying the Laplacian operator to highlight regions of rapid intensity change.
The Laplacian operator computes the second derivative of the image, emphasizing
areas where the intensity changes sharply. By detecting zero-crossings in the
resulting Laplacian image, the LoG method identifies potential blob locations. This
approach is effective at detecting blobs of various sizes but can be computationally
expensive due to the convolution with the Gaussian kernel.
2. Difference of Gaussians (DoG): The DoG method is an approximation of the LoG
method and offers a computationally efficient alternative. It involves subtracting
two blurred versions of the original image, each smoothed with a Gaussian filter of
different standard deviations. By subtracting the blurred images, the DoG method
12
highlights areas of rapid intensity change, which correspond to potential blob
locations. Similar to the LoG method, the DoG approach also detects blobs by
identifying zero-crossings in the resulting difference image.
3. Determinant of Hessian: The Determinant of Hessian method is another blob
detection technique that relies on the Hessian matrix, which describes the local
curvature of an image. By computing the determinant of the Hessian matrix at each
pixel, this method identifies regions where the intensity changes significantly in
multiple directions, indicating the presence of a blob. The determinant of the
Hessian measures the strength of blob-like structures, enabling robust blob
detection across different scales.
These blob detection methods are valuable tools in image analysis and computer vision
applications. They allow for the identification and localization of objects or regions of
interest within an image, even in the presence of noise or variations in lighting conditions.
By detecting blobs, these methods facilitate subsequent processing steps, such as object
tracking, segmentation, and recognition.
4. Texture Analysis
Texture analysis is a vital aspect of image processing and computer vision that focuses on
quantifying and characterizing the spatial arrangement of pixel intensities within an image.
Understanding the texture of an image can be crucial for tasks like classification,
segmentation, and recognition, particularly when dealing with images containing repetitive
patterns or complex structures.
Common texture analysis methods:
1. Gray-Level Co-occurrence Matrix (GLCM): GLCM is a statistical method used to
capture the spatial relationships between pixels in an image. It measures the
frequency of occurrence of pairs of pixel values at specified distances and
orientations within an image. By analyzing these pixel pairs, GLCM can extract
texture features such as contrast, correlation, energy, and homogeneity, which
provide information about the texture properties of the image. GLCM is particularly
effective for analyzing textures with well-defined patterns and structures.
2. Local Binary Patterns (LBP): LBP is a simple yet powerful method for texture
description. It operates by comparing each pixel in an image with its neighboring
pixels and assigning a binary code based on whether the neighbor’s intensity is
greater than or less than the central pixel’s intensity. These binary patterns are then
used to encode the texture information of the image. LBP is robust to changes in
illumination and is computationally efficient, making it suitable for real-time
applications such as face recognition, texture classification, and object detection.
3. Gabor Filters: Gabor filters are a set of bandpass filters that are widely used for
texture analysis and feature extraction. They are designed to mimic the response of
13
human visual system cells to different spatial frequencies and orientations. By
convolving an image with a bank of Gabor filters at various scales and orientations,
Gabor features can be extracted, which capture information about the texture’s
spatial frequency and orientation characteristics. Gabor filters are particularly
effective for analyzing textures with varying scales and orientations, making them
suitable for tasks such as texture segmentation, classification, and recognition.
These texture analysis methods offer valuable insights into the spatial structure and
patterns present in an image, enabling more robust and informative analysis for various
computer vision applications. By extracting relevant texture features, these methods
facilitate tasks such as image understanding, object recognition, and scene
understanding in diverse domains including medical imaging, remote sensing, and
industrial inspection.
Applications of Feature Extraction for Image Processing
Object Recognition: Edges are features that are used to differentiate images from
the background, textures and shapes of images are features used to differentiate
between images within an image.
Facial Recognition: Other factors such as facial symmetry and convexity, face shape
and size, distance between eyes and distance across base of nose, size of forehead
and distance across forehead, cheek and cheekbone size and paral distance, vertical
height or size of face below line and between the eyes, jaw size and shape, nose size
and shape, and size of the lips also have an effect on face categorisation.
Medical Imaging: It is therefore evident that in medical diagnostics MRI or CT
image it may possible to capture such characteristic in MRI or CT to analyze
anomalies of tumors that may be caused by a disease at a high success probability.
Remote Sensing: Features like Vegetation Index, water bodies and urban areas
provided from the satellites are very valuable for doing the environmental mapping.
Content-Based Image Retrieval (CBIR): Retrieving images from a database based
on the content of the images rather than metadata.
14